CUB
 All Classes Namespaces Files Functions Variables Typedefs Enumerations Enumerator Friends Groups
Classes | List of all members
cub::BlockStore< T, BLOCK_DIM_X, ITEMS_PER_THREAD, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH > Class Template Reference

Detailed description

template< typename T, int BLOCK_DIM_X, int ITEMS_PER_THREAD, BlockStoreAlgorithm ALGORITHM = BLOCK_STORE_DIRECT, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
class cub::BlockStore< T, BLOCK_DIM_X, ITEMS_PER_THREAD, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >

The BlockStore class provides collective data movement methods for writing a blocked arrangement of items partitioned across a CUDA thread block to a linear segment of memory.

block_store_logo.png
.
Template Parameters
TThe type of data to be written.
BLOCK_DIM_XThe thread block length in threads along the X dimension
ITEMS_PER_THREADThe number of consecutive items partitioned onto each thread.
ALGORITHM[optional] cub::BlockStoreAlgorithm tuning policy enumeration. default: cub::BLOCK_STORE_DIRECT.
WARP_TIME_SLICING[optional] Whether or not only one warp's worth of shared memory should be allocated and time-sliced among block-warps during any load-related data transpositions (versus each warp having its own storage). (default: false)
BLOCK_DIM_Y[optional] The thread block length in threads along the Y dimension (default: 1)
BLOCK_DIM_Z[optional] The thread block length in threads along the Z dimension (default: 1)
PTX_ARCH[optional] The PTX compute capability for which to to specialize this collective, formatted as per the CUDA_ARCH macro (e.g., 350 for sm_35). Useful for determining the collective's storage requirements for a given device from the host. (Default: the value of CUDA_ARCH during the current compiler pass)
Overview
A Simple Example
Every thread in the block uses the BlockStore class by first specializing the BlockStore type, then instantiating an instance with parameters for communication, and finally invoking one or more collective member functions.
The code snippet below illustrates the storing of a "blocked" arrangement of 512 integers across 128 threads (where each thread owns 4 consecutive items) into a linear segment of memory. The store is specialized for BLOCK_STORE_WARP_TRANSPOSE, meaning items are locally reordered among threads so that memory references will be efficiently coalesced using a warp-striped access pattern.
#include <cub/cub.cuh> // or equivalently <cub/block/block_store.cuh>
__global__ void ExampleKernel(int *d_data, ...)
{
// Specialize BlockStore for a 1D block of 128 threads owning 4 integer items each
// Allocate shared memory for BlockStore
__shared__ typename BlockStore::TempStorage temp_storage;
// Obtain a segment of consecutive items that are blocked across threads
int thread_data[4];
...
// Store items to linear memory
int thread_data[4];
BlockStore(temp_storage).Store(d_data, thread_data);
Suppose the set of thread_data across the block of threads is { [0,1,2,3], [4,5,6,7], ..., [508,509,510,511] }. The output d_data will be 0, 1, 2, 3, 4, 5, ....
Examples:
example_block_scan.cu.

Classes

struct  TempStorage
 The operations exposed by BlockStore require a temporary memory allocation of this nested type for thread communication. This opaque storage can be allocated directly using the __shared__ keyword. Alternatively, it can be aliased to externally allocated memory (shared or global) or union'd with other storage allocation types to facilitate memory reuse. More...
 

Public Methods

Collective constructors
__device__ __forceinline__ BlockStore ()
 Collective constructor using a private static allocation of shared memory as temporary storage.
 
__device__ __forceinline__ BlockStore (TempStorage &temp_storage)
 Collective constructor using the specified memory allocation as temporary storage. More...
 
Data movement
template<typename OutputIteratorT >
__device__ __forceinline__ void Store (OutputIteratorT block_itr, T(&items)[ITEMS_PER_THREAD])
 Store items into a linear segment of memory. More...
 
template<typename OutputIteratorT >
__device__ __forceinline__ void Store (OutputIteratorT block_itr, T(&items)[ITEMS_PER_THREAD], int valid_items)
 Store items into a linear segment of memory, guarded by range. More...
 

Constructor & Destructor Documentation

template<typename T , int BLOCK_DIM_X, int ITEMS_PER_THREAD, BlockStoreAlgorithm ALGORITHM = BLOCK_STORE_DIRECT, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
__device__ __forceinline__ cub::BlockStore< T, BLOCK_DIM_X, ITEMS_PER_THREAD, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::BlockStore ( TempStorage temp_storage)
inline

Collective constructor using the specified memory allocation as temporary storage.

Parameters
[in]temp_storageReference to memory allocation having layout type TempStorage

Member Function Documentation

template<typename T , int BLOCK_DIM_X, int ITEMS_PER_THREAD, BlockStoreAlgorithm ALGORITHM = BLOCK_STORE_DIRECT, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
template<typename OutputIteratorT >
__device__ __forceinline__ void cub::BlockStore< T, BLOCK_DIM_X, ITEMS_PER_THREAD, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::Store ( OutputIteratorT  block_itr,
T(&)  items[ITEMS_PER_THREAD] 
)
inline

Store items into a linear segment of memory.

  • Assumes a blocked arrangement of (block-threads*items-per-thread) items across the thread block, where threadi owns the ith range of items-per-thread contiguous items. For multi-dimensional thread blocks, a row-major thread ordering is assumed.
  • A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage) is to be reused or repurposed.
Snippet
The code snippet below illustrates the storing of a "blocked" arrangement of 512 integers across 128 threads (where each thread owns 4 consecutive items) into a linear segment of memory. The store is specialized for BLOCK_STORE_WARP_TRANSPOSE, meaning items are locally reordered among threads so that memory references will be efficiently coalesced using a warp-striped access pattern.
#include <cub/cub.cuh> // or equivalently <cub/block/block_store.cuh>
__global__ void ExampleKernel(int *d_data, ...)
{
// Specialize BlockStore for a 1D block of 128 threads owning 4 integer items each
// Allocate shared memory for BlockStore
__shared__ typename BlockStore::TempStorage temp_storage;
// Obtain a segment of consecutive items that are blocked across threads
int thread_data[4];
...
// Store items to linear memory
int thread_data[4];
BlockStore(temp_storage).Store(d_data, thread_data);
Suppose the set of thread_data across the block of threads is { [0,1,2,3], [4,5,6,7], ..., [508,509,510,511] }. The output d_data will be 0, 1, 2, 3, 4, 5, ....
Parameters
[in]block_itrThe thread block's base output iterator for storing to
[in]itemsData to store
template<typename T , int BLOCK_DIM_X, int ITEMS_PER_THREAD, BlockStoreAlgorithm ALGORITHM = BLOCK_STORE_DIRECT, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
template<typename OutputIteratorT >
__device__ __forceinline__ void cub::BlockStore< T, BLOCK_DIM_X, ITEMS_PER_THREAD, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::Store ( OutputIteratorT  block_itr,
T(&)  items[ITEMS_PER_THREAD],
int  valid_items 
)
inline

Store items into a linear segment of memory, guarded by range.

  • Assumes a blocked arrangement of (block-threads*items-per-thread) items across the thread block, where threadi owns the ith range of items-per-thread contiguous items. For multi-dimensional thread blocks, a row-major thread ordering is assumed.
  • A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage) is to be reused or repurposed.
Snippet
The code snippet below illustrates the guarded storing of a "blocked" arrangement of 512 integers across 128 threads (where each thread owns 4 consecutive items) into a linear segment of memory. The store is specialized for BLOCK_STORE_WARP_TRANSPOSE, meaning items are locally reordered among threads so that memory references will be efficiently coalesced using a warp-striped access pattern.
#include <cub/cub.cuh> // or equivalently <cub/block/block_store.cuh>
__global__ void ExampleKernel(int *d_data, int valid_items, ...)
{
// Specialize BlockStore for a 1D block of 128 threads owning 4 integer items each
// Allocate shared memory for BlockStore
__shared__ typename BlockStore::TempStorage temp_storage;
// Obtain a segment of consecutive items that are blocked across threads
int thread_data[4];
...
// Store items to linear memory
int thread_data[4];
BlockStore(temp_storage).Store(d_data, thread_data, valid_items);
Suppose the set of thread_data across the block of threads is { [0,1,2,3], [4,5,6,7], ..., [508,509,510,511] } and valid_items is 5. The output d_data will be 0, 1, 2, 3, 4, ?, ?, ?, ..., with only the first two threads being unmasked to store portions of valid data.
Parameters
[in]block_itrThe thread block's base output iterator for storing to
[in]itemsData to store
[in]valid_itemsNumber of valid items to write

The documentation for this class was generated from the following file: