template<
typename T,
int BLOCK_DIM_X,
int ITEMS_PER_THREAD,
BlockStoreAlgorithm ALGORITHM = BLOCK_STORE_DIRECT,
int BLOCK_DIM_Y = 1,
int BLOCK_DIM_Z = 1,
int PTX_ARCH = CUB_PTX_ARCH>
class cub::BlockStore< T, BLOCK_DIM_X, ITEMS_PER_THREAD, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >
The BlockStore class provides collective data movement methods for writing a blocked arrangement of items partitioned across a CUDA thread block to a linear segment of memory.
.
- Template Parameters
-
T | The type of data to be written. |
BLOCK_DIM_X | The thread block length in threads along the X dimension |
ITEMS_PER_THREAD | The number of consecutive items partitioned onto each thread. |
ALGORITHM | [optional] cub::BlockStoreAlgorithm tuning policy enumeration. default: cub::BLOCK_STORE_DIRECT. |
BLOCK_DIM_Y | [optional] The thread block length in threads along the Y dimension (default: 1) |
BLOCK_DIM_Z | [optional] The thread block length in threads along the Z dimension (default: 1) |
PTX_ARCH | [optional] The PTX compute capability for which to to specialize this collective, formatted as per the CUDA_ARCH macro (e.g., 350 for sm_35). Useful for determining the collective's storage requirements for a given device from the host. (Default: the value of CUDA_ARCH during the current compiler pass) |
- Overview
-
- A Simple Example
- Every thread in the block uses the BlockStore class by first specializing the BlockStore type, then instantiating an instance with parameters for communication, and finally invoking one or more collective member functions.
- The code snippet below illustrates the storing of a "blocked" arrangement of 512 integers across 128 threads (where each thread owns 4 consecutive items) into a linear segment of memory. The store is specialized for
BLOCK_STORE_WARP_TRANSPOSE
, meaning items are locally reordered among threads so that memory references will be efficiently coalesced using a warp-striped access pattern.
__global__ void ExampleKernel(int *d_data, ...)
{
__shared__ typename BlockStore::TempStorage temp_storage;
int thread_data[4];
...
BlockStore(temp_storage).Store(d_data, thread_data);
- Suppose the set of
thread_data
across the block of threads is { [0,1,2,3], [4,5,6,7], ..., [508,509,510,511] }
. The output d_data
will be 0, 1, 2, 3, 4, 5, ...
.
- Re-using dynamically allocating shared memory
- The following example under the examples/block folder illustrates usage of dynamically shared memory with BlockReduce and how to re-purpose the same memory region: example_block_reduce_dyn_smem.cu
This example can be easily adapted to the storage required by BlockStore.
- Examples:
- example_block_scan.cu.
|
struct | TempStorage |
| The operations exposed by BlockStore require a temporary memory allocation of this nested type for thread communication. This opaque storage can be allocated directly using the __shared__ keyword. Alternatively, it can be aliased to externally allocated memory (shared or global) or union 'd with other storage allocation types to facilitate memory reuse. More...
|
|
|
|
__device__ __forceinline__ | BlockStore () |
| Collective constructor using a private static allocation of shared memory as temporary storage. More...
|
|
__device__ __forceinline__ | BlockStore (TempStorage &temp_storage) |
| Collective constructor using the specified memory allocation as temporary storage. More...
|
|
|
template<typename OutputIteratorT > |
__device__ __forceinline__ void | Store (OutputIteratorT block_itr, T(&items)[ITEMS_PER_THREAD]) |
| Store items into a linear segment of memory. More...
|
|
template<typename OutputIteratorT > |
__device__ __forceinline__ void | Store (OutputIteratorT block_itr, T(&items)[ITEMS_PER_THREAD], int valid_items) |
| Store items into a linear segment of memory, guarded by range. More...
|
|
template<typename T , int BLOCK_DIM_X, int ITEMS_PER_THREAD, BlockStoreAlgorithm ALGORITHM = BLOCK_STORE_DIRECT, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
__device__ __forceinline__ cub::BlockStore< T, BLOCK_DIM_X, ITEMS_PER_THREAD, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::BlockStore |
( |
) | |
|
|
inline |
Collective constructor using a private static allocation of shared memory as temporary storage.
template<typename T , int BLOCK_DIM_X, int ITEMS_PER_THREAD, BlockStoreAlgorithm ALGORITHM = BLOCK_STORE_DIRECT, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
Collective constructor using the specified memory allocation as temporary storage.
- Parameters
-
[in] | temp_storage | Reference to memory allocation having layout type TempStorage |
template<typename T , int BLOCK_DIM_X, int ITEMS_PER_THREAD, BlockStoreAlgorithm ALGORITHM = BLOCK_STORE_DIRECT, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
template<typename OutputIteratorT >
__device__ __forceinline__ void cub::BlockStore< T, BLOCK_DIM_X, ITEMS_PER_THREAD, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::Store |
( |
OutputIteratorT |
block_itr, |
|
|
T(&) |
items[ITEMS_PER_THREAD] |
|
) |
| |
|
inline |
Store items into a linear segment of memory.
- Assumes a blocked arrangement of (block-threads*items-per-thread) items across the thread block, where threadi owns the ith range of items-per-thread contiguous items. For multi-dimensional thread blocks, a row-major thread ordering is assumed.
- A subsequent
__syncthreads()
threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage
) is to be reused or repurposed.
- Snippet
- The code snippet below illustrates the storing of a "blocked" arrangement of 512 integers across 128 threads (where each thread owns 4 consecutive items) into a linear segment of memory. The store is specialized for
BLOCK_STORE_WARP_TRANSPOSE
, meaning items are locally reordered among threads so that memory references will be efficiently coalesced using a warp-striped access pattern.
__global__ void ExampleKernel(int *d_data, ...)
{
__shared__ typename BlockStore::TempStorage temp_storage;
int thread_data[4];
...
int thread_data[4];
BlockStore(temp_storage).Store(d_data, thread_data);
- Suppose the set of
thread_data
across the block of threads is { [0,1,2,3], [4,5,6,7], ..., [508,509,510,511] }
. The output d_data
will be 0, 1, 2, 3, 4, 5, ...
.
- Parameters
-
[out] | block_itr | The thread block's base output iterator for storing to |
[in] | items | Data to store |
template<typename T , int BLOCK_DIM_X, int ITEMS_PER_THREAD, BlockStoreAlgorithm ALGORITHM = BLOCK_STORE_DIRECT, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
template<typename OutputIteratorT >
__device__ __forceinline__ void cub::BlockStore< T, BLOCK_DIM_X, ITEMS_PER_THREAD, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::Store |
( |
OutputIteratorT |
block_itr, |
|
|
T(&) |
items[ITEMS_PER_THREAD], |
|
|
int |
valid_items |
|
) |
| |
|
inline |
Store items into a linear segment of memory, guarded by range.
- Assumes a blocked arrangement of (block-threads*items-per-thread) items across the thread block, where threadi owns the ith range of items-per-thread contiguous items. For multi-dimensional thread blocks, a row-major thread ordering is assumed.
- A subsequent
__syncthreads()
threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage
) is to be reused or repurposed.
- Snippet
- The code snippet below illustrates the guarded storing of a "blocked" arrangement of 512 integers across 128 threads (where each thread owns 4 consecutive items) into a linear segment of memory. The store is specialized for
BLOCK_STORE_WARP_TRANSPOSE
, meaning items are locally reordered among threads so that memory references will be efficiently coalesced using a warp-striped access pattern.
__global__ void ExampleKernel(int *d_data, int valid_items, ...)
{
__shared__ typename BlockStore::TempStorage temp_storage;
int thread_data[4];
...
int thread_data[4];
BlockStore(temp_storage).Store(d_data, thread_data, valid_items);
- Suppose the set of
thread_data
across the block of threads is { [0,1,2,3], [4,5,6,7], ..., [508,509,510,511] }
and valid_items
is 5
. The output d_data
will be 0, 1, 2, 3, 4, ?, ?, ?, ...
, with only the first two threads being unmasked to store portions of valid data.
- Parameters
-
[out] | block_itr | The thread block's base output iterator for storing to |
[in] | items | Data to store |
[in] | valid_items | Number of valid items to write |
The documentation for this class was generated from the following file: