CUB
|
The WarpLoad class provides collective data movement methods for loading a linear segment of items from memory into a blocked arrangement across a CUDA thread block.
InputT | The data type to read into (which must be convertible from the input iterator's value type). |
ITEMS_PER_THREAD | The number of consecutive items partitioned onto each thread. |
ALGORITHM | [optional] cub::WarpLoadAlgorithm tuning policy. default: cub::WARP_LOAD_DIRECT. |
LOGICAL_WARP_THREADS | [optional] The number of threads per "logical" warp (may be less than the number of hardware warp threads). Default is the warp size of the targeted CUDA compute-capability (e.g., 32 threads for SM86). Must be a power of two. |
PTX_ARCH | [optional] The PTX compute capability for which to to specialize this collective, formatted as per the CUDA_ARCH macro (e.g., 350 for sm_35). Useful for determining the collective's storage requirements for a given device from the host. (Default: the value of CUDA_ARCH during the current compiler pass) |
WARP_LOAD_TRANSPOSE
, meaning memory references are efficiently coalesced using a warp-striped access pattern (after which items are locally reordered among threads). d_data
is 0, 1, 2, 3, 4, 5, ...
. The set of thread_data
across the first logical warp of threads in those threads will be: { [0,1,2,3], [4,5,6,7], ..., [60,61,62,63] }
. Classes | |
struct | TempStorage |
The operations exposed by WarpLoad require a temporary memory allocation of this nested type for thread communication. This opaque storage can be allocated directly using the __shared__ keyword. Alternatively, it can be aliased to externally allocated memory (shared or global) or union 'd with other storage allocation types to facilitate memory reuse. More... | |
Public Methods | |
Collective constructors | |
__device__ __forceinline__ | WarpLoad () |
Collective constructor using a private static allocation of shared memory as temporary storage. More... | |
__device__ __forceinline__ | WarpLoad (TempStorage &temp_storage) |
Collective constructor using the specified memory allocation as temporary storage. More... | |
Data movement | |
template<typename InputIteratorT > | |
__device__ __forceinline__ void | Load (InputIteratorT block_itr, InputT(&items)[ITEMS_PER_THREAD]) |
Load a linear segment of items from memory. More... | |
template<typename InputIteratorT , typename DefaultT > | |
__device__ __forceinline__ void | Load (InputIteratorT block_itr, InputT(&items)[ITEMS_PER_THREAD], int valid_items, DefaultT oob_default) |
Load a linear segment of items from memory, guarded by range. More... | |
|
inline |
Collective constructor using a private static allocation of shared memory as temporary storage.
|
inline |
Collective constructor using the specified memory allocation as temporary storage.
|
inline |
Load a linear segment of items from memory.
__syncthreads()
threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage
) is to be reused or repurposed.d_data
is 0, 1, 2, 3, 4, 5, ...
. The set of thread_data
across the first logical warp of threads in those threads will be: { [0,1,2,3], [4,5,6,7], ..., [60,61,62,63] }
.[in] | block_itr | The thread block's base input iterator for loading from |
[out] | items | Data to load |
|
inline |
Load a linear segment of items from memory, guarded by range.
__syncthreads()
threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage
) is to be reused or repurposed.d_data
is 0, 1, 2, 3, 4, 5, ...
, valid_items
is 5
, and the out-of-bounds default is -1
. The set of thread_data
across the first logical warp of threads in those threads will be: { [0,1,2,3], [4,-1,-1,-1], ..., [-1,-1,-1,-1] }
with only the first two threads being unmasked to load portions of valid data (and other items are assigned -1
).[in] | block_itr | The thread block's base input iterator for loading from |
[out] | items | Data to load |
[in] | valid_items | Number of valid items to load |
[in] | oob_default | Default value to assign out-of-bound items |