CUB  
 All Classes Namespaces Files Functions Variables Typedefs Enumerations Enumerator Friends Macros Groups Pages
Classes | List of all members
cub::BlockLoad< InputT, BLOCK_DIM_X, ITEMS_PER_THREAD, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH > Class Template Reference

Detailed description

template< typename InputT, int BLOCK_DIM_X, int ITEMS_PER_THREAD, BlockLoadAlgorithm ALGORITHM = BLOCK_LOAD_DIRECT, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
class cub::BlockLoad< InputT, BLOCK_DIM_X, ITEMS_PER_THREAD, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >

The BlockLoad class provides collective data movement methods for loading a linear segment of items from memory into a blocked arrangement across a CUDA thread block.

block_load_logo.png
.
Template Parameters
InputTThe data type to read into (which must be convertible from the input iterator's value type).
BLOCK_DIM_XThe thread block length in threads along the X dimension
ITEMS_PER_THREADThe number of consecutive items partitioned onto each thread.
ALGORITHM[optional] cub::BlockLoadAlgorithm tuning policy. default: cub::BLOCK_LOAD_DIRECT.
WARP_TIME_SLICING[optional] Whether or not only one warp's worth of shared memory should be allocated and time-sliced among block-warps during any load-related data transpositions (versus each warp having its own storage). (default: false)
BLOCK_DIM_Y[optional] The thread block length in threads along the Y dimension (default: 1)
BLOCK_DIM_Z[optional] The thread block length in threads along the Z dimension (default: 1)
PTX_ARCH[optional] The PTX compute capability for which to to specialize this collective, formatted as per the CUDA_ARCH macro (e.g., 350 for sm_35). Useful for determining the collective's storage requirements for a given device from the host. (Default: the value of CUDA_ARCH during the current compiler pass)
Overview
A Simple Example
Every thread in the block uses the BlockLoad class by first specializing the BlockLoad type, then instantiating an instance with parameters for communication, and finally invoking one or more collective member functions.
The code snippet below illustrates the loading of a linear segment of 512 integers into a "blocked" arrangement across 128 threads where each thread owns 4 consecutive items. The load is specialized for BLOCK_LOAD_WARP_TRANSPOSE, meaning memory references are efficiently coalesced using a warp-striped access pattern (after which items are locally reordered among threads).
#include <cub/cub.cuh> // or equivalently <cub/block/block_load.cuh>
__global__ void ExampleKernel(int *d_data, ...)
{
// Specialize BlockLoad for a 1D block of 128 threads owning 4 integer items each
// Allocate shared memory for BlockLoad
__shared__ typename BlockLoad::TempStorage temp_storage;
// Load a segment of consecutive items that are blocked across threads
int thread_data[4];
BlockLoad(temp_storage).Load(d_data, thread_data);
Suppose the input d_data is 0, 1, 2, 3, 4, 5, .... The set of thread_data across the block of threads in those threads will be { [0,1,2,3], [4,5,6,7], ..., [508,509,510,511] }.
Re-using dynamically allocating shared memory
The following example under the examples/block folder illustrates usage of dynamically shared memory with BlockReduce and how to re-purpose the same memory region: example_block_reduce_dyn_smem.cu

This example can be easily adapted to the storage required by BlockLoad.

Examples:
example_block_radix_sort.cu, and example_block_scan.cu.

Classes

struct  TempStorage
 The operations exposed by BlockLoad require a temporary memory allocation of this nested type for thread communication. This opaque storage can be allocated directly using the __shared__ keyword. Alternatively, it can be aliased to externally allocated memory (shared or global) or union'd with other storage allocation types to facilitate memory reuse. More...
 

Public Methods

Collective constructors
__device__ __forceinline__ BlockLoad ()
 Collective constructor using a private static allocation of shared memory as temporary storage. More...
 
__device__ __forceinline__ BlockLoad (TempStorage &temp_storage)
 Collective constructor using the specified memory allocation as temporary storage. More...
 
Data movement
template<typename InputIteratorT >
__device__ __forceinline__ void Load (InputIteratorT block_itr, InputT(&items)[ITEMS_PER_THREAD])
 Load a linear segment of items from memory. More...
 
template<typename InputIteratorT >
__device__ __forceinline__ void Load (InputIteratorT block_itr, InputT(&items)[ITEMS_PER_THREAD], int valid_items)
 Load a linear segment of items from memory, guarded by range. More...
 
template<typename InputIteratorT , typename DefaultT >
__device__ __forceinline__ void Load (InputIteratorT block_itr, InputT(&items)[ITEMS_PER_THREAD], int valid_items, DefaultT oob_default)
 Load a linear segment of items from memory, guarded by range, with a fall-back assignment of out-of-bound elements. More...
 

Constructor & Destructor Documentation

template<typename InputT , int BLOCK_DIM_X, int ITEMS_PER_THREAD, BlockLoadAlgorithm ALGORITHM = BLOCK_LOAD_DIRECT, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
__device__ __forceinline__ cub::BlockLoad< InputT, BLOCK_DIM_X, ITEMS_PER_THREAD, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::BlockLoad ( )
inline

Collective constructor using a private static allocation of shared memory as temporary storage.

template<typename InputT , int BLOCK_DIM_X, int ITEMS_PER_THREAD, BlockLoadAlgorithm ALGORITHM = BLOCK_LOAD_DIRECT, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
__device__ __forceinline__ cub::BlockLoad< InputT, BLOCK_DIM_X, ITEMS_PER_THREAD, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::BlockLoad ( TempStorage temp_storage)
inline

Collective constructor using the specified memory allocation as temporary storage.

Parameters
[in]temp_storageReference to memory allocation having layout type TempStorage

Member Function Documentation

template<typename InputT , int BLOCK_DIM_X, int ITEMS_PER_THREAD, BlockLoadAlgorithm ALGORITHM = BLOCK_LOAD_DIRECT, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
template<typename InputIteratorT >
__device__ __forceinline__ void cub::BlockLoad< InputT, BLOCK_DIM_X, ITEMS_PER_THREAD, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::Load ( InputIteratorT  block_itr,
InputT(&)  items[ITEMS_PER_THREAD] 
)
inline

Load a linear segment of items from memory.

  • Assumes a blocked arrangement of (block-threads*items-per-thread) items across the thread block, where threadi owns the ith range of items-per-thread contiguous items. For multi-dimensional thread blocks, a row-major thread ordering is assumed.
  • A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage) is to be reused or repurposed.
Snippet
The code snippet below illustrates the loading of a linear segment of 512 integers into a "blocked" arrangement across 128 threads where each thread owns 4 consecutive items. The load is specialized for BLOCK_LOAD_WARP_TRANSPOSE, meaning memory references are efficiently coalesced using a warp-striped access pattern (after which items are locally reordered among threads).
#include <cub/cub.cuh> // or equivalently <cub/block/block_load.cuh>
__global__ void ExampleKernel(int *d_data, ...)
{
// Specialize BlockLoad for a 1D block of 128 threads owning 4 integer items each
// Allocate shared memory for BlockLoad
__shared__ typename BlockLoad::TempStorage temp_storage;
// Load a segment of consecutive items that are blocked across threads
int thread_data[4];
BlockLoad(temp_storage).Load(d_data, thread_data);
Suppose the input d_data is 0, 1, 2, 3, 4, 5, .... The set of thread_data across the block of threads in those threads will be { [0,1,2,3], [4,5,6,7], ..., [508,509,510,511] }.
Parameters
[in]block_itrThe thread block's base input iterator for loading from
[out]itemsData to load
template<typename InputT , int BLOCK_DIM_X, int ITEMS_PER_THREAD, BlockLoadAlgorithm ALGORITHM = BLOCK_LOAD_DIRECT, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
template<typename InputIteratorT >
__device__ __forceinline__ void cub::BlockLoad< InputT, BLOCK_DIM_X, ITEMS_PER_THREAD, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::Load ( InputIteratorT  block_itr,
InputT(&)  items[ITEMS_PER_THREAD],
int  valid_items 
)
inline

Load a linear segment of items from memory, guarded by range.

  • Assumes a blocked arrangement of (block-threads*items-per-thread) items across the thread block, where threadi owns the ith range of items-per-thread contiguous items. For multi-dimensional thread blocks, a row-major thread ordering is assumed.
  • A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage) is to be reused or repurposed.
Snippet
The code snippet below illustrates the guarded loading of a linear segment of 512 integers into a "blocked" arrangement across 128 threads where each thread owns 4 consecutive items. The load is specialized for BLOCK_LOAD_WARP_TRANSPOSE, meaning memory references are efficiently coalesced using a warp-striped access pattern (after which items are locally reordered among threads).
#include <cub/cub.cuh> // or equivalently <cub/block/block_load.cuh>
__global__ void ExampleKernel(int *d_data, int valid_items, ...)
{
// Specialize BlockLoad for a 1D block of 128 threads owning 4 integer items each
// Allocate shared memory for BlockLoad
__shared__ typename BlockLoad::TempStorage temp_storage;
// Load a segment of consecutive items that are blocked across threads
int thread_data[4];
BlockLoad(temp_storage).Load(d_data, thread_data, valid_items);
Suppose the input d_data is 0, 1, 2, 3, 4, 5, 6... and valid_items is 5. The set of thread_data across the block of threads in those threads will be { [0,1,2,3], [4,?,?,?], ..., [?,?,?,?] }, with only the first two threads being unmasked to load portions of valid data (and other items remaining unassigned).
Parameters
[in]block_itrThe thread block's base input iterator for loading from
[out]itemsData to load
[in]valid_itemsNumber of valid items to load
template<typename InputT , int BLOCK_DIM_X, int ITEMS_PER_THREAD, BlockLoadAlgorithm ALGORITHM = BLOCK_LOAD_DIRECT, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
template<typename InputIteratorT , typename DefaultT >
__device__ __forceinline__ void cub::BlockLoad< InputT, BLOCK_DIM_X, ITEMS_PER_THREAD, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::Load ( InputIteratorT  block_itr,
InputT(&)  items[ITEMS_PER_THREAD],
int  valid_items,
DefaultT  oob_default 
)
inline

Load a linear segment of items from memory, guarded by range, with a fall-back assignment of out-of-bound elements.

  • Assumes a blocked arrangement of (block-threads*items-per-thread) items across the thread block, where threadi owns the ith range of items-per-thread contiguous items. For multi-dimensional thread blocks, a row-major thread ordering is assumed.
  • A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage) is to be reused or repurposed.
Snippet
The code snippet below illustrates the guarded loading of a linear segment of 512 integers into a "blocked" arrangement across 128 threads where each thread owns 4 consecutive items. The load is specialized for BLOCK_LOAD_WARP_TRANSPOSE, meaning memory references are efficiently coalesced using a warp-striped access pattern (after which items are locally reordered among threads).
#include <cub/cub.cuh> // or equivalently <cub/block/block_load.cuh>
__global__ void ExampleKernel(int *d_data, int valid_items, ...)
{
// Specialize BlockLoad for a 1D block of 128 threads owning 4 integer items each
// Allocate shared memory for BlockLoad
__shared__ typename BlockLoad::TempStorage temp_storage;
// Load a segment of consecutive items that are blocked across threads
int thread_data[4];
BlockLoad(temp_storage).Load(d_data, thread_data, valid_items, -1);
Suppose the input d_data is 0, 1, 2, 3, 4, 5, 6..., valid_items is 5, and the out-of-bounds default is -1. The set of thread_data across the block of threads in those threads will be { [0,1,2,3], [4,-1,-1,-1], ..., [-1,-1,-1,-1] }, with only the first two threads being unmasked to load portions of valid data (and other items are assigned -1)
Parameters
[in]block_itrThe thread block's base input iterator for loading from
[out]itemsData to load
[in]valid_itemsNumber of valid items to load
[in]oob_defaultDefault value to assign out-of-bound items

The documentation for this class was generated from the following file: