CUB  
 All Classes Namespaces Files Functions Variables Typedefs Enumerations Enumerator Friends Macros Groups Pages
Classes | List of all members
WarpLoad< InputT, ITEMS_PER_THREAD, ALGORITHM, LOGICAL_WARP_THREADS, PTX_ARCH > Class Template Reference

Detailed description

template< typename InputT, int ITEMS_PER_THREAD, WarpLoadAlgorithm ALGORITHM = WARP_LOAD_DIRECT, int LOGICAL_WARP_THREADS = CUB_PTX_WARP_THREADS, int PTX_ARCH = CUB_PTX_ARCH>
class WarpLoad< InputT, ITEMS_PER_THREAD, ALGORITHM, LOGICAL_WARP_THREADS, PTX_ARCH >

The WarpLoad class provides collective data movement methods for loading a linear segment of items from memory into a blocked arrangement across a CUDA thread block.

Template Parameters
InputTThe data type to read into (which must be convertible from the input iterator's value type).
ITEMS_PER_THREADThe number of consecutive items partitioned onto each thread.
ALGORITHM[optional] cub::WarpLoadAlgorithm tuning policy. default: cub::WARP_LOAD_DIRECT.
LOGICAL_WARP_THREADS[optional] The number of threads per "logical" warp (may be less than the number of hardware warp threads). Default is the warp size of the targeted CUDA compute-capability (e.g., 32 threads for SM86). Must be a power of two.
PTX_ARCH[optional] The PTX compute capability for which to to specialize this collective, formatted as per the CUDA_ARCH macro (e.g., 350 for sm_35). Useful for determining the collective's storage requirements for a given device from the host. (Default: the value of CUDA_ARCH during the current compiler pass)
Overview
  • The WarpLoad class provides a single data movement abstraction that can be specialized to implement different cub::WarpLoadAlgorithm strategies. This facilitates different performance policies for different architectures, data types, granularity sizes, etc.
  • WarpLoad can be optionally specialized by different data movement strategies:
    1. cub::WARP_LOAD_DIRECT. A blocked arrangement of data is read directly from memory. More...
    2. cub::WARP_LOAD_STRIPED,. A striped arrangement of data is read directly from memory. More...
    3. cub::WARP_LOAD_VECTORIZE. A blocked arrangement of data is read directly from memory using CUDA's built-in vectorized loads as a coalescing optimization. More...
    4. cub::WARP_LOAD_TRANSPOSE. A striped arrangement of data is read directly from memory and is then locally transposed into a blocked arrangement. More...
A Simple Example
The code snippet below illustrates the loading of a linear segment of 64 integers into a "blocked" arrangement across 16 threads where each thread owns 4 consecutive items. The load is specialized for WARP_LOAD_TRANSPOSE, meaning memory references are efficiently coalesced using a warp-striped access pattern (after which items are locally reordered among threads).
#include <cub/cub.cuh> // or equivalently <cub/warp/warp_load.cuh>
__global__ void ExampleKernel(int *d_data, ...)
{
constexpr int warp_threads = 16;
constexpr int block_threads = 256;
constexpr int items_per_thread = 4;
// Specialize WarpLoad for a warp of 16 threads owning 4 integer items each
using WarpLoadT = WarpLoad<int,
items_per_thread,
warp_threads>;
constexpr int warps_in_block = block_threads / warp_threads;
constexpr int tile_size = items_per_thread * warp_threads;
const int warp_id = static_cast<int>(threadIdx.x) / warp_threads;
// Allocate shared memory for WarpLoad
__shared__ typename WarpLoadT::TempStorage temp_storage[warps_in_block];
// Load a segment of consecutive items that are blocked across threads
int thread_data[items_per_thread];
WarpLoadT(temp_storage[warp_id]).Load(d_data + warp_id * tile_size,
thread_data);
Suppose the input d_data is 0, 1, 2, 3, 4, 5, .... The set of thread_data across the first logical warp of threads in those threads will be: { [0,1,2,3], [4,5,6,7], ..., [60,61,62,63] }.

Classes

struct  TempStorage
 The operations exposed by WarpLoad require a temporary memory allocation of this nested type for thread communication. This opaque storage can be allocated directly using the __shared__ keyword. Alternatively, it can be aliased to externally allocated memory (shared or global) or union'd with other storage allocation types to facilitate memory reuse. More...
 

Public Methods

Collective constructors
__device__ __forceinline__ WarpLoad ()
 Collective constructor using a private static allocation of shared memory as temporary storage. More...
 
__device__ __forceinline__ WarpLoad (TempStorage &temp_storage)
 Collective constructor using the specified memory allocation as temporary storage. More...
 
Data movement
template<typename InputIteratorT >
__device__ __forceinline__ void Load (InputIteratorT block_itr, InputT(&items)[ITEMS_PER_THREAD])
 Load a linear segment of items from memory. More...
 
template<typename InputIteratorT , typename DefaultT >
__device__ __forceinline__ void Load (InputIteratorT block_itr, InputT(&items)[ITEMS_PER_THREAD], int valid_items, DefaultT oob_default)
 Load a linear segment of items from memory, guarded by range. More...
 

Constructor & Destructor Documentation

template<typename InputT , int ITEMS_PER_THREAD, WarpLoadAlgorithm ALGORITHM = WARP_LOAD_DIRECT, int LOGICAL_WARP_THREADS = CUB_PTX_WARP_THREADS, int PTX_ARCH = CUB_PTX_ARCH>
__device__ __forceinline__ WarpLoad< InputT, ITEMS_PER_THREAD, ALGORITHM, LOGICAL_WARP_THREADS, PTX_ARCH >::WarpLoad ( )
inline

Collective constructor using a private static allocation of shared memory as temporary storage.

template<typename InputT , int ITEMS_PER_THREAD, WarpLoadAlgorithm ALGORITHM = WARP_LOAD_DIRECT, int LOGICAL_WARP_THREADS = CUB_PTX_WARP_THREADS, int PTX_ARCH = CUB_PTX_ARCH>
__device__ __forceinline__ WarpLoad< InputT, ITEMS_PER_THREAD, ALGORITHM, LOGICAL_WARP_THREADS, PTX_ARCH >::WarpLoad ( TempStorage temp_storage)
inline

Collective constructor using the specified memory allocation as temporary storage.

Member Function Documentation

template<typename InputT , int ITEMS_PER_THREAD, WarpLoadAlgorithm ALGORITHM = WARP_LOAD_DIRECT, int LOGICAL_WARP_THREADS = CUB_PTX_WARP_THREADS, int PTX_ARCH = CUB_PTX_ARCH>
template<typename InputIteratorT >
__device__ __forceinline__ void WarpLoad< InputT, ITEMS_PER_THREAD, ALGORITHM, LOGICAL_WARP_THREADS, PTX_ARCH >::Load ( InputIteratorT  block_itr,
InputT(&)  items[ITEMS_PER_THREAD] 
)
inline

Load a linear segment of items from memory.

A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage) is to be reused or repurposed.
Snippet
#include <cub/cub.cuh> // or equivalently <cub/warp/warp_load.cuh>
__global__ void ExampleKernel(int *d_data, ...)
{
constexpr int warp_threads = 16;
constexpr int block_threads = 256;
constexpr int items_per_thread = 4;
// Specialize WarpLoad for a warp of 16 threads owning 4 integer items each
using WarpLoadT = WarpLoad<int,
items_per_thread,
warp_threads>;
constexpr int warps_in_block = block_threads / warp_threads;
constexpr int tile_size = items_per_thread * warp_threads;
const int warp_id = static_cast<int>(threadIdx.x) / warp_threads;
// Allocate shared memory for WarpLoad
__shared__ typename WarpLoadT::TempStorage temp_storage[warps_in_block];
// Load a segment of consecutive items that are blocked across threads
int thread_data[items_per_thread];
WarpLoadT(temp_storage[warp_id]).Load(d_data + warp_id * tile_size,
thread_data);
Suppose the input d_data is 0, 1, 2, 3, 4, 5, .... The set of thread_data across the first logical warp of threads in those threads will be: { [0,1,2,3], [4,5,6,7], ..., [60,61,62,63] }.
Parameters
[in]block_itrThe thread block's base input iterator for loading from
[out]itemsData to load
template<typename InputT , int ITEMS_PER_THREAD, WarpLoadAlgorithm ALGORITHM = WARP_LOAD_DIRECT, int LOGICAL_WARP_THREADS = CUB_PTX_WARP_THREADS, int PTX_ARCH = CUB_PTX_ARCH>
template<typename InputIteratorT , typename DefaultT >
__device__ __forceinline__ void WarpLoad< InputT, ITEMS_PER_THREAD, ALGORITHM, LOGICAL_WARP_THREADS, PTX_ARCH >::Load ( InputIteratorT  block_itr,
InputT(&)  items[ITEMS_PER_THREAD],
int  valid_items,
DefaultT  oob_default 
)
inline

Load a linear segment of items from memory, guarded by range.

A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage) is to be reused or repurposed.
Snippet
#include <cub/cub.cuh> // or equivalently <cub/warp/warp_load.cuh>
__global__ void ExampleKernel(int *d_data, int valid_items, ...)
{
constexpr int warp_threads = 16;
constexpr int block_threads = 256;
constexpr int items_per_thread = 4;
// Specialize WarpLoad for a warp of 16 threads owning 4 integer items each
using WarpLoadT = WarpLoad<int,
items_per_thread,
warp_threads>;
constexpr int warps_in_block = block_threads / warp_threads;
constexpr int tile_size = items_per_thread * warp_threads;
const int warp_id = static_cast<int>(threadIdx.x) / warp_threads;
// Allocate shared memory for WarpLoad
__shared__ typename WarpLoadT::TempStorage temp_storage[warps_in_block];
// Load a segment of consecutive items that are blocked across threads
int thread_data[items_per_thread];
WarpLoadT(temp_storage[warp_id]).Load(d_data + warp_id * tile_size,
thread_data,
valid_items);
@endcod
@par
Suppose the input @p d_data is <tt>0, 1, 2, 3, 4, 5, ...</tt> and @p valid_items
is @p 5.
The set of @p thread_data across the first logical warp of threads in those
threads will be:
<tt>{ [0,1,2,3], [4,?,?,?], ..., [?,?,?,?] }</tt> with only the first
two threads being unmasked to load portions of valid data (and other items
remaining unassigned).
@param[in] block_itr The thread block's base input iterator for loading from
@param[out] items Data to load
@param[in] valid_items Number of valid items to load
/
template <typename InputIteratorT>
__device__ __forceinline__ void Load(InputIteratorT block_itr,
InputT (&items)[ITEMS_PER_THREAD],
int valid_items)
{
InternalLoad(temp_storage, linear_tid).Load(block_itr, items, valid_items);
}
Suppose the input d_data is 0, 1, 2, 3, 4, 5, ..., valid_items is 5, and the out-of-bounds default is -1. The set of thread_data across the first logical warp of threads in those threads will be: { [0,1,2,3], [4,-1,-1,-1], ..., [-1,-1,-1,-1] } with only the first two threads being unmasked to load portions of valid data (and other items are assigned -1).
Parameters
[in]block_itrThe thread block's base input iterator for loading from
[out]itemsData to load
[in]valid_itemsNumber of valid items to load
[in]oob_defaultDefault value to assign out-of-bound items

The documentation for this class was generated from the following file: