CUB  
 All Classes Namespaces Files Functions Variables Typedefs Enumerations Enumerator Friends Macros Groups Pages
Classes | List of all members
WarpStore< T, ITEMS_PER_THREAD, ALGORITHM, LOGICAL_WARP_THREADS, PTX_ARCH > Class Template Reference

Detailed description

template< typename T, int ITEMS_PER_THREAD, WarpStoreAlgorithm ALGORITHM = WARP_STORE_DIRECT, int LOGICAL_WARP_THREADS = CUB_PTX_WARP_THREADS, int PTX_ARCH = CUB_PTX_ARCH>
class WarpStore< T, ITEMS_PER_THREAD, ALGORITHM, LOGICAL_WARP_THREADS, PTX_ARCH >

The WarpStore class provides collective data movement methods for writing a blocked arrangement of items partitioned across a CUDA warp to a linear segment of memory.

Template Parameters
TThe type of data to be written.
ITEMS_PER_THREADThe number of consecutive items partitioned onto each thread.
ALGORITHM[optional] cub::WarpStoreAlgorithm tuning policy enumeration. default: cub::WARP_STORE_DIRECT.
LOGICAL_WARP_THREADS[optional] The number of threads per "logical" warp (may be less than the number of hardware warp threads). Default is the warp size of the targeted CUDA compute-capability (e.g., 32 threads for SM86). Must be a power of two.
PTX_ARCH[optional] The PTX compute capability for which to to specialize this collective, formatted as per the CUDA_ARCH macro (e.g., 350 for sm_35). Useful for determining the collective's storage requirements for a given device from the host. (Default: the value of CUDA_ARCH during the current compiler pass)
Overview
  • The WarpStore class provides a single data movement abstraction that can be specialized to implement different cub::WarpStoreAlgorithm strategies. This facilitates different performance policies for different architectures, data types, granularity sizes, etc.
  • WarpStore can be optionally specialized by different data movement strategies:
    1. cub::WARP_STORE_DIRECT. A blocked arrangement of data is written directly to memory. More...
    2. cub::WARP_STORE_STRIPED. A striped arrangement of data is written directly to memory. More...
    3. cub::WARP_STORE_VECTORIZE. A blocked arrangement of data is written directly to memory using CUDA's built-in vectorized stores as a coalescing optimization. More...
    4. cub::WARP_STORE_TRANSPOSE. A blocked arrangement is locally transposed into a striped arrangement which is then written to memory. More...
  • For multi-dimensional blocks, threads are linearly ranked in row-major order.
A Simple Example
The code snippet below illustrates the storing of a "blocked" arrangement of 64 integers across 16 threads (where each thread owns 4 consecutive items) into a linear segment of memory. The store is specialized for WARP_STORE_TRANSPOSE, meaning items are locally reordered among threads so that memory references will be efficiently coalesced using a warp-striped access pattern.
#include <cub/cub.cuh> // or equivalently <cub/warp/warp_store.cuh>
__global__ void ExampleKernel(int *d_data, ...)
{
constexpr int warp_threads = 16;
constexpr int block_threads = 256;
constexpr int items_per_thread = 4;
// Specialize WarpStore for a virtual warp of 16 threads owning 4 integer items each
using WarpStoreT = WarpStore<int,
items_per_thread,
warp_threads>;
constexpr int warps_in_block = block_threads / warp_threads;
constexpr int tile_size = items_per_thread * warp_threads;
const int warp_id = static_cast<int>(threadIdx.x) / warp_threads;
// Allocate shared memory for WarpStore
__shared__ typename WarpStoreT::TempStorage temp_storage[warps_in_block];
// Obtain a segment of consecutive items that are blocked across threads
int thread_data[4];
...
// Store items to linear memory
WarpStoreT(temp_storage[warp_id]).Store(d_data + warp_id * tile_size, thread_data);
Suppose the set of thread_data across the warp threads is { [0,1,2,3], [4,5,6,7], ..., [60,61,62,63] }. The output d_data will be 0, 1, 2, 3, 4, 5, ....

Classes

struct  TempStorage
 

Public Methods

Collective constructors
__device__ __forceinline__ WarpStore ()
 Collective constructor using a private static allocation of shared memory as temporary storage. More...
 
__device__ __forceinline__ WarpStore (TempStorage &temp_storage)
 Collective constructor using the specified memory allocation as temporary storage. More...
 
Data movement
template<typename OutputIteratorT >
__device__ __forceinline__ void Store (OutputIteratorT block_itr, T(&items)[ITEMS_PER_THREAD])
 Store items into a linear segment of memory. More...
 
template<typename OutputIteratorT >
__device__ __forceinline__ void Store (OutputIteratorT block_itr, T(&items)[ITEMS_PER_THREAD], int valid_items)
 Store items into a linear segment of memory, guarded by range. More...
 

Constructor & Destructor Documentation

template<typename T , int ITEMS_PER_THREAD, WarpStoreAlgorithm ALGORITHM = WARP_STORE_DIRECT, int LOGICAL_WARP_THREADS = CUB_PTX_WARP_THREADS, int PTX_ARCH = CUB_PTX_ARCH>
__device__ __forceinline__ WarpStore< T, ITEMS_PER_THREAD, ALGORITHM, LOGICAL_WARP_THREADS, PTX_ARCH >::WarpStore ( )
inline

Collective constructor using a private static allocation of shared memory as temporary storage.

template<typename T , int ITEMS_PER_THREAD, WarpStoreAlgorithm ALGORITHM = WARP_STORE_DIRECT, int LOGICAL_WARP_THREADS = CUB_PTX_WARP_THREADS, int PTX_ARCH = CUB_PTX_ARCH>
__device__ __forceinline__ WarpStore< T, ITEMS_PER_THREAD, ALGORITHM, LOGICAL_WARP_THREADS, PTX_ARCH >::WarpStore ( TempStorage temp_storage)
inline

Collective constructor using the specified memory allocation as temporary storage.

Member Function Documentation

template<typename T , int ITEMS_PER_THREAD, WarpStoreAlgorithm ALGORITHM = WARP_STORE_DIRECT, int LOGICAL_WARP_THREADS = CUB_PTX_WARP_THREADS, int PTX_ARCH = CUB_PTX_ARCH>
template<typename OutputIteratorT >
__device__ __forceinline__ void WarpStore< T, ITEMS_PER_THREAD, ALGORITHM, LOGICAL_WARP_THREADS, PTX_ARCH >::Store ( OutputIteratorT  block_itr,
T(&)  items[ITEMS_PER_THREAD] 
)
inline

Store items into a linear segment of memory.

A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage) is to be reused or repurposed.
Snippet
The code snippet below illustrates the storing of a "blocked" arrangement of 64 integers across 16 threads (where each thread owns 4 consecutive items) into a linear segment of memory. The store is specialized for WARP_STORE_TRANSPOSE, meaning items are locally reordered among threads so that memory references will be efficiently coalesced using a warp-striped access pattern.
#include <cub/cub.cuh> // or equivalently <cub/warp/warp_store.cuh>
__global__ void ExampleKernel(int *d_data, ...)
{
constexpr int warp_threads = 16;
constexpr int block_threads = 256;
constexpr int items_per_thread = 4;
// Specialize WarpStore for a virtual warp of 16 threads owning 4 integer items each
using WarpStoreT = WarpStore<int,
items_per_thread,
warp_threads>;
constexpr int warps_in_block = block_threads / warp_threads;
constexpr int tile_size = items_per_thread * warp_threads;
const int warp_id = static_cast<int>(threadIdx.x) / warp_threads;
// Allocate shared memory for WarpStore
__shared__ typename WarpStoreT::TempStorage temp_storage[warps_in_block];
// Obtain a segment of consecutive items that are blocked across threads
int thread_data[4];
...
// Store items to linear memory
WarpStoreT(temp_storage[warp_id]).Store(d_data + warp_id * tile_size, thread_data);
Suppose the set of thread_data across the warp threads is { [0,1,2,3], [4,5,6,7], ..., [60,61,62,63] }. The output d_data will be 0, 1, 2, 3, 4, 5, ....
Parameters
[out]block_itrThe thread block's base output iterator for storing to
[in]itemsData to store
template<typename T , int ITEMS_PER_THREAD, WarpStoreAlgorithm ALGORITHM = WARP_STORE_DIRECT, int LOGICAL_WARP_THREADS = CUB_PTX_WARP_THREADS, int PTX_ARCH = CUB_PTX_ARCH>
template<typename OutputIteratorT >
__device__ __forceinline__ void WarpStore< T, ITEMS_PER_THREAD, ALGORITHM, LOGICAL_WARP_THREADS, PTX_ARCH >::Store ( OutputIteratorT  block_itr,
T(&)  items[ITEMS_PER_THREAD],
int  valid_items 
)
inline

Store items into a linear segment of memory, guarded by range.

A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage) is to be reused or repurposed.
Snippet
The code snippet below illustrates the storing of a "blocked" arrangement of 64 integers across 16 threads (where each thread owns 4 consecutive items) into a linear segment of memory. The store is specialized for WARP_STORE_TRANSPOSE, meaning items are locally reordered among threads so that memory references will be efficiently coalesced using a warp-striped access pattern.
#include <cub/cub.cuh> // or equivalently <cub/warp/warp_store.cuh>
__global__ void ExampleKernel(int *d_data, int valid_items ...)
{
constexpr int warp_threads = 16;
constexpr int block_threads = 256;
constexpr int items_per_thread = 4;
// Specialize WarpStore for a virtual warp of 16 threads owning 4 integer items each
using WarpStoreT = WarpStore<int,
items_per_thread,
warp_threads>;
constexpr int warps_in_block = block_threads / warp_threads;
constexpr int tile_size = items_per_thread * warp_threads;
const int warp_id = static_cast<int>(threadIdx.x) / warp_threads;
// Allocate shared memory for WarpStore
__shared__ typename WarpStoreT::TempStorage temp_storage[warps_in_block];
// Obtain a segment of consecutive items that are blocked across threads
int thread_data[4];
...
// Store items to linear memory
WarpStoreT(temp_storage[warp_id]).Store(
d_data + warp_id * tile_size, thread_data, valid_items);
Suppose the set of thread_data across the warp threads is { [0,1,2,3], [4,5,6,7], ..., [60,61,62,63] } and valid_items is 5.. The output d_data will be 0, 1, 2, 3, 4, ?, ?, ..., with only the first two threads being unmasked to store portions of valid data.
Parameters
[out]block_itrThe thread block's base output iterator for storing to
[in]itemsData to store
[in]valid_itemsNumber of valid items to write

The documentation for this class was generated from the following file: