CUB  
 All Classes Namespaces Files Functions Variables Typedefs Enumerations Enumerator Friends Macros Groups Pages
Classes | List of all members
cub::BlockShuffle< T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH > Class Template Reference

Detailed description

template< typename T, int BLOCK_DIM_X, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
class cub::BlockShuffle< T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >

The BlockShuffle class provides collective methods for shuffling data partitioned across a CUDA thread block.

Template Parameters
TThe data type to be exchanged.
BLOCK_DIM_XThe thread block length in threads along the X dimension
BLOCK_DIM_Y[optional] The thread block length in threads along the Y dimension (default: 1)
BLOCK_DIM_Z[optional] The thread block length in threads along the Z dimension (default: 1)
PTX_ARCH[optional] The PTX compute capability for which to to specialize this collective, formatted as per the CUDA_ARCH macro (e.g., 350 for sm_35). Useful for determining the collective's storage requirements for a given device from the host. (Default: the value of CUDA_ARCH during the current compiler pass)
Overview
It is commonplace for blocks of threads to rearrange data items between threads. The BlockShuffle abstraction allows threads to efficiently shift items either (a) up to their successor or (b) down to their predecessor.

Classes

struct  TempStorage
 The operations exposed by BlockShuffle require a temporary memory allocation of this nested type for thread communication. This opaque storage can be allocated directly using the __shared__ keyword. Alternatively, it can be aliased to externally allocated memory (shared or global) or union'd with other storage allocation types to facilitate memory reuse. More...
 

Public Methods

Collective constructors
__device__ __forceinline__ BlockShuffle ()
 Collective constructor using a private static allocation of shared memory as temporary storage. More...
 
__device__ __forceinline__ BlockShuffle (TempStorage &temp_storage)
 Collective constructor using the specified memory allocation as temporary storage. More...
 
Shuffle movement
__device__ __forceinline__ void Offset (T input, T &output, int distance=1)
 Each threadi obtains the input provided by threadi+distance. The offset distance may be negative. More...
 
__device__ __forceinline__ void Rotate (T input, T &output, unsigned int distance=1)
 Each threadi obtains the input provided by threadi+distance. More...
 
template<int ITEMS_PER_THREAD>
__device__ __forceinline__ void Up (T(&input)[ITEMS_PER_THREAD], T(&prev)[ITEMS_PER_THREAD])
 The thread block rotates its blocked arrangement of input items, shifting it up by one item. More...
 
template<int ITEMS_PER_THREAD>
__device__ __forceinline__ void Up (T(&input)[ITEMS_PER_THREAD], T(&prev)[ITEMS_PER_THREAD], T &block_suffix)
 The thread block rotates its blocked arrangement of input items, shifting it up by one item. All threads receive the input provided by threadBLOCK_THREADS-1. More...
 
template<int ITEMS_PER_THREAD>
__device__ __forceinline__ void Down (T(&input)[ITEMS_PER_THREAD], T(&prev)[ITEMS_PER_THREAD])
 The thread block rotates its blocked arrangement of input items, shifting it down by one item. More...
 
template<int ITEMS_PER_THREAD>
__device__ __forceinline__ void Down (T(&input)[ITEMS_PER_THREAD], T(&prev)[ITEMS_PER_THREAD], T &block_prefix)
 The thread block rotates its blocked arrangement of input items, shifting it down by one item. All threads receive input[0] provided by thread0. More...
 

Constructor & Destructor Documentation

template<typename T , int BLOCK_DIM_X, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
__device__ __forceinline__ cub::BlockShuffle< T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::BlockShuffle ( )
inline

Collective constructor using a private static allocation of shared memory as temporary storage.

template<typename T , int BLOCK_DIM_X, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
__device__ __forceinline__ cub::BlockShuffle< T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::BlockShuffle ( TempStorage temp_storage)
inline

Collective constructor using the specified memory allocation as temporary storage.

Parameters
[in]temp_storageReference to memory allocation having layout type TempStorage

Member Function Documentation

template<typename T , int BLOCK_DIM_X, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
__device__ __forceinline__ void cub::BlockShuffle< T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::Offset ( input,
T &  output,
int  distance = 1 
)
inline

Each threadi obtains the input provided by threadi+distance. The offset distance may be negative.

  • A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage) is to be reused or repurposed.
Parameters
[in]inputThe input item from the calling thread (threadi)
[out]outputThe input item from the successor (or predecessor) thread threadi+distance (may be aliased to input). This value is only updated for for threadi when 0 <= (i + distance) < BLOCK_THREADS-1
[in]distanceOffset distance (may be negative)
template<typename T , int BLOCK_DIM_X, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
__device__ __forceinline__ void cub::BlockShuffle< T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::Rotate ( input,
T &  output,
unsigned int  distance = 1 
)
inline

Each threadi obtains the input provided by threadi+distance.

  • A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage) is to be reused or repurposed.
Parameters
[in]inputThe calling thread's input item
[out]outputThe input item from thread thread(i+distance>)%BLOCK_THREADS (may be aliased to input). This value is not updated for threadBLOCK_THREADS-1
[in]distanceOffset distance (0 < distance < BLOCK_THREADS)
template<typename T , int BLOCK_DIM_X, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
template<int ITEMS_PER_THREAD>
__device__ __forceinline__ void cub::BlockShuffle< T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::Up ( T(&)  input[ITEMS_PER_THREAD],
T(&)  prev[ITEMS_PER_THREAD] 
)
inline

The thread block rotates its blocked arrangement of input items, shifting it up by one item.

  • Assumes a blocked arrangement of (block-threads*items-per-thread) items across the thread block, where threadi owns the ith range of items-per-thread contiguous items. For multi-dimensional thread blocks, a row-major thread ordering is assumed.
  • Efficiency is increased with increased granularity ITEMS_PER_THREAD. Performance is also typically increased until the additional register pressure or shared memory allocation size causes SM occupancy to fall too low. Consider variants of cub::BlockLoad for efficiently gathering a blocked arrangement of elements across threads.
  • A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage) is to be reused or repurposed.
Parameters
[in]inputThe calling thread's input items
[out]prevThe corresponding predecessor items (may be aliased to input). The item prev[0] is not updated for thread0.
template<typename T , int BLOCK_DIM_X, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
template<int ITEMS_PER_THREAD>
__device__ __forceinline__ void cub::BlockShuffle< T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::Up ( T(&)  input[ITEMS_PER_THREAD],
T(&)  prev[ITEMS_PER_THREAD],
T &  block_suffix 
)
inline

The thread block rotates its blocked arrangement of input items, shifting it up by one item. All threads receive the input provided by threadBLOCK_THREADS-1.

  • Assumes a blocked arrangement of (block-threads*items-per-thread) items across the thread block, where threadi owns the ith range of items-per-thread contiguous items. For multi-dimensional thread blocks, a row-major thread ordering is assumed.
  • Efficiency is increased with increased granularity ITEMS_PER_THREAD. Performance is also typically increased until the additional register pressure or shared memory allocation size causes SM occupancy to fall too low. Consider variants of cub::BlockLoad for efficiently gathering a blocked arrangement of elements across threads.
  • A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage) is to be reused or repurposed.
Parameters
[in]inputThe calling thread's input items
[out]prevThe corresponding predecessor items (may be aliased to input). The item prev[0] is not updated for thread0.
[out]block_suffixThe item input[ITEMS_PER_THREAD-1] from threadBLOCK_THREADS-1, provided to all threads
template<typename T , int BLOCK_DIM_X, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
template<int ITEMS_PER_THREAD>
__device__ __forceinline__ void cub::BlockShuffle< T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::Down ( T(&)  input[ITEMS_PER_THREAD],
T(&)  prev[ITEMS_PER_THREAD] 
)
inline

The thread block rotates its blocked arrangement of input items, shifting it down by one item.

  • Assumes a blocked arrangement of (block-threads*items-per-thread) items across the thread block, where threadi owns the ith range of items-per-thread contiguous items. For multi-dimensional thread blocks, a row-major thread ordering is assumed.
  • Efficiency is increased with increased granularity ITEMS_PER_THREAD. Performance is also typically increased until the additional register pressure or shared memory allocation size causes SM occupancy to fall too low. Consider variants of cub::BlockLoad for efficiently gathering a blocked arrangement of elements across threads.
  • A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage) is to be reused or repurposed.
Parameters
[in]inputThe calling thread's input items
[out]prevThe corresponding predecessor items (may be aliased to input). The value prev[0] is not updated for threadBLOCK_THREADS-1.
template<typename T , int BLOCK_DIM_X, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
template<int ITEMS_PER_THREAD>
__device__ __forceinline__ void cub::BlockShuffle< T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::Down ( T(&)  input[ITEMS_PER_THREAD],
T(&)  prev[ITEMS_PER_THREAD],
T &  block_prefix 
)
inline

The thread block rotates its blocked arrangement of input items, shifting it down by one item. All threads receive input[0] provided by thread0.

  • Assumes a blocked arrangement of (block-threads*items-per-thread) items across the thread block, where threadi owns the ith range of items-per-thread contiguous items. For multi-dimensional thread blocks, a row-major thread ordering is assumed.
  • Efficiency is increased with increased granularity ITEMS_PER_THREAD. Performance is also typically increased until the additional register pressure or shared memory allocation size causes SM occupancy to fall too low. Consider variants of cub::BlockLoad for efficiently gathering a blocked arrangement of elements across threads.
  • A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage) is to be reused or repurposed.
Parameters
[in]inputThe calling thread's input items
[out]prevThe corresponding predecessor items (may be aliased to input). The value prev[0] is not updated for threadBLOCK_THREADS-1.
[out]block_prefixThe item input[0] from thread0, provided to all threads

The documentation for this class was generated from the following file: