CUB  
 All Classes Namespaces Files Functions Variables Typedefs Enumerations Enumerator Friends Macros Groups Pages
Classes | List of all members
WarpExchange< InputT, ITEMS_PER_THREAD, LOGICAL_WARP_THREADS, PTX_ARCH > Class Template Reference

Detailed description

template< typename InputT, int ITEMS_PER_THREAD, int LOGICAL_WARP_THREADS = CUB_PTX_WARP_THREADS, int PTX_ARCH = CUB_PTX_ARCH>
class WarpExchange< InputT, ITEMS_PER_THREAD, LOGICAL_WARP_THREADS, PTX_ARCH >

The WarpExchange class provides collective methods for rearranging data partitioned across a CUDA warp.

Template Parameters
TThe data type to be exchanged.
ITEMS_PER_THREADThe number of items partitioned onto each thread.
LOGICAL_WARP_THREADS[optional] The number of threads per "logical" warp (may be less than the number of hardware warp threads). Default is the warp size of the targeted CUDA compute-capability (e.g., 32 threads for SM86). Must be a power of two.
PTX_ARCH[optional] The PTX compute capability for which to to specialize this collective, formatted as per the CUDA_ARCH macro (e.g., 350 for sm_35). Useful for determining the collective's storage requirements for a given device from the host. (Default: the value of CUDA_ARCH during the current compiler pass)
Overview
  • It is commonplace for a warp of threads to rearrange data items between threads. For example, the global memory accesses prefer patterns where data items are "striped" across threads (where consecutive threads access consecutive items), yet most warp-wide operations prefer a "blocked" partitioning of items across threads (where consecutive items belong to a single thread).
  • WarpExchange supports the following types of data exchanges:
A Simple Example
The code snippet below illustrates the conversion from a "blocked" to a "striped" arrangement of 64 integer items partitioned across 16 threads where each thread owns 4 items.
#include <cub/cub.cuh> // or equivalently <cub/warp/warp_exchange.cuh>
__global__ void ExampleKernel(int *d_data, ...)
{
constexpr int warp_threads = 16;
constexpr int block_threads = 256;
constexpr int items_per_thread = 4;
constexpr int warps_per_block = block_threads / warp_threads;
const int warp_id = static_cast<int>(threadIdx.x) / warp_threads;
// Specialize WarpExchange for a virtual warp of 16 threads owning 4 integer items each
using WarpExchangeT =
cub::WarpExchange<int, items_per_thread, warp_threads>;
// Allocate shared memory for WarpExchange
__shared__ typename WarpExchangeT::TempStorage temp_storage[warps_per_block];
// Load a tile of data striped across threads
int thread_data[items_per_thread];
// ...
// Collectively exchange data into a blocked arrangement across threads
WarpExchangeT(temp_storage[warp_id]).StripedToBlocked(thread_data, thread_data);
Suppose the set of striped input thread_data across the block of threads is { [0,16,32,48], [1,17,33,49], ..., [15, 32, 47, 63] }. The corresponding output thread_data in those threads will be { [0,1,2,3], [4,5,6,7], [8,9,10,11], ..., [60,61,62,63] }.

Classes

struct  TempStorage
 The operations exposed by WarpExchange require a temporary memory allocation of this nested type for thread communication. This opaque storage can be allocated directly using the __shared__ keyword. Alternatively, it can be aliased to externally allocated memory (shared or global) or union'd with other storage allocation types to facilitate memory reuse. More...
 

Public Methods

Collective constructors
 WarpExchange ()=delete
 
__device__ __forceinline__ WarpExchange (TempStorage &temp_storage)
 Collective constructor using the specified memory allocation as temporary storage. More...
 
Data movement
template<typename OutputT >
__device__ __forceinline__ void BlockedToStriped (const InputT(&input_items)[ITEMS_PER_THREAD], OutputT(&output_items)[ITEMS_PER_THREAD])
 Transposes data items from blocked arrangement to striped arrangement. More...
 
template<typename OutputT >
__device__ __forceinline__ void StripedToBlocked (const InputT(&input_items)[ITEMS_PER_THREAD], OutputT(&output_items)[ITEMS_PER_THREAD])
 Transposes data items from striped arrangement to blocked arrangement. More...
 
template<typename OffsetT >
__device__ __forceinline__ void ScatterToStriped (InputT(&items)[ITEMS_PER_THREAD], OffsetT(&ranks)[ITEMS_PER_THREAD])
 Exchanges valid data items annotated by rank into striped arrangement. More...
 
template<typename OutputT , typename OffsetT >
__device__ __forceinline__ void ScatterToStriped (const InputT(&input_items)[ITEMS_PER_THREAD], OutputT(&output_items)[ITEMS_PER_THREAD], OffsetT(&ranks)[ITEMS_PER_THREAD])
 Exchanges valid data items annotated by rank into striped arrangement. More...
 

Constructor & Destructor Documentation

template<typename InputT , int ITEMS_PER_THREAD, int LOGICAL_WARP_THREADS = CUB_PTX_WARP_THREADS, int PTX_ARCH = CUB_PTX_ARCH>
WarpExchange< InputT, ITEMS_PER_THREAD, LOGICAL_WARP_THREADS, PTX_ARCH >::WarpExchange ( )
delete
template<typename InputT , int ITEMS_PER_THREAD, int LOGICAL_WARP_THREADS = CUB_PTX_WARP_THREADS, int PTX_ARCH = CUB_PTX_ARCH>
__device__ __forceinline__ WarpExchange< InputT, ITEMS_PER_THREAD, LOGICAL_WARP_THREADS, PTX_ARCH >::WarpExchange ( TempStorage temp_storage)
inlineexplicit

Collective constructor using the specified memory allocation as temporary storage.

Member Function Documentation

template<typename InputT , int ITEMS_PER_THREAD, int LOGICAL_WARP_THREADS = CUB_PTX_WARP_THREADS, int PTX_ARCH = CUB_PTX_ARCH>
template<typename OutputT >
__device__ __forceinline__ void WarpExchange< InputT, ITEMS_PER_THREAD, LOGICAL_WARP_THREADS, PTX_ARCH >::BlockedToStriped ( const InputT(&)  input_items[ITEMS_PER_THREAD],
OutputT(&)  output_items[ITEMS_PER_THREAD] 
)
inline

Transposes data items from blocked arrangement to striped arrangement.

A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage) is to be reused or repurposed.
Snippet
The code snippet below illustrates the conversion from a "blocked" to a "striped" arrangement of 64 integer items partitioned across 16 threads where each thread owns 4 items.
#include <cub/cub.cuh> // or equivalently <cub/warp/warp_exchange.cuh>
__global__ void ExampleKernel(int *d_data, ...)
{
constexpr int warp_threads = 16;
constexpr int block_threads = 256;
constexpr int items_per_thread = 4;
constexpr int warps_per_block = block_threads / warp_threads;
const int warp_id = static_cast<int>(threadIdx.x) / warp_threads;
// Specialize WarpExchange for a virtual warp of 16 threads owning 4 integer items each
using WarpExchangeT = cub::WarpExchange<int, items_per_thread, warp_threads>;
// Allocate shared memory for WarpExchange
__shared__ typename WarpExchangeT::TempStorage temp_storage[warps_per_block];
// Obtain a segment of consecutive items that are blocked across threads
int thread_data[items_per_thread];
// ...
// Collectively exchange data into a striped arrangement across threads
WarpExchangeT(temp_storage[warp_id]).BlockedToStriped(thread_data, thread_data);
Suppose the set of striped input thread_data across the block of threads is { [0,1,2,3], [4,5,6,7], [8,9,10,11], ..., [60,61,62,63] }. The corresponding output thread_data in those threads will be { [0,16,32,48], [1,17,33,49], ..., [15, 32, 47, 63] }.
Parameters
[in]input_itemsItems to exchange, converting between blocked and striped arrangements.
[out]output_itemsItems from exchange, converting between striped and blocked arrangements. May be aliased to input_items.
template<typename InputT , int ITEMS_PER_THREAD, int LOGICAL_WARP_THREADS = CUB_PTX_WARP_THREADS, int PTX_ARCH = CUB_PTX_ARCH>
template<typename OutputT >
__device__ __forceinline__ void WarpExchange< InputT, ITEMS_PER_THREAD, LOGICAL_WARP_THREADS, PTX_ARCH >::StripedToBlocked ( const InputT(&)  input_items[ITEMS_PER_THREAD],
OutputT(&)  output_items[ITEMS_PER_THREAD] 
)
inline

Transposes data items from striped arrangement to blocked arrangement.

A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage) is to be reused or repurposed.
Snippet
The code snippet below illustrates the conversion from a "striped" to a "blocked" arrangement of 64 integer items partitioned across 16 threads where each thread owns 4 items.
#include <cub/cub.cuh> // or equivalently <cub/warp/warp_exchange.cuh>
__global__ void ExampleKernel(int *d_data, ...)
{
constexpr int warp_threads = 16;
constexpr int block_threads = 256;
constexpr int items_per_thread = 4;
constexpr int warps_per_block = block_threads / warp_threads;
const int warp_id = static_cast<int>(threadIdx.x) / warp_threads;
// Specialize WarpExchange for a virtual warp of 16 threads owning 4 integer items each
using WarpExchangeT = cub::WarpExchange<int, items_per_thread, warp_threads>;
// Allocate shared memory for WarpExchange
__shared__ typename WarpExchangeT::TempStorage temp_storage[warps_per_block];
// Load a tile of data striped across threads
int thread_data[items_per_thread];
// ...
// Collectively exchange data into a blocked arrangement across threads
WarpExchangeT(temp_storage[warp_id]).StripedToBlocked(thread_data, thread_data);
Suppose the set of striped input thread_data across the block of threads is { [0,16,32,48], [1,17,33,49], ..., [15, 32, 47, 63] }. The corresponding output thread_data in those threads will be { [0,1,2,3], [4,5,6,7], [8,9,10,11], ..., [60,61,62,63] }.
Parameters
[in]input_itemsItems to exchange
[out]output_itemsItems from exchange. May be aliased to input_items.
template<typename InputT , int ITEMS_PER_THREAD, int LOGICAL_WARP_THREADS = CUB_PTX_WARP_THREADS, int PTX_ARCH = CUB_PTX_ARCH>
template<typename OffsetT >
__device__ __forceinline__ void WarpExchange< InputT, ITEMS_PER_THREAD, LOGICAL_WARP_THREADS, PTX_ARCH >::ScatterToStriped ( InputT(&)  items[ITEMS_PER_THREAD],
OffsetT(&)  ranks[ITEMS_PER_THREAD] 
)
inline

Exchanges valid data items annotated by rank into striped arrangement.

A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage) is to be reused or repurposed.
Snippet
The code snippet below illustrates the conversion from a "scatter" to a "striped" arrangement of 64 integer items partitioned across 16 threads where each thread owns 4 items.
#include <cub/cub.cuh> // or equivalently <cub/warp/warp_exchange.cuh>
__global__ void ExampleKernel(int *d_data, ...)
{
constexpr int warp_threads = 16;
constexpr int block_threads = 256;
constexpr int items_per_thread = 4;
constexpr int warps_per_block = block_threads / warp_threads;
const int warp_id = static_cast<int>(threadIdx.x) / warp_threads;
// Specialize WarpExchange for a virtual warp of 16 threads owning 4 integer items each
using WarpExchangeT = cub::WarpExchange<int, items_per_thread, warp_threads>;
// Allocate shared memory for WarpExchange
__shared__ typename WarpExchangeT::TempStorage temp_storage[warps_per_block];
// Obtain a segment of consecutive items that are blocked across threads
int thread_data[items_per_thread];
int thread_ranks[items_per_thread];
// ...
// Collectively exchange data into a striped arrangement across threads
WarpExchangeT(temp_storage[warp_id]).ScatterToStriped(
thread_data, thread_ranks);
Suppose the set of input thread_data across the block of threads is { [0,1,2,3], [4,5,6,7], ..., [60,61,62,63] }, and the set of thread_ranks is { [63,62,61,60], ..., [7,6,5,4], [3,2,1,0] }. The corresponding output thread_data in those threads will be { [63, 47, 31, 15], [62, 46, 30, 14], ..., [48, 32, 16, 0] }.
Template Parameters
OffsetT[inferred] Signed integer type for local offsets
Parameters
[in,out]itemsItems to exchange
[in]ranksCorresponding scatter ranks
template<typename InputT , int ITEMS_PER_THREAD, int LOGICAL_WARP_THREADS = CUB_PTX_WARP_THREADS, int PTX_ARCH = CUB_PTX_ARCH>
template<typename OutputT , typename OffsetT >
__device__ __forceinline__ void WarpExchange< InputT, ITEMS_PER_THREAD, LOGICAL_WARP_THREADS, PTX_ARCH >::ScatterToStriped ( const InputT(&)  input_items[ITEMS_PER_THREAD],
OutputT(&)  output_items[ITEMS_PER_THREAD],
OffsetT(&)  ranks[ITEMS_PER_THREAD] 
)
inline

Exchanges valid data items annotated by rank into striped arrangement.

A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage) is to be reused or repurposed.
Snippet
The code snippet below illustrates the conversion from a "scatter" to a "striped" arrangement of 64 integer items partitioned across 16 threads where each thread owns 4 items.
#include <cub/cub.cuh> // or equivalently <cub/warp/warp_exchange.cuh>
__global__ void ExampleKernel(int *d_data, ...)
{
constexpr int warp_threads = 16;
constexpr int block_threads = 256;
constexpr int items_per_thread = 4;
constexpr int warps_per_block = block_threads / warp_threads;
const int warp_id = static_cast<int>(threadIdx.x) / warp_threads;
// Specialize WarpExchange for a virtual warp of 16 threads owning 4 integer items each
using WarpExchangeT = cub::WarpExchange<int, items_per_thread, warp_threads>;
// Allocate shared memory for WarpExchange
__shared__ typename WarpExchangeT::TempStorage temp_storage[warps_per_block];
// Obtain a segment of consecutive items that are blocked across threads
int thread_input[items_per_thread];
int thread_ranks[items_per_thread];
// ...
// Collectively exchange data into a striped arrangement across threads
int thread_output[items_per_thread];
WarpExchangeT(temp_storage[warp_id]).ScatterToStriped(
thread_input, thread_output, thread_ranks);
Suppose the set of input thread_input across the block of threads is { [0,1,2,3], [4,5,6,7], ..., [60,61,62,63] }, and the set of thread_ranks is { [63,62,61,60], ..., [7,6,5,4], [3,2,1,0] }. The corresponding thread_output in those threads will be { [63, 47, 31, 15], [62, 46, 30, 14], ..., [48, 32, 16, 0] }.
Template Parameters
OffsetT[inferred] Signed integer type for local offsets
Parameters
[in]input_itemsItems to exchange
[out]output_itemsItems from exchange. May be aliased to input_items.
[in]ranksCorresponding scatter ranks

The documentation for this class was generated from the following file: