CUB
|
The WarpExchange class provides collective methods for rearranging data partitioned across a CUDA warp.
T | The data type to be exchanged. |
ITEMS_PER_THREAD | The number of items partitioned onto each thread. |
LOGICAL_WARP_THREADS | [optional] The number of threads per "logical" warp (may be less than the number of hardware warp threads). Default is the warp size of the targeted CUDA compute-capability (e.g., 32 threads for SM86). Must be a power of two. |
PTX_ARCH | [optional] The PTX compute capability for which to to specialize this collective, formatted as per the CUDA_ARCH macro (e.g., 350 for sm_35). Useful for determining the collective's storage requirements for a given device from the host. (Default: the value of CUDA_ARCH during the current compiler pass) |
thread_data
across the block of threads is { [0,16,32,48], [1,17,33,49], ..., [15, 32, 47, 63] }
. The corresponding output thread_data
in those threads will be { [0,1,2,3], [4,5,6,7], [8,9,10,11], ..., [60,61,62,63] }
. Classes | |
struct | TempStorage |
The operations exposed by WarpExchange require a temporary memory allocation of this nested type for thread communication. This opaque storage can be allocated directly using the __shared__ keyword. Alternatively, it can be aliased to externally allocated memory (shared or global) or union 'd with other storage allocation types to facilitate memory reuse. More... | |
Public Methods | |
Collective constructors | |
WarpExchange ()=delete | |
__device__ __forceinline__ | WarpExchange (TempStorage &temp_storage) |
Collective constructor using the specified memory allocation as temporary storage. More... | |
Data movement | |
template<typename OutputT > | |
__device__ __forceinline__ void | BlockedToStriped (const InputT(&input_items)[ITEMS_PER_THREAD], OutputT(&output_items)[ITEMS_PER_THREAD]) |
Transposes data items from blocked arrangement to striped arrangement. More... | |
template<typename OutputT > | |
__device__ __forceinline__ void | StripedToBlocked (const InputT(&input_items)[ITEMS_PER_THREAD], OutputT(&output_items)[ITEMS_PER_THREAD]) |
Transposes data items from striped arrangement to blocked arrangement. More... | |
template<typename OffsetT > | |
__device__ __forceinline__ void | ScatterToStriped (InputT(&items)[ITEMS_PER_THREAD], OffsetT(&ranks)[ITEMS_PER_THREAD]) |
Exchanges valid data items annotated by rank into striped arrangement. More... | |
template<typename OutputT , typename OffsetT > | |
__device__ __forceinline__ void | ScatterToStriped (const InputT(&input_items)[ITEMS_PER_THREAD], OutputT(&output_items)[ITEMS_PER_THREAD], OffsetT(&ranks)[ITEMS_PER_THREAD]) |
Exchanges valid data items annotated by rank into striped arrangement. More... | |
|
delete |
|
inlineexplicit |
Collective constructor using the specified memory allocation as temporary storage.
|
inline |
Transposes data items from blocked arrangement to striped arrangement.
__syncthreads()
threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage
) is to be reused or repurposed.thread_data
across the block of threads is { [0,1,2,3], [4,5,6,7], [8,9,10,11], ..., [60,61,62,63] }
. The corresponding output thread_data
in those threads will be { [0,16,32,48], [1,17,33,49], ..., [15, 32, 47, 63] }
.[in] | input_items | Items to exchange, converting between blocked and striped arrangements. |
[out] | output_items | Items from exchange, converting between striped and blocked arrangements. May be aliased to input_items . |
|
inline |
Transposes data items from striped arrangement to blocked arrangement.
__syncthreads()
threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage
) is to be reused or repurposed.thread_data
across the block of threads is { [0,16,32,48], [1,17,33,49], ..., [15, 32, 47, 63] }
. The corresponding output thread_data
in those threads will be { [0,1,2,3], [4,5,6,7], [8,9,10,11], ..., [60,61,62,63] }
.[in] | input_items | Items to exchange |
[out] | output_items | Items from exchange. May be aliased to input_items . |
|
inline |
Exchanges valid data items annotated by rank into striped arrangement.
__syncthreads()
threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage
) is to be reused or repurposed.thread_data
across the block of threads is { [0,1,2,3], [4,5,6,7], ..., [60,61,62,63] }
, and the set of thread_ranks
is { [63,62,61,60], ..., [7,6,5,4], [3,2,1,0] }
. The corresponding output thread_data
in those threads will be { [63, 47, 31, 15], [62, 46, 30, 14], ..., [48, 32, 16, 0] }
.OffsetT | [inferred] Signed integer type for local offsets |
[in,out] | items | Items to exchange |
[in] | ranks | Corresponding scatter ranks |
|
inline |
Exchanges valid data items annotated by rank into striped arrangement.
__syncthreads()
threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage
) is to be reused or repurposed.thread_input
across the block of threads is { [0,1,2,3], [4,5,6,7], ..., [60,61,62,63] }
, and the set of thread_ranks
is { [63,62,61,60], ..., [7,6,5,4], [3,2,1,0] }
. The corresponding thread_output
in those threads will be { [63, 47, 31, 15], [62, 46, 30, 14], ..., [48, 32, 16, 0] }
.OffsetT | [inferred] Signed integer type for local offsets |
[in] | input_items | Items to exchange |
[out] | output_items | Items from exchange. May be aliased to input_items . |
[in] | ranks | Corresponding scatter ranks |