template<
typename T,
int LOGICAL_WARP_THREADS = CUB_PTX_WARP_THREADS,
int PTX_ARCH = CUB_PTX_ARCH>
class cub::WarpReduce< T, LOGICAL_WARP_THREADS, PTX_ARCH >
The WarpReduce class provides collective methods for computing a parallel reduction of items partitioned across a CUDA thread warp.
.
- Template Parameters
-
T | The reduction input/output element type |
LOGICAL_WARP_THREADS | [optional] The number of threads per "logical" warp (may be less than the number of hardware warp threads). Default is the warp size of the targeted CUDA compute-capability (e.g., 32 threads for SM20). |
PTX_ARCH | [optional] The PTX compute capability for which to to specialize this collective, formatted as per the CUDA_ARCH macro (e.g., 350 for sm_35). Useful for determining the collective's storage requirements for a given device from the host. (Default: the value of CUDA_ARCH during the current compiler pass) |
- Overview
- A reduction (or fold) uses a binary combining operator to compute a single aggregate from a list of input elements.
- Supports "logical" warps smaller than the physical warp size (e.g., logical warps of 8 threads)
- The number of entrant threads must be an multiple of
LOGICAL_WARP_THREADS
- Performance Considerations
- Uses special instructions when applicable (e.g., warp
SHFL
instructions)
- Uses synchronization-free communication between warp lanes when applicable
- Incurs zero bank conflicts for most types
- Computation is slightly more efficient (i.e., having lower instruction overhead) for:
- Summation (vs. generic reduction)
- The architecture's warp size is a whole multiple of
LOGICAL_WARP_THREADS
- Simple Examples
- Every thread in the warp uses the WarpReduce class by first specializing the WarpReduce type, then instantiating an instance with parameters for communication, and finally invoking or more collective member functions.
- The code snippet below illustrates four concurrent warp sum reductions within a block of 128 threads (one per each of the 32-thread warps).
__global__ void ExampleKernel(...)
{
__shared__ typename WarpReduce::TempStorage temp_storage[4];
int thread_data = ...
int warp_id = threadIdx.x / 32;
int aggregate =
WarpReduce(temp_storage[warp_id]).Sum(thread_data);
- Suppose the set of input
thread_data
across the block of threads is {0, 1, 2, 3, ..., 127}
. The corresponding output aggregate
in threads 0, 32, 64, and 96 will 496
, 1520
, 2544
, and 3568
, respectively (and is undefined in other threads).
- The code snippet below illustrates a single warp sum reduction within a block of 128 threads.
__global__ void ExampleKernel(...)
{
__shared__ typename WarpReduce::TempStorage temp_storage;
...
if (threadIdx.x < 32)
{
int thread_data = ...
int aggregate =
WarpReduce(temp_storage).Sum(thread_data);
- Suppose the set of input
thread_data
across the warp of threads is {0, 1, 2, 3, ..., 31}
. The corresponding output aggregate
in thread0 will be 496
(and is undefined in other threads).
|
|
__device__ __forceinline__ | WarpReduce (TempStorage &temp_storage) |
| Collective constructor using the specified memory allocation as temporary storage. Logical warp and lane identifiers are constructed from threadIdx.x . More...
|
|
|
__device__ __forceinline__ T | Sum (T input) |
| Computes a warp-wide sum in the calling warp. The output is valid in warp lane0. More...
|
|
__device__ __forceinline__ T | Sum (T input, int valid_items) |
| Computes a partially-full warp-wide sum in the calling warp. The output is valid in warp lane0. More...
|
|
template<typename FlagT > |
__device__ __forceinline__ T | HeadSegmentedSum (T input, FlagT head_flag) |
| Computes a segmented sum in the calling warp where segments are defined by head-flags. The sum of each segment is returned to the first lane in that segment (which always includes lane0). More...
|
|
template<typename FlagT > |
__device__ __forceinline__ T | TailSegmentedSum (T input, FlagT tail_flag) |
| Computes a segmented sum in the calling warp where segments are defined by tail-flags. The sum of each segment is returned to the first lane in that segment (which always includes lane0). More...
|
|
|
template<typename ReductionOp > |
__device__ __forceinline__ T | Reduce (T input, ReductionOp reduction_op) |
| Computes a warp-wide reduction in the calling warp using the specified binary reduction functor. The output is valid in warp lane0. More...
|
|
template<typename ReductionOp > |
__device__ __forceinline__ T | Reduce (T input, ReductionOp reduction_op, int valid_items) |
| Computes a partially-full warp-wide reduction in the calling warp using the specified binary reduction functor. The output is valid in warp lane0. More...
|
|
template<typename ReductionOp , typename FlagT > |
__device__ __forceinline__ T | HeadSegmentedReduce (T input, FlagT head_flag, ReductionOp reduction_op) |
| Computes a segmented reduction in the calling warp where segments are defined by head-flags. The reduction of each segment is returned to the first lane in that segment (which always includes lane0). More...
|
|
template<typename ReductionOp , typename FlagT > |
__device__ __forceinline__ T | TailSegmentedReduce (T input, FlagT tail_flag, ReductionOp reduction_op) |
| Computes a segmented reduction in the calling warp where segments are defined by tail-flags. The reduction of each segment is returned to the first lane in that segment (which always includes lane0). More...
|
|
template<typename T , int LOGICAL_WARP_THREADS = CUB_PTX_WARP_THREADS, int PTX_ARCH = CUB_PTX_ARCH>
__device__ __forceinline__ T cub::WarpReduce< T, LOGICAL_WARP_THREADS, PTX_ARCH >::Sum |
( |
T |
input) | |
|
|
inline |
Computes a warp-wide sum in the calling warp. The output is valid in warp lane0.
A subsequent __syncthreads()
threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage
) is to be reused or repurposed.
- Snippet
- The code snippet below illustrates four concurrent warp sum reductions within a block of 128 threads (one per each of the 32-thread warps).
__global__ void ExampleKernel(...)
{
__shared__ typename WarpReduce::TempStorage temp_storage[4];
int thread_data = ...
int warp_id = threadIdx.x / 32;
int aggregate =
WarpReduce(temp_storage[warp_id]).Sum(thread_data);
- Suppose the set of input
thread_data
across the block of threads is {0, 1, 2, 3, ..., 127}
. The corresponding output aggregate
in threads 0, 32, 64, and 96 will 496
, 1520
, 2544
, and 3568
, respectively (and is undefined in other threads).
- Parameters
-
[in] | input | Calling thread's input |
template<typename T , int LOGICAL_WARP_THREADS = CUB_PTX_WARP_THREADS, int PTX_ARCH = CUB_PTX_ARCH>
__device__ __forceinline__ T cub::WarpReduce< T, LOGICAL_WARP_THREADS, PTX_ARCH >::Sum |
( |
T |
input, |
|
|
int |
valid_items |
|
) |
| |
|
inline |
Computes a partially-full warp-wide sum in the calling warp. The output is valid in warp lane0.
All threads across the calling warp must agree on the same value for valid_items
. Otherwise the result is undefined.
A subsequent __syncthreads()
threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage
) is to be reused or repurposed.
- Snippet
- The code snippet below illustrates a sum reduction within a single, partially-full block of 32 threads (one warp).
__global__ void ExampleKernel(int *d_data, int valid_items)
{
__shared__ typename WarpReduce::TempStorage temp_storage;
int thread_data;
if (threadIdx.x < valid_items)
thread_data = d_data[threadIdx.x];
thread_data, valid_items);
- Suppose the input
d_data
is {0, 1, 2, 3, 4, ...
and valid_items
is 4
. The corresponding output aggregate
in thread0 is 6
(and is undefined in other threads).
- Parameters
-
[in] | input | Calling thread's input |
[in] | valid_items | Total number of valid items in the calling thread's logical warp (may be less than LOGICAL_WARP_THREADS ) |
template<typename T , int LOGICAL_WARP_THREADS = CUB_PTX_WARP_THREADS, int PTX_ARCH = CUB_PTX_ARCH>
template<typename FlagT >
__device__ __forceinline__ T cub::WarpReduce< T, LOGICAL_WARP_THREADS, PTX_ARCH >::HeadSegmentedSum |
( |
T |
input, |
|
|
FlagT |
head_flag |
|
) |
| |
|
inline |
Computes a segmented sum in the calling warp where segments are defined by head-flags. The sum of each segment is returned to the first lane in that segment (which always includes lane0).
A subsequent __syncthreads()
threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage
) is to be reused or repurposed.
- Snippet
- The code snippet below illustrates a head-segmented warp sum reduction within a block of 32 threads (one warp).
__global__ void ExampleKernel(...)
{
__shared__ typename WarpReduce::TempStorage temp_storage;
int thread_data = ...
int head_flag = ...
int aggregate =
WarpReduce(temp_storage).HeadSegmentedSum(
thread_data, head_flag);
- Suppose the set of input
thread_data
and head_flag
across the block of threads is {0, 1, 2, 3, ..., 31
and is {1, 0, 0, 0, 1, 0, 0, 0, ..., 1, 0, 0, 0
, respectively. The corresponding output aggregate
in threads 0, 4, 8, etc. will be 6
, 22
, 38
, etc. (and is undefined in other threads).
- Template Parameters
-
ReductionOp | [inferred] Binary reduction operator type having member T operator()(const T &a, const T &b) |
- Parameters
-
[in] | input | Calling thread's input |
[in] | head_flag | Head flag denoting whether or not input is the start of a new segment |
template<typename T , int LOGICAL_WARP_THREADS = CUB_PTX_WARP_THREADS, int PTX_ARCH = CUB_PTX_ARCH>
template<typename FlagT >
__device__ __forceinline__ T cub::WarpReduce< T, LOGICAL_WARP_THREADS, PTX_ARCH >::TailSegmentedSum |
( |
T |
input, |
|
|
FlagT |
tail_flag |
|
) |
| |
|
inline |
Computes a segmented sum in the calling warp where segments are defined by tail-flags. The sum of each segment is returned to the first lane in that segment (which always includes lane0).
A subsequent __syncthreads()
threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage
) is to be reused or repurposed.
- Snippet
- The code snippet below illustrates a tail-segmented warp sum reduction within a block of 32 threads (one warp).
__global__ void ExampleKernel(...)
{
__shared__ typename WarpReduce::TempStorage temp_storage;
int thread_data = ...
int tail_flag = ...
int aggregate =
WarpReduce(temp_storage).TailSegmentedSum(
thread_data, tail_flag);
- Suppose the set of input
thread_data
and tail_flag
across the block of threads is {0, 1, 2, 3, ..., 31
and is {0, 0, 0, 1, 0, 0, 0, 1, ..., 0, 0, 0, 1
, respectively. The corresponding output aggregate
in threads 0, 4, 8, etc. will be 6
, 22
, 38
, etc. (and is undefined in other threads).
- Template Parameters
-
ReductionOp | [inferred] Binary reduction operator type having member T operator()(const T &a, const T &b) |
- Parameters
-
[in] | input | Calling thread's input |
[in] | tail_flag | Head flag denoting whether or not input is the start of a new segment |
template<typename T , int LOGICAL_WARP_THREADS = CUB_PTX_WARP_THREADS, int PTX_ARCH = CUB_PTX_ARCH>
template<typename ReductionOp >
__device__ __forceinline__ T cub::WarpReduce< T, LOGICAL_WARP_THREADS, PTX_ARCH >::Reduce |
( |
T |
input, |
|
|
ReductionOp |
reduction_op |
|
) |
| |
|
inline |
Computes a warp-wide reduction in the calling warp using the specified binary reduction functor. The output is valid in warp lane0.
Supports non-commutative reduction operators
A subsequent __syncthreads()
threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage
) is to be reused or repurposed.
- Snippet
- The code snippet below illustrates four concurrent warp max reductions within a block of 128 threads (one per each of the 32-thread warps).
__global__ void ExampleKernel(...)
{
__shared__ typename WarpReduce::TempStorage temp_storage[4];
int thread_data = ...
int warp_id = threadIdx.x / 32;
int aggregate =
WarpReduce(temp_storage[warp_id]).Reduce(
- Suppose the set of input
thread_data
across the block of threads is {0, 1, 2, 3, ..., 127}
. The corresponding output aggregate
in threads 0, 32, 64, and 96 will 31
, 63
, 95
, and 127
, respectively (and is undefined in other threads).
- Template Parameters
-
ReductionOp | [inferred] Binary reduction operator type having member T operator()(const T &a, const T &b) |
- Parameters
-
[in] | input | Calling thread's input |
[in] | reduction_op | Binary reduction operator |
template<typename T , int LOGICAL_WARP_THREADS = CUB_PTX_WARP_THREADS, int PTX_ARCH = CUB_PTX_ARCH>
template<typename ReductionOp >
__device__ __forceinline__ T cub::WarpReduce< T, LOGICAL_WARP_THREADS, PTX_ARCH >::Reduce |
( |
T |
input, |
|
|
ReductionOp |
reduction_op, |
|
|
int |
valid_items |
|
) |
| |
|
inline |
Computes a partially-full warp-wide reduction in the calling warp using the specified binary reduction functor. The output is valid in warp lane0.
All threads across the calling warp must agree on the same value for valid_items
. Otherwise the result is undefined.
Supports non-commutative reduction operators
A subsequent __syncthreads()
threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage
) is to be reused or repurposed.
- Snippet
- The code snippet below illustrates a max reduction within a single, partially-full block of 32 threads (one warp).
__global__ void ExampleKernel(int *d_data, int valid_items)
{
__shared__ typename WarpReduce::TempStorage temp_storage;
int thread_data;
if (threadIdx.x < valid_items)
thread_data = d_data[threadIdx.x];
- Suppose the input
d_data
is {0, 1, 2, 3, 4, ...
and valid_items
is 4
. The corresponding output aggregate
in thread0 is 3
(and is undefined in other threads).
- Template Parameters
-
ReductionOp | [inferred] Binary reduction operator type having member T operator()(const T &a, const T &b) |
- Parameters
-
[in] | input | Calling thread's input |
[in] | reduction_op | Binary reduction operator |
[in] | valid_items | Total number of valid items in the calling thread's logical warp (may be less than LOGICAL_WARP_THREADS ) |
template<typename T , int LOGICAL_WARP_THREADS = CUB_PTX_WARP_THREADS, int PTX_ARCH = CUB_PTX_ARCH>
template<typename ReductionOp , typename FlagT >
__device__ __forceinline__ T cub::WarpReduce< T, LOGICAL_WARP_THREADS, PTX_ARCH >::HeadSegmentedReduce |
( |
T |
input, |
|
|
FlagT |
head_flag, |
|
|
ReductionOp |
reduction_op |
|
) |
| |
|
inline |
Computes a segmented reduction in the calling warp where segments are defined by head-flags. The reduction of each segment is returned to the first lane in that segment (which always includes lane0).
Supports non-commutative reduction operators
A subsequent __syncthreads()
threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage
) is to be reused or repurposed.
- Snippet
- The code snippet below illustrates a head-segmented warp max reduction within a block of 32 threads (one warp).
__global__ void ExampleKernel(...)
{
__shared__ typename WarpReduce::TempStorage temp_storage;
int thread_data = ...
int head_flag = ...
int aggregate =
WarpReduce(temp_storage).HeadSegmentedReduce(
- Suppose the set of input
thread_data
and head_flag
across the block of threads is {0, 1, 2, 3, ..., 31
and is {1, 0, 0, 0, 1, 0, 0, 0, ..., 1, 0, 0, 0
, respectively. The corresponding output aggregate
in threads 0, 4, 8, etc. will be 3
, 7
, 11
, etc. (and is undefined in other threads).
- Template Parameters
-
ReductionOp | [inferred] Binary reduction operator type having member T operator()(const T &a, const T &b) |
- Parameters
-
[in] | input | Calling thread's input |
[in] | head_flag | Head flag denoting whether or not input is the start of a new segment |
[in] | reduction_op | Reduction operator |
template<typename T , int LOGICAL_WARP_THREADS = CUB_PTX_WARP_THREADS, int PTX_ARCH = CUB_PTX_ARCH>
template<typename ReductionOp , typename FlagT >
__device__ __forceinline__ T cub::WarpReduce< T, LOGICAL_WARP_THREADS, PTX_ARCH >::TailSegmentedReduce |
( |
T |
input, |
|
|
FlagT |
tail_flag, |
|
|
ReductionOp |
reduction_op |
|
) |
| |
|
inline |
Computes a segmented reduction in the calling warp where segments are defined by tail-flags. The reduction of each segment is returned to the first lane in that segment (which always includes lane0).
Supports non-commutative reduction operators
A subsequent __syncthreads()
threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage
) is to be reused or repurposed.
- Snippet
- The code snippet below illustrates a tail-segmented warp max reduction within a block of 32 threads (one warp).
__global__ void ExampleKernel(...)
{
__shared__ typename WarpReduce::TempStorage temp_storage;
int thread_data = ...
int tail_flag = ...
int aggregate =
WarpReduce(temp_storage).TailSegmentedReduce(
- Suppose the set of input
thread_data
and tail_flag
across the block of threads is {0, 1, 2, 3, ..., 31
and is {0, 0, 0, 1, 0, 0, 0, 1, ..., 0, 0, 0, 1
, respectively. The corresponding output aggregate
in threads 0, 4, 8, etc. will be 3
, 7
, 11
, etc. (and is undefined in other threads).
- Template Parameters
-
ReductionOp | [inferred] Binary reduction operator type having member T operator()(const T &a, const T &b) |
- Parameters
-
[in] | input | Calling thread's input |
[in] | tail_flag | Tail flag denoting whether or not input is the end of the current segment |
[in] | reduction_op | Reduction operator |