template<
typename T,
int BLOCK_DIM_X,
BlockReduceAlgorithm ALGORITHM = BLOCK_REDUCE_WARP_REDUCTIONS,
int BLOCK_DIM_Y = 1,
int BLOCK_DIM_Z = 1,
int PTX_ARCH = CUB_PTX_ARCH>
class cub::BlockReduce< T, BLOCK_DIM_X, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >
The BlockReduce class provides collective methods for computing a parallel reduction of items partitioned across a CUDA thread block.
.
- Template Parameters
-
T | Data type being reduced |
BLOCK_DIM_X | The thread block length in threads along the X dimension |
ALGORITHM | [optional] cub::BlockReduceAlgorithm enumerator specifying the underlying algorithm to use (default: cub::BLOCK_REDUCE_WARP_REDUCTIONS) |
BLOCK_DIM_Y | [optional] The thread block length in threads along the Y dimension (default: 1) |
BLOCK_DIM_Z | [optional] The thread block length in threads along the Z dimension (default: 1) |
PTX_ARCH | [optional] The PTX compute capability for which to to specialize this collective, formatted as per the CUDA_ARCH macro (e.g., 350 for sm_35). Useful for determining the collective's storage requirements for a given device from the host. (Default: the value of CUDA_ARCH during the current compiler pass) |
- Overview
- A reduction (or fold) uses a binary combining operator to compute a single aggregate from a list of input elements.
- For multi-dimensional blocks, threads are linearly ranked in row-major order.
- BlockReduce can be optionally specialized by algorithm to accommodate different latency/throughput workload profiles:
- cub::BLOCK_REDUCE_RAKING_COMMUTATIVE_ONLY. An efficient "raking" reduction algorithm that only supports commutative reduction operators. More...
- cub::BLOCK_REDUCE_RAKING. An efficient "raking" reduction algorithm that supports commutative and non-commutative reduction operators. More...
- cub::BLOCK_REDUCE_WARP_REDUCTIONS. A quick "tiled warp-reductions" reduction algorithm that supports commutative and non-commutative reduction operators. More...
- Performance Considerations
- Efficiency is increased with increased granularity
ITEMS_PER_THREAD
. Performance is also typically increased until the additional register pressure or shared memory allocation size causes SM occupancy to fall too low. Consider variants of cub::BlockLoad for efficiently gathering a blocked arrangement of elements across threads.
- Very efficient (only one synchronization barrier).
- Incurs zero bank conflicts for most types
- Computation is slightly more efficient (i.e., having lower instruction overhead) for:
- Summation (vs. generic reduction)
BLOCK_THREADS
is a multiple of the architecture's warp size
- Every thread has a valid input (i.e., full vs. partial-tiles)
- See cub::BlockReduceAlgorithm for performance details regarding algorithmic alternatives
- A Simple Example
- Every thread in the block uses the BlockReduce class by first specializing the BlockReduce type, then instantiating an instance with parameters for communication, and finally invoking one or more collective member functions.
- The code snippet below illustrates a sum reduction of 512 integer items that are partitioned in a blocked arrangement across 128 threads where each thread owns 4 consecutive items.
__global__ void ExampleKernel(...)
{
__shared__ typename BlockReduce::TempStorage temp_storage;
int thread_data[4];
...
int aggregate =
BlockReduce(temp_storage).Sum(thread_data);
- Re-using dynamically allocating shared memory
- The following example under the examples/block folder illustrates usage of dynamically shared memory with BlockReduce and how to re-purpose the same memory region: example_block_reduce_dyn_smem.cu
- Examples:
- example_block_reduce.cu.
|
|
__device__ __forceinline__ | BlockReduce () |
| Collective constructor using a private static allocation of shared memory as temporary storage. More...
|
|
__device__ __forceinline__ | BlockReduce (TempStorage &temp_storage) |
| Collective constructor using the specified memory allocation as temporary storage. More...
|
|
|
template<typename ReductionOp > |
__device__ __forceinline__ T | Reduce (T input, ReductionOp reduction_op) |
| Computes a block-wide reduction for thread0 using the specified binary reduction functor. Each thread contributes one input element. More...
|
|
template<int ITEMS_PER_THREAD, typename ReductionOp > |
__device__ __forceinline__ T | Reduce (T(&inputs)[ITEMS_PER_THREAD], ReductionOp reduction_op) |
| Computes a block-wide reduction for thread0 using the specified binary reduction functor. Each thread contributes an array of consecutive input elements. More...
|
|
template<typename ReductionOp > |
__device__ __forceinline__ T | Reduce (T input, ReductionOp reduction_op, int num_valid) |
| Computes a block-wide reduction for thread0 using the specified binary reduction functor. The first num_valid threads each contribute one input element. More...
|
|
|
__device__ __forceinline__ T | Sum (T input) |
| Computes a block-wide reduction for thread0 using addition (+) as the reduction operator. Each thread contributes one input element. More...
|
|
template<int ITEMS_PER_THREAD> |
__device__ __forceinline__ T | Sum (T(&inputs)[ITEMS_PER_THREAD]) |
| Computes a block-wide reduction for thread0 using addition (+) as the reduction operator. Each thread contributes an array of consecutive input elements. More...
|
|
__device__ __forceinline__ T | Sum (T input, int num_valid) |
| Computes a block-wide reduction for thread0 using addition (+) as the reduction operator. The first num_valid threads each contribute one input element. More...
|
|
template<typename T , int BLOCK_DIM_X, BlockReduceAlgorithm ALGORITHM = BLOCK_REDUCE_WARP_REDUCTIONS, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
Collective constructor using a private static allocation of shared memory as temporary storage.
template<typename T , int BLOCK_DIM_X, BlockReduceAlgorithm ALGORITHM = BLOCK_REDUCE_WARP_REDUCTIONS, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
template<typename ReductionOp >
__device__ __forceinline__ T cub::BlockReduce< T, BLOCK_DIM_X, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::Reduce |
( |
T |
input, |
|
|
ReductionOp |
reduction_op |
|
) |
| |
|
inline |
Computes a block-wide reduction for thread0 using the specified binary reduction functor. Each thread contributes one input element.
- The return value is undefined in threads other than thread0.
- For multi-dimensional blocks, threads are linearly ranked in row-major order.
- A subsequent
__syncthreads()
threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage
) is to be reused or repurposed.
- Snippet
- The code snippet below illustrates a max reduction of 128 integer items that are partitioned across 128 threads.
__global__ void ExampleKernel(...)
{
__shared__ typename BlockReduce::TempStorage temp_storage;
int thread_data;
...
- Template Parameters
-
ReductionOp | [inferred] Binary reduction functor type having member T operator()(const T &a, const T &b) |
- Parameters
-
[in] | input | Calling thread's input |
[in] | reduction_op | Binary reduction functor |
template<typename T , int BLOCK_DIM_X, BlockReduceAlgorithm ALGORITHM = BLOCK_REDUCE_WARP_REDUCTIONS, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
template<int ITEMS_PER_THREAD, typename ReductionOp >
__device__ __forceinline__ T cub::BlockReduce< T, BLOCK_DIM_X, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::Reduce |
( |
T(&) |
inputs[ITEMS_PER_THREAD], |
|
|
ReductionOp |
reduction_op |
|
) |
| |
|
inline |
Computes a block-wide reduction for thread0 using the specified binary reduction functor. Each thread contributes an array of consecutive input elements.
- The return value is undefined in threads other than thread0.
- Efficiency is increased with increased granularity
ITEMS_PER_THREAD
. Performance is also typically increased until the additional register pressure or shared memory allocation size causes SM occupancy to fall too low. Consider variants of cub::BlockLoad for efficiently gathering a blocked arrangement of elements across threads.
- A subsequent
__syncthreads()
threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage
) is to be reused or repurposed.
- Snippet
- The code snippet below illustrates a max reduction of 512 integer items that are partitioned in a blocked arrangement across 128 threads where each thread owns 4 consecutive items.
__global__ void ExampleKernel(...)
{
__shared__ typename BlockReduce::TempStorage temp_storage;
int thread_data[4];
...
- Template Parameters
-
ITEMS_PER_THREAD | [inferred] The number of consecutive items partitioned onto each thread. |
ReductionOp | [inferred] Binary reduction functor type having member T operator()(const T &a, const T &b) |
- Parameters
-
[in] | inputs | Calling thread's input segment |
[in] | reduction_op | Binary reduction functor |
template<typename T , int BLOCK_DIM_X, BlockReduceAlgorithm ALGORITHM = BLOCK_REDUCE_WARP_REDUCTIONS, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
template<typename ReductionOp >
__device__ __forceinline__ T cub::BlockReduce< T, BLOCK_DIM_X, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::Reduce |
( |
T |
input, |
|
|
ReductionOp |
reduction_op, |
|
|
int |
num_valid |
|
) |
| |
|
inline |
Computes a block-wide reduction for thread0 using the specified binary reduction functor. The first num_valid
threads each contribute one input element.
- The return value is undefined in threads other than thread0.
- For multi-dimensional blocks, threads are linearly ranked in row-major order.
- A subsequent
__syncthreads()
threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage
) is to be reused or repurposed.
- Snippet
- The code snippet below illustrates a max reduction of a partially-full tile of integer items that are partitioned across 128 threads.
__global__ void ExampleKernel(int num_valid, ...)
{
__shared__ typename BlockReduce::TempStorage temp_storage;
int thread_data;
if (threadIdx.x < num_valid) thread_data = ...
- Template Parameters
-
ReductionOp | [inferred] Binary reduction functor type having member T operator()(const T &a, const T &b) |
- Parameters
-
[in] | input | Calling thread's input |
[in] | reduction_op | Binary reduction functor |
[in] | num_valid | Number of threads containing valid elements (may be less than BLOCK_THREADS) |
template<typename T , int BLOCK_DIM_X, BlockReduceAlgorithm ALGORITHM = BLOCK_REDUCE_WARP_REDUCTIONS, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
__device__ __forceinline__ T cub::BlockReduce< T, BLOCK_DIM_X, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::Sum |
( |
T |
input) | |
|
|
inline |
Computes a block-wide reduction for thread0 using addition (+) as the reduction operator. Each thread contributes one input element.
- The return value is undefined in threads other than thread0.
- For multi-dimensional blocks, threads are linearly ranked in row-major order.
- A subsequent
__syncthreads()
threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage
) is to be reused or repurposed.
- Snippet
- The code snippet below illustrates a sum reduction of 128 integer items that are partitioned across 128 threads.
__global__ void ExampleKernel(...)
{
__shared__ typename BlockReduce::TempStorage temp_storage;
int thread_data;
...
int aggregate =
BlockReduce(temp_storage).Sum(thread_data);
- Parameters
-
[in] | input | Calling thread's input |
template<typename T , int BLOCK_DIM_X, BlockReduceAlgorithm ALGORITHM = BLOCK_REDUCE_WARP_REDUCTIONS, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
template<int ITEMS_PER_THREAD>
__device__ __forceinline__ T cub::BlockReduce< T, BLOCK_DIM_X, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::Sum |
( |
T(&) |
inputs[ITEMS_PER_THREAD]) | |
|
|
inline |
Computes a block-wide reduction for thread0 using addition (+) as the reduction operator. Each thread contributes an array of consecutive input elements.
- The return value is undefined in threads other than thread0.
- Efficiency is increased with increased granularity
ITEMS_PER_THREAD
. Performance is also typically increased until the additional register pressure or shared memory allocation size causes SM occupancy to fall too low. Consider variants of cub::BlockLoad for efficiently gathering a blocked arrangement of elements across threads.
- A subsequent
__syncthreads()
threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage
) is to be reused or repurposed.
- Snippet
- The code snippet below illustrates a sum reduction of 512 integer items that are partitioned in a blocked arrangement across 128 threads where each thread owns 4 consecutive items.
__global__ void ExampleKernel(...)
{
__shared__ typename BlockReduce::TempStorage temp_storage;
int thread_data[4];
...
int aggregate =
BlockReduce(temp_storage).Sum(thread_data);
- Template Parameters
-
ITEMS_PER_THREAD | [inferred] The number of consecutive items partitioned onto each thread. |
- Parameters
-
[in] | inputs | Calling thread's input segment |
template<typename T , int BLOCK_DIM_X, BlockReduceAlgorithm ALGORITHM = BLOCK_REDUCE_WARP_REDUCTIONS, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
__device__ __forceinline__ T cub::BlockReduce< T, BLOCK_DIM_X, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::Sum |
( |
T |
input, |
|
|
int |
num_valid |
|
) |
| |
|
inline |
Computes a block-wide reduction for thread0 using addition (+) as the reduction operator. The first num_valid
threads each contribute one input element.
- The return value is undefined in threads other than thread0.
- For multi-dimensional blocks, threads are linearly ranked in row-major order.
- A subsequent
__syncthreads()
threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage
) is to be reused or repurposed.
- Snippet
- The code snippet below illustrates a sum reduction of a partially-full tile of integer items that are partitioned across 128 threads.
__global__ void ExampleKernel(int num_valid, ...)
{
__shared__ typename BlockReduce::TempStorage temp_storage;
int thread_data;
if (threadIdx.x < num_valid)
thread_data = ...
int aggregate =
BlockReduce(temp_storage).Sum(thread_data, num_valid);
- Parameters
-
[in] | input | Calling thread's input |
[in] | num_valid | Number of threads containing valid elements (may be less than BLOCK_THREADS) |