CUB
 All Classes Namespaces Files Functions Variables Typedefs Enumerations Enumerator Friends Groups
Classes | List of all members
cub::BlockReduce< T, BLOCK_DIM_X, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH > Class Template Reference

Detailed description

template< typename T, int BLOCK_DIM_X, BlockReduceAlgorithm ALGORITHM = BLOCK_REDUCE_WARP_REDUCTIONS, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
class cub::BlockReduce< T, BLOCK_DIM_X, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >

The BlockReduce class provides collective methods for computing a parallel reduction of items partitioned across a CUDA thread block.

reduce_logo.png
.
Template Parameters
TData type being reduced
BLOCK_DIM_XThe thread block length in threads along the X dimension
ALGORITHM[optional] cub::BlockReduceAlgorithm enumerator specifying the underlying algorithm to use (default: cub::BLOCK_REDUCE_WARP_REDUCTIONS)
BLOCK_DIM_Y[optional] The thread block length in threads along the Y dimension (default: 1)
BLOCK_DIM_Z[optional] The thread block length in threads along the Z dimension (default: 1)
PTX_ARCH[optional] The PTX compute capability for which to to specialize this collective, formatted as per the CUDA_ARCH macro (e.g., 350 for sm_35). Useful for determining the collective's storage requirements for a given device from the host. (Default: the value of CUDA_ARCH during the current compiler pass)
Overview
  • A reduction (or fold) uses a binary combining operator to compute a single aggregate from a list of input elements.
  • For multi-dimensional blocks, threads are linearly ranked in row-major order.
  • BlockReduce can be optionally specialized by algorithm to accommodate different latency/throughput workload profiles:
    1. cub::BLOCK_REDUCE_RAKING_COMMUTATIVE_ONLY. An efficient "raking" reduction algorithm that only supports commutative reduction operators. More...
    2. cub::BLOCK_REDUCE_RAKING. An efficient "raking" reduction algorithm that supports commutative and non-commutative reduction operators. More...
    3. cub::BLOCK_REDUCE_WARP_REDUCTIONS. A quick "tiled warp-reductions" reduction algorithm that supports commutative and non-commutative reduction operators. More...
Performance Considerations
  • Efficiency is increased with increased granularity ITEMS_PER_THREAD. Performance is also typically increased until the additional register pressure or shared memory allocation size causes SM occupancy to fall too low. Consider variants of cub::BlockLoad for efficiently gathering a blocked arrangement of elements across threads.
  • Very efficient (only one synchronization barrier).
  • Incurs zero bank conflicts for most types
  • Computation is slightly more efficient (i.e., having lower instruction overhead) for:
    • Summation (vs. generic reduction)
    • BLOCK_THREADS is a multiple of the architecture's warp size
    • Every thread has a valid input (i.e., full vs. partial-tiles)
  • See cub::BlockReduceAlgorithm for performance details regarding algorithmic alternatives
A Simple Example
Every thread in the block uses the BlockReduce class by first specializing the BlockReduce type, then instantiating an instance with parameters for communication, and finally invoking one or more collective member functions.
The code snippet below illustrates a sum reduction of 512 integer items that are partitioned in a blocked arrangement across 128 threads where each thread owns 4 consecutive items.
#include <cub/cub.cuh> // or equivalently <cub/block/block_reduce.cuh>
__global__ void ExampleKernel(...)
{
// Specialize BlockReduce for a 1D block of 128 threads on type int
// Allocate shared memory for BlockReduce
__shared__ typename BlockReduce::TempStorage temp_storage;
// Obtain a segment of consecutive items that are blocked across threads
int thread_data[4];
...
// Compute the block-wide sum for thread0
int aggregate = BlockReduce(temp_storage).Sum(thread_data);
Examples:
example_block_reduce.cu.

Classes

struct  TempStorage
 The operations exposed by BlockReduce require a temporary memory allocation of this nested type for thread communication. This opaque storage can be allocated directly using the __shared__ keyword. Alternatively, it can be aliased to externally allocated memory (shared or global) or union'd with other storage allocation types to facilitate memory reuse. More...
 

Public Methods

Collective constructors
__device__ __forceinline__ BlockReduce ()
 Collective constructor using a private static allocation of shared memory as temporary storage.
 
__device__ __forceinline__ BlockReduce (TempStorage &temp_storage)
 Collective constructor using the specified memory allocation as temporary storage. More...
 
Generic reductions
template<typename ReductionOp >
__device__ __forceinline__ T Reduce (T input, ReductionOp reduction_op)
 Computes a block-wide reduction for thread0 using the specified binary reduction functor. Each thread contributes one input element. More...
 
template<int ITEMS_PER_THREAD, typename ReductionOp >
__device__ __forceinline__ T Reduce (T(&inputs)[ITEMS_PER_THREAD], ReductionOp reduction_op)
 Computes a block-wide reduction for thread0 using the specified binary reduction functor. Each thread contributes an array of consecutive input elements. More...
 
template<typename ReductionOp >
__device__ __forceinline__ T Reduce (T input, ReductionOp reduction_op, int num_valid)
 Computes a block-wide reduction for thread0 using the specified binary reduction functor. The first num_valid threads each contribute one input element. More...
 
Summation reductions
__device__ __forceinline__ T Sum (T input)
 Computes a block-wide reduction for thread0 using addition (+) as the reduction operator. Each thread contributes one input element. More...
 
template<int ITEMS_PER_THREAD>
__device__ __forceinline__ T Sum (T(&inputs)[ITEMS_PER_THREAD])
 Computes a block-wide reduction for thread0 using addition (+) as the reduction operator. Each thread contributes an array of consecutive input elements. More...
 
__device__ __forceinline__ T Sum (T input, int num_valid)
 Computes a block-wide reduction for thread0 using addition (+) as the reduction operator. The first num_valid threads each contribute one input element. More...
 

Constructor & Destructor Documentation

template<typename T , int BLOCK_DIM_X, BlockReduceAlgorithm ALGORITHM = BLOCK_REDUCE_WARP_REDUCTIONS, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
__device__ __forceinline__ cub::BlockReduce< T, BLOCK_DIM_X, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::BlockReduce ( TempStorage temp_storage)
inline

Collective constructor using the specified memory allocation as temporary storage.

Parameters
[in]temp_storageReference to memory allocation having layout type TempStorage

Member Function Documentation

template<typename T , int BLOCK_DIM_X, BlockReduceAlgorithm ALGORITHM = BLOCK_REDUCE_WARP_REDUCTIONS, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
template<typename ReductionOp >
__device__ __forceinline__ T cub::BlockReduce< T, BLOCK_DIM_X, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::Reduce ( input,
ReductionOp  reduction_op 
)
inline

Computes a block-wide reduction for thread0 using the specified binary reduction functor. Each thread contributes one input element.

  • The return value is undefined in threads other than thread0.
  • For multi-dimensional blocks, threads are linearly ranked in row-major order.
  • A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage) is to be reused or repurposed.
Snippet
The code snippet below illustrates a max reduction of 128 integer items that are partitioned across 128 threads.
#include <cub/cub.cuh> // or equivalently <cub/block/block_reduce.cuh>
__global__ void ExampleKernel(...)
{
// Specialize BlockReduce for a 1D block of 128 threads on type int
// Allocate shared memory for BlockReduce
__shared__ typename BlockReduce::TempStorage temp_storage;
// Each thread obtains an input item
int thread_data;
...
// Compute the block-wide max for thread0
int aggregate = BlockReduce(temp_storage).Reduce(thread_data, cub::Max());
Template Parameters
ReductionOp[inferred] Binary reduction functor type having member T operator()(const T &a, const T &b)
Parameters
[in]inputCalling thread's input
[in]reduction_opBinary reduction functor
template<typename T , int BLOCK_DIM_X, BlockReduceAlgorithm ALGORITHM = BLOCK_REDUCE_WARP_REDUCTIONS, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
template<int ITEMS_PER_THREAD, typename ReductionOp >
__device__ __forceinline__ T cub::BlockReduce< T, BLOCK_DIM_X, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::Reduce ( T(&)  inputs[ITEMS_PER_THREAD],
ReductionOp  reduction_op 
)
inline

Computes a block-wide reduction for thread0 using the specified binary reduction functor. Each thread contributes an array of consecutive input elements.

  • The return value is undefined in threads other than thread0.
  • Efficiency is increased with increased granularity ITEMS_PER_THREAD. Performance is also typically increased until the additional register pressure or shared memory allocation size causes SM occupancy to fall too low. Consider variants of cub::BlockLoad for efficiently gathering a blocked arrangement of elements across threads.
  • A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage) is to be reused or repurposed.
Snippet
The code snippet below illustrates a max reduction of 512 integer items that are partitioned in a blocked arrangement across 128 threads where each thread owns 4 consecutive items.
#include <cub/cub.cuh> // or equivalently <cub/block/block_reduce.cuh>
__global__ void ExampleKernel(...)
{
// Specialize BlockReduce for a 1D block of 128 threads on type int
// Allocate shared memory for BlockReduce
__shared__ typename BlockReduce::TempStorage temp_storage;
// Obtain a segment of consecutive items that are blocked across threads
int thread_data[4];
...
// Compute the block-wide max for thread0
int aggregate = BlockReduce(temp_storage).Reduce(thread_data, cub::Max());
Template Parameters
ITEMS_PER_THREAD[inferred] The number of consecutive items partitioned onto each thread.
ReductionOp[inferred] Binary reduction functor type having member T operator()(const T &a, const T &b)
Parameters
[in]inputsCalling thread's input segment
[in]reduction_opBinary reduction functor
template<typename T , int BLOCK_DIM_X, BlockReduceAlgorithm ALGORITHM = BLOCK_REDUCE_WARP_REDUCTIONS, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
template<typename ReductionOp >
__device__ __forceinline__ T cub::BlockReduce< T, BLOCK_DIM_X, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::Reduce ( input,
ReductionOp  reduction_op,
int  num_valid 
)
inline

Computes a block-wide reduction for thread0 using the specified binary reduction functor. The first num_valid threads each contribute one input element.

  • The return value is undefined in threads other than thread0.
  • For multi-dimensional blocks, threads are linearly ranked in row-major order.
  • A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage) is to be reused or repurposed.
Snippet
The code snippet below illustrates a max reduction of a partially-full tile of integer items that are partitioned across 128 threads.
#include <cub/cub.cuh> // or equivalently <cub/block/block_reduce.cuh>
__global__ void ExampleKernel(int num_valid, ...)
{
// Specialize BlockReduce for a 1D block of 128 threads on type int
// Allocate shared memory for BlockReduce
__shared__ typename BlockReduce::TempStorage temp_storage;
// Each thread obtains an input item
int thread_data;
if (threadIdx.x < num_valid) thread_data = ...
// Compute the block-wide max for thread0
int aggregate = BlockReduce(temp_storage).Reduce(thread_data, cub::Max(), num_valid);
Template Parameters
ReductionOp[inferred] Binary reduction functor type having member T operator()(const T &a, const T &b)
Parameters
[in]inputCalling thread's input
[in]reduction_opBinary reduction functor
[in]num_validNumber of threads containing valid elements (may be less than BLOCK_THREADS)
template<typename T , int BLOCK_DIM_X, BlockReduceAlgorithm ALGORITHM = BLOCK_REDUCE_WARP_REDUCTIONS, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
__device__ __forceinline__ T cub::BlockReduce< T, BLOCK_DIM_X, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::Sum ( input)
inline

Computes a block-wide reduction for thread0 using addition (+) as the reduction operator. Each thread contributes one input element.

  • The return value is undefined in threads other than thread0.
  • For multi-dimensional blocks, threads are linearly ranked in row-major order.
  • A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage) is to be reused or repurposed.
Snippet
The code snippet below illustrates a sum reduction of 128 integer items that are partitioned across 128 threads.
#include <cub/cub.cuh> // or equivalently <cub/block/block_reduce.cuh>
__global__ void ExampleKernel(...)
{
// Specialize BlockReduce for a 1D block of 128 threads on type int
// Allocate shared memory for BlockReduce
__shared__ typename BlockReduce::TempStorage temp_storage;
// Each thread obtains an input item
int thread_data;
...
// Compute the block-wide sum for thread0
int aggregate = BlockReduce(temp_storage).Sum(thread_data);
Parameters
[in]inputCalling thread's input
template<typename T , int BLOCK_DIM_X, BlockReduceAlgorithm ALGORITHM = BLOCK_REDUCE_WARP_REDUCTIONS, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
template<int ITEMS_PER_THREAD>
__device__ __forceinline__ T cub::BlockReduce< T, BLOCK_DIM_X, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::Sum ( T(&)  inputs[ITEMS_PER_THREAD])
inline

Computes a block-wide reduction for thread0 using addition (+) as the reduction operator. Each thread contributes an array of consecutive input elements.

  • The return value is undefined in threads other than thread0.
  • Efficiency is increased with increased granularity ITEMS_PER_THREAD. Performance is also typically increased until the additional register pressure or shared memory allocation size causes SM occupancy to fall too low. Consider variants of cub::BlockLoad for efficiently gathering a blocked arrangement of elements across threads.
  • A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage) is to be reused or repurposed.
Snippet
The code snippet below illustrates a sum reduction of 512 integer items that are partitioned in a blocked arrangement across 128 threads where each thread owns 4 consecutive items.
#include <cub/cub.cuh> // or equivalently <cub/block/block_reduce.cuh>
__global__ void ExampleKernel(...)
{
// Specialize BlockReduce for a 1D block of 128 threads on type int
// Allocate shared memory for BlockReduce
__shared__ typename BlockReduce::TempStorage temp_storage;
// Obtain a segment of consecutive items that are blocked across threads
int thread_data[4];
...
// Compute the block-wide sum for thread0
int aggregate = BlockReduce(temp_storage).Sum(thread_data);
Template Parameters
ITEMS_PER_THREAD[inferred] The number of consecutive items partitioned onto each thread.
Parameters
[in]inputsCalling thread's input segment
template<typename T , int BLOCK_DIM_X, BlockReduceAlgorithm ALGORITHM = BLOCK_REDUCE_WARP_REDUCTIONS, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
__device__ __forceinline__ T cub::BlockReduce< T, BLOCK_DIM_X, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::Sum ( input,
int  num_valid 
)
inline

Computes a block-wide reduction for thread0 using addition (+) as the reduction operator. The first num_valid threads each contribute one input element.

  • The return value is undefined in threads other than thread0.
  • For multi-dimensional blocks, threads are linearly ranked in row-major order.
  • A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage) is to be reused or repurposed.
Snippet
The code snippet below illustrates a sum reduction of a partially-full tile of integer items that are partitioned across 128 threads.
#include <cub/cub.cuh> // or equivalently <cub/block/block_reduce.cuh>
__global__ void ExampleKernel(int num_valid, ...)
{
// Specialize BlockReduce for a 1D block of 128 threads on type int
// Allocate shared memory for BlockReduce
__shared__ typename BlockReduce::TempStorage temp_storage;
// Each thread obtains an input item (up to num_items)
int thread_data;
if (threadIdx.x < num_valid)
thread_data = ...
// Compute the block-wide sum for thread0
int aggregate = BlockReduce(temp_storage).Sum(thread_data, num_valid);
Parameters
[in]inputCalling thread's input
[in]num_validNumber of threads containing valid elements (may be less than BLOCK_THREADS)

The documentation for this class was generated from the following file: