CUB
 All Classes Namespaces Files Functions Variables Typedefs Enumerations Enumerator Friends Groups
Classes | Enumerations | Functions
cub Namespace Reference

Optional outer namespace(s) More...

Classes

struct  CachingDeviceAllocator
 A simple caching allocator for device memory allocations. More...
 
struct  If
 Type selection (IF ? ThenType : ElseType) More...
 
struct  Equals
 Type equality test. More...
 
struct  Log2
 Statically determine log2(N), rounded up. More...
 
struct  PowerOfTwo
 Statically determine if N is a power-of-two. More...
 
struct  IsPointer
 Pointer vs. iterator. More...
 
struct  IsVolatile
 Volatile modifier test. More...
 
struct  RemoveQualifiers
 Removes const and volatile qualifiers from type Tp. More...
 
class  ArgIndexInputIterator
 A random-access input wrapper for pairing dereferenced values with their corresponding indices (forming KeyValuePair tuples). More...
 
class  CacheModifiedInputIterator
 A random-access input wrapper for dereferencing array values using a PTX cache load modifier. More...
 
class  CacheModifiedOutputIterator
 A random-access output wrapper for storing array values using a PTX cache-modifier. More...
 
class  ConstantInputIterator
 A random-access input generator for dereferencing a sequence of homogeneous values. More...
 
class  CountingInputIterator
 A random-access input generator for dereferencing a sequence of incrementing integer values. More...
 
class  TexObjInputIterator
 A random-access input wrapper for dereferencing array values through texture cache. Uses newer Kepler-style texture objects. More...
 
class  TexRefInputIterator
 A random-access input wrapper for dereferencing array values through texture cache. Uses older Tesla/Fermi-style texture references. More...
 
class  TransformInputIterator
 A random-access input wrapper for transforming dereferenced values. More...
 
struct  Equality
 Default equality functor. More...
 
struct  Inequality
 Default inequality functor. More...
 
struct  InequalityWrapper
 Inequality functor (wraps equality functor) More...
 
struct  Sum
 Default sum functor. More...
 
struct  Max
 Default max functor. More...
 
struct  ArgMax
 Arg max functor (keeps the value and offset of the first occurrence of the larger item) More...
 
struct  Min
 Default min functor. More...
 
struct  ArgMin
 Arg min functor (keeps the value and offset of the first occurrence of the smallest item) More...
 
struct  CastOp
 Default cast functor. More...
 
class  SwizzleScanOp
 Binary operator wrapper for switching non-commutative scan arguments. More...
 
struct  ReduceBySegmentOp
 Reduce-by-segment functor. More...
 
struct  ReduceByKeyOp
 < Binary reduction operator to apply to values More...
 
class  BlockDiscontinuity
 The BlockDiscontinuity class provides collective methods for flagging discontinuities within an ordered set of items partitioned across a CUDA thread block.

discont_logo.png
.
More...
 
class  BlockExchange
 The BlockExchange class provides collective methods for rearranging data partitioned across a CUDA thread block.

transpose_logo.png
.
More...
 
class  BlockHistogram
 The BlockHistogram class provides collective methods for constructing block-wide histograms from data samples partitioned across a CUDA thread block.

histogram_logo.png
.
More...
 
class  BlockLoad
 The BlockLoad class provides collective data movement methods for loading a linear segment of items from memory into a blocked arrangement across a CUDA thread block.

block_load_logo.png
.
More...
 
class  BlockRadixSort
 The BlockRadixSort class provides collective methods for sorting items partitioned across a CUDA thread block using a radix sorting method.

sorting_logo.png
.
More...
 
class  BlockReduce
 The BlockReduce class provides collective methods for computing a parallel reduction of items partitioned across a CUDA thread block.

reduce_logo.png
.
More...
 
class  BlockScan
 The BlockScan class provides collective methods for computing a parallel prefix sum/scan of items partitioned across a CUDA thread block.

block_scan_logo.png
.
More...
 
class  BlockStore
 The BlockStore class provides collective data movement methods for writing a blocked arrangement of items partitioned across a CUDA thread block to a linear segment of memory.

block_store_logo.png
.
More...
 
struct  DeviceHistogram
 DeviceHistogram provides device-wide parallel operations for constructing histogram(s) from a sequence of samples data residing within device-accessible memory.

histogram_logo.png
.
More...
 
struct  DevicePartition
 DevicePartition provides device-wide, parallel operations for partitioning sequences of data items residing within device-accessible memory.

partition_logo.png
.
More...
 
struct  DeviceRadixSort
 DeviceRadixSort provides device-wide, parallel operations for computing a radix sort across a sequence of data items residing within device-accessible memory.

sorting_logo.png
.
More...
 
struct  DeviceReduce
 DeviceReduce provides device-wide, parallel operations for computing a reduction across a sequence of data items residing within device-accessible memory.

reduce_logo.png
.
More...
 
struct  DeviceRunLengthEncode
 DeviceRunLengthEncode provides device-wide, parallel operations for demarcating "runs" of same-valued items within a sequence residing within device-accessible memory.

run_length_encode_logo.png
.
More...
 
struct  DeviceScan
 DeviceScan provides device-wide, parallel operations for computing a prefix scan across a sequence of data items residing within device-accessible memory.

device_scan.png
.
More...
 
struct  DeviceSegmentedRadixSort
 DeviceSegmentedRadixSort provides device-wide, parallel operations for computing a batched radix sort across multiple, non-overlapping sequences of data items residing within device-accessible memory.

segmented_sorting_logo.png
.
More...
 
struct  DeviceSegmentedReduce
 DeviceSegmentedReduce provides device-wide, parallel operations for computing a reduction across multiple sequences of data items residing within device-accessible memory.

reduce_logo.png
.
More...
 
struct  DeviceSelect
 DeviceSelect provides device-wide, parallel operations for compacting selected items from sequences of data items residing within device-accessible memory.

select_logo.png
.
More...
 
struct  DeviceSpmv
 DeviceSpmv provides device-wide parallel operations for performing sparse-matrix * dense-vector multiplication (SpMV). More...
 
class  WarpScan
 The WarpScan class provides collective methods for computing a parallel prefix scan of items partitioned across a CUDA thread warp.

warp_scan_logo.png
.
More...
 
class  WarpReduce
 The WarpReduce class provides collective methods for computing a parallel reduction of items partitioned across a CUDA thread warp.

warp_reduce_logo.png
.
More...
 

Enumerations

enum  CacheLoadModifier {
  LOAD_DEFAULT, LOAD_CA, LOAD_CG, LOAD_CS,
  LOAD_CV, LOAD_LDG, LOAD_VOLATILE
}
 Enumeration of cache modifiers for memory load operations. More...
 
enum  CacheStoreModifier {
  STORE_DEFAULT, STORE_WB, STORE_CG, STORE_CS,
  STORE_WT, STORE_VOLATILE
}
 Enumeration of cache modifiers for memory store operations. More...
 
enum  BlockHistogramAlgorithm { BLOCK_HISTO_SORT, BLOCK_HISTO_ATOMIC }
 BlockHistogramAlgorithm enumerates alternative algorithms for the parallel construction of block-wide histograms. More...
 
enum  BlockLoadAlgorithm {
  BLOCK_LOAD_DIRECT, BLOCK_LOAD_VECTORIZE, BLOCK_LOAD_TRANSPOSE, BLOCK_LOAD_WARP_TRANSPOSE,
  BLOCK_LOAD_WARP_TRANSPOSE_TIMESLICED
}
 
enum  BlockReduceAlgorithm { BLOCK_REDUCE_RAKING_COMMUTATIVE_ONLY, BLOCK_REDUCE_RAKING, BLOCK_REDUCE_WARP_REDUCTIONS }
 
enum  BlockScanAlgorithm { BLOCK_SCAN_RAKING, BLOCK_SCAN_RAKING_MEMOIZE, BLOCK_SCAN_WARP_SCANS }
 BlockScanAlgorithm enumerates alternative algorithms for cub::BlockScan to compute a parallel prefix scan across a CUDA thread block. More...
 
enum  BlockStoreAlgorithm {
  BLOCK_STORE_DIRECT, BLOCK_STORE_VECTORIZE, BLOCK_STORE_TRANSPOSE, BLOCK_STORE_WARP_TRANSPOSE,
  BLOCK_STORE_WARP_TRANSPOSE_TIMESLICED
}
 cub::BlockStoreAlgorithm enumerates alternative algorithms for cub::BlockStore to write a blocked arrangement of items across a CUDA thread block to a linear segment of memory. More...
 

Functions

__host__ __device__
__forceinline__ cudaError_t 
Debug (cudaError_t error, const char *filename, int line)
 CUB error reporting macro (prints error messages to stderr) More...
 
CUB_RUNTIME_FUNCTION
__forceinline__ cudaError_t 
PtxVersion (int &ptx_version)
 Retrieves the PTX version that will be used on the current device (major * 100 + minor * 10) More...
 
CUB_RUNTIME_FUNCTION
__forceinline__ cudaError_t 
SmVersion (int &sm_version, int device_ordinal)
 Retrieves the SM version (major * 100 + minor * 10)
 
__device__ __forceinline__
unsigned int 
SHR_ADD (unsigned int x, unsigned int shift, unsigned int addend)
 Shift-right then add. Returns (x >> shift) + addend.
 
__device__ __forceinline__
unsigned int 
SHL_ADD (unsigned int x, unsigned int shift, unsigned int addend)
 Shift-left then add. Returns (x << shift) + addend.
 
template<typename UnsignedBits >
__device__ __forceinline__
unsigned int 
BFE (UnsignedBits source, unsigned int bit_start, unsigned int num_bits)
 Bitfield-extract. Extracts num_bits from source starting at bit-offset bit_start. The input source may be an 8b, 16b, 32b, or 64b unsigned integer type.
 
__device__ __forceinline__ void BFI (unsigned int &ret, unsigned int x, unsigned int y, unsigned int bit_start, unsigned int num_bits)
 Bitfield insert. Inserts the num_bits least significant bits of y into x at bit-offset bit_start.
 
__device__ __forceinline__
unsigned int 
IADD3 (unsigned int x, unsigned int y, unsigned int z)
 Three-operand add. Returns x + y + z.
 
__device__ __forceinline__ int PRMT (unsigned int a, unsigned int b, unsigned int index)
 Byte-permute. Pick four arbitrary bytes from two 32-bit registers, and reassemble them into a 32-bit destination register. For SM2.0 or later. More...
 
__device__ __forceinline__ void ThreadExit ()
 Terminates the calling thread.
 
__device__ __forceinline__ void ThreadTrap ()
 Abort execution and generate an interrupt to the host CPU.
 
__device__ __forceinline__ int RowMajorTid (int block_dim_x, int block_dim_y, int block_dim_z)
 Returns the row-major linear thread identifier for a multidimensional thread block.
 
__device__ __forceinline__
unsigned int 
LaneId ()
 Returns the warp lane ID of the calling thread.
 
__device__ __forceinline__
unsigned int 
WarpId ()
 Returns the warp ID of the calling thread. Warp ID is guaranteed to be unique among warps, but may not correspond to a zero-based ranking within the thread block.
 
__device__ __forceinline__
unsigned int 
LaneMaskLt ()
 Returns the warp lane mask of all lanes less than the calling thread.
 
__device__ __forceinline__
unsigned int 
LaneMaskLe ()
 Returns the warp lane mask of all lanes less than or equal to the calling thread.
 
__device__ __forceinline__
unsigned int 
LaneMaskGt ()
 Returns the warp lane mask of all lanes greater than the calling thread.
 
__device__ __forceinline__
unsigned int 
LaneMaskGe ()
 Returns the warp lane mask of all lanes greater than or equal to the calling thread.
 
template<int LOGICAL_WARP_THREADS, typename T >
__device__ __forceinline__ T ShuffleUp (T input, int src_offset, int first_thread, unsigned int member_mask)
 Shuffle-up for any data type. Each warp-lanei obtains the value input contributed by warp-lanei-src_offset. For thread lanes i < src_offset, the thread's own input is returned to the thread.

shfl_up_logo.png
.
More...
 
template<int LOGICAL_WARP_THREADS, typename T >
__device__ __forceinline__ T ShuffleDown (T input, int src_offset, int last_thread, unsigned int member_mask)
 Shuffle-down for any data type. Each warp-lanei obtains the value input contributed by warp-lanei+src_offset. For thread lanes i >= WARP_THREADS, the thread's own input is returned to the thread.

shfl_down_logo.png
.
More...
 
template<int LOGICAL_WARP_THREADS, typename T >
__device__ __forceinline__ T ShuffleIndex (T input, int src_lane, unsigned int member_mask)
 Shuffle-broadcast for any data type. Each warp-lanei obtains the value input contributed by warp-lanesrc_lane. For src_lane < 0 or src_lane >= WARP_THREADS, then the thread's own input is returned to the thread.

shfl_broadcast_logo.png
.
More...
 
template<int LABEL_BITS>
__device__ unsigned int MatchAny (unsigned int label)
 
Thread I/O (cache modified)
template<CacheLoadModifier MODIFIER, typename InputIteratorT >
__device__ __forceinline__
std::iterator_traits
< InputIteratorT >::value_type 
ThreadLoad (InputIteratorT itr)
 Thread utility for reading memory using cub::CacheLoadModifier cache modifiers. Can be used to load any data type. More...
 
template<CacheStoreModifier MODIFIER, typename OutputIteratorT , typename T >
__device__ __forceinline__ void ThreadStore (OutputIteratorT itr, T val)
 Thread utility for writing memory using cub::CacheStoreModifier cache modifiers. Can be used to store any data type. More...
 
Blocked arrangement I/O (direct)
template<typename InputT , int ITEMS_PER_THREAD, typename InputIteratorT >
__device__ __forceinline__ void LoadDirectBlocked (int linear_tid, InputIteratorT block_itr, InputT(&items)[ITEMS_PER_THREAD])
 Load a linear segment of items into a blocked arrangement across the thread block. More...
 
template<typename InputT , int ITEMS_PER_THREAD, typename InputIteratorT >
__device__ __forceinline__ void LoadDirectBlocked (int linear_tid, InputIteratorT block_itr, InputT(&items)[ITEMS_PER_THREAD], int valid_items)
 Load a linear segment of items into a blocked arrangement across the thread block, guarded by range. More...
 
template<typename InputT , typename DefaultT , int ITEMS_PER_THREAD, typename InputIteratorT >
__device__ __forceinline__ void LoadDirectBlocked (int linear_tid, InputIteratorT block_itr, InputT(&items)[ITEMS_PER_THREAD], int valid_items, DefaultT oob_default)
 Load a linear segment of items into a blocked arrangement across the thread block, guarded by range, with a fall-back assignment of out-of-bound elements.. More...
 
template<typename T , int ITEMS_PER_THREAD>
__device__ __forceinline__ void LoadDirectBlockedVectorized (int linear_tid, T *block_ptr, T(&items)[ITEMS_PER_THREAD])
 Load a linear segment of items into a blocked arrangement across the thread block. More...
 
template<typename T , int ITEMS_PER_THREAD, typename OutputIteratorT >
__device__ __forceinline__ void StoreDirectBlocked (int linear_tid, OutputIteratorT block_itr, T(&items)[ITEMS_PER_THREAD])
 Store a blocked arrangement of items across a thread block into a linear segment of items. More...
 
template<typename T , int ITEMS_PER_THREAD, typename OutputIteratorT >
__device__ __forceinline__ void StoreDirectBlocked (int linear_tid, OutputIteratorT block_itr, T(&items)[ITEMS_PER_THREAD], int valid_items)
 Store a blocked arrangement of items across a thread block into a linear segment of items, guarded by range. More...
 
template<typename T , int ITEMS_PER_THREAD>
__device__ __forceinline__ void StoreDirectBlockedVectorized (int linear_tid, T *block_ptr, T(&items)[ITEMS_PER_THREAD])
 Store a blocked arrangement of items across a thread block into a linear segment of items. More...
 
Striped arrangement I/O (direct)
template<int BLOCK_THREADS, typename InputT , int ITEMS_PER_THREAD, typename InputIteratorT >
__device__ __forceinline__ void LoadDirectStriped (int linear_tid, InputIteratorT block_itr, InputT(&items)[ITEMS_PER_THREAD])
 Load a linear segment of items into a striped arrangement across the thread block. More...
 
template<int BLOCK_THREADS, typename InputT , int ITEMS_PER_THREAD, typename InputIteratorT >
__device__ __forceinline__ void LoadDirectStriped (int linear_tid, InputIteratorT block_itr, InputT(&items)[ITEMS_PER_THREAD], int valid_items)
 Load a linear segment of items into a striped arrangement across the thread block, guarded by range. More...
 
template<int BLOCK_THREADS, typename InputT , typename DefaultT , int ITEMS_PER_THREAD, typename InputIteratorT >
__device__ __forceinline__ void LoadDirectStriped (int linear_tid, InputIteratorT block_itr, InputT(&items)[ITEMS_PER_THREAD], int valid_items, DefaultT oob_default)
 Load a linear segment of items into a striped arrangement across the thread block, guarded by range, with a fall-back assignment of out-of-bound elements. More...
 
template<int BLOCK_THREADS, typename T , int ITEMS_PER_THREAD, typename OutputIteratorT >
__device__ __forceinline__ void StoreDirectStriped (int linear_tid, OutputIteratorT block_itr, T(&items)[ITEMS_PER_THREAD])
 Store a striped arrangement of data across the thread block into a linear segment of items. More...
 
template<int BLOCK_THREADS, typename T , int ITEMS_PER_THREAD, typename OutputIteratorT >
__device__ __forceinline__ void StoreDirectStriped (int linear_tid, OutputIteratorT block_itr, T(&items)[ITEMS_PER_THREAD], int valid_items)
 Store a striped arrangement of data across the thread block into a linear segment of items, guarded by range. More...
 
Warp-striped arrangement I/O (direct)
template<typename InputT , int ITEMS_PER_THREAD, typename InputIteratorT >
__device__ __forceinline__ void LoadDirectWarpStriped (int linear_tid, InputIteratorT block_itr, InputT(&items)[ITEMS_PER_THREAD])
 Load a linear segment of items into a warp-striped arrangement across the thread block. More...
 
template<typename InputT , int ITEMS_PER_THREAD, typename InputIteratorT >
__device__ __forceinline__ void LoadDirectWarpStriped (int linear_tid, InputIteratorT block_itr, InputT(&items)[ITEMS_PER_THREAD], int valid_items)
 Load a linear segment of items into a warp-striped arrangement across the thread block, guarded by range. More...
 
template<typename InputT , typename DefaultT , int ITEMS_PER_THREAD, typename InputIteratorT >
__device__ __forceinline__ void LoadDirectWarpStriped (int linear_tid, InputIteratorT block_itr, InputT(&items)[ITEMS_PER_THREAD], int valid_items, DefaultT oob_default)
 Load a linear segment of items into a warp-striped arrangement across the thread block, guarded by range, with a fall-back assignment of out-of-bound elements. More...
 
template<typename T , int ITEMS_PER_THREAD, typename OutputIteratorT >
__device__ __forceinline__ void StoreDirectWarpStriped (int linear_tid, OutputIteratorT block_itr, T(&items)[ITEMS_PER_THREAD])
 Store a warp-striped arrangement of data across the thread block into a linear segment of items. More...
 
template<typename T , int ITEMS_PER_THREAD, typename OutputIteratorT >
__device__ __forceinline__ void StoreDirectWarpStriped (int linear_tid, OutputIteratorT block_itr, T(&items)[ITEMS_PER_THREAD], int valid_items)
 Store a warp-striped arrangement of data across the thread block into a linear segment of items, guarded by range. More...
 

Detailed Description

Optional outer namespace(s)

CUB namespace

Enumeration Type Documentation

BlockHistogramAlgorithm enumerates alternative algorithms for the parallel construction of block-wide histograms.

Enumerator
BLOCK_HISTO_SORT 
Overview
Sorting followed by differentiation. Execution is comprised of two phases:
  1. Sort the data using efficient radix sort
  2. Look for "runs" of same-valued keys by detecting discontinuities; the run-lengths are histogram bin counts.
Performance Considerations
Delivers consistent throughput regardless of sample bin distribution.
BLOCK_HISTO_ATOMIC 
Overview
Use atomic addition to update byte counts directly
Performance Considerations
Performance is strongly tied to the hardware implementation of atomic addition, and may be significantly degraded for non uniformly-random input distributions where many concurrent updates are likely to be made to the same bin counter.

cub::BlockLoadAlgorithm enumerates alternative algorithms for cub::BlockLoad to read a linear segment of data from memory into a blocked arrangement across a CUDA thread block.

Enumerator
BLOCK_LOAD_DIRECT 
Overview

A blocked arrangement of data is read directly from memory.

Performance Considerations
  • The utilization of memory transactions (coalescing) decreases as the access stride between threads increases (i.e., the number items per thread).
BLOCK_LOAD_VECTORIZE 
Overview

A blocked arrangement of data is read from memory using CUDA's built-in vectorized loads as a coalescing optimization. For example, ld.global.v4.s32 instructions will be generated when T = int and ITEMS_PER_THREAD % 4 == 0.

Performance Considerations
  • The utilization of memory transactions (coalescing) remains high until the the access stride between threads (i.e., the number items per thread) exceeds the maximum vector load width (typically 4 items or 64B, whichever is lower).
  • The following conditions will prevent vectorization and loading will fall back to cub::BLOCK_LOAD_DIRECT:
    • ITEMS_PER_THREAD is odd
    • The InputIteratorTis not a simple pointer type
    • The block input offset is not quadword-aligned
    • The data type T is not a built-in primitive or CUDA vector type (e.g., short, int2, double, float2, etc.)
BLOCK_LOAD_TRANSPOSE 
Overview

A striped arrangement of data is read efficiently from memory and then locally transposed into a blocked arrangement.

Performance Considerations
  • The utilization of memory transactions (coalescing) remains high regardless of items loaded per thread.
  • The local reordering incurs slightly longer latencies and throughput than the direct cub::BLOCK_LOAD_DIRECT and cub::BLOCK_LOAD_VECTORIZE alternatives.
BLOCK_LOAD_WARP_TRANSPOSE 
Overview

A warp-striped arrangement of data is read efficiently from memory and then locally transposed into a blocked arrangement.

Usage Considerations
  • BLOCK_THREADS must be a multiple of WARP_THREADS
Performance Considerations
  • The utilization of memory transactions (coalescing) remains high regardless of items loaded per thread.
  • The local reordering incurs slightly larger latencies than the direct cub::BLOCK_LOAD_DIRECT and cub::BLOCK_LOAD_VECTORIZE alternatives.
  • Provisions more shared storage, but incurs smaller latencies than the BLOCK_LOAD_WARP_TRANSPOSE_TIMESLICED alternative.
BLOCK_LOAD_WARP_TRANSPOSE_TIMESLICED 
Overview

Like BLOCK_LOAD_WARP_TRANSPOSE, a warp-striped arrangement of data is read directly from memory and then is locally transposed into a blocked arrangement. To reduce the shared memory requirement, only one warp's worth of shared memory is provisioned and is subsequently time-sliced among warps.

Usage Considerations
  • BLOCK_THREADS must be a multiple of WARP_THREADS
Performance Considerations
  • The utilization of memory transactions (coalescing) remains high regardless of items loaded per thread.
  • Provisions less shared memory temporary storage, but incurs larger latencies than the BLOCK_LOAD_WARP_TRANSPOSE alternative.

BlockReduceAlgorithm enumerates alternative algorithms for parallel reduction across a CUDA thread block.

Enumerator
BLOCK_REDUCE_RAKING_COMMUTATIVE_ONLY 
Overview
An efficient "raking" reduction algorithm that only supports commutative reduction operators (true for most operations, e.g., addition).
Execution is comprised of three phases:
  1. Upsweep sequential reduction in registers (if threads contribute more than one input each). Threads in warps other than the first warp place their partial reductions into shared memory.
  2. Upsweep sequential reduction in shared memory. Threads within the first warp continue to accumulate by raking across segments of shared partial reductions
  3. A warp-synchronous Kogge-Stone style reduction within the raking warp.
block_reduce.png
BLOCK_REDUCE_RAKING data flow for a hypothetical 16-thread thread block and 4-thread raking warp.
Performance Considerations
  • This variant performs less communication than BLOCK_REDUCE_RAKING_NON_COMMUTATIVE and is preferable when the reduction operator is commutative. This variant applies fewer reduction operators than BLOCK_REDUCE_WARP_REDUCTIONS, and can provide higher overall throughput across the GPU when suitably occupied. However, turn-around latency may be higher than to BLOCK_REDUCE_WARP_REDUCTIONS and thus less-desirable when the GPU is under-occupied.
BLOCK_REDUCE_RAKING 
Overview
An efficient "raking" reduction algorithm that supports commutative (e.g., addition) and non-commutative (e.g., string concatenation) reduction operators. Assumes a blocked arrangement of (block-threads*items-per-thread) items across the thread block, where threadi owns the ith range of items-per-thread contiguous items. For multi-dimensional thread blocks, a row-major thread ordering is assumed..
Execution is comprised of three phases:
  1. Upsweep sequential reduction in registers (if threads contribute more than one input each). Each thread then places the partial reduction of its item(s) into shared memory.
  2. Upsweep sequential reduction in shared memory. Threads within a single warp rake across segments of shared partial reductions.
  3. A warp-synchronous Kogge-Stone style reduction within the raking warp.
block_reduce.png
BLOCK_REDUCE_RAKING data flow for a hypothetical 16-thread thread block and 4-thread raking warp.
Performance Considerations
  • This variant performs more communication than BLOCK_REDUCE_RAKING and is only preferable when the reduction operator is non-commutative. This variant applies fewer reduction operators than BLOCK_REDUCE_WARP_REDUCTIONS, and can provide higher overall throughput across the GPU when suitably occupied. However, turn-around latency may be higher than to BLOCK_REDUCE_WARP_REDUCTIONS and thus less-desirable when the GPU is under-occupied.
BLOCK_REDUCE_WARP_REDUCTIONS 
Overview
A quick "tiled warp-reductions" reduction algorithm that supports commutative (e.g., addition) and non-commutative (e.g., string concatenation) reduction operators.
Execution is comprised of four phases:
  1. Upsweep sequential reduction in registers (if threads contribute more than one input each). Each thread then places the partial reduction of its item(s) into shared memory.
  2. Compute a shallow, but inefficient warp-synchronous Kogge-Stone style reduction within each warp.
  3. A propagation phase where the warp reduction outputs in each warp are updated with the aggregate from each preceding warp.
block_scan_warpscans.png
BLOCK_REDUCE_WARP_REDUCTIONS data flow for a hypothetical 16-thread thread block and 4-thread raking warp.
Performance Considerations
  • This variant applies more reduction operators than BLOCK_REDUCE_RAKING or BLOCK_REDUCE_RAKING_NON_COMMUTATIVE, which may result in lower overall throughput across the GPU. However turn-around latency may be lower and thus useful when the GPU is under-occupied.

BlockScanAlgorithm enumerates alternative algorithms for cub::BlockScan to compute a parallel prefix scan across a CUDA thread block.

Enumerator
BLOCK_SCAN_RAKING 
Overview
An efficient "raking reduce-then-scan" prefix scan algorithm. Execution is comprised of five phases:
  1. Upsweep sequential reduction in registers (if threads contribute more than one input each). Each thread then places the partial reduction of its item(s) into shared memory.
  2. Upsweep sequential reduction in shared memory. Threads within a single warp rake across segments of shared partial reductions.
  3. A warp-synchronous Kogge-Stone style exclusive scan within the raking warp.
  4. Downsweep sequential exclusive scan in shared memory. Threads within a single warp rake across segments of shared partial reductions, seeded with the warp-scan output.
  5. Downsweep sequential scan in registers (if threads contribute more than one input), seeded with the raking scan output.
block_scan_raking.png
BLOCK_SCAN_RAKING data flow for a hypothetical 16-thread thread block and 4-thread raking warp.
Performance Considerations
  • Although this variant may suffer longer turnaround latencies when the GPU is under-occupied, it can often provide higher overall throughput across the GPU when suitably occupied.
BLOCK_SCAN_RAKING_MEMOIZE 
Overview
Similar to cub::BLOCK_SCAN_RAKING, but with fewer shared memory reads at the expense of higher register pressure. Raking threads preserve their "upsweep" segment of values in registers while performing warp-synchronous scan, allowing the "downsweep" not to re-read them from shared memory.
BLOCK_SCAN_WARP_SCANS 
Overview
A quick "tiled warpscans" prefix scan algorithm. Execution is comprised of four phases:
  1. Upsweep sequential reduction in registers (if threads contribute more than one input each). Each thread then places the partial reduction of its item(s) into shared memory.
  2. Compute a shallow, but inefficient warp-synchronous Kogge-Stone style scan within each warp.
  3. A propagation phase where the warp scan outputs in each warp are updated with the aggregate from each preceding warp.
  4. Downsweep sequential scan in registers (if threads contribute more than one input), seeded with the raking scan output.
block_scan_warpscans.png
BLOCK_SCAN_WARP_SCANS data flow for a hypothetical 16-thread thread block and 4-thread raking warp.
Performance Considerations
  • Although this variant may suffer lower overall throughput across the GPU because due to a heavy reliance on inefficient warpscans, it can often provide lower turnaround latencies when the GPU is under-occupied.

cub::BlockStoreAlgorithm enumerates alternative algorithms for cub::BlockStore to write a blocked arrangement of items across a CUDA thread block to a linear segment of memory.

Enumerator
BLOCK_STORE_DIRECT 
Overview

A blocked arrangement of data is written directly to memory.

Performance Considerations
  • The utilization of memory transactions (coalescing) decreases as the access stride between threads increases (i.e., the number items per thread).
BLOCK_STORE_VECTORIZE 
Overview

A blocked arrangement of data is written directly to memory using CUDA's built-in vectorized stores as a coalescing optimization. For example, st.global.v4.s32 instructions will be generated when T = int and ITEMS_PER_THREAD % 4 == 0.

Performance Considerations
  • The utilization of memory transactions (coalescing) remains high until the the access stride between threads (i.e., the number items per thread) exceeds the maximum vector store width (typically 4 items or 64B, whichever is lower).
  • The following conditions will prevent vectorization and writing will fall back to cub::BLOCK_STORE_DIRECT:
    • ITEMS_PER_THREAD is odd
    • The OutputIteratorT is not a simple pointer type
    • The block output offset is not quadword-aligned
    • The data type T is not a built-in primitive or CUDA vector type (e.g., short, int2, double, float2, etc.)
BLOCK_STORE_TRANSPOSE 
Overview
A blocked arrangement is locally transposed and then efficiently written to memory as a striped arrangement.
Performance Considerations
  • The utilization of memory transactions (coalescing) remains high regardless of items written per thread.
  • The local reordering incurs slightly longer latencies and throughput than the direct cub::BLOCK_STORE_DIRECT and cub::BLOCK_STORE_VECTORIZE alternatives.
BLOCK_STORE_WARP_TRANSPOSE 
Overview
A blocked arrangement is locally transposed and then efficiently written to memory as a warp-striped arrangement
Usage Considerations
  • BLOCK_THREADS must be a multiple of WARP_THREADS
Performance Considerations
  • The utilization of memory transactions (coalescing) remains high regardless of items written per thread.
  • The local reordering incurs slightly longer latencies and throughput than the direct cub::BLOCK_STORE_DIRECT and cub::BLOCK_STORE_VECTORIZE alternatives.
BLOCK_STORE_WARP_TRANSPOSE_TIMESLICED 
Overview
A blocked arrangement is locally transposed and then efficiently written to memory as a warp-striped arrangement To reduce the shared memory requirement, only one warp's worth of shared memory is provisioned and is subsequently time-sliced among warps.
Usage Considerations
  • BLOCK_THREADS must be a multiple of WARP_THREADS
Performance Considerations
  • The utilization of memory transactions (coalescing) remains high regardless of items written per thread.
  • Provisions less shared memory temporary storage, but incurs larger latencies than the BLOCK_STORE_WARP_TRANSPOSE alternative.

Function Documentation

template<int LABEL_BITS>
__device__ unsigned int cub::MatchAny ( unsigned int  label)
inline

Compute a 32b mask of threads having the same least-significant LABEL_BITS of label as the calling thread.