|
CUB
|
Optional outer namespace(s) More...
Classes | |
| struct | CachingDeviceAllocator |
| A simple caching allocator for device memory allocations. More... | |
| struct | If |
Type selection (IF ? ThenType : ElseType) More... | |
| struct | Equals |
| Type equality test. More... | |
| struct | Log2 |
| Statically determine log2(N), rounded up. More... | |
| struct | PowerOfTwo |
| Statically determine if N is a power-of-two. More... | |
| struct | IsPointer |
| Pointer vs. iterator. More... | |
| struct | IsVolatile |
| Volatile modifier test. More... | |
| struct | RemoveQualifiers |
Removes const and volatile qualifiers from type Tp. More... | |
| class | ArgIndexInputIterator |
A random-access input wrapper for pairing dereferenced values with their corresponding indices (forming KeyValuePair tuples). More... | |
| class | CacheModifiedInputIterator |
| A random-access input wrapper for dereferencing array values using a PTX cache load modifier. More... | |
| class | CacheModifiedOutputIterator |
| A random-access output wrapper for storing array values using a PTX cache-modifier. More... | |
| class | ConstantInputIterator |
| A random-access input generator for dereferencing a sequence of homogeneous values. More... | |
| class | CountingInputIterator |
| A random-access input generator for dereferencing a sequence of incrementing integer values. More... | |
| class | TexObjInputIterator |
| A random-access input wrapper for dereferencing array values through texture cache. Uses newer Kepler-style texture objects. More... | |
| class | TexRefInputIterator |
| A random-access input wrapper for dereferencing array values through texture cache. Uses older Tesla/Fermi-style texture references. More... | |
| class | TransformInputIterator |
| A random-access input wrapper for transforming dereferenced values. More... | |
| struct | Equality |
| Default equality functor. More... | |
| struct | Inequality |
| Default inequality functor. More... | |
| struct | InequalityWrapper |
| Inequality functor (wraps equality functor) More... | |
| struct | Sum |
| Default sum functor. More... | |
| struct | Max |
| Default max functor. More... | |
| struct | ArgMax |
| Arg max functor (keeps the value and offset of the first occurrence of the larger item) More... | |
| struct | Min |
| Default min functor. More... | |
| struct | ArgMin |
| Arg min functor (keeps the value and offset of the first occurrence of the smallest item) More... | |
| struct | Cast |
| Default cast functor. More... | |
| class | SwizzleScanOp |
| Binary operator wrapper for switching non-commutative scan arguments. More... | |
| struct | ReduceBySegmentOp |
| Reduce-by-segment functor. More... | |
| struct | ReduceByKeyOp |
| < Binary reduction operator to apply to values More... | |
| class | BlockDiscontinuity |
The BlockDiscontinuity class provides collective methods for flagging discontinuities within an ordered set of items partitioned across a CUDA thread block.
. | |
| class | BlockExchange |
The BlockExchange class provides collective methods for rearranging data partitioned across a CUDA thread block.
. | |
| class | BlockHistogram |
The BlockHistogram class provides collective methods for constructing block-wide histograms from data samples partitioned across a CUDA thread block.
. | |
| class | BlockLoad |
The BlockLoad class provides collective data movement methods for loading a linear segment of items from memory into a blocked arrangement across a CUDA thread block.
. | |
| class | BlockRadixSort |
The BlockRadixSort class provides collective methods for sorting items partitioned across a CUDA thread block using a radix sorting method.
. | |
| class | BlockReduce |
The BlockReduce class provides collective methods for computing a parallel reduction of items partitioned across a CUDA thread block.
. | |
| class | BlockScan |
The BlockScan class provides collective methods for computing a parallel prefix sum/scan of items partitioned across a CUDA thread block.
. | |
| class | BlockStore |
The BlockStore class provides collective data movement methods for writing a blocked arrangement of items partitioned across a CUDA thread block to a linear segment of memory.
. | |
| struct | DeviceHistogram |
DeviceHistogram provides device-wide parallel operations for constructing histogram(s) from a sequence of samples data residing within device-accessible memory.
. | |
| struct | DevicePartition |
DevicePartition provides device-wide, parallel operations for partitioning sequences of data items residing within device-accessible memory.
. | |
| struct | DeviceRadixSort |
DeviceRadixSort provides device-wide, parallel operations for computing a radix sort across a sequence of data items residing within device-accessible memory.
. | |
| struct | DeviceReduce |
DeviceReduce provides device-wide, parallel operations for computing a reduction across a sequence of data items residing within device-accessible memory.
. | |
| struct | DeviceRunLengthEncode |
DeviceRunLengthEncode provides device-wide, parallel operations for demarcating "runs" of same-valued items within a sequence residing within device-accessible memory.
. | |
| struct | DeviceScan |
DeviceScan provides device-wide, parallel operations for computing a prefix scan across a sequence of data items residing within device-accessible memory.
. | |
| struct | DeviceSegmentedRadixSort |
DeviceSegmentedRadixSort provides device-wide, parallel operations for computing a batched radix sort across multiple, non-overlapping sequences of data items residing within device-accessible memory.
. | |
| struct | DeviceSegmentedReduce |
DeviceSegmentedReduce provides device-wide, parallel operations for computing a reduction across multiple sequences of data items residing within device-accessible memory.
. | |
| struct | DeviceSelect |
DeviceSelect provides device-wide, parallel operations for compacting selected items from sequences of data items residing within device-accessible memory.
. | |
| struct | DeviceSpmv |
| DeviceSpmv provides device-wide parallel operations for performing sparse-matrix * dense-vector multiplication (SpMV). More... | |
| class | WarpScan |
The WarpScan class provides collective methods for computing a parallel prefix scan of items partitioned across a CUDA thread warp.
. | |
| class | WarpReduce |
The WarpReduce class provides collective methods for computing a parallel reduction of items partitioned across a CUDA thread warp.
. | |
Functions | |
| __host__ __device__ __forceinline__ cudaError_t | Debug (cudaError_t error, const char *filename, int line) |
| CUB error reporting macro (prints error messages to stderr) More... | |
| CUB_RUNTIME_FUNCTION __forceinline__ cudaError_t | PtxVersion (int &ptx_version) |
| Retrieves the PTX version that will be used on the current device (major * 100 + minor * 10) More... | |
|
CUB_RUNTIME_FUNCTION __forceinline__ cudaError_t | SmVersion (int &sm_version, int device_ordinal) |
| Retrieves the SM version (major * 100 + minor * 10) | |
|
__device__ __forceinline__ unsigned int | SHR_ADD (unsigned int x, unsigned int shift, unsigned int addend) |
Shift-right then add. Returns (x >> shift) + addend. | |
|
__device__ __forceinline__ unsigned int | SHL_ADD (unsigned int x, unsigned int shift, unsigned int addend) |
Shift-left then add. Returns (x << shift) + addend. | |
| template<typename UnsignedBits > | |
| __device__ __forceinline__ unsigned int | BFE (UnsignedBits source, unsigned int bit_start, unsigned int num_bits) |
Bitfield-extract. Extracts num_bits from source starting at bit-offset bit_start. The input source may be an 8b, 16b, 32b, or 64b unsigned integer type. | |
| __device__ __forceinline__ void | BFI (unsigned int &ret, unsigned int x, unsigned int y, unsigned int bit_start, unsigned int num_bits) |
Bitfield insert. Inserts the num_bits least significant bits of y into x at bit-offset bit_start. | |
|
__device__ __forceinline__ unsigned int | IADD3 (unsigned int x, unsigned int y, unsigned int z) |
Three-operand add. Returns x + y + z. | |
| __device__ __forceinline__ int | PRMT (unsigned int a, unsigned int b, unsigned int index) |
| Byte-permute. Pick four arbitrary bytes from two 32-bit registers, and reassemble them into a 32-bit destination register. For SM2.0 or later. More... | |
| __device__ __forceinline__ void | ThreadExit () |
| Terminates the calling thread. | |
| __device__ __forceinline__ void | ThreadTrap () |
| Abort execution and generate an interrupt to the host CPU. | |
| __device__ __forceinline__ int | RowMajorTid (int block_dim_x, int block_dim_y, int block_dim_z) |
| Returns the row-major linear thread identifier for a multidimensional threadblock. | |
|
__device__ __forceinline__ unsigned int | LaneId () |
| Returns the warp lane ID of the calling thread. | |
|
__device__ __forceinline__ unsigned int | WarpId () |
| Returns the warp ID of the calling thread. Warp ID is guaranteed to be unique among warps, but may not correspond to a zero-based ranking within the thread block. | |
|
__device__ __forceinline__ unsigned int | LaneMaskLt () |
| Returns the warp lane mask of all lanes less than the calling thread. | |
|
__device__ __forceinline__ unsigned int | LaneMaskLe () |
| Returns the warp lane mask of all lanes less than or equal to the calling thread. | |
|
__device__ __forceinline__ unsigned int | LaneMaskGt () |
| Returns the warp lane mask of all lanes greater than the calling thread. | |
|
__device__ __forceinline__ unsigned int | LaneMaskGe () |
| Returns the warp lane mask of all lanes greater than or equal to the calling thread. | |
| template<typename T > | |
| __device__ __forceinline__ T | ShuffleUp (T input, int src_offset, int first_lane, unsigned int member_mask) |
Shuffle-up for any data type. Each warp-lanei obtains the value input contributed by warp-lanei-src_offset. For thread lanes i < src_offset, the thread's own input is returned to the thread.
. | |
| template<typename T > | |
| __device__ __forceinline__ T | ShuffleDown (T input, int src_offset, int last_lane, unsigned int member_mask) |
Shuffle-down for any data type. Each warp-lanei obtains the value input contributed by warp-lanei+src_offset. For thread lanes i >= WARP_THREADS, the thread's own input is returned to the thread.
. | |
| template<typename T > | |
| __device__ __forceinline__ T | ShuffleIndex (T input, int src_lane, int logical_warp_threads, unsigned int member_mask) |
Shuffle-broadcast for any data type. Each warp-lanei obtains the value input contributed by warp-lanesrc_lane. For src_lane < 0 or src_lane >= WARP_THREADS, then the thread's own input is returned to the thread.
. | |
Thread I/O (cache modified) | |
| template<CacheLoadModifier MODIFIER, typename InputIteratorT > | |
| __device__ __forceinline__ std::iterator_traits < InputIteratorT >::value_type | ThreadLoad (InputIteratorT itr) |
| Thread utility for reading memory using cub::CacheLoadModifier cache modifiers. Can be used to load any data type. More... | |
| template<CacheStoreModifier MODIFIER, typename OutputIteratorT , typename T > | |
| __device__ __forceinline__ void | ThreadStore (OutputIteratorT itr, T val) |
| Thread utility for writing memory using cub::CacheStoreModifier cache modifiers. Can be used to store any data type. More... | |
Blocked arrangement I/O (direct) | |
| template<typename InputT , int ITEMS_PER_THREAD, typename InputIteratorT > | |
| __device__ __forceinline__ void | LoadDirectBlocked (int linear_tid, InputIteratorT block_itr, InputT(&items)[ITEMS_PER_THREAD]) |
| Load a linear segment of items into a blocked arrangement across the thread block. More... | |
| template<typename InputT , int ITEMS_PER_THREAD, typename InputIteratorT > | |
| __device__ __forceinline__ void | LoadDirectBlocked (int linear_tid, InputIteratorT block_itr, InputT(&items)[ITEMS_PER_THREAD], int valid_items) |
| Load a linear segment of items into a blocked arrangement across the thread block, guarded by range. More... | |
| template<typename InputT , typename DefaultT , int ITEMS_PER_THREAD, typename InputIteratorT > | |
| __device__ __forceinline__ void | LoadDirectBlocked (int linear_tid, InputIteratorT block_itr, InputT(&items)[ITEMS_PER_THREAD], int valid_items, DefaultT oob_default) |
| Load a linear segment of items into a blocked arrangement across the thread block, guarded by range, with a fall-back assignment of out-of-bound elements.. More... | |
| template<typename T , int ITEMS_PER_THREAD> | |
| __device__ __forceinline__ void | LoadDirectBlockedVectorized (int linear_tid, T *block_ptr, T(&items)[ITEMS_PER_THREAD]) |
| Load a linear segment of items into a blocked arrangement across the thread block. More... | |
| template<typename T , int ITEMS_PER_THREAD, typename OutputIteratorT > | |
| __device__ __forceinline__ void | StoreDirectBlocked (int linear_tid, OutputIteratorT block_itr, T(&items)[ITEMS_PER_THREAD]) |
| Store a blocked arrangement of items across a thread block into a linear segment of items. More... | |
| template<typename T , int ITEMS_PER_THREAD, typename OutputIteratorT > | |
| __device__ __forceinline__ void | StoreDirectBlocked (int linear_tid, OutputIteratorT block_itr, T(&items)[ITEMS_PER_THREAD], int valid_items) |
| Store a blocked arrangement of items across a thread block into a linear segment of items, guarded by range. More... | |
| template<typename T , int ITEMS_PER_THREAD> | |
| __device__ __forceinline__ void | StoreDirectBlockedVectorized (int linear_tid, T *block_ptr, T(&items)[ITEMS_PER_THREAD]) |
| Store a blocked arrangement of items across a thread block into a linear segment of items. More... | |
Striped arrangement I/O (direct) | |
| template<int BLOCK_THREADS, typename InputT , int ITEMS_PER_THREAD, typename InputIteratorT > | |
| __device__ __forceinline__ void | LoadDirectStriped (int linear_tid, InputIteratorT block_itr, InputT(&items)[ITEMS_PER_THREAD]) |
| Load a linear segment of items into a striped arrangement across the thread block. More... | |
| template<int BLOCK_THREADS, typename InputT , int ITEMS_PER_THREAD, typename InputIteratorT > | |
| __device__ __forceinline__ void | LoadDirectStriped (int linear_tid, InputIteratorT block_itr, InputT(&items)[ITEMS_PER_THREAD], int valid_items) |
| Load a linear segment of items into a striped arrangement across the thread block, guarded by range. More... | |
| template<int BLOCK_THREADS, typename InputT , typename DefaultT , int ITEMS_PER_THREAD, typename InputIteratorT > | |
| __device__ __forceinline__ void | LoadDirectStriped (int linear_tid, InputIteratorT block_itr, InputT(&items)[ITEMS_PER_THREAD], int valid_items, DefaultT oob_default) |
| Load a linear segment of items into a striped arrangement across the thread block, guarded by range, with a fall-back assignment of out-of-bound elements. More... | |
| template<int BLOCK_THREADS, typename T , int ITEMS_PER_THREAD, typename OutputIteratorT > | |
| __device__ __forceinline__ void | StoreDirectStriped (int linear_tid, OutputIteratorT block_itr, T(&items)[ITEMS_PER_THREAD]) |
| Store a striped arrangement of data across the thread block into a linear segment of items. More... | |
| template<int BLOCK_THREADS, typename T , int ITEMS_PER_THREAD, typename OutputIteratorT > | |
| __device__ __forceinline__ void | StoreDirectStriped (int linear_tid, OutputIteratorT block_itr, T(&items)[ITEMS_PER_THREAD], int valid_items) |
| Store a striped arrangement of data across the thread block into a linear segment of items, guarded by range. More... | |
Warp-striped arrangement I/O (direct) | |
| template<typename InputT , int ITEMS_PER_THREAD, typename InputIteratorT > | |
| __device__ __forceinline__ void | LoadDirectWarpStriped (int linear_tid, InputIteratorT block_itr, InputT(&items)[ITEMS_PER_THREAD]) |
| Load a linear segment of items into a warp-striped arrangement across the thread block. More... | |
| template<typename InputT , int ITEMS_PER_THREAD, typename InputIteratorT > | |
| __device__ __forceinline__ void | LoadDirectWarpStriped (int linear_tid, InputIteratorT block_itr, InputT(&items)[ITEMS_PER_THREAD], int valid_items) |
| Load a linear segment of items into a warp-striped arrangement across the thread block, guarded by range. More... | |
| template<typename InputT , typename DefaultT , int ITEMS_PER_THREAD, typename InputIteratorT > | |
| __device__ __forceinline__ void | LoadDirectWarpStriped (int linear_tid, InputIteratorT block_itr, InputT(&items)[ITEMS_PER_THREAD], int valid_items, DefaultT oob_default) |
| Load a linear segment of items into a warp-striped arrangement across the thread block, guarded by range, with a fall-back assignment of out-of-bound elements. More... | |
| template<typename T , int ITEMS_PER_THREAD, typename OutputIteratorT > | |
| __device__ __forceinline__ void | StoreDirectWarpStriped (int linear_tid, OutputIteratorT block_itr, T(&items)[ITEMS_PER_THREAD]) |
| Store a warp-striped arrangement of data across the thread block into a linear segment of items. More... | |
| template<typename T , int ITEMS_PER_THREAD, typename OutputIteratorT > | |
| __device__ __forceinline__ void | StoreDirectWarpStriped (int linear_tid, OutputIteratorT block_itr, T(&items)[ITEMS_PER_THREAD], int valid_items) |
| Store a warp-striped arrangement of data across the thread block into a linear segment of items, guarded by range. More... | |
Optional outer namespace(s)
CUB namespace
BlockHistogramAlgorithm enumerates alternative algorithms for the parallel construction of block-wide histograms.
Definition at line 56 of file block_histogram.cuh.
cub::BlockLoadAlgorithm enumerates alternative algorithms for cub::BlockLoad to read a linear segment of data from memory into a blocked arrangement across a CUDA thread block.
| Enumerator | |
|---|---|
| BLOCK_LOAD_DIRECT |
A blocked arrangement of data is read directly from memory.
|
| BLOCK_LOAD_VECTORIZE |
A blocked arrangement of data is read from memory using CUDA's built-in vectorized loads as a coalescing optimization. For example,
|
| BLOCK_LOAD_TRANSPOSE |
A striped arrangement of data is read efficiently from memory and then locally transposed into a blocked arrangement.
|
| BLOCK_LOAD_WARP_TRANSPOSE |
A warp-striped arrangement of data is read efficiently from memory and then locally transposed into a blocked arrangement.
|
| BLOCK_LOAD_WARP_TRANSPOSE_TIMESLICED |
Like
|
Definition at line 473 of file block_load.cuh.
BlockReduceAlgorithm enumerates alternative algorithms for parallel reduction across a CUDA threadblock.
| Enumerator | |
|---|---|
| BLOCK_REDUCE_RAKING_COMMUTATIVE_ONLY |
|
| BLOCK_REDUCE_RAKING |
|
| BLOCK_REDUCE_WARP_REDUCTIONS |
|
Definition at line 60 of file block_reduce.cuh.
BlockScanAlgorithm enumerates alternative algorithms for cub::BlockScan to compute a parallel prefix scan across a CUDA thread block.
| Enumerator | |
|---|---|
| BLOCK_SCAN_RAKING |
|
| BLOCK_SCAN_RAKING_MEMOIZE |
|
| BLOCK_SCAN_WARP_SCANS |
|
Definition at line 57 of file block_scan.cuh.
cub::BlockStoreAlgorithm enumerates alternative algorithms for cub::BlockStore to write a blocked arrangement of items across a CUDA thread block to a linear segment of memory.
| Enumerator | |
|---|---|
| BLOCK_STORE_DIRECT |
A blocked arrangement of data is written directly to memory.
|
| BLOCK_STORE_VECTORIZE |
A blocked arrangement of data is written directly to memory using CUDA's built-in vectorized stores as a coalescing optimization. For example,
|
| BLOCK_STORE_TRANSPOSE |
|
| BLOCK_STORE_WARP_TRANSPOSE |
|
| BLOCK_STORE_WARP_TRANSPOSE_TIMESLICED |
|
Definition at line 354 of file block_store.cuh.
1.8.4