CUB  
 All Classes Namespaces Files Functions Variables Typedefs Enumerations Enumerator Friends Macros Groups Pages
Class List
Here are the classes, structs, unions and interfaces with brief descriptions:
[detail level 123]
oNcub
|oNdetail
||\CChooseOffsetT
|oCCachingDeviceAllocatorA simple caching allocator for device memory allocations
|oCSwitchDeviceRAII helper which saves the current device and switches to the specified device on construction and switches to the saved device on destruction
|oCKernelConfig
|oCChainedPolicyHelper for dispatching into a policy chain
|oCChainedPolicy< PTX_VERSION, PolicyT, PolicyT >Helper for dispatching into a policy chain (end-of-chain specialization)
|oCBlockAdjacentDifferenceBlockAdjacentDifference provides collective methods for computing the differences of adjacent elements partitioned across a CUDA thread block
||\CTempStorageThe operations exposed by BlockDiscontinuity require a temporary memory allocation of this nested type for thread communication. This opaque storage can be allocated directly using the __shared__ keyword. Alternatively, it can be aliased to externally allocated memory (shared or global) or union'd with other storage allocation types to facilitate memory reuse
|oCBlockDiscontinuityThe BlockDiscontinuity class provides collective methods for flagging discontinuities within an ordered set of items partitioned across a CUDA thread block.

discont_logo.png
||\CTempStorageThe operations exposed by BlockDiscontinuity require a temporary memory allocation of this nested type for thread communication. This opaque storage can be allocated directly using the __shared__ keyword. Alternatively, it can be aliased to externally allocated memory (shared or global) or union'd with other storage allocation types to facilitate memory reuse
|oCBlockExchangeThe BlockExchange class provides collective methods for rearranging data partitioned across a CUDA thread block.

transpose_logo.png
||\CTempStorageThe operations exposed by BlockExchange require a temporary memory allocation of this nested type for thread communication. This opaque storage can be allocated directly using the __shared__ keyword. Alternatively, it can be aliased to externally allocated memory (shared or global) or union'd with other storage allocation types to facilitate memory reuse
|oCBlockHistogramThe BlockHistogram class provides collective methods for constructing block-wide histograms from data samples partitioned across a CUDA thread block.

histogram_logo.png
||\CTempStorageThe operations exposed by BlockHistogram require a temporary memory allocation of this nested type for thread communication. This opaque storage can be allocated directly using the __shared__ keyword. Alternatively, it can be aliased to externally allocated memory (shared or global) or union'd with other storage allocation types to facilitate memory reuse
|oCBlockLoadThe BlockLoad class provides collective data movement methods for loading a linear segment of items from memory into a blocked arrangement across a CUDA thread block.

block_load_logo.png
||\CTempStorageThe operations exposed by BlockLoad require a temporary memory allocation of this nested type for thread communication. This opaque storage can be allocated directly using the __shared__ keyword. Alternatively, it can be aliased to externally allocated memory (shared or global) or union'd with other storage allocation types to facilitate memory reuse
|oCBlockLoadType
|oCBlockRadixSortThe BlockRadixSort class provides collective methods for sorting items partitioned across a CUDA thread block using a radix sorting method.

sorting_logo.png
||\CTempStorageThe operations exposed by BlockRadixSort require a temporary memory allocation of this nested type for thread communication. This opaque storage can be allocated directly using the __shared__ keyword. Alternatively, it can be aliased to externally allocated memory (shared or global) or union'd with other storage allocation types to facilitate memory reuse
|oCBlockReduceThe BlockReduce class provides collective methods for computing a parallel reduction of items partitioned across a CUDA thread block.

reduce_logo.png
||\CTempStorageThe operations exposed by BlockReduce require a temporary memory allocation of this nested type for thread communication. This opaque storage can be allocated directly using the __shared__ keyword. Alternatively, it can be aliased to externally allocated memory (shared or global) or union'd with other storage allocation types to facilitate memory reuse
|oCBlockRunLengthDecodeThe BlockRunLengthDecode class supports decoding a run-length encoded array of items. That is, given the two arrays run_value[N] and run_lengths[N], run_value[i] is repeated run_lengths[i] many times in the output array. Due to the nature of the run-length decoding algorithm ("decompression"), the output size of the run-length decoded array is runtime-dependent and potentially without any upper bound. To address this, BlockRunLengthDecode allows retrieving a "window" from the run-length decoded array. The window's offset can be specified and BLOCK_THREADS * DECODED_ITEMS_PER_THREAD (i.e., referred to as window_size) decoded items from the specified window will be returned
||\CTempStorage
|oCBlockScanThe BlockScan class provides collective methods for computing a parallel prefix sum/scan of items partitioned across a CUDA thread block.

block_scan_logo.png
||\CTempStorageThe operations exposed by BlockScan require a temporary memory allocation of this nested type for thread communication. This opaque storage can be allocated directly using the __shared__ keyword. Alternatively, it can be aliased to externally allocated memory (shared or global) or union'd with other storage allocation types to facilitate memory reuse
|oCBlockShuffleThe BlockShuffle class provides collective methods for shuffling data partitioned across a CUDA thread block
||\CTempStorageThe operations exposed by BlockShuffle require a temporary memory allocation of this nested type for thread communication. This opaque storage can be allocated directly using the __shared__ keyword. Alternatively, it can be aliased to externally allocated memory (shared or global) or union'd with other storage allocation types to facilitate memory reuse
|oCBlockStoreThe BlockStore class provides collective data movement methods for writing a blocked arrangement of items partitioned across a CUDA thread block to a linear segment of memory.

block_store_logo.png
||\CTempStorageThe operations exposed by BlockStore require a temporary memory allocation of this nested type for thread communication. This opaque storage can be allocated directly using the __shared__ keyword. Alternatively, it can be aliased to externally allocated memory (shared or global) or union'd with other storage allocation types to facilitate memory reuse
|oCBlockStoreType
|oCRadixSortTwiddleTwiddling keys for radix sort
|oCBaseDigitExtractorBase struct for digit extractor. Contains common code to provide special handling for floating-point -0.0
|oCBFEDigitExtractorA wrapper type to extract digits. Uses the BFE intrinsic to extract a key from a digit
|oCShiftDigitExtractorA wrapper type to extract digits. Uses a combination of shift and bitwise and to extract digits
|oCDeviceAdjacentDifferenceDeviceAdjacentDifference provides device-wide, parallel operations for computing the differences of adjacent elements residing within device-accessible memory
|oCDeviceHistogramDeviceHistogram provides device-wide parallel operations for constructing histogram(s) from a sequence of samples data residing within device-accessible memory.

histogram_logo.png
|oCDeviceMergeSortDeviceMergeSort provides device-wide, parallel operations for computing a merge sort across a sequence of data items residing within device-accessible memory
|oCDevicePartitionDevicePartition provides device-wide, parallel operations for partitioning sequences of data items residing within device-accessible memory.

partition_logo.png
|oCDeviceRadixSortDeviceRadixSort provides device-wide, parallel operations for computing a radix sort across a sequence of data items residing within device-accessible memory.

sorting_logo.png
|oCDeviceReduceDeviceReduce provides device-wide, parallel operations for computing a reduction across a sequence of data items residing within device-accessible memory.

reduce_logo.png
|oCDeviceRunLengthEncodeDeviceRunLengthEncode provides device-wide, parallel operations for demarcating "runs" of same-valued items within a sequence residing within device-accessible memory.

run_length_encode_logo.png
|oCDeviceScanDeviceScan provides device-wide, parallel operations for computing a prefix scan across a sequence of data items residing within device-accessible memory.

device_scan.png
|oCDeviceSegmentedRadixSortDeviceSegmentedRadixSort provides device-wide, parallel operations for computing a batched radix sort across multiple, non-overlapping sequences of data items residing within device-accessible memory.

segmented_sorting_logo.png
|oCDeviceSegmentedReduceDeviceSegmentedReduce provides device-wide, parallel operations for computing a reduction across multiple sequences of data items residing within device-accessible memory.

reduce_logo.png
|oCDeviceSelectDeviceSelect provides device-wide, parallel operations for compacting selected items from sequences of data items residing within device-accessible memory.

select_logo.png
|oCDeviceSpmvDeviceSpmv provides device-wide parallel operations for performing sparse-matrix * dense-vector multiplication (SpMV)
|oCGridBarrierGridBarrier implements a software global barrier among thread blocks within a CUDA grid
|oCGridBarrierLifetimeGridBarrierLifetime extends GridBarrier to provide lifetime management of the temporary device storage needed for cooperation
|oCGridEvenShareGridEvenShare is a descriptor utility for distributing input among CUDA thread blocks in an "even-share" fashion. Each thread block gets roughly the same number of input tiles
|oCGridQueueGridQueue is a descriptor utility for dynamic queue management
|oCArgIndexInputIteratorA random-access input wrapper for pairing dereferenced values with their corresponding indices (forming KeyValuePair tuples)
|oCCacheModifiedInputIteratorA random-access input wrapper for dereferencing array values using a PTX cache load modifier
|oCCacheModifiedOutputIteratorA random-access output wrapper for storing array values using a PTX cache-modifier
|oCConstantInputIteratorA random-access input generator for dereferencing a sequence of homogeneous values
|oCCountingInputIteratorA random-access input generator for dereferencing a sequence of incrementing integer values
|oCDiscardOutputIteratorA discard iterator
|oCTexObjInputIteratorA random-access input wrapper for dereferencing array values through texture cache. Uses newer Kepler-style texture objects
|oCTexRefInputIteratorA random-access input wrapper for dereferencing array values through texture cache. Uses older Tesla/Fermi-style texture references
|oCTransformInputIteratorA random-access input wrapper for transforming dereferenced values
|oCEqualityDefault equality functor
|oCInequalityDefault inequality functor
|oCInequalityWrapperInequality functor (wraps equality functor)
|oCSumDefault sum functor
|oCDifferenceDefault difference functor
|oCDivisionDefault division functor
|oCMaxDefault max functor
|oCArgMaxArg max functor (keeps the value and offset of the first occurrence of the larger item)
|oCMinDefault min functor
|oCArgMinArg min functor (keeps the value and offset of the first occurrence of the smallest item)
|oCCastOpDefault cast functor
|oCSwizzleScanOpBinary operator wrapper for switching non-commutative scan arguments
|oCReduceBySegmentOpReduce-by-segment functor
|oCReduceByKeyOp< Binary reduction operator to apply to values
|oCBinaryFlip
|oCWarpReduceThe WarpReduce class provides collective methods for computing a parallel reduction of items partitioned across a CUDA thread warp.

warp_reduce_logo.png
||\CTempStorageThe operations exposed by WarpReduce require a temporary memory allocation of this nested type for thread communication. This opaque storage can be allocated directly using the __shared__ keyword. Alternatively, it can be aliased to externally allocated memory (shared or global) or union'd with other storage allocation types to facilitate memory reuse
|\CWarpScanThe WarpScan class provides collective methods for computing a parallel prefix scan of items partitioned across a CUDA thread warp.

warp_scan_logo.png
| \CTempStorageThe operations exposed by WarpScan require a temporary memory allocation of this nested type for thread communication. This opaque storage can be allocated directly using the __shared__ keyword. Alternatively, it can be aliased to externally allocated memory (shared or global) or union'd with other storage allocation types to facilitate memory reuse
oNdetail
oCBlockMergeSortThe BlockMergeSort class provides methods for sorting items partitioned across a CUDA thread block using a merge sorting method
oCBlockMergeSortStrategyGeneralized merge sort algorithm
|\CTempStorageThe operations exposed by BlockMergeSort require a temporary memory allocation of this nested type for thread communication. This opaque storage can be allocated directly using the __shared__ keyword. Alternatively, it can be aliased to externally allocated memory (shared or global) or union'd with other storage allocation types to facilitate memory reuse
oCDeviceSegmentedSortDeviceSegmentedSort provides device-wide, parallel operations for computing a batched sort across multiple, non-overlapping sequences of data items residing within device-accessible memory.

segmented_sorting_logo.png
oCEqualsType equality test
oCIfType selection (IF ? ThenType : ElseType)
oCIsPointerPointer vs. iterator
oCIsVolatileVolatile modifier test
oCLog2Statically determine log2(N), rounded up
oCPowerOfTwoStatically determine if N is a power-of-two
oCRemoveQualifiersRemoves const and volatile qualifiers from type Tp
oCWarpExchangeThe WarpExchange class provides collective methods for rearranging data partitioned across a CUDA warp
|\CTempStorageThe operations exposed by WarpExchange require a temporary memory allocation of this nested type for thread communication. This opaque storage can be allocated directly using the __shared__ keyword. Alternatively, it can be aliased to externally allocated memory (shared or global) or union'd with other storage allocation types to facilitate memory reuse
oCWarpLoadThe WarpLoad class provides collective data movement methods for loading a linear segment of items from memory into a blocked arrangement across a CUDA thread block
|\CTempStorageThe operations exposed by WarpLoad require a temporary memory allocation of this nested type for thread communication. This opaque storage can be allocated directly using the __shared__ keyword. Alternatively, it can be aliased to externally allocated memory (shared or global) or union'd with other storage allocation types to facilitate memory reuse
oCWarpMergeSortThe WarpMergeSort class provides methods for sorting items partitioned across a CUDA warp using a merge sorting method
\CWarpStoreThe WarpStore class provides collective data movement methods for writing a blocked arrangement of items partitioned across a CUDA warp to a linear segment of memory
 \CTempStorage