 cub | |
  detail | |
   ChooseOffsetT | |
  CachingDeviceAllocator | A simple caching allocator for device memory allocations |
  SwitchDevice | RAII helper which saves the current device and switches to the specified device on construction and switches to the saved device on destruction |
  KernelConfig | |
  ChainedPolicy | Helper for dispatching into a policy chain |
  ChainedPolicy< PTX_VERSION, PolicyT, PolicyT > | Helper for dispatching into a policy chain (end-of-chain specialization) |
  BlockAdjacentDifference | BlockAdjacentDifference provides collective methods for computing the differences of adjacent elements partitioned across a CUDA thread block |
   TempStorage | The operations exposed by BlockDiscontinuity require a temporary memory allocation of this nested type for thread communication. This opaque storage can be allocated directly using the __shared__ keyword. Alternatively, it can be aliased to externally allocated memory (shared or global) or union 'd with other storage allocation types to facilitate memory reuse |
  BlockDiscontinuity | The BlockDiscontinuity class provides collective methods for flagging discontinuities within an ordered set of items partitioned across a CUDA thread block.
|
   TempStorage | The operations exposed by BlockDiscontinuity require a temporary memory allocation of this nested type for thread communication. This opaque storage can be allocated directly using the __shared__ keyword. Alternatively, it can be aliased to externally allocated memory (shared or global) or union 'd with other storage allocation types to facilitate memory reuse |
  BlockExchange | The BlockExchange class provides collective methods for rearranging data partitioned across a CUDA thread block.
|
   TempStorage | The operations exposed by BlockExchange require a temporary memory allocation of this nested type for thread communication. This opaque storage can be allocated directly using the __shared__ keyword. Alternatively, it can be aliased to externally allocated memory (shared or global) or union 'd with other storage allocation types to facilitate memory reuse |
  BlockHistogram | The BlockHistogram class provides collective methods for constructing block-wide histograms from data samples partitioned across a CUDA thread block.
|
   TempStorage | The operations exposed by BlockHistogram require a temporary memory allocation of this nested type for thread communication. This opaque storage can be allocated directly using the __shared__ keyword. Alternatively, it can be aliased to externally allocated memory (shared or global) or union 'd with other storage allocation types to facilitate memory reuse |
  BlockLoad | The BlockLoad class provides collective data movement methods for loading a linear segment of items from memory into a blocked arrangement across a CUDA thread block.
|
   TempStorage | The operations exposed by BlockLoad require a temporary memory allocation of this nested type for thread communication. This opaque storage can be allocated directly using the __shared__ keyword. Alternatively, it can be aliased to externally allocated memory (shared or global) or union 'd with other storage allocation types to facilitate memory reuse |
  BlockLoadType | |
  BlockRadixSort | The BlockRadixSort class provides collective methods for sorting items partitioned across a CUDA thread block using a radix sorting method.
|
   TempStorage | The operations exposed by BlockRadixSort require a temporary memory allocation of this nested type for thread communication. This opaque storage can be allocated directly using the __shared__ keyword. Alternatively, it can be aliased to externally allocated memory (shared or global) or union 'd with other storage allocation types to facilitate memory reuse |
  BlockReduce | The BlockReduce class provides collective methods for computing a parallel reduction of items partitioned across a CUDA thread block.
|
   TempStorage | The operations exposed by BlockReduce require a temporary memory allocation of this nested type for thread communication. This opaque storage can be allocated directly using the __shared__ keyword. Alternatively, it can be aliased to externally allocated memory (shared or global) or union 'd with other storage allocation types to facilitate memory reuse |
  BlockRunLengthDecode | The BlockRunLengthDecode class supports decoding a run-length encoded array of items. That is, given the two arrays run_value[N] and run_lengths[N], run_value[i] is repeated run_lengths[i] many times in the output array. Due to the nature of the run-length decoding algorithm ("decompression"), the output size of the run-length decoded array is runtime-dependent and potentially without any upper bound. To address this, BlockRunLengthDecode allows retrieving a "window" from the run-length decoded array. The window's offset can be specified and BLOCK_THREADS * DECODED_ITEMS_PER_THREAD (i.e., referred to as window_size) decoded items from the specified window will be returned |
   TempStorage | |
  BlockScan | The BlockScan class provides collective methods for computing a parallel prefix sum/scan of items partitioned across a CUDA thread block.
|
   TempStorage | The operations exposed by BlockScan require a temporary memory allocation of this nested type for thread communication. This opaque storage can be allocated directly using the __shared__ keyword. Alternatively, it can be aliased to externally allocated memory (shared or global) or union 'd with other storage allocation types to facilitate memory reuse |
  BlockShuffle | The BlockShuffle class provides collective methods for shuffling data partitioned across a CUDA thread block |
   TempStorage | The operations exposed by BlockShuffle require a temporary memory allocation of this nested type for thread communication. This opaque storage can be allocated directly using the __shared__ keyword. Alternatively, it can be aliased to externally allocated memory (shared or global) or union 'd with other storage allocation types to facilitate memory reuse |
  BlockStore | The BlockStore class provides collective data movement methods for writing a blocked arrangement of items partitioned across a CUDA thread block to a linear segment of memory.
|
   TempStorage | The operations exposed by BlockStore require a temporary memory allocation of this nested type for thread communication. This opaque storage can be allocated directly using the __shared__ keyword. Alternatively, it can be aliased to externally allocated memory (shared or global) or union 'd with other storage allocation types to facilitate memory reuse |
  BlockStoreType | |
  RadixSortTwiddle | Twiddling keys for radix sort |
  BaseDigitExtractor | Base struct for digit extractor. Contains common code to provide special handling for floating-point -0.0 |
  BFEDigitExtractor | A wrapper type to extract digits. Uses the BFE intrinsic to extract a key from a digit |
  ShiftDigitExtractor | A wrapper type to extract digits. Uses a combination of shift and bitwise and to extract digits |
  DeviceAdjacentDifference | DeviceAdjacentDifference provides device-wide, parallel operations for computing the differences of adjacent elements residing within device-accessible memory |
  DeviceHistogram | DeviceHistogram provides device-wide parallel operations for constructing histogram(s) from a sequence of samples data residing within device-accessible memory.
|
  DeviceMergeSort | DeviceMergeSort provides device-wide, parallel operations for computing a merge sort across a sequence of data items residing within device-accessible memory |
  DevicePartition | DevicePartition provides device-wide, parallel operations for partitioning sequences of data items residing within device-accessible memory.
|
  DeviceRadixSort | DeviceRadixSort provides device-wide, parallel operations for computing a radix sort across a sequence of data items residing within device-accessible memory.
|
  DeviceReduce | DeviceReduce provides device-wide, parallel operations for computing a reduction across a sequence of data items residing within device-accessible memory.
|
  DeviceRunLengthEncode | DeviceRunLengthEncode provides device-wide, parallel operations for demarcating "runs" of same-valued items within a sequence residing within device-accessible memory.
|
  DeviceScan | DeviceScan provides device-wide, parallel operations for computing a prefix scan across a sequence of data items residing within device-accessible memory.
|
  DeviceSegmentedRadixSort | DeviceSegmentedRadixSort provides device-wide, parallel operations for computing a batched radix sort across multiple, non-overlapping sequences of data items residing within device-accessible memory.
|
  DeviceSegmentedReduce | DeviceSegmentedReduce provides device-wide, parallel operations for computing a reduction across multiple sequences of data items residing within device-accessible memory.
|
  DeviceSelect | DeviceSelect provides device-wide, parallel operations for compacting selected items from sequences of data items residing within device-accessible memory.
|
  DeviceSpmv | DeviceSpmv provides device-wide parallel operations for performing sparse-matrix * dense-vector multiplication (SpMV) |
  GridBarrier | GridBarrier implements a software global barrier among thread blocks within a CUDA grid |
  GridBarrierLifetime | GridBarrierLifetime extends GridBarrier to provide lifetime management of the temporary device storage needed for cooperation |
  GridEvenShare | GridEvenShare is a descriptor utility for distributing input among CUDA thread blocks in an "even-share" fashion. Each thread block gets roughly the same number of input tiles |
  GridQueue | GridQueue is a descriptor utility for dynamic queue management |
  ArgIndexInputIterator | A random-access input wrapper for pairing dereferenced values with their corresponding indices (forming KeyValuePair tuples) |
  CacheModifiedInputIterator | A random-access input wrapper for dereferencing array values using a PTX cache load modifier |
  CacheModifiedOutputIterator | A random-access output wrapper for storing array values using a PTX cache-modifier |
  ConstantInputIterator | A random-access input generator for dereferencing a sequence of homogeneous values |
  CountingInputIterator | A random-access input generator for dereferencing a sequence of incrementing integer values |
  DiscardOutputIterator | A discard iterator |
  TexObjInputIterator | A random-access input wrapper for dereferencing array values through texture cache. Uses newer Kepler-style texture objects |
  TexRefInputIterator | A random-access input wrapper for dereferencing array values through texture cache. Uses older Tesla/Fermi-style texture references |
  TransformInputIterator | A random-access input wrapper for transforming dereferenced values |
  Equality | Default equality functor |
  Inequality | Default inequality functor |
  InequalityWrapper | Inequality functor (wraps equality functor) |
  Sum | Default sum functor |
  Difference | Default difference functor |
  Division | Default division functor |
  Max | Default max functor |
  ArgMax | Arg max functor (keeps the value and offset of the first occurrence of the larger item) |
  Min | Default min functor |
  ArgMin | Arg min functor (keeps the value and offset of the first occurrence of the smallest item) |
  CastOp | Default cast functor |
  SwizzleScanOp | Binary operator wrapper for switching non-commutative scan arguments |
  ReduceBySegmentOp | Reduce-by-segment functor |
  ReduceByKeyOp | < Binary reduction operator to apply to values |
  BinaryFlip | |
  WarpReduce | The WarpReduce class provides collective methods for computing a parallel reduction of items partitioned across a CUDA thread warp.
|
   TempStorage | The operations exposed by WarpReduce require a temporary memory allocation of this nested type for thread communication. This opaque storage can be allocated directly using the __shared__ keyword. Alternatively, it can be aliased to externally allocated memory (shared or global) or union 'd with other storage allocation types to facilitate memory reuse |
  WarpScan | The WarpScan class provides collective methods for computing a parallel prefix scan of items partitioned across a CUDA thread warp.
|
   TempStorage | The operations exposed by WarpScan require a temporary memory allocation of this nested type for thread communication. This opaque storage can be allocated directly using the __shared__ keyword. Alternatively, it can be aliased to externally allocated memory (shared or global) or union 'd with other storage allocation types to facilitate memory reuse |
 detail | |
 BlockMergeSort | The BlockMergeSort class provides methods for sorting items partitioned across a CUDA thread block using a merge sorting method |
 BlockMergeSortStrategy | Generalized merge sort algorithm |
  TempStorage | The operations exposed by BlockMergeSort require a temporary memory allocation of this nested type for thread communication. This opaque storage can be allocated directly using the __shared__ keyword. Alternatively, it can be aliased to externally allocated memory (shared or global) or union 'd with other storage allocation types to facilitate memory reuse |
 DeviceSegmentedSort | DeviceSegmentedSort provides device-wide, parallel operations for computing a batched sort across multiple, non-overlapping sequences of data items residing within device-accessible memory.
|
 Equals | Type equality test |
 If | Type selection (IF ? ThenType : ElseType ) |
 IsPointer | Pointer vs. iterator |
 IsVolatile | Volatile modifier test |
 Log2 | Statically determine log2(N), rounded up |
 PowerOfTwo | Statically determine if N is a power-of-two |
 RemoveQualifiers | Removes const and volatile qualifiers from type Tp |
 WarpExchange | The WarpExchange class provides collective methods for rearranging data partitioned across a CUDA warp |
  TempStorage | The operations exposed by WarpExchange require a temporary memory allocation of this nested type for thread communication. This opaque storage can be allocated directly using the __shared__ keyword. Alternatively, it can be aliased to externally allocated memory (shared or global) or union 'd with other storage allocation types to facilitate memory reuse |
 WarpLoad | The WarpLoad class provides collective data movement methods for loading a linear segment of items from memory into a blocked arrangement across a CUDA thread block |
  TempStorage | The operations exposed by WarpLoad require a temporary memory allocation of this nested type for thread communication. This opaque storage can be allocated directly using the __shared__ keyword. Alternatively, it can be aliased to externally allocated memory (shared or global) or union 'd with other storage allocation types to facilitate memory reuse |
 WarpMergeSort | The WarpMergeSort class provides methods for sorting items partitioned across a CUDA warp using a merge sorting method |
 WarpStore | The WarpStore class provides collective data movement methods for writing a blocked arrangement of items partitioned across a CUDA warp to a linear segment of memory |
  TempStorage | |