CUB
|
Namespaces | |
detail | |
Classes | |
struct | CachingDeviceAllocator |
A simple caching allocator for device memory allocations. More... | |
struct | SwitchDevice |
RAII helper which saves the current device and switches to the specified device on construction and switches to the saved device on destruction. More... | |
struct | KernelConfig |
struct | ChainedPolicy |
Helper for dispatching into a policy chain. More... | |
struct | ChainedPolicy< PTX_VERSION, PolicyT, PolicyT > |
Helper for dispatching into a policy chain (end-of-chain specialization) More... | |
class | BlockAdjacentDifference |
BlockAdjacentDifference provides collective methods for computing the differences of adjacent elements partitioned across a CUDA thread block. More... | |
class | BlockDiscontinuity |
The BlockDiscontinuity class provides collective methods for flagging discontinuities within an ordered set of items partitioned across a CUDA thread block.
![]()
. | |
class | BlockExchange |
The BlockExchange class provides collective methods for rearranging data partitioned across a CUDA thread block.
![]()
. | |
class | BlockHistogram |
The BlockHistogram class provides collective methods for constructing block-wide histograms from data samples partitioned across a CUDA thread block.
![]()
. | |
class | BlockLoad |
The BlockLoad class provides collective data movement methods for loading a linear segment of items from memory into a blocked arrangement across a CUDA thread block.
![]()
. | |
struct | BlockLoadType |
class | BlockRadixSort |
The BlockRadixSort class provides collective methods for sorting items partitioned across a CUDA thread block using a radix sorting method.
![]()
. | |
class | BlockReduce |
The BlockReduce class provides collective methods for computing a parallel reduction of items partitioned across a CUDA thread block.
![]()
. | |
class | BlockRunLengthDecode |
The BlockRunLengthDecode class supports decoding a run-length encoded array of items. That is, given the two arrays run_value[N] and run_lengths[N], run_value[i] is repeated run_lengths[i] many times in the output array. Due to the nature of the run-length decoding algorithm ("decompression"), the output size of the run-length decoded array is runtime-dependent and potentially without any upper bound. To address this, BlockRunLengthDecode allows retrieving a "window" from the run-length decoded array. The window's offset can be specified and BLOCK_THREADS * DECODED_ITEMS_PER_THREAD (i.e., referred to as window_size) decoded items from the specified window will be returned. More... | |
class | BlockScan |
The BlockScan class provides collective methods for computing a parallel prefix sum/scan of items partitioned across a CUDA thread block.
![]()
. | |
class | BlockShuffle |
The BlockShuffle class provides collective methods for shuffling data partitioned across a CUDA thread block. More... | |
class | BlockStore |
The BlockStore class provides collective data movement methods for writing a blocked arrangement of items partitioned across a CUDA thread block to a linear segment of memory.
![]()
. | |
struct | BlockStoreType |
struct | RadixSortTwiddle |
Twiddling keys for radix sort. More... | |
struct | BaseDigitExtractor |
Base struct for digit extractor. Contains common code to provide special handling for floating-point -0.0. More... | |
struct | BFEDigitExtractor |
A wrapper type to extract digits. Uses the BFE intrinsic to extract a key from a digit. More... | |
struct | ShiftDigitExtractor |
A wrapper type to extract digits. Uses a combination of shift and bitwise and to extract digits. More... | |
struct | DeviceAdjacentDifference |
DeviceAdjacentDifference provides device-wide, parallel operations for computing the differences of adjacent elements residing within device-accessible memory. More... | |
struct | DeviceHistogram |
DeviceHistogram provides device-wide parallel operations for constructing histogram(s) from a sequence of samples data residing within device-accessible memory.
![]()
. | |
struct | DeviceMergeSort |
DeviceMergeSort provides device-wide, parallel operations for computing a merge sort across a sequence of data items residing within device-accessible memory. More... | |
struct | DevicePartition |
DevicePartition provides device-wide, parallel operations for partitioning sequences of data items residing within device-accessible memory.
![]()
. | |
struct | DeviceRadixSort |
DeviceRadixSort provides device-wide, parallel operations for computing a radix sort across a sequence of data items residing within device-accessible memory.
![]()
. | |
struct | DeviceReduce |
DeviceReduce provides device-wide, parallel operations for computing a reduction across a sequence of data items residing within device-accessible memory.
![]()
. | |
struct | DeviceRunLengthEncode |
DeviceRunLengthEncode provides device-wide, parallel operations for demarcating "runs" of same-valued items within a sequence residing within device-accessible memory.
![]()
. | |
struct | DeviceScan |
DeviceScan provides device-wide, parallel operations for computing a prefix scan across a sequence of data items residing within device-accessible memory.
![]()
. | |
struct | DeviceSegmentedRadixSort |
DeviceSegmentedRadixSort provides device-wide, parallel operations for computing a batched radix sort across multiple, non-overlapping sequences of data items residing within device-accessible memory.
![]()
. | |
struct | DeviceSegmentedReduce |
DeviceSegmentedReduce provides device-wide, parallel operations for computing a reduction across multiple sequences of data items residing within device-accessible memory.
![]()
. | |
struct | DeviceSelect |
DeviceSelect provides device-wide, parallel operations for compacting selected items from sequences of data items residing within device-accessible memory.
![]()
. | |
struct | DeviceSpmv |
DeviceSpmv provides device-wide parallel operations for performing sparse-matrix * dense-vector multiplication (SpMV). More... | |
class | GridBarrier |
GridBarrier implements a software global barrier among thread blocks within a CUDA grid. More... | |
class | GridBarrierLifetime |
GridBarrierLifetime extends GridBarrier to provide lifetime management of the temporary device storage needed for cooperation. More... | |
struct | GridEvenShare |
GridEvenShare is a descriptor utility for distributing input among CUDA thread blocks in an "even-share" fashion. Each thread block gets roughly the same number of input tiles. More... | |
class | GridQueue |
GridQueue is a descriptor utility for dynamic queue management. More... | |
class | ArgIndexInputIterator |
A random-access input wrapper for pairing dereferenced values with their corresponding indices (forming KeyValuePair tuples). More... | |
class | CacheModifiedInputIterator |
A random-access input wrapper for dereferencing array values using a PTX cache load modifier. More... | |
class | CacheModifiedOutputIterator |
A random-access output wrapper for storing array values using a PTX cache-modifier. More... | |
class | ConstantInputIterator |
A random-access input generator for dereferencing a sequence of homogeneous values. More... | |
class | CountingInputIterator |
A random-access input generator for dereferencing a sequence of incrementing integer values. More... | |
class | DiscardOutputIterator |
A discard iterator. More... | |
class | TexObjInputIterator |
A random-access input wrapper for dereferencing array values through texture cache. Uses newer Kepler-style texture objects. More... | |
class | TexRefInputIterator |
A random-access input wrapper for dereferencing array values through texture cache. Uses older Tesla/Fermi-style texture references. More... | |
class | TransformInputIterator |
A random-access input wrapper for transforming dereferenced values. More... | |
struct | Equality |
Default equality functor. More... | |
struct | Inequality |
Default inequality functor. More... | |
struct | InequalityWrapper |
Inequality functor (wraps equality functor) More... | |
struct | Sum |
Default sum functor. More... | |
struct | Difference |
Default difference functor. More... | |
struct | Division |
Default division functor. More... | |
struct | Max |
Default max functor. More... | |
struct | ArgMax |
Arg max functor (keeps the value and offset of the first occurrence of the larger item) More... | |
struct | Min |
Default min functor. More... | |
struct | ArgMin |
Arg min functor (keeps the value and offset of the first occurrence of the smallest item) More... | |
struct | CastOp |
Default cast functor. More... | |
class | SwizzleScanOp |
Binary operator wrapper for switching non-commutative scan arguments. More... | |
struct | ReduceBySegmentOp |
Reduce-by-segment functor. More... | |
struct | ReduceByKeyOp |
< Binary reduction operator to apply to values More... | |
struct | BinaryFlip |
class | WarpReduce |
The WarpReduce class provides collective methods for computing a parallel reduction of items partitioned across a CUDA thread warp.
![]()
. | |
class | WarpScan |
The WarpScan class provides collective methods for computing a parallel prefix scan of items partitioned across a CUDA thread warp.
![]()
. | |
Functions | |
__host__ __device__ __forceinline__ cudaError_t | Debug (cudaError_t error, const char *filename, int line) |
CUB error reporting macro (prints error messages to stderr) More... | |
CUB_RUNTIME_FUNCTION int | CurrentDevice () |
Returns the current device or -1 if an error occurred. More... | |
CUB_RUNTIME_FUNCTION int | DeviceCountUncached () |
Returns the number of CUDA devices available or -1 if an error occurred. More... | |
CUB_RUNTIME_FUNCTION int | DeviceCount () |
Returns the number of CUDA devices available. More... | |
CUB_RUNTIME_FUNCTION cudaError_t | PtxVersionUncached (int &ptx_version) |
Retrieves the PTX version that will be used on the current device (major * 100 + minor * 10). More... | |
__host__ cudaError_t | PtxVersionUncached (int &ptx_version, int device) |
Retrieves the PTX version that will be used on device (major * 100 + minor * 10). More... | |
__host__ cudaError_t | PtxVersion (int &ptx_version, int device) |
Retrieves the PTX version that will be used on device (major * 100 + minor * 10). More... | |
CUB_RUNTIME_FUNCTION cudaError_t | PtxVersion (int &ptx_version) |
Retrieves the PTX version that will be used on the current device (major * 100 + minor * 10). More... | |
CUB_RUNTIME_FUNCTION cudaError_t | SmVersionUncached (int &sm_version, int device=CurrentDevice()) |
Retrieves the SM version of device (major * 100 + minor * 10) More... | |
CUB_RUNTIME_FUNCTION cudaError_t | SmVersion (int &sm_version, int device=CurrentDevice()) |
Retrieves the SM version of device (major * 100 + minor * 10) More... | |
CUB_RUNTIME_FUNCTION cudaError_t | SyncStream (cudaStream_t stream) |
template<typename KernelPtr > | |
CUB_RUNTIME_FUNCTION cudaError_t | MaxSmOccupancy (int &max_sm_occupancy, KernelPtr kernel_ptr, int block_threads, int dynamic_smem_bytes=0) |
Computes maximum SM occupancy in thread blocks for executing the given kernel function pointer kernel_ptr on the current device with block_threads per thread block. More... | |
__device__ __forceinline__ unsigned int | SHR_ADD (unsigned int x, unsigned int shift, unsigned int addend) |
Shift-right then add. Returns (x >> shift ) + addend . More... | |
__device__ __forceinline__ unsigned int | SHL_ADD (unsigned int x, unsigned int shift, unsigned int addend) |
Shift-left then add. Returns (x << shift ) + addend . More... | |
template<typename UnsignedBits > | |
__device__ __forceinline__ unsigned int | BFE (UnsignedBits source, unsigned int bit_start, unsigned int num_bits) |
Bitfield-extract. Extracts num_bits from source starting at bit-offset bit_start . The input source may be an 8b, 16b, 32b, or 64b unsigned integer type. More... | |
__device__ __forceinline__ void | BFI (unsigned int &ret, unsigned int x, unsigned int y, unsigned int bit_start, unsigned int num_bits) |
Bitfield insert. Inserts the num_bits least significant bits of y into x at bit-offset bit_start . More... | |
__device__ __forceinline__ unsigned int | IADD3 (unsigned int x, unsigned int y, unsigned int z) |
Three-operand add. Returns x + y + z . More... | |
__device__ __forceinline__ int | PRMT (unsigned int a, unsigned int b, unsigned int index) |
Byte-permute. Pick four arbitrary bytes from two 32-bit registers, and reassemble them into a 32-bit destination register. For SM2.0 or later. More... | |
__device__ __forceinline__ void | ThreadExit () |
Terminates the calling thread. More... | |
__device__ __forceinline__ void | ThreadTrap () |
Abort execution and generate an interrupt to the host CPU. More... | |
__device__ __forceinline__ int | RowMajorTid (int block_dim_x, int block_dim_y, int block_dim_z) |
Returns the row-major linear thread identifier for a multidimensional thread block. More... | |
__device__ __forceinline__ unsigned int | LaneId () |
Returns the warp lane ID of the calling thread. More... | |
__device__ __forceinline__ unsigned int | WarpId () |
Returns the warp ID of the calling thread. Warp ID is guaranteed to be unique among warps, but may not correspond to a zero-based ranking within the thread block. More... | |
template<int LOGICAL_WARP_THREADS, int PTX_ARCH = CUB_PTX_ARCH> | |
__host__ __device__ __forceinline__ unsigned int | WarpMask (unsigned int warp_id) |
Returns the warp mask for a warp of LOGICAL_WARP_THREADS threads. More... | |
__device__ __forceinline__ unsigned int | LaneMaskLt () |
Returns the warp lane mask of all lanes less than the calling thread. More... | |
__device__ __forceinline__ unsigned int | LaneMaskLe () |
Returns the warp lane mask of all lanes less than or equal to the calling thread. More... | |
__device__ __forceinline__ unsigned int | LaneMaskGt () |
Returns the warp lane mask of all lanes greater than the calling thread. More... | |
__device__ __forceinline__ unsigned int | LaneMaskGe () |
Returns the warp lane mask of all lanes greater than or equal to the calling thread. More... | |
template<int LOGICAL_WARP_THREADS, typename T > | |
__device__ __forceinline__ T | ShuffleUp (T input, int src_offset, int first_thread, unsigned int member_mask) |
Shuffle-up for any data type. Each warp-lanei obtains the value input contributed by warp-lanei-src_offset . For thread lanes i < src_offset, the thread's own input is returned to the thread.
![]()
. | |
template<int LOGICAL_WARP_THREADS, typename T > | |
__device__ __forceinline__ T | ShuffleDown (T input, int src_offset, int last_thread, unsigned int member_mask) |
Shuffle-down for any data type. Each warp-lanei obtains the value input contributed by warp-lanei+src_offset . For thread lanes i >= WARP_THREADS, the thread's own input is returned to the thread.
![]()
. | |
template<int LOGICAL_WARP_THREADS, typename T > | |
__device__ __forceinline__ T | ShuffleIndex (T input, int src_lane, unsigned int member_mask) |
Shuffle-broadcast for any data type. Each warp-lanei obtains the value input contributed by warp-lanesrc_lane . For src_lane < 0 or src_lane >= WARP_THREADS, then the thread's own input is returned to the thread.
![]()
. | |
template<int LABEL_BITS> | |
__device__ unsigned int | MatchAny (unsigned int label) |
template<typename BinaryOpT > | |
__device__ __host__ BinaryFlip < BinaryOpT > | MakeBinaryFlip (BinaryOpT binary_op) |
template<typename T > | |
__device__ __forceinline__ void | Swap (T &lhs, T &rhs) |
template<typename KeyT , typename ValueT , typename CompareOp , int ITEMS_PER_THREAD> | |
__device__ __forceinline__ void | StableOddEvenSort (KeyT(&keys)[ITEMS_PER_THREAD], ValueT(&items)[ITEMS_PER_THREAD], CompareOp compare_op) |
Sorts data using odd-even sort method. More... | |
Blocked arrangement I/O (direct) | |
template<typename InputT , int ITEMS_PER_THREAD, typename InputIteratorT > | |
__device__ __forceinline__ void | LoadDirectBlocked (int linear_tid, InputIteratorT block_itr, InputT(&items)[ITEMS_PER_THREAD]) |
Load a linear segment of items into a blocked arrangement across the thread block. More... | |
template<typename InputT , int ITEMS_PER_THREAD, typename InputIteratorT > | |
__device__ __forceinline__ void | LoadDirectBlocked (int linear_tid, InputIteratorT block_itr, InputT(&items)[ITEMS_PER_THREAD], int valid_items) |
Load a linear segment of items into a blocked arrangement across the thread block, guarded by range. More... | |
template<typename InputT , typename DefaultT , int ITEMS_PER_THREAD, typename InputIteratorT > | |
__device__ __forceinline__ void | LoadDirectBlocked (int linear_tid, InputIteratorT block_itr, InputT(&items)[ITEMS_PER_THREAD], int valid_items, DefaultT oob_default) |
Load a linear segment of items into a blocked arrangement across the thread block, guarded by range, with a fall-back assignment of out-of-bound elements.. More... | |
template<typename T , int ITEMS_PER_THREAD> | |
__device__ __forceinline__ void | LoadDirectBlockedVectorized (int linear_tid, T *block_ptr, T(&items)[ITEMS_PER_THREAD]) |
Load a linear segment of items into a blocked arrangement across the thread block. More... | |
template<typename T , int ITEMS_PER_THREAD, typename OutputIteratorT > | |
__device__ __forceinline__ void | StoreDirectBlocked (int linear_tid, OutputIteratorT block_itr, T(&items)[ITEMS_PER_THREAD]) |
Store a blocked arrangement of items across a thread block into a linear segment of items. More... | |
template<typename T , int ITEMS_PER_THREAD, typename OutputIteratorT > | |
__device__ __forceinline__ void | StoreDirectBlocked (int linear_tid, OutputIteratorT block_itr, T(&items)[ITEMS_PER_THREAD], int valid_items) |
Store a blocked arrangement of items across a thread block into a linear segment of items, guarded by range. More... | |
template<typename T , int ITEMS_PER_THREAD> | |
__device__ __forceinline__ void | StoreDirectBlockedVectorized (int linear_tid, T *block_ptr, T(&items)[ITEMS_PER_THREAD]) |
Store a blocked arrangement of items across a thread block into a linear segment of items. More... | |
Striped arrangement I/O (direct) | |
template<int BLOCK_THREADS, typename InputT , int ITEMS_PER_THREAD, typename InputIteratorT > | |
__device__ __forceinline__ void | LoadDirectStriped (int linear_tid, InputIteratorT block_itr, InputT(&items)[ITEMS_PER_THREAD]) |
Load a linear segment of items into a striped arrangement across the thread block. More... | |
template<int BLOCK_THREADS, typename InputT , int ITEMS_PER_THREAD, typename InputIteratorT > | |
__device__ __forceinline__ void | LoadDirectStriped (int linear_tid, InputIteratorT block_itr, InputT(&items)[ITEMS_PER_THREAD], int valid_items) |
Load a linear segment of items into a striped arrangement across the thread block, guarded by range. More... | |
template<int BLOCK_THREADS, typename InputT , typename DefaultT , int ITEMS_PER_THREAD, typename InputIteratorT > | |
__device__ __forceinline__ void | LoadDirectStriped (int linear_tid, InputIteratorT block_itr, InputT(&items)[ITEMS_PER_THREAD], int valid_items, DefaultT oob_default) |
Load a linear segment of items into a striped arrangement across the thread block, guarded by range, with a fall-back assignment of out-of-bound elements. More... | |
template<int BLOCK_THREADS, typename T , int ITEMS_PER_THREAD, typename OutputIteratorT > | |
__device__ __forceinline__ void | StoreDirectStriped (int linear_tid, OutputIteratorT block_itr, T(&items)[ITEMS_PER_THREAD]) |
Store a striped arrangement of data across the thread block into a linear segment of items. More... | |
template<int BLOCK_THREADS, typename T , int ITEMS_PER_THREAD, typename OutputIteratorT > | |
__device__ __forceinline__ void | StoreDirectStriped (int linear_tid, OutputIteratorT block_itr, T(&items)[ITEMS_PER_THREAD], int valid_items) |
Store a striped arrangement of data across the thread block into a linear segment of items, guarded by range. More... | |
Warp-striped arrangement I/O (direct) | |
template<typename InputT , int ITEMS_PER_THREAD, typename InputIteratorT > | |
__device__ __forceinline__ void | LoadDirectWarpStriped (int linear_tid, InputIteratorT block_itr, InputT(&items)[ITEMS_PER_THREAD]) |
Load a linear segment of items into a warp-striped arrangement across the thread block. More... | |
template<typename InputT , int ITEMS_PER_THREAD, typename InputIteratorT > | |
__device__ __forceinline__ void | LoadDirectWarpStriped (int linear_tid, InputIteratorT block_itr, InputT(&items)[ITEMS_PER_THREAD], int valid_items) |
Load a linear segment of items into a warp-striped arrangement across the thread block, guarded by range. More... | |
template<typename InputT , typename DefaultT , int ITEMS_PER_THREAD, typename InputIteratorT > | |
__device__ __forceinline__ void | LoadDirectWarpStriped (int linear_tid, InputIteratorT block_itr, InputT(&items)[ITEMS_PER_THREAD], int valid_items, DefaultT oob_default) |
Load a linear segment of items into a warp-striped arrangement across the thread block, guarded by range, with a fall-back assignment of out-of-bound elements. More... | |
template<typename T , int ITEMS_PER_THREAD, typename OutputIteratorT > | |
__device__ __forceinline__ void | StoreDirectWarpStriped (int linear_tid, OutputIteratorT block_itr, T(&items)[ITEMS_PER_THREAD]) |
Store a warp-striped arrangement of data across the thread block into a linear segment of items. More... | |
template<typename T , int ITEMS_PER_THREAD, typename OutputIteratorT > | |
__device__ __forceinline__ void | StoreDirectWarpStriped (int linear_tid, OutputIteratorT block_itr, T(&items)[ITEMS_PER_THREAD], int valid_items) |
Store a warp-striped arrangement of data across the thread block into a linear segment of items, guarded by range. More... | |
Thread I/O (cache modified) | |
template<CacheLoadModifier MODIFIER, typename InputIteratorT > | |
__device__ __forceinline__ cub::detail::value_t < InputIteratorT > | ThreadLoad (InputIteratorT itr) |
Thread utility for reading memory using cub::CacheLoadModifier cache modifiers. Can be used to load any data type. More... | |
template<CacheStoreModifier MODIFIER, typename OutputIteratorT , typename T > | |
__device__ __forceinline__ void | ThreadStore (OutputIteratorT itr, T val) |
Thread utility for writing memory using cub::CacheStoreModifier cache modifiers. Can be used to store any data type. More... | |
BlockHistogramAlgorithm enumerates alternative algorithms for the parallel construction of block-wide histograms.
cub::BlockLoadAlgorithm enumerates alternative algorithms for cub::BlockLoad to read a linear segment of data from memory into a blocked arrangement across a CUDA thread block.
Enumerator | |
---|---|
BLOCK_LOAD_DIRECT |
A blocked arrangement of data is read directly from memory.
|
BLOCK_LOAD_STRIPED |
A striped arrangement of data is read directly from memory.
|
BLOCK_LOAD_VECTORIZE |
A blocked arrangement of data is read from memory using CUDA's built-in vectorized loads as a coalescing optimization. For example,
|
BLOCK_LOAD_TRANSPOSE |
A striped arrangement of data is read efficiently from memory and then locally transposed into a blocked arrangement.
|
BLOCK_LOAD_WARP_TRANSPOSE |
A warp-striped arrangement of data is read efficiently from memory and then locally transposed into a blocked arrangement.
|
BLOCK_LOAD_WARP_TRANSPOSE_TIMESLICED |
Like
|
BlockReduceAlgorithm enumerates alternative algorithms for parallel reduction across a CUDA thread block.
Enumerator | |
---|---|
BLOCK_REDUCE_RAKING_COMMUTATIVE_ONLY |
|
BLOCK_REDUCE_RAKING |
|
BLOCK_REDUCE_WARP_REDUCTIONS |
|
BlockScanAlgorithm enumerates alternative algorithms for cub::BlockScan to compute a parallel prefix scan across a CUDA thread block.
Enumerator | |
---|---|
BLOCK_SCAN_RAKING |
|
BLOCK_SCAN_RAKING_MEMOIZE |
|
BLOCK_SCAN_WARP_SCANS |
|
cub::BlockStoreAlgorithm enumerates alternative algorithms for cub::BlockStore to write a blocked arrangement of items across a CUDA thread block to a linear segment of memory.
Enumerator | |
---|---|
BLOCK_STORE_DIRECT |
A blocked arrangement of data is written directly to memory.
|
BLOCK_STORE_STRIPED |
|
BLOCK_STORE_VECTORIZE |
A blocked arrangement of data is written directly to memory using CUDA's built-in vectorized stores as a coalescing optimization. For example,
|
BLOCK_STORE_TRANSPOSE |
|
BLOCK_STORE_WARP_TRANSPOSE |
|
BLOCK_STORE_WARP_TRANSPOSE_TIMESLICED |
|
|
inline |
Compute a 32b mask of threads having the same least-significant LABEL_BITS of label
as the calling thread.
__device__ __forceinline__ void cub::Swap | ( | T & | lhs, |
T & | rhs | ||
) |
__device__ __forceinline__ void cub::StableOddEvenSort | ( | KeyT(&) | keys[ITEMS_PER_THREAD], |
ValueT(&) | items[ITEMS_PER_THREAD], | ||
CompareOp | compare_op | ||
) |
Sorts data using odd-even sort method.
The sorting method is stable. Further details can be found in: A. Nico Habermann. Parallel neighbor sort (or the glory of the induction principle). Technical Report AD-759 248, Carnegie Mellon University, 1972.
KeyT | Key type |
ValueT | Value type. If cub::NullType is used as ValueT , only keys are sorted. |
CompareOp | functor type having member bool operator()(KeyT lhs, KeyT rhs) |
ITEMS_PER_THREAD | The number of items per thread |
[in,out] | keys | Keys to sort |
[in,out] | items | Values to sort |
[in] | compare_op | Comparison function object which returns true if the first argument is ordered before the second |