CUB
|
DeviceRadixSort provides device-wide, parallel operations for computing a radix sort across a sequence of data items residing within device-accessible memory.
unsigned char
, int
, double
, etc.) as well as CUDA's __half
and __nv_bfloat16
16-bit floating-point types.[begin_bit, end_bit)
range, as the bitwise transformations will occur before the bit-range truncation.Any transformations applied to the keys prior to sorting are reversed while writing to the final output buffer.
For floating point types, positive and negative zero are a special case and will be considered equivalent during sorting.
uint32
keys. Performance plots for other scenarios can be found in the detailed method descriptions below.Static Public Methods | |
KeyT-value pairs | |
template<typename KeyT , typename ValueT , typename NumItemsT > | |
static CUB_RUNTIME_FUNCTION cudaError_t | SortPairs (void *d_temp_storage, size_t &temp_storage_bytes, const KeyT *d_keys_in, KeyT *d_keys_out, const ValueT *d_values_in, ValueT *d_values_out, NumItemsT num_items, int begin_bit=0, int end_bit=sizeof(KeyT)*8, cudaStream_t stream=0, bool debug_synchronous=false) |
Sorts key-value pairs into ascending order. (~2N auxiliary storage required) More... | |
template<typename KeyT , typename ValueT , typename NumItemsT > | |
static CUB_RUNTIME_FUNCTION cudaError_t | SortPairs (void *d_temp_storage, size_t &temp_storage_bytes, DoubleBuffer< KeyT > &d_keys, DoubleBuffer< ValueT > &d_values, NumItemsT num_items, int begin_bit=0, int end_bit=sizeof(KeyT)*8, cudaStream_t stream=0, bool debug_synchronous=false) |
Sorts key-value pairs into ascending order. (~N auxiliary storage required) More... | |
template<typename KeyT , typename ValueT , typename NumItemsT > | |
static CUB_RUNTIME_FUNCTION cudaError_t | SortPairsDescending (void *d_temp_storage, size_t &temp_storage_bytes, const KeyT *d_keys_in, KeyT *d_keys_out, const ValueT *d_values_in, ValueT *d_values_out, NumItemsT num_items, int begin_bit=0, int end_bit=sizeof(KeyT)*8, cudaStream_t stream=0, bool debug_synchronous=false) |
Sorts key-value pairs into descending order. (~2N auxiliary storage required). More... | |
template<typename KeyT , typename ValueT , typename NumItemsT > | |
static CUB_RUNTIME_FUNCTION cudaError_t | SortPairsDescending (void *d_temp_storage, size_t &temp_storage_bytes, DoubleBuffer< KeyT > &d_keys, DoubleBuffer< ValueT > &d_values, NumItemsT num_items, int begin_bit=0, int end_bit=sizeof(KeyT)*8, cudaStream_t stream=0, bool debug_synchronous=false) |
Sorts key-value pairs into descending order. (~N auxiliary storage required). More... | |
Keys-only | |
template<typename KeyT , typename NumItemsT > | |
static CUB_RUNTIME_FUNCTION cudaError_t | SortKeys (void *d_temp_storage, size_t &temp_storage_bytes, const KeyT *d_keys_in, KeyT *d_keys_out, NumItemsT num_items, int begin_bit=0, int end_bit=sizeof(KeyT)*8, cudaStream_t stream=0, bool debug_synchronous=false) |
Sorts keys into ascending order. (~2N auxiliary storage required) More... | |
template<typename KeyT , typename NumItemsT > | |
static CUB_RUNTIME_FUNCTION cudaError_t | SortKeys (void *d_temp_storage, size_t &temp_storage_bytes, DoubleBuffer< KeyT > &d_keys, NumItemsT num_items, int begin_bit=0, int end_bit=sizeof(KeyT)*8, cudaStream_t stream=0, bool debug_synchronous=false) |
Sorts keys into ascending order. (~N auxiliary storage required). More... | |
template<typename KeyT , typename NumItemsT > | |
static CUB_RUNTIME_FUNCTION cudaError_t | SortKeysDescending (void *d_temp_storage, size_t &temp_storage_bytes, const KeyT *d_keys_in, KeyT *d_keys_out, NumItemsT num_items, int begin_bit=0, int end_bit=sizeof(KeyT)*8, cudaStream_t stream=0, bool debug_synchronous=false) |
Sorts keys into descending order. (~2N auxiliary storage required). More... | |
template<typename KeyT , typename NumItemsT > | |
static CUB_RUNTIME_FUNCTION cudaError_t | SortKeysDescending (void *d_temp_storage, size_t &temp_storage_bytes, DoubleBuffer< KeyT > &d_keys, NumItemsT num_items, int begin_bit=0, int end_bit=sizeof(KeyT)*8, cudaStream_t stream=0, bool debug_synchronous=false) |
Sorts keys into descending order. (~N auxiliary storage required). More... | |
|
inlinestatic |
Sorts key-value pairs into ascending order. (~2N auxiliary storage required)
[d_keys_in, d_keys_in + num_items)
[d_keys_out, d_keys_out + num_items)
[d_values_in, d_values_in + num_items)
[d_values_out, d_values_out + num_items)
[begin_bit, end_bit)
of differentiating key bits can be specified. This can reduce overall sorting overhead and yield a corresponding performance improvement.N+P
), where N
is the length of the input and P
is the number of streaming multiprocessors on the device. For sorting using only O(P
) temporary storage, see the sorting interface using DoubleBuffer wrappers below.d_temp_storage
is NULL
, no work is done and the required allocation size is returned in temp_storage_bytes
.uint32,uint32
and uint64,uint64
pairs, respectively.int
keys with associated vector of int
values. KeyT | [inferred] KeyT type |
ValueT | [inferred] ValueT type |
NumItemsT | [inferred] Type of num_items |
[in] | d_temp_storage | Device-accessible allocation of temporary storage. When NULL, the required allocation size is written to temp_storage_bytes and no work is done. |
[in,out] | temp_storage_bytes | Reference to size in bytes of d_temp_storage allocation |
[in] | d_keys_in | Pointer to the input data of key data to sort |
[out] | d_keys_out | Pointer to the sorted output sequence of key data |
[in] | d_values_in | Pointer to the corresponding input sequence of associated value items |
[out] | d_values_out | Pointer to the correspondingly-reordered output sequence of associated value items |
[in] | num_items | Number of items to sort |
[in] | begin_bit | [optional] The least-significant bit index (inclusive) needed for key comparison |
[in] | end_bit | [optional] The most-significant bit index (exclusive) needed for key comparison (e.g., sizeof(unsigned int) * 8) |
[in] | stream | [optional] CUDA stream to launch kernels within. Default is stream0. |
[in] | debug_synchronous | [optional] Whether or not to synchronize the stream after every kernel launch to check for errors. Also causes launch configurations to be printed to the console. Default is false . |
|
inlinestatic |
Sorts key-value pairs into ascending order. (~N auxiliary storage required)
[d_keys.Current(), d_keys.Current() + num_items)
[d_keys.Alternate(), d_keys.Alternate() + num_items)
[d_values.Current(), d_values.Current() + num_items)
[d_values.Alternate(), d_values.Alternate() + num_items)
[begin_bit, end_bit)
of differentiating key bits can be specified. This can reduce overall sorting overhead and yield a corresponding performance improvement.P
), where P
is the number of streaming multiprocessors on the device (and is typically a small constant relative to the input size N
).d_temp_storage
is NULL
, no work is done and the required allocation size is returned in temp_storage_bytes
.uint32,uint32
and uint64,uint64
pairs, respectively.int
keys with associated vector of int
values. KeyT | [inferred] KeyT type |
ValueT | [inferred] ValueT type |
NumItemsT | [inferred] Type of num_items |
[in] | d_temp_storage | Device-accessible allocation of temporary storage. When NULL, the required allocation size is written to temp_storage_bytes and no work is done. |
[in,out] | temp_storage_bytes | Reference to size in bytes of d_temp_storage allocation |
[in,out] | d_keys | Reference to the double-buffer of keys whose "current" device-accessible buffer contains the unsorted input keys and, upon return, is updated to point to the sorted output keys |
[in,out] | d_values | Double-buffer of values whose "current" device-accessible buffer contains the unsorted input values and, upon return, is updated to point to the sorted output values |
[in] | num_items | Number of items to sort |
[in] | begin_bit | [optional] The least-significant bit index (inclusive) needed for key comparison |
[in] | end_bit | [optional] The most-significant bit index (exclusive) needed for key comparison (e.g., sizeof(unsigned int) * 8) |
[in] | stream | [optional] CUDA stream to launch kernels within. Default is stream0. |
[in] | debug_synchronous | [optional] Whether or not to synchronize the stream after every kernel launch to check for errors. Also causes launch configurations to be printed to the console. Default is false . |
|
inlinestatic |
Sorts key-value pairs into descending order. (~2N auxiliary storage required).
[d_keys_in, d_keys_in + num_items)
[d_keys_out, d_keys_out + num_items)
[d_values_in, d_values_in + num_items)
[d_values_out, d_values_out + num_items)
[begin_bit, end_bit)
of differentiating key bits can be specified. This can reduce overall sorting overhead and yield a corresponding performance improvement.N+P
), where N
is the length of the input and P
is the number of streaming multiprocessors on the device. For sorting using only O(P
) temporary storage, see the sorting interface using DoubleBuffer wrappers below.d_temp_storage
is NULL
, no work is done and the required allocation size is returned in temp_storage_bytes
.int
keys with associated vector of int
values. KeyT | [inferred] KeyT type |
ValueT | [inferred] ValueT type |
NumItemsT | [inferred] Type of num_items |
[in] | d_temp_storage | Device-accessible allocation of temporary storage. When NULL, the required allocation size is written to temp_storage_bytes and no work is done. |
[in,out] | temp_storage_bytes | Reference to size in bytes of d_temp_storage allocation |
[in] | d_keys_in | Pointer to the input data of key data to sort |
[out] | d_keys_out | Pointer to the sorted output sequence of key data |
[in] | d_values_in | Pointer to the corresponding input sequence of associated value items |
[out] | d_values_out | Pointer to the correspondingly-reordered output sequence of associated value items |
[in] | num_items | Number of items to sort |
[in] | begin_bit | [optional] The least-significant bit index (inclusive) needed for key comparison |
[in] | end_bit | [optional] The most-significant bit index (exclusive) needed for key comparison (e.g., sizeof(unsigned int) * 8) |
[in] | stream | [optional] CUDA stream to launch kernels within. Default is stream0. |
[in] | debug_synchronous | [optional] Whether or not to synchronize the stream after every kernel launch to check for errors. Also causes launch configurations to be printed to the console. Default is false . |
|
inlinestatic |
Sorts key-value pairs into descending order. (~N auxiliary storage required).
[d_keys.Current(), d_keys.Current() + num_items)
[d_keys.Alternate(), d_keys.Alternate() + num_items)
[d_values.Current(), d_values.Current() + num_items)
[d_values.Alternate(), d_values.Alternate() + num_items)
[begin_bit, end_bit)
of differentiating key bits can be specified. This can reduce overall sorting overhead and yield a corresponding performance improvement.P
), where P
is the number of streaming multiprocessors on the device (and is typically a small constant relative to the input size N
).d_temp_storage
is NULL
, no work is done and the required allocation size is returned in temp_storage_bytes
.int
keys with associated vector of int
values. KeyT | [inferred] KeyT type |
ValueT | [inferred] ValueT type |
NumItemsT | [inferred] Type of num_items |
[in] | d_temp_storage | Device-accessible allocation of temporary storage. When NULL, the required allocation size is written to temp_storage_bytes and no work is done. |
[in,out] | temp_storage_bytes | Reference to size in bytes of d_temp_storage allocation |
[in,out] | d_keys | Reference to the double-buffer of keys whose "current" device-accessible buffer contains the unsorted input keys and, upon return, is updated to point to the sorted output keys |
[in,out] | d_values | Double-buffer of values whose "current" device-accessible buffer contains the unsorted input values and, upon return, is updated to point to the sorted output values |
[in] | num_items | Number of items to sort |
[in] | begin_bit | [optional] The least-significant bit index (inclusive) needed for key comparison |
[in] | end_bit | [optional] The most-significant bit index (exclusive) needed for key comparison (e.g., sizeof(unsigned int) * 8) |
[in] | stream | [optional] CUDA stream to launch kernels within. Default is stream0. |
[in] | debug_synchronous | [optional] Whether or not to synchronize the stream after every kernel launch to check for errors. Also causes launch configurations to be printed to the console. Default is false . |
|
inlinestatic |
Sorts keys into ascending order. (~2N auxiliary storage required)
[d_keys_in, d_keys_in + num_items)
[d_keys_out, d_keys_out + num_items)
[begin_bit, end_bit)
of differentiating key bits can be specified. This can reduce overall sorting overhead and yield a corresponding performance improvement.N+P
), where N
is the length of the input and P
is the number of streaming multiprocessors on the device. For sorting using only O(P
) temporary storage, see the sorting interface using DoubleBuffer wrappers below.d_temp_storage
is NULL
, no work is done and the required allocation size is returned in temp_storage_bytes
.uint32
and uint64
keys, respectively.int
keys. KeyT | [inferred] KeyT type |
NumItemsT | [inferred] Type of num_items |
NumItemsT | [inferred] Type of num_items |
[in] | d_temp_storage | Device-accessible allocation of temporary storage. When NULL, the required allocation size is written to temp_storage_bytes and no work is done. |
[in,out] | temp_storage_bytes | Reference to size in bytes of d_temp_storage allocation |
[in] | d_keys_in | Pointer to the input data of key data to sort |
[out] | d_keys_out | Pointer to the sorted output sequence of key data |
[in] | num_items | Number of items to sort |
[in] | begin_bit | [optional] The least-significant bit index (inclusive) needed for key comparison |
[in] | end_bit | [optional] The most-significant bit index (exclusive) needed for key comparison (e.g., sizeof(unsigned int) * 8) |
[in] | stream | [optional] CUDA stream to launch kernels within. Default is stream0. |
[in] | debug_synchronous | [optional] Whether or not to synchronize the stream after every kernel launch to check for errors. Also causes launch configurations to be printed to the console. Default is false . |
|
inlinestatic |
Sorts keys into ascending order. (~N auxiliary storage required).
[d_keys.Current(), d_keys.Current() + num_items)
[d_keys.Alternate(), d_keys.Alternate() + num_items)
[begin_bit, end_bit)
of differentiating key bits can be specified. This can reduce overall sorting overhead and yield a corresponding performance improvement.P
), where P
is the number of streaming multiprocessors on the device (and is typically a small constant relative to the input size N
).d_temp_storage
is NULL
, no work is done and the required allocation size is returned in temp_storage_bytes
.uint32
and uint64
keys, respectively.int
keys. KeyT | [inferred] KeyT type |
NumItemsT | [inferred] Type of num_items |
[in] | d_temp_storage | Device-accessible allocation of temporary storage. When NULL, the required allocation size is written to temp_storage_bytes and no work is done. |
[in,out] | temp_storage_bytes | Reference to size in bytes of d_temp_storage allocation |
[in,out] | d_keys | Reference to the double-buffer of keys whose "current" device-accessible buffer contains the unsorted input keys and, upon return, is updated to point to the sorted output keys |
[in] | num_items | Number of items to sort |
[in] | begin_bit | [optional] The least-significant bit index (inclusive) needed for key comparison |
[in] | end_bit | [optional] The most-significant bit index (exclusive) needed for key comparison (e.g., sizeof(unsigned int) * 8) |
[in] | stream | [optional] CUDA stream to launch kernels within. Default is stream0. |
[in] | debug_synchronous | [optional] Whether or not to synchronize the stream after every kernel launch to check for errors. Also causes launch configurations to be printed to the console. Default is false . |
|
inlinestatic |
Sorts keys into descending order. (~2N auxiliary storage required).
[d_keys_in, d_keys_in + num_items)
[d_keys_out, d_keys_out + num_items)
[begin_bit, end_bit)
of differentiating key bits can be specified. This can reduce overall sorting overhead and yield a corresponding performance improvement.N+P
), where N
is the length of the input and P
is the number of streaming multiprocessors on the device. For sorting using only O(P
) temporary storage, see the sorting interface using DoubleBuffer wrappers below.d_temp_storage
is NULL
, no work is done and the required allocation size is returned in temp_storage_bytes
.int
keys. KeyT | [inferred] KeyT type |
NumItemsT | [inferred] Type of num_items |
[in] | d_temp_storage | Device-accessible allocation of temporary storage. When NULL, the required allocation size is written to temp_storage_bytes and no work is done. |
[in,out] | temp_storage_bytes | Reference to size in bytes of d_temp_storage allocation |
[in] | d_keys_in | Pointer to the input data of key data to sort |
[out] | d_keys_out | Pointer to the sorted output sequence of key data |
[in] | num_items | Number of items to sort |
[in] | begin_bit | [optional] The least-significant bit index (inclusive) needed for key comparison |
[in] | end_bit | [optional] The most-significant bit index (exclusive) needed for key comparison (e.g., sizeof(unsigned int) * 8) |
[in] | stream | [optional] CUDA stream to launch kernels within. Default is stream0. |
[in] | debug_synchronous | [optional] Whether or not to synchronize the stream after every kernel launch to check for errors. Also causes launch configurations to be printed to the console. Default is false . |
|
inlinestatic |
Sorts keys into descending order. (~N auxiliary storage required).
[d_keys.Current(), d_keys.Current() + num_items)
[d_keys.Alternate(), d_keys.Alternate() + num_items)
[begin_bit, end_bit)
of differentiating key bits can be specified. This can reduce overall sorting overhead and yield a corresponding performance improvement.P
), where P
is the number of streaming multiprocessors on the device (and is typically a small constant relative to the input size N
).d_temp_storage
is NULL
, no work is done and the required allocation size is returned in temp_storage_bytes
.int
keys. KeyT | [inferred] KeyT type |
NumItemsT | [inferred] Type of num_items |
[in] | d_temp_storage | Device-accessible allocation of temporary storage. When NULL, the required allocation size is written to temp_storage_bytes and no work is done. |
[in,out] | temp_storage_bytes | Reference to size in bytes of d_temp_storage allocation |
[in,out] | d_keys | Reference to the double-buffer of keys whose "current" device-accessible buffer contains the unsorted input keys and, upon return, is updated to point to the sorted output keys |
[in] | num_items | Number of items to sort |
[in] | begin_bit | [optional] The least-significant bit index (inclusive) needed for key comparison |
[in] | end_bit | [optional] The most-significant bit index (exclusive) needed for key comparison (e.g., sizeof(unsigned int) * 8) |
[in] | stream | [optional] CUDA stream to launch kernels within. Default is stream0. |
[in] | debug_synchronous | [optional] Whether or not to synchronize the stream after every kernel launch to check for errors. Also causes launch configurations to be printed to the console. Default is false . |