CUB  
 All Classes Namespaces Files Functions Variables Typedefs Enumerations Enumerator Friends Macros Groups Pages
List of all members
cub::DeviceRadixSort Struct Reference

Detailed description

DeviceRadixSort provides device-wide, parallel operations for computing a radix sort across a sequence of data items residing within device-accessible memory.

sorting_logo.png
.
Overview
The radix sorting method arranges items into ascending (or descending) order. The algorithm relies upon a positional representation for keys, i.e., each key is comprised of an ordered sequence of symbols (e.g., digits, characters, etc.) specified from least-significant to most-significant. For a given input sequence of keys and a set of rules specifying a total ordering of the symbolic alphabet, the radix sorting method produces a lexicographic ordering of those keys.
Supported Types
DeviceRadixSort can sort all of the built-in C++ numeric primitive types (unsigned char, int, double, etc.) as well as CUDA's __half and __nv_bfloat16 16-bit floating-point types.
Floating-Point Special Cases
  • Positive and negative zeros are considered equivalent, and will be treated as such in the output.
  • No special handling is implemented for NaN values; these are sorted according to their bit representations after any transformations.
Transformations
Although the direct radix sorting method can only be applied to unsigned integral types, DeviceRadixSort is able to sort signed and floating-point types via simple bit-wise transformations that ensure lexicographic key ordering. Additional transformations occur for descending sorts. These transformations must be considered when restricting the [begin_bit, end_bit) range, as the bitwise transformations will occur before the bit-range truncation.

Any transformations applied to the keys prior to sorting are reversed while writing to the final output buffer.

Type Specific Bitwise Transformations
To convert the input values into a radix-sortable bitwise representation, the following transformations take place prior to sorting:
  • For unsigned integral values, the keys are used directly.
  • For signed integral values, the sign bit is inverted.
  • For positive floating point values, the sign bit is inverted.
  • For negative floating point values, the full key is inverted.

For floating point types, positive and negative zero are a special case and will be considered equivalent during sorting.

Descending Sort Bitwise Transformations
If descending sort is used, the keys are inverted after performing any type-specific transformations, and the resulting keys are sorted in ascending order.
Stability
DeviceRadixSort is stable. For floating-point types, -0.0 and +0.0 are considered equal and appear in the result in the same order as they appear in the input.
Usage Considerations
  • Dynamic parallelism. DeviceRadixSort methods can be called within kernel code on devices in which CUDA dynamic parallelism is supported.
Performance
The work-complexity of radix sort as a function of input size is linear, resulting in performance throughput that plateaus with problem sizes large enough to saturate the GPU. The following chart illustrates DeviceRadixSort::SortKeys performance across different CUDA architectures for uniform-random uint32 keys. Performance plots for other scenarios can be found in the detailed method descriptions below.
lsb_radix_sort_int32_keys.png

Static Public Methods

KeyT-value pairs
template<typename KeyT , typename ValueT , typename NumItemsT >
static CUB_RUNTIME_FUNCTION
cudaError_t 
SortPairs (void *d_temp_storage, size_t &temp_storage_bytes, const KeyT *d_keys_in, KeyT *d_keys_out, const ValueT *d_values_in, ValueT *d_values_out, NumItemsT num_items, int begin_bit=0, int end_bit=sizeof(KeyT)*8, cudaStream_t stream=0, bool debug_synchronous=false)
 Sorts key-value pairs into ascending order. (~2N auxiliary storage required) More...
 
template<typename KeyT , typename ValueT , typename NumItemsT >
static CUB_RUNTIME_FUNCTION
cudaError_t 
SortPairs (void *d_temp_storage, size_t &temp_storage_bytes, DoubleBuffer< KeyT > &d_keys, DoubleBuffer< ValueT > &d_values, NumItemsT num_items, int begin_bit=0, int end_bit=sizeof(KeyT)*8, cudaStream_t stream=0, bool debug_synchronous=false)
 Sorts key-value pairs into ascending order. (~N auxiliary storage required) More...
 
template<typename KeyT , typename ValueT , typename NumItemsT >
static CUB_RUNTIME_FUNCTION
cudaError_t 
SortPairsDescending (void *d_temp_storage, size_t &temp_storage_bytes, const KeyT *d_keys_in, KeyT *d_keys_out, const ValueT *d_values_in, ValueT *d_values_out, NumItemsT num_items, int begin_bit=0, int end_bit=sizeof(KeyT)*8, cudaStream_t stream=0, bool debug_synchronous=false)
 Sorts key-value pairs into descending order. (~2N auxiliary storage required). More...
 
template<typename KeyT , typename ValueT , typename NumItemsT >
static CUB_RUNTIME_FUNCTION
cudaError_t 
SortPairsDescending (void *d_temp_storage, size_t &temp_storage_bytes, DoubleBuffer< KeyT > &d_keys, DoubleBuffer< ValueT > &d_values, NumItemsT num_items, int begin_bit=0, int end_bit=sizeof(KeyT)*8, cudaStream_t stream=0, bool debug_synchronous=false)
 Sorts key-value pairs into descending order. (~N auxiliary storage required). More...
 
Keys-only
template<typename KeyT , typename NumItemsT >
static CUB_RUNTIME_FUNCTION
cudaError_t 
SortKeys (void *d_temp_storage, size_t &temp_storage_bytes, const KeyT *d_keys_in, KeyT *d_keys_out, NumItemsT num_items, int begin_bit=0, int end_bit=sizeof(KeyT)*8, cudaStream_t stream=0, bool debug_synchronous=false)
 Sorts keys into ascending order. (~2N auxiliary storage required) More...
 
template<typename KeyT , typename NumItemsT >
static CUB_RUNTIME_FUNCTION
cudaError_t 
SortKeys (void *d_temp_storage, size_t &temp_storage_bytes, DoubleBuffer< KeyT > &d_keys, NumItemsT num_items, int begin_bit=0, int end_bit=sizeof(KeyT)*8, cudaStream_t stream=0, bool debug_synchronous=false)
 Sorts keys into ascending order. (~N auxiliary storage required). More...
 
template<typename KeyT , typename NumItemsT >
static CUB_RUNTIME_FUNCTION
cudaError_t 
SortKeysDescending (void *d_temp_storage, size_t &temp_storage_bytes, const KeyT *d_keys_in, KeyT *d_keys_out, NumItemsT num_items, int begin_bit=0, int end_bit=sizeof(KeyT)*8, cudaStream_t stream=0, bool debug_synchronous=false)
 Sorts keys into descending order. (~2N auxiliary storage required). More...
 
template<typename KeyT , typename NumItemsT >
static CUB_RUNTIME_FUNCTION
cudaError_t 
SortKeysDescending (void *d_temp_storage, size_t &temp_storage_bytes, DoubleBuffer< KeyT > &d_keys, NumItemsT num_items, int begin_bit=0, int end_bit=sizeof(KeyT)*8, cudaStream_t stream=0, bool debug_synchronous=false)
 Sorts keys into descending order. (~N auxiliary storage required). More...
 

Member Function Documentation

template<typename KeyT , typename ValueT , typename NumItemsT >
static CUB_RUNTIME_FUNCTION cudaError_t cub::DeviceRadixSort::SortPairs ( void *  d_temp_storage,
size_t &  temp_storage_bytes,
const KeyT *  d_keys_in,
KeyT *  d_keys_out,
const ValueT *  d_values_in,
ValueT *  d_values_out,
NumItemsT  num_items,
int  begin_bit = 0,
int  end_bit = sizeof(KeyT) * 8,
cudaStream_t  stream = 0,
bool  debug_synchronous = false 
)
inlinestatic

Sorts key-value pairs into ascending order. (~2N auxiliary storage required)

  • The contents of the input data are not altered by the sorting operation.
  • Pointers to contiguous memory must be used; iterators are not currently supported.
  • In-place operations are not supported. There must be no overlap between any of the provided ranges:
    • [d_keys_in, d_keys_in + num_items)
    • [d_keys_out, d_keys_out + num_items)
    • [d_values_in, d_values_in + num_items)
    • [d_values_out, d_values_out + num_items)
  • An optional bit subrange [begin_bit, end_bit) of differentiating key bits can be specified. This can reduce overall sorting overhead and yield a corresponding performance improvement.
  • This operation requires an allocation of temporary device storage that is O(N+P), where N is the length of the input and P is the number of streaming multiprocessors on the device. For sorting using only O(P) temporary storage, see the sorting interface using DoubleBuffer wrappers below.
  • When d_temp_storage is NULL, no work is done and the required allocation size is returned in temp_storage_bytes.
Performance
The following charts illustrate saturated sorting performance across different CUDA architectures for uniform-random uint32,uint32 and uint64,uint64 pairs, respectively.
lsb_radix_sort_int32_pairs.png
lsb_radix_sort_int64_pairs.png
Snippet
The code snippet below illustrates the sorting of a device vector of int keys with associated vector of int values.
#include <cub/cub.cuh> // or equivalently <cub/device/device_radix_sort.cuh>
// Declare, allocate, and initialize device-accessible pointers for sorting data
int num_items; // e.g., 7
int *d_keys_in; // e.g., [8, 6, 7, 5, 3, 0, 9]
int *d_keys_out; // e.g., [ ... ]
int *d_values_in; // e.g., [0, 1, 2, 3, 4, 5, 6]
int *d_values_out; // e.g., [ ... ]
...
// Determine temporary device storage requirements
void *d_temp_storage = NULL;
size_t temp_storage_bytes = 0;
cub::DeviceRadixSort::SortPairs(d_temp_storage, temp_storage_bytes,
d_keys_in, d_keys_out, d_values_in, d_values_out, num_items);
// Allocate temporary storage
cudaMalloc(&d_temp_storage, temp_storage_bytes);
// Run sorting operation
cub::DeviceRadixSort::SortPairs(d_temp_storage, temp_storage_bytes,
d_keys_in, d_keys_out, d_values_in, d_values_out, num_items);
// d_keys_out <-- [0, 3, 5, 6, 7, 8, 9]
// d_values_out <-- [5, 4, 3, 1, 2, 0, 6]
Template Parameters
KeyT[inferred] KeyT type
ValueT[inferred] ValueT type
NumItemsT[inferred] Type of num_items
Parameters
[in]d_temp_storageDevice-accessible allocation of temporary storage. When NULL, the required allocation size is written to temp_storage_bytes and no work is done.
[in,out]temp_storage_bytesReference to size in bytes of d_temp_storage allocation
[in]d_keys_inPointer to the input data of key data to sort
[out]d_keys_outPointer to the sorted output sequence of key data
[in]d_values_inPointer to the corresponding input sequence of associated value items
[out]d_values_outPointer to the correspondingly-reordered output sequence of associated value items
[in]num_itemsNumber of items to sort
[in]begin_bit[optional] The least-significant bit index (inclusive) needed for key comparison
[in]end_bit[optional] The most-significant bit index (exclusive) needed for key comparison (e.g., sizeof(unsigned int) * 8)
[in]stream[optional] CUDA stream to launch kernels within. Default is stream0.
[in]debug_synchronous[optional] Whether or not to synchronize the stream after every kernel launch to check for errors. Also causes launch configurations to be printed to the console. Default is false.
Examples:
example_device_radix_sort.cu.
template<typename KeyT , typename ValueT , typename NumItemsT >
static CUB_RUNTIME_FUNCTION cudaError_t cub::DeviceRadixSort::SortPairs ( void *  d_temp_storage,
size_t &  temp_storage_bytes,
DoubleBuffer< KeyT > &  d_keys,
DoubleBuffer< ValueT > &  d_values,
NumItemsT  num_items,
int  begin_bit = 0,
int  end_bit = sizeof(KeyT) * 8,
cudaStream_t  stream = 0,
bool  debug_synchronous = false 
)
inlinestatic

Sorts key-value pairs into ascending order. (~N auxiliary storage required)

  • The sorting operation is given a pair of key buffers and a corresponding pair of associated value buffers. Each pair is managed by a DoubleBuffer structure that indicates which of the two buffers is "current" (and thus contains the input data to be sorted).
  • The contents of both buffers within each pair may be altered by the sorting operation.
  • In-place operations are not supported. There must be no overlap between any of the provided ranges:
    • [d_keys.Current(), d_keys.Current() + num_items)
    • [d_keys.Alternate(), d_keys.Alternate() + num_items)
    • [d_values.Current(), d_values.Current() + num_items)
    • [d_values.Alternate(), d_values.Alternate() + num_items)
  • Upon completion, the sorting operation will update the "current" indicator within each DoubleBuffer wrapper to reference which of the two buffers now contains the sorted output sequence (a function of the number of key bits specified and the targeted device architecture).
  • An optional bit subrange [begin_bit, end_bit) of differentiating key bits can be specified. This can reduce overall sorting overhead and yield a corresponding performance improvement.
  • This operation requires a relatively small allocation of temporary device storage that is O(P), where P is the number of streaming multiprocessors on the device (and is typically a small constant relative to the input size N).
  • When d_temp_storage is NULL, no work is done and the required allocation size is returned in temp_storage_bytes.
Performance
The following charts illustrate saturated sorting performance across different CUDA architectures for uniform-random uint32,uint32 and uint64,uint64 pairs, respectively.
lsb_radix_sort_int32_pairs.png
lsb_radix_sort_int64_pairs.png
Snippet
The code snippet below illustrates the sorting of a device vector of int keys with associated vector of int values.
#include <cub/cub.cuh> // or equivalently <cub/device/device_radix_sort.cuh>
// Declare, allocate, and initialize device-accessible pointers for sorting data
int num_items; // e.g., 7
int *d_key_buf; // e.g., [8, 6, 7, 5, 3, 0, 9]
int *d_key_alt_buf; // e.g., [ ... ]
int *d_value_buf; // e.g., [0, 1, 2, 3, 4, 5, 6]
int *d_value_alt_buf; // e.g., [ ... ]
...
// Create a set of DoubleBuffers to wrap pairs of device pointers
cub::DoubleBuffer<int> d_keys(d_key_buf, d_key_alt_buf);
cub::DoubleBuffer<int> d_values(d_value_buf, d_value_alt_buf);
// Determine temporary device storage requirements
void *d_temp_storage = NULL;
size_t temp_storage_bytes = 0;
cub::DeviceRadixSort::SortPairs(d_temp_storage, temp_storage_bytes, d_keys, d_values, num_items);
// Allocate temporary storage
cudaMalloc(&d_temp_storage, temp_storage_bytes);
// Run sorting operation
cub::DeviceRadixSort::SortPairs(d_temp_storage, temp_storage_bytes, d_keys, d_values, num_items);
// d_keys.Current() <-- [0, 3, 5, 6, 7, 8, 9]
// d_values.Current() <-- [5, 4, 3, 1, 2, 0, 6]
Template Parameters
KeyT[inferred] KeyT type
ValueT[inferred] ValueT type
NumItemsT[inferred] Type of num_items
Parameters
[in]d_temp_storageDevice-accessible allocation of temporary storage. When NULL, the required allocation size is written to temp_storage_bytes and no work is done.
[in,out]temp_storage_bytesReference to size in bytes of d_temp_storage allocation
[in,out]d_keysReference to the double-buffer of keys whose "current" device-accessible buffer contains the unsorted input keys and, upon return, is updated to point to the sorted output keys
[in,out]d_valuesDouble-buffer of values whose "current" device-accessible buffer contains the unsorted input values and, upon return, is updated to point to the sorted output values
[in]num_itemsNumber of items to sort
[in]begin_bit[optional] The least-significant bit index (inclusive) needed for key comparison
[in]end_bit[optional] The most-significant bit index (exclusive) needed for key comparison (e.g., sizeof(unsigned int) * 8)
[in]stream[optional] CUDA stream to launch kernels within. Default is stream0.
[in]debug_synchronous[optional] Whether or not to synchronize the stream after every kernel launch to check for errors. Also causes launch configurations to be printed to the console. Default is false.
template<typename KeyT , typename ValueT , typename NumItemsT >
static CUB_RUNTIME_FUNCTION cudaError_t cub::DeviceRadixSort::SortPairsDescending ( void *  d_temp_storage,
size_t &  temp_storage_bytes,
const KeyT *  d_keys_in,
KeyT *  d_keys_out,
const ValueT *  d_values_in,
ValueT *  d_values_out,
NumItemsT  num_items,
int  begin_bit = 0,
int  end_bit = sizeof(KeyT) * 8,
cudaStream_t  stream = 0,
bool  debug_synchronous = false 
)
inlinestatic

Sorts key-value pairs into descending order. (~2N auxiliary storage required).

  • The contents of the input data are not altered by the sorting operation.
  • Pointers to contiguous memory must be used; iterators are not currently supported.
  • In-place operations are not supported. There must be no overlap between any of the provided ranges:
    • [d_keys_in, d_keys_in + num_items)
    • [d_keys_out, d_keys_out + num_items)
    • [d_values_in, d_values_in + num_items)
    • [d_values_out, d_values_out + num_items)
  • An optional bit subrange [begin_bit, end_bit) of differentiating key bits can be specified. This can reduce overall sorting overhead and yield a corresponding performance improvement.
  • This operation requires an allocation of temporary device storage that is O(N+P), where N is the length of the input and P is the number of streaming multiprocessors on the device. For sorting using only O(P) temporary storage, see the sorting interface using DoubleBuffer wrappers below.
  • When d_temp_storage is NULL, no work is done and the required allocation size is returned in temp_storage_bytes.
Performance
Performance is similar to DeviceRadixSort::SortPairs.
Snippet
The code snippet below illustrates the sorting of a device vector of int keys with associated vector of int values.
#include <cub/cub.cuh> // or equivalently <cub/device/device_radix_sort.cuh>
// Declare, allocate, and initialize device-accessible pointers for sorting data
int num_items; // e.g., 7
int *d_keys_in; // e.g., [8, 6, 7, 5, 3, 0, 9]
int *d_keys_out; // e.g., [ ... ]
int *d_values_in; // e.g., [0, 1, 2, 3, 4, 5, 6]
int *d_values_out; // e.g., [ ... ]
...
// Determine temporary device storage requirements
void *d_temp_storage = NULL;
size_t temp_storage_bytes = 0;
cub::DeviceRadixSort::SortPairsDescending(d_temp_storage, temp_storage_bytes,
d_keys_in, d_keys_out, d_values_in, d_values_out, num_items);
// Allocate temporary storage
cudaMalloc(&d_temp_storage, temp_storage_bytes);
// Run sorting operation
cub::DeviceRadixSort::SortPairsDescending(d_temp_storage, temp_storage_bytes,
d_keys_in, d_keys_out, d_values_in, d_values_out, num_items);
// d_keys_out <-- [9, 8, 7, 6, 5, 3, 0]
// d_values_out <-- [6, 0, 2, 1, 3, 4, 5]
Template Parameters
KeyT[inferred] KeyT type
ValueT[inferred] ValueT type
NumItemsT[inferred] Type of num_items
Parameters
[in]d_temp_storageDevice-accessible allocation of temporary storage. When NULL, the required allocation size is written to temp_storage_bytes and no work is done.
[in,out]temp_storage_bytesReference to size in bytes of d_temp_storage allocation
[in]d_keys_inPointer to the input data of key data to sort
[out]d_keys_outPointer to the sorted output sequence of key data
[in]d_values_inPointer to the corresponding input sequence of associated value items
[out]d_values_outPointer to the correspondingly-reordered output sequence of associated value items
[in]num_itemsNumber of items to sort
[in]begin_bit[optional] The least-significant bit index (inclusive) needed for key comparison
[in]end_bit[optional] The most-significant bit index (exclusive) needed for key comparison (e.g., sizeof(unsigned int) * 8)
[in]stream[optional] CUDA stream to launch kernels within. Default is stream0.
[in]debug_synchronous[optional] Whether or not to synchronize the stream after every kernel launch to check for errors. Also causes launch configurations to be printed to the console. Default is false.
template<typename KeyT , typename ValueT , typename NumItemsT >
static CUB_RUNTIME_FUNCTION cudaError_t cub::DeviceRadixSort::SortPairsDescending ( void *  d_temp_storage,
size_t &  temp_storage_bytes,
DoubleBuffer< KeyT > &  d_keys,
DoubleBuffer< ValueT > &  d_values,
NumItemsT  num_items,
int  begin_bit = 0,
int  end_bit = sizeof(KeyT) * 8,
cudaStream_t  stream = 0,
bool  debug_synchronous = false 
)
inlinestatic

Sorts key-value pairs into descending order. (~N auxiliary storage required).

  • The sorting operation is given a pair of key buffers and a corresponding pair of associated value buffers. Each pair is managed by a DoubleBuffer structure that indicates which of the two buffers is "current" (and thus contains the input data to be sorted).
  • The contents of both buffers within each pair may be altered by the sorting operation.
  • In-place operations are not supported. There must be no overlap between any of the provided ranges:
    • [d_keys.Current(), d_keys.Current() + num_items)
    • [d_keys.Alternate(), d_keys.Alternate() + num_items)
    • [d_values.Current(), d_values.Current() + num_items)
    • [d_values.Alternate(), d_values.Alternate() + num_items)
  • Upon completion, the sorting operation will update the "current" indicator within each DoubleBuffer wrapper to reference which of the two buffers now contains the sorted output sequence (a function of the number of key bits specified and the targeted device architecture).
  • An optional bit subrange [begin_bit, end_bit) of differentiating key bits can be specified. This can reduce overall sorting overhead and yield a corresponding performance improvement.
  • This operation requires a relatively small allocation of temporary device storage that is O(P), where P is the number of streaming multiprocessors on the device (and is typically a small constant relative to the input size N).
  • When d_temp_storage is NULL, no work is done and the required allocation size is returned in temp_storage_bytes.
Performance
Performance is similar to DeviceRadixSort::SortPairs.
Snippet
The code snippet below illustrates the sorting of a device vector of int keys with associated vector of int values.
#include <cub/cub.cuh> // or equivalently <cub/device/device_radix_sort.cuh>
// Declare, allocate, and initialize device-accessible pointers for sorting data
int num_items; // e.g., 7
int *d_key_buf; // e.g., [8, 6, 7, 5, 3, 0, 9]
int *d_key_alt_buf; // e.g., [ ... ]
int *d_value_buf; // e.g., [0, 1, 2, 3, 4, 5, 6]
int *d_value_alt_buf; // e.g., [ ... ]
...
// Create a set of DoubleBuffers to wrap pairs of device pointers
cub::DoubleBuffer<int> d_keys(d_key_buf, d_key_alt_buf);
cub::DoubleBuffer<int> d_values(d_value_buf, d_value_alt_buf);
// Determine temporary device storage requirements
void *d_temp_storage = NULL;
size_t temp_storage_bytes = 0;
cub::DeviceRadixSort::SortPairsDescending(d_temp_storage, temp_storage_bytes, d_keys, d_values, num_items);
// Allocate temporary storage
cudaMalloc(&d_temp_storage, temp_storage_bytes);
// Run sorting operation
cub::DeviceRadixSort::SortPairsDescending(d_temp_storage, temp_storage_bytes, d_keys, d_values, num_items);
// d_keys.Current() <-- [9, 8, 7, 6, 5, 3, 0]
// d_values.Current() <-- [6, 0, 2, 1, 3, 4, 5]
Template Parameters
KeyT[inferred] KeyT type
ValueT[inferred] ValueT type
NumItemsT[inferred] Type of num_items
Parameters
[in]d_temp_storageDevice-accessible allocation of temporary storage. When NULL, the required allocation size is written to temp_storage_bytes and no work is done.
[in,out]temp_storage_bytesReference to size in bytes of d_temp_storage allocation
[in,out]d_keysReference to the double-buffer of keys whose "current" device-accessible buffer contains the unsorted input keys and, upon return, is updated to point to the sorted output keys
[in,out]d_valuesDouble-buffer of values whose "current" device-accessible buffer contains the unsorted input values and, upon return, is updated to point to the sorted output values
[in]num_itemsNumber of items to sort
[in]begin_bit[optional] The least-significant bit index (inclusive) needed for key comparison
[in]end_bit[optional] The most-significant bit index (exclusive) needed for key comparison (e.g., sizeof(unsigned int) * 8)
[in]stream[optional] CUDA stream to launch kernels within. Default is stream0.
[in]debug_synchronous[optional] Whether or not to synchronize the stream after every kernel launch to check for errors. Also causes launch configurations to be printed to the console. Default is false.
template<typename KeyT , typename NumItemsT >
static CUB_RUNTIME_FUNCTION cudaError_t cub::DeviceRadixSort::SortKeys ( void *  d_temp_storage,
size_t &  temp_storage_bytes,
const KeyT *  d_keys_in,
KeyT *  d_keys_out,
NumItemsT  num_items,
int  begin_bit = 0,
int  end_bit = sizeof(KeyT) * 8,
cudaStream_t  stream = 0,
bool  debug_synchronous = false 
)
inlinestatic

Sorts keys into ascending order. (~2N auxiliary storage required)

  • The contents of the input data are not altered by the sorting operation.
  • Pointers to contiguous memory must be used; iterators are not currently supported.
  • In-place operations are not supported. There must be no overlap between any of the provided ranges:
    • [d_keys_in, d_keys_in + num_items)
    • [d_keys_out, d_keys_out + num_items)
  • An optional bit subrange [begin_bit, end_bit) of differentiating key bits can be specified. This can reduce overall sorting overhead and yield a corresponding performance improvement.
  • This operation requires an allocation of temporary device storage that is O(N+P), where N is the length of the input and P is the number of streaming multiprocessors on the device. For sorting using only O(P) temporary storage, see the sorting interface using DoubleBuffer wrappers below.
  • When d_temp_storage is NULL, no work is done and the required allocation size is returned in temp_storage_bytes.
Performance
The following charts illustrate saturated sorting performance across different CUDA architectures for uniform-random uint32 and uint64 keys, respectively.
lsb_radix_sort_int32_keys.png
lsb_radix_sort_int64_keys.png
Snippet
The code snippet below illustrates the sorting of a device vector of int keys.
#include <cub/cub.cuh> // or equivalently <cub/device/device_radix_sort.cuh>
// Declare, allocate, and initialize device-accessible pointers for sorting data
int num_items; // e.g., 7
int *d_keys_in; // e.g., [8, 6, 7, 5, 3, 0, 9]
int *d_keys_out; // e.g., [ ... ]
...
// Determine temporary device storage requirements
void *d_temp_storage = NULL;
size_t temp_storage_bytes = 0;
cub::DeviceRadixSort::SortKeys(d_temp_storage, temp_storage_bytes, d_keys_in, d_keys_out, num_items);
// Allocate temporary storage
cudaMalloc(&d_temp_storage, temp_storage_bytes);
// Run sorting operation
cub::DeviceRadixSort::SortKeys(d_temp_storage, temp_storage_bytes, d_keys_in, d_keys_out, num_items);
// d_keys_out <-- [0, 3, 5, 6, 7, 8, 9]
Template Parameters
KeyT[inferred] KeyT type
NumItemsT[inferred] Type of num_items
NumItemsT[inferred] Type of num_items
Parameters
[in]d_temp_storageDevice-accessible allocation of temporary storage. When NULL, the required allocation size is written to temp_storage_bytes and no work is done.
[in,out]temp_storage_bytesReference to size in bytes of d_temp_storage allocation
[in]d_keys_inPointer to the input data of key data to sort
[out]d_keys_outPointer to the sorted output sequence of key data
[in]num_itemsNumber of items to sort
[in]begin_bit[optional] The least-significant bit index (inclusive) needed for key comparison
[in]end_bit[optional] The most-significant bit index (exclusive) needed for key comparison (e.g., sizeof(unsigned int) * 8)
[in]stream[optional] CUDA stream to launch kernels within. Default is stream0.
[in]debug_synchronous[optional] Whether or not to synchronize the stream after every kernel launch to check for errors. Also causes launch configurations to be printed to the console. Default is false.
template<typename KeyT , typename NumItemsT >
static CUB_RUNTIME_FUNCTION cudaError_t cub::DeviceRadixSort::SortKeys ( void *  d_temp_storage,
size_t &  temp_storage_bytes,
DoubleBuffer< KeyT > &  d_keys,
NumItemsT  num_items,
int  begin_bit = 0,
int  end_bit = sizeof(KeyT) * 8,
cudaStream_t  stream = 0,
bool  debug_synchronous = false 
)
inlinestatic

Sorts keys into ascending order. (~N auxiliary storage required).

  • The sorting operation is given a pair of key buffers managed by a DoubleBuffer structure that indicates which of the two buffers is "current" (and thus contains the input data to be sorted).
  • The contents of both buffers may be altered by the sorting operation.
  • In-place operations are not supported. There must be no overlap between any of the provided ranges:
    • [d_keys.Current(), d_keys.Current() + num_items)
    • [d_keys.Alternate(), d_keys.Alternate() + num_items)
  • Upon completion, the sorting operation will update the "current" indicator within the DoubleBuffer wrapper to reference which of the two buffers now contains the sorted output sequence (a function of the number of key bits specified and the targeted device architecture).
  • An optional bit subrange [begin_bit, end_bit) of differentiating key bits can be specified. This can reduce overall sorting overhead and yield a corresponding performance improvement.
  • This operation requires a relatively small allocation of temporary device storage that is O(P), where P is the number of streaming multiprocessors on the device (and is typically a small constant relative to the input size N).
  • When d_temp_storage is NULL, no work is done and the required allocation size is returned in temp_storage_bytes.
Performance
The following charts illustrate saturated sorting performance across different CUDA architectures for uniform-random uint32 and uint64 keys, respectively.
lsb_radix_sort_int32_keys.png
lsb_radix_sort_int64_keys.png
Snippet
The code snippet below illustrates the sorting of a device vector of int keys.
#include <cub/cub.cuh> // or equivalently <cub/device/device_radix_sort.cuh>
// Declare, allocate, and initialize device-accessible pointers for sorting data
int num_items; // e.g., 7
int *d_key_buf; // e.g., [8, 6, 7, 5, 3, 0, 9]
int *d_key_alt_buf; // e.g., [ ... ]
...
// Create a DoubleBuffer to wrap the pair of device pointers
cub::DoubleBuffer<int> d_keys(d_key_buf, d_key_alt_buf);
// Determine temporary device storage requirements
void *d_temp_storage = NULL;
size_t temp_storage_bytes = 0;
cub::DeviceRadixSort::SortKeys(d_temp_storage, temp_storage_bytes, d_keys, num_items);
// Allocate temporary storage
cudaMalloc(&d_temp_storage, temp_storage_bytes);
// Run sorting operation
cub::DeviceRadixSort::SortKeys(d_temp_storage, temp_storage_bytes, d_keys, num_items);
// d_keys.Current() <-- [0, 3, 5, 6, 7, 8, 9]
Template Parameters
KeyT[inferred] KeyT type
NumItemsT[inferred] Type of num_items
Parameters
[in]d_temp_storageDevice-accessible allocation of temporary storage. When NULL, the required allocation size is written to temp_storage_bytes and no work is done.
[in,out]temp_storage_bytesReference to size in bytes of d_temp_storage allocation
[in,out]d_keysReference to the double-buffer of keys whose "current" device-accessible buffer contains the unsorted input keys and, upon return, is updated to point to the sorted output keys
[in]num_itemsNumber of items to sort
[in]begin_bit[optional] The least-significant bit index (inclusive) needed for key comparison
[in]end_bit[optional] The most-significant bit index (exclusive) needed for key comparison (e.g., sizeof(unsigned int) * 8)
[in]stream[optional] CUDA stream to launch kernels within. Default is stream0.
[in]debug_synchronous[optional] Whether or not to synchronize the stream after every kernel launch to check for errors. Also causes launch configurations to be printed to the console. Default is false.
template<typename KeyT , typename NumItemsT >
static CUB_RUNTIME_FUNCTION cudaError_t cub::DeviceRadixSort::SortKeysDescending ( void *  d_temp_storage,
size_t &  temp_storage_bytes,
const KeyT *  d_keys_in,
KeyT *  d_keys_out,
NumItemsT  num_items,
int  begin_bit = 0,
int  end_bit = sizeof(KeyT) * 8,
cudaStream_t  stream = 0,
bool  debug_synchronous = false 
)
inlinestatic

Sorts keys into descending order. (~2N auxiliary storage required).

  • The contents of the input data are not altered by the sorting operation.
  • Pointers to contiguous memory must be used; iterators are not currently supported.
  • In-place operations are not supported. There must be no overlap between any of the provided ranges:
    • [d_keys_in, d_keys_in + num_items)
    • [d_keys_out, d_keys_out + num_items)
  • An optional bit subrange [begin_bit, end_bit) of differentiating key bits can be specified. This can reduce overall sorting overhead and yield a corresponding performance improvement.
  • This operation requires an allocation of temporary device storage that is O(N+P), where N is the length of the input and P is the number of streaming multiprocessors on the device. For sorting using only O(P) temporary storage, see the sorting interface using DoubleBuffer wrappers below.
  • When d_temp_storage is NULL, no work is done and the required allocation size is returned in temp_storage_bytes.
Performance
Performance is similar to DeviceRadixSort::SortKeys.
Snippet
The code snippet below illustrates the sorting of a device vector of int keys.
#include <cub/cub.cuh> // or equivalently <cub/device/device_radix_sort.cuh>
// Declare, allocate, and initialize device-accessible pointers for sorting data
int num_items; // e.g., 7
int *d_keys_in; // e.g., [8, 6, 7, 5, 3, 0, 9]
int *d_keys_out; // e.g., [ ... ]
...
// Create a DoubleBuffer to wrap the pair of device pointers
cub::DoubleBuffer<int> d_keys(d_key_buf, d_key_alt_buf);
// Determine temporary device storage requirements
void *d_temp_storage = NULL;
size_t temp_storage_bytes = 0;
cub::DeviceRadixSort::SortKeysDescending(d_temp_storage, temp_storage_bytes, d_keys_in, d_keys_out, num_items);
// Allocate temporary storage
cudaMalloc(&d_temp_storage, temp_storage_bytes);
// Run sorting operation
cub::DeviceRadixSort::SortKeysDescending(d_temp_storage, temp_storage_bytes, d_keys_in, d_keys_out, num_items);
// d_keys_out <-- [9, 8, 7, 6, 5, 3, 0]s
Template Parameters
KeyT[inferred] KeyT type
NumItemsT[inferred] Type of num_items
Parameters
[in]d_temp_storageDevice-accessible allocation of temporary storage. When NULL, the required allocation size is written to temp_storage_bytes and no work is done.
[in,out]temp_storage_bytesReference to size in bytes of d_temp_storage allocation
[in]d_keys_inPointer to the input data of key data to sort
[out]d_keys_outPointer to the sorted output sequence of key data
[in]num_itemsNumber of items to sort
[in]begin_bit[optional] The least-significant bit index (inclusive) needed for key comparison
[in]end_bit[optional] The most-significant bit index (exclusive) needed for key comparison (e.g., sizeof(unsigned int) * 8)
[in]stream[optional] CUDA stream to launch kernels within. Default is stream0.
[in]debug_synchronous[optional] Whether or not to synchronize the stream after every kernel launch to check for errors. Also causes launch configurations to be printed to the console. Default is false.
template<typename KeyT , typename NumItemsT >
static CUB_RUNTIME_FUNCTION cudaError_t cub::DeviceRadixSort::SortKeysDescending ( void *  d_temp_storage,
size_t &  temp_storage_bytes,
DoubleBuffer< KeyT > &  d_keys,
NumItemsT  num_items,
int  begin_bit = 0,
int  end_bit = sizeof(KeyT) * 8,
cudaStream_t  stream = 0,
bool  debug_synchronous = false 
)
inlinestatic

Sorts keys into descending order. (~N auxiliary storage required).

  • The sorting operation is given a pair of key buffers managed by a DoubleBuffer structure that indicates which of the two buffers is "current" (and thus contains the input data to be sorted).
  • The contents of both buffers may be altered by the sorting operation.
  • In-place operations are not supported. There must be no overlap between any of the provided ranges:
    • [d_keys.Current(), d_keys.Current() + num_items)
    • [d_keys.Alternate(), d_keys.Alternate() + num_items)
  • Upon completion, the sorting operation will update the "current" indicator within the DoubleBuffer wrapper to reference which of the two buffers now contains the sorted output sequence (a function of the number of key bits specified and the targeted device architecture).
  • An optional bit subrange [begin_bit, end_bit) of differentiating key bits can be specified. This can reduce overall sorting overhead and yield a corresponding performance improvement.
  • This operation requires a relatively small allocation of temporary device storage that is O(P), where P is the number of streaming multiprocessors on the device (and is typically a small constant relative to the input size N).
  • When d_temp_storage is NULL, no work is done and the required allocation size is returned in temp_storage_bytes.
Performance
Performance is similar to DeviceRadixSort::SortKeys.
Snippet
The code snippet below illustrates the sorting of a device vector of int keys.
#include <cub/cub.cuh> // or equivalently <cub/device/device_radix_sort.cuh>
// Declare, allocate, and initialize device-accessible pointers for sorting data
int num_items; // e.g., 7
int *d_key_buf; // e.g., [8, 6, 7, 5, 3, 0, 9]
int *d_key_alt_buf; // e.g., [ ... ]
...
// Create a DoubleBuffer to wrap the pair of device pointers
cub::DoubleBuffer<int> d_keys(d_key_buf, d_key_alt_buf);
// Determine temporary device storage requirements
void *d_temp_storage = NULL;
size_t temp_storage_bytes = 0;
cub::DeviceRadixSort::SortKeysDescending(d_temp_storage, temp_storage_bytes, d_keys, num_items);
// Allocate temporary storage
cudaMalloc(&d_temp_storage, temp_storage_bytes);
// Run sorting operation
cub::DeviceRadixSort::SortKeysDescending(d_temp_storage, temp_storage_bytes, d_keys, num_items);
// d_keys.Current() <-- [9, 8, 7, 6, 5, 3, 0]
Template Parameters
KeyT[inferred] KeyT type
NumItemsT[inferred] Type of num_items
Parameters
[in]d_temp_storageDevice-accessible allocation of temporary storage. When NULL, the required allocation size is written to temp_storage_bytes and no work is done.
[in,out]temp_storage_bytesReference to size in bytes of d_temp_storage allocation
[in,out]d_keysReference to the double-buffer of keys whose "current" device-accessible buffer contains the unsorted input keys and, upon return, is updated to point to the sorted output keys
[in]num_itemsNumber of items to sort
[in]begin_bit[optional] The least-significant bit index (inclusive) needed for key comparison
[in]end_bit[optional] The most-significant bit index (exclusive) needed for key comparison (e.g., sizeof(unsigned int) * 8)
[in]stream[optional] CUDA stream to launch kernels within. Default is stream0.
[in]debug_synchronous[optional] Whether or not to synchronize the stream after every kernel launch to check for errors. Also causes launch configurations to be printed to the console. Default is false.

The documentation for this struct was generated from the following file: