template<
typename KeyT,
int BLOCK_DIM_X,
int ITEMS_PER_THREAD,
typename ValueT = NullType,
int RADIX_BITS = 4,
bool MEMOIZE_OUTER_SCAN = (CUB_PTX_ARCH >= 350) ? true : false,
BlockScanAlgorithm INNER_SCAN_ALGORITHM = BLOCK_SCAN_WARP_SCANS,
cudaSharedMemConfig SMEM_CONFIG = cudaSharedMemBankSizeFourByte,
int BLOCK_DIM_Y = 1,
int BLOCK_DIM_Z = 1,
int PTX_ARCH = CUB_PTX_ARCH>
class cub::BlockRadixSort< KeyT, BLOCK_DIM_X, ITEMS_PER_THREAD, ValueT, RADIX_BITS, MEMOIZE_OUTER_SCAN, INNER_SCAN_ALGORITHM, SMEM_CONFIG, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >
The BlockRadixSort class provides collective methods for sorting items partitioned across a CUDA thread block using a radix sorting method.
.
- Template Parameters
-
KeyT | KeyT type |
BLOCK_DIM_X | The thread block length in threads along the X dimension |
ITEMS_PER_THREAD | The number of items per thread |
ValueT | [optional] ValueT type (default: cub::NullType, which indicates a keys-only sort) |
RADIX_BITS | [optional] The number of radix bits per digit place (default: 4 bits) |
MEMOIZE_OUTER_SCAN | [optional] Whether or not to buffer outer raking scan partials to incur fewer shared memory reads at the expense of higher register pressure (default: true for architectures SM35 and newer, false otherwise). |
INNER_SCAN_ALGORITHM | [optional] The cub::BlockScanAlgorithm algorithm to use (default: cub::BLOCK_SCAN_WARP_SCANS) |
SMEM_CONFIG | [optional] Shared memory bank mode (default: cudaSharedMemBankSizeFourByte ) |
BLOCK_DIM_Y | [optional] The thread block length in threads along the Y dimension (default: 1) |
BLOCK_DIM_Z | [optional] The thread block length in threads along the Z dimension (default: 1) |
PTX_ARCH | [optional] The PTX compute capability for which to to specialize this collective, formatted as per the CUDA_ARCH macro (e.g., 350 for sm_35). Useful for determining the collective's storage requirements for a given device from the host. (Default: the value of CUDA_ARCH during the current compiler pass) |
- Overview
- The radix sorting method arranges items into ascending order. It relies upon a positional representation for keys, i.e., each key is comprised of an ordered sequence of symbols (e.g., digits, characters, etc.) specified from least-significant to most-significant. For a given input sequence of keys and a set of rules specifying a total ordering of the symbolic alphabet, the radix sorting method produces a lexicographic ordering of those keys.
For multi-dimensional blocks, threads are linearly ranked in row-major order.
- Supported Types
- BlockRadixSort can sort all of the built-in C++ numeric primitive types (
unsigned char
, int
, double
, etc.) as well as CUDA's __half
half-precision floating-point type.
- Floating-Point Special Cases
- Positive and negative zeros are considered equivalent, and will be treated as such in the output.
- No special handling is implemented for NaN values; these are sorted according to their bit representations after any transformations.
- Bitwise Key Transformations
- Although the direct radix sorting method can only be applied to unsigned integral types, BlockRadixSort is able to sort signed and floating-point types via simple bit-wise transformations that ensure lexicographic key ordering.
These transformations must be considered when restricting the [begin_bit, end_bit)
range, as the bitwise transformations will occur before the bit-range truncation.
Any transformations applied to the keys prior to sorting are reversed while writing to the final output buffer.
- Type Specific Bitwise Transformations
- To convert the input values into a radix-sortable bitwise representation, the following transformations take place prior to sorting:
- For unsigned integral values, the keys are used directly.
- For signed integral values, the sign bit is inverted.
- For positive floating point values, the sign bit is inverted.
- For negative floating point values, the full key is inverted.
- No Descending Sort Transformations
- Unlike
DeviceRadixSort
, BlockRadixSort
does not invert the input key bits when performing a descending sort. Instead, it has special logic to reverse the order of the keys while sorting.
- Stability
- BlockRadixSort is stable. For floating-point types -0.0 and +0.0 are considered equal and appear in the result in the same order as they appear in the input.
- Performance Considerations
- Efficiency is increased with increased granularity
ITEMS_PER_THREAD
. Performance is also typically increased until the additional register pressure or shared memory allocation size causes SM occupancy to fall too low. Consider variants of cub::BlockLoad for efficiently gathering a blocked arrangement of elements across threads.
- A Simple Example
- Every thread in the block uses the BlockRadixSort class by first specializing the BlockRadixSort type, then instantiating an instance with parameters for communication, and finally invoking one or more collective member functions.
- The code snippet below illustrates a sort of 512 integer keys that are partitioned in a blocked arrangement across 128 threads where each thread owns 4 consecutive items.
__global__ void ExampleKernel(...)
{
__shared__ typename BlockRadixSort::TempStorage temp_storage;
int thread_keys[4];
...
...
- Suppose the set of input
thread_keys
across the block of threads is { [0,511,1,510], [2,509,3,508], [4,507,5,506], ..., [254,257,255,256] }
. The corresponding output thread_keys
in those threads will be { [0,1,2,3], [4,5,6,7], [8,9,10,11], ..., [508,509,510,511] }
.
- Re-using dynamically allocating shared memory
- The following example under the examples/block folder illustrates usage of dynamically shared memory with BlockReduce and how to re-purpose the same memory region: example_block_reduce_dyn_smem.cu
This example can be easily adapted to the storage required by BlockRadixSort.
- Examples:
- example_block_radix_sort.cu.
|
|
__device__ __forceinline__ | BlockRadixSort () |
| Collective constructor using a private static allocation of shared memory as temporary storage. More...
|
|
__device__ __forceinline__ | BlockRadixSort (TempStorage &temp_storage) |
| Collective constructor using the specified memory allocation as temporary storage. More...
|
|
|
__device__ __forceinline__ void | Sort (KeyT(&keys)[ITEMS_PER_THREAD], int begin_bit=0, int end_bit=sizeof(KeyT)*8) |
| Performs an ascending block-wide radix sort over a blocked arrangement of keys. More...
|
|
__device__ __forceinline__ void | Sort (KeyT(&keys)[ITEMS_PER_THREAD], ValueT(&values)[ITEMS_PER_THREAD], int begin_bit=0, int end_bit=sizeof(KeyT)*8) |
| Performs an ascending block-wide radix sort across a blocked arrangement of keys and values. More...
|
|
__device__ __forceinline__ void | SortDescending (KeyT(&keys)[ITEMS_PER_THREAD], int begin_bit=0, int end_bit=sizeof(KeyT)*8) |
| Performs a descending block-wide radix sort over a blocked arrangement of keys. More...
|
|
__device__ __forceinline__ void | SortDescending (KeyT(&keys)[ITEMS_PER_THREAD], ValueT(&values)[ITEMS_PER_THREAD], int begin_bit=0, int end_bit=sizeof(KeyT)*8) |
| Performs a descending block-wide radix sort across a blocked arrangement of keys and values. More...
|
|
|
__device__ __forceinline__ void | SortBlockedToStriped (KeyT(&keys)[ITEMS_PER_THREAD], int begin_bit=0, int end_bit=sizeof(KeyT)*8) |
| Performs an ascending radix sort across a blocked arrangement of keys, leaving them in a striped arrangement. More...
|
|
__device__ __forceinline__ void | SortBlockedToStriped (KeyT(&keys)[ITEMS_PER_THREAD], ValueT(&values)[ITEMS_PER_THREAD], int begin_bit=0, int end_bit=sizeof(KeyT)*8) |
| Performs an ascending radix sort across a blocked arrangement of keys and values, leaving them in a striped arrangement. More...
|
|
__device__ __forceinline__ void | SortDescendingBlockedToStriped (KeyT(&keys)[ITEMS_PER_THREAD], int begin_bit=0, int end_bit=sizeof(KeyT)*8) |
| Performs a descending radix sort across a blocked arrangement of keys, leaving them in a striped arrangement. More...
|
|
__device__ __forceinline__ void | SortDescendingBlockedToStriped (KeyT(&keys)[ITEMS_PER_THREAD], ValueT(&values)[ITEMS_PER_THREAD], int begin_bit=0, int end_bit=sizeof(KeyT)*8) |
| Performs a descending radix sort across a blocked arrangement of keys and values, leaving them in a striped arrangement. More...
|
|
template<typename KeyT , int BLOCK_DIM_X, int ITEMS_PER_THREAD, typename ValueT = NullType, int RADIX_BITS = 4, bool MEMOIZE_OUTER_SCAN = (CUB_PTX_ARCH >= 350) ? true : false, BlockScanAlgorithm INNER_SCAN_ALGORITHM = BLOCK_SCAN_WARP_SCANS, cudaSharedMemConfig SMEM_CONFIG = cudaSharedMemBankSizeFourByte, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
__device__ __forceinline__ cub::BlockRadixSort< KeyT, BLOCK_DIM_X, ITEMS_PER_THREAD, ValueT, RADIX_BITS, MEMOIZE_OUTER_SCAN, INNER_SCAN_ALGORITHM, SMEM_CONFIG, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::BlockRadixSort |
( |
) | |
|
|
inline |
Collective constructor using a private static allocation of shared memory as temporary storage.
template<typename KeyT , int BLOCK_DIM_X, int ITEMS_PER_THREAD, typename ValueT = NullType, int RADIX_BITS = 4, bool MEMOIZE_OUTER_SCAN = (CUB_PTX_ARCH >= 350) ? true : false, BlockScanAlgorithm INNER_SCAN_ALGORITHM = BLOCK_SCAN_WARP_SCANS, cudaSharedMemConfig SMEM_CONFIG = cudaSharedMemBankSizeFourByte, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
__device__ __forceinline__ cub::BlockRadixSort< KeyT, BLOCK_DIM_X, ITEMS_PER_THREAD, ValueT, RADIX_BITS, MEMOIZE_OUTER_SCAN, INNER_SCAN_ALGORITHM, SMEM_CONFIG, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::BlockRadixSort |
( |
TempStorage & |
temp_storage) | |
|
|
inline |
Collective constructor using the specified memory allocation as temporary storage.
- Parameters
-
[in] | temp_storage | Reference to memory allocation having layout type TempStorage |
template<typename KeyT , int BLOCK_DIM_X, int ITEMS_PER_THREAD, typename ValueT = NullType, int RADIX_BITS = 4, bool MEMOIZE_OUTER_SCAN = (CUB_PTX_ARCH >= 350) ? true : false, BlockScanAlgorithm INNER_SCAN_ALGORITHM = BLOCK_SCAN_WARP_SCANS, cudaSharedMemConfig SMEM_CONFIG = cudaSharedMemBankSizeFourByte, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
__device__ __forceinline__ void cub::BlockRadixSort< KeyT, BLOCK_DIM_X, ITEMS_PER_THREAD, ValueT, RADIX_BITS, MEMOIZE_OUTER_SCAN, INNER_SCAN_ALGORITHM, SMEM_CONFIG, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::Sort |
( |
KeyT(&) |
keys[ITEMS_PER_THREAD], |
|
|
int |
begin_bit = 0 , |
|
|
int |
end_bit = sizeof(KeyT) * 8 |
|
) |
| |
|
inline |
Performs an ascending block-wide radix sort over a blocked arrangement of keys.
- Efficiency is increased with increased granularity
ITEMS_PER_THREAD
. Performance is also typically increased until the additional register pressure or shared memory allocation size causes SM occupancy to fall too low. Consider variants of cub::BlockLoad for efficiently gathering a blocked arrangement of elements across threads.
- A subsequent
__syncthreads()
threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage
) is to be reused or repurposed.
- Snippet
- The code snippet below illustrates a sort of 512 integer keys that are partitioned in a blocked arrangement across 128 threads where each thread owns 4 consecutive keys.
__global__ void ExampleKernel(...)
{
__shared__ typename BlockRadixSort::TempStorage temp_storage;
int thread_keys[4];
...
- Suppose the set of input
thread_keys
across the block of threads is { [0,511,1,510], [2,509,3,508], [4,507,5,506], ..., [254,257,255,256] }
. The corresponding output thread_keys
in those threads will be { [0,1,2,3], [4,5,6,7], [8,9,10,11], ..., [508,509,510,511] }
.
- Parameters
-
[in,out] | keys | Keys to sort |
[in] | begin_bit | [optional] The beginning (least-significant) bit index needed for key comparison |
[in] | end_bit | [optional] The past-the-end (most-significant) bit index needed for key comparison |
template<typename KeyT , int BLOCK_DIM_X, int ITEMS_PER_THREAD, typename ValueT = NullType, int RADIX_BITS = 4, bool MEMOIZE_OUTER_SCAN = (CUB_PTX_ARCH >= 350) ? true : false, BlockScanAlgorithm INNER_SCAN_ALGORITHM = BLOCK_SCAN_WARP_SCANS, cudaSharedMemConfig SMEM_CONFIG = cudaSharedMemBankSizeFourByte, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
__device__ __forceinline__ void cub::BlockRadixSort< KeyT, BLOCK_DIM_X, ITEMS_PER_THREAD, ValueT, RADIX_BITS, MEMOIZE_OUTER_SCAN, INNER_SCAN_ALGORITHM, SMEM_CONFIG, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::Sort |
( |
KeyT(&) |
keys[ITEMS_PER_THREAD], |
|
|
ValueT(&) |
values[ITEMS_PER_THREAD], |
|
|
int |
begin_bit = 0 , |
|
|
int |
end_bit = sizeof(KeyT) * 8 |
|
) |
| |
|
inline |
Performs an ascending block-wide radix sort across a blocked arrangement of keys and values.
- BlockRadixSort can only accommodate one associated tile of values. To "truck along" more than one tile of values, simply perform a key-value sort of the keys paired with a temporary value array that enumerates the key indices. The reordered indices can then be used as a gather-vector for exchanging other associated tile data through shared memory.
- Efficiency is increased with increased granularity
ITEMS_PER_THREAD
. Performance is also typically increased until the additional register pressure or shared memory allocation size causes SM occupancy to fall too low. Consider variants of cub::BlockLoad for efficiently gathering a blocked arrangement of elements across threads.
- A subsequent
__syncthreads()
threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage
) is to be reused or repurposed.
- Snippet
- The code snippet below illustrates a sort of 512 integer keys and values that are partitioned in a blocked arrangement across 128 threads where each thread owns 4 consecutive pairs.
__global__ void ExampleKernel(...)
{
__shared__ typename BlockRadixSort::TempStorage temp_storage;
int thread_keys[4];
int thread_values[4];
...
- Suppose the set of input
thread_keys
across the block of threads is { [0,511,1,510], [2,509,3,508], [4,507,5,506], ..., [254,257,255,256] }
. The corresponding output thread_keys
in those threads will be { [0,1,2,3], [4,5,6,7], [8,9,10,11], ..., [508,509,510,511] }
.
- Parameters
-
[in,out] | keys | Keys to sort |
[in,out] | values | Values to sort |
[in] | begin_bit | [optional] The beginning (least-significant) bit index needed for key comparison |
[in] | end_bit | [optional] The past-the-end (most-significant) bit index needed for key comparison |
template<typename KeyT , int BLOCK_DIM_X, int ITEMS_PER_THREAD, typename ValueT = NullType, int RADIX_BITS = 4, bool MEMOIZE_OUTER_SCAN = (CUB_PTX_ARCH >= 350) ? true : false, BlockScanAlgorithm INNER_SCAN_ALGORITHM = BLOCK_SCAN_WARP_SCANS, cudaSharedMemConfig SMEM_CONFIG = cudaSharedMemBankSizeFourByte, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
__device__ __forceinline__ void cub::BlockRadixSort< KeyT, BLOCK_DIM_X, ITEMS_PER_THREAD, ValueT, RADIX_BITS, MEMOIZE_OUTER_SCAN, INNER_SCAN_ALGORITHM, SMEM_CONFIG, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::SortDescending |
( |
KeyT(&) |
keys[ITEMS_PER_THREAD], |
|
|
int |
begin_bit = 0 , |
|
|
int |
end_bit = sizeof(KeyT) * 8 |
|
) |
| |
|
inline |
Performs a descending block-wide radix sort over a blocked arrangement of keys.
- Efficiency is increased with increased granularity
ITEMS_PER_THREAD
. Performance is also typically increased until the additional register pressure or shared memory allocation size causes SM occupancy to fall too low. Consider variants of cub::BlockLoad for efficiently gathering a blocked arrangement of elements across threads.
- A subsequent
__syncthreads()
threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage
) is to be reused or repurposed.
- Snippet
- The code snippet below illustrates a sort of 512 integer keys that are partitioned in a blocked arrangement across 128 threads where each thread owns 4 consecutive keys.
__global__ void ExampleKernel(...)
{
__shared__ typename BlockRadixSort::TempStorage temp_storage;
int thread_keys[4];
...
- Suppose the set of input
thread_keys
across the block of threads is { [0,511,1,510], [2,509,3,508], [4,507,5,506], ..., [254,257,255,256] }
. The corresponding output thread_keys
in those threads will be { [511,510,509,508], [11,10,9,8], [7,6,5,4], ..., [3,2,1,0] }
.
- Parameters
-
[in,out] | keys | Keys to sort |
[in] | begin_bit | [optional] The beginning (least-significant) bit index needed for key comparison |
[in] | end_bit | [optional] The past-the-end (most-significant) bit index needed for key comparison |
template<typename KeyT , int BLOCK_DIM_X, int ITEMS_PER_THREAD, typename ValueT = NullType, int RADIX_BITS = 4, bool MEMOIZE_OUTER_SCAN = (CUB_PTX_ARCH >= 350) ? true : false, BlockScanAlgorithm INNER_SCAN_ALGORITHM = BLOCK_SCAN_WARP_SCANS, cudaSharedMemConfig SMEM_CONFIG = cudaSharedMemBankSizeFourByte, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
__device__ __forceinline__ void cub::BlockRadixSort< KeyT, BLOCK_DIM_X, ITEMS_PER_THREAD, ValueT, RADIX_BITS, MEMOIZE_OUTER_SCAN, INNER_SCAN_ALGORITHM, SMEM_CONFIG, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::SortDescending |
( |
KeyT(&) |
keys[ITEMS_PER_THREAD], |
|
|
ValueT(&) |
values[ITEMS_PER_THREAD], |
|
|
int |
begin_bit = 0 , |
|
|
int |
end_bit = sizeof(KeyT) * 8 |
|
) |
| |
|
inline |
Performs a descending block-wide radix sort across a blocked arrangement of keys and values.
- BlockRadixSort can only accommodate one associated tile of values. To "truck along" more than one tile of values, simply perform a key-value sort of the keys paired with a temporary value array that enumerates the key indices. The reordered indices can then be used as a gather-vector for exchanging other associated tile data through shared memory.
- Efficiency is increased with increased granularity
ITEMS_PER_THREAD
. Performance is also typically increased until the additional register pressure or shared memory allocation size causes SM occupancy to fall too low. Consider variants of cub::BlockLoad for efficiently gathering a blocked arrangement of elements across threads.
- A subsequent
__syncthreads()
threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage
) is to be reused or repurposed.
- Snippet
- The code snippet below illustrates a sort of 512 integer keys and values that are partitioned in a blocked arrangement across 128 threads where each thread owns 4 consecutive pairs.
__global__ void ExampleKernel(...)
{
__shared__ typename BlockRadixSort::TempStorage temp_storage;
int thread_keys[4];
int thread_values[4];
...
- Suppose the set of input
thread_keys
across the block of threads is { [0,511,1,510], [2,509,3,508], [4,507,5,506], ..., [254,257,255,256] }
. The corresponding output thread_keys
in those threads will be { [511,510,509,508], [11,10,9,8], [7,6,5,4], ..., [3,2,1,0] }
.
- Parameters
-
[in,out] | keys | Keys to sort |
[in,out] | values | Values to sort |
[in] | begin_bit | [optional] The beginning (least-significant) bit index needed for key comparison |
[in] | end_bit | [optional] The past-the-end (most-significant) bit index needed for key comparison |
template<typename KeyT , int BLOCK_DIM_X, int ITEMS_PER_THREAD, typename ValueT = NullType, int RADIX_BITS = 4, bool MEMOIZE_OUTER_SCAN = (CUB_PTX_ARCH >= 350) ? true : false, BlockScanAlgorithm INNER_SCAN_ALGORITHM = BLOCK_SCAN_WARP_SCANS, cudaSharedMemConfig SMEM_CONFIG = cudaSharedMemBankSizeFourByte, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
__device__ __forceinline__ void cub::BlockRadixSort< KeyT, BLOCK_DIM_X, ITEMS_PER_THREAD, ValueT, RADIX_BITS, MEMOIZE_OUTER_SCAN, INNER_SCAN_ALGORITHM, SMEM_CONFIG, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::SortBlockedToStriped |
( |
KeyT(&) |
keys[ITEMS_PER_THREAD], |
|
|
int |
begin_bit = 0 , |
|
|
int |
end_bit = sizeof(KeyT) * 8 |
|
) |
| |
|
inline |
Performs an ascending radix sort across a blocked arrangement of keys, leaving them in a striped arrangement.
- Efficiency is increased with increased granularity
ITEMS_PER_THREAD
. Performance is also typically increased until the additional register pressure or shared memory allocation size causes SM occupancy to fall too low. Consider variants of cub::BlockLoad for efficiently gathering a blocked arrangement of elements across threads.
- A subsequent
__syncthreads()
threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage
) is to be reused or repurposed.
- Snippet
- The code snippet below illustrates a sort of 512 integer keys that are initially partitioned in a blocked arrangement across 128 threads where each thread owns 4 consecutive keys. The final partitioning is striped.
__global__ void ExampleKernel(...)
{
__shared__ typename BlockRadixSort::TempStorage temp_storage;
int thread_keys[4];
...
- Suppose the set of input
thread_keys
across the block of threads is { [0,511,1,510], [2,509,3,508], [4,507,5,506], ..., [254,257,255,256] }
. The corresponding output thread_keys
in those threads will be { [0,128,256,384], [1,129,257,385], [2,130,258,386], ..., [127,255,383,511] }
.
- Parameters
-
[in,out] | keys | Keys to sort |
[in] | begin_bit | [optional] The beginning (least-significant) bit index needed for key comparison |
[in] | end_bit | [optional] The past-the-end (most-significant) bit index needed for key comparison |
template<typename KeyT , int BLOCK_DIM_X, int ITEMS_PER_THREAD, typename ValueT = NullType, int RADIX_BITS = 4, bool MEMOIZE_OUTER_SCAN = (CUB_PTX_ARCH >= 350) ? true : false, BlockScanAlgorithm INNER_SCAN_ALGORITHM = BLOCK_SCAN_WARP_SCANS, cudaSharedMemConfig SMEM_CONFIG = cudaSharedMemBankSizeFourByte, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
__device__ __forceinline__ void cub::BlockRadixSort< KeyT, BLOCK_DIM_X, ITEMS_PER_THREAD, ValueT, RADIX_BITS, MEMOIZE_OUTER_SCAN, INNER_SCAN_ALGORITHM, SMEM_CONFIG, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::SortBlockedToStriped |
( |
KeyT(&) |
keys[ITEMS_PER_THREAD], |
|
|
ValueT(&) |
values[ITEMS_PER_THREAD], |
|
|
int |
begin_bit = 0 , |
|
|
int |
end_bit = sizeof(KeyT) * 8 |
|
) |
| |
|
inline |
Performs an ascending radix sort across a blocked arrangement of keys and values, leaving them in a striped arrangement.
- BlockRadixSort can only accommodate one associated tile of values. To "truck along" more than one tile of values, simply perform a key-value sort of the keys paired with a temporary value array that enumerates the key indices. The reordered indices can then be used as a gather-vector for exchanging other associated tile data through shared memory.
- Efficiency is increased with increased granularity
ITEMS_PER_THREAD
. Performance is also typically increased until the additional register pressure or shared memory allocation size causes SM occupancy to fall too low. Consider variants of cub::BlockLoad for efficiently gathering a blocked arrangement of elements across threads.
- A subsequent
__syncthreads()
threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage
) is to be reused or repurposed.
- Snippet
- The code snippet below illustrates a sort of 512 integer keys and values that are initially partitioned in a blocked arrangement across 128 threads where each thread owns 4 consecutive pairs. The final partitioning is striped.
__global__ void ExampleKernel(...)
{
__shared__ typename BlockRadixSort::TempStorage temp_storage;
int thread_keys[4];
int thread_values[4];
...
BlockRadixSort(temp_storage).SortBlockedToStriped(thread_keys, thread_values);
- Suppose the set of input
thread_keys
across the block of threads is { [0,511,1,510], [2,509,3,508], [4,507,5,506], ..., [254,257,255,256] }
. The corresponding output thread_keys
in those threads will be { [0,128,256,384], [1,129,257,385], [2,130,258,386], ..., [127,255,383,511] }
.
- Parameters
-
[in,out] | keys | Keys to sort |
[in,out] | values | Values to sort |
[in] | begin_bit | [optional] The beginning (least-significant) bit index needed for key comparison |
[in] | end_bit | [optional] The past-the-end (most-significant) bit index needed for key comparison |
template<typename KeyT , int BLOCK_DIM_X, int ITEMS_PER_THREAD, typename ValueT = NullType, int RADIX_BITS = 4, bool MEMOIZE_OUTER_SCAN = (CUB_PTX_ARCH >= 350) ? true : false, BlockScanAlgorithm INNER_SCAN_ALGORITHM = BLOCK_SCAN_WARP_SCANS, cudaSharedMemConfig SMEM_CONFIG = cudaSharedMemBankSizeFourByte, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
__device__ __forceinline__ void cub::BlockRadixSort< KeyT, BLOCK_DIM_X, ITEMS_PER_THREAD, ValueT, RADIX_BITS, MEMOIZE_OUTER_SCAN, INNER_SCAN_ALGORITHM, SMEM_CONFIG, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::SortDescendingBlockedToStriped |
( |
KeyT(&) |
keys[ITEMS_PER_THREAD], |
|
|
int |
begin_bit = 0 , |
|
|
int |
end_bit = sizeof(KeyT) * 8 |
|
) |
| |
|
inline |
Performs a descending radix sort across a blocked arrangement of keys, leaving them in a striped arrangement.
- Efficiency is increased with increased granularity
ITEMS_PER_THREAD
. Performance is also typically increased until the additional register pressure or shared memory allocation size causes SM occupancy to fall too low. Consider variants of cub::BlockLoad for efficiently gathering a blocked arrangement of elements across threads.
- A subsequent
__syncthreads()
threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage
) is to be reused or repurposed.
- Snippet
- The code snippet below illustrates a sort of 512 integer keys that are initially partitioned in a blocked arrangement across 128 threads where each thread owns 4 consecutive keys. The final partitioning is striped.
__global__ void ExampleKernel(...)
{
__shared__ typename BlockRadixSort::TempStorage temp_storage;
int thread_keys[4];
...
- Suppose the set of input
thread_keys
across the block of threads is { [0,511,1,510], [2,509,3,508], [4,507,5,506], ..., [254,257,255,256] }
. The corresponding output thread_keys
in those threads will be { [511,383,255,127], [386,258,130,2], [385,257,128,1], ..., [384,256,128,0] }
.
- Parameters
-
[in,out] | keys | Keys to sort |
[in] | begin_bit | [optional] The beginning (least-significant) bit index needed for key comparison |
[in] | end_bit | [optional] The past-the-end (most-significant) bit index needed for key comparison |
template<typename KeyT , int BLOCK_DIM_X, int ITEMS_PER_THREAD, typename ValueT = NullType, int RADIX_BITS = 4, bool MEMOIZE_OUTER_SCAN = (CUB_PTX_ARCH >= 350) ? true : false, BlockScanAlgorithm INNER_SCAN_ALGORITHM = BLOCK_SCAN_WARP_SCANS, cudaSharedMemConfig SMEM_CONFIG = cudaSharedMemBankSizeFourByte, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
__device__ __forceinline__ void cub::BlockRadixSort< KeyT, BLOCK_DIM_X, ITEMS_PER_THREAD, ValueT, RADIX_BITS, MEMOIZE_OUTER_SCAN, INNER_SCAN_ALGORITHM, SMEM_CONFIG, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::SortDescendingBlockedToStriped |
( |
KeyT(&) |
keys[ITEMS_PER_THREAD], |
|
|
ValueT(&) |
values[ITEMS_PER_THREAD], |
|
|
int |
begin_bit = 0 , |
|
|
int |
end_bit = sizeof(KeyT) * 8 |
|
) |
| |
|
inline |
Performs a descending radix sort across a blocked arrangement of keys and values, leaving them in a striped arrangement.
- BlockRadixSort can only accommodate one associated tile of values. To "truck along" more than one tile of values, simply perform a key-value sort of the keys paired with a temporary value array that enumerates the key indices. The reordered indices can then be used as a gather-vector for exchanging other associated tile data through shared memory.
- Efficiency is increased with increased granularity
ITEMS_PER_THREAD
. Performance is also typically increased until the additional register pressure or shared memory allocation size causes SM occupancy to fall too low. Consider variants of cub::BlockLoad for efficiently gathering a blocked arrangement of elements across threads.
- A subsequent
__syncthreads()
threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage
) is to be reused or repurposed.
- Snippet
- The code snippet below illustrates a sort of 512 integer keys and values that are initially partitioned in a blocked arrangement across 128 threads where each thread owns 4 consecutive pairs. The final partitioning is striped.
__global__ void ExampleKernel(...)
{
__shared__ typename BlockRadixSort::TempStorage temp_storage;
int thread_keys[4];
int thread_values[4];
...
BlockRadixSort(temp_storage).SortBlockedToStriped(thread_keys, thread_values);
- Suppose the set of input
thread_keys
across the block of threads is { [0,511,1,510], [2,509,3,508], [4,507,5,506], ..., [254,257,255,256] }
. The corresponding output thread_keys
in those threads will be { [511,383,255,127], [386,258,130,2], [385,257,128,1], ..., [384,256,128,0] }
.
- Parameters
-
[in,out] | keys | Keys to sort |
[in,out] | values | Values to sort |
[in] | begin_bit | [optional] The beginning (least-significant) bit index needed for key comparison |
[in] | end_bit | [optional] The past-the-end (most-significant) bit index needed for key comparison |