CUB  
 All Classes Namespaces Files Functions Variables Typedefs Enumerations Enumerator Friends Macros Groups Pages
Static Public Methods | List of all members
cub::DeviceSelect Struct Reference

Detailed description

DeviceSelect provides device-wide, parallel operations for compacting selected items from sequences of data items residing within device-accessible memory.

select_logo.png
.
Overview
These operations apply a selection criterion to selectively copy items from a specified input sequence to a compact output sequence.
Usage Considerations
  • Dynamic parallelism. DeviceSelect methods can be called within kernel code on devices in which CUDA dynamic parallelism is supported.
Performance
The work-complexity of select-flagged, select-if, and select-unique as a function of input size is linear, resulting in performance throughput that plateaus with problem sizes large enough to saturate the GPU.
The following chart illustrates DeviceSelect::If performance across different CUDA architectures for int32 items, where 50% of the items are randomly selected.
select_if_int32_50_percent.png
The following chart illustrates DeviceSelect::Unique performance across different CUDA architectures for int32 items where segments have lengths uniformly sampled from [1,1000].
select_unique_int32_len_500.png
Performance plots for other scenarios can be found in the detailed method descriptions below.

Static Public Methods

template<typename InputIteratorT , typename FlagIterator , typename OutputIteratorT , typename NumSelectedIteratorT >
CUB_RUNTIME_FUNCTION static
__forceinline__ cudaError_t 
Flagged (void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, FlagIterator d_flags, OutputIteratorT d_out, NumSelectedIteratorT d_num_selected_out, int num_items, cudaStream_t stream=0, bool debug_synchronous=false)
 Uses the d_flags sequence to selectively copy the corresponding items from d_in into d_out. The total number of items selected is written to d_num_selected_out.

select_flags_logo.png
.
More...
 
template<typename InputIteratorT , typename OutputIteratorT , typename NumSelectedIteratorT , typename SelectOp >
CUB_RUNTIME_FUNCTION static
__forceinline__ cudaError_t 
If (void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, NumSelectedIteratorT d_num_selected_out, int num_items, SelectOp select_op, cudaStream_t stream=0, bool debug_synchronous=false)
 Uses the select_op functor to selectively copy items from d_in into d_out. The total number of items selected is written to d_num_selected_out.

select_logo.png
.
More...
 
template<typename InputIteratorT , typename OutputIteratorT , typename NumSelectedIteratorT >
CUB_RUNTIME_FUNCTION static
__forceinline__ cudaError_t 
Unique (void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, NumSelectedIteratorT d_num_selected_out, int num_items, cudaStream_t stream=0, bool debug_synchronous=false)
 Given an input sequence d_in having runs of consecutive equal-valued keys, only the first key from each run is selectively copied to d_out. The total number of items selected is written to d_num_selected_out.

unique_logo.png
.
More...
 
template<typename KeyInputIteratorT , typename ValueInputIteratorT , typename KeyOutputIteratorT , typename ValueOutputIteratorT , typename NumSelectedIteratorT >
CUB_RUNTIME_FUNCTION static
__forceinline__ cudaError_t 
UniqueByKey (void *d_temp_storage, size_t &temp_storage_bytes, KeyInputIteratorT d_keys_in, ValueInputIteratorT d_values_in, KeyOutputIteratorT d_keys_out, ValueOutputIteratorT d_values_out, NumSelectedIteratorT d_num_selected_out, int num_items, cudaStream_t stream=0, bool debug_synchronous=false)
 Given an input sequence d_keys_in and d_values_in with runs of key-value pairs with consecutive equal-valued keys, only the first key and its value from each run is selectively copied to d_keys_out and d_values_out. The total number of items selected is written to d_num_selected_out.

unique_logo.png
.
More...
 

Member Function Documentation

template<typename InputIteratorT , typename FlagIterator , typename OutputIteratorT , typename NumSelectedIteratorT >
CUB_RUNTIME_FUNCTION static __forceinline__ cudaError_t cub::DeviceSelect::Flagged ( void *  d_temp_storage,
size_t &  temp_storage_bytes,
InputIteratorT  d_in,
FlagIterator  d_flags,
OutputIteratorT  d_out,
NumSelectedIteratorT  d_num_selected_out,
int  num_items,
cudaStream_t  stream = 0,
bool  debug_synchronous = false 
)
inlinestatic

Uses the d_flags sequence to selectively copy the corresponding items from d_in into d_out. The total number of items selected is written to d_num_selected_out.

select_flags_logo.png
.

  • The value type of d_flags must be castable to bool (e.g., bool, char, int, etc.).
  • Copies of the selected items are compacted into d_out and maintain their original relative ordering.
  • When d_temp_storage is NULL, no work is done and the required allocation size is returned in temp_storage_bytes.
Snippet
The code snippet below illustrates the compaction of items selected from an int device vector.
#include <cub/cub.cuh> // or equivalently <cub/device/device_select.cuh>
// Declare, allocate, and initialize device-accessible pointers for input, flags, and output
int num_items; // e.g., 8
int *d_in; // e.g., [1, 2, 3, 4, 5, 6, 7, 8]
char *d_flags; // e.g., [1, 0, 0, 1, 0, 1, 1, 0]
int *d_out; // e.g., [ , , , , , , , ]
int *d_num_selected_out; // e.g., [ ]
...
// Determine temporary device storage requirements
void *d_temp_storage = NULL;
size_t temp_storage_bytes = 0;
cub::DeviceSelect::Flagged(d_temp_storage, temp_storage_bytes, d_in, d_flags, d_out, d_num_selected_out, num_items);
// Allocate temporary storage
cudaMalloc(&d_temp_storage, temp_storage_bytes);
// Run selection
cub::DeviceSelect::Flagged(d_temp_storage, temp_storage_bytes, d_in, d_flags, d_out, d_num_selected_out, num_items);
// d_out <-- [1, 4, 6, 7]
// d_num_selected_out <-- [4]
Template Parameters
InputIteratorT[inferred] Random-access input iterator type for reading input items (may be a simple pointer type)
FlagIterator[inferred] Random-access input iterator type for reading selection flags (may be a simple pointer type)
OutputIteratorT[inferred] Random-access output iterator type for writing selected items (may be a simple pointer type)
NumSelectedIteratorT[inferred] Output iterator type for recording the number of items selected (may be a simple pointer type)
Parameters
[in]d_temp_storageDevice-accessible allocation of temporary storage. When NULL, the required allocation size is written to temp_storage_bytes and no work is done.
[in,out]temp_storage_bytesReference to size in bytes of d_temp_storage allocation
[in]d_inPointer to the input sequence of data items
[in]d_flagsPointer to the input sequence of selection flags
[out]d_outPointer to the output sequence of selected data items
[out]d_num_selected_outPointer to the output total number of items selected (i.e., length of d_out)
[in]num_itemsTotal number of input items (i.e., length of d_in)
[in]stream[optional] CUDA stream to launch kernels within. Default is stream0.
[in]debug_synchronous[optional] Whether or not to synchronize the stream after every kernel launch to check for errors. May cause significant slowdown. Default is false.
Examples:
example_device_select_flagged.cu.
template<typename InputIteratorT , typename OutputIteratorT , typename NumSelectedIteratorT , typename SelectOp >
CUB_RUNTIME_FUNCTION static __forceinline__ cudaError_t cub::DeviceSelect::If ( void *  d_temp_storage,
size_t &  temp_storage_bytes,
InputIteratorT  d_in,
OutputIteratorT  d_out,
NumSelectedIteratorT  d_num_selected_out,
int  num_items,
SelectOp  select_op,
cudaStream_t  stream = 0,
bool  debug_synchronous = false 
)
inlinestatic

Uses the select_op functor to selectively copy items from d_in into d_out. The total number of items selected is written to d_num_selected_out.

select_logo.png
.

  • Copies of the selected items are compacted into d_out and maintain their original relative ordering.
  • When d_temp_storage is NULL, no work is done and the required allocation size is returned in temp_storage_bytes.
Performance
The following charts illustrate saturated select-if performance across different CUDA architectures for int32 and int64 items, respectively. Items are selected with 50% probability.
select_if_int32_50_percent.png
select_if_int64_50_percent.png
The following charts are similar, but 5% selection probability:
select_if_int32_5_percent.png
select_if_int64_5_percent.png
Snippet
The code snippet below illustrates the compaction of items selected from an int device vector.
#include <cub/cub.cuh> // or equivalently <cub/device/device_select.cuh>
// Functor type for selecting values less than some criteria
struct LessThan
{
int compare;
CUB_RUNTIME_FUNCTION __forceinline__
LessThan(int compare) : compare(compare) {}
CUB_RUNTIME_FUNCTION __forceinline__
bool operator()(const int &a) const {
return (a < compare);
}
};
// Declare, allocate, and initialize device-accessible pointers for input and output
int num_items; // e.g., 8
int *d_in; // e.g., [0, 2, 3, 9, 5, 2, 81, 8]
int *d_out; // e.g., [ , , , , , , , ]
int *d_num_selected_out; // e.g., [ ]
LessThan select_op(7);
...
// Determine temporary device storage requirements
void *d_temp_storage = NULL;
size_t temp_storage_bytes = 0;
cub::DeviceSelect::If(d_temp_storage, temp_storage_bytes, d_in, d_out, d_num_selected_out, num_items, select_op);
// Allocate temporary storage
cudaMalloc(&d_temp_storage, temp_storage_bytes);
// Run selection
cub::DeviceSelect::If(d_temp_storage, temp_storage_bytes, d_in, d_out, d_num_selected_out, num_items, select_op);
// d_out <-- [0, 2, 3, 5, 2]
// d_num_selected_out <-- [5]
Template Parameters
InputIteratorT[inferred] Random-access input iterator type for reading input items (may be a simple pointer type)
OutputIteratorT[inferred] Random-access output iterator type for writing selected items (may be a simple pointer type)
NumSelectedIteratorT[inferred] Output iterator type for recording the number of items selected (may be a simple pointer type)
SelectOp[inferred] Selection operator type having member bool operator()(const T &a)
Parameters
[in]d_temp_storageDevice-accessible allocation of temporary storage. When NULL, the required allocation size is written to temp_storage_bytes and no work is done.
[in,out]temp_storage_bytesReference to size in bytes of d_temp_storage allocation
[in]d_inPointer to the input sequence of data items
[out]d_outPointer to the output sequence of selected data items
[out]d_num_selected_outPointer to the output total number of items selected (i.e., length of d_out)
[in]num_itemsTotal number of input items (i.e., length of d_in)
[in]select_opUnary selection operator
[in]stream[optional] CUDA stream to launch kernels within. Default is stream0.
[in]debug_synchronous[optional] Whether or not to synchronize the stream after every kernel launch to check for errors. May cause significant slowdown. Default is false.
Examples:
example_device_select_if.cu.
template<typename InputIteratorT , typename OutputIteratorT , typename NumSelectedIteratorT >
CUB_RUNTIME_FUNCTION static __forceinline__ cudaError_t cub::DeviceSelect::Unique ( void *  d_temp_storage,
size_t &  temp_storage_bytes,
InputIteratorT  d_in,
OutputIteratorT  d_out,
NumSelectedIteratorT  d_num_selected_out,
int  num_items,
cudaStream_t  stream = 0,
bool  debug_synchronous = false 
)
inlinestatic

Given an input sequence d_in having runs of consecutive equal-valued keys, only the first key from each run is selectively copied to d_out. The total number of items selected is written to d_num_selected_out.

unique_logo.png
.

  • The == equality operator is used to determine whether keys are equivalent
  • Copies of the selected items are compacted into d_out and maintain their original relative ordering.
  • When d_temp_storage is NULL, no work is done and the required allocation size is returned in temp_storage_bytes.
Performance
The following charts illustrate saturated select-unique performance across different CUDA architectures for int32 and int64 items, respectively. Segments have lengths uniformly sampled from [1,1000].
select_unique_int32_len_500.png
select_unique_int64_len_500.png
The following charts are similar, but with segment lengths uniformly sampled from [1,10]:
select_unique_int32_len_5.png
select_unique_int64_len_5.png
Snippet
The code snippet below illustrates the compaction of items selected from an int device vector.
#include <cub/cub.cuh> // or equivalently <cub/device/device_select.cuh>
// Declare, allocate, and initialize device-accessible pointers for input and output
int num_items; // e.g., 8
int *d_in; // e.g., [0, 2, 2, 9, 5, 5, 5, 8]
int *d_out; // e.g., [ , , , , , , , ]
int *d_num_selected_out; // e.g., [ ]
...
// Determine temporary device storage requirements
void *d_temp_storage = NULL;
size_t temp_storage_bytes = 0;
cub::DeviceSelect::Unique(d_temp_storage, temp_storage_bytes, d_in, d_out, d_num_selected_out, num_items);
// Allocate temporary storage
cudaMalloc(&d_temp_storage, temp_storage_bytes);
// Run selection
cub::DeviceSelect::Unique(d_temp_storage, temp_storage_bytes, d_in, d_out, d_num_selected_out, num_items);
// d_out <-- [0, 2, 9, 5, 8]
// d_num_selected_out <-- [5]
Template Parameters
InputIteratorT[inferred] Random-access input iterator type for reading input items (may be a simple pointer type)
OutputIteratorT[inferred] Random-access output iterator type for writing selected items (may be a simple pointer type)
NumSelectedIteratorT[inferred] Output iterator type for recording the number of items selected (may be a simple pointer type)
Parameters
[in]d_temp_storageDevice-accessible allocation of temporary storage. When NULL, the required allocation size is written to temp_storage_bytes and no work is done.
[in,out]temp_storage_bytesReference to size in bytes of d_temp_storage allocation
[in]d_inPointer to the input sequence of data items
[out]d_outPointer to the output sequence of selected data items
[out]d_num_selected_outPointer to the output total number of items selected (i.e., length of d_out)
[in]num_itemsTotal number of input items (i.e., length of d_in)
[in]stream[optional] CUDA stream to launch kernels within. Default is stream0.
[in]debug_synchronous[optional] Whether or not to synchronize the stream after every kernel launch to check for errors. May cause significant slowdown. Default is false.
Examples:
example_device_select_unique.cu.
template<typename KeyInputIteratorT , typename ValueInputIteratorT , typename KeyOutputIteratorT , typename ValueOutputIteratorT , typename NumSelectedIteratorT >
CUB_RUNTIME_FUNCTION static __forceinline__ cudaError_t cub::DeviceSelect::UniqueByKey ( void *  d_temp_storage,
size_t &  temp_storage_bytes,
KeyInputIteratorT  d_keys_in,
ValueInputIteratorT  d_values_in,
KeyOutputIteratorT  d_keys_out,
ValueOutputIteratorT  d_values_out,
NumSelectedIteratorT  d_num_selected_out,
int  num_items,
cudaStream_t  stream = 0,
bool  debug_synchronous = false 
)
inlinestatic

Given an input sequence d_keys_in and d_values_in with runs of key-value pairs with consecutive equal-valued keys, only the first key and its value from each run is selectively copied to d_keys_out and d_values_out. The total number of items selected is written to d_num_selected_out.

unique_logo.png
.

  • The == equality operator is used to determine whether keys are equivalent
  • Copies of the selected items are compacted into d_out and maintain their original relative ordering.
  • When d_temp_storage is NULL, no work is done and the required allocation size is returned in temp_storage_bytes.
Snippet
The code snippet below illustrates the compaction of items selected from an int device vector.
#include <cub/cub.cuh> // or equivalently <cub/device/device_select.cuh>
// Declare, allocate, and initialize device-accessible pointers for input and output
int num_items; // e.g., 8
int *d_keys_in; // e.g., [0, 2, 2, 9, 5, 5, 5, 8]
int *d_values_in; // e.g., [1, 2, 3, 4, 5, 6, 7, 8]
int *d_keys_out; // e.g., [ , , , , , , , ]
int *d_values_out; // e.g., [ , , , , , , , ]
int *d_num_selected_out; // e.g., [ ]
...
// Determine temporary device storage requirements
void *d_temp_storage = NULL;
size_t temp_storage_bytes = 0;
cub::DeviceSelect::UniqueByKey(d_temp_storage, temp_storage_bytes, d_keys_in, d_values_in, d_keys_out, d_values_out, d_num_selected_out, num_items);
// Allocate temporary storage
cudaMalloc(&d_temp_storage, temp_storage_bytes);
// Run selection
cub::DeviceSelect::UniqueByKey(d_temp_storage, temp_storage_bytes, d_keys_in, d_values_in, d_keys_out, d_values_out, d_num_selected_out, num_items);
// d_keys_out <-- [0, 2, 9, 5, 8]
// d_values_out <-- [1, 2, 4, 5, 8]
// d_num_selected_out <-- [5]
Template Parameters
KeyInputIteratorT[inferred] Random-access input iterator type for reading input keys (may be a simple pointer type)
ValueInputIteratorT[inferred] Random-access input iterator type for reading input values (may be a simple pointer type)
KeyOutputIteratorT[inferred] Random-access output iterator type for writing selected keys (may be a simple pointer type)
ValueOutputIteratorT[inferred] Random-access output iterator type for writing selected values (may be a simple pointer type)
NumSelectedIteratorT[inferred] Output iterator type for recording the number of items selected (may be a simple pointer type)
Parameters
[in]d_temp_storageDevice-accessible allocation of temporary storage. When NULL, the required allocation size is written to temp_storage_bytes and no work is done.
[in,out]temp_storage_bytesReference to size in bytes of d_temp_storage allocation
[in]d_keys_inPointer to the input sequence of keys
[in]d_values_inPointer to the input sequence of values
[out]d_keys_outPointer to the output sequence of selected keys
[out]d_values_outPointer to the output sequence of selected values
[out]d_num_selected_outPointer to the total number of items selected (i.e., length of d_keys_out or d_values_out)
[in]num_itemsTotal number of input items (i.e., length of d_keys_in or d_values_in)
[in]stream[optional] CUDA stream to launch kernels within. Default is stream0.
[in]debug_synchronous[optional] Whether or not to synchronize the stream after every kernel launch to check for errors. May cause significant slowdown. Default is false.

The documentation for this struct was generated from the following file: