DeviceSelect provides device-wide, parallel operations for compacting selected items from sequences of data items residing within device-accessible memory.
.
- Overview
- These operations apply a selection criterion to selectively copy items from a specified input sequence to a compact output sequence.
- Usage Considerations
- Dynamic parallelism. DeviceSelect methods can be called within kernel code on devices in which CUDA dynamic parallelism is supported.
- Performance
- The work-complexity of select-flagged, select-if, and select-unique as a function of input size is linear, resulting in performance throughput that plateaus with problem sizes large enough to saturate the GPU.
- The following chart illustrates DeviceSelect::If performance across different CUDA architectures for
int32
items, where 50% of the items are randomly selected.
- The following chart illustrates DeviceSelect::Unique performance across different CUDA architectures for
int32
items where segments have lengths uniformly sampled from [1,1000].
- Performance plots for other scenarios can be found in the detailed method descriptions below.
|
template<typename InputIteratorT , typename FlagIterator , typename OutputIteratorT , typename NumSelectedIteratorT > |
CUB_RUNTIME_FUNCTION static
__forceinline__ cudaError_t | Flagged (void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, FlagIterator d_flags, OutputIteratorT d_out, NumSelectedIteratorT d_num_selected_out, int num_items, cudaStream_t stream=0, bool debug_synchronous=false) |
| Uses the d_flags sequence to selectively copy the corresponding items from d_in into d_out . The total number of items selected is written to d_num_selected_out .
.
More...
|
|
template<typename InputIteratorT , typename OutputIteratorT , typename NumSelectedIteratorT , typename SelectOp > |
CUB_RUNTIME_FUNCTION static
__forceinline__ cudaError_t | If (void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, NumSelectedIteratorT d_num_selected_out, int num_items, SelectOp select_op, cudaStream_t stream=0, bool debug_synchronous=false) |
| Uses the select_op functor to selectively copy items from d_in into d_out . The total number of items selected is written to d_num_selected_out .
.
More...
|
|
template<typename InputIteratorT , typename OutputIteratorT , typename NumSelectedIteratorT > |
CUB_RUNTIME_FUNCTION static
__forceinline__ cudaError_t | Unique (void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, NumSelectedIteratorT d_num_selected_out, int num_items, cudaStream_t stream=0, bool debug_synchronous=false) |
| Given an input sequence d_in having runs of consecutive equal-valued keys, only the first key from each run is selectively copied to d_out . The total number of items selected is written to d_num_selected_out .
.
More...
|
|
template<typename KeyInputIteratorT , typename ValueInputIteratorT , typename KeyOutputIteratorT , typename ValueOutputIteratorT , typename NumSelectedIteratorT > |
CUB_RUNTIME_FUNCTION static
__forceinline__ cudaError_t | UniqueByKey (void *d_temp_storage, size_t &temp_storage_bytes, KeyInputIteratorT d_keys_in, ValueInputIteratorT d_values_in, KeyOutputIteratorT d_keys_out, ValueOutputIteratorT d_values_out, NumSelectedIteratorT d_num_selected_out, int num_items, cudaStream_t stream=0, bool debug_synchronous=false) |
| Given an input sequence d_keys_in and d_values_in with runs of key-value pairs with consecutive equal-valued keys, only the first key and its value from each run is selectively copied to d_keys_out and d_values_out . The total number of items selected is written to d_num_selected_out .
.
More...
|
|
template<typename KeyInputIteratorT , typename ValueInputIteratorT , typename KeyOutputIteratorT , typename ValueOutputIteratorT , typename NumSelectedIteratorT >
CUB_RUNTIME_FUNCTION static __forceinline__ cudaError_t cub::DeviceSelect::UniqueByKey |
( |
void * |
d_temp_storage, |
|
|
size_t & |
temp_storage_bytes, |
|
|
KeyInputIteratorT |
d_keys_in, |
|
|
ValueInputIteratorT |
d_values_in, |
|
|
KeyOutputIteratorT |
d_keys_out, |
|
|
ValueOutputIteratorT |
d_values_out, |
|
|
NumSelectedIteratorT |
d_num_selected_out, |
|
|
int |
num_items, |
|
|
cudaStream_t |
stream = 0 , |
|
|
bool |
debug_synchronous = false |
|
) |
| |
|
inlinestatic |
Given an input sequence d_keys_in
and d_values_in
with runs of key-value pairs with consecutive equal-valued keys, only the first key and its value from each run is selectively copied to d_keys_out
and d_values_out
. The total number of items selected is written to d_num_selected_out
.
.
- The
==
equality operator is used to determine whether keys are equivalent
- Copies of the selected items are compacted into
d_out
and maintain their original relative ordering.
- When
d_temp_storage
is NULL
, no work is done and the required allocation size is returned in temp_storage_bytes
.
- Snippet
- The code snippet below illustrates the compaction of items selected from an
int
device vector.
int num_items;
int *d_keys_in;
int *d_values_in;
int *d_keys_out;
int *d_values_out;
int *d_num_selected_out;
...
void *d_temp_storage = NULL;
size_t temp_storage_bytes = 0;
cudaMalloc(&d_temp_storage, temp_storage_bytes);
- Template Parameters
-
KeyInputIteratorT | [inferred] Random-access input iterator type for reading input keys (may be a simple pointer type) |
ValueInputIteratorT | [inferred] Random-access input iterator type for reading input values (may be a simple pointer type) |
KeyOutputIteratorT | [inferred] Random-access output iterator type for writing selected keys (may be a simple pointer type) |
ValueOutputIteratorT | [inferred] Random-access output iterator type for writing selected values (may be a simple pointer type) |
NumSelectedIteratorT | [inferred] Output iterator type for recording the number of items selected (may be a simple pointer type) |
- Parameters
-
[in] | d_temp_storage | Device-accessible allocation of temporary storage. When NULL, the required allocation size is written to temp_storage_bytes and no work is done. |
[in,out] | temp_storage_bytes | Reference to size in bytes of d_temp_storage allocation |
[in] | d_keys_in | Pointer to the input sequence of keys |
[in] | d_values_in | Pointer to the input sequence of values |
[out] | d_keys_out | Pointer to the output sequence of selected keys |
[out] | d_values_out | Pointer to the output sequence of selected values |
[out] | d_num_selected_out | Pointer to the total number of items selected (i.e., length of d_keys_out or d_values_out ) |
[in] | num_items | Total number of input items (i.e., length of d_keys_in or d_values_in ) |
[in] | stream | [optional] CUDA stream to launch kernels within. Default is stream0. |
[in] | debug_synchronous | [optional] Whether or not to synchronize the stream after every kernel launch to check for errors. May cause significant slowdown. Default is false . |