CUB
 All Classes Namespaces Files Functions Variables Typedefs Enumerations Enumerator Friends Groups
Static Public Methods | List of all members
cub::DevicePartition Struct Reference

Detailed description

DevicePartition provides device-wide, parallel operations for partitioning sequences of data items residing within device-accessible memory.

partition_logo.png
.
Overview
These operations apply a selection criterion to construct a partitioned output sequence from items selected/unselected from a specified input sequence.
Usage Considerations
  • Dynamic parallelism. DevicePartition methods can be called within kernel code on devices in which CUDA dynamic parallelism is supported.
Performance
The work-complexity of partition as a function of input size is linear, resulting in performance throughput that plateaus with problem sizes large enough to saturate the GPU.
The following chart illustrates DevicePartition::If performance across different CUDA architectures for int32 items, where 50% of the items are randomly selected for the first partition. Performance plots for other scenarios can be found in the detailed method descriptions below.
partition_if_int32_50_percent.png

Static Public Methods

template<typename InputIteratorT , typename FlagIterator , typename OutputIteratorT , typename NumSelectedIteratorT >
CUB_RUNTIME_FUNCTION static
__forceinline__ cudaError_t 
Flagged (void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, FlagIterator d_flags, OutputIteratorT d_out, NumSelectedIteratorT d_num_selected_out, int num_items, cudaStream_t stream=0, bool debug_synchronous=false)
 Uses the d_flags sequence to split the corresponding items from d_in into a partitioned sequence d_out. The total number of items copied into the first partition is written to d_num_selected_out.

partition_flags_logo.png
.
More...
 
template<typename InputIteratorT , typename OutputIteratorT , typename NumSelectedIteratorT , typename SelectOp >
CUB_RUNTIME_FUNCTION static
__forceinline__ cudaError_t 
If (void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, NumSelectedIteratorT d_num_selected_out, int num_items, SelectOp select_op, cudaStream_t stream=0, bool debug_synchronous=false)
 Uses the select_op functor to split the corresponding items from d_in into a partitioned sequence d_out. The total number of items copied into the first partition is written to d_num_selected_out.

partition_logo.png
.
More...
 

Member Function Documentation

template<typename InputIteratorT , typename FlagIterator , typename OutputIteratorT , typename NumSelectedIteratorT >
CUB_RUNTIME_FUNCTION static __forceinline__ cudaError_t cub::DevicePartition::Flagged ( void *  d_temp_storage,
size_t &  temp_storage_bytes,
InputIteratorT  d_in,
FlagIterator  d_flags,
OutputIteratorT  d_out,
NumSelectedIteratorT  d_num_selected_out,
int  num_items,
cudaStream_t  stream = 0,
bool  debug_synchronous = false 
)
inlinestatic

Uses the d_flags sequence to split the corresponding items from d_in into a partitioned sequence d_out. The total number of items copied into the first partition is written to d_num_selected_out.

partition_flags_logo.png
.

  • The value type of d_flags must be castable to bool (e.g., bool, char, int, etc.).
  • Copies of the selected items are compacted into d_out and maintain their original relative ordering, however copies of the unselected items are compacted into the rear of d_out in reverse order.
  • When d_temp_storage is NULL, no work is done and the required allocation size is returned in temp_storage_bytes.
Snippet
The code snippet below illustrates the compaction of items selected from an int device vector.
#include <cub/cub.cuh> // or equivalently <cub/device/device_partition.cuh>
// Declare, allocate, and initialize device-accessible pointers for input, flags, and output
int num_items; // e.g., 8
int *d_in; // e.g., [1, 2, 3, 4, 5, 6, 7, 8]
char *d_flags; // e.g., [1, 0, 0, 1, 0, 1, 1, 0]
int *d_out; // e.g., [ , , , , , , , ]
int *d_num_selected_out; // e.g., [ ]
...
// Determine temporary device storage requirements
void *d_temp_storage = NULL;
size_t temp_storage_bytes = 0;
cub::DevicePartition::Flagged(d_temp_storage, temp_storage_bytes, d_in, d_flags, d_out, d_num_selected_out, num_items);
// Allocate temporary storage
cudaMalloc(&d_temp_storage, temp_storage_bytes);
// Run selection
cub::DevicePartition::Flagged(d_temp_storage, temp_storage_bytes, d_in, d_flags, d_out, d_num_selected_out, num_items);
// d_out <-- [1, 4, 6, 7, 8, 5, 3, 2]
// d_num_selected_out <-- [4]
Template Parameters
InputIteratorT[inferred] Random-access input iterator type for reading input items (may be a simple pointer type)
FlagIterator[inferred] Random-access input iterator type for reading selection flags (may be a simple pointer type)
OutputIteratorT[inferred] Random-access output iterator type for writing output items (may be a simple pointer type)
NumSelectedIteratorT[inferred] Output iterator type for recording the number of items selected (may be a simple pointer type)
Parameters
[in]d_temp_storageDevice-accessible allocation of temporary storage. When NULL, the required allocation size is written to temp_storage_bytes and no work is done.
[in,out]temp_storage_bytesReference to size in bytes of d_temp_storage allocation
[in]d_inPointer to the input sequence of data items
[in]d_flagsPointer to the input sequence of selection flags
[out]d_outPointer to the output sequence of partitioned data items
[out]d_num_selected_outPointer to the output total number of items selected (i.e., the offset of the unselected partition)
[in]num_itemsTotal number of items to select from
[in]stream[optional] CUDA stream to launch kernels within. Default is stream0.
[in]debug_synchronous[optional] Whether or not to synchronize the stream after every kernel launch to check for errors. May cause significant slowdown. Default is false.
Examples:
example_device_partition_flagged.cu.
template<typename InputIteratorT , typename OutputIteratorT , typename NumSelectedIteratorT , typename SelectOp >
CUB_RUNTIME_FUNCTION static __forceinline__ cudaError_t cub::DevicePartition::If ( void *  d_temp_storage,
size_t &  temp_storage_bytes,
InputIteratorT  d_in,
OutputIteratorT  d_out,
NumSelectedIteratorT  d_num_selected_out,
int  num_items,
SelectOp  select_op,
cudaStream_t  stream = 0,
bool  debug_synchronous = false 
)
inlinestatic

Uses the select_op functor to split the corresponding items from d_in into a partitioned sequence d_out. The total number of items copied into the first partition is written to d_num_selected_out.

partition_logo.png
.

  • Copies of the selected items are compacted into d_out and maintain their original relative ordering, however copies of the unselected items are compacted into the rear of d_out in reverse order.
  • When d_temp_storage is NULL, no work is done and the required allocation size is returned in temp_storage_bytes.
Performance
The following charts illustrate saturated partition-if performance across different CUDA architectures for int32 and int64 items, respectively. Items are selected for the first partition with 50% probability.
partition_if_int32_50_percent.png
partition_if_int64_50_percent.png
The following charts are similar, but 5% selection probability for the first partition:
partition_if_int32_5_percent.png
partition_if_int64_5_percent.png
Snippet
The code snippet below illustrates the compaction of items selected from an int device vector.
#include <cub/cub.cuh> // or equivalently <cub/device/device_partition.cuh>
// Functor type for selecting values less than some criteria
struct LessThan
{
int compare;
CUB_RUNTIME_FUNCTION __forceinline__
LessThan(int compare) : compare(compare) {}
CUB_RUNTIME_FUNCTION __forceinline__
bool operator()(const int &a) const {
return (a < compare);
}
};
// Declare, allocate, and initialize device-accessible pointers for input and output
int num_items; // e.g., 8
int *d_in; // e.g., [0, 2, 3, 9, 5, 2, 81, 8]
int *d_out; // e.g., [ , , , , , , , ]
int *d_num_selected_out; // e.g., [ ]
LessThan select_op(7);
...
// Determine temporary device storage requirements
void *d_temp_storage = NULL;
size_t temp_storage_bytes = 0;
cub::DeviceSelect::If(d_temp_storage, temp_storage_bytes, d_in, d_out, d_num_selected_out, num_items, select_op);
// Allocate temporary storage
cudaMalloc(&d_temp_storage, temp_storage_bytes);
// Run selection
cub::DeviceSelect::If(d_temp_storage, temp_storage_bytes, d_in, d_out, d_num_selected_out, num_items, select_op);
// d_out <-- [0, 2, 3, 5, 2, 8, 81, 9]
// d_num_selected_out <-- [5]
Template Parameters
InputIteratorT[inferred] Random-access input iterator type for reading input items (may be a simple pointer type)
OutputIteratorT[inferred] Random-access output iterator type for writing output items (may be a simple pointer type)
NumSelectedIteratorT[inferred] Output iterator type for recording the number of items selected (may be a simple pointer type)
SelectOp[inferred] Selection functor type having member bool operator()(const T &a)
Parameters
[in]d_temp_storageDevice-accessible allocation of temporary storage. When NULL, the required allocation size is written to temp_storage_bytes and no work is done.
[in,out]temp_storage_bytesReference to size in bytes of d_temp_storage allocation
[in]d_inPointer to the input sequence of data items
[out]d_outPointer to the output sequence of partitioned data items
[out]d_num_selected_outPointer to the output total number of items selected (i.e., the offset of the unselected partition)
[in]num_itemsTotal number of items to select from
[in]select_opUnary selection operator
[in]stream[optional] CUDA stream to launch kernels within. Default is stream0.
[in]debug_synchronous[optional] Whether or not to synchronize the stream after every kernel launch to check for errors. May cause significant slowdown. Default is false.
Examples:
example_device_partition_if.cu.

The documentation for this struct was generated from the following file: