CUB  
 All Classes Namespaces Files Functions Variables Typedefs Enumerations Enumerator Friends Macros Groups Pages
Static Public Methods | List of all members
cub::DevicePartition Struct Reference

Detailed description

DevicePartition provides device-wide, parallel operations for partitioning sequences of data items residing within device-accessible memory.

partition_logo.png
.
Overview
These operations apply a selection criterion to construct a partitioned output sequence from items selected/unselected from a specified input sequence.
Usage Considerations
  • Dynamic parallelism. DevicePartition methods can be called within kernel code on devices in which CUDA dynamic parallelism is supported.
Performance
The work-complexity of partition as a function of input size is linear, resulting in performance throughput that plateaus with problem sizes large enough to saturate the GPU.
The following chart illustrates DevicePartition::If performance across different CUDA architectures for int32 items, where 50% of the items are randomly selected for the first partition. Performance plots for other scenarios can be found in the detailed method descriptions below.
partition_if_int32_50_percent.png

Static Public Methods

template<typename InputIteratorT , typename FlagIterator , typename OutputIteratorT , typename NumSelectedIteratorT >
CUB_RUNTIME_FUNCTION static
__forceinline__ cudaError_t 
Flagged (void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, FlagIterator d_flags, OutputIteratorT d_out, NumSelectedIteratorT d_num_selected_out, int num_items, cudaStream_t stream=0, bool debug_synchronous=false)
 Uses the d_flags sequence to split the corresponding items from d_in into a partitioned sequence d_out. The total number of items copied into the first partition is written to d_num_selected_out.

partition_flags_logo.png
.
More...
 
template<typename InputIteratorT , typename OutputIteratorT , typename NumSelectedIteratorT , typename SelectOp >
CUB_RUNTIME_FUNCTION static
__forceinline__ cudaError_t 
If (void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, NumSelectedIteratorT d_num_selected_out, int num_items, SelectOp select_op, cudaStream_t stream=0, bool debug_synchronous=false)
 Uses the select_op functor to split the corresponding items from d_in into a partitioned sequence d_out. The total number of items copied into the first partition is written to d_num_selected_out.

partition_logo.png
.
More...
 
template<typename InputIteratorT , typename FirstOutputIteratorT , typename SecondOutputIteratorT , typename UnselectedOutputIteratorT , typename NumSelectedIteratorT , typename SelectFirstPartOp , typename SelectSecondPartOp >
CUB_RUNTIME_FUNCTION static
__forceinline__ cudaError_t 
If (void *d_temp_storage, std::size_t &temp_storage_bytes, InputIteratorT d_in, FirstOutputIteratorT d_first_part_out, SecondOutputIteratorT d_second_part_out, UnselectedOutputIteratorT d_unselected_out, NumSelectedIteratorT d_num_selected_out, int num_items, SelectFirstPartOp select_first_part_op, SelectSecondPartOp select_second_part_op, cudaStream_t stream=0, bool debug_synchronous=false)
 Uses two functors to split the corresponding items from d_in into a three partitioned sequences d_first_part_out d_second_part_out and d_unselected_out. The total number of items copied into the first partition is written to d_num_selected_out[0], while the total number of items copied into the second partition is written to d_num_selected_out[1]. More...
 

Member Function Documentation

template<typename InputIteratorT , typename FlagIterator , typename OutputIteratorT , typename NumSelectedIteratorT >
CUB_RUNTIME_FUNCTION static __forceinline__ cudaError_t cub::DevicePartition::Flagged ( void *  d_temp_storage,
size_t &  temp_storage_bytes,
InputIteratorT  d_in,
FlagIterator  d_flags,
OutputIteratorT  d_out,
NumSelectedIteratorT  d_num_selected_out,
int  num_items,
cudaStream_t  stream = 0,
bool  debug_synchronous = false 
)
inlinestatic

Uses the d_flags sequence to split the corresponding items from d_in into a partitioned sequence d_out. The total number of items copied into the first partition is written to d_num_selected_out.

partition_flags_logo.png
.

  • The value type of d_flags must be castable to bool (e.g., bool, char, int, etc.).
  • Copies of the selected items are compacted into d_out and maintain their original relative ordering, however copies of the unselected items are compacted into the rear of d_out in reverse order.
  • When d_temp_storage is NULL, no work is done and the required allocation size is returned in temp_storage_bytes.
Snippet
The code snippet below illustrates the compaction of items selected from an int device vector.
#include <cub/cub.cuh>
// or equivalently <cub/device/device_partition.cuh>
// Declare, allocate, and initialize device-accessible pointers for
// input, flags, and output
int num_items; // e.g., 8
int *d_in; // e.g., [1, 2, 3, 4, 5, 6, 7, 8]
char *d_flags; // e.g., [1, 0, 0, 1, 0, 1, 1, 0]
int *d_out; // e.g., [ , , , , , , , ]
int *d_num_selected_out; // e.g., [ ]
...
// Determine temporary device storage requirements
void *d_temp_storage = nullptr;
std::size_t temp_storage_bytes = 0;
d_temp_storage, temp_storage_bytes,
d_in, d_flags, d_out, d_num_selected_out, num_items);
// Allocate temporary storage
cudaMalloc(&d_temp_storage, temp_storage_bytes);
// Run selection
d_temp_storage, temp_storage_bytes,
d_in, d_flags, d_out, d_num_selected_out, num_items);
// d_out <-- [1, 4, 6, 7, 8, 5, 3, 2]
// d_num_selected_out <-- [4]
Template Parameters
InputIteratorT**[inferred]** Random-access input iterator type for reading input items (may be a simple pointer type)
FlagIterator**[inferred]** Random-access input iterator type for reading selection flags (may be a simple pointer type)
OutputIteratorT**[inferred]** Random-access output iterator type for writing output items (may be a simple pointer type)
NumSelectedIteratorT**[inferred]** Output iterator type for recording the number of items selected (may be a simple pointer type)
Parameters
[in]d_temp_storageDevice-accessible allocation of temporary storage. When nullptr, the required allocation size is written to temp_storage_bytes and no work is done.
[in,out]temp_storage_bytesReference to size in bytes of d_temp_storage allocation
[in]d_inPointer to the input sequence of data items
[in]d_flagsPointer to the input sequence of selection flags
[out]d_outPointer to the output sequence of partitioned data items
[out]d_num_selected_outPointer to the output total number of items selected (i.e., the offset of the unselected partition)
[in]num_itemsTotal number of items to select from
[in]stream**[optional]** CUDA stream to launch kernels within. Default is stream0.
[in]debug_synchronous**[optional]** Whether or not to synchronize the stream after every kernel launch to check for errors. May cause significant slowdown. Default is false.
Examples:
example_device_partition_flagged.cu.
template<typename InputIteratorT , typename OutputIteratorT , typename NumSelectedIteratorT , typename SelectOp >
CUB_RUNTIME_FUNCTION static __forceinline__ cudaError_t cub::DevicePartition::If ( void *  d_temp_storage,
size_t &  temp_storage_bytes,
InputIteratorT  d_in,
OutputIteratorT  d_out,
NumSelectedIteratorT  d_num_selected_out,
int  num_items,
SelectOp  select_op,
cudaStream_t  stream = 0,
bool  debug_synchronous = false 
)
inlinestatic

Uses the select_op functor to split the corresponding items from d_in into a partitioned sequence d_out. The total number of items copied into the first partition is written to d_num_selected_out.

partition_logo.png
.

  • Copies of the selected items are compacted into d_out and maintain their original relative ordering, however copies of the unselected items are compacted into the rear of d_out in reverse order.
  • When d_temp_storage is NULL, no work is done and the required allocation size is returned in temp_storage_bytes.
Performance
The following charts illustrate saturated partition-if performance across different CUDA architectures for int32 and int64 items, respectively. Items are selected for the first partition with 50% probability.
partition_if_int32_50_percent.png
partition_if_int64_50_percent.png
The following charts are similar, but 5% selection probability for the first partition:
partition_if_int32_5_percent.png
partition_if_int64_5_percent.png
Snippet
The code snippet below illustrates the compaction of items selected from an int device vector.
#include <cub/cub.cuh>
// or equivalently <cub/device/device_partition.cuh>
// Functor type for selecting values less than some criteria
struct LessThan
{
int compare;
CUB_RUNTIME_FUNCTION __forceinline__
explicit LessThan(int compare) : compare(compare) {}
CUB_RUNTIME_FUNCTION __forceinline__
bool operator()(const int &a) const
{
return (a < compare);
}
};
// Declare, allocate, and initialize device-accessible pointers for
// input and output
int num_items; // e.g., 8
int *d_in; // e.g., [0, 2, 3, 9, 5, 2, 81, 8]
int *d_out; // e.g., [ , , , , , , , ]
int *d_num_selected_out; // e.g., [ ]
LessThan select_op(7);
...
// Determine temporary device storage requirements
void *d_temp_storage = nullptr;
std::size_t temp_storage_bytes = 0;
d_temp_storage, temp_storage_bytes,
d_in, d_out, d_num_selected_out, num_items, select_op);
// Allocate temporary storage
cudaMalloc(&d_temp_storage, temp_storage_bytes);
// Run selection
d_temp_storage, temp_storage_bytes,
d_in, d_out, d_num_selected_out, num_items, select_op);
// d_out <-- [0, 2, 3, 5, 2, 8, 81, 9]
// d_num_selected_out <-- [5]
Template Parameters
InputIteratorT**[inferred]** Random-access input iterator type for reading input items (may be a simple pointer type)
OutputIteratorT**[inferred]** Random-access output iterator type for writing output items (may be a simple pointer type)
NumSelectedIteratorT**[inferred]** Output iterator type for recording the number of items selected (may be a simple pointer type)
SelectOp**[inferred]** Selection functor type having member bool operator()(const T &a)
Parameters
[in]d_temp_storageDevice-accessible allocation of temporary storage. When nullptr, the required allocation size is written to temp_storage_bytes and no work is done.
[in,out]temp_storage_bytesReference to size in bytes of d_temp_storage allocation
[in]d_inPointer to the input sequence of data items
[out]d_outPointer to the output sequence of partitioned data items
[out]d_num_selected_outPointer to the output total number of items selected (i.e., the offset of the unselected partition)
[in]num_itemsTotal number of items to select from
[in]select_opUnary selection operator
[in]stream**[optional]** CUDA stream to launch kernels within. Default is stream0.
[in]debug_synchronous**[optional]** Whether or not to synchronize the stream after every kernel launch to check for errors. May cause significant slowdown. Default is false.
Examples:
example_device_partition_if.cu.
template<typename InputIteratorT , typename FirstOutputIteratorT , typename SecondOutputIteratorT , typename UnselectedOutputIteratorT , typename NumSelectedIteratorT , typename SelectFirstPartOp , typename SelectSecondPartOp >
CUB_RUNTIME_FUNCTION static __forceinline__ cudaError_t cub::DevicePartition::If ( void *  d_temp_storage,
std::size_t &  temp_storage_bytes,
InputIteratorT  d_in,
FirstOutputIteratorT  d_first_part_out,
SecondOutputIteratorT  d_second_part_out,
UnselectedOutputIteratorT  d_unselected_out,
NumSelectedIteratorT  d_num_selected_out,
int  num_items,
SelectFirstPartOp  select_first_part_op,
SelectSecondPartOp  select_second_part_op,
cudaStream_t  stream = 0,
bool  debug_synchronous = false 
)
inlinestatic

Uses two functors to split the corresponding items from d_in into a three partitioned sequences d_first_part_out d_second_part_out and d_unselected_out. The total number of items copied into the first partition is written to d_num_selected_out[0], while the total number of items copied into the second partition is written to d_num_selected_out[1].

  • Copies of the items selected by select_first_part_op are compacted into d_first_part_out and maintain their original relative ordering.
  • Copies of the items selected by select_second_part_op are compacted into d_second_part_out and maintain their original relative ordering.
  • Copies of the unselected items are compacted into the d_unselected_out in reverse order.
Snippet
The code snippet below illustrates how this algorithm can partition an input vector into small, medium, and large items so that the relative order of items remain deterministic.

Let's consider any value that doesn't exceed six a small one. On the other hand, any value that exceeds 50 will be considered a large one. Since the value used to define a small part doesn't match one that defines the large part, the intermediate segment is implied.

These definitions partition a value space into three categories. We want to preserve the order of items in which they appear in the input vector. Since the algorithm provides stable partitioning, this is possible.

Since the number of items in each category is unknown beforehand, we need three output arrays of num_items elements each. To reduce the memory requirements, we can combine the output storage for two categories.

Since each value falls precisely in one category, it's safe to add "large" values into the head of the shared output vector and the "middle" values into its tail. To add items into the tail of the output array, we can use thrust::reverse_iterator.

#include <cub/cub.cuh>
// or equivalently <cub/device/device_partition.cuh>
// Functor type for selecting values less than some criteria
struct LessThan
{
int compare;
CUB_RUNTIME_FUNCTION __forceinline__
explicit LessThan(int compare) : compare(compare) {}
CUB_RUNTIME_FUNCTION __forceinline__
bool operator()(const int &a) const
{
return a < compare;
}
};
// Functor type for selecting values greater than some criteria
struct GreaterThan
{
int compare;
CUB_RUNTIME_FUNCTION __forceinline__
explicit GreaterThan(int compare) : compare(compare) {}
CUB_RUNTIME_FUNCTION __forceinline__
bool operator()(const int &a) const
{
return a > compare;
}
};
// Declare, allocate, and initialize device-accessible pointers for
// input and output
int num_items; // e.g., 8
int *d_in; // e.g., [0, 2, 3, 9, 5, 2, 81, 8]
int *d_large_and_unselected_out; // e.g., [ , , , , , , , ]
int *d_small_out; // e.g., [ , , , , , , , ]
int *d_num_selected_out; // e.g., [ , ]
thrust::reverse_iterator<T> unselected_out(d_large_and_unselected_out + num_items);
LessThan small_items_selector(7);
GreaterThan large_items_selector(50);
...
// Determine temporary device storage requirements
void *d_temp_storage = nullptr;
std::size_t temp_storage_bytes = 0;
d_temp_storage, temp_storage_bytes,
d_in, d_large_and_medium_out, d_small_out, unselected_out,
d_num_selected_out, num_items,
large_items_selector, small_items_selector);
// Allocate temporary storage
cudaMalloc(&d_temp_storage, temp_storage_bytes);
// Run selection
d_temp_storage, temp_storage_bytes,
d_in, d_large_and_medium_out, d_small_out, unselected_out,
d_num_selected_out, num_items,
large_items_selector, small_items_selector);
// d_large_and_unselected_out <-- [ 81, , , , , , 8, 9 ]
// d_small_out <-- [ 0, 2, 3, 5, 2, , , ]
// d_num_selected_out <-- [ 1, 5 ]
Template Parameters
InputIteratorT**[inferred]** Random-access input iterator type for reading input items (may be a simple pointer type)
FirstOutputIteratorT**[inferred]** Random-access output iterator type for writing output items selected by first operator (may be a simple pointer type)
SecondOutputIteratorT**[inferred]** Random-access output iterator type for writing output items selected by second operator (may be a simple pointer type)
UnselectedOutputIteratorT**[inferred]** Random-access output iterator type for writing unselected items (may be a simple pointer type)
NumSelectedIteratorT**[inferred]** Output iterator type for recording the number of items selected (may be a simple pointer type)
SelectFirstPartOp**[inferred]** Selection functor type having member bool operator()(const T &a)
SelectSecondPartOp**[inferred]** Selection functor type having member bool operator()(const T &a)
Parameters
[in]d_temp_storageDevice-accessible allocation of temporary storage. When nullptr, the required allocation size is written to temp_storage_bytes and no work is done.
[in,out]temp_storage_bytesReference to size in bytes of d_temp_storage allocation
[in]d_inPointer to the input sequence of data items
[out]d_first_part_outPointer to the output sequence of data items selected by select_first_part_op
[out]d_second_part_outPointer to the output sequence of data items selected by select_second_part_op
[out]d_unselected_outPointer to the output sequence of unselected data items
[out]d_num_selected_outPointer to the output array with two elements, where total number of items selected by select_first_part_op is stored as d_num_selected_out[0] and total number of items selected by select_second_part_op is stored as d_num_selected_out[1], respectively
[in]num_itemsTotal number of items to select from
[in]select_first_part_opUnary selection operator to select d_first_part_out
[in]select_second_part_opUnary selection operator to select d_second_part_out
[in]stream**[optional]** CUDA stream to launch kernels within. Default is stream0.
[in]debug_synchronous**[optional]** Whether or not to synchronize the stream after every kernel launch to check for errors. May cause significant slowdown. Default is false.

The documentation for this struct was generated from the following file: