CUB
|
DeviceSegmentedReduce provides device-wide, parallel operations for computing a reduction across multiple sequences of data items residing within device-accessible memory.
Static Public Methods | |
template<typename InputIteratorT , typename OutputIteratorT , typename BeginOffsetIteratorT , typename EndOffsetIteratorT , typename ReductionOp , typename T > | |
static CUB_RUNTIME_FUNCTION cudaError_t | Reduce (void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, int num_segments, BeginOffsetIteratorT d_begin_offsets, EndOffsetIteratorT d_end_offsets, ReductionOp reduction_op, T initial_value, cudaStream_t stream=0, bool debug_synchronous=false) |
Computes a device-wide segmented reduction using the specified binary reduction_op functor. More... | |
template<typename InputIteratorT , typename OutputIteratorT , typename BeginOffsetIteratorT , typename EndOffsetIteratorT > | |
static CUB_RUNTIME_FUNCTION cudaError_t | Sum (void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, int num_segments, BeginOffsetIteratorT d_begin_offsets, EndOffsetIteratorT d_end_offsets, cudaStream_t stream=0, bool debug_synchronous=false) |
Computes a device-wide segmented sum using the addition ('+') operator. More... | |
template<typename InputIteratorT , typename OutputIteratorT , typename BeginOffsetIteratorT , typename EndOffsetIteratorT > | |
static CUB_RUNTIME_FUNCTION cudaError_t | Min (void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, int num_segments, BeginOffsetIteratorT d_begin_offsets, EndOffsetIteratorT d_end_offsets, cudaStream_t stream=0, bool debug_synchronous=false) |
Computes a device-wide segmented minimum using the less-than ('<') operator. More... | |
template<typename InputIteratorT , typename OutputIteratorT , typename BeginOffsetIteratorT , typename EndOffsetIteratorT > | |
static CUB_RUNTIME_FUNCTION cudaError_t | ArgMin (void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, int num_segments, BeginOffsetIteratorT d_begin_offsets, EndOffsetIteratorT d_end_offsets, cudaStream_t stream=0, bool debug_synchronous=false) |
Finds the first device-wide minimum in each segment using the less-than ('<') operator, also returning the in-segment index of that item. More... | |
template<typename InputIteratorT , typename OutputIteratorT , typename BeginOffsetIteratorT , typename EndOffsetIteratorT > | |
static CUB_RUNTIME_FUNCTION cudaError_t | Max (void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, int num_segments, BeginOffsetIteratorT d_begin_offsets, EndOffsetIteratorT d_end_offsets, cudaStream_t stream=0, bool debug_synchronous=false) |
Computes a device-wide segmented maximum using the greater-than ('>') operator. More... | |
template<typename InputIteratorT , typename OutputIteratorT , typename BeginOffsetIteratorT , typename EndOffsetIteratorT > | |
static CUB_RUNTIME_FUNCTION cudaError_t | ArgMax (void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, int num_segments, BeginOffsetIteratorT d_begin_offsets, EndOffsetIteratorT d_end_offsets, cudaStream_t stream=0, bool debug_synchronous=false) |
Finds the first device-wide maximum in each segment using the greater-than ('>') operator, also returning the in-segment index of that item. More... | |
|
inlinestatic |
Computes a device-wide segmented reduction using the specified binary reduction_op
functor.
segment_offsets
(of length num_segments+1
) can be aliased for both the d_begin_offsets
and d_end_offsets
parameters (where the latter is specified as segment_offsets+1
).d_temp_storage
is NULL
, no work is done and the required allocation size is returned in temp_storage_bytes
.int
data elements. InputIteratorT | [inferred] Random-access input iterator type for reading input items (may be a simple pointer type) |
OutputIteratorT | [inferred] Output iterator type for recording the reduced aggregate (may be a simple pointer type) |
BeginOffsetIteratorT | [inferred] Random-access input iterator type for reading segment beginning offsets (may be a simple pointer type) |
EndOffsetIteratorT | [inferred] Random-access input iterator type for reading segment ending offsets (may be a simple pointer type) |
ReductionOp | [inferred] Binary reduction functor type having member T operator()(const T &a, const T &b) |
T | [inferred] Data element type that is convertible to the value type of InputIteratorT |
[in] | d_temp_storage | Device-accessible allocation of temporary storage. When NULL, the required allocation size is written to temp_storage_bytes and no work is done. |
[in,out] | temp_storage_bytes | Reference to size in bytes of d_temp_storage allocation |
[in] | d_in | Pointer to the input sequence of data items |
[out] | d_out | Pointer to the output aggregate |
[in] | num_segments | The number of segments that comprise the sorting data |
[in] | d_begin_offsets | Random-access input iterator to the sequence of beginning offsets of length num_segments , such that d_begin_offsets[i] is the first element of the ith data segment in d_keys_* and d_values_* |
[in] | d_end_offsets | Random-access input iterator to the sequence of ending offsets of length num_segments , such that d_end_offsets[i]-1 is the last element of the ith data segment in d_keys_* and d_values_* . If d_end_offsets[i]-1 <= d_begin_offsets[i] , the ith is considered empty. |
[in] | reduction_op | Binary reduction functor |
[in] | initial_value | Initial value of the reduction for each segment |
[in] | stream | [optional] CUDA stream to launch kernels within. Default is stream0. |
[in] | debug_synchronous | [optional] Whether or not to synchronize the stream after every kernel launch to check for errors. Also causes launch configurations to be printed to the console. Default is false . |
|
inlinestatic |
Computes a device-wide segmented sum using the addition ('+') operator.
0
as the initial value of the reduction for each segment.segment_offsets
(of length num_segments+1
) can be aliased for both the d_begin_offsets
and d_end_offsets
parameters (where the latter is specified as segment_offsets+1
).+
operators that are non-commutative..d_temp_storage
is NULL
, no work is done and the required allocation size is returned in temp_storage_bytes
.int
data elements. InputIteratorT | [inferred] Random-access input iterator type for reading input items (may be a simple pointer type) |
OutputIteratorT | [inferred] Output iterator type for recording the reduced aggregate (may be a simple pointer type) |
BeginOffsetIteratorT | [inferred] Random-access input iterator type for reading segment beginning offsets (may be a simple pointer type) |
EndOffsetIteratorT | [inferred] Random-access input iterator type for reading segment ending offsets (may be a simple pointer type) |
[in] | d_temp_storage | Device-accessible allocation of temporary storage. When NULL, the required allocation size is written to temp_storage_bytes and no work is done. |
[in,out] | temp_storage_bytes | Reference to size in bytes of d_temp_storage allocation |
[in] | d_in | Pointer to the input sequence of data items |
[out] | d_out | Pointer to the output aggregate |
[in] | num_segments | The number of segments that comprise the sorting data |
[in] | d_begin_offsets | Random-access input iterator to the sequence of beginning offsets of length num_segments , such that d_begin_offsets[i] is the first element of the ith data segment in d_keys_* and d_values_* |
[in] | d_end_offsets | Random-access input iterator to the sequence of ending offsets of length num_segments , such that d_end_offsets[i]-1 is the last element of the ith data segment in d_keys_* and d_values_* . If d_end_offsets[i]-1 <= d_begin_offsets[i] , the ith is considered empty. |
[in] | stream | [optional] CUDA stream to launch kernels within. Default is stream0. |
[in] | debug_synchronous | [optional] Whether or not to synchronize the stream after every kernel launch to check for errors. Also causes launch configurations to be printed to the console. Default is false . |
|
inlinestatic |
Computes a device-wide segmented minimum using the less-than ('<') operator.
std::numeric_limits<T>::max()
as the initial value of the reduction for each segment.segment_offsets
(of length num_segments+1
) can be aliased for both the d_begin_offsets
and d_end_offsets
parameters (where the latter is specified as segment_offsets+1
).<
operators that are non-commutative.d_temp_storage
is NULL
, no work is done and the required allocation size is returned in temp_storage_bytes
.int
data elements. InputIteratorT | [inferred] Random-access input iterator type for reading input items (may be a simple pointer type) |
OutputIteratorT | [inferred] Output iterator type for recording the reduced aggregate (may be a simple pointer type) |
BeginOffsetIteratorT | [inferred] Random-access input iterator type for reading segment beginning offsets (may be a simple pointer type) |
EndOffsetIteratorT | [inferred] Random-access input iterator type for reading segment ending offsets (may be a simple pointer type) |
[in] | d_temp_storage | Device-accessible allocation of temporary storage. When NULL, the required allocation size is written to temp_storage_bytes and no work is done. |
[in,out] | temp_storage_bytes | Reference to size in bytes of d_temp_storage allocation |
[in] | d_in | Pointer to the input sequence of data items |
[out] | d_out | Pointer to the output aggregate |
[in] | num_segments | The number of segments that comprise the sorting data |
[in] | d_begin_offsets | Random-access input iterator to the sequence of beginning offsets of length num_segments , such that d_begin_offsets[i] is the first element of the ith data segment in d_keys_* and d_values_* |
[in] | d_end_offsets | Random-access input iterator to the sequence of ending offsets of length num_segments , such that d_end_offsets[i]-1 is the last element of the ith data segment in d_keys_* and d_values_* . If d_end_offsets[i]-1 <= d_begin_offsets[i] , the ith is considered empty. |
[in] | stream | [optional] CUDA stream to launch kernels within. Default is stream0. |
[in] | debug_synchronous | [optional] Whether or not to synchronize the stream after every kernel launch to check for errors. Also causes launch configurations to be printed to the console. Default is false . |
|
inlinestatic |
Finds the first device-wide minimum in each segment using the less-than ('<') operator, also returning the in-segment index of that item.
d_out
is cub::KeyValuePair <int, T>
(assuming the value type of d_in
is T
)d_out[i].value
and its offset in that segment is written to d_out[i].key
.{1, std::numeric_limits<T>::max()}
tuple is produced for zero-length inputssegment_offsets
(of length num_segments+1
) can be aliased for both the d_begin_offsets
and d_end_offsets
parameters (where the latter is specified as segment_offsets+1
).<
operators that are non-commutative.d_temp_storage
is NULL
, no work is done and the required allocation size is returned in temp_storage_bytes
.int
data elements. InputIteratorT | [inferred] Random-access input iterator type for reading input items (of some type T ) (may be a simple pointer type) |
OutputIteratorT | [inferred] Output iterator type for recording the reduced aggregate (having value type KeyValuePair<int, T> ) (may be a simple pointer type) |
BeginOffsetIteratorT | [inferred] Random-access input iterator type for reading segment beginning offsets (may be a simple pointer type) |
EndOffsetIteratorT | [inferred] Random-access input iterator type for reading segment ending offsets (may be a simple pointer type) |
[in] | d_temp_storage | Device-accessible allocation of temporary storage. When NULL, the required allocation size is written to temp_storage_bytes and no work is done. |
[in,out] | temp_storage_bytes | Reference to size in bytes of d_temp_storage allocation |
[in] | d_in | Pointer to the input sequence of data items |
[out] | d_out | Pointer to the output aggregate |
[in] | num_segments | The number of segments that comprise the sorting data |
[in] | d_begin_offsets | Random-access input iterator to the sequence of beginning offsets of length num_segments , such that d_begin_offsets[i] is the first element of the ith data segment in d_keys_* and d_values_* |
[in] | d_end_offsets | Random-access input iterator to the sequence of ending offsets of length num_segments , such that d_end_offsets[i]-1 is the last element of the ith data segment in d_keys_* and d_values_* . If d_end_offsets[i]-1 <= d_begin_offsets[i] , the ith is considered empty. |
[in] | stream | [optional] CUDA stream to launch kernels within. Default is stream0. |
[in] | debug_synchronous | [optional] Whether or not to synchronize the stream after every kernel launch to check for errors. Also causes launch configurations to be printed to the console. Default is false . |
|
inlinestatic |
Computes a device-wide segmented maximum using the greater-than ('>') operator.
std::numeric_limits<T>::lowest()
as the initial value of the reduction.segment_offsets
(of length num_segments+1
) can be aliased for both the d_begin_offsets
and d_end_offsets
parameters (where the latter is specified as segment_offsets+1
).>
operators that are non-commutative.d_temp_storage
is NULL
, no work is done and the required allocation size is returned in temp_storage_bytes
.int
data elements. InputIteratorT | [inferred] Random-access input iterator type for reading input items (may be a simple pointer type) |
OutputIteratorT | [inferred] Output iterator type for recording the reduced aggregate (may be a simple pointer type) |
BeginOffsetIteratorT | [inferred] Random-access input iterator type for reading segment beginning offsets (may be a simple pointer type) |
EndOffsetIteratorT | [inferred] Random-access input iterator type for reading segment ending offsets (may be a simple pointer type) |
[in] | d_temp_storage | Device-accessible allocation of temporary storage. When NULL, the required allocation size is written to temp_storage_bytes and no work is done. |
[in,out] | temp_storage_bytes | Reference to size in bytes of d_temp_storage allocation |
[in] | d_in | Pointer to the input sequence of data items |
[out] | d_out | Pointer to the output aggregate |
[in] | num_segments | The number of segments that comprise the sorting data |
[in] | d_begin_offsets | Random-access input iterator to the sequence of beginning offsets of length num_segments , such that d_begin_offsets[i] is the first element of the ith data segment in d_keys_* and d_values_* |
[in] | d_end_offsets | Random-access input iterator to the sequence of ending offsets of length num_segments , such that d_end_offsets[i]-1 is the last element of the ith data segment in d_keys_* and d_values_* . If d_end_offsets[i]-1 <= d_begin_offsets[i] , the ith is considered empty. |
[in] | stream | [optional] CUDA stream to launch kernels within. Default is stream0. |
[in] | debug_synchronous | [optional] Whether or not to synchronize the stream after every kernel launch to check for errors. Also causes launch configurations to be printed to the console. Default is false . |
|
inlinestatic |
Finds the first device-wide maximum in each segment using the greater-than ('>') operator, also returning the in-segment index of that item.
d_out
is cub::KeyValuePair <int, T>
(assuming the value type of d_in
is T
)d_out[i].value
and its offset in that segment is written to d_out[i].key
.{1, std::numeric_limits<T>::lowest()}
tuple is produced for zero-length inputssegment_offsets
(of length num_segments+1
) can be aliased for both the d_begin_offsets
and d_end_offsets
parameters (where the latter is specified as segment_offsets+1
).>
operators that are non-commutative.d_temp_storage
is NULL
, no work is done and the required allocation size is returned in temp_storage_bytes
.int
data elements. InputIteratorT | [inferred] Random-access input iterator type for reading input items (of some type T ) (may be a simple pointer type) |
OutputIteratorT | [inferred] Output iterator type for recording the reduced aggregate (having value type KeyValuePair<int, T> ) (may be a simple pointer type) |
BeginOffsetIteratorT | [inferred] Random-access input iterator type for reading segment beginning offsets (may be a simple pointer type) |
EndOffsetIteratorT | [inferred] Random-access input iterator type for reading segment ending offsets (may be a simple pointer type) |
[in] | d_temp_storage | Device-accessible allocation of temporary storage. When NULL, the required allocation size is written to temp_storage_bytes and no work is done. |
[in,out] | temp_storage_bytes | Reference to size in bytes of d_temp_storage allocation |
[in] | d_in | Pointer to the input sequence of data items |
[out] | d_out | Pointer to the output aggregate |
[in] | num_segments | The number of segments that comprise the sorting data |
[in] | d_begin_offsets | Random-access input iterator to the sequence of beginning offsets of length num_segments , such that d_begin_offsets[i] is the first element of the ith data segment in d_keys_* and d_values_* |
[in] | d_end_offsets | Random-access input iterator to the sequence of ending offsets of length num_segments , such that d_end_offsets[i]-1 is the last element of the ith data segment in d_keys_* and d_values_* . If d_end_offsets[i]-1 <= d_begin_offsets[i] , the ith is considered empty. |
[in] | stream | [optional] CUDA stream to launch kernels within. Default is stream0. |
[in] | debug_synchronous | [optional] Whether or not to synchronize the stream after every kernel launch to check for errors. Also causes launch configurations to be printed to the console. Default is false . |