CUB
|
The BlockScan class provides collective methods for computing a parallel prefix sum/scan of items partitioned across a CUDA thread block.
T | Data type being scanned |
BLOCK_DIM_X | The thread block length in threads along the X dimension |
ALGORITHM | [optional] cub::BlockScanAlgorithm enumerator specifying the underlying algorithm to use (default: cub::BLOCK_SCAN_RAKING) |
BLOCK_DIM_Y | [optional] The thread block length in threads along the Y dimension (default: 1) |
BLOCK_DIM_Z | [optional] The thread block length in threads along the Z dimension (default: 1) |
PTX_ARCH | [optional] The PTX compute capability for which to to specialize this collective, formatted as per the CUDA_ARCH macro (e.g., 350 for sm_35). Useful for determining the collective's storage requirements for a given device from the host. (Default: the value of CUDA_ARCH during the current compiler pass) |
ITEMS_PER_THREAD
. Performance is also typically increased until the additional register pressure or shared memory allocation size causes SM occupancy to fall too low. Consider variants of cub::BlockLoad for efficiently gathering a blocked arrangement of elements across threads.SHFL
)thread_data
across the block of threads is {[1,1,1,1], [1,1,1,1], ..., [1,1,1,1]}
. The corresponding output thread_data
in those threads will be {[0,1,2,3], [4,5,6,7], ..., [508,509,510,511]}
.This example can be easily adapted to the storage required by BlockScan.
Classes | |
struct | TempStorage |
The operations exposed by BlockScan require a temporary memory allocation of this nested type for thread communication. This opaque storage can be allocated directly using the __shared__ keyword. Alternatively, it can be aliased to externally allocated memory (shared or global) or union 'd with other storage allocation types to facilitate memory reuse. More... | |
Public Methods | |
Collective constructors | |
__device__ __forceinline__ | BlockScan () |
Collective constructor using a private static allocation of shared memory as temporary storage. More... | |
__device__ __forceinline__ | BlockScan (TempStorage &temp_storage) |
Collective constructor using the specified memory allocation as temporary storage. More... | |
Exclusive prefix sum operations | |
__device__ __forceinline__ void | ExclusiveSum (T input, T &output) |
Computes an exclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes one input element. The value of 0 is applied as the initial value, and is assigned to output in thread0. More... | |
__device__ __forceinline__ void | ExclusiveSum (T input, T &output, T &block_aggregate) |
Computes an exclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes one input element. The value of 0 is applied as the initial value, and is assigned to output in thread0. Also provides every thread with the block-wide block_aggregate of all inputs. More... | |
template<typename BlockPrefixCallbackOp > | |
__device__ __forceinline__ void | ExclusiveSum (T input, T &output, BlockPrefixCallbackOp &block_prefix_callback_op) |
Computes an exclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes one input element. Instead of using 0 as the block-wide prefix, the call-back functor block_prefix_callback_op is invoked by the first warp in the block, and the value returned by lane0 in that warp is used as the "seed" value that logically prefixes the thread block's scan inputs. Also provides every thread with the block-wide block_aggregate of all inputs. More... | |
Exclusive prefix sum operations (multiple data per thread) | |
template<int ITEMS_PER_THREAD> | |
__device__ __forceinline__ void | ExclusiveSum (T(&input)[ITEMS_PER_THREAD], T(&output)[ITEMS_PER_THREAD]) |
Computes an exclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes an array of consecutive input elements. The value of 0 is applied as the initial value, and is assigned to output [0] in thread0. More... | |
template<int ITEMS_PER_THREAD> | |
__device__ __forceinline__ void | ExclusiveSum (T(&input)[ITEMS_PER_THREAD], T(&output)[ITEMS_PER_THREAD], T &block_aggregate) |
Computes an exclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes an array of consecutive input elements. The value of 0 is applied as the initial value, and is assigned to output [0] in thread0. Also provides every thread with the block-wide block_aggregate of all inputs. More... | |
template<int ITEMS_PER_THREAD, typename BlockPrefixCallbackOp > | |
__device__ __forceinline__ void | ExclusiveSum (T(&input)[ITEMS_PER_THREAD], T(&output)[ITEMS_PER_THREAD], BlockPrefixCallbackOp &block_prefix_callback_op) |
Computes an exclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes an array of consecutive input elements. Instead of using 0 as the block-wide prefix, the call-back functor block_prefix_callback_op is invoked by the first warp in the block, and the value returned by lane0 in that warp is used as the "seed" value that logically prefixes the thread block's scan inputs. Also provides every thread with the block-wide block_aggregate of all inputs. More... | |
Exclusive prefix scan operations | |
template<typename ScanOp > | |
__device__ __forceinline__ void | ExclusiveScan (T input, T &output, T initial_value, ScanOp scan_op) |
Computes an exclusive block-wide prefix scan using the specified binary scan_op functor. Each thread contributes one input element. More... | |
template<typename ScanOp > | |
__device__ __forceinline__ void | ExclusiveScan (T input, T &output, T initial_value, ScanOp scan_op, T &block_aggregate) |
Computes an exclusive block-wide prefix scan using the specified binary scan_op functor. Each thread contributes one input element. Also provides every thread with the block-wide block_aggregate of all inputs. More... | |
template<typename ScanOp , typename BlockPrefixCallbackOp > | |
__device__ __forceinline__ void | ExclusiveScan (T input, T &output, ScanOp scan_op, BlockPrefixCallbackOp &block_prefix_callback_op) |
Computes an exclusive block-wide prefix scan using the specified binary scan_op functor. Each thread contributes one input element. the call-back functor block_prefix_callback_op is invoked by the first warp in the block, and the value returned by lane0 in that warp is used as the "seed" value that logically prefixes the thread block's scan inputs. Also provides every thread with the block-wide block_aggregate of all inputs. More... | |
Exclusive prefix scan operations (multiple data per thread) | |
template<int ITEMS_PER_THREAD, typename ScanOp > | |
__device__ __forceinline__ void | ExclusiveScan (T(&input)[ITEMS_PER_THREAD], T(&output)[ITEMS_PER_THREAD], T initial_value, ScanOp scan_op) |
Computes an exclusive block-wide prefix scan using the specified binary scan_op functor. Each thread contributes an array of consecutive input elements. More... | |
template<int ITEMS_PER_THREAD, typename ScanOp > | |
__device__ __forceinline__ void | ExclusiveScan (T(&input)[ITEMS_PER_THREAD], T(&output)[ITEMS_PER_THREAD], T initial_value, ScanOp scan_op, T &block_aggregate) |
Computes an exclusive block-wide prefix scan using the specified binary scan_op functor. Each thread contributes an array of consecutive input elements. Also provides every thread with the block-wide block_aggregate of all inputs. More... | |
template<int ITEMS_PER_THREAD, typename ScanOp , typename BlockPrefixCallbackOp > | |
__device__ __forceinline__ void | ExclusiveScan (T(&input)[ITEMS_PER_THREAD], T(&output)[ITEMS_PER_THREAD], ScanOp scan_op, BlockPrefixCallbackOp &block_prefix_callback_op) |
Computes an exclusive block-wide prefix scan using the specified binary scan_op functor. Each thread contributes an array of consecutive input elements. the call-back functor block_prefix_callback_op is invoked by the first warp in the block, and the value returned by lane0 in that warp is used as the "seed" value that logically prefixes the thread block's scan inputs. Also provides every thread with the block-wide block_aggregate of all inputs. More... | |
Inclusive prefix sum operations | |
__device__ __forceinline__ void | InclusiveSum (T input, T &output) |
Computes an inclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes one input element. More... | |
__device__ __forceinline__ void | InclusiveSum (T input, T &output, T &block_aggregate) |
Computes an inclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes one input element. Also provides every thread with the block-wide block_aggregate of all inputs. More... | |
template<typename BlockPrefixCallbackOp > | |
__device__ __forceinline__ void | InclusiveSum (T input, T &output, BlockPrefixCallbackOp &block_prefix_callback_op) |
Computes an inclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes one input element. Instead of using 0 as the block-wide prefix, the call-back functor block_prefix_callback_op is invoked by the first warp in the block, and the value returned by lane0 in that warp is used as the "seed" value that logically prefixes the thread block's scan inputs. Also provides every thread with the block-wide block_aggregate of all inputs. More... | |
Inclusive prefix sum operations (multiple data per thread) | |
template<int ITEMS_PER_THREAD> | |
__device__ __forceinline__ void | InclusiveSum (T(&input)[ITEMS_PER_THREAD], T(&output)[ITEMS_PER_THREAD]) |
Computes an inclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes an array of consecutive input elements. More... | |
template<int ITEMS_PER_THREAD> | |
__device__ __forceinline__ void | InclusiveSum (T(&input)[ITEMS_PER_THREAD], T(&output)[ITEMS_PER_THREAD], T &block_aggregate) |
Computes an inclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes an array of consecutive input elements. Also provides every thread with the block-wide block_aggregate of all inputs. More... | |
template<int ITEMS_PER_THREAD, typename BlockPrefixCallbackOp > | |
__device__ __forceinline__ void | InclusiveSum (T(&input)[ITEMS_PER_THREAD], T(&output)[ITEMS_PER_THREAD], BlockPrefixCallbackOp &block_prefix_callback_op) |
Computes an inclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes an array of consecutive input elements. Instead of using 0 as the block-wide prefix, the call-back functor block_prefix_callback_op is invoked by the first warp in the block, and the value returned by lane0 in that warp is used as the "seed" value that logically prefixes the thread block's scan inputs. Also provides every thread with the block-wide block_aggregate of all inputs. More... | |
Inclusive prefix scan operations | |
template<typename ScanOp > | |
__device__ __forceinline__ void | InclusiveScan (T input, T &output, ScanOp scan_op) |
Computes an inclusive block-wide prefix scan using the specified binary scan_op functor. Each thread contributes one input element. More... | |
template<typename ScanOp > | |
__device__ __forceinline__ void | InclusiveScan (T input, T &output, ScanOp scan_op, T &block_aggregate) |
Computes an inclusive block-wide prefix scan using the specified binary scan_op functor. Each thread contributes one input element. Also provides every thread with the block-wide block_aggregate of all inputs. More... | |
template<typename ScanOp , typename BlockPrefixCallbackOp > | |
__device__ __forceinline__ void | InclusiveScan (T input, T &output, ScanOp scan_op, BlockPrefixCallbackOp &block_prefix_callback_op) |
Computes an inclusive block-wide prefix scan using the specified binary scan_op functor. Each thread contributes one input element. the call-back functor block_prefix_callback_op is invoked by the first warp in the block, and the value returned by lane0 in that warp is used as the "seed" value that logically prefixes the thread block's scan inputs. Also provides every thread with the block-wide block_aggregate of all inputs. More... | |
Inclusive prefix scan operations (multiple data per thread) | |
template<int ITEMS_PER_THREAD, typename ScanOp > | |
__device__ __forceinline__ void | InclusiveScan (T(&input)[ITEMS_PER_THREAD], T(&output)[ITEMS_PER_THREAD], ScanOp scan_op) |
Computes an inclusive block-wide prefix scan using the specified binary scan_op functor. Each thread contributes an array of consecutive input elements. More... | |
template<int ITEMS_PER_THREAD, typename ScanOp > | |
__device__ __forceinline__ void | InclusiveScan (T(&input)[ITEMS_PER_THREAD], T(&output)[ITEMS_PER_THREAD], ScanOp scan_op, T &block_aggregate) |
Computes an inclusive block-wide prefix scan using the specified binary scan_op functor. Each thread contributes an array of consecutive input elements. Also provides every thread with the block-wide block_aggregate of all inputs. More... | |
template<int ITEMS_PER_THREAD, typename ScanOp , typename BlockPrefixCallbackOp > | |
__device__ __forceinline__ void | InclusiveScan (T(&input)[ITEMS_PER_THREAD], T(&output)[ITEMS_PER_THREAD], ScanOp scan_op, BlockPrefixCallbackOp &block_prefix_callback_op) |
Computes an inclusive block-wide prefix scan using the specified binary scan_op functor. Each thread contributes an array of consecutive input elements. the call-back functor block_prefix_callback_op is invoked by the first warp in the block, and the value returned by lane0 in that warp is used as the "seed" value that logically prefixes the thread block's scan inputs. Also provides every thread with the block-wide block_aggregate of all inputs. More... | |
|
inline |
Collective constructor using a private static allocation of shared memory as temporary storage.
|
inline |
Collective constructor using the specified memory allocation as temporary storage.
[in] | temp_storage | Reference to memory allocation having layout type TempStorage |
|
inline |
Computes an exclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes one input element. The value of 0 is applied as the initial value, and is assigned to output
in thread0.
T
's default constructor (or by zero-initialization if no user-defined default constructor exists) is suitable as the identity value zero for addition.__syncthreads()
threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage
) is to be reused or repurposed.thread_data
across the block of threads is 1, 1, ..., 1
. The corresponding output thread_data
in those threads will be 0, 1, ..., 127
. [in] | input | Calling thread's input item |
[out] | output | Calling thread's output item (may be aliased to input ) |
|
inline |
Computes an exclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes one input element. The value of 0 is applied as the initial value, and is assigned to output
in thread0. Also provides every thread with the block-wide block_aggregate
of all inputs.
T
's default constructor (or by zero-initialization if no user-defined default constructor exists) is suitable as the identity value zero for addition.__syncthreads()
threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage
) is to be reused or repurposed.thread_data
across the block of threads is 1, 1, ..., 1
. The corresponding output thread_data
in those threads will be 0, 1, ..., 127
. Furthermore the value 128
will be stored in block_aggregate
for all threads. [in] | input | Calling thread's input item |
[out] | output | Calling thread's output item (may be aliased to input ) |
[out] | block_aggregate | block-wide aggregate reduction of input items |
|
inline |
Computes an exclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes one input element. Instead of using 0 as the block-wide prefix, the call-back functor block_prefix_callback_op
is invoked by the first warp in the block, and the value returned by lane0 in that warp is used as the "seed" value that logically prefixes the thread block's scan inputs. Also provides every thread with the block-wide block_aggregate
of all inputs.
T
's default constructor (or by zero-initialization if no user-defined default constructor exists) is suitable as the identity value zero for addition.block_prefix_callback_op
functor must implement a member function T operator()(T block_aggregate)
. The functor's input parameter block_aggregate
is the same value also returned by the scan operation. The functor will be invoked by the first warp of threads in the block, however only the return value from lane0 is applied as the block-wide prefix. Can be stateful.__syncthreads()
threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage
) is to be reused or repurposed.d_data
is 1, 1, 1, 1, 1, 1, 1, 1, ...
. The corresponding output for the first segment will be 0, 1, ..., 127
. The output for the second segment will be 128, 129, ..., 255
.BlockPrefixCallbackOp | [inferred] Call-back functor type having member T operator()(T block_aggregate) |
[in] | input | Calling thread's input item |
[out] | output | Calling thread's output item (may be aliased to input ) |
[in,out] | block_prefix_callback_op | [warp0 only] Call-back functor for specifying a block-wide prefix to be applied to the logical input sequence. |
|
inline |
Computes an exclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes an array of consecutive input elements. The value of 0 is applied as the initial value, and is assigned to output
[0] in thread0.
T
's default constructor (or by zero-initialization if no user-defined default constructor exists) is suitable as the identity value zero for addition.ITEMS_PER_THREAD
. Performance is also typically increased until the additional register pressure or shared memory allocation size causes SM occupancy to fall too low. Consider variants of cub::BlockLoad for efficiently gathering a blocked arrangement of elements across threads.__syncthreads()
threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage
) is to be reused or repurposed.thread_data
across the block of threads is { [1,1,1,1], [1,1,1,1], ..., [1,1,1,1] }
. The corresponding output thread_data
in those threads will be { [0,1,2,3], [4,5,6,7], ..., [508,509,510,511] }
.ITEMS_PER_THREAD | [inferred] The number of consecutive items partitioned onto each thread. |
[in] | input | Calling thread's input items |
[out] | output | Calling thread's output items (may be aliased to input ) |
|
inline |
Computes an exclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes an array of consecutive input elements. The value of 0 is applied as the initial value, and is assigned to output
[0] in thread0. Also provides every thread with the block-wide block_aggregate
of all inputs.
T
's default constructor (or by zero-initialization if no user-defined default constructor exists) is suitable as the identity value zero for addition.ITEMS_PER_THREAD
. Performance is also typically increased until the additional register pressure or shared memory allocation size causes SM occupancy to fall too low. Consider variants of cub::BlockLoad for efficiently gathering a blocked arrangement of elements across threads.__syncthreads()
threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage
) is to be reused or repurposed.thread_data
across the block of threads is { [1,1,1,1], [1,1,1,1], ..., [1,1,1,1] }
. The corresponding output thread_data
in those threads will be { [0,1,2,3], [4,5,6,7], ..., [508,509,510,511] }
. Furthermore the value 512
will be stored in block_aggregate
for all threads.ITEMS_PER_THREAD | [inferred] The number of consecutive items partitioned onto each thread. |
[in] | input | Calling thread's input items |
[out] | output | Calling thread's output items (may be aliased to input ) |
[out] | block_aggregate | block-wide aggregate reduction of input items |
|
inline |
Computes an exclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes an array of consecutive input elements. Instead of using 0 as the block-wide prefix, the call-back functor block_prefix_callback_op
is invoked by the first warp in the block, and the value returned by lane0 in that warp is used as the "seed" value that logically prefixes the thread block's scan inputs. Also provides every thread with the block-wide block_aggregate
of all inputs.
T
's default constructor (or by zero-initialization if no user-defined default constructor exists) is suitable as the identity value zero for addition.block_prefix_callback_op
functor must implement a member function T operator()(T block_aggregate)
. The functor's input parameter block_aggregate
is the same value also returned by the scan operation. The functor will be invoked by the first warp of threads in the block, however only the return value from lane0 is applied as the block-wide prefix. Can be stateful.ITEMS_PER_THREAD
. Performance is also typically increased until the additional register pressure or shared memory allocation size causes SM occupancy to fall too low. Consider variants of cub::BlockLoad for efficiently gathering a blocked arrangement of elements across threads.__syncthreads()
threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage
) is to be reused or repurposed.d_data
is 1, 1, 1, 1, 1, 1, 1, 1, ...
. The corresponding output for the first segment will be 0, 1, 2, 3, ..., 510, 511
. The output for the second segment will be 512, 513, 514, 515, ..., 1022, 1023
.ITEMS_PER_THREAD | [inferred] The number of consecutive items partitioned onto each thread. |
BlockPrefixCallbackOp | [inferred] Call-back functor type having member T operator()(T block_aggregate) |
[in] | input | Calling thread's input items |
[out] | output | Calling thread's output items (may be aliased to input ) |
[in,out] | block_prefix_callback_op | [warp0 only] Call-back functor for specifying a block-wide prefix to be applied to the logical input sequence. |
|
inline |
Computes an exclusive block-wide prefix scan using the specified binary scan_op
functor. Each thread contributes one input element.
__syncthreads()
threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage
) is to be reused or repurposed.thread_data
across the block of threads is 0, -1, 2, -3, ..., 126, -127
. The corresponding output thread_data
in those threads will be INT_MIN, 0, 0, 2, ..., 124, 126
.ScanOp | [inferred] Binary scan functor type having member T operator()(const T &a, const T &b) |
[in] | input | Calling thread's input item |
[out] | output | Calling thread's output item (may be aliased to input ) |
[in] | initial_value | Initial value to seed the exclusive scan (and is assigned to output [0] in thread0) |
[in] | scan_op | Binary scan functor |
|
inline |
Computes an exclusive block-wide prefix scan using the specified binary scan_op
functor. Each thread contributes one input element. Also provides every thread with the block-wide block_aggregate
of all inputs.
__syncthreads()
threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage
) is to be reused or repurposed.thread_data
across the block of threads is 0, -1, 2, -3, ..., 126, -127
. The corresponding output thread_data
in those threads will be INT_MIN, 0, 0, 2, ..., 124, 126
. Furthermore the value 126
will be stored in block_aggregate
for all threads.ScanOp | [inferred] Binary scan functor type having member T operator()(const T &a, const T &b) |
[in] | input | Calling thread's input items |
[out] | output | Calling thread's output items (may be aliased to input ) |
[in] | initial_value | Initial value to seed the exclusive scan (and is assigned to output [0] in thread0) |
[in] | scan_op | Binary scan functor |
[out] | block_aggregate | block-wide aggregate reduction of input items |
|
inline |
Computes an exclusive block-wide prefix scan using the specified binary scan_op
functor. Each thread contributes one input element. the call-back functor block_prefix_callback_op
is invoked by the first warp in the block, and the value returned by lane0 in that warp is used as the "seed" value that logically prefixes the thread block's scan inputs. Also provides every thread with the block-wide block_aggregate
of all inputs.
block_prefix_callback_op
functor must implement a member function T operator()(T block_aggregate)
. The functor's input parameter block_aggregate
is the same value also returned by the scan operation. The functor will be invoked by the first warp of threads in the block, however only the return value from lane0 is applied as the block-wide prefix. Can be stateful.__syncthreads()
threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage
) is to be reused or repurposed.d_data
is 0, -1, 2, -3, 4, -5, ...
. The corresponding output for the first segment will be INT_MIN, 0, 0, 2, ..., 124, 126
. The output for the second segment will be 126, 128, 128, 130, ..., 252, 254
.ScanOp | [inferred] Binary scan functor type having member T operator()(const T &a, const T &b) |
BlockPrefixCallbackOp | [inferred] Call-back functor type having member T operator()(T block_aggregate) |
[in] | input | Calling thread's input item |
[out] | output | Calling thread's output item (may be aliased to input ) |
[in] | scan_op | Binary scan functor |
[in,out] | block_prefix_callback_op | [warp0 only] Call-back functor for specifying a block-wide prefix to be applied to the logical input sequence. |
|
inline |
Computes an exclusive block-wide prefix scan using the specified binary scan_op
functor. Each thread contributes an array of consecutive input elements.
ITEMS_PER_THREAD
. Performance is also typically increased until the additional register pressure or shared memory allocation size causes SM occupancy to fall too low. Consider variants of cub::BlockLoad for efficiently gathering a blocked arrangement of elements across threads.__syncthreads()
threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage
) is to be reused or repurposed.thread_data
across the block of threads is { [0,-1,2,-3], [4,-5,6,-7], ..., [508,-509,510,-511] }
. The corresponding output thread_data
in those threads will be { [INT_MIN,0,0,2], [2,4,4,6], ..., [506,508,508,510] }
.ITEMS_PER_THREAD | [inferred] The number of consecutive items partitioned onto each thread. |
ScanOp | [inferred] Binary scan functor type having member T operator()(const T &a, const T &b) |
[in] | input | Calling thread's input items |
[out] | output | Calling thread's output items (may be aliased to input ) |
[in] | initial_value | Initial value to seed the exclusive scan (and is assigned to output [0] in thread0) |
[in] | scan_op | Binary scan functor |
|
inline |
Computes an exclusive block-wide prefix scan using the specified binary scan_op
functor. Each thread contributes an array of consecutive input elements. Also provides every thread with the block-wide block_aggregate
of all inputs.
ITEMS_PER_THREAD
. Performance is also typically increased until the additional register pressure or shared memory allocation size causes SM occupancy to fall too low. Consider variants of cub::BlockLoad for efficiently gathering a blocked arrangement of elements across threads.__syncthreads()
threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage
) is to be reused or repurposed.thread_data
across the block of threads is { [0,-1,2,-3], [4,-5,6,-7], ..., [508,-509,510,-511] }
. The corresponding output thread_data
in those threads will be { [INT_MIN,0,0,2], [2,4,4,6], ..., [506,508,508,510] }
. Furthermore the value 510
will be stored in block_aggregate
for all threads.ITEMS_PER_THREAD | [inferred] The number of consecutive items partitioned onto each thread. |
ScanOp | [inferred] Binary scan functor type having member T operator()(const T &a, const T &b) |
[in] | input | Calling thread's input items |
[out] | output | Calling thread's output items (may be aliased to input ) |
[in] | initial_value | Initial value to seed the exclusive scan (and is assigned to output [0] in thread0) |
[in] | scan_op | Binary scan functor |
[out] | block_aggregate | block-wide aggregate reduction of input items |
|
inline |
Computes an exclusive block-wide prefix scan using the specified binary scan_op
functor. Each thread contributes an array of consecutive input elements. the call-back functor block_prefix_callback_op
is invoked by the first warp in the block, and the value returned by lane0 in that warp is used as the "seed" value that logically prefixes the thread block's scan inputs. Also provides every thread with the block-wide block_aggregate
of all inputs.
block_prefix_callback_op
functor must implement a member function T operator()(T block_aggregate)
. The functor's input parameter block_aggregate
is the same value also returned by the scan operation. The functor will be invoked by the first warp of threads in the block, however only the return value from lane0 is applied as the block-wide prefix. Can be stateful.ITEMS_PER_THREAD
. Performance is also typically increased until the additional register pressure or shared memory allocation size causes SM occupancy to fall too low. Consider variants of cub::BlockLoad for efficiently gathering a blocked arrangement of elements across threads.__syncthreads()
threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage
) is to be reused or repurposed.d_data
is 0, -1, 2, -3, 4, -5, ...
. The corresponding output for the first segment will be INT_MIN, 0, 0, 2, 2, 4, ..., 508, 510
. The output for the second segment will be 510, 512, 512, 514, 514, 516, ..., 1020, 1022
.ITEMS_PER_THREAD | [inferred] The number of consecutive items partitioned onto each thread. |
ScanOp | [inferred] Binary scan functor type having member T operator()(const T &a, const T &b) |
BlockPrefixCallbackOp | [inferred] Call-back functor type having member T operator()(T block_aggregate) |
[in] | input | Calling thread's input items |
[out] | output | Calling thread's output items (may be aliased to input ) |
[in] | scan_op | Binary scan functor |
[in,out] | block_prefix_callback_op | [warp0 only] Call-back functor for specifying a block-wide prefix to be applied to the logical input sequence. |
|
inline |
Computes an inclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes one input element.
__syncthreads()
threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage
) is to be reused or repurposed.thread_data
across the block of threads is 1, 1, ..., 1
. The corresponding output thread_data
in those threads will be 1, 2, ..., 128
. [in] | input | Calling thread's input item |
[out] | output | Calling thread's output item (may be aliased to input ) |
|
inline |
Computes an inclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes one input element. Also provides every thread with the block-wide block_aggregate
of all inputs.
__syncthreads()
threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage
) is to be reused or repurposed.thread_data
across the block of threads is 1, 1, ..., 1
. The corresponding output thread_data
in those threads will be 1, 2, ..., 128
. Furthermore the value 128
will be stored in block_aggregate
for all threads. [in] | input | Calling thread's input item |
[out] | output | Calling thread's output item (may be aliased to input ) |
[out] | block_aggregate | block-wide aggregate reduction of input items |
|
inline |
Computes an inclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes one input element. Instead of using 0 as the block-wide prefix, the call-back functor block_prefix_callback_op
is invoked by the first warp in the block, and the value returned by lane0 in that warp is used as the "seed" value that logically prefixes the thread block's scan inputs. Also provides every thread with the block-wide block_aggregate
of all inputs.
block_prefix_callback_op
functor must implement a member function T operator()(T block_aggregate)
. The functor's input parameter block_aggregate
is the same value also returned by the scan operation. The functor will be invoked by the first warp of threads in the block, however only the return value from lane0 is applied as the block-wide prefix. Can be stateful.__syncthreads()
threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage
) is to be reused or repurposed.d_data
is 1, 1, 1, 1, 1, 1, 1, 1, ...
. The corresponding output for the first segment will be 1, 2, ..., 128
. The output for the second segment will be 129, 130, ..., 256
.BlockPrefixCallbackOp | [inferred] Call-back functor type having member T operator()(T block_aggregate) |
[in] | input | Calling thread's input item |
[out] | output | Calling thread's output item (may be aliased to input ) |
[in,out] | block_prefix_callback_op | [warp0 only] Call-back functor for specifying a block-wide prefix to be applied to the logical input sequence. |
|
inline |
Computes an inclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes an array of consecutive input elements.
ITEMS_PER_THREAD
. Performance is also typically increased until the additional register pressure or shared memory allocation size causes SM occupancy to fall too low. Consider variants of cub::BlockLoad for efficiently gathering a blocked arrangement of elements across threads.__syncthreads()
threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage
) is to be reused or repurposed.thread_data
across the block of threads is { [1,1,1,1], [1,1,1,1], ..., [1,1,1,1] }
. The corresponding output thread_data
in those threads will be { [1,2,3,4], [5,6,7,8], ..., [509,510,511,512] }
.ITEMS_PER_THREAD | [inferred] The number of consecutive items partitioned onto each thread. |
[in] | input | Calling thread's input items |
[out] | output | Calling thread's output items (may be aliased to input ) |
|
inline |
Computes an inclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes an array of consecutive input elements. Also provides every thread with the block-wide block_aggregate
of all inputs.
ITEMS_PER_THREAD
. Performance is also typically increased until the additional register pressure or shared memory allocation size causes SM occupancy to fall too low. Consider variants of cub::BlockLoad for efficiently gathering a blocked arrangement of elements across threads.__syncthreads()
threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage
) is to be reused or repurposed.thread_data
across the block of threads is { [1,1,1,1], [1,1,1,1], ..., [1,1,1,1] }
. The corresponding output thread_data
in those threads will be { [1,2,3,4], [5,6,7,8], ..., [509,510,511,512] }
. Furthermore the value 512
will be stored in block_aggregate
for all threads.ITEMS_PER_THREAD | [inferred] The number of consecutive items partitioned onto each thread. |
ScanOp | [inferred] Binary scan functor type having member T operator()(const T &a, const T &b) |
[in] | input | Calling thread's input items |
[out] | output | Calling thread's output items (may be aliased to input ) |
[out] | block_aggregate | block-wide aggregate reduction of input items |
|
inline |
Computes an inclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes an array of consecutive input elements. Instead of using 0 as the block-wide prefix, the call-back functor block_prefix_callback_op
is invoked by the first warp in the block, and the value returned by lane0 in that warp is used as the "seed" value that logically prefixes the thread block's scan inputs. Also provides every thread with the block-wide block_aggregate
of all inputs.
block_prefix_callback_op
functor must implement a member function T operator()(T block_aggregate)
. The functor's input parameter block_aggregate
is the same value also returned by the scan operation. The functor will be invoked by the first warp of threads in the block, however only the return value from lane0 is applied as the block-wide prefix. Can be stateful.ITEMS_PER_THREAD
. Performance is also typically increased until the additional register pressure or shared memory allocation size causes SM occupancy to fall too low. Consider variants of cub::BlockLoad for efficiently gathering a blocked arrangement of elements across threads.__syncthreads()
threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage
) is to be reused or repurposed.d_data
is 1, 1, 1, 1, 1, 1, 1, 1, ...
. The corresponding output for the first segment will be 1, 2, 3, 4, ..., 511, 512
. The output for the second segment will be 513, 514, 515, 516, ..., 1023, 1024
.ITEMS_PER_THREAD | [inferred] The number of consecutive items partitioned onto each thread. |
BlockPrefixCallbackOp | [inferred] Call-back functor type having member T operator()(T block_aggregate) |
[in] | input | Calling thread's input items |
[out] | output | Calling thread's output items (may be aliased to input ) |
[in,out] | block_prefix_callback_op | [warp0 only] Call-back functor for specifying a block-wide prefix to be applied to the logical input sequence. |
|
inline |
Computes an inclusive block-wide prefix scan using the specified binary scan_op
functor. Each thread contributes one input element.
__syncthreads()
threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage
) is to be reused or repurposed.thread_data
across the block of threads is 0, -1, 2, -3, ..., 126, -127
. The corresponding output thread_data
in those threads will be 0, 0, 2, 2, ..., 126, 126
.ScanOp | [inferred] Binary scan functor type having member T operator()(const T &a, const T &b) |
[in] | input | Calling thread's input item |
[out] | output | Calling thread's output item (may be aliased to input ) |
[in] | scan_op | Binary scan functor |
|
inline |
Computes an inclusive block-wide prefix scan using the specified binary scan_op
functor. Each thread contributes one input element. Also provides every thread with the block-wide block_aggregate
of all inputs.
__syncthreads()
threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage
) is to be reused or repurposed.thread_data
across the block of threads is 0, -1, 2, -3, ..., 126, -127
. The corresponding output thread_data
in those threads will be 0, 0, 2, 2, ..., 126, 126
. Furthermore the value 126
will be stored in block_aggregate
for all threads.ScanOp | [inferred] Binary scan functor type having member T operator()(const T &a, const T &b) |
[in] | input | Calling thread's input item |
[out] | output | Calling thread's output item (may be aliased to input ) |
[in] | scan_op | Binary scan functor |
[out] | block_aggregate | block-wide aggregate reduction of input items |
|
inline |
Computes an inclusive block-wide prefix scan using the specified binary scan_op
functor. Each thread contributes one input element. the call-back functor block_prefix_callback_op
is invoked by the first warp in the block, and the value returned by lane0 in that warp is used as the "seed" value that logically prefixes the thread block's scan inputs. Also provides every thread with the block-wide block_aggregate
of all inputs.
block_prefix_callback_op
functor must implement a member function T operator()(T block_aggregate)
. The functor's input parameter block_aggregate
is the same value also returned by the scan operation. The functor will be invoked by the first warp of threads in the block, however only the return value from lane0 is applied as the block-wide prefix. Can be stateful.__syncthreads()
threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage
) is to be reused or repurposed.d_data
is 0, -1, 2, -3, 4, -5, ...
. The corresponding output for the first segment will be 0, 0, 2, 2, ..., 126, 126
. The output for the second segment will be 128, 128, 130, 130, ..., 254, 254
.ScanOp | [inferred] Binary scan functor type having member T operator()(const T &a, const T &b) |
BlockPrefixCallbackOp | [inferred] Call-back functor type having member T operator()(T block_aggregate) |
[in] | input | Calling thread's input item |
[out] | output | Calling thread's output item (may be aliased to input ) |
[in] | scan_op | Binary scan functor |
[in,out] | block_prefix_callback_op | [warp0 only] Call-back functor for specifying a block-wide prefix to be applied to the logical input sequence. |
|
inline |
Computes an inclusive block-wide prefix scan using the specified binary scan_op
functor. Each thread contributes an array of consecutive input elements.
ITEMS_PER_THREAD
. Performance is also typically increased until the additional register pressure or shared memory allocation size causes SM occupancy to fall too low. Consider variants of cub::BlockLoad for efficiently gathering a blocked arrangement of elements across threads.__syncthreads()
threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage
) is to be reused or repurposed.thread_data
across the block of threads is { [0,-1,2,-3], [4,-5,6,-7], ..., [508,-509,510,-511] }
. The corresponding output thread_data
in those threads will be { [0,0,2,2], [4,4,6,6], ..., [508,508,510,510] }
.ITEMS_PER_THREAD | [inferred] The number of consecutive items partitioned onto each thread. |
ScanOp | [inferred] Binary scan functor type having member T operator()(const T &a, const T &b) |
[in] | input | Calling thread's input items |
[out] | output | Calling thread's output items (may be aliased to input ) |
[in] | scan_op | Binary scan functor |
|
inline |
Computes an inclusive block-wide prefix scan using the specified binary scan_op
functor. Each thread contributes an array of consecutive input elements. Also provides every thread with the block-wide block_aggregate
of all inputs.
ITEMS_PER_THREAD
. Performance is also typically increased until the additional register pressure or shared memory allocation size causes SM occupancy to fall too low. Consider variants of cub::BlockLoad for efficiently gathering a blocked arrangement of elements across threads.__syncthreads()
threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage
) is to be reused or repurposed.thread_data
across the block of threads is { [0,-1,2,-3], [4,-5,6,-7], ..., [508,-509,510,-511] }
. The corresponding output thread_data
in those threads will be { [0,0,2,2], [4,4,6,6], ..., [508,508,510,510] }
. Furthermore the value 510
will be stored in block_aggregate
for all threads.ITEMS_PER_THREAD | [inferred] The number of consecutive items partitioned onto each thread. |
ScanOp | [inferred] Binary scan functor type having member T operator()(const T &a, const T &b) |
[in] | input | Calling thread's input items |
[out] | output | Calling thread's output items (may be aliased to input ) |
[in] | scan_op | Binary scan functor |
[out] | block_aggregate | block-wide aggregate reduction of input items |
|
inline |
Computes an inclusive block-wide prefix scan using the specified binary scan_op
functor. Each thread contributes an array of consecutive input elements. the call-back functor block_prefix_callback_op
is invoked by the first warp in the block, and the value returned by lane0 in that warp is used as the "seed" value that logically prefixes the thread block's scan inputs. Also provides every thread with the block-wide block_aggregate
of all inputs.
block_prefix_callback_op
functor must implement a member function T operator()(T block_aggregate)
. The functor's input parameter block_aggregate
is the same value also returned by the scan operation. The functor will be invoked by the first warp of threads in the block, however only the return value from lane0 is applied as the block-wide prefix. Can be stateful.ITEMS_PER_THREAD
. Performance is also typically increased until the additional register pressure or shared memory allocation size causes SM occupancy to fall too low. Consider variants of cub::BlockLoad for efficiently gathering a blocked arrangement of elements across threads.__syncthreads()
threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage
) is to be reused or repurposed.d_data
is 0, -1, 2, -3, 4, -5, ...
. The corresponding output for the first segment will be 0, 0, 2, 2, 4, 4, ..., 510, 510
. The output for the second segment will be 512, 512, 514, 514, 516, 516, ..., 1022, 1022
.ITEMS_PER_THREAD | [inferred] The number of consecutive items partitioned onto each thread. |
ScanOp | [inferred] Binary scan functor type having member T operator()(const T &a, const T &b) |
BlockPrefixCallbackOp | [inferred] Call-back functor type having member T operator()(T block_aggregate) |
[in] | input | Calling thread's input items |
[out] | output | Calling thread's output items (may be aliased to input ) |
[in] | scan_op | Binary scan functor |
[in,out] | block_prefix_callback_op | [warp0 only] Call-back functor for specifying a block-wide prefix to be applied to the logical input sequence. |