CUB
|
The WarpScan class provides collective methods for computing a parallel prefix scan of items partitioned across a CUDA thread warp.
T | The scan input/output element type |
LOGICAL_WARP_THREADS | [optional] The number of threads per "logical" warp (may be less than the number of hardware warp threads). Default is the warp size associated with the CUDA Compute Capability targeted by the compiler (e.g., 32 threads for SM20). |
PTX_ARCH | [optional] The PTX compute capability for which to to specialize this collective, formatted as per the CUDA_ARCH macro (e.g., 350 for sm_35). Useful for determining the collective's storage requirements for a given device from the host. (Default: the value of CUDA_ARCH during the current compiler pass) |
LOGICAL_WARP_THREADS
SHFL
)LOGICAL_WARP_THREADS
thread_data
across the block of threads is {1, 1, 1, 1, ...}
. The corresponding output thread_data
in each of the four warps of threads will be 0, 1, 2, 3, ..., 31}
.thread_data
across the warp of threads is {1, 1, 1, 1, ...}
. The corresponding output thread_data
will be {0, 1, 2, 3, ..., 31}
. Classes | |
struct | TempStorage |
The operations exposed by WarpScan require a temporary memory allocation of this nested type for thread communication. This opaque storage can be allocated directly using the __shared__ keyword. Alternatively, it can be aliased to externally allocated memory (shared or global) or union 'd with other storage allocation types to facilitate memory reuse. More... | |
Public Methods | |
Collective constructors | |
__device__ __forceinline__ | WarpScan (TempStorage &temp_storage) |
Collective constructor using the specified memory allocation as temporary storage. Logical warp and lane identifiers are constructed from threadIdx.x . More... | |
Inclusive prefix sums | |
__device__ __forceinline__ void | InclusiveSum (T input, T &inclusive_output) |
Computes an inclusive prefix sum across the calling warp. More... | |
__device__ __forceinline__ void | InclusiveSum (T input, T &inclusive_output, T &warp_aggregate) |
Computes an inclusive prefix sum across the calling warp. Also provides every thread with the warp-wide warp_aggregate of all inputs. More... | |
Exclusive prefix sums | |
__device__ __forceinline__ void | ExclusiveSum (T input, T &exclusive_output) |
Computes an exclusive prefix sum across the calling warp. The value of 0 is applied as the initial value, and is assigned to exclusive_output in thread0. More... | |
__device__ __forceinline__ void | ExclusiveSum (T input, T &exclusive_output, T &warp_aggregate) |
Computes an exclusive prefix sum across the calling warp. The value of 0 is applied as the initial value, and is assigned to exclusive_output in thread0. Also provides every thread with the warp-wide warp_aggregate of all inputs. More... | |
Inclusive prefix scans | |
template<typename ScanOp > | |
__device__ __forceinline__ void | InclusiveScan (T input, T &inclusive_output, ScanOp scan_op) |
Computes an inclusive prefix scan using the specified binary scan functor across the calling warp. More... | |
template<typename ScanOp > | |
__device__ __forceinline__ void | InclusiveScan (T input, T &inclusive_output, ScanOp scan_op, T &warp_aggregate) |
Computes an inclusive prefix scan using the specified binary scan functor across the calling warp. Also provides every thread with the warp-wide warp_aggregate of all inputs. More... | |
Exclusive prefix scans | |
template<typename ScanOp > | |
__device__ __forceinline__ void | ExclusiveScan (T input, T &exclusive_output, ScanOp scan_op) |
Computes an exclusive prefix scan using the specified binary scan functor across the calling warp. Because no initial value is supplied, the output computed for warp-lane0 is undefined. More... | |
template<typename ScanOp > | |
__device__ __forceinline__ void | ExclusiveScan (T input, T &exclusive_output, T initial_value, ScanOp scan_op) |
Computes an exclusive prefix scan using the specified binary scan functor across the calling warp. More... | |
template<typename ScanOp > | |
__device__ __forceinline__ void | ExclusiveScan (T input, T &exclusive_output, ScanOp scan_op, T &warp_aggregate) |
Computes an exclusive prefix scan using the specified binary scan functor across the calling warp. Because no initial value is supplied, the output computed for warp-lane0 is undefined. Also provides every thread with the warp-wide warp_aggregate of all inputs. More... | |
template<typename ScanOp > | |
__device__ __forceinline__ void | ExclusiveScan (T input, T &exclusive_output, T initial_value, ScanOp scan_op, T &warp_aggregate) |
Computes an exclusive prefix scan using the specified binary scan functor across the calling warp. Also provides every thread with the warp-wide warp_aggregate of all inputs. More... | |
Combination (inclusive & exclusive) prefix scans | |
template<typename ScanOp > | |
__device__ __forceinline__ void | Scan (T input, T &inclusive_output, T &exclusive_output, ScanOp scan_op) |
Computes both inclusive and exclusive prefix scans using the specified binary scan functor across the calling warp. Because no initial value is supplied, the exclusive_output computed for warp-lane0 is undefined. More... | |
template<typename ScanOp > | |
__device__ __forceinline__ void | Scan (T input, T &inclusive_output, T &exclusive_output, T initial_value, ScanOp scan_op) |
Computes both inclusive and exclusive prefix scans using the specified binary scan functor across the calling warp. More... | |
Data exchange | |
__device__ __forceinline__ T | Broadcast (T input, unsigned int src_lane) |
Broadcast the value input from warp-lanesrc_lane to all lanes in the warp. More... | |
|
inline |
Collective constructor using the specified memory allocation as temporary storage. Logical warp and lane identifiers are constructed from threadIdx.x
.
[in] | temp_storage | Reference to memory allocation having layout type TempStorage |
|
inline |
Computes an inclusive prefix sum across the calling warp.
__syncthreads()
threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage
) is to be reused or repurposed.thread_data
across the block of threads is {1, 1, 1, 1, ...}
. The corresponding output thread_data
in each of the four warps of threads will be 1, 2, 3, ..., 32}
. [in] | input | Calling thread's input item. |
[out] | inclusive_output | Calling thread's output item. May be aliased with input . |
|
inline |
Computes an inclusive prefix sum across the calling warp. Also provides every thread with the warp-wide warp_aggregate
of all inputs.
__syncthreads()
threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage
) is to be reused or repurposed.thread_data
across the block of threads is {1, 1, 1, 1, ...}
. The corresponding output thread_data
in each of the four warps of threads will be 1, 2, 3, ..., 32}
. Furthermore, warp_aggregate
for all threads in all warps will be 32
. [in] | input | Calling thread's input item. |
[out] | inclusive_output | Calling thread's output item. May be aliased with input . |
[out] | warp_aggregate | Warp-wide aggregate reduction of input items. |
|
inline |
Computes an exclusive prefix sum across the calling warp. The value of 0 is applied as the initial value, and is assigned to exclusive_output
in thread0.
T
's default constructor (or by zero-initialization if no user-defined default constructor exists) is suitable as the identity value zero for addition.__syncthreads()
threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage
) is to be reused or repurposed.thread_data
across the block of threads is {1, 1, 1, 1, ...}
. The corresponding output thread_data
in each of the four warps of threads will be 0, 1, 2, ..., 31}
. [in] | input | Calling thread's input item. |
[out] | exclusive_output | Calling thread's output item. May be aliased with input . |
|
inline |
Computes an exclusive prefix sum across the calling warp. The value of 0 is applied as the initial value, and is assigned to exclusive_output
in thread0. Also provides every thread with the warp-wide warp_aggregate
of all inputs.
T
's default constructor (or by zero-initialization if no user-defined default constructor exists) is suitable as the identity value zero for addition.__syncthreads()
threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage
) is to be reused or repurposed.thread_data
across the block of threads is {1, 1, 1, 1, ...}
. The corresponding output thread_data
in each of the four warps of threads will be 0, 1, 2, ..., 31}
. Furthermore, warp_aggregate
for all threads in all warps will be 32
. [in] | input | Calling thread's input item. |
[out] | exclusive_output | Calling thread's output item. May be aliased with input . |
[out] | warp_aggregate | Warp-wide aggregate reduction of input items. |
|
inline |
Computes an inclusive prefix scan using the specified binary scan functor across the calling warp.
__syncthreads()
threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage
) is to be reused or repurposed.thread_data
across the block of threads is {0, -1, 2, -3, ..., 126, -127}
. The corresponding output thread_data
in the first warp would be 0, 0, 2, 2, ..., 30, 30
, the output for the second warp would be 32, 32, 34, 34, ..., 62, 62
, etc.ScanOp | [inferred] Binary scan operator type having member T operator()(const T &a, const T &b) |
[in] | input | Calling thread's input item. |
[out] | inclusive_output | Calling thread's output item. May be aliased with input . |
[in] | scan_op | Binary scan operator |
|
inline |
Computes an inclusive prefix scan using the specified binary scan functor across the calling warp. Also provides every thread with the warp-wide warp_aggregate
of all inputs.
__syncthreads()
threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage
) is to be reused or repurposed.thread_data
across the block of threads is {0, -1, 2, -3, ..., 126, -127}
. The corresponding output thread_data
in the first warp would be 0, 0, 2, 2, ..., 30, 30
, the output for the second warp would be 32, 32, 34, 34, ..., 62, 62
, etc. Furthermore, warp_aggregate
would be assigned 30
for threads in the first warp, 62
for threads in the second warp, etc.ScanOp | [inferred] Binary scan operator type having member T operator()(const T &a, const T &b) |
[in] | input | Calling thread's input item. |
[out] | inclusive_output | Calling thread's output item. May be aliased with input . |
[in] | scan_op | Binary scan operator |
[out] | warp_aggregate | Warp-wide aggregate reduction of input items. |
|
inline |
Computes an exclusive prefix scan using the specified binary scan functor across the calling warp. Because no initial value is supplied, the output
computed for warp-lane0 is undefined.
__syncthreads()
threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage
) is to be reused or repurposed.thread_data
across the block of threads is {0, -1, 2, -3, ..., 126, -127}
. The corresponding output thread_data
in the first warp would be ?, 0, 0, 2, ..., 28, 30
, the output for the second warp would be ?, 32, 32, 34, ..., 60, 62
, etc. (The output thread_data
in warp lane0 is undefined.)ScanOp | [inferred] Binary scan operator type having member T operator()(const T &a, const T &b) |
[in] | input | Calling thread's input item. |
[out] | exclusive_output | Calling thread's output item. May be aliased with input . |
[in] | scan_op | Binary scan operator |
|
inline |
Computes an exclusive prefix scan using the specified binary scan functor across the calling warp.
__syncthreads()
threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage
) is to be reused or repurposed.thread_data
across the block of threads is {0, -1, 2, -3, ..., 126, -127}
. The corresponding output thread_data
in the first warp would be INT_MIN, 0, 0, 2, ..., 28, 30
, the output for the second warp would be 30, 32, 32, 34, ..., 60, 62
, etc.ScanOp | [inferred] Binary scan operator type having member T operator()(const T &a, const T &b) |
[in] | input | Calling thread's input item. |
[out] | exclusive_output | Calling thread's output item. May be aliased with input . |
[in] | initial_value | Initial value to seed the exclusive scan |
[in] | scan_op | Binary scan operator |
|
inline |
Computes an exclusive prefix scan using the specified binary scan functor across the calling warp. Because no initial value is supplied, the output
computed for warp-lane0 is undefined. Also provides every thread with the warp-wide warp_aggregate
of all inputs.
__syncthreads()
threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage
) is to be reused or repurposed.thread_data
across the block of threads is {0, -1, 2, -3, ..., 126, -127}
. The corresponding output thread_data
in the first warp would be ?, 0, 0, 2, ..., 28, 30
, the output for the second warp would be ?, 32, 32, 34, ..., 60, 62
, etc. (The output thread_data
in warp lane0 is undefined.) Furthermore, warp_aggregate
would be assigned 30
for threads in the first warp, 62
for threads in the second warp, etc.ScanOp | [inferred] Binary scan operator type having member T operator()(const T &a, const T &b) |
[in] | input | Calling thread's input item. |
[out] | exclusive_output | Calling thread's output item. May be aliased with input . |
[in] | scan_op | Binary scan operator |
[out] | warp_aggregate | Warp-wide aggregate reduction of input items. |
|
inline |
Computes an exclusive prefix scan using the specified binary scan functor across the calling warp. Also provides every thread with the warp-wide warp_aggregate
of all inputs.
__syncthreads()
threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage
) is to be reused or repurposed.thread_data
across the block of threads is {0, -1, 2, -3, ..., 126, -127}
. The corresponding output thread_data
in the first warp would be INT_MIN, 0, 0, 2, ..., 28, 30
, the output for the second warp would be 30, 32, 32, 34, ..., 60, 62
, etc. Furthermore, warp_aggregate
would be assigned 30
for threads in the first warp, 62
for threads in the second warp, etc.ScanOp | [inferred] Binary scan operator type having member T operator()(const T &a, const T &b) |
[in] | input | Calling thread's input item. |
[out] | exclusive_output | Calling thread's output item. May be aliased with input . |
[in] | initial_value | Initial value to seed the exclusive scan |
[in] | scan_op | Binary scan operator |
[out] | warp_aggregate | Warp-wide aggregate reduction of input items. |
|
inline |
Computes both inclusive and exclusive prefix scans using the specified binary scan functor across the calling warp. Because no initial value is supplied, the exclusive_output
computed for warp-lane0 is undefined.
__syncthreads()
threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage
) is to be reused or repurposed.thread_data
across the block of threads is {0, -1, 2, -3, ..., 126, -127}
. The corresponding output inclusive_partial
in the first warp would be 0, 0, 2, 2, ..., 30, 30
, the output for the second warp would be 32, 32, 34, 34, ..., 62, 62
, etc. The corresponding output exclusive_partial
in the first warp would be ?, 0, 0, 2, ..., 28, 30
, the output for the second warp would be ?, 32, 32, 34, ..., 60, 62
, etc. (The output thread_data
in warp lane0 is undefined.)ScanOp | [inferred] Binary scan operator type having member T operator()(const T &a, const T &b) |
[in] | input | Calling thread's input item. |
[out] | inclusive_output | Calling thread's inclusive-scan output item. |
[out] | exclusive_output | Calling thread's exclusive-scan output item. |
[in] | scan_op | Binary scan operator |
|
inline |
Computes both inclusive and exclusive prefix scans using the specified binary scan functor across the calling warp.
__syncthreads()
threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage
) is to be reused or repurposed.thread_data
across the block of threads is {0, -1, 2, -3, ..., 126, -127}
. The corresponding output inclusive_partial
in the first warp would be 0, 0, 2, 2, ..., 30, 30
, the output for the second warp would be 32, 32, 34, 34, ..., 62, 62
, etc. The corresponding output exclusive_partial
in the first warp would be INT_MIN, 0, 0, 2, ..., 28, 30
, the output for the second warp would be 30, 32, 32, 34, ..., 60, 62
, etc.ScanOp | [inferred] Binary scan operator type having member T operator()(const T &a, const T &b) |
[in] | input | Calling thread's input item. |
[out] | inclusive_output | Calling thread's inclusive-scan output item. |
[out] | exclusive_output | Calling thread's exclusive-scan output item. |
[in] | initial_value | Initial value to seed the exclusive scan |
[in] | scan_op | Binary scan operator |
|
inline |
Broadcast the value input
from warp-lanesrc_lane
to all lanes in the warp.
__syncthreads()
threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage
) is to be reused or repurposed.thread_data
across the block of threads is {0, 1, 2, 3, ..., 127}
. The corresponding output thread_data
will be {0, 0, ..., 0}
in warp0, {32, 32, ..., 32}
in warp1, {64, 64, ..., 64}
in warp2, etc. [in] | input | The value to broadcast |
[in] | src_lane | Which warp lane is to do the broadcasting |