CUB
|
Classes | |
class | WarpExchange< InputT, ITEMS_PER_THREAD, LOGICAL_WARP_THREADS, PTX_ARCH > |
The WarpExchange class provides collective methods for rearranging data partitioned across a CUDA warp. More... | |
class | WarpLoad< InputT, ITEMS_PER_THREAD, ALGORITHM, LOGICAL_WARP_THREADS, PTX_ARCH > |
The WarpLoad class provides collective data movement methods for loading a linear segment of items from memory into a blocked arrangement across a CUDA thread block. More... | |
class | WarpMergeSort< KeyT, ITEMS_PER_THREAD, LOGICAL_WARP_THREADS, ValueT, PTX_ARCH > |
The WarpMergeSort class provides methods for sorting items partitioned across a CUDA warp using a merge sorting method. More... | |
class | cub::WarpReduce< T, LOGICAL_WARP_THREADS, PTX_ARCH > |
The WarpReduce class provides collective methods for computing a parallel reduction of items partitioned across a CUDA thread warp.
![]()
. | |
class | cub::WarpScan< T, LOGICAL_WARP_THREADS, PTX_ARCH > |
The WarpScan class provides collective methods for computing a parallel prefix scan of items partitioned across a CUDA thread warp.
![]()
. | |
class | WarpStore< T, ITEMS_PER_THREAD, ALGORITHM, LOGICAL_WARP_THREADS, PTX_ARCH > |
The WarpStore class provides collective data movement methods for writing a blocked arrangement of items partitioned across a CUDA warp to a linear segment of memory. More... | |
Functions | |
template<int LOGICAL_WARP_THREADS, typename T > | |
__device__ __forceinline__ T | cub::ShuffleUp (T input, int src_offset, int first_thread, unsigned int member_mask) |
Shuffle-up for any data type. Each warp-lanei obtains the value input contributed by warp-lanei-src_offset . For thread lanes i < src_offset, the thread's own input is returned to the thread.
![]()
. | |
template<int LOGICAL_WARP_THREADS, typename T > | |
__device__ __forceinline__ T | cub::ShuffleDown (T input, int src_offset, int last_thread, unsigned int member_mask) |
Shuffle-down for any data type. Each warp-lanei obtains the value input contributed by warp-lanei+src_offset . For thread lanes i >= WARP_THREADS, the thread's own input is returned to the thread.
![]()
. | |
template<int LOGICAL_WARP_THREADS, typename T > | |
__device__ __forceinline__ T | cub::ShuffleIndex (T input, int src_lane, unsigned int member_mask) |
Shuffle-broadcast for any data type. Each warp-lanei obtains the value input contributed by warp-lanesrc_lane . For src_lane < 0 or src_lane >= WARP_THREADS, then the thread's own input is returned to the thread.
![]()
. | |
__device__ __forceinline__ T cub::ShuffleUp | ( | T | input, |
int | src_offset, | ||
int | first_thread, | ||
unsigned int | member_mask | ||
) |
Shuffle-up for any data type. Each warp-lanei obtains the value input
contributed by warp-lanei-src_offset
. For thread lanes i < src_offset, the thread's own input
is returned to the thread.
LOGICAL_WARP_THREADS | The number of threads per "logical" warp. Must be a power-of-two <= 32. |
T | [inferred] The input/output element type |
double
value from the predecessor of its predecessor. thread_data
across the first warp of threads is {1.0, 2.0, 3.0, 4.0, 5.0, ..., 32.0}
. The corresponding output peer_data
will be {1.0, 2.0, 1.0, 2.0, 3.0, ..., 30.0}
. The 5-bit SHFL mask for logically splitting warps into sub-segments starts 8-bits up
[in] | input | The value to broadcast |
[in] | src_offset | The relative down-offset of the peer to read from |
[in] | first_thread | Index of first lane in logical warp (typically 0) |
[in] | member_mask | 32-bit mask of participating warp lanes |
__device__ __forceinline__ T cub::ShuffleDown | ( | T | input, |
int | src_offset, | ||
int | last_thread, | ||
unsigned int | member_mask | ||
) |
Shuffle-down for any data type. Each warp-lanei obtains the value input
contributed by warp-lanei+src_offset
. For thread lanes i >= WARP_THREADS, the thread's own input
is returned to the thread.
LOGICAL_WARP_THREADS | The number of threads per "logical" warp. Must be a power-of-two <= 32. |
T | [inferred] The input/output element type |
double
value from the successor of its successor. thread_data
across the first warp of threads is {1.0, 2.0, 3.0, 4.0, 5.0, ..., 32.0}
. The corresponding output peer_data
will be {3.0, 4.0, 5.0, 6.0, 7.0, ..., 32.0}
. The 5-bit SHFL mask for logically splitting warps into sub-segments starts 8-bits up
[in] | input | The value to broadcast |
[in] | src_offset | The relative up-offset of the peer to read from |
[in] | last_thread | Index of last thread in logical warp (typically 31 for a 32-thread warp) |
[in] | member_mask | 32-bit mask of participating warp lanes |
__device__ __forceinline__ T cub::ShuffleIndex | ( | T | input, |
int | src_lane, | ||
unsigned int | member_mask | ||
) |
Shuffle-broadcast for any data type. Each warp-lanei obtains the value input
contributed by warp-lanesrc_lane
. For src_lane
< 0 or src_lane
>= WARP_THREADS, then the thread's own input
is returned to the thread.
LOGICAL_WARP_THREADS | The number of threads per "logical" warp. Must be a power-of-two <= 32. |
T | [inferred] The input/output element type |
double
value from warp-lane0.thread_data
across the first warp of threads is {1.0, 2.0, 3.0, 4.0, 5.0, ..., 32.0}
. The corresponding output peer_data
will be {1.0, 1.0, 1.0, 1.0, 1.0, ..., 1.0}
. The 5-bit SHFL mask for logically splitting warps into sub-segments starts 8-bits up
[in] | input | The value to broadcast |
[in] | src_lane | Which warp lane is to do the broadcasting |
[in] | member_mask | 32-bit mask of participating warp lanes |