CUB  
 All Classes Namespaces Files Functions Variables Typedefs Enumerations Enumerator Friends Macros Groups Pages
Classes | Functions

Classes

class  WarpExchange< InputT, ITEMS_PER_THREAD, LOGICAL_WARP_THREADS, PTX_ARCH >
 The WarpExchange class provides collective methods for rearranging data partitioned across a CUDA warp. More...
 
class  WarpLoad< InputT, ITEMS_PER_THREAD, ALGORITHM, LOGICAL_WARP_THREADS, PTX_ARCH >
 The WarpLoad class provides collective data movement methods for loading a linear segment of items from memory into a blocked arrangement across a CUDA thread block. More...
 
class  WarpMergeSort< KeyT, ITEMS_PER_THREAD, LOGICAL_WARP_THREADS, ValueT, PTX_ARCH >
 The WarpMergeSort class provides methods for sorting items partitioned across a CUDA warp using a merge sorting method. More...
 
class  cub::WarpReduce< T, LOGICAL_WARP_THREADS, PTX_ARCH >
 The WarpReduce class provides collective methods for computing a parallel reduction of items partitioned across a CUDA thread warp.

warp_reduce_logo.png
.
More...
 
class  cub::WarpScan< T, LOGICAL_WARP_THREADS, PTX_ARCH >
 The WarpScan class provides collective methods for computing a parallel prefix scan of items partitioned across a CUDA thread warp.

warp_scan_logo.png
.
More...
 
class  WarpStore< T, ITEMS_PER_THREAD, ALGORITHM, LOGICAL_WARP_THREADS, PTX_ARCH >
 The WarpStore class provides collective data movement methods for writing a blocked arrangement of items partitioned across a CUDA warp to a linear segment of memory. More...
 

Functions

template<int LOGICAL_WARP_THREADS, typename T >
__device__ __forceinline__ T cub::ShuffleUp (T input, int src_offset, int first_thread, unsigned int member_mask)
 Shuffle-up for any data type. Each warp-lanei obtains the value input contributed by warp-lanei-src_offset. For thread lanes i < src_offset, the thread's own input is returned to the thread.

shfl_up_logo.png
.
More...
 
template<int LOGICAL_WARP_THREADS, typename T >
__device__ __forceinline__ T cub::ShuffleDown (T input, int src_offset, int last_thread, unsigned int member_mask)
 Shuffle-down for any data type. Each warp-lanei obtains the value input contributed by warp-lanei+src_offset. For thread lanes i >= WARP_THREADS, the thread's own input is returned to the thread.

shfl_down_logo.png
.
More...
 
template<int LOGICAL_WARP_THREADS, typename T >
__device__ __forceinline__ T cub::ShuffleIndex (T input, int src_lane, unsigned int member_mask)
 Shuffle-broadcast for any data type. Each warp-lanei obtains the value input contributed by warp-lanesrc_lane. For src_lane < 0 or src_lane >= WARP_THREADS, then the thread's own input is returned to the thread.

shfl_broadcast_logo.png
.
More...
 

Function Documentation

template<int LOGICAL_WARP_THREADS, typename T >
__device__ __forceinline__ T cub::ShuffleUp ( input,
int  src_offset,
int  first_thread,
unsigned int  member_mask 
)

Shuffle-up for any data type. Each warp-lanei obtains the value input contributed by warp-lanei-src_offset. For thread lanes i < src_offset, the thread's own input is returned to the thread.

shfl_up_logo.png
.

Template Parameters
LOGICAL_WARP_THREADSThe number of threads per "logical" warp. Must be a power-of-two <= 32.
T[inferred] The input/output element type
  • Available only for SM3.0 or newer
Snippet
The code snippet below illustrates each thread obtaining a double value from the predecessor of its predecessor.
#include <cub/cub.cuh> // or equivalently <cub/util_ptx.cuh>
__global__ void ExampleKernel(...)
{
// Obtain one input item per thread
double thread_data = ...
// Obtain item from two ranks below
double peer_data = ShuffleUp<32>(thread_data, 2, 0, 0xffffffff);
Suppose the set of input thread_data across the first warp of threads is {1.0, 2.0, 3.0, 4.0, 5.0, ..., 32.0}. The corresponding output peer_data will be {1.0, 2.0, 1.0, 2.0, 3.0, ..., 30.0}.

The 5-bit SHFL mask for logically splitting warps into sub-segments starts 8-bits up

Parameters
[in]inputThe value to broadcast
[in]src_offsetThe relative down-offset of the peer to read from
[in]first_threadIndex of first lane in logical warp (typically 0)
[in]member_mask32-bit mask of participating warp lanes
template<int LOGICAL_WARP_THREADS, typename T >
__device__ __forceinline__ T cub::ShuffleDown ( input,
int  src_offset,
int  last_thread,
unsigned int  member_mask 
)

Shuffle-down for any data type. Each warp-lanei obtains the value input contributed by warp-lanei+src_offset. For thread lanes i >= WARP_THREADS, the thread's own input is returned to the thread.

shfl_down_logo.png
.

Template Parameters
LOGICAL_WARP_THREADSThe number of threads per "logical" warp. Must be a power-of-two <= 32.
T[inferred] The input/output element type
  • Available only for SM3.0 or newer
Snippet
The code snippet below illustrates each thread obtaining a double value from the successor of its successor.
#include <cub/cub.cuh> // or equivalently <cub/util_ptx.cuh>
__global__ void ExampleKernel(...)
{
// Obtain one input item per thread
double thread_data = ...
// Obtain item from two ranks below
double peer_data = ShuffleDown<32>(thread_data, 2, 31, 0xffffffff);
Suppose the set of input thread_data across the first warp of threads is {1.0, 2.0, 3.0, 4.0, 5.0, ..., 32.0}. The corresponding output peer_data will be {3.0, 4.0, 5.0, 6.0, 7.0, ..., 32.0}.

The 5-bit SHFL mask for logically splitting warps into sub-segments starts 8-bits up

Parameters
[in]inputThe value to broadcast
[in]src_offsetThe relative up-offset of the peer to read from
[in]last_threadIndex of last thread in logical warp (typically 31 for a 32-thread warp)
[in]member_mask32-bit mask of participating warp lanes
template<int LOGICAL_WARP_THREADS, typename T >
__device__ __forceinline__ T cub::ShuffleIndex ( input,
int  src_lane,
unsigned int  member_mask 
)

Shuffle-broadcast for any data type. Each warp-lanei obtains the value input contributed by warp-lanesrc_lane. For src_lane < 0 or src_lane >= WARP_THREADS, then the thread's own input is returned to the thread.

shfl_broadcast_logo.png
.

Template Parameters
LOGICAL_WARP_THREADSThe number of threads per "logical" warp. Must be a power-of-two <= 32.
T[inferred] The input/output element type
  • Available only for SM3.0 or newer
Snippet
The code snippet below illustrates each thread obtaining a double value from warp-lane0.
#include <cub/cub.cuh> // or equivalently <cub/util_ptx.cuh>
__global__ void ExampleKernel(...)
{
// Obtain one input item per thread
double thread_data = ...
// Obtain item from thread 0
double peer_data = ShuffleIndex<32>(thread_data, 0, 0xffffffff);
Suppose the set of input thread_data across the first warp of threads is {1.0, 2.0, 3.0, 4.0, 5.0, ..., 32.0}. The corresponding output peer_data will be {1.0, 1.0, 1.0, 1.0, 1.0, ..., 1.0}.

The 5-bit SHFL mask for logically splitting warps into sub-segments starts 8-bits up

Parameters
[in]inputThe value to broadcast
[in]src_laneWhich warp lane is to do the broadcasting
[in]member_mask32-bit mask of participating warp lanes