CUB  
 All Classes Namespaces Files Functions Variables Typedefs Enumerations Enumerator Friends Macros Groups Pages
Classes | Enumerations
warp_load.cuh File Reference
#include <iterator>
#include <type_traits>
#include <cub/block/block_load.cuh>
#include <cub/config.cuh>
#include <cub/iterator/cache_modified_input_iterator.cuh>
#include <cub/util_ptx.cuh>
#include <cub/util_type.cuh>
#include <cub/warp/warp_exchange.cuh>

Classes

class  WarpLoad< InputT, ITEMS_PER_THREAD, ALGORITHM, LOGICAL_WARP_THREADS, PTX_ARCH >
 The WarpLoad class provides collective data movement methods for loading a linear segment of items from memory into a blocked arrangement across a CUDA thread block. More...
 
struct  WarpLoad< InputT, ITEMS_PER_THREAD, ALGORITHM, LOGICAL_WARP_THREADS, PTX_ARCH >::LoadInternal< WARP_LOAD_TRANSPOSE, DUMMY >::_TempStorage
 
struct  WarpLoad< InputT, ITEMS_PER_THREAD, ALGORITHM, LOGICAL_WARP_THREADS, PTX_ARCH >::LoadInternal< WARP_LOAD_TRANSPOSE, DUMMY >::TempStorage
 
struct  WarpLoad< InputT, ITEMS_PER_THREAD, ALGORITHM, LOGICAL_WARP_THREADS, PTX_ARCH >::TempStorage
 The operations exposed by WarpLoad require a temporary memory allocation of this nested type for thread communication. This opaque storage can be allocated directly using the __shared__ keyword. Alternatively, it can be aliased to externally allocated memory (shared or global) or union'd with other storage allocation types to facilitate memory reuse. More...
 

Enumerations

enum  WarpLoadAlgorithm { WARP_LOAD_DIRECT, WARP_LOAD_STRIPED, WARP_LOAD_VECTORIZE, WARP_LOAD_TRANSPOSE }
 cub::WarpLoadAlgorithm enumerates alternative algorithms for cub::WarpLoad to read a linear segment of data from memory into a a CUDA warp. More...
 

Detailed Description

Operations for reading linear tiles of data into the CUDA warp.

Enumeration Type Documentation

cub::WarpLoadAlgorithm enumerates alternative algorithms for cub::WarpLoad to read a linear segment of data from memory into a a CUDA warp.

Enumerator
WARP_LOAD_DIRECT 
Overview

A blocked arrangement of data is read directly from memory.

Performance Considerations
The utilization of memory transactions (coalescing) decreases as the access stride between threads increases (i.e., the number items per thread).
WARP_LOAD_STRIPED 
Overview

A striped arrangement of data is read directly from memory.

Performance Considerations
The utilization of memory transactions (coalescing) doesn't depend on the number of items per thread.
WARP_LOAD_VECTORIZE 
Overview

A blocked arrangement of data is read from memory using CUDA's built-in vectorized loads as a coalescing optimization. For example, ld.global.v4.s32 instructions will be generated when T = int and ITEMS_PER_THREAD % 4 == 0.

Performance Considerations
  • The utilization of memory transactions (coalescing) remains high until the the access stride between threads (i.e., the number items per thread) exceeds the maximum vector load width (typically 4 items or 64B, whichever is lower).
  • The following conditions will prevent vectorization and loading will fall back to cub::WARP_LOAD_DIRECT:
    • ITEMS_PER_THREAD is odd
    • The InputIteratorT is not a simple pointer type
    • The block input offset is not quadword-aligned
    • The data type T is not a built-in primitive or CUDA vector type (e.g., short, int2, double, float2, etc.)
WARP_LOAD_TRANSPOSE 
Overview

A striped arrangement of data is read efficiently from memory and then locally transposed into a blocked arrangement.

Performance Considerations
  • The utilization of memory transactions (coalescing) remains high regardless of items loaded per thread.
  • The local reordering incurs slightly longer latencies and throughput than the direct cub::WARP_LOAD_DIRECT and cub::WARP_LOAD_VECTORIZE alternatives.