CUB  
 All Classes Namespaces Files Functions Variables Typedefs Enumerations Enumerator Friends Macros Groups Pages
Classes | List of all members
cub::BlockHistogram< T, BLOCK_DIM_X, ITEMS_PER_THREAD, BINS, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH > Class Template Reference

Detailed description

template< typename T, int BLOCK_DIM_X, int ITEMS_PER_THREAD, int BINS, BlockHistogramAlgorithm ALGORITHM = BLOCK_HISTO_SORT, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
class cub::BlockHistogram< T, BLOCK_DIM_X, ITEMS_PER_THREAD, BINS, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >

The BlockHistogram class provides collective methods for constructing block-wide histograms from data samples partitioned across a CUDA thread block.

histogram_logo.png
.
Template Parameters
TThe sample type being histogrammed (must be castable to an integer bin identifier)
BLOCK_DIM_XThe thread block length in threads along the X dimension
ITEMS_PER_THREADThe number of items per thread
BINSThe number bins within the histogram
ALGORITHM[optional] cub::BlockHistogramAlgorithm enumerator specifying the underlying algorithm to use (default: cub::BLOCK_HISTO_SORT)
BLOCK_DIM_Y[optional] The thread block length in threads along the Y dimension (default: 1)
BLOCK_DIM_Z[optional] The thread block length in threads along the Z dimension (default: 1)
PTX_ARCH[optional] The PTX compute capability for which to to specialize this collective, formatted as per the CUDA_ARCH macro (e.g., 350 for sm_35). Useful for determining the collective's storage requirements for a given device from the host. (Default: the value of CUDA_ARCH during the current compiler pass)
Overview
  • A histogram counts the number of observations that fall into each of the disjoint categories (known as bins).
  • The T type must be implicitly castable to an integer type.
  • BlockHistogram expects each integral input[i] value to satisfy 0 <= input[i] < BINS. Values outside of this range result in undefined behavior.
  • BlockHistogram can be optionally specialized to use different algorithms:
    1. cub::BLOCK_HISTO_SORT. Sorting followed by differentiation. More...
    2. cub::BLOCK_HISTO_ATOMIC. Use atomic addition to update byte counts directly. More...
Performance Considerations
  • Efficiency is increased with increased granularity ITEMS_PER_THREAD. Performance is also typically increased until the additional register pressure or shared memory allocation size causes SM occupancy to fall too low. Consider variants of cub::BlockLoad for efficiently gathering a blocked arrangement of elements across threads.
A Simple Example
Every thread in the block uses the BlockHistogram class by first specializing the BlockHistogram type, then instantiating an instance with parameters for communication, and finally invoking one or more collective member functions.
The code snippet below illustrates a 256-bin histogram of 512 integer samples that are partitioned across 128 threads where each thread owns 4 samples.
#include <cub/cub.cuh> // or equivalently <cub/block/block_histogram.cuh>
__global__ void ExampleKernel(...)
{
// Specialize a 256-bin BlockHistogram type for a 1D block of 128 threads having 4 character samples each
// Allocate shared memory for BlockHistogram
__shared__ typename BlockHistogram::TempStorage temp_storage;
// Allocate shared memory for block-wide histogram bin counts
__shared__ unsigned int smem_histogram[256];
// Obtain input samples per thread
unsigned char data[4];
...
// Compute the block-wide histogram
BlockHistogram(temp_storage).Histogram(data, smem_histogram);
Performance and Usage Considerations
  • All input values must fall between [0, BINS), or behavior is undefined.
  • The histogram output can be constructed in shared or device-accessible memory
  • See cub::BlockHistogramAlgorithm for performance details regarding algorithmic alternatives
Re-using dynamically allocating shared memory
The following example under the examples/block folder illustrates usage of dynamically shared memory with BlockReduce and how to re-purpose the same memory region: example_block_reduce_dyn_smem.cu

This example can be easily adapted to the storage required by BlockHistogram.

Classes

struct  TempStorage
 The operations exposed by BlockHistogram require a temporary memory allocation of this nested type for thread communication. This opaque storage can be allocated directly using the __shared__ keyword. Alternatively, it can be aliased to externally allocated memory (shared or global) or union'd with other storage allocation types to facilitate memory reuse. More...
 

Public Methods

Collective constructors
__device__ __forceinline__ BlockHistogram ()
 Collective constructor using a private static allocation of shared memory as temporary storage. More...
 
__device__ __forceinline__ BlockHistogram (TempStorage &temp_storage)
 Collective constructor using the specified memory allocation as temporary storage. More...
 
Histogram operations
template<typename CounterT >
__device__ __forceinline__ void InitHistogram (CounterT histogram[BINS])
 Initialize the shared histogram counters to zero. More...
 
template<typename CounterT >
__device__ __forceinline__ void Histogram (T(&items)[ITEMS_PER_THREAD], CounterT histogram[BINS])
 Constructs a block-wide histogram in shared/device-accessible memory. Each thread contributes an array of input elements. More...
 
template<typename CounterT >
__device__ __forceinline__ void Composite (T(&items)[ITEMS_PER_THREAD], CounterT histogram[BINS])
 Updates an existing block-wide histogram in shared/device-accessible memory. Each thread composites an array of input elements. More...
 

Constructor & Destructor Documentation

template<typename T , int BLOCK_DIM_X, int ITEMS_PER_THREAD, int BINS, BlockHistogramAlgorithm ALGORITHM = BLOCK_HISTO_SORT, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
__device__ __forceinline__ cub::BlockHistogram< T, BLOCK_DIM_X, ITEMS_PER_THREAD, BINS, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::BlockHistogram ( )
inline

Collective constructor using a private static allocation of shared memory as temporary storage.

template<typename T , int BLOCK_DIM_X, int ITEMS_PER_THREAD, int BINS, BlockHistogramAlgorithm ALGORITHM = BLOCK_HISTO_SORT, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
__device__ __forceinline__ cub::BlockHistogram< T, BLOCK_DIM_X, ITEMS_PER_THREAD, BINS, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::BlockHistogram ( TempStorage temp_storage)
inline

Collective constructor using the specified memory allocation as temporary storage.

Parameters
[in]temp_storageReference to memory allocation having layout type TempStorage

Member Function Documentation

template<typename T , int BLOCK_DIM_X, int ITEMS_PER_THREAD, int BINS, BlockHistogramAlgorithm ALGORITHM = BLOCK_HISTO_SORT, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
template<typename CounterT >
__device__ __forceinline__ void cub::BlockHistogram< T, BLOCK_DIM_X, ITEMS_PER_THREAD, BINS, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::InitHistogram ( CounterT  histogram[BINS])
inline

Initialize the shared histogram counters to zero.

Snippet
The code snippet below illustrates a the initialization and update of a histogram of 512 integer samples that are partitioned across 128 threads where each thread owns 4 samples.
#include <cub/cub.cuh> // or equivalently <cub/block/block_histogram.cuh>
__global__ void ExampleKernel(...)
{
// Specialize a 256-bin BlockHistogram type for a 1D block of 128 threads having 4 character samples each
// Allocate shared memory for BlockHistogram
__shared__ typename BlockHistogram::TempStorage temp_storage;
// Allocate shared memory for block-wide histogram bin counts
__shared__ unsigned int smem_histogram[256];
// Obtain input samples per thread
unsigned char thread_samples[4];
...
// Initialize the block-wide histogram
BlockHistogram(temp_storage).InitHistogram(smem_histogram);
// Update the block-wide histogram
BlockHistogram(temp_storage).Composite(thread_samples, smem_histogram);
Template Parameters
CounterT[inferred] Histogram counter type
template<typename T , int BLOCK_DIM_X, int ITEMS_PER_THREAD, int BINS, BlockHistogramAlgorithm ALGORITHM = BLOCK_HISTO_SORT, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
template<typename CounterT >
__device__ __forceinline__ void cub::BlockHistogram< T, BLOCK_DIM_X, ITEMS_PER_THREAD, BINS, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::Histogram ( T(&)  items[ITEMS_PER_THREAD],
CounterT  histogram[BINS] 
)
inline

Constructs a block-wide histogram in shared/device-accessible memory. Each thread contributes an array of input elements.

  • Efficiency is increased with increased granularity ITEMS_PER_THREAD. Performance is also typically increased until the additional register pressure or shared memory allocation size causes SM occupancy to fall too low. Consider variants of cub::BlockLoad for efficiently gathering a blocked arrangement of elements across threads.
  • A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage) is to be reused or repurposed.
Snippet
The code snippet below illustrates a 256-bin histogram of 512 integer samples that are partitioned across 128 threads where each thread owns 4 samples.
#include <cub/cub.cuh> // or equivalently <cub/block/block_histogram.cuh>
__global__ void ExampleKernel(...)
{
// Specialize a 256-bin BlockHistogram type for a 1D block of 128 threads having 4 character samples each
// Allocate shared memory for BlockHistogram
__shared__ typename BlockHistogram::TempStorage temp_storage;
// Allocate shared memory for block-wide histogram bin counts
__shared__ unsigned int smem_histogram[256];
// Obtain input samples per thread
unsigned char thread_samples[4];
...
// Compute the block-wide histogram
BlockHistogram(temp_storage).Histogram(thread_samples, smem_histogram);
Template Parameters
CounterT[inferred] Histogram counter type
Parameters
[in]itemsCalling thread's input values to histogram
[out]histogramReference to shared/device-accessible memory histogram
template<typename T , int BLOCK_DIM_X, int ITEMS_PER_THREAD, int BINS, BlockHistogramAlgorithm ALGORITHM = BLOCK_HISTO_SORT, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
template<typename CounterT >
__device__ __forceinline__ void cub::BlockHistogram< T, BLOCK_DIM_X, ITEMS_PER_THREAD, BINS, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::Composite ( T(&)  items[ITEMS_PER_THREAD],
CounterT  histogram[BINS] 
)
inline

Updates an existing block-wide histogram in shared/device-accessible memory. Each thread composites an array of input elements.

  • Efficiency is increased with increased granularity ITEMS_PER_THREAD. Performance is also typically increased until the additional register pressure or shared memory allocation size causes SM occupancy to fall too low. Consider variants of cub::BlockLoad for efficiently gathering a blocked arrangement of elements across threads.
  • A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage) is to be reused or repurposed.
Snippet
The code snippet below illustrates a the initialization and update of a histogram of 512 integer samples that are partitioned across 128 threads where each thread owns 4 samples.
#include <cub/cub.cuh> // or equivalently <cub/block/block_histogram.cuh>
__global__ void ExampleKernel(...)
{
// Specialize a 256-bin BlockHistogram type for a 1D block of 128 threads having 4 character samples each
// Allocate shared memory for BlockHistogram
__shared__ typename BlockHistogram::TempStorage temp_storage;
// Allocate shared memory for block-wide histogram bin counts
__shared__ unsigned int smem_histogram[256];
// Obtain input samples per thread
unsigned char thread_samples[4];
...
// Initialize the block-wide histogram
BlockHistogram(temp_storage).InitHistogram(smem_histogram);
// Update the block-wide histogram
BlockHistogram(temp_storage).Composite(thread_samples, smem_histogram);
Template Parameters
CounterT[inferred] Histogram counter type
Parameters
[in]itemsCalling thread's input values to histogram
[out]histogramReference to shared/device-accessible memory histogram

The documentation for this class was generated from the following file: