template<
typename T,
int BLOCK_DIM_X,
int ITEMS_PER_THREAD,
int BINS,
BlockHistogramAlgorithm ALGORITHM = BLOCK_HISTO_SORT,
int BLOCK_DIM_Y = 1,
int BLOCK_DIM_Z = 1,
int PTX_ARCH = CUB_PTX_ARCH>
class cub::BlockHistogram< T, BLOCK_DIM_X, ITEMS_PER_THREAD, BINS, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >
The BlockHistogram class provides collective methods for constructing block-wide histograms from data samples partitioned across a CUDA thread block.
.
- Template Parameters
-
T | The sample type being histogrammed (must be castable to an integer bin identifier) |
BLOCK_DIM_X | The thread block length in threads along the X dimension |
ITEMS_PER_THREAD | The number of items per thread |
BINS | The number bins within the histogram |
ALGORITHM | [optional] cub::BlockHistogramAlgorithm enumerator specifying the underlying algorithm to use (default: cub::BLOCK_HISTO_SORT) |
BLOCK_DIM_Y | [optional] The thread block length in threads along the Y dimension (default: 1) |
BLOCK_DIM_Z | [optional] The thread block length in threads along the Z dimension (default: 1) |
PTX_ARCH | [optional] The PTX compute capability for which to to specialize this collective, formatted as per the CUDA_ARCH macro (e.g., 350 for sm_35). Useful for determining the collective's storage requirements for a given device from the host. (Default: the value of CUDA_ARCH during the current compiler pass) |
- Overview
- A histogram counts the number of observations that fall into each of the disjoint categories (known as bins).
- The
T
type must be implicitly castable to an integer type.
- BlockHistogram expects each integral
input[i]
value to satisfy 0 <= input[i] < BINS
. Values outside of this range result in undefined behavior.
- BlockHistogram can be optionally specialized to use different algorithms:
- cub::BLOCK_HISTO_SORT. Sorting followed by differentiation. More...
- cub::BLOCK_HISTO_ATOMIC. Use atomic addition to update byte counts directly. More...
- Performance Considerations
- Efficiency is increased with increased granularity
ITEMS_PER_THREAD
. Performance is also typically increased until the additional register pressure or shared memory allocation size causes SM occupancy to fall too low. Consider variants of cub::BlockLoad for efficiently gathering a blocked arrangement of elements across threads.
- A Simple Example
- Every thread in the block uses the BlockHistogram class by first specializing the BlockHistogram type, then instantiating an instance with parameters for communication, and finally invoking one or more collective member functions.
- The code snippet below illustrates a 256-bin histogram of 512 integer samples that are partitioned across 128 threads where each thread owns 4 samples.
__global__ void ExampleKernel(...)
{
__shared__ typename BlockHistogram::TempStorage temp_storage;
__shared__ unsigned int smem_histogram[256];
unsigned char data[4];
...
- Performance and Usage Considerations
- All input values must fall between [0, BINS), or behavior is undefined.
- The histogram output can be constructed in shared or device-accessible memory
- See cub::BlockHistogramAlgorithm for performance details regarding algorithmic alternatives
- Re-using dynamically allocating shared memory
- The following example under the examples/block folder illustrates usage of dynamically shared memory with BlockReduce and how to re-purpose the same memory region: example_block_reduce_dyn_smem.cu
This example can be easily adapted to the storage required by BlockHistogram.
|
struct | TempStorage |
| The operations exposed by BlockHistogram require a temporary memory allocation of this nested type for thread communication. This opaque storage can be allocated directly using the __shared__ keyword. Alternatively, it can be aliased to externally allocated memory (shared or global) or union 'd with other storage allocation types to facilitate memory reuse. More...
|
|
|
|
__device__ __forceinline__ | BlockHistogram () |
| Collective constructor using a private static allocation of shared memory as temporary storage. More...
|
|
__device__ __forceinline__ | BlockHistogram (TempStorage &temp_storage) |
| Collective constructor using the specified memory allocation as temporary storage. More...
|
|
|
template<typename CounterT > |
__device__ __forceinline__ void | InitHistogram (CounterT histogram[BINS]) |
| Initialize the shared histogram counters to zero. More...
|
|
template<typename CounterT > |
__device__ __forceinline__ void | Histogram (T(&items)[ITEMS_PER_THREAD], CounterT histogram[BINS]) |
| Constructs a block-wide histogram in shared/device-accessible memory. Each thread contributes an array of input elements. More...
|
|
template<typename CounterT > |
__device__ __forceinline__ void | Composite (T(&items)[ITEMS_PER_THREAD], CounterT histogram[BINS]) |
| Updates an existing block-wide histogram in shared/device-accessible memory. Each thread composites an array of input elements. More...
|
|
template<typename T , int BLOCK_DIM_X, int ITEMS_PER_THREAD, int BINS, BlockHistogramAlgorithm ALGORITHM = BLOCK_HISTO_SORT, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
Collective constructor using a private static allocation of shared memory as temporary storage.
template<typename T , int BLOCK_DIM_X, int ITEMS_PER_THREAD, int BINS, BlockHistogramAlgorithm ALGORITHM = BLOCK_HISTO_SORT, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
Collective constructor using the specified memory allocation as temporary storage.
- Parameters
-
[in] | temp_storage | Reference to memory allocation having layout type TempStorage |
template<typename T , int BLOCK_DIM_X, int ITEMS_PER_THREAD, int BINS, BlockHistogramAlgorithm ALGORITHM = BLOCK_HISTO_SORT, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
template<typename CounterT >
__device__ __forceinline__ void cub::BlockHistogram< T, BLOCK_DIM_X, ITEMS_PER_THREAD, BINS, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::InitHistogram |
( |
CounterT |
histogram[BINS]) | |
|
|
inline |
Initialize the shared histogram counters to zero.
- Snippet
- The code snippet below illustrates a the initialization and update of a histogram of 512 integer samples that are partitioned across 128 threads where each thread owns 4 samples.
__global__ void ExampleKernel(...)
{
__shared__ typename BlockHistogram::TempStorage temp_storage;
__shared__ unsigned int smem_histogram[256];
unsigned char thread_samples[4];
...
BlockHistogram(temp_storage).Composite(thread_samples, smem_histogram);
- Template Parameters
-
CounterT | [inferred] Histogram counter type |
template<typename T , int BLOCK_DIM_X, int ITEMS_PER_THREAD, int BINS, BlockHistogramAlgorithm ALGORITHM = BLOCK_HISTO_SORT, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
template<typename CounterT >
__device__ __forceinline__ void cub::BlockHistogram< T, BLOCK_DIM_X, ITEMS_PER_THREAD, BINS, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::Histogram |
( |
T(&) |
items[ITEMS_PER_THREAD], |
|
|
CounterT |
histogram[BINS] |
|
) |
| |
|
inline |
Constructs a block-wide histogram in shared/device-accessible memory. Each thread contributes an array of input elements.
- Efficiency is increased with increased granularity
ITEMS_PER_THREAD
. Performance is also typically increased until the additional register pressure or shared memory allocation size causes SM occupancy to fall too low. Consider variants of cub::BlockLoad for efficiently gathering a blocked arrangement of elements across threads.
- A subsequent
__syncthreads()
threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage
) is to be reused or repurposed.
- Snippet
- The code snippet below illustrates a 256-bin histogram of 512 integer samples that are partitioned across 128 threads where each thread owns 4 samples.
__global__ void ExampleKernel(...)
{
__shared__ typename BlockHistogram::TempStorage temp_storage;
__shared__ unsigned int smem_histogram[256];
unsigned char thread_samples[4];
...
BlockHistogram(temp_storage).Histogram(thread_samples, smem_histogram);
- Template Parameters
-
CounterT | [inferred] Histogram counter type |
- Parameters
-
[in] | items | Calling thread's input values to histogram |
[out] | histogram | Reference to shared/device-accessible memory histogram |
template<typename T , int BLOCK_DIM_X, int ITEMS_PER_THREAD, int BINS, BlockHistogramAlgorithm ALGORITHM = BLOCK_HISTO_SORT, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
template<typename CounterT >
__device__ __forceinline__ void cub::BlockHistogram< T, BLOCK_DIM_X, ITEMS_PER_THREAD, BINS, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >::Composite |
( |
T(&) |
items[ITEMS_PER_THREAD], |
|
|
CounterT |
histogram[BINS] |
|
) |
| |
|
inline |
Updates an existing block-wide histogram in shared/device-accessible memory. Each thread composites an array of input elements.
- Efficiency is increased with increased granularity
ITEMS_PER_THREAD
. Performance is also typically increased until the additional register pressure or shared memory allocation size causes SM occupancy to fall too low. Consider variants of cub::BlockLoad for efficiently gathering a blocked arrangement of elements across threads.
- A subsequent
__syncthreads()
threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage
) is to be reused or repurposed.
- Snippet
- The code snippet below illustrates a the initialization and update of a histogram of 512 integer samples that are partitioned across 128 threads where each thread owns 4 samples.
__global__ void ExampleKernel(...)
{
__shared__ typename BlockHistogram::TempStorage temp_storage;
__shared__ unsigned int smem_histogram[256];
unsigned char thread_samples[4];
...
BlockHistogram(temp_storage).Composite(thread_samples, smem_histogram);
- Template Parameters
-
CounterT | [inferred] Histogram counter type |
- Parameters
-
[in] | items | Calling thread's input values to histogram |
[out] | histogram | Reference to shared/device-accessible memory histogram |
The documentation for this class was generated from the following file: