CUB  
 All Classes Namespaces Files Functions Variables Typedefs Enumerations Enumerator Friends Macros Groups Pages
Static Public Methods | List of all members
cub::DeviceRunLengthEncode Struct Reference

Detailed description

DeviceRunLengthEncode provides device-wide, parallel operations for demarcating "runs" of same-valued items within a sequence residing within device-accessible memory.

run_length_encode_logo.png
.
Overview
A run-length encoding computes a simple compressed representation of a sequence of input elements such that each maximal "run" of consecutive same-valued data items is encoded as a single data value along with a count of the elements in that run.
Usage Considerations
  • Dynamic parallelism. DeviceRunLengthEncode methods can be called within kernel code on devices in which CUDA dynamic parallelism is supported.
Performance
The work-complexity of run-length encode as a function of input size is linear, resulting in performance throughput that plateaus with problem sizes large enough to saturate the GPU.
The following chart illustrates DeviceRunLengthEncode::RunLengthEncode performance across different CUDA architectures for int32 items. Segments have lengths uniformly sampled from [1,1000].
rle_int32_len_500.png
Performance plots for other scenarios can be found in the detailed method descriptions below.

Static Public Methods

template<typename InputIteratorT , typename UniqueOutputIteratorT , typename LengthsOutputIteratorT , typename NumRunsOutputIteratorT >
CUB_RUNTIME_FUNCTION static
__forceinline__ cudaError_t 
Encode (void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, UniqueOutputIteratorT d_unique_out, LengthsOutputIteratorT d_counts_out, NumRunsOutputIteratorT d_num_runs_out, int num_items, cudaStream_t stream=0, bool debug_synchronous=false)
 Computes a run-length encoding of the sequence d_in. More...
 
template<typename InputIteratorT , typename OffsetsOutputIteratorT , typename LengthsOutputIteratorT , typename NumRunsOutputIteratorT >
CUB_RUNTIME_FUNCTION static
__forceinline__ cudaError_t 
NonTrivialRuns (void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OffsetsOutputIteratorT d_offsets_out, LengthsOutputIteratorT d_lengths_out, NumRunsOutputIteratorT d_num_runs_out, int num_items, cudaStream_t stream=0, bool debug_synchronous=false)
 Enumerates the starting offsets and lengths of all non-trivial runs (of length > 1) of same-valued keys in the sequence d_in. More...
 

Member Function Documentation

template<typename InputIteratorT , typename UniqueOutputIteratorT , typename LengthsOutputIteratorT , typename NumRunsOutputIteratorT >
CUB_RUNTIME_FUNCTION static __forceinline__ cudaError_t cub::DeviceRunLengthEncode::Encode ( void *  d_temp_storage,
size_t &  temp_storage_bytes,
InputIteratorT  d_in,
UniqueOutputIteratorT  d_unique_out,
LengthsOutputIteratorT  d_counts_out,
NumRunsOutputIteratorT  d_num_runs_out,
int  num_items,
cudaStream_t  stream = 0,
bool  debug_synchronous = false 
)
inlinestatic

Computes a run-length encoding of the sequence d_in.

  • For the ith run encountered, the first key of the run and its length are written to d_unique_out[i] and d_counts_out[i], respectively.
  • The total number of runs encountered is written to d_num_runs_out.
  • The == equality operator is used to determine whether values are equivalent
  • When d_temp_storage is NULL, no work is done and the required allocation size is returned in temp_storage_bytes.
Performance
The following charts illustrate saturated encode performance across different CUDA architectures for int32 and int64 items, respectively. Segments have lengths uniformly sampled from [1,1000].
rle_int32_len_500.png
rle_int64_len_500.png
The following charts are similar, but with segment lengths uniformly sampled from [1,10]:
rle_int32_len_5.png
rle_int64_len_5.png
Snippet
The code snippet below illustrates the run-length encoding of a sequence of int values.
#include <cub/cub.cuh> // or equivalently <cub/device/device_run_length_encode.cuh>
// Declare, allocate, and initialize device-accessible pointers for input and output
int num_items; // e.g., 8
int *d_in; // e.g., [0, 2, 2, 9, 5, 5, 5, 8]
int *d_unique_out; // e.g., [ , , , , , , , ]
int *d_counts_out; // e.g., [ , , , , , , , ]
int *d_num_runs_out; // e.g., [ ]
...
// Determine temporary device storage requirements
void *d_temp_storage = NULL;
size_t temp_storage_bytes = 0;
cub::DeviceRunLengthEncode::Encode(d_temp_storage, temp_storage_bytes, d_in, d_unique_out, d_counts_out, d_num_runs_out, num_items);
// Allocate temporary storage
cudaMalloc(&d_temp_storage, temp_storage_bytes);
// Run encoding
cub::DeviceRunLengthEncode::Encode(d_temp_storage, temp_storage_bytes, d_in, d_unique_out, d_counts_out, d_num_runs_out, num_items);
// d_unique_out <-- [0, 2, 9, 5, 8]
// d_counts_out <-- [1, 2, 1, 3, 1]
// d_num_runs_out <-- [5]
Template Parameters
InputIteratorT[inferred] Random-access input iterator type for reading input items (may be a simple pointer type)
UniqueOutputIteratorT[inferred] Random-access output iterator type for writing unique output items (may be a simple pointer type)
LengthsOutputIteratorT[inferred] Random-access output iterator type for writing output counts (may be a simple pointer type)
NumRunsOutputIteratorT[inferred] Output iterator type for recording the number of runs encountered (may be a simple pointer type)
Parameters
[in]d_temp_storageDevice-accessible allocation of temporary storage. When NULL, the required allocation size is written to temp_storage_bytes and no work is done.
[in,out]temp_storage_bytesReference to size in bytes of d_temp_storage allocation
[in]d_inPointer to the input sequence of keys
[out]d_unique_outPointer to the output sequence of unique keys (one key per run)
[out]d_counts_outPointer to the output sequence of run-lengths (one count per run)
[out]d_num_runs_outPointer to total number of runs
[in]num_itemsTotal number of associated key+value pairs (i.e., the length of d_in_keys and d_in_values)
[in]stream[optional] CUDA stream to launch kernels within. Default is stream0.
[in]debug_synchronous[optional] Whether or not to synchronize the stream after every kernel launch to check for errors. May cause significant slowdown. Default is false.
template<typename InputIteratorT , typename OffsetsOutputIteratorT , typename LengthsOutputIteratorT , typename NumRunsOutputIteratorT >
CUB_RUNTIME_FUNCTION static __forceinline__ cudaError_t cub::DeviceRunLengthEncode::NonTrivialRuns ( void *  d_temp_storage,
size_t &  temp_storage_bytes,
InputIteratorT  d_in,
OffsetsOutputIteratorT  d_offsets_out,
LengthsOutputIteratorT  d_lengths_out,
NumRunsOutputIteratorT  d_num_runs_out,
int  num_items,
cudaStream_t  stream = 0,
bool  debug_synchronous = false 
)
inlinestatic

Enumerates the starting offsets and lengths of all non-trivial runs (of length > 1) of same-valued keys in the sequence d_in.

  • For the ith non-trivial run, the run's starting offset and its length are written to d_offsets_out[i] and d_lengths_out[i], respectively.
  • The total number of runs encountered is written to d_num_runs_out.
  • The == equality operator is used to determine whether values are equivalent
  • When d_temp_storage is NULL, no work is done and the required allocation size is returned in temp_storage_bytes.
Performance
Snippet
The code snippet below illustrates the identification of non-trivial runs within a sequence of int values.
#include <cub/cub.cuh> // or equivalently <cub/device/device_run_length_encode.cuh>
// Declare, allocate, and initialize device-accessible pointers for input and output
int num_items; // e.g., 8
int *d_in; // e.g., [0, 2, 2, 9, 5, 5, 5, 8]
int *d_offsets_out; // e.g., [ , , , , , , , ]
int *d_lengths_out; // e.g., [ , , , , , , , ]
int *d_num_runs_out; // e.g., [ ]
...
// Determine temporary device storage requirements
void *d_temp_storage = NULL;
size_t temp_storage_bytes = 0;
cub::DeviceRunLengthEncode::NonTrivialRuns(d_temp_storage, temp_storage_bytes, d_in, d_offsets_out, d_lengths_out, d_num_runs_out, num_items);
// Allocate temporary storage
cudaMalloc(&d_temp_storage, temp_storage_bytes);
// Run encoding
cub::DeviceRunLengthEncode::NonTrivialRuns(d_temp_storage, temp_storage_bytes, d_in, d_offsets_out, d_lengths_out, d_num_runs_out, num_items);
// d_offsets_out <-- [1, 4]
// d_lengths_out <-- [2, 3]
// d_num_runs_out <-- [2]
Template Parameters
InputIteratorT[inferred] Random-access input iterator type for reading input items (may be a simple pointer type)
OffsetsOutputIteratorT[inferred] Random-access output iterator type for writing run-offset values (may be a simple pointer type)
LengthsOutputIteratorT[inferred] Random-access output iterator type for writing run-length values (may be a simple pointer type)
NumRunsOutputIteratorT[inferred] Output iterator type for recording the number of runs encountered (may be a simple pointer type)
Parameters
[in]d_temp_storageDevice-accessible allocation of temporary storage. When NULL, the required allocation size is written to temp_storage_bytes and no work is done.
[in,out]temp_storage_bytesReference to size in bytes of d_temp_storage allocation
[in]d_inPointer to input sequence of data items
[out]d_offsets_outPointer to output sequence of run-offsets (one offset per non-trivial run)
[out]d_lengths_outPointer to output sequence of run-lengths (one count per non-trivial run)
[out]d_num_runs_outPointer to total number of runs (i.e., length of d_offsets_out)
[in]num_itemsTotal number of associated key+value pairs (i.e., the length of d_in_keys and d_in_values)
[in]stream[optional] CUDA stream to launch kernels within. Default is stream0.
[in]debug_synchronous[optional] Whether or not to synchronize the stream after every kernel launch to check for errors. May cause significant slowdown. Default is false.

The documentation for this struct was generated from the following file: