DeviceRunLengthEncode provides device-wide, parallel operations for demarcating "runs" of same-valued items within a sequence residing within device-accessible memory.
.
- Overview
- A run-length encoding computes a simple compressed representation of a sequence of input elements such that each maximal "run" of consecutive same-valued data items is encoded as a single data value along with a count of the elements in that run.
- Usage Considerations
- Dynamic parallelism. DeviceRunLengthEncode methods can be called within kernel code on devices in which CUDA dynamic parallelism is supported.
- Performance
- The work-complexity of run-length encode as a function of input size is linear, resulting in performance throughput that plateaus with problem sizes large enough to saturate the GPU.
- The following chart illustrates DeviceRunLengthEncode::RunLengthEncode performance across different CUDA architectures for
int32
items. Segments have lengths uniformly sampled from [1,1000].
- Performance plots for other scenarios can be found in the detailed method descriptions below.
|
template<typename InputIteratorT , typename UniqueOutputIteratorT , typename LengthsOutputIteratorT , typename NumRunsOutputIteratorT > |
CUB_RUNTIME_FUNCTION static
__forceinline__ cudaError_t | Encode (void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, UniqueOutputIteratorT d_unique_out, LengthsOutputIteratorT d_counts_out, NumRunsOutputIteratorT d_num_runs_out, int num_items, cudaStream_t stream=0, bool debug_synchronous=false) |
| Computes a run-length encoding of the sequence d_in . More...
|
|
template<typename InputIteratorT , typename OffsetsOutputIteratorT , typename LengthsOutputIteratorT , typename NumRunsOutputIteratorT > |
CUB_RUNTIME_FUNCTION static
__forceinline__ cudaError_t | NonTrivialRuns (void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OffsetsOutputIteratorT d_offsets_out, LengthsOutputIteratorT d_lengths_out, NumRunsOutputIteratorT d_num_runs_out, int num_items, cudaStream_t stream=0, bool debug_synchronous=false) |
| Enumerates the starting offsets and lengths of all non-trivial runs (of length > 1) of same-valued keys in the sequence d_in . More...
|
|
template<typename InputIteratorT , typename UniqueOutputIteratorT , typename LengthsOutputIteratorT , typename NumRunsOutputIteratorT >
CUB_RUNTIME_FUNCTION static __forceinline__ cudaError_t cub::DeviceRunLengthEncode::Encode |
( |
void * |
d_temp_storage, |
|
|
size_t & |
temp_storage_bytes, |
|
|
InputIteratorT |
d_in, |
|
|
UniqueOutputIteratorT |
d_unique_out, |
|
|
LengthsOutputIteratorT |
d_counts_out, |
|
|
NumRunsOutputIteratorT |
d_num_runs_out, |
|
|
int |
num_items, |
|
|
cudaStream_t |
stream = 0 , |
|
|
bool |
debug_synchronous = false |
|
) |
| |
|
inlinestatic |
Computes a run-length encoding of the sequence d_in
.
- For the ith run encountered, the first key of the run and its length are written to
d_unique_out[i]
and d_counts_out[i]
, respectively.
- The total number of runs encountered is written to
d_num_runs_out
.
- The
==
equality operator is used to determine whether values are equivalent
- When
d_temp_storage
is NULL
, no work is done and the required allocation size is returned in temp_storage_bytes
.
- Performance
- The following charts illustrate saturated encode performance across different CUDA architectures for
int32
and int64
items, respectively. Segments have lengths uniformly sampled from [1,1000].
- The following charts are similar, but with segment lengths uniformly sampled from [1,10]:
- Snippet
- The code snippet below illustrates the run-length encoding of a sequence of
int
values.
int num_items;
int *d_in;
int *d_unique_out;
int *d_counts_out;
int *d_num_runs_out;
...
void *d_temp_storage = NULL;
size_t temp_storage_bytes = 0;
cudaMalloc(&d_temp_storage, temp_storage_bytes);
- Template Parameters
-
InputIteratorT | [inferred] Random-access input iterator type for reading input items (may be a simple pointer type) |
UniqueOutputIteratorT | [inferred] Random-access output iterator type for writing unique output items (may be a simple pointer type) |
LengthsOutputIteratorT | [inferred] Random-access output iterator type for writing output counts (may be a simple pointer type) |
NumRunsOutputIteratorT | [inferred] Output iterator type for recording the number of runs encountered (may be a simple pointer type) |
- Parameters
-
[in] | d_temp_storage | Device-accessible allocation of temporary storage. When NULL, the required allocation size is written to temp_storage_bytes and no work is done. |
[in,out] | temp_storage_bytes | Reference to size in bytes of d_temp_storage allocation |
[in] | d_in | Pointer to the input sequence of keys |
[out] | d_unique_out | Pointer to the output sequence of unique keys (one key per run) |
[out] | d_counts_out | Pointer to the output sequence of run-lengths (one count per run) |
[out] | d_num_runs_out | Pointer to total number of runs |
[in] | num_items | Total number of associated key+value pairs (i.e., the length of d_in_keys and d_in_values ) |
[in] | stream | [optional] CUDA stream to launch kernels within. Default is stream0. |
[in] | debug_synchronous | [optional] Whether or not to synchronize the stream after every kernel launch to check for errors. May cause significant slowdown. Default is false . |
template<typename InputIteratorT , typename OffsetsOutputIteratorT , typename LengthsOutputIteratorT , typename NumRunsOutputIteratorT >
CUB_RUNTIME_FUNCTION static __forceinline__ cudaError_t cub::DeviceRunLengthEncode::NonTrivialRuns |
( |
void * |
d_temp_storage, |
|
|
size_t & |
temp_storage_bytes, |
|
|
InputIteratorT |
d_in, |
|
|
OffsetsOutputIteratorT |
d_offsets_out, |
|
|
LengthsOutputIteratorT |
d_lengths_out, |
|
|
NumRunsOutputIteratorT |
d_num_runs_out, |
|
|
int |
num_items, |
|
|
cudaStream_t |
stream = 0 , |
|
|
bool |
debug_synchronous = false |
|
) |
| |
|
inlinestatic |
Enumerates the starting offsets and lengths of all non-trivial runs (of length > 1) of same-valued keys in the sequence d_in
.
- For the ith non-trivial run, the run's starting offset and its length are written to
d_offsets_out[i]
and d_lengths_out[i]
, respectively.
- The total number of runs encountered is written to
d_num_runs_out
.
- The
==
equality operator is used to determine whether values are equivalent
- When
d_temp_storage
is NULL
, no work is done and the required allocation size is returned in temp_storage_bytes
.
- Performance
- Snippet
- The code snippet below illustrates the identification of non-trivial runs within a sequence of
int
values.
int num_items;
int *d_in;
int *d_offsets_out;
int *d_lengths_out;
int *d_num_runs_out;
...
void *d_temp_storage = NULL;
size_t temp_storage_bytes = 0;
cudaMalloc(&d_temp_storage, temp_storage_bytes);
- Template Parameters
-
InputIteratorT | [inferred] Random-access input iterator type for reading input items (may be a simple pointer type) |
OffsetsOutputIteratorT | [inferred] Random-access output iterator type for writing run-offset values (may be a simple pointer type) |
LengthsOutputIteratorT | [inferred] Random-access output iterator type for writing run-length values (may be a simple pointer type) |
NumRunsOutputIteratorT | [inferred] Output iterator type for recording the number of runs encountered (may be a simple pointer type) |
- Parameters
-
[in] | d_temp_storage | Device-accessible allocation of temporary storage. When NULL, the required allocation size is written to temp_storage_bytes and no work is done. |
[in,out] | temp_storage_bytes | Reference to size in bytes of d_temp_storage allocation |
[in] | d_in | Pointer to input sequence of data items |
[out] | d_offsets_out | Pointer to output sequence of run-offsets (one offset per non-trivial run) |
[out] | d_lengths_out | Pointer to output sequence of run-lengths (one count per non-trivial run) |
[out] | d_num_runs_out | Pointer to total number of runs (i.e., length of d_offsets_out ) |
[in] | num_items | Total number of associated key+value pairs (i.e., the length of d_in_keys and d_in_values ) |
[in] | stream | [optional] CUDA stream to launch kernels within. Default is stream0. |
[in] | debug_synchronous | [optional] Whether or not to synchronize the stream after every kernel launch to check for errors. May cause significant slowdown. Default is false . |
The documentation for this struct was generated from the following file: