CUB
|
The BlockRunLengthDecode class supports decoding a run-length encoded array of items. That is, given the two arrays run_value[N] and run_lengths[N], run_value[i] is repeated run_lengths[i] many times in the output array. Due to the nature of the run-length decoding algorithm ("decompression"), the output size of the run-length decoded array is runtime-dependent and potentially without any upper bound. To address this, BlockRunLengthDecode allows retrieving a "window" from the run-length decoded array. The window's offset can be specified and BLOCK_THREADS * DECODED_ITEMS_PER_THREAD (i.e., referred to as window_size) decoded items from the specified window will be returned.
run_values
across the block of threads is { [0, 1], [2, 3], [4, 5], [6, 7], ..., [254, 255] }
and run_lengths
is { [1, 2], [3, 4], [5, 1], [2, 3], ..., [5, 1] }
. The corresponding output decoded_items
in those threads will be { [0, 1, 1, 2], [2, 2, 3, 3], [3, 3, 4, 4], [4, 4, 4, 5], ..., [169, 169, 170, 171] }
and relative_offsets
will be { [0, 0, 1, 0], [1, 2, 0, 1], [2, 3, 0, 1], [2, 3, 4, 0], ..., [3, 4, 0, 0] }
during the first iteration of the while loop.ItemT | The data type of the items being run-length decoded |
BLOCK_DIM_X | The thread block length in threads along the X dimension |
RUNS_PER_THREAD | The number of consecutive runs that each thread contributes |
DECODED_ITEMS_PER_THREAD | The maximum number of decoded items that each thread holds |
DecodedOffsetT | Type used to index into the block's decoded items (large enough to hold the sum over all the runs' lengths) |
BLOCK_DIM_Y | The thread block length in threads along the Y dimension |
BLOCK_DIM_Z | The thread block length in threads along the Z dimension |
Classes | |
struct | TempStorage |
Public Methods | |
template<typename RunLengthT , typename TotalDecodedSizeT > | |
__device__ __forceinline__ | BlockRunLengthDecode (TempStorage &temp_storage, ItemT(&run_values)[RUNS_PER_THREAD], RunLengthT(&run_lengths)[RUNS_PER_THREAD], TotalDecodedSizeT &total_decoded_size) |
Constructor specialised for user-provided temporary storage, initializing using the runs' lengths. The algorithm's temporary storage may not be repurposed between the constructor call and subsequent RunLengthDecode calls. More... | |
template<typename UserRunOffsetT > | |
__device__ __forceinline__ | BlockRunLengthDecode (TempStorage &temp_storage, ItemT(&run_values)[RUNS_PER_THREAD], UserRunOffsetT(&run_offsets)[RUNS_PER_THREAD]) |
Constructor specialised for user-provided temporary storage, initializing using the runs' offsets. The algorithm's temporary storage may not be repurposed between the constructor call and subsequent RunLengthDecode calls. More... | |
template<typename RunLengthT , typename TotalDecodedSizeT > | |
__device__ __forceinline__ | BlockRunLengthDecode (ItemT(&run_values)[RUNS_PER_THREAD], RunLengthT(&run_lengths)[RUNS_PER_THREAD], TotalDecodedSizeT &total_decoded_size) |
Constructor specialised for static temporary storage, initializing using the runs' lengths. More... | |
template<typename UserRunOffsetT > | |
__device__ __forceinline__ | BlockRunLengthDecode (ItemT(&run_values)[RUNS_PER_THREAD], UserRunOffsetT(&run_offsets)[RUNS_PER_THREAD]) |
Constructor specialised for static temporary storage, initializing using the runs' offsets. More... | |
template<typename RelativeOffsetT > | |
__device__ __forceinline__ void | RunLengthDecode (ItemT(&decoded_items)[DECODED_ITEMS_PER_THREAD], RelativeOffsetT(&item_offsets)[DECODED_ITEMS_PER_THREAD], DecodedOffsetT from_decoded_offset=0) |
Run-length decodes the runs previously passed via a call to Init(...) and returns the run-length decoded items in a blocked arrangement to decoded_items . If the number of run-length decoded items exceeds the run-length decode buffer (i.e., DECODED_ITEMS_PER_THREAD * BLOCK_THREADS), only the items that fit within the buffer are returned. Subsequent calls to RunLengthDecode adjusting from_decoded_offset can be used to retrieve the remaining run-length decoded items. Calling __syncthreads() between any two calls to RunLengthDecode is not required. item_offsets can be used to retrieve each run-length decoded item's relative index within its run. E.g., the run-length encoded array of 3, 1, 4 with the respective run lengths of 2, 1, 3 would yield the run-length decoded array of 3, 3, 1, 4, 4, 4 with the relative offsets of 0, 1, 0, 0, 1, 2 . A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage ) is to be reused or repurposed. More... | |
__device__ __forceinline__ void | RunLengthDecode (ItemT(&decoded_items)[DECODED_ITEMS_PER_THREAD], DecodedOffsetT from_decoded_offset=0) |
Run-length decodes the runs previously passed via a call to Init(...) and returns the run-length decoded items in a blocked arrangement to decoded_items . If the number of run-length decoded items exceeds the run-length decode buffer (i.e., DECODED_ITEMS_PER_THREAD * BLOCK_THREADS), only the items that fit within the buffer are returned. Subsequent calls to RunLengthDecode adjusting from_decoded_offset can be used to retrieve the remaining run-length decoded items. Calling __syncthreads() between any two calls to RunLengthDecode is not required. More... | |
|
inline |
Constructor specialised for user-provided temporary storage, initializing using the runs' lengths. The algorithm's temporary storage may not be repurposed between the constructor call and subsequent RunLengthDecode calls.
|
inline |
Constructor specialised for user-provided temporary storage, initializing using the runs' offsets. The algorithm's temporary storage may not be repurposed between the constructor call and subsequent RunLengthDecode calls.
|
inline |
Constructor specialised for static temporary storage, initializing using the runs' lengths.
|
inline |
Constructor specialised for static temporary storage, initializing using the runs' offsets.
|
inline |
Run-length decodes the runs previously passed via a call to Init(...) and returns the run-length decoded items in a blocked arrangement to decoded_items
. If the number of run-length decoded items exceeds the run-length decode buffer (i.e., DECODED_ITEMS_PER_THREAD * BLOCK_THREADS), only the items that fit within the buffer are returned. Subsequent calls to RunLengthDecode adjusting from_decoded_offset
can be used to retrieve the remaining run-length decoded items. Calling __syncthreads() between any two calls to RunLengthDecode is not required. item_offsets
can be used to retrieve each run-length decoded item's relative index within its run. E.g., the run-length encoded array of 3, 1, 4
with the respective run lengths of 2, 1, 3
would yield the run-length decoded array of 3, 3, 1, 4, 4, 4
with the relative offsets of 0, 1, 0, 0, 1, 2
. A subsequent __syncthreads()
threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage
) is to be reused or repurposed.
[out] | decoded_items | The run-length decoded items to be returned in a blocked arrangement |
[out] | item_offsets | The run-length decoded items' relative offset within the run they belong to |
[in] | from_decoded_offset | If invoked with from_decoded_offset that is larger than total_decoded_size results in undefined behavior. |
|
inline |
Run-length decodes the runs previously passed via a call to Init(...) and returns the run-length decoded items in a blocked arrangement to decoded_items
. If the number of run-length decoded items exceeds the run-length decode buffer (i.e., DECODED_ITEMS_PER_THREAD * BLOCK_THREADS), only the items that fit within the buffer are returned. Subsequent calls to RunLengthDecode adjusting from_decoded_offset
can be used to retrieve the remaining run-length decoded items. Calling __syncthreads() between any two calls to RunLengthDecode is not required.
[out] | decoded_items | The run-length decoded items to be returned in a blocked arrangement |
[in] | from_decoded_offset | If invoked with from_decoded_offset that is larger than total_decoded_size results in undefined behavior. |