Part 2: CUDA Implementation

Inline vs Lookaside Acceleration — Fig. 22 Inline and lookaside in the gNB DU processing pipeline. The figure is from [Kundu2023B]. Note that the Jetson AGX Orin platform uses a unified memory architecture, which can be seen as a hybrid of the inline and lookaside memory management.

The second part of this tutorial focuses on the implementation of the LDPC decoder using CUDA and explains the common pitfalls when offloading compute-intensive functions to the GPU. Further, we show how GPU acceleration offloading of the LDPC decoding can be integrated into the OAI stack.

Function offloading to dedicated accelerators requires careful consideration of the data flow with respect to the underlying memory architecture. This is especially critical for wireless communications applications, where strict latency requirements in the receiver processing pipeline necessitate efficient memory management when utilizing dedicated signal processing accelerators.

As shown in Fig. 22, one needs to distinguish between inline and lookaside acceleration. While lookaside acceleration moves data between the CPU and the hardware accelerator, inline acceleration avoids such data movement by applying the entire processing pipeline on the hardware accelerator.

Strictly speaking, the Jetson platform still moves data between the CPU and GPU. However, the overhead is significantly reduced compared to traditional split-memory architectures as the Jetson AGX Orin platform shares the same physical memory between CPU and GPU. This has implications for the caching behavior and requires a careful implementation of the CUDA kernels to avoid performance degradation. Nevertheless, we will consider it as inline acceleration for the following discussion.

For further details on CUDA, we refer to the NVIDIA CUDA Programming Guide [CUDA2024] and the CUDA for Tegra Memory Model [Tegra2024].

Overview

The CUDA implementation can be found in tutorials/ldpc_cuda/runtime/ldpc_decoder.cu. The core decoding algorithm is implemented in the update_cn_kernel(.) and update_vn_kernel(.) functions. Both kernels are iteratively called and perform the check node (CN) and variable node (VN) updates, respectively. The decoder stops when the maximum number of iterations is reached. An additional early stopping condition could also be applied to reduce the average number of iterations.

The pack_bits_kernel(.) kernel maps die soft-values to hard-decided bits and packs them into a more compact byte-representation which is required for the OAI processing pipeline.

CUDA Integration in OAI

Running the code from the tutorial requires integration in the OAI stack. After patching, the Sionna Research Kit and the related Dockerfiles are already configured for CUDA support. Thus, before building the Docker images, you need to ensure that the OAI stack is patched via

# Patch the OAI stack
./scripts/patch_oai-tutorials.sh

If you followed the Quickstart tutorial, this automatically patches the OAI stack and adds the CUDA flag to the build pipeline.

After patching, the CUDA implementation is located in

openairinterface5g/tutorials/ldpc_cuda/runtime/ldpc_decoder.cu

You can now modified the ldpc_decoder.cu file and implement your own decoder variants.

After modifying the ldpc_decoder.cu file, you now need to rebuild the docker images via

./scripts/build-oai-images.sh

Running the Decoder

The LDPC decoder is implemented as shared library that can be loaded using the OAI shared library loader. Thus, the CUDA-based decoder can be used as a drop-in replacement for the existing decoder implementations. It can be loaded when running the gNB via the following GNB_EXTRA_OPTIONS in the .env file of the config folder.

GNB_EXTRA_OPTIONS=--loader.ldpc.shlibversion _cuda --thread-pool 5,6,7,8,9,10,11

We strongly recommend to additionally assign dedicated CPU cores to PHY-layer processing via the thread-pool option. This assigns the cores 5-11 to the PHY layer thread pool. Note that the lower cpu cores are assigned to handle the USRP related tasks such as time synchronization.

You can now start the gNB with the CUDA-based decoder by running

scripts/start_system.sh b200_arm64 # replace b200_arm64 with your config folder

The GPU load can be monitored via

jtop

Congratulations! You have now successfully accelerated the LDPC decoder using CUDA.

Implementation Aspects

The following sections focus on various technical aspects of the CUDA implementation and the performance implications of different memory transfer patterns.

For debugging and profiling, please refer to the tutorial on Debugging & Troubleshooting.

Memory Management

In order to offload compute-intensive processing to the GPU, data needs to be shared between the host (CPU) and the device (GPU). We can leverage the shared system memory architecture of the Jetson platform to avoid the bottleneck of costly memory transfers on traditional split-memory platforms.

In fact, we can avoid the overhead of any complex API calls and memory transition operations by allocating page-locked memory on the host using cudaHostAlloc. To make input LLRs visible to the GPU, we then use a simple memcpy() operation to copy inputs from the 5G stack over into such a page-locked buffer, where unified memory addressing allows direct reads and writes both on the host and the device. For output bits, we first synchronize the parallel CUDA command stream and then simply use memcpy() to copy packed bits from the shared page-locked buffer into the 5G stack output buffer.

Note that this simplicity is achieved via some of the implied cache semantics [Tegra2024]: Host memory caches are active to allow fast reads and writes on the host side, while device memory caching is disabled on the page-locked buffers to directly write to or read from the shared system memory. For fast reads and writes in device-side compute kernels, we ensure maximally coalesced read-once and write-once memory access patterns (addressing consecutive memory in consecutive threads).

Traditional memory transfer patterns designed for split-memory architectures can also be used on the Jetson platform, but can lead to higher overheads depending on the exact architecture and generation. Explicit memory transfer calls such as cudaMemcpyAsync(inputs..., cudaMemcpyHostToDevice, stream) and cudaMemcpyAsync(outputs..., cudaMemcpyDeviceToHost, stream) incur additional API overheads and may depend on availability of additional copy engines; explicit transitioning of managed pageable memory allocated by cudaMallocManaged() by cudaStreamAttachMemAsync() may also require excessive API overhead for small buffer sizes. Therefore, we instead use the patterns described above to optimize for shared system memory and low latency, particularly since input and output buffers are typically small in the real-time 5G stack.

For comparison, we show both variants side-by-side in the following code, where the latency-optimized code path is the one with USE_UNIFIED_MEMORY defined. We copy the input LLRs from the host to the device memory in the ldpc_decoder.cu file:

    // copy input data to device-visible memory
#ifdef USE_UNIFIED_MEMORY
    memcpy(const_cast<int8_t*>(mapped_llr_in), llr_in, num_llrs * sizeof(*llr_in));
#else
    cudaCheck(cudaMemcpyAsync(const_cast<int8_t*>(mapped_llr_in), llr_in, num_llrs * sizeof(*llr_in), cudaMemcpyHostToDevice, stream));
#endif

After decoding, we make the output bits available via a host-side copy in parallel to an asynchronous syndrome check on the device:

    // pack LDPC output bits on the device
    int pack_thread_blocks = (block_length + 127) / 128;
    pack_decoded_bit_kernel<<<pack_thread_blocks, 128, 0, context.stream>>>(context.dev_llr_out, context.dev_bits_out, block_length);

   // allow CPU access of output bits while computing syndrome
#ifdef USE_UNIFIED_MEMORY
    cudaStreamSynchronize(context.stream);

    // ... schedule syndrome or CRC check ...

    // while syndrome computations are running on the device, copy output bits to 5G stack output buffer
    memcpy(p_out, context.dev_bits_out, memorySize_bits_out);
#else
    cudaCheck(cudaMemcpyAsync(p_out, context.dev_bits_out, memorySize_bits_out, cudaMemcpyDeviceToHost, context.stream));

    // ... schedule syndrome or CRC check ...

    // allow CPU access of output bits and syndrome
    cudaStreamSynchronize(context.stream);
#endif

Kernel Optimization

The core idea of CUDA programming is to parallelize the execution of the kernel on the GPU. The kernel is executed by a grid of blocks, each containing a number of independent threads. This level of parallelism allows for significant speedups compared to the serial execution on the CPU.

For a detailed introduction to CUDA programming, we refer to the CUDA Programming Guide. Conceptually, a CUDA kernel is defined as follows

// Kernel definition: __global__ indicates that the function is a CUDA kernel
__global__ void my_kernel(float *data, int N) {
   // Each thread processes one element of the array

   // We calculate the global thread index from the block and thread indices
   // idx is unique for each thread
   int idx = threadIdx.x + blockIdx.x * blockDim.x;
   if (idx < N) {
      // process the idx-th element of the array
      data[idx] = ...;
   }
}

This kernel can now be launched by specifying its grid and block dimensions via

// Launch the kernel
// <<<1, 32>>> specifies the grid and block sizes
my_kernel<<<1, 32>>>(d_data, N);

For the case of LDPC decoding, the decoder can be parallelized over the number of variable (VN) and check node (CN) updates, respectively. An overview of the CUDA implementation is given in Fig. 23. The VN update kernel is given as

static __global__ void update_vn_kernel(llr_msg_t const* llr_msg, int8_t const* llr_ch, llr_accumulator_t* llr_total,
                                        uint32_t Z, uint32_t const* bg_vn, uint32_t const* bg_vn_degree, uint32_t num_cols, uint32_t num_rows) {
    uint32_t tid = blockIdx.x * blockDim.x + threadIdx.x;

    uint32_t i = tid % Z; // for i in range(Z)
    uint32_t idx_col = tid / Z; // for idx_col in range(num_cols)
    if (idx_col >= num_cols) return;

    uint32_t vn_degree = bg_vn_degree[idx_col];

    // list of tuples (idx_row = index_cn, s)
    // idx_col = idx_vn and msg_offset omitted,
    // msg spread out to idx_cn + idx_vn * num_cn
    uint32_t const* variable_nodes = &bg_vn[idx_col]; // len(vn) = vn_degree

    int msg_sum = 0;
    // accumulate all incoming LLRs
    for (uint32_t j = 0; j < vn_degree; ++j) {
        uint32_t vn = variable_nodes[j * num_cols];

        // see packing layout above
        uint32_t idx_row = vn & 0xffffu; // note: little endian
        uint32_t s = vn >> 16;           // ...
        uint32_t msg_offset = idx_row + idx_col * num_rows;

        // index of the msg in the LLR array
        // it is the idx_col-th variable node, and the j-th message from the idx_row-th check node
        uint32_t msg_idx = msg_offset * Z + (i-s+(Z<<8))%Z;

        // accumulate all incoming LLRs
        msg_sum += llr_msg[msg_idx];
    }

    // add the channel LLRs
    msg_sum += llr_ch[idx_col*Z + i];

    msg_sum = min(max(msg_sum, -MAX_LLR_ACCUMULATOR_VALUE), MAX_LLR_ACCUMULATOR_VALUE);

    llr_total[idx_col*Z + i] = llr_accumulator_t(msg_sum);
}

As the OAI processing pipeline uses multiple threads, we need to ensure proper multi-threading support in the CUDA kernel. This is done via a thread specific context that is passed to the kernel. This ensures that each thread operates on its own CUDA stream and, thus, can be executed in parallel without interference.

struct ThreadContext {
    cudaStream_t stream = 0;

    // Device memory declarations - use raw pointers instead of device symbols
    int8_t* llr_in_buffer = nullptr;
    uint8_t* llr_bits_out_buffer = nullptr;
    uint8_t* syndrome_buffer = nullptr;
#ifndef USE_UNIFIED_MEMORY
    uint8_t host_syndrome_buffer = nullptr;
#endif
    llr_msg_t* llr_msg_buffer = nullptr;
    llr_accumulator_t* llr_total_buffer = nullptr;

    // list of thread contexts for shutdown
    ThreadContext* next_initialized_context = nullptr;
};
static __thread ThreadContext thread_context = { };

Further, the decoder uses clipping values for the extrinsic messages and the VN accumulator.

    // add the channel LLRs
    msg_sum += llr_ch[idx_col*Z + i];

    msg_sum = min(max(msg_sum, -MAX_LLR_ACCUMULATOR_VALUE), MAX_LLR_ACCUMULATOR_VALUE);

    llr_total[idx_col*Z + i] = llr_accumulator_t(msg_sum);

    // clip msg magnitudes to MAX_LLR_VALUE
    min_1 = min(max(min_1, -MAX_LLR_MSG_VALUE), MAX_LLR_MSG_VALUE);
    min_2 = min(max(min_2, -MAX_LLR_MSG_VALUE), MAX_LLR_MSG_VALUE);

Following the same principles, the CN update kernel is given as

static __global__ void update_cn_kernel(llr_accumulator_t const* llr_total, llr_msg_t* llr_msg,
                                        uint32_t Z, uint32_t const* bg_cn, uint32_t const* bg_cn_degree, uint32_t num_rows,
                                        bool first_iter) {
    uint32_t tid = blockIdx.x * blockDim.x + threadIdx.x;

    uint32_t i = tid % Z; // for i in range(Z)
    uint32_t idx_row = tid / Z; // for idx_row in range(num_rows)
    if (idx_row >= num_rows) return;

    uint32_t cn_degree = bg_cn_degree[idx_row];

    // list of tuples (idx_col = idx_vn, s),
    // idx_row = idx_cn and msg_offset omitted,
    // msg spread out to idx_cn + idx_vn * num_cn
    uint32_t const* check_nodes = &bg_cn[idx_row]; // len(cn) = cn_degree

    // search the "extrinsic" min of all incoming LLRs
    // this means we need to find the min and the second min of all incoming LLRs
    int min_1 = INT_MAX;
    int min_2 = INT_MAX;
    int idx_min = -1;
    int node_sign = 1;
    uint32_t msg_signs = 0; // bitset, 0 == positive; max degree is 19

    for (uint32_t ii = 0; ii < cn_degree; ++ii) {
        uint32_t cn = check_nodes[ii * num_rows];

        // see packing layout above
        uint32_t idx_col = cn & 0xffffu; // note: little endian
        uint32_t s = cn >> 16;           // ...
        uint32_t msg_offset = idx_row + idx_col * num_rows;

        uint32_t msg_idx = msg_offset * Z + i;

        // total VN message
        int t = llr_total[idx_col*Z + (i+s)%Z];

        // make extrinsic by subtracting the previous msg
        if (!first_iter)
            t -= llr_msg[msg_idx];

        // store sign for 2nd recursion
        // note: could be also used for syndrome-based check or early termination
        int sign = (t >= 0 ? 1 : -1);
        node_sign *= sign;
        msg_signs |= (t < 0) << ii; // for later sign calculation

        // find min and second min
        int t_abs = abs(t);
        if (t_abs < min_1) {
            min_2 = min_1;
            min_1 = t_abs;
            idx_min = msg_idx;
        } else if (t_abs < min_2)
            min_2 = t_abs;
    }

    // START marker-cnp-damping
    // apply damping factor
    min_1 = APPLY_DAMPING_INT(min_1); // min_1 * DAMPING_FACTOR, e.g. *3/4
    min_2 = APPLY_DAMPING_INT(min_2); // min_2 * DAMPING_FACTOR, e.g. *3/4
    // END marker-cnp-damping

    // clip msg magnitudes to MAX_LLR_VALUE
    min_1 = min(max(min_1, -MAX_LLR_MSG_VALUE), MAX_LLR_MSG_VALUE);
    min_2 = min(max(min_2, -MAX_LLR_MSG_VALUE), MAX_LLR_MSG_VALUE);
    // END marker-vnp-clipping

    // apply min and second min to the outgoing LLR
    for (uint32_t ii = 0; ii < cn_degree; ++ii) {
         uint32_t cn = check_nodes[ii * num_rows];

        // see packing layout above
        uint32_t idx_col = cn & 0xffffu; // note: little endian
        uint32_t msg_offset = idx_row + idx_col * num_rows;

        uint32_t msg_idx = msg_offset * Z + i;
        int min_val;
        if (msg_idx == idx_min)
            min_val = min_2;
        else
            min_val = min_1;

        int msg_sign = (msg_signs >> ii) & 0x1 ? -1 : 1;

        // and update outgoing msg including sign
        llr_msg[msg_idx] = llr_msg_t(min_val * node_sign * msg_sign);
    }
}

Note that the choice of the clipping values is critical for the performance of the decoder, in particular as the decoder uses mostly signed char variables. This keeps the memory footprint low and can be configured via the following macros

static const int MAX_LLR_ACCUMULATOR_VALUE = 127;
typedef int8_t llr_accumulator_t;
static const int MAX_LLR_MSG_VALUE = 127;
typedef int8_t llr_msg_t;

For a given lifting factor Z and a number of rows num_rows and columns num_cols given from the BG selection, the following grid and block dimensions are used

dim3 threads(256);
// check node update
dim3 blocks_cn(blocks_for(bg.num_rows * Z, threads.x));
// VN update
dim3 blocks_vn(blocks_for(bg.num_cols * Z, threads.x));

inline __host__ __device__ uint32_t blocks_for(uint32_t elements, int block_size) {
    return int( uint32_t(elements + (block_size-1)) / uint32_t(block_size) );
}

Note that the grid and block dimensions are architecture dependent and should be tuned for the specific GPU.

After decoding, the output is hard-decided and packed into a byte-array.

static __global__ void pack_bits_kernel(llr_accumulator_t const* llr_total, uint8_t* bits, uint32_t block_length) {
    uint32_t tid = blockIdx.x * blockDim.x + threadIdx.x;

    uint32_t coop_byte = 0;
    // 1 bit per thread
    if (tid < block_length)
        coop_byte = (llr_total[tid] < 0) << (7 - (threadIdx.x & 7)); // note: highest to lowest bit

    // use fast lane shuffles to assemble one byte per group of 8 adjacent threads
    coop_byte += __shfl_xor_sync(0xffffffff, coop_byte, 1); // xxyyzzww
    coop_byte += __shfl_xor_sync(0xffffffff, coop_byte, 2); // xxxxyyyy
    coop_byte += __shfl_xor_sync(0xffffffff, coop_byte, 4); // xxxxxxxx

    // share bytes across thread group to allow one coalesced write by first N threads
    __shared__ uint32_t bit_block_shared[PACK_BITS_KERNEL_THREADS / 8];
    if ((threadIdx.x & 0x7) == 0)
        bit_block_shared[threadIdx.x / 8] = coop_byte;

    __syncthreads();

    // the first (PACK_BITS_KERNEL_THREADS / 8) threads pack 8 bits each
    if (threadIdx.x < PACK_BITS_KERNEL_THREADS / 8 && blockIdx.x * PACK_BITS_KERNEL_THREADS + threadIdx.x * 8 < block_length) {
        bits[blockIdx.x * PACK_BITS_KERNEL_THREADS / 8 + threadIdx.x] = bit_block_shared[threadIdx.x];
    }
}

Note

The decoder returns the number of iterations and declares a decoding failure by returning max_num_iter +1 if the CRC or the syndrome check fails.

Unittest

We have implemented unittests using pytest to verify the correctness of the CUDA LDPC decoder implementation. The tests can be found in tutorials/ldpc_cuda/test/ and validate the decoder against Sionna’s reference implementation. The tests use nanobind to interface between Python and the CUDA code, allowing us to test the decoder independently from the full 5G stack. For details on the Python-CUDA binding, please refer to the nanobind documentation.

A script for building and running the tests is provided in tutorials/ldpc_cuda/tests/build_and_run.sh

cd tutorials/ldpc_cuda/tests
./build_and_run.sh

Note that the test requires the sionna and pytest packages to be installed.

Outlook - Weighted Belief Propagation

An extension of classical belief propagation is the weighted belief propagation [Nachmani2016] using gradient descent to optimize trainable parameters during message passing. An example implementation can be found in the Sionna tutorial on Weighted Belief Propagation.

When looking at the current implementation of the CN update, we can already see a damping factor of 3/4 applied to each outgoing CN message [Pretti2005].

1#define APPLY_DAMPING_INT(x) (x*3/4)

    // apply damping factor
    min_1 = APPLY_DAMPING_INT(min_1); // min_1 * DAMPING_FACTOR, e.g. *3/4
    min_2 = APPLY_DAMPING_INT(min_2); // min_2 * DAMPING_FACTOR, e.g. *3/4

One could now implement a weighted CN update by simply replacing the damping factor with a trainable parameter from the weighted BP tutorial.

We hope you enjoyed this tutorial. Following the principles outlined in this tutorial, you are now well equipped to accelerate other compute intensive parts of the 5G stack as well.