Part 2: CUDA Implementation

Fig. 22 Inline and lookaside in the gNB DU processing pipeline. The figure is from [Kundu2023B]. Note that the Jetson AGX Orin platform uses a unified memory architecture, which can be seen as a hybrid of the inline and lookaside memory management.
The second part of this tutorial focuses on the implementation of the LDPC decoder using CUDA and explains the common pitfalls when offloading compute-intensive functions to the GPU. Further, we show how GPU acceleration offloading of the LDPC decoding can be integrated into the OAI stack.
Function offloading to dedicated accelerators requires careful consideration of the data flow with respect to the underlying memory architecture. This is especially critical for wireless communications applications, where strict latency requirements in the receiver processing pipeline necessitate efficient memory management when utilizing dedicated signal processing accelerators.
As shown in Fig. 22, one needs to distinguish between inline and lookaside acceleration. While lookaside acceleration moves data between the CPU and the hardware accelerator, inline acceleration avoids such data movement by applying the entire processing pipeline on the hardware accelerator.
Strictly speaking, the Jetson platform still moves data between the CPU and GPU. However, the overhead is significantly reduced compared to traditional split-memory architectures as the Jetson AGX Orin platform shares the same physical memory between CPU and GPU. This has implications for the caching behavior and requires a careful implementation of the CUDA kernels to avoid performance degradation. Nevertheless, we will consider it as inline acceleration for the following discussion.
For further details on CUDA, we refer to the NVIDIA CUDA Programming Guide [CUDA2024] and the CUDA for Tegra Memory Model [Tegra2024].
Overview

Fig. 23 Overview of CUDA implementation of the LDPC BP decoding algorithm.
The CUDA implementation can be found in tutorials/ldpc_cuda/runtime/ldpc_decoder.cu. The core decoding algorithm is implemented in the update_cn_kernel(.) and update_vn_kernel(.) functions. Both kernels are iteratively called and perform the check node (CN) and variable node (VN) updates, respectively. The decoder stops when the maximum number of iterations is reached. An additional early stopping condition could also be applied to reduce the average number of iterations.
The pack_bits_kernel(.) kernel maps die soft-values to hard-decided bits and packs them into a more compact byte-representation which is required for the OAI processing pipeline.
CUDA Integration in OAI
Running the code from the tutorial requires integration in the OAI stack. After patching, the Sionna Research Kit and the related Dockerfiles are already configured for CUDA support. Thus, before building the Docker images, you need to ensure that the OAI stack is patched via
# Patch the OAI stack
./scripts/patch_oai-tutorials.sh
If you followed the Quickstart tutorial, this automatically patches the OAI stack and adds the CUDA flag to the build pipeline.
After patching, the CUDA implementation is located in
openairinterface5g/tutorials/ldpc_cuda/runtime/ldpc_decoder.cu
You can now modified the ldpc_decoder.cu file and implement your own decoder variants.
After modifying the ldpc_decoder.cu file, you now need to rebuild the docker images via
./scripts/build-oai-images.sh
Running the Decoder
The LDPC decoder is implemented as shared library that can be loaded using the OAI shared library loader. Thus, the CUDA-based decoder can be used as a drop-in replacement for the existing decoder implementations. It can be loaded when running the gNB via the following GNB_EXTRA_OPTIONS
in the .env
file of the config folder.
GNB_EXTRA_OPTIONS=--loader.ldpc.shlibversion _cuda --thread-pool 5,6,7,8,9,10,11
We strongly recommend to additionally assign dedicated CPU cores to PHY-layer processing via the thread-pool option. This assigns the cores 5-11 to the PHY layer thread pool. Note that the lower cpu cores are assigned to handle the USRP related tasks such as time synchronization.
You can now start the gNB with the CUDA-based decoder by running
scripts/start_system.sh b200_arm64 # replace b200_arm64 with your config folder
The GPU load can be monitored via
jtop
Congratulations! You have now successfully accelerated the LDPC decoder using CUDA.
Implementation Aspects
The following sections focus on various technical aspects of the CUDA implementation and the performance implications of different memory transfer patterns.
For debugging and profiling, please refer to the tutorial on Debugging & Troubleshooting.
Memory Management
In order to offload compute-intensive processing to the GPU, data needs to be shared between the host (CPU) and the device (GPU). We can leverage the shared system memory architecture of the Jetson platform to avoid the bottleneck of costly memory transfers on traditional split-memory platforms.
In fact, we can avoid the overhead of any complex API calls and memory transition operations by allocating page-locked memory on the host using cudaHostAlloc
. To make input LLRs visible to the GPU, we then use a simple memcpy()
operation to copy inputs from the 5G stack over into such a page-locked buffer, where unified memory addressing allows direct reads and writes both on the host and the device. For output bits, we first synchronize the parallel CUDA command stream and then simply use memcpy()
to copy packed bits from the shared page-locked buffer into the 5G stack output buffer.
Note that this simplicity is achieved via some of the implied cache semantics [Tegra2024]: Host memory caches are active to allow fast reads and writes on the host side, while device memory caching is disabled on the page-locked buffers to directly write to or read from the shared system memory. For fast reads and writes in device-side compute kernels, we ensure maximally coalesced read-once and write-once memory access patterns (addressing consecutive memory in consecutive threads).
Traditional memory transfer patterns designed for split-memory architectures can also be used on the Jetson platform, but can lead to higher overheads depending on the exact architecture and generation. Explicit memory transfer calls such as cudaMemcpyAsync(inputs..., cudaMemcpyHostToDevice, stream)
and cudaMemcpyAsync(outputs..., cudaMemcpyDeviceToHost, stream)
incur additional API overheads and may depend on availability of additional copy engines; explicit transitioning of managed pageable memory allocated by cudaMallocManaged()
by cudaStreamAttachMemAsync()
may also require excessive API overhead for small buffer sizes. Therefore, we instead use the patterns described above to optimize for shared system memory and low latency, particularly since input and output buffers are typically small in the real-time 5G stack.
For comparison, we show both variants side-by-side in the following code, where the latency-optimized code path is the one with USE_UNIFIED_MEMORY
defined.
We copy the input LLRs from the host to the device memory in the ldpc_decoder.cu file:
1 // copy input data to device-visible memory
2#ifdef USE_UNIFIED_MEMORY
3 memcpy(const_cast<int8_t*>(mapped_llr_in), llr_in, num_llrs * sizeof(*llr_in));
4#else
5 cudaCheck(cudaMemcpyAsync(const_cast<int8_t*>(mapped_llr_in), llr_in, num_llrs * sizeof(*llr_in), cudaMemcpyHostToDevice, stream));
6#endif
After decoding, we make the output bits available via a host-side copy in parallel to an asynchronous syndrome check on the device:
1 // pack LDPC output bits on the device
2 int pack_thread_blocks = (block_length + 127) / 128;
3 pack_decoded_bit_kernel<<<pack_thread_blocks, 128, 0, context.stream>>>(context.dev_llr_out, context.dev_bits_out, block_length);
4
5 // allow CPU access of output bits while computing syndrome
6#ifdef USE_UNIFIED_MEMORY
7 cudaStreamSynchronize(context.stream);
8
9 // ... schedule syndrome or CRC check ...
10
11 // while syndrome computations are running on the device, copy output bits to 5G stack output buffer
12 memcpy(p_out, context.dev_bits_out, memorySize_bits_out);
13#else
14 cudaCheck(cudaMemcpyAsync(p_out, context.dev_bits_out, memorySize_bits_out, cudaMemcpyDeviceToHost, context.stream));
15
16 // ... schedule syndrome or CRC check ...
17
18 // allow CPU access of output bits and syndrome
19 cudaStreamSynchronize(context.stream);
20#endif
Kernel Optimization
The core idea of CUDA programming is to parallelize the execution of the kernel on the GPU. The kernel is executed by a grid of blocks, each containing a number of independent threads. This level of parallelism allows for significant speedups compared to the serial execution on the CPU.
For a detailed introduction to CUDA programming, we refer to the CUDA Programming Guide. Conceptually, a CUDA kernel is defined as follows
// Kernel definition: __global__ indicates that the function is a CUDA kernel
__global__ void my_kernel(float *data, int N) {
// Each thread processes one element of the array
// We calculate the global thread index from the block and thread indices
// idx is unique for each thread
int idx = threadIdx.x + blockIdx.x * blockDim.x;
if (idx < N) {
// process the idx-th element of the array
data[idx] = ...;
}
}
This kernel can now be launched by specifying its grid and block dimensions via
// Launch the kernel
// <<<1, 32>>> specifies the grid and block sizes
my_kernel<<<1, 32>>>(d_data, N);
For the case of LDPC decoding, the decoder can be parallelized over the number of variable (VN) and check node (CN) updates, respectively. An overview of the CUDA implementation is given in Fig. 23. The VN update kernel is given as
1static __global__ void update_vn_kernel(llr_msg_t const* llr_msg, int8_t const* llr_ch, llr_accumulator_t* llr_total,
2 uint32_t Z, uint32_t const* bg_vn, uint32_t const* bg_vn_degree, uint32_t num_cols, uint32_t num_rows) {
3 uint32_t tid = blockIdx.x * blockDim.x + threadIdx.x;
4
5 uint32_t i = tid % Z; // for i in range(Z)
6 uint32_t idx_col = tid / Z; // for idx_col in range(num_cols)
7 if (idx_col >= num_cols) return;
8
9 uint32_t vn_degree = bg_vn_degree[idx_col];
10
11 // list of tuples (idx_row = index_cn, s)
12 // idx_col = idx_vn and msg_offset omitted,
13 // msg spread out to idx_cn + idx_vn * num_cn
14 uint32_t const* variable_nodes = &bg_vn[idx_col]; // len(vn) = vn_degree
15
16 int msg_sum = 0;
17 // accumulate all incoming LLRs
18 for (uint32_t j = 0; j < vn_degree; ++j) {
19 uint32_t vn = variable_nodes[j * num_cols];
20
21 // see packing layout above
22 uint32_t idx_row = vn & 0xffffu; // note: little endian
23 uint32_t s = vn >> 16; // ...
24 uint32_t msg_offset = idx_row + idx_col * num_rows;
25
26 // index of the msg in the LLR array
27 // it is the idx_col-th variable node, and the j-th message from the idx_row-th check node
28 uint32_t msg_idx = msg_offset * Z + (i-s+(Z<<8))%Z;
29
30 // accumulate all incoming LLRs
31 msg_sum += llr_msg[msg_idx];
32 }
33
34 // add the channel LLRs
35 msg_sum += llr_ch[idx_col*Z + i];
36
37 msg_sum = min(max(msg_sum, -MAX_LLR_ACCUMULATOR_VALUE), MAX_LLR_ACCUMULATOR_VALUE);
38
39 llr_total[idx_col*Z + i] = llr_accumulator_t(msg_sum);
40}
As the OAI processing pipeline uses multiple threads, we need to ensure proper multi-threading support in the CUDA kernel. This is done via a thread specific context that is passed to the kernel. This ensures that each thread operates on its own CUDA stream and, thus, can be executed in parallel without interference.
1struct ThreadContext {
2 cudaStream_t stream = 0;
3
4 // Device memory declarations - use raw pointers instead of device symbols
5 int8_t* llr_in_buffer = nullptr;
6 uint8_t* llr_bits_out_buffer = nullptr;
7 uint8_t* syndrome_buffer = nullptr;
8#ifndef USE_UNIFIED_MEMORY
9 uint8_t host_syndrome_buffer = nullptr;
10#endif
11 llr_msg_t* llr_msg_buffer = nullptr;
12 llr_accumulator_t* llr_total_buffer = nullptr;
13
14 // list of thread contexts for shutdown
15 ThreadContext* next_initialized_context = nullptr;
16};
17static __thread ThreadContext thread_context = { };
Further, the decoder uses clipping values for the extrinsic messages and the VN accumulator.
1 // add the channel LLRs
2 msg_sum += llr_ch[idx_col*Z + i];
3
4 msg_sum = min(max(msg_sum, -MAX_LLR_ACCUMULATOR_VALUE), MAX_LLR_ACCUMULATOR_VALUE);
5
6 llr_total[idx_col*Z + i] = llr_accumulator_t(msg_sum);
1 // clip msg magnitudes to MAX_LLR_VALUE
2 min_1 = min(max(min_1, -MAX_LLR_MSG_VALUE), MAX_LLR_MSG_VALUE);
3 min_2 = min(max(min_2, -MAX_LLR_MSG_VALUE), MAX_LLR_MSG_VALUE);
Following the same principles, the CN update kernel is given as
1static __global__ void update_cn_kernel(llr_accumulator_t const* llr_total, llr_msg_t* llr_msg,
2 uint32_t Z, uint32_t const* bg_cn, uint32_t const* bg_cn_degree, uint32_t num_rows,
3 bool first_iter) {
4 uint32_t tid = blockIdx.x * blockDim.x + threadIdx.x;
5
6 uint32_t i = tid % Z; // for i in range(Z)
7 uint32_t idx_row = tid / Z; // for idx_row in range(num_rows)
8 if (idx_row >= num_rows) return;
9
10 uint32_t cn_degree = bg_cn_degree[idx_row];
11
12 // list of tuples (idx_col = idx_vn, s),
13 // idx_row = idx_cn and msg_offset omitted,
14 // msg spread out to idx_cn + idx_vn * num_cn
15 uint32_t const* check_nodes = &bg_cn[idx_row]; // len(cn) = cn_degree
16
17 // search the "extrinsic" min of all incoming LLRs
18 // this means we need to find the min and the second min of all incoming LLRs
19 int min_1 = INT_MAX;
20 int min_2 = INT_MAX;
21 int idx_min = -1;
22 int node_sign = 1;
23 uint32_t msg_signs = 0; // bitset, 0 == positive; max degree is 19
24
25 for (uint32_t ii = 0; ii < cn_degree; ++ii) {
26 uint32_t cn = check_nodes[ii * num_rows];
27
28 // see packing layout above
29 uint32_t idx_col = cn & 0xffffu; // note: little endian
30 uint32_t s = cn >> 16; // ...
31 uint32_t msg_offset = idx_row + idx_col * num_rows;
32
33 uint32_t msg_idx = msg_offset * Z + i;
34
35 // total VN message
36 int t = llr_total[idx_col*Z + (i+s)%Z];
37
38 // make extrinsic by subtracting the previous msg
39 if (!first_iter)
40 t -= llr_msg[msg_idx];
41
42 // store sign for 2nd recursion
43 // note: could be also used for syndrome-based check or early termination
44 int sign = (t >= 0 ? 1 : -1);
45 node_sign *= sign;
46 msg_signs |= (t < 0) << ii; // for later sign calculation
47
48 // find min and second min
49 int t_abs = abs(t);
50 if (t_abs < min_1) {
51 min_2 = min_1;
52 min_1 = t_abs;
53 idx_min = msg_idx;
54 } else if (t_abs < min_2)
55 min_2 = t_abs;
56 }
57
58 // START marker-cnp-damping
59 // apply damping factor
60 min_1 = APPLY_DAMPING_INT(min_1); // min_1 * DAMPING_FACTOR, e.g. *3/4
61 min_2 = APPLY_DAMPING_INT(min_2); // min_2 * DAMPING_FACTOR, e.g. *3/4
62 // END marker-cnp-damping
63
64 // clip msg magnitudes to MAX_LLR_VALUE
65 min_1 = min(max(min_1, -MAX_LLR_MSG_VALUE), MAX_LLR_MSG_VALUE);
66 min_2 = min(max(min_2, -MAX_LLR_MSG_VALUE), MAX_LLR_MSG_VALUE);
67 // END marker-vnp-clipping
68
69 // apply min and second min to the outgoing LLR
70 for (uint32_t ii = 0; ii < cn_degree; ++ii) {
71 uint32_t cn = check_nodes[ii * num_rows];
72
73 // see packing layout above
74 uint32_t idx_col = cn & 0xffffu; // note: little endian
75 uint32_t msg_offset = idx_row + idx_col * num_rows;
76
77 uint32_t msg_idx = msg_offset * Z + i;
78 int min_val;
79 if (msg_idx == idx_min)
80 min_val = min_2;
81 else
82 min_val = min_1;
83
84 int msg_sign = (msg_signs >> ii) & 0x1 ? -1 : 1;
85
86 // and update outgoing msg including sign
87 llr_msg[msg_idx] = llr_msg_t(min_val * node_sign * msg_sign);
88 }
89}
Note that the choice of the clipping values is critical for the performance of the decoder, in particular as the decoder uses mostly signed char variables. This keeps the memory footprint low and can be configured via the following macros
1static const int MAX_LLR_ACCUMULATOR_VALUE = 127;
2typedef int8_t llr_accumulator_t;
3static const int MAX_LLR_MSG_VALUE = 127;
4typedef int8_t llr_msg_t;
For a given lifting factor Z and a number of rows num_rows and columns num_cols given from the BG selection, the following grid and block dimensions are used
1dim3 threads(256);
2// check node update
3dim3 blocks_cn(blocks_for(bg.num_rows * Z, threads.x));
4// VN update
5dim3 blocks_vn(blocks_for(bg.num_cols * Z, threads.x));
1inline __host__ __device__ uint32_t blocks_for(uint32_t elements, int block_size) {
2 return int( uint32_t(elements + (block_size-1)) / uint32_t(block_size) );
3}
Note that the grid and block dimensions are architecture dependent and should be tuned for the specific GPU.
After decoding, the output is hard-decided and packed into a byte-array.
1static __global__ void pack_bits_kernel(llr_accumulator_t const* llr_total, uint8_t* bits, uint32_t block_length) {
2 uint32_t tid = blockIdx.x * blockDim.x + threadIdx.x;
3
4 uint32_t coop_byte = 0;
5 // 1 bit per thread
6 if (tid < block_length)
7 coop_byte = (llr_total[tid] < 0) << (7 - (threadIdx.x & 7)); // note: highest to lowest bit
8
9 // use fast lane shuffles to assemble one byte per group of 8 adjacent threads
10 coop_byte += __shfl_xor_sync(0xffffffff, coop_byte, 1); // xxyyzzww
11 coop_byte += __shfl_xor_sync(0xffffffff, coop_byte, 2); // xxxxyyyy
12 coop_byte += __shfl_xor_sync(0xffffffff, coop_byte, 4); // xxxxxxxx
13
14 // share bytes across thread group to allow one coalesced write by first N threads
15 __shared__ uint32_t bit_block_shared[PACK_BITS_KERNEL_THREADS / 8];
16 if ((threadIdx.x & 0x7) == 0)
17 bit_block_shared[threadIdx.x / 8] = coop_byte;
18
19 __syncthreads();
20
21 // the first (PACK_BITS_KERNEL_THREADS / 8) threads pack 8 bits each
22 if (threadIdx.x < PACK_BITS_KERNEL_THREADS / 8 && blockIdx.x * PACK_BITS_KERNEL_THREADS + threadIdx.x * 8 < block_length) {
23 bits[blockIdx.x * PACK_BITS_KERNEL_THREADS / 8 + threadIdx.x] = bit_block_shared[threadIdx.x];
24 }
25}
Note
The decoder returns the number of iterations and declares a decoding failure by returning max_num_iter +1 if the CRC or the syndrome check fails.
Unittest
We have implemented unittests using pytest to verify the correctness of the CUDA LDPC decoder implementation. The tests can be found in tutorials/ldpc_cuda/test/ and validate the decoder against Sionna’s reference implementation. The tests use nanobind to interface between Python and the CUDA code, allowing us to test the decoder independently from the full 5G stack. For details on the Python-CUDA binding, please refer to the nanobind documentation.
A script for building and running the tests is provided in tutorials/ldpc_cuda/tests/build_and_run.sh
cd tutorials/ldpc_cuda/tests
./build_and_run.sh
Note that the test requires the sionna
and pytest
packages to be installed.
Outlook - Weighted Belief Propagation
An extension of classical belief propagation is the weighted belief propagation [Nachmani2016] using gradient descent to optimize trainable parameters during message passing. An example implementation can be found in the Sionna tutorial on Weighted Belief Propagation.
When looking at the current implementation of the CN update, we can already see a damping factor of 3/4 applied to each outgoing CN message [Pretti2005].
1#define APPLY_DAMPING_INT(x) (x*3/4)
1 // apply damping factor
2 min_1 = APPLY_DAMPING_INT(min_1); // min_1 * DAMPING_FACTOR, e.g. *3/4
3 min_2 = APPLY_DAMPING_INT(min_2); // min_2 * DAMPING_FACTOR, e.g. *3/4
One could now implement a weighted CN update by simply replacing the damping factor with a trainable parameter from the weighted BP tutorial.
We hope you enjoyed this tutorial. Following the principles outlined in this tutorial, you are now well equipped to accelerate other compute intensive parts of the 5G stack as well.