Part 2: GPU-Accelerated inference

We now discuss how to integrate the TensorRT engine for inference into the OAI stack. To keep the inference latency as low as possible [Gadiyar2023], [Kundu2023], we use CUDA graphs [Gray2019] to launch the TensorRT inference engine.

You will learn about:

How to accelerate the neural demapper using TensorRT
How to pre- and post-process input and output data using CUDA
How to use CUDA graphs for latency reductions

For details on efficient memory management when offloading compute-intensive functions to the GPU, we refer to the GPU-Accelerated LDPC Decoding tutorial.

Demapper Implementation Overview

The neural demapper is implemented in Tensorflow and exported to TensorRT, the source code of the inference logic can be found in tutorials/neural_demapper/runtime/trt_demapper.cpp. The implementation will be explained in the following sections.

The TRT demapper receives noisy input symbols from the OpenAirInterface stack via the function trt_demapper_decode(), which chunks a given array of symbols into batches of maximum size MAX_BLOCK_LEN and then calls trt_demapper_decode_block() to carry out the actual inference on each batch. To leverage data-parallel execution on the GPU, inference is performed for batches of symbols and multiple threads in parallel. The output of the neural demapper is passed back in the form of num_bits_per_symbol LLRs per input symbol.

To run the TensorRT inference engine on the given int16_t-quantized data, we dequantize input symbols to half-precision floating-point format on the GPU using a data-parallel CUDA kernel (see norm_int16_symbols_to_float16()), and re-quantize output LLRs using another CUDA kernel (see float16_llrs_to_int16()).

Note

The full implementation can be found in tutorials/neural_demapper/runtime/data_processing.cu and tutorials/neural_demapper/runtime/trt_demapper.cpp.

Setting up the TensorRT Inference Engine

To be compatible with the multi-threaded OpenAirInterface implementation, we load the neural demapper network into a TensorRT ICudaEngine once and share the inference engine between multiple IExecutionContext objects which are created per worker thread. We store the global and per-thread state, respectively, as follows:

#include "NvInfer.h"
#include <cuda_fp16.h>

// global
static IRuntime* runtime = nullptr;
static ICudaEngine* engine = nullptr;

// per thread
struct TRTContext {
    cudaStream_t default_stream = 0;     // asynchronous CUDA command stream
    IExecutionContext* trt = nullptr;    // TensorRT execution context
    void* prealloc_memory = nullptr;     // memory block for temporary per-inference data
    __half* input_buffer = nullptr;      // device-side network inputs after CUDA pre-processing
    __half* output_buffer = nullptr;     // device-side network output before CUDA pre-processing

    int16_t* symbol_buffer = nullptr;    // write-through buffer for symbols written by CPU and read by GPU
    int16_t* magnitude_buffer = nullptr; // write-through buffer for magnitude estimates written by CPU and read by GPU
    int16_t* llr_buffer = nullptr;       // host-cached buffer for llr estimates written by GPU and read by CPU

    // list of thread contexts for shutdown
    TRTContext* next_initialized_context = nullptr;
};
static __thread TRTContext thread_context = { };

We call the following global initialization routine on program startup:

static char const* trt_weight_file = "models/neural_demapper_qam16_2.plan"; // Training result / trtexec output
static bool trt_normalized_inputs = true;

extern "C" TRTContext* trt_demapper_init() {
    if (runtime)  // lazy, global
        return &trt_demapper_init_context();

    printf("Initializing TRT runtime\n");
    runtime = createInferRuntime(logger);
    printf("Loading TRT engine %s (normalized inputs: %d)\n", trt_weight_file, trt_normalized_inputs);
    std::vector<char> modelData = readModelFromFile(trt_weight_file);
    engine = runtime->deserializeCudaEngine(modelData.data(), modelData.size());

    return &trt_demapper_init_context();
}

// Utilities

std::vector<char> readModelFromFile(char const* filepath) {
    std::vector<char> bytes;
    FILE* f = fopen(filepath, "rb");
    if (!f) {
        logger.log(Logger::Severity::kERROR, filepath);
        return bytes;
    }
    fseek(f, 0, SEEK_END);
    bytes.resize((size_t) ftell(f));
    fseek(f, 0, SEEK_SET);
    if (bytes.size() != fread(bytes.data(), 1, bytes.size(), f))
        logger.log(Logger::Severity::kWARNING, filepath);
    fclose(f);
    return bytes;
}

struct Logger : public ILogger
{
    void log(Severity severity, const char* msg) noexcept override
    {
        // suppress info-level messages
        if (severity <= Severity::kWARNING)
            printf("TRT %s: %s\n", severity == Severity::kWARNING ? "WARNING" : "ERROR", msg);
    }
};
static Logger logger;

On startup of each worker thread, we initialize the per-thread contexts as follows:

TRTContext& trt_demapper_init_context() {
    auto& context = thread_context;
    if (context.trt) // lazy
        return context;

    printf("Initializing TRT context (TID %d)\n", (int) gettid());

    // create execution context with its own pre-allocated temporary memory attached
    context.trt = engine->createExecutionContextWithoutDeviceMemory();
    size_t preallocSize = engine->getDeviceMemorySize();
    CHECK_CUDA(cudaMalloc(&context.prealloc_memory, preallocSize));
    context.trt->setDeviceMemory(context.prealloc_memory);

    // create own asynchronous CUDA command stream for this thread
    CHECK_CUDA(cudaStreamCreateWithFlags(&context.default_stream, cudaStreamNonBlocking));

    // allocate neural network input and output buffers (device access memory)
    cudaMalloc((void**) &context.input_buffer, sizeof(*context.input_buffer) * 4 * MAX_BLOCK_LEN);
    cudaMalloc((void**) &context.output_buffer, sizeof(*context.output_buffer) * MAX_BITS_PER_SYMBOL * MAX_BLOCK_LEN);

    // OAI decoder input buffers that can be written and read with unified addressing from CPU and GPU, respectively
    // note: GPU reads are uncached, but read-once coalesced
    cudaHostAlloc((void**) &context.symbol_buffer, sizeof(*context.symbol_buffer) * 2 * MAX_BLOCK_LEN, cudaHostAllocMapped | cudaHostAllocWriteCombined);
    cudaHostAlloc((void**) &context.magnitude_buffer, sizeof(*context.magnitude_buffer) * 2 * MAX_BLOCK_LEN, cudaHostAllocMapped | cudaHostAllocWriteCombined);
    // OAI decoder output buffers that can be written and read with unified addressing from GPU and CPU, respectively
    // note: GPU writes are uncached, but write-once coalesced
    cudaHostAlloc((void**) &context.llr_buffer, sizeof(*context.llr_buffer) * MAX_BITS_PER_SYMBOL * MAX_BLOCK_LEN, cudaHostAllocMapped);

    // keep track of active thread contexts for shutdown
    TRTContext* self = &context;
    __atomic_exchange(&initialized_thread_contexts, &self, &self->next_initialized_context, __ATOMIC_ACQ_REL);

    return context;
}

Running Batched Inference

If decoder symbols are already available in half-precision floating-point format, running the TensorRT inference engine is as simple as performing one call to enqueue the corresponding inference commands on the asynchronous CUDA command stream of the calling thread’s context:

void trt_demapper_run(TRTContext* context, cudaStream_t stream, __half const* inputs, size_t numInputs, size_t numInputComponents, __half* outputs) {
    if (stream == 0)
        stream = context->default_stream;

    context.trt->setTensorAddress("y", (void*) inputs);
    context.trt->setInputShape("y", Dims2(numInputs, numInputComponents));
    context.trt->setTensorAddress("output_1", outputs);
    context.trt->enqueueV3(stream);
}

Converting Data Types between Host and Device

In the OAI 5G stack, received symbols come in from the host side in quantized int16_t format, together with a channel magnitude estimate. In order to convert inputs to half-precision floating-point format, we first copy the symbols to a pinned memory buffer mapped_symbols that resides in unified addressable memory, and then run a CUDA kernel for dequantization and normalization on the GPU. After inference, the conversion back to quantized LLRs follows the same pattern, first a CUDA kernel quantizes the half-precision floating-point inference outputs, then the quantized data written by the GPU is read by the CPU using the unified addressable memory buffer mapped_outputs. Note that the CUDA command stream runs asynchronously, therefore it needs to be synchronized with the calling thread before accessing the output data.

extern "C" void trt_demapper_decode_block(TRTContext* context_, cudaStream_t stream, int16_t const* in_symbols, int16_t const* in_mags, size_t num_symbols,
                                          __half const *mapped_symbols, __half const *mapped_mags, size_t num_batch_symbols,
                                          int16_t* outputs, uint32_t num_bits_per_symbol, __half* mapped_outputs) {
    auto& context = *context_;

    memcpy((void*) mapped_symbols, in_symbols, sizeof(*in_symbols) * 2 * num_symbols);
    memcpy((void*) mapped_mags, in_mags, sizeof(*in_mags) * 2 * num_symbols);

    size_t num_in_components;
    if (trt_normalized_inputs) {
        norm_int16_symbols_to_float16(stream, mapped_symbols, mapped_mags, num_batch_symbols,
                                      (uint16_t*) context.input_buffer, 1);
        num_in_components = 2;
    }
    else {
        [...]
        num_in_components = 4;
    }

    trt_demapper_run(&context, stream, recording ? nullptr : context.input_buffer, block_size, num_in_components, recording ? nullptr : context.output_buffer);

    float16_llrs_to_int16(stream, (uint16_t const*) context.output_buffer, num_batch_symbols,
                          mapped_outputs, num_bits_per_symbol);

    CHECK_CUDA(cudaStreamSynchronize(stream));
    memcpy(outputs, mapped_outputs, sizeof(*outputs) * num_bits_per_symbol * num_symbols);
}

The CUDA kernel for normalization runs in a straight-forward 1D CUDA grid, reading the tuples of int16_t-quantized components that make up each complex value in a coalesced (consecutive) way, as one int32_t value each. Then, the symbol values are normalized with respect to the magnitude values and again written in a coalesced way, fusing each complex symbol into one __half2 value:

__global__ void
norm_int16_symbols_to_float16_kernel(
    const int16_t* __restrict__ symbols_i,
    const int16_t* __restrict__ magnitudes_i,
    uint32_t num_symbols,
    __half2* __restrict symbols_h,
    uint32_t output_int32_stride
    ) {
    uint32_t globalIdx = threadIdx.x + blockDim.x * blockIdx.x;
    if (globalIdx >= num_symbols)
        return;

    uint32_t symbolBits = reinterpret_cast<const uint32_t*>(symbols_i)[globalIdx];
    int16_t s_r = int16_t(uint16_t(symbolBits & 0xffff)); // note: little endian
    int16_t s_i = int16_t(uint16_t(symbolBits >> 16));    // ...

    uint32_t magBits = reinterpret_cast<const uint32_t*>(magnitudes_i)[globalIdx];
    int16_t m_r = int16_t(uint16_t(magBits & 0xffff)); // note: little endian
    int16_t m_i = int16_t(uint16_t(magBits >> 16));    // ...

    float2 sf;
    sf.x = float(s_r) / float(m_r);
    sf.y = float(s_i) / float(m_i);
    symbols_h[globalIdx * output_int32_stride] = __float22half2_rn(sf);
}

void norm_int16_symbols_to_float16(
    cudaStream_t stream,
    const int16_t* symbols_i,
    const int16_t* magnitudes_i,
    uint32_t num_symbols,
    uint16_t* symbols_h,
    uint32_t output_int32_stride
    ) {
    dim3 threads(256);
    dim3 blocks(blocks_for(num_symbols, threads.x));

    norm_int16_symbols_to_float16_kernel<<<blocks, threads, 0, stream>>>(
        symbols_i,
        magnitudes_i,
        num_symbols,
        reinterpret_cast<__half2*>(symbols_h),
        output_int32_stride
    );
}

inline __host__ __device__ int blocks_for(uint32_t elements, int block_size) {
    return int( uint32_t(elements + (block_size-1)) / uint32_t(block_size) );
}

The CUDA kernel for re-quantization of output LLRs works analogously, converting half-precision floating-point LLR tuples to quantized int16_t values by fixed-point scaling and rounding:

__global__ void
float16_llrs_to_int16_kernel(
    __half const* __restrict llrs_h,
    uint32_t num_llrs,
    int16_t* __restrict__ llrs_i
    ) {
    uint32_t globalIdx = threadIdx.x + blockDim.x * blockIdx.x;

    float2 tuple = {};
    if (2 * globalIdx + 1 < num_llrs)
        tuple = __half22float2( reinterpret_cast<const __half2*>(llrs_h)[globalIdx] );
    else if (2 * globalIdx < num_llrs)
        tuple.x = llrs_h[2 * globalIdx];
    else
        return;

    int16_t s1 = int16_t(__float2int_rn(ldexpf(tuple.x, 8)));
    int16_t s2 = int16_t(__float2int_rn(ldexpf(tuple.y, 8)));

    if (2 * globalIdx + 1 < num_llrs)
        reinterpret_cast<uint32_t*>(llrs_i)[globalIdx] = (uint32_t(s2 & 0xffffu) << 16) + uint32_t(s1 & 0xffffu); // note: little endian
    else
        llrs_i[2 * globalIdx] = s1;
}

void float16_llrs_to_int16(
    cudaStream_t stream,
    uint16_t const* llrs_h,
    uint32_t num_symbols,
    int16_t* llrs_i,
    uint32_t num_bits
    ) {

    dim3 threads(256);
    dim3 blocks(blocks_for(blocks_for(num_symbols * num_bits, 2u), threads.x));

    float16_llrs_to_int16_kernel<<<blocks, threads, 0, stream>>>(
        reinterpret_cast<__half const*>(llrs_h),
        num_symbols * num_bits,
        llrs_i
    );
}

Demapper Integration in OAI

Note

Ensure that you have build the TRTengine in the first part of the tutorial.

Finally, we integrate the TensorRT demapper implementation into the OAI stack. After patching, the Dockerfiles in the Sionna Reseach Kit are already configured for CUDA support. If done manually, you need to patch the system before building the Docker images via

./scripts/patch_oai-tutorials.sh

This automatically patches the OAI stack and adds the CUDA flag to the build pipeline. Note that this is already done if you followed the Quickstart tutorial.

After patching, the demapper implementation is located in tutorials/neural_demapper/…. You can now modify the runtime/trt_demapper.cpp file and implement your own demapper variants.

In order to mount the TensorRT models and config files, you can extend the oai-gnb config in docker-compose.override.yaml:

services:
   oai-gnb:
       volumes:
       - ./models/:/opt/oai-gnb/models
       - ./demapper_trt.config:/opt/oai-gnb/demapper_trt.config

Pre-trained models are available in tutorials/neural_demapper/models.

The TRT config file format has the following schema:

<trt_engine_file:string> # file name of the TensorRT engine
<trt_normalized_inputs:int> # flag to indicate if the inputs are normalized

For example, the following config file will use the TensorRT engine models/neural_demapper.2xfloat16.plan and normalize the inputs:

model/neural_demapper.2xfloat16.plan
1

Running the Demapper

The neural demapper is implemented as shared library (see Plugins & Data Acquisition) which can be loaded using the OAI shared library loader. The demapper can now be used as a drop-in replacement for the QAM-16 default implementation. The demapper can be loaded when running the gNB via the following GNB_EXTRA_OPTIONS in the .env file of the config folder.

GNB_EXTRA_OPTIONS=--loader.demapper.shlibversion _trt --MACRLCs.[0].dl_max_mcs 10 --MACRLCs.[0].ul_max_mcs 10 --thread-pool 5,6,7,8,9,10,11

We limit the MCS indices to 10 in order to stay within the QAM-16 modulation order. We strongly recommend to additionally assign dedicated CPU cores to PHY-layer processing via the thread-pool option. This assigns the cores 5-11 to the PHY layer thread pool. Note that the lower CPU cores are assigned to the USRP handling such as time synchronization.

Congratulations! You have now successfully implemented demapping using a neural network.

You can track the GPU load via

$ jtop

Implementation Aspects

In the following section, we focus on various technical aspects of the CUDA implementation and the performance implications of different memory transfer patterns and command scheduling optimization.

Memory Management

Similar to the GPU-Accelerated LDPC Decoding tutorial, we use the shared system memory architecture of the Jetson platform to avoid the bottleneck of costly memory transfers on traditional split-memory platforms.

As previously covered in the GPU-Accelerated LDPC Decoding tutorial, optimizing memory operations is essential for real-time performance. For the neural demapper implementation, we use the same efficient approach of page-locked memory (via cudaHostAlloc()) to enable direct GPU-CPU memory sharing. This allows for simple memcpy() operations instead of complex memory management calls, with host caching enabled for CPU access while device caching is disabled for direct memory access. This approach is particularly well-suited for the small buffer sizes used in neural demapping, avoiding the overhead of traditional GPU memory management methods like cudaMemcpyAsync() or cudaMallocManaged().

For comparison, we show both variants side-by-side in the following inference code, where the latency-optimized code path is the one with USE_UNIFIED_MEMORY defined:

extern "C" void trt_demapper_decode_block(TRTContext* context_, cudaStream_t stream, int16_t const* in_symbols, int16_t const* in_mags, size_t num_symbols,
                                          __half const *mapped_symbols, __half const *mapped_mags, size_t num_batch_symbols,
                                          int16_t* outputs, uint32_t num_bits_per_symbol, __half* mapped_outputs) {
    auto& context = *context_;

#if defined(USE_UNIFIED_MEMORY)
    memcpy((void*) mapped_symbols, in_symbols, sizeof(*in_symbols) * 2 * num_symbols);
    memcpy((void*) mapped_mags, in_mags, sizeof(*in_mags) * 2 * num_symbols);
#else
    cudaMemcpyAsync((void*) mapped_symbols, in_symbols, sizeof(*in_symbols) * 2 * num_symbols, cudaMemcpyHostToDevice, stream);
    cudaMemcpyAsync((void*) mapped_mags, in_mags, sizeof(*in_mags) * 2 * num_symbols, cudaMemcpyHostToDevice, stream);
#endif

    size_t num_in_components;
    if (trt_normalized_inputs) {
        norm_int16_symbols_to_float16(stream, mapped_symbols, mapped_mags, num_batch_symbols,
                                      (uint16_t*) context.input_buffer, 1);
        num_in_components = 2;
    }
    else {
        [...]
        num_in_components = 4;
    }

    trt_demapper_run(&context, stream, recording ? nullptr : context.input_buffer, block_size, num_in_components, recording ? nullptr : context.output_buffer);

    float16_llrs_to_int16(stream, (uint16_t const*) context.output_buffer, num_batch_symbols,
                          mapped_outputs, num_bits_per_symbol);

#if defined(USE_UNIFIED_MEMORY)
    // note: synchronize the asynchronous command queue before accessing from the host
    CHECK_CUDA(cudaStreamSynchronize(stream));
    memcpy(outputs, mapped_outputs, sizeof(*outputs) * num_bits_per_symbol * num_symbols);
#else
    cudaMemcpyAsync(outputs, mapped_outputs, sizeof(*outputs) * num_bits_per_symbol * num_symbols, cudaMemcpyDeviceToHost, stream);
    // note: synchronize after the asynchronous command queue has executed the copy to host
    CHECK_CUDA(cudaStreamSynchronize(stream));
#endif
}

CUDA Graph Optimization

CUDA command graph APIs [Gray2019] were introduced to frontload the overhead of scheduling repetitive sequences of compute kernels on the GPU, allowing pre-recorded, pre-optimized command sequences, such as in our case neural network inference, to be scheduled by a single API call. Thus, latency can be reduced further, focussing runtime spending on the actual computations rather than on dynamic command scheduling. We pre-record CUDA graphs including demapper inference, data pre-processing, and post-processing, for two different batch sizes, one for common small batches and one for the maximum expected parallel batch size.

Command graphs are pre-recorded per thread due to the individual intermediate storage buffers used. We run the recording at the end of thread context initialization as introduced above, for each batch size running one size-0 inference to trigger any kind of lazy runtime allocations, and another inference on dummy inputs for the actual recording:

#ifdef USE_GRAPHS
    // record graphs for optimal and max block size
    for (int i = 0; i < 2; ++i) {
        int16_t in_symbols[2], in_mags[2], outputs[MAX_BITS_PER_SYMBOL];
        unsigned num_batch_symbols = i == 0 ? MAX_BLOCK_LEN : OPT_BLOCK_LEN;
        unsigned num_bits_per_symbol = 4;

        // pre-allocate, then record
        for (int a = 0; a < 2; ++a) {
            trt_demapper_decode_block(&context, context.default_stream, in_symbols, in_mags, a,
                                      context.symbol_buffer, context.magnitude_buffer, num_batch_symbols,
                                      outputs, num_bits_per_symbol, context.llr_buffer);
        }
    }
#endif

To extend the function trt_demapper_decode_block() with CUDA graph recording and execution, we introduce the following code paths when USE_GRAPHS is defined:

extern "C" void trt_demapper_decode_block(TRTContext* context_, cudaStream_t stream, int16_t const* in_symbols, int16_t const* in_mags, size_t num_symbols,
                                          int16_t const *mapped_symbols, int16_t const *mapped_mags, size_t num_batch_symbols,
                                          int16_t* outputs, uint32_t num_bits_per_symbol, int16_t* mapped_outputs) {
    auto& context = *context_;

    uint32_t block_size = num_batch_symbols > OPT_BLOCK_LEN ? MAX_BLOCK_LEN : OPT_BLOCK_LEN;
    cudaGraph_t& graph = block_size == OPT_BLOCK_LEN ? context.graph_opt : context.graph_max;
    cudaGraphExec_t& graphCtx = block_size == OPT_BLOCK_LEN ? context.record_opt : context.record_max;

    if (num_symbols > 0) {
        memcpy((void*) mapped_symbols, in_symbols, sizeof(*in_symbols) * 2 * num_symbols);
        memcpy((void*) mapped_mags, in_mags, sizeof(*in_mags) * 2 * num_symbols);
    }

    // graph capture
    if (!graph) {
        bool recording = false;
#ifdef USE_GRAPHS
        // allow pre-allocation before recording
        if (num_symbols > 0) {
            // in pre-recording phase
            CHECK_CUDA(cudaStreamBeginCapture(stream, cudaStreamCaptureModeRelaxed));
            num_batch_symbols = block_size;
            recording = true;
        }
        // else: pre-allocation phase
#endif

        size_t num_in_components;
        if (trt_normalized_inputs) {
            norm_int16_symbols_to_float16(stream, mapped_symbols, mapped_mags, num_batch_symbols,
                                          (uint16_t*) context.input_buffer, 1);
            num_in_components = 2;
        }
        else {
            int16_symbols_to_float16(stream, mapped_symbols, num_batch_symbols,
                                     (uint16_t*) context.input_buffer, 2);
            int16_symbols_to_float16(stream, mapped_mags, num_batch_symbols,
                                     (uint16_t*) context.input_buffer + 2, 2);
            num_in_components = 4;
        }

        trt_demapper_run(&context, stream, recording ? nullptr : context.input_buffer, block_size, num_in_components, recording ? nullptr : context.output_buffer);

        float16_llrs_to_int16(stream, (uint16_t const*) context.output_buffer, num_batch_symbols,
                              mapped_outputs, num_bits_per_symbol);

#ifdef USE_GRAPHS
        if (num_symbols > 0) {
            // in pre-recording phase
            CHECK_CUDA(cudaStreamEndCapture(stream, &graph));
            printf("Recorded CUDA graph (TID %d), stream %llX\n", (int) gettid(), (unsigned long long) stream);
        }
#endif
    }

#ifdef USE_GRAPHS
    if (graph && !graphCtx) {
        // in pre-recording phase
        CHECK_CUDA(cudaGraphInstantiate(&graphCtx, graph, 0));
    }
    else if (num_symbols > 0) {
        // in runtime inference, run pre-recorded graph
        cudaGraphLaunch(graphCtx, stream);
    }
#endif

    CHECK_CUDA(cudaStreamSynchronize(stream));
    memcpy(outputs, mapped_outputs, sizeof(*outputs) * num_bits_per_symbol * num_symbols);
}

Unit tests

We have implemented unit tests using pytest to allow testing individual parts of the implementation outside of the full 5G stack. The unit tests can be found in tutorials/neural_demapper/tests/. The unit tests use nanobind to call the TensorRT and CUDA modules from Python and to test against Python-based reference implementations. For more details on how to use nanobind, please refer to the nanobind documentation.

An example script for building and testing is provided in tests/build_and_run.sh (assuming that the requirements of the Sionna framework are installed):

set -e
cmake ../runtime -B build -G Ninja
ninja -C build
pytest -- ./test_data_processing.py ./test_demappers.py

Outlook

This was a first tutorial on accelerating neural network inference using TensorRT and CUDA graphs. The neural demapper itself is a simple network and the focus was on the integration rather than the actual error rate performance.

You are now able to deploy your own neural networks using this tutorial as a blueprint. An interesting starting point could be the Multi-user MIMO Neural Receiver, which provides a 5G compliant implementation of a neural receiver and already provides a TensorRT export of the trained network.