Architecture Overview#

You have written a #[kernel] function in Rust. It type-checks. It borrow-checks. Now it needs to become PTX that runs on a GPU. This chapter explains how cuda-oxide gets it there – every stage, every crate, and the reasoning behind each choice.

If you just want to write kernels, you never need to read this page. But if you want to hack on the compiler, contribute a pass, or satisfy the “but how does it actually work?” itch, welcome. Grab a coffee.

Design Philosophy#

The guiding principle is short enough to fit on a sticky note:

Use the best tool for each stage – but own the full pipeline.

Compilers are layer cakes. Each layer has a wildly different job, and different tools excel at each one. cuda-oxide picks the strongest option per stage rather than building everything from scratch:

Frontend: rustc + rustc_public (Stable MIR). Why rewrite a type checker when one of the best ever built already exists? Rust’s compiler handles parsing, name resolution, type inference, borrow checking, trait resolution, monomorphization, and MIR optimization. We take all of that for free.
Middle-end: pliron (Pliron IR, MLIR-like). We need a place to transform Rust MIR into something LLVM-shaped. pliron is an extensible IR framework inspired by LLVM’s MLIR, but written in pure Rust. No C++ dependency, no CMake, no tablegen – just cargo build. We define three custom dialects here: one for MIR, one for LLVM IR, and one for NVIDIA GPU intrinsics.
Backend: LLVM NVPTX. NVIDIA has poured years of work into the NVPTX backend in LLVM. It knows every register class, every instruction encoding, every scheduling quirk. We emit LLVM IR textually and hand it to llc. Standing on the shoulders of giants beats reinventing the PTX assembler.

The payoff: the entire compiler is written in Rust (except the final llc invocation). There are no opaque handoffs to a C++ middle-end. You can set a breakpoint in any transformation pass, println! your way through the IR, and run the whole thing under Miri if you feel adventurous. Standard Rust tooling, all the way down.

Note

llc is the one external binary. The pinned Rust toolchain’s llvm-tools component ships llc with the NVPTX backend enabled, so on a fresh clone rustup component add llvm-tools (already listed in rust-toolchain.toml) is the only step needed. A system LLVM 21+ install also works and serves as the fallback. The CUDA Toolkit alone is not enough — it does not include llc. All cuda-oxide stages up to LLVM IR emission are implemented in Rust; after the backend writes the .ll file, it invokes external LLVM llc to generate PTX.

The Pipeline at a Glance#

Here is the full journey of a #[kernel] function, from source to silicon:

../_images/compiler-pipeline.svg — The full compilation pipeline. Rust source enters the rustc frontend, passes through Stable MIR, is translated into `dialect-mir` (with `mem2reg` promoting allocas back into SSA), lowered to `dialect-llvm`, exported as textual LLVM IR, and finally compiled to PTX by the NVPTX backend.#

Stage by stage:

Rust Source. You write a function, slap #[kernel] on it, and go about your day. The proc macro renames it into the reserved cuda_oxide_kernel_<hash>_<name> namespace so the backend can spot it later. The exact prefix lives in the workspace-internal reserved-oxide-symbols crate; the <hash> makes accidental collisions impossible.
rustc Frontend. rustc parses, type-checks, borrow-checks, monomorphizes generics, and runs MIR optimization passes (inlining, constant propagation, dead code elimination). All the hard work happens here.
Stable MIR. The codegen backend receives rustc’s internal MIR and bridges it to rustc_public’s stable types. This gives us a versioned, stable view of the MIR that won’t break on the next nightly.
dialect-mir (pliron). mir-importer translates Stable MIR into dialect-mir – a pliron dialect that models Rust MIR semantics (places, projections, Rvalue, BinOp, etc.). The initial form uses per-local mir.alloca slots with mir.load/mir.store for cross-block data flow; pliron::opts::mem2reg then promotes those slots back into SSA values.
dialect-llvm (pliron). mir-lower transforms dialect-mir operations into dialect-llvm operations: llvm.alloca, llvm.load, llvm.store, llvm.getelementptr, llvm.call, and friends. This is where Rust-level concepts get flattened to machine-oriented IR.
LLVM IR (.ll file). The dialect-llvm printer serializes the IR into textual LLVM IR. This is a plain .ll file – you can read it, feed it to opt, or diff it between compiler versions.
PTX (.ptx file). llc with the NVPTX target compiles the .ll file to PTX assembly. The result is a .ptx file ready to be loaded by the CUDA driver at runtime.

Crate Map#

cuda-oxide is split into focused crates. Here is every one and its role:

Crate	Role
`rustc-codegen-cuda`	Custom rustc codegen backend – intercepts `codegen_crate()`, splits host/device code
`mir-importer`	Translates Stable MIR into `dialect-mir`, orchestrates the full pipeline
`dialect-mir`	pliron dialect modeling Rust MIR semantics (places, rvalues, terminators)
`dialect-llvm`	pliron dialect modeling LLVM IR + textual `.ll` export
`dialect-nvvm`	pliron dialect for NVIDIA GPU intrinsics (`tid`, `ntid`, barriers, TMA)
`mir-lower`	Lowers `dialect-mir` to `dialect-llvm` – the main transformation pass
`cargo-oxide`	CLI tool: `cargo oxide build`, `cargo oxide run`, `cargo oxide pipeline`
`cuda-device`	Device-side API: intrinsics, `DisjointSlice`, barriers, shared memory, warp ops
`cuda-macros`	Proc macros: `#[kernel]`, `#[device]`
`cuda-host`	Host-side typed module loading and launch helpers
`cuda-core`	Safe bindings to the CUDA Driver API (`CudaContext`, `DeviceBuffer`, `CudaStream`)
`cuda-async`	Async GPU programming: `DeviceOperation`, combinators, stream pool scheduling
`cuda-bindings`	Low-level FFI bindings to CUDA driver (`libcuda.so`)

Dependency flow#

The compiler crates form a clear pipeline:

pliron sits underneath all three dialect crates as the shared IR framework – it provides the Context, Module, Region, Block, Operation, Type, and Attribute infrastructure. rustc_public provides the stable MIR types that mir-importer reads from rustc. The user-facing crates (cuda-device, cuda-macros, cuda-host, cuda-core, cuda-async) are independent of the compiler internals and depend only on each other.

The Two Key Dependencies#

Two external projects make cuda-oxide possible. Neither is optional, and both deserve a brief introduction before the deep-dive chapters that follow.

pliron – Pliron IR (MLIR-like)#

pliron is an extensible compiler IR framework inspired by LLVM’s MLIR, but written entirely in Rust. It provides the same core abstractions – dialects, operations, types, attributes, regions, and blocks – without requiring a C++ toolchain, CMake, or tablegen.

cuda-oxide chose pliron over upstream MLIR for a pragmatic reason: we wanted the entire compiler to build with cargo. Depending on MLIR means pulling in the LLVM monorepo, a C++ build system, and Rust-C++ FFI glue – all of which add build complexity, slow down CI, and make contributor onboarding painful. With pliron, dialects are defined using standard Rust traits and derive macros, and the IR can be inspected with any Rust debugger.

cuda-oxide defines three dialects on top of pliron: dialect-mir (models Rust MIR), dialect-llvm (models LLVM IR + textual export), and dialect-nvvm (NVIDIA GPU intrinsics).

rustc_public – Stable MIR#

rustc_public (historically known as Stable MIR or stable_mir) is Rust’s official stable interface to the compiler’s internals. MIR – the Mid-level Intermediate Representation – is where borrow checking, lifetime validation, and most optimizations happen. It is also a rich, high-level representation that retains type information, making it an ideal starting point for a GPU backend.

The problem: MIR is an internal representation. Its data structures change between nightly versions with no stability guarantees. A backend that reads internal MIR directly would break every time rustc refactors a field name or reorders an enum variant – which happens more often than you might hope. rustc_public solves this by providing a versioned, stable API that bridges internal types to a public surface. cuda-oxide hooks in at the CodegenBackend::codegen_crate() entry point, bridges internal types to stable MIR types, and hands the result to mir-importer for translation.

The Host/Device Split#

cuda-oxide is a single-source compiler. Host code and device code live in the same .rs files, and a single build command compiles both. Here is how that works, step by step:

1. cargo-oxide invokes rustc with the custom backend.

cargo oxide run vecadd

Under the hood, this sets -Z codegen-backend=librustc_codegen_cuda.so, which tells rustc to route code generation through cuda-oxide’s backend instead of the default LLVM one.

2. rustc calls codegen_crate() for every crate in the dependency tree.

This is not a cuda-oxide-specific step – it is how rustc works. For every crate being compiled (your binary, cuda-device, any other dependency), rustc invokes the codegen backend.

3. The backend scans for kernel entry points.

It looks for monomorphized functions whose names contain the reserved cuda_oxide_kernel_<hash>_ prefix. These are the functions that #[kernel] created.

4. If kernels are found: build the device call graph and emit PTX.

Starting from each kernel, the backend walks the call graph to collect every device function the kernel transitively calls. This set of functions is handed to mir-importer, which runs the full pipeline (dialect-mir -> dialect-llvm -> .ll -> PTX). The result is a .ptx file written next to the host binary.

5. Always: delegate host code to the standard LLVM backend.

Regardless of whether kernels were found, host code is compiled normally. cuda-oxide’s backend delegates to rustc’s default LLVM codegen for everything that is not device code. Your main() function, your CLI parsing, your async runtime – all compiled the usual way.

6. Result: a host binary + a .ptx file, from one build.

target/debug/vecadd          ← host binary (loads PTX at runtime)
target/debug/vecadd.ptx      ← device code (loaded by CUDA driver)

Note

Device code from dependencies (like cuda-device) is compiled lazily. Functions from external crates only get compiled to PTX when a kernel in your crate transitively calls them. The MIR is available from .rlib metadata, so there is no need to recompile dependencies from source – the backend reads their Stable MIR on demand.

A simplified mental model#

../_images/host-device-split.svg — One build command, two compilation targets. Every function goes through rustc’s frontend. At the codegen boundary, kernels go to cuda-oxide; everything else goes to LLVM.#

Every function goes through rustc’s frontend. At the codegen boundary, the backend looks at each function and asks: “Are you a kernel or called by a kernel?” If yes, you go right (cuda-oxide pipeline). If no, you go left (standard LLVM). Some functions go both ways – a generic helper used on both host and device will be compiled twice, once per target.

What rustc Gives Us For Free#

One of the nicest things about building on rustc rather than inventing a new language is the sheer volume of work we do not have to do. Here is what rustc handles before cuda-oxide ever sees the code:

What	Value for GPU Code
Type checking	Catch errors before GPU compilation – no cryptic PTX assembler failures
Lifetime tracking	Safety guarantees that span the host/device boundary
Borrow checking	Prevent data races at compile time, even across GPU threads
Monomorphization	Generics “just work” on the GPU – `map<f32, _>` becomes a concrete PTX kernel
MIR optimization	Inlining, constant propagation, dead code elimination – all applied before we begin
Trait resolution	Trait objects are resolved, vtables are gone, everything is static dispatch
Pattern matching	`match` arms are lowered to optimized `SwitchInt` MIR terminators

We do not reimplement any of this. rustc does the heavy lifting, and we pick up the fully optimized, monomorphized, borrow-checked MIR at the end. Our job is “just” the translation – which, to be fair, is still plenty of work. But it is a dramatically smaller problem than building a GPU language from scratch.

Note

This also means that Rust’s error messages work normally. If you make a type error in a kernel, you get the same helpful rustc diagnostic you would get in any other Rust code – complete with suggestions, span highlighting, and “did you mean?” hints. No separate GPU compiler error format to learn.

Where to Go Next#

The rest of this chapter zooms into each piece of the architecture:

Pliron – Pliron IR (MLIR-like) – the IR framework that holds the pipeline together.
rustc_public – Stable MIR – how we read MIR without breaking on every nightly.
The Code Generator: rustc-codegen-cuda – the codegen backend that intercepts rustc.
MIR Importer – translating Stable MIR into pliron.
Pliron Dialects – the three custom dialects and their operation sets.
The Lowering Pipeline – dialect-mir to dialect-llvm, pass by pass.
Adding New Intrinsics – a contributor’s guide to extending the compiler.