The Lowering Pipeline#
The previous chapters built the IR from both ends: MIR Importer
translated Stable MIR into dialect-mir, and
Pliron Dialects described dialect-mir, dialect-llvm,
and dialect-nvvm. This chapter is about the bridge between them – the pass
that takes Rust-flavored IR and turns it into something LLVM can actually
compile.
If you know Rust types, you are about to find out how many of them LLVM has never heard of.
What Lowering Means#
dialect-mir speaks Rust. It knows about tuples, enums, slices, checked
arithmetic, and GPU address spaces. LLVM IR knows about none of those things.
It has flat integer and float types, getelementptr, PHI nodes, and a general
suspicion toward anything with more than one level of abstraction.
Lowering is the process of replacing every dialect-mir operation with
an equivalent sequence of dialect-llvm operations, one by one, until no
dialect-mir operations remain. Tuples become anonymous structs. Slices
become pointer-length pairs. Checked addition becomes an LLVM overflow
intrinsic followed by an extract. Every Rust concept gets flattened to
something LLVM can digest.
The pass that does all of this lives in crates/mir-lower/ and uses pliron’s
DialectConversion framework. It is the single largest transformation in the
pipeline, and the rest of this chapter is about how it works.
DialectConversion – The Lowering Framework#
The lowering uses pliron’s DialectConversion + DialectConversionRewriter
rather than a manual walk-and-replace pass. The framework handles IR walking,
def-before-use ordering, type conversion, and block argument patching
automatically.
How It Works#
Each dialect-mir and dialect-nvvm op declares how to lower itself via the MirToLlvmConversion
op interface (defined in conversion_interface.rs). The interface has a single
method: convert(ctx, rewriter, op, operands_info). Each op’s implementation
lives in convert/interface_impls.rs, which dispatches to converter functions
organized by category.
For each MirFuncOp in the module, convert_func (in lowering.rs):
Creates an LLVM function with a flattened type signature (aggregate args become multiple scalar args for the CUDA ABI).
Propagates GPU metadata (
gpu_kernel,maxntid,cluster_dim_*).Uses
inline_regionto move MIR blocks into the LLVM function. The original blocks are preserved – no manual block mapping needed.Builds an entry prologue that reconstructs aggregate values from the flat LLVM arguments via
insertvalue(see Argument Scalarization below).Runs
DialectConversionwhich walks every MIR operation and invokes itsMirToLlvmConversion::rewriteimplementation to replace it with LLVM operations.
Converter Modules#
The conversion functions are organized into modules by category:
Category |
Module |
What It Handles |
|---|---|---|
Arithmetic |
|
|
Memory |
|
|
Control Flow |
|
|
Aggregate |
|
Struct/tuple field access → GEP or |
Cast |
|
|
Call |
|
|
GPU Intrinsic |
|
NVVM ops → LLVM intrinsic calls or inline PTX |
Constants |
|
|
The framework dispatches to these automatically – each op’s MirToLlvmConversion
impl calls the right converter function. The complexity lives inside each converter,
where Rust semantics meet LLVM reality.
Type Conversion#
Before operations can be converted, their types must be converted. LLVM’s type system is deliberately simpler than Rust’s – there is no signedness on integers, no tuples, no enums, no fat pointers. Everything must be flattened.
MIR Type |
LLVM Type |
Notes |
|---|---|---|
|
|
LLVM integers carry no sign – signedness is on the operation, not the type |
|
|
Tuples become anonymous structs |
|
|
Fat pointer decomposition – pointer + length |
|
|
Explicit padding arrays to match rustc’s layout |
|
|
Opaque pointers with address space preserved |
|
|
Direct mapping – arrays are simple enough even for LLVM |
|
|
Discriminant + payload sized to the largest variant |
The integer signedness case deserves emphasis. In Rust, i32 and u32 are
different types. In LLVM, both are just i32. The sign information shifts to
the operations: a signed less-than comparison is icmp slt, an unsigned one is
icmp ult. The type converter drops the signedness, and the operation
converters pick it back up when they emit comparison and division instructions.
Argument Scalarization#
Kernel entry points need special treatment. The CUDA driver passes kernel
arguments as flat scalars – it does not understand Rust fat pointers or
structs. If a kernel takes a &[f32], the driver needs to pass a raw pointer
and a length as two separate arguments.
The lowering pass scalarizes aggregate kernel arguments and reconstructs them in the entry block:
MIR: fn kernel(slice: &[f32])
→ entry arg: %slice : MirSliceType
LLVM: fn kernel(ptr addrspace(1) %ptr, i64 %len)
→ entry block reconstructs:
%slice = insertvalue {ptr, i64} undef, %ptr, 0
%slice2 = insertvalue {ptr, i64} %slice, %len, 1
The rest of the function sees %slice2 and uses it normally. The scalarization
is invisible to everything downstream of the entry block – they just see a
struct value, as if nothing happened. This avoids ABI mismatches between the
host compiler (which calls cuLaunchKernel with flat arguments) and the device
compiler (which expects typed Rust values).
Interesting Conversions#
Most conversions are straightforward: mir.add on integers becomes llvm.add,
mir.load becomes llvm.load, mir.goto becomes llvm.br. The interesting
cases are the ones where a single MIR operation expands into multiple LLVM
operations, or where GPU-specific concerns change the translation entirely.
Checked Arithmetic#
In debug builds, Rust checks every integer arithmetic operation for overflow.
MIR models this with operations like mir.checked_add that return a
(result, overflow_flag) tuple. LLVM has no such concept, but it does have
overflow intrinsics:
MIR: %result = mir.checked_add %a, %b : i32 → mir.tuple<i32, bool>
LLVM: %sum = add i32 %a, %b
%overflow = extractvalue {i32, i1} @llvm.sadd.with.overflow.i32(%a, %b), 1
The overflow flag feeds into an assert that the MIR importer already lowered to a conditional branch targeting an unreachable block. On the GPU, this effectively means: if you manage to trigger integer overflow, the kernel traps. Not the most graceful error handling, but the CUDA toolchain does not support stack unwinding today.
Enum Lowering#
Rust enums are algebraically rich. LLVM has no concept of tagged unions. The lowering pass bridges the gap by representing enums as a struct with two fields: a discriminant (telling you which variant is active) and a payload area sized to the largest variant:
MIR: %opt = mir.construct_enum "Some", (%val) : mir.enum<"Option_i32">
LLVM: %tmp = insertvalue { i8, [4 x i8] } zeroinitializer, i8 1, 0
%result = insertvalue { i8, [4 x i8] } %tmp, <val into payload area>
The discriminant is i8 1 because Some is variant 1 of Option. The
payload is [4 x i8] – four bytes, enough to hold an i32. Variant access
works in reverse: read the discriminant, branch on it, then extractvalue the
payload and bitcast to the expected type.
It is not elegant, but it is exactly how C compilers have handled tagged unions for decades. LLVM’s optimizer is quite good at cleaning up the redundant insertvalue/extractvalue chains.
GPU Intrinsic Conversion#
dialect-nvvm operations – thread indexing, warp shuffles, barriers, TMA
bulk copies – are not lowered to generic dialect-llvm operations. They are
lowered to either LLVM intrinsic calls or inline PTX assembly, depending on
whether LLVM has a built-in intrinsic for the operation.
Strategy 1: LLVM Intrinsic Call#
For operations where LLVM already provides a target-specific intrinsic, the
conversion emits a call to that intrinsic:
nvvm.read_ptx_sreg_tid_x
→ call i32 @llvm_nvvm_read_ptx_sreg_tid_x()
nvvm.shfl_sync_bfly_i32
→ call i32 @llvm_nvvm_shfl_sync_bfly_i32(i32 -1, i32 %val, i32 %mask, i32 31)
Notice the warp shuffle: the user-facing cuda_device API takes two arguments
(value and lane mask), but the LLVM intrinsic takes four (membermask, value,
delta, clamp). The lowering pass fills in the missing arguments – membermask = -1 (all lanes) and clamp = 31 (full warp width) – so the user never has to
think about them.
Strategy 2: Inline PTX Assembly#
Newer GPU instructions often lack LLVM intrinsics. For these, the lowering pass
emits inline PTX assembly using LLVM’s asm syntax:
nvvm.wgmma_fence_sync
→ call void asm sideeffect convergent "wgmma.fence.sync.aligned;", ""()
nvvm.mbarrier_arrive
→ call i64 asm sideeffect convergent "mbarrier.arrive.shared.b64 $0, [$1];", "=l,r"(ptr %bar)
The convergent attribute is critical here. It tells LLVM: “Do not move,
duplicate, or speculate this instruction across control flow.” Without it, LLVM
might hoist a barrier out of a conditional branch or sink a warp-level
instruction past a sync point, resulting in a GPU that hangs or computes garbage
– neither of which produces a helpful error message.
Block Arguments to PHI Nodes#
Pliron IR (MLIR-like) uses block arguments for value flow between basic blocks. LLVM uses PHI nodes. They express the same concept – “this value comes from different predecessors” – but the syntax is different enough that the export step needs a real transformation, not just pretty-printing.
Pliron style (block arguments):
^loop_header(%sum: f32, %i: i64):
...
br ^loop_header(%new_sum, %new_i)
LLVM IR style (PHI nodes):
loop_header:
%sum = phi float [ 0.0, %preheader ], [ %new_sum, %body ]
%i = phi i64 [ 0, %preheader ], [ %new_i, %body ]
The exporter handles this conversion with a two-pass approach:
Pre-pass: name every value. Before emitting any code, the exporter walks all blocks and assigns sequential SSA names (
%v0,%v1, …) to every value. This is critical because PHI nodes can reference values from blocks that appear later in the listing – loop back-edges point forward in the text but backward in the control flow. Without pre-naming, those references would be undefined.Build a predecessor map. For each block, the exporter collects
(predecessor_block, values_passed)pairs by inspecting every branch instruction in the function.Emit PHI nodes. At the entry of each non-entry block, the exporter emits one PHI node per block argument, populated with the values and predecessor labels from the predecessor map.
The pre-pass is the subtle part. Consider a loop: the PHI in the loop header
references %new_sum from the loop body, but the loop body appears after the
header in the textual output. If we assigned names on-the-fly during emission,
%new_sum would not have a name yet. The pre-pass eliminates this problem by
naming everything upfront.
Symbol Name Sanitization#
Function names flow through several stages, each applying its own constraints:
rustc_public (FQDN) helper_fn::cuda_oxide_device_<hash>_vecadd
↓ body.rs (:: → __)
dialect-mir helper_fn__cuda_oxide_device_<hash>_vecadd
↓ call.rs (:: → __)
dialect-llvm helper_fn__cuda_oxide_device_<hash>_vecadd
↓ export.rs (strip prefix)
Textual LLVM IR @vecadd
↓ llc
PTX vecadd
Three conversions happen along this path:
::to__– Bothbody.rs(function definitions) andcall.rs(call targets) replace Rust path separators with double underscores to produce valid pliron/LLVM identifiers. Since both sides apply the same conversion, definitions and call sites match.Device prefix stripping –
export.rsstrips the reservedcuda_oxide_device_<hash>_prefix (and any preceding FQDN crate prefix) from#[device]function names viareserved_oxide_symbols::device_base_name. This prefix exists for MIR-level detection but should not appear in the final LLVM IR, PTX, or LTOIR output.Device extern prefix stripping – For
#[device] unsafe extern "C"functions,call.rsstrips thecuda_oxide_device_extern_<hash>_prefix viareserved_oxide_symbols::device_extern_base_nameso the LLVM IR references the original symbol name exported by the external LTOIR (e.g., CCCL libraries).
Note
This manual sanitization will be replaced by pliron’s Legaliser when the
framework is upgraded. The Legaliser handles :: to _ conversion and
collision detection systematically.
PTX Generation#
After dialect-llvm is exported to a textual .ll file, the final step is
invoking llc – LLVM’s static compiler – to produce PTX assembly:
llc -march=nvptx64 -mcpu=sm_90 kernel.ll -o kernel.ptx
Target Selection#
The pipeline probes for llc on PATH in the order below. LLVM 21 is the
minimum — earlier releases reject the TMA / tcgen05 / WGMMA intrinsic
signatures that cuda-oxide emits.
Priority |
llc Version |
Target |
PTX Version |
|---|---|---|---|
1st |
|
|
PTX 8.x |
2nd |
|
|
PTX 8.x |
If neither is available the pipeline fails with a clear error. You can opt
into a specific (possibly older) binary by setting
CUDA_OXIDE_LLC=/path/to/llc, but simple kernels are the only thing
guaranteed to compile on LLVM 20 and below.
If the selected target does not match the physical GPU, the CUDA driver JIT-compiles the PTX at load time. First launch costs roughly 30ms while the driver translates; subsequent launches use a cached binary. In practice, you rarely notice – the JIT is fast and the cache is persistent across runs.
The CUDA_OXIDE_TARGET environment variable overrides auto-detection for cases
where you need a specific target. For example, sm_100a enables
Blackwell-specific tcgen05 features that are not available under the generic
sm_100 target.
Note
Why LLVM 21? The 2-D bulk TMA load intrinsic used by tma_copy,
gemm_sol, and tcgen05_matmul gained a 10-operand form with addrspace(7)
and a cta_group parameter in LLVM 21. Older llc versions reject it with
Intrinsic has incorrect argument type!. Rather than maintain separate
intrinsic emitters per LLVM version, we set 21 as the minimum.
Putting It All Together#
Here is the full sequence of events when lower_mir_to_llvm processes a module:
1. Register `dialect-llvm` types and operations
2. For each MirFuncOp in the module:
a. Create `llvm.func` with flattened type signature
b. inline_region: move `dialect-mir` blocks into the LLVM function
c. Build entry prologue (reconstruct aggregates from flat args)
d. Run DialectConversion:
├── Walk every `dialect-mir`/`dialect-nvvm` op (def-before-use order)
├── Invoke MirToLlvmConversion::rewrite for each op
├── Converter emits `dialect-llvm` op(s) via DialectConversionRewriter
└── Framework patches block arg types automatically
3. Export `dialect-llvm` to textual LLVM IR (.ll) (with PHI node conversion)
4. Invoke llc to produce .ptx
After step 4, you have a .ptx file that the CUDA driver can load and
execute. The journey from mir.checked_add to add.s32 is complete.
The lowering pipeline turns Rust-flavored IR into GPU-ready LLVM IR. For a hands-on walkthrough of adding new GPU operations, see Adding New Intrinsics.