Pliron Dialects#
cuda-oxide does not lower Rust to PTX in a single, heroic transformation. It uses three custom pliron dialects, each modeling a different level of abstraction. This chapter walks through all three – their types, their operations, and how they fit together to form the compilation pipeline.
If you have not read the Pliron – Pliron IR (MLIR-like) chapter yet, now
is a good time. The concepts there (operations, types, attributes, regions,
Ptr<T>, def-use chains) are the building blocks of everything on this page.
The Three Dialects at a Glance#
Dialect |
Purpose |
Level |
|---|---|---|
dialect-mir |
Models Rust MIR semantics |
Highest – Rust types, tuples, enums, slices, checked arithmetic |
dialect-llvm |
Models LLVM IR |
Middle – flat types, GEP, PHI-ready control flow |
dialect-nvvm |
Models NVIDIA GPU intrinsics |
Orthogonal – thread indexing, warps, TMA, WGMMA, tcgen05 |
dialect-nvvm is “orthogonal” rather than a layer in the stack because its
operations appear alongside dialect-llvm operations, not below them. A
warp shuffle and an integer add coexist in the same function body.
Data flows through the pipeline like this:
dialect-mir ──(mem2reg)──▶ dialect-mir (SSA) ──(DialectConversion)──▶ dialect-llvm + dialect-nvvm ops ──(export.rs)──▶ textual LLVM IR ──(llc)──▶ PTX
Each arrow is a well-defined transformation. The first two happen inside pliron; the last one is LLVM’s NVPTX backend doing what it does best.
dialect-mir – The Rust Layer#
dialect-mir preserves Rust’s type system and control flow semantics as
pliron operations. This is deliberate: we want to reason about Rust concepts
(tuples, enums, checked arithmetic, address spaces) before flattening them
to LLVM’s type system.
Types#
The dialect defines seven custom types that mirror Rust’s compound types:
Type |
Example |
Description |
|---|---|---|
|
|
Heterogeneous tuples |
|
|
Pointers with GPU address space |
|
|
Fixed-size arrays |
|
|
Named structs with layout info |
|
|
Fat pointers (ptr + length) |
|
|
Safety-checked slice – each thread accesses a unique element |
|
|
Rust enums with discriminant and variant payloads |
The address spaces on mir.ptr and mir.slice track where data lives in
the GPU memory hierarchy:
Address Space |
Meaning |
|---|---|
0 |
Generic (resolved at runtime) |
1 |
Global (device DRAM) |
3 |
Shared (per-block SRAM) |
4 |
Constant (read-only cache) |
5 |
Local (per-thread stack, spills to DRAM) |
6 |
Tensor memory (Blackwell TMEM) |
Operations#
dialect-mir defines 54 operations across 11 categories:
Category |
Examples |
Count |
|---|---|---|
Function |
|
1 |
Control flow |
|
5 |
Constants |
|
3 |
Memory |
|
9 |
Arithmetic |
|
15 |
Comparison |
|
6 |
Aggregate |
|
8 |
Enum |
|
3 |
Cast |
|
1 |
Storage |
|
2 |
Call |
|
1 |
That is a lot of operations, but they fall into natural groups. If you know Rust MIR (or have read the rustc_public chapter), each operation maps directly to a MIR concept.
What the IR Looks Like#
Here are a few examples of dialect-mir operations in practice. These are
simplified for readability – the actual printed form includes more metadata.
Checked addition (Rust: let sum = a + b where a, b: i32):
// mir.checked_add returns a tuple (result, overflow_flag)
%checked = mir.checked_add %a, %b : i32
%sum = mir.extract_field %checked, 0 : mir.tuple<i32, i1>
%overflowed = mir.extract_field %checked, 1 : mir.tuple<i32, i1>
mir.assert %overflowed == false, "attempt to add with overflow" -> bb1
Struct construction and field access (Rust: point.x):
%point = mir.construct_struct %x, %y : mir.struct<"Point", [f32, f32]>
%x_val = mir.extract_field %point, 0 : mir.struct<"Point", [f32, f32]>
Shared memory allocation (the GPU-specific part):
%shmem = mir.shared_alloc : mir.ptr<f32, mutable, addrspace: 3>
mir.store %value, %shmem : f32
Verification#
Every MIR operation verifies type consistency when constructed. This catches import bugs early – before they have a chance to propagate through the lowering pipeline and surface as cryptic LLVM errors three passes later.
Examples of what gets checked:
mir.addverifies that both operands have the same type.mir.cond_brverifies that the condition isi1(a boolean).mir.extract_fieldverifies that the field index is in bounds and the result type matches the field’s type.mir.storeverifies that the value type matches the pointee type of the pointer.
The DisjointSlice safety guarantee (“one thread, one element”) is enforced
at the type-system level via ThreadIndex – only hardware-derived thread
indices can access the slice. There is no separate compiler pass for
disjoint-access verification; the safety comes from the Rust type system
and cuda-device’s API design.
dialect-llvm – The LLVM Layer#
dialect-llvm models LLVM IR as pliron operations. It provides a near-1:1
mapping to textual .ll files – every LLVM instruction has a corresponding
pliron operation, and the types map directly to LLVM’s type system.
Types#
Type |
Example |
Description |
|---|---|---|
Integers |
|
Pliron built-in, used directly |
Floats |
|
Pliron built-in ( |
|
|
Opaque pointers with optional address space |
|
|
Named or anonymous, may be opaque |
|
|
Fixed-size arrays |
|
|
SIMD vectors |
|
|
Function signatures |
|
|
The unit type |
Note the absence of Rust-specific types. By the time code reaches
dialect-llvm, tuples have become structs, enums have become
discriminant-indexed structs, and slices have become pointer-length pairs.
The lowering pass (covered in The Lowering Pipeline)
handles all of that flattening.
Operations#
The dialect defines 62 operations:
Category |
Examples |
Count |
|---|---|---|
Arithmetic |
|
19 |
Cast |
|
13 |
Control flow |
|
5 |
Memory |
|
4 |
Atomic |
|
5 |
Comparison |
|
2 |
Aggregate |
|
3 |
Call |
|
2 |
Inline asm |
|
2 |
Constants |
|
3 |
Symbol |
|
3 |
Select |
|
1 |
If you have read LLVM IR before, nothing here will surprise you. The operation
names are intentionally the same as their LLVM counterparts, prefixed with
llvm. in the IR.
The Export Engine#
The crown jewel of dialect-llvm is export.rs – the module that converts a
pliron IR module into valid textual LLVM IR. This is not just “print each
operation”; several non-trivial transformations happen during export:
Block arguments become PHI nodes. Pliron IR (MLIR-like) models merge points
as block arguments – a function-style calling convention between basic blocks.
LLVM IR uses PHI nodes instead. The exporter builds a predecessor map from
branch operands and emits phi instructions at the top of each non-entry
block.
Value naming. A pre-pass assigns sequential SSA names (%v0, %v1, …)
to every value. Constants are special-cased: llvm.constant results are
mapped to their literal value (not a %vN name), so PHIs can reference
constants from blocks that appear later in the output.
NVVM intrinsic name conversion. Pliron identifiers use underscores; LLVM
intrinsics use dots. The exporter converts all names starting with llvm_ by
replacing underscores with dots: llvm_nvvm_read_ptx_sreg_tid_x becomes
llvm.nvvm.read.ptx.sreg.tid.x. This is a mechanical transformation, not a
lookup table.
Convergent attribute marking. Barrier, shuffle, and vote intrinsics must
be marked convergent to prevent LLVM from hoisting them out of control flow.
The exporter recognizes these by prefix pattern matching on the (dot-form)
name and appends #0 to their call sites, emitting attributes #0 = { convergent } at module level.
Kernel metadata. Functions marked as kernels get ptx_kernel calling
convention and an !nvvm.annotations metadata entry.
Here is what the exported LLVM IR looks like for a simple vector-add kernel:
target datalayout = "e-i64:64-i128:128-v16:16-v32:32-n16:32:64"
target triple = "nvptx64-nvidia-cuda"
declare i32 @llvm.nvvm.read.ptx.sreg.tid.x()
declare i32 @llvm.nvvm.read.ptx.sreg.ntid.x()
declare i32 @llvm.nvvm.read.ptx.sreg.ctaid.x()
define ptx_kernel void @vecadd(ptr addrspace(1) %v0, i64 %v1,
ptr addrspace(1) %v2, i64 %v3,
ptr addrspace(1) %v4, i64 %v5) {
entry:
%v6 = call i32 @llvm.nvvm.read.ptx.sreg.tid.x() #0
%v7 = call i32 @llvm.nvvm.read.ptx.sreg.ntid.x() #0
%v8 = call i32 @llvm.nvvm.read.ptx.sreg.ctaid.x() #0
%v9 = mul i32 %v8, %v7
%v10 = add i32 %v9, %v6
; ... bounds check, load, add, store ...
ret void
}
!nvvm.annotations = !{!0}
!0 = !{ptr @vecadd, !"kernel", i32 1}
attributes #0 = { convergent }
Notice the slices have been scalarized: each Rust &[f32] becomes a
ptr addrspace(1) and an i64 length. That happened in the lowering pass;
by the time dialect-llvm sees them, they are flat arguments.
dialect-nvvm – The GPU Layer#
dialect-nvvm wraps NVIDIA’s GPU intrinsics as typed pliron operations.
These operations do not form a “level” in the lowering chain – they are
inserted during the dialect-mir → dialect-llvm lowering pass and coexist
with dialect-llvm operations in the same function body. At export time,
they become call instructions to @llvm.nvvm.* intrinsics.
Architecture Coverage#
The dialect is organized into modules, each targeting a GPU feature set:
Module |
Description |
Ops |
Minimum SM |
GPU Family |
|---|---|---|---|---|
|
Thread/block indexing, |
18 |
All |
All GPUs |
|
Lane id, shuffle, vote, match |
18 |
All |
All GPUs |
|
Cooperative |
1 |
sm_70 |
Volta+ |
|
Clock, trap, breakpoint, |
6 |
All |
All GPUs |
|
Atomic load/store/RMW/cmpxchg |
4 |
sm_70 |
Volta+ |
|
Thread Block Clusters + DSMEM |
11 |
sm_90 |
Hopper+ |
|
Async barriers + fence proxy + nanosleep |
10 |
sm_90 |
Hopper+ |
|
Tensor Memory Accelerator (bulk G2S/S2G) |
15 |
sm_90 |
Hopper+ |
|
Warpgroup Matrix Multiply-Accumulate |
5 |
sm_90 |
Hopper+ |
|
Shared memory matrix store + bf16 convert |
5 |
sm_90 |
Hopper+ |
|
Tensor Core Gen 5 + TMEM |
24 |
sm_100 |
Blackwell+ |
|
Cluster Launch Control |
6 |
sm_100 |
Blackwell+ |
That is 123 operations total. Most users will only encounter the first three modules (thread indexing, warp shuffles, barriers). The rest are for advanced GPU programming – TMA, matrix accelerators, and Blackwell’s tensor memory – covered in the Advanced GPU Features chapters.
From Rust to PTX: An Intrinsic’s Journey#
Each NVVM operation maps through three levels of naming:
Pliron operation |
LLVM intrinsic |
PTX instruction |
|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
The first column is the Rust struct name in dialect-nvvm. The second is what
export.rs emits (after the underscore-to-dot transformation). The third is
what llc produces. You never have to write any of these by hand – they are
generated by mir-lower when it sees calls to cuda-device intrinsic
functions like thread::index_x() or warp::shfl_sync_bfly().
Verification Strategy#
NVVM operations use minimal structural verification: each operation checks
its operand count and result count, and a handful verify result types (thread
indexing ops require i32 results; tcgen05 loads check exact result counts
for their 32-register and 4-register variants).
This is intentional. NVVM operations are machine-generated by mir-lower –
they are never hand-written by users. LLVM’s NVPTX backend provides
comprehensive type validation downstream. Adding full type checking to every
NVVM operation would double the dialect’s code size for zero practical benefit.
Note
The GPU architecture requirements (sm_70, sm_90, sm_100) are documented but
not enforced at the pliron level. Architecture validation happens later, when
llc is invoked with a specific -mcpu=sm_XX flag. If you use a Hopper
intrinsic and target Volta, llc will tell you – loudly.
How the Dialects Interact#
Here is the lifecycle of a single Rust operation as it passes through all three abstraction levels:
Rust source: let sum = a + b; // a, b: f32
dialect-mir: %sum = mir.add %a, %b : f32
↓ (DialectConversion)
dialect-llvm: %v5 = fadd float %v3, %v4
↓ (export.rs)
LLVM IR: %v5 = fadd float %v3, %v4
↓ (llc --mcpu=sm_80)
PTX: add.f32 %f3, %f1, %f2;
The dialect-mir → dialect-llvm step is where the interesting work
happens: mir.add on f32 becomes fadd (floating-point add), while
mir.add on i32 becomes add (integer add). Checked operations like
mir.checked_add expand into an llvm.add, a constant i1 false for the
overflow flag, and an insertvalue into a struct – the GPU path omits
overflow detection (since GPU integer arithmetic wraps). The lowering pass
handles all of these translations.
For GPU-specific operations, dialect-nvvm enters the picture:
Rust source: let tid = thread::threadIdx_x();
dialect-mir: %tid = mir.call @cuda_oxide_device_<hash>_thread_index_x()
↓ (DialectConversion, recognizes the intrinsic)
dialect-nvvm: %v2 = nvvm.read_ptx_sreg_tid_x : i32
↓ (export.rs)
LLVM IR: %v2 = call i32 @llvm.nvvm.read.ptx.sreg.tid.x() #0
↓ (llc)
PTX: mov.u32 %r1, %tid.x;
The lowering pass recognizes calls to cuda_device intrinsic functions by
their fully qualified names (FQDNs) and replaces them with the corresponding
dialect-nvvm operations. No generic “function call” machinery is needed –
the intrinsic becomes a direct hardware instruction.
The Full Picture#
Putting it all together, a compiled kernel body contains a mix of
dialect-llvm and dialect-nvvm operations:
llvm.func @vecadd(...) {
entry:
%tid = nvvm.read_ptx_sreg_tid_x // NVVM: thread index
%ntid = nvvm.read_ptx_sreg_ntid_x // NVVM: block size
%ctaid = nvvm.read_ptx_sreg_ctaid_x // NVVM: block index
%offset = llvm.mul %ctaid, %ntid // LLVM: integer math
%idx = llvm.add %offset, %tid // LLVM: integer math
%cmp = llvm.icmp slt %idx, %len // LLVM: bounds check
llvm.cond_br %cmp, bb1, bb2 // LLVM: branch
bb1:
%p_a = llvm.gep %a, %idx // LLVM: pointer arithmetic
%val_a = llvm.load %p_a // LLVM: memory access
%p_b = llvm.gep %b, %idx
%val_b = llvm.load %p_b
%sum = llvm.fadd %val_a, %val_b // LLVM: floating-point add
%p_c = llvm.gep %c, %idx
llvm.store %sum, %p_c
llvm.br bb2
bb2:
llvm.return void
}
The dialect-nvvm operations at the top compute the global thread index.
Everything else is standard dialect-llvm – loads, stores, arithmetic,
branches. The export engine serializes all of it into a single .ll file,
and llc compiles it to PTX.
For how these dialects are connected by the lowering pass, see The Lowering Pipeline.