Pliron Dialects#

cuda-oxide does not lower Rust to PTX in a single, heroic transformation. It uses three custom pliron dialects, each modeling a different level of abstraction. This chapter walks through all three – their types, their operations, and how they fit together to form the compilation pipeline.

If you have not read the Pliron – Pliron IR (MLIR-like) chapter yet, now is a good time. The concepts there (operations, types, attributes, regions, Ptr<T>, def-use chains) are the building blocks of everything on this page.

The Three Dialects at a Glance#

Dialect	Purpose	Level
dialect-mir	Models Rust MIR semantics	Highest – Rust types, tuples, enums, slices, checked arithmetic
dialect-llvm	Models LLVM IR	Middle – flat types, GEP, PHI-ready control flow
dialect-nvvm	Models NVIDIA GPU intrinsics	Orthogonal – thread indexing, warps, TMA, WGMMA, tcgen05

dialect-nvvm is “orthogonal” rather than a layer in the stack because its operations appear alongside dialect-llvm operations, not below them. A warp shuffle and an integer add coexist in the same function body.

Data flows through the pipeline like this:

dialect-mir ──(mem2reg)──▶ dialect-mir (SSA) ──(DialectConversion)──▶ dialect-llvm + dialect-nvvm ops ──(export.rs)──▶ textual LLVM IR ──(llc)──▶ PTX

Each arrow is a well-defined transformation. The first two happen inside pliron; the last one is LLVM’s NVPTX backend doing what it does best.

dialect-mir – The Rust Layer#

dialect-mir preserves Rust’s type system and control flow semantics as pliron operations. This is deliberate: we want to reason about Rust concepts (tuples, enums, checked arithmetic, address spaces) before flattening them to LLVM’s type system.

Types#

The dialect defines seven custom types that mirror Rust’s compound types:

Type	Example	Description
`mir.tuple`	`mir.tuple<i32, f32, i64>`	Heterogeneous tuples
`mir.ptr`	`mir.ptr<f32, mutable, addrspace: 1>`	Pointers with GPU address space
`mir.array`	`mir.array<f32, 256>`	Fixed-size arrays
`mir.struct`	`mir.struct<"Point", [f32, f32]>`	Named structs with layout info
`mir.slice`	`mir.slice<f32, addrspace: 1>`	Fat pointers (ptr + length)
`mir.disjoint_slice`	`mir.disjoint_slice<f32>`	Safety-checked slice – each thread accesses a unique element
`mir.enum`	`mir.enum<"Option_i32", [("None", []), ("Some", [i32])]>`	Rust enums with discriminant and variant payloads

The address spaces on mir.ptr and mir.slice track where data lives in the GPU memory hierarchy:

Address Space	Meaning
0	Generic (resolved at runtime)
1	Global (device DRAM)
3	Shared (per-block SRAM)
4	Constant (read-only cache)
5	Local (per-thread stack, spills to DRAM)
6	Tensor memory (Blackwell TMEM)

Operations#

dialect-mir defines 54 operations across 11 categories:

Category	Examples	Count
Function	`mir.func`	1
Control flow	`mir.goto`, `mir.cond_br`, `mir.return`, `mir.assert`, `mir.unreachable`	5
Constants	`mir.constant`, `mir.float_constant`, `mir.undef`	3
Memory	`mir.alloca`, `mir.load`, `mir.store`, `mir.assign`, `mir.ref`, `mir.ptr_offset`, `mir.shared_alloc`, `mir.global_alloc`, `mir.extern_shared`	9
Arithmetic	`mir.add`, `mir.sub`, `mir.mul`, `mir.div`, `mir.rem`, `mir.checked_add`, `mir.checked_sub`, `mir.checked_mul`, `mir.neg`, `mir.not`, `mir.shr`, `mir.shl`, `mir.bitand`, `mir.bitor`, `mir.bitxor`	15
Comparison	`mir.eq`, `mir.ne`, `mir.lt`, `mir.le`, `mir.gt`, `mir.ge`	6
Aggregate	`mir.extract_field`, `mir.insert_field`, `mir.construct_struct`, `mir.construct_tuple`, `mir.construct_array`, `mir.extract_array_element`, `mir.field_addr`, `mir.array_element_addr`	8
Enum	`mir.get_discriminant`, `mir.construct_enum`, `mir.enum_payload`	3
Cast	`mir.cast`	1
Storage	`mir.storage_live`, `mir.storage_dead`	2
Call	`mir.call`	1

That is a lot of operations, but they fall into natural groups. If you know Rust MIR (or have read the rustc_public chapter), each operation maps directly to a MIR concept.

What the IR Looks Like#

Here are a few examples of dialect-mir operations in practice. These are simplified for readability – the actual printed form includes more metadata.

Checked addition (Rust: let sum = a + b where a, b: i32):

// mir.checked_add returns a tuple (result, overflow_flag)
%checked = mir.checked_add %a, %b : i32
%sum     = mir.extract_field %checked, 0 : mir.tuple<i32, i1>
%overflowed = mir.extract_field %checked, 1 : mir.tuple<i32, i1>
mir.assert %overflowed == false, "attempt to add with overflow" -> bb1

Struct construction and field access (Rust: point.x):

%point = mir.construct_struct %x, %y : mir.struct<"Point", [f32, f32]>
%x_val = mir.extract_field %point, 0 : mir.struct<"Point", [f32, f32]>

Shared memory allocation (the GPU-specific part):

%shmem = mir.shared_alloc : mir.ptr<f32, mutable, addrspace: 3>
mir.store %value, %shmem : f32

Verification#

Every MIR operation verifies type consistency when constructed. This catches import bugs early – before they have a chance to propagate through the lowering pipeline and surface as cryptic LLVM errors three passes later.

Examples of what gets checked:

mir.add verifies that both operands have the same type.
mir.cond_br verifies that the condition is i1 (a boolean).
mir.extract_field verifies that the field index is in bounds and the result type matches the field’s type.
mir.store verifies that the value type matches the pointee type of the pointer.

The DisjointSlice safety guarantee (“one thread, one element”) is enforced at the type-system level via ThreadIndex – only hardware-derived thread indices can access the slice. There is no separate compiler pass for disjoint-access verification; the safety comes from the Rust type system and cuda-device’s API design.

dialect-llvm – The LLVM Layer#

dialect-llvm models LLVM IR as pliron operations. It provides a near-1:1 mapping to textual .ll files – every LLVM instruction has a corresponding pliron operation, and the types map directly to LLVM’s type system.

Types#

Type	Example	Description
Integers	`i1`, `i8`, `i16`, `i32`, `i64`, `i128`	Pliron built-in, used directly
Floats	`half`, `float`, `double`	Pliron built-in (`FP16Type`, `FP32Type`, `FP64Type`)
`llvm.ptr`	`ptr addrspace(1)`	Opaque pointers with optional address space
`llvm.struct`	`{ i32, float }` or `%MyStruct`	Named or anonymous, may be opaque
`llvm.array`	`[256 x float]`	Fixed-size arrays
`llvm.vector`	`<4 x float>`	SIMD vectors
`llvm.func`	`(i32, ptr) -> void`	Function signatures
`llvm.void`	`void`	The unit type

Note the absence of Rust-specific types. By the time code reaches dialect-llvm, tuples have become structs, enums have become discriminant-indexed structs, and slices have become pointer-length pairs. The lowering pass (covered in The Lowering Pipeline) handles all of that flattening.

Operations#

The dialect defines 62 operations:

Category	Examples	Count
Arithmetic	`add`, `sub`, `mul`, `fadd`, `fsub`, `fmul`, `fdiv`, `frem`, `fneg`, …	19
Cast	`zext`, `sext`, `trunc`, `fpext`, `fptrunc`, `sitofp`, `uitofp`, `fptosi`, `fptoui`, `ptrtoint`, `inttoptr`, `addrspacecast`, `bitcast`	13
Control flow	`br`, `cond_br`, `switch`, `return`, `unreachable`	5
Memory	`load`, `store`, `alloca`, `gep`	4
Atomic	`atomic_load`, `atomic_store`, `atomicrmw`, `cmpxchg`, `fence`	5
Comparison	`icmp`, `fcmp`	2
Aggregate	`extract_value`, `insert_value`, `extractelement`	3
Call	`call`, `call_intrinsic`	2
Inline asm	`inline_asm`, `inline_asm_multi`	2
Constants	`constant`, `zero`, `undef`	3
Symbol	`func`, `global`, `addressof`	3
Select	`select`	1

If you have read LLVM IR before, nothing here will surprise you. The operation names are intentionally the same as their LLVM counterparts, prefixed with llvm. in the IR.

The Export Engine#

The crown jewel of dialect-llvm is export.rs – the module that converts a pliron IR module into valid textual LLVM IR. This is not just “print each operation”; several non-trivial transformations happen during export:

Block arguments become PHI nodes. Pliron IR (MLIR-like) models merge points as block arguments – a function-style calling convention between basic blocks. LLVM IR uses PHI nodes instead. The exporter builds a predecessor map from branch operands and emits phi instructions at the top of each non-entry block.

Value naming. A pre-pass assigns sequential SSA names (%v0, %v1, …) to every value. Constants are special-cased: llvm.constant results are mapped to their literal value (not a %vN name), so PHIs can reference constants from blocks that appear later in the output.

NVVM intrinsic name conversion. Pliron identifiers use underscores; LLVM intrinsics use dots. The exporter converts all names starting with llvm_ by replacing underscores with dots: llvm_nvvm_read_ptx_sreg_tid_x becomes llvm.nvvm.read.ptx.sreg.tid.x. This is a mechanical transformation, not a lookup table.

Convergent attribute marking. Barrier, shuffle, and vote intrinsics must be marked convergent to prevent LLVM from hoisting them out of control flow. The exporter recognizes these by prefix pattern matching on the (dot-form) name and appends #0 to their call sites, emitting attributes #0 = { convergent } at module level.

Kernel metadata. Functions marked as kernels get ptx_kernel calling convention and an !nvvm.annotations metadata entry.

Here is what the exported LLVM IR looks like for a simple vector-add kernel:

target datalayout = "e-i64:64-i128:128-v16:16-v32:32-n16:32:64"
target triple = "nvptx64-nvidia-cuda"

declare i32 @llvm.nvvm.read.ptx.sreg.tid.x()
declare i32 @llvm.nvvm.read.ptx.sreg.ntid.x()
declare i32 @llvm.nvvm.read.ptx.sreg.ctaid.x()

define ptx_kernel void @vecadd(ptr addrspace(1) %v0, i64 %v1,
                                ptr addrspace(1) %v2, i64 %v3,
                                ptr addrspace(1) %v4, i64 %v5) {
entry:
    %v6 = call i32 @llvm.nvvm.read.ptx.sreg.tid.x() #0
    %v7 = call i32 @llvm.nvvm.read.ptx.sreg.ntid.x() #0
    %v8 = call i32 @llvm.nvvm.read.ptx.sreg.ctaid.x() #0
    %v9 = mul i32 %v8, %v7
    %v10 = add i32 %v9, %v6
    ; ... bounds check, load, add, store ...
    ret void
}

!nvvm.annotations = !{!0}
!0 = !{ptr @vecadd, !"kernel", i32 1}
attributes #0 = { convergent }

Notice the slices have been scalarized: each Rust &[f32] becomes a ptr addrspace(1) and an i64 length. That happened in the lowering pass; by the time dialect-llvm sees them, they are flat arguments.

dialect-nvvm – The GPU Layer#

dialect-nvvm wraps NVIDIA’s GPU intrinsics as typed pliron operations. These operations do not form a “level” in the lowering chain – they are inserted during the dialect-mir → dialect-llvm lowering pass and coexist with dialect-llvm operations in the same function body. At export time, they become call instructions to @llvm.nvvm.* intrinsics.

Architecture Coverage#

The dialect is organized into modules, each targeting a GPU feature set:

Module	Description	Ops	Minimum SM	GPU Family
`thread`	Thread/block indexing, `barrier0`, threadfences	18	All	All GPUs
`warp`	Lane id, shuffle, vote, match	18	All	All GPUs
`grid`	Cooperative `grid_sync`	1	sm_70	Volta+
`debug`	Clock, trap, breakpoint, `vprintf`	6	All	All GPUs
`atomic`	Atomic load/store/RMW/cmpxchg	4	sm_70	Volta+
`cluster`	Thread Block Clusters + DSMEM	11	sm_90	Hopper+
`mbarrier`	Async barriers + fence proxy + nanosleep	10	sm_90	Hopper+
`tma`	Tensor Memory Accelerator (bulk G2S/S2G)	15	sm_90	Hopper+
`wgmma`	Warpgroup Matrix Multiply-Accumulate	5	sm_90	Hopper+
`stmatrix`	Shared memory matrix store + bf16 convert	5	sm_90	Hopper+
`tcgen05`	Tensor Core Gen 5 + TMEM	24	sm_100	Blackwell+
`clc`	Cluster Launch Control	6	sm_100	Blackwell+

That is 123 operations total. Most users will only encounter the first three modules (thread indexing, warp shuffles, barriers). The rest are for advanced GPU programming – TMA, matrix accelerators, and Blackwell’s tensor memory – covered in the Advanced GPU Features chapters.

From Rust to PTX: An Intrinsic’s Journey#

Each NVVM operation maps through three levels of naming:

Pliron operation	LLVM intrinsic	PTX instruction
`ReadPtxSregTidXOp`	`llvm.nvvm.read.ptx.sreg.tid.x`	`mov.u32 %r1, %tid.x`
`Barrier0Op`	`llvm.nvvm.barrier0`	`bar.sync 0`
`ShflSyncBflyI32Op`	`llvm.nvvm.shfl.sync.bfly.i32`	`shfl.sync.bfly.b32`
`CpAsyncBulkTensorG2sTile2dOp`	`llvm.nvvm.cp.async.bulk.tensor.2d.tile.g2s.im2col.*`	`cp.async.bulk.tensor.2d.tile.g2s ...`

The first column is the Rust struct name in dialect-nvvm. The second is what export.rs emits (after the underscore-to-dot transformation). The third is what llc produces. You never have to write any of these by hand – they are generated by mir-lower when it sees calls to cuda-device intrinsic functions like thread::index_x() or warp::shfl_sync_bfly().

Verification Strategy#

NVVM operations use minimal structural verification: each operation checks its operand count and result count, and a handful verify result types (thread indexing ops require i32 results; tcgen05 loads check exact result counts for their 32-register and 4-register variants).

This is intentional. NVVM operations are machine-generated by mir-lower – they are never hand-written by users. LLVM’s NVPTX backend provides comprehensive type validation downstream. Adding full type checking to every NVVM operation would double the dialect’s code size for zero practical benefit.

Note

The GPU architecture requirements (sm_70, sm_90, sm_100) are documented but not enforced at the pliron level. Architecture validation happens later, when llc is invoked with a specific -mcpu=sm_XX flag. If you use a Hopper intrinsic and target Volta, llc will tell you – loudly.

How the Dialects Interact#

Here is the lifecycle of a single Rust operation as it passes through all three abstraction levels:

Rust source:   let sum = a + b;        // a, b: f32

dialect-mir:   %sum = mir.add %a, %b : f32
                ↓  (DialectConversion)
dialect-llvm:  %v5 = fadd float %v3, %v4
                ↓  (export.rs)
LLVM IR:       %v5 = fadd float %v3, %v4
                ↓  (llc --mcpu=sm_80)
PTX:           add.f32 %f3, %f1, %f2;

The dialect-mir → dialect-llvm step is where the interesting work happens: mir.add on f32 becomes fadd (floating-point add), while mir.add on i32 becomes add (integer add). Checked operations like mir.checked_add expand into an llvm.add, a constant i1 false for the overflow flag, and an insertvalue into a struct – the GPU path omits overflow detection (since GPU integer arithmetic wraps). The lowering pass handles all of these translations.

For GPU-specific operations, dialect-nvvm enters the picture:

Rust source:   let tid = thread::threadIdx_x();

dialect-mir:   %tid = mir.call @cuda_oxide_device_<hash>_thread_index_x()
                ↓  (DialectConversion, recognizes the intrinsic)
dialect-nvvm:  %v2 = nvvm.read_ptx_sreg_tid_x : i32
                ↓  (export.rs)
LLVM IR:       %v2 = call i32 @llvm.nvvm.read.ptx.sreg.tid.x() #0
                ↓  (llc)
PTX:           mov.u32 %r1, %tid.x;

The lowering pass recognizes calls to cuda_device intrinsic functions by their fully qualified names (FQDNs) and replaces them with the corresponding dialect-nvvm operations. No generic “function call” machinery is needed – the intrinsic becomes a direct hardware instruction.

The Full Picture#

Putting it all together, a compiled kernel body contains a mix of dialect-llvm and dialect-nvvm operations:

llvm.func @vecadd(...) {
  entry:
    %tid    = nvvm.read_ptx_sreg_tid_x     // NVVM: thread index
    %ntid   = nvvm.read_ptx_sreg_ntid_x    // NVVM: block size
    %ctaid  = nvvm.read_ptx_sreg_ctaid_x   // NVVM: block index
    %offset = llvm.mul %ctaid, %ntid        // LLVM: integer math
    %idx    = llvm.add %offset, %tid        // LLVM: integer math
    %cmp    = llvm.icmp slt %idx, %len      // LLVM: bounds check
    llvm.cond_br %cmp, bb1, bb2             // LLVM: branch
  bb1:
    %p_a    = llvm.gep %a, %idx             // LLVM: pointer arithmetic
    %val_a  = llvm.load %p_a                // LLVM: memory access
    %p_b    = llvm.gep %b, %idx
    %val_b  = llvm.load %p_b
    %sum    = llvm.fadd %val_a, %val_b      // LLVM: floating-point add
    %p_c    = llvm.gep %c, %idx
    llvm.store %sum, %p_c
    llvm.br bb2
  bb2:
    llvm.return void
}

The dialect-nvvm operations at the top compute the global thread index. Everything else is standard dialect-llvm – loads, stores, arithmetic, branches. The export engine serializes all of it into a single .ll file, and llc compiles it to PTX.

For how these dialects are connected by the lowering pass, see The Lowering Pipeline.