Pliron Dialects#

cuda-oxide does not lower Rust to PTX in a single, heroic transformation. It uses three custom pliron dialects, each modeling a different level of abstraction. This chapter walks through all three – their types, their operations, and how they fit together to form the compilation pipeline.

If you have not read the Pliron – Pliron IR (MLIR-like) chapter yet, now is a good time. The concepts there (operations, types, attributes, regions, Ptr<T>, def-use chains) are the building blocks of everything on this page.


The Three Dialects at a Glance#

Dialect

Purpose

Level

dialect-mir

Models Rust MIR semantics

Highest – Rust types, tuples, enums, slices, checked arithmetic

dialect-llvm

Models LLVM IR

Middle – flat types, GEP, PHI-ready control flow

dialect-nvvm

Models NVIDIA GPU intrinsics

Orthogonal – thread indexing, warps, TMA, WGMMA, tcgen05

dialect-nvvm is “orthogonal” rather than a layer in the stack because its operations appear alongside dialect-llvm operations, not below them. A warp shuffle and an integer add coexist in the same function body.

Data flows through the pipeline like this:

dialect-mir ──(mem2reg)──▶ dialect-mir (SSA) ──(DialectConversion)──▶ dialect-llvm + dialect-nvvm ops ──(export.rs)──▶ textual LLVM IR ──(llc)──▶ PTX

Each arrow is a well-defined transformation. The first two happen inside pliron; the last one is LLVM’s NVPTX backend doing what it does best.


dialect-mir – The Rust Layer#

dialect-mir preserves Rust’s type system and control flow semantics as pliron operations. This is deliberate: we want to reason about Rust concepts (tuples, enums, checked arithmetic, address spaces) before flattening them to LLVM’s type system.

Types#

The dialect defines seven custom types that mirror Rust’s compound types:

Type

Example

Description

mir.tuple

mir.tuple<i32, f32, i64>

Heterogeneous tuples

mir.ptr

mir.ptr<f32, mutable, addrspace: 1>

Pointers with GPU address space

mir.array

mir.array<f32, 256>

Fixed-size arrays

mir.struct

mir.struct<"Point", [f32, f32]>

Named structs with layout info

mir.slice

mir.slice<f32, addrspace: 1>

Fat pointers (ptr + length)

mir.disjoint_slice

mir.disjoint_slice<f32>

Safety-checked slice – each thread accesses a unique element

mir.enum

mir.enum<"Option_i32", [("None", []), ("Some", [i32])]>

Rust enums with discriminant and variant payloads

The address spaces on mir.ptr and mir.slice track where data lives in the GPU memory hierarchy:

Address Space

Meaning

0

Generic (resolved at runtime)

1

Global (device DRAM)

3

Shared (per-block SRAM)

4

Constant (read-only cache)

5

Local (per-thread stack, spills to DRAM)

6

Tensor memory (Blackwell TMEM)

Operations#

dialect-mir defines 54 operations across 11 categories:

Category

Examples

Count

Function

mir.func

1

Control flow

mir.goto, mir.cond_br, mir.return, mir.assert, mir.unreachable

5

Constants

mir.constant, mir.float_constant, mir.undef

3

Memory

mir.alloca, mir.load, mir.store, mir.assign, mir.ref, mir.ptr_offset, mir.shared_alloc, mir.global_alloc, mir.extern_shared

9

Arithmetic

mir.add, mir.sub, mir.mul, mir.div, mir.rem, mir.checked_add, mir.checked_sub, mir.checked_mul, mir.neg, mir.not, mir.shr, mir.shl, mir.bitand, mir.bitor, mir.bitxor

15

Comparison

mir.eq, mir.ne, mir.lt, mir.le, mir.gt, mir.ge

6

Aggregate

mir.extract_field, mir.insert_field, mir.construct_struct, mir.construct_tuple, mir.construct_array, mir.extract_array_element, mir.field_addr, mir.array_element_addr

8

Enum

mir.get_discriminant, mir.construct_enum, mir.enum_payload

3

Cast

mir.cast

1

Storage

mir.storage_live, mir.storage_dead

2

Call

mir.call

1

That is a lot of operations, but they fall into natural groups. If you know Rust MIR (or have read the rustc_public chapter), each operation maps directly to a MIR concept.

What the IR Looks Like#

Here are a few examples of dialect-mir operations in practice. These are simplified for readability – the actual printed form includes more metadata.

Checked addition (Rust: let sum = a + b where a, b: i32):

// mir.checked_add returns a tuple (result, overflow_flag)
%checked = mir.checked_add %a, %b : i32
%sum     = mir.extract_field %checked, 0 : mir.tuple<i32, i1>
%overflowed = mir.extract_field %checked, 1 : mir.tuple<i32, i1>
mir.assert %overflowed == false, "attempt to add with overflow" -> bb1

Struct construction and field access (Rust: point.x):

%point = mir.construct_struct %x, %y : mir.struct<"Point", [f32, f32]>
%x_val = mir.extract_field %point, 0 : mir.struct<"Point", [f32, f32]>

Shared memory allocation (the GPU-specific part):

%shmem = mir.shared_alloc : mir.ptr<f32, mutable, addrspace: 3>
mir.store %value, %shmem : f32

Verification#

Every MIR operation verifies type consistency when constructed. This catches import bugs early – before they have a chance to propagate through the lowering pipeline and surface as cryptic LLVM errors three passes later.

Examples of what gets checked:

  • mir.add verifies that both operands have the same type.

  • mir.cond_br verifies that the condition is i1 (a boolean).

  • mir.extract_field verifies that the field index is in bounds and the result type matches the field’s type.

  • mir.store verifies that the value type matches the pointee type of the pointer.

The DisjointSlice safety guarantee (“one thread, one element”) is enforced at the type-system level via ThreadIndex – only hardware-derived thread indices can access the slice. There is no separate compiler pass for disjoint-access verification; the safety comes from the Rust type system and cuda-device’s API design.


dialect-llvm – The LLVM Layer#

dialect-llvm models LLVM IR as pliron operations. It provides a near-1:1 mapping to textual .ll files – every LLVM instruction has a corresponding pliron operation, and the types map directly to LLVM’s type system.

Types#

Type

Example

Description

Integers

i1, i8, i16, i32, i64, i128

Pliron built-in, used directly

Floats

half, float, double

Pliron built-in (FP16Type, FP32Type, FP64Type)

llvm.ptr

ptr addrspace(1)

Opaque pointers with optional address space

llvm.struct

{ i32, float } or %MyStruct

Named or anonymous, may be opaque

llvm.array

[256 x float]

Fixed-size arrays

llvm.vector

<4 x float>

SIMD vectors

llvm.func

(i32, ptr) -> void

Function signatures

llvm.void

void

The unit type

Note the absence of Rust-specific types. By the time code reaches dialect-llvm, tuples have become structs, enums have become discriminant-indexed structs, and slices have become pointer-length pairs. The lowering pass (covered in The Lowering Pipeline) handles all of that flattening.

Operations#

The dialect defines 62 operations:

Category

Examples

Count

Arithmetic

add, sub, mul, fadd, fsub, fmul, fdiv, frem, fneg, …

19

Cast

zext, sext, trunc, fpext, fptrunc, sitofp, uitofp, fptosi, fptoui, ptrtoint, inttoptr, addrspacecast, bitcast

13

Control flow

br, cond_br, switch, return, unreachable

5

Memory

load, store, alloca, gep

4

Atomic

atomic_load, atomic_store, atomicrmw, cmpxchg, fence

5

Comparison

icmp, fcmp

2

Aggregate

extract_value, insert_value, extractelement

3

Call

call, call_intrinsic

2

Inline asm

inline_asm, inline_asm_multi

2

Constants

constant, zero, undef

3

Symbol

func, global, addressof

3

Select

select

1

If you have read LLVM IR before, nothing here will surprise you. The operation names are intentionally the same as their LLVM counterparts, prefixed with llvm. in the IR.

The Export Engine#

The crown jewel of dialect-llvm is export.rs – the module that converts a pliron IR module into valid textual LLVM IR. This is not just “print each operation”; several non-trivial transformations happen during export:

Block arguments become PHI nodes. Pliron IR (MLIR-like) models merge points as block arguments – a function-style calling convention between basic blocks. LLVM IR uses PHI nodes instead. The exporter builds a predecessor map from branch operands and emits phi instructions at the top of each non-entry block.

Value naming. A pre-pass assigns sequential SSA names (%v0, %v1, …) to every value. Constants are special-cased: llvm.constant results are mapped to their literal value (not a %vN name), so PHIs can reference constants from blocks that appear later in the output.

NVVM intrinsic name conversion. Pliron identifiers use underscores; LLVM intrinsics use dots. The exporter converts all names starting with llvm_ by replacing underscores with dots: llvm_nvvm_read_ptx_sreg_tid_x becomes llvm.nvvm.read.ptx.sreg.tid.x. This is a mechanical transformation, not a lookup table.

Convergent attribute marking. Barrier, shuffle, and vote intrinsics must be marked convergent to prevent LLVM from hoisting them out of control flow. The exporter recognizes these by prefix pattern matching on the (dot-form) name and appends #0 to their call sites, emitting attributes #0 = { convergent } at module level.

Kernel metadata. Functions marked as kernels get ptx_kernel calling convention and an !nvvm.annotations metadata entry.

Here is what the exported LLVM IR looks like for a simple vector-add kernel:

target datalayout = "e-i64:64-i128:128-v16:16-v32:32-n16:32:64"
target triple = "nvptx64-nvidia-cuda"

declare i32 @llvm.nvvm.read.ptx.sreg.tid.x()
declare i32 @llvm.nvvm.read.ptx.sreg.ntid.x()
declare i32 @llvm.nvvm.read.ptx.sreg.ctaid.x()

define ptx_kernel void @vecadd(ptr addrspace(1) %v0, i64 %v1,
                                ptr addrspace(1) %v2, i64 %v3,
                                ptr addrspace(1) %v4, i64 %v5) {
entry:
    %v6 = call i32 @llvm.nvvm.read.ptx.sreg.tid.x() #0
    %v7 = call i32 @llvm.nvvm.read.ptx.sreg.ntid.x() #0
    %v8 = call i32 @llvm.nvvm.read.ptx.sreg.ctaid.x() #0
    %v9 = mul i32 %v8, %v7
    %v10 = add i32 %v9, %v6
    ; ... bounds check, load, add, store ...
    ret void
}

!nvvm.annotations = !{!0}
!0 = !{ptr @vecadd, !"kernel", i32 1}
attributes #0 = { convergent }

Notice the slices have been scalarized: each Rust &[f32] becomes a ptr addrspace(1) and an i64 length. That happened in the lowering pass; by the time dialect-llvm sees them, they are flat arguments.


dialect-nvvm – The GPU Layer#

dialect-nvvm wraps NVIDIA’s GPU intrinsics as typed pliron operations. These operations do not form a “level” in the lowering chain – they are inserted during the dialect-mirdialect-llvm lowering pass and coexist with dialect-llvm operations in the same function body. At export time, they become call instructions to @llvm.nvvm.* intrinsics.

Architecture Coverage#

The dialect is organized into modules, each targeting a GPU feature set:

Module

Description

Ops

Minimum SM

GPU Family

thread

Thread/block indexing, barrier0, threadfences

18

All

All GPUs

warp

Lane id, shuffle, vote, match

18

All

All GPUs

grid

Cooperative grid_sync

1

sm_70

Volta+

debug

Clock, trap, breakpoint, vprintf

6

All

All GPUs

atomic

Atomic load/store/RMW/cmpxchg

4

sm_70

Volta+

cluster

Thread Block Clusters + DSMEM

11

sm_90

Hopper+

mbarrier

Async barriers + fence proxy + nanosleep

10

sm_90

Hopper+

tma

Tensor Memory Accelerator (bulk G2S/S2G)

15

sm_90

Hopper+

wgmma

Warpgroup Matrix Multiply-Accumulate

5

sm_90

Hopper+

stmatrix

Shared memory matrix store + bf16 convert

5

sm_90

Hopper+

tcgen05

Tensor Core Gen 5 + TMEM

24

sm_100

Blackwell+

clc

Cluster Launch Control

6

sm_100

Blackwell+

That is 123 operations total. Most users will only encounter the first three modules (thread indexing, warp shuffles, barriers). The rest are for advanced GPU programming – TMA, matrix accelerators, and Blackwell’s tensor memory – covered in the Advanced GPU Features chapters.

From Rust to PTX: An Intrinsic’s Journey#

Each NVVM operation maps through three levels of naming:

Pliron operation

LLVM intrinsic

PTX instruction

ReadPtxSregTidXOp

llvm.nvvm.read.ptx.sreg.tid.x

mov.u32 %r1, %tid.x

Barrier0Op

llvm.nvvm.barrier0

bar.sync 0

ShflSyncBflyI32Op

llvm.nvvm.shfl.sync.bfly.i32

shfl.sync.bfly.b32

CpAsyncBulkTensorG2sTile2dOp

llvm.nvvm.cp.async.bulk.tensor.2d.tile.g2s.im2col.*

cp.async.bulk.tensor.2d.tile.g2s ...

The first column is the Rust struct name in dialect-nvvm. The second is what export.rs emits (after the underscore-to-dot transformation). The third is what llc produces. You never have to write any of these by hand – they are generated by mir-lower when it sees calls to cuda-device intrinsic functions like thread::index_x() or warp::shfl_sync_bfly().

Verification Strategy#

NVVM operations use minimal structural verification: each operation checks its operand count and result count, and a handful verify result types (thread indexing ops require i32 results; tcgen05 loads check exact result counts for their 32-register and 4-register variants).

This is intentional. NVVM operations are machine-generated by mir-lower – they are never hand-written by users. LLVM’s NVPTX backend provides comprehensive type validation downstream. Adding full type checking to every NVVM operation would double the dialect’s code size for zero practical benefit.

Note

The GPU architecture requirements (sm_70, sm_90, sm_100) are documented but not enforced at the pliron level. Architecture validation happens later, when llc is invoked with a specific -mcpu=sm_XX flag. If you use a Hopper intrinsic and target Volta, llc will tell you – loudly.


How the Dialects Interact#

Here is the lifecycle of a single Rust operation as it passes through all three abstraction levels:

Rust source:   let sum = a + b;        // a, b: f32

dialect-mir:   %sum = mir.add %a, %b : f32
                ↓  (DialectConversion)
dialect-llvm:  %v5 = fadd float %v3, %v4
                ↓  (export.rs)
LLVM IR:       %v5 = fadd float %v3, %v4
                ↓  (llc --mcpu=sm_80)
PTX:           add.f32 %f3, %f1, %f2;

The dialect-mirdialect-llvm step is where the interesting work happens: mir.add on f32 becomes fadd (floating-point add), while mir.add on i32 becomes add (integer add). Checked operations like mir.checked_add expand into an llvm.add, a constant i1 false for the overflow flag, and an insertvalue into a struct – the GPU path omits overflow detection (since GPU integer arithmetic wraps). The lowering pass handles all of these translations.

For GPU-specific operations, dialect-nvvm enters the picture:

Rust source:   let tid = thread::threadIdx_x();

dialect-mir:   %tid = mir.call @cuda_oxide_device_<hash>_thread_index_x()
                ↓  (DialectConversion, recognizes the intrinsic)
dialect-nvvm:  %v2 = nvvm.read_ptx_sreg_tid_x : i32
                ↓  (export.rs)
LLVM IR:       %v2 = call i32 @llvm.nvvm.read.ptx.sreg.tid.x() #0
                ↓  (llc)
PTX:           mov.u32 %r1, %tid.x;

The lowering pass recognizes calls to cuda_device intrinsic functions by their fully qualified names (FQDNs) and replaces them with the corresponding dialect-nvvm operations. No generic “function call” machinery is needed – the intrinsic becomes a direct hardware instruction.

The Full Picture#

Putting it all together, a compiled kernel body contains a mix of dialect-llvm and dialect-nvvm operations:

llvm.func @vecadd(...) {
  entry:
    %tid    = nvvm.read_ptx_sreg_tid_x     // NVVM: thread index
    %ntid   = nvvm.read_ptx_sreg_ntid_x    // NVVM: block size
    %ctaid  = nvvm.read_ptx_sreg_ctaid_x   // NVVM: block index
    %offset = llvm.mul %ctaid, %ntid        // LLVM: integer math
    %idx    = llvm.add %offset, %tid        // LLVM: integer math
    %cmp    = llvm.icmp slt %idx, %len      // LLVM: bounds check
    llvm.cond_br %cmp, bb1, bb2             // LLVM: branch
  bb1:
    %p_a    = llvm.gep %a, %idx             // LLVM: pointer arithmetic
    %val_a  = llvm.load %p_a                // LLVM: memory access
    %p_b    = llvm.gep %b, %idx
    %val_b  = llvm.load %p_b
    %sum    = llvm.fadd %val_a, %val_b      // LLVM: floating-point add
    %p_c    = llvm.gep %c, %idx
    llvm.store %sum, %p_c
    llvm.br bb2
  bb2:
    llvm.return void
}

The dialect-nvvm operations at the top compute the global thread index. Everything else is standard dialect-llvm – loads, stores, arithmetic, branches. The export engine serializes all of it into a single .ll file, and llc compiles it to PTX.


For how these dialects are connected by the lowering pass, see The Lowering Pipeline.