The Safety Model#
A GPU kernel runs thousands of threads that all see the same memory at the same time. On a CPU, Rust prevents data races through ownership and borrowing – one mutable reference, no aliases, enforced at compile time. On a GPU, you have 2048 threads per SM, all launched from the same function, all pointing at the same output buffer. The borrow checker was not designed for this.
cuda-oxide solves the problem in layers. The common case – one thread writes
one element – is safe by construction, no unsafe required. The uncommon
cases – shared memory, warp shuffles, hardware intrinsics – require
unsafe with documented contracts. And the frontier cases – TMA, tensor
cores, cluster-level communication – are fully manual, matching the
complexity of the hardware they control.
This chapter explains the model, walks through each layer, and tells you
exactly when you need unsafe and why.
Three tiers#
cuda-oxide organizes kernel safety into three tiers based on how much the compiler can verify:
Tier |
Description |
|
|---|---|---|
Tier 1 |
Safe by construction – the type system prevents misuse |
No |
Tier 2 |
Explicit |
Yes, scoped |
Tier 3 |
Raw hardware intrinsics – full user responsibility |
Yes, pervasive |
Most application kernels live entirely in Tier 1 or straddle Tier 1 and 2. Tier 3 is for performance engineers building at the level of CUTLASS or Triton IR. If you are writing a vecadd, a GEMM, or a reduction, you will rarely leave Tier 2.
Tier 1: safe by default#
The core idea: DisjointSlice<T> + ThreadIndex#
The primary safety abstraction is a pair of types that together guarantee
race-free parallel writes without unsafe at the call site:
ThreadIndex– an opaque newtype aroundusizewith no public constructor. You cannot create one from an arbitrary integer. The only way to obtain aThreadIndexis through trusted functions (index_1d,index_2d) that derive it from hardware built-in variables (threadIdx,blockIdx,blockDim) – read-only special registers assigned by the runtime at kernel launch.DisjointSlice<T>– a slice-like type whoseget_mut()method accepts onlyThreadIndex, not rawusize. It returnsOption<&mut T>–Nonefor out-of-bounds indices,Some(&mut T)for valid ones.
Put them together and you get a kernel with zero unsafe:
use cuda_device::{kernel, thread, DisjointSlice};
#[kernel]
pub fn vecadd(a: &[f32], b: &[f32], mut c: DisjointSlice<f32>) {
let idx = thread::index_1d();
if let Some(c_elem) = c.get_mut(idx) {
*c_elem = a[idx.get()] + b[idx.get()];
}
}
Safety follows from three facts:
index_1d()produces a unique value per thread (hardware guarantee:threadIdx.x < blockDim.x, so the linear indexblockIdx.x * blockDim.x + threadIdx.xis unique across the grid).get_mut()is bounds-checked – out-of-range threads getNone.Different threads get different
ThreadIndexvalues, so different&mut Treferences. No aliasing, no data race.
The borrow checker sees a single &mut T per thread. The hardware
guarantees the indices are disjoint. The type system ties the two together.
Trusted index functions#
ThreadIndex is only as trustworthy as the functions that create it. Here
are the constructors cuda-oxide provides:
Function |
Formula |
Return Type |
Uniqueness Guarantee |
|---|---|---|---|
|
|
|
Unconditional – |
|
|
|
Unconditional – returns |
|
|
|
Component accessor, not a |
|
|
|
Component accessor, not a |
Notice that index_2d_row() and index_2d_col() return plain usize –
they give you the row and column for arithmetic, but they cannot be used to
index into a DisjointSlice. Only the linearized result, after a uniqueness
check, earns the ThreadIndex type.
Why index_2d returns Option#
This one deserves a closer look, because it is the subtlest part of the safety model.
The formula row * stride + col is only injective (one-to-one)
when col < stride. In a 2D grid, col is derived from blockIdx.x
and threadIdx.x – it can exceed the matrix column count when the grid
dimensions overshoot (which they almost always do, since block dimensions
must be multiples of the warp size). If the function returned a bare
ThreadIndex for those threads, two distinct threads could compute the
same linear index. That would mean two &mut T references to the same
element. That is undefined behavior.
index_2d eliminates this by checking col < row_stride internally:
pub fn index_2d(row_stride: usize) -> Option<ThreadIndex> {
let row = (blockIdx_y() * blockDim_y() + threadIdx_y()) as usize;
let col = (blockIdx_x() * blockDim_x() + threadIdx_x()) as usize;
if col < row_stride {
Some(ThreadIndex(row * row_stride + col))
} else {
None
}
}
Threads that fail the check get None. They simply do not participate in
the write – no aliasing, no race. Even if you pass a “wrong” stride
(say, index_2d(1)), the worst that happens is that most threads get
None and only the col == 0 threads write. You get fewer writers than
you intended, but never undefined behavior.
Tip
The informal proof: suppose two threads with (row_a, col_a) and
(row_b, col_b), both satisfying col < stride, produce the same index:
row_a * stride + col_a == row_b * stride + col_b
=> (row_a - row_b) * stride == col_b - col_a
The right side is in (-stride, stride) because both cols are in
[0, stride). The left side is a multiple of stride. The only value in
both sets is zero – so row_a == row_b and col_a == col_b. But distinct
hardware threads have distinct (row, col) pairs.
The GEMM pattern#
For 2D kernels, the typical pattern looks like this:
let row = thread::index_2d_row();
let col = thread::index_2d_col();
if let Some(c_idx) = thread::index_2d(n as usize) {
// col < n is guaranteed by Some -- no manual check needed
if row < m as usize {
// ... compute dot product ...
if let Some(c_elem) = c.get_mut(c_idx) {
*c_elem = alpha * sum + beta * (*c_elem);
}
}
}
The if let Some from index_2d replaces the manual col < n guard you
would write in CUDA C++. The row < m check remains because it guards
against reading garbage from the input matrices (though get_mut would also
return None for out-of-bounds writes).
What makes a kernel Tier 1#
A kernel is fully safe – Tier 1 – when:
All mutable output access goes through
DisjointSlice::get_mut(ThreadIndex)All inputs are shared immutable references (
&[T])No shared memory, no raw pointers, no intrinsics beyond thread indexing
Examples in this tier include vecadd, helper_fn, generic, host_closure,
and the naive GEMM kernels in the gemm and async_mlp examples.
Tier 2: scoped unsafe#
Not every kernel fits the “one thread, one output element” pattern. When
threads need to cooperate – sharing data through fast on-chip memory,
communicating across lanes in a warp, or performing atomic updates – you
need unsafe. The key property of Tier 2 is that the unsafe is scoped
and auditable: each block has a documented safety contract, and the rest
of the kernel remains safe.
Warp intrinsics#
Warp-level primitives let threads within a warp exchange data without
touching memory at all – register-to-register transfers, coordinated in
hardware. They are unsafe because the hardware does not check thread
convergence: if you pass a mask that includes a diverged thread, you get
undefined behavior (typically a silent hang, which is worse than a crash).
API |
Safety Obligation |
|---|---|
|
Source lane must be active; mask must include calling thread |
|
All threads in mask must be converged |
|
Result is only meaningful at the point of call |
See also
Warp-Level Programming for shuffle patterns, reductions, and prefix sums using warp intrinsics.
Barriers and lifecycle#
The ManagedBarrier typestate API encodes the barrier lifecycle
(Uninit -> Ready -> Invalidated) in the type system, so you cannot
wait on a barrier that was never initialized or use one that has been
invalidated. The init() and inval() transitions still require unsafe
because they interact with the hardware, but the type states prevent the
most common mistakes at compile time.
API |
Safety Obligation |
|---|---|
|
Must be called by exactly one thread; barrier must be in shared memory |
|
Barrier must be initialized; token must match |
|
|
Atomics#
Atomic operations are safe to call once you have a valid atomic reference.
The unsafe surface is at construction – creating a DeviceAtomicU32
from a raw pointer requires the caller to guarantee that the pointer is
valid and properly aligned:
let atom = unsafe { DeviceAtomicU32::new(ptr) };
atom.fetch_add(1, Ordering::Relaxed); // safe call
Unchecked slice access#
When the “one thread, one element” model does not fit – for instance, in a
warp-level reduction where only lane 0 writes the result –
DisjointSlice::get_unchecked_mut(usize) provides an escape hatch:
if warp::lane_id() == 0 {
let warp_idx = gid.get() / 32;
// SAFETY: Only lane 0 of each warp writes; warp indices are unique
unsafe { *out.get_unchecked_mut(warp_idx) = sum; }
}
The safety obligation is the same as the ThreadIndex system enforces
automatically: index in bounds, no two threads share the same index. The
difference is that you prove it yourself instead of letting the type system
do it for you.
Tier 3: raw hardware#
At the bottom of the stack are the raw hardware intrinsics – the APIs
that talk directly to specific functional units on specific GPU
architectures. Every call is unsafe, the safety contracts are complex
and architecture-dependent, and the documentation lives in the PTX ISA
manual more than in Rust doc comments.
Feature |
Key APIs |
Architectures |
|---|---|---|
TMA (Tensor Memory Accelerator) |
|
sm_90+ (Hopper) |
tcgen05 (Tensor Core Gen 5) |
|
sm_120 (Blackwell) |
WGMMA (Warpgroup MMA) |
|
sm_90+ (Hopper) |
Cluster |
|
sm_90+ (Hopper) |
CLC (Cluster Launch Control) |
|
sm_120 (Blackwell) |
TMEM (Tensor Memory) |
|
sm_120 (Blackwell) |
If you are writing application-level kernels, you should not need Tier 3 APIs. They exist for the people building the next CUTLASS – and for those people, cuda-oxide provides the same hardware access as inline PTX in CUDA C++, with Rust’s type system available (but not enforced) as a guardrail.
See also
Tensor Memory Accelerator, Matrix Multiply Accelerators, and Cluster Programming for detailed coverage of Tier 3 features.
What the borrow checker gives you#
cuda-oxide is not a DSL or a macro system – it runs the real rustc
frontend on your kernel code. That means every safety guarantee Rust
provides on the CPU is also enforced on the GPU:
Guarantee |
How It Works |
|---|---|
Ownership and borrowing |
Lifetime errors, use-after-free, and aliasing violations caught at compile time |
Safe parallel writes |
|
Explicit |
Raw pointer access requires |
Convergent attribute enforcement |
Sync primitives (barriers, fences, shuffles) marked |
The first three are standard Rust. The fourth is GPU-specific: CUDA’s
bar.sync, fence, and warp shuffle instructions must not be duplicated or
reordered by the compiler. cuda-oxide marks them convergent in the IR so
that LLVM’s optimization passes leave them alone.
The hard problems#
Rust’s borrow checker was designed for single-threaded ownership with
Send/Sync for CPU concurrency. SIMT execution introduces patterns that
the borrow checker was never taught to reason about. Here is an honest
accounting of what cuda-oxide does not enforce today – and why these
problems are solvable.
Thread-divergent control flow#
Rustc’s JumpThreading MIR optimization duplicates function calls into both
branches of an if-statement – a perfectly sound optimization on CPUs, but
it breaks GPU barrier semantics where all threads in a block must converge
at the same bar.sync instruction. cuda-oxide currently disables
JumpThreading for device code (-Z mir-enable-passes=-JumpThreading). A
proper solution would teach the compiler about convergence requirements so
it can optimize around them instead of disabling the pass entirely.
Warp-level convergence#
Operations like shfl_sync and ballot_sync require that all threads
named in the participation mask are actually converged at the call site.
The type system cannot enforce this today. If threads have diverged and you
pass a full mask, you get a silent hang – the worst kind of bug, because
there is no crash and no error message, just a kernel that never finishes.
Memory space awareness#
GPU memory has distinct address spaces – global, shared, local, TMEM.
A &mut to shared memory is visible to every thread in the block; a
&mut to local memory is private to one thread. The borrow checker treats
them identically. This is conservative (it rejects some safe programs) but
never unsound (it does not accept unsafe ones). Still, a memory-space-aware
borrow checker could accept more programs without unsafe.
Why these are solvable#
The building blocks already exist in Rust’s type system. They need to be extended, not reinvented:
Idea |
What It Solves |
|---|---|
Execution-resource-aware types |
Functions annotated with their execution level (grid / block / warp / thread). A barrier call inside a divergent branch becomes a compile-time error. |
Memory views |
Generalized parallel access patterns – like |
Extended borrow checking for sync |
Statically enforce that barriers cannot be forgotten, placed at divergent control flow, or duplicated by the optimizer. Convergence in the type system. |
All of this is compile-time analysis. The generated PTX is identical to what you would write by hand – the safety net disappears at code generation. Zero runtime cost.
cuda-oxide is well-positioned to deliver this incrementally. The real rustc
borrow checker already runs on device code. The IR infrastructure (pliron
dialects) supports GPU-aware analysis passes. The full compilation pipeline
from MIR to PTX is under our control. And each new safety check is additive
– existing kernels keep compiling while new ones get stronger guarantees.
Writing safe kernels: a cheat sheet#
The default path#
For most kernels, start here:
#[kernel]
pub fn my_kernel(input: &[f32], mut output: DisjointSlice<f32>) {
let idx = thread::index_1d();
if let Some(out) = output.get_mut(idx) {
*out = transform(input[idx.get()]);
}
}
The rules:
Use
DisjointSlicefor all mutable outputs.Use
&[T]for all read-only inputs.Use
index_1d()for 1D grids,index_2d(stride)for 2D grids.Always bounds-check via
get_mut()(returnsOption).
If your kernel compiles without unsafe, it is race-free by construction.
When you need unsafe#
Pattern |
Why |
Mitigation |
|---|---|---|
Shared memory |
Multiple threads access the same |
Synchronize with |
Warp shuffles |
Thread convergence is not compiler-checked |
Use |
Atomics |
Construction from a raw pointer |
Wrap in a helper; the atomic operations themselves are safe |
Non-uniform writes |
Not every thread writes to its own index |
Use |
Hardware intrinsics |
Complex, architecture-specific contracts |
Follow the PTX ISA documentation; test on target hardware |
The SAFETY comment#
For every unsafe block, document why the invariants hold. Not what the
code does – the code already says that – but why this particular usage is
safe:
// SAFETY: Only lane 0 of each warp executes this branch.
// Warp indices (gid / 32) are unique across warps, so no two
// threads write to the same output element.
if warp::lane_id() == 0 {
let warp_idx = gid.get() / 32;
unsafe { *partial_sums.get_unchecked_mut(warp_idx) = warp_sum; }
}
This is not ceremony. When a kernel data-races at 2 AM and you are staring
at a compute-sanitizer log, past-you’s safety comments are the fastest
path to the bug.
Tip
If you cannot write a convincing SAFETY comment for an unsafe block,
that is a signal that the invariant is not actually maintained. Restructure
the code until the argument is obvious, or use a safe API instead.
Summary#
Property |
Status |
|---|---|
Borrow checker on device code |
Enforced (real |
Safe parallel writes ( |
Enforced for both 1D and 2D grids ( |
Explicit |
Enforced (Rust language rules) |
Convergent attribute on sync primitives |
Enforced (IR-level |
Thread convergence for warp ops |
NOT enforced (runtime obligation) |
Memory space awareness (shared vs global) |
NOT enforced (future work) |
The safety model is designed to make the common case safe by default while
providing explicit escape hatches for everything else. Write your kernel,
let the type system catch the races, and save unsafe for the parts where
you genuinely know something the compiler does not.