10. CUDA Graphs#
CUDA graphs let you capture an entire GPU workload once and replay it many times, eliminating per-launch overhead. This tutorial builds a multi-layer forward pass using DeviceOp combinators, captures it as a CUDA graph, and replays it in a token loop.
Motivation#
Every kernel launch involves CPU-side work: selecting a stream, setting up arguments, invoking the driver. For workloads that repeat the same graph of operations (e.g., the forward pass of a transformer), this per-launch overhead can dominate — especially at small batch sizes where kernels are fast relative to their launch cost.
A CUDA graph records the entire sequence of operations once, then replays it with a single driver call. The GPU sees the full graph up front and can schedule internal work more aggressively.
Without graphs: With graphs:
CPU: launch → wait → launch → CPU: launch_graph → wait
wait → launch → wait (single call)
GPU: ████ ████ ████ GPU: ████████████████
gaps between kernels no gaps
The Model#
We’ll build a minimal transformer-style layer stack: each layer performs RMSNorm → Q projection (matvec) → O projection (matvec) → residual add. The hidden state flows through all layers sequentially.
input
│
├─ Layer 0: RMSNorm → Q MatVec → O MatVec → Add(residual, hidden)
│
├─ Layer 1: RMSNorm → Q MatVec → O MatVec → Add(residual, hidden)
│
└─ … (n_layers)
Kernels#
Three cutile kernels handle the compute. Each follows the output-first
convention (&mut Tensor as the first parameter):
#[cutile::module]
mod kernels {
use cutile::core::*;
/// RMS normalization: out = rms_norm(x) * w
#[cutile::entry()]
pub fn rms_norm<const D: i32, const BS: i32>(
out: &mut Tensor<f32, { [1, D] }>,
x: &Tensor<f32, { [-1, D] }>,
w: &Tensor<f32, { [D] }>,
eps: f32,
) { /* tile-level implementation */ }
/// Matrix-vector multiply: out = x @ w^T
#[cutile::entry()]
pub fn matvec<const BN: i32, const BK: i32, const K: i32>(
out: &mut Tensor<f32, { [BN] }>,
x: &Tensor<f32, { [-1, K] }>,
w: &Tensor<f32, { [-1, K] }>,
) { /* tile-level implementation */ }
/// Element-wise add: out = a + b
#[cutile::entry()]
pub fn add<const B: i32>(
out: &mut Tensor<f32, { [B] }>,
a: &Tensor<f32, { [-1] }>,
b: &Tensor<f32, { [-1] }>,
) { /* tile-level implementation */ }
}
Model State#
Weights are shared across all forward calls. Buffers are pre-allocated once and reused every token — the graph replays into the same memory:
struct LayerWeights {
norm_w: Arc<Tensor<f32>>,
wq: Arc<Tensor<f32>>,
wo: Arc<Tensor<f32>>,
}
struct LayerBuffers {
norm: Arc<Tensor<f32>>,
q: Arc<Tensor<f32>>,
o: Arc<Tensor<f32>>,
residual: Arc<Tensor<f32>>,
}
Building the Lazy Graph#
The core of the approach: build the entire forward pass as a DeviceOp
without executing anything. This is the graph that will be captured.
fn build_forward(
cfg: &Config,
weights: &[LayerWeights],
input: Arc<Tensor<f32>>,
buffers: Vec<LayerBuffers>,
) -> (DeviceOpVec<()>, SharedDeviceOp<Tensor<f32>>) {
let mut ops = Vec::with_capacity(buffers.len());
let mut hidden: SharedDeviceOp<Tensor<f32>> = shared(input);
for (w, bufs) in weights.iter().zip(buffers) {
// RMSNorm: hidden(1,d) × norm_w → norm(1,d)
let norm = rms_norm(
bufs.norm.try_partition([1, cfg.d]).expect("sole buffer owner"),
hidden.clone().reshape(&[1, cfg.d]),
w.norm_w.clone(),
cfg.eps,
)
.generics(cfg.rms_generics())
.first()
.unpartition()
.shared();
// Q projection: norm @ wq^T → q
let q = matvec(
bufs.q.try_partition([cfg.bn]).expect("sole buffer owner"),
norm.clone(),
w.wq.clone(),
)
.generics(cfg.mv_generics())
.first()
.unpartition()
.shared();
// O projection: q @ wo^T → o
let o = matvec(
bufs.o.try_partition([cfg.bn]).expect("sole buffer owner"),
q.clone().reshape(&[1, cfg.d]),
w.wo.clone(),
)
.generics(cfg.mv_generics())
.first()
.unpartition()
.shared();
// Residual add: hidden + o → residual
let residual = add(
bufs.residual.try_partition([cfg.block]).expect("sole buffer owner"),
hidden.clone().reshape(&[cfg.d]),
o.clone(),
)
.first()
.unpartition()
.shared();
hidden = residual.clone();
// Keep the layer's work in the graph; the fixed buffers hold the data.
ops.push(
zip!(norm, q, o, residual)
.map(|_| ())
.boxed(),
);
}
(DeviceOpVec::new(ops), hidden)
}
Key patterns to notice:
.shared()— Each intermediate result is shared so it can feed into both the next kernel and the final buffer collection. The underlying computation runs once; downstream consumers getArc::clone()..first()— Kernel launches return a tuple of all arguments..first()extracts just the output (the&mut Tensorparameter).try_partition— ConvertsArc<Tensor<T>>into aPartitionby proving sole ownership (Arc refcount == 1).DeviceOpVec— Collects boxed ops for each layer’s graph work.No GPU work yet — Everything above is pure graph construction.
Capturing the Graph#
.graph_on(stream) executes the operation once in CUDA’s stream capture mode,
recording all GPU work into a replayable graph:
let input: Tensor<f32> = api::rand([cfg.d], None).sync_on(&stream)?;
let input_arc: Arc<Tensor<f32>> = unsafe { input.into_shared_alias() };
let buffers: Vec<_> = (0..cfg.n_layers)
.map(|_| LayerBuffers::allocate(cfg.d, &stream))
.collect::<Result<_, _>>()?;
unsafe { stream.synchronize() }?;
// Build lazy graph (no GPU work).
let (forward_op, output_shared) = build_forward(&cfg, &weights, input_arc, buffers);
// Capture: executes once, records everything, returns CudaGraph.
let mut graph = forward_op.graph_on(stream.clone())?;
// Retrieve the output from the capture execution.
let output = output_shared.sync_on(&stream)?;
After capture:
graphholds the recorded CUDA graph, ready for replay.inputis the fixed input buffer whose pointer is baked into the graph.outputpoints to the final layer’s residual buffer, which the graph rewrites on replay.
The Module Pattern#
Wrap the graph in a Module trait for clean inference:
trait Module {
type Input: Send + ?Sized;
type Output: Send;
fn forward(&mut self, input: &Self::Input) -> Result<Self::Output, DeviceError>;
}
struct GraphModel {
graph: CudaGraph<()>,
input: Tensor<f32>,
output: Arc<Tensor<f32>>,
}
impl Module for GraphModel {
type Input = Tensor<f32>;
type Output = Arc<Tensor<f32>>;
fn forward(&mut self, input: &Self::Input) -> Result<Self::Output, DeviceError> {
// Copy new embedding into the baked-in input buffer.
self.graph.update(api::memcpy(&mut self.input, input))?;
// Replay the entire forward pass with a single driver call.
self.graph.launch().sync_on(self.graph.stream())?;
Ok(self.output.clone())
}
}
Each forward call:
graph.update(memcpy(…))— Copies new input data into the pre-allocated input buffer. This runs on the graph’s stream, so it completes before the graph launches.graph.launch().sync_on(…)— Replays all captured kernels. The GPU sees the full operation sequence and can schedule aggressively.Returns
output.clone()— The outputArcpoints to the same device memory the graph wrote into. No copy needed.
Putting It Together#
fn main() -> Result<(), Error> {
let device = Device::new(0)?;
let stream = device.new_stream()?;
let cfg = Config { d: 2048, n_layers: 22, block: 128, bn: 16, bk: 16, eps: 1e-5 };
// Allocate weights (random for this example).
let weights: Vec<LayerWeights> = (0..cfg.n_layers)
.map(|_| Ok(LayerWeights {
norm_w: api::rand([cfg.d], None).sync_on(&stream)?.into(),
wq: api::rand([cfg.d, cfg.d], None).sync_on(&stream)?.into(),
wo: api::rand([cfg.d, cfg.d], None).sync_on(&stream)?.into(),
}))
.collect::<Result<_, Error>>()?;
// Build and capture.
let mut model = GraphModel::new(&cfg, &weights, &stream)?;
// Inference loop — each call is a single graph launch.
let n_tokens = 512;
for _ in 0..n_tokens {
let embedding: Tensor<f32> = api::rand([cfg.d], None).sync_on(&stream)?;
let _output = model.forward(&embedding)?;
}
Ok(())
}
Alternative: CudaGraph::scope#
The combinator approach above requires Arc<Tensor<T>> + try_partition
for pre-allocated buffers. CudaGraph::scope provides an imperative
alternative using &mut borrows:
let mut input: Tensor<f32> = api::ones::<f32>(&[cfg.d]).sync_on(&stream)?;
let mut buffers: Vec<LayerBuffers> = /* pre-allocate */;
let graph = CudaGraph::scope(&stream, |s| {
for (w, bufs) in weights.iter().zip(buffers.iter_mut()) {
let hidden_2d = input.view(&[1, cfg.d])?;
s.record(rms_norm(
(&mut bufs.norm).partition([1, cfg.d]),
&hidden_2d,
&w.norm_w,
cfg.eps,
).generics(cfg.rms_generics()))?;
s.record(matvec(
(&mut bufs.q).partition([cfg.bn]),
&bufs.norm,
&w.wq,
).generics(cfg.mv_generics()))?;
let q_2d = bufs.q.view(&[1, cfg.d])?;
s.record(matvec(
(&mut bufs.o).partition([cfg.bn]),
&q_2d,
&w.wo,
).generics(cfg.mv_generics()))?;
s.record(add(
(&mut bufs.residual).partition([cfg.block]),
&input,
&bufs.o,
))?;
s.record(api::memcpy(&mut input, &bufs.residual))?;
}
Ok(())
})?;
graph.launch().sync_on(&stream)?;
Key differences from the combinator approach:
Combinator ( |
Scope ( |
|
|---|---|---|
Buffer ownership |
|
|
Write-then-read |
Via |
Via |
Failure mode |
Runtime panic (refcount != 1) |
Compile error (borrow conflict) |
Composability |
Chains with |
Imperative sequential code |
s.record(op) only accepts operations that implement GraphNode — kernel
launches and memcpy. Allocation ops (api::zeros, dup, etc.) are rejected
at compile time because their addresses may change on graph replay.
Use cases#
Scenario |
Use CUDA graphs? |
Why |
|---|---|---|
Repeat the same operation graph many times |
Yes |
Amortizes capture cost; eliminates per-launch overhead |
Dynamic shapes per iteration |
No |
Captured graphs bake in tensor dimensions |
Dynamic control flow per iteration |
No |
Captured graphs bake in the branch structure |
Small number of iterations |
Maybe |
Capture cost (~1 execution) must be amortized |
Profiling individual kernels |
No |
Graph replay shows as a single event |
Key Takeaways#
Concept |
What it means |
|---|---|
|
Lazy composition of GPU work — no execution until driven |
|
Capture the entire operation into a replayable |
|
Imperative graph capture with |
|
Returns a |
|
Run a DeviceOp on the graph’s stream before replay (e.g., memcpy new input) |
Pre-allocated buffers |
Graph writes into fixed memory; vary inputs via |
|
Each intermediate executes once during capture; clones share the result |
Note
Weight tensors in build_forward are passed as Arc<Tensor<T>> because
they’re shared across layers via .clone(). In a sync context where weights
are only read (not shared across spawned tasks), you could pass &Tensor<T>
instead — the borrow checker ensures the weights outlive the graph capture.
See Tutorial 7 for
details on borrowed inputs.
Exercise 1: Add a Second Graph#
Capture a second graph that computes the backward pass (or a simplified version). How would you sequence the forward and backward graphs?
Exercise 2: Measure the Speedup#
Add timing around graph.launch() vs a non-graph path that rebuilds and
executes the DeviceOp each iteration. How does the speedup scale with
n_layers?
Exercise 3: Dynamic Input Shapes#
The current approach bakes in the tensor dimensions at capture time. What would need to change to support variable sequence lengths? (Hint: consider capturing multiple graphs for common sizes.)
Full Reference Examples#
The scope-based reference example with benchmarks:
cargo run -p cutile-examples --example cuda_graphs
The DeviceOp-combinator variant:
cargo run -p cutile-examples --example cuda_graphs_deviceop
See also#
Device Operations — where CUDA graphs fit alongside sync and async execution
Host API: CUDA Graph Integration —
.graph_on(stream)andCudaGraph::scopesignatures