Launching Kernels#
Writing a kernel is only half the story. The host must load device code,
configure the launch grid, marshal arguments, and dispatch the work to the GPU.
The primary cuda-oxide launch path is #[cuda_module]: it embeds the generated
device artifact into the host binary and generates typed launch methods. The
lower-level load_kernel_module and cuda_launch! APIs remain available when
you need explicit sidecar loading or custom launch code.
See also
CUDA Programming Guide – Execution Configuration
for the authoritative reference on <<<grid, block, smem, stream>>> semantics.
The launch lifecycle#
Every kernel launch follows the same sequence:
Initialize a CUDA context – bind to a GPU device.
Load the device module – usually from the embedded artifact bundle.
Look up the kernel function – by its PTX entry point name.
Configure the grid – block dimensions, grid dimensions, shared memory.
Launch – enqueue the kernel on a stream.
Synchronize – wait for results (explicit or implicit).
The kernel launch lifecycle. The host initializes a context, loads the device module, configures the grid, and launches via a typed method. The GPU scheduler dispatches blocks to SMs.#
In practice, #[cuda_module] handles steps 2–5 behind a generated Rust API.
You normally interact with context creation, kernels::load, and a typed method
call.
#[cuda_module] – typed launch#
Wrap kernels in an inline #[cuda_module] module to generate a typed loader and
one method per #[kernel]. The method is “synchronous” in the CUDA sense: you
provide a specific stream and the kernel is enqueued immediately, though GPU
execution still overlaps the host until you synchronize.
use cuda_device::{cuda_module, kernel, thread, DisjointSlice};
use cuda_core::{CudaContext, DeviceBuffer, LaunchConfig};
#[cuda_module]
mod kernels {
use super::*;
#[kernel]
pub fn vecadd(a: &[f32], b: &[f32], mut c: DisjointSlice<f32>) {
let idx = thread::index_1d();
let i = idx.get();
if let Some(c_elem) = c.get_mut(idx) {
*c_elem = a[i] + b[i];
}
}
}
fn main() {
let ctx = CudaContext::new(0).unwrap();
let stream = ctx.default_stream();
let module = kernels::load(&ctx).unwrap();
let a = DeviceBuffer::from_host(&stream, &[1.0f32; 1024]).unwrap();
let b = DeviceBuffer::from_host(&stream, &[2.0f32; 1024]).unwrap();
let mut c = DeviceBuffer::<f32>::zeroed(&stream, 1024).unwrap();
module
.vecadd(&stream, LaunchConfig::for_num_elems(1024), &a, &b, &mut c)
.expect("Kernel launch failed");
let result = c.to_host_vec(&stream).unwrap();
assert_eq!(result[0], 3.0);
}
Field-by-field breakdown#
Piece |
Description |
|---|---|
|
Generates loader and launch methods |
|
Loads the embedded artifact bundle |
|
Enqueues a typed kernel launch |
|
Grid/block dimensions and smem |
Argument mapping#
The generated method maps kernel parameters to host values:
Kernel parameter |
Host argument |
GPU ABI |
|---|---|---|
|
|
Pointer + length |
|
|
Pointer + length |
|
|
Pointer + length |
scalar/raw pointer |
Same value |
Value directly |
Return value#
Typed launch methods return Result<(), DriverError>. The Ok case means the
kernel was successfully enqueued – not that it finished. To check for
runtime errors (e.g., out-of-bounds trap), synchronize the stream or context
afterward.
cuda_launch! – lower-level launch#
cuda_launch! is the explicit launch API used by older code and by examples
that intentionally load a specific module. It remains useful when you need to
choose a sidecar PTX/cubin/LTOIR artifact manually.
use cuda_host::{cuda_launch, load_kernel_module};
let module = load_kernel_module(&ctx, "vecadd").unwrap();
cuda_launch! {
kernel: vecadd,
stream: stream,
module: module,
config: LaunchConfig::for_num_elems(1024),
args: [slice(a), slice(b), slice_mut(c)]
}
.expect("Kernel launch failed");
The wrappers in args produce the same host packet as the generated
#[cuda_module] methods: slice(...) and slice_mut(...) push the
(ptr, len) pair, scalar arguments push their value directly, and a
closure or by-value struct pushes as a single byval value (the kernel
boundary receives it as one .param, not as per-field flattened
parameters).
Artifact policy#
#[cuda_module] is a launch-surface feature, not a target-selection feature. It
loads the embedded payload that the compiler placed in the host binary. Decisions
such as PTX versus LTOIR, cubin versus fatbin, or single-arch versus multi-arch
packaging live in the compiler and artifact/runtime loader layers. Keeping that
policy separate lets the generated Rust launch methods stay stable as payload
formats evolve.
PTX and cubin embedded payloads are loaded directly. Embedded NVVM IR/LTOIR is compiled or linked to a cubin through the same libNVVM/nvJitLink path used by the lower-level sidecar loader.
LaunchConfig#
LaunchConfig specifies the grid shape:
use cuda_core::LaunchConfig;
let config = LaunchConfig {
grid_dim: (num_blocks, 1, 1),
block_dim: (256, 1, 1),
shared_mem_bytes: 0,
};
Field |
Type |
Description |
|---|---|---|
|
|
Number of blocks in x, y, z |
|
|
Threads per block in x, y, z |
|
|
Dynamic shared memory per block |
for_num_elems helper#
For 1D data-parallel kernels, the common pattern is one thread per element:
let config = LaunchConfig::for_num_elems(N as u32);
This uses 256 threads per block and computes the grid size via ceiling
division: grid_x = (N + 255) / 256. It’s the right default for most
element-wise operations.
2D and 3D configurations#
For matrix operations, use 2D block and grid dimensions:
let config = LaunchConfig {
grid_dim: ((cols + 15) / 16, (rows + 15) / 16, 1),
block_dim: (16, 16, 1),
shared_mem_bytes: 0,
};
Inside the kernel, combine threadIdx_x() / blockIdx_x() with their _y()
counterparts to compute row and column indices.
Choosing block size#
The block size is the single most important tuning parameter (see the Execution Model chapter for details). Quick guidelines:
256 is a safe default for most kernels.
Powers of 2 (128, 256, 512) align with warp boundaries.
Use
#[launch_bounds]to hint the compiler about your intended block size.
Typed async launch#
With the cuda-host async feature enabled, #[cuda_module] also generates
borrowed and owned async methods. These return lazy DeviceOperation values
instead of enqueuing immediately. No stream is specified at launch time – the
scheduling policy chooses one when the operation is executed:
use cuda_async::device_context::init_device_contexts;
use cuda_async::device_operation::DeviceOperation;
init_device_contexts(0, 1)?;
let module = kernels::load_async(0)?;
let op = module.vecadd_async(
LaunchConfig::for_num_elems(1024),
&a_dev,
&b_dev,
&mut c_dev,
)?;
// Execute and wait
op.sync()?;
Use the owned form when the operation must be spawned or stored as a 'static
future:
use std::future::IntoFuture;
let op = module.vecadd_async_owned(
LaunchConfig::for_num_elems(1024),
a_dev,
b_dev,
c_dev,
)?;
let (a_dev, b_dev, c_dev) = tokio::spawn(op.into_future()).await??;
Async buffer lifetimes#
Async launches are lazy, so pointer lifetimes matter:
raw pointer shape:
build op from CUdeviceptr
drop buffer
run op later -> stale pointer
borrowed typed shape:
build op from &DeviceBuffer
Rust keeps the buffer borrowed until op completes
owned typed shape:
move DeviceBox into op
spawned task owns the allocation until completion
cuda_launch_async! remains as a lower-level migration API, but prefer the
generated borrowed or owned methods for new code. Raw pointer async launches are
only correct when the caller can prove the pointed-to allocation outlives the
lazy operation.
.sync() vs .await#
Method |
What it does |
|---|---|
|
Execute on the default scheduling policy, block the current thread until complete |
|
Execute and yield the current async task (requires a Tokio runtime) |
Composing GPU work#
DeviceOperation supports functional composition. Chain operations with
and_then and run independent work in parallel with zip!:
use cuda_async::zip;
let forward_pass = layer1_op
.and_then(|output1| layer2_op(output1))
.and_then(|output2| layer3_op(output2));
// Run two independent operations concurrently
let combined = zip!(branch_a, branch_b);
let (result_a, result_b) = combined.sync()?;
Each operation in the chain is scheduled onto a stream only when it executes.
The and_then combinator passes the output of one operation as input to the
next, forming a lazy computation graph.
See also
The Async GPU Programming
section covers DeviceOperation, scheduling policies, and stream management in
depth.
Cluster launch#
Thread Block Clusters (Hopper and newer) allow blocks to cooperate beyond shared
memory via distributed shared memory (DSMEM). To launch with clusters, add
#[cluster_launch] to the kernel and include cluster_dim in the launch:
use cuda_device::{kernel, cluster, cluster_launch, DisjointSlice};
#[kernel]
#[cluster_launch(4, 1, 1)]
pub fn cluster_kernel(mut out: DisjointSlice<u32>) {
let rank = cluster::block_rank();
// Blocks 0-3 can communicate via DSMEM
}
On the host, the launch uses launch_kernel_ex (the extended launch API) with
cluster dimensions. cuda_launch! supports this via the cluster_dim field:
cuda_launch! {
kernel: cluster_kernel,
stream: stream,
module: module,
config: config,
cluster_dim: (4, 1, 1),
args: [slice_mut(out_dev)]
}
.expect("Cluster launch failed");
Tip
Cluster launch requires Hopper (sm_90) or newer. The maximum cluster size is
typically 16 blocks. Use cargo oxide build --arch sm_90 to target Hopper.
Common launch errors#
Error |
Likely cause |
Fix |
|---|---|---|
|
Grid or block dimensions are zero or exceed limits |
Check |
|
PTX entry point name doesn’t match |
Verify |
|
Too much shared memory or too many registers per block |
Reduce |
|
Kernel hit a trap (panic, assert failure, OOB) |
Debug with |
|
PTX compiled for wrong architecture |
Rebuild with |
See also
The Error Handling and Debugging chapter covers how to diagnose and fix kernel failures in detail.