Launching Kernels#
Writing a kernel is only half the story. The host must load device code,
configure the launch grid, marshal arguments, and dispatch the work to the GPU.
The primary cuda-oxide launch path is #[cuda_module]: it embeds the generated
device artifact into the host binary and generates typed launch methods. The
lower-level load_kernel_module and cuda_launch! APIs remain available when
you need explicit sidecar loading or custom launch code; note cuda_launch!
is unsafe and must be wrapped in unsafe { }.
See also
CUDA Programming Guide – Execution Configuration
for the authoritative reference on <<<grid, block, smem, stream>>> semantics.
The launch lifecycle#
Every kernel launch follows the same sequence:
Initialize a CUDA context – bind to a GPU device.
Load the device module – usually from the embedded artifact bundle.
Look up the kernel function – by its PTX entry point name.
Configure the grid – block dimensions, grid dimensions, shared memory.
Launch – enqueue the kernel on a stream.
Synchronize – wait for results (explicit or implicit).
The kernel launch lifecycle. The host initializes a context, loads the device module, configures the grid, and launches via a typed method. The GPU scheduler dispatches blocks to SMs.#
In practice, #[cuda_module] handles steps 2–5 behind a generated Rust API.
You normally interact with context creation, kernels::load, and a typed method
call.
#[cuda_module] – typed launch#
Wrap kernels in an inline #[cuda_module] module to generate a typed loader and
one method per #[kernel]. The method is “synchronous” in the CUDA sense: you
provide a specific stream and the kernel is enqueued immediately, though GPU
execution still overlaps the host until you synchronize.
use cuda_device::{cuda_module, kernel, thread, DisjointSlice};
use cuda_core::{CudaContext, DeviceBuffer, LaunchConfig};
#[cuda_module]
mod kernels {
use super::*;
#[kernel]
pub fn vecadd(a: &[f32], b: &[f32], mut c: DisjointSlice<f32>) {
let idx = thread::index_1d();
let i = idx.get();
if let Some(c_elem) = c.get_mut(idx) {
*c_elem = a[i] + b[i];
}
}
}
fn main() {
let ctx = CudaContext::new(0).unwrap();
let stream = ctx.default_stream();
let module = kernels::load(&ctx).unwrap();
let a = DeviceBuffer::from_host(&stream, &[1.0f32; 1024]).unwrap();
let b = DeviceBuffer::from_host(&stream, &[2.0f32; 1024]).unwrap();
let mut c = DeviceBuffer::<f32>::zeroed(&stream, 1024).unwrap();
module
.vecadd(&stream, LaunchConfig::for_num_elems(1024), &a, &b, &mut c)
.expect("Kernel launch failed");
let result = c.to_host_vec(&stream).unwrap();
assert_eq!(result[0], 3.0);
}
Field-by-field breakdown#
Piece |
Description |
|---|---|
|
Generates loader and launch methods |
|
Loads the embedded artifact bundle |
|
Enqueues a typed kernel launch |
|
Grid/block dimensions and smem |
Argument mapping#
The generated method maps kernel parameters to host values:
Kernel parameter |
Host argument |
GPU ABI |
|---|---|---|
|
|
Pointer + length |
|
|
Pointer + length |
|
|
Pointer + length |
scalar/raw pointer |
Same value |
Value directly |
Return value#
Typed launch methods return Result<(), DriverError>. The Ok case means the
kernel was successfully enqueued – not that it finished. To check for
runtime errors (e.g., out-of-bounds trap), synchronize the stream or context
afterward.
cuda_launch! – unsafe lower-level launch#
cuda_launch! is the explicit, unsafe escape hatch below #[cuda_module].
Its niche is modules loaded at runtime by name (a sidecar PTX/cubin/LTOIR
artifact you choose manually), where no compile-time kernel signature exists
to check against.
Because the macro cannot verify the argument list, every use must sit inside
an unsafe { } block. The caller promises that argument count, order, and
types match the kernel’s actual signature, and that pointer arguments are
device-accessible. A mismatch is undefined behavior: the driver reads past
the end of the args array, or the device dereferences junk.
use cuda_host::{cuda_launch, load_kernel_module};
let module = load_kernel_module(&ctx, "vecadd").unwrap();
// SAFETY: args match vecadd's signature (three slices); a, b, c are live
// device buffers.
unsafe {
cuda_launch! {
kernel: vecadd,
stream: stream,
module: module,
config: LaunchConfig::for_num_elems(1024),
args: [slice(a), slice(b), slice_mut(c)]
}
}
.expect("Kernel launch failed");
The wrappers in args produce the same host packet as the generated
#[cuda_module] methods: slice(...) and slice_mut(...) push the
(ptr, len) pair, scalar arguments push their value directly, and a
closure or by-value struct pushes as a single byval value (the kernel
boundary receives it as one .param, not as per-field flattened
parameters).
Artifact policy#
#[cuda_module] is a launch-surface feature, not a target-selection feature. It
loads the embedded payload that the compiler placed in the host binary. Decisions
such as PTX versus LTOIR, cubin versus fatbin, or single-arch versus multi-arch
packaging live in the compiler and artifact/runtime loader layers. Keeping that
policy separate lets the generated Rust launch methods stay stable as payload
formats evolve.
PTX and cubin embedded payloads are loaded directly. Embedded NVVM IR/LTOIR is compiled or linked to a cubin through the same libNVVM/nvJitLink path used by the lower-level sidecar loader.
LaunchConfig#
LaunchConfig specifies the grid shape:
use cuda_core::LaunchConfig;
let config = LaunchConfig {
grid_dim: (num_blocks, 1, 1),
block_dim: (256, 1, 1),
shared_mem_bytes: 0,
};
Field |
Type |
Description |
|---|---|---|
|
|
Number of blocks in x, y, z |
|
|
Threads per block in x, y, z |
|
|
Dynamic shared memory per block |
for_num_elems helper#
For 1D data-parallel kernels, the common pattern is one thread per element:
let config = LaunchConfig::for_num_elems(N as u32);
This uses 256 threads per block and computes the grid size via ceiling
division: grid_x = (N + 255) / 256. It’s the right default for most
element-wise operations.
2D and 3D configurations#
For matrix operations, use 2D block and grid dimensions:
let config = LaunchConfig {
grid_dim: ((cols + 15) / 16, (rows + 15) / 16, 1),
block_dim: (16, 16, 1),
shared_mem_bytes: 0,
};
Inside the kernel, combine threadIdx_x() / blockIdx_x() with their _y()
counterparts to compute row and column indices.
Choosing block size#
The block size is the single most important tuning parameter (see the Execution Model chapter for details). Quick guidelines:
256 is a safe default for most kernels.
Powers of 2 (128, 256, 512) align with warp boundaries.
Use
#[launch_bounds]to hint the compiler about your intended block size.
Typed async launch#
With the cuda-host async feature enabled, #[cuda_module] also generates
borrowed and owned async methods. These return lazy DeviceOperation values
instead of enqueuing immediately. No stream is specified at launch time – the
scheduling policy chooses one when the operation is executed:
use cuda_async::device_context::init_device_contexts;
use cuda_async::device_operation::DeviceOperation;
init_device_contexts(0, 1)?;
let module = kernels::load_async(0)?;
let op = module.vecadd_async(
LaunchConfig::for_num_elems(1024),
&a_dev,
&b_dev,
&mut c_dev,
)?;
// Execute and wait
op.sync()?;
Use the owned form when the operation must be spawned or stored as a 'static
future:
use std::future::IntoFuture;
let op = module.vecadd_async_owned(
LaunchConfig::for_num_elems(1024),
a_dev,
b_dev,
c_dev,
)?;
let (a_dev, b_dev, c_dev) = tokio::spawn(op.into_future()).await??;
Async buffer lifetimes#
Async launches are lazy, so pointer lifetimes matter:
raw pointer shape:
build op from CUdeviceptr
drop buffer
run op later -> stale pointer
borrowed typed shape:
build op from &DeviceBuffer
Rust keeps the buffer borrowed until op completes
owned typed shape:
move DeviceBox into op
spawned task owns the allocation until completion
cuda_launch_async! remains as a lower-level migration API, but prefer the
generated borrowed or owned methods for new code. Raw pointer async launches are
only correct when the caller can prove the pointed-to allocation outlives the
lazy operation.
.sync() vs .await#
Method |
What it does |
|---|---|
|
Execute on the default scheduling policy, block the current thread until complete |
|
Execute and yield the current async task (requires a Tokio runtime) |
Composing GPU work#
DeviceOperation supports functional composition. Chain operations with
and_then and run independent work in parallel with zip!:
use cuda_async::zip;
let forward_pass = layer1_op
.and_then(|output1| layer2_op(output1))
.and_then(|output2| layer3_op(output2));
// Run two independent operations concurrently
let combined = zip!(branch_a, branch_b);
let (result_a, result_b) = combined.sync()?;
Each operation in the chain is scheduled onto a stream only when it executes.
The and_then combinator passes the output of one operation as input to the
next, forming a lazy computation graph.
See also
The Async GPU Programming
section covers DeviceOperation, scheduling policies, and stream management in
depth.
Cluster launch#
Thread Block Clusters (Hopper and newer) allow blocks to cooperate beyond shared
memory via distributed shared memory (DSMEM). To launch with clusters, add
#[cluster_launch] to the kernel and include cluster_dim in the launch:
use cuda_device::{kernel, cluster, cluster_launch, DisjointSlice};
#[kernel]
#[cluster_launch(4, 1, 1)]
pub fn cluster_kernel(mut out: DisjointSlice<u32>) {
let rank = cluster::block_rank();
// Blocks 0-3 can communicate via DSMEM
}
On the host, the launch uses launch_kernel_ex (the extended launch API) with
cluster dimensions. cuda_launch! supports this via the cluster_dim field:
// SAFETY: args match cluster_kernel's signature; out_dev is a live buffer.
unsafe {
cuda_launch! {
kernel: cluster_kernel,
stream: stream,
module: module,
config: config,
cluster_dim: (4, 1, 1),
args: [slice_mut(out_dev)]
}
}
.expect("Cluster launch failed");
Tip
Cluster launch requires Hopper (sm_90) or newer. The maximum cluster size is
typically 16 blocks. Use cargo oxide build --arch sm_90 to target Hopper.
Cooperative launch#
Grid-wide barriers (cuda_device::grid::sync() or this_grid().sync()) only
work when every block in the grid is resident on the device at the same time.
A cooperative launch asks the driver to guarantee exactly that; without
it, blocks that have not been scheduled yet can never reach the barrier and
the kernel deadlocks.
On the typed #[cuda_module] path, mark the kernel with
#[cooperative_launch]. Every generated launch method (sync, async, and
owned-async) then submits through cuLaunchKernelEx with the
CU_LAUNCH_ATTRIBUTE_COOPERATIVE attribute set:
use cuda_device::{cooperative_launch, grid, kernel, DisjointSlice};
#[cuda_module]
mod kernels {
use super::*;
#[kernel]
#[cooperative_launch]
pub fn grid_sync_kernel(mut out: DisjointSlice<u32>) {
// ... per-block work ...
grid::sync();
// ... grid-wide post-barrier work ...
}
}
let module = kernels::load(&ctx)?;
module.grid_sync_kernel(&stream, config, &mut out_dev)?; // cooperative launch
Unlike #[cluster_launch], the attribute changes nothing in the PTX; it only
changes how the host submits the launch. The two attributes may be combined
on one kernel, in which case both launch attributes go into the same
cuLaunchKernelEx call.
The legacy (caller-unsafe) cuda_launch! macro offers the same behaviour
through its cooperative: true field.
Tip
The whole grid must fit on the device in a single wave, or the driver rejects
the launch with CUDA_ERROR_COOPERATIVE_LAUNCH_TOO_LARGE. Size the grid from
cuOccupancyMaxActiveBlocksPerMultiprocessor (blocks per SM × SM count) when
in doubt.
Common launch errors#
Error |
Likely cause |
Fix |
|---|---|---|
|
Grid or block dimensions are zero or exceed limits |
Check |
|
PTX entry point name doesn’t match |
Verify |
|
Too much shared memory or too many registers per block |
Reduce |
|
Kernel hit a trap (panic, assert failure, OOB) |
Debug with |
|
PTX compiled for wrong architecture |
Rebuild with |
See also
The Error Handling and Debugging chapter covers how to diagnose and fix kernel failures in detail.