Launching Kernels#
Writing a kernel is only half the story. The host must load the compiled PTX,
configure the launch grid, marshal arguments, and dispatch the work to the GPU.
cuda-oxide provides two launch paths: cuda_launch! for straightforward
synchronous enqueuing, and cuda_launch_async! for composable, lazy execution
graphs.
See also
CUDA Programming Guide – Execution Configuration
for the authoritative reference on <<<grid, block, smem, stream>>> semantics.
The launch lifecycle#
Every kernel launch follows the same sequence:
Initialize a CUDA context – bind to a GPU device.
Load the PTX module – the compiled device code produced by
cargo oxide build.Look up the kernel function – by its PTX entry point name.
Configure the grid – block dimensions, grid dimensions, shared memory.
Launch – enqueue the kernel on a stream.
Synchronize – wait for results (explicit or implicit).
The kernel launch lifecycle. The host initializes a context, loads the PTX module, configures the grid, and launches via cuda_launch! (which handles steps 3-5 in one call). The GPU scheduler dispatches blocks to SMs.#
In practice, cuda_launch! and cuda_launch_async! handle steps 3–5 in a
single macro invocation. You typically only interact with context creation and
module loading directly.
cuda_launch! – synchronous launch#
The cuda_launch! macro is the standard way to launch a kernel. It is
“synchronous” in the sense that you provide a specific stream and the kernel is
enqueued immediately (though execution on the GPU is asynchronous relative to the
host):
use cuda_device::{kernel, thread, DisjointSlice};
use cuda_core::{CudaContext, DeviceBuffer, LaunchConfig};
use cuda_host::{cuda_launch, load_kernel_module};
#[kernel]
pub fn vecadd(a: &[f32], b: &[f32], mut c: DisjointSlice<f32>) {
let idx = thread::index_1d();
if let Some(c_elem) = c.get_mut(idx) {
*c_elem = a[idx.get()] + b[idx.get()];
}
}
fn main() {
let ctx = CudaContext::new(0).unwrap();
let stream = ctx.default_stream();
let module = load_kernel_module(&ctx, "vecadd").unwrap();
let a = DeviceBuffer::from_host(&stream, &[1.0f32; 1024]).unwrap();
let b = DeviceBuffer::from_host(&stream, &[2.0f32; 1024]).unwrap();
let mut c = DeviceBuffer::<f32>::zeroed(&stream, 1024).unwrap();
cuda_launch! {
kernel: vecadd,
stream: stream,
module: module,
config: LaunchConfig::for_num_elems(1024),
args: [slice(a), slice(b), slice_mut(c)]
}
.expect("Kernel launch failed");
let result = c.to_host_vec(&stream).unwrap();
assert_eq!(result[0], 3.0);
}
Field-by-field breakdown#
Field |
Type |
Description |
|---|---|---|
|
Path |
Kernel name ( |
|
|
The stream to enqueue on |
|
|
Loaded PTX module containing the kernel |
|
|
Grid/block dimensions and shared memory |
|
|
Kernel arguments (see below) |
Argument wrappers#
The args list uses wrapper functions to handle
scalarization:
Syntax |
Kernel parameter |
What’s passed to the GPU |
|---|---|---|
|
|
Pointer + length |
|
|
Pointer + length |
|
|
Value directly |
|
Closure |
Captures individually |
Return value#
cuda_launch! returns Result<(), DriverError>. The Ok case means the
kernel was successfully enqueued – not that it finished. To check for
runtime errors (e.g., out-of-bounds trap), synchronize the stream or context
afterward.
LaunchConfig#
LaunchConfig specifies the grid shape:
use cuda_core::LaunchConfig;
let config = LaunchConfig {
grid_dim: (num_blocks, 1, 1),
block_dim: (256, 1, 1),
shared_mem_bytes: 0,
};
Field |
Type |
Description |
|---|---|---|
|
|
Number of blocks in x, y, z |
|
|
Threads per block in x, y, z |
|
|
Dynamic shared memory per block |
for_num_elems helper#
For 1D data-parallel kernels, the common pattern is one thread per element:
let config = LaunchConfig::for_num_elems(N as u32);
This uses 256 threads per block and computes the grid size via ceiling
division: grid_x = (N + 255) / 256. It’s the right default for most
element-wise operations.
2D and 3D configurations#
For matrix operations, use 2D block and grid dimensions:
let config = LaunchConfig {
grid_dim: ((cols + 15) / 16, (rows + 15) / 16, 1),
block_dim: (16, 16, 1),
shared_mem_bytes: 0,
};
Inside the kernel, combine threadIdx_x() / blockIdx_x() with their _y()
counterparts to compute row and column indices.
Choosing block size#
The block size is the single most important tuning parameter (see the Execution Model chapter for details). Quick guidelines:
256 is a safe default for most kernels.
Powers of 2 (128, 256, 512) align with warp boundaries.
Use
#[launch_bounds]to hint the compiler about your intended block size.
cuda_launch_async! – composable async launch#
The async launch macro returns a DeviceOperation instead of enqueuing
immediately. No stream is specified at launch time – the scheduling policy
chooses one when the operation is executed:
use cuda_async::device_context::init_device_contexts;
use cuda_async::device_operation::DeviceOperation;
use cuda_host::cuda_launch_async;
init_device_contexts(0, 1)?;
let op = cuda_launch_async! {
kernel: vecadd,
module: module,
config: LaunchConfig::for_num_elems(1024),
args: [slice(a_dev), slice(b_dev), slice_mut(c_dev)]
};
// Execute and wait
op.sync()?;
push_arg and push_args#
Under the hood, cuda_launch_async! builds an AsyncKernelLaunch and calls
push_arg for each argument. You can also build launches manually with
push_args for tuple-based argument passing:
let mut launch = AsyncKernelLaunch::new(func, config);
launch.push_args((ptr_a, len_a, ptr_b, len_b, ptr_c, len_c));
push_args accepts tuples of up to 32 elements, where each element implements
KernelArgument. Scalar types (u32, f32, u64, etc.) are boxed
automatically.
.sync() vs .await#
Method |
What it does |
|---|---|
|
Execute on the default scheduling policy, block the current thread until complete |
|
Execute and yield the current async task (requires a Tokio runtime) |
Composing GPU work#
DeviceOperation supports functional composition. Chain operations with
and_then and run independent work in parallel with zip!:
use cuda_async::zip;
let forward_pass = layer1_op
.and_then(|output1| layer2_op(output1))
.and_then(|output2| layer3_op(output2));
// Run two independent operations concurrently
let combined = zip!(branch_a, branch_b);
let (result_a, result_b) = combined.sync()?;
Each operation in the chain is scheduled onto a stream only when it executes.
The and_then combinator passes the output of one operation as input to the
next, forming a lazy computation graph.
See also
The Async GPU Programming
section covers DeviceOperation, scheduling policies, and stream management in
depth.
Cluster launch#
Thread Block Clusters (Hopper and newer) allow blocks to cooperate beyond shared
memory via distributed shared memory (DSMEM). To launch with clusters, add
#[cluster_launch] to the kernel and include cluster_dim in the launch:
use cuda_device::{kernel, cluster, cluster_launch, DisjointSlice};
#[kernel]
#[cluster_launch(4, 1, 1)]
pub fn cluster_kernel(mut out: DisjointSlice<u32>) {
let rank = cluster::block_rank();
// Blocks 0-3 can communicate via DSMEM
}
On the host, the launch uses launch_kernel_ex (the extended launch API) with
cluster dimensions. cuda_launch! supports this via the cluster_dim field:
cuda_launch! {
kernel: cluster_kernel,
stream: stream,
module: module,
config: config,
cluster_dim: (4, 1, 1),
args: [slice_mut(out_dev)]
}
.expect("Cluster launch failed");
Tip
Cluster launch requires Hopper (sm_90) or newer. The maximum cluster size is
typically 16 blocks. Use cargo oxide build --arch sm_90 to target Hopper.
Common launch errors#
Error |
Likely cause |
Fix |
|---|---|---|
|
Grid or block dimensions are zero or exceed limits |
Check |
|
PTX entry point name doesn’t match |
Verify |
|
Too much shared memory or too many registers per block |
Reduce |
|
Kernel hit a trap (panic, assert failure, OOB) |
Debug with |
|
PTX compiled for wrong architecture |
Rebuild with |
See also
The Error Handling and Debugging chapter covers how to diagnose and fix kernel failures in detail.