Launching Kernels#

Writing a kernel is only half the story. The host must load device code, configure the launch grid, marshal arguments, and dispatch the work to the GPU. The primary cuda-oxide launch path is #[cuda_module]: it embeds the generated device artifact into the host binary and generates typed launch methods. The lower-level load_kernel_module and cuda_launch! APIs remain available when you need explicit sidecar loading or custom launch code.

See also

CUDA Programming Guide – Execution Configuration for the authoritative reference on <<<grid, block, smem, stream>>> semantics.

The launch lifecycle#

Every kernel launch follows the same sequence:

  1. Initialize a CUDA context – bind to a GPU device.

  2. Load the device module – usually from the embedded artifact bundle.

  3. Look up the kernel function – by its PTX entry point name.

  4. Configure the grid – block dimensions, grid dimensions, shared memory.

  5. Launch – enqueue the kernel on a stream.

  6. Synchronize – wait for results (explicit or implicit).

../_images/launch-lifecycle.svg

The kernel launch lifecycle. The host initializes a context, loads the device module, configures the grid, and launches via a typed method. The GPU scheduler dispatches blocks to SMs.#

In practice, #[cuda_module] handles steps 2–5 behind a generated Rust API. You normally interact with context creation, kernels::load, and a typed method call.

#[cuda_module] – typed launch#

Wrap kernels in an inline #[cuda_module] module to generate a typed loader and one method per #[kernel]. The method is “synchronous” in the CUDA sense: you provide a specific stream and the kernel is enqueued immediately, though GPU execution still overlaps the host until you synchronize.

use cuda_device::{cuda_module, kernel, thread, DisjointSlice};
use cuda_core::{CudaContext, DeviceBuffer, LaunchConfig};

#[cuda_module]
mod kernels {
    use super::*;

    #[kernel]
    pub fn vecadd(a: &[f32], b: &[f32], mut c: DisjointSlice<f32>) {
        let idx = thread::index_1d();
        let i = idx.get();
        if let Some(c_elem) = c.get_mut(idx) {
            *c_elem = a[i] + b[i];
        }
    }
}

fn main() {
    let ctx = CudaContext::new(0).unwrap();
    let stream = ctx.default_stream();
    let module = kernels::load(&ctx).unwrap();

    let a = DeviceBuffer::from_host(&stream, &[1.0f32; 1024]).unwrap();
    let b = DeviceBuffer::from_host(&stream, &[2.0f32; 1024]).unwrap();
    let mut c = DeviceBuffer::<f32>::zeroed(&stream, 1024).unwrap();

    module
        .vecadd(&stream, LaunchConfig::for_num_elems(1024), &a, &b, &mut c)
        .expect("Kernel launch failed");

    let result = c.to_host_vec(&stream).unwrap();
    assert_eq!(result[0], 3.0);
}

Field-by-field breakdown#

Piece

Description

#[cuda_module]

Generates loader and launch methods

kernels::load(&ctx)

Loads the embedded artifact bundle

module.vecadd(...)

Enqueues a typed kernel launch

LaunchConfig

Grid/block dimensions and smem

Argument mapping#

The generated method maps kernel parameters to host values:

Kernel parameter

Host argument

GPU ABI

&[T]

&DeviceBuffer<T>

Pointer + length

&mut [T]

&mut DeviceBuffer<T>

Pointer + length

DisjointSlice<T>

&mut DeviceBuffer<T>

Pointer + length

scalar/raw pointer

Same value

Value directly

Return value#

Typed launch methods return Result<(), DriverError>. The Ok case means the kernel was successfully enqueued – not that it finished. To check for runtime errors (e.g., out-of-bounds trap), synchronize the stream or context afterward.

cuda_launch! – lower-level launch#

cuda_launch! is the explicit launch API used by older code and by examples that intentionally load a specific module. It remains useful when you need to choose a sidecar PTX/cubin/LTOIR artifact manually.

use cuda_host::{cuda_launch, load_kernel_module};

let module = load_kernel_module(&ctx, "vecadd").unwrap();

cuda_launch! {
    kernel: vecadd,
    stream: stream,
    module: module,
    config: LaunchConfig::for_num_elems(1024),
    args: [slice(a), slice(b), slice_mut(c)]
}
.expect("Kernel launch failed");

The wrappers in args produce the same host packet as the generated #[cuda_module] methods: slice(...) and slice_mut(...) push the (ptr, len) pair, scalar arguments push their value directly, and a closure or by-value struct pushes as a single byval value (the kernel boundary receives it as one .param, not as per-field flattened parameters).

Artifact policy#

#[cuda_module] is a launch-surface feature, not a target-selection feature. It loads the embedded payload that the compiler placed in the host binary. Decisions such as PTX versus LTOIR, cubin versus fatbin, or single-arch versus multi-arch packaging live in the compiler and artifact/runtime loader layers. Keeping that policy separate lets the generated Rust launch methods stay stable as payload formats evolve.

PTX and cubin embedded payloads are loaded directly. Embedded NVVM IR/LTOIR is compiled or linked to a cubin through the same libNVVM/nvJitLink path used by the lower-level sidecar loader.

LaunchConfig#

LaunchConfig specifies the grid shape:

use cuda_core::LaunchConfig;

let config = LaunchConfig {
    grid_dim: (num_blocks, 1, 1),
    block_dim: (256, 1, 1),
    shared_mem_bytes: 0,
};

Field

Type

Description

grid_dim

(u32, u32, u32)

Number of blocks in x, y, z

block_dim

(u32, u32, u32)

Threads per block in x, y, z

shared_mem_bytes

u32

Dynamic shared memory per block

for_num_elems helper#

For 1D data-parallel kernels, the common pattern is one thread per element:

let config = LaunchConfig::for_num_elems(N as u32);

This uses 256 threads per block and computes the grid size via ceiling division: grid_x = (N + 255) / 256. It’s the right default for most element-wise operations.

2D and 3D configurations#

For matrix operations, use 2D block and grid dimensions:

let config = LaunchConfig {
    grid_dim: ((cols + 15) / 16, (rows + 15) / 16, 1),
    block_dim: (16, 16, 1),
    shared_mem_bytes: 0,
};

Inside the kernel, combine threadIdx_x() / blockIdx_x() with their _y() counterparts to compute row and column indices.

Choosing block size#

The block size is the single most important tuning parameter (see the Execution Model chapter for details). Quick guidelines:

  • 256 is a safe default for most kernels.

  • Powers of 2 (128, 256, 512) align with warp boundaries.

  • Use #[launch_bounds] to hint the compiler about your intended block size.

Typed async launch#

With the cuda-host async feature enabled, #[cuda_module] also generates borrowed and owned async methods. These return lazy DeviceOperation values instead of enqueuing immediately. No stream is specified at launch time – the scheduling policy chooses one when the operation is executed:

use cuda_async::device_context::init_device_contexts;
use cuda_async::device_operation::DeviceOperation;

init_device_contexts(0, 1)?;
let module = kernels::load_async(0)?;

let op = module.vecadd_async(
    LaunchConfig::for_num_elems(1024),
    &a_dev,
    &b_dev,
    &mut c_dev,
)?;

// Execute and wait
op.sync()?;

Use the owned form when the operation must be spawned or stored as a 'static future:

use std::future::IntoFuture;

let op = module.vecadd_async_owned(
    LaunchConfig::for_num_elems(1024),
    a_dev,
    b_dev,
    c_dev,
)?;

let (a_dev, b_dev, c_dev) = tokio::spawn(op.into_future()).await??;

Async buffer lifetimes#

Async launches are lazy, so pointer lifetimes matter:

raw pointer shape:
  build op from CUdeviceptr
  drop buffer
  run op later  -> stale pointer

borrowed typed shape:
  build op from &DeviceBuffer
  Rust keeps the buffer borrowed until op completes

owned typed shape:
  move DeviceBox into op
  spawned task owns the allocation until completion

cuda_launch_async! remains as a lower-level migration API, but prefer the generated borrowed or owned methods for new code. Raw pointer async launches are only correct when the caller can prove the pointed-to allocation outlives the lazy operation.

.sync() vs .await#

Method

What it does

.sync()

Execute on the default scheduling policy, block the current thread until complete

.await

Execute and yield the current async task (requires a Tokio runtime)

Composing GPU work#

DeviceOperation supports functional composition. Chain operations with and_then and run independent work in parallel with zip!:

use cuda_async::zip;

let forward_pass = layer1_op
    .and_then(|output1| layer2_op(output1))
    .and_then(|output2| layer3_op(output2));

// Run two independent operations concurrently
let combined = zip!(branch_a, branch_b);
let (result_a, result_b) = combined.sync()?;

Each operation in the chain is scheduled onto a stream only when it executes. The and_then combinator passes the output of one operation as input to the next, forming a lazy computation graph.

See also

The Async GPU Programming section covers DeviceOperation, scheduling policies, and stream management in depth.

Cluster launch#

Thread Block Clusters (Hopper and newer) allow blocks to cooperate beyond shared memory via distributed shared memory (DSMEM). To launch with clusters, add #[cluster_launch] to the kernel and include cluster_dim in the launch:

use cuda_device::{kernel, cluster, cluster_launch, DisjointSlice};

#[kernel]
#[cluster_launch(4, 1, 1)]
pub fn cluster_kernel(mut out: DisjointSlice<u32>) {
    let rank = cluster::block_rank();
    // Blocks 0-3 can communicate via DSMEM
}

On the host, the launch uses launch_kernel_ex (the extended launch API) with cluster dimensions. cuda_launch! supports this via the cluster_dim field:

cuda_launch! {
    kernel: cluster_kernel,
    stream: stream,
    module: module,
    config: config,
    cluster_dim: (4, 1, 1),
    args: [slice_mut(out_dev)]
}
.expect("Cluster launch failed");

Tip

Cluster launch requires Hopper (sm_90) or newer. The maximum cluster size is typically 16 blocks. Use cargo oxide build --arch sm_90 to target Hopper.

Common launch errors#

Error

Likely cause

Fix

CUDA_ERROR_INVALID_VALUE

Grid or block dimensions are zero or exceed limits

Check LaunchConfig values; max block is 1024 threads

CUDA_ERROR_NOT_FOUND

PTX entry point name doesn’t match

Verify #[kernel] name matches the loaded module

CUDA_ERROR_LAUNCH_OUT_OF_RESOURCES

Too much shared memory or too many registers per block

Reduce shared_mem_bytes or block size; use #[launch_bounds]

CUDA_ERROR_ILLEGAL_INSTRUCTION

Kernel hit a trap (panic, assert failure, OOB)

Debug with cargo oxide debug or gpu_printf!

CUDA_ERROR_NO_BINARY_FOR_GPU

PTX compiled for wrong architecture

Rebuild with --arch matching your GPU

See also

The Error Handling and Debugging chapter covers how to diagnose and fix kernel failures in detail.