The cuda-oxide Book#

cuda-oxide is an experimental Rust-to-CUDA compiler that lets you write (SIMT) GPU kernels in safe(ish), idiomatic Rust. It compiles standard Rust code directly to PTX — no DSLs, no foreign language bindings, just Rust.

Note

This book assumes familiarity with the Rust programming language, including ownership, traits, and generics. Later chapters on async GPU programming also assume working knowledge of async/.await and runtimes like tokio.

For a refresher, see The Rust Programming Language, Rust by Example, or the Async Book.

Project Status#

The v0.1.0 release is an early-stage alpha: expect bugs, incomplete features, and API breakage as we work to improve it. We hope you’ll try it and help shape its direction by sharing feedback on your experience.

🚀 Quick start#

use cuda_device::{kernel, launch_bounds, launch_contract, thread, DisjointSlice};
use cuda_host::cuda_module;
use cuda_core::{CudaContext, DeviceBuffer, LaunchConfig1D};

#[cuda_module]
mod kernels {
    use super::*;

    #[kernel]
    #[launch_bounds(256)]
    #[launch_contract(domain = 1, block = (256, 1, 1))]
    pub fn vecadd(a: &[f32], b: &[f32], mut c: DisjointSlice<f32>) {
        let idx = thread::index_1d();
        let i = idx.get();
        if let Some(c_elem) = c.get_mut(idx) {
            *c_elem = a[i] + b[i];
        }
    }
}

fn main() {
    let ctx = CudaContext::new(0).unwrap();
    let stream = ctx.default_stream();

    // SAFETY: this package owns the embedded device bundle for `kernels`.
    let module = unsafe { kernels::load(&ctx).unwrap() };

    let a = DeviceBuffer::from_host(&stream, &[1.0f32; 1024]).unwrap();
    let b = DeviceBuffer::from_host(&stream, &[2.0f32; 1024]).unwrap();
    let mut c = DeviceBuffer::<f32>::zeroed(&stream, 1024).unwrap();

    let prepared = module
        .prepare_vecadd(LaunchConfig1D::new(1024u32.div_ceil(256), 256, 0))
        .unwrap();
    module
        .vecadd(&stream, &prepared, &a, &b, &mut c)
        .unwrap();

    let result = c.to_host_vec(&stream).unwrap();
    assert_eq!(result[0], 3.0);
}

Build and run with cargo oxide run vecadd upon installing the prerequisites. The same launch-contract pattern is what cargo oxide new scaffolds; see Writing Your First Kernel.

Note

#[cuda_module] embeds the generated device artifact into the host binary and generates a typed kernels::load function plus one launch method per kernel. Kernel arguments are type-checked. A declared #[launch_contract] unlocks the safe PreparedLaunch path (prepare_* + typed launch) described in Launching Kernels. A raw LaunchConfig call remains available as an unsafe escape hatch when you need a one-off geometry that the contract does not cover. The lower-level load_kernel_module and unsafe cuda_launch! APIs remain available when you need to load a specific sidecar artifact or build custom launch code.

Why cuda-oxide?#

🦀 Rust on the GPU

Write GPU kernels with Rust’s type system and ownership model. Safety is a first-class goal, but GPUs have subtleties — read about the safety model.

💎 A SIMT Compiler

Not a DSL. A custom rustc codegen backend that compiles pure Rust to PTX.

⚡ Async Execution

Compose GPU work as lazy DeviceOperation graphs. Schedule across stream pools. Await results with .await.