The DeviceOperation Model#

In the Writing GPU Programs chapter, you saw two ways to launch kernels: cuda_launch! enqueues work on an explicit stream, while cuda_launch_async! returns a lazy handle that defers stream selection. This chapter digs into the abstraction behind that lazy handle – the DeviceOperation trait – and explains why decoupling what the GPU should do from which stream it runs on is the foundation of composable async GPU programming in cuda-oxide.

See also

CUDA Programming Guide – Asynchronous Concurrent Execution for the underlying CUDA stream and event model that DeviceOperation builds on.

Why lazy operations?#

In CUDA C++, you build concurrency by creating multiple cudaStream_t handles and placing kernel launches and memory copies onto them explicitly. The programmer decides at every call site which stream to use. This couples the definition of GPU work to the scheduling decision, making it hard to compose and rearrange work after the fact.

cuda-oxide takes a different approach. A DeviceOperation describes GPU work without binding to any stream. You can compose operations with combinators (and_then, zip!), pass them across function boundaries, store them in collections, and only decide how to schedule them at the last moment. This is the same idea behind Rust’s Iterator – build the pipeline lazily, execute it eagerly at the call site.

Approach

When is the stream chosen?

Composable?

cuda_launch!

At the call site (you pass a stream)

No – work is enqueued immediately

cuda_launch_async!

At execution time (scheduling policy picks)

Yes – returns a lazy DeviceOperation you can chain and combine

../_images/device-operation-lifecycle.svg

The DeviceOperation lifecycle. Phase 1: cuda_launch_async! builds a lazy recipe (no GPU work). Phase 2: the scheduling policy picks a stream from its pool. Phase 3: execute() submits GPU work and a cuLaunchHostFunc callback. Phase 4: the callback fires, wakes the async runtime, and delivers the result. Bottom: the four execution methods from simplest (.sync()) to most manual (async_on).#

Recipes and kitchens#

Think of a DeviceOperation as a recipe card. The card describes every step of the dish – what ingredients to combine, at what temperature, for how long – but it does not say which kitchen will cook it. You can hand the card to any kitchen, photocopy it, staple two cards together into a multi-course meal, or file it away for later. The dish only starts cooking when someone walks into a kitchen and begins following the instructions.

In cuda-oxide’s model:

  • A recipe is a DeviceOperation – a lazy description of GPU work.

  • A kitchen is a CUDA stream – the in-order queue where work actually runs.

  • The head chef is a SchedulingPolicy – the logic that decides which kitchen handles each recipe.

  • The meal is the Output – the result you get when everything is done.

This separation is what makes the system composable. You can write a function that returns a recipe for “upload data, run GEMM, apply ReLU” without caring which stream will execute it. The caller can chain more steps onto the recipe, run it on a specific stream, or hand it to the scheduling policy and walk away.

Your first async launch#

The simplest way to create a DeviceOperation is the cuda_launch_async! macro. It looks almost identical to cuda_launch!, but without the stream: field – and it returns a recipe instead of cooking immediately:

use cuda_async::device_context::init_device_contexts;
use cuda_host::cuda_launch_async;
use cuda_core::LaunchConfig;

// One-time setup: create a stream pool for scheduling
init_device_contexts(0, 1)?;

// Build the recipe (no GPU work yet)
let op = cuda_launch_async! {
    kernel: vecadd,
    module: module,
    config: LaunchConfig::for_num_elems(1024),
    args: [slice(a_dev), slice(b_dev), slice_mut(c_dev)]
};

// Now cook it: pick a stream, launch, wait for the result
op.sync()?;

At the point where op is created, nothing has happened on the GPU. The macro builds an AsyncKernelLaunch value that remembers which function to call, what arguments to pass, and how to configure the grid – but it does not touch any stream. It is a recipe card sitting on the counter.

When you call .sync(), the scheduling policy picks a stream from its pool, submits the kernel, and blocks until the stream is idle. That single line is where the recipe becomes a cooked meal.

What makes a DeviceOperation#

Behind the scenes, DeviceOperation is a trait. Any type that describes GPU work can implement it. The trait has one required method and one associated type:

pub trait DeviceOperation: Send + Sized + IntoFuture {
    type Output: Send;

    unsafe fn execute(
        self,
        context: &ExecutionContext,
    ) -> Result<Self::Output, DeviceError>;
}

Output is the Rust value the operation produces when it finishes. For a kernel launch this is () – the kernel runs for its side effects on device memory. For a device-to-host copy it might be Vec<f32>. For a memory allocation it could be a DeviceBox<[f32]> that owns the pointer.

execute is where the actual GPU work happens. It receives an ExecutionContext – the assigned kitchen – and submits work to the stream inside it. The method is unsafe because GPU work may still be in flight when it returns; the caller is responsible for synchronizing before reading results.

The Send bound means operations can move across threads (essential for tokio::spawn). The IntoFuture bound is what makes .await work – more on that shortly.

You rarely implement DeviceOperation yourself. The crate provides a set of types that implement it, and you compose them using combinators:

  • AsyncKernelLaunch – produced by cuda_launch_async!. Launches a kernel.

  • Value<T> – wraps a host-side value. No GPU work. Returns T immediately.

  • AndThen – chains two operations: run A, feed the result to B.

  • Zip – runs two operations and returns both results as a tuple.

  • StreamOperation – defers construction until the stream is known.

These are the building blocks of every async pipeline. The Combinators and Composition chapter covers each one in detail.

The ExecutionContext – where the stream lives#

When a recipe is executed, it needs to know which kitchen it is in. The ExecutionContext carries that information:

pub struct ExecutionContext {
    device: usize,              // which GPU
    cuda_stream: Arc<CudaStream>,   // which stream
    cuda_context: Arc<CudaContext>, // which CUDA context
}

Operations never create streams themselves. The scheduling policy (covered in Scheduling and Streams) creates the ExecutionContext and passes it into execute. This is the core of the separation: operations describe what, the context provides where.

Inside an execute implementation, you access the stream with ctx.get_cuda_stream() and the CUDA context with ctx.get_cuda_context(). For most operations this is all you need – enqueue a kernel or a memory copy on the stream, and you are done.

Running the recipe: four ways to execute#

Once you have a DeviceOperation, you need to trigger it. cuda-oxide gives you four paths, ranging from “do everything for me” to “I’ll handle it myself.”

.sync() – block and wait#

The simplest option. The scheduling policy picks a stream, runs the operation, and blocks the calling thread until the stream is idle:

let result: Vec<f32> = d2h_operation.sync()?;

This is perfect for scripts, tests, and any place where you just want the answer now. No Tokio runtime needed.

.await – yield and resume#

Inside an async runtime, .await does the same thing but without blocking the thread. It converts the operation into a DeviceFuture, submits the GPU work, and yields the current task. When the GPU finishes, it wakes the task and delivers the result:

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    init_device_contexts(0, 1)?;

    let result = cuda_launch_async! {
        kernel: vecadd,
        module: module,
        config: LaunchConfig::for_num_elems(1024),
        args: [slice(a_dev), slice(b_dev), slice_mut(c_dev)]
    }
    .await?;

    Ok(())
}

While the GPU is working, the Tokio runtime is free to poll other tasks – no thread sits idle waiting for hardware. This is the key to running multiple GPU pipelines concurrently, which we explore in Concurrent Execution.

.sync_on(&stream) – you choose the stream#

When you need a specific stream – for interop with an existing CUDA library, or to guarantee ordering with other work on that stream – sync_on lets you supply it directly and blocks until it completes:

let stream = ctx.new_stream()?;
operation.sync_on(&stream)?;

unsafe async_on(&stream) – fire and forget#

The most manual option. It submits work to a stream and returns immediately, without synchronizing. The caller must ensure the stream is synchronized before reading results. This is useful for batching many operations onto a stream before a single sync at the end:

let stream = ctx.new_stream()?;
unsafe { op_a.async_on(&stream)? };
unsafe { op_b.async_on(&stream)? };
stream.synchronize()?;  // now both are done

Lifting host data with value()#

Not every step in a pipeline involves the GPU. Sometimes you need to feed a host-side value – a configuration parameter, a set of dimensions, a pre-loaded weight vector – into a chain of device operations. The value() function wraps any Send type in a no-op DeviceOperation that returns it immediately:

use cuda_async::device_operation::value;

let weights = vec![1.0f32; 1024];
let op = value(weights);  // impl DeviceOperation<Output = Vec<f32>>

On its own, value() looks pointless. Its power shows up in composition. If you are zipping together a host-to-device transfer and a configuration struct, value() makes the configuration fit the pipeline:

let (device_buf, config) = zip!(
    h2d(raw_data),
    value(ModelConfig { dim: 64, layers: 3 })
).sync()?;

Both arms of zip! must be DeviceOperations. value() is the adapter that makes host data play nicely with device work.

Talking to the stream with with_context#

Some operations need access to the stream itself at execution time. Memory allocation (malloc_async), asynchronous copies (memcpy_htod_async), and event recording all require a raw CUstream handle. But remember – a DeviceOperation does not know which stream it will run on when it is created. The stream is assigned later, by the scheduling policy.

with_context bridges this gap. It creates an operation whose body is deferred until the ExecutionContext is available:

use cuda_async::device_operation::{with_context, value};
use cuda_core::memory::{malloc_async, memcpy_htod_async};

fn h2d(host_data: Vec<f32>) -> impl DeviceOperation<Output = DeviceBox<[f32]>> {
    with_context(move |ctx| {
        let stream = ctx.get_cuda_stream();
        let n = host_data.len();
        let num_bytes = n * std::mem::size_of::<f32>();
        unsafe {
            let dptr = malloc_async(stream.cu_stream(), num_bytes).unwrap();
            memcpy_htod_async(dptr, host_data.as_ptr(), num_bytes, stream.cu_stream())
                .unwrap();
            value(DeviceBox::from_raw_parts(dptr, n, ctx.get_device_id()))
        }
    })
}

The closure receives the ExecutionContext and must return another DeviceOperation. Here it returns a Value wrapping the freshly allocated device pointer. The inner operation is executed immediately on the same stream.

This pattern – with_context wrapping raw driver calls, returning value() at the end – is how you turn any low-level CUDA operation into a composable building block. The async_mlp example uses it for h2d, d2h, and zeros helpers that slot cleanly into and_then chains.

Tip

with_context is the escape hatch for raw driver calls that need a CUstream. For kernel launches, prefer cuda_launch_async! or AsyncKernelLaunch – they handle argument marshalling and are less error-prone.

How the GPU tells Rust it is done#

When you .await a DeviceOperation, something interesting happens under the hood. The operation becomes a DeviceFuture – a type that implements Rust’s std::future::Future – and the async runtime polls it. But how does a poll- based system know when hardware has finished its work?

The answer is cuLaunchHostFunc, a CUDA driver API that enqueues a host-side callback into a stream. When all preceding GPU work on that stream finishes, the driver calls the callback on a driver thread. cuda-oxide uses this to build a zero-busy-wait bridge between CUDA and Rust’s async model.

The DeviceFuture is a three-state machine:

  Idle ───poll()───► Executing ───callback fires───► Complete
                         │                               │
                   (submit GPU work               (return result
                    + enqueue callback)             to the runtime)

On the first poll, the future:

  1. Calls execute() on the operation, submitting GPU work to the stream.

  2. Enqueues a cuLaunchHostFunc callback on the same stream, right after the GPU work. CUDA guarantees stream ordering: this callback will not fire until the kernel finishes.

  3. Returns Poll::Pending. The async runtime parks the task and moves on.

When the GPU finishes the kernel, the CUDA driver calls the host callback on a driver thread. The callback sets an AtomicBool flag and wakes the task’s AtomicWaker. The async runtime notices the wake and re-polls the future.

On the second poll, the future sees the flag and returns Poll::Ready(Ok(result)). The task resumes with the value.

The critical property: no host thread spins or sleeps while the GPU works. The async executor is free to run other tasks – including other DeviceFutures on other streams. This is how cuda-oxide achieves true concurrent execution without dedicating a thread per GPU operation.

See also

The Combinators and Composition chapter shows how to build multi-stage pipelines from these primitives, and Scheduling and Streams explains how the scheduling policy selects a stream and creates the ExecutionContext that ties everything together.