Host API#

Reference for everything host-side: creating and transferring tensors, managing contexts and streams, configuring kernel launches, the DeviceOp trait and its combinators, and CUDA graph integration. For tutorial-style introductions, see Device Operations and Working with Data.

Tensor Creation and Views#

`api::*` constructors#

All creation functions return a DeviceOp — allocation and initialization happen when the operation runs, not when it is constructed.

Function	Output	Description
`api::zeros::<T>(shape: &[usize])`	`DeviceOp<Output = Tensor<T>>`	All zeros
`api::ones::<T>(shape: &[usize])`	`DeviceOp<Output = Tensor<T>>`	All ones
`api::full::<T>(val, shape: &[usize])`	`DeviceOp<Output = Tensor<T>>`	Fill with scalar value
`api::fill::<T>(tensor, val)`	`DeviceOp<Output = Tensor<T>>`	Fill an existing tensor and return it
`api::arange::<T>(len: usize)`	`DeviceOp<Output = Tensor<T>>`	`[0, 1, 2, ..., len-1]` (1D)
`api::linspace(start: f32, stop: f32, n: usize)`	`DeviceOp<Output = Tensor<f32>>`	`n` values evenly spaced from `start` to `stop`
`api::eye(n: usize)`	`DeviceOp<Output = Tensor<f32>>`	`n × n` identity matrix
`api::eye_rect(rows: usize, cols: usize)`	`DeviceOp<Output = Tensor<f32>>`	`rows × cols`, ones on main diagonal
`api::convert::<From, To>(src: Arc<Tensor<From>>)`	`DeviceOp<Output = Tensor<To>>`	Convert tensor element type
`api::rand::<T, RANK>(shape: [usize; RANK], seed: Option<u64>)`	`DeviceOp<Output = Tensor<T>>`	Uniform `[0, 1)` from cuRAND (`T: RandUniform`)
`api::randn::<T, RANK>(mean: T, std: T, shape: [usize; RANK], seed: Option<u64>)`	`DeviceOp<Output = Tensor<T>>`	Normal `N(mean, std²)` from cuRAND (`T: RandNormal`)
`api::randn_f16(mean: f16, std: f16, shape: [usize; RANK], seed: Option<u64>)`	`DeviceOp<Output = Tensor<f16>>`	Normal for `f16` (generates `f32` and converts; cuRAND has no native `f16`)

Shape conventions vary across the module: zeros/ones/full take &[usize] slices of arbitrary length; rand/randn take [usize; RANK] arrays (rank is a const generic). The RANK parameter is usually inferred from the array literal.

use cutile::api;

let z = api::zeros::<f32>(&[1024]).sync_on(&stream)?;
let m = api::ones::<f32>(&[256, 256]).sync_on(&stream)?;
let r = api::randn(0.0f32, 1.0, [32, 64, 128], None).sync_on(&stream)?;   // 3D N(0, 1)
let u = api::rand::<f32, 1>([1024], Some(42)).sync_on(&stream)?;          // Uniform with fixed seed
let idx = api::arange::<i32>(1024).sync_on(&stream)?;
let I = api::eye(64).sync_on(&stream)?;

Tensor and `DeviceOp` shape helpers#

Host-side reshapes are zero-copy metadata changes. They require the new shape to preserve the element count and, for borrowed views, to be contiguous.

API	Description
`tensor.reshape(&shape)`	Consume and return `Tensor<T>` with a new shape.
`(&arc_tensor).reshape(&shape)`	Return a new `Arc<Tensor<T>>` sharing the same allocation with new shape metadata.
`device_op.reshape(&shape)`	Reshape the `Tensor<T>` or `Arc<Tensor<T>>` produced by a `DeviceOp`.
`tensor.partition(shape)`	Consume a tensor and create a mutable output partition for kernel launch.
`arc_tensor.try_partition(shape)`	Consume an `Arc<Tensor<T>>` only if it has a single owner, then partition it.
`partition.unpartition()`	Recover the owned tensor from a partition returned by a kernel.

use cutile::api::{self, DeviceOpReshape};
use cutile::tensor::{Reshape, Tensor, TryPartition};
use cutile::tile_kernel::PartitionOp;
use std::sync::Arc;

let x = api::arange::<f32>(32).reshape(&[4, 8]).sync_on(&stream)?;
let z = api::zeros::<f32>(&[32]).partition([4]);

let weights: Arc<Tensor<f32>> = api::ones::<f32>(&[4, 8]).sync_on(&stream)?.into();
let weights_2d = (&weights).reshape(&[8, 4])?;
let partitioned = weights_2d.try_partition([2, 4])?;

Tensor metadata and reinterpretation#

Tensor<T> stores shape and layout metadata alongside the device allocation. These accessors do not synchronize with the GPU:

API	Description
`tensor.shape()`	Runtime dimensions as `&[i32]`
`tensor.strides()`	Runtime strides as `&[i32]`
`tensor.size()`	Number of elements
`tensor.num_bytes()`	Number of bytes in the tensor view
`tensor.is_contiguous()`	Whether the view is contiguous
`tensor.device_id()`	CUDA device ordinal for the allocation
`tensor.device_pointer()`	Typed non-owning `DevicePointer<T>` for interop
`arc_tensor.reinterpret::<U>(&shape)`	Zero-copy reinterpretation as `Arc<Tensor<U>>`

reinterpret requires an Arc<Tensor<T>>, contiguous storage, matching total byte size, and compatible pointer alignment:

use cutile::tensor::Tensor;
use std::sync::Arc;

let raw: Arc<Tensor<u32>> = api::arange::<u32>(4).sync_on(&stream)?.into();
let floats: Arc<Tensor<f32>> = raw.reinterpret::<f32>(&[4])?;
assert_eq!(floats.shape(), &[4]);

`TensorView`: zero-copy views and slices#

TensorView provides zero-copy borrowed views of a tensor with a different shape or offset. Views borrow the underlying tensor — the tensor cannot be mutated while a view exists. The offset is applied host-side, so passing a view to a kernel hands the kernel a pointer to the correct starting address without any data movement.

Method	Description
`tensor.view(&shape)`	Reshape to the given shape without copying. Total element count must match.
`tensor.slice(&ranges)`	Borrow a rectangular sub-region (one numpy-style range per dimension).
`view.slice(&ranges)`	Chain-slice further; offsets accumulate.

let tensor = api::arange::<f32>(1024).sync_on(&stream)?;

// Reshape without copying.
let matrix = tensor.view(&[32, 32])?;

// Slice: borrow a subregion (numpy-style ranges).
let first_half = tensor.slice(&[0..512])?;       // elements 0-511
let row_slice = matrix.slice(&[1..3])?;          // rows 1-2, all columns
let block = matrix.slice(&[1..3, 2..6])?;        // rows 1-2, cols 2-5

// Chained slices accumulate offsets.
let inner = tensor.slice(&[100..200])?.slice(&[10..20])?;  // = tensor[110..120]

Views and slices are passed to kernels as &Tensor parameters. They’re the right tool when you want to process a subregion of an existing tensor — an attention kernel over a sub-sequence, a GEMM over a sub-matrix, a scan over a contiguous slice — without allocation or copying.

Host-Device and Device-Device Transfers#

Moving data between CPU and GPU, or between two device tensors, uses APIs that return DeviceOps — the copy is scheduled when the op runs, not constructed:

API	Returns	Description
`api::copy_host_vec_to_device::<T>(vec: &Arc<Vec<T>>)`	`DeviceOp<Output = Tensor<T>>`	Copy host `Vec<T>` to a new device `Tensor<T>`
`api::copy_device_to_host_vec::<T>(tensor: &Arc<Tensor<T>>)`	`DeviceOp<Output = Vec<T>>`	Copy a device `Tensor<T>` to a host `Vec<T>`
`tensor.to_host_vec()`	`DeviceOp<Output = Vec<T>>`	Method form of `copy_device_to_host_vec` (preferred)
`device_op.to_host_vec()`	`DeviceOp<Output = Vec<T>>`	Copy the `Tensor<T>` produced by a `DeviceOp` to host
`api::dup(&tensor)` / `tensor.dup()`	`DeviceOp<Output = Tensor<T>>`	Allocate a new tensor and copy device-to-device
`api::memcpy(&mut dst, &src)`	`DeviceOp<Output = ()>`	Copy device-to-device into an existing tensor, used especially for CUDA graph updates

// Host -> device
let data: Arc<Vec<f32>> = Arc::new(vec![1.0; 1024]);
let tensor: Tensor<f32> = api::copy_host_vec_to_device(&data).sync_on(&stream)?;

// Device -> host
let result: Vec<f32> = tensor.to_host_vec().sync_on(&stream)?;

// Device -> device
let copy = tensor.dup().sync_on(&stream)?;

The host-side Vec must remain alive until the op completes — the async copy reads from it until the stream synchronizes. Arc<Vec<T>> makes this straightforward for shared access. to_host_vec is available on Tensor<T>, Arc<Tensor<T>>, and &Arc<Tensor<T>>; each returns the same DeviceOp<Output = Vec<T>>. It is also available on a DeviceOp<Output = Tensor<T>>, which is the common form after a kernel chain:

let host: Vec<f32> = kernel(out.partition([128]), &input)
    .first()
    .unpartition()
    .to_host_vec()
    .sync_on(&stream)?;

api::memcpy copies between already allocated tensors and requires source and destination to have the same element count. It is the usual way to update graph input buffers before replay:

graph.update(api::memcpy(&mut input_buffer, &new_input))?;
graph.launch().sync_on(&stream)?;

Devices and Streams#

Every host program starts with a Device, plus one or more Streams for scheduling GPU work:

use cuda_core::Device;

let device = Device::new(0)?;              // Device ordinal 0
let stream = device.new_stream()?;         // A new stream owned by this device

Method	Returns	Description
`Device::new(ordinal: usize)`	`Result<Arc<Device>, DriverError>`	Create a device handle bound to a GPU ordinal
`Device::device_count()`	`Result<i32, DriverError>`	Number of CUDA-capable devices
`device.ordinal()`	`usize`	GPU ordinal this handle represents
`device.name()`	`Result<String, DriverError>`	Device name
`device.new_stream()`	`Result<Arc<Stream>, DriverError>`	Create a new stream on this device
`Device::borrow_raw(...)`	`Arc<Device>`	Borrow an externally owned CUDA context/device for interop
`Stream::borrow_raw(...)`	`Arc<Stream>`	Borrow an externally owned CUDA stream for interop
`Module::borrow_raw(...)` / `Function::borrow_raw(...)`	CUDA module/function wrappers	Borrow externally owned CUDA handles

Devices are Arc-wrapped for sharing across threads; streams are also Arc-wrapped and can be passed to .sync_on(&stream) for explicit stream scheduling.

The default round-robin scheduling policy handles stream assignment automatically for most workloads — these APIs are for when you need explicit stream control (debugging, deterministic ordering, paired with AsyncKernelLaunch, or overlapping compute with transfers on dedicated streams).

The borrow_raw constructors do not take ownership of the underlying CUDA handles and therefore do not destroy them on drop. Use them when integrating with another runtime that owns the context, stream, module, or function.

Kernel Launch Configuration#

Several types configure how kernels compile and launch.

CompileOptions — runtime overrides for entry-level optimization_hints, typically used for autotuning:

use cutile::tile_kernel::CompileOptions;

let opts = CompileOptions::default()
    .occupancy(4)
    .num_cta_in_cga(2)
    .max_divisibility(16);

let result = my_kernel(args).compile_options(opts).grid(grid).await?;

Different CompileOptions values trigger separate JIT compilations and are part of the kernel cache key.

Generated #[cutile::entry] launchers also expose launch-time configuration methods:

Method	Description
`.grid((x, y, z))`	Set an explicit runtime launch grid instead of inferring it from partitioned tensor inputs.
`.const_grid((x, y, z))`	Set a compile-time constant grid, enabling grid-dependent optimizations.
`.compile_options(opts)`	Override occupancy, cluster/CTA, and divisibility hints for this compilation.
`.generics(values)`	Bind type and const generic arguments manually when they cannot be inferred.

The JIT compiler invokes tileiras through normal PATH lookup by default. Set CUTILE_TILEIRAS_PATH to use a specific binary:

CUTILE_TILEIRAS_PATH=/opt/cuda-tile/bin/tileiras cargo test -p cutile

LaunchConfig — grid/block/shared-memory specification for AsyncKernelLaunch (raw CUDA kernels launched outside the #[cutile::entry] path):

use cuda_core::LaunchConfig;

LaunchConfig {
    grid_dim: ((n + 255) / 256, 1, 1),    // 3D grid of thread blocks
    block_dim: (256, 1, 1),                // 3D block of threads
    shared_mem_bytes: 0,                   // Dynamic shared memory per block
}

AsyncKernelLaunch — wraps a CUDA driver kernel launch as a DeviceOp. Build the argument list with push_arg (safe, for DType scalars) or push_device_ptr (unsafe, for raw device pointers), set the launch config, then .await or .sync_on():

use cuda_async::launch::AsyncKernelLaunch;

let mut launcher = AsyncKernelLaunch::new(function.clone());
launcher.push_arg(num_elements as u32);
launcher.push_arg(scale);
let input_ptr = input.device_pointer();
let output_ptr = output.device_pointer();
unsafe {
    launcher
        .push_device_ptr(input_ptr.cu_deviceptr())
        .push_device_ptr(output_ptr.cu_deviceptr());
}
launcher.set_launch_config(LaunchConfig {
    grid_dim: ((num_elements as u32 + 255) / 256, 1, 1),
    block_dim: (256, 1, 1),
    shared_mem_bytes: 0,
});
launcher.await?;  // Executes as a DeviceOp

See Interoperability for the full walkthrough and the wrapper pattern that hides unsafe at the call site.

.generics(Vec<String>) — #[cutile::entry]-generated launchers accept this method to bind const generics and type parameters at runtime:

let generics = vec![
    "f32".to_string(),  // E
    "16".to_string(),   // BM
    "16".to_string(),   // BN
    "8".to_string(),    // BK
    "128".to_string(),  // K
];
gemm(z, x, y).generics(generics).sync_on(&stream)?;

Generic values are part of the kernel cache key: each unique combination triggers its own JIT compilation.

The Futures Analogy#

DeviceOp is to GPU work what Future is to async I/O. Both are lazy descriptions of work that don’t execute until driven:

Concept	`std::future::Future`	`DeviceOp`
What it represents	Async computation	GPU computation
When it runs	On `.await` or `poll()`	On `.sync()`, `.sync_on()`, or `.await`
Chaining	`.then()`, `.map()` via `FutureExt`	`.then()`, `.map()` on `DeviceOp`
Fan-in	`join!`	`zip!`
Fan-out	N/A (single consumer)	`.unzip()`
Shared access	`FutureExt::shared()`	`.shared()`
Type erasure	`BoxFuture`	`.boxed()` → `BoxedDeviceOp`
Output wrapper	`Poll<T>`	`Result<T, DeviceError>`

The key difference: a Future is pulled by an async runtime via poll(), while a DeviceOp is pushed to the GPU via execute(). When you convert a DeviceOp to a Future (via .await or .into_future()), cuTile bridges the two models — the runtime polls a DeviceFuture that checks whether the GPU has finished.

Combinator Reference#

All combinators follow established Rust conventions. The “Precedent” column shows which standard library or futures crate method inspired the design.

Composition#

Combinator	Signature	Precedent	What it does
`zip!(a, b, …)`	`(impl DeviceOp, …) → impl DeviceOp<Output=(A, B, …)>`	`Iterator::zip`	Combine N operations into a single tuple-producing operation
`.unzip()`	`impl DeviceOp<Output=(A, B, …)> → (impl DeviceOp<Output=A>, …)`	`Iterator::unzip`	Split a tuple operation into independent per-element operations
`.then(f)`	`self → f(Self::Output) → impl DeviceOp<Output=O>`	`FutureExt::then`	Chain follow-up GPU work on the same stream
`.map(f)`	`self → f(Self::Output) → O` (no GPU work)	`FutureExt::map`	Transform output without issuing GPU work
`.inspect(f)`	`self → f(&Self::Output)` (passthrough)	`FutureExt::inspect`	Peek at output for debugging; returns it unchanged

Selection#

Combinator	Signature	Precedent	What it does
`.first()`	`impl DeviceOp<Output=(A, B, …)> → impl DeviceOp<Output=A>`	`slice::first`	Extract the first element of a tuple output
`.last()`	`impl DeviceOp<Output=(A, B, …)> → impl DeviceOp<Output=Z>`	`slice::last`	Extract the last element of a tuple output

Sharing and Erasure#

Combinator	Signature	Precedent	What it does
`.shared()`	`self → SharedDeviceOp<Self::Output>`	`FutureExt::shared`	Cloneable, execute-once; output is `Arc<T>`
`shared(arc)`	`Arc<T> → SharedDeviceOp<T>`	—	Wrap an existing `Arc` as a pre-computed `SharedDeviceOp`
`.boxed()`	`self → BoxedDeviceOp<Self::Output>`	`FutureExt::boxed`	Type-erase for heterogeneous collections

Execution#

Method	Stream chosen by	Blocks?	Use case
`.sync()`	Default policy (round-robin)	Yes	Quick scripts
`.sync_on(&stream)`	The explicit stream	Yes	Deterministic ordering, debugging
`.await`	Default policy (round-robin)	No (suspends task)	Async production code
`.into_future()`	Default policy	No (returns `DeviceFuture`)	Manual future handling
`.schedule(policy)`	The policy you provide	No (returns `DeviceFuture`)	Multi-device dispatch
`.graph()`	Default policy (round-robin)	Yes (captures + syncs)	CUDA graph capture
`.graph_on(stream)`	The explicit stream	Yes (captures + syncs)	CUDA graph capture on specific stream

Note

If any kernel input is &Tensor<T> (borrowed), the operation is not 'static and cannot be used with tokio::spawn. Use .sync_on() or .await in the same scope, or switch to Arc<Tensor<T>> for spawned tasks.

Supported Kernel Parameter Types#

Kernel param	Host type	Return type
`&Tensor<T, S>`	`Tensor<T>`, `Arc<Tensor<T>>`, or `&Tensor<T>`	Same as input
`&mut Tensor<T, S>`	`Partition<Tensor<T>>` or `Partition<&mut Tensor<T>>`	Same as input
Scalar (`f32`, `i32`, etc.)	Same scalar	Same scalar
`*mut T` (unsafe only)	`DevicePointer<T>`	`DevicePointer<T>`

The borrowed partition form (Partition<&mut Tensor<T>>) writes in place — no unpartition() needed. Create it with (&mut tensor).partition(shape).

Raw pointer entry points are unsafe fns. Obtain a typed device pointer from a tensor with tensor.device_pointer(), and make sure the pointer remains valid for the duration of the kernel launch:

let backing = api::zeros::<f32>(&[1024]).sync_on(&stream)?;
let ptr = backing.device_pointer();
unsafe { raw_ptr_kernel(ptr, 1024) }.sync_on(&stream)?;

Ownership Model#

The core invariant: you get back what you put in.

Read-only inputs (`&Tensor` params)#

Input	Returned	`tokio::spawn`?
`Tensor<T>`	`Tensor<T>`	Yes
`Arc<Tensor<T>>`	`Arc<Tensor<T>>`	Yes
`&'a Tensor<T>`	`&'a Tensor<T>`	No (not `'static`)

Mutable outputs (`&mut Tensor` params)#

Input	Returned	`unpartition()` needed?
`Partition<Tensor<T>>` (owned)	`Partition<Tensor<T>>`	Yes
`Partition<&'a mut Tensor<T>>` (borrowed)	`Partition<&'a mut Tensor<T>>`	No — tensor is written in place

The borrowed form is created with (&mut tensor).partition(shape):

Owned: `Tensor<T>`#

Pass a tensor directly — the launcher wraps it in Arc internally for the kernel, then unwraps it back afterward (safe because refcount is 1):

let output = my_kernel(
    api::zeros(&[1024]).partition([128]),
    api::ones::<f32>(&[1024]),  // DeviceOp<Output=Tensor<f32>>
)
.first()
.unpartition()
.sync_on(&stream)?;

Use this for single-use tensors where you don’t need shared access.

Shared: `Arc<Tensor<T>>`#

Wrap in Arc when the same tensor is passed to multiple kernels:

let x: Arc<Tensor<f32>> = api::ones(&[1024]).sync_on(&stream)?.into();

let a = kernel_a(out_a, x.clone()).sync_on(&stream)?;
let b = kernel_b(out_b, x.clone()).sync_on(&stream)?;

This is the most common pattern in existing code.

Borrowed: `&Tensor<T>`#

Pass a reference when you want to retain ownership and avoid Arc overhead. The borrow checker ensures the tensor outlives the kernel:

let weights: Tensor<f32> = api::ones(&[1024]).sync_on(&stream)?;

// Borrow — no Arc allocation, no refcount.
let result = my_kernel(out_partition, &weights).sync_on(&stream)?;

// weights is still available here.

Key safety property: because &Tensor<T> is not 'static, tokio::spawn rejects operations that borrow tensors:

let op = my_kernel(out, &weights);  // borrows weights
tokio::spawn(op.into_future());      // ← compile error: not 'static

This is enforced at compile time by Rust’s lifetime system — no runtime checks needed.

`.shared()`: Clone + Execute-Once#

.shared() converts a DeviceOp into a SharedDeviceOp<T> that is Clone. The underlying operation runs at most once; every clone receives Arc::clone() of the cached result:

let x = api::ones::<f32>(&[32, 32]).shared();

let a = kernel_a(x.clone()).sync()?;  // x executes here (first clone to run)
let b = kernel_b(x.clone()).sync()?;  // uses cached Arc<Tensor<f32>>

Output type changes: DeviceOp<Output=T> becomes SharedDeviceOp with Output=Arc<T>.

For pre-computed values (e.g., weight tensors), use the shared() free function to wrap an Arc<T> directly:

use cuda_async::device_operation::shared;

let w: Arc<Tensor<f32>> = /* loaded weights */;
let w_op: SharedDeviceOp<Tensor<f32>> = shared(w);

`.unwrap_arc()`#

.shared() and unzip produce Arc<T> outputs. When you need owned T back (e.g., to partition a tensor), use .unwrap_arc():

let x: Arc<Tensor<f32>> = api::ones(&[1024]).shared().sync()?;

let owned: Tensor<f32> = value(x).unwrap_arc().sync()?;
let partitioned = owned.partition([128]);

Panics if the Arc has multiple owners.

IntoDeviceOp: Automatic Wrapping#

The IntoDeviceOp trait lets kernel launchers accept both DeviceOps and plain values:

Type	Wraps as
Any `impl DeviceOp<Output=T>`	Pass-through
`Tensor<T>`	`Value<Tensor<T>>`
`Arc<T>`	`Value<Arc<T>>`
`&'a Tensor<T>`	`Value<&'a Tensor<T>>`
`&Arc<T>`	`Value<Arc<T>>` (clones the Arc)
`f32`, `f64`, `i32`, `i64`, `u32`, `u64`, `usize`	`Value<T>`
`Partition<Tensor<T>>`	`Value<Partition<Tensor<T>>>`

// All of these work as inputs to a &Tensor kernel param:
my_kernel(out, tensor);              // Tensor<T>
my_kernel(out, arc_tensor);          // Arc<Tensor<T>>
my_kernel(out, &tensor);             // &Tensor<T>
my_kernel(out, api::ones(&[1024]));  // DeviceOp<Output=Tensor<T>>

Scheduling Model#

Stream assignment#

When you call .sync() or .await, the operation asks the default device’s scheduling policy for a stream. The default policy is StreamPoolRoundRobin with 4 streams:

op_a.sync()  →  Stream 0
op_b.sync()  →  Stream 1
op_c.sync()  →  Stream 2
op_d.sync()  →  Stream 3
op_e.sync()  →  Stream 0  (wraps around)

Consecutive independent operations land on different streams, enabling GPU overlap. Operations chained with .then() share the parent’s stream, preserving data-dependency ordering.

Explicit Stream: `.sync_on()`#

Bypasses the policy entirely. All operations given the same stream execute in call order:

let stream = device.new_stream()?;
let a = op_a.sync_on(&stream)?;  // Stream X
let b = op_b.sync_on(&stream)?;  // Stream X — guaranteed after op_a

Available Policies#

Policy	Behavior
`StreamPoolRoundRobin` (default)	Rotates through N streams (default 4)
`SingleStream`	All operations on one stream — strict ordering
Custom `impl SchedulingPolicy`	Implement `fn next_stream()` for your own strategy

`.then()` Guarantees#

.then() is the recommended way to express data dependencies. Both operations share a single stream, so the second is guaranteed to see the first’s output fully written — no manual synchronization needed:

let result = allocate_buffer()
    .then(|buf| fill_kernel(buf))      // same stream
    .then(|buf| process_kernel(buf))   // same stream
    .sync()?;

Non-reentrancy: On any given thread, only one DeviceOp may be executing at a time. Calling sync_on, sync, or .await inside a then closure will return a runtime error. This prevents CUDA data races from cross-stream access to in-flight tensors. If you need nested execution and have verified there are no cross-stream data races, use unsafe then_unchecked.

Error Propagation#

All execution methods return Result<T, DeviceError>. Errors propagate through combinators: if any operation in a .then() chain fails, the error short-circuits to the caller.

DeviceError Variants#

Variant	When it occurs
`Driver(DriverError)`	CUDA driver call failed (OOM, invalid argument, etc.)
`Context { device_id, message }`	Device context assertion failed
`KernelCache(String)`	Kernel compilation or cache lookup failed
`Scheduling(String)`	No stream available or policy misconfigured
`Launch(String)`	Kernel launch precondition violated
`Internal(String)`	Bug in cuda-async internals
`Anyhow(String)`	Converted from `anyhow::Error`

Error Handling Patterns#

// Pattern 1: Propagate with ?
let x = api::zeros(&[1024]).sync_on(&stream)?;

// Pattern 2: Match specific errors
match my_kernel(args).sync_on(&stream) {
    Ok(result) => { /* use result */ }
    Err(DeviceError::Launch(msg)) => {
        eprintln!("kernel launch failed: {msg}");
    }
    Err(e) => return Err(e.into()),
}

cutile::error::Error vs DeviceError#

cutile::error::Error is the top-level error type that wraps DeviceError alongside other error categories (I/O, shape mismatches, etc.). Functions that only do GPU work return DeviceError; functions that mix host and device work (like the examples) return cutile::error::Error.

CUDA Graph Integration#

Combinator approach: `.graph_on(stream)`#

Any DeviceOp can be captured into a replayable CUDA graph:

let forward_op = build_forward(&cfg, &weights, input, buffers);
let mut graph = forward_op.graph_on(stream.clone())?;
let output = graph.take_output().unwrap();

// Replay loop — no graph rebuilding, no kernel re-compilation.
for token in tokens {
    graph.update(api::memcpy(&mut input_buf, &token))?;
    graph.launch().sync_on(&stream)?;
}

This requires Arc<Tensor<T>> + try_partition for shared buffers.

Scope approach: `CudaGraph::scope`#

CudaGraph::scope provides an imperative alternative using &mut borrows instead of Arc. Each s.record(op) records a graph node and releases borrows immediately. A buffer written by one record call can be read by the next:

let mut output = api::zeros::<f32>(&[d]).sync_on(&stream)?;
let weights = api::ones::<f32>(&[d]).sync_on(&stream)?;

let graph = CudaGraph::scope(&stream, |s| {
    s.record(kernel1((&mut output).partition([128]), &weights))?;
    s.record(kernel2((&mut output).partition([64]), &weights))?;
    Ok(())
})?;

graph.launch().sync_on(&stream)?;

record only accepts operations that implement GraphNode — kernel launches and memcpy. Allocation ops (zeros, ones, dup) are rejected at compile time because their addresses may change on replay.

`GraphNode` trait#

GraphNode is a marker trait for operations safe to record in a CUDA graph. Only operations that do not allocate or free device memory implement it:

Implements `GraphNode`	Why safe
Macro-generated kernel launchers	Kernel launch only — no alloc/free
`Memcpy` (`api::memcpy`)	Copy between pre-allocated buffers
`Value<T>` (`value(x)`)	No GPU work

CudaGraph methods#

Method	What it does
`.graph()` / `.graph_on(stream)`	Capture a `DeviceOp` into a `CudaGraph<T>`
`CudaGraph::scope(&stream, \|s\| { … })`	Scoped capture with `&mut` borrows
`s.record(op: impl GraphNode)`	Record a graph node inside a scope
`graph.take_output()`	Retrieve the output from the capture execution
`graph.update(op)`	Run a `DeviceOp` on the graph’s stream (e.g., copy new input)
`graph.launch()`	Returns a `DeviceOp` that replays the captured graph

All device pointers are baked in at capture time. To vary inputs, pre-allocate a buffer, pass it into the operation, and memcpy new data before each launch. See Tutorial 10 for a complete walkthrough.