Host API#
Reference for everything host-side: creating and transferring tensors, managing contexts and streams, configuring kernel launches, the DeviceOp trait and its combinators, and CUDA graph integration. For tutorial-style introductions, see Device Operations and Working with Data.
Tensor Creation and Views#
api::* constructors#
All creation functions return a DeviceOp — allocation and initialization happen when the operation runs, not when it is constructed.
Function |
Output |
Description |
|---|---|---|
|
|
All zeros |
|
|
All ones |
|
|
Fill with scalar value |
|
|
Fill an existing tensor and return it |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Convert tensor element type |
|
|
Uniform |
|
|
Normal |
|
|
Normal for |
Shape conventions vary across the module: zeros/ones/full take &[usize] slices of arbitrary length; rand/randn take [usize; RANK] arrays (rank is a const generic). The RANK parameter is usually inferred from the array literal.
use cutile::api;
let z = api::zeros::<f32>(&[1024]).sync_on(&stream)?;
let m = api::ones::<f32>(&[256, 256]).sync_on(&stream)?;
let r = api::randn(0.0f32, 1.0, [32, 64, 128], None).sync_on(&stream)?; // 3D N(0, 1)
let u = api::rand::<f32, 1>([1024], Some(42)).sync_on(&stream)?; // Uniform with fixed seed
let idx = api::arange::<i32>(1024).sync_on(&stream)?;
let I = api::eye(64).sync_on(&stream)?;
Tensor and DeviceOp shape helpers#
Host-side reshapes are zero-copy metadata changes. They require the new shape to preserve the element count and, for borrowed views, to be contiguous.
API |
Description |
|---|---|
|
Consume and return |
|
Return a new |
|
Reshape the |
|
Consume a tensor and create a mutable output partition for kernel launch. |
|
Consume an |
|
Recover the owned tensor from a partition returned by a kernel. |
use cutile::api::{self, DeviceOpReshape};
use cutile::tensor::{Reshape, Tensor, TryPartition};
use cutile::tile_kernel::PartitionOp;
use std::sync::Arc;
let x = api::arange::<f32>(32).reshape(&[4, 8]).sync_on(&stream)?;
let z = api::zeros::<f32>(&[32]).partition([4]);
let weights: Arc<Tensor<f32>> = api::ones::<f32>(&[4, 8]).sync_on(&stream)?.into();
let weights_2d = (&weights).reshape(&[8, 4])?;
let partitioned = weights_2d.try_partition([2, 4])?;
Tensor metadata and reinterpretation#
Tensor<T> stores shape and layout metadata alongside the device allocation.
These accessors do not synchronize with the GPU:
API |
Description |
|---|---|
|
Runtime dimensions as |
|
Runtime strides as |
|
Number of elements |
|
Number of bytes in the tensor view |
|
Whether the view is contiguous |
|
CUDA device ordinal for the allocation |
|
Typed non-owning |
|
Zero-copy reinterpretation as |
reinterpret requires an Arc<Tensor<T>>, contiguous storage, matching total
byte size, and compatible pointer alignment:
use cutile::tensor::Tensor;
use std::sync::Arc;
let raw: Arc<Tensor<u32>> = api::arange::<u32>(4).sync_on(&stream)?.into();
let floats: Arc<Tensor<f32>> = raw.reinterpret::<f32>(&[4])?;
assert_eq!(floats.shape(), &[4]);
TensorView: zero-copy views and slices#
TensorView provides zero-copy borrowed views of a tensor with a different shape or offset. Views borrow the underlying tensor — the tensor cannot be mutated while a view exists. The offset is applied host-side, so passing a view to a kernel hands the kernel a pointer to the correct starting address without any data movement.
Method |
Description |
|---|---|
|
Reshape to the given shape without copying. Total element count must match. |
|
Borrow a rectangular sub-region (one numpy-style range per dimension). |
|
Chain-slice further; offsets accumulate. |
let tensor = api::arange::<f32>(1024).sync_on(&stream)?;
// Reshape without copying.
let matrix = tensor.view(&[32, 32])?;
// Slice: borrow a subregion (numpy-style ranges).
let first_half = tensor.slice(&[0..512])?; // elements 0-511
let row_slice = matrix.slice(&[1..3])?; // rows 1-2, all columns
let block = matrix.slice(&[1..3, 2..6])?; // rows 1-2, cols 2-5
// Chained slices accumulate offsets.
let inner = tensor.slice(&[100..200])?.slice(&[10..20])?; // = tensor[110..120]
Views and slices are passed to kernels as &Tensor parameters. They’re the right tool when you want to process a subregion of an existing tensor — an attention kernel over a sub-sequence, a GEMM over a sub-matrix, a scan over a contiguous slice — without allocation or copying.
Host-Device and Device-Device Transfers#
Moving data between CPU and GPU, or between two device tensors, uses APIs that
return DeviceOps — the copy is scheduled when the op runs, not constructed:
API |
Returns |
Description |
|---|---|---|
|
|
Copy host |
|
|
Copy a device |
|
|
Method form of |
|
|
Copy the |
|
|
Allocate a new tensor and copy device-to-device |
|
|
Copy device-to-device into an existing tensor, used especially for CUDA graph updates |
// Host -> device
let data: Arc<Vec<f32>> = Arc::new(vec![1.0; 1024]);
let tensor: Tensor<f32> = api::copy_host_vec_to_device(&data).sync_on(&stream)?;
// Device -> host
let result: Vec<f32> = tensor.to_host_vec().sync_on(&stream)?;
// Device -> device
let copy = tensor.dup().sync_on(&stream)?;
The host-side Vec must remain alive until the op completes — the async copy
reads from it until the stream synchronizes. Arc<Vec<T>> makes this
straightforward for shared access. to_host_vec is available on Tensor<T>,
Arc<Tensor<T>>, and &Arc<Tensor<T>>; each returns the same
DeviceOp<Output = Vec<T>>. It is also available on a
DeviceOp<Output = Tensor<T>>, which is the common form after a kernel chain:
let host: Vec<f32> = kernel(out.partition([128]), &input)
.first()
.unpartition()
.to_host_vec()
.sync_on(&stream)?;
api::memcpy copies between already allocated tensors and requires source and
destination to have the same element count. It is the usual way to update graph
input buffers before replay:
graph.update(api::memcpy(&mut input_buffer, &new_input))?;
graph.launch().sync_on(&stream)?;
Devices and Streams#
Every host program starts with a Device, plus one or more Streams for scheduling GPU work:
use cuda_core::Device;
let device = Device::new(0)?; // Device ordinal 0
let stream = device.new_stream()?; // A new stream owned by this device
Method |
Returns |
Description |
|---|---|---|
|
|
Create a device handle bound to a GPU ordinal |
|
|
Number of CUDA-capable devices |
|
|
GPU ordinal this handle represents |
|
|
Device name |
|
|
Create a new stream on this device |
|
|
Borrow an externally owned CUDA context/device for interop |
|
|
Borrow an externally owned CUDA stream for interop |
|
CUDA module/function wrappers |
Borrow externally owned CUDA handles |
Devices are Arc-wrapped for sharing across threads; streams are also Arc-wrapped and can be passed to .sync_on(&stream) for explicit stream scheduling.
The default round-robin scheduling policy handles stream assignment automatically for most workloads — these APIs are for when you need explicit stream control (debugging, deterministic ordering, paired with AsyncKernelLaunch, or overlapping compute with transfers on dedicated streams).
The borrow_raw constructors do not take ownership of the underlying CUDA
handles and therefore do not destroy them on drop. Use them when integrating
with another runtime that owns the context, stream, module, or function.
Kernel Launch Configuration#
Several types configure how kernels compile and launch.
CompileOptions — runtime overrides for entry-level optimization_hints, typically used for autotuning:
use cutile::tile_kernel::CompileOptions;
let opts = CompileOptions::default()
.occupancy(4)
.num_cta_in_cga(2)
.max_divisibility(16);
let result = my_kernel(args).compile_options(opts).grid(grid).await?;
Different CompileOptions values trigger separate JIT compilations and are part of the kernel cache key.
Generated #[cutile::entry] launchers also expose launch-time configuration
methods:
Method |
Description |
|---|---|
|
Set an explicit runtime launch grid instead of inferring it from partitioned tensor inputs. |
|
Set a compile-time constant grid, enabling grid-dependent optimizations. |
|
Override occupancy, cluster/CTA, and divisibility hints for this compilation. |
|
Bind type and const generic arguments manually when they cannot be inferred. |
The JIT compiler invokes tileiras through normal PATH lookup by default.
Set CUTILE_TILEIRAS_PATH to use a specific binary:
CUTILE_TILEIRAS_PATH=/opt/cuda-tile/bin/tileiras cargo test -p cutile
LaunchConfig — grid/block/shared-memory specification for AsyncKernelLaunch (raw CUDA kernels launched outside the #[cutile::entry] path):
use cuda_core::LaunchConfig;
LaunchConfig {
grid_dim: ((n + 255) / 256, 1, 1), // 3D grid of thread blocks
block_dim: (256, 1, 1), // 3D block of threads
shared_mem_bytes: 0, // Dynamic shared memory per block
}
AsyncKernelLaunch — wraps a CUDA driver kernel launch as a DeviceOp. Build the argument list with push_arg (safe, for DType scalars) or push_device_ptr (unsafe, for raw device pointers), set the launch config, then .await or .sync_on():
use cuda_async::launch::AsyncKernelLaunch;
let mut launcher = AsyncKernelLaunch::new(function.clone());
launcher.push_arg(num_elements as u32);
launcher.push_arg(scale);
let input_ptr = input.device_pointer();
let output_ptr = output.device_pointer();
unsafe {
launcher
.push_device_ptr(input_ptr.cu_deviceptr())
.push_device_ptr(output_ptr.cu_deviceptr());
}
launcher.set_launch_config(LaunchConfig {
grid_dim: ((num_elements as u32 + 255) / 256, 1, 1),
block_dim: (256, 1, 1),
shared_mem_bytes: 0,
});
launcher.await?; // Executes as a DeviceOp
See Interoperability for the full walkthrough and the wrapper pattern that hides unsafe at the call site.
.generics(Vec<String>) — #[cutile::entry]-generated launchers accept this method to bind const generics and type parameters at runtime:
let generics = vec![
"f32".to_string(), // E
"16".to_string(), // BM
"16".to_string(), // BN
"8".to_string(), // BK
"128".to_string(), // K
];
gemm(z, x, y).generics(generics).sync_on(&stream)?;
Generic values are part of the kernel cache key: each unique combination triggers its own JIT compilation.
The Futures Analogy#
DeviceOp is to GPU work what Future is to async I/O. Both are lazy
descriptions of work that don’t execute until driven:
Concept |
|
|
|---|---|---|
What it represents |
Async computation |
GPU computation |
When it runs |
On |
On |
Chaining |
|
|
Fan-in |
|
|
Fan-out |
N/A (single consumer) |
|
Shared access |
|
|
Type erasure |
|
|
Output wrapper |
|
|
The key difference: a Future is pulled by an async runtime via poll(),
while a DeviceOp is pushed to the GPU via execute(). When you convert
a DeviceOp to a Future (via .await or .into_future()), cuTile bridges
the two models — the runtime polls a DeviceFuture that checks whether the
GPU has finished.
Combinator Reference#
All combinators follow established Rust conventions. The “Precedent” column
shows which standard library or futures crate method inspired the design.
Composition#
Combinator |
Signature |
Precedent |
What it does |
|---|---|---|---|
|
|
|
Combine N operations into a single tuple-producing operation |
|
|
|
Split a tuple operation into independent per-element operations |
|
|
|
Chain follow-up GPU work on the same stream |
|
|
|
Transform output without issuing GPU work |
|
|
|
Peek at output for debugging; returns it unchanged |
Selection#
Combinator |
Signature |
Precedent |
What it does |
|---|---|---|---|
|
|
|
Extract the first element of a tuple output |
|
|
|
Extract the last element of a tuple output |
Execution#
Method |
Stream chosen by |
Blocks? |
Use case |
|---|---|---|---|
|
Default policy (round-robin) |
Yes |
Quick scripts |
|
The explicit stream |
Yes |
Deterministic ordering, debugging |
|
Default policy (round-robin) |
No (suspends task) |
Async production code |
|
Default policy |
No (returns |
Manual future handling |
|
The policy you provide |
No (returns |
Multi-device dispatch |
|
Default policy (round-robin) |
Yes (captures + syncs) |
CUDA graph capture |
|
The explicit stream |
Yes (captures + syncs) |
CUDA graph capture on specific stream |
Note
If any kernel input is &Tensor<T> (borrowed), the operation is not
'static and cannot be used with tokio::spawn. Use .sync_on() or
.await in the same scope, or switch to Arc<Tensor<T>> for spawned tasks.
Supported Kernel Parameter Types#
Kernel param |
Host type |
Return type |
|---|---|---|
|
|
Same as input |
|
|
Same as input |
Scalar ( |
Same scalar |
Same scalar |
|
|
|
The borrowed partition form (Partition<&mut Tensor<T>>) writes in place — no
unpartition() needed. Create it with (&mut tensor).partition(shape).
Raw pointer entry points are unsafe fns. Obtain a typed device pointer from a
tensor with tensor.device_pointer(), and make sure the pointer remains valid
for the duration of the kernel launch:
let backing = api::zeros::<f32>(&[1024]).sync_on(&stream)?;
let ptr = backing.device_pointer();
unsafe { raw_ptr_kernel(ptr, 1024) }.sync_on(&stream)?;
Ownership Model#
The core invariant: you get back what you put in.
Read-only inputs (&Tensor params)#
Input |
Returned |
|
|---|---|---|
|
|
Yes |
|
|
Yes |
|
|
No (not |
Mutable outputs (&mut Tensor params)#
Input |
Returned |
|
|---|---|---|
|
|
Yes |
|
|
No — tensor is written in place |
The borrowed form is created with (&mut tensor).partition(shape):
Owned: Tensor<T>#
Pass a tensor directly — the launcher wraps it in Arc internally for the
kernel, then unwraps it back afterward (safe because refcount is 1):
let output = my_kernel(
api::zeros(&[1024]).partition([128]),
api::ones::<f32>(&[1024]), // DeviceOp<Output=Tensor<f32>>
)
.first()
.unpartition()
.sync_on(&stream)?;
Use this for single-use tensors where you don’t need shared access.
Borrowed: &Tensor<T>#
Pass a reference when you want to retain ownership and avoid Arc overhead.
The borrow checker ensures the tensor outlives the kernel:
let weights: Tensor<f32> = api::ones(&[1024]).sync_on(&stream)?;
// Borrow — no Arc allocation, no refcount.
let result = my_kernel(out_partition, &weights).sync_on(&stream)?;
// weights is still available here.
Key safety property: because &Tensor<T> is not 'static,
tokio::spawn rejects operations that borrow tensors:
let op = my_kernel(out, &weights); // borrows weights
tokio::spawn(op.into_future()); // ← compile error: not 'static
This is enforced at compile time by Rust’s lifetime system — no runtime checks needed.
.unwrap_arc()#
.shared() and unzip produce Arc<T> outputs. When you need owned T
back (e.g., to partition a tensor), use .unwrap_arc():
let x: Arc<Tensor<f32>> = api::ones(&[1024]).shared().sync()?;
let owned: Tensor<f32> = value(x).unwrap_arc().sync()?;
let partitioned = owned.partition([128]);
Panics if the Arc has multiple owners.
IntoDeviceOp: Automatic Wrapping#
The IntoDeviceOp trait lets kernel launchers accept both DeviceOps and
plain values:
Type |
Wraps as |
|---|---|
Any |
Pass-through |
|
|
|
|
|
|
|
|
|
|
|
|
// All of these work as inputs to a &Tensor kernel param:
my_kernel(out, tensor); // Tensor<T>
my_kernel(out, arc_tensor); // Arc<Tensor<T>>
my_kernel(out, &tensor); // &Tensor<T>
my_kernel(out, api::ones(&[1024])); // DeviceOp<Output=Tensor<T>>
Scheduling Model#
Stream assignment#
When you call .sync() or .await, the operation asks the default
device’s scheduling policy for a stream. The default policy is
StreamPoolRoundRobin with 4 streams:
op_a.sync() → Stream 0
op_b.sync() → Stream 1
op_c.sync() → Stream 2
op_d.sync() → Stream 3
op_e.sync() → Stream 0 (wraps around)
Consecutive independent operations land on different streams, enabling GPU
overlap. Operations chained with .then() share the parent’s stream,
preserving data-dependency ordering.
Explicit Stream: .sync_on()#
Bypasses the policy entirely. All operations given the same stream execute in call order:
let stream = device.new_stream()?;
let a = op_a.sync_on(&stream)?; // Stream X
let b = op_b.sync_on(&stream)?; // Stream X — guaranteed after op_a
Available Policies#
Policy |
Behavior |
|---|---|
|
Rotates through N streams (default 4) |
|
All operations on one stream — strict ordering |
Custom |
Implement |
.then() Guarantees#
.then() is the recommended way to express data dependencies. Both
operations share a single stream, so the second is guaranteed to see the
first’s output fully written — no manual synchronization needed:
let result = allocate_buffer()
.then(|buf| fill_kernel(buf)) // same stream
.then(|buf| process_kernel(buf)) // same stream
.sync()?;
Non-reentrancy: On any given thread, only one DeviceOp may be
executing at a time. Calling sync_on, sync, or .await inside a
then closure will return a runtime error. This prevents CUDA data
races from cross-stream access to in-flight tensors. If you need
nested execution and have verified there are no cross-stream data
races, use unsafe then_unchecked.
Error Propagation#
All execution methods return Result<T, DeviceError>. Errors propagate
through combinators: if any operation in a .then() chain fails, the
error short-circuits to the caller.
DeviceError Variants#
Variant |
When it occurs |
|---|---|
|
CUDA driver call failed (OOM, invalid argument, etc.) |
|
Device context assertion failed |
|
Kernel compilation or cache lookup failed |
|
No stream available or policy misconfigured |
|
Kernel launch precondition violated |
|
Bug in cuda-async internals |
|
Converted from |
Error Handling Patterns#
// Pattern 1: Propagate with ?
let x = api::zeros(&[1024]).sync_on(&stream)?;
// Pattern 2: Match specific errors
match my_kernel(args).sync_on(&stream) {
Ok(result) => { /* use result */ }
Err(DeviceError::Launch(msg)) => {
eprintln!("kernel launch failed: {msg}");
}
Err(e) => return Err(e.into()),
}
cutile::error::Error vs DeviceError#
cutile::error::Error is the top-level error type that wraps
DeviceError alongside other error categories (I/O, shape mismatches,
etc.). Functions that only do GPU work return DeviceError; functions
that mix host and device work (like the examples) return
cutile::error::Error.
CUDA Graph Integration#
Combinator approach: .graph_on(stream)#
Any DeviceOp can be captured into a replayable CUDA graph:
let forward_op = build_forward(&cfg, &weights, input, buffers);
let mut graph = forward_op.graph_on(stream.clone())?;
let output = graph.take_output().unwrap();
// Replay loop — no graph rebuilding, no kernel re-compilation.
for token in tokens {
graph.update(api::memcpy(&mut input_buf, &token))?;
graph.launch().sync_on(&stream)?;
}
This requires Arc<Tensor<T>> + try_partition for shared buffers.
Scope approach: CudaGraph::scope#
CudaGraph::scope provides an imperative alternative using &mut borrows
instead of Arc. Each s.record(op) records a graph node and releases
borrows immediately. A buffer written by one record call can be read
by the next:
let mut output = api::zeros::<f32>(&[d]).sync_on(&stream)?;
let weights = api::ones::<f32>(&[d]).sync_on(&stream)?;
let graph = CudaGraph::scope(&stream, |s| {
s.record(kernel1((&mut output).partition([128]), &weights))?;
s.record(kernel2((&mut output).partition([64]), &weights))?;
Ok(())
})?;
graph.launch().sync_on(&stream)?;
record only accepts operations that implement GraphNode — kernel
launches and memcpy. Allocation ops (zeros, ones, dup) are
rejected at compile time because their addresses may change on replay.
GraphNode trait#
GraphNode is a marker trait for operations safe to record in a CUDA
graph. Only operations that do not allocate or free device memory
implement it:
Implements |
Why safe |
|---|---|
Macro-generated kernel launchers |
Kernel launch only — no alloc/free |
|
Copy between pre-allocated buffers |
|
No GPU work |
CudaGraph methods#
Method |
What it does |
|---|---|
|
Capture a |
|
Scoped capture with |
|
Record a graph node inside a scope |
|
Retrieve the output from the capture execution |
|
Run a |
|
Returns a |
All device pointers are baked in at capture time. To vary inputs, pre-allocate
a buffer, pass it into the operation, and memcpy new data before each
launch. See Tutorial 10 for a
complete walkthrough.
See Also#
Device Operations — tutorial-style guide to streams, scheduling, and composition patterns
Tutorial 10 — end-to-end CUDA graph example
Interoperability — integrating custom CUDA C++ kernels into the DeviceOp model