Host vs. Device Code#
cuTile Rust programs have two parts. Host code runs on the CPU and owns normal Rust control flow, tensor allocation, stream selection, kernel launch, and result readback. Device code runs on the GPU and describes the work performed by each tile block.
use cutile::prelude::*;
use kernels::scale;
fn main() -> Result<(), cuda_async::error::DeviceError> {
let device = cuda_core::Device::new(0)?;
let stream = device.new_stream()?;
let x = api::ones::<f32>(&[32, 32]).sync_on(&stream)?;
let mut z = api::zeros::<f32>(&[32, 32]).sync_on(&stream)?;
let _ = scale((&mut z).partition([4, 4]), &x, 2.0f32).sync_on(&stream)?;
Ok(())
}
#[cutile::module]
mod kernels {
use cutile::core::*;
#[cutile::entry()]
fn scale<const S: [i32; 2]>(
z: &mut Tensor<f32, S>,
x: &Tensor<f32, { [-1, -1] }>,
alpha: f32,
) {
let tile_x = load_tile_like(x, z);
z.store(tile_x * alpha);
}
}
main is host code. It constructs tensors, partitions the writable output, and synchronizes the returned DeviceOp. scale is device code. It runs once per tile block and sees one mutable output partition at a time.
Modules and Entry Points#
#[cutile::module] marks a Rust module whose functions can be compiled for the GPU. #[cutile::entry()] marks a function as a kernel entry point. Entry points are callable from host code through generated launcher APIs.
Entry points follow four rules:
They must be inside a
#[cutile::module].Writable tensor parameters use static tile shapes, such as
Tensor<f32, S>orTensor<f32, { [BM, BN] }>.Read-only tensor parameters may use dynamic dimensions, such as
Tensor<f32, { [-1, -1] }>.Kernels write results into tensor parameters instead of returning values.
Unmarked functions inside a #[cutile::module] are device functions. They can be called from entry points or other device functions and are inlined during compilation, but they cannot be launched directly.
Kernel Launchers#
The generated launcher accepts host-side values that correspond to the device-side kernel signature:
Kernel parameter |
Host input |
Device view |
|---|---|---|
|
|
Read-only tensor |
|
|
Writable partition |
Scalar ( |
Same scalar |
Same scalar |
Mutable tensors are partitioned before launch so each tile block writes a disjoint region:
let mut z = api::zeros::<f32>(&[32, 32]).sync_on(&stream)?;
let _ = scale((&mut z).partition([4, 4]), &x, 2.0f32).sync_on(&stream)?;
Read-only tensors can be borrowed, moved, or shared with Arc. Multiple tile blocks may read the same tensor concurrently.
Host and Device Types#
The same names appear on both sides, but host and device types carry different information:
Type |
Side |
Parameterized by |
Use |
|---|---|---|---|
|
Host |
Element type |
GPU allocation managed from CPU code |
|
Host |
Element type |
Owned writable launch partition |
|
Host |
Element type |
Borrowed writable launch partition |
|
Host |
Element type |
Shared read-only input |
|
Device |
Element type and shape |
Kernel tensor parameter |
|
Device |
Element type and shape |
Device-side tiled view of a read-only tensor |
|
Device |
Element type, tile shape, and map shape |
Advanced writable output that lets one tile block process multiple logical output tiles |
|
Device |
Element type and static shape |
Register-resident compute value |
Host tensors use runtime shapes because allocation sizes are ordinary runtime data. Device tensors and tiles carry shape information in the type system so the compiler can check operations and specialize generated code.
MappedPartitionMut is used by kernels that need a custom traversal over the output grid, such as persistent GEMM. The host creates one by mapping a mutable partition:
let z = z.partition([BM, BN]).map([4, 1], num_tile_blocks);
The device entry point takes it by value:
fn kernel<const MAP_SHAPE: [i32; 2]>(
mut z: MappedPartitionMut<f32, { [BM, BN] }, MAP_SHAPE>,
) {
for out_idx in z.iter_indices() {
// Compute one logical output tile.
z.store(tile, out_idx);
}
}
DeviceOp Basics#
Tensor constructors, kernel launchers, host readbacks, and composition helpers return DeviceOps. A DeviceOp is a lazy description of GPU work. Nothing runs until it is synchronized, awaited, or captured into a CUDA graph.
let z = api::zeros::<f32>(&[32, 32]); // No allocation has run yet.
let z = z.sync_on(&stream)?; // The operation runs here.
Kernel launchers are DeviceOps too:
let op = scale((&mut z).partition([4, 4]), &x, 2.0f32);
let _ = op.sync_on(&stream)?;
The full host-side execution model is described in Device Operations.
First-Use Compilation#
Calling a generated launcher may compile the device code before it launches. A specialization is one compiled kernel variant for a particular entry function, target GPU, and set of compile-time inputs. The first launch of a specialization compiles the captured Rust AST to Tile IR bytecode and then to a cubin. Later launches with the same specialization reuse the cached binary.
Specialization depends on element types, const generic values, compile options, and other compile-time parameters. Dynamic tensor dimensions can vary across launches without creating a new specialization. Compilation covers the cache and specialization rules in detail.
Continue to Tensors and Tiles.