Execution Model#
This page describes how cuTile Rust programs execute on NVIDIA GPUs.
Abstract Machine#
cuTile Rust is built on Tile IR — a tile virtual machine and instruction set that models the GPU as a tile-based processor. Unlike the traditional SIMT (Single Instruction Multiple Thread) model, Tile IR enables programming in terms of tiles (multi-dimensional array fragments) rather than individual threads.
cuTile Rust targets an abstract machine that maps to CUDA:
Abstract Concept |
What It Represents |
CUDA Mapping |
|---|---|---|
Tile Block |
A single logical thread of execution that processes one tile of data |
One or more thread blocks, clusters, or warps (compiler-determined) |
Tile |
An immutable multi-dimensional array fragment with compile-time static shape |
Data in registers |
Tensor |
A multi-dimensional array in global memory, accessed via structured views |
Data in HBM, accessed through typed pointers with shape and stride metadata |
The mapping of both the grid and individual tile blocks to the underlying hardware threads is abstracted away and handled entirely by the compiler. This includes:
Thread block and cluster configuration.
Register allocation.
Shared memory staging.
Memory coalescing.
Tensor Core utilization.
Execution Spaces#
cuTile Rust operates across two execution spaces, more commonly referred to as the host-side and device-side:
Host Side (CPU)#
Allocates GPU memory.
Launches kernels.
Manages data transfers.
Coordinates async operations.
// Host code: sets up data and launches the kernel
fn main() -> Result<(), Error> {
let ctx = CudaContext::new(0)?;
let stream = ctx.new_stream()?;
let x: Arc<Tensor<f32>> = ones([1024, 1024]).arc().sync_on(&stream)?;
let y: Arc<Tensor<f32>> = ones([1024, 1024]).arc().sync_on(&stream)?;
let z = zeros([1024, 1024]).sync_on(&stream)?.partition([64, 64]);
let (z, _x, _y) = add(z, x, y).sync_on(&stream)?;
Ok(())
}
Device Side (GPU)#
Concurrently executes kernel code on tile threads.
Operates on tiles in registers.
Accesses global memory through tensors.
// Device code: runs on GPU
#[cutile::entry()]
fn add<const S: [i32; 2]>(
c: &mut Tensor<f32, S>,
a: &Tensor<f32, {[-1, -1]}>,
b: &Tensor<f32, {[-1, -1]}>
) {
// The compiler performs automatic parallelization,
// executing the following code on hundreds of GPU threads in parallel
let tile_a = load_tile_like_2d(a, c);
let tile_b = load_tile_like_2d(b, c);
c.store(tile_a + tile_b);
}
Kernel Entry Points#
The #[cutile::entry()] attribute marks a function as a GPU kernel entry point:
#[cutile::module]
mod my_kernels {
use cutile::core::*;
#[cutile::entry()]
fn my_kernel<const TILE_SIZE: [i32; 2]>(
output: &mut Tensor<f32, TILE_SIZE>,
input: &Tensor<f32, {[-1, -1]}>
) {
// Kernel body
}
}
Entry Point Rules#
Must be in a module — Entry points must be inside a
#[cutile::module]block.Const generics for tile size — An output tensor’s shape must be static. It determines the output tensor’s tile size.
Tensor parameters — All data passes through
Tensorreferences.No return values — Results are written to output tensors.
Tile Concurrency#
When a kernel launches, the runtime creates a grid of tile blocks:
Each tile thread:
Processes one sub-tensor of data.
Runs independently of other tile threads.
Executes the same kernel function.
Tile Block Identification#
Within a kernel, each tile block can query its coordinates in the grid and the total grid dimensions:
#[cutile::entry()]
fn kernel<const S: [i32; 2]>(
output: &mut Tensor<f32, S>,
input: &Tensor<f32, {[-1, -1]}>
) {
let pid: (i32, i32, i32) = get_tile_block_id(); // This block's (x, y, z)
let grid: (i32, i32, i32) = get_num_tile_blocks(); // Grid dimensions
// For element-wise ops, load_tile_like_2d uses the output's
// partition to determine which region this block processes:
let tile = load_tile_like_2d(input, output);
output.store(tile);
}
Constantness#
cuTile Rust enforces compile-time constantness for key values:
Compile-Time Constants#
These must be known at compile time. Some compile-time constants include:
Tile dimensions — Shape of tiles in registers.
Dtype — Element data types.
Reduction axes — Which dimensions to reduce.
// Tile shape is a compile-time constant, specified via const generics
#[cutile::entry()]
fn kernel<const TILE_SIZE: [i32; 2]>(
output: &mut Tensor<f32, TILE_SIZE>, // [64, 64] known at compile time
input: &Tensor<f32, {[-1, -1]}> // Dynamic size
) {
let tile = load_tile_like_2d(input, output);
// Reduction axis is a compile-time constant
let max_vals = reduce_max(tile, 1i32); // Reduce along axis 1
output.store(tile);
}
Runtime Values#
These can vary at runtime:
Tensor dimensions — Size of input tensors.
Tensor data — Actual element values.
Grid size — Number of tile blocks to launch.
Python Subset (Comparison)#
If you’re familiar with cuTile Python, here’s how concepts map:
cuTile Python |
cuTile Rust |
|---|---|
|
|
|
|
|
|
|
Implicit via partition |
|
Async operation + |
Both use the same underlying compilation pipeline and generate equivalent GPU code.
Next Steps#
Learn about the Data Model for details on types and shapes
Explore Memory Hierarchy for performance optimization
See Async Execution for concurrent CPU/GPU work
See Interoperability for integrating custom CUDA kernels into the
DeviceOperationmodel