Tensors and Tiles#

cuTile Rust kernels operate by moving data from tensors into tiles, computing on those tiles, and storing the result back to tensors.

Property

Tensor

Tile

Location

Global memory (HBM)

Registers

Mutability

Mutable or read-only

Immutable

Shape

Static, dynamic, or mixed

Static

Operations

Load and store

Arithmetic, reductions, matrix multiply, shape ops

Lifetime

Persists across kernels

Exists only inside a kernel

Addressable

Yes

No

#[cutile::entry()]
fn add<const S: [i32; 2]>(
    z: &mut Tensor<f32, S>,
    x: &Tensor<f32, { [-1, -1] }>,
    y: &Tensor<f32, { [-1, -1] }>,
) {
    let tile_x = load_tile_like(x, z);
    let tile_y = load_tile_like(y, z);
    z.store(tile_x + tile_y);
}

The output tensor is mutable and already partitioned by the host launcher. The input tensors are read-only and can be loaded into tiles that match the output partition.

Tensor, Partition, Tile#

Tensor<E, S> is the device-side view of global memory. Kernels receive tensors as parameters. A tensor can be loaded into a Tile<E, S> or used to create a device-side partition.

Partition<E, S> is a tiled view of a tensor. On the host, partitioning a mutable tensor determines how many tile blocks launch and which region each block writes. On the device, partitioning a read-only tensor lets a kernel load arbitrary tiles by index.

Tile<E, S> is an immutable register-resident array fragment. Tile operations create new tiles instead of mutating the original value:

let tile = load_tile_like(x, z);
let shifted = tile + 1.0f32;
let scaled = shifted * 2.0f32;
z.store(scaled);

The core data flow is:

Data flow: Load from Tensor to Tile, Compute in registers, Store back to Tensor

Tensor -> Partition -> Tile -> Compute -> Store

Partitioning and the Grid#

Mutable outputs are partitioned on the host before launch:

let mut z = api::zeros::<f32>(&[1024, 1024]).sync_on(&stream)?;
let _ = add((&mut z).partition([64, 64]), &x, &y).sync_on(&stream)?;

The partition shape becomes the static shape seen by the kernel. A [1024, 1024] tensor partitioned as [64, 64] creates a logical grid of (16, 16, 1) tile blocks. Each tile block receives one writable sub-tensor.

Read-only inputs do not need host-side partitioning. Inside the kernel, partition them with the shape needed by the algorithm:

let pid: (i32, i32, i32) = get_tile_block_id();
let part_x = x.partition(const_shape![BM, BK]);
let tile_x = part_x.load([pid.0, i]);

The same read-only tensor can be partitioned multiple ways inside one kernel. This is common in matrix multiplication, where the left-hand side and right-hand side are loaded with different tile shapes.

The launch grid is inferred from mutable output partitions unless the launcher sets it explicitly:

kernel(z.partition([64, 64]), x).grid((16, 16, 1)).sync_on(&stream)?;

When a kernel has multiple mutable outputs, their inferred grids must match.

Static and Dynamic Shapes#

Static dimensions are compile-time constants. Dynamic dimensions use -1 and are resolved from the runtime tensor shape.

#[cutile::entry()]
fn normalize<const S: [i32; 2]>(
    z: &mut Tensor<f32, S>,          // Static tile shape from partition.
    x: &Tensor<f32, { [-1, -1] }>,   // Runtime full tensor shape.
) {
    let tile = load_tile_like(x, z);
    z.store(tile);
}

Static shapes let the compiler check operations and optimize layout. Dynamic dimensions let the same compiled variant handle different full tensor sizes. The common pattern is static output tile shape and dynamic read-only input shape.

Const generic arrays such as const S: [i32; 2] and const_shape![BM, BK] carry tile dimensions through the type system. Changing a const generic value can create a new compiled variant.

Loading, Computing, Storing#

load_tile_like(input, output) loads a read-only tensor region matching the output tensor’s tile shape and tile-block coordinates. For explicit device-side partitions, call partition.load(index):

let part_x = x.partition(const_shape![BM, BK]);
let tile_x = part_x.load([pid.0, k_tile]);

Writable tensors store tile results:

let result = tile_x + tile_y;
z.store(result);

Use load_tile_mut when a kernel needs to read the existing output value before writing a new one:

let acc = load_tile_mut(z);
z.store(acc + update);

Operations at a Glance#

The DSL API reference has complete signatures. These are the operation families used most often inside kernels:

Category

Examples

Load and store

load_tile_like, load_tile_mut, Partition::load, Tensor::store

Arithmetic

+, -, *, /, fma, true_div

Math

exp, log, sqrt, rsqrt, sin, cos, tanh

Reduction and scan

reduce_max, reduce_sum, reduce_min, scan

Matrix multiply

mma, mmaf_scaled

Shape manipulation

reshape, broadcast, transpose, const_shape!

Comparison

gt_tile, ge_tile, lt_tile, le_tile, eq_tile, select

Creation and conversion

constant, iota, convert_tile, pack, unpack

For element types, operation signatures, and lower-level memory operations, see the DSL API.

Broadcasting and Reductions#

Broadcasting expands a scalar or smaller tile to match a larger tile shape. It follows NumPy-style rules: align dimensions from the right; each dimension must either match or be 1.

let bias: Tile<f32, { [1, BN] }> = ...;
let x: Tile<f32, { [BM, BN] }> = ...;
let y = x + bias.broadcast(const_shape![BM, BN]);

Reductions collapse one axis. Reshape the reduced result before broadcasting it back:

fn softmax<const BM: i32, const BN: i32>(
    x: Tile<f32, { [BM, BN] }>,
) -> Tile<f32, { [BM, BN] }> {
    let max = reduce_max(x, 1i32)
        .reshape(const_shape![BM, 1])
        .broadcast(const_shape![BM, BN]);
    let stable = x - max;

    let exp_x = exp(stable);
    let sum = reduce_sum(exp_x, 1i32)
        .reshape(const_shape![BM, 1])
        .broadcast(const_shape![BM, BN]);

    true_div(exp_x, sum)
}

Numerically Stable Softmax#

Subtract the per-row maximum before exp when implementing softmax. This prevents overflow on large positive inputs and is the pattern used in fused softmax and attention kernels.

Tiled Matrix Multiply#

Matrix multiply accumulates repeated mma calls across the K dimension. Each tile block owns one output tile. Each loop iteration loads a [BM, BK] tile from the left input and a [BK, BN] tile from the right input.

fn tiled_gemm<
    E: ElementType,
    const BM: i32,
    const BN: i32,
    const BK: i32,
>(
    z: &mut Tensor<f32, { [BM, BN] }>,
    x: &Tensor<E, { [-1, -1] }>,
    y: &Tensor<E, { [-1, -1] }>,
) {
    let pid: (i32, i32, i32) = get_tile_block_id();
    let k_tiles = x.shape()[1] / BK;

    let part_x = x.partition(const_shape![BM, BK]);
    let part_y = y.partition(const_shape![BK, BN]);

    let mut acc = constant(0.0f32, const_shape![BM, BN]);
    for k_tile in 0i32..k_tiles {
        let tile_x = part_x.load([pid.0, k_tile]);
        let tile_y = part_y.load([k_tile, pid.1]);
        acc = mma(tile_x, tile_y, acc);
    }

    z.store(acc);
}

The output accumulator usually uses a wider type than the inputs. For example, FP16 or FP8 inputs often accumulate into FP32.

Type Safety and Generics#

The compiler catches shape mismatches, element-type mismatches, and invalid matrix multiply shapes before code runs:

let a: Tile<f32, { [16, 8] }> = ...;
let b: Tile<f32, { [16, 32] }> = ...;
let c = mma(a, b, acc); // Error: inner dimensions do not match.

Use explicit conversion when element types differ:

let y_float: Tile<f32, { [4, 4] }> = convert_tile(y_int);
let z = x_float + y_float;

Generic kernels can support multiple shapes and element types:

#[cutile::entry()]
fn scale<E: ElementType, const S: [i32; 2]>(
    z: &mut Tensor<E, S>,
    x: &Tensor<E, { [-1, -1] }>,
    alpha: E,
) {
    let tile = load_tile_like(x, z);
    z.store(tile * alpha);
}

Each concrete element type and const generic value can produce a separate compiled variant. Dynamic tensor dimensions do not.


Continue to Compilation.