Writing Computations#

cuTile Rust provides a rich set of operations that work on tiles and leverage GPU parallelism. The DSL API reference has the complete catalog with full signatures; this page focuses on how these building blocks compose into complete algorithms.


Operations at a Glance#

Everything you express inside a kernel boils down to a small set of operation categories:

Category

Representative operations

Reference

Load and store

load_tile_mut, load_tile_like, Partition::load, Tensor::store

Memory: Load and Store

Arithmetic

+, -, *, /, fma, true_div

Arithmetic (Element-wise)

Math

exp, log, sqrt, rsqrt, sin, cos, tanh

Math (Floating-Point)

Reduction / scan

reduce_max, reduce_sum, reduce_min, reduce_prod, scan

Reduction and Scan

Matrix multiply

mma, permute

Matrix Multiply

Shape manipulation

reshape, broadcast, const_shape!

Shape Manipulation

Comparison

gt_tile, ge_tile, lt_tile, le_tile, eq_tile, select

Comparison

Creation

constant, iota, convert_tile

Creation

Arithmetic operators apply element-wise to tiles of compatible shapes (broadcasting fills mismatched dimensions where one side is 1). Scalars can be combined with tiles directly or promoted to a tile with broadcast. Math functions, reductions, and shape operations compose freely with arithmetic to express most numerical algorithms.


Composition Patterns#

The real value of the tile model shows up when you combine operations into full algorithms. Three representative patterns.

Scale and shift#

Multiply a tile by a scale and add a bias — elementwise combined with broadcast:

fn scale_and_shift<const S: [i32; 2]>(
    x: Tile<f32, S>, scale: f32, shift: f32
) -> Tile<f32, S> {
    let s = scale.broadcast(x.shape());
    let b = shift.broadcast(x.shape());
    x * s + b
}

Numerically stable softmax#

Reductions combine with broadcasting to produce normalization. Subtracting the per-row max before exp prevents overflow on large inputs:

fn softmax<const BM: i32, const BN: i32>(
    x: Tile<f32, { [BM, BN] }>
) -> Tile<f32, { [BM, BN] }> {
    let max: Tile<f32, { [BM, BN] }> = reduce_max(x, 1i32)
        .reshape(const_shape![BM, 1])
        .broadcast(const_shape![BM, BN]);
    let stable = x - max;

    let exp_x = exp(stable);
    let sum: Tile<f32, { [BM, BN] }> = reduce_sum(exp_x, 1)
        .reshape(const_shape![BM, 1])
        .broadcast(const_shape![BM, BN]);

    true_div(exp_x, sum)
}

The reshape(const_shape![BM, 1]) converts the reduced [BM] back to a column vector, then broadcast expands it across the column dimension to match the original [BM, BN] shape.

Reduction operations along axis 0 (columns) and axis 1 (rows)

Tiled matrix multiply#

Repeated mma calls accumulate into a destination tile across a K loop. Each iteration loads a [BM, BK] tile from x and a [BK, BN] tile from y, multiplies them, and accumulates into acc. The accumulator lives in registers throughout the loop:

fn tiled_gemm<E: ElementType, const BM: i32, const BN: i32, const BK: i32, const K: i32>(
    z: &mut Tensor<E, { [BM, BN] }>,
    x: &Tensor<E, { [-1, K] }>,
    y: &Tensor<E, { [K, -1] }>,
) {
    let part_x = x.partition(const_shape![BM, BK]);
    let part_y = y.partition(const_shape![BK, BN]);
    let pid: (i32, i32, i32) = get_tile_block_id();

    let mut acc = constant(0.0f32, const_shape![BM, BN]);
    for i in 0i32..(K / BK) {
        let tile_x = part_x.load([pid.0, i]);
        let tile_y = part_y.load([i, pid.1]);
        acc = mma(tile_x, tile_y, acc);
    }

    z.store(acc);
}

Debugging Computations#

cuda_tile_print! prints from within a kernel:

cuda_tile_print!("Value at tile ({}, {}): {}\n",
    pid.0, pid.1, some_value);

Warning

GPU printing is slow and serializes execution. Use it only for debugging small grids.

For more techniques — type conversion sanity checks, assertions, IR dumping, and profiler integration — see Debugging and Profiling.


Continue to The Execution Model to see how these operations map onto the hardware. For the full catalog of operations with signatures, see the DSL API.