Tuning for Performance#

GPU performance optimization balances three concerns: memory bandwidth (moving data efficiently), compute utilization (keeping ALUs busy), and occupancy (maximizing parallel execution). Good kernels are well-balanced across all three; poor kernels are bottlenecked on one.

The GPU performance triangle showing memory bandwidth, compute utilization, and occupancy

For algorithms where peak performance requires warp-level control or integration with hand-tuned CUDA C++ kernels, see Interoperability.


Compiler Hints and Specialization#

cuTile Rust provides optimization_hints at two levels: entry-level (kernel-wide) and per-op (on individual load/store operations).

Entry-level hints go on the entry annotation. They can also be overridden at runtime via CompileOptions for autotuning — different values trigger separate JIT compilations and are part of the kernel cache key:

#[cutile::entry(
    optimization_hints = (
        sm_120 = (                       // Blackwell-specific hints
            num_cta_in_cga = 2,
            occupancy = 2,
            max_divisibility = 16,
        ),
        sm_90 = (                        // Hopper-specific hints
            num_cta_in_cga = 1,
        ),
    )
)]
fn optimized_kernel<const S: [i32; 2]>(...) { ... }

// Runtime override for autotuning:
use cutile::tile_kernel::CompileOptions;

let result = my_kernel(input)
    .compile_options(CompileOptions::default().occupancy(4).num_cta_in_cga(2))
    .grid(grid)
    .await?;

Per-op hints (latency, disallow_tma) apply to individual load/store operations:

let tile: Tile<f32, S> =
    load_view_tko(&partition, idx, ordering::Weak, scope::TileBlock, Some(4), tma::Enabled);
unsafe {
    store_view_tko_mut(&mut partition, tile, idx, ordering::Weak, scope::TileBlock, None, tma::Disabled);
}
let (values, token) =
    load_ptr_tko(ptrs, ordering::Weak, None::<scope::TileBlock>, None, None, None, Latency::<4>);

Level

Hint

Description

Default

Entry

max_divisibility

Cap on auto-inferred alignment divisor

16

Entry

num_cta_in_cga

CTAs in Cooperative Group Array

1

Entry

occupancy

Target occupancy level

Auto

Per-op

latency

Latency optimization hint (Option<i32> for view ops, Latency<N> for pointer ops)

Compiler decides / Latency<0>

Per-op

disallow_tma

Disable Tensor Memory Accelerator for this op

false (TMA allowed)

Tile size significantly impacts performance. Larger tiles mean fewer memory transactions but more registers per block, reducing occupancy. General guidelines:

GPU Architecture

Recommended Tile Sizes

Ampere (A100)

[128, 128], [64, 64], [256, 64]

Hopper (H100)

[128, 128], [64, 128], [128, 256]

Ada (RTX 4090)

[64, 64], [128, 64]

Tile Size

Registers (approx)

Max Occupancy

[32, 32]

~32

High

[64, 64]

~64-128

Medium-High

[128, 128]

~256+

Medium

The preferred safe performance path is a mapped output partition. The output partition produces bounded, disjoint indices, while input partitions use with_bounds(...) to carry the matching logical grid:

fn gemm_persistent<
    T: ElementType,
    const BM: i32, const BN: i32, const BK: i32,
    const MAP_SHAPE: [i32; 2],
>(
    mut z: MappedPartitionMut<T, { [BM, BN] }, MAP_SHAPE>,
    x: &Tensor<T, { [-1, -1] }>,
    y: &Tensor<T, { [-1, -1] }>,
) {
    let m = num_tiles(&z, 0);
    let n = num_tiles(&z, 1);
    let k = Dim::new(x.shape()[1] / BK);

    let part_x = x.partition(const_shape![BM, BK]).with_bounds((m, k));
    let part_y = y.partition(const_shape![BK, BN]).with_bounds((k, n));

    for out_idx in z.iter_indices() {
        let (bid_m, bid_n) = out_idx.components();
        let acc = compute_tile(bid_m, bid_n, k, &part_x, &part_y);
        z.store(acc, out_idx);
    }
}

On the host side, .map(...) defines the output traversal and lets the launch grid be inferred from the mapped partition:

let z = z.partition([BM, BN]).map([4, 1], num_tile_blocks);
let (z, _x, _y) = gemm_persistent(z, x, y)
    .generics(generics)
    .sync_on(&stream)?;

unchecked_accesses = true remains available when the programmer wants to opt out of all runtime bounds checks explicitly:

#[cutile::entry(unchecked_accesses = true)]
unsafe fn fast_kernel<const S: [i32; 2]>(...) {
    // No bounds checking - programmer must ensure correctness
}

The older fully static GEMM pattern can also eliminate checks safely by making all tensor dimensions const generics and passing the launch grid with .const_grid(...). That path is mainly useful for legacy kernels or workloads with a very small fixed set of problem sizes; every new full tensor shape specializes the JIT compilation.


Memory Optimization#

Coalesced access — adjacent threads reading adjacent memory locations — is how the GPU memory system is designed to be used. cuTile Rust’s tile load operations automatically generate coalesced access patterns, so you get this for free from load_tile_like, Partition::load, and the standard loading APIs.

Keep data in registers. Load once from global memory, compute many times in registers:

Memory Level

Latency

Strategy

Registers

~0 cycles

Keep data in tiles

Shared Memory

~20 cycles

Reuse across iterations

L2 Cache

~200 cycles

Temporal locality

Global Memory

~400 cycles

Minimize accesses

#[cutile::entry()]
fn fused_ops<const S: [i32; 2]>(
    output: &mut Tensor<f32, S>,
    input: &Tensor<f32, {[-1, -1]}>
) {
    // Single load from global memory
    let tile = load_tile_like(input, output);

    // Multiple operations in registers (free!)
    let normalized = tile - reduce_max(tile, 1i32);
    let exp_vals = exp(normalized);
    let softmax = true_div(exp_vals, reduce_sum(exp_vals, 1));

    // Single store to global memory
    output.store(softmax);
}

Kernel fusion is the register strategy scaled up — combining multiple logical operations into a single kernel. A pipeline of 3 kernels might read and write intermediate results to global memory 6 times; fusing into one kernel eliminates most of those round-trips:

// UNFUSED: 3 kernels, 6 loads + 3 stores total.

// FUSED: 1 kernel, 3 loads + 1 store (3× memory reduction).
#[cutile::entry()]
fn fused<const S: [i32; 2]>(
    w: &mut Tensor<f32, S>,
    a: &Tensor<f32, {[-1, -1]}>,
    b: &Tensor<f32, {[-1, -1]}>,
    c: &Tensor<f32, {[-1, -1]}>
) {
    let tile_a = load_tile_like(a, w);
    let tile_b = load_tile_like(b, w);
    let tile_c = load_tile_like(c, w);

    // All in registers — no intermediate memory traffic
    let y = tile_a + tile_b;
    let z = y * tile_c;
    let result = exp(z);

    w.store(result);
}

For the full memory hierarchy model and arithmetic intensity analysis, see Where Data Lives.


Compute Optimization#

Tensor Cores deliver massive throughput for matrix operations when shapes align. Express matrix multiply through mma with compatible [M, K], [K, N], and [M, N] tile shapes; the compiler lowers supported dtype/shape combinations to Tensor Core instructions:

#[cutile::entry()]
fn tensor_core_matmul<const M: i32, const N: i32, const K: i32>(
    c: &mut Tensor<f32, {[M, N]}>,  // f32 accumulator
    a: &Tensor<f16, {[-1, -1]}>,
    b: &Tensor<f16, {[-1, -1]}>
) {
    let part_a = a.partition(const_shape![M, K]);
    let part_b = b.partition(const_shape![K, N]);
    let pid: (i32, i32, i32) = get_tile_block_id();
    let tile_a = part_a.load([pid.0, 0i32]);
    let tile_b = part_b.load([0i32, pid.1]);

    let acc = constant(0.0f32, c.shape());
    let result = mma(tile_a, tile_b, acc);
    c.store(result);
}

Arithmetic intensity is FLOPs per byte transferred. Higher is better: high-intensity kernels are compute-bound rather than memory-bound.

Operation

Arithmetic Intensity

Bound

Vector Add

~0.1

Memory

Matrix-Vector

1-2

Memory

Matrix-Matrix

O(N)

Compute

Fused Softmax

~10+

Compute

See Where Data Lives: Arithmetic Intensity for the full treatment.

Instruction-level parallelism (ILP) lets the compiler overlap independent operations. Write independent branches explicitly so the compiler can schedule them in parallel:

// Independent operations — compiler can overlap them
let sum1 = reduce_sum(tile1, 1i32);
let sum2 = reduce_sum(tile2, 1i32);  // Can execute concurrently

// Dependent operations — serialize
let step1 = tile * 2.0;
let step2 = step1 + 1.0;  // Must wait for step1

Profiling and Pitfalls#

Focus on four metrics when profiling with Nsight Compute:

Metric

Target

Memory Throughput

>80% of peak for memory-bound kernels

Compute Throughput

>70% for compute-bound kernels

Occupancy

>50%

Register Spills

0

Identify the bottleneck from the profile:

  • High memory throughput, low compute → memory-bound; increase arithmetic intensity, fuse kernels.

  • Low memory throughput, high compute → compute-bound; already near-optimal for this algorithm.

  • Low on both, high stall cycles → latency-bound; increase parallelism, overlap independent operations.

Common pitfalls:

  • Wrong tile size. [8, 8] is usually too small (overhead dominates); [512, 512] is usually too large (register spills, low occupancy). Start with [64, 64] or [128, 128].

  • Wrong dtype. Using f32 when f16/bf16 would suffice leaves 2× Tensor Core throughput on the table.

  • Excessive synchronization. Let the compiler handle thread synchronization; avoid introducing extra sync points.

  • Algorithmic stride. Tile operations coalesce automatically, but strided access patterns in your algorithm logic defeat this.

Pre-ship checklist: tile size appropriate for workload and architecture; memory access coalesced; kernel fusion applied where possible; data types optimized (f16/bf16 for Tensor Cores); arithmetic intensity maximized; occupancy balanced against tile size; profiled with Nsight Compute.


Continue to Interoperability for the escape hatch when tile programming isn’t enough, or Debugging and Profiling for deeper troubleshooting.