Debugging and Profiling#

Start debugging with small, deterministic inputs. Read results back to the host, compare against a CPU reference, then inspect generated Tile IR or profile the GPU when correctness is established.

Printing and Assertions#

cuda_tile_print! prints from inside a GPU kernel:

#[cutile::entry()]
fn debug_kernel<const S: [i32; 2]>(
    z: &mut Tensor<f32, S>,
    x: &Tensor<f32, { [-1, -1] }>,
) {
    let pid: (i32, i32, i32) = get_tile_block_id();
    let tile = load_tile_like(x, z);

    cuda_tile_print!("Block ({}, {}): loaded tile\n", pid.0, pid.1);
    z.store(tile);
}

GPU printing is slow and serializes tile block execution. Use it for small grids and remove it before measuring performance.

cuda_tile_assert! checks conditions inside a kernel:

let tile = load_tile_like(x, z);
cuda_tile_assert!(tile[0] > 0.0, "expected positive input");

Host Readback#

Host readback is a DeviceOp; execute it before reading the host vector:

let z_host: Vec<f32> = z
    .unpartition()
    .to_host_vec()
    .sync_on(&stream)?;

assert!(!z_host.iter().any(|x| x.is_nan()));
assert!(!z_host.iter().any(|x| x.is_infinite()));

If a fused kernel is wrong, split it into stages and read back each intermediate. Each stage should match a simple CPU implementation on a small input.

Correctness Tests#

Use minimal inputs first:

#[test]
fn small_add_matches_cpu() {
    let a = vec![1.0, 2.0, 3.0, 4.0];
    let b = vec![10.0, 20.0, 30.0, 40.0];
    let expected = vec![11.0, 22.0, 33.0, 44.0];

    let result = run_add_kernel(&a, &b);
    assert_eq!(result, expected);
}

Then compare larger random inputs against a CPU reference with an appropriate tolerance:

for (cpu, gpu) in cpu_result.iter().zip(gpu_result.iter()) {
    assert!((cpu - gpu).abs() < 1e-5, "CPU={cpu}, GPU={gpu}");
}

For numerically sensitive kernels, test edge cases: zeros, large positive values, large negative values, non-divisible shapes if supported, and known overflow-prone inputs.

Inspecting Tile IR#

print_ir = true prints the generated wrapper, source kernel, and Tile IR text during JIT compilation:

#[cutile::entry(print_ir = true)]
fn debug_ir_kernel<const S: [i32; 2]>(...) { ... }

dump_mlir_dir writes the compiled Tile IR text to files:

#[cutile::entry(dump_mlir_dir = "/tmp/cutile-ir")]
fn debug_ir_kernel<const S: [i32; 2]>(...) { ... }

use_debug_mlir loads hand-modified Tile IR text:

#[cutile::entry(use_debug_mlir = "/path/to/custom.mlir")]
fn kernel_with_custom_ir<const S: [i32; 2]>(...) { ... }

The same compiler-stage dumps are also available with environment variables:

Variable	Description	Default
`CUTILE_DUMP`	Dump compiler stages (`ast`, `resolved`, `typed`, `instantiated`, `ir`, `bytecode`, or `all`)	unset
`CUTILE_DUMP_FILTER`	Restrict dumps to matching function names or `module::function` paths	unset

Errors and Crashes#

Most cuTile Rust errors are caught before a kernel runs:

Error	Cause	Fix
Shape mismatch	Incompatible tile shapes	Align shapes or use `reshape` / `broadcast`
Element type mismatch	Different element types in one operation	Add explicit `convert_tile()`
Invalid reduction axis	Axis outside the tile rank	Use an axis in `0..rank`
Unsupported MMA shape or dtype	No lowering for that combination	Use a supported shape and element type
Missing entry	Function is not marked with `#[cutile::entry()]`	Add the entry attribute

Runtime errors usually come from out-of-bounds accesses, toolkit issues, or invalid raw-pointer usage:

Error	Cause	Fix
CUDA error: no kernel image	Wrong GPU architecture or stale cubin	Clear cache, rebuild, verify target SM
Failed to load kernel	CUDA toolkit or driver issue	Check `nvidia-smi` and toolkit version
Out of memory	Tensor allocation or JIT memory pressure	Reduce allocation size or specialization count
Shape mismatch at runtime	Tensor size incompatible with partition	Ensure expected divisibility or bounds handling

CPU segfaults usually mean the failure happened in host-side FFI, JIT compilation, or raw-pointer lifetime management rather than inside ordinary safe tile code. Get a backtrace first:

RUST_BACKTRACE=1 cargo run
RUST_BACKTRACE=full cargo run

gdb --args ./target/debug/my_program
(gdb) run
(gdb) bt

Check the CUDA driver, CUDA Toolkit path, raw pointer lifetimes, spawned task lifetimes, and host memory use during first-launch compilation.

Profiling#

Use Nsight Compute for individual kernels:

ncu --target-processes all ./my_cutile_program
ncu --set full -o profile_report ./my_cutile_program

Watch memory throughput, compute throughput, occupancy, register spills, and stall reasons.

Use Nsight Systems for CPU/GPU scheduling:

nsys profile ./my_cutile_program
nsys-ui report.nsys-rep

Look for launch gaps, unnecessary synchronization, memory transfer overlap, and whether independent kernels actually overlap on separate streams.

Debugging Checklist#

Shapes match the operation and launch partition.
Tensor sizes are compatible with the partition shape.
Element types match or are explicitly converted.
Small inputs match a CPU reference.
Numerically sensitive code handles overflow and underflow.
Raw pointers outlive all GPU work that uses them.
print_ir shows the expected Tile IR operations.
Profiles are captured after correctness checks pass.

Review Performance for optimization strategies or Interoperability for custom CUDA kernels.