Debugging and Profiling#
Start debugging with small, deterministic inputs. Read results back to the host, compare against a CPU reference, then inspect generated Tile IR or profile the GPU when correctness is established.
Printing and Assertions#
cuda_tile_print! prints from inside a GPU kernel:
#[cutile::entry()]
fn debug_kernel<const S: [i32; 2]>(
z: &mut Tensor<f32, S>,
x: &Tensor<f32, { [-1, -1] }>,
) {
let pid: (i32, i32, i32) = get_tile_block_id();
let tile = load_tile_like(x, z);
cuda_tile_print!("Block ({}, {}): loaded tile\n", pid.0, pid.1);
z.store(tile);
}
GPU printing is slow and serializes tile block execution. Use it for small grids and remove it before measuring performance.
cuda_tile_assert! checks conditions inside a kernel:
let tile = load_tile_like(x, z);
cuda_tile_assert!(tile[0] > 0.0, "expected positive input");
Host Readback#
Host readback is a DeviceOp; execute it before reading the host vector:
let z_host: Vec<f32> = z
.unpartition()
.to_host_vec()
.sync_on(&stream)?;
assert!(!z_host.iter().any(|x| x.is_nan()));
assert!(!z_host.iter().any(|x| x.is_infinite()));
If a fused kernel is wrong, split it into stages and read back each intermediate. Each stage should match a simple CPU implementation on a small input.
Correctness Tests#
Use minimal inputs first:
#[test]
fn small_add_matches_cpu() {
let a = vec![1.0, 2.0, 3.0, 4.0];
let b = vec![10.0, 20.0, 30.0, 40.0];
let expected = vec![11.0, 22.0, 33.0, 44.0];
let result = run_add_kernel(&a, &b);
assert_eq!(result, expected);
}
Then compare larger random inputs against a CPU reference with an appropriate tolerance:
for (cpu, gpu) in cpu_result.iter().zip(gpu_result.iter()) {
assert!((cpu - gpu).abs() < 1e-5, "CPU={cpu}, GPU={gpu}");
}
For numerically sensitive kernels, test edge cases: zeros, large positive values, large negative values, non-divisible shapes if supported, and known overflow-prone inputs.
Inspecting Tile IR#
print_ir = true prints the generated wrapper, source kernel, and Tile IR text during JIT compilation:
#[cutile::entry(print_ir = true)]
fn debug_ir_kernel<const S: [i32; 2]>(...) { ... }
dump_mlir_dir writes the compiled Tile IR text to files:
#[cutile::entry(dump_mlir_dir = "/tmp/cutile-ir")]
fn debug_ir_kernel<const S: [i32; 2]>(...) { ... }
use_debug_mlir loads hand-modified Tile IR text:
#[cutile::entry(use_debug_mlir = "/path/to/custom.mlir")]
fn kernel_with_custom_ir<const S: [i32; 2]>(...) { ... }
The same compiler-stage dumps are also available with environment variables:
Variable |
Description |
Default |
|---|---|---|
|
Dump compiler stages ( |
unset |
|
Restrict dumps to matching function names or |
unset |
Errors and Crashes#
Most cuTile Rust errors are caught before a kernel runs:
Error |
Cause |
Fix |
|---|---|---|
Shape mismatch |
Incompatible tile shapes |
Align shapes or use |
Element type mismatch |
Different element types in one operation |
Add explicit |
Invalid reduction axis |
Axis outside the tile rank |
Use an axis in |
Unsupported MMA shape or dtype |
No lowering for that combination |
Use a supported shape and element type |
Missing entry |
Function is not marked with |
Add the entry attribute |
Runtime errors usually come from out-of-bounds accesses, toolkit issues, or invalid raw-pointer usage:
Error |
Cause |
Fix |
|---|---|---|
CUDA error: no kernel image |
Wrong GPU architecture or stale cubin |
Clear cache, rebuild, verify target SM |
Failed to load kernel |
CUDA toolkit or driver issue |
Check |
Out of memory |
Tensor allocation or JIT memory pressure |
Reduce allocation size or specialization count |
Shape mismatch at runtime |
Tensor size incompatible with partition |
Ensure expected divisibility or bounds handling |
CPU segfaults usually mean the failure happened in host-side FFI, JIT compilation, or raw-pointer lifetime management rather than inside ordinary safe tile code. Get a backtrace first:
RUST_BACKTRACE=1 cargo run
RUST_BACKTRACE=full cargo run
gdb --args ./target/debug/my_program
(gdb) run
(gdb) bt
Check the CUDA driver, CUDA Toolkit path, raw pointer lifetimes, spawned task lifetimes, and host memory use during first-launch compilation.
Profiling#
Use Nsight Compute for individual kernels:
ncu --target-processes all ./my_cutile_program
ncu --set full -o profile_report ./my_cutile_program
Watch memory throughput, compute throughput, occupancy, register spills, and stall reasons.
Use Nsight Systems for CPU/GPU scheduling:
nsys profile ./my_cutile_program
nsys-ui report.nsys-rep
Look for launch gaps, unnecessary synchronization, memory transfer overlap, and whether independent kernels actually overlap on separate streams.
Debugging Checklist#
Shapes match the operation and launch partition.
Tensor sizes are compatible with the partition shape.
Element types match or are explicitly converted.
Small inputs match a CPU reference.
Numerically sensitive code handles overflow and underflow.
Raw pointers outlive all GPU work that uses them.
print_irshows the expected Tile IR operations.Profiles are captured after correctness checks pass.
Review Performance for optimization strategies or Interoperability for custom CUDA kernels.