Error Handling and Debugging#
GPU kernels fail differently from CPU code. The CUDA toolchain does not
support exceptions or stack unwinding today, there are no stack traces in
kernel output, and no println!. When something goes wrong, the result is
either silent data corruption, a hardware trap, or a cryptic driver error on the
host. This chapter covers cuda-oxide’s tools for diagnosing and fixing kernel
problems.
What happens when a kernel goes wrong#
GPU errors fall into three categories:
Failure mode |
What you see |
Example |
|---|---|---|
Silent corruption |
Wrong results, no error |
Race condition, off-by-one index |
Hardware trap |
|
|
Launch failure |
|
Wrong grid dims, missing module, out of resources |
The CUDA toolchain does not expose an exception mechanism today (the hardware could support it, but nvcc/ptxas do not wire it up). A trap instruction kills the kernel and poisons the CUDA context – subsequent operations on the same context will fail until you handle or recreate it.
gpu_printf! – printing from the GPU#
gpu_printf! lets you print values from device code for quick debugging. It
uses CUDA’s built-in vprintf mechanism:
use cuda_device::{kernel, thread, gpu_printf, DisjointSlice};
#[kernel]
pub fn debug_kernel(data: &[f32], mut out: DisjointSlice<f32>) {
let idx = thread::index_1d();
if idx.get() < 4 {
gpu_printf!("Thread {} sees value {}\n", idx.get(), data[idx.get()]);
}
if let Some(out_elem) = out.get_mut(idx) {
*out_elem = data[idx.get()] * 2.0;
}
}
Important details#
Flush requires sync. Output is buffered on the GPU and only appears on the host after a stream or device synchronization (e.g.,
to_host_vecorctx.synchronize()).Buffer size. The default printf buffer is 1 MiB. If many threads print, output may be truncated. Enlarge with
cudaDeviceSetLimit(cudaLimitPrintfFifoSize, size).Thread ordering. Output from different threads appears in arbitrary order.
Performance. Printf serializes across threads – avoid it in hot paths. Use it for debugging, not logging.
Format conversion. The macro converts Rust
{}format specifiers to C printf equivalents (%d,%f, etc.) at compile time.
Why not println! or Debug?#
Standard Rust formatting (fmt::Display, fmt::Debug, format!, println!)
requires dynamic dispatch, string allocation, and I/O – none of which exist on
the GPU. gpu_printf! bypasses all of this by lowering directly to a CUDA
vprintf call.
gpu_assert! and trap()#
For fatal error checking on the device, use gpu_assert! or debug::trap():
use cuda_device::{kernel, thread, debug, gpu_assert, DisjointSlice};
#[kernel]
pub fn checked_kernel(data: &[f32], len: u32, mut out: DisjointSlice<f32>) {
let idx = thread::index_1d();
gpu_assert!(idx.get() < len as usize); // traps if false
if let Some(out_elem) = out.get_mut(idx) {
*out_elem = data[idx.get()];
}
}
Intrinsic |
What it does |
Host effect |
|---|---|---|
|
Traps if condition is false |
|
|
Unconditional trap |
|
|
Emit |
Pauses in cuda-gdb; crashes without debugger |
The trap-and-check pattern#
A common workflow for catching device-side errors:
// Launch kernel
cuda_launch! { /* ... */ }.expect("Launch failed");
// Synchronize and check for traps
stream.synchronize().expect("Kernel trapped -- check gpu_assert! conditions");
If a gpu_assert! fires, synchronization returns an error. The error message
doesn’t tell you which assertion failed, so use gpu_printf! alongside
assertions to narrow down the problem.
Host-side error handling#
DriverError#
The synchronous launch path (cuda_launch!) returns
Result<(), DriverError>. The DriverError wraps a CUDA driver result code:
match cuda_launch! { /* ... */ } {
Ok(()) => { /* launched successfully */ }
Err(e) => eprintln!("Launch failed: {e}"),
}
DeviceError#
The async path (cuda_launch_async! / DeviceOperation) uses DeviceError,
which wraps driver errors alongside context and scheduling failures:
use cuda_async::error::DeviceError;
let result: Result<Vec<f32>, DeviceError> = operation.sync();
DeviceError variants include Driver, Context, KernelCache, Scheduling,
Launch, and Internal.
CudaContext::check_err#
After a series of operations, call check_err() on the context to surface any
asynchronous errors that may have been recorded:
ctx.check_err().expect("Asynchronous GPU error detected");
cargo oxide debug – cuda-gdb integration#
cargo oxide debug builds your kernel with debug info and launches cuda-gdb:
cargo oxide debug vecadd # Standard GDB
cargo oxide debug vecadd --tui # GDB with TUI
cargo oxide debug vecadd --cgdb # cgdb front-end
Breakpoint workflow#
Build with debug:
cargo oxide debug <example>Set a breakpoint on your kernel:
break vecaddRun:
runInspect threads:
cuda thread,cuda block,cuda warpPrint variables:
print idx,print *c_elem
For programmatic breakpoints, use debug::breakpoint() in your kernel code.
When cuda-gdb hits the brkpt instruction, it pauses execution and lets you
inspect the GPU state.
Tip
debug::breakpoint() will crash the kernel if no debugger is attached.
Guard it with a compile-time flag or only use it during debugging sessions.
cargo oxide doctor – environment validation#
Before debugging kernel failures, verify your environment is correctly set up:
cargo oxide doctor
Doctor checks:
Check |
What it verifies |
|---|---|
Rust toolchain |
Nightly compiler with required components |
CUDA toolkit |
|
libNVVM |
|
nvJitLink |
|
libdevice |
|
LLVM |
|
Codegen backend |
|
The libNVVM / nvJitLink / libdevice checks fire only when a kernel calls
CUDA libdevice math (sin, cos, exp, pow, sqrt, …). If your
kernel is pure arithmetic, those three failing is harmless. They all ship
with the CUDA Toolkit – no separate download. If any check fails, doctor
prints the standard install location for that component.
cargo oxide pipeline – inspecting the compilation#
When a kernel produces wrong results but no errors, inspect the compilation pipeline to see exactly what code was generated:
cargo oxide pipeline vecadd
This prints the full pipeline output:
MIR collection – which functions the collector found
dialect-mir– pliron IR modelling Rust MIR (before and aftermem2reg)dialect-llvm– pliron IR modelling LLVM IR (aftermir-lower)Textual LLVM IR – serialized
.llfileFinal PTX – the generated assembly
Environment variables#
For more targeted inspection:
Variable |
Effect |
|---|---|
|
Verbose compiler output |
|
Dump the rustc MIR before import |
Profiling with Nsight Compute#
For performance debugging, NVIDIA’s Nsight Compute (ncu) provides
roofline analysis, memory throughput, and occupancy metrics:
ncu --set full ./target/release/my_example
cuda-oxide kernels can emit profiler triggers using
debug::prof_trigger::<N>(), which generates a pmevent instruction that
Nsight Compute and Nsight Systems can capture for timeline annotation.
See also
Nsight Compute Documentation for the full profiling toolkit.
Common pitfalls#
Pitfall |
Symptom |
Fix |
|---|---|---|
Race condition on output buffer |
Wrong results, non-deterministic |
Use |
Missing |
Stale shared memory reads |
Add barrier between writes and reads |
Wrong |
|
Match |
Out-of-bounds with raw pointers |
Trap or silent corruption |
Use |
|
Compile error (fmt unavailable) |
Use |
Forgetting to sync after launch |
Host reads stale data |
Call |
PTX built for wrong arch |
|
Rebuild with |
Debugging decision tree: kernel problems fall into three categories (compile error, runtime trap, silent corruption), each with different diagnostic tools. Common fixes are shown at the bottom.#