Kernels and Device Functions#
A kernel is a function that runs on the GPU – the entry point that the host launches across thousands of threads. A device function is a helper that runs on the GPU but can only be called from another device function or kernel, never from the host. This chapter covers both, along with the Rust patterns that are (and aren’t) supported in device code.
See also
CUDA Programming Guide – Kernels for the authoritative CUDA C++ reference on kernel and device functions.
#[kernel] – the GPU entry point#
Annotating a function with #[kernel] tells cuda-oxide to compile it as a GPU
entry point. The function must return () – kernels communicate results by
writing to output buffers, not by returning values.
use cuda_device::{kernel, thread, DisjointSlice};
#[kernel]
pub fn vecadd(a: &[f32], b: &[f32], mut c: DisjointSlice<f32>) {
let idx = thread::index_1d();
if let Some(c_elem) = c.get_mut(idx) {
*c_elem = a[idx.get()] + b[idx.get()];
}
}
Under the hood, #[kernel] does three things:
Renames the function into the reserved
cuda_oxide_kernel_<hash>_<name>namespace so the compiler’s collector can identify it as a device entry point. The exact prefix is owned by the workspace-internalreserved-oxide-symbolscrate; the<hash>suffix makes the namespace unguessable for user code.Adds
#[no_mangle]to preserve the symbol name in the generated PTX.Generates a marker struct implementing
CudaKernel(orGenericCudaKernelfor generic kernels) so thatcuda_launch!can look up the correct PTX entry point at compile time.
In the generated PTX, a kernel becomes a .entry directive – the GPU
equivalent of main:
.entry vecadd(.param .u64 a, .param .u64 a_len, ...) { ... }
Parameter constraints#
Kernel parameters are flattened at the ABI boundary through a process called argument scalarization (covered in the Memory and Data Movement chapter). The key rules:
Slices (
&[T],DisjointSlice<T>) become a pointer + length pair.Scalars (
u32,f32, etc.) are passed directly.Structs are decomposed into their individual fields.
No heap-allocated types (
Vec,String,Box) – thealloccrate is allowed through the compiler, but no device-side#[global_allocator]is configured today. Even with one, devicemallocis extremely slow.
Device helper functions#
Not all GPU code belongs in the kernel itself. You can factor logic into helper functions that the compiler will also compile for the GPU.
Auto-discovered helpers#
The simplest approach: just write a normal Rust function and call it from your
kernel. The compiler’s collector traverses the call graph from each
#[kernel] entry point and automatically compiles every reachable function for
the GPU – no annotation needed:
fn clamp(x: f32, lo: f32, hi: f32) -> f32 {
if x < lo { lo } else if x > hi { hi } else { x }
}
#[kernel]
pub fn apply_clamp(input: &[f32], mut out: DisjointSlice<f32>) {
let idx = thread::index_1d();
if let Some(out_elem) = out.get_mut(idx) {
*out_elem = clamp(input[idx.get()], 0.0, 1.0);
}
}
The clamp function is compiled to a PTX .func (device function) and
typically inlined by the compiler, so there is no call overhead.
When #[device] is needed#
The #[device] attribute is required in three specific scenarios where
auto-discovery is not sufficient:
Scenario |
Why |
|---|---|
Standalone device libraries |
No |
Cross-crate device functions |
The function is in a different crate from the kernel |
Device FFI |
The function is exposed as |
use cuda_device::device;
#[device]
pub fn magnitude(x: f32, y: f32) -> f32 {
(x * x + y * y).sqrt()
}
#[kernel] vs #[device]#
Feature |
|
|
Auto-discovered |
|---|---|---|---|
PTX directive |
|
|
|
Launchable from host |
Yes, via |
No |
No |
Can return a value |
No (must be |
Yes |
Yes |
Callable from device code |
Yes |
Yes |
Yes |
Annotation required |
Always |
Only for standalone/cross-crate/FFI |
Never |
What Rust works on the GPU#
cuda-oxide compiles standard Rust through rustc – it is not a subset
language. That said, GPU code runs in a no_std environment without a
device-side heap allocator configured, so certain Rust features are
unavailable today. Here is the current support matrix:
Supported#
Feature |
Notes |
|---|---|
Primitive types ( |
Full support |
Structs and tuples |
Decomposed at ABI boundary |
Enums ( |
Including |
|
Multi-way branching |
|
Range-based and iterator-based |
Iterators ( |
Desugared through MIR |
|
Inside loops |
Arrays ( |
Read, write, indexing |
Slices ( |
Read-only; mutable writes via |
Closures (within device code) |
Normal Rust semantics |
Generic functions |
Monomorphized per call site |
|
For advanced patterns |
Not supported#
Feature |
Reason |
Alternative |
|---|---|---|
|
Require heap allocator (no device-side |
Use fixed-size arrays or slices |
|
Require formatting machinery + I/O |
Use |
|
No OS on GPU |
Communicate via buffers |
Trait objects ( |
Require vtable dispatch |
Use generics (monomorphized) |
|
Formatting + allocation |
Use |
Tip
If you accidentally use an unsupported feature, the compiler will produce a
clear error: "CUDA-OXIDE: FORBIDDEN CRATE IN DEVICE CODE" with a list of
allowed crates (core, alloc, cuda_device, and your local crate).
#[launch_bounds] – occupancy hints#
The #[launch_bounds] attribute tells the compiler how many threads you intend
to launch per block. This lets the PTX assembler make better register allocation
decisions and can improve occupancy:
#[kernel]
#[launch_bounds(256, 2)]
pub fn optimized_kernel(mut out: DisjointSlice<f32>) {
// ...
}
Parameter |
Required |
PTX directive |
Description |
|---|---|---|---|
|
Yes |
|
Maximum threads per block |
|
No |
|
Minimum concurrent blocks per SM |
The generated PTX includes these directives:
.entry optimized_kernel .maxntid 256, 1, 1 .minnctapersm 2 { ... }
Tip
#[launch_bounds] must appear after #[kernel]:
#[kernel]
#[launch_bounds(256, 2)] // correct
pub fn my_kernel(...) { }
The collector – how device code is discovered#
When you build with cargo oxide, the rustc-codegen-cuda backend runs a
collector pass that determines which functions to compile for the GPU:
Scan all compilation units for functions in the reserved
cuda_oxide_kernel_<hash>_namespace (generated by#[kernel]).For each kernel, traverse the call graph and collect all transitively reachable functions.
Filter each callee against the allowed-crate list:
Crate |
Allowed |
Why |
|---|---|---|
Your local crate |
Yes |
Your kernel and helper code |
|
Yes |
GPU intrinsics (threads, warps, shared memory) |
|
Yes |
|
|
No |
Requires OS facilities not available on GPU |
|
Allowed |
Passes the collector, but no device-side allocator is wired up yet. Link-time error today. |
If the collector encounters a call into a forbidden crate, it reports a compile-time error rather than generating broken PTX.
The device code collector: starting from #[kernel] entry points, the compiler walks the call graph to discover all reachable device functions, then filters each callee against the allowed-crate list (local crate, cuda_device, core). The output is a PTX module with .entry and .func directives.#
no_std and panic behavior#
Device code runs in an implicit #![no_std] environment. You do not need to add
this attribute yourself – the compiler backend handles it.
Panic behavior: all unwind paths in MIR are treated as unreachable. If a
panic actually triggers at runtime (e.g., an array bounds check fails), the GPU
executes a trap instruction, which causes the host to receive
CUDA_ERROR_ILLEGAL_INSTRUCTION. This is semantically equivalent to
panic=abort but does not require any special compiler flags.
In practice this means:
unwrap()andexpect()work but will trap the GPU onNone/Err.assert!anddebug_assert!work but trap on failure.panic!("message")is not supported (the formatting machinery is unavailable) – usegpu_assert!ordebug::trap()instead.
See also
The Error Handling and Debugging chapter
covers gpu_printf!, gpu_assert!, and cargo oxide debug for diagnosing
kernel failures.