Supported Features#
This appendix presents the cuda-oxide feature matrix: every compiler capability, runtime API, and hardware feature along with its current support status. The data is drawn from the compiler/runtime sources and the test suite.
Legend: Full = tested and working, Planned = on the roadmap, N/A = not applicable or no identified need.
Compiler: Memory Model#
Feature |
Status |
Description |
|---|---|---|
HMM / Unified Memory Management |
Full |
GPU directly reads/writes host memory without |
Unified Struct ABI (no |
Full |
Device struct layout matches host exactly. The compiler queries rustc’s actual layout and reproduces it with explicit padding in LLVM IR. Works with |
Dynamic Layout Matching |
Full |
Compiler queries rustc’s |
Compiler: Type System#
Feature |
Status |
Description |
|---|---|---|
Generics and Monomorphization |
Full |
Generic kernels and device functions with trait bounds. Monomorphized instances collected from rustc MIR. Const generics supported. |
Enums ( |
Full |
Full enum support including discriminant extraction and payload access. Pattern matching on enums works. |
Struct Construction and Field Access |
Full |
Struct literals, field access, pass-by-value and return values. User-defined structs supported without annotations. |
Array Types ( |
Full |
Static array construction, constant-index and runtime-index access. Mutable arrays auto-promoted to memory-backed. |
|
Full |
Generic SIMD register type with named accessors ( |
ABI Scalarization |
Full |
Composite types (slices, structs) are scalarized at kernel boundaries. |
Compiler: Closures#
Feature |
Status |
Description |
|---|---|---|
Move Closures ( |
Full |
Closures that capture by value. Captures are passed as kernel arguments. |
Reference Closures ( |
Full |
Non-move closures that capture by reference. GPU reads host addresses via HMM. |
Host-to-Device Closures |
Full |
Closures defined on host passed to generic kernels. Polynomial evaluation with captured coefficients tested. |
Device-Internal Closures |
Full |
Closures created and used entirely on device, including closures passed to device functions. |
Compiler: Control Flow#
Feature |
Status |
Description |
|---|---|---|
Match Expressions (integer switch) |
Full |
Multi-way match on integers. Generates chain of conditional branches. |
Match on Enums |
Full |
Pattern matching on |
For Loops (range, iterator, enumerate) |
Full |
Full iterator desugaring: range-based, |
While Loops / If-Else |
Full |
Baseline control flow fully supported. |
Break and Continue |
Full |
|
Compiler: Arithmetic and Casting#
Feature |
Status |
Description |
|---|---|---|
64-bit Arithmetic |
Full |
Full 64-bit integer arithmetic including shifts, bitwise ops, and descriptor field packing. |
Type Casting (all kinds) |
Full |
IntToInt, IntToFloat, FloatToInt, FloatToFloat, Transmute (bitcast), PtrToPtr, PtrToInt, IntToPtr, pointer coercions. |
Compiler: Interop#
Feature |
Status |
Description |
|---|---|---|
Bi-directional LTOIR Support |
Full |
Rust kernels call CUDA C++ device functions and C++ calls Rust device functions. Via NVVM IR → libNVVM → LTOIR → nvJitLink. |
Device FFI ( |
Full |
|
MathDx FFI (cuFFTDx / cuBLASDx) |
Full |
cuFFTDx (8/16/32-point thread-level FFT), cuBLASDx (32x32x32 block-level GEMM) via LTOIR. |
Cross-Crate Kernels |
Full |
Kernels and device functions defined in library crates with monomorphization at the binary crate use site. |
Compiler: Functions#
Feature |
Status |
Description |
|---|---|---|
|
Full |
Marks functions as GPU kernel entry points ( |
|
Full |
Device-side helper functions callable from kernels. Inlined aggressively by |
Standalone |
Full |
Device functions compiled without any kernel present. Clean export names for C++ consumption. |
Multi-Kernel Modules |
Full |
Multiple |
Compiler: Compilation Pipeline#
Feature |
Status |
Description |
|---|---|---|
Unified Single-Source Compilation |
Full |
Host and device code in the same file. Custom rustc codegen backend intercepts codegen. No |
PTX Output |
Full |
Default output: Rust MIR → |
NVVM IR Output |
Full |
Alternative output for libNVVM consumption with NVVM metadata. |
LTOIR Output |
Full |
Device-side LTO for linking with CUDA C++. Via |
Float Math Intrinsics (libdevice) |
Full |
Rust |
Pipeline Inspection |
Full |
|
cuda-gdb Debug Support |
Full |
Build with debug info and launch |
Runtime Library: Safety#
Feature |
Status |
Description |
|---|---|---|
|
Full |
Bounds-checked parallel write output slice. Access via |
|
Full |
Newtype that can only be constructed by trusted index functions. Guarantees unique indices. |
|
Full |
Compile-time barrier lifecycle: |
Runtime Library: Atomics#
Feature |
Status |
Description |
|---|---|---|
Device-Scope Atomics |
Full |
|
Block-Scope Atomics |
Full |
|
System-Scope Atomics |
Full |
|
|
Full |
Standard library atomic types lowered to PTX |
Runtime Library: Thread and Synchronization#
Feature |
Status |
Description |
|---|---|---|
Thread/Block/Grid Intrinsics |
Full |
|
Block Synchronization |
Full |
|
Async Barriers (mbarrier) |
Full |
Hardware async barriers for Hopper+: init, arrive, test_wait, try_wait, inval. |
Cluster Synchronization |
Full |
|
Fence Operations |
Full |
|
Runtime Library: Warp#
Feature |
Status |
Description |
|---|---|---|
Warp Shuffle Operations |
Full |
|
Warp Vote Operations |
Full |
|
Lane/Warp ID |
Full |
|
Runtime Library: Cooperative Groups#
Feature |
Status |
Description |
|---|---|---|
Typed Group Handles |
Full |
|
Group Universal API |
Full |
|
Warp Tile Partitioning |
Full |
|
Warp Collectives |
Full |
|
Warp Reductions / Scans |
Full |
|
Block Reductions / Scans |
Full |
|
Cooperative Kernel Launch |
Full |
|
Runtime Library: Debug#
Feature |
Status |
Description |
|---|---|---|
|
Full |
Formatted GPU output with full format specifier support. Lowers to |
|
Full |
Runtime GPU assertion. Calls |
Debug Intrinsics |
Full |
|
Runtime Library: Kernel Launch#
Feature |
Status |
Description |
|---|---|---|
|
Full |
Synchronous kernel launch with argument passing, closure extraction, cluster support. |
|
Full |
Occupancy hints: max threads per block, min blocks per SM. |
|
Full |
Compile-time cluster dimensions. Emits |
Runtime Library: TMA#
Feature |
Status |
Description |
|---|---|---|
TMA Bulk Tensor Copy (1D–5D) |
Full |
|
TMA Multicast |
Full |
Single TMA load broadcast to all CTAs in cluster. sm_100a for full multicast. |
TMA Commit/Wait Groups |
Full |
|
Not Yet Implemented#
Feature |
Status |
Notes |
|---|---|---|
Inline Assembly ( |
Planned |
Workaround: use built-in intrinsics or add new intrinsics to |
FP8 / MX Data Types |
Planned |
Roadmap item for Blackwell. No architectural limitation. |
Dynamic Dispatch ( |
N/A |
Use generics with static dispatch. Haven’t found a real need for this. |
Heap Allocation ( |
N/A |
CUDA has a device-side heap ( |
|
N/A |
Use |
Panic / Unwinding |
N/A |
Panic paths exist in MIR but the compiler strips |
Standard Library ( |
N/A |
|
Texture Memory |
N/A |
Lower priority given TMA availability on Hopper+. |