Interoperability#
The tile model handles dense tensor algebra well — GEMM, element-wise operations, reductions, convolutions — but some algorithms depend on warp-level primitives (__shfl_sync, __ballot_sync, __reduce_sync) for things like custom scan/prefix-sum, cooperative groups, or irregular data access patterns. For these, write the kernel in CUDA C++ and integrate it using the approach below.
A custom CUDA kernel can participate in the same DeviceOperation execution model as your tile kernels — sharing streams, chaining with .and_then(), and avoiding unnecessary synchronization.
Step 1: Compile Your CUDA Kernel#
Compile your CUDA C++ kernel to PTX (portable) or a .cubin (architecture-specific):
# PTX — portable across GPU architectures, JIT-compiled at load time.
nvcc -ptx -arch=compute_80 my_kernel.cu -o my_kernel.ptx
# cubin — pre-compiled for a single architecture, no JIT overhead.
nvcc -cubin -arch=sm_80 my_kernel.cu -o my_kernel.cubin
Architecture portability: A
.cubinfile only runs on the exact SM architecture it was compiled for. Code compiled with-arch=sm_80will not load on ansm_100GPU. PTX avoids this problem — the CUDA driver JIT-compiles it for the target GPU at load time, at the cost of a one-time compilation delay. Prefer PTX unless you need to eliminate JIT overhead. If you must ship.cubinfiles, compile for each target architecture.
Step 2: Load the Module and Function#
Use cuda-async’s module loading functions to load the compiled kernel:
use cuda_async::device_context::load_module_from_file;
let module = load_module_from_file("my_kernel.cubin", device_id)?;
let function = Arc::new(module.load_function("my_kernel_entry")?);
For PTX (JIT-compiled at runtime):
use cuda_async::device_context::load_module_from_ptx;
let ptx_src = include_str!("my_kernel.ptx");
let module = load_module_from_ptx(ptx_src, device_id)?;
let function = Arc::new(module.load_function("my_kernel_entry")?);
Step 3: Launch via AsyncKernelLaunch#
AsyncKernelLaunch is a DeviceOperation that wraps the CUDA driver’s kernel launch API:
use cuda_async::launch::AsyncKernelLaunch;
use cuda_core::LaunchConfig;
let mut launcher = AsyncKernelLaunch::new(function.clone());
launcher.push_arg(num_elements as u32);
launcher.push_arg(scale);
// SAFETY: input and output are valid device allocations with at least
// num_elements f32 elements. output is exclusively written; input is
// read-only. Both remain allocated until this operation completes.
unsafe {
launcher
.push_device_ptr(input.cu_deviceptr())
.push_device_ptr(output.cu_deviceptr());
}
launcher.set_launch_config(LaunchConfig {
grid_dim: ((num_elements as u32 + 255) / 256, 1, 1),
block_dim: (256, 1, 1),
shared_mem_bytes: 0,
});
// Execute as a DeviceOperation — integrates with the async model.
launcher.await?;
Scalar arguments (types implementing DType) can be pushed safely with push_arg. Device pointers must use unsafe { push_device_ptr() } — see Safety: Device Pointer Arguments below.
Safety: Device Pointer Arguments#
push_device_ptr passes a raw address to the CUDA driver. The Rust compiler has no visibility into GPU kernel code and cannot verify that:
The pointer refers to a valid device memory allocation on the correct GPU.
The allocation is large enough for the kernel’s access pattern.
No other operation is concurrently reading or writing the same memory.
The argument order and types match the kernel’s parameter signature.
Neither the Rust compiler nor the CUDA driver validates these invariants — mistakes result in silent undefined behavior or hard-to-diagnose GPU faults. You must verify them manually.
Scalar arguments (like num_elements as u32) are copied into the kernel’s parameter space — the kernel reads the value, not an address. Any type implementing DType can be pushed safely with push_arg.
To prevent data races, use stream ordering: operations chained with .and_then() on the same stream execute in order and see each other’s writes. Operations on different streams require explicit synchronization.
Why generated cuTile Rust kernels don’t require
unsafe: When you write a tile kernel with#[cutile::entry], the generated launcher uses theKernelArgumentandArcKernelArgumentimplementations forTensor<T>andPartition<Tensor<T>>. These implementations callpush_device_ptrinternally, but can do so safely because the framework controls both sides: device pointers come from framework-managed allocations (guaranteed valid), and the ownership model —Partitionfor exclusive access,Arc<Tensor>for shared reads — prevents aliasing at the type level. Custom kernels bypass this: you are pushing pointers that the framework didn’t allocate and can’t track, so the safety burden falls on you.
You can wrap a custom kernel launch in a struct that implements DeviceOperation. The struct’s typed fields enforce the correct argument signature, and unsafe is confined to execute:
use cuda_async::device_context::with_default_device_policy;
use cuda_async::device_future::DeviceFuture;
use cuda_async::device_operation::{DeviceOperation, ExecutionContext};
use cuda_async::error::DeviceError;
use cuda_async::launch::AsyncKernelLaunch;
use cuda_async::scheduling_policies::SchedulingPolicy;
use cuda_core::{CudaFunction, LaunchConfig};
use std::future::IntoFuture;
pub struct ScaleKernel {
function: Arc<CudaFunction>,
n: u32,
scale: f32,
input: Arc<Tensor<f32>>,
output: Tensor<f32>,
}
impl DeviceOperation for ScaleKernel {
type Output = (Arc<Tensor<f32>>, Tensor<f32>);
// execute is unsafe because it enqueues async GPU work without
// synchronizing — the returned tensors may still be in-flight.
// Callers must synchronize (e.g. via DeviceFuture) before accessing
// the output.
unsafe fn execute(
self,
ctx: &ExecutionContext,
) -> Result<<Self as DeviceOperation>::Output, DeviceError> {
let mut launcher = AsyncKernelLaunch::new(self.function);
launcher.push_arg(self.n);
launcher.push_arg(self.scale);
// SAFETY: input and output are framework-managed Tensor allocations.
// input is shared (Arc, read-only); output is exclusively written.
unsafe {
launcher
.push_device_ptr(self.input.cu_deviceptr())
.push_device_ptr(self.output.cu_deviceptr());
}
launcher.set_launch_config(LaunchConfig {
grid_dim: ((self.n + 255) / 256, 1, 1),
block_dim: (256, 1, 1),
shared_mem_bytes: 0,
});
unsafe { launcher.execute(ctx)? };
Ok((self.input, self.output))
}
}
// IntoFuture is a supertrait of DeviceOperation. Every custom DeviceOperation
// needs this boilerplate to enable `.await` and `.sync()`.
impl IntoFuture for ScaleKernel {
type Output = Result<(Arc<Tensor<f32>>, Tensor<f32>), DeviceError>;
type IntoFuture = DeviceFuture<(Arc<Tensor<f32>>, Tensor<f32>), ScaleKernel>;
fn into_future(self) -> Self::IntoFuture {
match with_default_device_policy(|policy| policy.schedule(self)) {
Ok(Ok(future)) => future,
Ok(Err(e)) | Err(e) => DeviceFuture::failed(e),
}
}
}
This is the same pattern the #[cutile::entry] macro uses to generate safe launchers for tile kernels. No unsafe at the call site.
Step 4: Compose with Tile Kernels#
AsyncKernelLaunch implements DeviceOperation, so it chains with tile kernels. This pipeline runs a tile add (z = x + y), then the custom scale wrapper (w = scale * z):
// Run the tile add kernel — z = x + y.
let (z_part, _x, _y) =
tile_add::add(z.partition([tile_size]), x.clone(), y.clone()).await?;
let z: Tensor<f32> = z_part.unpartition();
// Run the custom scale kernel — w = scale * z.
let w: Tensor<f32> = zeros::<1, f32>([num_elements]).await?;
let (_z, w) = ScaleKernel {
function: scale_function,
n: num_elements as u32,
scale,
input: Arc::new(z),
output: w,
}
.await?;
See interop.rs for a complete, runnable version.
Using with_context for Low-Level Control#
For more direct control, use with_context to access the CUDA stream and issue driver API calls directly:
use cuda_async::device_operation::{with_context, value, DeviceOperation};
use cuda_async::device_operation::ExecutionContext;
use cuda_core::{malloc_async, memcpy_htod_async, free_async};
let host_data: Vec<f32> = vec![1.0; num_elements];
let num_bytes = num_elements * std::mem::size_of::<f32>();
// host_data is captured by reference — it must outlive the await so that
// the async memcpy can read from it until the stream synchronizes.
let op = with_context(|ctx: &ExecutionContext| {
let stream = ctx.get_cuda_stream();
let dptr = unsafe {
let dptr = malloc_async(num_bytes, stream);
memcpy_htod_async(dptr, host_data.as_ptr(), num_elements, stream);
dptr
};
value(dptr)
});
let dptr = op.await?;
// host_data is safe to drop now — the await synchronized the stream.
// Clean up: free the device memory on a stream.
with_context(move |ctx: &ExecutionContext| {
unsafe { free_async(dptr, ctx.get_cuda_stream()) };
value(())
})
.await?;
This gives you full access to the CUDA driver API while participating in the DeviceOperation model. Everything inside the unsafe block is your responsibility to get right.
Coming from Triton#
Triton and cuTile Rust both let you write kernels in terms of tile-level operations. Many patterns that require explicit warp specialization in Triton (e.g., warp_specialize in tl.range) are handled implicitly by the cuTile Rust compiler:
Triton (manual) |
cuTile Rust (automatic) |
|---|---|
Assign producer warps to prefetch tiles from global → shared memory |
Compiler generates shared memory staging for |
Assign consumer warps to compute on shared memory tiles |
Compiler maps tile arithmetic to Tensor Cores and registers |
Software pipeline with |
Compiler uses TMA for hardware-assisted pipelining on supported architectures |
Manual |
|
Tune |
|
For patterns that don’t map to the tile model, compile the kernel with Triton (or write it in CUDA C++) and integrate it via AsyncKernelLaunch as described above. Since Triton outputs PTX, you can load it directly:
let module = load_module_from_ptx(triton_generated_ptx, device_id)?;
let function = Arc::new(module.load_function("gemm_kernel")?);
Continue to Debugging for troubleshooting, or see Performance Tuning for optimization techniques. This chapter builds on the DeviceOperation model introduced in Async Execution.