Memory and Data Movement#

GPUs have their own memory, separate from the host. Getting data to and from the device – and choosing the right kind of memory once it’s there – is fundamental to every CUDA program. This chapter covers cuda-oxide’s memory abstractions, from host/device transfers to shared memory and the kernel ABI.

See also

CUDA Programming Guide – Device Memory for the authoritative reference on the CUDA memory hierarchy and access patterns.

The CUDA memory hierarchy#

NVIDIA GPUs expose several tiers of memory, each with different capacity, latency, and scope:

Memory

Scope

Typical size

Latency

cuda-oxide API

Registers

Per thread

~255 × 32-bit

0 cycles

Local variables

Shared memory

Per block

48–228 KB (arch-dependent)

~5 cycles

SharedArray, DynamicSharedArray

L1 cache

Per SM

Combined with shared

Hardware-managed

Automatic

L2 cache

Chip-wide

Up to 50 MB (Hopper)

~30 cycles

Automatic

Global memory (DRAM)

All threads

16–80 GB (HBM)

~400 cycles

DeviceBuffer, DeviceBox

The guiding principle: move frequently accessed data into faster, closer memory. Registers are fastest but per-thread; shared memory is fast and visible to the whole block; global memory is large but slow.

../_images/memory-hierarchy.svg

The CUDA memory hierarchy from fastest (registers, per-thread) to largest (global DRAM, all threads). Each tier trades capacity for latency. The right panel shows scope and the cuda-oxide API for each level.#

Contexts and streams#

Before diving into memory APIs, two host-side concepts that appear in every code example need introduction: contexts and streams.

A CUDA context (CudaContext) binds the host thread to a specific GPU. It owns all resources on that device – modules, streams, allocations. You typically create one at the start of your program:

use cuda_core::CudaContext;

let ctx = CudaContext::new(0).unwrap();   // bind to GPU 0

A CUDA stream (CudaStream) is an ordered queue of GPU operations. Operations enqueued on the same stream execute in FIFO order – each one sees all side effects of the preceding operations. Operations on different streams may overlap and run concurrently with no ordering guarantees between them.

let stream = ctx.default_stream();     // the implicit, always-available stream
let work_stream = stream.fork()?;      // a new stream, joined to the parent's current position

Every memory transfer and kernel launch requires a stream. For single-stream programs (which covers most examples in this book), the default stream is all you need – everything is sequential and correct by construction. Multi-stream pipelines unlock overlap between compute and data movement, but require explicit synchronization via events or join:

Same stream:       [kernel_A] → [memcpy_B] → [kernel_C]     (automatic ordering)
Different streams: [kernel_A on stream 1] | [memcpy_B on stream 2]  (concurrent, needs events)
../_images/cuda-streams.svg

Top: single-stream execution where operations are automatically ordered in FIFO. Bottom: multi-stream execution where streams A and B run concurrently, with an event establishing the data dependency between kernel_write and kernel_read.#

See also

The Launching Kernels chapter covers stream usage in launch macros, and the Async GPU Programming section covers DeviceOperation which manages streams automatically.

DeviceBuffer – host/device transfers#

DeviceBuffer<T> in cuda_core is the primary way to allocate device memory and move data between host and GPU:

use cuda_core::{CudaContext, DeviceBuffer};

let ctx = CudaContext::new(0).unwrap();
let stream = ctx.default_stream();

// Host → device: copy a host slice to GPU memory
let a_dev = DeviceBuffer::from_host(&stream, &host_data).unwrap();

// Allocate zeroed device memory
let mut c_dev = DeviceBuffer::<f32>::zeroed(&stream, 1024).unwrap();

// Device → host: read results back
let results = c_dev.to_host_vec(&stream).unwrap();

Key methods#

Method

Direction

Description

from_host(&stream, &[T])

Host → Device

Allocate + async copy

zeroed(&stream, len)

Allocate + zero-fill

to_host_vec(&stream)

Device → Host

Async copy + return Vec<T>

copy_to_host(&stream, &mut [T])

Device → Host

Copy into existing slice

cu_deviceptr()

Raw CUdeviceptr for FFI

Ownership and drop#

DeviceBuffer frees its allocation synchronously on drop via cuMemFree. This is a blocking driver call – it internally synchronizes the entire device to ensure no in-flight kernel is still touching the memory. In practice, this means:

  • Dropping while a kernel is running will stall the host thread until the GPU is completely idle, then free the memory.

  • Dropping after synchronization (e.g., after to_host_vec or stream.synchronize()) has no additional cost because the device is already idle.

For single-stream workloads this is fine – everything executes in FIFO order, so by the time you read results back, all kernels are done and the free is instant. The cost becomes visible in multi-stream scenarios where you want to overlap compute with memory operations; a synchronous free on one stream can stall work on all other streams.

DeviceBox – async-friendly device memory#

DeviceBox<T> in cuda_async solves the synchronous-free problem. On drop, it frees memory via cuMemFreeAsync on a dedicated deallocator stream. This is a stream-ordered operation – the free is enqueued on the deallocator stream and executes only after all preceding work on that stream completes. Critically, it does not synchronize the device:

use cuda_async::device_box::DeviceBox;
use cuda_async::device_context::init_device_contexts;

init_device_contexts(0, 1)?;  // Initialize device context map (default device 0)

// DeviceBox wraps a device pointer; freed asynchronously on drop
let dev_ptr: DeviceBox<f32> = /* allocated by DeviceOperation chain */;
// When dev_ptr is dropped, cuMemFreeAsync is called on the deallocator stream.
// Other streams continue running without stalling.

Choosing between them#

DeviceBuffer

DeviceBox

Crate

cuda_core

cuda_async

Free on drop

cuMemFree (sync – stalls device)

cuMemFreeAsync (async – no stall)

Use with

cuda_launch!

cuda_launch_async!

Host readback

to_host_vec()

Via explicit memcpy operation

Best for

Single-stream, blocking workloads

Multi-stream, pipelined workloads

Tip

For latency-sensitive teardown in multi-stream pipelines, prefer DeviceBox. For straightforward single-stream examples, DeviceBuffer is simpler and the synchronous free is effectively zero-cost.

Argument scalarization#

When you write a kernel that takes &[f32], the host and device don’t agree on how to represent a Rust slice in memory – the struct layout can differ between the host’s x86 ABI and the NVPTX ABI. cuda-oxide solves this by scalarizing aggregate types at the kernel boundary: decomposing them into primitive values that both sides interpret identically.

Kernel parameter type

What the host actually passes

&[T]

ptr: *const T + len: u64

DisjointSlice<T>

ptr: *mut T + len: u64

T (scalar)

T directly

Struct { a: u32, b: f32 }

a: u32 + b: f32 (flattened)

Zero-sized types

Stripped entirely

This is why cuda_launch! uses slice() and slice_mut() wrappers – they extract the pointer and length for you. Inside the kernel, the compiler reconstitutes the slice struct from the scalar parameters, so your kernel code sees normal &[T] types.

../_images/scalarization-abi.svg

Argument scalarization: the host passes Rust slices as (ptr, len) pairs through the ABI boundary. The device kernel receives flat scalar parameters and the compiler reconstitutes the original Rust types inside the kernel.#

Tip

Scalarization is completely invisible in normal kernel code. You write &[f32] in the signature and use it as a regular slice. The cuda_launch! macro and the compiler handle everything else.

DisjointSlice – safe parallel writes#

In CUDA C++, the standard pattern for parallel output is a raw __global__ pointer that every thread indexes into. This is inherently unsafe – nothing prevents two threads from writing to the same location.

cuda-oxide provides DisjointSlice<T> as a safe alternative. It wraps a mutable slice and only allows writes through a ThreadIndex, ensuring each thread accesses a unique element:

use cuda_device::{kernel, thread, DisjointSlice};

#[kernel]
pub fn double(input: &[f32], mut out: DisjointSlice<f32>) {
    let idx = thread::index_1d();
    if let Some(out_elem) = out.get_mut(idx) {
        *out_elem = input[idx.get()] * 2.0;
    }
}
  • get_mut(idx) returns Option<&mut T>None for out-of-bounds indices, eliminating buffer overruns.

  • For patterns like reductions where multiple threads intentionally write to the same location, get_unchecked_mut (unsafe) provides an escape hatch.

Why ThreadIndex makes this safe#

The key to DisjointSlice’s safety is ThreadIndex – a newtype around usize that can only be constructed from hardware-provided thread identifiers like thread::index_1d(). You cannot create a ThreadIndex from an arbitrary integer:

let idx = thread::index_1d();          // ThreadIndex -- ok
let bad = ThreadIndex::new(42);        // does not exist -- no public constructor

This works because CUDA’s thread indices are uniform values provided by the hardware: every thread in a block receives a unique threadIdx from the GPU’s warp scheduler. For 1D grid launches (where only the x dimension is used), the global index derived from blockIdx.x * blockDim.x + threadIdx.x is unique across the entire grid. By restricting get_mut to accept only ThreadIndex, the type system enforces at compile time that each thread can only write to its own element – turning a data-race hazard into a type error.

Shared memory#

Shared memory is fast on-chip memory visible to all threads within a block. It is the primary tool for inter-thread communication and data reuse within a block, sitting between registers (per-thread) and global memory (all threads) in both speed and scope.

Static shared memory – SharedArray#

When the size is known at compile time, use SharedArray<T, N>:

use cuda_device::{kernel, thread, SharedArray, DisjointSlice};

static mut TILE: SharedArray<f32, 256> = SharedArray::UNINIT;

#[kernel]
pub fn smem_example(input: &[f32], mut out: DisjointSlice<f32>) {
    let idx = thread::index_1d();
    let tid = thread::threadIdx_x() as usize;

    // Load from global to shared
    unsafe { TILE[tid] = input[idx.get()]; }
    thread::sync_threads();

    // Read neighbor from shared (much faster than global)
    let neighbor = if tid > 0 {
        unsafe { TILE[tid - 1] }
    } else {
        0.0
    };
    thread::sync_threads();

    if let Some(out_elem) = out.get_mut(idx) {
        *out_elem = unsafe { TILE[tid] } + neighbor;
    }
}

Each static mut SharedArray maps to a separate .shared allocation in PTX. The sync_threads() barrier ensures all threads have finished writing before any thread reads.

Dynamic shared memory – DynamicSharedArray#

When the size depends on runtime parameters, use DynamicSharedArray<T> and specify the allocation size via LaunchConfig::shared_mem_bytes:

use cuda_device::{kernel, thread, DynamicSharedArray, DisjointSlice};

#[kernel]
pub fn dynamic_smem_example(input: &[f32], mut out: DisjointSlice<f32>) {
    let smem = DynamicSharedArray::<f32>::get();
    let tid = thread::threadIdx_x() as usize;

    unsafe { *smem.add(tid) = input[thread::index_1d().get()]; }
    thread::sync_threads();
    // ... use smem ...
}

On the host, set the size at launch:

let config = LaunchConfig {
    grid_dim: (num_blocks, 1, 1),
    block_dim: (256, 1, 1),
    shared_mem_bytes: 256 * std::mem::size_of::<f32>() as u32,
};

Multiple dynamic arrays can share the same allocation by using DynamicSharedArray::offset(byte_offset) to partition it.

Alignment#

Type

Default alignment

Notes

SharedArray<T, N>

align_of::<T>()

Standard alignment

SharedArray<T, N, 128>

128 bytes

Required for TMA operations

DynamicSharedArray<T>

16 bytes

nvcc-compatible default

DynamicSharedArray<T, 128>

128 bytes

Required for TMA

Common pitfalls#

  • Missing sync_threads(): Without a barrier between shared memory writes and reads, threads may read stale or uninitialized data.

  • Exceeding SM limits: Requesting too much shared memory causes CUDA_ERROR_LAUNCH_OUT_OF_RESOURCES. Check your architecture’s limit.

See also

CUDA Programming Guide – Shared Memory for details on bank conflicts, broadcasting, and optimal access patterns.

Unified memory and HMM#

By default, GPUs operate in a separate address space from the CPU. A GPU cannot dereference an ordinary host pointer – the address simply doesn’t map to anything in the GPU’s page tables. The traditional CUDA workflow therefore requires explicit allocation in device memory followed by explicit copies:

┌──────────────────┐              ┌──────────────────┐
│   CPU Memory     │    PCIe /    │   GPU Memory     │
│   (host DRAM)    │◄────────────►│   (device HBM)   │
│                  │   NVLink     │                  │
└──────────────────┘    copy      └──────────────────┘
   Separate address spaces -- GPU cannot dereference host pointers

CUDA provides mechanisms that relax this restriction by letting the GPU access host memory transparently, at the cost of page-fault latency on first access.

Memory access modes at a glance#

Mode

What GPU can access

Allocation required

First-access cost

Hardware requirement

Explicit copy

Device memory only

DeviceBuffer

None (data copied upfront)

Any CUDA GPU

Pinned (mapped)

Specific host buffers

cudaHostAlloc

High (~10–20 µs per access)

Any CUDA GPU

Unified Memory

Managed allocations

cudaMallocManaged

Medium (page migration)

Kepler+ (sm_30+)

HMM

Any host memory

None

Medium (page fault + fetch)

Turing+ on Linux

cuda-oxide primarily uses explicit copies (DeviceBuffer, DeviceBox) for bulk data and HMM for non-move closure captures and small configuration data.

Unified Memory#

Unified Memory is CUDA’s managed-memory allocator (cudaMallocManaged). The resulting pointer is valid on both the CPU and GPU – the CUDA runtime tracks which processor “owns” each page and migrates it on demand. When the GPU accesses a page that currently resides in host DRAM, the runtime transparently copies it to device memory before the kernel reads the data. This migration is invisible to your code but not free: the first access from the “wrong” side incurs a page fault and a DMA transfer over the interconnect. Subsequent accesses to the same page hit the GPU’s local cache.

cuda-oxide does not currently wrap cudaMallocManaged directly. For managed- memory workflows you would use the CUDA driver API through raw bindings. In practice, DeviceBuffer::from_host (explicit copy) covers most use cases and gives predictable performance.

HMM (Heterogeneous Memory Management)#

HMM is a Linux kernel feature that extends Unified Memory’s demand-paging model to all system memory – heap allocations, mmap regions, and even stack variables. With HMM enabled, the GPU can dereference any valid host pointer without a special CUDA allocator:

let factor = 5i32;                           // ordinary stack variable
let scale = |x: i32| x * factor;            // captures &factor (non-move)
cuda_launch! { kernel: scale, args: [...] }  // GPU reads &factor via HMM

Unlike Unified Memory, HMM requires no special allocation API – the pointer is a plain host address. When ATS (Address Translation Services) is available on hardware-coherent platforms like Grace Hopper, it supersedes HMM and provides hardware coherence at cache-line granularity; HMM is automatically disabled.

What happens on a page fault#

When a kernel loads from an address whose page is not resident in device memory, the hardware and driver cooperate to fetch it:

  1. The SM executes a global load (ld.global) for a virtual address.

  2. The GPU MMU looks up the address in the TLB. On a miss, it walks the device page table.

  3. If the page table has no mapping, the GPU raises a page fault. The faulting warp stalls; other warps on the same SM can continue.

  4. The CUDA driver fault handler determines the source of the page:

    • Unified Memory – the CUDA runtime identifies the managed allocation and initiates migration.

    • HMM – the Linux kernel’s HMM layer resolves the host virtual address, pins the host page, and either migrates it or creates a remote mapping.

  5. A DMA transfer over PCIe or NVLink copies the page from host DRAM to device HBM. The GPU memory controller writes the data; the host memory controller services the read.

  6. The GPU page table is updated, the TLB is refilled, and the warp resumes. The page is now local and cached in L2; subsequent accesses cost only hundreds of cycles.

The latency of step 5 depends on the interconnect:

Interconnect

Bandwidth

Fault latency

Notes

PCIe 4.0 x16

~25 GB/s

~10–20 µs

Most desktop / workstation GPUs

PCIe 5.0 x16

~50 GB/s

~5–15 µs

Ada Lovelace + newer platforms

NVLink 4.0

~900 GB/s

~1–5 µs

Data-center GPUs (H100, B100)

Grace Hopper C2C

~900 GB/s

<1 µs

Hardware coherent – uses ATS, not HMM

Because faults operate at page granularity (4 KB or 2 MB), a single fault can satisfy many threads. Warp-level coalescing also helps: 32 threads reading consecutive 4-byte elements touch at most one or two pages, not 32. On PCIe systems, a single fault costs roughly the same as a small cudaMemcpy – the advantage of demand paging is that you only pay for the pages you actually touch.

How cuda-oxide uses HMM#

cuda-oxide leverages HMM in two ways:

  1. Non-move closure captures. When a non-move closure is passed to a kernel, captured variables remain on the host stack and the GPU accesses them through HMM pointers. This avoids copying data that the kernel only reads once or infrequently.

  2. Struct ABI with dynamic layout. cuda-oxide matches Rust’s actual struct layout (including #[repr(Rust)] field reordering) on the device side, so HMM-accessed host structs are read correctly without #[repr(C)] or manual layout specification. The compiler queries rustc for field offsets and builds matching LLVM struct types with explicit padding.

HMM system requirements#

Requirement

Minimum

GPU architecture

Turing (compute capability 7.5+)

Linux kernel

6.1.24+, 6.2.11+, or 6.3+

CUDA driver

535+ with Open Kernel Modules

Check whether HMM is active on your system:

nvidia-smi -q | grep Addressing
# Addressing Mode : HMM  ← HMM is enabled

When to use HMM vs explicit copies#

Scenario

Recommended approach

Large arrays processed by many threads

DeviceBuffer::from_host (explicit copy)

Small read-only configuration data

HMM (pass pointer, let GPU page-fault)

Data shared between CPU and GPU iteratively

Explicit copies with double-buffering

Prototyping / quick experiments

HMM (simplest – no copies needed)

Tip

HMM is a convenience, not a performance strategy. For bandwidth-sensitive kernels, explicit copies to device memory will always be faster because they avoid page-fault overhead and use the full memory bus width.

See also

CUDA Programming Guide – Unified Memory and NVIDIA Blog – Simplifying GPU Development with HMM for the full details on page migration, prefetching, and system requirements.