The CUDA Execution Model#

This chapter covers the CUDA SIMT execution model – how work is organized into threads, warps, blocks, and grids – and how cuda-oxide exposes each level through safe, ergonomic Rust APIs.

Threads, blocks, and grids#

Every kernel launch creates a grid of thread blocks. The three-level hierarchy is the foundation of GPU programming:

Level	What it is	Size	Key property
Grid	All blocks launched by one kernel call	Up to 2³¹ - 1 blocks per dimension	Blocks execute independently
Block	A group of threads that can cooperate	Up to 1024 threads	Threads share fast on-chip memory
Warp	32 consecutive threads within a block	Always 32	Execute instructions in lockstep (SIMT)

A kernel launch specifies two things: how many blocks in the grid (the grid dimensions) and how many threads in each block (the block dimensions). The hardware then groups every 32 consecutive threads into warps automatically – you never create warps explicitly.

../_images/simt-thread-hierarchy.svg — The three-level CUDA thread hierarchy. A 2×2 grid of blocks, each containing 256 threads arranged in 8 warps of 32. The bottom legend maps CUDA concepts to their cuda-oxide API equivalents.#

Thread indexing in cuda-oxide#

Inside a kernel, every thread needs to know which element it should work on. CUDA provides built-in variables (threadIdx, blockIdx, blockDim, gridDim); cuda-oxide wraps these in the cuda_device::thread module:

use cuda_device::{kernel, thread, DisjointSlice};

#[kernel]
pub fn vecadd(a: &[f32], b: &[f32], mut c: DisjointSlice<f32>) {
    let idx = thread::index_1d();
    if let Some(c_elem) = c.get_mut(idx) {
        *c_elem = a[idx.get()] + b[idx.get()];
    }
}

thread::index_1d() computes blockIdx.x * blockDim.x + threadIdx.x – the global flat index that maps each thread to exactly one array element. This is the common case for 1D data-parallel kernels.

For cases where you need individual components, cuda-oxide exposes the raw accessors:

cuda-oxide API	Equivalent CUDA C++	Returns
`thread::index_1d()`	`blockIdx.x * blockDim.x + threadIdx.x`	Global 1D thread index
`thread::threadIdx_x()`	`threadIdx.x`	Thread’s position within its block
`thread::blockIdx_x()`	`blockIdx.x`	Block’s position within the grid
`thread::blockDim_x()`	`blockDim.x`	Number of threads per block (x)

Tip

For multi-dimensional indexing (e.g., 2D matrix operations), use threadIdx_y(), blockIdx_y(), and blockDim_y() alongside the _x variants to compute row/column indices.

Warps and SIMT execution#

A warp is the fundamental scheduling unit on NVIDIA GPUs. Every 32 consecutive threads in a block form one warp, and all 32 threads in a warp execute the same instruction at the same time – but on different data. This model is called SIMT (Single Instruction, Multiple Thread).

When all threads in a warp follow the same control-flow path, the warp achieves full throughput. When threads diverge (different threads take different if branches), the hardware serializes the paths: it executes one branch with some threads masked off, then the other branch, then reconverges. This is called branch divergence and it directly reduces throughput.

../_images/simt-warp-execution.svg — Left: uniform execution where all 32 threads run the same instruction in one cycle. Right: branch divergence where even and odd threads take different paths, requiring two serial passes.#

Why this matters#

You don’t need to think about warps to write correct kernels – cuda-oxide handles the details. But understanding SIMT helps you write fast ones:

Prefer uniform control flow. When all threads in a warp evaluate the same branch, there is no divergence penalty.
Data-dependent branches are fine as long as nearby threads (those in the same warp) tend to take the same path.
Avoid thread-ID-based branching like if thread::threadIdx_x() % 2 == 0 inside hot loops – this guarantees every warp diverges.

Hardware mapping#

When you launch a kernel, the GPU’s hardware scheduler assigns each block to a Streaming Multiprocessor (SM). Multiple blocks can run concurrently on the same SM – the exact number depends on the block’s resource usage (registers, shared memory, threads).

The key insight: you control the grid and block dimensions; the hardware controls everything else. You never specify which SM runs which block, or in what order blocks execute. This separation is what lets the same kernel scale from a laptop GPU with a handful of SMs to a data-center GPU with 100+.

../_images/simt-hardware-mapping.svg — Eight blocks assigned to four SMs by the GPU scheduler. Each SM has its own warp schedulers, CUDA cores, and shared memory/L1 cache. Blocks 4-7 (dashed arrows) run after blocks 0-3 complete or are queued alongside them if resources permit.#

What limits concurrency#

Each SM has a fixed pool of resources. A block is assigned to an SM only if the SM has enough of all of the following:

Resource	Typical limit (Ampere)	Controlled by
Threads	2048 per SM	`block_dim`
Registers	65536 per SM	Compiler allocation
Shared memory	164 KB per SM (configurable)	`shared_mem_bytes`
Block slots	32 per SM	Grid size

When a block finishes, its resources are freed and the scheduler immediately assigns a queued block to that SM. This is why launching more blocks than the GPU has SMs is not just okay – it’s the normal and expected pattern.

Launch configuration#

On the host side, LaunchConfig tells the runtime how to shape the grid:

use cuda_core::LaunchConfig;

// Quick 1D launch: 256 threads per block, enough blocks to cover N elements
let cfg = LaunchConfig::for_num_elems(N as u32);

for_num_elems uses a block size of 256 and computes the grid size via ceiling division – the right default for most element-wise kernels. For more control, construct LaunchConfig directly:

let cfg = LaunchConfig {
    grid_dim: (4, 4, 1),      // 4×4 = 16 blocks
    block_dim: (16, 16, 1),   // 16×16 = 256 threads per block
    shared_mem_bytes: 0,       // no dynamic shared memory
};

Then pass it to the generated launch method:

module
    .vecadd(&stream, LaunchConfig::for_num_elems(N as u32), &a_dev, &b_dev, &mut c_dev)
    .expect("Kernel launch failed");

Or with the async API:

module
    .vecadd_async(LaunchConfig::for_num_elems(N as u32), &a_dev, &b_dev, &mut c_dev)?
    .sync()?;

Choosing block size#

The block size is the single most important tuning knob:

256 threads is a safe default. It balances occupancy (multiple blocks per SM) with register pressure on most architectures.
Powers of 2 (128, 256, 512) align naturally with warp boundaries and avoid wasting threads.
Too small (< 128) may leave warp schedulers underutilized.
Too large (1024) uses the full block thread limit, which may reduce the number of concurrent blocks per SM.

The grid size follows from the block size and the problem size: grid_x = (N + block_x - 1) / block_x. This is exactly what LaunchConfig::for_num_elems computes.

Bounds checking#

Because the grid size is rounded up, some threads will have indices beyond the array length. cuda-oxide’s DisjointSlice handles this safely – get_mut returns None for out-of-bounds indices, so those threads simply do nothing:

#[kernel]
pub fn vecadd(a: &[f32], b: &[f32], mut c: DisjointSlice<f32>) {
    let idx = thread::index_1d();
    if let Some(c_elem) = c.get_mut(idx) {   // out-of-bounds threads skip
        *c_elem = a[idx.get()] + b[idx.get()];
    }
}

This is a deliberate departure from CUDA C++, where bounds-checking is the programmer’s responsibility. cuda-oxide’s approach eliminates an entire class of out-of-bounds memory bugs at the cost of a single branch (which is uniform across the warp for all but the last block).