Definitions#
This glossary defines key terms as they are used throughout the cuTile Rust book.
Tile#
A multi-dimensional array fragment that lives in GPU registers during kernel execution. Tiles are the fundamental unit of computation in cuTile Rust: you load data from tensors into tiles, compute on tiles, and store the results back. Tiles have compile-time static shapes and are represented by the type Tile<E, S>, where E is the element type and S is the shape (e.g., Tile<f32, {[16, 16]}>).
Tensor#
A multi-dimensional array stored in GPU global memory (HBM). Tensors are passed as kernel arguments — &Tensor<E, S> for read-only inputs and &mut Tensor<E, S> for writable outputs. Tensors do not support direct arithmetic; data must first be loaded into tiles.
Partition#
A logical division of a tensor into a grid of equally sized sub-regions, each of which is processed by one tile block. The term “partition” appears on both the host side and the device side, but refers to different things.
Host-side partition (mutable tensors only). Calling .partition([M, N]) on a Tensor<T> produces a Partition<Tensor<T>>. This is a host-side wrapper that records the partition_shape (the tile dimensions) alongside the original tensor. A host-side Partition<Tensor<T>> is what you pass to a kernel launcher in the position of a &mut Tensor<E, S> parameter. The partition_shape stored in the host-side Partition determines the static shape S that the kernel sees — for example, passing a Partition with partition_shape = [32, 64] means the kernel receives a &mut Tensor<T, {[32, 64]}>.
Only mutable tensors must be partitioned on the host side. This is because each &mut Tensor sub-region is written to by exactly one tile block, satisfying Rust’s exclusive access requirement for mutable memory: at most one writer may access a given region at a time. By partitioning before launch, the system guarantees that no two tile blocks write to overlapping memory.
Shared tensor references (no host-side partition required). Read-only inputs are passed as Arc<Tensor<T>> on the host side, corresponding to &Tensor<E, S> in the kernel signature. These do not need to be partitioned on the host side — multiple tile blocks may safely read from the same tensor or overlapping regions simultaneously, so there is no exclusive-access constraint to enforce. Instead, shared tensors are partitioned on the device side for greater flexibility in how they are accessed.
Device-side partition. Inside a kernel, calling .partition(const_shape![M, N]) on a &Tensor creates a read-only Partition view that can be indexed to load individual tiles (e.g., part.load([i, j])). This is how shared tensor references are divided into tiles for loading. Because the partitioning happens on the device side, the same &Tensor can be partitioned in different ways — or accessed with different indexing patterns — within the same kernel. For example, in GEMM the input matrices x and y are each partitioned with a different shape inside the kernel body (const_shape![BM, BK] and const_shape![BK, BN] respectively), even though both were passed as plain Arc<Tensor<T>> from the host.
The generated launcher code accepts Partition<Tensor<T>> for every &mut Tensor parameter and Arc<Tensor<T>> for every &Tensor parameter.
Grid dimensions. A host-side partition’s grid is computed by dividing the tensor’s shape by the partition shape, rounding up: grid[i] = ceil(tensor_shape[i] / partition_shape[i]). The result is mapped to a 3D tuple (x, y, z), with trailing dimensions set to 1 for tensors of rank less than 3. For example, a [128, 256] tensor partitioned with [32, 64] produces a grid of (4, 4, 1).
Launch grid inference. At kernel launch time, the launcher calls .grid() on each &mut Tensor parameter’s host-side Partition and collects the resulting grids. If no explicit grid is specified via .grid() or .const_grid(), the launch grid is inferred from these partition grids. When multiple &mut Tensor parameters are present, all of their inferred grids must match or the launch will fail with an error. This is how partitioning a tensor on the host side determines how many tile blocks the kernel runs.
Tile Block#
A logical tile thread and the basic unit of concurrent execution on the GPU. Each tile block runs the kernel function once as a single logical thread of execution, operating on one partition of the data. A tile block is identified by its coordinates, obtained via get_tile_block_id(). The cuTile Rust compiler maps each tile block to one or more underlying CUDA execution units (thread blocks, clusters, or warps) depending on the target architecture — but from the programmer’s perspective, a tile block is simply a single-threaded context that processes one tile of data.
Tile Thread#
An alias for Tile Block, used throughout this book to emphasize the single-threaded programming model. Each tile thread executes the kernel function once as a single logical thread of execution. The terms “tile thread” and “tile block” are interchangeable — the API uses get_tile_block_id() and get_num_tile_blocks(), while the guides often say “tile thread” for clarity.
Concurrent Execution#
Multiple tile blocks making progress over a period of time by being scheduled onto available Streaming Multiprocessors (SMs). This aligns with Rust’s definition of concurrency — different parts of a program executing independently, not necessarily at the exact same instant — extended to the GPU context: when a kernel is launched with more tile blocks than there are SMs, the GPU’s hardware scheduler assigns tile blocks to SMs as resources become available. Some tile blocks execute in parallel while others are pending, but from the programmer’s perspective all tile blocks are logically concurrent — their relative order of execution is unspecified and they are independent of one another.
On the host side, concurrency also arises through CUDA streams and async/await: multiple DeviceOperations submitted to different streams can overlap in time, and the async runtime schedules them without requiring the programmer to specify an exact execution order.
Parallel Execution#
Multiple tile blocks executing at the same time on different SMs. All NVIDIA GPUs execute tile blocks in parallel — a modern GPU has tens to over a hundred SMs, each capable of running one or more tile blocks simultaneously. The distinction from concurrency is that parallelism refers specifically to simultaneous execution on separate hardware units, whereas concurrency is the broader concept of managing multiple in-progress tasks. In practice, a kernel launch exhibits both: tile blocks that fit on available SMs run in parallel, while the full set of tile blocks runs concurrently (scheduled over time as SMs become free).
This matches Rust’s distinction (see The Rust Programming Language, Ch. 17): parallelism is work happening at the exact same time on different hardware, while concurrency is independently executing tasks making progress over time — which may or may not involve parallelism.
Streaming Multiprocessor (SM)#
The primary processing unit on an NVIDIA GPU. Each SM has its own registers, shared memory, and execution pipelines including Tensor Cores. Tile blocks are scheduled onto SMs by the GPU’s hardware scheduler. A single SM can run multiple tile blocks concurrently if it has sufficient resources (registers, shared memory, thread slots). For architecture-specific details on SM resources, see the CUDA C++ Programming Guide.
Tensor Cores#
Specialized hardware units (available on Volta architecture and later) that perform small matrix multiply-accumulate operations in a single instruction. The mma() intrinsic in cuTile Rust maps to Tensor Core instructions. Tensor Cores impose alignment requirements on tile dimensions (e.g., dimensions must typically be multiples of 8 or 16 depending on the element type).
Global Memory (HBM)#
The GPU’s main memory — High Bandwidth Memory. Global memory has the highest capacity but is slower than shared memory and registers. Tensor data resides in global memory.
Registers (RMEM)#
The fastest storage on the GPU, private to each thread within a tile block. Tile data lives in registers during computation. Each SM has a fixed register file, so larger tiles consume more registers, potentially reducing occupancy.
Const Generics#
Compile-time constant parameters on kernel functions, such as const BM: i32. Const generics enable the compiler to optimize register allocation, unroll loops, and generate architecture-specific code. Changing a const generic value triggers JIT recompilation. See also Const Generic Arrays.
Const Generic Arrays#
An extension to the Rust programming language that allows const generic parameters to have array types — for example, const S: [i32; 2]. Standard Rust only supports scalar const generics (integers, bool, char), so this syntax is not valid in ordinary Rust code. The cuTile Rust compiler recognizes array const generics and uses them to propagate tile shapes through the type system at compile time.
Const generic arrays are the idiomatic way to parameterize a kernel over its tile shape:
#[cutile::entry()]
fn add<const S: [i32; 2]>(
z: &mut Tensor<f32, S>,
x: &Tensor<f32, {[-1, -1]}>,
y: &Tensor<f32, {[-1, -1]}>,
) { ... }
Here S is inferred from the host-side partition shape passed at launch time. Because S is a compile-time constant, the compiler can specialize the generated code for each distinct shape. A new value of S triggers JIT recompilation, just like scalar const generics.
Dynamic Dimensions#
Tensor shape dimensions specified as -1 in the kernel signature (e.g., Tensor<f32, {[-1, -1]}>). Dynamic dimensions can vary across kernel launches without triggering recompilation. They carry no compile-time optimization benefit but provide flexibility for problem sizes that change often.
JIT Compilation#
cuTile Rust compiles kernels at first invocation through a multi-stage pipeline: Rust AST → MLIR → cubin. The compiled binary is cached in memory (in a thread-local HashMap) so subsequent launches with the same generics are instant. A new combination of const generics or type parameters produces a new compilation.
DeviceOperation#
A lazy description of GPU work — allocation, kernel launch, or data transfer — that is not executed until either .sync_on(&stream), .await, or tokio::spawn() is invoked. DeviceOperations can be composed with zip!, .apply(), and .and_then() to build dataflow graphs before submitting GPU work.
DeviceFuture#
A DeviceFuture is a future that has been assigned resources — specifically, a device stream on which to execute — but has not yet started GPU work. A DeviceFuture is created when a DeviceOperation is scheduled (e.g., via into_future()), at which point the scheduling policy selects a stream. The actual GPU work is not submitted until the DeviceFuture is polled for the first time, which happens when you .await it or tokio::spawn it.
Broadcasting#
Replicating a smaller tile (or scalar) to match the shape of a larger tile. Broadcasting is a compile-time transformation — no extra memory is allocated. For example, a.broadcast(y.shape()) expands a scalar into a tile matching y’s partition shape.
Kernel Fusion#
Combining multiple logical operations into a single kernel so that intermediate results stay in registers rather than being written to and read back from global memory. Fused softmax is a canonical example: find-max, subtract, exponentiate, sum, and divide are all performed in one kernel launch.
Arithmetic Intensity#
The ratio of compute operations (FLOPs) to memory operations (bytes transferred). Higher arithmetic intensity means better GPU utilization. A kernel with low arithmetic intensity (e.g., element-wise addition) is memory-bound; a kernel with high arithmetic intensity (e.g., matrix multiplication) is compute-bound.
CUDA Stream#
An ordered queue of GPU operations. Operations on the same stream execute in submission order; operations on different streams may execute concurrently. cuTile Rust’s default async scheduling policy distributes work across a pool of four streams in round-robin fashion.
Occupancy#
The ratio of active warps to the maximum number of warps an SM can support. Higher occupancy generally improves the GPU’s ability to hide memory latency by switching between warps. Occupancy is affected by register usage, shared memory usage, and thread block size.
Warp#
A group of 32 GPU threads that execute instructions in lockstep. Warps are the smallest scheduling unit on an SM. Tile sizes that are multiples of 32 align well with warp-level execution.