Working with Data#
cuTile Rust leverages Rust’s type system to catch errors at compile time. Shape mismatches, type errors, and many common GPU programming bugs are caught before your code even runs.
Tensors vs Tiles#
cuTile Rust has two fundamental data abstractions that represent data at different levels of the memory hierarchy:
Property |
Tensor |
Tile |
|---|---|---|
Location |
Global Memory (HBM) |
GPU registers |
Mutability |
Mutable ( |
Immutable |
Shape |
Mixed static / dynamic |
Compile-time (static) |
Operations |
Load, Store |
Arithmetic, Reduction, etc. |
Lifetime |
Persists across kernels |
Exists only during kernel |
Addressable |
Yes (pointers) |
No (compiler-managed) |
// Tensors: live in global memory, passed as kernel arguments
fn kernel(
output: &mut Tensor<f32, S>, // Mutable tensor (can store to)
input: &Tensor<f32, {[-1, -1]}> // Immutable tensor (read-only)
) {
// Tiles: live in registers, created by loading
let tile = load_tile_like(input, output); // Load creates a tile
let result = tile * 2.0; // Operations create new tiles
output.store(result); // Store tile back to tensor
}
Host-side and Device-side Types#
The Tensor and Partition types exist on both the host side (CPU) and the device side (GPU kernel), but they are different Rust types with similar semantics. Host-side types are parameterized by element type only; device-side types additionally carry shape information in the type system for compile-time optimization.
On the host, you allocate tensors, partition them, and pass them to kernel launchers:
// Host-side Tensor<T> — parameterized by element type only
let mut tensor: Tensor<f32> = zeros(&[1024, 1024]).sync_on(&stream)?;
// Owned partition — moves the tensor into the partition
let partitioned: Partition<Tensor<f32>> = tensor.partition([16, 16]);
// Borrowed partition — borrows mutably, tensor written in place
let partitioned_ref = (&mut tensor).partition([16, 16]);
// Read-only inputs: borrow, Arc, or owned
let input: &Tensor<f32> = &tensor;
let shared: Arc<Tensor<f32>> = Arc::new(tensor);
The generated launcher accepts multiple forms for each parameter type. &Tensor params accept &Tensor<T>, Arc<Tensor<T>>, or Tensor<T>. &mut Tensor params accept Partition<Tensor<T>> or Partition<&mut Tensor<T>>.
The api::* module constructs tensors on the device. Each constructor returns a DeviceOp, so allocation and initialization are lazy until .sync() or .await:
use cutile::api;
// 3D tensor of random values from a standard normal distribution.
let weights: Tensor<f32> = api::randn(0.0f32, 1.0, [32, 64, 128], None).sync_on(&stream)?;
// Other common constructors: zeros, ones, full, arange, linspace, eye, rand, randn.
// Note: zeros/ones/full take &[usize] slices; rand/randn take [usize; RANK] arrays + Option<u64> seed.
TensorView provides zero-copy views and slices of an existing tensor, which matters for performance: when you want to process a subregion, views avoid the allocation and copy you’d otherwise need. The offset is applied host-side, so passing a view to a kernel hands it a pointer to the correct starting address with no data movement.
let tensor = api::arange::<f32>(1024).sync_on(&stream)?;
let matrix = tensor.view(&[32, 32])?; // Reshape without copying
let row_slice = matrix.slice(&[1..3])?; // Rows 1-2, all columns
let block = matrix.slice(&[1..3, 2..6])?; // Rows 1-2, cols 2-5
Views and slices pass to kernels as &Tensor parameters. Use them for attention over a sub-sequence, GEMM over a sub-matrix, and similar sub-region patterns — no allocation, no copy. See the Host API for the full list of constructors and TensorView methods.
Inside a kernel, tensors and tiles carry their shape as a type parameter, enabling compile-time shape checking and optimization:
fn kernel(
output: &mut Tensor<f32, { [BM, BN] }>, // Static shape from partition
input: &Tensor<f32, { [-1, -1] }>, // Dynamic shape
) {
// Device-side Partition<E, S> — view of a tensor as tiles
let part = input.partition(const_shape![BM, BK]);
let tile = part.load([pid.0, i]);
// Tile<E, S> — immutable data fragment in registers
let tile_a: Tile<f32, { [BM, BN] }> = load_tile_like(input, output);
let result = tile_a * 2.0;
output.store(result);
}
Type |
Side |
Parameterized By |
Description |
|---|---|---|---|
|
Host |
Element type |
Tensor in global memory; allocated and managed on the CPU |
|
Host |
Element type |
Host-side wrapper recording a tensor and its partition shape |
|
Host |
Element type |
Shared reference for read-only kernel inputs |
|
Device |
Element type + shape |
Kernel parameter; |
|
Device |
Element type + shape |
Read-only view of a |
|
Device |
Element type + shape (always static) |
Immutable data fragment in GPU registers |
For the full list of supported element types (f16, bf16, f32, f64, tf32, f8e4m3fn, f8e5m2, integer types, bool), see the DSL API: ElementType. For the api::* module and TensorView, see Host API: Tensor Creation and Views.
Shapes and Broadcasting#
Static shapes are compile-time constants. Dynamic shapes (written as -1) are determined at runtime.
// Static: known at compile time, fully optimized
fn kernel_static<const BM: i32, const BN: i32>(
output: &mut Tensor<f32, { [BM, BN] }>,
) { /* BM and BN known; compiler can optimize layout and access patterns */ }
// Dynamic: -1 means "determined at runtime"
fn kernel_dynamic(
input: &Tensor<f32, { [-1, -1] }>,
) {
let shape = input.shape(); // Query actual dimensions at runtime
}
Static shapes let the compiler optimize layout and access patterns and catch shape errors at compile time, at the cost of re-compilation whenever a type or const generic changes. Dynamic shape dimensions that vary across launches do not trigger re-compilation.
The common pattern is static output, dynamic input: the tile size is a const generic, while the full tensor dimensions are runtime values:
#[cutile::entry()]
fn add<const S: [i32; 2]>(
z: &mut Tensor<f32, S>, // Static: tile knows its size
x: &Tensor<f32, {[-1, -1]}>, // Dynamic: full tensor
y: &Tensor<f32, {[-1, -1]}>, // Dynamic: full tensor
) {
let tile_x = load_tile_like(x, z); // Load matching z's shape
let tile_y = load_tile_like(y, z);
z.store(tile_x + tile_y);
}
Broadcasting expands a tile of one shape to operate against a tile of another, following NumPy rules: align dimensions from the right; two dimensions are compatible if they are equal or one of them is 1; the result shape is the maximum along each dimension.
// [64, 64] + [1, 64] -> [64, 64] (broadcast B along dim 0)
let tile_a: Tile<f32, {[64, 64]}> = ...;
let tile_b: Tile<f32, {[1, 64]}> = ...;
let result = tile_a + tile_b.broadcast(const_shape![64, 64]);
Type Safety and Generics#
The compiler catches shape mismatches, element-type mismatches, and matrix-multiply dimension errors before code runs:
// ❌ Shape mismatch
let a: Tile<f32, {[4, 4]}> = ...;
let b: Tile<f32, {[8, 8]}> = ...;
let c = a + b; // Error: cannot add [4,4] and [8,8]
// ❌ Element-type mismatch without conversion
let x: Tile<f32, {[4, 4]}> = ...;
let y: Tile<i32, {[4, 4]}> = ...;
let z = x + y; // Error: cannot add f32 and i32
// ✅ Explicit conversion
let y_float: Tile<f32, {[4, 4]}> = convert_tile(y);
let z = x + y_float;
// ❌ MMA inner dimensions don't match: [M=16, K=8] × [K=16, N=32]
let a: Tile<f32, {[16, 8]}>;
let b: Tile<f32, {[16, 32]}>;
let c = mma(a, b, zeros); // Error!
// ✅ MMA inner dims match: [M=16, K=8] × [K=8, N=32] -> [16, 32]
let a: Tile<f32, {[16, 8]}>;
let b: Tile<f32, {[8, 32]}>;
let c = mma(a, b, zeros);
Generic kernels let a single function handle multiple element types and shapes:
#[cutile::entry()]
fn flexible_gemm<
E: ElementType, // Any element type
const BM: i32, // Tile rows
const BN: i32, // Tile cols
const BK: i32, // Inner tile dim
const K: i32, // Full inner dim
>(
z: &mut Tensor<E, {[BM, BN]}>,
x: &Tensor<E, {[-1, K]}>,
y: &Tensor<E, {[K, -1]}>,
) {
// Works for any element type and tile sizes!
}
let generics = vec![
"f32".to_string(), // E
"16".to_string(), // BM
"16".to_string(), // BN
"8".to_string(), // BK
"128".to_string(), // K
];
gemm(z, x, y).generics(generics).sync_on(&stream)?;
Custom element types implement the ElementType trait. The built-in numeric types all implement it.
Continue to Writing Computations for the operations you can apply on tiles. For type signatures, operator catalogs, and const_shape! / shape utilities, see the DSL API reference.