DSL API#
Status: This API is under active development. Expect changes.
The cuTile Rust DSL lets you write GPU kernel code in Rust syntax that compiles to CUDA Tile IR. Inside a #[cutile::module] block, you write Rust-like code using the types and operations documented here. The compiler translates this code into MLIR, which is then compiled to PTX/SASS for execution on NVIDIA GPUs.
All DSL types and functions are available via use cutile::core::*; inside a module block.
Functions#
Modules and Entry Points#
GPU kernels are defined inside #[cutile::module] blocks. Each function marked with #[cutile::entry()] becomes a launchable kernel.
#[cutile::module]
mod my_kernels {
use cutile::core::*;
#[cutile::entry()]
fn add<const S: [i32; 1]>(
output: &mut Tensor<f32, S>, // mutable: partitioned output
a: &Tensor<f32, { [-1] }>, // immutable: read-only input
b: &Tensor<f32, { [-1] }>, // immutable: read-only input
) {
let pid: (i32, i32, i32) = get_tile_block_id();
let tile_a: Tile<f32, S> = a.load_tile(const_shape!(S), [pid.0]);
let tile_b: Tile<f32, S> = b.load_tile(const_shape!(S), [pid.0]);
output.store(tile_a + tile_b);
}
}
Entry point parameters:
Parameter type |
Description |
Host-side type |
|---|---|---|
|
Mutable output (must be first) |
|
|
Immutable input |
|
|
Scalar value |
Same type, passed by value |
|
Raw device pointer |
|
Convention: The mutable output tensor is always the first parameter.
Entry attributes:
Attribute |
Type |
Description |
|---|---|---|
|
bool |
Print three stages at JIT compile time: (1) the generated entry point wrapper (Rust), (2) the original kernel function, (3) the compiled Tile IR MLIR |
|
bool |
Disable partition bounds checks. Requires |
|
expr |
Pass an |
|
string |
Write the compiled MLIR to a file in the specified directory |
#[cutile::entry(print_ir = true)]
fn debug_kernel<const S: [i32; 1]>(output: &mut Tensor<f32, S>) { ... }
#[cutile::entry(unchecked_accesses)]
unsafe fn fast_kernel<const S: [i32; 1]>(output: &mut Tensor<f32, S>) { ... }
#[cutile::entry(dump_mlir_dir = "/tmp/mlir")]
fn traced_kernel<const S: [i32; 1]>(output: &mut Tensor<f32, S>) { ... }
Module attributes:
Attribute |
Type |
Description |
|---|---|---|
|
bool |
Internal: marks the core module ( |
|
bool |
Internal: used when defining kernels inside the cutile crate itself (changes import paths from |
#[cutile::module] // standard user module
mod my_kernels { ... }
#[crate::module(tile_rust_crate = true)] // internal to cutile crate
pub mod creation { ... }
Entry Points vs Device Functions#
Functions marked with #[cutile::entry()] are compiled as GPU kernel entry points — they can be launched from the host with a grid configuration.
Unmarked functions inside a module are device functions — they are inlined at the call site during compilation. They cannot be launched directly but can be called from entry points or other device functions.
#[cutile::module]
mod my_kernels {
use cutile::core::*;
// Device function: inlined into callers
fn relu<const S: [i32; 1]>(x: Tile<f32, S>) -> Tile<f32, S> {
let zero: Tile<f32, S> = constant(0.0f32, x.shape());
select(gt_tile(x, zero), x, zero)
}
#[cutile::entry()]
fn apply_relu<const S: [i32; 1]>(
output: &mut Tensor<f32, S>,
input: &Tensor<f32, { [-1] }>,
) {
let pid: (i32, i32, i32) = get_tile_block_id();
let tile: Tile<f32, S> = input.load_tile(const_shape!(S), [pid.0]);
output.store(relu(tile)); // device function call, inlined
}
}
Inter-Module Device Function Calls#
Device functions defined in one #[cutile::module] can be called from
entry points in another module. When Module B does use crate::activations::*,
two things happen:
Rust’s type checker resolves the function signatures at compile time.
The macro records Module A as a dependency of Module B, including Module A’s AST in the collected module list for JIT compilation.
At JIT time, the compiler sees all dependent module ASTs and inlines device functions from any of them into the entry point.
Each module is identified by its fully-qualified path (via module_path!()),
so shared dependencies like cutile::core are automatically deduplicated
even when imported by multiple modules.
/// Module A: reusable activation device functions (no entry points).
#[cutile::module]
mod activations {
use cutile::core::*;
pub fn relu<const S: [i32; 1]>(x: Tile<f32, S>) -> Tile<f32, S> {
let zero: Tile<f32, S> = constant(0.0f32, x.shape());
max_tile(x, zero)
}
pub fn square<const S: [i32; 1]>(x: Tile<f32, S>) -> Tile<f32, S> {
x * x
}
}
/// Module B: kernels that call device functions from Module A.
#[cutile::module]
mod my_kernels {
use cutile::core::*;
use crate::activations::{relu, square}; // import from Module A
#[cutile::entry()]
fn apply_relu_square<const S: [i32; 1]>(
output: &mut Tensor<f32, S>,
input: &Tensor<f32, { [-1] }>,
) {
let pid: (i32, i32, i32) = get_tile_block_id();
let tile: Tile<f32, S> = input.load_tile(const_shape!(S), [pid.0]);
let activated: Tile<f32, S> = relu(tile); // inlined from Module A
output.store(square(activated)); // inlined from Module A
}
}
See cutile-examples/examples/inter_module.rs for a runnable version.
cutile::core internals#
cutile::core (defined in _core.rs) is itself a #[cutile::module]
containing the built-in DSL operations. When you write use cutile::core::*;,
the macro records it as a dependency and includes its AST. The JIT compiler
then resolves calls to constant, iota, mma, reduce_sum, etc. from
core’s AST and inlines them into your entry point — exactly the same
mechanism as user-defined inter-module calls.
Types#
Tile#
Tile<E: ElementType, const S: [i32; N]> — A multi-dimensional array of GPU values. This is the fundamental compute type. Tiles are register-resident and processed in parallel by all threads in a warp/block.
// Shapes are const generics
let a: Tile<f32, { [128] }>; // 1D: 128 f32 elements
let b: Tile<f32, { [16, 16] }>; // 2D: 16x16 matrix
let c: Tile<i32, { [4, 8, 2] }>; // 3D
// Scalar tile (rank 0)
let s: Tile<f32, { [] }>;
Methods:
tile.shape()— Returns the tile’sShape<S>tile.broadcast(shape)— Broadcast to a larger shapetile.reshape(shape)— Reshape (must preserve element count)
Arithmetic operators: +, -, *, / are overloaded for element-wise operations between tiles of the same shape and type.
let z: Tile<f32, { [128] }> = a + b; // element-wise add
let w: Tile<f32, { [128] }> = a * b - c; // chained arithmetic
Tensor#
Tensor<E: ElementType, const S: [i32; N]> — A device-side tensor view. Represents memory that lives in GPU global memory. Unlike Tile, tensors must be loaded/stored explicitly.
Static dimensions are known at compile time; dynamic dimensions are marked with -1:
fn kernel(
output: &mut Tensor<f32, { [128, 128] }>, // static: 128x128
input: &Tensor<f32, { [-1, -1] }>, // dynamic: shape provided at runtime
)
Methods:
tensor.shape()— Returns the tensor’sShape<S>tensor.load()— Load the entire tile (only for&mut Tensor, loads output tile)tensor.store(tile)— Store a tile to the tensortensor.load_tile(shape, [indices])— Load a tile at a specific partition indextensor.partition(shape)— Create aPartitionview for block-indexed loadingtensor.partition_permuted(shape, dim_map)— Create a permutedPartitionviewunsafe tensor.partition_mut(shape)— Create a mutablePartitionMutview
Partition / PartitionMut#
Partition<'a, E, const D: [i32; N]> — A read-only partitioned view of a tensor, dividing it into tiles indexed by block ID.
PartitionMut<'a, E, const D: [i32; N]> — A mutable partitioned view.
let part: Partition<f32, { [128, 128] }> = input.partition(const_shape![128, 128]);
let tile: Tile<f32, { [128, 128] }> = part.load([pid.0, pid.1]);
Shape / Array#
Shape<const D: [i32; N]> — A compile-time shape descriptor. Created with const_shape!:
let shape: Shape<{ [128, 64] }> = const_shape![128, 64];
let shape: Shape<S> = tensor.shape();
Array<const D: [i32; N]> — A compile-time array (used for strides and indices).
Const Generic Array (CGA) Syntax#
Shapes in the DSL use Rust’s const generic arrays (const S: [i32; N]). There are two ways to specify them:
1. Explicit values — when dimensions are known literals:
fn kernel(
output: &mut Tensor<f32, { [128, 64] }>, // fixed 128x64
input: &Tensor<f32, { [-1, -1] }>, // dynamic (runtime) shape
)
2. Const generic parameters — when dimensions are specified at launch time:
fn gemm<const BM: i32, const BN: i32, const BK: i32>(
output: &mut Tensor<f32, { [BM, BN] }>, // shape from generics
a: &Tensor<f32, { [-1, -1] }>,
)
3. Const generic array (CGA) — when the entire shape is a single generic parameter:
fn add<const S: [i32; 1]>( // S is the whole shape
output: &mut Tensor<f32, S>,
input: &Tensor<f32, { [-1] }>,
)
The CGA form (const S: [i32; N]) is concise but has a limitation: the array length N must be fixed at definition time. You cannot write const S: [i32; N] where N is itself generic — the rank must be a literal. This means you cannot write a single kernel that works for both 1D and 2D tensors:
// NOT supported: generic rank
fn add<const N: usize, const S: [i32; N]>(output: &mut Tensor<f32, S>) { ... }
// Instead, write separate kernels per rank:
fn add_1d<const S: [i32; 1]>(output: &mut Tensor<f32, S>) { ... }
fn add_2d<const S: [i32; 2]>(output: &mut Tensor<f32, S>) { ... }
When to use which:
Pattern |
Use when |
|---|---|
|
Dimensions are fixed literals |
|
Each dimension is an independent generic (e.g., GEMM tile sizes) |
|
The whole shape is generic but rank is known |
|
Dimension is dynamic (provided at runtime by the host) |
Dynamic dimensions (-1) are resolved at runtime from the tensor’s actual shape. Static dimensions are baked into the compiled kernel. Mixing is allowed: { [-1, 128] } means “dynamic first axis, static second axis.”
PointerTile#
PointerTile<P: Pointer, const D: [i32; N]> — A tile of device pointers. Used for scatter/gather and atomic operations.
let ptr: PointerTile<*mut f32, { [] }> = pointer_to_tile(raw_ptr);
let offset_ptrs: PointerTile<*mut f32, { [128] }> = ptr.broadcast(const_shape![128]).offset_tile(offsets);
Token#
Token — An ordering token for memory operations. Tokens enforce ordering between loads and stores without requiring full barriers.
let token: Token = new_token_unordered();
let joined: Token = join_tokens(&[token_a, token_b]);
ElementType#
Trait implemented by all scalar types usable in tiles:
Type |
Description |
|---|---|
|
IEEE 754 half-precision |
|
Brain floating-point |
|
Single-precision |
|
Double-precision |
|
Signed integers |
|
Unsigned integers |
|
Boolean |
|
TensorFloat-32 (NVIDIA) |
|
FP8 (storage only; convert to |
Operations#
Tile IR Operation Mapping#
The functions in this reference are the Rust DSL surface for
CUDA Tile IR operations.
Most low-level operations in cutile::core map directly to a cuda_tile.*
operation. The Rust name is usually either the Tile IR operation name without
the cuda_tile. prefix, or a small wrapper with a Rust-oriented name and type
signature.
Examples:
Rust DSL surface |
Tile IR operation family |
|---|---|
|
Core tile ops such as |
|
Floating-point ops such as |
|
Integer and bitwise ops such as |
|
Conversion ops such as |
|
Memory, atomic, and token ops such as |
|
View ops such as |
|
Compiler and debugging ops such as |
Some Tile IR operations are intentionally compiler-owned rather than public DSL
functions. cuda_tile.module, cuda_tile.entry, cuda_tile.return, and
control-flow operations are generated from Rust modules, entry attributes,
return, if, for, loop, break, and continue syntax. This keeps user
code Rust-shaped while still lowering to the corresponding Tile IR operations.
Tile IR attributes such as memory ordering, memory scope, comparison predicate, rounding mode, overflow behavior, and flush-to-zero mode are represented as Rust marker types and traits where possible. This lets the Rust type checker reject unsupported attribute combinations before the JIT compiler lowers the operation.
Memory: Load and Store#
Function |
Signature |
Description |
|---|---|---|
|
|
Load the output tile |
|
|
Store a tile to the tensor |
|
|
Load at a partition index |
|
|
Load output tile (convenience) |
|
|
Load from src at dst’s tile-block position (rank 1-3) |
// Pattern 1: Direct load/store on mutable tensor
let tile: Tile<f32, { [128] }> = load_tile_mut(output);
output.store(tile * scale_tile);
// Pattern 2: Load at block position
let pid: (i32, i32, i32) = get_tile_block_id();
let tile: Tile<f32, { [128] }> = input.load_tile(const_shape![128], [pid.0]);
// Pattern 3: Load-like (positional)
let tile_x: Tile<f32, { [16, 16] }> = load_tile_like(x, output);
Grid and Block#
Function |
Signature |
Description |
|---|---|---|
|
|
Current thread block’s (x, y, z) index in the grid |
|
|
Total (x, y, z) dimensions of the grid |
let pid: (i32, i32, i32) = get_tile_block_id();
let grid: (i32, i32, i32) = get_num_tile_blocks();
// Grid-stride loop
for i in (pid.0..total).step_by(grid.0 as usize) { ... }
Arithmetic (Element-wise)#
In addition to operator overloading (+, -, *, /), these explicit functions are available:
Function |
Signature |
Description |
|---|---|---|
|
|
Absolute value (integer) |
|
|
Absolute value (float) |
|
|
Negation (integer) |
|
|
Negation (float) |
|
|
Fused multiply-add: |
|
|
Fused multiply-add (flush-to-zero) |
|
|
Power |
|
|
Ceiling division (scalar) |
|
|
True (floating-point) division |
|
|
High bits of integer multiply |
// Absolute value
let abs_x: Tile<f32, S> = absf(x);
let abs_i: Tile<i32, S> = absi(int_tile);
// Fused multiply-add: a * b + c (single instruction, no intermediate rounding)
let result: Tile<f32, S> = fma(a, b, c);
// Power
let squared: Tile<f32, S> = pow(x, broadcast_scalar(2.0f32, x.shape()));
// Negation
let neg_x: Tile<f32, S> = negf(x);
let neg_i: Tile<i32, S> = negi(int_tile);
Math (Floating-Point)#
Function |
Signature |
Description |
|---|---|---|
|
|
e^x |
|
|
2^x |
|
|
2^x with flush-to-zero |
|
|
Natural logarithm |
|
|
Base-2 logarithm |
|
|
Square root |
|
|
Square root with flush-to-zero |
|
|
Reciprocal square root (1/sqrt(x)) |
|
|
Reciprocal square root with flush-to-zero |
|
|
Sine |
|
|
Cosine |
|
|
Tangent |
|
|
Hyperbolic sine |
|
|
Hyperbolic cosine |
|
|
Hyperbolic tangent |
|
|
Ceiling |
|
|
Floor |
|
|
Float max |
|
|
Float min |
|
|
Float max (flush-to-zero) |
|
|
Float min (flush-to-zero) |
|
|
Float add (flush-to-zero) |
|
|
Float sub (flush-to-zero) |
|
|
Float mul (flush-to-zero) |
|
|
Float div (flush-to-zero) |
// Softmax numerics: subtract max, exponentiate
let max_val: Tile<f32, { [BM] }> = reduce_max(x, 1i32);
let shifted: Tile<f32, S> = x - max_val.reshape(const_shape![BM, 1]).broadcast(x.shape());
let softmax_exp: Tile<f32, S> = exp(shifted);
// RMS normalization
let sq: Tile<f32, S> = x * x;
let mean_sq: Tile<f32, { [BM] }> = reduce_sum(sq, 1i32);
let rms: Tile<f32, { [BM] }> = rsqrt(mean_sq + broadcast_scalar(1e-6f32, mean_sq.shape()));
// Activation functions
let gelu_approx: Tile<f32, S> = x * (constant(1.0f32, x.shape()) + tanh(x));
let swish: Tile<f32, S> = x / (constant(1.0f32, x.shape()) + exp(negf(x)));
// exp2 is faster than exp on GPU — convert: exp(x) = exp2(x * log2(e, ftz::Disabled))
let log2_e: f32 = 1.4426950408889634f32;
let fast_exp: Tile<f32, S> = exp2(x * broadcast_scalar(log2_e, x.shape()));
// Flush-to-zero variants: treat denormals as zero (faster on some hardware, f32 only)
let clamped: Tile<f32, S> = maxf_ftz(x, broadcast_scalar(0.0f32, x.shape()));
let sum: Tile<f32, S> = addf_ftz(a, b);
let product: Tile<f32, S> = mulf_ftz(a, b);
let fma_result: Tile<f32, S> = fma_ftz(a, b, c);
Comparison#
Function |
Signature |
Description |
|---|---|---|
|
|
Equal |
|
|
Not equal |
|
|
Greater than |
|
|
Greater or equal |
|
|
Less than |
|
|
Less or equal |
|
|
Conditional select |
|
Scalar min/max |
|
|
Element-wise min/max |
let mask: Tile<bool, { [128] }> = lt_tile(indices, len_tile);
let result: Tile<f32, { [128] }> = select(mask, values, zeros);
Creation#
Function |
Signature |
Description |
|---|---|---|
|
|
Fill a tile with a constant |
|
|
Sequential indices |
|
|
Broadcast a scalar to a tile shape |
let zeros: Tile<f32, { [128] }> = constant(0.0f32, const_shape![128]);
let indices: Tile<i32, { [64] }> = iota(const_shape![64]); // [0, 1, 2, ..., 63]
let scale: Tile<f32, { [16, 16] }> = broadcast_scalar(2.0f32, const_shape![16, 16]);
Shape Manipulation#
Function |
Signature |
Description |
|---|---|---|
|
|
Reshape (preserves element count) |
|
|
Broadcast to a larger shape |
|
|
Free function reshape |
|
|
Free function broadcast |
|
|
Transpose / permute dimensions |
|
|
Concatenate along a dimension |
|
|
Extract a sub-tile |
|
|
Read one runtime dimension from a shape |
let row: Tile<f32, { [128] }> = iota(const_shape![128]);
let col: Tile<f32, { [128, 1] }> = row.reshape(const_shape![128, 1]);
let matrix: Tile<f32, { [128, 64] }> = col.broadcast(const_shape![128, 64]);
let n_cols: i32 = get_shape_dim(matrix.shape(), 1i32);
Reduction and Scan#
Function |
Signature |
Description |
|---|---|---|
|
|
Sum reduction along one dimension |
|
|
Max reduction |
|
|
Min reduction |
|
|
Product reduction |
|
|
Custom reduction |
|
|
Prefix sum |
|
|
Custom prefix scan |
// Sum each row of a [128, 64] tile to [128] (reduce along axis 1)
let row_sums: Tile<f32, { [128] }> = reduce_sum(matrix, 1i32);
// Prefix sum along axis 0
let prefix: Tile<f32, { [128] }> = scan_sum(row, 0i32, reverse::Forward, 0.0f32);
Matrix Multiply#
Function |
Signature |
Description |
|---|---|---|
|
|
Matrix multiply-accumulate |
Maps to hardware tensor cores when available.
let mut acc: Tile<f32, { [16, 16] }> = constant(0.0f32, const_shape![16, 16]);
for k in 0i32..(K/BK) {
let a_tile: Tile<f32, { [16, 8] }> = a_part.load([pid.0, k]);
let b_tile: Tile<f32, { [8, 16] }> = b_part.load([k, pid.1]);
acc = mma(a_tile, b_tile, acc);
}
Low-Level Memory Ops#
These APIs are close to the Tile IR memory/view operations. Prefer the
high-level methods above (tensor.load_tile, partition.load,
partition_mut.store, load_tile_like, tensor.store) unless you are building
custom views, raw-pointer kernels, or compiler-facing helpers.
View construction and queries#
View constructors create typed tensor or partition views from lower-level
metadata. Mutable view construction and raw tensor construction are unsafe
because the caller must preserve aliasing, layout, and lifetime invariants.
Function |
Signature |
Description |
|---|---|---|
|
|
Build a tensor view from a base pointer |
|
|
Build a read-only partition view |
|
|
Build a mutable partition view |
|
|
Query a tensor view’s runtime shape |
|
|
Query a partition’s tile-grid shape |
|
|
Read a tensor view’s memory token |
|
|
Update a tensor view’s memory token |
|
|
Read a read-only partition token |
|
|
Read a mutable partition token |
|
|
Number of tiles along one partition axis |
|
|
Load a strided tensor view from an integer pointer tensor |
let shape = input.shape();
let token = get_tensor_token(input);
let part = make_partition_view(input, shape, padding::None, dim_map::Identity, token);
let tiles_m: i32 = num_tiles(&part, 0);
View loads and stores#
These are the direct memory operations on partition views. They expose Tile IR ordering, scope, latency, and TMA controls explicitly.
Function |
Signature |
Description |
|---|---|---|
|
|
Load a tile from a read-only partition |
|
|
Load from a mutable partition; unsafe aliasing contract |
|
|
Store a tile into a mutable partition |
let pid = get_tile_block_id();
let tile: Tile<f32, S> =
load_view_tko(&part, [pid.0], ordering::Weak, scope::TileBlock, None, tma::Enabled);
Pointer-based loads and stores#
Function |
Signature |
Description |
|---|---|---|
|
|
Scatter-gather load via pointers |
|
|
Scatter-gather store via pointers |
|
|
Convert raw pointer to scalar pointer tile |
|
|
Convert back |
|
Offset a pointer tile by a scalar |
|
|
Offset a pointer tile by an index tile |
|
|
Broadcast a pointer tile to a larger shape |
|
|
Reshape a pointer tile |
let base: PointerTile<*mut f32, { [] }> = pointer_to_tile(ptr);
let ptrs: PointerTile<*mut f32, { [128] }> = base.broadcast(const_shape![128]).offset_tile(offsets);
let (values, token): (Tile<f32, { [128] }>, Token) =
load_ptr_tko(ptrs, ordering::Weak, None::<scope::TileBlock>, None, None, None, Latency::<0>);
Atomics#
Function |
Signature |
Description |
|---|---|---|
|
|
Atomic read-modify-write |
|
|
Atomic compare-and-swap |
RMW modes: atomic::{Add, AddF, And, Or, Xor, Max, Min, Umax, Umin, Xchg}
Memory orderings: ordering::{Relaxed, Acquire, Release, AcqRel} (atomics; load/store also accept Weak)
Scopes: scope::{TileBlock, Device, System}
atomic_rmw_tko(ptrs, increments, atomic::Add, ordering::Relaxed, scope::Device, None, None);
atomic_cas_tko(ptrs, expected, desired, ordering::AcqRel, scope::System, None, None);
Tokens#
Tokens track ordering dependencies between memory operations. A token returned from a load guarantees that the load has completed before any operation that consumes that token. join_tokens merges multiple tokens into one, ensuring all joined operations complete before the result token is used.
This enables fine-grained ordering without full barriers: independent loads can execute in parallel, and a store only waits for the specific loads it depends on.
Function |
Signature |
Description |
|---|---|---|
|
|
Create a fresh ordering token (no ordering guarantee) |
|
|
Join multiple tokens: result waits for all inputs |
// Thread tokens through a load → compute → store sequence:
let token: Token = new_token_unordered();
// Load returns a new token guaranteeing the load completed
let (data, load_token): (Tile<f32, { [128] }>, Token) =
load_ptr_tko(src_ptrs, ordering::Weak, None::<scope::TileBlock>, None, None, None, Latency::<0>);
// Compute on the loaded data
let result: Tile<f32, { [128] }> = data * data;
// Store uses the load token: waits for the load before writing
let store_token: Token =
store_ptr_tko(dst_ptrs, result, ordering::Weak, None::<scope::TileBlock>, None, Some(load_token), Latency::<0>);
// Join tokens from two independent loads before a dependent store:
let (a_data, a_token): (Tile<f32, { [128] }>, Token) =
load_ptr_tko(a_ptrs, ordering::Weak, None::<scope::TileBlock>, None, None, None, Latency::<0>);
let (b_data, b_token): (Tile<f32, { [128] }>, Token) =
load_ptr_tko(b_ptrs, ordering::Weak, None::<scope::TileBlock>, None, None, None, Latency::<0>);
// Both loads must complete before the store
let combined: Token = join_tokens(&[a_token, b_token]);
let result: Tile<f32, { [128] }> = a_data + b_data;
let _: Token =
store_ptr_tko(out_ptrs, result, ordering::Weak, None::<scope::TileBlock>, None, Some(combined), Latency::<0>);
Bitwise#
Function |
Signature |
Description |
|---|---|---|
|
|
Bitwise AND |
|
|
Bitwise OR |
|
|
Bitwise XOR |
|
|
Shift left |
|
|
Shift right |
// Mask lower 8 bits
let mask: Tile<i32, S> = constant(0xFF, x.shape());
let low_byte: Tile<i32, S> = andi(x, mask);
// Shift left by 2 (multiply by 4)
let shift: Tile<i32, S> = constant(2, x.shape());
let shifted: Tile<i32, S> = shli(x, shift);
// Toggle bits with XOR
let toggled: Tile<i32, S> = xori(x, mask);
Type Conversion#
Function |
Signature |
Description |
|---|---|---|
|
|
Convert element type |
|
|
Convert scalar type |
|
|
Wrap a scalar in a rank-0 tile |
|
|
Convert a rank-0 tile back to a scalar |
|
|
Float-to-float conversion with rounding mode |
|
|
Float-to-integer conversion with rounding mode |
|
|
Integer-to-float conversion with rounding mode |
|
|
Reinterpret bits (no conversion) |
|
|
Extend integer (sign/zero) |
|
|
Truncate integer with overflow mode |
|
|
Integer to pointer |
|
|
Pointer to integer tile |
|
|
Pointer cast |
// Float to int conversion
let indices: Tile<i32, { [128] }> = iota(const_shape![128]);
let float_indices: Tile<f32, { [128] }> = convert_tile(indices);
// Bitcast: reinterpret f32 bits as u32 (no value conversion)
let float_tile: Tile<f32, { [128] }> = constant(1.0f32, const_shape![128]);
let bits: Tile<u32, { [128] }> = bitcast(float_tile); // 0x3F800000
// Integer extension and truncation
let small: Tile<i16, { [64] }> = constant(42i16, const_shape![64]);
let wide: Tile<i32, { [64] }> = exti(small); // sign-extend i16 -> i32
let narrow: Tile<i16, { [64] }> = trunci(wide, overflow::None);
Compiler Hints#
Function |
Signature |
Description |
|---|---|---|
|
|
Assert value is divisible by N |
|
|
Assert value >= L |
|
|
Assert value <= U |
|
|
Assert L <= value <= U |
These are unsafe — incorrect assumptions produce undefined behavior. They enable compiler optimizations like vectorized loads and simplified index arithmetic.
// Tell the compiler a dimension is a multiple of 16 (enables wider vector loads)
let dim: i32 = unsafe { assume_div_by::<_, 16>(dim) };
// Bound an index (enables range-based optimizations)
let idx: i32 = unsafe { assume_bounds::<_, 0, 1024>(idx) };
// Combine: non-negative and aligned
let stride: i32 = unsafe { assume_bounds_lower::<_, 0>(stride) };
let stride: i32 = unsafe { assume_div_by::<_, 4>(stride) };
Debugging#
Macro |
Description |
|---|---|
|
Printf-style GPU print |
|
GPU assertion |
let pid: (i32, i32, i32) = get_tile_block_id();
cuda_tile_print!("Block ({}, {}, {})\n", pid.0, pid.1, pid.2);
// Assert a condition — aborts the kernel if false
cuda_tile_assert!(len > 0, "Length must be positive");
// Print scalars for debugging (runs on every block, so output may interleave)
cuda_tile_print!("offset = {}\n", pid.0 * 128);