Tutorial 7: Intro to Async Execution#
Note: While async concepts are taught using the
tokioruntime, any async runtime can be used.
The sync API blocks the CPU until the GPU finishes:
let launcher = hello_world_kernel();
launcher.grid((2, 2, 1)).sync_on(&stream); // CPU waits here!
// CPU blocked until GPU finishes
With async, the CPU can do other work while the GPU computes:
Preparing the next batch while the current one computes.
Pipelining multiple operations.
Overlapping data transfer with computation.
Multi-GPU coordination.
DeviceOperation#
In cutile, GPU work is represented as a DeviceOperation — a description of work to be done, not yet executed:
DeviceOperationdescribes the work..await,tokio::spawn(.), or.sync_on(.)executes it.
// This creates a DeviceOperation, but doesn't execute yet!
let tensor_op = api::ones([1024, 1024]); // Returns impl DeviceOperation
// Nothing has happened on the GPU yet...
// NOW it executes:
let tensor: Tensor<f32> = tensor_op.sync_on(&stream); // Sync API
// or
let tensor: Tensor<f32> = tensor_op.await; // Async API
Sync vs Async APIs#
In cutile, a DeviceOperation can be executed with either sync or async APIs. Given a particular operation op:
API |
Description |
|---|---|
|
Immediately executes |
|
Immediately executes |
|
Immediately executes |
|
Submits a task to the async runtime, returning a handle that can later be awaited. Non-blocking. Can only be called from within an async context. |
Note: An async context is any code appearing in a block defined with the
asynckeyword, e.g.async fn ...,async { ... },async || { ... }.
Async Vector Addition#
use cutile::api::{ones, zeros};
use cutile::tensor::{Tensor, ToHostVec, Unpartition};
use cutile::tile_kernel::{IntoDeviceOperationPartition, TileKernel, TensorDeviceOpToHostVec};
use cuda_async::device_operation::*;
use std::sync::Arc;
#[cutile::module]
mod async_add_module {
use cutile::core::*;
#[cutile::entry()]
fn add<const S: [i32; 2]>(
z: &mut Tensor<f32, S>,
x: &Tensor<f32, {[-1, -1]}>,
y: &Tensor<f32, {[-1, -1]}>
) {
let tile_x = load_tile_like_2d(x, z);
let tile_y = load_tile_like_2d(y, z);
z.store(tile_x + tile_y);
}
}
use async_add_module::add_apply;
#[tokio::main]
async fn main() {
let x: Arc<Tensor<f32>> = ones([32, 32]).arc().await?;
let y: Arc<Tensor<f32>> = ones([32, 32]).arc().await?;
let z_op = zeros::<2, f32>([32, 32]);
let args = zip!(
z_op.partition([4, 4]), // Output, partitioned into tiles
x.device_operation(), // Input x as DeviceOperation
y.device_operation() // Input y as DeviceOperation
);
let (z, _x, _y) = args.apply(add_apply).unzip();
let z_host: Vec<f32> = z.unpartition().to_host_vec().await?;
println!("z[0] = {} (expected 2.0)", z_host[0]);
}
Output:
z[0] = 2 (expected 2.0)
Overlapping Work with Spawn#
.await lets the programmer control when to execute work, but it blocks the enclosing async context — no further code in that async block runs until the awaited operation completes. (The underlying thread is freed and can run other tasks in the meantime, but this async context is suspended.) tokio::spawn converts a future into a concurrently executing task, returning a non-blocking handle that can later be awaited to retrieve the result.
#[tokio::main]
async fn main() {
let batch1_op = prepare_batch(1); // Returns DeviceOperation
let batch2_op = prepare_batch(2); // Returns DeviceOperation
let batch1 = batch1_op.await;
let result1_op = process_kernel(batch1);
let result1_handle = tokio::spawn(result1_op); // Non-blocking
// batch 2 data can be prepared while batch 1's kernel runs
let batch2 = batch2_op.await;
let result2 = process_kernel(batch2).await;
let result1 = result1_handle.await;
}
Composing DeviceOperations#
zip! — Combine Operations for Kernels#
zip! combines multiple DeviceOperations into a tuple that can be passed to kernels:
use cuda_async::device_operation::*;
let args = zip!(
output_op.partition([4, 4]), // Partitioned output
input1.device_operation(), // Input as DeviceOperation
input2.device_operation() // Another input
);
let (out, _in1, _in2) = args.apply(kernel_apply).unzip();
apply — Run Kernels on DeviceOperations#
let args = zip!(output_op, input_op);
let (output, _input) = args.apply(some_kernel_apply).unzip();
let result = output.await;
Use kernel_op(...) instead when the arguments are still separate DeviceOperations rather than already grouped with zip!:
let output_op = kernel_op(z_op, input_op);
let output = output_op.await;
When to Use Async#
Scenario |
Use Sync |
Use Async |
|---|---|---|
Simple scripts |
✓ |
|
Interactive exploration |
✓ |
|
Production pipelines |
✓ |
|
Multi-batch processing |
✓ |
|
Multi-GPU workloads |
✓ |
|
Overlapping compute/transfer |
✓ |
Start with sync for learning, move to async for production.
Key Takeaways#
Concept |
What It Means |
|---|---|
DeviceOperation |
A description of GPU work, not yet executed |
.await |
Execute the operation and get the result |
Async enables overlap |
CPU can do work while GPU computes |
zip! |
Combine multiple operations for kernel input |
apply |
Launch a kernel from one grouped |
|
Launch a kernel from separate |
Exercise 1: Async SAXPY#
Convert the SAXPY kernel to use the async API.
Exercise 2: Parallel Tensor Creation#
Use zip! to create 4 tensors in parallel.
Answer
let (a, b, c, d) = zip!(
ones([100, 100]).arc(),
zeros([100, 100]).arc(),
randn(0.0, 1.0, [100, 100]).arc(),
arange(10000).arc()
).await?;
Exercise 3: Measure the Difference#
Time a sync version vs. an async version with overlapped work. Use std::time::Instant to measure.