Auto-Tuning¶

Created: 2026-04-18 17:02:44 Edited: 2026-05-22 12:50:00

WarpConvNet's spatially sparse convolution has many backend algorithms (see Sparse Convolution Internals — Algorithm taxonomy). None of them wins across all problem shapes: the optimal pick depends on coordinate count, input/output channels, kernel volume, dtype, and the GPU. This page describes how WarpConvNet chooses.

Why auto-tune¶

A single sparse-conv layer runs three math kernels per training step (forward = AB, dgrad = ABt, wgrad = AtB — see Three math kernels per layer), each with its own optimal algorithm:

Relative winners shift dramatically with channel count (e.g. 64 vs 256).
Small-\(N\) shapes favor mask-based fused kernels; large-\(N\) shapes favor CUTLASS.
Wgrad (AtB gather-gather) has different arithmetic intensity than AB/ABt and picks differently from forward at the same shape.
Dgrad (ABt) picks differently from fwd (AB) because the \(C_\text{in} \leftrightarrow C_\text{out}\) swap changes the optimal tile shape and split-K factor.

Picking by hand is infeasible. WarpConvNet benchmarks the candidate set at the runtime shape on first use and caches the winner, per op, in three independent cache namespaces (AB_gather_scatter, ABt_gather_scatter, AtB_gather_gather).

How it works¶

On the first forward (or backward) pass for a new problem shape, WarpConvNet:

Selects a set of candidate algorithms based on the convolution dimensions \((N, C_{\text{in}}, C_{\text{out}}, K)\) and dtype.
Runs each candidate with warmup=2, iters=5 and records median time via CUDA events.
Picks the fastest and caches the result keyed by \((\lceil\log_{10} N_{\text{in}}\rceil, \lceil\log_{10} N_{\text{out}}\rceil, C_{\text{in}}, C_{\text{out}}, K, G, \text{use\_fp16\_accum}, \text{dtype}, \text{SM})\).
Subsequent calls with the same shape hit the cache instantly.

Results are persisted to ~/.cache/warpconvnet/benchmark_cache_generic.msgpack and survive across Python sessions. The cache merges back-in results from other ranks so that rank 0's autotune pass populates every rank.

Adaptive candidate selection¶

The candidate set adapts to the problem dimensions. Based on benchmark analysis of 148 configs on SM 8.9 with cuBLAS 12.9.1.4:

AB gather-scatter (forward) and ABt gather-scatter (dgrad) share the same candidate pool — 7-11 candidates per op; each op is tuned independently against its own cache namespace:

\(N\) range	\(ch \le 256\)	\(ch > 256\)
Small (\(N \le 10{,}000\))	`mask_gemm` (92-100%)	`cute_grouped` (58%), `mask_gemm` (25%)
Medium (\(10{,}000 < N \le 100{,}000\))	`mask_gemm` (69%), `cutlass_implicit_gemm` (27%)	`cutlass_grouped_hybrid` (67%)
Large (\(N > 100{,}000\))	`mask_gemm` / `cutlass_grouped` / `cutlass`	`cutlass_implicit_gemm` (100%)

AtB gather-gather (wgrad) — 5-8 candidates:

\(N\) range	\(ch \le 64\)	\(ch > 64\)
Small	`cute_grouped` (57%), `implicit_gemm` (36%)	`cute_grouped` (100%)
Medium	`cute_grouped` (57%), `explicit_gemm_grouped` (43%)	`cute_grouped` (77%)
Large	`cutlass_grouped_hybrid` (57%), `explicit_gemm_grouped` (36%)	`cute_grouped` (100%)

Modes¶

Three global modes for the AB and AtB candidate sets:

Mode	Candidate set	When to use
`auto` (default)	Adaptive per shape.	Normal training / inference. Covers every winning algorithm at its optimal shape.
`trimmed`	Broader pool.	Broader search. Includes slower-converging alternatives but excludes dead-weight.
`all`	Full pool.	Exhaustive. For benchmarking new hardware or new backends; slowest first run.

# Default: adaptive reduced set (recommended)
export WARPCONVNET_FWD_ALGO_MODE=auto

# Exhaustive: benchmark every algorithm variant
export WARPCONVNET_FWD_ALGO_MODE=all

# Specific algorithm (no benchmarking, just use it)
export WARPCONVNET_FWD_ALGO_MODE=mask_gemm

# Algorithm list (benchmark only these)
export WARPCONVNET_FWD_ALGO_MODE="[mask_gemm,cutlass_implicit_gemm]"

The same options apply to WARPCONVNET_DGRAD_ALGO_MODE (dgrad ABt algorithm) and WARPCONVNET_WGRAD_ALGO_MODE (wgrad AtB algorithm).

Specifying algorithms¶

Forward, dgrad, and wgrad can be controlled independently:

from warpconvnet.nn.functional import spatially_sparse_conv

# Different algorithms for each op
output = spatially_sparse_conv(
    input_voxels, weight, kernel_size=3,
    fwd_algo="mask_gemm",        # AB for forward
    dgrad_algo="mask_gemm",      # AB for dgrad
    wgrad_algo="cute_grouped",   # AtB for wgrad
)

# Algorithm list -- benchmark only these
output = spatially_sparse_conv(
    input_voxels, weight, kernel_size=3,
    fwd_algo=["mask_gemm", "cutlass_implicit_gemm"],
    dgrad_algo=["mask_gemm", "cute_grouped"],
    wgrad_algo=["cute_grouped", "cutlass_grouped_hybrid"],
)

Strict name filter¶

Named algorithms are resolved strictly. A typo raises ValueError rather than silently falling back to autotune:

spatially_sparse_conv(..., fwd_algo="explicit_gem")  # typo
# ValueError: Unknown algorithm(s) in filter: ['explicit_gem'].
# Not present in adaptive pool or exhaustive _ALL_AB_PARAMS.
# Fix the algo name or extend the pool.

Parameterless algorithms (explicit_gemm, cutlass_implicit_gemm, cute_implicit_gemm) are synthesised as (name, {}) when they are not in the current adaptive pool, so those names always work regardless of mode.

Environment variables¶

Variable	Default	Description
`WARPCONVNET_FWD_ALGO_MODE`	`auto`	AB gather-scatter algorithm for forward. Shared candidate pool with dgrad.
`WARPCONVNET_DGRAD_ALGO_MODE`	`auto`	ABt gather-scatter algorithm for dgrad. Shared candidate pool with forward; tuned + cached independently.
`WARPCONVNET_WGRAD_ALGO_MODE`	`auto`	AtB gather-gather algorithm for wgrad.
`WARPCONVNET_DEPTHWISE_CONV_FWD_ALGO_MODE`	`auto`	Depthwise forward algorithm.
`WARPCONVNET_DEPTHWISE_CONV_BWD_ALGO_MODE`	`auto`	Depthwise backward algorithm.
`WARPCONVNET_USE_FP16_ACCUM`	`false`	Global default for the fp16 accumulator flag. See Accumulator Precision.
`WARPCONVNET_BENCHMARK_CACHE_DIR`	`~/.cache/warpconvnet`	Cache directory.
`WARPCONVNET_AUTOTUNE_LOG`	`true`	Set to `false` to suppress auto-tuning log messages.

Accepted values for the mode variables: auto, all, trimmed, a single algorithm name, or a bracket list like [algo1,algo2].

Valid algorithm names: explicit_gemm, implicit_gemm, cutlass_implicit_gemm, cute_implicit_gemm, explicit_gemm_grouped, implicit_gemm_grouped, cutlass_grouped_hybrid, cute_grouped, mask_gemm. Unknown names raise ValueError.

Cache¶

Results are keyed per problem shape and persisted to ~/.cache/warpconvnet/benchmark_cache_generic.msgpack.

Upgrade note¶

SM90 CuTe non-mask GEMM inner-autotune entries now use registry identity (op, backend, tile_id). Existing warm cache entries under cute_gemm_sm90_AD_gather_scatter are not migrated and will be rebenchmarked under nonmask_gemm_ad_gather_scatter.cute_sm90 after upgrade.

SpatiallySparseConvConfig now keys on conv stride, transposed, generative, and stride_mode in addition to channel/voxel shape. Strided downsample layers (N_in != N_out) now route through native strided fwd kernels (tile_ids 300-307). Pre-upgrade cache entries deserialize with empty stride metadata and will miss against new lookups; expect a one-time re-autotune pass on the first run after upgrade.

# Clear cache (e.g. after switching GPUs)
rm -rf ~/.cache/warpconvnet/

# Inspect cached entries
python scripts/inspect_benchmark_cache.py
python scripts/inspect_benchmark_cache.py namespace=AB_gather_scatter --best-only   # forward
python scripts/inspect_benchmark_cache.py namespace=ABt_gather_scatter --best-only  # dgrad
python scripts/inspect_benchmark_cache.py namespace=AtB_gather_gather --best-only   # wgrad

# Analyze win rates and margins across all configs
python scripts/analyze_benchmark_cache.py --markdown --output analysis.md

See Inspect Benchmark Cache for the full inspector CLI and Pre-Populate Benchmark Cache for filling the cache ahead of deployment.

Performance characteristics¶

Based on empirical analysis on RTX 6000 Ada with cuBLAS 12.9.1.4:

Condition	Best AB backend	Best AtB backend
\(ch \le 256\), any \(N\)	`mask_gemm`	`cute_grouped`
\(ch > 256\), small \(N\)	`cute_grouped`	`cute_grouped`
\(ch > 256\), large \(N\)	`cutlass_implicit_gemm`	`cute_grouped`
\(ch \le 64\), small \(N\) (wgrad)	—	`implicit_gemm` or `explicit_gemm_grouped`

The cost of the first autotune pass on a previously-unseen shape is roughly (warmup + iters) * n_candidates * kernel_time. For auto mode this is typically under one second on a warm GPU; for all mode it can take tens of seconds.

Troubleshooting¶

Slow first run: normal — autotune benchmarks candidates. Use auto (not all) to minimize tuning time. To skip autotune entirely, pre-populate the cache before your first run.

Cache mismatch across GPUs: the SM capability is embedded in cache keys, so entries from one GPU will not be picked up on another. Clear the cache with rm -rf ~/.cache/warpconvnet/ when switching hardware.

CUTLASS not available: some backends require specific compute capability. Fall back with an explicit list:

export WARPCONVNET_FWD_ALGO_MODE="[explicit_gemm,implicit_gemm,mask_gemm]"

ValueError: Unknown algorithm(s) in filter: you passed a name that is not in the adaptive or exhaustive pool. Check the valid names list above.

Source files¶

File	Contents
`warpconvnet/nn/functional/sparse_conv/detail/unified.py`	Top-level auto-tune dispatch.
`warpconvnet/nn/functional/sparse_conv/detail/algo_params.py`	Adaptive candidate selection, mode handling, strict filter.
`warpconvnet/nn/functional/sparse_conv/detail/autotune.py`	Benchmark runners, cache init/merge, callback registration.
`warpconvnet/utils/benchmark_cache.py`	Generic benchmark cache with msgpack persistence.
`warpconvnet/constants.py`	Environment variable parsing.