Accumulator Precision¶
Created: 2026-04-15 14:00:00 Edited: 2026-04-18 16:35:57
WarpConvNet's production mask GEMM kernels use tensor core MMA instructions that support two accumulator modes:
| Mode | MMA Atom | Throughput | Precision | Default |
|---|---|---|---|---|
| fp32 accumulator | SM80_16x8x16_F32F16F16F32_TN |
1x | Higher | Yes |
| fp16 accumulator | SM80_16x8x16_F16F16F16F16_TN |
2x | Lower | No |
The fp16 accumulator doubles tensor core throughput by accumulating partial sums in fp16 instead of fp32. This gives ~15% end-to-end speedup at large channels (C >= 64) where the computation is tensor-core-bound. The precision loss can be significant at wide channels (measured rd vs fp32: 1.4e-5 @ C=8, 2.4e-4 @ C=32, 5.3e-4 @ C=128, 7.5e-4 @ C=256) and training convergence on sensitive models can trail the fp32-accumulator baseline over many epochs. Default is off for that reason; enable explicitly for inference or after verifying training stability.
Configuration¶
Environment Variable (no code changes)¶
# Enable fp16 accumulator globally
export WARPCONVNET_USE_FP16_ACCUM=true
# Disable (default)
export WARPCONVNET_USE_FP16_ACCUM=false
Runtime API (no code changes)¶
import warpconvnet
# Enable fp16 accumulator for all subsequent convolutions
warpconvnet.set_fp16_accum(True)
# Query current setting
print(warpconvnet.get_fp16_accum()) # True
# Disable
warpconvnet.set_fp16_accum(False)
Per-Module Override¶
Individual convolution layers can override the global setting:
from warpconvnet.nn.modules.sparse_conv import SparseConv3d
# This layer always uses fp16 accumulator, regardless of global setting
conv_fast = SparseConv3d(64, 128, kernel_size=3, use_fp16_accum=True)
# This layer always uses fp32 accumulator
conv_precise = SparseConv3d(64, 128, kernel_size=3, use_fp16_accum=False)
# This layer follows the global setting (default)
conv_default = SparseConv3d(64, 128, kernel_size=3) # use_fp16_accum=None
Functional API¶
from warpconvnet.nn.functional.sparse_conv import spatially_sparse_conv
output = spatially_sparse_conv(
input_voxels,
weight,
kernel_size=3,
use_fp16_accum=True, # or None for global setting
)
Precedence¶
- Per-module
use_fp16_accum=True/False-- highest priority - Global runtime
warpconvnet.set_fp16_accum(True)-- used when module isNone - Environment variable
WARPCONVNET_USE_FP16_ACCUM=true-- sets initial global value - Default
False(fp32 accumulator)
What the Flag Does¶
When use_fp16_accum=True is resolved (per-module, global, or env):
- Production pool: F16Acc tiles (40/42 forward, equivalent dgrad variants) are added to the autotune candidate pool. When the flag is unset, only F32Acc production tiles are benchmarked.
- CUTLASS pool: Every CUTLASS entry's
accumulator_typeparameter is rewritten totorch.float16. Without the flag, CUTLASS runs with its default fp32 accumulator.
Both the adaptive (auto) and trimmed pools default to F32Acc only. Setting WARPCONVNET_AB_ALGO_MODE=all or explicitly naming production via the env config includes F16Acc tiles regardless of the flag; the flag is the normal gate for adaptive/trimmed pools.
Which Tiles Use FP16 Accumulator¶
The autotune system selects between F16Accum and F32 tile variants based on the use_fp16_accum setting:
| Tile ID | Name | Accumulator | Use Case | In adaptive/trimmed pool? |
|---|---|---|---|---|
| 40 | Prod_Fwd_32x32x32_F16Acc |
fp16 | Forward, C \<= 48, fp16 only | Only when use_fp16_accum=True |
| 42 | Prod_Fwd_64x128x32_F16Acc |
fp16 | Forward, C >= 128 | Only when use_fp16_accum=True |
| 53 | Prod_Dgrad_64x64x32_F16Acc |
fp16 | Dgrad, C = 64 | Only when use_fp16_accum=True |
| 54 | Prod_Dgrad_64x128x32_F16Acc |
fp16 | Dgrad, C >= 128 | Only when use_fp16_accum=True |
| 41 | Prod_Fwd_64x64x32 |
fp32 | Forward, C = 64 | Yes |
| 43 | Prod_Fwd_64x128x32_3s |
fp32 | Forward, C >= 128 | Yes |
| 44 | Prod_Fwd_128x64x32 |
fp32 | Forward, C = 64 | Yes |
| 51 | Prod_Dgrad_64x64x32 |
fp32 | Dgrad, C = 64 | Yes |
| 52 | Prod_Dgrad_64x128x32 |
fp32 | Dgrad, C >= 128 | Yes |
Wgrad always uses fp32 accumulator regardless of this setting, since weight gradient precision directly affects training convergence.
Recommendations¶
- Training: Use fp32 accumulator (default). Switch to fp16 only after verifying convergence is unaffected on your model. The F16Acc tiles' per-step relative difference (up to ~7.5e-4 at C=256) accumulates across epochs and has been observed to slow convergence on ScanNet MinkUNet-style models.
- Inference: fp16 accumulator is safe and recommended for maximum throughput.
- Large channels (C >= 128): Largest speedup from fp16 accumulator (~15%).
- Small channels (C \<= 32): Minimal benefit since the computation is memory-bound, not compute-bound.
- After switching: clear the cache (
rm ~/.cache/warpconvnet/benchmark_cache_generic.*) so the pool change triggers a fresh autotune pass.