Sparse Convolutions¶
WarpConvNet implements efficient spatially sparse convolutions on voxel grids using CUDA and Warp kernels.
Overview¶
WarpConvNet provides two types of sparse convolutions:
- Regular Sparse Convolution: General-purpose convolution for feature learning
- Depthwise Sparse Convolution: Channel-wise convolution for efficient feature processing
Both implementations feature a unified benchmarking system that automatically finds optimal algorithm configurations for your specific hardware and data characteristics.
Algorithm Selection¶
Available Algorithms¶
Regular Sparse Convolution¶
EXPLICIT_GEMM
: Traditional matrix multiplication approachIMPLICIT_GEMM
: Custom CUDA kernels with implicit GEMM operationsWMMA_IMPLICIT_GEMM
: Custom CUDA kernels with WMMA-accelerated implicit GEMM operationsCUTLASS_IMPLICIT_GEMM
: NVIDIA CUTLASS-based high-performance kernelsAUTO
: Automatically benchmark and select the best algorithm
Depthwise Sparse Convolution¶
EXPLICIT
: Element-wise multiplication approachIMPLICIT
: Custom CUDA kernels for depthwise operationsAUTO
: Automatically benchmark and select the best algorithm
Unified Benchmarking System¶
The unified benchmarking system ensures consistent parameter optimization across all algorithm inputs:
- Single Algorithm:
fwd_algo=IMPLICIT_GEMM
→ Benchmarks all parameter combinations for IMPLICIT_GEMM - Algorithm List:
fwd_algo=[IMPLICIT_GEMM, CUTLASS_IMPLICIT_GEMM]
→ Benchmarks both algorithms, selects best - AUTO Mode:
fwd_algo=AUTO
→ Benchmarks all available algorithms, selects best
Key Insight: Algorithm input acts as a search space filter - benchmarking always occurs to find optimal parameters within the specified space.
For viewing and interpreting cached benchmark results, see Inspecting the Benchmark Cache.
Benchmarking Cache Management¶
The benchmark cache is automatically managed:
- Persistent Storage: Results saved to
~/.cache/warpconvnet/
- Configuration-Specific: Different cache entries for different input sizes/types
- Background Saving: Cache updates happen in background threads
- Manual Reset: Clear cache with
rm -rf ~/.cache/warpconvnet/
if needed
Usage Examples¶
Basic Usage¶
Basic functional API call (uses environment defaults):
import torch
from warpconvnet.nn.modules.sparse_conv import SpatiallySparseConv
from warpconvnet.nn.functional import spatially_sparse_conv
input_voxels = ...
# nn module usage
conv = SpatiallySparseConv(
in_channels=64,
out_channels=128,
kernel_size=3,
)
output = conv(input_voxels)
# Minimal example (uses environment variables for algorithm selection if set)
output = spatially_sparse_conv(
input_voxels,
weight,
kernel_size=3,
)
Depthwise Convolution¶
from warpconvnet.nn.functional import spatially_sparse_depthwise_conv
from warpconvnet.nn.functional.sparse_conv_depth import SPARSE_DEPTHWISE_CONV_FWD_ALGO_MODE
# Depthwise convolution with algorithm list
output = spatially_sparse_depthwise_conv(
input_features,
depthwise_weight,
kernel_map,
num_out_coords,
)
Advanced Usage (NOT RECOMMENDED)¶
Please refer to the test_sparse_conv.py file for more advanced usage examples.
Environment Variables¶
You can set global defaults using environment variables that support both single algorithms and algorithm lists:
Regular Sparse Convolution¶
Specifying algorithms explicitly is not recommended since this will result in auto-tuning being disabled.
# Single algorithm
export WARPCONVNET_FWD_ALGO_MODE=implicit_gemm
export WARPCONVNET_BWD_ALGO_MODE=implicit_gemm
# WMMA single algorithm
export WARPCONVNET_FWD_ALGO_MODE=wmma_implicit_gemm
export WARPCONVNET_BWD_ALGO_MODE=wmma_implicit_gemm
# Algorithm list (limits search space)
export WARPCONVNET_FWD_ALGO_MODE="[implicit_gemm,wmma_implicit_gemm,cutlass_implicit_gemm]"
export WARPCONVNET_BWD_ALGO_MODE="[implicit_gemm,wmma_implicit_gemm,cutlass_implicit_gemm]"
# AUTO mode (benchmark all algorithms)
export WARPCONVNET_FWD_ALGO_MODE=auto
export WARPCONVNET_BWD_ALGO_MODE=auto
Depthwise Sparse Convolution¶
# Single algorithm
export WARPCONVNET_DEPTHWISE_CONV_FWD_ALGO_MODE=explicit
export WARPCONVNET_DEPTHWISE_CONV_BWD_ALGO_MODE=explicit
# Algorithm list
export WARPCONVNET_DEPTHWISE_CONV_FWD_ALGO_MODE="[explicit,implicit]"
export WARPCONVNET_DEPTHWISE_CONV_BWD_ALGO_MODE="[explicit,implicit]"
# AUTO mode
export WARPCONVNET_DEPTHWISE_CONV_FWD_ALGO_MODE=auto
export WARPCONVNET_DEPTHWISE_CONV_BWD_ALGO_MODE=auto
Benchmarking and Performance Optimization¶
How Benchmarking Works¶
- Algorithm Filtering: The system determines which algorithms to benchmark based on your input
- Parameter Generation: For each algorithm, generates all possible parameter combinations
- Performance Testing: Runs each combination multiple times and measures execution time
- Optimal Selection: Chooses the fastest algorithm and parameter configuration
- Caching: Stores results for future use with similar configurations
Parameter Examples¶
The system automatically optimizes parameters like:
IMPLICIT_GEMM:
fwd_block_size
: 4, 16, 32gemm_block_size
: 4, 16, 32split_k_factor
: 2, 4, 8
CUTLASS_IMPLICIT_GEMM:
mma_tile
: 0, 1, 2, 3split_k_slices
: 1, 2, 4, 8accumulator_type
: float32
WMMA_IMPLICIT_GEMM:
- No user-tunable parameters (auto-configured internally)
Performance Benefits¶
The unified benchmarking system provides:
- Consistent Optimization: All execution paths lead to parameter optimization
- Hardware Adaptation: Automatically finds best configuration for your GPU
- Future-Proof: New algorithm parameters are automatically optimized
- Search Space Control: Algorithm lists limit benchmarking scope for faster startup
String and Mixed Input Support¶
The system supports flexible input formats:
# String inputs
output = spatially_sparse_conv(
input_voxels,
weight,
kernel_size=3,
fwd_algo="implicit_gemm", # String format
bwd_algo="implicit_gemm",
)
# String lists
output = spatially_sparse_conv(
input_voxels,
weight,
kernel_size=3,
fwd_algo=["implicit_gemm", "wmma_implicit_gemm", "cutlass_implicit_gemm"], # String list
bwd_algo=["implicit_gemm", "wmma_implicit_gemm", "cutlass_implicit_gemm"],
)
# Mixed enum and string lists
output = spatially_sparse_conv(
input_voxels,
weight,
kernel_size=3,
fwd_algo=[SPARSE_CONV_FWD_ALGO_MODE.IMPLICIT_GEMM, "cutlass_implicit_gemm"],
bwd_algo=[SPARSE_CONV_BWD_ALGO_MODE.IMPLICIT_GEMM, "cutlass_implicit_gemm"],
)
Best Practices¶
For Development¶
- Use
AUTO
mode to explore all available algorithms - Use algorithm lists to limit search space during hyperparameter tuning
- Monitor first-run performance (benchmarking overhead) vs. cached runs
For Production¶
- Use specific algorithms once you know what works best
- Set environment variables for consistent behavior across runs
- Consider using algorithm lists if you want to restrict to tested algorithms
For New Hardware¶
- Clear cache when switching GPUs:
rm -rf ~/.cache/warpconvnet/
- Use
AUTO
mode to discover optimal algorithms for new hardware - Algorithm lists help compare specific algorithms on new hardware
Troubleshooting¶
Common Issues¶
Slow First Run:
- Normal behavior - benchmarking finds optimal parameters
- Subsequent runs use cached results and are fast
- Use algorithm lists to reduce initial benchmarking time
Cache Issues:
- Clear cache:
rm -rf ~/.cache/warpconvnet/
- Check permissions on cache directory
- Cache is configuration-specific (input size, types, etc.)
Algorithm Availability:
- Some algorithms require specific CUDA versions
- CUTLASS algorithms need compatible GPU compute capability
- WMMA requires Tensor Cores and compatible compute capability
- Check logs for algorithm availability warnings
- CUTLASS may not work on all GPUs
- Use
export WARPCONVNET_FWD_ALGO_MODE="[explicit_gemm,implicit_gemm]"; export WARPCONVNET_DEPTHWISE_CONV_FWD_ALGO_MODE="[explicit_gemm,implicit_gemm]"
to force use explicit and implicit gemm.
Performance Tips¶
- Use algorithm lists to focus on algorithms known to work well
- Environment variables provide global defaults
- Cache results are persistent across Python sessions
- Benchmarking overhead is paid once per configuration