ScanNet Example¶

This example trains a semantic segmentation model on ScanNet indoor scenes using a MinkUNet-style encoder-decoder built with sparse convolutions.

Dataset¶

The script uses the pre-processed ScanNet 3D point clouds from the OpenScene project. Each scene is stored as (coords, colors, labels):

coords: (N, 3) float32 — 3D point positions
colors: (N, 3) float32 — RGB color features
labels: (N,) int — semantic class labels (20 classes, 255 = ignore)

The 20 semantic classes include: wall, floor, cabinet, bed, chair, sofa, table, door, window, bookshelf, picture, counter, desk, curtain, refrigerator, shower curtain, toilet, sink, bathtub, and other furniture.

The dataset is downloaded automatically on first run (~1.3 GB) to ./data/scannet_3d/.

Data augmentations are opt-in

Augmentations are disabled by default so the script stays minimal. Set data.augmentations=true to apply the standard ScanNet recipe (random rotation around the up-axis, scale, horizontal flip, point dropout, chromatic auto-contrast / translation / jitter / drop). Expect a 5–10 mIoU boost over the un-augmented baseline.

Network architecture¶

The default model is MinkUNet18, a U-Net with sparse convolution encoder and decoder blocks connected by skip connections. Available models:

Model	Description
`warpconvnet.models.MinkUNet18`	Lightweight U-Net (default)
`warpconvnet.models.MinkUNet34`	Deeper encoder
`warpconvnet.models.MinkUNet50`	ResNet-50 style blocks
`warpconvnet.models.MinkUNet101`	ResNet-101 style blocks
`warpconvnet.models.SpaCeFormer`	Curve + space + window attention U-Net (see page)

Input points are voxelized at voxel_size=0.02 and wrapped via PointToVoxel, which handles the point-to-voxel conversion and maps output features back to the original point resolution.

The model outputs per-point logits with shape (N, 20).

Setup¶

Install the optional model and training dependencies:

pip install "warpconvnet[models]"

Additional requirements: hydra-core, omegaconf, torchmetrics.

Run¶

python examples/train/scannet.py

The script uses Hydra for configuration. Override any parameter on the command line:

# Smaller batch size for limited GPU memory
python examples/train/scannet.py train.batch_size=4

# Use a deeper model
python examples/train/scannet.py model._target_=warpconvnet.models.MinkUNet34

# Change voxel size and learning rate
python examples/train/scannet.py data.voxel_size=0.05 train.lr=0.01

# Swap the backbone for SpaCeFormer (curve+space attention U-Net). The
# +-prefixed keys add SpaCeFormer-specific args MinkUNet does not have.
python examples/train/scannet.py \
    model._target_=warpconvnet.models.SpaCeFormer \
    +model.enc_attn_types=ssccc \
    +model.dec_attn_types=ssca \
    +model.use_rope=true

Configuration reference¶

Paths:

Key	Default	Description
`paths.data_dir`	`./data/scannet_3d`	Dataset directory
`paths.output_dir`	`./results/`	Output directory
`paths.ckpt_path`	`null`	Checkpoint path to resume from

Training:

Key	Default	Description
`train.batch_size`	`12`	Training batch size
`train.lr`	`0.001`	AdamW learning rate
`train.epochs`	`100`	Number of training epochs
`train.step_size`	`20`	StepLR decay period (epochs)
`train.gamma`	`0.7`	StepLR decay factor
`train.num_workers`	`8`	DataLoader workers
`train.precision`	`"16-mixed"`	`"32"` (fp32) or `"16-mixed"` (fp16 forward + GradScaler)

Test:

Key	Default	Description
`test.batch_size`	`12`	Test batch size
`test.num_workers`	`4`	DataLoader workers

Data:

Key	Default	Description
`data.num_classes`	`20`	Number of semantic classes
`data.voxel_size`	`0.02`	Voxelization resolution (meters)
`data.ignore_index`	`255`	Label index to ignore in loss/metrics
`data.augmentations`	`false`	Apply geometric + chromatic training augs

Model:

Key	Default	Description
`model._target_`	`warpconvnet.models.MinkUNet18`	Model class to instantiate
`model.in_channels`	`3`	Input feature channels (RGB)
`model.out_channels`	`20`	Output channels (num classes)
`model.in_type`	`voxel`	Input type (`voxel` wraps model with `PointToVoxel`)

General:

Key	Default	Description
`device`	`cuda`	Device
`use_wandb`	`false`	Enable Weights & Biases logging
`seed`	`42`	Random seed

Visualization (viser):

Key	Default	Description
`viz.enabled`	`false`	Spin up a viser server during training
`viz.port`	`8080`	Viser HTTP port
`viz.interval_seconds`	`10.0`	Min seconds between scene refreshes (per training loop)

Data augmentations¶

Set data.augmentations=true to construct default_train_augmentations(colors_in_unit_range=True) and pass it via ScanNetDataset(transform=...). The default recipe is a port of the SpatioTemporalSegmentation recipe (chrischoy/SpatioTemporalSegmentation), applied per training sample, before voxelization:

Transform	Probability	Range / parameters
`RandomRotation3D()`	always	x: ±π/64, y: ±π/64, z: ±π (matches upstream)
`RandomScale((0.9,1.1))`	always	uniform scale factor
`RandomHorizontalFlip(z)`	0.95	independent flip per non-up axis
`RandomTranslationRatio()`	always	±20 % of scene extent on x/y, none on z
`ElasticDistortion(((0.2,0.4),(0.8,1.6)))`	0.95	two-pass smooth coord warp; mean displacement ≈ 27 cm
`RandomDropout(0.20)`	0.20	drop 20 % of points
`ChromaticAutoContrast`	0.20	per-scene auto-contrast, blended
`ChromaticTranslation(0.10)`	0.95	scene-wide RGB tint, ±10 % of range
`ChromaticJitter(σ=0.01)`	0.95	per-point Gaussian RGB noise
`ChromaticDrop`	0.05	replace all RGB with mid-gray

Test-time data is not augmented (test loader uses the bare ScanNetDataset).

# Default minimal training (no augs)
python examples/train/scannet.py

# Standard augmented recipe (recommended)
python examples/train/scannet.py data.augmentations=true

To customize the pipeline, build your own Compose([...]) and pass it as the transform argument to ScanNetDataset. See warpconvnet/dataset/transforms.py for the available transform classes.

About ElasticDistortion

Needs scipy and is the slowest transform in the recipe (~0.4 s per sample). It produces a smooth random warp of coordinate space — flat walls bow, chair legs curl. Mean per-point displacement at the upstream parameters ((0.2, 0.4), (0.8, 1.6)) is ≈ 27 cm with p95 ≈ 41 cm. Disable by passing your own Compose([...]) without it if profiling shows the dataloader is your bottleneck.

Skipped from upstream

HueSaturationTranslation (marginal mIoU contribution) is not included by default. Add it yourself if you want the full upstream recipe.

Live Minecraft-style visualization (viser)¶

Set viz.enabled=true to launch an embedded viser server while training. The visualizer renders one scan from each batch as three side-by-side voxel scenes:

Left — input RGB (per-voxel mean color)
Middle — ground-truth segmentation (per-voxel majority label)
Right — model prediction (per-voxel majority argmax)

Each occupied voxel is drawn as an axis-aligned cube, giving the scene a Minecraft-like look that makes the discrete sparse-conv grid structure obvious. The scene refreshes at most once every viz.interval_seconds seconds so it never stalls the training loop.

# Train + visualize
python examples/train/scannet.py viz.enabled=true viz.port=8080 viz.interval_seconds=10
# Open http://localhost:8080

viser full UI — three side-by-side cube scenes + GUI sidebar

The three panels (input · GT · prediction) use the same camera and identical voxelization, so misclassified voxels jump out as color speckles when the right panel diverges from the middle one.

viser cube panels close-up

The GUI sidebar surfaces live metrics — total occupied voxels, voxel-level accuracy of the current frame, and the current epoch / step — alongside a class-color legend matching the standard ScanNet 20-class palette.

viser GUI sidebar with class legend

About these screenshots

The screenshots above were captured by running the visualizer against a hand-crafted synthetic mini-room (no GPU, no ScanNet data, no checkpoint required). Regenerate them with:

```bash
pip install viser trimesh playwright
playwright install chromium
python docs/examples/scripts/capture_viser_screenshots.py
```

Expected output¶

Each epoch prints a progress bar followed by test-set evaluation with accuracy and mean IoU:

Train Epoch: 1 Loss:  2.143: 100%|██████████| 104/104
Test set: Average loss:  1.8234, Accuracy:  42.15%, mIoU:  18.73%

After 100 epochs with default settings, expect roughly:

Overall accuracy: ~75-80%
mIoU: ~55-65%

Results will vary with augmentation, model choice, and voxel size. This example is intended as a starting point, not a benchmark-tuned recipe.