A proposal-free space-curve transformer for fast, open-vocabulary 3D instance segmentation — 0.14 seconds per scene, no proposals, no GT 3D supervision.
Open-vocabulary 3D instance segmentation is a core capability for robotics and AR/VR, but prior methods trade one bottleneck for another: multi-stage 2D+3D pipelines aggregate foundation-model outputs at hundreds of seconds per scene, while pseudo-labeled end-to-end approaches rely on fragmented masks and external region proposals. We present SpaCeFormer (Space-Curve Transformer), a proposal-free space-curve transformer that runs at 0.14 seconds per scene, 2–3 orders of magnitude faster than multi-stage 2D+3D pipelines.
We pair it with SpaCeFormer-3M, the largest open-vocabulary 3D instance segmentation dataset (3.0M multi-view-consistent captions over 604K instances from 7.4K scenes) built through multi-view mask clustering and multi-view VLM captioning; it reaches 21× higher mask recall than prior single-view pipelines (54.3% vs. 2.5% at IoU>0.5). SpaCeFormer combines spatial window attention with Morton curve serialization for spatially coherent features, and uses a RoPE-enhanced decoder to predict instance masks directly from learned queries without external proposals. On ScanNet200 we achieve 11.1 zero-shot mAP, a 2.8× improvement over the prior best proposal-free method; on ScanNet++ and Replica, we reach 22.9 and 24.1 mAP, surpassing all prior methods including those using multi-view 2D inputs.
We construct SpaCeFormer-3M, the largest open-vocabulary 3D instance segmentation training corpus. Using training-free multi-view mask clustering and structured VLM prompting, we generate 604K instance masks with 3M diverse captions across 7,361 indoor scenes — without any human annotation.
| Source Dataset | Scenes | Instance Masks | Captions | Avg Masks / Scene |
|---|---|---|---|---|
| ScanNet | 1,201 | 79,320 | 396,600 | 66.0 |
| ScanNet++ | 223 | 27,296 | 136,480 | 122.4 |
| ARKitScenes | 4,497 | 446,409 | 2,232,045 | 99.3 |
| Matterport3D | 1,440 | 51,102 | 255,510 | 35.5 |
| Total | 7,361 | 604,127 | 3,020,635 | 82.1 |
Auto-generated training masks benchmarked against GT instance annotations at IoU > 0.5. Compared to Mosaic3D's SAM2-based 2D-to-3D lifting, SpaCeFormer-3M achieves 21× higher recall (54.3% vs. 2.5%) — multi-view aggregation produces substantially more complete, geometry-consistent instances.
| Source | Masks/Scene | Mean best-IoU | Precision @0.5 | Recall @0.5 | IoU > 0.50 |
|---|---|---|---|---|---|
| Mosaic3D (SAM2 lifting) | 16.1 | 0.247 | 4.8% | 2.5% | 3.8% |
| SpaCeFormer-3M (Ours) | 65.2 | 0.251 | 33.6% | 54.3% | 25.9% |
Each 3D instance mask is paired with 5 diverse captions generated from multiple viewpoints, capturing shape, material, texture, and spatial context.
Explore SpaCeFormer's instance segmentation predictions on real 3D scenes. Click and drag to orbit, scroll to zoom, right-click to pan. Click masks in the legend to toggle.
Most 3D transformers serialize point clouds along Morton (Z-order) curves — space-filling curves that map 3D coordinates to 1D indices while preserving locality on average. The advantage: every attention block sees a fixed number of tokens, so compute is predictable.
For dense prediction tasks like instance segmentation, where you need sharp boundaries, this fragmentation is a real problem. SpaCeFormer interleaves curve attention with spatial window attention — fixed geometric extent (H×W×D voxels) with variable token count — to keep geometrically adjacent points in the same attention group. Shifted partitions across layers (à la Swin) restore connectivity beyond window boundaries.
SpaCeFormer is a hierarchical sparse-voxel U-Net interleaving Space attention (3D windows, fixed metric extent) at shallow high-resolution stages with Curve attention (Morton/Hilbert serialized patches, length 1024) at deeper low-resolution stages. Both flavors share fused QKV + 3D RoPE CUDA kernels on top of sparse convolution shortcuts.
ViT
CvT
Point Transformer
SpaCeFormer · Ours
Groups voxels into 3D windows of fixed metric extent (window_size),
guaranteeing spatial proximity with variable per-window populations. Achieves
~28.6% smaller mean pairwise distance between attending voxels than
curve-based grouping — deployed at shallow, high-resolution levels where local geometry
dominates.
Serializes voxels along space-filling curves and partitions into fixed-length patches
(patch_size=1024). Fixed compute per patch enables efficient long-range mixing;
deployed at deeper, low-resolution stages where receptive field matters more than locality.
Reduces complexity from O(N²) to O(N·L).
VoxelRotaryPositionalEmbeddings extends RoPE to 3D with block-diagonal
rotations parameterized by relative displacements (Δx, Δy, Δz), fused into the QKV CUDA
kernel. Window-aware base (~4·L) auto-selected via suggest_voxel_rope_base.
+27.6% mAP over best alternative PE.
Learned query embeddings (Q=200) iteratively refined through cross-attention with point features and self-attention between queries. Directly predicts instance masks, CLIP features, and foreground scores — no proposal generation, no class-agnostic pre-filtering.
604K instance masks from 7,361 scenes with 3M multi-view captions. Training-free multi-view mask clustering from 2D foundation models, plus structured VLM prompting for diverse, view-consistent descriptions of shape, texture, material, and spatial context.
Released as warpconvnet.models.space_former. Configurable per-level attention
via enc_attn_types string codes (e.g. "ssccc"), three block layouts
(pre_norm / post_norm / stream_norm), pluggable MaskFormer
head wrapped in PointToVoxel.
Under the matched setting (proposal-free, 3D-only, no GT 3D annotations), SpaCeFormer is 2.8× over the next-best method. Higher-scoring methods rely on either GT-trained Mask3D proposals or multi-view 2D streams with YOLO/SAM proposals.
| Method | Input | Proposals | No GT | mAP | mAP50 | mAP25 | Time (s) |
|---|---|---|---|---|---|---|---|
| OpenMask3D | 3D + 2D | Mask3D | No | 15.4 | 19.9 | 23.1 | 553.9 |
| Open-YOLO 3D | 3D + 2D | Mask3D | No | 24.7 | 31.7 | 36.2 | 21.8 |
| Open3DIS | 3D + 2D | Superpts + ISBNet + GSAM | No | 23.7 | 29.4 | 32.8 | 33.5 |
| Any3DIS | 3D + 2D | ISBNet + SAM2 | No | 25.8 | — | — | — |
| Details Matter | 3D + 2D | Mask3D + GSAM | No | 25.8 | 32.5 | 36.2 | — |
| SAI3D | 3D + 2D | Superpts + SAM | Yes | 12.7 | 18.8 | 24.1 | 75.2 |
| MaskClustering | 3D + 2D | CropFormer | Yes | 12.0 | 23.3 | 30.1 | — |
| SAM2Object | 3D + 2D | SAM2 | Yes | 13.3 | 19.0 | 23.8 | — |
| OpenTrack3D | 3D + 2D | YOLO-World + SAM2 | Yes | 26.0 | 37.7 | 45.4 | ~356 |
| Mosaic3D + Decoder | 3D | Proposal-free | Yes | 3.9 | 7.0 | 12.3 | 1.2 |
| SpaCeFormer (Ours) | 3D | Proposal-free | Yes | 11.1 | 18.8 | 24.3 | 0.14 |
"No GT" = method does not rely on ground-truth 3D mask annotations for proposal training. For reference, fully supervised closed-vocabulary Mask3D reaches 27.4 mAP and OneFormer3D reaches 30.6 mAP with GT labels.
| Method | Input | Proposals | mAP | mAP50 | mAP25 | Time (s) | Speedup |
|---|---|---|---|---|---|---|---|
| OpenMask3D | 3D + 2D | Mask3D | 2.0 | 2.7 | 3.4 | ~554 | 3957× |
| OVIR-3D | 3D + 2D | Mask3D | 3.6 | 5.7 | 7.3 | — | — |
| MaskClustering | 3D + 2D | CropFormer | 7.8 | 10.7 | 12.1 | 600 | 4286× |
| Segment3D | 3D + 2D | SAM | 10.1 | 17.7 | 20.2 | — | — |
| Open3DIS | 3D + 2D | SAM-HQ | 11.9 | 18.1 | 21.7 | ~360 | 2571× |
| Any3DIS | 3D + 2D | SAM2 | 12.9 | 19.0 | 21.9 | ~36 | 257× |
| OpenSplat3D | 3D + 2D | SAM + GS | 16.5 | 29.7 | 39.0 | — | — |
| OpenTrack3D | 3D + 2D | YOLO-World + SAM2 | 20.6 | 34.2 | 43.4 | ~320 | 2286× |
| SpaCeFormer (Ours) | 3D | Proposal-free | 22.9 | 33.7 | 41.6 | 0.14 | — |
SpaCeFormer surpasses the prior state of the art (OpenTrack3D, 20.6 mAP) using only 3D input while being over 2,000× faster.
Among open-vocabulary methods without GT 3D supervision, SpaCeFormer is Pareto-optimal on the latency–accuracy frontier; matched in accuracy only by SOLE (24.7 mAP), which requires ScanNet200 GT mask supervision and is ~9× slower.
| Method | Input | Proposals | No GT | mAP | mAP50 | mAP25 | Time (s) |
|---|---|---|---|---|---|---|---|
| OpenScene-3D | 3D | Mask3D | No | 8.2 | 10.5 | 12.6 | 4.3 |
| OpenMask3D | 3D + 2D | Mask3D | No | 13.1 | 18.4 | 24.2 | 547.3 |
| Open3DIS | 3D + 2D | ISBNet + SAM | No | 18.5 | 24.5 | 28.2 | 188.0 |
| Open-YOLO 3D | 3D + 2D | Mask3D | No | 23.7 | 28.6 | 34.8 | 16.6 |
| Details Matter | 3D + 2D | Mask3D + GSAM | No | 22.6 | 31.7 | 37.7 | 597 |
| OVIR-3D | 3D + 2D | Detic | Yes | 11.1 | 20.5 | 27.5 | 52.7 |
| PoVo | 3D + 2D | SAM | Yes | 20.8 | 28.7 | 34.4 | — |
| BoxOVIS | 3D + 2D | Box + SAM | Yes | 24.0 | 31.8 | 37.4 | 43.7 |
| OpenTrack3D | 3D + 2D | YOLO-World + SAM2 | Yes | 23.9 | 36.4 | 47.6 | ~62 |
| SOLE | 3D | Proposal-free (GT-supervised) | No | 24.7 | 31.8 | 40.3 | — |
| SpaCeFormer (Ours) | 3D | Proposal-free | Yes | 24.1 | 31.8 | 37.1 | 0.14 |
SpaCeFormer is 119× faster than Open-YOLO 3D and 3,909× faster than OpenMask3D — the only open-vocabulary method that operates at interactive rates.
SpaCeFormer's final model achieves strong class-agnostic mask quality (22.5 AP, 45.7 AP50, 64.4 AP25) in addition to open-vocabulary classification (11.1 mAP over 200 categories).
| Configuration | Class-Agnostic | 200-Way Open-Vocab | ||||
|---|---|---|---|---|---|---|
| AP | AP50 | AP25 | mAP | mAP50 | mAP25 | |
| — Space-Curve Attention Ablation — | ||||||
| Window only | 25.5 | 49.9 | 69.0 | 9.5 | 16.8 | 22.5 |
| Morton only | 23.3 | 47.3 | 67.4 | 9.5 | 16.4 | 22.3 |
| Window + Morton (Ours) | 25.2 | 50.4 | 69.1 | 11.1 | 18.8 | 24.3 |
| — Positional Encoding Ablation — | ||||||
| No PE | — | 5.97 | 11.5 | 17.3 | ||
| Absolute PE (APE) | — | 5.95 | 11.5 | 16.1 | ||
| Learnable Bias | — | 6.46 | 11.8 | 17.2 | ||
| 3D RoPE (Ours) | — | 7.60 | 13.7 | 18.6 | ||
| — Query Initialization Ablation — | ||||||
| Random | 14.6 | 32.3 | 52.8 | 5.6 | 10.2 | 14.9 |
| Farthest Point Sampling | 18.0 | 38.4 | 58.4 | 7.2 | 13.1 | 17.9 |
| Learned Queries (Ours) | 19.8 | 41.6 | 62.5 | 6.4 | 12.0 | 18.2 |
PE ablations trained for 25K iterations (shorter schedule); attention and query ablations use the full training configuration.
SpaCeFormer predictions on ScanNet200 validation scenes. Each color represents a distinct predicted instance with its open-vocabulary label.






@inproceedings{choy2026spaceformer,
title = {SpaCeFormer: Fast Proposal-Free Open-Vocabulary
3D Instance Segmentation},
author = {Choy, Chris and Lee, Junha and Park, Chunghyun
and Cho, Minsu and Kautz, Jan},
booktitle = {Proceedings of the International Conference on
Machine Learning (ICML)},
year = {2026},
eprint = {2604.20395},
archivePrefix = {arXiv}
}