Fast Proposal-Free Open-Vocabulary 3D Instance Segmentation
1NVIDIA 2POSTECH
Open-vocabulary 3D instance segmentation is a core capability for robotics and AR/VR, but prior methods trade one bottleneck for another: multi-stage 2D+3D pipelines aggregate foundation-model outputs at hundreds of seconds per scene, while pseudo-labeled end-to-end approaches rely on fragmented masks and external region proposals. We present SpaCeFormer (Space-Curve Transformer), a proposal-free space-curve transformer that runs at 0.14 seconds per scene, 2–3 orders of magnitude faster than multi-stage 2D+3D pipelines.
We pair it with SpaCeFormer-3M, the largest open-vocabulary 3D instance segmentation dataset (3.0M multi-view-consistent captions over 604K instances from 7.4K scenes) built through multi-view mask clustering and multi-view VLM captioning; it reaches 21× higher mask recall than prior single-view pipelines (54.3% vs. 2.5% at IoU>0.5). SpaCeFormer combines spatial window attention with Morton curve serialization for spatially coherent features, and uses a RoPE-enhanced decoder to predict instance masks directly from learned queries without external proposals. On ScanNet200 we achieve 11.1 zero-shot mAP, a 2.8× improvement over the prior best proposal-free method; on ScanNet++ and Replica, we reach 22.9 and 24.1 mAP, surpassing all prior methods including those using multi-view 2D inputs.
We construct SpaCeFormer-3M, the largest open-vocabulary 3D instance segmentation training corpus. Using training-free multi-view mask clustering and structured VLM prompting, we generate 604K instance masks with 3M diverse captions across 7,361 indoor scenes — without any human annotation.
| Source Dataset | Scenes | Instance Masks | Captions | Avg Masks/Scene |
|---|---|---|---|---|
| ScanNet | 1,201 | 79,320 | 396,600 | 66.0 |
| ScanNet++ | 223 | 27,296 | 136,480 | 122.4 |
| ARKitScenes | 4,497 | 446,409 | 2,232,045 | 99.3 |
| Matterport3D | 1,440 | 51,102 | 255,510 | 35.5 |
| Total | 7,361 | 604,127 | 3,020,635 | 82.1 |
We benchmark auto-generated training masks against GT instance annotations at IoU > 0.5. Compared to Mosaic3D's SAM2-based 2D-to-3D lifting, SpaCeFormer-3M achieves 21× higher recall (54.3% vs. 2.5%), confirming that multi-view aggregation produces substantially more complete, geometry-consistent instances.
| Source | Masks / Scene | Mean best-IoU | Precision @0.5 | Recall @0.5 | IoU > 0.50 |
|---|---|---|---|---|---|
| Mosaic3D (SAM2 lifting) | 16.1 | 0.247 | 4.8% | 2.5% | 3.8% |
| SpaCeFormer-3M (Ours) | 65.2 | 0.251 | 33.6% | 54.3% | 25.9% |
Each 3D instance mask is paired with 5 diverse captions generated from multiple viewpoints, capturing shape, material, texture, and spatial context.
Explore SpaCeFormer's instance segmentation predictions on real 3D scenes. Click and drag to orbit, scroll to zoom, right-click to pan. Click masks in the legend to toggle.
SpaCeFormer introduces Space-Curve partitioning and 3D RoPE on top of sparse 3D convolutions, compared to standard attention blocks used in prior architectures.
ViT
CvT
Point Transformer
SpaCeFormer (Ours)
Combines spatial window attention (fixed geometric extent, variable tokens) with Morton curve serialization (fixed-length segments, variable spatial extent). Windows preserve local spatial neighborhoods; curves introduce structured long-range diversity. Reduces complexity from O(N²) to O(N·Lmax).
Extends RoPE to 3D with block-diagonal rotation matrices parameterized by relative displacements (Δx, Δy, Δz). Enables geometry-aware attention that naturally encodes spatial proximity — +27.6% mAP improvement over the best alternative positional encoding.
Learned query embeddings (Q=200) iteratively refined through cross-attention with point features and self-attention between queries. Directly predicts instance masks, CLIP features, and foreground scores — no proposal generation, no class-agnostic pre-filtering.
604K instance masks from 7,361 scenes with 3M multi-view captions. Training-free multi-view mask clustering from 2D foundation models, plus structured VLM prompting for diverse, view-consistent descriptions of shape, texture, material, and spatial context.
Under the matched setting (proposal-free, 3D-only, no GT 3D annotations), SpaCeFormer is 2.8× over the next-best method. Higher-scoring methods rely on either GT-trained Mask3D proposals or multi-view 2D streams with YOLO/SAM proposals.
| Method | Input | Proposals | No GT | mAP | mAP50 | mAP25 | Time (s) |
|---|---|---|---|---|---|---|---|
| OpenMask3D | 3D + 2D | Mask3D | No | 15.4 | 19.9 | 23.1 | 553.9 |
| Open-YOLO 3D | 3D + 2D | Mask3D | No | 24.7 | 31.7 | 36.2 | 21.8 |
| Open3DIS | 3D + 2D | Superpts + ISBNet + GSAM | No | 23.7 | 29.4 | 32.8 | 33.5 |
| Any3DIS | 3D + 2D | ISBNet + SAM2 | No | 25.8 | — | — | — |
| Details Matter | 3D + 2D | Mask3D + GSAM | No | 25.8 | 32.5 | 36.2 | — |
| SAI3D | 3D + 2D | Superpts + SAM | Yes | 12.7 | 18.8 | 24.1 | 75.2 |
| MaskClustering | 3D + 2D | CropFormer | Yes | 12.0 | 23.3 | 30.1 | — |
| SAM2Object | 3D + 2D | SAM2 | Yes | 13.3 | 19.0 | 23.8 | — |
| OpenTrack3D | 3D + 2D | YOLO-World + SAM2 | Yes | 26.0 | 37.7 | 45.4 | ~356 |
| Mosaic3D + Decoder | 3D | Proposal-free | Yes | 3.9 | 7.0 | 12.3 | 1.2 |
| SpaCeFormer (Ours) | 3D | Proposal-free | Yes | 11.1 | 18.8 | 24.3 | 0.14 |
"No GT" = method does not rely on ground-truth 3D mask annotations for proposal training. For reference, fully supervised closed-vocabulary Mask3D reaches 27.4 mAP and OneFormer3D reaches 30.6 mAP with GT labels.
| Method | Input | Proposals | mAP | mAP50 | mAP25 | Time (s) | Speedup |
|---|---|---|---|---|---|---|---|
| OpenMask3D | 3D + 2D | Mask3D | 2.0 | 2.7 | 3.4 | ~554 | 3957× |
| OVIR-3D | 3D + 2D | Mask3D | 3.6 | 5.7 | 7.3 | — | — |
| MaskClustering | 3D + 2D | CropFormer | 7.8 | 10.7 | 12.1 | 600 | 4286× |
| Segment3D | 3D + 2D | SAM | 10.1 | 17.7 | 20.2 | — | — |
| Open3DIS | 3D + 2D | SAM-HQ | 11.9 | 18.1 | 21.7 | ~360 | 2571× |
| Any3DIS | 3D + 2D | SAM2 | 12.9 | 19.0 | 21.9 | ~36 | 257× |
| OpenSplat3D | 3D + 2D | SAM + GS | 16.5 | 29.7 | 39.0 | — | — |
| OpenTrack3D | 3D + 2D | YOLO-World + SAM2 | 20.6 | 34.2 | 43.4 | ~320 | 2286× |
| SpaCeFormer (Ours) | 3D | Proposal-free | 22.9 | 33.7 | 41.6 | 0.14 | — |
SpaCeFormer surpasses the prior state of the art (OpenTrack3D, 20.6 mAP) using only 3D input while being over 2,000× faster.
Among open-vocabulary methods without GT 3D supervision, SpaCeFormer is Pareto-optimal on the latency–accuracy frontier; matched in accuracy only by SOLE (24.7 mAP), which requires ScanNet200 GT mask supervision and is ~9× slower.
| Method | Input | Proposals | No GT | mAP | mAP50 | mAP25 | Time (s) |
|---|---|---|---|---|---|---|---|
| OpenScene-3D | 3D | Mask3D | No | 8.2 | 10.5 | 12.6 | 4.3 |
| OpenMask3D | 3D + 2D | Mask3D | No | 13.1 | 18.4 | 24.2 | 547.3 |
| Open3DIS | 3D + 2D | ISBNet + SAM | No | 18.5 | 24.5 | 28.2 | 188.0 |
| Open-YOLO 3D | 3D + 2D | Mask3D | No | 23.7 | 28.6 | 34.8 | 16.6 |
| Details Matter | 3D + 2D | Mask3D + GSAM | No | 22.6 | 31.7 | 37.7 | 597 |
| OVIR-3D | 3D + 2D | Detic | Yes | 11.1 | 20.5 | 27.5 | 52.7 |
| PoVo | 3D + 2D | SAM | Yes | 20.8 | 28.7 | 34.4 | — |
| BoxOVIS | 3D + 2D | Box + SAM | Yes | 24.0 | 31.8 | 37.4 | 43.7 |
| OpenTrack3D | 3D + 2D | YOLO-World + SAM2 | Yes | 23.9 | 36.4 | 47.6 | ~62 |
| SOLE | 3D | Proposal-free (GT-supervised) | No | 24.7 | 31.8 | 40.3 | — |
| SpaCeFormer (Ours) | 3D | Proposal-free | Yes | 24.1 | 31.8 | 37.1 | 0.14 |
SpaCeFormer is 119× faster than Open-YOLO 3D and 3,909× faster than OpenMask3D — the only open-vocabulary method that operates at interactive rates.
SpaCeFormer's final model achieves strong class-agnostic mask quality (22.5 AP, 45.7 AP50, 64.4 AP25) in addition to open-vocabulary classification (11.1 mAP over 200 categories).
| Configuration | Class-Agnostic | 200-Way Open-Vocab | ||||
|---|---|---|---|---|---|---|
| AP | AP50 | AP25 | mAP | mAP50 | mAP25 | |
| SPACE-CURVE ATTENTION ABLATION | ||||||
| Window only | 25.5 | 49.9 | 69.0 | 9.5 | 16.8 | 22.5 |
| Morton only | 23.3 | 47.3 | 67.4 | 9.5 | 16.4 | 22.3 |
| Window + Morton (Ours) | 25.2 | 50.4 | 69.1 | 11.1 | 18.8 | 24.3 |
| POSITIONAL ENCODING ABLATION | ||||||
| No PE | — | 5.97 | 11.5 | 17.3 | ||
| Absolute PE (APE) | — | 5.95 | 11.5 | 16.1 | ||
| Learnable Bias | — | 6.46 | 11.8 | 17.2 | ||
| 3D RoPE (Ours) | — | 7.60 | 13.7 | 18.6 | ||
| QUERY INITIALIZATION ABLATION | ||||||
| Random | 14.6 | 32.3 | 52.8 | 5.6 | 10.2 | 14.9 |
| Farthest Point Sampling | 18.0 | 38.4 | 58.4 | 7.2 | 13.1 | 17.9 |
| Learned Queries (Ours) | 19.8 | 41.6 | 62.5 | 6.4 | 12.0 | 18.2 |
PE ablations trained for 25K iterations (shorter schedule); attention and query ablations use the full training configuration.
SpaCeFormer predictions on ScanNet200 validation scenes. Each color represents a distinct predicted instance with its open-vocabulary label.
@article{choy2026spaceformer,
title = {SpaCeFormer: Fast Proposal-Free Open-Vocabulary
3D Instance Segmentation},
author = {Choy, Chris and Lee, Junha and Park, Chunghyun
and Cho, Minsu and Kautz, Jan},
journal = {arXiv preprint},
year = {2026}
}