SpaCeFormer

Fast Proposal-Free Open-Vocabulary 3D Instance Segmentation

Chris Choy1, Junha Lee2, Chunghyun Park2, Minsu Cho2, Jan Kautz1

1NVIDIA    2POSTECH

0 Inference Time
0 Mask Recall vs. Mosaic3D
0 mAP Replica (Zero-Shot)
0 mAP ScanNet++
0 Captions (SpaCeFormer-3M)

Abstract

Open-vocabulary 3D instance segmentation is a core capability for robotics and AR/VR, but prior methods trade one bottleneck for another: multi-stage 2D+3D pipelines aggregate foundation-model outputs at hundreds of seconds per scene, while pseudo-labeled end-to-end approaches rely on fragmented masks and external region proposals. We present SpaCeFormer (Space-Curve Transformer), a proposal-free space-curve transformer that runs at 0.14 seconds per scene, 2–3 orders of magnitude faster than multi-stage 2D+3D pipelines.

We pair it with SpaCeFormer-3M, the largest open-vocabulary 3D instance segmentation dataset (3.0M multi-view-consistent captions over 604K instances from 7.4K scenes) built through multi-view mask clustering and multi-view VLM captioning; it reaches 21× higher mask recall than prior single-view pipelines (54.3% vs. 2.5% at IoU>0.5). SpaCeFormer combines spatial window attention with Morton curve serialization for spatially coherent features, and uses a RoPE-enhanced decoder to predict instance masks directly from learned queries without external proposals. On ScanNet200 we achieve 11.1 zero-shot mAP, a 2.8× improvement over the prior best proposal-free method; on ScanNet++ and Replica, we reach 22.9 and 24.1 mAP, surpassing all prior methods including those using multi-view 2D inputs.

SpaCeFormer-3M Dataset

We construct SpaCeFormer-3M, the largest open-vocabulary 3D instance segmentation training corpus. Using training-free multi-view mask clustering and structured VLM prompting, we generate 604K instance masks with 3M diverse captions across 7,361 indoor scenes — without any human annotation.

Source Dataset Scenes Instance Masks Captions Avg Masks/Scene
ScanNet 1,201 79,320 396,600 66.0
ScanNet++ 223 27,296 136,480 122.4
ARKitScenes 4,497 446,409 2,232,045 99.3
Matterport3D 1,440 51,102 255,510 35.5
Total 7,361 604,127 3,020,635 82.1

Mask Quality vs. Ground Truth (312 ScanNet Train Scenes)

We benchmark auto-generated training masks against GT instance annotations at IoU > 0.5. Compared to Mosaic3D's SAM2-based 2D-to-3D lifting, SpaCeFormer-3M achieves 21× higher recall (54.3% vs. 2.5%), confirming that multi-view aggregation produces substantially more complete, geometry-consistent instances.

Source Masks / Scene Mean best-IoU Precision @0.5 Recall @0.5 IoU > 0.50
Mosaic3D (SAM2 lifting) 16.1 0.247 4.8% 2.5% 3.8%
SpaCeFormer-3M (Ours) 65.2 0.251 33.6% 54.3% 25.9%

Dataset Examples

Each 3D instance mask is paired with 5 diverse captions generated from multiple viewpoints, capturing shape, material, texture, and spatial context.

Scene 0270 — Hotel Room

Entertainment Center
Entertainment Center
  1. The wooden entertainment center features a warm, honey-toned finish and provides storage and display space for electronics.
  2. A medium-sized, rectangular unit with a dark brown hue sits adjacent to a desk, offering a dedicated space for a television.
  3. This cabinet-like structure offers ample storage with its multiple drawers, suitable for organizing media components and accessories.
  4. Crafted from wood, the built-in unit's design integrates seamlessly into the room's decor, providing a sturdy base for a television.
  5. The brown wooden cabinet, with its integrated shelf, stands near a desk, offering a functional and aesthetically pleasing media storage solution.
Metal Rack
Metal Rack
  1. Dark metal rack with a tiered design, likely used for drying dishes or storing small items.
  2. The black, angular rack stands near a light-colored cabinet, its open structure creating visual contrast.
  3. A medium-sized, sturdy rack provides a surface for air-drying, positioned beside a wooden cabinet.
  4. Constructed from dark metal bars, the rack's design allows for ventilation and easy access to items.
  5. The angled rack, with its rectangular openings, sits adjacent to a cabinet, offering a practical storage solution.
Wooden Table
Wooden Table
  1. The wooden table features a warm, brown finish and a smooth, polished surface, ideal for supporting lamps and other decor.
  2. A medium-sized table with a scalloped edge sits adjacent to a cabinet, its shape complementing the room's traditional style.
  3. This small table offers a surface for placing items, providing a convenient spot for a phone or remote control.
  4. Crafted from wood, the table's design incorporates a single drawer and decorative cutouts, blending functionality with aesthetic appeal.
  5. Positioned near a chair, the brown wooden table provides a stable base for a lamp, creating a cozy reading nook.
Bed Frame
Bed Frame
  1. The bedframe is constructed from wood, providing structural support for the mattress and bedding.
  2. A dark brown bed frame with a rectangular headboard sits against a wall, adjacent to a bedside table.
  3. This medium-sized bed frame offers a comfortable place to rest, featuring a flat surface for pillows.
  4. The wooden headboard's design complements the hotel room's decor, creating a cohesive aesthetic.
  5. The bed frame's upper edge supports a white pillow, positioned near a lamp and artwork.

Scene 0601 — Living Room

Whiteboard
Whiteboard
  1. The red metal whiteboard serves as a surface for writing and displaying information, positioned against a dark red wall.
  2. A rectangular whiteboard with a red frame stands upright, offering a large, blank space for notes and diagrams.
  3. Medium-sized whiteboard, suitable for brainstorming or presentations, provides a writable surface near laundry machines.
  4. The whiteboard's flat surface and sturdy frame allow for easy cleaning and repeated use in a communal space.
  5. Red-painted metal whiteboard, mounted on the wall, offers a functional space for communication and collaboration.
Side Table
Side Table
  1. The small side table features a smooth, wooden tabletop and a dark gray metal stem, providing a surface for drinks or books.
  2. A round table with a warm brown wood top sits adjacent to a patterned armchair, creating a cozy seating area.
  3. This medium-sized table offers a convenient surface for placing items, easily accessible for resting a cup or phone.
  4. The table's simple, cylindrical design complements the surrounding furniture, blending seamlessly into the waiting room decor.
  5. Positioned next to a chair, the table's wooden surface and dark metal leg provide a functional and stylish accent.
Upholstered Chair
Upholstered Chair
  1. Upholstered chair with a textured, patterned fabric in shades of red, orange, and brown offers a comfortable seating option.
  2. The chair's rounded shape and reddish-brown hue complement the surrounding decor, positioned near a small table.
  3. A medium-sized chair provides a place to sit and rest, its sturdy frame suggesting durability and frequent use.
  4. The chair's design features a cutout back and a patterned upholstery, blending seamlessly into the waiting area's aesthetic.
  5. Located near a side table, the chair's fabric texture and reddish-brown color create a welcoming and functional seating arrangement.
Washing Machine
Washing Machine
  1. The large, white washing machine features a stainless steel drum and digital control panel, designed for laundry tasks.
  2. A white appliance with a rounded door sits adjacent to other machines, displaying a modern, streamlined shape.
  3. This medium-sized washing machine offers a convenient space for loading clothes, facilitating household chores.
  4. The appliance's durable white plastic exterior integrates seamlessly into the laundry room's utilitarian design.
  5. Positioned between other washers, the white machine's front-loading door and digital display indicate its operational status.

Scene 0656 — Bedroom

Nightstand
Nightstand
  1. A small, dark wood nightstand provides storage near the bed, featuring brass-toned hardware and a traditional design.
  2. The reddish-brown bedside table sits adjacent to the bed, its rectangular shape complementing the room's layout.
  3. This medium-sized wooden cabinet offers a surface for lamps and books, easily accessible from the bed.
  4. Crafted from wood with a visible grain, the nightstand's design blends classic style with functional storage.
  5. The dark brown nightstand, positioned next to the bed, provides a convenient spot for personal items and a lamp.
Pillow
Pillow
  1. The pillow's woven fabric offers a soft, comfortable surface for resting.
  2. A rectangular pillow with a muted brown and black checkered pattern rests against the bed.
  3. Medium-sized pillow, suitable for supporting the head during sleep or relaxation.
  4. The pillow's design complements the bedroom's decor, providing a cozy accent.
  5. Positioned against the headboard, the pillow offers support and adds visual texture to the bed.
Writing Desk
Writing Desk
  1. Dark brown wooden table with ornate carved details, providing a surface for small items.
  2. Rectangular table with a glossy finish, positioned near a curtain and a chair.
  3. Medium-sized table offering a flat surface for holding stationery and decorative objects.
  4. The table's design features cabriole legs and a drawer, blending into the room's decor.
  5. A dark wood table sits adjacent to a chair, providing a functional space for writing or display.
Desk Chair
Desk Chair
  1. The chair's dark fabric upholstery provides a comfortable seating surface, designed for supporting a person.
  2. A medium-sized chair with a curved backrest sits adjacent to a dark wooden desk, displaying a classic design.
  3. This chair offers a place to sit; its slender frame allows for easy movement around the room.
  4. The chair's metal legs and dark fabric seat create a simple, functional design, blending into the room's decor.
  5. Positioned near a desk, the chair's dark color contrasts with the lighter carpet, offering a spot for focused work.

Interactive 3D Predictions

Explore SpaCeFormer's instance segmentation predictions on real 3D scenes. Click and drag to orbit, scroll to zoom, right-click to pan. Click masks in the legend to toggle.

Loading scene...
0.00
Drag to rotate • Scroll to zoom • Right-click to pan

Method

RoPE-Enhanced Instance Segmentation Decoder
RoPE-Enhanced Instance Segmentation Decoder. Learned queries are iteratively refined through cross-attention with 3D RoPE-encoded point features and self-attention, directly predicting instance masks, CLIP features, and foreground scores.

Attention Block Comparison

SpaCeFormer introduces Space-Curve partitioning and 3D RoPE on top of sparse 3D convolutions, compared to standard attention blocks used in prior architectures.

ViT attention block ViT
CvT attention block CvT
Point Transformer attention block Point Transformer
SpaCeFormer attention block SpaCeFormer (Ours)

Space-Curve Attention

Combines spatial window attention (fixed geometric extent, variable tokens) with Morton curve serialization (fixed-length segments, variable spatial extent). Windows preserve local spatial neighborhoods; curves introduce structured long-range diversity. Reduces complexity from O(N²) to O(N·Lmax).

3D Rotary Positional Embeddings

Extends RoPE to 3D with block-diagonal rotation matrices parameterized by relative displacements (Δx, Δy, Δz). Enables geometry-aware attention that naturally encodes spatial proximity — +27.6% mAP improvement over the best alternative positional encoding.

Proposal-Free Decoder

Learned query embeddings (Q=200) iteratively refined through cross-attention with point features and self-attention between queries. Directly predicts instance masks, CLIP features, and foreground scores — no proposal generation, no class-agnostic pre-filtering.

SpaCeFormer-3M Dataset

604K instance masks from 7,361 scenes with 3M multi-view captions. Training-free multi-view mask clustering from 2D foundation models, plus structured VLM prompting for diverse, view-consistent descriptions of shape, texture, material, and spatial context.

Benchmark Results

ScanNet200 Zero-Shot Instance Segmentation (200 classes)

Under the matched setting (proposal-free, 3D-only, no GT 3D annotations), SpaCeFormer is 2.8× over the next-best method. Higher-scoring methods rely on either GT-trained Mask3D proposals or multi-view 2D streams with YOLO/SAM proposals.

Method Input Proposals No GT mAP mAP50 mAP25 Time (s)
OpenMask3D 3D + 2D Mask3D No 15.419.923.1553.9
Open-YOLO 3D 3D + 2D Mask3D No 24.731.736.221.8
Open3DIS 3D + 2D Superpts + ISBNet + GSAM No 23.729.432.833.5
Any3DIS 3D + 2D ISBNet + SAM2 No 25.8
Details Matter 3D + 2D Mask3D + GSAM No 25.832.536.2
SAI3D 3D + 2D Superpts + SAM Yes 12.718.824.175.2
MaskClustering 3D + 2D CropFormer Yes 12.023.330.1
SAM2Object 3D + 2D SAM2 Yes 13.319.023.8
OpenTrack3D 3D + 2D YOLO-World + SAM2 Yes 26.037.745.4~356
Mosaic3D + Decoder 3D Proposal-free Yes 3.97.012.31.2
SpaCeFormer (Ours) 3D Proposal-free Yes 11.118.824.30.14

"No GT" = method does not rely on ground-truth 3D mask annotations for proposal training. For reference, fully supervised closed-vocabulary Mask3D reaches 27.4 mAP and OneFormer3D reaches 30.6 mAP with GT labels.

Results: ScanNet++ Zero-Shot Instance Segmentation (100 classes)

Method Input Proposals mAP mAP50 mAP25 Time (s) Speedup
OpenMask3D 3D + 2D Mask3D 2.02.73.4~5543957×
OVIR-3D 3D + 2D Mask3D 3.65.77.3
MaskClustering 3D + 2D CropFormer 7.810.712.16004286×
Segment3D 3D + 2D SAM 10.117.720.2
Open3DIS 3D + 2D SAM-HQ 11.918.121.7~3602571×
Any3DIS 3D + 2D SAM2 12.919.021.9~36257×
OpenSplat3D 3D + 2D SAM + GS 16.529.739.0
OpenTrack3D 3D + 2D YOLO-World + SAM2 20.634.243.4~3202286×
SpaCeFormer (Ours) 3D Proposal-free 22.933.741.60.14

SpaCeFormer surpasses the prior state of the art (OpenTrack3D, 20.6 mAP) using only 3D input while being over 2,000× faster.

Results: Replica Zero-Shot Instance Segmentation (8 scenes)

Among open-vocabulary methods without GT 3D supervision, SpaCeFormer is Pareto-optimal on the latency–accuracy frontier; matched in accuracy only by SOLE (24.7 mAP), which requires ScanNet200 GT mask supervision and is ~9× slower.

Method Input Proposals No GT mAP mAP50 mAP25 Time (s)
OpenScene-3D 3D Mask3D No 8.210.512.64.3
OpenMask3D 3D + 2D Mask3D No 13.118.424.2547.3
Open3DIS 3D + 2D ISBNet + SAM No 18.524.528.2188.0
Open-YOLO 3D 3D + 2D Mask3D No 23.728.634.816.6
Details Matter 3D + 2D Mask3D + GSAM No 22.631.737.7597
OVIR-3D 3D + 2D Detic Yes 11.120.527.552.7
PoVo 3D + 2D SAM Yes 20.828.734.4
BoxOVIS 3D + 2D Box + SAM Yes 24.031.837.443.7
OpenTrack3D 3D + 2D YOLO-World + SAM2 Yes 23.936.447.6~62
SOLE 3D Proposal-free (GT-supervised) No 24.731.840.3
SpaCeFormer (Ours) 3D Proposal-free Yes 24.131.837.10.14

SpaCeFormer is 119× faster than Open-YOLO 3D and 3,909× faster than OpenMask3D — the only open-vocabulary method that operates at interactive rates.

Class-Agnostic vs. 200-Way Instance Segmentation

SpaCeFormer's final model achieves strong class-agnostic mask quality (22.5 AP, 45.7 AP50, 64.4 AP25) in addition to open-vocabulary classification (11.1 mAP over 200 categories).

Configuration Class-Agnostic 200-Way Open-Vocab
APAP50AP25 mAPmAP50mAP25
SPACE-CURVE ATTENTION ABLATION
Window only 25.549.969.0 9.516.822.5
Morton only 23.347.367.4 9.516.422.3
Window + Morton (Ours) 25.250.469.1 11.118.824.3
POSITIONAL ENCODING ABLATION
No PE 5.9711.517.3
Absolute PE (APE) 5.9511.516.1
Learnable Bias 6.4611.817.2
3D RoPE (Ours) 7.6013.718.6
QUERY INITIALIZATION ABLATION
Random 14.632.352.8 5.610.214.9
Farthest Point Sampling 18.038.458.4 7.213.117.9
Learned Queries (Ours) 19.841.662.5 6.412.018.2

PE ablations trained for 25K iterations (shorter schedule); attention and query ablations use the full training configuration.

Qualitative Results

SpaCeFormer predictions on ScanNet200 validation scenes. Each color represents a distinct predicted instance with its open-vocabulary label.

Scene 0011
ScanNet scene0011_00
Scene 0019
ScanNet scene0019_00
Scene 0025
ScanNet scene0025_00
Scene 0435
ScanNet scene0435_03
Scene 0462
ScanNet scene0462_00
Scene 0474
ScanNet scene0474_00

Citation

@article{choy2026spaceformer,
  title     = {SpaCeFormer: Fast Proposal-Free Open-Vocabulary
               3D Instance Segmentation},
  author    = {Choy, Chris and Lee, Junha and Park, Chunghyun
               and Cho, Minsu and Kautz, Jan},
  journal   = {arXiv preprint},
  year      = {2026}
}