NVLabs Research · Vol. 26 · No. 05 Folio 001 — Frontispiece

SpaCeFormer.

A proposal-free space-curve transformer for fast, open-vocabulary 3D instance segmentation — 0.14 seconds per scene, no proposals, no GT 3D supervision.

Authors of Record

Chris Choy1, Junha Lee2, Chunghyun Park2, Minsu Cho2, Jan Kautz1

1NVIDIA  ·  2POSTECH
Accepted to
ICML 2026
N S E W
i. latency
0 Inference per scene
ii. recall
0 Mask recall vs. Mosaic3D
iii. replica
0 mAP, zero-shot Replica
iv. scannet++
0 mAP, ScanNet++
v. corpus
0 Captions in SpaCeFormer-3M
§ I  ·  Abstract

Open-vocabulary 3D instance segmentation is a core capability for robotics and AR/VR, but prior methods trade one bottleneck for another: multi-stage 2D+3D pipelines aggregate foundation-model outputs at hundreds of seconds per scene, while pseudo-labeled end-to-end approaches rely on fragmented masks and external region proposals. We present SpaCeFormer (Space-Curve Transformer), a proposal-free space-curve transformer that runs at 0.14 seconds per scene, 2–3 orders of magnitude faster than multi-stage 2D+3D pipelines.

We pair it with SpaCeFormer-3M, the largest open-vocabulary 3D instance segmentation dataset (3.0M multi-view-consistent captions over 604K instances from 7.4K scenes) built through multi-view mask clustering and multi-view VLM captioning; it reaches 21× higher mask recall than prior single-view pipelines (54.3% vs. 2.5% at IoU>0.5). SpaCeFormer combines spatial window attention with Morton curve serialization for spatially coherent features, and uses a RoPE-enhanced decoder to predict instance masks directly from learned queries without external proposals. On ScanNet200 we achieve 11.1 zero-shot mAP, a 2.8× improvement over the prior best proposal-free method; on ScanNet++ and Replica, we reach 22.9 and 24.1 mAP, surpassing all prior methods including those using multi-view 2D inputs.

§ II  ·  The SpaCeFormer-3M Corpus

An atlas of 7,361 scenes,
604K masks, 3M captions.

We construct SpaCeFormer-3M, the largest open-vocabulary 3D instance segmentation training corpus. Using training-free multi-view mask clustering and structured VLM prompting, we generate 604K instance masks with 3M diverse captions across 7,361 indoor scenes — without any human annotation.

Source Dataset Scenes Instance Masks Captions Avg Masks / Scene
ScanNet1,20179,320396,60066.0
ScanNet++22327,296136,480122.4
ARKitScenes4,497446,4092,232,04599.3
Matterport3D1,44051,102255,51035.5
Total7,361604,1273,020,63582.1

Mask quality vs. ground truth · 312 ScanNet train scenes

Auto-generated training masks benchmarked against GT instance annotations at IoU > 0.5. Compared to Mosaic3D's SAM2-based 2D-to-3D lifting, SpaCeFormer-3M achieves 21× higher recall (54.3% vs. 2.5%) — multi-view aggregation produces substantially more complete, geometry-consistent instances.

SourceMasks/SceneMean best-IoUPrecision @0.5Recall @0.5IoU > 0.50
Mosaic3D (SAM2 lifting)16.10.2474.8%2.5%3.8%
SpaCeFormer-3M (Ours)65.20.25133.6%54.3%25.9%

Plates · Instance galleries from the corpus

Each 3D instance mask is paired with 5 diverse captions generated from multiple viewpoints, capturing shape, material, texture, and spatial context.

Scene 0270 — Hotel Room

Entertainment Center
Entertainment Center
  1. The wooden entertainment center features a warm, honey-toned finish and provides storage and display space for electronics.
  2. A medium-sized, rectangular unit with a dark brown hue sits adjacent to a desk, offering a dedicated space for a television.
  3. This cabinet-like structure offers ample storage with its multiple drawers, suitable for organizing media components and accessories.
  4. Crafted from wood, the built-in unit's design integrates seamlessly into the room's decor, providing a sturdy base for a television.
  5. The brown wooden cabinet, with its integrated shelf, stands near a desk, offering a functional and aesthetically pleasing media storage solution.
Metal Rack
Metal Rack
  1. Dark metal rack with a tiered design, likely used for drying dishes or storing small items.
  2. The black, angular rack stands near a light-colored cabinet, its open structure creating visual contrast.
  3. A medium-sized, sturdy rack provides a surface for air-drying, positioned beside a wooden cabinet.
  4. Constructed from dark metal bars, the rack's design allows for ventilation and easy access to items.
  5. The angled rack, with its rectangular openings, sits adjacent to a cabinet, offering a practical storage solution.
Wooden Table
Wooden Table
  1. The wooden table features a warm, brown finish and a smooth, polished surface, ideal for supporting lamps and other decor.
  2. A medium-sized table with a scalloped edge sits adjacent to a cabinet, its shape complementing the room's traditional style.
  3. This small table offers a surface for placing items, providing a convenient spot for a phone or remote control.
  4. Crafted from wood, the table's design incorporates a single drawer and decorative cutouts, blending functionality with aesthetic appeal.
  5. Positioned near a chair, the brown wooden table provides a stable base for a lamp, creating a cozy reading nook.
Bed Frame
Bed Frame
  1. The bedframe is constructed from wood, providing structural support for the mattress and bedding.
  2. A dark brown bed frame with a rectangular headboard sits against a wall, adjacent to a bedside table.
  3. This medium-sized bed frame offers a comfortable place to rest, featuring a flat surface for pillows.
  4. The wooden headboard's design complements the hotel room's decor, creating a cohesive aesthetic.
  5. The bed frame's upper edge supports a white pillow, positioned near a lamp and artwork.

Scene 0601 — Living Room

Whiteboard
Whiteboard
  1. The red metal whiteboard serves as a surface for writing and displaying information, positioned against a dark red wall.
  2. A rectangular whiteboard with a red frame stands upright, offering a large, blank space for notes and diagrams.
  3. Medium-sized whiteboard, suitable for brainstorming or presentations, provides a writable surface near laundry machines.
  4. The whiteboard's flat surface and sturdy frame allow for easy cleaning and repeated use in a communal space.
  5. Red-painted metal whiteboard, mounted on the wall, offers a functional space for communication and collaboration.
Side Table
Side Table
  1. The small side table features a smooth, wooden tabletop and a dark gray metal stem, providing a surface for drinks or books.
  2. A round table with a warm brown wood top sits adjacent to a patterned armchair, creating a cozy seating area.
  3. This medium-sized table offers a convenient surface for placing items, easily accessible for resting a cup or phone.
  4. The table's simple, cylindrical design complements the surrounding furniture, blending seamlessly into the waiting room decor.
  5. Positioned next to a chair, the table's wooden surface and dark metal leg provide a functional and stylish accent.
Upholstered Chair
Upholstered Chair
  1. Upholstered chair with a textured, patterned fabric in shades of red, orange, and brown offers a comfortable seating option.
  2. The chair's rounded shape and reddish-brown hue complement the surrounding decor, positioned near a small table.
  3. A medium-sized chair provides a place to sit and rest, its sturdy frame suggesting durability and frequent use.
  4. The chair's design features a cutout back and a patterned upholstery, blending seamlessly into the waiting area's aesthetic.
  5. Located near a side table, the chair's fabric texture and reddish-brown color create a welcoming and functional seating arrangement.
Washing Machine
Washing Machine
  1. The large, white washing machine features a stainless steel drum and digital control panel, designed for laundry tasks.
  2. A white appliance with a rounded door sits adjacent to other machines, displaying a modern, streamlined shape.
  3. This medium-sized washing machine offers a convenient space for loading clothes, facilitating household chores.
  4. The appliance's durable white plastic exterior integrates seamlessly into the laundry room's utilitarian design.
  5. Positioned between other washers, the white machine's front-loading door and digital display indicate its operational status.

Scene 0656 — Bedroom

Nightstand
Nightstand
  1. A small, dark wood nightstand provides storage near the bed, featuring brass-toned hardware and a traditional design.
  2. The reddish-brown bedside table sits adjacent to the bed, its rectangular shape complementing the room's layout.
  3. This medium-sized wooden cabinet offers a surface for lamps and books, easily accessible from the bed.
  4. Crafted from wood with a visible grain, the nightstand's design blends classic style with functional storage.
  5. The dark brown nightstand, positioned next to the bed, provides a convenient spot for personal items and a lamp.
Pillow
Pillow
  1. The pillow's woven fabric offers a soft, comfortable surface for resting.
  2. A rectangular pillow with a muted brown and black checkered pattern rests against the bed.
  3. Medium-sized pillow, suitable for supporting the head during sleep or relaxation.
  4. The pillow's design complements the bedroom's decor, providing a cozy accent.
  5. Positioned against the headboard, the pillow offers support and adds visual texture to the bed.
Writing Desk
Writing Desk
  1. Dark brown wooden table with ornate carved details, providing a surface for small items.
  2. Rectangular table with a glossy finish, positioned near a curtain and a chair.
  3. Medium-sized table offering a flat surface for holding stationery and decorative objects.
  4. The table's design features cabriole legs and a drawer, blending into the room's decor.
  5. A dark wood table sits adjacent to a chair, providing a functional space for writing or display.
Desk Chair
Desk Chair
  1. The chair's dark fabric upholstery provides a comfortable seating surface, designed for supporting a person.
  2. A medium-sized chair with a curved backrest sits adjacent to a dark wooden desk, displaying a classic design.
  3. This chair offers a place to sit; its slender frame allows for easy movement around the room.
  4. The chair's metal legs and dark fabric seat create a simple, functional design, blending into the room's decor.
  5. Positioned near a desk, the chair's dark color contrasts with the lighter carpet, offering a spot for focused work.
§ III  ·  Live Plate

Interactive 3D predictions.

Explore SpaCeFormer's instance segmentation predictions on real 3D scenes. Click and drag to orbit, scroll to zoom, right-click to pan. Click masks in the legend to toggle.

Loading scene…
0.00
Drag · rotate  |  Scroll · zoom  |  Right-click · pan
§ IV  ·  Method · Architecture

Window + curve attention,
3D RoPE, learned queries.

Why space-filling curves alone break.

Most 3D transformers serialize point clouds along Morton (Z-order) curves — space-filling curves that map 3D coordinates to 1D indices while preserving locality on average. The advantage: every attention block sees a fixed number of tokens, so compute is predictable.

“Two points adjacent in 3D can land far apart in the serialized order, breaking spatial coherence within an attention window.” — SpaCeFormer · §4.1

For dense prediction tasks like instance segmentation, where you need sharp boundaries, this fragmentation is a real problem. SpaCeFormer interleaves curve attention with spatial window attention — fixed geometric extent (H×W×D voxels) with variable token count — to keep geometrically adjacent points in the same attention group. Shifted partitions across layers (à la Swin) restore connectivity beyond window boundaries.

−28.6% mean pairwise distance
vs. Morton alone
O(N·L) complexity, fixed
L=1024 per patch
Morton curve vs. spatial window partitioning
Fig. 02 Morton curve serialization (left) provides structured diversity but can split nearby points across windows. Spatial windows (right) keep geometrically adjacent points in the same attention group — preserving local boundaries critical for instance segmentation.
RoPE-Enhanced Instance Segmentation Decoder
Fig. 01 RoPE-Enhanced Instance Segmentation Decoder. Learned queries are iteratively refined through cross-attention with 3D RoPE-encoded point features and self-attention, directly predicting instance masks, CLIP features, and foreground scores.

Attention block lineage.

SpaCeFormer is a hierarchical sparse-voxel U-Net interleaving Space attention (3D windows, fixed metric extent) at shallow high-resolution stages with Curve attention (Morton/Hilbert serialized patches, length 1024) at deeper low-resolution stages. Both flavors share fused QKV + 3D RoPE CUDA kernels on top of sparse convolution shortcuts.

ViT attention block ViT
CvT attention block CvT
Point Transformer attention block Point Transformer
SpaCeFormer attention block SpaCeFormer · Ours
01  Window

Space Attention

Groups voxels into 3D windows of fixed metric extent (window_size), guaranteeing spatial proximity with variable per-window populations. Achieves ~28.6% smaller mean pairwise distance between attending voxels than curve-based grouping — deployed at shallow, high-resolution levels where local geometry dominates.

02  Morton/Hilbert

Curve Attention

Serializes voxels along space-filling curves and partitions into fixed-length patches (patch_size=1024). Fixed compute per patch enables efficient long-range mixing; deployed at deeper, low-resolution stages where receptive field matters more than locality. Reduces complexity from O(N²) to O(N·L).

03  3D RoPE

Voxel Rotary PE

VoxelRotaryPositionalEmbeddings extends RoPE to 3D with block-diagonal rotations parameterized by relative displacements (Δx, Δy, Δz), fused into the QKV CUDA kernel. Window-aware base (~4·L) auto-selected via suggest_voxel_rope_base. +27.6% mAP over best alternative PE.

04  Decoder

Proposal-Free Head

Learned query embeddings (Q=200) iteratively refined through cross-attention with point features and self-attention between queries. Directly predicts instance masks, CLIP features, and foreground scores — no proposal generation, no class-agnostic pre-filtering.

05  Corpus

SpaCeFormer-3M

604K instance masks from 7,361 scenes with 3M multi-view captions. Training-free multi-view mask clustering from 2D foundation models, plus structured VLM prompting for diverse, view-consistent descriptions of shape, texture, material, and spatial context.

06  Library

WarpConvNet

Released as warpconvnet.models.space_former. Configurable per-level attention via enc_attn_types string codes (e.g. "ssccc"), three block layouts (pre_norm / post_norm / stream_norm), pluggable MaskFormer head wrapped in PointToVoxel.

§ V  ·  Benchmarks

Three datasets,
one architecture, no proposals.

ScanNet200 · zero-shot, 200 classes

Under the matched setting (proposal-free, 3D-only, no GT 3D annotations), SpaCeFormer is 2.8× over the next-best method. Higher-scoring methods rely on either GT-trained Mask3D proposals or multi-view 2D streams with YOLO/SAM proposals.

MethodInputProposalsNo GTmAPmAP50mAP25Time (s)
OpenMask3D3D + 2DMask3DNo15.419.923.1553.9
Open-YOLO 3D3D + 2DMask3DNo24.731.736.221.8
Open3DIS3D + 2DSuperpts + ISBNet + GSAMNo23.729.432.833.5
Any3DIS3D + 2DISBNet + SAM2No25.8
Details Matter3D + 2DMask3D + GSAMNo25.832.536.2
SAI3D3D + 2DSuperpts + SAMYes12.718.824.175.2
MaskClustering3D + 2DCropFormerYes12.023.330.1
SAM2Object3D + 2DSAM2Yes13.319.023.8
OpenTrack3D3D + 2DYOLO-World + SAM2Yes26.037.745.4~356
Mosaic3D + Decoder3DProposal-freeYes3.97.012.31.2
SpaCeFormer (Ours)3DProposal-freeYes11.118.824.30.14

"No GT" = method does not rely on ground-truth 3D mask annotations for proposal training. For reference, fully supervised closed-vocabulary Mask3D reaches 27.4 mAP and OneFormer3D reaches 30.6 mAP with GT labels.

ScanNet++ · zero-shot, 100 classes

MethodInputProposalsmAPmAP50mAP25Time (s)Speedup
OpenMask3D3D + 2DMask3D2.02.73.4~5543957×
OVIR-3D3D + 2DMask3D3.65.77.3
MaskClustering3D + 2DCropFormer7.810.712.16004286×
Segment3D3D + 2DSAM10.117.720.2
Open3DIS3D + 2DSAM-HQ11.918.121.7~3602571×
Any3DIS3D + 2DSAM212.919.021.9~36257×
OpenSplat3D3D + 2DSAM + GS16.529.739.0
OpenTrack3D3D + 2DYOLO-World + SAM220.634.243.4~3202286×
SpaCeFormer (Ours)3DProposal-free22.933.741.60.14

SpaCeFormer surpasses the prior state of the art (OpenTrack3D, 20.6 mAP) using only 3D input while being over 2,000× faster.

Replica · zero-shot, 8 scenes

Among open-vocabulary methods without GT 3D supervision, SpaCeFormer is Pareto-optimal on the latency–accuracy frontier; matched in accuracy only by SOLE (24.7 mAP), which requires ScanNet200 GT mask supervision and is ~9× slower.

MethodInputProposalsNo GTmAPmAP50mAP25Time (s)
OpenScene-3D3DMask3DNo8.210.512.64.3
OpenMask3D3D + 2DMask3DNo13.118.424.2547.3
Open3DIS3D + 2DISBNet + SAMNo18.524.528.2188.0
Open-YOLO 3D3D + 2DMask3DNo23.728.634.816.6
Details Matter3D + 2DMask3D + GSAMNo22.631.737.7597
OVIR-3D3D + 2DDeticYes11.120.527.552.7
PoVo3D + 2DSAMYes20.828.734.4
BoxOVIS3D + 2DBox + SAMYes24.031.837.443.7
OpenTrack3D3D + 2DYOLO-World + SAM2Yes23.936.447.6~62
SOLE3DProposal-free (GT-supervised)No24.731.840.3
SpaCeFormer (Ours)3DProposal-freeYes24.131.837.10.14

SpaCeFormer is 119× faster than Open-YOLO 3D and 3,909× faster than OpenMask3D — the only open-vocabulary method that operates at interactive rates.

Ablations · class-agnostic vs. 200-way

SpaCeFormer's final model achieves strong class-agnostic mask quality (22.5 AP, 45.7 AP50, 64.4 AP25) in addition to open-vocabulary classification (11.1 mAP over 200 categories).

Configuration Class-Agnostic 200-Way Open-Vocab
APAP50AP25 mAPmAP50mAP25
— Space-Curve Attention Ablation —
Window only25.549.969.09.516.822.5
Morton only23.347.367.49.516.422.3
Window + Morton (Ours)25.250.469.111.118.824.3
— Positional Encoding Ablation —
No PE5.9711.517.3
Absolute PE (APE)5.9511.516.1
Learnable Bias6.4611.817.2
3D RoPE (Ours)7.6013.718.6
— Query Initialization Ablation —
Random14.632.352.85.610.214.9
Farthest Point Sampling18.038.458.47.213.117.9
Learned Queries (Ours)19.841.662.56.412.018.2

PE ablations trained for 25K iterations (shorter schedule); attention and query ablations use the full training configuration.

§ VI  ·  Plates · Predictions

Qualitative results.

SpaCeFormer predictions on ScanNet200 validation scenes. Each color represents a distinct predicted instance with its open-vocabulary label.

Scene 0011
scene0011_00
Scene 0019
scene0019_00
Scene 0025
scene0025_00
Scene 0435
scene0435_03
Scene 0462
scene0462_00
Scene 0474
scene0474_00
§ VII  ·  Citation

If this work informs yours.

@inproceedings{choy2026spaceformer,
  title     = {SpaCeFormer: Fast Proposal-Free Open-Vocabulary
               3D Instance Segmentation},
  author    = {Choy, Chris and Lee, Junha and Park, Chunghyun
               and Cho, Minsu and Kautz, Jan},
  booktitle = {Proceedings of the International Conference on
               Machine Learning (ICML)},
  year      = {2026},
  eprint    = {2604.20395},
  archivePrefix = {arXiv}
}