SpaCeFormer — An Atlas of Open-Vocabulary 3D Instance Segmentation

Open-vocabulary 3D instance segmentation is a core capability for robotics and AR/VR, but prior methods trade one bottleneck for another: multi-stage 2D+3D pipelines aggregate foundation-model outputs at hundreds of seconds per scene, while pseudo-labeled end-to-end approaches rely on fragmented masks and external region proposals. We present SpaCeFormer (Space-Curve Transformer), a proposal-free space-curve transformer that runs at 0.12–0.30 seconds per scene across standard benchmarks, 2–3 orders of magnitude faster than multi-stage 2D+3D pipelines.

We pair it with SpaCeFormer-3M, the largest open-vocabulary 3D instance segmentation dataset (3.0M multi-view-consistent captions over 604K instances from 7.4K scenes) built through multi-view mask clustering and multi-view VLM captioning; it reaches 21× higher mask recall than prior single-view pipelines (54.3% vs. 2.5% at IoU>0.5). SpaCeFormer combines spatial window attention with Morton curve serialization for spatially coherent features, and uses a RoPE-enhanced decoder to predict instance masks directly from learned queries without external proposals. On ScanNet200 we achieve 11.1 zero-shot mAP, a 2.8× improvement over the prior best proposal-free method; on ScanNet++ and Replica, we reach 22.9 and 24.1 mAP, surpassing all prior methods including those using multi-view 2D inputs.

§ II · The SpaCeFormer-3M Corpus

An atlas of 7,361 scenes,
604K masks, 3M captions.

We construct SpaCeFormer-3M, the largest open-vocabulary 3D instance segmentation training corpus. Using training-free multi-view mask clustering and structured VLM prompting, we generate 604K instance masks with 3M diverse captions across 7,361 indoor scenes — without any human annotation.

Source Dataset	Scenes	Instance Masks	Captions	Avg Masks / Scene
ScanNet	1,201	79,320	396,600	66.0
ScanNet++	223	27,296	136,480	122.4
ARKitScenes	4,497	446,409	2,232,045	99.3
Matterport3D	1,440	51,102	255,510	35.5
Total	7,361	604,127	3,020,635	82.1

Mask quality vs. ground truth · 312 ScanNet train scenes

Auto-generated training masks benchmarked against GT instance annotations at IoU > 0.5. Compared to Mosaic3D's SAM2-based 2D-to-3D lifting, SpaCeFormer-3M achieves 21× higher recall (54.3% vs. 2.5%) — multi-view aggregation produces substantially more complete, geometry-consistent instances.

Source	Masks/Scene	Mean best-IoU	Precision @0.5	Recall @0.5	IoU > 0.50
Mosaic3D (SAM2 lifting)	16.1	0.247	4.8%	2.5%	3.8%
SpaCeFormer-3M (Ours)	65.2	0.251	33.6%	54.3%	25.9%

Plates · Instance galleries from the corpus

Each 3D instance mask is paired with 5 diverse captions generated from multiple viewpoints, capturing shape, material, texture, and spatial context.

Scene 0270 — Hotel Room

Entertainment Center

The wooden entertainment center features a warm, honey-toned finish and provides storage and display space for electronics.
A medium-sized, rectangular unit with a dark brown hue sits adjacent to a desk, offering a dedicated space for a television.
This cabinet-like structure offers ample storage with its multiple drawers, suitable for organizing media components and accessories.
Crafted from wood, the built-in unit's design integrates seamlessly into the room's decor, providing a sturdy base for a television.
The brown wooden cabinet, with its integrated shelf, stands near a desk, offering a functional and aesthetically pleasing media storage solution.

Metal Rack

Dark metal rack with a tiered design, likely used for drying dishes or storing small items.
The black, angular rack stands near a light-colored cabinet, its open structure creating visual contrast.
A medium-sized, sturdy rack provides a surface for air-drying, positioned beside a wooden cabinet.
Constructed from dark metal bars, the rack's design allows for ventilation and easy access to items.
The angled rack, with its rectangular openings, sits adjacent to a cabinet, offering a practical storage solution.

Wooden Table

The wooden table features a warm, brown finish and a smooth, polished surface, ideal for supporting lamps and other decor.
A medium-sized table with a scalloped edge sits adjacent to a cabinet, its shape complementing the room's traditional style.
This small table offers a surface for placing items, providing a convenient spot for a phone or remote control.
Crafted from wood, the table's design incorporates a single drawer and decorative cutouts, blending functionality with aesthetic appeal.
Positioned near a chair, the brown wooden table provides a stable base for a lamp, creating a cozy reading nook.

Bed Frame

The bedframe is constructed from wood, providing structural support for the mattress and bedding.
A dark brown bed frame with a rectangular headboard sits against a wall, adjacent to a bedside table.
This medium-sized bed frame offers a comfortable place to rest, featuring a flat surface for pillows.
The wooden headboard's design complements the hotel room's decor, creating a cohesive aesthetic.
The bed frame's upper edge supports a white pillow, positioned near a lamp and artwork.

Scene 0601 — Living Room

Whiteboard

The red metal whiteboard serves as a surface for writing and displaying information, positioned against a dark red wall.
A rectangular whiteboard with a red frame stands upright, offering a large, blank space for notes and diagrams.
Medium-sized whiteboard, suitable for brainstorming or presentations, provides a writable surface near laundry machines.
The whiteboard's flat surface and sturdy frame allow for easy cleaning and repeated use in a communal space.
Red-painted metal whiteboard, mounted on the wall, offers a functional space for communication and collaboration.

Side Table

The small side table features a smooth, wooden tabletop and a dark gray metal stem, providing a surface for drinks or books.
A round table with a warm brown wood top sits adjacent to a patterned armchair, creating a cozy seating area.
This medium-sized table offers a convenient surface for placing items, easily accessible for resting a cup or phone.
The table's simple, cylindrical design complements the surrounding furniture, blending seamlessly into the waiting room decor.
Positioned next to a chair, the table's wooden surface and dark metal leg provide a functional and stylish accent.

Upholstered Chair

Upholstered chair with a textured, patterned fabric in shades of red, orange, and brown offers a comfortable seating option.
The chair's rounded shape and reddish-brown hue complement the surrounding decor, positioned near a small table.
A medium-sized chair provides a place to sit and rest, its sturdy frame suggesting durability and frequent use.
The chair's design features a cutout back and a patterned upholstery, blending seamlessly into the waiting area's aesthetic.
Located near a side table, the chair's fabric texture and reddish-brown color create a welcoming and functional seating arrangement.

Washing Machine

The large, white washing machine features a stainless steel drum and digital control panel, designed for laundry tasks.
A white appliance with a rounded door sits adjacent to other machines, displaying a modern, streamlined shape.
This medium-sized washing machine offers a convenient space for loading clothes, facilitating household chores.
The appliance's durable white plastic exterior integrates seamlessly into the laundry room's utilitarian design.
Positioned between other washers, the white machine's front-loading door and digital display indicate its operational status.

Scene 0656 — Bedroom

Nightstand

A small, dark wood nightstand provides storage near the bed, featuring brass-toned hardware and a traditional design.
The reddish-brown bedside table sits adjacent to the bed, its rectangular shape complementing the room's layout.
This medium-sized wooden cabinet offers a surface for lamps and books, easily accessible from the bed.
Crafted from wood with a visible grain, the nightstand's design blends classic style with functional storage.
The dark brown nightstand, positioned next to the bed, provides a convenient spot for personal items and a lamp.

Pillow

The pillow's woven fabric offers a soft, comfortable surface for resting.
A rectangular pillow with a muted brown and black checkered pattern rests against the bed.
Medium-sized pillow, suitable for supporting the head during sleep or relaxation.
The pillow's design complements the bedroom's decor, providing a cozy accent.
Positioned against the headboard, the pillow offers support and adds visual texture to the bed.

Writing Desk

Dark brown wooden table with ornate carved details, providing a surface for small items.
Rectangular table with a glossy finish, positioned near a curtain and a chair.
Medium-sized table offering a flat surface for holding stationery and decorative objects.
The table's design features cabriole legs and a drawer, blending into the room's decor.
A dark wood table sits adjacent to a chair, providing a functional space for writing or display.

Desk Chair

The chair's dark fabric upholstery provides a comfortable seating surface, designed for supporting a person.
A medium-sized chair with a curved backrest sits adjacent to a dark wooden desk, displaying a classic design.
This chair offers a place to sit; its slender frame allows for easy movement around the room.
The chair's metal legs and dark fabric seat create a simple, functional design, blending into the room's decor.
Positioned near a desk, the chair's dark color contrasts with the lighter carpet, offering a spot for focused work.

§ IV · Method · Architecture

Window + curve attention,
3D RoPE, learned queries.

Why space-filling curves alone break.

Most 3D transformers serialize point clouds along Morton (Z-order) curves — space-filling curves that map 3D coordinates to 1D indices while preserving locality on average. The advantage: every attention block sees a fixed number of tokens, so compute is predictable.

“Two points adjacent in 3D can land far apart in the serialized order, breaking spatial coherence within an attention window.” — SpaCeFormer · §4.1

For dense prediction tasks like instance segmentation, where you need sharp boundaries, this fragmentation is a real problem. SpaCeFormer interleaves curve attention with spatial window attention — fixed geometric extent (H×W×D voxels) with variable token count — to keep geometrically adjacent points in the same attention group. Shifted partitions across layers (à la Swin) restore connectivity beyond window boundaries.

−28.6% mean pairwise distance
vs. Morton alone

O(N·L) complexity, fixed
L=1024 per patch

Morton curve vs. spatial window partitioning — Fig. 02 Morton curve serialization (left) provides structured diversity but can split nearby points across windows. Spatial windows (right) keep geometrically adjacent points in the same attention group — preserving local boundaries critical for instance segmentation.

Fig. 01 RoPE-Enhanced Instance Segmentation Decoder. Learned queries are iteratively refined through cross-attention with 3D RoPE-encoded point features and self-attention, directly predicting instance masks, CLIP features, and foreground scores.

Attention block lineage.

SpaCeFormer is a hierarchical sparse-voxel U-Net interleaving Space attention (3D windows, fixed metric extent) at shallow high-resolution stages with Curve attention (Morton/Hilbert serialized patches, length 1024) at deeper low-resolution stages. Both flavors share fused QKV + 3D RoPE CUDA kernels on top of sparse convolution shortcuts.

ViT

CvT

Point Transformer

SpaCeFormer · Ours

01 Window

Space Attention

Groups voxels into 3D windows of fixed metric extent (window_size), guaranteeing spatial proximity with variable per-window populations. Achieves ~28.6% smaller mean pairwise distance between attending voxels than curve-based grouping — deployed at shallow, high-resolution levels where local geometry dominates.

02 Morton/Hilbert

Curve Attention

Serializes voxels along space-filling curves and partitions into fixed-length patches (patch_size=1024). Fixed compute per patch enables efficient long-range mixing; deployed at deeper, low-resolution stages where receptive field matters more than locality. Reduces complexity from O(N²) to O(N·L).

03 3D RoPE

Voxel Rotary PE

VoxelRotaryPositionalEmbeddings extends RoPE to 3D with block-diagonal rotations parameterized by relative displacements (Δx, Δy, Δz), fused into the QKV CUDA kernel. Window-aware base (~4·L) auto-selected via suggest_voxel_rope_base. +1.1 mAP (7.60 vs 6.46) over the best alternative PE.

04 Decoder

Proposal-Free Head

Learned query embeddings (Q=200) iteratively refined through cross-attention with point features and self-attention between queries. Directly predicts instance masks, CLIP features, and foreground scores — no proposal generation, no class-agnostic pre-filtering.

05 Corpus

SpaCeFormer-3M

604K instance masks from 7,361 scenes with 3M multi-view captions. Training-free multi-view mask clustering from 2D foundation models, plus structured VLM prompting for diverse, view-consistent descriptions of shape, texture, material, and spatial context.

06 Library

WarpConvNet

Released as warpconvnet.models.space_former. Configurable per-level attention via enc_attn_types string codes (e.g. "ssccc"), three block layouts (pre_norm / post_norm / stream_norm), pluggable MaskFormer head wrapped in PointToVoxel.

§ V · Benchmarks

Three datasets,
one architecture, no proposals.

ScanNet200 · zero-shot, 200 classes

Under the matched setting (proposal-free, 3D-only, no GT 3D annotations), SpaCeFormer is 2.8× over the next-best method. Higher-scoring methods rely on either GT-trained Mask3D proposals or multi-view 2D streams with YOLO/SAM proposals.

Method	Input	Proposals	No GT	mAP	mAP₅₀	mAP₂₅	Time (s)
OpenMask3D	3D + 2D	Mask3D	No	15.4	19.9	23.1	553.9
Open-YOLO 3D	3D + 2D	Mask3D	No	24.7	31.7	36.2	21.8
Open3DIS	3D + 2D	Superpts + ISBNet + GSAM	No	23.7	29.4	32.8	33.5
Any3DIS	3D + 2D	ISBNet + SAM2	No	25.8	—	—	—
Details Matter	3D + 2D	Mask3D + GSAM	No	25.8	32.5	36.2	—
SAI3D	3D + 2D	Superpts + SAM	Yes	12.7	18.8	24.1	75.2
MaskClustering	3D + 2D	CropFormer	Yes	12.0	23.3	30.1	—
SAM2Object	3D + 2D	SAM2	Yes	13.3	19.0	23.8	—
OpenTrack3D	3D + 2D	YOLO-World + SAM2	Yes	26.0	37.7	45.4	~356
Mosaic3D + Decoder	3D	Proposal-free	Yes	3.9	7.0	12.3	1.2
SpaCeFormer (Ours)	3D	Proposal-free	Yes	11.1	18.8	24.3	0.12

"No GT" = method does not rely on ground-truth 3D mask annotations for proposal training. For reference, fully supervised closed-vocabulary Mask3D reaches 27.4 mAP and OneFormer3D reaches 30.6 mAP with GT labels.

ScanNet++ · zero-shot, 100 classes

Method	Input	Proposals	mAP	mAP₅₀	mAP₂₅	Time (s)	Speedup
OpenMask3D	3D + 2D	Mask3D	2.0	2.7	3.4	~554	3957×
OVIR-3D	3D + 2D	Mask3D	3.6	5.7	7.3	—	—
MaskClustering	3D + 2D	CropFormer	7.8	10.7	12.1	600	4286×
Segment3D	3D + 2D	SAM	10.1	17.7	20.2	—	—
Open3DIS	3D + 2D	SAM-HQ	11.9	18.1	21.7	~360	2571×
Any3DIS	3D + 2D	SAM2	12.9	19.0	21.9	~36	257×
OpenSplat3D	3D + 2D	SAM + GS	16.5	29.7	39.0	—	—
OpenTrack3D	3D + 2D	YOLO-World + SAM2	20.6	34.2	43.4	~320	2286×
SpaCeFormer (Ours)	3D	Proposal-free	22.9	33.7	41.6	0.30	—

SpaCeFormer surpasses the prior state of the art (OpenTrack3D, 20.6 mAP) using only 3D input while being over 2,000× faster.

Replica · zero-shot, 8 scenes

Among open-vocabulary methods without GT 3D supervision, SpaCeFormer is Pareto-optimal on the latency–accuracy frontier; matched in accuracy only by SOLE (24.7 mAP), which requires ScanNet200 GT mask supervision and is ~9× slower.

Method	Input	Proposals	No GT	mAP	mAP₅₀	mAP₂₅	Time (s)
OpenScene-3D	3D	Mask3D	No	8.2	10.5	12.6	4.3
OpenMask3D	3D + 2D	Mask3D	No	13.1	18.4	24.2	547.3
Open3DIS	3D + 2D	ISBNet + SAM	No	18.5	24.5	28.2	188.0
Open-YOLO 3D	3D + 2D	Mask3D	No	23.7	28.6	34.8	16.6
Details Matter	3D + 2D	Mask3D + GSAM	No	22.6	31.7	37.7	597
OVIR-3D	3D + 2D	Detic	Yes	11.1	20.5	27.5	52.7
PoVo	3D + 2D	SAM	Yes	20.8	28.7	34.4	—
BoxOVIS	3D + 2D	Box + SAM	Yes	24.0	31.8	37.4	43.7
OpenTrack3D	3D + 2D	YOLO-World + SAM2	Yes	23.9	36.4	47.6	~62
SOLE	3D	Proposal-free (GT-supervised)	No	24.7	31.8	40.3	—
SpaCeFormer (Ours)	3D	Proposal-free	Yes	24.1	31.8	37.1	0.22

SpaCeFormer is 119× faster than Open-YOLO 3D and 3,909× faster than OpenMask3D — the only open-vocabulary method that operates at interactive rates.

Ablations · class-agnostic vs. 200-way

SpaCeFormer's final model achieves strong class-agnostic mask quality (22.5 AP, 45.7 AP₅₀, 64.4 AP₂₅) in addition to open-vocabulary classification (11.1 mAP over 200 categories).

Configuration	Class-Agnostic			200-Way Open-Vocab
Configuration	AP	AP₅₀	AP₂₅	mAP	mAP₅₀	mAP₂₅
— Space-Curve Attention Ablation —
Window only	25.5	49.9	69.0	9.5	16.8	22.5
Morton only	23.3	47.3	67.4	9.5	16.4	22.3
Window + Morton (Ours)	25.2	50.4	69.1	11.1	18.8	24.3
— Positional Encoding Ablation —
No PE	—			5.97	11.5	17.3
Absolute PE (APE)	—			5.95	11.5	16.1
Learnable Bias	—			6.46	11.8	17.2
3D RoPE (Ours)	—			7.60	13.7	18.6
— Query Initialization Ablation —
Random	14.6	32.3	52.8	5.6	10.2	14.9
Farthest Point Sampling	18.0	38.4	58.4	7.2	13.1	17.9
Learned Queries (Ours)	19.8	41.6	62.5	6.4	12.0	18.2

PE ablations trained for 25K iterations (shorter schedule); attention and query ablations use the full training configuration.

SpaCeFormer.

SpaCeFormer in motion.

An atlas of 7,361 scenes,
604K masks, 3M captions.

Mask quality vs. ground truth · 312 ScanNet train scenes