We tackle open-vocabulary 3D scene segmentation tasks by introducing a novel data generation pipeline and training framework. Our work targets three essential aspects required for an effective dataset: precise 3D region segmentation, comprehensive textual descriptions, and sufficient dataset scale. By leveraging state-of-the-art open-vocabulary image segmentation models and region-aware vision-language models (VLM), we develop an automatic pipeline capable of producing high-quality 3D mask-text pairs. Applying this pipeline to multiple 3D scene datasets, we create Mosaic3D-5.6M, a dataset of more than 30K annotated scenes with 5.6M mask-text pairs, significantly larger than existing datasets. Building on these data, we propose Mosaic3D, a 3D visiual foundation model (3D-VFM) combining a 3D encoder trained with contrastive learning and a lightweight mask decoder for open-vocabulary 3D semantic and instance segmentation. Our approach achieves state-of-the-art results on open-vocabulary 3D semantic and instance segmentation benchmarks including ScanNet200, Matterport3D, and ScanNet++, with ablation studies validating the effectiveness of our large-scale training data.
Generating 3D mask-text pair datasets can be costly and require meticulous attention. Recent work have leveraged 2D visual foundation models (VFMs) to automate data annotation to an extent; they use multi-view images to generate captions or features on different types of region proposals (e.g. bounding boxes, segmentations, or sliding windows). However, existing approaches suffer from imprecise boundary delineation due to their reliance on coarse object detectors, or provide only simple attribute labels. To overcome these limitations, we propose a data generation pipeline that combines recent advances in open-vocabulary segmentation and robust region-aware vision-language models (VLMs), enabling both precise region boundaries and rich descriptions that capture object attributes, spatial relationships, and scene context.
Mosaic3D-5.6M is the largest 3D mask-text paired dataset to date, encompassing over 30K indoor scenes and approximately 1M RGB-D frames, yielding 5.6M region captions comprising 30M total text tokens. Our dataset offers significant advantages over the existing datasets in terms of:
We train the 3D open-vocabulary segmentation model based on Mosaic3D-5.6M using two-stage training: per-point language alignment, and mask decoder training that predicts instances from these aligned features.
Query:
@article{lee2025mosaic3d,
title={Mosaic3D: Foundation Dataset and Model for Open-Vocabulary 3D Segmentation},
author={Junha Lee and Chunghyun Park and Jaesung Choe and Frank Wang and Jan Kautz and Minsu Cho and Chris Choy},
journal={arXiv},
year={2025}
}