Mosaic3D Logo

Mosaic3D

Foundation Dataset and Model for Open-Vocabulary 3D Segmentation

Junha Lee1,2, Chunghyun Park1,2, Jaesung Choe1, Frank Wang1, Jan Kautz1, Minsu Cho2, Chris Choy1
1NVIDIA 2POSTECH

TLDR

  • Propose a large-scale dataset of open vocabulary 3D mask-text pairs.
  • Propose a foundation model for open-vocabulary 3D segmentation.

Dataset Visualization

Abstract

We tackle open-vocabulary 3D scene understanding by introducing a novel data generation pipeline and training framework. Our method addresses three critical requirements for effective training: precise 3D region segmentation, comprehensive textual descriptions, and sufficient dataset scale. By leveraging state-of-the-art open-vocabulary image segmentation models and region-aware Vision-Language Models (VLM), we develop an automatic pipeline that generates high-quality 3D mask-text pairs. Applying this pipeline to multiple 3D scene datasets, we create the Mosaic3D dataset, a dataset of over 30K annotated scenes with 5.6M mask-text pairs—significantly larger than existing datasets. Building upon this data, we propose Mosaic3D, a foundation model combining a 3D encoder trained with contrastive learning and a lightweight mask decoder for open-vocabulary 3D semantic and instance segmentation. Our approach achieves state-of-the-art results on open-vocabulary 3D semantic and instance segmentation tasks including ScanNet200, Matterport3D, and ScanNet++, with ablation studies validating the effectiveness of our large-scale training data.

Citation

@article{lee2025mosaic3d,
    title={Mosaic3D: Foundation Dataset and Model for Open-Vocabulary 3D Segmentation},
    author={Junha Lee and Chunghyun Park and Jaesung Choe and Frank Wang and Jan Kautz and Minsu Cho and Chris Choy},
    journal={arXiv},
    year={2025}
}