AV Perception Research AVR Group LPR Group Toronto AI Lab
Memorize What Matters: Emergent scene decomposition from multitraverse

Memorize What Matters: Emergent scene decomposition from multitraverse

1 NYU
2 NVIDIA Research
3 USC
4 Stanford University
NeurIPS 2024 Spotlight

Core Contributions


• We propose a camera-only 3D environment mapping framework for self-driving scenes based on 3D Gaussian Splatting.

• We propose a self-supervised 2D ephemerality segmentation method that can be used as autolabeling toolkit for dynamic scenes.

• We build the Mapverse benchmark to evaluate multitraverse 2D segmentation, 3D reconstruction, and neural rendering.

3DGM unsupervisedly converts multitraverse RGB videos into 3DGS of the environment (EnvGS) and 2D ephemeral object masks (EmerSeg).

Abstract


Humans naturally retain memories of permanent elements, while ephemeral moments often slip through the cracks of memory. This selective retention is crucial for robotic perception, localization, and mapping. To endow robots with this capability, we introduce 3D Gaussian Mapping (3DGM), a self-supervised, camera-only offline mapping framework grounded in 3D Gaussian Splatting. 3DGM converts multitraverse RGB videos from the same region into a Gaussian-based environmental map while concurrently performing 2D ephemeral object segmentation. Our key observation is that the environment remains consistent across traversals, while objects frequently change. This allows us to exploit self-supervision from repeated traversals to achieve environment-object decomposition. More specifically, 3DGM formulates multitraverse environmental mapping as a robust differentiable rendering problem, treating pixels of the environment and objects as inliers and outliers, respectively. Using robust feature distillation, feature residuals mining, and robust optimization, 3DGM jointly performs 3D mapping and 2D segmentation without human intervention. We build the Mapverse benchmark, sourced from the Ithaca365 and nuPlan datasets, to evaluate our method in unsupervised 2D segmentation, 3D reconstruction, and neural rendering. Extensive results verify the effectiveness and potential of 3DGM for self-driving and robotics.

Key Takeaways


  • Multitraverse consensus-dissensus can be exploited as a self-supervision signal to decompose the environment and objects.
  • Repeated traversals of the same location can provide more camera observations to upgrade the 3D reconstruction without LiDARs.
  • Robust features produced by vision foundation models are crucial for 2D consensus identification of multitraverse.
  • Method


    Given RGB camera observations collected at different times, we use COLMAP to obtain the camera poses and initial Gaussian points. Then we utilize splatting-based rasterization to render both RGB images and robust features from the environmental Gaussians. We further leverage feature residuals to extract the object masks by mining spatial information of the residuals. Finally, we utilize the ephemerality masks to finetune the 3D Gaussians.

    Mapverse Dataset


    We build the Mapverse benchmark sourced from the Ithaca365 (Carlos A. Diaz-Ruiz et al., CVPR 2022) and nuPlan (Napat Karnchanachari et al., ICRA 2024) datasets, featuring 40 locations, each with no less than 10 traversals, totaling 467 driving video clips and 35,304 images. Ithaca365 emphasizes its multitraverse nature in the original paper, whereas nuPlan does not explicitly mention this feature. These two datasets capture diverse scenes to verify our method across various driving scenarios. Both datasets use the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International Public License (CC BY-NC-SA 4.0).

    Mapverse-Ithaca365

    Visualizations of sample data in Mapverse-Ithaca365. Each row represents image observations of the same location captured during different traversals, with five traversals shown for brevity. Ithaca365 is repeatedly recorded along a 15 km route under diverse scenes, weather, time, and traffic conditions. The dataset includes images and point clouds from four cameras and LiDAR sensors, along with high-precision GPS/INS to establish correspondence across routes. A key uniqueness of this dataset is that the same locations can be observed across different weather and time conditions. Please check the official page of Ithaca365 for more details.

    Mapverse-nuPlan

    Visualizations of sample data in Mapverse-nuPlan. Each row represents image observations of the same location captured during different traversals, with five traversals shown for brevity. The nuPlan dataset is a comprehensive dataset designed to advance research and development in autonomous vehicle planning. Developed by Motional, it is considered the world's first and largest benchmark for AV planning. The dataset includes approximately 1,500 hours of driving data collected from four cities: Boston, Pittsburgh, Las Vegas, and Singapore. The authors provide 10% of the raw sensor data (120 hours). We find that the nuPlan dataset collected in Las Vegas has a number of repeated traversals of the same location. Hence, we extract the multitraverse driving data (from mid-May to late July 2021) by querying the GPS coordinates. Please check the official page of nuPlan for more details.

    Emerged Segmentation


    Location 600 of Mapverse-Ithaca365

    Location 2450 of Mapverse-Ithaca365

    Location 24 of Mapverse-nuPlan

    Location 28 of Mapverse-nuPlan

    Location 30 of Mapverse-nuPlan

    Left: Original RGB; Right: 2D Ephemerality Segmentation

    Depth Map


    Location 600 of Mapverse-Ithaca365

    Location 2450 of Mapverse-Ithaca365

    Location 24 of Mapverse-nuPlan

    Location 28 of Mapverse-nuPlan

    Location 30 of Mapverse-nuPlan

    Left: Original RGB; Right: Depth Map (Environment-Only)

    Neural Environment Rendering


    Location 600 of Mapverse-Ithaca365

    Location 2450 of Mapverse-Ithaca365

    Location 24 of Mapverse-nuPlan

    Location 28 of Mapverse-nuPlan

    Location 30 of Mapverse-nuPlan

    Left: Original RGB; Right: Rendered RGB (Environment-Only)

    Citation


    
          @article{li3dgm2024,
            title={Memorize What Matters: Emergent scene decomposition from multitraverse},
            author={Li, Yiming and Wang, Zehong and Wang, Yue and Yu, Zhiding and Gojcic, Zan 
              and Pavone, Marco and Feng, Chen and Alvarez, Jose M},
            journal={arXiv preprint},
            year={2024}
          }
      

    Paper



    Acknowledgment


    We express our deep gratitude to Jiawei Yang and Sanja Fidler for their valuable feedback throughout the project. We also thank Yurong You and Carlos A. Diaz-Ruiz for their support with the Ithaca365 dataset, and Shijie Zhou for his help with high-dimensional feature rendering in 3DGS. Yiming Li gratefully acknowledges support from the NVIDIA Graduate Fellowship Program. This website design is borrowed from XCube.