- Hongchi Xia1,2*
- Xuan Li1
- Zhaoshuo Li1
- Qianli Ma1
- Jiashu Xu1
- Ming-Yu Liu1
- Yin Cui1
- Tsung-Yi Lin1
- Wei-Chiu Ma3
- Shenlong Wang2
- Shuran Song1,4
- Fangyin Wei1
- 1NVIDIA
- 2University of Illinois at Urbana-Champaign
- 3Cornell University
- 4Stanford University
- *Work done during internship at NVIDIA
Abstract
We present SAGE, an agentic framework that, given a user-specified embodied task (e.g., “pick up a bowl and place it on the table”), understands the intent and automatically generates simulation-ready environments at scale. The agent couples multiple generators for layout and object composition with critics that evaluate semantic plausibility, visual realism, and physical stability. Through iterative reasoning and adaptive tool selection, it self-refines the scenes until meeting user intent and physical validity.
The resulting environments are realistic, diverse, and directly deployable in modern simulators for policy training. Policies trained purely on this data exhibit clear scaling trends and generalize to unseen objects and layouts, demonstrating the promise of simulation-driven scaling for embodied AI. We will release both 3D scene and action generation code to foster further research.
SAGE-10k Dataset
Generated Single-Room Scenes
Bedroom
Living room
Gym
Fairy-tale princess room
Rusty and dusty restroom
Office
Cyberpunk game den
Starry-night bedroom
Golden and luxury bedroom
Meeting room
Children room
Muddy and dirty dining room
Generated Multi-Room Scenes
The student apartment with one bedroom
The student apartment with two bedrooms
Multilingual teacher's apartment
Mid-century modern family home
Craft supply hoarder's bungalow
Naturalist's cabin
Image-Conditioned Generated Scenes
Ref Image
Ref Image
Ref Image
Generated Scene
Generated Scene
Generated Scene
Physical Stability of Generated Scenes
Progress bars below each video show simulation progress
Holodeck
SceneWeaver
Ours
Augmentation of Generated Scenes
Object Category-Level Augmentation
Base Scene
Aug. Scene 1
Aug. Scene 2
Aug. Scene 3
Base Scene
Aug. Scene 1
Aug. Scene 2
Aug. Scene 3
Base Scene
Aug. Scene 1
Aug. Scene 2
Aug. Scene 3
Base Scene
Aug. Scene 1
Aug. Scene 2
Aug. Scene 3
Scene Layout-Level Augmentation
Base Scene
Aug. Scene 1
Aug. Scene 2
Aug. Scene 3
Base Scene
Aug. Scene 1
Aug. Scene 2
Aug. Scene 3
Base Scene
Aug. Scene 1
Aug. Scene 2
Aug. Scene 3
Base Scene
Aug. Scene 1
Aug. Scene 2
Aug. Scene 3
Embodied Policy Training
Pick-and-Place Action Generation
Here we show the collected robot actions with camera views (3 perspective cameras located at left, right, and wrist), and we input both RGB and depth from each view to the policy network. Parallelized action generation is performed with 8 environments in parallel per GPU in the IsaacSim simulator.
Configuration 1
Configuration 2
Configuration 1
Configuration 2
Configuration 1
Configuration 2
Configuration 1
Configuration 2
More Examples of Pick-and-Place Actions Generation
Configuration 1
Configuration 2
Configuration 1
Configuration 2
Configuration 1
Configuration 2
Policy Inference: Successful Examples
Policy Inference: Failure Cases
Mobile Manipulation Action Generation
Here we show the collected robot actions with camera views (2 fisheye cameras for navigation (front and back), 3 perspective cameras (left, right, and wrist) for pick-and-place), and we input both RGB and depth from each view to the policy network. Parallelized action generation is performed with 2 environments in parallel per GPU in the IsaacSim simulator.
More Examples of Mobile Manipulation Actions Generation
Policy Inference: Successful Examples
Policy Inference: Failure Cases
Articulated Objects in SAGE Generated Scenes
Closed
Open
Closed
Open
Citation
If you find our work useful in your research, please consider citing:
@article{xia2026sage,
title={SAGE: Scalable Agentic 3D Scene Generation for Embodied AI},
author={Xia, Hongchi and Li, Xuan and Li, Zhaoshuo and Ma, Qianli and Xu, Jiashu and Liu, Ming-Yu and Cui, Yin and Lin, Tsung-Yi and Ma, Wei-Chiu and Wang, Shenlong and Song, Shuran and Wei, Fangyin},
journal={arXiv preprint arXiv:TBD},
year={2026}
}