• 1NVIDIA
  • 2University of Illinois at Urbana-Champaign
  • 3Cornell University
  • 4Stanford University
  • *Work done during internship at NVIDIA
Arxiv Preprint
overview

Abstract

Real-world data collection for embodied agents remains costly and unsafe, calling for scalable, realistic, and simulator-ready 3D environments. However, existing scene-generation systems often rely on rule-based or task-specific pipelines, yielding artifacts and physically invalid scenes.

We present SAGE, an agentic framework that, given a user-specified embodied task (e.g., “pick up a bowl and place it on the table”), understands the intent and automatically generates simulation-ready environments at scale. The agent couples multiple generators for layout and object composition with critics that evaluate semantic plausibility, visual realism, and physical stability. Through iterative reasoning and adaptive tool selection, it self-refines the scenes until meeting user intent and physical validity.

The resulting environments are realistic, diverse, and directly deployable in modern simulators for policy training. Policies trained purely on this data exhibit clear scaling trends and generalize to unseen objects and layouts, demonstrating the promise of simulation-driven scaling for embodied AI. We will release both 3D scene and action generation code to foster further research.

SAGE-10k Dataset

SAGE-10k is a large-scale interactive indoor scene dataset featuring realistic layouts, generated by the agentic-driven pipeline introduced in "SAGE: Scalable Agentic 3D Scene Generation for Embodied AI". The dataset contains 10,000 diverse scenes spanning 50 room types and styles, along with 565K uniquely generated 3D objects. Download the dataset through this link.
overview

Generated Single-Room Scenes

Our Agentic Scene Generation Framework generates realistic, diverse, and semantically coherent scenes spanning various styles and functionalities, from Bedroom and Office spaces to creative themes like “Cyberpunk game den” and “Starry-night bedroom”. We include two previous running log examples of how MCP agents call tools to generate scenes here and here (behavior might differ from the code release ver.).

Bedroom

Living room

Gym

Fairy-tale princess room

Rusty and dusty restroom

Office

Cyberpunk game den

Starry-night bedroom

Golden and luxury bedroom

Meeting room

Children room

Muddy and dirty dining room

Generated Multi-Room Scenes

SAGE can be extended to generate multi-room scenes at scale easily by generating the floor plan and then call generator MCP tools to multiple rooms in parallel.

The student apartment with one bedroom

The student apartment with two bedrooms

Multilingual teacher's apartment

Mid-century modern family home

Craft supply hoarder's bungalow

Naturalist's cabin

Image-Conditioned Generated Scenes

How about generating scenes that are conditioned on a reference image? Thanks to the capability of Agentic VLM model Qwen3-VL, we can directly feed a reference image to the agent. Although agent is not able to generate pixel-aligned scenes with the reference image, it can generate scenes that are semantically coherent with the reference image.
Reference image

Ref Image

Reference image

Ref Image

Reference image

Ref Image

Generated Scene

Generated Scene

Generated Scene

Physical Stability of Generated Scenes

Generated scenes are loaded into IsaacSim for physical validation. Both baselines exhibit displaced objects due to instability, whereas SAGE preserves scene stability before and after simulation.

Progress bars below each video show simulation progress

Holodeck

SceneWeaver

Ours

Augmentation of Generated Scenes

Object Category-Level Augmentation

Here we showcase the capability of our category augmentation method. We randomly select part of the objects in the scene for category augmentation. Given the text description of the selected object from the generation stage, we employ an LLM-based text augmentation to produce variations in geometry and texture (e.g., shape, color, material, or finish) while maintaining the original object category. We then use TRELLIS to synthesize corresponding 3D assets from these augmented descriptions, which are placed into the scene to enrich visual and physical diversity across instances.

Base Scene

Aug. Scene 1

Aug. Scene 2

Aug. Scene 3

Base Scene

Aug. Scene 1

Aug. Scene 2

Aug. Scene 3

Base Scene

Aug. Scene 1

Aug. Scene 2

Aug. Scene 3

Base Scene

Aug. Scene 1

Aug. Scene 2

Aug. Scene 3

Scene Layout-Level Augmentation

While the category-level augmentations modify the geometry and texture of the selected objects, the background environment layout remains unchanged. For tasks requiring full-scene exploration or navigation, we introduce layout-level augmentation, where the background scene, including room geometry and all task irrelevant objects, is regenerated through the agent-driven scene generation. This process produces diverse scene layouts sharing the same task specification, enabling learning policies that generalize across spatial configurations.

Bedroom: We keep the objects of desk, nightstand, and mug on the nightstand as the same, and generate diverse layouts based on that.

Base Scene

Aug. Scene 1

Aug. Scene 2

Aug. Scene 3

Livingroom: We keep the objects of sideboard, coffeetable, and vase on the coffeetable as the same, and generate diverse layouts based on that.

Base Scene

Aug. Scene 1

Aug. Scene 2

Aug. Scene 3

Office: We keep the objects of sofa, a desk, and a pen on the desk as the same, and generate diverse layouts based on that.

Base Scene

Aug. Scene 1

Aug. Scene 2

Aug. Scene 3

Meeting room: We keep the objects of table, cabinet, and a cup on the meeting table as the same, and generate diverse layouts based on that.

Base Scene

Aug. Scene 1

Aug. Scene 2

Aug. Scene 3

Embodied Policy Training

Pick-and-Place Action Generation

In Pick-and-Place, for grasping actions, we use M2T2 to generate grasp pose candidates from rendered depth images. Collision-free trajectories are computed by integrating Curobo into the motion planning and inverse kinematics pipeline, ensuring feasible and stable grasp execution.
Here we show the collected robot actions with camera views (3 perspective cameras located at left, right, and wrist), and we input both RGB and depth from each view to the policy network. Parallelized action generation is performed with 8 environments in parallel per GPU in the IsaacSim simulator.

Configuration 1

Configuration 2

Configuration 1

Configuration 2

Configuration 1

Configuration 2

Configuration 1

Configuration 2

More Examples of Pick-and-Place Actions Generation

Configuration 1

Configuration 2

Configuration 1

Configuration 2

Configuration 1

Configuration 2

Policy Inference: Successful Examples

We show some successful examples of the trained policy inference.

Policy Inference: Failure Cases

Here are some failure cases of the trained policy inference. The failure cases are due to the randomness of the policy inference or far-away objects which are hard to reach and grasp by the robot.

Mobile Manipulation Action Generation

This task is composed of navigation with object pick-and-place in between. For the navigation motions, we adopt RRT for robot path planning, generating collision-free trajectories between designated start and target positions.
Here we show the collected robot actions with camera views (2 fisheye cameras for navigation (front and back), 3 perspective cameras (left, right, and wrist) for pick-and-place), and we input both RGB and depth from each view to the policy network. Parallelized action generation is performed with 2 environments in parallel per GPU in the IsaacSim simulator.

More Examples of Mobile Manipulation Actions Generation

Policy Inference: Successful Examples

We show some successful examples of the trained policy inference.

Policy Inference: Failure Cases

Here are some failure cases of the trained policy inference. The failure cases are due to the randomness of the policy inference and difficult grasp pose location for reaching in the test cases.

Articulated Objects in SAGE Generated Scenes

While the paper focuses on the overall framework, SAGE's modular design can easily extend the current text-to-3D generation with object retrieval. The figure below shows a simple integration using articulated assets from PartNet-Mobility in the generated scenes, and robot actions generated by grasp pose prediction and motion planning as described in the paper.
Reference image

Closed

Reference image

Open

Reference image

Closed

Reference image

Open

Citation

If you find our work useful in your research, please consider citing:

@article{xia2026sage,
  title={SAGE: Scalable Agentic 3D Scene Generation for Embodied AI},
  author={Xia, Hongchi and Li, Xuan and Li, Zhaoshuo and Ma, Qianli and Xu, Jiashu and Liu, Ming-Yu and Cui, Yin and Lin, Tsung-Yi and Ma, Wei-Chiu and Wang, Shenlong and Song, Shuran and Wei, Fangyin},
  journal={arXiv preprint arXiv:TBD},
  year={2026}
}