SPOT: SE(3) Pose Trajectory Diffusion for Object-Centric Manipulation

Cheng-Chun Hsu^{* 1,2} , Bowen Wen^1,, Jie Xu¹, Yashraj Narang¹, Xiaolong Wang^1,3, Yuke Zhu^1,2, Joydeep Biswas^1,2, Stan Birchfield¹

¹NVIDIA, ²UT Austin, ³UCSD

* Work done during internship at NVIDIA
Corresponding author

Paper Code

Abstract

We introduce SPOT, an object-centric imitation learning framework. The key idea is to capture each task by an object-centric representation, specifically the SE(3) object pose trajectory relative to the target. This approach decouples embodiment actions from sensory inputs, facilitating learning from various demonstration types, including both action-based and action-less human hand demonstrations, as well as cross-embodiment generalization. Additionally, object pose trajectories inherently capture planning constraints from demonstrations without the need for manually crafted rules. To guide the robot in executing the task, the object trajectory is used to condition a diffusion policy. We show improvement compared to prior work on RLBench simulated tasks. In real-world evaluation, using only eight demonstrations shot on an iPhone, our approach completed all tasks while fully complying with task constraints.

6D Object Pose as Intermediate Representation

Given the observation, our framework estimates the object’s pose, predicts its future path in SE(3), and derives an action plan accordingly. Our diffusion model is trained on demonstration trajectories extracted from videos without needing action data from the same embodiment.

Framework Overview

During training, we extract object pose trajectories from demonstration RGBD videos (e.g., collected with an iPhone), which are independent of the embodiment. Using these extracted trajectories, we train a diffusion model to generate future object trajectories and determine task completion based on current and past poses.

During task execution, the task-relevant object is constantly tracked, and its pose is forwarded to the trajectory diffusion model to predict the object's future trajectory in SE(3), which leads to task accomplishment. Finally, we convert the generated trajectories into embodiment-agnostic action plans for closed-loop manipulation.

Real-world Evaluation

We evaluated our method on 4 real-world manipulation tasks. All models use single camera views and 8 human demonstrations per task.

Task: mug-on-coaster

Task: plant-in-vase

Task: pour-water

Task: put-plate-into-oven

Task: mug-on-coaster

Task: plant-in-vase

Task: pour-water

Task: put-plate-into-oven

Task: mug-on-coaster

Task: plant-in-vase

Task: pour-water

Task: put-plate-into-oven

Task: mug-on-coaster

Task: plant-in-vase

Task: pour-water

Generalization Tests

We test our method across various scenarios to evaluate its generalization capabilities.

Object Configurations

Lighting Conditions

Clutter Scenes

More Details on Data Collection

Our approach requires a demonstration dataset for training. The dataset collection involves two steps: object mesh reconstruction (see below) and demonstration video collection (see Real-world Evaluation). The mesh is used for object pose tracking during both training and testing phases. The demonstration video is used to train the trajectory diffusion model.

We use one iPhone 12 Pro as the only device in all dataset collection. Specifically, we obtain object scans using AR Code and record RGBD human video demonstrations with Record3D.

We’ve released the code for dataset preprocessing and model training. Feel free to reach out with any questions or to share your experience!

Task: pour-water

Pitcher

Mug

Task: mug-on-coaster

Travel Mug

Coaster

Task: plant-in-vase

Cactus

Vase

Task: put-plate-into-oven

Plate

Toaster Oven