SPOT: SE(3) Pose Trajectory Diffusion for Object-Centric Manipulation


1NVIDIA, 2UT Austin, 3UCSD

* Work done during internship at NVIDIA

Abstract

We introduce SPOT, an object-centric imitation learning framework. The key idea is to capture each task by an object-centric representation, specifically the SE(3) object pose trajectory relative to the target. This approach decouples embodiment actions from sensory inputs, facilitating learning from various demonstration types, including both action-based and action-less human hand demonstrations, as well as cross-embodiment generalization. Additionally, object pose trajectories inherently capture planning constraints from demonstrations without the need for manually crafted rules. To guide the robot in executing the task, the object trajectory is used to condition a diffusion policy. We show improvement compared to prior work on RLBench simulated tasks. In real-world evaluation, using only eight demonstrations shot on an iPhone, our approach completed all tasks while fully complying with task constraints.



6D Object Pose as Intermediate Representation

Given the observation, our framework estimates the object’s pose, predicts its future path in SE(3), and derives an action plan accordingly. Our diffusion model is trained on demonstration trajectories extracted from videos without needing action data from the same embodiment.



Framework Overview

During training, we extract object pose trajectories from demonstration RGBD videos (e.g., collected with an iPhone), which are independent of the embodiment. Using these extracted trajectories, we train a diffusion model to generate future object trajectories and determine task completion based on current and past poses.

During task execution, the task-relevant object is constantly tracked, and its pose is forwarded to the trajectory diffusion model to predict the object's future trajectory in SE(3), which leads to task accomplishment. Finally, we convert the generated trajectories into embodiment-agnostic action plans for closed-loop manipulation.



Real-world Evaluation

We evaluated our method on 4 real-world manipulation tasks. All models use single camera views and 8 human demonstrations per task.



Generalization Tests

We test our method across various scenarios to evaluate its generalization capabilities.


Object Configurations


Lighting Conditions


Clutter Scenes



More Details on Data Collection

Our approach requires a demonstration dataset for training. The dataset collection involves two steps: object mesh reconstruction (see below) and demonstration video collection (see Real-world Evaluation). The mesh is used for object pose tracking during both training and testing phases. The demonstration video is used to train the trajectory diffusion model.

We use one iPhone 12 Pro as the only device in all dataset collection. Specifically, we obtain object scans using AR Code and record RGBD human video demonstrations with Record3D.

We'll be releasing the code for dataset processing and model training soon. Feel free to reach out with any questions or to share your experience!

Task: pour-water

Pitcher

Mug

Task: mug-on-coaster

Travel Mug

Coaster

Task: plant-in-vase

Cactus

Vase

Task: put-plate-into-oven

Plate

Toaster Oven