SPOT: SE(3) Pose Trajectory Diffusion for Object-Centric Manipulation


1NVIDIA, 2UT Austin, 3UCSD

* Work done during internship at NVIDIA

Abstract

We introduce SPOT, an object-centric imitation learning framework. At the core, it leverages the synergy between diffusion policy and object-centric representation, specifically SE(3) object pose trajectory. This approach decouples embodiment actions from sensory inputs, facilitating learning from various demonstration types, including both action-based and action-less human hand demonstrations. Additionally, object pose trajectories inherently capture planning constraints from demonstrations without the need for manually crafted rules. In real-world evaluation, using only eight demonstrations shot on an iPhone, our approach completed all tasks while fully complying with task constraints.



6D Object Pose as Intermediate Representation

Given the observation, our framework estimates the object’s pose, predicts its future path in SE(3), and derives an action plan accordingly. Our diffusion model is trained on demonstration trajectories extracted from videos without needing action data from the same embodiment.



Framework Overview

During training, we extract object pose trajectories from demonstration RGBD videos (e.g., collected with an iPhone), which are independent of the embodiment. Using these extracted trajectories, we train a diffusion model to generate future object trajectories and determine task completion based on current and past poses. During task execution, the task-relevant object is constantly tracked, and its pose is forwarded to the trajectory diffusion model to predict the object's future trajectory in SE(3), which leads to task accomplishment. Finally, we convert the generated trajectories into embodiment-agnostic action plans for closed-loop manipulation.



Real-world Evaluation

We evaluated our method on 4 real-world manipulation tasks. All models use single camera views and 8 human demonstrations per task.

Human Demonstration Video



Generalization Tests

We test our method across various scenarios to evaluate its generalization capabilities.

Object Configurations


Lighting Conditions


Clutter Scenes