We introduce SPOT, an object-centric imitation learning framework. At the core, it leverages the synergy between diffusion policy and object-centric representation, specifically SE(3) object pose trajectory. This approach decouples embodiment actions from sensory inputs, facilitating learning from various demonstration types, including both action-based and action-less human hand demonstrations. Additionally, object pose trajectories inherently capture planning constraints from demonstrations without the need for manually crafted rules. In real-world evaluation, using only eight demonstrations shot on an iPhone, our approach completed all tasks while fully complying with task constraints.
Given the observation, our framework estimates the object’s pose, predicts its future path in SE(3), and derives an action plan accordingly. Our diffusion model is trained on demonstration trajectories extracted from videos without needing action data from the same embodiment.
During training, we extract object pose trajectories from demonstration RGBD videos (e.g., collected with an iPhone), which are independent of the embodiment. Using these extracted trajectories, we train a diffusion model to generate future object trajectories and determine task completion based on current and past poses. During task execution, the task-relevant object is constantly tracked, and its pose is forwarded to the trajectory diffusion model to predict the object's future trajectory in SE(3), which leads to task accomplishment. Finally, we convert the generated trajectories into embodiment-agnostic action plans for closed-loop manipulation.
We evaluated our method on 4 real-world manipulation tasks. All models use single camera views and 8 human demonstrations per task.
Task: mug-on-coaster
Task: plant-in-vase
Task: pour-water
Task: put-plate-into-oven
Task: mug-on-coaster
Task: plant-in-vase
Task: pour-water
Task: put-plate-into-oven
We test our method across various scenarios to evaluate its generalization capabilities.