SONIC: Supersizing Motion Tracking for Natural Humanoid Whole-Body Control

Co-First Authors     * Core Contributors     Project Leads
All models shown in the videos will be released.

Trailers

One Day at NVIDIA

Skill Montage

Abstract

Despite the rise of billion-parameter foundation models trained across thousands of GPUs, similar scaling gains have not been shown for humanoid control. Current neural controllers for humanoids remain modest in size, target a limited behavior set, and are trained on a handful of GPUs over several days. We show that scaling up model capacity, data, and compute yields a generalist humanoid controller capable of creating natural and robust whole-body movements. Specifically, we posit motion tracking as a natural and scalable task for humanoid control, leverageing dense supervision from diverse motion-capture data to acquire human motion priors without manual reward engineering. We build a foundation model for motion tracking by scaling along three axes: network size (from 1.2M to 42M parameters), dataset volume (over 100M frames, 700 hours of high-quality motion data), and compute (9k GPU hours). Beyond demonstrating the benefits of scale, we show the practical utility of our model through two mechanisms: (1) a real-time universal kinematic planner that bridges motion tracking to downstream task execution, enabling natural and interactive control, and (2) a unified token space that supports various motion input interfaces, such as VR teleoperation devices, human videos, and vision-language-action (VLA) models, all using the same policy. Scaling motion tracking exhibits favorable properties: performance improves steadily with increased compute and data diversity, and learned representations generalize to unseen motions, establishing motion tracking at scale as a practical foundation for humanoid control.

All results are generated using a single unified control policy.

Teleoperation

Video Teleoperation

Using video as input and GENMO for pose estimation, the humanoid tracks and reproduces complex motions from human demonstrations in real-time.

Kung Fu

Crawling



VR Teleoperation with Keypoints

A hybrid control mode using only three VR tracking points (head and hands) as upper-body humanoid motion and a kinematic planner to generate lower-body motion, enabling intuitive manipulation tasks.

Lawn Mowing

Object Manipulation



VR Teleoperation with Whole-Body

Full-body VR tracking captures the operator's complete body motion, enabling precise and natural humanoid control for complex whole-body manipulation tasks.

Whole-Body Tracking

Manipulation with Whole-Body


Multi-Modal Control

Music Control

Leveraging our universal control interface, the humanoid can perform expressive, human-like dance motions synchronized to music. The choreography is generated by GENMO.



Text Control

Natural language commands are translated into human motions by GENMO and directly followed by the humanoid, enabling intuitive text-based control.

Backward Walking

Monkey Movement


Interactive Kinematic Planner

Stylized Locomotion

Our real-time kinematic planner enables interactive gamepad control with diverse locomotion styles, allowing the humanoid to navigate while maintaining distinct movement characteristics.

Happy Walking

Running


Stealth Walking

Injured Walking



Squatting, Kneeling, and Crawling

The kinematic planner supports diverse body configurations beyond standing locomotion, enabling low-posture movements essential for navigating constrained environments.

Squatting

Kneeling


Hand Crawling

Elbow Crawling



Boxing

Athletic motions produced by the planner demonstrate the policy's ability to track and execute dynamic, coordinated movements that require precise timing and balance.

Boxing

Boxing with Movement


Connection to VLA Foundation Model

We connect a VLA foundation model (GR00T N1.5) through the same universal control interface, combining high-level reasoning with fast, reactive whole-body control. This integration achieves 95% success rate on a mobile manipulation task.


Method

SONIC employs a universal control policy that seamlessly handles robot motion, human motion, and hybrid motion through a shared latent representation. Specialized encoders process diverse motion commands into a universal token space, enabling diverse applications including interactive gamepad control, VR teleoperation, video teleoperation, and multi-modal control from text and music.

Method Overview

BibTeX


@article{luo2025sonic,
    title={SONIC: Supersizing Motion Tracking for Natural Humanoid Whole-Body Control},
    author={Luo, Zhengyi and Yuan, Ye and Wang, Tingwu and Li, Chenran and Chen, Sirui and Casta\~neda, Fernando and Cao, Zi-Ang and Li, Jiefeng and Minor, David and Ben, Qingwei and Da, Xingye and Ding, Runyu and Hogg, Cyrus and Song, Lina and Lim, Edy and Jeong, Eugene and He, Tairan and Xue, Haoru and Xiao, Wenli and Wang, Zi and Yuen, Simon and Kautz, Jan and Chang, Yan and Iqbal, Umar and Fan, Linxi and Zhu, Yuke},
    journal={arXiv preprint arXiv:2511.xxxxx},
    year={2025}
}