SONIC: Supersizing Motion Tracking for Natural Humanoid Whole-Body Control

Trailers

One Day at NVIDIA

Skill Montage

Abstract

Despite the rise of billion-parameter foundation models trained across thousands of GPUs, similar scaling gains have not been shown for humanoid control. Current neural controllers for humanoids remain modest in size, target a limited set of behaviors, and are trained on a handful of GPUs. We show that scaling model capacity, data, and compute yields a generalist humanoid controller capable of natural, robust whole-body movements. We position motion tracking as a scalable task for humanoid control, leveraging dense supervision from diverse motion-capture data to acquire human motion priors without manual reward engineering. We build a foundation model for motion tracking by scaling along three axes: network size (1.2M to 42M parameters), dataset volume (100M+ frames from 700 hours of motion capture), and compute (21k GPU hours). Beyond demonstrating the benefits of scale, we further show downstream utility through: (1) a real-time kinematic planner bridging motion tracking to tasks such as navigation, enabling natural and interactive control, and (2) a unified token space supporting VR teleoperation and vision-language-action (VLA) models with a single policy. Through this interface, we demonstrate autonomous VLA-driven whole-body loco-manipulation requiring coordinated hand and foot placement. Scaling motion tracking exhibits favorable properties: performance improves steadily with compute and data diversity, and learned policies generalize to unseen motions, establishing motion tracking at scale as a practical foundation for humanoid control.

Connection to VLA Foundation Model

We connect a VLA foundation model (GR00T N1.5) through the same universal control interface, combining high-level reasoning with fast, reactive whole-body control. All policies are fully autonomous.

Drill

Dropping Soda into Trash Can

Carrot

Sponge

Trash Can

Apple

Teleoperation

Video Teleoperation

Using video as input and GEM for pose estimation, the humanoid tracks and reproduces complex motions from human demonstrations in real-time.

Kung Fu

Crawling

VR Teleoperation with Keypoints

A hybrid control mode using only three VR tracking points (head and hands) as upper-body humanoid motion and a kinematic planner to generate lower-body motion, enabling intuitive manipulation tasks.

Lawn Mowing

Object Manipulation

VR Teleoperation with Whole Body

Full-body VR tracking captures the operator's complete body motion, enabling precise and natural humanoid control for complex whole-body manipulation tasks.

Whole-Body Tracking

Manipulation with Whole Body

Multi-Modal Control

Music Control

Leveraging our universal control interface, the humanoid can perform expressive, human-like dance motions synchronized to music. The choreography is generated by GEM.

Text Control

Natural language commands are translated into human motions by GEM and directly followed by the humanoid, enabling intuitive text-based control.

Backward Walking

Monkey Movement

Interactive Kinematic Planner

Stylized Locomotion

Our real-time kinematic planner enables interactive gamepad control with diverse locomotion styles, allowing the humanoid to navigate while maintaining distinct movement characteristics.

Happy Walking

Running

Stealth Walking

Injured Walking

Squatting, Kneeling, and Crawling

The kinematic planner supports diverse body configurations beyond standing locomotion, enabling low-posture movements essential for navigating constrained environments.

Squatting

Kneeling

Hand Crawling

Elbow Crawling

Boxing

Athletic motions produced by the planner demonstrate the policy's ability to track and execute dynamic, coordinated movements that require precise timing and balance.

Boxing

Boxing with Movement

Tracking Robustness

The policy demonstrates robust motion tracking under challenging conditions, maintaining stable whole-body control despite external perturbations.

Method

GEAR-SONIC employs a universal control policy that seamlessly handles robot motion, human motion, and hybrid motion through a shared latent representation. Specialized encoders process diverse motion commands into a universal token space, enabling diverse applications including interactive gamepad control, VR teleoperation, video teleoperation, and multi-modal control from text and music.

BibTeX


@article{luo2025sonic,
    title={SONIC: Supersizing Motion Tracking for Natural Humanoid Whole-Body Control},
    author={Luo, Zhengyi and Yuan, Ye and Wang, Tingwu and Li, Chenran and Chen, Sirui and Casta\~neda, Fernando and Cao, Zi-Ang and Li, Jiefeng and Minor, David and Ben, Qingwei and Da, Xingye and Ding, Runyu and Hogg, Cyrus and Song, Lina and Lim, Edy and Jeong, Eugene and He, Tairan and Xue, Haoru and Xiao, Wenli and Wang, Zi and Yuen, Simon and Kautz, Jan and Chang, Yan and Iqbal, Umar and Fan, Linxi and Zhu, Yuke},
    journal={arXiv preprint arXiv:2511.07820},
    year={2025}
}