Despite the rise of billion-parameter foundation models trained across thousands of GPUs, similar scaling gains have not been shown for humanoid control. Current neural controllers for humanoids remain modest in size, target a limited set of behaviors, and are trained on a handful of GPUs. We show that scaling model capacity, data, and compute yields a generalist humanoid controller capable of natural, robust whole-body movements. We position motion tracking as a scalable task for humanoid control, leveraging dense supervision from diverse motion-capture data to acquire human motion priors without manual reward engineering. We build a foundation model for motion tracking by scaling along three axes: network size (1.2M to 42M parameters), dataset volume (100M+ frames from 700 hours of motion capture), and compute (21k GPU hours). Beyond demonstrating the benefits of scale, we further show downstream utility through: (1) a real-time kinematic planner bridging motion tracking to tasks such as navigation, enabling natural and interactive control, and (2) a unified token space supporting VR teleoperation and vision-language-action (VLA) models with a single policy. Through this interface, we demonstrate autonomous VLA-driven whole-body loco-manipulation requiring coordinated hand and foot placement. Scaling motion tracking exhibits favorable properties: performance improves steadily with compute and data diversity, and learned policies generalize to unseen motions, establishing motion tracking at scale as a practical foundation for humanoid control.
We connect a VLA foundation model (GR00T N1.5) through the same universal control interface, combining high-level reasoning with fast, reactive whole-body control. All policies are fully autonomous.
Using video as input and GEM for pose estimation, the humanoid tracks and reproduces complex motions from human demonstrations in real-time.
A hybrid control mode using only three VR tracking points (head and hands) as upper-body humanoid motion and a kinematic planner to generate lower-body motion, enabling intuitive manipulation tasks.
Full-body VR tracking captures the operator's complete body motion, enabling precise and natural humanoid control for complex whole-body manipulation tasks.
Leveraging our universal control interface, the humanoid can perform expressive, human-like dance motions synchronized to music. The choreography is generated by GEM.
Natural language commands are translated into human motions by GEM and directly followed by the humanoid, enabling intuitive text-based control.
Our real-time kinematic planner enables interactive gamepad control with diverse locomotion styles, allowing the humanoid to navigate while maintaining distinct movement characteristics.
The kinematic planner supports diverse body configurations beyond standing locomotion, enabling low-posture movements essential for navigating constrained environments.
Athletic motions produced by the planner demonstrate the policy's ability to track and execute dynamic, coordinated movements that require precise timing and balance.
The policy demonstrates robust motion tracking under challenging conditions, maintaining stable whole-body control despite external perturbations.
GEAR-SONIC employs a universal control policy that seamlessly handles robot motion, human motion, and hybrid motion through a shared latent representation. Specialized encoders process diverse motion commands into a universal token space, enabling diverse applications including interactive gamepad control, VR teleoperation, video teleoperation, and multi-modal control from text and music.
@article{luo2025sonic,
title={SONIC: Supersizing Motion Tracking for Natural Humanoid Whole-Body Control},
author={Luo, Zhengyi and Yuan, Ye and Wang, Tingwu and Li, Chenran and Chen, Sirui and Casta\~neda, Fernando and Cao, Zi-Ang and Li, Jiefeng and Minor, David and Ben, Qingwei and Da, Xingye and Ding, Runyu and Hogg, Cyrus and Song, Lina and Lim, Edy and Jeong, Eugene and He, Tairan and Xue, Haoru and Xiao, Wenli and Wang, Zi and Yuen, Simon and Kautz, Jan and Chang, Yan and Iqbal, Umar and Fan, Linxi and Zhu, Yuke},
journal={arXiv preprint arXiv:2511.07820},
year={2025}
}