Configuration Guide#
SONIC uses Hydra for hierarchical configuration. This guide explains the config structure and the most important parameters to tune.
Config Hierarchy#
When you run a training command like:
python gear_sonic/train_agent_trl.py +exp=manager/universal_token/all_modes/sonic_release
Hydra composes the final config from a chain of YAML files:
gear_sonic/config/
├── base.yaml # Global defaults (seed, num_envs, paths)
├── base/
│ ├── hydra.yaml # Hydra output directory settings
│ └── structure.yaml # Resolved experiment directory structure
├── algo/
│ └── ppo_im_phc.yaml # PPO hyperparameters
├── manager_env/
│ ├── base_env.yaml # Environment defaults (sim_dt, decimation, episode length)
│ ├── actions/tracking/base.yaml
│ ├── commands/tracking/base.yaml
│ │ └── terms/motion.yaml # Motion library, body names, future frames
│ ├── rewards/tracking/
│ │ └── base_5point_local_feet_acc.yaml # Reward composition
│ │ └── terms/*.yaml # Individual reward terms with weights
│ ├── terminations/tracking/
│ │ └── base_adaptive_strict_ori_foot_xyz.yaml # Termination composition
│ │ └── terms/*.yaml # Individual termination conditions
│ ├── events/tracking/
│ │ └── level0_4.yaml # Domain randomization events
│ └── observations/
│ ├── tokenizer/ # Encoder input observations
│ ├── policy/ # Policy (actor) observations
│ └── critic/ # Critic observations
├── actor_critic/
│ └── universal_token/ # Network architecture (encoders, decoders, quantizer)
├── aux_losses/
│ └── universal_token/ # Auxiliary loss terms
├── trainer/
│ └── trl_ppo_aux.yaml # Trainer config (PPO with aux losses)
├── callbacks/ # Training callbacks (save, eval, W&B, resample)
└── exp/manager/universal_token/all_modes/
└── sonic_release.yaml # Experiment config (overrides all of the above)
The experiment config (sonic_release.yaml) sits at the top and overrides
specific values from the base configs. You can further override any value
from the command line with ++key=value.
Overriding Config Values#
Hydra uses ++ prefix to force-override values (even nested ones):
# Override a top-level value
python gear_sonic/train_agent_trl.py +exp=... num_envs=16
# Override a nested value (use dots for nesting)
python gear_sonic/train_agent_trl.py +exp=... \
++manager_env.commands.motion.motion_lib_cfg.motion_file=/path/to/data
# Override a reward weight
python gear_sonic/train_agent_trl.py +exp=... \
++manager_env.rewards.tracking_anchor_pos.weight=1.0
Top Parameters to Tune#
Training scale#
Parameter |
Default |
Location |
Description |
|---|---|---|---|
|
4096 |
|
Number of parallel environments. Reduce for debugging ( |
|
True |
|
Set |
|
0 |
|
Random seed for reproducibility. |
PPO hyperparameters#
Parameter |
Default |
Location |
Description |
|---|---|---|---|
|
2e-5 |
|
Actor learning rate. Lower for finetuning, higher for training from scratch. |
|
1e-3 |
|
Critic learning rate. Usually 10-100x the actor LR. |
|
5 |
|
PPO epochs per batch of experience. |
|
4 |
|
Mini-batches per PPO epoch. |
|
24 |
|
Rollout length (steps per env before PPO update). |
|
0.99 |
|
Discount factor. |
|
0.95 |
|
GAE lambda. |
|
0.2 |
|
PPO clip parameter. |
|
0.01 |
|
Entropy bonus coefficient. |
|
0.01 |
|
Target KL for adaptive learning rate schedule. |
|
100000 |
|
Total training iterations. |
Simulation#
Parameter |
Default |
Location |
Description |
|---|---|---|---|
|
0.005 |
|
Physics timestep (200 Hz). Smaller = more stable but slower. |
|
4 |
|
Policy runs every |
|
10.0 |
|
Episode length in seconds before timeout reset. |
|
trimesh |
|
|
|
g1_model_12_dex |
|
Robot type (must match |
Motion data#
Parameter |
Default |
Location |
Description |
|---|---|---|---|
|
— |
|
Path to retargeted robot motion PKLs. |
|
— |
|
Path to SMPL motion PKLs (or |
|
— |
|
Path to SOMA motion PKLs (4-encoder config only). |
|
true |
|
Set |
|
50 |
|
Target FPS for motion resampling. |
|
g1_29dof_rev_1_0.xml |
|
MJCF file for motion library FK. Change for different robots. |
Motion command#
Parameter |
Default |
Location |
Description |
|---|---|---|---|
|
10 |
|
Number of future reference frames provided to the policy. |
|
0.1 |
|
Time spacing between future frames (seconds). |
|
true |
|
Augment lower-body motions with upper-body from different clips. |
|
0.5 |
|
Probability of upper-body augmentation per episode. |
|
true |
|
Augment with frozen (static) reference frames. |
Observation history#
Parameter |
Default |
Location |
Description |
|---|---|---|---|
|
10 |
|
Number of past proprioception frames stacked for actor. |
|
10 |
|
Number of past actions stacked for actor. |
|
10 |
|
Same, for critic. |
|
10 |
|
Same, for critic. |
Reward weights#
All reward terms have a weight parameter. Positive weights encourage the behavior,
negative weights penalize it. The default weights for base_5point_local_feet_acc:
Reward term |
Weight |
Description |
|---|---|---|
|
0.5 |
Root position tracking |
|
0.5 |
Root orientation tracking |
|
1.0 |
Body position tracking (anchor-relative) |
|
1.0 |
Body orientation tracking (anchor-relative) |
|
1.0 |
Body linear velocity tracking |
|
1.0 |
Body angular velocity tracking |
|
2.0 |
5-point (wrists + head + feet) local tracking |
|
-0.1 |
Smooth actions (penalize jerk) |
|
-10.0 |
Stay within joint limits |
|
-0.1 |
Penalize non-foot ground contacts |
|
-0.005 |
Penalize wrist/head jitter |
|
-2.5e-6 |
Penalize foot acceleration (smooth stepping) |
Each reward term also has a std parameter controlling the Gaussian kernel
sharpness. Smaller std = stricter tracking (reward drops faster with error).
Override example:
++manager_env.rewards.tracking_anchor_pos.weight=2.0
++manager_env.rewards.tracking_anchor_pos.params.std=0.1
Termination thresholds#
Terminations end episodes early when tracking error exceeds a threshold. The adaptive variants use a curriculum that tightens thresholds over training:
Termination |
Threshold |
Description |
|---|---|---|
|
0.15 m |
Root position deviation |
|
0.2 rad |
Root orientation deviation |
|
0.15 m |
End-effector position deviation |
|
0.2 m |
Foot position deviation |
|
— |
Episode ends when motion clip finishes |
Looser thresholds (larger values) make training easier initially. The adaptive terminations automatically tighten as the policy improves.
Adaptive motion sampling#
The motion library supports adaptive sampling — motions the policy fails on are sampled more frequently:
Parameter |
Default |
Description |
|---|---|---|
|
true |
Enable adaptive sampling. |
|
50 |
Window size for failure rate tracking. |
|
200 |
Max/mean failure rate ratio cap. Prevents one hard motion from dominating. |
Saving and logging#
Parameter |
Default |
Location |
Description |
|---|---|---|---|
|
500 |
|
Save checkpoint every N iterations. |
|
500 |
|
Run evaluation every N iterations. |
|
false |
|
Enable Weights & Biases logging. |
|
logs_rl |
|
Root directory for training outputs. |
Experiment Configs#
Config |
Encoders |
Use case |
|---|---|---|
|
G1, teleop, SMPL |
Default — matches the released checkpoint |
|
G1, teleop, SMPL, SOMA |
Extended training with SOMA skeleton encoder |
|
G1, teleop, SMPL |
H2 robot (31 DOF) |
Common Recipes#
Debug a training run visually#
python gear_sonic/train_agent_trl.py +exp=... \
num_envs=4 headless=False \
algo.config.num_learning_iterations=10
Finetune with lower learning rate#
python gear_sonic/train_agent_trl.py +exp=... \
+checkpoint=sonic_release/last.pt \
++algo.config.actor_learning_rate=5e-6 \
++algo.config.desired_kl=0.005
Train on flat ground only#
python gear_sonic/train_agent_trl.py +exp=... \
++manager_env.config.terrain_type=plane
Relax termination thresholds for hard motions#
python gear_sonic/train_agent_trl.py +exp=... \
++manager_env.terminations.anchor_pos.params.threshold=0.3 \
++manager_env.terminations.ee_body_pos.params.threshold=0.3
Increase tracking precision#
python gear_sonic/train_agent_trl.py +exp=... \
++manager_env.rewards.tracking_relative_body_pos.params.std=0.1 \
++manager_env.rewards.tracking_anchor_pos.params.std=0.1