Training Code Structure#
This page describes the Python training codebase under gear_sonic/, covering directory layout, the training pipeline, configuration system, key modules, and evaluation scripts.
Directory Layout#
gear_sonic/
├── train_agent_trl.py # Main training entry point
├── eval_agent_trl.py # Single-checkpoint evaluation
├── eval_exp.py # Checkpoint monitor (continuous eval)
├── config/ # Hydra configuration hierarchy
│ ├── base.yaml # Global defaults (seed, num_envs, paths)
│ ├── base_eval.yaml # Eval-specific global defaults
│ ├── eval_exp.yaml # Checkpoint monitor config
│ ├── base/ # Hydra plumbing (output dirs, resolvers)
│ ├── algo/ # PPO hyperparameters
│ ├── actor_critic/ # Actor-critic architecture configs
│ │ ├── encoders/ # Per-encoder MLP configs (g1, smpl, teleop)
│ │ ├── decoders/ # Decoder MLP configs (g1_kin, g1_dyn)
│ │ ├── critics/ # Critic backbone configs
│ │ ├── quantizers/ # FSQ quantizer config
│ │ └── universal_token/ # Assembled encoder+decoder+quantizer presets
│ ├── aux_losses/ # Auxiliary loss definitions
│ ├── callbacks/ # Training callback configs
│ ├── exp/ # Experiment presets (compose all pieces)
│ ├── manager_env/ # Environment MDP component configs
│ ├── opt/ # Logging options (wandb)
│ └── trainer/ # Trainer class selection
├── envs/ # IsaacLab environment wrappers
│ ├── manager_env/
│ │ ├── modular_tracking_env_cfg.py # Scene, sensors, robot articulation
│ │ ├── robots/ # Per-robot configs (g1.py, h2.py)
│ │ └── mdp/ # MDP components (see below)
│ ├── wrapper/
│ │ └── manager_env_wrapper.py # RL-facing env wrapper
│ └── env_utils/ # Joint ordering utilities
├── trl/ # Training modules (PPO, actor-critic, losses)
│ ├── trainer/
│ │ ├── ppo_trainer.py # Base PPO trainer
│ │ └── ppo_trainer_aux_loss.py # PPO + auxiliary losses (SONIC)
│ ├── modules/
│ │ ├── actor_critic_modules.py # Actor, Critic classes
│ │ ├── universal_token_modules.py # UniversalTokenModule (SONIC ATM)
│ │ ├── base_module.py # Shared MLP building blocks
│ │ └── data_utils.py # Batch/data helpers
│ ├── losses/
│ │ └── token_losses.py # Reconstruction & latent auxiliary losses
│ ├── callbacks/ # Runtime callbacks
│ │ ├── im_eval_callback.py # Imitation evaluation metrics
│ │ ├── im_resample_callback.py # Adaptive motion resampling
│ │ ├── model_save_callback.py # Checkpoint saving
│ │ ├── wandb_callback.py # W&B logging
│ │ └── read_eval_callback.py # Read eval results from disk
│ └── utils/ # Math, rotation, scheduling utilities
├── utils/ # Shared utilities
│ ├── motion_lib/ # Motion library loading (PKL format)
│ ├── mujoco_sim/ # MuJoCo sim-to-sim bridge
│ └── teleop/ # VR teleoperation helpers
├── data/ # Robot models, URDF/USD assets
├── data_process/ # Motion data conversion scripts
└── scripts/ # MuJoCo sim loop, misc tools
Training Pipeline#
Running python gear_sonic/train_agent_trl.py +exp=manager/universal_token/all_modes/sonic_release executes the following steps:
1. Configuration Loading#
The entry point uses @hydra.main(config_path="config", config_name="base"). The +exp=... argument selects an experiment preset that composes all sub-configs:
base.yaml # Global defaults
└── +exp=manager/universal_token/all_modes/sonic_release
├── /algo: ppo_im_phc # PPO hyperparameters
├── /actor_critic: universal_token/all_mlp_v1
│ ├── encoders/g1_mf_mlp, smpl_mlp, teleop_mlp
│ ├── decoders/g1_kin_mf_mlp, g1_dyn_mlp
│ ├── quantizers/fsq
│ └── critics/mlp
├── /manager_env: base_env # Environment config
│ ├── observations/{tokenizer, policy, critic}
│ ├── rewards/tracking/base_5point_local_feet_acc
│ ├── terminations/tracking/base_adaptive_strict_ori_foot_xyz
│ └── events/tracking/level0_4
├── /aux_losses: universal_token/g1_recon_and_all_latent
├── /trainer: trl_ppo_aux
└── /callbacks: model_save, wandb, read_eval, im_resample
2. Simulator and Accelerator Init#
After config resolution, the script:
Parses TRL
PPOConfig/ScriptArguments/ModelConfigfrom the config dict.Creates a HuggingFace
Acceleratorfor multi-GPU support (DDP).Launches the IsaacLab
AppLauncherto start the Isaac Sim runtime.Saves
config.yamlandmeta.yamlto the experiment directory.
3. Environment Creation#
create_manager_env() instantiates the IsaacLab ManagerBasedRLEnv from the composed environment config, then wraps it with ManagerEnvWrapper:
ManagerBasedRLEnv (IsaacLab)
└── ManagerEnvWrapper
├── Observation spaces (policy, critic, tokenizer groups)
├── Motion command manager (motion_lib)
├── Action transform module (optional, for pretrained ATM)
└── Keyboard / visualization hooks
4. Policy and Value Model Creation#
The actor and critic are instantiated from the algo config. For SONIC training, the actor backbone is UniversalTokenModule:
# Simplified from train_agent_trl.py
policy = custom_instantiate(config.algo.config.actor, env_config=env.config, ...)
value_model = custom_instantiate(config.algo.config.critic, env_config=env.config, ...)
The Actor wraps UniversalTokenModule as its backbone and adds a diagonal Gaussian distribution for exploration. The Critic wraps a separate MLP backbone.
5. PPO Training Loop#
The TRLAuxLossPPOTrainer.train() method runs the main loop:
for iteration in range(num_learning_iterations):
# 1. Rollout: collect num_steps_per_env transitions
for step in range(num_steps_per_env):
actions = policy.rollout(obs_dict)
obs_dict, rewards, dones, infos = env.step(actions)
store(obs, actions, rewards, values, log_probs)
# 2. GAE: compute advantages and returns
advantages = generalized_advantage_estimation(rewards, values, dones)
# 3. PPO update: num_ppo_epochs over mini-batches
for epoch in range(num_ppo_epochs):
for mini_batch in shuffle_and_split(rollout_data):
policy_loss = clipped_surrogate_objective(...)
value_loss = clipped_value_loss(...)
aux_loss = sum(coef_i * aux_loss_i) # encoder reconstruction, etc.
total_loss = policy_loss + value_loss_coef * value_loss
+ aux_loss_scale * aux_loss
optimizer.step(total_loss)
# 4. Post-update: sync running stats, adaptive sampling, callbacks
update_scheduled_params(...) # learning rate, domain randomization
callbacks.on_step_end(...) # checkpointing, evaluation, logging
Configuration System#
The configuration system uses Hydra with config groups and composition.
Hierarchy#
Level |
Path |
Purpose |
|---|---|---|
Global |
|
Seed, num_envs, paths, wandb toggle |
Algorithm |
|
PPO hyperparameters, learning rates, epochs |
Actor-Critic |
|
Network architecture (encoders, decoders, critic) |
Environment |
|
Observations, rewards, terminations, events |
Auxiliary Losses |
|
Reconstruction and latent alignment losses |
Trainer |
|
Trainer class selection (PPO or PPO+AuxLoss) |
Callbacks |
|
Checkpointing, evaluation, W&B logging |
Experiment |
|
Preset that composes all the above |
Experiment Presets#
Experiment configs live under config/exp/ and use the @package _global_ directive to set values at the root level. They compose all component configs via defaults:
# config/exp/manager/universal_token/all_modes/sonic_release.yaml
defaults:
- /algo: ppo_im_phc
- /manager_env: base_env
- override /actor_critic: universal_token/all_mlp_v1
- override /manager_env/observations/tokenizer: unitoken_all_noz
- override /manager_env/observations/policy: local_dir_hist
- override /manager_env/rewards: tracking/base_5point_local_feet_acc
- override /manager_env/terminations: tracking/base_adaptive_strict_ori_foot_xyz
- override /manager_env/events: tracking/level0_4
# ...
Key Config Parameters#
Parameter |
Default |
Description |
|---|---|---|
|
4096 |
Number of parallel simulation environments |
|
100000 |
Total training iterations |
|
32 |
Rollout horizon per iteration |
|
5 |
PPO epochs per iteration |
|
4 |
Mini-batches per PPO epoch |
|
2e-5 |
Actor learning rate |
|
1e-3 |
Critic learning rate |
|
0.2 |
PPO clipping parameter |
|
0.05 |
Initial exploration noise std |
|
500 |
Checkpoint save frequency (iterations) |
Universal Token Module#
The UniversalTokenModule implements SONIC’s action transform module (ATM) – the core architecture that maps diverse motion inputs into a shared token space.
Architecture#
┌─────────────┐
G1 obs ───► │ G1 Encoder │──┐
└─────────────┘ │
┌─────────────┐ │ ┌─────────┐ ┌─────────────┐
Teleop obs───► │Teleop Encdr │──┼──► │ FSQ │──► │ G1 Dynamic │──► joint actions
└─────────────┘ │ │Quantizer│ │ Decoder │
┌─────────────┐ │ └─────────┘ └─────────────┘
SMPL obs ───► │ SMPL Encoder│──┘ │
└─────────────┘ │ ┌─────────────┐
└───────► │G1 Kinematic │──► (aux loss only)
│ Decoder │
└─────────────┘
Encoders map different observation modalities into a shared latent space. Each encoder is an MLP that takes modality-specific tokenizer observations and outputs a fixed-size latent vector. During training, one encoder is sampled per environment according to encoder_sample_probs.
FSQ Quantizer discretizes the continuous latent into a finite set of tokens using Finite Scalar Quantization. Each latent dimension is independently quantized to one of fsq_level_list discrete levels. This produces a compact, discrete token representation.
Decoders reconstruct outputs from the quantized tokens plus proprioception:
G1 Dynamic Decoder (
g1_dyn): Produces joint-space actions fed to the actuators. This is the only decoder used at deployment time.G1 Kinematic Decoder (
g1_kin): Reconstructs future motion frames from tokens. Used only during training to compute reconstruction auxiliary losses.
Latent Residual Mode#
For downstream tasks (e.g., object manipulation), an external policy can inject corrections into the token space without retraining the base ATM:
Mode |
Behavior |
|---|---|
|
Residual added after FSQ quantization |
|
Residual added before FSQ; the sum gets quantized |
|
Latent is replaced entirely by the residual |
Encoder Sampling#
During training, each environment is randomly assigned an encoder per episode according to encoder_sample_probs. The encoder_index observation tells the module which encoder produced the current token. At deployment, only one encoder is active (selected by the observation configuration).
Environment Structure#
The training environment is built on IsaacLab’s ManagerBasedRLEnv and uses a modular MDP design where each component is configured independently via YAML.
MDP Components#
All MDP components live in gear_sonic/envs/manager_env/mdp/:
Module |
Config path |
Description |
|---|---|---|
|
|
Observation terms for policy, critic, and tokenizer groups |
|
|
Joint position action space |
|
|
Reward terms (tracking, regularization) |
|
|
Episode termination conditions |
|
|
Domain randomization events |
|
|
Motion command generation (motion library) |
|
|
Curriculum schedules |
|
(inline) |
Terrain generation |
|
|
Video recording |
Observation Groups#
Observations are split into groups, each with its own config file:
Group |
Purpose |
Example terms |
|---|---|---|
policy |
Direct input to the policy MLP |
joint_pos, joint_vel, base_ang_vel, gravity_dir, last_actions |
critic |
Privileged observations for the value function |
All policy obs + base_lin_vel, body_pos, body_ori |
tokenizer |
Input to the UniversalTokenModule encoders |
Multi-future joint commands, SMPL joints, VR targets, anchor orientations |
Reward Terms#
Reward configs compose individual terms from config/manager_env/rewards/terms/. Key tracking rewards:
Term |
Description |
|---|---|
|
Track reference body positions (5-point: root, wrists, feet) |
|
Track reference body orientations |
|
Track root anchor position |
|
Track root anchor orientation |
|
Track reference body linear velocities |
|
Track reference body angular velocities |
|
Penalize action jerk |
|
Penalize foot acceleration (smoothness) |
ManagerEnvWrapper#
ManagerEnvWrapper bridges the IsaacLab environment with the RL training loop. It handles:
Flattening observation dicts for the policy
Applying the optional pretrained action transform module
Motion replay mode
Debug visualization and keyboard controls
Evaluation Scripts#
eval_agent_trl.py – Single Checkpoint#
Loads a single checkpoint and runs evaluation in Isaac Sim. Automatically reads the training config.yaml from the checkpoint directory to reconstruct the full configuration.
# Interactive visualization
python gear_sonic/eval_agent_trl.py +checkpoint=path/to/model.pt +headless=False ++num_envs=1
# Headless with video rendering
python gear_sonic/eval_agent_trl.py +checkpoint=path/to/model.pt +headless=True \
++num_envs=16 +run_once=True \
++manager_env.config.save_rendering_dir=path/to/output \
++manager_env.config.render_results=True \
+manager_env/recorders=render
Key features:
Merges training config with eval overrides (
eval_overridesin config)Removes train-only events and terminations automatically
Supports
+run_once=Trueto exit after all environments complete one episodeHandles
+metrics_fileto render worst-performing motions from a prior eval
eval_exp.py – Checkpoint Monitor#
CheckpointEvaluator continuously monitors an experiment directory for new checkpoints and evaluates them sequentially. It runs as a companion process alongside training.
python gear_sonic/eval_exp.py ++experiment_dir=path/to/experiment
For each new checkpoint, it:
Runs metrics evaluation (launches
eval_agent_trl.pyvia subprocess)Runs video rendering for the hardest motions
Logs results and videos to W&B (resuming the training run)
Marks each checkpoint as evaluated to avoid redundant work
Configuration (config/eval_exp.yaml):
Parameter |
Description |
|---|---|
|
Path to the training experiment directory |
|
Seconds between checkpoint scans (default: 60) |
|
Number of environments for metric evaluation |
|
Number of videos to render per checkpoint |
|
Only evaluate every N-th checkpoint (default: all) |
|
Evaluate pending checkpoints once and exit |
Key Classes Reference#
Class |
Module |
Description |
|---|---|---|
|
|
Policy network: backbone + diagonal Gaussian. Maintains observation buffer for temporal models. |
|
|
Value function network: backbone + scalar output. Supports running mean/std normalization. |
|
|
SONIC ATM: multi-encoder, FSQ quantizer, multi-decoder. Computes auxiliary reconstruction losses. |
|
|
Base PPO trainer adapted from HuggingFace TRL. Handles rollout collection, GAE, and gradient updates. |
|
|
Extends |
|
|
Wraps policy + value model into a single |
|
|
Bridges IsaacLab |
|
|
Monitors experiment directory, evaluates new checkpoints, logs to W&B. |