Troubleshooting

Troubleshooting#

Common issues and solutions. If your problem isn’t listed here, check the GitHub issues page.

1. `ModuleNotFoundError: No module named 'isaaclab'`#

Symptom: Training or eval script exits immediately with an import error.

Cause: Isaac Lab is not installed, or you’re running in the wrong Python environment. Isaac Lab is not a pip dependency — it must be installed separately.

Fix:

Install Isaac Lab following the official guide.

Make sure you activate the correct conda/venv environment before running:

conda activate env_isaaclab  # or whatever you named it
python -c "import isaaclab; print(isaaclab.__version__)"

2. Mesh files are tiny text files (Git LFS not installed)#

Symptom: Simulation crashes or renders an invisible/broken robot. Mesh files (.stl, .STL) are ~130 bytes and contain text like version https://git-lfs.github.com/spec/v1.

Cause: The repo was cloned without Git LFS. Large files (meshes, ONNX models) are stored via Git LFS and need to be fetched separately.

Fix:

sudo apt install git-lfs
git lfs install
git lfs pull

Verify: ls -la gear_sonic/data/assets/robot_description/urdf/g1/main.urdf should be ~60KB+, not ~130 bytes.

4. `trl` / `transformers` version conflict during pip install#

Symptom: pip install -e "gear_sonic/[training]" fails with a dependency resolution error about incompatible transformers versions.

Cause: trl==0.28.0 requires transformers>=4.56.2. If you have an older transformers pinned or installed, pip cannot resolve.

Fix:

pip install -e "gear_sonic/[training]" --upgrade

Or install in a fresh environment. If you need a specific transformers version for another project, use a separate venv for SONIC training.

5. TensorRT build fails (`TensorRT_ROOT` not set)#

Symptom: CMake error during C++ deployment build:

Could not find a package configuration file provided by "TensorRT"

Cause: The TensorRT_ROOT environment variable is not set, or TensorRT is not installed.

Fix:

Download the correct TensorRT version (TAR package, not DEB):

Platform

TensorRT Version

x86_64 (Desktop)

10.13 (required)

Jetson / G1 onboard Orin

10.7 (required; JetPack 6)

Extract and set the environment variable:

export TensorRT_ROOT=$HOME/TensorRT
echo 'export TensorRT_ROOT=$HOME/TensorRT' >> ~/.bashrc

6. Motion file path errors (`FileNotFoundError` or empty motion library)#

Symptom: Training crashes with FileNotFoundError on a motion path, or starts but logs 0 motions loaded.

Cause: The experiment config has placeholder paths (e.g., data/motion_lib_bones_seed/robot_filtered) that don’t exist on your machine. Motion data paths must be provided on the command line.

Fix: Always pass motion data paths explicitly:

python gear_sonic/train_agent_trl.py \
    +exp=manager/universal_token/all_modes/sonic_release \
    ++manager_env.commands.motion.motion_lib_cfg.motion_file=<path/to/robot_filtered> \
    ++manager_env.commands.motion.motion_lib_cfg.smpl_motion_file=<path/to/smpl_filtered>

For quick testing, download the sample data from HuggingFace:

hf download nvidia/GEAR-SONIC --include "sample_data/*" --local-dir .

7. Body name errors (`RuntimeError: body 'xxx' not found`)#

Symptom: Isaac Lab crashes with an error about a body/joint name not found in the robot’s articulation.

Cause: A config YAML references a body name that doesn’t exist on your robot. This commonly happens when using G1 configs with a different robot (e.g., H2).

Fix: Check which body name failed and find where it’s referenced:

grep -rn "the_failing_body_name" gear_sonic/config/

Override the body name in your experiment config, or check the Training on New Embodiments guide for the full list of config files that reference body names.

8. Robot explodes or falls immediately on first frame#

Symptom: The robot ragdolls, flies away, or collapses instantly when simulation starts.

Cause: Usually one of:

Init state height is wrong — the robot spawns inside the ground or too high. Check init_state.pos in your robot config (the z-value is spawn height).
KP/KD values are wrong — if stiffness (KP) is too low, joints have no holding torque. If too high, the simulation becomes unstable. See Training on New Embodiments for tuning guidance.
Action scale is too large — the policy outputs move joints too aggressively. Reduce action_scale values.
Default joint angles are wrong — the robot starts in an impossible pose. Check init_state.joint_pos matches a stable standing configuration.

Debug: Run with num_envs=1 headless=False and watch the first few frames.

9. Robot behaves weirdly during deployment (wrong TensorRT version)#

Symptom: The robot stands but moves erratically, drifts, or produces unnatural motions during C++ deployment — even though the same checkpoint works correctly in Isaac Lab or MuJoCo simulation.

Cause: You are using a different TensorRT version than required. TensorRT version mismatches produce silently wrong inference results — the model runs without errors but outputs incorrect actions.

Fix: You must use the exact TensorRT versions:

Platform	Required Version
x86_64 (Desktop)	TensorRT 10.13
Jetson / G1 onboard Orin	TensorRT 10.7 (JetPack 6)

Verify your version:

echo $TensorRT_ROOT
ls $TensorRT_ROOT/lib/libnvinfer.so*

If the version is wrong, download the correct one from NVIDIA Developer and rebuild the C++ deployment binary.

10. `ChannelFactory create domain error` in MuJoCo sim#

Symptom: run_sim_loop.py crashes with:

[ChannelFactory] create domain error. msg: Occurred upon initialisation
of a cyclonedds.domain.Domain

Cause: CycloneDDS domain initialization conflict. The SimulatorFactory reinitializes a channel that was already created.

Fix: This is a known issue (#77). Workaround: comment out the duplicate channel init in the simulator factory, or ensure no other DDS process is using the same domain on your machine.

11. SMPL tracking is unstable or drifts#

Symptom: The robot follows G1 motion tracking well but drifts or becomes unstable when using SMPL encoder inputs.

Cause: SMPL data may have mismatched coordinate conventions (y-up vs z-up), incorrect joint ordering, or the SMPL-to-robot retargeting quality is poor.

Fix:

Verify smpl_y_up: true is set in your config if your SMPL data uses y-up coordinates.
Check that the SMPL PKL files have the correct shape: smpl_joints should be (T, 24, 3).
Try training with smpl_motion_file: dummy first to confirm the robot encoder works before adding SMPL.

12. MuJoCo viewer renders incorrectly in Docker#

Symptom: MuJoCo window is black, garbled, or shows rendering artifacts when running inside Docker on a machine with an Intel display controller.

Cause: GPU passthrough or display driver conflict between the Intel iGPU and NVIDIA dGPU inside Docker.

Fix: Force NVIDIA GPU rendering:

export __NV_PRIME_RENDER_OFFLOAD=1
export __GLX_VENDOR_LIBRARY_NAME=nvidia

Or run with --gpus all -e DISPLAY=$DISPLAY in your Docker run command. See #25 for details.

13. `deploy.sh` fails to bind ZMQ port 5557 on Orin#

Symptom: deploy.sh exits with a ZMQ bind error on port 5557.

Cause: A Unitree system service (iphone_server.service) is already listening on port 5557.

Fix:

sudo systemctl stop iphone_server.service

Then re-run the deployment. The service restarts on the next boot; to keep it stopped across reboots use sudo systemctl disable iphone_server.service.

Still stuck?#

Search existing issues
Open a new issue with your error message, Python version, and OS

Troubleshooting

Contents

Troubleshooting#

1. `ModuleNotFoundError: No module named 'isaaclab'`#

2. Mesh files are tiny text files (Git LFS not installed)#

3. `RuntimeError: size mismatch` when loading a checkpoint#

4. `trl` / `transformers` version conflict during pip install#

5. TensorRT build fails (`TensorRT_ROOT` not set)#

6. Motion file path errors (`FileNotFoundError` or empty motion library)#

7. Body name errors (`RuntimeError: body 'xxx' not found`)#

8. Robot explodes or falls immediately on first frame#

9. Robot behaves weirdly during deployment (wrong TensorRT version)#

10. `ChannelFactory create domain error` in MuJoCo sim#

11. SMPL tracking is unstable or drifts#

12. MuJoCo viewer renders incorrectly in Docker#

13. `deploy.sh` fails to bind ZMQ port 5557 on Orin#

Still stuck?#

Platform	TensorRT Version
x86_64 (Desktop)	10.13 (required)
Jetson / G1 onboard Orin	10.7 (required; JetPack 6)

Troubleshooting

Contents

Troubleshooting#

1. ModuleNotFoundError: No module named 'isaaclab'#

2. Mesh files are tiny text files (Git LFS not installed)#

3. RuntimeError: size mismatch when loading a checkpoint#

4. trl / transformers version conflict during pip install#

5. TensorRT build fails (TensorRT_ROOT not set)#

6. Motion file path errors (FileNotFoundError or empty motion library)#

7. Body name errors (RuntimeError: body 'xxx' not found)#

8. Robot explodes or falls immediately on first frame#

9. Robot behaves weirdly during deployment (wrong TensorRT version)#

10. ChannelFactory create domain error in MuJoCo sim#

11. SMPL tracking is unstable or drifts#

12. MuJoCo viewer renders incorrectly in Docker#

13. deploy.sh fails to bind ZMQ port 5557 on Orin#

Still stuck?#

1. `ModuleNotFoundError: No module named 'isaaclab'`#

3. `RuntimeError: size mismatch` when loading a checkpoint#

4. `trl` / `transformers` version conflict during pip install#

5. TensorRT build fails (`TensorRT_ROOT` not set)#

6. Motion file path errors (`FileNotFoundError` or empty motion library)#

7. Body name errors (`RuntimeError: body 'xxx' not found`)#

10. `ChannelFactory create domain error` in MuJoCo sim#

13. `deploy.sh` fails to bind ZMQ port 5557 on Orin#