VLA Inference#
This guide covers running a trained Isaac-GR00T VLA policy on the Unitree G1 robot using the Sonic whole-body control stack.
Overview#
The inference pipeline consists of:
Isaac-GR00T PolicyServer — loads the VLA model and serves actions over ZMQ
VLA inference client (
run_vla_inference.py) — reads camera + robot state, queries the PolicyServer, and publishes actions to the C++ control loopC++ deploy (
gear_sonic_deploy) — executes whole-body control on the robotCamera server — provides camera images over ZMQ (runs as a systemd service)
Data exporter (optional) — records episodes during inference
┌──────────────────────┐
│ Isaac-GR00T │
│ PolicyServer │
│ (GPU machine) │
└──────┬───────────────┘
│ ZMQ REQ/REP
▼
┌─────────────────────┐ ZMQ TCP ┌──────────────────────┐
│ VLA Inference │ ◄─────────── │ Camera Server │
│ (run_vla_inference)│ │ (on robot) │
└────┬───────────┬────┘ └──────────────────────┘
│ │
│ ZMQ PUB │ ZMQ SUB
│ (actions) │ (state)
▼ ▼
┌─────────────────────┐
│ C++ Deploy │
│ (gear_sonic_deploy)│
└─────────────────────┘
Prerequisites#
1. Isaac-GR00T PolicyServer#
The PolicyServer runs on a machine with a GPU. It loads your finetuned VLA model and serves inference over ZMQ.
Install Isaac-GR00T and start the server:
# On the GPU machine (from the Isaac-GR00T repo)
uv run python gr00t/eval/run_gr00t_server.py \
--model-path /path/to/your/finetuned_model \
--embodiment-tag UNITREE_G1_SONIC \
--device cuda:0 \
--port 5550
2. Inference Environment#
On the inference machine (can be the same as the PolicyServer or a separate PC):
bash install_scripts/install_inference.sh
This creates .venv_inference with the Isaac-GR00T PolicyClient and all
inference dependencies.
3. Camera Server#
The camera server should be running as a systemd service on the robot. See Data Collection for camera server setup.
4. C++ Deploy#
The gear_sonic_deploy binary must be built. See the main README.
Action Space#
The Sonic embodiment (unitree_g1_sonic) uses a 78-dimensional action
space: 64-dim motion token + 7-dim left hand joints + 7-dim right hand joints.
Quick Start — tmux Launcher#
The easiest way to run inference is with the all-in-one tmux launcher:
# Real robot
python gear_sonic/scripts/launch_inference.py \
--prompt "pick up the apple" \
--camera-host 192.168.123.164
# Simulation
python gear_sonic/scripts/launch_inference.py --sim \
--prompt "pick up the apple"
# Without data recording
python gear_sonic/scripts/launch_inference.py \
--no-data-exporter \
--prompt "pick up the apple"
The launcher creates a tmux session with four panes:
Pane |
Component |
Description |
|---|---|---|
0 (top-left) |
C++ Deploy |
Whole-body controller |
1 (bottom-left) |
Keyboard Publisher |
Type keyboard commands here |
2 (top-right) |
VLA Inference |
Policy client + action loop |
3 (bottom-right) |
Data Exporter |
Records episodes (optional) |
Keyboard Controls#
Type these keys in the Keyboard Publisher pane (pane 1):
Key |
Action |
|---|---|
|
Start / stop the C++ control loop |
|
Blend smoothly to initial pose and switch to POSE mode |
|
Pause / resume policy inference |
|
Toggle left hand open/closed (initial pose) |
|
Toggle right hand open/closed (initial pose) |
|
Change the inference prompt (e.g., |
|
Start recording an episode (data exporter) |
|
Stop recording — success (data exporter) |
|
Stop recording — failure / discard (data exporter) |
Typical Workflow#
Wait for all panes to initialize
Click on pane 0 (C++ Deploy) and press Enter to confirm deployment
Switch to pane 1 (Keyboard Publisher)
Press
kto start the C++ control loop (starts in PLANNER mode)Press
ito blend to the initial pose (switches to POSE mode)The robot smoothly interpolates to the initial pose over 1 second. If your task starts from a different pose than the default, see Customizing the Initial Pose below.
Press
pto unpause the inference loopThe robot will begin executing VLA-predicted actions
Press
pto pause,kto stop the control loop when done
Manual Setup (Without tmux)#
If you prefer to run each component in separate terminals:
Terminal 1 — Isaac-GR00T PolicyServer (GPU machine)#
# From the Isaac-GR00T repo
uv run python gr00t/eval/run_gr00t_server.py \
--model-path /path/to/your/finetuned_model \
--embodiment-tag UNITREE_G1_SONIC \
--device cuda:0 \
--port 5550
Terminal 2 — C++ Deploy#
cd gear_sonic_deploy
./deploy.sh --input-type zmq_manager real
Terminal 3 — VLA Inference#
source .venv_inference/bin/activate
python gear_sonic/scripts/run_vla_inference.py \
--host <policy_server_ip> \
--port 5550 \
--embodiment-tag unitree_g1_sonic \
--prompt "pick up the apple" \
--camera-host 192.168.123.164
Terminal 4 — Data Exporter (optional)#
source .venv_data_collection/bin/activate
python gear_sonic/scripts/run_data_exporter.py \
--task-prompt "pick up the apple" \
--camera-host 192.168.123.164
Configuration Reference#
VLA Inference (run_vla_inference.py)#
Flag |
Default |
Description |
|---|---|---|
|
|
PolicyServer host |
|
|
PolicyServer port |
|
|
Embodiment tag |
|
|
Language prompt |
|
|
Action publish rate (Hz) |
|
|
Actions per inference chunk |
|
|
Inference rate (Hz) |
|
|
Camera server host |
|
|
Camera server port |
|
|
Seconds to blend to initial pose (0 = instant snap) |
|
|
Always print loop timing |
tmux Launcher (launch_inference.py)#
The launcher exposes all the above flags plus deploy and data exporter options.
Run python gear_sonic/scripts/launch_inference.py --help for the full list.
Remote PolicyServer#
When running the PolicyServer on a separate GPU machine:
# On the inference machine, point to the remote server
python gear_sonic/scripts/launch_inference.py \
--policy-host <gpu_machine_ip> \
--policy-port 5550 \
--camera-host 192.168.123.164 \
--prompt "pick up the apple"
Make sure port 5550 (or your chosen port) is accessible between the two machines.
Latency Compensation#
The inference loop automatically compensates for network and compute latency.
When a new action chunk arrives, the system calculates how many actions in the
chunk are already “stale” based on the time elapsed since inference started,
and skips to the appropriate action index. This is controlled by --action-publish-rate
and --action-horizon.
Customizing the Initial Pose#
When you press i, the inference client blends the robot smoothly from its
current configuration to a predefined initial pose encoded as a 64-dim
latent motion token. This pose should match the starting configuration your
demonstrations typically begin from.
When to Change the Initial Pose#
You should update the initial motion token if:
Your collected demonstrations start from a pose far from the default (e.g., arms raised, holding an object, or a different standing stance)
You switch to a different SONIC checkpoint (each checkpoint has its own latent space — the same token produces different poses across checkpoints)
The robot is snapping to a dangerous or unstable configuration on
ipress
Where to Change It#
Edit gear_sonic/utils/inference/initial_poses.py:
LATENT_INITIAL_MOTION_TOKEN = np.array(
[
# Replace with your 64-dim token
...
],
dtype=np.float32,
)
How to Find a Good Token#
From data collection: Look at the first action frame of a good demonstration episode. The
action.motion_tokencolumn in the parquet file atframe_index=0gives you the latent token for that pose.From the C++ deploy: Put the robot in the desired starting pose via teleop, then read the most recent latent token published on the ZMQ action channel.
Blend Duration#
The blend duration controls how quickly the robot transitions to the initial pose:
# Default: 1 second smooth blend
python gear_sonic/scripts/run_vla_inference.py --initial-pose-blend-duration 1.0
# Faster blend (0.5 seconds)
python gear_sonic/scripts/run_vla_inference.py --initial-pose-blend-duration 0.5
# Instant snap (no interpolation, legacy behavior)
python gear_sonic/scripts/run_vla_inference.py --initial-pose-blend-duration 0
Warning
Setting --initial-pose-blend-duration too low (or to 0) can cause jerky motion,
especially if the robot’s current pose is far from the initial pose. The default
1-second blend is safe for most configurations.