VLA Inference#

This guide covers running a trained Isaac-GR00T VLA policy on the Unitree G1 robot using the Sonic whole-body control stack.

Overview#

The inference pipeline consists of:

  1. Isaac-GR00T PolicyServer — loads the VLA model and serves actions over ZMQ

  2. VLA inference client (run_vla_inference.py) — reads camera + robot state, queries the PolicyServer, and publishes actions to the C++ control loop

  3. C++ deploy (gear_sonic_deploy) — executes whole-body control on the robot

  4. Camera server — provides camera images over ZMQ (runs as a systemd service)

  5. Data exporter (optional) — records episodes during inference

┌──────────────────────┐
│  Isaac-GR00T         │
│  PolicyServer        │
│  (GPU machine)       │
└──────┬───────────────┘
       │ ZMQ REQ/REP
       ▼
┌─────────────────────┐    ZMQ TCP    ┌──────────────────────┐
│  VLA Inference      │ ◄─────────── │  Camera Server       │
│  (run_vla_inference)│              │  (on robot)          │
└────┬───────────┬────┘              └──────────────────────┘
     │           │
     │ ZMQ PUB   │ ZMQ SUB
     │ (actions) │ (state)
     ▼           ▼
┌─────────────────────┐
│  C++ Deploy         │
│  (gear_sonic_deploy)│
└─────────────────────┘

Prerequisites#

1. Isaac-GR00T PolicyServer#

The PolicyServer runs on a machine with a GPU. It loads your finetuned VLA model and serves inference over ZMQ.

Install Isaac-GR00T and start the server:

# On the GPU machine (from the Isaac-GR00T repo)
uv run python gr00t/eval/run_gr00t_server.py \
    --model-path /path/to/your/finetuned_model \
    --embodiment-tag UNITREE_G1_SONIC \
    --device cuda:0 \
    --port 5550

2. Inference Environment#

On the inference machine (can be the same as the PolicyServer or a separate PC):

bash install_scripts/install_inference.sh

This creates .venv_inference with the Isaac-GR00T PolicyClient and all inference dependencies.

3. Camera Server#

The camera server should be running as a systemd service on the robot. See Data Collection for camera server setup.

4. C++ Deploy#

The gear_sonic_deploy binary must be built. See the main README.

Action Space#

The Sonic embodiment (unitree_g1_sonic) uses a 78-dimensional action space: 64-dim motion token + 7-dim left hand joints + 7-dim right hand joints.

Quick Start — tmux Launcher#

The easiest way to run inference is with the all-in-one tmux launcher:

# Real robot
python gear_sonic/scripts/launch_inference.py \
    --prompt "pick up the apple" \
    --camera-host 192.168.123.164

# Simulation
python gear_sonic/scripts/launch_inference.py --sim \
    --prompt "pick up the apple"

# Without data recording
python gear_sonic/scripts/launch_inference.py \
    --no-data-exporter \
    --prompt "pick up the apple"

The launcher creates a tmux session with four panes:

Pane

Component

Description

0 (top-left)

C++ Deploy

Whole-body controller

1 (bottom-left)

Keyboard Publisher

Type keyboard commands here

2 (top-right)

VLA Inference

Policy client + action loop

3 (bottom-right)

Data Exporter

Records episodes (optional)

Keyboard Controls#

Type these keys in the Keyboard Publisher pane (pane 1):

Key

Action

k

Start / stop the C++ control loop

i

Blend smoothly to initial pose and switch to POSE mode

p

Pause / resume policy inference

[

Toggle left hand open/closed (initial pose)

]

Toggle right hand open/closed (initial pose)

t <text>

Change the inference prompt (e.g., t pick up the cup)

c

Start recording an episode (data exporter)

s

Stop recording — success (data exporter)

f

Stop recording — failure / discard (data exporter)

Typical Workflow#

  1. Wait for all panes to initialize

  2. Click on pane 0 (C++ Deploy) and press Enter to confirm deployment

  3. Switch to pane 1 (Keyboard Publisher)

  4. Press k to start the C++ control loop (starts in PLANNER mode)

  5. Press i to blend to the initial pose (switches to POSE mode)

    The robot smoothly interpolates to the initial pose over 1 second. If your task starts from a different pose than the default, see Customizing the Initial Pose below.

  6. Press p to unpause the inference loop

  7. The robot will begin executing VLA-predicted actions

  8. Press p to pause, k to stop the control loop when done

Manual Setup (Without tmux)#

If you prefer to run each component in separate terminals:

Terminal 1 — Isaac-GR00T PolicyServer (GPU machine)#

# From the Isaac-GR00T repo
uv run python gr00t/eval/run_gr00t_server.py \
    --model-path /path/to/your/finetuned_model \
    --embodiment-tag UNITREE_G1_SONIC \
    --device cuda:0 \
    --port 5550

Terminal 2 — C++ Deploy#

cd gear_sonic_deploy
./deploy.sh --input-type zmq_manager real

Terminal 3 — VLA Inference#

source .venv_inference/bin/activate
python gear_sonic/scripts/run_vla_inference.py \
    --host <policy_server_ip> \
    --port 5550 \
    --embodiment-tag unitree_g1_sonic \
    --prompt "pick up the apple" \
    --camera-host 192.168.123.164

Terminal 4 — Data Exporter (optional)#

source .venv_data_collection/bin/activate
python gear_sonic/scripts/run_data_exporter.py \
    --task-prompt "pick up the apple" \
    --camera-host 192.168.123.164

Configuration Reference#

VLA Inference (run_vla_inference.py)#

Flag

Default

Description

--host

localhost

PolicyServer host

--port

5550

PolicyServer port

--embodiment-tag

unitree_g1_sonic

Embodiment tag

--prompt

demo

Language prompt

--action-publish-rate

50

Action publish rate (Hz)

--action-horizon

40

Actions per inference chunk

--rate

2.5

Inference rate (Hz)

--camera-host

localhost

Camera server host

--camera-port

5555

Camera server port

--initial-pose-blend-duration

1.0

Seconds to blend to initial pose (0 = instant snap)

--verbose-timing

false

Always print loop timing

tmux Launcher (launch_inference.py)#

The launcher exposes all the above flags plus deploy and data exporter options. Run python gear_sonic/scripts/launch_inference.py --help for the full list.

Remote PolicyServer#

When running the PolicyServer on a separate GPU machine:

# On the inference machine, point to the remote server
python gear_sonic/scripts/launch_inference.py \
    --policy-host <gpu_machine_ip> \
    --policy-port 5550 \
    --camera-host 192.168.123.164 \
    --prompt "pick up the apple"

Make sure port 5550 (or your chosen port) is accessible between the two machines.

Latency Compensation#

The inference loop automatically compensates for network and compute latency. When a new action chunk arrives, the system calculates how many actions in the chunk are already “stale” based on the time elapsed since inference started, and skips to the appropriate action index. This is controlled by --action-publish-rate and --action-horizon.

Customizing the Initial Pose#

When you press i, the inference client blends the robot smoothly from its current configuration to a predefined initial pose encoded as a 64-dim latent motion token. This pose should match the starting configuration your demonstrations typically begin from.

When to Change the Initial Pose#

You should update the initial motion token if:

  • Your collected demonstrations start from a pose far from the default (e.g., arms raised, holding an object, or a different standing stance)

  • You switch to a different SONIC checkpoint (each checkpoint has its own latent space — the same token produces different poses across checkpoints)

  • The robot is snapping to a dangerous or unstable configuration on i press

Where to Change It#

Edit gear_sonic/utils/inference/initial_poses.py:

LATENT_INITIAL_MOTION_TOKEN = np.array(
    [
        # Replace with your 64-dim token
        ...
    ],
    dtype=np.float32,
)

How to Find a Good Token#

  1. From data collection: Look at the first action frame of a good demonstration episode. The action.motion_token column in the parquet file at frame_index=0 gives you the latent token for that pose.

  2. From the C++ deploy: Put the robot in the desired starting pose via teleop, then read the most recent latent token published on the ZMQ action channel.

Blend Duration#

The blend duration controls how quickly the robot transitions to the initial pose:

# Default: 1 second smooth blend
python gear_sonic/scripts/run_vla_inference.py --initial-pose-blend-duration 1.0

# Faster blend (0.5 seconds)
python gear_sonic/scripts/run_vla_inference.py --initial-pose-blend-duration 0.5

# Instant snap (no interpolation, legacy behavior)
python gear_sonic/scripts/run_vla_inference.py --initial-pose-blend-duration 0

Warning

Setting --initial-pose-blend-duration too low (or to 0) can cause jerky motion, especially if the robot’s current pose is far from the initial pose. The default 1-second blend is safe for most configurations.