Streaming Motion Tracking

Streaming Motion Tracking#

Stream motion data to the robot over ZMQ for reference motion tracking. This interface supports streaming either SMPL-based poses (e.g., from PICO) or G1 whole-body joint positions (qpos) from any external source (--input-type zmq).

Prerequisites

Complete the Quick Start to have the sim2sim loop running.

Emergency Stop

Press O at any time to immediately stop control and exit. Always keep a hand near the keyboard ready to press O.

Launch#

Sim2Sim (MuJoCo):

# Terminal 1 — MuJoCo simulator (from repo root)
source .venv_sim/bin/activate
python gear_sonic/scripts/run_sim_loop.py

# Terminal 2 — C++ deployment (from gear_sonic_deploy/)
bash deploy.sh --input-type zmq \
  --zmq-host <publisher-ip> \
  --zmq-port 5556 \
  --zmq-topic pose \
  sim

Real Robot:

# From gear_sonic_deploy/
bash deploy.sh --input-type zmq \
  --zmq-host <publisher-ip> \
  --zmq-port 5556 \
  --zmq-topic pose \
  real

Step-by-Step#

Press ] to start the control system.
By default you are in reference motion mode — use T to play motions, N / P to switch, R to restart (same as the keyboard interface).
Press ENTER to toggle into ZMQ streaming mode. The terminal will print ZMQ STREAMING MODE: ENABLED.
The policy now tracks motion frames arriving from the ZMQ publisher in real time. Playback starts automatically.
Press ENTER again to switch back to reference motions. The terminal will print ZMQ STREAMING MODE: DISABLED, and the encode mode resets to 0 (joint-based).
Use Q / E to adjust the heading (±0.1 rad per press) in either mode.
Press I to reinitialize the base quaternion and reset the heading to zero.
When done, press O to stop control and exit.

Note

No planner support — this interface uses pre-loaded and ZMQ-streamed reference motions only. For planner + ZMQ control (e.g., PICO VR teleoperation), use --input-type zmq_manager instead. See the VR Whole-Body Teleop tutorial.

Tip

Build your own streaming source. The ZMQ stream protocol documented below is self-contained — any publisher that sends messages in this format can drive the robot. You can write your own motion capture retargeting pipeline, simulator bridge, or any other source that produces the required fields. No PICO hardware is needed.

Using with PICO VR Teleop#

You can use --input-type zmq with the PICO teleop streamer for a simple, streaming-only whole-body teleoperation setup. In this mode, the PICO streams full-body SMPL poses over ZMQ and the deployment side tracks them directly — no locomotion planner, no PICO-button mode switching. All control is done from the keyboard.

Prerequisites#

Completed the Quick Start — you can run the sim2sim loop.
PICO VR hardware is set up — headset and controllers are connected, body tracking is working, and .venv_teleop is installed. See the VR Teleop Setup for installation and calibration.

Launch (Sim2Sim)#

Run three terminals:

Terminal 1 — MuJoCo simulator (from repo root):

source .venv_sim/bin/activate
python gear_sonic/scripts/run_sim_loop.py

Terminal 2 — C++ deployment (from gear_sonic_deploy/):

bash deploy.sh --input-type zmq \
  --zmq-host localhost \
  --zmq-port 5556 \
  --zmq-topic pose \
  sim

Terminal 3 — PICO teleop streamer (from repo root):

source .venv_teleop/bin/activate

# With visualization (recommended for first run):
python gear_sonic/scripts/pico_manager_thread_server.py \
    --manager --vis_smpl --vis_vr3pt

# Without visualization (headless):
# python gear_sonic/scripts/pico_manager_thread_server.py --manager

Launch (Real Robot)#

Run two terminals (no MuJoCo):

Terminal 1 — C++ deployment (from gear_sonic_deploy/):

bash deploy.sh --input-type zmq \
  --zmq-host <teleop-machine-ip> \
  --zmq-port 5556 \
  --zmq-topic pose \
  real

Replace <teleop-machine-ip> with localhost if the PICO streamer runs on the same machine, or the IP of the machine running Terminal 2.

Terminal 2 — PICO teleop streamer (from repo root):

source .venv_teleop/bin/activate
python gear_sonic/scripts/pico_manager_thread_server.py --manager

Step-by-Step#

Calibration pose: Stand upright, feet together, upper arms at your sides, forearms bent 90° forward (L-shape at each elbow), palms facing inward.
On the PICO controllers, press A + B + X + Y simultaneously to initialize and calibrate the body tracking.
Press A + X on the PICO controllers to start streaming poses.
In Terminal 2 (C++ deployment), press ] to start the control system.
In the MuJoCo window (sim only), press 9 to drop the robot to the ground.
Back in Terminal 2, press ENTER to enable ZMQ streaming. The terminal prints ZMQ STREAMING MODE: ENABLED. The robot begins tracking your PICO poses in real time.
Move your body — the robot mirrors your motions. Use the Trigger button on each PICO controller to close the corresponding robot hand.
To pause streaming (e.g., to reposition yourself), press ENTER again. The terminal prints ZMQ STREAMING MODE: DISABLED. The robot holds its last pose and stops tracking. You can move freely without affecting the robot.
To resume, press ENTER once more. The robot will snap to your current pose — move back close to the robot’s current pose before resuming to avoid sudden jumps.
When done, press O to stop control and exit.

DANGER — Resuming from Pause

When you press ENTER to resume streaming after a pause, the robot will immediately try to reach your current physical pose. If your body is in a very different position from the robot, the robot may perform sudden, aggressive motions. Always move back close to the robot’s current pose before pressing ENTER to resume.

PICO Buttons in ZMQ Mode#

In --input-type zmq mode, the C++ deployment side does not process PICO controller button combos directly. However, the buttons still affect the Python streamer, which controls what data gets published on the pose ZMQ topic. Since the deployment side tracks whatever arrives (or stops arriving) on that topic, several buttons still have an indirect effect on the robot.

PICO Button	Effect
A + B + X + Y	Calibrate body tracking in the streamer. Press once to initialize; press again to stop streaming (emergency stop on the streamer side).
A + X	Toggle Pose mode in the streamer — starts or stops publishing pose data. When stopped, the robot holds its last pose. Works as pause/resume.
Menu (hold)	Pauses pose streaming in the streamer while held. The robot holds its last pose until you release. Works as pause. Move back close to the robot’s current pose before releasing.
Trigger	Hand grasp — processed by the streamer and sent as `left_hand_joints` / `right_hand_joints` in the stream.
B + Y	Toggle Pose mode in the streamer (same effect as A+X) — starts or stops publishing pose data. Works as pause/resume.

All mode control on the deployment side is done from the keyboard:

Key	Action
`]`	Start control system
`ENTER`	Toggle streaming on/off (pause/resume)
`O`	Emergency stop — stop control and exit
`I`	Reinitialize base quaternion and reset heading
`Q` / `E`	Adjust heading (±0.1 rad)
`F`	Report motor temperatures (TTS voice alert)

Note

For the full PICO VR experience with planner support, locomotion modes, and PICO-controller-based mode switching, use --input-type zmq_manager instead. See the VR Whole-Body Teleop tutorial.

Controls#

Key	Action
]	Start control system
O	Stop control and exit (emergency stop)
ENTER	Toggle between reference motions and ZMQ streaming
I	Reinitialize base quaternion and reset heading
Q / E	Adjust delta heading left / right (±0.1 rad)
F	Report motor temperatures (TTS voice alert)

Reference motion mode only (streaming off):

Key	Action
T	Play current motion to completion
R	Restart current motion from beginning (pause at frame 0)
P / N	Previous / Next motion sequence

Stream Protocol Versions#

The encode mode is determined automatically by the ZMQ stream protocol version. SONIC uses Protocol v1, v3, and v4. Protocol v2 is available for custom applications.

Encode Mode Logic#

The encode mode only takes effect when the policy model has an encoder configured and loaded. At startup, each motion’s encode mode is initialized based on encoder availability:

`encode_mode`	Meaning
`-2`	No encoder / token state configured in the model — encode mode has no effect
`-1`	Encoder config exists (token state dimension > 0) but no encoder model file provided
`0`	Encoder loaded, joint-based mode (default)
`1`	Encoder loaded, teleop / 3 points upper-body mode
`2`	Encoder loaded, SMPL-based mode

When ZMQ streaming is active, the protocol version sets the encode mode on the streamed motion: v1 → 0, v2/v3 → 2. Protocol v4 bypasses the encoder entirely — it streams pre-computed tokens directly into the policy. This only affects inference if the model actually has an encoder (encode_mode >= 0). If no encoder is configured (-2), the value is set but has no effect on the inference pipeline.

When switching back to reference motions (pressing ENTER to disable streaming), the encode mode resets to 0 (if the motion has an encoder, i.e. encode_mode >= 0).

Common Fields (All Versions)#

All versions require two common fields:

Field	Shape	Dtype	Description
`body_quat`	`[N, 4]` or `[N, num_bodies, 4]`	`f32` / `f64`	Body quaternion(s) per frame (w, x, y, z)
`frame_index`	`[N]`	`i32` / `i64`	Monotonically increasing frame indices for alignment

Warning

Changing the protocol version mid-session is not allowed. If the publisher switches protocol versions while streaming, the interface will automatically disable ZMQ mode and return to reference motions for safety.

Error message: Protocol version changed from X to Y during active ZMQ session!

Protocol v1 — Joint-Based (Encode Mode 0)#

Streams raw G1 joint positions and velocities. Use this when your source provides direct qpos/qvel data (e.g., from another simulator or motion capture retargeting pipeline).

Required fields:

Field	Shape	Dtype	Description
`joint_pos`	`[N, 29]`	`f32` / `f64`	Joint positions in IsaacLab order (all 29 joints)
`joint_vel`	`[N, 29]`	`f32` / `f64`	Joint velocities in IsaacLab order (all 29 joints)

N = number of frames per message (batch size).
All 29 joint values must be provided and meaningful.
Frame counts of joint_pos and joint_vel must match.

Common errors:

Version 1 missing required fields (joint_pos, joint_vel) — one or both fields are absent.
Frame count mismatch between joint_pos and joint_vel — the N dimension differs.

Protocol v2 — SMPL-Based (Encode Mode 2)#

Streams SMPL body model data. This protocol is not used by SONIC’s built-in pipelines — it is available for your own custom applications that produce SMPL representations, for example a plicy only observe the SMPL.

Required fields:

Field	Shape	Dtype	Description
`smpl_joints`	`[N, 24, 3]`	`f32` / `f64`	SMPL joint positions (24 joints × xyz)
`smpl_pose`	`[N, 21, 3]`	`f32` / `f64`	SMPL joint rotations in axis-angle (21 body poses × xyz)

joint_pos and joint_vel are optional in v2.

Common errors:

Version 2 missing required field 'smpl_joints' or 'smpl_pose' — required SMPL fields are absent.

Protocol v3 — Joint + SMPL Combined (Encode Mode 2)#

Combines both joint-level and SMPL data. This is what SONIC uses for whole-body teleoperation (e.g., PICO VR).

Required fields:

Field	Shape	Dtype	Description
`joint_pos`	`[N, 29]`	`f32` / `f64`	Joint positions in IsaacLab order
`joint_vel`	`[N, 29]`	`f32` / `f64`	Joint velocities in IsaacLab order
`smpl_joints`	`[N, 24, 3]`	`f32` / `f64`	SMPL joint positions (24 joints × xyz)
`smpl_pose`	`[N, 21, 3]`	`f32` / `f64`	SMPL joint rotations in axis-angle (21 body poses × xyz)

Important

In Protocol v3, only the 6 wrist joints need meaningful values in joint_pos — the remaining 23 joints can be zero. The wrist joint indices (in IsaacLab order) are: [23, 24, 25, 26, 27, 28] (3 joints per wrist × 2 wrists). The joint_vel values for non-wrist joints can also be zero.

The SMPL fields (smpl_joints, smpl_pose) carry the primary motion data in v3; the wrist joints in joint_pos provide fine-grained wrist control that SMPL alone cannot capture.

Frame counts across all four fields must be consistent.

Common errors:

Version 3 missing required field 'joint_pos' or 'joint_vel' — joint fields are absent (unlike v2, they are required in v3).
Version 3 frame count mismatch between smpl_joints (X) and joint_pos (Y) — the N dimension differs across fields.

Protocol v4 — Token-Only Streaming (Direct Latent Actions)#

Streams pre-computed motion tokens directly to the policy, bypassing the encoder entirely. Use this when your source produces encoded latent actions (e.g., from a separate encoder running on a different machine, or a generative model that outputs tokens directly).

Unlike v1–v3, Protocol v4 does not carry motion frames — the reference motion on the robot side is left unchanged. The tokens are injected directly into the token_state observation slot of the decoder policy.

Required fields:

Field	Shape	Dtype	Description
`token_state`	`[D]`	`f32` / `f64`	Motion token array (dimension must match the encoder `dimension` in the observation config)

Optional fields:

Field	Shape	Dtype	Description
`frame_index`	`[1]`	`i32` / `i64`	Frame index (for logging only, does not affect playback)
`left_hand_joints`	`[7]` or `[1, 7]`	`f32` / `f64`	Left hand 7-DOF Dex3 joint positions
`right_hand_joints`	`[7]` or `[1, 7]`	`f32` / `f64`	Right hand 7-DOF Dex3 joint positions
`body_quat_w`	`[4]` or `[1, 4]`	`f32` / `f64`	Body quaternion (w,x,y,z) for heading updates

The token_state dimension is validated against the encoder configuration. A mismatch is logged as a warning.
Hand joints, when provided, are applied to the robot directly (same as v1–v3 optional hand fields).
body_quat_w can be used to update the heading reference during token streaming.

Common errors:

Version 4 missing required field 'token_state' — the token_state field is absent from the message.
Protocol version 4 with motion data is impossible! — v4 message produced a motion sequence (should never happen; indicates a decoder bug).
Protocol version 4 with empty token data! — token_state field was present but contained no data.

Warning

Protocol v4 requires the policy to have an encoder configuration with token_state in its observations. If the model has no encoder (encode_mode == -2), the tokens will be received but have no effect.

Protocol Summary#

Protocol	Encode Mode	Used by SONIC	Required Fields
v1	`0` (joint-based)	✅ Yes	`joint_pos`, `joint_vel`
v2	`2` (SMPL-based)	❌ Custom only	`smpl_joints`, `smpl_pose`
v3	`2` (SMPL-based)	✅ Yes	`joint_pos`, `joint_vel`, `smpl_joints`, `smpl_pose`
v4	N/A (bypasses encoder)	✅ Yes	`token_state`

Optional Stream Fields#

The following optional fields can be included in any protocol version:

Field	Shape	Dtype	Description
`left_hand_joints`	`[7]` or `[1, 7]`	`f32` / `f64`	Left hand 7-DOF Dex3 joint positions
`right_hand_joints`	`[7]` or `[1, 7]`	`f32` / `f64`	Right hand 7-DOF Dex3 joint positions
`vr_position`	`[9]` or `[3, 3]`	`f32` / `f64`	VR 3-point tracking positions: left wrist, right wrist, head (xyz × 3)
`vr_orientation`	`[12]` or `[3, 4]`	`f32` / `f64`	VR 3-point orientations: left, right, head quaternions (wxyz × 3)
`catch_up`	scalar	`bool` / `u8` / `i32`	If `true` (default), resets playback when a large frame gap is detected
`heading_increment`	scalar	`f32` / `f64`	Incremental heading adjustment applied per message

Configuration#

Flag	Default	Description
`--zmq-host`	`localhost`	ZMQ publisher host
`--zmq-port`	`5556`	ZMQ publisher port
`--zmq-topic`	`pose`	ZMQ topic prefix
`--zmq-conflate`	off	Keep only the latest message (drop stale frames)

Streaming Motion Tracking

Contents

Streaming Motion Tracking#

Launch#

Step-by-Step#

Using with PICO VR Teleop#

Prerequisites#

Launch (Sim2Sim)#

Launch (Real Robot)#

Step-by-Step#

PICO Buttons in ZMQ Mode#

Controls#

Stream Protocol Versions#

Encode Mode Logic#

Common Fields (All Versions)#

Protocol v1 — Joint-Based (Encode Mode 0)#

Protocol v2 — SMPL-Based (Encode Mode 2)#

Protocol v3 — Joint + SMPL Combined (Encode Mode 2)#

Protocol v4 — Token-Only Streaming (Direct Latent Actions)#

Protocol Summary#

Optional Stream Fields#

Configuration#