Kinematic Planner ONNX Model Reference

Kinematic Planner ONNX Model Reference#

This page provides a detailed specification of the Kinematic Planner ONNX model inputs and outputs. The kinematic planner is the core motion generation component of the GEAR-SONIC system: given the robot’s current state and high-level navigation commands, it produces a sequence of future whole-body poses (MuJoCo qpos frames) that the low-level whole-body controller then tracks.

The ONNX model is part of the C++ inference stack and is called by the deployment runtime during operation. The C++ stack manages input construction, timing, and state management — certain combinations of inputs are invalid and are handled by the C++ layer to ensure safe operation. This page is intended for developers who want to understand the model interface at a deeper level or build custom integrations beyond the standard deployment pipeline.

Training Code & Technical Report

The kinematic planner training code and technical report will be released soon. This page documents the ONNX model interface for deployment integration.

Overview#

The planner takes 11 input tensors and produces 2 output tensors. The 6 primary inputs are listed below; the remaining 5 are advanced inputs managed by the C++ stack and should not need to be modified in most cases.

Primary inputs:

Tensor Name	Shape	Dtype	Default
`context_mujoco_qpos`	`[1, 4, 36]`	`float32`	Required
`target_vel`	`[1]`	`float32`	`-1.0` (use mode default velocity)
`mode`	`[1]`	`int64`	Required
`movement_direction`	`[1, 3]`	`float32`	Required
`facing_direction`	`[1, 3]`	`float32`	Required
`height`	`[1]`	`float32`	`-1.0` (disable height control)

Outputs:

Tensor Name	Shape	Dtype
`mujoco_qpos`	`[1, N, 36]`	`float32`
`num_pred_frames`	scalar	`int64`

Where:

K = max_tokens - min_tokens + 1 (model-dependent; the range of allowed prediction horizons)
N = maximum number of output frames (padded); only the first num_pred_frames frames are valid

Coordinate System#

The model operates in MuJoCo’s Z-up coordinate convention:

X — forward
Y — left
Z — up

All position and direction vectors in the inputs and outputs follow this convention.

Input Tensors#

`context_mujoco_qpos`#

Property	Value
Shape	`[1, 4, 36]`
Dtype	`float32`
Description	The planner’s context input consisting of 4 consecutive MuJoCo `qpos` frames representing the recent states of the robot

This is the primary context input. It provides 4 frames of the robot’s recent joint configuration at the simulation framerate. The 36 dimensions of each frame are the standard MuJoCo qpos vector for the Unitree G1 (29-DOF) model:

Index	Field	Description
0–2	Root position	`(x, y, z)` in meters, Z-up world frame
3–6	Root quaternion	`(w, x, y, z)` orientation — MuJoCo convention
7–35	DOF positions	29 joint angles in radians, following MuJoCo body tree order

Coordinate Frame

All inputs — including context_mujoco_qpos, movement_direction, facing_direction, specific_target_positions, and specific_target_headings — should be provided in the world coordinate frame. The root quaternion uses MuJoCo’s (w, x, y, z) ordering at indices 3 to 6. The model handles canonicalization internally.

`target_vel`#

Property	Value
Shape	`[1]`
Dtype	`float32`
Description	Desired locomotion speed override

Controls the target movement speed. When set to zero or below (e.g., -1.0), the model uses the default velocity for the selected mode. When set to a positive value, it overrides the mode’s default speed (in meters per second). Note that the actual achieved speed may differ from the target due to the critically damped spring model and motion dynamics.

Value	Behavior
`<= 0.0`	Use the default velocity for the selected `mode`
`> 0.0`	Override with this target velocity (m/s)

`mode`#

Property	Value
Shape	`[1]`
Dtype	`int64`
Description	Index selecting the motion style/behavior

Selects the motion style from the pre-loaded clip library. The mode index is clamped to the number of available clips at runtime. The default planner ships with the following modes:

Locomotion set:

Index	Mode	Description
0	`idle`	Standing still
1	`slowWalk`	Slow forward locomotion
2	`walk`	Normal walking speed
3	`run`	Running

Squat / ground set:

Index	Mode	Description
4	`squat`	Squatting — requires `height` input (range ~0.4–0.8m)
5	`kneelTwoLeg`	Kneeling on both knees — requires `height` input (0.2m-0.4m)
6	`kneelOneLeg`	Kneeling on one knee — requires `height` input (0.2m-0.4m)
7	`lyingFacedown`	Lying face down — requires `height` input
8	`handCrawling`	Crawling on hands and knees
14	`elbowCrawling`	Crawling on elbows (more likely to overheat)

Boxing set:

Index	Mode	Description
9	`idleBoxing`	Boxing stance (idle)
10	`walkBoxing`	Walking with boxing guard
11	`leftJab`	Left jab
12	`rightJab`	Right jab
13	`randomPunches`	Random punch sequence
15	`leftHook`	Left hook
16	`rightHook`	Right hook

Style walks:

Index	Mode	Description
17	`happy`	Happy walking
18	`stealth`	Stealthy walking
19	`injured`	Limping walk
20	`careful`	Cautious walking
21	`objectCarrying`	Walking with hands reaching out
22	`crouch`	Crouched walking
23	`happyDance`	Dancing walk (only walk forward)
24	`zombie`	Zombie walk
25	`point`	Walking with hands pointing
26	`scared`	Scared walk

`movement_direction`#

Property	Value
Shape	`[1, 3]`
Dtype	`float32`
Description	Desired direction of movement in the MuJoCo world frame

A 3D direction vector (x, y, z) in the Z-up world coordinate system indicating where the robot should move. It is recommended to pass a normalized vector for good practice, though the model normalizes internally. Speed is controlled by target_vel and mode, not by the magnitude of this vector.

The planner uses the (x, y) components (horizontal plane) for computing the target root trajectory via a critically-damped spring model.
When the magnitude is near zero (< 1e-5), the model falls back to using the facing_direction with a small scaling factor for in-place turning.

`facing_direction`#

Property	Value
Shape	`[1, 3]`
Dtype	`float32`
Description	Desired facing (heading) direction in the MuJoCo world frame

A 3D direction vector (x, y, z) indicating which direction the robot’s torso should face. The target heading angle is computed as atan2(y, x) from this vector. Like movement_direction, this does not need to be normalized.

This is independent of movement_direction — the robot can walk in one direction while facing another (e.g., strafing).

`height`#

Property	Value
Shape	`[1]`
Dtype	`float32`
Description	Desired root height for height-aware behaviors

Controls the target pelvis height for modes that support variable height (e.g., squat, kneelTwoLeg, kneelOneLeg, lyingFacedown). When a positive value is provided, the model searches the reference clip’s keyframes and selects the one whose root height is closest to the requested value, using it as the target pose for motion generation.

Value	Behavior
`< 0.0`	Height control disabled; use the randomly-selected keyframe from the reference clip
`>= 0.0`	Find the closest height keyframe in the reference clip and use it as the target pose (meters)

Advanced Inputs#

These inputs are managed internally by the C++ deployment stack and should not be modified under normal operation. They are documented here for completeness and for advanced users who need to build custom integrations.

`random_seed`#

Property	Value
Shape	`[1]`
Dtype	`int64`
Description	Seed for controlling network randomness

`has_specific_target`#

Property	Value
Shape	`[1, 1]`
Dtype	`int64`
Description	Flag indicating whether specific waypoint targets are provided

Value	Behavior
`0`	Ignore `specific_target_positions` and `specific_target_headings`; use `movement_direction` / `facing_direction`
`1`	Use the provided specific target positions and headings as waypoint constraints

When enabled, the spring model’s target root position and heading are overridden by the values in specific_target_positions and specific_target_headings.

`specific_target_positions`#

Property	Value
Shape	`[1, 4, 3]`
Dtype	`float32`
Description	4 waypoint positions in MuJoCo world coordinates

Each waypoint is a 3D position (x, y, z) in the Z-up world frame. The 4 waypoints correspond to 4 frames (one token’s worth) of target root positions. Only used when has_specific_target = 1.

`specific_target_headings`#

Property	Value
Shape	`[1, 4]`
Dtype	`float32`
Description	4 waypoint heading angles in radians

Target heading (yaw) angles for each of the 4 waypoint frames. These are absolute angles in the Z-up world frame, measured as rotation around the Z-axis. Only used when has_specific_target = 1. The last waypoint’s heading ([:, -1]) is used as the primary target heading for the spring model.

`allowed_pred_num_tokens`#

Property	Value
Shape	`[1, K]` where `K = max_tokens - min_tokens + 1`
Dtype	`int64`
Description	Binary mask controlling the allowed prediction horizon

A binary mask where each element corresponds to a possible number of predicted tokens. Index i maps to min_tokens + i tokens. A value of 1 means that prediction length is allowed; 0 means it is disallowed.

Since each token represents 4 frames, the prediction horizon in frames is num_tokens * 4. In our default planner we have min_tokens = 6 and max_tokens = 16:

Index	Tokens	Frames
0	6	24
1	7	28
2	8	32
3	9	36
4	10	40
5	11	44
6	12	48
7	13	52
8	14	56
9	15	60
10	16	64

Output Tensors#

`mujoco_qpos`#

Property	Value
Shape	`[1, N, 36]`
Dtype	`float32`
Description	Predicted motion sequence as MuJoCo `qpos` frames

The primary output: a sequence of whole-body pose frames in the same 36-dimensional MuJoCo qpos format as the input (see context_mujoco_qpos for the dimension layout).

Important: Use num_pred_frames to Truncate

The output tensor mujoco_qpos is not truncated — it contains the full padded buffer. Only the first num_pred_frames frames are valid predictions. When consuming this output, always slice:

valid_qpos = mujoco_qpos[:, :num_pred_frames, :]

The poses are in the global MuJoCo world frame (not canonicalized). The model internally handles canonicalization, inference, and coordinate conversion, then transforms the output back to the original world frame. The first 4 predicted frames are blended with the input context for smooth transitions.

The root quaternion in the output uses (w, x, y, z) ordering (MuJoCo convention).

`num_pred_frames`#

Property	Value
Shape	scalar
Dtype	`int64`
Description	Number of valid predicted frames in the `mujoco_qpos` output

This value equals num_pred_tokens * 4, where num_pred_tokens is the number of motion tokens the model decided to generate (constrained by allowed_pred_num_tokens). Use this value to slice the mujoco_qpos output.

Internal Pipeline#

Canonicalization — The input qpos is transformed to a body-relative frame by removing the first frame’s heading rotation and horizontal position. This helps the model generalize across different starting orientations and positions.
Spring Model — A critically-damped spring model generates smooth target root trajectories and heading angles from the high-level commands, using mode-dependent average velocities from the training clips.
Target Pose Selection — Based on the mode and random_seed, a target pose is fetched from the pre-loaded clip library and aligned (rotated/translated) to match the spring model’s predicted target position and heading.
Motion Inference — The core motion model fills in the motion between the context (current state) and target (desired future state), producing a natural transition.
Post-processing — The output is converted back to MuJoCo qpos in the original world frame, and the first 4 frames are blended with the input context for smooth transitions.

Deployment Integration#

This section describes how the C++ deployment stack uses the planner at runtime. Understanding this is useful for building custom integrations or modifying the replan behavior.

Threading Model#

The planner runs on a dedicated thread at 10 Hz (planner_dt = 0.1s), separate from the control loop (50 Hz) and input thread (100 Hz). The planner thread:

Reads the latest MovementState from a thread-safe buffer (written by the input interface).
Decides whether a replan is needed.
If so, calls UpdatePlanning() which runs TensorRT inference.
Stores the result in a shared buffer that the control thread picks up on its next tick.

Initialization#

When the planner is first enabled (e.g., pressing ENTER on the keyboard interface), the planner thread:

Reads the robot’s current base quaternion and joint positions from the latest LowState.
Calls Initialize(), which:
- Sets up a 4-frame context at the default standing height with zero-yaw orientation.
- Runs an initial inference with IDLE mode and no movement.
- Resamples the 30 Hz output to 50 Hz.
The control thread detects the new planner motion and switches current_motion_ to the planner output.

Context Construction#

The planner requires a 4-frame context (context_mujoco_qpos of shape [1, 4, 36]). During operation, this context is sampled from the current planner motion (not the robot state):

The context starts at gen_frame = current_frame + motion_look_ahead_steps (default look-ahead = 2 frames at 50 Hz).
4 frames are sampled at 30 Hz intervals from this starting point.
Joint positions, body positions, and quaternions are linearly interpolated (quaternions via slerp) between 50 Hz motion frames to produce the 30 Hz context samples.

Replan Logic#

Not every planner tick triggers a replan. The decision follows this priority:

1. Always replan when (regardless of static/non-static mode):

Locomotion mode changed
Facing direction changed
Height changed

2. For non-static modes only, also replan when any of the following is true:

Movement speed changed
Movement direction changed
Periodic replan timer expired and movement speed is non-zero

Static modes (Idle, Squat, Kneel, Lying, Idle Boxing) never trigger replans from the second category — they only replan on mode/facing/height changes from the first category.

Replan intervals (periodic timer) vary by locomotion type to balance responsiveness and computational cost:

Locomotion Type	Replan Interval
Running	0.1 s (every planner tick)
Crawling	0.2 s
Boxing (punches, hooks)	1.0 s
All others (walk, squat, styled, etc.)	1.0 s

The periodic timer only triggers a replan if the current movement speed is non-zero — a stationary robot in a non-static mode (e.g., Walk mode with speed 0) will not replan on the timer.

Output Resampling (30 Hz → 50 Hz)#

The planner model outputs frames at 30 Hz. The deployment stack resamples them to 50 Hz (the control loop rate) using linear interpolation:

For each 50 Hz frame, compute the corresponding fractional 30 Hz frame index.
Linearly interpolate joint positions and body positions between the two nearest 30 Hz frames.
Slerp-interpolate body quaternions.
Compute joint velocities by finite differencing the resampled positions ((pos[t+1] - pos[t]) * 50).

The resampled motion is stored in planner_motion_50hz_ and has num_pred_frames * 50/30 frames (rounded down).

Animation Blending#

When a new planner output arrives while the previous one is still playing, the control thread blends the old and new animations over an 8-frame cross-fade:

The old animation is rebased so current_frame maps to frame 0.
The new animation is aligned to start at gen_frame - current_frame in the rebased timeline.
Over 8 frames starting from the blend point, a linearly increasing weight w_new (0 → 1) is applied:
- Joint positions/velocities: w_old * old + w_new * new
- Body positions: w_old * old + w_new * new
- Body quaternions: slerp(old, new, w_new)
After the blend region, the new animation takes over completely.
current_frame is reset to 0 on the blended result.

This ensures smooth transitions between successive planner outputs without visible discontinuities.

TensorRT Acceleration#

The planner runs via TensorRT with CUDA graph capture for low-latency inference:

At startup, the ONNX model is converted to a TensorRT engine (cached on disk).
A CUDA graph is captured during initialization — this records the entire inference pass (input copy → kernel launches → output copy) as a single replayable graph.
On each replan, inputs are copied to GPU via pinned memory (TPinnedVector), the CUDA graph is launched, and outputs are copied back.
FP16 precision is supported via --planner-precision 16 (default is FP32).

Planner Model Versions#

The deployment stack supports multiple planner model versions, auto-detected from the model filename:

Version	Inputs	Modes	Description
V0	6	4 (Idle, Slow Walk, Walk, Run)	Basic locomotion only
V1	11	20	Adds squat/kneel/boxing/styled walks + height control + waypoint targets
V2	11	27	All V1 modes + additional styled walking modes

The version is determined by the presence of V0, V1, or V2 in the planner model filename. Version determines:

The number of input tensors (6 vs 11)
The valid range of mode values
Whether height, has_specific_target, specific_target_positions, specific_target_headings, and allowed_pred_num_tokens inputs are used

Model Properties#

The exported ONNX model has the following properties:

ONNX opset version: 17
Batch size: 1 (fixed)

The model is distributed as a single .onnx file, along with a .pt file containing reference input/output tensors that can be used for validation.

Coming Soon

The training code, export tooling, and a full technical report will be released soon. Stay tuned for updates.

Usage Example#

import onnxruntime as ort
import numpy as np

# Load the ONNX model
session = ort.InferenceSession("kinematic_planner.onnx")

# Primary inputs
inputs = {
    "context_mujoco_qpos": current_qpos_buffer.astype(np.float32),           # [1, 4, 36]
    "target_vel": np.array([-1.0], dtype=np.float32),                         # -1.0 = use mode default
    "mode": np.array([2], dtype=np.int64),                                    # 2 = walk
    "movement_direction": np.array([[1.0, 0.0, 0.0]], dtype=np.float32),      # forward
    "facing_direction": np.array([[1.0, 0.0, 0.0]], dtype=np.float32),        # face forward
    "height": np.array([-1.0], dtype=np.float32),                             # -1.0 = disabled

    # Advanced inputs (typically managed by the C++ stack)
    "random_seed": np.array([1234], dtype=np.int64),
    "has_specific_target": np.array([[0]], dtype=np.int64),
    "specific_target_positions": np.zeros([1, 4, 3], dtype=np.float32),
    "specific_target_headings": np.zeros([1, 4], dtype=np.float32),
    "allowed_pred_num_tokens": np.ones([1, 11], dtype=np.int64),              # K = 11 for default model
}

# Run inference
mujoco_qpos, num_pred_frames = session.run(None, inputs)

# Extract valid frames only
num_frames = int(num_pred_frames)
predicted_motion = mujoco_qpos[0, :num_frames, :]  # [num_frames, 36]

# Each row: [x, y, z, qw, qx, qy, qz, 29 joint angles in radians]
for frame in predicted_motion:
    root_pos = frame[:3]
    root_quat = frame[3:7]       # (w, x, y, z)
    joint_angles = frame[7:36]   # 29 DOF positions