Kinematic Planner ONNX Model Reference#
This page provides a detailed specification of the Kinematic Planner ONNX model inputs and outputs. The kinematic planner is the core motion generation component of the GEAR-SONIC system: given the robot’s current state and high-level navigation commands, it produces a sequence of future whole-body poses (MuJoCo qpos frames) that the low-level whole-body controller then tracks.
The ONNX model is part of the C++ inference stack and is called by the deployment runtime during operation. The C++ stack manages input construction, timing, and state management — certain combinations of inputs are invalid and are handled by the C++ layer to ensure safe operation. This page is intended for developers who want to understand the model interface at a deeper level or build custom integrations beyond the standard deployment pipeline.
Training Code & Technical Report
The kinematic planner training code and technical report will be released soon. This page documents the ONNX model interface for deployment integration.
Overview#
The planner takes 11 input tensors and produces 2 output tensors. The 6 primary inputs are listed below; the remaining 5 are advanced inputs managed by the C++ stack and should not need to be modified in most cases.
Primary inputs:
Tensor Name |
Shape |
Dtype |
Default |
|---|---|---|---|
|
|
|
Required |
|
|
|
|
|
|
|
Required |
|
|
|
Required |
|
|
|
Required |
|
|
|
|
Outputs:
Tensor Name |
Shape |
Dtype |
|---|---|---|
|
|
|
|
scalar |
|
Where:
K =
max_tokens - min_tokens + 1(model-dependent; the range of allowed prediction horizons)N = maximum number of output frames (padded); only the first
num_pred_framesframes are valid
Coordinate System#
The model operates in MuJoCo’s Z-up coordinate convention:
X — forward
Y — left
Z — up
All position and direction vectors in the inputs and outputs follow this convention.
Input Tensors#
context_mujoco_qpos#
Property |
Value |
|---|---|
Shape |
|
Dtype |
|
Description |
The planner’s context input consisting of 4 consecutive MuJoCo |
This is the primary context input. It provides 4 frames of the robot’s recent joint configuration at the simulation framerate.
The 36 dimensions of each frame are the standard MuJoCo qpos vector for the Unitree G1 (29-DOF) model:
Index |
Field |
Description |
|---|---|---|
0–2 |
Root position |
|
3–6 |
Root quaternion |
|
7–35 |
DOF positions |
29 joint angles in radians, following MuJoCo body tree order |
Coordinate Frame
All inputs — including context_mujoco_qpos, movement_direction, facing_direction, specific_target_positions, and specific_target_headings — should be provided in the world coordinate frame. The root quaternion uses MuJoCo’s (w, x, y, z) ordering at indices 3 to 6. The model handles canonicalization internally.
target_vel#
Property |
Value |
|---|---|
Shape |
|
Dtype |
|
Description |
Desired locomotion speed override |
Controls the target movement speed. When set to zero or below (e.g., -1.0), the model uses the default velocity for the selected mode. When set to a positive value, it overrides the mode’s default speed (in meters per second). Note that the actual achieved speed may differ from the target due to the critically damped spring model and motion dynamics.
Value |
Behavior |
|---|---|
|
Use the default velocity for the selected |
|
Override with this target velocity (m/s) |
mode#
Property |
Value |
|---|---|
Shape |
|
Dtype |
|
Description |
Index selecting the motion style/behavior |
Selects the motion style from the pre-loaded clip library. The mode index is clamped to the number of available clips at runtime. The default planner ships with the following modes:
Locomotion set:
Index |
Mode |
Description |
|---|---|---|
0 |
|
Standing still |
1 |
|
Slow forward locomotion |
2 |
|
Normal walking speed |
3 |
|
Running |
Squat / ground set:
Index |
Mode |
Description |
|---|---|---|
4 |
|
Squatting — requires |
5 |
|
Kneeling on both knees — requires |
6 |
|
Kneeling on one knee — requires |
7 |
|
Lying face down — requires |
8 |
|
Crawling on hands and knees |
14 |
|
Crawling on elbows (more likely to overheat) |
Boxing set:
Index |
Mode |
Description |
|---|---|---|
9 |
|
Boxing stance (idle) |
10 |
|
Walking with boxing guard |
11 |
|
Left jab |
12 |
|
Right jab |
13 |
|
Random punch sequence |
15 |
|
Left hook |
16 |
|
Right hook |
Style walks:
Index |
Mode |
Description |
|---|---|---|
17 |
|
Happy walking |
18 |
|
Stealthy walking |
19 |
|
Limping walk |
20 |
|
Cautious walking |
21 |
|
Walking with hands reaching out |
22 |
|
Crouched walking |
23 |
|
Dancing walk (only walk forward) |
24 |
|
Zombie walk |
25 |
|
Walking with hands pointing |
26 |
|
Scared walk |
movement_direction#
Property |
Value |
|---|---|
Shape |
|
Dtype |
|
Description |
Desired direction of movement in the MuJoCo world frame |
A 3D direction vector (x, y, z) in the Z-up world coordinate system indicating where the robot should move. It is recommended to pass a normalized vector for good practice, though the model normalizes internally. Speed is controlled by target_vel and mode, not by the magnitude of this vector.
The planner uses the
(x, y)components (horizontal plane) for computing the target root trajectory via a critically-damped spring model.When the magnitude is near zero (
< 1e-5), the model falls back to using thefacing_directionwith a small scaling factor for in-place turning.
facing_direction#
Property |
Value |
|---|---|
Shape |
|
Dtype |
|
Description |
Desired facing (heading) direction in the MuJoCo world frame |
A 3D direction vector (x, y, z) indicating which direction the robot’s torso should face. The target heading angle is computed as atan2(y, x) from this vector. Like movement_direction, this does not need to be normalized.
This is independent of movement_direction — the robot can walk in one direction while facing another (e.g., strafing).
height#
Property |
Value |
|---|---|
Shape |
|
Dtype |
|
Description |
Desired root height for height-aware behaviors |
Controls the target pelvis height for modes that support variable height (e.g., squat, kneelTwoLeg, kneelOneLeg, lyingFacedown). When a positive value is provided, the model searches the reference clip’s keyframes and selects the one whose root height is closest to the requested value, using it as the target pose for motion generation.
Value |
Behavior |
|---|---|
|
Height control disabled; use the randomly-selected keyframe from the reference clip |
|
Find the closest height keyframe in the reference clip and use it as the target pose (meters) |
Advanced Inputs#
These inputs are managed internally by the C++ deployment stack and should not be modified under normal operation. They are documented here for completeness and for advanced users who need to build custom integrations.
random_seed#
Property |
Value |
|---|---|
Shape |
|
Dtype |
|
Description |
Seed for controlling network randomness |
has_specific_target#
Property |
Value |
|---|---|
Shape |
|
Dtype |
|
Description |
Flag indicating whether specific waypoint targets are provided |
Value |
Behavior |
|---|---|
|
Ignore |
|
Use the provided specific target positions and headings as waypoint constraints |
When enabled, the spring model’s target root position and heading are overridden by the values in specific_target_positions and specific_target_headings.
specific_target_positions#
Property |
Value |
|---|---|
Shape |
|
Dtype |
|
Description |
4 waypoint positions in MuJoCo world coordinates |
Each waypoint is a 3D position (x, y, z) in the Z-up world frame. The 4 waypoints correspond to 4 frames (one token’s worth) of target root positions. Only used when has_specific_target = 1.
specific_target_headings#
Property |
Value |
|---|---|
Shape |
|
Dtype |
|
Description |
4 waypoint heading angles in radians |
Target heading (yaw) angles for each of the 4 waypoint frames. These are absolute angles in the Z-up world frame, measured as rotation around the Z-axis. Only used when has_specific_target = 1. The last waypoint’s heading ([:, -1]) is used as the primary target heading for the spring model.
allowed_pred_num_tokens#
Property |
Value |
|---|---|
Shape |
|
Dtype |
|
Description |
Binary mask controlling the allowed prediction horizon |
A binary mask where each element corresponds to a possible number of predicted tokens. Index i maps to min_tokens + i tokens. A value of 1 means that prediction length is allowed; 0 means it is disallowed.
Since each token represents 4 frames, the prediction horizon in frames is num_tokens * 4. In our default planner we have min_tokens = 6 and max_tokens = 16:
Index |
Tokens |
Frames |
|---|---|---|
0 |
6 |
24 |
1 |
7 |
28 |
2 |
8 |
32 |
3 |
9 |
36 |
4 |
10 |
40 |
5 |
11 |
44 |
6 |
12 |
48 |
7 |
13 |
52 |
8 |
14 |
56 |
9 |
15 |
60 |
10 |
16 |
64 |
Output Tensors#
mujoco_qpos#
Property |
Value |
|---|---|
Shape |
|
Dtype |
|
Description |
Predicted motion sequence as MuJoCo |
The primary output: a sequence of whole-body pose frames in the same 36-dimensional MuJoCo qpos format as the input (see context_mujoco_qpos for the dimension layout).
Important: Use num_pred_frames to Truncate
The output tensor mujoco_qpos is not truncated — it contains the full padded buffer. Only the first num_pred_frames frames are valid predictions. When consuming this output, always slice:
valid_qpos = mujoco_qpos[:, :num_pred_frames, :]
The poses are in the global MuJoCo world frame (not canonicalized). The model internally handles canonicalization, inference, and coordinate conversion, then transforms the output back to the original world frame. The first 4 predicted frames are blended with the input context for smooth transitions.
The root quaternion in the output uses (w, x, y, z) ordering (MuJoCo convention).
num_pred_frames#
Property |
Value |
|---|---|
Shape |
scalar |
Dtype |
|
Description |
Number of valid predicted frames in the |
This value equals num_pred_tokens * 4, where num_pred_tokens is the number of motion tokens the model decided to generate (constrained by allowed_pred_num_tokens). Use this value to slice the mujoco_qpos output.
Internal Pipeline#
Canonicalization — The input qpos is transformed to a body-relative frame by removing the first frame’s heading rotation and horizontal position. This helps the model generalize across different starting orientations and positions.
Spring Model — A critically-damped spring model generates smooth target root trajectories and heading angles from the high-level commands, using mode-dependent average velocities from the training clips.
Target Pose Selection — Based on the
modeandrandom_seed, a target pose is fetched from the pre-loaded clip library and aligned (rotated/translated) to match the spring model’s predicted target position and heading.Motion Inference — The core motion model fills in the motion between the context (current state) and target (desired future state), producing a natural transition.
Post-processing — The output is converted back to MuJoCo qpos in the original world frame, and the first 4 frames are blended with the input context for smooth transitions.
Deployment Integration#
This section describes how the C++ deployment stack uses the planner at runtime. Understanding this is useful for building custom integrations or modifying the replan behavior.
Threading Model#
The planner runs on a dedicated thread at 10 Hz (planner_dt = 0.1s), separate from the control loop (50 Hz) and input thread (100 Hz). The planner thread:
Reads the latest
MovementStatefrom a thread-safe buffer (written by the input interface).Decides whether a replan is needed.
If so, calls
UpdatePlanning()which runs TensorRT inference.Stores the result in a shared buffer that the control thread picks up on its next tick.
Initialization#
When the planner is first enabled (e.g., pressing ENTER on the keyboard interface), the planner thread:
Reads the robot’s current base quaternion and joint positions from the latest
LowState.Calls
Initialize(), which:Sets up a 4-frame context at the default standing height with zero-yaw orientation.
Runs an initial inference with
IDLEmode and no movement.Resamples the 30 Hz output to 50 Hz.
The control thread detects the new planner motion and switches
current_motion_to the planner output.
Context Construction#
The planner requires a 4-frame context (context_mujoco_qpos of shape [1, 4, 36]). During operation, this context is sampled from the current planner motion (not the robot state):
The context starts at
gen_frame = current_frame + motion_look_ahead_steps(default look-ahead = 2 frames at 50 Hz).4 frames are sampled at 30 Hz intervals from this starting point.
Joint positions, body positions, and quaternions are linearly interpolated (quaternions via slerp) between 50 Hz motion frames to produce the 30 Hz context samples.
Replan Logic#
Not every planner tick triggers a replan. The decision follows this priority:
1. Always replan when (regardless of static/non-static mode):
Locomotion mode changed
Facing direction changed
Height changed
2. For non-static modes only, also replan when any of the following is true:
Movement speed changed
Movement direction changed
Periodic replan timer expired and movement speed is non-zero
Static modes (Idle, Squat, Kneel, Lying, Idle Boxing) never trigger replans from the second category — they only replan on mode/facing/height changes from the first category.
Replan intervals (periodic timer) vary by locomotion type to balance responsiveness and computational cost:
Locomotion Type |
Replan Interval |
|---|---|
Running |
0.1 s (every planner tick) |
Crawling |
0.2 s |
Boxing (punches, hooks) |
1.0 s |
All others (walk, squat, styled, etc.) |
1.0 s |
The periodic timer only triggers a replan if the current movement speed is non-zero — a stationary robot in a non-static mode (e.g., Walk mode with speed 0) will not replan on the timer.
Output Resampling (30 Hz → 50 Hz)#
The planner model outputs frames at 30 Hz. The deployment stack resamples them to 50 Hz (the control loop rate) using linear interpolation:
For each 50 Hz frame, compute the corresponding fractional 30 Hz frame index.
Linearly interpolate joint positions and body positions between the two nearest 30 Hz frames.
Slerp-interpolate body quaternions.
Compute joint velocities by finite differencing the resampled positions (
(pos[t+1] - pos[t]) * 50).
The resampled motion is stored in planner_motion_50hz_ and has num_pred_frames * 50/30 frames (rounded down).
Animation Blending#
When a new planner output arrives while the previous one is still playing, the control thread blends the old and new animations over an 8-frame cross-fade:
The old animation is rebased so
current_framemaps to frame 0.The new animation is aligned to start at
gen_frame - current_framein the rebased timeline.Over 8 frames starting from the blend point, a linearly increasing weight
w_new(0 → 1) is applied:Joint positions/velocities:
w_old * old + w_new * newBody positions:
w_old * old + w_new * newBody quaternions:
slerp(old, new, w_new)
After the blend region, the new animation takes over completely.
current_frameis reset to 0 on the blended result.
This ensures smooth transitions between successive planner outputs without visible discontinuities.
TensorRT Acceleration#
The planner runs via TensorRT with CUDA graph capture for low-latency inference:
At startup, the ONNX model is converted to a TensorRT engine (cached on disk).
A CUDA graph is captured during initialization — this records the entire inference pass (input copy → kernel launches → output copy) as a single replayable graph.
On each replan, inputs are copied to GPU via pinned memory (
TPinnedVector), the CUDA graph is launched, and outputs are copied back.FP16 precision is supported via
--planner-precision 16(default is FP32).
Planner Model Versions#
The deployment stack supports multiple planner model versions, auto-detected from the model filename:
Version |
Inputs |
Modes |
Description |
|---|---|---|---|
V0 |
6 |
4 (Idle, Slow Walk, Walk, Run) |
Basic locomotion only |
V1 |
11 |
20 |
Adds squat/kneel/boxing/styled walks + height control + waypoint targets |
V2 |
11 |
27 |
All V1 modes + additional styled walking modes |
The version is determined by the presence of V0, V1, or V2 in the planner model filename. Version determines:
The number of input tensors (6 vs 11)
The valid range of
modevaluesWhether
height,has_specific_target,specific_target_positions,specific_target_headings, andallowed_pred_num_tokensinputs are used
Model Properties#
The exported ONNX model has the following properties:
ONNX opset version: 17
Batch size: 1 (fixed)
The model is distributed as a single .onnx file, along with a .pt file containing reference input/output tensors that can be used for validation.
Coming Soon
The training code, export tooling, and a full technical report will be released soon. Stay tuned for updates.
Usage Example#
import onnxruntime as ort
import numpy as np
# Load the ONNX model
session = ort.InferenceSession("kinematic_planner.onnx")
# Primary inputs
inputs = {
"context_mujoco_qpos": current_qpos_buffer.astype(np.float32), # [1, 4, 36]
"target_vel": np.array([-1.0], dtype=np.float32), # -1.0 = use mode default
"mode": np.array([2], dtype=np.int64), # 2 = walk
"movement_direction": np.array([[1.0, 0.0, 0.0]], dtype=np.float32), # forward
"facing_direction": np.array([[1.0, 0.0, 0.0]], dtype=np.float32), # face forward
"height": np.array([-1.0], dtype=np.float32), # -1.0 = disabled
# Advanced inputs (typically managed by the C++ stack)
"random_seed": np.array([1234], dtype=np.int64),
"has_specific_target": np.array([[0]], dtype=np.int64),
"specific_target_positions": np.zeros([1, 4, 3], dtype=np.float32),
"specific_target_headings": np.zeros([1, 4], dtype=np.float32),
"allowed_pred_num_tokens": np.ones([1, 11], dtype=np.int64), # K = 11 for default model
}
# Run inference
mujoco_qpos, num_pred_frames = session.run(None, inputs)
# Extract valid frames only
num_frames = int(num_pred_frames)
predicted_motion = mujoco_qpos[0, :num_frames, :] # [num_frames, 36]
# Each row: [x, y, z, qw, qx, qy, qz, 29 joint angles in radians]
for frame in predicted_motion:
root_pos = frame[:3]
root_quat = frame[3:7] # (w, x, y, z)
joint_angles = frame[7:36] # 29 DOF positions