VLA Workflow: Collect, Fine-tune, Deploy#
This tutorial walks through the end-to-end workflow for training and deploying a VLA policy on the Unitree G1 with SONIC whole-body control:
Collect teleop demonstrations using the SONIC stack
Fine-tune the Isaac-GR00T N1.7 model on your collected data
Deploy the finetuned policy for autonomous inference
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ 1. Collect │ │ 2. Fine-tune │ │ 3. Deploy │
│ (VR Teleop + │ ──► │ (Isaac-GR00T │ ──► │ (PolicyServer │
│ Data Export) │ │ N1.7) │ │ + SONIC) │
└─────────────────┘ └─────────────────┘ └─────────────────┘
For examples of whole-body manipulation tasks accomplished with this workflow, see the VLA result videos on the GEAR-SONIC project page.
How It Works: SONIC Latent Actions#
Instead of predicting raw joint angles, the VLA predicts SONIC latent motion tokens — a compact 64-dimensional representation learned by the SONIC whole-body controller. SONIC then decodes these latents into full-body joint commands at 50 Hz.
┌─────────────┐ latent tokens ┌─────────────┐ joint commands ┌───────┐
│ VLA Model │ ──────────────────► │ SONIC │ ──────────────────► │ Robot │
│ (2.5 Hz) │ (64-dim × 40) │ Decoder │ (50 Hz) │ │
│ │ │ (C++) │ │ │
└─────────────┘ └─────────────┘ └───────┘
This means the VLA only needs to reason about what to do — SONIC handles the how: balance, locomotion, and smooth whole-body coordination all come for free from the pretrained controller. The result is a system that can walk, reach, grasp, and manipulate simultaneously.
The full action space per inference step is 78-dimensional: 64-dim motion token + 7-dim left hand joints + 7-dim right hand joints.
Step 1: Data Collection#
Collect teleop demonstrations using VR whole-body teleoperation. The data exporter records robot state, camera images, and teleop actions as a LeRobot dataset.
See the Data Collection tutorial for full setup instructions (camera server, VR teleop, recording controls).
Quick start:
python gear_sonic/scripts/launch_data_collection.py \
--camera-host 192.168.123.164 \
--task-prompt "pick up the soda can and place it in the bin"
Output: A LeRobot v2.1 dataset directory, e.g.:
outputs/2026-04-03-14-30-00-G1-robot01/
├── data/
│ └── train-00000.parquet
├── videos/
│ └── observation.images.ego_view/
│ └── episode_000000.mp4
└── meta/
├── info.json
├── modality.json
├── episodes.jsonl
└── tasks.jsonl
Tip
Collect at least 50–100 demonstrations of the target task for reliable fine-tuning.
Use --dataset-name to append multiple sessions into the same dataset, or merge
sessions afterwards with process_dataset.py.
Step 2: Fine-tuning with Isaac-GR00T#
Fine-tune the GR00T N1.7 base model on your collected dataset using the Isaac-GR00T training pipeline.
Prerequisites#
Clone and install Isaac-GR00T:
git clone https://github.com/NVIDIA/Isaac-GR00T.git cd Isaac-GR00T uv sync --all-extras
Multi-GPU machine (4+ GPUs recommended)
The collected dataset accessible from the training machine
Launch Fine-tuning#
export NUM_GPUS=4
uv run python \
gr00t/experiment/launch_finetune.py \
--base-model-path nvidia/GR00T-N1.7-3B \
--dataset-path /path/to/your/collected_dataset \
--embodiment-tag UNITREE_G1_SONIC \
--modality-config-path gr00t/configs/data/embodiment_configs.py \
--num-gpus $NUM_GPUS \
--output-dir /path/to/output \
--save-total-limit 5 \
--save-steps 5000 \
--max-steps 20000 \
--use-wandb \
--global-batch-size 32 \
--color-jitter-params brightness 0.3 contrast 0.4 saturation 0.5 hue 0.08 \
--dataloader-num-workers 4
Key Parameters#
Flag |
Description |
|---|---|
|
HuggingFace model ID or local path to pretrained weights |
|
Path to the LeRobot dataset from Step 1 |
|
Must match the dataset’s embodiment ( |
|
Python file defining the modality configuration |
|
Number of GPUs for distributed training |
|
Total training steps (20k is a good starting point) |
|
Total batch size across all GPUs |
|
Checkpoint save interval |
|
Enable Weights & Biases logging |
Monitoring#
With --use-wandb, training metrics (loss, learning rate, etc.) are logged to your
W&B project. Monitor the training loss curve — it should decrease steadily and
plateau before --max-steps.
Output#
Checkpoints are saved to --output-dir:
/path/to/output/
├── checkpoint-5000/
├── checkpoint-10000/
├── checkpoint-15000/
├── checkpoint-20000/
├── config.json
└── processor_config.json
Use the final checkpoint (or the best-performing one based on your evaluation) for deployment in Step 3.
Step 3: Deploy for Inference#
Deploy the finetuned model using the Isaac-GR00T PolicyServer and the SONIC inference stack. See the VLA Inference tutorial for full details on the inference pipeline, keyboard controls, and configuration.
Start the PolicyServer#
On the GPU machine, from the Isaac-GR00T repository:
uv run python gr00t/eval/run_gr00t_server.py \
--model-path /path/to/output/checkpoint-20000 \
--embodiment-tag UNITREE_G1_SONIC \
--device cuda:0 \
--port 5550
Run Inference#
On the inference machine (from the GR00T-WholeBodyControl repository):
python gear_sonic/scripts/launch_inference.py \
--policy-host <gpu_machine_ip> \
--policy-port 5550 \
--camera-host 192.168.123.164 \
--prompt "pick up the soda can and place it in the bin"
Summary#
Step |
Where |
Key Command |
|---|---|---|
Collect |
GR00T-WholeBodyControl |
|
Fine-tune |
Isaac-GR00T |
|
Deploy |
Both repos |
|
For iterating on a task, repeat Steps 1–3: collect more data, fine-tune again (or continue from an existing checkpoint), and redeploy.