Skip to content

SANA-WM Logo

🌍 SANA-WM: Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer

πŸ“½οΈ About SANA-WM

SANA-WM is an efficient 2.6 B-parameter open-source world model trained natively for one-minute video generation. It synthesises 720p, minute-scale videos with precise 6-DoF camera control, paired with an LTX-2 sink-bidirectional Euler refiner for high-fidelity decoding.

Core contributions:

  • Hybrid Linear Attention β€” frame-wise Gated DeltaNet combined with softmax attention every $N$-th block for memory-efficient long-context modelling.
  • Dual-Branch Camera Control β€” independent main and camera branches enable precise per-frame trajectory adherence (6 DoF).
  • Two-Stage Generation Pipeline β€” a long-video refiner stitched on top of Stage-1 latents improves quality and temporal consistency.
  • Robust Annotation Pipeline β€” metric-scale 6-DoF camera poses extracted from public corpora yield spatiotemporally consistent action supervision.

SANA-WM completes pre-training in 15 days on 64 H100s and generates a 60s 720p clip on a single GPU.

Note Building on the original bidirectional pipeline (full-sequence Stage 1 + sink-bidirectional refiner), this release adds a new streaming pipeline: a chunk-causal distilled Stage 1 + chunk-causal refiner + causal-VAE decoder, overlapped on three CUDA streams and written progressively to MP4 so you can watch the clip as it generates. Streaming weights are released under SANA-WM_streaming.

βš™οΈ Environment Setup

bash ./environment_setup.sh sana
conda activate sana

πŸƒ Inference

All Stage-1 / Stage-2 weights, the VAE, and the LTX-2 Gemma text encoder are fetched on first use from Efficient-Large-Model/SANA-WM_bidirectional β€” no manual download required.

Example 1 β€” image + prompt + action string

python inference_video_scripts/wm/inference_sana_wm.py \
  --image      asset/sana_wm/demo_0.png \
  --prompt     asset/sana_wm/demo_0.txt \
  --action     "w-100,dw-60,w-100,aw-60" \
  --num_frames 321 \
  --output_dir results/sana_wm_demo

Action DSL: each segment is <keys>-<frames> joined by commas. The control scheme is: w / s forward / back (translation along the heading), a / d yaw left / right (turn), i / k pitch up / down, j / l strafe left / right. none-N holds the pose for N frames. Held keys ease in/out with light inertia (instant on a fresh press, gentle coast on release); default speeds are gentle (--translation_speed 0.025, --rotation_speed_deg 0.6).

⚠️ Mapping update (breaking change vs the first release). The --action keys were remapped so the demo and CLI share one control scheme: a / d now yaw (previously strafe) and j / l now strafe (previously yaw); w / s (forward/back) and i / k (pitch) are unchanged, and the old implicit a/dβ†’steer coupling is gone. Motion is also smoothed now and the default speeds are gentler. If you have action strings from the earlier release, swap a/d ↔ j/l to reproduce the same motion (the CLI also prints this notice once when --action is used).

Example 2 β€” image + prompt + camera trajectory (.npy)

python inference_video_scripts/wm/inference_sana_wm.py \
  --image      asset/sana_wm/demo_0.png \
  --prompt     asset/sana_wm/demo_0.txt \
  --camera     asset/sana_wm/demo_0_pose.npy \
  --intrinsics asset/sana_wm/demo_0_intrinsics.npy \
  --num_frames 321 \
  --output_dir results/sana_wm_demo

--camera is a NumPy .npy of shape (F, 4, 4) (camera-to-world matrices); --intrinsics is .npy of shape (3, 3), (F, 3, 3), or (4,) = (fx, fy, cx, cy) in input-image pixels. If --intrinsics is omitted we estimate it from --image with Pi3X and abort if the resulting FOV is outside [25Β°, 120Β°].

The release ships five first-frame + prompt + camera examples under asset/sana_wm/ β€” demo_{0..4}.{png,txt}, each with a rolled-out _pose.npy trajectory and an _intrinsics.npy. Swap demo_0 for any of them in the commands above (works for both the bidirectional and streaming scripts). The actions are gentle by design β€” slow forward drift with light left/right look-around.

Example Scene --action
demo_0 salt-desert / black supercar w-100,dw-60,w-100,aw-60
demo_1 bioluminescent cave w-35,aw-60,dw-100,aw-55,w-25,none-50
demo_2 mushroom forest / robot w-25,aw-60,dw-100,aw-55,none-85 Β (+ --translation_speed 0.015)
demo_3 salt flat / supercar w-70,none-40,dw-35,w-70,aw-35,none-72
demo_4 ice plain / portal w-95,aw-35,w-70,dw-35,none-87

The _pose.npy files already bake in these actions (and demo_2's slower speed), so --camera asset/sana_wm/demo_N_pose.npy reproduces the same motion as the matching --action string.

Lower memory

For tight VRAM budgets, opt in to lazy-load + CPU offload:

... --offload_vae --offload_refiner

Streaming inference

The streaming pipeline replaces all three full-sequence stages with chunk-causal variants and emits one decoded chunk per AR block straight into a progressive MP4. Stage 1 runs the 4-step distilled student (CFG-baked-in, runs at cfg_scale=1), the refiner runs chunk-causal AR with a sliding KV window, and the causal LTX-2 VAE decodes chunk-by-chunk.

All streaming weights (DiT, causal VAE, refiner, and the Gemma text encoder) are fetched on first use from SANA-WM_streaming β€” no manual download required, exactly like the bidirectional path. The inference YAML ships in-repo under configs/sana_wm/. Just run:

python inference_video_scripts/wm/inference_sana_wm_streaming.py \
  --image       asset/sana_wm/demo_0.png \
  --prompt      asset/sana_wm/demo_0.txt \
  --action      "w-80,dw-40,w-80,aw-40" \
  --num_frames  241 \
  --output_dir  results/sana_wm_streaming

--num_frames defaults to 241 (~15s @ 16fps). It is snapped to 8Β·refiner_block_sizeΒ·k + 1 so the VAE and refiner chunking divide evenly (241 = 24Β·10+1 needs no snap). Use a larger value (e.g. 961 for ~60s) for longer clips.

Output lands at results/sana_wm_streaming/<name>_streaming.mp4 and grows in place β€” you can watch it while inference continues. Reaches ~0.93Γ— realtime on a single H100 after a one-time torch.compile warmup (~3 min cold, ~30 s warm cache; the warmup amortises across runs that reuse the same shapes).

All speed-critical knobs are baked into the script as defaults β€” torch.compile on the refiner transformer (max-autotune-no-cudagraphs mode), flash-only SDPA, Inductor coordinate_descent_tuning + epilogue_fusion, cuDNN benchmark, and the expandable CUDA allocator. The causal VAE decoder is intentionally not compiled: torch.compile corrupts its cross-chunk causal cache (chunk 0 decodes fine but later chunks come out blank/gray), so it runs eager. There is no slow/fast toggle; the script is the fast config.

Overrides for advanced use:

  • --streaming_root <path> β€” optional LOCAL bundle dir holding sana_dit/, ltx2_causal_vae/, refiner_diffusers/, gemma3_12b/. Unset by default, in which case each artefact is pulled from hf://Efficient-Large-Model/SANA-WM_streaming.
  • --config / --model_path / --causal_vae_path / --refiner_root / --refiner_gemma_root β€” point at non-default weight paths (local path or hf:// URI). --config defaults to the in-repo configs/sana_wm/sana_wm_streaming_1600m_720p.yaml.
  • --num_frame_per_block (default 3, must match the checkpoint's chunk_size), --denoising_step_list (default "1000,960,889,727,0"), --refiner_block_size (3), --refiner_kv_max_frames (11) β€” change the canonical recipe at your own quality risk.

πŸŽ›οΈ Argument Reference

Argument Format / Default
--image First-frame RGB image. Aspect-preserving resized + center-cropped to 704Γ—1280.
--prompt UTF-8 text file with the conditioning prompt.
--camera (F, 4, 4) .npy camera-to-world matrices. Mutually exclusive with --action.
--action Control DSL (w/s move, a/d yaw, i/k pitch, j/l strafe). Rolled out via action_string_to_c2w (smoothed) to a (F+1, 4, 4) trajectory.
--translation_speed Per-frame translation magnitude (default 0.025).
--rotation_speed_deg Per-frame rotation magnitude in degrees (default 0.6).
--intrinsics Optional .npy of shape (3, 3), (F, 3, 3), or (4,). Pi3X-estimated if omitted.
--num_frames Total frames to generate (default 161; the demos above use 321).
--fps Output mp4 frame rate (default 16).
--step Stage-1 DiT sampling steps (default 60).
--cfg_scale Classifier-free-guidance scale (default 5.0).
--flow_shift Override the scheduler's inference_flow_shift.
--no_refiner Skip the LTX-2 refiner and decode Stage-1 latents with the Sana VAE (faster, lower quality).
--refiner_root LTX-2 refiner root containing transformer/ and connectors/.
--no_action_overlay Skip the WASD + joystick overlay on the output video.
--offload_vae Move the VAE to CPU between encode / decode steps.
--offload_refiner Lazy-load the LTX-2 refiner only when needed; release afterwards.
--sampling_algo flow_euler_ltx (default, bidirectional). For streaming use the dedicated wm/inference_sana_wm_streaming.py.

πŸ“ HF Repository Layout

Efficient-Large-Model/SANA-WM_bidirectional:

Component Path Size
Sana DiT (Stage 1) dit/sana_wm_1600m_720p.safetensors 10 GB
LTX-2 VAE (diffusers) vae/ 2 GB
LTX-2 refiner (Stage 2) refiner/{transformer,connectors}/ 38 GB
Gemma text encoder for the refiner refiner/text_encoder/ 46 GB
Inference config config.yaml β€”

Efficient-Large-Model/SANA-WM_streaming (streaming variant):

Component Path
Chunk-causal Sana DiT (distilled) sana_dit/model.pt
Causal LTX-2 VAE ltx2_causal_vae/
Chunk-causal LTX-2 refiner refiner_diffusers/{transformer,connectors}/
Gemma-3-12B text encoder (refiner) gemma3_12b/

The inference config ships in-repo at configs/sana_wm/sana_wm_streaming_1600m_720p.yaml (not in the weights repo).

The Sana text encoder (gemma-2-2b-it) is fetched separately from Efficient-Large-Model/gemma-2-2b-it.

πŸ“ BibTeX

@misc{zhu2026sanawm,
      title={SANA-WM: Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer},
      author={Haoyi Zhu and Haozhe Liu and Yuyang Zhao and Tian Ye and Junsong Chen and Jincheng Yu and Tong He and Song Han and Enze Xie},
      year={2026},
      eprint={2605.15178},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2605.15178},
}