Skip to content

SANA-Streaming: Real-time Streaming Video Editing with Hybrid Diffusion Transformer

📽️ About SANA-Streaming

SANA-Streaming is a real-time video-to-video editing system for minute-level, high-resolution editing. Given a source video and a text instruction, it edits the requested content while preserving source motion and non-edited regions.

Core contributions:

  • Hybrid Diffusion Transformer — interleaves Gated DeltaNet (GDN) blocks with softmax-attention blocks, combining compact long-range memory with local source alignment.
  • Streaming Video Editing — processes long videos with state caching and chunk-wise generation instead of full-sequence attention.
  • Cycle-Reverse Regularization — improves temporal consistency by training the model to reconstruct source frames from edited content through flow matching.
  • Efficient System Co-design — the paper reports fused GDN kernels and Mixed-Precision Quantization (MPQ) for RTX 5090 deployment, reaching 1280×704 real-time editing at 24 end-to-end FPS and 58 DiT FPS.

This repository release exposes two practical inference paths:

  • bidirectional_short: 5-second short-video editing with the bidirectional 2B SANA-Streaming DiT.
  • long_streaming: 1-min long-video editing with the streaming 2B SANA-Streaming DiT.

The current public script runs the released BF16 checkpoints. The MPQ deployment recipe described in the paper is not required for the commands below.

⚙️ Environment Setup

bash ./environment_setup.sh sana
conda activate sana

The released V2V checkpoints were validated with torch==2.10.0, torchvision==0.25.0, triton==3.6.0, transformers==4.57.1, accelerate==1.0.1, and Hugging Face diffusers commit fbe8a75ad59fe5c0eec7f3691d2eb0ed890a0c90. The fused GDN kernels and the LTX-2 VAE path are sensitive to runtime package versions; use the pinned package versions in pyproject.toml for reproducible bidirectional inference.

🏃 Inference

All DiT checkpoints and demo source videos are fetched on first use from the Hugging Face repos below. Local paths and hf:// URIs are both supported.

Streaming long-video editing

The streaming model edits 969 frames by default with 4 denoising steps, cfg_scale=1.0, num_cached_blocks=2, and sink-token caching enabled.

python inference_video_scripts/v2v/inference_sana_streaming.py \
  --mode long_streaming \
  --config configs/sana_streaming/sana_streaming_2b_720p.yaml \
  --model_path hf://Efficient-Large-Model/SANA-Streaming/dit/sana_streaming_ar.pth \
  --prompt "Transform the entire scene into a breathtaking Sci-Fi Art digital painting." \
  --video_path hf://Efficient-Large-Model/SANA-Streaming/source/09_style_transfer_source.mp4 \
  --num_frames 969 \
  --step 4 \
  --cfg_scale 1.0 \
  --num_cached_blocks 2 \
  --sink_token true \
  --output_dir results/sana_streaming_long \
  --output_name output.mp4

Bidirectional short-video editing

The bidirectional model edits 81 frames by default with flow-DPM solver sampling, 50 denoising steps, and cfg_scale=6.0. A default negative prompt is applied unless --negative_prompt is provided.

python inference_video_scripts/v2v/inference_sana_streaming.py \
  --mode bidirectional_short \
  --config configs/sana_streaming/sana_streaming_bidirectional_2b_720p.yaml \
  --model_path hf://Efficient-Large-Model/SANA-Streaming_bidirectional/dit/sana_bidirectional_short.pth \
  --prompt "Remove the thick, textured gold hoop earrings from the woman's ears. Carefully reconstruct the exposed earlobes to match her natural skin tone and texture. Ensure the lighting and soft shadows on the newly bare ears blend seamlessly with the rest of her face, leaving no trace or reflection of the metallic jewelry behind." \
  --video_path hf://Efficient-Large-Model/SANA-Streaming/source/00_local_editing_source.mp4 \
  --num_frames 81 \
  --step 50 \
  --cfg_scale 6.0 \
  --output_dir results/sana_streaming_bidirectional \
  --output_name output.mp4

The release includes three source videos under Efficient-Large-Model/SANA-Streaming. The same examples can be run with both long_streaming and bidirectional_short by changing --mode.

Example Source video Prompt
Local editing source/00_local_editing_source.mp4 Remove the thick, textured gold hoop earrings from the woman's ears. Carefully reconstruct the exposed earlobes to match her natural skin tone and texture. Ensure the lighting and soft shadows on the newly bare ears blend seamlessly with the rest of her face, leaving no trace or reflection of the metallic jewelry behind.
Background editing source/05_background_editing_source.mp4 Replace the background with a cinematic, rain-streaked windowpane at dusk. Feature softly out-of-focus city lights in moody cool teal and muted amber glowing through the wet glass. Add delicate condensation and trickling raindrops to the window surface, maintaining a shallow depth of field to enhance the deeply emotional, melancholic atmosphere without altering the subject's lighting or appearance.
Style transfer source/09_style_transfer_source.mp4 Transform the entire scene into a breathtaking Sci-Fi Art digital painting. Re-render the background as an out-of-focus futuristic cityscape with glowing holographic bokeh and sleek technological structures. Re-imagine the subject in a highly detailed, futuristic illustration style, giving her skin a flawless, subtly luminescent quality. Keep her exact features, pose, and emotional expression intact, while rendering her hair, clothing, and phone with advanced, sleek synthetic textures. Bathe the composition in atmospheric neon blues, cool cyans, and deep purples to reflect a highly advanced civilization.

🎛️ Argument Reference

Argument Format / Default
--mode long_streaming or bidirectional_short (default long_streaming).
--prompt Text editing instruction.
--video_path Source MP4 path. Supports local files and hf://<repo>/<path> URIs.
--output_dir Output directory.
--output_name Output MP4 filename (default output.mp4).
--config YAML config path. Defaults are mode-specific under configs/sana_streaming/.
--model_path DiT checkpoint path. Defaults to the released Hugging Face checkpoints.
--num_frames Frames decoded from the source video (969 for streaming, 81 for bidirectional).
--height / --width Center-cropped output resolution (704 × 1280).
--fps Output MP4 frame rate (16).
--step Denoising steps (4 for streaming, 50 for bidirectional).
--cfg_scale CFG scale (1.0 for streaming, 6.0 for bidirectional).
--flow_shift Optional scheduler flow-shift override.
--seed Random seed (0).
--negative_prompt Optional negative prompt. Bidirectional mode uses a built-in default if omitted.
--num_cached_blocks Streaming cache window size (2).
--sink_token Keep the first chunk in the streaming cache window (true).

📁 HF Repository Layout

Efficient-Large-Model/SANA-Streaming_bidirectional:

Component Path
Bidirectional SANA-Streaming DiT dit/sana_bidirectional_short.pth

Efficient-Large-Model/SANA-Streaming:

Component Path
Streaming SANA-Streaming DiT dit/sana_streaming_ar.pth
Causal LTX-2 VAE release artifact ltx2_causal_vae_0516/
Demo source videos source/{00_local_editing_source.mp4,05_background_editing_source.mp4,09_style_transfer_source.mp4}

The inference configs ship in-repo:

Mode Config
bidirectional_short configs/sana_streaming/sana_streaming_bidirectional_2b_720p.yaml
long_streaming configs/sana_streaming/sana_streaming_2b_720p.yaml

The text encoder is fetched separately from Efficient-Large-Model/gemma-2-2b-it. The default VAE path in both configs is Lightricks/LTX-2; long_streaming loads it through the local causal/chunk-tile wrapper for streaming encode/decode.

📝 BibTeX

@article{zhao2026sana,
  title={SANA-Streaming: Real-time Streaming Video Editing with Hybrid Diffusion Transformer},
  author={Zhao, Yuyang and Pan, Yicheng and He, Qiyuan and Yu, Jincheng and Chen, Junsong and Ye, Tian and Liu, Haozhe and Xie, Enze and Han, Song},
  journal={arXiv preprint arXiv:2605.30409},
  year={2026}
}