SANA-Video: Efficient Video Generation with Block Linear Diffusion Transformer

ICLR 2026 Oral Presentation

About SANA-Video
We introduce SANA-Video, a small diffusion model that can efficiently generate videos up to 2K resolution and minute-length duration. SANA-Video synthesizes high-resolution, high-quality and long videos with strong text-video alignment at a remarkably fast speed, deployable on RTX 5090 GPU. Two core designs ensure our efficient, effective and long video generation: (1) Linear DiT: We leverage linear attention as the core operation, which is more efficient than vanilla attention given the large number of tokens processed in video generation. (2) Constant-Memory KV cache for Block Linear Attention: we design block-wise autoregressive approach for long video generation by employing a constant-memory state, derived from the cumulative properties of linear attention. This KV cache provides the Linear DiT with global context at a fixed memory cost, eliminating the need for a traditional KV cache and enabling efficient, minute-long video generation. In addition, we explore effective data filters and model training strategies, narrowing the training cost to 12 days on 64 H100 GPUs, which is only 1% of the cost of MovieGen. Given its low cost, SANA-Video achieves competitive performance compared to modern state-of-the-art small diffusion models (e.g., Wan 2.1-1.3B and SkyReel-V2-1.3B) while being 16× faster in measured latency. Moreover, SANA-Video can be deployed on RTX 5090 GPUs with NVFP4 precision, accelerating the inference speed of generating a 5-second 720p video from 71s to 29s (2.4× speedup). In summary, SANA-Video enables low-cost, high-quality video generation. Code and model will be publicly released.

Model Performance

2K
Maximum Resolution
16fps
Frame Rate
1min
Maximum Duration
18s
Latency generating 5-second 720P video with LTX2-Refiner on H100

Demo Videos

Quality and Efficiency of SANA-Video.

Technical Videos

Mechanical demonstration of Block Causal Linear Attention and Causal Mix-FFN.

SANA + LTX2-Refiner

Generated 720P video outputs using SANA-Video with LTX2-Refiner. Read our technical blog for details on the Two-Stage Inference Paradigm.

Image-to-Video (I2V)

Generate dynamic videos from static images, bringing still frames to life

Text-to-Video (T2V)

Generate high-quality video content through natural language descriptions, supporting multiple styles and scenes

Image-to-Video (I2V) Model Comparison

Compare different I2V models side by side with the same input prompt and reference image

Model Comparison

Compare different models side by side with the same input prompt

World Simulation and Physical Intelligence

Real-world applications and use cases demonstrating the practical capabilities of SANA-Video

Long Video Generation

Generate 1min-long video

NVFP4 vs BF16

SANA-Video with NVFP4 quantization on RTX 5090 GPU. Latency comparison between NVFP4 and BF16 can be seen the below figure.
FP4 vs BF16

Citation

If you find our work helpful, please consider citing:

@inproceedings{
    chen2025sanavideo,
    title={SANA-Video: Efficient Video Generation with Block Linear Diffusion Transformer}, 
    author={Junsong Chen and Yuyang Zhao and Jincheng Yu and Ruihang Chu and Junyu Chen and Shuai Yang and Xianbang Wang and Yicheng Pan and Daquan Zhou and Huan Ling and Haozhe Liu and Hongwei Yi and Hao Zhang and Muyang Li and Yukang Chen and Han Cai and Sanja Fidler and Ping Luo and Song Han and Enze Xie}    title={{SANA}-Video: Efficient Video Generation with Block Linear Diffusion Transformer},
    booktitle={The Fourteenth International Conference on Learning Representations},
    year={2026},
    url={https://openreview.net/forum?id=mzAchylAtf}
}

Click to copy

×