Overview
⚡️ Efficient High-Resolution Image & Video Generation
ICLR 2025 Oral | ICML 2025 | ICCV 2025 Spotlight
Introduction¶
SANA is an efficiency-oriented codebase for high-resolution image and video generation, providing complete training and inference pipelines.
Models¶
| Model | Description |
|---|---|
| Sana | Efficient text-to-image generation with Linear DiT, up to 4K resolution |
| Sana-1.5 | Training-time and inference-time compute scaling |
| Sana-Sprint | Few-step generation via sCM (Consistency Model) distillation |
| Sana-Video | Efficient video generation with Block Linear Attention |
| LongSana | Minute-length real-time video generation (with LongLive) |
Key Techniques¶
- Linear Attention: Replace vanilla attention with linear attention for efficiency at high resolutions
- DC-AE: 32× image compression (vs. traditional 8×) to reduce latent tokens
- Block Causal Linear Attention: Efficient attention for video generation
- Causal Mix-FFN: Memory-efficient feedforward for long videos
- Flow-DPM-Solver: Reduce sampling steps with efficient training and sampling
- sCM Distillation: One/few-step generation with continuous-time consistency distillation
Highlights¶
- 🚀 20× smaller, 100× faster than Flux-12B
- 🖼️ Up to 4K resolution image generation
- ⚡ One-step inference with Sana-Sprint
- 💻 < 8GB VRAM with 4-bit quantization
- 🎬 Efficient video generation with Sana-Video
- ⏱️ 27 FPS real-time minute-length video with LongSana
- 📦 Full training & inference codebase
Quick Start¶
import torch
from diffusers import SanaPipeline
pipe = SanaPipeline.from_pretrained(
"Efficient-Large-Model/SANA1.5_1.6B_1024px_diffusers",
torch_dtype=torch.bfloat16,
).to("cuda")
image = pipe("a cyberpunk cat").images[0]
image.save("sana.png")