Skip to content

Overview

Sana Logo

⚡️ Efficient High-Resolution Image & Video Generation

ICLR 2025 Oral | ICML 2025 | ICCV 2025 Spotlight

Sana Sana Sprint Video

Blog Replicate Discord

Demo 4bit ControlNet Sprint HF Sprint


Introduction

SANA is an efficiency-oriented codebase for high-resolution image and video generation, providing complete training and inference pipelines.

Models

Model Description
Sana Efficient text-to-image generation with Linear DiT, up to 4K resolution
Sana-1.5 Training-time and inference-time compute scaling
Sana-Sprint Few-step generation via sCM (Consistency Model) distillation
Sana-Video Efficient video generation with Block Linear Attention
LongSana Minute-length real-time video generation (with LongLive)

Key Techniques

  • Linear Attention: Replace vanilla attention with linear attention for efficiency at high resolutions
  • DC-AE: 32× image compression (vs. traditional 8×) to reduce latent tokens
  • Block Causal Linear Attention: Efficient attention for video generation
  • Causal Mix-FFN: Memory-efficient feedforward for long videos
  • Flow-DPM-Solver: Reduce sampling steps with efficient training and sampling
  • sCM Distillation: One/few-step generation with continuous-time consistency distillation

Highlights

  • 🚀 20× smaller, 100× faster than Flux-12B
  • 🖼️ Up to 4K resolution image generation
  • One-step inference with Sana-Sprint
  • 💻 < 8GB VRAM with 4-bit quantization
  • 🎬 Efficient video generation with Sana-Video
  • ⏱️ 27 FPS real-time minute-length video with LongSana
  • 📦 Full training & inference codebase

Quick Start

git clone https://github.com/NVlabs/Sana.git
cd Sana
bash ./environment_setup.sh sana
import torch
from diffusers import SanaPipeline

pipe = SanaPipeline.from_pretrained(
"Efficient-Large-Model/SANA1.5_1.6B_1024px_diffusers",
    torch_dtype=torch.bfloat16,
).to("cuda")

image = pipe("a cyberpunk cat").images[0]
image.save("sana.png")

Sana Sana-1.5 Sprint Video

SANA SANA-1.5 SANA-Sprint SANA-Video