SANA-Sprint

About Sana

This paper presents SANA-Sprint, an efficient diffusion model for ultra-fast text-to-image (T2I) generation. SANA-Sprint is built on a pre-trained foundation model and augmented with hybrid distillation, dramatically reducing inference steps from 20 to 1-4. We introduce three key innovations: (1) We propose a training-free approach that transforms a pre-trained flow-matching model for continuous-time consistency distillation (sCM), eliminating costly training from scratch and achieving high training efficiency. Our hybrid distillation strategy combines sCM with latent adversarial distillation (LADD): sCM ensures alignment with the teacher model, while LADD enhances single-step generation fidelity. (2) SANA-Sprint is a unified step-adaptive model that achieves high-quality generation in 1-4 steps, eliminating step-specific training and improving efficiency. (3) We integrate ControlNet with SANA-Sprint for real-time interactive image generation, enabling instant visual feedback for user interaction. SANA-Sprint establishes a new Pareto frontier in speed-quality tradeoffs, achieving state-of-the-art performance with 7.59 FID and 0.74 GenEval in only 1 step — outperforming FLUX-schnell (7.94 FID / 0.71 GenEval) while being 10× faster (0.1s vs 1.1s on H100). It also achieves 0.1s (T2I) and 0.25s (ControlNet) latency for 1024×1024 images on H100, and 0.31s (T2I) on an RTX 4090, showcasing its exceptional efficiency and potential for AI-powered consumer applications (AIPC). Code and pre-trained models will be open-sourced.

inference speed for generating 1024 ×1024 images

Several Core Design Details for Efficiency

• Training-Free Transformation to TrigFlow: We propose a training-free approach that transforms a pre-trained flow-matching model for continuous-time consistency distillation (sCM), eliminating costly training from scratch and achieving high training efficiency. Our hybrid distillation strategy combines sCM with latent adversarial distillation (LADD): sCM ensures alignment with the teacher model, while LADD enhances single-step generation fidelity.

• Stabilizing Continuous-Time Distillation: To stabilize continuous-time consistency distillation, we address two key challenges: training instabilities and excessively large gradient norms that occur when scaling up the model size and increasing resolution, leading to model collapse. We achieve this by refining dense time-embedding and integrating QK-Normalization into self- and cross-attention mechanisms. These modifications enable efficient training and improve stability, allowing for robust performance at higher resolutions and larger model sizes.

• Improving Continuous-Time CMs with GAN: CTM analyzes that CMs distill teacher information in a local manner, where at each iteration, the student model learns from local time intervals. This leads the model to learn cross timestep information under the implicit extrapolation, which can slow the convergence speed. To address this limitation, we introduce an additional adversarial loss to provide direct global supervision across different timesteps, improving both the convergence speed and the output quality.

Overall Performance

Our SANA-Sprint models focus on timestep distillation, achieving high-quality generation with 1-4 inference steps. We compare SANA-Sprint with SoTA text-to-image timestep distillation methods in the Figure below.

We compare SANA-Sprint with SoTA text-to-image diffusion and timestep distillation methods in Table 2. Specifically, with 4 steps, SANA-Sprint 0.6B achieves 5.34 samples/s throughput and 0.32s latency, with an FID of 6.48 and GenEval of 0.76. SANA-Sprint 1.6B has slightly lower throughput ( 5.20 samples/s) but improves GenEval to 0.77, outperforming larger models like FLUX-schnell (12B), which achieves only 0.5 samples/s with 2.10s latency. At 2 steps, SANA-Sprint models remain efficient: SANA-Sprint 0.6B reaches 6.46 samples/s with 0.25s latency (FID: 6.54), while SANA-Sprint 1.6B achieves 5.68 samples/s with 0.24s latency (FID: 6.76). In single-step mode, SANA-Sprint 0.6B achieves 7.22 samples/s throughput and 0.21s latency, maintaining an FID of 7.04 and GenEval of 0.72, comparable to FLUX-schnell but with significantly higher efficiency.

SANA-Sprint ControlNet

We implement a ControlNet-Transformer architecture, specifically tailored for Transformers, achieving explicit controllability alongside high-quality image generation.

Sana-Sprint-Controlnet demo — Sana-ControlNet online demo layout

Our Mission

Our mission is to develop efficient, lightweight, and accelerated AI technologies that address practical challenges and deliver fast, open-source solutions.

BibTeX

@misc{chen2025sana-sprint,
      title={SANA-Sprint: One-Step Diffusion with Continuous-Time Consistency Distillation},
      author={Junsong Chen and Shuchen Xue and Yuyang Zhao and Jincheng Yu and Sayak Paul and Junyu Chen and Han Cai and Song Han and Enze Xie},
      year={2025},
      eprint={2503.09641},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2503.09641},
    }