Exploring the Frontiers of Efficient Generative Foundation Models
Junsong Chen1*, Shuchen Xue5*, Yuyang Zhao1†, Jincheng Yu1†, Sayak Paul1, Junyu Chen3, Han Cai1, Enze Xie1‡, Song Han1,2
{junsongc,enzex,songh}@nvidia.com,
xueshuchen.acad@gmail.com
1NVIDIA,  2MIT,  3Tsinghua University,  4Huggingface,  5Independent Researcher
*Equal contribution †Core contributor ‡Project Lead
This paper presents SANA-Sprint, an efficient diffusion model for ultra-fast text-to-image (T2I) generation. SANA-Sprint is built on a pre-trained foundation model and augmented with hybrid distillation, dramatically reducing inference steps from 20 to 1-4. We introduce three key innovations: (1) We propose a training-free approach that transforms a pre-trained flow-matching model for continuous-time consistency distillation (sCM), eliminating costly training from scratch and achieving high training efficiency. Our hybrid distillation strategy combines sCM with latent adversarial distillation (LADD): sCM ensures alignment with the teacher model, while LADD enhances single-step generation fidelity. (2) SANA-Sprint is a unified step-adaptive model that achieves high-quality generation in 1-4 steps, eliminating step-specific training and improving efficiency. (3) We integrate ControlNet with SANA-Sprint for real-time interactive image generation, enabling instant visual feedback for user interaction. SANA-Sprint establishes a new Pareto frontier in speed-quality tradeoffs, achieving state-of-the-art performance with 7.59 FID and 0.74 GenEval in only 1 step — outperforming FLUX-schnell (7.94 FID / 0.71 GenEval) while being 10× faster (0.1s vs 1.1s on H100). It also achieves 0.1s (T2I) and 0.25s (ControlNet) latency for 1024×1024 images on H100, and 0.31s (T2I) on an RTX 4090, showcasing its exceptional efficiency and potential for AI-powered consumer applications (AIPC). Code and pre-trained models will be open-sourced.
• Training-Free Transformation to TrigFlow:
We propose a training-free approach that transforms a pre-trained flow-matching model for continuous-time consistency distillation (sCM),
eliminating costly training from scratch and achieving high training efficiency.
Our hybrid distillation strategy combines sCM with latent adversarial distillation (LADD): sCM ensures alignment with the teacher model,
while LADD enhances single-step generation fidelity.
• Stabilizing Continuous-Time Distillation:
To stabilize continuous-time consistency distillation, we address two key challenges:
training instabilities and excessively large gradient norms that occur when scaling up the model size and increasing resolution,
leading to model collapse. We achieve this by refining dense time-embedding and integrating QK-Normalization into self- and cross-attention mechanisms.
These modifications enable efficient training and improve stability, allowing for robust performance at higher resolutions and larger model sizes.
• Improving Continuous-Time CMs with GAN: CTM analyzes that CMs distill teacher information in a local manner, where at each iteration, the student model learns from local time intervals. This leads the model to learn cross timestep information under the implicit extrapolation, which can slow the convergence speed. To address this limitation, we introduce an additional adversarial loss to provide direct global supervision across different timesteps, improving both the convergence speed and the output quality.
Our SANA-Sprint models focus on timestep distillation, achieving high-quality generation with 1-4 inference steps. We compare SANA-Sprint with SoTA text-to-image timestep distillation methods in the Figure below.
We compare SANA-Sprint with SoTA text-to-image diffusion and timestep distillation methods in Table 2. Specifically, with 4 steps, SANA-Sprint 0.6B achieves 5.34 samples/s throughput and 0.32s latency, with an FID of 6.48 and GenEval of 0.76. SANA-Sprint 1.6B has slightly lower throughput ( 5.20 samples/s) but improves GenEval to 0.77, outperforming larger models like FLUX-schnell (12B), which achieves only 0.5 samples/s with 2.10s latency. At 2 steps, SANA-Sprint models remain efficient: SANA-Sprint 0.6B reaches 6.46 samples/s with 0.25s latency (FID: 6.54), while SANA-Sprint 1.6B achieves 5.68 samples/s with 0.24s latency (FID: 6.76). In single-step mode, SANA-Sprint 0.6B achieves 7.22 samples/s throughput and 0.21s latency, maintaining an FID of 7.04 and GenEval of 0.72, comparable to FLUX-schnell but with significantly higher efficiency.
We implement a ControlNet-Transformer architecture, specifically tailored for Transformers, achieving explicit controllability alongside high-quality image generation.
Our mission is to develop efficient, lightweight, and accelerated AI technologies that address practical challenges and deliver fast, open-source solutions.
@misc{chen2025sana-sprint,
title={SANA-Sprint: One-Step Diffusion with Continuous-Time Consistency Distillation},
author={Junsong Chen and Shuchen Xue and Yuyang Zhao and Jincheng Yu and Sayak Paul and Junyu Chen and Han Cai and Enze Xie and Song Han},
year={2025},
eprint={2503.09641},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2503.09641},
}