Exploring the Frontiers of Efficient Generative Foundation Models
Enze Xie1*,
Junsong Chen1*,
Yuyang Zhao1†,
Jincheng Yu1†,
Ligeng Zhu1†,
Yujun Lin2,
Zhekai Zhang2,
Muyang Li2,
Junyu Chen3,
Han Cai1,
Bingchen Liu4,
Daquan Zhou5,
Song Han1,2
1NVIDIA,  2MIT,  3Tsinghua University,  4Playground,  5 Peking University
*Equal contribution †Core contributor
This paper presents SANA-1.5, a linear Diffusion Transformer for efficient scaling in text-to-image generation.
Building upon SANA-1.0, we introduce three key innovations:
(1) Efficient Training Scaling: A depth-growth paradigm that enables scaling from 1.6B to 4.8B parameters with significantly reduced computational resources,
combined with a memory-efficient 8-bit optimizer.
(2) Model Depth Pruning: A block importance analysis technique for efficient model compression to arbitrary sizes with minimal quality loss.
(3) Inference-time Scaling: A repeated sampling strategy that trades computation for model capacity,
enabling smaller models to match larger model quality at inference time. Through these strategies,
SANA-1.5 achieves a text-image alignment score of 0.72 on GenEval, which can be further improved to 0.80 through inference scaling,
establishing a new SoTA on GenEval benchmark. These innovations enable efficient model scaling across different compute budgets while maintaining high quality,
making high-quality image generation more accessible. Our code and pre-trained models will be released.
• Efficient Training Scaling:
We scale the model size of Linear DiT by initializing the first 18 layers of the 4.8B Sana-1.5 model using the 1.6B Sana-1.0 pre-trained model,
leveraging a Partial Preservation initialization strategy. This approach allows the 4.8B model to achieve superior GenEval performance while reducing training time by 60% compared to training from scratch.
Additionally, we introduce the first 8-bit CAME optimizers, which significantly reduce GPU memory usage, enabling efficient scaling for larger diffusion models.
• Model Depth Pruning:
We employ Block Importance Analysis (Figure (a)) to guide the pruning of model depth.
By removing less important layers, the model retains most of its semantic capabilities while temporarily losing its ability to generate high-frequency details.
This loss, however, can be effectively restored through a brief retraining process (typically with 100 iterations on a single GPU).
• Inference-time Scaling:
NVILA-2B
as the judge: With fine-tuned NVILA-2B to automatically compare and judge generated images,
we run a tournament-style comparison several rounds until we determine the top-N candidates, as illustrated in the below Figure (a).
We demonstrate that with inference scaling:
1. Sana-1.5 (4.8B) achieves a SoTA 0.80 GenEval score, as shown in Figure (b).
2. Smaller models can outperform larger ones, with performance improvements observed consistently across all model sizes, shown in Figure (c).
SANA-1.5 is an efficient model with scaling of training-time and inference time techniques. SANA delivers: efficient model growth from 1.6B Sana-1.0 model to 4.8B, achieving similar or better performance than training from scratch and saving 60% training cost; efficient model depth pruning, slimming any model size as you want; powerful VLM selection based inference scaling, smaller model+inference scaling > larger model; Top-notch GenEval & DPGBench results. Detailed results are shown in the below table.
Our mission is to develop efficient, lightweight, and accelerated AI technologies that address practical challenges and deliver fast, open-source solutions.
@misc{xie2025sana,
title={SANA 1.5: Efficient Scaling of Training-Time and Inference-Time Compute in Linear Diffusion Transformer},
author={Xie, Enze and Chen, Junsong and Zhao, Yuyang and Yu, Jincheng and Zhu, Ligeng and Lin, Yujun and Zhang, Zhekai and Li, Muyang and Chen, Junyu and Cai, Han and others},
year={2025},
eprint={2501.18427},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2501.18427},
}