SANA-1.5

Efficient Scaling of Training-Time and
Inference-Time Compute in Linear Diffusion Transformer

Exploring the Frontiers of Efficient Generative Foundation Models

Enze Xie^1*, Junsong Chen^1*, Yuyang Zhao^1†, Jincheng Yu^1†, Ligeng Zhu^1†, Yujun Lin²,
Zhekai Zhang², Muyang Li², Junyu Chen³, Han Cai¹, Bingchen Liu⁴, Daquan Zhou⁵, Song Han^1,2

¹NVIDIA, ²MIT, ³Tsinghua University, ⁴Playground, ⁵ Peking University
*Equal contribution †Core contributor

Paper Code Demo

About SANA-1.5

This paper presents SANA-1.5, a linear Diffusion Transformer for efficient scaling in text-to-image generation. Building upon SANA-1.0, we introduce three key innovations:
(1) Efficient Training Scaling: A depth-growth paradigm that enables scaling from 1.6B to 4.8B parameters with significantly reduced computational resources, combined with a memory-efficient 8-bit optimizer. (2) Model Depth Pruning: A block importance analysis technique for efficient model compression to arbitrary sizes with minimal quality loss. (3) Inference-time Scaling: A repeated sampling strategy that trades computation for model capacity, enabling smaller models to match larger model quality at inference time. Through these strategies, SANA-1.5 achieves a text-image alignment score of 0.72 on GenEval, which can be further improved to 0.80 through inference scaling, establishing a new SoTA on GenEval benchmark. These innovations enable efficient model scaling across different compute budgets while maintaining high quality, making high-quality image generation more accessible. Our code and pre-trained models will be released.

Several Core Design Details for Efficiency

• Efficient Training Scaling: We scale the model size of Linear DiT by initializing the first 18 layers of the 4.8B Sana-1.5 model using the 1.6B Sana-1.0 pre-trained model, leveraging a Partial Preservation initialization strategy. This approach allows the 4.8B model to achieve superior GenEval performance while reducing training time by 60% compared to training from scratch. Additionally, we introduce the first 8-bit CAME optimizers, which significantly reduce GPU memory usage, enabling efficient scaling for larger diffusion models.

• Model Depth Pruning: We employ Block Importance Analysis (Figure (a)) to guide the pruning of model depth. By removing less important layers, the model retains most of its semantic capabilities while temporarily losing its ability to generate high-frequency details. This loss, however, can be effectively restored through a brief retraining process (typically with 100 iterations on a single GPU).

• Inference-time Scaling: NVILA-2B as the judge: With fine-tuned NVILA-2B to automatically compare and judge generated images, we run a tournament-style comparison several rounds until we determine the top-N candidates, as illustrated in the below Figure (a). We demonstrate that with inference scaling:
1. Sana-1.5 (4.8B) achieves a SoTA 0.80 GenEval score, as shown in Figure (b).
2. Smaller models can outperform larger ones, with performance improvements observed consistently across all model sizes, shown in Figure (c).

Overall Performance

SANA-1.5 is an efficient model with scaling of training-time and inference time techniques. SANA delivers: efficient model growth from 1.6B Sana-1.0 model to 4.8B, achieving similar or better performance than training from scratch and saving 60% training cost; efficient model depth pruning, slimming any model size as you want; powerful VLM selection based inference scaling, smaller model+inference scaling > larger model; Top-notch GenEval & DPGBench results. Detailed results are shown in the below table.

Our Mission

Our mission is to develop efficient, lightweight, and accelerated AI technologies that address practical challenges and deliver fast, open-source solutions.

BibTeX

@misc{xie2025sana,
      title={SANA 1.5: Efficient Scaling of Training-Time and Inference-Time Compute in Linear Diffusion Transformer},
      author={Xie, Enze and Chen, Junsong and Zhao, Yuyang and Yu, Jincheng and Zhu, Ligeng and Lin, Yujun and Zhang, Zhekai and Li, Muyang and Chen, Junyu and Cai, Han and others},
      year={2025},
      eprint={2501.18427},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2501.18427},
    }