⚡️ Sana: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformer¶
This guide covers training and inference for Sana text-to-image models.
Hardware Requirements¶
| Task | VRAM |
|---|---|
| Inference (0.6B) | 9GB |
| Inference (1.6B) | 12GB |
| Inference (4-bit) | < 8GB |
| Training | 32GB |
Note
All tests are done on A100 GPUs. Different GPU versions may vary.
Inference¶
Using Diffusers (Recommended)¶
import torch
from diffusers import SanaPipeline
pipe = SanaPipeline.from_pretrained(
"Efficient-Large-Model/SANA1.5_1.6B_1024px_diffusers",
torch_dtype=torch.bfloat16,
)
pipe.to("cuda")
pipe.vae.to(torch.bfloat16)
pipe.text_encoder.to(torch.bfloat16)
prompt = 'a cyberpunk cat with a neon sign that says "Sana"'
image = pipe(
prompt=prompt,
height=1024,
width=1024,
guidance_scale=4.5,
num_inference_steps=20,
generator=torch.Generator(device="cuda").manual_seed(42),
)[0]
image[0].save("sana.png")
Using SanaPAGPipeline¶
import torch
from diffusers import SanaPAGPipeline
pipe = SanaPAGPipeline.from_pretrained(
"Efficient-Large-Model/SANA1.5_1.6B_1024px_diffusers",
torch_dtype=torch.bfloat16,
pag_applied_layers="transformer_blocks.8",
)
pipe.to("cuda")
pipe.text_encoder.to(torch.bfloat16)
pipe.vae.to(torch.bfloat16)
image = pipe(
prompt='a cyberpunk cat with a neon sign that says "Sana"',
guidance_scale=5.0,
pag_scale=2.0,
num_inference_steps=20,
generator=torch.Generator(device="cuda").manual_seed(42),
)[0]
image[0].save('sana.png')
Using Native Pipeline¶
import torch
from app.sana_pipeline import SanaPipeline
from torchvision.utils import save_image
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
generator = torch.Generator(device=device).manual_seed(42)
sana = SanaPipeline("configs/sana1-5_config/1024ms/Sana_1600M_1024px_allqknorm_bf16_lr2e5.yaml")
sana.from_pretrained("hf://Efficient-Large-Model/SANA1.5_1.6B_1024px/checkpoints/SANA1.5_1.6B_1024px.pth")
image = sana(
prompt='a cyberpunk cat with a neon sign that says "Sana"',
height=1024,
width=1024,
guidance_scale=4.5,
pag_guidance_scale=1.0,
num_inference_steps=20,
generator=generator,
)
save_image(image, 'output/sana.png', nrow=1, normalize=True, value_range=(-1, 1))
Gradio Demo¶
DEMO_PORT=15432 \
python app/app_sana.py \
--share \
--config=configs/sana_config/1024ms/Sana_1600M_img1024.yaml \
--model_path=hf://Efficient-Large-Model/Sana_1600M_1024px_BF16/checkpoints/Sana_1600M_1024px_BF16.pth \
--image_size=1024
Batch Inference¶
# Run samples in a txt file
python scripts/inference.py \
--config=configs/sana_config/1024ms/Sana_1600M_img1024.yaml \
--model_path=hf://Efficient-Large-Model/Sana_1600M_1024px/checkpoints/Sana_1600M_1024px.pth \
--txt_file=asset/samples/samples_mini.txt
# Run samples in a json file
python scripts/inference.py \
--config=configs/sana_config/1024ms/Sana_1600M_img1024.yaml \
--model_path=hf://Efficient-Large-Model/Sana_1600M_1024px/checkpoints/Sana_1600M_1024px.pth \
--json_file=asset/samples/samples_mini.json
Training¶
Data Preparation¶
Prepare image-text pairs in the following format:
Train from Scratch¶
# Train Sana 0.6B with 512x512 resolution
bash train_scripts/train.sh \
configs/sana_config/512ms/Sana_600M_img512.yaml \
--data.data_dir="[asset/example_data]" \
--data.type=SanaImgDataset \
--model.multi_scale=false \
--train.train_batch_size=32
Fine-tuning¶
# Fine-tune Sana 1.6B with 1024x1024 resolution
bash train_scripts/train.sh \
configs/sana_config/1024ms/Sana_1600M_img1024.yaml \
--data.data_dir="[asset/example_data]" \
--data.type=SanaImgDataset \
--model.load_from=hf://Efficient-Large-Model/Sana_1600M_1024px/checkpoints/Sana_1600M_1024px.pth \
--model.multi_scale=false \
--train.train_batch_size=8
Multi-Scale WebDataset¶
Convert data to WebDataset format:
Then train:
bash train_scripts/train.sh \
configs/sana_config/512ms/Sana_600M_img512.yaml \
--data.data_dir="[asset/example_data_tar]" \
--data.type=SanaWebDatasetMS \
--model.multi_scale=true \
--train.train_batch_size=32
Training with FSDP¶
# Download toy dataset
huggingface-cli download Efficient-Large-Model/toy_data --repo-type dataset --local-dir ./data/toy_data
# DDP training
bash train_scripts/train.sh \
configs/sana1-5_config/1024ms/Sana_1600M_1024px_allqknorm_bf16_lr2e5.yaml \
--data.data_dir="[data/toy_data]" \
--data.type=SanaWebDatasetMS \
--model.multi_scale=true \
--data.load_vae_feat=true \
--train.train_batch_size=2
# FSDP training
bash train_scripts/train.sh \
configs/sana1-5_config/1024ms/Sana_1600M_1024px_AdamW_fsdp.yaml \
--data.data_dir="[data/toy_data]" \
--data.type=SanaWebDatasetMS \
--model.multi_scale=true \
--data.load_vae_feat=true \
--train.use_fsdp=true \
--train.train_batch_size=2
Related¶
- Model Zoo - All available models
- 4-bit Sana - Memory-efficient inference
- LoRA & DreamBooth - Fine-tuning methods
Citation¶
@misc{xie2024sana,
title={Sana: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformer},
author={Enze Xie and Junsong Chen and Junyu Chen and Han Cai and Haotian Tang and Yujun Lin and Zhekai Zhang and Muyang Li and Ligeng Zhu and Yao Lu and Song Han},
year={2024},
eprint={2410.10629},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2410.10629},
}
@misc{xie2025sana,
title={SANA 1.5: Efficient Scaling of Training-Time and Inference-Time Compute in Linear Diffusion Transformer},
author={Xie, Enze and Chen, Junsong and Zhao, Yuyang and Yu, Jincheng and Zhu, Ligeng and Lin, Yujun and Zhang, Zhekai and Li, Muyang and Chen, Junyu and Cai, Han and others},
year={2025},
eprint={2501.18427},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2501.18427},
}