🚀 SGLang: High-Performance Serving for SANA¶
SGLang is an inference framework for accelerated image/video generation. SANA models are natively supported in SGLang, providing high-performance serving with OpenAI-compatible API, CLI, and Python SDK.
Supported Models¶
| Model | HuggingFace ID |
|---|---|
| Sana 0.6B (512px) | Efficient-Large-Model/Sana_600M_512px_diffusers |
| Sana 0.6B (1024px) | Efficient-Large-Model/Sana_600M_1024px_diffusers |
| Sana 1.6B (512px) | Efficient-Large-Model/Sana_1600M_512px_diffusers |
| Sana 1.6B (1024px) | Efficient-Large-Model/Sana_1600M_1024px_diffusers |
| SANA-1.5 1.6B (1024px) | Efficient-Large-Model/SANA1.5_1.6B_1024px_diffusers |
| SANA-1.5 4.8B (1024px) | Efficient-Large-Model/SANA1.5_4.8B_1024px_diffusers |
Installation¶
For more installation methods (e.g. Docker, ROCm/AMD), check the SGLang installation guide.
Quick Start¶
1. CLI¶
The simplest way to generate an image:
sglang generate \
--model-path Efficient-Large-Model/SANA1.5_1.6B_1024px_diffusers \
--prompt 'a cyberpunk cat with a neon sign that says "Sana"' \
--save-output
2. Python SDK¶
from sglang.multimodal_gen import DiffGenerator
generator = DiffGenerator.from_pretrained(
model_path="Efficient-Large-Model/SANA1.5_1.6B_1024px_diffusers",
num_gpus=1,
)
image = generator.generate(
sampling_params_kwargs=dict(
prompt='a cyberpunk cat with a neon sign that says "Sana"',
height=1024,
width=1024,
num_inference_steps=20,
guidance_scale=4.5,
seed=42,
save_output=True,
output_path="outputs/",
)
)
Server Mode (OpenAI-Compatible API)¶
Launch the Server¶
sglang serve --model-path Efficient-Large-Model/SANA1.5_1.6B_1024px_diffusers \
--host 0.0.0.0 --port 30000
Send Requests¶
Once the server is running, use the OpenAI-compatible image generation API:
import requests
response = requests.post(
"http://127.0.0.1:30000/v1/images/generations",
json={
"prompt": 'a cyberpunk cat with a neon sign that says "Sana"',
"size": "1024x1024",
"num_inference_steps": 20,
"guidance_scale": 4.5,
"seed": 42,
"response_format": "b64_json",
"n": 1,
},
)
result = response.json()
Memory Optimization¶
For GPUs with limited VRAM, SGLang provides CPU offloading options:
sglang generate \
--model-path Efficient-Large-Model/SANA1.5_1.6B_1024px_diffusers \
--text-encoder-cpu-offload \
--vae-cpu-offload \
--pin-cpu-memory \
--prompt "A beautiful landscape" \
--save-output
| Option | Description |
|---|---|
--dit-cpu-offload |
Offload DiT model to CPU |
--text-encoder-cpu-offload |
Offload text encoder to CPU |
--vae-cpu-offload |
Offload VAE to CPU |
--pin-cpu-memory |
Pin CPU memory for faster transfer |
LoRA Support¶
Apply LoRA adapters during inference:
sglang generate \
--model-path Efficient-Large-Model/SANA1.5_1.6B_1024px_diffusers \
--lora-path <your-lora-path> \
--prompt "A beautiful landscape" \
--save-output
Related¶
- SGLang Diffusion Documentation
- Model Zoo - All available Sana models
- SANA Inference & Training - Native inference pipeline
Citation¶
@misc{xie2024sana,
title={Sana: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformer},
author={Enze Xie and Junsong Chen and Junyu Chen and Han Cai and Haotian Tang and Yujun Lin and Zhekai Zhang and Muyang Li and Ligeng Zhu and Yao Lu and Song Han},
year={2024},
eprint={2410.10629},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2410.10629},
}