Observation 1: Memory comsumption
![memory](./static/images/memory.png)
Reconstruction losses incur a large proportion of the memory cost and prohibit large-batch training. Better reconstruction is not always beneficial for representation learning.
We introduce Quantized Language-Image Pretraining (QLIP), a visual tokenization method that combines state-of-the-art reconstruction quality with state-of-the-art zero-shot image understanding. QLIP trains a binary-spherical-quantization-based autoencoder with reconstruction and language-image alignment objectives. We are the first to show that the two objectives do not need to be at odds. We balance the two loss terms dynamically during training and show that a two-stage training pipeline effectively mixes the large-batch requirements of image-language pre-training with the memory bottleneck imposed by the reconstruction objective. We validate the effectiveness of QLIP for multimodal understanding and text-conditioned image generation with a single model. Specifically, QLIP serves as a drop-in replacement for the visual encoder for LLaVA and the image tokenizer for LlamaGen with comparable or even better performance. Finally, we demonstrate that QLIP enables a unified mixed-modality auto-regressive model for understanding and generation.
State-of-the-art visual tokenizers excel at either understanding (high zero-shot accuracy, e.g. SigLIP) or reconstruction (low reconstruction FID, e.g. RQ-VAE), but not both. QLIP can perform well on both understanding and reconstruction with a marginal performance drop, opening up an opportunity for unified multi-modal understanding and generation.
Reconstruction losses incur a large proportion of the memory cost and prohibit large-batch training. Better reconstruction is not always beneficial for representation learning.
Different gradient magnitude (up to 2 orders of magnitude) leads to different convergence rates between the contrastive image-text alignment and pixel reconstruction objectives.
Stage (1): we optimize a weighted sum of reconstruction loss, quantization loss, and contrastive loss without the perceptual and adversarial loss. The loss weights are decided based on the gradient magnitude of each loss term.
Stage (2): we improve the reconstruction quality and restore higher-frequency details by fine-tuning the quantization bottleneck and the visual decoder. We drop the text encoder and freeze the visual encoder to prevent degradation when the batch-size restriction is relaxed.
We conduct a linear probing evaluation to compare all visual encoder methods.
The methods include:
(1) reconstruction-only tokenizers such as VQ-VAE and BSQ-ViT,
(2) language-quantized tokenizers, such as LQAE, and
(3) CLIP-style vision encoder (without decoder), such as EVA-CLIP.
We see significant improvement in linear probing classsification accuracy over reconstruction-only tokenizers and language-quantized tokenizers.
In addition, QLIP is very close to EVA-CLIP.
We show the generated images by LlamaGen with its original VQGAN (left) and our QLIP (right) side by side with the same caption in the bottom. We can see images generated by QLIP follow the captions better by depicting all aspects that might be missing from the baseline with VQ-GAN.
With QLIP as the underlying visual tokenizer, we show the performance of the unified multimodal model that performs all text-only, image-to-text, and text-to-image tasks in one single model.
@article{zhao2025qlip,
author = {Zhao, Yue and Xue, Fuzhao and Reed, Scott and Fan, Linxi and Zhu, Yuke and Kautz, Jan and Yu, Zhiding and Krähenbühl, Philipp and Huang, De-An},
title = {QLIP: Text-Aligned Visual Tokenization Unifies Auto-Regressive Multimodal Understanding and Generation},
journal = {arXiv preprint arXiv:2502.05178},
year = {2025},
}