Baifeng Shi1,2 Boyi Li1,2 Han Cai2 Yao Lu2 Sifei Liu2 Marco Pavone2 Jan Kautz2 Song Han2 Trevor Darrell1 Pavlo Molchanov2 Hongxu Yin2
PS3 scales CLIP-style vision pre-training from 384 to 4K resolution with a near-constant cost
by prompt-aware selective encoding.
VILA-HD is a frontier high-res MLLM built on top of PS3, achieving better performance and efficiency than Qwen2-VL
on up to 4K-resolution images.
4KPro is a benchmark that not only contains 4K-resolution images, but also strictly requires 4K-resolution perception.
Previous vision models (e.g., CLIP, SigLIP) are all pre-trained at low resolution such as 384x384. However, in real-world applications, we often need to process high-resolution images such as 4K resolution. Below shows an example where 4K resolution is required to recognize the stop sign while driving.
Although previous methods such as S2 and AnyRes can process high-res images without high-res pre-training, we find that pre-training on high-res images improves the performance because it can utilize the large-scale pre-training data to learn high-quality high-res features. Below we can see that PS3, pre-trained at 4K resolution, clearly improves over baselines like S2 and AnyRes.
Previous vision pre-training like CLIP and SigLIP can't scale to high resolution because it's too expensive. The vision model needs to encode the whole image which is at least quadratic in compute. However, for high-res images, you usually don't need to look at the whole image. For example, in the figure above, you only need to look at the stop sign to answer that question. This means, instead of doing contrastive learning on the whole image, it's enough to contrast between local regions and local captions. In this way, the model can still learn detailed representations of high-res images but with nearly no extra cost.
The key to PS3's success is the ability to selectively process high-res regions based on any text prompt. This is achieved by a top-down (i.e., prompt-aware) selection mechanism that allows the model to focus on the most relevant regions for any given text prompt and encode both the low-res global image and the high-res local region.
We build VILA-HD, a high-res MLLM with PS3 as the vision encoder that can efficeintly process up to 4K x 4K resolution. VILA-HD efficiently processes high-res images by first taking the low-res features from PS3 and text tokens and then selectively processing the high-res regions that are relevant to the text prompt using PS3. One can flexibly decide how many high-res patches to process in VILA-HD based on the compute budget.
VILA-HD with PS3 shows intriguing scaling properties. (a) When scaling up the resolution and selecting all the patches for each resolution, VILA-HD with PS3 shows better scaling curve than baselines without high-res pre-training. (b) VILA-HD with PS3 can scale up resolution and improve performance without extra training and inference cost by selecting a constant number of patches. (c, d) VILA-HD with PS3 can scale up training or test-time compute by selecting more patches in trade of better performance.
Compared to state-of-the-art MLLMs such as NVILA and Qwen2-VL, VILA-HD achieves competitive performance across all the benchmarks including Chart, Doc, OCR, and natural image understanding, and sets new SOTA result on benchmarks that require high-res perception such as V*Bench.
VILA-HD also achieves best efficiency compared to previous token pruning approaches, thanks to the top-down patch selection mechanism of PS3. Specifically, when selecting the same number of tokens, PS3 significantly improves ViT efficiency while achieving better performance. PS3 is also the only approach that's able to process 4K resolution.
Although previous image QA benchmarks consist of up to 4k resolution images, the questions in those benchmarks do not really need 4k resolution perception to answer. Specifically, we manually check the minimum recognizable resolution (MRR), i.e., the minimum resolution required to answer the question, of each question in the benchmarks. We find that most questions can be answered with no more than 1K resolution.
To this end, we propose 4KPro, a new benchmark that strictly requires 4K-Resolution Perception. 4KPro consists of 4K-resolution QA tasks in four professional domains, including autonomous driving, household, gaming, and UI understanding.
VILA-HD with PS3 show better scaling curves than baselines without high-res pre-training. VILA-HD also achieves SOTA performance and better efficiency than previous MLLMs including Qwen2-VL.