Scaling Vision Pre-Training to 4K Resolution

Baifeng Shi1,2    Boyi Li1,2    Han Cai2    Yao Lu2    Sifei Liu2    Marco Pavone2    Jan Kautz2    Song Han2    Trevor Darrell1    Pavlo Molchanov2    Hongxu Yin2

PS3 scales CLIP-style vision pre-training from 384 to 4K resolution with a near-constant cost by prompt-aware selective encoding.

VILA-HD is a frontier high-res MLLM built on top of PS3, achieving better performance and efficiency than Qwen2-VL on up to 4K-resolution images.

4KPro is a benchmark that not only contains 4K-resolution images, but also strictly requires 4K-resolution perception.

Paper PS3 Code VILA-HD Code (Coming Soon) Hugging Face logo PS3 Weights (Coming Soon) Hugging Face logo VILA-HD Weights (Coming Soon) 4KPro Benchmark (Coming Soon) Citation

PS3: Vision Pre-Training at 4K Resolution

Why 4K Resolution?

Previous vision models (e.g., CLIP, SigLIP) are all pre-trained at low resolution such as 384x384. However, in real-world applications, we often need to process high-resolution images such as 4K resolution. Below shows an example where 4K resolution is required to recognize the stop sign while driving.

Why Pre-Train at 4K Resolution?

Although previous methods such as S2 and AnyRes can process high-res images without high-res pre-training, we find that pre-training on high-res images improves the performance because it can utilize the large-scale pre-training data to learn high-quality high-res features. Below we can see that PS3, pre-trained at 4K resolution, clearly improves over baselines like S2 and AnyRes.

motivation

Why We Can Do It But Previous Methods Can't?

Previous vision pre-training like CLIP and SigLIP can't scale to high resolution because it's too expensive. The vision model needs to encode the whole image which is at least quadratic in compute. However, for high-res images, you usually don't need to look at the whole image. For example, in the figure above, you only need to look at the stop sign to answer that question. This means, instead of doing contrastive learning on the whole image, it's enough to contrast between local regions and local captions. In this way, the model can still learn detailed representations of high-res images but with nearly no extra cost.

local region

Key Design: Localized High-Res Encoding via Top-Down Selection

The key to PS3's success is the ability to selectively process high-res regions based on any text prompt. This is achieved by a top-down (i.e., prompt-aware) selection mechanism that allows the model to focus on the most relevant regions for any given text prompt and encode both the low-res global image and the high-res local region.

local region local region

VILA-HD: Enabling Efficient and Performant
4K-Resolution MLLM with PS3

Building VILA-HD with PS3

We build VILA-HD, a high-res MLLM with PS3 as the vision encoder that can efficeintly process up to 4K x 4K resolution. VILA-HD efficiently processes high-res images by first taking the low-res features from PS3 and text tokens and then selectively processing the high-res regions that are relevant to the text prompt using PS3. One can flexibly decide how many high-res patches to process in VILA-HD based on the compute budget.

motivation

Superior Scaling Properties

VILA-HD with PS3 shows intriguing scaling properties. (a) When scaling up the resolution and selecting all the patches for each resolution, VILA-HD with PS3 shows better scaling curve than baselines without high-res pre-training. (b) VILA-HD with PS3 can scale up resolution and improve performance without extra training and inference cost by selecting a constant number of patches. (c, d) VILA-HD with PS3 can scale up training or test-time compute by selecting more patches in trade of better performance.

motivation

SOTA Performance and Efficiency

Compared to state-of-the-art MLLMs such as NVILA and Qwen2-VL, VILA-HD achieves competitive performance across all the benchmarks including Chart, Doc, OCR, and natural image understanding, and sets new SOTA result on benchmarks that require high-res perception such as V*Bench.

motivation

VILA-HD also achieves best efficiency compared to previous token pruning approaches, thanks to the top-down patch selection mechanism of PS3. Specifically, when selecting the same number of tokens, PS3 significantly improves ViT efficiency while achieving better performance. PS3 is also the only approach that's able to process 4K resolution.

motivation

4KPro: Benchmarking 4K-Resolution Perception

Previous Benchmarks Do Not Need 4K-Resolution Perception

Although previous image QA benchmarks consist of up to 4k resolution images, the questions in those benchmarks do not really need 4k resolution perception to answer. Specifically, we manually check the minimum recognizable resolution (MRR), i.e., the minimum resolution required to answer the question, of each question in the benchmarks. We find that most questions can be answered with no more than 1K resolution.

motivation

4KPro Strictly Requires 4K-Resolution Perception

To this end, we propose 4KPro, a new benchmark that strictly requires 4K-Resolution Perception. 4KPro consists of 4K-resolution QA tasks in four professional domains, including autonomous driving, household, gaming, and UI understanding.

motivation

VILA-HD achieves SOTA Performance and Efficiency on 4KPro

VILA-HD with PS3 show better scaling curves than baselines without high-res pre-training. VILA-HD also achieves SOTA performance and better efficiency than previous MLLMs including Qwen2-VL.

motivation