Train Cheaper, Run Faster, Perform Better!
As of January 6, 2025 VILA is now part of the new Cosmos Nemotron vision language models.
Zhijian Liu1,†, Ligeng Zhu1,†, Baifeng Shi1,3, Zhuoyang Zhang1,2, Yuming Lou1,6, Shang Yang1,2, Haocheng Xi1,3, Shiyi Cao1,3, Yuxian Gu2,6, Dacheng Li1,3, Xiuyu Li1,3, Yunhao Fang1,4, Yukang Chen1, Cheng-Yu Hsieh5, De-An Huang1, An-Chieh Cheng4, Vishwesh Nath1, Jinyi Hu2,6, Sifei Liu1, Ranjay Krishna5, Daguang Xu1, Xiaolong Wang1,4, Pavlo Molchanov1, Jan Kautz1, Hongxu Yin1,‡, Song Han1,2‡, Yao Lu1,‡,
1NVIDIA,
2MIT,
2UC Berkeley,
4UC San Diego,
5University of Washington,
6Tsinghua University
†Equal contribution,
‡Equal advisory
Visual language models (VLMs) have made significant advances in accuracy in recent years. However, their efficiency has received much less attention. This paper introduces NVILA, a family of open VLMs designed to optimize both efficiency and accuracy. Building on top of VILA, we improve its model architecture by first scaling up the spatial and temporal resolutions, and then compressing visual tokens. This "scale-then-compress" approach enables NVILA to efficiently process high-resolution images and long videos. We also conduct a systematic investigation to enhance the efficiency of NVILA throughout its entire lifecycle, from training and fine-tuning to deployment. NVILA matches or surpasses the accuracy of many leading open and proprietary VLMs across a wide range of image and video benchmarks. At the same time, it reduces training costs by 4.5×, fine-tuning memory usage by 3.4×, pre-filling latency by 1.6-2.2×, and decoding latency by 1.2-2.8×. We make our code and models available to facilitate reproducibility.
In this paper, we introduce NVILA, a family of open VLMs designed to optimize both efficiency and accuracy. Building on VILA, we improve its model architecture by first scaling up the spatial and temporal resolution, followed by compressing visual tokens. "Scaling" preserves more details from visual inputs, raising the accuracy upper bound, while "compression" squeezes visual information to fewer tokens, improving computational efficiency. This "scale-then-compress" strategy allows NVILA to process high-resolution images and long videos both effectively and efficiently. In addition, we conduct a systematic study to optimize the efficiency of NVILA throughout its entire lifecycle, including training, fine-tuning, and deployment.
For spatial scaling, we increase the image resolution of the vision encoder to 896×896. However, uniformly applying high resolution is inefficient for smaller images. To address this, we use S2 to extract multi-scale high-resolution features with image tiling. S2 resizes the image into multiple scales (e.g., 448², 896², 1344²), splits each scale into 448² tiles, processes each tile individually, and stitches the tiles back together. Feature maps from different scales are then interpolated and concatenated. S2 resizes images into squares, causing distortion for images with varying aspect ratios. To address this, we propose DynS2, which maintains the original aspect ratio at the largest scale. DynS2 adjusts image dimensions to the closest size divisible by 448² tiles, processes the tiles, and concatenates feature maps from all scales. With DynS2, the model achieves up to 30% accuracy improvements on text-heavy benchmarks. To compress spatial tokens, we use a 2×2 spatial-to-channel (STC) reshape, reducing token count by a factor of 4 without sacrificing accuracy. More aggressive reductions cause performance drops, so we introduce an additional visual encoder pre-training stage to recover accuracy loss, achieving a 2.4× speedup in training and inference. Alternative designs like TokenLearner and Perceiver Resampler do not outperform the simple STC design with the same token reduction ratio.
For temporal scaling, we increase the number of uniformly sampled frames from the input video. Following previous methods, we train the model with additional video-supervised fine-tuning (SFT) to extend its capability to process more frames. Extending the number of frames from 8 to 32 can increase the model's accuracy on Video-MME by more than 5%. However, this also increases the number of visual tokens by 4×. Similar to spatial token compression, we reduce these visual tokens using temporal averaging, which partitions the frames into groups and then temporally pools visual tokens within each group. This reduces temporal redundancy while retaining important spatiotemporal information. Compressing the visual tokens by 4× leads to an acceptable accuracy drop. Compared to the original baseline with the same number of tokens, the scaled and then expanded result costs almost the same but has much higher accuracy. This approach further scales the number of frames and the compression ratio, leading to a state-of-the-art 7B model on this benchmark.
In order to improve model accuracy, previous work kept grabbing high quality SFT datasets from various sources and can show improvement on Benchmark scores. However, not all data contributes equally to the model and continuous growth of datasets lead to much redundancy. In NVILA, we follow the "Scale-Then-Compress" concept to first increase our SFT dataset mixture and then try to compress the dataset. NVILA's training involves more than 100M data, making it necessary to prune the training set while maintaining accuracy.
Inspired by recent works in knowledge distillation, we leverage DeltaLoss to score the training set. Our experiments report the average performance across 10 benchmarks, with a focus on key tasks to demonstrate the method's effectiveness. We examine three pruning thresholds: 10%, 30%, and 50%, and notice that DeltaLoss consistently outperforms the random baseline. Especially on the GQA and DocVQA tasks, the random pruning shows significant performance degradation while DeltaLoss stays accurate. We notice 50% is a relatively safe threshold where the average score remains competitive while the training can be sped up by 2×. Thus, we set the threshold to 50% for later experiments.
We use FP8 from COAT to speed up NVILA training. Unlike LLM training, VLM training deals with varying sequence lengths: videos need many tokens, images need fewer, and text needs the least. This variability means smaller workloads can benefit from larger batch sizes. Using FP8 for weights and activations lets NVILA increase batch size from 4 to 16, doubling the speed. With gradient checkpointing, quantizing activations is less crucial. We use Liger's cross-entropy kernel to manage memory with Qwen's large vocabulary, still achieving a 1.2× speedup over BF16 training.
When fine-tuning the vision encoder (ViT) and language model (LLM) together using PEFT methods, we found that the learning rate for the ViT should be 5-50× lower than for the LLM. Additionally, fine-tuning the vision encoder with Layernorm achieves similar performance to LoRA but is more efficient, reducing training time by 25%. With this setup, NVILA can be fine-tuned for various tasks using 24 GB memory while maintaining performance.
We have developed a specialized inference engine using quantization techniques to efficiently deploy NVILA. The inference process is divided into two phases: prefilling and decoding. In the compute-intensive prefilling stage, we use token compression techniques to reduce the workload for the LLM backbone. The vision tower then becomes the main bottleneck, accounting for over 90% of the prefilling latency. To address this, we implement W8A8 quantization for the vision tower, reducing NVILA's Time-To-First-Token (TTFT) in this stage. In the memory-intensive decoding stage, we use AWQ for W4A16 quantization of the LLM backbone to speed up the process. We further optimize the original AWQ implementation by introducing FP16 accumulation to the W4A16 GEMM kernels, achieving a 1.7× kernel speedup without losing accuracy. A detailed comparison is shown in the figure below.
We evaluated NVILA across various image benchmarks: AI2D, ChartQA, DocVQA, InfographicVQA, MathVista, MMMU (zero-shot CoT), RealworldQA, SEED-Bench, TextVQA, and VQAv2. NVILA performs on par with leading open-source models like Qwen2-VL, InternVL, and Pixtral in each size category. For visual question answering tasks (ChartQA, DocVQA, InfoVQA, TextVQA, VQAv2, Seed), NVILA-8B and NVILA-15B match or exceed the performance of proprietary models (GPT-4o, Gemini). On science benchmarks (AI2D), NVILA-8B achieves state-of-the-art results among open-source models, and NVILA-15B competes well with proprietary models. For reasoning and knowledge benchmarks (MMMU, RealworldQA, MathVista), performance improves significantly with larger model sizes. NVILA-8B also excels in OCR tasks (TextVQA, AI2D, ChartQA, DocVQA, InfoVQA). Below, we provide qualitative examples showcasing NVILA's OCR, reasoning, and multi-image capabilities.
We evaluate our models on a range of video understanding benchmarks, spanning short videos of a few seconds to longer videos up to an hour in duration. The table below presents the performance of NVILA compared to baseline models. NVILA features long-context capability and can process up to 256 frames. With the scale-then-compress design, NVILA-8B achieves impressive results, setting new state-of-the-art performance across all benchmarks. Notably, NVILA reaches performance levels comparable to GPT-4o mini with only 8B parameters and outperforms many larger models.
@misc{liu2024nvila,
title={NVILA: Efficient Frontier Visual Language Models},
author={Zhijian Liu and Ligeng Zhu and Baifeng Shi and Zhuoyang Zhang and Yuming Lou and Shang Yang and Haocheng Xi and Shiyi Cao and Yuxian Gu and Dacheng Li and Xiuyu Li and Yunhao Fang and Yukang Chen and Cheng-Yu Hsieh and De-An Huang and An-Chieh Cheng and Vishwesh Nath and Jinyi Hu and Sifei Liu and Ranjay Krishna and Daguang Xu and Xiaolong Wang and Pavlo Molchanov and Jan Kautz and Hongxu Yin and Song Han and Yao Lu},
year={2024},
eprint={2412.04468},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2412.04468},
}