Scaling Test-time Inference for Visual Grounding

Guanqi Zhan1,3*Changye Li4*Zhijian Liu1Yao Lu1Yi Wu4Song Han1,2Ligeng Zhu1
1NVIDIA Research, 2MIT, 3University of Oxford, 4Tsinghua University

Previous state-of-the-art grounding VLMs usually have large model sizes, making them heavy for deployment and slow for inference. This project proposes a new paradigm, EGM (Efficient visual Grounding language Models), demonstrating that by increasing test-time computation, small models can outperform large models in visual grounding tasks while being significantly faster.

Performance (IoU) vs. Efficiency (Latency) Comparison

Figure 1: Performance (IoU) vs. Efficiency (Latency) Comparison on RefCOCO. Bubble size represents model parameters. EGM models demonstrate superior accuracy and efficiency compared to conventional large VLMs.

Speed & Efficiency

The EGM-Qwen3-VL-8B model inference speed is 5.9x faster than Qwen3-VL-235B, with an average latency of only 737ms.

Outperforming Giants

On the RefCOCO benchmark, EGM-Qwen3-VL-8B achieves 91.4 IoU, surpassing the massive 235B parameter model (90.5 IoU).

Core Mechanism

Introduces Chain-of-Thought (CoT) via SFT + Reinforcement Learning (RL), teaching the model "how to think" about location.

teaser_motivation

Figure 2: The EGM Paradigm. Left: Existing state-of-the-art grounding VLMs usually have large model sizes. Right: Our EGM extends scaling laws by scaling up inference tokens to enhance text understanding capabilities of small VLMs, achieving better efficiency.


Why Do Small Models Fail?

Key Finding: The Bottleneck is Language, Not Vision

We found that VLMs of different sizes (e.g., Qwen series) often use the same visual encoder. Small models lag behind large models in Visual Grounding tasks primarily due to a gap in text understanding capabilities.

Failure Pattern Analysis: 62.8% of errors belong to "COMPLEX PROMPT". Small models get confused when prompts are semantically complex (containing multiple relational descriptions) and there are multiple similar candidates in the image.

As model size increases, errors caused by complex semantics gradually decrease.

Breakdown of Small Model Failure Modes

Figure 3: Visual Grounding Failure Modes Analysis. Detailed statistical breakdown showing that the majority of small VLM errors are due to complex semantic understanding, not visual encoding.

EGM: Scaling Test-time Inference

Conventional vs. EGM

Conventional

State-of-the-art grounding VLMs usually have large model sizes.

Hard to deploy, high latency.

EGM Strategy

Equip small VLMs with multi-modal reasoning capability to compete with large models.

Deployment friendly, low inference cost, lower total latency.

Training Pipeline (SFT + RL)

  • 1
    SFT (Cold Start):

    Use GPT-4 to generate detailed reasoning steps, constructing the SFT dataset.

  • 2
    RL (Reinforcement Learning):

    Train using GRPO (Group Relative Policy Optimization). The reward function combines IoU and task success rate.

Overview of the EGM Training Pipeline

Figure 4: Overview of our method. Top (a): Data curation pipeline of training data with reasoning. We feed the image, text prompt and ground truth bounding box of the target object into a proprietary VLM to generate the detailed reasoning process of how to locate the object correctly given the image and text prompt. The generated reasoning process is incorporated as part of the training data. Bottom (b): Examples of generated reasoning training data for vanilla grounding and amodal grounding. The reasoning process of vanilla grounding analyzes the feature that distinguishes the target object from others, and the reasoning process of amodal grounding further involves what object causes the occlusion and in which directions the visible parts should be extended to recover the complete object.

New Task: Amodal Grounding

Existing visual grounding only focuses on visible parts of objects. This paper defines a new task—Amodal Grounding, requiring the model to predict the full bounding box of an object (including occluded parts).

Challenges

The model must identify the object AND reason about "what is occluding it," "the object's full shape," and "the extension direction of hidden parts."

EGM Performance

EGM significantly improves small model performance on this task. EGM-InternVL-3-8B improves accuracy by +11.5%.

Key Experimental Results

Performance Improvement Details (RefCOCO Avg. Acc)

Model Family Size (Parameters) Base Acc (IoU) EGM Acc (Ours) Gain
Qwen3-VL 2B 83.6% 89.6% +6.0%
Qwen3-VL 4B 87.2% 91.0% +3.8%
Qwen3-VL 8B 87.8% 91.4% +3.6%
InternVL-3 1B 81.6% 86.8% +5.2%
InternVL-3 2B 86.7% 88.4% +1.7%
InternVL-3 8B 89.6% 90.7% +1.1%