Scaling Test-time Inference for Visual Grounding
Previous state-of-the-art grounding VLMs usually have large model sizes, making them heavy for deployment and slow for inference. This project proposes a new paradigm, EGM (Efficient visual Grounding language Models), demonstrating that by increasing test-time computation, small models can outperform large models in visual grounding tasks while being significantly faster.
Figure 1: Performance (IoU) vs. Efficiency (Latency) Comparison on RefCOCO. Bubble size represents model parameters. EGM models demonstrate superior accuracy and efficiency compared to conventional large VLMs.
The EGM-Qwen3-VL-8B model inference speed is 5.9x faster than Qwen3-VL-235B, with an average latency of only 737ms.
On the RefCOCO benchmark, EGM-Qwen3-VL-8B achieves 91.4 IoU, surpassing the massive 235B parameter model (90.5 IoU).
Introduces Chain-of-Thought (CoT) via SFT + Reinforcement Learning (RL), teaching the model "how to think" about location.
Figure 2: The EGM Paradigm. Left: Existing state-of-the-art grounding VLMs usually have large model sizes. Right: Our EGM extends scaling laws by scaling up inference tokens to enhance text understanding capabilities of small VLMs, achieving better efficiency.
Why Do Small Models Fail?
Key Finding: The Bottleneck is Language, Not Vision
We found that VLMs of different sizes (e.g., Qwen series) often use the same visual encoder. Small models lag behind large models in Visual Grounding tasks primarily due to a gap in text understanding capabilities.
As model size increases, errors caused by complex semantics gradually decrease.
Figure 3: Visual Grounding Failure Modes Analysis. Detailed statistical breakdown showing that the majority of small VLM errors are due to complex semantic understanding, not visual encoding.
EGM: Scaling Test-time Inference
Conventional vs. EGM
State-of-the-art grounding VLMs usually have large model sizes.
Hard to deploy, high latency.
Equip small VLMs with multi-modal reasoning capability to compete with large models.
Deployment friendly, low inference cost, lower total latency.
Training Pipeline (SFT + RL)
-
1SFT (Cold Start):
Use GPT-4 to generate detailed reasoning steps, constructing the SFT dataset.
-
2RL (Reinforcement Learning):
Train using GRPO (Group Relative Policy Optimization). The reward function combines IoU and task success rate.
Figure 4: Overview of our method. Top (a): Data curation pipeline of training data with reasoning. We feed the image, text prompt and ground truth bounding box of the target object into a proprietary VLM to generate the detailed reasoning process of how to locate the object correctly given the image and text prompt. The generated reasoning process is incorporated as part of the training data. Bottom (b): Examples of generated reasoning training data for vanilla grounding and amodal grounding. The reasoning process of vanilla grounding analyzes the feature that distinguishes the target object from others, and the reasoning process of amodal grounding further involves what object causes the occlusion and in which directions the visible parts should be extended to recover the complete object.
New Task: Amodal Grounding
Existing visual grounding only focuses on visible parts of objects. This paper defines a new task—Amodal Grounding, requiring the model to predict the full bounding box of an object (including occluded parts).
Challenges
The model must identify the object AND reason about "what is occluding it," "the object's full shape," and "the extension direction of hidden parts."
EGM Performance
EGM significantly improves small model performance on this task. EGM-InternVL-3-8B improves accuracy by +11.5%.
Key Experimental Results
Performance Improvement Details (RefCOCO Avg. Acc)
| Model Family | Size (Parameters) | Base Acc (IoU) | EGM Acc (Ours) | Gain |
|---|---|---|---|---|
| Qwen3-VL | 2B | 83.6% | 89.6% | +6.0% |
| Qwen3-VL | 4B | 87.2% | 91.0% | +3.8% |
| Qwen3-VL | 8B | 87.8% | 91.4% | +3.6% |
| InternVL-3 | 1B | 81.6% | 86.8% | +5.2% |
| InternVL-3 | 2B | 86.7% | 88.4% | +1.7% |
| InternVL-3 | 8B | 89.6% | 90.7% | +1.1% |