EGM: Efficient Visual Grounding Language Models

Guanqi Zhan^1,3* • Changye Li^4* • Zhijian Liu¹ • Yao Lu¹ • Yi Wu⁴ • Song Han^1,2 • Ligeng Zhu¹

¹NVIDIA Research, ²MIT, ³University of Oxford, ⁴Tsinghua University

Previous state-of-the-art grounding VLMs usually have large model sizes, making them heavy for deployment and slow for inference. This project proposes a new paradigm, EGM (Efficient visual Grounding language Models), demonstrating that by increasing test-time computation, small models can outperform large models in visual grounding tasks while being significantly faster.

Figure 1: Performance (IoU) vs. Efficiency (Latency) Comparison on RefCOCO. Bubble size represents model parameters. EGM models demonstrate superior accuracy and efficiency compared to conventional large VLMs.

Speed & Efficiency

The EGM-Qwen3-VL-8B model inference speed is 5.9x faster than Qwen3-VL-235B, with an average latency of only 737ms.

Outperforming Giants

On the RefCOCO benchmark, EGM-Qwen3-VL-8B achieves 91.4 IoU, surpassing the massive 235B parameter model (90.5 IoU).

Core Mechanism

Introduces Chain-of-Thought (CoT) via SFT + Reinforcement Learning (RL), teaching the model "how to think" about location.

Figure 2: The EGM Paradigm. Left: Existing state-of-the-art grounding VLMs usually have large model sizes. Right: Our EGM enhances text understanding capabilities of small VLMs by equipping them with multi-modal reasoning capability, achieving better efficiency.

Why Do Small Models Fail?

Key Finding: The Bottleneck is Language, Not Vision

We found that VLMs of different sizes (e.g., Qwen series) often use the same visual encoder. Small models lag behind large models in Visual Grounding tasks primarily due to a gap in text understanding capabilities.

                                    Failure Pattern Analysis: 62.8% of errors belong to "COMPLEX PROMPT". Small models get confused when prompts are semantically complex (containing multiple relational descriptions) and there are multiple similar candidates in the image.
                                

As model size increases, errors caused by complex semantics gradually decrease.

Figure 3: Visual Grounding Failure Modes Analysis. Detailed statistical breakdown showing that the majority of small VLM errors are due to complex semantic understanding, not visual encoding.

EGM: Scaling Test-time Inference

Conventional vs. EGM

Conventional

State-of-the-art grounding VLMs usually have large model sizes.

Hard to deploy, high latency.

EGM Strategy

Equip small VLMs with multi-modal reasoning capability to compete with large models.

Deployment friendly, low inference cost, lower total latency.

Training Pipeline (SFT + RL)

1

SFT (Cold Start):
Use a proprietary VLM to generate detailed reasoning steps, constructing the SFT dataset.
2

RL (Reinforcement Learning):
Train using GRPO (Group Relative Policy Optimization). The reward function combines IoU and task success rate.

Overview of our method. Top (a): Data curation pipeline of SFT training data with reasoning. We feed the image, text prompt and ground truth bounding box of the target object into a proprietary VLM to generate the detailed reasoning process of how to locate the object correctly given the image and text prompt. The generated reasoning process is incorporated as part of the training data. Middle (b): Examples of generated reasoning training data for vanilla grounding and amodal grounding. The reasoning process of vanilla grounding analyzes the feature that distinguishes the target object from others, and the reasoning process of amodal grounding further involves what object causes the occlusion and in which directions the visible parts should be extended to recover the complete object. Bottom (c): RL data curation for vanilla grounding. The RL data is curated by collecting the instances with learnability > 0 (i.e., the learner model fails while the reference model succeeds) and merging with easy samples where the learner model succeeds.

New Task: Amodal Grounding

Existing visual grounding only focuses on visible parts of objects. This paper defines a new task—Amodal Grounding, requiring the model to predict the full bounding box of an object (including occluded parts).

Challenges

The model must identify the object AND reason about "what is occluding it," "the object's full shape," and "the extension direction of hidden parts."

EGM Performance

EGM significantly improves small model performance on this task. EGM-InternVL-3-8B improves accuracy by +11.5%.

Key Experimental Results

Performance Improvement Details (RefCOCO Avg. Acc)

Model Family	Size (Parameters)	Base Acc (IoU)	EGM Acc (Ours)	Gain
Qwen3-VL	2B	83.6%	89.6%	+6.0%
Qwen3-VL	4B	87.2%	91.0%	+3.8%
Qwen3-VL	8B	87.8%	91.4%	+3.6%
Qwen3-VL	235B	90.5%	-	-

BibTeX

If you find this work useful, please cite:

@article{zhan2026EGM,
    author = {Zhan, Guanqi and Li, Changye and Liu, Zhijian and Lu, Yao and Wu, Yi and Han, Song and Zhu, Ligeng},
    title = {EGM: Efficient Visual Grounding Language Models},
    booktitle = {arXiv},
    year = {2026}
}