Abstract
Instructed Temporal Grounding for Videos
While Video Large Language Models (Video-LLMs) have shown significant potential in multimodal understanding and reasoning tasks, how to efficiently select the most informative frames from videos remains a critical challenge. Existing methods attempt to optimize frame sampling by reducing inter-frame redundancy or employing unsupervised event localization. However, these approaches often fall short in handling complex instruction-following tasks and scenarios that demand precise temporal modeling, resulting in limited performance in both semantic alignment and temporal reasoning. To address the above challenges, we introduce Instructed Temporal Grounding for Videos (VideoITG), a framework aiming to adaptively customize frame sampling strategies based on user instructions. Specifically, we design the VidThinker pipeline, which automates annotation by generating instruction-conditioned captions, retrieving relevant video segments, and selecting key frames to enable efficient supervision. Using VidThinker, we build the VideoITG-40K dataset with 40K videos and 500K temporal grounding annotations. Our plug-and-play VideoITG model leverages Video-LLMs' visual-language alignment and reasoning for discriminative frame selection. VideoITG consistently boosts the performance on multiple multimodal video understanding benchmarks, demonstrating its effectiveness and potential.
Dataset
Dataset Construction
Figure 2. Illustration of four instruction types and their corresponding frame selection strategies in VidThinker.
Model
Model Design
Figure 3. VideoITG model design: (A) Text generation aligns video and language tokens for sequential predictions. (B) Classification with causal attention utilizes anchor tokens for temporal cue management. (C) Classification with full attention facilitates interaction across visual and text tokens without anchors.
Experiments
Results
Figure 4. Frame selection comparison on VideoMME with LLaVA-Video-7B.
Table 2. Results with different selection methods.
Table 3. Extension to more Video-LLMs with different model sizes and sampled frames.
Table 4. Ablation studies on VideoITG-40K and model design.
Qualitative
Qualitative Results
Our VideoITG model effectively searches for temporal cues in long videos, enabling VideoLLM to accurately answer questions.
Example 1. Uni vs Ours qualitative comparison.
Example 2. Uni vs Ours qualitative comparison.
Citation
BibTeX
@misc{wang2025videoitgmultimodalvideounderstanding,
title={VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding},
author={Shihao Wang and Guo Chen and De-an Huang and Zhiqi Li and Minghan Li and Guilin Liu and Jose M. Alvarez and Lei Zhang and Zhiding Yu},
year={2025},
eprint={2507.13353},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2507.13353},
}