VideoITG: Improving Multimodal Video Understanding with Instructed Temporal Grounding

Figure 1: Overview of the VidThinker annotation pipeline for VideoITG. The pipeline consists of three stages that fully leverage the provided instructions: (1) segment-level clip captioning; (2) instruction-guided relevant clip retrieval; (3) fine-grained frame-level localization.

Abstract

Recent studies have revealed that selecting informative and relevant video frames can significantly improve the performance of Video Large Language Models (Video-LLMs). Current methods, such as reducing inter-frame redundancy, employing separate models for image-text relevance assessment, or utilizing temporal video grounding for event localization, are mostly training-free approaches, whereas they struggle to address the complex scenarios in long video understanding. We propose Instructed Temporal Grounding for Videos (VideoITG), featuring customized frame sampling aligned with user instructions. The core of VideoITG is the VidThinker pipeline, an automated annotation framework that explicitly mimics the human annotation process. First, it generates detailed clip-level captions conditioned on the instruction; then, it retrieves relevant video segments through instruction-guided reasoning; finally, it performs fine-grained frame selection to pinpoint the most informative visual evidence. Leveraging VidThinker, we construct the VideoITG-40K dataset, containing 40K videos and 500K instructed temporal grounding annotations. We then design a plug-and-play VideoITG model, which takes advantage of visual language alignment and reasoning capabilities of Video-LLMs, for effective frame selection in a discriminative manner. Coupled with Video-LLMs, VideoITG achieves consistent performance improvements across multiple multimodal video understanding benchmarks, showing its superiority and great potentials for video understanding.

Dataset Construction

Figure 2: Illustration of four instruction types and their corresponding frame selection strategies in VidThinker. For semantic-focused instructions, the system selects diverse frames capturing key visual clues. For motion-focused instructions, frames are uniformly sampled to capture dynamic changes. When both semantic and motion cues are required, a hybrid sampling strategy is applied. For vague or open-ended instructions, the system samples a minimal yet diverse set of frames across the video for holistic coverage.

Model Design

Figure 3: VideoITG model design}: (a) Text generation aligns video and language tokens for sequential predictions. (b) Classification with causal attention utilizes anchor tokens for temporal cue management. (c) Classification with full attention facilitates interaction across visual and text tokens without anchors.

Quantitative Results

Main Results

Table 1: Performance comparison of VideoITG when integrated with different Video-LLMs, which have different model sizes of the answering LLM and different numbers of sampled frames.

Empirical Study on VideoITG

Table 2: Empirical study on VideoITG-40k dataset and VideoITG model design.

Qualitative Results

Our VideoITG model effectively searches for temporal cues in long videos, enabling VideoLLM to accurately answer questions.

BibTeX

@misc{wang2025videoitgmultimodalvideounderstanding,
      title={VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding}, 
      author={Shihao Wang and Guo Chen and De-an Huang and Zhiqi Li and Minghan Li and Guilin Li and Jose M. Alvarez and Lei Zhang and Zhiding Yu},
      year={2025},
      eprint={2507.13353},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2507.13353}, 
}