VideoITG: Improving Multimodal Video Understanding with Instructed Temporal Grounding

The Hong Kong Polytechnic University1, NVIDIA2, Nanjing University3, Harvard University4
* Corresponding Author
Interpolate start reference image.

Figure 1: Overview of the VidThinker annotation pipeline for VideoITG. The pipeline consists of three stages that fully leverage the provided instructions: (1) segment-level clip captioning; (2) instruction-guided relevant clip retrieval; (3) fine-grained frame-level localization.

Abstract

Recent studies have revealed that selecting informative and relevant video frames can significantly improve the performance of Video Large Language Models (Video-LLMs). Current methods, such as reducing inter-frame redundancy, employing separate models for image-text relevance assessment, or utilizing temporal video grounding for event localization, are mostly training-free approaches, whereas they struggle to address the complex scenarios in long video understanding. We propose Instructed Temporal Grounding for Videos (VideoITG), featuring customized frame sampling aligned with user instructions. The core of VideoITG is the VidThinker pipeline, an automated annotation framework that explicitly mimics the human annotation process. First, it generates detailed clip-level captions conditioned on the instruction; then, it retrieves relevant video segments through instruction-guided reasoning; finally, it performs fine-grained frame selection to pinpoint the most informative visual evidence. Leveraging VidThinker, we construct the VideoITG-40K dataset, containing 40K videos and 500K instructed temporal grounding annotations. We then design a plug-and-play VideoITG model, which takes advantage of visual language alignment and reasoning capabilities of Video-LLMs, for effective frame selection in a discriminative manner. Coupled with Video-LLMs, VideoITG achieves consistent performance improvements across multiple multimodal video understanding benchmarks, showing its superiority and great potentials for video understanding.

Dataset Construction

Interpolate start reference image.

Figure 2: Illustration of four instruction types and their corresponding frame selection strategies in VidThinker. For semantic-focused instructions, the system selects diverse frames capturing key visual clues. For motion-focused instructions, frames are uniformly sampled to capture dynamic changes. When both semantic and motion cues are required, a hybrid sampling strategy is applied. For vague or open-ended instructions, the system samples a minimal yet diverse set of frames across the video for holistic coverage.

Model Design

Interpolate start reference image.

Figure 3: VideoITG model design}: (a) Text generation aligns video and language tokens for sequential predictions. (b) Classification with causal attention utilizes anchor tokens for temporal cue management. (c) Classification with full attention facilitates interaction across visual and text tokens without anchors.

Quantitative Results

Main Results

Interpolate start reference image.

Table 1: Performance comparison of VideoITG when integrated with different Video-LLMs, which have different model sizes of the answering LLM and different numbers of sampled frames.


Empirical Study on VideoITG

Interpolate start reference image.

Table 2: Empirical study on VideoITG-40k dataset and VideoITG model design.

Qualitative Results


Our VideoITG model effectively searches for temporal cues in long videos, enabling VideoLLM to accurately answer questions.

Qualitative result 1.

Qualitative result 2.

BibTeX

@misc{wang2025videoitgmultimodalvideounderstanding,
      title={VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding}, 
      author={Shihao Wang and Guo Chen and De-an Huang and Zhiqi Li and Minghan Li and Guilin Li and Jose M. Alvarez and Lei Zhang and Zhiding Yu},
      year={2025},
      eprint={2507.13353},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2507.13353}, 
}