Accepted by CVPR 2026 CVPR 2026 Logo

VideoITG Multimodal Video Understanding with Instructed Temporal Grounding

The Hong Kong Polytechnic University1, Nanjing University2, NVIDIA3, Harvard University4

† Equal Corresponding and Advising Authors  ·  ‡ Work done during an internship at NVIDIA

✉ cslzhang@comp.polyu.edu.hk; scutchrisding@gmail.com

VideoITG is an instructed temporal grounding framework that adaptively customizes frame sampling strategies based on user instructions, enabling Video-LLMs to reason with the most informative frames under efficient supervision.

VideoITG overview figure.
VideoITG dataset construction overview.

VideoITG-40K is built via VidThinker: instruction-conditioned captioning, relevant clip retrieval, and fine-grained frame localization.

Abstract

Instructed Temporal Grounding for Videos

While Video Large Language Models (Video-LLMs) have shown significant potential in multimodal understanding and reasoning tasks, how to efficiently select the most informative frames from videos remains a critical challenge. Existing methods attempt to optimize frame sampling by reducing inter-frame redundancy or employing unsupervised event localization. However, these approaches often fall short in handling complex instruction-following tasks and scenarios that demand precise temporal modeling, resulting in limited performance in both semantic alignment and temporal reasoning. To address the above challenges, we introduce Instructed Temporal Grounding for Videos (VideoITG), a framework aiming to adaptively customize frame sampling strategies based on user instructions. Specifically, we design the VidThinker pipeline, which automates annotation by generating instruction-conditioned captions, retrieving relevant video segments, and selecting key frames to enable efficient supervision. Using VidThinker, we build the VideoITG-40K dataset with 40K videos and 500K temporal grounding annotations. Our plug-and-play VideoITG model leverages Video-LLMs' visual-language alignment and reasoning for discriminative frame selection. VideoITG consistently boosts the performance on multiple multimodal video understanding benchmarks, demonstrating its effectiveness and potential.

Dataset

Dataset Construction

Dataset construction figure. Figure 2. Illustration of four instruction types and their corresponding frame selection strategies in VidThinker.

Model

Model Design

VideoITG model design figure. Figure 3. VideoITG model design: (A) Text generation aligns video and language tokens for sequential predictions. (B) Classification with causal attention utilizes anchor tokens for temporal cue management. (C) Classification with full attention facilitates interaction across visual and text tokens without anchors.

Experiments

Results

Frame selection comparison. Figure 4. Frame selection comparison on VideoMME with LLaVA-Video-7B.
Main results table. Table 2. Results with different selection methods.
Extension to more Video-LLMs. Table 3. Extension to more Video-LLMs with different model sizes and sampled frames.
Ablation study table. Table 4. Ablation studies on VideoITG-40K and model design.

Qualitative

Qualitative Results

Our VideoITG model effectively searches for temporal cues in long videos, enabling VideoLLM to accurately answer questions.

Qualitative result 1. Example 1. Uni vs Ours qualitative comparison.
Qualitative result 2. Example 2. Uni vs Ours qualitative comparison.

Citation

BibTeX

@misc{wang2025videoitgmultimodalvideounderstanding,
  title={VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding},
  author={Shihao Wang and Guo Chen and De-an Huang and Zhiqi Li and Minghan Li and Guilin Liu and Jose M. Alvarez and Lei Zhang and Zhiding Yu},
  year={2025},
  eprint={2507.13353},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2507.13353},
}