VideoZoomer: Reinforcement-Learned Temporal Focusing for Long Video Reasoning

Reading time: 5 minute
...

📝 Original Info

  • Title: VideoZoomer: Reinforcement-Learned Temporal Focusing for Long Video Reasoning
  • ArXiv ID: 2512.22315
  • Date: 2025-12-26
  • Authors: Yang Ding, Yizhen Zhang, Xin Lai, Ruihang Chu, Yujiu Yang

📝 Abstract

Multimodal Large Language Models (MLLMs) have achieved remarkable progress in vision-language tasks yet remain limited in long video understanding due to the limited context window. Consequently, prevailing approaches tend to rely on uniform frame sampling or static pre-selection, which might overlook critical evidence and unable to correct its initial selection error during its reasoning process. To overcome these limitations, we propose VideoZoomer, a novel agentic framework that enables MLLMs to dynamically control their visual focus during reasoning. Starting from a coarse low-frame-rate overview, VideoZoomer invokes a temporal zoom tool to obtain high-frame-rate clips at autonomously chosen moments, thereby progressively gathering fine-grained evidence in a multi-turn interactive manner. Accordingly, we adopt a two-stage training strategy: a cold-start supervised fine-tuning phase on a curated dataset of distilled exemplar and reflection trajectories, followed by reinforcement learning to further refine the agentic policy. Extensive experiments demonstrate that our 7B model delivers diverse and complex reasoning patterns, yielding strong performance across a broad set of long video understanding and reasoning benchmarks. These emergent capabilities allow it to consistently surpass existing open-source models and even rival proprietary systems on challenging tasks, while achieving superior efficiency under reduced frame budgets.

💡 Deep Analysis

📄 Full Content

VIDEOZOOMER: REINFORCEMENT-LEARNED TEMPORAL FOCUSING FOR LONG VIDEO REASONING Yang Ding1∗Yizhen Zhang1∗ Xin Lai2† Ruihang Chu1 Yujiu Yang1‡ 1Tsinghua University 2The Chinese University of Hong Kong ABSTRACT Multimodal Large Language Models (MLLMs) have achieved remarkable progress in vision-language tasks yet remain limited in long video understand- ing due to the limited context window. Consequently, prevailing approaches tend to rely on uniform frame sampling or static pre-selection, which might over- look critical evidence and unable to correct its initial selection error during its reasoning process. To overcome these limitations, we propose VideoZoomer, a novel agentic framework that enables MLLMs to dynamically control their vi- sual focus during reasoning. Starting from a coarse low-frame-rate overview, VideoZoomer invokes a temporal zoom tool to obtain high-frame-rate clips at autonomously chosen moments, thereby progressively gathering fine-grained ev- idence in a multi-turn interactive manner. Accordingly, we adopt a two-stage training strategy: a cold-start supervised fine-tuning phase on a curated dataset of distilled exemplar and reflection trajectories, followed by reinforcement learning to further refine the agentic policy. Extensive experiments demonstrate that our 7B model delivers diverse and complex reasoning patterns, yielding strong perfor- mance across a broad set of long video understanding and reasoning benchmarks. These emergent capabilities allow it to consistently surpass existing open-source models and even rival proprietary systems on challenging tasks, while achiev- ing superior efficiency under reduced frame budgets. The code are avaliable at https://github.com/zsgvivo/VideoZoomer. 1 INTRODUCTION With a clear task in mind, humans can efficiently navigate long and complex visual streams by dynamically allocating attention, selectively identifying salient events such as decisive actions in a sports match or key explanations in a lengthy lecture, while filtering out redundancy. This goal- directed ability underlies effective and efficient visual reasoning, as widely documented in cognitive science (Kietzmann et al., 2018), remains difficult to achieve in artificial intelligence. Although MLLMs perform strongly on image (Bai et al., 2025; Chen et al., 2024) and short-video tasks (Zhang et al., 2023), they remain constrained in long-video comprehension tasks mainly due to their limited context window (OpenAI, 2024; Reid et al., 2024). The most common strategy to address this challenge is uniform frame sampling (Zhang et al., 2024b;c), which selects frames at fixed intervals (e.g., two frames per second) to construct a sub- set that fits within context window. Nevertheless, this strategy is inherently limited, as it assumes all moments are equally important and further risks overlooking short but critical events while al- locating context budget to redundant clip segments. To address these limitations, prior work has investigated adaptive frame selection (Yu et al., 2024; Hu et al., 2025a; Tang et al., 2025), where a ∗Equal contribution. †Project leader. ‡Corresponding authors. 1 arXiv:2512.22315v1 [cs.CV] 26 Dec 2025 VLM Answer Question Selector Video VLM Answer Question Uniformly sample Video Video VLM Question CoT-1 CoT-2 Answer (a) uniform sampling (b) frame selector (c) ours key event 0 2 4 8 16 32 64 128 256 512 1024 2048 4096 Number of Input Frames 30 35 40 45 50 55 60 65 70 Accuracy (%) Performance on LSDBench Gemini-2.0-Flash InternVideo2.5 LongVA LongVila Qwen2-VL Qwen2.5-VL Qwen2.5-VL (RHS) Ours Figure 1: Left: Conceptual comparison of three long video reasoning frameworks: (a) uniform sampling, (b) with frame selector, and (c) VideoZoomer (Ours). Right: Performance comparison of VideoZoomer against various baseline models under different frame budgets on LSDBench. lightweight selector module, conditioned on the text query, identifies salient frames before reason- ing. While improving over uniform sampling, these methods are still inefficient because they are designed to select a fixed number of frames regradless of the problem’s complexity. Second, the design remains static and non-interactive. If the initial choice is suboptimal or misses key details, the model has no mechanism to correct the error or revisit the video. This fundamentally limits its performance on complex tasks that require iterative evidence gathering. To overcome the rigidity and inefficiency of prior methods, we propose VideoZoomer, a novel frame- work that empowers an MLLM to autonomously and dynamically control its visual focus during its reasoning process. As illustrated in Figure 1 (Left), instead of being a passive recipient of pre- selected frames, our model acts as an active agent. This yields two primary advantages: (i) It is highly efficient: the agent begins with a coarse overview of low frame rates, only consuming a significant context budget when it decides to invoke a

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut