📝 Original Info
- Title: VideoZoomer: Reinforcement-Learned Temporal Focusing for Long Video Reasoning
- ArXiv ID: 2512.22315
- Date: 2025-12-26
- Authors: Yang Ding, Yizhen Zhang, Xin Lai, Ruihang Chu, Yujiu Yang
📝 Abstract
Multimodal Large Language Models (MLLMs) have achieved remarkable progress in vision-language tasks yet remain limited in long video understanding due to the limited context window. Consequently, prevailing approaches tend to rely on uniform frame sampling or static pre-selection, which might overlook critical evidence and unable to correct its initial selection error during its reasoning process. To overcome these limitations, we propose VideoZoomer, a novel agentic framework that enables MLLMs to dynamically control their visual focus during reasoning. Starting from a coarse low-frame-rate overview, VideoZoomer invokes a temporal zoom tool to obtain high-frame-rate clips at autonomously chosen moments, thereby progressively gathering fine-grained evidence in a multi-turn interactive manner. Accordingly, we adopt a two-stage training strategy: a cold-start supervised fine-tuning phase on a curated dataset of distilled exemplar and reflection trajectories, followed by reinforcement learning to further refine the agentic policy. Extensive experiments demonstrate that our 7B model delivers diverse and complex reasoning patterns, yielding strong performance across a broad set of long video understanding and reasoning benchmarks. These emergent capabilities allow it to consistently surpass existing open-source models and even rival proprietary systems on challenging tasks, while achieving superior efficiency under reduced frame budgets.
💡 Deep Analysis
📄 Full Content
VIDEOZOOMER: REINFORCEMENT-LEARNED
TEMPORAL FOCUSING FOR LONG VIDEO REASONING
Yang Ding1∗Yizhen Zhang1∗
Xin Lai2† Ruihang Chu1 Yujiu Yang1‡
1Tsinghua University
2The Chinese University of Hong Kong
ABSTRACT
Multimodal Large Language Models (MLLMs) have achieved remarkable
progress in vision-language tasks yet remain limited in long video understand-
ing due to the limited context window. Consequently, prevailing approaches tend
to rely on uniform frame sampling or static pre-selection, which might over-
look critical evidence and unable to correct its initial selection error during its
reasoning process. To overcome these limitations, we propose VideoZoomer, a
novel agentic framework that enables MLLMs to dynamically control their vi-
sual focus during reasoning. Starting from a coarse low-frame-rate overview,
VideoZoomer invokes a temporal zoom tool to obtain high-frame-rate clips at
autonomously chosen moments, thereby progressively gathering fine-grained ev-
idence in a multi-turn interactive manner. Accordingly, we adopt a two-stage
training strategy: a cold-start supervised fine-tuning phase on a curated dataset of
distilled exemplar and reflection trajectories, followed by reinforcement learning
to further refine the agentic policy. Extensive experiments demonstrate that our
7B model delivers diverse and complex reasoning patterns, yielding strong perfor-
mance across a broad set of long video understanding and reasoning benchmarks.
These emergent capabilities allow it to consistently surpass existing open-source
models and even rival proprietary systems on challenging tasks, while achiev-
ing superior efficiency under reduced frame budgets. The code are avaliable at
https://github.com/zsgvivo/VideoZoomer.
1
INTRODUCTION
With a clear task in mind, humans can efficiently navigate long and complex visual streams by
dynamically allocating attention, selectively identifying salient events such as decisive actions in a
sports match or key explanations in a lengthy lecture, while filtering out redundancy. This goal-
directed ability underlies effective and efficient visual reasoning, as widely documented in cognitive
science (Kietzmann et al., 2018), remains difficult to achieve in artificial intelligence. Although
MLLMs perform strongly on image (Bai et al., 2025; Chen et al., 2024) and short-video tasks (Zhang
et al., 2023), they remain constrained in long-video comprehension tasks mainly due to their limited
context window (OpenAI, 2024; Reid et al., 2024).
The most common strategy to address this challenge is uniform frame sampling (Zhang et al.,
2024b;c), which selects frames at fixed intervals (e.g., two frames per second) to construct a sub-
set that fits within context window. Nevertheless, this strategy is inherently limited, as it assumes
all moments are equally important and further risks overlooking short but critical events while al-
locating context budget to redundant clip segments. To address these limitations, prior work has
investigated adaptive frame selection (Yu et al., 2024; Hu et al., 2025a; Tang et al., 2025), where a
∗Equal contribution.
†Project leader.
‡Corresponding authors.
1
arXiv:2512.22315v1 [cs.CV] 26 Dec 2025
VLM
Answer
Question
Selector
Video
VLM
Answer
Question
Uniformly
sample
Video
Video
VLM
Question
CoT-1
CoT-2
Answer
(a) uniform sampling
(b) frame selector
(c) ours
key event
0
2
4
8
16
32
64
128
256
512
1024
2048
4096
Number of Input Frames
30
35
40
45
50
55
60
65
70
Accuracy (%)
Performance on LSDBench
Gemini-2.0-Flash
InternVideo2.5
LongVA
LongVila
Qwen2-VL
Qwen2.5-VL
Qwen2.5-VL (RHS)
Ours
Figure 1: Left: Conceptual comparison of three long video reasoning frameworks: (a) uniform
sampling, (b) with frame selector, and (c) VideoZoomer (Ours). Right: Performance comparison of
VideoZoomer against various baseline models under different frame budgets on LSDBench.
lightweight selector module, conditioned on the text query, identifies salient frames before reason-
ing. While improving over uniform sampling, these methods are still inefficient because they are
designed to select a fixed number of frames regradless of the problem’s complexity. Second, the
design remains static and non-interactive. If the initial choice is suboptimal or misses key details,
the model has no mechanism to correct the error or revisit the video. This fundamentally limits its
performance on complex tasks that require iterative evidence gathering.
To overcome the rigidity and inefficiency of prior methods, we propose VideoZoomer, a novel frame-
work that empowers an MLLM to autonomously and dynamically control its visual focus during its
reasoning process. As illustrated in Figure 1 (Left), instead of being a passive recipient of pre-
selected frames, our model acts as an active agent.
This yields two primary advantages: (i) It is highly efficient: the agent begins with a coarse
overview of low frame rates, only consuming a significant context budget when it decides to invoke
a
Reference
This content is AI-processed based on open access ArXiv data.