Factorized Learning for Temporally Grounded Video-Language Models

Reading time: 5 minute
...

📝 Original Info

  • Title: Factorized Learning for Temporally Grounded Video-Language Models
  • ArXiv ID: 2512.24097
  • Date: 2025-12-30
  • Authors: Wenzheng Zeng, Difei Gao, Mike Zheng Shou, Hwee Tou Ng

📝 Abstract

Evidence grounding What did I put in the rack? User D 2 VLM Small bag. The relevant event happens in Response with evidence referencing Factorized Preference Optimization (FPO) (c) (a) Previous methods 𝐃 𝟐 𝐕𝐋𝐌  Preferred response Large basket. The relevant event happens in [23.5s-46.1s]. Small bag. The relevant event happens in [12.3s-15.6s].   Factorized perturbation Small bag. The relevant event happens in [23.5s-46.1s]. 23.5s 46.1s Evidence token Explicit visual semantic capture Token generation flow Evidence referencing Figure 1. (a) Performance: Our method outperforms SOTA methods across various tasks (here we draw the maximum performance across methods, detailed in Sec. 6). (b) Model: We propose a new framework D 2 VLM, where we decompose the generation objective into a "grounding then answering with evidence referencing" paradigm and introduce evidence tokens to emphasize explicit event-level visual semantic capture. (c) Training Algorithm: We introduce Factorized Preference Optimization (FPO) that explicitly addresses both temporal grounding and textual response. A factorized data synthesis approach is also designed to support FPO.

💡 Deep Analysis

📄 Full Content

Recent advances in video-language models, especially those built upon large language models (video LLMs), have enabled remarkable progress in video understanding [10,21,23,25,31,44]. Through their flexible video&text-intext-out nature, video LLMs demonstrate great potential as general-purpose solvers, unifying various tasks (e.g., temporal grounding [9], dense captioning [47], and question answering [7]) into one generic "video question answer-ing" framework. Despite their effectiveness, existing video LLMs still struggle with accurate temporal grounding for event perception and localization [10,11,23,31,31], a crucial capability not only for grounding tasks themselves but also for relevant tasks that need textual answering.

We notice that for video understanding, temporal eventlevel grounding and textual response are two primary tasks that exhibit distinct characteristics yet maintain strong logical dependencies. Specifically, temporal grounding focuses on precisely locating temporal events (evidence) that support answering, while textual response emphasizes accurate interpretation from the grounded evidence and generating coherent textual answers. However, existing methods [4,13,14,23,31] typically handle these two tasks in a coupled manner, with two major limitations: (1) Various special tokens are designed for temporal grounding [4,14,23], but their generation is mixed with text token generation without a clear logical structure, leading to coupled learning objectives. (2) More importantly, these special tokens mainly focus on timestamp representation for precise timestamp output, lacking the explicit capture of visual semantics from grounded events. In contrast, we argue that such event-level visual semantics should not be overlooked, as it could inherently serve as crucial contexts for subsequent textual answer generation, especially under the next-token prediction paradigm.

Based on the aforementioned observations, we propose to address this from a factorized learning perspective. We first introduce a new framework D 2 VLM that decouples the learning of temporal evidence grounding and textual answering, while preserving and even strengthening their inherent dependency. Specifically, as shown in Fig. 1 (b), we decompose the model response into two sequential stages: (1) pure temporal grounding that aims to localize and capture essential visual evidence for response, followed by (2) interleaved text-evidence answer generation, where both the textual answer and temporal information are produced in an evidence-referencing manner to establish consistency with the previously grounded evidence.

Technically, we introduce evidence tokens, a special token type dedicated to temporal evidence grounding. Different from existing designs of grounding tokens that focus on its special category and timestamp representation, our evidence tokens not only aim to determine the temporal location of the grounded event, but also emphasize the capture of event-level visual semantics, which serves as crucial context for subsequent answer generation. During the subsequent interleaved text-evidence generation, the evidencereferencing process is achieved by generating evidence tokens that align with those from the previous grounding stage. This ensures that the output information in the final response remains consistent with the initially grounded evidence while reinforcing logical coherence across stages.

Besides providing decoupled and clearer task objectives, our design naturally fits well with the teacher-forcing autoregressive training paradigm, as subsequent textual response generation is conditioned on the correctly grounded evidence, enabling a learning shortcut for more stable training. Our experiments demonstrate that both the designed sequence generation objective and event-level visual semantic capture are essential for performance improvement, offering valuable insights for future model design.

To further facilitate the learning of these two tasks, we propose a novel Factorized Preference Optimization (FPO) algorithm. Unlike standard preference optimization, FPO incorporates probabilistic temporal grounding modeling into the optimization objective, enabling explicit preference learning for temporal grounding in addition to standard textual preference. Meanwhile, another obstacle is that existing video preference datasets do not account for such temporal grounding aspects, making it infeasible to directly apply our proposed FPO algorithm. To overcome this limitation, we construct a synthetic dataset by introducing factorized perturbations into the original preferred response sequence. These perturbations are applied at the sub-video event level, considering two main factors: temporal grounding and textual response, along with multiple possible sub-factors. This approach ensures a structured and controllable noise generation process, where the cause and type of noise are precisely known without any manual annotati

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut