Understanding Temporal Logic Consistency in Video-Language Models through Cross-Modal Attention Discriminability

Understanding Temporal Logic Consistency in Video-Language Models through Cross-Modal Attention Discriminability
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large language models (LLMs) often generate self-contradictory outputs, which severely impacts their reliability and hinders their adoption in practical applications. In video-language models (Video-LLMs), this phenomenon recently draws the attention of researchers. Specifically, these models fail to provide logically consistent responses to rephrased questions based on their grounding outputs. However, the underlying causes of this phenomenon remain underexplored. In this work, we adopt an interpretability-driven approach to analyze, statistically summarize, and intervention the potential factors of the phenomenon. We find that one of the primary reasons for the inconsistency in responses lies in the inability of cross-modal attention heads to effectively distinguish video tokens across different timestamps. To address this, we propose an attention enhancement method called Temporally Conditioned Attention Sharpening (TCAS), which constructs an enhancement objective based on attention distinctions to enhance the model’s temporal resolution capability, thereby improving its temporal understanding logic consistency. Experimental results demonstrate that our method significantly enhances the temporal logic consistency of Video-LLMs. Further analyses reveal that our method indeed improves the temporal discriminability of attention heads, validating our conclusions. Additionally, our method even achieves performance improvements in general video temporal grounding tasks, suggesting that temporal logic consistency is an important factor in temporal understanding.


💡 Research Summary

This paper investigates why video‑language models (Video‑LLMs) often produce contradictory answers when the same question is rephrased or temporally shifted, a problem the authors term “temporal logic consistency.” By conducting an interpretability‑driven analysis on the TimeChat‑7B model, they first identify a small subset of cross‑modal attention heads that receive disproportionately high attention from textual tokens. These heads are defined using a “Cross‑Modal Score,” and their ability to focus on the correct temporal segment is quantified with an “Attention Discriminability Score.” Statistical analysis reveals a strong positive Pearson correlation (≈0.48, p < 10⁻⁴) between discriminability and both re‑phrased (c_rg) and shifted (c_sg) grounding consistency scores, suggesting that temporal discriminability directly influences logical consistency.

To move beyond correlation, the authors perform a causal intervention: they blend the original attention distribution with a uniform distribution over ground‑truth timestamps, controlled by a mixing factor α. Moderate α values (0.2–0.6) improve consistency, while extreme α (≈1) degrades overall performance, indicating that a balanced enhancement is required.

Motivated by these findings, they propose Temporal Conditioned Attention Sharpening (TCAS), a loss that explicitly encourages the top‑t cross‑modal heads to allocate higher attention to tokens within the ground‑truth time window. TCAS computes the original attention, aggregates it by timestamps, selects tokens whose maximum temporal attention exceeds a threshold, and then minimizes a margin‑based loss between the original and a target uniform‑over‑ground‑truth attention. This loss is added to the standard training objective, guiding the model to sharpen its temporal focus without over‑fitting.

Experiments on Charades‑CON (original, re‑phrased, and shifted grounding subsets) show that TCAS consistently raises mIoU by 1–3 percentage points and improves Recall@0.5/0.7 by 1–2 points across both TimeChat‑7B and Qwen2.5‑VL. The improvement is more pronounced for re‑phrased consistency, while shifted consistency requires stronger intervention, reflecting task difficulty. Moreover, TCAS also yields modest gains on standard video temporal grounding benchmarks, confirming that enhancing temporal logic consistency benefits broader temporal understanding.

The paper’s contributions are threefold: (1) revealing that a minority of cross‑modal attention heads are critical for temporal logic consistency, (2) introducing a quantitative discriminability metric and a causal intervention framework, and (3) proposing TCAS, a lightweight, architecture‑agnostic loss that effectively sharpens temporal attention. Limitations include the focus on attention heads without examining decoder or adapter components, and a lack of qualitative analysis of how altered attention aligns with human perception of video events. Future work should test TCAS across diverse multimodal architectures and explore interactions with other temporal encoding mechanisms such as positional embeddings.


Comments & Academic Discussion

Loading comments...

Leave a Comment