How does longer temporal context enhance multimodal narrative video processing in the brain?

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Understanding how humans and artificial intelligence systems process complex narrative videos is a fundamental challenge at the intersection of neuroscience and machine learning. This study investigates how the temporal context length of video clips (3–12 s clips) and the narrative-task prompting shape brain-model alignment during naturalistic movie watching. Using fMRI recordings from participants viewing full-length movies, we examine how brain regions sensitive to narrative context dynamically represent information over varying timescales and how these neural patterns align with model-derived features. We find that increasing clip duration substantially improves brain alignment for multimodal large language models (MLLMs), whereas unimodal video models show little to no gain. Further, shorter temporal windows align with perceptual and early language regions, while longer windows preferentially align higher-order integrative regions, mirrored by a layer-to-cortex hierarchy in MLLMs. Finally, narrative-task prompts (multi-scene summary, narrative summary, character motivation, and event boundary detection) elicit task-specific, region-dependent brain alignment patterns and context-dependent shifts in clip-level tuning in higher-order regions. Together, our results position long-form narrative movies as a principled testbed for probing biologically relevant temporal integration and interpretable representations in long-context MLLMs.

💡 Research Summary

This paper investigates how the length of temporal context and narrative‑task prompting shape the alignment between brain activity and multimodal large language models (MLLMs) during naturalistic movie watching. Using fMRI recordings from four participants who viewed four full‑length films (Movie10 dataset), the authors extracted voxel‑wise responses mapped onto the Glasser 180‑region cortical parcellation and language‑specific parcels from the Fedorenko lab. They systematically varied the video clip duration (3 s, 6 s, 9 s, 12 s) using a sliding‑window approach with a 1.49 s stride, sampling 16 frames and synchronized audio per window. Each window was paired with one of four narrative‑task prompts: multi‑scene summary, narrative summary, character motivation, and event‑boundary detection.

Two state‑of‑the‑art video‑audio MLLMs—Qwen‑2.5‑Omni and DA TE—were used to generate contextual embeddings. For each window, hidden states from all 36 transformer layers were averaged across tokens to produce a compact layer‑wise representation. As baselines, two unimodal video models (TimeSFormer and VideoMAE, each with 12 layers) were also evaluated under the same temporal windows.

Brain‑model alignment was quantified by training linear encoding models that predict voxel activity from model embeddings, and cross‑subject prediction accuracy was estimated following Schrimpf et al. (2021) to account for intrinsic neural noise. The authors addressed four research questions: (RQ1) the effect of temporal context length on brain predictivity for MLLMs versus unimodal models; (RQ2) which cortical regions show the greatest gains with longer context and how these relate to model layers; (RQ3) how different narrative‑task prompts modulate region‑specific alignment; and (RQ4) which video clips are most predictive of voxel responses across contexts and tasks.

Key findings: (1) Increasing clip duration from 3 s to 12 s systematically improves brain predictivity for both MLLMs, with the most pronounced gains in higher‑order semantic regions such as posterior cingulate cortex (PCC) and medial prefrontal cortex (mPFC). Unimodal video models exhibit little or no improvement, indicating that multimodal integration is crucial for leveraging long temporal context. (2) A clear layer‑to‑cortex hierarchy emerges: early transformer layers (1‑4) align best with low‑level visual and auditory cortices (e.g., V1, posterior temporal lobe), middle layers (5‑12) correspond to early language areas, and higher layers (13‑36) align with high‑level integrative regions (default‑mode network). This mirrors known cortical processing hierarchies that operate on increasing timescales. (3) Narrative‑task prompts produce distinct region‑specific patterns. Narrative and multi‑scene summaries drive alignment in higher‑order language ROIs, character‑motivation prompts preferentially engage localized temporal‑language regions, and event‑boundary detection yields a more distributed pattern across perceptual transition zones. (4) Analysis of the most predictive clips reveals that visual ROIs are driven by low‑level visual features that are stable across context lengths, whereas higher‑order language ROIs become sensitive to the narrative coherence present in longer clips.

Overall, the study demonstrates that (i) long temporal context substantially enhances brain‑MLLM alignment, (ii) the improvement follows a hierarchical mapping between model depth and cortical processing timescales, and (iii) task‑specific prompts can be used as functional probes to isolate distinct neural representations. These insights provide actionable guidance for designing future AI systems capable of long‑form video understanding and for using naturalistic movies as a principled testbed to study biologically relevant temporal integration in the human brain.

How does longer temporal context enhance multimodal narrative video processing in the brain?

💡 Research Summary

Comments & Academic Discussion

Leave a Comment