NarraScore: Bridging Visual Narrative and Musical Dynamics via Hierarchical Affective Control

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Synthesizing coherent soundtracks for long-form videos remains a formidable challenge, currently stalled by three critical impediments: computational scalability, temporal coherence, and, most critically, a pervasive semantic blindness to evolving narrative logic. To bridge these gaps, we propose NarraScore, a hierarchical framework predicated on the core insight that emotion serves as a high-density compression of narrative logic. Uniquely, we repurpose frozen Vision-Language Models (VLMs) as continuous affective sensors, distilling high-dimensional visual streams into dense, narrative-aware Valence-Arousal trajectories. Mechanistically, NarraScore employs a Dual-Branch Injection strategy to reconcile global structure with local dynamism: a \textit{Global Semantic Anchor} ensures stylistic stability, while a surgical \textit{Token-Level Affective Adapter} modulates local tension via direct element-wise residual injection. This minimalist design bypasses the bottlenecks of dense attention and architectural cloning, effectively mitigating the overfitting risks associated with data scarcity. Experiments demonstrate that NarraScore achieves state-of-the-art consistency and narrative alignment with negligible computational overhead, establishing a fully autonomous paradigm for long-video soundtrack generation.

💡 Research Summary

NarraScore tackles the long‑standing challenge of generating coherent soundtracks for minute‑scale videos, where existing methods either collapse under quadratic memory costs or suffer from “semantic blindness” – an inability to capture the evolving narrative logic of a story. The authors’ key insight is to treat emotion, represented as a continuous Valence‑Arousal (VA) curve, as a high‑density compression of narrative tension and resolution. By repurposing frozen Vision‑Language Models (VLMs) such as CLIP as affective sensors, NarraScore extracts a dense, temporally aligned VA trajectory directly from raw pixel streams without any external classifiers or costly annotations.

The system first discretizes the video at 1 Hz and interleaves each frame with a textual time‑stamp (“Time: t s”). An “Instruction‑Driven Semantic Primer” – a short natural‑language prompt placed at the beginning of the sequence – nudges the frozen VLM to activate its high‑level reasoning pathways rather than low‑level object detection. A lightweight probing head (spatial average pooling followed by a two‑layer MLP) then regresses each frame’s hidden representation onto the VA plane, producing a smooth affective curve while keeping the VLM parameters frozen.

For music generation, NarraScore adopts a dual‑branch conditioning strategy. The Global Semantic Anchor (GSA) encodes the overall genre, mood, and atmospheric context of the entire video into a fixed‑length vector that seeds the music decoder (a pre‑trained MusicGen‑style transformer) and stabilizes the global style. Concurrently, the Token‑Level Affective Adapter (TLAA) injects the frame‑wise VA values as element‑wise additive biases directly into the decoder’s hidden states at each token step. This design sidesteps the need for dense frame‑level attention, dramatically reducing memory consumption and training time while still allowing fine‑grained control of musical tension.

Training is highly parameter‑efficient: only the probing head and the TLAA are learned, using a hybrid L1/L2 loss on the VA predictions. The music decoder remains untouched, preserving its rich generative priors. Experiments on multi‑minute video clips demonstrate that NarraScore outperforms recent video‑to‑music baselines in both temporal consistency and narrative alignment, as judged by human raters and automated CLIP‑Audio similarity metrics. Memory usage drops by more than 30 % and inference runs near real‑time speeds. Ablation studies confirm that removing either the GSA or the TLAA degrades style stability or tension responsiveness, respectively, and that the VLM‑based affective probing is essential for accurate VA estimation.

Limitations include reliance on the VLM’s zero‑shot affective reasoning, which may falter on culturally specific or visually unconventional content, and the restriction to a two‑dimensional VA space that cannot fully encode complex plot structures such as twists or multi‑threaded narratives. Future work could integrate additional modalities (script text, subtitles, audio cues) into the semantic primer, expand the affective representation to include higher‑level narrative markers (e.g., “climax”, “resolution”), and explore user‑controllable prompts for interactive soundtrack editing.

In summary, NarraScore introduces a novel affective‑semantic bridge that leverages frozen multimodal models to derive continuous emotional arcs, and couples this with a minimalist dual‑branch injection mechanism to achieve scalable, narratively aligned music generation for long‑form video. The approach sets a new baseline for autonomous cinematic soundtrack synthesis and opens promising avenues for multimodal, controllable content creation.

NarraScore: Bridging Visual Narrative and Musical Dynamics via Hierarchical Affective Control

💡 Research Summary

Comments & Academic Discussion

Leave a Comment