Video Soundtrack Generation by Aligning Emotions and Temporal Boundaries

Video Soundtrack Generation by Aligning Emotions and Temporal Boundaries
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Providing soundtracks for videos remains a costly and time-consuming challenge for multimedia content creators. We introduce EMSYNC, an automatic video-based symbolic music generator that creates music aligned with a video’s emotional content and temporal boundaries. It follows a two-stage framework, where a pretrained video emotion classifier extracts emotional features, and a conditional music generator produces MIDI sequences guided by both emotional and temporal cues. We introduce boundary offsets, a novel temporal conditioning mechanism that enables the model to anticipate upcoming video scene cuts and align generated musical chords with them. We also propose a mapping scheme that bridges the discrete categorical outputs of the video emotion classifier with the continuous valence-arousal inputs required by the emotion-conditioned MIDI generator, enabling seamless integration of emotion information across different representations. Our method outperforms state-of-the-art models in objective and subjective evaluations across different video datasets, demonstrating its effectiveness in generating music aligned to video both emotionally and temporally. Our demo and output samples are available at https://serkansulun.com/emsync.


💡 Research Summary

The paper introduces EMSYNC, an automatic video‑to‑symbolic‑music system that aligns generated MIDI with both the emotional content and temporal boundaries of an input video. The authors adopt a two‑stage pipeline. In the first stage, a pretrained multimodal video emotion classifier processes audio, visual, speech‑to‑text, and OCR streams and outputs a probability distribution over the six Ekman basic emotions (anger, disgust, fear, joy, sadness, surprise). Because the downstream music generator requires continuous valence‑arousal (V‑A) inputs, the authors define a mapping from the categorical distribution to a V‑A point. This mapping is built from prior user‑study data: each Ekman category is associated with a mean V‑A coordinate and a covariance; the classifier’s softmax probabilities are used as mixture weights to compute a weighted average V‑A value for the video. This bridge enables the use of large, independently labeled datasets – the Lakh MIDI dataset (with V‑A annotations) for music and the Ekman‑6 video dataset for emotion – without requiring paired video‑MIDI data.

The second stage is a conditional transformer that generates event‑based MIDI tokens (ON, OFF, TIMESHIFT, CHORD, BAR, START, PAD). Tokens are embedded, summed with learned positional encodings, and concatenated with the V‑A vector (projected to the same dimensionality). The novel temporal conditioning mechanism, called “boundary offsets,” supplies, for every token, a scalar indicating the remaining time (in normalized units) until the next scene cut detected in the video. By feeding this scalar into the transformer (via feature concatenation), the model can anticipate upcoming scene boundaries and deliberately place long‑duration chords or other structural events at those moments. This approach contrasts with prior work that uses dense, frame‑wise motion or saliency features to modulate note density, which often leads to rhythmic instability. EMSYNC retains an event‑based representation with an 8 ms time‑shift resolution, allowing expressive timing while still achieving precise synchronization.

Training is performed on the Lakh Pianoroll Dataset (174 k pieces). The authors tokenize each piece with an 8 ms TIMESHIFT resolution, merge instruments into five categories to keep vocabulary manageable, and add special tokens (FEWER_INSTRUMENTS, MORE_INSTRUMENTS) to signal instrument density. The transformer employs relative global attention, and the loss is standard cross‑entropy over the token vocabulary.

Evaluation comprises objective metrics (chord alignment accuracy, timing error, tonal consistency) and subjective listening tests. Results show that the boundary‑offset conditioned model outperforms dense‑conditioning baselines by a large margin: chord alignment at scene cuts improves by ~18 %, and listeners rate EMSYNC higher on emotional congruence and rhythmic stability across multiple video datasets (YouTube clips, movie trailers, etc.). The system also surpasses state‑of‑the‑art video‑to‑MIDI models on most objective scores.

Key contributions are: (1) a probabilistic mapping that unifies categorical video emotions with continuous V‑A music conditioning, enabling large‑scale multimodal training; (2) the introduction of boundary offsets as a sparse, anticipatory temporal conditioning signal for event‑based transformers; (3) a comprehensive two‑stage architecture that, despite lacking paired video‑MIDI data, achieves superior emotional and temporal alignment compared to existing methods. The authors release demo videos and generated samples, and suggest future work on finer‑grained boundary detection, multi‑emotion blending, and integration into real‑world production pipelines.


Comments & Academic Discussion

Loading comments...

Leave a Comment