Integrating Fine-Grained Audio-Visual Evidence for Robust Multimodal Emotion Reasoning
Multimodal emotion analysis is shifting from static classification to generative reasoning. Beyond simple label prediction, robust affective reasoning must synthesize fine-grained signals such as facial micro-expressions and prosodic which shifts to decode the latent causality within complex social contexts. However, current Multimodal Large Language Models (MLLMs) face significant limitations in fine-grained perception, primarily due to data scarcity and insufficient cross-modal fusion. As a result, these models often exhibit unimodal dominance which leads to hallucinations in complex multimodal interactions, particularly when visual and acoustic cues are subtle, ambiguous, or even contradictory (e.g., in sarcastic scenery). To address this, we introduce SABER-LLM, a framework designed for robust multimodal reasoning. First, we construct SABER, a large-scale emotion reasoning dataset comprising 600K video clips, annotated with a novel six-dimensional schema that jointly captures audiovisual cues and causal logic. Second, we propose the structured evidence decomposition paradigm, which enforces a “perceive-then-reason” separation between evidence extraction and reasoning to alleviate unimodal dominance. The ability to perceive complex scenes is further reinforced by consistency-aware direct preference optimization, which explicitly encourages alignment among modalities under ambiguous or conflicting perceptual conditions. Experiments on EMER, EmoBench-M, and SABER-Test demonstrate that SABER-LLM significantly outperforms open-source baselines and achieves robustness competitive with closed-source models in decoding complex emotional dynamics. The dataset and model are available at https://github.com/zxzhao0/SABER-LLM.
💡 Research Summary
The paper tackles a fundamental shift in multimodal emotion analysis: moving from static label classification toward generative reasoning that explains why an emotion arises. Existing multimodal large language models (MLLMs) excel at aligning visual, auditory, and textual streams with a language model’s semantic space, yet they suffer from two critical shortcomings when applied to fine‑grained affective reasoning. First, they lack sufficient training data that captures subtle cues such as micro‑expressions, prosodic nuances, and body language, leading to a “unimodal dominance” problem where the model over‑relies on the most salient modality and ignores weaker but essential signals. Second, when visual and acoustic cues are contradictory—e.g., a smiling face paired with a sarcastic tone—current models either hallucinate missing evidence or produce inconsistent predictions, because they typically fuse modalities in a single end‑to‑end pass without explicit grounding.
To address these issues, the authors introduce SABER‑LLM, a two‑stage framework built on the Qwen2.5‑Omni architecture, together with a novel dataset, SABER. The dataset comprises roughly 600 000 video clips drawn from diverse sources (movies, TV shows, scripted dialogues, and user‑generated content) and is annotated with a six‑dimensional schema: (1) Scene description, (2) Speech content (transcript and semantics), (3) Acoustic features (prosody, pitch, intensity), (4) Facial expression (including micro‑expressions and gaze), (5) Body language (posture, gestures), and (6) Comprehensive reasoning that synthesizes the previous five dimensions into a causal explanation of the observed emotion. Annotation is performed automatically using Gemini‑2.5‑Pro with a hierarchical prompting strategy, followed by rigorous quality control: (a) an ASR‑based word‑error‑rate filter to catch audio hallucinations, and (b) a Qwen2.5‑VL‑based visual description consistency check. The resulting high‑quality corpus also includes a dedicated test split, SABER‑Test, deliberately balanced between “consistent” (audio and visual cues align) and “inconsistent” (cues conflict) samples, each scored for conflict intensity (0‑10) using GPT‑4o.
Stage 1 – Structured Evidence Decomposition (SED).
Instead of feeding the raw multimodal input directly to the language model and asking it to output a final emotion label, SED forces the model to generate a structured triplet `
Comments & Academic Discussion
Loading comments...
Leave a Comment