AmbER$^2$: Dual Ambiguity-Aware Emotion Recognition Applied to Speech and Text
Emotion recognition is inherently ambiguous, with uncertainty arising both from rater disagreement and from discrepancies across modalities such as speech and text. There is growing interest in modeling rater ambiguity using label distributions. However, modality ambiguity remains underexplored, and multimodal approaches often rely on simple feature fusion without explicitly addressing conflicts between modalities. In this work, we propose AmbER$^2$, a dual ambiguity-aware framework that simultaneously models rater-level and modality-level ambiguity through a teacher-student architecture with a distribution-wise training objective. Evaluations on IEMOCAP and MSP-Podcast show that AmbER$^2$ consistently improves distributional fidelity over conventional cross-entropy baselines and achieves performance competitive with, or superior to, recent state-of-the-art systems. For example, on IEMOCAP, AmbER$^2$ achieves relative improvements of 20.3% on Bhattacharyya coefficient (0.83 vs. 0.69), 13.6% on R$^2$ (0.67 vs. 0.59), 3.8% on accuracy (0.683 vs. 0.658), and 4.5% on F1 (0.675 vs. 0.646). Further analysis across ambiguity levels shows that explicitly modeling ambiguity is particularly beneficial for highly uncertain samples. These findings highlight the importance of jointly addressing rater and modality ambiguity when building robust emotion recognition systems.
💡 Research Summary
The paper tackles a fundamental challenge in affective computing: emotion recognition is intrinsically ambiguous. Two distinct sources of ambiguity are identified. First, “rater ambiguity” arises because multiple human annotators often disagree on the emotional label of a given utterance; this disagreement can be captured as a probability distribution over emotion classes. Second, “modality ambiguity” occurs when different input modalities (speech acoustic cues versus textual lexical cues) convey conflicting emotional signals for the same sample. While prior work has extensively modeled rater ambiguity using soft‑label distributions, modality ambiguity has largely been ignored or handled only by naïve feature‑level fusion, leaving a gap in robust multimodal emotion modeling.
To address both ambiguities simultaneously, the authors propose AmbER² (Dual Ambiguity‑Aware Emotion Recognition). The architecture follows a teacher‑student paradigm. Three heads are defined: an audio‑only head (A), a text‑only head (T), and a multimodal fusion head (AT). The two unimodal heads act as “experts” (teachers) that provide modality‑specific predictions, while the fusion head serves as the “student” that integrates information across modalities. All heads share the same underlying encoders (wav2vec2 for audio, BERT for text) and are trained jointly.
Training is driven by a composite loss that explicitly encodes both types of ambiguity.
-
Rater Ambiguity Integrated (RAI) loss – The student’s output distribution s is matched to the ground‑truth label distribution y (obtained by normalizing annotator votes) using Jensen‑Shannon (JS) divergence:
L_RAI = JS(y‖s).
This term forces the model to reproduce the full annotator distribution rather than a single majority label, preserving rater disagreement. -
Modality Ambiguity Integrated (MAI) loss – For each expert p_m (m ∈ {A, T}), a consistency loss JS(s‖p_m) is weighted by how well that expert aligns with the rater distribution. The weight is defined as:
u_m = exp(−κ·D_m) / Σ_k exp(−κ·D_k), where D_m = JS(p_m‖y) and κ controls sharpness.
The overall MAI loss is Σ_{m≠student} u_m · JS(s‖p_m).
Consequently, experts whose predictions are close to the human distribution receive higher influence, while those that diverge are down‑weighted, directly addressing modality conflict.
The final objective combines the two terms:
L_AmbER = λ_RAI · L_RAI + λ_MAI · L_MAI, with λ hyper‑parameters tuned on validation data.
Experimental Setup
- Datasets: IEMOCAP (4‑class, scripted dyadic interactions) and MSP‑Podcast (8‑class, natural podcast speech). Both provide multiple annotator votes per utterance, enabling distributional supervision.
- Encoders: Pre‑trained wav2vec2 (audio) and BERT (text). Utterance‑level representations are obtained by mean‑pooling frame‑level embeddings (audio) and token embeddings (text).
- Fusion: Concatenated audio and text embeddings passed through a gated fusion module before the fusion head.
- Baseline: Same architecture trained with Class‑Balanced Cross‑Entropy (CB‑CE) loss, i.e., without any ambiguity modeling.
- Training: AdamW optimizer (lr = 3e‑4, weight decay = 1e‑2), batch size = 128, up to 30 epochs, 5‑fold cross‑validation (session‑wise for IEMOCAP, equal‑size folds for MSP‑Podcast). λ_RAI fixed to 1.0, λ_MAI explored {0.3, 0.5, 0.7}, κ ∈ {2, 4, 8}. Results are averaged over five random seeds.
Evaluation Metrics
Distributional fidelity: JS (lower better), Bhattacharyya Coefficient (BC, higher better), and R² (higher better).
Classification performance: macro‑F1, weighted‑F1, and accuracy (ACC).
Results – Baseline vs. AmbER²
On IEMOCAP, AmbER² reduces JS from 0.216 ± 0.001 to 0.193 ± 0.002 (≈10 % relative improvement), raises BC from 0.803 ± 0.001 to 0.825 ± 0.001, and lifts R² from 0.628 ± 0.001 to 0.665 ± 0.002. Classification metrics also improve: accuracy from 0.654 ± 0.003 to 0.683 ± 0.003 (+4.4 %), weighted‑F1 from 0.655 ± 0.003 to 0.675 ± 0.004 (+3.1 %).
On MSP‑Podcast, JS drops from 0.368 ± 0.003 to 0.328 ± 0.001 (≈11 % reduction), BC climbs from 0.664 ± 0.000 to 0.707 ± 0.000, and R² rises from 0.378 ± 0.002 to 0.425 ± 0.001. Accuracy improves markedly from 0.473 ± 0.003 to 0.520 ± 0.003 (+9.9 %).
A detailed analysis shows that samples with high ambiguity (i.e., flat label distributions) benefit the most, confirming that the MAI component successfully mitigates conflicting modality cues.
Comparison with State‑of‑the‑Art
The authors compare AmbER² against three recent systems:
- AER‑LLM (Gemini‑1.5‑Flash, 2025) – a large‑language‑model based multimodal approach, evaluated in zero‑shot and few‑shot modes.
- EmoEnt (2025) – an entropy‑aware multimodal model that incorporates a confidence‑based loss.
- EMO‑Super (2024) – a benchmark using a variety of self‑supervised speech representations.
Using the same wav2vec2‑BERT backbone, AmbER² matches or exceeds these baselines on distributional metrics (lower JS, higher BC) while delivering competitive classification scores. Notably, AmbER² outperforms AER‑LLM few‑shot on JS (0.193 vs. 0.210) and BC (0.825 vs. 0.812), demonstrating that explicit dual‑ambiguity modeling can rival large‑scale LLM approaches without massive parameter counts.
Key Contributions
- Dual‑ambiguity formulation – Simultaneously models rater and modality uncertainty within a unified loss.
- Adaptive expert weighting – Dynamically scales the influence of each modality based on its agreement with human annotations, directly addressing cross‑modal conflicts.
- Teacher‑student architecture for multimodal fusion – Enables the fusion head to learn from both unimodal experts and the ground‑truth distribution, improving both distributional fidelity and classification accuracy.
Implications and Future Work
The study demonstrates that handling both sources of ambiguity yields more reliable emotion predictions, especially for ambiguous utterances that are common in real‑world applications (e.g., spontaneous speech, podcasts). The framework is modular and can be extended to additional modalities such as facial video, physiological signals, or contextual metadata. Moreover, the adaptive weighting mechanism could be integrated with attention‑based fusion strategies or transformer‑style cross‑modal encoders. Finally, deploying AmbER² in interactive systems (e.g., empathetic dialogue agents) could improve user experience by providing calibrated confidence estimates that reflect both annotator disagreement and modality conflict.
Comments & Academic Discussion
Loading comments...
Leave a Comment