CAF-Mamba: Mamba-Based Cross-Modal Adaptive Attention Fusion for Multimodal Depression Detection
Depression is a prevalent mental health disorder that severely impairs daily functioning and quality of life. While recent deep learning approaches for depression detection have shown promise, most rely on limited feature types, overlook explicit cross-modal interactions, and employ simple concatenation or static weighting for fusion. To overcome these limitations, we propose CAF-Mamba, a novel Mamba-based cross-modal adaptive attention fusion framework. CAF-Mamba not only captures cross-modal interactions explicitly and implicitly, but also dynamically adjusts modality contributions through a modality-wise attention mechanism, enabling more effective multimodal fusion. Experiments on two in-the-wild benchmark datasets, LMVD and D-Vlog, demonstrate that CAF-Mamba consistently outperforms existing methods and achieves state-of-the-art performance. Our code is available at https://github.com/zbw-zhou/CAF-Mamba.
💡 Research Summary
The paper introduces CAF‑Mamba, a novel multimodal depression detection framework that leverages the recent Mamba state‑space model to capture long‑range temporal dependencies while explicitly modeling cross‑modal interactions and dynamically weighting each modality. The architecture consists of three main components. First, the Unimodal Extraction Module (UEM) processes each modality—acoustic features, facial landmarks with Action Units (AUs), and eye‑gaze‑head (EGH) cues—through a 1‑D convolutional projection followed by a Residual Mamba block, preserving both low‑level signals and high‑level abstractions. Second, the Cross‑modal Interaction Mamba Encoder (CIME) aggregates the three unimodal embeddings via element‑wise addition and passes the sum through another ResMamba block, thereby learning explicit inter‑modal dependencies (e.g., low‑pitched voice ↔ down‑turned mouth ↔ downward gaze). Third, the Adaptive Attention Mamba Fusion Module (AAMFM) fuses unimodal and inter‑modal representations. Its Modal‑wise Attention Block (MAB) first applies temporal average pooling to each feature stream, concatenates the pooled vectors, and generates attention weights through a linear projection and Softmax. These weights modulate the four streams (acoustic, landmark + AU, EGH, and CIME output) before a 1‑D convolution produces an attention‑weighted representation. This representation is then fed into a Multimodal Mamba Encoder (MME), again built on ResMamba, to capture higher‑order correlations and produce the final multimodal embedding, which is classified by a simple linear layer.
Experiments were conducted on two in‑the‑wild datasets: LMVD (1,823 vlogs, 5 visual/audio modalities) and D‑Vlog (961 vlogs, acoustic + facial landmarks). The model uses a single ResMamba block per stage with a hidden dimension of 256, trained with Adam (lr = 1e‑4) and ReduceLROnPlateau for 80 epochs, batch size 16, and binary cross‑entropy loss. Evaluation metrics include accuracy, precision, recall, and F1.
In multimodal settings on LMVD, CAF‑Mamba achieved 78.69 % accuracy, 78.26 % precision, 79.12 % recall, and 78.69 % F1, surpassing the previous state‑of‑the‑art MDDformer (76.88 %/77.02 %/76.88 %/76.85 %). In bimodal experiments (acoustic + visual), CAF‑Mamba consistently outperformed baselines such as DepMamba, TFN, and STST on both datasets, with the acoustic + EGH combination yielding the best results, underscoring the importance of audio cues.
Ablation studies demonstrated the contribution of each component: removing CIME caused a 6.83 % drop in precision and a 2.85 % drop in recall; removing AAMFM (replacing it with simple concatenation) led to a 5.81 % precision decline and a 1.10 % recall decline. Modality‑combination experiments showed that excluding acoustic features dramatically reduces performance, confirming audio’s pivotal role.
Efficiency analysis compared CAF‑Mamba with a Transformer‑based DepDetector. CAF‑Mamba required only 0.57 M parameters (≈ half of DepDetector’s 1.06 M) and exhibited near‑linear inference time growth with sequence length (e.g., 3.99 ms for length = 10 k versus 12.67 ms for the Transformer). This demonstrates superior scalability for long‑form video data.
Overall, CAF‑Mamba integrates three strengths: (1) Mamba’s ability to model long‑range dependencies, (2) explicit cross‑modal interaction via CIME, and (3) dynamic modality‑wise attention through AAMFM. The resulting system achieves state‑of‑the‑art depression detection performance while being computationally efficient. Future work will explore even lighter architectures, more sophisticated fusion strategies, and broader evaluations across laboratory and in‑the‑wild datasets to improve generalization.
Comments & Academic Discussion
Loading comments...
Leave a Comment