Large Language Models for Mental Health: A Multilingual Evaluation

Large Language Models for Mental Health: A Multilingual Evaluation
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large Language Models (LLMs) have remarkable capabilities across NLP tasks. However, their performance in multilingual contexts, especially within the mental health domain, has not been thoroughly explored. In this paper, we evaluate proprietary and open-source LLMs on eight mental health datasets in various languages, as well as their machine-translated (MT) counterparts. We compare LLM performance in zero-shot, few-shot, and fine-tuned settings against conventional NLP baselines that do not employ LLMs. In addition, we assess translation quality across language families and typologies to understand its influence on LLM performance. Proprietary LLMs and fine-tuned open-source LLMs achieve competitive F1 scores on several datasets, often surpassing state-of-the-art results. However, performance on MT data is generally lower, and the extent of this decline varies by language and typology. This variation highlights both the strengths of LLMs in handling mental health tasks in languages other than English and their limitations when translation quality introduces structural or lexical mismatches.


💡 Research Summary

This paper presents the first comprehensive multilingual evaluation of large language models (LLMs) on mental‑health classification tasks. The authors assemble eight publicly available datasets covering depression and suicidal‑ideation detection in six non‑English languages—Arabic, Bengali, Spanish, Portuguese, Russian, and Thai—along with the corresponding English‑translated (machine‑translated, MT) versions. They benchmark three proprietary LLMs (GPT‑4 Omni, Claude 3.5 Sonnet, Gemini 2 Flash) and four open‑source models (LLaMA 3.2, Gemma 2, Mistral AI Ministral, R1) across three prompting strategies: zero‑shot, five‑shot (few‑shot), and a chain‑of‑thought variant that incorporates emotion‑infused cues (CoT Emo). In addition, the open‑source models are fine‑tuned on the original (non‑translated) data to assess the benefit of task‑specific adaptation.

Key findings include: (1) Proprietary models, especially GPT‑4 and Claude 3.5, consistently achieve the highest F1 scores on the original datasets, often surpassing traditional baselines such as Random Forests, Logistic Regression, and earlier BERT‑based systems. (2) The CoT Emo prompting yields the most robust gains across languages, improving F1 by 0.07–0.12 points relative to zero‑shot, with the largest jumps observed for Russian and Spanish datasets. (3) Open‑source models, when fine‑tuned, close much of the gap to proprietary systems; for example, fine‑tuned LLaMA 3.2 reaches F1 ≈ 0.79 on Russian depression, outperforming several baselines. (4) Machine‑translated data generally lead to modest performance drops (0.03–0.18 F1 points), with the magnitude strongly correlated with translation quality metrics (BLEU, BERTScore, LaBSE). Analytic languages such as Arabic and Spanish suffer larger declines due to structural divergence from English, whereas typologically similar languages (Portuguese, Russian, Bengali) retain most of their performance. (5) In a few cases, MT data slightly outperform original data, suggesting that translation can sometimes smooth linguistic irregularities, but this effect is not systematic.

The authors argue that effective multilingual mental‑health NLP requires careful consideration of three intertwined factors: (a) prompt engineering—emotion‑aware CoT prompts are especially beneficial for affect‑rich mental‑health text; (b) language‑specific characteristics—structural differences and translation fidelity directly impact model accuracy; and (c) model adaptability—open‑source models can be fine‑tuned to achieve competitive results while offering transparency and lower cost.

Limitations noted include reliance on a single MT system (Facebook’s nllb‑200‑3.3B) and the absence of back‑translation quality control beyond automatic metrics. The paper calls for future work on expanding high‑quality multilingual corpora, developing translation‑aware fine‑tuning strategies, and integrating ethical safeguards for deploying LLMs in sensitive mental‑health applications.


Comments & Academic Discussion

Loading comments...

Leave a Comment