Building Multilingual Datasets for Predicting Mental Health Severity through LLMs: Prospects and Challenges
Large Language Models (LLMs) are increasingly being integrated into various medical fields, including mental health support systems. However, there is a gap in research regarding the effectiveness of LLMs in non-English mental health support applications. To address this problem, we present a novel multilingual adaptation of widely-used mental health datasets, translated from English into six languages (e.g., Greek, Turkish, French, Portuguese, German, and Finnish). This dataset enables a comprehensive evaluation of LLM performance in detecting mental health conditions and assessing their severity across multiple languages. By experimenting with GPT and Llama, we observe considerable variability in performance across languages, despite being evaluated on the same translated dataset. This inconsistency underscores the complexities inherent in multilingual mental health support, where language-specific nuances and mental health data coverage can affect the accuracy of the models. Through comprehensive error analysis, we emphasize the risks of relying exclusively on LLMs in medical settings (e.g., their potential to contribute to misdiagnoses). Moreover, our proposed approach offers significant cost savings for multilingual tasks, presenting a major advantage for broad-scale implementation.
💡 Research Summary
The paper addresses the pressing need for multilingual mental‑health AI tools by creating and evaluating a multilingual adaptation of two widely used English‑language datasets: DEP‑SEVERITY (four‑level depression severity) and C‑SSRS (five‑level suicide risk). The authors automatically translate the English posts into six target languages—Turkish, French, Portuguese, German, Greek, and Finnish—using large language models (LLMs) (GPT‑3.5‑turbo and GPT‑4o‑mini) with temperature 0 to ensure deterministic outputs. The translated texts are then fed back into LLMs (including Llama 3.1) via zero‑shot and one‑shot prompts that ask the model to assign the appropriate severity label.
Methodologically, the workflow consists of three stages: (1) translation of source posts, (2) prompting for classification, and (3) comparison of predicted labels with the original ground‑truth using precision, recall, and F1‑score per class, with macro‑averaged scores summarizing overall performance. The same prompts are used across all languages to guarantee a fair comparison.
Experimental results reveal substantial variability across languages. In English, the macro F1 reaches about 0.34, whereas the other languages range from 0.10 to 0.33. Rare classes such as “Behavior” and “Attempt” in the suicide‑risk dataset often receive an F1 of zero, indicating that the models struggle to detect these high‑risk signals after translation. The authors attribute these gaps to (a) translation‑induced loss of affective nuance, (b) cultural differences in how mental‑health symptoms are expressed, (c) severe class imbalance in the original datasets, and (d) unequal pre‑training exposure of the LLMs to the target languages.
A detailed error analysis shows that sentiment‑laden words are frequently weakened or omitted in translation, leading to under‑estimation of depression severity. Moreover, the models frequently confuse “Supportive” and “Indicator” categories, especially in languages where the translated phrasing is ambiguous. For suicide‑risk detection, direct expressions of suicidal ideation are sometimes softened, causing missed detections of critical risk levels. These findings underscore the danger of deploying LLM‑only pipelines in clinical settings, where false negatives could have severe consequences.
From a cost perspective, the authors argue that using a single LLM for both translation and classification dramatically reduces the need for human annotators and professional translators, making large‑scale multilingual deployment financially feasible. However, given the observed performance disparities, they recommend a hybrid human‑AI workflow, where LLM predictions are reviewed by clinicians or language experts before any clinical decision is made.
The related‑work section situates the study within the broader NLP‑for‑mental‑health literature, noting that earlier approaches relied on hand‑crafted features and monolingual models, while recent multilingual models (e.g., XLM‑R, multilingual BERT) still suffer from low‑resource language challenges. The paper contributes a novel dataset, a reproducible multilingual evaluation pipeline, and an empirical demonstration of the limits of current LLMs for mental‑health severity prediction across languages.
In conclusion, the study proves that multilingual mental‑health datasets can be generated automatically, but performance is highly language‑dependent. Future work should explore (1) hybrid translation strategies that combine human and LLM output, (2) language‑specific prompt engineering and fine‑tuning, (3) techniques to mitigate class imbalance (e.g., oversampling, cost‑sensitive learning), and (4) clinical validation studies to assess real‑world safety and efficacy. Only through such targeted improvements can LLM‑based mental‑health tools become reliable and equitable across linguistic and cultural boundaries.
Comments & Academic Discussion
Loading comments...
Leave a Comment