A Large-Scale Benchmark for Evaluating Large Language Models on Medical Question Answering in Romanian

A Large-Scale Benchmark for Evaluating Large Language Models on Medical Question Answering in Romanian
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We introduce MedQARo, the first large-scale medical QA benchmark in Romanian, alongside a comprehensive evaluation of state-of-the-art large language models (LLMs). We construct a high-quality and large-scale dataset comprising 105,880 QA pairs about cancer patients from two medical centers. The questions regard medical case summaries of 1,242 patients, requiring both keyword extraction and reasoning. Our benchmark contains both in-domain and cross-domain (cross-center and cross-cancer) test collections, enabling a precise assessment of generalization capabilities. We experiment with four open-source LLMs from distinct families of models on MedQARo. Each model is employed in two scenarios: zero-shot prompting and supervised fine-tuning. We also evaluate two state-of-the-art LLMs exposed only through APIs, namely GPT-5.2 and Gemini 3 Flash. Our results show that fine-tuned models significantly outperform zero-shot models, indicating that pretrained models fail to generalize on MedQARo. Our findings demonstrate the importance of both domain-specific and language-specific fine-tuning for reliable clinical QA in Romanian.


💡 Research Summary

The paper introduces MedQARo, the first large‑scale Romanian medical question‑answering (QA) benchmark, and provides a thorough evaluation of several state‑of‑the‑art large language models (LLMs) on this resource. MedQARo comprises 105,880 high‑quality QA pairs derived from the electronic medical records (epicrises) of 1,242 oncology patients collected at two hospitals in Bucharest. Professional oncologists and radiotherapists spent roughly 3,000 hours manually crafting questions that require keyword extraction or multi‑step clinical reasoning, and providing accurate answers. The dataset is split at the patient level into training, validation, and test sets, with an in‑domain test set (same hospital and cancer types as training) and a cross‑domain test set (different hospital and five cancer types not seen during training). This design enables precise measurement of both within‑distribution performance and generalization to unseen clinical settings.

Four open‑source LLMs are evaluated: RoLLaMA2‑7B‑Instruct and RoMistral‑7B‑Instruct (both Romanian‑adapted), Phi‑4‑mini‑instruct (a long‑context model supporting up to 4 k tokens), and LLaMA3‑OpenBioLLM‑8B (pre‑trained on biomedical data). Each model is tested in two configurations: zero‑shot prompting and supervised fine‑tuning using LoRA (Low‑Rank Adaptation). LoRA adapters are inserted only into the attention projection layers, freezing the rest of the model; only 0.04–0.10 % of parameters are updated. Hyperparameters (learning rate, dropout, LoRA rank, scaling factor) are tuned on the validation set, with the final configuration being LR = 2e‑5, dropout = 0.05, rank = 8, α = 16. Training runs for a maximum of two epochs with BF16 precision, effective batch size = 8 (gradient accumulation).

Prompt engineering experiments compare two input orders: “epicrisis + question + answer” (E+Q+A) versus “question + epicrisis + answer” (Q+E+A). Across all models and token limits, Q+E+A consistently yields higher scores, confirming prior findings that early tokens receive more attention.

Performance is measured with four metrics: token‑level F1 (balancing precision and recall), Exact Match (EM), BLEU (n‑gram overlap), and METEOR (which accounts for synonyms, stemming, and word order—particularly relevant for Romanian’s rich morphology). Baselines include a random token selector and a majority‑answer predictor that exploits answer distribution in the training data.

Key findings: (1) Zero‑shot LLMs achieve only marginal improvements over the random baseline and are substantially outperformed by the majority‑answer baseline, indicating that pretrained models do not generalize well to Romanian medical QA without adaptation. (2) Fine‑tuned models dramatically surpass all baselines, with performance gains often exceeding an order of magnitude over zero‑shot. (3) Among fine‑tuned models, RoMistral‑7B‑Instruct attains the best results, reaching F1 0.671, EM 0.571, BLEU 0.521, and METEOR 0.369 on the in‑domain test set. (4) Cross‑domain performance drops considerably (e.g., F1 ≈ 0.452), highlighting limited generalization across hospitals and cancer types. (5) Two proprietary API models, GPT‑5.2 and Gemini 3 Flash, evaluated only in zero‑shot mode, perform worse than the fine‑tuned open‑source models, reinforcing the importance of domain‑ and language‑specific fine‑tuning.

The authors conclude that (a) large‑scale, native‑language medical QA resources are essential for advancing NLP in low‑resource languages; (b) language‑specific and domain‑specific fine‑tuning are critical for achieving clinically reliable performance; (c) prompt formulation significantly influences outcomes; and (d) future work should explore multi‑domain training, larger context windows, and safety/interpretability measures to move toward real‑world clinical decision support. All data and code are released publicly (with full anonymization) to foster reproducibility and further research.


Comments & Academic Discussion

Loading comments...

Leave a Comment