MedAraBench: Large-Scale Arabic Medical Question Answering Dataset and Benchmark

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Arabic remains one of the most underrepresented languages in natural language processing research, particularly in medical applications, due to the limited availability of open-source data and benchmarks. The lack of resources hinders efforts to evaluate and advance the multilingual capabilities of Large Language Models (LLMs). In this paper, we introduce MedAraBench, a large-scale dataset consisting of Arabic multiple-choice question-answer pairs across various medical specialties. We constructed the dataset by manually digitizing a large repository of academic materials created by medical professionals in the Arabic-speaking region. We then conducted extensive preprocessing and split the dataset into training and test sets to support future research efforts in the area. To assess the quality of the data, we adopted two frameworks, namely expert human evaluation and LLM-as-a-judge. Our dataset is diverse and of high quality, spanning 19 specialties and five difficulty levels. For benchmarking purposes, we assessed the performance of eight state-of-the-art open-source and proprietary models, such as GPT-5, Gemini 2.0 Flash, and Claude 4-Sonnet. Our findings highlight the need for further domain-specific enhancements. We release the dataset and evaluation scripts to broaden the diversity of medical data benchmarks, expand the scope of evaluation suites for LLMs, and enhance the multilingual capabilities of models for deployment in clinical settings.

💡 Research Summary

MedAraBench addresses a critical gap in Arabic‑language medical NLP by providing a large‑scale, high‑quality multiple‑choice question (MCQ) benchmark. The authors collected scanned exam papers from regional medical schools, digitized them via professional typists, and performed extensive manual cleaning to remove malformed entries, duplicate options, and ambiguous answer keys. After filtering, the final corpus contains 24,883 MCQs covering 19 medical specialties (e.g., Anatomy, Anesthesia, Biochemistry, Internal Medicine, Surgery) and five difficulty levels aligned with years of study (Y1‑Y5). The dataset is split 80/20 into training (19,894) and test (4,989) sets using stratified random sampling to preserve specialty distribution.

Each question is annotated with the number of answer choices (4 or 5), difficulty level, and specialty label. No patient‑identifiable information is present, so privacy concerns are minimal. The authors deliberately avoided external terminology standardization, arguing that real‑world medical QA often contains heterogeneous vocabularies; instead, they rely on expert validation to ensure clinical relevance.

Quality assessment follows a two‑pronged approach. First, two board‑certified clinicians (Anesthesiology and Internal Medicine) performed double‑blinded reviews of a statistically representative sample of 378 questions (derived via Cochran’s formula for 95 % confidence and ±5 % margin). Review criteria include Medical Accuracy, Clinical Relevance, Question Difficulty, and Question Quality (clarity, option homogeneity, single best answer, and absence of clueing). Inter‑annotator agreement ranged from fair to moderate (Cohen’s κ ≈ 0.42‑0.58). Second, an “LLM‑as‑a‑judge” protocol uses four state‑of‑the‑art models (GPT‑3, Gemini‑2.0‑Flash, Claude‑4‑Sonnet, GPT‑4.1) prompted to act as medical educators, scoring the same four dimensions on a binary scale for the entire test set. Pearson correlations between LLM scores and human annotations lie between 0.58 and 0.71, indicating that LLMs can approximate expert judgments but still diverge on a substantial fraction of items.

Benchmarking evaluates 16 models—10 open‑source (e.g., Llama‑3.3‑70B‑instruct, DeepSeek‑chat‑v3, Medgemma‑4B‑it) and 6 proprietary (Claude‑sonnet‑4, Gemini‑2.0‑Flash, GPT‑5, GPT‑4.1, GPT‑3, Qwen‑plus). All models are tested in a zero‑shot setting with temperature 0, forced to output a single answer letter (A‑D). Results show that the best proprietary model (GPT‑5) achieves roughly 52 % accuracy, while the strongest open‑source model reaches about 38 %. Even the top performer falls well short of human‑level competence, underscoring the difficulty of Arabic medical reasoning for current LLMs.

To explore data efficiency, the authors conduct few‑shot experiments with LLaMA‑3.1‑8B‑instruct, providing three high‑quality exemplars (selected from the training split and excluded from testing). This yields a modest 3‑5 % absolute gain over zero‑shot. Further, they apply QLoRA (quantized low‑rank adaptation) to fine‑tune LLaMA‑3.1‑8B‑instruct on the full training set using 4‑bit quantization and 800 training steps with gradient accumulation. The LoRA‑adapted model improves test accuracy by ~7 % absolute, demonstrating that domain‑specific adaptation can meaningfully boost performance even with limited data.

The paper’s contributions are fourfold: (1) the creation and public release of the first large‑scale Arabic medical MCQ benchmark with rich metadata; (2) a dual evaluation framework combining expert human review and LLM‑as‑judge analysis; (3) comprehensive baseline results across a spectrum of models and training regimes (zero‑shot, few‑shot, LoRA fine‑tuning); and (4) open‑source scripts for data preprocessing, prompting, and metric calculation to facilitate reproducibility. The authors argue that MedAraBench will catalyze research on multilingual, domain‑specific LLMs and encourage the development of more robust evaluation protocols for clinical AI.

In conclusion, MedAraBench reveals that despite rapid advances in LLM capabilities, Arabic medical question answering remains a challenging frontier. The modest gains from few‑shot prompting and low‑rank adaptation suggest that larger, more diverse Arabic medical corpora and specialized training objectives are needed. Future work should expand the dataset (e.g., include longer‑form clinical vignettes, multi‑turn dialogues), refine difficulty calibration, and develop safety‑oriented metrics (ethical reasoning, hallucination detection). By providing both data and evaluation infrastructure, MedAraBench lays a solid foundation for advancing Arabic‑language medical AI toward safe, reliable deployment in real‑world clinical settings.

MedAraBench: Large-Scale Arabic Medical Question Answering Dataset and Benchmark

💡 Research Summary

Comments & Academic Discussion

Leave a Comment