Title: The Illusion of Clinical Reasoning: A Benchmark Reveals the Pervasive Gap in Vision-Language Models for Clinical Competency
ArXiv ID: 2512.22275
Date: 2025-12-25
Authors: Dingyu Wang, Zimu Yuan, Jiajun Liu, Shanggui Liu, Nan Zhou, Tianxing Xu, Di Huang, Dong Jiang
📝 Abstract
Background: The rapid integration of foundation models into clinical practice and public health necessitates a rigorous evaluation of their true clinical reasoning capabilities beyond narrow examination success. Current benchmarks, typically based on medical licensing exams or curated vignettes, fail to capture the integrated, multimodal reasoning essential for real-world patient care. Methods: We developed the Bones and Joints (B&J) Benchmark, a comprehensive evaluation framework comprising 1,245 questions derived from real-world patient cases in orthopedics and sports medicine. This benchmark assesses models across 7 tasks that mirror the clinical reasoning pathway, including knowledge recall, text and image interpretation, diagnosis generation, treatment planning, and rationale provision. We evaluated eleven vision-language models (VLMs) and six large language models (LLMs), comparing their performance against expert-derived ground truth. Results: Our results demonstrate a pronounced performance gap between task types. While state-of-the-art models achieved high accuracy, exceeding 90%, on structured multiple-choice questions, their performance markedly declined on open-ended tasks requiring multimodal integration, with accuracy scarcely reaching 60%. VLMs demonstrated substantial limitations in interpreting medical images and frequently exhibited severe text-driven hallucinations, often ignoring contradictory visual evidence. Notably, models specifically fine-tuned for medical applications showed no consistent advantage over general-purpose counterparts. Conclusions: Current artificial intelligence models are not yet clinically competent for complex, multimodal reasoning. Their safe deployment should currently be limited to supportive, text-based roles. Future advancement in core clinical tasks awaits fundamental breakthroughs in multimodal integration and visual understanding.
💡 Deep Analysis
📄 Full Content
The Illusion of Clinical Reasoning:
A Benchmark Reveals the Pervasive Gap in Vision-
Language Models for Clinical Competency
Dingyu Wang1,2,3#, Zimu Yuan1,2,3#, Jiajun Liu1,2,3, Shanggui Liu1,2,3, Nan Zhou4,5,
Tianxing Xu4,5, Di Huang4,5*, Dong Jiang1,2,3*
1. Department of Sports Medicine, Peking University Third Hospital, Institute of
Sports Medicine of Peking University, Beijing, China.
2. Beijing Key Laboratory of Sports Injuries, Beijing, China.
3. Engineering Research Center of Sports Trauma Treatment Technology and Devices,
Ministry of Education, Beijing, China.
4. State Key Laboratory of Complex and Critical Software Environment, Beihang
University, Beijing, China
5. School of Computer Science and Engineering, Beihang University, Beijing, China
# These authors contributed equally to this study.
*These are corresponding authors.
Abstract
Background: The rapid integration of foundation models into clinical practice and
public health necessitates a rigorous evaluation of their true clinical reasoning
capabilities beyond narrow examination success. Current benchmarks, typically based
on medical licensing exams or curated vignettes, fail to capture the integrated,
multimodal reasoning essential for real-world patient care.
Methods: We developed the Bones and Joints (B&J) Benchmark, a comprehensive
evaluation framework comprising 1,245 questions derived from real-world patient
cases in orthopedics and sports medicine. This benchmark assesses models across 7
tasks that mirror the clinical reasoning pathway, including knowledge recall, text and
image interpretation, diagnosis generation, treatment planning, and rationale provision.
We evaluated eleven vision-language models (VLMs) and six large language models
(LLMs), comparing their performance against expert-derived ground truth.
Results: Our results demonstrate a pronounced performance gap between task types.
While state-of-the-art models achieved high accuracy, exceeding 90%, on structured
multiple-choice questions, their performance markedly declined on open-ended tasks
requiring multimodal integration, with accuracy scarcely reaching 60%. VLMs
demonstrated substantial limitations in interpreting medical images and frequently
exhibited severe text-driven hallucinations, often ignoring contradictory visual
evidence. Notably, models specifically fine-tuned for medical applications showed no
consistent advantage over general-purpose counterparts.
Conclusions: Current artificial intelligence models are not yet clinically competent for
complex, multimodal reasoning. Their safe deployment should currently be limited to
supportive, text-based roles. Future advancement in core clinical tasks awaits
fundamental breakthroughs in multimodal integration and visual understanding.
Introduction
The emergence of next-generation foundation models is reshaping the field of
medical artificial intelligence (AI). Two recent technological advances, the
development of large reasoning models and the maturation of vision-language models
(VLMs), have enabled AI systems to address complex medical tasks that require
sophisticated planning, integration of multimodal information, and other high-level
cognitive skills. State-of-the-art (SOTA) models have demonstrated performance
comparable to or surpassing human experts on multiple medical benchmarks1, 2.
Consequently, these foundation models are being rapidly integrated into clinical
workflows, where they are tasked with summarizing records, generating diagnostic
reports, and providing decision support3, 4. Moreover, these tools are now widely
available to the public, who are turning to them for everyday health-related questions
and initial symptom assessments. Patients can readily input their symptoms or medical
history into AI-powered chatbots, which offer 24/7 access to immediate medical
guidance5, 6.
While developers and social media often highlight the remarkable achievements of
AI models and their perceived superiority over clinicians, a crucial limitation tends to
be overlooked: these impressive results are largely derived from medical licensing
examinations, narrow question-answering datasets, curated and constrained clinical
vignettes7-9, which do not reflect the integrated and nuanced nature of real-world
clinical reasoning10, 11. It must be acknowledged that passing such medical examination
is merely the first step toward becoming a clinician. Beyond the acquisition of
knowledge, a clinician must synthesize information from diverse sources, including
clinical notes, physical examinations, laboratory results, and medical images, and apply
evidence-based reasoning within diagnostic pathways. This gap raises a critical
question: can contemporary AI models truly achieve clinical competence, especially
when faced with multimodal data and conflicting information in real healthcare
environments? In the absence of robust evaluation methods for such capabilities,