The Illusion of Clinical Reasoning: A Benchmark Reveals the Pervasive Gap in Vision-Language Models for Clinical Competency

December 25, 2025

Reading time: 5 minute

...

📝 Original Info

Title: The Illusion of Clinical Reasoning: A Benchmark Reveals the Pervasive Gap in Vision-Language Models for Clinical Competency
ArXiv ID: 2512.22275
Date: 2025-12-25
Authors: Dingyu Wang, Zimu Yuan, Jiajun Liu, Shanggui Liu, Nan Zhou, Tianxing Xu, Di Huang, Dong Jiang

📝 Abstract

Background: The rapid integration of foundation models into clinical practice and public health necessitates a rigorous evaluation of their true clinical reasoning capabilities beyond narrow examination success. Current benchmarks, typically based on medical licensing exams or curated vignettes, fail to capture the integrated, multimodal reasoning essential for real-world patient care. Methods: We developed the Bones and Joints (B&J) Benchmark, a comprehensive evaluation framework comprising 1,245 questions derived from real-world patient cases in orthopedics and sports medicine. This benchmark assesses models across 7 tasks that mirror the clinical reasoning pathway, including knowledge recall, text and image interpretation, diagnosis generation, treatment planning, and rationale provision. We evaluated eleven vision-language models (VLMs) and six large language models (LLMs), comparing their performance against expert-derived ground truth. Results: Our results demonstrate a pronounced performance gap between task types. While state-of-the-art models achieved high accuracy, exceeding 90%, on structured multiple-choice questions, their performance markedly declined on open-ended tasks requiring multimodal integration, with accuracy scarcely reaching 60%. VLMs demonstrated substantial limitations in interpreting medical images and frequently exhibited severe text-driven hallucinations, often ignoring contradictory visual evidence. Notably, models specifically fine-tuned for medical applications showed no consistent advantage over general-purpose counterparts. Conclusions: Current artificial intelligence models are not yet clinically competent for complex, multimodal reasoning. Their safe deployment should currently be limited to supportive, text-based roles. Future advancement in core clinical tasks awaits fundamental breakthroughs in multimodal integration and visual understanding.

💡 Deep Analysis

📄 Full Content

The Illusion of Clinical Reasoning: A Benchmark Reveals the Pervasive Gap in Vision- Language Models for Clinical Competency Dingyu Wang1,2,3#, Zimu Yuan1,2,3#, Jiajun Liu1,2,3, Shanggui Liu1,2,3, Nan Zhou4,5, Tianxing Xu4,5, Di Huang4,5*, Dong Jiang1,2,3* 1. Department of Sports Medicine, Peking University Third Hospital, Institute of Sports Medicine of Peking University, Beijing, China. 2. Beijing Key Laboratory of Sports Injuries, Beijing, China. 3. Engineering Research Center of Sports Trauma Treatment Technology and Devices, Ministry of Education, Beijing, China. 4. State Key Laboratory of Complex and Critical Software Environment, Beihang University, Beijing, China 5. School of Computer Science and Engineering, Beihang University, Beijing, China # These authors contributed equally to this study. *These are corresponding authors. Abstract Background: The rapid integration of foundation models into clinical practice and public health necessitates a rigorous evaluation of their true clinical reasoning capabilities beyond narrow examination success. Current benchmarks, typically based on medical licensing exams or curated vignettes, fail to capture the integrated, multimodal reasoning essential for real-world patient care. Methods: We developed the Bones and Joints (B&J) Benchmark, a comprehensive evaluation framework comprising 1,245 questions derived from real-world patient cases in orthopedics and sports medicine. This benchmark assesses models across 7 tasks that mirror the clinical reasoning pathway, including knowledge recall, text and image interpretation, diagnosis generation, treatment planning, and rationale provision. We evaluated eleven vision-language models (VLMs) and six large language models (LLMs), comparing their performance against expert-derived ground truth. Results: Our results demonstrate a pronounced performance gap between task types. While state-of-the-art models achieved high accuracy, exceeding 90%, on structured multiple-choice questions, their performance markedly declined on open-ended tasks requiring multimodal integration, with accuracy scarcely reaching 60%. VLMs demonstrated substantial limitations in interpreting medical images and frequently exhibited severe text-driven hallucinations, often ignoring contradictory visual evidence. Notably, models specifically fine-tuned for medical applications showed no consistent advantage over general-purpose counterparts. Conclusions: Current artificial intelligence models are not yet clinically competent for complex, multimodal reasoning. Their safe deployment should currently be limited to supportive, text-based roles. Future advancement in core clinical tasks awaits fundamental breakthroughs in multimodal integration and visual understanding. Introduction The emergence of next-generation foundation models is reshaping the field of medical artificial intelligence (AI). Two recent technological advances, the development of large reasoning models and the maturation of vision-language models (VLMs), have enabled AI systems to address complex medical tasks that require sophisticated planning, integration of multimodal information, and other high-level cognitive skills. State-of-the-art (SOTA) models have demonstrated performance comparable to or surpassing human experts on multiple medical benchmarks1, 2. Consequently, these foundation models are being rapidly integrated into clinical workflows, where they are tasked with summarizing records, generating diagnostic reports, and providing decision support3, 4. Moreover, these tools are now widely available to the public, who are turning to them for everyday health-related questions and initial symptom assessments. Patients can readily input their symptoms or medical history into AI-powered chatbots, which offer 24/7 access to immediate medical guidance5, 6. While developers and social media often highlight the remarkable achievements of AI models and their perceived superiority over clinicians, a crucial limitation tends to be overlooked: these impressive results are largely derived from medical licensing examinations, narrow question-answering datasets, curated and constrained clinical vignettes7-9, which do not reflect the integrated and nuanced nature of real-world clinical reasoning10, 11. It must be acknowledged that passing such medical examination is merely the first step toward becoming a clinician. Beyond the acquisition of knowledge, a clinician must synthesize information from diverse sources, including clinical notes, physical examinations, laboratory results, and medical images, and apply evidence-based reasoning within diagnostic pathways. This gap raises a critical question: can contemporary AI models truly achieve clinical competence, especially when faced with multimodal data and conflicting information in real healthcare environments? In the absence of robust evaluation methods for such capabilities,

📄 Read Full PDF on ArXiv

📸 Image Gallery

Reference

This content is AI-processed based on open access ArXiv data.

The Illusion of Clinical Reasoning: A Benchmark Reveals the Pervasive Gap in Vision-Language Models for Clinical Competency

📝 Original Info

📝 Abstract

💡 Deep Analysis

📄 Full Content

📸 Image Gallery

Reference

Table of Contents

Table of Contents

📝 Original Info

📝 Abstract

💡 Deep Analysis

📄 Full Content

📸 Image Gallery

Reference

Related Posts

Reflection Pretraining Enables Token-Level Self-Correction in Biological Sequence Models

Automated stereotactic radiosurgery planning using a human-in-the-loop reasoning large language model agent

Retrieval-augmented Prompt Learning for Pre-trained Foundation Models

Start searching

No results found