Artificial intelligence (AI) is increasingly permeating healthcare, from physician assistants to consumer applications. Since AI algorithm's opacity challenges human interaction, explainable AI (XAI) addresses this by providing AI decision-making insight, but evidence suggests XAI can paradoxically induce over-reliance or bias. We present results from two large-scale experiments (623 lay people; 153 primary care physicians, PCPs) combining a fairness-based diagnosis AI model and different XAI explanations to examine how XAI assistance, particularly multimodal large language models (LLMs), influences diagnostic performance. AI assistance balanced across skin tones improved accuracy and reduced diagnostic disparities. However, LLM explanations yielded divergent effects: lay users showed higher automation bias - accuracy boosted when AI was correct, reduced when AI erred - while experienced PCPs remained resilient, benefiting irrespective of AI accuracy. Presenting AI suggestions first also led to worse outcomes when the AI was incorrect for both groups. These findings highlight XAI's varying impact based on expertise and timing, underscoring LLMs as a "double-edged sword" in medical AI and informing future human-AI collaborative system design.
Artificial intelligence (AI) has been intensively investigated as a tool to enhance clinical decision-making. Dermatology is one area with several developed and FDA-approved tools such as Nevisense 1 and DermaSensor 2 , as skin conditions are primarily diagnosed through image assessment. In light of the national shortage of dermatologists 3,4 , effective AI assistance could help to improve early detection rates and reduce unnecessary clinical visits. The development of AI-powered interfaces has also been proposed to assist the general public in making informed healthcare decisions (e.g., self-diagnosis of skin diseases with Google Lens 5 ) Explainable AI (XAI) has been used in health generally to target AI usability and adoption [6][7][8][9] . In dermatology, physicians have clear features to look for, such as the ABCD's (asymmetry, irregular borders, multiple colors, diameter greater than a pencil eraser) of melanoma. Similarly, in AI, gradient-weighted class activation mapping (GradCAM) 10 and content-based image retrieval (CBIR) 11,12 have been used to highlight relevant image regions and retrieve similar cases, respectively. A recent survey found that GradCAM and CBIR are the top two most commonly used XAI techniques in dermatology 13 . In addition, with the recent surge of generative AI 14 , multimodal large language models (multimodal LLMs) operating over textual and visual modalities [15][16][17][18] have also been used to analyze dermatological images and explain AI decisions [19][20][21] .
Unfortunately, prior work has shown that XAI can increase subject over-reliance on AI in decision-making [22][23][24] , and human-AI collaborative medical decisions do not always surpass those made by humans or AI alone 25,26 . To date, there is no consensus about whether medical experience improves human-AI collaborative diagnosis. While some research indicates that medical knowledge is needed for better AI resilience and clinical decision making 27 , there is also research showing it makes minimal difference 12 , and in dermatology, AI may even mislead humans and lead to worse outcomes in diagnosis [28][29][30] . As dermatological AI tools expand, it is crucial to understand how different XAI methods, especially with the vast spread of LLMs, affect skin disease diagnostic accuracy across both the general public and medical experts [31][32][33][34][35] .
In this paper, we designed two large-scale experiments to systematically investigate how different XAI methods and human-AI decision paradigms impact diagnostic performance across expertise levels. We chose clinical-image-based skin condition diagnosis as a plausible real-world scenario (i.e., ecological validity) given the influx of patient-facing and physician-facing diagnostic models in this space 30,36 . We investigated the overall effectiveness of XAI assistance in improving dermatological diagnostic accuracy 37,38 , reducing the disparities across skin tones 39 , and influencing accuracy-confidence calibration (i.e., accuracy and confidence are consistent). We compared both correct and incorrect multimodal LLM-based explanations to traditional XAI approaches in both scenarios. We also examined how individual differences in AI deference (i.e., the propensity to follow AI regardless of accuracy, sometimes referred to as AI susceptibility 40 ) impact diagnostic performance. Finally, we explored the impact of the human-AI decision paradigm (i.e., the order of human or AI making decisions) on diagnostic outcomes 41,42 , providing insights into optimal implementation strategies for clinical settings.
We explore these questions in two populations. First in the general public (N=623) with a binary classification task to distinguish melanoma and nevus, and second with medical providers (N=153 primary care physicians [PCPs]) on a more challenging open-ended differential diagnosis of skin diseases. Our findings have important implications for designing and deploying image-based medical AI systems for skin diseases. Our work comprehensively studies the collaboration between explainable AI and both the general public and medical experts. By elucidating how different traditional (GradCAM, CBIR) and advanced XAI methods (multimodal LLM), levels of medical expertise, and human-AI decision paradigms influence outcomes, we make unique contributions to the understanding of human-AI collaborative decision making and the development of more effective and appropriate human-AI collaborative medical systems.
We designed two complementary large-scale digital studies to evaluate human-AI collaborative diagnostic performance (Fig. 1a). Both studies were based on clinical images and designed to resemble real-world practices: a regular lay person may take a photo of their skin and resort to search engines or AI tools 43,44 , and an expert often needs to make a differential diagnosis based on a clinical image (e.g., patient communication through electronic health record messaging s
This content is AI-processed based on open access ArXiv data.