"Do I Trust the AI?" Towards Trustworthy AI-Assisted Diagnosis: Understanding User Perception in LLM-Supported Reasoning

"Do I Trust the AI?" Towards Trustworthy AI-Assisted Diagnosis: Understanding User Perception in LLM-Supported Reasoning
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large language models (LLMs) have shown considerable potential in supporting medical diagnosis. However, their effective integration into clinical workflows is hindered by physicians’ difficulties in perceiving and trusting LLM capabilities, which often results in miscalibrated trust. Existing model evaluations primarily emphasize standardized benchmarks and predefined tasks, offering limited insights into clinical reasoning practices. Moreover, research on human-AI collaboration has rarely examined physicians’ perceptions of LLMs’ clinical reasoning capability. In this work, we investigate how physicians perceive LLMs’ capabilities in the clinical reasoning process. We designed clinical cases, collected the corresponding analyses, and obtained evaluations from physicians (N=37) to quantitatively represent their perceived LLM diagnostic capabilities. By comparing the perceived evaluations with benchmark performance, our study highlights the aspects of clinical reasoning that physicians value and underscores the limitations of benchmark-based evaluation. We further discuss the implications of opportunities for enhancing trustworthy collaboration between physicians and LLMs in LLM-supported clinical reasoning.


💡 Research Summary

The paper “Do I Trust the AI? Towards Trustworthy AI‑Assisted Diagnosis: Understanding User Perception in LLM‑Supported Clinical Reasoning” investigates how physicians perceive and trust large language models (LLMs) when they are used to support medical diagnosis. While recent advances have shown that LLMs can retrieve patient data, identify complex symptom patterns, and generate personalized treatment recommendations, their integration into real‑world clinical workflows remains limited because physicians often misjudge the models’ capabilities, leading to either over‑reliance or under‑use. Existing evaluations of LLMs rely heavily on standardized benchmarks (e.g., USMLE, MedQA) and predefined question‑answer tasks, which do not capture the dynamic, iterative nature of clinical reasoning that involves hypothesis generation, evidence gathering, and decision updating.

To address this gap, the authors formulate two research questions: (RQ1) How do physicians evaluate the value and capability of LLM responses during clinical reasoning? (RQ2) What is the relationship between physicians’ perceived LLM capability and the models’ performance on standard medical benchmarks?

The study proceeds in two steps. In Step One, nine clinical cases spanning multiple specialties are designed in collaboration with a team of physicians. For each case, analyses are collected from several LLMs (including GPT‑4, Claude 3, Gemini 1.5) and from human experts, covering diagnostic inquiry, final diagnosis, and treatment principles. In Step Two, an initial group of 11 physicians helps derive a set of evaluation dimensions. A larger cohort of 37 physicians then rates each LLM analysis on five dimensions—diagnostic accuracy, clinical relevance, reasoning coherence, explainability, and risk mitigation—and also provides an overall ranking of the analyses.

Using the dimension scores and overall rankings, the authors fit a multivariate regression model to compute a “Perceived Capability Score” (PCS) that quantifies how capable physicians believe an LLM to be. They then compare PCS with the LLMs’ benchmark scores. The key findings are:

  1. Non‑linear alignment – PCS increases with benchmark performance but at a diminishing rate, indicating a saturation effect where additional gains in objective accuracy yield only modest improvements in perceived trust.
  2. Dimension weighting – Clinical relevance (≈30 % contribution) and risk mitigation (≈25 %) dominate the trust assessment, while explainability, reasoning coherence, and raw diagnostic accuracy contribute less (≈20 %, 15 %, and 10 % respectively). This suggests physicians prioritize how well the model’s output fits the clinical context and its safety implications over pure correctness.
  3. Benchmark limitations – Standardized tests focus on answer correctness and ignore process‑oriented factors such as transparency, contextual appropriateness, and safety, leading to a mismatch between high benchmark scores and physician trust.

The authors argue that effective human‑AI collaboration in high‑risk medical settings requires “trust calibration” mechanisms that align subjective trust with objective performance. Potential solutions include providing uncertainty estimates, step‑by‑step explanations, and real‑time performance feedback within the user interface. Visualizing the evaluation dimensions and integrating LLM assistance seamlessly into electronic health record workflows are recommended design strategies.

Finally, the study contributes a novel evaluation framework that combines objective benchmark metrics with a physician‑derived PCS, offering a more clinically grounded method for assessing LLMs. This framework can guide future development, validation, and deployment of trustworthy AI‑assisted diagnostic tools, ensuring that advances in LLM capability translate into genuine clinical utility and safe, reliable decision support.


Comments & Academic Discussion

Loading comments...

Leave a Comment