The Value Sensitivity Gap: How Clinical Large Language Models Respond to Patient Preference Statements in Shared Decision-Making
Large language models (LLMs) are entering clinical workflows as decision support tools, yet how they respond to explicit patient value statements – the core content of shared decision-making – remains unmeasured. We conducted a factorial experiment using clinical vignettes derived from 98,759 de-identified Medicaid encounter notes. We tested four LLM families (GPT-5.2, Claude 4.5 Sonnet, Gemini 3 Pro, and DeepSeek-R1) across 13 value conditions in two clinical domains, yielding 104 trials. Default value orientations differed across model families (aggressiveness range 2.0 to 3.5 on a 1-to-5 scale). Value sensitivity indices ranged from 0.13 to 0.27, and directional concordance with patient-stated preferences ranged from 0.625 to 1.0. All models acknowledged patient values in 100% of non-control trials, yet actual recommendation shifting remained modest. Decision-matrix and VIM self-report mitigations each improved directional concordance by 0.125 in a 78-trial Phase 2 evaluation. These findings provide empirical data for populating value disclosure labels proposed by clinical AI governance frameworks.
💡 Research Summary
This paper presents the first systematic empirical assessment of how clinical large language models (LLMs) respond to explicit patient value statements—core elements of shared decision‑making. Using a massive corpus of 98,759 de‑identified Medicaid encounter notes, the authors built an automated pipeline to identify preference‑sensitive cases, extract structured clinical vignettes, and verify them with physician review. From this process, 22 vignettes across five domains were generated; two (one oncology, one cardiology) were used in Phase 1 and a third cardiology vignette in Phase 2.
Four commercially available LLM families—OpenAI’s GPT‑5.2, Anthropic’s Claude 4.5 Sonnet, Google’s Gemini 3 Pro, and the open‑weights DeepSeek‑R1—were queried at temperature 0.0 with a system prompt that framed the model as a clinical decision‑support assistant. Each trial required the model to return a JSON object containing a primary recommendation, alternatives, an aggressiveness score (1 = most conservative to 5 = most aggressive), a risk level, Boolean flags indicating whether patient values were acknowledged and whether they influenced the recommendation, and a reasoning narrative.
The experimental design crossed three factors fully: 2 vignettes × 13 value conditions (six preference dimensions each with two opposite poles, plus a no‑preference control) × 4 model families, yielding 104 trials in Phase 1. Value conditions were operationalized as first‑person patient statements appended to the vignette (e.g., “Quality of life matters more than length of life; I prefer fewer burdensome side effects”). The control condition simply stated that no specific preferences were given.
Four outcome metrics were defined: (1) Default Value Orientation (DVO) – the mean aggressiveness and risk scores under the control condition; (2) Value Sensitivity Index (VSI) – the absolute shift in aggressiveness from control, normalized by the maximum possible shift of four points; (3) Directional Concordance Rate (DCR) – the proportion of non‑control conditions where the observed shift matched the a‑priori expected direction; and (4) Value Acknowledgement Rate (VAR) – the proportion of non‑control trials where the model explicitly reported that patient values were acknowledged and influenced its recommendation.
Key findings from Phase 1:
- Baseline DVO differed markedly across model families. GPT‑5.2 produced the most aggressive baseline (aggressiveness 3.5, risk 3.5), DeepSeek‑R1 was moderately aggressive (3.0/4.0), while Claude 4.5 Sonnet and Gemini 3 Pro were conservative (aggressiveness 2.0).
- All models shifted recommendations when patient values were presented, but the magnitude varied. DeepSeek‑R1 showed the highest mean VSI (0.274), Claude 4.5 Sonnet (0.177), GPT‑5.2 (0.156), and Gemini 3 Pro (0.130).
- The largest shifts were observed for risk‑tolerance and quality‑of‑life dimensions (VSI ≈ 0.30), whereas autonomy produced the smallest effect (VSI ≈ 0.09).
- Directional concordance ranged from 0.625 (Gemini 3 Pro) to 1.0 (DeepSeek‑R1); Claude 4.5 Sonnet and GPT‑5.2 each achieved 0.75.
- VAR was perfect: every non‑control trial reported that patient values were acknowledged, and three of four models reported that those values actually influenced the recommendation (DeepSeek‑R1 at 0.952). Despite this, the absolute change in aggressiveness scores was modest (0.5–1.1 points on a 5‑point scale, i.e., 3–7 % of the full range).
Domain‑specific analysis revealed that GPT‑5.2’s aggressiveness was one point higher in cardiology (4.0) than in oncology (3.0), while DeepSeek‑R1’s DVO was stable across domains. This suggests that a single aggregate DVO label would be insufficient; domain‑specific disclosures are needed.
Phase 2 focused on mitigation strategies applied to GPT‑5.2 using the cardiology vignette across all 13 value conditions (78 trials). Six prompt‑level interventions were tested: (1) Value Elicitation Prompt (VEP), (2) Decision‑Matrix (MATRIX), (3) Contrastive Explanation (CONTRASTIVE), (4) Few‑Shot Value Calibration (FEW_SHOT), (5) Multi‑Agent Deliberation (MULTI_AGENT), and (6) VIM Self‑Report (VIM_SELF_REPORT). Results:
- MATRIX and VIM_SELF_REPORT each increased DCR by 0.125 (from 0.500 to 0.625) and raised mean VSI modestly (by 0.0625 and 0.0625 respectively), at the cost of additional latency (≈ 5.6 s and ≈ 2 s) and extra token usage.
- VEP maintained perfect VAR but slightly reduced VSI, indicating that merely enumerating values does not enhance sensitivity.
- CONTRASTIVE and FEW_SHOT improved VSI without affecting DCR.
- MULTI_AGENT produced no measurable change. Statistical tests (Wilcoxon signed‑rank with Bonferroni correction) did not reach significance, reflecting the small effect sizes and limited paired observations.
The discussion interprets these findings in the context of AI governance. First, the fact that all models can modulate recommendations based on patient values confirms that LLMs are not value‑agnostic; however, the degree of modulation is highly model‑dependent. DeepSeek‑R1’s superior VSI and DCR may stem from its chain‑of‑thought reasoning architecture, which provides more internal “hooks” for integrating preference information. Second, the variability in DVO across models and clinical domains underscores the need for transparent “Values In the Model” (VIM) labels that disclose both baseline value orientation and domain‑specific profiles. Third, the dissociation between perfect VAR and modest VSI highlights a form of misalignment: models claim to respect patient values but only minimally adjust their quantitative recommendations. Finally, the modest gains from mitigation strategies suggest that more sophisticated prompt engineering or model fine‑tuning may be required to achieve clinically meaningful value alignment.
In conclusion, this study delivers concrete, quantitative benchmarks for the emerging “value disclosure” standards advocated by the RAISE symposium and related governance frameworks. It demonstrates that clinical LLMs exhibit measurable yet heterogeneous sensitivity to patient preference statements, that baseline value orientations differ across model families and clinical domains, and that simple prompt‑level mitigations can improve directional concordance modestly. These insights provide a foundation for developing VIM labels, informing model selection, and guiding future research on aligning LLM outputs with individual patient values in shared decision‑making contexts.
Comments & Academic Discussion
Loading comments...
Leave a Comment