Ethical Risks of Large Language Models in Medical Consultation: An Assessment Based on Reproductive Ethics

Ethical Risks of Large Language Models in Medical Consultation: An Assessment Based on Reproductive Ethics
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Background: As large language models (LLMs) are increasingly used in healthcare and medical consultation settings, a growing concern is whether these models can respond to medical inquiries in a manner that is ethically compliant–particularly in accordance with local ethical standards. To address the pressing need for comprehensive research on reliability and safety, this study systematically evaluates LLM performance in answering questions related to reproductive ethics, specifically assessing their alignment with Chinese ethical regulations. Methods: We evaluated eight prominent LLMs (e.g., GPT-4, Claude-3.7) on a custom test set of 986 questions (906 subjective, 80 objective) derived from 168 articles within Chinese reproductive ethics regulations. Subjective responses were evaluated using a novel six-dimensional scoring rubric assessing Safety (Normative Compliance, Guidance Safety) and Quality of the Answer (Problem Identification, Citation, Suggestion, Empathy). Results: Significant safety issues were prevalent, with risk rates for unsafe or misleading advice reaching 29.91%. A systemic weakness was observed across all models: universally poor performance in citing normative sources and expressing empathy. We also identified instances of anomalous moral reasoning, including logical self-contradictions and responses violating fundamental moral intuitions. Conclusions: Current LLMs are unreliable and unsafe for autonomous reproductive ethics counseling. Despite knowledge recall, they exhibit critical deficiencies in safety, logical consistency, and essential humanistic skills. These findings serve as a critical cautionary note against premature deployment, urging future development to prioritize robust reasoning, regulatory justification, and empathy.


💡 Research Summary

This paper presents a systematic evaluation of eight prominent large language models (LLMs) – including GPT‑4 Turbo, Claude‑3.7‑Sonnet‑Thinking, DeepSeek‑r1‑671b, DeepSeek‑r1‑7b, Doubao, Qwen2.5‑72b, Qwen2.5‑7b, and JingyiQianxun – in the context of medical consultation on reproductive ethics. The authors first compiled a corpus of six Chinese regulatory documents governing assisted reproductive technologies, extracting 194 articles and, after expert filtering, retaining 168 distinct clauses that form the legal backbone of reproductive ethics in China. Using a “clause‑based” question generation strategy, they transformed each clause into realistic clinical scenarios, ensuring relevance, subtlety, multi‑perspectivity, and authenticity. After rigorous review, the final test set comprised 986 questions: 906 open‑ended (subjective) and 80 multiple‑choice (objective) items.

For the objective questions, a strict scoring rule required models to select all correct options without any false ones. Accuracy varied dramatically across models, ranging from 71.25 % (Claude‑3.7‑Thinking) down to 22.5 % (DeepSeek‑r1‑7b), with a clear positive correlation between model size and performance.

The subjective items were evaluated with a novel six‑dimensional rubric. The first stage, Risk Assessment, flagged any response that violated either (1) Normative Compliance (i.e., factual or regulatory inaccuracy) or (2) Guidance Safety (i.e., advice that could be illegal, unsafe, or lacking necessary warnings). A –1 score on either dimension marked the response as high‑risk and excluded it from further analysis. The second stage, Quality Assessment, scored four binary dimensions: Problem Identification, Citation of Ethical Guidelines, Actionable Suggestions, and Empathetic Engagement. Overall Score combined a –1 penalty for high‑risk answers with the sum of the four quality scores (0–4) for safe answers.

Results showed a stark safety divide. DeepSeek‑r1‑7b produced unsafe or misleading advice in 29.91 % of cases, while GPT‑4 Turbo’s risk rate was 16.00 %. The safest models were DeepSeek‑r1‑671b (3.75 %) and Claude‑3.7‑Sonnet‑Thinking (4.75 %). Across all models, performance on Citation and Empathy was uniformly poor; average scores for these dimensions were below 0.3, indicating that models rarely referenced the specific legal provisions nor demonstrated compassionate language. Conversely, they performed relatively better on Problem Identification and Actionable Suggestions, suggesting that they can recognize dilemmas and propose steps, but without grounding those steps in law or showing human‑centered care.

Qualitative analysis uncovered logical self‑contradictions and violations of basic moral intuitions. Some models simultaneously cited a regulation prohibiting a procedure while recommending that procedure, reflecting a failure of coherent ethical reasoning. The authors also noted that the automated scorer, fine‑tuned on the rubric, achieved an overall 88.5 % accuracy compared with human raters, but its performance varied across dimensions, especially for empathy detection.

The study acknowledges several limitations: (1) the dataset is China‑specific, limiting generalizability to other jurisdictions; (2) human evaluation was performed by only two ethicists, introducing potential subjectivity; (3) the automated scoring system, while efficient, still misclassifies a non‑trivial portion of responses; (4) differences in training data and architecture among the models were not fully controlled.

In conclusion, while current LLMs can recall regulatory facts, they are unreliable for autonomous reproductive‑ethics counseling due to significant safety risks, poor citation practices, and lack of empathetic engagement. The authors caution against premature deployment in high‑stakes medical contexts and call for future work that (i) enhances models’ ability to cite and justify recommendations with legal sources, (ii) integrates real‑time safety verification mechanisms, (iii) expands evaluation to multi‑jurisdictional ethical frameworks, and (iv) explores hybrid human‑AI counseling workflows to combine the breadth of LLM knowledge with the nuanced judgment of trained professionals.


Comments & Academic Discussion

Loading comments...

Leave a Comment