Can Reasoning LLMs Enhance Clinical Document Classification?

Can Reasoning LLMs Enhance Clinical Document Classification?
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Clinical document classification is essential for converting unstructured medical texts into standardised ICD-10 diagnoses, yet it faces challenges due to complex medical language, privacy constraints, and limited annotated datasets. Large Language Models (LLMs) offer promising improvements in accuracy and efficiency for this task. This study evaluates the performance and consistency of eight LLMs; four reasoning (Qwen QWQ, Deepseek Reasoner, GPT o3 Mini, Gemini 2.0 Flash Thinking) and four non-reasoning (Llama 3.3, GPT 4o Mini, Gemini 2.0 Flash, Deepseek Chat); in classifying clinical discharge summaries using the MIMIC-IV dataset. Using cTAKES to structure clinical narratives, models were assessed across three experimental runs, with majority voting determining final predictions. Results showed that reasoning models outperformed non-reasoning models in accuracy (71% vs 68%) and F1 score (67% vs 60%), with Gemini 2.0 Flash Thinking achieving the highest accuracy (75%) and F1 score (76%). However, non-reasoning models demonstrated greater stability (91% vs 84% consistency). Performance varied across ICD-10 codes, with reasoning models excelling in complex cases but struggling with abstract categories. Findings indicate a trade-off between accuracy and consistency, suggesting that a hybrid approach could optimise clinical coding. Future research should explore multi-label classification, domain-specific fine-tuning, and ensemble methods to enhance model reliability in real-world applications.


💡 Research Summary

The paper investigates whether reasoning‑enhanced large language models (LLMs) can improve the classification of clinical discharge summaries into ICD‑10 codes, a task central to automated clinical coding. Using a balanced subset of the MIMIC‑IV database, the authors selected the top ten ICD‑10 diagnoses and extracted 150 positive and 150 negative discharge summaries for each code, yielding a total of 3,000 records. To mitigate the challenges of raw clinical text, they processed the summaries with cTAKES, converting narratives into structured SNOMED entities annotated for affirmation or negation, thereby reducing input length and standardizing terminology.

Eight LLMs were evaluated: four reasoning models (Qwen QWQ, Deepseek Reasoner, GPT‑o3 Mini, Gemini Flash Thinking) and four non‑reasoning models (Llama 3.3, GPT‑4o Mini, Gemini Flash, Deepseek Chat). Each model received the same prompt—“Discharge Summary: … Does this summary contain the diagnosis associated with ICD‑10 code …? Answer Yes or No only?”—and was run three independent times on the entire dataset. Final predictions were derived by majority voting across the three runs, allowing the authors to assess not only accuracy and F1‑score but also consistency (the proportion of runs that produced the same label).

Results show that reasoning models achieved a higher average accuracy (71 %) and F1 (67 %) than non‑reasoning models (68 % accuracy, 60 % F1). The best performer, Gemini Flash Thinking, reached 75 % accuracy and an impressive 76 % F1, especially excelling on complex, nuanced cases. Conversely, GPT‑4o Mini recorded the lowest scores (64 % accuracy, 47 % F1), highlighting that model size, prompt engineering, and domain alignment critically affect performance. Consistency, however, favored non‑reasoning models, which averaged 91 % versus 84 % for reasoning models, indicating that the additional reasoning steps introduce variability across runs.

A per‑code analysis revealed that both model families performed well on well‑defined conditions (e.g., sepsis, myocardial infarction) but struggled with abstract or less concrete categories, underscoring the limits of current LLMs in capturing subtle clinical semantics. The authors interpret these findings as a trade‑off: reasoning models boost diagnostic accuracy, particularly for intricate narratives, but at the cost of stability; non‑reasoning models offer steadier outputs but with modest accuracy.

Consequently, the paper proposes a hybrid or ensemble approach that leverages the strengths of both model types to achieve optimal clinical coding—high accuracy where needed and reliable consistency overall. Future research directions include extending the task to multi‑label classification, fine‑tuning LLMs on domain‑specific corpora, and developing sophisticated ensemble methods to improve both performance and generalizability in real‑world healthcare settings. The study contributes valuable empirical evidence to the ongoing debate about the practical utility of reasoning capabilities in LLMs for medical NLP applications.


Comments & Academic Discussion

Loading comments...

Leave a Comment