From Scores to Steps: Diagnosing and Improving LLM Performance in Evidence-Based Medical Calculations
Large language models (LLMs) have demonstrated promising performance on medical benchmarks; however, their ability to perform medical calculations, a crucial aspect of clinical decision-making, remains underexplored and poorly evaluated. Existing benchmarks often assess only the final answer with a wide numerical tolerance, overlooking systematic reasoning failures and potentially causing serious clinical misjudgments. In this work, we revisit medical calculation evaluation with a stronger focus on clinical trustworthiness. First, we clean and restructure the MedCalc-Bench dataset and propose a new step-by-step evaluation pipeline that independently assesses formula selection, entity extraction, and arithmetic computation. Under this granular framework, the accuracy of GPT-4o drops from 62.7% to 43.6%, revealing errors masked by prior evaluations. Second, we introduce an automatic error analysis framework that generates structured attribution for each failure mode. Human evaluation confirms its alignment with expert judgment, enabling scalable and explainable diagnostics. Finally, we propose a modular agentic pipeline, MedRaC, that combines retrieval-augmented generation and Python-based code execution. Without any fine-tuning, MedRaC improves the accuracy of different LLMs from 16.35% up to 53.19%. Our work highlights the limitations of current benchmark practices and proposes a more clinically faithful methodology. By enabling transparent and transferable reasoning evaluation, we move closer to making LLM-based systems trustworthy for real-world medical applications.
💡 Research Summary
This paper critically re‑examines the ability of large language models (LLMs) to perform medical calculations, a task that underpins many clinical decision‑support applications. Existing benchmarks, most notably MedCalc‑Bench, evaluate only the final numeric answer within a generous ±5 % tolerance, thereby masking systematic failures in intermediate reasoning steps such as selecting the wrong formula, mis‑extracting patient variables, or making arithmetic mistakes. Moreover, the authors discovered numerous data inconsistencies and obsolete entries in the original dataset.
To address these issues, the authors first clean the benchmark, removing 108 faulty cases and retaining 940 high‑quality clinical calculation scenarios drawn from 55 widely used MDCalc calculators. They then introduce a step‑wise evaluation pipeline that decomposes each task into four sequential components: (1) formula selection, (2) variable extraction, (3) mathematical computation, and (4) final answer verification. Each component is judged by an independent LLM‑as‑judge using binary correctness criteria, and a logical dependency V(S_i) ⇔ V(S_{i‑1}) is enforced so that a step can be considered correct only if all preceding steps are correct. The computation step adopts the strict tolerance defined on the original calculators (e.g., ±0.005 for two‑decimal answers) rather than the blanket ±5 % used previously.
Building on this granular rubric, the paper proposes an automatic error‑analysis framework. A high‑performance LLM is prompted to compare a model’s output with the gold‑standard reference, assign binary correctness for each step, and categorize any failure into one of eight clinically meaningful error types: formula misselection/hallucination, incorrect variable extraction, clinical misinterpretation, missing variables, demographic adjustment failure, unit‑conversion error, arithmetic error, and rounding/precision error. Human expert validation shows >92 % agreement with the automated judgments, demonstrating that the framework can scale diagnostic feedback without costly manual review.
The most consequential contribution is MedRaC, a modular, training‑free agentic pipeline designed to remediate the identified error classes. MedRaC consists of (1) Formula‑RAG, which indexes the full set of MDCalc formulas and their textual descriptions, retrieves the most relevant formula for a given vignette, and injects it into the prompt to eliminate formula‑selection mistakes; and (2) Python Code Execution, which instructs the LLM to emit executable Python code representing the retrieved formula and the extracted variables, then runs the code to obtain a precise numeric result, thereby eradicating arithmetic, rounding, and precision errors. Both components are plug‑and‑play and require no fine‑tuning, making the approach applicable to any LLM accessible via an API.
Extensive experiments compare several closed‑source (GPT‑4o, Qwen‑3‑8B) and open‑source (Phi‑4‑mini, LLaMA‑3.2‑3B) models across multiple prompting strategies: direct answer, chain‑of‑thought (CoT), one‑shot exemplars, MedPrompt (k‑nearest‑neighbor retrieval), and Self‑Refine (iterative self‑critique). When evaluated with the original final‑answer metric, GPT‑4o appears to achieve 62.7 % accuracy; however, under the step‑wise rubric its true accuracy drops to 43.6 %, revealing a substantial over‑estimation. Applying MedRaC lifts performance across the board, with accuracy gains ranging from 16.35 % to 53.19 % absolute, and particularly strong improvements for smaller models (e.g., Phi‑4‑mini gains of over 30 %). The results confirm that formula retrieval and executable code generation directly target the dominant failure modes identified by the error‑analysis taxonomy.
In conclusion, the paper demonstrates that evaluating LLMs solely on final numeric answers is insufficient for high‑stakes medical contexts. By introducing a transparent, step‑wise evaluation, an automated error‑attribution system, and a modular, training‑free enhancement pipeline, the authors provide a more clinically faithful methodology. This work paves the way for standardized, trustworthy assessment of LLMs in medical calculations and suggests future extensions to broader clinical tasks, multimodal data, and integration with electronic health records, ultimately moving LLM‑based decision support closer to safe real‑world deployment.
Comments & Academic Discussion
Loading comments...
Leave a Comment