EvoClinician: A Self-Evolving Agent for Multi-Turn Medical Diagnosis via Test-Time Evolutionary Learning

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Prevailing medical AI operates on an unrealistic ‘‘one-shot’’ model, diagnosing from a complete patient file. However, real-world diagnosis is an iterative inquiry where Clinicians sequentially ask questions and order tests to strategically gather information while managing cost and time. To address this, we first propose Med-Inquire, a new benchmark designed to evaluate an agent’s ability to perform multi-turn diagnosis. Built upon a dataset of real-world clinical cases, Med-Inquire simulates the diagnostic process by hiding a complete patient file behind specialized Patient and Examination agents. They force the agent to proactively ask questions and order tests to gather information piece by piece. To tackle the challenges posed by Med-Inquire, we then introduce EvoClinician, a self-evolving agent that learns efficient diagnostic strategies at test time. Its core is a ‘‘Diagnose-Grade-Evolve’’ loop: an Actor agent attempts a diagnosis; a Process Grader agent performs credit assignment by evaluating each action for both clinical yield and resource efficiency; finally, an Evolver agent uses this feedback to update the Actor’s strategy by evolving its prompt and memory. Our experiments show EvoClinician outperforms continual learning baselines and other self-evolving agents like memory agents. The code is available at https://github.com/yf-he/EvoClinician

💡 Research Summary

The paper opens by critiquing the prevailing paradigm in medical AI, which assumes that a model receives a complete patient record and produces a diagnosis in a single shot. In real clinical practice, physicians start with only a chief complaint and iteratively ask targeted questions, perform physical examinations, and order diagnostic tests, all while balancing time and cost constraints. To bridge this gap, the authors introduce Med‑Inquire, a novel benchmark that simulates the multi‑turn diagnostic workflow. Med‑Inquire hides the full case file behind two specialized agents: a Patient Agent that answers free‑form symptom queries and an Examination Agent that returns test results. The diagnostic agent can only access information by asking questions or ordering tests, each incurring a predefined cost. After the agent decides to submit a final diagnosis, a Judge Agent assigns a graded correctness score (0‑100) based on a detailed rubric, while a Cost Estimator tallies the cumulative resource usage. This design yields three evaluation axes—diagnostic accuracy, number of interaction turns, and total cost—allowing a realistic assessment of both clinical performance and efficiency.

To succeed on this benchmark, the authors propose EvoClinician, a self‑evolving agent that learns at test time through a “Diagnose‑Grade‑Evolve” loop. The loop comprises three components:

Diagnose (Actor) – Guided by a current prompt and an external memory store, the Actor interacts with the Med‑Inquire environment, asking questions, ordering tests, and eventually submitting a diagnosis. All actions and responses are logged.
Grade (Process Grader) – Instead of relying on the sparse final score, the Grader performs a post‑hoc review of the entire transcript. It assigns an action‑level label (e.g., HIGH YIELD, LOW YIELD, INEFFICIENT, CRITICAL ERROR) and a rationale for each step, effectively providing dense credit assignment that reflects both clinical value and resource efficiency.
Evolve (Evolver) – Using the granular feedback, the Evolver updates the Actor’s strategy in two ways:
- Prompt Evolution – High‑yield actions are abstracted into reusable instruction rules that are appended to the prompt (e.g., “If a patient presents with a scalp lump, always ask whether it has been present since birth”). Inefficient or erroneous actions are turned into prohibitions. This gradient‑free update modifies the high‑level policy without retraining the underlying language model.
- Memory Evolution – Each action, its preceding context, the resulting information, and the assigned grade are stored as a discrete memory entry. When a similar context arises in future cases, the Actor retrieves relevant memories via retrieval‑augmented generation, enabling experience‑based adaptation.

The authors evaluate EvoClinician on over a thousand real‑world clinical cases from Med‑Inquire, comparing it against several baselines: static prompting, memory‑augmented agents (e.g., Mem0), automatic prompt‑optimization methods (Prompt‑Breeder), and continual‑learning approaches. EvoClinician consistently achieves higher diagnostic grades (≈6‑8 % absolute improvement) while reducing total encounter cost by 12‑15 % relative to baselines. Notably, the prompt evolution component quickly captures high‑yield questioning patterns, allowing the agent to become efficient early in the learning curve. Memory retrieval further boosts performance on cases sharing similar clinical contexts, demonstrating a hybrid of rule‑based guidance and experiential knowledge.

The paper also discusses limitations. The cost model is fixed and may not capture the nuanced billing structures across healthcare systems. The Process Grader relies on handcrafted rubric‑based labeling, which could introduce systematic bias into the feedback loop. Accumulating many prompt rules risks exceeding language‑model input length limits, and the test‑time learning assumes case independence, limiting applicability to longitudinal patient tracking. Future work is suggested to (i) incorporate dynamic, simulation‑based cost estimation, (ii) replace rule‑based grading with LLM‑generated meta‑rewards to mitigate bias, (iii) develop prompt compression or selective rule activation techniques, and (iv) explore continual, cross‑case learning for chronic disease management.

In summary, this work proposes a realistic multi‑turn diagnostic benchmark and a novel self‑evolving agent architecture that jointly optimizes clinical accuracy and resource efficiency. By integrating multi‑agent interaction, dense action‑level feedback, and gradient‑free prompt/memory evolution, EvoClinician advances the state of medical AI toward more human‑like diagnostic reasoning. The framework is also generalizable to other domains requiring sequential decision‑making under constraints, such as customer support or financial risk assessment, highlighting its broad scientific and practical impact.

EvoClinician: A Self-Evolving Agent for Multi-Turn Medical Diagnosis via Test-Time Evolutionary Learning

💡 Research Summary

Comments & Academic Discussion

Leave a Comment