RadOnc-GPT: An Autonomous LLM Agent for Real-Time Patient Outcomes Labeling at Scale

RadOnc-GPT: An Autonomous LLM Agent for Real-Time Patient Outcomes Labeling at Scale
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Manual labeling limits the scale, accuracy, and timeliness of patient outcomes research in radiation oncology. We present RadOnc-GPT, an autonomous large language model (LLM)-based agent capable of independently retrieving patient-specific information, iteratively assessing evidence, and returning structured outcomes. Our evaluation explicitly validates RadOnc-GPT across two clearly defined tiers of increasing complexity: (1) a structured quality assurance (QA) tier, assessing the accurate retrieval of demographic and radiotherapy treatment plan details, followed by (2) a complex clinical outcomes labeling tier involving determination of mandibular osteoradionecrosis (ORN) in head-and-neck cancer patients and detection of cancer recurrence in independent prostate and head-and-neck cancer cohorts requiring combined interpretation of structured and unstructured patient data. The QA tier establishes foundational trust in structured-data retrieval, a critical prerequisite for successful complex clinical outcome labeling.


💡 Research Summary

The paper introduces RadOncGPT, an autonomous large‑language‑model (LLM) agent built on OpenAI’s GPT‑4o, designed to automate real‑time patient‑outcome labeling in radiation oncology. The authors address the bottleneck of manual registry curation by creating a two‑tier evaluation framework. In Tier 1 (structured quality‑assurance), the agent retrieves demographic fields (sex, race, ethnicity) and treatment‑plan details (course IDs, ICD codes, plan IDs, radiation type) for 500 patients using a set of whitelisted API functions that query internal Mayo Clinic databases (Aria, Epic) and external resources (PubMed, ClinicalTrials.gov). The agent achieved 100 % exact match on all six demographic fields and 99.4 % accuracy on radiation‑course counts, demonstrating that function‑driven data retrieval can reproduce structured records without human oversight.

Tier 2 (complex clinical outcomes) tests the agent’s ability to synthesize structured data with unstructured clinical notes, radiology reports, and pathology reports to label three outcomes: mandibular osteoradionecrosis (ORN) in head‑and‑neck (HNC) patients, prostate cancer recurrence, and HNC recurrence after surgery. A single recurrence‑detection prompt was applied to both prostate and HNC cohorts to assess cross‑disease generalization. Initial automated labeling yielded accuracies of 84.5 % (ORN), 92.5 % (prostate recurrence), and 92.7 % (HNC recurrence). Discrepancies (48 cases) were adjudicated by independent radiation oncologists and classified as model error (13), ground‑truth error (30), or indeterminate (5). After adjudication, final accuracies rose to 95.2 % (ORN), 95.0 % (prostate), and 96.3 % (HNC), revealing that the majority of initial mismatches were hidden errors in the existing registry rather than failures of the LLM.

Technical innovations include: (1) a modular function library that allows the LLM to decide autonomously which data‑access routine to invoke, reducing token consumption compared with traditional retrieval‑augmented generation (RAG); (2) a controlled pruning strategy that trims older conversation turns when token limits are approached, preserving the system prompt while maintaining context relevance; (3) an external orchestration layer called “LLM Task Streaming” that feeds patient IDs and task‑specific prompts to the agent, collects JSON outputs, and aggregates them into CSV files for downstream analysis. Processing time per patient ranged from 10 to 30 seconds, limited only by API rate limits.

Cost analysis notes that GPT‑4o’s pricing ($2.50 per 1 M input tokens) makes large‑scale deployment financially feasible, especially when the agent’s function‑based retrieval dramatically reduces the number of tokens needed versus raw text retrieval.

The authors conclude that RadOncGPT reliably reproduces structured data, generalizes to complex outcome labeling across disease sites, and serves simultaneously as a labeler and an auditor, uncovering latent registry inaccuracies. This dual capability promises scalable, trustworthy, near‑real‑time curation of radiation‑oncology research datasets and could be extended to other cancer types and treatment modalities.


Comments & Academic Discussion

Loading comments...

Leave a Comment