TimeTox: An LLM-Based Pipeline for Automated Extraction of Time Toxicity from Clinical Trial Protocols

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Time toxicity, the cumulative healthcare contact days from clinical trial participation, is an important but labor-intensive metric to extract from protocol documents. We developed TimeTox, an LLM-based pipeline for automated extraction of time toxicity from Schedule of Assessments tables. TimeTox uses Google’s Gemini models in three stages: summary extraction from full-length protocol PDFs, time toxicity quantification at six cumulative timepoints for each treatment arm, and multi-run consensus via position-based arm matching. We validated against 20 synthetic schedules (240 comparisons) and assessed reproducibility on 644 real-world oncology protocols. Two architectures were compared: single-pass (vanilla) and two-stage (structure-then-count). The two-stage pipeline achieved 100% clinically acceptable accuracy ($\pm$3 days) on synthetic data (MAE 0.81 days) versus 41.5% for vanilla (MAE 9.0 days). However, on real-world protocols, the vanilla pipeline showed superior reproducibility: 95.3% clinically acceptable accuracy (IQR $\leq$ 3 days) across 3 runs on 644 protocols, with 82.0% perfect stability (IQR = 0). The production pipeline extracted time toxicity for 1,288 treatment arms across multiple disease sites. Extraction stability on real-world data, rather than accuracy on synthetic benchmarks, is the decisive factor for production LLM deployment.

💡 Research Summary

TimeTox is an end‑to‑end pipeline that leverages Google’s Gemini large language models (LLMs) to automatically extract “time toxicity” – the cumulative number of healthcare contact days a patient experiences while participating in a clinical trial – from Schedule of Assessments (SoA) tables embedded in trial protocols. The authors recognize that SoA tables are dense, multi‑page grids with heterogeneous formatting, making manual extraction labor‑intensive and error‑prone.

The pipeline consists of three stages. First, a “summary extraction” step uses Gemini 2.5 Flash (temperature 0.0, top‑p 0.95) to ingest the full PDF, identify pages containing SoA information, and produce a condensed PDF that includes the SoA pages plus one surrounding page for context. This reduces the input size for downstream processing and achieves a 99.5 % success rate on 644 of 649 collected oncology protocols.

Second, the core extraction stage quantifies time toxicity at six cumulative time points (screening, 1, 3, 6, 9, and 12 months) for each treatment arm. Two architectural variants are evaluated:

Vanilla (single‑pass) – a single Gemini 3.0 Flash call receives the summarized SoA PDF and a detailed prompt (≈ 76 lines) that defines healthcare‑contact‑day, the six windows, extraction rules (arm identification, cycle‑to‑calendar mapping, grouped column expansion, unique‑day counting, arm‑specific superscripts), category definitions, and a strict JSON schema for output. The model directly returns the contact‑day counts.
Two‑stage (structure‑then‑count) – Stage 1 extracts the structural blueprint of the schedule (cycle length, visits per cycle, treatment duration, special visits, disease type) as JSON. Stage 2 receives this blueprint together with the PDF and performs the arithmetic (mapping cycles to calendar days, applying visit patterns, aggregating unique days). The rationale is to separate perception (reading the table) from computation (counting), hypothesizing reduced error propagation.

Third, a multi‑run consensus mechanism aligns treatment‑arm names across runs using position‑based matching, then aggregates the results to mitigate run‑to‑run variability.

For validation, the authors generated 20 synthetic SoA schedules (HTML‑rendered, realistic, spanning eight oncology disease categories, three complexity levels, five visual styles, and multiple treatment modalities). Each schedule contains two arms, yielding 40 arms and 240 ground‑truth comparisons (six time points per arm). Ground truth was deterministically computed by expanding cycles, adding imaging intervals, end‑of‑treatment, and follow‑up visits, then counting unique calendar days.

On synthetic data, the two‑stage pipeline achieved perfect clinically acceptable accuracy (100 % within ±3 days) and a mean absolute error (MAE) of 0.81 days, whereas the vanilla pipeline lagged with 41.5 % clinically acceptable accuracy and MAE 9.0 days. This demonstrates that separating structure extraction from arithmetic dramatically improves performance when the input format is controlled.

However, real‑world evaluation on 644 oncology protocols revealed the opposite trend. Across three independent runs, the vanilla pipeline delivered 95.3 % clinically acceptable accuracy (IQR ≤ 3 days) and 82.0 % perfect stability (IQR = 0). The two‑stage approach suffered from higher variability, likely because errors in the first structural extraction stage amplified during the second counting stage when faced with noisy OCR, irregular legends, and diverse table layouts.

Consequently, the authors argue that for production deployment, reproducibility on real‑world documents outweighs synthetic benchmark accuracy. They therefore selected the vanilla pipeline for production, extracting time‑toxicity metrics for 1,288 treatment arms across the 644 protocols.

Key insights include:

Stability over raw accuracy – In clinical settings, consistent outputs across runs are more valuable than marginal gains on synthetic benchmarks.
Prompt engineering matters – Detailed, rule‑based prompts with forced JSON output and low temperature improve determinism.
Synthetic data are useful but not sufficient – They enable deterministic ground truth but cannot capture the full spectrum of real‑world noise.
Multi‑run consensus mitigates LLM variability – Position‑based arm matching effectively aligns outputs despite naming inconsistencies.

Limitations noted are the cost and latency of Gemini API calls (≈ 2–3 min per protocol), sensitivity to visual layout variations, and reliance on LLMs for arithmetic which may still produce occasional outliers. Future work could explore multimodal vision‑LLM models, automated prompt optimization, integration with external trial metadata (e.g., ClinicalTrials.gov), and automated validation layers to further enhance robustness.

In summary, TimeTox demonstrates a practical application of LLMs for automating a clinically relevant metric extraction from complex trial documents, and it highlights the paramount importance of reproducibility when moving from controlled synthetic experiments to real‑world deployment.

TimeTox: An LLM-Based Pipeline for Automated Extraction of Time Toxicity from Clinical Trial Protocols

💡 Research Summary

Comments & Academic Discussion

Leave a Comment