One Loss to Rule Them All: Marked Time-to-Event for Structured EHR Foundation Models

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Clinical events captured in Electronic Health Records (EHR) are irregularly sampled and may consist of a mixture of discrete events and numerical measurements, such as laboratory values or treatment dosages. The sequential nature of EHR, analogous to natural language, has motivated the use of next-token prediction to train prior EHR Foundation Models (FMs) over events. However, this training fails to capture the full structure of EHR. We propose ORA, a marked time-to-event pretraining objective that jointly models event timing and associated measurements. Across multiple datasets, downstream tasks, and model architectures, this objective consistently yields more generalizable representations than next-token prediction and pretraining losses that ignore continuous measurements. Importantly, the proposed objective yields improvements beyond traditional classification evaluation, including better regression and time-to-event prediction. Beyond introducing a new family of FMs, our results suggest a broader takeaway: pretraining objectives that account for EHR structure are critical for expanding downstream capabilities and generalizability

💡 Research Summary

The paper addresses a fundamental limitation of current electronic health record (EHR) foundation models (FMs): most of them are pretrained using a next‑token prediction objective that treats the record as a simple sequence of discrete tokens. This approach ignores two critical aspects of EHR data—irregular timing of events and the presence of continuous measurements (e.g., lab values, drug dosages). To overcome these shortcomings, the authors propose ORA (Observed‑Risk‑Association), a marked time‑to‑event pretraining loss that jointly models the distribution of event times, the associated clinical codes (marks), and any optional numeric values.

Conceptually, each patient’s record is represented as a series of triples (t, m, v) where t is the timestamp, m the clinical code, and v an optional numeric measurement. The authors cast the whole dataset as a marked point process and define a composite likelihood that, at each observation point j, predicts for every possible code m the time until its next occurrence (Δtₘⱼ), the value v, and an indicator δ denoting whether the event is observed or censored. This formulation solves three issues inherent to next‑token training: (1) it provides a dense learning signal by considering all potential next events rather than only the single observed one; (2) it removes the unrealistic mutual‑exclusivity assumption, allowing multiple events to co‑occur; and (3) it naturally incorporates censoring, avoiding bias in likelihood estimation.

To make the loss tractable, the continuous time‑value space is discretized into T time bins and V value bins. For each code m, the model outputs a T × V probability matrix Pₘ(x) that approximates the joint intensity function. The log‑likelihood is then the sum of log probabilities for observed events and the log of the complement for censored events. This discretized approach is similar to DeepHit but extended to handle marks and values jointly.

Importantly, ORA is architecture‑agnostic. The authors evaluate it on two popular FM backbones: a standard Transformer and the recent state‑space model Mamba. They fix the tokenization scheme (using an entropy‑based filter from Steinberg et al., 2024) to isolate the effect of the loss function. Experiments are conducted on two large, multi‑institutional EHR datasets representing tertiary and quaternary care settings. A total of 14 downstream tasks are used for evaluation: 7 binary classification, 3 regression, and 4 time‑to‑event prediction tasks.

Results show consistent improvements across all settings. ORA‑pretrained Transformers achieve an average 10.7 % gain, while Mamba models gain 11.4 % over their next‑token counterparts. Gains are especially pronounced for tasks involving rare codes and continuous values, confirming that the loss effectively leverages numeric information. Cross‑site validation demonstrates that ORA models generalize well to unseen institutional data distributions, highlighting robustness to dataset shift.

The paper’s contributions are threefold: (1) a novel, mathematically grounded pretraining objective that captures the full stochastic structure of EHRs; (2) empirical evidence that the objective works equally well with attention‑based and state‑space architectures, underscoring its flexibility; and (3) a comprehensive downstream benchmark that goes beyond classification to include regression and survival‑type predictions, thereby showcasing the broader clinical utility of the learned representations.

In summary, ORA demonstrates that aligning the pretraining loss with the intrinsic irregular, marked, and valued nature of EHR data yields substantially more expressive and generalizable foundation models. This work paves the way for future medical AI systems that can reliably exploit the rich temporal and quantitative information embedded in real‑world clinical records.

One Loss to Rule Them All: Marked Time-to-Event for Structured EHR Foundation Models

💡 Research Summary

Comments & Academic Discussion

Leave a Comment