A Bayesian Prevalence Incidence Cure model for estimating survival using Electronic Health Records with incomplete baseline diagnoses

A Bayesian Prevalence Incidence Cure model for estimating survival using Electronic Health Records with incomplete baseline diagnoses
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Retrospective cohorts can be extracted from Electronic Health Records (EHR) to study prevalence, time until disease or event occurrence and cure proportion in real world scenarios. However, EHR are collected for patient care rather than research, so typically have complexities, such as patients with missing baseline disease status. Prevalence-Incidence (PI) models, which use a two-component mixture model to account for this missing data, have been proposed. However, PI models are biased in settings in which some individuals will never experience the endpoint (they are ‘cured’). To address this, we propose a Prevalence Incidence Cure (PIC) model, a 3 component mixture model that combines the PI model framework with a cure model. Our PIC model enables estimation of the prevalence, time-to-incidence, and the cure proportion, and allows for covariates to affect these. We adopt a Bayesian inference approach, and focus on the interpretability of the prior. We show in a simulation study that the PIC model has smaller bias than a PI model for the survival probability; and compare inference under vague, informative and misspecified priors. We illustrate our model using a dataset of 1964 patients undergoing treatment for Diabetic Macular Oedema, demonstrating improved fit under the PIC model.


💡 Research Summary

This paper addresses a common challenge in using electronic health records (EHR) for epidemiological research: the simultaneous presence of missing baseline disease status and a subset of patients who will never experience the event of interest (often interpreted as “cured” or “non‑responders”). Existing prevalence‑incidence (PI) models handle missing baseline status by treating the data as a two‑component mixture of prevalent and incident cases, but they assume that every individual will eventually experience the event. Consequently, PI models produce biased prevalence estimates and distorted survival curves when a cure fraction exists.

To overcome these limitations, the authors propose a three‑component mixture model called the Prevalence‑Incidence‑Cure (PIC) model. The PIC model partitions the population into (1) prevalent cases (already diseased at baseline), (2) incident cases (susceptible individuals who will eventually develop the disease), and (3) cured cases (individuals who will never develop the disease). The model simultaneously estimates the prevalence probability π_i, the cure probability δ_i, and the time‑to‑event distribution for incident cases. A multinomial logistic regression links covariates x_i to π_i and δ_i (Equations 4‑5). For the incident time distribution, a Weibull proportional‑hazards specification is adopted, providing interpretable hazard ratios γ for covariate effects (Equations 6‑7).

EHR data are interval‑censored because disease status is observed only at discrete clinical visits. The authors define l_i as the last negative test time and r_i as the first positive test time, and introduce four observation categories (c_i = 1…4) that capture (i) baseline positives, (ii) interval‑censored incident cases, (iii) censored non‑responders, and (iv) missing baseline with later positive test. The observed likelihood (Equation 3) combines the contributions of each category, correctly accounting for the uncertainty about prevalence and cure status.

A Bayesian framework is employed to estimate all parameters. The authors emphasize prior elicitation: they ask experts to provide plausible ratios of prevalence to incidence (r_π) and cure to incidence (r_δ). These ratios are modeled with log‑normal priors (Equations 10‑13), which translate directly into normal priors for the intercepts β_1π and β_1δ. Covariate coefficients for prevalence (β_jπ) and cure (β_jδ) are also given log‑normal priors based on elicited 95 % quantiles of odds ratios. For the Weibull shape α and scale λ, log‑normal priors are specified, with α reflecting whether the hazard increases (>1) or decreases (<1) over time, and λ derived from a prior on the median event time.

Simulation studies compare the PIC model with the traditional PI model under varying cure fractions (0 %–30 %). Results demonstrate that the PIC model yields substantially lower bias in the estimated survival function, especially when a non‑negligible cure proportion exists. The PI model’s survival curve erroneously converges to zero, whereas the PIC model correctly asymptotes to the estimated cure proportion.

The methodology is applied to a real‑world dataset of 1,964 patients treated for diabetic macular oedema (DMO). Visual acuity (VA) ≥ 70 is defined as “healthy,” and achieving VA ≥ 70 after treatment is considered a cure (non‑response). The PIC model estimates a baseline prevalence of approximately 30 % and a cure proportion of about 15 %. The Weibull shape parameter suggests an increasing hazard early in follow‑up. Model fit is assessed using Deviance Information Criterion (DIC) and posterior predictive checks; the PIC model outperforms the PI model on both metrics, indicating better calibration to the observed interval‑censored data.

Key contributions of the paper are:

  1. Introduction of a three‑component mixture model that simultaneously handles missing baseline disease status and a cure fraction in EHR data.
  2. Development of a Bayesian inference scheme with intuitive prior elicitation based on prevalence‑to‑incidence and cure‑to‑incidence ratios, facilitating incorporation of expert knowledge.
  3. Derivation of an explicit observed‑data likelihood that respects interval censoring and the four observation categories.
  4. Empirical validation through simulations and a substantive clinical example, showing reduced bias and improved model fit relative to existing PI models.

Limitations include reliance on a Weibull parametric form for incident times, which may not capture more complex hazard shapes, and a relatively modest sensitivity analysis of prior specifications. Future work could explore flexible baseline hazard models (e.g., splines or piecewise‑constant hazards), time‑varying covariate effects, and broader applications across disease areas where cure or long‑term non‑susceptibility is plausible.


Comments & Academic Discussion

Loading comments...

Leave a Comment