Multi-state Models For Disease Histories Based On Longitudinal Data

Multi-state Models For Disease Histories Based On Longitudinal Data
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Multi-stage disease histories derived from longitudinal data are becoming increasingly available as registry data and biobanks expand. Multi-state models are suitable to investigate transitions between different disease stages in presence of competing risks. In this context, however, their estimation is complicated by dependent left-truncation, multiple time scales, index event bias, and interval-censoring. In this work, we investigate the extension of piecewise exponential additive models (PAMs) to this setting and their applicability given the above challenges. In simulation studies we show that PAMs can handle dependent left-truncation and accommodate multiple time scales. Compared to a stratified single time scale model, a multiple time scales model is found to be less robust to the data generating process. We also quantify the extent of index event bias in multiple settings, demonstrating its dependence on the completeness of covariate adjustment. In general, PAMs recover baseline and fixed effects well in most settings, except for baseline hazards in interval-censored data. Finally, we apply our framework to estimate multi-state transition hazards and probabilities of chronic kidney disease (CKD) onset and progression in a UK Biobank dataset (n=142,667). We observe CKD progression risk to be highest for individuals with early CKD onset and to further increase over age. In addition, the well-known genetic variant rs77924615 in the UMOD locus is found to be associated with CKD onset hazards, but not with risk of further CKD progression.


💡 Research Summary

This paper addresses the growing availability of multi‑stage disease histories derived from large longitudinal resources such as registries and biobanks, focusing on the statistical challenges that arise when fitting multi‑state models to such data. The authors identify four major obstacles: (1) dependent left‑truncation, where entry into a risk set for a given transition depends on prior disease onset; (2) the presence of multiple, potentially interacting time scales (e.g., chronological age, time since disease onset, time since progression); (3) index‑event bias, a selection bias that emerges when analyses are restricted to individuals who have already experienced an index event, leading to spurious correlations among risk factors; and (4) interval‑censoring, which is typical for longitudinal measurements taken at irregular follow‑up times. Traditional non‑parametric tools such as the Aalen‑Johansen estimator or Cox proportional‑hazards models are ill‑suited to handle these complexities, especially when non‑Markovian dynamics or non‑linear covariate effects are present.

To overcome these limitations, the authors extend piecewise exponential additive models (PAMs) to the multi‑state setting. PAMs transform survival data into a piecewise exponential data (PED) format, allowing the problem to be treated as a penalized Poisson regression. Baseline hazards are estimated flexibly using penalized splines, avoiding the need to pre‑specify interval cut‑points and providing smooth hazard estimates. Estimation proceeds via restricted maximum likelihood (REML) or fast REML (fREML) using the mgcv package, which scales to the large UK Biobank cohort (n ≈ 143 k).

Two baseline formulations are introduced: (i) a stratified single‑time‑scale (SSTS) PAM, where each transition’s baseline hazard is a function of a single time variable (typically age); and (ii) a multiple‑time‑scale (MTS) PAM, which incorporates several time axes simultaneously (e.g., age, time since CKD onset, time since progression). The MTS approach can capture richer temporal dynamics but, as simulation results show, may be less robust when the data‑generating process does not align with the chosen scale combination.

Simulation studies systematically evaluate the performance of these models under each of the four challenges. The results demonstrate that PAMs correctly adjust for dependent left‑truncation, recover fixed‑effect coefficients, and provide unbiased estimates of baseline hazards in most scenarios. However, when interval‑censoring is severe, baseline hazard estimation deteriorates, although covariate effects remain well‑estimated. Index‑event bias is quantified by varying the completeness of covariate adjustment; full adjustment markedly reduces bias, confirming that the bias stems from omitted‑variable confounding within the conditioned sub‑cohort.

The methodological framework is applied to a UK Biobank analysis of chronic kidney disease (CKD). The multi‑state diagram includes Healthy → Mild CKD → Severe CKD → End‑Stage Kidney Disease (ESKD) and Death as absorbing states. Findings reveal that CKD onset risk rises with age and is significantly associated with the UMOD locus variant rs77924615, whereas this variant does not influence subsequent progression. Moreover, individuals who develop CKD early (e.g., before age 50) exhibit markedly higher progression hazards, especially after age 60, indicating an interaction between age and disease duration.

In conclusion, the study provides a versatile, computationally efficient framework for multi‑state modeling of complex longitudinal disease data. By leveraging PAMs, researchers can simultaneously address dependent left‑truncation, multiple time scales, index‑event bias, and interval‑censoring. Limitations include reduced baseline hazard precision under heavy interval‑censoring and potential over‑fitting in MTS models, suggesting avenues for future work such as Bayesian regularization or adaptive scale selection.


Comments & Academic Discussion

Loading comments...

Leave a Comment