A Large-Scale Neutral Comparison Study of Survival Models on Low-Dimensional Data

A Large-Scale Neutral Comparison Study of Survival Models on Low-Dimensional Data
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

This work presents the first large-scale neutral benchmark experiment focused on single-event, right-censored, low-dimensional survival data. Benchmark experiments are essential in methodological research to scientifically compare new and existing model classes through proper empirical evaluation. Existing benchmarks in the survival literature are smaller in scale regarding the number of used datasets and extent of empirical evaluation. They often lack appropriate tuning or evaluation procedures, while other comparison studies focus on qualitative reviews rather than quantitative comparisons. This comprehensive study aims to fill the gap by neutrally evaluating a broad range of methods and providing generalizable guidelines for practitioners. We benchmark 19 models, ranging from classical statistical approaches to many common machine learning methods, on 34 publicly available datasets. The benchmark tunes models using both a discrimination measure (Harrell’s C-index) and a scoring rule (Integrated Survival Brier Score), and evaluates them across six metrics covering discrimination, calibration, and overall predictive performance. Despite superior average ranks in overall predictive performance from individual learners like oblique random survival forests and likelihood-based boosting, and better discrimination rankings from multiple boosting- and tree-based methods as well as parametric survival models, no method significantly outperforms the commonly used Cox proportional hazards model for either tuning measure. We conclude that for predictive purposes in the standard survival analysis setting of low-dimensional, right-censored data, the Cox Proportional Hazards model remains a simple and robust method, sufficient for most practitioners. All code, data, and results are publicly available on GitHub https://github.com/slds-lmu/paper_2023_survival_benchmark


💡 Research Summary

This paper presents the first large‑scale, neutral benchmark of survival‑analysis methods for single‑event, right‑censored, low‑dimensional data. The authors assembled 34 publicly available datasets, each with at least 100 observed events, fewer predictors than observations, and a standard censoring indicator. They evaluated 19 models ranging from classical statistical approaches (Cox proportional hazards, accelerated failure time, penalized Cox, spline‑based Cox) to a broad set of machine‑learning techniques (Random Survival Forests, Conditional Inference Forests, Oblique RSF, Relative Risk Trees, XGBoost with Cox/AFT loss, CoxBoost, Survival‑SVM, etc.).

Two distinct tuning objectives were used: Harrell’s concordance index (C‑index) for discrimination and the Integrated Brier Score (IBS) for overall predictive accuracy. Hyper‑parameter optimization employed Bayesian optimization with 50 × dimensionality iterations, ensuring each method received comparable tuning effort. Nested repeated cross‑validation provided unbiased performance estimates: three outer folds repeated five or ten times (depending on dataset size) and two inner folds repeated twice, yielding up to 30 evaluation rounds per dataset.

Performance was assessed with six metrics covering discrimination (C‑index, time‑dependent AUC), calibration (Calibration Slope, Integrated Calibration Index), and overall error (IBS, Brier‑based calibration score). This multi‑metric approach allowed the authors to examine not only ranking but also practical aspects such as calibration, which is crucial for clinical decision‑making.

Results showed that while some modern learners—particularly Oblique Random Survival Forests and likelihood‑based boosting methods (CoxBoost, XGBoost‑Cox)—achieved the best average ranks in overall predictive performance, none demonstrated a statistically significant advantage over the standard Cox proportional hazards model under either tuning criterion. In the discrimination‑focused C‑index tuning, several tree‑ and boosting‑based models performed slightly better, yet the differences vanished when evaluated with the IBS‑based tuning, which emphasizes calibration and integrated error.

The authors deliberately excluded deep‑learning models such as DeepSurv and DeepHit due to unstable implementations, high computational cost, and the need for extensive tuning that could not be reliably accommodated within the benchmark framework. They note that future work should incorporate these methods once robust, reproducible pipelines become available.

All code, data, and hyper‑parameter search spaces are openly released on GitHub and integrated into an OpenML benchmark suite, adhering to the reproducibility standards advocated by Boulesteix et al. (2013). The study’s design follows the “neutral comparison” guidelines: no novel method is promoted, model developers were consulted to ensure fair configuration, and a comprehensive set of evaluation metrics was employed despite known biases (e.g., C‑index’s sensitivity to censoring).

In conclusion, the benchmark provides strong empirical evidence that, for low‑dimensional right‑censored survival problems, the Cox proportional hazards model remains a simple, robust, and often sufficient choice for practitioners. While newer machine‑learning algorithms can offer marginal gains in specific scenarios, their added complexity does not translate into statistically significant improvements in typical biomedical datasets. This work therefore offers a valuable reference point for researchers deciding whether to adopt advanced survival‑learning techniques or to rely on the well‑understood Cox model in routine predictive tasks.


Comments & Academic Discussion

Loading comments...

Leave a Comment