Bootstrapping-based Regularisation for Reducing Individual Prediction Instability in Clinical Risk Prediction Models

Bootstrapping-based Regularisation for Reducing Individual Prediction Instability in Clinical Risk Prediction Models
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Clinical prediction models are increasingly used to support patient care, yet many deep learning-based approaches remain unstable, as their predictions can vary substantially when trained on different samples from the same population. Such instability undermines reliability and limits clinical adoption. In this study, we propose a novel bootstrapping-based regularisation framework that embeds the bootstrapping process directly into the training of deep neural networks. This approach constrains prediction variability across resampled datasets, producing a single model with inherent stability properties. We evaluated models constructed using the proposed regularisation approach against conventional and ensemble models using simulated data and three clinical datasets: GUSTO-I, Framingham, and SUPPORT. Across all datasets, our model exhibited improved prediction stability, with lower mean absolute differences (e.g., 0.019 vs. 0.059 in GUSTO-I; 0.057 vs. 0.088 in Framingham) and markedly fewer significantly deviating predictions. Importantly, discriminative performance and feature importance consistency were maintained, with high SHAP correlations between models (e.g., 0.894 for GUSTO-I; 0.965 for Framingham). While ensemble models achieved greater stability, we show that this came at the expense of interpretability, as each constituent model used predictors in different ways. By regularising predictions to align with bootstrapped distributions, our approach allows prediction models to be developed that achieve greater robustness and reproducibility without sacrificing interpretability. This method provides a practical route toward more reliable and clinically trustworthy deep learning models, particularly valuable in data-limited healthcare settings.


💡 Research Summary

The paper tackles a pervasive problem in clinical risk prediction models: individual‑level prediction instability (ILP), where models trained on different samples from the same population produce markedly different risk estimates for the same patient. Traditional remedies either report this variability after the fact by bootstrapping the development data (Riley et al., 2023) or employ bagging ensembles that average predictions from many bootstrapped models. While ensembles improve stability, they sacrifice interpretability because each constituent model may rely on different features, making post‑hoc explanations (e.g., SHAP) inconsistent.

To address both stability and interpretability, the authors embed the bootstrap process directly into the training objective of a deep neural network (DNN). Let fθ be the model trained on the full dataset D, and let ˆθb be the parameters that would be obtained from a bootstrap sample Db. They define a distance d between the log‑probabilities of fθ and ˆθb for each observation, and add the expected distance over the bootstrap distribution as a regularisation term to the usual negative log‑likelihood loss:

R(θ)=Lθ(D)+λ E_B


Comments & Academic Discussion

Loading comments...

Leave a Comment