Imputation Uncertainty in Interpretable Machine Learning Methods

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

In real data, missing values occur frequently, which affects the interpretation with interpretable machine learning (IML) methods. Recent work considers bias and shows that model explanations may differ between imputation methods, while ignoring additional imputation uncertainty and its influence on variance and confidence intervals. We therefore compare the effects of different imputation methods on the confidence interval coverage probabilities of the IML methods permutation feature importance, partial dependence plots and Shapley values. We show that single imputation leads to underestimation of variance and that, in most cases, only multiple imputation is close to nominal coverage.

💡 Research Summary

This paper investigates how different imputation strategies affect the uncertainty of interpretable machine learning (IML) explanations when missing values are present in the data. While prior work has focused on bias introduced by the choice of imputation, the authors highlight that the additional variability—imputation uncertainty—has been largely ignored, especially regarding confidence‑interval (CI) coverage for IML methods.

The authors consider three classic missing‑data mechanisms defined by Rubin: missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR). Missingness rates of 10 %, 20 %, and 40 % are imposed on synthetic datasets generated from a linear data‑generating process (DGP) (Y = X_1 - X_2 + \varepsilon) and a non‑linear DGP (Y = X_1 - \sqrt{1 - X_2} + X_3 X_4 + (X_4/10)^2 + \varepsilon), where (\varepsilon \sim N(0,1)) and the covariates follow a Toeplitz‑structured multivariate normal distribution.

Four imputation methods are evaluated: (1) simple mean imputation, (2) MissForest (a single‑imputation random‑forest approach), and two multiple‑imputation schemes based on multivariate imputation by chained equations (MICE) – predictive mean matching (MICE‑PMM) and random‑forest based imputation (MICE‑RF). For each imputed dataset, the authors generate 20 bootstrap or subsampling replicates, fit either an XGBoost model or a linear regression, and compute three global IML explanations: permutation feature importance (PFI), partial dependence (PD) plots, and global SHAP feature importance (mean absolute Shapley values).

The theoretical contribution extends Molnar et al.’s “learner‑Ψ” framework, which decomposes IML error into learner bias and variance, by adding an imputation‑uncertainty component. The authors estimate the expected explanation (\bar{\Psi}) across (k) model fits and its variance using a corrected estimator (Equation 3) that incorporates the Nadeau‑Bengio adjustment (c = n_{\text{test}}/n_{\text{train}}) to account for dependence among resampled datasets. Confidence intervals are then constructed with a t‑distribution (Equation 4).

Across 1 000 simulation repetitions, coverage probabilities for nominal 95 % CIs are computed by comparing the estimated intervals to a ground‑truth reference obtained from 10 000 repetitions on the complete (no‑missing) data. The main findings are:

Single imputation (mean, MissForest) systematically underestimates variance, leading to CI coverage far below 0.95 for most IML methods. Mean imputation performs especially poorly for SHAP and PFI, while MissForest does slightly better for PD but still fails to achieve nominal coverage.
Multiple imputation (MICE‑PMM, MICE‑RF) yields coverage close to the nominal level, particularly for the linear DGP. Coverage remains relatively stable as missingness increases, although a slight degradation is observed for the non‑linear DGP and for the MNAR pattern.
Confidence‑interval width grows with missingness, reflecting the genuine increase in uncertainty. Multiple imputation produces the widest intervals (consistent with its more accurate variance estimate), whereas single imputation yields narrower but over‑confident intervals.
Bootstrap and subsampling produce similar patterns, with subsampling sometimes giving marginally lower coverage. Adjusted variance estimation (using the Nadeau‑Bengio correction) improves coverage compared to the naïve estimator, but the improvement is modest relative to the gains from multiple imputation.

The authors conclude that ignoring imputation uncertainty can lead to over‑confident, potentially misleading explanations, especially in high‑stakes domains such as healthcare or finance. They recommend employing multiple imputation together with appropriate variance corrections when reporting IML explanations, and they provide reproducible code on GitHub. Future work is suggested on more complex DGPs, additional IML techniques (e.g., LIME, ICE), and real‑world missing‑data applications.

Imputation Uncertainty in Interpretable Machine Learning Methods

💡 Research Summary

Comments & Academic Discussion

Leave a Comment