Multiple Imputation Methods under Extreme Values
Missing data are ubiquitous in empirical databases, yet statistical analyses typically require complete data matrices. Multiple imputation offers a principled solution for filling these gaps. This study evaluates the performance of several multiple imputation methods, both in the presence and absence of extreme values, using the MICE package in R. Through Monte Carlo simulations, we generated incomplete data sets with three variables and assessed each imputation method within regression models. The results indicate that the linear regression based imputation method showed the best overall predictive performance (CV-MSE), whereas the sparse model approach was generally less efficient. Our findings underscore the relevance of extreme values when selecting an imputation strategy and highlight sample size, proportion of missingness, presence of extremes, and the type of fitted model as key determinants of performance. Despite its limitations, the study offers practical recommendations for researchers, stressing the need to examine the missingness mechanism and the occurrence of extreme values before choosing an imputation method.
💡 Research Summary
This paper investigates how multiple imputation (MI) methods perform when data contain both missing values and extreme observations. Using the R MICE package, the authors conduct extensive Monte‑Carlo simulations in which three continuous variables (y, x₁, x₂) are generated from a linear model y = 1 + 0.5 x₁ + 1.5 x₂ + ε (ε ∼ N(0, 1.5²)). The predictor x₁ follows N(10, 2²) and x₂ is generated conditionally on x₁ to achieve a controllable correlation ρ. Missingness is introduced completely at random (MCAR) only in x₂, with missing‑rate levels ranging from 10 % to 30 %. To create extreme values, a symmetric “three‑sigma” contamination replaces a proportion P_ext (5 %–20 %) of observations with the sample mean ± 3 standard deviations, thereby generating vertical outliers and high‑leverage points.
Two downstream analysis models are employed: ordinary least squares (OLS) for clean data and elastic‑net (α = 0.5) for contaminated data, both tuned by K‑fold cross‑validation with identical folds reused across imputations. The simulation design varies sample size (n = 200, 500, 1,000), correlation ρ, missing‑rate, and extreme‑value proportion, yielding a fully crossed factorial experiment.
The MI methods compared fall into two broad families. The parametric family includes linear regression (LM), Bayesian linear regression (Bayes‑LM), and predictive mean matching (PMM). The non‑parametric/machine‑learning family comprises random forest, CART, and other modern learners available through MICE. For each method, M = 5–10 completed data sets are generated, analyzed with the same downstream model, and pooled using Rubin’s rules. Primary performance is measured by out‑of‑sample cross‑validated mean‑squared error (CV‑MSE). Secondary metrics assess bias, root‑mean‑square error (RMSE), and 95 % coverage of β₀, β₁, β₂.
Key findings:
-
Linear regression‑based imputation consistently yields the lowest CV‑MSE across all scenarios, especially when the sample is modest and the missing‑rate is high. Its parametric nature provides stable predictive distributions that are not overly distorted by the three‑sigma contamination.
-
Bayesian linear regression produces tighter predictive intervals but can underestimate extreme values, leading to higher MSE when the contamination proportion exceeds ~15 %.
-
Predictive mean matching preserves the marginal distribution of the incomplete variable, yet its reliance on nearest‑neighbor searches becomes less efficient as the correlation between x₁ and x₂ strengthens, inflating MSE.
-
Random forest and CART exhibit robustness to outliers because tree‑based splits are less sensitive to leverage points. However, when the true data‑generating process is linear, these learners tend to overfit the noise introduced by missingness, resulting in larger CV‑MSE than parametric methods.
-
Elastic‑net (sparse) imputation, despite its variable‑selection benefits, is generally less efficient than the simple linear approach. Its penalty shrinks coefficients, which can be advantageous under multicollinearity but adds bias in the presence of extreme values.
Sample size amplifies these patterns: increasing n to 1,000 reduces variance for all methods, yet the superiority of linear regression imputation remains pronounced when P_ext ≥ 15 %. At the highest missing‑rate (30 %), non‑parametric methods become unstable, reinforcing the recommendation to favor parametric MI under severe missingness.
Practical recommendations derived from the study:
- Screen for extreme observations before imputation; consider Winsorizing or robust scaling if outliers are suspected.
- Confirm the missingness mechanism (MCAR/MAR). Under MAR, parametric regression‑based MI is the safest default.
- Match the imputation model to the downstream analysis (congeniality). The authors use a consistent predictor matrix across all imputations to ensure valid pooling.
- When the data are large and exhibit strong non‑linearity, supplement parametric MI with machine‑learning based imputation, but validate performance via cross‑validation.
In conclusion, the study demonstrates that, even in the presence of deliberately injected extreme values, traditional linear regression imputation outperforms more sophisticated sparse or machine‑learning approaches in terms of predictive accuracy and inferential validity. Sparse models and non‑parametric learners may be useful in niche settings (high dimensionality, pronounced non‑linearity), but for most applied research involving moderate sample sizes and linear relationships, parametric MI remains the method of choice.
Comments & Academic Discussion
Loading comments...
Leave a Comment