When Pattern-by-Pattern Works: Theoretical and Empirical Insights for Logistic Models with Missing Values

When Pattern-by-Pattern Works: Theoretical and Empirical Insights for Logistic Models with Missing Values
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Predicting with missing inputs challenges even parametric models, as parameter estimation alone is insufficient for prediction on incomplete data. While several works study prediction in linear models, we focus on logistic models, where optimal predictors lack closed-form expressions. We prove that a Pattern-by-Pattern strategy (PbP), which learns one logistic model per missingness pattern, accurately approximates Bayes probabilities under a Gaussian Pattern Mixture Model (GPMM). Crucially, this result holds across standard missing data scenarios (MCAR and MAR) and, notably, in Missing Not at Random (MNAR) settings where standard methods often fail. Empirically, we compare PbP against imputation and EM methods across classification, probability estimation, calibration, and inference. Our analysis provides a comprehensive view of logistic regression with missing values. It reveals that mean imputation can be used as baseline for low sample sizes and PbP for large sample sizes, as both methods are fast to train and may have good performances in some settings. The best performances are achieved by non-linear multiple iterative imputation techniques that include the response label (Random Forest MICE with response), which are more computationally expensive.


💡 Research Summary

This paper tackles the challenging problem of binary classification when input features contain missing values. While much of the existing literature focuses on parameter estimation for linear models, the authors turn their attention to logistic regression, where the optimal predictor does not admit a closed‑form expression. Their central proposal is the Pattern‑by‑Pattern (PbP) strategy: fit a separate logistic model for each observed missingness pattern.

The theoretical contribution rests on the assumption that the data follow a Gaussian Pattern Mixture Model (GPMM). Under GPMM, each missingness pattern m∈{0,1}^d induces a distinct multivariate normal distribution for the covariates, allowing the framework to encompass MCAR, MAR, and many MNAR mechanisms. First, the authors show (Theorem 3.3) that for a probit link, the Bayes probability on each pattern remains a probit model with transformed parameters that depend on the pattern‑specific mean μ_m and covariance Σ_m. This result mirrors earlier findings for linear regression and demonstrates that the probit model is well‑specified under GPMM.

For the logistic link, prior work (Lobo et al., 2025) proved that the pattern‑specific Bayes predictor is not exactly logistic. The authors refine this negative result by establishing (Theorem 3.5) that the deviation between the true Bayes probability η*_m(x) and a scaled logistic function σ(α₀,m+α_mᵀx)·(1+π/8·\tildeσ²_m)⁻¹ is bounded by the supremum norm of ε(t)=Φ(t)−σ(t·√(π/8)), which is numerically about 0.018. Consequently, each pattern‑specific Bayes predictor is extremely close to a logistic function, with the correction term \tildeσ²_m reflecting the variance contributed by the missing covariates. This provides a solid theoretical justification for using separate logistic regressions per pattern, even though they are not strictly Bayes‑optimal.

The empirical study evaluates PbP against several baselines: mean (constant) imputation, Multivariate Imputation by Chained Equations (MICE), Random‑Forest‑based MICE that incorporates the response variable, and EM‑type methods such as SAEM. Four complementary metrics are employed: classification error, probability estimation error, calibration error, and parameter inference error. Experiments span synthetic data with varying dimensions (d=5–50), sample sizes (N=100–10⁴), and missingness mechanisms (MCAR, MAR, MNAR). Key findings include:

  1. Small‑sample regime – Mean imputation is a fast, surprisingly competitive baseline because the limited data per pattern makes more sophisticated methods unstable.
  2. Large‑sample regime – PbP consistently yields the best trade‑off between computational cost and predictive performance under the GPMM assumption. Its predictions are well‑calibrated and its estimated coefficients closely match the true generating parameters.
  3. Non‑linear relationships – When the true data‑generating process is non‑linear or when covariates deviate from Gaussianity, response‑aware non‑linear multiple imputation (Random‑Forest MICE) outperforms all other methods, albeit at a substantially higher computational expense.
  4. Pattern concentration – Real‑world datasets exhibit a “pattern concentration” phenomenon: out of the 2^d possible missingness patterns, only a handful (typically ≤10) occur with non‑negligible frequency. This mitigates the curse of dimensionality for PbP, making it practically feasible even in moderately high dimensions.

The authors also provide a detailed illustration in a two‑dimensional setting where one covariate is always observed and the other is always missing. They show that naïvely substituting the missing covariate by its expectation leads to poor probability estimates, whereas the PbP logistic model (or its best logistic approximation) aligns closely with the true Bayes probabilities, confirming the theoretical predictions.

In conclusion, the paper delivers a comprehensive theoretical and empirical justification for the Pattern‑by‑Pattern logistic regression approach under Gaussian mixture missingness. While exact Bayes optimality is unattainable, the approximation error is negligible, and the method scales well when missingness patterns are limited. For practitioners, the recommendation is: use mean imputation for very small datasets, adopt PbP for larger datasets where the GPMM assumption is plausible, and resort to response‑aware non‑linear multiple imputation when the underlying relationships are complex or the Gaussian assumption is violated. The work bridges a gap between theory and practice in missing‑data logistic regression and opens avenues for further research on pattern‑wise modeling in other generalized linear models.


Comments & Academic Discussion

Loading comments...

Leave a Comment