Stochastic EM Estimation and Inference for Zero-Inflated Beta-Binomial Mixed Models for Longitudinal Count Data
Analyzing overdispersed, zero-inflated, longitudinal count data poses significant modeling and computational challenges, which standard count models (e.g., Poisson or negative binomial mixed effects models) fail to adequately address. We propose a Zero-Inflated Beta-Binomial Mixed Effects Regression (ZIBBMR) model that augments a beta-binomial count model with a zero-inflation component, fixed effects for covariates, and subject-specific random effects, accommodating excessive zeros, overdispersion, and within-subject correlation. Maximum likelihood estimation is performed via a Stochastic Approximation EM (SAEM) algorithm with latent variable augmentation, which circumvents the model’s intractable likelihood and enables efficient computation. Simulation studies show that ZIBBMR achieves accuracy comparable to leading mixed-model approaches in the literature and surpasses simpler zero-inflated count formulations, particularly in small-sample scenarios. As a case study, we analyze longitudinal microbiome data, comparing ZIBBMR with an external Zero-Inflated Beta Regression (ZIBR) benchmark; the results indicate that applying both count- and proportion-based models in parallel can enhance inference robustness when both data types are available.
💡 Research Summary
The paper tackles a pervasive problem in modern biomedical and environmental research: longitudinal count data that simultaneously exhibit excess zeros, over‑dispersion, and within‑subject correlation. Conventional approaches—Poisson or negative‑binomial mixed models—address at most two of these issues, leading to biased estimates and poor predictive performance. To fill this gap, the authors introduce the Zero‑Inflated Beta‑Binomial Mixed Effects Regression (ZIBBMR) model. ZIBBMR augments a beta‑binomial count distribution (which captures extra‑binomial variation through a latent beta‑distributed success probability) with a zero‑inflation component and subject‑specific random intercepts for both the zero‑inflation and the beta‑binomial mean. Formally, for subject i at time t the observed count Y_it follows a mixture: with probability (1‑p_it) it is exactly zero, and with probability p_it it follows a Beta‑Binomial(S_it, u_it ϕ, (1‑u_it) ϕ). The mixing probabilities are linked to covariates via logistic regressions: logit(p_it) = a_i + X_itᵀα and logit(u_it) = b_i + Z_itᵀβ, where (a_i, b_i) are independent normal random effects with variances σ₁² and σ₂².
Because the likelihood involves integrating over the random effects and the zero‑inflation mixture, a closed‑form expression is unavailable. The authors therefore adopt a Stochastic Approximation Expectation–Maximization (SAEM) algorithm. In each iteration, the random effects are sampled from their conditional distribution using a Metropolis‑Hastings step (with a blend of global prior proposals, multivariate random‑walk moves, and single‑component fine‑tuning). The sampled draws are used to update sufficient statistics via a decreasing step‑size sequence γ_q, after which the population‑level parameters (μ, G) are updated analytically. The beta‑binomial dispersion ϕ and regression coefficients β, as well as the zero‑inflation coefficients α, are updated by maximizing their respective conditional log‑likelihoods, also using the SAEM step‑size. This hybrid approach preserves the convergence guarantees of EM while handling the intractable E‑step through Monte‑Carlo approximation.
Standard errors are obtained via Louis’ missing‑information principle, which expresses the observed Fisher information as the sum of the expected complete‑data information and the covariance of the complete‑data score. Both terms are approximated using the same Monte‑Carlo draws, yielding a stochastic estimate of the information matrix; its inverse provides asymptotic covariance estimates for all parameters. To enable likelihood‑based model comparison (e.g., likelihood‑ratio tests, AIC/BIC), the authors approximate the observed log‑likelihood with importance sampling. The proposal distribution is a non‑central t‑distribution centered at the empirical conditional means of the random effects (obtained from SAEM) with a scale derived from their empirical conditional variances. Using ν = 5 degrees of freedom and K = 500 draws, the Monte‑Carlo estimator delivers a low‑variance approximation suitable for hypothesis testing.
Simulation studies evaluate ZIBBMR against two popular alternatives: glmmTMB (which uses Laplace approximation via Template Model Builder) and gamlss (which fits penalized likelihood additive models). Four data‑generating scenarios are explored, varying the sign and magnitude of fixed effects, the variance of random intercepts, the zero‑inflation probability, and the beta‑binomial dispersion ϕ. Sample sizes range from N = 30 to 100 subjects, each with T = 5, 10, or 15 repeated measures. Across all settings, ZIBBMR exhibits negligible bias, the smallest mean‑squared error, and 95 % Wald confidence‑interval coverage close to the nominal 0.95, even when N = 30. In contrast, glmmTMB and gamlss show larger bias in the zero‑inflation and dispersion parameters, especially under high over‑dispersion or strong zero‑inflation. Type‑I error rates for testing zero‑inflation effects remain at the nominal level for ZIBBMR, whereas the competitors tend to be slightly conservative or overly liberal. Computationally, glmmTMB is fastest, but ZIBBMR’s runtime is only about 1.5–2 times longer and scales well with the number of random effects.
The methodology is applied to a longitudinal vaginal microbiome dataset collected over 12 months, with sequencing depths (S_it) ranging from 200 to 800 reads per visit. Covariates include treatment group, time, and pregnancy status. ZIBBMR identifies a significant reduction in the zero‑inflation probability for the treatment group, indicating a lower chance of structural zeros for certain taxa. The beta‑binomial mean probability u_it rises over time, especially for taxa whose abundance fluctuates with pregnancy, reflecting genuine biological changes rather than sampling artifacts. Random‑effect variances σ₁² and σ₂² are sizable, confirming substantial inter‑individual heterogeneity. A benchmark Zero‑Inflated Beta Regression (ZIBR), which models proportions rather than counts, yields comparable fixed‑effect estimates but fails to capture the discrete nature of the data and shows instability when sequencing depth is low. The parallel use of count‑based ZIBBMR and proportion‑based ZIBR is recommended for robustness when both data representations are available.
In conclusion, the ZIBBMR framework provides a principled, likelihood‑based solution for longitudinal count data plagued by excess zeros, over‑dispersion, and subject correlation. The SAEM algorithm delivers reliable parameter estimates and standard errors, while importance‑sampling approximations enable formal model selection. Future extensions could incorporate multivariate dependence among taxa, Bayesian priors for regularization in high‑dimensional settings, and sparse penalization to handle ultra‑high‑throughput sequencing data. The authors supply an R package and reproducible scripts, facilitating immediate adoption by researchers confronting similar analytical challenges.
Comments & Academic Discussion
Loading comments...
Leave a Comment