Bayesian analysis of flexible Heckman selection models using Hamiltonian Monte Carlo
The Heckman selection model is widely used in econometric analysis and other social sciences to address sample selection bias in data modeling. A common assumption in Heckman selection models is that the error terms follow an independent bivariate normal distribution. However, real-world data often deviates from this assumption, exhibiting heavy-tailed behavior, which can lead to inconsistent estimates if not properly addressed. In this paper, we propose a Bayesian analysis of Heckman selection models that replace the Gaussian assumption with well-known members of the class of scale mixture of normal distributions, such as the Student’s-t and contaminated normal distributions. For these complex structures, Stan’s default No-U-Turn sampler is utilized to obtain posterior simulations. Through extensive simulation studies, we compare the performance of the Heckman selection models with normal, Student’s-t and contaminated normal distributions. We also demonstrate the broad applicability of this methodology by applying it to medical care and labor supply data. The proposed algorithms are implemented in the R package HeckmanStan.
💡 Research Summary
The paper presents a comprehensive Bayesian framework for Heckman selection models that replaces the traditional assumption of bivariate normal errors with members of the scale‑mixture‑of‑normals (SMN) family, specifically the multivariate Student’s‑t and contaminated normal (CN) distributions. The authors begin by outlining the problem of sample‑selection bias, which arises when the outcome variable is observed only for a non‑random subset of the population. The classic Heckman model addresses this bias using two linked equations—a linear outcome equation and a probit selection equation—under the assumption that the error terms follow a bivariate normal distribution. While this assumption simplifies inference, it is often violated in practice because real data frequently exhibit heavy tails or outliers, leading to inconsistent estimates.
To overcome this limitation, the authors introduce two robust extensions: the Heckman‑t (SLt) model, where the error vector follows a bivariate Student’s‑t distribution with unknown degrees of freedom ν, and the Heckman‑contaminated‑normal (SLcn) model, where the error vector follows a bivariate contaminated normal distribution characterized by a contamination proportion ν₁ and a scale‑inflation factor ν₂. Both distributions belong to the SMN class, which can be expressed as a normal distribution whose covariance matrix is multiplied by a latent scaling variable U. By choosing different mixing distributions for U (Gamma for Student’s‑t, a two‑point mixture for CN), the model can adapt to a wide range of tail behaviors and outlier structures.
The Bayesian specification assigns weakly informative priors: multivariate normal priors for the regression coefficients β (outcome) and γ (selection), an inverse‑gamma prior for the variance σ², a transformed beta prior for the correlation ρ, a gamma prior for ν, and beta priors for ν₁ and ν₂. The resulting posterior distribution is analytically intractable because the likelihood involves the cumulative distribution functions of the Student’s‑t and CN distributions and because the latent scaling variables introduce non‑conjugacy.
To sample from this complex posterior, the authors employ Hamiltonian Monte Carlo (HMC) as implemented in Stan, leveraging automatic differentiation and the No‑U‑Turn Sampler (NUTS) for adaptive step‑size and trajectory length selection. HMC’s use of gradient information yields substantially higher effective sample sizes per unit of computation compared with traditional Metropolis‑Hastings or Gibbs samplers, especially in high‑dimensional settings with hierarchical mixtures.
A thorough simulation study evaluates the three models (normal, t, CN) across a range of data‑generating scenarios: varying degrees of tail heaviness, different contamination proportions, and sample sizes. Performance metrics include bias, root‑mean‑square error, coverage of 95 % credible intervals, and Bayesian model‑selection criteria (WAIC, LOO‑CV). Results consistently show that when the true error distribution is heavy‑tailed or contaminated, the SLt and SLcn models produce markedly lower bias and more accurate uncertainty quantification than the standard normal model. Moreover, the Bayesian information criteria correctly identify the data‑generating model as the best‑fitting one.
The methodology is illustrated with two real‑world applications. The first uses medical‑care utilization data where high‑cost patients are selectively observed; the contaminated‑normal model captures a small proportion of extreme cost observations without allowing them to dominate the parameter estimates. The second application analyzes labor‑supply data with non‑random participation; the Student’s‑t model accommodates the pronounced skewness and heavy tails in wages of part‑time and informal workers, yielding more realistic elasticity estimates. Posterior predictive checks and residual diagnostics confirm the improved fit of the robust models.
All computational tools are packaged in the R library HeckmanStan, which automates Stan code generation for the three error specifications, provides functions for prior specification, runs NUTS sampling, and includes diagnostics, model‑comparison utilities, and visualization of posterior predictive distributions. The package is publicly available on CRAN, facilitating adoption by applied researchers.
In conclusion, the paper makes four key contributions: (1) it generalizes the Heckman selection framework to a flexible SMN error structure, thereby accommodating heavy tails and contamination; (2) it demonstrates that HMC/NUTS delivers efficient Bayesian inference for these models, overcoming the computational challenges of traditional MCMC; (3) it validates the approach through extensive simulations and two substantive empirical examples, showing superior estimation and predictive performance; and (4) it provides an open‑source implementation that lowers the barrier to applying robust selection models in economics, biostatistics, and related fields. Future work is suggested on extending the framework to multi‑stage selection, non‑linear outcome specifications, and sparsity‑inducing priors for variable selection.
Comments & Academic Discussion
Loading comments...
Leave a Comment