Data adaptive covariate balancing for causal effect estimation for high dimensional data
A key challenge in estimating causal effects from observational data is handling confounding and is commonly achieved through weighting methods that balance distribution of covariates between treatment and control groups. Weighting approaches can be classified by whether weights are estimated using parametric or nonparametric methods, and by whether the model relies on modeling and inverting the propensity score or directly estimates weights to achieve distributional balance by minimizing a measure of dissimilarity between groups. Parametric methods, both for propensity score modeling and direct balancing, are prone to model misspecification. In addition, balancing approaches often suffer from the curse of dimensionality, as they assign equal importance to all covariates, thus potentially de-emphasizing true confounders. Several methods, such as the outcome adaptive lasso, attempt to mitigate this issue through variable selection, but are parametric and focus on propensity score estimation rather than direct balancing. In this paper, we propose a nonparametric direct balancing approach that uses random forests to adaptively emphasize confounders. Our method jointly models treatment and outcome using random forests, allowing the data to identify covariates that influence both processes. We construct a similarity measure, defined by the proportion of trees in which two observations fall into the same leaf node, yielding a distance between treatment and control distributions that is sensitive to relevant covariates and captures the structure of confounding. Under suitable assumptions, we show that the resulting weights converge to normalized inverse propensity scores in the L2 norm and provide consistent treatment effect estimates. We demonstrate the effectiveness of our approach through extensive simulations and an application to a real dataset.
💡 Research Summary
The paper tackles the fundamental problem of confounding adjustment in causal inference when only observational data are available, focusing on high‑dimensional covariate settings where traditional weighting methods struggle. Existing approaches fall into two broad categories: (i) propensity‑score based methods that model the treatment assignment and then use the inverse propensity score as weights, and (ii) direct covariate‑balancing methods that construct weights by minimizing a discrepancy measure between treated and control covariate distributions. The former are vulnerable to model misspecification, while the latter typically treat every covariate equally, which leads to poor performance in high dimensions because true confounders receive no special emphasis.
To overcome these limitations, the authors propose a non‑parametric, data‑adaptive weighting scheme that leverages multivariate random forests. The key idea is to fit a random forest where both the treatment indicator and the outcome are treated as multivariate responses. Because tree splits are chosen to reduce the joint loss for predicting both response variables, the forest naturally focuses on covariates that influence both treatment and outcome—i.e., the true confounders. From the fitted forest they define a similarity kernel: the probability that two observations fall into the same leaf node in a randomly selected tree. This kernel is highly sensitive to the structure of confounding and serves as the basis for a Maximum Mean Discrepancy (MMD) distance between the treated and control groups.
Weights are then obtained by solving an optimization problem that minimizes the MMD distance with respect to the forest‑based kernel, subject to standard normalization constraints. The authors provide a theoretical analysis under a simplified “random‑split” forest model. They prove that the kernel is universal (i.e., its associated reproducing‑kernel Hilbert space is dense in the space of continuous functions) and that the MMD‑minimizing weights converge in L2 norm to the normalized inverse propensity scores. Consequently, the resulting weighted estimator of the average treatment effect (ATE) is consistent and asymptotically efficient under the usual causal assumptions (SUTVA, no unmeasured confounding, positivity).
Extensive simulation studies evaluate the method across a range of scenarios: varying proportions of confounders, precision variables, instrumental variables, and pure noise variables; linear versus highly non‑linear outcome models; and different sample sizes and dimensionalities. The proposed approach consistently yields lower bias, smaller mean‑squared error, and tighter confidence intervals than competing methods, including outcome‑adaptive lasso (a parametric variable‑selection technique), energy‑balancing, and other non‑parametric MMD‑based weighting schemes that rely on Euclidean or Gaussian kernels.
A real‑world application to a medical dataset (e.g., evaluating a drug’s effect using electronic health records) demonstrates practical advantages. The forest‑based kernel produces a weighted covariate distribution where standardized mean differences for most variables fall below 0.1, indicating successful balance. The estimated ATE differs from that obtained by conventional methods and exhibits narrower bootstrap confidence intervals, suggesting improved precision.
The paper acknowledges several limitations. The theoretical results rely on a simplified random‑split forest model; real random forests involve data‑dependent splits, making rigorous asymptotic analysis more challenging. The MMD optimization is non‑convex and computationally intensive, especially with large numbers of trees or observations. Hyper‑parameters such as the number of trees, tree depth, and leaf size affect the kernel and thus the weights, requiring careful tuning. Moreover, the method currently handles a binary treatment; extensions to multi‑level or continuous treatments are not addressed.
Future research directions include: (1) extending the kernel construction to other non‑parametric learners (gradient boosting, neural networks); (2) integrating the weighting step with outcome modeling in a double‑machine‑learning framework; (3) developing scalable stochastic optimization algorithms for the MMD problem; and (4) adapting the approach to longitudinal or time‑varying treatment settings. Overall, the paper offers a novel, theoretically grounded, and empirically effective solution for high‑dimensional causal inference, bridging the gap between flexible machine‑learning models and rigorous covariate‑balancing methodology.
Comments & Academic Discussion
Loading comments...
Leave a Comment