Assumption-Lean Post-Integrated Inference with Surrogate Control Outcomes
Data integration methods aim to extract low-dimensional embeddings from high-dimensional outcomes to remove unwanted variations, such as batch effects and unmeasured covariates, across heterogeneous datasets. However, multiple hypothesis testing after integration can be biased due to data-dependent processes. We introduce a robust post-integrated inference method that accounts for latent heterogeneity by utilizing control outcomes. Leveraging causal interpretations, we derive nonparametric identifiability of the direct effects using negative control outcomes. By utilizing surrogate control outcomes as an extension of negative control outcomes, we develop semiparametric inference on projected direct effect estimands, accounting for hidden mediators, confounders, and moderators. These estimands remain statistically meaningful under model misspecifications and with error-prone embeddings. We provide bias quantifications and finite-sample linear expansions with uniform concentration bounds. The proposed doubly robust estimators are consistent and efficient under minimal assumptions and potential misspecification, facilitating data-adaptive estimation with machine learning algorithms. Our proposal is evaluated using random forests through simulations and analysis of single-cell CRISPR perturbed datasets, which may contain potential unmeasured confounders.
💡 Research Summary
This paper addresses a critical yet often overlooked problem in data science: the statistical validity of inference performed on “integrated” data. Data integration methods, such as batch correction in genomics, aim to remove unwanted variations (e.g., batch effects, unmeasured covariates) by extracting low-dimensional embeddings from high-dimensional outcomes across heterogeneous datasets. A common practice is to first estimate these latent embeddings (e.g., via PCA, RUV, SVA) and then use them as additional covariates in downstream regression or hypothesis testing. However, this two-step procedure ignores the estimation uncertainty from the first step, leading to biased standard errors, inflated type I errors, and unreliable p-values.
The authors propose a robust “assumption-lean” framework for post-integrated inference. They start from the causal inference concept of “negative control outcomes” (NCOs), which are outcomes associated only with the latent variable U and conditionally independent of the treatment X and primary outcomes. Using NCOs, they establish nonparametric identifiability conditions for direct effects. To overcome the practical difficulty of finding perfect NCOs, they introduce a more flexible concept: “surrogate control outcomes” (SCOs). SCOs only require a strong association with U, without making strict causal ordering assumptions between U and X, making the framework applicable to a wider range of scenarios where U can be a confounder, mediator, or moderator.
The core methodological contribution is the development of semiparametric inference for “projected direct effect” estimands. These estimands represent a form of adjusted association between X and Y after accounting for U, and they remain statistically meaningful even under model misspecification or when using error-prone estimated embeddings (Û). The paper rigorously quantifies the bias introduced by using Û instead of the true U, providing finite-sample linear expansions and uniform concentration bounds.
The proposed estimators are “doubly robust” and efficient. They are consistent if either the model for the outcome Y given X and U, or the model for the treatment X given U, is correctly specified, and they achieve semiparametric efficiency when both are correct. This property allows for the use of flexible, data-adaptive machine learning algorithms (like random forests) in estimating these nuisance functions, enhancing the method’s robustness and applicability.
The framework is evaluated through comprehensive simulations and a real-data analysis of single-cell CRISPR perturbation data studying autism spectrum disorder. The simulations demonstrate that the proposed method maintains correct type I error rates and higher power compared to naive regression and existing post-integration methods under various misspecification scenarios. In the real-data analysis, focusing on the effect of PTEN perturbation, the method produces well-calibrated test statistics (close to a standard normal distribution under the null), whereas several popular batch correction and confounder adjustment methods yield either overly conservative or anti-conservative distributions, and show poor agreement among themselves on which genes are significant.
In summary, this work provides a principled and practical statistical framework that bridges data integration and valid inference. It offers theoretical guarantees for a common but flawed practice, enabling researchers to leverage advanced data integration techniques while drawing reliable statistical conclusions, even in the presence of unmeasured confounding and model uncertainty.
Comments & Academic Discussion
Loading comments...
Leave a Comment