Planning for gold: Hypothesis screening with split samples for valid powerful testing in matched observational studies
Observational studies are valuable tools for inferring causal effects in the absence of controlled experiments. However, these studies may be biased due to the presence of some relevant, unmeasured set of covariates. One approach to mitigate this concern is to identify hypotheses likely to be more resilient to hidden biases by splitting the data into a planning sample for designing the study and an analysis sample for making inferences. We devise a powerful and flexible method for selecting hypotheses in the planning sample when an unknown number of outcomes are affected by the treatment, allowing researchers to gain the benefits of exploratory analysis and still conduct powerful inference under concerns of unmeasured confounding. We investigate the theoretical properties of our method and conduct extensive simulations that demonstrate pronounced benefits, especially at higher levels of allowance for unmeasured confounding. Finally, we demonstrate our method in an observational study of the multi-dimensional impacts of a devastating flood in Bangladesh.
💡 Research Summary
Observational studies aim to infer causal effects without the benefit of randomization, yet they remain vulnerable to hidden bias arising from unmeasured covariates. This paper introduces a systematic, data‑splitting based hypothesis‑screening framework that leverages the “sensitivity value” – the smallest Γ at which a treatment effect loses statistical significance – to select outcomes that are intrinsically more robust to hidden bias. The full dataset is randomly partitioned into a planning sample and an analysis sample. In the planning sample, each of the L potential outcomes is evaluated using a matched‑set signed‑score statistic (e.g., Wilcoxon signed‑rank). For each outcome l, the sensitivity value Γₗ(α) is computed, and a transformed version κₗ = Γₗ/(1+Γ*ₗ) is formed to map the quantity onto the unit interval. An estimate of the variability of κₗ (via bootstrap, asymptotic variance, or other methods) yields a (1‑α) predictive interval for the analysis‑sample κₗ. If the lower bound of this interval exceeds the transformed control level κ_con = Γ_con/(1+Γ_con) – where Γ_con is the researcher‑chosen bound on allowable hidden bias – outcome l is retained in the candidate set S. Only the outcomes in S are subsequently tested in the analysis sample at the pre‑specified significance level α, guaranteeing family‑wise error rate (FWER) control at α (Proposition 1).
The authors provide a rigorous theoretical justification: (i) the sensitivity value is a consistent estimator of design sensitivity, (ii) κₗ satisfies a central‑limit theorem under mild regularity conditions, and (iii) the predictive‑interval rule is conservative yet powerful, especially in finite samples where design sensitivity alone can be misleading. They contrast this “adaptive” method with two baselines: (a) a naïve split‑sample rule that selects outcomes with planning‑sample p‑values ≤ α, and (b) a full‑sample Bonferroni correction without splitting.
Extensive simulations explore a range of scenarios: total sample sizes N = 500–2000, numbers of outcomes L = 100–1000, proportions of true effects (5–20 %), and control Γ values (Γ_con = 1, 2, 3, 4). Results show that the proposed method consistently attains higher power than both baselines while maintaining FWER ≤ α. The advantage grows as Γ_con increases, reflecting the method’s ability to prioritize outcomes that remain significant under stronger hidden bias. In small‑sample or low‑Γ settings the gain is modest, aligning with theoretical expectations.
The methodology is applied to a real‑world case study of the 1998 Bangladesh floods. The authors consider 30 multidimensional outcomes (food availability, water sanitation, disease incidence, etc.) measured on roughly 5,000 households. Using a 20 % planning split, they compute sensitivity values for each outcome, set Γ_con = 3, and retain seven outcomes whose predictive intervals exceed the threshold. In the 80 % analysis sample, these seven outcomes show statistically significant treatment effects, confirming that the flood reduced food availability and sanitary water access and increased certain illnesses, while the remaining outcomes appear too sensitive to hidden bias to draw reliable conclusions.
The paper also extends the framework beyond matched pairs to matched sets of arbitrary size, provides supplemental material on variance estimation, and discusses potential extensions such as Bayesian integration, handling multiple treatment levels, and non‑parametric matching.
In summary, this work offers a practical, theoretically sound solution for researchers who wish to exploit exploratory flexibility while preserving rigorous inference in matched observational studies. By anchoring outcome selection to the sensitivity value and its predictive uncertainty, the approach balances error control, power, and robustness to hidden confounding, representing a meaningful advance over traditional full‑sample multiple‑testing corrections.
Comments & Academic Discussion
Loading comments...
Leave a Comment