Assessing Utility of Differential Privacy for RCTs
Randomized controlled trials (RCTs) have become powerful tools for assessing the impact of interventions and policies in many contexts. They are considered the gold standard for causal inference in the biomedical fields and many social sciences. Researchers have published an increasing number of studies that rely on RCTs for at least part of their inference. These studies typically include the response data that has been collected, de-identified, and sometimes protected through traditional disclosure limitation methods. In this paper, we empirically assess the impact of privacy-preserving synthetic data generation methodologies on published RCT analyses by leveraging available replication packages (research compendia) in economics and policy analysis. We implement three privacy-preserving algorithms, that use as a base one of the basic differentially private (DP) algorithms, the perturbed histogram, to support the quality of statistical inference. We highlight challenges with the straight use of this algorithm and the stability-based histogram in our setting and described the adjustments needed. We provide simulation studies and demonstrate that we can replicate the analysis in a published economics article on privacy-protected data under various parameterizations. We find that relatively straightforward (at a high-level) privacy-preserving methods influenced by DP techniques allow for inference-valid protection of published data. The results have applicability to researchers wishing to share RCT data, especially in the context of low- and middle-income countries, with strong privacy protection.
💡 Research Summary
The paper investigates whether differential privacy (DP) techniques can be used to protect data from randomized controlled trials (RCTs) while preserving the statistical validity of the original analyses. The authors note that RCTs have become the gold standard for causal inference across biomedical, social science, and economic fields, and that the push for reproducibility has led many journals to require the publication of replication packages containing the raw data and code. However, most of these data sets are only de‑identified, leaving them vulnerable to re‑identification attacks, especially in low‑ and middle‑income countries where legal protections may be weaker.
To address this gap, the authors develop three DP‑based synthetic data generation mechanisms that all rely on a perturbed histogram as a building block. The first mechanism adds Laplace noise directly to histogram counts, providing a straightforward (ε,δ)‑DP guarantee. The second, a stability‑based histogram, repeatedly resamples the data to estimate the variability of each bin and scales the noise accordingly, thereby reducing unnecessary distortion in high‑frequency bins. The third mechanism combines the first two: it first adds Laplace noise and then applies a stability check to add further correction where needed. The authors acknowledge that none of the three methods achieve pure DP in the strict sense because of practical constraints (e.g., small sample sizes, mixed continuous and categorical variables), but they argue that the resulting privacy loss is modest and can be quantified.
A comprehensive simulation study evaluates the three mechanisms across a range of privacy budgets (ε = 0.1, 0.5, 1.0; δ = 10⁻⁵) and sample sizes (N = 500, 1,000, 5,000). The metrics include mean‑squared error of treatment‑effect estimates, coverage of 95 % confidence intervals, and the inflation of standard errors. Results show that for ε ≤ 0.5 the loss of statistical efficiency is limited: confidence‑interval coverage remains above 93 % and MSE increases by no more than 20 % relative to the non‑private benchmark. When ε = 0.1, small samples suffer noticeable power loss, indicating that the privacy budget must be chosen with the study’s size and inferential goals in mind.
The empirical component replicates the analysis of Blattman, Jamison, and Sheridan (2017), a well‑known cash‑transfer RCT conducted in a low‑income setting. Using the publicly available replication package, the authors treat the original data as “confidential” and apply their DP mechanisms to generate a privacy‑protected version that can be released as a drop‑in replacement. They then run the same intent‑to‑treat regressions on the synthetic data. The estimated treatment effects differ by less than 0.03 from the original estimates, and the 95 % confidence intervals still overlap, demonstrating that the key substantive conclusions (cash transfers significantly raise income) are preserved. Standard errors are modestly larger (≈ 8 % inflation), which the authors argue is an acceptable trade‑off for the privacy gain.
To facilitate adoption, the authors release an R package called DPrct (Web, 2025). The package implements the three histogram‑based mechanisms, provides utilities for choosing bin numbers, calibrating noise, and performing Bayesian post‑processing to adjust for the added DP noise. Computational complexity is O(N·B) (N = sample size, B = number of bins), allowing the entire pipeline to run on a standard laptop in a few minutes for datasets of several thousand observations. This low‑resource requirement is especially relevant for researchers in low‑resource environments.
The paper contributes on three fronts: (1) it identifies practical challenges of applying off‑the‑shelf DP algorithms to RCT data (e.g., sparsity, mixed data types) and proposes concrete algorithmic adjustments; (2) it provides both simulation evidence and a real‑world replication that demonstrate “inference‑valid” privacy protection is achievable; and (3) it supplies open‑source software that lowers the barrier for non‑specialists to adopt DP in their data‑sharing workflows.
In the discussion, the authors outline future research directions, including extending DP methods to clustered or multi‑level RCT designs, adapting the approach for non‑linear or machine‑learning models, and developing systematic ε‑budget accounting for multiple analyses on the same data set. They argue that such advances will enable a principled balance between transparency, reproducibility, and participant confidentiality, particularly in contexts where data protection regulations are limited. Overall, the study provides a compelling proof‑of‑concept that differential privacy can be integrated into the standard workflow of RCT researchers without sacrificing the credibility of causal inference.
Comments & Academic Discussion
Loading comments...
Leave a Comment