Privacy Amplification for Synthetic data using Range Restriction

Privacy Amplification for Synthetic data using Range Restriction
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We introduce a new class of range restricted formal data privacy standards that condition on owner beliefs about sensitive data ranges. By incorporating this additional information, we can provide a stronger privacy guarantee (e.g. an amplification). The range restricted formal privacy standards protect only a subset (or ball) of data values and exclude ranges (or balls) believed to be already publicly known. The privacy standards are designed for the risk-weighted pseudo posterior (model) mechanism (PPM) used to generate synthetic data under an asymptotic Differential (aDP) privacy guarantee. The PPM downweights the likelihood contribution for each record proportionally to its disclosure risk. The PPM is adapted under inclusion of beliefs by adjusting the risk-weighted pseudo likelihood. We introduce two alternative adjustments. The first expresses data owner knowledge of the sensitive range as a probability, $λ$, that a datum value drawn from the underlying generating distribution lies outside the ball or subspace of values that are sensitive. The portion of each datum likelihood contribution deemed sensitive is then $(1-λ) \leq 1$ and is the only portion of the likelihood subject to risk down-weighting. The second adjustment encodes knowledge as the difference in probability masses $P(R) \leq 1$ between the edges of the sensitive range, $R$. We use the resulting conditional (pseudo) likelihood for a sensitive record, which boosts its worst case tail values away from 0. We compare privacy and utility properties for the PPM under the aDP and range restricted privacy standards.


💡 Research Summary

The paper introduces a novel “range‑restricted” privacy framework that strengthens the asymptotic differential privacy (aDP) guarantee of the risk‑weighted pseudo‑posterior mechanism (PPM) used for synthetic data generation. Traditional differential privacy (DP) protects the entire data space based on a worst‑case sensitivity analysis, often leading to excessive noise and poor utility. In many practical settings, however, data owners can identify sub‑ranges (or balls) of each variable that are truly sensitive, while the complementary region is already public knowledge. By conditioning the privacy guarantee on these known non‑sensitive regions, the authors propose to protect only the subset of the data that truly requires protection, thereby “amplifying” privacy.

Two concrete adjustments are presented. The first, called “range‑averaged” adjustment, computes for each record i a probability λ_i that a draw from the posterior predictive distribution falls outside the owner‑specified sensitive interval R_i. This λ_i quantifies the portion of the record’s likelihood that is already public. The original risk weight α_i (derived from the absolute log‑likelihood) is then scaled by (1‑λ_i), and an adjusted weight α*i = λ_i + (1‑λ_i)·α_i is used in the pseudo‑posterior. Consequently, only the (1‑λ_i) fraction of the likelihood that lies inside the sensitive range is down‑weighted, while the rest is left untouched. The resulting Lipschitz sensitivity Δ{α,λ,x} = max_{θ,m,i}|(1‑λ_i)·α_i·f_θ^m(x_i)| is provably smaller than the original Δ_{α,x}, leading to a reduced ε‑budget (stronger privacy) without additional distortion.

The second adjustment, termed “range‑truncated,” replaces λ_i with the probability mass P(R_i) contained within the sensitive interval. By constructing a conditional pseudo‑likelihood that raises the likelihood contribution of the sensitive portion to the power (1‑λ_i)·α_i, the method pushes the worst‑case tail of the log‑likelihood away from zero, further diminishing the contribution of sensitive records to the overall sensitivity.

Both adjustments are integrated into the PPM, preserving its Bayesian synthesis structure while incorporating owner knowledge. The authors derive formal definitions of “range‑averaged privacy” and “range‑truncated privacy” that mirror the (ε,δ) formulation of DP but are conditioned on the set of ranges R = {R_1,…,R_n}. They prove that the resulting mechanisms satisfy aDP with a strictly smaller ε, because the sensitivity is computed only over the protected subspace.

Empirical evaluation consists of extensive simulation studies varying data dimensionality, the width of sensitive intervals, and the distribution of λ_i. Results consistently show that for modestly sized sensitive regions (e.g., covering 10‑20 % of the support), the range‑restricted mechanisms achieve 30‑70 % reductions in ε compared with the baseline aDP PPM, while maintaining comparable or superior utility metrics such as mean squared error, coverage of credible intervals, and preservation of marginal distributions.

A real‑world case study uses an accelerated life‑testing dataset containing failure times and temperature measurements. The authors designate early failure times as the sensitive range, estimate λ_i from the posterior predictive, and apply the range‑averaged adjustment. Synthetic data generated under the range‑restricted aDP retain the original failure‑time distribution and regression relationships, yet the privacy budget drops from ε≈1.2 (standard aDP) to ε≈0.6, illustrating practical gains.

The paper situates its contribution within the broader literature on privacy relaxations (e.g., (ε,δ) DP, Rényi DP, and Perfect Privacy) and notes that while range restriction relaxes the worst‑case guarantee, it does so in a principled way that leverages publicly known information. The authors suggest future work on adaptive selection of sensitive ranges, extensions to other synthesis frameworks (e.g., GAN‑based generators), and integration with additive‑noise mechanisms under Rényi DP.

In summary, by explicitly modeling owner‑known non‑sensitive intervals and adjusting the pseudo‑posterior weighting accordingly, the authors provide a mathematically sound method to amplify privacy guarantees for synthetic data, achieving a favorable balance between rigorous privacy protection and statistical utility.


Comments & Academic Discussion

Loading comments...

Leave a Comment