Generating High-quality Privacy-preserving Synthetic Data
Synthetic tabular data enables sharing and analysis of sensitive records, but its practical deployment requires balancing distributional fidelity, downstream utility, and privacy protection. We study a simple, model agnostic post processing framework that can be applied on top of any synthetic data generator to improve this trade off. First, a mode patching step repairs categories that are missing or severely underrepresented in the synthetic data, while largely preserving learned dependencies. Second, a k nearest neighbor filter replaces synthetic records that lie too close to real data points, enforcing a minimum distance between real and synthetic samples. We instantiate this framework for two neural generative models for tabular data, a feed forward generator and a variational autoencoder, and evaluate it on three public datasets covering credit card transactions, cardiovascular health, and census based income. We assess marginal and joint distributional similarity, the performance of models trained on synthetic data and evaluated on real data, and several empirical privacy indicators, including nearest neighbor distances and attribute inference attacks. With moderate thresholds between 0.2 and 0.35, the post processing reduces divergence between real and synthetic categorical distributions by up to 36 percent and improves a combined measure of pairwise dependence preservation by 10 to 14 percent, while keeping downstream predictive performance within about 1 percent of the unprocessed baseline. At the same time, distance based privacy indicators improve and the success rate of attribute inference attacks remains largely unchanged. These results provide practical guidance for selecting thresholds and applying post hoc repairs to improve the quality and empirical privacy of synthetic tabular data, while complementing approaches that provide formal differential privacy guarantees.
💡 Research Summary
The paper tackles the practical challenges of deploying synthetic tabular data, namely the need to balance distributional fidelity, downstream utility, and privacy protection. While modern deep generative models such as CTGAN and TVAE have become the de‑facto standards for mixed‑type data, they still suffer from two persistent problems: categorical mode collapse (rare categories disappear from the synthetic output) and proximity‑based privacy risks (synthetic records that are too close to real individuals). To address these issues without redesigning the generators themselves, the authors propose a model‑agnostic post‑processing pipeline that can be attached to any trained synthetic data generator.
The pipeline consists of two orthogonal components. First, a “mode‑patching” step identifies missing categorical modes by cross‑tabulating real versus synthetic frequency tables. For each absent category, the lower (feature‑extracting) layers of the pre‑trained generator are frozen, and only the upper layers are fine‑tuned on a small subset of real rows that contain the missing category. This targeted fine‑tuning restores categorical support while preserving the rest of the learned distribution; empirically, freezing 60‑80 % of early layers stabilizes adaptation and avoids catastrophic forgetting.
Second, the authors introduce a HEOM‑kNN ε‑ANY privacy filter. Using the Heterogeneous Euclidean Overlap Metric (HEOM) to handle mixed numeric and categorical attributes, they compute a per‑record 2‑nearest‑neighbor radius r_i for each real record. Any synthetic record that falls inside any real record’s radius is deemed unsafe. The filter iteratively resamples unsafe synthetic rows until the empirical risk ε_ANY (the fraction of unsafe rows) drops below a user‑specified threshold τ_ANY. Although this mechanism does not provide formal (ε,δ)‑DP guarantees, it enforces a minimum distance between synthetic and real data, thereby improving empirical privacy metrics.
The authors evaluate the approach on three public datasets—Credit Default, Adult (census income), and Cardiovascular health—using both CTGAN and TVAE as base generators. Their evaluation framework spans three dimensions: (1) fidelity, measured by Jensen‑Shannon divergence for categorical marginals, quantile‑based discrepancies for continuous variables, and multivariate dependence matrices (Pearson, Cramér’s V, η²) summarized via Frobenius norms; (2) utility, assessed through the Train‑on‑Synthetic Test‑on‑Real (TSTR) protocol with eight diverse classifiers; and (3) privacy, captured by distance‑to‑closest‑record (DCR) distributions, Relative Proximity Ratio (RPR), Correct Attribution Probability (CAP), and attribute‑inference attacks (AIA).
Results show that moderate filtering thresholds (τ_ANY≈0.2–0.35) achieve the best trade‑off. At these settings, categorical Jensen‑Shannon divergence drops by up to 36 %, multivariate dependence preservation improves by 10–14 %, and TSTR predictive performance remains within ±1 % of the unfiltered baseline—sometimes even slightly better due to a regularizing effect. Tight thresholds (τ_ANY≪0.1) over‑filter the data, causing loss of support and degraded utility, while very loose thresholds leave proximity risks largely unchanged. Privacy indicators improve: DCR values increase, RPR moves toward the ideal 50 % balance, indicating reduced memorization of training records. However, attribute‑inference attack success rates remain largely unchanged, suggesting that the filter mainly mitigates nearest‑neighbor leakage but does not fully block attacks that exploit higher‑order statistical cues.
The key contribution is a lightweight, post‑hoc augmentation that can be layered on top of any synthetic data generator, including those that already provide formal differential privacy (e.g., DP‑CTGAN, PATE‑CTGAN). By decoupling fidelity restoration (mode patching) from proximity control (ε‑ANY filter), the pipeline complements existing methods without requiring retraining from scratch. The authors also release an open‑source evaluation suite and code, promoting reproducibility and facilitating future research on combined formal‑and‑empirical privacy safeguards. Overall, the work demonstrates that careful post‑processing can substantially improve the quality‑privacy balance of synthetic tabular data while preserving downstream utility.
Comments & Academic Discussion
Loading comments...
Leave a Comment