Measuring Privacy Risks and Tradeoffs in Financial Synthetic Data Generation
We explore the privacy-utility tradeoff of synthetic data generation schemes on tabular financial datasets, a domain characterized by high regulatory risk and severe class imbalance. We consider representative tabular data generators, including autoencoders, generative adversarial networks, diffusion, and copula synthesizers. To address the challenges of the financial domain, we provide novel privacy-preserving implementations of GAN and autoencoder synthesizers. We evaluate whether and how well the generators simultaneously achieve data quality, downstream utility, and privacy, with comparison across balanced and imbalanced input datasets. Our results offer insight into the distinct challenges of generating synthetic data from datasets that exhibit severe class imbalance and mixed-type attributes.
💡 Research Summary
The paper investigates the privacy‑utility trade‑off of synthetic data generation for tabular financial datasets, which are characterized by strict regulatory constraints and severe class imbalance. Four representative generators are examined: a classical Gaussian Copula, the diffusion‑based TabDiffusion, the GAN‑based CTGAN, and the variational auto‑encoder variant TVAE. The authors focus on CTGAN and TVAE because they achieve high data quality while being amenable to differential privacy (DP) integration. They design novel DP‑CTGAN and DP‑TVAE implementations that provide rigorous (ε, δ)‑DP guarantees throughout the entire synthesis pipeline.
Key technical contributions include: (1) Replacing the original Gaussian‑mixture based continuous‑feature binning with a uniform binning that only uses non‑private metadata (min/max ranges), thereby eliminating a privacy leak in preprocessing. (2) Generating conditional vectors by randomly selecting a column and copying a value from a Poisson‑sampled real record, which avoids exposing aggregate statistics. (3) Randomizing the selection of real records for each conditional vector, ensuring each training example is used with equal probability and simplifying privacy accounting. (4) Computing discriminator loss per‑sample, clipping each loss contribution, and adding calibrated Gaussian noise via DP‑Adam, which protects the discriminator’s gradients. (5) Updating the generator with ordinary (non‑DP) Adam, relying on the post‑processing property of DP to keep the generator privacy‑free because it only sees the already‑private discriminator outputs. For TVAE, the entire encoder‑decoder pair is trained with DP‑Adam, extending DP protection to all model parameters.
The empirical evaluation uses five financial tabular datasets from the Tabular Arena benchmark: Adult (census income), Bank Customer Churn, Bank Marketing, Credit‑Card Default, and German Credit. These datasets contain a mix of categorical and continuous attributes and exhibit minority class ratios ranging from 10 % to 85 %. The authors assess three dimensions: (i) data quality (marginal and joint distribution similarity, SDMetrics scores), (ii) downstream utility (classification accuracy and AUC of models trained on synthetic data), and (iii) privacy risk (membership inference attack success rate and the effective ε value). They also explore the impact of class imbalance by comparing results on the original imbalanced splits and on balanced versions obtained via down‑sampling or over‑sampling.
Results show that non‑DP generators achieve the best quality and utility but suffer from mode collapse on highly imbalanced datasets, especially CTGAN, whose synthetic minority class coverage drops dramatically. DP‑CTGAN and DP‑TVAE substantially reduce MIA success (by 30‑50 % at ε ≤ 1.0) while incurring modest quality and utility losses (5‑12 %). DP‑CTGAN’s redesign of conditional vector sampling and per‑sample loss clipping mitigates mode collapse, and when combined with class‑balanced batch construction the utility loss shrinks further (≈3 %). DP‑TVAE experiences slightly larger degradation due to the VAE’s reconstruction nature but still offers strong privacy protection. Diffusion‑based TabDiffusion remains stable in the non‑DP setting but becomes impractical under DP because the added noise destabilizes the diffusion process and dramatically increases computational cost.
An important observation is that severe class imbalance amplifies privacy risk: minority records that are selected more frequently leak more information, leading to higher MIA success. The authors’ mitigation—randomized conditional vector generation and balanced batch sampling—effectively dampens this effect. The paper also highlights that privacy‑preserving implementations must address not only the training algorithm but also preprocessing steps that can inadvertently expose data statistics.
In conclusion, the study provides a thorough, reproducible benchmark of synthetic data generators for financial tabular data, introduces robust DP‑CTGAN and DP‑TVAE pipelines, and demonstrates how class imbalance interacts with both privacy leakage and model stability. The authors suggest future directions such as adaptive ε allocation, imbalance‑aware sampling strategies, and meta‑learning frameworks for jointly optimizing privacy and utility, which could further bridge the gap between regulatory compliance and data‑driven innovation in finance.
Comments & Academic Discussion
Loading comments...
Leave a Comment