HeteroFedSyn: Differentially Private Tabular Data Synthesis for Heterogeneous Federated Settings

HeteroFedSyn: Differentially Private Tabular Data Synthesis for Heterogeneous Federated Settings
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Traditional Differential Privacy (DP) mechanisms are typically tailored to specific analysis tasks, which limits the reusability of protected data. DP tabular data synthesis overcomes this by generating synthetic datasets that can be shared for arbitrary downstream tasks. However, existing synthesis methods predominantly assume centralized or local settings and overlook the more practical horizontal federated scenario. Naively synthesizing data locally or perturbing individual records either produces biased mixtures or introduces excessive noise, especially under heterogeneous data distributions across participants. We propose HeteroFedSyn, the first DP tabular data synthesis framework designed specifically for the horizontal federated setting. Built upon the PrivSyn paradigm of 2-way marginal-based synthesis, HeteroFedSyn introduces three key innovations for distributed marginal selection: (i) an L2-based dependency metric with random projection for noise-efficient correlation measurement, (ii) an unbiased estimator to correct multiplicative noise, and (iii) an adaptive selection strategy that dynamically updates dependency scores to avoid redundancy. Extensive experiments on range queries, Wasserstein fidelity, and machine learning tasks show that, despite the increased noise inherent to federated execution, HeteroFedSyn achieves utility comparable to centralized synthesis. Our code is open-sourced via the link.


💡 Research Summary

HeteroFedSyn addresses the problem of differentially private (DP) tabular data synthesis in a realistic horizontal federated learning setting where multiple organizations hold disjoint subsets of records that share the same schema but exhibit heterogeneous distributions. Traditional DP synthesis methods such as PrivSyn assume a centralized data repository, while local‑DP approaches either produce biased mixtures or inject prohibitive noise when applied naïvely across federated parties. The authors propose the first framework that enables collaborative synthesis of a global synthetic dataset while preserving (ε,δ)‑DP guarantees and keeping communication overhead low.

The core of HeteroFedSyn builds on the PrivSyn paradigm of 2‑way marginal‑based synthesis but introduces three technical innovations to make marginal selection feasible under federation. First, they define an L2‑based dependency metric InDif2₍a,b₎ that measures the correlation between any pair of attributes a and b. To reduce the dimensionality of each 2‑way marginal (size |A|·|B|), random projection compresses the marginal to a short vector of length k (k ≪ |A|·|B|) while preserving the L2 distance in expectation. Second, because the compressed marginals are perturbed with Gaussian noise, the product‑type computation required for InDif2 becomes biased. The authors derive an unbiased estimator that corrects for the noise’s second‑order moments, guaranteeing that the expected value of the estimated InDif2 equals the true L2 distance. Third, they observe that greedy selection based solely on raw dependency scores can be redundant: once marginals (a,b) and (a,c) are selected, the correlation between b and c is already partially constrained. An adaptive marginal selection mechanism therefore updates the InDif2 scores after each selection, diminishing scores of already‑covered attribute pairs and encouraging coverage of new relationships within a fixed privacy budget.

System architecture consists of client‑side and server‑side components. Each client locally computes 1‑way and 2‑way marginals, applies random projection, adds Gaussian noise calibrated to the L2‑sensitivity, and then uses the unbiased estimator to produce noisy dependency scores. These compressed noisy marginals and scores are sent to an untrusted aggregator server. The server runs the adaptive selection algorithm to pick a subset of informative 2‑way marginals, then releases the selected noisy marginals (still under Gaussian noise) to the synthesis module. The synthesis module implements a federated version of PrivSyn’s GUM (Gradient Update Method) called Fed‑PrivSyn, which iteratively adjusts a randomly initialized synthetic table to match the noisy 1‑way and selected 2‑way marginals.

Experiments were conducted on four public tabular datasets (Adult, Census, Hospital, Loan) with simulated heterogeneity across parties. Evaluation metrics include (i) average absolute error on range queries, (ii) Wasserstein distance between original and synthetic joint distributions, and (iii) downstream machine‑learning performance (Random Forest, MLP, XGBoost) measured by accuracy and ROC‑AUC when synthetic data replace the original training set. Privacy budgets ranging from ε = 0.5 to 3 were examined, and the projection dimension k was varied between 50 and 200. Results show that HeteroFedSyn’s utility is within 5–10 % of centralized PrivSyn across all metrics for ε ≥ 1, and the adaptive selection yields noticeable gains (up to 12 % improvement) when data heterogeneity is high. Communication cost is dramatically reduced: with k = 100 the total transmitted data is less than 5 % of the raw 2‑way marginal size.

A formal privacy analysis uses zero‑Concentrated DP (zCDP) to compose the noise added at each stage (marginal computation, projection, selection, synthesis). The authors prove that the entire protocol satisfies (ε,δ)‑DP for the chosen ε,δ, and that the unbiased estimator does not consume additional privacy budget beyond the Gaussian noise already accounted for.

In summary, HeteroFedSyn delivers a practical solution for DP tabular data synthesis in federated environments by (1) employing an L2‑based dependency metric with random projection to achieve noise‑efficient correlation measurement, (2) providing an unbiased estimator that restores the true dependency values from noisy compressed marginals, and (3) introducing an adaptive marginal selection strategy that maximizes information gain under a fixed privacy budget while avoiding redundancy. The framework achieves utility comparable to centralized methods, substantially lowers communication overhead, and maintains rigorous DP guarantees, marking a significant step toward privacy‑preserving data sharing among heterogeneous organizations. Future work may explore higher‑order marginals, asynchronous federated protocols, and integration with domain‑specific generative models.


Comments & Academic Discussion

Loading comments...

Leave a Comment