FHAIM: Fully Homomorphic AIM For Private Synthetic Data Generation

FHAIM: Fully Homomorphic AIM For Private Synthetic Data Generation
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Data is the lifeblood of AI, yet much of the most valuable data remains locked in silos due to privacy and regulations. As a result, AI remains heavily underutilized in many of the most important domains, including healthcare, education, and finance. Synthetic data generation (SDG), i.e. the generation of artificial data with a synthesizer trained on real data, offers an appealing solution to make data available while mitigating privacy concerns, however existing SDG-as-a-service workflow require data holders to trust providers with access to private data. We propose FHAIM, the first fully homomorphic encryption (FHE) framework for training a marginal-based synthetic data generator on encrypted tabular data. FHAIM adapts the widely used AIM algorithm to the FHE setting using novel FHE protocols, ensuring that the private data remains encrypted throughout and is released only with differential privacy guarantees. Our empirical analysis show that FHAIM preserves the performance of AIM while maintaining feasible runtimes.


💡 Research Summary

FHAIM introduces the first fully homomorphic encryption (FHE) framework for training a marginal‑based synthetic data generator directly on encrypted tabular data, thereby guaranteeing both input privacy and differential privacy (DP). The authors adapt the widely used AIM algorithm—an iterative method that selects the most mismatched attribute subsets, measures noisy marginals, and generates synthetic data—to the encrypted domain. To achieve this, they design three novel FHE protocols: πCOMP for encrypted marginal computation, πSELECT for DP‑protected query selection using a Gumbel‑Max approximation of the exponential mechanism with an L₂‑norm quality score, and πMEASURE for adding Gaussian noise to marginals inside the ciphertext space. These protocols implement a “DP‑in‑FHE” paradigm where the service provider never sees raw data, the noise samples, or the exact noisy statistics; only encrypted values are processed. After the select and measure steps are completed homomorphically, the noisy marginals are decrypted and the generate step proceeds in the clear, as it no longer depends on private data.

The system is built on the CKKS scheme, chosen for its efficient handling of real‑valued arithmetic required by DP noise addition. An encrypted memory layout ensures that the multiplicative depth of marginal computation depends only on the marginal degree k, keeping the scheme scalable. The authors replace the traditional L₁‑norm quality score with a squared L₂‑norm to avoid unstable polynomial approximations of absolute values in the encrypted domain.

Empirical evaluation on three real‑world datasets from healthcare, finance, and education demonstrates that FHAIM can train the synthetic generator in roughly 11–30 minutes while preserving statistical utility. Compared to the plaintext AIM baseline, the synthetic data produced by FHAIM exhibits nearly identical marginal distributions (low KL‑divergence and MAE) and yields comparable downstream machine‑learning performance (≤2 % accuracy loss). The experiments also confirm that ciphertext depth and memory usage grow linearly with marginal degree, indicating practical scalability to higher‑dimensional data.

Key contributions of the paper are: (1) the first FHE‑based synthetic data generation framework that provides strong input privacy without requiring multiple non‑colluding parties; (2) novel DP‑in‑FHE protocols for marginal computation, differentially private query selection, and noisy measurement; (3) an efficient encrypted memory layout that limits depth growth; and (4) a thorough experimental validation showing feasible runtimes and negligible utility degradation.

The work opens several avenues for future research, including extending the approach to non‑tabular modalities (images, time series), handling multiple data holders in a collaborative FHE setting, exploring integer‑based FHE schemes (BFV/BGV) for potentially lower overhead, and developing adaptive privacy‑budget allocation strategies. Overall, FHAIM bridges a critical gap between privacy‑preserving data sharing and the demand for high‑quality synthetic data, offering a practical solution for organizations constrained by strict data‑protection regulations.


Comments & Academic Discussion

Loading comments...

Leave a Comment