PTOPOFL: Privacy-Preserving Personalised Federated Learning via Persistent Homology
Federated learning (FL) faces two structural tensions: gradient sharing enables data-reconstruction attacks, while non-IID client distributions degrade aggregation quality. We introduce PTOPOFL, a framework that addresses both challenges simultaneously by replacing gradient communication with topological descriptors derived from persistent homology (PH). Clients transmit only 48-dimensional PH feature vectors-compact shape summaries whose many-to-one structure makes inversion provably ill-posed-rather than model gradients. The server performs topology-guided personalised aggregation: clients are clustered by Wasserstein similarity between their PH diagrams, intra-cluster models are topology-weighted,and clusters are blended with a global consensus. We prove an information-contraction theorem showing that PH descriptors leak strictly less mutual information per sample than gradients under strongly convex loss functions, and we establish linear convergence of the Wasserstein-weighted aggregation scheme with an error floor strictly smaller than FedAvg. Evaluated against FedAvg, FedProx, SCAFFOLD, and pFedMe on a non-IID healthcare scenario (8 hospitals, 2 adversarial) and a pathological benchmark (10 clients), PTOPOFL achieves AUC 0.841 and 0.910 respectively-the highest in both settings-while reducing reconstruction risk by a factor of 4.5 relative to gradient sharing. Code is publicly available at https://github.com/MorillaLab/TopoFederatedL and data at https://doi.org/10.5281/zenodo.18827595.
💡 Research Summary
Federated learning (FL) enables collaborative model training without sharing raw data, but two fundamental challenges remain unresolved. First, transmitting high‑dimensional gradients leaks substantial information about local datasets, making them vulnerable to reconstruction attacks. Second, non‑IID client data distributions cause client drift, degrading the quality of the aggregated model and slowing convergence. Existing mitigations—differential privacy, secure aggregation, proximal penalties, control variates, or Moreau‑envelop based personalization—address either privacy or heterogeneity but not both, and often at the cost of model accuracy or added communication overhead.
The paper introduces PTOPOFL, a novel FL framework that simultaneously tackles privacy and heterogeneity by replacing gradient communication with compact topological descriptors derived from persistent homology (PH). Each client computes a 48‑dimensional feature vector φ_k from its local dataset D_k. This vector aggregates Betti numbers, persistence entropy, ℓ₂ diagram amplitude, and Betti curves across H₀ and H₁, providing a multi‑scale summary of the data’s shape. Crucially, the mapping Φ: D_k → φ_k is many‑to‑one: infinitely many distinct datasets produce the same φ_k, rendering inversion ill‑posed. Moreover, PH enjoys bottleneck stability, guaranteeing that small data perturbations lead to bounded changes in φ_k.
PTOPOFL’s server‑side pipeline consists of three stages. (1) Topology‑guided clustering: Using the p‑Wasserstein distance between persistence diagrams, clients are grouped via hierarchical average‑linkage clustering. Theoretical analysis (Theorem 3.3) shows that if inter‑cluster separation exceeds a margin γ, clustering is robust to perturbations up to γ/(2c), where c is the PH stability constant. (2) Intra‑cluster aggregation: Within each cluster C_j, local models are combined using topology‑weighted averaging. The weight for client k is proportional to n_k·exp(−‖φ̂_k−φ̂_{C_j}‖)·t_k, where φ̂ denotes the normalized descriptor, n_k the local sample size, and t_k a trust factor from anomaly detection. This emphasizes clients whose topological signature aligns closely with the cluster centroid. (3) Inter‑cluster blending: To avoid over‑personalization, each cluster model is blended with the global consensus model ¯θ via a mixing parameter β_blend (β_blend = 0 in the best‑performing configuration). The final personalized model for a client is thus a convex combination of its cluster’s weighted average and the global average.
Privacy is quantified through an information‑contraction bound (Theorem 3.7): I(x_i;Φ(D_k)) ≤ m_p·c²·L²·I(x_i;∇F_k), where m_p ≪ 1, c is the PH stability constant, and L is the Lipschitz constant of the loss. This shows that per‑sample mutual information leaked by φ_k is dramatically smaller than that leaked by gradients. Additionally, Theorem 3.5 proves that the influence of adversarial clients decays exponentially with their Wasserstein distance from the honest majority, whereas in standard FedAvg the influence grows linearly.
Convergence analysis assumes each local objective F_k is L‑smooth and µ‑strongly convex. Under these conditions, Theorem 3.9 and Proposition 3.11 establish linear convergence of the Wasserstein‑weighted aggregation scheme, with an error floor strictly lower than that of FedAvg. The existence of a Wasserstein barycenter (Theorem 3.2) guarantees that the cluster‑level optimization problem is well‑posed.
Empirical evaluation comprises two realistic non‑IID settings. (i) Healthcare scenario: Eight simulated hospitals, each with a distinct patient distribution, of which two are adversarially poisoned. (ii) Pathology benchmark: Ten clients each holding different tissue‑image distributions. All experiments use logistic regression as the base model for theoretical compatibility. PTOPOFL is compared against FedAvg, FedProx, SCAFFOLD, and pFedMe. Results show that PTOPOFL achieves AUC = 0.841 in the healthcare setting and AUC = 0.910 in the pathology benchmark—the highest among all baselines. Moreover, reconstruction risk is reduced by a factor of 4.5 relative to gradient sharing. Ablation studies confirm that (a) the topology‑guided clustering remains stable under data noise, (b) β_blend = 0 yields the best trade‑off between personalization and global generalization, and (c) anomaly detection effectively down‑weights malicious clients.
The authors release the full implementation as an open‑source Python package (https://github.com/MorillaLab/TopoFederatedL) and provide the processed datasets via Zenodo (https://doi.org/10.5281/zenodo.18827595), ensuring reproducibility. Limitations are acknowledged: the convergence proofs rely on strong convexity and smoothness, which do not hold for deep neural networks; extending the framework to such models will require additional techniques (e.g., local linearization or surrogate convex objectives). Future work includes integrating formal differential privacy guarantees, automating the choice of the number of clusters, and exploring richer PH‑based descriptors for higher‑dimensional homology.
In summary, PTOPOFL offers a principled, privacy‑preserving, and heterogeneity‑aware alternative to gradient‑based FL. By leveraging the many‑to‑one, stable nature of persistent homology, it dramatically reduces information leakage while enabling topology‑driven personalized aggregation that converges faster and attains higher predictive performance than state‑of‑the‑art federated learning methods.
Comments & Academic Discussion
Loading comments...
Leave a Comment