GC-Fed: Gradient Centralized Federated Learning with Partial Client Participation
Federated Learning (FL) enables privacy-preserving multi-source information fusion (MSIF) but is challenged by client drift in highly heterogeneous data settings. Many existing drift-mitigation strategies rely on reference-based techniques–such as gradient adjustments or proximal loss–that use historical snapshots (e.g., past gradients or previous global models) as reference points. When only a subset of clients participates in each training round, these historical references may not accurately capture the overall data distribution, leading to unstable training. In contrast, our proposed Gradient Centralized Federated Learning (GC-Fed) employs a hyperplane as a historically independent reference point to guide local training and enhance inter-client alignment. GC-Fed comprises two complementary components: Local GC, which centralizes gradients during local training, and Global GC, which centralizes updates during server aggregation. In our hybrid design, Local GC is applied to feature-extraction layers to harmonize client contributions, while Global GC refines classifier layers to stabilize round-wise performance. Theoretical analysis and extensive experiments on benchmark FL tasks demonstrate that GC-Fed effectively mitigates client drift and achieves up to a 20% improvement in accuracy under heterogeneous and partial participation conditions.
💡 Research Summary
Federated learning (FL) suffers from client drift when data across devices are heterogeneous, a problem that is exacerbated under partial client participation—a common scenario in cross‑device settings where only a small fraction of the total client pool is available each round. Existing drift‑mitigation techniques (e.g., FedProx, SCAFFOLD, control‑variates, or proximal losses) rely on historical references such as the previous global model, past client updates, or per‑client control variables. When the sampled subset does not faithfully represent the whole population, these references become biased, leading to unstable training and degraded performance.
The authors propose a fundamentally different approach: leveraging Gradient Centralization (GC) as a reference‑free alignment mechanism. GC, originally introduced as an optimizer‑level technique, subtracts the mean of each output‑channel gradient, which can be expressed as a projection onto a fixed hyperplane orthogonal to the all‑ones vector. This hyperplane is shared by all clients and all rounds, providing a stable, historically independent reference point. By projecting gradients onto this hyperplane, the method reduces inter‑client variance without storing or communicating any extra state.
GC‑Fed consists of two complementary components:
- Local GC – applied during local SGD on the feature‑extraction layers (e.g., convolutional or embedding layers). This aligns the direction of client‑side updates before they are sent to the server, mitigating drift caused by divergent data distributions.
- Global GC – applied on the server during model aggregation, but only to the classifier (final fully‑connected) layer. Here the accumulated client updates are treated as a global gradient and projected onto the same hyperplane, reducing classifier‑specific variance that is especially pronounced under class‑imbalance.
Individually, Local GC yields higher peak accuracy but exhibits larger round‑to‑round fluctuations, while Global GC offers smoother convergence at the cost of a slightly lower ceiling. To combine the strengths of both, the authors introduce a hybrid scheme (GC‑Fed) that applies Local GC to early layers and Global GC to the final layer. This design preserves the high performance of Local GC while inheriting the stability of Global GC.
Theoretical analysis shows that GC reduces the variance of gradient estimates by zero‑centering them, and that the projection operation effectively bounds the distance between the “true” update (aggregated over all clients) and the “partial” update (aggregated over a sampled subset). The authors prove that, under partial participation, the L2 norm of this discrepancy shrinks proportionally to the variance reduction induced by GC. Importantly, the method incurs no additional communication overhead and requires only a simple matrix‑free projection (subtracting the per‑channel mean), making it compatible with existing FL pipelines.
Extensive experiments were conducted on heterogeneous benchmarks: CIFAR‑10/100, FEMNIST, and Shakespeare, with participation ratios ranging from 10 % to 30 %. GC‑Fed was compared against FedAvg, FedProx, SCAFFOLD, FedOpt, FedVARP, FedSAM, and other recent baselines. Results consistently demonstrate that GC‑Fed achieves higher final test accuracy (up to 20 % improvement in low‑participation regimes), faster convergence, and markedly lower variance across communication rounds. Ablation studies confirm that the layer‑wise split (feature vs. classifier) is crucial: applying GC only to the classifier yields modest gains, while applying it solely to early layers improves peak performance but harms stability. The hybrid configuration delivers the best trade‑off. Sensitivity analyses on the hyper‑parameter λ (the layer‑threshold separating local and global GC) and learning‑rate schedules show that GC‑Fed is robust across a wide range of settings.
In summary, GC‑Fed introduces a simple yet powerful mechanism—gradient projection onto a shared hyperplane—to align client updates without relying on stale or biased historical references. By integrating GC both locally and globally, the method effectively mitigates client drift under realistic partial‑participation scenarios, offering a practical, low‑overhead enhancement to federated learning systems.
Comments & Academic Discussion
Loading comments...
Leave a Comment