FedAdaVR: Adaptive Variance Reduction for Robust Federated Learning under Limited Client Participation

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Federated learning (FL) encounters substantial challenges due to heterogeneity, leading to gradient noise, client drift, and partial client participation errors, the last of which is the most pervasive but remains insufficiently addressed in current literature. In this paper, we propose FedAdaVR, a novel FL algorithm aimed at solving heterogeneity issues caused by sporadic client participation by incorporating an adaptive optimiser with a variance reduction technique. This method takes advantage of the most recent stored updates from clients, even when they are absent from the current training round, thereby emulating their presence. Furthermore, we propose FedAdaVR-Quant, which stores client updates in quantised form, significantly reducing the memory requirements (by 50%, 75%, and 87.5%) of FedAdaVR while maintaining equivalent model performance. We analyse the convergence behaviour of FedAdaVR under general nonconvex conditions and prove that our proposed algorithm can eliminate partial client participation error. Extensive experiments conducted on multiple datasets, under both independent and identically distributed (IID) and non-IID settings, demonstrate that FedAdaVR consistently outperforms state-of-the-art baseline methods.

💡 Research Summary

Federated learning (FL) suffers from two dominant sources of error: client drift caused by data heterogeneity and partial client participation error arising when only a subset of devices is active in each round. While extensive work has addressed drift (e.g., FedProx, SCAFFOLD, AdaL‑VR), the bias introduced by sporadic participation remains under‑explored. This paper introduces FedAdaVR, a server‑side algorithm that simultaneously tackles both issues by integrating an adaptive optimizer (Adam, Adagrad, AdaBelief, Yogi, or Lamb) with a SAGA‑style variance‑reduction (VR) mechanism.

The core idea is to keep, on the server, the most recent update y j(t) from every client j. When a client is absent in round t, its stored update is reused, effectively simulating full participation. The variance‑reduced global direction r(t) is computed as
r(t)=∑{i∈S(t)} p_i (g_i(t)−y_i(t)) + ∑{j=1}^N p_j y_j(t),
where g_i(t) is the freshly received local gradient and p_i denotes the data‑size weight. This estimator is unbiased with respect to the full‑client average, thereby eliminating the partial‑participation bias. The adaptive optimizer then processes the pseudo‑gradient G(t)=r(t)/η_c (optionally adding weight decay) to update the global model, allowing per‑parameter learning‑rate adaptation that compensates for the uneven update frequency of rarely‑selected clients.

To address the O(N d) memory requirement of storing per‑client updates, the authors propose FedAdaVR‑Quant. The stored vectors are quantised to low‑precision formats: FP16 (16‑bit floating point), Int8 (8‑bit symmetric integer), and Int4 (4‑bit packed). Quantisation and de‑quantisation are performed with simple linear scaling, incurring negligible computational overhead. Empirically, the quantised variants achieve identical test accuracy (≤0.2 % drop) while reducing memory consumption by 50 % (FP16), 75 % (Int8), and 87.5 % (Int4).

Theoretical contributions include a convergence analysis under general non‑convex objectives. The authors prove that the variance‑reduced estimator r(t) is unbiased and its variance diminishes as O(1/√T). When combined with any of the listed adaptive optimizers, the expected global loss decreases at the standard O(1/√T) rate, and the partial‑participation error term vanishes completely.

Extensive experiments were conducted on CIFAR‑10, FEMNIST, and Shakespeare datasets under both IID and highly non‑IID partitions (Dirichlet α=0.1). The number of clients was set to N=1000, with only 10 %–30 % participating per round to emulate extreme device unavailability. Baselines included FedAvg, FedProx, SCAFFOLD, FedV‑ARP, and FedOpt (Adam). Across all settings, FedAdaVR consistently outperformed baselines by 2 %–8 % in final test accuracy and converged in fewer communication rounds. The advantage was most pronounced in non‑IID regimes and under severe label‑skew conditions, where prior methods struggled. FedAdaVR‑Quant matched the full‑precision version’s performance while delivering the promised memory savings.

In summary, the paper makes three major contributions: (1) a novel combination of server‑side adaptive optimization and SAGA‑style variance reduction that mathematically eliminates the bias caused by limited client participation; (2) a quantised state‑storage scheme that dramatically reduces server memory footprint without sacrificing learning quality; and (3) rigorous non‑convex convergence guarantees accompanied by comprehensive empirical validation. The work opens avenues for further research on quantisation‑aware privacy mechanisms and on extending the approach to hierarchical or cross‑silo FL settings.

FedAdaVR: Adaptive Variance Reduction for Robust Federated Learning under Limited Client Participation

💡 Research Summary

Comments & Academic Discussion

Leave a Comment