DecHW: Heterogeneous Decentralized Federated Learning Exploiting Second-Order Information

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Decentralized Federated Learning (DFL) is a serverless collaborative machine learning paradigm where devices collaborate directly with neighbouring devices to exchange model information for learning a generalized model. However, variations in individual experiences and different levels of device interactions lead to data and model initialization heterogeneities across devices. Such heterogeneities leave variations in local model parameters across devices that leads to slower convergence. This paper tackles the data and model heterogeneity by explicitly addressing the parameter level varying evidential credence across local models. A novel aggregation approach is introduced that captures these parameter variations in local models and performs robust aggregation of neighbourhood local updates. Specifically, consensus weights are generated via approximation of second-order information of local models on their local datasets. These weights are utilized to scale neighbourhood updates before aggregating them into global neighbourhood representation. In extensive experiments with computer vision tasks, the proposed approach shows strong generalizability of local models at reduced communication costs.

💡 Research Summary

The paper addresses a fundamental challenge in decentralized federated learning (DFL): the simultaneous presence of data heterogeneity (different local data distributions) and model initialization heterogeneity (different random seeds or architectures across devices). In a server‑less setting, each device can only exchange model parameters with its immediate neighbors in a static communication graph, and traditional aggregation methods such as weighted averaging (FedAvg) treat all parameters equally, weighting only by the size of each client’s dataset. This uniform treatment leads to misaligned parameters, slower convergence, and higher communication overhead when the local models are initialized differently or trained on highly non‑i.i.d. data.

To overcome these limitations, the authors propose Decentralized Hessian‑Weighted aggregation (DecHW). The core idea is to use second‑order information—specifically the diagonal of the Hessian matrix of the local loss with respect to each model parameter—to quantify the sensitivity (curvature) of each parameter on a given client’s data. A large diagonal entry indicates that a small change in that parameter would cause a large change in loss, i.e., the parameter is still “important” for the current local task. Conversely, a small entry suggests the parameter lies in a flat region, possibly already over‑fitted, and should have less influence on the global model.

Because computing the full Hessian is prohibitive for modern deep networks, the authors adopt a Gauss‑Newton approximation, which replaces the Hessian diagonal with the product of Jacobians of the loss gradients. This yields an efficient, per‑parameter curvature estimate that can be computed alongside the usual stochastic gradient descent (SGD) steps.

The aggregation procedure works as follows: for each communication round, every node i gathers the current model parameters w_j from its neighbor set N_i (including itself). For each parameter p, a weight w_{i←j}(p) is computed as

w_{i←j}(p) = H_{j,diag}(p) / Σ_{k∈N_i∪{i}} H_{k,diag}(p)

where H_{j,diag}(p) is the approximated Hessian diagonal entry for node j at parameter p. The local update is then a weighted element‑wise sum:

w_i^{(t)} = Σ_{j∈N_i∪{i}} w_{i←j} ⊙ w_j^{(t‑1)}

(⊙ denotes element‑wise multiplication). To avoid instability when Hessian values become very small or noisy, the authors apply ε‑smoothing, clipping, and a moving‑average filter across rounds.

The method is evaluated on standard image classification benchmarks (CIFAR‑10, MNIST) under two initialization regimes: homogeneous (all clients start from the same random seed) and heterogeneous (each client uses a different seed). Results show that DecHW reaches >80 % accuracy within 30–40 communication rounds, whereas conventional parameter averaging requires 60–80 rounds to achieve comparable performance. Because fewer rounds are needed, the total transmitted data is reduced by roughly 30 %. Moreover, DecHW remains robust when the graph topology is sparse, the number of neighbors varies across devices, and the data distributions are highly skewed.

Key contributions include:

A novel aggregation rule that operates at the parameter level rather than the model level, leveraging curvature information to assign importance.
An efficient Hessian diagonal approximation suitable for decentralized settings, avoiding the need for a central server or extra synthetic data.
Extensive empirical validation demonstrating faster convergence, lower communication cost, and resilience to both data and initialization heterogeneity.

The paper also discusses limitations. The diagonal approximation ignores off‑diagonal curvature (parameter interactions), which may be significant for very deep or highly coupled architectures. Computing even the diagonal for extremely large models (e.g., Transformers with billions of parameters) could still be a bottleneck, suggesting future work on low‑dimensional subspace curvature estimation or compressed Hessian summaries. Additionally, the current experiments assume a static communication graph; extending DecHW to dynamic or asynchronous networks is an open direction.

In summary, DecHW introduces a principled, second‑order‑aware aggregation mechanism for DFL that directly tackles the root cause of slow convergence—parameter misalignment caused by heterogeneous data and model initializations—while preserving the fully decentralized nature of the system and keeping communication overhead modest. This work opens a promising line of research on curvature‑driven model fusion in server‑less federated environments.

DecHW: Heterogeneous Decentralized Federated Learning Exploiting Second-Order Information

💡 Research Summary

Comments & Academic Discussion

Leave a Comment