FedMerge: Federated Personalization via Model Merging
One global model in federated learning (FL) might not be sufficient to serve many clients with non-IID tasks and distributions. While there has been advances in FL to train multiple global models for better personalization, they only provide limited choices to clients so local finetuning is still indispensable. In this paper, we propose a novel ``FedMerge’’ approach that can create a personalized model per client by simply merging multiple global models with automatically optimized and customized weights. In FedMerge, a few global models can serve many non-IID clients, even without further local finetuning. We formulate this problem as a joint optimization of global models and the merging weights for each client. Unlike existing FL approaches where the server broadcasts one or multiple global models to all clients, the server only needs to send a customized, merged model to each client. Moreover, instead of periodically interrupting the local training and re-initializing it to a global model, the merged model aligns better with each client’s task and data distribution, smoothening the local-global gap between consecutive rounds caused by client drift. We evaluate FedMerge on three different non-IID settings applied to different domains with diverse tasks and data types, in which FedMerge consistently outperforms existing FL approaches, including clustering-based and mixture-of-experts (MoE) based methods.
💡 Research Summary
The paper introduces FedMerge, a federated learning framework that tackles the non‑IID challenge by merging multiple global models on the server side to produce a single personalized model for each client. Traditional FL (e.g., FedAvg) relies on one global model, which often fails when client data distributions differ substantially. Recent multi‑model approaches, especially Mixture‑of‑Experts (MoE) based methods, let clients combine several expert models locally, but this incurs communication and computation costs that grow linearly with the number of experts, making them impractical for resource‑constrained devices.
FedMerge’s core idea is to keep a “model soup” of d global models {Θ₁,…,Θ_d} on the server and learn a client‑specific weight vector w(i,·)∈ℝ^d. The personalized model for client i is constructed as a weighted sum:
θ_i = Σ_{j=1}^d w(i,j)·Θ_j,
with w normalized by a softmax to ensure a proper probability distribution. The server sends only θ_i to each client; the client performs standard local SGD on its own data using this merged model, exactly as in FedAvg. After t local steps, the client returns the model delta Δθ_i = θ_i^{(t)} – θ_i^{(0)} to the server.
Because the server holds both the global models and the merging weights, it cannot directly compute their gradients. The authors treat the whole system as a single‑layer fully‑connected network where the merged models are the high‑level nodes and the global models are the low‑level nodes. Using the chain rule, they derive closed‑form updates:
∂L/∂Θ_j = Σ_i (n_i/n)·w(i,j)·∂ℓ/∂θ_i,
∂L/∂w(i,j) = (n_i/n)·⟨Θ_j , ∂ℓ/∂θ_i⟩.
In practice, the server computes ΔΘ_j = Σ_i (n_i/n)·w(i,j)·Δθ_i and Δw(i,j) = (n_i/n)·⟨Θ_j , Δθ_i⟩, then applies simple additive updates. This back‑propagation‑like scheme enables end‑to‑end joint optimization of both the global models and the client‑specific merging weights while keeping client‑side computation identical to vanilla FL.
To handle very large models, the authors restrict the inner‑product ⟨Θ_j , Δθ_i⟩ to parameters of the last few layers (e.g., the classification head) because higher‑level layers capture semantic information most relevant to task heterogeneity, and this reduces noise from massive parameter spaces. They also enforce a softmax on w to avoid negative contributions and to regularize the merging process.
Experiments cover three non‑IID settings (directory partition, label skew, and clustering‑based heterogeneity) across two model families: classic CNNs (ResNet variants) and foundation models fine‑tuned with LoRA adapters. Baselines include FedAvg, clustering‑based personalized FL, and several MoE‑style methods such as pFed‑MoE. Across all scenarios, FedMerge consistently outperforms baselines, achieving 2–5 percentage‑point gains in accuracy while keeping the communication cost per client constant regardless of the number of global models. The method also shows comparable or faster convergence than FedAvg, supported by a theoretical convergence analysis (provided in the appendix) under standard Lipschitz and bounded variance assumptions.
In summary, FedMerge resolves the scalability‑vs‑personalization trade‑off in federated learning by moving model merging to the server, learning client‑specific blending weights, and preserving the simplicity of client‑side training. It offers a practical, low‑overhead path to leverage a large pool of expert models without burdening devices, and opens avenues for future work on non‑linear merging strategies, dynamic expert addition/removal, and stronger privacy guarantees.
Comments & Academic Discussion
Loading comments...
Leave a Comment