Federated Sinkhorn
We study distributed Sinkhorn iterations for entropy-regularized optimal transport when the Gibbs kernel operator is row-partitioned across c workers and cannot be centralized. We present Federated Sinkhorn, two exact synchronous protocols that exchange only scaling-vector slices: (i) an All-to-All scheme implemented by Allgather, and (ii) a Star (parameter-server) scheme implemented by client to server sends and server to client broadcasts. For both, we derive closed-form per-iteration compute, communication, and memory costs under an alpha-beta latency–bandwidth model, and show that the distributed iterates match centralized Sinkhorn under standard positivity assumptions. Multi-node CPU/GPU experiments validate the model and show that repeated global scaling exchange quickly becomes the dominant bottleneck as c increases. We also report an optional bounded-delay asynchronous schedule and an optional privacy measurement layer for communicated log-scalings.
💡 Research Summary
The paper addresses the problem of solving entropy‑regularized optimal transport (OT) at scale when the Gibbs kernel matrix K = exp(−C/ε) cannot be centralized due to size, policy, or security constraints. In such settings the kernel is row‑partitioned across c workers, each holding a local block K_{I_j,:} and the corresponding marginal slices a_j, b_j. The authors propose Federated Sinkhorn, a family of exact synchronous algorithms that require only the exchange of scaling‑vector slices (u and v) while keeping the raw cost data local. Two communication topologies are studied:
-
All‑to‑All (decentralized) – every worker participates in two Allgather collective operations per iteration (first to obtain the global v, then to obtain the global u). After each Allgather, workers perform a local matrix‑vector product (K_{I_j,:} v or K_{:,I_j}ᵀ u) and update their local slice of u or v by element‑wise division with the corresponding marginal. The global iterates (u, v) are provably identical to those produced by a centralized Sinkhorn algorithm under the standard positivity assumption on K.
-
Star‑Network (parameter‑server) – a distinguished server stores the full kernel (or can evaluate K·v and Kᵀ·u on demand) and maintains the global scaling vectors. In each iteration workers send their local slice of v to the server; the server computes q = K v, broadcasts the relevant slice q_{I_j} back, and workers update u_{I_j}. The second half of the iteration mirrors this pattern with u → r = Kᵀ u → v. Again, the resulting iterates match centralized Sinkhorn exactly.
The authors develop a performance model based on the classic α‑β latency‑bandwidth formulation. Local computation cost is modeled as
T_comp ≈ 2 · t_mv(m,n) + t_ew(m),
where t_mv is the measured time for a matrix‑vector product on a block of size m × n, and t_ew covers the cheap element‑wise scaling. Communication cost for a payload B bytes follows T_p2p = α + β · B, with separate (α, β) parameters calibrated for Allgather (AG), Broadcast (BC), and Send/Recv (SR) primitives on the target hardware. Using these, per‑iteration wall‑times are derived:
- All‑to‑All: T_A2A_sync ≈ 2 · t_mv + 2 · T_AG.
- Star: T_Star_sync ≈ 2 · t_srv_mv + 2 · T_BC + 2 · T_SR, where the uplink cost scales with (c − 1) · α_SR + β_SR · B/c (each worker sends a slice of size B/c).
Memory analysis shows that All‑to‑All requires each worker to store the full vectors u and v (Θ(n) memory), whereas Star allows workers to keep only their local slices (Θ(m)), shifting the bulk of memory to the server.
Beyond the synchronous variants, the paper introduces a bounded‑delay asynchronous schedule (stale‑synchronous model) where each worker may use a slightly outdated copy of the global scaling vectors, bounded by a delay w. An under‑relaxation factor η ∈ (0,1] can be applied to dampen oscillations caused by staleness; η = 1 recovers the exact Sinkhorn updates, while η < 1 yields a convex combination of old and new scalings.
A privacy layer is also presented. Since only log‑scaled vectors are communicated, the authors add calibrated Gaussian noise to log u and log v to achieve (ε, δ)‑differential privacy. The sensitivity analysis is straightforward because each entry of log u or log v depends linearly on a single marginal entry. Experiments demonstrate that for typical privacy budgets (e.g., ε = 1, δ = 10⁻⁵) the added noise has negligible impact on the final OT cost.
Experimental validation is performed on multi‑node CPU and GPU clusters with problem sizes ranging from n = 10⁶ to 10⁷ and worker counts c = 2–64. The authors calibrate t_mv, α, and β for each platform, then compare measured runtimes against the analytical model. Results confirm that:
- Communication dominates as c grows; All‑to‑All’s O(c²) Allgather cost quickly becomes the bottleneck, limiting scalability beyond ~16 workers.
- Star‑Network scales almost linearly with c because the server’s uplink cost grows only with the number of slices, and the downlink broadcast is a single operation.
- Memory usage on workers remains modest for Star, enabling very large n even on modest GPUs.
- The bounded‑delay asynchronous variant with w ≤ 5 and η ≈ 0.8 reduces idle time without sacrificing convergence speed appreciably.
- The differential‑privacy augmentation incurs less than 0.2 % deviation in the final OT objective, confirming practical feasibility.
In summary, Federated Sinkhorn provides a rigorous, communication‑aware framework for distributed entropy‑regularized OT. By exposing closed‑form cost models, the paper equips practitioners with the tools to decide between decentralized All‑to‑All and centralized Star topologies based on network latency, bandwidth, and memory constraints. The optional asynchronous and privacy‑preserving extensions broaden the applicability to federated learning, cross‑silo analytics, and any scenario where raw cost data must remain on‑site while global optimal transport distances are still required.
Comments & Academic Discussion
Loading comments...
Leave a Comment