Communication-Efficient Stochastic Distributed Learning

1 Communication-Efﬁcient Stochastic Distr ib uted Lear ning Xiao xing Ren 1 , Nicola Bastianello 2 ⋆ , Karl H. Johansson 2 , Thomas P arisini 3 , 4 , 5 Abstract — W e address distributed learning prob lems, both noncon vex and con vex, over undirected netw orks. In particular , we design a novel algorithm based on the dis- tributed Alternating Direction Method of Multipliers (ADMM) to address the challenges of high comm unication costs, and large datasets. Our design tac kles these challenges i) by enabling the agents to perf orm multiple local training steps between each round of comm unications; and ii) by allowing the agents to employ stochastic gradients while carrying out local computations. W e show that the pro- posed algorithm conver g es to a neighborhood of a sta- tionary point, for noncon vex problems, and of an optimal point, for con vex prob lems. We also propose a v ariant of the algorithm to incorporate variance reduction thus achieving exact convergence. We show that the resulting algorithm indeed con verges to a stationar y (or optimal) point, and moreover that local training accelerates conver g ence. W e thoroughl y compare the proposed algorithms with the state of the ar t, both theoretically and through numerical results. Index T erms — Distrib uted learning; Stochastic optimiza- tion; V ariance reduction; Local training. I . I N T R O D U C T I O N Recent technological advances have enabled the widespread adoption of devices with computational and communication capabilities in many ﬁelds, for instance, power grids [1], robotics [2], [3], transportation networks [4], and sensor net- works [5]. These devices connect with each other , forming multi-agent systems that cooperate to collect and process data [6]. As a result, there is a growing need for algorithms that enable ef ﬁcient and accurate cooperati ve learning. In speciﬁc terms, the objecti ve in distributed learning is to train a model ( e.g. , a neural network) with parameters x ∈ R n The work of X.R. and T .P . was suppor ted by European Union’ s Hor i- zon 2020 Research and Innovation programme under grant agreement no . 739551 (KIOS CoE). The work of N.B . and K.H.J. w as par tially suppor ted b y the European Union’ s Hor izon Research and Innovation Actions programme under grant agreement No. 101070162, and par tially by Swedish Research Council Distinguished Professor Grant 2017-01078 Knut and Alice W allenberg Foundation W allenberg Scholar Grant. 1 School of Civil and Environmental Engineer ing, Systems Engineer- ing Field, Cornell University , Ithaca, NY , USA. 2 School of Electrical Engineer ing and Computer Science, and Digital Futures, KTH Ro yal Institute of T echnology , Stockholm, Sw eden. 3 Depar tment of Electrical and Electronic Engineering, Imperial Col- lege London, London, United Kingdom. 4 Depar tment of Electronic Systems, Aalborg Univ ersity , Denmar k. 5 Depar tment of Engineering and Architecture, University of T rieste, T rieste, Italy . ⋆ Corresponding author . Email: nicolba@kth.se . cooperatively across a network of N agents. Each agent i has access to a local dataset which deﬁnes the local cost as f i ( x ) = 1 m i m i X h =1 f i,h ( x ) , (1) with f i,h : R n → R being the loss function associated to data point h ∈ { 1 , . . . , m i } . Thus, the goal is for the agents to solve the following constrained problem [7], [8]: min x i ∈ R n i =1 ,...N 1 N N X i =1 f i ( x i ) s.t. x 1 = x 2 = · · · = x N , (2) where the objectiv e is the sum of local costs (1) to pool together the agents’ data. Moreov er, each agent is assigned a set x i of local model parameters, and the consensus con- straints x 1 = x 2 = . . . = x N ensure that the agents will asymptotically agree on a shared trained model. T o effecti vely tackle this problem, especially when dealing with large datasets that in volve sensitive information, dis- tributed methods ha ve become increasingly important. These techniques offer signiﬁcant robustness advantages over fed- erated learning algorithms [6], as they do not rely on a central coordinator and thus, for example, ha ve not a single point of f ailure. In particular, both distrib uted gradient-based algorithms [17], [18], [19], [20], and distributed Alternat- ing Direction Method of Multipliers (ADMM) [21], [22], [23], [24] have proven to be effecti ve strategies for solving such problems. ADMM-based algorithms have demonstrated strong robustness to practical constraints (see e.g. [22] and references therein), although this often comes at the cost of higher computational complexity compared to gradient-based methods. In this work, we propose nov el ADMM-based algo- rithms that retain the computational efﬁciency characteristic of gradient methods. Howe ver , many learning applications face the challenges of: high communication costs, especially when training large models, and large datasets. In this paper, we jointly address these challenges with the follo wing approach. First, we guaran- tee the communication efﬁciency of our algorithm by adopting the paradigm of local training , which reduces the frequency of communications. In other terms, the agents perform multiple training steps between communication rounds. W e tackle the second challenge by locally incorporating stoc hastic gradients . The idea is to allow the agents to estimate local gradients by employing only a (random) subset of the av ailable data, thus 2 T ABLE I C O M P A R IS ON W I T H T H E S T ATE O F T H E A R T I N S T O C H A S T I C D I S T R I B U T ED O P T I M I Z A T I O N . Algorithm [Ref.] variance reduction grad. steps ÷ comm. # stored variables † comm. size ‡ # ∇ f i,j ev aluations per iteration assumpt. ⋆ con vergence K-GT [9] ✗ τ ÷ 1 2 2 |N i | 1 n.c. sub-linear , ∝ σ 2 LED [10] ✗ τ ÷ 1 2 |N i | 1 n.c. s.c. sub-linear , ∝ σ 2 linear , ∝ σ 2 RandCom [11] ✗ l 1 p m ÷ 1 (in mean) 2 |N i | 1 n.c. s.c. sub-linear , ∝ σ 2 linear , ∝ σ 2 VR-EXTRA/DIGing [12], GT -VR [13] ✓ 1 ÷ 1 3 2 |N i | |B| , m i ev ery l 1 p m n.c. s.c. sub-linear , → 0 linear , → 0 GT -SAGA [14], [15] ✓ 1 ÷ 1 3 2 |N i | 1 s.c. n.c. sub-linear , → 0 linear , → 0 GT -SARAH [16] ✓ 1 ÷ 1 3 2 |N i | |B| , m i ev ery τ n.c. sub-linear , → 0 GT -SVRG [14] ✓ 1 ÷ 1 3 2 |N i | 1 , m i ev ery τ s.c. linear , → 0 L T -ADMM [this work] ✗ τ ÷ 1 |N i | + 1 |N i | |B| n.c. sub-linear , ∝ σ 2 L T -ADMM-VR [this work] ✓ τ ÷ 1 |N i | + 1 |N i | |B | , m i ev ery τ n.c. sub-linear , → 0 † number of vectors in R n stored by each agent between iterations (disregarding temporary variables) ‡ number of messages sent by each agent during a communication round ⋆ n.c. and s.c. stand for (non)conv ex and strongly conve x av oiding the computational burden of full gradient ev aluations on large datasets. Our main contributions are as follows: • W e propose two algorithms based on distributed ADMM, with one round of communication between multiple local update steps. The ﬁrst algorithm, Local T raining ADMM (L T -ADMM), uses stochastic gradient descent (SGD) for the local updates, while the second algorithm, L T - ADMM with V ariance Reduction (L T -ADMM-VR), uses a variance-reduced SGD method [25]. • W e establish the con vergence properties of L T -ADMM for both nonconv ex and con vex (not strongly con vex) learning problems. In particular , we sho w almost-sure and mean-squared con vergence of L T -ADMM to a neighbor- hood of the stationary point in the noncon v ex case, and to a neighborhood of an optimum in the con vex case. The radius of the neighborhood depends on speciﬁc properties of the problem and on tunable parameters. W e prov e that the algorithm achie ves a con vergence rate of O ( 1 K τ ) , where K is the number of iterations, and τ the number of local training steps. • For L T -ADMM-VR, we prove exact conver gence to a stationary point in the noncon ve x case, and to an optimum in the conv ex case. The algorithm has a O ( 1 K τ ) rate of conv ergence, which is faster than O ( 1 K ) obtained by related algorithms [16], [15], [13]. • W e provide extensiv e numerical ev aluations comparing the proposed algorithms with the state of the art. The results validate the communication efﬁcienc y of the al- gorithms. Indeed, L T -ADMM and L T -ADMM-VR out- perform alternativ e methods when communications are expensi ve. A. Comparison with the state of the ar t W e compare our proposed algorithms – L T -ADMM and L T -ADMM-VR– with the state of the art. The comparison is holistically summarized in T able I. Decentralized learning algorithms, as ﬁrst highlighted in the seminal paper [26] on federated learning, face the fundamental challenge of high communication costs. The authors of [26] address this challenge by designing a communication-ef ﬁcient algorithms which allows the agents to perform multiple local training steps before each round of communication with the coordinator . Howe ver , the accuracy of the algorithm in [26] degrades signiﬁcantly when the agents hav e heterogeneous data. Since then, alternativ e federated learning algorithms, e.g . , [27], [28], [29], [30], hav e been designed to employ local training without compromising accuracy . The interest for communication-efﬁcient algorithms has more recently ex- tended to the distributed set-up, where agents rely on peer- to-peer communications rather than on a coordinator as in federated learning. Distributed algorithms with local training hav e been proposed in [31], [9], [10], [11]. In particular, [31], [9], [10] present gradient tracking methods which allow each agent to perform a ﬁxed number of local updates between each communication round. The algorithm in [11], which builds on [28], instead triggers communication rounds according to a giv en probability distribution, resulting in a time-varying number of local training steps. Another related algorithm is that of [32], which allows for both multiple consensus and gra- dient steps in each iteration. Howe ver , this algorithm requires a monotonically increasing number of communication rounds in order to guarantee exact con vergence. A stochastic version of [32] was then studied in [33]. The algorithm has inexact gradient ev aluations, but only allows for multiple consensus steps. An alternativ e approach to reducing the frequency of communications is to employ event-triggering, see e .g. [34], where messages are exchanged only when a certain condition is met. When the agents employ stochastic gradients in the algo- rithms of [9], [10], [11], they only conv erge to a neighborhood of a stationary point, whose radius is proportional to the stochastic gradient v ariance. Dif ferent variance r eduction tech- 3 niques are a vailable to improve the accuracy of (centralized) algorithms relying on stochastic gradients, e.g. , [35], [25], [36]. Then, these methods have been applied to distributed optimization by combining them with widely used gradient tracking algorithms [12], [13], [14], [15], [16]. The resulting algorithms succeed in guaranteeing exact con ver gence to a sta- tionary point despite the presence of gradient noise. Howe ver , they are not communication-efﬁcient, as the y only allow one gradient update per communication round. W e conclude by providing in T able I a summary of the key features of the algorithms discussed above. This table focuses on methods that employ the mechanisms of primary interest in this work – local training and variance reduction. First, we classify them based on whether the y use or not v ariance reduction and local training. For the latter , we report the ratio of gradient steps to communication rounds that characterizes each algorithm, with a ratio of 1 ÷ 1 signifying that no local training is used. Notice that L T -ADMM-VR is the only algorithm to use both variance reduction and local training, while the other only use one technique. W e then compare the number of variables stored by the agents when they deploy each algorithm (disregarding temporary v ariables). W e notice that the variable storage of L T -ADMM and L T -ADMM-VR, differently from the alternatives, scales with the size of an agent’ s neighborhood; this is due to the use of distributed ADMM as the foundation of our proposed algorithms [21]. W e see that [10], [11], L T -ADMM, and L T -ADMM-VR require one communication per neighbor , while the other methods require two communications per neighbor . W e also compare the algorithms by the computational comple xity of the gradient estimators they employ , namely , the number of component gradient ev aluations needed per local training iteration. The algorithms of [9], [10], [11], [14] use a single data point to estimate the gradient, while [12], [16], L T -ADMM, L T - ADMM-VR can apply mini-batch estimators that use a subset B of the local data points. The use of mini-batches yields more precise gradient estimates and increased ﬂexibility . Ho wev er , we remark that the gradient estimators used in [12], [16], [14], L T -ADMM, L T -ADMM-VR require a registry of component gradient e valuations, which needs to be refreshed entirely at ﬁxed intervals. This coincides with the ev aluation of a full gradient, and thus requires m i component gradient ev alua- tions. Finally , we compare the algorithms’ con vergence. W e notice that all algorithms, except for [14], provide (sub-linear) con vergence guarantees for conv ex and noncon vex problems. Additionally , some works show linear con vergence for strongly con vex problems. W e further distinguish between algorithms which achie ve exact conv ergence due to the use of variance reduction, or ine xact con vergence with an error proportional to the stochastic gradient variance ( ∝ σ 2 ). Outline: The outline of the paper is as follows. Section II formulates the problem at hand, and presents the proposed algorithms design. Section III analyzes their conv ergence, and discusses the results. Section IV reports and discusses numerical results comparing the proposed algorithms with the state of the art. Section V presents some concluding remarks. Notation: ∇ f denotes the gradient of a differentiable func- tion f . Giv en a matrix A ∈ R n × n , λ min ( A ) and λ max ( A ) denotes the smallest and largest eigen value of A , respectiv ely . A > 0 represents that matrix A is positiv e deﬁnite. W ith n ∈ N , we let 1 n ∈ R n be the vector with all elements equal to 1 , I ∈ R n × n the identity matrix and 0 ∈ R n × n the zero matrix. ⟨ x, y ⟩ = P n h =1 x h y h represents the standard inner product of two vectors x, y ∈ R n . ∥ · ∥ denotes the Euclidean norm of a vector and the matrix-induced 2-norm of a matrix. The proximal of a cost f , with penalty ρ > 0 , is deﬁned as pro x ρ f ( y ) = arg min y ∈ R n n f ( y ) + 1 2 ρ ∥ y − x ∥ 2 o . I I . P R O B L E M F O R M U L A T I O N A N D A L G O R I T H M D E S I G N In this section, we formulate the problem at hand and present our proposed algorithms. A. Problem formulation W e target the solution of (2) ov er a (undirected) graph G = ( V , E ) , where V = { 1 , ..., N } is the set of N agents, and E ⊂ V × V is the set of edges ( i, j ) , i, j ∈ V . In particular , we assume that the local costs f i : R n → R are in the empirical risk minimization form (1). W e make the following assumptions for (2), which are commonly used to support the con vergence analysis of distributed learning algorithms (see e.g . [9], [10], [11], [12], [15], [16]). Assumption 1: G = ( V , E ) is a connected, undirected graph. Assumption 2: The cost function f i of each agent i ∈ V is L -smooth. That is, there exists l > 0 such that ∥∇ f i ( x ) − ∇ f i ( y ) ∥ ≤ L ∥ x − y ∥ , ∀ x, y ∈ R n . Moreover , f i is proper: f i ( x ) > −∞ , ∀ x ∈ R n . When, in the follo wing, we specialize our results to conv ex scenarios, we resort to the additional assumption belo w . Assumption 3: Each function f i , i ∈ V , is con vex. B. Algorithm design W e start our design from the distributed ADMM, character- ized by the updates 1 [21]: x i,k +1 = pro x 1 /ρ |N i | f i 1 ρ |N i | X j ∈N i z ij,k ! , (3a) z ij,k +1 = 1 2 ( z ij,k − z j i,k + 2 ρx j,k +1 ) , (3b) where N i = { j ∈ V | ( i, j ) ∈ E } denotes the neighbors of agent i , ρ > 0 is a penalty parameter, and z ij ∈ R n are auxil- iary variables, one for each neighbor of agent i . This algorithm con verges in a wide range of scenarios, and, differently from most gradient tracking approaches, shows robustness to many challenges (asynchrony , limited communications, etc ) [21], [22]. Howe ver , the drawback of (3) is that the agents need to solve an optimization problem to update x i , which in general does not ha ve a closed-form solution. Therefore, in practice, the agents need to compute an approximate update (3a), which can lead to inexact con vergence [22]. In this paper , we modify (3) to use approximate local updates, while ensuring that this choice does not compr omise 1 W e remark that, more precisely , (3) corresponds to the algorithm in [21] with relaxation parameter α = 1 / 2 . 4 exact con ver gence . In particular , we allow the agents to use τ ∈ N iterations of a gradient-based solver to approxi- mate (3a), which yields the update: ϕ 0 i,k = x i,k , ϕ t +1 i,k = ϕ t i − γ g i ( ϕ t i,k ) + β  ρ |N i | x i,k − X j ∈N i z ij,k  ! , t = 0 , . . . , τ − 1 , x i,k +1 = ϕ τ i,k , (4) where γ , β are the positi ve step-sizes, and g i ( ϕ ℓ i,k ) is an estimate of the gradient ∇ f i . Notice that for efﬁciency’ s sake we “freeze” the penalty term ρ |N i | x i,k − P j ∈N i z ij,k , and for ﬂexibility we multiply gradient estimate and penalty term by two dif ferent step-sizes. The resulting algorithm is a distributed gradient method, with the difference that each communication round (3b) is preceded by τ > 1 local gradient e valuations. This is an application of the local training paradigm [10]. W e remark that the con ver gence of the proposed algorithm rests on the initialization ϕ 0 i,k = x i,k , which enacts a feedback loop on the local training. In general, without this initialization, exact con vergence cannot be achiev ed [22]. The local training (4) requires a local gradient ev aluation or at least its estimate. In the following, we introduce two differ- ent estimator options. Notice that the gradient of the penalty term, ρ |N i | x i,k − P j ∈N i z ij,k , is exactly known (and frozen) and does not need an estimator . The most straightforward idea is to simply employ a local gradient g i ( ϕ ) = ∇ f i ( ϕ ) . Howe ver , in learning applications, the agents may store large datasets ( m i ≫ 1 ). Therefore, computing ∇ f i ( ϕ ) becomes computationally expensiv e. T o remedy this, the agents can instead use stochastic gradients , choosing g i ( ϕ ) = 1 |B i | X h ∈B i ∇ f i,h ( ϕ ) , (5) where B i are randomly drawn indices from { 1 , . . . , m i } , with |B i | < m i . While reducing the computational complexity of the local training iterations, the use of stochastic gradients re- sults in inexact con vergence. The idea, therefore, is to employ a gradient estimator based on a variance r eduction scheme. In particular , we adopt the scheme proposed in [25], characterized by the following procedure. Each agent maintains a table of component gradients {∇ f i,h ( r t i,h,k ) } , h = 1 , . . . , m i , where r t i,h,k is the most recent iterate at which the component gradient was ev aluated. This table is reset at the beginning of ev ery new local training (that is, for any k ∈ N when t = 0 ). Using the table, the agents then estimate their local gradients as g i  ϕ t i,k  = 1 |B i | X h ∈B i  ∇ f i,h  ϕ t i,k  − ∇ f i,h  r t i,h,k  + 1 m i m i X h =1 ∇ f i,h ( r t i,h,k ) . (6) The gradient estimate is then used to update ϕ t +1 i,k according to (4); afterwards, the agents update their local memory by setting r t +1 i,h,k = ϕ t +1 i,k if h ∈ B i , and r t +1 i,h,k = r t i,h,k otherwise. Notice that this update requires a full gradient e valuation at the beginning of each local training, to populate the memory with {∇ f i,h ( r 0 i,h,k ) = ∇ f i,h ( ϕ 0 i,k ) } , h = 1 , . . . , m i . In the following steps ( t > 0 ), each agent only computes |B i | component gradients. Selecting the stochastic gradient estimator (5) yields the proposed algorithm L T -ADMM, while selecting the v ariance reduction scheme (6) yields the proposed algorithm L T - ADMM-VR. The two methods are reported in Algorithm 1. Algorithm 1 L T -ADMM and L T -ADMM-VR Input: For each node i , initialize x i, 0 = z ij, 0 , j ∈ N i . Set the penalty ρ , the number of local training steps τ , the number of iterations K , and the local step-size γ , β . 1: for k = 0 , 1 , . . . , K − 1 ev ery agent i do // local training 2: ϕ 0 i,k = x i,k , r 0 i,h,k = x i,k , for all h ∈ { 1 , . . . .m i } 3: for t = 0 , 1 , . . . , τ − 1 do 4: draw the batch B i uniformly at random 5: update the gradient estimator according to (5) 6: update the gradient estimator according to (6) 7: update ϕ i,k according to (4) 8: if h ∈ B i update r t +1 i,h,k = ϕ t +1 i,k , else r t +1 i,h,k = r t i,h,k 9: end for 10: x i,k +1 = ϕ τ i,k // communication 11: transmit z ij,k − 2 ρx i,k +1 to each neighbor j ∈ N i , and receiv e the corresponding transmissions // auxiliary update 12: update z ij,k +1 according to (3b) 13: end for I I I . C O N V E R G E N C E A N A L Y S I S A N D D I S C U S S I O N In this section, we analyze the con ver gence rate of Algo- rithm 1 in both noncon vex and con vex scenarios. Throughout, we will employ the following metric of conv ergence D k = E   ∥∇ F ( ¯ x k ) ∥ 2 + 1 τ τ − 1 X t =0      1 N N X i =1 ∇ f i  ϕ t i,k       2   , (7) where F ( x ) = 1 N P N i =1 f i ( x ) and ¯ x k = 1 N P N i =1 x i,k . If the agents conv erge to a stationary point of (2), then D k → 0 . W e note that this performance measure is standard in the literature on stochastic gradient methods and distributed optimization [10], [11]. Although it does not, in general, imply almost sure conv ergence of the sequence {D k } K k =1 , it provides meaningful performance guarantees in expectation. Speciﬁcally , if the inde x K ′ is selected uniformly at random from 1 , . . . , K , then E [ D K ′ ] = 1 K P K − 1 k =0 D k . A. Conv ergence with SGD W e start by characterizing the con vergence of Algorithm 1 when the agents use SGD during local training (L T -ADMM). 5 T o this end, we make the follo wing standard assumption on the variance of the gradient estimators, see e.g. [9], [10]. Assumption 4: For all ϕ ∈ R n the gradient estimators g i ( ϕ ) , i ∈ V , in (5) are unbiased and their variance is bounded by some σ 2 > 0 : E [ g i ( ϕ ) − ∇ f i ( ϕ )] = 0 E h ∥ g i ( ϕ ) − ∇ f i ( ϕ ) ∥ 2 i ≤ σ 2 . W e are no w ready to state our con vergence results. All the proofs are deferred to the Appendix, where Appendix I provides a sketch of the proofs, followed by the full proofs. W e remark that prior analyses of distributed ADMM based on operator-theoretic approaches [21], [22] are not directly applicable to L T -ADMM, and the con vergence proofs must therefore be speciﬁcally tailored to this algorithm. Theor em 1 (Noncon vex case): Let Assumptions 1, 2, and 4 hold. If the local step-sizes satisfy γ ≤ O ( λ l Lτ 2 ) < γ sgd : = min i =1 , 2 ,..., 6 ¯ γ i (see (10) in Appendix II-A for the precise bound), and 1 / ( τ λ u ρ ) ≤ β < 2 / ( τ λ u ρ ) , then the output of L T -ADMM satisﬁes: 1 K K − 1 X k =0 D k ≤ O  F ( ¯ x 0 ) − F ( x ∗ ) K γ τ  + O  γ τ σ 2  + O ∥ b d 0 ∥ 2 ρ 2 K N ! , (8) where x ∗ is a stationary point of (2), λ u is the largest eigen value of the graph G ’ s Laplacian matrix, λ l is the smallest nonzero eigen value of graph G ’ Laplacian matrix, and ∥ b d 0 ∥ is related to the initial conditions (see (26)). Theorem 1 shows that L T -ADMM con verges to a neigh- borhood of a stationary point x ∗ as K → ∞ . The radius of this neighborhood is proportional to the step-size γ , to the number of local training epochs τ , and to the stochastic gradient variance σ 2 . The result can then be particularized to the con vex case as follo ws. Cor ollary 1 (Con vex case): In the setting of Theorem 1, with the additional Assumption 3, then the output of L T - ADMM con verges to a neighborhood of an optimal solution x ∗ characterized by (8). Remark 1 (Exact con verg ence): Clearly , if we employ full gradients (and thus σ = 0 ), then these results prove exact con vergence to a stationary/optimal point. This veriﬁes that our algorithm design achie ves con vergence despite the use of approximate local updates. B. Conv ergence with variance reduction The results of the previous section shows that only in- exact con vergence can be achieved when employing SGD. The follo wing results sho w ho w Algorithm 1 achiev es exact con vergence when using variance reduction (L T -ADMM-VR). Theor em 2 (Noncon vex case): Let Assumptions 1, 2 hold. If the local step-sizes satisfy γ ≤ O ( λ l Lτ 3 ) < γ vr : = min i =1 , 7 , 8 ,..., 15 ¯ γ i (see (11) in Appendix II-A for the precise bound), and 1 / ( τ λ u ρ ) ≤ β < 2 / ( τ λ u ρ ) , then the output of L T -ADMM-VR con verges to a stationary point x ∗ of (2), and in particular it holds: 1 K K − 1 X k =0 D k ≤ O  F ( ¯ x 0 ) − F ( x ∗ ) K γ τ  + O ∥ b d 0 ∥ 2 ρ 2 K ! , (9) where x ∗ is a stationary point of (2), λ u is the largest eigen value of the graph G ’ s Laplacian matrix, λ l is the smallest nonzero eigen value of graph G ’ Laplacian matrix, and ∥ b d 0 ∥ is related to the initial conditions (see (26)). Cor ollary 2 (Con vex case): In the setting of Theorem 2, with the additional Assumption 3, then the output of L T - ADMM-VR conv erges to an optimal solution x ∗ , with rate characterized by (9). C . Discussion 1) Choice of step-size : The upper bounds to the step-sizes of L T -ADMM and L T -ADMM-VR ((10) and (11) in Appendix II- A), highlight a dependence on sev eral features of the pr oblem . In particular , the step-size bounds decrease as the smoothness constant L increases, as is usually the case for gradient- based algorithms. Moreov er, the bounds are proportional to the network connectivity , represented by the smallest nonzero eigen value of G ’ s Laplacian (the algebraic connectivity λ l ). Thus, less connected graphs (smaller λ l ) result in smaller bounds. Finally , we remark that the step-size bound for L T - ADMM-VR is proportional to m l m u = min i ∈V m i max i ∈V m i , where m i is the number of data points av ailable to agent i (see (1)). This ratio can be viewed as a measure of heterogeneity between the agents. Smaller values of m l m u highlight higher imbalance in the amount of data available to the agents. The step-size bound thus is smaller for less balanced scenarios. The step-size bounds also depend on the tunable parameters τ , the number of local updates, and ρ , the penalty param- eter . Therefore, these two parameters can be tuned in order to increase the step-size bounds, which translates in faster con vergence. 2) Conv ergence rates : As discussed in section I-A, various distributed algorithms with variance reduction ha ve been re- cently proposed, for example, [12], [14] for strongly con vex problems, and [16], [15], [13] for noncon vex problems. Fo- cusing on [16], [15], [13], we notice that their conv ergence rate is O ( 1 K ) , while Theorem 2 shows that L T -ADMM-VR has rate of O ( 1 τ K ) . This sho ws that employing local training accelerates conver gence . Similarly to L T -ADMM-VR, [16], [12] also use batch gradient computations, i.e. , the y update a subset of components to estimate the gradient (see (6)). Interestingly , the step-size upper bound and, hence, the conv ergence rate in [16], [12] depend on the batch size. On the other hand, our theoretical results are not affected by the batch size, since we use a different variance reduction technique. W e also remark that, as shown in (48) and (70) in the Appendix, better network connecti vity (corresponding to lar ger λ l ) leads to smaller upper bounds on the right-hand side of the con vergence results. This indicates that stronger network connectivity accelerates the con vergence rate. Finally , the bound in Theorem 1 also highlights a trade-off: a larger γ accelerates conv ergence through the term O ( 1 K γ τ ) , but it also enlarges the steady-state neighborhood via the term O  γ τ σ 2  . Thus, γ must be tuned to balance con ver gence speed and steady-state precision – and a similar discussion holds for τ . This trade-of f is also explored in the numerical results of section IV -B. 6 3) Choice of variance reduction mechanism : In variance reduction, we distinguish two main classes of algorithms: those that need to periodically (or randomly) perform a full gradient ev aluation (SARAH-type [36]), and those that do not (SAGA- type [25]). In distributed learning, SARAH-type algorithms were proposed in e.g. , [16], [14], while SA GA-type algorithms in e.g. [14]. Also the proposed L T -ADMM-VR requires a periodic full gradient e valuation, as the agents re-initialize their local gradient memory at the start of each local training (since they set r 0 i,h,k = x i,k ). Clearly , periodically computing a full gradient signiﬁcantly increases the computational com- plexity of the algorithm. Thus, one can design a SAGA-type variant of L T -ADMM-VR by removing the gradient memory re-initialization at the start of local training (choosing now r 0 i,h,k = r τ i,h,k − 1 ). This variant is computationally cheaper and shows promising empirical performance, see the results for L T -ADMM-VR v2 in section IV. Howe ver , using the outdated gradient memory leads to a more complex theoretical analysis, which we leav e for future work. 4) Uncoordinated parameters : In principle, the agents could employ uncoor dinated parameters, depending on their av ail- able resources ( e.g . , heterogeneous computational capabili- ties). For instance, different agents could adopt distinct local solvers, numbers of updates (see results in Section IV -B), and batch sizes. Alternativ ely , they could use the same solver but with step-sizes tailored to the smoothness of their local cost functions. I V . N U M E R I C A L R E S U LT S In this section we compare the proposed algorithms with the state of the art, applying them to a classiﬁcation problem with noncon vex regularization, characterized by [10]: f i ( x ) = 1 m i m i X h =1 log  1 + exp  − b i,h a ⊤ i,h x  + ϵ n X ℓ =1 [ x ] 2 ℓ 1 + [ x ] 2 ℓ where [ x ] ℓ is the ℓ -th component of x ∈ R n , and a i,h ∈ R n and b i,h ∈ {− 1 , 1 } are the pairs of feature v ector and label. As data we use 8 × 8 gray-scale images of handwritten digits 2 , with pixels normalized to [0 , 1] ; we divide the images in the two classes ‘ev en’ and ‘odd’. W e choose a ring graph with N = 10 , and ha ve n = 64 , m i = 180 , ϵ = 0 . 01 ; the initial conditions are randomly chosen as x i, 0 ∼ N (0 , 100 I n ) . W e use stochastic gradients with a batch of |B | = 1 . All results are av eraged ov er 10 Monte Carlo iterations. F or the algorithms with local training we select τ = 2 . W e also tune the step- sizes of all algorithms to ensure best performance. Finally , as performance metric we employ ∥∇ F ( ¯ x k ) ∥ 2 , which is zero if the agents have reached a solution of (2) 3 . The simulations are implemented in Python and run on a W indows laptop with Intel i7-1265U and 16GB of RAM. A. Comparison with the state of the ar t W e start by comparing L T -ADMM and L T -ADMM-VR with local training algorithms LED [10] and K-GT [9], as well 2 From https://doi.org/10.24432/C50P49 . 3 W e choose this metric as it can be deﬁned for all algorithms considered in the comparison, whereas D k is deﬁned speciﬁcally for Algorithm 1. as variance reduction algorithms GT -SARAH [16], and GT - SA GA [14]. W e also compare with the alternativ e version L T - ADMM-VR v2 discussed in section III-C.3. When ev aluating the performance, we account for the computation time of each algorithm, rather than the more commonly used iteration count. In particular, letting t G be the time for a component gradient ev aluation ( ∇ f i,h ), and t C the time for a round of communications, T able II reports the computation time incurred by each algorithm over the course of τ iterations. T ABLE II C O M P U TA T I O N T I M E O F T H E A L G O R I T HM S OV ER τ I TE R A T I O N S . Algorithm [Ref.] T ime LED [10] & K-GT [9] τ t G + 2 t C GT -SARAH [16] ( m i + τ − 1) t G + 2 τ t C GT -SAGA [14] τ ( t G + 2 t C ) L T -ADMM & L T -ADMM-VR v2 τ t G + t C L T -ADMM-VR ( m i + τ − 1) t G + t C W e start by comparing in T able III the algorithms with variance reduction, in terms of the computation time they require to reach ∥∇ F ( ¯ x k ) ∥ 2 < 10 − 7 , that is, to reach a stationary point up to numerical precision. W e see that, T ABLE III C O M P A R IS ON O F C O M P U T A T I O N T I ME F O R VA R I A N C E - R ED U CE D A L G O R I T H MS TO R EA C H ∥∇ F ( ¯ x k ) ∥ 2 < 10 − 7 . Algorithm [Ref.] t G /t C = 0 . 1 t G /t C = 1 t G /t C = 10 GT -SARAH [16] 7 . 57 × 10 5 6 . 33 × 10 6 6 . 20 × 10 7 GT -SAGA [14] 1 . 55 × 10 5 2 . 21 × 10 5 8 . 85 × 10 5 L T -ADMM-VR 6 . 04 × 10 5 5 . 76 × 10 6 5 . 73 × 10 7 L T -ADMM-VR v2 3 . 81 × 10 4 9 . 52 × 10 4 6 . 66 × 10 5 depending on the ratio t G /t C , their relativ e speed of con- ver gence changes. When gradient computations are cheaper than communications ( t G /t C = 0 . 1 ), the proposed algorithm L T -ADMM-VR (and L T -ADMM-VR v2) outperform both GT - SARAH and GT -SAGA, since the latter do not employ local training. This testiﬁes to the beneﬁt of employing local training in scenarios where communications ar e expensive . As the ratio t G /t C increases to 1 and then 10 , we see how L T -ADMM-VR and GT -SARAH, on the one hand, and L T -ADMM-VR v2 and GT -SA GA, on the other hand, tend to align in terms of performance, as the b ulk of the computation time is no w due to gradient ev aluations, of which the two pairs of algorithms hav e a similar number (see T able II). Nonetheless, local training still giv es an edge to the proposed algorithms. The remaining algorithms (L T -ADMM, LED, K-GT) do not guarantee exact conv ergence as they do not employ variance reduction. Thus in T able IV we report the asymptotic value of ∥∇ F ( ¯ x k ) ∥ 2 achiev ed by the different methods. The algo- T ABLE IV C O M P A R IS ON O F A L G O R IT HM S W I T HO U T VA R I A N C E R E D UC TI O N . Algorithm [Ref.] ∥∇ F ( ¯ x K ) ∥ 2 LED [10] 1 . 29 × 10 − 3 K-GT [9] 2 . 01 × 10 − 3 L T -ADMM 1 . 07 × 10 − 3 rithms hav e close performance, with the proposed L T -ADMM 7 outperforming the state of the art slightly , that is, con ver ging closer to a stationary point. B. T uning the parameters In this section we focus on ev aluating the impact of the proposed algorithms’ tunable parameters. As discussed in section III-C.2, the step-size of L T -ADMM regulates both the speed of con vergence and how close it conv erges to a stationary point. In T able V then we apply different step-sizes and e valuate both the asymptotic value of ∥∇ F ( ¯ x k ) ∥ 2 and the computation time needed for L T -ADMM to reach such value. As expected, a smaller step-size leads to a smaller T ABLE V P E R F O R M AN C E O F L T - A D M M F OR D I FF E R E N T γ . γ ∥∇ F ( ¯ x K ) ∥ 2 Computation time 0 . 1 6 . 01 × 10 − 5 4 . 80 × 10 4 1 5 . 79 × 10 − 4 3 . 22 × 10 4 2 1 . 07 × 10 − 3 2 . 27 × 10 2 3 1 . 72 × 10 − 3 2 . 38 × 10 2 4 2 . 68 × 10 − 3 7 . 9 × 10 1 5 4 . 43 × 10 − 3 4 . 10 × 10 1 asymptotic distance from the stationary point, while a larger step-size improves the speed of con ver gence. W e turn no w to L T -ADMM-VR and ev aluate its speed of con vergence for different numbers of local training epochs τ . Figure 1 reports the computation time to reach ∥∇ F ( ¯ x k ) ∥ 2 < 10 − 7 . Interestingly , it appears that there is a ﬁnite optimal 0 5 10 15 20 25 τ 10 6 4 × 10 5 6 × 10 5 Comp. time Fig. 1. Computation time f or L T -ADMM-VR to reach ∥∇ F ( ¯ x k ) ∥ 2 < 10 − 7 for diff erent numbers of local training epochs τ . value ( τ = 8 ), while smaller and larger values lead to slower con vergence. Finally , as discussed in section III-C.4, we can actually choose uncoordinated parameters in L T -ADMM-VR. In T a- ble VI then we test the use of different τ i , i ∈ { 1 , . . . , N } , for different agents. In particular , we compare the computation T ABLE VI C O M P U TA T I O N T I M E O F L T - A D M M - V R W I T H U N C OO RD IN ATE D N U M B E R S O F L O C A L T R A I N I N G E P O CH S . τ i Computation time 2 ∀ i 6 . 04 × 10 5 ( 2 i < N / 2 5 i ≥ N / 2 5 . 80 × 10 5 5 ∀ i 3 . 75 × 10 5 time required to reach ∥∇ F ( ¯ x k ) ∥ 2 < 10 − 7 in two coordinated scenarios and an uncoordinated scenario where half of the agents are “slow” ( τ i = 2 ) and half are “fast” ( τ i = 5 ). Inter- estingly , the algorithm still con verges ev en with uncoordinated parameters, with the presence of “f ast” agents improving the performance. V . C O N C L U D I N G R E M A R K S In this paper, we considered (non)con vex distributed learn- ing problems. In particular, to address the challenge of ex- pensiv e communication, we proposed two communication- efﬁcient algorithms, L T -ADMM and L T -ADMM-VR, that use local training. The algorithms employ SGD and SGD with variance reduction, respectively . W e have shown that L T - ADMM conv erges to a neighborhood of a stationary point, while L T -ADMM-VR con verges exactly . W e have thoroughly compared our algorithms with the state of the art, both theoretically and in simulations. Future research will focus on analyzing con ver gence for strongly con vex problems and on extending our algorithmic framework to asynchronous scenarios and to the broader class of composite problems, as in [37]. A P P E N D I X I P R O O F S K E T C H O F T H E M A I N T H E O R E M S A. Proof sketch of Theorem 1 Step 1 (Lemma 1): Reformulate the algorithm into a com- pact linear dynamical system, in which h k contains all non- linearities. Decompose the system into average and de viation components via a projection matrix b Q . Use graph connectivity to show that the linear part of the deviation system b d k +1 = ∆ b d k − b h k is stable ( ∥ ∆ ∥ = 1 − λ l ρτ β / 2 < 1 ) when β < 2 τ λ u ρ is satisﬁed. Step 2 (Lemmas 2, 3): Bound the error from local training steps: the deviation of local states Φ t k from the global av erage ¯ X k as in (29). E h ∥ b Φ k ∥ 2 i ≤  72 β τ 2 λ l ρ + 144 τ 3 β 2  E [ ∥ b d k ∥ 2 ] + 4 N τ 2 γ 2 σ 2 + 16 τ 3 γ 2 E [ ∥∇ F ( ¯ x k ) ∥ 2 ] . Incorporate this bound into the perturbation term b h k to deri ve a recursiv e inequality for the deviation system as in (37), E [ ∥ b d k +1 ∥ 2 ] ≤ ( δ + c 0 1 − δ ) E [ ∥ b d k ∥ 2 ] + c 1 1 − δ E [ ∥ X t ∇ F ( Φ t k ) ∥ 2 ] + c 2 1 − δ E [ ∥∇ F ( ¯ x k ) ∥ 2 ] + c 3 1 − δ σ 2 , where sufﬁciently small γ ensuring stability . Step 3 (Theorem 1): Apply the smoothness inequality to the averaged iterate ¯ x k +1 − x ∗ = ¯ x k − x ∗ − γ N τ − 1 X t =0 N X i =1 g i ( ϕ t i,k ) , show a descent in the global objective F ( ¯ x k ) . E [ F ( ¯ x k +1 )] ≤ E [ F ( ¯ x k )] − γ τ 2 E [ ∥∇ F ( ¯ x k ) ∥ 2 ] − γ 2 (1 − 2 γ τ L ) X t E [ ∥ 1 N X i ∇ f i  ϕ t i,k  ∥ 2 ] + γ L 2 2 N E [ ∥ b Φ k ∥ 2 ] + γ 2 τ 2 Lσ 2 . 8 Sum over iterations, pick an appropriate step-size, and use the bounds obtained in Lemmas 2, 3, leading to the con vergence result. B. Proof sketch of Theorem 2 The proof of Theorem 2 follows a structure similar to that of Theorem 1. The key distinction lies in the treatment of the gradient variance. W e bound the gradient v ariance with t k , E [ X i X t ∥ g i  ϕ t i,k  − ∇ f i ( ϕ t i,k ) ∥ 2 ] ≤ 2 L 2 ∥ b Φ k ∥ 2 + 2 L 2 E [ t k ] , using E [ ∥ a − E [ a ] ∥ 2 ] ≤ E [ ∥ a ∥ 2 ] with a = ∇ f i,h ( ϕ t i,k ) − ∇ f i,h ( r t i,h,k ) . And show that t k can be bound by the deviation bound ∥ b d k ∥ and the global gradient at the average state ∇ F ( ¯ x k ) as in Lemma 5. And we further bound the deviation bound ∥ b d k ∥ with t k as in Lemma 6. As a result, the ﬁnal con vergence expression no longer depends on a constant σ , and we can hav e exact con vergence. A P P E N D I X I I P R E L I M I N A RY A N A L Y S I S In this section we summarize the step-size bounds, and present preliminary results underpinning Theorems 1 and 2. A. Step-size bounds The step-size upper bounds for L T -ADMM and L T -ADMM- VR are, respecti vely: ¯ γ sgd : = min i =1 , 2 ,..., 6 ¯ γ i , (10) ¯ γ vr : = min i =1 , 7 , 8 ,..., 15 ¯ γ i , (11) where: ¯ γ 1 : = min  1 , 1 Lτ 2 √ 2  , ¯ γ 2 : = √ 3 8 Lτ , ¯ γ 3 : = 3 8 Lτ , ¯ γ 4 : = λ l λ u L q 16(1 + 2 ρ 2 ∥ ˜ L ∥ 2 ) τ ∥ b V − 1 ∥ 2 β 0 , ¯ γ 5 : = √ λ l 4 τ √ λ u L 4 q c 4 (1 + 2 ρ 2 ∥ ˜ L ∥ 2 )2 ∥ b V − 1 ∥ 2 , ¯ γ 6 : = √ λ l β 2 √ τ λ u L 4 q c 4 6 N ∥ b V − 1 ∥ 2 , ¯ γ 7 : = 1 4 τ L √ 3 , ¯ γ 8 : = r m l 32 L 2 m u , ¯ γ 9 : = r 1 12 L 2 , ¯ γ 10 : = r m l 512 m u τ 3 L 2 , ¯ γ 11 : = r 3 τ 8 κ 3 , ¯ γ 12 : = 3 8 τ L , ¯ γ 13 : = λ l λ u q 8( κ 0 ˜ β 0 + 2( ˜ s 0 + ˜ s 1 )( κ 1 + 32 τ 2 L 2 κ 0 )) , ¯ γ 14 : = 3 s λ 2 l β 2 768 L 2 N λ 2 u κ 4 , ¯ γ 15 : = 3 s λ 2 l τ 128 λ 2 u κ 4 κ 2 These bounds depend on the following quantities. d u = max {|N i |} i ∈V denotes the maximum agents’ degree. b V is deﬁned in (25). λ u is the largest eigen value of the graph G ’ s Laplacian matrix, λ l is the smallest nonzero eigenv alue of graph G ’ s Laplacian matrix. W e denote m u = max i =1 ,...,N m i and m l = min i =1 ,...,N m i , where m i is the number of local data points of agent i . Additionally , we hav e the following deﬁnitions, used both in the upper bound abo ve and throughout the con vergence analysis: β 0 : = 72 β τ 2 λ l ρ + 144 τ 3 β 2 , ˜ β 0 : = 72 β τ 2 λ l ρ + 144 τ 3 β 2 , c 4 : = 4 L 2 N  72 β τ λ l ρ + 144 τ 2 β 2  κ 0 : = (1 + 2 ρ 2 ∥ ˜ L ∥ 2 )6 τ L 2 ∥ b V − 1 ∥ 2 + 6 L 2 β 2 2 τ L 2 ∥ b V − 1 ∥ 2 , κ 1 : = (1 + 2 ρ 2 ∥ ˜ L ∥ 2 )4 τ L 2 ∥ b V − 1 ∥ 2 + 6 L 2 β 2 2 τ L 2 ∥ b V − 1 ∥ 2 , κ 2 : = 16 τ 3 N κ 0 + 2 ˜ s 2 ( κ 1 + 32 τ 2 L 2 κ 0 ) , ˜ s 0 : = 36 β τ 2 m u λ l ρ + 144 τ 2 m u m l β 2 , ˜ s 1 : =  72 β τ 2 λ l ρ + 144 τ 3 β 2  8 m u τ m l , ˜ s 2 : = 16 N m u τ 2 m l + 8 m u τ m l 16 τ 3 N , κ 3 : = 16 τ 3 N ( L 2 2 N + 2 τ L 3 ) + 2 ˜ s 2 (2 τ L 3 + 32 τ 2 L 2 ( L 2 2 N + 2 τ L 3 )) , κ 4 : = ( L 2 2 N + 2 τ L 3 ) ˜ β 0 + 2( ˜ s 0 + ˜ s 1 )(2 τ L 3 + 32 τ 2 L 2 ( L 2 2 N + 2 τ L 3 )) , where 1 τ λ u ρ ≤ β < 2 τ λ u ρ . B. Preliminar y transformation W e start by rewriting the algorithm in a compact form. T o this end, we introduce the following auxiliary vari- ables: Z = col { z ij } i,j ∈E , Φ t k = col { ϕ t 1 ,k , ϕ t 2 ,k , ..., ϕ t N ,k } , G (Φ t k ) = col { g 1 ( ϕ t 1 ,k ) , g 2 ( ϕ t 2 ,k ) , ..., g N ( ϕ t N ,k ) } , F( X ) = col { f 1 ( x 1 ) , f 2 ( x 2 ) , ..., f N ( x N ) } , F ( x ) = 1 N P N i =1 f i ( x ) . De- ﬁne A = blk diag { 1 d i } i ∈V ⊗ I n ∈ R M n × N n , where d i = |N i | is the degree of node i , and M = P i |N i | . P ∈ R M n × M n is a permutation matrix that swaps e ij with e j i . If there is an edge between nodes i , j , then A T [ i, :] P A [: , j ] = 1 , otherwise A T [ i, :] P A [: , j ] = 0 . Therefore A T P A = ˜ A is the adjacency matrix. The compact form of L T -ADMM and L T -ADMM-VR then is: X k +1 = X k − τ − 1 X t =0 ( γ G ( Φ t k ) + β ( ρ A T AX k − A T Z k )) (12a) Z k +1 = 1 2 Z k − 1 2 PZ k + ρ P AX k +1 . (12b) 9 Moreov er, we introduce the follo wing useful variables Y k = A T Z k − γ β ∇ F ( ¯ X k ) − ρ DX k ˜ Y k = A T PZ k + γ β ∇ F( ¯ X k ) − ρ DX k , (13) where ¯ X k = 1 N ⊗ ¯ x k , with ¯ x k = 1 N 1 T X k , and D = A T A = diag { d i I n } i ∈V is the degree matrix. Multiplying both sides of (12b) by 1 T , and using the initial condition, we obtain 1 T A T Z k +1 = ρ 1 T D X k +1 for all k ∈ N . As a consequence ¯ Y k = γ β 1 ⊗ 1 N 1 T ∇ F( ¯ X k ) = γ β 1 ⊗ 1 N P i ∇ f i ( ¯ x k ) , and (12) can be further rewritten as   X k +1 Y k +1 ˜ Y k +1   =   I β τ I 0 ρ ˜ L ρ ˜ L β τ + 1 2 I − 1 2 I 0 − 1 2 I 1 2 I   ⊗ I n   X k Y k ˜ Y k   − h k , (14) where ˜ L = ˜ A − D (15) and h k = [ γ τ − 1 X t =0 ( ∇ G ( Φ t k ) − ∇ F( ¯ X k )); γ ρ ˜ L τ − 1 X t =0 ( ∇ G ( Φ t k ) − ∇ F( ¯ X k )) + γ β ( ∇ F ( ¯ X k +1 ) − ∇ F ( ¯ X k )); γ β ( −∇ F ( ¯ X k +1 ) + ∇ F ( ¯ X k ))] . W e remark that (14) can be interpreted as a linear dynamical system, with the non-linearity of the gradients as input in h k . C . Deviation from the av erage The following lemma illustrates ho w f ar the states deviate from the average and will be used later in the proofs of Lemmas 2 and 6. Lemma 1: Let Assumption 1 hold, when β < 2 τ λ u ρ , ∥ ¯ X k − X k ∥ 2 ≤ 18 β τ λ l ρ ∥ b d k ∥ 2 , ∥ ¯ Y k − Y k ∥ 2 ≤ 9 ∥ b d k ∥ 2 , (16) and ∥ b d k +1 ∥ 2 ≤ δ ∥ b d k ∥ 2 + 1 1 − δ ∥ b h k ∥ 2 (17) where δ = 1 − λ l ρτ β / 2 < 1 , b d k = b V − 1 h b Q T X k ; b Q T Y k ; b Q T ˜ Y k i , b Q and b V − 1 are matrices used to deﬁne the deviation term b d k . Pr oof: By Assumption 1, graph G is undirected and connected, hence its Laplacian − ˜ L is symmetric; moreover , it has one zero eigen value with eigenv ector 1 , with all eigen- values being positiv e. Denote by b Q ∈ R N × ( N − 1) the matrix satisfying b Q b Q T = I N − 1 N 11 T , b Q T b Q = I N − 1 and 1 T b Q = 0 , b Q T 1 = 0 . W e have that b Q T ˜ L = b Q T ˜ L ( I N − 1 N 11 T ) = b Q T ˜ L b Q b Q T . (18) Additionally , it holds that ∥ b Q T X k ∥ 2 = X T k b Q b Q T b Q b Q T X k =    b Q b Q T X k    2 =   X k − ¯ X k   2 , and ∥ b Q ∥ = 1 . Multiplying both sides of (14) by b Q T and using (18) yields:    b Q T X k +1 b Q T Y k +1 b Q T ˜ Y k +1    = ( Θ ⊗ I n )    b Q T X k b Q T Y k b Q T ˜ Y k    − b Q T h k (19) where Θ =   I β τ I 0 ρ b Q T ˜ L b Q ρ b Q T ˜ L b Q β τ I + 1 2 I − 1 2 I 0 1 2 I 1 2 I   . The next step is to show that b Q T ˜ L b Q is negati ve deﬁnite by contradiction. Let x ∈ R N − 1 be an arbitrary vector , since − ˜ L is the positi ve semi-deﬁnite Laplacian matrix, the quadratic form x T b Q T ˜ L b Q x = ( b Q x ) T ˜ L b Q x ≤ 0 . Moreover , if ( b Q x ) T ( ˜ A − D ) b Q x = 0 , we have b Q x = 1 . Now , the properties of b Q imply that b Q T b Q x = x = b Q T 1 = 0 . Therefore, for all non-zero vectors x , the quadratic form x T b Q T ˜ L b Q x < 0 , thus b Q T ˜ L b Q is a symmetric negativ e-deﬁnite matrix. W e proceed now to diagonalize each block of Θ with ϕ ∈ R ( N − 1) × ( N − 1) : ˜ Θ = ϕ Θ ϕ T =   ϕ 0 0 0 ϕ 0 0 0 ϕ   Θ   ϕ T 0 0 0 ϕ T 0 0 0 ϕ T   =   I β τ 0 ρϕ b Q T ˜ L b Q ϕ T ρϕ b Q T ˜ L b Q ϕ T β τ + 1 2 I − 1 2 I 0 − 1 2 I 1 2 I   . W e denote ϕ b Q T ˜ L b Q ϕ T = diag { ˜ λ i } i =2 ,...,N , where ˜ λ i < 0 is the eigenv alue of b Q T ˜ L b Q , ˜ λ min = λ min ( b Q T ˜ L b Q ) , and ˜ λ max = λ max ( b Q T ˜ L b Q ) , note that | ˜ λ max | and | ˜ λ min | are the smallest nonzero eigenv alue and the largest eigen value of the Laplacian matrix of the graph G , res pectively . In the following, we denote λ l = | ˜ λ max | and λ u = | ˜ λ min | . Since each block of ˜ Θ is a diagonal matrix, there exists a permutation matrix P 0 such that P 0 ˜ ΘP T 0 = P 0 ϕ Θ ϕ T P T 0 = blkdiag { D i } N i =2 , where D i =   1 β τ 0 ρ ˜ λ i ρ ˜ λ i β τ + 0 . 5 − 0 . 5 0 − 0 . 5 0 . 5   . (20) W e diagonalize D i = V i ∆ i V − 1 i , where ∆ i is the diagonal matrix of D i ’ s eigen values, and V i =   − β τ d 12 d 13 1 d 22 d 23 1 1 1   (21) with d 12 = − β τ + (( β ˜ λ i ρτ ( β ˜ λ i ρτ + 2)) 0 . 5 ) / ( ˜ λ i ρ ) , d 13 = − β τ − (( β ˜ λ i ρτ ( β ˜ λ i ρτ + 2)) 0 . 5 ) / ( ˜ λ i ρ ) , d 22 = ˜ λ i ρd 12 − 1 , d 23 = ˜ λ i ρd 13 − 1 . The nonzero eigenv alues λ of D i , i = 2 , . . . , N , satisfy 2 λ 2 + ( − 2 ˜ λ i ρτ β − 4) λ + ˜ λ i ρτ β + 2 = 0 , which can be written in the form: 2 λ 2 − 2 tλ + t = 0 (22) where t = ˜ λ i ρτ β + 2 . The modulus of the roots of (22) is 1 − | ˜ λ | ρτ β 2 when − 2 < ˜ λρτ β < 0 . W e conclude that we can write 10 Θ = ( P 0 ϕ ) T V∆V − 1 ( P 0 ϕ ) where V = blkdiag { V i } N i =2 and ∆ = blkdiag { ∆ i } N i =2 . (23) Moreov er, ∥ ∆ ∥ = 1 − λ l ρτ β / 2 when λ u ρτ β < 2 . (24) Then, left multiplying both sides of (19) by the in verse of b V = ( P 0 ϕ ) T V , (25) which is gi ven by b V − 1 = V − 1 ( P 0 ϕ ) , yields b d k +1 = ∆ b d k − b h k , (26) where b d k = b V − 1 h b Q T X k ; b Q T Y k ; b Q T ˜ Y k i , b h k = b V − 1 b Q T h k , and h b Q T X k ; b Q T Y k ; b Q T ˜ Y k i = b V b d k = ϕ T P T 0 V b d k = ϕ T P T 0 VP 0 P T 0 b d k . As a consequence, from (19) it holds that    b Q T X k b Q T Y k b Q T ˜ Y k .    = ϕ T   − β τ I d 12 I d 13 I I d 22 I d 23 I I I I   P T 0 b d k = ϕ T   − β τ IP T 0 [1] + d 12 P T 0 [2] + d 13 P T 0 [3] P T 0 [1] + d 22 P T 0 [2] + d 23 P T 0 [3] P T 0 [1] + P T 0 [2] + P T 0 [3]   b d k , where P T 0 [1] , P T 0 [2] , P T 0 [3] are the top, middle and bottom blocks of P T 0 respectiv ely . Moreov er, we have | d 12 | 2 = | d 13 | 2 ≤ 2 β τ λ l ρ , | d 22 | = | d 23 | = 1 . No w , if we let β τ ≤ 2 λ l ρ , and using ∥ ϕ ∥ = 1 , ∥ P T 0 [ i ] ∥ = 1 , i = 1 , 2 , 3 , we derive that ∥ ¯ X k − X k ∥ 2 = ∥ b Q T X k ∥ 2 = ∥ ϕ T ( − β τ IP T 0 [1] + d 12 P T 0 [2] + d 13 P T 0 [3]) b d k ∥ 2 ≤ 3( β 2 τ 2 + 4 β τ λ l ρ ) ∥ b d k ∥ 2 ≤ 18 β τ λ l ρ ∥ b d k ∥ 2 . Applying the same manipulations to ∥ ¯ Y k − Y k ∥ 2 , we obtain (16) holds. Denote now ∥ b Φ k ∥ 2 = P N i =1 P τ − 1 t =0    ϕ t i,k − ¯ x k    2 = P τ − 1 t =0   Φ t k − ¯ X k   2 . Using Assumption 2 we deriv e that ∥ τ − 1 X t =0 ( G (Φ t k ) − ∇ F( ¯ X k )) ∥ 2 ≤ 2 τ L 2 ∥ b Φ k ∥ 2 + 2 τ τ − 1 X t =0 ∥ G (Φ t k ) − ∇ F (Φ t k ) ∥ 2 . Denote G (Φ t k ) = 1 N P N i =1 g i ( ϕ t i,k ) and ∇ F(Φ t k ) = 1 N P N i =1 ∇ f i ( ϕ t i,k ) , we hav e ∥ τ − 1 X t =0 G (Φ t k ) ∥ 2 = ∥ 1 N X i X t ( ∇ f i  ϕ t i,k  + g i  ϕ t i,k  − ∇ f i  ϕ t i,k  ) ∥ 2 ≤ 2 ∥ τ − 1 X t =0 ∇ F(Φ t k ) ∥ 2 + 2 τ N X i X t ∥ g i  ϕ t i,k  − ∇ f i  ϕ t i,k  ∥ 2 (27) W e also hav e ∥∇ F ( ¯ X k +1 ) − ∇ F ( ¯ X k ) ∥ 2 = N L 2 ∥ ¯ x k +1 − ¯ x k ∥ 2 = N L 2 γ 2 ∥ P t G (Φ t k ) ∥ 2 , it further holds that: ∥ h k ∥ 2 ≤ γ 2 (1 + 2 ρ 2 ∥ ˜ L ∥ 2 )(2 τ L 2 ∥ b Φ k ∥ 2 + 2 τ τ − 1 X t =0 ∥ G (Φ t k ) − ∇ F (Φ t k ) ∥ 2 ) + 6 L 2 γ 4 β 2  τ X i X t   g i  ϕ t i,k  − ∇ f i  ϕ t i,k    2 + N ∥ τ − 1 X t =0 ∇ F(Φ t k ) ∥ 2  (28) Recalling (26), using Jensen’ s inequality ∥ b d k +1 ∥ 2 ≤ 1 ∥ ∆ ∥ ∥ ∆ ∥ 2 ∥ b d k ∥ 2 + 1 1 −∥ ∆ ∥ ∥ b h k ∥ 2 yields (17). A P P E N D I X I I I C O N V E R G E N C E A N A L Y S I S F O R L T - A D M M A. K ey bounds Lemma 2: Let Assumptions 1, 2, and 4 hold; when β < 2 τ λ u ρ and γ ≤ ¯ γ 1 , we hav e E h ∥ b Φ k ∥ 2 i ≤  72 β τ 2 λ l ρ + 144 τ 3 β 2  E [ ∥ b d k ∥ 2 ] + 4 N τ 2 γ 2 σ 2 + 16 τ 3 N γ 2 E [ ∥∇ F ( ¯ x k ) ∥ 2 ] . (29) Pr oof: From (12) we can derive that ¯ x k +1 − x ∗ = ¯ x k − x ∗ − γ N τ − 1 X t =0 N X i =1 g i ( ϕ t i,k ) (30) and Φ t +1 k = Φ t k + β Y k − γ ( G (Φ t k ) − ∇ F( ¯ X k )) (31) Recall that by Assumption 4, ∥ G (Φ t k ) − ∇ F (Φ t k ) ∥ 2 ≤ N σ 2 . Now , suppose that τ ≥ 2 , using Jensen’ s inequality we obtain E [   Φ t +1 k − ¯ X k   2 ] = E [ ∥ Φ t k − ¯ X k + β Y k − γ ( ∇ F(Φ t k ) − ∇ F( ¯ X k )) ∥ 2 ] + N γ 2 σ 2 ≤  1 + 1 τ − 1  E [   Φ t k − ¯ X k   2 ] + N γ 2 σ 2 (32) + τ E [ ∥ β Y k − γ ( ∇ F(Φ t k ) − ∇ F( ¯ X k )) ∥ 2 ] ≤  1 + 1 τ − 1 + 2 γ 2 τ L 2  E [   Φ t k − ¯ X k   2 ] (33) + 2 τ β 2 E [ ∥ Y k ∥ 2 ] + γ 2 N σ 2 ≤  1 + 5 / 4 τ − 1  E [ ∥ Φ t k − ¯ X k ∥ 2 ] + γ 2 N σ 2 + 2 τ β 2 E [ ∥ Y k ∥ 2 ] , (34) where the last inequality holds when 2 γ 2 τ L 2 ≤ 1 / 4 τ − 1 , (35) which is satisﬁed by γ ≤ ¯ γ 1 . Iterating the above inequality for t = 0 , ..., τ − 1 E [   Φ t +1 k − ¯ X k   2 ] ≤  1 + 5 / 4 τ − 1  t E [   X k − ¯ X k   2 ]+ 11 + 2 τ β 2 t X l =0  1 + 5 / 4 τ − 1  l E [   Y k − ¯ Y k + ¯ Y k   2 ] + N γ 2 σ 2 t X l =0  1 + 5 / 4 τ − 1  l ≤ 4 E [   X k − ¯ X k   2 ] + 4 τ N γ 2 σ 2 + 8 τ 2 β 2 E [   Y k − ¯ Y k + ¯ Y k   2 ] , where the last inequality holds by (1 + a τ − 1 ) t ≤ exp( at τ − 1 ) ≤ exp( a ) for t ≤ τ − 1 and a = 5 / 4 . Summing over t , it follows that E [ ∥ b Φ k ∥ 2 ] ≤ 4 τ E [ ∥ X k − ¯ X k ∥ 2 ] + 4 N τ 2 γ 2 σ 2 + 16 τ 3 β 2 E [ ∥ Y k − ¯ Y k ∥ 2 ] + 16 τ 3 N γ 2 E [ ∥∇ F ( ¯ x k ) ∥ 2 ]; (36) moreov er, it is easy to verify that (36) also holds for τ = 1 . Using (16) concludes the proof. Lemma 3: Let Assumptions 1, 2, and 4 hold. When β < 2 τ λ u ρ and γ ≤ ¯ γ 1 , E [ ∥ b d k +1 ∥ 2 ] ≤ ( δ + c 0 1 − δ ) E [ ∥ b d k ∥ 2 ] + c 1 1 − δ E [ ∥ X t ∇ F ( Φ t k ) ∥ 2 ] + c 2 1 − δ E [ ∥∇ F ( ¯ x k ) ∥ 2 ] + c 3 1 − δ σ 2 (37) where δ : = 1 − λ l ρτ β 2 , β 0 : = 72 β τ 2 λ l ρ + 144 τ 3 β 2 c 0 : = γ 2 (1 + 2 ρ 2 ∥ ˜ L ∥ 2 )2 τ L 2 ∥ b V − 1 ∥ 2 β 0 , c 1 : = γ 4 6 L 2 β 2 N ∥ b V − 1 ∥ 2 , c 2 : = γ 4 (1 + 2 ρ 2 ∥ ˜ L ∥ 2 ) L 2 32 τ 4 ∥ b V − 1 ∥ 2 , c 3 : = γ 4 (1 + 2 ρ 2 ∥ ˜ L ∥ 2 )8 L 2 N τ 3 ∥ b V − 1 ∥ 2 + γ 2 (1 + 2 ρ 2 ∥ ˜ L ∥ 2 )2 τ 2 N ∥ b V − 1 ∥ 2 + 6 L 2 γ 4 β 2 N τ 2 ∥ b V − 1 ∥ 2 , Pr oof: When β < 2 τ λ u ρ and γ ≤ ¯ γ 1 , using (28), (29) and Assumption 4, we have ∥ b h k ∥ 2 ≤ c 0 ∥ b d k ∥ 2 + c 1 ∥ X t ∇ F ( Φ t k ) ∥ 2 + c 2 E [ ∥∇ F ( ¯ x k ) ∥ 2 ] + c 3 σ 2 , together with (17) we can then derive that (37) holds. B. Theorem 1 W e start our proof by recalling that the following inequality holds for all L -smooth function f , ∀ y , z ∈ R n [38]: f ( y ) ≤ f ( z ) + ⟨∇ f ( z ) , y − z ⟩ + ( L/ 2) ∥ y − z ∥ 2 (38) Based on (30), substituting y = ¯ x k +1 and z = ¯ x k into (38), using Assumption 4, we get E [ F ( ¯ x k +1 )] ≤ E [ F ( ¯ x k )] − γ E [ ⟨∇ F ( ¯ x k ) , 1 N X t X i ∇ f i  ϕ t i,k  ⟩ ] + γ 2 L 2 E [ ∥ 1 N X t X i g i ( ϕ t i,k ) ∥ 2 ] ≤ E [ F ( ¯ x k )] − γ [ ⟨∇ F ( ¯ x k ) , 1 N X t X i ∇ f i ( ϕ t i,k )) ⟩ ] + γ 2 τ L E [ X t ∥ 1 N X i ∇ f i  ϕ t i,k  ∥ 2 ] + γ 2 τ 2 Lσ 2 . Using now 2 ⟨ a, b ⟩ = ∥ a ∥ 2 + ∥ b ∥ 2 − ∥ a − b ∥ 2 , we hav e − ⟨∇ F ( ¯ x k ) , 1 N X t X i ∇ f i  ϕ t i,k  ⟩ = − τ 2 ∥∇ F ( ¯ x k ) ∥ 2 − 1 2 X t ∥ 1 N X i ∇ f i  ϕ t i,k  ∥ 2 + 1 2 X t ∥ 1 N X i ∇ f i  ϕ t i,k  − ∇ F ( ¯ x k ) ∥ 2 ≤ − τ 2 ∥∇ F ( ¯ x k ) ∥ 2 − 1 2 X t ∥ 1 N X i ∇ f i  ϕ t i,k  ∥ 2 + L 2 2 N ∥ b Φ k ∥ 2 . Now , combining the two equations above and using (16), yields E [ F ( ¯ x k +1 )] ≤ E [ F ( ¯ x k )] − γ τ 2 E [ ∥∇ F ( ¯ x k ) ∥ 2 ] − γ 2 (1 − 2 γ τ L ) X t E [ ∥ 1 N X i ∇ f i  ϕ t i,k  ∥ 2 ] + γ L 2 2 N E [ ∥ b Φ k ∥ 2 ] + γ 2 τ 2 Lσ 2 . Substituting (29) into the above inequality yields E [ F ( ¯ x k +1 )] ≤ E [ F ( ¯ x k )]+ − γ τ 2  1 − 16 L 2 τ 2 γ 2  E [ ∥∇ F ( ¯ x k ) ∥ 2 ] − γ 2 (1 − 2 γ Lτ ) X t E [ ∥ 1 N X i ∇ f i  ϕ t i,k  ∥ 2 ] + γ L 2 2 N  72 β τ 2 λ l ρ + 144 τ 3 β 2  E [ ∥ b d k ∥ 2 ] + γ 2 τ 2 Lσ 2 + 2 τ 2 γ 3 σ 2 L 2 . When γ ≤ min { ¯ γ 2 , ¯ γ 3 } , then 16 L 2 τ 2 γ 2 ≤ 3 4 , 2 γ Lτ ≤ 3 4 , (39) and we can upper bound the previous inequality by E [ F ( ¯ x k +1 )] ≤ E [ F ( ¯ x k )] − γ τ 8 E [ ∥∇ F ( ¯ x k ) ∥ 2 ] − γ 8 X t E [ ∥ 1 N X i ∇ f i  ϕ t i,k  ∥ 2 ] + γ L 2 2 N  72 β τ 2 λ l ρ + 144 τ 3 β 2  E [ ∥ b d k ∥ 2 ] + γ 2 τ 2 Lσ 2 + 2 τ 2 γ 3 σ 2 L 2 . Rearranging the abov e relation, we get D k ≤ 8 γ τ E h ˜ F ( ¯ x k ) − ˜ F ( ¯ x k +1 ) i + 8 γ τ γ L 2 2 N  72 β τ 2 λ l ρ + 144 τ 3 β 2  E [ ∥ b d k ∥ 2 ] + 8 γ τ Lσ 2 + 16 τ γ 2 σ 2 L 2 , 12 where D k is deﬁned in (7), and ˜ F ( ¯ x k ) = F ( ¯ x k ) − F ( x ∗ ) . Summing over k = 0 , 1 , . . . , K − 1 , using − ˜ F ( ¯ x k ) ≤ 0 , it holds that K − 1 X k =0 D k ≤ 8 ˜ F  ¯ x 0  γ τ + c 4 K − 1 X k =0 E [ ∥ b d k ∥ 2 ] + K c 5 σ 2 (40) where c 4 : = 4 L 2 N  72 β τ λ l ρ + 144 τ 2 β 2  c 5 : = 8 γ τ L + 16 τ γ 2 L 2 (41) W e now bound the term P K − 1 k =0 ∥ b d k ∥ 2 . From (37), we hav e E [ ∥ b d k +1 ∥ 2 ] ≤ ( δ + c 0 1 − δ ) E [ ∥ b d k ∥ 2 ] + c 1 τ 1 − δ E [ ∥ 1 N X i ∇ f i  ϕ t i,k  ∥ 2 ] + c 2 1 − δ E [ ∥∇ F ( ¯ x k ) ∥ 2 ] + c 3 1 − δ σ 2 ≤ ¯ δ E [ ∥ b d k ∥ 2 ] + c 3 1 − δ σ 2 + R D k (42) where R : = max { c 2 1 − δ , c 1 τ 2 1 − δ } . (43) Moreov er, letting γ ≤ ¯ γ 4 and 1 τ λ u ρ ≤ β < 2 τ λ u ρ , we hav e ¯ δ = δ + c 0 1 − δ < 1 − λ l 4 λ u . (44) Iterating (42) no w giv es E [ ∥ b d k ∥ 2 ] ≤ ¯ δ k E [ ∥ b d 0 ∥ 2 ] + R k − 1 X ℓ =0 ¯ δ k − 1 − ℓ D ℓ + c 3 σ 2 1 − ¯ δ and summing this inequality ov er k = 0 , . . . , K − 1 , it follows that K − 1 X k =0 E [ ∥ b d k ∥ 2 ] ≤ ∥ b d 0 ∥ 2 1 − ¯ δ + R 1 − ¯ δ K − 1 X k =0 D k + c 3 σ 2 K 1 − ¯ δ . (45) Substituting (45) into (40) and rearranging, we obtain (1 − q 0 ) K − 1 X k =0 D k ≤ 8 ˜ F  ¯ x 0  γ τ + q 1 ∥ ˆ d 0 ∥ 2 + K q 2 σ 2 , where q 0 : = c 4 R 1 − ¯ δ , q 1 : = c 4 1 − ¯ δ q 2 : = c 4 c 3 1 − ¯ δ + c 5 . (46) Since 1 − ¯ δ ≥ λ l 4 λ u and 1 − δ ≥ λ l 2 λ u , when γ ≤ min { 1 , ¯ γ 5 , ¯ γ 6 } , we have q 0 ≤ 1 2 , (47) and it follo ws that 1 K K − 1 X k =0 D k ≤ 16 ˜ F  ¯ x 0  γ τ K + 2 q 1 K ∥ ˆ d 0 ∥ 2 + 2 q 2 σ 2 . (48) By collecting all step-size conditions, if the step-size γ < ¯ γ sgd : = min i =1 , 2 ,..., 6 ¯ γ i , then (48) holds, the states { X k } generated by L T -ADMM con ver ge to the neighborhood of the stationary point, concluding the proof. A P P E N D I X I V C O N V E R G E N C E A N A L Y S I S F O R L T - A D M M - V R A. K ey bounds W e start by deriving an upper bound for the variance of the gradient estimator E [ ∥ g i ( ϕ t i,k ) − ∇ f i ( ϕ t i,k ) ∥ 2 ] . Deﬁne t k i as the averaged consensus gap of the auxiliary variables of { r k i,h,k } m i h =1 at node i : t t i,k = 1 m i m i X h =1 ∥ r t i,h,k − ¯ x k ∥ 2 , t t k = N X i =1 t t i,k = 1 m i m i X h =1 ∥ r t h,k − ¯ X k ∥ 2 , t k = τ − 1 X t =0 t t k = τ − 1 X t =0 N X i =1 t t i,k . By the updates of g i ( ϕ t i,k ) in L T -ADMM-VR, E  ∥ g i ( ϕ t i,k ) − ∇ f i  ϕ t i,k  ∥ 2  = E [ ∥ 1 |B i | X h ∈B i ∇ f i,h  ϕ t i,k  − 1 |B i | X h ∈B i ∇ f i,h  r t i,h,k  − ( ∇ f i  ϕ t i,k  − 1 m i m j X h =1 ∇ f i,h ( r k i,h,k )) ∥ 2 ] ≤ E " ∥ 1 |B i | X h ∈B i ∇ f i,h  ϕ t i,k  − 1 |B i | X h ∈B i ∇ f i,h  r t i,h,k  ∥ 2 # ≤ E " 1 |B i | X h ∈B i   ∇ f i,h  ϕ t i,k  − ∇ f i,h  r t i,h,k    2 # = 1 |B i | X h ∈B i E h   ∇ f i,h  ϕ t i,k  − ∇ f i,h  r t i,h,k    2 i ≤ 2 L 2   ϕ t i,k − x k   2 + 2 L 2 E [ t t i,k ] , where in the ﬁrst inequality we use E [ ∥ a − E [ a ] ∥ 2 ] ≤ E [ ∥ a ∥ 2 ] with a = ∇ f i,h ( ϕ t i,k ) − ∇ f i,h ( r t i,h,k ) ; and in the second in- equality we use the smoothness of the costs. As a consequence, we have E [ X i X t ∥ g i  ϕ t i,k  − ∇ f i ( ϕ t i,k ) ∥ 2 ] ≤ 2 L 2 ∥ b Φ k ∥ 2 + 2 L 2 E [ t k ] . (49) Lemma 4: Let Assumptions 1 and 2 hold; when β τ ≤ 2 λ u ρ and γ ≤ ¯ γ 7 , we hav e E [ ∥ b Φ k ∥ 2 ] ≤  72 β τ 2 λ l ρ + 144 τ 3 β 2  E [ ∥ b d k ∥ 2 ] + 16 τ 3 γ 2 N E [ ∥∇ F ( ¯ x k ) ∥ 2 ] + 32 τ 2 γ 2 L 2 E [ t k ] . (50) 13 Pr oof: Suppose that τ ≥ 2 , using (31) and (49) we have E [   Φ t +1 k − ¯ X k   2 ] = E [ ∥ Φ t k − ¯ X k + β Y k − γ ( G (Φ t k ) − ∇ F( ¯ X k )) ∥ 2 ] ≤  1 + 1 τ − 1  E [   Φ t k − ¯ X k   2 ] + τ E [ ∥ β Y k − γ ( G (Φ t k ) − ∇ F( ¯ X k )) ∥ 2 ] ≤  1 + 1 τ − 1 + 4 γ 2 τ (2 L 2 + L 2 )  E [   Φ t k − ¯ X k   2 ] + 2 τ β 2 E [ ∥ Y k ∥ 2 ] + 4 τ γ 2 (2 L 2 E [ t t k ]) ≤  1 + 5 / 4 τ − 1  E [ ∥ Φ t k − ¯ X k ∥ 2 ] + 8 τ γ 2 L 2 E [ t t k ] + 2 τ β 2 E [ ∥ Y k ∥ 2 ] , (51) where the last inequality holds when 4 γ 2 τ (2 L 2 + L 2 ) ≤ 1 / 4 τ − 1 , (52) which can be satisﬁed when γ ≤ ¯ γ 7 . Similar to Lemma 2, we can deriv e that (50) holds, which concludes the proof. The following lemma provides the bound on t k . Lemma 5: Let { t k } be the iterates generated by L T - ADMM-VR. If β τ ≤ 2 λ l ρ and γ ≤ min { ¯ γ 8 , ¯ γ 9 , ¯ γ 10 } , we hav e for all k ∈ N : E [ t k ] ≤ 2( s 0 + s 1 ) E [ ∥ b d k ∥ 2 ] + 2 s 2 E [ ∥∇ F ( ¯ x k ) ∥ 2 ] , (53) where s 0 = 36 β τ 2 m u λ l ρ + 144 τ 2 m u m l β 2 s 1 =  72 β τ 2 λ l ρ + 144 τ 3 β 2  8 m u τ m l s 2 = 16 N γ 2 m u τ 2 m l + 8 m u τ m l 16 τ 3 γ 2 N . (54) Pr oof: From Algorithm 1, ∀ k , r t +1 i,h,k = r k i,h,k with probability 1 − 1 m i and r t +1 i,h,k = ϕ t +1 i,k with probability 1 m i , therefore, E [ t t +1 k ] = 1 m i m i X h =1 E [ ∥ r t +1 h,k − ¯ X k ∥ 2 ] = 1 m i m i X h =1 E [(1 − 1 m i ) ∥ r t h,k − ¯ X k ∥ 2 + 1 m i ∥ Φ t +1 k − ¯ X k ∥ 2 ] =  1 − 1 m i  1 m i m i X h =1 E  ∥ r t h,k − ¯ X k ∥ 2  + 1 m i E [ ∥ Φ t +1 k − ¯ X k ∥ 2 ] . Denote q t k = β Y k − γ ( G (Φ t k ) − ∇ F( ¯ X k )) , we hav e   Φ t +1 k − ¯ X k   2 =   Φ t +1 k − Φ t k + Φ t k − ¯ X k   2 ≤ 2 ∥ Φ t k − ¯ X k ∥ 2 + 2 ∥ q t k ∥ 2 , and E [ ∥ q t k ∥ 2 ] ≤ 2 γ 2 E [ ∥ G (Φ t k ) − ∇ F( ¯ X k ) ∥ 2 ] + 2 β 2 E [ ∥ Y k ∥ 2 ] ≤ 4 γ 2 (2 L 2 + L 2 ) E [   Φ t k − ¯ X k   2 ] + 2 β 2 E [ ∥ Y k ∥ 2 ] + 4 γ 2 (2 L 2 E [ t t k ]) ≤ 12 γ 2 L 2   Φ t k − ¯ X k   2 + 4 γ 2 N ∥∇ F ( ¯ x k ) ∥ 2 + 4 β 2 ∥ Y k − ¯ Y k ∥ 2 + 8 γ 2 L 2 E [ t t k ] , it follows that E [ t t +1 k ] = (1 − 1 m i ) 1 m i m i X h =1 ∥ r t h,k − ¯ X k ∥ 2 (55) + 1 m i ∥ Φ t +1 k − ¯ X k ∥ 2 ≤ (1 − 1 m i ) t t k + 1 m i  2 ∥ Φ t k − ¯ X k ∥ 2 + 2 ∥ q t k ∥ 2  ≤  1 − 1 m u + 16 γ 2 L 2 m l  E [ t t k ] (56) +  2 m l + 24 γ 2 L 2 m l  E [ ∥ Φ t k − ¯ X k ∥ 2 + 72 m l β 2 ∥ b d k ∥ 2 + 8 N m l γ 2 ∥∇ F ( ¯ X k ) ∥ 2 ≤  1 − 1 2 m u  E [ t t k ] + 4 m l E [ ∥ Φ t k − ¯ X k ∥ 2 (57) + 72 m l β 2 ∥ b d k ∥ 2 + 8 N m l γ 2 ∥∇ F ( ¯ X k ) ∥ 2 (58) where the last inequality holds when 16 γ 2 L 2 m l < 1 2 m u , 24 γ 2 L 2 < 2 . (59) Iterating (58) for t = 0 , ..., τ − 1 then yields: E [ t t k ] ≤ (1 − 1 2 m u ) t E [ ∥ X k − ¯ X k ∥ 2 ] + 72 m l β 2 t − 1 X l =0 (1 − 1 2 m u ) t − 1 − l E [ ∥ b d k ∥ 2 ] + 8 N γ 2 m l t − 1 X l =0 (1 − 1 2 m u ) l ∥∇ F ( ¯ x k ) ∥ 2 + 4 m l t − 1 X l =0 (1 − 1 2 m u ) t − 1 − l E [ ∥ Φ l k − ¯ X k ∥ 2 ≤ 36 β τ m u λ l ρ E [ ∥ b d k ∥ 2 ] + 16 N γ 2 m u τ m l ∥∇ F ( ¯ x k ) ∥ 2 + 8 m u τ m l t − 1 X l =0 E [ ∥ Φ l k − ¯ X k ∥ 2 + 144 m u τ m l β 2 E [ ∥ b d k ∥ 2 ] . Summing the abov e relation over t = 0 , 1 , ..., τ − 1 we get: E [ t k ] ≤  36 β τ 2 m u λ l ρ + 144 τ 2 m u m l β 2  E [ ∥ b d k ∥ 2 ] + 8 m u τ m l ∥ b Φ k ∥ 2 + 16 N γ 2 m u τ 2 m l ∥∇ F ( ¯ x k ) ∥ 2 , and using (50) then yields E [ t k ] ≤ ( s 0 + s 1 ) E [ ∥ b d k ∥ 2 ] + s 2 E [ ∥∇ F ( ¯ x k ) ∥ 2 ] + 8 m u τ m l 32 τ 2 γ 2 L 2 E [ t k ] , where s 0 , s 1 and s 2 are deﬁned in (54). Letting 8 m u τ m l 32 τ 2 γ 2 L 2 < 1 2 , (60) 14 and thus (53) holds. The conditions (59) and (60) hold when γ ≤ min { ¯ γ 8 , ¯ γ 9 , ¯ γ 10 } . Lemma 6: Let Assumptions 1 and 2 hold; when β τ ≤ 2 λ u ρ and γ < min { ¯ γ 1 , ¯ γ 7 } , it holds that ∀ k ≥ 0 E [ ∥ b d k +1 ∥ 2 ] ≤ ( δ + ˜ q 0 1 − δ ) E [ ∥ b d k ∥ 2 ] + ˜ q 1 1 − δ E [ ∥ X t ∇ F ( Φ t k ) ∥ 2 ] + ˜ q 2 1 − δ E [ ∥∇ F ( ¯ x k ) ∥ 2 ] , (61) where ˜ β 0 : = 72 β τ 2 λ l ρ + 144 τ 3 β 2 , ˜ c 0 : = γ 2 (1 + 2 ρ 2 ∥ ˜ L ∥ 2 )6 τ L 2 ∥ b V − 1 ∥ 2 + 6 L 2 γ 4 β 2 2 τ L 2 ∥ b V − 1 ∥ 2 , ˜ c 1 : = γ 2 (1 + 2 ρ 2 ∥ ˜ L ∥ 2 )4 τ L 2 ∥ b V − 1 ∥ 2 + 6 L 2 γ 4 β 2 2 τ L 2 ∥ b V − 1 ∥ 2 , ˜ c 2 : = 6 L 2 γ 4 β 2 N ∥ b V − 1 ∥ 2 , ˜ q 0 : = ˜ c 0 ˜ β 0 + 2( s 0 + s 1 )(˜ c 1 + 32 τ 2 γ 2 L 2 ˜ c 0 ) , ˜ q 1 : = 6 L 2 γ 4 β 2 N , ˜ q 2 : = 16 τ 3 γ 2 N ˜ c 0 + 2 s 2 (˜ c 1 + 32 τ 2 γ 2 L 2 ˜ c 0 ) . Pr oof: When β τ ≤ 2 λ u ρ and γ < min { ¯ γ 1 , ¯ γ 7 } , substi- tuting (49) and (50) into (28) then yields ∥ h k ∥ 2 ≤ γ 2 (1 + 2 ρ 2 ∥ ˜ L ∥ 2 )(6 τ L 2 ∥ b Φ k ∥ 2 + 4 τ L 2 E [ t k ]) + 6 L 2 γ 4 β 2  4 τ L 2 ∥ b Φ k ∥ 2 + 4 τ L 2 E [ t k ] + N ∥ τ − 1 X t =0 ∇ F(Φ t k ) ∥ 2  and ∥ b h k ∥ 2 ≤ ˜ c 0 ∥ b Φ k ∥ 2 + ˜ c 1 t k + ˜ c 2 ∥ X t ∇ F ( Φ t k ) ∥ 2 ≤ ˜ q 0 E [ ∥ b d k ∥ 2 ] + ˜ q 1 E [ ∥ X t ∇ F ( Φ t k ) ∥ 2 ] + ˜ q 2 E [ ∥∇ F ( ¯ x k ) ∥ 2 ] , together with (17), it proves that (61) holds. B. Theorem 2 Based on (30), substituting y = ¯ x k +1 and z = ¯ x k into (38) and using (49), we get E [ F ( ¯ x k +1 )] ≤ E [ F ( ¯ x k )] − γ E [ ⟨∇ F ( ¯ x k ) , 1 N X t X i ∇ f i  ϕ t i,k  ⟩ ] + γ 2 L 2 E [ ∥ 1 N X t X i g i ( ϕ t i,k ) ∥ 2 ] ≤ E [ F ( ¯ x k )] − γ [ ⟨∇ F ( ¯ x k ) , 1 N X t X i ∇ f i ( ϕ t i,k )) ⟩ ] + γ 2 τ L E [ X t ∥ 1 N X i ∇ f i  ϕ t i,k  ∥ 2 ] + 2 γ 2 τ L 3 ( ∥ b Φ k ∥ 2 + E [ t k ]) . Using now 2 ⟨ a, b ⟩ = ∥ a ∥ 2 + ∥ b ∥ 2 − ∥ a − b ∥ 2 , we hav e − ⟨∇ F ( ¯ x k ) , 1 N X t X i ∇ f i  ϕ t i,k  ⟩ = − τ 2 ∥∇ F ( ¯ x k ) ∥ 2 − 1 2 X t ∥ 1 N X i ∇ f i  ϕ t i,k  ∥ 2 + 1 2 X t ∥ 1 N X i ∇ f i  ϕ t i,k  − ∇ F ( ¯ x k ) ∥ 2 ≤ − τ 2 ∥∇ F ( ¯ x k ) ∥ 2 − 1 2 X t ∥ 1 N X i ∇ f i  ϕ t i,k  ∥ 2 + L 2 2 N ∥ b Φ k ∥ 2 . Now , combining the two equations above and using (16), yields E [ F ( ¯ x k +1 )] ≤ E [ F ( ¯ x k )] − γ τ 2 E [ ∥∇ F ( ¯ x k ) ∥ 2 ] − γ 2 (1 − 2 γ τ L ) X t E [ ∥ 1 N X i ∇ f i  ϕ t i,k  ∥ 2 ] + γ L 2 2 N E [ ∥ b Φ k ∥ 2 ] + 2 γ 2 τ L 3 ( ∥ b Φ k ∥ 2 + E [ t k ]) . Using (50) and (53) we hav e E [ F ( ¯ x k +1 )] ≤ E [ F ( ¯ x k )] − γ τ 2 E [ ∥∇ F ( ¯ x k ) ∥ 2 ] − γ 2 (1 − 2 γ τ L ) E [ X t ∥ 1 N X i ∇ f i  ϕ t i,k  ∥ 2 ] + ˜ q 3 E [ ∥∇ F ( x ) ∥ 2 ] + ˜ q 4 E [ ∥ b d k ∥ 2 ] , where ˜ q 3 : = 16 τ 3 γ 2 N ( γ L 2 2 N + 2 γ 2 τ L 3 ) + 2 s 2 (2 γ 2 τ L 3 + 32 τ 2 γ 2 L 2 ( γ L 2 2 N + 2 γ 2 τ L 3 )) , ˜ q 4 : = ( γ L 2 2 N + 2 γ 2 τ L 3 ) ˜ β 0 + 2( s 0 + s 1 )(2 γ 2 τ L 3 + 32 τ 2 γ 2 L 2 ( γ L 2 2 N + 2 γ 2 τ L 3 )) . Letting γ ≤ min { 1 , ¯ γ 11 , ¯ γ 12 } , then ˜ q 3 ≤ 3 γ τ 8 , 2 γ τ L ≤ 3 4 , (62) and we can upper bound the previous inequality by E [ F ( ¯ x k +1 )] ≤ E [ F ( ¯ x k )] − γ τ 8 E [ ∥∇ F ( ¯ x k ) ∥ 2 ] − γ 8 E [ X t ∥ 1 N X i ∇ f i  ϕ t i,k  ∥ 2 ] + ˜ q 4 E [ ∥ b d k ∥ 2 ] . Rearranging we get D k ≤ 8 γ τ ( E [ ˜ F ( ¯ x k )] − E [ ˜ F ( ¯ x k +1 )]) + 8 γ τ ˜ q 4 E [ ∥ b d k ∥ 2 ] , (63) where D k is deﬁned in (7), and ˜ F ( ¯ x k ) = F ( ¯ x k ) − F ( x ∗ ) . Summing (63) ov er k = 0 , 1 , . . . , K − 1 and using − ˜ F ( ¯ x k ) ≤ 0 , it holds that K − 1 X r =0 D k ≤ 8 ˜ F ( ¯ x 0 ) γ τ + 8 ˜ q 4 γ τ K − 1 X k =0 ∥ E [ ˆ d k ∥ 2 ] . (64) According to (61), we derive that ∀ k ≥ 0 , E [ ∥ b d k +1 ∥ 2 ] ≤ ( δ + ˜ q 0 1 − δ ) E [ ∥ b d k ∥ 2 ] + ˜ q 1 1 − δ E [ ∥ X t ∇ F ( Φ t k ) ∥ 2 ] + ˜ q 2 1 − δ E [ ∥∇ F ( ¯ x k ) ∥ 2 ] ≤ ˜ δ E [ ∥ b d k ∥ 2 ] + ˜ R D k , (65) 15 where ˜ R = max  ˜ q 1 τ 2 1 − δ , ˜ q 2 1 − δ  . Letting γ ≤ min { ¯ γ 1 , ¯ γ 13 } and 1 τ λ u ρ ≤ β < 2 τ λ u ρ , then ˜ δ = δ + ˜ q 0 1 − δ ≤ 1 − λ l 4 λ u . (66) Iterating (65) yields ∀ k ≥ 1 , E [ ∥ b d k ∥ 2 ] ≤ ˜ δ k E [ ∥ b d 0 ∥ 2 ] + ˜ R P k − 1 ℓ =0 ˜ δ k − 1 − ℓ D ℓ , and summing ov er k = 0 , . . . , K − 1 it holds that K − 1 X k =0 E [ ∥ b d k ∥ 2 ] ≤ 1 1 − ˜ δ ∥ b d 0 ∥ 2 + K − 1 X k =0 ˜ R 1 − ˜ δ D k . (67) Substituting (67) into (64), and rearranging, we obtain K − 1 X r =0 D k ≤ 8 ˜ F ( ¯ x 0 ) γ τ + 8 ˜ q 4 γ τ K − 1 X k =0 ∥ E [ ˆ d k ∥ 2 ] . (68) (1 − ˜ R 1 − ˜ δ 8 ˜ q 4 γ τ ) K − 1 X k =0 D k ≤ 8 ˜ F ( ¯ x 0 ) γ τ + 8 ˜ q 4 γ τ (1 − ˜ δ ) ∥ b d 0 ∥ 2 . Since 1 − ˜ δ ≥ λ l 4 λ u , let γ ≤ min { ¯ γ 14 , ¯ γ 15 } , then ˜ R 1 − ˜ δ 8 ˜ q 4 γ τ ≤ 1 2 , (69) and therefore it follows that 1 K K − 1 X k =0 D k ≤ 16 ˜ F ( ¯ x 0 ) K γ τ + 16 ˜ q 4 K γ τ (1 − ˜ δ ) ∥ b d 0 ∥ 2 . (70) By collecting all step-size conditions, if the step-size γ satis- ﬁes γ ≤ min ¯ γ i =1 , 7 , 8 ,..., 15 , then the states { X k } generated by L T -ADMM-VR con verge to the stationary point, concluding the proof. R E F E R E N C E S [1] D. K. Molzahn, F . Dorﬂer , H. Sandber g, S. H. Low , S. Chakrabarti, R. Baldick, and J. Lav aei, “ A Survey of Distributed Optimization and Control Algorithms for Electric Power Systems, ” IEEE T ransactions on Smart Grid , vol. 8, no. 6, pp. 2941–2962, Nov . 2017. [2] O. Shorinwa, T . Halsted, J. Y u, and M. Schwager , “Distributed optimiza- tion methods for multi-robot systems: Part 1—a tutorial, ” IEEE Robotics & Automation Magazine , 2024. [3] ——, “Distributed optimization methods for multi-robot systems: Part 2—a survey , ” IEEE Robotics & Automation Magazine , 2024. [4] R. Mohebifard and A. Hajbabaie, “Distributed optimization and coordi- nation algorithms for dynamic trafﬁc metering in urban street networks, ” IEEE T ransactions on Intelligent T ransportation Systems , vol. 20, no. 5, pp. 1930–1941, 2018. [5] A. Nedi ´ c and J. Liu, “Distributed Optimization for Control, ” Annual Review of Control, Robotics, and Autonomous Systems , vol. 1, no. 1, pp. 77–103, May 2018. [6] T . Gafni, N. Shlezinger , K. Cohen, Y . C. Eldar, and H. V . Poor , “Federated Learning: A signal processing perspective, ” IEEE Signal Pr ocessing Magazine , vol. 39, no. 3, pp. 14–41, May 2022. [7] L. Bottou, F . E. Curtis, and J. Nocedal, “Optimization methods for large- scale machine learning, ” SIAM r eview , vol. 60, no. 2, pp. 223–311, 2018. [8] S. A. Alghunaim and K. Y uan, “ A uniﬁed and reﬁned conver gence analysis for non-con vex decentralized learning, ” IEEE Tr ansactions on Signal Processing , vol. 70, pp. 3264–3279, 2022. [9] Y . Liu, T . Lin, A. Kolosko va, and S. U. Stich, “Decentralized gradient tracking with local steps, ” Optimization Methods and Software , pp. 1– 28, 2024. [10] S. A. Alghunaim, “Local exact-diffusion for decentralized optimization and learning, ” IEEE T ransactions on Automatic Control , vol. 69, no. 11, pp. 7371–7386, 2024. [11] L. Guo, S. A. Alghunaim, K. Y uan, L. Condat, and J. Cao, “Randcom: Random communication skipping method for decentralized stochastic optimization, ” CoRR , 2023. [12] H. Li, Z. Lin, and Y . Fang, “V ariance Reduced EXTRA and DIGing and Their Optimal Acceleration for Strongly Con vex Decentralized Optimization, ” Journal of Machine Learning Researc h , vol. 23, no. 222, pp. 1–41, 2022. [13] X. Jiang, X. Zeng, J. Sun, and J. Chen, “Distributed Stochastic Gra- dient T racking Algorithm With V ariance Reduction for Non-Con vex Optimization, ” IEEE T ransactions on Neural Networks and Learning Systems , vol. 34, no. 9, pp. 5310–5321, Sep. 2023. [14] R. Xin, U. A. Khan, and S. Kar, “V ariance-Reduced Decentralized Stochastic Optimization With Accelerated Con vergence, ” IEEE T rans- actions on Signal Processing , vol. 68, pp. 6255–6271, 2020. [15] ——, “ A Fast Randomized Incremental Gradient Method for Decen- tralized Noncon vex Optimization, ” IEEE Tr ansactions on Automatic Contr ol , vol. 67, no. 10, pp. 5150–5165, 2022. [16] ——, “Fast Decentralized Nonconve x Finite-Sum Optimization with Recursiv e V ariance Reduction, ” SIAM Journal on Optimization , vol. 32, no. 1, pp. 1–28, Mar . 2022. [17] W . Shi, Q. Ling, G. Wu, and W . Y in, “Extra: An exact ﬁrst-order algorithm for decentralized consensus optimization, ” SIAM Journal on Optimization , vol. 25, no. 2, pp. 944–966, 2015. [18] A. Nedic, A. Olshevsk y , and W . Shi, “ Achieving geometric conver gence for distributed optimization over time-varying graphs, ” SIAM Journal on Optimization , vol. 27, no. 4, pp. 2597–2633, 2017. [19] F . Saadatniaki, R. Xin, and U. A. Khan, “Decentralized optimization over time-varying directed graphs with row and column-stochastic matrices, ” IEEE T ransactions on Automatic Contr ol , vol. 65, no. 11, pp. 4769–4780, 2020. [20] X. Ren, D. Li, Y . Xi, and H. Shao, “ An accelerated distributed gradient method with local memory , ” Automatica , vol. 146, p. 110260, 2022. [21] N. Bastianello, R. Carli, L. Schenato, and M. T odescato, “ Asynchronous Distributed Optimization Over Lossy Networks via Relaxed ADMM: Stability and Linear Con vergence, ” IEEE T ransactions on Automatic Contr ol , vol. 66, no. 6, pp. 2620–2635, Jun. 2021. [22] N. Bastianello, D. Deplano, M. Franceschelli, and K. H. Johansson, “Ro- bust online learning over networks, ” IEEE Tr ansactions on Automatic Contr ol , vol. 70, no. 2, p. 933–946, 2025. [23] A. Makhdoumi and A. Ozdaglar , “Conver gence rate of distributed admm over networks, ” IEEE Tr ansactions on Automatic Control , vol. 62, no. 10, pp. 5082–5095, 2017. [24] V . Khatana and M. V . Salapaka, “Dc-distadmm: Admm algorithm for constrained optimization over directed graphs, ” IEEE T ransactions on Automatic Contr ol , vol. 68, no. 9, pp. 5365–5380, 2022. [25] A. Defazio, F . Bach, and S. Lacoste-Julien, “Saga: A fast incremental gradient method with support for non-strongly conve x composite ob- jectiv es, ” Advances in neural information pr ocessing systems , vol. 27, 2014. [26] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y . Arcas, “Communication-Efﬁcient Learning of Deep Networks from Decen- tralized Data, ” in Proceedings of the 20th International Confer ence on Artiﬁcial Intelligence and Statistics , ser . Proceedings of Machine Learning Research, A. Singh and J. Zhu, Eds., vol. 54. Fort Lauderdale, FL, USA: PMLR, Apr . 2017, pp. 1273–1282. [27] X. Zhang, M. Hong, S. Dhople, W . Y in, and Y . Liu, “FedPD: A Federated Learning Frame work With Adapti vity to Non-IID Data, ” IEEE T ransactions on Signal Pr ocessing , vol. 69, pp. 6055–6070, 2021. [28] K. Mishchenko, G. Malinovsky , S. Stich, and P . Richtarik, “ProxSkip: Y es! Local Gradient Steps Provably Lead to Communication Acceler- ation! Finally!” in Proceedings of the 39th International Confer ence on Machine Learning , ser . Proceedings of Machine Learning Research, K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvari, G. Niu, and S. Sabato, Eds., vol. 162. PMLR, Jul. 2022, pp. 15 750–15 769. [29] A. Mitra, R. Jaafar , G. J. Pappas, and H. Hassani, “Linear Con ver- gence in Federated Learning: T ackling Client Heterogeneity and Sparse Gradients, ” in Advances in Neural Information Processing Systems , M. Ranzato, A. Beygelzimer, Y . Dauphin, P . S. Liang, and J. W . V aughan, Eds., vol. 34. Curran Associates, Inc., 2021, pp. 14 606– 14 619. [30] L. Condat, I. Agarsk ´ y, G. Malinovsky , and P . Richt ´ arik, “T AMUN A: Doubly Accelerated Federated Learning with Local Training, Compres- sion, and Partial Participation, ” May 2023. 16 [31] E. D. Hien Nguyen, S. A. Alghunaim, K. Y uan, and C. A. Uribe, “On the Performance of Gradient Tracking with Local Updates, ” in 2023 62nd IEEE Conference on Decision and Control (CDC) . Singapore, Singapore: IEEE, Dec. 2023, pp. 4309–4313. [32] A. S. Berahas, R. Bollapragada, and E. W ei, “On the conv ergence of nested decentralized gradient methods with multiple consensus and gradient steps, ” IEEE T ransactions on Signal Pr ocessing , vol. 69, pp. 4192–4203, 2021. [33] C. Iakovidou and E. W ei, “S-NEAR-DGD: A Flexible Distributed Stochastic Gradient Method for Inexact Communication, ” IEEE T rans- actions on Automatic Contr ol , vol. 68, no. 2, pp. 1281–1287, Feb . 2023. [34] Y . Hou, W . Hu, J. Li, and T . Huang, “Prescribed performance control for double-integrator multi-agent systems: A uniﬁed ev ent-triggered consensus framework, ” IEEE Tr ansactions on Circuits and Systems I: Re gular P apers , vol. 71, no. 9, pp. 4222–4232, 2024. [35] R. Johnson and T . Zhang, “ Accelerating stochastic gradient descent using predictive variance reduction, ” Advances in neural information pr ocessing systems , vol. 26, 2013. [36] L. M. Nguyen, J. Liu, K. Scheinberg, and M. T ak ´ a ˇ c, “Sarah: A novel method for machine learning problems using stochastic recursiv e gradient, ” in International confer ence on machine learning . PMLR, 2017, pp. 2613–2621. [37] M. Hong, Z.-Q. Luo, and M. Razaviyayn, “Conv ergence analysis of alternating direction method of multipliers for a family of nonconv ex problems, ” SIAM Journal on Optimization , vol. 26, no. 1, pp. 337–364, 2016. [38] Y . Nesterov , Introductory lectur es on conve x optimization: A basic course . Springer Science & Business Media, 2013, vol. 87.

Communication-Efficient Stochastic Distributed Learning

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment