Parallel Restarted SPIDER -- Communication Efficient Distributed Nonconvex Optimization with Optimal Computation Complexity

In this paper, we propose a distributed algorithm for stochastic smooth, non-convex optimization. We assume a worker-server architecture where $N$ nodes, each having $n$ (potentially infinite) number of samples, collaborate with the help of a central…

Authors: Pranay Sharma, Swatantra Kafle, Prashant Kh

P arallel Restarted SPIDER - Comm unication Efficien t Distributed Noncon v ex Optimization with Optim a l Computa t i on Complexit y Prana y Sharma ∗ , Sw atan tra Kafle ∗ , Prashan t Khanduri ∗ , Saikiran Bulusu ∗ , Ke t a n Ra ja wat † and Pramo d K. V arshney ∗ ∗ Departmen t of Electrical Engineering and Computer Science, Syracuse Unive rsity , Syracuse, New Y ork-13244 † Departmen t of Electrical Engineering, I IT Ka npur, Ka npur, India-208 016. { psharm04, sk afle, pkhandur, sabulusu, v arshney } @syr.edu, and k etan@iitk.ac.in. No v ember 9, 2020 Abstract In this paper, w e p rop ose a distributed algorithm for sto chastic smooth, non-conv ex optimization. W e assume a w orker-server a rchitecture where N nodes, eac h having n (p otentially infinite) number of samples, collaborate with the help of a centra l serve r to p erform the optimization task. The global ob jectiv e is to minimize the av erage of local cost funct ions a v ailable at individual nod es. The proposed app roac h is a non-triv ial extension of the popular parallel-restarted SGD algorithm, incorporating the optimal v ariance-reduction based SPID ER gradient estimator into it. W e prov e conv ergence of our algorithm to a first-order stationary solution. The prop osed approac h achieves the b est kno wn comm unication complexit y O ( ǫ − 1 ) along with the optimal computation complexity . F or finite-sum problems (finite n ), we ac hieve the optimal computation (IFO) complexit y O ( √ N nǫ − 1 ). F or online problems ( n unknown or infin ite), we achiev e the optimal IF O complexit y O ( ǫ − 3 / 2 ). In b oth the cases, w e main tain the linear sp eed up achiev ed by existing metho ds. This is a massive impro vement o ver th e O ( ǫ − 2 ) IFO complexity of the existing ap p roac hes. Additionally , our algorithm is general enough to allow non-identical distributions of data across w orkers, as in the recen tly p rop osed federated learning paradigm. 1 In tro du ction The curren t age of Big Data is built on the foundation of distr ibuted sys tems, and efficient distributed a lgorithms to run on these systems [1]. With the ra pid incr ease in the volume of the data b eing fed into these sys tems, storing and pro cessing all this da ta a t a central lo catio n b ecomes infeasible. Such a central server r equires a giga nt ic amo un t of computational and storag e resour ces [2]. Also, the input data is usually collected from a myriad of sources , which might inher ent ly b e distributed [3]. T ra nsferring all this data to a single lo cation mig h t b e ex p ensive. Depe nding o n the sensitivity of the underlying data, this mig h t also lead to concerns ab out maintaining the priv acy and anonymit y of this data [4]. Therefore, even in situations where it is p ossible to have central servers, it is no t always desir able. One of the factor s which ha s made this Big Da ta explo sion p ossible is the meteoric rise in the capabilities , and the consequent proliferatio n, o f end-user de v ices [3 ]. These worker no des o r ma chines hav e sig nificant sto rage a nd computational reso urces. O ne s imple solution to the central ser ver pr oblem fac ed by distributed s ystems is that the server offloads some of its conv entional ta sks to these workers [2]. Mor e precisely , instead of sharing its entire data with the server, which is exp ensive c o mmu nica tion-wise, each work er no de carries co mputatio ns on its data its e lf. The results of these “ lo cal” computations a re then sha red with the server. The s e r ver aggreg ates them to arr ive at a 1 “global” e s timate, which is then sha red with all the work er s . The following distributed optimization problem, which we solve in this pap er, ob eys this governing pr inciple. 1.1 Problem min x ∈ R d f ( x ) , 1 N N X i =1 f i ( x ) , (1) where N is the num b er of worker no des. The lo ca l function co rresp onding to no de i , f i ( x ) is a smo oth, p otentially non-conv ex function. W e consider t wo v ariants of this pro blem. • Finite-Sum Case : Each individua l no de function f i in (1) is an empiric a l mean of function v alues corr e- sp onding to n sa mples. Therefore, the problem is of the form min x ∈ R d f ( x ) , 1 N N X i =1 1 n n X j =1 f i ( x , ξ j ) , (2) where ξ j denotes the j -th sample and f i ( x , ξ j ) is the cost cor resp onding to the j -th sample. • Online Case: E ach individual no de function f i in (1) is an exp ected v alue. Hence, the pr oblem has the for m min x ∈ R d f ( x ) , 1 N N X i =1 E ξ ∈ D i f i ( x , ξ ) , (3) where D i denotes the distr ibution of sa mples at the i - th no de. Note tha t D i can b e different acro ss different work er s . This scenar io is the p opular federated learning [3] mo del. Throughout the pap er, we assume that given a p oint y each no de can choo se sa mples ξ independently , and the sto chastic gra dients o f these sample functions a re un bia sed estimators of the actual gra dient . Genera lly , the optimality of a nonconv ex pr o blem is meas ured in terms of an ǫ -stationary p oint which is defined a s a p oint x such that k∇ f ( x ) k 2 ≤ ǫ. which is up da ted to E k∇ f ( x ) k 2 ≤ ǫ, for sto chastic algo rithms, where the exp ectation is taken to ac count for the randomnes s introduced by the sto chastic nature of the algor ithms. The p oint x which satisfies the ab ov e is referr ed to a s a first-o r der statio nary ( F oS ) p oint of (1). Note that the a bove quantit y enco des only the gradient er r or and ig nores the consensus erro r acr oss different no des. Since, the algorithm prop osed in this work is a distributed algo rithm, it is appropria te to include co ns ensus error ter m in the definition of the F oS p oint [5]. F or this pur po se, we mo dify the definition o f F oS p oint and include consensus e r ror term in the definition as: Definition 1.1 . ǫ - First-order stationary ( ǫ -F oS) p oi n t [5 ] Let { x i } N i =1 with x i ∈ R d be the lo ca l iterates at N no des and ¯ x = 1 N P N i =1 x i , then we define the ǫ - F oS p oint ¯ x as E k∇ f ( ¯ x ) k 2 + 1 N N X i =1 E k x i − ¯ x k 2 ≤ ǫ. Note that the exp ectation is over the sto chasticit y of the algor ithm. All our results are in terms of conv ergence to an ǫ - F oS p oint. 2 Note that the a bove Definition 1.1 implies that a n ǫ - F oS p oint will ensure that E k ∇ f ( ¯ x ) k 2 ≤ ǫ . The complexity of the algo rithm is measured in terms of c o mm unica tion and co mputation complexity , which are defined next. Definition 1.2 . Computation Complexi ty: In the Incr ement a l First-or der Oracle (IFO) fr amework [6], given a sample ξ a t no de i and a p oint x , the ora c le re turns ( f i ( x ; ξ ) , ∇ f i ( x ; ξ )). E a ch access to the orac le is counted as a single IFO op eration. The sample o r computation co mplex it y of the algo rithm is hence, the total (agg r egated a cross no des) num b er of IFO op eratio ns to achiev e an ǫ -F oS so lution. Definition 1.3. Com m unication Compl e xit y: In o ne r ound o f co mm unica tio n b etw een the workers a nd the server, each w or ker sends its lo cal vector ∈ R d to the server, and receives a n a ggrega ted vector of the same dimension in r e tur n. The communication complexity o f the a lgorithm is the num b er of such communication r ounds re q uired to achiev e an ǫ -F oS solutio n. 1.2 Related W ork 1.2.1 Sto c hastic Gradien t Descen t (SGD) SGD is workhorse of the mo dern Big-Data machinery . B e fo re moving to the discussio n of the litera tur e most relev ant to our work, we provide a quick r eview of the most basic results using SGD. F or strongly conv ex pr o blems, to obtain an ǫ -acc urate solution, O (1 /ǫ ) IFO-calls a re requir ed [7]. F or general conv ex [8] proble ms , to achieve an ǫ -ac curate solution, O (1 /ǫ 2 ) I FO calls are required. F or nonc o nv ex proble ms [9], to achiev e an ǫ -stationa r y solution, O (1 / ǫ 2 ) IF O calls are required. In the absence of a dditional a s sumptions on the smo o thness of sto chastic functions, this bo und is tight [10]. With additional assumptions, this b ound can b e improv ed using v ariance reduction methods, which we discuss in Section 1 .2.3. 1.2.2 Distributed Sto chastic Gradient Descent (SGD) There is a v ast and ever-growing b o dy o f literature on distributed SGD. Howev er, here we limit our discus s ion almost exclusively to work in the domain of sto chastic nonconv ex optimizatio n. A classic a l metho d to solve (1) is Parallel Mini-batch SGD [11], [12 ]. Each work er, in para llel, sa mples a (or a mini-batch of ) lo ca l sto chastic g radient(s). It uses this g radient estimator to up date its lo cal iter ate and sends the latter to the server. The ser ver node aggr egates the lo ca l iterates, co mputes their average and bro adcasts this av er age to all the workers. The workers and the server rep eat the same pro cess iteratively . 1 The appr oach achieves an ǫ -F oS p oint, with a linear sp eedup with the nu mber o f work e r s. This is b ecause the total IFO co mplexity O ( ǫ − 2 ) is independent of the num b er of w orkers N . Hence each no de only needs to compute O ( 1 N ǫ − 2 ) g r adients. Ho wev er , the exchange o f gradie nts and iterates betw e e n the workers and the server at each itera tion re s ults in a sig nificant communication requirement, leading to communication complexity of O ( 1 N ǫ − 2 ). T o allevia te the co mmunication cost, s e veral approa ches hav e recently bee n prop osed. These include c o mmu nica ting co mpressed gradients to the server, a s in quantized SGD [13, 14] or sparsified SGD [15, 1 6]. The motiv ation be hind using qua n tized gr adient v ecto rs (or s pa rse approximations of a ctual gradients) is to reduce the communication c o st, while not significa nt ly affecting the c o nv ergence ra te. The num ber of co mm unica tion r ounds, how ever, still r emains the sa me. Mo del Averaging: T o cut back o n the comm unication costs further, the no des might decide to make the communication and the subsequent av era ging infrequent. At one extreme of this idea is one-s hot a veraging , in which av era ging happ ens o nly once. All the no des r un SGD loc ally , in par allel, througho ut the pro cedure. At the end of the final iteration, the lo cal iter ates a re av erag ed and the av era ge is returned as output. How ever, this has bee n shown, in some non-c onv ex problems, to hav e p o or convergence [17]. More frequent av erag ing emerg es a s the obvious ne x t option to explore. F or s trongly con vex pro blems, it was shown in [18] tha t mo del av er aging can also achiev e linea r sp eedup with N , as lo ng as averaging happens at leas t every I = 1 N ǫ iterations. F ollowing this p ositive 1 Alternativ ely , the worke r nodes migh t s end their lo cal gradien t estimators to the server, which then av erages these, updates its iterate using the av erage, and broadcasts the new iterate to the work ers. 3 result, a num b er of works hav e achieved simila r r esults for non-conv ex optimization. The IFO complexity is still O ( ǫ − 2 ). How ever, savings on c o mm unica tion costs are demonstra ted. The r esulting class o f a lg orithms is referred to a s P arallel Restarted SGD . The en tir e a lgorithm is divided int o ep o chs, each of leng th I . Within each ep o ch, a ll the N no des run SGD lo cally , in pa rallel. All the no des beg in each ep o ch a t the same p oint, which is the av erag e of the lo cal iter ates a t the end of the prev io us ep o ch. Essentially , each no de r uns SGD in p ar al lel for I iterations. A t the end of these lo cal iterations, each worker sends its solution to the server. The server returns the av er age o f these lo ca l iterates to the workers, ea ch of which r estarts the lo cal iterations from this new p oint. Hence the name Parallel Restarted SGD. In [19], the pro po sed approa ch achiev es communication savings using mo del av eraging . How ever, it works under the a dditional a s sumption of all the sto chastic gra dients having b ounded s econd moment. This a ssumption is r elaxed in [20], which achiev es further reduction in the communication re q uirement by adding momentum to the v anilla SGD run lo c ally a t the no des. The c o mm unica tion cos t is aga in improved in [21], where the author s used dyna mic batch sizes. This is a chiev ed using a tw o-step approa ch. In the first step, for the sp ecia l class of non-co nv ex pr oblems sa tisfying the Polyak- Lo jasiewicz (PL) condition, exp onentially increa sing batch size s are used lo cally . The fas tes t known convergence, with linea r sp eedup, is thus a chiev ed using only O (log ( ǫ − 1 )) commu nica tion rounds. Next, for genera l nonconv ex problems, the algo rithm pro p o sed in the first step is used as a subroutine, to achieve linear sp eedup, while using only O ( ǫ − 1 log( N − 1 ǫ − 1 )) = ˜ O ( ǫ − 1 ) communication rounds, where ˜ O ( · ) subsumes logar ithmic factors. Reference Communication Complexity IF O Co mplexity Ident ica l f i ( · ) Non-identical f i ( · ) [11] O  1 N ǫ 2  NA O  1 ǫ 2  [19]* O  1 ǫ 3 / 2  O  1 ǫ 3 / 2  [22] O  N 2 ǫ  O  √ N ǫ 3 / 2  [20] O  N ǫ  O  1 ǫ 3 / 2  [21] O  1 ǫ log( 1 N ǫ )  NA This Pap er O  1 ǫ  O  1 ǫ  min n √ N n ǫ , 1 ǫ 3 / 2 o T able 1: The Comm un ication and IFO complexity of different d istributed SGD algorithms to reac h a ǫ -F oS p oin t, for the stochastic smo oth n on - conv ex optimization problem. Note t hat n ot all approac h es are applicable to the more general setting where t h e distributions at different no des are non-identical. This setting captures the recently prop osed federated learning paradigm [3]. *The approach in [19] req uires the additional assumption th at the gradients h a ve b ounded second moments. Other a pproaches based on infrequent averaging include LA G [23]. In LA G, during every iter ation, the ser ver requests fr esh gradie nts only from a subset of the workers, while for the remaining workers, it reuse s the gradients received in the previo us itera tion. How ever, the approa ch has only b een explo r ed in the deterministic gra dient descent setting. In [24], r edundancy is intro duced in the training data to achiev e communication savings. The authors of [25] provide a comprehensive empirical study of distributed SGD, with fo cus on co mm unica tion efficiency and generaliza tion p erforma nce. T o the b est of our knowledge, no improvemen t in the O ( ǫ − 2 ) IFO complexity benchmark for distributed sto chastic no n-conv ex optimization has b een rep orted th us far. 1.2.3 V ariance Reductio n F or the sake of simplicity of the dis c ussion in this section, we ass ume that all the samples in the finite-sum pr oblem (2) or the online problem (3) a r e av ailable at a sing le no de. T o s olve the finite- s um pro blem (2), where each f i ( · ; ξ j ) is L -smo oth a nd p otentially non-convex, gr a dient descent (GD) and SGD ar e the tw o classica l a ppr oaches. In terms of p er -iteration complex it y , thes e form the t wo extremes : GD entails computing the full gr adient at each no de in each iteration, i.e., O ( N × n ) op erations , while SGD requires o nly O (1) co mputatio ns p er iteration. Overall, to reach an ǫ -F oS p oint ¯ x , GD requir es O ( N nǫ − 1 ) [26], while SGD requires O ( ǫ − 2 ) gradient e v alua tions. F or la rge v alues of N × n (or in situations where n is infinite as in the online setting (3 )), SGD is the only v iable option. 4 Empirically , SGD has b e e n o bserved to hav e go o d initial pe r formance, but its pr ogress slows down near the solution. O ne of the r easons b ehind this, is the v ariance inherent in the sto chastic gradient estimator used. A n umber of v ariance reduction es tima tors, for example, SAGA [2 7] and SVRG [28], hav e b een prop osed in the litera ture to ameliorate this pr oblem. The alg o rithms based on these estimators, compute full (or batch) gradients at regular int er v als, to “guide” the pr ogress made by the intermediate SGD steps. This interleaving o f GD and SGD steps has r e sulted in sig nificantly improv ed co n vergence r ates, for pr oblems with mean-squar ed s mo othness prop erty 2 . The IF O complex ity requir ed for finite-s um pr oblems was firs t improved to O (( N n ) 2 / 3 ǫ − 1 ) in [2 9, 30], and then further improved to O (( N n ) 1 / 2 ǫ − 1 ) in [3 1, SPIDER], [32, SPIDERBo os t], [33, SNVRG], [34, SARAH]. Moreover, it is prov ed in [31] that O (( N n ) 1 / 2 ǫ − 1 ) is the optimal complexity for problems where N × n ≤ O ( ǫ − 2 ). Similarly , for online problems, v ariance r eduction metho ds first improved upo n SGD to achieve the IF O complexity of O ( ǫ − 5 / 3 ) [35, SCSG], which was again impr ov ed to O ( ǫ − 3 / 2 ) [31, SP IDER], [3 2]. Quite recently in [10], it was shown that the IF O complexity of O ( ǫ − 3 / 2 ) is optimal for online pr oblems. 1.2.4 Distributed Sto chastic V ariance Reduction Metho ds The existing litera ture on distributed v ar iance-reduction metho ds is a lmost exclusively focused on c onv ex a nd strongly conv ex pro ble ms . Empirically , these metho ds have b een shown to b e promising [36, AIDE]. Single no de SVR G requires gra dien t estimators at each itera tion to b e unbiased. This is a ma jor c ha llenge fo r distributed v ariants of SVR G. The exis ting a pproaches try to bypass this by simulating sampling extra data [37], [38, DA NE ], [39]. T o the bes t of our knowledge, [40] is the only work that av oids this additional s a mpling. 1.3 Con tributions In this pap er, we prop ose a distributed v ariant of the SPIDER a lg orithm, Parallel Res tarted SPIDER (PR-SPIDER), to s olve the non-co nv ex optimization problem (1). Note that PR-SP IDE R is a no n- trivial ex tens ion o f b oth SPIDER and para llel-restar ted SGD. This is b ecause we need to optimize b oth communication and computation co mplex ities. Averaging at every step, or to o often, neg atively impacts the communication savings. T oo infre q uent av eraging leads to erro r terms building up as we see in the a nalysis, which has a dverse impact on convergence. • F or the online setting (3), our prop os ed a ppr oach a chiev es the optimal ov er all (agg regated acr oss no des ) IFO complexity of O  σ ǫ 3 / 2 + σ 2 ǫ  . This result improves the long-standing b est known result o f O ( σ 2 ǫ − 2 ), while also a chieving the linear sp eedup achieved in the existing literature. The communication complex ity achiev ed O ( ǫ − 1 ) is also the b est known in the literatur e. T o the b est of our knowledge, the o nly other work to achiev e the sa me communication complexity is [21]. • F or the finite-sum problem (2), our prop osed approa ch achiev es the ov era ll IFO complexit y of O ( √ N nǫ − 1 ). F or problems wher e N × n ≤ O ( ǫ − 2 ), this is the optimal res ult one can achieve, even if a ll the N × n functions a re av a ilable at a sing le lo cation [31]. At the same time, we also achiev e the b est k nown communication complexity O ( ǫ − 1 ). • Compared to sev er al existing approaches which r equire the samples acr oss no des to follo w the same distribution, our approach is more general in the sense that the data distribution across no de s may b e differen t (the federated learning problem [3]). 1.4 P ap er Organization The r est o f the pap er is org anized as follows: in Section 2, we discuss our approach to solve the finite-s um problem (2). W e prop ose P R-SPIDER, a distributed, parallel v aria nt of the SPIDE R a lgorithm [31, 32], follo wed by its conv erge nce analys is. In Sectio n 3, we solve the online pr oblem (3). The a lgorithm prop ose d, a nd the a ccompanying 2 See Assumption (A1) in Section 1.5. 5 conv erge nce analysis builds up on the finite-sum appro a ch and the a nalysis fr om the pr evious section. Finally , w e conclude the pap er in Section 4. All the pro o fs are in the App endices at the e nd. 1.5 Notations and Assumptions Given a p ositive integer N , the set { 1 , . . . , N } is denoted by the sho rthand [1 : N ]. k · k deno tes the vector ℓ 2 norm. Boldface letters are used to denote vectors. W e use x ∧ y to denote the minimum o f tw o nu mber s x, y ∈ R . F ollowing a ssumptions hold for the rest o f the pap er. (A1) Lipsc hitz-ness : All the functions ar e mean-squa red L -smo oth. 3 E ξ k∇ f ( x ; ξ ) − ∇ f ( y ; ξ ) k 2 ≤ L 2 k x − y k 2 , ∀ x , y ∈ R d . (4) (A2) All the no des b egin the a lgorithm from the same starting p oint x 0 . 2 P arallel Restarted SPIDER - Finite Su m Case W e consider a netw or k of N work er no des connected to a server no de. The o b jective is to solve (2). Note that the nu mber o f sa mple functions at different no des i 6 = j , for i, j ∈ [1 : N ], ca n b e non-uniform, i.e., n i 6 = n j . How ever, to ease the notationa l burden slightly , we assume n j = n , ∀ j ∈ [1 : N ]. 2.1 Prop osed Algorithm The prop osed algor ithm is inspired by the re c ent ly pr op osed SPIDER a lgorithm [31, 32] fo r sing le-no de s to chastic nonconv ex optimization. Like numerous v ariance - reduced approa ches prop os ed in the literature, our algo rithm also pro ceeds in ep o chs. A t the b eginning of each ep o ch, each worker no de ha s access to the s ame iterate, and the full gr adient ∇ f ( · ) computed a t this v alue. These are used to compute the first iterate of the ep o ch x s +1 i, 1 , ∀ i ∈ [1 : N ]. This is follow ed by the inner lo o p (step 6-13). At iter ation t in s -th ep o ch, the worker no des first compute an estimator o f the gr adient, v s +1 i,t (step 7). This estimator is co mputed iteratively , using the previo us estimate v s +1 i,t − 1 , the curr e n t iterate x s +1 i,t , and the pr evious itera te x s +1 i,t − 1 . The sample set B s +1 i,t of size B is picked unifor mly randomly at each no de, and independent of the other no des. Suc h an estimator has b een prop os ed in the literature fo r sing le-no de sto chastic nonc o nv ex optimization [3 1, 32, 41, 42], and ha s even b een proved to b e optimal in certain r egimes. Using the gradient estima tor, the work er no de computes the next itera te x s +1 i,t +1 . This pro ce s s is re p ea ted m − 1 times. O nc e every I iterations of the inner lo o p (whenever t mo d I = 0), the no des send their lo ca l iterates and gradient estimators { x s +1 i,t , v s +1 i,t } N i =1 to the server no de. The server, in tur n, computes their av er a ges and returns the av erag es { ¯ x s +1 t , ¯ v s +1 t } to all the no des (steps 8-11). The next iteration a t ea ch no de pro ceeds using this iterate and direction. A t the end of the inner lo op ( t = m ), all the worker no des send their lo ca l iterates { x s +1 i,m } N i =1 to the server. The server computes the mo del average ¯ x s +1 m , and r eturns it to all the w o rkers (steps 15-16). The workers co mpute the full gr adient of their resp ective functions { ∇ f i ( ¯ x s +1 m ) } N i =1 at this p oint, and send it to the server. The server a verages these, a nd returns this average (which is es sentially ∇ f ( ¯ x s +1 m )) to the worker no des (steps 17-18). Consequently , a ll the worker no des sta rt the next ep o ch a t the same p oint, and alo ng the s ame des c ent direction. This “r estart” o f the lo ca l computation is alo ng the lines of Parallel- Res tarted SGD [18, 1 9]. Before we pro c e ed with the pro o f of conv er g ence of P R-SPIDER for the finite-sum pr oblem, following is the organiza tion of the pr o of (2). 3 This assumption might seem stri ngen t. How ever, we can alwa ys c ho ose L as the maximum Lipsch i tz constant corresp onding to all the f unctions across all nodes. 6 Organization of the Pro of: In Section 2.2, we first present some preliminary results required to pr ov e the ma in r esult o f Sectio n 2 .3 given in Theorem 2 .4. W e firs t present Lemma 2.1, whic h b o unds the v aria nc e of the av era g e gr adient estimates , ¯ v s +1 t , across a ll a gents. Proving Lemma 2.1 req uires a n intermediate result given in Lemma 2.2, w hich b ounds the netw ork disagreements acro ss gr adient estimates and lo c al iter ates. In Section 2.3, we first prove the des cent lemma given in Lemma 2.3 using the results presented in Section 2 .2. Finally , the main res ult of the sec tion is presented in Theo rem 2.4, which uses the prelimina ry results pr ovided in Section 2.2 and Lemma 2.3. Algorithm 1 P R- SPIDER - Finite Sum Case 1: Input: Initial iter ate x 0 i,m = x 0 , v 0 i,m = ∇ f  x 0  ∀ i ∈ [1 : N ] 2: for s = 0 to S − 1 do 3: x s +1 i, 0 = x s i,m , ∀ i ∈ [1 : N ] 4: v s +1 i, 0 = v s i,m , ∀ i ∈ [1 : N ] 5: x s +1 i, 1 = x s +1 i, 0 − γ v s +1 i, 0 6: for t = 1 to m − 1 do 7: v s +1 i,t = 1 B P ξ s +1 i,t ∈B s +1 i,t  ∇ f i  x s +1 i,t ; ξ s +1 i,t  − ∇ f i  x s +1 i,t − 1 ; ξ s +1 i,t  + v s +1 i,t − 1 , ∀ i ∈ [1 : N ] ( |B s +1 i,t | = B ) 8: if t mo d I = 0 then 9: x s +1 i,t = ¯ x s +1 t , 1 N P N j =1 x s +1 j,t , ∀ i ∈ [1 : N ] 10: v s +1 i,t = ¯ v s +1 t , 1 N P N j =1 v s +1 j,t , ∀ i ∈ [1 : N ] 11: end if 12: x s +1 i,t +1 = x s +1 i,t − γ v s +1 i,t , ∀ i ∈ [1 : N ] 13: end for 14: if s < S − 1 then 15: ¯ x s +1 m = 1 N P N j =1 x s +1 j,m 16: x s +1 i,m = ¯ x s +1 m , ∀ i ∈ [1 : N ] 17: ¯ v s +1 m = 1 N P N j =1 ∇ f j  x s +1 j,m  = ∇ f  ¯ x s +1 m  18: v s +1 i,m = ¯ v s +1 m , ∀ i ∈ [1 : N ] 19: end i f 20: end for 21: Return 2.2 Preliminaries W e beg in by first s tating a few prelimina ry results which s ha ll b e useful in the conv er g ence pro of. First, we bo und the err or in the av era ge (acr oss no des) gradient estimator . Lemma 2.1. (Gr adient Estimate Err or) F or 0 < t < m , 0 ≤ s ≤ S − 1 , the se quenc e of iter ates { x s +1 i,t } i,t and { v s +1 i,t } i,t gener ate d by Algo rithm 1 satisfies E      ¯ v s +1 t − 1 N N X i =1 ∇ f i  x s +1 i,t       2 ≤ E      ¯ v s +1 0 − 1 N N X i =1 ∇ f i  x s +1 i, 0       2 | {z } E s +1 0 + L 2 N 2 B N X i =1 t − 1 X ℓ =0 E    x s +1 i,ℓ +1 − x s +1 i,ℓ    2 (5) Pr o of. See App endix A.1 . The erro r in the av er age gr adient estimate at time t is bounded in terms o f the cor r esp onding er ror at the beg inning of the ep o ch, and the av erag e cum ulative difference of co ns ecutive loca l itera tes . In (5), we explicitly 7 define E s +1 0 , E    ¯ v s +1 0 − 1 N P N i =1 ∇ f i  x s +1 i, 0     2 . Note that by Algorithm 1 (for finite-s um pro ble ms ), ¯ v s +1 0 = 1 N X N i =1 v s +1 i, 0 = 1 N X N i =1 v s i,m = ¯ v s m (steps 4 , 18 in Algor ithm 1) = 1 N X N i =1 ∇ f j ( ¯ x s m ) (steps 1 5-17 in Algo rithm 1) = 1 N X N i =1 ∇ f j  ¯ x s +1 j, 0  (step 3 in Algo rithm 1) Therefore, E s +1 0 = 0 , ∀ s . Howev er, we r etain it (as in [5]), a s they sha ll b e needed in the a nalysis of online pr oblems. Next, we b ound the net work disa greements of the lo cal estimates relative to global av era ges. There are tw o such error terms , corresp o nding to: 1) the lo cal gr a dient estima tors, and 2 ) the lo ca l iterates. F o r this purp o se, we fir s t define τ ( ℓ ) =    arg max j { j | j < ℓ , j mo d I = 0 } if ℓ mo d I 6 = 0 ℓ otherwise. (6) Note that, τ ( ℓ ) is the la rgest iter ation index smaller than (or equal to) ℓ , which is a multiple of I . Basically , lo oking back fro m ℓ , τ ( ℓ ) is the la tes t time index when av era ging ha pp ened in the cur rent e p o ch (steps 9-10). Lemma 2.2. (Network Disagr e ements) Given 0 ≤ ℓ ≤ m , α > 0 , δ > 0 , θ > 0 such that θ = δ + 8 γ 2 L 2 (1 + 1 δ ) . F o r ℓ mo d I 6 = 0 , N X i =1 E    v s +1 i,ℓ − ¯ v s +1 ℓ    2 ≤ 8 γ 2 N L 2  1 + 1 δ  ℓ − 1 X j = τ ( ℓ ) (1 + θ ) ℓ − 1 − j E   ¯ v s +1 j   2 (7) N X i =1 E    x s +1 i,ℓ − ¯ x s +1 ℓ    2 ≤  1 + 1 α  γ 2 N X i =1 ℓ − 1 X j = τ ( ℓ )+1 (1 + α ) ℓ − 1 − j E   v s +1 i,j − ¯ v s +1 j   2 (8) If ℓ mo d I = 0 , P N i =1 E k v s +1 i,ℓ − ¯ v s +1 ℓ k 2 = P N i =1 E k x s +1 i,ℓ − ¯ x s +1 ℓ k 2 = 0 . Pr o of. See App endix A.2 . Lemma 2.2 q ua nt ifies the netw ork erro r at any time instant ℓ . Owing to the frequent av erag ing (every I iterations), this er ror build-up is limited. One can easily see that in the absence of within-ep o ch av er aging (steps 9-10), the sum ov er time in (7) would s tart at j = 0, rather than j = τ ( ℓ ), le a ding to a g reater err or build-up. The same reasoning holds for (8). Note that w e can further b ound (8) b y substituting the b ound on P N i =1 E k v s +1 i,j − ¯ v s +1 j k 2 from (7). Next, we first present a descent lemma which along with the preliminary results o f this s ection is then used to prov e the conv erg ence of PR-SPIDER for the finite sum problem (2). 2.3 Con v er gence Analysis Given ¯ x s +1 t +1 = ¯ x s +1 t − γ ¯ v s +1 t , using L -Lipshcitz gradient pr op erty of f E f  ¯ x s +1 t +1  ≤ E f  ¯ x s +1 t  − γ E  ∇ f  ¯ x s +1 t  , ¯ v s +1 t  + γ 2 L 2 E   ¯ v s +1 t   2 = E f  ¯ x s +1 t  − γ 2 E   ∇ f  ¯ x s +1 t    2 − γ 2 (1 − L γ ) E   ¯ v s +1 t   2 + γ 2 E   ∇ f  ¯ x s +1 t  − ¯ v s +1 t   2 (9) 8 where in (9 ), we use h a, b i = || a || 2 + || b || 2 − || a − b || 2 2 . Nex t, we upper b ound the last term in (9). W e assume t > 0 (for t = 0 , E   ∇ f  ¯ x s +1 0  − ¯ v s +1 0   2 = 0 - steps 4, 1 7 in Algorithm 1). E   ∇ f  ¯ x s +1 t  − ¯ v s +1 t   2 = E      1 N N X i =1  ∇ f i  ¯ x s +1 t  − ∇ f i  x s +1 i,t  + ∇ f i  x s +1 i,t  − v s +1 i,t       2 ( b ) ≤   2 E      1 N N X i =1  ∇ f i  ¯ x s +1 t  − ∇ f i  x s +1 i,t       2 + 2 E      1 N N X i =1  v s +1 i,t − ∇ f i  x s +1 i,t       2   (10) ( c ) ≤ 2 L 2 N N X i =1 E   x s +1 i,t − ¯ x s +1 t   2 + 2 E     ¯ v s +1 t − 1 N X N i =1 ∇ f i  x s +1 i,t      2 (11) where (10 ), (11) follow from E k P n i =1 X i k 2 ≤ n P n i =1 E k X i k 2 and the mean- squared L - s mo othness assumption (A1). More pr e cisely , E      1 N N X i =1  ∇ f i  ¯ x s +1 t  − ∇ f i  x s +1 i,t       2 ≤ 1 N N X i =1 E   ∇ f i  ¯ x s +1 t  − ∇ f i  x s +1 i,t    2 (using Jensen’s inequality) = 1 N N X i =1 E   ∇ E ξ f i  ¯ x s +1 t ; ξ  − ∇ E ξ f i  x s +1 i,t ; ξ    2 ≤ 1 N N X i =1 EE ξ   ∇ f i  ¯ x s +1 t ; ξ  − ∇ f i  x s +1 i,t ; ξ    2 (using Jensen’s inequality) ≤ 1 N N X i =1 E   ¯ x s +1 t − x s +1 i,t   2 . using (A1) The first term in (1 1 ) is the net work err or of lo ca l itera tes, which is upp er b ounded in Lemma 2.2. Using Lemma 2.1 to bo und the second term o f (11), we ge t E   ∇ f  ¯ x s +1 t  − ¯ v s +1 t   2 ≤ 2 L 2 N N X i =1 E   x s +1 i,t − ¯ x s +1 t   2 + 2 E s +1 0 + 2 L 2 N 2 B N X i =1 t − 1 X ℓ =0 E   x s +1 i,ℓ +1 − x s +1 i,ℓ   2 ≤ 2 L 2 N N X i =1 E   x s +1 i,t − ¯ x s +1 t   2 + 2 γ 2 L 2 N 2 B N X i =1 t − 1 X ℓ =0 E   v s +1 i,ℓ − ¯ v s +1 ℓ + ¯ v s +1 ℓ   2 + 2 E s +1 0 (12) ≤ 2 L 2 N N X i =1 E   x s +1 i,t − ¯ x s +1 t   2 + 4 γ 2 L 2 N 2 B N X i =1 t − 1 X ℓ =0 h E   v s +1 i,ℓ − ¯ v s +1 ℓ   2 + E   ¯ v s +1 ℓ   2 i + 2 E s +1 0 (13) where (12) follows since x s +1 i,ℓ +1 = x s +1 i,ℓ − γ v s +1 i,ℓ . The second term in (13) implies that during the ep o ch, as t incr eases, in the absence o f any av er aging/co mm unica tion within an e p o ch, the e rror P t − 1 ℓ =0 E k v s +1 i,ℓ − ¯ v s +1 ℓ k 2 keeps building up. T o c heck this ra pid growth, we in tr o duce within-ep o ch av erag ing every I iterations (steps 9-10). Using the upp er bo unds o n the tw o netw ork err or terms in (13), from Lemma 2.2, we get 9 E   ∇ f  ¯ x s +1 t  − ¯ v s +1 t   2 ≤ 2 L 2 N  1 + 1 α  γ 2 t − 1 X ℓ = τ ( t )+1 (1 + α ) t − 1 − ℓ 8 γ 2 N L 2  1 + 1 δ  ℓ − 1 X j = τ ( ℓ ) (1 + θ ) ℓ − 1 − j E    ¯ v s +1 j    2 + 4 γ 2 L 2 N 2 B t − 1 X ℓ =0 8 γ 2 N L 2  1 + 1 δ  ℓ − 1 X j = τ ( ℓ ) (1 + θ ) ℓ − 1 − j E    ¯ v s +1 j    2 + 4 γ 2 L 2 N 2 B N t − 1 X ℓ =0 E    ¯ v s +1 ℓ    2 + 2 E s +1 0 . (14) W e substitute this upp er b ound in (9) to derive the following descent lemma on function v alues. Lemma 2.3 . (Desc ent L emma) In e ach ep o ch s ∈ [1 , S ] , E f  ¯ x s +1 m  ≤ E f  ¯ x s +1 0  − γ 2 m − 1 X t =0 E   ∇ f  ¯ x s +1 t    2 − γ 2 (1 − L γ ) m − 1 X t =0 E   ¯ v s +1 t   2 + m − 1 X t =0 E    ¯ v s +1 t    2  64 γ 5 L 4 m N B δ 2 + 2 γ 3 L 2 m N B + 256 γ 5 L 4 δ 4  . (15) Pr o of. See App endix A.3 . Lemma 2 .3 quantifies the decay in function v alue acro ss one epo ch. Clear ly , the extent of dec ay depe nds on the precise v alues of a lgorithm para meter s γ , δ, B , m . Rearra ng ing the terms in (15), a nd summing ov er ep o ch index s , we get γ 2 m − 1 X t =0 E   ∇ f  ¯ x s +1 t    2 + γ 2  1 − Lγ −  128 γ 4 L 4 m N B δ 2 + 4 γ 2 L 2 m N B + 512 γ 4 L 4 δ 4  m − 1 X t =0 E   ¯ v s +1 t   2 ≤  E f  ¯ x s +1 0  − E f  ¯ x s +1 m  (16) ⇒ 1 T S − 1 X s =0 m − 1 X t =0 E   ∇ f  ¯ x s +1 t    2 +  1 − Lγ −  128 γ 4 L 4 m N B δ 2 + 4 γ 2 L 2 m N B + 512 γ 4 L 4 δ 4  1 T S − 1 X s =0 m − 1 X t =0 E   ¯ v s +1 t   2 ≤ 2 T γ S − 1 X s =0  E f  ¯ x s +1 0  − E f  ¯ x s +1 m  = 2 T γ  E f  ¯ x 0  − E f  ¯ x S +1 m  ≤ 2  f ( x 0 ) − f ∗  T γ . (17) where, in (16), we hav e collected a ll the terms in (15) with P m − 1 t =0 E   ¯ v s +1 t   2 on the left ha nd s ide . (16) is then summed acr oss all ep o chs s = 0 , . . . , S − 1, divided by T and divided b y γ 2 to get (17). f ∗ = min x f ( x ). Note that in (17), for sma ll enough, but co nstant step size γ , we can ensure that 1 − Lγ −  128 γ 4 L 4 m N B δ 2 + 4 γ 2 L 2 m N B + 512 γ 4 L 4 δ 4  ≥ 1 2 . (18) See App endix A.4 for a choice of γ which satisfies (18). Next, using (17) and (18) ab ov e, we prov e the conv erg ence of PR-SPIDE R for the finite s ample case (2). Theorem 2.4. (Conver genc e) F or the finite-sum pr oblem under As s umptions (A1)-(A2), for smal l enough step size 0 < γ < 1 8 I L , min s,t " E   ∇ f  ¯ x s +1 t    2 + 1 N N X i =1 E   x s +1 i,t − ¯ x s +1 t   2 # ≤ 2  f ( x 0 ) − f ∗  T γ . (19) Pr o of. See App endix A.5 . 10 The ab ov e result now ca n b e directly us ed to compute the b ounds o n the sample complexity (see Definition 1.2) and the comm unica tio n co mplexity (see Definition 1.3) of the prop osed algorithm PR-SPIDER for the finite sum problem (2). 2.4 Sample Complexity Num b er of iter ations T satisfies 2  f ( x 0 ) − f ∗  T γ = ǫ ⇒ T = 2  f ( x 0 ) − f ∗  γ ǫ = C I ǫ , (20) for cons ta n ts C, I . Then for m = I √ N n, B = 1 I p n N N ×  T m  · n + T · B  ≤ N ×  C I mǫ + 1  · n + C I ǫ · B  ≤ N ×  C I ǫ ·  n m + B  + n  = O √ N n ǫ + N n ! = O √ N n ǫ ! for N × n ≤ O ( ǫ − 2 ). Note that this is the optimal sa mple complexity a chiev ed for N × n ≤ O ( ǫ − 2 ) in the single no de ca se [31], [32]. Each of the N no des in our case need to compute O  1 ǫ p n N  sample sto chastic gra dient s . 2.5 Comm unication Complexit y Since communication happens once every I iterations, the communication complexity (the n umber of communication rounds) is  T I  ≤ 1 + C ǫ = O  1 ǫ  . 3 P arallel Restarted SPIDER - Online Case W e consider a netw or k of N work er no des connected to a server no de. The o b jective is to solve (3). Note that the distribution of samples at different no des can p otentially be different, i.e., D i 6 = D j , for i 6 = j . 3.1 Prop osed Algorithm The pseudo-c o de of the approach is given in Algor ithm 2. In the following, w e o nly highlight the steps whic h are different from Algorithm 1. The prop os e d algo rithm is pretty m uch the same as Alg orithm 1, exce pt a few changes to account for the fact that for problem (3), exact gradients can never b e computed. Hence full gradient computations are replaced by batch sto chastic gra dient co mputatio n, where the batches ar e of size n b . Aga in, batch size s acr oss no des can b e non-uniform. How ever, we av oid doing so for the sake of simpler nota tions. A t the end of the inner lo op ( t = m ), fir st the lo cal iterates a re av erag ed and returned to the workers (steps 15-16). Next, the work e r s compute ba tch sto chastic gr adients of their r esp ective functions { 1 n b P ξ i ∇ f i ( · ; ξ i ) } N i =1 at the common p oint ¯ x s +1 m , and send thes e to the ser ver. The server averages these, and r eturns this av erag e ¯ v s +1 m to the worker no des (steps 1 7-18). As in Algor ithm 1, all the worker nodes sta rt the next e p o ch at the s a me p oint, and along the same descent directio n. Howev er, unlike Algo rithm 1, this direction is not ∇ f ( ¯ x s +1 m ). 11 Algorithm 2 P R- SPIDER - O nline Ca se 1: Input: Initial iter ate x 0 i,m = x 0 , v 0 i,m = 1 N P N j =1 1 n b P ξ j ∇ f j  x 0 ; ξ j  , ∀ i ∈ [1 : N ] 2: for s = 0 to S − 1 do 3: x s +1 i, 0 = x s i,m , ∀ i ∈ [1 : N ] 4: v s +1 i, 0 = v s i,m , ∀ i ∈ [1 : N ] 5: x s +1 i, 1 = x s +1 i, 0 − γ v s +1 i, 0 6: for t = 1 to m − 1 do 7: v s +1 i,t = 1 B P ξ s +1 i,t ∈B s +1 i,t  ∇ f i  x s +1 i,t ; ξ s +1 i,t  − ∇ f i  x s +1 i,t − 1 ; ξ s +1 i,t  + v s +1 i,t − 1 , ∀ i ∈ [1 : N ] ( |B s +1 i,t | = B ) 8: if t mo d I = 0 then 9: x s +1 i,t = ¯ x s +1 t , 1 N P N j =1 x s +1 j,t , ∀ i ∈ [1 : N ] 10: v s +1 i,t = ¯ v s +1 t , 1 N P N j =1 v s +1 j,t , ∀ i ∈ [1 : N ] 11: end if 12: x s +1 i,t +1 = x s +1 i,t − γ v s +1 i,t , ∀ i ∈ [1 : N ] 13: end for 14: if s < S − 1 then 15: ¯ x s +1 m = 1 N P N j =1 x s +1 j,m 16: x s +1 i,m = ¯ x s +1 m , ∀ i ∈ [1 : N ] 17: ¯ v s +1 m = 1 N P N i =1 1 n b P ξ i ∇ f i  x s +1 i,m ; ξ i  18: v s +1 i,m = ¯ v s +1 m , ∀ i ∈ [1 : N ] 19: end i f 20: end for 21: Return 3.2 Additional Assumptions (A3) Bounded V ariance: There exists constant σ such tha t E ξ k∇ f i ( x ; ξ ) − ∇ f ( x ) k 2 ≤ σ 2 , ∀ i ∈ [1 : N ] . (21) 3.3 Con v er gence Analysis Lemma 3. 1. (Bounde d V aria n c e) F or 0 ≤ s ≤ S − 1 , the se quenc es of iter ates { x s +1 i,t } i,t and { v s +1 i,t } i,t gener ate d by Algo rithm 1 satisfy E s +1 0 = E    ¯ v s +1 0 − 1 N N X i =1 ∇ f i  x s +1 i, 0     2 ≤ σ 2 N n b . (22) Pr o of. See App endix B .1 . Compared to the finite-sum setting, the only difference now is that E s +1 0 6 = 0. B oth, L e mma 2.1 a nd 2.2 hold true. Next we state the desce nt lemma for the online case. Lemma 3.2 . (Desc ent L emma) In e ach ep o ch s ∈ [1 , S ] , E f  ¯ x s +1 m  ≤ E f  ¯ x s +1 0  − γ 2 m − 1 X t =0 E   ∇ f  ¯ x s +1 t    2 − γ 2 (1 − L γ ) m − 1 X t =0 E   ¯ v s +1 t   2 + m − 1 X t =0 E    ¯ v s +1 t    2  64 γ 5 L 4 m N B δ 2 + 2 γ 3 L 2 m N B + 256 γ 5 L 4 δ 4  + γ σ 2 m N n b . (23) Pr o of. See App endix B .2 . 12 Note that γ σ 2 m N n b is the only ex tra term co mpa red to (15). Rea rranging the terms in (23 ), and summing ov er ep o ch index s , analog o us to (16) a nd (17) we get γ 2 m − 1 X t =0 E   ∇ f  ¯ x s +1 t    2 + γ 2  1 − Lγ −  128 γ 4 L 4 m N B δ 2 + 4 γ 2 L 2 m N B + 512 γ 4 L 4 δ 4  m − 1 X t =0 E   ¯ v s +1 t   2 + γ σ 2 m N n b ≤  E f  ¯ x s +1 0  − E f  ¯ x s +1 m  (24) ⇒ 1 T S − 1 X s =0 m − 1 X t =0 E   ∇ f  ¯ x s +1 t    2 +  1 − Lγ −  128 γ 4 L 4 m N B δ 2 + 4 γ 2 L 2 m N B + 512 γ 4 L 4 δ 4  1 T S − 1 X s =0 m − 1 X t =0 E   ¯ v s +1 t   2 ≤ 2 T γ S − 1 X s =0  E f  ¯ x s +1 0  − E f  ¯ x s +1 m  + 2 T γ S − 1 X s =0 γ σ 2 m N n b = 2 T γ  E f  ¯ x 0  − E f  ¯ x S +1 m  + 2 σ 2 N n b ≤ 2  f ( x 0 ) − f ∗  T γ + 2 σ 2 N n b . (25) Similar to (17), for small eno ugh step size γ , (18) holds. Next, we prese n t the co nv ergence of PR-SP IDER for the online case (3). Theorem 3.3. (Conver genc e) F or the finite-sum pr oblem under As s umptions (A1)-(A2), for smal l enough step size 0 < γ < 1 8 I L , min s,t " E   ∇ f  ¯ x s +1 t    2 + 1 N N X i =1 E   x s +1 i,t − ¯ x s +1 t   2 # ≤ 2  f ( x 0 ) − f ∗  T γ + 2 σ 2 N n b . (26) Pr o of. See App endix B .3 . Note that, in comparis o n to Theorem 2 .4, in Theo rem 3.3 we hav e an additional term 2 σ 2 /N n b , which accounts for the v ariance of the g radients computed at the start o f each ep o ch. Similar to the finite s ample cas e (2), we use Theore m 3.3 to compute the b ounds o n the sa mple complexity (see Definition 1.2) and the communication co mplexity (see Definition 1 .3) of the pr op osed alg orithm for the online problem (3). In the following, we cho ose the T and the pa rameters n b , m a nd B such tha t such tha t the alg orithm reaches an ǫ -F o S p oint (see Definition 1 .1) and a t the sa me time the sample and the communication complexity are minimized. 3.4 Sample Complexity Num b er of iter ations T satisfies 2  f ( x 0 ) − f ∗  T γ = ǫ ⇒ T = 2  f ( x 0 ) − f ∗  γ ǫ = C I ǫ , (27) for cons ta n ts C, I . And the batch size n b to compute the g radient estimators at the start o f each ep o ch satisfy 2 σ 2 N n b = ǫ 2 ⇒ n b = 4 σ 2 N ǫ . (28) 13 Then for m = I √ N n b , B = 1 I p n b N N ×  T m  · n b + T · B  ≤ N ×  C I mǫ + 1  · n b + C I ǫ · B  ≤ N ×  C I ǫ ·  n b m + B  + n b  = O  √ N n b ǫ + N n b  = O  σ ǫ 3 / 2 + σ 2 ǫ  . 3.5 Comm unication Complexit y Since co mm unica tion happ ens once every I iterations, the communication complexity is  T I  ≤ 1 + C ǫ = O  1 ǫ  . 4 Conclusion In this pap er, we pr op osed a distributed v ariance-r e duced algor ithm, P R-SPIDER, for sto chastic non-convex opti- mization. Our algor ithm is a non-trivial e xtension of SP IDER [31], the sing le-no de stochastic o ptimization algorithm, and parallel- restarted SGD [1 9, 21]. W e prov ed conv er gence of our a lgorithm to a first-or der sta tionary solution. The pr op osed a pproach a chiev es the bes t k nown communication complexity O ( ǫ − 1 ). In terms of I FO complexity , we hav e achiev ed the optimal r ate, s ignificantly improving the state-of-the-a r t, while maintaining the linear s pee dup achiev ed by existing metho ds. F or finite-sum pro blems, w e achiev ed the optimal O ( √ N n ǫ ) overall IFO co mplexit y . On the other hand, for online pro blems, we a chiev ed the optimal O  σ ǫ 3 / 2 + σ 2 ǫ  , a mass ive improvemen t over the existing O ( σ 2 ǫ 2 ). In addition, unlike many exis ting a pproaches, our alg o rithm is gener a l enough to allow non-identical distributions of data across workers. References [1] E. P . Xing, Q. Ho, P . Xie, a nd D. W ei, “Strateg ies and pr inciples of distributed machine learning o n big data,” Engine ering , vol. 2, no. 2, pp. 179 –195 , 2 016. [2] E. P . Xing, Q. Ho, W. Dai, J. K. K im, J. W ei, S. Lee, X. Zheng, P . Xie, A. K umar, and Y. Y u, “Petuum : A new platform for dis tributed machine learning on big data,” IEEE T r ansactio ns on Big Data , vol. 1 , no. 2, pp. 49–67 , June 2 015. [3] J. Koneˇ cn` y, H. B. McMa ha n, D. Ramage, a nd P . Rich t´ ar ik, “F eder ated optimizatio n: Distributed machine learning for on-device intelligence,” arXiv pr eprint arXiv:1610. 02527 , 2016. [4] T. L ´ eaut´ e and B . F altings , “Pr otecting pr iv acy through distributed computation in m ulti-agent decis ion making,” Journal of Artificial Intel ligenc e R ese ar ch , vol. 47 , pp. 6 49–69 5, 2013 . [5] H. Sun, S. Lu, a nd M. Hong, “Improving the sample and comm unica tion complexity for decen tra lized non-conv ex optimization: A jo in t gra dient es tima tio n and tracking appr oach,” arXiv pr eprint arXiv:191 0.05857 , 2 019. [6] A. Agarwal and L. Bottou, “A low er b ound for the o ptimiza tion of finite sums,” arXiv pr eprint arXiv:1410.072 3 , 2014. 14 [7] A. Nemirovski, A. Juditsky , G. La n, and A. Shapir o, “Robust sto chastic a pproximation appro ach to sto chastic progra mming,” SIAM J ournal on optimization , vol. 19, no . 4, pp. 157 4–160 9, 2009 . [8] E. Moulines and F. R. Bach, “Non-a symptotic analysis o f sto chastic a ppr oximation algor ithms fo r machine learning,” in A dvanc es in N eu r al Information Pr o c essing Systems , 2 011, pp. 451– 459. [9] S. Ghadimi and G. Lan, “Sto chastic first- a nd ze roth-order metho ds for nonco nv ex sto chastic prog ramming,” SIAM Journal on Optimization , vol. 23 , no. 4, pp. 2341 –2368 , 201 3. [10] Y. Arjev ani, Y. Carmon, J. C. Duc hi, D. J. F oster , N. Srebro, and B. W o o dworth, “Low er bounds for non-conv e x sto chastic optimization,” arXiv pr eprint arXiv:1912 .02365 , 201 9. [11] O. Dekel, R. Gilad-Bachrach, O. Shamir, and L. Xiao , “Optimal distributed online prediction using mini- batches,” Journal of Machine Le arning R ese ar ch , vol. 13, no. Jan, pp. 1 6 5–20 2 , 2 012. [12] M. Li, D. G. Andersen, A. J. Smola, and K. Y u, “ Communication efficient distr ibuted machine learning with the par ameter server,” in A dvanc es in Neur al Information Pr o c essing Systems , 201 4 , pp. 19–27 . [13] W. W en, C. Xu, F. Y an, C. W u, Y. W ang, Y. Chen, and H. Li, “T ernGrad: T ernar y gr adients to reduce communication in distributed deep learning ,” in Adv anc es in neur al information pr o c essing systems , 20 17, pp. 1509– 1519 . [14] D. Alistarh, D. Grubic, J. Li, R. T o mio k a, and M. V o jnovic, “Q SGD: Comm unicatio n-efficient SGD via gra dient quantization and enco ding,” in A dvanc es in Neur al Information Pr o c essing Systems , 2017 , pp. 170 9–172 0. [15] N. Dryden, T. Mo on, S. A. Jacobs , and B. V an Essen, “Communication quan tization for data-par a llel tra ining of deep neur al net works,” in 2016 2nd Workshop on Machine L e arning in HPC Envir onments (MLHPC) . IEEE, 2016, pp. 1 –8. [16] A. F. Aji and K. Heafield, “ Sparse communication for distributed gradient descent,” arXiv pr eprint arXiv:170 4.05021 , 201 7 . [17] J. Zhang , C. De Sa, I. Mitliag k as, and C . R´ e, “P a r allel SGD: When do es averaging help?” arXiv pr eprint arXiv:160 6.07365 , 201 6 . [18] S. U. Stich, “Lo ca l SGD converges fast and communicates little,” arXiv pr eprint arXiv:180 5.09767 , 201 8 . [19] H. Y u, S. Y ang, a nd S. Zhu, “ Para lle l restarted SGD with faster convergence and le ss communication: De- m y stifying why mo del av erag ing works for deep learning,” in Pr o c e e dings of the AAAI Confer enc e on Artificial Intel ligenc e , vol. 3 3, 20 1 9, pp. 5 6 93–5 700. [20] H. Y u, R. Jin, and S. Y ang, “On the linear s pee dup ana lysis of communication efficient moment um SGD for distributed no n-conv ex o ptimization,” arXiv pr eprint arXiv:1905.03 817 , 2019. [21] H. Y u and R. Jin, “On the computation and comm unica tion complexity of pa rallel SGD with dyna mic batch s izes for s to chastic non-conv ex o ptimization,” in International Confer enc e on Machine L e arning , 201 9, pp. 7174 – 7183 . [22] P . Jiang and G. Agrawal, “A linear sp eedup ana lysis of distributed deep learning with spar s e a nd q uantized communication,” in A dvanc es in Neur al Information Pr o c essing Systems , 2018 , pp. 2 525– 2 536. [23] T. Chen, G. Gianna k is, T. Sun, and W. Yin, “ LA G: Lazily agg r egated gradient for communication-efficien t distributed lea rning,” in A dvanc es in Neura l Information Pr o c essing Systems , 20 18, pp. 5 050–5 060. [24] F. Haddadp our, M. M. Kamani, M. Mahda vi, and V. Cadambe, “T rading r edundancy for comm unication: Spee ding up distributed SGD for no n-conv ex optimization,” in International Confer enc e on Machine L e arning , 2019, pp. 2 545–2 554. 15 [25] T. Lin, S. U. Stich, K. K . Patel, and M. Jag gi, “Don’t use larg e mini-batches, use lo cal SGD,” arXiv pr eprint arXiv:180 8.07217 , 201 8 . [26] Y. Nesterov, “Introductory lec tur es on co nv ex progr amming volume i: Basic co urse,” 199 8. [27] A. Defazio, F. Bac h, and S. Lacoste-Julien, “SA GA: A fast incremen tal g radient metho d with suppo rt for non-strong ly conv ex comp os ite o b jectives,” in A dvanc es in neur al information pr o c essing systems , 2 014, pp. 1646– 1654 . [28] R. Johnson and T. Zhang, “Accelera ting stochastic gra dien t descent using predictive v ariance reduction,” in A dvanc es in neur al information pr o c essing systems , 20 13, pp. 3 15–32 3. [29] S. J . Reddi, A. Hefny , S. Sr a, B. Poczo s, and A. Smola, “Sto chastic V ariance Reduction for Nonconv ex Opti- mization,” in Int. Conf. Machine L e arn. , 201 6 , pp. 314–3 23. [30] Z. Allen-Zh u and E. Hazan, “V a riance reduction for faster non-conv ex o ptimization,” in International c onfer enc e on machine le arning , 20 1 6, pp. 6 9 9–70 7 . [31] C. F ang , C. J. Li, Z. Lin, a nd T. Z ha ng, “SPIDE R: Near -optimal non-convex optimization via sto chastic path- int eg rated differential estimator,” in A dvanc es in Neur al Information Pr o c essing Systems , 2018 , pp. 689 – 699. [32] Z. W a ng , K . Ji, Y. Z ho u, Y. Liang, and V. T arokh, “Spider Bo ost: A class o f faster v ariance - reduced alg orithms for nonco n vex optimization,” arXiv pr eprint arXiv:1810. 10690 , 2018. [33] D. Zhou, P . Xu, and Q. Gu, “Sto chastic nes ted v aria nce reduced gr a dient descent for no nconv ex optimization,” in A dv. Neur al Inf. Pr o c ess. Systems , 20 18, pp. 3 921–3 932. [34] L. M. Nguyen, M. v an Dijk, D. T. Phan, P . H. Nguyen, T.-W. W eng, and J. R. Kalagna nam, “ O ptimal finite-sum smo oth non-co nv ex optimization with SARAH,” arXiv pr eprint arXiv:1901.07 648 , 2019. [35] L. Lei and M. I. J ordan, “Less than a single pa s s: Sto chastically controlled sto chastic g radient metho d,” arXiv pr eprint arXiv:1609. 03261 , 20 16. [36] S. J. Reddi, J. K oneˇ cn ` y, P . Richt ´ a rik, B. P´ o cz´ os, and A. Smola, “AIDE: F ast a nd communication efficient distributed o ptimization,” arXiv pr eprint arXiv:1608 .06879 , 2016. [37] J. D. Lee, Q. Lin, T. Ma, and T. Y a ng , “ Distributed sto chastic v ariance r educed g radient metho ds by sampling extra data with r eplacement,” The Journal of Machine L e arning R ese ar ch , vol. 18 , no. 1, pp. 4 404–4 446, 20 1 7. [38] O. Sha mir, N. Sr ebro, and T. Zhang, “Communication-efficient distributed o ptimiza tion using an approximate newton-type metho d,” in International c onfer enc e on machine le arning , 2014 , pp. 100 0–10 0 8. [39] J. W ang, W. W a ng, and N. Sr ebro, “Memor y and commun ic a tion efficient dis tributed sto chastic optimizatio n with minibatch-prox,” arXiv pr eprint arXiv:170 2.06269 , 2 017. [40] S. Cen, H. Zhang, Y. Chi, W. Chen, and T.-Y. Liu, “Convergence of distributed stochastic v aria nce reduced metho ds without sampling extra data,” arXiv pr eprint arXiv:1905 .12648 , 201 9. [41] L. M. Nguyen, J . Liu, K. Schein b erg, a nd M. T a k´ aˇ c, “SARAH: A nov el metho d for machine le a rning pro blems using sto chastic r ecursive gradient,” in Pr o c e e dings of the 34th International Confer enc e on Machine L e arning- V ol u me 70 . JMLR. org , 201 7 , pp. 2613– 2621. [42] ——, “ Sto chastic recursive gradient algorithm for nonconv ex o ptimization,” arXiv pr eprint arXiv:1705. 07261 , 2017. 16 A Finite-sum case A.1 Pro of of Lemma 2.1 Pr o of. W e have E     ¯ v s +1 t − 1 N X N i =1 ∇ f i  x s +1 i,t      2 = E    ¯ v s +1 t − 1 − 1 N X N i =1 ∇ f i  x s +1 i,t − 1  + 1 N X N j =1 1 B X ξ s +1 j,t  ∇ f j  x s +1 j,t ; ξ s +1 j,t  − ∇ f j  x s +1 j,t − 1 ; ξ s +1 j,t  + 1 N X N i =1  ∇ f i  x s +1 i,t − 1  − ∇ f i  x s +1 i,t     2 (29) = E     ¯ v s +1 t − 1 − 1 N X N i =1 ∇ f i  x s +1 i,t − 1      2 + E    1 N N X i =1 1 B X ξ s +1 i,t  ∇ f i  x s +1 i,t ; ξ s +1 i,t  − ∇ f i  x s +1 i,t − 1 ; ξ s +1 i,t  + ∇ f i  x s +1 i,t − 1  − ∇ f i  x s +1 i,t     2 + E * ¯ v s +1 t − 1 − 1 N N X i =1 ∇ f i  x s +1 i,t − 1  , 1 N N X i =1 1 B X ξ s +1 i,t  ∇ f i  x s +1 i,t ; ξ s +1 i,t  − ∇ f i  x s +1 i,t − 1 ; ξ s +1 i,t  + ∇ f i  x s +1 i,t − 1  − ∇ f i  x s +1 i,t  | {z } E ( · )= 0 + (30) = E     ¯ v s +1 t − 1 − 1 N X N i =1 ∇ f i  x s +1 i,t − 1      2 + 1 N 2 X N i =1 E    1 B X ξ s +1 i,t  ∇ f i  x s +1 i,t ; ξ s +1 i,t  − ∇ f i  x s +1 i,t − 1 ; ξ s +1 i,t  + ∇ f i  x s +1 i,t − 1  − ∇ f i  x s +1 i,t     2 + 1 N 2 X i 6 = j E  1 B X ξ s +1 i,t  ∇ f i  x s +1 i,t ; ξ s +1 i,t  − ∇ f i  x s +1 i,t − 1 ; ξ s +1 i,t  + ∇ f i  x s +1 i,t − 1  − ∇ f i  x s +1 i,t  , 1 B X ξ s +1 j,t  ∇ f j  x s +1 j,t ; ξ s +1 j,t  − ∇ f j  x s +1 j,t − 1 ; ξ s +1 j,t  + ∇ f j  x s +1 j,t − 1  − ∇ f j  x s +1 j,t   (31) = E     ¯ v s +1 t − 1 − 1 N X N i =1 ∇ f i  x s +1 i,t − 1      2 + 1 N 2 X N i =1 1 B 2 E X ξ s +1 i,t    ∇ f i  x s +1 i,t ; ξ s +1 i,t  − ∇ f i  x s +1 i,t − 1 ; ξ s +1 i,t  + ∇ f i  x s +1 i,t − 1  − ∇ f i  x s +1 i,t     2 + 1 N 2 X N i =1 1 B 2 E X ξ s +1 i,t 6 = ζ s +1 i,t * E ξ s +1 i,t  ∇ f i  x s +1 i,t ; ξ s +1 i,t  − ∇ f i  x s +1 i,t − 1 ; ξ s +1 i,t  + ∇ f i  x s +1 i,t − 1  − ∇ f i  x s +1 i,t  | {z } = 0 , E ζ s +1 i,t  ∇ f i  x s +1 i,t ; ζ s +1 i,t  − ∇ f i  x s +1 i,t − 1 ; ζ s +1 i,t  + ∇ f i  x s +1 i,t − 1  − ∇ f i  x s +1 i,t  | {z } = 0 + (32) ≤ E     ¯ v s +1 t − 1 − 1 N X N i =1 ∇ f i  x s +1 i,t − 1      2 + 1 N 2 X N i =1 1 B 2 E X ξ s +1 i,t    ∇ f i  x s +1 i,t ; ξ s +1 i,t  − ∇ f i  x s +1 i,t − 1 ; ξ s +1 i,t     2 (33) ≤ E     ¯ v s +1 t − 1 − 1 N X N i =1 ∇ f i  x s +1 i,t − 1      2 + 1 N 2 X N i =1 L 2 B E    x s +1 i,t − x s +1 i,t − 1    2 (34) ≤ E     ¯ v s +1 0 − 1 N X N i =1 ∇ f i  x s +1 i, 0      2 | {z } E s +1 0 + L 2 N 2 B N X i =1 t − 1 X ℓ =0 E    x s +1 i,ℓ +1 − x s +1 i,ℓ    2 (35) 17 (29) follows from step 7 in Algo r ithm 1. The cross term in (30) is zero s inc e E ξ s +1 i,t  ∇ f i  x s +1 i,t ; ξ s +1 i,t  − ∇ f i  x s +1 i,t − 1 ; ξ s +1 i,t  = ∇ f i  x s +1 i,t  − ∇ f i  x s +1 i,t − 1  , (36) for a ll ξ s +1 i,t ∈ B s +1 i,t , ∀ t, ∀ s, ∀ i ∈ [1 : N ]. The cr oss term in (31) is zer o s ince the mini-batches B s +1 i,t are sampled uniformly rando mly , and indep endently at a ll the no des i ∈ [1 : N ]. The cro ss term in (32) is zero since at a single no de i , the s amples in the mini-batch B s +1 i,t are sampled indep endent o f ea ch o ther. (33 ) follows from (36) and using E k x − E ( x ) k 2 ≤ E k x k 2 . (3 4) fo llows from mean-squa red L-s mo o th prop erty of each sto chastic function f i ( · , ξ ). (35 ) follows by recursive a pplication of (34), to the b eginning of the ep o ch. A.2 Pro of of Lemma 2.2 Pr o of. First we pr ov e (7). F or ℓ such tha t ℓ mo d I 6 = 0 N X i =1 E    v s +1 i,ℓ − ¯ v s +1 ℓ    2 ≤ N X i =1 " (1 + δ ) E    v s +1 i,ℓ − 1 − ¯ v s +1 ℓ − 1    2 +  1 + 1 δ  E    1 B X ξ s +1 i,ℓ n ∇ f i  x s +1 i,ℓ ; ξ s +1 i,ℓ  − ∇ f i  x s +1 i,ℓ − 1 ; ξ s +1 i,ℓ  o − 1 N N X j =1 1 B X ξ s +1 j,ℓ n ∇ f j  x s +1 j,ℓ ; ξ s +1 j,ℓ  − ∇ f j  x s +1 j,ℓ − 1 ; ξ s +1 j,ℓ  o    2 # (37) ≤ N X i =1 " (1 + δ ) E    v s +1 i,ℓ − 1 − ¯ v s +1 ℓ − 1    2 +  1 + 1 δ  2 E    1 B X ξ s +1 i,ℓ n ∇ f i  x s +1 i,ℓ ; ξ s +1 i,ℓ  − ∇ f i  x s +1 i,ℓ − 1 ; ξ s +1 i,ℓ  o    2 +  1 + 1 δ  2 E    1 N N X j =1 1 B X ξ s +1 j,ℓ n ∇ f j  x s +1 j,ℓ ; ξ s +1 j,ℓ  − ∇ f j  x s +1 j,ℓ − 1 ; ξ s +1 j,ℓ  o    2 # ≤ N X i =1 " (1 + δ ) E    v s +1 i,ℓ − 1 − ¯ v s +1 ℓ − 1    2 +  1 + 1 δ  2 B E X ξ s +1 i,ℓ    ∇ f i  x s +1 i,ℓ ; ξ s +1 i,ℓ  − ∇ f i  x s +1 i,ℓ − 1 ; ξ s +1 i,ℓ     2 +  1 + 1 δ  2 N N X j =1 1 B E X ξ s +1 j,ℓ    ∇ f j  x s +1 j,ℓ ; ξ s +1 j,ℓ  − ∇ f j  x s +1 j,ℓ − 1 ; ξ s +1 j,ℓ     2 # (38) ≤ N X i =1 h (1 + δ ) E    v s +1 i,ℓ − 1 − ¯ v s +1 ℓ − 1    2 +  1 + 1 δ  2 L 2 E    x s +1 i,ℓ − x s +1 i,ℓ − 1    2 +  1 + 1 δ  2 L 2 N N X j =1 E    x s +1 j,ℓ − x s +1 j,ℓ − 1    2 i (39) = N X i =1 h (1 + δ ) E    v s +1 i,ℓ − 1 − ¯ v s +1 ℓ − 1    2 + 4 L 2  1 + 1 δ  E    x s +1 i,ℓ − x s +1 i,ℓ − 1    2 i ≤ N X i =1 h (1 + δ ) E    v s +1 i,ℓ − 1 − ¯ v s +1 ℓ − 1    2 + 8 γ 2 L 2  1 + 1 δ  E    v s +1 i,ℓ − 1 − ¯ v s +1 ℓ − 1    2 + 8 γ 2 L 2  1 + 1 δ  E   ¯ v s +1 ℓ − 1   2 i ≤ N X i =1  1 + δ + 8 γ 2 L 2  1 + 1 δ  | {z } θ  E    v s +1 i,ℓ − 1 − ¯ v s +1 ℓ − 1    2 + 8 γ 2 N L 2  1 + 1 δ  E   ¯ v s +1 ℓ − 1   2 (40) where, (37) follows from s tep 7 in Alg orithm 1 and Y oung’s inequality , for δ > 0 ; (38) follo ws from Jensen’s inequa lity; (39) follows from the mean-sq uared L -smo o th ass umption (A1). Applying (40) r e c ursively , we obtain 18 N X i =1 E    v s +1 i,ℓ − ¯ v s +1 ℓ    2 ≤ N X i =1 (1 + θ ) 2 E    v s +1 i,ℓ − 2 − ¯ v s +1 ℓ − 2    2 + 8 γ 2 N L 2  1 + 1 δ  h E   ¯ v s +1 ℓ − 1   2 + (1 + θ ) E   ¯ v s +1 ℓ − 2   2 i ≤ N X i =1 (1 + θ ) ℓ − τ ( ℓ ) E    v s +1 i,τ ( ℓ ) − ¯ v s +1 τ ( ℓ )    2 + 8 γ 2 N L 2  1 + 1 δ  ℓ − 1 X j = τ ( ℓ ) (1 + θ ) ℓ − 1 − j E   ¯ v s +1 j   2 = 8 γ 2 N L 2  1 + 1 δ  ℓ − 1 X j = τ ( ℓ ) (1 + θ ) ℓ − 1 − j E   ¯ v s +1 j   2 (41) where (41) follows since av er aging happ ens at time index τ ( ℓ ) (step 10, Algorithm 1 ), i.e., v s +1 i,τ ( ℓ ) = ¯ v s +1 τ ( ℓ ) , ∀ i ∈ [1 : N ]. Next, we prov e (8). F or ℓ such that ℓ mo d I 6 = 0 N X i =1 E    x s +1 i,ℓ − ¯ x s +1 ℓ    2 = N X i =1 E    h x s +1 i,ℓ − 1 − γ v s +1 i,ℓ − 1 i −  ¯ x s +1 ℓ − 1 − γ ¯ v s +1 t − 1     2 ≤ N X i =1  (1 + α ) E    x s +1 i,ℓ − 1 − ¯ x s +1 ℓ − 1    2 +  1 + 1 α  γ 2 E    v s +1 i,ℓ − 1 − ¯ v s +1 ℓ − 1    2  (42) ≤ N X i =1  (1 + α ) 2 E    x s +1 i,ℓ − 2 − ¯ x s +1 ℓ − 2    2 +  1 + 1 α  γ 2  E    v s +1 i,ℓ − 1 − ¯ v s +1 ℓ − 1    2 + (1 + α ) E    v s +1 i,ℓ − 2 − ¯ v s +1 ℓ − 2    2  ≤ N X i =1   (1 + α ) ℓ − τ ( ℓ ) E    x s +1 i,τ ( ℓ ) − ¯ x s +1 τ ( ℓ )    2 +  1 + 1 α  γ 2 ℓ − 1 X j = τ ( ℓ ) (1 + α ) ℓ − 1 − j E   v s +1 i,j − ¯ v s +1 j   2   =  1 + 1 α  γ 2 N X i =1 ℓ − 1 X j = τ ( ℓ )+1 (1 + α ) ℓ − 1 − j E   v s +1 i,j − ¯ v s +1 j   2 (43) where (42 ) follows from Y oung’s ineq uality , with α > 0; (43) follows since averaging happens at time index τ ( ℓ ) (step 9-10, Algorithm 1). Consequently x s +1 i,τ ( ℓ ) = ¯ x s +1 τ ( ℓ ) , v s +1 i,τ ( ℓ ) = ¯ v s +1 τ ( ℓ ) , ∀ i ∈ [1 : N ]. W e can further upp er b ound (43) using the b ound o n P N i =1 E k v s +1 i,j − ¯ v s +1 j k 2 in (41). A.3 Pro of of Lemma 2.3 T o prov e Lemma 2 .3, we need so me preliminary results which are given b efore providing the pro of o f Lemma 2.3. A.3.1 Preliminary resul ts required for Lem ma 2.3 In the following ana lysis, the following facts shall be utilized rep eatedly . (F1) W e choose δ , γ suc h that δ < θ = δ + 8 γ 2 L 2  1 + 1 δ  < 2 δ < 1. (F2) W e choose α = δ 2 . Therefor e, θ − α > δ 2 . (F3)  1 + 1 δ  < 2 δ and  1 + 1 α  < 2 α . (F4) I =  1 θ  ≤ 1 θ . Also, fo r θ < 1 ,  1 θ  ≥ 1 2 θ > 1 4 δ . Now, we sta te firs t of the three lemmas we will re q uire to pr ov e Lemma 2.3. Lemma A.1. We have: 16 γ 5 L 4 N B  1 + 1 δ  m − 1 X t =0 t − 1 X ℓ =0 ℓ − 1 X j = τ ( ℓ ) (1 + θ ) ℓ − 1 − j E    ¯ v s +1 j    2 ≤ 64 γ 5 L 4 m N B δ 2 m − 1 X t =0 E   ¯ v s +1 t   2 . 19 Pr o of. F rom the left hand side of the inequality , we hav e 16 γ 5 L 4 N B  1 + 1 δ  m − 1 X t =0 t − 1 X ℓ =0 ℓ − 1 X j = τ ( ℓ ) (1 + θ ) ℓ − 1 − j E    ¯ v s +1 j    2 = 16 γ 5 L 4 N B  1 + 1 δ  m − 1 X t =0   I − 1 X ℓ =0 ℓ − 1 X j =0 (1 + θ ) ℓ − 1 − j E   ¯ v s +1 j   2 + 2 I − 1 X ℓ = I ℓ − 1 X j = I (1 + θ ) ℓ − 1 − j E   ¯ v s +1 j   2 . . . + t − 1 X ℓ = τ ( t ) ℓ − 1 X j = τ ( t ) (1 + θ ) ℓ − 1 − j E   ¯ v s +1 j   2   (44) = 16 γ 5 L 4 N B  1 + 1 δ  m − 1 X t =0   I − 1 X ℓ =0 ℓ − 1 X j =0 (1 + θ ) ℓ − 1 − j E   ¯ v s +1 j   2 + I − 1 X ℓ =0 ℓ − 1 X j =0 (1 + θ ) ℓ − 1 − j E    ¯ v s +1 j + I    2 . . . + t − 1 − τ ( t ) X ℓ =0 ℓ − 1 X j =0 (1 + θ ) ℓ − 1 − j E    ¯ v s +1 j + τ ( t )    2   (45) ≤ 16 γ 5 L 4 N B  1 + 1 δ  " m − 1 X t =0 I − 1 X ℓ =0 (1 + θ ) ℓ − 1 # m − 1 X j =0 E   ¯ v s +1 j   2 (46) ≤ 16 γ 5 L 4 N B  1 + 1 δ  m − 1 X t =0 E   ¯ v s +1 t   2 m  (1 + θ ) I − 1 θ  ≤ 16 γ 5 L 4 m N B 2 δ e − 1 θ m − 1 X t =0 E   ¯ v s +1 t   2 ( c ) ≤ 64 γ 5 L 4 m N B δ 2 m − 1 X t =0 E   ¯ v s +1 t   2 (47) where, (45) follows from (4 4) by simple re-index ing. (46 ) follows fro m (45) by upp er b ounding the co efficients of all terms with the one corres po nding to j = 0. Note that (1+ 1 x ) x in increa sing function for x > 0 and lim x →∞ (1 + 1 x ) x = e (Euler’s co nstant). (47) follows from (F1)-(F3 ). Next, we present the second lemma we will use to prove Lemma 2.3 . Lemma A.2. We have: 2 γ 3 L 2 N B m − 1 X t =0 t − 1 X ℓ =0 E   ¯ v s +1 ℓ   2 ≤ 2 γ 3 L 2 m N B m − 1 X t =0 E   ¯ v s +1 t   2 . Pr o of. W e have from the left hand s ide of the inequa lit y 2 γ 3 L 2 N B m − 1 X t =0 t − 1 X ℓ =0 E   ¯ v s +1 ℓ   2 = 2 γ 3 L 2 N B h E   ¯ v s +1 0   2 ( m − 1 ) + E   ¯ v s +1 1   2 ( m − 2) + . . . + E   ¯ v s +1 m − 2   2 (1) i ≤ 2 γ 3 L 2 m N B m − 1 X t =0 E   ¯ v s +1 t   2 . Finally , we present the third intermediate lemma b efor e presenting the pr o of of Lemma 2.3. 20 Lemma A.3. We have: 8 γ 5 L 4  1 + 1 α   1 + 1 δ  m − 1 X t =0 t − 1 X ℓ = τ ( t )+1 ℓ − 1 X j = τ ( ℓ ) (1 + α ) t − 1 − ℓ (1 + θ ) ℓ − 1 − j E   ¯ v s +1 j   2 ≤ 256 γ 5 L 4 δ 4 m − 1 X t =0 E   ¯ v s +1 t   2 . Pr o of. Suppose τ ( m − 1) = ( p − 1) I (see (6)), we hav e 8 γ 5 L 4  1 + 1 α   1 + 1 δ  m − 1 X t =0 t − 1 X ℓ = τ ( t )+1 ℓ − 1 X j = τ ( ℓ ) (1 + α ) t − 1 − ℓ (1 + θ ) ℓ − 1 − j E   ¯ v s +1 j   2 = 8 γ 5 L 4  1 + 1 α   1 + 1 δ    I − 1 X t =0 t − 1 X ℓ =1 ℓ − 1 X j =0 (1 + α ) t − 1 − ℓ (1 + θ ) ℓ − 1 − j E   ¯ v s +1 j   2 + 2 I − 1 X t = I t − 1 X ℓ = I +1 ℓ − 1 X j = I (1 + α ) t − 1 − ℓ (1 + θ ) ℓ − 1 − j E   ¯ v s +1 j   2 . . . + m − 1 X t =( p − 1) I t − 1 X ℓ =( p − 1) I +1 ℓ − 1 X j =( p − 1) I (1 + α ) t − 1 − ℓ (1 + θ ) ℓ − 1 − j E   ¯ v s +1 j   2   (48) = 8 γ 5 L 4  1 + 1 α   1 + 1 δ    I − 1 X t =0 t − 1 X ℓ =1 ℓ − 1 X j =0 (1 + α ) t − 1 − ℓ (1 + θ ) ℓ − 1 − j E   ¯ v s +1 j   2 + I − 1 X t =0 t − 1 X ℓ =1 ℓ − 1 X j =0 (1 + α ) t − 1 − ℓ (1 + θ ) ℓ − 1 − j E    ¯ v s +1 j + I    2 . . . + m − 1 − ( p − 1) I X t =0 t − 1 X ℓ =1 ℓ − 1 X j =0 (1 + α ) t − 1 − ℓ (1 + θ ) ℓ − 1 − j E    ¯ v s +1 j +( p − 1) I    2   ≤ 8 γ 5 L 4 2 α 2 δ m − 1 X j =0 E   ¯ v s +1 j   2 " I − 1 X t =0 t − 1 X ℓ =0 (1 + α ) t − 1 − ℓ (1 + θ ) ℓ − 1 # (49) = 32 γ 5 L 4 αδ m − 1 X j =0 E   ¯ v s +1 j   2    I − 1 X t =0 (1 + α ) t − 1 (1 + θ )  1+ θ 1+ α  t − 1  1+ θ 1+ α  − 1    ≤ 32 γ 5 L 4 αδ m − 1 X j =0 E   ¯ v s +1 j   2 " I − 1 X t =0 (1 + θ ) t − (1 + α ) t θ − α # ≤ 32 γ 5 L 4 αδ m − 1 X t =0 E   ¯ v s +1 t   2  (1 + θ ) I − 1 θ ( θ − α )  ≤ 32 γ 5 L 4 ( δ / 2) δ m − 1 X t =0 E   ¯ v s +1 t   2  e − 1 δ ( δ / 2 )  (50) = 256 γ 5 L 4 δ 4 m − 1 X t =0 E   ¯ v s +1 t   2 . (51) In (4 8), we split the summation ov er t into blo cks of length I . F or any blo ck P ( c +1) I − 1 t =( c ) I , τ ( t ) = cI . Therefore, ℓ v aries from ℓ = [ cI + 1 : t − 1]. F urther, over this range of ℓ , τ ( ℓ ) = c I . Note that in this blo ck, the larg est co efficient corres p o nds to the smallest j , i.e ., E k ¯ v s +1 cI k 2 . W e use these upp er b ounds on co efficient s in (49). W e a lso use (F3). (50) follows from (F1), (F2), (F4). 21 Now, we a re rea dy to prove Lemma 2.3. In the pro cess, we will utilize the three lemmas derived ab ov e. A.3.2 Pro of of Lem ma 2.3 Pr o of. [Lemma 2.3] Substituting the upp er b ound (14) in (9 ) we get E f  ¯ x s +1 t +1  ≤ E f  ¯ x s +1 t  − γ 2 E   ∇ f  ¯ x s +1 t    2 − γ 2 (1 − L γ ) E   ¯ v s +1 t   2 + γ 2 4 γ 2 L 2 N 2 B t − 1 X ℓ =0 8 γ 2 N L 2  1 + 1 δ  ℓ − 1 X j = τ ( ℓ ) (1 + θ ) ℓ − 1 − j E    ¯ v s +1 j    2 + γ 2 4 γ 2 L 2 N 2 B N t − 1 X ℓ =0 E   ¯ v s +1 ℓ   2 + γ 2 2 E s +1 0 + γ 2 2 L 2 N  1 + 1 α  γ 2 t − 1 X ℓ = τ ( t )+1 (1 + α ) t − 1 − ℓ 8 γ 2 N L 2  1 + 1 δ  ℓ − 1 X j = τ ( ℓ ) (1 + θ ) ℓ − 1 − j E    ¯ v s +1 j    2 = E f  ¯ x s +1 t  − γ 2 E   ∇ f  ¯ x s +1 t    2 − γ 2 (1 − L γ ) E   ¯ v s +1 t   2 + 16 γ 5 L 4 N B  1 + 1 δ  t − 1 X ℓ =0 ℓ − 1 X j = τ ( ℓ ) (1 + θ ) ℓ − 1 − j E    ¯ v s +1 j    2 + 2 γ 3 L 2 N B t − 1 X ℓ =0 E   ¯ v s +1 ℓ   2 + γ E s +1 0 + 8 γ 5 L 4  1 + 1 α   1 + 1 δ  t − 1 X ℓ = τ ( t )+1 (1 + α ) t − 1 − ℓ ℓ − 1 X j = τ ( ℓ ) (1 + θ ) ℓ − 1 − j E    ¯ v s +1 j    2 (52) Note that, a s discussed ea rlier, for finite sum problems, E s +1 0 , E    ¯ v s +1 0 − 1 N P N i =1 ∇ f i  x s +1 i, 0     2 = 0. F urther, summing (52) over t = 0 , . . . , m − 1, we get E f  ¯ x s +1 m  ≤ E f  ¯ x s +1 0  − γ 2 m − 1 X t =0 E   ∇ f  ¯ x s +1 t    2 − γ 2 (1 − Lγ ) m − 1 X t =0 E   ¯ v s +1 t   2 + 16 γ 5 L 4 N B  1 + 1 δ  m − 1 X t =0 t − 1 X ℓ =0 ℓ − 1 X j = τ ( ℓ ) (1 + θ ) ℓ − 1 − j E    ¯ v s +1 j    2 + 2 γ 3 L 2 N B m − 1 X t =0 t − 1 X ℓ =0 E   ¯ v s +1 ℓ   2 + 8 γ 5 L 4  1 + 1 α   1 + 1 δ  m − 1 X t =0 t − 1 X ℓ = τ ( t )+1 ℓ − 1 X j = τ ( ℓ ) (1 + α ) t − 1 − ℓ (1 + θ ) ℓ − 1 − j E   ¯ v s +1 j   2 (53) Substituting the upp er bo unds der ived in Lemmas A.1, A.2 and A.3, we get the result o f Lemma 2.3. A.4 Choice of γ Suppo se γ be selected such tha t Lγ ≤ 1 8 ∧ 128 γ 4 L 4 m N B δ 2 ≤ 1 8 ∧ 4 γ 2 L 2 m N B ≤ 1 8 ∧ 512 γ 4 L 4 δ 4 ≤ 1 8 (54) As we s hall see in Section 3.4, the optimal choices m = I √ N n, B = 1 I p n N . Substituting these v alues in o ne of the inequalities ab ove, we g et 128 γ 4 L 4 I √ N n N 1 I p n N ≤ δ 2 8 ⇒ 128 γ 4 L 4 I 2 ≤ 1 8 I 2 ∵ δ < 1 I (F4) ⇒ γ ≤ 1 4 √ 2 LI . (55) 22 Repe a ting a similar reaso ning for all the inequalities in (54), we ge t γ ≤ min  1 8 L , 1 4 √ 2 LI , 1 8 LI  ⇒ γ ≤ 1 8 LI . (56) Therefore, we can cho ose a co nstant step size γ , small enough (and indep endent of N , n ), such tha t (18) holds. A.5 Pro of of T heorem 2.4 Pr o of. W e have min s,t " E   ∇ f  ¯ x s +1 t    2 + 1 N N X i =1 E   x s +1 i,t − ¯ x s +1 t   2 # ≤ 1 T S − 1 X s =0 m − 1 X t =0 " E   ∇ f  ¯ x s +1 t    2 + 1 N N X i =1 E   x s +1 i,t − ¯ x s +1 t   2 # (57) Upper b ounding the second ter m on the right ha nd side, w e have 1 T S − 1 X s =0 m − 1 X t =0 1 N N X i =1 E   x s +1 i,t − ¯ x s +1 t   2 ≤ 1 T S − 1 X s =0 m − 1 X t =0 1 N 8 γ 4 N L 2  1 + 1 δ   1 + 1 α  t − 1 X ℓ = τ ( t )+1 (1 + α ) t − 1 − ℓ ℓ − 1 X j = τ ( ℓ ) (1 + θ ) ℓ − 1 − j E   ¯ v s +1 j   2 (58) ≤ 256 γ 4 L 4 δ 4 1 T S − 1 X s =0 m − 1 X t =0 E   ¯ v s +1 t   2 (59) ≤ 1 2 1 T S − 1 X s =0 m − 1 X t =0 E   ¯ v s +1 t   2 (60) where, (58) follows fro m Lemma 2.2; (59) follows from the b ound in (51); (6 0) follows from (18). Replacing (60) in (57), we get min s,t " E   ∇ f  ¯ x s +1 t    2 + 1 N N X i =1 E   x s +1 i,t − ¯ x s +1 t   2 # ≤ 1 T S − 1 X s =0 m − 1 X t =0  E   ∇ f  ¯ x s +1 t    2 + 1 2 E   ¯ v s +1 t   2  ≤ 1 T S − 1 X s =0 m − 1 X t =0  E   ∇ f  ¯ x s +1 t    2 +  1 − Lγ −  128 γ 4 L 4 m N B δ 2 + 4 γ 2 L 2 m N B + 512 γ 4 L 4 δ 4  E   ¯ v s +1 t   2  (61) ≤ 2  f ( x 0 ) − f ∗  T γ . (62) Here, (61) follows fro m (18); (62) follows from (1 7). 23 B Online case B.1 Pro of of Lemma 3.1 Pr o of. W e have from the definition of E s +1 0 E s +1 0 = E    ¯ v s +1 0 − 1 N N X i =1 ∇ f i  x s +1 i, 0     2 = E    1 N N X i =1 1 n b X ξ i ∇ f i  x s +1 i, 0 ; ξ i  − 1 N N X i =1 ∇ f i  x s +1 i, 0     2 (steps 4, 17-18 in Algorithm 2 ) = 1 N 2 N X i =1 E    1 n b X ξ i  ∇ f i  x s +1 i, 0 ; ξ i  − ∇ f i  x s +1 i, 0     2 (63) = 1 N 2 N X i =1 1 n 2 b E X ξ i    ∇ f i  x s +1 i, 0 ; ξ i  − ∇ f i  x s +1 i, 0     2 (64) ≤ σ 2 N n b . (65) (63) follows since for i 6 = j E * X ξ i  ∇ f i  x s +1 i, 0 ; ξ i  − ∇ f i  x s +1 i, 0  , X ξ j  ∇ f j  x s +1 j, 0 ; ξ j  − ∇ f j  x s +1 j, 0  + = E * E X ξ i  ∇ f i  x s +1 i, 0 ; ξ i  − ∇ f i  x s +1 i, 0  , E X ξ j  ∇ f j  x s +1 j, 0 ; ξ j  − ∇ f j  x s +1 j, 0  + = 0 . This is b ecause, given x s +1 i, 0 = ¯ x s m , the s amples at each no de are pick ed uniformly ra ndomly , and indep endent of other no des. (64) fo llows since samples at any no de ar e also pick ed indep endent of each other. Therefor e, for any t wo distinct samples ξ i 6 = ζ i , E  ∇ f i  x s +1 i, 0 ; ξ i  − ∇ f i  x s +1 i, 0  , ∇ f i  x s +1 i, 0 ; ζ i  − ∇ f i  x s +1 i, 0  = E  E ξ i ∇ f i  x s +1 i, 0 ; ξ i  − ∇ f i  x s +1 i, 0  , E ζ i ∇ f i  x s +1 i, 0 ; ζ i  − ∇ f i  x s +1 i, 0  = 0 . Finally , (65) fo llows from Assumption (A3). B.2 Pro of of Lemma 3.2 Pr o of. Substituting (65) in (52) a nd summing (52) ov er t = 0 , . . . , m − 1, we get E f  ¯ x s +1 m  ≤ E f  ¯ x s +1 0  − γ 2 m − 1 X t =0 E   ∇ f  ¯ x s +1 t    2 − γ 2 (1 − Lγ ) m − 1 X t =0 E   ¯ v s +1 t   2 + 16 γ 5 L 4 N B  1 + 1 δ  m − 1 X t =0 t − 1 X ℓ =0 ℓ − 1 X j = τ ( ℓ ) (1 + θ ) ℓ − 1 − j E    ¯ v s +1 j    2 + 2 γ 3 L 2 N B m − 1 X t =0 t − 1 X ℓ =0 E   ¯ v s +1 ℓ   2 + 8 γ 5 L 4  1 + 1 α   1 + 1 δ  m − 1 X t =0 t − 1 X ℓ = τ ( t )+1 ℓ − 1 X j = τ ( ℓ ) (1 + α ) t − 1 − ℓ (1 + θ ) ℓ − 1 − j E   ¯ v s +1 j   2 + γ σ 2 m N n b (66) 24 (66) is the same as (53) except the a dditional last term in (66). Ther efore, aga in using the upp er b ounds derived in Lemmas A.1, A.2 and A.3 in (66), we g et (23). B.3 Pro of of T heorem 3.3 Pr o of. W e have min s,t " E   ∇ f  ¯ x s +1 t    2 + 1 N N X i =1 E   x s +1 i,t − ¯ x s +1 t   2 # ≤ 1 T S − 1 X s =0 m − 1 X t =0  E   ∇ f  ¯ x s +1 t    2 +  1 − Lγ −  128 γ 4 L 4 m N B δ 2 + 4 γ 2 L 2 m N B + 512 γ 4 L 4 δ 4  E   ¯ v s +1 t   2  (67) ≤ 2  f ( x 0 ) − f ∗  T γ + 2 σ 2 N n b . (68) Here, (67) follows fro m (60) and (18); (68) fo llows from (25). 25

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment