Non-Asymptotic Analysis of an Optimal Algorithm for Network-Constrained Averaging with Noisy Links

1 Non-asymptotic analysis of an optimal algo rithm f or network-constrained a veraging with noisy links Nima Noorshams 1 and Martin J. W ainwright 1 , 2 Departments of Statistics 2 and Electrical Engineering & Computer Science 1 , Univ ersity of Ca lifornia Berkeley , { nshams, wainwrig } @eecs.berkele y .edu Abstract The problem of network-co nstrained av e raging is to compute the average of a set of values distributed throug hout a graph G using an algorithm that can pass messages only along graph ed ges. W e study th is problem in the noisy setting, in wh ich the commu nication along each link is modeled by an a dditive white Gaussian noise ch annel. W e pro pose a two-phase decentr alized algorithm, and we use stoch astic approx imation methods in conjunc tion with the spectral grap h theo ry to p rovide con crete (no n-asymp totic) bound s o n the mean-squared erro r . Having f ound such b ounds, we a nalyze h ow the number of iteratio ns T G ( n ; δ ) required to achieve m ean-squar ed error δ scales as a function o f the graph topology and the number of no des n . Previous work pr ovided guarantees with the nu mber of iterations scaling inversely with the second smallest eig en value of the Lap lacian. This paper gives an alg orithm that reduces this grap h depend ence to th e graph diameter, which is th e best scalin g possible. I . I N T R O D U C T I O N The problem of network-constrained averaging is to compute the average of a set of numbers distributed throughout a network, using an algorithm that is allowed to pa ss me ssages only along e dges of the grap h. Moti vating applications include sensor networks, in which individual motes h av e limited memory and communication ability , and massive databas es an d s erver farms, in which memory constraints preclude storing all data at a central location. In typical applica tions, the av e rage migh t represent a statistical estimate of some ph ysical q uantity (e.g., tempe rature, pressure etc.), or a n intermediate qua ntity in a mo re c omplex algorithm (e. g., for distributed optimization). There is now an extensi ve literature on network-a veraging , consen sus problems, as we ll as distributed o ptimization and estimation (e.g ., see the pape rs [7 ], [12], [10], [30], [20], [3], [4], [8], [23], [22]). The b u lk of the earlier work ha s foc used o n the noise less variant, in which co mmunication between node s in the graph is assumed to be noiseless. A more recent line of work has studied versions of the prob lem with noisy c ommunication links (e.g., see the papers [18], [15 ], [27], [2], [29], [19], [24] and references therein). The focus of this pape r is a no isy version of network-constraine d averaging in which inter-node com- munication is modeled by a n additiv e wh ite Ga ussian noise (A WGN) chan nel. Given this rando mness, a ny algorithm is necess arily stoch astic, and the co rresponding s equen ce of random variables ca n be an alyzed in various ways. The simplest ques tion to ask is whether the algorithm is co nsistent—that is, do es it compu te an a pproximate average or achieve consen sus in a n a symptotic sense for a g i ven ﬁxed graph ? A more reﬁned analysis seeks to provide information about this con vergence rate. In this paper , we do so by posing the follo wing question: for a given a lgorithm, how d oes numbe r of iterations required to comp ute the av e rage to within δ -acc uracy scale as a function o f the graph topology and number of nodes n ? For obvious reas ons, we refer to this as the network scaling of a n a lgorithm, an d we are interes ted in ﬁnding an algo rithm tha t has nea r -op timal sc aling law . The iss ue of network scaling ha s b een studie d by a number o f a uthors in the noiseles s setting, in which the communication between n odes is perfec t. Of particular relev a nce here is the work o f Bene zit et al. [5], who in the case o f perfect communication, provided a scheme that has es sentially optimal message sca ling law for rando m geometric g raphs. A p ortion of the method proposed in this paper is inspired by the ir sch eme, albeit with s uitable exten sions to multiple paths tha t are es sential in the noisy se tting. T he issue of network scaling has also been studied in the noisy setting; in particular , p ast work b y Rajagopal and W ainwright [27] analyze d a dampe d version of the us ual c onsen sus upda tes, and p rovided sc alings of the iteration number as a function of the graph topology and size. Howev e r , ou r new a lgorithm ha s much be tter scaling tha n the method [27]. The main contrib utions of this paper are the d ev elopment of a novel two-phase algorithm for network- constrained av e raging with noise, and establish ing the near -optimality of its network scaling. At a high lev e l, the outer phase of our algorithm produces a sequ ence of iterates { θ ( τ ) } ∞ τ =0 based on a recursi ve linear update with decaying step size, as in stochas tic approximation methods. The system matrix in this update is a time-varying and random quan tity , who se structure is determined by the updates within the inn er phase. T hese inner rounds are b ased on es tablishing multiple paths b etween pairs of nodes , and a veraging along them simultane ously . By combining a ca reful analysis of the spe ctral properties of this ra ndom matrix with stoch astic approximation the ory , we p rove tha t this two-phase algorithm co mputes a δ -accurate version of the a verage using a number of iterations that gro ws with the graph diameter (up to logarithmic f actors). 1 As 1 The graph diameter i s the minimal number of edges needed to connect any two pairs of nodes in the graph. 2 we discu ss in more detail follo wing the stateme nt of ou r main result, this result is optimal up to logarithmic factors, me aning that no algo rithm can be sub stantially better in terms of network sca ling. The remainder of this paper is or ganized as follo ws . W e begin in Section II with background and formulation of the prob lem. In Sec tion III, we des cribe our algorithm, and state various theoretical guara ntees on its performance. W e then provide the proof of o ur main result in Section IV. Se ction V is dev o ted to some simulation resu lts that co nﬁrm the sharpnes s of o ur the oretical predictions. W e c onclude the pape r in Section VI. Notation: For the reade r’ s con venienc e, we collect h ere some notation used throughout the pap er . The notation f ( n ) = O ( g ( n )) mea ns that there exists some c onstant c ∈ (0 , ∞ ) and n 0 ∈ N such f ( n ) ≤ cg ( n ) for all n ≥ n 0 , whereas f ( n ) = Ω( g ( n )) means that f ( n ) ≥ c ′ g ( n ) for all n ≥ n 0 . The notation f ( n ) = Θ( g ( n )) means tha t f ( n ) = O ( g ( n )) a nd f ( n ) = Ω ( g ( n )) . Given a symmetric ma trix A ∈ R n × n , we deno te its o rdered sequence of eigen values by λ 1 ( A ) ≤ λ 2 ( A ) ≤ . . . ≤ λ n ( A ) and a lso its l 2 -operator no rm b y | | | A | | | 2 = sup k v k 2 =1 k Av k 2 . Finally we u se h· , ·i to den ote the Euclidean inner prod uct. I I . B AC K G R O U N D A N D P RO B L E M S E T - U P W e begin in this se ction by introducing neces sary bac kground a nd setting up the prob lem more precisely . A. Network -constrained averaging Consider a collection { θ i (0) , i = 1 , . . . , n } of n numbers. In statistical settings, these numbers would be modeled a s identically distributed (i.i.d.) draws from a n unk nown distrib ution Q wi th me an µ . In a centralized setting, a standard estimator for the mean is the s ample average θ := 1 n P n i =1 θ i (0) . Whe n all of the da ta can b e aggregated at a central location, then computation of θ is straightforward. In this paper , we c onsider the network-constrained version of this estimation problem, modeled by an undirected graph G = ( V , E ) that consists of a vertex s et V = { 1 , . . . , n } , and a c ollection o f edges E joining pairs of vertices. For i ∈ V , we vie w ea ch meas urement θ i (0) as as sociated with vertex i . (For instance, in the con text of sensor n etworks, eac h vertex would con tain a mote and collect observations o f the e n vironment.) T he e dge structure of the grap h enforce s communication co nstraints on the proce ssing: in particular , the presence of edge ( i, j ) indicates tha t it is po ssible for s ensors i and j to exchange information via a noisy c ommunication channe l. Con versely , sensor pairs tha t a re not joine d by an ed ge are not permitted to communic ate directly . 2 Every nod e has a synchron ized internal clock, and ac ts at disc rete times t = 1 , 2 , · · · . For any given pair of 2 Moreo ver , since the edges are undirected, there is no dif ference between edge ( i, j ) and ( j, i ) ; moreov er, we exclude self-edges, meaning that ( i, i ) / ∈ E for all i ∈ V . 3 sensors ( i, j ) ∈ E , we assume that the me ssage sen t from i to j is perturbed by a n independ ent iden tically distrib u ted N (0 , σ 2 ) variate. Although this additi ve wh ite Gauss ian noise (A WGN) mode l is more realistic than a noiseless model, it is conc eiv able (as pointed o ut by one of the re viewers) that othe r stocha stic chan nel models might b e more su itable for certain types of sen sor networks, and we le av e this exploration for future research. Gi ven this se t-up, of interest to us are stoch astic algorithms that generate se quence s { θ ( t ) } ∞ t =0 of iterates contained within R n , and we require that the algorithm be graph -r es pecting , meaning tha t in each iteration, it is allo wed to s end at mo st one mess age for each direc tion of every edge ( i, j ) ∈ E . At time t , we measure the distance between θ ( t ) and the desired a verage θ via the average (pe r node) mean-squared e rror , giv e n by MSE( θ ( t )) := 1 n n X i =1 E [( θ i ( t ) − θ ) 2 ] . (1) In this paper , our g oal is for every node to compute the average θ up to a n error tolerance δ . In a ddition, we require almost su re cons ensus among node s, me aning P [ θ i ( t ) = θ j ( t ) ∀ i, j = 1 , 2 , · · · , n ] → 1 as t → ∞ . Our primary g oal is in c haracterizing the rate of con vergence as a function of the grap h top ology and the number of no des, to whic h we refer as the n etwork-sca ling function of the a lgorithm. More precise ly , in order to s tudy this network scaling, we co nsider sequen ces of graph s {G n } ind exed by the number of node s n . For any given algorithm (deﬁned for e ach graph G n ) and a ﬁxed tolerance p arameter δ > 0 , our goal is to determine boun ds o n the qua ntity T G ( n ; δ ) := inf  t = 1 , 2 , . . . | MS E( θ ( t )) ≤ δ  . (2) Note that T G ( n ; δ ) is a stopp ing time, given by the smallest number o f iterations required to obtain mean- squared error less than δ on a graph o f type G with n n odes. B. Graph topologies Of course, the question that we have posed will de pend on the graph type, an d this pape r ana lyzes three types of graphs, as shown in Figure 1. The ﬁrst two graphs have regular topologies: the single cycle g raph in panel (a) is degree two-regular , and the two-dimensional grid graph in panel (b) is degree four-regular . In a ddition, we also analyze an impo rtant class of random grap hs with irregular topo logy , na mely the class of random geometric graphs. As illustrated in Figure 1(c), a rand om geometric graph (RGG) in the plane 4 ❅ ❅ ❅ ❛ ❛    ❛ ❛ ❅ ❅ ❅ ❛ ❛    ❛ ❛ q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q ❛ r ( n ) ❇ ❇ ❛ ✏ ✏ ✏ ❛   ❛ ✂ ✂ ✂ ✂ ❛ ❛ ✁ ✁ ✁ ❛ ❛ ❊ ❊ ❊ ❛ ❆ ❆ ❆ ❛ (a) Single cycle. (b) T wo-dimensional grid. (c) Ran dom ge ometric graph. Fig. 1. Illustration of gr aph top ologies. ( a) A single cycle grap h. (b) T wo-dimensional grid with fo ur-nearest- neighbo r con nectivity . (c) Illustration of a ran dom geometr ic graph (RGG). T wo n odes are co nnected if their distance is le ss than r ( n ) . The solid circles repre sent the center of squ ares. is formed a ccording by placing n nodes uniformly at ra ndom in the unit square [0 , 1] × [0 , 1] , and the connec ting two node s if their Eu clidean distance is less than some radius r ( n ) . It is known tha t an RGG will be co nnected with high probability as long a s r ( n ) = Ω( q log n n ) ; s ee Penrose [26] for discussion of this and othe r prop erties of random geo metric graphs. A key graph -theoretic pa rameter rele vant to ou r analysis is the graph diameter , de noted by D n = diam( G n ) . The pa th distance be tween any pair of n odes is the length of the sho rtest p ath joining them in the graph, and b y de ﬁnition, the graph d iameter is the ma ximum path distance taken over all nod e pa irs in the graph. It is straightforward to see that D n = Θ( n ) for the single cycle graph, a nd that D n = Θ( √ n ) for the two-dimensional grid. For a random geo metric graph with rad ius chos en to ens ure c onnectivity , it is known that D n = Θ  q n log n  . Finally , in order to simplify the routing problem explaine d la ter , we divide the unit squa re into subregions (squares) of side length q 1 n in case of grid, an d for s ome constant c > 0 , of side length q c log n n in case of RGG. W e assu me that each n ode knows its location and is aware of the center of these m 2 subregions namely ( x i , y j ) i, j = 1 , 2 , · · · , m , whe re m = √ n for the regular g rid, and m = q n c log n for the RGG. As a con vention, we as sume that ( x 1 , y 1 ) is the left bottom sq uare, to which we refer to as the ﬁrst sq uare. By con struction, in a regular grid, each squa re will contain one a nd on ly one nod e which is located at the center o f the squa re. From known properties o f RGGs [26], [17], ea ch o f the giv en s ubregions will contain a t least o ne node with high proba bility (w .h .p.). Moreover , an RGG is regular w .h.p, meaning that each square contains Θ (log n ) nod es (see Lemma 1 in the pa per [12]). Acc ordingly , in the remainder of the paper , we assume without loss of generality that any giv en RGG is regular . Note that by c onstruction, the transmission radius r ( n ) is selected so that each nod e in each s quare is con nected to ev e ry other no de in four adjacent squares . 5 I I I . A L G O R I T H M A N D I T S P RO P E RT I E S In this section we state our main resu lt which is follo wed by a de tailed description of the propose d algorithm. A. Theo r e tical guarantee s Our main result guarantee s the existence of a grap h-respecting algorithm with desirable properties. Reca ll the deﬁnition o f the g raph res pecting scheme, a s we ll as the deﬁn ition of our A WGN chann el model g i ven in Section II. In the following statement, the quantity c 0 denotes a universal constant, indepen dent of n , δ , and σ 2 . Theorem 1. F or the communication model in w hich each link is an A WGN channel with var iance σ 2 , there is a graph-respecting algorithm such tha t: a) Nod es a lmost s urely reach a con sensus . More precisely , we hav e θ ( t ) a.s. − → e θ ~ 1 as t → ∞ , (3) for some e θ ∈ R . b) After T = T G ( n ; δ ) iterations, the algorithm satisfy the following bo unds on the MSE( θ ( T )) : i) F or ﬁx ed tolerance δ > 0 sufﬁciently small, we hav e MSE ( θ ( T )) ≤ 3 σ 2 δ after T cyc ( n ; δ ) ≤ c 0 n max  1 δ log 1 δ , MSE( θ (0)) σ 2 δ 2  iterations for a single cycle graph . ii) F or ﬁ xed tolerance δ > 0 s uf ﬁciently small, we ha ve MSE( θ ( T )) = O  σ 2 δ  after T grid ( n ; δ ) ≤ c 0 √ n max  1 δ log 1 δ , MSE( θ (0)) σ 2 δ 2  iterations for the r egular grid in two dimensions. iii) Assu me that δ = e δ (log n ) 2 , for some ﬁxed e δ sufﬁciently small. Then we have MSE( θ ( T )) = O  σ 2 e δ  after T RG G ( n ; δ ) ≤ c 0 p n (log n ) 3 max  1 e δ log (log n ) 2 e δ , MSE( θ (0)) σ 2 e δ 2  iterations for a r egular random geometric g raph. Here c 0 is some cons tant indep endent of n , δ , and σ 2 , whose value may chan ge fr om line to line. 6 Remarks: A fe w comments are in order regarding the interpretation of this result. First, it is wort h mentioning that the quality of the diff e rent links does no t have to be the same. Similar arguments apply to the cas e where noises ha ve diff erent variances . Second, although no des a lmost su rely reach a consensu s, a s gua ranteed in part (a), this consens us v a lue is not necessarily the same as the s ample me an ¯ θ . T he choice of e θ is intentional to emphas ize this point. Howe ver , as guaranteed by part (b), this consen sus value is within σ 2 δ distance of the actual sample mean. Since the sample mean itself repres ents a noisy estimate of some underlying p opulation quantity , there is little point to computing it to a rbitrary accu racy . Third, it is worthwhile compa ring part (b) with previous results o n n etwork scaling in the no isy setting. Rajagopal and W a inwright [27] analyzed a simple set o f dampe d updates, an d showed that T cyc ( n ; δ ) = O  n 2  for the single cycle, and that T grid ( n ) = O ( n ) for the two-dimensional grid. By c omparison, the a lgorithm propose d here and o ur analysis thereof has removed factors of n an d √ n from this s caling. B. Optimality o f the r e sults As we now disc uss, the scalings in Theo rem 1 are op timal for the cases of cycle and grid a nd n ear-opti mal (up to logarithmic factor) for the case of RGG. In an adversarial setting, any algorithm ne eds a t least Ω( D n ) iterations, where D n denotes the graph diameter , in order to a pproximate the av erage; otherwise, some no de will fail to h av e any information from s ome s ubset of other n odes (and their values ca n be set in a worst-case manner). Theorem 1 provides upper boun ds on the numbe r of iterations that, at mo st, are within logarithmic factors o f the diameter , and hence a re also within logarithmic factors of the optimal latency sca ling law . For the graph s given he re, the scalings are also optimal in a non -adversarial setting, in wh ich { θ i (0) } n i =1 are mod eled as chose n i.i.d. from some distributi on. Indeed , for a given nod e j ∈ V , and positiv e integer t , we let N ( j ; t ) denote the depth t ne ighborhood of j , meaning the set of no des that are conn ected to j by a path of length at most t . W e then deﬁne the graph spread ing func tion ψ G ( t ) = min j ∈V |N ( j ; t ) | . Note that the function ψ G is non-decrea sing, so that we may d eﬁne its in verse func tion ψ − 1 G ( s ) = inf { t | ψ G ( t ) ≤ s } . As some examples : • for a cycle on n no des, we have ψ G ( t ) = 2 t , a nd hence ψ − 1 G ( s ) = s/ 2 . • for a n -grid in two dimension s, we have the upper bound ψ G ( t ) ≤ 2 t 2 , and hence the lower b ound ψ − 1 G ( s ) ≥ p s 2 . • for a ran dom geometric graph (RGG), we ha ve the upper bound ψ G ( t ) = Θ( t 2 log n ) , which implies the lower bou nd ψ − 1 G ( s ) = Θ  q s log n  After t steps, a gi ven nod e can ga ther the information of at most ψ G ( t ) node s. For the a verage based on ψ G ( t ) node s to b e comparab le to θ , we req uire that ψ G ( t ) = Ω( n ) , a nd henc e the iteration n umber t sho uld be at least Ω ( ψ − 1 G ( n )) . For the three graph s c onsidered here, this lead s to the same con clusion, n amely that 7 Ω( D n ) iterations are requ ired. W e note also tha t using information-theoretic te chniques , A yas o e t al. [1] proved a lower bound on the numb er of iterations for a g eneral graph in terms o f the C heeger c onstant [9]. For the graphs c onsidered here, the Cheege r cons tant is of the order of the d iameter . C. Desc ription of a lgorithm W e now describe the algorithm that a chieves the bound s s tated in Theorem 1. At the highest lev e l, the algorithm can be divided into two types of phase s: an inner phas e, and an outer ph ase. T he outer phase produces a sequen ce o f iterates { θ ( τ ) } , whe re τ = 0 , 1 , 2 , . . . is the outer time sca le parameter . By design of the a lgorithm, ea ch update of the outer pa rameters requires a total of M message-pas sing rounds (these rounds correspo nding to the inne r phase ), wh ere in ea ch round the algorithm ca n pass at mos t tw o mes sages per e dge (one for each direction). T o pu t everything in a nu tshell, the algorithm is ba sed on establishing multiple routes, av eraging along the m in an inner phase and u pdating the estimates b ased on the noisy version of av erages alon g routes in an o uter ph ase. Consequently , if we use the estimate θ ( τ ) , then in the language of Theorem 1, it co rresponds to T = M τ rounds o f me ssage-pa ssing. Our goal is to es tablish up per bounds on T that guarantee the MS E is O ( σ 2 δ ) . F igure 2 illustrates the ba sic operations of the a lgorithm. T wo-phase a lgorithm for distributed c onsens us: • Inner pha se: – Deciding the averaging direction – Choos ing the head no des – Establishing the routes – A veraging along the routes • Outer phas e: – Based on the a verag es along the routes, update the es timates ac cording to θ ( τ + 1) = θ ( τ ) − ǫ ( τ )  L ( τ ) θ ( τ ) + v ( τ )  Fig. 2: Basic opera tions of a two-p hase algor ithm for distributed co nsensus. 1) Outer phase : In t he outer phase, we produce a sequence of i terates { θ ( τ ) } ∞ τ =1 according to the recursi ve update θ ( τ + 1) = θ ( τ ) − ǫ ( τ )  L ( τ ) θ ( τ ) + v ( τ )  . (4) Here { ǫ ( τ ) } ∞ τ =1 is a sequence of pos iti ve decreasing stepsizes. For a gi ven precision, δ , we set ǫ ( τ ) = 1 / ( 1 δ + τ ) . For e ach τ , the quantity L ( τ ) ∈ R n × n is a ran dom ma trix, whose structure is determined by the 8 inner phase, and v ( τ ) ∈ R n is an additive Gaus sian term, whose structure is also determined in the inner phase. As will be come clear in the se quel, even thoug h L a nd v are de penden t, they are both indepen dent of θ . Moreover , gi ven L , the random vector v is Gaus sian with bounded variance. 2) Inner p hase: The inner ph ase is the core of the algorithm and it in volves a nu mber of steps, as we describe here. W e use s = 1 , 2 , . . . , M to ind ex the iterations within any inner phase , and use { γ ( s ) } M s =1 to denote the seque nce of inner iterates within R n . For the inner phase c orresponding to outer update from θ ( τ ) → θ ( τ + 1) , the inner phase takes the initialization γ (1) ← θ ( τ ) , and then red uces a s output γ ( M ) → θ ( τ + 1) to the o uter iteration. In more detail, the inner phase can be broken down into three steps, which we now de scribe in detail. a) Step 1, dec iding the averaging dir ection: The ﬁrst s tep is to choose a direction in which to pe rform av eraging. In a s ingle cycle graph , since left and right a re viewed as the same , there is on ly one c hoice, a nd hence nothing to be decided. In contrast, the g rid or RGG graphs require a dec ision-making phase , which proceeds as follo w s. One node in the ﬁrst (bottom left) s quare, wakes up and choose s un iformly at random to se nd in the horizontal o r vertical direction. W e co de this de cision using the rand om variable ζ ∈ {− 1 , 1 } , where ζ = − 1 (respectiv e ly ζ = + 1 ) represe nts the horizontal (res pectiv e ly vertical) direction. T o simplify matters, we assume in the remainder of this de scription that the a veraging direction is horizontal, with the modiﬁcations required for vertical averaging being stan dard. b) Step 2, choosing the head node s: This step applies only to the grid and RGG g raphs. Given our assumption that the node in the ﬁrst square has chos en the horizontal dir ection, it then pa sses a tok en messag e to a rando mly selec ted no de in the above a djacent sq uare. The purpose of this token is to determine wh ich node (referred to a s the head node ) should be in volved in e stablishing the route pa ssing through the giv e n square. After recei ving the token, the rec eiving node pass es it to a nother rand omly selected no de in the above adjacen t squa re and so on. No te tha t in the spe cial cas e o f grid, there is only o ne n ode in ea ch s quare, and so no ch oices are required within squares . After m roun ds, one node in eac h square ( x 1 , y j ) , j = 1 , 2 , · · · , m ( ( x i , y 1 ) , i = 1 , 2 , · · · , m ) receives the token, as illustrated in Figure 3. Note that again in a single cycle graph, there is no thing to be dec ided, s ince the direction a nd head nodes are all de termined. c) Step 3, establishing r ou tes and averaging: In this pha se, each of head nodes establishes a horizontal path, and then perform av eraging a long the p ath, as illustrated in Figure 3 (b). This p art o f algorithm in volv es three substep s, which w e now desc ribe in detail. • For j = 1 , 2 , · · · , m , eac h he ad no de s 1 j selects a node s 2 j uniformly at random (u.a.r .) from within the right adjac ent square, and passe s to it the quantity γ 1 j (1) . Given the Gauss ian noise model, n ode s 2 j then 9 r r r r r r r r r r r r r r r r r r r r r r r r r ❜ ❅ ❅ s 11 ❜ ✁ ✁ ✁ ✁ s 12 ❜ s 13 ❜ ❅ ❅ s 14 ❜ s 15 r r r r r r r r r r r r r r r r r r r r r r r r r P 1 ❜ ❅ ❅ ❜ ❜   ❜ P P P P P ❜ P 2 ❜ ❜   ❜ ❜ ❅ ❅ ❜ P 3 ❜ ✟ ✟ ❜ ❍ ❍ ❍ ❜ ✏ ✏ ✏ ✏ ❜ ❜ P 4 ❜ ❜ ❅ ❅ ❜ ❜ ✟ ✟ ✟ ❜ P 5 ❜ ✏ ✏ ✏ ✏ ✏ ❜ ❜ ❅ ❅ ❜ ❜ (a) (b) Fig. 3. (a) The node labeled s 11 in the ﬁrst square, chooses the horizontal direction for a verag ing ( ζ = − 1 ); it passes the token vertically to inform o ther nodes to average horizonta lly . Nodes wh o receive the token p ass it to anothe r nod e in th e above adjacent squ are. (b ) The head nod es s 1 j j = 1 , 2 , · · · , as d etermined in the ﬁrst step, estab lish routes horizo ntally ( P j , j = 1 , 2 , · · · , m ) a nd then av erage alo ng these p aths. receiv es the quantity e γ 1 j (1) = γ 1 j (1) + v 1 j , where v 1 j ∼ N (0 , σ 2 ) , and then upd ates its own local v ariable as γ 2 j (2) = γ 2 j (1) + e γ 1 j (1) . W e the n iterate this s ame procedure— that is, node s 2 j selects ano ther s 3 j u.a.r . from its right ad jacent square, a nd pass es the messa ge γ 2 j (2) . Overall, at round i of this update proced ure, we have γ ( i +1) j ( i + 1) = γ ( i +1) j ( i ) + e γ ij ( i ) , where e γ ij ( i ) = γ ij ( i ) + v ij , and v ij ∼ N (0 , σ 2 ) . At the end of round m , node s mj can compute a noisy version of the average along the path P j : s 1 j → s 2 j → · · · → s mj , in particular via the rescaled quantity η j := γ mj ( m ) m = 1 m m X i =1 θ s ij ( t ) + v j j = 1 , 2 , · · · , m. Here the variable v j ∼ N (0 , σ 2 m ) , s ince the noise variables as sociated with dif ferent edges are independen t. • At this point, for each j = 1 , 2 , . . . , m , each node s mj which has the noisy version, η j , of the path av erage along route P j ; can share this information with other node s in the path by sen ding η j back to the h ead node. A na i ve way to do this is as follows: nod e s mj makes m co pies of η j —namely , η ( l ) j = η j , l = 1 , 2 , · · · , m —and s tarts transmitting one copy at a time ba ck to the head node . Nodes along the path simply forward what they rece i ve, so that after m − i + m − 1 time step s, node s ij receiv es m n oisy copies of the average, e η ( l ) ij = η ( l ) j + v ( l ) ij where v ( l ) ij ∼ N (0 , ( m − i ) σ 2 ) . A veraging the m co pies, nod e s ij 10 can compu te the quantity γ ij (3 m − i − 1) := 1 m m X l =1 e η ( l ) ij = 1 m m X l =1 θ s lj ( τ ) + w ij , where w ij = v j + 1 m P m l =1 v ( l ) ij . Sinc e the noise o n different links and different time steps are indepe ndent Gaussian random variables, we have w ij ∼ N (0 , σ 2 i ) , with σ 2 i = 1 m σ 2 + (1 − i m ) σ 2 = (1 − ( i − 1) m ) σ 2 ≤ σ 2 . Therefore, at the end of M = Θ ( m ) rounds, for ea ch j = 1 , 2 , . . . , m , all nodes have the average o f the estimates in the path P j that is perturbed by Gau ssian noise with variance a t most σ 2 . Since m = Θ ( D n ) , we have M = Θ( D n ) . • At the end of the inner ph ase τ , nodes that were in volv ed in a path us e their estimate of the average along the p ath to upda te θ ( τ ) , while estimate o f the no des that we re not in volved in any route remain the same. A given node s ij on a pa th u pdates its estimate v ia θ s ij ( τ + 1) =  1 − ǫ ′ ( τ )  θ s ij ( τ ) + ǫ ′ ( τ ) γ ij ( M ) , (5) where ǫ ′ ( τ ) = O  1 τ +1 /δ  . On the other han d, using h· , ·i to denote the Euc lidean inner product, we hav e γ ij ( M ) = h w , θ ( τ ) i + v s ij , whe re w is the averaging vector of the route P j with the entries w ( s ℓj ) = 1 m for ℓ = 1 , 2 , · · · , m , and zero otherwise. Co mbining the sca lar update s (5) y ields the ma trix-form upd ate θ ( τ + 1) = θ ( τ ) − ǫ ′ ( τ )  ( I − W ( τ )) θ ( τ ) + v ′ ( τ ) } , (6) where the matrix W ( τ ) = W ( τ ; P 1 , P 2 , · · · , P m , ζ ) is a rando m av eraging matrix induced by the choice of routes P 1 , P 2 , · · · , P m and the random directions ζ . The noise vector v ′ ( τ ) ∼ N (0 , C ′ ) is additiv e noise. Note that for any gi ven time, the no ise at dif fere nt nodes are correlated via the ma trix C ′ , but for dif feren t time ins tants τ 6 = τ ′ , the noise vectors v ′ ( τ ) and v ′ ( τ ′ ) are inde penden t. Moreover , from our earlier arguments, we have the upper bound max i =1 ,...,n C ′ ii ≤ σ 2 . I V . P RO O F O F T H E O R E M 1 W e now turn to the proof of Theorem 1. At a high-lev e l, the structure of the argument consists of decompo sing the vector θ ( τ ) ∈ R n into a sum of two terms: a compo nent within the consen sus subs pace (meaning all values o f the vector are identical), a nd a compon ent in the orthogon al co mplement. Using this decompo sition, the mean -squared error s plits into a sum of two terms and we use sta ndard techn iques to bound them. As will be shown, these bounds depend on the pa rameter δ , noise variance, the initial MSE, 11 and ﬁna lly the (in verse) spe ctral g ap of the up date matrix. Th e ﬁn al step is to lower bound the sp ectral gap of our up date ma trix. A. Setting u p the pr oof Recalling t he a veraging matri x W ( τ ) from the update (6), we deﬁne the Laplacian matri x S ( τ ) := I − W ( τ ) . W e then deﬁne the average matrix W := E [ W ( τ )] , where the expe ctation is taken place over the rando mness due to the choice o f routes; 3 in a similar way , we deﬁne the associated (average) L aplacian S := I − W . Finally , we d eﬁne the resc aled q uantities ǫ ( τ ) := λ 2 ( S ) ǫ ′ ( τ ) , L ( τ ) := 1 λ 2 ( S ) S ( τ ) , a nd v ( τ ) := 1 λ 2 ( S ) v ′ ( τ ) , (7) where we recall that λ 2 ( · ) de notes the secon d s mallest eigenv alue of a symmetric matrix. In terms of these rescaled quantities, our a lgorithm has the form θ ( τ + 1) = θ ( τ ) − ǫ ( τ )[ L ( τ ) θ ( τ ) + v ( τ )] , (8) as stated previously in the update e quation (4 ). Moreover , by c onstruction, we have v ( τ ) ∼ N (0 , C ) wh ere C = 1 ( λ 2 ( ¯ S )) 2 C ′ . W e also , for theoretical conv e nience, set ǫ ′ ( τ ) = 1 λ 2 ( ¯ S )( τ + 1 δ ) , (9) or equiv alently ǫ ( τ ) = 1 ( τ + 1 δ ) for τ = 1 , 2 , · · · . W e ﬁrst claim that the matrix W is symmetric and (doubly) s tochastic. The symmetry follows from the fact tha t dif ferent routes do not collide, whereas the ma trix is stoc hastic because e very ro w of W (depe nding on whether the node co rresponding to that row participates in a route or no t) either repres ents an averaging along a route or is the corresponding ro w of the identity matrix. Cons equently , we ca n interpret W as the trans ition matrix of a re versible Marko v chain. It is an irreducible Markov chain, because within any updating round, there is a p ositi ve chance of averaging nodes that are in the same column or row , which implies that the assoc iated Marko v chain can trans ition from on e state to any other in at most two steps. Moreover , the s tationary distrib u tion of the chain is uniform (i.e., π = ~ 1 /n ). W e now use these p roperties to simplify ou r study of the sequenc e { θ ( τ ) } ∞ τ =1 generated by the u p- date equation (8 ). Since S is real and sy mmetric, it has the eigen value deco mposition S = U Λ U T , where U = h u 1 u 2 · · · u n i is a unitary matrix (that is, U T U = I n ). Moreover , we have Λ = 3 For the single cycle graph, t here is only one route t hat in volv es all the nodes at each round, so W ( τ ) is deterministic in this case. 12 diag { λ 1 ( S ) , λ 2 ( S ) , · · · , λ n ( S ) } , where λ i ( S ) is the e igen value corresponding to the eigenv ector u i , for i = 1 , . . . , n . Since L = 1 λ 2 ( S ) ( I − W ) , the eigen values of L a nd W are related via λ i ( L ) = 1 λ 2 ( S ) (1 − λ n +1 − i ( W )) = 1 1 − λ n − 1 ( W ) (1 − λ n +1 − i ( W )) . Since the lar g est eigen value of an irreducible Markov ch ain is o ne (wit h multiplicity one ) [16], we have 1 = λ n ( W ) > λ n − 1 ( W ) ≥ · · · ≥ λ 1 ( W ) , or equiv alen tly 0 = λ 1 ( L ) < λ 2 ( L ) ≤ · · · ≤ λ n ( L ) , with λ 2 ( L ) = 1 . Moreover , we ha ve S ~ 1 = L ~ 1 = ~ 0 , so that the ﬁrst eigenv ector u 1 = ~ 1 / √ n correspo nds to the e igen value λ 1 ( L ) = 0 . Let e U denote the ma trix obtained from U by deleting its ﬁ rst c olumn, u 1 . Since the smallest eige n value of L is zero, we may write L = e U e Λ e U T , where e Λ = diag { λ 2 ( L ) , · · · λ n ( L ) } , e U T e U = I n − 1 , and e U e U T = I n − ~ 1 ~ 1 T n . W ith this notation, our ana lysis is based on the dec omposition θ ( τ ) = α ( τ ) ~ 1 √ n + e U β ( τ ) , (10) where we hav e deﬁne d α ( τ ) := h ~ 1 / √ n , θ ( τ ) i ∈ R and β ( τ ) := e U T θ ( τ ) ∈ R n − 1 . Since ~ 1 T L ( τ ) = ~ 0 T for all τ = 1 , 2 , · · · , from the de composition (10) and the form of the up dates (8), we have the following recursions, α ( τ + 1) = α ( τ ) − ǫ ( τ ) ~ 1 T √ n v ( τ ) , and (11) β ( τ + 1) = β ( τ ) − ǫ ( τ )  L ( τ ) β ( τ ) + e U T v ( τ )  . (12) Here L is an ( n − 1) × ( n − 1) ma trix d eﬁned by the rela tion U T L ( τ ) U =   0 ~ 0 T ~ 0 L ( τ )   n × n . B. Main s teps As we sho w , part (a) of the theorem requires some intermediate r esults of the proof of part (b). Accordingly , we defer it to the end of the se ction. W ith this set-up, we now state the two main technica l lemmas that form the c ore of T heorem 1. Our ﬁrst lemma co ncerns the b ehavior o f the c omponent sequen ces { α ( τ ) } ∞ τ =0 and { β ( τ ) } ∞ τ =0 which ev olve according to equ ations (11 ) and (12) resp ectiv e ly . 13 Lemma 2. Given the random sequ ence { θ ( τ ) } generated by the upd ate eq uation (4) , we ha ve MSE( θ ( τ )) = 1 n v ar ( α ( τ )) | {z } e 1 ( τ ) + 1 n E [ k β ( τ ) k 2 2 ] | {z } e 2 ( τ ) . (13) Furthermo r e, e 1 ( τ ) and e 2 ( τ ) satisfy the following bounds: (a) F or each iteration τ = 1 , 2 , . . . , we h ave e 1 ( τ ) ≤ σ 2 δ [ λ 2 ( ¯ S )] 2 . (14) (b) Mo r e over , for each iteration τ = 1 , 2 , . . . we ha ve e 2 ( τ ) ≤ σ 2 [ λ 2 ( ¯ S )] 2 log( τ + 1 δ − 1) τ + 1 δ − 1 + e 2 (0) 1 δ − 1 τ + 1 δ − 1 , (15) From Lemma 2, we c onclude that in o rder to gua rantee a n O ( σ 2 δ [ λ 2 ( ¯ S )] 2 ) bou nd o n the MSE, it sufﬁces to take τ such that 1 δ − 1 τ + 1 δ − 1 ≤ σ 2 δ e 2 (0)[ λ 2 ( ¯ S )] 2 , and log( τ + 1 δ − 1) τ + 1 δ − 1 ≤ δ . Note tha t the ﬁrst inequa lity is s atisﬁed whe n τ ≥ e 2 (0) σ 2 δ 2 [ λ 2 ( ¯ S )] 2 . Moreover , do ing a lit tle bit o f algebra , one can see tha t τ = 2 δ log 1 δ − ( 1 δ − 1) is s uf ﬁcient to satisfy the seco nd ineq uality . Acc ordingly , we take τ = max  2 δ log 1 δ , e 2 (0)[ λ 2 ( ¯ S )] 2 σ 2 δ 2  outer iterations. The last part of the proof is to b ound the s econd smalles t eigen value of the Lapla cian matrix S . The follo wing lemma, whic h we p rove in Sec tion IV - D to follo w , ad dresses this issu e. Recall that λ 2 ( · ) de notes the secon d sma llest e igen value of a matrix. Lemma 3. The averaged matrix S that arises fr om o ur pr o tocol has the following p r oper ties: (a) F or a cycle and a re gular gr id we hav e λ 2 ( ¯ S ) = Ω(1) , and (b) for a ran dom geome tric graph , we hav e λ 2 ( ¯ S ) = Ω( 1 log n ) , with high probabilit y . It is important to note that the averaged matr ix S is not the same as the graph Lap lacian that would arise from standard averaging on these graph s. Rather , a s a consequen ce of establishing ma ny paths and averaging along them in e ach inne r ph ase, ou r protocol ens ures that the matrix behaves es sentially like the graph Laplacian for the fully co nnected graph. As establishe d previously , each outer step requires M = O ( D n ) iterations. Therefore, we hav e shown 14 that it is su f ﬁc ient to take a total of T = O  D n max  2 δ log 1 δ , e 2 (0)[ λ 2 ( ¯ S )] 2 σ 2 δ 2  transmissions p er edg e in order to guarantee a 3 σ 2 δ [ λ 2 ( ¯ S )] 2 bound o n the MSE. As we w ill s ee in the next s ection, assuming that the initial values a re ﬁxed, we have e 1 (0) = 0 , henc e MS E( θ (0)) = e 2 (0) . The claims in Theorem 1 the n follo w by s tandard calculations of the diameters of the various graph s and the result of the Lemma 3. It remains to p rove the two technica l results, L emma 2 and 3, and we do s o in the following sections . C. Pr o of o f Lemma 2 W e begin by obse rving that E h ( θ ( τ ) − ¯ θ ~ 1)( θ ( τ ) − ¯ θ ~ 1) T i = F 1 + F 2 + F 3 , where F 1 := E  ( α ( τ ) − √ n ¯ θ ) 2  ~ 1 ~ 1 T n , the sec ond te rm is given by F 2 := E h e U β ( τ ) β ( τ ) T e U T i , and F 3 := E " ( α ( τ ) − √ n ¯ θ ) ~ 1 √ n β ( τ ) T e U T # + E " ( α ( τ ) − √ n ¯ θ ) e U β ( τ ) ~ 1 T √ n # . Since e U has ort honormal columns, all orthogonal to the all one vector ( ~ 1 T e U = ~ 0 ), it follo ws that trace( F 2 ) = E  k β ( τ ) k 2 2 ] , and trace( F 3 ) = 0 . It remains to compute trace( F 1 ) . Unwrapping the rec ursion (11) and using the fact that initialization θ (0) implies α (0) = √ nθ yields α ( τ ) = √ nθ − τ − 1 X l =0 ǫ ( l ) h ~ 1 √ n , v ( l ) i , (16) for all τ = 1 , 2 , . . . . Since v ( l ) , l = 0 , 1 , · · · , τ − 1 , are zero me an random vec tors, from equa tion (16) we conclude that E [ α ( τ )] = √ n ¯ θ 4 and accordingly , trace( F 1 ) = v ar ( α ( τ )) . Reca lling the deﬁnition of the MSE (1) and c ombining the pieces yields the claim (13). (a) From e quation (16), it is clea r that e ach α ( τ ) is Gaussia n with mean √ nθ . It remains to bound the 4 Here we have assumed that t he initial v alues, θ i (0) i = 1 , 2 , · · · , n , are giv en (ﬁ xed ). 15 variance. Using the i.i.d. nature o f the se quence v ( i ) ∼ N (0 , C ) , we hav e v ar ( α ( τ )) = E   τ − 1 X l =0 ǫ ( l ) h ~ 1 √ n , v ( l ) i  2  = τ − 1 X l =0 ǫ ( l ) 2 n h ~ 1 , C ~ 1 i = τ − 1 X l =0 ǫ ′ ( l ) 2 h ~ 1 , C ′ ~ 1 i n , where we have reca lled the resca led qua ntities (7). Reca lling the fact that C ′ ii ≤ σ 2 and using the Cauchy- Schwarz ineq uality , we h av e C ′ ij ≤ q C ′ ii C ′ j j ≤ σ 2 . Hence, we ob tain v ar ( α ( τ )) ≤ nσ 2 τ − 1 X l =0 ǫ ′ ( l ) 2 = nσ 2 [ λ 2 ( ¯ S )] 2 τ − 1 X l =0 1 ( 1 δ + l ) 2 ≤ nσ 2 [ λ 2 ( ¯ S )] 2 Z ∞ 1 δ 1 x 2 dx = n σ 2 δ [ λ 2 ( ¯ S )] 2 ; from which res caling by 1 /n establishes the boun d (14) . (b) Deﬁning H ( β ( τ ) , v ( τ )) = L ( τ ) β ( τ ) + e U T v ( τ ) , the upd ate equation (12) can b e written as β ( τ + 1) = β ( τ ) − ǫ ( τ ) H ( β ( τ ) , v ( τ )) , for τ = 1 , 2 , · · · . In o rder to uppe r bound e 2 ( τ + 1) , deﬁn ed in (13), we need to control e 2 ( τ + 1) − e 2 ( τ ) . Doing some algeb ra yields e 2 ( τ + 1) − e 2 ( τ ) = 1 n E [ h β ( τ + 1) − β ( τ ) , β ( τ + 1) + β ( τ ) i ] = 1 n E [ h− ǫ ( τ ) H ( β ( τ , v ( τ ))) , − ǫ ( τ ) H ( β ( τ , v ( τ ))) + 2 β ( τ ) i ] , and henc e e 2 ( τ + 1) − e 2 ( τ ) = 1 n ǫ ( τ ) 2 E  k H ( β ( τ ) , v ( τ )) k 2 2  − 2 ǫ ( τ ) n E [ h H ( β ( τ ) , v ( τ )) , β ( τ ) i ] . Since β ( τ ) is independ ent o f both L ( τ ) and v ( τ ) , by conditioning on the β ( τ ) and using the tower property of expectation, we obtain E [ h H ( β ( τ ) , v ( τ )) , β ( τ ) i ] = E [ h E [ L ] β ( τ ) , β ( τ ) i ] . 16 By construction all the eigen values of E [ L ] are grea ter tha n one, hen ce h E [ L ] β ( τ ) , β ( τ ) i ≥ k β ( τ ) k 2 2 . Putting the piece s together , we obtain e 2 ( τ + 1) ≤ 1 n ǫ ( τ ) 2 E  k H ( β ( τ ) , v ( τ )) k 2 2  + (1 − 2 ǫ ( τ )) e 2 ( τ ) = 1 n ǫ ( τ ) 2 E  k L ( τ ) β ( τ ) k 2 2  | {z } F 1 + 1 n ǫ ( τ ) 2 E h k e U T v ( τ ) k 2 2 i | {z } F 2 + (1 − 2 ǫ ( τ )) e 2 ( τ ) , (17) where we used the fact tha t E h h L ( τ ) β ( τ ) , e U T v ( τ ) i i = 0 . W e con tinue by uppe r bounding the terms F 1 = E  k L ( τ ) β ( τ ) k 2 2  , and F 2 = E h k e U T v ( τ ) k 2 2 i . First, we bound the former . By deﬁ nition o f the l 2 - operator norm, we have E  k L ( τ ) β ( τ ) k 2 2  ≤ E  | | | L ( τ ) | | | 2 2 k β ( τ ) k 2 2  . On the other h and, u sing the fact that L ( τ ) = 1 λ 2 ( ¯ S ) e U T ( I − W ( τ )) e U (reca ll the identities of the Section IV - A) yields 5 | | | L ( τ ) | | | 2 ≤ 1 λ 2 ( ¯ S ) (1 + | | | W ( τ ) | | | 2 ) = 2 λ 2 ( ¯ S ) . Therefore, we have the following bound on F 1 F 1 ≤ 4 [ λ 2 ( ¯ S )] 2 E  k β ( τ ) k 2 2  . (18) T urning to term F 2 , we have F 2 = E " v ( τ ) T ( I − ~ 1 ~ 1 T n ) v ( τ ) # ≤ trace  co v ( v ( τ ))  ≤ nσ 2 [ λ 2 ( ¯ S )] 2 . (19) Substituting the inequ alities (18 ) and (19) into (17 ), we obtain the following recursive bound on e 2 ( τ + 1) e 2 ( τ + 1) ≤ σ 2 [ λ 2 ( ¯ S )] 2 ǫ ( τ ) 2 +  1 − 2 ǫ ( τ ) + 4 ǫ ( τ ) 2 [ λ 2 ( ¯ S )] 2  e 2 ( τ ) . 5 Let v be an eigen vector of t he matri x W ( τ ) corresponding to the eigen value λ 6 = 1 . Since ~ 1 T v = 0 , there exist an ( n − 1) - dimensional v ector u such that v = e U u . Therefore we ha ve, e U T ( I − W ( τ )) e U u = e U T ( I − W ( τ )) v = (1 − λ ) e U T v = (1 − λ ) u. So by subtracting one from the eigen values of e U T ( I − W ( τ )) e U , we obtain the non-o ne eigen values of W ( τ ) . 17 Recall the de ﬁnitions (7) and (9). If δ ≤ [ λ 2 ( ¯ S )] 2 4 , then 1 − 2 ǫ ( τ ) + 4 ǫ ( τ ) 2 [ λ 2 ( ¯ S )] 2 ≤ 1 − ǫ ( τ ) , a nd hence we have e 2 ( τ + 1) ≤ σ 2 [ λ 2 ( ¯ S )] 2 ǫ ( τ ) 2 + (1 − ǫ ( τ )) e 2 ( τ ) , (20) for all τ = 1 , 2 , · · · . Un wrapping the inequality (20) yields e 2 ( τ + 1) ≤ σ 2 [ λ 2 ( ¯ S )] 2 τ X k =0 ǫ ( k ) 2 τ Y l = k +1 (1 − ǫ ( l )) + τ Y l =0 (1 − ǫ ( l )) e 2 (0) . (21) On the other hand , the produ ct Q τ l = k +1 (1 − ǫ ( l )) forms a telescop ic series a nd is equal to k + 1 δ τ + 1 δ . Su bstituting this fact into the equ ation (21) yields e 2 ( τ + 1) ≤ σ 2 [ λ 2 ( ¯ S )] 2 τ X k =0 1 ( k + 1 δ ) ( τ + 1 δ ) + e 2 (0) 1 δ − 1 τ + 1 δ ( a ) ≤ σ 2 [ λ 2 ( ¯ S )] 2 log( τ + 1 δ ) τ + 1 δ + e 2 (0) 1 δ − 1 τ + 1 δ , where step (a) us es the followi ng inequ ality τ X k =0 1 k + 1 δ ≤ Z τ + 1 δ 1 δ − 1 1 x dx ≤ log ( τ + 1 δ ) , valid for δ ∈ (0 , 1 2 ) . D. Pr o of o f Lemma 3 In the case o f cycle there is only one averaging path and all the no des are inv olved in tha t at e ach round so the averaging ma trix, W , is ﬁxed. More precisely , we have W = W = 1 n ~ 1 ~ 1 T . Therefore, W is a rank 1 matrix with λ n − 1 ( W ) = 0 and acco rdingly we have λ 2 ( S ) = 1 − λ n − 1 ( W ) = 1 . For the cas e of grid or random ge ometric graphs, we use the Poincare inequality [11]. A version of this theorem can be stated as follows: Let A = [ a ij ] deno te the transition matrix of an irr e ducible aperiodic time re versible Markov ch ain with stationary distribution π . For each o rdered p air of nodes ( s, u ) in the transition diagram, cho ose one and only o ne pa th η su = ( s, s 1 , s 2 , · · · , s l , u ) between s and u and deﬁn e | η su | := 1 π ( s ) a ss 1 + 1 π ( s 1 ) a s 1 s 2 + · · · + 1 π ( s l ) a s l u . (22) Then the Poinca re coefﬁcient is κ := max e ∈ E ′ X η su ∋ e | η su | π ( s ) π ( u ) , (23) where E ′ is the s et of directed e dges formed in the previous step. Deﬁn ing this quan tity , the theorem states 18 that λ n − 1 ( A ) ≤ 1 − 1 κ or equiv alently , 1 − λ n − 1 ( A ) ≥ 1 κ . (24) W e a pply this theorem to the Ma rkov chain formed by W ; the idea is to up per bou nd its Poincare c oefﬁcient. 1) Grid: W e ﬁrst deﬁ ne a pa th η su for every pair of nod es { s, u } . T wo dif ferent c ases can be distinguished here. For a n illustration of the path η su see Figure 4. a) Case 1: No des s and u do not belong to the sa me c olumn or row . In this c ase, we con sider a two-hop path η su = ( s → w → u ) , wh ere w = ( x u , y s ) is the vertex of the rectangle c onstructed by s and u . Note that x u is the x -coordinate of u an d y s is the y -coo rdinate of s . Since nodes { s , w } and { w , u } are averaged 1 2 of the time, we have W sw = W w u = 1 2 m . Subs tituting this into (22) an d using the fact that π = 1 n ~ 1 yields | η su | = 1 W sw π ( s ) + 1 W w u π ( w ) = 4 mn. b) Case 2: Nodes s and u be long to the same row or column. In this case , we set η su = ( s → u ) which leads to | η su | = 1 W su π ( s ) = 2 mn. Moreover , a given e dge e = ( s → w ) is in volved in a t most m p aths. As node u varies in the correspon ding column or row , we o btain m − 1 paths in ca se 1, and on e path in cas e 2 . Combining the piec es, we compute the Poinc are coefﬁcient κ = max e ∈ E ′ X η su ∋ e | η su | π ( s ) π ( u ) ≤ m 4 mn n 2 = 4 . Finally , from e quation (24), we have λ 2 ( S ) = 1 − λ n − 1 ( W ) ≥ 1 κ ≥ 1 4 which conclud es the proof for the cas e of a grid-structured graph. 2) Ran dom geometric graph: For the RGG, we follow the same proo f structure: name ly , we ﬁrst ﬁnd a path for each pair o f nodes { s, u } , a nd then upper bou nd the Poinc are c oefﬁcient for the Markov ch ain W . W e ﬁrst introduc e some u seful n otation. Let C : V → { 1 , 2 , · · · , m } 2 be the mapping that takes a nod e a s its input and returns the sub-squa re of that nod e. More precise ly , for some s ∈ V we ha ve C ( s ) = ( i, j ) if s ∈ ( i, j ) -th square i, j = 1 , 2 , · · · , m. 19 q q q q q q q q q q q q q q q q s w u ✻ q q q q q q q q q q q q q q q q s u ✲ (a) Cas e 1. (b) Cas e 2. Fig. 4. Illustration of the path η su for a grid-structu red g raph. (a) Case 1, wher e nodes s a nd u do not be long to th e same c olumn o r r ow . ( b) Case 2 , wh ere nod es s an d u be long to the same colu mn or row . This choice of η su yield a tight upper bo und on the Poin care coefﬁcient. Furthermore, we en umerate the nodes in sq uare C ( s ) = ( i, j ) from 1 to n ij where n ij denotes the total number of nodes in C ( s ) . W e refer to the label of node s as N C ( s ) ( s ) where N C ( s ) ( . ) is the en umeration operator for the squa re C ( s ) . Also let n ∗ = min i,j n ij denote the minimum number of nodes in on e sub- square wh ich by a ssumption is g reater than a log n for s ome c onstant a . W e split the problem into three dif feren t case s. Figure 5 illustrates these there different c ases. a) Case 1: Nodes s and u do not b elong to the the same column or row . In this cas e, a two hop path η su = ( s → w → u ) is conside red. First, we pick C ( w ) , the vertex of the rectangle constructed by C ( s ) a nd C ( u ) with the s ame x -coordinate as C ( u ) a nd the same y -coordinate as C ( s ) . Now choose a node, w , ins ide C ( w ) such that N C ( w ) ( w ) = N C ( s ) ( s ) + N C ( u ) ( u ) mod n ∗ . (25) Since ea ch square has a t least n ∗ nodes, su ch a choice c an be made. On the other ha nd, sinc e node s in ea ch square is picked uniformly at rand om in the averaging p hase a nd there are at most b log n nodes in e ach square (for s ome co nstant b ) we have W sw , W w u ≥ 1 2 m ( b log n ) 2 , where the factor of 2 is due to the choice of ζ , the averaging direction. Sub stituting this inequa lity into (22), we o btain | η su | = 1 W sw π ( s ) + 1 W w u π ( w ) ≤ 4 b 2 mn (log n ) 2 . Furthermore, from eq uation (25), we see tha t for a ﬁxed s there are a t most b a nodes in the squa re C ( u ) that result in ch oosing w . Therefore, edge e : ( s → w ) is in volved in a t most b a ( m − 1) such paths. b) Case 2 : Nodes s a nd u belong to the same row o r column. In this ca se, by s etting η su = ( s → u ) , we obtain | η su | = 1 W su π ( s ) ≤ 2 b 2 mn (log n ) 2 . 20 q q q q q q q q q q q q q q q q ❛ s ❛ w ✘ ✘ ✘ ✘ ✘ ✿ ✄ ✄ ✄ ✄ ✗ ❛ u q q q q q q q q q q q q q q q q ❛ s ❛ u ✘ ✘ ✘ ✘ ✘ ✿ q q q q q q q q q q q q q q q q ❛ s ✘ ✘ ✿ ❛ w ❅ ❅ ■ ❛ u (a) Case 1. (b) Case 2. (c) Case 3. Fig. 5 . Illustration of the path η su for the case of RGG. (a) Case 1, wh ere nodes s an d u belong to the sub-squar es in different row and c olumns (b) Case 2, where nodes s and u b elong to th e sub-squ ares in the same row or colum n. (c) Case 3, no des s and u b elong to the same square. Note that there is only o ne path con taining e of this type . c) Case 3: No des s an d u belong to the same square, meaning C ( s ) = C ( u ) . In this ca se a node w is chosen in a squ are a djacent to C ( s ) according to (25) such that C ( w ) is to the right of C ( s ) ; u nless C ( s ) is in the last column , in which case C ( w ) is to the left of C ( s ) . T he sa me argument as ca se 1 would g i ve u s a bound on | η su | . As for the upper bound o n the n umber of pa ths: the edge e : ( s → w ) is in volved in at most b a such paths. Combining all the p ieces, we ob tain | η su | ≤ 4 b 2 mn (log n ) 2 ∀ s, u ∈ V , and max e ∈ E ′ X s,u I { η su ∋ e } ≤ m b a + 1 . Substituting these two inequa lities into (23 ) yields κ ≤  m b a + 1  4 b 2 mn (log n ) 2 n 2 ≤ 2 mb a 4 b 2 mn (log n ) 2 n 2 = c 1 log n for some con stant c 1 . Therefore, from Po incare Theorem, we h av e λ 2 ( S ) = 1 − λ n − 1 ( W ) ≥ 1 κ ≥ 1 c 1 log n which conclud es the second pa rt of Lemma 3. 21 E. Pr o of o f part (a) o f Theorem 1 W e now return to the proof o f part (a) of Th eorem 1. Combining eq uations (10 ) and (16) yields θ ( τ ) = ( θ − w ( τ )) ~ 1 + e U β ( τ ) , (26) where w ( τ ) = 1 √ n P τ − 1 l =0 ǫ ( l ) h ~ 1 √ n , v ( l ) i . As previously estab lished, we know that E [ w ( τ )] = 0 and v ar ( w ( τ )) ≤ σ 2 δ [ λ 2 ( ¯ S )] 2 for all τ = 1 , 2 , · · · . The refore, in voking a result on c on vergence of s eries with bounde d variance (Theorem 8. 3 from Chap ter 1 of [14]), we have w ( τ ) a.s. − → w as τ → ∞ , (27) for some random vari a ble w . Since w ( τ ) is a sum of inde pende nt Gau ssian random variables (and h ence Gaussian ), it is abs olutely integrable [14]. Therefore, we hav e E [ w ] = lim τ →∞ E [ w ( τ )] = 0 a nd also v ar ( w ) = lim τ →∞ v ar ( w ( τ )) ≤ σ 2 δ [ λ 2 ( ¯ S )] 2 . Now we move on to the next pa rt of the proof, analyzing the sequ ence { β ( τ ) } ∞ τ =1 using tech niques from stochas tic approx imation theory (e.g., see the boo ks [21 ], [6]). These techniques a pply to recursions that generate a state s equenc e { θ ( t ) } ∞ t =1 according to θ ( t + 1) = θ ( t ) − ǫ ( t ) H ( θ ( t ) , v ( t )) t = 1 , 2 , · · · , where v ( t ) is the noise vector that models the ran domness coming into play in the a lgorithm. The parame ter ǫ ( t ) is a p ositi ve s tep size, and the seq uence { ǫ ( t ) } ∞ t =1 is required to sa tisfy the conditions P ∞ t =1 ǫ ( t ) = ∞ and P ∞ t =1 ǫ ( t ) α < ∞ for some α > 1 . The as ymptotic behavior o f the se stoc hastic upd ates ca n be analyz ed in terms of the o rdinary differential equ ation (ODE) dγ ( ζ ) dζ = − h ( γ ) , (28) where h ( θ ) := E [ H ( θ , v )] . Un der mild regularit y cond itions, it is known that θ ( t ) a.s. − → γ ∗ , wh ere γ ∗ is the attractor of the OD E (28). Recalling the update equa tion (12), our problem ca n be cas t within this framew ork. In particular , the s tate sequen ce is { β ( τ ) } ∞ τ =1 , the n oise sequence is formed by zero-mea n i.i.d. random vectors, the de creasing sequen ce is ǫ ( τ ) = 1 / ( 1 δ + τ ) , and ﬁn ally H ( β , v ) = ( L β + e U T v ) is a linear function with h ( β ) = E [ L ] β . Note bec ause we removed the zero eigen value from the average La placian matrix, the matrix E [ L ] has all positiv e e igen values, and so γ ∗ = 0 is the unique s table point of the line ar dif ferential eq uation 22 dγ ( ζ ) dζ = − E [ L ] γ . Therefore, an ap plication of the ODE metho d [21], [6] guaran tees that β ( τ ) a.s. − → 0 as τ → ∞ . (29) Substituting the results (27) a nd (29) into eq uation (26 ), we ob tain θ ( τ ) a.s. − → ( θ − w ) ~ 1 as τ → ∞ . In othe r words, n odes will almost surely rea ch a conse nsus; moreover , the conse nsus value, e θ = θ − w , is within σ 2 δ [ λ 2 ( ¯ S )] 2 distance of the true sample me an. V . S I M U L A T I O N R E S U L T S In order to de monstrate the ef fectiveness o f the p roposed algorithm, we c onducted a se t of s imulations. More sp eciﬁcally , w e ap ply the prop osed a lgorithm to four n earest-neighbo r squ are grids o f different sizes. W e initially generate the data θ i (0) , i = 1 , 2 , · · · , n a s random N (1 , 1) vari ables and ﬁx them throughout the simulation. So for e ach run o f the algorithm the initial d ata is ﬁxed . In implementing the algorithm, we adopt σ 2 = 1 a s the cha nnel n oise variance, an d we set the toleran ce parameter δ = 0 . 1 , lea ding to the step size ǫ ( τ ) = 1 10+ τ . W e es timated the mean-squ ared error , de ﬁned in equa tion (1), b y taking the average over 50 sample paths . As discussed in Section III, every outer phase u pdate requires M = O ( √ n ) time steps . Figure 6 shows the mea n-squared error versus the n umber of outer loop iterations; the pan el contains two dif feren t curves, one for a graph with n = 30 2 nodes, and the other for n = 50 2 nodes. As expected, the MSE monotonica lly decrea ses as the number of iterations increa ses, s howing con ver g ence of the algorithm. More importantly , the gap between the two plots is negligible. This phe nomenon , which is predicted by our theory , is explored further in our next set o f experiments. In orde r to study the network sc aling of the grid mo re p recisely , for a giv e n set o f gra ph s izes, we compute the n umber of the ou ter iterations τ = τ ( n, δ ) , su ch that MSE( θ ( τ M )) ≤ σ 2 δ . Recall that this stopping time is the focus of Theo rem 1(b). Figure 7 provides a box plot of this stopping time τ versu s the graph size n . Theorem 1(b) predicts that this stopping time should be in versely propo rtional to the spectral gap of the La placian ma trix S , which for the grid scales as Ω (1) (in particular , see Lemma 3). As sh own in Figure 7, over a range of graph s o f s ize varying from n = 1000 to n = 1000 0 , the stopping time is rough ly constant ( τ ≈ 25 ), which is c onsistent with the theo ry . V I . D I S C U S S I O N In this pa per , we propo sed and analyze d a two-phase grap h-respecting a lgorithm for computing averages in a network, whe re communication is modele d as an add iti ve white Gauss ian noise channel. W e sh owed 23 5 10 15 20 25 30 35 40 45 50 10 −1 Number of Outer Iterations Mean−squared Error n = 50 2 n = 30 2 Fig. 6. Me an-square d error versus the num ber of oute r loop iterations for grids with n ∈ { 30 2 , 50 2 } nodes. As expected the MSE mo notonically d ecreases, which supports th e conver g ence claim . 0 2000 4000 6000 8000 10000 20 21 22 23 24 25 26 27 28 29 30 Size of the graph Number of outer iterations Fig. 7. Stopping time τ = τ ( n, δ ) vs. the g raph size n . For different grap h sizes, we compu te the ﬁrst ou ter phase time instance, τ ( n, δ ) , suc h that MSE( θ ( τ M )) ≤ σ 2 δ . Her e we have ﬁxed th e parameters to σ 2 = 1 , and δ = 0 . 1 . As y ou can see, over a range o f grap hs of size varying from 1000 to 10000 , this stop ping time is r oughly constant ( ≈ 25 ), which is con sistent with the theo ry (Th eorem 1(b ) an d Lemma 3) . that it achieves consensus , an d we characte rized the rate o f c on vergence a s a function of the graph topology and graph s ize. For our algorithm, this network s caling is within log arithmic factors of the graph diameter , showing that it is n ear-optimal, since the grap h diameter provides a lower bou nd for any algorithm. There are various issu es left o pen in this work. First, while the A WGN model is more realistic than noiseless communication, many channels in wireles s n etworks may be more complicated, for instance in volving fading, interference a nd othe r types of memory . In principle, our algo rithm c ould be ap plied to suc h chann els and networks, but its behavior and ass ociated con ver g ence rates remain to be ana lyzed. In a sepa rate dir ection, it is also worth noting that gossip-type algorithms c an be use d to solve more complicated types of problems, such as distrib uted optimization p roblems (e.g., [25], [28], [13]). Studying the iss ue of near-optimal ne twork sca ling for such prob lems is also of interes t. 24 Acknowledgements NN and MJW were partially suppo rted by NSF grant CCF-0545862 from the National Science Foundation, and AFOSR-09NL1 84 grant from the Air Force Ofﬁce of Scientiﬁc Res earch. R E F E R E N C E S [1] O. A yaso, D. Shah, and M. Dahleh. Information theoretic bounds for distributed computation ov er networks of point-to-point channels. In Internationa l Symposium on Information Theory , 200 8. [2] T . C. A ysal, M. J. Coates, and M. G. Rabbat. Distributed av erage consensus wi th dithered quantization. IEE E T ransactions on Signal Pr ocessing , 56:4905–4 918, 2008. [3] T . C. A ysal, M. E . Y ildiz, A . D. Sarwate, and A. Scaglione. Broadca st gossip algorithms for consensus. IEEE T ransactions on Signal Pr ocessing , 57:2748–2 761, 2009. [4] F . Benezit, V . Blondel, P . Thiran, J. Tsitsiklis, and M. V etterli. W eighted gossip: Distributed av eraging using non-doubly stochastic matrices. In P r oc. IEEE International Symposium on Information Theory , 2010. [5] F . Benezit, A. G. Dimakis, P . Thiran, and M. V etterli. Gossip along the way: order-op t imal consensus through randomized path av eraging. In F orty-F i fth A nnual Allerton Confer ence on Communication, Contr ol, and Computing , Sep 200 7. [6] A. Ben veniste, M. Metivier , and P . Priouret. Stochastic app r oximations and adap tive algorithms . S pringer-V e rlag, New Y ork, 1990. [7] S. Boyd, A. Ghosh, B. P rabhakar , and D. S hah. Randomized gossip algorithms. IEEE T ransactions on Information Theory , 52:2508– 2530, 2006. [8] F . Cattiv elli and A. H. Sayed. Diffusion LMS strategies for distrib uted estimation. IEEE Tr ansactions on Signal Pr ocessing , 58(3):1035– 1048, March 2010. [9] F . R. K. C hung. Spectr al Graph Theory . American Mathematical Society , 1997. [10] M. H. DeGroot. Reaching a con sensus. J. Amer . Stat. Assoc. , 69:118–12 1, 1974. [11] P . Diaconis and D. Stroock. Geometric bounds for eigen values of Markov chains. Ann. Applied Pr obability , 1:36–61 , 1991. [12] A. G. Dimakis, A. Sarwate, and M. J. W ainwright. Geographic gossip: Efﬁcient averagin g for sensor netw orks. IEEE T rans. Signal Pr ocessing , 53:1205–1 216, March 2008. [13] J. Duchi, A. Agawarl, and M. J. W ainwright. Dual averaging for distributed optimization: Con verge nce analysis and netw ork scaling. T echnical Report arXiv:1005.201 2, UC Berkeley , May 2010. [14] Rick Durrett. Pr obability: Theory and Examples . Thomson Learning, 2005 . [15] F . Fagnani and S. Zampieri. A verage consensus with packet drop communication. SIAM J . on Control and Optimization , 2007. T o appear . [16] G.R. Grimmett and D.R. S tirzaker . Pr obability and Random Proc esses . Oxford Science Publications, Clarendon Press, Oxford, 1992. [17] P . Gupta and P . Kumar . The capacity of wireless networks. IEEE T rans. on I nf. T heory , 46(2):388 –404, Mar 2000. [18] Y . Hatano, A. K. Das, and M. Mesbahi. Agreement in presence of noise: pseu dogradients on random geome tric networks. In Pr oceedings of the 44th IEEE Confer ence on Decision and Contr ol , December 2005 . [19] S. Kar and J. M. F . Moura. Distributed consensus algorithm in sensor networks with imperfect communication: link failures and channe l noise. IEEE T ransactions on Signal Pr ocessing , 57(5):355–3 69, Jan 2009. [20] D. K empe, A. Dobra, and J. Gehrke. Gossip-based computation of aggrega te i nformation. In Pr oc. IEEE Conf. F oundation of Computer Science (FOCS) , 200 3. 25 [21] H. J. Kushner and G. G. Yin. Stocha st ic Appr oximation and Recursive Algorithms and Applications . Springer-V erlag, Ne w Y ork, 200 3. [22] C. G. Lopes and A. H. Sayed. Incremental adapti ve strategies ov er distributed network s. IEEE T ransa ctions on Signal Pr ocessing , 55(8):4064–407 7, August 2007 . [23] C. G. Lopes and A. H. Sayed. Diffusion least-mean squares ove r adaptiv e networks: Formulation and performance analysis. IEEE T ransaction s on Signal Pr ocessing , 56(7):3122–3 136, July 2008. [24] B. Nazer , A. G. Dimakis, and M. Gastpar . Neighborhood gossip: Concurrent averag ing through local interference. In Pr oc. IEEE ICASSP , 2009. [25] A. Nedic and A. Ozdaglar . Distributed subgradient methods for multi-agent optimization. IEEE Tr ansactions on Automatic Contr ol , 54:48–61, 2009. [26] M. Penrose. Oxfor d studies in pr obability , Random Geometric Graphs. Oxford Uni v . Press, Oxford U.K., 2003 . [27] R. Rajagopal and M. J. W ainwright. N etwork-ba sed consensus av eraging with general noisy channels. IEEE T ransactions on Signal Pr ocessing , Jan 2011. [28] S. Sundhar Ram, A. Nedic, and V . V . V eerav alli . Distributed subgradient projection algorithm for con vex optimization. In IEEE International Confer ence on Acoustics, Speec h, and Signal Pr ocessing , pages 3653– 3656, 2009. [29] H. I. Su and A. E l Gamal. Distributed lossy averag ing. In Pr oc. IEEE International Symposium on Information Theory , 2009. [30] J. Tsitsiklis. Pr oblems in decentralized decision -making and computation . PhD thesis, Department of EEC S, MIT , 1984. Nima Noorshams recei ved his B.S c. from Sharif Univ ersit y of T echnology , T ehran, Iran, in 2007. He is currently pursuing his M.S c. degree in the department of Statistics and his Ph.D. degree in the department of Electrical Engineering & Computer Science at Univ ersi ty of California, Berkeley . His current research interests include stochastic approximation methods, graphical models, stati stical signal proce ssing, and modern coding theory . Martin W ainwright is currently an associate professor at Univ ersity of California at Berkele y , with a joint appointment between the Department of S tatistics and the Department of Electrical Engineering and C omputer Sciences. He receiv ed a Bachelor’ s degree in Mathematics from University of W aterloo, Canada, and Ph.D. degree in E lectrical Engineering and Computer Science (EECS) from Massach usetts Institute of T echnology (MIT). His research interests include coding and information t heory , machine learning, mathematical statistics, and statistical signal processing. He has been awarded an Alfred P . Sloan Foundation Fellowship, an NSF CAREER A ward, t he George M. Spro wl s Prize for his dissertation research (EECS depa rtment, MIT), a N atural Sciences and Engineering Research Counc il of Canada 1967 Fello wship, an IEEE S ignal Processing S ociety Best Paper A ward in 2008, and se veral outstand ing conference paper aw ards. 26

Non-Asymptotic Analysis of an Optimal Algorithm for Network-Constrained Averaging with Noisy Links

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment