Nonparametric Link Prediction in Large Scale Dynamic Networks

Nonparametric Link Prediction in Larg e Scale Dy namic Net w orks Purnamrita Sarkar psarkar@ eecs.ber keley.edu U. C. Berk eley Deepa y an Chakrabarti ∗ deepay@f b.com F aceb o ok Mic hael I. Jo rdan jordan@c s.berkel ey.edu Univ ersit y of California, Berk eley Abstract W e prop ose a nonparametric app roac h to link prediction in large-scale dynamic netw orks. Our mod el uses graph-based features of pairs of n odes as well as those of their lo c al neighb or- ho o ds to predict wheth er those no des will b e linked at each time step. The model allo ws for diﬀerent types of evolution in diﬀerent p arts of the graph (e.g, g ro wing or shrinking comm u ni- ties). W e focus on large-scale graphs and p resent an impleme nta tion of our mo del t h at mak es use of lo cality-sensitiv e h ashing to allo w it to b e scaled to large problems. Exp eriments with sim u lated data as w ell as ﬁve real-w orld dynamic graphs show that w e outp erform the state of the art, esp ecially when sh arp ﬂuctuations or nonlinearities are present. W e also establish theoretical prop erties of our estimator, in particular consistency and weak converg ence, the latter mak ing u se of an elaboration of Stein’s metho d for dep end ency graphs. 1 In tro duction Many real-world problem domains generate da ta in the for m of graphs o r net works. Examples include so cial netw o rks (e.g., F ace bo ok), reco mmendation ser v ices (e.g., Netﬂix o r Last.fm), bio - chem ical netw orks, citation graphs a nd market analysis. The inferenti al pr oblem in these s ettings is often o ne of link pr e diction . This problem can b e formulated in a static s etting where one as- sumes that a ﬁxed but unknown graph is par tially obse r ved, and one wishes to a ssess whether a pair of no des tha t are not k nown to b e link ed a r e in fact linked, given an obser ved linkage pattern among other no des. Ma ny real-world graphs are often b est mo deled, how ever, as dynamic entities, where links can arise and disapp ear ov er time. In the dyna mic setting the link prediction pr oblem in volv es assess ing whether tw o no des will be link ed at time t given the linkage patterns at all previous times. Real-world graphs o f current interest are often very larg e, in volving many h undr e ds o f thousands or millions of no des. The dynamic setting inv olves sequences of such graphs. Giv en the larg e-scale nature of these da ta structures, inferential metho dolo gy that may b e feas ible on smaller g r aphs of h undreds of no des, such as Mar ko v rando m ﬁelds and other gra phical mo dels , are generally infea- sible for rea l-world link prediction problems, and pra ctical approa ches to such problems generally ∗ This wo rk was partly done when the author wa s at Y ahoo! Research 1 in volv e simple heuristics, such as estimating a probability of a link b eing present as a simple func- tion o f the las t time a pair of no des formed a link, or the num b er of commo n neig hbors b etw een a pair o f no des [ 12 , 1 7 , 27 , 30 ]. While these heuristics do resp ect the co mputational imp erative, and are often useful in practice, there has b een little in the wa y of s tatistical a nalysis to provide a sound foundation for their use and to a ssess the quality o f the inferences that they pr ovide. This is particular ly true in the dynamic setting, where link prediction is often approached b y sp ecifying v arious measur es of co nnectivit y in a static gra ph a nd extending these measure s in an ad ho c manner to sequences of gr aphs. In this pap er, we develop a no nparametric metho dolog y for link pr ediction in large- scale dy - namic netw or ks. O ur metho dology is a rela tively simple kernel-based a pproach, one that aims to retain the virtues of the simple heuristic metho ds, both in their favorable computational scaling and in the re latively w eak as s umptions that they app ear to make on the graph gener ation pro cess. As compared to existing heuristic approaches, how ever, our kernel-based approach allows us to provide a formal inferential treatment of link prediction—we establish consistency and weak con- vergence o f our estimato r. On the computational front, while a na ive implementation of a kernel method would hav e po or scaling (due to the need to compare query points to every p oint in a training set), w e show that our k ernel-based a ppr oach is amenable to loca lity sensitive hashing (LSH) [ 13 ], which provides a fast and scalable implemen tation of the estimator. Our approa ch is in the spirit of the nonpar ametric autoreg ressive time series mo dels [ 18 ]. In these mo dels the evolution of a sequence x t of contin uous univ aria te ra ndom v ariables is mo deled b y taking the conditiona l exp e ctation o f x t to b e a function of a moving window ( x t − 1 , . . . , x t − p ), and estimating this function via kernel regre s sion. It is also p oss ible to consider multiv aria te extensions of such mo dels. While it would b e poss ible in principle to apply such mo dels to o ur problem by enco ding gr a phs as vectors, in pr a ctice the large-sc a le graphs that are our fo cus would generate hig h-dimensional vector repres ent ations that w ould be fatal to naive kernel regr e s sion. Instead, we think of the graphs as providing a “spatial” dimension that is or tho g onal to the time axis. In addition to impos ing the conditional indep endence assumption implicit in the use of a moving window, we ma ke the additional ass umption that the linkage behavior of any no de i is independent of the rest of the graph g iven its “lo cal neig hborho o d”; in eﬀect, lo cal neig hborho o ds are to the spatia l dimension what moving windows are to the time dimension. Thu s we mo del the out-edges of i at time t as a function of the lo cal neighborho o d o f i ov er a moving window of time, r esulting in a muc h mo re tractable problem. As a bypro duct, this also allows for diﬀerent evolutions for diﬀerent r egions to ex ist in the sa me gr aph; e .g ., r e g ions of slow versus fast change in links, a ssortative versus disa ssortative regions (where hig h-degree no des are more/less likely to co nnect to other high-deg ree no des), densifying v ersus sparsifying regions, and so on. As a brief summary , our contributions are a s follows: (1) Nonp ar ametric pr oblem formulation: W e o ﬀer, to o ur knowledge, the ﬁrst nonparametric mo del for link prediction in dynamic netw orks. The mo del is p ow er ful eno ugh to acco mmo date diﬀerent regions with diﬀerent dynamics, which is not acco mmo da ted in existing heuris tic approaches. It also allows c ov ariates to b e incorp o rated (such as demographic data ab out a no de). (2) Consistency and we ak c onver genc e of the estimator: W e pr ov e consistency o f o ur estimator using notions o f strong mixing in Marko v chains. T o establish weak conv erg e nce we show how to adapt Stein’s metho d to our setting, going b eyond the dep endency gra ph formulation of Stein’s method [ 25 ] to allow lo ng-rang e weak dep endence instead of marg inal indep e ndence. (3) F ast implementation via LSH: Nonparametric methods such a s k ernel regression require computing kernel similarities be t ween a q uer y and a ll members o f the tr aining set. A na ive imple- men tation would lead to computation linear in the tra ining set s ize, which is genera lly infeasible for larg e-scale netw o rks. In or der to mitigate this issue, we adapt the locality sensitive hashing algorithm o f Indyk a nd Motw a ni [ 13 ] to our par ticular kernel function. (4) Empi ric al impr ovements over pr evious metho ds: W e demonstrate the empirical eﬀectiv eness of our metho d on link prediction tasks on b o th simulated and real netw ork s . On graphs with nonlinear linka ge patterns (e.g., seasona l trends), w e outp erfor m a ll of the state-of-the-art heuristic measures fo r static and dy na mic graphs. This result is obtained in particular on a rea l-world se ns o r 2 net w ork graph. O n other rea l-world datasets with smo other and simpler evolution, we per form as well as the b est comp etitor. Finally , we compare our LSH-based kernel regr ession to exact kernel regress io n, and show that the LSH-based approach yields almost identical accurac y at a fra c tio n of the computational cost. The rest o f the pap er is organized a s follows. W e present the mo del a nd the estimator in Section 2 . Our LSH implemen tation is describ ed in Sectio n 3 . Section 4 provides an e xp e r iment al ev aluation of our metho d. W e provide an analysis o f consis tency in 5 . In Section 6 w e discuss our adaptation of Stein’s metho d which we us e to establish weak co nv ergence of our estimator in Section 7 . W e provide a discussion o f related work in Section 8 and we present our c onclusions in Section 9 . 2 The Mo del and the Estimator W e be g in by introducing some notation. Consider a sequence of directed g raphs, G = { G 1 , G 2 , . . . , G t } . Deﬁne the indicator Y t ( i, j ) which equals 1 if the edge i → j exis ts at time t , a nd 0 otherwise. Let N t ( i ) denote the lo c al neighb orho o d o f no de i in G t ; in our exp eriments, w e deﬁne it to be the set o f no des within tw o hops of i and all edg es b etw een the no des in that set. Note that the neighborho o ds of nea rby no des ca n ov erlap. Let ~ N t,p ( i ) = { N t ( i ) , . . . , N t − p +1 ( i ) } ; this repres e nts the lo c a l neig hbo rho o d o f i along b oth spatial and tempor al dimensio ns. 2.1 The Mo del Our model is as follows: Y t +1 ( i, j ) |G ∼ Bernoulli( g ( ψ t ( i, j ))) ψ t ( i, j ) = { s t ( i, j ) , d t ( i ) } , where 0 ≤ g ( · ) ≤ 1 is a function of t wo sets o f features: those speciﬁc to the p air of no des ( i, j ) under considera tion— { s t ( i, j ) } —and those for the lo ca l neighborho o d of the endpo int i — { d t ( i ) } . W e require that b oth of these feature sets b e functions of ~ N t,p ( i ). Thus, Y t +1 ( i, j ) is assumed to b e indep endent of G given ~ N t,p ( i ), limiting the dimensionality o f the problem. Note that t w o pairs of no des ( i, j ) and ( i ′ , j ′ ) that are close to each other in ter ms of gra ph dista nce are likely to hav e ov erlapping neighborho o ds, and hence a higher pr obabilit y of shar ing neighbo rho o d-sp eciﬁc features. Thus, link prediction pro babilities for pa irs o f no des fro m the sa me r egion ar e likely to be s imilar, a s desired. T o make this statement precise, we will need to imp ose smo o thness pro p erties on g . W e will show that g iven appro priate assumptions of smoothness (Assumption 1 in Section 5 ), nonparamet- ric kernel e s timators have des ir able consistency prop er ties. Assume that the pair-sp eciﬁc fea tures s t ( i, j ) come fro m a ﬁnite set S ; if not, they are dis c r etized in to such a set. F or example, o ne may let s t ( i, j ) record the num b er of co mmon neighbors betw een i and j and the last time a link a ppe a red b etw een these no des (lastlink); note that b o th are functions of ~ N t,p ( i ). Let d t ( i ) = { η i,t ( s ) , η + i,t ( s ) ; ∀ s ∈ S } , where η i,t ( s ) are the num b er of no de pairs in N t − 1 ( i ) with feature vector s , and η + i,t ( s ) the n um ber of such pairs which were also linked b y an e dg e in the next timestep t . In a nutshell, d t ( i ) tells us the c hances of an edge b eing crea ted in t given its features in t − 1, av eraged over the whole neighborho o d N t − 1 ( i )—in other words, it captures the change o f the neighbor ho o d a round i ov er one timestep. One can th ink of d t ( i ) as a con tingency table indexed by the features s . Contingency ta bles are widely referred to a s “datacub es” in the database communit y , and we will adopt this terminolo gy , refering to d t ( i ) as a datacub e, and a feature vector s as the “cell” s in the datacub e with conten ts ( η i,t ( s ) , η + i,t ( s )). Finiteness of S is necessary to e nsure that datacub es are ﬁnite-dimensiona l, which allows us to index them and quickly ﬁnd nea rest-neighbor datac ub es. 3 2.2 The E stimator Our estimator of the function g ( · ) at time T is: ˜ g T ( ψ T ( i, j )) = P i ′ ,j ′ ,t ′ Γ( ψ T ( i, j ) , ψ t ′ ( i ′ , j ′ )) · Y t ′ +1 ( i ′ , j ′ ) P i ′ ,j ′ ,t ′ Γ( ψ T ( i, j ) , ψ t ′ ( i ′ , j ′ )) , (1) where we factor the kernel function Γ( ψ T ( i, j ) , ψ t ′ ( i ′ , j ′ )) into neighbor ho o d-sp eciﬁc and pa ir- sp e c iﬁc parts: K ( d t ( i ) , d t ′ ( i ′ )) · ξ (s t ( i, j ) , s t ′ ( i ′ , j ′ )). Le t dist( s, s ′ ) denote th e L 1 distance b etw een features s and s ′ , a nd let n ( s ) deno te the set o f features a t L 1 distance 1 from feature s . W e deﬁne ξ (s t ( i, j ) , s t ′ ( i ′ , j ′ )) a s ξ (s t ( i, j ) , s t ′  i ′ , j ′  ) := I { s t ′ ( i ′ , j ′ ) = s t ( i, j ) } + ζ T I { dist(s t ( i, j ) , s t ′ ( i ′ , j ′ )) = 1 } 1 + ζ T | n (s t ( i, j )) | , (2) where ζ T is a bandwidth pa rameter which we will r equire to be O ( T − (1 / 2+ ǫ ) ) for some ǫ > 0 in order to obtain co nsistency and distributional conv er gence. K ( d t ( i ) , d t ′ ( i ′ )) is a discr ete analog of a co nt in uous kernel function (similar functions can b e found in Aitc hison and Aitken [ 2 ] and W ang and v an Ryzin [ 32 ]). As is the ca se with contin uous kernel functions, it ha s the prop erty that as the bandwidth para meter b T → 0, it is equal to one if and only if d t ( i ) = d t ′ ( i ′ ), and zero otherwis e . Similarly ξ (s t ( i, j ) , s t ′ ( i ′ , j ′ )) has the pr op erty that as ζ T → 0, it approa ches I { s t ′ ( i ′ , j ′ ) = s t ( i, j ) } . This inner kernel function can also b e ex tended to features at L 1 distance t w o a nd so forth, while weighing those terms by p owers o f ζ T . Plugging in the deﬁnition o f the kernel in E quation 1 , w e o btain the following in ter pr etation of the estimator: ˜ g T ( ψ T ( i, j )) = P i ′ ,t ′ K ( d t ( i ) ,d t ′ ( i ′ )) η + i ′ ,t ′ +1 ( s t ( i,j ))+ ζ T P s ∈ n ( s t ( i,j ) ) η + i ′ ,t ′ +1 ( s ) ! P i ′ ,t ′ K ( d t ( i ) ,d t ′ ( i ′ )) η i ′ ,t ′ +1 ( s t ( i,j ))+ ζ T P s ∈ n ( s t ( i,j ) ) η i ′ ,t ′ +1 ( s ) ! . (3) Useful intui tion can b e obtained by considering the case ζ T = 0. Here, given the quer y pair ( i, j ) at time t , we lo ok inside cells for the quer y feature s = s t ( i, j ) in a ll neighbo r ho o d datacub es, compute the av era ge η + i ′ ,t ′ ( s ) and η i ′ ,t ′ ( s ) in these cells after accounting for the similarities of the datacub es to the query neighborho o d da tacub e , a nd use their q uo tient as the estimate of linkage probability . Letting ζ T > 0 provides a n estimator that deals more eﬀectiv ely with spars ity by computing weight ed averages of η + i ′ ,t ′ ( s ) a nd η i ′ ,t ′ ( s ) over fea ture s s that are “close” to s t ( i, j ). Thu s, the pro bability estimates a r e derived from histor ical instances wher e (a) the featur e vector of the historical no de pair matches the query , and (b) the lo cal neighbor ho o d is s imilar as well. Now, we need a measure of the s imilarity b etw een neighborho o ds, with the goal of trea ting t w o neighbor ho o ds a s similar if they ha ve similar probabilities of generating links betw e en no de pairs with feature vector s , for any s ∈ S . T o this end we could simply compare p o int estimates η + . ( s ) /η . ( s ), but we also wish to ac c o unt for the v ariance in these estimates. W e achiev e this by deﬁning a s imilarity measure that has a Bay esia n ﬂav o r: K ( d t ( i ) , d t ′ ( i ′ )) = e − D ( d t ( i ) ,d t ′ ( i ′ )) /b T (0 < b T < 1) (4) D ( d t ( i ) , d t ′ ( i ′ )) = X s ∈ S TV( X, Y ) X ∼ B  η + i,t ( s ) , η i,t ( s ) − η + i,t ( s )  Y ∼ B  η + i ′ ,t ′ ( s ) , η i ′ ,t ′ ( s ) − η + i ′ ,t ′ ( s )  , where TV( X , Y ) deno tes the total v aria tion distance b etw een the distr ibutions of X a nd Y , B is the beta distribution and b T ∈ (0 , 1 ) is a ba ndwidth para meter. W e will require b T = O ( T − (1 / 2+ θ ) ) 4 for some θ > 0 to obtain appr o priate rates when we s tudy the consistency and distributional conv erg ence o f our estimator. Remarks. T o better understand our c hoice of estimator, consider by wa y of cont rast a simple estimator tha t co mputes the fra ction of pair s for which the feature las tlink was equal to k at time t ′ and which formed a n edg e at time t ′ + 1 (for k = 1 , 2 , . . . ). This approach suﬀers from t w o key problems that make it p er form p o orly on real-world graphs. First, it do es no t allow for lo cal v ariatio ns in the link-for ma tion fractions, as would b e exp ected for commu nities evolving diﬀerent ly within the same gra ph. W e addres s this problem by maintaining a separ ate datacub e for ea ch lo cal neig hborho o d. The second, more subtle, pr oblem is the implicit a ssumption of stationarity—a no de’s link-for mation probabilities are as sumed to b e time-in v ariant functions of the datacub e features . This assumption do es not allow for seasonal changes in linkage patterns, or for a transition fro m slow to fast gr owth, etc. Our mo del addresses this issue b y ﬁnding histor ical neighborho o ds from so me previous time t ′ with datacub es similar to the quer y datacub e, a nd uses their ev o lution from t ′ to t ′ + 1 to predict link formation in the next time step for the c ur rent neighborho o d. This helps us learn nonlinear trends. Our estimator also has the virtue that it combats spar sity by aggregating data across similarly- evolving co mm unities even if they are separ ated by g raph distance a nd time. That said, spar sity remains a s e r ious issue, and we provide a further discus sion o f sparsity in the following section. Finally , note that we build the datacub e s o a s to enco de the r ecent change of a neighborho o d, and not just the distribution of feature s in the neighborho od. Thus, for example, tw o neighbor- ho o ds may have the sa me datacub e if the fra ction o f lastlink = 1 no de pairs that fo rmed an edge in the next timestep is the same in b oth neig hborho o ds, and not if they b oth merely had the same n um ber of lastlink = 1 pair s. Th us, it is the change in link structure that drives the estimation of linkage probabilities. Moreov er , t wo neigh b oring no des may end up ha v ing very simila r datacub es, and will end up forming links in a s imilar wa y , wher eas very diﬀerent da tacub es will reﬂect the v ariations in link for ma tion patterns among diﬀerent communit ies. 2.3 Sparsit y F or sparse graphs, o r sho r t time series , tw o practical problems can a rise. First, a no de i ca n have zero deg ree and hence an empt y neigh b orho o d. T o cope with this issue, we deﬁne the neighborho o d of no de i as the union of t wo-hop neighborho o ds ov er the last p timesteps. Second, and mor e problematically , the η . ( s ) a nd η + . ( s ) v a lues obtained fro m kernel r e gression can b e sma ll, yielding an es timated linka g e pr obability η + . ( s ) /η . ( s ) that is unreliable numerically . W e oﬀer a threefold solution to this pro blem, the ﬁrst elemen t of whic h is already pre sent in our estimator. (a) The inner kernel ξ (Equation 2 ) combines η . ( s ) a nd η + . ( s ) with a weigh ted average of the co rresp onding v alues for any s ′ that are “close” to s , the weights enco ding the similar ity betw een s ′ and s . (b) In determining a ﬁnal r anking, instead of using η + . ( s ) /η . ( s ) directly , we use the lower end of the 9 5% Wilson sco re interv al [ 33 ]. The no de pair s that are r anked highest according to this “Wilson score” ar e those that ha ve high estimated linkage proba bility η + . ( s ) /η . ( s ) and η . ( s ) is high (implying a reliable estimate). (c) W e use a “back oﬀ” smo othing pro cedur e for the Wilson scores, in which the r aw sco res are smo othed against the scores obtained from a “prior” datacub e, which is the av erage of all his to rical datacub es. The degree of smo othing dep ends on η . ( s ). This can be though t of as a simple hierarchical model, where the low er lev el (set of individual datacub es) smo oths its estimates us ing the higher level (the prior datacub e). 3 F ast searc h using LSH A naiv e implemen tation of the nonparametric estimator in Equation ( 3 ) computes kernel similarit y betw een the quer y datacub e and all n da ta cubes for each of the T timesteps for each prediction, which can b e infeasibly slow for large graphs. T o obtain a mo re computationally tractable estima- tor, we consider only the top- r closest neighbor ho o ds (in terms of the largest kernel similar ities). The v alue o f r is a para meter of the algo rithm; for o ur exp eriments w e use r = 20. What is needed 5 to make this pra ctical is a fas t method (one that r uns in sublinear time) to q uickly ﬁnd the top- r closest neighborho o ds. W e ach ieve this by using lo cality sensitive hashing (LSH) [ 13 ]. Hashing is often used in databases for fast “table-lo okups” or retrieving matching items from a lar ge database. The key comp onent is a ha sh function th at maps a given “key” or ob ject to a certain hash v a lue. In order to search for a particular key , we compute the ha sh v a lue a nd do a ta ble lo okup with this v alue. The concept of “lo ca lity sensitiv e” hashing re fer s to hash functions having the prop er ty that, with high probability , tw o “s imila r” data items are hashed to the same v alue. This facilitates a ppr oximate nearest neighbor sear ch, and is suitable for high-dimensional spaces, where traditional neares t neighbor search tec hniques ar e o ften infeasible. The sta nda rd LSH method oper ates on bit seq uence s , and maps sequences with small Hamming distance to the same ha sh buck et. In our setting, we m ust hash datacub es, a nd use the tota l v ariation distance metric. W e make use of the fa c t that total v ar iation distance b etw een discrete distributions is half the L 1 distance b etw een the corresp o nding pro bability mas s functions. If we could a pproximate the probability distributions in each datacub e cell with bit sequences, then the L 1 distance w o uld just be the Hamming distance b etw een these sequences, making our setting amenable to the use of standard LSH. W e achiev e this with three steps: Con version to bit sequence The k ey idea is to approxim ate the linkage pr obability distribution b y discretizatio n. W e ﬁrst discretize the range [0 , 1] (since we deal with pro babilities) into B 1 buc kets. F or each buck et w e compute the probability mass p falling inside it. This p is enco ded using B 2 bits by setting the ﬁrst ⌊ pB 2 ⌋ bits to 1, and the others to 0 . In this wa y the entire distribution (i.e., one cell) is repres ent ed by B 1 B 2 bits. As a r esult the entire datacub e can no w b e stored in | S | B 1 B 2 bits. How e ver, in all our experiments, datacubes were very spa r se with o nly M ≪ | S | cells ever b eing non-empt y (usually , 10- 5 0); thus, we use only M B 1 B 2 bits in practice. The Hamming distance b etw een tw o pair s o f M B 1 B 2 bit vectors y ields the total v ariation distance be tw een datacub es (mo dulo a cons ta nt factor ). Distances vi a LSH W e create a hash function by picking a unifor mly rando m sample of k bits out of M B 1 B 2 . F or each has h function, a ha sh table is crea ted to store all data cubes whose hashes a re identical in these k bits. W e use ℓ suc h hash functions. A query datacub e is ﬁrst hashed using each of these ℓ functions. Then we cr eate a c andidate set containing O (max ( ℓ, r)) of distinct datacubes sharing any of these ℓ hashes. The tota l v ariation distance of these candidates to the query datacube is computed explicitly , yielding the closest matc hing historical data cubes . Pic king k The n um ber of bits k is crucial in bala ncing a ccuracy versus query time: while a large k hashes all datac ubes to their own hash buck et, returning a few or no match es to the query , a small k bunches many datacubes in to the same buck et, decreasing the pr obability o f ﬁnding the ‘true’ near neighbor s. In the spirit of Indyk a nd Motw a ni [ 13 ], w e do a binary sear ch to ﬁnd the k for which the a verage hash-buck et size ov er a q uery workload is just enough to provide the desired top-20 matches. W e ev aluate the a ccuracy of this a ppr oach in Section 4 . W e conclude this section with tw o additional p oints. First, we nev er crea te the entire bit rep- resentation o f M B 1 B 2 bits explicitly; only the ha s hes need to be computed, taking O ( k ℓ ) time. Second, the main co st in the algor ithm is in creating the hash table, which needs to b e done once as a pre pr o cessing s tep. Query pro ces s ing is extremely fast and sublinear , since the candidate set is muc h sma ller than the size of the training set. 4 Exp erimen ts W e start by introducing several baseline algorithms, and our ev aluation metric. These baselines were pick ed carefully from previous work a s b e ing those that hav e yielded state-o f-the-art p erfor - mance in a range of link prediction tasks. In our ﬁrst set of exp eriments we use sim ulated da ta to compare the p erforma nce o f o ur algor ithm to these bas elines, fo c using on situations involving 6 seasonality in link formation. Second, we study the performance of our a lgorithm a nd the ba selines on several rea l-world gr aphs: a sensor netw or k, tw o co-a uthorship gr aphs, and a gra ph o f F ace- bo ok employ ees. Finally , we inv estigate the co mputational scaling of our a ppr oach, comparing the improv ement in runtime of the LSH-ba sed a lgorithm to an exact a lg orithm, a nd inv estigating the eﬀect o f the LSH bit-size k o n acc ur acy . 4.1 Baselines and metrics W e compare o ur nonparametric netw o rk inference a lgorithm ( NNI ) to the following baselines which, although quite na ive, have proved diﬃcult to b eat in practice [ 17 , 30 ]: LL : ra nks pa irs using ascending order of last t ime of linkage [ 30 ]. CN (last timestep): r anks pairs using des c e nding or der of the num b er of c ommon neighb ors [ 17 ]. AA (last timestep): r anks pair s using descending order of the A damic-A dar scor e [ 1 ], a w eighted v ariant of co mmo n neighbor s which it has b een shown to o utperfo r m [ 17 ]. Katz (last timestep): extends CN to paths with length g reater than tw o , but with longer paths getting exp o nent ially sma ller weights [ 14 ]. CN-all , AA-all , K atz-all : CN , AA , and Katz co mputed on the u n ion of al l gr aphs u n til the last timestep . F or NNI , we only predict on pairs which ar e in the neighbor ho o d (generated by the union o f t wo- hop neighborho o ds of the last p timesteps) of each o ther. W e delib erately used a simple feature set for NNI , setting s t ( i, j ) = { cn t ( i, j ) , ℓℓ t ( i, j ) } (i.e., common neig hbo rs and last-link) and not using any external “meta-data” (e.g., sto ck secto r s, university aﬃliations, etc.). All feature v alues were binned log arithmically in or der to combat sparsity in the tails of the fea ture distributions. Strictly sp e a king, our feature ℓ t ( i, j ) should b e capp ed at p . How ever, since the heuristic LL uses no such capping, for fairness, we used the uncapp ed “last time a link app ear ed” as the feature ℓ t ( i, j ) for the pairs w e predict on. The ba ndwidth b T was pick ed b y cro ss-v alidation. F or any gr a ph s e q uence ( G 1 , . . . , G T ), we test link prediction accuracy on G T for a subset S > 0 of no des with non-zero degree in G T . Each algor ithm is provided tra ining data up to a nd including timestep T − 1, and m ust output, for ea ch node i ∈ S > 0 , a ranked list of no des in descending order of pro bability of linking with i in G T . F o r purposes of eﬃciency , we only require a ranking on the no des that hav e ever b een within tw o hops of i (call these the candidate pairs); a ll algor ithms under consideration predict the absence of a link for no des outside this subset. W e compute the A UC s core for predicted sco res for all candidate pairs against their actual edges formed in G T . 4.2 Sim ulations In this section we compare NNI to the baseline a lgorithms us ing simulated data, fo cusing o n seasonal patterns as an example o f the kind of nonlinear b ehavior that may b e diﬃcult to capture with the heuristic methods. W e simulated a mo del of Hoﬀ et al [ 10 ] that pos its an indep endently drawn “feature vector” for ea ch no de. Time mov es over a rep eating sequence of seas ons, with a diﬀerent set of features being “activ e” in each. Nodes with these feature s are more likely to be linked in that sea son, though no isy links also exist. The user fea tures als o change smo othly ov er time, to reﬂect changing user preferences . Mo del Sp eciﬁcations W e generate features u i,t ∈ R 6 for node i at time t . Node pair { i, j } has a link if u T i,t L t u j,t exceeds one, where L t is a matrix governing feature interactions. W e now formally deﬁne u i,t and L t . F or every no de we gener a te a t wo features a i , b i ∼ N ( 0 6 , I 6 × 6 ). T he six features a re divided in to three blo cks each of size tw o. Now, for the t th timestep, the fea tur e s of no de i are g iven by u i,t = ( c t a i + ( 1 − c t ) b i ) / p c 2 t + (1 − c t ) 2 , where c t = T − t T − 1 . The no r malization e nsures identical 7 v ariance of features at a ny timestep. F or t = 3 i + j , the feature int eraction matrix L t is genera ted as follows: B k,ℓ = µ F or k , ℓ ∈ { 2 j + 1 , 2 j + 2 } , L t = B + σ R + R T 2 Where R ∼ N (0 , 1) k × k , where µ r epresents the sig na l a nd σ repr e sents the noise. W e gener ated 10 0-no de graphs ov er 20 timesteps using 3 seas o ns, and plotted AUC averaged ov er 10 random r uns for several noise-to -signal ratios (Fig. 1 ). NNI consistently o utper formed all other bas e lines by a lar ge margin. C lea rly , seas onal g raphs hav e nonlinear linkag e patterns: the bes t predictor of links at time T ar e the links at times T − 3, T − 6, etc., and NNI is able to learn this pattern. By con trast, CN , AA , and Katz are biased to wards predicting links b etw een pairs which are link ed (or hav e short paths connecting them) at the prev ious timestep T − 1; this implicit smo othness assumption makes them p er fo rm p o or ly; indeed, they b ehaved essentially as po orly as a r andom predictor (an AUC of 0.5). Baselines LL , CN-all , AA-all a nd Katz-all use informa tio n from the union of all gra phs un til time T − 1. Since the oﬀ-seasona l noise edges a re not suﬃciently lar g e to for m communities, most of the new edges come from communities of no des crea ted in season. This is w hy CN-all , AA-all and Katz-all o utp erfor m their “last-timestep” coun terparts. As for LL , since links are mo r e likely to come from the last sea sons, it pe rformed well, a lthough p o or ly compared to NNI . Also no te that the changing user features forc e s the communit y structure s to change slowly ov e r time; in our exp e riments, CN-all per formed worse tha n it would were there was no change in the user features , since the communities stayed the same. T able 2 summar izes the average A UC score s for gra phs with s e a sonality , and also presents results for statio nary data. In b oth cases, the noise was set to the smallest v alue in Fig. 1 . F or the stationa ry data, links formed in the la st few timesteps of the training data ar e go o d pr edictors of future links, and so LL , CN , AA and Katz all p e r formed v er y well. Interestingly , CN-all , AA- all and Katz-all were w o rse than their “last time-step” v aria nt s, presumably owing to the slow mov ement of the user features . As for NNI , it p erfor med slightly b etter than all other metho ds for the sta tio nary data, in addition to showing substantial improv ements ov er the other metho ds for the seasonal netw orks. 0 0.5 1 1.5 2 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 Noise to Signal Ratio AUC score AUC scores over increasing noise NNI CN CN−All LL AA AA−all Katz Katz−all CN−All, Katz−All AA−All LL NNI CN, Katz Figure 1: Sim ulated graphs: Eﬀect of noise. Seasonal Stationary NNI . 91 ± . 01 0 . 99 ± . 005 LL . 77 ± . 03 0 . 97 ± . 006 CN . 51 ± . 02 0 . 97 ± . 01 AA . 51 ± . 02 0 . 95 ± . 02 Katz . 50 ± . 02 0 . 97 ± . 01 CN-all . 71 ± . 03 0 . 86 ± . 03 AA-all . 65 ± . 04 0 . 71 ± . 04 Katz-all . 71 ± . 03 0 . 87 ± . 03 Figure 2: A verage A UC for T = 20 timesteps. 8 4.3 Real-w orld graphs W e begin b y presen ting results on a 24-no de senso r netw ork where each edge represents the success- ful t ransmission of a message 1 . W e considered up to 82 consecutive measur ement s. These net w orks exhibit clear p erio dicit y; in particular , a diﬀerent set of sensors turn o n and communicate dur ing four diﬀerent p erio ds. Fig. 3 shows our results fo r these four p er io ds averaged over several cycles. The maximum standard deviation, av era ged over the per io ds, w a s . 07. W e do no t show results for CN , AA and K atz , as they all p er formed no b etter than a random predictor. NN I signiﬁcantly outper formed the bas elines, co nﬁrming the results from the simulation exp eriments for sea sonal graphs. W e also presen t results o n three dyna mic co-authorship gr aphs: the Physics “HepTh” com- m unit y ( n = 14 , 73 7 no des, e = 31 , 189 total edges, a nd T = 8 timesteps), NIPS ( n = 2 , 865, e = 5 , 247, T = 9 ), and authors of paper s on Citeseer ( n = 2 0 , 912 , e = 45 , 672, T = 11) with “machin e learning” in their abstra cts. Each timestep considers 1 − 2 years of pap ers (so that the median degree at an y timestep is at least 1 ). Finally we also considered a dynamic undirected net w ork of F aceb o ok employ ees ov er several weeks, where the nodes r epresent employ ees and edges are for med if o ne employ ee menti ons another in a p o s t. The netw o rk co ntains ab ov e ﬁve thousand no des, a nd ab ov e 1 00 , 00 0 edg es in total. Figure 3: AUC scores for a p erio dic s ensor net w ork NIPS HepTh Citeseer F aceb o o k NNI . 87 . 89 . 89 . 82 LL . 84 . 87 . 90 . 81 CN . 74 . 76 . 69 . 70 AA . 84 . 87 . 90 . 71 Katz . 75 . 83 . 83 . 78 CN-all . 56 . 62 . 70 . 87 AA-all . 77 . 83 . 83 . 89 Katz-all . 67 . 71 . 81 . 89 Figure 4: A v erage AU C for co- authorship and F a ceb o ok graphs. T able 4.3 s hows the a verage AU C for all a lgorithms for the co -authorship gr aphs a nd the F aceb o ok graph. F or the co-author ship graphs, we do not exp ect to see s easonal v ar iation, and we exp e ct a relatively simple mo del to b e eﬀective; authors will tend to keep w orking with a similar set of co -authors over time. F or such g raphs, Tylenda et al. [ 30 ] have shown that LL is the b est heuristic, and we replicate that result here. Our kernel-based a pproach, NNI , also performs well on these graphs, slightly outp erfor ming LL . F or the F acebo ok graph, employ ees in the sa me r esearch group tend to p o st mor e messag es mentioning each other, and hence algorithms working on all edges se e n so far should int uitiv e ly pick up this communit y structur e . This is indeed reﬂected in the AUC sco res. CN-all , AA-all a nd Katz-all p er form the b est. These algo rithms o utp erfo r m NNI , primarily b eca use they count pa ths thro ugh edges that exist in diﬀerent timesteps, w hich is not allow ed in o ur mo del. In summary , for gr a phs having a seaso nal trend, NNI is the b est metho d by a lar ge margin. F or the co -authorship gr aphs, NNI remains the b est a lgorithm, although LL is a lso eﬀective. F or the co rrelation g raph, Katz- all is the b est alg orithm, but its p erformance is quite p o o r o n the co-authors hip graphs and the s easonal graphs. Overall, the p erformance of NNI dominates that of the other a lgorithms. 1 http://w ww.sele ct.cs.cmu.e du/data 9 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 0 100 200 300 400 500 600 700 800 Number of datacubes Time per datacube NNI with Exact Search NNI with Hashing 0 200 400 600 800 1000 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 Number of hashing bits Average AUC score AUC vs Number of hash bits k Our adaptive choice of k (a) Time vs. #-datacub es (b) AUC vs. hash bitsize k Figure 5: Time and accuracy using LSH. 4.4 Ev aluation of LSH W e hav e found the use of LSH to b e essential in our exp erimental w or k. In this section we provide quantit ative s uppo rt for this ass e rtion. Exact search vs. LS H. In Fig. 5 (a) we plot the time taken to p erform to p-20 nearest neighbor search for a query datacub e using simulated data. W e ﬁxed the nu m ber of no des a t 100, and increased the num b er of timesteps. As e x pe c ted, the exact sea rch time increases linea rly with the total num ber o f datacub es, wherea s LSH searches in nearly cons ta nt time. Also, the AUC score of NNI with LSH is within 0.4% of that o f the exact algo rithm on average, implying minimal loss of accuracy from LSH. In our exp eriments with real-world gr a phs, the query time p er datacub e using LSH was quite small: 0 . 3s for Citeseer, 0 . 4s for NIP S, 0 . 6 s for HepTh, and 1 . 9s for F acebo ok. Exact se a rch was infeasible for these larg e-scale gr a phs. Number o f Bits in Hashing. Fig. 5 (b) shows the eﬀectiveness o f o ur adaptive scheme to select the num b er of hash bits (Section 3 ). F or these experiments, we turned o ﬀ the s mo othing based on the prior data c ub e. As k increa ses, the accuracy go es down to 50%, a s a result of the fact that NNI fails to ﬁnd a ny matches of the query datacub e. Our adaptive scheme ﬁnds k ∼ 170, which yields the highest accur a cy . Note also that larger k translates to fewer en tries p er hash bucket and hence faster sear ches, and thus our adaptive choice of k y ields the fastest runtime p er formance a s well. 5 Consistency of Kernel Estimator In this section we study the consistency of the estimator ˜ g deﬁned in Eq. ( 3 ). Rec a ll that o ur mo del is: Y t +1 ( i, j ) |G ∼ Bernoulli( g ( ψ t ( i, j ))) , (5) where ψ T ( i, j ) equals { s t ( i, j ) , d t ( i ) } . Assume that all graphs have n no des ( n is ﬁnite). F or a ﬁxed node q ∈ { 1 , . . . , n } , let Q repre s ent the quer y datacube d T ( q ). W e wan t to study the consistency of predictions for timestep T + 1. Rather than studying ˜ g directly , it prov es to be simpler to study a s lig ht ly diﬀeren t e s timator which w e show (in Lemma 5.1 ) to b e a symptotically equiv alent to ˜ g . Deﬁne b g T ( s, Q ) , b h T ( s, Q ) and 10 b f T ( s, Q ) as follows: b g T ( s, Q ) = b h T ( s, Q ) b f T ( s, Q ) (where s = s T ( q , q ′ )) (6) b h T ( s, Q ) = 1 n ( T − p ) T − 1 X t = p n X i =1 K b T ( d t ( i ) , Q ) η + i,t +1 ( s ) b f T ( s, Q ) = 1 n ( T − p ) T − 1 X t = p n X i =1 K b T ( d t ( i ) , Q ) η i,t +1 ( s ) . Lemma 5.1. Deﬁne ˜ g T ( . ) as in Equation 1 , and b g T ( . ) as in Equation 6 . W e have: | ˜ g T ( s, Q ) − b g T ( s, Q ) | = O ( ζ T ) Pr o of. Reca ll that n ( s ) denotes the set of features a t L 1 distance 1 fro m s . Let k := | n ( s ) | . W e hav e: ˜ g T ( s, Q ) = b h T ( s, Q ) + C T b f T ( s, Q ) + D T , where by vir tue o f the ﬁniteness of num b er of featur e s , η and η + , we have: C T := ζ T X i,t K b T ( d t ( i ) , Q ) X s ′ ∈ n ( s ) η + it +1 ( s ′ ) = O ( ζ T ) . Similarly , D T = O ( ζ T ). Also, note that bo th C T and D T are non-negative. Thus we hav e: | ˜ g T ( s, Q ) − ˆ g T ( s, Q ) | =      C T b f T ( s, Q ) − D T b h T ( s, Q ) ( b f T ( s, Q ) + D T ) b f T ( s, Q )      = O ( ζ T ) , where the last step follows b eca use b oth b h T and b f T are b ounded and b f T tends to so me p ositive constant with pr obability tending to one as T → ∞ (as shown in Theorem 5.2 ). The estimator b g T is deﬁned only when b f T > 0, which holds with probability tending to one as will be shown in the next theorem. The kernel w a s deﬁned earlier as K b T ( d t ( i ) , Q ) = e − D ( d t ( i ) ,Q ) /b T , wher e the bandwidth b T tends to 0 as T → ∞ , and D ( · ) is the distance function deﬁned in E q. ( 4 ). This has the following prop erty: lim b T → 0 K b T ( d t ( i ) , Q ) = ( 1 if d t ( i ) = Q 0 otherwise . (7) F rom now o n, w e will drop the arg ument s s and Q and instea d write g , b g T , b f T and b h T for simplicit y . Our graph ev o lution mo del is Mar ko vian; assuming each “state” to r e pr esent p + 1 consecutive g r aphs, the next graph (and hence the next state) is a function only of the current state. The state spa ce is also ﬁnite, since each g raph has b ounded size. Thu s, the state spa ce S may b e partitioned in to a set of tra nsient states and S i C i , where C i is an irr educible closed communi cation c la ss, and there exists at least one C i [ 7 ]. The Markov chain mu st even tually en ter one of the (ﬁnitely many) communication classes. W e will denote the time of en tering some commun ication class by T 1 , and the ev ent b y E T 1 . W e remind the rea der that using simple a rguments for ﬁnite state space Marko v chains, it ca n be s hown that the tail probabilit y of T 1 decays g eometrically (see [ 7 ]), leading to the ﬁniteness of the ﬁrs t a nd second moments. Also let S C denote the even t S T ∈ C , w her e S t denotes the state o f the Markov chain at time t . Thu s E T 1 T S C is the even t that the chain enters class C at time T 1 and remains there hencefor th. 11 Theorem 5. 2 (Consistency) . L et b T = o (1) as T → ∞ . F or two ﬁxe d no des q, q ′ ∈ { 1 , . . . , n } , b g T ( s ( q , q ′ ) , d T ( q )) is wel l-deﬁne d with pr ob ability tending to one as T → ∞ . Also , b g T ( s ( q , q ′ ) , d T ( q )) is a c onsistent estimator of g ( s ( q , q ′ ) , d T ( q )) , i.e., b g T ( s ( q , q ′ ) , d T ( q )) P − → g ( s ( q , q ′ ) , d T ( q )) as T → ∞ . Pr o of. Firs t, note that our query datacub e is o btained at time T , and we ar e interested in the asymptotic b ehavior of the chain as T → ∞ . Since our Marko v chain has a ﬁnite state space, the query datacub e b elongs to some closed co mm unication class C with pr obability tending to o ne. Thu s, as T → ∞ , the estimator’s distribution is gov erned by that commu nication class. W e prove our r esult in t wo par ts; ﬁrst we s how that the conv ergence statement ho lds co nditioned on S C , for a ny commun ication cla ss C ; i.e., P ( | b g T − g | ≥ ǫ | S C ) → 0 as T → ∞ . Next, w e hav e P ( | b g T − g | ≥ ǫ ) ≤ X C P ( | b g T − g | ≥ ǫ | S C ) P ( S C ) + P ( T 1 > T ) , which implies lim sup T →∞ P ( | b g T − g | ≥ ǫ ) = 0 , given the tail b ound on T 1 and the fact that the ﬁr s t term is a sum ov er a ﬁnite num b er o f terms, each conv erg ing to zero as T → ∞ . In what follows, we will g ive a pro of of statistical consis tency conditioned on S C for a ny commun ication cla ss C . Deﬁne B T ( s, Q, C ) = E [ b h T | S C ] /E [ b f T | S C ] − g . W e hav e: b g T − g = ([ b h T − g b f T ] − E [ b h T − g b f T | S C ]) . b f T + B T E [ b f T | S C ] / b f T . (8) Lemma 5.3 shows that E [ b f T | S C ] → R c , R c being a p os itive deterministic function o f class C . Thu s, B T is asymptotically well deﬁned. Also Lemma 5.9 sho ws that v ar( b f T | S C ) tends to 0 as T → ∞ . This along with Lemma 5.3 s hows that, conditioned on S C , b f T P → R c , thus also pr oving that b g T is asymptotically well deﬁned for C . Next, we will deﬁne the following: b h T ( t ) := 1 n n X i =1 K b T ( d t ( i ) , Q ) η + i,t +1 ( s ) , b f T ( t ) := 1 n n X i =1 K b T ( d t ( i ) , Q ) η i,t +1 ( s ) . (9) Note that b h T and b f T (Equation 6 ) e quals P t b h T ( t ) / ( T − p ) a nd P t b f T ( t ) / ( T − p ) r e s pe c tively . Also let q t := b h T ( t ) − E [ b h T ( t ) | S C ] − g ( b f T ( t ) − E [ b f T ( t ) | S C ]) . (10) Thu s q t is a b ounded deterministic funct ion of the state at time t . In Lemma 5.9 we prov e that v a r( P t q t / √ T | S C ) → σ c for so me non-neg ative constan t σ c , as T → ∞ . Th us we hav e, v ar( P t q t /T | S C ) → 0, as T → ∞ . Since E [ q t | S C ] = 0, we hav e P t q t /T ∼ ([ b h T − g b f T ] − E [ b h T − g b f T | S C ]) qm → 0 conditioned on S C . Since convergence in quadra tic mean implies conv erg e nce in pro bability , we hav e: ( b f T , [ b h T − g b f T ] − E [ b h T − g b f T | S C ]) P → ( R c , 0) conditioned on S C . Using the contin uous mapping theor em on f ( X, Y ) = Y / X a nd the fact that B T = o (1) (Lemma 5.4 ) we have that, fo r a ny C suc h that S T ∈ C , b g T P → g . The pro of of the following lemma is deferred to the Appendix. Lemma 5.3. A s T → ∞ , for some R c > 0 (a deterministic fun ction of class C ), E [ b f T ( s, Q ) |E T 1 , S C ] → R c , E [ b f T ( s, Q ) | S C ] → R c . 12 The fo llowing smo othness condition on g is int ro duced to ensure appro priate rates o f co nv er- gence o f the bias terms B T . A ssumption 1 . The function g satisﬁes the following smo o thness condition with resp ect to the distance metric D : | g ( s, d t ( i )) − g ( s, d t ′ ( j )) | = O ( D ( d t ( i ) , d t ′ ( j ))). Lemma 5.4. Deﬁne B T ( s, Q, C ) = ( E [ b h T ( s, Q ) | S C ] − g E [ b f T ( s, Q ) | S C ]) /E [ b f T ( s, Q ) | S C ] . If A ssump- tion 1 holds, then we have B T = O ( b T ) . Sinc e b T → 0 as T → ∞ , this implies B T = o (1) . Pr o of Sketch. F o r t ∈ [ p, T − 2 ], i ∈ [1 , N ] and s = s T ( q , q ′ ), the numerator o f B T is a n av era ge of the terms: A t := E  K b T ( d t ( i ) , Q ) η + i,t +1 ( s ) | S C  − E [ K b T ( d t ( i ) , Q ) η i,t +1 ( s ) | S C ] g ( s, Q ) . Using a further conditioning step o n E T 1 , w e can show that the numerator of B T can b e upper bo unded as: | X t A t /T | ≤ X t | E [ K b T ( d t ( i ) , Q ) η i,t +1 ( s ) ( g ( s, d t ( i )) − g ( s, Q )) | S C ] | /T + o (1) . W e now analyze ea ch ter m in the average; i.e., terms of the form: E [ K b T ( d t ( i ) , Q ) η i,t +1 ( s ) · ( g ( s, d t ( i )) − g ( s, Q )) | S C ] . This e xp e c ta tion is computed ov er all p ossible conﬁg urations o f the neigh b or ho o ds N t ( i ) a nd N t +1 ( i ). Since our neighbor ho o d size s are b o unded (beca use n is b ounded), the exp ectation is a sum ov er a ﬁnite num b er of terms. W e now use the s mo othness assumption on g . Using | g ( s, d t ( i )) − g ( s, Q ) | = O ( D ( d t ( i ) , Q )) and that η i,t +1 ( s ) is ﬁnite for all T a nd Lemma 5 .3 , we hav e : B T = O  E [ D ( d t ( i ) , Q ) e − D ( d t ( i ) ,Q ) /b T | S C ]  = O ( b T ) , which holds beca use for non-neg ative x , we hav e xe − x/b T ≤ b T /e . W e now show that the v aria nce of b f T and b h T conv erg e to zer o. In or der to upp er b ound the gr owth of v ariance terms, w e make use of strong mixing. F or a Marko v chain S t , deﬁne the strong mixing co eﬃcients α ( k ) . = sup | t − t ′ |≥ k {| P ( A ∩ B ) − P ( A ) P ( B ) | : A ∈ F ≤ t , B ∈ F ≥ t ′ } , where F ≤ t and F ≥ t ′ are the sigma a lgebras generated by ev en ts in S i ≤ t S i and S i ≥ t ′ S i resp ectively . In tuitiv ely , small v alues of α ( k ) imply that sta tes that are k apart in the Markov c hain are almost independent. F or b ounded A and B , this also limits their cov ar iance: | cov( A, B ) | ≤ cα ( k ) for some constant c [ 5 ]. Instead of proving that the v ariance of b h T or b f T conv erg es to zero, we will prov e that the v a r iance divided by T conv erg es to a non-nega tive constant. This is a stro nger result that we will ﬁnd useful in proving weak conv er gence in section 7 . W e introduce s o me notation that will b e used in stating the next few results. Let q t denote a bo unded deterministic function of the state of a ﬁnite state s pace Mar ko v chain at time t . Also deﬁne U T := P t q t / √ T . Recall that our Mar ko v ch ain will even tually hit one o f the ﬁnitely ma ny closed communication classes. Ea rlier we used S C to deﬁne the even t { S T ∈ C } , by T 1 the time of entering so me communication c la ss, and the even t by E T 1 . W e will denote the even t o f entering class C at time T 1 b y E T 1 T S C . If C is ap erio dic, then o nce inside C , the Markov chain gets arbitrarily clos e to the stationa ry distribution o f C after some constant time M ; we state this more formally in the following lemma, whose pro of is deferred to the App endix. Lemma 5.5. Consider an irr e ducible and ap erio dic ﬁnite state Markov cha in with pr ob ability tr ansition matrix P , initial distribution π 0 and stationary distribution π . L et X t b e a r andom variable (with ﬁnite supp ort) that is c onditional ly indep endent of al l other states, give n the state at time t . The exp e ctation of X t under the distribution at t ime t is denote d by E [ X t | π 0 ] . L et µ denote the exp e ctation of X ∞ (i.e., the exp e ctation with r esp e ct to π ). Ther e ex ists a c onstant λ ∈ (0 , 1) , and a c onstant M such that, for al l t > M , max x ∈S P y ∈S | P ( x, y ) − π ( y ) | = O ( λ t ) , and | E [ X t | π 0 ] − µ | = O ( λ t ) . 13 Our estimators ar e weighted sums of 1 , . . . , T v ar iables; for T 1 ≤ T , we will br e a k this sum up in to three parts, indexed by 1 , . . . , T 1 − 1 , follow ed b y T 1 , . . . , T 1 + M − 1 , and ﬁnally T 1 + M , . . . , T , where M is a co nstant. F o r T 1 > T , we will use the fact tha t T 1 has bo unded ﬁrst and second moment s. Since w e ar e interested in the b ehavior of the s um unconditionally , our analysis will consist o f t w o s teps of nested conditioning, the outer one obtained b y conditioning on S C , whic h in turn is o btained by analyzing the sum co nditioned o n E T 1 T S C . F or ease of exp ositio n we will assume C to be ap erio dic. The mo re general case of cyclo-stationarity , whic h is similar in principle, is discuss e d in remar k 5.10 . Lemma 5.6. v ar ( U T | S C ) → σ c as T → ∞ , for some c onstant σ c ≥ 0 . Pr o of. W e hav e v ar( U T | S C ) = E [v ar( U T |E T 1 , S C ) | S C ] + v ar( E [ U T |E T 1 , S C ] | S C ). W e prov e that the ﬁrst part co nv erges to a non-negative constant σ c (a deterministic function of C ) (Lemma 5.7 ), and the seco nd is as ymptotically o (1) (Lemma 5.8 ). Lemma 5.7. F or any ﬁnite inte ger k , we have v ar( X t ≥ T 1 + M q t |E T 1 , T 1 = k , S C ) /T → σ c for s ome σ c ≥ 0 (11) v ar( X t q t |E T 1 , T 1 = k , S C ) /T → σ c for s ome σ c ≥ 0 . (12) F or a Markov chain with a ﬁnite state sp ac e, we also have E [v ar( U T |E T 1 , S C ) | S C ] → σ c for some σ c ≥ 0 . Pr o of Sketch. F o r ease of exposition, for the pro of sketc h we assume there is only one comm u- nication class, whic h is a p e r io dic. Recall that T 1 is the time to hit the communication c la ss. Once inside the communication class, irr educibilit y and ap erio dicity implies geometric er go dicity (Lemma 5.5 ), which implies absolute regularity whic h in turn implies strong mixing with exp o nen- tial dec ay [ 3 ]: α ( k ) ∼ e − β k for some β > 0. W e can prov e that for ﬁnite T 1 , v ar( P t q t |E T 1 , T 1 = k , S C ) /T = v a r( P t ≥ T 1 + M q t |E T 1 , T 1 = k , S C ) /T + o (1). So we focus on proving Equa tion 11 . Denote P t ≥ T 1 + M q t b y P . Recall that for our Mar ko v chain, S t in volv es p + 1 graphs ( G t − p +1 , . . . , G t +1 ). Since p t is a function o f S t , it a lso dep ends on p + 1 gra phs. Hence, the distance dist( t, t ′ ) betw e e n tw o sigma- algebras F ≤ t and F >t ′ is deﬁned as max( t ′ − t − ( p + 1) , 0). Now we can write v ar( P |E T 1 , S C ) as v ar( P |E T 1 , S C ) = 2 X t ≥ T 1 + M T − t X dist( t,t ′ )=0 cov ( q t , q t ′ |E T 1 , S C ) . Since the num b er of states a t dista nce 0 is O ( p + 1), a nd at distance ≥ 1 is O (1), for constants { c k , k ≥ 0 } we hav e, T − t X dist( t,t ′ )=0 | cov ( q t , q t ′ |E T 1 , S C ) | ≤ ∞ X k =0 c k α ( k ) = O ( X k e − β k ) = O (1) . This shows that the ab ove sum converges to some co ns tant a t . Since t ≥ T 1 + M , the chain will get a rbitrarily close to stationar it y , and a t → σ c for some constant σ c . Thus v ar ( P |E T 1 , S C ) /T is asymptotically equiv alent to P t a t /T , which als o conv er ges to σ c as T → ∞ . Since, fo r all T , v ar ( P |E T 1 , S C ) /T is non-negative, σ c is also non-neg ative. This prov es Equation 11 . Thus Equation 12 is pr oved, and also, since T 1 has ﬁnite ﬁrs t and second momen ts for a ﬁnite state space Marko v chain, E [v ar ( P t q t |E T 1 , S C ) | S C ] /T conv erges to σ c , as T → ∞ . It remains to a nalyze v ar( E [ U T |E T 1 , S C ] | S C ) in the v a riance decomp osition. Using Lemma 5.5 we can prove that | E [ U T − µ c | S C , E T 1 ] | approa ches zer o at a geo metr ic rate as T → ∞ , where µ c denotes the exp ectation of q t under the stationar y distribution in comm unication class C . This implies the following lemma, which is prov ed in the Appendix. 14 Lemma 5.8. v ar ( E [ U T |E T 1 , S C ] | S C ) = o (1) . Lemma 5.9. v ar ( b h T | S C ) and v ar( b f T | S C ) t en d to 0 as T → ∞ . Pr o of. The r esult follows by applying Lemma 5.6 with q t ( . ) equal to P i K b T ( d t ( i ) , Q ) η + i,t +1 ( s ) /n a nd P i K b T ( d t ( i ) , Q ) η i,t +1 ( s ) /n resp ectively . R emark 5.10 . Recall that L e mma 5.7 was obtained under the as sumption that C is ap erio dic. The c a se of p erio dic C implies cyc lo -stationar ity; i.e., the chain S t + kd approaches stationa rity as k → ∞ . Hence, for p erio dic C (with perio d d ) we consider M ′ , whic h is a Mar ko v chain where each transition corres p o nds to d transitions of the original chain. No w, M ′ is irreducible a nd ap erio dic (since C was irreducible and had per io d d ). A state S ′ t in M ′ started at S 1 simply cor resp onds to the o ld state S td +1 in M . No w, 1 / √ T P T t =1 q t can b e written a s 1 / √ T P ⌊ T /d ⌋ t =1 q ′ t + o P (1), wher e q ′ i := ( i +1) d P j = id +1 q j is the sum o f d cons e c utive random v ariables . Since, q ′ t is indep endent of all other q ′ s conditioned o n S ′ t , S ′ t +1 , we have: cov ( q ′ t , q ′ t + k ) = E [ E [ q ′ t q ′ t + k | S ′ t +1 , S ′ t + k ]] − E [ q ′ t ] E [ q ′ t + k ] (13) = E [ E [ q ′ t | S ′ t +1 ] E [ q ′ t + k | S ′ t + k ]] − E [ E [ q ′ t | S ′ t +1 ]] E [ E [ q ′ t + k | S ′ t + k ]] = cov ( E [ q ′ t | S ′ t +1 ] , E [ q ′ t + k | S ′ t + k ]) = O ( α ( k − 1 )) . The last step uses the fact that the q ′ t are b ounded. Now, E [v ar(1 / √ T P ⌊ T /d ⌋ t =1 q ′ t |E T 1 , S C )] can again be s hown to converge to some non-neg ative constant using a slight mo diﬁca tion of the a r gument in Lemma 5.7 . The o P (1) remainder o f 1 / √ T P T t =1 q t can b e shown to b e negligible via a simple application of the Cauch y-Sch wartz inequality . A detailed pro of of Lemma 5.7 using this idea ca n be found in the App endix. As for E [ U T |E T 1 , S C ] in the cyclic ca s e, we simply have to a pply Lemma 5.5 for each of the d cyclic classes. F or the i th cyclic class , q T 1 + kd + i is indep endent of all states (in tha t cyclic class ) given S T 1 + kd + i . Hence there exists M i , and λ i ∈ (0 , 1), such that for all k with k d + i > M i , | E [ q T 1 + kd + i |E T 1 , S C ] − µ i | = O ( λ k i ), thus proving Lemma 5.5 for a p er io dic C . This ag ain prov es Lemma 5 .8 fo r the case where C is p erio dic. 6 Stein’s Metho d for Graphical Data Our estimators, and indeed many k ernel estimators, inv olve weighted sums o f dependent v ariables. While their distributional conv erg ence can b e studied using existing results on ergo dic Marko v chains, we take a diﬀerent approa ch, ba s ed on an adaptation of Stein’s metho d to the setting o f graphs. W e beg in with a brief int ro duction to Stein’s metho d. The method repose s on the following key lemma [ 4 ], which pr ovides a characterization of the norma l distribution: Lemma 6.1 (Stein’s Lemma) . If W has a standar d normal distribution, then E f ′ ( W ) = E [ W f ( W )] , (14) for al l absolutely c ont inu ous functions f : R → R wi th E | f ′ ( Z ) | < ∞ . Conversely, if Equa- tion 14 holds for al l b ounde d, c ontinuous and pie c ewise c ont inuously diﬀer entiable functions f with E | f ′ ( Z ) | < ∞ , then W has a standar d normal distribution. Recall that the W assers tein distance b etw een a mean zer o, unit v aria nce r a ndom v a riable W and a standar d normal v ar iate Z is deﬁned as sup h ∈H | E h ( X ) − E h ( Z ) | , where H := { h : | h ( x ) − h ( y ) | ≤ | x − y |} . W eak conv er gence of W to Z can b e established by showing that the W a sserstein distance conv erg es to zer o. Now, Stein’s Lemma ( 6.1 ) shows that W d = Z if | E f ′ ( W ) − E [ W f ( W )] | equa ls zero for appropriate choices of f . This key observ ation leads to the Stein Equation: f ′ ( W ) − W f ( W ) = h ( W ) − E [ h ( Z )] . (15) 15 It can b e shown that the so lution to the Stein Equation, for h ∈ H , s a tisﬁes k f k ≤ 2, k f ′ k ≤ 2, k f ′′ k ≤ p 2 /π [ 4 ]. Thus, instea d of dealing with E [ h ( W )] − E [ h ( Z )] we need to show that | E [ f ′ ( W ) − W f ( W )] | is small (where f s atisﬁes the aforementioned conditions); this is an easier quantit y to analyze. The existing application o f Stein’s metho d to sums o f weakly dependent r a ndom v ariables has fo c us ed on marginal-indep endence structures that ca n b e ca ptured by a bounded-degr ee dep en- dency graph [ 25 ]. In this section, we rela x the requirement of mar ginal indep endence by allowing arbitrary dep endency structures among the summed v ar ia bles as long as certain conditions on strong mixing co eﬃcient s α ( k ) hold. (See also Sunklo das [ 28 ] for a s imilar appr oach to ours for chain-structured dependencies ; he obtains a s lig ht ly tigh ter bo und than o ur s at the exp ense of a more complex pr o of.) Our approa ch pro ceeds by b ounding the W assers tein distance be tw een the (appropr iately scaled and cent ered) sum W of the dep endent v ar iables and a standard normal v ariate Z in terms of α ( k ) and the degree of dep e ndence of the ra ndom v a riables. W e then show that this b o und tends to zero for o ur estimators, demonstrating conv erg ence to a normal distribution and yielding a rate of conv erg ence a s a by-product. W e note that although we use this to prov e normal conv ergence for a cyclo-stationa r y Mar kov c hain, it can po ten tially b e used for more genera l dep endence structures, as long a s s uitable stro ng mixing pro p er ties a re av ailable. W e let T denote the to tal num b er of v a riables in o ur mo del. Let Y i , { i = 1 , . . . T } b e b o unded, ( | Y i | ≤ B ), mean-zer o ra ndom v a riables. Let σ T 2 denote the v aria nce o f T P i Y i ; assume 0 < σ T < ∞ for all T . Deﬁne X i = Y i /σ T , where | X i | ≤ B /σ T . Let W := T P i X i , and γ T = T /σ T . W e will assume that the index set underlying t he random v ariables { X i } is endow e d with a distance met ric, dist( i, j ). This can be the geo desic distance if the v ariables are connected via a graph structure or the a bsolute diﬀerence in time indices in a time series mo del, etc. Let N m ( i ) denote the set of no des at dis ta nce m from no de i ; similar ly let N ≤ k ( i ) a nd N >k ( i ) resp ectively denote the set of no des within distance k and a t a distance gre a ter than k from no de i . Now, let | N ≤ k | denote max i | N ≤ k ( i ) | . W e need a notion of strong mixing in a netw ork setting. Deﬁne the strong mixing co eﬃcients α ( k ) . = sup X i ,X j {| P ( A ∩ B ) − P ( A ) P ( B ) | : A ∈ F ( X i ) , B ∈ F ( X j ) , dist( i, j ) ≥ k } , wher e F ( X ) is the sigma algebra genera ted by the ra ndom v ariable X . A similar pr op osal for s tr ong mixing in random ﬁelds ca n b e found in [ 21 ]. Let τ k denote the tail sum P m>k | N m | α ( m ). W e are now rea dy to state the main result. Lemma 6.2 . Th e W asserstein distanc e d W ( W , Z ) b etwe en W and t he standar d normal r andom variable Z is upp er b ounde d as fol lows: d W ( W, Z ) ≤ min k ≤ T c 1 B 3 γ T  | N ≤ k | σ T  2 + c 2 B γ T α ( k )+ (16) B 2 s c 3  γ T τ k σ T  2 + c 4 γ T  | N ≤ k | σ T  3 + c 5 γ T τ k σ T  | N ≤ k | γ T  2   , wher e c 1 , c 2 , c 3 , c 4 , c 5 ar e c onstants. Pr o of sketch. W e will g ive a brief pro of sketc h here , and provide the full pro of in the App endix. W e wan t to b ound | E [ f ′ ( W ) − W f ( W )] | . W e shall rep eatedly break up W into tw o parts: W i = P j ∈ N >k ( i ) X j being the contribution from a ll no des with distance more than k fro m some no de i , and the remainder from no des “close to” i . In classical analy sis of dependency graphs, X i and W i are indep endent ; in contrast, in our case we only hav e cov( X i , W i ) = O ( α ( k )). Here, k is a 16 parameter that sha ll b e pick e d later to optimize the b o und. Since W = P T i =1 X i , | E [ f ′ ( W ) − W f ( W )] | ≤      E [ f ′ ( W )(1 + X i X i ( W i − W ))]      | {z } ( A 1) +      E [ X i X i ( W i − W ) f ′ ( W ) + X i X i f ( W )]      | {z } ( A 2) . Using T a ylor expa ns io n the term ( A 2) can b e further b ounded by ( A 2) ≤ k f ′′ k 2 E      X i X i ( W i − W ) 2      +      E [ X i X i f ( W i )]      . The ﬁrst term of this re s ult ag a in can b e b ounded using the AM-GM inequality by c 1 B 3 T | N ≤ k | 2 σ T 3 , where c 1 is a constant. Recall th at | N ≤ k | upp er bounds the size of the neighborho o d of k hops. The second part of (A2) now is bounded by c 2 B T α ( k ) σ T , using the usual relationship betw e e n cov ariances and str o ng mixing co eﬃcients. Thus the ov era ll b ound on ( A 2) is as follows: ( A 2) ≤ c 1 B 3 T | N ≤ k | 2 σ T 3 + c 2 B T α ( k ) σ T . Now we need to bo und ( A 1). Denote P T = P i X i ( W i − W ). Note that if X i and W i wer e indep endent , we would hav e E [ P T ] = − E ( W 2 ) = − 1, since W is centered and sca led a ppropriately . F or us howev er E [ P T ] do es not equal − 1; instead it bec omes sma ller as we increa se k . W e th us bo und ( A 1) a s follows: ( A 1) ≤ k f ′ k p E [1 + P T ] 2 ≤ k f ′ k p (1 + E [ P T ]) 2 + v ar( P T ) . Now no te that | 1 + E [ P T ] | = | E [ P i X i W i ] | . Since E [ X i ] = 0, | E [ X i X i W i ] | =       X i X j ∈ N >k ( i ) co v ( X i , X j )       ≤ c ′′ B 2 /σ T 2 X i X m>k α ( m ) | N m | , using the fact that N >k ( i ) = S m>k N m ( i ), and for all j ∈ N m ( i ), cov ( X i , X j ) = O ( α ( m ) /σ T 2 ). W e upp er b ound | 1 + E [ P T ] | by c ′′ B 2 T τ k /σ T 2 . Using s imilar arguments (see Appe ndix) we upp e r bo und v ar( P T ) b y 8 B 4 T | N ≤ k | 3 /σ T 4 + 1 6 B 4 T | N ≤ k | 2 τ k /σ T 4 . Putting the pieces together and using γ T = T / σ T we s e e that d W ( W, Z ) ≤ ( A 2) + k f ′ k p (1 + E [ P T ]) 2 + v ar( P T ) ≤ c 1 B 3 γ T  | N ≤ k | σ T  2 + c 2 B γ T α ( k ) + B 2 s c 3  γ T τ k σ T  2 + c 4 γ T  | N ≤ k | σ T  3 + c 5 γ T τ k σ T  | N ≤ k | σ T  2 . The res ult is obtained by optimizing the upper bo und ov er k ≤ T . Next, we present a suﬃcient condition for the W assers tein distance to v anish asymptotically , implying conv erg ence of W to a standard no r mal. Lemma 6.3. W → N (0 , 1) as T → ∞ if t he fol lowing c onditions hold: 1. γ T → ∞ . 17 2. Ther e exists a se quenc e k ( T ) → ∞ su ch that the fol lowing ar e s atisﬁe d: (a) γ T α ( k ( T )) → 0 (b) γ T τ k ( T ) σ T → 0 (c) γ T  | N ≤ k ( T ) | σ T  2 → 0 . Pr o of. The a bove conditions imply that α ( k ( T )) → 0, τ k ( T ) σ T → 0, and  | N ≤ k ( T ) | σ T  2 → 0 (and thus | N ≤ k ( T ) | σ T → 0 as well) a s T → ∞ . Hence the pro duct of tw o v anishing sequences, γ T  | N ≤ k ( T ) | σ T  2 × | N ≤ k ( T ) | σ T , also v anishes. Similar ly  γ T τ k σ T   | N ≤ k ( T ) | σ T  2 also v a nishes as T → ∞ . Thus, a ll terms on the rig ht hand side of Eq. ( 16 ) v anish, th us proving W d → N (0 , 1 ). 7 W eak Con v ergence of our Estimator In this section we bring tog ether the res ults fro m the previous tw o sectio ns to establish weak conv erg ence o f our estimator. Recall that our estimator ˜ g T is deﬁned in Equatio n 3 . Recall also the deﬁnitions o f b h T ( t ), b f T ( t ) and q t from Equatio ns 9 and 10 . F ro m Lemma 5 .1 w e hav e | √ T ( ˜ g T − b g T ) | = O ( √ T ζ T ), where ζ T denotes the bandwidth for the pair-speciﬁc kernel function (see Equation 2 ). Hence, with ζ T = T − (1 / 2+ ǫ ) for s ome ǫ > 0, we see that √ T ( ˜ g T − b g T ) a.s. → 0. W e will show (in Prop osition 7 .1 ) that under suitable conditions √ T ( b g T − g ) co nv erges to a mean-zero normal distribution. Hence, we a lso hav e the same normal distribution as the limit of √ T ( ˜ g T − g ) under the s ame conditions. Prop ositio n 7.1. L et A ssumption 1 hold. If σ c > 0 , and b T = T − (1 / 2+ θ ) for s ome θ > 0 , then: Conditione d on S C , √ T ( b g T − g ) d → N (0 , σ 2 c /R 2 c ) A s T → ∞ . wher e S T is the state of the Markov chain at time T . Pr o of. F rom Equa tion 8 we see that √ T ( b g T − g ) equals ( P t q t / √ T ) . b f T + ( E [ b f T | S C ] / b f T ) . √ T B T . Using the following lemma (Lemma 7.2 ) we know that the numerator of the ﬁrs t term co nv erges to a N (0 , σ 2 c ) distribution. Using Lemmas 5.9 and 5.3 we hav e b f T P → R c for a p ositive co nstant R c , conditioned on S C . Hence using Slutsky’s lemma the ﬁrst part c onv erg es c o nditionally to N (0 , σ 2 c /R 2 c ). Also, E [ b f T | S C ] / b f T conv erg es to one in probabilit y conditioned on S C . Finally , in voking Lemma 5.3 and Lemma 5.4 w e see that since B T = O ( b T ), for b T ∼ T − (1 / 2+ θ ) , the second part is o P (1). Now, Slutsky’s lemma and the contin uous mapping theorem yield the statement o f the prop osition. Lemma 7.2. Under A ssumption 1 and assu ming σ c > 0 , Conditione d on S C , X t q t / √ T d → N (0 , σ 2 c ) A s T → ∞ . The pro of of this res ult uses Lemma 5.8 and is deferr ed to the App endix. Lemma 7.3. Deﬁ n e p t := [ b h T ( t ) − g b f T ( t )] − E [ b h T ( t ) − g b f T ( t ) |E T 1 , S C ] . Un der A ssumption 1 and assuming σ c > 0 , for any ﬁnite T 1 , we have: X t ≥ T 1 + M p t / √ T d → N (0 , σ 2 c ) c onditione d on E T 1 T S C as T → ∞ . 18 Pr o of Sketch. First we prov e that, for a seq uence k ( T ) = c log T for a pro p er ly chosen c , the conditions in Lemma 6.3 are satisﬁed for W T :=   X t ≥ T 1 + M p t   , s v ar( X t ≥ T 1 + M p t |E T 1 , S C ) . W e a lso show that for this v alue of k , the upp er bo und on the W asserstein distance in Lemma 6.2 is O (log 2 ( T ) /T ). The deta ils a r e defer r ed to the App e ndix. Now Lemma 6 .1 gives: W T d → N (0 , 1) conditioned on E T 1 T S C . How ever, for ﬁnite v a lues o f T 1 , v ar ( P t ≥ T 1 + M p t |E T 1 , S C ) /T → σ 2 c (Lemma 5.7 and Eq uation 11 ). Thu s, the additional assumption of σ c > 0 pr ov es the r esult. R emark 7.4 . Pr op osition 7.1 shows that, under so me weak a ssumptions, W T conv erg es to a stan- dard nor mal distribution co nditioned on S C . Since there are a ﬁnite n um ber of closed commu- nication cla sses, unconditionally W T conv erg es to a mixture of zero-mean Gaussians, the mixture prop ortions b eing determined by the proba bilit y of reaching the communication classes from the start s tate. R emark 7.5 . W e hav e established weak co nv ergence for the case where C is ap erio dic. Ho wever, as in Remark 5 .10 , we can consider M ′ , which is a Markov chain wher e each trans itio n co rresp onds to d transitions of the orig inal chain. Again, any sum of the form P T t =1 q t / √ T can b e written as 1 / √ d ⌊ T /d ⌋ P t =1 q ′ t / p T /d + o P (1) ! . q ′ t now denotes the s um of the d consecutive q t ’s. F o r q ′ i := ( i +1) d P j = id +1 q j , w e hav e cov ( q ′ t , q ′ t + k ) = O ( α ( k − 1)) using Equation 13 . Thus the ﬁrs t sum a gain brings us to the ir r educible a p erio dic setting (with a slightly mo diﬁed dista nce function), and hence no rmal conv erg ence c an b e esta blished. 8 Relate d W ork Existing work on link predictio n in dynamic net w orks can b e broadly divided in to tw o categories : link prediction ba s ed on genera tive mo dels and link prediction ba sed o n structural features. A substantial amount of w o rk has gone into the dev elopment o f genera tive mo dels of gr aph structure based on the fo r malism of Mark ov ra ndom ﬁelds, loglinear mo dels or other graphical mo dels [ 6 , 8 , 15 , 26 , 11 , 29 , 31 ]. F or exa mple, Hanneke and Xing [ 8 ] present a dynamic loglinea r mo del based on evolution statistics s uch a s “edge stability ,” “recipro c it y” a nd “tra nsitivit y . ” F u et a l. [ 6 ] pr op ose an extension of the mixed membership blo ck mo del to a llow a linea r Gauss ia n trend in the model parameters. Zhou et al. [ 34 ] present a nonparametric a pproach to e s timating a time-v arying Gaussian gra phical mo del where the cov a riance matrix changes smo othly ov er time. The discrete analog of this is considered in [ 15 ], where the g oal is to learn the latent structure s of e volving gra phs from a time series of no de a ttributes. The sta tic mo del of Raftery et al. [ 23 ] is extended by Sarka r and Mo ore [ 26 ] by allowing smo oth transitions in latent space. All of these mo dels hav e the virtue of a clean probabilistic formulation such that link prediction can be cas t in terms of B ayesian p os terior inference. Obtaining this p osterior is, how ever, o ften infeasible in large-sc ale graphs. Moreov er, these mo dels o ften make strong mo del assumptions, no t only for the graph structur e but a lso for the netw ork dynamics, which is often mo deled as linear. Alternatives to ge ner ative mo dels g enerally revolv e around the deﬁnition of v arious static fea- tures that aim to capture structur a l prop e r ties of graphs. These are extended to the dynamic setting via heuris tics or via a uto r egressive mo deling. F or e x ample, Huang a nd Lin [ 12 ] prop os e a linear autoregress ive mo del for link prediction and in vest igate simple com bina tio ns of static graph- based similarity measur es (e.g., Ka tz, common neighbo rs) with their autoregr e ssive model to cap- ture transitive similarities in netw or ks. A simila r parametric approach can b e found in Richard 19 et al. [ 24 ], where a v ecto r autor egress ive mo del was used for link prediction in dynamic gr aphs. The a uthors assume a low ra nk structure o f the gr aph adjacency matrices a nd pro p ose proximal methods for infere nce . Tylenda et al. [ 30 ] ex amine simple temp ora l e x tensions o f existing static measures. As we hav e noted ear lier, these metho ds have the virtue of b eing applicable to lar ge-sca le graphs. They also tend to yie ld surprising ly g o o d p erformance. Our work falls into this gener al category , while going b e yond existing work by providing a fo rmal statistical tre atment of link pr ediction as a nonparametric estimation pro blem. W e co nclude this s ection with a brief discuss ion o n re lev a nt r e s earch on nonpa r ametric b o o tstr a p estimators in strong mixing random ﬁelds and Markov pro cesse s. While these works are not relev ant to the link prediction asp ect of our w or k, they are similar beca use the estimation uses lo cal resampling metho ds thereb y retaining the dep endency structure of the data . In the con text of strong mixing r a ndom ﬁelds Politis and Romano [ 22 ] consider a blo cks of blo cks re-sa mpling metho d for estimating asymptotica lly a ccurate co nﬁdence interv als for pa rameters of the joint distribution of the random ﬁeld. Nonpara metr ic bo o tstrap algorithms hav e also been applied succes sfully to the area of computer vision. Levina and Bick el [ 16 ] show that one such heuristic algorithm for texture synth esis can be formally framed as a re sampling tec hnique for stationary random ﬁelds, and pro ve consistency pro pe r ties of it under br oad co nditions. In the context o f sto chastic pro cesse s with an autoregr essive structure, Paparo ditis and Politis [ 19 ] pr esent the “lo ca l b o otstrap” algo rithm, which implicitly estimates the distribution of the one-step transition in the underlying Marko v pro cess a nd generates the b o o tstrap replicates using this estimated distribution. 9 Conclusions In this pa p er we prop osed a nonparametric mo del ( NNI ) for link prediction in dynamic netw orks, and show ed that it p erfor ms a s well as the state of the ar t for several r eal-world gr aphs, and exhibit s impo r tant a dv antages ov er them in the presence of nonlinearities such as seasona lit y pa tterns . NNI also allo ws us to incorp orate features external to graph topolog y into the lin k prediction a lgorithm, and its as ymptotic convergence to the true link probability is gua r anteed under our fairly gener al mo del as sumptions. In addition, w e show how to make NNI computationally tractable via the use of lo cality sensitiv e hashing. T oge ther , these make NNI a useful to ol for link prediction in dynamic net w orks. 10 App endix 10.1 Statemen t and pro ofs of r esults from section 5 Lemma 5.3. A s T → ∞ , for some R c > 0 (a deterministic fun ction of class C ), E [ b f T ( s, Q ) |E T 1 , S C ] → R c , E [ b f T ( s, Q ) | S C ] → R c . Pr o of. Let ǫ denote the minimum distance b etw een t w o da tacub es that are not iden tical; since the set of all p os sible datacub es is ﬁnite, ǫ > 0 . E [ b f T ( s, Q ) |E T 1 , S C ] is a n average o f terms E [ K b T ( d t ( i ) , Q ) η i,t +1 ( s ) |E T 1 , S C ], ov er i ∈ { 1 , . . . , n } and t ∈ { p, . . . , T − 1 } . Now, E [ K b T ( d t ( i ) , Q ) η i,t +1 ( s ) |E T 1 , S C ] = E h e − D ( d t ( i ) ,Q ) /b T η i,t +1 ( s ) |E T 1 , S C i . W riting the expe c ta tion in terms of a sum ov er all p os sible datacub es, and noting that everything is b ounded, gives the following: E h e − D ( d t ( i ) ,Q ) /b T η i,t +1 ( s ) |E T 1 , S C i = E [ η i,t +1 ( s ) | d t ( i ) = Q, E T 1 , S C ] P ( d t ( i ) = Q |E T 1 , S C ) + O ( e − ǫ/b T ) . 20 Recalling that E [ b f T ( s, Q ) |E T 1 , S C ] was an av erage of the a b ov e terms, w e see that it equals: 1 n ( T − p ) X t,i E [ η i,t +1 ( s ) | d t ( i ) = Q, E T 1 , S C ] · P ( d t ( i ) = Q |E T 1 , S C ) + O ( e − ǫ/b T ) . (17) W e will now show that the ab ove av erage conv erg es to g ( s, Q ) R for some R > 0. The second term in the RHS in eq ( 17 ) conv erges to zer o, since b T → 0 as T → ∞ . F or the n umer a tor of the ﬁrst term we hav e, E [ η i,t +1 ( s ) | d t ( i ) = Q, E T 1 , S C ] · P ( d t ( i ) = Q |E T 1 , S C ) = P η η P ( η i,t +1 ( s ) = η , d t ( i ) = Q |E T 1 , S C ). Bo th d t ( i ) and η i,t +1 ( s ) are fully determined g iven the curr ent state S t of the Marko v chain. Using I S ( X ) to denote a n indica tor of X in state S , we hav e P ( η i,t +1 ( s ) = η , d t ( i ) = Q |E T 1 , S C ) = P S I S ( η i,t +1 ( s ) = η , d t ( i ) = Q ) P ( S t = S |E T 1 , S C ). As a r esult of this, the ﬁrst term in the R.H.S of eq ( 17 ) beco mes a n average of the form 1 T P t P S ξ ( S ) P ( S t = S |E T 1 , S C ), where ξ ( S ) = 1 n P i,η η I S ( η i,t +1 ( s ) = η , d t ( i ) = Q ). Since we hav e a ﬁnite state- space and ξ ( S ) is bo unded, w e can r ewrite the ab ov e expres sion as P S ξ ( S ) P t P ( S t = S |E T 1 ,S C ) T . Now, recall that the query datacub e at T is a function o f the state S T , whic h belo ngs to a closed irreducible set C with probability 1. Due to statio narity (or cyclic stationarity with a ﬁnite cycle length) the av era g e P t P ( S t = S |E T 1 , S C ) /T con verges to some co nstant R ( S ) (constant bec a use it is a function o f the ﬁnite state space). F o r the sp ecia l case o f S = S T , we hav e the following: (a) S T ∈ C , so R ( S T ) > 0, a nd (b) S T contains at least one pair of no des with the feature vector s (since we are a ttempting link prediction for such a pa ir), so there exists some η > 0 for which I S T ( η , Q ) = 1. T ogether, these imply that P S ξ ( S )  P t P ( S t = S |E T 1 , S C ) /T  conv erg es to some R c > 0, wher e R c is a deter ministic function o f comm unication class C . Noting that E [ b f T ( s, Q ) | S C ] = E [ E [ b f T ( s, Q ) |E T 1 , S C ] | S C ], and the fact that b f T is bo unded we inv oke the Dominated Conv erg ence Theo rem a nd see that E [ b f T ( s, Q ) | S C ] → R c as well, thus completing the pro of of the theorem. Lemma 5.4 . Deﬁne B T ( s, Q, C ) = ( E [ b h T ( s, Q ) | S C ] − g E [ b f T ( s, Q ) | S C ]) /E [ b f T ( s, Q ) | S C ] . If as- sumption 1 holds, then, we have B T = O ( b T ) . Sinc e b T → 0 as T → ∞ , this implies B T = o (1) . Pr o of. F or t ∈ [ p, T − 2]; i ∈ [1 , N ]; s = s T ( q , q ′ ), the numerator of B T is a n average of the terms: A t := E  K b T ( d t ( i ) , Q ) η + i,t +1 ( s ) | S C  − E [ K b T ( d t ( i ) , Q ) η i,t +1 ( s ) | S C ] g ( s, Q ) . T aking expecta tions w.r.t. d t ( i ), and denoting K b T ( d t ( i ) , Q ) by γ , the ﬁrst term b ecomes: E  γ η + i,t +1 ( s ) | S C  = E  γ E  η + i,t +1 ( s ) | d t ( i ) , S C  | S C  . Now note that E  η + i,t +1 ( s ) | d t ( i ) , S C  = E [ E [ η + i,t +1 ( s ) | d t ( i ) , E T 1 , S C ] | S C ]. Conditioning on E T 1 makes η + i,t +1 ( s ) conditionally independent of S C given d t ( i ) if t > T 1 . Also, for t ≥ T 1 , E  η + i,t +1 ( s ) | d t ( i ) , E T 1 , S C  = η i,t +1 ( s ) · g ( s, d t ( i )), as can b e seen by summing Eq. 2 .3 ov er all pairs ( i, j ) in a neig hborho o d with identical s t ( i, j ), a nd then taking expe c ta tions 2 . This along with the fact that γ η + i,t +1 ( s ) is b o unded leads to: E [ η + i,t +1 ( s ) | d t ( i ) , E T 1 , S C ] ≤ η i,t +1 ( s ) g ( s, d t ( i )) 1 [ T 1 ≤ t ] + c 1 [ T 1 > t ] ≤ η i,t +1 ( s ) g ( s, d t ( i )) + c 1 [ T 1 > t ] . Thu s the numerator of B T can b e upp er b ounded a s : | X t A t /T | ≤ X t | E [ γ η i,t +1 ( s ) ( g ( s , d t ( i )) − g ( s, Q )) | S C ] | /T + c ′ X t P [ T 1 > t ] /T . 2 Note that the conditioning on E T 1 is crucial here. 21 The second par t is s imply O ( E [ T 1 ] /T ) and o (1). Thus, the numerator of B T beco mes a n av er age of the terms of the following form: E [ K b T ( d t ( i ) , Q ) η i,t +1 ( s ) · ( g ( s, d t ( i )) − g ( s, Q )) | S C ] . This exp ectatio n is over a ll possible conﬁgurations of the neighborho o ds N t ( i ) and N t +1 ( i ). Since our neighborho o d sizes are b o unded (b ecaus e n is bo unded), the exp ectation is a sum over a ﬁnite num b er o f terms. W e now use the s mo othness assumption on g . Using | g ( s, d t ( i )) − g ( s, Q ) | = O ( D ( d t ( i ) , Q )) and that η i,t +1 ( s ) is ﬁnite for all T a nd Lemma 5.3 , we have: B T = O  E [ D ( d t ( i ) , Q ) e − D ( d t ( i ) ,Q ) /b T | S C ]  = O ( b T ) . The las t equa tion holds since for non-negative x , xe − x/b T ≤ b T /e . Lemma 5.5. Consider an irr e ducible and ap erio dic ﬁnite state Markov cha in with pr ob ability tr ansition matrix P , starting distribution π 0 and stationary distribution π . L et X t b e a deter- ministic function (with ﬁnite supp ort) of the state at time t . The exp e ctation of X under the distribution at time t is denote d by E [ X t | π 0 ] . L et µ denote the exp e ct ation of X ∞ (i.e. un- der distribution π ). Th er e ex ists a c onstant λ ∈ (0 , 1) , and a c onstant M such that, ∀ t > M , max x ∈S P y ∈S | P ( x, y ) − π ( y ) | = O ( λ t ) , and | E [ X t | π 0 ] − µ | = O ( λ t ) . Pr o of. Using the same line of r easoning as [ 9 ], we ﬁrs t prov e the ab ov e for max x P y ∈S | P t ( x, y ) − π ( y ) | . Here |S | deno tes the state space and P the |S | × |S | pr obability transitio n matr ix a sso ciated with the Ma rko v chain. Denote by Π the matrix 1 π T , wher e 1 denotes the co lumn vector o f all ones. Note that since P Π = Π and Π P = Π, we hav e P t − Π = ( P − Π) t . F o r a ﬁnite state space irreducible and a p erio dic Marko v c hain, | P t ( x, y ) − Π( x, y ) | → 0, as t → ∞ . Hence for some p ositive δ < 1, we can ﬁnd a n M s.t. ∀ t > M , P y | P t ( x, y ) − Π( x , y ) | ≤ δ , ∀ x ∈ S . Since max x P y | P t ( x, y ) − Π( x, y ) | = || P t − Π || ∞ , using matrix norm inequalities we ha ve for t = k M + ℓ , where ℓ < M and t > M , | P t − Π | ∞ ≤ || P M − Π || k ∞ || P ℓ − Π || ∞ = O ( δ k ) , since max ℓ ≤ M || P ℓ − Π || ∞ is a constant. How ever, δ k = δ k +1 /δ = O ( λ t ), where λ = δ 1 / M < 1. Now fo r t > M a nd λ < 1, we hav e: max x | P t ( x, y ) − π ( y ) | = O ( λ t ) . First consider π 0 to b e an atom at a state x 0 ∈ S . Since | E ( X t | X 0 ) − µ | ≤ P x ∈S | x || P ( x 0 , x ) − π ( x ) | , using that X t is b ounded we hav e the main result. The result ca n b e ea sily extended to the more general ca se wher e π 0 is a c onv ex co mbin ation o f atoms at x ∈ S . Lemma 5.7. F or any ﬁnite inte ger k , we have v ar( X t ≥ T 1 + M q t |E T 1 , T 1 = k , S C ) /T → σ c for s ome σ c ≥ 0 (7) v ar( X t q t |E T 1 , T 1 = k , S C ) /T → σ c for s ome σ c ≥ 0 . (8) F or a ﬁnite st ate sp ac e Markov chain, we also have E [v a r( U T |E T 1 , S C ) | S C ] → σ c for some σ c ≥ 0 . Pr o of. Let C hav e (ﬁnite) p erio d d ; the p erio d is ﬁnite from the ﬁniteness of the Markov chain, and is typically very s mall (e.g., d = 1 if 0 < g ( . ) < 1 everywhere). Let M ′ be a Markov chain where each tra nsition corresp onds to d tra nsitions o f the or iginal chain. Now, M ′ is ir reducible and ap e r io dic (since C was irreducible a nd ha d p erio d d ). T hus, ∃ M , λ ∈ (0 , 1) s.t. ∀ t ≥ M , it is geometrica lly erg o dic with rate λ (Lemma 5.5 ), w hich implies in tur n that for t ≥ M , M ′ is strongly mixing with exponential dr o p-oﬀ [ 20 ] for large k : α ( k ) ∼ e − β k for some β > 0. Thus, 22 distant states are almost indep endent, and we use this to b ound the cov ariances of the q it , as follows. Also deﬁne q t = P i q it /n . F or the ﬁrs t term, we hav e: (1 /T )v ar " T X t =1 q t |E T 1 , S C # = ( 1 /T ) X tt ′ is deﬁned as max ( ⌈ ( t ′ − t − ( p + 1)) /d ⌉ , 0) . Thus, the total nu m ber of states at distance k is O (1). Let R t = ⌊ ( T − t ) / d ⌋ . Rather importantly , note that w e will use basic conditional indep en- dence results fro m Ma rko v chains. F or e xample E [ X t X t +2 d | X t + d ] = E [ X t | X t + d ] E [ X t +2 d | X t + d ]. Unfortunately , conditioned on E T 1 T S C this may not b e true. Howev er, if t ≥ T 1 , we ca n safely use the conditiona l indep endence, which is deﬁnitely true for A t . F or notational co nv enience we will denote b y c ov c and E c cov ar iance and exp ectation condi- 23 tioned on E T 1 T S C . Then, A t = X t ≤ t ′ T 1 A t /T . Let us ﬁr st co nsider 1 /T P t cov ( q t , p ( S ′ t + R t d ) |E T 1 , S C ). By virtue o f ge o metric erg o dicity co v( q t , p ( S ′ t + R t d ) |E T 1 , S C ) = O  e − β R t  , where R t = ⌊ ( T − t ) /d ⌋ . Thu s we hav e: X t | cov ( q t , p ( S ′ t + R t d ) |E T 1 , S C ) | = O X t e − β ⌊ ( T − t ) /d ⌋ ! = O  e β 1 − e − β /d  . Using e lementary ar guments f rom real a nalysis we see that P t cov ( q t , p ( S ′ t + R t d ) |E T 1 , S C ) conv erg es to some ﬁnite n umber . Hence after dividing b y T it co nt ributes a o (1) term to the express ion P t A t /T . F o r this reaso n we will now concentrate on P t>T 1 B t /T term. First note that the sequence B t is upp er b ounded by the following, B t ≤ X r | co v ( q t , p ( S ′ t + r d ) |E T 1 , S C ) | Also t > T 1 , and we ha ve conditioned on E T 1 , S C ≤ O ( X r e − β r ) = O (1) Since all q t are b oun ded . W e again see th at B t also co nv erge s to some constant c t , th us making P 2 asymptotically eq uiv ale nt to: 1 T d − 1 P ℓ =0 R t − 1 P r =0 c T 1 + r d + ℓ . How ever, for all T 1 ≤ T , if the chain is cyclo-stationar y , then after a ﬁnite time, for any ℓ ∈ { 0 , . . . , d − 1 } , c T 1 + r d + ℓ approaches the same co nstant c ℓ , ∀ r . Therefore, for a ll 24 T 1 ≤ T we hav e lim R →∞ R − 1 P r =0 c T 1 + r d + ℓ  R = c ℓ , where c ℓ is a c onstant w.r.t T . This leads to: v ar( V |E T 1 , S C ) → 1 /d d − 1 X ℓ =0 c ℓ as T → ∞ . Since the P 1 is a v ariance term, it is non-negative for all T , and hence σ c = 1 / d d − 1 P ℓ =0 c ℓ m ust b e non-negative as well, thus proving Equa tion 7 . Using the Ca uch y Sc h wartz inequality , cov ( U, V |E T 1 , S C ) /T = O ( p (v ar( U |E T 1 , S C ) /T )(v ar( V |E T 1 , S C ) /T )) = o (1) . Thu s P 1 → σ c as T → ∞ for some non-nega tiv e constant σ c . Another use o f the Ca uch y Sch wartz arg ument fro m b efor e, a lo ng with the conv ergence result on P 1 lets us upp er b ound P 2 b y O ( T 1 / √ T ). Thu s, for ﬁnite k , putting all the b ounds (i.e. on P 0 , P 1 , and P 2 ) together , w e hav e v ar( P t q t |E T 1 , S C , T 1 = k ) /T → σ c , for some σ c ≥ 0, pro ving Equation 8 . Also, since T 1 has ﬁnite ﬁrst and seco nd moments for a ﬁnite space Markov chain, we hav e E [v ar( P t q t |E T 1 , S C ) | S C ] /T → σ c . W e remind the rea der that using simple arguments for ﬁnite state space Marko v chains, it can be s hown that T 1 ’s tail probability is geo metrically decaying, leading to the ﬁniteness of the ﬁr st and sec o nd moments. Lemma 5.8. v ar ( E [ U T |E T 1 , S C ] | S C ) = o (1) . Pr o of. Reca ll that U T := P t q t / √ T . Let µ c denotes the exp ectation of q t under the station- ary distribution in co mm unication class C (it is a deterministic function of class C ). Since v ar( E [ q t |E T 1 , S C ] | S C ) = v ar( E [ q t |E T 1 , S C ] − µ c | S C ), w e will simply upp er bound E [ U T − µ c |E T 1 , S C ]. Lemma 5.5 shows that: ∃ M , and λ ∈ (0 , 1) such that, ∀ t > T 1 + M , | E [ q t |E T 1 , S C ] − µ c | = O ( λ t − T 1 ). Thu s, | E [ U T − µ c | S C , E T 1 ] | ≤ c ( T 1 + M ) √ T + P t>T 1 + M λ t − T 1 √ T = O  T 1 + M √ T  (9) Thu s, v ar( E [ U T |E T 1 , S C ]) = O  E [( T 1 + M ) 2 ] /T  = o (1), since T 1 has ﬁnite second momen t. 10.2 Statemen t and pro ofs of r esults from section 6 Lemma 6.2 . Th e W asserstein distanc e d W ( W , Z ) b etwe en W and t he standar d normal r andom variable Z is upp er b ounde d as fol lows: d W ( W, Z ) ≤ min k ≤ T c 1 B 3 γ T  | N ≤ k | σ T  2 + c 2 B γ T α ( k )+ B 2 s c 3  γ T τ k σ T  2 + c 4 γ T  | N ≤ k | σ T  3 + c 5 γ T τ k σ T  | N ≤ k | γ T  2   , wher e c 1 , c 2 , c 3 , c 4 , c 5 ar e c onstants. Pr o of. W e deﬁne the following sets: N m ( i ) := { j : dist( i , j ) = m } , N ≤ k ( i ) := [ m ≤ k N m ( i ) , N >k ( i ) := [ m>k N m ( i ) . W e also deﬁne the following upp er b ounds on the sizes o f these sets: | N m | := max i | N m ( i ) | , | N ≤ k | := max i | N ≤ k ( i ) | , | N >k | := max i | N >k ( i ) | . 25 Before b eginning, we reca ll tw o facts. (1) Bounde d c ovarianc e via s t r ong mixing: F o r tw o ra ndom v aria bles X and Y that are more than distance k aw ay , we have | E [ X Y ] − E [ X ] E [ Y ] | ≤ 4 k X k ∞ k Y k ∞ α ( k ) . (2) Bounds on W asserstein distanc e: F or the set of functions F = { f | k f k , k f ′′ k ≤ 2 , k f ′ k ≤ p 2 /π } , d W ( W , Z ) ≤ sup f ∈F | E [ f ′ ( W ) − W f ( W )] | , where d W ( . ) is the W a sserstein distance a nd Z has the standard normal distribution. In the following, we sha ll b ound | E [ f ′ ( W ) − W f ( W )] | . W e sha ll rep ea tedly break up W into t w o parts: W i = P j ∈ N >k ( i ) X j being the contribution fro m all no des within a dista nce k of so me no de i , a nd the remainder from nodes “far a way” from i . Here, k is a para meter that shall b e pick ed la ter. W e can b ound | E [ f ′ ( W ) − W f ( W )] | as follows: | E [ f ′ ( W ) − W f ( W )] | = | E [ f ′ ( W ) − X i X i f ( W )] | (10) ≤      E [ f ′ ( W )(1 + X i X i ( W i − W ) )]      +      E [ X i X i ( W i − W ) f ′ ( W ) + X i X i f ( W )]      . The seco nd par t in eq. 10 can b e further b ounded ab ov e as follows,      E [ X i X i ( W i − W ) f ′ ( W ) + X i X i f ( W )]      (11) ≤ E      X i X i ( W i − W ) f ′ ( W ) − X i X i ( f ( W i ) − f ( W ))      +      E [ X i X i f ( W i )]      ≤ 1 2 E      X i X i ( W − W i ) 2 f ′′ ( W ∗ i )      +      E [ X i X i f ( W i )]      ≤ k f ′′ k 2 E      X i X i ( W i − W ) 2      +      E [ X i X i f ( W i )]      , where the seco nd inequality follows from T aylor expansio n with W ∗ i being so me v alue b etw een W and W i . First, note that: k f ′′ k E      X i X i ( W i − W ) 2      = k f ′′ k E       X i X j 1 ,j 2 ∈ N ≤ k ( i ) X i X j 1 X j 2       ≤ k f ′′ k X i X j 1 ,j 2 ∈ N ≤ k ( i ) E | X i X j 1 X j 2 | ≤ k f ′′ k X i X j 1 ,j 2 ∈ N ≤ k ( i ) E | X 3 i | + E | X 3 j 1 | + E | X 3 j 2 | 3 ≤ 2 c 1 B 3 T | N ≤ k | 2 σ T 3 (The factor 2 is added for later ease of notation). As for the second term in eq. 11 we have:      E [ X i X i f ( W i )]      ≤ X i | E [ X i f ( W i ) − E [ X i ] E [ f ( W i )]] | (b ecause E [ X i ] = 0) = X i | co v ( X i , f ( W i )) | ≤ 4 k f k B T α ( k ) σ T = c 2 B T α ( k ) σ T . 26 Thu s, we obtain a b o und for b o th terms in eq. 11 , and hence a b o und for the second term of eq. 10 . W e will now bo und the ﬁrs t term in eq. 10 . Le t P T = P i X i ( W i − W ). Denote by τ k the tail sum P m>k | N m | α ( m ). Recall that E [ X i ] = 0 a nd E [ W 2 ] = 1. Thus,      E [ f ′ ( W )(1 + X i X i ( W i − W ) )]      ≤ E   f ′ ( W ) (1 + P T )   ≤ k f ′ k q E [1 + P T ] 2 ≤ k f ′ k p E [(1 + E [ P T ]) + ( P T − E [ P T ])] 2 ≤ p 2 /π p (1 + E [ P T ]) 2 + v ar( P T ) . Now, | E [ P T ] + 1 | =     E  P i X i W i      =      P i E [ X i P j ∈ N >k ( i ) X j ]      =      P i P m>k P j ∈ N m ( i ) E [ X i X j ]      =      P i P m>k P j ∈ N m ( i ) ( E [ X i X j ] − E [ X i ] E [ X j ])      ≤ P i P m>k c ′′ B 2 σ T 2 α ( m ) | N m | ≤ c ′′ B 2 T τ k σ T 2 . Next, we lo o k a t the v ar( P T ) term: v ar( P T ) = ( E [ P 2 T ] − E [ P T ] 2 ) (12) = E       X i j ∈ N ≤ k ( i ) X i X j    2    − E [ P T ] 2 = E       X i,j s ∈ N ≤ k ( i ) t ∈ N ≤ k ( j ) X i X j X s X t       | {z } ( A ) − E [ P T ] 2 . The ﬁrst term (i.e., ter m (A)) in eq. 12 can b e br oken in to tw o parts, o ne such that the minim um distance betw een any no de in { i, s } and any no de in pair { j, t } is ≤ k (denote this by se t F ≤ k ), and one wher e its g r eater than k (denote this by set F >k ). F or ma lly , we deﬁne the following terms: F m = { ( i, j, s, t ) : s ∈ N ≤ k ( i ) , t ∈ N ≤ k ( j ) , min a,b ∈{ i,j,s,t } dist( a, b ) = m } F ≤ k = [ m ≤ k F m , F >k = [ m>k F m , | F m | = max i | F m ( i ) | , | F ≤ k | = max i | F ≤ k ( i ) | . Consider the term | F ≤ k | . Given i , s can b e pick ed in at mo s t | N ≤ k | wa y s. Now, either j or t o r bo th must b e within distance k of i or s . Thus, given i and s , j (or t ) can b e pic ked in at most 2 | N ≤ k | wa ys , and then t (or j ) can b e pick ed in another | N ≤ k | wa ys . Hence, | F ≤ k | ≤ 4 T | N ≤ k | 3 . By a similar arg ument , | F m | ≤ 4 T | N ≤ k | 2 | N m | . Now, we have: ( A ) = X F ≤ k E [ X i X j X s X t ] + X F >k E [ X i X j X s X t ] = X F ≤ k E [ X i X j X s X t ] + X F >k E [ X i X s ] E [ X j X t ] + X F >k ( E [ X i X j X s X t ] − E [ X i X s ][ X j X t ]) ≤ X F ≤ k E [ X i X j X s X t ] | {z } ( B 0) + X F >k E [ X i X s ] E [ X j X t ] | {z } ( B 1) + 4 X m>k X F m B 4 σ T 4 α ( m ) | {z } ( B 2) . ( B 0) = X F ≤ k E [ X i X j X s X t ] ≤ X F ≤ k E [ X 4 i ] + E [ X 4 j ] + E [ X 4 s ] + E [ X 4 t ] 4 ≤ B 4 σ T 4 X F ≤ k 1 ≤ 4 B 4 T | N ≤ k | 3 σ T 4 . 27 ( B 1) = X F >k E [ X i X s ] E [ X j X t ] = X F >k S F ≤ k E [ X i X s ] E [ X j X t ] − X F ≤ k E [ X i X s ] E [ X j X t ] ≤ ( X i E [ X i ( W − W i )]) 2 + X F ≤ k E [ X 4 i ] + E [ X 4 s ] + E [ X 4 j ] + E [ X 4 t ] 4 ≤ ( E [ P T ]) 2 + 4 B 4 T | N ≤ k | 3 σ T 4 . ( B 2) ≤ 4 B 4 σ T 4 X m>k | F m | α ( m ) ≤ 16 B 4 T | N ≤ k | 2 σ T 4 X m>k | N m | α ( m ) ≤ 16 B 4 T | N ≤ k | 2 σ T 4 τ k . The las t equation simply uses a num b er of applicatio ns of the fact tha t the geometric mean is less than the ar ithmetic mean, and Jensen’s inequality . Plugging these int o Equation 12 , we hav e: v ar( P T ) = ( B 0) + ( B 1) + ( B 2) − E [ P T ] 2 ≤ 4 B 4 T | N ≤ k | 2 σ T 4 + 4 B 4 T | N ≤ k | 3 σ T 4 + 16 B 4 T | N ≤ k | 2 σ T 4 τ k ≤ 8 B 4 T | N ≤ k | 3 σ T 4 + 16 B 4 T | N ≤ k | 2 σ T 4 τ k . Combinin g these s teps, and recalling that γ T = T /σ T , we ﬁnally obtain the following form for Eq. 10 : d W ( W, Z ) ≤ c 1 B 3 T | N ≤ k | 2 σ T 3 + c 2 B T α ( k ) σ T + k f ′ k r c ′′ 2 B 4 T 2 τ 2 k σ T 4 + 8 B 4 T | N ≤ k | 3 σ T 4 + 16 B 4 T | N ≤ k | 2 σ T 4 τ k ≤ c 1 B 3 γ T  | N ≤ k | σ T  2 + c 2 B γ T α ( k ) + B 2 s c 3  γ T τ k σ T  2 + c 4 γ T  | N ≤ k | σ T  3 + c 5 γ T  | N ≤ k | σ T  2 τ k σ T . 10.3 Statemen t and pro ofs of r esults from section 7 W e will start b y reminding the reader some of the deﬁnitions. Deﬁne the fo llowing: b h T ( t ) : = 1 n n X i =1 K b T ( d t ( i ) , Q ) η + i,t +1 ( s ) b f T ( t ) : = 1 n n X i =1 K b T ( d t ( i ) , Q ) η i,t +1 ( s ) q t := b h T ( t ) − E [ b h T ( t ) | S C ] − g ( b f T ( t ) − E [ b f T ( t ) | S C ]) p t := [ b h T ( t ) − g b f T ( t )] − E [ b h T ( t ) − g b f T ( t ) |E T 1 , S C ] . W e deﬁne: σ 2 T ( T 1 , C ) := v ar ( P t q t |E T 1 , S C ), and σ 2 T ( C ) := v ar( P t q t | S C ). Also, σ 2 T ( T 1 , C ) := v ar( P t q t |E T 1 , S C ). Lemma 7.2. Under A ssumption 1 and assu m ing σ c > 0 , Conditione d on S C , X t q t / √ T d → N (0 , σ 2 c ) A s T → ∞ . Pr o of. Using o ur distributional co nvergence results conditioned o n E T 1 T S C , we have shown that X t ≥ T 1 + M p t / √ T d → N (0 , σ 2 c ) Conditioned o n E T 1 T S C , when T 1 has a ﬁnite v alue . 28 Denote by V t := b h T ( t ) − g b f T ( t ). W e hav e, | X t q t / √ T − X t ≥ T 1 + M p t / √ T | ≤ | X t K ) F or any ﬁn ite K → lim sup T →∞ P ( X t q t / √ T ≤ x | S C ) ≤ Φ 0 ,σ 2 c ( x ) P ( T 1 ≤ K ) + P ( T 1 > K ) . In the last step, the e x change of limit a nd exp ectation is v a lid by virtue of the Dominated Conv erg ence Theor em. Now taking K → ∞ (which minimizes the upp er b ound on the lim sup) and using the geometr ic b ound on tail pro bability of T 1 in ﬁnite state space Mar ko v chains, we hav e: lim sup T →∞ P ( X t q t / √ T ≤ x | S C ) ≤ Φ 0 ,σ 2 c ( x ) . An identical ar gument o n P ( P t q t / √ T > x | S C ) gives the following equation. lim inf T →∞ P ( X t q t / √ T ≤ x | S C ) ≥ Φ 0 ,σ 2 c ( x ) . Thu s we show that ∀ x ∈ R , as T → ∞ P ( P t q t / √ T ≤ x | S C ) → Φ 0 ,σ 2 c ( x ), whic h in turn prov es our result. Lemma 7.3. Deﬁ n e p t := [ b h T ( t ) − g b f T ( t )] − E [ b h T ( t ) − g b f T ( t ) |E T 1 , S C ] . Un der A ssumption 1 and assuming σ c > 0 , for any ﬁnite T 1 , we have: X t ≥ T 1 + M p t / √ T d → N (0 , σ 2 c ) c onditione d on E T 1 T S C as T → ∞ . Pr o of. W e will prov e the ab ove result in tw o par ts. If w e can show that the conditions in Lemma 6.3 are satisﬁed fo r W T :=   X t ≥ T 1 + M p t   , s v ar( X t ≥ T 1 + M p t |E T 1 , S C ) , then us ing Lemma 6 .1 we will hav e: W T d → N (0 , 1 ) c o nditioned on E T 1 T S C . How ever, for any ﬁnite v a lue of T 1 , v a r( P t ≥ T 1 + M p t |E T 1 , S C ) /T → σ 2 c (see Lemma 5.7 , eq . 7 ). Thu s, with the additional as s umption of σ c > 0, the r e s ult is proved. Now w e will show that, conditioned on E T 1 T S C , the co nditions in Lemma 6.3 are satisﬁed for W T , a nd th us the W ass erstein distance in Lemma 6.2 ca n b e upper b ounded b y O ( T − 1 / 2 log 2 ( T )) . First note that p t is b ounded and E [ p t |E T 1 , S C ] = 0. Thus p t corresp o nds to Y t in Lemma 6.2 . Since p t is a function of S t , it in volves p + 1 g raphs ( G t − p +1 , . . . , G t +1 ). The distance dist( i, j ) is deﬁned as max( | i − j | − ( p + 1) , 0). Th us | N m | equa ls 2 for m > 0, and 2 ( p + 1) other wise; hence | N m | = O (1). Also, | N ≤ k | = O ( k ). Denote by σ T ( T 1 , C ) the standard deviation of P t ≥ T 1 p t conditioned o n S C T E T 1 . Let us now examine the conditions in Lemma 6.3. 29 Condition 1 γ T → ∞ W e ha ve γ T = T /σ T ( T 1 , C ) → √ T /σ c → ∞ , where the limits follo w from Lemma 5.7 and the σ c > 0 a ssumption. Condition 2 Let k ( T ) = log T /β . W e will show that this satisﬁes conditions 2a, 2b, and 2c. 2a: γ T α ( k ( T )) → 0 Plugging in the v alue o f k ( T ), and using Lemma 5 .7 we see that: γ T α ( k ( T )) = O  T e − β k ( T )  σ T ( T 1 , C )  = O ( T − 1 / 2 ) . 2b: γ T τ k ( T )  σ T ( T 1 , C ) → 0 Using Lemma 5.7 w e see that: γ T τ k ( T )  σ T ( T 1 , C ) = ( T /σ T ( T 1 , C ) 2 ) τ k ( T ) = O ( τ k ( T ) ) = O ( X m>k ( T ) | N m | α ( m )) = O ( e − β k ( T ) X t> 0 e − β t ) Using | N m | = O ( 1) and α ( k ) = O ( e − β k ). = O ( e − β k ( T ) ) = O ( T − 1 ) Using k ( T ) = log T /β . 2c: γ T  | N ≤ k ( T ) | σ T ( T 1 ,C )  2 → 0 Again, using Lemma 5 .7 gives us: γ T  | N ≤ k ( T ) | σ T ( T 1 , C )  2 = T /σ T ( T 1 , C ) 2 | N ≤ k ( T ) | 2  √ T = O ( | N ≤ k ( T ) | 2  √ T ) = O ( k ( T ) 2  √ T ) = O ((log T ) 2 /T 1 / 2 ) Using k ( T ) = log T /β . Now the upp er b ound on W a sserstein distance (Lemma 6.2 ) b ecomes O (log ( T ) 2 /T ) by using k = log( T ) /T and the expr e s sions derived b efore as part of the second condition. A c kn o wledgemen ts W e are grateful to Peter Bick e l for helpful discussio ns on this topic. References [1] L. A damic and E. Adar. F riends and neighbor s o n the web. So cial Networks , 25 :211– 2 30, 2003. [2] J. Aitchison and C. G. G. Aitken. Multiv ariate binary discrimination by the kernel metho d. Biometrika , 63:4 13–4 2 0, 19 76. [3] R. C. Bradley . Basic Pro p erties o f Stro ng Mixing Co nditions. A Survey and Some Op en Questions. Pr ob ability Surveys , 2:107– 144, 2005 . [4] L.H.Y. Chen, L. Goldstein, a nd Q .M. Shao. Normal A ppr oximation by Stein ’s Metho d . Springer V er lag, 2010 . [5] R. Durrett. Pr ob ability: The ory and Examples . Duxbury P r ess, 19 95. 30 [6] W. F u, E. P . Xing, a nd L. Song . A state-spa ce mixed membership blo ckmodel for dyna mic net w ork tomog raphy . A nnals of Appli e d Statistics , 4:535 –566 , 201 0. [7] G. Grimmett and D. Stirz a ker. Pr ob ability and Ra ndom Pr o c esses . Oxfor d University Press , 2001. [8] S. Hanneke and E. P . Xing. Discrete temp oral mo dels of so cial netw or ks. Ele ctr onic Journal of Statistics , 4:58 5–60 5, 20 06. [9] B. Heidergott, A. Hordijk, and M. v an Uitert. Series expansions for ﬁnite-state Markov chains. Tin ber gen Institute Discussion Paper s 05- 086/4 , 200 5. [10] P . D. Hoﬀ. La ten t factor models for relationa l data. URL http:/ /www. stat.washington.edu/hoff/public/acms.pdf . [11] P . W. Holland and S. Le inha rdt. A dynamic mo del for so cial net works. J ourn al of Mathe- matic al So ciolo gy , 5 :5–20 , 1 977. [12] Z. Huang and D. K. J. Lin. The time-series link pr ediction problem with applications in communi cation s urveillance. INFORMS Journal on Computing , 2 0 09. [13] P . Indyk and R. Motw a ni. Approximate nea rest neighbors: T ow ards removing the curse of dimensionality . In AC M Symp osium on The ory of c omputing . MIT Pr ess, 19 98. [14] L. Katz. A new status index derived from s o ciometric a nalysis. In Psychometrika , v olume 18, pages 3 9 –43, 19 53. [15] M. Kolar, L. Song , A. Ahmed, and E. Xing. E stimating time-v arying netw or ks. A nnals of A pplie d Statistics , 2010 . [16] E lizaveta Le v ina and Peter J. Bick el. Thexture sy nt hesis a nd nonpara metr ic resampling of random ﬁelds. A nnals of Statistics , 34 (4):1751 Ű1773, 2006 . [17] D. Liben- Now ell and J. Kleinberg. The link prediction problem for so cial netw or ks. In Con- fer enc e on In formation and K n ow le dge Management . ACM, 200 3. [18] E . Masry and D. Tjøstheim. Nonpa r ametric estimation a nd ident iﬁcation o f nonlinear AR CH time s eries. Ec onometric The ory , 11:2 58–28 9, 1995 . [19] E fsta thios Paparo ditis and Dimitris N. Politis. The lo cal b o otstrap for marko v pro cesses . J. Statist. Plann. Infer enc e , 108 :301–3 28, 200 2. [20] D. Pham. The mixing pro p erty of bilinea r and g eneralised random co eﬃcient autor egressive mo dels. Sto chastic Pr o c esses and their Ap plic ations , 2 3:291 –300, 1 986. [21] D. Politis, J. Romano, and M. W olf. Subsampling . Springer , 1999 . [22] D. N. Politis and J. P . Romano. Nonpara metric resampling for homogeneo us stro ng mixing random ﬁelds. Journal of Multivariate Ana lysis , 47 (2 ):301–3 28, 199 3. [23] A. E. Raftery , M. S. Handco ck, and P . D. Ho ﬀ. La ten t spa c e approaches to so cial netw ork analysis. Journal of the A meric an St atistic al A sso ciation , 15:460, 20 02. [24] E mile Ric hard, Stephane Gaiﬀas, and Nicolas V a yatis. Link prediction in graphs with au- toregres s ive features. In P . Bartlett, F.C.N. Pereira, C.J.C. Burges, L. Bottou, and K.Q. W einberger, editors, A dvanc es in Neur al Information Pr o c essing Systems 25 , pages 28 4 3– 2851, 20 12. [25] Y. Rinott and V. Rotar. A multiv ariate CL T fo r lo cal dependence with n − 1 / 2 log n rate and applications to mu ltiv ariate graph related statistics. Journ al of Mult ivariate A nalysis , 56(2): 333–3 50, 1 996. 31 [26] P . Sarka r and A. Mo ore. Dynamic so cial netw ork analys is using latent spac e mo dels. In A dvanc es in Neur al Information Pr o c essing Systems . 2005 . [27] P . Sarkar , L. Chen, and A. Dubrawski. Dyna mic netw o rk mo del for predicting o ccurrences of salmo nella at fo o d facilities. In Biosurveil lanc e and Biose curity: Int ernational W orkshop , BioSe cur e . Springer, 2008. [28] J. Sunklo das. On normal approximation for strong ly mixing random v ar iables. A cta App li- c andae Mathematic ae , 97 :251– 260, 2007. [29] K Nowicki T Snijders. Estimation and prediction for sto chastic blo ckmo dels for gra phs with latent blo ck structure. Journal of Classiﬁc ation , 19 97. [30] T. Tylenda, R. Angelov a, and S. Beda th ur. T o w ards time-aw a r e link prediction in evolving so cial netw o rks. In A CM W orkshop on So cial Network Mining and A nalysis . A CM, 2009. [31] D. V u, A. Asuncion, D. Hun ter, and P . Smyth. Co ntin uous-time re g ression mo dels for lo ngi- tudinal net works. In A dvanc es in Neur al Information Pr o c essing Systems . MIT Press, 2011. [32] M.-C. W ang and J. v an Ryzin. A class of smo oth es timato rs for disc r ete distributions. Biometrika , 198 1. [33] E . Wilson. Probable inference, the law of s ucc e ssion, and statistical inference. Journal of the A meric an St atistic al A sso ciation , 22:209 – 212, 19 27. [34] S. Zhou, J. La ﬀerty , a nd L. W ass e r man. Time v arying undirected g r aphs. In Confer enc e on L e arning The ory , 2008. 32

Nonparametric Link Prediction in Large Scale Dynamic Networks

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment