Overlapping Communities Detection via Measure Space Embedding

Ov erlapping Comm unities Detection via Measure Space Em b edding Mark Kozdoba markk@tx.te chnion.ac.il Shie Mannor shie@ee.tec hnion.ac.il Abstract W e present a new algorithm for communit y detection. The algorithm uses random w alks to embed the graph in a space of m easures, after whic h a mo diﬁcation of k -means in that space is applied. The algorithm is th erefo re fa st and easily p aral lelizable. W e ev aluate the algorithm on standard random graph benchmarks, including some o verlapping commu- nity benchmarks, and ﬁnd its p erformance to b e b etter or at least as goo d as prev iously known algorithms. W e also prov e a linear time ( in num b er of edges) g uarantee f or the algorithm on a p , q -stochastic blo c k mod el with where p ≥ c · N − 1 2 + ǫ and p − q ≥ c ′ q pN − 1 2 + ǫ log N . 1 In tro duction Communit y detection in graphs, also k no wn as graph clustering , is a problem where one wishes to identify subsets o f the vertices o f a g r aph such that the connectivity inside the subset is in some w ay denser than the connectivity of the subset with the r est of the gra ph. Such subsets are referred to as communities, and it often happens in applications that if t wo vertices belong to the same communit y , they have similar application-r elated qualities. This in turn may allow for a higher level a nalysis of the graph, in terms of co mm unities instea d of individual no des. Communit y detection ﬁnds applications in a diversit y of ﬁelds, such as socia l netw o rks analysis, communication and traﬃc des ign, in biological netw orks , and, generally , in most ﬁelds where meaningful graphs can arise (see, for instance, [1] for a survey). In addition to dire c t applications to gr aphs, co mm unit y detection can, for ins tance, be also applied to general Euclidean space cluster ing problems, by transforming the metric to a weigh ted graph str uc tur e (see [2] for a survey). Communit y detection pr oblems co me in diﬀeren t ﬂav ours, depending on whether the graph in ques tion is simple, or weigh ted, o r /and directed. An- other imp ortant distinction is whether the comm unities are a llowed to o verlap or not. In the overlapping c o mm unities case, ea c h vertex can b elong to several subsets. 1 A diﬃculty with communit y detection is that the notion of communit y is not w ell deﬁned. Diﬀerent algorithms may employ diﬀeren t for mal notions of a co mm unit y , and can sometimes pro duce diﬀerent results. Nevertheless, there exist several widely ado pted b enc hmarks – synt hetic mo dels and rea l-life graphs – wher e the gr ound truth c o mm unities are known, and alg orithms are ev aluated based o n the similarity of the pr oduced output to the g round truth, and based on the amount of require d c o mputations. On the theoretica l side, mo st of the eﬀort is concentrated on developing algorithms wit h g uarant eed recov ery of clus ters for graphs generated fr om v ariants of the Stochastic Block Mo del (referred to as SBM in what follows, [1]). In this pap er we pr esen t a new algorithm, DER (Diﬀusion E n tropy Reducer, for reaso ns to b e clariﬁe d later ), for non-overlapping comm unity detection. The algorithm is an a daptation of the k-mea ns algor ithm to a s pa ce of mea sures which ar e g enerated by short random w alks fro m the no des of the graph. The adaptation is do ne by introducing a certain natura l cost on the space of the measures. As detailed b elow, we ev aluate the DER on several b enc hmarks and ﬁnd its per formance to b e as go o d or b etter than the best alternativ e method. In a ddition, we establish some theo retical g uarantees o n its p erformance. While the main purp ose of the th eor etical analysis in this pap er is to provide some insight into why DER works, o ur result is also one of the very few results in the literature that show reco ns truction in linear time. On the e mpir ical side, we ﬁrst ev a luate our algorithm o n a set of r andom graph benchmarks known as the LFR models, [3]. In [4], 12 other algorithms were ev aluated on these b enchm ar ks, a nd three algor ithms, descr ibed in [5], [6] and [7], were iden tiﬁed, that exhibited signiﬁca n tly b etter p erformance than the others, and similar per formance among themselves. W e ev a luate our algorithm on random graphs with the same para meters a s those used in [4] and ﬁnd its per formance to be as go o d a s these thr ee b est metho ds. Several well known metho ds, including sp ectral c lustering [8 ], exhaustive mo dularity optimization (see [4] for details), and clique per colation [9], hav e w orse p erformance on th e ab o ve b enchmarks. Next, while our a lgorithm is designed f or no n-o verlapping co mm unities, we int ro duce a simple mo diﬁcation that enables it to detect overlapping comm u- nities in some cases. Using this modiﬁca tion, we compar e the p erformance of our algorithm to the p erformance o f 4 ov erla pping co mmunit y algor ithms on a set of b enc hmar k s that w ere considered in [10]. W e ﬁnd t hat in all cases DER per forms b etter than a ll 4 algorithms. None of the alg orithms ev aluated in [4] and [3] has theoretical g uarantees. On the theoretical side, we sho w that DER reco nstructs with high pro babilit y the partition of the p, q -sto chastic block model s uc h that, r oughly , p ≥ N − 1 2 , where N is the num ber of vertices, and p − q ≥ c q pN − 1 2 + ǫ log N (this holds in particula r when p q ≥ c ′ > 1 ) for some co nstan t c > 0. W e show that for this reconstructio n o nly one iteratio n of the k -means is suﬃcient. In fact, three passage s o ver the set of edges suﬃce. While the cost function w e in tro duce for DE R will appea r at ﬁrst to hav e pur ely probabilistic motiv atio n, for the 2 purp oses of the proo f w e pr ovide an alternative interpretation of this co st in terms of the graph, and the arguments show which pro perties of the gra ph are useful for the conv erge nc e of the a lgorithm. Finally , although this is not the emphasis of the pr esen t paper , it is worth noting here that, as will b e ev iden t la ter , our a lgorithm can b e tr ivially par alle- lalized. T his seems to b e a par ticularly nice feature since most other alg orithms, including sp ectral clustering, ar e not ea sy to parallelalize and do not seem to hav e paralle l implementations at pre sen t. The res t of the pap er is orga nized a s follows: Section 2 ov erviews related work and discusses relations to our results. In Section 3 w e provide the motiv ation for the deﬁnition o f the algorithm, derive the cost function and esta blish some basic prop erties. Sectio n 4 we pr e s en t the results on the empirica l ev aluation of the alg orithm and Section 5 describ es the theo retical guarantees and the general pr oof scheme. Some pro ofs and additiona l material are provided in the supplement ar y material. 2 Literature review Communit y detection in graphs has b een a n active resea rc h topic for the la s t tw o decades and generated a huge literature. W e r e fer to [1] for an extensive survey . Throughout the pa per, let G = ( V , E ) b e a graph, and let P = P 1 , . . . , P k be a partition o f V . Lo osely sp eaking, a par tition P is a go o d communit y structure on G if for each P i ∈ P , more edges stay within P i than leav e P i . This is usually quantiﬁed via some cost function that a ssigns larger scalar s to partitions P that are in some sense better separated. Perhaps the mos t well known cost function is the mo dularity , which was in tro duced in [11] and ser v ed as a basis of a large num b e r of communit y detection a lgorithms ([1]). The p opular sp ectral clustering methods, [8]; [2], can a lso b e viewed a s a (r elaxed) optimization of a certain cost (see [2]). Y et another group of algorithms is based on ﬁtting a genera tiv e mo del of a graph with communities to a given gr aph. References [1 2]; [10] are tw o among the many e x amples. Perhaps the simplest generative mo del for no n- o verlapping communities is the sto chastic blo c k mo del, see [13 ],[1] which we now deﬁne: Let P = P 1 , . . . , P k be a partition of V into k subsets. p, q -SBM is a distribution ov er the gra phs on v ertex set V , suc h that all edg e s are independent and fo r i, j ∈ V , the edge ( i, j ) exists with probability p if i, j b elong to the same P s , and it exists with probabilit y q otherwise. If q << p , the compo nen ts P i will be well s eparated in this mo del. W e denote the num b er of no des b y N = | V | throughout the pa p er. Graphs genera ted from SBMs can serve as a b enchmark for communit y de- tection algorithms. How ever, such graphs lack certain desirable prop erties, such as p ow er-law degr ee and comm unity size distributions. Some of these issues were ﬁx ed in the b enc hmark models in [3]; [14], and these mo dels are referred to a s L FR models in the literature. More details on these models are given in Section 4. 3 W e now turn to the discussion of the theoretical g uarantees. Typically res ults in this direc tion provide alg orithms that can reconstruct,with high probability , the ground partition of a gra ph drawn f ro m a v a rian t of a p, q -SBM mo del, with s ome, p ossibly larg e, num b er of compo nen ts k . Recen t r esults include the works [15] and [16]. In this pap er, ho wev er, w e sha ll only analytically ana lyse the k = 2 case, and s uc h that, in addition, | P 1 | = | P 2 | . F o r this case, the b est known r econstruction re s ult was obtained alrea dy in [17] and was only impr oved in terms of runtim e since then. Namely , Bopa nna’s result s ta tes that if p ≥ c 1 log N N and p − q ≥ c 2 log N N , then with high proba bilit y the par titio n is recons tructible. Similar b ound can b e obtained, for instance, from the appro ac hes in [15]; [16], to name a few. The metho ds in this gr oup are ge ne r ally based on the analys is of the s p ectrum of the adjac e nc y matr ix. The run time of these algorithms is non-linear in the size of the gr aph and it is no t kno wn how these algorithms b eha ve on graphs not generated by the probabilistic mo dels that they a ssume. It is generally known that when the graphs are dense ( p of order of constant), simple linear time reconstructio n a lgorithms exis t (see [18]). The ﬁrst, and to the b est of our knowledge, the only previous linear time algor ithm for non dense graphs w as prop osed in [18]. This algor ithm works for p ≥ c 3 ( ǫ ) N − 1 2 + ǫ , for any ﬁxed ǫ > 0. The approach of [18] was further e x tended in [19], to handle more general cluster sizes. Thes e appro ac hes a pproaches diﬀer signiﬁca n tly from the sp e ctrum based metho ds, and provide equally imp ortant theoretical insight. How e ver, their empirica l b eha vio ur w as nev er studied, and it is lik ely that ev en for graphs generated from the SBM, extremely high v alues of N would b e required fo r the algor ithms to w or k , due to larg e constants in the concentration inequalities (se e the concluding remarks in [19]). 3 Algorithm Let G b e a ﬁnite undirected graph with a vertex set V = { 1 , . . . , n } . Denote by A = { a ij } the symmetric adjacency matrix of G , where a ij ≥ 0 are edge weigh ts, and for a v ertex i ∈ V , set d i = P j a ij to be the degree o f i . Let D be an n × n diagonal matrix such that D ii = d i , and set T = D − 1 A to be th e transition matrix of the random w alk on G . Set also p ij = T ij . Finally , deno te by π , π ( i ) = d i P j d j the s tationary measure of the rando m walk. A nu mber of comm unity detection algorithms are based on the intuit ion that distinct communities should b e relatively closed under the random walk (see [1]), and employ diﬀer en t notions of closedness. Our approa c h a lso ta k es this point of view. F o r a ﬁxed L ∈ N , c o nsider the following sampling pr ocess on the graph: Cho ose vertex v 0 randomly fro m π , and p erform L steps of a random walk o n G , starting fro m v 0 . This results in a length L + 1 sequence of v ertices, x 1 . Repe a t the pro cess N times indep enden tly , to obtain also x 1 , . . . , x N . Suppo se now that we would like to mo del the sequences x s as a m ultinomial mixture mo del with a single co mponent. Since each co ordinate x s t is distributed 4 according to π , the single comp onent of the mixture should b e π itself, when N gr o ws. Now s uppose that w e w ould lik e to model the same sequences with a mixture of tw o co mponents. Because the sequences ar e sampled from a ran- dom w alk rather then indep enden tly from each other, the comp onen ts need no longer b e π itself, a s in any mixture where some elements app ear more o ften together then others. The mixture as above can be found using the EM al- gorithm, and this in pr inciple summarizes our approach. The only additional step, as discussed ab ov e , is to replace the sampled rando m w alks with their true distributions, which simpliﬁes the a na lysis a nd also leads to somewhat improved empirical perfo rmance. W e now present the DER algo r ithm for detec ting the non-ov erla pping com- m unities. Its input is the num ber of compo nen ts to detect, k , the le ngth of the walks L , an initialization partition P = { P 1 , . . . , P k } o f V into dis jo in t subsets. P would b e usually taken to be a rando m partition of V into equally sized subsets. F o r t = 0 , 1 , . . . and a vertex i ∈ V , denote by w t i the i -th row of the matrix T t . Then w t i is the distribution o f the ra ndo m walk on G , started at i , a fter t steps. Set w i = 1 L ( w 1 i + . . . + w L i ), which is the distribution corresp onding to the average of the empirical measures of sequences x that sta r t at i . F o r tw o pro babilit y mea sures ν, µ on V , set D ( ν, µ ) = X i ∈ V ν ( i ) log µ ( i ) . Although D is not a metric, will act a s a distance function in our a lgorithm. Note that if ν was an empirical measure, then, up to a constan t, D would be just the log- lik eliho o d of obser v ing ν from independent s a mples of µ . F o r a s ubset S ⊂ V , set π S to b e the restriction of the measure π to S , and also set d S = P i ∈ S d i to be the full degree of S . Let µ S = 1 d S X i ∈ S d i w i (1) denote the distribution of the ra ndom walk started from π S . The complete DER algo r ithm is descr ib ed in Algor ithm 1. The algor ithm is essentially a k-means algor ithm in a non-E uc lidea n space, where the p oints are the measures w i , each o ccurring with multiplicit y d i . Step (1) is the “means” s tep, and (2) is the maximization s tep. Let C = L X l =1 X i ∈ P l d i · D ( w i , µ l ) (2) be the asso ciated cos t. As with the usual k-mea ns, we hav e the following Lemma 3.1. Either P is unchange d by steps (1) and (2) or b oth steps (1) and (2) strictly incr e ase the value of C . 5 Algorithm 1 DER 1: Input: Graph G , walk length L , nu mber of comp onen ts k . 2: Compute the mea sures w i . 3: Initialize P 1 , . . . , P k to be a random partition such that | P i | = | V | /k f or all i . 4: rep eat 5: (1) F o r all s ≤ k , construct µ s = µ P s . 6: (2) F o r all s ≤ k , set P s =  i ∈ V | s = arg max l D ( w i , µ l )  . 7: un til the sets P s do not change The pro of is by direct co mputation and is deferred to th e supplemen tary material. Since the num b er of co nﬁgurations P is ﬁnite, it follows that DER alwa ys terminates and provides a “lo cal max im um” of the c o st C . The cost C can be rewritten in a so mewhat mor e informative form. T o do so, we introduce some notation ﬁrst. Let X b e a ra ndom v ar ia ble on V , distributed according to measure π . Let Y a step of a random w alk started at X , so that the distribution of Y given X = i is w i . Finally , for a partition P , let Z b e the indicator v ariable o f a par tition, Z = s iﬀ X ∈ P s . Wit h this notation, one can write C = − d V · H ( Y | Z ) = d V ( − H ( Y ) + H ( Z ) − H ( Z | Y )) , (3) where H are the full and c o nditional Shannon entropies. Ther efore, DER alg o- rithm can b e interpreted a s seeking a partition that maximizes the information betw een c urren t known state ( Z ), and the next step from it ( Y ). This inter- pretation gives rise to the na me of the algor ithm, DER, since e very itera tion reduces the entrop y H ( Y | Z ) of the random walk, or diﬀusion, with res p ect to the par tition. The second eq ualit y in (3) has another interesting interpretation. Suppo se, for simplicity , that k = 2, with partition P 1 , P 2 . In general, a cluster- ing algorithm aims to minimize the cut, the n umber of edges b et ween P 1 and P 2 . Howev er, minimizing the num b er of edges dir ectly will lead to situatio ns where P 1 is a single no de, connected with one edg e to the rest of the gr aph in P 2 . T o avoid suc h situation, a relative, normalized version of a cut needs to be in tro duced, which takes int o acco un t the sizes of P 1 , P 2 . E v ery clus ter ing algorithms ha s a wa y to reso lv e this is s ue, implicitly or explicitly . F or DE R, this is shown in second equality of (3). H ( Z ) is maximize d when the co mponents are of equal sizes (with r espect to π ), w hile H ( Z | Y ) is minimized when the measures µ P s are as disjointly supp orted as p ossible. As any k -means algor ithm, DER’s results depend somewhat o n its random initialization. All k -means-like s c hemes are usually resta rted several times a nd the solution with the b est cost is chosen. In all cases which we ev aluated we observed empirically that the dep endence of DE R o n the initial parameters is 6 0 1 2 3 4 5 6 7 8 9 1 0 1 1 1 2 1 3 1 4 1 5 1 6 1 7 1 8 1 9 2 0 2 1 2 2 2 3 2 4 2 5 2 6 2 7 2 8 2 9 3 0 3 1 3 2 3 3 (a) Karate Club (b) P olitical Blogs rather weak. After t wo or t hree restar ts it usually found a partition nearly as go od as after 100 restarts. F or clustering problems, ho wev er, there is another simple wa y to agg r egate the results of m ultiple runs into a single partition, which slightly impr o ves the q ua lit y of the ﬁnal results. W e use this technique in all our exp erimen ts and we provide the details in the Supplementary Material, Section A. W e conclude by mentioning t wo algorithms that use s ome of the concepts that we use. The W alktrap, [20], similar ly to DER constructs the ra ndo m w alk s (the mea sures w i , p ossibly for L > 1) as pa rt of its computation. Howev er, W a lktrap uses w i ’s in a completely diﬀerent way . Both the optimization pro ce- dure and the cost function a re diﬀerent fr om ours . The Infomap , [5], [2 1], ha s a co st that is related to the no tion of infor mation. It aims to minimize to the information required to tr ansmit a ra ndom walk on G through a c hannel, the source co ding is c onstructed using the clus ters, a nd b est clusters are those that yield the best compr ession. This do es no t s eem to b e dir ectly connected to the maximum lik elyho o d motiv ated approach that w e use. As with W alktrap, the optimization pr ocedure of Infoma p also completely diﬀers from ours. 4 Ev aluation In this section results of the e v aluation of DER algo rithm ar e prese nted. In Section 4.1 we illustrate DER o n tw o classical graphs. Sections 4.2 a nd 4.3 contain the ev aluation on the LFR b enc hmarks. 4.1 Basic example s When a new clus ter ing algo rithm is introduced, it is useful to g et a general feel of it with some simple exa mples. Figur e 1a s hows the classica l Zachary’s Kara te Club, [22]. This gr aph has a ground partitio n in to t w o subsets. The partition shown in Figur e 1a is a partition obtained fr o m a typical run of DER algorithm, with k = 2, and wide ra nge of L ’s. ( L ∈ [1 , 10] were tested). As is the cas e with 7 many o ther clustering alg orithms, the sho wn pa rtition diﬀers from the g round partition in one element, no de 8 (see [1 ]). Figure 1b shows the politica l blogs graph, [23]. The no des are po litica l blogs, and the graph has a n (undirected) edge if o ne of the blog s had a link to the o ther. There are 1222 nodes in the gr aph. The ground truth partition of this graph has t wo comp onen ts - the righ t wing and left wing blogs. The lab eling o f the ground tr uth was partially automatic and partially manual, and b oth pr ocess es could intro duce some erro r s. The run of DER reconstr ucts the gr ound truth partition w ith only 57 node s missclass ifed. The NMI (see the next section, Eq. (4)) to the ground truth pa r tition is . 7 4 . The politica l blogs gra phs is particularly in ter esting s ince it is an ex ample of a graph for which ﬁtting a n SBM mo del to reco nstruct the clusters pr oduces results very diﬀerent from the gr o und truth. It c a n also b e easily chec ked tha t sp ectral clustering, in form given in [8], is not clos e to ground truth when k = 2. It is close to ground tr uth when k = 3, how ever. T o overcome the problem with SBM ﬁtting on this g raph, a deg ree sensitive version of SBM was intro duced in [24]. That a lgorithm pro duces partition with NMI . 75. 4.2 LFR benchma rks The LFR benchmark model, [14], is a widely used extension of the stochastic blo c k mo del, wher e node degr ees and communit y sizes hav e power law distribu- tion, as o ften obser ved in rea l g r aphs. An imp ortant parameter of this mo del is the mixing para meter µ ∈ [0 , 1 ] that co n trols the fraction of the edges of a no de that go outside the no de’s communit y (or outside all of no de’s communities, in the ov erla pping case). F or sma ll µ , there will be a small num b er of edg e s go ing outside the communit ies, leading to disjoint, eas ily separable graphs, and the bo undaries b e t ween communities will become less pronounced as µ grows. Given a set o f communities P on a gr aph, and the g round truth set of communities Q , there are several w ays to measure how close P is to Q . One standard meas ure is the nor ma lized mu tual informatio n (NMI), given by: N M I ( P, Q ) = 2 I ( P , Q ) H ( P ) + H ( Q ) , (4) where H is the Shannon entropy of a partition and I is the mutual informa tion (see [1] for deta ils). NMI is equal 1 if and only if the partitions P and Q coincide, and it takes v a lues b et ween 0 and 1 otherwise. When c o mputed with NMI, the sets inside P , Q ca n not overlap. T o deal with ov erlapping communities, an extension of NMI was prop osed in [2 5]. W e refer to the original paper for the deﬁnition, as the deﬁnition is somewhat lengthy . This extension, whic h w e denote here as ENMI, w as subsequently used in the literature as a measure of clos eness o f tw o sets of communities, even t in the cases of disjoint co mm unities. Note that most pap ers use the notation NMI while the metric that they really us e is ENMI. Figure 1c shows the results of e v aluation of DER for four cas e s : the s ize of a gra ph was either N = 10 00 or N = 50 00 no des, and the size of the co mm uni- 8 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 µ 0.0 0.2 0.4 0.6 0.8 1.0 ENMI n1000S n1000B n5000S n5000B (c) D ER, LFR b enc hmarks 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 µ 0.0 0.2 0.4 0.6 0.8 1.0 ENMI n1000S n1000B n5000S n5000B (d) Sp ectral Alg., LFR b enchmarks ties w as restricted to be either betw een 10 to 50 (denoted S in the ﬁgures) or betw een 20 to 1 00 (denoted B ). F or eac h com bination of these par ameters, µ v ar ied b et ween 0 . 1 and 0 . 8. F o r ea ch combination of graph size, communit y s iz e restrictions as ab ov e a nd µ v alue, we gener a ted 20 graphs from tha t mo del and run DER. T o provide some basic intuition ab out these graphs, we no te that the nu mber of communities in the 1000 S g raphs is stro ng ly concentrated a r ound 40, and in 1 000B, 5000S, and 5 000B graphs it is around 2 5, 200 a nd 100 r espec- tively . Eac h p oint in Figure 1c is a the average ENMI on the 2 0 corre sponding graphs, with standar d deviatio n as the error bar . These ex periments corre s pond precisely to the ones per formed in [4] (see Supplementary Material, Sectio n Cfor more details). In all runs on D ER w e have set L = 5 and set k to be the true nu mber o f communities for each g raph, as w as do ne in [4] for the metho ds that required it. Therefore our Figure 1 c can b e compared directly with Figur e 2 in [4]. F r om this compariso n we see that D ER and the tw o of the b est algorithms ident iﬁed in [4], Infomap [5] and RN [6], r econstruct the par tition pe r fectly for µ ≤ 0 . 5 , for µ = 0 . 6 DER’s r econstruction scores ar e b et ween Infomap’s and RN’s, with v alues for all of the algo rithms ab ov e 0 . 95, and for µ = 0 . 7 DER has the b est p e rformance in tw o of the four c a ses. F or µ = 0 . 8 all algo rithms have score 0 . W e hav e a lso p erformed the same exp e rimen ts with the standard version of sp ectral c lus tering, [8], b ecause this version w as not ev a luated in [4]. The results are shown in Fig. 1d. Although the performance is generally goo d, the scor es are mostly lower than those of DER, Infomap and RN. 4.3 Ov erlapping LFR bench marks W e now describ e how DER can b e applied to ov erla pping communit y detectio n. Observe that DER int erna lly op erates on measures µ P s rather then subsets of the vertex set. Recall that µ P s ( i ) is the probability that a random walk star ted from P s will hit no de i . W e can therefore consider eac h i to be a mem b er of thos e communities fro m which the probabilit y to hit it is “high enough”. T o deﬁne this fo rmally , we ﬁrst note tha t for any pa r tition P , the following decomp osition 9 T a ble 1: Ev aluation for Overlapping LFR. All v alues except DER a r e from [10] Alg. µ = 0 µ = 0 . 2 µ = 0 . 4 DER 0.94 0.9 0.83 SVI ([1 0 ]) 0.89 0.73 0.6 POI ([26]) 0.86 0.68 0.55 INF ([21 ]) 0.42 0.38 0.4 COP ([27]) 0.65 0.43 0.0 holds: π = k X s =1 π ( P s ) µ P s . (5) This follows f ro m the in v aria nce of π under the random walk. Now, given the out put of DER - the sets P s and meas ures µ P s set m i ( s ) = µ P s ( i ) π ( P s ) P k t =1 µ P t ( i ) π ( P t ) = µ P s ( i ) π ( P s ) π ( i ) , (6) where we use d (5) in the second equality . Then m i ( s ) is the proba bilit y that the walks star ted at P s , given that it ﬁnished in i . F o r each i ∈ V , set s i = a rgmax l m i ( l ) to b e the mo st lik ely comm unity g iv en i . Then deﬁne the ov erlapping communities C 1 , . . . , C k via C t =  i ∈ V | m i ( t ) ≥ 1 2 · m i ( s i )  . (7) The pap er [1 0] intro duces a new algorithm for overlapping co mm unities de- tection and con tains also an ev aluation of that algorithm as well as of sev eral other a lgorithms on a s et of o verlapping LFR benchmarks. The overlapping communities LFR mo del was deﬁned in [3]. In T able 1 we pr esen t the ENMI results of DER runs on the N = 100 00 graphs w ith same parameters as in [10], and also show the v a lues obtained o n these b enc hmarks in [10] (Figur e S4 in [10]), f or four other a lgorithms. The DER algorithm was run with L = 2, a nd k w as set to the true num b er of communities. Each n umber is an av erag e over ENMIs on 10 insta nces of graphs with a g iv en set of pa rameters (as in [1 0]). The standard deviatio n ar o und this a verage for DER w as less then 0 . 02 in all cases. V ariances for other algo rithms a re provided in [1 0]. F o r µ ≥ 0 . 6 all algor ithms yield ENMI of less then 0 . 3. As we see in T able 1, DER p e rforms b etter than all other algor ithms in a ll the cas es. W e b eliev e this indicates that DER tog ether with equation (7) is a go od choice for ov erlapping communit y detection in situations wher e communit y overlap b et ween each tw o communities is sparse, as is the case in the LFR mo dels co nsidered ab ov e. F ur ther discussion is provided in the Supplement ar y Material, Section D. W e conclude this section b y noting that while in the non-overlapping case the models g e nerated with µ = 0 res ult in trivial commun ity detectio n pr oblems, 10 bec ause in thes e cases co mm unities are s imply the connected comp onen ts of the graph, this is no longer tr ue in the ov erla pping case. As a p oint of reference, the well known Clique Percola tion metho d was als o ev aluated in [10], in the µ = 0 case. The average ENMI for this algo r ithm was 0 . 2 (T able S3 in [10]). 5 Analytic b ounds In this section we restrict our a tten tion to the case L = 1 of the DER alg orithm. Recall that the p, q -SBM mo del was deﬁned in Sectio n 2. W e shall co nsider the mo del with k = 2 and suc h that | P 1 | = | P 2 | . W e assume that the initial partition for the DER, denoted C 1 , C 2 in what follows, is chosen as in step 3 of DER (Algor ithm 1) - a random partition of V into two e q ual sized subsets. In this setting we hav e the following: Theorem 5.1. F or every ǫ > 0 ther e exist s C > 0 and c > 0 such that if p ≥ C · N − 1 2 + ǫ (8) and p − q ≥ c q pN − 1 2 + ǫ log N (9) then DER r e c overs the p artition P 1 , P 2 after one iter ation, with pr ob ability φ ( N ) such that φ ( N ) → 1 when N → ∞ . Note that the probability in the conclusion of the theorem refers to a joint probability of a draw from the SBM and of an independent dr a w from the random initializa tion. The pro of of the theorem ha s ess en tially three steps. First, we observe that the ra ndom initialization C 1 , C 2 is necessar ily somewhat biased, in the sense that C 1 and C 2 never divide P 1 exactly into t wo ha lv es. Spec iﬁc a lly , || C 1 ∩ P 1 | − | C 2 ∩ P 1 || ≥ N − 1 2 − ǫ with hig h probability . Assume that C 1 has the bigger ha lf, | C 1 ∩ P 1 | > | C 2 ∩ P 1 | . In the s econd step, b y an appropri- ate lineariza tion argumen t we sho w that for a no de i ∈ P 1 , deciding whether D ( w i , µ C 1 ) > D ( w i , µ C 2 ) or vice versa amounts to counting pa ths of length tw o betw een i and | C 1 ∩ P 1 | . In the third step we estimate the n um b er of thes e length t wo paths in the model. The fact that | C 1 ∩ P 1 | > | C 2 ∩ P 1 | + N − 1 2 − ǫ will imply more paths to C 1 ∩ P 1 from i ∈ P 1 and we will conclude that D ( w i , µ C 1 ) > D ( w i , µ C 2 ) fo r all i ∈ P 1 and D ( w i , µ C 2 ) > D ( w i , µ C 1 ) fo r all i ∈ P 2 . The full pro of is provided in the supplementary material. W e note that the use of paths of length t wo is essential for the argument to work. Similar ar gumen t with pa ths of length one (edges) will not work (unless p is of the or der of a c onstan t). How ever, we als o note tha t paths of length t wo a re never explicitly computed, as this would require squa r ing the a djacency matrix. Instead, this is achieved by considering paths of leng th one from the target set C 1 (via µ C 1 ) a nd paths of length one from the no des (via w i ). 11 References [1] Sant o F ortunato. Communit y detection in graphs. Physics Re p orts , 486(35 ):75 – 174, 2010. [2] Ulrike Luxburg. A tutorial on spectr a l clustering. St atistics and Computing , 17(4):395 –416, 20 07. [3] Andrea La ncic hinetti and Sant o F ortunato . Benchm ar ks for testing comm u- nit y detection a lgorithms on directed and weigh ted graphs with overlapping communities. Phys. Rev . E , 80(1):01 6118, 20 09. [4] Sant o F ortunato and Andrea Lancichin etti. Comm unit y detection algo- rithms: A compar ativ e analy sis. In F ourth International ICST Confer enc e , 2009. [5] M. Rosv all and C. T. Be rgstrom. Ma ps o f r andom walks on complex net - works r ev eal co mm unit y structure. Pr o c. Natl. A c ad. Sci. USA , pag e 1118 , 2008. [6] Peter Ronhovde and Zohar Nussinov. Multiresolutio n communit y detection for megascale net works by inf or mation-based r eplica corr elations. Phys. R ev. E , 80, 2 009. [7] Vincen t D Blondel, J ean-Loup Guillaume, Renaud L a m biotte, and Etienne Lefebvre. F ast unfolding of co mm unities in large netw orks. Journal of Statistic al Me chanics: The ory and Ex p eriment , 2008(10 ), 2008 . [8] Andrew Y. Ng, Michael I. Jo rdan, and Y air W eiss. On spe c tral clustering: Analysis and an algorithm. In A dvanc es in Neur al In formation Pr o c essing Systems 1 4 , 2001 . [9] Gergely P alla, Imre Der´ enyi, Ill´ es F ark as, and T am´ as Vicsek. Uncov ering the ov erlapping communit y structure of complex netw orks in nature a nd so ciet y . Natur e , 4 35, 2005. [10] Prem K Gopalan and David M Blei. Eﬃcient discov ery o f o verlapping communities in ma ssiv e netw orks. Pr o c e e dings of t he National Ac ademy of Scienc es , 110 (36):14534– 14539, 20 13. [11] M. Girv a n and M. E . J. Newman. Comm unity structure in so cial and biological net works. Pr o c e e dings of the National A c ademy of Scienc es , 99(12):78 21–7826 , 2002. [12] MEJ Newman and EA Leich t. Mixture mo dels a nd explo ratory analysis in net works. Pr o c e e dings of the National A c ademy of Scienc es , 1 04(23):9564 , 2007. [13] P aul W. Holland, Kathr y n B. La sk ey , and Samuel Leinhardt. Sto chastic blo c kmo dels: First steps. So cial Net works , 5(2):10 9–137, 1 983. 12 [14] Andrea Lancichinett i, Sant o F ortunato, a nd Filipp o Ra dicc hi. Be nc hmark graphs for testing co mm unit y detection a lg orithms. Phys. R ev. E , 78(4), 2008. [15] Animashree Anandkumar , Rong Ge, Daniel Hsu, and Sham Kak ade. A tensor spectra l approach to lea rning mixed mem b ership communit y mo dels. In COL T , volume 30 of JMLR Pr o c e e dings , 2013. [16] Y udong Chen, S. Sanghavi, a nd Huan Xu. Improv ed gra ph clustering. Information The ory, IEEE T r ansactions on , 60(10 ):6440–645 5, Oct 2 014. [17] Ra vi B. Boppana. E igen v alues and gr aph bisec tio n: An av erage -case anal- ysis. In F oundations of Computer Scienc e, 1987., 28th Annual Symp osium on , pa ges 280 –285, Oct 1987 . [18] Anne Condon and Richard M. Kar p. Algorithms for g raph partitioning on the planted partition mo del. R andom Struct. Algorithms , 1 8(2):116–14 0 , 2001. [19] Ron Shamir and Dekel Tsur. Improv ed a lgorithms for the ra ndom cluster graph mo del. R andom Stru ct. Algorithms , 31 (4):418–449 , 200 7. [20] P as c a l Pons and Matthieu Latapy . Computing communities in larg e net- works using random walks. J. of Gr aph A lg. and App. , 10:2 8 4–293, 2 004. [21] Alcides Viamontes Esquivel and Ma rtin Rosv all. Compression of ﬂow can reveal o verlapping-mo dule org anization in net works. Phys. R ev. X , 1:0210 25, Dec 201 1. [22] W. W. Z a c hary . An informa tion ﬂow mo del for conﬂict a nd ﬁss io n in small groups. Journal of Anthr op olo gic al Rese ar ch , 33:45 2–473, 1 977. [23] Lada A. Adamic a nd Natalie Glance. The p olitical blogospher e and the 2004 U.S. election: Divided they blog . LinkKDD ’05, 2 005. [24] Brian Karrer and M. E. J. Newman. Sto c hastic blo c kmo dels and commu- nit y structure in net works. Phys. R ev. E , 83, 2011. [25] Andrea Lancichinetti, Sant o F ortunato, and Jnos K ertsz. Detecting the ov erlapping and hierarchical comm unity structure in complex net works. New Journ al of Physics , 11(3):03 3015, 20 0 9. [26] Brian Ba ll, Brian K arrer, and M. E. J. Newman. E ﬃcien t and principled metho d fo r detecting communities in netw orks. Phys. R ev. E , 84:03 6103, Sep 2011. [27] Stev e Greg o ry . Finding ov erlapping c o mm unities in netw orks b y lab el prop- agation. New Journal of Ph ysics , 12 (10):103018 , 2010 . 13 [28] Thomas M. Cover and Joy A. Thomas. Elements of Information The- ory (Wiley Serie s in T ele c ommunic ations and Signal Pr o c essing) . Wiley- Int ers cience, 2006 . [29] S. Ja nson, T. Luczak, and A. Rucinski. R andom Gr aphs . Wiley Series in Discrete Mathematics a nd Optimization. Wiley , 20 11. [30] W. F eller. A n intr o duction to pr ob ability the ory and its applic ations . Wiley series in probability and mathematical statistics: Pro babilit y and mathe- matical statistics. Wiley , 197 1. [31] W. L. Nicholson. On the normal approximation to the hyperg eometric distribution. A nn. Math. Statist. , 27 (2):471–483 , 06 1956. [32] h ttps://sites.g oog le.com/site/santofortunato/inthepress2. [33] Jierui Xie, Stephen Kelley , and Boles la w K. Szymanski. Overlapping com- m unity detection in netw or ks: The state- of-the-art and compar ativ e study . ACM Comput. Surv. , 45(4):4 3 :1–43:35, August 2013. 14 A Restarts and rep eats As any k -means alg orithm, DE R’s results dep end so mewhat on its random ini- tializations, and can be improv ed b y m ultiple runs on the same instanc e with diﬀerent initializatio ns. W e refer to this a s restarts of the algorithm. W e ha ve observed empirically the following b ehaviour of DER: Supp ose a g raph G has a ground tr uth pa rtition P 1 , . . . , P k . Then the output of a typical restart o f DER will be a pa r tition C 1 , . . . , C k with the pro perty that for each C i , i ≤ k , either there is j ≤ k such that C i = P j , or there a r e j 1 , j 2 such that C i = P j 1 ∪ P j 2 or there are j and l suc h that C i ∪ C l = P j . In other w ords, DER tends to either ﬁnd the precise cluster, or to glue together t wo o riginal clusters, or split an orig inal cluster into t wo parts. Usually most o f the clusters will b e found precisely , and there will be some small num b er of (usua lly small) clusters tha t are glued or splitted. Which clus ter s will b e glued or splitted would dep end on the random initializatio n. An simple wa y to deal with this is to use the following “rep eats” s tr ategy: Choose a num b er of rep eats, R (say , R = 5) a nd run DER R times. Co nstruct the node co- occurenc e matrix: ˆ R ij = num be r o f runs such tha t i a nd j app ear in the sa me cluster. (10) for a ll i, j ∈ V . The matr ix ˆ R can now be reg arded as an adjacency matrix of a weigh ted graph and can b e cluster ed its e lf. Ho wev er, ˆ R w ill often hav e very clear clusters , which can be found using the following tr ivial thr eshold a lgorithm: Deﬁne T = ⌈ R/ 2 ⌉ . Initialize a set U = V . Cho ose an ar bitrary i ∈ U and deﬁne a cluster C by C = { j ∈ U | ˆ R ij ≥ T } . Then output cluster C , set U = U \ C , c ho ose a new i ∈ U and rep eat until U is empty . While on the b enc hmarks a single run of DER with a single restart usually has quite high precision, repea ts are a more eﬀective wa y to deal with g lueing and splitting than the res tarts. It is of course a ls o p ossible to use more sophisticated but s lo wer algo rithms instead of the threshold one to cluster the co -occur ence matrix R . B Pro ofs B.1 Lemma 3.1 Pr o of Of L emma 3 .1: The claim is obvious for step (2) o f the algorithm. F or step (1) the claim is implied b y the following sta ndard fac t: Let ν 1 , ν 2 , . . . , ν z be any ﬁnite collection of measur es. Set ˜ ν = 1 z P i ν i . Then for any mea sure κ , z X i =1 D ( ν i , κ ) ≤ z X i =1 D ( ν i , ˜ ν ) . (11) 15 Indeed, b y rearra nging terms in (11), we get X j ∈ V z X i =1 ν i ( j ) ! (log ˜ ν ( j ) − lo g κ ( j )) = z · X j ∈ V ˜ ν ( j )  log ˜ ν ( j ) κ ( j )  ≥ 0 which is the non-negativity of the Kullback-Leibler divergence [28], with equal- it y iﬀ κ = ˜ ν . B.2 Main result W e now prove Theorem 5 .1, which we restate here for conv enience. Theorem B.1. F or every ǫ > 0 ther e exists C > 0 and c > 0 such that if p ≥ C · N − 1 2 + ǫ (12) and p − q ≥ c q pN − 1 2 + ǫ log N (13) then DER r e c overs the p artition P 1 , P 2 after one iter ation, with pr ob ability φ ( N ) such that φ ( N ) → 1 when N → ∞ . Recall tha t a genera l plan of the pro of w as discussed in Section 5. W e pr oceed to implement that pla n. W e start with stating some pr eliminaries. First, we state a version of Chernoﬀ ’s b ound for bino mial v ariables. Theorem B.2 (Theorem 2.1 in [29]) . L et X ∼ B in ( n, p ) b e a binomial variable and set λ = n p . Then f or al l t ≥ 0 , P ( X ≥ E X + t ) ≤ exp  − t 2 2( λ + t/ 3)  (14) P ( X ≤ E X − t ) ≤ exp  − t 2 2 λ  (15) In gener al g iven a binomia l X ∼ B in ( n, p ) we will often refer to λ = np as X ’s lambda. The following Cor ollary will b e useful. Corollary B. 3 (Corrolary 2 .3 in [29]) . L et X ∼ B in ( n, p ) b e a binomial vari- able. The n for al l ǫ ≤ 3 2 , P ( | X − E X | ≥ ǫ · E X ) ≤ 2 exp  − ǫ 2 3 E X  (16) W e will also often use the following Corollar y o f Theo rem B.2. 16 Corollary B.4. Ther e is a c onstant c > 0 su ch that the fol lowing hol ds: L et X ∼ B in ( n, p ) b e a binomia l variable such that λ = np > 1 . Then fo r any N > 0 , P  | X − E X | ≥ 2 0 · √ λ · log N  ≤ c/ N 2 . (17) W e no w present a ser ies of Lemmas ab out random gr aphs in the p, q - SBM mo del and a bout rando m initializations. Througho ut G = ( V , E ) will b e as- sumed to be a random gr a ph from the p, q -SBM and w e denote th is G ∼ G p,q . Recall that N = | V | is the size of the no de set, and for a no de i ∈ V in a ﬁxe d graph G , n i is the set of neighbours of i , and d i = | n i | is the degree of i . Also, for a set S ⊂ V , its full deg ree is d S = P i ∈ S d i . Next, for a set S ⊂ V , we denote by d ( i, S ) = | n i ∩ S | the n umber of edges b etw een i and S and for t wo sets, S, T ⊂ V deﬁne d ( S, T ) = P i ∈ S d ( i, T ) to b e the n umber of edg es b etw een S and T . Finally , set d 2 ( i, T ) = d ( n i , T ) to be the n umber of paths of length t wo that start at i and end at T . In addition, let C 1 , C 2 , with | C 1 | = | C 2 | = N / 2, b e a random partition of V int o t wo sets, the initialization of DER. Deno te N 1 = | C 1 ∩ P 1 | , and N 2 = N / 2 − N 1 = | C 1 ∩ P 2 | = | C 2 ∩ P 1 | . W e as sume without loss of genera lit y that N 1 ≥ N 2 , and set ∆ N = N 1 − N 2 . The par tition C 1 , C 2 will b e considered ﬁxed in all the lemmas that concer n the random graphs. W e proceed to give b ounds on the exp ectations and concentration interv als of several quantities rela ted to o ur problem. F o r a ﬁxed no de i ∈ V , the degree d i is dis tr ibuted as a sum of t wo indep en- dent binomials, d i ∼ B in ( N / 2 − 1 , p ) + B in ( N / 2 , q ) , (18) the ﬁrs t term counts the edges to the co mponent to which i b elongs, the seco nd to the other comp onen t. In particula r, the exp ected degree is E d i = ( N / 2 − 1 ) p + ( N/ 2) q . (19) Lemma B.5 (Degr ee bo unds) . L et G ∼ G p,q . Ther e ex ists a c ons t ant ˆ c 1 such that the fol lowing holds: Assume that N p ≥ 1 00 log N . (20) Then with pr ob ability at le ast 1 − ˆ c 1 / N , for al l i ∈ V 1 4 · N 2 p ≤ d i ≤ 2 · N p. (21) Pr o of. Fixed a no de i ∈ V , and let X ∼ B i n ( N/ 2 − 1 , p ) and Y ∼ B i n ( N/ 2 , q ) be tw o independent binomials suc h that d i ∼ X + Y . By applying (16) to X with ǫ = 1 2 , w e obtain that 1 4 N 2 p ≤ E X − 1 2 E X ≤ X < d i (22) 17 with probability at least 1 − 2 exp ( − 1 12 ( N 2 p − 1)). Using the assumption (20), it follows that there is c > 0 suc h that 2 exp ( − 1 12 ( N 2 p − 1)) ≤ c/ N 2 . Using the union b ound w e therefore conclude that 1 4 N 2 p ≤ d i (23) holds for a ll nodes i ∈ V with proba bilit y at least 1 − c/ N . Similarly , we use (16) to obtain that X ≤ N p with probability at least 1 − c/ N 2 , p erhaps with a diﬀerent c and that Y ≤ N p with pr o babilit y at lea st 1 − c ′ / N 2 , be cause q < p . By the union b ound it follows that d i = X + Y ≤ 2 N p with probability at lea st 1 − ( c + c ′ ) / N 2 , and by the union b ound again, we obtain d i ≤ 2 N p for all i ∈ V , with pr obabilit y ate leas t 1 − c ′′ / N . In what fo llo ws we will often encounter situa tions where we need to b ound ﬂuctuations of sums of a ﬁxed num b er of no t necessa rily independent random v ar iables, and co ns iderations s imila r to those in Lemma B.5 will often b e omit- ted. W e now consider the degr ee of C 1 , d C 1 . Note that by symmetry E d C 1 = E d C 2 , and that the total degree of the gra ph satisﬁes d G = d C 1 + d C 2 . Therefore E d C 1 = 1 2 d G = N E d i = N (( N / 2 − 1 ) p + ( N/ 2) q ) . (24) The next lemma concerns the concentration o f the deg ree of C 1 . Lemma B. 6. Set λ = N 2 p . Ther e exist c onstants ˆ c 3 , ˆ c 4 such that with pr ob a- bility at le ast 1 − ˆ c 3 / N , | d C 1 − E d C 1 | ≤ ˆ c 4 log N · √ λ. (25) Pr o of. F or l , s ∈ { 1 , 2 } , set S ls = C l ∩ P s . Observ e that d C 1 can b e written a s d C 1 = 2 · d ( S 11 , S 11 ) + 2 · d ( S 12 , S 12 ) + 2 · d ( S 11 , S 12 ) + + d ( S 11 , S 21 ) + d ( S 11 , S 22 ) + + d ( S 12 , S 21 ) + d ( S 12 , S 22 ) . Note that each o f the terms in the sum ab ov e is a binomial v ar iable with lambda that is smaller or e q ual to c N 2 p for some co ns tan t c > 0. Therefore by applying Corollar y B.4 to each term and using union b ound, we obtain the result. The next Lemma provides an upper b ound o n ∆ N . Lemma B.7. Th er e ar e c onstants c 1 , c 2 > 0 such that ∆ N ≤ c 1 √ N log N (26) with pr ob ability at le ast 1 − c 2 / N . 18 Pr o of. F or the pur p oses o f this lemma we do not a ssume tha t N 1 > N 2 . Recall that N 1 is the size of the intersection P 1 with a random subset of V of size N / 2, denoted C 1 . Hence N 1 has ha s the h yp ergeometric dis tr ibution. Set λ = E N 1 = | P 1 || C 1 | | V | = 1 4 N . (27) The hyper geometric distribution sa tis ﬁe s co nc e n tration inequa lities similar to those sa tisﬁed by the binomials. Sp eciﬁcally , b y Theorem 2 .10 in [29], the con- clusion of Corolla r y B.4, inequality (17) ho lds for hyper geometric v aria bles, with λ is deﬁned as in (27). The result follows b y an application of that inequality . W e no w exa mine the quan tity d ( j, C 2 ) for a no de j ∈ V . The exp ectations satisfy E d ( j, C 2 ) = N 2 p + N 1 q if j ∈ P 1 (28) E d ( j, C 2 ) = N 1 p + N 2 q if j ∈ P 2 . (29) This follows from the decomp osition o f d ( j, C 2 ) a s a sum of tw o binomials. Similar ex pressions ho ld also for d ( j, C 1 ). Note that when, for instance j ∈ P 1 , in fact E d ( j, C 2 ) = N 2 p + N 1 q if j ∈ C 1 ∩ P 1 , and E d ( j, C 2 ) = ( N 2 − 1) p + N 1 q if j ∈ C 1 ∩ P 1 . Since w e will b e interested o nly in orders of magnitude, we will disregar d the diﬀerence b etw een the tw o cases in wha t follows. Thro ughout the pro of we denote L = N 2 p + N 1 q (30) as a conv enient sho rthand for E d ( j, C 2 ) (when j ∈ P 1 ). The quant ities in the following Lemma will be re le v ant in what follows: Lemma B.8. Assume that the p artition C 1 , C 2 is such that ∆ N ≤ c √ N log N . (31 ) Then ther e exist c onstants c 1 , c 2 , c 3 , c 4 > 0 and κ 1 > 0 such t ha t if N p > κ 1 then with pr ob ability at le ast 1 − c 1 N the fol lowing hold s: F or al l j ∈ V , d ( j, C 2 ) ≥ c 2 N p (32) | d ( j, C 1 ) − d ( j, C 2 ) | ≤ c 3 p N p log N (33) d ( j, C 1 ) /d ( j, C 2 ) ≥ 1 2 (34) | d ( j, C 2 ) − L | ≤ c 4 p N p log N . (35) Pr o of. W e sho w that the statements hold for ev ery j ∈ V individually with probability at leas t 1 − c 4 / N 2 , from which the c la im of the Lemma follows by the unio n bo und. Using inequality (17), we obtain that with probability at least 1 − c 5 / N 2 , | d ( j, C 2 ) − E d ( j, C 2 ) | ≤ c 6 p N p log N , (36) 19 and simila r ly | d ( j, C 1 ) − E d ( j, C 1 ) | ≤ c 6 p N p log N , (37) where in a way similar to the pro of of Lemma B.5, we hav e used the de c o mpo- sition o f d ( j, C l ) into t wo bino mials and the fact that q < p . Assume that N p is large enough so that c 6 p N p log N ≤ 1 10 N p (38) holds. By using the assumption (31) a nd (28) or (2 9), we o btain that E d ( j, C 2 ) ≥ 1 4 N p for all N ≥ κ 2 for some constan t κ 2 > 0 . Combinin g this with (36) and with (38), we obtain d ( j, C 2 ) ≥ E d ( j, C 2 ) − c 6 p N p log N ≥ ( 1 4 − 1 10 ) N p, (39) thereby pr o ving (32). Next, using (28), (29) and similar expressio ns for d ( j, C 1 ) we o btain that | E d ( j, C 1 ) − E d ( j, C 2 ) | = ∆ N ( p − q ) . (40) Using (40) with (36) and (37), it follows tha t | d ( j, C 1 ) − d ( j, C 2 ) | ≤ c ∆ N p + c ′ p N p log N ≤ c 8 p N p log N , (41) for appropriate constan ts c, c ′ > 0 . This prov es (33. Similar ly , the claim (35) holds for all j ∈ P 1 and for j ∈ P 2 we have | d ( j, C 2 ) − L | ≤ | L − E d ( j, C 2 ) | + c ′′ p N p log N ≤ ≤ c ∆ N p + c ′′ p N p log N ≤ c 9 p N p log N . Thu s (35) ho lds for all j ∈ V . Finally , to show (34) wr ite d ( j, C 1 ) d ( j, C 2 ) = 1 − d ( j, C 1 ) − d ( j, C 2 ) d ( j, C 2 ) . (42) Then (34) holds if | d ( j,C 1 ) − d ( j,C 2 ) d ( j,C 2 ) | ≤ 1 2 holds, w hich in turn holds b y (32) and (33) , for N a nd N p larger than some ﬁxed constant. W e no w pr ovide some estimates on the n umber of length tw o paths ( which we a lso r eferr to as 2 -paths in what follows). Lemma B.9. F or a no de j ∈ P 1 , E d 2 ( j, C 1 ) = 1 2 N  N 1 p 2 + 2 pq N 2 + N 1 q 2  (43) E d 2 ( j, C 2 ) = 1 2 N  N 2 p 2 + 2 pq N 1 + N 2 q 2  (44) 20 Pr o of. F or l , s ∈ { 1 , 2 } , set S ls = C l ∩ P s . There a r e four types of 2-pa ths from j to C 1 . Those that land in P 1 at ﬁr s t step, and then land at S 11 . W e denote paths of this type by P 1 S 11 . There exist 1 2 N · N 1 such p ossible paths and each one exists in G p,q mo del with probability p 2 . F or some co ncrete path of type P 1 S 11 , say p = j, u, v , with u ∈ P 1 and v ∈ S 11 , let E p be the even t that this path exists in the graph. The n um b er of such paths is then P p ∈ P 1 S 11 1 E p and the exp ected nu mber of such paths is therefore 1 2 N N 1 p 2 . The o ther path t yp es are P 1 S 12 , P 2 S 11 , P 2 S 12 , with ex pected num b ers o f paths 1 2 N N 2 pq , 1 2 N N 1 q 2 and 1 2 N N 2 pq respectively . Hence (43) holds. Similar considerations y ield (44). Next we obta in conce ntration b ounds on d 2 . Lemma B.10 . Ther e ar e c onstant s c 1 , c 2 > 0 , su ch that with pr ob ability at le ast 1 − c 1 / N the f ol lowing holds: F or al l i ∈ P 1 , | d 2 ( i, C 1 ) − E d 2 ( i, C 1 ) | ≤ c 1 N p log N (45) | d 2 ( i, C 2 ) − E d 2 ( i, C 2 ) | ≤ c 1 N p log N (46) Pr o of. Let n i be the neigh b ourho od of i in G . Set as b efore S ls = C l ∩ P s for l, s ∈ { 1 , 2 } a nd set also A ls = S ls ∩ n i . Similar ly to the arguments in the previous Le mmas, to obtain concentration b ounds o n d 2 ( i, C 1 ) w e repr esen t it as a sum of binomia ls d 2 ( i, C 1 ) = X l,s ∈{ 1 , 2 } X t,r ∈{ 1 , 2 } d ( A ls , S tr ) . Then one obser v es that the lambda of each such binomial is of the order N p · N · p , bec ause the size of A ls is o f the o rder o f N p and the s ize of S tr is of the or der of N . Then the conclusion follows b y inequality (17). Since the sets A ls are random sets, to carry the ab ov e arg umen t precisely we ﬁrst co nditio n on the neighbourho o d of n i and ensure (using (16)) that the s ets A ls are indeed not larger that cN p for an appropria te c > 0. The full details are straigh tforward but so mewhat lengthy and ar e omitted. W e will also make use of the fo llo wing inequalities: log(1 + t ) ≤ t for all t ≥ − 1 (47) t − t 2 ≤ log(1 + t ) for all t ≥ − 1 2 (48) | log t s | ≤ | t − s | min { t, s } for a ll t, s > 0 (49) | s t + θ − s t | = | θ t + θ | · | s t | for all t, s, θ (50 ) Pr o of of The or em B.1: F or x ∈ V , denote by n x the set of neigh b ours of x in G . As indicated e arlier, we shall use tha t fact that C 1 is s ligh tly biased tow ards either P 1 or P 2 . Sp eciﬁcally , set δ = 1 2 ǫ and assume throughout the proo f, 21 without loss of generalit y , that N 1 > N 2 . Then the follo wing holds with high probability: ∆ N = N 1 − N 2 ≥ N 1 2 − δ . (51) Indeed, note that N 1 , as a function of the r a ndom par tition, is hyper geometri- cally distributed with mean N / 4 and standard deviation o f order N 1 2 . Hence, by the cen tra l limit theorem for the hyper geometric distribution (see [30]; [31]), P      N 1 − 1 4 · N     ≥ N 1 2 − δ  → 1 (5 2) with N → ∞ . Statement (52) guar a n tees a deviation from the mean, and in particular tha t (51) holds with high probability . T o prove the Theorem w e now es tablish the following claim: Claim B.11. Fix a p artition C 1 , C 2 of V , satisfying e q. (51) and (31 ). Under assumptions (12) and (13), with pr ob ability at le ast 1 − 1 N gr aph G satisﬁes: F or al l i ∈ P 1 , D ( w i , µ C 1 ) > D ( w i , µ C 2 ) . (53) Note that the ass umptions of the Cla im dep end only on ra ndomness of the partitions and a re satisﬁed with high proba bilit y . Indeed, (51) holds as dis cussed ab o ve a nd (31) follows from Lemma (B.7). Once we prove the cla im, b y symmetry we will also hav e for all i ∈ P 2 a reverse inequality in (53), a nd together with (51) this will prove the theorem. W e pro ceed to prove the claim. Observe that by deﬁnition we hav e µ C l ( i ) = d ( i,C l ) d C l for every i ∈ V . Therefore we ca n rewrite (53) as: P j ∈ n i log d ( j,C 1 ) d ( j,C 2 ) + (54) + d i log d C 2 d C 1 (55) > 0 (56) W e now b ound the term (55). Using (49) we o btain     log d C 2 d C 1     ≤ | d C 2 − d C 1 | min { d C 2 , d C 1 } . (57) Using (24) and (25) we obtain that min { d C 2 , d C 1 } ≥ cN 2 p, (58) and that | d C 2 − d C 1 | ≤ cN log N √ p. (59) In addition, reca ll that by Lemma B.5, d i ≤ cN p . Therefore we obtain that     d i log d C 2 d C 1     ≤ cN p N log N √ p c ′′ N 2 p ≤ c ′′′ log N √ p ≤ c ′′′ log N (60) 22 for s o me constant c ′′′ > 0. W e now examine the ter m (54). Using (48), write log d ( j, C 1 ) d ( j, C 2 ) ≥ d ( j, C 1 ) − d ( j, C 2 ) d ( j, C 2 ) −  d ( j, C 1 ) − d ( j, C 2 ) d ( j, C 2 )  2 . (61) Note that b y (34), d ( j,C 1 ) d ( j,C 2 ) ≥ 1 2 and ther e fore (48 ) applies. W e no w replace the denominator in the ﬁr s t term of the rig h t hand of (61) b y a quantit y indep e nden t of j , namely b y L as deﬁned in (30). Using (50) with s = d ( j, C 1 ) − d ( j, C 2 ), t = L and θ = d ( j, C 2 ) − L , write d ( j, C 1 ) − d ( j, C 2 ) d ( j, C 2 ) ≥ d ( j, C 1 ) − d ( j, C 2 ) L − | d ( j, C 2 ) − L | d ( j, C 2 ) · | d ( j, C 1 ) − d ( j, C 2 ) | L . (62) T o summarize , we hav e obtained that P j ∈ n i log d ( j,C 1 ) d ( j,C 2 ) ≥ (63) P j ∈ n i d ( j,C 1 ) − d ( j,C 2 ) L (64) − P j ∈ n i | d ( j,C 2 ) − L | d ( j,C 2 ) · | d ( j,C 1 ) − d ( j,C 2 ) | L (65) − P j ∈ n i  d ( j,C 1 ) − d ( j,C 2 ) d ( j,C 2 )  2 . (66) Note that the term (64) sa tis ﬁe s X j ∈ n i d ( j, C 1 ) − d ( j, C 2 ) L = d 2 ( i, C 1 ) − d 2 ( i, C 2 ) L . (67) This term counts the num b er of 2 -paths and is the heart of the pro of. Before analysing it, we b ound the other tw o terms in the inequality in (63). P lug ging in the estimates from Lemma B.8, we obtain for (65) that X j ∈ n i | d ( j, C 2 ) − L | d ( j, C 2 ) · | d ( j, C 1 ) − d ( j, C 2 ) | L ≤ c · d i √ N p log N N p · √ N p log N N p . (68) Using the degree estimate form Lemma B.5, d i ≤ cN p , we thus get X j ∈ n i | d ( j, C 2 ) − L | d ( j, C 2 ) · | d ( j, C 1 ) − d ( j, C 2 ) | L ≤ c (log N ) 2 (69) for a n appropria te c > 0 . Similarly , for the term (66) we have X j ∈ n i  d ( j, C 1 ) − d ( j, C 2 ) d ( j, C 2 )  2 ≤ c · d i · N p log 2 N N 2 p 2 ≤ c · lo g 2 N , (70) with s ome (pe r haps diﬀerent) c > 0. 23 W e now pro ceed t o obtain a low er b ound on (67). The crucial prop erty of length tw o pa th counts, d 2 ( i, C 1 ) and d 2 ( i, C 2 ), tha t enables s uc h a b ound is that the diﬀer e nce betw een the e xpectations o f these quantities is of la rger order of mag nitude than their ﬂuctuations. Indeed, b y Lemma B.10, with proba bilit y at least 1 − c/ N w e hav e that d 2 ( i, C 1 ) − d 2 ( i, C 2 ) ≥ E d 2 ( i, C 1 ) − E d 2 ( i, C 2 ) − 2 cN p log N (71) for a ll i ∈ P 1 . In a ddition, by Lemma B.9, E d 2 ( i, C 1 ) − E d 2 ( i, C 2 ) = 1 2 N ∆ N ( p − q ) 2 ≥ N 3 / 2 − δ ( p − q ) 2 , ( 72 ) where we have used (51) in the last inequality . Incorp orating the inequalities (60), (69), (70), we o btain that D ( w i , µ C 1 ) > D ( w i , µ C 2 ) ho lds if the following inequality holds: N 3 / 2 − δ ( p − q ) 2 − 2 cN p log N L − c log N > 0 . (73) T o prov e the theorem, it rema ins to choo se p and q such that (73) is satisﬁed. Such p , q are given by the assumptions (12), (1 3 ). Indeed, recall that L satisﬁes L ≤ cN p for an appropria te c > 0 and hence under assumptions (12), (13) we hav e N 3 / 2 − δ ( p − q ) 2 − 2 cN p log N L log N → ∞ (74) with N → ∞ , hence y ielding (73). C LFR b enc hmarks In this section we specify the full pa rameters used for the exp eriments in the pap er. The LFR mo del is gener ated from the following par ameters: The g raph size N , the mixing para meter µ , communit y size low er and upp er bounds c min , c max , av erage degree d , maximal degree d max , and the p ower law exponents for the degree and communit y size distributions - which ar e in all cases set to their default v alues of − 2 and − 1 resp ectively . In additio n, in the overlapping cas e , parameter n sp eciﬁes the num b er o f nodes that will participate in multiple communities, and the parameter m sp eciﬁes the num b er of co mm unities in whic h each such no de will par ticipa te. The LFR mo dels were g enerated using the s oft ware av ailable at [32]. F o r the non overlapping LFR b ench mar ks we hav e used d = 20 a nd d max = 50, with the rest of pa r ameters as sp eciﬁed in Sec tion 4.2 . This corr esponds precisely to the exp eriments in [4]. The r epeats strateg y is describ ed in Section A. F or ea c h given graph instance, DER was run with 15 rep eats, using 3 resta r ts in eac h run. The results of the rep eats w ere cluster e d using the threshold 24 algorithm des cribed in Section A, except in the µ = 0 . 7 in which we have used the sp ectral cluster ing to cluster the co- occurenc e matrix. The LFR expe rimen ts with the sp ectral clustering a lgorithm that are shown in Figure 2.b were p erformed using the s p ectral clustering v ersion in Python sklearn v0.14.1 pac k a g e, whic h is a n implementation of the algorithm in [8 ]. The sp ectral clustering was run with 1 5 0 restarts o f its ﬁnal stage Euclidea n k-means step. W e note that while the rep eats strategy could be applied to the sp ectral cluster ing to o, it did not improv e the p erformance in this case (despite the fact that diﬀer en t runs of sp ectral clustering r e tur ned somewhat diﬀeren t results). The re s ults shown in Fig ure 2.b a re without rep eats. F o r the overlapping communit y b enc hmarks we have used the following set- tings: N = 100 00, d = 60, d max = 100, c min = 200, c max = 500. The v alue of n w a s 5000 and m was 4. These a re the settings that were use d in [10]. As discussed in the next s ection, in one sens e these s ettings can b e considered a heavy overlap, while there is a diﬀeren t sense in whic h they can b e consider ed sparse. In all cases w e ha ve run DER with 1 5 rep eats a nd 3 resta rts p er run, and we hav e used the sp ectral clustering to cluster the co -occur ence matr ix. Recall that our approa c h to ov erla pping communities is to ﬁrst o btain a non-ov erla pping clustering and then to po st-pro c e ss it to obtain o verlapping communities. One can ask ther efore what will happ en if in the non- ov erla pping step, DER is replaced b y another non-ov erlapping c lustering a lgorithm. W e hav e tried us ing sp ectral c lus tering instead of DER, and applied the same p o st- pro cessing. In all ca ses this resulted in ENMI v alues clo s e to 0. D Ov erlapping LFR b enc hmarks W e refer to [3] and [14] for th e deﬁnitions of th e LFR models. In this s ection we ma ke a few br ief comments reg arding the s tructure o f the ov erlapping LFR communities. T o simplify the discussio n, we restrict our attention to the particula r settings that were used in the ev aluation in Section 4 .3. The s ettings n = 5 0 00 and m = 4 (see Section C) imply that ther e a re 5 000 suc h that each node b elongs to a single communit y , and 500 0 no des such that each no de b elongs to 4 communit ies. These s e ttings may b e consider ed as a heavy overlap (see [33]). Indeed, it follows theoretically from the w ay LFR communities are generated, and also is observed in actua l graphs, that under these settings ea c h communit y C contains ab out 20 % of nodes that belong only to C , and each of the remaining 80% of the no des belo ng s to C and to 3 other communities. On the other hand, for a no de i ∈ C that belongs to 3 other co mmunities, the 3 other co mmunities are c hos e n at random among ab out 75 rema ining com- m unities of the graph. T his implies tha t for each pair of communities C, J , the int ers ection betw een them is small and if a no de i ∈ V is chosen at rando m, the even t i ∈ C is almost independent of the even t i ∈ J . The ab ov e small intersections and la c k of corre lations b e tween communities prop ert y implies that r andom walk star ted from communit y C , after tw o steps 25 has a chance of a bout 1 / 16 o f returning to C while the rest of the probability is distributed mo r e or less uniformly b et ween the other communities (and is muc h less than 1 / 16 for each c omm unity that is not C ). In o ther w or ds, the measures w i and w j hav e muc h mo re chance of b eing correla ted if i and j b elong to some common C than otherw is e. This explains why DE R works well on these gra phs. 26

Overlapping Communities Detection via Measure Space Embedding

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment