Component models for large networks

Comp onen t mo dels for large net w orks Janne Sinkk onen janne.sinkkonen@tkk.fi Xtr act Ltd. and Helsinki University of T e chnolo gy Janne Aukia janne.aukia@xtra ct.com Xtr act Ltd. Hitsaajankatu 22, 00810 Helsinki, Finland Sam uel Kaski samuel.kaski@tkk.fi Dep artment of Information and Computer Scienc e Helsinki University of T e chnolo gy P.O. Box 5400, FI-02015 TKK, Finland Abstract Being among the easiest w ays to ﬁnd meaningful structure from discrete data, Laten t Diric hlet Allo cation (LD A) and related comp onent models ha ve b een applied widely . They are simple, computationally fast and scalable, in terpretable, and admit nonparametric pri- ors. In the currently p opular ﬁeld of net work modeling, relativ ely little work has tak en uncertain ty of data seriously in the Bay esian sense, and comp onen t mo dels hav e b een in- tro duced to the ﬁeld only recen tly , by treating eac h no de as a bag of out-going links. W e in tro duce an alternative, interaction comp onen t mo del for comm unities (ICMc), where the whole netw ork is a bag of links, stemming from diﬀerent comp onen ts. The former ﬁnds b oth disassortativ e and assortative structure, while the alternativ e assumes assortativity and ﬁnds communit y-like structures lik e the earlier metho ds motiv ated by physics. With Diric hlet Process priors and an eﬃcient implemen tation the mo dels are highly scalable, as demonstrated with a so cial net work from the Last.fm web site, with 670,000 no des and 1.89 million links. Keyw ords: Laten t-Comp onen t Mixture Mo del, So cial Net work, Probabilistic Commu- nit y Finding, Nonparametric Bay esian 1. In tro duction Data collections represen table as netw orks, or sets of binary relations b et ween vertices, app ear no w frequently in many ﬁelds, including so cial netw orks and interaction net works in biology (Fig. 1). Consequently , inferring properties of the net work v ertices 1 from the edges has b ecome a common data mining problem. Most of the w ork has been about dividing the v ertices into relatively w ell-connected subsets, or c ommunities (F ortunato and Castellano, 2007). Most papers on comm unities ha ve b een inspired by graph theory and physics, as is a large ﬁeld of fundamental netw ork-related work not directly relev ant here. Esp ecially optimizing a measure of goo d division called mo dularity (Newman, 2006) has gained success but is not without its problems (F ortunato and Barthelem y, 2007; Kumpula et al., 2007). 1. W e will use the terms v ertex and no de in terchangeably , and likewise for edges and links. 1 Figure 1: One of the mo dels (ICMc) on a classic small netw ork, the Karate club. Zac hary (1977) observ ed the so cial interactions (edges) b et ween 34 mem b ers of a k arate club ov er tw o y ears. During this p erio d, there was disagreemen t among the club mem b ers which even tually led to the splitting of the club (dashed line). Black and white depict the degree of comp onen t memberships obtained b y the mo del, without knowing the correct split. A feature and p otential problem of mo dularit y is that it tak es the observed edges gran ted, while netw ork data are t ypically not a complete description of reality but comes with errors, omissions and uncertain ties. Some links may b e spurious, for instance due to measuremen t noise in biological net works, and some p oten tial links may b e missing, for in- stance friendship links of new comers in social net works. Probabilistic generative models are a to ol for mo deling and inference under suc h uncertaint y . They treat the links as random ev ents, and give an explicit structure for the observed data and its uncertain ty . Compared to non-sto c hastic metho ds, they are therefore likely to p erform well as long as their as- sumptions are v alid: They may rev eal prop erties of netw orks that are diﬃcult to observe with non-statistical techniques from the noisy and incomplete data, and they also oﬀer a groundw ork for new conceptual developmen ts. F or example, it may be argued that net work comm unities should b e deﬁned in terms of sto c hastic mo dels that do not take links at face v alue but instead giv e them an underlying sto c hastic structure that should b e realistic giv en an application. On the down side, probabilistic metho ds are not alwa ys scalable, and they ma y b e diﬃcult to understand, apply and trust by p eople from other ﬁelds, esp ecially if the estimation pro cess is complex. Probabilistic mo dels of net work connectivity ha ve b een introduced recently . Mixtures of laten t comp onen ts (Newman and Leich t, 2006), analogous to ﬁnite mixture mo dels for v ectorial data, are attractive b ecause of ease of interpretation, but the extensive num b ers of parameters encum b er straigh tforward ﬁtting attempts. A very promising developmen t called stochastic blo c k mo dels (Airo di et al., 2008—but also Daudin et al., 2007; Hofman and Wiggins, 2007) groups the no des in to blocks and explains the links in terms of homogeneous connections b etw een pairs of groups. Finally , links can b e explained b y the proximit y of 2 no des in a laten t space created by a logistic link (Handco c k et al., 2007). These mo dels ha ve b een successiv ely applied to v arious netw orks from sociology and biology , up to the size of thousands or tens of thousands of no des. With heuristic impro vemen ts, stochastic blo c k mo dels are exp ected to scale up to o ver one million no des (E. Airoldi, p.c.), but in general the computational b ottlenec k is scalability . The mo dels discussed in this pap er are generative probabilistic mo dels that decomp ose the links in to comp onen ts, but their structure makes them scalable to netw orks with at least 10 3 . . . 10 6 no des, and up to thousands of laten t components—as long as the net works are sparse enough. The Simple So cial Netw ork LDA (SSN-LD A) mo del presen ted by Zhang et al. (2007) is iden tical to the Latent Dirichlet Allo cation (LD A; Bun tine, 2002; Blei et al., 2003) mo del, originally applied to text collections. It is also a conceptual although not a geneologic successor of the mixture mo del by Newman and Leic ht (2006). The SSN-LDA mo del assumes that each no de is a bag of outgoing links, and models each outgoing set of links as a mixture o ver latent comp onen ts. The comp onen ts are the same for each no de, but their prop ortions diﬀer. As an alternativ e w e in tro duce a comp onen t model for relational data, where eac h link is directly assumed to come from a latent comp onen t, and the whole net work is a bag of links (Sinkk onen et al., 2007). This mo del is particularly well suited for mo deling of communit y- t yp e structure in netw orks. F or conciseness, we call it ICMc (in teraction comp onen t mo del for comm unities), the latter ’c’ reminding of the fact that it is easy to generate new mo dels from the family of ICMc and SSN-LDA, with sligh tly diﬀeren t generative assumptions and requiremen ts for data. Both ICMc and SSN-LD A represen t a set of links as a probabilistic mixture ov er laten t comp onen ts. Dep ending on the prior, the mo dels can ﬁnd either a given num b er of latent comp onen ts, or nonparametrically adjust the num b er of comp onen ts to the data, guided b y a div ersity parameter. Moreov er, dep ending on parameters, they are capable of ﬁnding either subnetw orks or more graded, latent-space-lik e structures. Both mo dels can b e easily and eﬃcien tly ﬁtted to data by collapsed Gibbs sampling (Neal, 2000), an MCMC technique for sampling from the p osterior where parameters ha ve b een integrated out and latent v ariables are sampled. In the comp onen t mo dels the latent v ariables give the assigments of the links to the comp onents. Critical for successful scaling to large netw orks is sparseness of represen tations; here the component assignments of the links, the v ariables that are sampled in the collapsed Gibbs, can b e eﬃciently represen ted as sparse arra ys, trees, and hash maps. W e compare the t w o mo dels on tw o citation net works with a few thousand no des, Cite- Seer and Cora (Sen and Geto or, 2007), and demostrate their properties on smaller net works. As a demonstration of a larger-scale problem, m usical tastes of p eople are derived from the friendship netw ork of the online m usic service Last.fm ( www.last.fm ), with o ver 650,000 v ertices v ertices and almost tw o million edges. 2. Tw o scalable net work mo dels SSN-LD A models directed links. A unique mixing pattern o ver laten t link tar get pr oﬁles is asso ciated to each node. (T echnical details are presented later, e.g., in Fig. 4, right). The latent proﬁles corresp ond to topics of text do cumen t mo dels, the original application 3 SSN-LD A ICMc SSN-LD A ICMc Figure 2: Netw ork structure and component models. L eft: A to y netw ork split into tw o comp onen ts (blac k and white). SSN-LDA produces both assortative and disas- sortativ e comp onen ts, but here fav ors the disassortative in terpretation, grouping no des by common external connectivit y . The ICMc solution is assortative and more comm unity-lik e. Right: The Adj-Noun netw ork of adjective and noun co- o ccurrences is bipartite, with a negative mo dularit y of -0.241. SSN-LD A, not limited to assortative solutions, ﬁnds the underlying structure (no de size repre- sen ts certaint y). Mo dularit y of the SSN-LD A solution (blac k vs. white) is -0.262 and that of ICMc 0.188. See T able 1 for details of Adj-Noun. of LDA. If the no de mem b erships in latent proﬁles are sharp enough, that is, if the no des are mainly asso ciated to one proﬁle only , the proﬁles can b e interpreted as subgraphs. The grouping criterion is a probabilistic v ersion of the structur al e quivalenc e principle of so ciology (Mic haelson and Contractor, 1992): Tw o no des b elong to the same group if their role in the netw ork top ology is similar, that is, they link to the same (other) no des. In ICMc, a unique mixture o v er late n t components is asso ciated with each no de, and linking is unstructured inside a comp onent. Instead of structural equiv alence, the criterion for subgroups is homogeneous, symmetric in ternal connectivit y . Link directions are therefore not modelled. A related so cial concept is subgroup cohesion (W asserman and F aust, 1994), where laten t similarity results in connections inside the group, instead of linking in to some common third part y . As a result, the net w ork looks homophilic (Lazarsfeld and Merton, 1954); the connected no des tend to b e relativ ely similar by their non-netw ork prop erties. F or tec hnical reasons, the parameterization of linking within a component in ICMc is in terms of linking probabilities o ver the comp onen ts; memberships of no des in comp onen ts can b e obtained from these parameters by the Ba yes rule. Equiv alently , the mo del can b e describ ed as mo deling the whole graph as a bag of links. Eac h link comes from a comp onen t sp eciﬁed by a latent v ariable z (Fig. 4, left). Eac h comp onen t chooses the endp oin ts of a link from a comp onen t-sp eciﬁc (multinomial) distribution o ver the nodes, parameterized b y m z . A further helpful distinction is that of assortative and disassortative net work prop er- ties. A netw ork is assortative with resp ect to a prop erty if the property tends to co-o ccur in connected no des more often than exp ected by c hange (Newman, 2003). The opp osite, 4 negativ e correlation in adjacen t no des, is called disassortativit y . SSN-LD A can in principle ﬁnd either kinds of structures, while ICMc tends to ﬁnd only assortative structure. 2 Mo du- larit y , a qualit y measure for comm unity detection (Newman, 2006), can at least to a degree b e used as a measure of assortativity; if it is negative for a partitioning of the netw ork, the partitioning is disassortativ e (F ortunato and Castellano, 2007). Unfortunately , comparing mo dularities of partitionings o v er diﬀerent net works is in general not justiﬁed (F ortunato and Castellano, 2007), and hence we cannot use it to compare the mo deling problems. One would exp ect ICMc to ﬁnd comm unities from so cial and other netw orks b etter than the less sp ecialized SSN-LDA, as long as linking results from homophily and the comm unities can b e assumed assortative. The reason is that a mo del ha ving less degrees of freedom in its parameterization will b e able to more accurately estimate the parameters from the relatively small observ ed data sets. On the other hand, ICMc should be unable to ﬁnd disassortativ e structures. This seems to indeed hold in some extreme cases (Fig. 2), but in practice diﬀerences are often graded and harder to attribute to properties of the netw ork (Fig. 3). The b ehavior of b oth SSN-LD A and ICMc are determined b y their hyperparameters. Both mo dels can be made to prefer either laten t comp onen ts of equal size, or to allo w heavy size v ariation. Ev en more importantly , either graded or non-o v erlapping c omponents can b e preferred. In graded comp onen ts the comm unity membership probabilities are akin to co ordinates in a laten t space, while non-o verlapping comp onents divide the no des sharply in to clusters. Both mo dels can accommo date integer link weigh ts in the sense of generating multiple links b et ween tw o no des. On the other hand, the models w ork particularly eﬃciently for sparse binary links: If data is sparse, link probabilties are small ov erall, m ultiple links ev en more improbable, and the mo del eﬀectively generates binary data. The SSN-LD A mo del is originally based on a ﬁnite mixture (Zhang et al., 2007), but it is easily extended for a Dirichlet pro cess prior (DP prior; Blackw ell and MacQueen, 1973; Neal, 2000), while ICMc is originally with a DP prior (Sinkkonen et al., 2007), but here applied also in its ﬁnite form. The mo dels are demonstrated on three small net works in Figures 1 and 3. The ﬁrst is the Karate net work originating from a study b y Zachary (1977). In the study Zachary observ ed the so cial interactions b et w een 34 members of a k arate club ov er t wo y ears. During this p eriod, there was disagreement among the club members which led to the splitting of the club. Figure 1 demonstrates that ICMc ﬁnds the splitting. The second demonstration net w ork is the F o otball netw ork (Girv an and Newman, 2002), whic h depicts American fo otball games b et ween Division IA colleges during the fall season 2000. The nodes of the net work represen t fo otball teams and edges the games b et ween the teams. There is a known communit y structure for the netw ork in the form of conferences. In general, games betw een teams that belong to the same conference are more frequen t than games b etw een teams that b elong to diﬀerent conferences, but sometimes teams prefer to pla y mostly against teams in other conferences. Both mo dels ﬁnd the structure as seen in Figure 3, SSN-LDA slightly more accurately . ICMc is somewhat more accurate on another net work deriv ed from p olitical blogs. 2. The discussion of the distinction by Newman and Leich t (2006) is indeed applicable to SSN-LDA, for SSN-LD A can b e seen as a Bay esian extension of the earlier mo del. 5 • • 1.24 1.26 1.28 1.30 1.32 Perplexity for Polblogs Algorithm ) E S * 2 e t o n e d s r a b r o r r e ( y t i x e l p r e P ICMc SSN−LDA • • 2.0 2.2 2.4 2.6 2.8 Perplexity for Football Algorithm ) E S * 2 e t o n e d s r a b r o r r e ( y t i x e l p r e P ICMc SSN−LDA ICMc SSN-LD A 0 1 1 0 10 0 10 1 10 2 10 3 10 4 10 5 10 6 10 7 10 8 10 9 1 1 11 0 11 1 11 2 11 3 11 4 1 2 1 3 1 4 1 5 1 6 1 7 1 8 1 9 2 2 0 2 1 2 2 2 3 2 4 2 5 2 6 2 7 2 8 2 9 3 3 0 3 1 3 2 3 3 3 4 3 5 3 6 3 7 3 8 3 9 4 4 0 4 1 4 2 4 3 4 4 4 5 4 6 4 7 4 8 4 9 5 5 0 5 1 5 2 5 3 5 4 5 5 5 6 5 7 5 8 5 9 6 6 0 6 1 6 2 6 3 6 4 6 5 6 6 6 7 6 8 6 9 7 7 0 7 1 7 2 7 3 7 4 7 5 7 6 7 7 7 8 7 9 8 8 0 8 1 8 2 8 3 8 4 8 5 8 6 8 7 8 8 8 9 9 9 0 9 1 9 2 9 3 9 4 9 5 9 6 9 7 9 8 9 9 0 1 1 0 10 0 10 1 10 2 10 3 10 4 10 5 10 6 10 7 10 8 10 9 1 1 11 0 11 1 11 2 11 3 11 4 1 2 1 3 1 4 1 5 1 6 1 7 1 8 1 9 2 2 0 2 1 2 2 2 3 2 4 2 5 2 6 2 7 2 8 2 9 3 3 0 3 1 3 2 3 3 3 4 3 5 3 6 3 7 3 8 3 9 4 4 0 4 1 4 2 4 3 4 4 4 5 4 6 4 7 4 8 4 9 5 5 0 5 1 5 2 5 3 5 4 5 5 5 6 5 7 5 8 5 9 6 6 0 6 1 6 2 6 3 6 4 6 5 6 6 6 7 6 8 6 9 7 7 0 7 1 7 2 7 3 7 4 7 5 7 6 7 7 7 8 7 9 8 8 0 8 1 8 2 8 3 8 4 8 5 8 6 8 7 8 8 8 9 9 9 0 9 1 9 2 9 3 9 4 9 5 9 6 9 7 9 8 9 9 Figure 3: Relative p erformance of ICMc and SSN-LD A v aries for diﬀerent net works. Ab ove: ICMc p erforms better for a net w ork of p olitical blogs, as measured by the p er- plexit y of the comp onents, while SSN-LDA is b etter for a net work of US football games. Reasons for the diﬀerences are unclear, although they might b e related to the assortativit y of the net works with resp ect to the ground clusters (political orien tation and fo otball conferences). Below: Despite p erplexit y diﬀerences, the solutions are qualitativ ely very similar. F or the F o otball net work, main diﬀer- ences are in the certaint y of cluster assignments. Shaded areas sho w the b orders of the conferences, while comm unit y assignmen ts by the model and their certaint y are depicted b y no de color and size, resp ectiv ely . See T able 1 for details of the net works and mo del parameters. 6 T able 1: Netw ork characteristics and mo deling parameters for the small and medium-size net works. In the table, I is the n um b er of no des in the netw ork, L is the num b er of edges, Q the ground c luster modularity , and α Dir and β are the hyperparameters of the mo dels. ICMc SSN-LD A Net work I L Q α Dir β α Dir β Adj-Noun 112 423 -0.241 0.5 0.2 0.5 0.2 F o otball 115 613 0.554 0.083 0.03 0.083 0.7 P olblogs 1 222 16 714 0.410 0.5 0.003 0.5 0.4 Citeseer 2 120 3 678 0.517 0.166 0.04 0.166 0.006 Cora 2 485 5 067 0.630 0.143 0.02 0.143 0.025 z K L z v θ m β α i v j K M d z v i i θ β α j z m Figure 4: The ICMc mo del ( left ) and SSN-LD A ( right ). SSN-LDA is eﬀectiv ely the Latent Diric hlet Allo cation (LDA; Bun tine, 2002; Blei et al., 2003) applied to netw ork data, with no des playing b oth the role of ’word s’, at the receiving end of links, and ’do cumen ts’ at the sending end. ICMc has no hierarc hy lev el for no des. Instead, it generates two no des for eac h link; the links are undirected. See Section 2 for the notation and further discussion. 2.1 In teraction Comp onen t Mo del for Comm unities (ICMc) The generative pro cess out of whic h the netw ork is supp osed to arise is the following (see Fig. 4 for a diagram); it is parameterized by the hyperparameters ( α , β ). (1.1) Generate a m ultinomial distribution θ ov er latent c omp onents z . F or K components, the m ultinomial is generated from a K -dimensional Diric hlet distribution with all parameters set to α Dir , Dir K sym ( α Dir ), and for an inﬁnite num b er of comp onen ts from the Dirichlet pro cess DP( α DP ). 7 (1.2) T o each z , asso ciate a multinomial distribution o v er the M vertices i b y sampling the m ultinomial parameters m z from the Diric hlet distribution Dir M sym ( β ). (T o clarify , w e ha ve P i m z i = 1 for eac h z , and P z θ z = 1 .); (2) Then rep eat for eac h link l = 1 . . . L : (2.1) Draw a laten t comp onen t z from the m ultinomial θ . (2.2) Cho ose tw o no des, i and j , indep enden tly of eac h other, with probabilities m z ; set up a nondirectional link b et ween i and j . Within comp onen ts, edges are generated indep enden tly of each other; the non-random structure of the netw ork emerges from the tendency of comp onen ts to prefer certain v ertices (that is, m ). In con trast to many other netw ork mo dels, the laten t v ariables op erate on the edge lev el, not on the vertex level. There is no explicit hierarch y level for v ertices, but because vertices typically hav e several edges, they are implicitly treated as mixtures o ver the latent comp onen ts. Finally , the model is parameterized to generate self-links and multi-edges b ecause this c hoice allo ws sparse implementations which would not b e directly p ossible with a p oten tial alternative model that w ould generate binary links from the Bernoulli distribution. Although in the case of a Dirichlet pro cess prior the n umber of p oten tially generated comp onen ts is inﬁnite, the prior giv es an uneven distribution o ver the comp onen ts. There- fore, with a suitably small v alue of α DP , we observe muc h fewer comp onen ts than the n umber of links is, and the mo del is useful. On the other hand, β describ es the unevenness of the degree distribution of the no des within c omp onents : a high β tends to giv e compo- nen ts spanning ov er all no des, while a small β prefers m utually exclusiv e, communit y-like comp onen ts. W e ha ve estimated the mo del with Gibbs sampling (Geman and Geman, 1984), a v arian t of MCMC metho ds that pro duce samples from the p osterior distribution of the mo del parameters and the laten t comp onen t memberships. As a side note, maximum lik eliho od or MAP estimation of the mo del is not sensible since the num b er of parameters and latent v ariables is large compared with the av ailable data, It is easy to derive an EM algorithm for the ﬁnite-mixture ICMc, but it gets stuck in to sub optimal local p osterior maxim ums at the b orders of the parameter space. W e use Gibbs sampling with some of the mo del parameters integrated out, called Rao- Blac kwellized, or collapsed (Neal, 2000). (F or the joint distribution of the mo del and the deriv ation of the estimation algorithm see the Appendix). In the collapsed Gibbs estimation algorithm the unkno wn mo del parameters m z i and θ z are marginalized a wa y and only the laten t classes of the edges, z l , are sampled, one edge at a time. In general, w e denote edge coun ts p er comp onen t b y n z , component-wise vertex degrees b y k z i , and the endp oin ts of the left-out edge by ( i, j ). Then delete one edge, resulting in counts ( n 0 , k 0 , N 0 ) that are equal to ( n, k, N ) but without the one edge. The component probabilities of the left-out edge are p ( z | i, j ) ∝ k 0 z i + β 2 n 0 z + 1 + M β × k 0 z j + β 2 n 0 z + M β × n 0 z + α Dir N 0 + K α Dir (1) 8 Algorithm 1 ICMc-Gibbs . A simple implemen tation for the ICMc algoritm with a Diric hlet prior. Simple-Gibbs-Sampling ( α Dir , β , L ) t nodes ← no de count for c ← 1 to c omp onents do  initialize data structures A [ c ] ← 0 for n ← 1 to t nodes do K [ n, c ] ← 0 for i ← 1 to iter ations do  main iteration lo op foreac h l in L do v i , v j ← ﬁrst and second no de of l if i 6 = 1 do z old ← Z [ l ] decremen t K [ v i , z old ] , K [ v j , z old ] , A [ z old ] for c ← 1 to c omp onents do P [ c ] ← Calc-Probability ( A [ c ] , K [ v i , c ] , K [ v j , c ] , t nodes , α Dir , β ) z new ← sample index from P Z [ l ] ← z new incremen t K [ v i , z new ] , K [ v j , z new ] , A [ z new ] return K, Z Calc-Probability ( n c , k a , k b , t nodes , α Dir , β ) return ( k a + β )( k b + β )( n c + α Dir ) (2 n c + 1 + β t nodes )(2 n c + β t nodes ) for the Diric hlet prior, and p ( z | i, j ) ∝ k 0 z i + β 2 n 0 z + 1 + M β × k 0 z j + β 2 n 0 z + M β × C( n 0 z , α DP ) N 0 + α DP (2) for the Dirichlet pro cess prior. The ’chooser’ function C( n z , α DP ) ≡ n z if n z 6 = 0 and C(0 , α DP ) = α DP . The case with α DP , as opp osed to n z , corresp onds to a new comp onen t with no other links so far. This sampling step is simply rep eated iteratively for all links, un til conv ergence to the p osterior distribution, or un til the results are satisfactory by some other measure. A particularly elegant, although not necessarily the most eﬃcient initialization of the sampler starts from empty urns, with k z i = n z = N = 0, then runs through the edges once in a random order and populates the urns according to (1) or (2) while counting only the edges seen so far. The goal of mo del ﬁtting is usually to infer comm unity mem b erships of the nodes. F rom the Bay es rule we obtain p ( z | i ) = θ z m z i P z 0 θ z 0 m z 0 i . (3) 9 A sample of the marginalized parameters θ and m can b e reconstructed from eac h realization of the coun ts ( k , n ) by sampling from the conditional Diric hlet distributions giv en the priors and the coun ts: θ ∼ Dir z ( n z + α Dir ) or θ ∼ Dir z ( { n z , α DP } z ) , and m z ∼ Dir i ( k z i + β ) . (4) Note that n z = P i k z i / 2. In the case of the Diric hlet pro cess prior, the parameter θ has probabilities of the comp onents with at least one assigned link, and then the probabilit y of all empty comp onen ts summed up into the last bin. These correspond to the Dirichlet parameters n and α DP , resp ectiv ely . Ev en if one wan ts to reconstruct θ and m , collapsed Gibbs is lik ely to b e faster than full Gibbs. The reasons are tw ofold: Firstly , Gibbs conv erges faster when the parameter up dates are not in the main lo op. Secondly , one usually uses decimation in sampling from the conv erged chain, and the ( θ , m ) nee d to b e constructed only for the decimated samples. It is often suﬃcien t to estimate the comm unity memberships from the exp ected v alues of the marginalized parameters, ˆ θ = n z + α Dir P z 0 n z 0 + K α Dir or ˆ θ = n z P z 0 n z 0 + α DP , (5) and ˆ m z = k z i + β P i 0 k z i 0 + M β . (6) Substituting the exp ectations in to (3), we ﬁnd that for small α and β , p ( z | i ) ≈ k z i P z 0 k z 0 i (7) is a go od appro ximation. Prediction for new data is straightforw ard; the comp onen t memberships of the links asso ciated to a new no de can b e sampled from (1), given old links. If the new links are not conditionally indep enden t given the old data, one can run a short Gibbs iteration on the new links. 2.2 SSN-LDA SSN-LD A (Zhang et al., 2007) also has t wo h yp erparameters, denoted by α and β , but they are in a slightly diﬀerent role than in ICMc (see Fig. 4). The generative pro cess is as follo ws: (1.1) Generate M multinomial distributions θ i , i = 1 , . . . , M , ov er latent c omp onents z , z = 1 , . . . , K , either from a K -dimensional Dirichlet distribution Dir K sym ( α Dir ), or from the Diric hlet pro cess DP( α DP ). (1.2) Assign a m ultinomial distribution m z o ver the vertices i to each comp onen t z b y sampling from the Dirichlet distribution Dir M sym ( β ). (2) Then rep eat for eac h link l = 1 , . . . , L , with sending no des i = 1 , . . . , M : 10 (2.1) Draw a laten t comp onen t z from the m ultinomial θ i . (2.2) Cho ose the link endp oin t j with probabilities m z ; set up a directional link b e- t ween i and j . W e hav e presen ted the generative process of links in a ﬂat form to make comparison to ICMc easier; in step 2, the lo op ov er no des is av oided by referring to the no de indices i asso ciated to links. In contrast to ICMc, SSN-LDA has the no de as an explicit hierarch y lev el—in the generativ e mo del, there are the parameters θ for each node separately , and α is the common h yp erparameter of these no de-wise distributions. As in ICMc, the h yp erparameters m are asso ciated to latent comp onents ov er no des, and β determines their prior. But no w m determines only the probabilities of the r e c eiving no des . Sending probabilities, asso ciated to starting p oin ts i of the links, are mo deled by θ . Collapsed Gibbs sampling operates on t wo sets of counts: n iz that counts the sender– comp onen t combinations ( i, z ) for links, originating from step (2.1) of the generativ e pro cess, and k z j coun ting the receiv er–comp onen t com binations ( j, z ) from step (2.2) of the pro cess. F ollowing Griﬃths and Steyvers (2004), and Zhang et al. (2007), the conditional probabili- ties for sampling a left-out link in a collapsed Gibbs iteration, given hyperparameters and all other links, is p ( z | i, j ) ∝ k 0 z j + β k 0 z · + M β × n 0 iz + α Dir n 0 i · + K α Dir , (8) where sums o ver counts k 0 and n 0 ha ve b een denoted b y the dot notation. W e hav e omitted the deriv ation of the Dirichlet pro cess v ariant, b ecause it is very similar to the deriv ation of the DP ICMc (see the App endix), leading to: p ( z | i, j ) ∝ k 0 z j + β k 0 z · + M β × C( n 0 iz , α DP ) n 0 i · . (9) Again, parameter reconstruction for θ and m can b e done either b y sampling from the corresp onding Dirichlet distributions, or by computing the conditional MAP estimates, either roughly or exactly including priors. As with ICMc, w e ha ve used the rough alternative suitable for small v alues of α and β : p ( z | i ) ≈ n iz n i · . (10) This is for the communit y mem b erships of no des as senders of links. Because in SSN-LDA links are directed, it is p ossible to deﬁne the memberships also in terms of received links, p ( z | j ) ≈ k z j k · j . (11) 2.3 Eﬃcient implementation of the collapsed Gibbs samplers Large real-life net w orks are sparse almost b y deﬁnition, and for eﬃciency it is important to preserv e the sparseness in model structures. ICMc and SSN-LDA facilitate sparse struc- tures, since likelihoo ds decomp ose into sums ov er existing links, and terms related to non- links do not app ear. 11 Collapsed Gibbs sampling of ICMc and SSN-LD A needs tables for n and k which to- gether, as a ﬁrst approximation, are of the size complexity O ( M K ). In addition one needs to keep trac k of the comp onen t iden tities of the links, an array of size O ( L ). But in b oth mo dels the degree of a no de p oses an upp er limit for its comp onen t heterogeneity , so that only a few of the coun ts k , or k and n in LDA, are simultaneously non-zero, allowing sparse represen tation of the coun t tables. Therefore with hash tables memory consumption can b e reduced to O ( M ¯ d + L + K ) where ¯ d is the a verage degree of a no de. Because ¯ d = L/ M , memory consumption scales as O ( L + K ). Marginal sums of the coun t tables, notably n in ICMc, can b e represented in a sparse form and up dated eﬃciently during sampling with the aid of a self-balancing binary tree. The idea of using a tree in sampling of discrete distributions was originally prop osed b y W ong and Easton (1980), and another metho d for using binary trees in simulations is pro vided by Blue et al. (1995). In our implementation the Arne-Andersson tree (AA tree) is used (see, e.g., W eiss, 1998), but other self-balancing binary trees w ould b e equiv alen t in p erformance. A partial sum tree is formed, where in eac h node, the total probabilit y of the no de is stored, together with the sums of probabilities of b oth the left and right children of the no de. When the probabilit y of a no de is c hanged, the mo diﬁcations are propagated up to the parents of the no de. Sampling pro ceeds recursiv ely do wn the tree as a sequence of w eighted Bernoulli samples. These sparse representations, and the binary tree for the marginal sums, mak e it p ossible to run mo dels with at least tens of thousands of comp onents in an ordinary PC or server. These structures also ﬁt well with the dynamic comp onent n umbers due to the Dirichlet pro cess prior. With the data structures describ ed ab o v e, running time p er one Gibbs iteration o ver all the nodes b ecomes O ( L ¯ d log K ). That is, the time needed for an iteration scales linearly in the num b er of edges and logarithmically in the num b er of comp onents. It is hard to giv e any general rule on the n umber of Gibbs iterations needed for con- v ergence. Because the v ariables in the collapsed Gibbs algorithm corresp ond to links, the dep endency graph of the v ariables is like the original net work, but with the no des b eing in the role of links, and vice versa. The path lengh ts of the dep endency netw ork are therefore prop ortional to the path lengths of the original netw ork. Let us assume that the av erage path length scales as l ∝ log M , as is the case with man y small-world net works (Albert and Barabasi, 2002). In Gibbs, information diﬀusion o ver the net w ork can be expected to tak e l 2 iterations, analogously to ordinary diﬀusion. This leads to the conjecture that the n umber of Gibbs iterations should b e prop ortional to log 2 M . 3. T ests W e compared SSN-LD A and ICMc on tw o medium-scale so cial netw ork datasets, Cora and Citeseer (Sen and Geto or, 2007), in the task of ﬁnding a predeﬁned set of known clusters. P erformance on large netw orks of 10 5 . . . 10 6 no des is then demonstrated for one of the mo dels (ICMc) with t wo friendship netw orks from the m usic site Last.fm. 3.1 ICMc vs. SSN-LD A The Cora and the CiteSeer datasets consist of con tent descriptions of scientiﬁc publications and citations b et ween them. The Cora dataset has 2,708 pap ers in seven predeﬁned classes, 12 0 10000 20000 30000 40000 50000 −69400 −69200 −69000 −68800 −68600 d o o h i l e k i L Convergence of the ICMc sampler Iteration Perplexity for Cora 0.001 0.010 0.100 1.000 3.0 3.5 4.0 4.5 5.0 5.5 6.0 ICMc SSN-LDA y t i x e l p r e P β Figure 5: Gibbs samplers on the Cora citation set: conv ergence and sensitivit y to the hy- p erparameter β . L eft: Leav e-one-out logarithmic p osterior probability of the data for a single ICMc chain. This can b e recorded easily during sampling as log probabilities of the drawn link assignments. W e ran 50,000 iterations o ver the data, but ab out 15,000 would hav e b een enough for conv ergence, and ab out 3,000 for getting useful results. SSN-LD A conv ergence was very similar. Right: P erplexity for the Cora dataset with a range of h yp erparameter v alues β . Each rep orted v alue is an av erage of four chains. Both mo dels are quite robust with resp ect to β . while the CiteSeer dataset contains 3,312 publications in six classes. W e used only the citation information, and the predeﬁned classes as a ground truth for clustering. Nodes (publications) not b elonging to the main comp onen ts of the netw ork were remov ed, and directional links w ere symmetrized. The resulting net w ork for Cora has 2,120 nodes and for Citeseer 2,485 no des (T able 1). F ollowing Zhang et al. (2007) and our own exp eriences (Aukia, 2007), we ﬁxed α Dir = 1 /K for b oth mo dels and datasets. V alues for the parameter β were c hosen with pretests (T able 1 and Fig. 5). In general, the mo dels with a Diric hlet prior and a small n umber of comp onen ts are quite insensitive to v alues of α Dir and β within the range 0 . 001 . . . 0 . 1. The Gibbs sampler was initialized as suggested in Section 2.3 and run for 50,000 itera- tions (see Fig. 5). W e then to ok 100 samples at interv als of 100. Eac h sample consists of the laten t cluster memberships z for all links. No de mem b erships were constructed by (7) and (10) for each sample separately , and these w ere summed up to get confusion matrices. Ov er the computed 50 c hains, there is a goo d a verage corresp ondence betw een the found clusters and the original manual clustering of the data sets (Fig. 6). In terms of p erplexit y ICMc is able to reco ver the orignal clusters better than SSN-LD A, although the av erage confusion matrices are relatively similar. Results v ary from chain to chain more than with small netw orks, indicating multiple lo cal minima for the Gibbs sampler to get trapp ed in to. (See Section 4 b elo w for discussion on this b eha viour.) 13 • • 3.0 3.1 3.2 3.3 3.4 Perplexity for Cora Algorithm ) E S * 2 e t o n e d s r a b r o r r e ( y t i x e l p r e P ICMc SSN−LDA • • 3.6 3.7 3.8 3.9 4.0 4.1 4.2 Perplexity for Citeseer Algorithm ) E S * 2 e t o n e d s r a b r o r r e ( y t i x e l p r e P ICMc SSN−LDA Case_Based Genetic_Algorithms Neural_Networks Probabilistic_Methods Reinforcement_Learning Rule_Learning Theory A B C D E F G Clustering Case_Based Genetic_Algorithms Neural_Networks Probabilistic_Methods Reinforcement_Learning Rule_Learning Theory A B C D E F G Clustering Agents AI DB HCI IR ML A B C D E F Correct Clustering Agents AI DB HCI IR ML A B C D E F Correct Clustering Figure 6: The models ICMc and SSN-LD A on tw o citation netw orks, Cora and CiteSeer. A b ove: P erformance in ﬁnding true clusters, as measured b y the perplexity of pre- dicting ground-truth groups with the clusters. The av erage and 95% conﬁdence in terv als for the mean are ov er 50 c hains. Below: Av erage confusion matrices b et w een the found clusters (columns) and the true clusters (rows). T able 2: Last.fm netw orks and mo deling parameters: I is the n um b er of no des in the net work, L is the num b er of edges, and α DP and β are the hyperparameters. ICMc (DP) Net work I L α DP β F ull Last.fm 675 682 1 898 960 0.3 0.3 Last.fm USA 147 610 352 987 0.2 0.2 3.2 ICMc on Last.fm friendship netw ork Last.fm is an In ternet site that learns the musical taste of its members on the basis of examples, and then constructs a p ersonalized, radio-lik e music feed. The web site also has a richer array of services, including a p ossibilit y to announce friendships with other users. The friendships are initiated b y a single part y but are later mutual, forming a netw ork with undirected links. Because friends tend to b e similar, comm unities in the net work w ould b e relativ ely homogeneous by their musical taste and other c haracteristics. W e use this similarit y within communities to demostrate ICMc comp onen ts. 14 [None] Antarctica Zimbabwe Christmas Island United States Afghanistan Canada U.S. Minor Outlying Isl. Korea, Republic of Hong Kong Belgium France Denmark Japan Greece China Taiwan, Rep. of China Iceland Norway United Kingdom South Africa Ireland Sweden Finland Australia New Zealand Malaysia Netherlands Switzerland Israel Egypt Indonesia Germany Austria India Luxembourg Peru Latvia Chile Hungary Argentina Mexico Venezuela Colombia Czech Republic Slovakia Lithuania Spain Estonia Slovenia Costa Rica Thailand Philippines Puerto Rico Bosnia and Herzeg. Portugal Romania Bulgaria Ukraine Turkey Croatia Poland Russian Federation Brazil Belarus Serbia Italy Macedonia, the Rep. of Singapore Kazakhstan A E B C D Figure 7: ICMc comp onen ts (rows) of the Last.fm friendship netw ork correlated with na- tionalities of the participan ts (columns). The full Last.fm net work, about 675,000 no des, was analyzed with the DP-version of ICMc. After a burn-in p erio d of 19,000 iterations, 20 samples were taken at interv als of 50 iterations to get com- p onen t mem b erships for the nodes. The running time w as 16.4 hours when run in a single thread on an eigh t-core 2 GHz Intel Xeon. Dark blue and dark red denote the extremes of high and lo w co-o ccurrence coun ts, resp ectively . The columns are ordered and the tree pro duced by he atmap of the statistical en vironment R . The global Last.fm net work had ab out 675,000 no des and 1,9 million links, while the subset of US mem b ers had about 147,000 no des and 353,000 links (T able 2). In addition to the friendships, w e also crawled the nationalities of the site mem b ers in the net work, as well as the tags they had asso ciated to the music they like. The most common tags represen t musical genres or subgenres, allowing interpretation of the comp onen ts found from the net work. W e modeled the net w orks with the ICMc, with its Dirichlet process prior adjusted to fa vor few comp onen ts. (With diﬀerent h yp erparameters, it would ha ve b een p ossible to obtain thousands of lo cal communities, but the interpretation of suc h a solution here to get an idea ab out its qualit y w ould b e diﬃcult.) See T able 2 and Figs. 7 and 8 for details and results. The comp onen t structure of the full Last.fm netw ork is primarily ab out geography or nationalities (Fig. 7). This was unexp ected at ﬁrst sight, but in hindsight it is not at all surprising, for p eople tend to bond mostly within their country or cit y , and the friendships in Last.fm are likely to reﬂect the relationships of the real world. Even if they did not, nationalit y w ould aﬀect b onding. W e also correlated the global comp onen t structure to 15 acoustic indie seen live metalcore grindcore hardcore screamo post−hardcore emo pop punk ska punk punk rock [NA] soul hip hop Hip−Hop rap christian christian rock alternative rock industrial Grunge Soundtrack Progressive rock hard rock death metal heavy metal black metal visual kei J−rock japanese j−pop Power metal metal Melodic Death Metal Progressive metal thrash metal alternative singer−songwriter post−rock experimental folk indie rock indie pop reggae classic rock ebm comedy ambient post−punk electronic electronica Alt−country shoegaze jazz chillout new wave idm Classical 80s trip−hop rock rnb country britpop dance musicals female vocalists pop A B C D E F H G Figure 8: ICMc comp onen ts (rows) of the Last.fm USA friendship netw ork correlated with m usical taste of the participan ts (columns). The mo del had the DP prior and w as computed with a burn-in of 49,000 iterations, after whic h 20 samples w ere recorded at interv als of 50 iterations. The running time was 8.4 hours when run in a single thread on an eigh t-core 2 GHz Intel Xeon. Columns corresp ond to the most common tags given to the songs b y the users themselv es. Other details are as in Fig. 7. m usical taste, and while there are meaningful groups of genres (not sho wn, but see Fig. 8), it is hard to say which part of them arises due to the geographical division. Although a more complex model would be needed to ﬁnd both musical and geographical structure, the results show that ICMc is able to ﬁnd homophilic structures from large net works. T o get a b etter grasp of the musical homophily of the net work, w e also ran ICMc on a geographically more homogeneous subset of mem b ers who hav e announced to b e from the US. This revealed a clear structure in terms of music preferences, as shown in Figure 8 and in T able 3. The model w as able to separate ligh t p op, more expe rimen tal m usic, “alternativ e,” metal, Christian, and a punk–hip-hop contin uum. In addition, there w ere t wo components that are harder to in terpret. 4. Discussion W e hav e presen ted tw o generativ e mo dels for net works, of whic h ICMc is no vel, and demon- strated and tested them on data sets of v arious sizes. Performance diﬀerences b et ween the 16 (a) Cluster A juggalo 1 . 36 p op 1 . 34 m usicals 1 . 32 Sludge − 1 . 89 blac k metal − 1 . 98 (b) Cluster B sho egaze 1 . 35 Alt-coun try 1 . 24 p ost-punk 1 . 22 screamo − 1 . 79 p op punk − 3 . 16 (c) Cluster C indie 0 . 46 p ost-rock 0 . 30 folk 0 . 22 visual kei − 1 . 86 j-p op − 2 . 08 (d) Cluster D j-p op 1 . 69 visual kei 1 . 68 blac k metal 1 . 56 p ost-punk − 1 . 21 psyc hedelic − 1 . 41 (e) Cluster E c hristian 1 . 53 p o dcast 1 . 01 trance 0 . 87 sho egaze − 1 . 54 Sludge − 1 . 68 (f ) Cluster F rn b 1 . 33 screamo 1 . 21 p op punk 1 . 15 Korean − 2 . 25 psytrance − 2 . 48 (g) Cluster G Jam 1 . 35 sk a 0 . 89 hardcore 0 . 47 visual kei − 1 . 43 j-p op − 2 . 28 (h) Cluster H latin 1 . 13 c hinese 1 . 05 psytrance 0 . 70 syn thp op − 1 . 54 juggalo − 1 . 62 T able 3: The most lik ely and unlikely tags for eac h of the ICMc components in the Last.fm US netw ork. The tables ha ve b een obtained b y comparing the frequency of the tag to that exp ected in terms of its marginal probabilities. The table includes only tags for which the deviation from the exp ectation w as reliable in terms of a binomial test (p=0.05). The nu merical v alues are log-o dds. mo dels w ere small; ICMc p erformed sligh tly b etter on netw orks with strong subgroup cohe- sion, while SSN-LD A had an edge in ﬁnding more disassortativ e netw ork structures. If one is after communities in a so cial netw ork, there are b oth theoretical and empirical reasons to prefer ICMc. The mo dels do not hav e signiﬁcant diﬀerences in implementation complexit y or ease of use. SSN-LD A can be seen as a further dev elopmen t of similar kinds of mo dels earlier applied to text do cumen ts. On the other hand, it is also a generalization of the mo del b y Newman and Leich t (2006), whic h interestingly shares notable similarity with earlier text do cumen t mo dels (Hofmann, 2001). ICMc belongs to the same mo del family with LDA, but in tro duces a generative pro cess that is more faithful to the idea of subgroup cohes ion. An earlier form ulation of subgroup cohesion is mo dularit y (Newman and Girv an, 2004; Newman, 2006), for which ICMc or its likelihoo d could be seen as an alternativ e. It would b e interesting to explore the relationship betw een these tw o, esp ecially as our simulations sho w that in general mo dularit y increases monotonically during a Gibbs run or saturates and only sligh tly decreases b efore conv ergence (Aukia, 2007). Most of our tests w ere on netw orks with a known communit y structure, whic h allow ed us to set the n umber of comp onents in adv ance and use the Diric hlet prior. In preliminary tests we also tried Dirichlet pro cess priors with these netw orks, but performance w as nat- urally w orse since they did not ha ve the prior knowledge about the exp ected (“kno wn”) comp onen t num b er. Another reason for the worse p erformance of DPs probably is that the size distribution of the communities is artiﬁcially even, for tw o related reasons. First, the netw orks ha v e surviv ed the selection process of b ecoming de factor standards for model 17 testing. Second, in many cases the communities hav e b een manually set up to make them maximally informativ e or otherwise handy . A small num b er of ev en-sized communities does not ﬁt well with the Dirichlet pro cess prior, which assumes either a small n um b er of com- m unities with rather unequal size, or a very large n umber with more equal size. It is likely that in real applications to social and biological netw orks the Dirichlet pro cess p erforms relativ ely m uch better, b ecause real-life communities tends to b e of heterogenerous sizes. The generativ e processes of simple mo dels as discussed here are not mean t to b e realistic, at least not on higher hierarc hical levels b ey ond the distributions generating the observ ed data. Instead, the ultimate criterion for generative pro cesses should b e empirical. Some abstract information ab out the net works can be co ded to the generativ e pro cesses, how ever. One ob vious example is the assortativ e vs. disassortative nature of the netw ork structure. It seems that getting this wrong is not catastrophical, but certainly using the right mo del impro ves p erformance. Another in teresting detail are the Dirichlet priors. F rom their urn representations, it is obvious that they mimic the preferential attachmen t mo del of net work generation (Alb ert and Barabasi, 2002) which pro duces relatively realistic degree distributions for so cial net works. Ev en with the Diric hlet pro cess prior one needs to choose the hyperparameters. F ortu- nately , the mo dels seem to b e quite robust in terms of the parameter β , and also in terms of α Dir with the Diric hlet prior. It is p ossible to tak e β in to the sampling pro cess as an MCMC step, because the marginal lik eliho o d for β is easy to compute. With the Dirichlet pro cess prior, the parameter α DP fundamen tally aﬀects the laten t comp onent div ersity and therefore mo del complexit y . F or α one can use the prop osed approximations of evidence, suc h as the harmonic mean estimator (Griﬃths and Steyvers, 2004; Bun tine and Jakulin, 2004)—it is kno wn to b e unstable but at least sometimes repairable (Raftery et al., 2007). Cross v alidation on the link level is still another p ossibilit y . Although Gibbs sampling has a reputation of b eing slow compared to v ariational meth- o ds, a lot dep ends on ho w the slo wness is measured. With topic mo dels for texts, Gibbs is kno w to pro duce better results than v ariational LDA, at the cost of maybe 4–8 times the running time to conv ergence (W ray Buntine, p.c.). But according to Griﬃths and Steyvers (2004), Fig. 1, collapsed Gibbs is actually faster , measured in ﬂoating p oin t op erations p er second to attain a certain lev el of p erplexit y . The diﬀerence ma y partly b e explained by implemen tational details, but one should also note that p erformance measurements should b e relative to the goal: While in statistical inference conv ergence is essen tial, in predictiv e tasks the predictive p erformance counts, and often in practice a model is b etter if it giv es b etter performance in a shorter running time, regardless of whether it has con v erged or not. In fact, the whole notion of p osterior conv ergence is problematic in mo dels lik e LD A and ICMc with a high num b er of data, parameters and comp onen ts. W e do know that p erm utation mo des exist and that the curren t Gibbs samplers fortunately ﬁnd only one of them—if they found more, we would ha ve a lab el switching problem. Ev en within a p erm utation mo de there are probably man y lo cal mo des of which the Gibbs sampler explores only part—this is suggested b y the v ariation betw een the chains, and the NP-hardness of related form ulations of the communit y ﬁnding problem (Brandes et al., 2006). If needed, diﬀeren t types of compromizes b et ween running time and p erformance are av ailable by applying b etter MCMC techniques, such as annealing, p opulation metho ds, or split-merge 18 mo ves. V ariational metho ds are av ailable for the DP prior (Blei and Jordan, 2004) but they are likely to need help with mo de ﬁnding. ICMc and SSN-LDA can b e considered as examples of a larger family of comp onen t mo dels, giving generalizations. Links or higher-order co-o ccurrences of p oten tially several t yp es are generated from laten t comp onents, together with other nominal data asso ciated to no des. Optimization of such mo dels with collapsed Gibbs is relativ ely straightforw ard and easy to implement, as long as the priors are conjugate, non-parametric or not. An in teresting extension of ICMc, evidently needed for the Last.fm net work, would b e to allow factorial (nominal) comp onen ts, whose in teractions describ e the observ ed comm unities. In the Last.fm net work, the obvious factors could b e geograph y and musical taste. More generic formal extensibilit y of the mo del family , along the lines of relational mo dels (e.g. Xu et al., 2007) should also b e inv estigated. Ac kno wledgments W e thank Last.fm for the data, and E. Airoldi for information on the blo c k mo dels, partic- ularly on their scalability . This work was supported b y Academy of Finland, gran t n umber 119342. SK b elongs to the Finnish CoE on Adaptive Informatics Researc h Centre of the Academ y of Finland and Helsinki Institute for Information T ec hnology , and w as partially supp orted b y EU Netw ork of Excellence P ASCAL. App endix A. The join t distribution and collapsed Gibbs sampler. In ICMc the joint lik eliho od of observed links L and latent v ariables Z , given mid-lev el mo del parameters θ and m , is p ( L , Z | m, θ ) = Y l θ Z l m Z l I l m Z l J l = Y z θ n z z × Y iz m k zi z i , where the notation Z l refers to the index of the component generating link l , and I l and J l refer to link endp oin t no de indices. In the last expression w e hav e link endpoints counts n z o ver comp onen ts, and k z i o ver comp onen t–no de co-o ccurrences. With symmetric Dirichlet priors Dir I sym ( β ) for each m z and Dir K sym ( α Dir ) for θ , this b ecomes p ( L , Z , m, θ | α Dir , β ) = Z − 1 ( α Dir , β ) Y z θ n z + α Dir − 1 z Y iz m k zi + β − 1 z i , with the normalizer Z arising from the Dirichlet priors. F ollo wing Griﬃths and Steyv ers (2004) on Rao-Blac kwellisation of LDA, marginalize o ver θ and all m z : p ( L , Z | α Dir , β ) = Z Z p ( L , Z , m, θ | α Dir , β ) dθ dm = Z − 1 ( α Dir , β ) Y z Q i Γ( k z i + β ) Γ(2 n z + M β ) × Q z Γ( n z + α Dir ) Γ( N + K α Dir ) , (12) where M is the num b er of no des, K is the n umber of comp onents, and the 2 n z comes from the num b er of component-wise links and the fact that each link has tw o endpoints. (F or 19 ev aluating the in tegral, lo ok for a correspondence with the general Dirichlet distribution and its normalizing factor.) Because links are generated indep enden tly , they can in principle b e separated from p ( L , Z | α, β ) into link-wise factors. Separate one arbitrary link, say l 0 , asso ciated to the laten t v ariable z 0 and to no des i 0 and j 0 ( i 0 6 = j 0 ), from the pro duct, and denote by ( L 0 , Z 0 ) the other links and their asso ciated laten t comp onen ts, and b y ( k 0 , n 0 , N 0 ) the counts as they w ere if the link w as nonexistent. F or most indices, we will hav e k 0 = k and n 0 = n , and alwa ys N 0 = N − 1, but for some indices k 0 = k − 1 and n 0 = n − 1. Because Γ( x ) = ( x − 1)Γ( x − 1) Γ( x ) = ( x − 1)( x − 2)Γ( x − 2) , all this translates into p ( L 0 , Z 0 , l 0 , z 0 | α Dir , β ) = Z − 1 ( α Dir , β ) Y z Q i Γ( k 0 z i + β ) Γ(2 n 0 z + M β ) × Q z Γ( n 0 z + α Dir ) Γ( N 0 + K α Dir ) × u 0 = p ( L 0 , Z 0 | α Dir , β ) × u z , where u z ≡ p ( l 0 , z 0 |L 0 , Z 0 , α Dir , β ) = ( k 0 z 0 i 0 + β )( k 0 z 0 j 0 + β ) (2 n 0 z 0 + 1 + M β )(2 n 0 z 0 + M β ) × n 0 z 0 + α Dir N 0 + K α Dir . (13) One can use the result to sample a new comp onent z for the left-out link, with the proba- bilities p ( z | l 0 , L 0 , Z 0 , α Dir , β ) = u z /u · , the denominator using the dot notation for the sum. A Gibbs iteration follows b y lea ving one link out at a time, and sampling a new laten t comp onen t for it as ab ov e. Diric hlet process prior for components. The ICMc model can be deriv ed for a Diric h- let Pro cess comp onen t prior in several w ays. Informally , after seeing the link remov al decomp osition with u z , one notes the structure of p ( L , Z | α Dir , β ) as nested Poly a urns (Johnson, 1977). One can then substitute the comp onen t urn, the last factor in (13), with the Blackw ell-MacQueen urn (Blackw ell and MacQueen, 1973; T a v are and Ewens, 1997) parameterized by α DP : p ( l 0 , z 0 |L 0 , Z 0 , α DP , β ) = ( k 0 z 0 i 0 + β )( k 0 z 0 j 0 + β ) (2 n 0 z 0 + 1 + M β )(2 n 0 z 0 + M β ) C( n z 0 , α DP ) N + α DP (14) with C( n, α ) = n if n 6 = 0 and C(0 , α ) = α . Another w ay to end up with the same result is to substitute α Dir = α DP /K to (12) or (13), then collect all empty comp onen ts in to one bin, and tak e the limit K → ∞ (Neal, 2000). More formally , one can ﬁrst write the joint distribution of the ICMc mo del with an unsp eciﬁed comp onen t prior p ( Z | α ), p ( L , Z , m | α, β ) = p ( L , m |Z , β ) p ( Z | α ) = Z − 1 ( β ) Y iz m k zi + β − 1 z i × p ( Z | α ) , 20 in tegrate m out, and then substitute the Diric hlet process prior (e.g., Dahl 2003), obtainable from the Blac kwell-Queen urn mo del by induction, to end up with p ( L , Z | α DP , β ) = Z − 1 ( β ) Y z Q i Γ( k z i + β ) Γ(2 n z + M β ) × α K Dir Γ( α Dir ) Γ( α Dir + N ) Y z Γ( n z ) . (15) The sampling rule (14) can then b e obtained b y computing the probabilit y of one (remo ved) link given all others, just as in the case of a ﬁnite Dirichlet prior. Collapsed Gibbs sampling for the SSN-LDA model. The collapsed sampler is iden- tical to that in Griﬃths and Steyv ers (2004), also presented by Zhang et al. (2007). The collapsed sampling formula for SSN-LDA with the DP prior is obtained analogously to ICMc, by mo difying the factor corresp onding to the latent-component urn. References E. M. Airo di, D. M. Blei, S. E. Fien b erg, and E. P . Xing. Mixed membership sto c hastic blo c kmo dels. Journal of Machine L e arning R ese ar ch , 2008. In press. Rek a Alb ert and Alb ert-Laszlo Barabasi. Statistical mec hanics of complex netw orks. R e- views of Mo dern Physics , 74(1):47–97, 2002. Janne Aukia. Ba yesian clustering of h uge friendship net works. Master’s thesis, Laboratory of Computer and Information Science, Helsinki Univ ersity of T echnology , Esp oo, Finland, 2007. D. Blackw ell and J. B. MacQueen. F erguson distributions via Polya urn sc hemes. A nnals of Statistics , 1:353–355, 1973. D. Blei, A. Ng, and M. Jordan. Laten t Diric hlet allo cation. Journal of Machine L e arning R ese ar ch , 3:993–1022, 2003. D. M. Blei and M. I. Jordan. V ariational metho ds for the Diric hlet pro cess. In Pr o c e e dings of the 21st International Confer enc e on Machine L e arning (ICML) , 2004. James L. Blue, Isabel Beic hl, and F rancis Sulliv an. F aster Monte Carlo simulations. Physic al R eview E , 51(2):867–868, 1995. Ulrik Brandes, Daniel Delling, Marco Gaertler, Rob ert Grke, Martin Ho eﬂer, Zoran Nik oloski, and Dorothea W agner. Maximizing mo dularit y is hard. ArXiv e-prints , 2006. arXiv:ph ysics/0608255v2. W. Buntine and A. Jakulin. Applying discrete PCA in data analysis. In M. Chic kering and J. Halp ern, editors, Pr o c. UAI’04, 20th Confer enc e on Unc ertainty in A rtiﬁcial Intel li- genc e , pages 59–66. AUAI Press, 2004. W ray L. Bun tine. V ariational extensions to EM and multinomial PCA. In T apio Elomaa, Heikki Mannila, and Hann u T oiv onen, editors, Pr o c e e dings of the 13th Eur op e an Con- fer enc e on Machine L e arning (ECML’02) , volume 2430 of L e ctur e Notes in Computer Scienc e , pages 23–34, London, UK, 2002. Springer. 21 J. J. Daudin, F. Picard, and S. Robin. A mixture mo del for random graphs: A v ariational approac h. T ec hnical Report 4, Statistics for systems biology group, INRA, Jouy-en-Josas, F rance, 2007. San to F ortunato and Marc Barthelem y . Resolution limit in comm unit y detection. Pr o c e e d- ings of the National A c ademy of Scienc es USA , 104(1):36–41, 2007. San to F ortunato and Claudio Castellano. Comm unity structure in graphs. ArXiv e-prints , 2007. S. Geman and D. Geman. Stochastic relaxation, Gibbs distributions, and the Bay esian restoration of images. IEEE T r ansactions on Pattern A nalysis and Machine Intel ligenc e , 6(6):721–741, 1984. M. Girv an and M. E. J. Newman. Communit y structure in so cial and biological net works. Pr o c e e dings of the National A c ademy of Scienc es USA , 99(12):7821–7826, 2002. Thomas L. Griﬃths and Mark Steyv ers. Finding scientiﬁc topics. Pr o c e e dings of the National A c ademy of Scienc es USA , 101 Suppl 1:5228–5235, 2004. Mark S. Handcock, Adrian E. Raftery , and Jerem y M. T antrum. Mo del-based clustering for so cial netw orks. Journal Of The R oyal Statistic al So ciety Series A , 170(2):301–354, 2007. Jak e M. Hofman and Chris H. Wiggins. A Ba y esian approach to net w ork mo dularit y . ArXiv e-prints , 2007. Thomas Hofmann. Unsup ervised learning by probabilistic laten t semantic analysis. Machine L e arning , 42:177–196, 2001. N. L. Johnson. Urn mo dels and their applic ations . John Wiley and Sons, 1977. Jussi M. Kumpula, Jari Saramaki, Kimmo Kaski, and Janos Kertesz. Limited resolution in complex net work communit y detection with Potts mo del approac h. The Eur op e an Physic al Journal B , 56(1):41–45, 2007. P aul Lazarsfeld and Robert K. Merton. F riendship as a so cial pro cess: A substantiv e and metho dological analysis. In F r e e dom and Contr ol in Mo dern So ciety , pages 18–66. V an Nostrand, New Y ork, USA, 1954. Alaina Mic haelson and Noshir S. Con tractor. Structural p osition and p erceived similarit y . So cial Psycholo gy Quarterly , 55(3):300–310, 1992. Radford M. Neal. Marko v chain sampling metho ds for Diric hlet process mixture models . Journal of Computational and Gr aphic al Statistics , 9(2):249–265, 2000. M. E. J. Newman. The structure and function of complex netw orks. SIAM R eview , 45(2): 167–256, 2003. M. E. J. Newman. Mo dularity and comm unit y structure in net works. Pr o c e e dings of the National A c ademy of Scienc es USA , 103:8577–8582, 2006. 22 M. E. J. Newman and M. Girv an. Finding and ev aluating comm unity structure in net works. Physic al R eview E , 69(2):026113, 2004. M. E. J. Newman and E. A. Leich t. Mixture mo dels and exploratory data analysis in net works. ArXiv e-prints , 2006. arXiv:ph ysics/0611158. A. E. Raftery , M. A. Newton, J. M. Satagopan, and P . Krivitsky . Estimating the integrated lik eliho o d via posterior simulation using the harmonic mean iden tity (with discussion). In J. M. Bernardo et al., editor, Bayesian Statistics , v olume 8, pages 1–45. Oxford Univ ersity Press, 2007. Prith vira j Sen and Lise Geto or. Link-based classiﬁcation. T echnical Rep ort Rep ort CS- TR-4858, Universit y of Maryland, College Park, USA, 2007. Janne Sinkkonen, Janne Aukia, and Samuel Kaski. Inferring vertex prop erties from top ology in large net works. In Working Notes of the 5th International Workshop on Mining and L e arning with Gr aphs (MLG’07) , Florence, Italy , 2007. Universita degli Studi di Firenze. Extended Abstract. S. T av are and W. J. Ewens. The Ewens sampling formula. In Multivariate discr ete distri- butions . John Wiley & Sons, New Y ork, USA, 1997. S. W asserman and K. F aust. So cial Network Analysis: Metho ds and Applic ations . Cam- bridge Universit y Press, Cambridge, UK, 1994. Mark Allen W eiss. Data Structur es and Algorithm Analysis in Java . Addison-W esley , Reading, USA, 1998. C. K. W ong and M. C. Easton. An eﬃcien t method for weigh ted sampling without replace- men t. SIAM Journal on Computing , 9(1):111–113, 1980. Zhao Xu, V olk er T resp, Ship eng Y u, Kai Y u, and Hans-Peter Kriegel. F ast inference in inﬁnite hidden relational mo dels. In Working Notes of the 5th International Workshop on Mining and L e arning with Gr aphs (MLG’07) , Florence, Italy , 2007. Universita degli Studi di Firenze. Extended Abstract. W ayne W. Zachary . An information ﬂow mo del for conﬂict and ﬁssion in small groups. Journal of Anthr op olo gic al R ese ar ch , 33(4):452–473, 1977. Haizheng Zhang, Bao jun Qiu, C. Lee Giles, Henry C. F oley , and John Y en. An LD A-based comm unity structure discov ery approac h for large-scale so cial net works. In Intel ligenc e and Se curity Informatics (ISI) 2007 , pages 200–207. IEEE, 2007. 23

Component models for large networks

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment