A maximum entropy approach to separating noise from signal in bimodal affiliation networks

A maxim um en trop y approac h to separating noise from signal in bimo dal aﬃliation net w orks Na vid Dianati ∗ The L azer L ab, Northe astern University, Boston Massachusetts. and Institute for Quantitative So cial Scienc es, Harvar d University, Cambridge Massachusetts. In practice, many empirical net w orks, including co-authorship and collo cation net works are uni- mo dal pro jections of a bipartite data structure where one lay er represents entities, the second la yer consists of a num ber of sets represen ting aﬃliations, attributes, groups, etc., and an interla y er link indicates membership of an entit y in a set. The edge weigh t in the unimo dal pro jection, whic h w e refer to as a c o-o c curr enc e network , counts the n um b er of sets to whic h b oth end-no des are link ed. In terpreting such dense netw orks requires statistical analysis that takes in to account the bipartite structure of the underlying data. Here we develop a statistical signiﬁcance metric for such net works based on a maximum entrop y n ull model which preserves b oth the frequency sequence of the individuals/entities and the size sequence of the sets. Solving the maximum entrop y problem is reduced to solving a system of nonlinear equations for whic h fast algorithms exist, thus eliminating the need for exp ensive Monte-Carlo sampling techniques. W e use this metric to prune and visualize a n um b er of empirical net w orks. I. INTR ODUCTION Man y in teger w eighted graphs derived from empirical data are so-called c o-o c curr enc e graphs: an edge weigh t coun ts the num b er of times the tw o end no des where ob- serv ed to share a prop erty . Most abstractly , this shared prop ert y can b e modeled as mem b ership in some un- ordered set. F or instance, mem b ership in the same team or group, aﬃliation with an institution, shared physical attributes, or words app earing in the same do cument. Suc h net works ha ve b een studied in v arious contexts in- cluding co-attendance in so cial ev ents [1], net works of co- starring actors [2] congressional bill co-sp onsorship net- w orks [3, 4]. F ormally , giv en a set S = { s 1 , s 2 , · · · , s m } of symbols or en tities, the data consists of an arbitrary num b er of subsets of S : D = { u j } n j =1 , u j ⊂ S j = 1 , · · · n. (1) In its simplest form, a giv en entry u j is simply an un- ordered set deﬁning a symmetric relationship b etw een ev ery pair of its elements. A weigh ted graph may then b e deﬁned with v ertex set V ≡ S where a w eighted edge b et w een t wo no des counts the num b er of subsets u j con- taining b oth no des. In the context of natural language pro cessing, the subsets u j are commonly referred to as do cuments and their elements as wor ds or symb ols. The set S is sometimes referred to as the L exic on. W e will use this terminology in the rest of this pap er. Dep ending on the nature of the data, more sp ecialized w ays of constructing a graph ma y b e desirable. F or in- stance, a do cumen t may con tain an internal order in whic h case diﬀeren t pairs of symbols within the do cu- men t may need to b e assigned diﬀerent w eights. In this pap er we will consider the most generic case of unordered sets and homogeneous co-occurrence w eights. The data can be abstracted as a bipartite net work where the vertex set for one lay er consists of the set S of all sym b ols and the vertex set of the second la yer is the set of all sets u α , α = 1 , · · · n. An edge b et w een a sym- b ol s i and a set u α denotes the relation s i ∈ u α . Let g α = | u α | , α = 1 , 2 , · · · n denote the size of the set u α , and f i , i = 1 , 2 , · · · m the frequency of the symbol s i in the entire dataset. In the bipartite graph, these tw o se- quences are then simply the degrees of the corresp onding no des in the ﬁrst and second lay ers resp ectively and we trivially hav e P i f i = P α g α = N . The co-o ccurrence net work is then a weigh ted pr oje ction of this bipartite graph onto the la yer consisting of the symbols or enti- ties. See ﬁgure 2 for an example. The question of most practical interest is how one can extract statistically meaningful substructures in the co- o ccurrence net w ork. These structures—which are b e- liev ed to b e obscured by an abundance of noisy edges— are sometimes called the b ackb one of the net work and the remo v al of insigniﬁcant edges in the hop e of uncov ering them is referred to as pruning. Most commonly , prun- ing is p erformed by weight thr esholding, i.e., remo ving the edges with weigh ts below a desired threshold from the graph. This is a naive approach as it results in the loss of the m ultiscale structure of the graph. F or nativ ely unimo dal netw orks, other statistically inspired metho ds hav e been prop osed including the disp arity ﬁlter [5], the GLOSS ﬁlter [6], and the mar ginal likeliho o d ﬁl- ter (MLF) [7]. These metho ds are statistically informed since they form ulate generative null models and then iden tify features in the observed net work least exp ected to hav e o ccurred due to pure chance according to the n ull model. Similar metho dologies hav e also b een pro- p osed for bimo dal netw orks of the kind w e are concerned with in this pap er, including the ﬁxe d de gr e e se quenc e mo del (FDSM)[8] and sto chastic de gr e e se quenc e mo del (SDSM) [9]. These latter metho ds employ random null 2 mo dels that preserve the degree sequences of the no des in b oth lay ers (corresp onding to the frequency sequence of the symbols and the size sequence of the sets) with FDSM doing so exactly and SDSM on a verage. In this pap er we prop ose a random graph ensemble also based on the same in tuition as the SDSM—namely pre- serving the exp ectation v alue of the full degree sequence of the graph—and its resulting signiﬁcance test. Our metho dology diﬀers from the SDSM in imp ortan t wa ys. Firstly , the SDSM generates realizations of the random graph ensemble b y sampling each p ossible edge in the bipartite graph according to a Bernoulli process whose probabilit y is determined by solving a regression mo del suc h that the exp ectation v alue of each no de’s degree matc hes the corresp onding degree in the observed graph with reasonable precision. While this randomization pro- cess generates an ensem ble appro ximately consisten t with the desired constraint, it is not guaranteed to yield the “most random” such ensemble. By contrast, in this paper w e compute an ensem ble that is in fact guaran teed to b e the most random (i.e., the m ost unbiased) one satisfying the constrain t, by solving a maximum en tropy problem. Secondly , the test statistics of the SDSM are computed b y sampling the graph ensemble and deriving empirical n ull distributions for the co-o ccurrence edges based on the obtained sample. The accuracy of the test statistics is thus critically dep endent on the sample size, making it computationally exp ensive to pro duce reliable results. W e, on the other hand, derive test statistics that can b e computed exactly , or otherwise with high precision with- out the need to sample the ensemble. I I. UNWEIGHTED CO-OCCURRENCE NETW ORKS Let us fo cus on the case where the link b etw een a sym- b ol and a set is unw eigh ted, that is, a sym b ol either app ears in a set or it doesn’t. W e m ust form ulate a randomization pro cess whereby some set of meaningful and presumably robust features of the observed graph are preserved on av erage but the graph is randomized otherwise. W e c hoose to preserv e the degree sequences of both lay ers, one corresp onding to the frequencies of the symbols throughout the data set, and the other cor- resp onding to the size sequence of the sets to which the sym b ols can b e related by membership. At ﬁrst glance, this problem app ears to b e simply a bipartite analogue of the Mar ginal Likeliho o d Filter [7] where for a given uni- mo dal, integer-w eigh ted ev en t-counting net work, a set of indep enden t assignment even ts are distributed randomly b et w een all p ossible no de pairs such that the degree se- quence is preserv ed on a v erage. But the present problem is diﬀerent for t wo reasons: 1) the inter-la yer edges can- not b e mo diﬁed indep endently of one another since the Figure 1 Figure 2: Example of a co-occurrence netw ork compiled from a bimo dal entit y-aﬃliation graph. co-mem b ership relation is transitive, and 2) randomly distributing assignmen t even ts w ould allow for multi- edges. W e ma y then simply demand that a giv en pair s i , u α b e connected with probability f i g α / N which do es indeed lead to the correct expectation v alue for b oth de- grees. How ev er, there is no guarantee that this quan tity is even a probabilit y . Instead, we deriv e a maximum en- tr opy ensemble with the desired constrain ts, hoping to b e able to compute the marginal probabilit y distribu- tions for all ( i, α ) edges, leading to a simple marginal signiﬁcance test similar to [7]. Let us deriv e a maximum en tropy graph ensemble that preserv es b oth the set sizes g α and sym b ol frequencies f i on av erage. The probability distribution for this en- sem ble is given by an exp onen tial where our m + n linear constrain ts h P α σ iα i = f i , i = 1 , 2 , · · · m and h P i σ iα i = g α , α = 1 , 2 , · · · n are enforced by Lagrange m ultipliers λ i , i = 1 , 2 , · · · m and γ α , α = 1 , 2 , · · · n. P ( G ) ∼ exp " X i λ i X α σ iα + X α γ α X i σ iα # = exp   X i,α ( λ i + γ α ) σ iα   (2) where σ iα is either zero or one, indicating whether nodes i and α from the ﬁrst and second lay ers resp ectiv ely are 3 (a) (b) (c) Figure 3: (a) The numerically computed connection probabilities betw een the tw o lay ers for the senate co-sp onsorship net w ork of the 110th US congress. Plotted against the “naive” probabilit y . Note the highly nonlinear dep endence. (b) Numerically solved x i y α as a function of the ﬁrst order guess. In b oth plots, a cluster of p oin ts stand out, far abov e the bulk. These corresp ond to pairs with g α ' n, i.e., sets that are connected to almost every en tity , leading to near certain exp ected connectivity according to the null model. (c) p- v alue vs weigh t for edges in the co-o ccurrence graph. connected. Therefore, the partition function is given b y Z = X { σ iα } e P i,α ( λ i + γ α ) σ iα (3) = X { σ iα } Y i,α e ( λ i + γ α ) σ iα (4) = Y i,α h 1 + e ( λ i + γ α ) i . (5) No w we enforce the constrain ts and compute the La- grange multipliers: f j = ∂ log Z ∂ λ j and g β = ∂ log Z ∂ γ β . (6) Th us, f j = X α e λ j + γ α 1 + e λ i + γ α , (7) g β = X i e λ i + γ β 1 + e λ i + γ β . (8) Deﬁning x i = e λ i and y α = e γ α , our problem is reduced to the solution of the follo wing system of nonlinear equa- tions: y α = g α / X i x i 1 + x i y α , α = 1 , · · · , n (9) x i = f i / X α y α 1 + x i y α , i = 1 , · · · , m. (10) Note that these equations are basically telling us that according to the maxim um en tropy scheme, the “o ccu- pation” probability of each of the p ossible edges betw een the ﬁrst and second lay ers should hav e a logistic form: p iα = e λ i + γ α 1 + e λ i + γ α = x i y α 1 + x i y a . (11) I II. SOL VING THE SADDLEPOINT EQUA TIONS It is not clear whether one can ﬁnd a closed-form solution for equations (9) and (10). How ever, one can solve them n umerically using iterativ e metho ds. In (9) and (10) we ha ve already written the system of the equations in the form: x i = φ i ( { x j } , { y β } ) , y α = ψ α ( { x j } , { y β } ) . (12) The solution is therefore the ﬁxed point of the system of transformations deﬁned by φ i , i = 1 , · · · , m and ψ α , α = 1 , · · · n. In order to compute the ﬁxed p oint, we start with initial guesses for each of the x i and y α and iterate the follo wing equations un til conv ergence: x [ k +1] i = φ i n x [ k ] j o , n y [ k ] β o , (13) y [ k +1] α = ψ α n x [ k ] j o , n y [ k ] β o (14) where the sup erscript indexes the current step in the it- eration. As the initial v alues, w e use x i = f i / √ N and y α = g α / √ N which corresp ond to the ﬁrst order solu- tion in terms of x i y α . Figure 3 shows the results from the numerical solution of these equations for the US sen- ate cosp onsorship data with m = 3613 , n = 102 , such that the co-o ccurrence graph has 5151 edges. F or details of this data and further discussion, see section VI. IV. CO-OCCURRENCE NETW ORK Ha ving computed the null model’s inter-la yer connec- tion probabilities, w e now proceed to derive the prob- abilit y distribution for the co-o ccurrence w eight of pairs 4 (a) (b) Figure 4: The tw o largest connected components of the US senate bill co-sponsorship net work for the 110th congress, pruned do wn to netw ork densit y 2 using a) weigh t thresholding, b) using our bimo dal signiﬁcance metric. Here, node colors indicate part y mem b ership. of symbols (ﬁrst lay er no des). The quantit y of interest is the probability distribution for the random v ariables M ( s i , s j ) deﬁned as follo ws: M ( s i , s j ) : = n umber of nodes in the second (15) la yer link ed b oth to s i and s j . The exp ectation v alue of this random v ariable is given b y E [ M ( s i , s j )] = n X α =1 p iα p j α (16) If x i 6 = x j , the summand simpliﬁes to p iα p j α = x i x j y α x i − x j  1 1 + x j y α − 1 1 + x i y α  (17) and thus, the sum ov er α b ecomes E [ M ( s i , s j )] = X α p iα p j α = 1 x i − x j [ x i f j − x j f i ] for x i 6 = x j . (18) If x i = x j , how ever, this simpliﬁcation is not v alid and w e must compute the full sum E [ M ( s i , s j )] = X α p iα p j α = X α x 2 i y 2 α (1 + x i y α ) 2 for x i = x j . (19) Similarly , we may compute the v ariance: V ar [ M ( s i , s j )] = X α p iα p j α (1 − p iα p j α ) . (20) Using these expressions we compute and store E [ M ( s i , s j )] and V ar [ M ( s i , s j )] once for every edge in the observed graph. This op eration has time complexity O ( n | E | ) where | E | is the size of the edge set of the co-o ccurence net work. V. DISTRIBUTION OF THE CO-OCCURRENCE WEIGHTS The ﬁnal step is to estimate the probability distribution of M ( s i , s j ) so that a p -v alue ma y b e computed for each edge. Note that M ( s i , s j ) is the sum of n binary indi- cator v ariables each indicating whether s i and s j “co- o ccurred” in a set u α . Therefore, for large n, by the cen- tral limit theorem, we exp ect the distribution to approac h a normal distribution. Ho wev er, in general this appro x- imation do es not yield accurate results. T o b e precise, the sum of independent and diﬀeren t Bernoulli v ariables is kno wn as the Poisson binomial distribution. Simple closed form expressions of the p df and cdf for this dis- tribution aren’t kno wn, but v arious approximations as w ell as exact, alb eit computationally exp ensiv e numeri- cal estimation metho ds are known [10, 11]. Here we use the so-called r eﬁne d normal appr oximation (RNA) due to V olko v a [12] whic h is a modiﬁcation of the normal ap- pro ximation. See App endix A for details. Giv en the cdf F ij ( k ) for the n ull distribution of the w eight b et w een no des i, j in the co-o ccurrence graph and an ob- serv ed weigh t w ij , we compute the pv alue π ij π ij ( w ij ) = 1 − F ij ( w ij ) (21) and deﬁne the signiﬁcance metric as − log ( π ij ( w ij )) . 5 VI. APPLICA TION TO DA T A In this section we presen t the results of the application of the ﬁlter to the senate bill cosponsorship in the 110th US congress (2007-2008). The data is from [13, 14] and con tains a list of all bills introduced in the senate and for eac h one, the list of senators who cosp onsored the bill. Aside from its original sp onsor, a bill can also b e cosp on- sored by an arbitrary n umber of other senators. Sen- ators cosp onsor bills for a v ariety of reasons, including partisan allegiance, lending supp ort and forming strate- gic alliances, and simply increasing their own visibilit y and p erceiv ed p olitical clout. Regardless, b eing cosp on- sors of a given bill is a signal of aﬃnity as regards the legislativ e pro cess. The data then consists of a bipartite graph where the no des represent the senators in the ﬁrst la yer and the bills in the second lay er and an inter-la yer link indicates cosp onsorship of a bill by a senator. The co-sp onsorhip net work is then the pro jection of this bi- partite graph onto the ﬁrst lay er. The full co-o ccurence net work consists of 102 no des and more than 5000 edges, a rather dense graph with no visible structure. Figure 4 sho ws this netw ork pruned using naiv e w eigh t threshold- ing as well as our signiﬁcance measure. Each graph sho ws the giant comp onent as w ell as the next largest connected comp onen t of the graph after it is pruned down to net- w ork density 2 using eac h pruning sc heme. The graph on the left sho ws a cluster of mostly Demo crats with the rest of the graph more or less disintegrated. The one on the right on the other hand, sho ws most of the no des connected through the giant comp onent, which demon- strates a highly mo dular communit y structure reﬂecting the main partisan division with the senate. Both ﬁgures are rendered using the Kamada-Kaw ai graph la yout, a p opular force directed lay out algorithm. Figure 5 com- pares weigh t thresholding and the bimo dal ﬁlter. On the left, the size of the gian t components truncated at v arious net work densities are compared. With our signiﬁcance measure, the gian t component already contains ab out 80% of all the nodes at densit y 2 and nearly all at density 4, whereas w eigh t thresholding leav es the graph rather disin tegrated up to high densities: a rather small giant comp onen t, with the rest of the nodes scattered ac ross singletons and otherwise v ery small comp onen ts. The ﬁgure on the right compares the t wo methods in terms of their abilit y to reveal the partisan divide within the sen- ate. Given the kno wn part y mem b erships of US senators, w e computed the mo dularity scores of the pruned graphs at diﬀeren t truncation lev els, b oth for w eight threshold- ing as w ell as our bimodal ﬁltering tec hnique. The mo d- ularit y of graphs resulting from our ﬁltering technique is consisten tly and signiﬁcan tly higher than those produced b y w eigh t thresholding, showing that the partisan divide is manifest m uc h more clearly with the application of our ﬁlter. 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Network density 20 30 40 50 60 70 80 90 100 Number of nodes significance weight (a) 0 1000 2000 3000 4000 5000 Number of edges 0.0 0.1 0.2 0.3 0.4 0.5 Modularity significance weight (b) Figure 5: a) The size of the giant comp onen t for the US senate co-sponsorship net work pruned do wn to v arious net work densities using the bimo dal signiﬁcance measure as w ell as weigh t thresholding. b) The modularity of the same net works according to party membership. App endix A: Reﬁned normal approximation In this app endix we describe the reﬁned normal approx- imation for the cdf of the Poisson binomial distribution due to V olko v a [12]. F or the sum of n indep enden t Bernoulli random v ariables with means p i, i = 1 , 2 , · · · n, The cdf, F ( k ) is appro ximately given b y F ( k ) ≈ G  k + 0 . 5 − µ σ  , k = 0 , 1 , · · · , n (A1) where G ( x ) = Φ( x ) + γ (1 − x 2 ) φ ( x ) / 6 , (A2) φ ( x ) ,Φ( x ) are the p df and cdf of the standard normal distribution resp ectiv ely and γ = σ − 3 η where η = n X j =1 p j (1 − p j )(1 − 2 p j ) (A3) So, the ingredien ts necessary for this computation are the follo wing: µ = n X i =1 p i (A4) σ 2 = n X i =1 p i (1 − p i ) (A5) η = n X i =1 p j (1 − p j )(1 − 2 p j ) (A6) φ ( x ) = 1 √ 2 π e − x 2 / 2 (A7) Φ( x ) = 1 2  1 + erf  x √ 2  (A8) ∗ n.dianatimaleki@neu.edu 6 [1] A. Da vis, B. B. Gardner, and M. R. Gardner, De ep South: A social anthr op olo gic al study of c aste and class . Univ of South Carolina Press, 2009. [2] D. J. W atts and S. H. Strogatz, “Collective dynamics of ’small-world’net works,” natur e , vol. 393, no. 6684, pp. 440–442, 1998. [3] J. H. F owler, “Connecting the congress: A study of cosp onsorship netw orks,” Politic al Analysis , v ol. 14, no. 4, pp. 456–487, 2006. [4] J. H. F owler, “Legislative cosp onsorship net works in the us house and senate,” So cial Networks , v ol. 28, no. 4, pp. 454–465, 2006. [5] M. ´ A. Serrano, M. Bogu˜ n´ a, and A. V espignani, “Extract- ing the multiscale backbone of complex weigh ted net- w orks,” Pr o c e e dings of the National A c ademy of Scienc es , v ol. 106, pp. 6483–6488, Apr. 2009. [6] F. Radicchi, J. Ramasco, and S. F ortunato, “Information ﬁltering in complex weigh ted net works,” Physic al Review E , v ol. 83, p. 046101, Apr. 2011. [7] N. Dianati, “Unwinding the hairball graph: Pruning al- gorithms for weigh ted complex net works,” Physic al R e- view E , v ol. 93, p. 012304, Jan. 2016. [8] M. Latapy , C. Magnien, and N. D. V ecchio, “Basic no- tions for the analysis of large tw o-mo de net works,” So cial Networks , v ol. 30, no. 1, pp. 31 – 48, 2008. [9] Z. Neal, “The backbone of bipartite pro jections: In- ferring relationships from co-authorship, co-sponsorship, co-attendance and other co-b ehaviors,” So cial Networks , v ol. 39, pp. 84–97, Oct. 2014. [10] M. F ernandez and S. Williams, “Closed-form expression for the p oisson-binomial probabilit y density function,” IEEE T r ansactions on A er osp ac e and Ele ctr onic Systems , v ol. 46, pp. 803–817, April 2010. [11] Y. Hong, “On computing the distribution function for the p oisson binomial distribution,” Computational Statistics & Data Analysis , vol. 59, pp. 41 – 51, 2013. [12] A. Y. V olko v a, “A Reﬁnement of the Central Limit The- orem for Sums of Independent Random Indicators,” The- ory of Pr ob ability & Its Applications , vol. 40, no. 4, pp. 791–794, 1996. [13] J. H. F owler, “Connecting the Congress: A Study of Cosp onsorship Netw orks,” Politic al Analysis , vol. 14, pp. 456–487, Sept. 2006. [14] J. H. F owler, “Legislative cosp onsorship net works in the US House and Senate,” So cial Networks , vol. 28, pp. 454– 465, Oct. 2006.

A maximum entropy approach to separating noise from signal in bimodal affiliation networks

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment