Network as a computer: ranking paths to find flows

We explore a simple mathematical model of network computation, based on Markov chains. Similar models apply to a broad range of computational phenomena, arising in networks of computers, as well as in genetic, and neural nets, in social networks, and…

Authors: Dusko Pavlovic

Net w ork as a computer: ranking paths to find flo ws Dusko Pa vlovic ⋆ Oxford Universit y and Kestrel Institute Abstract. W e explore a simple mathematical model of netw ork computation, based on Marko v chains. Similar models apply to a broad range of computational ph en omena, arising in netw orks of comput ers, as well as in genetic, and neural nets, in social netw orks, and so on. The main problem of in teraction with such sp ontaneously evolving computational systems is th at the data are not uniformly structu red. An interesti ng app roac h is to try to extract the semantica l conten t of the data from their d istribu t ion among the no d es. A concept is then identified b y find ing the comm u nity of n od es that share it. The task of data structuring is thus red u ced to the task of finding the n etw ork comm un ities, as groups of nod es t hat together p erform some non-lo cal data pro cessing. T ow ards this goa l, w e extend th e ranking metho d s from n odes to paths, whic h allo ws us to extract information ab out the lik ely flo w biases from the av ailable static information abou t the net work. 1 In t ro duction Initially , W eb sea rch was developed as an instance o f information r etrieval , optimized for a par ticularly large distributed databas e. With the adv ent of online advertising, W eb search got enhanced by a broad range of information supply techn iques where the search r esults a re expanded b y additional data, extrap olated fro m user’s interests, and from s e arch engine’s stock of information. F r o m the simple idea to match and coordina te the push and the pull of infor mation o n the W eb a s a new c omputational platform [18] sprang up a new generation of w eb business es and so c ial netw or ks. Similar pa tterns o f informa tion pro cessing are found in many o ther evolutionary systems, from gene reg ulation, protein interaction and neural ne ts , through the v arious ne tw orks of computers and devices , to the complex so cial and market str uctures [1 5]. This pap er explores so me s imple mathematical cons equences of the observ a tion that the W eb, and simila r net works, are muc h more than mere information repos ito ries: b esides storing, and r etrieving, and supplying information, they a ls o g enerate, and pro cess information. W e pursue the idea that the W eb can b e modeled as a co mputer, rather than a database; o r more precisely , as a v ast mult i-party computation [6], akin to a market place, wher e masses of selfish agen ts jointly ev aluate a nd gener ate public infor mation, dr iven by their priv a te utilities. While this view raises interesting new pr oblems acros s the whole ga m ut of Computer Science, the most effective solutions, so far, of the pro blem of semantic al interactions with the W eb co mputatio ns w ere obtained by rediscov ering and a dopting the ra nking metho ds, deeply ro oted in the so cio metric tr a dition [11,10], and adapting them for use on very larg e indices, lea ding to the who le new paradig m o f se ar ch [19,12,13]. Implicitly , the idea of the W eb as a computer is tacitly pr esent a lr eady in this para digm, in the sense that the search rankings are extracted from the link structure, and other in tr insic information, generated o n the W eb itself, r ather than stored in it. Outline of the p ap er. In section 2 we intro duce the bas ic netw o rk model, and describ e a first attempt to extract infor ma tion ab out the flows through a netw o rk from the av ailable static data ab out it. In sections 3 and 4, we describ e the structure which allo ws us to lift the notio n of rank , describ ed in section 5, to path net works in section 6. Ranking paths allows us to extract a rando m v ariable, called attraction bias, which allows mea suring the mutual information o f the distributions of the inputs and the outputs of the netw ork computation, which can b e view ed as a n indicator o f non-lo cal informatio n pro cessing that takes place in the giv en net work. In the final section, we describ e ho w the obtained data can b e used to detect s emantical ⋆ Email: dusko@ { ke strel.edu,c omlab.ox.ac.uk } . Supp orted by EPSRC and ONR. structures in a netw ork. The ex per imental w o r k necessary to test the prac tica l effectiveness of the appr oach is left for future work. 2 Net w or ks Basic mo del. W e view a netw ork as an edge-labelled directed gra ph A =  R γ o o E δ / /  / / N  , where N and E are, resp ectively , the finite sets of no des , and links , or e dges , whereas R is an or dered field of values (in some applica tions an order ed ring of functions). A link i e → j is thus an element e ∈ E with δ ( e ) = i and  ( e ) = j . The v alue γ ( e ) is the cos t (when p ositive), or payoff (when negative) of the tr affic ov er e . These data induce the adjac ency matrix E = ( E ij ) N × N and the c ap acity matrix A = ( A ij ) N × N , with the entries E ij = { e ∈ E | i e → j } and A ij = P e ∈ E ij A e , where A e = 2 − γ ( e ) is the capa c it y of the link e . R emark. The ter m ”capacit y” is used here as in net work flo w theory . 1 The cost or the pa yoff o f a link may represent its v alue in a pay-per - click mo del of a fragment of the W eb; or it may denote the proximit y of the web pages within the same site, o r within a gr oup o f in ter connected sites. In a protein netw ork , the energy cost or pay off may b e derived from the chemical affinities be tween the no des. While this para meter can be abstracted aw ay , simply b y taking γ ( e ) = 0 for a ll links e , its role will beco me clear in sections 3 and 6 , where it allows discounting and eliminating so me paths. Basic dynamics. The simples t mo del of net work dynamics is bas ed on the a ssumption that the traffic flows are distr ibuted prop ortio nally to the link capacities. Rando mly sa mpling the W eb traffic, we shall th us find a s urfer on a link e with the probability α e = A e A • , where A • = P f ∈ E A f . In order to find the comm unities in a net work, we need to detect the traffic biases betw een its nodes. W e assume that the traffic b etw e e n the nodes within the same comm unity will b e higher than the capacity of the links be tw een them w o uld lead us to exp ect; and that the traffic betw een the different communities will be low er than ex pec ted. T o measure suc h tr affic biases, w e no r malize the ca pacity ma tr ix A to get the c ap acity distribution α = ( α ij ) N × N as α ij = A ij A •• , where A •• = P k,ℓ ∈ N A kℓ . The entry α ij is thus the proba bilit y that a random sample of traffic o n A , follo wing the s imple dynamics propo r tional to c apacity , will b e found on a link from i to j . On the other hand, the marginals of the probabilit y distr ibution α , α i • = X j ∈ N α ij α • j = X i ∈ N α ij corres p o nd, resp ectively , to the pro ba bilities that a random sample of traffic will have i a s its so ur ce, a nd j as its destination. Let us call α i • the out- ra nk of i , and α • j the in-r ank o f j , becaus e they can be viewed as the simplest, alb eit degenerate ca ses of the notion of rank. If the in-rank and the out-rank a re s ta tistically indep endent, then (b y the definition of indep endence) the probability that a r andom traffic sample goes from i to j will b e α i • α • j . Their dependency is thus measured by the tr affic bias matrix υ = ( υ ij ) N × N with the entries υ ij = α ij − α i • α • j falling in the in ter v al [ − 1 , 1]. The higher the bias, the more unexp ected traffic there is. F or a set of no des U ⊆ N the v a lues Coh ( U ) = X i,j ∈ U υ ij Adh ( U ) = X i ∈ U,j 6∈ U υ ij + υ j i can th us b e construed as the c ohesion and the ad hesion forces: the total traffic bias within the group, and with its exterior , resp ectively . A net work communit y U can thus be recognized as a set of no des with a high cohesion and a low a dhesion [16]. The idea that sema n tically related no des can b e c a ptured as mem ber s 1 The information th eoretic homon ym has a differen t, alb eit related meaning, whic h motiv ates the c hoice of γ ( e ) = − log 2 A e . 2 of the same netw or k communities, der ived from the graph structure, is a na tural extensio n of the ranking approach, which has b een formalized in [9,1 7]. The only problem with applying that idea to the ab ove model is that our initial assumption — that the traffic distribution o n A is pro po rtional to its link capacities — is no t very realis tic. It abstrac ts awa y a ll traffic dynamics. On the other hand, the static netw ork mo del, as given a bove, does not provide any data ab out the actual traffic. W e ex plo re the wa ys to solve this problem, and extract increa singly more realistic views of traffic dynamics fro m a static net work mo del. 3 Adding paths A path i a → j in a net work A is a sequence of links i a 1 → k 1 a 2 → k 2 → · · · a n → j . In man y cases of in terest, traffic dynamics on a net work dep ends on the path selections, r a ther than just o n single links. One idea is to add the paths to the structure of a netw ork , and to annota te how the links compo se into paths, and how the paths compo se in to longer paths. This amoun ts to gener ating the free categor y [14] o ver the netw o r k gra ph. Unfortunately , adding all paths to a net work usually destr oys some essen tia l information, just like the transitive closure of a r elation do es. E.g., in a so cial netw ork, a fr iend of a friend is often no t even an acquain tance. T aking the tra nsitive closure of the friendship rela tio n oblitera tes that fact. Moreov er, the popular ” s mall world” phenomenon suggests that a lmo st every two p e ople can be related through no more than six friends of friends of friends. . . So alr eady a dding all paths of leng th six to a so cia l netw ork, with a s ymmetric friendship relation, is likely to gener ate a complete gr aph. In fact, the av er age probabilit y that t wo o f node’s neighbor s in an undirected graph a re also linked with each o ther is a n important facto r, called clu stering c o efficient [20]. On the o ther ha nd, in some net works, e.g . of protein int eractions, a link i → k which s hortcuts the links i → j → k often denotes a direct fe e d-forwar d connectio n, rather than a comp osition of the t w o links, and lea ds to essentially different dynamics. So o nly ”s hort” pa ths m ust be added to a netw ork: co mp os ition must b e p e nalized. Definition 1. F or a given net work A =  R γ o o E δ / /  / / N  , a cutoff value v ∈ R , and a c omp osition p enalty d ∈ R , we define the v -completion to b e the network A ∗ v =  R γ o o E ∗ v δ / /  / / N  , wher e E ∗ v = { a ∈ E ∗ | γ ( a ) ≤ v } and γ  i 0 a 1 → i 1 a 2 → i 2 → · · · a n → i n  = ( n − 1) d + γ ( a 1 ) + · · · + γ ( a n ) and E ∗ is the set of al l n onempty p aths in A . R emarks. E ∗ can be obtained as the ma trix o f se ts E ∗ = P ∞ n =0 E n where each E n is a p ow e r of the adjacency matrix E . If the ent ry E ij is viewed as the set of links { i e → j } , then the entry E 2 ij = P N k =1 E ik · E kj corres p o nds to the set of 2- hop paths { i e 1 → k e 2 → j } through the v a rious no des k ; the matrix E 3 similarly corres p o nds to the matrix of 3- hop paths, and so on. The v -clo sed net work A ∗ v is closed under the c o mpo sition of low cost paths, but not if the cost is g reater than v . It is not har d to s ee that the v -completio n is an ide mp otent o p eration, i.e. A ∗ v ∗ v = A ∗ v , but it may fail to b e a prop er closure op er ation, b eca use a link e in A , with γ ( e ) > v , may lead to A 6⊆ A ∗ v . In the rest of the pap er, w e assume that the netw orks ar e v - co mplete for some v , i.e. A = A ∗ v . This means that the relev a nt path wa ys are already represe nted as links, with the comp osition p enalty absorb ed in the cost. 4 Net w or k dynamics In or der to derive netw ork dyna mics from a static netw ork mo del, one first sp e cifies the w ay in which the behavior of a c o mputational agen t, pr o cessing data on the netw or k, is influenced by the netw ork structure, and then usually der ives a Mar ko v chain that drives the traffic. The netw or k featur e s that influence its dynamics can then be increment ally refined, yielding more and more information. 3 4.1 F orw ard and back w ard Random walks o n netw orks are often represented in terms of the b ehavior of surfers on the W eb, following the hyperlinks. 2 The simplest sur fer b ehavior cho oses an out-link unifor mly at r andom at each no de. A visitor of a node i will thus pro c eed to a node j with probability A  ij = A ij A i • , where A i • = P N k =1 A ik is the out-degree of i . The row-stochastic ma trix A  = ( A  ij ) N × N represents forwar d dynamics of a netw ork A . The en tries A  ij are called the pul l co efficients of i by j . Dually , b ackwar d dynamic s o f a netw ork A is represented b y a column-stochastic matrix A  = ( A  ij ) N × N , where the entry A  ij = A ij A • j , with A • j = P N k =1 A kj denoting the in- degree of j , desc rib es the pr obability that a s urfer who is on the node j came there fro m the no de i . The entries A  ij are called the push co efficients. R emark. Note that the capacity ma trix can b e normalized to get A  and A  as ab ove only if no rows, resp. columns, c o nsist of 0s alone. This means that every no de of the net work must hav e at least one out-link, resp. at least one in-link. Netw ork s that do not satisfy this r equirement need to be mo dified, in one w ay or another, in order to enable analys is . Adding a high-cost link betw een ev er y t wo nodes is clea rly the minimal per turbation (with maximal entropy) that ac hieves this. Alternativ e ly , the pr oblem ca n also b e resolved by adjoining a fresh no de, and the high-cost link s in a nd out of it [2]. Either wa y , the quantitativ e effect of suc h mo difications can b e made a r bitrarily small by increasing the cost of the added links . 4.2 F orw ard-out and bac kw ard-in dynamics The next example can b e in terpreted in tw o ways, either to show how forward and bac kward dynamics can be refined to tak e in to account v a rious na vigation capabilities, or how to abstract aw ay irrelev ant cy c les. Suppo se that a surfer searches for the hubs on the net work: he prefers to follow the hyperlinks that lea d to the no des with a higher out-degree. This preference may be r ealized by a nnotating the hyper links according to the out-rank of their targe t no des. Alternatively , the surfer may explore the h yp erlinks ahe a d, and select those with the hig he s t out-degre e; but we w ant to ignore the exploratio n par t, and simply ass ume that he pro ceeds according to the out-r ank of the nodes ahead. The probability that this surfer will move from i to j is th us A  ij = A  ij · α j • = A ij A j • A i • A •• W e call this the forwar d-out dynamics. In the dua l, b ackwar d-in dynamics, the s urfers are mo re likely to arrive to j from i if this is a freq uently visited no de, i.e. if its in-r ank is higher A  ij = α • i · A  ij = A • i A ij A •• A • j These dynamics will be the pa r ticularly conv enient to demonstrate an exa mple of bia s analys is in section 6, bec ause they clea rly dis play clear ly how the simple tr affic bia s υ from section 2 can be re fined by the v ar ious dynamics factors. 4.3 T elep ortation and preference The main p oint o f for mu lating net work dynamics, es pec ia lly in the Marko v chain form, is to b e able to compute the no de ranking a s its inv aria nt distribution. How ever, since the netw o rk graphs ar e usually not strongly connected, the Ma rko v c hains, deriv ed from their structure, are o ften reducible to c lasses of no des with no wa y out. The simplest r emedy is the idea o f telep ortation , g oing back to [19]. A g eneral interpretation is that, whichev er dynamics a surfer might follow, at each no de he to s ses a biase d co in, and with a pr obability d ∈ (0 , 1) fo llows that dynamics. Otherwise, with a probability 1 − d , he ”telep orts” to a r andomly chosen 2 The surfers deserve their name b y follo wing th e ”w aves ”, i.e. ob eying the same dy n amics. 4 no de, ignoring all h yp er links and other s tructure. F ollo w ing , say , forward dynamics, the proba bilit y that he will go fro m i to j is thus A P ij = dA  ij + 1 − d N . This is roughly the P ageRank dynamics, from which the Go og le search eng ine had started [19] 3 . The induced dynamics is thus A P = dA  + (1 − d ) P , where P = ( P ij ) N × N has all en trie s P ij = 1 N . In the netw orks without a cost funct ion, this is int erpreted as adding a link betw een every tw o nodes . The influence o f such links can be controlled using the cost functions. In any case, the resulting Markov c ha ins b eco me irr educible, and their stationa ry distributio ns do not g et ca ptured in any closed co mpo ne nts. F urthermor e, the mo del can b e p ersonalize d by capturing s urfer’s pre ferences in ter ms of the biases in P : the en tries P ij can b e in terpreted as i ’s trust in j [8]. The extensions of the b ackwar d dynamics b y telep ortatio n yields to differen t in terpretations, whic h the rea de r may wish to consider on her own. 5 Ranking Int uitively , the ra nk o f a no de is the pro bability that randomly sampled traffic will b e fo und to v is it that no de. In search, this is tak en a s a generic relev ance meas ure. The tec hnical implica tion is that the rank can be obtained as a stationa ry distribution of the Mar ko v c hain capturing dynamics. Ea ch notio n of dynamics th us induces a co rresp onding notion of r ank. Since a Ma rko v c hain can b e viewed as a linear , and hence contin uous transformatio n of the simplex of distributions, which is closed and co mpact, alre a dy Brouw er ’s fixed p oint theorem guara nt ees that the rank a lwa ys exists. Finding a meaningful, useful notion of rank is another matter. First of all, as a lready mentioned, netw o rks often deco mpo se into lo osely connected subnets. In the lo ng run, all tra ffic is likely to ge t captured in some such subnet. This results in multiple stationary distributions, each c oncentrated in a close d subnet, zer o o therwise. Dynamics der ived dir ectly from the netw ork gr aph therefore result in uninformative ra nk ing data. In or der to assure that the rele v a nt Mar ko v chains are irre- ducible and ap erio dic, a nd th us induce unique and nondegenera te statio na ry distributions, net work dynamics usually need to be p erturb ed, using a damping and stabilizing fac tor suc h as telep o rtation. Another so rt o f problems ar ise when the unique stationary distribution is no t an attractor, o r when the rate of conv ergence is unfeas ibly slow [7,3]. While very impo rtant in concrete applicatio ns , these pro blems, and their solutions, hav e less impact on the conceptual analyses pur sued in this pa p er . We s hal l henc eforth assume t hat al l pr o c esses have b e en adjuste d to induc e u nique and effe ctively c omputable r anking. 4 5.1 Promotion and reputation W e no w explain the in tuition behind the simplest notions of ra nk. In so cial ter ms, the push coefficient A  ij = A ij A • j can b e in terpreted as measuring how m uch i supp orts (or advocates) j . The concept of pr omotion can then b e formalized as a probability distribution r  , such that r  i = P N k =1 A  ik r  k . In w ords, the promotion rank (or push ra nk ) r  i of a node i is the sum of the pro motion ranks r  k of its children no des, eac h allo ca ted to i according to the push coefficient A  ik , measur ing i ’s supp ort for k . Dually , the pull co efficient A  ij can b e interpreted as mea suring how m uch i trus ts j . The concept o f r eputation can then b e for malized a s a probability distribution r  , such that r  j = P N k =1 r  k A  kj . This reputation ra nk (or pul l r ank ) r  i of a no de i is thus the sum of the reputation ranks r  k of its pa rent nodes, each allo cated according to the pull co efficient A  kj , of k ’s trust in j . 3 The original v ersion allo wed A  ij to b e 0, if A i • is 0, i. e. if i is a ”s ink-hole”, and the telep ortation fa ctor w as added to sa ve dynamics fro m such sinkholes. Other mod ifications w ere introduced later. 4 This implies that all notions of dy namics that w e consider ha ve a tacit damping factor. W e do not display it only b ecause it needlessly complicates form ulas. 5 Gathering the promo tion v alues in a column vector r  and the reputation v alues in a row vector r  , w e can r ewrite the definitions of r  and r  in the matrix form r  = A  r  r  = r  A  The refined notions of pr omotion r  and r eputation r  are defined and in terpr eted a lo ng the same lines, as the stationary distributions of the pro cesses A  and A  resp ectively . 5.2 Exp ected flow While dy namics o f reputation has b een studied for a long time [11,10], and with increased attention rec e n tly , since it b ecome a crucial tool of W eb search [19,13], the dual dynamics of promo tio n do es not seem to hav e attracted m uch attention. W e need both notions to define the exp ected traffic flow. The expected flow fro m j to k , under the assumption that they are indep endent, is ca used only b y a ”traffic pressure”, resulting from the pull to j a nd the push from k . F ollowing this idea , we de fine r   j k = r  j r  k (1) The expec ted flo w r   is thus a pr obability distribution ov er N × N , which can be represented a s the matrix r   = r  · r  , obtained b y mult iplying the column vector r  and the r ow vector r  . Since r  and r  are the principal eige nvectors of A  and A  , r   is the unique distribution satisfying r   = A  · r   · A  , i.e. r   j k = P N i =1 P N ℓ =1 A  ij r   iℓ A  kℓ . Intuitiv ely , this means that the flow pressure from i to ℓ pro pagates to cause a flow pr e ssure from j to k prop or tionally to the force of the traffic fr o m i to j and to the force of traffic flo ws from k to ℓ — pr ovide d that j and k a re indep endent. In or der to mea sure their dependency , w e attempt to capture how the actual flows fr o m i to ℓ (rather than mere flo w pr essure) may get diverted, say by the high costs and the lo w capacities, to cause actual flows from j to k . 6 P ath netw orks Definition 2. Giv en a v -close d network A =  R γ o o E δ / /  / / N  , we define the path net work b A =  R γ o o b E δ / /  / / b N  , wher e b N = E , and b E = P a,b ∈ E b E ab , with b E ab =  f = h f 0 , f 1 i ∈ E ij × E kℓ | γ ( f 0 ) + γ ( b ) + γ ( f 1 ) − γ ( a ) ≤ v − 2 d  (2) γ ( f ) = 2 d + γ ( f 0 ) + γ ( b ) + γ ( f 1 ) − γ ( a ) (3) i a   f 0 ' ' P P P P P P P j b   k f 1 v v n n n n n n n ℓ Dynamics of path selectio n. Recalling that b A ab = P f ∈ b E ab 2 − γ ( f ) , we define the for ward and the ba ckw ar d dynamics, and the pull r ank and the push rank just like b efore : b A  ab = b A ab b A a • b A  ab = b A ab b A • b b r b = X a ∈ b N b r a b A  ab b r  a = X b ∈ b N b A  ab b r  b 6 Int uitively , b A  ab is now the probability that tr affic through a is diverted to b (r ather than to some o ther path); while b A  ab is the pro bability that traffic thro ug h b is diverted from a (and not fr o m some other path). The pull rank b r b , i.e. the probability that b will b e traversed, can thus be under sto o d as its attr action ; whereas b r  a is the proba bilit y that a w ill b e a voided. Using the pull ra nk of the paths, w e can now define the no de attra ction b etw een j a nd k to b e the total attraction of all paths b etw een them: b r j k = X j → b k b r b (4) The idea is that this notion of attraction the no des will allow us to r efine the estimate of the traffic bias υ as described in sectio n 2. In particular, consider attr action bias Υ j k = b r j k − r   j k (5) T o motiv ate this, note that expanding the form ula for r   j k in section 5.2 shows tha t r   is the stationary distribution of the Marko v c hain A   =  A   ( ij )( kℓ )  N 2 × N 2 , where A   ( ij )( kℓ ) = A ij A j • A • k A kℓ A i • A 2 •• A • ℓ and r   j k = X i,ℓ ∈ N A   ( ij )( kℓ ) r   iℓ On the other hand, the node attraction b r turns out to b e a sta tionary distribution o f a pro ce ss tha t refines A   . Definition 3. Giv en a n etwork A , its a ttraction dynamics is a Markov chain b A =  b A ( ij )( kℓ )  N 2 × N 2 , with the entries b A ( ij )( kℓ ) = A ij A j k A kℓ A i • A •• A • ℓ (6) wher e A i • A •• A • ℓ = P m,n ∈ N A im A mn A nℓ . Prop ositio n 1. Supp ose that a given network A is v -c omplete for a sufficiently lar ge v . Then the no de attr action b r , define d in (4), is the stationary distribution of its attr action dynamics (6). In other wor ds, for every j, k holds b r j k = X i,ℓ ∈ N b A ( ij )( kℓ ) b r iℓ (7) The pro of is in the App endix. It is based on the follo wing lemma. Lemma 1. F or a network A , whic h is v - c omplete for a sufficiently lar ge cutoff value v , the fol lowing e quations hold for i a → ℓ and j b → k b A ab = A b 4 d A a A ij A kℓ (8) X j → c k b A ac = 1 4 d A a A ij A j k A kℓ (9) b A a • = 1 4 d A a A i • A •• A • ℓ (10) On the other hand, prop ositio n 1 implies the following corollary , which establishes that for mula (5) can be used to measur e the attraction bia s, a s intended. 7 Corollary 1. Th e dir e cte d r eputation and pr omotion r anks ar e the mar ginals of the no de attr action X k ∈ N b r j k = r  j (11) X j ∈ N b r j k = r  k (12) In terpretation. T o understand the meaning of a ttraction bias, consider a v -complete net work A , with the forward-out and backward-in dynamics. The pull ra nk r  j tells how likely it is that a rando mly sampled traffic path arr ives to j ; whereas the push rank r  k tells how likely it is that a randomly sa mpled traffic path departs from k . On the other ha nd, the attraction dynamics in the induced path net work b A g ives the no de attraction b r j k , which tells how lik ely it is that a randomly sampled traffic path trav erses a pa th from j to k . In summary , we hav e r  j = Prob  • ξ → j | ξ ∈ A   r  k = Prob  k ξ → • | ξ ∈ A   b r j k = Prob  j ξ → k | ξ ∈ b A  Although the notation suggests that r  , r  , and b r are sampled fro m different pro c e sses, cor ollary 1 establishes that b r is in fact the joint distribution of r  and r  . Nevertheless, a diligen t rea der will sure ly notice a t wist of j a nd k in the last three equations, and wonder why is the probability that traffic goes fr o m j to k related with the probabilities that it arrives to j , and that it departs fr om k ? — The answer to this q ues tion makes the forward- out and the ba ckw ar d- in dynamics int o a mo re int eresting example than its many dynamical cous ins . Br iefly , if the surfers are more likely to flow with • → j if the capac ity of the links out of j is higher, and if they are mor e likely to flow with k →• if the capa city of the links in to k is higher, then the surfers are most lik ely to follo w both thes e flows, i.e. in to j and out o f k — if there is a high capa city of the links j → k . Mutual i nformation o f the inputs and the o u tputs . The fact that b r is the joint distribution of the pro cesses expressed by r  and r  allows us to extract their mutu al informa tion [4] I ( r  ; r  ) = D ( b r || r   ) = N X j =1 N X k =1 b r j k log b r j k r  j r  k Its expr ession in terms of rela tive entrop y D ( b r || r   ) [ ibidem ] shows that it measures how muc h we lose, in the efficiency of enco ding of b r if we a ssume that r  and r  are mutu ally indep endent. Intuitiv ely , the m utual information I ( r  ; r  ) can th us b e taken as a measure of the lo c ality of information process ing in A . If this is an entirely lo cal pro cess, then every path must b egin and end at the s ame no de, and the ra ndom walks δ and  , selecting the sources and the destinations of the paths, must coincide. But if δ =  , then the push rank and the pull rank m us t obey the s ame distr ibution r  = r  = r , and their mutual infor mation is I ( r  ; r  ) = H ( r ), their entrop y . In the other extreme case, the rando m walks δ and  a re indep endent 5 , and their joint distribution is just the pr o duct of their distributio ns b r j k = r  j r  k . Their mutual information is then I ( r  ; r  ) = 0. 7 Conclusions and future work When the W eb is view ed as a global data store, the problem of its seman tics is the pr oblem o f determining a unifor m mea ning for the data published by its v a rious pa rticipants. The sear ch eng ines a re dealing with 5 The theorem in the app endix suggests that th ey are similarl y distributed, up to a scale factor. 8 this problem on the level of the h uman-W eb interaction (e.g ., distinguishing the meanings of the word ”jaguar ” , sometimes denoting a car , sometimes a n animal [12], or deciding whether ”Paris Hilton”, in a given context, refers to a p erson or to a ho tel, etc.), whereas the Semantic W eb pro ject [1] deals with the computer-W eb interactions. When the W eb is viewed as a computer, the problem of its seman tics is not just a matter of assigning some meanings to some data stored in it, but a lso to its data pro ces sing op erations. F o r progra mming languages, this is what we usually ca ll op eratio nal seman tics. How ever, unlik e a programming language, the W eb, and other sp ontaneously evolving netw orks, do not have a formally defined set o f data structures and op eratio ns: data are transfor med b y many ra ndom walks, running concurrently . Op erational semantics o f netw o rk computation requires a to o lkit for incremen tal analysis of suc h process e s. In this paper, we describ ed a pa th r anking method, which is may b ecome a useful piece of that to olkit. Now we sketc h a w ay to test it exper imen tally . Using the notion of a ttraction bias, w e lift the graph theoretic notio n of (maximal) clique into ra nk analysis, while retaining net work dynamics as a gra ph structure ov er such generalize d c liques. W e call these generalized cliq ues c onc epts and the links betw een them asso ciations . Communities and concepts. T aking the notion of attr a ction bias ba ck to the idea of communit ies as s ets of no des with hig h co hesion, from which w e started in the Introduction, we now reformulate the notion o f cohesion in a differen t norm ( ℓ ∞ instead o f ℓ 1 ), and define cohesion o f a set of no des U ⊆ N to b e their minimal symmetric attrac tio n bias Υ ( U ) = ^ i,j ∈ U ( Υ ij ∨ Υ j i ) F or ea ch ε ∈ [0 , 1], we define an ε - c ommunity to b e a set o f no des U ⊆ N such that Υ ( U ) ≥ ε . Denoting by ℘ ε N the set of ε -communities, note that ε 1 ≤ ε 2 implies that ℘ ε 1 N ⊇ ℘ ε 2 N . The partial ordering of U, V ∈ ℘ ε N is given by U ⊑ V ⇐ ⇒ U ⊆ V ∧ Υ ( U ) ≤ Υ ( V ) This g ives a di r e cte d c omplete p art ial or der (dcp o) . It is not a la ttice b ecause s o me communit ies canno t b e extended by new no des without decreasing their c o hesion; so there a re pair s of communities that cannot b e joined, a nd do not have an uppe r b o und. How ever, dir e cte d sets of comm unities (i.e., where each pair has an upp er b ound) do hav e least upp er b ounds, which ar e just their set theoretic unions. Directed complete pa r tial orders are often used in denotatio nal semantics of pro gramming la nguages [5]. According to that in terpr etation, communities ca n be thought of a s pieces of p artial information , their ⊑ - ordering as the incr ease of informatio n, and the existence of an upper bo und of tw o comm unities as the c onsistency of the informations that they ca r ry . The maximal elemen ts of ℘ ε N , i.e. the communities that ca nnot be extended by new no des without losing cohesion, can be construed as ε -c onc epts . A set U ∈ ℘ ε N is th us an ε -concept if Υ ( { i, j } ) ≥ ε holds for all i, j ∈ U , but fo r every k ∈ N \ U there is a j ∈ U such that Υ ( { k , j } ) < ε . The communit y a nd concept str uctur e of a net work A can be analyzed by studying the sequence of hypergra phs A ε , where the ε -concepts, or the ε -communities approximating them, are viewed as h yp eredg es. The sequence ( A ε ) ε ∈ [0 , 1] decreases as the cohesion para meter ε increases, and the highly cohesive comm unities and co ncepts can b e feasibly ana lyzed. A le vel further , concepts and comm unities can b e viewed as the no des of a netw or k. The most in ter e sting definition o f the links b etw een them, intuit ively thought of as asso cia tions, is based on a v ar ia nt of a path net work, complementing definition 2. A sketc h of this definition is in the next, final subsectio n. Asso ciations . Let N ε denote the set of ε -concepts in a net work A . The c onc ept network A ε , induced by a netw or k A , has the ε -concepts as its no des. I ts edge s are calle d c onc ept asso ciations . The set of asso ciations b etw e en U, V ∈ N ε is E ε U V = X U → a U ∩ V X U ∩ V → b V ˜ E ab (13) where U ξ → V abbre viates δ ( ξ ) ∈ U ∧  ( ξ ) ∈ V , and ˜ E ab =  f = h f 0 , f 1 i ∈ E ij × E kℓ | γ ( f 0 ) + γ ( b ) ≤ v − d and γ ( a ) + γ ( f 1 ) ≤ v − d  9 An a sso ciation f ∈ A U V is th us a quadruple f = h a, b, f 0 , f 1 i i a   f 0 / / j b   k f 1 / / ℓ such that i , j, k ∈ U and j, k , ℓ ∈ V . Its cost is γ ( f ) = γ ( f 0 ) + γ ( b ) − γ ( a ) − γ ( f 1 ). The cost of an asso ciation from U to V is lo wer if the tra ffic fro m i ∈ U to ℓ ∈ V gets less co s tly when it crosses to V ea rlier. While the general netw ork analysis to ols apply to conce pt netw or ks, the v a rious notions of dy na mics acquire new meanings on this lev e l. At this p oint, understanding whic h of the p o ssible in ter pretations may lead to useful to o ls for extra c ting and analyzing the r e lev ant conc e pts, proces s ed in a net work, seems to call for experimentation with real da ta. References 1. T. Berners-Lee. Semantic W eb road map, Octob er 1998. 2. M. Bianchini, M. Gori, and F. S carselli. Inside P ageRank. ACM T r ans. Inter. T e ch. , 5(1):92–128, F ebruary 2005. 3. P . Boldi, M. Santini, and S. V igna. Page Rank as a function of the damping factor. In WWW ’05: Pr o c e e dings of the 14th international c onfer enc e on World Wide Web , pages 557–5 66, New Y ork, NY, USA, 2005. ACM Pres s. 4. T. M. Cov er and J. A. Thomas. Elements of information the ory . W iley- Interscience, New Y ork, NY, USA, 1991. 5. G. Gi erz, K. H. H offmann, K. Keimel, J. Lawson, M. Mislo ve, and D . Scott. Continuous L attic es and Domains , vol ume 93 of Encyclop e dia of Math ematics and its Applic ations . Cam b rid ge Univers ity Press, 2003. 6. O. Goldreic h. F oundations of Crypto gr aphy: V olume 2, Basi c Applic ations . Cam bridge Un ivers ity Press, New Y ork, NY, USA , 2004 . 7. G. H. Golub and C. Grei f. An Arnoldi-type algo rithm for computing Pag eRank. BIT Num eric al Mathematics , 43(1):1–18 , 20 03. 8. Z. Gy¨ ongyi, H. Garcia-Molina, and J. P edersen. Co mbating W eb spam with T rustRank. In VLDB , pages 576–587 , 2004. 9. T. H. Ha veliw ala. T opic- sensitive pagerank: A con text- sensitiv e ranking algorithm for w eb search. IEEE T r ans. Know l. Data Eng. , 15(4):784–796, 2003. 10. C . Hubb ell. An input-outp ut app roac h to clique identificati on. So ciometry , 28:377–399 , 1965. 11. L. Katz. A new status index derived from sociometric analysis. Psychometrika , 18 :39–43, 1953. 12. J . M. Kleinberg. A uthoritative sources in a hyperlinked environmen t. Journal of the A CM , 46 (5):604–632 , 19 99. 13. A. N . Langville and C. D. Mey er. Go o gl e’ s PageR ank and Beyond: The Scienc e of Se ar ch Engine Ra nkings . Princeton Universit y Press, Princeton, NJ, U S A, 2006. 14. S. Mac Lane. Cate gories for the Working Mathematician . Num b er 5 in Graduate T exts in Mathematics. Springer- V erlag, 1971. 15. M . Newman, A.-L. Barabasi, and D. J. W atts, editors. The Structur e and Dynamics of Networks . Princeton Studies in Complexit y . Princeton Un ivers ity Press, Princeton, N J, USA, 2006. 16. M . E. J. Newman. Modularity and communit y stru cture in netw orks. PNAS , 103(23):8 577–858 2, June 2006. 17. Y. Ollivier and P . S en ellart. F inding related pages using Green measures: An illustration with Wikip edia. In Pr o c e e di ngs of the 22nd AAAI Confer enc e on A rtificial Intel ligenc e , pages 1427–1433, Menlo Park, California , July 2007. AAAI , AAAI Press. 18. T . O’Reilly . What is W eb 2.0, September 2005. 19. L. Page, S. Brin, R. Motw ani, and T. Winograd. The PageRank citation ranking: Bringing order to th e W eb. T echnical report, Stanford Digital Lib rary T ec hnologies Pro ject, 1998. 20. D. J. W atts and S. H. Strogatz. Collectiv e dynamics of ’small-w orld’ net works. Natur e , 393(6684 ):440–442 , June 1998. 10 App endix: P r o ofs Pr o of (of lemma 1(8)). The first claim is that there is a sufficien tly la rge v suc h that γ ( f 0 ) + γ ( b ) + γ ( f 1 ) ≤ v − 2 d ho ds for all f 0 ∈ E ij and f 1 ∈ E kℓ . Since b and d are fixed, the claim is clear if E ij and E kℓ are finite. Since A is assumed to be tr unca ted complete, an infinite set of paths can only be ge ner ated from the links with a cost ≤ 0. So the costs of the ele men ts o f E ij and E kℓ are in any case bounded. But if all f 0 ∈ E ij and f 1 ∈ E kℓ satisfy γ ( f 0 ) + γ ( b ) + γ ( f 1 ) ≤ v − 2 d , then b E ab = E ij × E kℓ . Unfolding the definition of b A ab and using (3) w e get b A ab = X f ∈ b E ab 2 − γ ( f ) = X f 0 ∈ E ij X f 1 ∈ E kℓ 2 − 2 d − γ ( f 0 ) − γ ( b ) − γ ( f 1 )+ γ ( a ) = 4 − d 2 − γ ( b ) 2 − γ ( a ) X f 0 ∈ E ij 2 − γ ( f 0 ) X f 1 ∈ E kℓ 2 − γ ( f 1 ) = A b 4 d A a A ij A kℓ 1(9) fo llows directly from 1(8), unpacking A e = 2 − γ ( e ) . And 1(10) then follows fro m 1(9): b A a • = X b b A ab = X j,k ∈ N X j → b k b A ab = X j,k ∈ N 1 4 d A a A ij A j k A kℓ = 1 4 d A a A i • A •• A • ℓ Pr o of (of pr op osition 1(7)). W e unfold the definition of b r and then use (9) and (10): b r j k = X j → b k b r b = X j → b k X a b A ab b A a • b r a = X a P j → b k b A ab b A a • b r a = X a 1 4 d A a A ij A j k A kℓ 1 4 d A a A i • A •• A • ℓ b r a = X i,ℓ ∈ N X i → a ℓ A ij A j k A kℓ A i • A •• A • ℓ b r a = X i,ℓ ∈ N A ij A j k A kℓ A i • A •• A • ℓ X i → a ℓ b r a = X i,ℓ ∈ N b A ( ij )( kℓ ) b r iℓ 11 Pr o of (of c or ol lary 1(11)). W e set q j = P k ∈ N b r j k and expand b r j k using prop osition 1(7), to sho w that q is the stationary distribution of the pro cess A  : q j = X k ∈ N b r j k = X k,i,ℓ ∈ N b A ( ij )( kℓ ) b r iℓ = X k,i,ℓ ∈ N A ij A j k A kℓ A i • A •• A • ℓ b r iℓ = X i,ℓ ∈ N A ij A j • A i • A •• b r iℓ = X i ∈ N A  ij A j • A •• X ℓ ∈ N b r iℓ = X i ∈ N A  ij q i Since r  is b y definition the stationary point of A  , the uniqueness implies q = r  . T o pro ve 1(12), w e set q k = P j ∈ N b r j k and pr o ceed s imila rly: q k = X j ∈ N b r j k = X j,i,ℓ ∈ N b A ( ij )( kℓ ) b r iℓ = X j,i,ℓ ∈ N A ij A j k A kℓ A i • A •• A • ℓ b r iℓ = X i,ℓ ∈ N A • k A kℓ A •• A • ℓ b r iℓ = X ℓ ∈ N A • k A •• A  kℓ X i ∈ N b r iℓ = X ℓ ∈ N A  kℓ q ℓ Since r  is b y definition the stationary point of A  , the uniqueness implies q = r  . 12

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment