Established Clustering Procedures for Network Analysis
In light of the burgeoning interest in network analysis in the new millenium, we bring to the attention of contemporary network theorists, a two-stage double-standarization and hierarchical clustering (single-linkage-like) procedure devised in 1974. …
Authors: Paul B. Slater
Established Clustering Pro cedures for Net w ork Analysis P aul B. Slater ∗ ISBER, University of California , Santa Barb ar a, CA 9 3 106 (Dated: No vem b er 19, 2 018) Abstract In ligh t of the bu rge oning in terest in net w ork analysis in the new millenium, w e bring to the atten tion of c ontemp or ary net w ork theorists, a t wo -stage double -standarization and hierarc h ic al clustering ( single-link age-l ik e) pro cedure d e vised in 1974. In its many applications o v er the next decade–primarily to the migrati on flo ws betw een geographic sub divisions within natio ns–the pres- ence w as often rev ealed of “h ubs”. These are, t yp i cally , “cosmop o litan/non-pro vincial” areas–suc h as th e F renc h capital, P aris–whic h s e nd and r e ceiv e p eople r elatively broadly across th e ir resp ec- tiv e nations. Additionally , this t w o-stage procedu re– whic h “migh t v er y w ell b e the mo st su cc essful application of cluster analysis” (R. C. Du bes)–has detected many (p h ysically or so cia lly) isolated groups (regions) of areas, suc h as those forming the southern islands, S hik oku and Kyushu, of Japan, t he I t alian island s of Sardin ia and Sicily , and the New England r e gion of the United Stat es. F urther, we d iscuss a (complemen tary) app roa c h devel op ed in 1976, in v olving th e application of the max-flow/min-cut theorem to r aw/non-stand ar dize d flo ws. P ACS nu m be rs: V a lid P ACS 02.10 .Ox, 89.6 5.-s ∗ Electronic address: sla ter@kitp.ucsb.edu 1 A. L. Barab´ asi, in his recen t p opular b o ok, “Link ed”, asserts that the emergence of hubs in netw o rk s is a surprising phenomenon that is “forbidden by b oth the Erd¨ os-R ´ en yi a nd W atts-Strogatz mo dels” [1, p. 63] [2, Chap. 8]. Here, we indicate an analytical framew or k in tr o duced in 1974 that the distinguished computer scien tist R. C. Dub es, in a rev iew of [3], has asserted “might v ery w ell b e the most successful application o f cluster analysis” [4, p. 142]. It has prov ed insigh tful in rev ealing–among other intere sting relatio nships–hu b-lik e structures in net w orks of (weigh ted, directed) inte rno dal flo ws. In the recen t resurgence o f in terest in net w o rk analysis, this metho dology ma y hav e b een o ve rlo ok ed, as man y of its uses had b een rep orted in the 1970’s and 1980’s, in journa ls outside of the strictly mathematical and ph ysical literature [5, 6, 7 , 8, 9, 10, 11, 12, 13, 14] (a s w ell as in the r esearch institute monographs [3, 15, 16], widely distributed to academic libraries). Though the principal pro cedure under discussion here is applicable in a wide v a r iety of so cial-scien ce settings [3, 4 ], it has b een largely used, in a demographic con text, to study the i n t ernal migr ation tables published at regular p erio dic in terv als by most of t he nations of the w orld. These tables can b e though t of as N × N (square) matrices, the entries ( m ij ) of which are the nu m b er of p eople who lived in geographic sub divis ion i at time t and j at time t + 1. (Some t able s–but not all–ha v e diagonal entries , m ii , whic h ma y represen t the n umber of p eople who did mo v e within area i , or simply those who liv ed in i at t and t + 1.) In the first step of the a naly tical pro cedure emplo yed, the rows and columns of the table of flo ws are alternately (biprop ortionally [17]) scaled to sum to a fixed n umber (sa y 1). Under broad conditions–to b e discussed b elo w–con vergenc e o ccu rs to a “doubly-sto c hastic” (bisto c hastic) table, with row and column sums all sim ultaneously equ al to 1 [18, 19, 20, 21]. The purp ose of the scaling is to remo ve ov erall (marginal) effects of size, and fo cus o n relativ e, in teractio n effects. The cr oss-pr o duct r atios ( r elative o dds ), m ij m kl m il m kj , measures of asso ciation, are left inva ri a nt . Additionally , the en t r ie s of the doubly-sto c hastic table provide maximum entr opy estimates of the or ig inal flow s, g iven the row and column constrain ts [22]. F or large sp arse flow tables, only the nonzero en tries, together with t he ir ro w and column co ordinates are needed. Ro w and column (biprop ortional) m ultipliers can b e iterativ ely computed b y sequen tially a ccessing the nonzero cells [2 3 ]. If the table is “critically sparse”, v arious conv ergence difficulties ma y o ccur. Nonzero en tries tha t are “unsupp orted”–that is, not part of a set of N nonzero en tries, no tw o in the same row and column– may conv erge to zero and/or the biprop ortional m ultipliers ma y not con ve rge [3, p. 19] [24] [25, p. 171]. (The 2 scaling was successfully implemen ted with a 3 , 140 × 3 , 140 1965-70 in tercount y migration table for the United States [9, 15]–as w ell as for a more aggregate 510 × 510 table fo r the US [13]. Smo othing procedures can be used to mo dify the zero-nonzero structure of a flo w table, particularly if it is critically sparse [26, 27].) The “ firs t strongly p olynomial-time algorithm for matrix scaling” w as rep orted in [28]. In the se c ond step of the pro cedure, the doubly-sto c hastic matrix is con v erted to a series of di r e c te d (0 ,1 ) g raphs (digraphs), by applying thresholds to its en tries. As the thresholds are pro g res siv ely low ered, larger and larg e r str ong c omp onents (a directed path existing from any mem b er of a comp onen t to an y other) o f the resulting graphs are found. This pro cess (a simple v arian t o f w ell-kno wn single-link age [nearest neigh b or] clustering) can b e represen ted b y the familiar dendrogram or tree diagram used in hierarc hical cluster analysis and cladistics/ph ylogen y (cf. [29]). A FOR TRAN implemen tation of the t w o-stage pro cess was giv en in [30], as we ll a s one in the SAS (Stat is tical Analysis System) framew ork [31]. The noted computer scien tist R. E. T arjan [32] devised an O ( M (log N ) 2 ) algorithm [33 ] and, then, a f urthe r improv ed O ( M (log N )) metho d [34], where N is the num b er of no des and M the n um b er of edges of a direc ted g raph. (These substan tially improv ed up on the e arlier w orks [3 0 , 3 1 ], whic h required the computat io ns of tr ansitive closur es of g raphs , a nd w ere O ( M N ) in nat ur e.) A FOR TRAN co ding–in v olving linked lists–of the improv ed T arja n alg orithm [34] w as pre- sen ted in [35], and applied in the US interc oun ty study [15]. If the graph-theoretic (0,1) structure of the netw ork under study is not strongly connected [3 6 ], indep e ndent analyses of the subsystems of the netw ork is appropriate. The go o dness- o f-fit of the dendrogram generated to the doubly-sto c hastic table itself can b e ev aluated–and p ossibly employ ed, it w ould seem, as an optimization criterion. Distances b et w een no des in the dendrogram satisfy the (stronger than triangular ) ultr ametric inequal- it y , d ij ≤ max ( d ik , d j k ) [37, p. 245] [38, eq. (2.2)]. Geographic sub divis ions (or gro up s o f sub divis ions) that enter into the bulk o f the den- drogram at the w eak est lev els a re those with the broadest ties. T ypically , these ha ve b een found to b e “cosmop olitan”, hub-lik e areas, a proto typical example b eing the F renc h capital, P aris [3, sec. 4.1] [6]. Similarly , in parallel analyses of other in ternal migra t io n tables, the cosmop olitan/non-pro vincial natures of Lo ndo n, Barcelona, Milan, W est Berlin, Mosco w, Manila, Buc harest, Montr ´ eal, Z ¨ uric h, Santiago, T unis and Istan bul w ere–among others– 3 highligh ted in the resp ectiv e dendrograms for their nations [3, sec. 8.2] [14, pp. 181-182 ] [8, p. 55]. In the in tercount y analysis for the US, the most cosmop olitan en tities were : (1) the c entr al ly lo cated paired Illinois coun ties of Co ok (Chicago) and neigh b oring, suburban Du P ag e; (2) the nat io n’s capital, W ashington, D. C.; and (3) the paired south F lorida (retire- men t) coun ties of Dade (Miami) and Brow ard (Ft. Lauderdale) [9, 15]. (In general, counties with large military installatio ns , large college p opulations, or t hat w ere state capita ls also in teracted relative ly bro a dly with other areas.) It should be emphasized t hat although the indicated cosmopolitan areas may generally ha v e relativ ely la rge p opulations, this can not, in and of itself, explain the wide national ties observ ed, since the double-standardization, in effect, renders all areas of equal o v erall size. Additionally , geographically isolated areas–suc h as the Japanese islands o f Kyush u and Shik oku–emerged as w ell-defined clusters (regions) of their constituen t subdivisions (“pre- fectures” in the Japanese case) in the dendrograms (cf. [39, 40]), and similarly the Italian islands of Sicily a nd Sardinia [12], and the North a nd South Islands of New Zealand and Newfoundland ( C anada) [15, p. 1]. The eight coun t ies of Connecticut, and other New Eng- land g r o upin gs, as furt her examples, were also v ery prominen t in t he highly disaggregated US analysis [15 ]. Relatedly , in a study based solely up on the 196 8 mo v emen t of c ol le ge students among the fift y states, the six New England states w ere strongly clustered [11, Fig. 1]. Though q uite success ful, eviden tly , in rev ealing hub-lik e and clustering b eha vior in recorded flo ws, the indicated series of studies did no t addres s the recen tly-emerging, theoretically-imp ortan t issues o f scale-free net w orks, p o wer-la w descriptions, netw ork ev olu- tion and vulnerability , and small-w o rld pro perties that ha v e b een stressed by Barab´ asi [1] (and his colleagues and many o the rs in the gro wing field). (F or critiques of these matt ers, see [41, 4 2].) In this regard, one migh t–using the indicated tw o -stage pro cedure–compare the hierar c hical structure of geographic areas using internal migr a tion tables at differ ent lev els of geographic agg r ega tion (coun ties, states, regions...) T o again use the example of F ra nc e, based on a 1962- 68 2 1 × 21 in t erregio na l ta ble , R ´ egion Parisie nne w as certainly the most h ub-lik e [3, sec. 4.1 ] [6], while using a finer 89 × 8 9 1954- 62 in terdepartmental table, the dy ad comp osed of Seine (that is P aris and its immediate suburbs) together with the encircling Seine-et-Oise (administrativ ely eliminated in 1964 ) w as most cosmop olitan [7] [3, sec. 6.1]. 4 It w ould b e of interes t to dev elop a theory–making use of the ric h mathematical structure of doubly-sto c hastic matr ices–by whic h the statistic al signific anc e of apparen t h ubs and clusters in dendrograms could b e ev a luated [15, pp. 7-8] [43]. In the geog r aphic con text of in ternal migra tion tables, where nearby areas hav e a strong distance-adv ersion pr edilection for binding, it seems unlikely t hat mo st clustering results generated could b e considered to b e–in an y standard sense–“random” in nature. On the other hand, other ty p es of “o rigin- destination” tables, suc h a s those for o c cup ational mobility [44], journal citations [8] [16, pp. 125-153], interindus try (input-output) flows [10], bra nd sw itc hes [3, sec. 9.6] [45], crime switc hes [3, sec. 9.7] [46, T able XI I], and (Morse co de) confusions [3, sec. 9.8 ] [47], among others, clearly lac k suc h a geographic dimension. An efficien t alg o rithm–cons idered as a nonlinear dynamical system–to generate r andom bisto c ha stic matrices has recen tly b een presen ted [20] (cf. [48, 49]). The creativ e, pro ductiv e netw ork a na ly st M. E. J. Newman has written: “Edge w eights in netw o rk s ha v e, with some exceptions . . . receiv ed relativ ely little attention in the physics [emphasis added] literature for the excellen t reason that in any field one is w ell advised to lo ok at the simple cases first (un w eigh ted net works ). On the o t her hand, there are man y cases where edge w eigh ts a re kno wn for net w orks, and to ignore them is to throw out a lot of dat a that, in theory a t least, could help us to understand these systems b etter” [50]. Of course, the n umerous a pplic ations of the tw o-stage pro cedure w e ha v e discussed a bov e ha v e, in fact, b een to suc h w eigh ted net w orks. In [50 ], Newman a pp lied the famous F ord-F ulk erson max-flow/min-cut the or em to w eighted net w orks ( which he mapp ed on to unw eigh ted multigr aphs ). Earlier, this theorem had b een used to study Spanish [4 0 ], Philippine [51], and Brazilian, Mexican and Argentinian [52] internal migration and US in terindustry flows [16, pp. 18-28] [53]–all the corresp onding flo ws now b eing left unadjusted, that is not standardized. In this “m ultiterminal” approach, the maxim um flo w a nd the dual minim um edge cut-sets, b et wee n al l o r dered pairs o f no des are found. Those cuts (often few or eve n nul l in n um b er) whic h partitio n the N no des non t r iv ially–that is, into t w o sets eac h of cardinalit y gr e ater tha n 1–a r e noted. The set in eac h suc h pair with the few er no des is regarded as a no dal cluster (region, in the geographic con text). It has the in teresting, defining prop ert y that few er p eople migrate in t o (from) it, as a w hole, than into (fro m) its no de. In the Spanish contex t, the (noda l) pro vince of Bada joz was fo und to hav e a particularly large o ut-migration sphere of influence, and the 5 (Basque) province of Vizcay a (site of Bilbao and G uernica), an extensiv e in- m igration field [40]. The netw orks for med by the W orld Wide W eb and the Interne t hav e b een the fo cus of muc h recen t interes t [1]. Their structures a re typically represen ted b y N × N adja- c ency matrices, the en tries of whic h are simply 0 or 1, rather than natural num b ers, as in in ternal migration and other flow tables. One migh t in v estigate whether the t wo-stage double-standardization a nd hierarchic al clustering, a nd the (complemen tary) multiterminal max-flo w/min-cut pro cedures we ha ve sough t to bring to the atten tion of the activ e b o dy of con temp orary netw ork theorists, could yield nov el insights into these a nd other imp ortan t mo dern structures. In closing, it migh t b e of in terest to describe the immediate motiv atio n for this particular note. I had done no f urthe r work applying the metho ds describ ed ab o v e after 1985, b eing a w are of, but not absorb ed in recen t dev elopmen ts in netw ork analysis. In Ma y , 2 0 08, Mathematical Reviews ask ed me to review the b oo k of T om Siegfried [2], c hapter 8 of whic h is dev o te d to the on-going activities in net w ork analysis. This further led me (thanks to D. E. Bo yce) to the b o ok of Barab´ a s i [1]. I, t hen, e-mailed Barab´ asi, pointing out the use of the earlier, widely-applied clustering metho ds. In reply , he wrote, in part: “I guess y ou w ere another demo of ev erything being a question of timing– after a quic k lo ok it does a ppear that man y things y ou did ha v e came back as questions – with m uc h more detailed data– again in the netw ork comm unity to da y . No, I was not a ware of your pap ers, unfortunately , and it is har d to kno w how to get them ba ck in to the flow of the system”. The presen t note migh t b e seen as an effor t in that direction, alerting presen t- day inv estigators t o these demonstratedly fruitful researc h metho dologies. Ac knowledgmen ts I w ould lik e t o express appreciation to the Ka vli Institute for Theoretical Ph ysics (KITP) for tec hnical supp ort. [1] A.-L. Barab´ asi, Linke d: How everything is c onne c te d to everything e l se and what it me ans for business, sci enc e, and ev ery day life ( Plume, New Y ork, 2003). 6 [2] T. Siegfried, A b e autiful math: John Nash, game the ory, and the mo dern quest for a c o de of natur e (Joseph Henry , W ashington, 2006). [3] P . B. S l ater, T r e e r epr esentations of internal migr ation flows and r elate d topics (Comm unit y and Or g anizatio n Res. In st ., Santa Barbara, 1984). [4] R. C. Dub es, J. Classif. 2 , 141 (1985). [5] P . B. S la ter, Regional Stud. 10 , 123 (1976). [6] P . B. S la ter, IEEE Syst. Man. Cyb. 6 , 321 (1976). [7] P . B. S la ter and H. L. M. Wincheste r, IE EE Syst. Man. Cyb . 8 , 635 (1978). [8] P . B. S la ter, Sciento metrics 5 , 55 (1983). [9] P . B. S la ter, Environ. Plann. A 16 , 545 (1984) . [10] P . B. S la ter, Emp i rical Econ. 2 , 1 (1977). [11] P . B. S la ter, Res. Higher Ed uc. 4 , 305 (1976 ). [12] M. L. Genti lesc hi and P . B. S l ater, Riv. Geog. Ital. 87 , 133 (1980). [13] P . B. S la ter, Rev. Public Data Use 4 , 32 (1976). [14] P . B. S la ter, Qualit y and Quantit y 15 , 179 (1981). [15] P . B. S la ter, Migr ation r e gions of the U nite d States: two c ounty-level 1965-70 analyses (Com- m unit y and Organization Res. Inst., Santa Barbara, 1983). [16] P . B. Slater, L ar ge sc ale data analytic studies in the so cial scienc es (Comm u nit y and Organi- zation Res. Inst., S an ta Barbara, 1986). [17] M. A. Bac harac h , Bipr op ortional matric es and i nput- output change (Cam bridge Univ., Cam- bridge, 1970). [18] F. Mosteller, J. Amer. S tatist. Asso c. 63 , 1 (1968). [19] J . D. Louc k, F ound. P h ys. 27 , 1085 (1997) . [20] V. Capp ellini, H.-J. Sommer s, W. Bru zda, and K. ˙ Zyczk owski, Nonline ar dynamics in c on- structing r andom bisto chastic matric es , arXiv:0711.3 345. [21] I. Bengtsson, The imp ortanc e of b eing u nisto c hastic , qu an t-ph/04030 88. [22] J . Er iksson, Math. Program. 18 , 146 (1980) . [23] B. N. Parlett and T . L. Land is, Lin. Alg. App lics. 48 , 53 (1982). [24] R. Sink horn and P . Knopp, Pa c. J . Math. 21 , 343 (1967). [25] L. Mirsky , T r ansversal The ory (Academic, New Y ork, 1971). [26] J . S . Simonoff, J. Statist. Plann . Infer. 47 , 41 (1995). 7 [27] P . B. S later, IEEE Syst. Man Cyb er. 10 , 678 (1980). [28] N. Linial, A. Samoro d nitsky , and A. Wigderson, Com binatorica 20 , 545 (2000). [29] K . Oza w a, P att. Recog. 16 , 201 (1983). [30] C . Leusmann , Comput. Ap plics. 769 , 769 (1977). [31] D. Chilko , SAS Sup plemen tal Library User’s Guide pp. 65–70 (1980). [32] J . S c hw artz, Not. Amer. Math. S o c. 29 , 502 (1982). [33] R. E. T arjan, Inf o. Pro c. Lett. 14 , 26 (1982) . [34] R. E. T arjan, Inf o. Pro c. Lett. 17 , 37 (1983) . [35] P . B. S later, Environ. Plann. A 19 , 117 (1987 ). [36] D. F. Hartfiel and J. W. S p ellman, Pr o c. Amer. Math. So c. 36 , 389 (1972). [37] S . C. Johnson, Psychometrik a 32 , 241 (1967). [38] R. Rammal, G. T oulouse, and M. A. Virasoro, Rev. Mo d. Ph ys. 58 , 765 (1986). [39] W. E, T. Li, and E. V anden -Eijnden, Pro c. Natl. Acad. Sci 105 , 7907 (2008). [40] P . B. S later, Environ. Plann. A 8 , 875 (1976). [41] J . C. Do yle, D. L. Alderson, L. Li, S. Lo w, M. Roughan, S. Shaluno v, R. T anak a, and W. Will- inger, Pr o c. Natl. Acad. Sci. 102 , 14497 (2005). [42] D. Alderson, J. C. Do yle, L. Li, and W. Willinger, In ternet Math. 2 , 421 (2005) . [43] H. H. Bock, Comput. S tat. Data Anal. 23 , 5 (1996). [44] O . D. Dun can, Amer. J. S o ciol. 84 , 793 (1979). [45] V. R. Rao and D. J. S aba v ala, J. C onsumer Res. 8 , 85 (1981 ). [46] A. Blumstein and R. Larson, Op erat. Res. 17 , 199 (1969). [47] E. Z. Rothk opf, J. Exp eriment. Psyc h. 53 , 94 (1957) . [48] R. C. Gr iffiths, Canad. J. Math. 26 , 600 (1974). [49] K . ˙ Zyczk owski, M. Ku ´ s, W. S lomczy´ nski, and H.-J. Sommers, J . Phys. A 36 , 3425 (2003). [50] M. E. J. Newman, Phys. Rev. E 70 , 056131 (2004). [51] P . B. S later, Philipp ine Geog. J . 20 , 79 (1976). [52] P . B. S later, Estad ´ ıstica 36 , 180 (1977). [53] P . B. S later, Empir ical Econ. 3 , 49 (1977). 8
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment