A Survey and Taxonomy of Graph Sampling
Graph sampling is a technique to pick a subset of vertices and/ or edges from original graph. It has a wide spectrum of applications, e.g. survey hidden population in sociology [54], visualize social graph [29], scale down Internet AS graph [27], gra…
Authors: Pili Hu, Wing Cheong Lau
A S ur v ey a nd T ax o n o my of Gr ap h S am p lin g Pili Hu and Wing Ch eong Lau Departm en t of Informa tion Engineeri ng Chinese Univ e rsit y of Hong K ong {hup ili ,w clau }@i e.cu hk.e du. hk A b st r ac t Graph sampling is a te c hnique to p ic k a sub set of v er tices and/ or e dges from orig inal grap h. It has a wid e spectrum of applic ations, e.g. su rv ey hidden p opula tion in so ciology [54], visu alize so cial graph [29], sc ale do wn In ternet AS g raph [27 ], graph sparsifi cation [8], etc. In some s cenarios , the whole g raph is kn o wn and th e purpose of sampling is to obtain a smaller graph. In other scena rios, the graph is unkno wn and sampling is regarded as a w a y to explor e t he graph. Common ly u sed tech niques are V ertex Sampling, Edge Sampling and T r a v ersal B ased Sam pling. W e pr o vide a taxo nom y of d ifferen t gra ph sampling ob jectiv es and graph sampling approac hes. The relations b et w een these app roac hes are f o rmall y argued and a g eneral fram ew ork to bridge theor etical analy sis and practica l implemen tat ion is pro vided. Althoug h b eing smaller in siz e, sampled grap hs ma y b e similar to original graphs in some w a y. W e are particular ly intere sted in what graph properties are preserv ed g iv en a sampling pr o cedure. If some prop erties are preserv e d, w e can estima te them on the sampled gr aphs, whic h giv es a w a y to c onstru ct efficien t estimators. If one algorithm relies on the p erserv ed prop erties, w e ca n expect t hat it giv e s similar output on orig- inal and sampled grap hs. This leads to a systematic w ay to acce lerate a class of graph algor ithms. In this surve y, w e discuss b oth classical text-bo ok ty p e prop erties a nd som e adv anced properties. T h e land scap e is tabularized and we see a lot of missing w orks in this field. Some theoretic al studies are collect ed in this surv e y and simple extensions are m a de. Most previous nume rical e v aluation w orks c ome i n an a d ho c fashion, i.e. ev aluate differen t ty p e of graphs, differen t set o f p roperties, and differ en t sampli ng algor ithms. A systematical and neutral ev aluation is needed to shed ligh t on further gra ph sampling stu dies. 1 T a ble of c on t en ts 1 In tr o du ction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.1 So me Motiv atin g E xamples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 Sa mpling as a Base Approac h . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3 O rganization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2 Common Notations a nd Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 3 A T axonomy o f Sa mpling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 3.1 Ca tegorization By Sa mpling Ob jecti v e . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 3.1.1 Common S ampling Ob ject iv es . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 3.1.2 Relations Betw een Prop erty Preserv ation and Estimation . . . . . . . . . . . . . . 7 3.2 Ca tegorization By Typ e of Netw orks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3.3 Ca tegorization By Sa mpling Approac h . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 3.3.1 Common S ampling Approac hes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 3.3.2 Relations b et ween Sampling Appro ches . . . . . . . . . . . . . . . . . . . . . . . . . . 9 4 T rav ersa l Ba sed Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 4.1 Br eadth/ Depth/ R andom First Sampling (B-/ D-/R- FS) . . . . . . . . . . . . . . . . . . 9 4.2 Sn o w-Ball Sa mpli ng (SBS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 4.3 Ra nd om W alk (R W) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 0 4.4 M etrop olis-Hastings R andom W alk (M HR W) . . . . . . . . . . . . . . . . . . . . . . . . . 11 4.5 Ra nd om W alk with E scaping (R WE) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 4.6 M ultiple Indep end ent Random W al k e r s (MIR W) . . . . . . . . . . . . . . . . . . . . . . . 12 4.7 M ulti-Dimension al R andom W alk (M DR W) . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 4.8 F orest Fire Samp ling ( FFS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 4.9 Res p ondent Driven S ampling (RDS) (R WR W) . . . . . . . . . . . . . . . . . . . . . . . . 13 5 Graph Properti es . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 5.1 V ertex/ Edge Lab el Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 5.2 Cla ssical Graph Prop erties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 5.3 Cla ssical Prop erties Studied in Literatures . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 5.4 Adv anced Gra ph P rop erties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 5.4.1 Cut and Ratio-/ Normalized- / W eighted- Cu t . . . . . . . . . . . . . . . . . . . . . 18 5.4.2 Association and Ratio-/ Normalized-/ W eighted- Asso ciatio n . . . . . . . . . . . 19 5.4.3 Conductance and Expan sion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 5.4.4 Quadratic F orm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 9 5.4.5 Modularity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 5.4.6 Cohesion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 5.4.7 Propertie s w ith Different V e rtex Set . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 5.5 D istance Metric s f or Prop erties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 6 Prop erty Preserv ation/ Estimation Results . . . . . . . . . . . . . . . . . . . . . . . . . 22 6.1 N et work Size E stimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 6.1.1 Graph Identities on NS, G D a nd AD . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 6.1.2 A v erage Degree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 6.1.3 P opulat ion Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 6.1.4 Densit y Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 6.2 F ull Graph Obs e rv ation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 6.2.1 V e rtex S ampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 6.2.2 Uniform V er tex Sampling w ith Neighbourho o d . . . . . . . . . . . . . . . . . . . . 24 6.2.3 Non-uniform V ertex Samp ling with Neighbo urho o d . . . . . . . . . . . . . . . . . 25 6.2.4 Edge Sam pling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 6.2.5 T rav ersal Based S ampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2 S ection 6.3 D egree Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 6.3.1 V e rtex S ampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 6.3.2 Edge Sam pling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 6.3.3 V e rtex S ampling with Neighbourho o d . . . . . . . . . . . . . . . . . . . . . . . . . . 26 6.3.4 Edge Sam pling with V ertex Lab el . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 7 6.3.5 BFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 7 6.3.6 Results fo r Cer tain Type of G raphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 6.4 M inim um Cut . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 6.4.1 Edge Sam pling with Contraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 F ramework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 Previous R esults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 Our results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 8 6.4.2 V e rtex S ampling with Con traction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 8 6.4.3 ESC Appr o ximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 9 6.5 Cu t . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 6.5.1 Uniform Ed ge Sa mpling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 6.5.2 Non-uniform Edge Sam pling Using Edge S trong Connectivity . . . . . . . . . . . 30 6.6 RCut, NCut, A sso c, R Asso c, NAsso c, V olume . . . . . . . . . . . . . . . . . . . . . . . . 30 6.7 M od ularity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 0 6.8 Co hesion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 6.9 Q uadratic F orms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 6.9.1 Non-uniform Edge Sam pling Using Effect ive R esistance . . . . . . . . . . . . . . . 30 6.10 Shortest Path Length . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1 7 Conclusion and F uture W or ks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 App endix A U seful Probabilit y Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 4 App endix B L ong Deriv ations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 App endix C O ther W orks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 T able of conten ts 3 1 I n t ro duc tio n People hav e defined many prop er ties to characterize a grap h, e .g. degree d istribution, sp ectrum, NCut, etc. Th ose pr op erties are very i mp ortan t for p eople to u nderstand a ne tw ork. They ma y also be f urther develop ed in to cr iteria or ob jectives for so me problems. It is interesting to kno w whether th ose prop erties a re preserved on a trans formed gra ph. If so, r unnin g the algorithm on the transformed graph h as approximately the sa me effect a s running it on o riginal graph. Besides, w e can estimate t he p rop erties of original grap h using the transform ed g raph. A simple bu t effectiv e wa y of graph transformation is via sampling: Select a su bset of v ertices or edges of the original graph. The biggest adv antage of sampling method s a re their execu tion efficiency so that the graph transformation pro ced ure w o n’t take longer time than straightforw a rd computation on original graph. 1.1 So me Motiv ating Examples In this se ction, we discuss some concr ete examples and motiv ate graph sampling fro m different asp ects: • Lack of data. F o r example, y ou can not afford to cra wl all the p eople from an O nline So cial Netw ork (e.g. due to AP I call li mit). Instead, you random ly pick IDs (it’s sometimes possible to enumerate ID s for most OS Ns) and crawl them. Th is is essentially sampling of v ertices (edges are lost in a bulk) . It is go o d to know how well this pro cess preserves cer tain graph prop erties. Or more simple questions to ask is : ho w many vertices/ edges should we sample in order to g e t go o d cov erage of certain info rmation? • Surv ey hidden populatio n. In so ciology studies, we ma y w ant to reac h a hidden p opulation, e.g. drug abuse rs. It is gener ally imp ossible to directly enumerate and sam ple from the whole p opulation. Researchers usually star t from a small s et of seed no d es and e x pand ac c ording to their knowledge. Classica l metho ds range s f rom Snow Ball Sampling [17] t o Resp ondent Driven Samplin g [54]. • Graph Spar si fication. Man y mo dern complex netw orks are ver y la rge in size, m aking it hard to manipulate the whole graph. This calls for grap h sparsifica tion, which includes e dge sparsification [8][32, L12] and vertex sparsification [44]. Graph sparsification is a classical problem. It often i mp oses very string e nt ma thematical constraints on the transfor mation, like preserving all cuts (cut sparsifier s) or perser ving a ll quadr atic for ms of Laplacian ( sp ec- tral spar sifiers). Ho wev er, some of t ho se techniques are too heavy to apply. W e are int erested in those as simple as straig ht sampling. See [21] for a sur v ey. • Reduce test cost. F or example, Protein Interaction Netw ork [60] is a frequent ob ject to study i n bio chemical r esearches. The acc urate test of the interactions b etw een all p oss ible neighbours may be t o o costly. In this case, one wan t to sample the graph and only test the edges on the sampl ed g raph. Although the r e a re mo r e m ature approaches (e.g . SPCA [70] for gene micro-arraying) to prioritize what edges should be tested firs t, sampling usually gives an eas y fi rst s olution. • Visualization. T he o riginal graph m a y b e to o big to fi t in a screen. Dis pla y i ng all the edges ma y be too cluttered . S ampling can giv e the graph a digest, whic h mak e s it easier for visualization [29]. T ow a rds this end, w e may hop e the sampled graph “looks like” the original gr aph. T his topic is also related to Dimensionality R e duction and Gr aph Emb e dding . All those examples, althoug h diff eren t in ro ot motiv ation and mathema tical dep th, s hare the following commo n characteristics: • The graph size i s r educed during the transforma tio n. • Certain prop e rt ies are exp e cted from t he o utput e .g. “b eing represen tative”, “lo okin g similar to original graph ”, etc. 4 S ection 1 1.2 Sam pling as a Base Approach In ord er to reduce the graph s ize and pr eserve grap h prop erties at the same time, ma n y complex approaches a re p ossible. F o r ex ample, one ca n form ulate a mathematical programming problem in order to minimize t he distance betw ee n origin al and samp led graph. This ap proach, while being ver y rigorous , can b e e x tremely costly. F or example, s o lving the graph cut sparsifi er pr oblem (reduce the graph s ize suc h that cuts are preserv ed) is NP -Hard [21]. Even when t hose po lynomial approximation algorithm s are use d, they are still too complex. One ma y find that r unning the sparsifier is a lready mo re complex than r unning an algor ith m on original g raph. This deviat e s f orm our primary pu rsuit. Another d rawbac k o f t hose co mple x metho ds is that t hey usually require global kno wledg e (the whole graph) in order to (app roximately) so lve the optimization. This mak e s it useless in some scenarios, e.g. Decentralized So c ial Netw ork s (DSN), wh ere we can only get p art of th e data a nd assume it is a sample (by s ome distribution) f rom the original graph. With these obser v ations, we are more interested simple techniques and see what resul t can b e obtained. Gra ph sa mpling is a sim ple and in tuitive approach. It i s usua lly one of the first thou gh ts to ha ndl e massive da ta . It has b een widely used but the performa nce is not systematically s tu died. 1.3 Organ ization In Section 2, we introd uce co mmon notations used in this s urvey. In Section 3, w e pro vide a taxonomy of gra ph samplin g from three a ngles: ob jective, graph type and s am pling appr o ac h. Relations be tw een d ifferent ob jectives and different sampling approaches are dis c u ssed. Among common sa mpling app roac hes, T rav ersal Ba s ed Samp ling (TBS) is a large class of algorithms and dr aws a lo t researc h interest ov er the y ea rs. T ow ards thi s end, we devote t he whole Section 4 to it. In Se c tion 5, w e discuss graph pr op e r ties ranging from classica l text-b oo k type ones to adv anced ones which may b e more useful to s u pp ort g raph algorit hms. Past works a re col lected and tabularized in this sect ion, showing that a lot of future works can b e done to n umerical ly and theoretically study those properties. In Section 6, we collect s ome ad h o c theoretical re sults and make extensions where affordable in a sh ort p eriod of t ime. Lastly, w e co nclude the survey and discuss futu re works in Sectio n 7. 2 C omm on Nota tions and D efin itions In this section, w e provide some common definition s for this pap er. Because there are a wide sp ectrum of sur v eyed w orks, man y o ther sym b ols will b e used ( ma y be o verrided) locally. W e’ll give definitio ns in the context. W e consider a n un weigh ted a nd undirected graph, wh ich will b e t he main ob ject discussed in this pap e r . W ithout spec ial men tion, we call an unw eighted and undirected graph as “graph ” for s hort. Den ote a graph a s G = . V is the set of vertice and E ⊆ { ( u , v ) | u ∈ V , v ∈ V } is the set of edges , where ( u , v ) is an un ordered pa ir . F or conv enience of discussion, w e deno te n = | V | and m = | E | . The nei gh b ourho o d o f vertex v is N ( v ) = { u | ( v , u ) ∈ E , u ∈ V } . Th e degree of a vertex is defined as d G ( v ) = d ( v ) = d v = | N ( v ) | . These notations wi ll b e used interc h angeably where a ppropriate. The vertices are denoted as v 1 , v 2 , v n or sim ply 1 , 2 , n as a sho rthand. T o facilitate our discus si on, we defi ne the inci dent edges to a vertex by δ ( v ) = { ( u , v ) ∈ E | u ∈ N ( v ) } . The incident edges to a s et of vertices S is vol ( S ) = S v ∈ S δ ( v ) . T he edges a t the b ounda ry of S is δ ( S ) = { ( u , v ) ∈ E | u ∈ S , v S } . N ote that δ ( v ) = δ ( { v } ) = v ol ( v ) . W e briefly mention the defin ition for weigh t ed or directed gr aphs for compl eteness. • The weigh ted graph is G = , whe re w e = w ( u , v ) ∈ R , ∀ e = ( u , v ) ∈ E . • The direct e d graph is de fined by c hanging the d efinition for edge f rom ( u , v ) to , where is an order ed pair. Other symb ols o n those graphs can b e defined accordingly. C ommon N ot a tions and Def initions 5 W e denote the sa mpled graph by G s = < V s , E s > , wher e so me v alidity conditions are required: V s ∈ V E s ∈ E E s ⊆ { ( u , v ) | u ∈ V s , v ∈ V s } (1) The first and second conditions ensure that the elemen ts (either v ertices or edg es) are a s a mple from o riginal graph. The third condition ensur es the samp led elem ent s comp ose to a va lid graph. W e denote n s = | V s | and m s = | E s | , resp ectively. The sampling pro cedure in vo lves a p rimitive – “probin g”, e.g . use one AP I call to get someone’s buddy list on an OSN, or askin g pre vious par ticipants in an exp er imen t to refer some of th eir acquaintances. The budget is denoted as B . There is usually a unit cost b asso ciate with each probing op eration. The cost and the inf orma tion obtained during o ne probing v ary acco rding to the concret e app lication scena rio. Giv en a graph G , a prop erty is defined as a f unction f ( G ) , where f is probably a vector fu nction in many cases. N ote that this definition is different from the classical discussion of rando m graph s (e.g. [10] and [33 ] L5), where it is defined a s a subset o f graphs from the family, i.e. Q ⊆ G . If G is an r.v., so is f ( G ) . The classical de fini tion is just an indicator f or a cer tain f ( G ) . F or ex ample, define Q a s “subset of graphs who ar e connected”; Define f ( G ): G → # of connected comp onents; Obviously G ∈ Q ⇔ I[ f ( G ) = 1] . T ow ards th is end, we’ll us e the former defin ition, which is mo re general and im plicitly adopted in a wide sp e c trum of works. 3 A T axo nomy o f S a mpl ing There ar e many related works to “graph sampling”. All these works hav e a sense of ra ndomly picking ver tices or edges (maybe according to c urrent knowledge of the graph). H o wev er, they ar ise f rom different con text and hav e diffe rent problem dimensions. In this section, we provide a taxon omy of survey ed g raph sampling wo rks. 3.1 Cate gorization By Sam pling O b jectiv e 3.1.1 Common Sampling Ob j ectiv es Common ob j e c tiv es of graph sampling are listed b elow: 1. Get representativ e subset o f vertices. This is the usual motiv at io n from so ciology studies, e.g. p oll the opinion o f this sa mp led v ert ic e s (p eople) . In many sce na rios, target p opulation can b e sampled d irectly, e.g . phone number, random str eet survey, e tc. In other scenarios, target p opulation is hidd en, e.g. drug abusers in urban area. In this la tt er ca se, the researchers hav e to execut e certain samp ling algorithms on a graph to explore the hidde n p opulation. 2. Preserve certain prop ert y of original gr ap h. A prop ert y of a g raph can b e viewed as a (can b e vector) function f ( G ) . Sometimes, we purs ue exact p rop ert y preserv a tion (e.g. Section 6.4). Somet imes , w e only preserve the prop erty wi thin certain error m argin (e.g. Section 6.5). After p erforming sampling , t w o thing s ca n b e done: • Estimate graph prop ertie s. I f we know some prop ert y is p er served on G s , we can cal- culate f ( G s ) as an estimato r fo r f ( G ) . F or example, MHR W (Section 4.4) preser ves ver tex lab el distribut ions ( Section 5.1). So we can use f ( G s ) as an estimator for f ( G ) if G s is obt ain by MHR W f rom G and f ( · ) is a vertex lab el distribution p rop ert y. Note t ha t for mere estima tion pu rp oses , one do no t hav e to p reserv e th e prop erties on G s . As long as w e know how the prop erties are biased, we ma y b e able to correct it. RDS (S ection 4.9) is such a n approach. • Supp ort grap h algorithms. Many graph algorith ms aim at optimizing certain ob jec- tiv e associated with s ome graph prop er ties. If w e c a n preserv e those prop er ties o n G s , we may expect to obt ain simila r results by running the a lgorithm on G s instead of G . Man y samp ling approaches are v ery efficient (e.g. ES, VS, ESC, VSC; Section 3.3). This gives a general m etho d to accelerat e a class of graph a lgorithms. 6 S ection 3 3. Generate rand om graph. Graph generatio n is a big t opic in its o wn right. How ever, some literature on graph gen eration also u se the phr ase lik e “graph sampling”. There are some relations b etw een the tw o topics. On e can vi e w a graph gener ation mod el as a family of graphs G . The generatio n pro cedure is to sample one graph G from G . In our te rm inology, let G = K n b e a complete g raph of n vertices. The pro cess of p er forming an e dg e sampling (Section 3.3 ) on G is indeed t he generation of a n Erdos-Renyi Netw ork (ERN) . In t his pape r, w e w ill fo cus on the second t yp e, namely, prop erty preserv ation. Both prope rt y estimation an d alg orit hm supp orting works are d is cussed. Note that different algo rithms dep end on graph prop erties in different wa ys , so the sampling pro c e dur es are o ften tailored t o t hose algorithms. T o wards this end, we also term the la tter t yp e of w o rks as Problem Oriented P r op erty Preserv ati on ( POPP). 3.1.2 Relations Bet ween Property Pr e serv a tion and Estima tion Although the mo tiv ation of prop erty estimation and prop ert y prese rv a tion lo ok different initially, they are closely related and ca n be some time s transformed to eac h other. Consider the original graph G an d sampled graph G s . Use cut weight as an exa mple prop erty: • If we know cuts are pres e rv ed o n G s , w e can directly compute the w eight of th e cut that we are interested i n and it is naturally an est imator for it in G . • If we kno w cuts are not p reserv ed on G s but we h a ve an estimator fo r t he cuts, w e can try t o mo dify G s so that cuts are preserved. F or example, if the edges are sampled with probability p , th e re are less ed ges in the cu ts of G s . How ever, we can upw eight all the e d ges in G s b y a factor of 1 / p and the (weigh ted) c uts are th e n preserved. ( e.g. Section 6.5 .1) W e should note that t he r elationship is not symmetric. Namely, prop erty preserv a tion is actu ally a stronger and m ore useful no tion. Here are a few remarks: • All prop erty preser v a tion results ca n lead to prop erty es tim ators. • Not a ll prop erty estim ation results can b e easily casted t o pr op erty preserv at ion r e sults. • In order to c onst ruct b etter est imators, p eople u su ally approach prop e rt y estimation directly without first deriving a property preserv at ion result. This simplifies the analysis b e cause prop erty p reserv ation re sults are usually to o strong for mere estimation purp ose. In this wa y, prop erty estimation is st ill of its own resea rc h interest. In Section 6, we will discuss som e ad ho c analyt ical re s ults f or bo th pr op erty estimation and prop erty pres e rv ation. 3.2 Cate gorization By Type of N et works Most works survey ed in this pap er do es not assume a sp ecifi c type o f net work. The ev a luation is also done on rea l-life so cial net works [22], o nline social n etw orks [15 ], biological net works [18], Internet AS netw orks [27], P2P netw o r ks [50], etc. Lac king of an underlying graph generation model makes the theor etical analys is difficult in general. T ow ards thi s end, most works accompanied with theoretical ana lysis assumes cer ta in t yp e of graph generation m o del: • Erdos-Renyi Net work (ERN), also called “rando m g raph”, exp onential random graph, Poisson r andom graph , etc. This i s the most classical and well-understoo d probabilistic netw ork mo del. See [10] for a comprehensive discussion o f th e p rop erties of ERN . • Po wer-La w Netw o rk (PLN), also called scale-free netw or k. It was shown in the last decade that man y net works presen t p ow er- law degree distribution, ranging from social netw o rks to biological ne tw orks. W orks l ik e [35] and [60] fo cus on th is typ e of netw ork. • Small-W orld N et work (SMN) [65] is a com binatio n of ERN a nd regular g raph. Ma ny real so cial netw orks also exhibit some SMN prop erties . O ne notable prop erty is the existen ce of efficient decentralized routing s c heme [26]. A T ax onomy of Samp ling 7 • Fixed D egree D istribution Random Gr ap h (FDD RG). In order to theoretically characterize the bi as of B FS for d egree distr ibut ion, [30] assumes a random graph mo d el with fixed degree distribution . FDDR G is also t ermed as “config uration mo del” in [6 1]. 3.3 Cate gorization By Sam pling Ap proac h In this sectio n, we first present most f requently discussed sampling approaches. V ertex Sampling and Edge Sampling a re t wo classical sa mpling metho ds. They are also bu ilding blocks for more complex metho ds. V e rt ex Samp ling with Neighb ourho o d operates lik e V ertex Sampling but n eigh- b ourho o d information is acquire d in one probing. W e also discuss tw o v ariations to VS and ES with contraction. T ra versal Bas ed Sampling (TB S) is a large class of metho ds so we briefly m ention it here and leav e the full di scussion to ano ther dedicated section. In the sec ond part, we briefly d iscuss the relations b etw een d ifferent sampling approaches and formalize the metho dology fo r t heoretical analysis and practica l i mplementation. 3.3.1 Common Sampling Approac hes The commo nly studied samplin g techniques are: • V ertex Sampling (VS). W e firs t select V s ⊆ V directly without to p olo gy inform ation (e .g. uniformly or accord ing t o some distribution on V ). T hen w e let E s = { ( u , v ) ∈ E | u ∈ V s , v ∈ V s } , namel y o nly edges b etw een sampled vertices are k ept. One v ariation, stratified sampling, is commonly used in surv ey studies. Usually, only information asso ciated with ver tices is u sed, e.g. dem ographic attributes . W e also regard it as vertex sa mpling. • Edge Sampling (ES ). W e fir st s elect E s ⊆ E somehow. Then w e let V s (1) = { u , v | ( u , v ) ∈ E s } . This definition only arises in some theore tic al discussions of basic sampling me thods on graphs. A more realistic definition is to let V s (2) = V . Then the setting is same a s graph (edge) spa rsification. Graph sp ars ification ha s more com plex quality metrics and algorithms. Edge samplin g is just one of them. W e d o not delib eratel y distinguish t he tw o d efinit ions and one ca n find which on e is used from co ntext. • V ertex Sampling w ith Neighbourho o d ( VSN ). W e firs t select V ˜ s ⊆ V di rectly without top ology inf ormation. Then we let E s = S v ∈ V ˜ s δ ( v ) , a nd V s = { u, v | ( u , v ) ∈ E s } . W e retu rn G s = as the sampled graph. Note th at this is a more realistic setup than V S in so cial netw ork cra wling, where sampling a vertex means ge tt ing it s bud dy list. V ˜ s is lim ited by av ailable re s ources (e.g. AP I call). • T ra versal Based S ampling (TBS) is a class o f sampling metho ds. The s ampler starts with a set of initial vertices (and/ or edges) a nd expand the sample based o n curren t observ a tions. This cla ss of approa c h es arise naturally i n the context of net work cra wli ng, hidde n p opula- tion s urvey, etc. Exa mples a re like Snowball Sampl ing [17], Respo nden t Dr iv en Sampling [22], F o r est Fire [37], etc. It has a v ery long history and is also th e r ecent re searc h focu s of these samplin g met ho ds listed here. The whole Section 4 is devo ted to the d iscussion of TBS. The abov e approach es only assumes partial kno wled ge of the graph. When w e hav e full kno wle dge of a graph, sampl ing with contraction may b e a nother a pproach. It ca n b e u sed to supp ort visu- alization or accel erate followup algor ithms. Th e following tw o types are les s discus s ed in previous literature b ut we think they a re promising in th is co n text: • Edge Sampl ing with Co nt raction (E SC). This is an itera tiv e pro ces s. Eac h round one e dge ( u , v ) ∈ E is sampled an d let ◦ V ← V + { ( u , v ) } − { v } − { u } ◦ δ (( u , v ) ) = { (( u , v ) , w ) | ( u , w ) ∈ δ ( u ) ∪ ( v , w ) ∈ δ ( v ) } ◦ E ← E + δ (( u , v )) − δ ( u ) − δ ( v ) . 8 S ection 3 • V ertex Sampling with Cont raction (VSC). Similar to ESC, but one vertex v ∈ V is sampled each r ound an d v ertices in { v } ∪ N ( v ) are contracted into one v ertex with c orr esp onding edges. This i s equiv alent to p erform edge contraction on all e ∈ δ ( v ) . One will find tha t VSC is a mo re c o nstrained v ersion of ESC whic h in tro duce dep endence b etw ee n con tracted edges. 3.3.2 Relations b et ween Sampling App ro ches VS and ES are simple and suitable for th eoretical analysis. This is becaus e sa mples tak en from VS and ES a re uncorrelated. Indep enden ce mak es it possible for u s to apply concentration b ounds, e.g. Chernoff bound (App en dix A). How ever, in mo st real applica tions, one can not p er form VS and E S di rectly due to all kinds of co nstraints, e.g. can no t enumerate the ID space. In t his c ase, TBS b ecomes more pra ctical, which only relies on a small sized starting top ology (e.g. a few seed no des) and expand it durin g the exploration. Note that VS, ES and TBS are not t otally d iffer en t from each ot her. Certain TBS tech niques can b e used to generate VS or ES: • Random W alk (R W) (Section 4.3) r esults in uniform edge d istribution on an undirected graph. • Metrop olis-Hasting s Random W alk (MHR W ) (Section 4.4 ) results in un iform vertex distri- bution. In fact, the Metrop o lis-Hastings algorith m [42 ] can tai lor the M ark ov c hain to any ver tex distr ibu tion. These t wo observ ations are the bridges b etw een theoretical analysis and pr actical implementation. The genera l metho dology is: • Analyze the pro per ties with simpl e V S, ES, or its v ariants. • Construct equiv a le nt VS or ES samplers using TBS techniques. Here are a few remarks for t his metho dology: • If TBS is used t o mimic VS an d E S, w e usua lly sp end substa ntially more prob es. F o r example, starting f rom an arbitrary no de, we need to run random w alk for the mixing time in order to re ac h the statio na ry distributio n. This is also known as “burn i n” p erio d of the Marko v c h ain. After r eac hing t he stati onary distrib ution, res e a rc hers often take samples from the rando m pro cess with cer tain gap, which r esults in a waste o f resources. • Although samples taken from TBS hav e correlations, this can some tim es b e utilize d t o construct m ore efficient estimators for certain prop erties, e.g. [2 0] for clustering co effi cient. Approaching f rom TBS directly can reduce the w aste in mimic king VS and ES. This mak e s TBS based p rope rt y estimat or of its own interest. 4 T raversal B a sed Sa mp ling T rav ersal B a sed Samp ling has a very long hist ory and is still the r e search focus i n recent y ea rs. It is also called “top ology based sampling” [3 ] or “sampling b y exp loration” [ 36]. In th is sectio n, we present well studied TBS t echniques from the c lassical ones to recent o nes. S ome les s p opu lar or to o recent met ho ds/ i mprov ements a re b riefly mentioned in Section C. T rav ersal Based Sampling has a very long history an d is still the researc h fo c u s in r ecent years. It is also called “top ology based sampling” [3] or “sampling by exploration” [36]. In this section, w e present well stu died TBS techniques from the c lassical ones to r ecent ones. 4.1 Brea dth/ De pt h/ Rand om First Sampling (B-/D-/R - FS) One recent p ap er [11] formalized the framework for three in tuitive graph tra versal metho ds. Since Breadth First Sampling and Depth Firs t Sampling are det e rministic, we also use their classical name Breadth First Search and Depth First Searc h, resp ectively. The framework works as follo ws: • Initialize th e queue with a starting vertex: Q ← { v 0 } , V s ← { v 0 } . Let L = { } b e the set of visited vertices. T ra versal Ba sed Sampl ing 9 • Loo p until budget B is exhaust ed: ◦ Dequeue one elem ent: v = Q . dequeue () ◦ B ← B − b , L ← L + { v } ◦ F or u ∈ N ( v ) ∩ u L ∩ u Q , inv oke Q. e nq ue u e ( u ) The sam pled vertices are V s = L and the edges are dep endent on the con text. The difference betw e en BFS, DFS and RFS is in the implementation of “ dequeue()” metho ds. F or BFS, the first e lement is selected; F or DFS, the la st element is select ed; F or RFS, a random element is selected. 4.2 Snow-Ball Samp ling (SBS) Snowball Sampli ng [1 7] ha s lon g b een used in so cio logy studies, where an inv esti gation is p erformed on a hid den p opulation ( e.g. drug ab users). It is also called “ netw ork s ampling” [12][57] or “chain referral sampling” in m an y literatures (as coun terparts of “straight s a mpling”, “ stratified s ampling”, etc). A stage t name k SS is defin ed as: • Start from ini tial set of vertices V (0) . This set may b e obtained through a random sample on vertices or sid e knowledge of th e hidden p opul ation. • A t stage i , ask ∀ v ∈ V ( i − 1) to name k neighbourin g no d es. How to name v aries accordi ng to the s pe cific study . T his is e quiv alent to obtain a sample of incident edges of V ( i − 1) . W e denote those edges by E ( i ) . V ertices o bserv ed in this s tage a re V ˜ ( i ) = u , v | ( u , v ) ∈ E ( i ) . W e put n e w v ertices into s tage i , i.e. V ( i ) = V ˜ ( i ) − S j = 0 i − 1 V ( j ) . • The proce ss con tinues for t stages. The samp led graph is G s = wh ere V s = S j = 0 t V ( j ) and E s = S j = 1 t E ( j ) . Note th at we output the whole sa mpled graph to make our dis cussion u nified. In s o ciology study, p eople only care V s for most of the time. As long as V s is a representativ e se t of vertices of certain p opulation, statistics ca n b e done on this group of p eople, e.g. findin g the a verage opinio n on some issues. One r emark is that SB S is v ery similar t o BFS. BFS e xhaustively e xpand t he neighbour ho o d of current vertex while SBS o nly expands a fi x ed num b er of th em. 4.3 Ran do m W alk (R W) Random W a lk also arises fro m different context. The general description is: • Start from an init ial v ertex v (0) . • A t step i , c ho ose one neighbour ing v ertex u ∈ N v ( i − 1) (maybe uniformly at random or according t o some weig ht). Let v ( i ) ← u b e the next v e rtex and include th is edge E ˜ s ← E ˜ s + v ( i − 1) , v ( i ) . • Rep eat t = B b steps. R eturn th e sample d graph G s = . There ar e tw o p ossibilities of the vertex s et and edge se t dep ending on differ ent application: ◦ V s = v (0) , v (1) , v ( t ) and E s = E ˜ s . Thi s corresp ond s to t he scenario tha t neigh b our- ho o d of vertices are unkn o wn, b ut we can somehow prob e one neighbour accordi ng to some rule s . ◦ Denote V ˜ s = v (0) , v (1) , v ( t ) . Then E s = S v ∈ V ˜ s δ ( v ) , and V s = { u , v | ( u , v ) ∈ E s } . This is ju st the VSN t echnique corr esp onding to man y netw ork crawling scenarios. W e simply t reat R W as one wa y to p er form VS and o bta in V ˜ s . R W is re la ted with SBS i n the following asp ects: • One ca n regar d R W as per forming SBS with k = 1 and the “naming” criterion f or pa rticipants is “pick k = 1 of your r andom neighb our”. How ever , 10 S ection 4 • R W is memoryless. In SBS, participants from previous stages are e x cluded. In R W, one can revisit some v ertices. The m e moryless prop erty of R W makes it more app ealing for theoretical analysis, e.g . obta in stationary distribution from Marko v Chain analysis. It turns out th at many analyses do ne on SBS a re actually done on R W , esp ecially in th e discussi on of RDS (Sec tion 4.9). W e hav e a f ew remarks fo r R W: • It can b e shown [40] that R W on und irected graph (connected and non-bipartite) r esults in uniform di stribution on edg e s. T ow ards this end, un iform ES can b e made practica l via R W in many scenarios . • A v er tex has degree-prop ort ional probability to b e in V s , i.e. π ( v ) = d ( v ) m , but the transition probability b et ween ver tices can b e v aried to out put uniform VS. This is th e problem that Metrop olis-Ha stings R andom W alk (Section 4.4 ) wan t to address . • It’s s ho wn [52] (for degree distribution) th at th e erro r of estim at ors is inv e r se pr op ortional to the sp ectral gap of R W transition ma trix D − 1 A . Th is result is intuitiv e b eca us e the sp ectral gap enco des the conductance of the graph. The larger the conductance, the less likely a random wa lker gets stuc k in a lo cal region. F or graphs wit h small sp ectral gap, MIR W (S ection 4.6) an d MDR W ( Section 4.7) are pr op osed. 4.4 M et rop olis-H astings Ran dom W alk (MHR W) Metrop olis-Ha stings a lg orithm [42] was wide ly used in Marko v Monte Carlo Chain to obta in a desired vertex dis tribution f rom a n ar bitrary u ndirected co nnected graph. Denote the transiti on probability from x to y b y P x , y and th e desired vertex distributio n by π x . T he most general form to our knowledge is: ([43]*) P x , y = M x , y × min 1 , π y π x x y ∩ y ∈ N ( x ) 0 x y ∩ y N ( x ) 1 − X y x P x , y x = y where M x , y = M y , x is a no rmalization constant for the pa ir of x , y satisfying P y x P x , y 6 1 . When we let M x , y = M 6 min v ∈ V 1 d ( v ) , this degrade s to the result giv en by [43 ]. Note that adding more (high er weigh t ) self-lo ops wi ll make the mixing t ime longer, so it is b etter to cho ose a larger M x , y when p ossible. O ne choice is M x , y = min n 1 d ( x ) , 1 d ( y ) o . If we furth er let π x = π y = 1 n , Eq. 4.4 degrades to t he result quoted in [25][50][15]. T o pro v e the correct ness, one can first observe it is a v alid conditiona l probability distributi on and t hen show that π x P x , y = π y P y , x . Since π x sums u p to 1, the time reversibilit y is satisfied. That means π x is indeed the s tationary distribu tion of this M arko v Chain. One r emark is th at th e applica tion scenario of M HR W is more limited than ordinary R W. In order to calcula te the transition probability one has to know the degree of neig h b ouring v ertices. This in fo rmation is in general u na v ailab le . Nevertheless, th ere are s ome scenario s where th is is p ossible. F or example, in a P2P net work, t he n umber of p eers is usu ally a fixed par ameter. F or another example, the API of some O SNs ma y return friend list with rich information besides mere IDs (e.g. num b er of follow ers and follow ees). 4.5 Ran do m W alk with Escaping ( R WE) There is one w idely used v ariation of the abov e men t ioned R W. B esides walking to neighbours, the random walk er can jump to an arbit rary random no de u ∈ V . In origina l Pa gerank [9] comp utation, this i s to 1) mak e the chain b oth ap e rio dic and irreducible; 2) enlarge the eigen gap in order t o conv erge faster. This tec hniq ue is t ermed as “random jump” in [36][25]. How ev er, R WE is not a ver y meanin gful technique in graph sampling. TBS in general solv es the problem whe r e the whole graph can not b e reached, o r at least direct VS o r ES is hard. R WE assumes t he existence of efficient VS, which renders i t le s s u sable in many scen arios. T ra versal Ba sed Sampl ing 1 1 As a suppl emen tary note, R WE’s s ta tionary distribution i s o btained by solving the equat io n π = α W π + (1 − α ) 1 n 1 where W = A D − 1 gives the tra nsition probability t o neig hbours. It giv es π = 1 n (1 − α )[ I − α W ] − 1 1 The in version mak es π v correlated with other entries in W , which is una v ailab le u nless we hav e the full graph . So even w e ca n run R WE o n the graph, i t is hard to construct unbiased estimators for grap h p rop erties b ecause we do not kno w π (Section 4.9 giv es t he wa y to c o rrect b ias if π i s unknown). The only simple conclusions we can g et lie on the t wo ends of α ’s domain: 1) when α = 0 , π is un iform; 2) when α = 1 , π is deg ree p rop ortional. In [25], the autho rs prop osed the “Alb at ross Sampling” algorithm. This is essentially R WE presented ab ov e. The only difference is that W is const ru cted by MHR W instead of simply W = A D − 1 . 4.6 M ult ip le Indep endent Random W alk er s (MIR W ) One problem of R W is that it is prone to b e trapp ed into local dense regio n. Th us it is b elieved to hav e high bias acco rding to diff eren t initial vertex. N ote that t his phenomenon is w elcome in some applications lik e lo cal graph partitioning ( [33] L1 0), Sybil detectio n [6 9], etc. H o wev er, in many other applications, esp ecially graph property estimation via rando m walks, we w ant to alleviate the problem. MIR W [1 5] was pro po sed to address this probl em. L ater, [51] shows t hat MIR W actua lly results in higher es timation errors and prop o ses MD R W ( Section 4.7). W e do cum ent MIR W here for completeness. Fir s t, we p erform VS to get l init ial vertices. Then we split the budget to l random walk ers and let them exe c u te indep endently. T o make it simple, let all random w alkers to walk t = B l c steps. In s ta tionary state ( v 1 , v 2 , v l ) , the joint di stribution on vertices is P ( v 1 , v 2 , v l ) = Y i = 1 l d ( v i ) m and the joint distribution on edges is P ( e 1 , e 2 , e l ) = Y i = 1 l X v : e ∈ δ ( v ) Pr { e i | v } Pr { v } = Y i = 1 l 1 m = 1 m l The problem for MIR W is that P ( v 1 , v 2 , v l ) can deviate from the un iform distr ib ution P ˜ ( v 1 , v 2 , v l ) = 1 n l by a considerable large amount. The difference is significant wh en v ertex degree span a large rang e, which is the case f or many real so cial netw o rks. One ca n not use it to estimate v ertex lab els. 4.7 M ult i- Dimensio na l Random W alk (MDR W ) MDR W [51] is also ca lled F rontier Samplin g (FS). An l -dimen sional R W wo rks as follows: • Initialize L = ( v 1 , v 2 , v l ) with m random vertices (e.g. via VS). • A t each step, choose one vertex v i ∈ L with the probability p ( v i ) ∝ d ( v ) . Cho os e a random neighbour u ∈ N ( v i ) . Assign E s ← E s + ( v i , u ) and L ← ( v 1 , v i − 1 , u , v i + 1 , , v l ) . • Loo p until budget is used out. Output E s as the sampled edges. W e defin e t he l -th e Cartesian p ow er of G , G l = , as follows: • V l = { ( v 1 , v 2 , v l ) | v 1 , v 2 , v l ∈ V } • E l = n ( u , v ) | u = ( u 1 , u 2 , u l ) ∈ V l , v = ( v 1 , v 2 , v l ) ∈ V l , ∃ i s.t. ( u i , v i ) ∈ E , T j i u j = v j o 12 S ection 4 The l -di mensional R W on G is equiv al e n t t o a 1 -dime nsional R W on G l . [51] shows that MD R W yields b etter estima to rs s ome g raph prop erties. [41] says that MDR W y ields uniform distributi on on b oth edg es a nd ver tices w he n l → ∞ . 4.8 F or est Fire Sampl ing (FFS ) F orest Fire w as originally prop osed in [37] as a graph generatio n mo del which ca ptu res some im p or- tant observ a tions in real so cial net works, like densification la w, shrinking diamet er, and communit y guided attachement . In [36], the author adapted this graph generat io n mod el to perform graph sampling and termed it as F orest F ire Sampling (FFS). It wa s later abbreviated as “forest fire” in many followup g raph sampling w o rks. FFS is a probabilistic v ersio n of S BS (see Section 4.2). I n SBS, k neighb ours are selected at each ro und. In FFS K ∼ Geometric ( p ) 1 nu mber of ne igh b ours are selected. SBS and FFS are link ed by setting p = 1 k . Then we ha ve E[ K ] = k . One r emark i s t hat a part fro m the abov e analysis, FF S is still more cl ose to SBS than other R W v ariations. In R W and its v ariations, rep etitions are included in the sa mple for es t imation purp ose. When FFS is used to do estimation, the algorith m av oids previously s elected vertices (once a vertex is “burned”, it will no longer b e burn e d again ). This setting is same as origi na l SBS where no de s se lec ted in previo us stages are exclud ed from cur rent st age. 4.9 Resp ondent Dr iv en Sam pl ing (RD S) (R WR W) RDS was first p resen ted in so ciology studies to p erform est imation on hidden po pulation, e .g. [22][54]. This technique b ecomes h ot r ecentl y and sees applic at ions in m an y other fields. The original core ide a is to run SBS and correct t he bias accordin g to the sampli ng probability of eac h ver tex in V s . Man y peo ple no w implement SBS as R W b ecause the stati onary distribution derived from R W c an be easily used to correct the b ias. F or this reason, [15] also termed this technique as Re-W eighted R andom W alk (R WR W). Note that RDS ( R WR W) and MH R W are often c o mpared in recen t literatur e s , e.g. [15 ][50], leaving the confusion that RD S is a samp ling technique. Indeed RDS itself is no t a standalone sampling or say graph exploration t ec h nique. It either uses SBS (original op eration) or R W (current trend) for sampling . After t hat, it p rop oses to use H ansen-Hurwitz estimator [ 19] to correct th e bias of an esti mator for the mean of vertex lab el distributio n (Section 5.1). It do es not matter how we get the sam ples (by VS, ES, o r TBS). As long a s we kno w the sample pro babilit y, we can inv oke the b ias correcti on technique. Since MH R W can generate a un iform sam ple for V , direct estimato r ca n be applied on th e sampl ed vertices. By lo oking at th e sa mpling+estimatin g pro cedures as a whole, R WR W and MHR W se em to hav e the sa me o b jective and similar res ults. [16] made this relation mor e cle ar. Although RDS is not a standalone TBS tec hn ique in our terminology, we also note the i dea here for ea sier r eference. Supp os e we hav e a sa mple | V s | = n s dra wn with t he d istribut ion π ( v ) , ∀ v ∈ V . Rep eated samp les o f a sin gle ver tex is allow ed. No w we wan t t o estim ate t he parameter θ = 1 n P v ∈ V g ( v ) , where g : V → R is a funct ion to generate v ertex lab els (e.g. degree, g ( v ) = d ( v ) ). Note that the naive estimato r T 1 = 1 n s P v ∈ V s g ( v ) is not consistent. As n s → ∞ , T 1 → P v ∈ V π ( v ) g ( v ) . Now we wan t t o substitute g ( v ) with h ( v ) in the expression of T 1 with the exp ectation that n s → ∞ ⇒ T 2 = 1 n s X v ∈ V s h ( v ) → θ = 1 n X v ∈ V g ( v ) One can show t hat one choice o f h ( v ) is g iven by h ( v ) = g ( v ) n π ( v ) 1. In o ri gi na l pa per [3 7] , it w as s a id to be a b in om i al ly d is t ri b ut e d v al ue wi th ce r ta in ex pec ta ti o n. A bi n om i al di st ri b ut io n is c ha ra ct er iz ed b y t w o pa ra m et er s n , p bu t on ly o ne p ar am e te r w a s gi ve n he re . I n [3 ], t he a ut ho rs in te rp re te d it as a g eo m et ri c di st ri bu te d r. v. an d th e gi v en e xpe c ta ti on i s en o ug h to c ha ra ct er iz e it . W e ad o pt th e in te rp re ta ti on b y [3 ]. T ra versal Ba sed Sampl ing 1 3 When n s → ∞ T 2 = 1 n s X v ∈ V s h ( v ) = E π g ( v ) n π ( v ) = X v ∈ V g ( v ) n π ( v ) π ( v ) = 1 n X v ∈ V g ( v ) This estimato r is still to o gener al to use. Ther e are tw o difficult ies: • n m a y b e u nkno wn in man y scenarios. W e must estimate n using π . B y setting g ( v ) = 1 , ∀ v ∈ V , we know 1 n s P v ∈ V s 1 n π ( v ) is a consi sten t estimato r for 1 , namely, 1 n s X v ∈ V s 1 n π ( v ) = 1 ⇒ 1 n s X v ∈ V s 1 π ( v ) = n in probabili t y. So we find the consistent estimator for n n ˆ = 1 n s X v ∈ V s 1 π ( v ) and let our estimator for p arameter θ b e θ ˆ = 1 n s X v ∈ V s g ( v ) n ˆ π ( v ) • The stationar y distributi on output by our R W is π ( v ) = d ( v ) m . d ( v ) can b e obtained du ring the TBS pro cedure somei tmes. How ever, m is unkno wn in most scenarios. Luc kily, with the ab ov e e stimator n ˆ , m just cancels out and we get: θ ˆ = 1 n s X v ∈ V s g ( v ) d ( v ) m 1 1 n s P v ∈ V s 1 d ( v ) m = 1 P v ∈ V s 1 d ( v ) X v ∈ V s g ( v ) d ( v ) (2) After a ddressing the tw o problems, RDS com bi ned wi th R W is a practical appro ac h to estimate graph prop erties without knowi ng the full g raph. By substituting g ( v ) with appl ication sp ecific functions, we g et the formula used in [16][15 ][50]. One rem ark is that the co rrection meth o d presented by RDS is not tied to R W. As long a s w e can dr a w sam ples from a d istribution π and know the relative ratio b et ween all π ( v ) and π ( u ) for v , u ∈ V s , sim ilar estimator ca n be construct e d. 5 G ra ph P rop erties Giv en a graph G , a prop er t y is d efined as a function f ( G ) , where f is proba bly a vector fu nction in many cases. N ote that this definition is different from the classical discussion of rando m graph s (e.g. [10][3 3, L5] ), where it is d efined as a sub s et of graph s from the famil y, i.e. Q ⊆ G . If G is an r .v ., so is f ( G ) . T h e cla ssical d efinition is just an indicator for a cer tain f ( G ) . F or example, define Q a s “subset of graphs who ar e connected”; Define f ( G ): G → # of connected comp onents; Obviously G ∈ Q ⇔ I[ f ( G ) = 1] . T ow ards th is end, we’ll us e the former defin ition, which is mo re general and im plicitly adopted in a wide sp e c trum of works. 5.1 V e rtex/ E dge Lab el Dist r ibution Before w e discuss concrete graph properties , w e first presen t t wo general t yp e of properti es: v ert ex lab el and edg e label [51]. Supp ose there is a collection of lab els L . E ver y v er tex v is a sso ciated with a set of la b els L ( v ) ⊆ L . The lab el distri but ion o n L is define d as p ( l ) = P v ∈ V I[ l ∈ L ( v )] P v ∈ V | L ( v ) | 14 S ection 5 where I[ · ] is the indicator f unction. F or examp le, let L b e all non- negative int egers and l e t L ( v ) = { d ( v ) } . Then p ( l ) b ecomes the de gree dist ribution. A ctually, some of the p roper ties d is cussed in Section 5.2 a re ju st v ertex lab el distribution. The edge lab el d istribution can b e defined similarl y b ut it is seldom see n. The reason m a y b e that m ost meaningf ul prop erties derived from me re top ology are d efined ov er vertices, e.g. degree distribution a nd clustering co effic ient. Nev erthel ess, there are som e a pplication sp ecifi c prop erties which can b e viewed as edge lab el distributi on. F or examp le , [50] uses tw o TB S techniques, R DS and MHR W (see Section 4), to samp le a P2 P ov erlay. One of t he p rop erties the autho rs measure is R ound T rip Time (R TT) b etw ee n p eer s. This prop ert y i s define d ov er edges and can b e obtain at every p eer quer y. Here are a few remarks to vertex/ edge lab el dis tribution: • Labe l distrib ution gives a general framework to define a class of p roper ties. This definition allo ws a straightforwa rd construction o f estimator s. F or example, a graph prop erty f ( G ) can b e calculated as f ( V ) if it is a vertex label distribution. Natu rally, f ( V s ) g ives an estimator of this prop erty if V s is a uniform sam ple from V . Edge lab el distribution is simil ar. In other words, how well a s ampler preserves the prop erty b ec omes how well the s ampler generat es a uniform di stribution ov er V s or E s . • The ab o ve s tatement assumes that label s can be effi cientl y acquired as long as we sample a ver tex. This is true for some cases, e.g. age, gender and other att ributes crawled from profiles of O SN. How ever, many other prop erties are not r eadily av aila ble up on each probing. The information acquired a lso differs across sampling algorithms. What’s more, so me p roper ties when a sso ciated with the t yp e o f the g raph can yield sp ecial results. T ow ards this en d, a discussion of concrete prop erties is still neede d ev en if they are cov ered b y lab e l distribution. 5.2 Class ical Grap h Prop e rties In order to characterize a graph, p eople hav e prop osed dozens of widely used g raph prop erties. Graphs share the same p rop erties ar e b elieved to b e “s imilar”. Indeed, by obser ving or estimati ng those prop er tie s, p eople can a lready say something a bo ut the graph. e.g. [6] p e rformed classifica tion on s ynthesized netw orks using 40+ fe at ures (v ariant s of some of the following prop erties) and found that some types of netw orks are well separable f rom others. In this sect ion, we collect most of the text-b o ok like p rop erties. • Netw ork Size. T w o simple v a lues to describ e net work size are n umb er of vertices n = | V | and num be r of edges m = | E | . • Degree Distribut ion (Deg). Randoml y p ic k a no de X ∈ V , let p deg ( k ) = P r { d ( X ) = k } p deg ( k ) is thus the p.d.f. for the d egree distribu tion. • A verage Degr ee (AD). A verage degree is t he exp ectation E [ d ( X )] = X k k p deg ( k ) • Po wer Law Exponent (PLE). After obtaining the degree distribution, one can fi t the pow er law exp onent γ , s.t. the fitted distribution p fit ( k ) ∝ k − γ is m ost close to observ ed degree distribution p deg ( k ) . How to fit is b ey o nd the scop e of this pap er, see [49] fo r m ore info. This prop erty only makes sense when the origina l and sampled graph is close to p ow er-law graphs. • Graph Density (GD). It is d efined as the ratio of observed n umb er of ed ges o ver maximum p ossible num b e r of edges: densit y = m n 2 = 2 m n ( n − 1) G raph Pr oper ties 1 5 Since 2 m = P k p deg ( k ) × n × k = E [ d ( X )] × n , GD is just a summary metric of De g. • Pa th Matrix (PM). Denote t he adjacency matr ix by A , t he t step path m atrix is: [ P t ] i , j = 1 ∃ τ 6 t [ A τ ] i , j > 0 0 else That is the path matrix enco d es the reachabilit y informatio n b e tw een all pa irs o f vertices. • Shortest Path Matr ix (SPM). Th e sh ortest path matr ix can b e define d on P M as [ S ] i , j = arg min t { [ P t ] i , j = 1 } namely, S records all pa ir s ho rtest path dis tance. • A verage Path Len gth (APL). AP L refers to the a verage sh ortest p ath length: APL = 1 n 2 X 1 6 i < j 6 n S i , j = 2 n ( n − 1) X 1 6 i < j 6 n S i , j • Closeness Centralit y (CloC) [66]. The farnes s of a vertex is defin ed as th e sum of distance s to all other vertices. The c losen ess centralit y is just its inv er se: Closeness ( v ) = 1/ X u ∈ V S v , u • Radius ( R) [3 8]. Radius of a vertex is the ma xim um shortest path d istance t o a ll othe r ver tices. W e can define it using SPM : R ( v ) = max u ∈ V S v , u • Diameter (Dia) [38]. Diameter is defi ned as the maximum radiu s o f all vertices, i.e. D ( G ) = max v ∈ V R ( v ) . In other words, it meas ures the longe st di sta nce b e t w een all pair o f v e r tices. In practice, the net work ma y b e disconnected or ha ve a few large radius vertices. Diameter may not b e a mea ning ful prop erty. Instead, effecti v e diameter is used, i.e. the distance within which α fraction (e. g. α = 9 0 % ) of the pairs can r eac h each other. • Bet weenness Centralit y (BC) [66] is the n umber of shorte st p at hs a v er tex is inv olved in. Denote the collectio n of s horte st pa ths be tw een s and t by σ s , t . The paths v is inv o lv ed in is σ s , t ( v ) = { p ∈ σ s , t | v ∈ p } . Then the be t weenness centrality is defi ned as: Betw een ne s s ( v ) = X s v , t v | σ s , t ( v ) | | σ s , t | • Assortativity (As). Assortativi ty is generally d efined a s the P earson Correlation of simila rity b etw een neighb ouring v ertices. One can use degr e e as the sim ilarit y measure [48] and lead s to th e following definit ion. Th e distr ibution o f re m aining ed ges (excep t for the one that link the t wo vertices under cons ider ation) is: q k = ( k + 1) p deg ( k + 1) P j j p j . Defin e th e joint dist ribution of the degree of tw o vertices by e j , k . Then th e assortativity is d e fined as: r = 1 σ q 2 X j , k j k ( e j , k − q j q k ) • Clustering Co efficient ( CC). F or t he no de v , the clustering co effici ent is defined as C ( v ) = | ∆( v ) | / | N ( v ) | 2 16 S ection 5 where ∆( v ) = { ( u , w ) | u ∈ N ( v ) , w ∈ N ( v ) , ( u , w ) ∈ E } is the set of observed edges in neighbourh o o d. | N ( v ) | 2 is the number of all possible edges. One can a lso in t erpret it as the ratio of obser v ed triangles ov er all p ossib le tri angles on v ’s ego-netw ork. • A verage Clustering Coefficient ( ACC). The a verage clusterin g coefficient (or Gl obal Cl us- tering Co effic ien t [51]) is a s umm ary prop erty of t he CC of all vertices: C ( G ) = 1 n X v ∈ V C ( v ) • Global Clustering Co efficient (GCC ). GCC is defined as [20]: GCC ( G ) = |{ | ( u , v ) , ( v , w ) , ( w , u ) ∈ E }| /3 |{ | ( u , v ) , ( v , w ) ∈ E }| /2 It is also called t ransitivity [55]. Note that t here m a y b e cons tant differences in the definition across litera tur e s. • Self Similarit y (SS). [39] pr o po sed a p robably more precise definition to capture “scale-free” prop erty of so me netw o rks. Let s ( G ) = P ( u , v ) ∈ E d ( u ) d ( v ) . Th is v alue is higher is v ertices are connected to simila r other deg ree v e rtices. Denote s max = max G ∈ G s ( G ) , where G is the family of graphs that hav e same de gre e distribution as G . The self-similarity of G is defined as: (digested f rom [67]) S ( G ) = s ( G ) s max This v alue is p ositively related with the assor tativit y defined a bov e. • Sp ectrum (S). The spec trum of a matrix is the eigen v alue distribution. F or a g raph, sev eral matrices ma y b e used [23]: A djacency matrix A ; La placian L = D − A ; Normaliz e d adj acency matrix A = D − 1/2 AD − 1/2 ; Normalized Lapla cian matrix L = I − A ; Left normalized adjacency matrix A left = D − 1 A ; Le f t normalized Laplacia n matrix L left = I − A left . They hav e different physical meaning s a nd different applicatio n scena rio. The sp ectrum enco d es certain graph p rop erties, e.g. the eigen gap o f L enco des the conductance of t he graph. • Largest Eigen V ector (LEV). The larg est eigen v ector of A , A and A left enco des a type o f centralit y information of v erti ces. F or examp le, lef t eige n vector of A left is j ust th e PageRank v alue [ 9]. It is also the eigenv ector of A scaled by D 1/2 . T h e ei gen vector of A is defined as the eigenv ect or centralit y [ 47]. • Smallest Eigen V ector (SEV). The smallest non-zero eigen v ector of L , L , a nd L left enco des a t yp e of partiti on information. F or examp le, 2nd eigen vector of L w ill b e piecewise linear if the graph has sev eral connec ted comp onent s [62] (ideal partition). W e a ddress “non-zero” b ecause the smal lest eig en v alue is 0 and c o rresp onding all one’s v ector is not in teres ti ng ( L 1 = 0 ). 5.3 Class ical Prop erties S tudied in Litera t ures It turns out that man y of the properties were n um erically studied after certain sampling metho d s. How ever, only a few prop er ties are the ore tica lly stud ied. In the later part of this chapter, we p resent t wo theoretical results we get during the survey. W e summarize th ose p rop erties in the survey e d works in the follo wing T able 1 . It may provide s ome p ointers t o p otent ial future w o rks. 2 2. W e c an n ot a ff or d a th or ou gh s ca n of a l l pr ope rt ie s. T h e re a re m an y pr ope rt ie s a p pe ar i n re se ar c h w or ks bu t a re no t s o wi de ly us ed . In t er es te d r ea d er s c an co ns ul t th o se li te ra tu re s. e. g. [3 6 ] p re se nt ed so m e dy na m ic pr op e rt ie s ; [6 ] pr es en te d 10 ot he r su mm ar y p ro pe rt ie s wh ic h ar e n ot i n cl ud ed in th is se ct ion ; S e e [ 55 ] fo r ge n er al iz ed cl us te ri n g co e ffi cie n t; [ 6 4] i s a cl ass ic al t ex t boo k on s oci al n et w or k an a ly si s, in cl ud in g th e di sc uss io n of m an y c la ss ic al pr op e rt ie s . G raph Pr oper ties 1 7 Prop erty Numerical Study Theoretical Stu dy NS [28][20] [28][20] Deg [60][59][35][51][15] [25][2 7][63][6] [60][59][35][51][24][30][31 ][52][34] [36][30][3][4][31][52][3 4][68] AD [11][28][4][31] [24][28] PLE [11] GD [11][28] [28] R Dia [11][6][36] PM SPM [6][3][4] CloC [6] APL [35][6][3][4] BC [35][6] As [35][11][51][15] [51] CC [15][6][36][3][4] A CC [35][51][15][63][6][36][4][20] [51][20] GCC [20] [20] SS S [36] LEV [27][6][36] SEV T a bl e 1. S tu d ie d P ro per ti e s i n P as t Li te ra tu re s W e can not affor d a thorough scan of all prop erties. There are ma n y p rop erti es wh ich app ea red in research wor ks but are not so w id ely used. Interested read ers can consu lt those literat ures. e.g. [36] pres e n ted som e dynamic prop erties ; [6] presented 10 other s um mary pr op e rties which are no t included in Se c tio n 5.2; See [5 5] for general ized clustering co efficient; [ 64] is a cl ass ic a l te x tb o ok on so cial netw ork analysis a nd it includes t he discussion of many classical prop erties. 5.4 Adv anced Gra ph Prop erties Classical proper ties are w el l established in past years. T hey are useful for un derstanding t he graphs. How ever, those gr aph are less useful for furt her sup p orting graph algor ithms. In this sect ion , w e discuss some adva nced grap h prop erties. Those pr op e r ties are more rich in terms of capturin g the structure of g raphs and t hey hav e b een u sed as ob jectives of graph algo rithms. By pres erving those prop erties on a s ampled graph, we can syste ma tically accelerat e t hose algorithms. 5.4.1 Cut and Ratio -/ Nor malized-/ W e igh ted- Cut The cut o f a s et S is defin ed as the ed ges crossing S and V − S : Cut ( S ) = | δ ( S ) | = |{ ( u , v ) ∈ E | u ∈ S , v S }| The ratio cut t akes the size of S in to con sideration: R Cut ( S ) = | δ ( S ) | | S | The normalized cut further co nsiders the i mp ortance of vertices i n terms of deg ree: NCut ( S ) = | δ ( S ) | | vol ( S ) | A most gene ral quantit y of thi s ca tegory is the weight ed cut: W Cut ( S ) = | δ ( S ) | P v ∈ S w ( v ) 18 S ection 5 where w ( v ) ∈ R , ∀ v ∈ V is the weighting funct ion. If w ( v ) = 1 , ∀ v ∈ V , t his degra des to RCut. If w ( v ) = d ( v ) , ∀ v ∈ V , t his degrades to NCut. The cut se ries c an very well capture the structure of the g raph. F or exampl e, when S is a sin- gleton { v } , the cut degrades to N ( v ) . The cut distribution th us subsumes the degree dist ribution. Cut series a rise as bui lding blo cks of optimizat ion ob jective for a lot pr oblems ( e.g. N Cut [56 ] for image segmentation). Interested readers ca n r efer to [2 3] for th e application on s pe ctral clustering. 5.4.2 Asso ciation and Ratio-/ Norm a lized-/ W e ighted- Associat ion Asso ciation is t he coun terpart of cut . It i s defined as the edges within one set of v e rtices. T he other v ariants ca n also b e defined accordingly : Asso c ( S ) = | v ol ( S ) − δ ( S ) | RAsso c ( S ) = Asso c ( S ) | S | NAsso c ( S ) = Asso c ( S ) | vo l ( S ) | W Asso c ( S ) = Asso c ( S ) P v ∈ S w ( v ) See the refe r ence in [23] fo r s om e applications us in g those quantities. T o facilit ate further di sc us- sion, we also denote ρ ( S ) = v ol ( S ) − δ ( S ) . 5.4.3 Conductance and Expansion Conductance and expansio n are similar q ua n tities to NCut a nd RCut. Th ey are defined as: Conductance ( S ) = δ ( S ) min { v ol ( S ) , vol ( V − S ) } Expansion ( S ) = δ ( S ) min {| S | , | V − S |} Due to the min op erator, th e y are less used as optimization ob jecti v e. In so me discussions, p eople restrict the set S to b e | S | 6 | V | 2 or | S | 6 | vol ( V ) | 2 . In this wa y, the m in op erat or v anishes and they are same as RCut and NC ut. 5.4.4 Quadratic F o rm The qua dratic form o f adjacency m atrix A a nd Laplacian L = D − A a r e more general quantit y to characterize t he graph. Denote the characteristic vector for a s et of vertices S by: χ S ( v ) = 1 v ∈ V 0 v V One can s ee that χ S T Aχ S enco des the association of set S a nd χ S T Lχ S enco des the cut of set S . T ow ards this end, the qu adratic form is a more general notion than the cut series a nd asso ciati on series. By using different m atrix and weigh ted characteristic vector, we can en co de other v aria n t s of cut and asso c ia tion. F or exa mple, one can ex pand the quadrati c f orm of Laplacia n as x T L x = X ( u , v ) ∈ E ( x ( u ) − x ( v )) 2 W e can set t he characteristic vector to b e χ S ( v ) = 1 | S | p v ∈ S 0 v S G raph Pr oper ties 1 9 and then χ S T Lχ S = X ( u , v ) ∈ δ ( S ) 1 | S | p − 0 ! 2 = | δ ( S ) | | S | This is just the ratio cut. Other quantities ca n b e obtained similarly. In graph sparsification p roblems, a v e ry widely used criteri on for ε -appro xi mation is: ([58] ch6.4) (1 − ǫ ) x T L G x ≤ x T L H x ≤ ( 1 + ǫ ) x T L G x , ∀ x ∈ R n A ccording to the analysis in this section, we also prop ose to use adjacency matrix as one alternative criterion: (*) (1 − ǫ ) x T A G x ≤ x T A H x ≤ ( 1 + ǫ ) x T A G x , ∀ x ∈ R n A djacency matrix enco d es yet another type o f informa tion (e.g. asso ciatio n) which is n ot readily av ailable f rom La placian. Thi nking along this li ne 3 , the qu adratic of degree matrix is also useful: (*) (1 − ǫ ) x T D G x ≤ x T D H x ≤ (1 + ǫ ) x T D G x , ∀ x ∈ R n It e nc o d es t he v olume of a set of v er tices. If a sa mpling procedure can pre s erv e those quadratic forms, a larg er cla ss of algorithms may apply. 5.4.5 Mo dularity Mo dularity is a classical quality me asure in co mm unity detection pr oblems [4 6]. W e us e t he f orm presented in [1]. Let C = { C 1 , C 2 , C n } b e a clustering of t he graph s.t. C i ∩ C j = ∅ and C 1 ∪ C 2 ∪ C n = V . The mo dularity of t he whole graph ca n b e expressed as: Q ( C ) = 1 2 m X C ∈ C X u ∈ C , v ∈ C A u , v − d u d v 2 m The term d u d v 2 m computes the exp ected num be r of edges b etw een u , v on a Fixe d Degree Distribution Random Graph (FDDRG). Thus, f or each cluster, it com putes the deviation of observed g raph from a random graph . If a set of vertices hav e closer re lat io nships, the ir Mo dularity should b e higher. This f act mak e s Mo dularity a widely accepted metric for comm u nit y detection problems. T o mak e it a prop er ty similar to cut series and assoc iat io n series, w e can define the Mo d ularity for a set of vertices: (*) Mo dularity ( S ) = X u ∈ S , v ∈ S A u , v − d u d v 2 m = ρ ( S ) − X u ∈ S , v ∈ S d u d v 2 m (3) 5.4.6 Cohesion Cohesion [13] is a recently p rop osed metric for community det ection pr oblems. It gene ralizes the notion of c uts. Cuts are edges that cr oss the b oundary of a vertex set. Cohesion measures the triangles that cr oss the b ound ary of a vertex set. W e first defi ne the set of tria ng les. Given three ver tex se ts S 1 , S 2 and S 3 , the triangles spanned by t he three sets are: ∆( S 1 , S 2 , S 3 ) = { ( u , v , w ) | u ∈ S 1 , v ∈ S 2 , w ∈ S 3 , ( u , v ) ∈ E , ( v , w ) ∈ E , ( w , u ) ∈ E } Note that w e use ( u , v , w ) to denote an unordered triplet. The inner triangles of a s et of vertices S are ∆ i ( S ) = ∆( S , S , S ) . The b ou ndary (out e r) triangles are ∆ o ( S ) = ∆( S , S , V − S ) . Th e origin al cohesion is defined as: Cohesion ( S ) = | ∆ i ( S ) | | S | 3 × | ∆ i ( S ) | | ∆ i ( S ) | + | ∆ o ( S ) | 3. T hi s is an af t er t ho u gh t w he n I fi ni sh m ost o f t h e pa per . B y ad d in g t h e D i n qu ad ra t ic fo rm s ec ti on , al l V IE S ar e en cod ed b y qu ad r at ic fo rm s . So q u ad ra ti c f o rm tu rn s o ut t o be th e m os t ge ne ra l in th is s er ie s. Th is is th e s am e ob se rv at io n m ad e b y pr of . L a u. 20 S ection 5 The first term consid ers the densit y of the group of vertices in terms of triangles. The second term measures how isolat ed the gro up of vertices are. Fig. 1 in [13] is a goo d illustration of wh y cut fails but co hesion succeeds in capturing cer tain communit y structures. During the sampling proc edure, the n umb er of triangles are de e m ed to decrease and w e can quantify this. Supp ose the edg e sampling prob abilit y is p . Using the linea rity o f e xp ectations E[# T r iangles ] = X ( u , v , w ) ∈ ∆( S 1 , S 2 , S 3 ) Pr { ( u , v ) } × P r { ( v , w ) } × Pr { ( w , u ) } = | ∆( S 1 , S 2 , S 3 ) | p 3 The triang le density w ill also decr ease. Maybe we can fi x this by up-weigh ting edges. L ea ve it to future d is cussions. The se c ond term is homogeneo us. That is, after samp ling, the v a lue stays the same in te rm s of e xp ectation. Note that be sides ∆ i and ∆ o defined ab o ve, there is another type: ∆( S , V − S , V − S ) . Let ∆( S ) b e all the tria ngles related with S : ∆( S ) = ∆ i ( S ) + ∆ o ( S ) + ∆( S , V − S , V − S ) W e can d e fine, T riangle Cut (TCut), a sim ilar quantit y to N Cut TCut ( S ) = | ∆ o ( S ) | | ∆( S ) | and T riang le Asso ciation (T Asso c) T Asso c = | ∆ i ( S ) | | ∆( S ) | 5.4.7 Prop erties with Differen t V ertex Set The ab ov e mentioned prop erties are all defined with full v ertex set, i.e. after s ampling, V s = V . If the vertex sets are diff eren t, it’s v ery hard to say, f or example, what means “a cut is pr eserved”. Other propertie s like quadratic forms also suffer from this probl e m. T ow ards t his end, one wi ll find those prop erti es di scussed in thi s s e ction are alwa ys studied alon g wi th edge sampling . W e wa nt to extend the d efinitions of those prop erties so that they are m eaningful after vertex sampling. [44] d efines such an ob jecti v e . S upp ose we ha ve a set of terminal vertices K . After g raph transformation, the resulting graph preser v es all min-cuts b et ween U a nd K − U , where U ⊆ K . How to come up w ith meaningfu l coun terpa rts o f the pro per ties discussed in this section is still a problem. 5.5 Dis tance Metr ic s for P rop ert ies Note that t here a re basically tw o types o f pr op erties mentioned ab ov e. On e type is vector o r d is- tribution in nature, e.g. Degree Distribution, Radius. T he other t yp e is a digest of the dist ribution prop erty by taking max, min, o r av erage, e.g. A verage Deg r ee Distribution, Dia meter (max of Radius). Giv en the original graph G and sampled graph G s , w e wan t to kno w how far is f ( G s ) from f ( G ) . The second typ e is just a scalar and the comm on measure of the quality of estimatio n is by Normalized R o ot Mean Square E rror: (e.g. [29][63 ][51]) NRMSE θ , θ ˆ = E θ ˆ − θ 2 q θ F or the fi rst type, some distance me tr ics are used. Let p b e the distributi on derived from original graph and q b e the distribution de rived from samp led gr aph. Supp ose they are defi ned o n Ω , one can measure the distance i n the following wa ys: • T otal v ariation distance (e.g. [ 51]) measures al l the difference betw een tw o distributions: TV ( p , q ) = max A ⊆ Ω | p ( A ) − q ( A ) | = 1 2 X x ∈ Ω | p ( x ) − q ( x ) | G raph Pr oper ties 2 1 • Kullback-Leibler (KL) divergence captur e s the difference of the tw o di stributions accounting for the bulk of t he distributions : KL ( p , q ) = X x ∈ Ω p ( x ) ln p ( x ) q ( x ) • K olmogorov-Smirnov (K S) statistic (e.g [2]) captu res the maximum v ertic al distance of the c.d.f of th e tw o distributions . When Ω = R , it can b e defined as : KS ( p , q ) = max x X ξ 6 x p ( ξ ) − X ξ 6 x q ( ξ ) 6 P rop erty P reserv a ti on/ E stim a tion Re sults In this section we d is cuss some ad ho c results of prop erty preserv ation or prop ert y estimation. Although the t wo ob jecti v e s are initially different, their results can t ransform to each other as discussed b efore. This section is s till under development. W e ha ve markers for sec tions co ming in future versions of this survey. Interested rea ders can come ba c k to arxiv to r etrieve them. 6.1 Netw ork Size Es timation Netw ork size estimation may b e the most direct ob jective of graph sampling. Ho wev er, we did not find many works on th is direction. There were p revious works f or p opulation es t imation (Secti on 6.1.3), but they are not tuned for graphs or not designed t o utilize graph s tr uc t ures. 6.1.1 Graph Iden ti ties on NS, GD and AD W e use n = | V | and m = | E | to describ e the size of a netw o r k. They a re related with graph densit y ρ 4 and av era ge deg ree h d i . The following graph identities ar e u seful ρ = 2 m n ( n − 1) (4) h d i = 2 m n (5) n = h d i ρ + 1 (6) One may estimate some of t hose quantities and use the ab ov e identities to estimate rest o nes. F or example, the estimation of av erage degree is well understo o d (Se c tion 6.1.2). There are also many classical metho ds for p o pulation estima tion, leading t o many (ma y not b e s o accurate) est imators for n (S ection 6.1.3). Combining the tw o quan tities, we can estimate m and ρ u sing those iden tities. [28] takes a d ifferent approach: esti mate ρ and h d i ; use graph identities t o get n and m . 6.1.2 A v e rage D egree One ca n app ly th e analysis for degre e dist ribution to s ee that av erag e deg ree is not preserved by VS a nd ES. How ever, by V SN (pract ical for OSN crawling), one can obtain th e neig hbo urho o d. A veraging o ver directly crawled vertices yields an unbiased estimator fo r av erage degr ee of G : h d i = 1 V ˜ s X v ∈ V ˜ s d ( v ) (7) If w e use R W, it’s kno wn that the samples are biased to wards high deg ree v ertices. W e can use RDS to correc t it. Simply do R W on the graph and su bstitute g ( v ) = d ( v ) in Eq. 2: h d i = | V s | P v ∈ V s 1 d ( v ) (8) 4. ρ i s o v e rl oa de d in t he di scu ss io n of a ss ocia t io n. 22 S ection 6 This i s to sa y, w e use the harmonic m ean of the degree s of our sampled vertices V s as the estimator for the me an d egree of the g raph. 6.1.3 P opulation Estimation W e di gest classic al popul ation estima tion metho ds from the p ointers in [2 8]. Interested readers can see reference therei n. The p opulation estimation p roblem a ssumes the exist e nce of v ertex s amplers (rep etitions are allow ed). Both uniform and non-uniform sample are work able, as long as w e kno w the distribution. W e d iscuss so me clas sical me th o ds us ing uniform sampl es and the estimat ors on non-uniform sample can b e de s igned similarl y: ( using Hansen-H urwitz estimator [1 9]) • Capture-recaptu re. Supp ose we hav e S 1 and S 2 b eing t wo indep endent samples without replacement (only retaining unique elements). The exp ected number of in tersection is E[ | S 1 ∩ S 2 | ] = X v ∈ S 1 Pr { v ∈ S 2 } = | S 1 || S 2 | | V | So one estima to r is n ˆ = | S 1 || S 2 | | S 1 ∩ S 2 | • Unique element counting. Let S b e the sample with replacement (with rep etitio ns) a nd S uniq b e t he u nique elemen ts from S . Counting t he unique elements are just a balls-and- bins proble m: thro w | S | balls into n bi ns; how many bin s a re o c c u pied? E[ | S uniq | ] = X i = 1 n Pr { v i ∈ S } = X i = 1 n 1 − 1 − 1 n | S | ≈ n 1 − e − | S | n (9) One can solve n using observed quantities as an estimator. • Collision c oun ting. Pick tw o v ertices from a set o f sam ple S = { s 1 , s 2 , s | S | } . If they are the same vertex, we call it a co llision. The e x pect ed num b er of collisions are: E[ N ] = X i < j Pr { s i = s j } = | S | 2 1 n One can ob tain the collision co un ting estima to r: n ˆ = | S | 2 / N How t o qu an tify the accuracy of those estimat ors is of its o wn research in terest in p opulation estimation. 6.1.4 Density Estimation The me tho d [28] uses to estimate density is very s im ilar to c ollision c ou n ting e stima tion for p op- ulation s ize. The graph d ensit y can b e interpreted as the pro babilit y that a ra ndomly chosen pair of vertices are adjacent. Den ote the sampled vertices (with replacement) by S and n umber o f adjacent pairs in S b y N , n amely N = P i < j I[( s i , s j ) ∈ E ] . W e h a ve the following relation: E[ N ] = X i < j Pr { ( s i , s j ) ∈ E } = | S | 2 ρ Similar estima tor ca n b e designe d if ver tex sam pling is non-unif orm. See [28] for more details . 6.2 F u ll Gra ph Observ ation In gr aph crawling, th e first question to ask is: How many vertices/ edges do we need to sam ple in order to obs e rve a considerable p ortion (e.g. 1 − ε fract ion) of the graph ? P r oper ty Pr eser v a tion/ E stima tion R esul ts 2 3 6.2.1 V er tex S ampling There are tw o s cenarios: • If we can onl y do VS wi th replace ments. T his is a cou p on collecto r problem and we k no w the n umb er of samples w e need in order t o observ e the whole graph is Θ( n ln n + c n ) . It has a sharp th reshold b ehaviour at this v alu e. In order to observe 1 − ε fraction of the vertices, the num b er o f samples s must satisfy: (using Eq. 9) n 1 − e − s n > (1 − ε ) n which gives: s > n ln 1 ε • If we can do VS wit hout replacements, it b ecom e s t rivial: s > (1 − ε ) n . 6.2.2 Uniform V erte x Sampling with Neigh b ourho o d VSN mak es the problem more interesting and is t he usual c ase of O SN cra wling. W e only consider the scenario without replacements, because one can easily remo ve dupli cates during the cra wlin g. If the min degree of the graph is d min , w e expect VSN to act d min times more efficien t than VS. How ever, the neigh b our ho o d of v ert ices sampled later ma y ha ve larger probability t o ha v e a lr eady b een ob served in previous crawling. Denote the sampled set by S (with r epet ition). The num b er of n ew vertices intro duced by o ne crawl when ther e are alread y N unique ver tices (includ ing the neighbouri ng vertices of S ) is: E[# new | N ] = X v ∈ V − S Pr { select v } X u ∈ N ( v ) Pr { u S } = X v ∈ V − S Pr { select v } d ( v ) 1 − N n ≈ X v ∈ V Pr { select v } d ( v ) 1 − N n = h d i 1 − N n The approximation holds b eca use by un iform VS, S , V − S and V should hav e appro xim at ely same degree dist ribution (Sectio n 6.3.3). Denote the number of unique v ertices at step i by N i . The exp ected num b er of unique vertices after t steps can b e c a lculated as: E[ N t ] = X x (E[# new | x ] + x ) Pr { N t − 1 = x } = X x h d i + 1 − h d i n x Pr { N t − 1 = x } = h d i + 1 − h d i n E[ N t − 1 ] The b ound ary condition is E [ N 1 ] = h d i . So E[ N t ] = h d i + 1 − h d i n h d i + 1 − h d i n t − 1 E[ N 1 ] = h d i 1 − 1 − h d i n t 1 − 1 − h d i n = n 1 − 1 − h d i n t In order to o bserv e 1 − ε fraction o f the vertices, we need to hav e t > n h d i ln 1 ε 24 S ection 6 which is h d i t imes more efficie nt than VS. 6.2.3 Non-uniform V ertex Sampling wi th Neighbour ho o d The uniform VSN is h d i times more efficient t han VS, b ecause sampling a vertex ca n give us some neighbours. How ever, this approach is no t optimal. W e can try to sample higher degree vertices first wh ic h po ten tially bring u s more new vertices in one cra w l. T he num b er o f ne w v ertices dep ends largely on w ha t we ha ve crawled and the argument is not easy. W e can only p ro vide a lo o se low er b ound: t > n d max ln 1 ε 6.2.4 Edge Sampling In the usu al se n se, ES onl y selects an edg e and its tw o endpoints. I f we only lo ok a t one v ertex, it degrades to VS with degree pro portional distri bution . The efficiency will not exceed ordinary u niform V S. In order to gi ve a loo s e low er b ound, we can t reat the tw o endp oints as indep endently p ic ked from t he graph. That is: t > n 2 ln 1 ε T o g ive a tighter b ound, we can consid er the small est degr ee vertices. It will hav e at least 1 2 m to b e pick ed at ea ch round. So we hav e: t > m ln 1 ε 6.2.5 T ra versal Ba sed Sampling Most TBS are v ariants o f R W. So this b e comes the question of finding the co ver time o f a Mark ov chain. Be fore knowing detailed information ab out th e graph, one can ap ply a simple b ound, O ( n m ) [33, L6] if it is connected. A tighter version for general graph is [33, L6]: m R ( G ) 6 C ( G ) 6 2 e 3 m R ( G ) ln n + n where R ( G ) = max R u , v and R u , v is the e ff ective resistance of the gra ph. Many so cial grap hs turn out to be small world graphs, that is, the diameter is O ( ln n ) . Find suc h a pair of vertices u , v who has this maxim u m shortest path length. W e apply Ra yleigh’s monotonici t y principle of re s istance: After r emo ving a ll edges not on t he sh ortest p ath, the resistance can only increase. So we know R u , v = O ( ln n ) . W e can b o und the cov e r t ime: C ( G ) = O (2 e 3 m ln 2 n + n ) This is not to o muc h worse tha n VS wi th rep lacement. Whe n one can not enumerate ID space of an OSN, R W may also b e effective. Note t ha t cov er time gua ran tees all vertices are visited. If we only asks 1 − ε fraction, similar analysis c a n b e adapted. First, the maximum hitting time is H = 2 m ln n for our sma ll-world graph ( 2 m R ( G ) , [33, L6]), regard less of the starting ver tex. A fter 2 H steps, the pr obability t hat one ver tex is not visited is 1/2 by Marko v’s inequality. A ft er t × 2 H steps, the prob ab ilit y t hat one ver tex i s not visited is (1/2) t . W e use the linearit y of expecta tio n to compute the n umb er of visited ver tices: E[ N ] = X v ∈ V Pr { v visited } = n 1 − 1 2 t T o make it larger tha n 1 − ε , we h a ve t > log 2 (1/ ε ) . So it takes 2 m ln n log 2 (1/ ε ) time to observe 1 − ε fra ction of all the vertices. 6.3 Deg ree Distrib ution Denote t he probabilit y density funct ion of degree distribution for the original g raph as p ( k ) . T o obtain t he deg ree distrib ution of the sampled graph, we simply ca lculate every p oi nt of p s ( k ) , i.e. p s ( k ) = P r { d G s ( X ) = k } , where X ∈ V s is a random vertex pick ed f rom the sampled graph. P r oper ty Pr eser v a tion/ E stima tion R esul ts 2 5 6.3.1 V er tex S ampling T o obtain a degree k vertex in G s , we basically need tw o p ha ses: 1) one v ∈ V with Pr { d G ( v ) > k } is pick ed; 2 ) v and k − 1 of its neighbours are preserved in the sampling . That is: ([35]) p s ( k ) = Pr { d G s ( X ) = k } = X j > k Pr { d G s ( X ) = k | d G ( X ) = j } Pr { d G ( X ) = j } = X j > k j k n − 1 − j n s − 1 − k n − 1 n s − 1 p ( j ) (10) If we assu me n s is not fixed bu t we only sample every vertex indep enden tly with pr obability α = n ˜ s / n , wher e n ˜ s is the in t ended num b er of vertices in t he sampled graph. Then the degree distribution b ec omes: ([59], [35]*) p s ( k ) = X j > k Pr { d G s ( X ) = k | d G ( X ) = j } Pr { d G ( X ) = j } = X j > k j k α k (1 − α ) j − k p ( j ) (11) When n → ∞ but the rati o α = n s / n is fixed, Eq. 1 0 is approximately equiv alent to E q. 11: n − 1 − j n s − 1 − k n − 1 n s − 1 = ( n − 1 − j )! ( n − j − n s + k )!( n s − 1 − k )! ( n − 1)! ( n − n s )!( n s − 1)! = ( n − n s )! ( n − j − n s + k )! ( n − 1 − j )! ( n − 1)! ( n s − 1)! ( n s − 1 − k )! ≈ ( n − n s ) j − k ( n − 1) − j ( n s − 1) k ≈ (1 − α ) j − k α k 6.3.2 Edge Sampling ES turns out to give t he same r esult as Eq. 1 1 ([35]). The first e q uality holds regardless of the sampling t ec hnique that is used. T o argue the co nditional pro babilit y in the sum mation, no te that “ k o ut o f j neighbours a re s ampled” is equiv alent to “ k out of j incident edges are sampled”, thus giving th e same result. Eq. 11 only confirms one intuitiv e con c lusion: v ertices on G s in general ha ve lo wer degree than ver tices in G . Except for this, it do es not lead to use ful application. As noted in Section 5.4.1, cut subsu mes neighb ourho o d . Using this fact, we can apply the E S for cu t to fi nd an esti mator of degree. More concrete ly, if the sample pr obability of edges is p , we set the weigh t of edges in E s to be 1/ p . W e calculate the (w eig h ted) degree distribution of G s , whic h is an estimator for the degree distribu tion of G , with b ounded error. 6.3.3 V er tex S ampling with Neigh b o urho o d VSN wil l give precise d egree for v ∈ V ˜ s (not V s ). So a straightforw a rd estimato r for degree distri- bution is p s ( k ) = 1 V ˜ s X v ∈ V ˜ s I[ d ( v ) = k ] If we treat V ˜ s as a r.v., it can b e sho wn that p s ( k ) = Pr d G s ( X ) = k | X ∈ V ˜ s = Pr d G s ( X ) = k | X ∈ V ˜ s , d G ( X ) = k Pr d G ( X ) = k | X ∈ V ˜ s = 1 × Pr { d G ( X ) = k } = p ( k ) So the tw o distributions are as ymptotically the same. If degree is obtained f rom a vertex label instead of top o logy, the same re sult holds. 26 S ection 6 6.3.4 Edge Sampling w i th V ertex Lab el In the normal ES, we o nly obs e rv e a par t of the graph and degrees should b e unde restimated. If we further assu me t he av ailab ilit y of vertex lab el for all vertices in V s and degree is pre s ent ed as one vertex l ab e l, we can get simi lar result like VSN . If we treat V ˜ s as a r.v., it can b e sho wn that: ([ 41]* 5 ) Pr d G s ( X ) = k | X ∈ V ˜ s = Pr d G s ( X ) = k | X ∈ V ˜ s , d G ( X ) = k Pr d G ( X ) = k | X ∈ V ˜ s = 1 × Pr X ∈ V ˜ s | d G ( X ) = k Pr { d G ( X ) = k } Pr X ∈ V ˜ s = k 2 m × p ( k ) Pr X ∈ V ˜ s ∝ k p ( k ) That is t o s ay, the degree distribu tio n on V s is biased tow a rds high deg ree edges if we use ES. W e can correct th e bias by the fo llo wing estima to r: p s ( k ) = 1 Z k X v ∈ V ˜ s I[ d ( v ) = k ] where Z is a no rmalization con stant . This p s ( k ) will b e as ymptotically the sa me as p ( k ) . 6.3.5 BFS [30] shows that B FS has a bias t o wards high d egree v ertice s . This s ection will b e provided in a future v ersion. T o chec kout our current prog ress and previews, please contact the authors. 6.3.6 Results for Cert ain T yp e of G raphs Note that the a b ov e discuss ion do es n ot assume a certain type of graph. In the surv ey, we f ound the results for certain type of graphs: • [60] provides a stud y of sca le-free (p ow er -la w) g raph s, where p ( k ) ∝ k − γ . There is an ex a ct analytic expres sion for γ = 2 . • On Erdos-Renyi r andom graphs, o ne can see chapter 2 of [10]. 6.4 M ini m u m Cut If the min-cut of a graph is preserv e d on G s , a m in-cut alg orithm (e.g. runs in O ( n s 3 ) ) w ill of course output i t. In this section, w e study the sa mple and contract approach. [43] presen ts an extreme of such sampling approach, i.e. sample and contract until V s = 2 . No w we want to in vestigate how likely an early stopping will preserv e thi s prop erty, i.e. min-cut. S to pping ea rli er and running min- cut for O ( n s 3 ) turn s out to b e an improv e ment for b o th a lgorithm, i.e. O ( n 4 ) random contraction algorithm a nd O ( n 3 ) b est de terministic min-cu t algorithm. 6.4.1 Edge Sampling w i th Con tracti on Although we do not giv e the bes t randomized algorithm in th is section (see L1 of [33] f or some p ointers), it suffices to show the flav our of our core ob jective: Use simple alg orithms (e.g. ES, VS, ESC, V SC) to reduc e the graph with t he hop e of p reserving certain prop erty of the graph ; Then we run algori thm s d ep ending on t his prop erty to obt ain ( approximately) th e sa me result. F r amework Our framework is s imple: 1. Pick an arbitra ry edge and contract it. Stop u n t il there are only r vertices. Get G s . 5. “ ed ge sa m pl in g ” m en tio n ed i n som e li te ra tu re s a re a ct ua lly “ ed ge sa m pl in g wi th v e rt ex la be l” . Th is is th e co m mo n co nf us io n p ar t be c au se y ou s o m et im e s se e d eg re e is p re se rv ed ( or c an be e st im at ed ) bu t so me ti me s no t. Th e as su m p ti on of ve rt ex la be l a v ai la bi li t y af te r ES i s v al i d i n so m e s c en a ri os . F or ex am p le , a ct iv it y ( e. g. re t w e et ) e dg e on O SN of te n co n ta in s su c h in fo rm at i on (a v a ila bl e vi a AP I) : t w e et c on te n t, t w e et -e r in f o, re t w ee t- e r in fo . O n e ca n ge t so m e v ert ex la bel fr om t w e et -e r or r et w ee t- er in fo , li ke fo ll o w er co un t, f o ll o w e e co u nt , et c. ( ho ld s fo r Si n a W ei bo) P r oper ty Pr eser v a tion/ E stima tion R esul ts 2 7 2. Run any deter ministic min-cu t al gorithm on G s and record th e c ut. 3. Rep eat the abov e exp eriments Θ ( n 2 / r 2 ) (calculated later) times. Output the m inim um one of all those exp eriments. It can b e shown that the mi n- cut is output w.h. p.. Previous Results The b est determinis tic algorithm for min-cut is know to be O ( n 3 ) . [43 ] (Ch. 1) presented a simple r andomize d algorithm , whic h runs in O ( n 4 ) time. That algori thm is a spec ial case o f our framework by s ubstituting r = 2 . S ince only tw o v ertices ar e l eft in G s , the determin istic min-cut algorithm is trivial (t he r e is only one cut). Ho wev er , this approach h as very high failure probability, so we ha ve to rep eat the exp eriment for Θ( n 2 ) tim es in ord er to get the rea l min-cut w.h .p.. Combining with t he contraction co st, it gives an O ( n 4 ) algorithm. Our results The analysis is very s imilar to that of [43] (Ch. 1), excep t that w e stop early and defer the determinati on o f b est r . Denote the ev ent t hat the edge co ntracted at i -th iteration is not in C (the min-cut) b y E i . Let F i = ∩ j = 1 i E j . It can b e sho wn t hat P ( E 1 = F 1 ) > 1 − 2 n and P ( E i | F i − 1 ) > 1 − k k ( n − i + 1)/2 = 1 − 2 n − i + 1 ( k = | C | ). The probability that the m in-cut p roper t y is preserved after n − r round is: P ( F n − r ) = P ( E n − r | F n − r − 1 ) P ( E 2 | F 1 ) P ( F 1 ) > Y i = 1 n − r n − i − 1 n − i + 1 = ( n − 1 − ( n − r ) )( n − 1 − ( n − r − 1)) n ( n − 1) = ( r − 1) r n ( n − 1) Consider r = r ( n ) a s a function of n , this probability is Θ r ( n ) n 2 and by rep eating the exp e r- iment for Θ ˜ n r ( n ) 2 times, w e ar e certainly to pr eserve th e min-cut in at one exp eriment (fail probability smaller than O 1 p oly ( n ) ). Running min -cut algorithm on the sampled graph ta kes O ( r 3 ) . The contraction pro cedure c an b e implemented i n O ( n 2 ) time. So t he total complexity f or this algorith m is: O n 2 r 2 ( n 2 + r 3 ) Let r ( n ) = n c , t his b ecomes n 2 − 2 c ( n 2 + n 3 c ) = n 4 − 2 c + n 2 + c > 2 n 4 − 2 c n 2 + c √ . The “=” can b e achiev e d with c = 2 3 . Plugging it back, we get a min-cut al gori thm o f com plexit y O n 8/3 < O ( n 3 ) . This is bet ter than runni ng t he min-cut on o r iginal graph. There are so me future extensio ns to make the argument in thi s se c t ion tighter: • Note that the determi nistic min-cut c an ac hieve O ( n m ) which i s less than O ( n 3 ) on sparse graphs. One ca n extend t he argument in t his section by tak ing the relationshi p of m and n into consider ation, i.e. m = n b for a known b . • One contraction down to r vertices only t ak es O ( n ( n − r )) t ime, which is slightly smaller in implementation. A lthough it does not c hange the order in theo retical analysis, it may b enefit impl emen tations f or not so large n and m . 6.4.2 V er tex S ampling with Con tract ion Similar idea can be applied to vertex sampling with con traction. Denote t he set of vertices in ciden t to C by V C . W e know that | V C | 6 2 | C | = 2 k . The g oal is to calculate the p robability that such 2 k ver tices s urviv e the VSC p ro cess. Throug h similar deriv a tion, we can get the low er b ound of this probability: (let r ( n ) = n c ) ( r − 2 k ) 2 k n 2 k ≈ n ( c − 1) 2 k 28 S ection 6 The optim al c hoice f or c also comes a t c = 8 3 . How e ver, the comp lexit y is O n 8 k /3 , muc h l arger than that o f th e E SC. 6.4.3 ESC Appro xim ation The ab ov e ESC al gorithm aims at exact cal c ulation of mi n-cut so rep e a ted exp eriments are requi red to increase the success probability. If we only w ant to a pproximate min-cut, ma yb e w e can get more efficient algorithms. That is , after running ESC, with probability 1 − ε , at least one cut with size smaller than (1 + δ ) | C | is preserved. This s ection will b e provided in a future v ersion. T o chec kout our current prog ress and previews, please contact the authors. 6.5 Cu t 6.5.1 Uniform Edge Sampling There is an edge s ampling algorithm to preserve cuts of all subsets of vertices. W e present a s ketc h here. Interested reader s can s ee [33, L 3 ] for m ore details . The a lgorithm is s imple: p erfo rm ES w ith probability p ; ∀ e ∈ E s , set weigh t w ( e ) = 1 p . It can be shown that with some subtle cond itions, G s preserves all the cuts. Denote the weigh t of a cut by w G ( δ G ( S )) = P e ∈ δ G ( S ) w G ( e ) , where w G ( e ) is the w eight o f edge e in g raph G . In an unw eighted graph, w ( e ) = 1 . W e de note H = G s to make it unclut ter ed. The technical ex pre ssion for “preser v ing all cuts” is: (1 − ε ) w G ( δ G ( S )) 6 w H ( δ H ( S )) 6 ( 1 + ε ) w G ( δ G ( S )) ∀ S ∈ V In t he discu ssion, w e a ls o call δ ( S ) as “cut S ”. Since t he num b er o f edges of cu t S obeys binom ial distribution Binomi al ( | S | , p ) and th e weigh t of r emaining edges is set to 1/ p , the exp ected size of δ ( S ) an d weig ht of δ ( S ) are: E[ | δ H ( S ) | ] = | δ G ( S ) | × p = p | δ G ( S ) | E[ w H ( δ H ( S ))] = | δ G ( S ) | × p × 1 p = | δ G ( S ) | which is the same as the weight in original graph. W e u se c hernoff b ound (App endix A) to calculate the deviation probability of th e s ize of δ ( S ) : Pr {|| δ H ( S ) | − p | δ G ( S ) || > εp | δ G ( S ) |} 6 2 e − p | δ G ( S ) | ε 2 /3 Denote th e event t hat cut S is preserved b y E S and the fail pr o babilit y is Pr { E ¯ S } 6 2 e − p | δ G ( S ) | ε 2 /3 . W e w ant to apply union b ound on all S ∈ V . It can b e sho wn (App endix A) that there are no more than n 2 α cuts of size la rger t han α c , wher e c = Ω ( ln n ) is the minimum cut size . Then we get: Pr ( [ S ∈ V E ¯ S ) 6 X S ∈ V Pr { E ¯ S } 6 X α > 1 X S ∈ V , | δ G ( S ) | = α c Pr { E ¯ S } 6 X α > 1 X S ∈ V , | δ G ( S ) | = α c 2 e − p αcε 2 /3 6 X α > 1 X S ∈ V , | δ G ( S ) | 6 α c 2 e − pαcε 2 /3 6 X α > 1 n 2 α × 2 e − p αcε 2 /3 If we let p = 3( d + 2) ln n ε 2 c , Pr ( [ S ∈ V E ¯ S ) 6 X α > 1 n 2 α × 2 × n − α ( d + 2) = 2 × X α > 1 n − αd = O ( n − d ) P r oper ty Pr eser v a tion/ E stima tion R esul ts 2 9 So all the cut s a re p reserved w.h.p .. 6.5.2 Non-uniform Edge Sampling Using Edge Strong Connectivit y [8] propo s es non-uniform edge s ampling technique in order to relax t he constraint that c = Ω( l n n ) . The idea is to sample ed ges in sparse cut with highe r probability a nd lo wer weigh t, and sa mple edges in dense cut with low er pr ob abilit y and higher weigh t. It can e ffectively reduce the num b er of edges to O ( n ln n ) . See also [33, L1 1 o f 2011] for a dig est. 6.6 RCut, NCut, Asso c, RAsso c, NAsso c, V o lume This section will b e p ro v ided in a future version. T o chec kout our c urren t progress and previews, please contact the authors. 6.7 M o dularity This section will b e p ro v ided in a future version. T o chec kout our c urren t progress and previews, please contact the authors. 6.8 Coh esion This section will b e p ro v ided in a future version. T o chec kout our c urren t progress and previews, please contact the authors. 6.9 Quad ratic F o rms 6.9.1 Non-uniform Edge Sampling Using Effecti ve Re sistance [32, L12] presents one no n-uniform edges samplin g to pres erv e quadratic forms of L ap lacian. The algorithm is: • Initialize H as an empty gr ap h. • F or 1 6 i 6 q ◦ Pick random edge e with proba bility p e . ◦ Increase the weight of e in H b y w e q p e . It can b e sh o w n that by set ting p e = w e R e P e w e R e , q = O ( n log n / ε 2 ) where R e denotes the e ffectiv e resistance of e , we can g et an ε -app roximation, namely : (1 − ǫ ) x T L G x ≤ x T L H x ≤ ( 1 + ǫ ) x T L G x , ∀ x ∈ R n W e’ll omit the detailed disc ussion but present s ome notes he re for op erational c o nv enience • The effective resistance b et ween v ertex a and b is defined as the v ( a ) − v ( b ) when one Amp ere current is in jected in to a and remo v ed from b , where v ( a ) denotes the voltage of a . Let w ( e ) b e t he (current) conduct ance ( inv erse resistance) o f edg e e , i.e. A u , v = w ( e ) f or e = ( u , v ) . In this wa y, effective resistance can b e exp ressed as ( χ a − χ b ) T L G − 1 ( χ a − χ b ) (12) where L G − 1 denotes the p seudo inv erse a nd χ a is the characteristic vector for a single vertex. • In order to run this algorithm , one nee d t o eva luate Eq. 12. This is to either solve the pseudo in verse or to solve a Laplacian system, i.e. L x = b . The state-of-the-art linear solv er for Laplacian system inv olves this sp ectral sparsificatio n pro cess. The recursive constructi on of linear tim e Laplacian sol v er and sp ectra l graph sparsifi er is to o complex to present here. See also [32, L11-L13] for de ta ils. 30 S ection 6 6.10 S hortest Path Length This section will b e p ro v ided in a future version. T o chec kout our c urren t progress and previews, please contact the authors. 7 C on cl usio n a nd F utu re W ork s Graph s ampling is a well moti v ated t op ic and has imp ortant applica tions i n all kinds of research fields. There are not so many theoretica l analysis on prop erty p reserv ation o r estimation. F or nu merical ev aluation, p eople used different criter ia, algorithms, graphs and prop erties. There are many ad ho c results of prop e rt y preserv atio n. Some future works are: • A comp reh ensive n um erical ev a luation o f the prop erty preserv at io n result of di fferent sam- pling a lgorithms. One can see that there is a large v a cancy i n the c o mb ination, i.e. proper ties (Section 5 .2) × alg orithms (Sect ion 3.3) × cr iteria (Sect ion 5.5) × g raph m o dels (Secti on 3.2). All previous works did part o f them. A comprehensive and neut ral numerical ev aluation can hint the p ossible directions of future theo retical studies. • There is a lac k of ev aluation on sy nthesized data set. While ev aluation on real netw o rks give convincing results of whether an algorithm is applicab le or not, t he ev aluation on synthesized data set (using certain graph gener ation mo del) may give peopl e some in sights b etw een al gor ithms and prop erties. • There is only a few theoretical studies. On e can see that the most well under sto o d property is degree distribution and its d eriv atives. It’s non-trivial to a na lytically quant ify some of the prop erties . How ever, some easier ones ar e t ractable. • TBS techniques ar e quite popular in recent l iteratures. They ar e also promising practical techniques for large graph cra w ling. Ho wev er, most works do not provide theoreti cal analys is of the v ariance and efficiency of t hose estimators. Th e t heoretical analysis li k e [45][34] can lay a more solid foundation for f uture applica tion. • Problem or ient ed prop erty preserv ation (POPP) is at its infa nc y . As w e mentioned b efore, in m an y research fields, a fa mily of algorithms may aim a t optimizing a certain prop erty defined on a g raph. If G s exhibits similar p rop erties as G , we may exp ect tho se algorithms also w ork on G s . This gives a way to probabili stically a c c elerate a class of algorithms (ma yb e with b ounded lo ss of optima lit y). The s imple min- cut improv ement made in this surv ey ( Section 6 .4) shows a flav ou r of POPP. Since P OPP is t ightly bind ed with the graph a lgorithms, more f ur ther inv estigat ion s ca n b e done. Bib l iogra p h y [1 ] G . A ga rw al an d D . K e m pe. M odu la ri t y- ma xi m iz in g gr ap h co m m un it ie s vi a m at he m at ic al pr og ra m m in g . T h e E u r o p e an P h y si c a l J o ur n al B - Co n d en se d M a tt e r a n d C o m p le x S ys te m s , 66 (3 ): 40 9– 41 8, 20 08 . [2 ] N . A hm e d , J . N ev ill e, an d R . K om pe ll a. Ne t w or k sam pl i ng d es ig ns fo r re la ti on a l c la ss ifi ca ti on . In S ix th In te r - n a ti o n al A A A I C on fe r e n c e o n W e b lo gs an d S o ci al M e d ia , 20 12 . [3 ] N . A hm e d, J. N ev il le , an d R . R. K om pe ll a . Ne t w o rk sa m pl in g v ia ed ge -b a se d n ode se le ct io n w it h g ra ph in du c ti o n. T e c hn ic al Re por t CS D T R 11 -0 16 , Co mp u te r Sc ie nc e De pa rt me n t, P ur du e Un iv er si t y, 20 11 . [4 ] N . K . A h m ed , F. Be rc hm an s, J. N e vi lle , a nd R. K o m pel la . T im e- b as ed sa mp li ng of so ci al ne t w o rk a ct iv it y gr ap h s. In P r o c e e d in g s o f th e E ig h th W o r k sh o p o n M in i n g a n d L e a rn in g w it h G r a p h s , pa ge s 1– 9. A C M, 20 10 . [5 ] N . K . A h m ed , J . Ne vi ll e, an d R . K om pe lla . Sp ac e -e ffi ci en t sa m pl in g fr om soc ia l ac ti v it y st re am s . I n P r o c e e d - i n gs o f t h e 1 st In t e rn a ti o n a l W o rk s ho p o n B ig D a t a , S tr e a m s a n d H et e r o g e n e o u s S o ur c e M in in g: A lg o ri th m s, S y s te m s, P r o g r a m m in g M o d el s a n d A pp l ic at io n s , p ag es 53 –6 0. A C M , 2 01 2. [6 ] E . M . A ir ol di an d K . M . C ar l ey . S a m pl in g a lg or it hm s fo r p ur e n et w or k t op ol og ie s: a s t ud y o n th e s ta bi li t y an d th e se pa ra bi lit y of me tr ic e m be dd in gs . A C M S IG K D D E x p lo r a ti o n s N e w sl e tt er , 7( 2) :1 3– 22 , 2 00 5. [7 ] J . Ba ts on , D . A. Sp ie lm a n , an d N. Sr iv a st a v a. T w ic e- ra ma n uj an sp ar si fi e rs . S IA M J o u r n a l on C om p u t in g , 41 (6 ): 17 04 –1 72 1, 2 01 2. B i b liogr aphy 3 1 [8 ] A . B e nc zu r a nd D. R . K ar ge r. Ra nd om iz ed ap p ro xi ma ti on sc he m es fo r c ut s a nd fl o w s i n ca pa ci ta te d g ra ph s. a rX iv p r e p r in t c s/ 02 07 0 78 , 20 02 . [9 ] M . B ia nc h in i, M . G or i, an d F. S c ar se ll i. In si de pa g er an k. A C M T r an sa ct io n s o n In te r ne t T e ch n o lo g y (T O I T ) , 5( 1) :9 2– 12 8, 2 0 05 . [1 0] B . B o ll ob á s. R a n d o m g r a p hs , v ol um e 73 . C am br id g e un iv er si t y pr es s, 2 00 1 . [1 1] C . D oer r a n d N . B le nn . M e tr ic c on ve rg en ce in s oci al n et w or k sa m p li n g. I n H o tP la n e t , 20 1 3. [1 2] O . F ra nk . Ne t w or k sa m pl i ng , Ju ly 2 00 9. [1 3] A . F ri gg er i, G . Ch el iu s, a nd E . F leu ry . T ria n gl es t o ca pt ur e so ci al c oh es io n . In P r i v ac y , se c u r it y , ri sk a n d t r u st ( p as s at ) , 2 01 1 ie e e t hi r d i n te r n a t io n al c o n fe r e nc e o n a n d 2 0 11 i e e e th ir d in te r n a ti on al c o n fe r e n c e o n s o ci a l c o m pu ti n g ( s o c ia lc om ) , p ag es 25 8– 26 5. IE EE , 20 11 . [1 4] M . G jo k a, C. T. Bu tt s, M. Ku ra n t, a nd A . M ar k op ou lo u. M u lt ig r ap h sa m p li ng of on l in e soc ia l ne t w or ks . S e le ct e d A r e a s i n C om m u n i c a t io ns , I E EE J ou rn a l o n , 29 (9 ): 1 89 3 –1 90 5 , 20 11 . [1 5] M . G jo k a , M . K u r an t, C. T. B ut ts , an d A. M ar k op ou lo u. W al ki n g in fa c eb oo k : A ca se st ud y of un bi as e d sa m pli ng of osn s . In IN F O C O M , 20 10 Pr o c e e di n gs I E EE , pa ge s 1 – 9 . IE E E, 20 10 . [1 6] S . G oel a nd M . J . S al g an ik . R e sp on d en t- dr iv en s am pl in g as m a rk o v c ha in m on te ca rl o. S ta ti st ic s in m e d ic in e , 28 (1 7 ): 2 2 02 –2 2 29 , 20 09 . [1 7] L . A . G ood ma n. S no w ba ll sa m p lin g. T h e a n na ls o f m at he m a ti c al s t at is ti cs , 32 (1 ): 14 8– 1 70 , 19 61 . [1 8] J. -D . J. Ha n, D . D u p uy , N. B er ti n , M . E. Cu si c k, an d M . V id al . Eff ec t of s am p li ng on t o po l og y pr ed ic t io ns of pr ot ei n -p ro te in i nt er ac ti on n et w or ks . N a tu r e b io te ch no lo gy , 23 (7 ) :8 39 –8 44 , 20 05 . [1 9] M . H . H a n se n a n d W . N. H ur w itz . O n t he t h eo ry of s am p li ng fr om fi ni te p op ul at io ns . T h e A n n a ls of M at h em a t ic a l St a ti st ic s , 14 (4 ): 33 3– 36 2, 1 9 43 . [2 0] S . J. H ar d im a n a nd L . K at zi r. E st im at in g cl us te r in g coe ffi ci en ts a nd si z e of s oc ia l n et w o rk s vi a r an do m w al k. In P r o c e e d in g s o f th e 22 n d in t e r n a t io n al c o nf e r e n c e on W o r ld W id e W e b , pa ge s 53 9– 55 0. In t er na ti on al W or ld W id e W e b C on fe re nc es S te er in g Co m mi tt e e, 20 13 . [2 1] N . H a rv ey . Gr ap h sp ar si fi er s: A su r v ey . pr es en t at io n sl id e s, 20 11 . [2 2] D . D. He c k a t ho rn . Re spo nd en t- dr iv en sa m pl in g : a n ew ap p ro ac h to t h e st u dy of h id de n pop ul at io ns . S o c ia l p r ob l e m s , pa g es 17 4– 1 99 , 19 97 . [2 3] P . H u . S pec tr al cl u st er in g su rv ey , 5 20 12 . [2 4] P . H u . G ra p h sa m p lin g su rv ey . no n- pu bl is h ed dr af t, M a y 20 13 . [2 5] L . J in , Y . C he n , P . H ui , C. D in g, T. W an g, A. V . V a si la k os , B . D e n g, an d X . L i. Al b at r os s s am p li ng : ro bu st an d e ffe ct iv e h y br id v er te x sa mp li ng fo r soc ia l g ra p hs . In P r o c e e d in g s o f t h e 3 r d A C M i nt er n at i o n a l w o rk s h op o n M o b iA r c h , pa ge s 11 –1 6. A CM , 2 0 11 . [2 6] J. Kl ei nb er g . T h e sm al l- w o rl d ph en om en on : an a lg or it hm per spe ct i v e. In P r o c e e d in gs o f th e th i r ty - se c on d a n n ua l A C M s y m p o si u m o n T he or y o f c om p u t in g , pa ge s 1 6 3– 17 0. A C M , 2 00 0 . [2 7] V . K ris hn am u rt h y, M . F al ou ts os, M . Ch ro ba k, L. La o, J .- H . Cu i, an d A . G. P e rc us . R ed uc i ng la rg e in te r n et to po lo gi es fo r fa st er si m ul at i on s. In N E T W O R K IN G 2 00 5. N e tw o r k in g T e c h n o lo g ie s , S e rv ic e s , a n d P r o to c ol s; P er fo r m a n c e o f C o m pu te r a n d C o m m u n ic a ti o n N e tw o rk s ; M o b i l e a n d W ir e le s s C om m u n ic a ti on s S y s te m s , p ag es 32 8– 34 1. Sp ri ng er , 2 00 5. [2 8] M . K ur an t, C . T . B ut ts, an d A. Ma rk opo ul ou . G ra ph s iz e e sti m at io n. a r X iv p r e pr in t a r X iv : 1 21 0. 0 46 0 , 2 0 1 2. [2 9] M . Ku ra n t, M . Gj ok a , Y . W an g, Z. W . A lm qu is t, C . T . Bu tt s, a nd A. Ma rk opo ul ou . C o ar se -g r ai ne d to pol og y es ti ma ti on vi a g ra ph sa mp li ng . I n P r o c e e d i ng s o f A C M S I G C O M M W o rk sh o p o n O n li n e S o c ia l N e tw o r k s ( W O SN ) ’1 2 , He ls in ki , Fi nl an d, A ug us t 20 12 . [3 0] M . K ur an t, A . Ma rk op o ul ou , a nd P. T hi ra n. O n th e bi as o f bf s ( br ea d th fi rs t se ar c h) . In T e le t r a ffi c C o n g r es s ( IT C ), 2 0 10 2 2 n d In t e rn at i o n a l , p ag es 1– 8. IE E E , 20 10 . [3 1] M . K ur an t, A. M ar k o po u lo u, an d P. T hi ra n. T o w a rd s un b ia se d b fs sa mp li n g . S el e c te d A r e as in C om m u n i c a - t i o ns , IE E E J o u rn a l o n , 2 9( 9) :1 79 9– 18 0 9 , 2 01 1. [3 2] L . L au . Sp e ct ra l a lg o ri t hm . CU H K -C S CI 5 16 0 , 20 12 . [3 3] L . L au . R an d om ne ss a nd c om pu ta ti on . C U HK - C SC I5 4 5 0, 20 1 3. [3 4] C . -H . L e e, X . X u , a nd D. Y. Eu n. B e y on d ra n do m w al k a nd m et ro pol i s- ha st in gs s a m pl er s: w h y y ou sh ou ld n ot b ac kt ra c k fo r u n bi as ed gr ap h s a m pl in g. A C M S IG M E T R IC S P er fo rm a n c e E v a l u at io n R e vi e w , 4 0( 1) :3 19 – 33 0 , 20 12 . [3 5] S . H. L ee , P .- J. Ki m , an d H . J eo ng . S t at is ti ca l pr op er tie s o f sa m pl ed ne t w o rk s. P h y s ic a l R ev ie w E , 73 (1 ): 01 61 02 , 2 00 6 . [3 6] J. Le sk o v e c an d C . F al ou ts os . Sa mp li ng f ro m la rg e gr a ph s. In P r o c e e d in g s o f th e 1 2 t h A C M S IG K D D in te r- n a ti o n al c on fe r e n c e on K no wle d g e d i sc o ve r y a n d d at a m in in g , pa ge s 63 1– 63 6. A C M , 20 0 6. 32 Secti on [3 7] J. L es k o v ec , J. K le in be rg , an d C. F a lo ut so s. G r ap h s o v er ti m e: de ns ifi ca ti on l aw s , sh ri n ki ng d ia m et er s an d po ss ib le e xp la n at io ns . In P r o c e e d in gs o f t h e e le v e n th A C M S I G K D D i n te r n a t io n al c o nf e r e nc e o n K n o wle dg e d is c o v e ry in d a ta m in in g , pa ge s 17 7– 18 7. A CM , 20 0 5. [3 8] T . L ew i s. N et wo rk s ci e nc e : T he o r y a n d ap pl ic at io n s . Wi le y, 2 01 1. [3 9] L . L i, D . A ld er so n, J . C. D o y le , an d W . W il li ng er . T o w ar ds a t he or y of s ca l e- fr e e g ra ph s: D e fi ni t io n, p ro per ti es, an d im p li ca ti on s. In t e rn et M at h em a t ic s , 2 ( 4) :4 31 –5 23 , 2 00 5. [4 0] L . L o v ás z . R an d om w alk s o n g ra ph s: A su rv ey . T e c hn ic al R epo rt 1, Y al e Un iv er sit y, 1 99 3. [4 1] J. C . Lu i. Ex pl or in g la rg e gr ap hs : F ro m ra nd om w al k t o c ybe r- in su ra nc e. CC F Pr es en ta tio n Sl id es , J u ne 20 1 2. [4 2] N . M et ro pol is , A . W . R os en bl ut h, M. N . Ro se nb lu th , A . H . T e ll er , an d E . T el ler . E qu at ion of s ta te c al cu l at io ns b y fa st c om pu tin g m a c hi ne s. T h e jo u rn al o f c h e m ic a l p h y s ic s , 21 :1 0 87 , 19 53 . [4 3] M . M it ze nm a c he r an d E. U pf al . P r o b a b i l it y a n d c o m pu ti n g: R a n do m i z e d al g o ri t h m s a nd p r o b a bi li s ti c a n al y s is . C am br id g e U ni v er si ty Pr es s, 2 0 0 5. [4 4] A . M OI TR A . V er te x s p ar si fic at io n a nd ob li vi ou s re du ct io ns. F O CS , 20 09 . [4 5] F . M ur ai , B. Ri be ir o , D . T ow sl ey , a n d P . W an g. On se t si ze d is tr ib u ti o n es tim a ti on an d t he c ha ra c te ri z at io n of la rg e n et w or ks vi a sa m pl i ng . T e c h n ic al Re por t T e c hn ic al R epo rt UM - C S- 20 12 -0 23 v2 , 20 12 . [4 6] M . Ne w ma n. M odu la ri t y an d c o mm u ni t y st ru ct ur e i n ne t w or ks . P r o c e e d in gs o f t h e N at io n al A c ad em y of S c ie nc e s , 10 3( 23 ): 85 7 7, 20 06 . [4 7] M . N e w ma n. N e tw o rk s : a n in t r o d uc ti on . O U P Ox fo rd , 20 10 . [4 8] M . E . Ne wm a n . As s or ta tiv e m ix in g in ne t w or ks . P h y si c a l r e v i e w le tt e rs , 89 (2 0) :2 0 87 01 , 20 02 . [4 9] M . E . N ew ma n. P o w er la ws , p ar et o di st rib u ti on s an d z ip f’s la w . C on te m p o r ar y p h y si c s , 46 (5 ): 32 3– 35 1 , 20 05 . [5 0] A . H. Ra st i, M. T or kj az i, R . R ej a ie , N . D uffi el d, W . W il li ng er , a n d D . St ut zb a ch . Re sp on d en t- dr iv en s am p li ng fo r c ha ra ct e ri zi n g u ns tru c tu re d o v er la ys . In I N F O C O M 2 0 0 9 , I E E E , pa ge s 27 01 –2 70 5. IE EE , 2 00 9. [5 1] B . R ibe ir o a nd D . T o w sl ey . E st i m at in g a nd sa mp li ng gr ap h s w ith m u lti di me ns io na l r an do m w al ks . In P r o- c e e d in gs of t he 1 0 th A C M S IG C O M M c o n f e r e nc e on I nt er n et m e a s u r e m e n t , pa ge s 3 9 0– 4 03 . A C M, 2 01 0 . [5 2] B . Ri be ir o an d D . T o ws ley . O n th e es ti ma ti on ac cu ra cy of d eg re e di st rib u ti on s f ro m gr ap h sa m pl in g. In D e c is io n a nd C o n t r o l ( C D C ), 2 01 2 I E E E 51 st A n nu a l C on fe r en c e o n , pa ge s 52 40 –5 24 7 . IE E E, 2 01 2 . [5 3] B . Ri bei r o, P . W an g, F . Mu ra i, an d D. T o ws le y. S am pl in g d ir ec te d g ra ph s wi th ra nd om w a lk s. In I N F O C O M , 2 0 1 2 P r o c e e d in gs I E E E , p ag es 16 92 –1 70 0. IE EE , 20 1 2. [5 4] M . J. Sa lg an ik an d D . D . H e c k at ho rn . S am pl in g an d es ti m at io n in hi dd en pop u la ti o n s us in g re spo nd en t- d riv en sa mp li n g . S o ci o lo g ic al m e th o d o lo gy , 34 (1 ): 19 3– 2 40 , 20 04 . [5 5] T . S c ha nk an d D . W ag ne r. A p pr o xi m at in g c lu st er i ng -c oeffi ci en t an d tr an si tiv it y. 20 04 . [5 6] J. Sh i a nd J. Ma li k. No rm al iz ed cu ts an d im ag e seg m en ta t io n. P at t e r n A n a ly si s a n d M a ch i n e I n te lli ge n c e, I E E E T r a n s a c t i o n s o n , 22 (8 ): 88 8 –9 05 , 20 00 . [5 7] M . SI RK E N . A s ho rt hi sto ry of n et w or k sa m p li ng . In P r o c e e d i n gs o f th e A m er i c a n S t a t is ti c a l A ss o c i at io n , S u r v ey R e s e a r c h M e th o d s S e c ti o n , 19 98 . [5 8] D . S pi el ma n. Sp ec t ra l gr a ph th eo ry le c tu re no te s, 2 00 9 . A v ai la bl e at http://w ww.cs.yale .edu/homes/s pielman/56 1/ . [5 9] M . P. St um p f a n d C . W iu f. Sa mp li ng pr op er ti e s o f ra nd om g ra ph s: th e d eg re e di st ri b ut io n. P hy si c al R e v ie w E , 72 (3 ):0 3 61 18 , 20 05 . [6 0] M . P. St um p f, C . W iu f, an d R . M . M a y. S ub ne ts of sc al e- f re e ne t w or ks ar e no t s ca le -fr ee : sa m p lin g p ro pe rt ie s of ne t w or ks . P r o c e e d in g s o f th e N a ti o n al A c a d em y o f S c i e nc e s o f th e U n i te d S t a te s o f A m e ri c a , 1 02 (1 2) :4 2 21 – 42 2 4, 20 05 . [6 1] R . V an De r Ho fs ta d. R a n d om g r a p h s an d c om p le x ne tw or k s . 20 09 . [6 2] U . V on Lu xb ur g. A t ut or ia l o n sp ec tr al c lu ste ri n g. St at is t ic s a n d C om p ut i n g , 17 (4 ): 39 5– 41 6, 2 0 07 . [6 3] T . W an g, Y. Ch e n, Z. Zh a ng , T. X u, L. J in , P. Hu i, B. De ng , an d X . Li . U n de rs ta nd in g gr a p h sa m pl in g al go ri th ms f or s oc ia l ne t w or k an al ys is. In D is tr i b ut e d C o m p u ti n g S y st em s W o rk s h o p s (I C D C SW ) , 2 01 1 3 1 st I n te r n a ti on al C o n f e r e n c e o n , pa ge s 12 3– 12 8 . IE E E, 2 01 1 . [6 4] S . W as ser m an an d K . F au st . S o ci a l ne tw or k a n al y s is : M et ho d s a nd a p p li c a ti on s , v o lu m e 8. Ca m br id ge un i- v er si ty pr es s, 1 9 94 . [6 5] D . W at ts a nd S. S tr og at z. C o lle ct iv e dy n am ic s o f ’s ma ll -w or ld ’n et w or ks. N a tu r e , 39 3( 66 84 ): 44 0– 4 42 , 19 9 8. [6 6] W id ipe d ia . C en tr al it y. http://e n.wikipedia. org/wiki/C entrality . [6 7] W id ipe d ia . Sc a le -f re e ne t w or k , 20 1 3. http://en. wikipedia. org/wiki/S cale-free_ network . [6 8] B . Y a n a nd S . G re g o ry . I de n ti fy in g c om m u n iti es an d k ey v er t ic es b y r ec on st ru c ti ng n e t w or ks fr om s a m pl es. P L O S O N E , 8 ( 4) :e 6 1 00 6, 20 13 . B i b liogr aphy 3 3 [6 9] H . Y u. Sy bi l de fe ns e s vi a soc ia l n et w or ks : a tu to ria l an d s ur v ey . A C M S I GA C T N ew s , 42 (3 ):8 0– 10 1, 20 11 . [7 0] H . Zo u, T . Ha st ie , an d R. T ib shi ra ni . Sp ar se p ri n ci pa l co m pon en t an a ly si s. J o ur na l of c o m p u ta ti o n al a n d g r a ph ic al s ta t is ti cs , 15 (2 ) :2 65 –2 86 , 20 06 . App end ix A Us eful Pr oba bi l it y Resu lts Theorem 1 . (Chernoff Bounds) Consider a heter o g ene ous c oin flipping setting: X 1 , X 2 , X n with he ad pr ob a bility p 1 , p 2 , p n . L et X = P i = 1 n X i and µ = E[ X ] = P i = 1 n p i . W e have the fol l owing c onclusi ons: ([ 33 ] L 3) 1. E[ e t X ] 6 e µ ( e t − 1) is the gener al r esult . 2. f or δ > 0 , P r { X > (1 + δ ) µ } < e δ (1 + δ ) 1 + δ µ . T his is the str o ngest a mong t he fol lowing sp e cific r esults. 3. f or 0 < δ < 1 , P r { X > (1 + δ ) µ } < e − δ 2 µ /3 . T his is mor e pr a ctic al 4. f or R > 6 µ , P r { X > R } 6 2 − R . 5. f or 0 < δ < 1 , P r { X 6 (1 − δ ) µ } 6 e − δ (1 − δ ) 1 − δ µ 6. f or 0 < δ < 1 , P r { X 6 (1 − δ ) µ } 6 e − µδ 2 /2 . 7. f or 0 < δ < 1 , Pr {| X − µ | > δµ } 6 e − µδ 2 /3 . T hi s is a fr e quently use d fo rm. It’s the double side d ver sion d eri ve d fr om (3 ) and (6 ). 8. Same r esults hold if X i ∼ U [0 , 1] w ith E[ X i ] = p i . (H o effding’s ext ension). Lemma 2. Ther e are at mo st n 2 α cuts with α c e dges , wher e c is the min- cut va lue. ([ 33 ] HW 1 ) App end ix B Lo ng Der iv a tio ns App end ix C Ot her W ork s There are man y w orks r elated to graph sa mpling but n ot system atically discu s sed in th is pap er. In this section, we briefly men tion so me of those works found du ring this survey. After prop er organization, t hey will b e put in t he future version of this survey. [53] studies the sampling on directed g raphs. [5] addressed the ability o f streaming sampling of a lar ge n umber of e dges from social a ctivit y g raphs . [3] p rop oses a graph induction step a fter ES, i.e. after obt aining V s and E s in our terminology, e dg es b etw een V s are also add e d. The use case is when ( activity) edges come in order of time and w e w ant to sample the topology in one pass o the da ta. [2] is one Problem Oriented Prop erty P reserv ation ( PPO P) w ork. It ma kes the sampled g raph mimic original g r aph so t hat relational c lassification algorithms can b e ev aluated meaningfull y . [14] observes that there a re multiple graphs on OSN e x cept for static friendships and prop ose a r andom walk sampling on the un ion of those g r aphs, which exhibits more rap id mixing. [34] i mpro ves RDS and MHR W by preven tin g backtrac king. [7] presents a sp ectral sparsifier with only n / ε 2 edges. H ow ev er, it tak es O ( n 4 ) for the co nstruction which r e nders it impractical for m any problems. [ 68] st udies n etw ork reconstruction from m ultiple SBS samples and nod e level a ttributes. It also g ive numerical ev alu ation of com munit y dete ction algorit hm p erformance on the sampled (reconstructe d) a nd original grap h. 34 Ap pendix C
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment