Real-Time Community Detection in Large Social Networks on a Laptop

For a broad range of research, governmental and commercial applications it is important to understand the allegiances, communities and structure of key players in society. One promising direction towards extracting this information is to exploit the …

Authors: Benjamin Paul Chamberlain, Josh Levy-Kramer, Clive Humby

Real-Time Community Detection in Large Social Networks on a Laptop
Real-Time Community Detection in Large Social Netw orks on a Laptop Real-Time Comm unit y Detection in Large So cial Net w orks on a Laptop Ben Cham b erlain b.chamberlain14@imperial.ac.uk Dep artment of Computing Imp erial Col le ge L ondon L ondon SW7 2AZ, UK Josh Levy-Kramer josh@st ar count.com Star c ount Insights 2 R iding House Str e et L ondon W1W 7F A Cliv e Humb y clive@st ar count.com Star c ount Insights 2 R iding House Str e et L ondon W1W 7F A Marc P eter Deisenroth m.deisenroth@imperial.a c.uk Dep artment of Computing Imp erial Col le ge L ondon L ondon SW7 2AZ, UK Abstract F or a broad range of research, gov ernmental and commercial applications it is important to understand the allegiances, communities and structure of key play ers in so ciet y . One promising direction to wards extracting this information is to exploit the ric h relational data in digital social netw orks (the so cial graph). As so cial media data sets are v ery large, most approaches mak e use of distributed computing systems for this purp ose. Distributing graph pro cessing requires solving man y difficult engineering problems, which has lead some researc hers to lo ok at single-machine solutions that are faster and easier to maintain. In this article, we present a single-mac hine real-time system for large-scale graph pro cessing that allo ws analysts to in teractiv ely explore graph structures. The k ey idea is that the aggregate actions of large n umbers of users can be compressed into a data structure that encapsulates user similarities while b eing robust to noise and queryable in real-time. W e achiev e single- mac hine real-time p erformance b y compressing the neigh b ourho od of eac h vertex using minhash signatures and facilitate rapid queries through Lo calit y Sensitive Hashing. These tec hniques reduce query times from hours using industrial desktop machines op erating on the full graph to milliseconds on standard laptops. Our metho d allo ws exploration of strongly asso ciated regions (i.e. comm unities) of large graphs in real-time on a laptop. It has b een deplo yed in softw are that is activ ely used by so cial net work analysts and offers another channel for media owners to monetise their data, helping them to contin ue to pro vide free services that are v alued by billions of p eople globally . 1 Chamberlain, Levy-Kramer, Humby, Deisenroth Figure 1: An example of how our system can be used: Diageo w ant to explore the mark et (comp etitors, customers, associations etc.) around their brand. They feed in information ab out themselv es (“seeds”). In this example the seeds are the compan y itself (Diageo) and some of their ma jor brands (Smirnoff, Baileys and Guinness). Our systems finds ac- coun ts that are similar/related to the seeds and then structures the similar accoun ts in to comm unities. 1. In tro duction Algorithms to discov er groups of asso ciated entities from relational (netw orked) data sets are often called c ommunity dete ction methods. They come in t wo forms: global metho ds, whic h partition the entire graph and local metho ds that lo ok for v ertices that are related to an input v ertex and only w ork on a small part of the graph. W e are concerned with comm unity detection on large graphs that runs on a single commo dit y computer. T o ac hieve this we com bine the t w o approaches, using local comm unity detection to iden tify an interesting region of a graph and then applying global communit y detection to help understand that region. Our fo cus is on communit y detection using so cial media data. So cial media data provides a record of global human interactions at a scale that is hitherto unprecedented. Disco v- ering comm unities in the so cial graph has a large num b er of go vernmen tal and industrial applications, whic h include: security , where analysts explore a net w ork lo oking for groups of p oten tial adversaries; so cial sciences, where queries can establish the imp ortan t relationships b et w een individuals of interest; e-commerce, where queries reveal related pro ducts or users; mark eting, where companies seek to optimise advertising channels or celebrit y endorsemen t p ortfolios. These applications do not disrupt user exp erience in the wa y that sp onsored links or feed advertising do offering an alternativ e means for so cial media pro viders to con tinue to offer free services. As an illustration of a commercial application of communit y detection using Twitter data, tak e a compan y that w ants to trade in a new geographic region. T o do this they need to understand the region’s comp etitors, customers and marketing channels. Using our system they input the Twitter handles for their existing pro ducts, key people, brands and endorsers, and in real-time receive the accoun ts closely related to their compan y in that mark et. The output is automatically structured in to groups (comm unities) suc h as media titles, sp orts p eople and other related companies. Analysts examine the results and explore differen t regions by c hanging the input accounts. W e sho w a high lev el illustration for the drinks brand Diageo in Figure 1. 2 Real-Time Community Detection in Large Social Networks on a Laptop Throughout this pap er we refer to graphs. In this context a graph is a collection of v ertices and edges connecting them. A graph is usually written G ( V , E ) and a graph with w eighted edges as G ( V , E , W ). A netw ork is a ric her structure than a graph, comprising a graph and a collection of metadata describing the vertices and/or edges of the graph. A comm unity is a collection of v ertices C ⊂ V that share man y more edges than would b e exp ected from a random subset of v ertices. In the context of the Twitter graph, a v ertex V is a Twitter account and an (undirected) edge E b et ween V i , V j exists if V i F ollows V j or V j F ollows V i . A comm unity might b e the set of Twitter accounts b elonging to mac hine learning researc hers. In addition to the Twitter graph, the Twitter network also includes metadata asso ciated with the accoun ts (e.g., name, description) and edges (eg. creation time, direction). Our algorithm fo cusses exclusiv ely on the prop erties of the graph. W e are particularly in terested in the neighbourho o d graph. The neigh b ourhoo d graph of a v ertex consists of the set of all v ertices that are directly connected to it, irrespective of the edge direction. DSN neigh b ourho od graphs can be v ery large; In Twitter, the largest hav e almost 100 million mem b ers (as of June 2016). W e prop ose that robust associations betw een so cial netw ork accoun ts can b e reac hed b y considering the similarit y of their neigh b ourho od graphs. This prop osition relies on the existence of homophily in so cial netw orks. The homophily principle states that p eople with similar attributes are more likely to form relationships (McPherson et al., 2001). Accordingly , social media accounts with similar neigh b ourho od graphs are lik ely to ha v e similar attributes. W e seek to build a system that: (1) pro duces high qualit y communities from very noisy data. (2) Is robust to failure and do es not require engineering support. (3) is parsimonious with the time of its users. The first constrain t leads us to use the neighbourho o d graph as the unit of comparison b et w een vertices. The neigh b ourho od graph is generated by the actions of large n umbers of indep enden t use rs in contrast to features like text conten t or group memberships, which are usually controlled b y a single user. The second requiremen ts leads us to search for a single-mac hine solution and the third prescrib es a real-time system (or as close as p ossible). High p erformance is vital as analysts wish to in teract with the data, com bining the results of previous exp erimen ts to inform new ones. The difference b etw een real-time and ‘quite quick’ is imp ortan t. Real-time response is primary amongst the reasons that in teractive program languages lik e Python and R hav e replaced compiled languages lik e C++ as the tools of c hoice for data analysts. W e aim to offer similar impro vemen ts in usabilit y . Curren tly , no to ol exists that pro vides real-time analysis of large graphs on a single commo dit y machine. Existing metho ds to analyse lo cal comm unity structure in large graphs either rely on distributed computing facilities or incur excessiv e run-times making them impractical for exploratory and in teractive w ork (Clauset, 2005; Bahmani et al., 2011). In this ar ticle, we describ e our real-time analysis tool for detecting comm unities in large graphs using only a laptop. W e focus on a 700 million user Twitter net work. Ho w ever, our w ork is more generally applicable as it does not rely upon Twitter-s pecific data, only the graph structure, and we provide some results from F aceb ook to demonstrate this. There are tw o core problems to solv e: (1) The graph m ust b e fit in to the memory of a single (commo dit y) mac hine. (2) Many neighbourho o d graphs containing up to 100 million v ertices must b e compared in milliseconds. The first step to solving these problems is to com- 3 Chamberlain, Levy-Kramer, Humby, Deisenroth press the neighbourho od graphs in to fixed-length minhash signatures. Minhash signatures v astly reduce the size of the graph while at the same time encoding an efficien t estimation of the Jaccard similarity betw een any tw o neighbourho o d graphs. 1 Cho osing appropriate length minhash signatures squeezes the graph in to memory and addresses problem (1). T o solv e problem (2) and achiev e real-time querying we use the elements of the minhash signatures as the basis to build a Lo calit y Sensitiv e Hashing (LSH) data structure. LSH facilitates querying of similar accounts in constan t time. This com bination of minhashing and LSH allows analysts to enter an accoun t or a set of accoun ts and in milliseconds receive the set of most related accoun ts. F rom this set we use the minhash signatures to rapidly construct a weigh ted graph and apply the W ALKTRAP comm unity detection algorithm b efore visualising the results (Pons and Latapy, 2005). Our system applies well-studied techniques in an innov ative wa y: (1) T o the b est of our kno wledge, minhashing has not been applied to the neighbourho o d graph b efore; Minhash- ing is normally only used for very similar sets. (2) W e show that minhashing is effective for comm unity detection when applied to a broad range of neigh b ourho od graph similarities. (3) W e dev elop an agglomerative clustering algorithm and pro ve an original up date pro- cedure for minhash signatures in this setting. The no vel com bination of these tec hniques allo ws our system to p erform real-time comm unity detection on graphs that exceed 100 million v ertices. The con tributions of this article are: 1. W e establish that robust asso ciations b et ween so cial media users can b e determined b y means of the Jaccard similarit y of their neighbourho o d graphs. 2. W e sho w that the approximations implicit in minhashing and LSH minimally degrade p erformance and allo w querying of very large graphs in real time. 3. System design and ev aluation: W e hav e designed and ev aluated an end-to-end Python system for extracting data from so cial media providers, compressing the data in to a form where it can be efficiently queried in real time. 4. W e demonstrate how queries can b e applied to a range of problems in graph analysis, e.g., understanding the structure of industries, allegiances within p olitical parties and the public image of a brand. There are seven sections in this pap er. Section 2 describ es ho w to mine the Twitter graph and can b e omitted b y readers uninterested in replicating our work. Section 3 describ es the related w ork, which is necessarily broad as our system brings together comm unity detection, graph processing and data structures. Section 4 contains our detailed metho dology with the exception of ho w w e prepare and analyse the ground truth data, which is left un til Section 5. In Section 6 we describ e the results of three exp erimen ts, whic h v alidate our metho dology and conclusions and future work follow in Section 7. 1. The Jaccard similarit y is a widely used symmetric measure of the likeness of tw o sets. 4 Real-Time Community Detection in Large Social Networks on a Laptop 2. Data and Preliminaries In this article, w e fo cus on Twitter data because Twitter is the most widely used Digital So- cial Netw ork (DSN) for academic researc h. The Twitte r F ollo wer graph consists of roughly one billion vertices (Twitter accounts) and 30 billion edges (F ollows). T o sho w that our metho d generalises to other so cial netw orks, w e also present some results using a F aceb ook P ages engagemen t graph containing 450 million vertices (FB ac- coun ts) and 700 million edges (P age lik es / commen ts) (see Section 6). Most DSNs hav e public Application Programming Interfaces (APIs) so that third-party dev elop ers can build applications using their data. Delivering data at massive scale incurs significan t cost and to manage these, DSNs limit the rate that data can be downloaded. Rate limiting v aries b et w een netw orks. Usually , when a DSN account holder logs into a third part y application using their so cial login, they grant the application owner one access tok en. Eac h access token allo ws the application o wner to download a fixed amount of data in a given time windo w. This procedure gives more p opular apps access to more data. Our w ork mak es use of access tokens generated b y sev eral client facing apps 2 . T o collect Twitter data w e use the REST API to cra wl the net work iden tifying ev ery accoun t with more than 10,000 F ollo wers 3 and gather their complete F ollo wer lists. Our data set con tains 675,000 such accoun ts with a total of 1 . 5 × 10 10 F ollow ers, of which 7 × 10 8 w ere unique. W e use accounts with greater than 10,000 F ollow ers (though 700 million Twitter accoun ts are used to build the signatures) because accoun ts of this size tend to hav e public profiles (Wikipedia pages or Google hits) making the results interpretable. T o generate data from F aceb o ok we matc hed the Twitter accounts with greater than 10,000 F ollo wers to F aceb ook P age accoun ts 4 using a com bination of automatic accoun t name matc hing and manual v erification. F aceb ook P age likes are not av ailable retrospectively , but can b e collected through a real-time stream. W e collected the stream ov er a p erio d of tw o y ears, starting in late 2013. Do wnloading large quantities of so cial media data is an in volv ed sub ject and w e include details of ho w w e did this in App edix A.1 for repro ducibilit y . 3. Related W ork Existing approac hes to large scale, efficient, comm unity detection ha ve three fla vours: More efficien t comm unit y detection algorithms, inno v ative w ays to p erform pro cessing on large graphs and data structures for graph compression and searc h. T able 1 sho ws related ap- proac hes to this problem and whic h constrain ts they satisfy . 3.1 Communit y Detection Algorithms Comm unity detection metho ds ha ve b een dev elop ed in areas as diverse as neuronal firing (Bullmore and Sporns, 2009), electron spin alignment (Reic hardt and Bornholdt, 2006) and social mo dels (Y ang and Lesk ov ec, 2013). F ortunato (2010) and Newman (2003) both 2. Starcoun t Playlist, Starcount Vibe and Chatsnacks 3. The num b er of F ollow ers is contained in the Twitter account metadata, i.e., it is av ailable without collecting and counting all edges. 4. F aceb ook pages are the public equiv alent of the priv ate profiles. Many influential users hav e a F acebo ok P age. 5 Chamberlain, Levy-Kramer, Humby, Deisenroth T able 1: Comparison of related w ork. SCM stands for runs on a Single Commodity Mac hine Metho d Real-time Large graphs SCM Mo dularit y optimisation (Newman, 2004a) 7 7 3 W ALKTRAP (Pons and Latap y, 2005) 7 7 3 INF OMAP (Rosv all and Bergstrom, 2008) 7 7 3 Louv ain metho d (Blondel et al., 2008) 7 3 3 BigClam (Y ang and Lesko v ec, 2013) 7 3 3 Graphlab (Lo w et al., 2014) 7 3 7 Pregel (Malewicz et al., 2010) 7 3 7 Surfer (Chen et al., 2010) 7 3 7 Graphci (Kyrola et al., 2012) 7 3 3 Twitter WTF (Gupta et al., 2013) 3 3 7 LEMON (Li et al., 2015) 7 3 3 Our Method 3 3 3 pro vide excellen t and detailed ov erviews of the v ast communit y detection literature. Ap- proac hes can be broadly categorised in to lo cal and global metho ds. Global methods assign every v ertex to a comm unity , usually b y partitioning the vertices. Man y highly inno v ative sc hemes hav e b een dev elop ed to do this. Modularity optimisation (Newman, 2004a) is one of the b est known. Mo dularit y is a metric used to ev aluate the qualit y of a graph partition. Communities are determined by selecting the partition that maximises the mo dularit y . An alternativ e to modularity was dev elop ed b y P ons and Latap y (2005) who inno v atively applied random w alks on the graph to define comm unities as regions in which walk ers b ecome trapp ed (W ALKTRAP). Rosv all and Bergstrom (2008) combined random w alks with efficient coding theory to pro duce INF OMAP , a technique that pro vides a new p erspective on comm unity detection: Communities are defined as the structural sub- units that facilitate the most efficien t enco ding of information flows through a netw ork. All three methods are w ell optimised for their motiv ating net works, but w ere not created with graphs at the scale of mo dern Digital So cial Net works (DSNs) and can not easily scale to v ery large data sets. The av ailabilit y of data from the W eb, DSNs and services lik e Wikipedia has fo cussed researc h atten tion on metho ds that scale. An early success was the Louv ain metho d that allo wed modularity optimisation to b e applied to large graphs (they rep ort 100 million v ertices and 1 billion edges). Ho wev er, the metho d was not intended to be real-time and the 152 minute run time is too slo w to ac hiev e real-time p erformance, even allo wing for 8 y ears of hardw are adv ances Blondel et al. (2008). Another notew orthy tec hnique applied to very large graphs is the Bigclam method, whic h in addition to operating at scale, is able to detect o verlapp ing comm unities (Y ang and Lesk o vec, 2013). Ho w ever, in common with the Louv ain metho d, Bigclam is not a real-time algorithm that could facilitate in teractiv e exploration of so cial net works. In con trast to global comm unity detection metho ds, lo cal methods do not assign ev ery v ertex to a communit y . Instead they find vertices that are in the same communit y as a set of input vertices (seeds). F or this reason they are normally faster than global methods. Lo cal comm unity detection metho ds were originally developed as cra wling strategies to 6 Real-Time Community Detection in Large Social Networks on a Laptop cop e with the rapidly expanding web-graph (Flake et al., 2000). F ollo wing the huge impact of the PageRank algorithm (P age et al., 1998), man y lo cal random walk algorithms hav e b een dev elop ed. Kloumann and Kleinberg (2014) conducted a comprehensive assessmen t of lo cal communit y detection algorithms on large graphs. In their study P ersonal P ageRank (PPR) (Hav eliwala, 2002) was the clear winner. PPR is able to measure the similarity to a set of vertices instead of the global imp ortance/influence of each v ertex by applying a slight mo dification to PageRank. P ageRank can be regarded as a sequence of tw o step pro cesses that are iterated un til con vergence: A random walk on the graph follo wed by (with small probabilit y) a random telep ort to an y vertex. PPR mo difies P ageRank in tw o w ays: Only a small num b er of steps are run (often 4), and an y random walk er selected to telep ort must return to one of the seed vertices. Recent extensions ha ve sho wn that seeding PPR with the neighbourho o d graph can improv e p erformance Gleich and Seshadhri (2012) and that PPR can b e used to initiate lo c al sp ectral metho ds with go od results Li et al. (2015). Random walk metho ds are usually ev aluated by p o w er iteration; a series of matrix m ultiplications requiring the full adjacency matrix to b e read in to memory . The adjacency matrix of large graphs will not fit in memory and so distributed computing resources are used (e.g., Hado op). While distributed systems are con tin ually impro ving, they are not alw ays av ailable to analysts, require skilled operators and typically hav e an o verhead of sev eral min utes p er query . A ma jor c hallenge when applying both lo cal and global comm unity detection algorithms to real w orld net works is p erformance v erification. T esting algorithms on a held out lab elled test set is complicated by the lac k of an y agreed definition of a communit y . Much early w ork mak es use of small hand-lab elled communities and treats the original researchers’ de- cisions as gold standards (Sampson, 1969; Zachary, 1977; Lusseau, 2003). Irresp ectiv e of the v alidity of this pro cess, a single (or small num b er) of man ual labellers can not pro duce ground-truth for large DSNs. Y ang and Lesko vec (2012) prop osed a solution to the verifica- tion problem in comm unity detection. They observ e that in practice, comm unity detection algorithms detect communities based on the structure of interconnections, but results are v erified b y disco v ering common attributes or functions of vertices within a comm unity . Y ang and Lesk ov ec (2012) iden tified 230 real-w orld net works in which they define ground-truth comm unities based on v ertex attributes. The sp ecific attributes that they use are v aried and some examples include publication v enues for academic co-authorship netw orks, c hat group mem b ership within so cial net w orks and product categories in co-purc hasing net w orks. 3.2 Graph Pro cessing Systems A complimentary approach to efficient communit y detection on large graphs is to dev elop more efficien t and robust systems. This is an area of active research within the systems comm unity . General-purp ose tools for distributed computation on large scale graphs in- clude Graphlab, Pregel and Surfer (Chen et al., 2010; Malewicz et al., 2010; Lo w et al., 2014). Purp ose-built distributed graph pro cessing systems offer ma jor adv ances ov er the widely used MapReduce framew ork (Pace, 2012). This is particularly true for iterativ e computations, which are common in graph pro cessing and include random walk algorithms. Ho wev er, distributed graph processing still presen ts ma jor design, usability and latency c hallenges. T ypically the run times of algorithms are dominated b y comm unication be- 7 Chamberlain, Levy-Kramer, Humby, Deisenroth t ween machines o ver the net work. Much of the complexity comes from partitioning the graph to minimise net w ork traffic. The general solution to the graph partitioning problem is NP-hard and remains unsolv ed. These concerns hav e lead us and other researc hers to buc k the ov erarching trend for increased parallelisation on ev er larger computing clusters and search for single-mac hine graph processing solutions. One such solution is Graphci, a single-mac hine system that offers a p o werful and efficien t alternativ e to processing on large graphs Kyrola et al. (2012). The k ey idea is to store the graph on disk and optimise I/O routines for graph analysis op erations. Graphci ac hieves dramatic sp eed-ups compared to conv entional systems, but the rep eated disk I/O makes real-time op eration imp ossible. Twitter also use a single-machine recommendation system that serves “Who T o F ollow (WTF)” recommendations across their en tire user base (Gupta et al., 2013). WTF pro- vides real-time recommendations using random w alk methods similar to PPR. They ac hieve this b y loading the en tire Twitter graph into memory . F ollowing their design sp ecification of 5 b ytes p er edge 5 × 30 × 10 9 = 150 GB of RAM w ould b e required to load the curren t graph, whic h is an order of magnitude more than a v ailable on our target platforms. 3.3 Graph Compression and Data Structures The alternativ e to using large servers, clusters or disk storage for pro cessing large graphs is to compress the whole graph to fit in to the memory of a single mac hine. Graph compression tec hniques were originally motiv ated by the desire for single mac hine pro cessing on the W eb Graph. Approaches fo cus on wa ys to store the differences b et ween graph structures instead of the raw graph. Adler and Mitzenmacher (2001) searched for w eb pages with similar neigh b ourho od graphs and encoded only the differences b etw een edge lists. The seminal w ork b y Boldi and Vigna (2004) ordered W eb pages lexicographically endowing them with a measure of lo calit y . Similar compression tec hniques w ere adapted to social net works by Chieric hetti et al. (2009). They replaced the lexical ordering with an ordering based on a single minhash v alue of the out-edges, but found so cial net works to b e less compressible than the W eb (14 versus 3 bits p er edge). While the aforementioned tec hniques achiev e remark able compression lev els, the cost is slo wer access to the data (Gupta et al., 2013). Minhashing is a technique for represen ting large sets with fixed length signatures that enco de an estimate of the similarit y betw een the original sets. When the sets are sub-graphs minhashing can b e used for lossy graph compression. The pioneering w ork on minhashing w as b y Broder (1997) whose implemen tation dealt with binary vectors. This w as extended to coun ts (integer v ectors) b y Charik ar (2002) and later to contin uous v ariables (Philbin, 2008). Efficien t algorithms for generating the hashes are discussed by Manasse and Mcsherry (2008). Minhashing has b een applied to clustering the W eb by Hav eliwala et al. (2000), who considered eac h w eb page to b e a bag of w ords and built hashes from the coun t vectors. Tw o important innov ations that impro ve up on minhashing are b-Bit minhashing (Li and K¨ onig, 2009) and Odd Sketc hes (Mitzenmac her et al., 2014). When designing a minhashing sc heme there is a trade off b et ween the size of the signatures and the v ariance of the similarit y estimator. Li and K¨ onig (2009) show that it is p ossible to impro ve on the size- v ariance trade off b y using longer signatures, but only keeping the low est b-bits of eac h elemen t (instead of all 32 or 64). Their w ork deliv ers large improv emen ts for v ery similar sets (more than half of the total elemen ts are shared) and for sets that are large relative 8 Real-Time Community Detection in Large Social Networks on a Laptop System T ypical run time (s) Space requiremen t (GB) Naiv e edge list 8,000 240 Minhash signatures 1 4 LSH with minhash 0.25 5 T able 2: T ypical run times and space requirements for systems p erforming lo cal communit y detection on the Twitter F ollow er net work of 700 million v ertices and 20 billion edges and pro ducing 100 v ertex output comm unities to the n um b er of elements in the sample space. Mitzenmac her et al. (2014) impro v ed upon b-bit minhashing b y sho wing that for appro ximately iden tical sets (Jaccard similarities ≈ 1) there w as a more optimal estimation sc heme. Lo calit y Sensitiv e Hashing (LSH) is a technique introduced by Indyk and Motw ani (1998) for rapidly finding appro ximate near neigh b ours in high dimensional space. In the original pap er they define a parameter ρ that gov erns the qualit y of LSH algorithms. A lo wer v alue of ρ leads to a b etter algorithm. There has b een a great deal of w ork studying the limits on ρ . Of particular in terest, Mot wani et al. (2005) used a F ourier analytic argumen t to pro vide a tighter low er b ound on ρ , whic h was later bettered b y O’Donnell et al. (2009) who exploited prop erties of the noise stabilit y of b oolean functions. The latest LSH research uses the structure of the data, through data dependent hash functions Andoni et al. (2014) to get even tighter bounds. As the hash functions are data dep enden t, unlik e earlier w ork, only static data structures can be addressed. 4. Real-Time Comm unit y Detection In this section, we detail our approach to real-time communit y detection in large social net works. Our metho d consists of t w o main stages: In stage one, w e tak e a set of seed accoun ts and expand this set to a larger group containing the most related accoun ts to the seeds. This stage is depicted b y the b o x lab elled ”Find similar accounts” in Figure 1. Stage one uses a very fast nearest neigh b our searc h. In stage t wo, w e embed the results of stage one into a weigh ted graph where each edge is weigh ted by the Jaccard similarity of the tw o accoun ts it connects. W e apply a global comm unity detection algorithm to the weigh ted graph and visualise the results. Stage t wo is depicted b y the box lab elled ”Structure and visualise” in Figure 1. In the remainder of the pap er w e use the following notation: The i th user account (or in terchangeably , v ertex of the net w ork) is denoted b y A i and N ( A i ) gives the set of all accoun ts directly connected to A i (the neigh b ours of A i ). The set of accoun ts that are input by a user into the system are called seeds and denoted by S = { A 1 , A 2 , ..., A m } while C = { A 1 , A 2 , ..., A n } (communit y) is used for the set of accounts that are returned by stage one of the pro cess. 9 Chamberlain, Levy-Kramer, Humby, Deisenroth 4.1 Stage 1: Seed Expansion The first stage of the pro cess takes a set of seed accounts as input, orders all other accounts b y similarit y to the seeds and returns an expanded set of accounts similar to the seed accoun t(s). F or this purpose, w e require three ingredients: 1. A similarit y metric betw een accoun ts 2. An efficien t system for finding similar accoun ts 3. A stopping criterion to determine the num b er of accounts to return In the following, we detail these three ingredients of our system, whic h will allo w for real- time comm unit y detection in large so cial net w orks on a standard laptop. 4.1.1 Similarity Metric The prop ert y of each account that w e choose to compare is the neighbourho o d graph. The neigh b ourho od graph is an attractive feature as it is not controlled b y an individual, but b y the (approximately) independent actions of large num b ers of individuals. The edge generation pro cess in Digital So cial Net w orks (DSNs) is v ery noisy pro ducing graphs with man y extraneous and missing edges. As an illustrativ e example, the p op stars Eminem and Rihanna ha v e collaborated on four records and a stadium tour. 5 Despite this clear asso ciation, Eminem is not one of Rihanna’s 40 million Twitter follo w ers. How ever, Rihanna and Eminem hav e a Jaccard similarit y of 18%, making Rihanna Eminem’s 6 th strongest connection. Using the neighbourho o d graph as the unit of comparison b etw een accoun ts mitigates against noise asso ciated with the unpredictable actions of individuals. The metric that we use to compare t wo neigh b ourho od graphs is the Jaccard similarit y . The Jaccard similarit y has t wo attractiv e properties for this task. Firstly it is a normalised measure pro viding comparable results for sets that differ in size b y orders of magnitude. Secondly minhashing can b e used to pro vide an unbiased estimator of the Jaccard similarit y that is b oth time and space efficient. The Jaccard similarit y is giv en b y J ( A i , A j ) = | N ( A i ) ∩ N ( A j ) | | N ( A i ) ∪ N ( A j ) | , (1) where N ( A i ) is the set of neigh b ours of i th accoun t. 4.1.2 Efficient Account Search T o efficien tly searc h for accoun ts that are similar to a set of seeds w e represen t ev ery accoun t as a minhash signature and use a Lo calit y Sensitive Hashing (LSH) data structure based on the minhash signatures for appro ximate nearest neigh b our searc h. Rapid Ja ccard Estima tion via Minhash Signa tures Computing the Jaccard similarities in (1) is very expensive as eac h set can hav e up to 10 8 mem b ers and calculating in tersections is sup er-linear in the total num b er of mem b ers of the 5. “Lo ve the W ay Y ou Lie” (2010), “The Monster” (2013), “Numb” (2012), and “Lov e the W ay Y ou Lie (P art I I)” (2010), the Monster T our (2014) 10 Real-Time Community Detection in Large Social Networks on a Laptop t wo sets b eing in tersected. Multiple large intersection calculations can not b e pro cessed in real-time. There are tw o alternatives: either the Jaccard similarities can be pre-computed for all p ossible pairs of v ertices, or they can b e estimated. Using pre-computed v alues for n = 675 , 000 w ould require caching 1 2 n ( n − 1) ≈ 2 . 5 × 10 11 floating point v alues, whic h is appro ximately 1TB and so not p ossible using commo dit y hardware. Therefore an estimation pro cedure is required. The minhashing compression tec hnique of Broder et al. (2000) generates un biased esti- mates of the Jaccard similarity in O ( K ), where K is the n umber of hash functions in the signature. Eac h hash function approximates a t wo step pro cess: An independent p erm uta- tion of the indices asso ciated with each member of a set follo wed by taking the minimum v alue of the permuted indices. Broder et al. (2000) show ed that the un biased estimate ˆ J ( A i , A j ) of the Jaccard similarit y J ( A i , A j ) is attained by exploiting that J ( A i , A j ) = p ( h k ( A i ) = h k ( A j )) ∀ k = 1 , . . . , K where h i are hash functions . This means the probability that an y minhash function is equal for b oth sets is given b y the Jaccard co efficien t. W e create a signature v ector H , which is made of K indep enden t hashes h i and calculate the Monte-Carlo Jaccard estimate ˆ J as ˆ J ( A i , A j ) = I /K , (2) where w e define I = K X k =1 δ ( h k ( A i ) , h k ( A j )) , (3) δ ( h k ( A i ) , h k ( A j )) = ( 1 if h k ( A i ) = h k ( A j ) 0 if h k ( A i ) 6 = h k ( A j ) . (4) As eac h h k is independent, I ∼ B in ( J ( A i , A j ) , K ). The estimator is fully efficient, i.e., the v ariance is giv en b y the Cram´ er-Rao low er b ound v ar( ˆ J ) = J (1 − J ) K , (5) where w e hav e dropped the Jaccard arguments for brevit y . Equation 5 shows that Jaccard co efficien ts can b e approximated to arbitrary precision using minhash signatures with an estimation error that scales as O (1 / √ K ). The memory requiremen t of minhash signatures is K n integers, and so can b e configured to fit into memory and for K = 1000 and n = 675 , 000 is only ≈ 4 GB . In comparison to calculating Jaccard similarities of the largest 675,000 Twitter accoun ts with ≈ 4 × 10 10 neigh b ours minhashing reduces exp ected pro cessing times by a factor of 10 , 000 and storage space b y a factor of 1000. 6 Efficient Genera tion of Minhash Signa tures Minhash signatures allo w for rapid estimation of the Jaccard similarities. How ev er, care m ust be tak en when implemen ting minhash generation. Calculation of the signatures is 6. Our metho d allows to add new accounts quickly by simply calculating one additional minhash signature without needing to add the pairwise similarity to all other accounts. 11 Chamberlain, Levy-Kramer, Humby, Deisenroth Algorithm 1 Minhash signature generation Require: M ← num b er of Accoun ts Require: K ← size of signature Require: N ( Account ) ← All Neighbours 1: T ∈ N M × K ← ∞  Initialise signature matrix to ∞ 2: index ← 1 3: for all Accounts do 4: P ← p erm ute(index) ∈ N 1 × K  P erm ute the Accoun t index K times 5: for all N(Account) do 6: T [ i ] ← min( T [ i ] , P )  Compute the element-wise minimum of the signature 7: end for 8: index = index + 1 9: end for 10: return T  Return matrix of signatures exp ensiv e: Algorithm 1 requires O ( N E K ) computations, where N is the n umber of neigh- b ours, E is the av erage out-degree of each neighbour and K is the length of the signature. F or our Twitter data these v alues are N = 7 × 10 8 , E = 10, K = 1 , 000. A naive imple- men tation can run for sev eral days. W e ha ve an efficient implementation that takes one hour allo wing signatures to b e regenerated o vernigh t without affecting op erational use (See App endix A). Locality Sensitive Hashing (LSH) Calculating Jaccard similarities based on minhash signatures instead of full adjacency lists pro vides tremendous benefits in both space and time complexit y . How ever, finding near neigh b ours of the input seeds is an onerous task. F or a set of 100 seeds and our Twitter data set, nearly 70 million minhash signature comparisons w ould need to b e p erformed, whic h dominates the run time. Lo calit y Sensitive Hashing (LSH) is an efficien t system for finding appro ximate near neighbours Indyk and Mot w ani (1998). LSH w orks by partitioning the data space. Any tw o p oin ts that fall inside the same partition are regarded as similar. Multiple indep enden t partitions are considered, which are in vok ed b y a set of hash functions. LSH has an elegan t formulation when combined with minhash signatures for near neighbour queries in Jaccard space. The minhash signatures are divided into bands con taining fixed n umbers of hash v alues and LSH exploits that similar minhash signatures are likely to ha ve iden tical bands. An LSH table can then b e constructed that p oin ts from each accoun t to all accoun ts that hav e at least one identical minhash band. W e apply LSH to every input seed indep enden tly to find all candidates that are ‘near’ to at least one seed. In our implemen tation, we use 500 bands, eac h containing t wo hashes. As most accoun ts share no neigh b ours, the LSH step dramatically reduces the n umber of candidate accounts and the algorithm runtime b y a factor of roughly 100. LSH is essen tial for the real-time capabilit y of our system. 12 Real-Time Community Detection in Large Social Networks on a Laptop Sor ting Similarities LSH pro duces a set of candidate accounts that are related to at least one of the input seeds. In general, w e do not w ant every candidate returned by LSH. Therefore, w e select the subset of candidates that are most asso ciated with the whole seed set. W e experimented with t wo sequen tial ranking sc hemes: Minhash Similarity (MS) and Agglomerativ e Clustering (A C). The rankings can best b e understoo d through the Jaccard distance D = 1 − J , whic h is used to define the centre X ∈ [0 , 1] M of any set of vertices. A t eac h step A C and MS augmen t the results set C with A ∗ the closest account to X : A ∗ / ∈ C . Ho w ever, MS uses a constan t v alue of X based on the input seeds while A C up dates X after eac h step. F ormally , the cen tre of the input vertices used for MS is defined b y X j ( A j , S ) = 1 n n X i =1 D ( A j , S i ) , j = 0 , 1 , . . . , M . (6) Algorithm 2 Agglomerativ e Clustering Algorithm (AC) Require: Communit y C of initial seeds 1: Define candidate mem b ers ¯ A = LS H ( C ) 2: rep eat 3: Compute comm unit y cen tre X ( ¯ A, S ), see (6) 4: Select next Account: A ∗ = arg min A i / ∈ ¯ C X ( A i , S ) 5: Gro w comm unity: C t +1 ← C t ∪ A ∗ , ¯ A ← ¯ A \ A ∗ 6: un til Stopping criteria is met A t eac h iteration of Algorithm 2 C and X are up dated by first setting C = S and then adding the closest account given by A ∗ = arg min i X ( A i , C ) ∀ A i / ∈ C (7) leading to C t +1 = C t ∪ A ∗ . The new cen tre X n +1 is most efficien tly calculated using the recursive online update equation X t +1 ( A, C t +1 ) = nX ( A, C t ) + D ( A ∗ , C t ) n + 1 , (8) where n is the size of C t . 4.1.3 Stopping Criterion Both A C and MS are sequen tial pro cesses and will return every candidate account unless a stopping criteria is applied. Many stopping criteria ha v e been used to terminate seed expansion pro cesses. The simplest metho d is to te rminate after a fixed num b er of inclu- sions. Alternative metho ds use lo cal maxima in mo dularit y (Lancichinetti et al., 2009) and conductance (Lesk o vec et al., 2010). An application of our work is to help define an optimal set of celebrities to endorse a brand. In this context we w ant to answer questions lik e: “What is the smallest set of 13 Chamberlain, Levy-Kramer, Humby, Deisenroth athletes that hav e influence on o ver half of the users of Twitter?”. W e refer to the num b er of unique neigh b ours of a set of accoun ts as the c over age of that set. An exact solution to this problem is com binatorial and requires calculating large n umbers of unions ov er very large sets. Ho wev er it can be efficiently appro ximated using minhash signatures. W e exploit t wo prop erties of minhash signatures to do this: The un biased Jaccard estimate through Equation 2 and the minhash signature of the union of t wo sets is the element wise minimum of their resp ectiv e minhash signatures. Minhash signatures allo w cov erage to b e used as a stopping criteria to rank LSH candidates without losing real-time p erformance. Efficien t Cov erage Computation The co verage y is giv en b y y =      n [ i =1 N ( A i )      , (9) the num b er of unique neighbours of the output v ertices. Ev ery time a new account A is added w e need to calculate | N ( C ) ∪ N ( A ) | to up date the cov erage. This is a large union op eration and exp ensiv e to p erform on each addition. Lemma ?? allo ws us to rephrase this exp ensiv e computation equiv alen tly b y using the Jaccard co efficien t (av ailable c heaply via the minhash signatures), which we subsequently use for a real-time iterative algorithm. Lemma 1 F or a c ommunity C = S i A i and a new ac c ount A / ∈ C , the numb er of Neigh- b ours of the union A ∪ C is given as | N ( A ∪ C ) | = | N ( A ) | + | N ( C ) | 1 + J ( A, C ) . (10) Pro of F ollowing (1), the Jaccard co efficien t of a new Accoun t A / ∈ C and the comm unit y C is J ( A, C ) = | A ∩ C | | A ∪ C | . (11) By considering the V enn diagram and utilising the inclusion-exclusion principle, we obtain | A ∪ C | = | A | + | C | − | A ∩ C | . (12) Substituting this expression in the denominator of the Jaccard co efficien t in (11) yields | A | + | C | 1 + J ( A, C ) (11) (12) = | A | + | C | 1 + | A ∩ C | | A | + | C |−| A ∩ C | = | A | + | C | | A | + | C | | A | + | C |−| A ∩ C | = | A | + | C | − | A ∩ C | (12) = | A ∪ C | , whic h pro ves (10) and the Lemma. Lemma 2 A c ommunity C = S i A i c an b e r epr esente d by a minhash signatur e H ( C ) = { h 1 ( C ) , h 2 ( C ) , ...h k ( C ) } wher e h i ( C ) = min j ( h i ( A j )) (13) 14 Real-Time Community Detection in Large Social Networks on a Laptop Pro of A minhash signature is composed of k indep enden t minhash functions. Eac h of whic h is a comp ound function made up of a general mapping and a minimum op eration. h i ( A ) = min( g i ( A )) ∀ i, A where g i : N m → N m and so h i ( ∪ j A j ) = min( g i ( ∪ j A j )) (14) h i ( C ) = min( g i ( A 1 ) , g i ( A 2 ) , ..., g i ( A k )) (15) whic h pro ves (13) and the Lemma. W e use Lemma 1 to update the unique neigh b our count. Once the next accoun t to add to the communit y A ∗ is determined according to (7) | N ( C t +1 ) | = | N ( C t ) | + | N ( A ∗ ) | 1 + J ( C t , A ∗ ) . (16) The right hand side of (16) contains three terms: | N ( C t ) | is what we started with, | N ( A ∗ ) | is the neighbour coun t of A ∗ , whic h is easily obtained from Twitter or F aceb ook metadata and J ( C t , A ∗ ) is a Jaccard calculation b et ween a communit y and an account. The minhash signature of a communit y is obtained via (13) and so we are able to calculate the co verage with negligible additional computational o verhead. 4.2 Stage 2: Comm unit y Detection and Visualisation Stage one expanded the seed accoun ts to find the related region. This was done b y first finding a large group of candidates using LSH that were related to an y one of the seeds and then filtering down to the accoun ts most associated to the whole seed set. In Stage t wo, the vertices returned b y Stage one are used to construct a weigh ted Jaccard similarit y graph. Figure 2 depicts the pro cess of transforming from the original unw eighted graph to the w eighted graph. The red vertices are those returned b y stage one. Edge weigh ts are calculated for all pairwise asso ciations from the minhash signatures through Equation 2. This pro cess effectiv ely em b eds the original graph in a metric Jaccard space (Broder, 1997). Comm unity detection is run on the w eighted graph. The final elemen t of the pro cess is to visualise the communit y structure and asso ciation strengths in the region of the input seeds. W e exp erimen ted with several global commu- nit y detection algorithms. These included INF OMAP , Label Propagation, v arious spectral metho ds and Mo dularit y Maximisation (Rosv all and Bergstrom, 2008; Raghav an et al., 2007; Newman, 2006, 2004b). The Jaccard similarity graph is weigh ted and almost fully connected and most communit y detection algorithms are designed for binary sparse graphs. As a result, all metho ds with the exception of lab el propagation and W ALKTRAP were to o slo w for our use case. Lab el Propagation had a tendency to select a single gian t clus- ter, thus adding no useful information. Therefore, w e c hose W ALKTRAP for communit y visualisation. 15 Chamberlain, Levy-Kramer, Humby, Deisenroth (a) A so cial netw ork con taining three in teresting (red) vertices. (b) The o verlapping neighbour- ho od graphs of the interesting v ertices in (a) Communit y of Influencers (c) The inferred w eighted net- w ork. V ertices connected by high w eights are more lik ely to b e in the same communit y Figure 2: Visualising the generation of a w eighted subgraph. Interesting v ertices are de- picted as larger red no des, and the neighbours as smaller, more n umerous black no des. (a) sho ws a complete so cial net w ork. (b) depicts the o verlapping neighbourho o d graphs of the three interesting v ertices in (a). (c) The net work is summarised by an inferred netw ork using the Jaccard similarity measure of the set of neigh b ouring vertices as edge w eights. 5. Ground-T ruth Communities T o pro vide a quan titative assessmen t of our method w e require ground-truth labelled com- m unities. No ground-truth exists for the data sets of in terest and so in this section we pro vide a methodology for generating ground-truth. This metho dology itself must b e v eri- fied and w e provide an extensiv e ev aluation of the qualit y of the deriv ed ground-truth based on the axiomatic definitions described in Y ang and Lesko v ec (2012). Most comm unit y de- tection algorithms (including ours) are based on the structure of the graph (F ortunato and Barthelem y, 2007). Axiomatically , goo d comm unit y structures are: • Compact • Densely in terconnected • W ell separated from the rest of the netw ork • Internally homogeneous Ho wev er, while comm unities are detected using these prop erties, v erification typically re- quires asso ciating eac h vertex with some functional attributes, e.g., fans of Arsenal fo otball club or Python programmers and sho wing that th e disco vered comm unities group attributes together (Y ang and Lesk ov ec, 2012). The practice of relating comm unity mem b ership with p ersonal attributes is justified b y the homophily principle of so cial net works (McPherson et al., 2001), whic h states that p eople with similar attributes are m ore lik ely to be connected. W e rev erse the pro cess of verification by generating ground-truth from p ersonal attributes. T o generate attributes w e match Twitter accounts with Wikipe dia pages and asso ciate Wikip edia tags with each Twitter accoun t. Wikip edia tags give hierarc hical functions like ‘fo otball:sportsp erson:sp ort’ and ‘p op:m usician:music’. It is not possible to match every 16 Real-Time Community Detection in Large Social Networks on a Laptop T able 3: Prop erties of ground-truth communities sorted by edge densit y . CR stands for Conductance Ratio. High v alues of clustering, density and separabilit y and low v alues of CR, conductance and Cohesiveness indicate go od comm unities. Communit y Size Clustering Cohesiveness Conductance CR Densit y Separabilit y Mixed Martial Arts 751 6.49E-02 4.29E-01 5.10E-01 1.19 3.06E-02 4.80E-01 Adult Actors 352 7.20E-02 1.29E-01 7.70E-01 5.98 2.94E-02 1.50E-01 Cycling 371 6.43E-02 4.51E-01 7.04E-01 1.56 2.50E-02 2.11E-01 Baseball 616 3.64E-02 1.49E-01 7.87E-01 5.29 1.63E-02 1.35E-01 Bask etball 786 3.84E-02 3.30E-01 7.71E-01 2.34 1.60E-02 1.48E-01 American F o otball 1295 2.24E-02 3.82E-01 7.40E-01 1.94 9.33E-03 1.75E-01 A thletics 530 3.48E-02 4.13E-01 8.47E-01 2.05 8.21E-03 9.01E-02 Hotel Brand 836 2.20E-02 4.53E-01 8.37E-01 1.85 6.16E-03 9.71E-02 Airline 363 2.30E-02 4.41E-01 9.46E-01 2.15 4.35E-03 2.84E-02 Cosmetics 332 3.34E-02 4.87E-01 9.56E-01 1.96 3.55E-03 2.32E-02 F o otball 4111 3.69E-02 3.95E-01 7.07E-01 1.79 2.93E-03 2.07E-01 Alcohol 388 1.72E-02 2.34E-01 9.52E-01 4.06 2.66E-03 2.53E-02 T rav el 2038 1.27E-02 4.25E-01 8.29E-01 1.95 2.50E-03 1.03E-01 Mo del 2096 2.62E-02 4.04E-01 9.01E-01 2.23 1.90E-03 5.50E-02 Electronics 689 1.40E-02 4.38E-01 9.75E-01 2.23 8.78E-04 1.30E-03 F o o d and Drink 2974 1.76E-02 4.57E-01 9.06E-01 1.98 7.69E-04 5.18E-02 Twitter accoun t and our matc hing pro cess disco vered 127 tags that o ccur more than 100 times in the data. Of these, man y w ere clearly to o v ague to b e useful suc h as ‘news:media’ or ‘Product Brand:Brands’. W e selected 16 tags that had relativ ely high frequencies in the data set and ev aluated 7 metrics for eac h that are related to the four axioms. These result are shown in T able 3. Sep erabilit y and conductance measure ho w w ell separated a comm unit y is from the rest of the graph. Density and size measure the compactness and density . Cohesiveness, clustering and conductance ratio measure ho w internally ho- mogeneous a communit y is. The mathematical form ulation of these metrics and details of ho w they w ere calculated is provided in App endix B. T able 3 is sorted b y density and the b old rows are visualised in Figures 4,5,6 and 7. The density is the most imp ortan t factor distinguishing goo d from bad comm unities, v arying by tw o orders of magnitude across the data. This is follo wed by ho w well separated (separabilit y) the com m unity is from the rest of the net work, which is inv ersely correlated with conductance by design (See Equa- tions 21 and 25). High clustering is also a useful indicator of comm unity go odness for the b est comm unities, but is less useful for separating comm unities that are made up of man y sub-units like team sp orts from v ery bad comm unities lik e F o od and Drink. Cohesiv eness is generally not useful as most communities contain at least one w ell separated sub-unit. T o establish a clearer view of the densit y and homogeneit y of the ground-truth w e vi- sualise the communities using netw ork diagrams and dendrograms. Netw ork diagrams are generated in Gephi (Bastian et al., 2009). The la yout uses the F orce Atlas 2 algorithm. Colours indicate clusters generated using Gephi’s modularity optimisation routine. The no de (and lab el) sizes indicate the w eighted degree of eac h no de and are scaled to b e be- t ween 5 and 20 pixels. The netw ork diagrams reveal any substructure presen t within the ground-truth. They con tain to o muc h information to easily see the individual accoun ts and so w e magnify small subregions and displa y Twitter profile images for accoun ts within them. A weakness of the netw ork diagrams is that different edge weigh ts are hard to p erceiv e. T o pro vide a visual representation of the general strength of interaction we generated dendro- 17 Chamberlain, Levy-Kramer, Humby, Deisenroth 0.0 0.2 0.4 0.6 0.8 1.0 jaccard distance (a) T rav el 0.0 0.2 0.4 0.6 0.8 1.0 jaccard distance (b) Airline 0.0 0.2 0.4 0.6 0.8 1.0 jaccard distance (c) Hotel 0.0 0.2 0.4 0.6 0.8 1.0 jaccard distance (d) Cosmetics Industrial groups. Small highly connected groups due to sub-brands 0.0 0.2 0.4 0.6 0.8 1.0 jaccard distance (e) F o o d and Drink 0.0 0.2 0.4 0.6 0.8 1.0 jaccard distance (f ) Electronics 0.0 0.2 0.4 0.6 0.8 1.0 jaccard distance (g) Alcohol 0.0 0.2 0.4 0.6 0.8 1.0 jaccard distance (h) Mo del Industrial groups. Limited in teraction 0.0 0.2 0.4 0.6 0.8 1.0 jaccard distance (i) Mixed Martial Arts 0.0 0.2 0.4 0.6 0.8 1.0 jaccard distance (j) Cycling 0.0 0.2 0.4 0.6 0.8 1.0 jaccard distance (k) A thletics 0.0 0.2 0.4 0.6 0.8 1.0 jaccard distance (l) Adult Actor Strongly connected communities. Sub-communities mostly due to nationalit y 0.0 0.2 0.4 0.6 0.8 1.0 jaccard distance (m) American F o otball 0.0 0.2 0.4 0.6 0.8 1.0 jaccard distance (n) Baseball 0.0 0.2 0.4 0.6 0.8 1.0 jaccard distance (o) Bask etball 0.0 0.2 0.4 0.6 0.8 1.0 jaccard distance (p) F o otball T eam sp orts. Many highly connected sub-groups Figure 3: Dendrograms sho wing the strength of in terconnection within communities. The v ertical axes show the Jaccard distances. Blue areas are not strongly connected. In each coloured region no t wo no des are separated b y a Jaccard distance greater than 0.85. The dendrograms are agglomerative: All accounts with a Jaccard distance less than the y - v alue are fused together into a sup er-node. The fusing process is sequen tial and the x -axis indicates the order of fusing with the first nodes to agglomerate at the righ t. 18 Real-Time Community Detection in Large Social Networks on a Laptop Figure 4: The Mixed Martial Arts comm unity is relativ ely homogeneous and densely in- terconnected with high clustering and go od separabilit y from the rest of the net work. The only disconnected region is the y ello w region, which has b een magnified to sho w that it is made up of Olympic judo competitors. This comm unity is well detected b y all methods. grams (Figure 3) for each ground-truth communit y . Dendrograms are agglomerative: All accoun ts with a Jaccard distance less than the y -v alue are fused together into a sup er-no de. An y subgroups containing more than 10 nodes with no t wo no des separated b y a Jaccard distance greater than 0.85 ha ve b een coloured to indicate sub-comm unities. Figure 4 sho ws the Mixed Martial Arts (MMA) communit y . F rom T able 3 we see that this communit y is densely connected, strongly clustered and v ery well separated from the rest of the netw ork. The blac k region in Dendrogram 3i is a massiv e cluster where the distance b et ween an y tw o no des is less than 0.8. It depicts MMA fighters, mostly figh ting in the Ultimate Fighting Championship (UF C). There is a single w ell separated sub-comm unity , which is magnified in Figure 4 showing Olympic judo fighters. MMA is the b est comm unity in our study . The Cycling, Adult Actor and Athletics communities are similar in structure to MMA (See T able 3). Figure 7 shows that the basketball communit y (largely NBA play ers) exhibits tw o large comm unities (the t wo NBA conferences). The individual team structure within the divisions is apparen t from the fine banding in Figure 3o where man y w ell-connected sub-clusters, each with a distance of less than 0.85 b et ween all pairs of no des, are visible. W e hav e magnified a small disconnected region of Figure 7, which shows pla y ers of the W omen’s National Bask etball Association (WNBA). Baseball, fo otball and american fo otball exhibit similar structural properties. Figure 5 sho ws that the industry is split in to four ma jor groups represen ting the different classes of alcoholic drink (wine, b eer, cider and spirits). W e ha v e magnified a region of the net work that contains mostly English cr aft ciders. Dendrogram 3g sho ws that the alcohol net work is mostly p o orly connected with only t w o coloured regions indicating well connected 19 Chamberlain, Levy-Kramer, Humby, Deisenroth Figure 5: The alcohol communit y is a lo w density comm unity with p oor clustering. It is divided into broad classes of drink such as b eer, spirits and wine. W e hav e magnified an area of the cider sub-comm unity Figure 6: The hotels net w ork has lo w conductance indicating that it is not well separated from the rest of the net work. It also has high cohesiveness indicating it c on tains comp onen ts that app ear to b e the true mo dular units. The t wo clearly visible sub comp onen ts are the F our Seasons brand in blue to the left and the hotels of Las V egas, which is magnified sub-comm unities. F rom T able 3 it can be seen that the alcohol net work exhibits a lo w link 20 Real-Time Community Detection in Large Social Networks on a Laptop Figure 7: The basketball comm unity has very similar attributes to the baseball and american fo otball communities, all b eing densely connected and w ell separated from the rest of the net work. The individual team structure is not apparent in the graph instead the t wo large clusters show teams from the eastern and w estern conferences. The small p eripheral clusters are mostly ma jor college teams and w e ha ve magnified an area sho wing play ers of the W omens National Basketball Asso ciation densit y and separability , indicating that the comm unit y lac ks distinction from the rest of the net work. This is a consisten t pattern for comm unities dra wn from industrial segmentations. Figure 6 sho ws an example of the final group of ground truth communities: industrial groups with prominen t sub-comm unities. In this case the m a jor sub-comm unities are the F our Seasons Hotel group and hotels located in Las V egas (magnified). Dendrogram 3c sho ws that while the hotel net w ork is generally p o orly connected, there are sizeable highly in terconnected sub-comm unities. F rom T able 3 sho ws that the hotel communit y has low clustering as most accounts are disconnected, high cohesiveness as there are well connected sub-groups and a low Conductance Ratio. The trav el, airlines and cosmetics communities all share these traits. In summary w e iden tify four groups of ground-truth communities and ev aluate their qualit y based on the four axioms. W e find that the groups differ greatly in quality . The group containing mixed martial arts, cycling, athletics and adult actors satisfies the four axioms and form a go o d set of ground-truth for algorithm ev aluation. The group comprising team sp orts (american fo otball, baseball, bask etball and fo otball) satisfy three of the four axioms (they are not homogenous). The remaining comm unities only con tain sub-groups that satisfy any of the axioms. 21 Chamberlain, Levy-Kramer, Humby, Deisenroth T able 4: The Twitter accounts with the highest Jaccard similarities to @Nik e. J and R giv e the true Jaccard co efficien t and Rank, resp ectiv ely . ˆ J and ˆ R giv e approximations using Equation (2) where the superscript determines the n umber of hashes used. Signatures of length 1,000 largely recov er the true Rank. Twitter handle J R ˆ J 100 ˆ R 100 ˆ J 1000 ˆ R 1000 adidas 0.261 1 0.22 2 0.265 1 nik estore 0.246 2 0.25 1 0.255 2 adidasoriginals 0.200 3 0.18 3 0.222 3 Jumpman23 0.172 4 0.13 7 0.166 4 nik esp ortsw ear 0.147 5 0.18 4 0.137 5 nik ebasketball 0.144 6 0.16 5 0.127 7 PUMA 0.132 7 0.13 6 0.132 6 nik efo otball 0.127 8 0.08 17 0.110 9 adidasfo otball 0.112 9 0.09 16 0.113 8 fo otlock er 0.096 10 0.08 17 0.096 11 6. Exp erimen tal Ev aluation Our approac h to real-time comm unity detection relies on t wo approximations: minhashing for rapid Jaccard estimation and locality sensitiv e hashing to provide a fast query mec ha- nism on top of minhashing. W e assess the effect of these appro ximations, and demonstrate the qualit y of our results in three exp erimen ts: (1) W e measure the sensitivit y of the Jaccard similarit y estimates with resp ect to the n um b er of hash functions used to generate the sig- natures. This will justify the use of the minhash approximation for computing approximate Jaccard similarities. (2) W e compare the run time and recall of our pro cess on ground-truth comm unities against the Personal P age Rank (PPR) algorithm (state of the art) on a single laptop. (3) W e visualise detected comm unities and demonstrate that association maps for so cial netw orks using minhashing and LSH pro duce intuitiv ely in terpretable maps of the Twitter and F aceb ook graphs in real-time on a single machine. 6.1 Exp erimen t 1: Assessing the Qualit y of Jaccard Estimates W e empirically ev aluate the minhash estimation error using a sample of 400,000 similarities tak en from the 250 billion pairwise relationships b et w een the Twitter accounts in our study . W e compare estimates using Equation 2 to exact Jaccards obtained b y exhaustiv e calcula- tions on the full sets using Equation 1. Figure 8 sho ws the estimation error (L1 norm) as a function of the n um b er of hashes comprising the minhash signature. Standard error bars are just visible up un til 400 hashes. The graph sho ws an exp ected error in the Jaccard of just 0.001 at 1,000 hashes. The high degree of accuracy and diminishing improv emen ts at this p oin t led us to select a signature length of K = 1 , 000. This v alue pro vides an appropriate balance betw een accuracy and p erformance (b oth run time and memory scale linearly with K ). A top-ten list of Jaccard similarities is given in T able 4 for the Nik e Twitter accoun t (based on the true Jaccard). Possible matc hes include sports people, m usicians, actors, 22 Real-Time Community Detection in Large Social Networks on a Laptop 0 200 400 600 800 1000 hash size 0 . 000 0 . 002 0 . 004 0 . 006 0 . 008 0 . 010 | J − ˆ J | Figure 8: Expected error from Jaccard estimation using minhash signatures against the n umber of the hashes used in the signature. The error bars sho w twice the standard error using 400,000 data p oin ts. p oliticians, educational institutions, media platforms and businesses from all sectors of the econom y . Of these, our approach iden tified four of Nike’s biggest competitors, fiv e Nik e sub-brands and a ma jor retailer of Nik e pro ducts as the most associated accoun ts. This is consistent with our assertion that the Jaccard similarity of neigh b ourhoo d sets pro vides a robust similarity measure b et w een accoun ts. W e found similar trends throughout the data and this is consistent with the exp erience of analysts at Starcount, a London based social media analytics compan y , who are using the to ol. T able 4 also shows how the size of the minhash signature affects the Jaccard estimate and the corresp onding rank of similar accoun ts. Lo cal comm unit y detection algorithms add accoun ts in similarit y order. Therefore, approximating the true ordering is an imp ortan t prop ert y . W e measure the Sp earman rank correlation b et ween the true Jaccard similarities (column R ) and those calculated from signatures of length 100 (column ˆ R 100 ) and 1000 (column ˆ R 1000 ) to b e 0.89 and 0.97 resp ectiv ely . The close corresp ondence of the rank v ector using signatures of length 1,000 and the true rank supp orts our decision to use signatures of con taining 1,000 hashes. 6.2 Exp erimen t 2: Comparison of Comm unit y Detection with PPR In exp erimen t 2 we mov e from assessing a single comp onen t (minhashing) to a system-wide ev aluation: W e ev aluate the ability of our algorithm to detect related en tities by measuring its p erformance as a lo cal comm unity detection algorithm seeded with members of the ground-truth comm unities listed in T able 3. As a baseline for comparison we use the PPR algorithm, whic h is considered to b e the state of the art for this problem (Kloumann and Klein b erg, 2014). It is impossible to pro vide a fully lik e-for-lik e comparison with PPR: Running PPR on the full graph (700 million v ertices and 20 billion edges) that w e extract features from requires cluster computing and could return results outside of the accoun ts w e considered. The alternativ e is to restrict PPR to run on the directly observ ed net work of the 675,000 largest Twitter accoun ts, which could then be run on a single machine. W e adopt this latter approac h as it is the only option that meets our requirements (single machine and real-time). 23 Chamberlain, Levy-Kramer, Humby, Deisenroth Figure 9: Pro cess diagram for Exp erimen t 2. Circles indicate (in termediate) results and rectangles are pro cesses. The size of circles is illustrative of the num b er of accounts at each stage. F rom eac h communit y w e sample 30 seeds to use as input for an LSH query . The LSH query pro duces a larger set of candidate accoun ts. The candidate lists are submitted to the MS and A C sorting procedures, whic h return results. In our exp erimen tation, w e randomly sampled 30 seeds from eac h ground-truth com- m unity . T o pro duce MS and A C results w e follow ed the pro cess depicted in Figure 9: The seeds are input to an LSH query , whic h pro duces a list of candidate near-neigh b ours. F or eac h candidate the Jaccard similarity is estimated using minhash signatures and sorted by either the MS or A C pro cedures. W e compare MS and A C to PPR operating on the directly observed net work of the 675,000 largest accounts. Our PPR implementation uses the 30 seeds as the telep ort set and runs for three iterations returning a rank ed list of similar Twitter accoun ts. In all cases, w e sequentially select accoun ts in similarity order and measure the recall after eac h selection. The recall is given b y recall = | C ∩ C true | | C true | − | C init | (17) with C init as the initial seed set, C true as the ground truth comm unity and C as the set of accoun ts added to the output. F or a comm unity of size | C | we do this for the | C | − 30 most similar accoun ts so that a p erfect system could ac hiev e a recall of one. The results of this experiment are shown in Figure 10 with the Area Under the Curv es (A UC) giv en in T able 5. Bold en tries in T able 5 indicate the best p erforming metho d. In all cases MS and A C giv e sup erior results to PPR. Figure 10 shows standard errors o v er five randomly c hosen input sets of 30 accoun ts from C true . The confidence b ounds are tigh t indicating that the metho ds are robust to the c hoice of input seeds. Figure 10 is group ed to corresp ond to the dendrograms in Figure 3. P erformance of all methods is considerable affected by the quality of the communities. Comm unities with goo d v alues of the metrics giv en in T able 3 in general ha ve sup erior recall across all metho ds. The third ro w of Figure 10 con tains the b est comm unities as 24 Real-Time Community Detection in Large Social Networks on a Laptop Figure 10: Av erage recall (with standard errors) of Agglomerativ e Clustering (yello w), P ersonal P ageRank (red) and Minhash Similarity (blue) against the num b er of additions to the comm unity expressed as a fraction of the size of the ground-truth communities given in T able 3. The tight error bars indicate that the metho ds are robust to the choice of seeds. measured by the metrics in T able 3. F or this group recalls are as high as 80% (Cycling, A C). The w orst group of comm unities are the transnational industrial comm unities in the second ro w. The lo west recall in ro w three (A thletics PPR) is still higher than the highest recall in the second row of results (Alcohol, A C). The b est performing metho d for ev ery comm unity in ro w three of the results is A C. This is because AC is an adaptiv e metho d that can incorporate information from early results. The downside of an adaptiv e metho d is that p ollution from false p ositiv es can rapidly degrade performance. This can b e seen in the step decrease in gradien t of the AC curves for bask etball, baseball and adult actors. The fourth ro w of the table contains team sports. T eam sp orts also ha ve go o d metrics in T able 3, but differ mark edly in structure from the communities in row three. The communities in ro w four hav e well defined m ulti-mo dal sub-structures generated by the differen t teams. Both A C and MS are unimo dal procedures that store the cen tre of a set of data p oin ts. F or a mu ltimo dal distribution the mean ma y not b e particularly close to the distribution 25 Chamberlain, Levy-Kramer, Humby, Deisenroth T able 5: Area under the recall curv es (Figure 10). Bold entries indicate the b est performing metho d. Minhash similarit y (MS) is the b est metho d in 8 cases, Agglomerativ e Clustering (A C) in 8 cases and Personalised PageRank (PPR) in none. A perfect comm unity detector w ould score 0.5 T ags PPR MS A C tra vel 0.186 0.240 0.230 airline 0.040 0.151 0.180 hotel brand 0.160 0.294 0.285 cosmetics 0.055 0.086 0.143 fo od and drink 0.072 0.099 0.082 electronics 0.035 0.069 0.059 alcohol 0.069 0.199 0.229 mo del 0.078 0.110 0.109 mixed martial arts 0.317 0.363 0.386 cycling 0.278 0.330 0.445 athletics 0.219 0.285 0.365 adult actor 0.269 0.347 0.397 american fo otball 0.240 0.371 0.240 baseball 0.203 0.379 0.378 bask etball 0.252 0.380 0.353 fo otball 0.202 0.233 0.212 and so false p ositiv es will o ccur. As AC incorp orates false p ositiv es in to the estimation pro cedure for all future results MS outp erforms AC for all team sport communities. Of the comm unities in the first and second rows of Figure 10 A C is b est p erforming in four and MS is best p erforming in four. These comm unities are all diffuse, but some ha ve a single densely connected region that can be found w ell b y A C. Muc h of the difference in p erformance of these metho ds deriv es from their resp ectiv e abilit y to explore the graph: PPR is really a global algorithm that has b een mo dified to find lo cal relationships. After three iterations PPR uses b oth first, second and third-order connections. First-order connection methods just use edges that directly connect to the seed no des (neigh b ours). Second-order metho ds also giv e w eigh t to the connections of the first-order no des (neighbours of neighbours) and so on for third-order connections. The abilit y to explore higher-order connections is the principal reason identified b y Kloumann and Klein b erg (2014) for the state-of-the-art p erformance of PPR. They also note that after t wo iterations most of the b enefit is realised and that after three iterations there is no more impro vemen t. Our implemen tations of MS and AC are effectiv ely second-order metho ds as they op erate on a derived graph where the edge weigh t b et ween tw o v ertices is calculated from the o v erlap of the resp ective neigh b ourhoo ds. MS and AC outp erform PPR b ecause they are based on man y more second order connections as they run on a compressed v ersion of the full graph instead of a sub-graph. PPR is exp ected to p erform b etter giv en more computational resources, but the additional complexit y , run time, latency or financial cost required for any scaled up/out solution w ould violate our system constrain ts. 26 Real-Time Community Detection in Large Social Networks on a Laptop T able 6: Clustering runtimes av eraged o v er comm unities. Metho d Mean(s) Std.Dev. PPR 12.58 8.83 MS 0.23 0.08 A C 18.6 22.0 Figure 11: Pro cess diagram for Exp erimen t 3. A set of seeds is queried using LSH and Minhash Similarit y . The w eighted adjacency matrix for the top 100 results is estimated using minhash signatures. The W ALKTRAP comm unity detection algorithm is run on the w eighted adjacency matrix and the results are visualised. T able 6 gives the mean and standard deviation of the run times a veraged o ver the 16 communities. MS is the fastest metho d b y t wo orders of magnitude. Average human reaction times are approximately a quarter of a second and so MS deliv ers a real-time user exp erience (Hew ett et al., 1992). As MS is the only method capable of operating in the real- time domain and this is a system requiremen t, w e choose the MS pro cedure for exp erimen t 3 and in our operational prototype. 6.3 Exp erimen t 3: Real-Time Graph Analysis and Visualisation In the following, w e provide e xample applications of our system to graph analysis. Users need only input a set of seeds, w ait a quarter of a second and the system disco v ers the structure of the graph in the region of the seeds. Users can then iterate the input seeds based on what previous outputs rev eal ab out the graph structure. Figure 12 sho ws results on the F aceb ook Page Engagemen ts netw ork while Figures 13 and 14 use the Twitter F ollo wers graph. Each diagram is generated b y the pro cedure sho wn in Figure 11: Seeds are passed to the MS pro cess, which returns the 100 most related entities. All pairwise Jaccard estimates are then calculated using the minhash signatures and the resulting w eighted adjacency matrix is passed to the W ALKTRAP global comm unity detection algorithm. The result is a w eigh ted graph with communit y affiliations for eac h v ertex. In our visualisations w e use the F orce Atlas 2 algorithm to lay out the v ertices. The thickness of the edges b et ween v ertices represen ts the pairwise Jaccard similarity , whic h has been thresholded for image clarit y . The vertex size represen ts the weigh ted degree of the vertex, but is logarithmically 27 Chamberlain, Levy-Kramer, Humby, Deisenroth scaled to b e b et ween 1 and 50 pixels. The vertex colours depict the different comm unities found b y the W ALKTRAP comm unity detection algorithm. W e show some results using the F aceb o ok P ages engagement graph to demonstrate that our work is broadly applicable across digital so cial netw orks. Ho wev er there are some key differences b et w een the F aceb ook P ages engagemen t graph and the Twitter F ollow ers graph. As F ollowing is the metho d used to subscrib e to a Twitter feed, F ollo ws tend to represen t gen uine interest. In con trast F acebo ok engagement is often used to grant approv al or b ecause a user desires an asso ciation. In addition, the Twitter graph corresp onds to actions o ccurring as far bac k as 2006 (relativ ely few edges are ever deleted), while the F aceb o ok graph corresp onds only to ev ents since 2014, when we b egan collecting data. As a result the Twitter data set contains significantly more data, but with less relev ance to current ev ents. Our work uses the v ast scale and ric hness of so cial media data to provide insigh ts into a broad range of questions. Here are some illustrativ e examples: • Ho w would you describ e the factions and relationships within the US Re- publican party? This is a question with a ma jor temp oral comp onen t, and so we use the F aceb ook Pages graph. W e feed “Donald T rump”, “Marco Rubio”, “T ed Cruz”, “Ben Carson” and “Jeb Bush” as seeds in to the system and wait for 0 . 25 s for Figure 12a, which shows a densely connected core group of active p oliticians with Donald T rump at the periphery surrounded by a largely disconnected set of right-wing in terest b odies. • Whic h factions exist in global p op m usic? W e feed the seeds “Justin Bieb er”, “Lady Gaga” and “Katy P erry” into the system loaded with the F aceb ook Pages engagemen t graph and wait for 0 . 25 s for Figure 12b, whic h sho ws that the industry forms comm unities that group genders and races. • Ho w are the ma jor so cial net works used? W e feed the seeds “Twitter”, “F ace- b ook”, “Y ouT ub e” and “Instagram” in to the system loaded with the Twitter F ollo wers graph and w ait for 0 . 25 s for Figure 13a, whic h sho ws that Go ogle is highly asso ci- ated with other technology brands while Instagram is closely related to celebrit y and Y ouT ub e and F aceb ook are link ed to sports and politics. • Ho w is the brand RedBull perceived b y Twitter users? W e feed the single seed “RedBull” in to the Twitter F ollow ers graph and w ait for 0 . 25 s for Figure 13b, whic h shows that RedBull has strong associations with motor racing, sp orts drinks, extreme sports, gaming and fo otball. • Ho w do es sp orts brand marketing differ b et w een the USA and Europ e? W e use the Twitter F ollow ers graph. “Adidas” and “Puma” are the seeds for the Europ ean brands while “Nik e”, “Reebok”, “UnderArmour” and “Dic ks” are used to represen t the US sp orts brands. Figures 14a and 14b show the enormous imp ortance of fo otball (so ccer) to Europ ean sp orts brands, whereas US sp orts brands are asso ciated with a broad range of sp orts including h unting, NFL, bask etball, baseball and mixed martial arts (MMA). In all cases, the user selects a group of seeds (or a single seed) and runs the system, whic h returns a Figure and a table of communit y memberships in 0 . 25 s. Analysts can then 28 Real-Time Community Detection in Large Social Networks on a Laptop tedcruzpage marcorubio realbencarson donaldtrump rushlimbaugh seanhannity thekellyfile marklevinshow gretawire washingtonexaminer governorperry heritagefoundation thehermancain newsbusters randpaul nan cnsnewscom bretbaiersr thefederalistpapers personalliberty westernjournalism theaclj mediaresearchcenter teapartyorg positivelyrepublican freedomworks mattdrudge billoreillyfnc foxnews teapartypatriots chicksontheright foxbusiness dsouzadinesh gop thefivefnc glennbeck judgejeaninepirro ericbolling stevencrowderofficial freebeacon dailycaller alliancedefendingfreedom nationalreview foxnewsvideo nrsc reincepriebus therealjoetheplumber sarahpalin ijreview theeagleisrising varneyco foxandfriends ingrahamangle texansforabbott carlyfiorina secureamericanow thehill gunowners headlinepolitics redstateblog theblacksphere.net nagrfb townhallcom loudobbstonight wndnews unitedwithisrael pamelageller nationalrifleassociation nranews reptreygowdy governorscottwalker faithfamilyamerica michelebachmann weeklystandard israelipm andreatantaros nrcc ronaldreagan cnnpolitics redflagnewswire speakerjohnboehner standwitharizona floridagop rare jaysekulow senatorrandpaul dennisdmz allenbwest christiansunitedforisrael foxnewsopinion the.patriot.nation.redux ricksantorum pjmedia newtgingrich politico billygrahamevangelisticassociation joetalkshow nan teamtwitchy tednugent alexanderemerickjones concernedvetsforamerica johnboehner thefoxnation harrisfaulkner kevinsorbo mediaite judgenapolitano nikkihaley shepardsmith crosswalkcom johnstossel senatortimscott gretchencarlson jasonjmattera dan.bongino nbcnightlynews fightback fathermorris familyresearchcouncil dan.patrick.texas ourcountrypac citizensuniteddc official.ray.comfort johnhageeministries michaelberryfanpage sheriffjoearpaio militarydotcom pastortonyevans humaneventsmedia theblazeradionetwork tenthamendmentcenter michaelereagan kayarthur.precept charliedanielsband samaritanspurse fredthompson kilmeade ainsleyearhardt nascaronfox duckdynasty israelinusa mikepence janicedean veteransmtc theweatherchannel theblaze washingtonpost nan nan (a) The US republican part y . Seeds are “Donald T rump”, “Marco Rubio”, “T ed Cruz”, “Ben Carson” and “Jeb Bush” ladygaga justinbieber justinbiebergermany ladygagapaginaitaliana katyperry gagadaily madisonellebeer ladygaganownet austinmahone mileycyrus bornthiswayfoundation iambeckyg tonybennett bellathorne billboard j14magazine alfredoflores worldmusicawards fifthharmony onedirectionmusic thevampsofficial taylorswift littlemixofficial kimkardashian thegrammys elliegoulding kendalljenner zendaya unionjworld kourtneykardashian mtvuk 5secondsofsummer cambioconnect vevo wizkhalifa zacefron rubyrose codysimpsonmusic capricho davidhenrie shawnmendesofficial tyga teenvogue 5incominutos revistatumexico mtvla beamiller popcrush christinaperrimusic ellemagazine elyarfox capitalfm iamfoxes inna marieclaire mariahcarey nan instyle glozellfanpage ludacris karliekloss nicolekidmanofficial connorfrantafans kendallschmidtofficial carlossantanaoficial tututumiercoles nicolescherzinger elleuk nickyjampr madonna officialdaniellecampbell borgeous mtvaustralia robertdowneyjr nan cashmoneyrecords belanova oliwhitetv fcbarcelona mixriofm antoniaofficial jdabrowsky nbcagt smemoranda moviesnowtv dimitrivegasandlikemike nasa jhoothibewafa keyshiacole spotlight.rannvijaysingh wearemkto pinkfloyd steveo tntbr mescalofficial nicolerichieofficial alienanthology littleboots mikeeppsofficial officialannli (b) Global Pop Music. Seeds are “Justin Bieb er”, “Lady Gaga” and “Katy Perry” Figure 12: Visualisations of the F aceb ook P ages engagemen t graph around differen t sets of seed v ertices. The vertex size depicts degree of similarit y to the seeds. Edge widths sho w pairwise similarities. Colours are used to show different communities. 29 Chamberlain, Levy-Kramer, Humby, Deisenroth twitter youtube facebook instagram google jlo katyperry barackobama britneyspears googlechrome jtimberlake taylorswift13 android ladygaga pink ddlovato officialadele microsoft billgates skype windows gmail justinbieber aliciakeys xtina cristiano nytimes samsungmobile davidguetta oprah theellenshow rihanna mtv cnn mariahcarey avrillavigne brunomars iamwill googleplay pitbull emwatson cnnbrk officialnikkin kaka onedirection beyonce arianagrande harry_styles googlemaps parishilton carlyraejepsen maroon5 lmfao kelly_clarkson ashleytisdale intel coldplay aplusk selenagomez shakira leodicaprio zacefron bbcbreaking usher niallofficial kimkardashian ryanseacrest jimcarrey applemusic louis_tomlinson firefox nasa real_liam_payne nba jimmyfallon liltunechi victoriajustice twitterespanol nickjonas ricky_martin mileycyrus twittermedia realmadrid sony enrique305 kingjames dior mirandacosgrove 10ronaldinho simoncowell wizkhalifa natgeo joejonas xbox nickiminaj neymarjr edsheeran official_flo drake twittermusic techcrunch tyrabanks paulocoelho kourtneykardash espn bep time breakingnews chanel ludacris ubersoc evalongoria neyocompound conanobrien lindsaylohan vine snoopdogg wired jessicaalba reuters samsungmobileus eminem mashable bbcworld akon alejandrosanz chrisbrown juanes fcbarcelona actuallynph twitpic austinmahone youtubetrends iamdiddy kevinhart4real jessicasimpson funnyordie tedtalks victoriabeckham louisvuitton theeconomist waynerooney dalailama kevinjonas people starbucks appstore wsj ea fcbarcelona_es bellathorne cherlloyd hootsuite charliesheen kendalljenner vevo shaq andresiniesta8 kdtrey5 xabialonso jennettemccurdy twittersports mesutozil1088 hm rustyrockets khloekardashian nike rockstargames mtvnews 3gerardpique marcanthony zendaya premierleague eonline paurubio googledrive johnlegend sportscenter ciara huffingtonpost (a) The ma jor so cial netw orks. Seeds are Twitter, F aceb o ok, Y ouT ub e and Instagram. redbull monsterenergy xgames dcshoes rockstarenergy gopro travispastrana kblock43 redbullracing vans_66 adidas scuderiaferrari converse puma mercedesamgf1 aussiegrit jackassworld oakley f1 porsche mclarenf1 realjknoxville lamborghini nikefootball astonmartin mcdonalds volcom shaunwhite dodge skrillex williamsracing steveo skype maserati_hq chevrolet europaleague pizzahut realhughjackman spinninrecords toyota lumia redhourben jeremyclarkson rockstargames callofduty nissanusa official_flo ford bobatl bentleymotors burn kfc assassinscreed rafvdvaart louisvuitton rafaelnadal yamahamotogp danialvesd2 richardhammond beatsbydre gm mark_wahlberg diegoforlan7 aarbeloa17 greenday 3gerardpique sonypictures slash rvn1776 mrjamesmay subway jaguar realmadriden robinho lennykravitz firefox jamesfrancotv autosport oreo hulkhogan (b) The many faces of RedBull Figure 13: Visualisations of the Twitter F ollo wer graph around different sets of seed v ertices. V ertex size depicts degree of similarity to the seeds. Edge widths sho w pairwise similarities. Colours are used to show different comm unities. use the results to supplement the seed list with new en tities or use the table of communit y mem b ers from a single W ALKTRAP sub-comm unit y to explore higher resolution. 30 Real-Time Community Detection in Large Social Networks on a Laptop puma adidas adidasoriginals nike reebok converse nikesportswear adidasfootball nikestore adidasus nikefootball pumafootball nikesoccer levis vans_66 uefacom nikerunning jumpman23 nikestoreeurope nikebasketball dcshoes adidashoops monsterenergy fifacom mercedesbenz porsche lamborghini mariogoetze tonikroos adidasrunning armani ralphlauren ligabbva adidasuk calvinklein audi danicarvajal92 reebokclassics redbull raphaelvarane 21lva atleti jeserodriguez10 tommyhilfiger bmw podolski10 fabio_coentrao officialpepe versace laliga bvb 19scazorla woodyinho jaguar isco_alarcon premierleague footlocker adidas_es luissuarez9 eddzeko acmilan ibra_official mcfc manuel_neuer nikeuk jb17official djokernole casillasworld astonmartin cocacola bschweinsteiger lacoste nissanusa pirlo_official benzema mrancelotti chevrolet mesutozil1088 mariomandzukic9 adidaslatam pepsi chicagobulls hazardeden10 jordialba d_degea mertesacker finallymario kobebryant officialel92 beatsbydre fcbarcelona_es ch14_ cadillac honda realmadriden manutd _oliviergiroud_ javi8martinez andre_schuerrle kyrieirving garethbale11 torres gianluigibuffon arsenal 1victorvaldes guaje7villa mercedesamgf1 preina25 burgerking esmuellert_ alvaromorata setoo9 3gerardpique mr11ok kierangibbs nikefuel nbatv kfc thomasvermaelen pablo_zabaleta m8arteta thierryhenry g_higuain cp3 marvel googlechrome juanmata8 nachofi1990 marcbartra easportsfifa dior fellaini bundesliga_de warnerbrosent aaronramsey sefutbol universalpics luisfigo thibautcourtois dcskateboarding easports pharrell kpbofficial onedrive marcelom12 david_alaba chelseafc doritos ford googlemaps skyfootball luisnani miamiheat dominos blakegriffin32 pizzahut vw 20thcenturyfox redbullracing oscar8 kingarturo23 swish41 ray_ban diegoforlan7 dfb_team_en sneijder101010 xgames waynerooney falcao skysports carras16 lexus android navaskeylor yg_trece disneystudios uefacom_es nico_rosberg rafvdvaart carles5puyol (a) Europ ean sp ort brands. Seeds are Adidas and Puma underarmour dicks reebok nike adidas puma nikestore adidasoriginals nikesportswear converse adidasus jumpman23 nikebasketball nikefootball nikerunning reebokclassics levis adidasfootball nikesoccer footlocker gatorade nikestoreeurope vans_66 adidashoops finishline europaleague pumafootball stephencurry30 nicekicks blakegriffin32 champssports usnikefootball kobebryant jharden13 cp3 russwest44 nikeid kyrieirving uefacom nikebaseball dcshoes rgiii nbatv beatsbydre espnnba johnwall eastbay chrisbosh kicksonfire monsterenergy espnnfl fifacom adidasrunning nikefuel audi bwwings yg_trece academy cut4 nba miamiheat tommyhilfiger nflnetwork calvinklein dwighthoward chevrolet bassproshops magicjohnson chicagobulls oakley ralphlauren lamborghini nbaontnt sony championsleague dame_lillard louisvuitton kevinlove rajonrondo powerade therock armani nbahistory cabelas mlbnetwork doritos crossfit jmanziel2 nikenyc kdtrey5 mercedesbenz carmeloanthony ussoccer stevenash adidasuk nikesb foxsports dangerusswilson shaq footaction premierleague astonmartin mariogoetze nikeuk rsherman_25 jcrossover swish41 porsche versace meekmill celtics deronwilliams collegegameday kingjames espnmondaynight rapsheet barrysanders usabasketball lexus alleniverson sportsnation xgames deandrejordan6 nflonfox maserati_hq roguefitness kendricklamar lacoste ligabbva bmw dcskateboarding 21lva cocacola richfroning dominos desmondhoward mountaindew sneakernews ea isco_alarcon lakers deionsanders dunkindonuts fabio_coentrao 2chainz andre honda mariasharapova gopro bubbawatson miketyson atleti icecube drewbrees stuartscott spurs garethbale11 lays jeep luissuarez9 davidluiz_4 jaguar gm okcthunder floydmayweather googlechrome realmikewilbon bestofnike d_degea realtree cbssports foxsports1 jalenrose toyota sportscenter dickiev nfl notthefakesvp juanmata8 ufc (b) US sp orts brands. Seeds are Nik e, Reeb ok, UnderArmour, Dicks Figure 14: Visualisations of the Twitter F ollo wer graph around different sets of seed v ertices. V ertex size depicts degree of similarity to the seeds. Edge widths sho w pairwise similarities. Colours are used to show different comm unities. 31 Chamberlain, Levy-Kramer, Humby, Deisenroth Similar tasks are traditionally conducted with expensive and difficult to scale tec hniques, suc h as telephone p olling and fo cus groups, which often take months to return results. In con trast, we are able to produce an automatic analysis in a fraction of a second and at minimal cost, which allows for interactiv e communit y detection in large so cial net w orks. 7. Conclusion and F uture W ork W e ha v e presen ted a real-time system to automatically detect communities in large so cial net works. The system is computationally and memory efficient that it runs on a standard laptop. This work represen ts a tec hnical adv ance leading to p erformance gains that are useful in practice and contains a rigorous ev aluation on large so cial media data sets. The k ey con tributions of this article are to demonstrate that (1) using the Jaccard similarity of neigh b ourho od graphs pro vides a robust asso ciation metric betw een vertices of noisy so cial net works; (2) W orking with minhash signatures of the neigh b ourho od graph dramatically reduces the space and time requirements of the system with acceptable approximation error; (3) Applying Lo calit y Sensistive Hashing allo ws for appro ximate lo cal communit y detection on very large graphs in real time with acceptable appro ximation error. F or in teractive and real-time comm unity detection, we ha ve demonstrated that our system finds higher qualit y communities in less time than the state-of-the-art algorithm operating under the constrain ts of a single machine. Our work has clear applications for kno wledge discov ery pro cesses that curren tly rely upon slow and exp ensiv e man ual procedures, suc h as fo cus groups and telephone polling. In general, our sys tem offers the p oten tial for organisations to rapidly acquire kno wledge of new territories and supplies an alternativ e monetisation sc heme for data owners. In this article, w e fo cussed on digital social netw orks, but our metho d is applicable to all large netw orks including bipartite netw orks. The user-item bipartite net works that are studied in the field of recommender systems w ould b e particularly amenable to this treatmen t, where items could b e compactly mo delled as minhash signatures of the users who ha v e purc hased them. W e leav e tw o extensions for future work. Firstly we treat the input social netw ork as binary . In many settings, information is a v ailable to weigh t the edges. This might include message coun ts, the time since a connection w as made or the t yp e of connection. Efficien t metho ds already exist for w orking with minhashes of weigh ted sets Manasse and Mcsherry (2008). Therefore, an in teresting progression of this w ork is to incorporate data with edges that can contain coun ts, w eights and categorisations. The second extension incorporates some of the latest developmen ts in the theory of minhashing. b-bit minhashing and Odd Sk etches pro vide t w o promising approac hes to extend our system to even larger graphs Li and K¨ onig (2009); Mitzenmacher et al. (2014). Both offer the best cost/b enefit trade-off when sets are very similar (Jaccard similarit y ≈ 1) or when sets contain most of the elemen ts in the sample space. DSN data typically contains sets that are v ery small relative to the sample space 7 and with Jaccard similarities  0 . 5. The strong theoretical b ounds of these algorithms do not hold in these DSN-t ypical settings. Therefore, a cost/b enefit analysis similar to Section 6.1 w ould b e required b efore implementing either in an extension. 7. Our Twitter data has a sample space containing 7 × 10 8 elemen ts with a typical set containing 10 4 elemen ts. 32 Real-Time Community Detection in Large Social Networks on a Laptop Figure 15: A distributed async hronous system for data acquisition from digital so cial net- w orks Ac kno wledgements This w ork w as partly funded b y a Ro yal Commission for the Exhibition of 1851 Industrial F ellowship. The authors would lik e to thank Donal Simmie for his work on optimising the minhash generation pro cedure. App endix A. Efficient Minhash Generation A naive Python implemen tation for generating minhash signatures requires six days to run on a desktop computer with 6 physical (12 logical) Intel i7 5930k @3.5GHz cores and 64GB of RAM. This is prohibitive for nightly up dates and so we highly optimised this part of the co de base. The co de was p orted to the Python-to-C bridge pro ject, Cython, whic h allow ed us to add type information and compiler directiv es to turn off array b ounds chec king (there is a large amount of array dereferencing). W e stored the input matrices in contiguous memory , removing an y superfluous co de (logging, most inline error chec king). The loops w ere then rewritten to b e vectorised by the SIMD pro cessor. The fully optimised Cython v ersion of the minhash implemen tation runs (in parallel on 6 cores) in appro ximately one hour. A.1 Crawling So cial Netw orks T o optimize data throughput while remaining within the DSN rate limits w e dev elop ed an asynchronous distributed netw ork crawler using Python’s Twisted library (Wyso c ki and Zabiero wski, 2011). The cra wler consists of a server resp onsible for token and w ork man- agemen t and a distributed set of clien ts making http requests to DSNs (see Figure 15). 33 Chamberlain, Levy-Kramer, Humby, Deisenroth The server contains a credential manager that holds access tokens in a queue and moni- tors the num b er of calls to each API. Once a tok en has b een exhausted it is put to the back of the queue and lo c k ed until its refresh time. The server comm unicates ov er TCP with the clien ts responding to requests for w ork and access tokens with account ids and fresh access tok ens/pause resp onses resp ectiv ely: The clients make asynchronous requests to the DSNs, handling response co des, parsing and storing data. A con ven tional program will blo ck while w aiting for an http resp onse. When the principal function of a program is to do wnload data, blo c king time amounts to the v ast ma jority of the run time. One solution is to run the program using multiple threads. Ho wev er, for this application threads carry an unnecessary o verhead and induce inefficiencies as data is naively mov ed b et ween caches b y the op erating system. The asyn- c hronous programming paradigm offers a sup erior alternative to explicit m ulti-threading. Async hronous programming makes use of an even t lo op that constantly listens for new jobs and does not block while waiting for h ttp resp onses. W e originally implemented the system using an 80 MB shared fibre optic connection, but our downloads caused netw ork blac kouts. Therefore, w e designed a distributed system that could be partially deploy ed in the cloud. The final system is depicted in Figure 15. The access tok ens and account IDs to query (w ork) liv e on a serv er on our local net work. Clien ts are deplo yed to Amazon’s elastic cloud from where all in teractions with DSN serv ers o ccurs. W e configured the clien ts to establish p ersisten t connections to the API endp oints. Ev ery time a connection is op ened, a handshak e m ust o ccur. F or secure systems (communi- cating ov er https), the handshake is particularly onerous, requiring the exc hange of securit y certificates. App endix B. Communit y Axioms Homophily only applies to attributes that ease information flo w b et ween individuals. Some attributes ha ve no effect or are divisiv e (for instance right-handed people feel no sense of kinship) and so should not b e asso ciated with comm unities. Additionally attributes ma y b e at the wrong scale to describ e structural sub-units (sp orts p erson rather than footballer). A comm unity ev aluation based on ground-truth that were not communities w ould ha ve no v alue W e apply comm unity goo dness functions to each prosp ectiv e ‘tag comm unity’ to iden tify to what exten t these functional traits generate structurally observ able comm unities. F or eac h functional group w e generate the fully connected w eighted graph by calculating all pairwise Jaccard similarities and ev aluate the six metrics in T able 3. They are adapted from Y ang and Lesk ov ec (2013) to apply to w eighted graphs. As we w ork with a deriv ed graph where eac h edge weigh t is the Jaccard similarity of neighbourho ods, the metrics hav e sligh tly different in terpretations. Tw o entities in the deriv ed graph are strongly connected if they hav e very similar neigh b ourhoo ds. Since for the large entities the neigh b ourhoo d normally has at least an order of magnitude more incoming than outgoing edges, en tities are closely related if they ha ve a similar fan/follow er base. W e define S to b e the set of v ertices comprising a communit y and a weigh ted graph G ( V , E , W ) where W is a w eight 34 Real-Time Community Detection in Large Social Networks on a Laptop matrix. The internal edge w eight of S is m s = X { i,j ∈ S } W i,j (18) and the weigh t of edges that cross the b oundary of S is c s = X { i ∈ S,j / ∈ S } W i,j . (19) The comm unit y go odness metrics are then given by: • Clustering exploits the idea that p eople in communities are likely to introduce their friends to each other. It measures how cliquey a comm unity is. In our paradigm clustering is high if follo wers of a communit y recommend things for other follow ers of the communit y to lik e or follow. If a v ertex has k n neigh b ours then 1 2 k n ( k n − 1) p ossible connections can exist b et ween the neighbours. The clustering of a no de gives the fraction of its neigh b ours’ p ossible connections that exist. The clustering of a comm unity is the a v erage clustering of each v ertex. Clustering is sometimes referred to as the prop ortion of triadic closures in the netw ork. The weigh ted clustering of the i th v ertex is giv en b y C l i ( S ) = W 3 s ( W s W max W s ) ii (20) where W max is a matrix where eac h entry is the maxim um w eight found within S (Holme et al., 2007). • Conductance is an electrical analogy for ho w easily information entering the com- m unity can lea ve it. In our context, it is defined as C on ( S ) = c s 2 m s + c s , (21) i.e., it is the ratio of the comm unity’s external to total edge w eigh t. A lo w v alue means that the the communit y is w ell separated from the rest of the netw ork. In our paradigm, conductance is low if the follo w ers of the communit y are not interested in other comm unities. • Cohesiv eness measures ho w easily the comm unit y can b e split into disconnected comp onen ts. A go od comm unit y is not easily broken up. The cohesiveness is given b y the minim um conductance of an y sub-communit y . A lo w v alue indicates a bad comm unity as there is at least one well-separated sub-communit y . In our paradigm, lo w cohesiveness corresp onds to mem b ers of the communit y having distinct, non- o verlapping follow er groups. C oh ( S ) = min { S 0 ⊂ S } C on ( S 0 ) (22) Iterating through all subsets S 0 of S is impractical. Th us, w e sample S 0 b y randomly selecting 10 subsets of starting vertices, running PPR communit y detection for eac h and taking a sw eep through the PageRank vector to find the minim um conductance cut. 35 Chamberlain, Levy-Kramer, Humby, Deisenroth • Conductance Ratio (CR) is the ratio of conductance to cohesiveness and defined as C R ( S ) = C on ( S ) C oh ( S ) . (23) A large num b er indicates that the communit y could b e brok en up into structural sub-units. • Densit y is giv en b y the ratio of the communit y’s total in ternal edge weigh t to the maxim um p ossible if every edge w as presen t with w eigh t one: D en = 2 m s n s ( n s − 1) (24) A high num b er indicates a highly in terconnected comm unity . In our paradigm, this corresp onds to a comm unity with a w ell-defined follo wer base that is interested in most comm unit y mem b ers. • Separabilit y measures ho w w ell the communit y is separated from the rest of the net work. It is the ratio of internal to external edges and so is closely related to conductance: S ep ( S ) = m s c s (25) In our paradigm, a high v alue indicates that follo wers of the communit y are not in terested in m uc h else. References M Adler and M Mitzenmac her. T ow ards compressing w eb graphs. Pr o c e e dings of the Data Compr ession Confer enc e , pages 203–212, 2001. A Andoni, P Indyk, HL. Nguyen and I Razensh teyn. Beyond lo calit y-sensitiv e hashing. Pr o c e e dings of the 25th A nnual ACM-SIAM Symp osium on Discr ete Algorithms , pages 1018–1028, 2014. ISSN 9781611973389. doi: 10.1137/1.9781611973402.76. URL http: //arxiv.org/abs/1306.1547 . B Bahmani, K Chakrabarti, and D Xin. F ast personalized pagerank on mapreduce. Pr o- c e e dings of the A CM SIGMOD International Confer enc e on Management of Data , pages 973-984, 2011. URL http://dl.acm.org/citation.cfm?id=1989425 . M Bastian, S Heymann, and M Jacom y . Gephi: an op en source soft ware for exploring and manipulating net works. Pr o c e e dings of the 3r d International AAAI Confer enc e on Weblo gs and So cial Me dia , pages 361–362, 2009. ISSN 14753898. doi: 10.1136/qshc.2004. 010033. VD Blondel, JL Guillaume, R Lam biotte, and E Lefeb vre. F ast unfolding of comm unity hierarc hies in large net w orks. Journal of Statistic al Me chanics: The ory and Exp eriment , page 10008, 2008. doi: 10.1088/1742- 5468/2008/10/P10008. 36 Real-Time Community Detection in Large Social Networks on a Laptop P Boldi and S Vigna. The w ebgraph framework I: compression techniques. Pr o c e e dings of the 13th International Confer enc e on the World Wide Web , A CM, 2004. URL http: //dl.acm.org/citation.cfm?id=988752 . AZ Bro der, M Charik ar, AM F rieze, and M Mitzenmac her. Min-wise indep enden t permu- tations. Journal of Computer and System Scienc es , 60(3):630–659, 2000. AZ Bro der. On the resemblance and con tainment of do cumen ts. IEEE Pr o c e e dings of Compr ession and Complexity of Se quenc es , pages 21–29, 1997. ISSN 0818681322. doi: 10.1109/SEQUEN.1997.666900. E Bullmore and O Sporns. Complex brain netw orks: graph theoretical analysis of structural and functional systems. Natur e R eviews Neur oscienc e , 10.3, pages 186–198, 2009. URL http://www.nature.com/nrn/journal/v10/n3/abs/nrn2575.html . MoS Charik ar. Similarity estimation tec hniques from rounding algorithms. In Pr o c e e dings of the 34th Annual A CM Symp osium on the The ory of Computing (STOC) , pages 380– 388, 2002. ISBN 1581134959. doi: 10.1145/509907.509965. URL http://portal.acm. org/citation.cfm?doid=509907.509965 . R Chen, X W eng, B He, and M Y ang. Large graph pro cessing in the cloud. Pr o c e e dings of the 2010 A CM SIGMOD International Confer enc e on Management of Data , 2010. URL http://dl.acm.org/citation.cfm?id=1807297 . F Chierichetti, R Kumar, and S Lattanzi. On compressing so cial net works. Pr o c e e d- ings of the 15th A CM SIGKDD International Confer enc e on Know le dge Disc overy and Data Mining , 2009. URL http://www.se.cuhk.edu.hk/ ~ hcheng/seg5010/slides/ compress_kdd2009.pdf . A Clauset. Finding local comm unity structure in netw orks. Physic al R eview E , 72.2, pages 026132 2005. URL http://journals.aps.org/pre/abstract/10.1103/PhysRevE.72. 026132 . GW Flake, S Lawrence, and CL Giles. Efficien t identification of W eb comm unities. Pr o c e e dings of the 6th ACM SIGKDD International Confer enc e on Know le dge Dis- c overy and Data Mining , pages 150–160, 2000. doi: 10.1145/347090.347121. URL http://portal.acm.org/citation.cfm?doid=347090.347121 . S F ortunato and M Barthelemy . Resolution limit in comm unity detection. Pr o c e e dings of the National A c ademy of Scienc es , 104.1, pages 36-41 2007. URL http://www.pnas. org/content/104/1/36.short . S F ortunato. Comm unity detection in graphs. Physics R ep orts , 486(3):75–174, 2010. DF Gleic h and C Seshadhri. V ertex neighborho ods, low conductance cuts, and go o d seeds for lo cal communit y methods. In Pr o c e e dings of the 18th ACM SIGKDD international c onfer enc e on Know le dge disc overy and data mining , pages 597-605, 2012. 37 Chamberlain, Levy-Kramer, Humby, Deisenroth P Gupta, A Goel, J Lin, A Sharma, D W ang, and R Zadeh. WTF: The who to follo w service at Twitter. Pr o c e e dings of the 22nd International Confer enc e on the World Wide Web , pages 505–514, 2013. URL http://dl.acm.org/citation.cfm?id=2488388.2488433 . T Hav eliwala, A Gionis, and P Indyk. Scalable tec hniques for clustering the W eb. The 3r d International Workshop on the Web and Datab ases , 2000. URL http://ilpubs. stanford.edu:8090/445/ . T Hav eliw ala. T opic-sensitiv e pagerank. Pr o c e e dings of the 11th International Confer enc e on the World Wide Web , pages 517–526, 2002. ISSN 08963207. doi: 10.1145/511446.511513. URL http://doi.acm.org/10.1145/511446.511513 . TT Hew ett, R Baeck er, S Card, and T Carey . ACM SIGCHI curricula for human-c omputer inter action . A CM, 1992. URL http://dl.acm.org/citation.cfm?id=2594128 . P Holme, SM P ark, BJ Kim, and CR Edling. Korean univ ersit y life in a net work p ersp ec- tiv e: Dynamics of a large affiliation net work. Physic a A: Statistic al Me chanics and its Applic ations , 373:821–830, 2007. ISSN 03784371. doi: 10.1016/j.physa.2006.04.066. P Indyk and R Mot wani. Appro ximate nearest neighbors: tow ards removing the curse of dimensionality. Pr o c e e dings of the 30th A nnual A CM Symp osium on the The ory of omputing , 126:604–613, 1998. ISSN 00123692. doi: 10.4086/toc.2012.v008a014. URL http://dl.acm.org/citation.cfm?id=276876 . I Kloumann and J Klein b erg. Communit y membership iden tification from small seed sets. In Pr o c e e dings of the 20th ACM SIGKDD International Confer enc e on Know le dge Disc overy and Data Mining , pages 1366-1375, A CM, 2014. URL http://dl.acm.org/citation. cfm?id=2623621 . A Kyrola, G Blello ch, and C Guestrin. Graphchi: Large-scale graph computation on just a PC. Symp osium on Op er ating Systems Design and Implementation (OSDI) , 2012. URL https://www.usenix.org/conference/osdi12/technical- sessions/ presentation/kyrola . A Lancichinetti, S F ortunato, and J Kert´ esz. Detecting the o verlapping and hierarchical comm unity structure in complex net works. New Journal of Physics , 11.3(033015), 2009. URL http://iopscience.iop.org/1367- 2630/11/3/033015 . J Lesk ov ec, KJ Lang, and MW Mahoney . Empirical comparison of algorithms for netw ork comm unity detection. Pr o c e e dings of the 19th International Confer enc e on the World Wide Web , pages 631–640, 4 2010. URL . Y Li, K He, D Bindel and JE Hop croft. Unco vering the small communit y structure in large net works: A lo cal sp ectral approach. In Pr o c e e dings of the 24th International Confer enc e on the World Wide Web , pages 658-668, 2015. P Li and C K¨ onig. b-Bit minwise hashing. Pr o c e e dings of the 19th International Confer enc e on the World Wide Web , ACM, 2009. URL http://dl.acm.org/citation.cfm?id= 1772759 . 38 Real-Time Community Detection in Large Social Networks on a Laptop Y Low, JE Gonzalez, and A Kyrola. Graphlab: A new framew ork for parallel machine learning. arXiv pr eprint arXiv , 2014. URL . D Lusseau. The emergen t properties of a dolphin social netw ork. Pr o c e e dings of the R oyal So ciety of L ondon B: Biolo gic al Scienc es , 270(Suppl2), pages S186-S188, 2003. G Malewicz, MH Austern, AJC Bik, JC Dehnert, I Horn, N Leiser, and G Cza jk owski. Pregel: a system for large-scale graph pro cessing. In Pr o c e e dings of the International Con- fer enc e on Management of data (SIGMOD) , pages 135–146, 2010. ISBN 9781450300322. doi: 10.1145/1807167.1807184. URL http://dl.acm.org/citation.cfm?id=1807167. 1807184 . M Manasse and F Mcsherry . Consisten t weigh ted sampling. T echnical rep ort, 2010. URL http://research.microsoft.com/pubs/132309/ConsistentWeightedSampling2. pdf . M McPherson, L Smith-Lo vin, and JM Co ok. Birds of a feather: homophily in so cial net works. A nnual R eview of So ciolo gy , 27(1):415–444, 8 2001. ISSN 0360-0572. doi: 10. 1146/ann urev.so c.27.1.415. URL http://www.annualreviews.org/doi/abs/10.1146/ annurev.soc.27.1.415?journalCode=soc . M Mitzenmacher, R P agh, and N Pham. Efficient estimation for high similarities using o dd sketc hes. Pr o c e e dings of the 23r d International Confer enc e on the World Wide Web , pages 109–118, 2014. URL http://dl.acm.org/citation.cfm?id=2568017 . R Mot wani, A Naor, and R P anigrahy . Low er b ounds on locality sensitive hashing. SIAM Journal on Discr ete Mathematics , pages 930–935, 2005. ISSN 0895-4801. doi: 10.1137/ 050646858. URL . MEJ Newman. Finding comm unity structure in netw orks using the eigen vectors of matrices. Physic al R eview E - Statistic al, Nonline ar, and Soft Matter Physics , 74(3):1–19, 2006. ISSN 15393755. doi: 10.1103/Ph ysRevE.74.036104. MEJ Newman. The structure and function of complex net w orks. SIAM r eview , 45.2, pages 167-256, 2003. URL http://epubs.siam.org/doi/abs/10.1137/S003614450342480 . MEJ Newman. F ast algorithm for detecting communit y structure in netw orks. Physic al r e- view E , 69(6)(066133), 2004. URL http://journals.aps.org/pre/abstract/10.1103/ PhysRevE.69.066133 . MEJ Newman. Detecting comm unity structure in net works. The Eur op e an Physic al Journal B-Condense d Matter and Complex Systems , 38.2, pages 321-330, 2004. R O’Donnell, Y W u, and Y Zhou. Optimal low er b ounds for lo calit y sensitiv e hashing (except when q is tiny). ACM T r ansactions on Computation The ory (TOCT) , 6(1):9, 2009. ISSN 19423462. doi: 10.1145/2578221. URL . MF P ace. BSP Vs MapReduce. Pr o c e dia Computer Scienc e 9 , pages 246-255, 2012. doi: 10.1016/j.pro cs.2012.04.026. URL . 39 Chamberlain, Levy-Kramer, Humby, Deisenroth L P age, S Brin, R Motw ani, and T Winograd. The citation ranking: bringing order to the W eb, 1998. URL http://ilpubs.stanford.edu:8090/422/1/1999- 66.pdf . J Philbin. Near duplicate image detection : min-Hash and tf-idf weigh ting. Pr o c e e dings of the British Machine Vision Confer enc e , 3:4, 2008. ISSN 10959203. doi: 10.5244/C.22.50. P Pons and M Latap y . Computing comm unities in large netw orks using random walks. Computer and Information Scienc es-ISCIS , pages 284–293, 2005. URL http://arxiv. org/abs/physics/0512106 . UN Ragha v an, R Alb ert, and S Kumara. Near linear time algorithm to detect communit y structures in large-scale net w orks. Physic al R eview E , 76(3):036106, 9 2007. ISSN 1539- 3755. doi: 10.1103/PhysRevE.76.036106. URL . J Reichardt and S Bornholdt. Statistical mechanics of comm unity detection. Physic al R eview E - Statistic al, Nonline ar, and Soft Matter Physics , 74, 2006. ISSN 15393755. doi: 10.1103/Ph ys RevE.74.016110. M Rosv all and CT Bergstrom. Maps of random w alks on complex netw orks reveal comm u- nit y structure. Pr o c e e dings of the National A c ademy of Scienc es , 105.4:1118–1123, 2008. URL http://www.pnas.org/content/105/4/1118.short . SF Sampson. Crisis in a cloister . PhD thesis, Cornell Univ ersity , Ithaca, 1969. SE Sc haeffer. Graph clustering. In Computer Scienc e R eview , 1(1):2764, 2007. R Wyso c ki and W Zabierowski. Twisted framework on game server example. Pr o c e e dings of the 11th International Confer enc e on the Exp erienc e of Designing and Applic ation of CAD Systems in Micr o ele ctr onics (CADSM) , pages 361–363, 2011. J Y ang and J Lesko vec. Defining and ev aluating net work communities based on ground- truth. Know le dge and Information Systems , pages 181–213, 2015. URL http://dl.acm. org/citation.cfm?id=2350193 . J Y ang and J Lesko v ec. Ov erlapping communit y detection at scale: a nonnegative matrix factorization approac h. In Pr o c e e dings of the 6th ACM International Confer enc e on Web Se ar ch and Data Mining , pages 587–596. A CM, 2013. ISBN 9781450318693. doi: 10.1145/ 2433396.2433471. URL http://dl.acm.org/citation.cfm?id=2433396.2433471 . W Zac hary . An information flo w mo del for conflict and fission in small groups. Journal of A nthr op olo gic al R ese ar ch , pages 452–473, 1977. URL http://www.jstor.org/stable/ 3629752 . 40

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment