Finding Online Extremists in Social Networks

Online extremists in social networks pose a new form of threat to the general public. These extremists range from cyberbullies who harass innocent users to terrorist organizations such as the Islamic State of Iraq and Syria (ISIS) that use social net…

Authors: Jytte Klausen, Christopher Marks, Tauhid Zaman

Finding Online Extremists in Social Networks
Finding Online Extremists in So cial Net w orks Jytte Klausen Brandeis Universit y 415 South Street W altham, MA 02453 klausen@brandeis.edu Christopher E Marks Operations Research Center, Massac husetts Institute of T echnology Charles Stark Draper Lab oratory 555 T echnology Square Cambridge, MA 02139 cemarks@mit.edu T auhid Zaman Sloan School of Management, Massac husetts Institute of T echnology 77 Massach usetts Ave. Cambridge, MA 02139 zlisto@mit.edu Online extremists in so cial net w orks p ose a new form of threat to the general public. These extremists range from cyb erbullies who harass inno cent users to terrorist organizations suc h as the Islamic State of Iraq and Syria (ISIS) that use so cial netw orks to recruit and incite violence. Currently social netw orks susp end the accoun ts of suc h extremists in resp onse to user complain ts. The challenge is that these extremist users simply create new accoun ts and contin ue their activities. In this work we presen t a new set of op erational capabilities to deal with the threat p osed b y online extremists in so cial netw orks. Using data from several hundred thousand extremist accounts on Twitter, we dev elop a b ehavioral model for these users, in particular what their accounts lo ok like and who they connect with. This mo del is used to iden tify new extremist accounts b y predicting if they will be susp ended for extremist activit y . W e also use this model to trac k existing extremist users as they create new accoun ts b y identifying if t wo accoun ts b elong to the same user. Finally , w e presen t a mo del for searc hing the social netw ork to efficien tly find suspended users’ new accounts based on a v ariant of the classic Poly a’s urn setup. W e find a simple characterization of the optimal search policy for this mo del under fairly general conditions. Our urn mo del and main theoretical results generalize easily to search problems in other fields. Key wor ds : so cial media, so cial net works, online extremism, Poly a’s urn, netw ork search 1. Intro duction In recen t years there has b een a h uge increase in the num b er and size of online extremist groups using so cial net w orks to harass users, recruit new mem b ers, and incite violence. These groups 1 Klausen, Marks, and Zaman: Finding Online Extr emists in So cial Networks 2 include terrorist organizations suc h as the Islamic State of Iraq and Syria (ISIS) [4], white national- ists and Nazi sympathizers [28], and cyb erbullies who target individuals with offensive and harass- ing messages [8]. Of particular concern is the danger p osed to public safety by terrorist groups. The threat from terrorist groups suc h as ISIS has b ecome so sev ere that U.S. president Barac k Obama recen tly said “The United States will contin ue to do our part, by working with partners to coun ter ISIL’s 1 hateful propaganda, esp ecially online” [15]. It is susp ected that the online presence of ISIS may ha v e b een resp onsible for radicalizing individuals and motiv ating them to commit acts of terror [7]. So cial net work ha ve recently begun taking actions to activ ely com bat online extremists. F or instance, Twitter, whic h has b ecome the main v enue for ISIS users to spread their propaganda [15], has b een very aggressive in its resp onse to ISIS. In August 2016, Twitter rep orted that it had sh ut down o ver 360,000 ISIS accoun ts and its daily susp ensions of terrorism-linked accounts ha ve jump ed 80 p ercen t since 2015 [1]. Twitter iden tifies extremist accounts primarily based on reports from its users, but it has b egun using proprietary spam-fighting to ols to supplement these rep orts. These to ols ha v e help ed to automatically iden tify more than one third of the accoun ts that were ultimately susp ended for promoting terrorism on Twitter [29]. The efforts of social netw orks such as Twitter hav e b een effectiv e at limiting the reac h of online extremist groups suc h as ISIS. Ho wev er, not all extremist users are shut do wn and they are con- stan tly returning to the so cial net work after b eing susp ended. In addition, m uch of the success in mitigating the threats of extremist groups has relied up on the co op eration of the so cial netw orks themselv es. F or instance, Twitter has dedicated teams to review user reports of p otential extremist accoun ts [1]. How ever, if extremist users migrate to other social net w orks, there is no guaran tee that the companies which op erate these net works will b e as co op erative or dedicate as man y resources to dealing with online extremists. Therefore, what is needed is a set of capabilities that can be used b y authorities to combat online extremists which do not rely up on the co operation of the so cial net w ork op erators and can b e applied to an y so cial net w ork. 1.1. Our Contributions The case of ISIS in Twitter is useful to understand general b ehavioral patterns of online extremist users in social netw orks. W e use these b eha viors to guide the developmen t of capabilities for com bating online extremists in general social netw orks. W e provide a detailed analysis of these b eha viors and develop the corresp onding capabilities in Sections 3, 4, 5, and 6. Here we will pro vide a concise ov erview of our ma jor con tributions, in particular the different b ehavioral patterns of online extremists and the corresp onding capabilities we develop. 1 ISIL is another name for ISIS and stands for is the Islamic State of Iraq and the Lev ant. Klausen, Marks, and Zaman: Finding Online Extr emists in So cial Networks 3 Susp ensions. Online extremist users post conten t whic h violate the T erms of Use of so cial net w orks, leading to the susp ension of their accoun ts. These susp ensions o ccur in response to user rep orts, but many so cial netw orks are beginning to use algorithms to automatically detect any violativ e con ten t. Going one step further, it would b e useful to ha ve a capability to flag users as p oten tial extremists b efore they p ost any con ten t at all. There are p oten tial features of an account that may predict if it b elongs to an extremist user. F or instance, the account ma y not publicly declare its geographical lo cation. Also, the users to whic h the account connects may indicate whether or not the account b elongs to an extremist user. In Section 3 we use these intuitions to dev elop a metho d to automatically predict if an account will b e susp ended without requiring it to p ost an y con ten t. Creating Multiple Accoun ts. After b eing susp ended, online extremist users will quickly create new accoun ts and con tinue their activities on the so cial net work. This makes it difficult to k eep an extremist user off the so cial net w ork. Typically the new account resembles the susp ended accoun t in several asp ects. F or instance, the names and profile pictures may b e v ery similar. A useful capabilit y would b e the abilit y to identify if multiple accoun ts as b elong to the same user. This w ould allow for more accurate monitoring and trac king of extremist users. W e develop suc h a capabilit y in Section 4. Refollo wing Previous F riends. A user in a so cial net work generally follo ws a set of users. In Twitter these follow ed users are referred to as the friends of the user and the user is referred to as their fol lower . Up on returning to the so cial netw ork after b eing susp ended, an extremist user will generally refollow some of his previous friends. If we knew which previous friends a susp ended user refollo ws, this information could b e used to find the user’s new account in the social net work. There ma y b e features of the friends which mak e it more lik ely the suspended user will refollo w them. In Section 5 w e use these features to dev elop a metho d to predict who susp ended users refollo w. Susp ended User Search. Authorities ma y wish to find susp ended users when they return to a so cial netw ork. The op erator of the so cial netw ork is notified every time a new user enters the net w ork and can use our accoun t matching capability to see if the new user matc hes a previously susp ended user. How ever, if one is not the op erator of the so cial netw ork, then one must searc h the net w ork to see if the susp ended user has returned. Because of the size of the so cial net w ork, this searc h could require a large amount of time and resources. T o o v ercome this challenge, we develop an efficient netw ork searc h p olicy in Section 6 based on a v ariant of a P olya’s urn mo del whic h utilizes our refollo wing prediction capabilit y from Section 5. The remainder of this pap er is organized as follows. W e review the extan t literature relev ant to our w ork in Section 1.2. W e pro vide a detailed o verview of the data used for our analysis in Section 2. Section 3 presen ts our predicting susp ensions capability . W e present our accoun t Klausen, Marks, and Zaman: Finding Online Extr emists in So cial Networks 4 matc hing capability in Section 4. Our metho d for predicting refollowing is presented in Section 5. Section 6 details our mo del for netw ork search and an optimal search p olicy . W e conclude in Section 7. 1.2. Previous W o rk Analysis of Online Extremist Net w orks There are sev eral studies fo cused on ISIS users in so cial netw orks. One of the first studies c haracterizes the n umber, b eha vioral traits, and organiza- tion of Twitter ISIS users [4]. A subsequen t study by the same authors found that the reac h of ISIS had b een limited by the b eginning of 2016 due to the efforts of Twitter to susp end ISIS accounts [5]. In [16] the authors study the dynamics of ISIS users in the Russian so cial netw ork VKon takte and suggest that shutting do wn smaller pro-ISIS groups can prev en t the emergence of larger, more influen tial groups. In [14] the authors develop mo dels to predict whic h users will b e susp ended for b eing in ISIS, who will ret w eet ISIS con ten t, and who will interact with ISIS users. This work is similar to our w ork, but the authors do not study many of the capabilities w e dev elops such as iden tifying multiple accoun ts from a single user, refollowing old friends, or searching for susp ended users. There ha v e also b een sev eral w orks lo oking at identifying extremist conten t in groups b eyond ISIS. In [24] the authors develop metho ds to automatically classify con ten t that is used for recruiting mem b ers to extremist groups. Similar work in [27] used machine learning to detect conten t that promotes hate and extremism. Machine learning metho ds ha v e also b een used to detect cyb erbullies based on the conten t they p ost [23, 12]. The work in [13] builds up on this work to develop an approac h for mitigating the threat of cyb erbullying. Spam/Bot Detection Closely related to our capabilities on predicting susp ensions is the work done on detecting online b ots (non-h uman users) or malicious users. Sev eral approaches hav e b een dev elop ed whic h use differen t t yp es of b ehavioral features. The t yp e of con tent (URL’s, user men tions) was found to b e predictive of Twitter b ots in [20]. T emp oral b ehavior and aggregate net w ork prop erties (in-degree,out-degree) w ere used to identify Twitter b ots in [9]. In [11] the authors demonstrate that the sentimen t of the p osted conten t can b e used to iden tify b ots. All of these approac hes are designed to detect automated behavior. How ever, they may not b e as effectiv e for h uman users who engage in extremist b ehavior. Also, many of these approac hes require the user to post some con ten t in order to detect whether or not they are b ots. An approac h related to ours is in [19] whic h relies purely on netw ork structure to identify malicious users in social netw orks where edges hav e a p olarity (friend/enem y). In contrast to this extant w ork, our approach com bines b oth b eha vioral features with refined net w ork features to detect extremist users. Klausen, Marks, and Zaman: Finding Online Extr emists in So cial Networks 5 Net work Search Our netw ork searc h problem is similar to those presented b y Alp ern and Lidb etter [3] and Dagan and Gal [10], who hav e done m uc h work in this area. Unlike their w ork, in whic h the searcher and the target are assumed to b e op erating in a ph ysical netw ork, our problem of searching a so cial net work admits a differen t set of search constraints. In our net w ork search problem, the searcher is not constrained to mov e along edges. Instead, the searc her can examine the neigh b ors of any of a set of no des that are known to him, but each of these queries comes at a cost. This alternative representation of netw ork searc h follows from one of the original searc h problems p osed by Blac k [6], in whic h a searcher looks for the search target among a set of p ossible lo cations. Each lo cation has a known probability of containing the target and a known probabilit y of finding the target, if it is there. Our netw ork search application adapts this simple search mo del to a netw ork setting. Instead of limiting the searc h target to b e at at most one of a set of p ossible lo cations, in a net w ork search the target could b e connected to more than one of the no des known to the searcher. Also, the metho d of querying the neighbors of a no de causes the probability of finding the target to c hange with eac h observ ation. Our net work searc h mo del builds directly on the multi-urn search mo del presen ted in [21]. Ho w ever, in this work the ma jor difference is that we allo w for more than one query to be done in eac h step, whic h results in slight differences in the optimal policies. 2. Data The data we study in this work comes form the micro-blogging site Twitter [31]. Twitter serves as a front line public platform used by ISIS for outreach and recruitmen t. ISIS’s presence on Twitter, and its consisten t success at gaining supp ort and recruits through the so cial media site has b een deeply analyzed and w ell-do cumen ted [4]. Twitter users form a so cial netw ork b y connecting to eac h other. This netw ork is directed and this directionality dictates the flo w of information. A user forms a connection with someone on Twitter b y fol lowing him or her. Eac h account a user is follo wing is known as this user’s friend and the user is known as the friend’s fol lower . These friends/follow ers edges form the Twitter so cial net w ork. This netw ork is then used to transmit information. In Twitter, this information comes in the form of short messages that users p ost known as twe ets . When a user p osts a tw eet, that tw eet app ears in the Twitter timeline of all the user’s follo wers. In this manner, information flows from users to their follo w ers in Twitter. F or this research, w e collected Twitter data from approximately 5,000 “seed” users, who w ere either known ISIS members or who w ere connected to many kno wn ISIS members as friends or follo w ers. The names of these seed users w ere obtained through news stories, blogs, and rep orts Klausen, Marks, and Zaman: Finding Online Extr emists in So cial Networks 6 released by law enforcemen t agencies and think tanks [18]. The data was collected at v arious times throughout the calendar y ear 2015, using Twitter’s REST API (see [30]). F or each seed user we collected the user accoun t profile information, including the screen name, name, description, location, profile picture, and profile banner at the time of the collection. W e also obtained the user account ID n um b er, which is the only unchangeable unique account iden tifier. In addition to obtaining seed users’ profile information, we collected the same set of profile information for each seed user’s friends and follow ers. As a result the n umber of user profiles contained in the data set grew to o v er 1.3 million. W e do wnloaded all publicly a v ailable t weets from eac h seed user’s timeline at the time of collec- tion. F or each t w eet we obtained the unique tw eet ID assigned by Twitter, the tw eet text, the time of the p ost, all hashtags, user men tions, URLs, and images contained in the t w eet, and whether the t w eet was a retw eet of or reply to another tw eet. The total num b er of tw eets in our data is appro ximately 4.8 million. Finally , we track ed man y of the accoun ts for sev eral months in 2015 in order to see if they w ere ev er susp ended. W e do not know the reason for susp ension, but giv en that these accounts w ere asso ciated with kno wn ISIS users, we assume the susp ension w as related to some form of extremist propaganda that violated Twitter’s user agreemen t. W e trac k ed all of the user accoun ts collected in June, 2015, including the seed accoun ts and their friends’ and follo wers’ accounts. This data set includes 646,961 accounts in total, of which 35,080 (or 5.4%) had been susp ended as of Septem b er 23, 2015. 3. Predicting Account Suspensions The first capability we develop to com bat online extremism is to predict which accounts b elong to new extremist users. In this section we dev elop an approach to this using logistic regression. W e lab el any account in our data set as extremist if it was susp ended b y Twitter. Therefore, to detect extremists w e predict which accounts are suspended b y Twitter. W e accomplish this using a logistic regression mo del based up on features of the user accounts. W e pro vide out-of-sample p erformance ev aluation of the mo del and provide insigh ts on what factors migh t b e useful in predicting whether a Twitter user is going to b e susp ended for violative b eha vior. T o train, v alidate, and test this prediction mo del we use a subset of the accounts whose sus- p ension status we track ed. W e randomly selected tw o non-o v erlapping samples of this data sets, eac h consisting of 5,000 accounts and main taining the 5.4% susp ension rate, whic h is the o verall susp ension rate of these accounts. These data sets were used for training and v alidation. F or our logistic regression mo del, the resp onse v ariable is whether the accoun t was still active as of September 23, 2015. The predictors are obtained from the wide arra y of information asso ciated Klausen, Marks, and Zaman: Finding Online Extr emists in So cial Networks 7 F eature t yp e F eature Net w ork F ollo wing each of 2,376 active ISIS seed accoun ts in our data (2,376 binary v ariables). Accoun t Date and time the account was created (numeric). Accoun t Number of “friends” and “follow ers” connected to the accoun t (2 n umeric v ariables). Accoun t Number of tw eets from the accoun t (n umeric). Accoun t Geo-lo cation enabled (binary). Accoun t “Protected” account (p osts are not visible to the public) (binary). Accoun t V erified accoun t (iden tity confirmed b y Twitter) (binary). T able 1 F eatures for predicting Twitter susp ensions. with the user accounts. Some of these relate to the account itself, while others hav e to do with the net w ork connections of the accoun t. The v ariables used as predictors for our mo del are listed in T able 1. While w e hav e observed that the n umber of screen name changes asso ciated with a user accoun t might serve as a go o d predictor of future susp ension, w e assume that this information is not necessarily known for an arbitrary account we wish to classify . Similarly , w e assume w e do not kno w if the account w as following accounts that were susp ended in the past. All features we use are what could b e measured for a new account that has not b een seen b efore. 3.1. Results W e fit a logistic regression model with L 1 -norm regularization to the training data. F rom v ali- dation, we find that setting the regularization constan t to 10 consisten tly provided near-optimal p erformance. The resulting co efficie n t estimates w ere nonzero for 89 of the predictor v ariables, of whic h 81 corresp onded to follo wing certain accounts. The signs and magnitudes of the co efficients giv e us some idea of the effects of some of the predictor v ariables. The co efficient estimates indicate that accoun ts that had enabled geo-tagging and accoun ts that had Twitter-v erified o wners w ere m uc h less lik ely to b e suspended. This is not surprising giv en that w e exp ect online extremists to w an t to mask their iden tity and lo cation. The effects of friendships w ere less intuitiv e and difficult to in terpret. In total we found that 38 accoun ts had a p ositive sign and 43 had a negative sign. Ho w ever, there was no clear pattern that w e could find among the p ositiv e sign accounts or nega- tiv e sign accounts. More detailed analysis may reveal what made following these accounts increase or decrease the lik eliho o d of susp ension. Nonetheless, just kno wing the v alue of the regression co efficien t w as sufficien t to predict susp ensions. Figure 1 sho ws the receiv er-op erator characteristic (R OC) curve on the v alidation set and on the test data, which was comprised of the 636,961 accoun ts not used for training and v alidation. The area under the curv e (A UC) on the test data is approximately 0.83. W e can see from the curv e that we can detect about 60% of susp ended users in the test set with only a 10% “false Klausen, Marks, and Zaman: Finding Online Extr emists in So cial Networks 8 A UC: 0.83 Figure 1 ROC curve for the regularized logistic regression classifier for Twitter susp ensions. T able 2 Summary of sampled accoun ts from those incorrectly classified as susp ensions using the regularized logistic regression mo del. Screen Name Summary @ab dulnagi313 F ew t weets, difficult to discern nature of accoun t. @445468a7e3fc45c V ery few t weets, user apparently follows ISIS activity and mem b ers on Twitter; p ossibly conducting research or surveillance. @613780 Tw eets Quranic verses in Arabic every few hours in consistent for- mat; lik ely a Twitter b ot. @aarishma jeed Accoun t with no t w eets follo wing three ISIS-related media accoun ts. @men9174 Arabic-language p ornograph y accoun t follo wed b y one of our seed accoun ts; follo wing many other pornographic accounts. p ositiv e” rate. This efficiency o ccurs by setting a classification probability of 0.1, i.e., classifying an account as an ISIS account if the regression function assigns a probabilit y of suspension higher than this threshold. It is imp ortant to note that b ecause 94.6% of the accounts in our data were not susp ended, a 10% false p ositive rate represen ts a greater num b er of false p ositiv e classifications than true detections using this logistic regression mo del. Ho wev er, it is also imp ortant to consider that accoun ts that ha v e not been susp ended could still b e susp ended in the future, or that some accoun ts in our data should b e susp ended but hav e succeeded in a v oiding detection. Sampling from the false p ositives resulting from this classification returned some accounts that w ere clearly ISIS supp orters, supp orting this notion that man y accoun ts should b e or soon w ould Klausen, Marks, and Zaman: Finding Online Extr emists in So cial Networks 9 b e susp ended. Many of these “false p ositives,” how ever, w ere ISIS researchers, media, or otherwise difficult to discern. T able 2 provides a summary of five randomly selected false positives found in our test data, when applying the classification probability threshold of 0.1. The inclusion of the p ornographic accoun t @men9174 as a false p ositiv e is interesting and concerning. In v estigation rev eals that this account is not following an y of our ISIS seed accounts. Our mo del classified this accoun t with a probabilit y of 0.101, v ery near our threshold, based primarily on its profile features. 4. Detecting Multiple Accounts No w that we hav e a mo del for predicting susp ensions, the next question we address is whether w e can automatically determine whether tw o accounts b elong to the same user. This question is relev ant because we hav e observed many cases in which a user simply creates a new accoun t after b eing susp ended. W e hav e even found ISIS accounts dedicated to the purp ose of broadcasting susp ended users’ new accounts to ISIS members and supp orters. By detecting multiple accounts b elonging to the same user, one can prev en t extremist users from restarting their violative b ehaviors b y creating new accoun ts and effectively k eep them susp ended from the so cial net work. Twitter profiles essen tially serve as a v atars; the syn tax and pictures pro vide cues ab out the iden tit y of the account holder. This is true for ISIS users as w ell and is intrinsic to the tactic b ehind the ISIS-based net w orks directed at recruitment. As a result, when a susp ended user op ens a new accoun t in Twitter, we ha v e b een able to identify it by comparing the names, images, screen names, and descriptions associated with eac h accoun t. This b ehavior is intuitiv e: these newly created accounts include cues to p ermit the susp ended user’s follow ers to identify and re-follo w the recreated accounts. W e hav e found man y examples of this predictable reiteration of account profile features in our data. 4.1. Susp ended User Behavior In addition to regular account susp ensions, we also observed that kno wn ISIS users in our data set c hanged their screen names regularly . W e h yp othesize that frequent screen name c hanges provide a means of a voiding trac king and detection, while retaining accoun t information, friends and follo wer connections, and Twitter p osts. W e also note that accoun ts that exhibit multiple screen name c hanges had higher susp ension rates, whic h could mean that users are changing their screen names to a v oid susp ension. T able 3 pro vides a timeline of screen name and name changes for t wo suc h accoun ts, purp ortedly b elonging to British citizen Sally Jones who had adopted the online alias “Umm Hussain al-Britani,’ [25]. Sally Jones and her husband, Junaid Hussain achiev ed celebrity status in ISIS, primarily due to Junaid Hussain’s role in creating and leading the “Cyb erCaliphate,” as w ell as his previous Klausen, Marks, and Zaman: Finding Online Extr emists in So cial Networks 10 in v olvemen t in the “T eam Poison” hacking group. Junaid Hussain was killed in a US airstrik e in August 2015 [2]. The timeline in T able 3 was reconstructed from observed tw eets, but the tw eets from b oth of these accounts are no longer av ailable due to account susp ensions. W e note that the screen name c hanges b ecame m uch more frequen t when the user b elieved her behavior migh t result in suspension. W e also observ e that in almost all cases, the user chooses some v arian t of the same online handle, e.g., “OumHu55inBrit,” whic h helps her retain her online identit y and signals her status by announcing her attachmen t to Junaid Hussain, who alw ays used the online alias “Abu Hussain al-Britani” (see follo w-on discussion and T able 7). T able 3 P artial screen name—tw eet timelines for t wo Twitter user accounts purp ortedly b elonging to Sally Jones. These accounts hav e been suspended b y Twitter and are no longer a v ailable. First Account Tw eet Time Screen Name 2015-09-30 11:45:37 OumHu554inBrit 2015-09-30 19:58:15 OumHu554inBrit 2015-10-02 13:43:59 Mrsl337 2015-10-02 21:28:54 OumHu554inBrit 2015-10-03 00:48:01 UmmHu55ain2 2015-10-03 15:30:08 † Oum1337 2015-10-03 16:52:39 OumHu554inBrit 2015-10-03 16:55:45 OumHu554inBrit 2015-10-03 17:24:06 OumHu554inBrit 2015-10-03 23:31:29 UmmHussain9ll 2015-10-04 13:20:47 OumHu554inBrit Second Account Tw eet Time Screen Name 2015-10-05 16:44:55 ‡ OumHu554in 2015-10-05 17:44:28 OumHu554in 2015-10-05 20:36:22 OumHu554in 2015-10-06 18:03:26 OumHussain 2015-10-07 13:24:47 OumHussa1n † In this tw eet the user w arns she is ab out to release information that could get her susp ended, and encourages her follow ers to b e ready to ret weet her. ‡ This is the first tw eet in a new user accoun t, as the previous one was susp ended. While there might hav e b een additional screen names and t w eets asso ciated with these accounts that we did not capture, we found the type of online b ehavior exhibited in T able 3 indicative of man y of the ISIS-supp orting accounts in our data set. F ollowing susp ension, the user apparently op ens a new account and contin ues the same tactic, all the while adopting v ery similar account screen names and names. Prominen t ISIS members Sally Jones and Junaid Hussain pro vide examples of this behavior; accoun ts asso ciated with them app ear frequen tly in our ISIS data. Querying our data for user accounts with a name similar to “Umm Hussain Al-Britani” returns 23 distinct entries, all of whic h ha ve been susp ended. Empirically , w e found that we observ ed screen name changes in appro ximately 10% of the accoun ts in our data that were even tually susp ended, while in the accoun ts that remained activ e Klausen, Marks, and Zaman: Finding Online Extr emists in So cial Networks 11 Figure 2 Histogram of screen names for active and suspended accounts in our data. The av erage num b ers of screen names for suspended and active accounts are listed in the legend. the num b er w as close to 1%. F urthermore, anecdotal inv estigation of active accoun ts with multiple screen name c hanges suggests that man y of these accoun ts are also ISIS-related. Figure 2 pro- vides a histogram comparison of the num b er of screen names asso ciated with active and suspended accoun ts in our data set. It is clear from the figure that the susp ended accoun ts are m uc h more lik ely to hav e more screen names. F or example, ev en though active accoun ts mak e up o v er 94% of our data, only 18% of the accoun ts with o ver 20 unique screen names are still activ e. These observ ations motiv ated our developmen t of a metho d for lo cating new accoun ts b elonging to a sp ecific user. The first step in this pro cess was to develop an automated metho d of identifying whether a pair of accoun ts b elong to the same user. T o ac hiev e this pairwise classification, w e emplo y a sup ervised mac hine learning approac h, which is described next. 4.2. Profile Compa rison Metrics W e define a Twitter user pr ofile as a vector of profile features x associated with a Twitter account. A Twitter accoun t can only hav e a single user profile at any p oint in time. The features of the profile are not fixed, ho w ever. As we hav e noted, cases exist in our data in which users c hanged their screen name or other profile features, resulting in our obtaining m ultiple user profile feature v ectors b elonging to the same accoun t. While it is p ossible for a single Twitter account to b elong to differen t users at different times (e.g., an account gets hac ked or one user simply pro vides the accoun t login information to another), w e assume that all of the profiles asso ciated with the same Twitter account b elong to a single user. Klausen, Marks, and Zaman: Finding Online Extr emists in So cial Networks 12 Our classification goal is therefore to compare tw o user profiles ( x ( i ) , x ( j ) ) from differ ent Twitter accoun ts and iden tify whether or not they b elong to the same user. In order to train a mo del to p erform this classification, we m ust construct profile comparison features from profile pairs ( x ( i ) , x ( j ) ) that are useful in establishing whether they b elong to the same user. Building on our qualitative observ ations of individual ISIS Twitter users retaining identifying similarities b et ween their m ultiple user profiles, w e prop ose a set of similarity metrics based on comparisons of the following four profile features: screen name, user name, profile picture, and profile banner image. These similarit y metrics are based on user profile characteristics that are publicly av ailable on all accoun ts, even if the user has “protected” the account using Twitter’s priv acy settings. 4.2.1. Screen name and user name similarit y metrics In comparing t wo screen names or t w o user names, we use the w ell-known Levensh tein ratio (see [26]) to provide a measure of distance b et w een t w o strings. This ratio in volv es coun ting the n um b er of character additions, deletions, or place exchanges required to transform one string in to the other. This n um b er is normalized by the length of the longer string and then subtracted from one. If we let S b e a set of strings of v arious lengths, the Lev ensh tein ratio can b e though t of as a function L : S 2 → [0 , 1] where L ( s 1 , s 1 ) = 1 and L ( s 1 , s 2 ) = L ( s 2 , s 1 ) for any s ∈ S . L ( s 1 , s 2 ) = 0 implies that strings s 1 and s 2 are not at all similar. Our first tw o comparison features, φ 1 and φ 2 , are simply the screen name and user name Leven- sh tein ratios: φ 1 ( x ( i ) , x ( j ) ) = L ( x ( i ) S N , x ( j ) S N ) , φ 2 ( x ( i ) , x ( j ) ) = L ( x ( i ) N , x ( j ) N ) , where x ( i ) S N and x ( i ) N denote the resp ective screen name and user name of account profile x ( i ) . Figure 3 pro vides an illustration of fiv e screen name pairs and their corresp onding Lev enshtein ratios. Figure 3 Example screen name comparison Levensh tein ratios. Klausen, Marks, and Zaman: Finding Online Extr emists in So cial Networks 13 4.2.2. Profile picture and profile banner similarity metrics W e employ a simple image a v erage hash algorithm (e.g., [22]) to compare t wo pictures. Essentially , the algorithm partitions the image into 8 × 8 equal-sized rectangular sub-images and then identifies whether the av erage shade of each sub-image is brighter or darker than the ov erall image av erage. The algorithm runs efficien tly and returns an 8 × 8 binary matrix, whic h can easily b e represen ted as a non-negative in teger. W e denote the hash algorithm as a function H : Ψ → Z + , where for an y ψ 1 , ψ 2 ∈ Ψ, ψ 1 = ψ 2 ⇒ H ( ψ 1 ) = H ( ψ 2 ) H ( ψ 1 ) 6 = H ( ψ 2 ) ⇒ ψ 1 6 = ψ 2 H ( ψ 1 ) = H ( ψ 2 ) ⇒ ψ 1 ≈ ψ 2 . Tw o images with the same hash v alue contain v ery similar patterns of shade. Therefore, we assume that images with the same v alue are the same image. Our image similarity metric ( h ) is a simple step function that follo ws from this assumption: h : Ψ 2 → { 0 , 1 } , h ( ψ 1 , ψ 2 ) = ( 1 H ( ψ 1 ) = H ( ψ 2 ) 0 H ( ψ 1 ) 6 = H ( ψ 2 ) . W e use this image similarit y metric to construct our third and fourth features: φ 3 ( x ( i ) , x ( j ) ) = h ( x ( i ) P P , x ( j ) P P ) φ 4 ( x ( i ) , x ( j ) ) = h ( x ( i ) B P , x ( j ) B P ) , where x ( i ) P P and x ( i ) B P are the resp ective profile and banner pictures for profile x ( i ) . These features are simply binary indicators for whether or not the images b eing compared ha ve the same a verage hash matrix. 4.3. Data Set Construction Ha ving defined pairwise account profile similarit y features, our next step was to clean the data and extract the features for use in a classification model. Initially we examined 4,339 seed user accoun ts collected before June 4, 2015. Ho wev er, in order to keep the string similarit y metrics consistent, w e remov ed 395 accounts with user names strings that did not use the Latin alphabet. This left us with 3,944 user profiles. Within this se t, we knew some user profiles we collected b elonged to the same Twitter account and therefore the same user. These accounts were identifiable b y the Twitter user ID, which do es not change even if a user changes his or her screen name or other profile features. Our set of 3,944 profiles contained 3,855 unique Twitter accounts (i.e., unique user IDs), corresp onding to 3,855 seed users. F or eac h pair of user profiles ( i, j ), we computed a feature v ector φ ( i,j ) of the four similarit y metrics. This results in  3 , 944 2  = 7 , 775 , 596 pairs. Klausen, Marks, and Zaman: Finding Online Extr emists in So cial Networks 14 4.4. Data Lab eling W e assume that each pair of user profiles either b elong to the same user or b elong to differen t users. W e denote this classification with binary class v ariable y ( i,j ) , where y ( i,j ) = ( 1 Profiles i and j b elong to the same user 0 Profiles i and j b elong to different users . Of the 7 , 775 , 596 pairwise profile comparisons in our data, 95 could b e traced to the same user b ecause they actually b elonged to the same accoun t, identifiable b y the Twitter user ID. Although w e do not seek to classify profiles b elonging to the same account b ecause we can already assume they b elong to the same user, we left these comparison p oin ts in the data set as labeled data in order to train the classification mo del. Up dating profile features for an existing account is a differen t action than creating a new Twitter accoun t, how ever, causing this lab eled set to b e biased tow ard profiles that are v ery similar. On the other hand, when a user creates a new Twitter accoun t, he or she must deliberately set or leav e blank each of the profile settings. As a result, w e do not exp ect the same level of similarit y b etw een tw o user profiles asso ciated with separate accounts, but b elonging to the same user, when compared to the similarit y b etw een tw o profiles b elonging to the same Twitter accoun t. As a result, using these 95 pre-lab eled data p oin ts for training migh t not b e very useful for our purp ose. W e also do not ha ve an y p oin ts classified as accoun ts belonging to different users. T o solv e this problem, w e lab eled a subset of comparisons in our data set using the following metho d. 1. If profile x ( i ) and profile x ( j ) share the same user ID, set lab el y ( i,j ) = 1. These are the 95 profile comparisons that are kno wn to b elong to the same user. 2. If profile x ( i ) and profile x ( j ) do not share the same user ID, and ( x ( i ) , x ( j ) ) :   φ ( i,j )   2 < 0 . 1 , (1) w e set lab el y ( i,j ) = 0. These conditions establish that the profiles hav e v ery little in common, so w e assume they b elong to different users. T able 4 provides an example of the features asso ciated with a pair of accoun ts meeting this criterion. 3. Man ually lab el a randomly selected subset of unlab eled pairs that exhibit relatively high similarit y metrics. W e chose 168 pairs from the set of 1,257,350 pairs where ( x ( i ) , x ( j ) ) :   φ ( i,j )   2 > 0 . 85 , (2) for man ual lab eling. In assigning a lab el to these pairs, we considered all av ailable data in comparing the t wo profiles, including Twitter posting habits and accoun t profile features, suc h as lo cation and description, that are not considered in the mo del. W e found that 82 of these pairs were accounts b elonging to the same user, while the remaining 86 of them were from differen t users. T able 5 pro vides an example comparison of the features of a pair of accoun ts meeting this criterion. Klausen, Marks, and Zaman: Finding Online Extr emists in So cial Networks 15 T able 4 Accoun ts exhibiting very low similarity , according to the selection criterion given in equation (1). F eature ( k ) User i User j φ ( i,j ) k User ID 2683126250 3108319204 [NA] Screen Name khalidbinalw ale profomar0 0.08 Name Abu Muslim prof 0.00 Profile Picture 00. . . c3 09. . . cc 0.00 Profile Banner 00. . . 00 [None] 0.00 k φ ( i,j ) k 2 = 0 . 08 T able 5 Accoun ts exhibiting very high similarit y , according to selection criterion given in equation (2). These accoun ts w ere man ually labeled as b elonging to the same user, i.e., y ( i,j ) = 1. F eature ( k ) User i User j φ ( i,j ) k User ID 3307258107 3297609231 [NA] Screen Name Ahmes Zirve Ahmes Zirv e 0.88 Name Ahmes Zirv e Ahmes Zirv e 1.00 Profile Picture ff. . . ff ff. . . ff 1.00 Profile Banner [None] [None] 1.00 k φ ( i,j ) k 2 = 1 . 94 4.5. Classification Mo del F rom our set of lab eled data, w e set aside 10% for out of sample ev aluation of mo del p erformance. This p ercen tage w as enforced for each of the three lab eling metho ds, so that the test set included 10% of the hand lab eled data points, for example. W e then fit an L 1 -regularized logistic regression mo del on the training data. In other words, we assume P ( y ( i,j ) = 1) =  1 + e β T φ ( i,j ) + β 0  − 1 W e iden tified λ = 10 as the regularization parameter that pro vided the b est p erformance in cross v alidation. The intercept and coefficients for the logistic regression mo del fit on the training data are shown in T able 6. In terestingly , profile banner similarity is not useful in this mo del in determining the probabilit y of t wo profiles belonging to the same user. T able 6 Regression co efficien ts for matc hing accoun ts. F eature Regression co efficien t In tercept -8.05 Screen name Levensh tein ratio ( φ 1 ) 2.94 User name Levensh tein ratio ( φ 2 ) 7.05 Profile picture hash matrix ( φ 3 ) 1.88 Banner picture hash matrix ( φ 4 ) 0 The receiv er-op erator characteristic (ROC) curv e plotted for the manual ly lab ele d tr aining and test data c ombine d is given in Figure 4. ROC curv es plotted separately for the training and test data were very similar, and classification on the test data p oints that w ere not man ually lab eled Klausen, Marks, and Zaman: Finding Online Extr emists in So cial Networks 16 Figure 4 Logistic regression ROC curve on hand lab eled data. (i.e., they w ere lab eled using steps (1) or (2) of the lab eling metho d giv en in section 4.4) was nearly p erfect. The AUC in Figure 4 is approximately 0.91. W e view the ROC curv e on the man ually lab eled data in Figure 4 as an approximation for the “w orst case” performance of the classifier. W e selected these pairs for manual lab eling because they exhibited some degree of similarity , based on the L 2 norm of the comparison feature vector, an ticipating that they would b e among the most difficult p oin ts to classify . As noted previously , plotting the R OC curve on all of the lab eled data, or on the en tire test set, shows near p erfect classification. Because we an ticipate that most account pairs b elong to differen t users, maintaining a low false p ositiv e misclassification rate is imp ortant. A small false positive rate could e quate to a large n um b er of misclassified p oints. F or this reason, w e select a false p ositive threshold of 2% on the hand-lab eled ROC curv e. This threshold leads us to a classification probabilit y threshold of 0.782, as indicated in Figure 4. In other w ords, we assign a classification ˆ y ( i,j ) to a profile pair ( x ( i ) , x ( j ) ) according to the function ˆ y ( i,j ) =    1  1 + e β T φ ( i,j ) + β 0  − 1 ≥ 0 . 782 0  1 + e β T φ ( i,j ) + β 0  − 1 < 0 . 782 . (3) Based only on the hand-lab eled data ROC, we exp ect this classifier to correctly iden tify ov er 80% of accoun t pairs b elonging to the same user while misclassifying less than 2% of the accoun t pairs b elonging to different users. Because the manually lab eled data consists of accoun t pairs that exhibit some substan tial measure of similarity , w e exp ect p erformance on the entire data set to b e m uc h b etter, similar to the near-p erfect classification on the test data. Klausen, Marks, and Zaman: Finding Online Extr emists in So cial Networks 17 Figure 5 Graph represen tation of accounts b elonging to the same user using our regression model and equation (3) with a threshold of 0.782. When we apply the classifier in equation (3) to the en tire data set, we obtain 318 accoun t pairs classified as b elonging to the same user. Sixty-t w o of these pairs ha ve the same accoun t ID and are therefore known to b e from the same account, while the remaining 256 pairs come from differen t Twitter accounts. Figure 5 provides a net w ork represen tation of these accoun t connections. Eac h node in the plot represents a unique Twitter account. An edge drawn b etw een tw o accounts indicates our classification equation lab els the pair of accounts as b elonging to the same user. Only accoun ts with at least one edge are depicted in Figure 5. Most of the components in the graph depicted in Figure 5 are fully connected, whic h is as w e w ould exp ect. Comp onent A is an example of a fully connected component, consisting of five Twitter accounts. These account profile features are listed in T able 7. They are all very similar and indeed app ear to b elong to the same user. Comp onen t B, on the other hand, consists of three accoun ts but is not fully connected. T able 8 pro vides a list of the profile features asso ciated with these three accounts. While they all app ear to b elong to the same user, comparison of the first and third profiles giv en in the table resulted in probabilit y P ( y ( i,j ) = 1) = 0 . 774 , which falls b elow our classification threshold. While in this case these tw o accounts are connected b y wa y of a third accoun t that meets the classification threshold with b oth of them, it is clear that setting threshold this high do es indeed miss some pairs of accoun ts that probably do b elong to the same user. W e discuss the sensitivity of the results as a function of the classification threshold in App endix C. Klausen, Marks, and Zaman: Finding Online Extr emists in So cial Networks 18 T able 7 Accoun ts comprising comp onen t A. While av erage hash v alues for profile pictures are abbreviated, they are the same for all profiles. Screen Name Name Profile Pic AlJabarti28 Abu Y usuf Al-Jabarti 20. . . 00 Ban uKom b e Abu Y usuf Al-Jabarti 20. . . 00 enk orela Abu Y usuf Al-Jabarti 20. . . 00 ouaic heu Abu Y usuf Al-Jabarti 20. . . 00 ouaisheu Abu Y usuf Al-Jabarti 20. . . 00 T able 8 Accoun ts comprising comp onent B. Screen Name Name Profile Pic Aqidahhaqq Colonel Shaami [None] AnsarAlUmmah49 Colonel Shaami [None] buruan8 Colonel Shaami [None] 5. Refollo wing Model In the previous section w e used machine learning to produce a metho d for efficiently finding groups of accoun ts that are likely to b elong to a single user. In this section, w e use the account clusters pro duced from this metho d in an effort to learn ho w users tend to reconnect, or r efol low , other user accoun ts when op ening a new accoun t. Supp ose a user t has his accoun t susp ended and decides to op en a new accoun t. After getting the account open, t decides to follow some other users. W e hav e observed that in many cases, t will refollo w at least some of the user accounts he w as previously following with his susp ended account, and it seems reasonable to assume that an y suspended user would wan t to reconnect with some of the same p eople he or she was following prior to susp ension. In this section we fit a probability mo del that assigns a v alue to each of t ’s former friends, giving the probabilit y t will refollo w the former friend up on op ening a new Twitter account. W e again turn to logistic regression as a means to pro ducing this probability mo del. 5.1. Data Using the logistic regression mo del from Section 4 with a cutoff of 0.782, we group ed the seed accoun ts into clusters, each of which w e assume b elong to the same user. A netw ork representation of the non-singleton clusters is sho wn in Figure 5. Accounts in each cluster were then sorted by accoun t age. After sorting, w e compared the friend lists of eac h pair of consecutiv e accoun ts. F or eac h friend of the former account, we created a row in our data set lab eled with an indicator of whether or not the same friend w as connected to the latter accoun t. F or example, user account #3280844606 (@MusabGharieb18) and user accoun t #3343999888 (@MusabGharieb13) are consecutive accoun ts b elonging to the same user cluster. T able 9 shows Klausen, Marks, and Zaman: Finding Online Extr emists in So cial Networks 19 T able 9 Example of @MusabGharieb18’s (@M. . . 18) refollowing b ehavior up on opening new account @MusabGharieb13 (@M. . . 13). F riend @M. . . 18 @ M. . . 13 @p o orslav e 3 YES YES @enk orela YES NO @StillUkh tMary am YES YES @Y aqub London YES NO whether eac h account was following certain friend accounts. T able 10 sho ws how each of @Mus- abGharieb18’s friends w ould then generate a ro w in the data for this logistic regression model. T able 10 Example data rows resulting from refollowing b ehavior given in T able 9. F eatures are omitted but include, f or example, characteristics from eac h friend’s profile. F riend F eatures Refollo w ed (Resp onse) @p o orslav e 3 · · · 1 @enk orela · · · -1 @StillUkh tMary am · · · 1 @Y aqub London · · · -1 5.2. F eatures In order to obtain a go o d fit, w e included features from the susp ended user’s earlier account (e.g., @MusabGharieb18) as well as features from the friend account (e.g., @p o orslav e). F or a susp ended user accoun t User0 that w as follo wing account F riend , w e construct a v ariet y of features which can b e brok en down into different categories. One set of features deals with the features of the individual accoun ts of User0 and F riend . A related set of features are about the similarity of the tw o accounts. There is a category of features that deals with the in teractions b etw een the t wo accoun ts. Finally , there is a category of features that describ e aggregate prop erties of the neigh b ors of User0 . A complete list of the features used in our mo del can b e found in App endix D. 5.3. Kernel Logistic Regression In tuitiv ely , some interactions among our set of features might b e more predictive than the features themselv es. F or example, the av erage num b er of User0 ’s friends might not b e very useful in esti- mating the probabilit y User0 refollows a sp ecific F riend account. How ever, this v alue multiplied b y F riend ’s n um b er of Twitter friends could b e very useful. F or this reason, we use a quadratic k ernel in this logistic regression mo del, whic h ensures the regression is fit on all linear and quadratic terms, including pairwise in teractions: K ( x , y ) = (1 + x T y ) 2 Klausen, Marks, and Zaman: Finding Online Extr emists in So cial Networks 20 Giv en a training data set { x 1 , x 2 , . . . , x N } , the corresp onding logistic regression mo del is ˆ p ( x ) =  1 + e P N i =1 α i K ( x , x i )  − 1 . The parameters α = ( α 1 , . . . , αN ) are fit on the training data using an L 2 -regularized log loss: α = arg min ˆ α N X i =1 log(1 + e − y i P N i 0 =1 ˆ α i K ( x i , x i 0 ) ) + λ ˆ α T ˆ α, where y i is the resp onse in the i th ro w of the training data. These resp onses take v alue -1 if the F riend was not refollo wed, or 1 if the F riend w as refollow ed, as annotated in T able 10. The parameter λ serv es as the regularization co efficien t. 5.4. P erformance In order to fit this mo del w e used gradient-based optimization metho ds a v ailable in Python’s scipy pac k age [17]. W e first selected training (50%), v alidation (25%), and test (25%) sets randomly from all of the ro ws of the data and normalized the entire data set based on the v alues in the training data. Through v alidation w e found that λ = 10 − 5 pro vided the highest AUC. P erformance on out-of-sample test data is depicted in Figure 6. A UC = 0.663 A UC = 0.798 Figure 6 ROC curve for L 2 -regularized quadratic kernel logistic regression p erformance on out-of-sample test data. (left) T est data and training and v alidation data can contain the same user. (right) T est data and training and v alidation data do not contain the same users. F rom the figure it app ears that w e can predict with some accuracy whic h former friends a susp ended user is likely to reconnect with. It is p ossible, how ever, that the mo del is learning refollo wing preferences of individual users in the data set. T o in v estigate this p ossibility , w e selected new training, v alidation, and test sets by randomly selecting different user clusters for eac h and included all of the rows corresp onding to these user clusters in the corresp onding set. In other words, eac h c omp onent depicted in Figure 5 was assigned as a whole to either the training, v alidation, or Klausen, Marks, and Zaman: Finding Online Extr emists in So cial Networks 21 test data, approximately main taining the 50%-25%-25% ratios. Unlike the previous data partition, this constraint w ould ensure that all of the rows in T able 10 w ent to the same set, b ecause they b elong to the same user. Using this new data partition, v alidation and testing w ere completed on data c onsisting of en tirely different users than those that provided the training data. Through v alidation we found that λ = 10 − 4 pro vided the highest AUC on this new data partition. Out-of-sample p erformance suffered, as can b e seen in the ROC plot in Figure 6. Comparing the p erformance on each partition pro vides some interesting insights. First, the A UC for the new partition in Figure 6 is 0.66, whic h indicates that there is some underlying refollo wing b ehavior that transcends the users in our data. Ho w ever, our ability to predict whether or not a suspended user will refollow an old friend increases substan tially when w e include that user’s past b ehavior in the training data. The difference in p erformance giv es us an idea of how useful it is to hav e data on a sp ecific user’s past b eha vior when predicting whom the user will refollo w. Because w e used a quadratic kernal logistic regression, the expressions for the fit models are not easy to interpret. Their performance shows that w e can predict with some accuracy the refollowing b eha vior of a susp ended user, even in the absence of previous refollowing b ehavior, based solely on the refollo wing b ehavior of others. W e make use of this capabilit y in the next section, where we dev elop a metho d to search for a susp ended user’s new accoun t. In practice an analyst migh t be able to pro duce a muc h better model for a sp ecific user by carefully incorp orating past refollowing b eha vior, if a v ailable. 6. Susp ended User Search W e no w make use of our findings from the previous sections to address another relev ant problem. W e hav e observed m ultiple incidences in our data of susp ended users quic kly creating a new Twitter accoun t in order to contin ue their unethical activity , as exemplified in T able 3. In these instances it would b e useful for those task ed with monitoring nefarious users, such as so cial media service pro viders or intelligence communit y p ersonnel, to find an efficien t w ay to search for the susp ended user’s new accoun t. W e assume w e are giv en a tar get user whose account has b een suspended by Twitter. W e ha ve stored the target user’s account information, including lists of the target’s friends and follo w ers. F rom this information w e wish to lo cate the target user’s new Twitter accoun t, if one exists, as efficien tly as p ossible. Our approach to solving this problem is to query the follo wers of each of the target user’s kno wn Twitter “friend” accoun ts, prior to susp ension, and searc h the results for a new accoun t b elonging to the target user. Klausen, Marks, and Zaman: Finding Online Extr emists in So cial Networks 22 Our netw ork search mo del builds directly on the multi-urn searc h mo del presented in [21] and is illustrated in Figure 7. W e can think of each of the target’s former friends i as an urn con taining N i marbles, whic h represent the neighbors of i . If the target has connected to former friend i , then he is among i ’s neighbors and a single red marble is one of the N i marbles in urn i . Excepting these red marbles, all marbles in all urns are blue. Eac h follow er query can b e though t of as choosing a nonempt y urn j in the multi-urn mo del and remo ving some fixed num b er of its marbles. The num b er of marbles remo v ed is determined b y the query metho d used and, unlike the searc h mo del in [21], can b e more than one. Having a red marble among those remo v ed represents finding the target user’s account, and the searc h terminates. Figure 7 Netw ork searc h represen tation as a multi-urn mo del. 6.1. Susp ended User Search Mo del Let V b e the set of known friend accoun ts. These are the accoun ts that the target user w as follo wing prior to b egin susp ended. F or eac h known friend i ∈ V , let N i b e the num b er of Twitter accoun ts that are follo wing i . These quan tities are easily obtained through the Twitter API. Using the Twitter API it is p ossible to obtain a list of the follow ers of a sp ecified user, pro- vided the user has not enabled priv acy protection on the accoun t. Twitter offers tw o metho ds for executing these queries: GET followers/list and GET followers/ids . Both metho ds are rate limited to 15 queries within any 15-minute time p erio d. GET followers/list returns standard Twitter user profile information for each follow er, but only returns up to 200 profiles p er query . GET followers/ids has the same rate limit, but returns up to 5,000 user IDs p er query [30]. Eac h metho d can b e cursored so that subsequen t queries of the same user contin ue to pro duce unique results, until all of the user’s follow ers’ profiles or IDs ha v e b een obtained. F or our analysis, Klausen, Marks, and Zaman: Finding Online Extr emists in So cial Networks 23 w e set N M as the maximum num b er of unique follo wers obtained p er query , although in practice w e assume this n um b er to b e 5,000 as established in the GET followers/ids metho d. Therefore if w e ha ve queried user i ’s follo w ers n times, we exp ect the next query of user i ’s follow ers to return min { N M , N i − nN M } new results, pro vided i still has unqueried follo wers ( N i − nN M > 0). Additionally , w e make the following assumptions: 1. After being susp ended, the target user creates a new accoun t with probability ρ 0 , whic h we refer to as the a priori existenc e pr ob ability . If the target has not created a new account, then he do es not hav e a no de in the net w ork and will not b e found through follo wer queries. The v alue of ρ 0 quan tifies the searc her’s b elief that the target exists in the net work. 2. If the target user creates a new account, he reconnects with each former friend i ∈ V with some probability ϕ i , which can b e estimated from previous accoun t data as was done in Section 5. W e refer to this as the r e c onne ction pr ob ability to former friend i . 3. Reconnections to former friends are indep endent; whether or not the searc h target reconnects with former friend i do es not affect the probability he reconnects with former friend j 6 = i . 4. If the target user is follo wing user i ∈ V , then each account returned in each query of i ’s follo w ers is equally likely to be the target’s account. 5. The searc her can quic kly and accurately determine whether an accoun t obtained from a follo w er query is the target user’s accoun t. This can be done using the approach developed in Section 4. The searc h pro cess is mo deled as the execution of follow er queries in discrete stages. In each stage t ∈ 0 , 1 , . . . , N − 1, the searc her chooses one of the target user’s former friend accoun ts and executes a follow er query . Here, N is the total n umber of queries required to examine all of the follo w ers of all former friends, and is assumed to b e finite. If the target user’s new accoun t is among the query results, the search terminates. Otherwise, the searcher executes another query unless all N queries ha ve b een exhausted or the searcher concludes that the target has not created a new accoun t. The ob jective of the searc h is to minimize the total num b er of queries. In order to remain consisten t with the m ulti-urn search mo del in [21], we do not consider the cost of a query that succeeds in returning the target user’s new accoun t. Therefore, the ob jective in our search mo del is to minimize the num b er of unsuc c essful search queries. The b est result p ossible would b e to find the search target in the first query , in which case there are zero unsuccessful queries. Because of the sto chastic nature of this pro cess, we say that a searc h p olicy is optimal if it minimizes the exp e cte d num b er of unsuccessful queries. Klausen, Marks, and Zaman: Finding Online Extr emists in So cial Networks 24 6.2. Initialization W e assume that data collected on the target user pro vides a list of kno wn former friend accounts. Using the Twitter API, it is relatively easy to determine whic h of these accounts are still active, whether or not they are “priv ate,” and their follow er counts. W e initialize set V as the set of all former friend accounts that are active at the time of search execution, that hav e follow ers that can b e queried (i.e., ha v e a p ositiv e num b er of follow ers and are not “priv ate” accoun ts). W e use the follo w er counts for these accoun ts to initialize N i , i ∈ V . This searc h mo del also requires an initial probabilit y that the target user w ould reconnect with eac h former friend i ∈ V , giv en he has created a new accoun t. Let A b e the ev ent that the target has created a new account, B i b e the ev ent that the target is following former friend i ∈ V , and B = S i ∈V B i b e the ev ent that the target has reconnected with at least one former friend. F rom our definitions ab ov e, we can write ϕ i = P ( B i | A ) . W e can obtain the v alue of this probability using the approac h presented in Section 5. Note that even t B can also b e interpreted as the even t w e can find the target user b y exhaustively querying the follo wers of all former friends. Using our indep endence assumption we hav e P ( B c | A ) = 1 − P ( B | A ) = Y i ∈V (1 − ϕ i ) . W e also must select a v alue for the a priori existence probability ρ 0 ,whic h can b e done based on the b eliefs of exp erts in the relev ant domain. As the search pro cess progresses, the conditional existence probability will evolv e. The searc h terminates if the target user is found or all follow er queries are exhausted. In addition to these criteria, a searc her migh t wan t to terminate the search up on ac hieving reasonable certain ty that the target user has not created a new account. W e allo w for this termination criterion by including a termination conditional existence probability ¯ ρ . If at an y stage the conditional existence probability falls b elow ¯ ρ , the search terminates and the searcher concludes that the target has not created an iden tifiable new account. If ¯ ρ is set to zero, then the searc h con tinues un til the target is found or until all follow er queries are exhausted. 6.3. The Discrete Sto chastic Search Pro cess As we hav e suggested, the searc h pro cess can be mo deled as a set of urns, each represen ting a former friend. Urn i ∈ V has N i marbles, whic h represen t former friend i ’s follow ers. In each stage the searc her chooses a former friend (or urn) and executes a follow er query , receiving up to N M results (or dra wing at most N M marbles from the urn). The searc h contin ues un til one of the follo wing o ccurs: • The target’s new accoun t is found (a red marble is among those dra wn), Klausen, Marks, and Zaman: Finding Online Extr emists in So cial Networks 25 • The probability the target has created a new account falls b elow the termination probability ¯ ρ • The queries of former friends’ follo wers are exhausted (there are no marbles left in any of the urns). 6.3.1. P olicy Supp ose w e consider a v alid policy as any sequence of former friend queries in which each former friend is exactly exhaustiv ely queried. In other w ords, if we let u = ( u 0 , u 1 , . . . , u N − 1 ) b e a p olicy in which former friend u t ∈ V is queried in stage t , then u is v alid if and only if |{ t : u t = i }| =  N i N M  ∀ i ∈ V . Notice that any v alid p olicy can b e completely sp ecified in adv ance as an ordering of follow er queries that is executed un til one of the three termination criteria are met. As long as the target is not found, state transitions are deterministic and can b e en umerated a priori. Except for the decision to terminate, there is no b enefit to making p olicy decisions during the searc h. Unsuccessful searc h results do not provide any additional insigh t into whic h ordering of queries migh t yield a lo w er cost. 6.3.2. System State and T ransitions In order to analyze the dynamics of the system w e define the system state , x ( t ), at stage t as either a |V | -dimensional v ector in which the i th elemen t x i ( t ) is the num b er of follo wer queries that ha v e b een executed on former friend i ∈ V in previous stages, or a terminal state, “T erminate.” A t stage t = 0, no queries hav e b een executed and presumably the searc h has not terminated, so that x (0) = 0 . In any non-terminal state, let the v ector x ( t ) b e sp ecified as a function of the p olicy b eing executed: x i ( t ) = |{ ` < t : u ` = i }| ∀ i ∈ V . (4) State transitions in this system are a function of the curren t state, the p olicy , and a sto chastic input represen ting whether the target accoun t is found as a result of the current stage query . Let w ( x ( t ) , i ) = ( 0 , T arget is not found querying i from state x ( t ) 1 , T arget is found querying i from state x ( t ) . W e now ha ve all of the definitions needed to write the state transition function that gov erns this searc h mo del. x ( t + 1) = f ( x ( t ) , u t , w ( x ( t ) , u t )) = ( “T erminate , 00 w ( x ( t ) , u t ) = 1 or other termination criterion are met x ( t ) + e u t , otherwise. Here, e i represen ts the i th unit v ector. Klausen, Marks, and Zaman: Finding Online Extr emists in So cial Networks 26 6.4. Sea rch Pro cess Dynamics W e define the function ψ i ( u , t ) = max  x i ( t ) N M N i , 1  =    x i ( t ) N M N i x i ( t ) = 0 , 1 , . . . , l N i N M m − 1  1 x i ( t ) = l N i N M m as the fraction of former friend i ’s follow ers that hav e b een queried b efore stage t when executing v alid p olicy u (or, using the urn analogy , the fraction of marbles that hav e been remov ed from urn i at stage t ), conditioned on not having found the target user prior to stage t . This function captures the assumption that, provided former friend i has more than N M unqueried follo w ers remaining in stage t , the query returns N M follo w ers. If former friend i has fewer than N M unqueried follow ers remaining in stage t , then the query will return all of the remaining unqueried follo w ers . This function is strictly increasing at a constan t rate of N M N i as x i ( t ) increases from 0 to l N i N M m − 1. It contin ues to increase, at a p ossibly slow er rate, in the  d N i N M e  th query of former friend i . Because x i ( t ) is nondecreasing in t , w e can conclude that ψ i ( u , t ) is also nondecreasing in t . F or example, supp ose a certain former friend has 12,000 follo wers and that each follo wer query returns at most N M = 5 , 000 follow ers. Then, the first and second query of this former friend will return 5,000 follow ers each, while the final query will only return 2,000 follow ers. In general, we exp ect the first l N i N M m − 1 follow er queries of former friend i ∈ V to return N M results, while the final query returns N i − l N i N M m − 1  N M results. This irregularity results in final queries of former friends to affect the system dynamics differen tly than the preceding queries of the same former friends. 6.4.1. Conditional Existence Probabilit y W e now develop an expression for the condi- tional existence probability , i.e., the probabilit y that the target user has created a new accoun t conditioned on having reached a certain non-terminal state, x ( t ). Let A b e the ev ent that the target user has created a new accoun t. F or simplicit y of notation, w e condition directly on the state v ector x ( t ) to denote the even t that this state has b een reac hed without finding a target user’s new account, so that ρ ( t ) = P ( A | u , x ( t )) is the new account existence probabilit y conditioned on ha ving reached state x ( t ) when executing v alid search p olicy u without ha ving found the target accoun t. Note that ρ (0) = ρ 0 , the initialization v alue. Using Ba y es’ rule, the conditional existence probability is ρ ( t ) = ρ 0  Q i ∈V ( 1 − ψ i ( u , t ) ϕ i ) 1 − ρ 0 + ρ 0 Q i ∈V ( 1 − ψ i ( u , t ) ϕ i )  . Klausen, Marks, and Zaman: Finding Online Extr emists in So cial Networks 27 The terms inside the pro ducts are the probabilities of not finding the target account among the follo w ers of eac h former friend i , given that ψ i ( u , t ) of those follo wers hav e b een queried and examined. Multiplying these probabilities together implicitly relies on our assumption that the target user reconnects to his former friends indep endently . The expression for ρ ( t ) is the initial existence probabilit y m ultiplied by a ratio of t wo linear functions of the pro duct Q i ∈V (1 − ψ i ( u , t ) ϕ i ). Because ψ i ( u , t ) ≤ 1 ∀ i ∈ V and is nondecreasing in t , Q i ∈V (1 − ψ i ( u , t ) ϕ i ) is nonincreasing in t . The co efficien t in the denominator ( ρ 0 ) is no more than that of the n umerator (1), and therefore the conditional existence probability is nonincreasing in t and con v erges to 0 as Q i ∈V (1 − ψ i ( u , t ) ϕ i ) decreases to 0. This monotonicity prop ert y aligns with in tuition: the more the so cial net w ork is searc hed without finding the target user, the less lik ely it is that the target user exists in the net w ork. Other than the conditional existence probability at each stage, the v alue of the initial existence probabilit y ρ 0 do es not affect the system dynamics. Implicit in the execution of the search is the assumption that the search target has created a new accoun t and reconnected to former friends in a wa y that can be represen ted by a probability mo del. The utility of including an existence probabilit y in the mo del is that it enables the searcher set a search termination criterion when he is sufficien tly convinced that the target user has not created a new account, based on the v alue of the conditional existence probabilit y . 6.4.2. Conditional Reconnection Probabilities The conditional probability that the tar- get user has reconnected with former friend i , giv en he has cre ated a new account and that has not b een found b y stage t when applying searc h p olicy u , can also b e calculated using Bay es’ Rule. Recall that A is the even t that the target user created a new account and B i is the even t that the the target user has reconnected with friend i . Then w e ha ve P ( B i | A, u , x ( t )) = P ( x i ( t ) | B i , u , A ) P ( B i | u , A ) P ( x i ( t ) | B i , u , A ) P ( B i | u , A ) + (1 − P ( B i | u , A )) = ϕ i  1 − ψ i ( u , t ) 1 − ψ i ( u , t ) ϕ i  =    ϕ i  N i − x i ( t ) N M N i − ϕ i x i ( t ) N M  , x i ( t ) = 0 , 1 , . . . , l N i N M m − 1  0 x i ( t ) = l N i N M m Observ e that this probabilit y is the original probability m ultiplied by the ratio of tw o linear functions of x i ( t ). Because the n umerator decreases at a faster rate than the denominator, this probabilit y is strictly decreasing as x i ( t ) increases from 0 to d N i N M e , pro vided ϕ i > 0. Just as with the conditional existence probabilit y , the monotonicit y of this conditional probability matches in tuition: the more we query the follow ers of a certain former friend without finding the target, the less lik ely it b ecomes that the target has reconnected with this former friend. Klausen, Marks, and Zaman: Finding Online Extr emists in So cial Networks 28 6.4.3. Distribution of w ( x ( t ) , i ) The probability of finding the target when querying former friend i ∈ V from state x ( t ) is found using the m ultiplication rule. Note that the even t { w ( x ( t ) , i ) = 1 } ⊆ B i ⊆ A. Therefore, P ( w ( x ( t ) , i ) = 1) = P ( w ( x ( t ) , i ) = 1 | B i , A, x ( t )) P ( B i | A, x ( t )) P ( A | x ( t )) =    ϕ i  N M N i − ϕ i x i ( t ) N M   ρ 0 Q j ∈V (1 − ψ j ( u ,t ) ϕ j ) 1 − ρ 0 + ρ 0 Q j ∈V (1 − ψ j ( u ,t ) ϕ j )  , x i ( t ) = 0 , 1 , . . . , l N i N M m − 2  ϕ i  N i − x i ( t ) N M N i − ϕ i x i ( t ) N M   ρ 0 Q j ∈V (1 − ψ j ( u ,t ) ϕ j ) 1 − ρ 0 + ρ 0 Q j ∈V (1 − ψ j ( u ,t ) ϕ j )  , x i ( t ) = l N i N M m − 1 . This expression offers sev eral imp ortant insights in to the dynamics of this searc h mo del. First note that conditioned on the existence of a new target accoun t, P ( w ( x ( t ) , i ) = 1 | A, x ( t )) =    ϕ i  N M N i − ϕ i x i ( t ) N M  x i ( t ) = 0 , 1 , . . . , l N i N M m − 2  ϕ i  N i − x i ( t ) N M N i − ϕ i x i ( t ) N M  x i ( t ) = l N i N M m − 1 , (5) P ( w ( x ( t ) , i ) = 0 | A, x ( t )) =     N i − ϕ i x i ( t +1) N M N i − ϕ i x i ( t ) N M  x i ( t ) = 0 , 1 , . . . , l N i N M m − 2  (1 − ϕ i ) N i N i − ϕ i x i ( t ) N M x i ( t ) = l N i N M m − 1 . (6) W e refer to equation (5) as the probability of success when querying former friend i from state x ( t ). Lik ewise, equation (6) is the failure probabilit y when querying former friend i from state x ( t ). Giv en the target user has created a new account, the success probabilit y for a sp ecific friend i ∈ V is strictly increasing as x i ( t ) increases from 0 to l N i N M m − 2, and is therefore nondecreasing o ver the corresp onding stages t . How ever, this monotonicit y prop ert y does not alwa ys hold for the final query . As w e ha v e discussed, the final query of i do es not necessarily return the same num b er ( N M ) of results as previous queries of i , and has a different functional form for probability of success giv en in equation (5). Figure 8 illustrates this monotonicity prop ert y for t wo initial conditions. In b oth of the plotted tra jectories, l N i N M m = 20. F or former friend 1, N 1 mo d N M = 0 and all queries return the same n um b er ( N M ) of results. In this case the probabilit y of finding the target is strictly increasing o v er all queries of this former friend’s follo wers. The s econd former friend’s success probabilities depicted in Figure 8 do not hav e this characteristic, and the final query returns fewer results than the previous 19 queries. In this case, we observ e that the probability of finding the target is strictly increasing ov er the first 19 queries, but decreases in the final query b ecause this query returns few er results. This monotonicity prop erty is an extension of the monotonicity theorem pro vided in [21]. As a final note on this prop erty , we observ e that this result holds ev en if we remov e the conditioning Klausen, Marks, and Zaman: Finding Online Extr emists in So cial Networks 29 Figure 8 Probability of finding the target user’s new accoun t, giv en it exists, as a function of n umber of queries of former friend j . on A . If in stage t the searcher queried the follow ers of former friend i and did not find the target user, then in stage t + 1, P ( w ( x ( t + 1) , i ) = 1) > P ( w ( x ( t ) , i ) = 1) , for all 0 ≤ x i ( t + 1) < l N i N M m − 1, ϕ i > 0, and ρ ( t ) > 0. 6.5. Analysis: ¯ ρ = 0 W e provide analysis for the case in which we initialize ¯ ρ = 0, i.e., we contin ue to searc h un til either the target accoun t is found or all follow er queries hav e b een exhausted. If we were searc hing for a susp ended user’s new accoun t, one course of action would b e to first execute the query that w as most likely to reveal the account. How ever, we hav e sho wn in [21] that this approach do es not alw a ys yield the optimal p olicy . In this section w e pro vide a characterization of the optimal p olicy that naturally extends from the optimalit y condition deriv ed in [21] for indep enden t urns. 6.5.1. Expression for Exp ected P olicy Cost W e no w deriv e an expression for p olicy cost when ¯ ρ = 0. Let u b e a v alid p olice and C u b e the n um b er of unsuccessful queries, or cost of p olicy u . Because C can only take nonnegative in tegral v alues 0 , 1 , . . . , N , E [ C u ] = N − 1 X t =0 P ( C u > t ) = N − 1 X t =0 P ( C u > t | A ) P ( A ) + N − 1 X t =0 P ( C u > t | A c ) P ( A c ) Klausen, Marks, and Zaman: Finding Online Extr emists in So cial Networks 30 = ρ 0 N − 1 X t =0 t Y k =0 P ( w ( x ( t ) , u t ) = 0 | A ) + N (1 − ρ 0 ) . (7) The optimal searc h p olicy is the v alid p olicy that minimizes this expression. F ormally , u ? = arg min u ∈U E [ C u ] = arg min u ∈U ( ρ 0 N − 1 X t =0 t Y k =0 P ( w ( x ( t ) , u t ) = 0 | A ) + N (1 − ρ 0 ) ) = arg min u ∈U N − 1 X t =0 t Y k =0 P ( w ( x ( t ) , u t ) = 0 | A ) . (8) where u ? is the optimal p olicy and U is the set of v alid p olicies. Recall from equation (4) that the v ectors x ( t ) can b e written as a function of the the searc h p olicy . Not surprisingly , if w e commit to exhausting all p ossible queries in our search for the target, the initial existence probability ρ 0 do es not affect p olicy optimality . In order to simplify notation, w e define the probability q u ( t ) = P ( w ( x ( t ) , u t ) = 0 | A ) . This is the probabilit y of failing to find the target’s new accoun t when executing the t th query in p olicy u . This probability is specified in equation (6), and allows us to rewrite the ob jective function in equation (8) as u ? = arg min u ∈U N − 1 X t =0 t Y k =0 q u ( t ) . 6.5.2. Optimalit y Conditions In [21] it is shown that there exists a blo c k p olicy , in which eac h urn i ∈ V is exhaustiv ely queried b efore moving on to another urn, that is optimal in an y m ulti-urn search problem. W e no w provide an analogous result for this sp ecific application whic h is pro v ed in App endix A. Theorem 1 (Necessary Conditions for Optimality) If in a susp ende d user se ar ch, fol lower queries of former friends ar e exe cute d until either the tar get user is found or al l queries have b e en exhauste d, then any optimal p olicy must satisfy the fol lowing c onditions: (1) The first l N i N M m − 1 queries of e ach former friend i ∈ V ar e exe cute d in suc c ession in a single blo ck. (2) F or al l friends i ∈ V such that N i N M ϕ i − 1 2  N i N M  > N i (1 − ϕ i ) ϕ i  N i − l N i N M m N M + N M  , (9) al l l N i N M m queries of i ’s fol lowers ar e exe cute d in suc c ession. Klausen, Marks, and Zaman: Finding Online Extr emists in So cial Networks 31 The first part of Theorem 1 follows from the monotonicity of the success probability . If querying former friend i is optimal in stage t , and in the next stage ( t + 1) the success probability for i has increased while success probabilities for all j ∈ V \ i ha ve remained the same, then intuitiv ely it w ould b e optimal to query i again in stage t + 1. The condition in equation (9) is related to ho w the success probabilit y changes in the final query of eac h former friend. If N j N M ϕ j − 1 2  N j N M  > N j (1 − ϕ j ) ϕ j  N j − l N j N M m N M + N M  , then the final query of i has a lo w er cost than the previous queries of i . This is the case depicted in Figure 8 for former friend 1. In this case, querying all of i ’s follow ers in succession starting at any stage t is more v aluable than executing only the first l N i N M m − 1 queries, and any optimal p olicy will include all of these queries in a single blo ck. If on the other hand N j N M ϕ j − 1 2  N j N M  < N j (1 − ϕ j ) ϕ j  N j − l N j N M m N M + N M  , then the final query of former friend i has a higher cost than the previous query . This is the case of former friend 2 depicted in Figure 8. In this case querying all of i ’s follo wers in succession starting at any stage t is less b eneficial, in terms of minimizing cost, than executing only the first l N i N M m − 1 queries. The optimal p olicy might separate the final query of i from the first l N i N M m − 1 queries in this case. If the inequality in equation (9) is instead satisfied with equalit y , then executing all of the queries of i ’s follo wers in succession from any stage t essentially pro vides the same b enefit as executing only the first l N i N M m − 1 queries. In this case, an optimal p olicy will alwa ys exist in which these queries are executed together in a single blo ck, but alternative p olicies with equal cost migh t also exist in whic h the first l N i N M m − 1 queries of i are separated from the final query . Theorem 1 establishes that the optimal p olicy is a blo ck policy , but it do es not specify the details of this p olicy . The following theorem, which is prov ed App endix B, provides a full characterization of an optimal p olicy . Klausen, Marks, and Zaman: Finding Online Extr emists in So cial Networks 32 Theorem 2 (Necessary and Sufficient Conditions for Optimality) In a susp ende d user se ar ch, define γ ( x ( t ) , i ) =                                                      1 ϕ i l N i N M m − N M 2 N i l N i N M m l N i N M m − 1  − 1 , N i N M ϕ i − 1 2 l N i N M m > N i (1 − ϕ i ) ϕ i  N i − l N i N M m N M + N M  , x i ( t ) = 0 , 1 , . . . , l N i N M m − 1; N i N M ϕ i − 1 2 l N i N M m , N i N M ϕ i − 1 2 l N i N M m ≤ N i (1 − ϕ i ) ϕ i  N i − l N i N M m N M + N M  , x i ( t ) = 0 , 1 , . . . , l N i N M m − 2; N i (1 − ϕ i ) ϕ i  N i − l N i N M m N M + N M  , N i N M ϕ i − 1 2 l N i N M m ≤ N i (1 − ϕ i ) ϕ i  N i − l N i N M m N M + N M  , x i ( t ) = l N i N M m − 1; ∞ , otherwise . A valid p olicy is optimal if and only if it satisfies the c ondition in The or em 1 and it minimizes γ ( x ( t ) , i ) in e ach stage, i.e., u t = arg min i ∈V γ ( x ( t ) , i ) t = 0 , 1 , . . . , N − 1 . The function γ ( x ( τ ) , i ) arises in the pro of of Theorem 2 when comparing the costs of policies whic h swap the order of querying former friend i with another former friend. Theorem 2 simply sa ys that alwa ys choosing the former friend that minimizes γ ( x ( τ ) , i ) pro duces an optimal p olicy . The different cases for γ ( x ( τ ) , i ) corresp ond to different remaining follow ers to query of the former friends along with the optimalit y conditions from Theorem 1. The first case corresp onds to the condition in equation 9. As discussed, this condition implies that executing all queries of i in a single blo ck is more beneficial than executing only the first l N i N M m − 1 queries. The other cases follo w similar logic: the second case is the v alue function for the first l N i N M m − 1 queries of former friend i , and the condition indicates that executing only these queries in a single blo ck is b est. The third case is for the final query of former friend i , and the fourth condition sets γ ( x ( t ) , i ) to infinity if there are no queries remaining for i . 6.6. Results Using the classification results from Section 4, we iden tified 169 account pairs from our ISIS seed users for testing. Each pair of accounts consisted of an earlier ac coun t, which had b een susp ended, and an accoun t op ened later that b elonged to the same user. Without b eing able to v erify exactly when accoun t susp ensions to ok place, we assumed the later account in each case was op ened or Klausen, Marks, and Zaman: Finding Online Extr emists in So cial Networks 33 T able 11 Randomly selected account pairs for testing. P air F ormer friends Reconnection % Max Queries ( N ) 1 35 40.00% 38 2 310 59.68% 6609 3 94 17.02% 247 4 87 21.84% 431 5 185 8.11% 198 6 84 22.62% 101 7 63 9.52% 12007 8 189 4.23% 2078 9 257 88.72% 4312 10 109 30.28% 152 11 302 82.12% 5559 12 344 22.67% 1314 13 181 9.94% 190 14 87 3.45% 2965 15 221 2.26% 2654 used in response to the former accoun t’s suspension. Having collected the friends and follow ers lists for all of these accounts, w e w ere able to ev aluate the p erformance and effectiv eness of the searc h p olicy w e dev elop ed. F rom the set of 169 accoun t pairs, we randomly chose 15 for testing. T able 11 sho ws the num b er of former friends, the reconnection rate, and the total num b er of queries p ossible (or p olicy length) for each of these account pairs. F or eac h account pair, w e iden tified the friends from the earlier (susp ended) account as the “former friends” of the subsequent account. F or each of these former friends w e determined their reconnection probability using the logistic regression classifier from Section 5. W e also had the n umber of follow ers for each former friend stored in our data set. W e assumed that all of the former friend accounts were still active when the second account was op ened. Finally , w e initialized ρ 0 = 1. This initial v alue is useful because it reduces the expression for exp ected p olicy cost to the ob jectiv e function in equation (8) and allo ws for direct comparison of actual p erformance with our theoretical exp ected num b er of unsuccessful queries. In order to ev aluate policy p erformance, w e consider the following policies: • Optimal. This is a policy that minimizes expected cost, found using the necessary and sufficient conditions in Theorem 2. • Greedy . This p olicy maximizes the probabilit y of finding the new account at each stage. Because this probabilit y strictly increases for each former friend i ∈ V every time i is queried, excepting the final query of i , this p olicy alwa ys meets the necessary condition for optimality giv en in Theorem 1. • Min- N . This p olicy selects the former friend with the minimum n umber of unqueried follow ers at eac h stage. Because these v alues strictly decrease for eac h former friend i ∈ V with each query of i , this p olicy alwa ys meets the necessary condition for optimalit y giv en in Theorem 1. Klausen, Marks, and Zaman: Finding Online Extr emists in So cial Networks 34 • Max- P . This p olicy selects the former friend with the highest conditional reconnection proba- bilit y at each stage. Because conditional reconnection probabilities strictly decrease for eac h former friend i ∈ V with each query of i , this p olicy do es not necessarily meet the conditions in Theorem 2. • Random. This p olicy randomly chooses a query from those that are p ossible at each stage. 6.6.1. Comparison of Exp ected Costs W e computed the exp ected cost for eac h policy using equation (7). These v alues do not account for our knowledge of the true reconnections of the second accoun t in eac h case. Instead, we assume that our probabilit y mo del is correct in these computations. T able 12 Cost comparisons for differen t policies. Exp ected Costs Actual Costs P air Optimal Greedy Min- N Max- P Random Optimal Greedy Min- N Max- P Random 1 5.72 5.74 9.18 5.89 7.93 3.00 3.00 2.00 2.74 1.70 2 2.26 2.27 4.15 88.23 68.87 0.00 0.00 2.00 44.16 20.97 3 1.22 1.22 2.00 6.09 4.91 1.00 1.00 6.00 6.28 11.63 4 1.20 1.20 2.74 20.48 8.90 2.00 2.00 15.00 26.81 20.79 5 2.96 2.96 9.27 3.36 6.19 5.00 5.00 15.00 6.56 10.75 6 0.96 0.96 2.52 4.43 1.86 1.00 1.00 7.00 5.53 4.49 7 103.51 103.98 107.53 400.48 2170.52 5.00 5.00 12.00 283.13 1582.28 8 4.98 5.10 9.05 74.36 71.86 6.00 6.00 136.00 82.50 242.40 9 2.28 2.28 4.73 80.68 57.27 0.00 0.00 2.00 54.75 5.87 10 1.01 1.01 2.99 8.64 2.27 3.00 3.00 3.00 13.76 3.27 11 0.89 0.89 2.02 126.65 38.17 0.00 0.00 0.00 111.03 18.18 12 2.88 2.88 6.98 42.15 18.78 0.00 0.00 6.00 44.57 19.57 13 1.50 1.50 3.82 3.26 2.52 1.00 1.00 10.00 3.41 9.57 14 8.84 8.85 15.28 141.16 322.96 4.00 4.00 52.00 150.62 736.53 15 1.17 1.17 2.84 61.06 20.02 7.00 7.00 61.00 143.00 390.86 T able 12 giv es the expected costs computed for eac h p olicy . Exp ected cost v alues are analytically computed in all cases except for the random p olicy . In order to estimate expected cost for a random p olicy , we generated 500 random p olicies and computed the exp ected cost for eac h. The a verage of these 500 exp ected costs is rep orted as the random p olicy exp ected cost in T able 12. The results show that in man y cases, the greedy policy and the optimal policy ac hiev e the same cost. Comparison of these tw o p olicies reveals that they are very similar in all cases. This finding agrees with the findings in [21], which also suggests that there is a bound on the sub optimality of the greedy p olicy . The Min- N p olicy also pro duces costs close to those of the optimal and greedy p olicies, while the Max- P and random p olicies hav e a substantially higher costs in many cases. 6.6.2. Comparison of Actual Costs In this section we compare the p erformance of the differen t p olicies in finding the target user based on the actual reconnections. If the target users tended to reconnect in accordance with our probabilit y mo del w e w ould exp ect these actual cost Klausen, Marks, and Zaman: Finding Online Extr emists in So cial Networks 35 v alues to b e similar to the exp ected costs in T able 12. In cases in which the target user reconnected to former friends in a wa y that w ould b e very unlikely according to our probabilit y mo del, the actual p olicy costs might differ substantially from the exp ected costs. The actual costs for each p olicy are rep orted in T able 12. The v alues rep orted are the exp ected num b er of queries one w ould ha v e to execute b efore finding the target user, conditional on the target user’s actual reconnections. In some of the 15 cases, the actual costs in T able 12 differ substantially from the exp ected costs. Ho w ever, the same trend holds: the optimal and greedy p olicies tend to p erform the b est, and are nearly indistinguishable in terms of costs. The Min- N policy p erforms as well or nearly as w ell as the optimal p olicy in some cases, but in a few cases it is muc h w orse. The Max- P and random p olicies tend to p erform po orly , esp ecially when the target user has not connected with very many former friends (see T able 11). Using the optimal or greedy p olicies can result in substantial cost sa vings in these cases. Accoun t pair 1 pro vides an example of a case where a random policy can outp erform the optimal p olicy in practice. The reconnection rate for this target user was 40% (from T able 11), but the target did not reconnect with the most probable former friends, according to our probability model (in actuality , it is p ossible these accounts were susp ended when the target opened the new account). F rom T able 11, it is apparen t that most of the 35 former friends ha ve fewer than 5,000 follow ers, b ecause the v alid policy length is at most 38 queries. The random p olicy p erforms approximately as w e would exp ect in this case: each random query has approximately a 40% chance of returning the target. F rom the well-kno wn geometric probability distribution, the exp ected num b er of failures b efore the first success is 1.5, which is very close the v alue rep orted in T able 12. In each of the 15 account pairs, the optimal, greedy and Min- N p olicies located the target user when querying a former friend that had few er than 5,000 follow ers. F or this reason, the actual num b er of queries in these cases was deterministic, resulting in the in teger costs rep orted in the T able 12. F or pair 15, for example, the optimal p olicy would alwa ys find the target user on the 8th query b ecause this is the first former friend in the p olicy to whom the target user had reconnected, and a single query retriev es all of this former friend’s follow ers. In this application, man y of the former friend accounts ha ve fewer than 5,000 follow ers and are therefore exhausted in a single follow ers query . These accounts, when coupled with a high reconnection probability , are v ery v aluable in a searc h p olicy . Both the greedy and the optimal policies prioritize queries of former friends with relativ ely high reconnection probabilities and lo w num b ers of follow ers. 6.7. Discussion: ¯ ρ > 0 When an existence probabilit y threshold is applied as a termination criterion, Theorems 1 and 2 no longer hold. Ho wev er, the queries that ha ve the highest probability of finding the target user Klausen, Marks, and Zaman: Finding Online Extr emists in So cial Networks 36 are also the queries that hav e the largest effect on reducing the conditional existence probabilit y . W e conjecture that the optimal p olicy in the case for whic h ¯ ρ > 0 will b e the same as the optimal p olicy when ¯ ρ = 0 in the initial queries. At some p oint, a stage is reac hed for which a greedy p olicy b ecomes more desirable, b ecause it reaches the termination criterion ρ t < ¯ ρ earlier than a ¯ ρ = 0 optimal p olicy characterized by the conditions in Theorems 1 and 2. A final consideration for the case in which ¯ ρ > 0 in volv es the initial condition. The v alues that conditional existence probability ρ ( t ) tak e all dep end explicitly on the initial existence probabil- it y ρ 0 . This sensitivity should b e explored in analyses or execution of searches that employ this termination criterion. 7. Conclusion The gro wth of online extremism has created the need for capabilities to mitigate the threat p osed b y the abusive or threatening b ehavior of these extremist users. In this work we hav e dev elop ed a set of capabilities which allow for more effectiv e mitigation of these threats. These capabilities can b e used to enhance the p erformance of la w enforcement or other en tities that are resp onsible for protecting the public from online e xtremist groups. Our approach com bined statistical mo deling of extremist b ehavior with optimized search p olicies. Our b eha vioral mo deling allow ed us to predict new extremist users, determine if t wo accounts b elong to the same extremist user, and predict the net w ork connections of susp ended extremist users when they create new accoun ts. W e used our b eha vioral mo dels to form ulate a net work search p olicy to find the new accounts of susp ended extremist users when they return to the so cial net work. Simulations based on actual ISIS users found that our p olicy was muc h more efficient than other b enchmark approac hes. While our analysis fo cused on terrorist extremist groups such as ISIS in the so cial net w ork Twit- ter, the capabilities we developed can apply to any online extremist group and any so cial netw ork. Nothing in our mo deling or searc h p olicy is sp ecialized to ISIS or Twitter. Users that engage in some form of online extremism or harassment will hav e very similar b ehavioral characteristics in so cial netw orks. They will connect to a sp ecific set of users which form their extremist group. They will create new accounts which will resem ble their old accoun ts after b eing susp ended. When they return to the so cial net work after b eing suspended, they will reconnect with certain former friends with higher probabilit y . In addition, all of our capabilities do not require the co op eration of so cial net w ork op erators. Therefore, all the capabilities we dev elop ed here are agnostic to the extremist group and so cial netw ork. References [1] Y asmeen Abutaleb. Twitter susp ended 360,000 accoun ts for ’promotion of terrorism’. http://www.reuters.com/article/us- twitter- terrorism- idUSKCN10T1ST , August 2016. Accessed: 2016-10-13. Klausen, Marks, and Zaman: Finding Online Extr emists in So cial Networks 37 [2] Spencer Ac kerman, Ew en MacAskill, and Alice Ross. Junaid Hussain:British hack er for ISIS b elieved killed in US air strike. http://www.theguardian.com/world/2015/aug/ 27/junaid- hussain- british- hacker- for- isis- believed- killed- in- us- airstrike , August 27 2015. Accessed Octob er 16, 2015. [3] Stev e Alp ern and Thomas Lidbetter. Mining coal or finding terrorists: The expanding search paradigm. Op er ations R ese ar ch , 61(2):265–279, 2013. [4] J. Berger and Jonathan Morgan. The ISIS Twitter census: Defining and describing the p opu- lation of ISIS supporters on Twitter. The Br o okings Pr oje ct on US R elations with the Islamic World , 3:20, 2015. [5] JM Berger and Heather P erez. The islamic states diminishing returns on twitter: How sus- p ensions are limiting the so cial net w orks of english-sp eaking isis supp orters-washington. The Pr o gr am on Extr emism at Ge or ge Washington University Oc c asional Pap er, Washington DC , 2016. [6] William L Black. Discrete sequential searc h. Information and c ontr ol , 8(2):159–162, 1965. [7] Russell Brandom. Presiden t Obama says Orlando killer w as inspired b y online extremism. http://www.theverge.com/2016/6/13/11922034/ orlando- attack- barack- obama- briefing- isis- internet- terrorism , July 2016. Accessed: 2016-10-13. [8] Kristen V. Bro wn. Twitter just permanently susp ended Milo Yiannopou- los, the internets biggest troll. http://fusion.net/story/327536/ milo- yiannopoulos- nero- permanently- banned- from- twitter/ , July 2016. Accessed: 2016-10-13. [9] Zi Ch u, Stev en Gianv ecchio, Haining W ang, and Sushil Ja jo dia. Detecting automation of t witter accounts: Are you a h uman, b ot, or cyb org? IEEE T r ansactions on Dep endable and Se cur e Computing , 9(6):811–824, 2012. [10] Arnon Dagan and Shm uel Gal. Netw ork search games, with arbitrary searcher starting point. Networks , 52(3):156–161, 2008. ISSN 1097-0037. doi: 10.1002/net.20241. URL http://dx. doi.org/10.1002/net.20241 . [11] John P Dic kerson, V adim Kagan, and VS Subrahmanian. Using sen timent to detect b ots on Twitter: Are humans more opinionated than b ots? In A dvanc es in So cial Networks A naly- sis and Mining (ASONAM), 2014 IEEE/ACM International Confer enc e on , pages 620–627. IEEE, 2014. [12] Karthik Dinak ar, Roi Reichart, and Henry Lieberman. Mo deling the detection of textual cyb erbullying. The So cial Mobile Web , 11:02, 2011. Klausen, Marks, and Zaman: Finding Online Extr emists in So cial Networks 38 [13] Karthik Dinak ar, Birago Jones, Catherine Ha v asi, Henry Lieb erman, and Rosalind Picard. Common sense reasoning for detection, preven tion, and mitigation of cyberbullying. ACM T r ansactions on Inter active Intel ligent Systems (TiiS) , 2(3):18, 2012. [14] Emilio F errara, W en-Qiang W ang, Onur V arol, Alessandro Flammini, and Aram Galst yan. Predicting online extremism, conten t adopters, and interaction reciprocity . arXiv pr eprint arXiv:1605.00659 , 2016. [15] Scott Higham. Why the Islamic State leav es tec h companies torn betw een free sp eec h and securit y . https://www.washingtonpost.com/world/national- security/ islamic- states- embrace- of- social- media- puts- tech- companies- in- a- bind/2015/ 07/15/0e5624c4- 169c- 11e5- 89f3- 61410da94eb1_story.html?kmap=1 , July 2015. Accessed: 2016-10-13. [16] NF Johnson, M Zheng, Y V oroby ev a, A Gabriel, H Qi, N V elasquez, P Manrique, D Johnson, E Restrep o, C Song, et al. New online ecology of adversarial aggregates: Isis and beyond. Scienc e , pages 1459–1463, 2016. [17] Eric Jones, T ravis Oliphan t, Pearu P eterson, et al. SciPy: Open source scien tific tools for Python, 2001–. URL http://www.scipy.org/ . http://www.scipy.org/ ; Accessed 2016-04- 14. [18] Jytte Klausen. Tweeting the jihad: So cial media netw orks of western foreign fighters in syria and iraq. Studies in Conflict & T err orism , 38(1):1–22, 2015. [19] Srijan Kumar, F rancesca Sp ezzano, and VS Subrahmanian. Accurately detecting trolls in slashdot zo o via decluttering. In A dvanc es in So cial Networks Analysis and Mining (ASONAM), 2014 IEEE/ACM International Confer enc e on , pages 188–195. IEEE, 2014. [20] Kyumin Lee, Brian Da vid Eoff, and James Cav erlee. Sev en mon ths with the devils: A long- term study of con ten t p olluters on t witter. In ICWSM , 2011. [21] Christopher Marks and T auhid Zaman. A m ulti-urn mo del for net w ork searc h. ArXiv e-prints , August 2016. [22] Chris Pick ett. Safari Bo oks Online, Nov ember 2013. h ttps://www.safarib o oksonline.com/blog/2013/11/26/image-hashing-with-python/. [23] Kelly Reynolds, April Kontostathis, and Lynne Edw ards. Using machine learning to detect cyb erbullying. In Machine L e arning and Applic ations and Workshops (ICMLA), 2011 10th International Confer enc e on , volume 2, pages 241–244. IEEE, 2011. [24] Jacob R Scanlon and Matthew S Gerber. Automatic detection of cyber-recruitment b y violent extremists. Se curity Informatics , 3(1):1, 2014. [25] Robert Spencer. UK: F emale ro ck er embraces Islam, hop es to b ehead Christians. jihadw atc h.org, August 31, 2014. URL http://www.jihadwatch.org/2014/08/ Klausen, Marks, and Zaman: Finding Online Extr emists in So cial Networks 39 uk- female- rocker- embraces- islam- hopes- to- behead- christians . http://www. jihadwatch.org/2014/08/uk- female- rocker- embraces- islam- hopes- to- behead- christians ; Accessed Octob er 13, 2013. [26] Stac ko verflo w.com. Ho w python-Lev enshtein.ratio is computed. Online do c- umen tation, 2015. URL http://stackoverflow.com/questions/14260126/ how- python- levenshtein- ratio- is- computed . http://stackoverflow.com/questions/ 14260126/how- python- levenshtein- ratio- is- computed ; Accessed October 13, 2015. [27] Ashish Surek a and Swati Agarw al. Learning to classify hate and extremism promoting tw eets. In Intel ligenc e and Se curity Informatics Confer enc e (JISIC), 2014 IEEE Joint , pages 320–320. IEEE, 2014. [28] Amar T o or. Twitter ma y b e cracking do wn on ISIS, but white nation- alists are still thriving. http://www.theverge.com/2016/9/5/12798196/ twitter- nazi- white- nationalist- isis- study , Septem b er 2015. Accessed: 2016-10-13. [29] Twitter. An up date on our efforts to combat violen t extremism. https://blog. twitter.com/2016/an- update- on- our- efforts- to- combat- violent- extremism , August 2016. Access ed: 2016-10-13. [30] Twitter. Twitter REST API, 2016. https://dev.twitter.com/rest/public . [31] t witter.com. Twitter microblogging w ebsite, 2016. http://www.twitter.com . App endix A. Pro of of Theorem 1 W e provide pro ofs b y contradiction that follows the same logic as the blo ck p olicy pro of in [21]. Supp ose p olicy u is optimal and do es not satisfy condition (1) in Theorem 1. Then, there exists i ∈ V and integers τ ≥ 0, δ > 1, and ∆ > 0 suc h that u τ = i u τ + ` 6 = i ∀ ` ∈ { 1 , 2 , . . . , δ − 1 } u τ + δ = i u τ + δ +∆ = i. Note that this final condition simply implies that the query of i in stage τ + δ is not the final query of this former friend. F rom equation (6), the query failure probabilities in stages τ and τ + δ are q u ( τ ) =  N i − ϕ i ( x i ( τ ) + 1) N M N i − ϕ i x i ( τ ) N M  q u ( τ + δ ) =  N i − ϕ i ( x i ( τ ) + 2) N M N i − ϕ i ( x i ( τ ) + 1) N M  . Klausen, Marks, and Zaman: Finding Online Extr emists in So cial Networks 40 W e construct t w o alternative p olicies. The first alternative p olicy , ˆ u , mo v es the query of i from stage τ + δ to stage τ + 1. The second alternative p olicy mov es the query of i from stage τ to stage τ + δ − 1. Each of these alternativ e p olicies rearranges the sequence of queries in u so that the tw o queries of former friend i in stages τ and τ + δ are instead executed in succession. F ormally , ˆ u t = ( u t − 1 t = τ + 1 , . . . , τ + δ u t otherwise ˜ u t = ( u t +1 t = τ , . . . , τ + δ − 1 u t otherwise . The relationship b etw een the query failure probabilities follo ws from these p olicy definitions: q ˆ u ( t ) =      q u ( τ + δ ) t = τ + 1 q u ( t − 1) t = τ + 2 , . . . , τ + δ q u ( t ) otherwise . q ˜ u ( t ) =      q u ( τ ) t = τ + δ − 1 q u ( t + 1) t = τ , . . . , τ + δ − 2 q u ( t ) otherwise . W e no w compare the costs of these p olicies. Optimalit y of u implies that the exp ected cost of p olicy ˆ u must b e at least as high as the cost of u : E [ C ˆ u ] ≥ E [ C u ] N − 1 X t =0 t Y k =0 q ˆ u ( k ) ≥ N − 1 X t =0 t Y k =0 q u ( k ) τ + δ X t = τ +1 t Y k = τ +1 q ˆ u ( k ) ≥ τ + δ X t = τ +1 t Y k = τ +1 q u ( k ) q ˆ u ( τ + 1) + q ˆ u ( τ + 1) τ + δ X t = τ +2 t Y k = τ +2 q ˆ u ( k ) ≥ τ + δ − 1 X t = τ +1 t Y k = τ +1 q u ( k ) + q u ( τ + δ ) τ + δ − 1 Y k = τ +1 q u ( k ) q u ( τ + δ ) − q u ( τ + δ ) τ + δ − 1 Y k = τ +1 q u ( k ) ≥ τ + δ − 1 X t = τ +1 t Y k = τ +1 q u ( k ) − q u ( τ + δ ) τ + δ − 1 X t = τ +1 t Y k = τ +2 q u ( k ) q u ( τ + δ ) 1 − q u ( τ + δ ) ≥ P τ + δ − 1 t = τ +1 Q t k = τ +1 q u ( k ) 1 − Q τ + δ − 1 k = τ +1 q u ( k ) Lik ewise, optimality of u implies that the exp ected cost of p olicy ˜ u m ust also b e at least as high as the cost of u E [ C ˜ u ] ≥ E [ C u ] N − 1 X t =0 t Y k =0 q ˜ u ( k ) ≥ N − 1 X t =0 t Y k =0 q u ( k ) Klausen, Marks, and Zaman: Finding Online Extr emists in So cial Networks 41 τ + δ − 1 X t = τ t Y k = τ q ˜ u ( k ) ≥ τ + δ − 1 X t = τ t Y k = τ q u ( k ) τ + δ − 2 X t = τ t Y k = τ q ˜ u ( k ) + q ˜ u ( τ + δ − 1) τ + δ − 2 Y k = τ q ˜ u ( k ) ≥ q u ( τ ) + q u ( τ ) τ + δ − 1 X t = τ +1 t Y k = τ +1 q u ( k ) τ + δ − 1 X t = τ +1 t Y k = τ +1 q u ( k ) − q u ( τ ) τ + δ − 1 X t = τ +1 t Y k = τ +2 q u ( k ) ≥ q u ( τ ) − q u ( τ ) τ + δ − 1 Y k = τ +1 q u ( k ) P τ + δ − 1 t = τ +1 Q t k = τ +1 q u ( k ) 1 − Q τ + δ − 1 k = τ +1 q u ( k ) ≥ q u ( τ ) 1 − q u ( τ ) Com bining these t wo conditions, w e hav e q u ( τ ) 1 − q u ( τ ) ≤ P τ + δ − 1 t = τ +1 Q t k = τ +1 q u ( k ) 1 − Q τ + δ − 1 k = τ +1 q u ( k ) ≤ q u ( τ + δ ) 1 − q u ( τ + δ )  N i − ϕ i ( x i ( τ ) + 1) N M ϕ i N M  ≤ P τ + δ − 1 t = τ +1 Q t k = τ +1 q u ( k ) 1 − Q τ + δ − 1 k = τ +1 q u ( k ) ≤  N i − ϕ i ( x i ( τ ) + 2) N M ϕ i N M  Ho w ever, under the minimal assumptions that ϕ i , N i and N M are p ositiv e, the inequalit y  N i − ϕ i ( x i ( τ ) + 1) N M ϕ i N M  >  N i − ϕ i ( x i ( τ ) + 2) N M ϕ i N M  is strict, whic h pro vides a contradiction. No w suppose optimal policy u satisfies condition (1) but does not satisfy condition (2), i.e., there exists a former friend i ∈ V for whic h N i N M ϕ i − 1 2  N i N M  > N i (1 − ϕ i ) ϕ i  N i − l N i N M m N M + N M  that is not queried in a single blo ck. Let τ − l N i N M m + 2 be the first stage in p olicy u in whic h former friend i is queried. Because the p olicy satisfies condition (1), it follows that u t = i t = τ −  N i N M  + 2 , τ −  N i N M  + 1 , . . . , τ . Also, let τ + δ b e the stage corresp onding to the final query of former friend i . By assumption this final query is not executed in succession with the first l N i N M m − 1 queries of i , so δ > 1. As in the previous part of the pro of, we define tw o alternative p olicies, eac h moving tw o final queries of i in to a single blo c k. ˆ u t = ( u t − 1 t = τ + 1 , . . . , τ + δ u t otherwise ˜ u t =        u t + l N i N M m − 1 t = τ − l N i N M m + 2 , . . . , τ − l N i N M m + δ i t = τ − l N i N M m + δ + 1 , . . . , τ + δ u t otherwise . Klausen, Marks, and Zaman: Finding Online Extr emists in So cial Networks 42 The relationship b etw een the query failure probabilities follo ws from these p olicy definitions: q ˆ u ( t ) =      q u ( τ + δ ) t = τ + 1 q u ( t − 1) t = τ + 2 , . . . , τ + δ q u ( t ) otherwise . q ˜ u ( t ) =        q u ( t + l N i N M m − 1) t = τ − l N i N M m + 2 , . . . , τ − l N i N M m + δ q u ( t − δ + 1) t = τ − l N i N M m + δ + 1 , . . . , τ + δ − 1 q u ( t ) otherwise . Note also that, from equation (6), τ − l N i N M m +2+ t Y k = τ − l N i N M m +2 q u ( k ) = N i − ϕ i ( t + 1) N M N i , t = 0 , 1 , . . . ,  N i N M  − 2 τ Y t = τ − l N i N M m +2 q u ( t ) = N i − ϕ i l N i N M m − 1  N M N i τ X t = τ − l N i N M m +2 t Y k = τ − l N i N M m +2 q u ( k ) =  N i N M  − 1  − ϕ i N M 2 N i  N i N M   N i N M  − 1  q u ( τ + δ ) = (1 − ϕ i ) N i N i − ϕ i l N i N M m − 1  N M . As in the previous part of the pro of, we compare the costs of the p olicies. E [ C ˆ u ] ≥ E [ C u ] N − 1 X t =0 t Y k =0 q ˆ u ( k ) ≥ N − 1 X t =0 t Y k =0 q u ( k ) τ + δ X t = τ +1 t Y k = τ +1 q ˆ u ( k ) ≥ τ + δ X t = τ +1 t Y k = τ +1 q u ( k ) q ˆ u ( τ + 1) + q ˆ u ( τ + 1) τ + δ X t = τ +2 t Y k = τ +2 q ˆ u ( k ) ≥ τ + δ − 1 X t = τ +1 t Y k = τ +1 q u ( k ) + q u ( τ + δ ) τ + δ − 1 Y k = τ +1 q u ( k ) q u ( τ + δ ) − q u ( τ + δ ) τ + δ − 1 Y k = τ +1 q u ( k ) ≥ τ + δ − 1 X t = τ +1 t Y k = τ +1 q u ( k ) − q u ( τ + δ ) τ + δ − 1 X t = τ +1 t Y k = τ +2 q u ( k ) q u ( τ + δ ) 1 − q u ( τ + δ ) ≥ P τ + δ − 1 t = τ +1 Q t k = τ +1 q u ( k ) 1 − Q τ + δ − 1 k = τ +1 q u ( k ) Lik ewise, optimality of u implies that the exp ected cost of p olicy ˜ u m ust also b e at least as high as the cost of u : E [ C ˜ u ] ≥ E [ C u ] Klausen, Marks, and Zaman: Finding Online Extr emists in So cial Networks 43 N − 1 X t =0 t Y k =0 q ˜ u ( k ) ≥ N − 1 X t =0 t Y k =0 q u ( k ) τ + δ − 1 X t = τ − l N i N M m +2 t Y k = τ − l N i N M m +2 q ˜ u ( k ) ≥ τ + δ − 1 X t = τ − l N i N M m +2 t Y k = τ − l N i N M m +2 q u ( k ) τ − l N i N M m + δ X t = τ − l N i N M m +2 t Y k = τ − l N i N M m +2 q ˜ u ( k ) + τ − l N i N M m + δ Y k = τ − l N i N M m +2 q ˜ u ( k ) τ + δ − 1 X t = τ − l N i N M m + δ +1 t Y k = τ − l N i N M m + δ +1 q ˜ u ( k ) ≥ τ X t = τ − l N i N M m +2 t Y k = τ − l N i N M m +2 q u ( k ) + τ Y k = τ − l N i N M m +2 q u ( k ) τ + δ − 1 X t = τ +1 t Y k = τ +1 q u ( k ) τ + δ − 1 X t = τ +1 t Y k = τ +1 q u ( k ) + τ + δ − 1 Y k = τ +1 q u ( k ) τ X t = τ − l N i N M m +2 t Y k = τ − l N i N M m +2 q u ( k ) ≥ τ X t = τ − l N i N M m +2 t Y k = τ − l N i N M m +2 q u ( k ) + τ Y k = τ − l N i N M m +2 q u ( k ) τ + δ − 1 X t = τ +1 t Y k = τ +1 q u ( k ) P τ + δ − 1 t = τ +1 Q t k = τ +1 q u ( k ) 1 − Q τ + δ − 1 k = τ +1 q u ( k ) ≥ P τ t = τ − l N i N M m +2 Q t k = τ − l N i N M m +2 q u ( k ) 1 − Q τ k = τ − l N i N M m +2 q u ( k ) . Com bining these t wo conditions, w e hav e P τ t = τ − l N i N M m +2 Q t k = τ − l N i N M m +2 q u ( k ) 1 − Q τ k = τ − l N i N M m +2 q u ( k ) . ≤ P τ + δ − 1 t = τ +1 Q t k = τ +1 q u ( k ) 1 − Q τ + δ − 1 k = τ +1 q u ( k ) ≤ q u ( τ + δ ) 1 − q u ( τ + δ ) . This inequalit y implies that P τ t = τ − l N i N M m +2 Q t k = τ − l N i N M m +2 q u ( k ) 1 − Q τ k = τ − l N i N M m +2 q u ( k ) . ≤ q u ( τ + δ ) 1 − q u ( τ + δ ) N i ϕ i N M − 1 2  N i N M  ≤ (1 − ϕ i ) N i ϕ i  N i − l N i N M m N M + N M  , whic h is a con tradiction. B. Pro of of Theorem 2 First observ e that a p olicy satisfying the condition in Theorem 2 alwa ys exists. Such a policy can b e constructed algorithmically b y picking the former friend i : i = arg min j ∈V γ ( x ( t ) , j ) and querying i successiv ely un til a stage t 0 is reac hed for which γ ( x ( t 0 ) , i ) 6 = γ ( x ( t ) , i ). At this stage a new former friend is c hosen for querying according to the same criterion. Klausen, Marks, and Zaman: Finding Online Extr emists in So cial Networks 44 An imp ortant characteristic of γ ( x ( t ) , j ) is that it is nondecreasing in t for all j ∈ V . This property implies that for an y p olicy u that satisfies the condition in Theorem 2, γ ( x ( t ) , u t ) ≤ γ ( x ( t + 1) , u t +1 ). W e now show by contradiction that a p olicy which do es not meet the condition of Theorem 2 cannot b e optimal. W e consider only p olicies that meet the condition of Theorem 1, as we hav e sho wn this condition to b e necessary for optimality . Supp ose optimal policy u meets the necessary condition for optimalit y in Theorem 1 but does not meet the condition of Theorem 2. Then, there m ust b e at least one stage τ in which γ ( x ( τ ) , u τ ) > γ ( x ( τ + 1) , u τ +1 ). Because γ ( x ( t ) , j ) is nondecreasing in t for all j , this condition implies u τ 6 = u τ +1 . F or clarit y of notation, assume that u τ = i and u τ +1 = j . W e construct an alternate p olicy in which the order of these former friends i and j is reversed. Let ` b e the earliest stage for which γ ( x ( ` ) , i ) = γ ( x ( τ ) , i ) and u t = i ∀ t ∈ { `, ` + 1 , . . . , τ } . Also let L b e the latest stage for which γ ( x ( L ) , j ) = γ ( x ( τ + 1) , j ) and u t = j ∀ t ∈ { τ + 1 , τ + 2 , . . . , L } . Let δ = τ − ` + 1 b e the n umber of successiv e stages that i is queried in this sequence and let ∆ = L − τ b e the num b er of successiv e stages that j is queried in this sequence. In our alternative p olicy ˜ u , w e let ˜ u t =      u t t = 0 , . . . , ` − 1 , L + 1 , . . . , N − 1 j t ∈ `, ` + 1 , . . . , ` + ∆ − 1 i t ∈ ` + ∆ , ` + ∆ + 2 , . . . , L. This relationship implies q ˜ u ( t ) =      q u ( t ) t = 0 , . . . , ` − 1 , L + 1 , . . . , N − 1 q u ( t + δ ) t ∈ `, ` + 1 , . . . , ` + ∆ − 1 q u ( t − ∆) t ∈ ` + ∆ , ` + ∆ + 2 , . . . , L. F rom the optimalit y of u , we ha ve E [ C ˜ u ] ≥ E [ C u ] N − 1 X t =0 t Y k =0 q ˜ u ( k ) ≥ N − 1 X t =0 t Y k =0 q u ( k ) L X t = ` t Y k = ` q ˜ u ( k ) ≥ L X t = ` t Y k = ` q u ( k ) ` +∆ − 1 X t = ` t Y k = ` q ˜ u ( k ) + ` +∆ − 1 Y k = ` q ˜ u ( k ) ! L X t = ` +∆ t Y k = ` +∆ q ˜ u ( k ) ≥ τ X t = ` t Y k = ` q u ( k ) + τ Y k = ` q u ( k ) ! L X τ +1 t Y k = τ +1 q u ( k ) Klausen, Marks, and Zaman: Finding Online Extr emists in So cial Networks 45 L X t = τ +1 t Y k = τ +1 q u ( k ) + L Y k = τ +1 q u ( k ) ! τ X t = ` t Y k = ` q u ( k ) ≥ τ X t = ` t Y k = ` q u ( k ) + τ Y k = ` q u ( k ) ! L X τ +1 t Y k = τ +1 q u ( k ) P L t = τ +1 Q t k = τ +1 q u ( k ) 1 − Q L k = τ +1 q u ( k ) ≥ P τ t = ` Q t k = ` q u ( k ) 1 − Q τ k = ` q u ( k ) . (10) W e hav e multiple cases to consider when comparing these p olicy costs. Consider the sequence of queries of former friend i , starting in stage ` and ending in stage τ . Our assumptions on p olicy u (that it satisfies Theorem 1) and our metho d of selecting stage ` allow for three distinct possibilities: Case 1. Stage ` is the first query of i in p olicy u and stage τ is the l N i N M m − 1  th query of i . By adhering to the necessary conditions for optimality in Theorem 1, this case implies that N i ϕ i N M − 1 2 l N i N M m ≤ (1 − ϕ i ) N i ϕ i  N i − l N i N M m N M + N M  . Observ e that in this case the quan tity P τ t = ` Q t k = ` q u ( k ) 1 − Q τ k = ` q u ( k ) = l N i N M m − 1  − ϕ i N M 2 N i l N i N M m − 1  l N i N M m ϕ i l N i N M m − 1  N M N i = N i ϕ i N M − 1 2  N i N M  = γ ( x ( τ ) , i ) Case 2. Stage ` < τ is the first query of i in p olicy u and stage τ is the final (or l N i N M m th) query of i . Equality of γ ( x ( t ) , i ) across these stages implies N i ϕ i N M − 1 2 l N i N M m ≥ (1 − ϕ i ) N i ϕ i  N i − l N i N M m N M + N M  . Observ e that in this case the quan tity P τ t = ` Q t k = ` q u ( k ) 1 − Q τ k = ` q u ( k ) = l N i N M m − 1  − ϕ i N M 2 N i l N i N M m − 1  l N i N M m + (1 − ϕ i ) ϕ i = 1 ϕ j  N j N M  − N M 2 N j  N j N M   N j N M  − 1  − 1 = γ ( x ( τ ) , i ) Case 3. Stage ` = τ is the final query of i in p olicy u . If this is the case, then γ ( x ( τ ) , i ) = N i (1 − ϕ i ) ϕ i  N i − l N i N M m N M + N M  , irresp ective of whether this query is the only query of i . Observe that in this final case, P τ t = ` Q t k = ` q u ( k ) 1 − Q τ k = ` q u ( k ) =  (1 − ϕ i ) N i N i − ϕ i l N i N M m − 1  N M   N i − ϕ i l N i N M m − 1  N M − (1 − ϕ i ) N i N i − ϕ i l N i N M m − 1  N M  = (1 − ϕ i ) N i ϕ i  N i − l N i N M m + N M  = γ ( x ( τ ) , i ) . Klausen, Marks, and Zaman: Finding Online Extr emists in So cial Networks 46 Note that if stage τ corresp onds to the only query of i in p olicy u , then l N i N M m = 1 and this expression reduces to 1 − ϕ i ϕ i , whic h is correct. These three cases similarly apply to the sequence of queries of former friend j in stages τ + 1 , . . . , L . Therefore, in all cases equation (10) reduces to P L t = τ +1 Q t k = τ +1 q u ( k ) 1 − Q L k = τ +1 q u ( k ) ≥ P τ t = ` Q t k = ` q u ( k ) 1 − Q τ k = ` q u ( k ) γ ( x ( τ + 1) , j ) ≥ γ ( x ( τ ) , i ) , whic h is a con tradiction and shows that Theorem 2 pro vides a necessary condition for optimality . T o sho w that this condition is sufficien t for optimalit y , supp ose now that policy u satisfies the conditions in Theorems 1 and 2, but that it is not optimal. This assumption implies that there is another p olicy , u ? with a low er cost. F rom our previous argumen ts, u ? m ust also satisfy the conditions in Theorems 1 and 2. Because γ ( x ( t ) , i ) is nondecreasing in t for all i ∈ V , the p ossible differences b et w een the p olicies u and u ? are in stages where ties exist, i.e., ∃ i, j ∈ V : i 6 = j, γ ( x ( t ) , i ) = γ ( x ( t ) , j ) . Ho w ever, it follows from our developmen t ab o ve that these reorderings do not result in a change in cost. In other words, all policies that meet the conditions in Theorems 1 and 2 result in the same cost and therefore m ust b e optimal. C. Classification Threshold Sensitivit y W e pro vide a brief discussion of the sensitivity of the results to changes in the classification thresh- old. In the previous section, w e selected threshold P = 0 . 782 based on the shap e of the R OC curv e and our desire to keep the num b er of false p ositive classifications lo w. W e no w consider how differen t v alues of threshold P affect the “paired accounts” graph depicted in Figure 5. Figure 9 gives sev eral prop erties of the “paired accoun t” graph as a function of P . As w e would exp ect, when our classification threshold P = 0 the graph is fully connected, which indicates that all accoun ts are classified as belonging to the same user. As P increases, the n umber of connected accoun ts and the size of the giant comp onent decrease rapidly . Of in terest is the estimated a v erage clustering co efficien t, measured on the righ t-hand scale in Figure 9. If w e had access to the true classifications so that we could pro duce a graph of connected accoun ts that b elonged to the same users, each comp onent would b e fully connected. Average clustering provides a measure of how m uc h a graph exhibits this prop erty by estimating how often a triad of connected no des is fully connected. W e see from Figure 9 that the av erage clustering co efficien t is relativ ely stable for a wide range of threshold v alues, but as P increases b ey ond appro ximately 0.85 we observ e an increase in the Klausen, Marks, and Zaman: Finding Online Extr emists in So cial Networks 47 Figure 9 Paired accoun ts graph prop erties as a function of threshold P . The threshold v alue 0.782 from equation (3) is indicated on the plot. a v erage clustering co efficien t that suggests that there are clusters of profiles in our data that are all very similar. Comp onent A, indicated in Figure 5 and en umerated in T able 7, is an example of suc h a cluster. There are other fully connected clusters in Figure 5 consisting of more no des. These clusters represent users who open man y Twitter accounts and retain v ery similar profile features. F urther inv estigation of these accounts rev eals that they are nearly all susp ended, suggesting that accoun t susp ensions are the driving force b ehind the creation of these multiple accoun ts. As noted earlier, in at least some cases these accoun ts are created b y high-profile jihadists. Decreasing P from 0.782 app ears to rapidly increase the n um b er of false p ositive classifications. This result b ecomes quic kly apparen t in the app earance of a large but lo osely connected comp onent in the paired graph structure. F or example, reducing the classification threshold to P = 0 . 668 (indicated on the ROC plot in Figure 4) increases the profile pairs classified as b elonging to the same user to 455. In man y cases, these additional pairs app ear to b e correct classifications. F or example, comp onen t B in Figure 5 app ears as a fully connected comp onent using this threshold. How ev er, w e also observe the formation of the lo osely connected component indicated as “comp onen t C” in Figure 10. T able 13 shows the profile features for the accoun ts comprising this component, which app ear to b elong to several different users. D. F eatures fo r Refollowing Mo del The complete list of features used in the refollo w mo del from Section 5 is listed b elow. • F riend ’s num b er of Twitter friends (Log). • F riend ’s num b er of Twitter follow ers (Log). Klausen, Marks, and Zaman: Finding Online Extr emists in So cial Networks 48 Figure 10 Graph representation of accounts b elonging to the same user using our regression mo del and equation (3) with a threshold of 0.668. T able 13 Accoun ts comprising comp onent C. Screen Name Name Profile Pic AAbuAAwlaki Abu Awlaki [None] abu alia2 abu alia [None] Ab dullah4510394 Ab dullah [None] abu ab dillah12 Abu Ab dullah [None] dewdropz69 Ab dullah [None] Ummab dullaa Umm Ab dullah [None] ab ouab dullah7 ab ou ab dullah ff. . . e0 AbuAb dullah1400 Abu Ab dullah ff. . . ff ab ouosama6 Abouosama [None] Abuusamah17 Abu usamah [None] AbuIabulfida Abu Ab dullah e1. . . 00 AbuAyman2011 Abu Ayman [None] AbuMuhammad1503 Abu Muhammad [None] abu malhama4 Abu Malhama [None] moabibkhab abu hamad [None] nahida m uhammad Nahida muhammad [None] abum usab m usab Abu musab [None] xcon cp dc Abu Musa [None] AbuSaalihah06 Abu Saalihah [None] AbuSaalihah07 Abu Saalihah 00. . . 00 AbuSaalihah08 Abu Saalihah 00. . . 00 AbuSaalihah13 Abu Saalihah 00. . . 00 Abu sw aaliha abu swaaliha 1e. . . c3 Abu Malhama5 Abu Malhama bf. . . 00 omertalhaa Abu T alha [None] islamob jectiv e Abu Ramadi [None] Klausen, Marks, and Zaman: Finding Online Extr emists in So cial Networks 49 • F riend ’s num b er of Tweets (Log). • Accoun t age difference betw een F riend and User0 . • Binary indicator of whether F riend was following User0 . • Num b er of times User0 mentioned F riend in a tw eet (Log). • Num b er of times User0 retw eeted one of F riend ’s tw eets (Log). • Num b er of times User0 replied to one of F riend ’s t weets (Log). • User0 ’s n um b er of Twitter friends (Log). • User0 ’s n um b er of Twitter follo wers (Log). • User0 ’s n um b er of Tw eets (Log). • User0 ’s n um b er of fa vorite t weets (Log). • User0 ’s total n um b er of ret weets (Log). • Av erage n umber of friends of User0 ’s friends (Log). • Median n umber of friends of User0 ’s friends (Log). • Standard deviation of the n um b er of friends of User0 ’s friends (Log). • Av erage n umber of follow ers of User0 ’s friends (Log). • Median n umber of follo wers of User0 ’s friends (Log). • Standard deviation of the n um b er of follo wers of User0 ’s friends (Log). • Av erage n umber of tw eets of User0 ’s friends (Log). • Median n umber of t weets of User0 ’s friends (Log). • Standard deviation of the n um b er of t weets of User0 ’s friends (Log). • Av erage n umber of fav orite tw eets of User0 ’s friends (Log). • Median n umber of fa vorite tw eets of User0 ’s friends (Log). • Standard deviation of the n um b er of fa vorite t weets of User0 ’s friends (Log). • Binary indicator of whether F riend ’s account authenticit y had b een verified by Twitter. • F raction of User0 ’s friends that had accoun t authenticit y verified by Twitter. • Binary indicator of whether F riend and User0 had the same account language setting.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment