Revealing social networks of spammers through spectral clustering

To date, most studies on spam have focused only on the spamming phase of the spam cycle and have ignored the harvesting phase, which consists of the mass acquisition of email addresses. It has been observed that spammers conceal their identity to a l…

Authors: Kevin S. Xu, Mark Kliger, Yilun Chen

Revealing social networks of spammers through spectral clustering
Re v ealing Social Netw orks of Spammers Through Spectral Clustering Ke v in S. Xu ∗ , Mark Kliger † , Y ilun Chen ∗ , Peter J. W oolf ∗ , and Alfred O. He ro III ∗ ∗ University of Michigan, Ann Arbor, MI 4810 9 USA † Medasense Biometrics Ltd., PO Box 63 3, Ofakim, 8 7516 Israel ∗ { xukevin,yilun,p woolf,hero } @umich.edu, † mark@meda sense.com Abstract —T o date, most stu dies on spam hav e fo cused onl y on th e spamming phase of the spam cycle and hav e i gnored the harvesting ph ase, wh ich consists of the mass acquisi tion of email add resses. It has been observed that spammers conceal their i dentity to a lesser degree in the harvesting p hase, so it may be p ossible to gain new insights into spammers’ behavior by studying the behavior of harv esters, which are individuals or bots that collect email add resses. In th is p aper , w e r ev eal social networks of spammers b y iden tifying communities of h arv esters with high beha vioral similarity using spectral clustering. The data analyzed was collected through Pr oject Honey Pot, a distributed system fo r moni toring harv esting and spamming. Our main findings are (1) that most spammers either send only ph ishing emails or no phishi ng emails at all, (2) that most communities of spammers also send only phishin g emails or no phi shing emails at all, and (3) t hat several gro ups of spammers within communities exhibit coher ent temporal beha vior and ha ve similar IP addresses. Our fin dings revea l some prev iously unkn own behavio r of spammers and suggest that th ere is indeed social structure between spammers to be di scov ered. I . I N T RO D U C T I O N Previous studies on sp am hav e m ostly focused on studying its content o r its source. Like wise, curr ently used anti-spam methods mo stly in volve filtering emails b ased on their content or b y their email server IP add ress. More r ecently , there have been stud ies on the network-level behavior o f spa mmers [1], [2]. Howe ver , very little atten tion h as be en dev oted to studying how spammers ac quire the e mail addr esses that they send sp am to, a pr ocess commo nly referred to as h arvesting. Harvesting is the first phase of the spam cycle; sending th e spam emails to the acq uired addresses is th e second ph ase. Spammers sen d spam emails using spam servers, which are typically comp romised computer s or o pen prox ies, b oth of which allow spamme rs to hide th eir iden tities. On the other hand, it h as been observed that spammer s do not make the same effort to con ceal their id entities du ring the harvesting phase [3], indicatin g th at harvesters, which are individuals or bots that collect em ail addr esses, are closely r elated to the spammers who are sending the spam emails. The harvester and spam server are the two intermed iaries in the path of spam, illustrated in Fig. 1. In this pa per we try to re veal social networks of spamm ers by identifying co mmunities o f harvesters using data from both phases of the spam cycle. The sour ce of the data analyzed in this paper is Project Honey Po t [4], a web-based network for monitorin g harvesting and spam ming activity by u sing trap Fig. 1. The path of spam: from an email address on a web page to a recipi ent’ s inbo x. email ad dresses. For every spam em ail received at a trap email address, the Project Ho ney Pot d ata set provid es us with the IP address o f th e harvester that acquired the r ecipient’ s email address in ad dition to the IP address of the spam server , which is contain ed in the header of the em ail. Spammer s make use of both h arvesters and spam servers in order to distribute emails to r ecipients, but the IP addre ss of the harvester that acquired the recipien t’ s email addr ess is typically unkn own; it is only throug h Project Ho ney Pot that we ar e ab le to u ncover it. The Project Honey Po t d ata set is describ ed in d etail in Section II. Project Honey Pot h appens to be an ideal data sour ce fo r studying phishin g emails. Phishing is an attempt to frau du- lently acquire sensiti ve infor mation b y appearin g to represen t a trustworthy e ntity . It is im possible for a trap ema il address to, for example, sign u p for a PayP al account, so all emails sup- posedly received from financial institutions can immediately be classified as p hishing emails. W e inv estigate th e prevalence of phishing emails and their distribution among harvesters. W e lo ok for c ommun ity structur e w ithin the n etwork of harvesters by p artitioning har vesters into g roups su ch that the harvesters in each gr oup exhib it hig h b ehavioral simi- larity . This is a clustering problem , an d we adopt a method common ly referred to as spectral clustering. I dentifyin g com - munity structure not only rev eals groups of h arvesters that have high behavioral similarity but also groups o f spammer s who may be socially conn ected, due to the close relation between harvesters an d spammer s. W e provide an overview of spectral clustering in Section III, and we d iscuss our choices of behavioral sim ilarity measures in Section IV. Our main findings are as follows: 1) Most harvesters ar e eith er p hishers or non- phishers (Section II). W e find that most harvesters either send only p hishing emails or n o phishing emails at a ll (we define what it mean s for a harvester to send an email in Section II). 2) Phish ers a nd non- phishers tend to separate into differ ent communities when clustering based on similarity in spam server usage ( Section V -A). That is, phisher s tend to associate with other p hishers, and non -phisher s tend to associate with other non- phishers. I n p articular, phishers appear in small com munities with strong ties, wh ich suggests that they are sh aring resourc es (spam servers) with other members of their comm unity . 3) Several gr oup s of h arvesters have coher ent temporal behavio r and similar IP a ddr esses (Se ction V - B). In particular, we identif y a group of ten harvesters that send extremely large am ounts o f spam and have the same /24 IP address prefix, which happ ens to be owned by a rogue I nternet service provider . Th is indicates th at th ese harvesters are either the same spamm er or a group o f spammers opera ting fro m the same physical location. These finding s suggest that spammers do indeed form social networks, and we are able to identify meaningful co mmunities. I I . P RO J E C T H O N E Y P OT Project Ho ney Pot is a d istributed sy stem for mon itoring harvesting and spammin g acti vity v ia a ne twork of decoy web pag es w ith tr ap email add resses, known as honey pots. These trap a ddresses are emb edded within the HT ML source of a web pa ge and are in visible to hum an visitors. Spamm ers typically acquire email a ddresses eith er by br owsing web sites and look ing for them or by runn ing auto mated harvesting bots that scan the HTML so urce of web pages and c ollect email addresses automatically . Since the trap e mail ad dresses in the honey pots are invisible to hum an visitors, Project Hon ey Pot is trap ping on ly the harvesting bots, and as a result, this is th e only type of harvester that we in vestigate in this p aper . Each time a h arvester visits a honey po t, the centralized Project Honey Po t server generates a uniqu e tr ap email ad- dress. Th e harvester’ s IP ad dress is r ecorded and sen t to the Pro ject Honey Pot server . The email ad dress em bedded into each hon ey pot is un ique, so a particular email add ress could on ly have b een collected b y the visitor to that particular honey p ot. Th us, when an email is received at on e o f the trap ad dresses, we know exactly who ac quired th e add ress. These email ad dresses are n ot published anywher e be sides the honey pot, so we can assume that a ll emails receiv ed at these addresses are spam. As of February 2009, over 35 million tra p email a ddresses, 39 million spam servers, an d 59 , 000 harvesters h av e been identified by Project Honey Pot [4]. Hon ey pots are located in over 119 c ountries. The total numbe r of emails received at the trap email addre sses monitored by Project Honey Pot is sh own by month in F ig. 2, starting from its inception in Octob er 2 004. The n umber of emails rec eiv ed have been n ormalized by the number of add resses collected to distingu ish between growth of Project Hon ey Po t and an increase in spam volume. October 2006 is a mon th of par ticular interest. No tice that th e n umber of emails receiv ed in Oc tober 2006 increased significantly from Sep tember 200 6 then came ba ck down in November 2006. This observation ag rees with med ia rep orts of a spam 0 10 20 30 40 50 60 0 0.5 1 1.5 Months from October 2004 Emails per address October 2006 Fig. 2. Number of emails recei ved by month per address collect ed. outbreak in Octobe r 2006 [5]; thus we will focus our analysis around th is time. W e refe r r eaders to [3], [4] for additional details on Project Honey Pot. In or der to discover social networks of spammer s, we need to associate em ails to the spammers w ho sent them . Sin ce we do not know the ide ntity o f the spammer who sent a p articular email, we can associate the em ail either to the spam ser ver that was used to sen d it or to the har vester that acqu ired the r ecipi- ent’ s em ail addr ess. A p revious study using the Project Ho ney Pot d ata set has suggested that the harvester is more likely to be associated with the spammer than the spam ser ver [3], so we associate e ach email with th e ha rvester that acqu ir ed the r e cipient’s email add r ess . In particu lar , th is is different fro m studies [1], [2], which did no t in volve harvesters and implicitly associated emails with the spam servers that were used to send them. Note that we are not assuming that the harvesters are the spamm ers the mselves. A harvester may co llect email addresses for mu ltiple spam mers, o r a spammer may use multiple h arvesters to c ollect email addr esses. Th us, whe n we say that a particular harvester sends an email, we me an that a spammer who ob tained an addr ess co llected by this harvester sends an email . T o summarize, we are associating emails with harvesters and trying to discover com munities o f har vesters, which are closely related to communities of actual spam mers. As m entioned pr eviously , Project Honey Pot is an id eal da ta source for study ing p hishing emails beca use the trap email addresses cannot sign up for accounts at finan cial institutions and o ther sources that phishing emails frau dulently rep resent. Note that this is not po ssible with legitimate email addr esses, which may receive legitimate e mails fro m these sou rces. Since we know that any email mentioning such a source is a phishing email, we can classify each email as phishin g or non-p hishing based o n its co ntent. W e classify an ema il as ph ishing if its subject contains a common ly used phishin g word . T he list of such word s was built using comm on phishing words such as “password” and “accou nt” and includ es tho se fou nd in a study on phishing [6] and names of large financial institutions that do business on-line such as P ayPal and Chase. In gen eral we find that a small percentag e of the sp am received thro ugh Project Honey Pot consists of ph ishing emails. As of Febru ary 2009 , 3 . 5% of the sp am r eceived was phishing spam. W e define a phishing lev el for each h arvester as the ratio o f the n umber of phish ing emails it sent to th e total number of emails it sent. An interesting finding is th at most harvesters either send on ly phishing emails o r no ph ishing emails at a ll . In p articular, 1 4% of ha rvesters have a phishing lev el o f 0 . 9 or higher while 77% have a phishing level of 0 . 1 or lower , with o nly 9 % o f harvesters in between . Thu s we can lab el all harvesters a s p hishers o r non -phishers based on their ph ishing level. W e label a harvester as a phishe r if its phishing lev el e xceeds 0 . 5 . As of February 2009, about 18% of harvesters were labeled as ph ishers. W e note that phishers send less emails on a per-harvester b asis than non-p hishers, as only 3 . 5% of emails received were phishin g emails as m entioned earlier . The labelin g of harvesters as phishers or n on-p hishers will be used later when interpr eting the clustering results. I I I . O V E RV I E W O F S P E C T R A L C L U S T E R I N G In this pape r , we em ploy spectral clustering to iden tify group s of harvesters with h igh behavioral similarity . W e choose spectral clustering over other clustering techniqu es because of its close relation to the graph p artitioning p roblem of minimizin g the norm alized cut between partitions, which is a natural ch oice of objective function for com munity detection as discussed in [7] where it is r eferred to as conductan ce. A. The graph partitioning pr oblem W e represent the network of har vesters by a weighted undirected graph G = ( V , E , W ) wh ere V is the set of vertices, representing har vesters; E is the set of edge s between vertices; and W = [ w ij ] M i,j =1 is the m atrix of edg e weights with w ij indicating the similarity b etween harvesters i and j . The choice o f similarities is discussed in Sec tion IV. W is the adjacency matrix of the gr aph and is also r eferred to in th e literature as the similarity matrix or affinity matrix. M = | V | is the total n umber o f harvesters. The total weig hts of edges between two sets of vertices A, B ⊂ V is defined by links( A, B ) = X i ∈ A X j ∈ B w ij , (1) and the degree of a set A is defined by deg( A ) = links( A, V ) . (2) Our objective is to find highly similar gro ups of vertices in the g raph, which represent harvesters that beh av e in a similar manner . This is a graph partitioning pr oblem, and our objective translates into minimizin g s imilarity between group s, maximizing similarity within group s, or p referab ly both. Let the gr oups be deno ted by V 1 , V 2 , . . . , V K where K d enotes the nu mber of gro ups to partition the graph into. W e rep resent the graph partitio n by an M -by- K partition matrix X . Let X = [ x 1 , x 2 , . . . , x K ] where x ij = 1 if harvester i is in cluster j and x ij = 0 oth erwise. W e adop t th e norm alized cut disassociation measure pro posed in [8]. One fav o rable property of this measure is that min imizing the n ormalized cut between group s simultaneo usly maximizes the n ormalized association within grou ps. Thus we attempt to min imize the n ormalized cut by m aximizing the n ormalized association within gr oups, which is defined by KNasso c( X ) = 1 K K X i =1 links( V i , V i ) deg( V i ) . (3) B. F inding a near globa l-optimal solution Unfortu nately , maximiz ing K Nasso c is NP-complete even for K = 2 as noted in [ 8] so we tu rn to an approx imate method. Define the d egree matrix D = diag( W 1 M ) wh ere diag( · ) creates a diagon al matrix from its vector argu ment, and 1 M is a vector of M on es. Re write links and deg as links( V i , V i ) = x i T W x i (4) deg( V i ) = x i T D x i . (5) W e c an formu late the K Nasso c maximization prob lem as follows: maximize KNasso c( X ) = 1 K K X i =1 x i T W x i x i T D x i (6) subject to X ∈ { 0 , 1 } M × K (7) X 1 K = 1 M . (8) As mentio ned earlier, finding the o ptimal p artition matr ix X is an NP-com plete prob lem. A near glob al-optimal solution can be f ound b y first r elaxing a transforme d version of X into the co ntinuou s domain and finding the op timal contin uous partition matrix by solving a g eneralized eig en value pro blem. This is fo llowed b y solv ing a d iscretization pro blem wher e the c losest discrete partition matrix to the optimal co ntinuou s partition matrix is sought. W e refer intere sted readers to [9] for details on this method , commo nly r eferred to as spectral clustering. C. Choosing the numb er of clusters As with most clustering alg orithms, the p roper c hoice of K , the number of clusters, is unk nown in spectral clu stering. A useful heu ristic particularly well-suited fo r choo sing K in spectral clustering problems is t he eigenga p h euristic. The goal is to choose K suc h tha t th e hig hest eig en values λ 1 , . . . , λ K of the adjacency m atrix W are very close to 1 but λ K +1 is relativ ely far aw ay fro m 1 . This pr ocedure was justified in [ 10] and is used to choo se K in this pape r . I V . M E T H O D O L O G Y A social ne twork is a social structure co mposed o f node s, also kn own as actors, and ties, which indicate the relationsh ips between n odes. W e cann ot o bserve direct relatio nships be- tween harvesters (the actors), so we use indirect r elationships as th e ties. W e explore two typ es of ties in this p aper . E ach type of tie corre sponds to a similarity measu re for choosing the edge weights w ij , w hich indicate the behavioral similarity between harvesters. Note that th e network may ev o lve over time so we need to c hoose a time frame for analysis th at is shor t enou gh so that we should be able to see this ev olution if it is p resent yet long enoug h so that we h av e a large e nough sample for the clu stering resu lts to be meaningfu l. T here is no clear- cut m ethod for choo sing the time fram e. As a starting po int, we sp lit the data set by month and ana lyze each m onth indepen dently . A. Similarity mea sur es In this pap er , we study two mea sures of behavioral simi- larity: similarity in sp am server usage a nd tem poral similarity . For b oth of these similar ity measures, we create a coincide nce matrix H as an intermediate step to the creation of th e adjacency matrix W , which is d iscussed in Section IV -B. The choice of similarity measure is cru cial because it determines the topolog y of the gr aph. Each similar ity measure provides a different view of th e social network, so a po or c hoice of similarity measure m ay lead to detecting n o com munity structure if harvesters are too similar or too dissimilar . 1) Similarity in spam server usage: W e n ote that harvesters typically send ema ils throu gh multiple spam servers so co m- mon usage o f spam servers is on e way to link harvesters. Consider a m ixed network o f harvesters and spam ser vers described by the M × N c oincidenc e matrix H = [ h ij ] M ,N i,j =1 , where M is th e numb er o f harvesters and N is the num ber of sp am servers. W e choo se h ij = p ij / ( d j e i ) ∈ [0 , 1] wher e p ij denotes the n umber of em ails sent by h arvester i using spam server j , d j denotes th e to tal num ber of emails sent (by all harvesters) through spam server j , and e i denotes the to tal number of email addr esses harvester i has acquir ed. d j is a normalizatio n term that is includ ed to accou nt for th e variation in the total n umber of e mails sent throu gh each spam ser ver . For example, a h arvester that sent four emails throug h a spam server wh ich only sent fou r em ails total sh ould indicate a mu ch stronger connec tion to that spam server than one that sent four emails through a spam server wh ich sent one tho usand emails total. e i is also a no rmalization ter m to acc ount fo r the variation in th e n umber of ema il add resses each harvester has acqu ired, based on th e assum ption that harvesters send an equal amou nt o f spam to each ad dress they have acq uired. W e can interpre t h ij as harvester i ’s percentag e of usage of spam server j per add r ess it has acqu ir e d . T he similarity between two harvesters i 1 and i 2 is the inner p roduct betwe en rows i 1 and i 2 of H . 2) T emp oral similarity: Harvesters that exhibit high sim- ilarity in their temporal behavior may also indicate a socia l connectio n, so an other possibility for linking harvesters is by their tem poral spamm ing behavior . W e look at th e timestamp s of all emails sent by a particu lar harvester an d bin them into 1 -hou r intervals, resulting in a vector indicating how many emails a har vester sent in each interval. Doing this for all o f the harvesters, we get another coinciden ce matrix H but with the column s r epresenting time ( in 1 -hou r inter vals) rather than spam servers. The entries o f H are h ij = s ij /e i where s ij denotes th e num ber o f emails sent by har vester i in the j th time in terval, an d e i is de fined as b efore. Ag ain we norm alize b y the numb er o f email addresses acq uired but no other n ormalization s are necessary becau se th e colum ns represent time, which does not vary for dif ferent harvesters. B. Cr eating the ad jacency ma trix From th e coinc idence matrix H we can o btain an un- normalized matrix o f pair wise similarities S = H H T . W e normalize S to f orm a no rmalized m atrix of pairwise simi- larities S ′ = D − 1 / 2 S S D − 1 / 2 S , where D S is a d iagonal m atrix consisting of th e diago nal e lements of S. W e can in terpret this final no rmalization as a scaling of the edge weights between harvesters such th at each harvester’ s self-ed ge has unit weigh t. This ensures that each harvester is equ ally important because we have no prio r infor mation on th e impor tance of a par ticular harvester in the network. W e create an a djacency matrix W describing th e graph by connectin g the h arvesters togeth er accor ding to their similar i- ties in S ′ . T here are several method s of con necting the graph , including k -nea rest neig hbors and the fu lly-conn ected graph. W e opt for the k -nearest neighb or method , which translates into connecting each node to its n eighbor s with the k h ighest similarities. This is the recommen ded ch oice in [10] and is less vulnerab le to im proper choices of the co nnection para meters (in this case, the value o f k ). It also results in a sparse adjacency matrix , which speeds up co mputation s and makes the graph easier to visualize . Unfortuna tely , ther e are not many guidelines on how to choose k . A h euristic sugg ested in [ 10], motiv ated b y asymptotic results, is to choo se k = log M . W e use this cho ice o f k as a startin g poin t an d increase k as necessary to av o id artificially disconn ecting th e graph . V . R E S U LT S W e present visualizations f or our clustering re sults fro m October 20 06, which is a month of par ticular interest as noted in Section II. The visualiza tions were created using the force-d irected layo ut in Cytoscape [ 11]. Key statistics of the clustering results over a per iod of one y ear starting in July 2006 are presented in 3 -mo nth intervals in tables. A. Similarity in spam server u sage The g raph created u sing similarity in spam server usage usually consists of a large conn ected comp onent an d many small connec ted components. The small compone nts are easily recogn ized as clusters, while th e large c ompon ent is divided into multiple clu sters. In Fig. 3 we show th e social n etwork o f harvesters, con nected using similarity in spam server usage, from Oc tober 2006. The shape and color of a harvester indicates the cluster it belong s to. The eigengap heuristic suggests that the large connected componen t sho uld be divided into 64 cluster s, but to make th e fig ure easier to interpret, we present a clustering result that d ivides the large com ponent into 7 c lusters. W e also remove conn ected componen ts of less than ten har vesters. Th ese mod ifications were mad e for visualization pur poses only . In o ur analysis, an d in pa rticular when calc ulating th e validation ind ices we present later , we use the n umber of clusters sugg ested by the eigeng ap heuristic and include all small conne cted components. Notice that the m ajority o f harvesters belo ng in a large clus- ter with weak ties, which is a subset of the large componen t. Meanwhile there exist se veral smaller clu sters with stron g ties, some of wh ich are conn ected to the la rge cluster . Each cluster represents a com munity of harvesters that happen to use the Fig. 3. Social net work of harv esters formed by similarity in spam serve r usage in October 2006 (best viewe d in color). The color and shape of a harve ster indicate the cluster it belong s to. Phishing level 0.0 1.0 Fig. 4. Al ternate vie w of social network pictured in Fig. 3, where the color of a harv ester corresponds to its phishing lev el. same re sources (spam servers), indicating th at ther e is a stron g likelihood that these harvesters are working to gether . As with any clustering problem , the results need to be validated. If common usag e of spam servers indeed indica tes social connectio ns between har vesters, perhaps we can find some other pro perty that is c onsistent within clu sters. Reca ll from Section II that harvesters can be classified as either phishers or non-ph ishers. In Fig. 4 we show the same social network co lored by phishin g level, as d efined in Section II, rather than cluster . Note that each of the clusters consists almost entirely of ph ishers or almost entir ely of non-ph ishers. In p articular, ph ishers appear to c oncentrate in small clusters with strong ties. Th is observation is further enhan ced when clustering u sing 6 4 clu sters as su ggested by the eig engap heuristic. Thu s, phishing level app ears to b e consistent within clusters. W e conside r a cluster as a p hishing clu ster if it contains more phishers than non-phishe rs and as a non- phishing cluster otherwise. Using phisher or no n-ph isher as a label f or each har vester , we com pute the Rand index and ad justed Rand index [12], both com monly u sed indice s used for clu stering validation. The Rand index is a measure o f agreem ent b etween clustering results and a set of class lab els and is giv en by Rand index = a + d a + b + c + d (9) where a is the num ber of pair s of nod es with the same lab el T ABLE I V A L I D AT I O N I N D I C E S F O R C L U S T E R I N G R E S U LT S Y ear 2006 2007 Month July October January April July Rand inde x 0.923 0.954 0.942 0.964 0.901 Adj. Rand inde x 0.821 0.871 0.810 0.809 0. 649 and in the sam e cluster , b is the numb er of p airs with the same label but in d ifferent clusters, c is the number of pairs with different labels but in the same clu ster , and d is th e nu mber of pairs with different labels and in d ifferent clusters. A Rand index o f 0 in dicates co mplete disagreemen t betwee n clusters and labels, and a Rand ind ex of 1 indicates per fect ag reement. The adjusted Rand index is corrected for chance so that the range is [ − 1 , 1] with an expecte d index o f 0 for a rando m clustering result. In this clustering problem , the Rand in dex indicates how well phishers and non-p hishers divide into phishing and non- phishing clusters, respecti vely . The adjusted Rand index indi- cates how well phishers and non- phishers divide comp ared to the expected division th at a random clu stering algorithm would produ ce. Both in dices are shown in T able I for five months. Note that the clustering r esults have excellent agreement with the labels, an d the agreem ent is much higher than expec ted by chan ce. The division between p hishers and non-p hishers is not p erfect, as there ar e some ph ishers belongin g in n on- phishing c lusters and v ice-versa, but th e high adjusted Ran d index indicates that this split is hig hly unlikely to b e du e to chance alone. Hence we have fou nd emp irical evidence that phishers tend to fo rm small co mmunities with stron g ties, suggesting that they share resources (spam servers) be tween members of their comm unity . B. T empo ral similarity Unlike the g raph c reated by similarity in spam server usage, the grap h created by tem poral similarity is usually co nnected. In Fig. 5 we show the social network of harvesters, con nected using temporal similar ity , from Octo ber 20 06, wh ere ag ain the shape and co lor of a harvester in dicates the clu ster it b elongs to. Any similar ity in color with Fig. 3 is coinc idental; Fig. 5 represents a comp letely d ifferent view of th e so cial network and provides d ifferent insigh ts. Unfortu nately we do not have validation fo r this c lustering result on a global scale like we d id with ph ishing level for similarity in spam server usage. Howe ver by lo oking at temporal spamming plots of th e small clusters, we find som e local validation. Namely , we see gro ups o f h arvesters in the same cluster with extremely coh erent tempor al sp amming be- havior . W e notice that in many of these g roups, the harvesters also have similar IP addr esses. I n particular, we notice a group of ten ha rvesters that h av e extremely coher ent tempor al spamming pattern s and have the same /24 IP add ress p refix, namely 208.66.1 95/24, indicating that they are a lso in th e same physical location. In Fig. 5 they can be fo und in the light gree n cluster of triangular nodes at the top rig ht of the network. Fig. 5. Social network of harveste rs formed by tempo ral similarity in October 2006 (best vie wed in color). The color and shape of a harvester indicate the cluster it belong s to. T ABLE II A V E R A G E T E M P O R A L C O R R E L AT I O N C O E FF I C I E N T S O F 2 0 8 . 6 6 . 1 9 5 / 2 4 G R O U P O F T E N H A RV E S T E R S Y ear 2006 2007 Month July October January April July ρ a vg 0.980 0.988 0.950 0.949 0.935 Upon fu rther in vestigation, we find that their IP ad dresses are in th e 208.6 6.192 /22 prefix owned by McColo Corp ., a known rogue In ternet ser vice pr ovider that a cted as a gateway to spamm ers and was finally re moved fro m the In ternet in November 2008 [1 3]. Th is serves as fu rther co nfirmation th at these harvesters are likely to be socially connected. They first appeared at th e end of May 2 006 an d h av e bee n amo ng the heaviest ha rvesters, in terms of the nu mber o f emails sent, in ev ery month since then. The average co rrelation c oefficients ρ avg between two har vesters in this group ar e listed in T able II for five months. Notice that th eir average corr elation co- efficients are extremely high and stron gly suggest that th ey are workin g togeth er in a coord inated matter . Also note that their behavior is still highly co rrelated m ore than a year after they first ap peared. Fu rthermo re, we discover that they have high temporal co rrelation in the h arvesting phase; that is, they collect email addresses in a very similar manner as well. W e would certa inly expect them to be long to the same cluster, which ag rees with the clusterin g results. Hen ce we believe that this g roup is eith er the same spammer or a co mmunity o f spammers opera ting fro m the same physical location. V I . C O N C L U S I O N S In this paper, w e revealed so cial networks of spammers by d iscovering commu nities of har vesters fr om the data c ol- lected thr ough Pro ject Honey Pot. Sp ecifically , we clustered harvesters u sing two similar ity measures reflecting their be- havioral correlation s. In addition , we studie d the d istribution of phishing emails among harvesters an d among clusters. W e found that har vesters typica lly send either only phishing emails or no phishing emails at all. More over , we d iscovered that commun ities of harvesters di vide into communities o f mostly phishers and mo stly non -phishers wh en clustering acco rding to similarity in spam server usag e. I n particular, we observed that phisher s tend to f orm small comm unities with strong ties. W e also discovered several group s of harvesters with extremely coheren t tempor al behavior and very similar IP addresses, indicating that these groups are close geographically in addition to socially . Note that the two similarity measures we studied provided us with different views of th e social n etworks of h arvesters, and we gained useful insights f rom both of them. All of our findings are empirical; howe ver , we believe that they reveal some pr eviously u nknown b ehavior of spammer s, namely that spammers d o indeed form social networks. Since h arvesters are closely related to spammers, the d iscovered commu nities of harvesters ar e clo sely related to co mmunities of spammers. If we fur ther h ypothe size that harvesters are the spamm ers themselves, then the d iscovered com munities of har vesters correspo nd exactly to com munities of sp ammers. Id entifying commun ities of spam mers allows us to fig ht spam f rom a n ew perspective—by using spammer s’ social structure. A C K N O W L E D G M E N T W e thank Matthew Prin ce, Eric Lang heinrich , an d Lee Holloway of Un spam T echnolo gies Inc. for providing us with the Pro ject Hon ey Pot data set. W e are also gratefu l to Nitin Nayar, John Bell, and Dr . Olaf Maennel for their assistance with th e data retriev al. This research w as partially supp orted b y the Office of Nav al Research gra nt N000 14-08 -1-10 65 and the National Science Foundation grant CCR-032557 1. Ke vin Xu was sup ported in part by an award from the Natural Sciences and Enginee ring Research Cou ncil of Canada. R E F E R E N C E S [1] A. Ramach andran and N. Feamster , “Understand ing the netw ork-le vel beha vior of spammers, ” in Pr oc. ACM SIGCOMM , Sep. 2006. [2] Z. Duan, K. Gopalan, and X. Y uan, “Behavio ral charact eristics of spammers and their net work reachabil ity properties, ” in Proc . Int. Conf. Communicat ions , Jun. 2007. [3] M. Prince , L. Hollo way , E. L anghei nrich, B. M. Dahl, and A. M. Ke ller , “Understa nding ho w s pammers steal your e-mail address: An ana lysis of the first six months of data from Project Honey Pot, ” in Proc . 2nd Conf . Em ail and Anti-Spam , Jul. 2005. [4] (2009) Project Honey Pot . [Online]. A vai lable: http:/ /www .projecthone ypot.or g [5] M. Austin. (2006, Nov .) Spam at epic le vel s. ITPro. [Online]. A va ilable : http:/ /www .itpro.co.uk/97589/spa m- at- e pic- levels [6] M. Chan drasekaran , K. Naraya nan, and S. Upadhyaya , “Phishing email detec tion based on structural properties, ” in Pr oc. 9th Annual NYS Cyber Securit y Conf. , Jun. 2006. [7] J. Lesko ve c, K. J. Lang, A . Dasgupt a, and M. W . Mahoney , “Sta tistica l properti es of communit y structure in lar ge socia l and information netw orks, ” in 17th Int. WWW , Apr . 2008, pp. 695–704. [8] J. Shi and J. Malik, “Normaliz ed cut s and image se gmentati on, ” IEEE T rans. P attern Anal. Mach. Intell. , vol . 22, pp. 888–905, Aug. 2000. [9] S. Y u and J. Shi, “Mult iclass spectral clustering, ” in Pr oc. 9th IEEE Int. Conf . Computer V ision , Oct. 2003. [10] U. von L uxbur g, “ A tutorial on spectral clusteri ng, ” Stati stics and Computing , vol. 17, no. 4, pp. 395–416, Aug. 2007. [11] P . Shannon, A. Markiel, O. Ozier , N. S. Bal iga, J. T . W ang, D. Ramage, N. Amin, B. Schwiko wski, and T . Ideker , “Cytoscape : A software en- vironment for inte grated models of biomolec ular interaction networks, ” Genome Resear ch , vo l. 13, pp. 2498–2504, Nov . 2003. [12] L. Hubert and P . Arabie, “Compari ng partitions, ” J . Classifi cation , vol. 2, no. 1, pp. 193–218, Dec. 1985. [13] J. Nazario . (2008, Nov .) Third “ba d ISP” disapp ears– McColo gone . Arbor Netw orks. [Online]. A v ailab le: http:/ /asert.arbo rnetworks.com/2008/11/third- bad- isp- dissolves- mccolo- gone/

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment