Finding Deceptive Opinion Spam by Any Stretch of the Imagination

Finding Deceptiv e Opinion Spam by An y Stretch of the Imagination Myle Ott Y ejin Choi Claire Cardie Department of Computer Science Cornell Unive rsity Ithaca, NY 14853 {myleott, ychoi, cardie}@cs.co rnell.edu Jeffr ey T . Hancock Department of Communi cation Cornell Unive rsity Ithaca, NY 14853 jth34@corn ell.edu Abstract Consumers increasingly rate, revie w an d re- search produc ts online (Jansen, 201 0; Litvin et al., 2008) . Consequently , websites con- taining consumer reviews are becoming tar - gets of opinio n spa m . While recen t work has f ocused p rimarily o n m anually iden tiﬁ- able in stances of opin ion spam, in th is work we study de c eptive opinion spam —ﬁctitiou s opinion s that h av e b een deliberately written to sound authen tic. Integrating work from psy- chology and computational ling uistics, we de- velop and com pare three appro aches to detect- ing deceptive opinion spam, and ultimately develop a classiﬁer that is near ly 90% accurate on our gold-stan dard opin ion spam dataset. Based on feature analysis of our lear ned mod- els, we addition ally ma ke several th eoretical contributions, includ ing revealing a relatio n- ship b etween decepti ve opinion s and imagina- ti ve writing. 1 Intr oduction W ith the ev er -increas ing popula rity of re vie w web- sites that feature user- generate d opinio ns (e.g., T ripAdviso r 1 and Y elp 2 ), there comes an increasing potent ial for mon etary gai n throu gh opinio n spam — inappr opriate or fraudule nt rev ie ws. Opinion spam can range from annoy ing self -promotion of an un- related web site or blog to deliber ate rev iew fraud, as in the recent case 3 of a Belkin employee who 1 http://tripad visor.com 2 http://yelp.c om 3 http://news.c net.com/8301- 1001_ 3- 10145399- 92. html hired people to write positi ve revi ews for an other - wise poorl y re vie wed product. 4 While other kinds of spam hav e recei v ed consid- erable compu tational atten tion, regrettab ly there has been little work to date (see Section 2) on opinion spam detect ion. Furthermore, mos t pre vious work in the area has focused on the detect ion of D I S R U P T I V E O PI N I O N S PA M —uncontr ov ersial instances of spam that are easily ide ntiﬁed by a h uman reade r , e.g., ad- ver tisements, questions, and other irrele v ant or non- opinio n text (Jindal and Liu, 2008). And while the presen ce of disrup tiv e opinion spam is certainly a nuisan ce, the risk it poses to the user is min imal, since the user can al ways choose to ignore it. W e focus here on a p otential ly more insidi- ous type of opinion spam: D E C E P T I V E O P I N I O N S P A M —ﬁ ctitiou s opinions that hav e been deliber - ately written to sound authentic , in order to decei ve the reader . For example, one of the follo wing two hotel rev ie ws i s truthful and t he other i s deceptive opinio n spam : 1. I hav e stayed at many hotels traveling for both business and pleasure and I can honestly stay that The James i s tops. The service at the hotel is ﬁrst class. The rooms are modern and very comfortable. T he location is per- fect within w alking distance to all of the great sights and restaurants. Highly recommend to both business trav- ellers and couples. 2. My husband and I stayed at the James Chicago Hotel for our anni versary . This place is fantastic! W e knew as soon as we arrived we made the right choice! The rooms are BEA UTIFUL and the st af f very attentiv e and wonde rful!! T he area of the hotel is great, since I lov e to shop I couldn’t ask for more!! W e wil l deﬁnatly be 4 It is al so possible for opinion spam to be negati ve, poten- tially in order to sully the reputation of a competitor . back to Chicago and we will for sure be back to the James Chicago. T ypically , t hese decept iv e opinions are neither easily ignore d nor ev en identiﬁabl e by a human reader ; 5 conseq uently , there ar e few good sourc es of labeled data for this research. Indeed, in the ab- sence of gold-stand ard data, related studie s (se e Sec- tion 2) ha ve been for ced to u tilize ad hoc p rocedure s for ev aluation. I n contrast, on e contrib ution of the work pr esented here is the creation of the ﬁrst larg e- scale, pu blicly av ailable 6 datase t for dece pti ve opin- ion spam researc h, containin g 400 truthful and 400 gold-s tandar d decep tiv e revie ws. T o obtain a deeper unders tanding of the nature of decept iv e opini on spam, we explo re the relati ve util- ity of three potentia lly complementary framings of our problem. Speciﬁcally , w e vie w the task as: (a) a standard te xt cate gorizat ion task, in which we use n -gram–bas ed classiﬁers to label opinion s as either decept iv e or truthful (Joachims, 1998; Sebastian i, 2002) ; (b) an instance of psych olingui stic decep- tion detecti on , in w hich we exp ect decept iv e state- ments to ex emplify the psych ological effec ts of ly- ing, su ch as increased ne gati ve emotion and psycho- logica l distan cing (Hancock et al., 200 8; Newman et al., 2 003); and , (c) a p roblem o f g enr e iden tiﬁcatio n , in which we vie w decepti ve and truthful writing as sub-ge nres of imaginati ve and informati ve writing, respec tiv ely (Bibe r et al., 1999; Rays on et al., 2 001). W e compare the performance of each approa ch on our nov el dataset . Partic ularly , we ﬁnd that ma- chine le arning cla ssiﬁers trained on features trad i- tional ly employ ed in (a) psych ological studies of decept ion and (b) genre identi ﬁcation are both out- perfor med at statistic ally signiﬁcant le vels by n - gram–ba sed text categor ization te chnique s. Notably , a combined classiﬁer with both n -gram and psy- cholog ical decep tion featu res ach iev es nearly 90% cross- val idated accuracy on thi s task. In cont rast, we ﬁnd decept iv e opinion spam dete ction to be well bey ond the capabi lities of most human judges, who perfor m roughly at- chance—a ﬁnding that is con sis- tent with decades of traditiona l deception detection researc h (Bond and DePaulo, 200 6). 5 The second example re vie w is deceptiv e opinion spam. 6 A vailable by request at: http://www.c s.cornell. edu/ ˜ myleott/op_sp am Addition ally , w e make se vera l theoretical con- trib ution s based on an examinat ion of the feature weights learned by our machine learning classiﬁers. Speciﬁcally , we shed light on an ongo ing debate in the deceptio n literature regarding the importance of consid ering the conte xt and moti v ation of a decep- tion, rather than simp ly id entifyin g a uni versal se t of decept ion cues. W e also present ﬁndings that are consis tent with recent work highlig hting the dif ﬁcul- ties that liars hav e encoding spatial informat ion (Vrij et al., 2009). Lastly , our study of decepti ve opinion spam detectio n as a genre identiﬁcatio n problem re- vea ls relations hips between decepti ve opinions and imaginat iv e writing, a nd between truthful opinions and informat iv e w riting . The rest of this paper is orga nized as follo ws: in Section 2, we summarize related wo rk; in Sect ion 3, we expla in our methodol ogy for gathe ring data and e va luate human performanc e; in S ection 4, w e de- scribe the feature s and classiﬁers employ ed by our three automate d detection approaches ; in Section 5, we present and discuss experime ntal results; ﬁnally , conclu sions and direction s for futu re work are gi ven in Section 6. 2 Related W ork Spam has historicall y been stud ied in the conte xts of e-mail ( Drucker et al ., 2002 ), and the W eb ( Gy ¨ ongyi et al., 2004; Ntoulas et al., 2006). Recently , re- search ers ha ve began to look at opinion spam as well (Jin dal and Liu, 2008; W u et a l., 2010; Y oo and Gretzel, 2009) . Jindal and Liu (2008) ﬁnd that o pinion sp am is both widespr ead and diffe rent in nature from either e-mail or W eb spam. Using produc t re vie w data, and in the absenc e of gold-sta ndard decepti ve opin- ions, they train models using features based on the re vie w text , rev ie wer , and product, to distingui sh between duplicate opinions 7 (consi dered d ecepti ve spam) and non-dup licate opinio ns (considere d truth- ful). W u et al. (2010) propose an alterna tiv e strate gy for detecting decep tiv e opinio n spam in the absen ce 7 Duplicate (or n ear-duplicate) opin ions are op inions that ap- pear more than once in the corpus with the same (or similar) text. W hile these opinions are likely to be decepti ve, they are unlikely to be representative of deceptiv e opinion spam in gen- eral. Moreo ver , they are potentially detectable via off-the-sh elf plagiarism detection software. of gold-sta ndard data, based on the distorti on o f pop- ularity rankings . Both of these heuristic ev aluation approa ches are unnecessary in our work, si nce we compare gold-stan dar d decepti ve and truthful opin- ions. Y oo and Gretzel (2009 ) gathe r 40 truthful and 42 decept iv e hotel re vie ws and, using a standard statis- tical test, manu ally co mpare th e ps ycholog ically rel- e va nt linguis tic d iff erences between them. In con- trast, we create a m uch larger dataset of 800 opin- ions that we use to de velop and ev aluate automat ed decept ion classiﬁers . Research has also been condu cted on the re- lated task of psyc holingu istic deception detection . Newman et a l. (2 003), and late r Mihalc ea and Strappara va (2009), ask particip ants to gi ve both their true and untrue views on perso nal issue s (e.g., their stance on the death penalty). Zhou et al. (2004; 2008) consider compu ter -mediated dece p- tion in role-playin g games designed to be played ov er instant messagi ng and e-mail. Howe ver , w hile these studies compare n -gram–bas ed d eception clas- siﬁers to a random guess baseline of 50%, w e addi- tional ly ev aluate an d co mpare tw o oth er compu ta- tional approac hes (de scribed in S ection 4), as well as the p erformance of human judges (d escribed in Section 3.3). Lastly , automati c approach es to determining r e- view quality hav e been studi ed—directl y (W eimer et al., 20 07), and in the conte xts of helpf ul- ness (Danescu- Niculescu- Mizil et al., 2009; Kim et al., 2006; O’Mahony and Smyth, 2009) and credibil- ity (W eerka mp and De Rijke , 2008). Unfortunatel y , most measures of qualit y employed in those works are based exclusi vely on human judgments, which we ﬁnd in S ection 3 to be poorly calibrate d to de- tecting decep tiv e opinio n spam. 3 Dataset Construction and Human Pe rformanc e While truthful opinions are ubiquito us online, de- cepti ve opinion s are difﬁcult to obtain without re- sortin g to heuristic methods (Jindal and Liu, 2008; W u et al., 2010). In this sectio n, w e report our ef- forts to gather (and v alidate with human judgments) the ﬁrst publicly av ailab le opin ion spam dataset with gold-s tandar d decep tiv e opinions. Follo wing the work of Y oo and Gretzel (2009), we compare truthf ul and decepti ve p ositi ve re vie ws for hotels found on Tr ipAdvisor . Speciﬁcally , we mine all 5-star truthful re vie ws from the 20 most popular hotels on Trip Advisor 8 in the Chicago area. 9 De- cepti ve opinions are gathere d for those same 20 ho- tels using Amazon Mechanical T urk 10 (AMT). B e- lo w , we pro vide details of the collection methodolo- gies for decepti ve ( Section 3.1) and truthful opinions (Section 3.2). Ultimately , we collect 20 truthful and 20 decepti ve opini ons for each of the 20 chosen ho- tels (800 opini ons total). 3.1 Deceptiv e opinions via Mechanical T urk Cro wdsourcin g servic es such as AMT hav e made lar ge-sca le data annotatio n and collect ion effort s ﬁ- nancia lly affo rdable by granting any one with ba- sic programming skills access to a marketpl ace of anon ymous online w orke rs (kno wn as T urk ers ) will - ing to complet e small tasks. T o solicit gold-stan dard d eceptive opini on spam using AMT , we cre ate a pool of 40 0 Human- Intelli gence T asks (HITs) and allocate them ev enly across our 20 chosen hotels. T o ensure that opin- ions are written by unique authors, we allo w only a single submission per T urker . W e also restrict our task to T urke rs who are lo cated in the Unite d States, and who maintain an appro va l rating of at least 90%. T urkers are allowed a maximum of 30 minutes to work on the HIT , and are paid one US dollar for an accept ed submissio n. Each HIT presen ts the T urke r with the name and website of a hotel. The HIT instruct ions ask the T urker to assume that the y work for the hot el’ s mar - ket ing depart ment, and to pretend that their boss wants them to write a fake rev iew (as if they w ere a customer) to be posted on a tra vel rev iew website; additi onally , the rev ie w needs to soun d realistic and portra y the hotel in a positi ve light. A disclaimer 8 T ripAdvisor utilizes a proprietary ranking system to assess hotel popularity . W e chose the 20 hotels with t he greatest num- ber of rev iews, irrespecti ve of the T ripAdvisor ranking. 9 It has been hypothesized that popular offerings are less likely to become targets of decepti ve opinion spam, since the relativ e i mpact of the spam in such cases is small (Jindal and Liu, 2008; Li m et al., 2010). By considering only the most popular hotels, we hope to minimize the risk of mining opinion spam and labeling it as truthful. 10 http://mturk. com Time spent t (minutes) All submissions count : 400 t min : 0.08, t max : 29.78 ¯ t : 8.06, s : 6.32 Length ℓ (w ords) All submissions ℓ min : 25, ℓ max : 425 ¯ ℓ : 115.75, s : 61.30 T ime spent t < 1 count : 47 ℓ min : 39, ℓ max : 407 ¯ ℓ : 113.94, s : 66.24 T ime spent t ≥ 1 count : 353 ℓ min : 25, ℓ max : 425 ¯ ℓ : 115.99, s : 60.71 T able 1 : Descriptive statistics fo r 4 00 deceptive opin ion spam submissions gathered u sing AMT . s cor respond s to the sample standard deviation. indica tes that any submissio n found to be of insufﬁ- cient qu ality (e.g., written for the wrong hot el, unin- telligi ble, unreaso nably short, 11 plagia rized, 12 etc.) will be rejected . It took approximately 14 days to collect 400 sat- isfa ctory decepti ve opinio ns. Descrip tiv e statistics appear in T able 1. Submissions v ary quite dramati- cally both i n length, and time sp ent on the task. Par - ticular ly , nearly 12% of the submission s were com- pleted in under one minute . Surp risingly , an inde- pende nt two-tai led t-te st betwee n the mean len gth of these submissio ns ( ¯ ℓ t< 1 ) and the other submissions ( ¯ ℓ t ≥ 1 ) rev eals no signiﬁcant dif ference ( p = 0 . 83 ). W e suspect that these “ quic k ” users may hav e started worki ng prior to ha ving formally accepted the HIT , presumab ly to circumv ent the imposed time limit. Indeed , the quicke st submissi on took just 5 seconds and contai ned 114 words . 3.2 T ruthful opinions fro m T ripAdvisor For truthful opinions, we mine all 6,977 re vie ws from the 20 most pop ular Chicago hotels on T ripAdviso r . From these we eliminate: • 3,130 non-5- star re vie ws; • 41 non-Eng lish rev iews; 13 • 75 rev iews with fe wer than 150 charac ters since, by constructio n, decepti ve opinio ns are 11 A submission is considered unreasonably short if i t con- tains fewer than 1 50 characters. 12 Submissions are indiv idually check ed for plagiarism at http://plagia risma.net . 13 Language is determined using http://t agthe.net . at least 150 characters long (see footno te 11 in Section 3.1); • 1,607 re vie ws written by ﬁrst-time author s — ne w users w ho hav e not previo usly posted an opinio n on Tr ipAdvisor —since these opinio ns are mor e likely to contain opini on spam, which would reduce the integr ity of our truthful re- vie w data (W u et al., 2010). Finally , we balanc e the number of truthful and decept iv e opinions by selectin g 400 of the remain- ing 2,124 truthful revie ws, such that the document length s of the selected truthf ul rev ie ws are similarly distrib uted to those of the decep ti ve re vie ws. W ork by Serrano et al. (2009) suggest s that a log -normal distrib ution is appropriate for modelin g document length s. Thus, for each of the 20 chosen hotels, we select 20 truthful rev iews from a log-normal (left- trunca ted at 150 charac ters) distrib ution ﬁt to the length s of the decepti ve re vie ws. 14 Combined w ith the 400 decepti ve revi ews gathered in Section 3.1 this yields our ﬁnal datas et of 800 re vie ws. 3.3 Human perf ormance Assessin g human deceptio n detection performanc e is important for se ver al reasons. F irst, there are fe w other baselines for our classiﬁcatio n task; indeed, re- lated studies (Jindal and Liu, 2008; Mihalcea and Strappara va, 2009) hav e only consider ed a random guess baseli ne. Second, assess ing human perfor - mance is necessa ry to val idate the decepti ve opin- ions gathered in S ection 3.1. If human performan ce is low , then our decepti ve opinion s are con vincing, and therefo re, deservin g of further attentio n. Our initial approach to assessin g human perfor - mance on this task was with Mechan ical T urk. Un- fortun ately , we found that some T urke rs selected among the choices seemingly at random, presum- ably to maximize their hourl y earnings by obvi ating the need to read the rev iew . While a similar eff ect has been observed pre viously (Akkaya et al., 2010), there remains no uni vers al solution. Instea d, we solicit the help of three v olunte er un- der gradu ate univ ersity student s to m ake judgments on a subse t of our data. This balanced subset, cor - respon ding to the ﬁrst fold of our cross-v alidation 14 W e use the R package GAMLSS (Rigby and Stasinopoulos, 2005) to ﬁt the left-truncated log-normal distribution. T RU T H F U L D E C E P T I V E Accuracy P R F P R F H U M A N J U D G E 1 61.9% 57.9 87.5 69.7 74.4 36.3 48.7 J U D G E 2 56.9% 53.9 95.0 68.8 78.9 18.8 30.3 J U D G E 3 53.1% 52.3 70.0 59.9 54.7 36.3 43.6 M E TA M A J O R I T Y 58.1% 54.8 92.5 68.8 76.0 23.8 36.2 S K E P T I C 60.6% 60.8 60.0 60.4 60.5 61.3 60.9 T able 2: Perfor mance o f thr ee huma n judge s an d two m eta-judges on a subset of 1 60 opinio ns, co rrespon ding to the ﬁrst fold of our cross-validation e xperimen ts in Section 5. Boldface indica tes the largest value for each column. exp eriments describ ed in S ection 5, contains all 40 re vie ws from each of four randomly chosen hotels. Unlik e the T urk ers, our student vol unteers are not of fered a monetary rew ard. C onsequ ently , we con- sider their judgements to be more honest than those obtain ed via AMT . Addition ally , to test the extent to which the in- di vidual human judges are biased, w e ev aluat e the perfor mance of two virtual meta-judges . Speci ﬁ- cally , the M A J O R I T Y meta-judge predicts “ decep- tive ” w hen at least two out of three human judges belie ve the revie w to be decepti ve, and the S K E P - T I C meta-jud ge predicts “ deceptive ” when any hu- man judge belie ves the re vie w to be decepti ve. Human and meta-judg e performan ce is giv en in T able 2. It is clear from the results that human judges are not particu larly effe cti ve at this task. In- deed, a two-tailed binomial test fails to reject the null hypothesi s that J U D G E 2 and J U D G E 3 per- form at-chan ce ( p = 0 . 003 , 0 . 10 , 0 . 48 for the three judges , respecti ve ly). Furthermore, all three judge s suf fer from truth-bias (Vrij, 2008), a common ﬁnd- ing in decepti on detectio n researc h in which hu- man judges are more like ly to classify an opinion as truthful than decepti ve. In fact , J U D G E 2 clas- siﬁed fewer than 12% of the opinions as decep- ti ve! Interestin gly , this bias is effe cti vely smoothed by the S K E P T I C meta-judge , w hich produces nearly perfec tly class-b alanced predictio ns. A subseq uent ree v aluatio n of h uman performance o n this t ask sug- gests that the truth-b ias can be reduced if judges are gi ven the class -proport ions in adv ance, althoug h such prior kno wledge is unrea listic; and ultimatel y , perfor mance remains similar to that of T able 2. Inter -annotator agreemen t among th e thre e judges , comput ed usin g Fleiss’ kappa, is 0.11. While there is no prec ise rule for inter preting kappa scores, L andis and Koch (1977) sugges t that scores in the range (0.00, 0.20] corres pond to “ slight agr eement ” between anno tators. The lar gest pairwise Cohen’ s kappa is 0.12, between J U D G E 2 and J U D G E 3—a valu e far below generally accept ed pairwise agreement lev els. W e suspect that agreement among our human judges is so lo w pre cisely because humans are poor judges of decept ion (V rij, 2008), and therefore they perform nearly at-chan ce respecti ve to one anoth er . 4 A utomated App roache s to Deceptive Opinion Spam Detection W e consider three automat ed approaches to detect- ing decepti ve opinio n spam, each of which utilizes classiﬁer s (describe d in Section 4.4) trained on the datase t of Section 3. The featu res employe d by each strate gy are outlined here. 4.1 Genre identiﬁcation W ork in computat ional linguist ics has sho wn that the frequenc y distrib ution of part-of -speec h (POS) tags in a text is often depende nt on the genre of the tex t (Biber et al., 1999 ; Rayson et al., 2001). In our genre identiﬁcation approach to decepti ve opinio n spam detection , we test if such a relatio nship exist s for truthful and decepti ve rev iews by constructin g, for e ach re view , features b ased on the frequ encies of each P OS tag. 15 These featur es are also intended to pro vide a good basel ine with which to compare our other automated approa ches. 4.2 Psycholinguistic deception detectio n The Linguis tic Inquiry and W ord Count (LIWC) softwa re (Pennebak er et al., 2007) is a popular au- tomated text analysis tool used widely in the so- cial sciences. It has been used to detect persona lity 15 W e use the Stanford Parser (Kl ein and Manning , 2003) to obtain the relativ e POS frequencies. traits (Mairesse et al., 2007), to study tutoring dy- namics (Cade et al., 2010), and, most rele va ntly , to analyz e deception (Hancock et al., 2008; Mihalcea and Strappara va, 2009; Vrij et al., 2007). While LIWC does not include a te xt cl assiﬁer , we can crea te one with features deri ved fro m the LIWC outpu t. In particu lar , LIWC counts and groups the number of instances of nearly 4,500 keyw ords into 80 psycholo gically meaningful dimension s. W e constr uct one feature for each of the 80 LIWC di- mension s, which can be summarized broadly under the follo w ing four catego ries: 1. Linguistic processe s: Functio nal asp ects of text (e.g., the av erage number of words per sen- tence, the rate of misspe lling, swearing , etc.) 2. Psycholog ical processes: Includ es all social, emotiona l, cogniti ve, perceptual and biological proces ses, as well as a nythi ng rela ted to time or space. 3. Personal concer ns: An y references to work, leisure , money , religion , etc. 4. Spoken categ ories: Primarily ﬁ ller and agree- ment word s. While oth er features ha ve been consi dered in past decept ion detecti on work, notably those of Zhou et al. (2004), early expe riments found LIWC features to perform best. Indeed, the LIWC2007 software used in our expe riments subsumes most of the fea- tures introduced in other work. Thus, we focus our psych olinguis tic appro ach to decep tion detecti on on LIWC-based featur es. 4.3 T ext categ orizatio n In contrast to the other strateg ies just discus sed, our text cate gorizati on appro ach to decept ion de- tection allows us to m odel both content and con- tex t with n -gram feature s. S peciﬁcally , we consider the followin g three n -gram feature sets, w ith the corres ponding feature s lo wercased and unstemmed: U N I G R A M S , B I G R A M S + , T R I G R A M S + , where the supers cript + indica tes that the feature set subs umes the preced ing feature set. 4.4 Classiﬁers Features from the three approache s just introd uced are used to train N a ¨ ıve Bayes and Support V ector Machine class iﬁers, both of which hav e performed well in related work (Jindal and Liu, 2008; Mihalc ea and Strappara va, 2009; Zhou et al., 2008). For a document ~ x , with label y , the Na ¨ ıve Bayes (NB) classiﬁer gi ve s us the follo wing decision rule: ˆ y = arg max c Pr( y = c ) · P r ( ~ x | y = c ) (1) When the class prior is uniform , for example when the classes are balanced (as in our case), (1) can be simpliﬁed to the maximum likelihoo d classi- ﬁer (Peng and Schuurmans , 2003): ˆ y = arg max c Pr( ~ x | y = c ) (2) Under (2), both the NB classiﬁer used by Mihal- cea and Strappar av a (2009) and the language model classiﬁer used by Z hou et al. (2008) are equi va lent. Thus, follo wing Z hou et al. (2008 ), we use the SRI Languag e Modeling T oolki t (Stolck e, 2002) to esti- mate indivi dual languag e models, Pr( ~ x | y = c ) , for truthfu l and decepti ve opinion s. W e consider all three n -gram feature sets, namely U N I G R A M S , B I G R A M S + , and T R I G R A M S + , w ith correspo nding langua ge models smoothed using the interpo lated Kneser -N ey method (Chen and Goodman, 1996). W e also train Support V ector Machi ne (SVM) classiﬁer s, which ﬁ nd a high-dimen sional separa ting hyperp lane between tw o grou ps of da ta. T o s implify feature analysis in Section 5, we restrict our ev alu- ation to linear SVMs, which learn a weight vector ~ w and bias term b , such that a document ~ x can be classiﬁed by: ˆ y = sig n ( ~ w · ~ x + b ) (3) W e use S VM lig ht (Joach ims, 1999) to train our linear SVM models on all three approa ches and feature sets described abov e, namely P O S , L I W C , U N I G R A M S , B I G R A M S + , and T R I G R A M S + . W e also e va luate ev ery combinat ion of these feature s, but for brevi ty include only L I W C + B I G R A M S + , which perfor ms best. F ollo wing standard practice, doc- ument vectors are normalized to unit-len gth. For L I W C + B I G R A M S + , we unit-le ngth normalize L I W C and B I G R A M S + feature s indi viduall y before com- bining them. T RU T H F U L D E C E P T I V E Appr oach Fea tures Accuracy P R F P R F G E N R E I D E N T I FI C A T I O N P O S SV M 73.0% 75.3 68.5 71.7 71.1 77.5 74.2 P S Y C H O L I N G U I S T I C L I W C SV M 76.8% 77.2 76.0 76.6 76.4 77.5 76.9 D E C E P T I O N D E T E C T I O N T E X T C A T E G O R I Z A T I O N U N I G R A M S SV M 88.4% 89.9 86.5 88.2 87.0 90.3 88.6 B I G R A M S + SV M 89.6% 90.1 89.0 89.6 89.1 90.3 89.7 L I W C + B I G R A M S + SV M 89.8 % 89.8 89.8 89.8 89.8 89.8 89.8 T R I G R A M S + SV M 89.0% 89.0 89.0 89.0 89.0 89.0 89.0 U N I G R A M S N B 88.4% 92.5 83.5 87.8 85.0 93.3 88.9 B I G R A M S + N B 88.9% 89.8 87.8 88.7 88.0 90.0 89.0 T R I G R A M S + N B 87.6% 87.7 87.5 87.6 87.5 87.8 87.6 H U M A N / M E TA J U D G E 1 61.9 % 57.9 87.5 69.7 74.4 36.3 48.7 J U D G E 2 56.9% 53.9 95.0 68.8 78.9 18.8 30.3 S K E P T I C 60 .6% 60.8 60.0 60.4 60.5 61.3 60.9 T able 3: Autom ated classiﬁer perf ormanc e for three appr oaches based on n ested 5-fo ld cross-validation experiments. Reported p recision, r ecall and F-score are co mputed using a micro-average, i.e., fro m the ag gr e gate true positi ve, f alse positive and false negative rates, as suggested b y F orman and Scholz (2009). Hum an performance is repeated here for J U D G E 1 , J U D G E 2 and the S K E P T I C meta- judge, altho ugh th ey cannot be dir ectly compa red since the 16 0-opin ion subset on which they are asses sed on ly correspon ds to the ﬁrst cross-validation fold. 5 Results and Discussion The deceptio n detection strate gies described in Sec- tion 4 are e va luated using a 5-fold nested cross- v alidatio n (CV) p rocedure (Quad rianto et al., 2009), where model parameters are selected for each test fold based on standar d C V expe riments o n th e train- ing folds. Folds are s elected so that e ach contains all re vie ws from four hotels; thus, learne d m odels are alw ays ev aluated on re vie ws from unseen hotels. Results appear in T able 3. W e observ e that auto- mated classiﬁers outperform human judge s fo r ev ery metric, exc ept truthful recall where J U D G E 2 per- forms best. 16 Ho wev er , this is exp ected gi ven that untrai ned humans often focus on unreliable cues to decept ion (Vrij, 2008) . For example , one study ex- amining deceptio n in online dating found that hu- mans perform at-chance detecting decept iv e pro- ﬁles because they rely on text- based cues that are unrela ted to deception , such as second-pe rson pro- nouns (T oma and Hancock, In Press). Among the automated classiﬁers, baseline per- formance is gi ven by the simple genre identiﬁca- tion approach ( P O S S VM ) proposed in Section 4.1. Surprisin gly , we ﬁnd that ev en this simple auto- 16 As mentioned in Section 3.3, JU D G E 2 classiﬁed fewer than 12% of opinions as decep tiv e. While achie ving 95% truthful re- call, this judge’ s corresponding precision was not signiﬁcantly better than chance (two-tailed binomial p = 0 . 4 ). mated classiﬁer outperfor ms most human judges (one-t ailed sign test p = 0 . 06 , 0 . 01 , 0 . 001 for the three judges, respecti vel y , on the ﬁrst fold). This result is best explai ned by theories of reality mon- itoring (Johnso n and Raye, 1981 ), which suggest that truthful and decepti ve opinions might be clas- siﬁed into informati ve and imaginati ve genres, re- specti vely . W ork by Rayson et al. (2001) has found strong distrib utional differe nces between informa- ti ve and imaginati ve writing , namely that the for mer typica lly consists of more nouns, adjecti ves, prepo - sition s, determin ers, and coordinatin g conjun ctions, while the latter consists of m ore ve rbs, 17 adv erbs, 18 prono uns, and pre-determine rs. Indeed, w e ﬁnd that the weights learned by P O S S VM (found in T able 4) are largely in agreemen t with these ﬁndings, no- tably except for adjecti ve and adver b superlative s , the latter of w hich was found to be an exce ption by Rayson et al. (2001) . Howe ver , that decepti ve opin- ions contain more superlati ves is not unex pected, since decepti ve writing (bu t not necessaril y imagi- nati ve writing in general ) often contains exagge rated langua ge ( Buller and Bur goon, 1996; Han cock et al., 2008) . Both remaining automated approac hes to detect- ing decepti ve opinion spam outperform the simple 17 P ast participle verbs were an exception. 18 Superlative adverbs were an ex ception. T RU T H F U L / I N F O R M A T I V E D E C E P T I V E / I M AG I N A T I V E Category V ariant W eight Category V ariant W eight N O U N S Singular 0.008 V E R B S Base -0.057 Plural 0.002 Past tense 0.041 Proper , singular -0.041 Present participle -0.089 Proper , plural 0.091 Singular , present -0.031 A D J E C T I V E S General 0.002 Third person 0.026 Comparati ve 0.058 singular , present Superlativ e -0.164 Modal -0.063 P R E P O S I T I O N S General 0.064 A D V E R B S General 0.001 D E T E R M I N E R S Gen eral 0.009 Comparativ e -0.035 C O O R D . C O N J . General 0.094 P R O N O U N S Personal -0.098 V E R B S Past participle 0 .053 Possessiv e -0.303 A D V E R B S Superlati ve -0.094 P R E - D E T E R M I N E R S General 0.017 T able 4: A verage feature weigh ts lea rned by P O S S V M . Based on work b y Rayson et al. (2 001), we expec t weigh ts o n the left to be positive (predictiv e o f truthful op inions), and weights on the right to be negative (predictive of d eceptive opinion s). Bold face entries are at odds with these e xpectatio ns. W e rep ort a verage feature weights of unit-norma lized weight vectors, rather than r aw weights vectors, to acco unt for potential dif feren ces in m agnitude between the folds. genre identiﬁcat ion basel ine just discussed . Speciﬁ- cally , the psycholi nguistic approac h ( L I W C S VM ) pro- posed i n Secti on 4.2 performs 3.8% more accura tely (one-t ailed sign test p = 0 . 02 ), and the sta ndard text cate gorizati on approa ch p roposed in Section 4.3 per - forms between 14.6% and 16.6% more accuratel y . Ho wev er , best performan ce ov erall is achie ved by combinin g features from th ese two approache s. Par- ticular ly , the combined model L I W C + B I G R A M S + S VM is 89.8% accurate at detecting decepti ve opinion spam. 19 Surprisin gly , models traine d only on U N I G R A M S —the simples t n -gram featur e set— outper form all non–te xt-cate goriza tion approa ches, and models trained on B I G R A M S + perfor m even better (one-tail ed sign test p = 0 . 07 ). This suggests that a uni vers al set of keyw ord-base d deception cues (e.g., L I W C ) is not the best approach to de- tecting deception, and a conte xt-sen siti ve approach (e.g., B I G R A M S + ) might be necessar y to achie ve state-o f-the-art decepti on detectio n performance . T o better understand the models learned by these automate d approache s, we report in T able 5 the top 15 highest weighted features for each class ( truthful and deceptiv e ) as learned by L I W C + B I G R A M S + S VM and L I W C S VM . In agreement with theories of reality monitori ng (Johns on and R aye, 1981), we observe that truthful opinions tend to include more sensorial and concret e languag e than decepti ve opinions; in 19 The result is not signiﬁcantly better than B I G R A M S + SV M . L I W C + B I G R A M S + SV M L I W C SV M T RU T H F U L D E C E P T I V E T RU T H F U L D E C E P T I V E - chicago hear i ... my number family on hotel allpunct perspron location , and negem o see ) luxury dash pronoun allpunct LI W C experien ce exclusi ve leisure ﬂoor hilton we exclampunct ( busine ss sexua l sixlett ers the hotel v acation period posemo bathroom i otherpunct comma small spa space cause helpful looking human auxv erb $ while past future hotel . husband inhibition perceptual other my husband assent feel T able 5: T op 1 5 h ighest weigh ted truthf ul and deceptive features learned b y L I W C + B I G R A M S + S V M and L I W C S V M . Ambiguo us featu res a re subscripted to in dicate the source of th e f eature. LIWC features correspo nd to group s of keywords as explain ed in Section 4 .2; mo re details about LIWC an d the LIWC categories are available at http://liwc. net . particu lar , truthful opinion s are more speciﬁc about spatia l conﬁguratio ns (e.g., small, bathroom, on, lo- cation ). This ﬁnding is also supported by recent work by Vrij et al. (2009) suggesting that liars ha ve consid erable dif ﬁcultly encoding spatial informatio n into their lies. Accordin gly , we observ e an increa sed focus in decepti ve opinions on aspects external to the hotel being re vie wed (e.g., husban d, b usiness , v acation ). W e also ackno wledge se vera l ﬁndings tha t, on the surf ace, are in contrast to pre vious psycholingu istic studie s of d eception (Hanc ock et a l., 200 8; Newman et al., 2003). For instance , while deceptio n is often associ ated with negati ve emotion terms, our decep - ti ve re vie ws hav e more positi ve and fe wer negat iv e emotion terms. This patte rn makes sense when one consid ers the g oal of our decei vers, namely to cr eate a positi ve revie w (Buller and Burg oon, 1996). Deceptio n has also pre viously been associated with decreas ed usage of ﬁ rst perso n singula r , an ef- fect a ttrib uted to psychol ogical distancin g (Newman et al., 2003). I n contra st, we ﬁnd increase d ﬁ rst person singul ar to be among the large st indicato rs of decept ion, which we speculate is due to our de- cei ver s attempting to en hance the credib ility of their re vie ws by emphasizing their own presence in the re vie w . Addition al work is requi red, b ut these ﬁ nd- ings further suggest the importance of moving be- yond a uni versal set of decepti ve language features (e.g., LI W C ) by considering both the contex tual (e.g., B I G R A M S + ) and motiv ational parameters underly- ing a decept ion as well. 6 Conclusion and Futur e W ork In this work we ha ve de ve loped the ﬁrst large-s cale datase t contai ning gold-stand ar d decepti ve opinion spam. W ith it, we hav e sho wn that the detection of decepti ve opini on spam is well beyon d the ca- pabili ties of human judge s, most of whom perform rough ly at -chance . Accordingl y , we hav e introduc ed three automated approach es to decepti ve opinion spam detect ion, based on insights coming from re- search in computa tional linguist ics and psycho logy . W e ﬁnd that while standard n -gram–ba sed text cate- goriza tion is the best indi vidua l detectio n approa ch, a combinati on approach using psycholin guistically- moti v ated features and n -gram featu res can perform slight ly better . Finally , we ha ve m ade sev eral theoret ical con- trib ution s. Speciﬁcally , our ﬁ ndings suggest the importan ce of considering both the context (e.g., B I G R A M S + ) and motiv ations underlying a decep- tion, rather than strictly adherin g to a univ ersal set of deception cues (e.g., L I W C ). W e hav e also pre- sented results based on the feature weights learned by our classiﬁers that illust rate the dif ﬁculties faced by liars in encoding spati al information . Lastly , we ha ve discov ered a p lausible relations hip between de- cepti ve opinion spam and imaginati ve writing, based on POS distrib utional similarities. Possible directi ons for future work includ e an ex- tended e va luation of the m ethod s proposed in this work to both ne gati ve opinions , as well as opinio ns coming from other domains. Many additional ap- proach es to detecting decepti ve opinio n spam are also possible , and a focus on appro aches with high decept iv e precision might be useful for production en vironments . Acknowledgmen ts This work was supported in part by National Science Foundation Grants BCS-0624277, BC S- 09048 22, HS D-062426 7, IIS-096845 0, and NSCC- 09048 22, as well as a gift from G oogle, and the Jack Kent Cooke Foundation . W e also thank, al- phabe tically , Rachel B ooche ver , Cristian Danescu- Niculesc u-Mizil, A licia Granstein, Ulrike Gretzel, Danielle Kirshenbla t, Lillian Lee, B in Lu, Jack Newto n, Melissa S ackler , Mark Thomas, and Angie Y oo, as well as m embers of the Cornell NLP sem- inar group and the A CL re vie wers for their insig ht- ful comments, suggestio ns and a dvice o n var ious as - pects of this work. Refer ences C. Akkaya, A. C onrad , J. W iebe, and R. Mihalcea. 201 0. Amazon mecha nical turk for subjectivity word sen se disambigua tion. In Pr oceedings of the NAA CL HL T 2010 W orkshop on Cr eating Sp eech an d Langu age Data with A m a zons Mechanical T urk, Los Angeles , pages 195– 203. D. Biber , S. Johansson, G. Leec h, S. Conrad, E. F inegan, and R. Quirk. 1999. Longma n g rammar of spoken and written En glish , v olume 2. MIT Press. C.F . Bond and B.M. DePaulo. 2006. Accuracy of de- ception judg ments. P ersonality and Social Psychology Review , 1 0(3):21 4. D.B. Buller and J.K. Burgo on. 199 6. Interp ersonal deception th eory . Communicatio n Theory , 6 (3):203 – 242. W .L. Cade, B.A. Lehm an, and A. Oln ey . 201 0. An ex- ploration o f off topic conversation. In Human La n- guage T echnologies: The 2 010 Annua l Confer ence of the No rth American Chap ter of the A ssociation fo r Computation al Linguistics , pag es 6 69–6 72. Associa- tion for Computational Linguistics. S.F . Chen and J. Good man. 1996 . An empiric al study of smoothing tech niques for lan guage mo deling. In Pr o - ceedings of the 34 th annua l meeting on Association for Computation al Linguistics , p ages 3 10–31 8. Asso- ciation for Computationa l Linguistics. C. Danescu-Nicu lescu-Mizil, G. Kossinets, J. Klein berg, and L. Le e. 2009. How opin ions are received by on- line commu nities: a case study on amazo n.com help - fulness v o tes. In Pr oceed ings of the 18th intern ational confer ence on W orld wide web , pages 141–1 50. A CM. H. Dr ucker , D. W u, and V . N. V apn ik. 2002. Support vector machines fo r spam categor ization. Neural Net- works, I EEE T ransactions on , 10(5):1048 –1054 . G. Forman a nd M. Scholz. 20 09. Ap ples-to-App les in Cross-V alidation Stu dies: Pitfalls in Classiﬁer Perf or- mance Measurem ent. ACM SIGKDD Exp lorations , 12(1) :49–57 . Z. Gy ¨ ongyi, H. Garcia- Molina, and J. Peder sen. 2 004. Combating web spam with trustran k. In Pr oceedings of the Thirtieth interna tional conference on V ery lar ge data bases-V olume 30 , pages 576 –587. VLDB Endow- ment. J.T . Hancock , L.E. Curry , S. Goorha, and M. W oodworth. 2008. On lyin g and b eing lied to: A lingu istic an al- ysis of decep tion in computer-mediated co mmunica- tion. Discourse Pr oc e sses , 45(1):1 –23. J. Jansen. 2010. Online pr oduct research . P ew Internet & American Life Pr oject R e port . N. Jindal and B. Liu. 2008. Opinion spam an d an alysis. In Pr o ceeding s of the internation al conference on W eb sear ch and web data min in g , pages 219–23 0. AC M. T . Joach ims. 19 98. T ext cate gorizatio n with support vec- tor m achines: Le arning with many r elev ant features. Machine Learning: ECML-98 , pages 137 –142. T . Joachims. 1999. Makin g large- scale sup port vec- tor m achine learnin g practical. In Advan ces in kernel methods , page 184. MIT Press. M.K. Johnson and C.L. Raye. 1981 . Reality mon itoring. Psychological R eview , 88(1):67 –85. S.M. Kim, P . Pantel, T . Chklovski, an d M. Pennacchiotti. 2006. Autom atically a ssessing revie w help fulness. In Pr oceedin gs of the 2006 Co nfer ence o n Empirical Methods in Natural Lang u age Pr ocessing , pages 423– 430. Association for Computationa l Linguistics. D. Klein an d C.D. Mann ing. 200 3. Accurate unlexical- ized parsing. In Pr o ceeding s o f the 4 1st A nnual Meet- ing on Associa tion for Co m p utationa l Linguistics- V olume 1 , pag es 4 23–4 30. Association for Comp uta- tional Linguistics. J.R. Land is and G.G. Koch. 1977 . The measuremen t of observer ag reement for categor ical data. Biometrics , 33(1) :159. E.P . Lim, V .A. Nguyen, N. Jindal, B. Liu , and H.W . Lauw . 2010. Detecting prod uct revie w spa mmers us- ing rating behaviors. I n Pr oceedings of the 1 9th A CM internationa l co nfer ence o n Information and knowl- edge man agement , pag es 939–948. A CM. S.W . Litvin , R.E. Goldsmith, and B. Pan. 200 8. Elec- tronic word-of -mouth in h ospitality and tourism man - agement. T ourism ma nagement , 2 9(3):4 58–46 8. F . Mairesse, M. A. W alker , M.R. Meh l, an d R.K. Moore. 2007. Using linguistic cues for the automatic recogni- tion of personality in conversation an d text. Journal of Artiﬁcial In telligence Resear ch , 30(1):457 –500 . R. Mihalcea an d C. Strappa rav a. 200 9. The lie d etector: Explora tions in the automatic recognition of deceptive languag e. In P r ocee d ings o f the ACL-IJCNLP 20 09 Confer ence Short P apers , p ages 309–312 . Association for Computational Linguistics. M.L. Newman, J.W . Pennebaker, D.S. Berr y , and J.M. Richards. 2003 . L ying words: Predictin g deception from lin guistic styles. P ersonality and Social Psychol- ogy Bulletin , 29(5 ):665. A. Nto ulas, M. Najor k, M. Manasse, and D. Fetterly . 2006. D etecting spam web pages throug h con tent analysis. In Pr o ceeding s of the 15th internatio nal con- fer ence on W orld W ide W eb , pag es 83–92. A CM. M.P . O’Mahony and B. Smy th. 2009 . Learning to rec- ommend helpfu l ho tel reviews. In Pr oc e e dings of the thir d ACM conference on R ecommende r systems , pages 305– 308. ACM. F . Pen g and D. Schuu rmans. 2003 . Combin ing naiv e Bayes and n-gram languag e mo dels for text classiﬁca- tion. Advances in I nformation Re trieval , p ages 547– 547. J.W . Penneb aker , C.K. Chun g, M. Irelan d, A. Gonzales, and R.J. Booth. 2007. The development and p sycho- metric prop erties of L IWC2007. Austin, TX, LIWC. Net . N. Quad rianto, A.J. Smola, T .S. Caetano, and Q.V . Le. 200 9. Estimating labels f rom label prop ortions. The Journal of Machine Learning Resear ch , 10:2349– 2374. P . Rayson , A. W ilson, and G. Leech. 200 1. Gram matical word class variation within th e British National Cor- pus samp ler . Lan guage an d Compu ters , 36(1 ):295– 306. R.A. Rigby an d D.M. Stasinopo ulos. 20 05. Generalized additive mo dels for location, scale and shape. J our- nal of the Royal Statistical Society: Series C (Applied Statistics) , 54(3) :507–5 54. F . Seba stiani. 2002 . Machin e learnin g in autom ated text categor ization. ACM computing surveys (CSUR) , 34(1) :1–47. M. ´ A. Serrano, A. Flam mini, and F . Menczer . 200 9. Modeling statistical pro perties of wr itten text. PloS one , 4(4):53 72. A. Stolcke. 2002 . SRILM-an extensible langu age mod - eling toolkit. I n S eventh Interna tional Confer ence on Spoken Langu age Pr ocessing , volume 3, pag es 901– 904. Citeseer . C. T oma and J .T . Hancock. In Press. What Lies Beneath: The Lin guistic T races of De ception in Online Datin g Proﬁles. Journal of Communicatio n . A. Vrij, S. Mann, S. Kristen, and R .P . Fis her . 2 007. Cue s to decep tion an d a bility to d etect lies as a fun ction of police interview styles. La w an d h uman b ehavior , 31(5) :499–5 18. A. Vrij, S. L eal, P . A. Granh ag, S. M ann, R.P . Fisher, J. Hillman , and K. Sper ry . 2009. Ou tsmarting the liars: The b eneﬁt of a sking u nanticipated q uestions. Law a nd huma n beha vior , 3 3(2):1 59–16 6. A. Vrij. 200 8. Detecting lies an d d eceit: Pitfalls and opportun ities . Wile y-In terscience. W . W eerkam p and M. De Rijke. 2008. Credibility im- proves topica l blog post re triev al. ACL-08: HLT , pages 923– 931. M. W eimer , I. Gure vych, and M. M ¨ uhlh ¨ auser . 2 007. Au - tomatically assess ing the post quality i n online discus- sions on software. In P r ocee dings of the 45th An - nual Meeting of the ACL on I nteractive P oster a nd Demonstration Sessions , pages 12 5–12 8. Association for Computational Linguistics. G. W u , D. Green e, B. Smyth, and P . Cunnin gham. 201 0. Distortion as a v a lidation criterion in the identiﬁcatio n of suspicio us r evie ws. T echnical repo rt, UCD-CSI- 2010- 04, Un i versity College Du blin. K.H. Y oo and U. Gretze l. 2009 . Comp arison of De- ceptive and Truthful T ravel Revie ws. Informatio n a nd Communication T echno logies in T ourism 2009 , pa ges 37–47 . L. Zhou , J.K. Bu rgoon, D.P . T witche ll, T . Qin, an d J.F . Nunamaker Jr . 2004 . A compa rison of classiﬁca- tion meth ods for p redicting dece ption in com puter- mediated communication . Journal of Management In- formation Systems , 2 0(4):1 39–1 66. L. Zh ou, Y . Shi, and D. Z hang. 20 08. A Statistical Lan- guage Modeling Approa ch to Online De ception De- tection. IE EE T ransactions on Knowledge an d Data Engineering , 2 0(8):1 077–1 081.

Finding Deceptive Opinion Spam by Any Stretch of the Imagination

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment