Privacy Preserving Spam Filtering

Priv acy-Preserving Spam Filtering Manas A. P athak Carnegie Mellon University manasp@cs.cm u.edu Mehrbod Shariﬁ Carnegie Mellon University mehrbod@cs.cm u.edu Bhiksha Raj Carnegie Mellon University bhiksha@cs.cm u.edu ABSTRA CT Email is a priv ate medium of communication, and the in- heren t priv acy constrain ts form a ma jor obstacle in dev el- oping eﬀectiv e spam ﬁltering metho ds which require access to a large amount of email data b elonging to multipl e users. T o mitigate this problem, we en vision a priv acy preserv- ing spam ﬁltering system, where the server is able to train and ev aluate a logistic regression based spam classiﬁer on the combined email data of all users without b eing able to observ e an y emails using primitiv es suc h as homomorphic encryption and randomization. W e analyze the protocols for correctness and security , and perform exp eriments of a protot yp e system on a large scale spam ﬁltering task. State of the art spam ﬁlters often use character n-grams as features which result in large sparse data represen tation, whic h is not feasible to b e used directly with our training and ev aluation proto cols. W e explore v arious data independent dimensionalit y reduction which decrease the running time of the proto col making it feasible to use in practice while ac hieving high accuracy . General T erms Priv acy Preserving Machine Learning, Spam Filtering 1. INTR ODUCTION Email is a priv ate medium of communication with the message in tended to b e read only by the recipients. Due to the sensitive nature of the information conten t, there migh t b e p ersonal, strategic, and legal constraints against sharing and releasing email data. These constrain ts form formidable obstacles in many email processing applications suc h as spam ﬁltering which are usually supplied by a sepa- rate service pro vider. Ov er the years, spam has b ecome a ma jor problem: 75.9% all emails sent in August 2011 were spam [11]. Email users can b eneﬁt from using accurate spam ﬁlters, which could greatly reduce the loss of time and productivity du e to spam email. A proﬁcient user can directly learn a spam ﬁltering Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for proﬁt or commercial advantage and that copies bear this notice and the full citation on the ﬁrst page. T o copy otherwise, to republish, to post on servers or to redistrib ute to lists, requires prior speciﬁc permission and/or a fee. Copyright 20XX A CM X-XXXXX-XX-X/XX/XX ...$10.00. classiﬁer on her own priv ate data and send it to the spam ﬁltering provider or apply it herself, diminishing the need for a priv acy preserving spam ﬁltering system. It is, ho wev er, seen that the accuracy of spam ﬁlters based on classiﬁcation models can b e v astly improv ed by training on aggregates of data obtained from a large num b er of email users. This training and application of spam ﬁlters should, ho w ever, not be at the exp ense of user priv acy , with users being required to make their emails av ailable to the spam ﬁltering service pro vider. In this pap er we prop ose a solution that enables users to share their priv ate email data to train and apply spam ﬁl- ters while satisfying priv acy constrain ts. W e choose lo gistic r e gr ession as our classiﬁcation mo del as it is widely used in spam ﬁltering and text classiﬁcation applications and is observ ed to achiev e very high accuracy in these tasks. The training algorithm for logistic regression based on gradient ascen t is more amenable to b e mo diﬁed to satisfy priv acy constrain ts. The up date step in the training algorithm is also particularly con venien t when the training data is split among multiple p arties , who can simply compute their gra- dien t on their priv ate email data and the server can priv ately aggregate these to up date the model parameters. F urther- more, logistic regression is can also b e easily mo diﬁed to the online le arning setting. In a practical spam ﬁltering system, as the users are unlikely to relab el previously read emails as spam, the classiﬁer needs to b e learned on a contin uously arriving stream of email data. Although primarily directed at spam ﬁltering, our solution can also b e applied to any form of priv ate text classiﬁcation and in general to any binary classiﬁcation setting where pri- v acy is imp ortan t, e.g. , predicting the lik eliho od of disease based on an individual ’s priv ate medical records. Our meth- ods also extend to batch pro cessing scenarios. F ormally , w e consider t wo kinds of parties: a set of users who hav e access to their priv ate emails and server who is in terested in training a spam classiﬁcation mo del ov er the complete email data. The users can comm unicate with the serv er but not with each other as this is typ ically the case in an email service. The primary priv acy constraint is that the serv er should not b e able to observe the emails b elonging to an y of the users and similarly , any user should not be able to observ e emails b elonging to any other user. The secondary priv acy constrain t is that the users should not b e able to observ e t he par ameters of the cl assiﬁcation mo del learned by the server. While the motiv ation b ehind the former priv acy constrain t is more ob vious, the server might wan t to k eep the classiﬁcation mo del priv ate if it was priv ately trained o ver large quantities of training data po oled from a large n umber of users and if the server is interested in oﬀering a restricted pay per use spam ﬁltering service. W e present protocols to train and ev aluate logistic regression mo dels while maintaining these priv acy constraints. Our priv acy preserving proto col falls in to the broad class of secur e multip arty c omputation (SMC) algorithms [15]. In the SMC framew ork, multiple parties desire to compute a function that combines their individual inputs. The priv acy constrain t is that no part y should learn anything ab out in- puts belonging to any other part y b esides what can b e in- ferred from the v alue of the function. W e construct our protocol using a cryptosystem satisfying homomorphic en- cryption [9], in which op erations on encrypted data corre- spond to operations on the original unencrypted data (Sec- tion 3.2). W e further augment our proto col with additive and m ultiplicative randomization, and present an informa- tion theoretic analysis of the security of the protocol. The b eneﬁt of training and ev aluating a spam ﬁltering classiﬁer priv ately comes with a substan tial o verhead of com- putation and data transmission costs. W e ﬁnd that these costs are linear in the n um b er of training data instances and the data dimensionality . As the size of our c haracter four- gram feature representation of the text data is extremely large ( e.g. , one million features), application of our proto col on a typical email dataset is prohibitively exp ensive. T o- w ards this, we apply suitable data dimensionality reduction tec hniques to make the training proto col computationally usable in practical settings. As the same dimensionalit y re- duction is has to be applied b y all the parties to their priv ate data, we require that the tec hniques used are data indepen- den t and do not require to b e computed separately . W e presen t extensive ev aluation of our proto col on a large scale email dataset from the CEAS 2008 spam ﬁltering challenge. With data indep endent dimensionalit y reduction suc h as lo- calit y sensitive hashing, multin omial sampling, and hash space reduction, we demonstrate that our proto col is able to ac hieve state of the art p erformance in a feasible amount of running time. T o summarize, our main contributio ns are: • Proto cols for training and ev aluating the logistic re- gression based spam ﬁltering classiﬁer with online up- dates from while preserving the priv ate email data b e- longing to m ultiple parties. • Analysis of the proto cols for security and eﬃciency . • Dimensionalit y reduction for making the proto col fea- sible to be used in a practical spam ﬁltering task. • Exp erimen ts with the priv acy preserving training and ev aluation protocols ov er a large scale spam dataset: trade oﬀ betw een running time and accuracy . 2. RELA TED W ORK Email spam ﬁltering is a well established area of research. The accuracy of the b est systems in the 2007 CEAS spam ﬁltering comp etition was better than 0.9999 [3]. Our im- plemen tation is an online logistic regression classiﬁer imple- men tation inspired by [5] which on application to binary c haracter four-gram features was shown to hav e near state of the art accuracy [3]. The application of priv acy preserving techniques to large scale real world problems of practical importance, suc h as spam ﬁltering, is an emerging area of researc h. Li, et al. [6] presen t a distributed framework for priv acy a wa re spam ﬁl- tering. Their metho d is based on applying a one-wa y ﬁnger- prin ting transformation [1] to the message text and compar- ing t wo emails using a Hamming distance metric and do es not inv olve statistical learning. Additionally , this metho d also requires that the spam emails b elonging to all users should b e revealed which does not matc h our priv acy crite- ria. W e consider all emails to b e priv ate as the nature of the spam emails a user receives might b e correlated to the user’s online and oﬄine activities. There has also been recen t work on constructing priv acy preserving proto cols for general data mining tasks including decision trees [12], clustering [7], naiv e Ba y es [13], and sup- port v ector machines [14]. T o the best of our knowledge, this paper is the ﬁrst to describ e a practical priv acy-preserving framew ork using a logistic regression classiﬁer applied to a real world spam ﬁltering task. 3. PRELIMINARIES 3.1 Classiﬁcation Model: Logistic Regression in the Batch and Online Settings The training dataset consisting of n do cumen ts classiﬁed b y the user as spam or ham ( i.e. , not spam) are represented as the labeled data instances ( x, y ) = { ( x 1 , y 1 ) , . . . , ( x n , y n ) } where x i ∈ R d and y i ∈ {− 1 , 1 } . In the batc h learning setting, we assume that the complete dataset is av ailable a t a giv en time. In the logistic regression classiﬁcation algorithm, w e model the class probabilities by a sigmoid function P ( y i = 1 | x i , w ) = 1 1 + e − y i w T x i . W e denote the log-lik eliho od for the weigh t v ector w com- puted ov er the data instances ( x, y ) by L ( w , x, y ). Assum- ing the data instances to b e i.i.d., the data log-likelihoo d L ( w, x, y ) is equal to L ( w, x, y ) = log Y i 1 1 + e − y i w T x i = − X i log[1 + e − y i w T x i ] . W e maximize the data log-likeli ho o d L ( w , x, y ) using gradi- en t ascen t to obtain the classiﬁer with the optimal weigh t v ector w ∗ . Starting with a uniformly initialized vector w (0) , in the t th iteration, we up date w ( t ) as w ( t +1) = w ( t ) + η ∇ L ( w ( t ) , x, y ) = w ( t ) + η X i y i x T i 1 + e y i w T ( t ) x i , (1) where η is the pre-deﬁned step size. W e terminate the pro- cedure on con vergence b etw een consecutive v alues of w ( t ) . In the online learning setting, the data instances are ob- tained incrementally rather than b eing completely av ailable at a given instance of time. In this case, we start with a model with the uniformly random weigh t vector w (0) . A model w ( t ) learned using the ﬁrst t instances, is up dated af- ter observing a small blo ck of k instances. with the gradient of the log-lik eliho od computed ov er that block. 3.2 Homomorphic Encryption In a homomorphic cryptosystem, operations performed on encrypted data ( ciphertext ) map to the corresp onding op er- ations performed on the original unencrypted data ( plain- text ). If + and · are tw o op erators and x and y are tw o plain texts, a homomorphic encryption function E satisﬁes E [ x ] · E [ y ] = E [ x + y ] . This allo ws one party to encrypt the data using a homo- morphic encryption sc heme and another party to p erform operations without b eing able to observ e the plaintext data. This prop ert y forms the fundamen tal building blo ck of our priv acy preserving proto col. In this w ork we use the additively homomorphic P aillier cryptosystem [9] whic h also satisﬁes semantic security . The P aillier key generation algorithm pro duces a pair of b -bit n umbers ( N , g ) constituting the public key corresponding to the encryption function E : Z N 7→ Z N 2 and another pair of b -bit num b ers ( λ, µ ) constituting the priv ate key corre- sponding to the decryption function D : Z N 2 7→ Z N . Giv en a plaintext x ∈ Z N , the encrypted text is given by: E [ x ] = g x r n mo d N 2 , where r is a random num b er sampled uniformly from Z N . Using a diﬀeren t v alue of the random num b er r provides seman tic securit y , i.e. , tw o diﬀeren t encryptions of a n umber x say , E [ x ; r 1 ] and E [ x ; r 2 ] will hav e diﬀerent v alues but decrypting each of them will result in the same n umber x . It can be easily veriﬁed that the ab ov e encryption function satisﬁes the follo wing prop erties: 1. F or an y t wo ciphertexts E [ x ] and E [ y ], E [ x ] E [ y ] = E [ x + y mo d N 2 ] . 2. And as a corollary , for any ciphertext E [ x ] and plain- text y , E [ x ] y = E [ x y mod N 2 ] . Extending the Encryption Function to Real Numbers P aillier encryption as most other cryptosystems is deﬁned o ver the ﬁnite ﬁeld Z N = { 0 , . . . , N − 1 } . Ho wev er, in our protocol we need to encrypt real n umbers, such as the train- ing data and mo del parameters. W e mak e the follo wing modiﬁcations to the encryption function to supp ort this. 1. Real n umbers are conv erted to a ﬁxed precision ﬂoat- ing point represen tation. F or a la rge constan t C , a real n umber x is represen ted as b C x c = ¯ x . E [ ¯ x ] = E [ b C x c ] , D [ E [ ¯ x ]] = E [ b C x c ] /C = x. 2. The encryption of a negativ e integer is represen ted by the encryption of its modular additive inv erse. If − x is a negativ e in teger, E [ − x ] = E [ N − x ] . 3. Exp onen tiation of an encrypted num b er by a negative in teger is represented as the exponentia tion of the m ul- tiplicativ e inv erse of the encryption in the Z N 2 ﬁeld, b y the corresponding p ositive integer. W e represen t the exp onentiation 1 the ciphertext E [ x ] b y a negative in teger − y as E [ x ] − y = E [ x − 1 mo d N 2 ] y . 1 W e slightly abuse the notation to represent the non- modular exp onen tiation of the ciphertext by E [ x ] a to refer to E [ x ] · E [ x ] · · · ( a times). Represen ting real n umbers by a ﬁxed precision num b er in tro duces a small error due to the truncation which is di- rectly prop ortional to the v alue of C . This representation also reduces the domain of the encryption function from { 0 , . . . , N − 1 } to { 0 , . . . , b N − 1 C c} . W e need to ensure that the result of homomorphic op erations on encrypted func- tions do not ov erﬂo w the range, so we need to increase the bit-size b of the encryption k eys prop ortionally with C . As the computational cost of the encryption op erations is also proportional to b , this creates a trade-oﬀ betw een accuracy and computation cost. The representation of negative integers on the other hand does not introduce an y error but further halves the domain of the encryption function from { 0 , . . . , b N − 1 C c} to { 0 , . . . , b N − 1 2 C c} whic h w e denote b y D . 4. PRIV A CY PRESER VING CLASSIFIER TRAINING AND EV ALU A TION 4.1 Data Setup and Privacy Conditions W e deﬁne the part y “Bob” who is interest ed in training a logistic regression classiﬁer with weigh t v ector w ∈ R d . In the online learning setting, multiple users interact with Bob at one time using their priv ate training data as input. As all these parties pla y the same role in their interactions with Bob in one update step, we represen t them by a generic user “Alice” . Later on we see how Bob priv ately aggregates the encrypted gradients provid ed b y individual parties. Alice has a sequence of lab eled training data instances ( x, y ) = { ( x 1 , y 1 ) , . . . , ( x n , y n ) } . Bob is in terested in train- ing a logistic regression classiﬁer with weigh t vector w ∈ R d o ver ( x, y ) as discussed in Section 3.1. The priv acy con- strain t implies that Alice should not b e able to observe w and Bob should not be able to observe ( x i , y i ). The parties are assumed to b e semi-malicious , i.e. , they correctly exe- cute the steps of the proto col and do not attempt to cheat b y using fraudulen t data as input in order to extract ad- ditional information ab out the other parties. The parties are assumed to b e curious, i.e. , they keep a transcript of all intermediate results and can use that to gain as muc h information as possible. 4.2 Private T raining Protocol Bob generates a public and priv ate key pair for a b -bit P aillier cryptosystem and provides the public key to Alice. In this cryptosystem, Bob is able to p erform b oth encryp- tion and decryption o p erations while Alice can perform only encryption. As mentioned b efore, we use the homomorphic prop er- ties of Paillier encryption to allow the parties to p erform computations using priv ate data. The up date rule requires Bob to compute the gradient of the data log-lik eliho od func- tion ∇ L ( w ( t ) , x, y ) which inv olves exp onentiation and divi- sion and cannot b e done using only homomorphic additions and multiplicati ons. W e supplemen t the homomo rphic oper- ations with Bob p erforming those op erations on mul tiplica- tiv e shares to main tain the priv acy constraints. As men- tioned in Section 3.2, the domain of the encryption function is D = { 0 , . . . , b N − 1 2 C c} . W e sample the randomizations uni- formly from this set. Bob initiates the proto col with a uniform w (0) and the gradien t up date step η is publicly known. W e describ e the t th iteration of the proto col b elow. Input: Alice has ( x, y ) and the encryption key , Bob has w ( t ) and b oth encryption and decryption k eys. Output: Bob has w ( t +1) . 1. Bob encrypts w ( t ) and transfers E [ w ( t ) ] to Alice. 2. F or each training instance x i , i = 1 , . . . , n , Alice com- putes d Y j =1 E [ w ( t ) j ] y i x ij = E " d X j =1 y i w ( t ) j x ij # = E h y i w T ( t ) x i i . 3. Alice samples n n umbers r 1 , . . . , r n uniformly from Z N = { 1 , . . . , N − 1 } and computes E h y i w T ( t ) x i i · E [ − r i ] = E h y i w T ( t ) x i − r i i . Alice transfers E  y i w T ( t ) x i − r i  to Bob. 4. Bob decrypts this to obtain y i w T ( t ) x i − r i . In this w ay , Alice and Bob ha v e additiv e shares of the inner prod- ucts y i w T ( t ) x i . 5. Bob exponentiates and encrypts his shares of the inner products. He transfers E h e y i w T ( t ) x i − r i i to Alice. 6. Alice homomorphically m ultiplies the quantities she obtained from Bob by the exp onentiations of her cor- responding random shares to obtain the encryption of the exp onentiations of the inner pro ducts. 2 E h e y i w T ( t ) x i − r i i e r i = E h e y i w T ( t ) x i i . Alice homomorphically adds E [1] to these quantities to obtain E h 1 + e y i w T ( t ) x i i . 7. Alice samples n num b ers q 1 , . . . , q n from D using a bounded Po wer la w distribution 3 . She then homomor- phically computes E h 1 + e y i w T ( t ) x i i q i = E h q i  1 + e y i w T ( t ) x i i . She transfers these quantities to Bob. 8. Bob decrypts these quantities and computes the recip- rocal 1 q i 1+ e y i w T ( t ) x i ! . He then encrypts the recipro cals and sends them to Alice. 9. Alice homomorphically m ultiplies q i with the encrypted reciprocals to cancel out her m ultiplicativ e share. E   1 q i  1 + e y i w T ( t ) x i    q i = E " 1 1 + e y i w T ( t ) x i # . 2 In some cases, the exp onentiation might cause the plain- text to ov erﬂow the domain of encryption function. This can b e handled by computing the sigmoid function homo- morphically using a piecewise linear sum of comp onen ts. 3 W e require that q has the p df P ( q ) ∝ 1 /q for 1 ≤ q ≤ |D | . q can b e generated using in verse transform sampling. W e discuss the reasons for this in Section 5.2. 10. Alice then homomorphically multiplies the encrypted reciprocal b y each comp onent of y i x T i to obtain the encrypted d -dimensional v ector E " 1 1 + e y i w T ( t ) x i # y i x T i = E " y i x T i 1 + e y i w T ( t ) x i # . She homomorphically adds each encrypted comp onen t to obtain Y i E " y i x T i 1 + e y i w T ( t ) x i # = E " X i y i x T i 1 + e y i w T ( t ) x i # . This is the encrypted gradien t vector E  ∇ L ( w ( t ) , x, y )  . 11. Alice homomorphically updates the encrypted weigh t v ector she obtained in Step 1 with the gradien t. E [ w ( t +1) ] = E [ w ( t ) ] E  ∇ L ( w ( t ) , x, y )  η = E  w ( t ) + η ∇ L ( w ( t ) , x, y )  . 12. Alice then sends the up dated weigh t vector E [ w ( t +1) ] to Bob who then decrypts it to obtain his output. In this w ay , Bob is able to up date his weigh t v ector us- ing Alice’s data while main taining the priv acy constrain ts. In the batc h setting, Alice and Bob rep eat Steps 2 to 11 to p erform the iterative gradien t descent. Bob can chec k for con vergence in the v alue of w b etw een iterations b y perform- ing Step 12. In the on line setting, Alice and Bob execute the protocol only once with using Alice using a t ypically small block of k data instances as input. Extensions to the T raining Pr otocol 1. T r aining on private data horizontal ly split acr oss mul- tiple p arties. In the online setting we do not make an y assumption about which data holding party is participating in the protocol. Just as Alice uses her data to update w pri- v ately , other parties can then use their d ata to p erform the online update using the same proto col. In the batc h setting, m ultiple parties can execute one iteration of the proto col individually with Bob to com- pute the encrypted gradient on their o wn data. Finally , Bob can receive the encrypted gradients from all the parties and update the weigh t vector as follo ws. w ( t +1) = w ( t ) + η X k ∇ L ( w ( t ) , x k , y k ) , where ( x 1 , y 1 ) , . . . , ( x K , y K ) are the individual datasets belonging to K parties. 2. T r aining a r e gularize d classiﬁer. The proto col can easily b e extended to introduce ` 2 regularization, which is a commonly used method to prev ent o ver-ﬁtting. In this case the update rule b e- comes w ( t +1) = w ( t ) + η ∇ L ( w ( t ) , x, y ) + 2 λw ( t ) , where λ is the regularization parameter. This can b e accommodated b y Alice homomorphically adding the term 2 λw ( t ) to the gradien t in Step 11. E [ w ( t +1) ] = E [ w ( t ) ] 1+2 λ E  ∇ L ( w ( t ) , x, y )  η = E  (1 + 2 λ ) w ( t ) + η ∇ L ( w ( t ) , x, y )  . In order to identify the appropriate v alue of λ to use, Alice and Bob can p erform m -fold cross-v alidation b y repeatedly executing the priv ate training and ev alua- tion proto cols o ver diﬀeren t subsets of data b elonging to Alice. 4.3 Private Ev aluation Protocol Another party “Carol” having one test data instance x 0 ∈ R d is interested in applying the classiﬁcation mo del with w eight v ector w b elonging to Bob. Here, the priv acy con- strain t require that Bob should not be able to observ e x 0 and Carol should not b e able to observe w . Similar to the train- ing protocol, Bob generates a public and priv ate k ey pair for a b -bit P aillier cryptosystem and pro vides the public key to Carol. In order to lab el the data instance as y 0 = 1, Carol needs to chec k if P ( y 0 = 1 | x 0 , w ) = 1 1+ e − w T x 0 > 1 2 and vice-versa for y 0 = − 1. This is equiv alent to chec king if w T x 0 > 0. W e dev elop the following proto col to wards this purpose. Input: Bob has w and generates a public-priv ate k ey pair. Carol has x 0 and Bob’s public key . Output: Carol kno ws if w T x 0 > 0. 1. Bob encrypts w and transfers E [ w ] to Carol. 2. Carol homomorphically computes the encrypted inner product. d Y j =1 E [ w ] x 0 j = E " d X j =1 w j x 0 j # = E h w T x 0 i . 3. Carol generates a random n umber r and sends E  w T x 0  − r to Bob. 4. Bob decrypts it to obtain his additiv e share w T x 0 − r . Let us denote it b y − s , so that r − s = w T x 0 . 5. Bob and Carol execute a v ariant of the secure million- aire protocol [15] to with inputs r and s and b oth learn whether r > s . If r > s , Carol concludes w T x 0 > 0 and if r < s , she concludes w T x 0 < 0. In this wa y , Carol and Bob are able to p erform the classi- ﬁcation operation whi le main taining th e priv acy constrain ts. If Bob has to rep eatedly execute the same proto col, he can pre-compute E [ w ] to b e used in Step 1. 5. ANAL YSIS 5.1 Correctness The priv ate training proto col do es not alter an y of the computations of the original training algorithm and there- fore results in the same output. The additive randomization r i in tro duced in Step 3 is remov ed in Step 6 lea ving the results unc hanged. Similarly , the multiplicativ e randomiza- tion q i in tro duced in Step 7 is remov ed in Step 9. As discussed in Section 3.2, the only source of error is the truncation of less signiﬁcan t digits in the ﬁnite precision represen tation of real num b ers. In practice, we observe that the error in computing the weigh t vector w is negligibly small and do es not result in any loss of accuracy . 5.2 Security The principal requirement of a v alid secure multipart y computation (SMC) proto col is that any party m ust not learn anything about the input data provided b y the other parties apart from what can b e inferred from the result of the computation itself. As we mentioned earlier, we assume that the parties are semi-malicious. F rom this p erspective, it can b e seen that the priv ate training proto col ( Section 4.2) is demonstrably secure. Ali c e/Car ol: In the priv ate training proto col, Alice can only observ e encrypted inputs from Bob and hence she do es not learn anything ab out the weig ht vector used by Bob. In the priv ate classiﬁer ev aluation proto col, the part y Carol with the test email only receives the ﬁnal outcome of the classiﬁer in plain text. Thus, the only additional information a v ailable to her is the output of the classiﬁer itself, whic h being the output is p ermissible under the priv acy criteria of the prob- lem. Bob: In the training stage, Bob receives unencrypted data from Alice in Steps 3, 8 and 12. • Step 3 : Bob receives y i w T ( t ) x i − r i . Let us denote this quan tity by v and y w T x b y z , giving us v = z − r i . Since r i is dra wn from a uniform distribution ov er the en tire ﬁnite ﬁeld Z N , for any v and for every v alue of z there exists a unique v alue of r i suc h that v = z − r i . Thus, P z ( z | v ) ∝ P z ( z ) P r ( z − v ) = P z ( z ). 4 The conditional en trop y H ( z | v ) = H ( z ), i.e. , Bob receiv es no information from the operation. • Step 8 : A similar argument can b e made for this step. Here Bob receives v = q z , where z = 1 + e y i w T ( t ) x i . It can b e sho wn that for any v alue v that Bob receiv es, P z ( z | v ) ∝ P z ( z ) P q ( v/z ) z . Since q is drawn from a p o wer la w distribution, i.e. P q ( q ) ∝ 1 /q , for all v < |D | , P z ( z | v ) = P z ( z ). Once again, the conditional en trop y H ( z | v ) = H ( z ), i.e. , Bob receives no information from the op eration. • Step 12 : The information Bob receiv es in this step is the up dated weigh t vector, which is the result of the computation that Bob is p ermitted to receive by the basic premise of the SMC proto col. Information Revealed by the Output W e assume that all the parties agree with Bob receiving the updated classiﬁer at the end of the training proto col, this forms the premise b ehind their participation in the proto col to start with. If the parties use the mo diﬁed training pro- tocol which results in a diﬀerentially priv ate classiﬁer, no information ab out the data can b e gained from the output classiﬁer. In case the parties use the original training proto- col, the output classiﬁer do es reveal information ab out the 4 The notation P x ( X ) denotes the probability with whic h the random v ariable x has the v alue X . input data, whic h w e quan tify and presen t w ays to minimize in the follo wing analysis. A t the end of Step 12 in each iteration, Bob receiv es the update weigh t vector w t +1 = w t + η ∇ L ( w ( t ) , x, y ). As he also has the previous weigh t v ector w t , he eﬀectiv ely observ es the gradien t ∇ L ( w ( t ) , x, y ) = P i y i x T i  1 + e y i w T ( t ) x i  − 1 . In the online setting, w e normally use one training data in- stance at a time to up date th e classiﬁer. If Alice participates in the training proto col using only one document ( x 1 , y 1 ), the gradient observed b y Bob will be y 1 x 1  1 + e y 1 w T ( t ) x i  − 1 , whic h is simply a scaling of the data vector y 1 x 1 . As Bob kno ws w ( t ) he eﬀectiv ely kno ws y 1 x 1 . In particular, if x 1 is a v ector of non-negative counts as is the case for n-grams, the kno wledge of y 1 x 1 is equiv alent to knowing x 1 . Although the protocol itself is secure, the output reveals Alice’s data completely . Alice can prev ent this by up dating the classiﬁer using blo cks of K do cument vectors ( x, y ) at a time. The proto col ensures that for each blo c k of K vectors Bob only receives the gradient computed ov er them ∇ L ( w ( t ) , x, y ) = K X i =1 y i x T i  1 + e y i w T ( t ) x i  − 1 = K X i =1 g ( w ( t ) , x i , y i ) x i , where g ( w ( t ) , x i , y i ) is a scalar function of the data instance suc h that g ( w ( t ) , x i , y i ) x i has a one-to-one mapping to x i . Assuming that all data vectors x i are i.i.d., using Jensen’s inequalit y , we can show that the conditional en trop y H  x i |∇ L ( w ( t ) , x, y )  ≤ K − 1 K H [ x i ] + log( K ) . (2) In other w ords, while Bob gains some information ab out the data b elonging to Alice, the amount of this information is in- v ersely prop ortional to the block size. In the online learning setting, choosing a large blo c k size decreases the accuracy of the classiﬁer. Therefore, the choice of the blo ck size eﬀec- tiv ely b ecomes a parameter that Alice can control to trade oﬀ giving a wa y some information ab out her data with the accuracy of the classiﬁer. In Section 6.2, w e empirically ana- lyze the p erformance of the classiﬁer for v arying batc h sizes. W e observe that in practice, the accuracy of the classiﬁer is not reduced even after choosing substantially large batc hes of 1000 do cuments, which w ould hardly cause any loss of information as giv en by Equation 2. 5.3 Complexity W e analyze the encryption/decryption and the data trans- mission costs for a single execution of the proto col as these consume a v ast ma jority of the time. There are 6 steps of the proto col where encryption or de- cryption op erations are carried out. 1. In Step 1, Bob encrypts the d -dimensional vector w ( t ) . 2. In Step 3, Alice encrypts the n random n umbers r i . 3. In Step 4, Bob decrypts the n inner pro ducts obtained from Alice. 4. In Step 5, Bob encrypts the exp onen tiation of the n inner pro ducts. 5. In Step 8, Bob decrypts, takes a recipro cal, and en- crypts the n multiplicativ ely scaled quantities. 6. In Step 12, Bob decrypts the d dimensional up dated w eight vect or obtained from Alice. T otal: 3 n + 2 d encryptions and decryptions. Similarly , there are 6 steps of the proto col where Alice and Bob transfer data to each other. 1. In Step 1, Bob transfers the d -dimensional vector w ( t ) to Alice. 2. In Step 3, Alice transfers n randomized innner pro d- ucts to Bob. 3. In Step 5, Bob transfers the n encrypted exp onentials to Alice. 4. In Step 7, Alice transfers n scaled quantities to Bob. 5. In Step 8, Bob transfers the n encrypted recipro cals to Alice. 6. In Step 11, Alice transfers the d dimensional encrypted updated weig ht vector to Bob. T otal: T ransmitting 4 n + 2 d elements. The sp eed of p erforming the encryption and decryption operations dep ends directly on the size of the key of the cryptosystem. Similarly , when we are transfering encrypted data, the size of an individual element also depends on the size of the encryption key . As the security of the encryption function is largely determined by the size of the encryption k ey , this reﬂects a direct trade-oﬀ b etw een security and eﬃ- ciency . 6. EXPERIMENTS W e provide an exp erimen tal ev aluation of our approach for the task of email spam ﬁltering. The priv acy preserv- ing training proto col requires a substantia lly larger running time as compared to the non-priv ate algorithm. In this sec- tion, we analyze the training protocol for running time and accuracy . As the execution of the proto col on the original dataset requires an infeasible amount of time, w e see how data indep endent dimensionality reduction can b e used to eﬀectiv ely reduce the running time while still ac hieving com- parable accuracy . As it is conv entional in spam ﬁltering researc h, we re- port AUC scores. 5 It is considered to be a more appropri- ate metric for this task as compared to other metrics such as classiﬁcation accuracy or F-measure b ecause it av erages the p erformance of the classiﬁer in diﬀerent precision-recall points which corresp ond to diﬀerent thresholds on the pre- diction conﬁdence of the classiﬁer. The AUC score of a random classiﬁer is 0.5 and that for the perfect classiﬁer is 1. W e compared AUC p erformance of the classiﬁer given b y the priv acy preserving training protocol with the non- priv ate training algorithm and in all cases the num bers were iden tical up to the ﬁve signiﬁcant digits. Therefore, the er- ror due to the ﬁnite precision represen tation mentioned in Section 5.1 is negligible for practical purp oses. 5 Area under the ROC curve. T able 1: Email spam dataset summary . Section Spam Non-spam T otal T raining 2466 (82%) 534 (18%) 3000 T esting 2383 (79%) 617 (21%) 3000 6.1 Email Spam Dataset W e used the public spam email corpus from the CEAS 2008 spam ﬁltering challenge. 6 F or generality , we refer to emails as do cuments. P erformance of v arious algorithms on this dataset is reported in [10]. The dataset consists of 3,067 training and 206,207 testing documents manually labeled as spam or ham ( i.e. , not spam). T o simplify the b enc hmark calculations, we used the ﬁrst 300 0 do cuments from eac h set (T able 1). Accuracy of the baseline ma jority classiﬁer whic h labels all do cuments as spam is 0.79433. 6.2 Spam Filter Implementation Our classiﬁcation approach is based on online logistic re- gression [5], as describ ed in Section 3.1. The features are o verlapping c haracter four-grams whic h are extracted from the documents by a sliding windo w of four characters. The feature are binary indicating the presence or absence of the giv en four-gram. The do cumen ts are in ASCII or UTF-8 en- coding whic h represents each chara cter in 8 bits, therefore the s pace of possible four-gram features is 2 32 . F ollowing the previous work, we used mo dulo 10 6 to reduce the four-gram feature space to one million features and only the ﬁrst 35 KB of the documents is used to compute the features. F or all experiments, we use a step size of η = 0 . 001 and no regular- ization or noise required for diﬀerential priv acy is used. T able 2: Running time comparison of online training of logistic regression (LR) and the priv acy preserv- ing logistic regression (PPLR) for one do cumen t. F eature Coun t LR PPLR Original: 10 6 0.5 s 1.14 hours Reduced: 10 4 5 ms 41 s T able 3: Running time of priv acy preserving logis- tic regression for one do cumen t of 10 4 features with diﬀeren t encryption k ey sizes. Encryption Key Size Time 256 bit 41 s 1024 bit 2013 s 6.3 Protocol Implementation W e created a prototype implementation of the proto col in C++ and used the v ariable precision arithmetic libraries pro vided b y OpenSSL [8] to implement the Paillier cryp- tosystem. W e used the GSL libraries [4] for matrix op er- ations. W e p erformed the exp eriments on a 3.2 GHz In tel P entium 4 machine with 2 GB RAM and running 64-bit Ubun tu. 6 The dataset is av ailable at http://plg.uwaterloo.ca/ ~gvcormac/ceascorpus/ The part of the dataset we hav e used corresp onds to pretrain-nofeedback task. T able 4: Time requirement for steps of the proto col for random matrices of the dimensions shown (do c- umen ts × features). Steps Time (s) - 200 × 20 Time (s) - 200 × 100 1 0.06 0.31 2, 3 2.59 10.14 4, 5 0.82 0.73 6, 7 0.46 0.41 8 0.84 0.73 9, 10 1.81 8.33 11 0.05 0.18 T otal 6.61 20.81 The original dataset has 10 6 features as describ ed in Sec- tion 6.2. Similar to the complexity analysis of the training protocol (Section 5.1), we observed that time required for the training protocol is linear in n umber of do cuments and n umber of features. T able 2 compares the time required to train a logistic re- gression classiﬁer with and without the priv acy preserving protocol using 256-bit encryption for one do cumen t. It can be seen that the proto col is slow er than non-priv ate ver- sion by a factor of 10 4 mainly due to the encryption in each step of the proto col. Also, we observe that the running time is drastically reduced with the dimensionality reduc- tion. While the execution time for th e training proto col o ver the original feature space would b e infeasible for most ap- plications, the execution time for the reduced feature space is seen to b e usable in spam ﬁltering applications. This motiv ated us to consider v arious dimensionality reduction sc hemes whic h we discuss in Section 6.4. T o further analyze the b eha vior of v arious steps of the protocol, in T able 4 we rep ort the running time of individ- ual steps of the proto col outlined in Section 4.2 on tw o test datasets of random vect ors. It can b e observed that encryp- tion is the main b ottle neck among the other operations in the proto col. W e rep ort the Paillier cryptosystem with 256- bit keys in the following exp eriments. As shown in T able 3, using the more secure 1024-bit encryption keys, resulted in a slo wdown by a factor of ab out 50 as compared to using 256-bit encryption keys. This is a constant factor which can be applied to all our timing results if the stronger level of securit y pro vided by 1024-bit keys is desired. Using a pre-computed v alue of the encrypted weigh t vec- tor E [ w ], the priv ate ev aluation proto col took 210.956 sec- onds for one do cumen t using 10 6 features and 2.059 seconds for one do cument using 10 4 features which again highlights the necessity for dimensionality reduction to make the pri- v ate computation feasible. 6.4 Dimensionality Reduction Since the time requirement of the priv acy preserving pro- tocol v aries linearly with the data dimensionality , we can impro ve it by dimensionality reduction principally b ecause data with few er num b er of features will require fewer encryp- tions and decryptions. On the other hand, reducing the di- mensionalit y of the features, particularly for sparse features suc h as n -gram coun ts, can hav e an eﬀect on the classiﬁca- tion p erformance. W e study this b ehavior b y exp erimenting with six diﬀeren t dimensionalit y reduction tec hniques, and compared the running time and AUC of the classiﬁer lea rned b y the training proto col. W e consider PCA which is a data-dep enden t dimension- alit y reduction techn ique and ﬁve other ones which are data independent. The latter techniques are muc h more in our setting as they can b e used by multiple parties on their in- dividual do cuments without violating priv acy . T able 5: Performance of PCA for dimensionalit y re- duction. Dimension Time (s) A UC 5 18 0.96159 10 37 0.99798 50 242 0.99944 100 599 0.99967 300 5949 0.99981 0.80 0.82 0.84 0.86 0.8 8 0.90 0.92 0.94 0.96 0.98 1.0 0 0 10 20 30 40 50 A UC T i m e i n se con ds Co m p a r i so n of Di m e n si on a l i ty R e d u ct i on M e th od s PCA LSH H ash Sp ace Samp le M u ltin o m ial Samp le Un if o rm Docu m e n t Fre q u e n cy Figure 1: Time comparison for the dimensionality reduction approac hes reduce d from 10 6 to 10 4 dimen- sions. 1. Principal Comp onent Analysis (PCA): PCA is perhaps the most commonly used dimensionalit y re- duction technique which computes the low er dimen- sional pro jection of the data based on the most dom- inan t eigen v ectors of cov ariance matrix of the original data. Since w e only compute a small num b er of eigen- v ectors, PCA is found to b e eﬃcient for our sparse binary dataset. T able 5 summarizes the running time and the AUC of the classiﬁer trained on the reduced dimension data. While the p erformance of PCA is ex- cellen t, it has the following disadv an tages, motiv ating us to look at other tec hniques. T able 6: Time and space requiremen t for dimens ion- alit y reduction methods for reduction from 10 6 to 10 4 features. Method Time (s) Space (GB) PCA 7 × 10 6 41 LSH 50 × 10 3 40 Hash Space 41 – Document F requency 1 – Sample Uniform 2 – Sample Multinomial 490 – 0 500 1000 1500 0.94 0.95 0.96 0.97 0.98 0.99 1 Batch size (documents) AUC Figure 2: Performance of one iteration of logistic regression training on 300 dimensional PCA feature v ectors with diﬀeren t batc h sizes. (a) When training in a multipart y setting, all the par- ties are required to use a common feature repre- sen tation. Among the metho ds we considered, only PCA computes a pro jection matrix whic h is data dep enden t. This pro jection matrix cannot be computed ov er the priv ate training data b e- cause it rev eals information about the data. (b) F or many classiﬁcation tasks, reduction to an ex- tremely small subspace hurts the p erformance muc h more signiﬁcantly than in our case. F urthermore, computing PCA with high dimensional data is not eﬃcien t and w e are interested in eﬃcien t and scal- able dimensionality reduction techniques. 2. Lo calit y Sensitive Hashing (LSH): In LSH [2], we c ho ose k random hyperplanes in the original d dimen- sional space whic h represent eac h dimension in th e tar- get space. The reduced dimensions are binary and in- dicate the side of the hyperplane on which the original point lies. 3. Hash Space Reduction: As mentioned in Section 6.2, w e reduce the original feature space to mo dulo 10 6 . W e exp erimen ted with diﬀeren t sizes of this hash space. 4. Do cumen t F requency Based Pruning: W e select features which o ccur in at least k documents. This is a common approach in removing rarely-o ccurring features, although some of those feature could be dis- criminativ e esp ecially in a spam ﬁltering task. 5. Uniform Sampling: In this approac h, we draw from the uniform distribution until desired num b er of unique features are selected. 6. Multinomial Sampling: This approac h is similar to the uniform sampling approach except that we ﬁrst ﬁt a multinomial distribution based on the do cument fre- quency of the features and then draw from this distri- bution. This causes the sampling to be biased tow ard features with higher v ariance which are often the more informativ e features. W e ran each of these algorithms on 6000 do cuments of 10 6 dimensions. T able 6 summarizes the time and space re- quiremen t of each algorithm for reducing dimensions to 10 4 . W e trained the logistic regression classiﬁer on 3000 training documents with v arious reduced dimensions and measured the running time and A UC of the learned classiﬁer on the 3000 test do cuments. The results are s hown in Figure 1. W e observ e that the data indep enden t dimensionality reduction tec hniques suc h as LSH, multinomial sampling, and hash space reduction ac hieve close to p erfect A UC. Classiﬁer P erformance for V arying Batch Size As we discussed in Section 5.2, another imp ortan t require- men t of our proto col is to train in batches of do cuments rather than training on one do cument at a time. W e hav e sho wn that the extra information gained by Bob about an y part y’s data decreases with the increasing batc h size. On the other hand, increasing the batch size causes the opti- mization pro cedure of the training algorithm to hav e fewer c hances of correcting itself in a single pass ov er the entire training dataset. In Figure 2, we see that the trade-oﬀ in A UC is negligible even with batch sizes of around 1000 do c- umen ts. 6.5 Parallel Pr ocessing An alternative approac h to address the p erformance issue is parallelization. W e exp erimen ted with a multi-threaded implemen tation of the algorithm. On av erage, we observ ed 6.3% sp eed improv ement on a single core mac hine. W e ex- pect the impro vemen t to be more signiﬁcan t on a multi- core arc hitecture. A similar scheme can be used to paral- lelize the proto col across a cluster of machines, such as in a MapReduce framework. In b oth of these cases, the ac- curacy of the online algorithms will decrease slightly as the n umber of threads or machines increase b ecause the gradi- en t ∇ L ( w ( t ) , x, y ) computed in each of the parallel processes is based on an older v alue of the weigh t vector w ( t ) . A more promising approac h whic h do es not impact the accuracy is encrypting vectors in parallel. In the presen t implemen tation of the proto col, we encrypt v ectors serially and the procedure used for the individual elements is iden- tical. W e can p oten tially reduce the encryption time of a feature v ector substantially by using a parallel pro cessing infrastructure such as GPUs. W e lea ve the experimen ts with suc h an implementation for future work. 7. CONCLUSION W e developed proto cols for training and ev aluating a lo- gistic regression based spam ﬁltering classiﬁer ov er emails belonging to multiple parties while preserving the priv acy constrain ts. W e presented an information theoretic analysis of the securit y of the protocol and also found that b oth the encryption/decryption and data transmission costs of the protocol are linear in the the num b er of training instances and the dimensionalit y of the data. W e also exp erimen ted with a prototype implementation of the proto col on a large scale email datas et and demonstrate that our proto col is able to ach ieve close to state of the art performance in a feasible amoun t of execution time. The future directions of this w ork include applying our methods to other spam ﬁltering classiﬁcation algorithms. W e also plan to extend our protocols to make extensive use of parallel architectures such as GPUs to further increase the sp eed and scalability . 8. REFERENCES [1] A. Z. Bro der. Some applications of Rabin’s ﬁngerprin ting metho d. Se quenc es II: Metho ds in Communic ations, Se curity, and Computer Scienc e , pages 143–152, 1993. [2] M. Charik ar. Similarity estimation tec hniques from rounding algorithms. In 34th Annu al ACM Symp osium on The ory of Computing , 2002. [3] G. V. Cormack. TREC 2007 spam trac k o v erview. In T ext REtrieval Confer enc e TREC , 2007. [4] M. Galassi, J. Davies, J. Theiler, B. Gough, G. Jungman, P . Alk en, M. Booth, and F. Rossi. GNU Scientiﬁc Libr ary R efer enc e Manual (v1.12) . Net work Theory Ltd., third edition, 2009. [5] J. Go o dman and W. Yih. Online discriminative spam ﬁlter training. In Confer enc e on Email and An ti-Sp am CEAS , 2006. [6] K. Li, Z. Zhong, and L. Ramasw amy . Priv acy-aw are collaborative spam ﬁltering. IEEE T r ansactions on Par al lel and Distribute d Systems , 20(5):725–739, 2009. [7] X. Lin, C. Clifton, and M. Y. Zhu. Priv acy-preserving clustering with distributed EM mixture mo deling. Know le dge and Information Systems , 8(1):68–81, 2005. [8] http://www.openssl.org/docs/crypto /bn.html . [9] P . Pai llier. Public-k ey cryptosystems based on composite degree residuosity classes. In EUR OCR YPT , 1999. [10] D. Sculley and G. V. Cormack. Going mini: Extreme ligh tw eight spam ﬁlters. In Confer enc e on Email and Anti -Sp am CEAS , 2008. [11] Symante c in telligence rep ort: August 2011. http://www.symantec.com/conn ect/blogs/ symantec- intelligence- report- august- 2011 . [12] J. V aidya, C. Clifton, M. Kantarcioglu, and S. Patterson. Priv acy-preserving decision trees o ve r v ertically partitioned data. TKDD , 2(3), 2008. [13] J. V aidya, M. Kantarcioglu, and C. Clifton. Priv acy-preserving naive Bay es classiﬁcation. VLDB J , 17(4):879–898, 2008. [14] J. V aidya, H. Y u, and X. Jiang. Priv acy-preserving SVM classiﬁcation. Know le dge and Information Systems , 14(2):161–178, 2008. [15] A. Y ao. Protocols for secure computations. In IEEE Symp osium on F oundations of Computer Scienc e , 1982.

Privacy Preserving Spam Filtering

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment