Learning Character Strings via Mastermind Queries, with a Case Study Involving mtDNA

Learning Character Strings via Mastermind Queries, with a Case Study In v olving mtDN A Michael T . Goodrich ∗ Abstract W e study the de gree to which a character string, Q , leaks details about itself any time it engages in comparison protocols with a strings pro vided by a querier , Bob, e ven if those protocols are cryptograph- ically guaranteed to produce no additional information other than the scores that assess the degree to which Q matches strings offered by Bob . W e show that such scenarios allo w Bob to play v ariants of the game of Mastermind with Q so as to learn the complete identity of Q . W e sho w that there are a number of ef ﬁcient implementations for Bob to emplo y in these Mastermind attacks, depending on kno wledge he has about the structure of Q , which show ho w quickly he can determine Q . Indeed, we show that Bob can discov er Q using a number of rounds of test comparisons that is much smaller than the length of Q , under reasonable assumptions regarding the types of scores that are returned by the cryptographic protocols and whether he can use kno wledge about the distribution that Q comes from. W e also provide the results of a case study we performed on a database of mitochondrial DN A, showing the vulnerability of existing real-world DN A data to the Mastermind attack. Keyw ords: character strings, Mastermind, mitochondrial DN A. 1 Intr oduction Mastermind [10, 25] is a game played between two players—a codemaker and a codebr eaker —using colored pegs. (See Figure 1.) V iewed mathematically , Mastermind is abstracted as a game where the codemaker selects a plainte xt string 1 , Q , of length N , whose elements are selected from an alphabet of size K . For consistenc y with the board game, the members of this alphabet are often referred to as “colors. ” The codemaker and code- breaker both know the v alues of N and K , and play consists of the codebreak er repeatedly making guesses, V 1 , V 2 , . . . , about the identity of Q . For each guess, V i the codemaker provides a score on ho w well V i matches Q . In double-count Mastermind, which is the standard version based on the board game, this score consists of a pair of two numbers: • A black count, b ( Q, V i ) , which is the number of elements in V i and Q that match in both v alue and location. That is, b ( Q, V i ) = |{ j : V i [ j ] = Q [ j ] }| . ∗ Michael T . Goodrich is with the Department of Computer Science, Univ esity of California, Irvine, CA 92697-3435. E-mail and web page: see http://www.ics.uci.edu/˜goodrich . 1 Throughout this paper we use the terms “string, ” “sequence, ” and “vector” synonymously . 1 Figure 1: The Mastermind game. The four large pegs in the middle are used for guessing. The four smaller peg locations on the right are used to score each guess—with black-peg and white-peg scores. And the tw o pegs on the left are used to k eep score across multiple games. (This image is adapted from http://en.wikipedia.org/wiki/File:Mastermind.jpg, by User:ZeroOne, under the Creati ve Commons Attribu- tion ShareAlike 2.0 License.) • A white count, w ( Q, V i ) , which is the number of elements in V i that appear in Q b ut in different locations than their locations in V i . That is, letting π denote an arbitrary permutation, w ( Q, V i ) = max π |{ j : V i [ π ( j )] = Q [ j ] }| − b ( V i ) . In single-count Mastermind, which has been less studied, the codebreaker is gi ven only the black count, b ( Q, V i ) , for each guess, V i . (Note that it is impossible to solve the problem given only white-count scores.) The goal is for the codebreaker to disco ver Q using a small a number of guesses. 1.1 Pre vious Related W ork The original Mastermind g ame was in vented in 1970 by Meirowitz, as a board game having holes for vectors of length N = 4 and K = 6 colored pe gs. Knuth [25] subsequently sho wed that this instance of the Mastermind game can be solved in ﬁv e guesses or less. Chv ´ atal [10] studied the combinatorics of general Mastermind, sho wing that it can be solv ed in polynomial time, in the K ≥ N case, using 2 N d log K e + 4 N guesses, and Chen et al. [9] showed how this bound can be improved, in this case, to 2 N d log N e + 2 N + d K/ N e + 2 guesses. Stuckman and Zhang [33] showed that is NP-complete to determine if a sequence of guesses and responses in general double-count Mastermind is satisﬁable. Goodrich [20] shows that single- count (black-peg) Mastermind satisﬁability is NP-complete and that a speciﬁc vector Q can be guessed using a single-count (black-peg) query v ector that is of length N d log K e + d (2 − 1 /K ) N e + K . 2 Se veral researchers ha ve explored pri vac y-preserving data querying methods that can be applied to char - acter strings (e.g., see [2, 15, 16]). In particular , Atallah et al. [2] and Atallah and Li [3] studied priv acy- preserving protocols for edit-distance string comparisons, such as in the longest common subsequence (LCS) problem [21, 22, 36], where each party learns the score for the comparison, but neither learns the contents of the string of the other party . Such comparisons are common in DNA sequence alignment comparisons, for example. T roncoso-Pastoriza et al. [35] described a pri vac y-preserving protocol for searching for a cer- tain regular-e xpression pattern in a DN A sequence. In last-year’ s Oakland conference, Jha et al. [23] gi ve pri vac y-preserving protocols for computing edit distance similarity scores between two genomic sequences, improving the priv acy-preserving edit distance algorithm of Szajda et al. [34]. Single-count matching results between tw o strings can be done in a pri vacy-preserving manner , as well, using pri v acy-preserving set inter- section, e.g., using the method of Freedman et al. [16], V aidya and Clifton [37] or Sang and Shen [31, 32]. The string matching problem can also be done using priv acy-preserving dot product computations [1] or e ven general multi-party computation protocols (e.g., see [12, 18, 39]) or systems [6]. Jiang et al. [24] study a secure mulitparty method for comparing a genomic sequence against every sequence in a genomic database, providing a score indicating the match strength between the query sequence and each sequence in the database. In terms of the framework of this paper , the closest previous work is that of Du and Atallah [14], who studied a pri vacy-preserving protocol for querying a string Q in a database of strings, D , where comparisons are based on approximate matching (but not sequence-alignment). Their protocols assume that the parties are honest-b ut-curious, howe ver , so that, for instance, the database o wner cannot introduce fake strings in his database whose intent is to discov er the identity of the query string, Q . The attack model we explore in this paper , on the other hand, allo ws for “cheating” in the comparison protocol, so that D can introduce strings whose sole purpose is to help him discov er the identity of Q . 1.2 Attack Scenarios In this paper we study the Mastermind attack on string data, which is a way that a genomic querier , Bob, can “play” a type of Mastermind game with an unkno wn string, Q –for which Q ’ s owner , Alice, thinks that she is comparing with Bob in a priv acy-preserving manner—b ut instead Bob is discov ering the full identity of Q . The attack scenario is that Alice repeatedly participates in pri vacy-preserving comparisons of Q to itera- ti vely compare Q with strings pro vided by Bob . All is learned from each comparison is the score measuring the similarity of the two strings ( Q and a string V i provided by Bob), with the score for each string com- parison being re vealed to Bob (and possibly also Alice) before the next comparison be gins. Bob’ s goal is to learn the complete identity of Q with a reasonably small of comparisons. W e distinguish two versions of this attack scenario. In the ﬁrst scenario, the comparison between Q and each string V i provided by Bob is scored according to the single-count (black-pe g) straight-match score, b ( Q, V i ) = |{ j : V i [ j ] = Q [ j ] }| . In our second scenario, which is more common in genomic databases, the comparison between Q and each V i provided by Bob is scored according to a sequence-alignment score, a ( Q, V i ) = |{ ( j , k ) ∈ I : V i [ j ] = Q [ k ] }| , where I is an ordered index set of pairs of integers so that if ( j, k ) appears before ( l , m ) in I , then j < l and k < m . This is also kno wn as the longest common subsequence (LCS) [21, 22, 36] score between Q and 3 V i . (See Figure 2.) Incidentally , as we observe below , Lev enshtein edit distance scores are strongly related to the LCS score, and our attack scenarios should be able to be translated to this other measure, as well. Figure 2: Illustrating two types of matches between two DN A sequences. (a) A single-count (black-pe g) straight-match. Note that the second “ A ” in the bottom string is not matched, since it doesn’ t line up e xactly with the second “ A ” in the top string. (b) A sequence-alignment match. In going from the top string to the botttom string, the ﬁrst “C” in the top string corresponds to a deletion ev ent, the ﬁrst “C” in the bottom string corresponds to an insertion e vent, and the penultimate characters in each string correspond to a substitution e vent. There are a number of motiv ating usage en vironments that could be susceptible to Mastermind attacks. For e xample, Bob could be a genomic database o wner , storing genomic strings for a number of indi viduals, and Alice could be a database user who is searching Bob’ s database to ﬁnd the closest match to a string Q of interest. Bob could, for instance, be the o wner of a database of DN A from e very male attending a certain uni versity and Alice could be an FBI agent searching through that database for a match with DN A e vidence gathered after a sexual assault. Both parties in this example are likely to be under legal restrictions not to re veal the complete identity of their strings unless there is a match. In another example, Alice could be the o wner of a database of genomic sequences and Bob could be an attacker trying to learn the identity of a string Q in Alice’ s database, e.g., which Bob can identify only by an anon ymized inde x, j . In this case, Bob repeatedly does queries with each of his strings, V i , indexing into Alice’ s database using the name “ j ” to locate Q and get Alice to do a priv ac y-comparison of Q with V i . Bob could, for instance, be an employer trying to learn the genomic sequence of a prospecti ve employee, Charlie, by querying a uni versity DNA sequence database owned by Alice, which he could query simply knowing the index of Charlie’ s DNA in Alice’ s database (e.g., Bob might be able to infer this index from Charlie’ s student number). In e very case, Bob gets to ask Alice to compare her string, Q , to each of his query strings, V i , in a pri vac y-preserving manner until these comparisons hav e leaked enough information that he can easily infer the identity of Q . 1.3 Our Results In this paper we study v arious aspects of the Mastermind attack, deriving the follo wing results. • W e show that the problem of determining whether a sequence of Mastermind responses has a v alid solution is NP-complete e ven if each response is a sequence-alignment response. At ﬁrst, this might seem to provide some security for the priv acy of the unknown string, Q , for it implies a degree of intractability to the problem of learning a query string Q just from Mastermind responses in volving Q . Unfortunately , as was learned with Knapsack cryptosystems [28], having the security of a system be based on the difﬁculty of solving an NP-complete problem is no guarantee that it is safe in practice. Indeed, such is the case for the security of genomic sequences being susceptible to the Mastermind attack. W e sho w 4 that character strings can be discovered by surprisingly short sequence of guesses. In particular, we also provide the follo wing results: • W e show that an arbitrary query string, Q , of length N from an alphabet of size K , can be disco vered with ( N + 2) K queries, each of which reports the result of a sequence-alignment (LCS) test. Such queries are common in genomic applications. W e also show that this bound can be further improved if the distribution of characters in the alphabet follo ws Zipf ’ s Law [27]. • W e show how a Mastermind attacker can take advantage of known distributional information for ge- nomic data. Armed with distributional knowledge about a query string, Q , with respect to a reference string, R , such as the Re vised Cambridge Reference Sequence, rCRS (GenBank accession number: A C 000021), the Mastermind attacker can discover Q much quicker than in the general cases, using either single-count or sequence-alignment responses. • W e provide experimental analysis of the distribution-based Mastermind attack for genomic data, sho wing that, for a case study in v olving mitochondrial DNA (mtDN A), either single-count responses or sequence-alignment responses, the attack w orks surprisingly well. Giv en the relati v e abundance of mtDN A data, and its ethnic sensiti vity , we focus our e xperiments on 1000 human mtDN A sequences, sho wing that most can be discovered with a Mastermind attack of just a few hundred guesses, ev en though mtDN A sequences are typically over 16,500 bp long. Gi ven that current mtDN A databases already have thousands of members (e.g., see [5]), this experimental analysis sho ws that it would be relati vely easy for an attacker , Bob, to interleav e an undetected Mastermind attack with priv acy- preserving responses to actual sequences. W e conclude by discussing some of the issues that would hav e to be addressed in order to defeat Mas- termind attacks on genomic data, as well as some possible directions for future research. 2 Alternati ve Sequence Comparison Scor es Throughout this paper , we assume that the attack er , Bob, can learn the v alue of either a straight-match score, b ( Q, V i ) , or a sequence-alignment score, a ( Q, V i ) , between the unknown string, Q , and each of his giv en strings, V i . These are not the only types of scores of interest with respect to genomic data, howe ver . So, before we discuss the priv acy risks of genomic data from Mastermind attacks that use the b or a functions as scores, let us discuss two other kinds of score functions and ho w the y could alternati vely be used for similar attacks. There are a number of score functions that measure the similarity between two strings. W e revie w two here, including how they can be reduced to similarity measures using the functions b or a , for comparing two strings, Q and V . • Hamming distance: the Hammming distance, H ( Q, V ) , between Q and V , is giv en by H ( Q, V ) = |{ j : V [ j ] 6 = Q [ j ] }| . That is, the two strings Q and V are aligned in way that disallo ws insertions and deletions, and a score is computed based on the number of substitutions needed to con vert Q to V . Note that, gi ven a Hamming distance score, H ( Q, V ) , we can compute a straight-match score as b ( Q, V ) = | Q | − H ( Q, V ) . 5 • Levenshtein distance: the Lev enshtein distance, L ( Q, V ) , between Q and V , which is a kind of edit distance , is the minimum number of insertions, deletions, and substitutions needed to conv ert Q into V (or vice versa). Note that, given a Levenshtein distance score, L ( Q, V ) , we can compute a sequence-alignment score as a ( Q, V ) = | Q | + | V | − L ( Q, V ) 2 . Thus, the Mastermind attacks we mention in this paper apply equally well to systems that support string comparisons using Hamming distance or Le venshtein distance. 3 NP-Completeness of Sequence-Alignment Mastermind Satisﬁability As mentioned above, Stuckman and Zhang [33] sho w that double-count Mastermind satisﬁability is NP- complete and Goodrich [20] shows that single-count (black-peg) Mastermind satisﬁability is also NP- complete (which applies equally well for Hamming distance). In the Sequence-Alignment Mastermind Satisﬁability problem, we are gi ven a collection of Mastermind queries, V 1 , V 2 , . . . , V N , and the responses, a ( Q, V 1 ) , a ( Q, V 2 ) , . . . , a ( Q, V N ) , each of which is said to re- port the sequence-alignment (LCS) score between each V i and an unknown v ector , Q . W e are asked to determine if there indeed exists a v ector Q that satisﬁes all of these responses. Theorem 1: Sequence-Alignment Mastermind Satisﬁability is NP-complete. Proof: Our proof is an adaptation of the NP-completeness proof of Goodrich [20] showing that single- count (black-peg) Mastermind Satisﬁability is NP-complete. It is easy to see that Sequence-Alignment Mastermind Satisﬁability is in NP . For example, we could nondeterministically guess a vector Q and then test in polynomial time whether it satisﬁes all the responses, a ( Q, V 1 ) , a ( Q, V 2 ) , . . . , a ( Q, V N ) . T o prov e that Sequence-Alignment Mastermind Satisﬁability is NP-hard, we provide a reduction from 3-Dimensional Matching (3DM), which is a well-kno wn NP-complete problem (e.g., see [17]). In the 3DM problem, we are giv en three sets, X = { x 1 , . . . , x n } , Y = { y 1 , . . . , y n } , and Z = { z 1 , . . . , z n } , of n elements each. In addition, we are giv en a set T of m triples, { ( x i 1 , y j 1 , z k 1 ) , . . . , ( x i m , y j m , z k m ) } , whose elements are respectiv ely taken from the three sets, X , Y , and Z . The problem is to determine if there is a subset of triples such that each element in X , Y , and Z appears in exactly one triple in this subset. Suppose, then, that we are given an instance of the 3DM problem, as described above. W e consider the unkno wn vector , Q , to consist of the following v ector of variables: ( X 1 , . . . , X 2 n ; Y 1 , . . . , Y 2 n ; Z 1 , . . . , Z 2 n ; T 1 , . . . , T 2 m − 1 ) , where the semi-colons are used for the sake of notation to separate the four sections in the unknown v ector , Q . W e perform our reduction by constructing a sequence of guess vectors, V 0 , V 1 , . . . , V N , together with their sequence-alignment responses, a ( Q, V 0 ) , a ( Q, V 1 ) , . . . , a ( Q, V N ) , so that there is a satisfying vector Q for these responses if and only if there is a solution to the given instance of the 3DM problem. Our construction begins by setting the number of colors, K , to be m + 2 . Intuitiv ely , there is a color associated with each triple in T , plus a “null” color , φ , which is guaranteed to appear nowhere in our unkno wn vector , Q , and a separator color, µ , which occurs in e very other (e ven-index ed) position of Q . W e begin our sequence of queries with four special “enforcer” queries. The ﬁrst two of these are V 0 = ( φ, . . . , φ ; φ, . . . , φ ; φ, . . . , φ ; φ, . . . , φ ) , 6 which has response a ( Q, V 0 ) = 0 , and V 1 = ( µ, . . . , µ ; µ, . . . , µ ; µ, . . . , µ ; µ, . . . , µ ) , which has response a ( Q, V 1 ) = 3 n + m − 1 . Intuiti vely , V 0 enforces the fact that the null color, φ , appears no where in the unkno wn vector , and V 1 enforces the fact that the separator color , µ , appears exactly often enough to separate ev ery other (non- µ ) character in the unkno wn vector . So as to better understand the characteristics of the other queries, let us set h = 3 n + m − 1 , the number of µ colors in our unkno wn vector Q . W e then deﬁne two additional enforcer queries, V 2 = ( φ, µ, . . . , µ, φ, µ ; φ, µ, . . . , φ, µ ; φ, µ, . . . , φ, µ ; 1 , µ, 1 , µ, . . . , µ, 1) , which has response a ( Q, V 2 ) = h + n , and V 3 = ( φ, µ, . . . , µ, φµ ; φ, µ, . . . , µ, φµ ; φ, µ, . . . , µ, φµ ; 0 , µ, 0 , µ, . . . , µ, 0) , which has response a ( Q, V 3 ) = h + m − n . Intuiti vely , V 2 enforces a counting rule that exactly n of the T i ’ s will be set to 1 , and V 3 enforces a counting rule that the remaining m − n of the T i ’ s will be set to 0 . For each triple, T s = ( x i s , y j s , z k s ) , we construct three query vectors, as follo ws. V s, 1 = ( φ, µ, . . . , µ, φ, µ, s, µ, φ, µ, . . . , µ, φ, µ ; φ, µ, . . . , µ, φ, µ ; φ, µ, . . . , µ, φ, µ ; φ, µ, . . . , µ, φ, µ, 0 , µ, φ, µ, . . . , µ, φ ) , where the s is in position 2 i s − 1 in the ﬁrst group and the 0 is in position 2 s − 1 in the fourth group. This vector has response, a ( Q, V s, 1 ) = h + 1 . V s, 2 = ( φ, µ, . . . , µ, φ, µ ; φ, µ, . . . , µ, φ, µ, s, µ, φ, µ, . . . , µ, φ, µ ; φ, µ, . . . , µ, φ, µ ; φ, µ, . . . , µ, φ, µ, 0 , µ, φ, µ, . . . , µ, φ ) , where the s is in position 2 j s − 1 in the second group and the 0 is in position 2 s − 1 in the fourth group. This vector has response, a ( Q, V s, 2 ) = h + 1 . V s, 3 = ( φ, µ, . . . , µ, φ, µ ; φ, µ, . . . , µ, φ, µ ; φ, µ, . . . , µ, φ, µ, s, µ, φ, µ, . . . , µ, φ, µ ; φ, µ, . . . , µ, φ, µ, 0 , µ, φ, µ, . . . , µ, φ ) , where the s is in position 2 k s − 1 in the third group and the 0 is in position 2 s − 1 in the fourth group. This vector has response, a ( Q, V s, 3 ) = h + 1 . Intuiti vely , these three responses collectiv ely form a “chooser” gadget, where we will either hav e T 2 s − 1 = 0 or the three variables X 2 i s − 1 , Y 2 j s − 1 , and Z 2 k s − 1 , will each be set to hav e color s (and T 2 s − 1 = 1 ). Moreov er , note that there are m odd-index positions in the T , and each of them has to match either a 0 or 1 color . This reduction can clearly be done in polynomial time. So all that remains is for us to show that it works. Suppose, then, that there is a possible solution to the gi ven instance of 3DM. Then for each chosen triple, 7 T s = ( x i s , y j s , z k s ) , we can assign colors T 2 s − 1 = 1 , X 2 i s − 1 = s , Y 2 j s − 1 = s , and Z 2 k s − 1 = s , which will satisfy each of the V s, 1 , V s, 2 , and V s, 3 vector responses for this value of s . Like wise, setting T 2 s − 1 = 0 will satisfy each of the V s, 1 , V s, 2 , and V s, 3 vector responses for a triple T 2 s − 1 that is not chosen. Finally , given that there are n chosen vectors, we will satisfy the four preliminary vector responses as well. Suppose, alternativ ely , that we hav e a vector Q that satisﬁes all our vector responses. W e know that each X i , Y j , and Z k must be assigned a color other than φ . Moreo ver , e very e ven-inde xed position in Q must be assigned the color µ and e v ery odd-indexed position must be a color other than µ , because there are exactly h = 3 n + m − 1 instances of µ in Q and we hav e introduced a query that enforces the fact that there is exactly one non- µ color between every consecutiv e pair of µ -colored positions. Since there are only m + 2 colors, this implies each odd-indexed position X 2 i − 1 , Y 2 j − 1 , and Z 2 k − 1 must be assigned a color corresponding to a triple number , s , that is, it is not assigned φ or µ . If the corresponding T 2 s − 1 = 1 , then in order to hav e satisﬁed the vectors V s, 1 , V s, 2 , and V s, 3 , we must ha ve set X 2 i s − 1 = s , Y 2 j s − 1 = s , and Z 2 k s − 1 = s , which implies we can include the triple ( X i s , Y j s Z k s ) in our matching. If T 2 s − 1 = 0 , then we do not include this triple in our matching. By the vector responses V 2 and V 3 , we know that the number of triples chosen in this way is e xactly n . Thus, we hav e found a valid 3-dimensional matching. Thus, it is extremely unlikely that we will be able to ﬁnd a polynomial-time algorithm that can always satisfy arbitrary Mastermind sequence-alignment query strings, or ev en single-count queries [20]. Unfortu- nately , this is not the same as a guarantee of security for the kinds of query strings that w ould result from an interaction between a Mastermind attacker , Bob, and a character string o wner , Alice, where Bob is trying to learn Alice’ s string, Q , through a sequence of priv acy-preserving string comparisons. For we show , in the sections that follow , that such query strings, Q , can be discovered fairly ef ﬁciently using the Mastermind attack. 4 The Mastermind Attack f or Sequence-Alignment Queries Recall that in a sequence-alignment query we wish to compare two strings Q and V , where the score for a match is the length of the longest common subsequence (LCS) [21, 22, 36] between Q and V . Sev eral researchers hav e studied this problem and ha ve come up with priv acy-preserving protocols to determine such scores (e.g., see [2]). In this section, we sho w that performing such a series of sequence-alignment queries with Bob is susceptible to a type of Mastermind attack of its o wn. Suppose we are gi ven an unkno wn string Q of length N over an alphabet of size K , the members of which we call “colors. ” Suppose further that we are going to engage in a protocol with Bob to test Q against strings provided by Bob, where each test returns the length of a longest common subsequence between Q and one of Bob’ s strings. That is, we score matches using the sequence-alignment scoring function, a ( Q, V ) , for a guess vector V , which is the length of a longest common subsequence between V and Q . W e are interested in this section on studying an efﬁcient scheme for Bob to discov er Q using this query scheme. A Mastermind-attack algorithm for Bob begins as follo ws: • Bob begins by guessing K vectors, V 1 , V 2 , . . . , V K , with each vector V i consisting of elements of all the same color , i . The subsequence alignment score for each of the initial guesses will tell Bob the cardinality of each color in Q . Let us now imagine that we reorder the colors so that they are listed 1 to K in nondecreasing order of ho w often they each appear in Q . Thus, color 1 is now the least frequent color in Q and K is the most frequent color . Our algorithm continues by incrementally building up a vector W , such that W 8 either completely matches all its characters with Q (in the speciﬁed order) or it misses by just one character . Initially , we set W to be a vector consisting of exactly c 1 elements of color 1 , so that if we were to guess W , then we would get a score of a ( Q, W ) = c 1 . W e allow indexing and insertion into W so that we can add a character before the i th element in W for i = 1 to | W | + 1 (with an insertion “before” position | W | + 1 taken to mean an insertion just after position | W | , the last position in W ). Our algorithm for Bob’ s Mastermind attack continues sho wn in Figure 3. f or k = 2 to K do { take each color in turn } Set i = 1 { position in W where to insert items } Set j = 0 { count of number of items of color k found } while j < c k do { ﬁnd the places for color k } Add a color k item just before the i th item in W . Make a guess for W to learn the v alue of a ( Q, W ) . if a ( Q, W ) = | W | then { all of W matches } Increment i and j . else { there’ s one too many of color k before i } Remov e the color k item before i . Increment i . end if end while end f or Figure 3: The sequent-alignment learning algorithm. Note inductiv ely that, at the end of each iteration of the the while-loop, ev ery character in W matches in Q , that is, a ( Q, W ) = | W | . Thus, an y time the if-statement ﬁnds that a ( Q, W ) 6 = | W | , then we have just added an item of color k in a place where it cannot match any item without causing a previously-matched neighboring item to mis-match what it pre viously could match. Therefore, in each iteration of the for-loop, the algorithm correctly ﬁnds all the places where items of color k ﬁt with respect to items of colors 1 to k − 1 . So, when the algorithm completes, we have W = Q ; that is, we hav e learned Q . Consider no w the analysis of this algorithm. Note that in each iteration of the while-loop, we increment i , our index into W , and that at the end of the while loop the length of W is c 1 + c 2 + · · · + c k , where k is the index of the for -loop. Thus, the total number of queries made is at most K + K X i =1 i X j =1 c j , which is the same as K + K X i =1 ( K − i + 1) c i , since each term c i appears K − i + 1 times in the double sum. Let us perform a substitution of variables, where we let d 1 , d 2 , . . . , d K denote the cardinalities of the colors in Q in nonincreasing order , so d 1 is the most frequent color and d K is the least frequent. Then we can re write the total number of queries performed to be bounded by K + K X i =1 id i . 9 Note that, by deﬁnition, d i ≤ N /i , for otherwise, d i could not be the i th largest-cardinality color . Thus, the total number of queries is at most K + K X i =1 i ( N /i ) = K + K N = ( N + 1) K . This is the number of tests done by Bob, the Mastermind attacker , making no additional assumptions about the distribution of colors in the query string, Q . This analysis can be reﬁned, ho we ver , if the colors are distributed in Q according to Zipf ’ s Law [27], which in this context w ould imply that d i ≤ N i s H N ,s , where H N ,s is the N -th Harmonic number of order s , H N ,s = N X i =1 1 /i s , and s is between 1 and 2 , inclusiv e. In this case, the total number of guesses done by Bob would be at most K + K X i =1 iN i s H N ,s ≤ K + K N H N ,s , for s ≥ 1 . Thus, we have the follo wing: Theorem 2: Given an unknown length- N string Q , deﬁned on an alphabet of size K , a malicious Master- mind attacker can discov er Q in polynomial time using ( N + 1) K sequence-alignment tests tests against Q , each of which rev eals only the length of a longest common subsequence between Q and the test string match. If the cardinalities of elements of Q follo w Zipf ’ s Law , with parameter s ≥ 1 , then a malicious Mastermind attacker can disco ver Q using at most K + K N /H N ,s sequence-alignment tests. 5 Exploiting Data Distrib utions Up to this point, we ha ve focused on ho w the Mastermind attacker , Bob, could learn a general string Q using the types of queries typically asked of genomic databases, ev en if those queries are priv acy preserving. In this section, we explore ho w Bob can signiﬁcantly improv e the effecti veness of the Mastermind attack if he exploits information, which is publicly av ailable, about the distributions of the character strings of interest. Moreov er , to drive the point home, we provide a case study showing the effecti veness of such Mastermind attacks on a real-world genomic database, in the section that follo ws. Genomic sequences typically hav e a great deal of similarity . Indeed, recent compression schemes have sho wn that it is effecti ve to vie w a genomic sequence with respect to a compression scheme that represents a sequence in terms of its dif ferences with a reference sequence, R (e.g., see [4]). That is, we can start from a reference sequence, R , which contains the most common components of a typical genomic sequence. Then we deﬁne each other sequence, Q , in terms of its differences with R . Each difference is deﬁned by an index location, i , in R and an operation to perform at that location, such as a substitution, insertion, or deletion. 10 This difference pattern is present, for example, in human mitochondrial DNA, which is the type of ge- nomic data we use in our case study . This type of of DNA, which, as we ha ve already mentioned, is inherited only through the maternal line and is already av ailable in sequenced form in sizeable enough quantities to support obfuscated Mastermind attacks. Moreov er , because it is passed only though the maternal line, it functions as a highly tuned notion of race, allowing researchers in some cases to trace a person’ s ancestry to indi vidual villages. Thus, mitochondrial DN A is highly sensitiv e from a pri v acy-protection vie wpoint. As shown in recent work of Baldi et al. [4], mitochondrial DN A sequences can be encoded in signiﬁcantly- compressed form by using a standard reference sequence [7, 30]. This reference sequence, R = rCRS, is 16,568 bp long. So, in terms of the notation used abov e, we ha ve N = 16568 and K = 4 , since there are 4 types of base pairs possible. But these parameters suggest that there is more v ariation in the data than actually occurs. In fact, the vulnerability of DNA sequences to the Mastermind attack is much worse than this in practice. For e xample, there are a limited number of locations along the reference sequence where any changes appear statistically in the mitochondrial DN A data. So let us use M to denote the number of different possible locations where any query sequence might differ from the reference sequence, R . W orse yet, from a pri v acy-preserv ation standpoint, the a verage number of difference between an y human DNA sequence and the reference is orders of magnitude smaller than M in practice. (W e explore these statistics in detail belo w .) Here we sho w how a Mastermind attack can e xploit these statistical properties of genomic data. 5.1 The Substitution-Only Case In this section, we explore the version of the Mastermind attack where the attacker , Bob, engages in a series of pri v acy-preserving protocols with Alice, each of which reveals only the single-count straight-match score between Alice’ s string, Q , and strings provided by Bob, in an iterativ e online fashion (recall Figure 2a). In the attack model we consider , Bob is allo wed to use self-constructed sequences in comparisons with Q , from which he learns the v alue of b ( Q, V i ) for each of his query strings, V i . Gi ven additional knowledge of the distributional properties of DN A data, we can construct a Mastermind attack to take this kno wledge into consideration. In this case, we make the assumption that the unknown string, Q , differs from a reference string R only through a relati vely small number substitutions, which is true for example, for 45% of the mitochondrial DNA data. (W e will explore the more general case later in this section.) Our algorithm is an adaptation of an algorithm of Goodrich [20] for solving the boardgame version of Mastermind to the speciﬁc case of a Mastermind attack on a string Q relative to a reference string R . W e be gin the attack for Bob by ha ving him perform a query against Q with a reference sequence, R . For any string, Q , let s ( Q ) denote the number of substitutional dif ferences Q has with the reference sequence, R . Note, then, that our ﬁrst query (for the reference string R itself) allo ws us to determine the value of s ( Q ) , using the formula s ( Q ) = N − b ( Q, R ) . For example, R could be a genomic sequence deri ved from a sequencing of the DN A of a speciﬁc refer - ence human or it could be a canonical genomic reference sequence derived from analyzing commonalities among a number of human sequences. Even though few humans ha ve presently had their complete genomes sequenced [11, 26, 38], any of these could serve as a reference, R , for a Mastermind attack on a complete genome sequence. For the more wide-spread instances of mitochondrial DNA, the Revised Cambridge Reference Sequence (rCRS) (GenBank accession number: AC 000021) is commonly used as a mtDN A ref- erence sequence [7, 8, 30], and it could serve as the sequence R in a Mastermind attack on a mitochondrial 11 DN A sequence. Imagine that we cyclically order the K characters in our alphabet, so, for instance, if our alphabet is { A,C,G,T } , then we could use the cyclic ordering (A,C,G,T ,A,C,G,T , . . . ). Note that this ordering allows us to choose an y character as a base color , i.e., a “color 0 , ” and then specify all other characters as of fsets from that base. For example, in the DN A case, we could pick “C” as the base, color 0 , in which case “G” becomes color 1 , “T” becomes color 2 , and “ A ” becomes color 3 . Or we could pick “T” as the base, color 0 , in which case “ A ” becomes color 1 , “C” becomes color 2 , and “G” becomes color 3 . In the context of a Mastermind attack, we consider each character , R i , in the reference sequence, R , to be color “0” for that position, i . V iewed Mathematically , we can then number the K − 1 remaining characters, according to our c yclic ordering, as of fsets from these respecti ve color 0 ’ s. Assuming that Bob’ s ﬁrst guess, of R , is not a perfect match for the query sequence, Q , then we can vie w Bob’ s remaining task as that of determining the cardinality and location of all the non-zero offset values for positions in R . In fact, if we think of the characters in the respectiv e positions of R as the respective color 0 ’ s for those positions, then we can vie w the remaining task as that of determining the locations of the colors 0 through K − 1 . After Bob makes his initial guess using R , we then ha ve him perform K − 1 additional queries, each of which is a vector of elements that are all the same offset from R , i.e., a vector of all the same “colors” with respect to R , but only at the M places that are statistically possible locations for a substitution. Thus, let us assume we can view Q as now consisting of just the M places where substitutions may occur (for the other locations we simply repeat a guess for color 0 every time). This allows us to initially know the cardinality , c 0 , c 1 , . . . , c K − 1 , of ev ery (offset) color in the (compressed) unkno wn vector , Q . If any c i = 0 , then we remov e the color i from our alphabet of colors, and update the value of K accordingly . The remainder of Bob’ s computation proceeds as a recursi ve di vide-and-conquer algorithm, which is similar in structure to the approach of [10, 20]. The generic problem is to determine the offset values of all the elements in a range Q [ l..r ] , which initially is the entire vector Q = Q [0 ..N − 1] , assuming we kno w the values of c 0 , c 1 , . . . , c K − 1 , of every color in Q [ l..r ] , and each c i > 0 . If K ≤ 1 , we are done; so let us assume without loss of generality that K ≥ 2 . In addition, we assume inductively that we know , d , the number of instances of color 0 outside of the range Q [ l..r ] . Initially , of course, d = 0 . Gi ven this initial setup, we split Q [ l ..r ] into Q [ l..m ] and Q [ m + 1 ..r ] , where m is in the middle of the interval [ l , r ] . The main challenge, then, is to provide for Q [ l ..m ] and Q [ m + 1 ..r ] the same setup we had for Q [ l..r ] . This setup can be accomplished by determining the cardinalities, x 0 , x 1 , . . . , x K − 1 and y 0 , y 1 , . . . , y K − 1 , of e very color that respectiv ely appears in Q [ l ..m ] and Q [ m + 1 ..r ] . W e do this with a series of K − 1 additional queries, where we guess that the elements in Q [ l..m ] are of color i , for i = 1 , 2 , . . . , K − 1 , and that the rest of Q is of color 0 . Let the v alues of these queries be denoted as b 1 , b 2 , . . . , b K − 1 , and note that, at this point, we kno w the following: x i + y i = c i , for i = 0 , 1 , . . . , K − 1 (1) x i + y 1 = b i − d, for i = 1 , 2 , . . . , K − 1 (2) x 0 + x 1 + · · · + x K − 1 = m − l + 1 . (3) Thus, we can determine y 0 , as y 0 = c 0 + P K − 1 i =1 ( b i − d ) − ( m − l + 1) K , for y 0 is counted K times in the sum of c 0 and all the ( b i − d ) ’ s, and the sum of the x i ’ s is m − l + 1 , by Equation (3). Giv en the value of y 0 , we can then determine all the x i v alues, by using Equation (1) for 12 x 0 and Equation (2) for x 1 , x 2 , . . . , x K − 1 . Moreov er , once we hav e all these x i v alues, we can determine the v alues, y 1 , y 2 , . . . , y K − 1 , using Equation (1). Finally , we can determine the values d 0 = d + y 0 and d 00 = d x 0 and use these respecti vely for the role of d in Q [ l..m ] and Q [ m + 1 ..r ] . This gi ves us all the values necessary to then recursi vely determine Q [ l ..m ] and Q [ m + 1 ..r ] . Of course, if the c i v alues for either of these subproblems are all 0 , except for one (which would be equal to the size of this problem), then there is no need to recursi vely solv e this problem; so we would not perform a recursi ve call in this case. Let us, therefore, analyze the number of vector guesses performed by this algorithm. Ignoring for the time being the initial set of K guesses, note that we only continue to search if we are guaranteed to be honing in on a substitution. Thus, adding back the initial K guesses, we get that the total number of guesses is at most s ( Q ) d log M e + K. Thus, we hav e the follo wing. Theorem 3: Given an unkno wn length- N sequence Q , deﬁned on an alphabet of size K , with Q ha ving M possible locations of deviation from a reference sequence, R , a malicious Mastermind attacker can disco ver Q in polynomial time using s ( Q ) d log M e + K guesses, each of which re veals only the number of positions where Q and the test sequence match and where s ( Q ) denotes the number of substitutions that would transform R into Q . As we note in Section 6, this performance is more than adequate to show that nearly half of all mito- chondrial DN A data in our case study are vulnerable to this version of the Mastermind attack. Before we provide those statistics, howe ver , let us study ho w the Mastermind attack with sequence-alignment queries can be streamlined to exploit DN A data distributions. 5.2 The Sequence-Alignment Case As mentioned above, roughly half of the sequences in the mitochondrial DN A data set include insertions and/or deletions in addition to substitutions in the reference sequence, R . Thus, we discuss in this subsection ho w we can modify the Mastermind attack algorithm of Section 4 to tak e adv antage of the distributional properties common in genomic data sets, so as to discover a query sequence that can ha v e arbitrary kinds of dif ferences with the reference sequence, R . In this case, we vie w differences with R procedurally as e vents , each of which is either a singleton deletion, or an arbitrary-length insertion, which would transform R into the query sequence, Q . (Note: for this algorithm, we view a substitution as actually occurring as a deletion e vent follo wed by an insertion e vent.) In this case, we run the attack algorithm in two phases. In Phase 1, we aim to discover all the deletion e vents, and in Phase 2, we aim to discover all the insertion e vents. In both phases, we make the simplifying assumption that insertion and deletion ev ents are disjoint. That is, they don’t overlap or interfere with one another . This assumption is based on the fact that these ev ents come from a statistical characterization of genomic sequences, which is designed to keep events disjoint (for overlapping ev ents are better subdivided further and considered as separate sub-events). So, for example, we assume that there is no insertion event that is then follo wed by a deletion ev ent that then remov es part of the sequence that was just inserted. W e be gin by performing a guess for the reference sequence, R . Armed with the sequence-alignment score, a ( Q, R ) , for R , we then perform a di vide-and-conquer computation to ﬁnd all the deletion e vents that occur in going from R to Q . Note that if we next perform a guess V for a collection of deletion e vents at some subset of the M statistically possible (deletion) locations in R , then we can detect ho w man y deletions actually occurred at these locations. Moreov er , note that the insertion ev ents don’t change this score, since 13 the insertions and deletions do not interfere, by assumption. For each deletion ev ent that is present in one of the queried locations, then our score will not change with respect to the score for R , and, for each location that should not be deleted, we will record a score for V that is one worse than that for R . Thus, we can determine the number of deletion e vents for any test we do by the difference between the score we observe and the score we would expect if all of the deletions are removing actual matches. That is, if we test for r singleton deletion ev ents in V , then the number that actually occur is a ( Q, V ) − ( a ( Q, R ) − r ) , where a is the sequence-alignment score function. Let Z 1 ,M = { z 1 , z 2 , . . . , z M } be a set of Boolean v ariables, such that z i is 1 if and only if the i th statistically possible deletion ev ent in R actually occurs in going from R to Q . W e can perform a divide- and-conquer search in Z 1 ,M to determine which of the z i ’ s are 1. W e begin by testing for all the deletion e vents in Z 1 ,M . This giv es us the number of 1 ’ s in Z 1 ,M . W e then perform a test for e very deletion event in Z 1 ,M / 2 = { z 1 , . . . , z M / 2 } , which by deduction giv es us the number in Z M / 2+1 ,M = { z M / 2+1 , . . . , z M } . W e then recursi vely determine the number in either or both of these two sets so long as there is at least one deletion event in that set. Thus, we perform a divide-and-conquer parallel “binary” search for each of the exact locations of singleton deletions. Once we hav e completed this computation for R , with queries against Q , we will ha ve determined the locations of all the deletion e vents from R to Q , including those deletions that are really substitution ev ents. Thus, this set of guesses uses at most 1 + d ( Q ) d log M e tests, where d ( Q ) is the set of (singleton) deletion e vents in going from R to Q . Once we kno w the locations of all the deletions in going from R to Q , we perform a second set of binary searches, just among these locations, to ﬁnd the locations among this group that are actually the sites of substitution e vents. Let us no w deﬁne R 0 to be the reference sequence resulting from performing the ev ents we discov ered in Phase 1. In particular , we perform a binary search for each of the K colors, with respect to R 0 , searching, for each color i , in the statistically possible insertion locations in R 0 where we improve our score by adding a single character of color i . Note that there may be more than a single character of color i inserted at this location, but it is suf ﬁcient to do a single character query to determine that there is an insertion here, since there is a non-deleted element between e very possible insertion location in R 0 . Since we continue to perform recursive binary-type searches for any insertion locations that actually cause insertions, then the the set of additional guesses we do in this part of the second phase is at most K + e ( Q ) d log M e , where e ( Q ) is the number of insertion ev ents. At this point in the algorithm, we know where all the insertion ev ents are located, but we don’t know the full extent of each of their sizes. So for each location, we perform a set of K guesses of length 2 to see if we get a higher score by considering a longer insertion. If there are no dif ferences from the singleton queries, then we can infer the length of the insertion from the previous queries. Otherwise, we perform a set of K guesses of length 3 , 4 , and so on, until we observ e no change from the previous set of guesses. Thus, with a total number of guesses equal to K ε ( Q ) , where ε ( Q ) is the total size of all the insertion e vents, we discover the length of each insertion e vent. T o complete the computation, then, we perform a miniature version of our algorithm from Section 4 at each location determined to be to site of an insertion e vent. Each such computation requires ( m + 1) K guesses, where m is the length of the insertion. Thus, the total number of guesses made in this part of Phase 2 is ( ε ( Q ) + 1) K . Therefore, we hav e the following. Theorem 4: Given an unkno wn length- N sequence Q , deﬁned on an alphabet of size K , with Q ha ving M possible locations of deviation from a reference sequence, R , a malicious Mastermind attacker can disco ver Q in polynomial time using ( d ( Q ) + e ( Q )) d log M e + ( ε ( Q ) + 2) K + 1 guesses, each of which re veals only the number of positions where Q and the test sequence match, using sequence-alignment LCS tests, where • d ( Q ) is the number of deletion ev ents, 14 • e ( Q ) is the number of insertion ev ents, • ε ( Q ) is the total length of all insertion ev ents. 6 Case Study f or Mitochondrial DN A W e are at the point where hundreds of thousands of people hav e had their mitochondrial DN A (mtDN A) sequenced [5, 29], which is typically about 16,500 base pairs (bp) long, whereas the entire diploid human genome is roughly 6 billion bp long. Interestingly , since mtDN A is transferred only along the maternal line, scientists have used dif ferences from a reference mtDN A sequence as a way to plot human migration from the earliest days of the modern human species. (See Figure 4.) L 1 N Y A , D X A , C , D A , C , D A , C , D Z B B B B G F M T , U , V , W I , J , K N , M L 2 L 3 , M H , U , X Figure 4: A conﬂuent illustration [13] of the pattern of human migration implied by mtDN A mutations [5, 29]. Each letter stands for a major human mitochondrial haplogroup, that is, a canonical set of genetic mutations from a common ancestor . Because of this knowledge of migration patterns and its correlation to known mtDN A mutations, giv en someone’ s mtDN A sequence, it is possible to trace their maternal ancestry back to individual villages [5], just by identifying differences in their mtDN A to a reference sequence, e.g., rCRS (see Figure 5). In other words, mtDN A alone is suf ﬁcient to determine a person’ s ethnic background with incredible accurac y . Thus, we are at a point where pri vacy is a real concern with respect to genomic sequences, and this concern is sure to increase in the future. In addition to ethnicity , there are, of course, other priv ac y concerns with respect to genomic data, in- cluding sensiti ve information related to disease susceptibility , and possible genetic inﬂuences on se xual orientation, personality , addiction, and intelligence. Concerns that employers or insurers will use genetic information to screen those at high risk for a disease are already a public concern and stories in volving such risks are widespread in the press. Indeed, the U.S. government and sev eral states have already created laws dealing with DNA data access, and many more are considering such legislation. Thus, there is a need for technologies that can safeguard the pri v acy and security of genomic data. Fortunately , sev eral researchers hav e started exploring priv ac y-preserving data querying methods that can be applied to genomic sequences (e.g., see [2, 15, 16]). That is, cryptographic techniques can be used to allow for queries to be performed in a way that answers the speciﬁc question—such as a score rating the quality of a query for DNA matching or sequence alignment—but does not re veal any other information about the data, such as race or disease risk of the indi vidual whose DN A is being queried. 15 GATCACAGGTCTATCACCCTATTAA CCACTCACGGGAGCTCTCCATGCAT TTGGTATTTTCGTCTGGGGGGTATG CACGCGATAGCATTGCGAGACGCTG GAGCCGGAGCACCCTATGTCGCAGT ATCTGTCTTTGATTCCTGCCTCATC ... ATCTGGTTCCTACTTCAGGGTCATA AAGCCTAAATAGCCCACACGTTCCC CTTAAATAAGACATCACGATG Figure 5: A portion of the Re vised Cambridge Reference Sequence, rCRS (GenBank accession number: A C 000021), which is 16,568 bp long. The purpose of this case study is to show that, while being sufﬁcient for single-shot comparisons of DN A sequences, such cryptographic techniques hav e a weakness when they are employed repeatedly . Speciﬁcally , we explore in this section ho w the Mastermind attack allo ws a genomic querier , Bob, to iterativ ely discov er the full identity of a genomic query sequence, Q , with surprising ef ﬁciency , e ven if each comparison of Q with Bob’ s sequences are done using cryptographic priv acy-preserving protocols. It is not surprising that iterated priv acy-preserving sequence comparisons leak some information about the sequences being compared; what is surprising is ho w quickly the Mastermind attack can work, especially on genomic data. T o demonstrate the vulnerability of real-world DN A data to the Mastermind attack, we have performed a case study of our distrib ution-based Mastermind attack algorithms. W e used 1000 human mitochondrial se- quences downloaded from a recent version of GenBank (http://www .ncbi.nlm.nih.gov/Genbank/index.html). W e focused on the sequences alone, ignoring any header and other information, and have simulated Mas- termind attacks on each one. The Revised Cambridge Reference Sequence (rCRS) (GenBank accession number: AC 000021) was also do wnloaded and used as the reference sequence [7, 8, 30]. The reference se- quence is 16,568 bp long. All the sequences were aligned to the reference sequence and, for each sequence, the indices of the location of each v ariation were recorded together with the type (substitution, insertion, deletion) and content of each v ariation. This step is also essential if one is interested in compressing the data [4], for example. Statistics for the number of substitutions, deletions, and insertions for this data set of 1000 mtDN A sequences is gi ven in T able 1. mean standard dev . Substitutions 28.00 18.38 Deletions 0.90 2.46 Insertions 0.95 1.10 T able 1: Frequency statistics for 1000 mtDNA sequences. Mean and standard deviation statistics are giv en for the frequency of substitutions, deletions, and insertions in going from the reference sequence, R = rCRS, to each sampled sequence. Of the 1000 sequences, 453 have only substitution ev ents with respect to the reference sequence, R = rCRS. So we used this subset of 453 sequences to test the simulated performance of the method of Theo- 16 rem 3. The distribution of the number of substitutions in each of these sequences is sho wn in Figure 6. Figure 6: Histogram of number of substitutions in 1000 mtDN A with respect to the reference sequence, R = rCRS. Note that these frequencies do not follow a normal distribution, which shows the importance of our using real-world data, such as this, rather than randomly-generated or simulated data. The statistical div ersity of the mtDNA data is actually a reﬂection of the racial div ersity of the people whose mtDNA data is included in our data set. That is, edit distance from the reference sequence, R = rCRS, across the human species, is not uniformly or normally distributed. Instead, edit distance from rCRS is a reﬂection of human migration patterns, as illustrated in Figure 4. The 45.3% of the sampled mtDN A sequences with substitution-only modiﬁcations from rCRS are ex- actly the set of sequences that can be effecti vely discov ered by the single-count Mastermind attack of The- orem 3. Thus, we simulated the performance of this attack on each one of these sequences and tab ulated the number of guesses that would be needed in each case in order to discov er the complete identity of each sequence. Interestingly , 90% of the simulated substitution-only Mastermind attacks completed with 375 guesses or less. The complete distrib ution of single-count Mastermind attack lengths for this data set are sho wn in Figure 7. All 1000 sampled mtDN A sequences were then used to test the performance of the method of Theorem 4. Sequence-alignment Mastermind attacks were simulated for each such mtDN A sequence while the number of sequence-alignment tests were counted for each. Interestingly , 90% of these simulated subsequence- alignment Mastermind attacks completed with 875 guesses or less. And some completed with much fewer than this. The complete distrib ution of sequence-alignment Mastermind attack lengths for this data set is sho wn in Figure 8. 7 Discussion and Futur e Directions W e hav e shown that, even though the single-count and sequence-alignment Mastermind satisﬁability prob- lems are NP-complete, one can effecti vely mount Mastermind attacks on arbitrary genomic sequences just by knowing basic information about the length of the sequences and the number of characters in the alpha- bet used to construct those sequences. Moreover , if one has some basic statistical information about these sequences, relativ e to a reference sequence, then one can mount the Mastermind attack with surprising effec- ti veness. In fact, we pro vided a case study suggesting that such attacks are already possible and surprisingly 17 Figure 7: Histogram of Mastermind attack lengths for 453 substitution-only mtDN A sequences with stan- dard single-count Mastermind scores. The mean attack length for this data set was 219.6 and the standard de viation was 139.1. ef ﬁcient for mtDN A sequences. One conclusion to draw from this w ork is that priv acy-preserving protocols for performing a query with a sequence, Q , against a genomic database, D , should take into account the entire set of comparisons [14], with Q and the sequences in D , rather than relying on the priv acy-preservation of each individual comparison in turn. For example, in the usage model where Bob is a user querying a genomic database, the Mastermind attack is weakened if it is dif ﬁcult for Bob to kno w the index of the sequences he is comparing against—for example, if the database o wner, Alice, presents her sequences in a different random order each time. Such an obfuscation does not defeat the Mastermind attack, ho wev er , if Bob is able to use other reasoning inferences to match scores of his query sequences across multiple queries in Alice’ s database of sequences. In terms of further e xploration of the vulnerability of genomic data to the Mastermind attack, one in- teresting direction for future work would be to test the vulnerability of entire human genomes to the Mas- termind attack, once we ha ve enough completed genomes to do such an e xperimental study . In addition, other directions for future research therefore could include new , ef ﬁcient pri v acy-preserving schemes for querying entire genomic databases with respect to sequence-alignment queries. Such results would negate the pri vac y-exposing vulnerabilities of the Mastermind attack. Acknowledgments W e would like to thank Pierre Baldi for suggesting the security of genomic data as an important research question and for providing the mitochondrial DNA data used in our experiments, including the character- izations in terms of the reference sequence, rCRS. W e would also like to thank Da vid Eppstein, Daniel Hirschberg, Stas Jarecki, and Michael Nelson for helpful discussions regarding the topics of this paper . This research w as supported in part by the National Science Foundation under grants 0724806, 0713046, and 0847968. Some of the results of this paper appeared in preliminary form as [19], albeit with some ﬂawed arguments for justifying pre vious versions of Theorems 2 and 4. 18 Figure 8: Histogram of simulated Mastermind attack lengths for 1000 mtDN A sequences with sequence- alignment scores. The mean sequence-alignment simulated Mastermind attack length was 536.3 with a standard de viation of 373.9. Refer ences [1] A. Amirbekyan and V . Esti vill-Castro. A ne w efﬁcient pri vac y-preserving scalar product protocol. In AusDM ’07: Pr oceedings of the sixth Austr alasian confer ence on Data mining and analytics , pages 209–214, Darlinghurst, Australia, Australia, 2007. Australian Computer Society , Inc. [2] M. J. Atallah, F . Kerschbaum, and W . Du. Secure and priv ate sequence comparisons. In WPES ’03: Pr oceedings of the 2003 A CM workshop on Privacy in the electr onic society , pages 39–44, Ne w Y ork, NY , USA, 2003. A CM. [3] M. J. Atallah and J. Li. Secure outsourcing of sequence comparisons. Int. J . Inf . Secur . , 4(4):277–287, 2005. [4] P . Baldi, R. W . Benz, D. Hirschberg, and S. Sw amidass. Lossless compression of chemical ﬁngerprints using integer entrop y codes improv es storage and retriev al. J ournal of Chemical Information and Modeling , 47(6):2098–2109, 2007. [5] D. M. Behar1, S. Rosset, J. Blue-Smith, O. Balanovsky , S. Tzur1, D. Comas, R. J. Mitchell, L. Quintana-Murci, C. T yler-Smith, and R. S. W ells. The genographic project public participation mitochondrial DN A database. PLoS Genetics , 3(6), 2005. doi:10.1371/journal.pgen.0030104. [6] A. Ben-David, N. Nisan, and B. Pinkas. FairplayMP - a system for secure multi-party computation. In Pr oceedings of the ACM Computer and Communications Security Confer ence (ACM CCS) , pages 257–266, Ne w Y ork, NY , USA, 2008. A CM. [7] M. Brandon, M. Lott, K. Nguyen, S. Spolim, S. Nav athe, P . Baldi, and D. W allace. MITOMAP: a human mitochondrial genome database - 2004 update. Nucleic Acids Resear ch , 33:D611–D613, 2005. Database Issue. [8] M. C. Brandon, E. Ruiz-Pesini, D. Mishmar , V . Procaccio, M. T . Lott, K. C. Nguyen, S. Spolim, U. Patil, P . Baldi, and D. C. W allace. MITOMASTER: A bioinformatics tool for the analysis of mitochondrial DN A sequences. Human Mutation , 0:1–6, 2008. 19 [9] Z. Chen, C. Cunha, and S. Homer . Finding a hidden code by asking questions. In COCOON ’96: Pr oceedings of the Second Annual International Conference on Computing and Combinatorics , volume 1090 of LNCS , pages 50–55. Springer , 1996. [10] V . Chv ´ atal. Mastermind. Combinatorica , 3(3/4):325–329, 1983. [11] I. H. G. S. Consortium. Initial sequencing and analysis of the human genome. Natur e , 409:860–921, 2001. [12] I. Damg ˚ ard, M. Fitzi, E. Kiltz, J. B. Nielsen, and T . T oft. Unconditionally secure constant-rounds multi-party computation for equality , comparison, bits and exponentiation. In S. Halevi and T . Rabin, editors, Theory of Cryptogr aphy , volume 3876 of Lectur e Notes in Computer Science , pages 285–304. Springer , 2006. [13] M. Dickerson, D. Eppstein, M. T . Goodrich, and J. Meng. Conﬂuent drawings: V isualizing non-planar diagrams in a planar way . In Pr oc. 11th Int. Symp. on Graph Dr awing , volume 2912 of Lectur e Notes in Computer Science , pages 1–12. Springer -V erlag, 2003. [14] W . Du and M. J. Atallah. Protocols for secure remote database access with approximate matching. In A. K. Ghosh, editor , E-Commer ce Security and Privacy: Advances in Information Security , V olume 2 , pages 87–112. Kluwer Academic Publishers, 2001. [15] W . Du and M. J. Atallah. Secure multi-party computation problems and their applications: a revie w and open problems. In NSPW ’01: Pr oceedings of the 2001 workshop on New security par adigms , pages 13–22, Ne w Y ork, NY , USA, 2001. A CM. [16] M. Freedman, K. Nissim, and B. Pinkas. Efﬁcient pri v ate matching and set intersection. In Advances in Cryptology — EUR OCRYPT 2004. , 2004. [17] M. R. Garey and D. S. Johnson. Computers and Intractability: A Guide to the Theory of NP-Completeness . W . H. Freeman, New Y ork, NY , 1979. [18] O. Goldreich, S. Micali, and A. W igderson. How to play an y mental game. In STOC ’87: Pr oceedings of the nineteenth annual ACM symposium on Theory of computing , pages 218–229, Ne w Y ork, NY , USA, 1987. ACM. [19] M. T . Goodrich. The mastermind attack on genomic data. In IEEE Symposium on Security and Privacy , pages 204–218. IEEE Press, 2009. [20] M. T . Goodrich. On the algorithmic complexity of the mastermind game with black-pe g results. Information Pr ocessing Letters , 109:675–678, 2009. [21] D. S. Hirschberg. A linear space algorithm for computing maximal common subsequences. Commun. A CM , 18(6):341–343, 1975. [22] C. S. Iliopoulos and M. S. Rahman. Algorithms for computing variants of the longest common subsequence problem. Theor . Comput. Sci. , 395(2-3):255–267, 2008. [23] S. Jha, L. Kruger , and V . Shmatik ov . T ow ards practical pri vac y for genomic computation. In Pr oceedings of the 2008 IEEE Symposium on Security and Privacy , pages 216–230, W ashington, DC, USA, 2008. IEEE Computer Society . 20 [24] W . Jiang, M. Murugesan, C. Clifton, and L. Si. Similar document detection with limited information disclosure. In ICDE , pages 735–743. IEEE, 2008. [25] D. Knuth. The computer as a master mind. J ournal of Recr eational Mathematics , 9:1–5, 1977. [26] S. Levy , G. Sutton, P . C. Ng, and et al. The diploid genome sequence of an individual human. PLOS Biology , 5(10):2113–2144, 2007. [27] M. Newman. Po wer laws, P areto distributions, and Zipf ’ s law . Contempor ary Physics , 46(5):323–351, 2005. [28] A. M. Odlyzko. The rise and fall of knapsack cryptosystems. In C. Pomerance, editor , Cryptology and Computational Number Theory , pages 75–88. Am. Math. Soc., 1990. [29] B. Pakendorf and M. Stoneking. Mitochondrial DN A and human evolution. Annual Re v . Genomics Hum. Genet. , 6:165–183, 2005. [30] E. Ruiz-Pesini, M. T . Lott, V . Procaccio, J. Poole, M. C. Brandon, D. Mishmar, C. Y i, J. Kreuziger , P . Baldi, and D. C. W allace. An enhanced MITOMAP with a global mtDN A mutational philogeny . Nucleic Acids Resear ch , 35:D823–D828, 2007. Database Issue. [31] Y . Sang and H. Shen. Pri vac y preserving set intersection protocol secure against malicious beha viors. In PDCA T ’07: Pr oceedings of the Eighth International Confer ence on P arallel and Distributed Computing, Applications and T echnologies , pages 461–468, W ashington, DC, USA, 2007. IEEE Computer Society . [32] Y . Sang and H. Shen. Pri vac y preserving set intersection based on bilinear groups. In A CSC ’08: Pr oceedings of the thirty-ﬁrst Austr alasian confer ence on Computer science , pages 47–54, Darlinghurst, Australia, Australia, 2008. Australian Computer Society , Inc. [33] J. Stuckman and G.-Q. Zhang. Mastermind is np-complete, 2005. http://arxi v .org/abs/cs/0512049. [34] D. Szajda, M. Pohl, J. Owen, and B. G. Lawson. T ow ard a practical data priv acy scheme for a distributed implementation of the Smith-W aterman genome sequence comparison algorithm. In Pr oceedings of the Network and Distributed System Security Symposium (NDSS) , 2006. [35] J. R. T roncoso-Pastoriza, S. Katzenbeisser , and M. Celik. Pri vacy preserving error resilient dna searching through obli vious automata. In CCS ’07: Pr oceedings of the 14th A CM confer ence on Computer and communications security , pages 519–528, Ne w Y ork, NY , USA, 2007. A CM. [36] J. D. Ullman, A. V . Aho, and D. S. Hirschber g. Bounds on the complexity of the longest common subsequence problem. J. A CM , 23(1):1–12, 1976. [37] J. V aidya and C. Clifton. Secure set intersection cardinality with application to association rule mining. J. Comput. Secur . , 13(4):593–622, 2005. [38] J. C. V enter , M. D. Adams, E. W . Myers, and P . W . Li et al. The sequence of the human genome. Science , 291:1304–1351, 2001. [39] A. C. Y ao. Protocols for secure computations. In Pr oc. of 23rd Symp. on F oundations of Computer Science , pages 160–164, W ashington, DC, USA, 1982. IEEE Computer Society . 21

Learning Character Strings via Mastermind Queries, with a Case Study Involving mtDNA

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment