Improved Lower Bounds for Constant GC-Content DNA Codes

IEEE TRANSA CTIONS ON INFORMA TION THEOR Y, V OL. 54, NO. 1, J ANUAR Y 2008 391 Impro ved Lower Bounds for Constant GC-Content DNA Codes Y eo w Meng Chee and San Ling Abstract— The design of large libraries of oligonucleotides ha ving con- stant -content and satisfying Hamming distance constraints between oligonucleotides and their W atson-Crick complements is important in reducing hybridization errors in DNA computing, DNA microarray technologies, and molecular bar coding. V arious techniques have been studied for the construction of such oligonucleotide libraries, ranging from algorithmic constructions via stochastic local search to theoretical constructions via coding theory. A new stochastic local search method is introduced, which yields improvements for more than one third of the benchmark lower bounds of Gaborit and King (2005) for n -mer oligonucleotide libraries when n  14 . Se veral optimal libraries are also found by computing maximum cliques on certain graphs. Index T erms— DNA codes, exhaustiv e search, Hamming distance model, oligonucleotide libraries, stochastic local sear ch. I. I NTR ODUCTION Oligonucleotides (short single-stranded DN A) made by chemical synthesis are important structures for information storage in DN A com- puting [1], [2], as probes in DN A microarray technologies [3], [4], and as tags in molecular bar coding [5]–[7]. The critical property of DNA in these applications is the tendency of oligonucleotides to speciﬁcally hybridize to their W atson–Crick complements and form a stable du- plex [8]. Unfortunately , nonspeciﬁc hybridizations can also occur between oligonucleotides used in a self-assembly step, in a polymerase chain re- action, or in an extraction operation. The probability of such hybridiza- tion errors is related to the combinatorial as well as the thermodynamic properties of the oligonucleotides. Among the basic constraints that must be fulﬁlled in order to reduce the probability of erroneous hy- bridizations for a library of oligonucleotides, the following are of par- ticular importance: 1) two oligonucleotides in the library must be dissimilar; 2) an oligonucleotide in the library must be dissimilar to the (W atson–Crick) complement of another oligonucleotide in the library; 3) ev ery oligonucleotide in the library has similar melting tempera- ture; 4) an oligonucleotide must not fold back onto itself in a manner that renders it chemically inactiv e. The measure of similarity between oligonucleotides depends on the hybridization model adopted. On two extremes of the spectrum, we hav e the following. Manuscript received October 19, 2006; revised September 2, 2007. This work was supported in part by the Singapore Ministry of Education under Research Grant T206B2204. Y. M. Chee is with the Interactiv e Digital Media R&D Program Ofﬁce, Media Dev elopment Authority , Singapore 179369, Republic of Singapore, the Divi- sion of Mathematical Sciences, School of Ph ysical and Mathematical Sciences, Nanyang T echnological Uni versity , Singapore 637616, Republic of Singapore, and with the Department of Computer Science, School of Computing, National Univ ersity of Singapore, Singapore 117590, Republic of Singapore (e-mail: ym- chee@alumni.uwaterloo.ca). S. Ling is with the Division of Mathematical Sciences, School of Physical and Mathematical Sciences, Nanyang T echnological Univ ersity , Singapore 637616, Republic of Singapore (e-mail: lingsan@ntu.edu.sg). Communicated by V. A. V aishampayan, Associate Editor At Lar ge. Digital Object Identiﬁer 10.1109/TIT.2007.911167 • Hamming distance model [9]–[11]. The sugar-phosphate back- bone of oligonucleotides is nonelastic, and an oligonucleotide can only hybridize to its W atson–Crick complement. • Lev enshte ˘ in distance model [12], [13]. The sugar-phosphate back- bone of oligonucleotides is completely elastic, and an oligonu- cleotide  can hybridize to any oligonucleotide containing the W atson–Crick complement of  as a subsequence. In actual fact, the sugar-phosphate backbone of oligonucleotides shows some limited elasticity , and the stability of a hybridized duplex is determined by the nearest neighbor interaction energies and stacking energies of the hybridized bases [14], which are difﬁcult to model ac- curately with purely combinatorial constraints. Hybridization models based on thermodynamical properties of oligonucleotides have been proposed as better approximations [15]. Other measures of similarity between oligonucleotides have also been considered [16]. Recently , Chen et al. [17] addressed the problems of predicting hybridization properties of long oligonucleotides. In short, the problem of what properties oligonucleotides hav e to possess in order to exhibit very speciﬁc hybridization beha vior is not well understood, except those of short lengths. The model we adopt in this correspondence is the Hamming distance model. It should be noted that the constraints and the hybridization model we consider do not address certain issues related to hybridization which may be important in practical applications, for example insensi- tivity to frame-shifts, the av oidance of secondary structures, and the use of a more accurate model of melting temperature [18]–[20]. Our model also does not consider DNA folding, which is one of the most impor- tant properties one has to test in the process of probe selection. How- ev er , for the sequence lengths that we consider in this correspondence, folding is not expected to be severe (not too many oligonucleotides of up to 8-mers fold, and even if they do, the folds are usually not very stable) [21]. For the purpose of efﬁcienc y in the applications mentioned above, it is desirable that for a giv en n , we have as large a library of n -mer oligonucleotides as possible that satisﬁes constraints 1) to 3) above. This is the oligonucleotide (or DNA) sequence design pr oblem [9], [22]–[24]. Many approaches hav e been considered for this problem. These include template-based constructions [11], [24]–[26], stochastic local search [27]–[31], lexicographic search [9], and coding theoretic constructions [10]. A surve y of the best lower bounds for the sizes of oligonucleotide libraries has been undertaken by Gaborit and King [10]. The purpose of this correspondence is to introduce a ne w stochastic local search method for the oligonucleotide sequence design problem. This search method has been implemented and yielded many record- breaking oligonucleotide libraries. Several optimal oligonucleotide li- braries were also obtained via an exhausti ve search algorithm based on computing maximum cliques on graphs. II. D EFINITIONS AND N O T A TIONS W e model oligonucleotides as sequences ov er the alphabet 6= f ; ; ; g .I f  2 6 n , the element in position i of the sequence  is denoted  i . The Hamming distance between two sequences ;  2 6 n , denoted d H ( ;  ) , is the number of positions where  and  differ , that is d H ( ;  )= jf 1  i  n :  i 6 =  i gj : The (W atson–Crick) complement of a sequence  =  1 .. .  n 2 6 n is the sequence   =  n ...   1 2 6 n , where  = ;  = ;  = ;  = : 0018-9448/$25.00 © 2008 IEEE 392 IEEE TRANSA CTIONS ON INFORMA TION THEOR Y , VOL. 54, NO. 1, J ANUAR Y 2008 The -content of a sequence  2 6 n , denoted GC (  ) , is the number of occurences of and in  : GC (  )= jf 1  i  n :  i 2f ; ggj : Henceforth, lower case Greek letters are used to denote oligonu- cleotides, and if not otherwise stated, they are assumed to belong to a generic set L . A library of n -mer oligonucleotides L 6 n satisfying all the con- straints. 1) Hamming distance constraint : d H ( ;  )  d for all ;  2L ;  6 =  ; 2) Complementary distance constraint : d H ( ;   )  d for all ;  2 L ; 3) Constant -content constraint : GC (  )= w for all  2L ; is called an ( n; d; w ) -DNA code . Note that the second constraint has to hold also for  =  .I f L 6 n satis ﬁ es only the Hamming distance and the constant -content constraints, we call L a weak ( n; d; w ) - DNA code , Following King [9], we denote the maximum size of an ( n; d; w ) -DN A code by A GC ; R C 4 ( n; d; w ) , and the maximum size of a weak ( n; d; w ) -DN A code by A GC 4 ( n; d; w ) . A (weak) ( n; d; w ) -DN A code containing A GC;RC 4 ( n; d; w )( A GC 4 ( n; d; w )) sequences is said to be optimal . The following halving bound is known [9], [23]. Lemma 1 (Marathe et al. King): Fo r 0

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment