p-Adic Degeneracy of the Genetic Code

Degeneracy of the genetic code is a biological way to minimize effects of the undesirable mutation changes. Degeneration has a natural description on the 5-adic space of 64 codons $\mathcal{C}_5 (64) = \{n_0 + n_1 5 + n_2 5^2 : n_i = 1, 2, 3, 4 \} …

Authors: Branko Dragovich, Alex, ra Dragovich

p -Adic Degene racy of the Genetic Co d e Branko Dragovic h a ∗ and Alexandr a Dragovic h b a Institute of Physics, P.O. Box 57 11001 Belgr ade, S erbia b V avilov Institute of Gener al Genetics Gubkin S t. 3, 119991 Mo sc ow, R ussia Abstract Degeneracy of the geneti c code is a biologic al w a y to min- imize effects of the undesirable m utation c hanges. Degenera- tion has a natural description on the 5-adic space of 64 cod on s C 5 (64) = { n 0 + n 1 5 + n 2 5 2 : n i = 1 , 2 , 3 , 4 } , where n i are digits related to n ucleotides as follo w s: C = 1, A = 2, T = U = 3, G = 4. The smallest 5-adic distance b et w een co dons joins them in to 16 quadruplets, whic h under 2-adic distance deca y in to 32 doublets. p -Adically close cod ons are a ssigned to one of 20 amino acids, wh ic h are b uilding blo c ks of pr oteins, or co d e termination of p rotein synthesis. W e shown that genetic co de m ultiplets are made of the p -adic nearest co dons . 1 In tro duction Genetic information in living systems is con tained in the deso xyri- b on ucleic acid (DNA) seq uence. The DNA macromolecules are com- p osed of tw o p olyn ucleotide c hains with a double-helical structure. The building blo cks of the genetic infor ma t io n a re fo ur n ucleotides called: adenine (A), guanine ( G ), cytosine (C) and thymine (T). A ∗ E-mail: drag ovich@phy .bg.ac.yu 1 and G are purines, while C and T are p yrimidines. Nucleotides are arranged alo ng double helix thro ugh base pairs A-T and C-G. The DNA is pack ag ed in c hromosomes whic h are lo calized in the n ucleus of the euk ary otic cells. One of the basic pro cesses within DNA is its replication. The passage of DNA gene information to proteins, called gene expression, p erfor ms b y the messenger rib onu cleic acids (mRNA), whic h are usually single p olyn ucleotide chains. The mRNA are sy n- thesized in the first part of this pro cess, kno wn as transcription, when n ucleotides A, G , C, T from DNA are resp ectiv ely transcrib ed into their complemen ts U, C, G, A of mRNA, where T is replaced b y U (U is t he uracil). The next step is translation, when the info rmation co ded by co dons in the mRNA is translated into proteins. In this pro cess t w o other RNA’s are inv olv ed: transfer tRNA and rib osomal rRNA. Co dons are ordered sequences of three n ucleotides t a k en of the A, G, C, U. Protein syn thesis in all euk ary otic cells p erforms in the rib osomes of the cytoplasm. The genetic co de relates t he information of the sequence of co do ns in mRNA to the sequence of amino acids in a protein. Although there are ab o ut dozen co des (see , e.g. [1]), the most imp ortant are t w o of them: t he euk ary otic co de and the v ertebral mito c hondrial code. In the sequel w e shall mainly consider the vertebral mito c hondrial code, b ecause it lo oks the simplest one and the o t hers can b e regarded as its mo difications. It is obv ious t ha t there are 4 × 4 × 4 = 64 co dons. How- ev er (in the v ertebral mito chondrial co de), 60 of them are distributed on the 20 differen t amino acids and 4 mak e stop-co dons, whic h serv e as termination signals. According to exp erimen tal observ atio ns, t w o amino acids are co ded by six co dons, six amino acids by four co dons, and t w elv e amino acids b y tw o co dons. This prop ert y that to an amino acid corresponds more than one co don is kno wn as genetic co de de- generacy . This degeneracy is a very imp or t a n t prop ert y of the genetic co de and give s an efficien t w a y to minimize effects of the undesir- able m utatio n c hanges. Since there is a h uge n um b er (ab out 1 0 80 ) of all p ossible assignmen ts b et w een co dons and amino acids, and o nly a ve ry small num b er (ab out dozen) of them is represen ted in living cells, it has been a p ersisten t theoretical c hallenge to find an appro- priate mo del explaining con temp orary genetic co des. Still there is no generally a ccepted explanation o f the genetic co de. F o r a detail 2 and comprehensiv e informatio n on molecular bio lo gy asp ects of DNA, RNA and g enetic co de one can see Ref. [2]. It is w orth mention- ing that hum an g enome, whic h presen ts all genetic informatio n of the homo sapiens, is comp osed of a b out three billions DNA base pair s and con tains more than 20.000 genes. Mo deling of DNA, RNA and genetic co de is a c hallenge as well as an opp o rtunit y for mo dern mathematical phys ics. An in teresting mo del based on the quan tum algebra U q ( sl (2) ⊕ sl (2)) in the q → 0 limit w as prop osed as a symmetry algebra for the genetic co de (see [1] and references therein). In a sense this approa c h mimics quark mo del of bary ons. T o describ e corresp ondence b et w een co dons and amino- acids, it w as constructed an op erat o r which a cts on the space of co dons and its eigen v alues are related to amino acids. Besides some success es of this approac h, there is a problem with rather many parameters in the op erator . There are also pap ers [3] starting with 64- dimensional irreducible represen tation o f a Lie (sup er)a lgebra and trying to connect m ultiplicit y of co dons with irreducible r epresen tations of subalgebras arising in a chain of symmetry breaking. Although in teresting as an attempt t o describe ev olution of the genetic co de these Lie algebra approac hes did not succ eed to get its mo dern form. F or a v ery brief review of these and some other theoretical approac hes to the g enetic co de one can see Ref. [1 ]. Recen t ly w e in tro duced a p -adic approac h to the D NA, RNA se- quences and genetic co de [4]. Let us men tion that p -a dic mo dels in mathematical phys ics ha v e b een activ ely considered since 19 8 7 (see [5], [6] for early reviews and [7], [8] for s ome recen t reviews). It is w orth noting that p - adic mo dels with pseudo differen tial op erators hav e b een success fully applied to interbasin kinetics o f pro t eins [9]. Some p - adic asp ects o f cognitive , psyc hological and so cial phenomena hav e b een also considered [10]. The prese n t status of application of p -adic n um b ers in ph ysics and related branc hes o f sciences is reflected in the pro ceedings of the 2nd International Conference on p -Adic Mathemat- ical Ph ysics [11]. The main goal of this pap er is to presen t p -adic ro ot of the genetic co de a nd, in particular, its degeneracy . 3 2 p -Adic space of co dons An elemen tary in tro duction to p - adic n um b ers can b e found in the b o ok [12]. Ho w ev er, for our purp oses we will use here only a bit of p - adics, mainly a finite set o f in tegers and ultrametric distance b et w een them. Let us in tro duce the set of nat ura l num b ers C 5 (64) = { n 0 + n 1 5 + n 2 5 2 : n i = 1 , 2 , 3 , 4 } , (1) where n i are digits related to nucle otides by the follo wing assignmen t: C = 1, A = 2 , T = U = 3, G = 4. This is an expansion to the base 5. It is obvious that 5 is a prime num b er and that the set C 5 (64) contains 64 num b ers b et w een 31 and 1 24 (in t he us ual base 10). In t he se quel w e shall denote elemen ts o f C 5 (64) b y t heir digits to the base 5 in the follo wing w ay : n 0 + n 1 5 + n 2 5 2 ≡ n 0 n 1 n 2 . Note that here ordering of digits is t he same as in the expansion (1), i.e this o r dering is o pp osite to the usual one. There is now eviden t one-to-one corr esp o ndence b et w een codons in letter XYZ and num b er n 0 n 1 n 2 represen tatio ns. In addition to arithmetic op erations it is often imp ortant to know also a distance b etw een num b ers. Distance can b e defined b y a norm. On the set Z o f in tegers there ar e tw o kinds of nontrivial norm: usual absolute v alue | · | ∞ and p -adic absolute v alue | · | p , where p is a ny prime n um b er. The usual absolute v alue is w ell kno wn from elemen tary courses of mathematics and the corresponding distance betw een t w o n um b ers x and y is d ∞ ( x, y ) = | x − y | ∞ . The p -adic absolute v alue is related t o the divisibilit y of inte gers by prime n um b ers, and p -adic distance can b e understo o d as a measure of this div isibilit y for the difference of tw o n um b ers (the mor e divisible, the shorter). By definition, p -adic norm of an integer m ∈ Z , is | m | p = p − k , where k ∈ N S { 0 } is degree of divisib ility of m b y prime p (i.e. m = p k m ′ , p ∤ m ′ ) a nd | 0 | p = 0 . This norm is a mapping fro m Z in to non-nega tiv e real n um b ers and has t he follo wing properties: (i) | x | p ≥ 0 , | x | p = 0 if and only if x = 0, (ii) | x y | p = | x | p | y | p , (iii) | x + y | p ≤ max {| x | p , | y | p } ≤ | x | p + | y | p for all x , y ∈ Z . Because of the strong triangle inequalit y | x + y | p ≤ ma x {| x | p , | y | p } , p -adic a bsolute v alue b elongs to non- Arc himedean (ultrametric) norm. 4 One can easily conclude that 0 ≤ | m | p ≤ 1 . p -Adic distance betw een t wo integers x a nd y is d p ( x , y ) = | x − y | p . (2) Since p -a dic absolute v alue is ultrametric, the p -adic distance (2) is also ultrametric, i.e. it satisfies d p ( x , y ) ≤ max { d p ( x , z ) , d p ( z , y ) } ≤ d p ( x , z ) + d p ( z , y ) , (3) where x, y and z are any three in tegers. The ab o v e in tro duced set C 5 (64) endo w ed by p -adic distance w e shall call p -adic space of co dons. 5- Adic distance b et w een t w o co dons a, b ∈ C 5 (64) is d 5 ( a, b ) = | a 0 + a 1 5 + a 2 5 2 − b 0 − b 1 5 − b 2 5 2 | 5 . (4) When a 6 = b then d 5 ( a, b ) ma y ha v e three differen t v alues: (i) d 5 ( a, b ) = 1 if a 0 6 = b 0 , (ii) d 5 ( a, b ) = 1 / 5 if a 0 = b 0 and a 1 6 = b 1 , and (iii) d 5 ( a, b ) = 1 / 5 2 if a 0 = b 0 , a 1 = b 1 and a 2 6 = b 2 . W e see that the maxim um 5-adic distance b etw een co dons is 1 and it is equal to the maxim um p - adic distance on Z . Let us also note that this distance dep ends only o n the first tw o n ucleotides in the co dons. Use of 5- adic distance b etw een co dons is a natural one to describe information similarit y b etw een them. In the case of standard distance d ∞ ( a, b ) = | a 0 + a 1 5 + a 2 5 2 − b 0 − b 1 5 − b 2 5 2 | ∞ , t hir d n ucleotides a 2 and b 2 pla y more imp ortant role than those at the second place (i.e a 1 and b 1 ), and nucle otides a 0 and b 0 are o f the sm allest importance. 3 p -Adic gen etic co de Living cells are v ery complex systems comp osed ma inly of proteins whic h pla y v arious roles. These prot eins are long linear c hains made of only 20 amino acids, which are the same f or all living w orld on the Earth. Differen t sequences o f a mino acids form differen t proteins. An intens iv e study of connection b etw een ordering of nu cleotides in the DNA (and RNA) and ordering of amino acids in proteins led to the exp erimental disco v ery of genetic co de in the mid-1960s. Genetic co de 5 is understo o d as a dictionary for translation of information from the DNA (through RNA) to production of proteins b y amino acids. The information on a mino acids is contained in co dons. T o the sequence of co dons in the RNA corresp onds quite definite sequence of amino acids in a pro t ein, and this seq uence of amino acids determine s a primary structure of the protein. Ho w ev er, there is no simple theoretical understanding of genetic co ding. In pa r ticular, it is not clear why genetic co de exists just in the kno wn w a y and not in man y other p o ssible w a ys. What is a principle (or principles) used in establishmen t of a basic (mito c hondrial) co de? What are prop erties of co dons connecting them in to definite m ultiplets whic h co de the same amino acid or t erminatio n signal? These a r e only some of man y questions whose answ ers should lead us to make an appropriate theoretic al mo del of the genetic co de. 111 CCC Pro 211 A CC Thr 311 UCC Ser 411 GCC Ala 112 CCA Pro 212 ACA Thr 312 UCA Ser 412 GCA Ala 113 CCU Pro 213 ACU Thr 313 UCU Ser 413 GCU Ala 114 CCG Pro 214 A CG Thr 314 UCG Ser 414 GCG Ala 121 CA C His 221 AA C Asn 321 UA C Tyr 421 GA C Asp 122 CAA Gln 222 AAA Lys 322 UAA T er 422 GAA Glu 123 CA U His 223 AA U Asn 323 UA U T yr 423 G A U Asp 124 CA G Gln 224 AA G Lys 324 UAG T er 424 GAG Glu 131 CUC Leu 231 A UC Ile 331 UUC Phe 431 GUC V al 132 CUA Leu 232 A UA Met 33 2 UUA Leu 4 32 GUA V al 133 CUU Leu 233 A UU Ile 333 UUU Phe 43 3 GUU V al 134 CUG Leu 234 AUG Met 334 UUG Leu 4 3 4 GUG V al 141 CGC Arg 2 41 A GC Ser 34 1 UGC Cys 441 GGC Gly 142 CGA Arg 242 AGA T er 3 4 2 UGA T rp 44 2 GGA G ly 143 CGU Arg 243 AGU Ser 343 UGU Cys 443 GGU Gly 144 CGG Arg 24 4 AGG T er 3 44 UGG T rp 444 GGG Gly T able: The vertebral mito c hondrial co de 6 Let us now lo ok at the exp erimen tal T able of the vertebral mi- to c hondrial co de and compare it with t he ab ov e in tro duced C 5 (64) co don space. T o this end, co dons are sim ultaneously denoted by three digits and standard capital letters (recall: C=1, A=2 , U=3, G =4). The corresp onding amino acids are presen t ed in the usual three-letter form. First of all let us note that our T able is constructed according to the gradual change of digits and, as a consequence, there is a dif - feren t spatial distribution of amino acids comparing to the standard (W atson-Cric k) table ( see, e.g. [1]). An y of these tables can b e re- garded as a big rectangle divided in to 16 equal smaller rectangles: 8 of them are quadruplets whic h one-to-one corresp ond to 8 amino acids, and other 8 rectangles are divided into 16 doublets coding 14 amino acids and termination (stop) co don ( by tw o doublets at dif- feren t places). Note that 2 of 16 doublets co de 2 amino acids (Ser and Leu) whic h are already co ded by 2 quadruplets, thus amino acids Serine and Leucine are coded b y 6 co dons. In our T able quadruplets and doublets to g ether form a figure, whic h is symmetric with respect to the mid v ertical line, i.e. it is in v arian t unde r interc hange 1 ← → 4 and 2 ← → 3 of the first digits in co dons. Recall that the DNA is sym- metric under sim ultaneous inte rc hange of complemen tary n ucleotides in its strands. In other w ords, the DNA is inv arian t under n ucleotide in terc hange 1 ← → 4 and 2 ← → 3 b etw een strands. All doublets in the T able form a nice figure which lo oks lik e letter T . No w w e can lo ok at the T able as a represen tation of the C 5 (64) co don space. Namely , we observ e that there a r e 16 quadruplets suc h that each of them has the same first tw o digits. Hence 5-adic distance b et w een an y tw o differen t codons inside a quadruplet is d 5 ( a, b ) = | a 0 + a 1 5 + a 2 5 2 − a 0 − a 1 5 − b 2 5 2 | 5 = | ( a 2 − b 2 ) 5 2 | 5 = 5 − 2 , (5) b ecause a 0 = b 0 , a 1 = b 1 and | a 2 − b 2 | 5 = 1. Since co dons are comp osed of three n ucleotides, each of whic h is either a purine or a pyrimid ine, it is natural to try to quan tify sim- ilarit y inside purines and py rimidines, as well as distinction b et w een elemen ts from these t w o gr o ups of nuc leotides. F ortunately there is a to ol, whic h is a g ain related to the p -adics, and no w it is 2-a dic dis- tance. O ne can easily see that the 2- adic distance b etw een p yrimidines 7 C and U is 1 / 2 as the distance b et w een purines A and G. How ev er 2-adic dis tance b et w een C and A or G a s w ell as distance b et w een U and A or G is 1 (i.e. maxim um). With resp ect to the 2-adic distance, the ab ov e quadruplets ma y b e regarded as composed of tw o doublets: a = a 0 a 1 1 and b = a 0 a 1 3 mak e t he first doublet, and c = a 0 a 1 2 and d = a 0 a 1 4 form the second one. 2-Adic distance b et w een co dons within each of these doublets is 1 2 , i.e. d 2 ( a, b ) = | (3 − 1) 5 2 | 2 = 1 2 , d 2 ( c, d ) = | (4 − 2 ) 5 2 | 2 = 1 2 , (6) b ecause 3 − 1 = 4 − 2 = 2. One can no w lo ok at the T able as a system of 32 doublets. Th us 64 co dons are clustered b y a very regular w a y in to 32 doublets. Eac h of 21 sub jects (20 amino acids and 1 termination op eration) is co ded by one, tw o or three doublet. In fact, there a re tw o, six and tw elv e a mino acids co ded by three, tw o and one doublets, resp ectiv ely . Residual t w o do ublets co de termination signal. T o ha ve a more complete picture on the genetic co de it is useful to consider po ssible distance s b etw een co dons from differen t quadru- plets a s w ell as f r om different doublets. Also, w e in tro duce distance b et w een quadruplets or betw een doublets, esp ecially when distances b et w een their co dons ha v e the same v alue. Thus 5-adic distance b e- t w een a quadruplet and quadruplets in the same column is 1 / 5 , while suc h distance tow ard all other quadruple ts is 1. 5-Adic distance be- t w een doublets coincides with distance b et w een quadruplets, and this distance is 1 5 2 when doublets are inside the same quadruplet. The 2-adic distance b etw een codons, doublets and quadruplets is more complex. There are three basic cases: (1) co dons differ only in one digit, (2) co dons differ in t w o digit s, and (3) codons differ in all three digits. In the first case, 2-adic distance can b e 1 2 or 1 dep ending whether difference b et w een digits is 2 or not, resp ectiv ely . Let us now lo o k at 2-adic distances b et w een doublets co ding Leucine and also b et w een doublets co ding Serine. These are tw o cases of amino acids co ded b y three doublets. D o ublet cons isting of co dons 332 and 334 should b e compared with doublet of co dons 132 and 134. The largest 2-adic distance b etw een them is 1 2 . W e again obtain maxim um 8 distance 1 2 for Serine when w e compare doublets (311, 3 13) and (241, 243). Other known co des may b e regarded as s ome mo difications of the v ertebral mito c hondrial co de (inside five quadruplets of T -lik e region and quadruplet co ding Leucine). The mo dification means that some co dons c hange t heir meaning and co de either other a mino a cids o r termination signal. So, in the unive rsal (standard, canonical) co de there are the follow ing c hanges: (i) 232 AUA: Met → Ile, (ii) 24 2 A GA and 2 4 4 AGG: T er → Arg , (iii) 342 UGA: T rp → T er. 4 Discuss ion and con cluding remarks W e hav e ch osen p = 5 as the base in expansion of a n elemen t of the C 5 (64) space of co dons, b ecause 5 is the smallest prime n um b er whic h con tains four nucle otides (A , T , G , C) in DNA, o r (A , U , G , C) in RNA, in the form of four differen t digit s. At the first glance, b ecause there are four n ucleotides, o ne could start to think that a 4- adic expansion, whic h has just four digits, might b e more appropriat e. Ho w ev er, note that 4 is a comp osite in teger and that suc h expansion is not suitable since the corresp onding | · | 4 absolute v alue is no t a norm but a pse udonorm and it mak es a problem with uniqueness of the distance b etw een t w o p oints. T o illustrate this problem let us consider, fo r instance, a distance betw een num b ers 4 and 0. Then we ha v e d 4 (0 , 4) = | 4 | 4 = 1 4 , but on the other hand d 4 (0 , 4) = | 2 | 4 | 2 | 4 = 1. Recall that there are generally 5 digits (0, 1, 2 , 3, 4) in repre- sen t a tion of 5-adic n um b ers. In this approach, w e omitted the digit 0 to represen t a n ucleotide, because its consisten t meaning can be only absence of any n ucleotide. Let us note tha t there are in general 2 4 p ossibilities to connect four digits with four n ucleotides. Ho w ev er, w e find that the abov e c hoice seems to b e the most appropriate. An essen tial prop erty of the C 5 (64) space of co dons is ultrametric b eha vior of distances b etw een its elemen ts, whic h radically differs from usual distances. One can easily observ e that quadruplets and doublets of co dons in the vertebral mito c hondrial co de hav e nat ura l explanation within 5- a dic and 2- a dic closeness. It follow s that degeneracy of the genetic co de in the f o rm of doublets, quadruplets and sextuplets is 9 direct consequence of p -a dic ultrametricit y b et w een co dons. There is an imp o rtan t asp ect of g enetic co ding relat ed to particular connections b et w een co dons and amino acids. Namely , whic h amino acid corresp onds to whic h m ultiplet of co dons? An answ er should b e related to connections b et w een stereo che mical prop erties of co dons and amino acids. Let us also no te a r ecen t pap er [13], where an ultrametric approach to the genetic co de is considered o n a diadic plane. Ac kno wledgmen ts The work on this pap er w as partially supp orted b y the Ministry of Science and En vironmen tal Protection, Serbia, under con tract No 144032D . One of the authors (B.D) would like to thank M. Rakocevic for useful discussions on the genetic co de a nd amino acids. References [1] L. F rappa t , A. Sciarrino and P . Sorba, J. Biol. Phys. 2 7 (2001) 1-38 ; ph ysics/0003037 . [2] J.D. W atson, T.A. Ba k er, S.P . Bell, A. G ann, M. Levine and R. Losic k, Molecular Biolo gy of the G ene , CSHL Press,Benjamin Cummings, San F rancisco, 20 04. [3] J.E.M.Hornos a nd Y.M.M. Horno s, Ph ys. Rev. Lett. 71 (1993) 4401-44 04; M. F or ger and S. Sa c hse, Lie sup er algebr as and the Multiplet Structur e of the Genetic Co de I: Co don R epr esenta- tions ; math-ph/9808 0 01. [4] B. Dragovic h and A. Drag o vic h, A p -A dic Mo del of D NA Se- quenc e and Genetic Co de ; q- bio.GN/06070 1 8. [5] L. Brekk e and P .G.O. F reund, Ph ys. Rept. 233 (1 9 93) 1–66. [6] V.S. Vladimirov, I.V. V olovic h and E.I. Zeleno v, p - Adic Analysis and Mathematical Ph ysics , W orld Scien tific, Singapore, 199 4. [7] B. Drago vic h, p -A d ic and A delic Quantum Me chanics, Pro c. V. A. Steklo v Inst. Math. 245 (200 4) 72-85; hep-th/031204 6. 10 [8] B. Dragovic h, p -A d ic and A de lic Cosmolo gy: p -A dic O rigin of D ark Ene r gy a nd D ark Matter , in ’ p -Adic Mathematical Ph ysics’, AIP Conference Pro ceedings 826 (200 6) 25 - 42; hep - th/060204 4 . [9] V.A. Av etiso v, A.Kh. Bikulo v and V.A. Osip o v, J. Ph ys. A: Math. Gen. 36 (2003) 423 9-4246. [10] A. Khrennik ov , Information Dynamics in Cognitiv e, Psyc holog- ical, So cial and Anomalous Phenomena , Klu w er AP , Dordrec h t, 2004. [11] p -Adic Mathematical Ph ysics , Pro ceedings of the 2nd In terna- tional Conference on p -Adic Mathematical Ph ysics, AIP Confer- ence Pro ceedings 826 , 200 6 . [12] F.Q. Gouv ea, p -Adic num b ers: An in tro duction , (Univ ersitext), Springer, Berlin, 1 9 93. [13] A.Y u. Khrennik ov and S.V. Kozyrev, Gene tic c o de on a diadic plane ; q- bio /0701007. 11

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment