Inverse Folding of RNA Pseudoknot Structures

Background: RNA exhibits a variety of structural configurations. Here we consider a structure to be tantamount to the noncrossing Watson-Crick and \pairGU-base pairings (secondary structure) and additional cross-serial base pairs. These interactions …

Authors: James Z.M. Gao, Linda Y.M. Li, Christian M. Reidys

Inverse F olding of RNA Ps eudoknot S tructures James Z.M. Gao 1 , Linda Y.M. Li 1 and Christian M. Reidys 1 ∗ 1 Center for Combin atorics, LPMC-TJKLC, Nankai Universit y , Tianjin 300071, PR China Email: Gao: gzm55@cfc.nank ai.edu.cn; Li: liy anmei@mail.nankai.edu .cn; ∗ Reidys: duck@santafe.edu; ∗ Co rresp onding autho r Abstract Background: RNA exhibits a va riety of structural configuratio ns. Here w e consider a structure to b e tantamo unt to the n o ncrossing Watson-Crick a n d G-U -base pairing s (se c o ndary structure) and additi o nal cross -serial bas e pairs. These interactions are called ps e udoknots an d a re o bserved across the whole sp ectrum of RNA functio n alities. In the co ntext of studying natural RNA s truct u res, sea rching for new rib ozymes and desig ning artificial RNA, it is of interest to find RNA sequen c es folding i n to a sp ec i fic structure a nd to an alyze their ind u ced neutral net wo rks. Since the established inverse folding algor ithms , RNAinvers e , RNA-SS D as well as INFO-RNA are limited to RNA secondary structures, we p resent in this pap er the inverse foldin g algorithm Inv which can deal with 3 - noncrossing , canonical p s eudoknot structures. Results: In this pap er we present the inverse folding algorithm In v . W e give a detai led analysis of Inv , including pseudo co des. W e sho w that Inv allows to desi gn in pa rticular 3 -noncrossi n g nonplanar RNA pse udoknot 3 - noncrossing RNA structures–a class whi c h is diffi cult to construct via dynamic programming routin es. Inv is freely avail able at htt p:// www.combinatorics.cn/cbp c/inv.h tml . Conclusions: The algorithm Inv extends inverse folding cap a bilities t o RNA pseudokn ot structures. In comparison with R NAinve rse it uses new ideas, fo r instance by considering s ets of comp eting structu res. As a result, I nv is not o nly able to fi nd no vel sequences e ven for RNA secondary structures, it do es s o in the co ntext of comp eting structures that p otentially exhibit cross-se ria l interactions. 1 1 5 1 0 1 5 2 0 2 5 3 0 3 5 4 0 4 5 5 0 5 5 6 0 6 5 7 0 7 5 1 1 0 2 0 3 0 4 0 5 0 6 0 7 0 Figure 1: The p seudoknot structure of the glmS ri- b ozyme pseudoknot P1.1 [7] as a diagram (top) and as a planar graph (b ottom). 1 Intro duction Pseudoknots are structur a l elements o f central im- po rtance in RNA structures [1], see Figure 1. They represent cross - serial base pairing interactions b e- t ween RNA n ucleotides that are functionally impo r- tant in tRNAs, RNaseP [2], telomerase RNA [3], and rib osomal RNAs [4]. P s eudoknot structures are be- ing observed in the mimicry of tRNA structures in plant vir us RNAs as well as the binding to the HIV- 1 reverse tra nscriptase in in vitr o selection exp er i- men ts [5]. F urthermore basic mechanisms, like rib o- somal frame shifting, in volv e pseudokno ts [6]. Despite them playing a k ey role in a v ariety of contexts, pseudoknots are excluded from lar ge-scale computational studies. Although the pro blem has attracted considerable attention in the last decade, pseudoknots are considered a somewhat “exotic” structural co ncept. F or all we kno w [8], the ab ini- tio prediction of general RNA pseudoknot structures is NP-complete a nd algor ithmic difficulties of pseu- doknot folding are confounded by the fact that the thermo dynamics of ps eudoknots is far from b eing well under sto o d. As for the folding of RNA secondary structures, W ater ma n et al [9 , 10], Zuker et al [11] a nd Nussi- nov [12] established the dynamic progr amming ( DP ) folding r o utines. The first mfe-folding a lgorithm for RNA secondar y structures, how ever, dates back to the 60’s [1 3 – 15]. F or res tricted classes of pse udo - knots, several alg orithms hav e been desig ned: Riv as and Eddy [16], Dirks and Pierce [17], Reeder and Giegerich [18] and Ren et al [19]. Recen tly , a nov el ab initio folding algorithm Cros s has b een int r o - duced [20]. Cross generates minimum free energy (mfe), 3-noncro ssing, 3-ca nonical RNA structures, i.e. structures that do not contain thre e o r more mu- tually crossing arc s and in which e ach stack, i.e. se- quence o f para llel a rcs, see eq. (1), has s iz e greater or equa l than thre e . In particula r , in a 3 - canonical structure there are no isolated arcs, see Figure 2. S t a c k _ 1 S t a c k _ 2 S t a c k _ 3 Figure 2 : σ -canonical RNA stru ct ures: eac h stack of “parallel” arcs has to h a ve minimum size σ . H ere we displa y a 3-canonical structure. The notion of mfe-structur e is based on a sp e- cific co ncept of pseudo knot lo ops and resp ective lo op-based ener g y parameters . This thermodyna mic mo del was conceived by Tino c o and r efined b y 2 F reie r , T urner, Ninio, and others [14, 21–25]. 1.1 k -noncro ssing, σ -canonica l RNA pseudoknot structu res Let us turn back the clo ck: three decades a go W a ter- man et al. [26], Nussinov et al. [12] and K leitman et al. in [27] ana lyzed RNA se c o ndary structure s . Sec- ondary structures are co arse grained RNA con tact structures, see Figure 3. 12345678910111 2131415161 71819202 122232 4252 6272 8293 0313 233343 5363 7383 9404 1424 3444 546474 8495 0515 2535 4555 657585 9606162636 4656667686 97071727 374757 6 5 ' 1 0 1 0 2 0 3 0 4 0 5 0 6 0 7 0 3 ' 3 ' 5 ' Figure 3 : The phenylala nine tRNA secondary structure represented as 2-noncrossing diagram (top) and as p lanar graph (b ottom). Secondary structure s can b e represented as dia- grams, i.e. labeled graphs over the vertex set [ n ] = { 1 , . . . , n } with vertex degrees ≤ 1, r epresented by drawing its v ertices on a horizontal line and its ar cs ( i, j ) ( i < j ), in the upp er half-plane, see Fig- ure 1 and Fig ure 4. Here, vertices and arcs corresp o nd to the nu- cleotides A , G , U , C and W atson-Crick ( A-U , G- C ) a nd ( U-G ) base pa ir s, r esp ectively . In a diagram, tw o ar cs ( i 1 , j 1 ) and ( i 2 , j 2 ) are called crossing if i 1 < i 2 < j 1 < j 2 holds. Accordingly , a k -cross ing is a sequence of ar cs ( i 1 , j 1 ) , . . . , ( i k , j k ) such that i 1 < i 2 < · · · < i k < j 1 < j 2 < · · · < j k , see Figure 5. 1 2 3 4 5 6 7 8 9 1 0 1 1 1 2 1 3 1 4 Figure 4: Setting k = 2 w e observe that secondary struc- tures are a particular type of k -n oncrossing structures. They coincide with noncrossing d iagrams h aving mini- mum arc-length tw o. 1 2 3 4 5 6 7 8 9 1 0 1 1 Figure 5: k -noncrossing d iagrams: w e d ispla y a 4- noncrossing diagram containing the three m utu ally cross- ing arcs (1 , 7) , (4 , 9) , (5 , 11) (dra wn in red) . W e call diagrams co n taining at mo st ( k − 1)- crossing s , k -noncr ossing dia grams. RNA secondary structures ha ve no crossings in their diagram repr e- sentation, see Figure 3 and Figure 4, a nd are there- fore 2 -noncro s sing dia grams. A structure in which any stack has at least size σ is called σ -canonical, where a stack of size σ is a sequence of “par allel” arcs of the fo r m (( i, j ) , ( i + 1 , j − 1) , . . . , ( i + ( σ − 1) , j − ( σ − 1 ))) . (1) As a natural gener alization of RNA se c ondary structures k -noncr ossing RNA structures [28 – 30] were introduced. A k -noncr o ssing RNA structure 3 is k -noncro ssing diagram without ar cs of the form ( i, i + 1). In the following we assume k = 3 , i.e. in the diagram re pr esentation ther e a re at most tw o m utu- ally cr o ssing arcs , a minimum arc - length of four and a minimum stack-size of three base pair s. The no- tion k - noncrossing stipulates that the complexity of a pseudo knot is related to the maximal num b e r of m utually crossing b onds. Indeed, mos t natur a l RNA pseudoknots are 3-noncro ssing [31]. 1.2 Neutral netw o rks Before considering an inverse folding algo rithm int o sp ecific RNA structures one has to have at least some rationale as to wh y there exists one seq ue nc e realiz- ing a giv en target a s mfe-configuration. In fact this is, on the level of entire folding maps, guara nt eed by the com binato rics o f the target str uctures alone. It has b een s hown in [32], tha t the num ber s of 3- noncrossing RNA pseudoknot structures, s atisfying the biophysical constr a ints gr ows a symptotically a s c 3 n − 5 2 . 03 n , where c 3 > 0 is some explicitly known constant. In view of the cen tra l limit theor ems of [33], this fact implies the existence of extended (ex- po nentially large) sets of sequences that all fold into one 3- noncross ing RNA pseudokno t structure, S . In other w ords, the co m binator ics of 3- no ncrossing RNA str uc tur es a lo ne implies that there are man y sequences mapping (folding) in to a single structure. The set o f all such sequences is called the neutral net work 1 of the structure S [34, 35], see Figure 6. 1 the te rm “neut ral n etw ork” as opposed to “neut ral set” stems from gian t comp onent results of random induced sub- S e q u e n c es p a c e S t r u c t u r es p a c e Figure 6: Neutral net works in sequence space: we dis- pla y sequence space ( left) and structure space (righ t) as grids. W e depict a set of sequences that all fold into a particular stru cture. Any tw o of these sequences are con- nected by a red edge. The neutral net wor k of this fi xed structure consists of all sequences folding into it and is typicall y a connected subgraph of sequen ce space. 1234567891 0111 21 31 41 51 61 71 8 1 4 9 1 2 1 5 1 6 2 3 5 6 1 0 11 8 7 1 4 1 3 1 8 1 7 ( ) A , U , G , C , C , G ( ) A U , U A , G C , C G , U G , G U Figure 7: A stru cture and a particular compatible se- quence organized in the segmen ts of unpaired and paired bases. By construction, all the sequences con tained in such a neutra l netw ork are a ll compatible with S . That is, at any t wo positions paired in S , we find t wo bases capa ble of forming a b ond ( A-U , U-A , G-C , C-G , G-U and U-G ), see Figur e 7. Let s ′ be a sequence derived via a mutation 2 of s . If s ′ is again compa tible with S , we call this mutation “compatible”. Let C [ S ] denote the s et of S -compa tible se- quences. The structure S motiv ates to consider a new a djacency rela tion within C [ S ]. Indeed, w e ma y graphs of n -cubes. That i s, neutral net works are t ypically connect ed in sequence space 2 note: w e do not consider insertions or deletions. 4 1 2 3 4 A G U A U G A A U G G A G G C A C G G A G G U A A U U A A C U A A A U A A G U U A G U G A G U C Figure 8 : Diagram represen tation of an RNA structure (top) and its induced compatible neighbors in sequence space (b ottom). Here the neighbors on the inner circle hav e Hamming d istance one while those on the outer circle hav e Hamming distance tw o. Note th at eac h base pair gives rise to five compatible neighbors (red) ex actly one of which b eing in Hamming distance one. reorg anize a sequence ( s 1 , . . . , s n ) into the pair  ( u 1 , . . . , u n u ) , ( p 1 , . . . , p n p )  , (2) where the u h denotes the unpaired n ucleotides and the p h = ( s i , s j ) denotes base pa ir s, res pec tively , see Figure 7. W e can then view s u = ( u 1 , . . . , u n u ) and s p = ( p 1 , . . . , p n p ) as elements o f the for mal cubes Q n u 4 and Q n p 6 , implying the new adjac e nc y relation for elements of C [ S ]. Accordingly , there are tw o types o f compatible neighbors in the sequence space u - and p -neighbor s: a u -neighbor has Hamming distance one and differs exactly b y a point m utation at a n unpaired p o sition. Analogously a p -neighbo r differs by a compensa to ry base pa ir-mutation, s e e Figure 8. Note, how ever, that a p -neighbo r has either Ham- ming distance one ( G-C 7→ G-U ) or Hamming dis- tance t wo ( G-C 7→ C-G ). W e ca ll a u - or a p - neighbor, y , a compatible neighbor. In light of the adjacency notion for the set of compatible sequences we call the set of all sequences folding into S the neutral net work of S . By construction, the neutral net work o f S is con tained in C [ S ]. If y is contained in the neutral netw ork w e r e fer to y as a neutral neighbor. This g ives rise to consider the compatible and neutra l distanc e of the tw o sequence s , denoted by C ( s, s ′ ) and N ( s, s ′ ). These ar e the minimum length of a C [ S ]- pa th and path in the neutral net- work b etw een s and s ′ , r esp ectively . Note that since each neutral path is in par ticula r a compatible path, the compatible distance is alwa ys smaller or equal than the neutral distance. In this pap er we study the in verse folding prob- lem for RNA pseudokno t struc tur es: for a given 3-noncro ssing target structure S , we search for se- quences from C [ S ], that hav e S as mfe configur ation. 2 Background F or RNA secondary structures, there are three dif- ferent strategies for inverse folding, RNAinverse , RNA-SS D and INFO-R NA [3 6 – 38], They a ll g enerate via a lo cal search routine itera- tively sequences, whose s tructures ha ve smaller and smaller dista nces to a given target. Here the dis ta nce betw een tw o structure s is obtained by aligning them as diagrams and counting “ 0 ”, if a given p ositio n is either unpair e d or incident to an arc contained in bo th s tructures and “1”, other wise, see Figure 9. One commo n assumption in these in verse fold- 5 1 4 5 6 7 8 9 1 0 1 1 1 2 1 3 1 4 1 5 1 6 1 7 1 8 1 9 2 0 2 1 4 5 6 7 8 9 1 0 1 1 1 2 1 3 1 4 1 5 1 6 1 7 1 8 1 9 2 0 2 1 0 0 0 1 1 1 0 0 0 0 0 0 0 0 1 1 1 1 1 1 0 S 1 S 2 2 3 1 2 3 Figure 9: Posi tions paired differently in S 1 and S 2 are assigned a “1”. There are tw o typ es of p ositions: I . p is contained in different arcs, see p osition 4, ( 4 , 20) ∈ S 1 and (4 , 17) ∈ S 2 . I I . p is unpaired in one structure and p is p aired in t h e other, such as position 18. ing algor ithms is, that the energies of specific sub- structures contribute additively to the energy of the ent ire structure. Let us pro c e e d by analyzing the algorithms. RNAinverse is the first inv erse-folding a lgorithm that derives s equences that realize given RNA seco ndary structures as mfe-co nfiguration. In its initialization step, a ra ndom compatible s equence s for the ta r- get T is ge ne r ated. Then RNAi nvers e pro ceeds by upda ting the sequence s to s ′ , s ′′ . . . step b y step, minimizing the structure distanc e b etw een the mfe structure of s ′ and the targ et str ucture T . Based on the observ ation, that the energ y of a substr uc tur e contributes a dditiv ely to the mfe of the mo lecule, RNAinv erse optimizes “small” substructures first, even tually extending these to the entire structure. While optimizing substructures, RNAinv erse do es an adaptive walk in order to decrea se the structure distance. In fact, this walk is based entirely on ran- dom compa tible mutations. RNA-SSD R NA-SS D first assigns sp ecific pro babilities to the bases lo cated in unpair ed p ositio ns and the base pair s ( G-C , A-U , U-G ) of T , re s pe c tively . In this assignment the pr o bability of a unpa ired po s i- tion be ing ass igned either A or U is greater than assigning G or C. Similarly , the probabilit y of pairs G-C and C-G base pair s is greater than that of the other base pairs. Then, RNA-SS D derives a hierar- chical decomp osition of the ta r get structure. It re- cursively splits the structur e and thereby derives a binary decomp osition tree ro oted in T and whose leav es co rresp ond to T - substructures. Each no n- leaf no de of this tree represents a substructure ob- tained by merg ing the t wo substructures of its re- sp ective children. Giv en this tree, RNA- SSD p er fo rms a sto chastic lo cal search, star ting at the leav es, sub- sequently w or king its w ay up to the ro ot. INFO- RNA employs a dynamic pro gramming metho d for finding a well suited initial sequence. This s e quence ha s a low est ener gy with resp ect to the T . Since the la tter do es not necessa rily fold into T , (due to potentially existing com- peting configura tio ns) INFO -RNA then utilizes a n improv ed 3 sto chastic loca l search in order to find a sequence in the neutral net work of T . In con tras t to RNA invers e , INFO-RNA allows for increa sing the distance to the target structure . A t the sa me time, only p os itions that do not pair corr ectly a nd 3 relative to the lo cal search routine used in RNAinverse 6 po sitions a djacent to these a re ex amined. 2.1 Cross Cross is an ab initio folding algo r ithm that maps RNA sequences into 3-noncrossing RNA structures. It is g uaranteed to search a ll 3-noncro ssing, σ - canonical structures and derives some (not necessar- ily unique), loo p-based mfe-configuration. In the fol- lowing we alwa ys assume σ ≥ 3 . The input o f Cros s is an arbitr ary RNA sequence s and an int eg e r N . Its output is a list of N 3-no ncrossing , σ - c anonical structures, the first of which being the mfe-structure for s . This list o f N structures ( C 0 , C 1 , . . . , C N − 1 ) is ordered b y the free energy and the first list-element, the mfe-structur e , is deno ted by Cross ( s ). If no N is sp ecified, Cross a ssumes N = 1 a s default. Cross ge ne r ates a mfe-structure based on sp ecific lo op-types of 3-noncro ssing RNA structures . F or a given structure S , let α b e an ar c contained in S ( S -arc) and deno te the set of S - arcs that cr oss α b y A S ( α ). F or tw o arcs α = ( i, j ) and α ′ = ( i ′ , j ′ ), w e next sp ecify the partial order “ ≺ ” o ver the set of ar cs: α ′ ≺ α if a nd o nly if i < i ′ < j ′ < j. All notions of minimal or maximal element s are un- dersto o d to be with resp e ct to ≺ . An ar c α ∈ A S ( β ) is called a minimal, β -cross ing if ther e exists no α ′ ∈ A S ( β ) such that α ′ ≺ α . Note that α ∈ A S ( β ) can b e minimal β -crossing , while β is not minimal α -crossing . 3 -noncros s ing diagr ams e x hibit the fol- 1 0 2 0 3 0 1 0 2 0 3 0 1 0 2 0 1 0 2 0 1 0 Figure 10: The standard lo op-typ es: hairpin-loop (top), interi or-lo op (middle) and multi-loop (b ottom). These represent all loop-typ es that o ccur in RN A secondary structures. lowing four basic lo o p-types: (1) A hairpin-lo op is a pair (( i, j ) , [ i + 1 , j − 1 ]) where ( i, j ) is a n arc and [ i , j ] is an int er v al, i.e. a sequence of consecutive vertices ( i, i + 1 , . . . , j − 1 , j ). (2) An int erio r-lo op, is a sequence (( i 1 , j 1 ) , [ i 1 + 1 , i 2 − 1] , ( i 2 , j 2 ) , [ j 2 + 1 , j 1 − 1]) , where ( i 2 , j 2 ) is nested in ( i 1 , j 1 ). That is we hav e i 1 < i 2 < j 2 < j 1 . (3) A mu lti-lo o p, see Figure 1 0 [20], is a sequence (( i 1 , j 1 ) , [ i 1 + 1 , ω 1 − 1 ] , S τ 1 ω 1 , [ τ 1 + 1 , ω 2 − 1 ] , S τ 2 ω 2 , . . . ) , where S τ h ω h denotes a ps eudoknot structure o ver [ ω h , τ h ] (i.e. nested in ( i 1 , j 1 )) and sub ject to the following condition: if all S τ h ω h = ( ω h , τ h ), i.e. all substructures are just arcs , for all h , then w e have h ≥ 2 ). A pse udo knot, see Figure 11 [2 0], consists of the following da ta: 7 1 10 20 30 32 20 1 10 30 1 10 20 30 20 1 10 30 Figure 11: Pseudoknot lo ops, formed by all blue vertices and arcs. ( P1) A set of a rcs P = { ( i 1 , j 1 ) , ( i 2 , j 2 ) , . . . , ( i t , j t ) } , where i 1 = min { i h } and j t = max { j h } , such tha t (i) the diag r am induced by the arc-se t P is irredu- cible, i.e. the depe ndenc y -graph o f P (i.e. the graph having P as v er tex set and in whic h α and α ′ are adjacent if and only if they cr oss) is connec ted and (ii) for each ( i h , j h ) ∈ P there exists some arc β (not nece ssarily contained in P ) such that ( i h , j h ) is minimal β -cro ssing. ( P2) Any i 1 < x < j t , no t contained in hairpin-, int erio r- or m ulti-lo ops . Having disc us sed the ba sic lo op-types, we are now in p ositio n to state Theorem 1 Any 3 -n oncr ossing RNA pseudoknot structur e has a unique lo op-de c omp osition [20] . Figure 12 illustrates the lo op decomp ositio n of a 3-noncro ssing structure. A m otif in Cros s is a 3- no ncrossing struc tur e, having only ≺ -maximal s tacks of size exactly σ , see 123456789101 11 2131 4151 61 7181 92 0212 22 3242 5262 72 8293 03 1323 33 4353 63 7383 9404 14 2434 44 5464 74 8495 0515 25 3545 55 6575 85 9606 16 2636 465 I I I I I I I I I I I V I V Figure 12 : Lo op decomp osition: here a hairpin- loop (I), an interior-loop (I I ), a multi-loop (II I) and a pseudok not (IV). Figure 1 3: Motif: a 3-noncrossing, 3-canonical motif. Figure 1 3. A skeleton , S , is a k - noncross ing s truc- ture such that • its core, c ( S ) ha s no noncros s ing arcs a nd • its L -graph, L ( S ) is connected. Here the cor e of a structure, c ( S ), is o btained b y collapsing its stacks into s ing le arcs (thereby reduc- ing its length) a nd the gra ph L ( S ) is obtained by mapping arcs in to vertices a nd connecting an y t wo if they cross in the diagra m r epresentation of S , see Figure 14. As for the genera l strategy , Cross con- structs 3- noncross ing RNA structure “ from top to bo ttom” via three subroutines: a 1 a 2 a 3 a 4 a 5 a 6 a 7 a 8 a 9 a 1 a 2 a 3 a 4 a 5 a 6 a 7 a 8 a 9 Figure 14: Skeleton and its L -graph: we display a skele- ton ( left) and its L -graph (right). 8 I ( Shadow ): Here we g enerate all maximal stacks of the str ucture. Note that a stack is max imal with resp ect to ≺ if it is not nested in some other stack. This is derived b y “ s hadowing” the motifs, i.e. their σ -stacks are extended “from top to bottom”. I I ( SkeletonBranch ): Given a shadow, the second step of C ross consists in g enerating, the skeleta- tree. The no des of this tree ar e par ticula r 3- noncrossing structures, obtained by succes sive inser- tions of stacks. Intuitiv ely , a skeleton encapsula tes all cros s-seria l arcs that cannot be recursively com- puted. Here the tree complexity is controlled via limiting the (total) n umber o f pse udo knots. I I I ( Sa tura tion ): In the third subroutine each skele- ton is saturated via DP- routines. After the satura- tion the mfe-3-noncr ossing structure is derived. Figure 1 5 provides an ov erview on how the three subroutines are combined. 3 The algo rithm The inv erse folding alg orithm Inv is based on the ab initio folding alg orithm Cross . The input of Inv is the targ et structure, T . The latter is ex pressed as a character s tr ing of “ :()[ ]{} ”, where “ : ” denotes unpaired ba se a nd “ () ”, “ [] ”, “ {} ” denote paired bases. In Algorithm 1 , we pres ent the pseudo co des of al- gorithm I nv . After v alidation of the tar get structur e (lines 2 to 5 in Algor ithm 1), similar to I NFO-R NA , Inv co nstructs an initial seq uenc e and then pro c e e ds sa tu r at io n op ti m al I I I I I I A A A CU UU GC G A A A C U U U GC G In p ut : A A A CU UU GC G A AA CU U U GC G A A AC U U UG C G A A A CU UU GC G A A A CU UU GC G A A A C U U U GC G A A A C U U U GC G A A A C U U U GC G A A A C U U U GC G A AA C U UU G C G A A A C U U U GC G A A A C U U U GC G A A A C U U U GC G A A A C U U U GC G A A A C U U U GC G A A A CU UU GC G A A AC U U UG C G A A A C U U U GC G A A A C U U U GC G A A A C U U U GC G sa tu r at io n op ti m al A A A C U U U GC G A A A C U U U GC G A A A C U U U GC G A A A C U U U GC G Figure 15: An outline of Cross (for illustration pur- p oses w e assume here σ = 1): The routines Shadow , SkeletonBranch and Sa tura tion are depicted. Due to space limitations w e only represent a few select motifs and for the same reason on ly one of the motifs displa yed in the first ro w is exten d ed by one arc (dra wn in blue). F urthermore note that only motif s with crossings giv e rise to nontrivial skeleton-trees, all other motifs are con- sidered directly as input for Sa tura tion . by a sto chastic lo cal sear ch based on the lo op decom- po sition of the target. This sequence is derived via the routine A djust-Seq . W e then decompo se the target structure into lo ops and endow these with a linear order. According to this order we use the rou- tine Local-Search in or der to find for each lo op a “prop er” lo ca l solution. 3.1 Adjust-Seq In this sec tio n we de s crib e Steps 2 a nd 3 of the pseudo co des presented in Alg orithm 1. The rou- tine Make-St ar t , see line 8, genera tes a random sequence, start , which is compatible to the target, with unifor m proba bilit y . W e then initialize the v a riable seq min via the sequence sta rt and se t the v aria ble d = + ∞ , 9 Algorithm 1 Inv Input: k -n oncrossing target structure T Output: an R NA sequ ence seq Require: k ≤ 3 and T is comp osed with “ :()[]{} ” Ensure: Cross ( seq ) = T 1: ✄ S tep 1: V alidate structure 2: if false = Check-Stru ( T ) then 3: prin t incorrect structu re 4: return NIL 5: end i f 6: 7: ✄ S tep 2: Generate the start sequence 8: star t ← M ake-St ar t ( T ) 9: 10: ✄ S tep 3: Adjust the start seq u ence 11: seq middle ← Adjust-Se q ( star t, T ) 12: 13: ✄ Step 4: Decomp ose T and derive the ordered in- terv als. 14: Interv al array I 15: m ← | I | ✄ I satisfies I m = T 16: 17: ✄ S tep 5: Sto chasti c Lo cal Search 18: seq ← seq middle 19: for all interv als in the array I w do 20: l ← start-point( I w ) 21: r ← end-p oint( I w ) 22: s ′ ← seq | [ l,r ] ✄ get sub -sequence 23: seq | [ l,r ] ← Local-Search ( s ′ , I w ) 24: end for 25: 26: ✄ S tep 6: output 27: if seq min = Cross ( s eq ) then 28: return seq 29: else 30: prin t F ailed! 31: return NIL 32: end i f where d denotes the structure distance b etw een Cross ( seq min ) a nd T . Given the sequence star t , we c onstruct a set of po tent ial “comp etitor s ”, C , i.e. a s et of s tructures suited as folding targ ets for star t . In Algorithm 2 we show ho w to a djust the star t sequence us ing the r o u- tine Adjust-Seq . Lines 4 to 3 8 o f Algorithm 2, con- tain a F or -lo op, executed at most √ n/ 2 times. Her e the lo op-length √ n/ 2 is heuristically determined. Setting the Cro ss -parameter 4 , N , the subroutine executed in the lo op-b o dy consists of the following three steps. Step I. Generating C 0 ( λ i ) via Cross . Supp os e we are in the i th step o f the F or -lo op and are given the sequence λ i − 1 where λ 0 = start . W e cons ider Cross ( λ i − 1 , N ), i.e. the list of suboptimal structures with r esp ect to λ i − 1 , C 0 ( λ i − 1 ) = Cross ( λ i − 1 , N ) = ( C 0 h ( λ i − 1 )) N − 1 h =0 If C 0 0 ( λ i − 1 ) = T , then Inv returns λ i − 1 . Else , in case of d = ( Cross ( C 0 0 ( λ i − 1 )) , T ) < d min , we set seq min = λ i − 1 d min = d ( Cr oss ( C 0 0 ( λ i − 1 )) , T ) . Otherwise we do not update seq min and go directly to Step II. Step I I. Th e comp etitors. W e introduce a sp e cific pro- cedure that “pe r turbs” arcs o f a given RNA pseudo- knot structure, S . Le t a b e an arc of S and let l ( a ), r ( a ) denote the sta rt- and end-p o in t of a . A p ertur- 4 F or all computer exper iment s we set N = 50. 10 Figure 1 6: Nin e p erturbations of an arc ( i, j ). Original arcs are d ra wn dotted, and the arcs incident to red b ases are the p ertu rbations. bation of a is a pro cedure which genera tes a new a rc a ′ , s uch that | l ( a ) − l ( a ′ ) | ≤ 1 and | r ( a ) − r ( a ′ ) | ≤ 1 . Clearly , there ar e nine pertur ba tions of any given ar c a (including a itse lf ), see Figure 16. W e proceed b y k eeping a , replacing the arc a by a non trivial p e rturbation or r emov e a , ar riving at a set of ten str uctures ν ( S, a ). Now we use this metho d in order to generate the set C 1 ( λ i − 1 ) by pertur bing each ar c of ea ch struc- ture C 0 h ( λ i − 1 ) ∈ C 0 ( λ i − 1 ). If C 0 h ( λ i − 1 ) ha s A h arcs, { a 1 h , . . . , a A h h } , then C 1 ( λ i − 1 ) = N − 1 [ h =0 A h [ j =1 ν ( C 0 h ( λ i − 1 ) , a j h ) . This construction may r esult in duplicate, inconsis- ten t or incompatible structures. Here, a structure is inconsistent if there exis ts a t lea st one p osition paired with more than one base, and incompatible if there ex is ts at lea st one arc no t compatible with λ i − 1 , see Figures 1 7 and 18. Here compatibility is A U U A Figure 17: Inconsistent structures: the dotted arc is p erturb ed b y shifting its end-p oint. This pertu rbation leads to a nucleotide establishing tw o base pairs, which is impossible. A G G U C Figure 18: I n compatible structures: w e display a p er- turbation of the dotted arc leading to a structure that is incompatible to the given sequence. understo o d with resp ect to the W atson- Crick and G - U ba se pa iring rules. Deleting inco nsistent and incompatible structur es, as well as those identical to the ta rget, we ar rive at the set of comp etitors , C ( λ i − 1 ) ⊂ C 1 ( λ i − 1 ) . Step I I I. Mutation Here w e a djust λ i − 1 with respe c t to T as well as the set o f compe tito r s, C ( λ i − 1 ) derived in the pr evious step. Suppo s e λ i − 1 = 1 2 3 4 5 6 7 8 9 10 11 12 13 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 1 2 3 4 5 6 7 8 9 10 11 12 13 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 1 2 3 4 5 6 7 8 9 10 11 12 13 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 A C G U C U A U G G A C U U G U A G C C U U U G A U A G U A A C G U A G A U U G U C U U C U A G G C A G U G A U A G U C 14 14 14 Figure 19 : Mutation: Supp ose th e t op and midd le struc- tures represent the set of comp etitors and the b ottom structure is target. W e d ispla y λ i − 1 (top sequence) and its mutation, λ i (b ottom sequ ence). Two nucleotides of base pairs not con tained in T are colored green, n u- cleotides sub ject to mutations are colored red. 11 s i − 1 1 s i − 1 2 . . . s i − 1 n . Let p ( S, w ) be the p osition paired to the p osition w in the RNA s tructure S ∈ C ( λ i − 1 ), or 0 if p osition w is unpaired. F or instance, in Figure 1 9, w e hav e p ( T , 1) = 4, p ( T , 2) = 0 and p ( T , 4) = 1 . F or each position w of the target T , if there ex ists a structure C h ( λ i − 1 ) ∈ C ( λ i − 1 ) such that p ( C h ( λ i − 1 ) , w ) 6 = p ( T , w ) (see positions 5, 6 , 9, and 11 in Figur e 19) w e modify λ i − 1 as follows: 1. unpaired p osition: If p ( T , w ) = 0, we up- date s i − 1 w randomly into the nucleotide s i w 6 = s i − 1 w , such that for each C h ( λ i − 1 ) ∈ C ( λ i − 1 ), either p ( C h ( λ i − 1 ) , w ) = 0 or s i w is not compat- ible with s i − 1 v where v = p ( C h ( λ i − 1 ) , w ) > 0, See p osition 6 in Figure 19. 2. start-p oi n t: If p ( T , w ) > w , set v = p ( T , w ). W e rando mly cho ose a compatible base pair ( s i w , s i v ) different from ( s i − 1 w , s i − 1 v ), such that for ea ch C h ( λ i − 1 ) ∈ C ( λ i − 1 ), ei- ther p ( C h ( λ i − 1 ) , w ) = 0 o r s i w is not compat- ible with s i − 1 u , where u = p ( C h ( λ i − 1 ) , w ) > 0 is the end-po int paired with s i − 1 w in C h ( λ i − 1 ) (Figure 19: (5 , 9 ). The pair G-C r etains the compatibility to (5 , 9), but is incompatible to (5 , 10 )). By Figur e 20 we show feasibilit y of this step. 3. end-p oin t: If 0 < p ( T , w ) < w , then by con- struction the nucleotide ha s alrea dy b een con- sidered in the previo us step. A U A G G U A G G C A G { U A A A A U G U A A G C A { A A A C G A A p A U U U U A U U U G U U C G U U { U A C C U G C C C G C C { A U C C U A C U U G C U C G C U { U G A A A U G G G U G G G C G G { q q 1 G U U U G C U U U G G G U A G G C G G G G C C C G U C C U A G A U G A G G U U C G C C U q 2 p q q 1 q 2 Figure 20: Mutations are alwa ys possible: supp ose p is paired with q in T and p is paired with q 1 in one compet itor and q 2 in another one. F or a fixed nucleotide at p there are at most tw o scenarios, since a b ase can pair with at most tw o d ifferent bases. F or in stance, for G w e ha ve the p airs G- C , G-U . W e display all nucleotide configurations (LHS) and their corresponding solutions (RHS). 12 Therefore, up dating all the nucleotides of λ i − 1 , we a rrive at the new seque nc e λ i = s i 1 s i 2 . . . s i n . Note that the a bove m utation steps heuristica lly decrease the str uc tur e distance. Ho wev er , the r e- sulting sequence is not necessarily incompatible to all comp etitors. F or insta nce, c o nsider a comp eti- tor C h whose ar c s ar e all contained T . Since λ i is compatible with T , λ i is compa tible with C h . Since comp etitors are obta ined from sub optimal folds such a sce na rio may aris e. In pr actice, this situation represents not a prob- lem, s ince these t yp e of c omp etitors a re likely to b e ruled o ut by v ir tue of the fact that they hav e a mfe larger than that o f the target structure. Accordingly w e ha ve the following situation, comp etitors are eliminated due to t wo, e q ually im- po rtant criteria: incompatibility as well as minimum free ener gy co nsiderations. If the distance of Cros s ( λ i ) to T is less tha n or equal to d min + 5, we return to Step I (with λ i ). Otherwise, w e rep eat Step I I I (for at most 5 times) thereby generating λ i 1 , . . . , λ i 5 and s et λ i = λ i w where d ( Cross ( λ i w ) , T ) is minimal. The pro cedure Adjust-Seq employs the neg- ative par adigm [17] in order to exclude energeti- cally clo se conformatio ns. It r eturns the sequence seq middle which is tailo red to realize the targ e t struc- ture as mfe-fold. Algorithm 2 Adjust-Seq Input: the original start sequence star t Input: the target structure T Output: a initialized sequ ence seq middle 1: n ← length of T 2: d min ← + ∞ , seq min ← star t 3: for i = 1 to 1 2 √ n do 4: ✄ S tep I: generate t h e set C 0 ( λ i − 1 ) v ia Cross 5: C 0 ( λ i − 1 ) ← Cross ( λ i − 1 , N ) 6: d ← d ( C 0 0 ( λ i − 1 ) , T ) 7: if d = 0 then 8: return λ i − 1 9: else if d < d min then 10: d min ← d , seq min ← λ i − 1 11: end i f 12: 13: ✄ S t ep I I : generate the comp etitor set C ( λ i − 1 ) 14: C 1 ( λ i − 1 ) ← φ 15: for all C 1 h ( λ i − 1 ) ∈ C 1 ( λ i − 1 ) do 16: for all arc a j h of C 1 h ( λ i − 1 ) do 17: C 1 ( λ i − 1 ) ← C 1 ( λ i − 1 ) ∪ ν ( C 1 0 ( λ i ) , a j h ) 18: end for 19: end for 20: C ( λ i − 1 ) = 21: { C 1 h ( λ i − 1 ) ∈ C 1 ( λ i − 1 ) : C 1 h ( λ i − 1 )is v alid } 22: 23: ✄ S t ep I I I: mutation 24: seq ← λ i − 1 25: for w = 1 to n do 26: if ∃ C h ( λ i − 1 ) ∈ C ( λ i − 1 ) s.t. p ( C h , w ) 6 = p ( T , w ) then 27: seq [ w ] ← random nucleotide or p air, s.t. ∀ C h ( λ i − 1 ) ∈ C ( λ i − 1 ), s eq ∈ C [ T ] and seq / ∈ C [ C h ( λ i − 1 )]. 28: end if 29: end for 30: T seq ← Cross ( seq ) 31: if d ( T seq , T ) < d min + 5 then 32: seq middle ← seq 33: else if Step I I I run less th an 5 times then 34: goto Step I I I 35: end i f 36: e nd for ✄ loop to line 3 37: 38: return seq middle 13 3.2 Decompose an d Local-Search In this section we in tro duce tw o the ro utines, Decompose and Local-Sear ch . The ro utine Decompose partitions T into linea rly o rdered en- ergy indep endent co mp onents, see Figur e 12 and Section 2.1. L ocal-Search constructs iteratively an optimal sequence for T v ia lo c al solutions, that are optimal to cer tain substruc tur es of T . Decompose : Supp ose T is decompos ed as fol- lows, B = { T 1 , . . . , T m ′ } . where the T w are the loops together with all arcs in the a sso ciated stems of the target. W e define a linear or der over B as follows: T w < T h if either 1. T w is nested in T h , o r 2. the start-p oint of T w precedes that of T h . In Figure 2 1 w e display the linear order of the lo ops o f the structure sho wn in Figure 12. Next we define the interv al a w = [ l ( T w ) , r ( T w )] 1 ≤ w ≤ m ′ , pro jecting the lo op T w onto the interv al [ l ( T w ) , r ( T w )] and b w = [ l ′ , r ′ ] ⊃ a w , being the maximal in terv al consisting of a w and its adjacen t unpaired co nsecutive n ucleotides, s e e Figure 12. Given t wo consecutive lo ops T w < T w +1 , we ha ve t wo scenar ios: • either b w and b w +1 are adjace nt, see b 5 and b 6 in Figure 21, 111 21 31 71 81 9 212 2 2 32 4252 62 7 2 82 93 03 13 23 33 43 53 63 7383 9 4 04 142 252 6 2 7 28 42434 4454647 789101 11 92 02 13 43 5363 7 495 0515 2535 4555 657 789101 11 92 0212 22 32 4 252 62 72 8293 03 1 3 2333 43 53 637 3 83 94 041 4 24 3444 54 64 7 1234567474 84 95 75 85 9606 16 26 3 T 1 T 2 T 3 T 4 T 5 T 6 T 7 Figure 2 1: Linear ordering of lo ops: a 1 = [11 , 19], b 1 = [10 , 20], a 2 = [7 , 37], b 2 = [5 , 39], a 3 = [21 , 42], b 3 = [20 , 44], a 4 = [25 , 47], b 4 = [24 , 48], a 5 = [7 , 47], b 5 = [5 , 48], a 6 = [49 , 57], b 6 = [48 , 59], a 7 = [1 , 63], b 7 = [1 , 65]. 1234567891 0 I 1 I 2 I 3 I 4 Figure 22: Loops and their induced sequence of inter- v als. • or b w ⊆ b w +1 , s ee b 1 and b 2 in Fig ure 2 1. Let c w = ∪ w h =1 b h , then we hav e the sequence of inter- v als a 1 , b 1 , c 1 , . . . , a m ′ , b m ′ , c m ′ . If there are no un- paired nucleotides a dja c ent to a w , then a w = b w and we s imply delete all s uch b w . Thereby we derive the sequence of interv als I 1 , I 2 , . . . , I m . In Figure 22 we illustrate how to obtain this in terv al s equence: here the target deco mpo ses into the lo ops T 1 , T 2 and we hav e I 1 = [3 , 5], I 2 = [3 , 6], I 3 = [2 , 9], and I 4 = [1 , 1 0]. 14 Local-Search : Giv en the sequence of in- terv a ls I 1 , I 2 , · · · , I m . W e pro ceed by p erfor m- ing a lo cal s to chastic se a rch on the subse q uences seq | I 1 , seq | I 2 , . . . , seq | I m (initialized via seq = seq middle and where s | [ x,y ] = s x s x +1 . . . s y ). When we p er form the lo cal search on seq | I w , only p o si- tions that contribute to the distance to the tar g et, see Figure 9, or p ositions adjacent to the latter, will be altered. W e use the a rrays U 1 , U 2 to store the unpaired a nd pa ired p os itions of T . In this pro ces s, we allow for m utations that increa se the structure distance by five with probability 0 . 1 . The latter pa- rameter is heuristically deter mined. W e iterate this routine un til the distance is either zer o or some halt- ing cr iter ion is met. 4 Discussion The main result of this pap er is the presentation of the a lgorithm In v , freely av aila ble at ht tp://w w w.c ombinatorics.cn/cbp c/inv.h tml Its input is a 3-noncro ssing RNA s tructure T , g iven in ter ms o f its bas e pairs ( i 1 , i 2 ) (where i 1 < i 2 ). The o utput of Inv is an RNA s e quences s = ( s 1 s 2 . . . s n ), wher e s h ∈ { A , C , G , G } with the prop erty C ross ( s ) = T , see Figure 23. The core o f Inv is a sto chastic lo cal search routine which is based on the fact that ea ch 3- noncrossing RNA str ucture has a unique lo op- decomp osition, see Theorem 1 in Sectio n 2 .1. I nv generates “optimal” subsequences a nd event ually ar- Algorithm 3 Local-Search Input: seq middle Input: the target T Output: s eq Ensure: Cross ( seq ) = T 1: s eq ← se q middle 2: i f Cross ( seq ) = T then 3: return se q 4: e nd if 5: d ecompose T an d derive the ordered in terv als. 6: I ← [ I 1 , I 2 , . . . , I m ] 7: for al l I w in I do 8: ✄ Ph ase I: Id entif y p ositions. 9: d min = d ( Cross ( seq | I w , T | I w ) ✄ in itialize d min 10: 11: d erive U 1 via Cross ( seq | I w ), T | I w 12: d erive U 2 via Cross ( seq | I w ), T | I w 13: 14: ✄ Phase I I: T est and Update. 15: for al l p in U 1 do 16: random T compatible mutate seq p 17: e nd for 18: for al l [ p, q ] in U 2 do 19: random T compatible mutate seq p 20: e nd for 21: 22: E ← φ 23: for al l p ∈ U 1 , U 2 do 24: 25: 26: d ← d ( T , Cross ( seq p )) 27: if d < d min then 28: d min ← d, seq ← seq p 29: goto Phase I 30: else i f d min < d < d min + 5 then 31: goto Phase I with the probability 0 . 1 32: end if 33: if d = d min then 34: E ← E ∪ { se q } 35: end if 36: e nd for 37: s e q ← e 0 ∈ E , where e 0 has the low est mfe in E 38: i f Phase I run less than 10 n times then 39: goto Phase I 40: e nd if 41: e nd for 42: return seq 15 G U U G C G G U G C G G U A A U G A C U G U C A G C A G A A A C C U C G A C U G U G G G G G A G G U U U C U G A G U G G A G A C A G A G C G U U A C G C U C C A A C U G U A U G G G G G G U C U U U G G G C U C C A U G U A G G C G C G C G U G G U G U A U C U C G G A G A C G G U G G G G C C C G G G U G C G U G U A A C U G G G C C U U A A Figure 23: UTR pseudoknot of b ovine coronavirus [39]: its diagram representation and three sequences of its n eu- tral netw ork as constructed by Inv . rives at a global s olution for T itself. Inv generalizes the existing inverse folding algorithm by co nsidering arbitrar y 3-no ncrossing ca nonical pseudo knot struc- tures. Conce ptua lly , Inv differs from INFO- RNA in how the sta rt seq uenc e is being genera ted a nd the particulars of the local search itself. As discussed in the in tro ductio n it has to b e given an argument as to why the in verse folding o f pseudo- knot RNA str uctures works. While folding maps into RNA secondary structures are w ell understo o d, the generaliza tion to 3-no ncrossing RNA s tructures is nontrivial. Ho wev er the combinatorics of RNA pseu- doknot structures [28, 29, 40] implies the existence of large neutral netw orks, i.e. netw ork s comp osed by sequences that all fold into a sp ecific pseudokno t structure. Therefore, the fact that it is indeed p o s - sible to generate via In v sequences co ntained in the neutral netw orks of targets aga inst co mpeting pseu- doknot configurations, se e Figur e 23 and Figure 24 confirms the predictions of [32]. An in teresting c la ss are the 3- noncrossing non- planar pseudo k not structur es. A nonplana r pseudo- knot structure is a 3-noncrossing s tructure which is not a bi-secondar y structure in the sense of Stadler [31]. That is, it cannot be represented by non- A U A C G A C A U C G U A A C U U C C U A C U C G U U G U G G A A C U G G C C G G G A G C C G G U C U C A G G A G C G A A U G G G U U A G G G G G C U C A C G C G C U G U C A U U G G U U G G U C C U A U C G A C A G C C U G A G A G G U C A G A A A G A G A G C G G U U G C Figure 24: The Pseudoknot PKI of the internal ribo- somal entry site (IR ES) region [41]: its diagram repre- senta tion and th ree sequences of its neutral netw ork as constructed by Inv . crossing arcs using the upp er and lower ha lf planes. Since DP-folding pa r adigms of ps eudoknots folding are based o n ga p-matrices [16], the minimal c la ss of “missed” structur es 5 are e x actly these, nonplanar , 3-noncro ssing structures. In Figure 25 we sho wcase a nonplanar RNA pseudoknot structure and 3 s e- quences o f its neutral netw ork, generated by Inv . As for the complex ity of Inv , the determining factor is the subroutine Local-Search . Supp ose that the target is decomp osed in to m in terv als with the length ℓ 1 , . . . , ℓ m . F or each interv al, we may assume that line 2 of Local -Sear ch runs for f h times, and that line 14 is executed for g h times. Since Local-Search will s to p (line 4) if T start = T ( line 3), the remainder of Local-Search , i.e. lines 7 to 41 run for ( f h − 1) times, each such execution having complexity O( ℓ h ). Therefore we arrive at the complexity m X h =1  ( f h + g h ) c( ℓ h ) + ( f h − 1) O( ℓ h )  , where c( ℓ ) denotes the complexity of the Cross . The m ultiplicities f h and g h depe nd on v arious fa ctors, such as star t , the random o r der of the elemen ts of 5 give n the implement ed truncations 16 UC C GC A UC G UC A AU C C CC U AC U UA U AG U AU U GA U GG C G GC A CA U UA U AA A UG U GG G GU G C UG C AA U CU C GC U GG G AU C UC A G GG G GC C UG A GG G CU U AU G UU C C CU A AU C CU A AU G AG C CA G U GA U GU A GG A UU U UU A GG C UG U C AC U AC C AG C GU U GC U GG C UA G G AA U UA C CU A GG A CC U GU U GG C G AU C CU G GA C AC A GG U CA G UG G G CG U CC A GG C UA G GU A GC C UG C U GU C CG A AC U UU G GA A GA C GU C A Figure 25: A nonplanar 3-noncrossing RNA stru ct u re together with three sequences realizing them as mfe- structures. U 1 , U 2 (see Algorithm 3) and the pr obability p . Ac- cording to [33] the co mplexity o f c( ℓ h ) is O( e 1 . 146 ℓ h ) and acc ordingly the complexity of I nv is given by m X h =1  ( f h + g h ) O( e 1 . 146 ℓ h )  . In Figur e 26 we prese n t the av erage inv erse folding time of several natural RNA structures taken from the PKdataba se [42]. These a verages a re computed via generating 200 sequences of the targ et’s neutral net works. In a ddition we present in T able 1 the total time for 100 executions of Inv for an additional set of RNA pseudoknot structures. 5 Comp eting interests The authors declare that they have no comp eting int ere sts. 6 Autho rs co ntributions All a uthors c ontributed e qually to this pap er . 7 Ackno wledgments W e are g rateful to F enix W.D. Huang for discus- sions. Specia l thank s b elo ngs to the t wo anonymous referee’s whose thoug ht ful comments hav e greatly helpe d in deriving an improved v ersio n of the pa - S e q u e n c el e n g t h Ti m e Figure 26: A pproximatio n using 2 cubic spines fitting of mean invers e folding time (seconds) o ver sequen ce length. F or n = 35 , . . . , 75 we choose a natural pseu- doknot structure from the PKdatabase and display the a verage inv erse folding time based on sampling 200 se- quences of th e neutral n etw ork of the resp ective target. per . This work was supp or ted b y the 973 Pro ject, the PCSIR T of the Ministry of E ducation, the Min- istry o f Science and T echnology , and the National Science F oundation of China. References 1. W esthof E, Jaeger L: RNA pseudoknots . Curr Opin Struct Biol 1992, 2 (3):327–333 . 2. Loria A, Pa n T: Domai n struct ure of the ribozym e from eubacterial rib onuclease P . RNA 1996, 2 :551–5 63. 3. Staple DW, Butcher SE: Pseudoknots: RNA structures with diverse funct ions. PL oS Biol 2005, 3 (6):e213. 4. Konings DA, Gutell RR: A comparison of thermodynam ic foldings with comparatively derived structures of 16S and 16S-like rRNAs . RNA 1995, 1 :559–574 . 5. T u erk C, MacDouga l S, Gold L: RNA pseudoknots that inhibit human immunodeficiency virus type 1 reverse 17 transcriptase . Pr o c Natl A c ad Sci USA 1992, 89 (15):6988 –6992. 6. Chamorro A, Manko VS, Denisov a TE: New exact solution for the exterior gra vi tational field of a c harged spinning mass . Phys. R ev. D 1991, 44 (10):3147–3151. 7. The pseudoknot structure of the glmS ribozym e pseudoknot P1.1 [http://w ww.ekev anbaten burg.nl/PKBASE/PKB00276.HTML] . 8. Lyngsø RB, P edersen CNS: RNA pseudoknot prediction in energy-based mo dels . J Com put Biol 2000, 7 (3–4):409–4 27. 9. Smith TF, W aterman MS: RNA secondary structure: A complete mathematical analysis . M ath Biol 1978 , 42 : 257–266. 10. W aterman MS, Smith TF: Rapid dynam ic programming metho ds for RNA secondary structure . Ad v Appl M ath 1986, 7 (4):455–464. 11. Zuker M, Stiegler P: Optimal computer folding of large RNA sequences using thermodynam ics and auxiliary information . Nucl A cids R es 1981, 9 :133–14 8. 12. Nussinov B, Jacobson AB: F ast algorithm for predicting the secondary structure of single-stranded RNA . Pr o c Natl A c ad Sci USA 1980, 77 (11):6309–6313. 13. F resco JR, Alb erts BM, Doty P: Some mole cular details of the secondary structur e of ribonucleic acid . Natur e 1960, 188 :98–101 . 14. Jun IT, Uhlenbeck OC, Levine MD: Estimation of Se condary Str ucture in Ribonucleic A cids . Natur e 1971, 230 (5293):362 –367. 15. Delisi C, Crothers DM: Prediction of RNA secondary structur e . Pr o c Natl A c ad Sci USA 1971, 68 (11):2682–2685. 16. Riv as E, Edd y SR: A dynam i c programming algorithm for RNA structur e prediction including pseudoknots . J M ol Biol 1999, 285 (5):2053– 2068. 17. Dirks RM, Lin M, Winfree E, Pierce NA : P aradigms for computational nucleic acid design . Nucleic A cids R es 2004, 32 (4):1392– 1403. 18. Reeder J, Giegerich R: Design, imple m entat ion and ev aluation of a practical pseudoknot folding algorithm based on thermodynamics. BMC Bi oinformatics 2004, 5 (104):2053–2068 . 19. Ren J, Rastegari B, Condon A, Ho os H: Hotk onts: Heuristic prediction of RNA secondary structur es including pseudoknots. RNA 2005, 15 :1 494–1504. 20. H uang FWD, P eng WWJ, Reidys CM: F olding 3-noncrossing RNA pseudoknot structures . J. Comp. Biol. 2009, 16 (11):1549– 75. 21. Borer PN, Dengler B, Tinoco JI, Uhlenb ec k OC: Stability of rib onucleic acid doublestranded helices . J Mol Biol 1974, 86 (4):843–853. 22. Papanicolaou C, Gouy M, Ninio J: A n energy mo del that predicts the correct folding of both the tRNA and the 5S RNA molecules . Nucleic A cids R es 1984, 12 :31–44 . 23. T urn er DH, Sugimoto N , F reier S M: RNA structure prediction . Ann R ev Bi ophys Biophys Chem 1988 , 17 :1 67–192. 24. W alter A E, T urn er DH , Kim J, Lyttle MH, Muller P , Mathews DH, Zu ker M: Coaxial stac king of helixe s enhances binding of oligorib onu cle otides and improv es predictions of RNA folding . Pr o c Natl A c ad Sci USA 1994, 91 (20):9218–9 222. 25. X ia T, SantaLucia JJ, Burk ard ME, Kierzek R , Schroeder SJ, Jiao X, Co x C, T urner DH: Thermo dynamic parameters for an e xpanded nearest-neighbor mo del for formation of RNA duplexes with W atson-Crick base pairs . Bio chemistry 1998, 37 (42):147 19–13735. 26. W aterman MS: C ombinator ics of RNA hairpins and clov erleav es . Stud Appl Math 1979, 60 :91–96. 27. D Kleitman BR: The num b e r of finite topologie s . Pr o c Amer Math So c 1970, 25 :276–2 82. 28. Jin EY, Qin J, Reidy s CM: Combinatorics of RNA structures with pseudoknots . Bul l Math Biol 2008, 70 :45–67. 29. Jin EY, R eidys CM: Combinatorial Design of Pseudoknot RNA . A dv Appl Math 2009, 42 (2):135–1 51. 30. Chen WYC, Han HSW, Reidys CM: Random k-noncrossing RNA Structures . Pr o c Natl A c ad Sci USA 2009, 106 (52):22061–2 2066. 31. S tadler PF: RNA Struct ures with Pseudo-Knots . Bul l Math Biol 1999, 61 :437–46 7. 32. Ma G, Reidys CM: Canonical RNA Pseudoknot Structur es . J Comput Bi ol 2008, 15 (10):1257 –1273. 33. H uang FWD , Reidy s CM: Statistics of canonical RNA pseudoknot structures . J The or Bi ol 2008, 253 (3):570–57 8. 34. R eidys CM, Stadler PF, S ch uster P: Generic properties of com binatory maps: ne utral netw orks of RNA secondary struct ures . Bul l Math Bi ol 1997, 59 (2):339–397 . 18 RNA struc tur e length trials total time success rate TPK-70 .28 [4 3] 40 100 4m 57 .81s 100% Ec PK2 [4 4] 59 100 5m 33.28s 100 % PMW a V- 2 [45] 62 100 1m 7.1 2s 100% tRNA 76 100 5m 2.4 9s 100% T able 1: I nv erse folding times for 100 executions of Inv for v arious RNA p seudoknot structures. In all cases all trials generated successfully sequences of t h e respective neutral netw orks. 35. Reidys CM: Lo cal connectivity of neutral netw orks . Bul l Math Biol 2008, 71 (2):265–290 . 36. Hofac ker I, F ontana W, Stadler P , Bonho effer L, T acker M, Sch uster P: F ast folding and comparison of RNA secondary structures . Chem Month 1994, 125 (2):167–1 88. 37. Andronescu M, F ejes AP , Hutter F, Ho os H H, A C: A New Algorithm for RNA Secondary Structur e Design . J M ol Biol 2004, 336 (2):607–6 24. 38. Busch A, Back ofen R: INFO -RNA—a fast approac h to inv erse RNA folding . Bioinformatics 2006, 22 (15 ):1823–1831. 39. 3’UTR pseudoknot of b o vi ne corona virus [http://w ww.ekev anbaten burg.nl/PKBASE/PKB00256.HTML] . 40. Jin EY, R eidys CM: Cent ral and lo cal l imit theorems for RNA structures . J The or Biol 2008, 253 (3):547–559. 41. Pseudoknot PKI of the int ernal rib osomal entr y site (IRES) region [http://w ww.ekev anbaten burg.nl/PKBASE/PKB00221.HTML] . 42. PseudoBase [http://w ww.ekev anbaten burg.nl/PKBASE/PKBGETCLS.HTML] . 43. The pseudoknot of SELEX-isolated inhibitor (ligand 70.28) of HIV-1 rev erse transcriptase [http://w ww.ekev anbaten burg.nl/PKBASE/PKB00066.HTML] . 44. Pseudoknot PK2 of E.coli tmRNA [http://w ww.ekev anbaten burg.nl/PKBASE/PKB00050.HTML] . 45. Pineapple mealy bug wilt asso ciated virus - 2 [http://w ww.ekev anbaten burg.nl/PKBASE/PKB00270.HTML] . 8 T ables 8.1 T able 1 19

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment