An Ant-Based Model for Multiple Sequence Alignment

An An t-Based Mo del for Multiple Sequence Alignmen t F r´ ed´ eric Guinand and Y o ann Pign´ e ⋆ LITIS lab oratory , Le Havre Unive rsity , F rance www.litisl ab.eu Abstract. Multiple sequence alignment is a key process in t o day’s biol- ogy , and ﬁn ding a relev an t alignmen t of severa l sequences is muc h more chal lenging than just optimizing some improbable ev aluation functions. Our ap p roac h fo r addressing multiple sequence alig nment fo cuses on the building of structures in a new g raph mo del: the factor graph model. This mod el relies on block-based form ulation of the original problem, form ulation th at seems to be one of th e most suitable w ays for capturing evol utionary aspects of alignmen t. The structures are implicitly built b y a colony of an ts laying down pheromones in the factor graphs, according to relations b etw een blo cks belonging to th e diﬀerent sequences. 1 In tro duction F or y ears, manipulation and study of biolo gical sequence s hav e b een added to the set of common tasks p erfo r med by bio logists in their daily activities. Among the numerous analysis metho ds, mult iple sequence alignment (MSA) is probably one of the most use d. Biolo gical sequences come from actua l liv ing b eings, and the r ole o f MSA consists in exhibiting the similarities and diﬀerences b etw een them. Considering sets of homolog o us sequences, diﬀerences may b e used to assess the ev olutionary distance betw een species in the cont ext of ph ylogeny . The results of this analys is may also be us ed to determine conserv ation of pr otein domains or structures. While most of the time the pro cess is p erformed for aligning a limited num ber of thous ands bp-lo ng s equences, it can also b e used at the g enome level allowing biologists to discover new features that co uld not b e exhibited at a lower level of study [5]. In all c a ses, one of the ma jor diﬃculties is the determination of a biolo gically relev ant alignmen t, p erformed witho ut relying explicitly on evolutionary information like a phylogenetic tree. Among existing a ppr oaches for determining such relev a nt alig nmen ts, o ne o f them r e sts on the notion of blo ck. A blo ck is a set of factors present in several sequences. E a ch factor belong ing to one blo ck is an almost identical substring. It may corre s po nd to a highly conser ved zone from an evolutionary p oint of view. Starting from the set of facto rs for eac h sequence, the problem w e address is the building of blo cks. It consists in c ho osing a nd gathering a lmost ident ical factors ⋆ Authors are alphab etically sorted. The work of Y. Pign´ e is partially sup p orted by F rench Ministry of Researc h and Higher Education. common to several se quences in the most appropr iate wa y , given that one blo ck cannot co n tain more than one fac to r p er sequence, that each factor can b elong to o nly one block and that tw o blo cks ca nno t cro ss each other. F or building such blo cks, we prop ose an appro ach based o n ant colonies. This problem is very close to so me classical optimiza tion issues except that the pro cess does not use an y ev aluation function since it s eems unlikely to ﬁnd a bio logically r e le v an t one. As such, it also diﬀers notably from other works in the domain setting up ant colonies for computing alignments for a se t o f biolo g ical seq uences [6 ,2]. The following section details the prop osed graph model. Sect. 3 go es deep er int o the ant algorithm details. Finally , Sect. 4 studies the b ehavior of the algo- rithm with examples. 2 Mo del There exist many diﬀeren t families of algorithms for deter mining m ultiple s e- quence alignments, dynamic pro gramming, prog ressive or itera tive metho ds, motif-based appro aches... How ever, if the num ber o f methods is imp orta n t, the nu mber of models o n which these methods oper ate is muc h more limited. In- deed, most algorithms use to cons ider n ucleotide sequences either as s trings or as graphs. In any case how ever, the problem is fo r mulated as an optimizatio n problem and an ev aluation function is g iven. Within this pap er, we prop ose an- other approa ch based on a graph of factors , where the factors are sub-sequences present in, at least, t wo s e quences. Ins tead of considering these fac to rs individu- ally , the fo rmulation considers that they interact with each other when they are neighbors in diﬀerent sequence s , such that our factor gr aph may b e understo o d as a factor/pa ttern in teraction net work. Considering suc h a graph, a mult iple sequence alignment corres po nds to a set of s tructures representing highly in ter- acting sets o f factors. The original goal may b e no w expres sed as the detectio n of such structures and we pro p o s e to p erfor m suc h a task with the help of artiﬁcial ants. 2.1 Graph Mo del An alignment is usually display ed sequence by sequence, with the nucleotides or amino acids that co mpo se it. Here the interest is giv en to the factor s that comp ose each sequence. So each sequence of the a lignment is displayed as a list of the factor s it is co mpo sed of. Fig. 1 illustrates such a represe n tation, where sequences are dis play ed as serie s of factors . There exists a relation b etw een factor s (named 1 , 2 and 3 in Fig. 1) as so on as there are almost identical. Indeed, tw o identical factors on diﬀerent sequences may b e aligned. Suc h an alignment aims at cr e ating blo cks. T ogether with factors, these relations can b e represented by a graph G = ( V , E ). The set V of no des represents a ll the factors a ppe aring in the s e quences, a nd edges of E link factors that may be aligned. These graphs ar e called factor gr aphs . A factor gr aph is a complete graph wher e edges linking factors attending on the same se q uence ar e remov ed. Indeed, a given factor f may align with any other Fig. 1. This is a set of three sequences. Common subsequences of these sequences which are rep eated are labele d. After the conv ersion, each sequence o f the a lign- men t is dis played as a list of facto rs. Her e sequence 1 = [1,2 ,3], seq uence 2 = [2,1,3] and sequence 3 = [2,1 ,3 ,3]. Thin lines link fa c to rs tha t may be alig ned together. ident ical facto r f ′ provided f ′ do es not b elong to the s ame sequence. In Fig. 1, thin links betw een factors illustrate the po ssible alignments b et ween them. F rom a gra ph point of view the seq uen tial or der ”se quence by seq uence” has no sense . The alignment pro blem is mo deled as a set o f factor gr aphs . So as to diﬀerentiate the diﬀerent factors, they are given a unique iden tiﬁer. Each factor is assigned a triplet [ x, y , z ] where x is a n iden tiﬁer for the pattern, y is the ident iﬁer of the sequence the factor is lo cated o n and z is the o cc ur rence of this pattern on the given seq ue nce . F or instance, on Fig. 1, the bo ttom r ight fac to r of the third sequence is iden tiﬁed b y the triplet [3 , 3 , 2]. Namely , it is the pattern ”3”, lo cated on the seq uence 3 and it o ccurs for the seco nd time on this sequence, given the sequences are read from left to rig ht . Using that mo del the sequences are no t o rdered a s it is the case when co n- sidering progres sive alignment metho ds. Fig. 2 illustr ates such g raphs according to the r epresentation of Fig. 1. Fig. 2. The alig nment seen in Fig. 1 display ed as a s et of factor gr aphs . F rom each factor gr ap h a s ubs e t of fac to rs ma y be sele cted to create a blo ck. If the block is compo sed of one factor p er sequence, it is a c o mplete blo ck, but if one sequence is missing the blo ck is said partial. Not all blo ck c o nstructions are p ossible since blo cks cr ossing is not allow ed and the se le ction of one blo ck may preven t the construction of a nother one. F or instance, from Fig. 1 one can observe that a blo ck made of factors “1” may b e created. Another blo ck with factor “2 ” may a ls o be created. How ever, b oth blo cks cannot b e present tog e ther in the a lignment. These blocks are s a id inc omp atible . A gr oup of blo cks is said to b e c omp atible if a ll couples of blo cks ar e compatible. Another rela tion b etw een p otential blo cks can b e o bserved in their neigh- bo rho o d. Indeed, a strong relev ance has to be accor de d to p otential blo cks that are clo sed to o ne ano ther and tha t do not cros s. If tw o factor s are neig hbo rs in many sequences, there is a high probability for these factors to b e part of a bigger factor with little diﬀerences. Such r e la tions are tak en in to account within our approach a nd are called friendly r e lations. An example of friendly relatio n in the sample alignment can be observed sequences 2 and 3 betw een fac tors “1” and “ 2”. The factor gr aph s are intended to represent the se a rch spa ce of blo cks accor d- ing to the se t of considered factors. How ev er, they do not capture compatibility constraints and friendly relations b etw een blo cks. F o r that purpo se, we ﬁrst c on- sider G ′ = ( V ′ , E ′ ) dual graph of G = ( V , E ), G b eing the g raph compo sed of the ent ire set of factor gr aphs . The set V ′ of G ′ corres p o nds to the set E of G . Each couple of adjacent edg e s e 1 , e 2 ∈ E cor resp onds to one edge in E ′ . Moreover, in order to represent co mpatibility cons tr aints a nd friendly relations tw o new kinds of edg es hav e to be added to this graph: namely E c for compatibility con- straints a nd E f for friendly r elations. W e call r elation gr aph (Fig. 3) the gra ph G r = ( V ′ , E ′ ∪ E f ∪ E c ). Fig. 3. The relation gra ph G r based on the dual graph of G with additiona l edges corres p o nding to c ompatibility constra ints, repr esented b y thick plain edges , and friendly relations, represented by dashed links. Our approa ch makes use of both graphs. An ts mov e within the factor graphs , but their actions may also pro duce some eﬀects in remote parts of the fac to r graphs, according to relation g raph topo logy a s e xplained in Sec t. 3. 2.2 An ts for Multiple Sequence Ali gnment An t bas ed appro aches hav e shown their eﬃciency for sing le or multiple criteria optimization pro blems. The cen tral metho dolog y b eing known as An t Co lo ny Optimization [3]. An ts in these algor ithms usually evolv e in a discr ete space mo deled b y a graph. This graph r epresents the sea rch spa ce of the consider ed problem. An ts collectively build and maintain a solution. Actually , this construc- tion takes place thanks to pheromone tr a ils laid down the gra ph. The sear ch is led b o th b y lo cal information in the gra ph and by the glo bal ev aluation of the pro duced solutions. The model issued in the previous s ection propo sed a g raph model that raises lo cal conﬂicts and attractions that may exist in the neighborho o d of the fac- tors in the alignmen t. F o r this, the mo del may ha ndle the lo cal s earch needed by classical ACOs. Howev er, providing a glo bal ev aluation function for this problem is unlik ely . Indeed, deﬁning a r elev an t ev alua tion function for MSA is in itself a problem s ince the ev aluation should be aw are of the evolutionary histor y of the underlying sp ecies, whic h is part of the MSA pr oblem. Molecular biolog ists themselves do not all ag r ee whether o r not one given alig nment is a go o d one. Popular ev aluation functions like the classica l sum of p airs [1] ar e still debated. As a consequence, instead of focus ing on suc h a function, our approach co n- centrates on building str uctures in the factor graphs acco rding to the relation graph. Thes e structures corresp ond to compatible blo cks. The building o f blo cks is made b y ants which b ehavior is dir ectly constrained b y the pheromo nes they laid down in the graph and indirectly b y the re la tion graph since this g raph has a crucial impact on pher omone depo s it lo ca tion. The globa l pro cess can be further reﬁned by taking in to a ccount the size of the selected facto rs, as well as the num ber of n ucleotides or amino acids lo ca ted betw een the factors of t wo neig h b or blo cks. These num bers a r e calle d r ela tive distanc es betw een factors in the sequel. Thus, the factor graphs carry lo ca l in- formation necessary to the a nt sy stem a nd acts like the environment for an ts. Communication via the environment also known as stigmerg ic communication [4] takes place in that g raph; pher omones trails are laid down on the edges a c - cording to an ts mo ve, but also according to the relatio n gr aph. A solution of the original problem is obtained by listing the set of compatible blo cks tha t hav e bee n bring to the fore by a n ts and r evealed b y phero mones. 3 Algorithm The prop osed ant-based system do es not ev aluate the pro duce d solutions, in this wa y it is not an ACO. How ever, the lo cal search pr o cess remains widely inspired from ACOs. The gener a l sc heme of the behavior of the a nt based system follows these rules: – An ts p erform walks into the factor gra phs. – During these walks each a nt lay co ns tant quantities of pheromones down on the edges they cross. – This dep osit entails a change in pheromone quant ities o f some remote edges according to the r elation graph G r as describ ed in Sect. 2 . – An ts are attracted b y the pheromone trails alr eady laid do wn in the en vi- ronment. Finally a s olution to the problem is a set of the most pher o mone lo aded edges of the factor gr aph that are fre e from conﬂicts. In the following section pheromone management is more formally detailled. 3.1 Pheromone T rails Let τ ij be the quantit y o f pher omone present o n edge ( i, j ) of graph G which links no des i a nd j . If τ ij ( t ) is the quant ity o f pheromone pre s ent on the edge ( i, j ) a t time t , then ∆τ ij is the quantit y of pheromo ne to b e added to the total quantit y on the edge at the current step (time t + 1). So : τ ij ( t + 1) = (1 − ρ ) .τ ij ( t ) + ∆τ ij (1) Note that the initial amount of pheromone in the gra ph is close to zero and that ρ repres en ts the ev apora tio n rate of the pher omones. Indeed, the modeling of the ev ap oration (like natural pheromones) is useful b ecause it makes it p o ssible to co ntrol the impo rtance o f the pro duced eﬀect. In practice, the control of this ev apora tion makes it p o ssible to limit the risks of pre ma ture conv erg ence o f the pro cess. The q uantit y of phero mone ∆τ ij added on the edge ( i, j ) is the sum of the pheromone dep osited b y all the ants cro ssing the e dg e ( i, j ) with the new step of time. The volume of pheromo ne depos ited on each pass age of an ant is a constant v alue Q . If m ants use the edge ( i, j ) during the current step, then: ∆τ ij = mQ (2) 3.2 Constrain ts and F eedbac k Lo ops Generally spea king, feedba ck loo ps rule self-org anized systems. Positive feedback lo ops incr ease the system tendencies while nega tiv e feedbac k loo ps prevent the system fro m co n tinually inc r easing or decreas ing to critical limits. In this ca se, pheromone trails pla y the role of p ositive feedba ck lo op attracting an ts that will depo sit pheromo ne s o n the same paths getting them more desira ble. F riendly relationship found in the G r graph may also play a p ositive feedback role. Indeed, more pheromones ar e laid do wn ar ound friendly linked blo cks. On the other side, conﬂicts betw een blocks act as negative feedback lo ops laying down ’negative’ quantities of pheromone. Let cons ider an edge ( i, j ) that has conﬂict links w ith s ome other edges. During the current step, the c a nt s that cr oss edg es in conﬂict with edge ( i, j ) deﬁne the amount of negative pheromones to assign to ( i , j ): ∆τ conf lict ij = cQ . Besides, a n edge ( i, j ) with some friendly rela tions will be assig ne d p os itiv e pheromones according to the f ants that cro ss edges in frie ndly relation with ( i, j ) during the current step : ∆τ f r iendly ij = f Q . Finally , the overall quantit y of pheromone o n one given edge ( i, j ) for the current step deﬁned in equation 2 is mo diﬁed as follow: ∆τ ij = ∆τ ij + ∆τ f r iendly i,j − ∆τ conf lict i,j (3) 3.3 T ransition Rule When an ant is o n vertex i , the c hoice of the next v ertex to b e visited m ust b e carried out according to a deﬁned proba bilit y rule. According to the metho d classically prop osed in a n t algo rithms, the choice of a next v ertex to visit is inﬂuenced b y 2 terms. The ﬁrst is a lo cal heur istic based on lo cal information, namely the r elative distanc e b etw een the factors ( d ). The second term is repr esentativ e of the stigmergic b ehavior of the system. It is the quantit y of phero mo ne dep osited ( τ ). R emark 1. Interaction b etw een p ositive and nega tiv e pheromones can lead on some edge s to an ov erall negative v alue of pher omone. Thus, pheromones quan- tities need normalizatio n b e fore the random dra w is made. Let max b e an upp er bo und v alue computed fro m the large st quantit y of pheromones on the neigh- bo rho o d of the curr e n t vertex. The qua nt ity of pheromone τ ij betw een edge i and j is normalize d as max − τ ij . The function N ( i ) returns the list of vertices a djacent to i (its neighbor s). The next vertex w ill be chosen in this lis t. The pr obability for an ant being on vertex i , to g o on j ( j b elo ng ing to ( N ( i )) is: P ( ij ) = [ 1 max − τ ij α . 1 d ij β ] P s ∈ N ( i ) [ 1 max − τ is α . 1 d is β ] (4) The par ameters α and β make it p os sible to balance the impact o f pher omone trails relatively to the r elative distanc es . 4 Analysis First exp eriments on rea l sequences ha ve bee n per formed. Fig. 4 compares some results obtained by the prop osed structural appr oach and the alignment provided by ClustalW. This sample shows that the visible aligned blo cks can be regained in the results g iven by ClustalW. How ever, results are still to b e ev aluated. Indeed, it may happe n tha t s ome blo cks a re found b y our metho d while they are not detected by C lus talW. Anyw ay , we are aware of the necessity o f additiona l analyses and discussions with bio logists in o rder to v alidate this approach. Clu stalW Str uctural Appr oach Fig. 4. Alignment of 3 s e q uences of TP53 regulated inhibitor of ap optosis 1 for Homo s a piens, Bos ta urus and Mus musculus. Comparison of the alig nmen t given by ClustalW and o ur structural approa ch. The r e d upp erc a se n ucleotides on the s tr uctural approach are the blo cks. 5 Conclusion In this pa per was prop osed a diﬀeren t a pproach, for the pro blem o f multiple sequence alignment. The key idea was to consider a problem of building and maintaining a structure in a set of bio logical sequences instea d of consider ing a n optimization problem. P reliminary results show that the structures built by our ant-based algorithm can be informally compared, on a pattern bas is, with the results given by ClustalW. The outlo ok fo r the pro ject is now to prov e the eﬃciency and the re le v ance of the metho d, in particular , an important ch unk of future work will concern the compariso n of the diﬀerences betw een conserved regions provided by ClustalW and other well-known m ultiple s equence alig nment methods and our approach. The s econd p ersp ective fo cuses on the per formance of the metho d. Indeed, the wa y blo cks ar e built and intermediate results allow us to consider a kind of div ide- and-conquer pa rallel v ersio n of this to ol. Most recent results and a dv ances will be ma de av ailable on www. litisl ab.eu . References 1. S . F. Altsch ul. Gap costs for m ultiple sequ ence alignmen t. Journal of The or et ic al Biolo gy , 138:29 7–309, 1989. 2. Y . Chen, Y. Pan, J. Chen, W. Liu, and L. Chen. Partitioned optimization algorithms for multiple sequence alignmen t. In Pr o c e e dings of the 20th International Confer- enc e on A dvanc e d Inf ormation Networking and Applic ations - V olume 2 (AINA’06) , vol ume 2, pages 618–622. IEEE Computer So ciety , 2006. 3. M. Dorigo and G. D i Caro. New Ide as in Optim ization. D. Corne and M. Dorigo and F. Glover e ds , chapter The Ant Colon y Op timization Meta-Heuristic, pages 11–32. McGra w-Hill, 1997. 4. P .-P . Grass´ e. La reconstruction du nid et les co ordinations inter-individuelles chez belicositermes natalensis et cubitermes s.p. la th´ eorie de la stigmergie : es- sai d’interpr ´ etation du comp ortement des termites constructeurs. Inse ct es so ciaux , 6:41–80 , 1959. 5. S . Kurtz, A. Phillippy , A. L. Delcher, M. Smo ot, M. Shumw a y , C. Antonescu, and S. L. Salzberg. V ersatil e and op en softw are for comparing large genomes. Genome Biolo gy , 5(2):R12, 2004. 6. J. D. Moss and C. G. Johnson. An ant colon y algorithm for multiple sequence alignmen t in bioinformatics. In David W. P earson, Nigel C. Steele, and Rudolf F. Albrech t, editors, A rtiﬁcial Neur al Networks and Genetic Algorithms , pages 182– 186. Springer, 2003.

An Ant-Based Model for Multiple Sequence Alignment

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment