Detecting phylogenetic relations out from sparse context trees

DETECTING PHYLOGENETIC RELA TIONS OUT FR OM SP ARSE CONTEXT TREES FLORENCIA LEONARDI, SER GIO R. MA TIOLI, HUGO A. ARMELIN, AND ANTONIO GAL VES Abstract. The goal of this pap er is to study the similarit y b etw een sequences using a distance b et ween the c ontext trees asso ciated to the sequences. These trees are deﬁned in the framework of Sp arse Pr ob abilistic Suﬃx T r e es (SPST), and can b e estimated using the SPST algorithm. W e implement the Phyl-SPST pac k age to compute the distance b etw een the sparse context trees estimated with the SPST algorithm. The distance takes in to account the structure of the trees, and indirectly the transition probabilities. W e apply this approach to reconstruct a ph ylogenetic tree of protein sequences in the globin family of v ertebrates. W e compare this tree with the one obtained using the well-kno wn P AM distance. 1. Introduction In this w ork we prop ose to use the framew ork of Sparse Probabilistic Suﬃx T rees (SPST) to analyze the similarit y b etw een sequences and to infer the evolution of protein families. SPST was ﬁrst in tro duced in Leonardi and Galv es (2005) as a generalization of the PST algorithm, prop osed in Ron et al. (1996). SPST has sho wn to b e useful in protein mo deling and classiﬁcation, p erforming b etter than the PST algorithm (Leonardi; 2006). The model that inspired the SPST algorithm is a generalization of V ariable Length Mark o v Chains (VLMC), in tro duced b y Rissanen (1983), and tak es in to accoun t the prop ert y of sparseness of the sequences. Giv en a sequence, SPST estimates a set of sp arse c ontexts . A sparse context is a short sequence of sub-sets of symbols (in a given alfab et) that are relev ant to predict an y sym b ol in the sequence, giv en that the preceding sym b ols b elong to the sub- sets of the con text. The SPST algorithm also estimates the transition probabilities asso ciated to eac h con text. The transition probabilities give the probability of each sym b ol conditioned on the fact that the preceding symbols b elong to the sparse con text. An interesting prop ert y of the set of sparse contexts is that it induces a partition of the set of all p ossible sequences and can b e represen ted as a tree. W e use this partition prop erty to deﬁne a distance b etw een context trees. This distance can b e used to measure the similarit y b etw een protein sequences. 1 DETECTING PHYLOGENETIC RELA TIONS OUT FROM SP ARSE CONTEXT TREES 2 T o our kno wledge it has not b een prop osed yet in the literature a metho d for sequence comparison using the information contained in the architecture of the con- text trees asso ciated to the sequences. The more closely related approaches proposed un til date are those that mo del the sequences as ﬁrst order Marko v c hains and use a statistical measure to infer the similarity b etw een them (W u et al.; 2001; Pham and Zuegg; 2004). The more remark able diﬀerence b etw een these approac hes and our is that we do not use directly the estimated probabilities of the mo del. Instead of that w e use the context tree architecture, that is trivial in ﬁrst order Marko v chains. W e sho w here that the con text tree arc hitecture can ha v e imp ortan t structural informa- tion that may b e useful to measure the similarit y b etw een sequences. The pap er is organized as follows. In Section 2 we review some deﬁnitions in the framew ork of SPST. In Section 3 w e introduce the distance b et w een sparse trees. In Section 4 w e present the results obtained for the globin protein family of v ertebrates and ﬁnally in Section 5 we discuss some asp ects of our metho d. 2. Sp arse Context Trees Let A b e a ﬁnite alphab et (for example, the set of tw en t y amino acids) of size | A | . W e will denote by P A the set of parts of A . That is, P A = { v : v ⊂ A } . The elements in P j A will b e denoted by w = ( w − j , . . . , w − 1 ). On the other hand, we will denote by P ∗ A the set of all ﬁnite sequences of elemen ts in P A ; that is, P ∗ A = ∞ [ j =1 P j A . Deﬁnition 2.1. Let ( X t ) t ∈ N b e a sto chastic pro cess taking v alues on the ﬁnite alphab et A . W e will say that the pro cess ( X t ) t ∈ N is a sp arse sto chastic chain if there exists a set τ ⊂ P ∗ A suc h that: (1) F or an y sequence x 0 , . . . , x n satisfying P [ X 0 = x 0 , . . . , X n − 1 = x n − 1 ] > 0 , there exists an element ( w − k , . . . , w − 1 ) ∈ τ such that P [ X n = x n | X n − 1 = x n − 1 , . . . , X 0 = x 0 ] = P [ X n = x n | X n − 1 ∈ w − 1 , . . . , X n − k ∈ w − k ] . (2.2) (2) If ( w − k , . . . , w − 1 ) and ( ¯ w − ¯ k , . . . , ¯ w − 1 ) belong to τ and there exists j such that w − i ∩ ¯ w − i 6 = ∅ for i = 1 , . . . , j , then w − i = ¯ w − i for i = 1 , . . . , j . DETECTING PHYLOGENETIC RELA TIONS OUT FROM SP ARSE CONTEXT TREES 3 (a) {a,c} {a,b,c} {d} ROOT {b,d} X n − 2 6 X n − 1 6 X n (b) ROOT {a} {c} {bd} {a,b,c} {d} {a} {b,c,d} (c) ROOT {a,b,c} {d} {a,b,c} {d} {a} {b,c,d} {c} {a} {b,d} Figure 1. Examples of sparse trees o v er the alphab et A = { a, b, c, d } . (a) The index of the v ariables grows in the direction from the lea v es to the ro ot. In this case, the set of sparse contexts is { ( { a, b, c } , { a, c } ) , ( { d } , { a, c } ) , ( { b, d } ) } . (c) Maximum b et w een the trees in (a) and (b). (3) The set τ is the minimum that satisﬁes 1. and 2. That is; if ¯ τ satisﬁes 1. and 2. then, for an y ( ¯ w − ¯ k , . . . , ¯ w − 1 ) ∈ ¯ τ there exists ( w − k , . . . , w − 1 ) ∈ τ suc h that ¯ k ≥ k and ¯ w j ⊂ w j for all j = 1 , . . . , k . Eac h sequence ( w − k , . . . , w − 1 ) ∈ τ is called sp arse c ontext and the set τ is called sp arse c ontext tr e e . This name is justiﬁed b ecause the set of sparse contexts can b e represen ted as a ro oted tree. In this tree, each context w = ( w − k , . . . , w − 1 ) is represen ted by a complete branch, in whic h the ﬁrst no de on top is w − 1 and so on un til the last element w − k whic h is represen ted by the terminal no de of the branch (Fig. 1). Recen tly , it was prop osed an algorithm to estimate the set of sparse contexts and the transition probabilities giv en by 2.2 (Leonardi and Galves; 2005; Leonardi; 2006). This algorithm represen ts in ternally the set of sparse con texts as a tree, as described ab o ve. W e b elieve that this tree contains imp ortant structural information that can b e used to measure the similarit y betw een sequences. Our goal in this pap er is to DETECTING PHYLOGENETIC RELA TIONS OUT FROM SP ARSE CONTEXT TREES 4 sho w some results concerning this conjecture. With this aim we prop ose to use a distance b etw een sparse con text trees to measure the relatedness b etw een symbolic sequences. This distance is deﬁned in the next section. 3. A metric sp ace of sp arse trees Giv en a sparse context w = ( w − k , . . . , w − 1 ) w e denote by l ( w ) its length, that is l ( w ) = k . W e use the notation s ( w ) for the pro duct of the cardinals of the w i ’s, that is s ( w ) = l ( w ) Y i =1 | w i | , where | w i | is the num b er of symbols in w i . Giv en tw o sparse contexts w = ( w − k , . . . , w − 1 ) and ¯ w = ( ¯ w − ¯ k , . . . , ¯ w − 1 ) we deﬁne the intersection b etw een w and ¯ w (assuming without loss of generalit y that k ≥ ¯ k ) b y w ∩ ¯ w = ( w − k , . . . , w − ( ¯ k +1) , w − ¯ k ∩ ¯ w − ¯ k , . . . , w − 1 ∩ ¯ w − 1 ), if w i ∩ ¯ w i 6 = ∅ for all i = 1 , . . . , ¯ k . In the case w i ∩ ¯ w i = ∅ for some i = 1 , . . . , ¯ k we deﬁne w ∩ ¯ w = ∅ . Giv en t wo sparse trees τ = { w 1 , . . . , w n } and ¯ τ = { ¯ w 1 , . . . , ¯ w m } , we deﬁne the maxim um b etw een τ and ¯ τ b y τ ∨ ¯ τ = { w i ∩ ¯ w j | w i ∩ ¯ w j 6 = ∅ ; i = 1 , . . . , n ; j = 1 , . . . , m } . The maxim um b etw een the trees of Figure 1(a)-(b) can b e seen in Figure 1(c). Before deﬁning the distance b etw een sparse context trees we in tro duce the notion of β -entrop y of a tree τ . F ollowing Simo vici and Szymon (2006) we deﬁne, for all β > 0, H β ( τ ) = 1 2 1 − β − 1  X w ∈ τ  s ( w ) | A | − l ( w )  β − 1  , if β 6 = 1, and H β ( τ ) = − X w ∈ τ s ( w ) | A | − l ( w ) · log 2  s ( w ) | A | − l ( w )  , if β = 1. Then, giv en tw o sparse trees, τ and ¯ τ , w e deﬁne the β -distance b et w een them as d β ( τ , ¯ τ ) = 2 H β ( τ ∧ ¯ τ ) − H β ( τ ) − H β ( ¯ τ ) . (3.1) It can b e seen that d β ( · , · ) deﬁnes a distance ov er the set of all con text trees. The pro of of this assertion can b e found in Simovici and Szymon (2006). DETECTING PHYLOGENETIC RELA TIONS OUT FROM SP ARSE CONTEXT TREES 5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.5 1.0 1.5 2.0 PAM vs SPST PAM distance SPST distance Figure 2. Comparison of the SPST and P AM distance matrices. 4. Resul ts W e implemen ted an algorithm co ded in C, called Ph yl-SPST, to calculate dis- tances b etw een context trees, as deﬁned by (3.1). The source co de and compiled v ersions for Mac OS X, Linux/Unix and Windows can b e do wnloaded from the site http://www.ime.usp.br/numec/softwares/phyl-spst/ . W e applied the Phyl-SPST pack age to study the similarit y b etw een the protein sequences of the globin family of vertebrates. The 41 sequences used in this analysis w ere obtained from the SCOP database (Andreev a et al.; 2004) and can b e found in the supplementary material. The program estimated, for each sequence in this set, a sparse context tree. Then it computed the distance matrix using the β -distance deﬁned by (3.1). In what follo ws we call this distance the SPST distance. In order to compare our metho d with an alignment-based distance we used the structure based alignmen t of the 41 globin sequences of vertebrates present in the P ALI database (Go wri et al.; 2003) (alignment a v ailable in supplementary material). Then, w e applied the algorithm PR OTDIST of the Ph ylip3.65 pac k age (F elsenstein; 2004), with the Dayhoﬀ P AM matrix option, to compute the distance matrix. DETECTING PHYLOGENETIC RELA TIONS OUT FROM SP ARSE CONTEXT TREES 6 (a) (b) Figure 3. Ph ylogenetic trees made with Neighbor Joining clustering algorhitm on SPST distances (a) and on P AM distances (b) DETECTING PHYLOGENETIC RELA TIONS OUT FROM SP ARSE CONTEXT TREES 7 When the P AM and SPST distances are plotted against eac h other (Fig. 2) a non linear relation is clearly observed. With each distance matrix we reconstructed a phylogenetic tree using the NEIGHBOR and DRA W GRAM algorithms of the Ph ylip3.65 pack age. These phylogenetic trees can b e seen in Figure 3. In b oth trees the lamprey globin was used as outgroup. 5. Discussion The dataset we used to verify the p oten tial use of the SPST distances on phylo- genetic reconstruction is a vertebrate subset of the globin gene family . This family is one of the ﬁrst protein families that w as characterized (Dayhoﬀ; 1972) and is, p erhaps, the most known to date (Vinogrado v et al.; 2006). Besides, the vertebrate ph ylogen y is also well studied and is ground in relatively abundan t paleontological, morphological, molecular, and physiological analyses (Cotton and Page; 2002). The phylogenetic tree sho wn in Fig. 3(a) prov es that in fact the con text trees inferred from sym b olic sequences (in this case, protein sequences) can oﬀer imp or- tan t ev olutionary information of the sequences. This constitutes an original and v ery promise asp ect of the mo deling of sequences b y v ariable memory sto chastic pro cesses, and it needs to b e studied in more details. The phylogenetic analysis here p erformed also reﬂects the ov erall b ehavior of the SPST distance. The tree pro duced with the SPST present larger branches in the most inclusive sequences, and shorter branc hes in the most basal sequences. With resp ect to the tree top ology , the main diﬀerences b etw een them is the placement of the m y oglobin cluster, that is closer to the b eta c hain of hemoglobin in the SPST tree and, in the P AM tree, it is outside of the hemoglobin c hain. Other remark able diﬀerence is the placemen t of the red tail deer ( Odo c oileus vir ginianus ) outside the cluster that contains the mammals, a reptile ( Ge o chelone gigante a ), and a bird ( Gal lus gal lus ) in the b eta c hain cluster of the SPST tree. Although there are minor misplacemen ts in the tree based on P AM distances with resp ect to the vertebrate and globin traditional ph ylogenies, it is sup erior in reconstructing the phylogen y than with the use of SPST distances. The relationship b et w een the SPST distance and the classical P AM distance of the globin family of v ertebrates sho ws a plateau behavior. The short P AM distances yields larger SPST distances, and the opp osite occurs when distances are longer. This ma y b e caused by the b ounded nature of the context trees and by the sp eciﬁc form of the distance we prop ose. Therefore, this analysis shows that small diﬀerences in sequences causes enough changes in the con text trees to increase the SPST dis- tance b et w een them. It remains y et as an op en problem the characterization of the c hanges pro duced in the context trees by stationary mo diﬁcations of the sequences DETECTING PHYLOGENETIC RELA TIONS OUT FROM SP ARSE CONTEXT TREES 8 as mutations, insertions or deletions. W e think that these c haracterizations could help to improv e the results shown here. On the other hand, it is also imp ortan t to deﬁne and test other distances ov er the set of trees to study their sp eciﬁc b ehaviors and compare them to the one prop osed here. A ckno wledgments This work is part of PR ONEX/F APESP’s pro ject Sto chastic b ehavior, critic al phenomena and rhythmic p attern identiﬁc ation in natur al languages (grant n um- b er 03/09930-9) and CNPq’s pro ject Sto chastic mo deling of sp e e ch (grant num ber 475177/2004-5). During the preparation of this pap er F.G.L w as supp orted by a CAPES grant and b y a F APESP fellowship (pro cess 06/56980-0). The authors A.G., H.A.A., and S.R.M. would like to thank the fello wship gran ts receiv ed from CNPq. The authors w ould like to thank Dr. Eleonora T ra jano for commen ts on the v ertebrates phylogenies. References Andreev a, A., How orth, D., Brenner, S. E., Hubbard, T. J. P ., Chothia, C. and Murzin, A. G. (2004). SCOP database in 2004: reﬁnements in tegrate structure and sequence family data, Nucl. A cids R es. 32 (suppl 1): D226–229. Cotton, J. A. and Page, R. (2002). Going n uclear: gene family evolution and v ertebrate phylogen y reconciled, Pr o c. R. So c. L ond. B 269 : 1555–1561. Da yhoﬀ, M. (1972). A tlas of protein sequence and structure. National Biomedical Researc h F oundation, W ashington. F elsenstein, J. (2004). Phylip (ph ylogeny inference pack age) v ersion 3.6. Distributed b y the author. Department of Genome Sciences, Univ ersity of W ashington, Seat- tle. Go wri, V. S., P andit, S. B., Karthik, P . S., Sriniv asan, N. and Bala ji, S. (2003). In tegration of related sequences with protein three-dimensional structural families in an up dated v ersion of P ALI database, Nucleic A cids R es. 31 : 486–488. Leonardi, F. (2006). A generalization of the PST algorithm: mo deling the sparse nature of protein sequences, Bioinformatics 22 (11): 1302–1307. Leonardi, F. and Galves, A. (2005). Sequence motif iden tiﬁcation and protein family classiﬁcation using probabilistic trees, A dvanc es in Bioinformatics and Computa- tional Biolo gy. Pr o c. BSB 2005. , V ol. LNBI 3594, pp. 190–193. Pham, T. and Zuegg, J. (2004). A probabilistic measure for alignment-free sequence comparison, Bioinformatics 20 (18): 3455–3461. Rissanen, J. (1983). A univ ersal data compression system, IEEE T r ans. Inform. The ory 29 (5): 656–664. DETECTING PHYLOGENETIC RELA TIONS OUT FROM SP ARSE CONTEXT TREES 9 Ron, D., Singer, Y. and Tish b y , N. (1996). The p ow er of amnesia: Learning proba- bilistic automata with v ariable memory length, Machine L e arning 25 (2-3): 117– 149. Simo vici, D. and Szymon, J. (2006). A new metric splitting criterion for decision trees, Journal of Par al lel, Emer ging and Distribute d Computing 21 (4): 239–256. Vinogrado v, S. N., Ho ogewijs, D., Bailly , X., Arredondo-Peter, R., Gough, J., Dewilde, S., Mo ens, L. and V anﬂeteren, J. R. (2006). A phylogenomic proﬁle of globins, BMC Evolutionary Biolo gy 6 : 31–47. W u, T. J., Hsieh, Y. C. and Li, L. A. (2001). Statistical measures of dna dissimilarit y under mark ov chain mo dels of base comp osition, Biometrics 57 : 441–448. Instituto de Ma tem ´ atica e Est a t ´ ıstica, Universidade de S ˜ ao P aulo., Rua do Ma t ˜ ao 1010 CEP 05508-090, S ˜ ao P aulo, SP, Brazil. E-mail addr ess : leonardi@ime.usp.br Instituto de Bioci ˆ encias, Universidade de S ˜ ao P aulo., Rua do Ma t ˜ ao, tra v. 14, n 321 CEP 05508-900, S ˜ ao P aulo, SP, Brazil., E-mail addr ess : srmatiol@ib.usp.br Instituto de Qu ´ ımica, Universidade de S ˜ ao P aulo., A v. Prof. Lineu Prestes, 748 CEP 05508-900, S ˜ ao P aulo, SP, Brazil. E-mail addr ess : haarmeli@iq.usp.br Instituto de Ma tem ´ atica e Est a t ´ ıstica, Universidade de S ˜ ao P aulo., Rua do Ma t ˜ ao 1010 CEP 05508-090, S ˜ ao P aulo, SP, Brazil., E-mail addr ess : galves@ime.usp.br

Detecting phylogenetic relations out from sparse context trees

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment