A Simple Characterization of the Minimal Obstruction Sets for Three-State Perfect Phylogenies

Lam, Gusfield, and Sridhar (2009) showed that a set of three-state characters has a perfect phylogeny if and only if every subset of three characters has a perfect phylogeny. They also gave a complete characterization of the sets of three three-state…

Authors: Brad Shutters, David Fern, ez-Baca

A Simple Characterization of the Minimal Obstruction Sets for   Three-State Perfect Phylogenies
A Simple Characterizati on of the Minimal Obstruction Sets for Three-State P erfect Ph ylogenies Brad Shutte rs, Da vid F ern´ andez-Baca Department of Computer Science, Iow a State University { shutters,fernande } @iastate.edu Abstract Lam, Gusfield, and Sridha r (2009 ) sho wed that a set of thr e e-state characters has a p erfect phylogen y if and o nly if every subset of three characters has a p erfect phy- logeny . They also gav e a complete characterization of the sets o f thr ee three-state characters that do not hav e a p erfect phylogeny . How ev er, it is not c lear from their characterization how to find a subset of three characters tha t do es not hav e a p erfect ph ylog eny witho ut testing a ll triples of characters. In this note, we build up o n their result b y giv ing a simple characteriza tio n of when a se t of three - state ch ar a cters do es not have a per fect phylogeny that can b e infer r ed from testing a ll pairs of characters. 1 In tro duction The k - state p erfe ct phylo geny pr oblem is one of the classic decision problems in computa- tional b iology . The in put is an n b y m m atrix M of in tegers from the set { 1 , . . . , k } . W e call a ro w of M a taxon (plur al taxa ), a column of M a char acter , and a v alue in column c of M a state of c haracter c . A p erfe ct phylo geny for M is an und irected tree t with n lea v es eac h lab eled by a distinct taxon of M in suc h a wa y that, for eac h charac ter c and eac h pair i, j of states of c , the minimal sub tr ee of t con taining all the lea v es lab eled by a taxon with state i for c haracter c is no de-disjoint f rom the minimal subtree of t conta ining all the lea ve s lab eled by a taxon with state j for c haracter c . The k -state p erfect phyloge ny problem is to decide wh ether M has a p erfect phylogen y . If M h as a p erf ect ph ylogen y , we sa y that th e charac ters in M are c omp atible , otherwise they are inc omp atible . See [6, 17] for more on the p erfect p h ylogen y problem. See Figur e 1 for an example. If the num b er of states of eac h charac ter is unb ounded (so k can gro w with n ), then the p erfect ph ylogen y p roblem is NP-complete [2, 18 ]. Ho w ev er, if the num b er of states of eac h c haracter is fi xed, the p erfect phyloge ny problem is solv able in O ( m 2 n ) (in fact, linear time for k = 2) [9, 4, 13, 1 , 14]. Eac h of these algorithms can also construct a p erfect phylo gen y for M if one exists. Ho w ev er, since ev ery subset of a compatible set of c haracters is itself compatible, if no p erf ect phylogen y exists f or M , there m ust b e some M a b c 1 3 2 1 2 2 1 3 3 2 2 2 4 3 3 2 5 1 1 3 6 2 2 3 321 332 222 223 213 113 Figure 1: Examp le 3-state p erf ect phyloge ny for input matrix M . minimal subset of the c haracters of M that do es not ha ve a p erfect phylo gen y . W e call su c h a set a minimal obstruction set for M . None of the ab o v e men tioned algorithms outp u t a minimal obstruction set when there is no p erf ect phylo gen y for M . If the c haracters in M are t wo -state c haracters, then M h as a p erfect phylog eny if and only if the c haracters in M are pairw ise compatible. Hence, a minimal obstruction set for k = 2 is of cardinalit y t w o [3, 16, 18, 5]. A recen t breakthrough by Lam, Gusfield, and Sridhar [15] sho ws that if the c haracters in M are th ree-state characte rs, then any minimal obstruction set for M has cardinalit y at most three. It is conjectured that giv en an inpu t matrix M of k -state c h aracters, there exists a fun ction f ( k ) s uc h that M has a p erfect phylo gen y if and only if every subset of f ( k ) charac ters of M has a p erfect phylo gen y [7, 12, 8, 16, 9, 15, 11]. F rom th e d iscussion ab o v e, it follo ws that f (2) = 2 an d f (3) = 3. Recen t w ork of Habib and T o [11] sh o ws that f (4) ≥ 5. If the c haracters in M are k -state c haracters and the cardin alit y of a m in imal obstruc- tion set fo r M is b ound ed ab o ve b y f ( k ), then it is preferable to ha v e a te st for the existence of su c h an obstruction set that do es not require testing all sub sets of f ( k ) c haracters in M , and ideally one that can b e inferred fr om testing all pairs of th e c haracters in M . Since we can d ecide if M has a p erfect p h ylogen y in O ( m 2 n ) time, and construct a p erfect phyloge ny in suc h a case, w e sh ou ld hop e to also outpu t a m in imal obstru ction set in O ( m 2 n ) time when a p erf ect phylo gen y for M do es n ot exist. Here, we will fo cus on the th ree-state p erfect p h ylogen y p roblem. Hence, we restrict M to b e an n by m matrix of integ ers from the set { 1 , 2 , 3 } . W e build u p on th e work of Lam, Gusfield, and Sridhar [15] who sho we d that if M do es not h av e a p erfect ph ylogen y , then M has an obstruction set of cardin ality at most thr ee. They also ga v e a complete c haracterizatio n of the minimal obstruction s ets of cardinalit y three. Ho w ev er, it is not clear from th eir c haracterization h o w to fin d s uc h an ob s truction set without indep enden tly testing all triples of c haracters in M , r equiring O ( m 3 n ) time. In this note, we remedy this situation by giving a simple c haracterization of wh en a set of three-state c haracters do es not ha ve a p er f ect p h ylogen y that can b e inf er r ed from testing all pairs of charac ters in M . This leads to a O ( m 2 n ) time algorithm to find an obstruction s et when M d o es not ha v e a p erfect phylog eny . If M do es admit a p erfect phyloge ny , then any of the ab o v e m entioned algorithms can b e used to construct a p erfect phylogen y for M in O ( m 2 n ) time. 2 2 Preliminaries 2.1 P erfect Ph ylogenies and Partition In tersection Graphs The p artition interse ction gr aph of M , denoted pig( M ), is the graph that h as a ve rtex c i for eac h characte r c and eac h state i of c , and an edge b et ween t w o vertice s c i and d j precisely if there is a taxon th at has b oth state i for c haracter c and state j f or c haracter d . Note that there can b e n o edges b et wee n t w o ve rtices of the same charact er. See Figure 2 for an example. In this section we giv e a br ief o v erview of some kn o wn results relating three-state p erfect phylogenies to partition intersectio n grap h s. a 1 a 2 a 3 b 1 b 2 b 3 c 1 c 2 c 3 Figure 2: Partitio n in tersection graph of the matrix M f rom Figure 1. A graph G is triangulate d if and only if there are no in duced c hord less cycles of length four or greater. A pr op er triangulation of pig ( M ) is a triangulated sup ergrap h of pig( M ) suc h that eac h edge is b etw een ve rtices of different characte rs. Theorem 1 (Buneman [3], Meac ham [16], Steel [18]) . Ther e is a p erfe ct phylo geny for M if and only i f p ig( M ) has a pr op er triangulation. F or a sub set C = { c 1 , . . . , c j } of the c haracters in M , we write M [ c 1 , . . . , c j ] to denote M restricted to the columns in C . W e sa y th at M is p airwise c omp atible if, for ev ery pair a, b of charact ers in M , there is a p erfect ph ylogen y for M [ a, b ]. Theorem 2 (Estabro ok and McMorris [5]) . L et a and b b e two char acters of M . Then M [ a, b ] has a p erfe c t phylo geny if and only if pig( M [ a, b ]) is acyclic. Theorem 3 (Lam, Gusfield, and S ridhar [15]) . M has a p erfe ct phylo geny if and only i f , for every thr e e char acters a, b, c in M , M [ a, b, c ] has a p erfe c t phylo geny. If three of the c haracters are incompatible, then either they are n ot pairwise compatible, or, as the follo wing theorem sh o ws, the ed ges of their partition intersectio n graph is a sup ers et (u p to renaming of s tates) of one of a collectio n of “forbidden ” edge s ets. 3 Theorem 4 (Lam, Gus fi eld, and Srid h ar [15 ]) . L et M b e p airwise c omp atible. Then, a triple { a, b, c } of char acters in M i s a minimal obstruction set if and only if (under p ossibly r enaming states) p ig( M [ a, b, c ]) c ontains al l of the e dges of one of gr aphs of Figur e 3. a 1 b 1 c 1 b 2 c 2 b 3 c 3 a 2 a 3 (a) a 1 b 1 c 1 b 2 c 2 b 3 c 3 a 2 (b) c 1 b 1 c 2 a 1 a 2 a 3 b 3 b 2 c 3 (c) Figure 3: The forbidden sets of ed ges of th e partition intersectio n graph of three c haracters that ha v e a p erfect phylog eny (adapted from Figur e 42 in [15 ]). W e n ote th at in [15], there are four forbidd en s ets of edges, how eve r, one of the sets of ed ges is a su p erset of one of the other sets of edges. T h us, only thr ee are needed h ere. 2.2 Solving Three-State Perfect Ph ylogen y with Two-State Characters Here we review a r esu lt of Dress and Steel [4]. Our exp osition closely f ollo ws th at of [10]. Our goal is to derive a matrix of t w o-state c haracters M from the matrix M of three- state c haracters. The prop erties of M are such that they enable use to find a p erfect phylo gen y f or M . The matrix M contai ns three c haracters c (1), c (2), c (3) for eac h c haracter c in M , such that all of the taxa that h a v e state i for c in M are giv en state 1 for c haracter c ( i ) in M , and the other taxa are giv en state 2 for c ( i ) in M . Since ev ery c haracter in M has t wo s tates, t wo c h aracters c ( i ) and d ( j ) of M are incompatible if and only if the t wo columns corresp ondin g to c ( i ) and d ( j ) con tain all fou r of the pairs (1 , 1), (1 , 2), (2 , 1), and (2 , 2), otherwise they are compatible. T his is known as th e four gametes test [17]. Theorem 5 (Dress and S teel [4]) . Ther e i s a p erfe ct phylo geny for M if and only if ther e is a subset C of the char acters of M such that (i) the char acters in C ar e p airwise c omp atible, and (ii) for e ach char acter c in M , C c ontains at le ast two of the char acters c (1) , c (2) , c (3) . Theorem 5 was used in [4] to giv e an O ( m 2 n ) time algorithm to decide if there is a p erfect phyloge ny for M . It was also used in [10] to reduce the three-state p erfect p h ylogen y problem in p olynomial time to the w ell known 2-SA T p roblem, wh ic h is in P . 4 3 A Simple Characterization of M inimal Obstruction Sets In this section, w e fo cus on the case wher e M is pairwise compatible. Our main r esult is a c haracterizatio n of th e situation where M do es not hav e a p erfect phylo gen y that is based on the partition in tersection graphs for the pairs of c haracters in M. Theorem 2 giv es a simple charac terization of the situation when M is n ot pairwise compatible. W e sa y th at a state i for a charact er c of M is dep endent precisely w hen there exists a charact er d of M , and t wo states j, k of d , such that c ( i ) is incompatible with b oth d ( j ) and d ( k ). The charac ter d is a witness that s tate i of c is dep enden t. Lemma 6. L et c b e a char acter of M and let i b e a dep endent state of C . Then no p airwise c omp atible subset of char acters in M satisfying The or em 5 c ontains c ( i ) . Pr o of. Let I b e a p airwise compatible sub set of the c haracters in M that con tains c ( i ). Since state i of c is d ep endent, there is a c haracter b in M and tw o states j, k of b , such that c ( i ) is incompatible w ith b oth b ( j ) and b ( k ). I t follo ws that b ( j ) 6∈ I and b ( k ) 6∈ I . But then I cannot p ossibly conta in tw o of b (1), b (2), and b (3). Th us, I cannot satisfy the condition r equired in Th eorem 5. The next lemma giv es a characte rization of when a state is d ep endent using p artition in tersection graph s. W e first in tro du ce some notation: if p : p 1 p 2 p 3 p 4 p 5 is a path of length four in a grap h , then we write middle[ p ] to denote p 3 , th e midd le v ertex of p . Lemma 7. L et M b e p airwise c omp atible. A state i of a char acter c of M is a dep endent state if and only i f ther e is a char acter d of M and a p ath p of length four in pig ( M [ c, d ]) with midd le[ p ] = c i . Pr o of. W.l.o.g. assume th at i = 1, i.e., c i = c 1 . ( ⇒ ) Since 1 is a dep en den t state of c , ther e exists a c haracter d in M su c h that c (1) is incompatible with t w o of d (1), d (2), and d (3). W.l.o.g ., assume c (1) is incompatible with b oth d (1) and d (2). Then, c 1 d 1 and c 1 d 2 are edges of p ig ( M [ c, d ]), and, since M has no cycles, either d 2 c 2 and d 1 c 3 or d 2 c 3 and d 1 c 2 are edges of G . If d 2 c 2 and d 1 c 3 are edges of pig( M [ c, d ]), then c 2 d 2 c 1 d 1 c 3 is the requ ired path of length four. If d 2 c 3 and d 1 c 2 are edges of p ig( M [ c, d ]), then c 3 d 2 c 1 d 1 c 2 is the required path of length four. c 1 | c 2 c 3 d 1 | d 2 d 3 (a) c (1) and d (1) are incompatible. c 1 | c 2 c 3 d 2 | d 1 d 3 (b) c (1) and d (2) are in compatible. Figure 4: Illu s trating the pr o of of Lemma 7. 5 ( ⇐ ) Let d b e a c haracter of M such th at there is a path p of length four in pig ( M [ c, d ]) with middle[ p ] = c 1 . Sin ce pig ( M [ c, d ]) cannot con tain edges b et ween to states of the same c haracter, we can assu me w.l.o.g. that p is th e path c 2 d 1 c 1 d 2 c 3 . Then, it is easy to v erify that c (1) is incompatible with b oth d (1) and d (2). This is illustrated in Figure 4. Lemma 8. If M is p airwise c omp atible and ther e is a char acter c of M that has two dep endent states, then no p erfe ct phylo geny exists for M . Pr o of. Let i and j b e tw o dep end en t states of c . Th en , by Lemma 6, no p airwise compatible subset I of the charac ters of M that satisfy the condition required in Theorem 5 can con tain c ( i ) or c ( j ). But then I c an only contai n one of c (1), c (2), or c (3). Hence, no pairwise compatible subset I of the c haracters of M can satisfy th e condition required in Th eorem 5. Hence, by Theorem 5, there is no p erfect phylog en y f or M . W e no w sho w th at the conv erse of Lemma 8 h olds. Lemma 9. If M is p airwise c omp atible and has no p erfe ct phylo geny, then ther e exists a char acter c of M tha t has two dep endent states. Pr o of. By Theorem 4, there exists c haracters a, b, c in M such that G = pig( M [ a, b, c ]) (under p ossibly renamin g of states) con tains all of the edges of at least one of the subgraph s of Figure 3. If G conta ins all of the ed ges of Figure 3a, then c 3 b 1 c 1 b 2 c 2 is a path witnessing that c 1 is dep enden t and c 3 a 1 c 2 a 3 c 1 is a path witnessing that c 2 is dep enden t (this is illustrated in Figure 5a). If G c on tains all of the edges of Figure 3b, th en c 3 b 1 c 1 b 2 c 2 is a path witnessing that c 1 is dep endent and c 3 a 1 c 2 a 2 c 1 is a path witnessin g that c 2 is dep end ent (this is illus trated in Figure 5b). If G con tains all of the edges of Figure 3b, then c 3 a 2 c 1 a 1 c 2 is a path witnessing that c 1 is dep enden t and c 3 b 3 c 2 b 1 c 1 is a path witnessing that c 2 is dep en den t (this is illustrated in Figure 5c). In all th ree cases, M con tains a c haracter that h as tw o d ep endent states. a 1 b 1 c 1 b 2 c 2 b 3 c 3 a 2 a 3 (a) a 1 b 1 c 1 b 2 c 2 b 3 c 3 a 2 (b) c 1 b 1 c 2 a 1 a 2 a 3 b 3 b 2 c 3 (c) Figure 5: Illu s trating the pr o of of Lemma 9. Lemmas 8 and 9 together immed iately imply our main theorem. 6 Theorem 10. If M is p airwise c omp atible, then ther e is a p erfe ct phylo geny for M if and only if ther e is at most one dep endent state of e ach char acter c of M . Observ ation 11. L et M b e p airwise c omp atible and let c b e a char acter of M with two dep endent states. L et a b e a witness for one dep endent state of c and let b b e a witness for another dep endent state of c . Then, the set { a, b, c } is an obstruction set for M . This leads to the follo w ing O ( m 2 n ) time algorithm to find a min im al obstruction set for M , if one exists. Algorithm 1 MinimalObs tructionSet( M ) Input: M is an n b y m matrix of in tegers from the set { 1 , 2 , 3 } . Output: A m in imal obs tr uction set for M if one exists, otherwise th e empty set. 1 for all c haracters x in M do 2 for all states i of x do 3 mark[ x i ] ← ∅ ; 4 S ← ∅ ; 5 for all pairs of c haracters a, b in M do 6 G ← pig ( M [ a, b ]); 7 if G con tains a cycle t hen 8 return { a, b } ; 9 else if S = ∅ then 10 for all x i ∈ { a 1 , a 2 , a 3 , b 1 , b 2 , b 3 } su c h th at mark[ x i ] is empty do 11 if ther e is a p ath p of length four in G w ith midd le[ p ] = c i then 12 mark[ x i ] ← { a, b } \ { x } ; 13 if t wo states i, j of a h a v e non-empt y marks then 14 S ← { a } ∪ mark[ a i ] ∪ mark[ a j ]; 15 else if t w o states i, j of b ha v e non-empt y marks then 16 S ← { b } ∪ mark[ b i ] ∪ mark[ b j ]; 17 return S ; The correctness of the algorithm follo ws from Theorem 2, Th eorem 10 , Observ ation 11, and Lemma 7 . T o s ee that the algo rithm tak es O ( m 2 n ) time note that the runtime is dominated by the lo op of lin es 5-16 which executes once f or eac h of the O ( m 2 ) pairs of c haracters in M . Cons tr ucting the partition in tersection graph of t wo three-state c haracters tak es O ( n ) time. Since th e partition intersectio n graph of t w o th ree-state c haracters is of constan t size, eac h of the other op erations p erformed in the lo op tak e constan t time. W e note that if n o obstru ction set exists for M , then a p erfect phylog eny for M can b e constructed in O ( m 2 n ) time by using one of the existing algorithms for th e three-state p erfect p h ylogen y prob lem [4 , 13, 1, 14, 10]. 7 Ac kno wledgmen ts This work w as supp orted in part by the National Science F oundation under grants CCF- 10171 89 and DEB-082967 4. References [1] R. Agar wala and D. F er n´ andez-Baca . A p olynomial-time algo rithm for the per fect phy- logeny problem when the num b er of character states is fixed. SIAM Journal on Computing , 23(6):121 6–12 2 4, 199 4. [2] H. Bo dlaender, M. F ellows, and T. W arnow. Two strikes agains t p erfect phylogeny . In ICALP , volume 62 3 of L e ctur e Notes in Computer Scienc e , pag es 273– 2 83. Springer-V erlag , 199 2. [3] P . Bunema n. A characterisation of rig id circuit gra phs . Discr ete Mathematics , 9(3):205– 212, 1974. [4] A. Dress and M. Steel. Convex tree rea lizations of partitions. Applie d Mathematics L ett ers , 5(3):3–6, 1 992. [5] G. F. Esta bro ok and F. R. Mc Mo rris. When ar e t wo q ua litative taxono mic characters com- patible? Journal of Mathematic al Biolo gy , 4 :195–2 00, 197 7. [6] D. F ern´ andez-Ba ca. The p er fect phylogeny proble m. In Steiner T r e es in Industry , pages 203–2 34. Kluw er, 20 0 1. [7] W. M. Fitc h. T ow ard finding the tree of maximum parsimony . In Pr o c e e dings of t he 8th International Confer enc e on Num eric al T axonomy , pag es 189–2 20, 1 975. [8] W. M. Fitch. On the pro blem of discovering the mos t pa rsimonious tree. Americ an N atur alist , 11:223 –257, 197 7. [9] D. Gusfield. Efficient algor ithms for inferring evolutionary trees. Net works , 21 (1):19–28 , 19 91. [10] D. Gusfield and Y. W u. The three-s tate p er fect phylogeny problem reduces to 2-SA T. Com- munic ations in Information & Systems , 9(4):19 5 –301 , 2 009. [11] M. Habib and T.-H. T o . On a conjecture of compatibility of multi-states ch ar a cters. T echnical Repo rt 11 05.110 9, Computing Resea rch Rep osito ry , May 20 1 1. [12] C. Johns o n, G. Esta bro ok, and F. R. McMo rris. A mathematical formulation for the analysis of cladistic character c ompatibility . Mathematic al Bioscienc e , 29, 1 976. [13] S. Ka nnan and T. W arnow. Inferring evolutionary histor y from DNA s equences. SIAM Journal on Computing , 23(4):71 3 –737 , 199 4. [14] S. Kanna n a nd T. W ar now. A fast alg orithm for the computation and enumeration of p erfect ph ylog enies. SIAM Journ al on Computing , 26(6):174 9–176 3, 1 997. [15] F. L a m, D. Gusfield, and S. Sridha r. Genera lizing the four gamete condition and splits equiv- alence theorem: Perfect phylogen y o n three sta te character s. In Workshop on Algo rithms in Bioinfo rmatics , pages 2 06–21 9, 2009 . 8 [16] C. A. Meacham. Theo r etical and computational co nsiderations of the compatibility of quali- tative tax onomic characters . In Numeric al T axonomy , volume G1 of NA TO ASI Series , pag es 304–3 14. Springer -V er lag, 1 983. [17] C. Semple and M. Steel. Phylo genetics . O x ford Lecture Serie s in Mathematics and Its Appli- cations. Oxfor d Universit y P r ess, 200 3. [18] M. Steel. Th e complexity of reconstruc ting tre e s from qualitative characters a nd subtrees . Journal of Classific ation , 9:91–1 16, 1 992. 9

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment