Identifiability of a Markovian model of molecular evolution with Gamma-distributed rates

Inference of evolutionary trees and rates from biological sequences is commonly performed using continuous-time Markov models of character change. The Markov process evolves along an unknown tree while observations arise only from the tips of the tre…

Authors: ** (논문에 명시된 저자 목록을 그대로 기재해 주세요. 예: J. Rogers, S. Steel, 등) **

Applie d Pr ob ability T rust (25 October 2018) IDENTIFI ABIL ITY OF A MARK O VIAN MODEL OF MOLECULAR EV OLUTIO N WITH GAMMA-DISTRIBUTED RA TES ELIZABETH S. ALLMA N, ∗ University of Alaska F airb anks C ´ ECILE AN ´ E, ∗∗ University of W isc onsin Madison JOHN A . R HODES, ∗∗∗ University of Alaska F airb anks ∗ Po stal address: Departmen t of Mat hematics and Statistics, Univ ersity of Alask a F airbanks, PO Box 756660, F airbanks, AK 99775 ∗ Email address: e.allman@uaf.edu ∗∗ Po stal address: Departmen t of St atistics, University of W isconsin Madison, Medical Science Cen ter, 1300 University Ave., Madison, WI 53706 ∗∗ Email address: ane@stat.wisc.edu ∗∗∗ Po stal address: Departmen t of M athematics and Statistics, Universit y of Alask a F airbanks, PO Bo x 756660, F airbanks, AK 99775 ∗∗∗ Email address: j .rho des@uaf.edu 1 2 Allman, A n´ e, Rho des Abstract Inference of evolutionary trees and rates from b iologica l sequences is commonly p erformed using con tinuous-time Marko v models of c haracter c h ange. The Mark ov pro cess evolve s along an u nknown tree while observ ations arise only from the tips of the tree. Rate h eterogeneit y is presen t in most real data sets and is accounted for by th e use of flexible mixture mo dels where eac h site is allo w ed its ow n rate. V ery little has b een rigorously established concerning the identifiabili ty of the models cu rrently in common use in data analysis, although non- identifiabilit y w as p ro ven for a semi-parametric mo del and an incorrect pro of of identifiabilit y wa s pu b lished for a general parametric mod el (GTR+Γ+I). Here w e prov e that one of the most widely used mod els (GTR+Γ) is identi fiable for generic parameters, and for all p arameter choices in the case of 4-state (DNA) mo dels. This is th e first pro of of identifiabilit y of a phylogenetic mod el with a con tinuous distribution of rates. Keywor ds: phylogenetics, identifiabilit y AMS 2000 Su b ject Classification: Primary 60J2 5 Secondary 92D15, 92D20 1. In tro duction A central goal of m olecular p h ylogenetics is to infer evol utionary trees from DNA or protein sequ ences. Suc h sequence d ata come fr om extan t sp ecies at the tips of the tree – the tree of life – while the top ology of the tr ee r elating these sp ecies is un k n o wn. Inferring this tree helps us understand the ev olutionary relationships b et w een sequences. Ph ylogenetic data analysis is often p erformed using Mark o vian mo d els of ev olution: Mutations o ccur along the branches of the tree under a fin ite-state Mark o v pro cess. There is ample evidence that some p laces in the genome undergo m utations at a h igh rate, wh ile other lo ci ev olv e v ery slowly , p erhap s due to s ome f u nctional constrain t. Suc h r ate variation o ccurs at all spatial scales, across genes as wel l as across sites w ithin genes. I n p erforming inference, this heterogeneit y is accoun ted for b y the use of flexible mixture mo dels where eac h site is allo w ed its o wn rate according to a rate d istribution µ . In the Identifiability of a m o del of mole cular evolution 3 con text of molecular phylog enetics, the use of a parametric family for µ is generally considered b oth adv an tageous and su fficien tly flexible. The question of identifiabilit y for suc h a rate-v ariation mo d el is a funda- men tal one, as standard pro ofs of consistency of statistical in ference metho ds b egin b y establishing identifiabilit y . Without id en tifiabilit y , inference of some or all mo del parameters ma y b e u njustified . Ho w ev er, since p hylogenetic data is gathered only from the tips of the tree, unders tanding when one has id en tifi- abilit y of th e tree top ology and other parameters for p hyl ogenetic mo dels p oses substanti al m athematical c hallenges. Ind eed, it has b een shown th at the tree and mo del p arameters are not identifiable if the distr ibution of rates µ is to o general, ev en wh en the Mark o vian m utation mo del is quite simple [13]. The m ost commonly u sed p h ylogenetic mo del is a gener al time-r eve rsible (GTR) Marko vian mutat ion mo del along w ith a Gamma distribu tion family (Γ) f or µ . F or more flexibilit y , a class of in v ariable sites (I) can b e added by allo w in g µ to b e the mixture of a Gamma distr ib ution w ith an atom at 0 [4]. Numerous studies ha v e sho wn that the addition to the GTR mo del of rate heterogeneit y th rough Γ, I, or b oth, can considerably improv e fit to data at the exp ense of only a few additional parameters. In fact, when mo d el selection pro cedur es are p erformed , the GT R + Γ+I mo d el is preferr ed in most s tudies. These sto c hastic mo d els are the basis of hundreds of p ublications ev ery yea r in the biologica l sciences — o ve r 40 in Systematic Biolo gy alone in 2006. T heir impact is imm en se in the fi elds of evo lutionary biology , ecology , conserv ation biology , and biogeograph y , as wel l as in medicine, where, for example, they app ear in the study of the ev olution of in fectious d iseases s uc h as HIV and influenza viru ses. The main result claimed in th e widely-cited pap er [11] is the follo wing: The 4 -b ase (DNA) GTR+ Γ +I mo del, with unknown mixing p ar ameter and Γ shap e p ar ameter, is identifiable fr om the joint distributions of p airs of taxa. Ho w ev er, th e pro of giv en in [11] of this statemen t is flaw ed; in f act, t wo gaps 4 Allman, A n´ e, Rho des o ccur in the argument. The first gap is in th e u se of an unjustified claim concerning graphs of the sort exemplified by Figure 3 of that pap er. As this claim pla ys a crucial role in the ent ire argument, th e statemen t ab o ve remains unpr ov en. The second gap, though less sw eeping in its imp act, is still significan t. As- suming the unjus tifi ed graphical claim men tioned ab o v e could b e pr o v ed, the argumen t of [11] still u ses an assump tion that the eigen v alues of th e GTR rate matrix b e distinct. While this is true f or generic GTR parameters, there are exceptions, includ ing the well-kno wn Ju k es-Can tor and Kimura 2-parameter mo dels [4]. Without substantia l additional argum ents, the reasoning giv en in [11] cannot prov e iden tifiabilit y in all cases. F ur th ermore, b r idging either o f the gaps in [11] is not a trivial matter. Though we s u sp ect that Rogers’ statemen t of identifiabilit y is correct, at least for generic parameters, we hav e not b een able to establish it b y h is metho ds. F or furth er exp osition on the n ature of the gaps, see the App endix. In this p ap er, w e consider only the GTR+Γ m o del, bu t for c haracters with an y num b er κ ≥ 2 states, wh ere th e case κ = 4 corresp onds to DNA sequences. Our main r esult is the follo wing: Theorem 1. The κ -state GTR+ Γ mo del is identifiable fr om the j oint distribu- tions of triples of taxa for ge ne ric p ar ameters on any tr e e with 3 or mor e taxa. Mor e over, when κ = 4 the mo del is identifiable f or al l p ar ameters. The term ‘generic’ here means for th ose GTR state distrib u tions and r ate matrices w hic h d o not satisfy at least one of a collection of equalities to b e explicitly giv en in Theorem 2. C onsequent ly , the set of non-generic parameters is of Leb esgue measure zero in the full p arameter sp ace. Our arguments are quite differen t from those attempted in [11]. W e com bine arguments from algebra, algebraic geometry and analysis. W e b eliev e this pap er pr esen ts the fir st correct p ro of of identifiabilit y for Identifiability of a m o del of mole cular evolution 5 an y mo d el with a contin u ous distribu tion µ of rates across s ites that is not fully kno wn . The non-identifiabilit y of some mo d els with m ore freely-v arying rate d istributions of rates across sites was established in [13]. T hat p ap er also sho we d identifiabilit y of rate-across-sites mo dels built up on certain group -based mo dels provided the rate distr ibution µ is completely kn o wn. More recen tly , [1] prov ed that tree top ologies are identifiable for generic parameters in rather general mixture mo dels with a small num b er of classes. Th at resu lt sp ecializes to give the identi fiability of trees for the κ -state GTR mo d els with at most κ − 1 rates-across-site s classes, includ ing th e GTR+I mo del. Iden tifiabilit y of n umerical mo del parameters for GTR+I is f urther exp lored in [2]. T here hav e also b een a num b er of r ecen t w orks dealing with non-iden tifiabilit y of m ixture mo dels whic h are n ot of the rates-across-sites t yp e; these include [15, 16, 9, 8]. In Section 2 we d efine th e GTR+Γ mo del, in tro du ce n otation, and r educe Theorem 1 to the case of a 3-taxon tree. In Section 3, w e use pu rely algebraic ar- gumen ts to determin e from a join t distribu tion certain useful quanti ties defin ed in terms of the mo del p arameters. In Section 4, in the generic case of certain algebraic expressions n ot v anishing, an analytic argument uses these quant ities to iden tify the m o del p arameters. F o cusin g on the imp ortant case of κ = 4 for th e r emainder of the pap er , in Section 5 we completely charact erize the exceptional cases of parameters not co vered by our generic argument. Using this additional information, in Section 6 we establish iden tifiabilit y for these cases as well . Finally , Section 7 briefly men tions s ev eral p roblems concerning iden tifiabilit y of phylo genetic mo d els that r emain op en. 2. Preliminaries 2.1. The GTR+rates-across-sites substi tuti on mo del The κ -state across-site rate-v ariation mo del is parameterized by: 1. An u n ro oted top ological tree T , with all internal v ertices of v alence ≥ 3, 6 Allman, A n´ e, Rho des and with lea ve s lab eled b y a 1 , a 2 , . . . , a n . These lab els represent taxa, and the tree their evo lutionary r elationships. 2. A collectio n of edge lengths t e ≥ 0, where e ranges ov er the edges of T . W e r equire t e > 0 for all in ternal edges of the tree, but allo w t e ≥ 0 for p endan t edges, pro vided no tw o taxa are total-e dge-length-distance 0 apart. Th us if an edge e is p end an t, the lab el on its leaf m a y represent either an ancestral ( t e = 0) or n on-ancestral ( t e > 0) taxon. 3. A distribu tion vec tor π = ( π 1 , . . . , π κ ) with π i > 0, P π i = 1, represent ing the frequencies of states o ccurring in biologic al sequences at all v ertices of T . 4. A κ × κ matrix Q = ( q ij ), w ith q ij > 0 for i 6 = j and P j q ij = 0 for eac h i , such that d iag( π ) Q is symmetric. Q rep r esen ts the in stan taneous substitution rates b etw een states in a r ev ersible Mark o v pr o cess. W e will also assume some normalization of Q h as b een imp osed, for instance that diag( π ) Q has trace − 1. Note that the symmetry and ro w s ummation conditions imply that π is a left eigen v ector of Q with eigen v alue 0, whic h in turn implies π is stationary u nder the contin uou s -time p ro cess defi n ed by Q . 5. A distribu tion µ , with non-negativ e sup p ort and exp ectation E ( µ ) = 1, describ in g th e d istribution of rates among sites. If a site has rate parameter r , th en its in stan taneous su bstitution rates w ill b e giv en by r Q . Letting [ κ ] = { 1 , 2 , . . . , κ } denote the states, the joint distribution of s tates at the lea ve s of the tree T w h ic h arises f r om a rate-across-sites GTR mo del is computed as follo ws. F or eac h rate r and edge e of the tree, let M e,r = exp( t e r Q ). Then with an arbitrary v ertex ρ of T c hosen as a ro ot, let P r ( i 1 , . . . , i n ) = X ( h v ) ∈ H π ( h ρ ) Y e M e,r ( h s ( e ) , h f ( e ) ) ! , (1) Identifiability of a m o del of mole cular evolution 7 where the pro d u ct is taken o ve r all edges e of T directed a w a y fr om ρ , edge e has initial v ertex s ( e ) and final vertex f ( e ), and th e su m is tak en o ver the set H = H i 1 i 2 ...i n =  ( h v ) v ∈ V ert ( T ) | h v ∈ [ κ ] if v 6 = a j , h v = i j if v = a j  ⊂ [ κ ] | V ert ( T ) | . Th us H represents the set of all ‘histories’ consistent with the sp ecified states i 1 , . . . , i n at the lea v es, and the n -dimensional table P r giv es the joint distribu - tion of states at the lea ve s giv en a site has r ate parameter r . S ince the Mark o v pro cess is rev ersible and stationary on π , this distribu tion is indep enden t of the c hoice of r o ot ρ . Finally , the joint distribution for the GTR+ µ mo del is giv en by the n - dimensional table P = Z r P r dµ ( r ) . The d istr ibution for the GTR+Γ mo del is given by additionally sp ecifying a parameter α > 0, with µ then sp ecialized to b e the Γ-distribution with shap e parameter α and mean 1, i. e. , with s cale parameter β = 1 /α . 2.2. Di agonalization of Q The r eversibilit y assumptions on a GTR mo d el imply that diag( π 1 / 2 ) Q diag( π − 1 / 2 ) is symmetric, and that Q can b e repr esen ted as Q = U diag(0 , λ 2 , λ 3 , . . . , λ κ ) U − 1 , where the eigen v alues of Q s atisfy 0 = λ 1 > λ 2 ≥ λ 3 ≥ · · · ≥ λ κ [6], and U is a real matrix of asso ciated eigen v ectors satisfying the equiv alent statemen ts U U T = diag ( π ) − 1 , U T diag( π ) U = I . (2) F ur th ermore, the first column of U may b e tak en to b e the v ector 1 . While the λ i are u niquely d etermin ed by these considerations, in th e case that all λ i are distinct the matrix U is determined only u p to multiplica tion of its ind ividual column s by ± 1. If the λ i are not distinct, eigenspaces are uniquely d etermined bu t U is not. 8 Allman, A n´ e, Rho des Our metho d of determining Q from a join t distrib ution w ill pro ceed by deter- mining eigenspaces (via U ) and the λ i separately . Although the non-uniquen ess of U will not matter for our argum en ts, the normalization d etermined b y equations (2) w ill b e used to simplify our presenta tion. 2.3. M oment generating function W e also use the moment generating fu nction ( i.e. , essentia lly the Laplace transform) of the density fu nction for the distrib u tion of rates in our m o del. As our algebraic arguments w ill apply to arb itrary rate d istributions, wh ile our analytic arguments are fo cused on Γ d istributions, we introd uce notation for the momen t generating functions in b oth settings. Definition 1. F or any fixed distribution µ of rates r , let L ( u ) = L µ ( u ) = E ( e r u ) for −∞ < u ≤ 0, denote the exp ectation of e r u . In the s p ecial case of Γ- distributed rates, with parameters α > 0 and β = 1 /α , let L α ( u ) = L Γ ,α = E ( e r u ) =  1 − u α  − α . Note that L , and in particular L α , is an increasing function throughout its domain. 2.4. R eduction to 3-taxon case T o prov e Theorem 1, it is su fficien t to consider only the case of 3-taxo n trees. Lemma 1. If the statements of The or e m 1 holds for 3 -taxon tr e es, then they also hold for n -taxon tr e es when n > 3 . Pr o of. As th e generic condition of Theorem 1 is a condition on π and Q (see Theorem 2 b elo w for a precise statement ), parameters on a n -taxon tree are generic if and only if the indu ced parameters on all induced 3-taxon trees are generic. Identifiability of a m o del of mole cular evolution 9 a b c t a t b t c Figure 1: The unique 3-taxon tree relating taxa a , b , and c , with branch lengths t a , t b and t c . If the mo del on 3-taxon trees is identi fiable for certain parameters, then from the join t distribution for a tr ee such as that of Figure 1, we may determine α , Q , π and the 3 edge lengths t a , t b , t c . Thus we ma y determin e the pairwise distances t a + t b , t a + t c , t b + t c b et wee n the taxa. F rom an n -taxon d istribution, b y considering marginalizations to 3 taxa we may thus determine α , Q , π , and all pairwise distances b et w een taxa. F rom all pairwise distances, w e may reco v er the top ological tree and all edge lengths by standard com binatorial arguments, as in [12 ]. 3. Algebraic arguments W e now d etermine some in formation that w e ma y obtain algebraically from a joint distr ibution kno wn to hav e arisen from the GTR+ µ mo del on a tree T r elating 3 taxa. While in this p ap er we will only apply the r esu lts to the GTR+Γ mo del, we d eriv e th em at their natural lev el of generalit y . W e therefore denote the moment generating function of the rate distribution by L , with its dep end ence on µ left implicit. As marginalizations of the joint distribution corresp on d to the mo del on induced trees T ′ with few er taxa, we work with trees with 1, 2, or 3 lea v es. If T ′ has only 1 leaf, it is s imply a single vertex, and the d istribution of states is therefore π . Th us π is id entifiable f rom a j oin t distrib ution for 1 or more taxa. 10 Allman, A n´ e, Rho des If T ′ has exactly 2 lea ves, joined by an ed ge of length t e > 0, then the j oin t distribution can b e exp r essed as P = diag( π ) E (exp( t e r Q )) = diag( π ) U diag ( L ( λ 1 t e ) , . . . , L ( λ κ t e )) U − 1 . Therefore, diagonalizing diag ( π ) − 1 P determines the collection of L ( λ i t e ) and the column s of U u p to factors of ± 1. Since L is increasing, we may determine individual L ( λ i t e ) by the requirement that 1 = L (0) = L ( λ 1 t e ) > L ( λ 2 t e ) ≥ · · · ≥ L ( λ κ t e ) . (3) When the λ i are distinct, this fixes an ordering to the columns of U . Regardless, w e simply make a fixed c hoice of some U consistent with th e in equalities (3 ) and satisfying equations (2). W e can fur ther require this c hoice of U b e made consisten tly for all 2-taxon marginalizations of the join t d istribution. Th us f or an y tree relating 2 or more taxa, we ma y d etermine the eigenspaces of Q via U and the v alue L ( λ i d j k ) f or eac h i and pair of taxa a j , a k , where d j k is the total edge-length d istance b etw een a j and a k . F or T with exactly 3 lea v es, let a, b, c b e the taxa lab eling them, with edge lengths as in Figure 1, and let X a , X b , X c denote the charact er states at th ese taxa. As in [3 ], denote by P ab,γ the square matrix con taining the probab ilities P ab,γ ( i, j ) = P ( X b = j, X c = γ | X a = i ) , whic h can b e computed f r om the joint distribution. But P ab,γ = E  e r t a Q diag  e r t c Q · γ  e r t b Q  where e r t c Q · γ is the γ th column of matrix e r t c Q , so U − 1 P ab,γ U = E  diag( e r t a λ 1 , . . . , e r t a λ κ ) U − 1 diag  e r t c Q · γ  U diag ( e r t b λ 1 , . . . , e r t b λ κ )  . Note that the j th column of diag  e r t c Q · γ  U Identifiability of a m o del of mole cular evolution 11 is the same as th e γ th column of diag ( U · j ) e r t c Q . Th us when ( i, j ) is fixed, the row v ector formed b y U − 1 P ab,γ U ( i, j ) f or γ = 1 , . . . , κ is µ ij E  e r t a λ i e r t b λ j e r t c Q  (4) where µ ij is the ro w v ector with µ ij ( k ) = U − 1 ( i, k ) U ( k , j ) = π ( k ) U ( k, i ) U ( k , j ) . (5) Finally , m ultiplying (4) by U on the righ t, and setting ν ij = µ ij U , we see that the information b rough t by the triple of taxa { a, b, c } amounts to the knowle dge of ν ij E  e r t a λ i e r t b λ j diag( e r t c λ 1 , . . . , e r t c λ κ )  , i.e., to the knowledge of eac h E  e r t a λ i e r t b λ j e r t c λ k  = L ( t a λ i + t b λ j + t c λ k ) for which ν ij ( k ) 6 = 0. This motiv ates the follo wing notation, w here for conciseness we let U ij = U ( i, j ): F or i, j, k ∈ [ κ ], let ν ij k = X l π l U li U lj U lk . Note that wh ile ν ij k = ν ij ( k ), w e p refer this new n otation since the v alue of ν ij k is unchanged by p erm uting sub scripts: ν ij k = ν ik j = ν j ik = ν j k i = ν k ij = ν k j i . F ur th ermore, since π can b e determined from 1-taxon marginalizati ons, and U from 2-taxon m arginalizations, from a 3-taxon distribution w e may compute ν ij k for all i, j, k . In su m mary , w e ha ve sh o wn th e follo wing: 12 Allman, A n´ e, Rho des Prop osition 1. F r om a distribution arising fr om the GTR+ µ mo del on the 3 -taxon tr e e of Figur e 1, we may obtain the f ol lowing information: 1. π , fr om 1-mar ginalizations 2. al l matric es U which diagonalize Q as ab ove, and for al l i the values L ( λ i ( t a + t b )) , L ( λ i ( t a + t c )) , L ( λ i ( t b + t c )) , fr om 2-mar ginalizations, and 3. the values L ( λ i t a + λ j t b + λ k t c ) for al l i, j, k such that ν ij k 6 = 0 for some such choic e of U . Note th at (2) can b e obtained as a sp ecial case of (3) by taking j = i, k = 1, as it is easy to see ν ii 1 6 = 0. W e shall also see that ν ij 1 = 0 if i 6 = j , so certainly some of th e ν ij k can v anish. One migh t exp ect that f or most c hoices of GTR parameters all the ν ij k 6 = 0 for i, j, k > 1. Indeed, this is generally the case, but for certain c hoices one or more of these ν ij k can v anish. The J uk es-Canto r and Kimura 2- and 3- parameter mo d els p ro vide simple examples of this for κ =4: F or these mo dels, one ma y c ho ose π = (1 / 4 , 1 / 4 , 1 / 4 , 1 / 4) , U =         1 1 1 1 1 − 1 1 − 1 1 1 − 1 − 1 1 − 1 − 1 1         , and ν ij k 6 = 0 for i, j, k > 1 only wh en i, j, k are distinct. While for th e J uk es- Can tor and Kim ura 2-parameter mo dels one ma y m ake other choic es for U , one can sh o w th at these alternativ e choice s of U do not lead to the r eco very of any additional information. Nonetheless, for κ ≥ 3 th ere is alwa ys some genuine 3-taxon information a v ailable from a distrib ution, as we no w sho w. Although we do not need th e Identifiability of a m o del of mole cular evolution 13 follo w ing pr op osition f or the pro of of Theorem 1, the metho d of argum en t it in tro du ces und er lies Section 5 b elo w. Prop osition 2. With κ ≥ 3 , for any choic e of GTR p ar ameters ther e exists at le ast one triple i, j, k > 1 with ν ij k 6 = 0 . Pr o of. S upp ose for all triples i, j, k > 1, ν ij k = X l π l U li U lj U lk = 0 . (6) F rom equation (2 ) w e also kno w th at if j 6 = k , then ν 1 j k = X l π l U lj U lk = 0 . (7) Both of these equations can b e expressed more conv eniently b y in tro d ucing the inner pro du ct h x, y i = x T diag( π ) y . Then with U i b eing the i th column of U , and W j k b eing the vecto r wh ose l th en try is the p ro duct U lj U lk , equations (6 ) giv e the orth ogonalit y statemen ts h U i , W j k i = 0 , if i, j, k > 1, while equations (7) yield b oth h U 1 , W j k i = 0 , if j 6 = k , and h U j , U k i = 0 , if j 6 = k . In particular, we s ee for j, k > 1, j 6 = k , that W j k is orthogonal to all U i , and so W j k = 0 . Considering in dividual entries of W j k giv es that, for ev ery l , U lj U lk = 0 , for all j, k > 1, j 6 = k . (8) No w note that for an y j > 1, the v ector U j m ust hav e at least 2 non-zero en tries. (Th is is simply b ecause U j is a n on -zero v ector, and h 1 , U j i = 0 s ince 14 Allman, A n´ e, Rho des U 1 = 1 .) W e use this observ ation, together w ith equation (8), to arriv e at a con tradiction. First, without loss of generalit y , assum e the fi rst t wo en tries of U 2 are non- zero. Then b y equation (8) the first t wo en tries of all the v ectors U 3 , U 4 , . . . m ust b e 0. But then we ma y assu m e the third and f ourth entries of U 3 are non-zero, and so th e fi rst 4 en tries of U 4 , . . . are zero. F or the 4-state DNA mo del, this shows U 4 = 0 , whic h is imp ossible. More generally , f or a κ -state mo del, w e fin d U k = 0 as so on as 2( k − 2) ≥ κ . Note that for κ ≥ 4 this happ ens for some v alue of k ≤ κ , thus con tradicting that the U k are non-zero. In the κ = 3 case the s ame argument give s that U 3 has only one non-zero entry , wh ic h is still a con tradiction, since U 3 is orthogonal to U 1 = 1 . Thus the lemma is established f or a κ -state mo del with κ ≥ 3. F or κ = 2, th e statemen t of Prop osition 2 do es not hold, as is sh o wn by considering the 2-state symmetric mo del, with π = (1 / 2 , 1 / 2) , and U =   1 1 1 − 1   . Ho w ev er, one can sho w th is is the only c hoice of π and U for whic h ν 222 = 0. 4. Iden tifiabili t y for generi c parameters W e no w complete the p ro of of the first statement in Theorem 1, the id en tifia- bilit y of the GTR+Γ m o del for generic parameters, which is v alid for all v alues of κ ≥ 2. As w e no w consider only Γ-distrib u ted rates, w e use th e sp ecialized momen t generating fu nction L α in our argumen ts. More p recisely , w e will establish the f ollo w ing: Theorem 2. F or κ ≥ 2 , c onsider those GTR p ar ameters for which ther e exist some i, j , with 1 < i ≤ j , such that ν ij j 6 = 0 . Then r estricte d to these p ar ameters, the GTR+ Γ mo del is identifiable on 3 -taxon tr e es. Identifiability of a m o del of mole cular evolution 15 Remark 1. Note that the cond itions ν ij j = 0 are p olynomial in the entries of U and π . Viewing the GTR mo del as parameterized by those v ariables together with the λ i , th en the set of p oin ts in parameter sp ace for whic h ν ij j = 0 for some i, j w ith 1 < i ≤ j forms a p rop er algebraic v ariet y . Basic f acts of algebraic geometry then implies this set is of strictly lo w er dimension than the full parameter sp ace. A generic p oin t in parameter s p ace therefore lies off this exceptional v ariet y , and the exceptional p oin ts h a v e Leb esgue measure zero in the full p arameter space. Remark 2. F or κ = 2, ident ifiabilit y do es not hold for the 3-taxon tree if the generic condition that ν ij j 6 = 0 f or some 1 < i ≤ j is dropp ed. Ind eed, if ν 222 = 0, then, as comment ed in the last section, π and U arise f rom the 2-state symmetric mo del. Since there are only t wo eigen v alues of Q , λ 1 = 0 and λ 2 < 0, the second of these is determined by the normalization of Q . As the pro of of P r op osition 1 indicates, the only add itional information we ma y obtain fr om the joint d istribution is the th ree quant ities L α ( λ 2 ( t a + t b )) , L α ( λ 2 ( t a + t c )) , L α ( λ 2 ( t b + t c )) . Since these dep end on four u nknown parameters α, t a , t b , t c , it is straigh tforw ard to see the parameter v alues are not u niquely determined . Our p ro of of Theorem 2 will dep end on the follo wing tec hnical lemma. Lemma 2. Supp ose c ≥ a ≥ d 1 > 0 and c ≥ b > d 2 > 0 . Then the e quation d − β 1 + d − β 2 − a − β − b − β − c − β + 1 = 0 . has at most one solution with β > 0 . Pr o of. T h e equation can b e rewritten as  c d 1  β −  c a  β ! +  c d 2  β −  c b  β ! +  c β − 1  = 0 (9) No w a fun ction g ( β ) = r β − s β is strictly con vex on β ≥ 0 p ro vided r > s ≥ 1, since g ′′ ( β ) > 0. I f r = s , then g ( β ) = 0 is still con vex. Thus when viewe d as 16 Allman, A n´ e, Rho des a fu nction of β the fir s t expression on the left side of equation (9) is con ve x, and the second expression is str ictly conv ex. Also, for an y r > 0 th e fun ction h ( β ) = r β − 1 is con ve x, so the third expression in equation (9) is con v ex as well. Th us the sum of these three terms, the left s ide of equation (9), is a strictly con v ex fun ction of β . But a strictly conv ex function of one v ariable can hav e at most t wo zeros. Since the function d efined by the left side of equation (9) has on e zero at β = 0, it therefore can hav e at most one zero with β > 0. Pr o of of The or em 2. F or some j ≥ i > 1, we are give n that ν ij j 6 = 0. As ν ij j = ν j ij , by Prop osition 1 w e may determine the v alues D ij j = L α ( λ i t a + λ j t b + λ j t c ) , D j ij = L α ( λ j t a + λ i t b + λ j t c ) , as w ell as C k = L α ( λ k ( t a + t b )) , B k = L α ( λ k ( t a + t c )) , A k = L α ( λ k ( t b + t c )) for k = 1 , . . . , κ . Since L α is increasing, for an y k > 1 we can use the v alues of C k , B k to determine whic h of t b and t c is larger. Pr o ceeding similarly , we ma y determine the relativ e rankin g of t a , t b , and t c . Without loss of generalit y , we therefore assume 0 ≤ t a ≤ t b ≤ t c for the remaind er of this pro of. Note ho wev er that if t a = 0, then t b > 0, b y our assumption on mo del p arameters that no t w o taxa b e total-edge-length- distance 0 apart. Observe th at L − 1 α ( D ij j ) + L − 1 α ( D j ij ) = L − 1 α ( A j ) + L − 1 α ( B j ) + L − 1 α ( C i ) , Identifiability of a m o del of mole cular evolution 17 or, using th e formula for L α and letting β = 1 /α , D − β ij j + D − β j ij − A − β j − B − β j − C − β i + 1 = 0 . (10) Since j ≥ i > 1, w e hav e that λ j ≤ λ i < 0 . Because L α is an increasing function, and 0 ≤ t a ≤ t b ≤ t c , with t b > 0, this implies C i ≥ A j ≥ D ij j , and C i ≥ B j > D j ij . Th us app lying Lemma 2 to equation (10), with a = A j , b = B j , c = C i , d 1 = D ij j , d 2 = D j ij , w e find β is uniqu ely determined, so α = 1 /β is id en tifiable. Once α is kn o wn, f or ev ery k we may determine the quantit ies λ k ( t a + t b ) = L − 1 α ( C k ) , λ k ( t a + t c ) = L − 1 α ( B k ) , λ k ( t b + t c ) = L − 1 α ( A k ) . Th us we may determine the ratio b et wee n any tw o eigen v alues λ k . As U is kno wn, this determines Q up to scaling. Sin ce we hav e r equired a normalization of Q , this means Q is iden tifiable. Wit h the λ k no w determined, we can fi nd t a + t b , t a + t c and t b + t c , and h ence t a , t b , t c . 5. Exceptional cases ( κ = 4) In the previous section, identifiabilit y w as prov ed un der the assump tion th at ν ij j 6 = 0 for some j ≥ i > 1. W e no w sp ecialize to the case of κ = 4, and determine those GTR p arameters for whic h none of these cond itions holds. In the subsequent section, w e will u se this information to argue that even in these exceptional cases the GTR+Γ mo del is ident ifiable. 18 Allman, A n´ e, Rho des Note that while we work only with a 4-state mo del approp r iate to DNA, the approac h we use may w ell app ly for larger κ , though one should exp ect additional exceptional sub cases to app ear. Lemma 3. F or κ = 4 , c onsider a choic e of GTR p ar ameters for which ν ij j = 0 for al l j ≥ i > 1 . Then, u p to p ermutation of the states and multiplic ation of some c olumns of U by − 1 , the distribution ve ctor π and eigenve ctor matrix U satisfy one of the two fol lowing sets of c onditions: Case A: π = (1 / 4 , 1 / 4 , 1 / 4 , 1 / 4) , and for some b, c ≥ 0 with b 2 + c 2 = 2 , U =         1 c b 1 1 − c − b 1 1 − b c − 1 1 b − c − 1         Case B: π = (1 / 8 , 1 / 8 , 1 / 4 , 1 / 2) , and U =         1 2 √ 2 1 1 − 2 √ 2 1 1 0 − √ 2 1 1 0 0 − 1         Pr o of. W e us e the notation of Pr op osition 2, including th e inner pr o duct and d efinition of ve ctors W ij giv en in its p ro of. Orthogonalit y and lengths will alw a ys b e w ith r esp ect to th at inner pro d uct. W e will rep eatedly u se that for i, j with 1 < i ≤ j , h W j j , U i i = ν ij j = 0 . In particular, setting j = 4, w e fin d W 44 is orthogonal to U 2 , U 3 , U 4 , and hence is a multiple of U 1 = 1 . Th is implies U 4 = ( ± 1 , ± 1 , ± 1 , ± 1) , Identifiability of a m o del of mole cular evolution 19 since U 4 has length 1. Without loss of generalit y , by p ossibly p ermuting the ro ws of U (wh ic h is equiv alen t to changing the ord ering of the states in wr iting do wn the rate matrix Q ), and then p ossibly multiplying U 4 b y − 1, we n eed no w only consider t w o cases: either Case A: U 4 = (1 , 1 , − 1 , − 1) , or Case B: U 4 = (1 , 1 , 1 , − 1) . W e consider these tw o cases separately . Case A: Since U 1 = 1 and U 4 = (1 , 1 , − 1 , − 1), the orthogonalit y of U 1 and U 4 giv es π 1 + π 2 − π 3 − π 4 = 0 . Since P 4 i =1 π i = 1, this tells us π 1 + π 2 = 1 / 2 , π 3 + π 4 = 1 / 2 . (11) No w since W 33 is orthogonal to b oth U 2 and U 3 , then W 33 is a linear com bination of U 1 and U 4 , and h ence W 33 = ( b 2 , b 2 , c 2 , c 2 ). Th us U 3 = ( ± b, ± b, ± c, ± c ) . Since U 3 is orthogonal to b oth U 1 and U 4 , it is orthogonal to their linear com- binations, and in particular to (1 , 1 , 0 , 0) and (0 , 0 , 1 , 1) . Thus, by p ermuting the first tw o entries of the U i , and also p ermuting th e last tw o en tries of the U i , if necessary , we m ay assume U 3 = ( b, − b, c, − c ) with b, c ≥ 0. Th is orthogonalit y further sh o ws bπ 1 − bπ 2 = 0 , cπ 3 − cπ 4 = 0 . Th us π 1 = π 2 , or b = 0 , 20 Allman, A n´ e, Rho des and π 3 = π 4 , or c = 0 . In ligh t of equations (11), w e ha ve π 1 = π 2 = 1 / 4 , or b = 0 , and π 3 = π 4 = 1 / 4 , or c = 0 . In any of these cases, U 3 has length 1 so b 2 ( π 1 + π 2 ) + c 2 ( π 3 + π 4 ) = 1 . T ogether with equations (11) th is give s that b 2 + c 2 = 2 . No w since U 2 is orthogonal to U 1 , U 3 , U 4 , w e m us t ha ve that U 2 = a ( c/π 1 , − c/π 2 , − b/π 3 , b/π 4 ) for some a , and we ma y assume a > 0. Bu t th e length of U 2 is 1, and U 2 is orthogonal to W 22 , so c 2 /π 1 + c 2 /π 2 + b 2 /π 3 + b 2 /π 4 = 1 /a 2 , (12) c 3 /π 2 1 − c 3 /π 2 2 − b 3 /π 2 3 + b 3 /π 2 4 = 0 . (13) If neither of b, c is zero, so all π i = 1 / 4, then equation (12) tells us a = 1 / 4, as the s tatemen t of the th eorem claims. If b = 0, then we already kno w c = √ 2, and π 3 = π 4 = 1 / 4. But equation (13) implies π 1 = π 2 , s o these are also 1 / 4. W e then find fr om equation (12) that a = 1 / 4, and we ha ve another instance of the claimed characte rization of case A. S imilarly , if c = 0 we ob tain the r emaining instance. Identifiability of a m o del of mole cular evolution 21 Case B: Since U 1 = 1 and U 4 = (1 , 1 , 1 , − 1), the orthogonalit y of U 1 and U 4 implies π 1 + π 2 + π 3 − π 4 = 0 . No w W 33 is orthogonal to U 2 and U 3 , and hence is a lin ear com bination of U 1 and U 4 . Thus W 33 = ( b 2 , b 2 , b 2 , c 2 ) , so U 3 = ( ± b, ± b, ± b, c ) . But U 3 is orthogonal to b oth U 1 and U 4 , and hence orthogonal to their linear com binations, including (0 , 0 , 0 , 1) and (1 , 1 , 1 , 0). This sh o ws c = 0 and that (p ossibly b y p erm uting the first three ro ws of U , and multiplying U 3 b y − 1) w e may assume U 3 = b (1 , 1 , − 1 , 0) for some b > 0. Or thogonalit y of U 3 and U 1 then shows π 1 + π 2 − π 3 = 0 . Also W 22 is orthogonal to U 2 , and h ence is a linear com bin ation of U 1 , U 3 , U 4 , so W 22 = ( d 2 , d 2 , e 2 , f 2 ). Th us U 2 = ( ± d, ± d, e, f ) . Ho w ev er, since U 2 is orthogonal to U 1 , U 3 , U 4 , it is orthogonal to (0 , 0 , 0 , 1 ), (0 , 0 , 1 , 0 ), and (1 , 1 , 0 , 0) . Thus we ma y assume U 2 = d (1 , − 1 , 0 , 0) with d > 0. Finally , orthogonalit y of U 2 and U 1 implies π 1 − π 2 = 0 . All th e ab o v e equations relating the π i , together with th e fact that P 4 i =1 π i = 1 giv es π = π 1 (1 , 1 , 2 , 4) = (1 / 8 , 1 / 8 , 1 / 4 , 1 / 2) . W e can no w determine the U i exactly , u sing that they must h a v e length 1, to show U is as claimed. 22 Allman, A n´ e, Rho des 6. Iden tifiabili t y in the exceptional cases ( κ = 4) W e no w complete the pro of of Theorem 1 by showing identifiabilit y in cases A and B of Lemma 3. W e do this by fi rst establishing some inequalities for the eigen v alues of Q that m ust hold in eac h of these cases, using the assumption that the off-diagonal entries of Q are p ositiv e. Note that as U − 1 = U T diag( π ), and th e entries of π are p ositiv e, the p ositivit y of the off-diagonal en tries of Q is equiv alen t to th e p ositivit y of the off-diagonal ent ries of the sy m metric matrix e Q = U diag (0 , λ 2 , λ 3 , λ 4 ) U T . Lemma 4. F or κ = 4 , let 0 = λ 1 > λ 2 ≥ λ 3 ≥ λ 4 denote the eigenvalues of a GTR r ate matrix Q . Then the fol lowing additional ine qualities hold in c ases A and B of L e mma 3: Case A: If bc 6 = 0 , then λ 4 > λ 2 + λ 3 , while if bc = 0 , then λ 4 > 2 λ 2 . Case B: λ 4 > 2 λ 2 . Pr o of. F or case A, on e computes th at e Q =         ∗ − λ 2 c 2 − λ 3 b 2 + λ 4 − λ 2 bc + λ 3 bc − λ 4 λ 2 bc − λ 3 bc − λ 4 ∗ ∗ λ 2 bc − λ 3 bc − λ 4 − λ 2 bc + λ 3 bc − λ 4 ∗ ∗ ∗ − λ 2 b 2 − λ 3 c 2 + λ 4 ∗ ∗ ∗ ∗         where the stars indicate quanti ties not of interest. F r om the p ositivit y of the (1,2) and (3,4) entries of e Q , we thus kno w λ 4 > max( λ 2 c 2 + λ 3 b 2 , λ 2 b 2 + λ 3 c 2 ) ≥ ( λ 2 c 2 + λ 3 b 2 ) + ( λ 2 b 2 + λ 3 c 2 ) 2 . Since b 2 + c 2 = 2, this s h o ws λ 4 > λ 2 + λ 3 . In the case when bc = 0, so ( b, c ) = (0 , √ 2) or ( √ 2 , 0), the fi rst inequalit y giv es th e stronger statemen t of the prop osition. Identifiability of a m o del of mole cular evolution 23 F or case B, e Q =         ∗ − 4 λ 2 + 2 λ 3 + λ 4 − 2 λ 3 + λ 4 − λ 4 ∗ ∗ − 2 λ 3 + λ 4 − λ 4 ∗ ∗ ∗ − λ 4 ∗ ∗ ∗ ∗         F rom the p ositivit y of the off-diagonal entries, we see that λ 4 > 2 λ 3 , λ 4 + 2 λ 3 > 4 λ 2 . T ogether, these imp ly that λ 4 > 2 λ 2 . W e no w return to p ro ving ident ifiabilit y for the exceptional cases. As in the pro of of Theorem 2, w e ma y determine the relativ e rankings of t a , t b and t c , and therefore assu me 0 ≤ t a ≤ t b ≤ t c , with t b > 0. In case A, we find that ν 234 = bc , so we b reak th at case into t w o sub cases, Case A1: if b, c 6 = 0; and Case A2: if b or c = 0. Case A1: In this case, we find that ν ij k 6 = 0 for all distinct i, j, k > 1. Letting D 342 = L α ( λ 3 t a + λ 4 t b + λ 2 t c ) , D 423 = L α ( λ 4 t a + λ 2 t b + λ 3 t c ) . and A k , B k , C k b e as in the pro of of Theorem 2, observe th at L − 1 α ( D 342 ) + L − 1 α ( D 423 ) = L − 1 α ( A 2 ) + L − 1 α ( B 3 ) + L − 1 α ( C 4 ) . Setting β = 1 /α and using the explicit form ula for L α yields D − β 342 + D − β 423 − A − β 2 − B − β 3 − C − β 4 + 1 = 0 . (14) 24 Allman, A n´ e, Rho des Note that b y Prop osition 1 all constan ts in this equation, except p ossibly β , are un iquely determined by the joint distribution. In pr eparation for applying Lemm a 2, w e claim that the follo wing inequalities hold: D 342 ≤ A 2 , (15) D 423 < B 3 , (16) D 342 < C 4 , (17) D 423 < C 4 . (18) Inequalities (15,16) follo w easily from the fact that L α is increasing. F or inequalit y (17), note first that λ 3 t a + λ 4 t b + λ 2 t c ≤ ( λ 2 + λ 3 ) t a + λ 4 t b . But Lemma 4 ind icates λ 2 + λ 3 < λ 4 , so, again u sing that L α is increasing, the claim follo ws. I nequalit y (18) is similarly sho wn to hold. Finally , to app ly L emm a 2, let d 1 = D 342 , d 2 = D 423 . The remaind er of the constan ts in the lemma are c hosen in one of three wa ys, dep en ding on which of A 2 , B 3 , C 4 is largest: If C 4 ≥ A 2 , B 3 , then let a = A 2 , b = B 3 , c = C 4 . If A 2 ≥ C 4 , B 3 , then let a = C 4 , b = B 3 , c = A 2 . If B 3 ≥ C 4 , A 2 , then let a = A 2 , b = C 4 , c = B 3 . Th us in all sub cases, fr om equation (14) we find that β > 0 is uniquely determined. The remainder of the pro of no w p ro ceeds exactly as f or Theorem 2 . Cases A2 and B: In b oth of these cases ν 224 6 = 0, s o, similarly to the previous case, letting D 422 = L α ( λ 4 t a + λ 2 t b + λ 2 t c ) , D 242 = L α ( λ 2 t a + λ 4 t b + λ 2 t c ) , leads to D − β 422 + D − β 242 − C − β 4 − A − β 2 − B − β 2 + 1 = 0 . (19) Identifiability of a m o del of mole cular evolution 25 By Prop osition 1, we kno w all quant ities in this equ ation except p ossibly β are uniquely d etermined from the joint d istribution. W e also n ote the follo wing inequalities hold: A 2 ≤ B 2 , (20) D 422 ≤ A 2 , (21) D 242 < B 2 , (22) D 242 < C 4 . (23) Inequalities (20–22) are im p lied by the fact the L α is increasing. Inequalit y (23) w ill follo w from λ 2 ( t a + t c ) < λ 4 t a . Ho w eve r, λ 2 ( t a + t c ) ≤ 2 λ 2 t a < λ 4 t a b y Lemm a 4. T o apply Lemma 2, let d 1 = D 422 and d 2 = D 242 . In ligh t of inequalit y (20), w e need assign the remaining constan ts according to only tw o cases: If C 4 ≥ B 2 , let a = A 2 , b = B 2 , and c = C 4 . If B 2 ≥ C 4 , let a = A 2 , b = C 4 , and c = B 2 . In b oth cases, we fin d β is uniquely determined, and the the p ro of of iden tifia- bilit y can b e completed as in T heorem 2. Th us iden tifiabilit y of the GTR+Γ mo del when κ = 4 is established for all cases. 7. Op en problem s Man y questions remain on the identifiabilit y of phyloge netic mo dels, in clud- ing those commonly used for d ata analysis. P erhaps the most immediate one is the identi fiability of the GTR+Γ+ I mo del. Despite its w id espread use in inference, no pro of has app eared that the tree top ology is iden tifiable for th is mo d el, m uch less its numerical parameters. Although our algebraic arguments of Section 3 apply , analogs for GTR+Γ+I of the an alytic argumen ts we ga v e for GTR+Γ are not ob vious. While th e Γ rate 26 Allman, A n´ e, Rho des distribution h as only one unkn own parameter, Γ+I has tw o, and this increase in dimensionalit y seems to b e at the heart of th e d ifficult y . In terestingly , emp irical studies [14] ha v e also sho wn that th ese p arameters can b e difficult to tease apart, as errors in th eir inf erred v alues can b e highly correlated in some circum stances. Although we conjecture that GTR+Γ+I is identifiable for generic p arameters, w e mak e n o guess as to its identi fiability for all p arameters. F or computational reasons, standard softw are pac k ages f or p h ylogenetic in- ference imp lemen t a discr etized Γ d istr ibution [17], r ather than the contin u ous one dealt with in this pap er. While resu lts on contin uou s distributions are suggestiv e of w hat might h old in the discrete case, they offer no guaran tee. It w ould therefore also b e highly desirable to hav e pro ofs of the identifiabilit y of the d iscretized v ariants of GTR+ Γ and GTR+Γ+ I, either for generic or all parameters. Note that suc h results migh t dep end on the n umb er of discrete rate classes used, as well as on other details of the discretization p ro cess. So far the only result in this d irection is that of [1] on th e id entifiabilit y of the tree parameter, for generic numerical parameter choice s wh en the num b er of rate classes is less than the num b er of observ able charac ter states ( e.g. , at m ost 3 rate classes for 4-stat e nucleotide mo dels, or at most 60 rate classes for 61-state co don mo d els). As the arguments in that w ork use n o sp ecial features of a Γ d istribution, or eve n of an across-site rate v ariation m o del, w e sus p ect that stronger claims sh ould hold when sp ecializing to a p articular form of a discrete rate distrib u tion. Finally , w e ment ion that b ey ond [1], almost nothing is kn o wn on ident ifia- bilit y of m o dels with other types of heterogeneit y , s u c h as co v arion-lik e mo dels and general mixtures. As these are of gro wing interest for add ressing biologica l questions [5, 10, 7], m uch remains to b e u ndersto o d . Ac kno wledgem en ts ESA and JAR th an k the In stitute for Mathemati cs and Its Applications and Identifiability of a m o del of mole cular evolution 27 the Isaac Newton Institute, wher e parts of this wo rk were undertak en, for their hospitalit y and fu nding. W ork b y ESA and JAR was also sup p orted b y the National Science F ound ation (DMS 07148 30). References [1] Elizab eth S . Allman and John A. Rho d es. The identifiabilit y of tree top ology for ph ylogenetic mo d els, in clud ing co v arion and mixture mo d els. J. Comput. Biol. , 13(5):1101 –1113, 2006. arX iv:q-bio .PE/0511 009 . [2] Elizab eth S. Allman and John A. Rho d es. Id en tifying evolutio nary tr ees and subs titution parameters for the general Marko v mo del with in v ariable sites. M ath. Biosci. , 2007. T o app ear, arXiv :q-bio.P E/070205 0 . [3] Joseph T. Chang. F u ll reconstruction of Mark ov mo d els on ev olutionary trees: id en tifiabilit y and consistency . Math. Biosci. , 137(1):51 –73, 1996. [4] Joseph F elsenstein. Inferring Phylo genies . S inauer Asso ciates, S underland , MA, 2004. [5] Olivier Ga scuel and Stephane Guidon. Mo delling the v ariabilit y of ev olutionary pr o cesses. In Olivier Gascuel and Mik e Steel, editors, R e- c onstructing Evolution: New M athematic al and Computationa l A dvanc es , pages 65–107. O xford Unive rsity Press, 2007. [6] Roger A. Horn and Charles R. John son. Matrix Anal ysis . Cam brid ge Univ ersit y Press, 1985. [7] Bryan Kolaczk o wski and Joseph Thornto n. Pe rform ance of maximum parsimon y and like liho o d phylog enetics wh en evo lution is heterogeneous. Natur e , 431:98 0–984, 2004. [8] F rederic k A. Matsen, Elc hanan Mossel, and Mik e Steel. Mixe d-u p trees: the structur e of phylogeneti c mixtu r es. Bul l. M ath. Biol. , to app ear. arXiv:07 05.4328 . 28 Allman, A n´ e, Rho des [9] F rederic k A. Matsen and Mike A. Steel. Ph ylogenetic mixtur es on a s in gle tree can mimic a tree of another top ology . Syst. Biol. , 56(5):7 67–775, 2007. [10] Mark Pag el and An drew Meade. A ph ylogenetic mixtu r e mo del for detecting pattern-heterogeneit y in gene sequence or c haracter-state data. Syst. Biol. , 53(4):5 71–581, 2004. [11] James S . Rogers. Maxim um lik eliho o d estimation of phylogenet ic trees is consisten t wh en su bstitution rates v ary according to the inv ariable sites plus gamma d istribution. Syst. Biol. , 50(5):7 13–722, 2001. [12] Charles Semple and Mik e Steel. Phylo genetics , v olume 24 of O xfor d L e ctur e Series in M athematics and its Applic ations . Oxford Universit y Press, Oxford, 2003. [13] M.A. S teel, L. Sz´ ek ely , and M.D. Hend y . Reconstructing trees from sequences whose sites ev olv e at v ariable rates. J. Comput. Biol. , 1(2):1 53– 163, 1994. [14] Jac k Sulliv an, Da vid L. S w offord, an d Ga vin J. P . Na ylor. The effect of taxon sampling on estimating rate h eterogeneit y parameters of m axim um- lik eliho o d mo dels. Mole cular Biolo gy and Evolution , 16(10 ):1347–13 56, 1999. [15] Daniel ˇ Stefank o vi ˇ c and Eric Vigod a. Ph ylogen y of mixture mod els: Robustness of maximum like liho o d and non-identifiable distribu tions. J. Comput. Biol. , 14(2):156–1 89, 2007. arXiv:q -bio.PE/ 0609038 . [16] Daniel ˇ Stefank o vi ˇ c and Eric Vigod a. Pitfalls of heterogeneous pro cesses for phyloge netic reconstruction. Sys. Biol. , 56(1):113– 124, 2007. [17] Zihen g Y ang. Maxim um like liho o d ph ylogenetic estimation from DNA sequences with v ariable rates o ver sites: aproxi mate m etho ds. J. Mol. Evol. , 39:306– 314, 1994. Identifiability of a m o del of mole cular evolution 29 App endix A. The g aps i n Rogers’ pro o f Here we explain the gaps in the pu blished pro of of Rogers [11] that the GTR+Γ+I mo del is iden tifiable. Since that pap er has b een widely cited and accepted as correct, our goal is to clearly in dicate where the argument is flaw ed, and illustrate, through some examples, th e natur e of the logical gaps. W e emp hasize that we do not pr ove that the gaps in th e p ublished argumen t cannot b e bridged. In deed, it seems most likel y that the GTR+Γ+I mo del is iden tifiable, at least for generic parameters, and it is p ossible a correct pr o of migh t follo w the rough outline of [11]. Ho w ev er, w e hav e not b een able to complete the argument Rogers attempts. Our o wn p ro of of the iden tifiabilit y of the GTR+ Γ mo del presented in the b o dy of this pap er follo ws a d ifferen t line of argu m en t. W e assume the r eader of this app endix will consu lt [11], as pinp ointing the fla ws in that p ap er requires rather tec hnical atten tion to the details in it. A.1. Gaps in the publ ished pro of There are tw o gaps in Rogers’ argu m en t wh ic h we ha v e id en tified. In this section we ind icate the lo cations and n ature of these fla ws, and in subsequent ones we elab orate on them individu ally . The fir st gap in the argument o ccurs roughly at the break fr om page 717 to page 718 of th e article. T o explain the gap, w e first outline Rogers’ w ork leading up to it. Before this p oin t, prop erties of the graph of the function ν − 1 ( µ ( x )) ha v e b een carefully derive d. An example of such a grap h , for p articular v alues of the parameters α, a, π , p o ccurring in the d efinitions of ν and µ , is shown in Figure 2 of the pap er. F or these parameter v alues and others, th e article has carefully and correctly shown that for x ≥ 0 the graph of ν − 1 ( µ ( x )) 1. is increasing, 2. has a single in flection p oin t, wher e the graph changes from conv ex to conca ve (i.e, the conca vit y c hanges f r om upw ard to d o wnw ard ), 30 Allman, A n´ e, Rho des 3. has a horizon tal asymptote as x → ∞ . Although the article outlines other cases for different ranges of the p arameter v alues, Rogers highlights the case when these three p rop erties hold. A t the top of page 718 of the article, Figure 3 is presented, p lotting the p oints whose co ord in ates are given by the pairs ( ν − 1 ( µ ( τ 1 λ i )) , ν − 1 ( µ ( τ 2 λ i ))) for all λ i ≥ 0. Here τ 2 > τ 1 are particular v alues, wh ile α, a, π , p are giv en the v alues leading to Figure 2. Rogers p oin ts out that “As in Figure 2, th e graph [of Figure 3] has an inflection p oin t, is conca v e upw ard s b efore the infl ection p oint, and is conca ve do wnw ards after the inflection p oint. ” Then h e claims that “Similar graph s will b e pro du ced for an y p air of path distances su c h that τ 2 > τ 1 .” Ho w ev er, he giv es no argumen t for this claim. As the remainder of the argumen t strongly u ses the conca vity prop erties of the graph of his Figure 3 (in the s econd column on page 718 the p hrase “. . . as shown b y Figure 3” app ears), without a pro of of this claim the main result of the p ap er is left u nprov ed. Judging from the con text in which it is placed, a more complete statemen t of the un pro ve d claim would b e that for any v alues of α, a, π , p r esulting in a graph of ν − 1 ( µ ( x )) with the geometric prop er ties of Figur e 2, and an y τ 2 > τ 1 , the graph analogous to Figure 3 has a single infl ection p oin t. As no argumen t is giv en to establish the claim, we can only guess what the author intended for its justification. F rom w hat app ears earlier in the pap er, it seems lik ely that the author b eliev ed the three geometric prop erties of th e graph in Figure 2 enumerated ab o ve implied the claimed prop erties of Figure 3. Ho w ev er, that is definitely not the case, as w e will sh o w in Section A.2 b elo w. Note that we d o not assert that the graphs analogous to Figure 3 for v arious parameter v alues are n ot as d escrib ed in [11]. While plots of them for many c hoices of parameter v alues certainly suggest that Rogers’ claim holds, it is of course inv alid to claim a pro of fr om examples. Mo reov er, with 4 parameters α, a, π , p to v ary , it is not clear how confid ent one sh ould b e of ev en havi ng explored th e parameter sp ace well enough to make a solid conjecture. In light Identifiability of a m o del of mole cular evolution 31 of the examp le we giv e in Section A.2, justifying Rogers’ claim w ould require a m uch more detailed analysis of the fu n ctions ν and µ than Rogers attempts. If this fir st gap in the p ro of w ere fi lled, a second p roblem wo uld remain. Though less fun damen tal to the ov erall argument, this gap w ould mean that iden tifiabilit y of the mo del w ould b e established for generic parameters, but that there migh t b e exceptional choice s of p arameters for which ident ifiability failed. (‘Generic’ here can b e tak en to mean for all parameters except th ose lying in a set of Leb esgue measure zero in p arameter sp ace. More inf orm ally , for an y reasonable pr obabilit y distribution placed on the parameter space, randomly-c hosen parameters will b e generic.) Although the origin of th is pr oblem with non-generic parameters is clearly p ointe d out b y Rogers, it is op en to int erp r etation whether he attempts to extend the p ro of to all parameter v alues at the v ery end of the article. Ho w ev er, as the abstract and int ro ductory material of [11] mak e no mentio n of the issue, this p oin t at the very least seems to hav e escap ed many readers atten tion. This gap o ccurs b ecause the pu blished argument requires that the n on-zero eigen v alues of the GTR rate matrix Q b e three distinct num b ers. On p age 718, at the conclusion of the main argumen t, it is stated that “Therefore, if the substitution rate matrix has th r ee distinct eigen v alues, the p arameters of the I+Γ rate heterogeneit y will b e uniqu ely determined.” Th e author then goes on to p oint ou t that for the Juk es-Canto r an d Kim ura 2-parameter mo dels this assumption on eigen v alues is violated, but “[f ]or r eal data sets, ho wev er, it is unlik ely that any tw o or all three of the eigen v alues will b e exactly identica l.” Lea ving aside the question of wh at parameters one might ha ve for a mo del whic h fits a r eal data set w ell, R ogers here clearly ind icates that his pro of of iden tifiabilit y up to this p oint omits some exceptional cases. In the concluding lines of the pap er, he p oin ts out that these exceptional cases can b e approxi- mated arbitrarily closely by parameters with thr ee distinct eigenv alues. While this is true, such an obs er v ation cannot b e used to argue that th e exceptional 32 Allman, A n´ e, Rho des cases are n ot exceptional, as we w ill d iscuss b elo w in Section A.3. It is unclear whether the concluding lines of [11] w ere mean t to ‘fi ll the gap’ or n ot. Of course, one migh t not b e to o concerned ab out exceptional cases. Indeed, if the first flaw w ere not p resen t in his argument, then Rogers’ pro of would still b e a v aluable con tribu tion in showing th at for ‘most’ parameter v alues iden tifiabilit y h eld. One migh t then lo ok for other arguments to s h o w ident ifi- abilit y also held in the exceptional cases. Nonetheless, it is disapp oin ting that the exceptional cases include mo d els suc h as the Jukes-Can tor and Kimura 2- parameter that are w ell-kno wn to biologists and might b e considered at least reasonable appr o ximations of realit y in s ome circumstances. A.2. A coun terexample to the graphical argume nt It seems that the origin of the first fl a w in Rogers’ argumen t is in a b elief that the thr ee en umerated prop erties he pro ves are exh ibited in h is Figur e 2 result in the claimed pr op erties of his Figure 3. In this section, we sho w this implication is not v alid, b y exh ibiting a function wh ose graph has the three prop erties, but w h en the graph analogous to Figure 3 is constructed, it has m ultiple inflection p oin ts. Let f ( x ) = Z x 0 exp  exp  − 10( t − 1) 2  − (1 − t ) 2 10  dt. Then f (0) = 0, and f ′ ( x ) = exp  exp  − 10( x − 1) 2  − (1 − x ) 2 10  , so f ′ ( x ) > 0 and f is in cr easing. F urthermore, one sees that f ′ ( x ) d eca ys quic kly enough to 0 as x → ∞ , so that f ( x ) has a horizonta l asymp tote as x → ∞ . T o see that f ( x ) has a single in flection p oint where the graph passes from con v ex to conca v e, it is enough to show f ′ ( x ) has a unique lo cal m axim um and no lo cal min im a. But this wo uld f ollo w from g ( x ) = ln( f ′ ( x )) having a u nique Identifiability of a m o del of mole cular evolution 33 lo cal maxim um and n o lo cal minima. Since g ( x ) = exp  − 10( x − 1) 2  − (1 − x ) 2 10 , and the tw o s ummands her e hav e u nique lo cal maxima at x = 1 and no lo cal minima, g must as we ll. Thus f exh ibits the enumerated prop erties of Rogers’ Figure 2. F or comparison, w e graph f in our Figure 2 b elo w . 0 1 2 3 4 2 4 6 8 10 Figure 2: The graph y = f ( x ). The analog of Figure 3 for the function f wo uld sho w the p oin ts ( f ( τ 1 x ) , f ( τ 2 x )). If w e c ho ose τ 1 = 1 , τ 2 = 2, w e obtain the graph sho wn in our Figure 3. Ob viously , the curv e in Figure 3 has multiple — at least thr ee — in fl ection 0 1 2 3 4 1 2 3 4 Figure 3: The p oints ( f ( x ) , f (2 x )). p oints. Although w e will not giv e a formal pr o of h ere that th is curve has m ultiple inflection p oin ts, it is not difficult to d o so. A.3. Iden tifiabil i t y for gene ric parameters vs. all parameters The second gap in Rogers’ argument arises b ecause it is p ossible to ha ve iden tifiabilit y for generic p arameters, but n ot for all parameters. Even if iden tifiabilit y of generic parameters has b een p ro v ed, th en one cann ot easily 34 Allman, A n´ e, Rho des argue that identifiabilit y must hold for th e non-generic, exceptional cases as w ell. T o illustrate th is, w e giv e a s im p le example. Consider the map φ : R 2 → R 2 , defi ned by φ ( a, b ) = ( a, ab ) . Here a, b pla y the roles of ‘parameters’ for a hyp othetical mo del, wh ose ‘joint distribution’ is giv en by the v ector-v alued function φ . Supp ose ( x, y ) is a particular distribu tion wh ic h arises f r om the mo d el ( i.e. , is in the im age of φ ), and we wish to find a, b such that φ ( a, b ) = ( x, y ). Th en pro vided x 6 = 0 (or equiv alen tly a 6 = 0), it is str aigh tforw ard to see that a, b m ust b e giv en by the formulas a = x, b = y /x. Th us for generic a, b (more sp ecifically , for all ( a, b ) with a 6 = 0) this hyp othetical mo del is id en tifiable. Notice, h o w ev er, that if ( x, y ) = (0 , 0), the situation is qu ite differen t. F rom x = 0, we see th at we m ust h a v e a = 0. But since φ (0 , b ) = (0 , 0), we find that all parameters of the form (0 , b ) lead to the same d istribution (0 , 0). T h us th ese exceptional parameters are not iden tifiable. Therefore, we ha ve identifiabilit y precisely for all parameters in th e 2-dimensional ab -plane exc ept th ose lying on the 1-dimensional line where a = 0. These exceptional parameters, form ing a set of lo w er d imension than the f u ll space, h a v e Leb esgue measure zero w ithin it. Notice that ev en though there are parameter v alues arbitrarily close to the exceptional ones (0 , b ) which are identifiable (for instance, ( ǫ, b ) for an y small ǫ 6 = 0), it is inv alid to argue that the parameters (0 , b ) m ust b e identi fiable as w ell. This example sho ws that ev en if the firs t flaw in the argumen t of [11] we re repaired, the approac h outlined th ere will at b est giv e iden tifiabilit y for generic Identifiability of a m o del of mole cular evolution 35 parameters. Th e final lines of that p ap er are n ot sufficient to pr ov e iden tifia- bilit y f or all parameter v alues. Ob viously the fu nction φ giv en h ere could not really b e a join t distrib ution for a statistica l mo del, since the en tries of the v ector φ ( a, b ) do not add to one, nor are they necessarily non -n egativ e. Ho wev er, these features can b e easily work ed in to a more complicated examp le. If one p refers a less contriv ed example, then instances of generic iden tifiabilit y of parameters but not full identifiabilit y o ccur in standard statistical mo dels u sed outside of phyloge netics (for in s tance, in laten t class mo dels). W e hav e c hosen to giv e this simpler example to highligh t the essen tial p roblem most clearly .

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment