Computing Geodesic Distances in Tree Space

We present two algorithms for computing the geodesic distance between phylogenetic trees in tree space, as introduced by Billera, Holmes, and Vogtmann (2001). We show that the possible combinatorial types of shortest paths between two trees can be co…

Authors: Megan Owen

Computing Geodesic Distances in Tree Space
Computing Geo desic Distances in T ree Space Megan Ow en ∗ Abstract W e presen t t wo algorithms for computing the geodesic distance b et w een ph ylogenetic trees in tree space, as introduced by Billera, Holmes, and V ogtmann (2001). W e sho w that the p ossible com binatorial t ypes of shortest paths b et w een tw o trees can b e compactly represented b y a partially ordered set. W e calculate the shortest distance along each candidate path b y con verting the problem in to one of finding the shortest path through a certain region of Euclidean space. In particular, we sho w there is a linear time algorithm for finding the shortest path betw een a p oin t in the all p ositiv e orthan t and a point in the all negativ e orthant of R k con tained in the subspace of R k consisting of all orthants with the first i co ordinates non-p ositiv e and the remaining co ordinates non-negative for 0 ≤ i ≤ k . 1 In tro duction Ph ylogenetic trees, or ph ylogenies, are used throughout biology to understand the ev olutionary history of organisms ranging from primates to the HIV virus. Outside of biology , they are used in studying the evolution of languages and culture, for example. Often, reconstruction metho ds give m ultiple plausible phylogenetic trees on the same set of taxa, whic h w e wish to compare using a quan titativ e distance measure. A more general op en question is how best to analyze sets of trees in a statistically rigourous manner, for example, b y providing confidence in terv als for the generated trees. The tree space of Billera, Holmes, and V ogtmann [3] and its corresponding geo desic distance measure w ere developed to provide a framew ork for addressing these issues ([13] and [14]). In this pap er, w e giv e sev eral com binatorial and metric properties of this space in the process of dev eloping t w o practical algorithms for computing this distance. There are many differen t algorithms to construct ph ylogenetic trees from biological data ([9] and its references), but their accuracy can b e affected b y suc h factors as the underlying tree shap e [12] or the rate of mutation in the DNA sequences used [15]. T o compare these metho ds through sim ulation, or to find the likelihoo d that a certain tree is generated from the data, researchers need to be able to compute a biologically meaningful distance b et w een trees [15]. Several different distances b et w een ph ylogenetic trees ha ve b een prop osed (e.g. [7], [10], [11], [23], [25]). With the exception of the weigh ted Robinson-F oulds distance [24], none of these distances incorp orate tree edge lengths. In resp onse to the need for a distance measure betw een ph ylogenetic trees that naturally in- corp orates both the tree top ology and the lengths of the edges, Billera et al. [3] in tro duced the ge o desic distanc e . This distance measure is deriv ed from the tree space, T n , which contains all ph ylogenetic trees with n leav es. The tree space is formed from a set of Euclidean regions, called ∗ maowen@berkeley.edu . Universit y of California Berkeley , Berkeley , CA, 95720. This w ork was supp orted in part b y NSF grant DMS-0555268 at Cornell Univ ersit y . A 2-page extended abstract of a preliminary version of Section 4 w as published in the online pro ceedings of the 17th F all W orkshop on Computational and Combinatorial Geometry (FW CG 2007). 1 orthan ts, one for each top ologically differen t tree. Two regions are connected if their corresp onding trees are considered to b e neigh b ours. Eac h ph ylogenetic tree with n leav es is represented as a p oin t within this space. There is a unique shortest path, called the ge o desic , betw een each pair of trees. The length of this path is our distance metric. The most closely related work is by Staple [29] and Kup czok et al. [16], who dev elop ed al- gorithms to compute the geo desic distance based on the notes of V ogtmann [30]. Both of these algorithms are exp onen tial in the num ber of different edges in the t w o trees. Although Kup czok et al. dev elop ed their algorithm GeoMeTree independently , it can b e considered a direct impro v e- men t to the algorithm of Staple. W e sho w in Section 5 that our algorithm p erforms significantly b etter than GeoMeTree , although it is still exp onential. A p olynomial time, √ 2-appro ximation of the geo desic distance w as giv en by Amenta et al. [1]. Since the submission of this pap er, a p olynomial time algorithm has b een developed to compute the geo desic distance [21]. Our primary con tribution is the three main combinatorial and geometric ideas b ehind the tw o algorithms w e give for computing the geo desic distance. First, the candidate shortest paths b et w een trees can b e represen ted as an easily constructible partially ordered set, giving information ab out the combinatorics of the tree space. Second, w e can find the length of each candidate shortest path b y translating the problem into one of finding the shortest path through a region of a low er dimensional Euclidean space. The solution to this new problem is a linear algorithm for a sp ecial case of the Euclidean shortest-path problem with obstacles. Since the general problem is NP-hard for dimensions greater than 2, this result is also of interest to computational geometers. Finally , we sho w that the combinatorics of the geo desic dep end on the com binatorics of the geo desic b etw een t w o simpler trees. This observ ation mak es it possible to use either a dynamic programming or a divide and conquer approach to significantly reduce the search space. The tw o resulting algorithms are computationally practical on some biological data sets of in terest. The remainder of this paper is organized as follows. In Section 2, we describ e the tree space and the geo desic distance. The problem of finding the geo desic distance has b oth a com binatorial comp onen t, which is in vestigated in Section 3, and a geometric comp onen t, whic h is co v ered in Section 4. More sp ecifically , w e in troduce a combinatorial framew ork in Section 3, whic h represen ts the candidate shortest paths b et w een trees b y an easily constructible partially ordered set (Theo- rem 3.7). In Section 4, we translate the problem of calculating the length of a candidate shortest path into a problem in Euclidean space (Theorem 4.4), and then show that this Euclidean problem can be solved in linear time (Theorem 4.10 and Theorem 4.11). Section 5 combines the ideas of Sections 3 and 4 to show that the path tak en by a geo desic is related to the geo desic path b et w een t w o simpler trees (Theorem 5.2). This theorem is exploited via dynamic programming and divide and conquer techniques to giv e tw o algorithms. 2 T ree Space and Geo desic Distance This section describes the space of ph ylogenetic trees, T n , and the geo desic distance. F or further details, see [3]. A phylo genetic tr e e , or just tr e e , T = ( X , Σ) is a ro oted tree, whose lea v es are in bijection with a set of lab els X represen ting differen t organisms, and whose interior edges are represen ted b y the set Σ of non-trivial splits . F or this pap er, let X = { 1 , ..., n } . The ro ot is lab elled with 0 and sometimes treated like a leaf. W e consider b oth bifurcating (or binary) trees, in whic h eac h in terior v ertex has degree 3, and m ultifurcating (or degenerate) trees, in whic h at least one in terior vertex has degree > 3. A split A | B is a partition of X ∪ { 0 } in to t wo non-empty sets A and B . A split is in T if it corresp onds to some edge e in T , such that deleting edge e from T divides T in to tw o subtrees, 2 with one subtree containing exactly the lea ves in A and the other subtree con taining exactly the lea v es in B . F or example, in Figure 1, the split corresp onding to the edge e 3 partitions the leav es in to the sets { 2 , 3 } and { 0 , 1 , 4 , 5 } . W e will refer to a split corresp onding to an edge ending in a leaf as a trivial split , and to all other splits as simply splits. A split of typ e n is a partition of the set { 0 , 1 , ..., n } into t w o blo cks, each containing at least tw o elemen ts. If A ⊆ Σ is a set of splits in T , then let T / A b e the tree T with the edges that correspond to A contracted. Tw o 2 4 3 1 5 e 1 e 3 e 2 0 Figure 1: The split corresp onding to the edge e 3 . splits e = X | X 0 and e 0 = Y | Y 0 are c omp atible if one of X ∩ Y , X ∩ Y 0 , X 0 ∩ Y or X 0 ∩ Y 0 is empt y . Equiv alently , t wo splits are compatible if their corresp onding edges can exist in the same ph ylogenetic tree. F or example, in Figure 1, the split e 3 = { 2 , 3 }|{ 0 , 1 , 4 , 5 } is compatible with the split e 2 = { 2 , 3 , 4 }|{ 0 , 1 , 5 } , b ecause { 2 , 3 } ∩ { 0 , 1 , 5 } = ∅ . How ever, e 3 is incompatible with f = { 1 , 2 }|{ 0 , 3 , 4 , 5 } . Two sets of mutually compatible splits of t yp e n , A and B , are c omp atible if A ∪ B is a set of mutually compatible splits. F or a tree T = ( X , Σ), each edge, and hence split, e ∈ Σ is asso ciated with a non-negative length | e | T . F or example, this length often represen ts the exp ected n um b er of mutations p er DNA c haracter site. Two splits are considered the same if they hav e iden tical partitions, regardless of their asso ciated lengths. F or any set of compatible splits A ⊆ Σ, let k A k = q P e ∈ A | e | 2 T . 2.1 T ree Space W e no w describ e the space of ph ylogenetic trees, T n , as constructed b y Billera et al. [3]. It is homeomorphic, but not isometric, to the tropical Grassmannian [27] and the Bergman fan of the graphic matroid of the complete graph [2]. This space con tains all bifurcating and multifurcating ph ylogenetic trees with n leav es. In this space, each tree topology with n lea v es is asso ciated with a Euclidean region, called an orthant . The p oints in the orthant represent trees with the same top ology , but different edge lengths. These orthan ts are attac hed, or glued together, to form the tree space. W e do not use the lengths of the edges ending in lea v es in the definition of tree space, but can easily include them by considering geo desics through T n × R n + , as noted in Billera et al. [3]. An y set of n − 2 compatible splits corresp onds to a unique rooted ph ylogenetic tree top ology [26, Theorem 3.1.4]. F or any such split set Σ corresp onding to tree T , asso ciate eac h split with a v ector suc h that the n − 2 v ectors are mutually orthogonal. The cone formed by these v e ctors is the orthan t asso ciated with the top ology of T . Recall that the k -dimensional (nonne gative) orthant is the non-negativ e part of R k , denoted R k + . A p oin t ( x 1 , ..., x n − 2 ) in R n − 2 + represen ts the tree in whic h the edge associated with the i -axis has length x i , for all 1 ≤ i ≤ n − 2, as illustrated in Figure 2(a). If x i = 0, then the tree is on a face of the orthan t, and we sa y that it do es not con tain the edge asso ciated with the i -axis. F urthermore, t wo orthants can share the same b oundary face, and th us 3 are attached. F or example, in Figure 2(a), the trees T 1 and T 0 1 are represented as tw o distinct p oin ts in the same orthan t, b ecause they hav e the same top ology , but different edge lengths. The tree T 0 has only one edge, e 1 , and thus is a p oin t on the e 1 axis. Notice that although Figure 2(a) is dra wn in the plane, it actually sits in R 3 , with eac h of the axes or splits corresponding to a differen t dimension. In general, T n sits in R N , where N = 2 n − n − 2 is the n um b er of possible splits of type n . How ever, as no p oin t in T n has a negativ e co ordinate in R N , we may dra w the p ositiv e and negative parts of an axis as corresponding to different splits. e 2 e 3 0 e 2 e 1 1 2 3 4 0 e 1 0 2 1 3 4 e 1 e 3 0 1 2 3 4 e 1 = geodesic T 1 T 2 T 0 e 1 ’ e 2 ’ 2 1 4 3 0 T 1 ’ (a) Two orthan ts in T 4 . T 2 e 2 e 1 1 2 3 4 0 e 1 e 3 0 1 2 3 4 e 4 e 3 0 1 2 3 4 e 4 e 5 0 1 2 3 4 e 5 e 5 e 2 0 1 2 3 4 e 4 e 3 e 1 e 2 T 1 ’ T 2 ’ T 1 = geodesic (b) Both edge length and tree topology determine the geo desic. Figure 2: The geometry of tree space. F or an y set A of compatible splits with lengths, let T ( A ) represent the tree con taining exactly the edges corresp onding to the splits A , with the given lengths. Let O ( A ) b e the orthant of lo w est dimension containing T ( A ). F or an y t ≥ 0, let t · A b e the set of splits A whose lengths hav e all b een m ultiplied by t . If A and B are tw o compatible sets of m utually compatible splits of t yp e n , then we define the binary op erator + on the orthants of T n b y O ( A ) + O ( B ) = O ( A ∪ B ). 2.2 Geo desic Distance There is a natural metric on T n . The distance b et ween t w o trees in the same orthan t is the Euclidean distance b et ween them. The distance b et w een tw o trees in differen t orthan ts is the length of the shortest path betw een them, where the length of a path is the sum of the Euclidean lengths of the in tersections of this path with eac h orthant. F or any trees T 1 and T 2 in T n , the ge o desic distanc e , d ( T 1 , T 2 ), betw een T 1 and T 2 is the length of the ge o desic , or lo cally shortest path, betw een T 1 and T 2 in T n . Billera et al. defined this distance, and prov ed that T n is non-p ositiv ely curv ed [5], and in particular CA T(0) [3, Lemma 4.1], and thus the geo desic b etw een an y t w o trees in T n is unique. F or example, in Figure 2(a), the geo desic b etw een the trees T 1 and T 2 is represented b y the dashed line. Figure 2(b) depicts 5 of the 15 orthants in T 4 . This figure also illustrates that the edge lengths, in addition to the tree top ologies, determine the in termediate orthants through whic h the geo desic passes. 2.3 The Essen tial Problem The problem of finding the geo desic b et ween tw o arbitrary trees in T n can b e reduced in p oly- nomial time to the problem of finding the geo desic betw een t w o trees with no splits in common. 4 F urthermore, the lengths of the p endant edges can easily b e included in the distance calculation, if desired. V ogtmann [30] pro ved the follo wing theorem, which explains ho w to decomp ose the problem of finding the geo desic when the trees share a common split. An alternative pro of is given in [20]. Let T 1 and T 2 b e t w o trees with a common split e = X | Y , where 0 ∈ X , as shown in Figure 3(a). F or i ∈ { 1 , 2 } , let T X i b e the tree T i with edge e and any edge b elo w e contracted. That is, an y edge e 0 = X 0 | Y 0 suc h that X 0 ⊂ Y or Y 0 ⊂ Y is contracted, as sho wn in Figure 3(b). F or i ∈ { 1 , 2 } , let T Y i b e the tree T i formed b y contracting edge e and all edges not contracted in T X i . That is, any edge e 0 = X 0 | Y 0 suc h that X 0 ⊂ X or Y 0 ⊂ X is contracted, as in Figure 3(c). e { 0 { X \ 0 Y (a) T ree T i . { 0 { X \ 0 Y (b) T ree T X i . { 0 { X \ 0 Y (c) T ree T Y i . Figure 3: F orming the trees T X i and T Y i from T i for i ∈ { 1 , 2 } . Theorem 2.1. If T 1 and T 2 have a c ommon split e , and T X i and T Y i ar e as describ e d in the ab ove p ar agr aph for i ∈ { 1 , 2 } , then d ( T 1 , T 2 ) = q d ( T X 1 , T X 2 ) 2 + d ( T Y 1 , T Y 2 ) 2 + ( | e | T 1 − | e | T 2 ) 2 . As noted in Section 2.1, the length of the edges ending in leav es can be included in the distance calculations b y considering the product space T n × R n + , and the shortest distance, d l ( T 1 , T 2 ), betw een the trees in this space. In this case, if the length of the edge to leaf i in tree T is | l i | T for all 1 ≤ i ≤ n , then d l ( T 1 , T 2 ) = q d ( T 1 , T 2 ) 2 + P n i =1 ( | l i | T 1 − | l i | T 2 ) 2 . Therefore, the essential problem is as follows, and w e dev ote the rest of this paper to it. Problem 1. Find the geo desic distance b et ween T 1 and T 2 , tw o trees in T n with no common splits. 3 Com binatorics of Path Spaces The properties of the geodesic imply that it is restricted to certain orthants in the tree space. In this section, we mo del this section of tree space as a partially ordered set (p oset), called the p ath p oset , in which each element corresp onds to an orthant in tree space. This p oset enables us to en umerate all orthan t sequences that could con tain the geo desic, b ecause each such orthan t sequence, called a p ath sp ac e , corresp onds to one of the maximal chains of this p oset b y Theorem 3.7. F or this section, assume that T 1 = ( X, Σ 1 ) and T 2 = ( X , Σ 2 ) are tw o trees in T n with no common splits. That is, Σ 1 ∩ Σ 2 = ∅ . 3.1 The Incompatibilit y and P ath P artially Ordered Sets W e first define the incompatibility p oset, which enco des the incompatibilities b et ween splits in T 1 and T 2 . It will b e used to construct the path p oset. T o define these p osets, we introduce the follo wing tw o definitions. 5 Let A and B b e tw o sets of mutually compatible splits of t yp e n , such that A ∩ B = ∅ . Define the c omp atibility set of A in B , C B ( A ), to b e the set of splits in B which are compatible with ev ery split in A . Define the cr ossing set of A in B , X B ( A ), to b e the set of splits in B which are incompatible with at least one split in A . If D is a set of mutually compatible splits of type n such that D ⊆ A , then: 1. C B ( A ) ⊆ C B ( D ) ( opp osite monotonicity of the c omp atibility set ), 2. X B ( D ) ⊆ X B ( A ) ( monotonicity of the cr ossing set ), 3. C B ( A ) and X B ( A ) partition B ( p artitioning ). A pr ep oset or quasi-or der e d set is a set P and binary relation ≤ that is reflexive and transitiv e. See [28, Exercise 1] for more details. Define the inc omp atibility pr ep oset , e P (Σ 1 , Σ 2 ), to b e the prep oset containing the elemen ts of Σ 2 , ordered by inclusion of their crossing sets. So, for any f , f 0 ∈ Σ 2 , f ≤ f 0 in e P (Σ 1 , Σ 2 ) if and only if X Σ 1 ( f ) ⊆ X Σ 1 ( f 0 ). Define the equiv alence relation f ∼ f 0 if and only if f ≤ f 0 and f 0 ≤ f . Thus, all the splits in an equiv alence class hav e the same crossing set, which w e define to b e the crossing set of that equiv alence class. Definition 3.1. The inc omp atibility p oset , P (Σ 1 , Σ 2 ), consists of the equiv alence classes defined b y ∼ in the prep oset e P (Σ 1 , Σ 2 ) ordered by inclusion of their crossing sets. Generally , we will b e informal, and treat the elements of the incompatibility poset as sets of Σ 2 , ordered b y inclusion of their crossing sets in Σ 1 . F or example, Figure 4(c) sho ws the incompatibilit y p oset P (Σ 1 , Σ 2 ) for the trees T 1 and T 2 , given in Figures 4(a) and 4(b), resp ectively . F or any A ∈ Σ 2 , define A ∈ Σ 2 b y A 7→ A = { f ∈ Σ 2 : X Σ 1 ( f ) ⊆ X Σ 1 ( A ) } . Note that by definition, X Σ 1 ( A ) = X Σ 1 ( A ). The map X 7→ X is a closur e op er ator on a set I if for ev ery subset X ⊂ I , it is extensive ( X ⊂ X ), idemp oten t ( X = X ), and isotone (if X ⊂ Y , then X ⊂ Y ) [4]. F rom the definition and the monotonicity of crossing set, A 7→ A is a closure op erator on Σ 2 . Definition 3.2. The p ath p oset fr om Σ 1 to Σ 2 , K (Σ 1 , Σ 2 ), is the closed sets of Σ 2 ordered by inclusion. The path p oset represen ts the p ossible orthan t sequences con taining the geo desic betw een T 1 and T 2 , and we next mak e clear this corresp ondence. The path p oset is b ounded below b y ∅ , and ab o ve by Σ 2 . It is a sublattice of the lattice of order ideals of P (Σ 1 , Σ 2 ), but need not b e graded [20]. Figure 4(d) gives an example of a path p oset. F or simplicity in the figures, w e omit the brac k ets, writing f 1 f 4 instead of { f 1 , f 4 } , for example. 3.2 P ath Spaces The geo desic is con tained in some sequence of orthants connecting the orthan ts containing T 1 and T 2 . Billera et al. [3] defined a set of orthant sequences, suc h that at least one of them con tains the geo desic. W e call suc h orthan t sequences p ath sp ac es . W e characterize all maximal path spaces in Theorem 3.6, and sho w that they are in one-to-one corresp ondence with the maximal chains in K (Σ 1 , Σ 2 ) in Theorem 3.7. Definition 3.3. F or trees T 1 and T 2 with no common splits, let Σ 1 = E 0 ⊃ E 1 ⊃ ... ⊃ E k − 1 ⊃ E k = ∅ , and ∅ = F 0 ⊂ F 1 ⊂ ... ⊂ F k − 1 ⊂ F k = Σ 2 b e sets of splits such that E i and F i are compatible for all 0 ≤ i ≤ k . Then ∪ k i =0 O ( E i ∪ F i ) is a p ath sp ac e b et ween T 1 and T 2 . 6 4 2 5 3 e 1 e 2 0 6 1 e 4 e 3 (a) T ree T 1 = ( X, Σ 1 ). 3 2 4 1 f 3 f 1 0 5 6 f 4 f 2 (b) T ree T 2 = ( X, Σ 2 ). f 3 , X ! 1 (f 3 ) = {e 1 , e 2 , e 3 , e 4 } f 2 , X ! 1 (f 2 ) = {e 3 , e 4 } f 4 , X ! 1 (f 4 ) = {e 4 } f 1 , X ! 1 (f 1 ) = {e 1 } (c) Incompatibility p oset P (Σ 1 , Σ 2 ) 󲰖 f 1 = { f 1 } f 4 = { f 4 } f 2 = { f 2 , f 4 } f 1 f 4 = { f 1 , f 4 } f 1 f 2 = { f 1 , f 2 , f 4 } f 3 = { f 1 , f 2 , f 3 , f 4 } (d) Path p oset K (Σ 1 , Σ 2 ) Figure 4: The incompatibility p oset for the trees T 1 (a) and T 2 (b) is shown in (c). The crossing sets of the elements of Σ 2 , which are ordered by inclusion to giv e the incompatibilit y p oset, are also shown in the lab els. The path p oset of T 1 and T 2 is given in (d). A path space is a subspace of T n consisting of the closed orthan ts corresp onding to the trees with in terior edges E i ∪ F i for all 0 ≤ i ≤ k . The intersection of O i and O i +1 is the orthan t O ( E i +1 ∪ F i ). If the i th step transforms the tree with splits E i − 1 ∪ F i − 1 in to the tree with splits E i ∪ F i , then at this step we remo ve the splits A i , E i − 1 \ E i and add the splits B i , F i \ F i − 1 . Using this notation, the i -th orthant corresp onds to the splits B 1 ∪ ... ∪ B i ∪ A i +1 ∪ .... ∪ A k . T o simplify notation, let O i = O ( E i ∪ F i ) and O 0 i = O ( E 0 i ∪ F 0 i ). The following prop erty of path spaces follows directly from the definition. Prop osition 3.4. L et ∪ k i =0 O ( E i ∪ F i ) b e a p ath sp ac e b etwe en T 1 and T 2 . Then E i ⊆ C Σ 1 ( F i ) and F i ⊆ C Σ 2 ( E i ) for al l 0 ≤ i ≤ k . Remark 3.5. In order to ensure a unique representation of a path space in terms of E i ’s and F i ’s, w e make the inclusions strict in the definition of a path space. How ever, if w e ha ve sets of splits Σ 1 = E 0 ⊇ E 1 ⊇ · · · ⊇ E k − 1 ⊇ E k = ∅ and ∅ = F 0 ⊆ F 1 ⊆ · · · ⊆ F k − 1 ⊆ F k = Σ 2 suc h that E i and F i are compatible for all 0 ≤ i ≤ k , then ∪ k i =0 O ( E i ∪ F i ) can b e represented by some ∪ k 0 i =0 O 0 i suc h that Σ 1 = E 0 0 ⊃ E 0 1 ⊃ · · · ⊃ E 0 k 0 − 1 ⊃ E 0 k 0 = ∅ and ∅ = F 0 0 ⊂ F 0 1 ⊂ · · · ⊂ F 0 k 0 − 1 ⊂ F k 0 = Σ 2 . T o do this, w e group consecutive E i ’s and F i ’s in to larger sets that are still m utually compatible with eac h other, until w e ha ve a path space. A path space is maximal if it is not con tained in any other path space. Since [3, Prop osition 4.1] prov es that the geo desic is contained in a path space, it m ust b e con tained in some maximal path space. W e now c haracterize the maximal path spaces using split compatibility . 7 Theorem 3.6. The maximal p ath sp ac es fr om T 1 to T 2 ar e exactly those p ath sp ac es ∪ k i =0 O i such that: 1. E i = C Σ 1 ( F i ) , for al l 0 ≤ i ≤ k . 2. F i = C Σ 2 ( E i ) , for al l 0 ≤ i ≤ k . 3. for al l 1 ≤ i ≤ k , the set of splits B i is a minimal element in the inc omp atibility p oset P ( A i ∪ ... ∪ A k , B i ∪ ... ∪ B k ) Pr o of. Let M b e the set of path spaces describ ed in the theorem. W e first show, by contradiction, that all path spaces in M are maximal. Supp ose not. Then there exists some path space M = ∪ k i =0 O i ∈ M that is strictly contained in another path space S 0 = ∪ k 0 i =0 O 0 i . If O j ⊆ O 0 l for some 0 ≤ j ≤ k and some 0 ≤ l 0 ≤ k 0 , then since Σ 1 and Σ 2 are disjoint, we hav e E j ⊆ E 0 l and F j ⊆ F 0 l . By Prop osition 3.4 and the opp osite monotonicit y of compatibility sets, F 0 l ⊆ C Σ 2 ( E 0 l ) ⊆ C Σ 2 ( E j ) = F j , where the last equalit y follo ws from Condition 2 on path spaces in M . Hence, F 0 l = F j . Similarly , E 0 l ⊆ C Σ 1 ( F 0 l ) = C Σ 1 ( F j ) = E j , where the last equality follo ws from Condition 1. Therefore, E 0 l = E j , and hence O j = O 0 l . Therefore, every orthan t of M is also an orthan t of S 0 , and th us S 0 m ust con tain at least one other orthan t not in M . Let j b e the smallest index for such an orthant. More sp ecifically , the orthant O j − 1 is in M and S 0 , but O 0 j , O 0 j +1 , ..., O 0 j + l − 1 are not in M and O j = O 0 j + l . Then b y definition of M and S 0 , B 0 j ⊆ B j and A 0 j ⊆ A j . By Condition 3 and the definition of the incompatibilit y poset, X A j ∪ ... ∪ A k ( B 0 j ) = X A j ∪ ... ∪ A k ( B j ). Therefore, A 0 j = A j , which implies that O 0 j ⊆ O j , a contradiction. Let S = ∪ k i =0 O i b e some path space that is not in M . W e will no w pro v e that S is contained in another path space, S 0 , and hence is not maximal. Since S / ∈ M , at least one of the three conditions do es not hold. Case 1: There exists a 0 ≤ j ≤ k suc h that E 0 = C Σ 1 ( F j ) \ E j is not empt y . That is, Condition 1 do es not hold. W e no w construct a path space in which the splits E 0 are dropp ed at the j -th step instead of an earlier one. Define S 0 = ∪ k i =0 O 0 i , where O 0 i = ( O i + O ( E 0 ) if 0 ≤ i ≤ j O i if j < i ≤ k Since we ha ve only added dimensions to orthants in S to define S 0 and O 0 i ⊂ O i + O ( E 0 ), we ha ve S ⊂ S 0 . It remains to sho w that S 0 is a path space. By definition, E 0 is compatible with F j , and hence F 0 ⊂ ... ⊂ F j − 1 ⊂ F j , so the splits sp ecifying each orthant of S 0 are compatible. Since Σ 1 = E 0 0 ⊇ E 0 1 ⊇ ... ⊇ E 0 j ⊃ ... ⊃ E 0 k = ∅ , then b y Remark 3.5, S 0 can b e relab elled as a path space and hence S is not a maximal path space. Case 2: There exists 0 ≤ j ≤ k suc h that F 0 = C Σ 2 ( E j ) \ F j is not empt y . That is, Condition 2 do es not hold. W e will no w construct a path space in whic h the splits F 0 are added to the tree at the j -th step, instead of a later step. Define S 0 = ∪ k i =0 O 0 i , where O 0 i = ( O i if 0 ≤ i < j O i + O ( F 0 ) if j ≤ i ≤ k By analogous reasoning to Case 1, S 0 is a path space strictly containing S , and therefore S is not maximal. 8 Case 3: Let P = P ( E j − 1 , Σ 2 \ F j − 1 ) = P ( A j ∪ ... ∪ A k , B j ∪ ... ∪ B k ). Neither Case 1 nor Case 2 holds, and, for some 1 ≤ j ≤ k , there exist splits f ∈ B j and g ∈ B j ∪ ... ∪ B k suc h that g < f in P . That is, Conditions 1 and 2 hold, but Condition 3 do es not hold. W e now construct a path space with an extra orthan t, which we get by adding the splits g and f in t wo distinct steps, instead of during the same step. Define S 0 = ∪ k +1 i =0 O 0 i , where O 0 i =      O i if 0 ≤ i < j O  E i − 1 \ X E i − 1 ( g )  + O  F i − 1 ∪ g  if i =j O i − 1 if j < i ≤ k W e w ill first sho w that O 0 j is neither contained in nor contains an y orthant from S , by sho wing that E 0 j − 1 ⊃ E 0 j ⊃ E 0 j +1 and F 0 j − 1 ⊂ F 0 j ⊂ F j +1 . W e must ha ve X E j − 1 ( g ) 6 = ∅ , or else g ∈ C Σ 2 ( E j − 1 ) \ F j − 1 , implying Case 2 holds, which is a contradiction. This implies that E j − 1 ⊃ E j − 1 \ X E j − 1 ( g ), or E 0 j − 1 ⊃ E 0 j . Since g < f in P , w e ha v e X E j − 1 ( g ) ⊂ X E j − 1 ( f ). T o add f at step j in S , we m ust drop all splits in E j − 1 that are incompatible with f , so X E j − 1 ( f ) ⊆ A j . Along with the previous statement, this implies that X E j − 1 ( g ) ⊂ A j , and hence E 0 j +1 ⊂ E 0 j . Therefore, w e hav e shown that E 0 j − 1 ⊃ E 0 j ⊃ E 0 j +1 , as desired. Since g / ∈ F j − 1 , we hav e F j − 1 ⊂ F j − 1 ∪ g , and hence F 0 j − 1 ⊂ F 0 j . It now remains to sho w that F 0 j ⊂ F 0 j +1 , whic h we will do by sho wing that f ∈ F j but f / ∈ F j − 1 ∪ g . The first statemen t follows b ecause f ∈ B j = F j \ F j − 1 . F or the second statement, g < f in P implies X E j − 1 ( g ) ⊂ X E j − 1 ( f ). Since S is a path space, X E j − 1 ( F j − 1 ) = ∅ . Th us, X E j − 1 ( F j − 1 ) ⊂ X E j − 1 ( g ) ⊂ X E j − 1 ( f ), whic h implies that X Σ 1 ( f ) * X Σ 1 ( F j − 1 ) ∪ X Σ 1 ( g ), and hence f / ∈ F j − 1 ∪ g . Therefore, F 0 j − 1 ⊂ F 0 j ⊂ F j +1 . Finally w e sho w that the splits in O 0 j are m utually compatible. By the definitions, C Σ 1 ( F j − 1 ∪ g ) = C Σ 1 ( F j − 1 ) ∩ C Σ 1 ( g ) ⊇ E j − 1 \ X Σ 1 ( g ) ⊇ E j − 1 \ X E j − 1 ( g ), and hence the splits of O 0 j are mutually compatible. The other orthants remain unchanged, and thus S 0 is a path space. Since S 0 strictly contains S , the path space S is not maximal. Recall that in a poset P , x < y is a c over r elation , or y c overs x , if there do es not exist an y z ∈ P such that x < z < y . A chain is a totally ordered subset of a p oset. A c hain is maximal when no other elemen ts from P can b e added to that subset. See [28, Chapter 3] for an exp osition of partially ordered sets. Theorem 3.7. L et g : K (Σ 1 , Σ 2 ) → T n b e given by g ( L ) = O L , wher e O L = O ( C Σ 1 ( L ) ∪ L ) , for any element L ∈ K (Σ 1 , Σ 2 ) . F or any maximal chain L 0 < L 1 < ... < L k in K (Σ 1 , Σ 2 ) , define h ( L 0 < L 1 < ... < L k ) = ∪ k i =0 g ( L i ) . Then ∪ k i =0 g ( L i ) = ∪ k i =0 O L i is a maximal p ath sp ac e and h is a bije ction b etwe en maximal p ath sp ac es fr om T 1 to T 2 and maximal chains in K (Σ 1 , Σ 2 ) . Pr o of. The map g is one-to-one, b ecause if L 6 = L 0 , then O L 6 = O L 0 . W e no w show that h maps maximal chains in K (Σ 1 , Σ 2 ) to maximal path spaces. Let ∅ = L 0 < L 1 < ... < L k = Σ 2 b e a maximal chain in K (Σ 1 , Σ 2 ). F or every 0 ≤ i ≤ k , let F i = L i and E i = C Σ 1 ( L i ). W e no w sho w that ∪ k i =0 O i is a path space. Since K (Σ 1 , Σ 2 ) is the closed sets of Σ 2 ordered by inclusion, F i ⊂ F i +1 for all 0 ≤ i < k . By the monotonicity of crossing sets, X Σ 1 ( L i ) ⊆ X Σ 1 ( L i +1 ). If X Σ 1 ( L i ) = X Σ 1 ( L i +1 ), then L i +1 ⊆ L i = L i , since L i is a closed set. This is a contradiction, and therefore, X Σ 1 ( L i ) ⊂ X Σ 1 ( L i +1 ). This implies that C Σ 1 ( L i ) ⊃ C Σ 1 ( L i +1 ) b y the partitioning prop erty , and hence E i ⊃ E i +1 for all 0 ≤ i < k . Since L 0 = ∅ , E 0 = C Σ 1 ( L 0 ) = Σ 1 , and since L k = Σ 2 , E k = C Σ 1 ( L k ) = ∅ , or else T 2 w ould con tain more than n − 2 splits. Finally , for all 0 ≤ i ≤ k , E i is compatible with F i b y definition. Therefore, ∪ k i =0 O ( E i ∪ F i ) is a path space. 9 W e will now show that ∪ k i =0 O i satisfies the three conditions of Theorem 3.6, and hence is maximal. Since E i = C Σ 1 ( F i ), Condition 1 is met. By Prop osition 3.4, F i ⊆ C Σ 2 ( E i ). W e now sho w that F i ⊇ C Σ 2 ( E i ). F or any f ∈ C Σ 2 ( E i ), by definition of the crossing set, X Σ 1 ( f ) ∩ E i = ∅ . Since X Σ 1 ( L i ) and C Σ 1 ( L i ) = E i partition Σ 1 , then X Σ 1 ( f ) ⊆ X Σ 1 ( L i ). This implies that f ∈ L i = L i = F i , and hence Condition 2 holds. T o sho w Condition 3, supp ose that for some 1 ≤ j ≤ k , there exists f ∈ B j and a minimal elemen t g in P ( E j − 1 , Σ 2 \ F j − 1 ) suc h that g < f in P ( E j − 1 , Σ 2 \ F j − 1 ). As shown in the proof of Theorem 3.6, F i − 1 ⊂ F i − 1 ∪ g ⊂ F i . This implies that L i − 1 < F i − 1 ∪ g < L i , and hence L i < L i − 1 is not a cov er relation, whic h is a contradiction. Therefore, Condition 3 also holds, and ∪ k i =0 O i is a maximal path space. So as claimed, if L 0 < L 1 < ... < L k is a maximal c hain, then h ( L 0 < ... < L k ) is a maximal path space. It remains to show that h is a bijection. F or an y maximal path space ∪ k i =0 O i , F i < F i +1 is a co ver relation for all 0 ≤ i < k since for any f ∈ B i , F i ∪ f = F i +1 b y Condition 3 of Theorem 3.6. This implies that ∅ = F 0 < F 1 < ... < F k = Σ 2 is a maximal c hain in K (Σ 1 , Σ 2 ) such that h ( F 0 < F 1 < ... < F k ) = ∪ k i =0 O i , and hence h is on to. W e ha ve that h is one-to-one, b ecause g is one-to-one. Therefore, h is a bijection, which establishes the corresp ondence. 0 1 3 2 5 4 6 n n-1 n-2 n-3 e 1 e 2 e 3 e 4 e n-4 e n-2 e n-3 (a) T ree T 1 . 0 1 2 6 3 7 n n-1 n-2 4 5 f 1 f 3 f n-3 f 2 f 4 f 6 f n-2 (b) T ree T 2 . f 2 , X ! 1 (f 2 ) = {e 1 ,e 2 } f 4 , X ! 1 (f 4 ) = {e 2 ,e 3 ,e 4 } f 6 , X ! 1 (f 6 ) = {e 4 ,e 5 ,e 6 } f 1 , X ! 1 (f 1 ) = {e 2 } f 3 , X ! 1 (f 3 ) = {e 4 } f n-1 , X ! 1 (f n-1 ) = {e n-2 } f n-2 , X ! 1 (f n-2 ) = {e n-4 ,e n-3 ,e n-2 } (c) Incompatibility p oset P (Σ 1 , Σ 2 ). Figure 5: A family of trees whose path p oset is exp onen tial in the num b er of lea ves. Remark 3.8. The n um b er of elemen ts in a path poset K (Σ 1 , Σ 2 ) can be exponential in the n umber splits in the tw o sets. F or example, for an y ev en p ositive integer n , consider the trees T 1 = ( X, Σ 1 ) and T 2 = ( X , Σ 2 ) depicted in Figures 5(a) and 5(b). Their incompatibilit y p oset is given in Figure 5(c). Let W b e the set of minimal elemen ts in P (Σ 1 , Σ 2 ). Then | W | = n − 2 2 . Each subset of W is a distinct closed set, and hence an elemen t in K (Σ 1 , Σ 2 ). This implies there are at least 2 ( n − 2) / 2 elemen ts in K (Σ 1 , Σ 2 ), and hence also an exp onen tial num b er of maximal chains. 4 Geo desics in P ath Spaces Giv en a path space, this section shows how to find the locally shortest path, or p ath sp ac e ge o desic , b et ween T 1 and T 2 within that space in linear time. W e do this b y transforming the problem into a Euclide an shortest-p ath pr oblem with obstacles ([18] and references) in Theorem 4.4. W e next reform ulate the problem as a touring problem [8]. A touring pr oblem asks for the shortest path through Euclidean space that visits a sequence of regions in the prescrib ed order. Lemma 4.8 and Lemma 4.9 give conditions on the path solving the touring problem. The linear algorithm for computing the path space geo desic is giv en in Section 4.2.1, with Theorem 4.10 proving its correctness. 10 4.1 Tw o Equiv alent Euclidean Space Problems Let T 1 and T 2 b e tw o trees with no common splits, and let S = ∪ k i =0 O ( E i ∪ F i ) b e a path space b et ween them. Define the p ath sp ac e ge o desic b etwe en T 1 and T 2 thr ough S to b e the shortest path b et ween T 1 and T 2 con tained in S . Let d S ( T 1 , T 2 ) b e the length of this path. W e will no w show that the path space geo desic b et ween T 1 and T 2 through a path space con taining k + 1 orthan ts is contained in a subspace of T n isometric to the follo wing subset of R k . F or 0 ≤ i ≤ k , define the orthant V i = { ( x 1 , ..., x k ) ∈ R k : x j ≤ 0 if j ≤ i and x j ≥ 0 if j > i } . Let V ( R k ) = ∪ k i =0 V i . W e prov e three properties of path space geo desics, and hence also geodesics, in Prop osition 4.1, Prop osition 4.2, and Corollary 4.3. These properties imply that the path space geodesic is a straigh t line except p ossibly at the in tersections b etw een orthants, where it may b end. F urthermore, if we kno w the p oin t on the path space geo desic at whic h an edge is added or dropp ed, then w e kno w the length of that edge at an y other p oin t on the path space geo desic. Analogous prop erties w ere pro v en by V ogtmann [30] for geodesics. Prop osition 4.1. The p ath sp ac e ge o desic is a str aight line in e ach orthant that it tr averses. Pr o of. If not, replace the path within each orthant with a straight line, which en ters and exits the orthan t at the same p oin ts as the original path, to get a shorter path. Prop osition 4.2. Moving along the p ath sp ac e ge o desic, the length of e ach non-zer o e dge changes in the tr e es on it at a c onstant r ate with r esp e ct to the ge o desic ar c length. That is, for any e dge e ∈ Σ 1 ∪ Σ 2 , ther e exists a c onstant c e > 0 such that | e | T d S ( T 1 ,T ) = c e for any tr e e T on the ge o desic that c ontains e dge e . Pr o of. By Prop osition 4.1, eac h edge must shrink or grow at a constan t rate with respect to the other edges within eac h orthant, but these rates can differ betw een orthants. That is, Prop osition 4.1 allo ws the constant c e to dep end on the orthant con taining T , but we will no w show that it do es not. It suffices to consider when the geo desic goes through the in teriors of the t wo adjacen t orthan ts O i − 1 = O ( E i − 1 ∪ F i − 1 ) and O i = O ( E i ∪ F i ), and b ends in the in tersection of these t w o orthants. Let a be the p oin t at whic h the geo desic enters O i − 1 , and let b b e the p oin t at whic h the geo desic lea v es O i . The edges A i = E i − 1 \ E i are dropped and the edges B i = F i \ F i − 1 are added as the geo desic mo v es from O i − 1 to O i . Thus the edges A i and B i all ha ve length 0 in the in tersection O ( E i ∪ F i − 1 ). Let m = | E i ∪ F i − 1 | , the dimension of O i − 1 ∩ O i . An affine h ull of a set S in R n is the intersection of all affine sets containing S . Consider the subset S = H a ∪ H b of O i − 1 ∪ O i , where H a is the affine h ull of a ∪ ( O i − 1 ∩ O i ) intersected with O i − 1 and H b is the affine h ull of b ∪ ( O i − 1 ∩ O i ) intersected with O i . This subset can b e isometrically mapp ed in to tw o orthants in R m +1 as follows. F or eac h tree T ∈ H a , let the first m co ordinates b e giv en by the pro jection of T on to O i − 1 ∩ O i . Let the ( m + 1)-st co ordinate be the length of the pro jection of T orthogonal to O i − 1 ∩ O i . More sp ecifically , let the edges in E i ∪ F i − 1 b e e 1 , e 2 , ..., e m . Then w e map T to the p oin t ( | e 1 | T , | e 2 | T , ...., | e m | T , s ) in R m +1 , where s = q P e ∈ A i | e | 2 T . Similarly , for e ac h tree T ∈ H b , let the first m co ordinates be giv en by the pro jection of T onto O i − 1 ∩ O i . Let the ( m + 1)-st co ordinate b e the negative of the length of the pro jection of T orthogonal to O i − 1 ∩ O i . In other words, we map T to the p oin t ( | e 1 | T , | e 2 | T , ...., | e m | T , − s ) in R m +1 , where s = q P e ∈ B i | e | 2 T . 11 W e ha v e mapp ed S in to Euclidean space, and hence the shortest path b et ween the image of a and the image of b is the straigh t line b etw een them. Along this line, eac h edge e 1 , ..., e m c hanges at the same rate with resp ect to the geo desic arc length. Since we can make this argument for each pair of consecutive orthan ts, we ha v e prov en this prop osition. Corollary 4.3. L et T b e a tr e e on the p ath sp ac e ge o desic b etwe en T 1 and T 2 thr ough the p ath sp ac e S = ∪ k i =0 O ( E i ∪ F i ) . Supp ose T ∈ O i . Then if 1 ≤ j ≤ i , we have | f 1 | T | f 1 | T 2 = | f 2 | T | f 2 | T 2 for any f 1 , f 2 ∈ B j , and if i < j ≤ k , we have | e 1 | T | e 1 | T 1 = | e 2 | T | e 2 | T 1 for any e 1 , e 2 ∈ A j . Pr o of. Let f 1 , f 2 ∈ B j b e edges in the tree T ∈ O i from the hypothesis. Then by Prop osition 4.2, there exist c f 1 , c f 2 > 0 suc h that | f 1 | T = c f 1 · d S ( T 1 , T ), | f 1 | T 2 = c f 1 · d S ( T 1 , T 2 ), | f 2 | T = c f 2 · d S ( T 1 , T ), and | f 2 | T 2 = c f 2 · d S ( T 1 , T 2 ). Then | f 1 | T | f 1 | T 2 = c f 1 · d S ( T 1 ,T ) c f 1 · d S ( T 1 ,T 2 ) = d S ( T 1 ,T ) d S ( T 1 ,T 2 ) = c f 2 · d S ( T 1 ,T ) c f 2 · d S ( T 1 ,T 2 ) = | f 2 | T | f 2 | T 2 . The argumen t to show | e 1 | T | e 1 | T 1 = | e 2 | T | e 2 | T 1 for any e 1 , e 2 ∈ A j is analogous. Therefore, there is one degree of freedom for each set of edges dropp ed, or alternatively for eac h set of edges added, at the transition b et ween orthants. Thus, the path space geo desic lies in a space of dimension equal to the n umber of transitions b et w een orthan ts. W e will now show that each path space geo desic lives in a space isometric to V ( R k ). F or example, in Figure 6(a), the path space Q consists of the orthants O ( { e 1 , e 2 , e 3 } ), O ( { f 1 , e 2 , e 3 } ), and O ( { f 1 , f 2 , f 3 } ). W e apply Theorem 4.4 to see that the geo desic through Q is contained in the shaded region of R 2 sho wn in Figure 6(b). A 2 B 2 (e 1 ,e 2 ,e 3 ) (-f 1 ,e 2 ,e 3 ) e 2 e 3 e 1 (-f 1 ,-f 2 ,-f 3 ) (0, e 2 ,e 3 ) (-f 1 ,0,0) = geodesic f 1 f 2 f 3 (a) Part of T 5 . B 2 A 2 e 1 f 1 ! e 1 , " e 2 2 + e 2 3 # ! − f 1 , " e 2 2 + e 2 3 # ! − f 1 , − " f 2 2 + f 2 3 # (b) Isometric mapping to V ( R 2 ). Figure 6: An isometric map b etw een a path space and V ( R 2 ). Theorem 4.4. L et Q = ∪ k i =0 O ( E i ∪ F i ) b e a p ath sp ac e b etwe en T 1 and T 2 , two tr e es in T n with no c ommon splits. Then the p ath sp ac e ge o desic b etwe en T 1 and T 2 thr ough Q is c ontaine d in a sp ac e isometric to V ( R k ) . Pr o of. By Corollary 4.3, any tree T 0 ∈ Q on the path space geo desic satisfies the following t wo conditions for each 1 ≤ j ≤ k : 1. if T 0 ∈ O i and j ≤ i , then there exists a c j = c j ( T 0 ) ≥ 0, dep ending on T 0 , such that | f | T 0 | f | T 2 = c j for all f ∈ B j , 12 2. if T 0 ∈ O i and j > i , then there exists a d j = d j ( T 0 ) ≥ 0, dep ending on T 0 , suc h that | e | T 0 | e | T 1 = d j for all e ∈ A j . Let Q 0 ⊂ T n b e the set of trees satisfying this prop ert y . F or 0 ≤ i ≤ n , define h i : Q 0 ∩ O i → V i b y h i  T 0  = h i  T ( c 1 · B 1 ∪ ... ∪ c i · B i ∪ d i +1 · A i +1 ∪ ... ∪ d k · A k )  =  − c 1 || B 1 || , ..., − c i || B i || , d i +1 || A i +1 || , ..., d k || A k ||  . W e claim that h i is a bijection from Q 0 ∩ O i to the orthant V i in V ( R k ). All trees in the in terior of orthant O i ha v e exactly the edges { B 1 , ..., B i , A i +1 , ..., A k } . Let N = | B 1 | + | B 2 | + ... + | B i | + | A i +1 | + .... + | A k | , the num b er of edges in trees in O i . Then O i is an N -dimensional orthan t, and w e can assign each edge to a coordinate axis so that the edges in B 1 are assigned to co ordinates 1 to | B 1 | , the edges in B 2 are assigned to coordinates | B 1 | + 1 to | B 1 | + | B 2 | , the edges in A i +1 are assigned to the co ordinates | B 1 | + | B 2 | + ... + | B i | + 1 to | B 1 | + | B 2 | + ... + | B i | + | A i +1 | , etc. Let e j b e the edge assigned to the j -th co ordinate. By abuse of notation, for all 1 ≤ j ≤ i , let B j b e the N -dimensional v ector with a 0 in every coordinate except those corresp onding to the edges B j , where we put the length of that edge in T 2 . Similarly , for all i < j ≤ k , let A j b e the N -dimensional vector with a 0 in ev ery co ordinate except those corresp onding to the edges in A j , where we put the length of that edges in T 1 . F or example, B 1 is the N -dimensional vector ( | f 1 | T 2 , | f 2 | T 2 , ..., | f | B 1 | | T 2 , 0 , ..., 0). Then Q 0 ∩ O i is generated by the v ectors n B 1 k B 1 k , B 2 k B 2 k , ..., B i k B i k , A i + 1 k A i + 1 k , .., A k k A k k o . Since these gen- erating vectors are pairwise orthogonal, they are indep enden t, and hence Q 0 ∩ O i is a k -dimensional orthan t con tained in O i . F urthermore, for all 1 ≤ j ≤ i , B j k B j k corresp onds to the tree T  1 k B j k · B j  , and for all i < j ≤ k , A j k A j k corresp onds to the tree T  1 k A j k · A j  . F or all 1 ≤ j ≤ k , let u j b e the k -dimensional unit v ector with a 1 in the j -th co ordinate. Then for 1 ≤ j ≤ i , h i  B j k B j k  = h i  T  1 k B j k · B j  = − 1 k B j k · k B j k u j = − u j . Similarly , for all i < j ≤ k , h i  A j k A j k  = h i  T  1 k A j k · A j  = 1 k A j k · k A j k u j = u j . The basis of V i is {− u 1 , ..., − u i , u i + 1 , ..., u k } , so h i maps each basis elemen t of Q 0 ∩ Q i to a unique basis element of V i . Thus, h i is a linear transformation, whose corresp onding matrix is the identit y matrix, and hence a bijection b etw een Q 0 ∩ Q i and V i for all i . F urthermore, since the determinan t of the matrix of h i is 1, h i is also an isometry . So Q 0 is piecewise linearly isometric to V ( R k ). F or all 0 ≤ i ≤ n , the inv erse of h i is g i : V i → Q 0 defined by g i ( − x 1 , ... − x i , x i +1 , ..., x k ) = T 0 , where x j ≥ 0 for all 1 ≤ j ≤ k and T 0 is the tree with edges E i ∪ F i with lengths | x j | k B j k · | e | T 2 if e ∈ B j for 1 ≤ j ≤ i and | x j | k A j k · | e | T 1 if e ∈ A j for i < j ≤ k . Notice that if T 0 ∈ Q 0 ∩ O i ∩ O i +1 , then h i ( T 0 ) = h i +1 ( T 0 ), since the lengths of all the edges in A i +1 and B i +1 are 0. Therefore, define h : Q 0 → V ( R k ) to b e h ( T 0 ) = h i ( T 0 ) if T 0 ∈ O i ∩ Q 0 , whic h is well-defined. Define g : V ( R k ) → Q 0 b y setting g ( − x 1 , ... − x i , x i +1 , ..., x k ) = g i ( − x 1 , ... − x i , x i +1 , ..., x k ) , for all 1 ≤ i ≤ k and for all x j ≥ 0 for all 1 ≤ j ≤ k . Then g is also well-defined and the inv erse of h . F or an y geodesic q in Q 0 , map it into V ( R k ) b y applying h to eac h point on q to get path p . Notice that since b oth h i and g i are distance preserving, p is the same length as q . W e claim p 13 is a geo desic in V ( R k ). T o pro ve this, supp ose not. Let p 0 b e the geo desic in V ( R k ) b et ween the same endp oints as path p . Then p 0 is strictly shorter than p . Use g to map p 0 bac k to Q 0 to get q 0 . Again distance is preserv ed, so q 0 is strictly shorter than q . But q was a geo desic, and hence the shortest path b etw een those t wo endp oin ts in Q 0 , so we ha ve a contradiction. Therefore, the geo desic b et w een T 1 and T 2 in Q is isometric to the geodesic b et ween A = ( k A 1 k , ..., k A k k ) and B = ( −k B 1 k , ..., −k B k k ) in V ( R k ). Th us, finding the geo desic through a ( k + 1) − orthant path space ∪ k i =0 O ( E i ∪ F i ) is equiv alent to finding the geo desic through V ( R k ) b et w een the p oin t A = ( k A 1 k , ..., k A k k ) and the p oin t B = ( −k B 1 k , ..., −k B k k ). Now consider the Euclidean space R k in which every orthant that is not in V ( R k ) is replaced by an obstacle. Then finding the shortest path from A to B in this new space with obstacles will give us the path space geodesic in tree space. W e will now generalize, and somewhat abuse notation, by letting A b e any p oin t in the all- p ositiv e orthan t of R k and by letting B b e any p oin t in the all-negativ e orthan t of R k . Then we can reformulate this general problem as the following touring problem: Problem 2 (T ouring) . Let A b e an y point in the positive orthan t of R k and let B b e an y p oin t in the negativ e orthant of R k . Let P i b e the b oundary b etw een the i -th and ( i + 1)-st orthants in V ( R k ), for all 1 ≤ i ≤ k . That is, P i = { ( x 1 , ..., x k ) ∈ R k : x j ≤ 0 if j < i ; x j = 0 if j = i ; x j ≥ 0 if j > i } . Find the shortest path b et ween A and B in R k that intersects P 1 , P 2 , ..., P k in that order. In dimensions 3 and higher, the Euclidean shortest path problem with obstacles is NP-hard in general [6], including when the obstacles are disjoin t axis-aligned b o xes [19]. The touring problem can be solved in p olynomial time as a second order cone problem when the regions are p olyhedra [22]. In the sp ecial case of the ab o v e touring problem, w e find a simple linear algorithm. 4.2 T ouring Problem Solution In this section, we give a solution to Problem 2. Since this is a conv ex optimization problem, this solution is unique [22], and we will call it the shortest, ordered path. As in the problem statemen t, let A = ( a 1 , ..., a k ), where a i ≥ 0 for all 1 ≤ i ≤ k , and let B = ( − b 1 , ..., − b k ), where b i ≥ 0 for all 1 ≤ i ≤ k . First, Lemma 4.5 establishes when a straigh t line from A to B passes through the regions in the desired order. Two further prop erties of the shortest, ordered path are giv en in Lemmas 4.8 and 4.9. Theorem 4.10 shows how exploiting this last prop ert y , in conjunction with using Theorem 4.4 to reduce the dimension of the problem, gives a linear algorithm for finding the shortest, ordered path from A to B . Lemma 4.5. The line fr om A to B , AB , p asses thr ough the r e gions P 1 , P 2 , ..., P k in that or der and has length q P k i =1 ( a i + b i ) 2 if and only if a 1 b 1 ≤ a 2 b 2 ≤ ... ≤ a k b k . Pr o of. P arametrize the line AB with resp ect to the v ariable t , so that t = 0 at A and t = 1 at B , to get ( x 1 , ..., x k ) = ( a 1 , ..., a k ) + t ( − a 1 − b 1 , ..., − a k − b k ). Let t i b e the v alue of t at the intersection of AB and P i . Setting x i = 0, and solving for t giv es t i = a i a i + b i . F or AB to cross P 1 , P 2 , ..., P k in that order, w e need t 1 ≤ t 2 ≤ ... ≤ t k or a 1 a 1 + b 1 ≤ a 2 a 2 + b 2 ≤ ... ≤ a k a k + b k . Since for an y 1 ≤ i, j ≤ k , a i a i + b i ≤ a j a j + b j is equiv alent to a i b i ≤ a j b j b y cross multiplication, we get the desired condition. By the Euclidean distance formula, the length AB is q P k i =1 ( a i + b i ) 2 . 14 Corollary 4.6. L et A = ( a 1 , ..., a k ) and B = ( − b 1 , ..., − b k ) b e p oints in R k with a i , b i ≥ 0 for al l 1 ≤ i ≤ k . Then a i b i = a i +1 b i +1 if and only if AB interse cts P i ∩ P i +1 . Pr o of. This follows directly from the pro of of Lemma 4.5. In general, w e will not ha ve a 1 b 1 ≤ a 2 b 2 ≤ ... ≤ a k b k , and hence the shortest path is not a straight line. Since the shortest, ordered path corresp onds to a path space geo desic in the shortest Euclidean path with obstacles problem, Prop osition 4.1, Prop osition 4.2, and Corollary 4.3 also hold here. Therefore, the shortest, ordered path intersects eac h region P i at a unique point p i , where the path ma y bend. The path is a straight line from p i to p i +1 for 1 ≤ i < k . W e can straighten a b end in the path by isometrically mapping the problem to a low er dimensional space using the follo wing Corollary 4.7 to Theorem 4.4. W e rep eat this process for each successive bend until Lemma 4.5 applies. Corollary 4.7. Consider the shortest p ath fr om A = ( a 1 , a 2 , ..., a k ) to B = ( − b 1 , − b 2 , ..., − b k ) in R k p assing thr ough P 1 , ..., P k in that or der. L et { M j } m j =1 b e any or der e d p artition of { 1 , 2 , ..., k } such that i, l ∈ M j implies p i = p l . Then this p ath is c ontaine d in a r e gion of R k isometric to V ( R m ) . Pr o of. Supp ose i, i + 1 are in the same blo c k in { M j } m j =1 . Then p i = p i +1 , and trav elling along the pre-image of the path in tree space, the tree loses splits A i and A i +1 sim ultane- ously , and gains splits B i and B i +1 sim ultaneously . Hence, this path is in the path space S = O 0 ∪  ∪ m j =1 O  ( ∩ i ∈ M j E i ) ∪ ( ∪ i ∈ M j F i )   . Apply Theorem 4.4 to S to see that its path space geo desic is con tained in a region isometric to V ( R m ), as desired. Notice that under the mapping to V ( R m ) describ ed in the ab o ve pro of, A is mapp ed to e A =  q P i ∈ M 1 a 2 i , q P i ∈ M 2 a 2 i , ..., q P i ∈ M m a 2 i  and B is mapp ed to e B =  − q P i ∈ M 1 b 2 i , − q P i ∈ M 2 b 2 i , ..., − q P i ∈ M m b 2 i  . T o apply Corollary 4.7, we need to kno w when p i = p i +1 . A condition for this is given in Lemma 4.9. The follo wing Lemma 4.8 is used in pro ving Lemma 4.9, but it also shows that the shortest path only b ends at the in tersection of t w o or more P i ’s (by setting J = i ). Lemma 4.8. L et q b e the shortest p ath fr om A to B that p asses thr ough P 1 , P 2 , ..., P k in that or der. L et p j b e the interse ction of q and P j for e ach 1 ≤ j ≤ k . If a J b J ≤ a J +1 b J +1 ≤ ... ≤ a i b i , for some 1 ≤ J ≤ i < k , q is a str aight line until it b ends at p J = p J +1 = ... = p i , and p J − 1 6 = p J if J > 1 , then p i = p i +1 . Pr o of. This pro of is by contradiction, so assume that p i 6 = p i +1 . Since q is a shortest, ordered path, q is a straigh t line from p i to p i +1 . Let Y = ( − y 1 , ..., − y i , y i +1 , ..., y k ), where y j ≥ 0 for all 1 ≤ j ≤ k , b e a point on the line p i p i +1 , ε > 0 past p i . Note that AY p J forms a non-trivial triangle, since q b ends at p J . W e will no w sho w that AY intersects P 1 , P 2 , ..., P i in that order. P arametrize the paths q and AY with resp ect to time t , so that t = 0 at A and t = 1 at Y . The j -th co ordinate, for 1 ≤ j ≤ J − 1, decreases linearly from a j to − y j in b oth q and AY , and th us b ecome 0 at the same time in b oth paths. This implies that since q crosses P 1 , ..., P J − 1 in that order, AY also crosses P 1 , ..., P J − 1 in that order. Let t j b e the time at whic h AY intersects P j , for 1 ≤ j ≤ i . Then 0 = a j + t j ( − y j − a j ) or t j = a j y j + a j . In q , eac h co ordinate b et w een J and i b ecomes 0 at the same time. These co ordinates then decrease linearly , so the ratio betw een an y t w o consecutiv e coordinates remains constan t as time increases. This implies y j y j +1 = b j b j +1 for each J ≤ j ≤ i . Since a J b J ≤ a J +1 b J +1 ≤ ... ≤ a i b i b y 15 the hypothesis, then a J y J ≤ a J +1 y J +1 ≤ ... ≤ a i y i . This implies a J a J + y J ≤ a J +1 a J +1 + y J +1 ≤ ... ≤ a i a i + y i , or t J ≤ t J +1 ≤ ... ≤ t i . Thus AY intersects P J , P J +1 , ..., P i in that order. It remains to show that AY intersects P J − 1 b efore P J if J > 1, which we do by con tradiction. So assume that t J < t J − 1 . Let r J − 1 and r J b e the p oin ts of in tersection of AY with P J − 1 and P J , resp ectiv ely . By the h yp otheses and assumption, r J and p J are contained in P J \ P J − 1 . Since P J − 1 and P J are conv ex, r J − 1 p J − 1 and r J p J are contained in P J − 1 and P J , resp ectiv ely . Now r J − 1 p J − 1 in tersects r J p J inside the triangle AY p J . This implies that r J p J passes from P J \ P J − 1 in to P J − 1 ∩ P J , on the b oundary of P J , and bac k in to P J \ P J − 1 . But this con tradicts the conv exity of P J . Thus t J − 1 ≤ t J , and AY passes through P 1 , P 2 , ..., P i in that order. By the triangle inequality , AY is shorter than the section of q from A to Y . This contradicts q b eing the shortest, ordered path, and th us p i = p i +1 . Lemma 4.9. F or the shortest p ath q fr om A to B that p asses thr ough P 1 , P 2 , ..., P k in that or der, if a 1 b 1 ≤ a 2 b 2 ≤ ... ≤ a i b i > a i +1 b i +1 , then this p ath interse cts P i ∩ P i +1 . Pr o of. P arametrize q with respect to the v ariable t , so that the path starts at A when t = 0, ends at B when t = 1, and passes through P j at p oin t p j = ( p j, 1 , p j, 2 , ..., p j,k ) when t = t j , for all 1 ≤ j ≤ k . If q bends b efore p i +1 , then let p j b e the first place that it b ends. By rep eated applications of Lemma 4.8, q also passes through P i ∩ P i +1 and w e are done. So assume that q is a straigh t line from A to p i +1 . Thus, the i -th coordinate c hanges linearly from a i to − b i , and from the parametrization of this, we get t i +1 = a i − p i +1 ,i a i + b i . Case 1: p i +1 ,i +2 6 = 0 (That is, the shortest ordered path q does not bend at p i +1 .) In this case, p i +1 ,i +1 = 0 = a i +1 + t i +1 ( − b i +1 − a i +1 ) , which implies t i +1 = a i +1 a i +1 + b i +1 . Equate this v alue of t i +1 with the one found ab o ve, and rearrange to get p i +1 ,i = a i − a i +1 ( a i + b i ) a i +1 + b i +1 . The definition of P i +1 and the assumption p i 6 = p i +1 implies that p i +1 ,i < 0. Hence, a i < a i +1 ( a i + b i ) a i +1 + b i +1 , whic h can b e rearranged to a i b i < a i +1 b i +1 , a contradiction. Case 2: p i +1 ,i +2 = 0 (That is, the shortest ordered path q bends at p i +1 , and p i +1 = p i +2 .) Let J ≥ 2 b e the largest integer such that p i + J = p i +1 , but p i + J +1 6 = p i +1 . Apply Corollary 4.7 using the partition { 1 } , { 2 } , ..., { i } , { i + 1 } , { i + 2 , ..., i + J } , { i + J + 1 } , ..., { k } to reduce the space b y J − 2 dimensions. A and B are mapped to e A = ( e a 1 , ..., e a k − ( J − 2) ) and e B = ( − e b 1 , ..., − e b k − ( J − 2) ), resp ectiv ely , in the lo wer dimension space, where: e a j =        a j if j ≤ i + 1 q P J l =2 a 2 i + l if j = i + 2 a j + J − 2 if j > i + 2 and e b j =        b j if j ≤ i + 1 q P J l =2 b 2 i + l if j = i + 2 b j + J − 2 if j > i + 2 Let e k = k − ( J − 2). Let e p j b e the image of p j in R e k under the ab o ve mapping if j ≤ i +2 and the image of p j + J − 2 if j > i + 2. Let e P j = { ( x 1 , ..., x e k ) ∈ R e k : x l ≤ 0 if l < j ; x l = 0 if l = j ; x l ≥ 0 if l > j } . So e P j is the b oundary b et ween the j -th and ( j + 1)-st orthants in the low er dimension space R e k . Let e q be the image of q . Then e q is a straigh t line from e A to e p i +1 , and e p i +1 = e p i +2 6 = e p i +3 , so e q b ends in e P i +1 ∩ e P i +2 . Since e q do es not in tersect e P i +2 ∩ e P i +3 , by the contrapositive of Lemma 4.8, e a i +1 e b i +1 > e a i +2 e b i +2 . In R k , this translates in to the condition that a i +1 b i +1 > q P J l =2 a 2 i + l q P J l =2 b 2 i + l . Cross-multiply , square each side, add 16 a 2 i +1 b 2 i +1 , and rearrange to get a i +1 b i +1 > q P J l =1 a 2 i + l q P J l =1 b 2 i + l . The remaining analysis is in R k . If the shortest, ordered path is a straigh t line through p i +1 , then w e make the same argumen t as in Case 1. Otherwise, since the path do es not b end at p i , the i -th co ordinate changes linearly from a i to − b i . W e use this parametrization to find t i +2 = t i +1 = a i − p i +1 ,i a i + b i . F urthermore, the ( i + 1)-st to ( i + J )-th co ordinates decrease at the same rate from A to p i +1 and at the same, but p ossibly different than the first, rate from p i +1 to B . Therefore, we can apply Corollary 4.7 to the partition { 1 } , { 2 } , ..., { i } , { i + 1 , i + 2 , ..., i + J } , { i + J + 1 } , ..., { k } to isometrically map the shortest, ordered path into R m − ( J − 1) . Let e a = q P J l =1 a 2 i + l , and let e b = q P J l =1 ( − b i + l ) 2 . Then in R m − ( J − 1) , the ( i + 1)-st co ordinate of the shortest ordered path changes at a constant rate from e a to − e b . This implies 0 = e a + t i +1 ( − e b − e a ), or t i +1 = e a e a + e b . Equate the tw o expressions for t i +1 to get p i +1 ,i = a i − ( a i + b i ) e a e a + e b . By definition of P i +1 , p i +1 ,i < 0. This implies a i b i < e a e b = q P J l =1 a 2 i + l √ P J l =1 ( − b i + l ) 2 . But we show ed that q P J l =1 a 2 i + l √ P J l =1 ( − b i + l ) 2 < a i +1 b i +1 , so a i b i < a i +1 b i +1 , which is also a contradiction. By rep eatedly applying this lemma, we find the low est dimensional space con taining the shortest, ordered path. In this space, the ratios deriv ed from the coordinates of the im ages of A and B form a non-descending sequence. The following theorem gives the shortest path through V ( R k ) from a p oin t in the p ositiv e orthant to a p oin t in the negative orthant, or equiv alently , the shortest tour that passes through P 1 , ..., P k in R k . Theorem 4.10. L et A = ( a 1 , a 2 , ..., a k ) and B = ( − b 1 , − b 2 , ..., − b k ) with a i , b i ≥ 0 for al l 1 ≤ i ≤ k b e p oints in R k . A lternate b etwe en applying L emma 4.9 and Cor ol lary 4.7 until ther e is a non- desc ending se quenc e of r atios e a 1 e b 1 ≤ e a 2 e b 2 ≤ ... ≤ e a m e b m , wher e e a i and e b i ar e the c o or dinates in the lower dimensional sp ac e. Ther e is a unique shortest p ath b etwe en e A = ( e a 1 , ...., e a m ) and e B = ( − e b 1 , ..., − e b m ) in V ( R m ) , with distanc e q P m i =1 ( e a i + e b i ) 2 . This is the length of the shortest p ath b etwe en A and B in V ( R k ) . Pr o of. F or the smallest i suc h that a i b i > a i +1 b i +1 , Lemma 4.9 implies that p i = p i +1 in the shortest, ordered path in R k . Thus, we can isometrically map this problem to the space one dimension lo w er that results from applying Corollary 4.7 using the partition { 1 } , { 2 } , ..., { i − 1 } , { i, i + 1 } , { i + 2 } , ..., { m } . W e rep eat these tw o steps, iteratively mapping this problem to low er dimensional spaces, un til the new ratio sequence is non-descending. Let e a 1 e b 1 ≤ e a 2 e b 2 ≤ ... ≤ e a m e b m b e this ratio sequence. By Lemma 4.5, the geo desic b et ween e A and e B is the straight line. F urthermore, its length is q P m i =1 ( e a i + e b i ) 2 . Since we mapp ed from V ( R k ) to V ( R m ) by rep eated isometries, b oth the length of the path and the order it passes through P 1 , ..., P m , or their images, remain the same. Th us the pre-image of this path is the shortest path in V ( R k ). 4.2.1 P athSpaceGeo: A Linear Algorithm for Computing P ath Space Geo desics Theorem 4.10 can b e translated in to a linear algorithm called P a thSp aceGeo , for computing the path space geo desic betw een T 1 and T 2 through some path space S = ∪ k i =0 O k . F or all 1 ≤ i ≤ k , let A i = E i − 1 \ E i and B i = F i \ F i − 1 , and let a i = k A i k and b i = k B i k . 17 Let 1 ≤ i < k be the least integer such that a i b i > a i +1 b i +1 . Then b y Theorem 4.10, to find the path space geo desic through S , w e should apply Lemma 4.9 and Corollary 4.7 to the ra- tio sequence a 1 b 1 , a 2 b 2 , ..., a k b k to map the problem to V ( R k − 1 ), where the ratio sequence b ecomes a 1 b 1 , ..., a i − 1 b i − 1 , √ a 2 i + a 2 i +1 √ b 2 i + b 2 i +1 , a i +2 b i +2 ..., a k b k . Rep eat this pro cess until the ratio sequence is non-descending. Unfortunately , this pro cess is not deterministic, in that differen t non-descending ratio sequences can be found for the same geo desic, dep ending on the starting path space. This o ccurs, b ecause b y Corollary 4.6, tw o equal ratios can b e com bined to giv e a ratio sequence corresp onding to a path with the same length. Ho wev er, if we mo dify the algorithm to also com bine equal ratios, the output ascending ratio sequence will b e unique for a given geodesic. Define the c arrier of the p ath sp ac e ge o desic thr ough S b et ween T 1 and T 2 to b e the path space Q = ∪ l i =0 O c ( i ) ⊆ S such that the path space geo desic through S trav erses the relativ e in teriors of O c (0) , O c (1) , ..., O c ( l ) , where the function c : { 0 , 1 , ..., l } → { 0 , ..., k } tak es i to c ( i ) if the i -th orthan t is Q is the c ( i )-th orthant in S . If a path space geo desic is the geo desic, we just write c arrier of the ge o desic . Then the carrier of the path space geo desic is the path space whose corresp onding ratio sequence is the unique ascending ratio sequence for the path space geo desic. W e no w explicitly describ e the algorithm for computing the ascending ratio sequence corre- sp onding to the path space geo desic, P a thSp aceGeo , and prov e it has linear run time. P a thSp aceGeo Input : P ath space S or its corresp onding ratio sequence R = a 1 b 1 , a 2 b 2 , ..., a k b k Output : The path space geo desic, represented as an asce nding ratio sequence, whic h is understo o d to b e the partition of R where the ratio q P J j =0 a 2 i + j q P J j =0 b 2 i + j corresp onds to the blo c k n a i b i , a i +1 b i +1 , ..., a i + J b i + J o . A lgorithm : Starting with the ratio pair a 1 b 1 , a 2 b 2 , P a thSp aceGeo compares consecutive ratios. If for the i -th pair, we ha v e a i b i ≥ a i +1 b i +1 , then c ombine the tw o ratios b y replacing them b y √ a 2 i + a 2 i +1 √ b 2 i + b 2 i +1 in the ratio sequence. Compare this new, com bined ratio with the previous ratio in the sequence, and com bine these tw o ratios if they are not ascending. Again the newly combined ratio m ust b e compared with the ratio b efore it in the sequence, and so on. Once the last com bined ratio is strictly greater then the previous one in the sequence, we again start moving forw ard through the ratio sequence, comparing consecutive ratios. The algorithm ends when it reac hes the end of the ratio sequence, and the ratios form an ascending ratio sequence. Theorem 4.11. P a thSp aceGeo has c omplexity Θ( k ) , wher e k + 1 is the numb er of orthants in the p ath sp ac e b etwe en T 1 and T 2 . Pr o of. W e first sho w the complexit y is O ( k ). Combining t w o ratios reduces the num b er of ratios b y 1, so this op eration is done at most k − 1 = O ( k ) times. It remains to coun t the n umber of comparisons b et ween ratios. Eac h ratio is inv olved in a comparison when it is first encountered in the sequence. There are k − 1 such comparisons. All other comparisons o ccur after ratios are com bined, so there are at most k − 1 of these comparisons. Therefore, P a thSp aceGeo has complexit y O ( k ). Any algorithm m ust make k − 1 comparisons to ensure the ratios are in ascending order, so the complexity is Ω( k ), and thus this b ound is tigh t. 5 Algorithms In this section, we show in Theorem 5.2 how to compute the geo desic distance b et w een t wo trees T 1 and T 2 b y computing the geo desic b et ween certain smaller, related trees. This allows us to use 18 the results from Sections 3 and 4, as w ell as either dynamic programming or divide and conquer tec hniques, to devise t wo algorithms for finding the geo desic betw een t wo trees with no common splits. Exp erimen ts on random trees sho w these algorithms are exp onen tial, but practical on trees with up to 40 lea ves, as well as larger trees from biological data. 5.1 A Relation b etw een Geo desics Let T 1 and T 2 b e t w o trees in T n with no common splits. The following theorem shows that there exists a path space con taining the geo desic b et ween T 1 and T 2 suc h that a certain subspace of it con tains the geo desic b et w een tw o smaller, related trees, T 0 1 and T 0 2 . As T 0 1 and T 0 2 ha v e fewer splits than T 1 and T 2 , it is easier to compute this geodesic. Therefore, w e can find the geo desic b et w een T 1 and T 2 b y finding the geo desic b et w een all such p ossible T 0 1 and T 0 2 . Definition 5.1. Let S = ∪ k i =0 O i b e a path space b etw een T 1 and T 2 . Define r ( S ) = ∪ k − 1 i =0 O ( E i \ E k − 1 ∪ F i ) to b e the trunc ation of S . Then r ( S ) is a path space b et w een T 0 1 = T (Σ 1 \ E k − 1 ) and T 0 2 = T ( F k − 1 ). That is, T 0 1 and T 0 2 are exactly the trees T 1 and T 2 with the edges E k − 1 = A k and F k \ F k − 1 = B k con tracted, and r ( S ) is the subspace of S formed by remo ving all trees having edges in A k or B k of non-zero length. Finally , if the path space S 0 = ∪ k − 1 i =0 O 0 i is the truncation of a path space b etw een trees T 1 and T 2 , then there is a unique path space S = ∪ k − 1 i =0  O 0 i + O (Σ 1 \ E 0 k − 1 )  ∪ O (Σ 2 ) b et ween T 1 and T 2 suc h that r ( S ) = S 0 . Theorem 5.2. L et T 1 and T 2 b e two tr e es in T n with no c ommon splits. Then ther e exists a p ath sp ac e Q = ∪ k i =0 O i that c ontains the ge o desic b etwe en T 1 and T 2 , such that the trunc ation Q 0 = r ( Q ) is the c arrier of the ge o desic b etwe en T 0 1 = T (Σ 1 \ E k − 1 ) and T 0 2 = T ( F k − 1 ) . T o prov e this theorem, we first pro v e t wo lemmas which hold for an y path space S betw een T 1 and T 2 , with truncation S 0 . Lemma 5.3 shows that the path space geo desic through S is contained in a path space whose truncation is the carrier of the path space geo desic of S 0 . Lemma 5.4 shows that if S 0 do es not con tain the geo desic betw een T 0 1 and T 0 2 , and hence w e can find another path space P 0 con taining a shorter path space geo desic, then the corresp onding path space P b et ween T 1 and T 2 do es not con tain a path space geo desic longer than the one in S . Lemma 5.3. L et T 1 and T 2 b e two tr e es in T n with no c ommon splits, and let S = ∪ k i =0 O i b e a p ath sp ac e b etwe en them. L et Q 0 b e the c arrier of the p ath sp ac e ge o desic thr ough S 0 = r ( S ) b etwe en T 0 1 = T (Σ 1 \ E k − 1 ) and T 0 2 = T ( F k − 1 ) . L et Q b e the p ath sp ac e b etwe en T 1 and T 2 such that r ( Q ) = Q 0 . Then d Q ( T 1 , T 2 ) = d S ( T 1 , T 2 ) . Pr o of. Since Q 0 is the carrier of the path space geo desic through S 0 , both Q 0 and S 0 ha v e the same path space geodesic, and hence P a thSp aceGeo will return the same ascending ratio sequence for either input Q 0 or S 0 . Let this ascending ratio sequence b e a 0 1 b 0 1 , a 0 2 b 0 2 , . . . , a 0 l b 0 l . The ratio sequences corresp onding to the path spaces Q and S are just the ratio sequences for Q 0 and S 0 , resp ectiv ely , with the ratio k A k k k B k k added to the end of eac h. So for b oth inputs Q and S , the ratio sequence when P a thSp aceGeo compares k A k k k B k k for the first time is a 0 1 b 0 1 , a 0 2 b 0 2 , . . . , a 0 l b 0 l , k A k k k B k k . This implies that the ratio sequence output by P a thSp aceGeo ( Q ) is the same as that output b y P a thSp aceGeo ( S ), and hence d Q ( T 1 , T 2 ) = d S ( T 1 , T 2 ). Lemma 5.4. L et T 1 and T 2 b e two tr e es in T n with no c ommon splits, and let S b e a p ath sp ac e b etwe en them. If S 0 = r ( S ) do es not c ontain the ge o desic b etwe en T 0 1 = T (Σ 1 \ E k − 1 ) and T 0 2 = 19 T ( F k − 1 ) , then ther e exists a p ath sp ac e P 0 b etwe en T 0 1 and T 0 2 such that d P 0 ( T 0 1 , T 0 2 ) < d S 0 ( T 0 1 , T 0 2 ) and d P ( T 1 , T 2 ) ≤ d S ( T 1 , T 2 ) , wher e P is the p ath sp ac e b etwe en T 1 and T 2 with trunc ation P 0 . Pr o of. Let S 0 = ∪ l i =0 O 0 i , and let Q 0 = ∪ l i =0 O 0 c ( i ) b e the carrier of the path space geo desic through S 0 . Let q b e the path space geodesic through Q 0 b et ween T 0 1 and T 0 2 , and let q i = O 0 c ( i − 1) ∩ O 0 c ( i ) ∩ q for every 1 ≤ i ≤ l . Since q is not the geo desic from T 0 1 to T 0 2 , q cannot b e lo cally shortest in T n . By Prop osition 4.1, for all 1 ≤ i ≤ l − 1, the part of q b etw een q i and q i +1 is a line, and cannot b e made shorter in T n . Th us we can only find a lo cally shorter path in T n b y v arying q in the neigh b ourhoo d of some q j . In particular, there exists some ε such that if s and t are the p oin ts on q , ε b efore and after q j in the orthan ts O c ( j − 1) and O c ( j ) , resp ectiv ely , then the geo desic b et ween s and t do es not follo w q . Replace the part of q b et ween s and t with the true geo desic b et w een s and t to get a shorter path in T n , with distance d s . Let O c ( j − 1) , O 00 1 = O ( E 00 1 ∪ F 00 1 ) , ..., O 00 m = O ( E 00 m ∪ F 00 m ) , O c ( j ) b e the sequence of orthants through whose relativ e in teriors the geo desic betw een s and t passes. Note that O 00 1 , ..., O 00 m are not in S 0 . These orthants m ust form a path space, and th us P 0 = Q 0 ∪ ( ∪ m i =0 O 00 i ) is a path space. Since the path space geo desic is the shortest path through a path space, d P 0 ( T 0 1 , T 0 2 ) ≤ d s < d Q 0 ( T 0 1 , T 0 2 ). By definition of Q 0 , d Q 0 ( T 0 1 , T 0 2 ) = d S 0 ( T 0 1 , T 0 2 ), and hence d P 0 ( T 0 1 , T 0 2 ) < d S 0 ( T 0 1 , T 0 2 ), as desired. T o show that d P ( T 1 , T 2 ) ≤ d S ( T 1 , T 2 ), let Q b e the path space b et w een T 1 and T 2 suc h that r ( Q ) = Q 0 . Then Q ⊂ P , whic h implies d P ( T 1 , T 2 ) ≤ d Q ( T 1 , T 2 ). By Lemma 5.3, d Q ( T 1 , T 2 ) = d S ( T 1 , T 2 ), and so P 0 is the desired path space. W e use Lemma 5.3 and Lemma 5.4 to prov e Theorem 5.2. Pr o of of The or em 5.2. W e first show there exists a path space M containing the geo desic betw een T 1 and T 2 , suc h that its truncation M 0 con tains the geo desic b et ween T 0 1 and T 0 2 . So let S b e an y path space containing the geo desic b et w een T 1 and T 2 , with truncation S 0 = r ( S ). If S 0 con tains the geo desic betw een T 0 1 and T 0 2 , then we are done. If not, then by Lemma 5.4, there exists a path space P 0 from T 0 1 to T 0 2 with d P 0 ( T 0 1 , T 0 2 ) < d S 0 ( T 0 1 , T 0 2 ) and d P ( T 1 , T 2 ) ≤ d S ( T 1 , T 2 ), where P is the path space b et ween T 1 and T 2 suc h that P 0 = r ( P ). Since S contains the geo desic from T 1 to T 2 , we ha ve d P ( T 1 , T 2 ) = d S ( T 1 , T 2 ), and hence P also contains the geodesic. If P 0 con tains the geo desic b et ween T 0 1 and T 0 2 , then w e are done. Otherwise, rep eat this step by applying Lemma 5.4 to P and P 0 . This pro cess pro duces a path space containing a strictly shorter path space geodesic at each iteration, so since there are only a finite num b er of path spaces, it even tually finds a path space containing the geo desic from T 0 1 to T 0 2 . Let Q 0 b e the carrier of the path space M 0 con taining the geo desic b et w een T 0 1 and T 0 2 , and let Q b e the path space from T 1 to T 2 suc h that r ( Q ) = Q 0 . Then by Lemma 5.3, Q also contains the geo desic betw een T 1 and T 2 , and we are done. W e will no w present t w o algorithms for computing geo desics. Both of these algorithms use Theorem 5.2 to av oid computing the path space geo desic for ev ery maximal path space b etw een T 1 and T 2 . This significantly decreases the runtime. W e call these algorithms GeodeMaps , whic h stands for GEOdesic DistancE via MAximal Path Spaces. The first algorithm uses dynamic programming tec hniques, and is denoted GeodeMaps-D ynamic , while the second uses a divide and conquer strategy , and is denoted GeodeMaps-Divide . 5.2 Geo deMaps-Dynamic: a Dynamic Programming Algorithm Theorem 5.2 implies that we can find the geo desic b et ween trees T 1 and T 2 b y just considering certain geo desics corresponding to the elements co vered b y Σ 2 in K (Σ 1 , Σ 2 ). More sp ecifically , for 20 an y A ∈ K (Σ 1 , Σ 2 ) co vered by Σ 2 , let Q 0 A b e the carrier for the geo desic g A from T ( X Σ 1 ( A )) to T ( A ). Then the geodesic from T 1 to T 2 is the minim um-length path space geodesic through the path spaces { Q A : A ∈ K (Σ 1 , Σ 2 ) is cov ered by Σ 2 and Q 0 A = r ( Q A ) } . An analogous metho d can b e applied to find the geo desic g A . In general, for an y element A 6 = ∅ in K (Σ 1 , Σ 2 ), the geo desic b et ween trees T ( X Σ 1 ( A )) and T ( A ) can b e computed from the carriers Q 0 B of the geo desics from T ( X Σ 1 ( B )) to T ( B ) for each B co vered by A . This is done by finding the minimum-length path space geo desic through the path spaces { Q B : B ∈ K (Σ 1 , Σ 2 ) is cov ered by A and Q 0 B = r ( Q B ) } . This suggests the following algorithm. Let G K (Σ 1 , Σ 2 ) b e the directed graph with v ertices in bijection with the elemen ts of K (Σ 1 , Σ 2 ), and with an edge b et ween tw o vertices if and only if there is a cov er relation b et ween their corresp onding elements in K (Σ 1 , Σ 2 ). The edge is directed from the cov ered element to the co v ering elemen t. Then we can compute the geo desic distance b y doing a breath-first search on G K (Σ 1 , Σ 2 ) . As w e visit each no de A in G K (Σ 1 , Σ 2 ) , w e construct the geo desic betw een T ( X Σ 1 ( A )) and T ( A ) using the geo desics b et w een T ( X Σ 1 ( B )) and T ( B ) for eac h B cov ered by A . This algorithm visits every no de in the graph, of which there can b e an exp onen tial num b er as sho wn in Remark 3.8, so this algorithm is exp onential in the w orst case. Ho w ev er, this is a significan t improv ement ov er considering eac h maximal path space. W e implemented a more memory-efficient version of this algorithm, called GeodeMaps- D ynamic . This version uses a depth-first searc h of G K (Σ 1 , Σ 2 ) . F or eac h element A in G K (Σ 1 , Σ 2 ) , store the distance of the shortest path space geodesic found so far b et ween T ( X Σ 1 ( A )) and T ( A ). If GeodeMaps-Dynamic revisits an elemen t with a longer path space geo desic, it prunes this branc h of the search. GeodeMaps-D ynamic stores the carrier of the shortest path space geo desic found so far b e- t w een T 1 and T 2 . As a heuristic improv ement, at each step in the depth-first search, GeodeMaps- D ynamic c ho oses the no de with the lo west transition ratio of the no des not y et visited. F or more details and an example of GeodeMaps-Dynamic , see [20, Section 5.2.1]. 5.3 Geo deMaps-Divide: a Divide And Conquer Algorithm If A is an element in K (Σ 1 , Σ 2 ), then the trees in the corresp onding orthant share the splits A with T 2 . This inspires the follo wing algorithm, which w e call GeodeMaps-Divide . Cho ose some minimal elemen t of P (Σ 1 , Σ 2 ), and add the splits in this equiv alence class to T 1 b y first dropping the incompatible splits. F or example, if w e c ho ose to add the split set F 1 , then w e m ust drop X Σ 1 ( F 1 ). The trees with this new topology no w hav e splits F 1 in common with T 2 . Apply Theorem 2.1 to divide the problem into subproblems along these common splits. F or each subproblem, recursively call GeodeMaps-Divide . Since some subproblems will b e encoun tered many times, store the geo desics for eac h solved subproblem in a hash table. Eac h subproblem corresp onds to an elemen t in K (Σ 1 , Σ 2 ), and GeodeMaps-Divide is p oly- nomial in the n umber of subproblems solved. Hence an upp er bound on the complexity of GeodeMaps-Divide is the num b er of elements in K (Σ 1 , Σ 2 ), which is exp onen tial in general b y Remark 3.8. See [20, Section 5.2.2] for details of this algorithm, an example, and a family of trees for which GeodeMaps-D ynamic has exp onen tial runtime. 5.4 P erformance of GeodeMaps -Dynamic and Geo deMaps-Divide W e no w compare the runtime p erformance of GeodeMaps-D ynamic and GeodeMaps-Divide with GeoMeTree [16], the only other geo desic distance algorithm published when this pap er w as written. F or n = 10 , 15 , 20 , 25 , 30 , 35 , 40 , 45, w e generated 200 random ro oted trees with n 21 lea v es, using a birth-death pro cess. Sp ecifically , w e ran evol ver , part of P AML [31] with the parameters estimated for the ph ylogeny of primates in [32], that is 6.7 for the birth rate ( λ ), 2.5 for the death rate ( µ ), 0.3333 for the sampling rate, and 0.24 for the mutation rate. F or eac h n , w e divided the 200 trees into 100 pairs, and computed the geo desic distance b etw een each pair. The av erage computation times are given in Figure 7. Memory was the limiting factor for all three algorithms, and preven ted us from calculating the missing data p oin ts. Both GeodeMaps- Comparison of Algorithm Average Runtimes -3 -2 -1 0 1 2 3 4 10 15 20 25 30 35 40 Number of Leaves in Trees Average Time per Distance Computation (log s) GeodeMaps-Dynamic GeodeMaps-Divide GeoMeTree Figure 7: Av erage runtimes of the three geo desic distance algorithms. D ynamic and GeodeMaps-Divide exhibit exp onential run time, but they are significantly faster the GeoMeTree . Note that as the trees used w ere random, they hav e v ery few common splits. Biologically meaningful trees often hav e man y common splits, resulting in muc h faster run times. F or example, for a data set of 31 43-lea ved trees represen ting p ossible ancestral histories of bacteria and arc haea [17], w e computed the geo desic distance b et ween each pair of trees. Using GeodeMaps- D ynamic the av erage computation time was 0.531 s, while using GeodeMaps-Divide the a verage time was 0.23 s. This contrasts to an av erage computation time of 22 s by GeodeMaps-Dynamic for t wo random trees with 40 lea ves. All computations were done on a Dell Po werEdge Quadcore with 4.0 GB memory , and 2.66 GHz x 4 pro cessing sp eed. The implementation of these algorithms, Geo deMaps 0.2, is av ailable for do wnload from www.math.b erkeley .edu/˜megan/geo demaps.html. 6 Conclusion W e ha ve used the com binatorics and geometry of the tree space T n to develop tw o algorithms to compute the geo desic distance b et ween tw o trees in this space. In doing so, we dev elop ed a p oset represen tation for the p ossible orthant sequences con taining the geo desic, and ga ve a linear time algorithm for computing the shortest path in the subspace V ( R n ) of R n , whic h will help c haracterize when the general problem of finding the shortest path through R n with obstacles is NP-hard. W e also show ed that geo desics can b e computed by solving smaller subproblems. Ac knowledgemen ts W e thank Louis Billera for n umerous helpful discussions and suggestions ab out this w ork; Karen V ogtmann for sharing her notes and thoughts on the problem; Seth Sulliv ant for suggestions that 22 greatly improv ed the presentation of this work; Philippe Lopez for the kind provision of the bio- logical data set; Jo e Mitc hell for pointing out that finding the geodesic in V ( R k ) is equiv alent to solving a touring problem; and an anonymous referee for constructiv e and helpful commen ts. References [1] N. Amen ta, M. Go dwin, N. Postarnak evich, and K. St. John. Approximating geodesic tree distance. Inform. Pr o c ess. L ett. , 103:61–65, 2007. [2] F. Ardila and C. Kliv ans. The Bergman complex of a matroid and ph ylogenetic trees. J. Combin. The ory Ser. B , 96:38–49, 2006. [3] L. Billera, S. Holmes, and K. V ogtmann. Geometry of the space of phylogenetic trees. A dv. in Appl. Math. , 27:733–767, 2001. [4] G. Birkhoff. L attic e The ory . American Mathematical So ciet y , 1967. [5] M.R. Bridson and A. Haefliger. Metric Sp ac es of Non-p ositive Curvatur e . Springer-V erlag, 1999. [6] J. Canny and J. Reif. Low er bounds for shortest path and related problems. In Pr o c e e dings of the 28th A nnual Symp osium on F oundations of Computer Scienc e (FOCS) , 1987. [7] B. DasGupta, X. He, T. Jiang, M. Li, and J. T romp. On the linear-cost subtree-transfer distance b et ween ph ylogenetic trees. Algorithmic a , 25:176–195, 1999. [8] M. Dror, A. Efrat, A. Lubiw, and J. Mitchell. T ouring a sequence of p olygons. In Pr o c e e dings of the 35th A nnual ACM Symp osium on The ory of Computing (STOC) , 2003. [9] R. Durbin, S. Eddy , A. Krogh, and G. Mitchison. Biolo gic al Se quenc e A nalysis: Pr ob abilistic Mo dels of Pr oteins and Nucleic A cids . Cam bridge Universit y Press, 1998. [10] G.F. Estabro ok, F.R. McMorris, and C.A. Meacham. Comparison of undirected ph ylogenetic trees based on subtrees of four evolutionary units. Syst. Zo ol. , 34:193–200, 1985. [11] J. Hein. Reconstructing evolution of sequences sub ject to recombination using parsimon y . Math. Biosci. , 98:185–200, 1990. [12] M.D. Hendy and D. P enny . A framework for the quan titative study of ev olutionary trees. Syst. Zo ol. , 38:297–309, 1989. [13] S. Holmes. Statistics for phylogenetic trees. The or etic al Population Biolo gy , 63:17–32, 2003. [14] S. Holmes. Statistical approac h to tests inv olving phylogenetics. In Mathematics of Evolution and Phylo geny . Oxford Universit y Press, 2005. [15] M.K. Kuhner and J. F elsenstein. A simulation comparison of phylogen y algorithms under equal and unequal evolutionary rates. Mol. Biol. Evol. , 11:459–468, 1994. [16] A. Kup czok, A. von Haeseler, and S. Klaere. An exact algorithm for the geo desic distance b et ween ph ylogenetic trees. J. Comput. Biol. , 15:577–591, 2008. [17] P . Lop ez. P ersonal communications, 2006. 23 [18] J.S.B. Mitc hell. Geometric shortest paths and netw ork optimization. In Handb o ok of Compu- tational Ge ometry , pages 633–701. Elsevier Science, 2000. [19] J.S.B. Mitchell and M. Sharir. New results on shortest paths in three dimensions. In 20 th A nnual Symp osium on Computational Ge ometry , 2004. [20] M. Ow en. Distanc e Computation in the Sp ac e of Phylo genetic T r e es . PhD thesis, Cornell Univ ersit y , 2008. [21] M. Ow en and J.S. Pro v an. A fast algorithm for computing geo desic distances in tree space. IEEE/A CM T r ansactions on Computational Biolo gy and Bioinformatics , 8:2–13, 2011. [22] V. P olishch uk and J.S.B. Mitc hell. T ouring conv ex b odies - a conic programming solution. In 17th Canadian Confer enc e on Computational Ge ometry , 2005. [23] D.F. Robinson. Comparison of lab eled trees with v alency three. J. Combinatorial The ory , 11:105–119, 1971. [24] D.F. Robinson and L.R. F oulds. Comparison of weigh ted lab elled trees. In Combinatorial Mathematics VI , v olume 748 of L e ctur e Notes in Mathematics , pages 119–126, Berlin, 1979. Springer. [25] D.F. Robinson and L.R. F oulds. Comparison of phylogenetic trees. Math. Biosci. , 53:131–147, 1981. [26] C. Semple and M. Steel. Phylo genetics . Oxford Universit y Press, Oxford, 2003. [27] D. Sp ey er and B. Sturmfels. The tropical Grassmannian. A dv. Ge om. , 4:389–411, 2004. [28] R.P . Stanley . Enumer ative Combinatorics , volume 1. Cambridge Univ ersity Press, 1997. [29] A. Staple. Computing distances in tree space. Unpublished research rep ort, Stanford Univ er- sit y , 2004. [30] K. V ogtmann. Geo desics in the space of trees. Av ailable at www.math.cornell.edu/ v v ogtmann/pap ers/T reeGeo desicss/index.h tml, 2007. [31] Z. Y ang. P AML 4: a program pack age for phylogenetic analysis b y maximum likelihoo d. Mol. Biol. Evol. , 24:1586–1591, 2007. [32] Z. Y ang and B. Rannala. Bay esian ph ylogenetic inference using DNA sequences: A Mark ov Chain Monte Carlo metho d. Mol. Biol. Evol. , 14:717–724, 1997. 24

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment