The Butterfly effect in Cayley graphs, and its relevance for evolutionary genomics

Suppose a finite set $X$ is repeatedly transformed by a sequence of permutations of a certain type acting on an initial element $x$ to produce a final state $y$. We investigate how 'different' the resulting state $y'$ to $y$ can be if a slight change…

Authors: Vincent Moulton, Mike Steel

The Butterfly effect in Cayley graphs, and its relevance for   evolutionary genomics
The ‘Butterfly effect’ in Ca yley graphs, and its relev ance for ev olutionary genomics Vincen t Moulton and Mik e Steel (VM) Scho ol of Computing Scienc es, University of East A nglia, Norwich, NR4 7TJ, UK. (MS) Biomathematics R ese ar ch Centr e, University of Canterbury, Christchur ch, New Ze aland. email: vinc ent.moulton@cmp.ue a.ac.uk , mike.ste el@c anterbury.ac.nz Abstract Supp ose a finite set X is rep eatedly transformed by a sequence of p ermutations of a certain type acting on an initial element x to produce a final state y . W e inv es- tigate how ‘differen t’ the resulting state y 0 to y can b e if a slight c hange is made to the sequence, either b y deleting one p ermutation, or replacing it with another. Here the ‘difference’ b et ween y and y 0 migh t b e measured b y the minim um n umber of permutations of the permitted t yp e required to transform y to y 0 , or b y some other metric. W e discuss this first in the general setting of sensitivit y to p ertur- bation of w alks in Ca yley graphs of groups with a sp ecified set of generators. W e then in v estigate some p erm utation groups and generators arising in computational genomics, and the statistical implications of the findings. Keywor ds: ev olutionary distance, permutation, metric, group action, genome rearrangemen ts 1. In tro duction In evolutionary genomics, t wo genomes 1 are frequently compared b y the min- im um num b er of ‘rearrangemen ts’ (of v arious t yp es) required to transform one genome in to another [7]. This minim um n umber is then used to estimate of the ac- tual n umber of even ts and thereby the ‘evolutionary distance’ b etw een the sp ecies in volv ed. Since b oth the precise num ber and the actual rearrangemen t even ts that o ccurred in the ev olution of the t wo genomes from a common ancestor are un- kno wn, it is p ertinent to hav e some idea of how sensitiv e this distance estimate 1 F or the purposes of this pap er a genome is simply an ordered sequence of ob jects – usually tak en from the DNA alphabet or a collection of genes – which ma y o ccur with or without rep etition, and with or without an orien tation (+,-). Pr eprint submitte d to Elsevier Novemb er 7, 2021 migh t b e to the sequence of ev ents (not just the num ber) that really to ok place [19]. This question has imp ortan t implications for the accurate inference of ev olu- tionary relationships b etw een species from their genomes, and w e discuss some of these further in Section 5. How ev er, w e b egin b y framing the t yp e of mathematical questions that we will b e considering in a general algebraic con text. Let G b e a finite group, whose iden tity element we write as 1 G , and let S b e a subset of generators, that is symmetric (i.e. closed under in verses, so x ∈ S ⇒ x − 1 ∈ S ). In addition, let Γ = C ay ( G, S ) b e the asso ciated Cayley gr aph , with v ertex set G and an edge connecting g and g 0 if there exists s ∈ S with g 0 = g s (unless otherwise stated, w e use the con v en tion of m ultiplying group elemen ts from left to righ t). F or an y tw o elemen ts g , g 0 ∈ G , the distance d S ( g , g 0 ) in C ay ( G, S ) is the minimum v alue of k for which there exist elemen ts s 1 , . . . , s k of S so that g 0 = g s 1 · · · s k (for g = g 0 , w e set d S ( g , g 0 ) = 0). Note that d S is a metric, in particular, d S ( g , g 0 ) = d S ( g 0 , g ), since S is symmetric. In this pap er, our fo cus is on the follo wing t wo quantities: λ 1 ( G, S ) := max g ∈ G,s ∈ S { d S ( sg , g ) } , and λ 2 ( G, S ) := max g ∈ G,s,s 0 ∈ S { d S ( sg , s 0 g ) } . One wa y to view these quantities is via the following result whic h is easily pro ved. Lemma 1. L et S b e a symmetric set of gener ators for a finite gr oup G . Then: • λ 1 ( G, S ) is the maximum value of d S ( g , g 0 ) b etwe en any p air of elements g and g 0 of G for which g = s 1 s 2 · · · s k , and g 0 = s 0 1 s 0 2 · · · s 0 k , wher e s 0 i = s i ∈ S for al l but at most one value (say j ) for i , and s 0 j = 1 G . • λ 2 ( G, S ) is the maximum value of d S ( g , g 0 ) b etwe en any p air of elements g and g 0 of G for which g = s 1 s 2 · · · s k and g 0 = s 0 1 s 0 2 · · · s 0 k wher e s 0 i = s i ∈ S for al l but at most one value (say j ) for i , and s 0 j ∈ S, s 0 j 6 = s j . Th us, λ 1 ( G, S ) tells us ho w m uch (under d S ) a pro duct of generators can c hange if w e drop one v alue of s , whilst λ 2 ( G, S ) tells us how muc h (again under d S ) a pro duct of generators in S can c hange if w e substitute one v alue of s b y another s 0 (see Fig. 1 for an example where λ 2 ( G, S ) = 6). As suc h, λ m is a measure of the ‘sensitivity’ of w alks in the Cayley graph to a switc h in or deletion of a generator at some p oint. Moreo ver, if G acts transitively 2 [3421] [4321] [4312] [3241] [3214] [2314] [2134] [1234] [1243] [1423] [4123] [4132] [3412] [3142] [1324] [2143] [1342] [3124] [1432] [4231] [2431] [4213] (34) (23) (34) (23) (34) (23) (23) (34) (23) (12) [2341] [2413] Figure 1: The Cayley graph C ay ( G, S ) for G = Σ 4 (the p ermutation group on { 1 , 2 , 3 , 4 } ) and the set of transp ositions S = { (12) , (23) , (34) } . Substituting just one elemen t – namely (34) for (12) – in the pro duct corresponding to the walk in the lo wer front face (which starts and returns to the low er-most p oin t [1234]) results in a w alk that ends at a p oint ([4321], top) that is very distan t (under d S ) from the end-p oint of the original walk. In fact, the t wo end-p oints are at maximal distance in this example. and freely 2 on a set X then λ m pro vides a corresp onding measure of sensitivity of this action to a switch in or deletion of a generator (since a transitiv e, free action of G on X is isomorphic to the action of G on itself b y righ t multiplication). Actions with large λ m v alues can th us b e viewed as exhibiting a discrete, group-theoretic analogue of the ‘butterfly effect’ in non-linear dynamics (see e.g. [9]). In the genomics applications that we shall consider, elements of the group G corresp ond to genomes, and d S to the evolutionary distance b etw een them. After presen ting some general results concerning λ m in the next section, in Sections 3 and 4 we discuss some applications arising for v arious c hoices of G and S . These include the Klein four group, whic h arises in evolutionary mo dels of DNA sequence ev olution, and the p ermutatation group, which typically app ears when studying rearrangemen t distances b et ween genomes. W e conclude in Section 5 with some statistical implications of our results. One can imagine many other settings b esides genomics where similar questions arise – for example, in a sequence of mov es that should unscram ble the Rubik’s 2 G acts tr ansitively on X if for any pair x, y ∈ X there exists g ∈ G with g ◦ x = y ; the action is fr e e if g ◦ x = h ◦ x ⇒ g = h , for all g , h ∈ G and x ∈ X , where ‘ ◦ ’ denotes the action of G on X . 3 cub e from a given p osition [12], what will b e the consequences (in terms of the n umber of mov es required) for completing the unscrambling if a mistake is made at some p oint (or one mo ve is forgotten)? In addition, related questions arise in the study of ‘automatic’ groups, where the group under consideration is typically infinite [4]. 2. General inequalities W e first make some basic observ ations ab out Ca yley graphs and the metric d S (further bac kground on basic group theory , Cayley graphs, and group actions can b e found in [15]). It is w ell kno wn that Γ is a connected regular graph of degree equal to the cardinality of S and that Γ is also vertex-transitiv e (see, for example, [11], Prop osition 1). Consider the function l S : G → { 0 , 1 , 2 , 3 . . . | G |} , where, l S (1 G ) = 0 and, for each g ∈ G − { 1 G } , l S ( g ) is the smallest n umber l of elemen ts s 1 , . . . , s l from S for which we can write g = s 1 · · · s l . The function l S clearly satisfies the subadditivity prop erty that, for all g , g 0 ∈ G : l S ( g g 0 ) ≤ l S ( g ) + l S ( g 0 ) . In addition, l S ( g − 1 ) = l S ( g ) , and l S ( g ) = 1 ⇔ g ∈ S, l S ( g ) = 0 ⇔ g = 1 G . Note that l S ( g g 0 ) is generally not equal to l S ( g 0 g ). The metric d S , describ ed in the previous section, is related to l S as follo ws: d S ( g , g 0 ) = l S ( g − 1 g 0 ) . Consequen tly , b y definition: λ 1 ( G, S ) = max g ∈ G,s ∈ S { l S ( g − 1 sg ) } , (1) and λ 2 ( G, S ) = max g ∈ G,s,s 0 ∈ S { l S ( g − 1 ss 0 g ) } . (2) Let l S ( G ) = max { l S ( g ) : g ∈ G } , whic h is the diameter of C ay ( G, S ), that is, maximum length shortest path connecting any t wo elemen ts of G . Clearly , λ 1 ( G, S ) , λ 2 ( G, S ) ≤ l S ( G ). Moreo ver: λ 2 ( G, S ) ≤ 2 · λ 1 ( G, S ) , (3) 4 since, for any g ∈ G and s, s 0 ∈ S , w e ha v e: d S ( sg , s 0 g ) ≤ d S ( sg , g ) + d S ( g , s 0 g ) . A partial conv erse to Inequality (3) is pro vided b y the follo wing: λ 1 ( G, S ) ≤ λ 2 ( G, S ) + λ 0 1 ( G, S ) , (4) where λ 0 1 ( G, S ) = max g ∈ G min s ∈ S { l S ( g − 1 sg ) } . T o verify (4), select a pair g ∈ G, s ∈ S so that l S ( g − 1 sg ) = λ 1 ( G, S ) . Then: λ 1 ( G, S ) = d S ( sg , g ) ≤ d S ( sg , s 1 g ) + d S ( s 1 g , g ) , where s 1 is an element s 0 (p ossibly equal to s ) in S that minimizes l S ( g − 1 s 0 g ). No w, d S ( sg , s 1 g ) ≤ λ 2 ( G, S ) (even if s 0 = s ) and d S ( s 1 g , g ) ≤ λ 0 1 ( G, S ), and so w e obtain (4). Note also that if G is Ab elian, then λ 1 ( G, S ) = 1, and λ 2 ( G, S ) ≤ 2 for any symmetric set S of generators. Moreov er, for the Ab elian 2-group G = Z n 2 and with the symmetric set S of generators consisting of all n elements with the identit y at all but one p osition, we hav e l S ( G ) = n and λ 1 ( G, S ) = 1. This sho ws that the inequalit y λ 1 ( G, S ) ≤ l S ( G ) can b e arbitrarily large. Our next result generalizes this observ ation further. Lemma 2. L et G 1 , G 2 , . . . , G k b e finite gr oups, and let S i b e a symmetric set of gener ators of G i for i = 1 , . . . , n . Consider the dir e ct pr o duct G = G 1 × G 2 ×· · · × G k along with the symmetric set of gener ators S of G c onsisting of al l p ossible k – tuples which c onsist of the identity element of G i at al l but one c o-or dinate i , wher e it takes some value in S i . Then (i) λ 1 ( G, S ) ≤ max 1 ≤ i ≤ k  l S i ( G i )  , and (ii) l S ( G ) = P k i =1 l S i ( G i ) . Pr o of: F or Part (i), let λ 1 ( G, S ) = l S ( g − 1 sg ), where s ∈ S is a non-iden tity elemen t at some co-ordinate ν . Notice that ( g − 1 sg ) j = 1 G j for all j 6 = ν . Moreo v er, ( g − 1 sg ) ν = s 1 · · · s l where l ≤ l S ν ( G ν ). Th us l S ( g − 1 sg ) ≤ l S ν ( G ν ), as claimed. F or Part (ii), the inequality l S ( G ) ≤ P k i =1 l S i ( G i ) is clear; to establish the rev erse inequalit y , let g i b e an elemen t of G i with l S i ( g i ) = l S i ( G i ), and g = ( g 1 , . . . , g k ) ∈ G . Then l S ( g ) = P k i =1 l S i ( G i ) , and so l S ( G ) ≥ P k i =1 l S i ( G i ) . 2 W e now consider ho w λ m b eha ves under group homomorphisms. Suppose H is the homomorphic image of a group G under a map p . Let N = K er ( p ) b e the k ernel of p , whic h is a normal subgroup of G , and with H ∼ = G/ N . Thus we hav e a short exact sequence: 1 → N → G p → H → 1 . (5) Let S b e a symmetric set of generators of G . Then S H = { p ( s ) : s ∈ S − N } is a symmetric set of generators of H . 5 Lemma 3. F or m = 1 , 2 , λ m ( H , S H ) ≤ λ m ( G, S ) . Pr o of: First supp ose that m = 1. F or x ∈ S H and h ∈ H , consider h − 1 xh . There exist elements g ∈ G and s ∈ S − N for whic h f ( g ) = h and f ( s ) = x . Now the elemen t g − 1 sg ∈ G can b e written as a pro duct of at most l = λ 1 ( G, S ) elemen ts of S , that is g − 1 sg = s 1 s 2 · · · s k for k ≤ l . Applying p to b oth sides of this equation giv es: h − 1 xh = p ( s 1 ) p ( s 2 ) · · · p ( s k ). Notice that some of the elemen ts on righ t may equal the iden tit y elemen t of H (since p ( s i ) = 1 H ⇔ s i ∈ N ), but they are elemen ts of S H otherwise. Th us l S H ( h − 1 xh ) ≤ l . Since this holds for all such elements h, x , Eqn. (1) sho ws that λ 1 ( H , S H ) ≤ λ 1 ( G, S ). The corresp onding result for m = 2 follo ws b y an analogous argument. 2 T o obtain a lo wer b ound for λ m ( G, S ) supp ose that the short exact sequence (5) is a split extension , i.e. there is a homomorphism i : H → G so that p ◦ i is the iden tity map on H , whic h (b y the splitting lemma) is equiv alent to the condition that G is the semidirect pro duct of N with a subgroup H 0 isomorphic to H (i.e. G = N H 0 = H 0 N , H 0 ∩ N = { 1 G } ). In this case we hav e the following b ounds. Prop osition 4. Supp ose a finite gr oup G is a semidir e ct pr o duct of sub gr oups N (normal) and H . L et S N , S H b e symmetric gener ator sets for N and H r esp e ctively, and let S = S N ∪ S H which is a symmetric gener ator set for G . Then: λ 1 ( H , S H ) ≤ λ 1 ( G, S ) ≤ λ 1 ( H , S H ) + l S N ( N ) . In p articular, by (3), λ 2 ( G, S ) ≤ 2 λ 1 ( H , S H ) + 2 l S N ( N ) . Pr o of: The low er b ound on λ 1 ( G, S ) follo ws from Lemma 3. F or the upp er b ound we m ust show that for all s ∈ S and g ∈ G , d S ( sg , g ) ≤ λ 1 ( H , S H ) + l S N ( N ) holds. W e consider tw o cases: (i) s ∈ N , and (ii) s ∈ H . In Case (i), note that the conjugate elemen t g − 1 sg is also an element of N ; in this case we ha v e the tigh ter b ound d S ( sg , g ) ≤ l S N ( N ). In Case (ii), write g = hn where n ∈ N and h ∈ H . Consider the word w = g − 1 sg = n − 1 h − 1 shn. Since N is normal we ha ve n − 1 ( h − 1 sh ) = ( h − 1 sh ) n 0 for some element n 0 ∈ N . Th us w = h − 1 shn 0 n. W rite w = w 1 w 2 where w 1 = h − 1 sh ∈ H and w 2 = n 0 n ∈ N . W e can select w 2 to b e a pro duct of terms of S N of length at most l S N ( N ) and, by Inequalit y (3), w e can select w 1 to b e a product of terms of S H of length at most λ 1 ( H , S H ). Thus w can b e written as a pro duct of, at most, λ 1 ( H , S H ) + l S N ( N ) elemen ts of S . 2 6 3. P erm utation groups and genomic applications W e first describ e a direct application that is relev an t to the ev olution of a DNA sequence under a simple mo del of site substitution (Kimura’s 3ST mo del) [10]. Consider the four-letter DNA alphab et A = { A, C , G, T } and the Klein four- group K = Z 2 × Z 2 with an action on A in which the three non-zero elemen ts of K corresp ond to ‘transitions’ (A ↔ G, C ↔ T) and the tw o t yp es of ‘transversions’ (A ↔ C, G ↔ T; and A ↔ T, G ↔ C). This represen tation of the Kim ura 3ST mo del w as first described and exploited b y [6]. F or g ∈ K and x ∈ A , let g ◦ x denote the element of A obtained by the action of g on x (the identit y elemen t fixes each elemen t of A ). The resulting comp onen t- wise action of K n on A n , defined b y: ( g 1 , . . . g n ) ◦ ( x 1 , . . . x n ) = ( g 1 ◦ x 1 , . . . , g n ◦ x n ), can b e regarded as the set of all c hanges that can o ccur to a DNA sequence ov er a perio d of time under site substitutions. No w, under an y con tin uous-time Mark ovian pro cess these c hange ev ents (‘site substitutions’) occur just one at a time and so a natural generating set of K n is the set S n of all elemen ts of K n that consist of 1 K at all but one co-ordinate. Moreov er, since the action of K n on A n is transitiv e and free (and so is isomorphic to the action of K n on itself b y righ t m ultiplication), λ m ( K n , S n ) measures the impact of ignoring (for m = 1) or replacing (for m = 2) one substitution in a chain of suc h ev ents o v er time. As K n is Ab elian, one has λ 1 ( K n , S n ) = 1 and λ 2 ( K n , S n ) = 2, whic h implies that this impact is minor, and, more significantly , is indep endent of n ; this has imp ortant statistical implications whic h we will describ e further in Section 5. F or a related example, consider the ordered sequence of distinct genes ( g 1 , g 2 , . . . , g n ) partitioned into regions R 1 , R 2 , . . . R k so that genomic rearrangemen ts o ccur within eac h region, but not b et ween regions (e.g. R i migh t refer to different c hromo- somes). This situation can b e mo delled by the setting of Lemma 2 in whic h G i is a p erm utation group on the genes within R i , and S i is set of elementary gene order rearrangemen t even ts that generates G i (w e discuss some examples b elo w). In this case, Lemma 2 provides a b ound on λ 1 and λ 2 that is indep enden t of the n umber of regions k . W e turn now to the calculation of λ m (Σ n , S ) for the p ermutation group Σ n on n ! elemen ts and v arious sets S of generators. This group commonly arises when studying genome rearrangements [11]. Our main in terest is to determine, for eac h instance of S , whether there is a constan t C (indep endent of n ) for whic h λ m (Σ n , S ) ≤ C , for m = 1 , 2. A p ermutation g on the set [ n ] := { 1 , 2 , . . . , n } is a bijectiv e mapping from [ n ] to itself. W e will also write g as g = [ g 1 , g 2 , . . . , g n ] where g i = g ( i ) is the image of the map g for i ∈ [ n ]. Note that, following the usual conv en tion, the pro duct g g 0 of tw o p ermutations g , g 0 ∈ Σ n will b e considered as the comp osition of the 7 functions g and g 0 . In particular, g g 0 ( i ) = g ( g 0 ( i )) for all i ∈ [ n ]. When studying genomes, eac h entry g i of a p erm utation g corresp onds to a gene and the full list [ g 1 , g 2 , . . . , g n ] to a genome. Multiplying g by a p ermutation leads to a rearrangemen t of the genome. F or example, m ultiplying by a tr ansp osition t i,j in terchanges the v alues at p ositions i and j of g , i.e. [ . . . , g i , . . . , g j , . . . ] t i,j = [ . . . , g j , . . . , g i , . . . ], and m ultiplying b y a r eversal r i,j rev erses the segmen t [ g i , g j ], 1 ≤ i < j ≤ n , of g , i.e. [ . . . , g i , g i +1 , . . . , g j − 1 , g j , . . . ] r i,j = [ . . . , g j , g j − 1 , . . . , g i +1 , g i , . . . ] . Suc h rearrangemen ts are widely observed and studied in molecular biology [7]. In genomics applications, w e are often in terested in defining some distance b et ween genomes. One distance that is commonly used in the context of p ermuta- tions is the br e akp oint distance [17, 7.3]. F or g , g 0 ∈ Σ n , d B P ( g , g 0 ) is defined as the n umber of pairs of elements that are adjacent in the list [0 , g 1 , g 2 , . . . , g n , n + 1], but not in the list [0 , g 0 1 , g 0 2 , . . . , g 0 n , n + 1]. F or example, if g = [1 , 2 , 3 , 4 , 5] , g 0 = [1 , 4 , 3 , 2 , 5] ∈ Σ 5 , we ha v e d B P ( g , g 0 ) = 2. It is clear that max { d B P ( g , g ) : g , g 0 ∈ Σ n } = n + 1. Alternativ ely , one can consider the r e arr angment distanc e b etw een tw o genomes, i.e. the minimal n um b er of op erations of a certain t yp e (suc h as transp ositions or rev ersals) that can b e applied to one of the genomes to obtain the other [7]. In terms of Cayley graphs, this distance can b e conv enien tly expressed for transp osi- tions and reversals as follows. Let T = T n := { t i,j ∈ Σ n : 1 ≤ i < j ≤ n } , C = C n := { t i,i +1 ∈ T : 1 ≤ i ≤ n − 1 } , (the Coxeter generators), and R := { r i,j ∈ Σ n : 1 ≤ i < j ≤ n } . Note that all three of these sets generate Σ n [11] and that they are all symmetric, since eac h generator is its own inv erse. The metric d S , S = T , C, R , is precisely the rearrangemen t distance. The diameters of C ay (Σ n , T ) and C ay (Σ n , R ) are b oth n − 1, and the diameter of C ay (Σ n , C ) is  n 2  [11]. Regarding the quantities λ m (Σ n , S ), w e hav e the following result for S = T , C , R : Theorem 5. F or n ≥ 7 the fol lowing hold: (i) λ 1 (Σ n , T n ) = 1 and λ 2 (Σ n , T n ) = 2 . 8 1 2 3 4 5 6 7 6 3 4 1 2 7 5 (b) 1 2 3 4 5 6 7 3 2 4 5 1 (a) 6 7 Figure 2: (a) A diagrammatic respresentation of the elemen t g = [3 , 2 , 5 , 4 , 6 , 7 , 1] in Σ 7 , defined in the pro of of Theorem 5 (iii). (b) The pro duct r 1 , 7 g = [5 , 6 , 3 , 4 , 1 , 2 , 7]. Note that d B P ( r 1 , 7 g , g ) = 8. (ii) λ 1 (Σ n , C n ) = 2 n − 3 and 2 n − 2 ≤ λ 2 (Σ n , C n ) ≤ 4 n − 6 . (iii) n +1 2 ≤ λ m (Σ n , R n ) ≤ n − 1 , m = 1 , 2 . Pr o of: (i) Note that if g ∈ Σ n and t i,j ∈ T , then: g − 1 t i,j g = t g − 1 ( i ) ,g − 1 ( j ) . (6) Therefore λ 1 (Σ n , T ) = 1 b y (1). Thus, b y Inequality (3), we hav e λ 2 (Σ n , T ) ≤ 2. The equality λ 2 (Σ n , T ) = 2 follows b y (2) and the fact that g − 1 t k,l t i,j g = t g − 1 ( i ) ,g − 1 ( j ) t g − 1 ( k ) ,g − 1 ( l ) holds for any g ∈ Σ n and 1 ≤ i < j < k < l ≤ n . (ii) Consider the p ermutation g ∈ Σ n giv en b y g = [2 , 3 , . . . , n − 1 , n, 1]. Then g − 1 t 1 , 2 g = [ n, 2 , 3 , . . . , n − 1 , 1]. Therefore, l C ( g − 1 t 1 , 2 g ) ≥ 2 n − 3 (since to transform [ n, 2 , 3 , . . . , n − 1 , 1] to 1 Σ n requires moving 1 and n back to their original positions). Therefore, λ 1 (Σ n , C ) ≥ 2 n − 3 b y (1). But, by Equalit y (6), λ 1 (Σ n , C ) ≤ 2 n − 3, since an y transp osition is the pro duct of at most 2 n − 3 elements in C . In particular, λ 1 (Σ n , C ) = 2 n − 3. Similarly , l C ( g − 1 t 1 , 2 t 3 , 4 g ) ≥ 2 n − 2, and so λ 2 (Σ n , C ) ≥ 2 n − 2 b y (2). Hence, b y Inequalit y (3), w e ha v e λ 2 (Σ n , C ) ≤ 2(2 n − 3). (iii) The inequalit y λ m (Σ n , R n ) ≤ n − 1, m = 1 , 2 follows as the diameter of C ay (Σ n , R ) is at most n − 1. No w, supp ose n is o dd. Let g ∈ Σ n b e given by g = [3 , 2 , 5 , 4 , 7 , 6 , . . . , n − 3 , n, n − 1 , 1]. Then it is straigh t-forw ard to chec k that d B P ( r 1 ,n g , g ) = n + 1 (see Figure 2 for the case n = 7). In particular, since the length of any shortest path in C ay (Σ n , R ) joining an y g , h ∈ Σ n is at least d B P ( h, g ) / 2 by [17, p.238], we hav e λ 1 (Σ n , R ) ≥ n +1 2 . Similarly , d B P ( r 2 , 3 r 1 ,n g , g ) = n + 1 for any g ∈ Σ n , and so λ 2 (Σ n , R ) ≥ n +1 2 . In case n is even, consider g = [3 , 2 , 5 , 4 , 7 , 6 , . . . , n − 4 , n − 1 , n − 2 , 1 , n ]. Then d B P ( r 2 ,n g , g ) = n + 1 and d B P ( r 3 , 4 r 2 ,n g , g ) = n + 1. Similar reasoning yields the desired result. 2 9 In genomics, the direction in which a gene is oriented in a genome can also pro vide useful information to incorporate in rearrangement mo dels, which can b e expressed as follows in terms of Cayley graphs [11]. The hyp er o ctahe dr al gr oup B n is defined as the group of all p ermutations g σ acting on the set {± 1 , . . . , ± n } suc h that g σ ( − i ) = − g σ ( i ) for all i ∈ [ n ]. An element of B n is a signe d p ermutation . Signed versions of transpositions and reversals can be defined in the obvious w ay; a sign change transp osition t σ i,j switc hes the v alues in the i th and j th positions of a signed p erm utation as well as b oth of their signs and so forth. Note that we also allo w i = j for signed transp ositions and reversals so that t i,i = r i,i , i ∈ [ n ], simply switches the sign of the i th v alue. W e denote the set of signed elements corresp onding to those in S = T , C, R , together with the elemen ts t i,i , 1 ≤ i ≤ n , b y S σ . Note that the diameter of C ay ( B n , R σ ) is n + 1 [11]. No w, regarding the group B n as a wreath pro duct [11, p. 2756], we hav e a short exact sequence: 1 → N → B n p → Σ n → 1 , (7) where the homomorphism p : B n → Σ n sends g σ ∈ B n to the p ermutation of [ n ] that maps i to | g σ ( i ) | (i.e. it ignores the sign). Notice that p maps S σ on to S when S = T , C, R . In particular, from Lemma 3, the following holds for m = 1 , 2: λ m ( B n , S σ ) ≥ λ m (Σ n , S ) . (8) Moreo ver, N = K er ( p ) is isomorphic to the elementary Ab elian 2-group Z n 2 and the short exact sequence in (7) splits, so B n is a semidirect pro duct of Z n 2 and a subgroup isomorphic to Σ n . Using these observ ations, w e obtain: Corollary 6. F or n ≥ 7 , the fol lowing hold: (i) λ 1 ( B n , T σ n ) ≤ 3 and λ 2 ( B n , T σ n ) ≤ 6 . (ii) 2 n − 3 ≤ λ 1 ( B n , C σ n ) ≤ 2 n − 1 and 2 n − 2 ≤ λ 2 ( B n , C σ n ) ≤ 4 n − 2 . (iii) n +1 2 ≤ λ m ( B n , R σ n ) ≤ n + 1 , m = 1 , 2 . Pr o of: The inequalities λ 1 ( B n , T σ n ) ≤ 3 and λ 1 ( B n , C σ n ) ≤ 2 n − 1 follow from similar arguments to those used in the pro of of Theorem 5 (i) and (ii), using the signed analogue of Equation (6). Inequalit y (3) then implies that inequalities λ 2 ( B n , T σ n ) ≤ 6 and λ 2 ( B n , C σ n ) ≤ 4 n − 2 b oth hold. The inequality λ m ( B n , R σ n ) ≤ n + 1, m = 1 , 2, follows as the diameter of C ay ( B n , R σ n ) is at most n + 1. The inequalities 2 n − 3 ≤ λ 1 ( B n , C σ n ) and 2 n − 2 ≤ λ 2 ( B n , C σ n ), and the remaining ones in (iii) follow by Inequality (8) and Theorem 5. 2 10 4. Bey ond d S : prop erties of breakp oin t distance As we hav e seen for the breakp oint distance on Σ n in the last section, it can sometimes b e useful to consider metrics on a group other than the distance d S arising from some Ca yley graph. Motiv ated by this, given an arbitrary metric d on a finite group G , with symmetric generator set S , w e define: λ 1 ( G, S , d ) := max g ∈ G,s ∈ S { d ( sg , g ) } and λ 2 ( G, S , d ) := max g ∈ G,s,s 0 ∈ S { d ( sg , s 0 g ) } . In particular, λ m ( G, S ) = λ m ( G, S , d S ) and λ m ( G, S , d ) ≤ max g ,g 0 ∈ G { d ( g , g 0 ) } , m = 1 , 2. Moreo ver, the following analogue of Inequality (3) for an arbitrary metric d on G is easily seen to hold: λ 2 ( G, S , d ) ≤ 2 · λ 1 ( G, S , d ) . (9) Note that, although the quantities λ m ( G, S ) and λ m ( G, S , d ) need not b e di- rectly related to one another, in certain circumstances, they are. F or example, if d has the prop ert y that d ( g , g s ) ≤ c for some constant c it is an easy exercise to sho w that λ m ( G, S , d ) ≤ c · λ m ( G, S ) , for m = 1 , 2. W e no w return to considering the breakp oint distance d B P . In genomics, this distance is commonly used as a pro xy for rearrangement distances. Th us it is of in terest to note: Lemma 7. F or n ≥ 7 , the fol lowing hold: (i) λ 1 (Σ n , T n , d B P ) ≤ 4 and λ 2 (Σ n , T n , d B P ) ≤ 8 . (ii) λ 1 (Σ n , C n , d B P ) ≤ 4 and λ 2 (Σ n , C n , d B P ) ≤ 8 . (iii) n +1 2 ≤ λ m (Σ n , R n , d B P ) ≤ n + 1 , m = 1 , 2 . Pr o of: Supp ose t = t i,j ∈ T n , 1 ≤ i < j ≤ n . Using Equation (6), it is straigh tforward to see that d B P ( tg , g ) ≤ 4 holds for an y g ∈ Σ n . Therefore λ 1 (Σ n , T n , d B P ) , λ 1 (Σ n , C n , d B P ) ≤ 4. The inequalities in (i) and (ii) in v olving λ 2 no w follo w from Inequalit y (9). The Inequalities in (iii) follo w from the argument used in the pro of of Theorem 5 (iii) and the diameter of d B P on Σ n . 2 In particular, for C , the set of Co xeter generators of Σ n in the last section, and m = 1 , 2, w e ha ve λ m (Σ n , C ) ≥ 2 n − 3, but λ m (Σ n , C , d B P ) ≤ 4. Intriguingly , this observ ation can b e extended as follows. F or k ≥ 1, let R ( k ) , denote the set of reversals of the form { r i,j : 1 ≤ i < j ≤ n, | i − j | ≤ k } . Suc h ‘fixed-length’ rev ersals ha ve been considered in the con text of genome rearrangemen ts in e.g. [2]. Note that R (1) = C and R ( k ) ⊆ R ( k +1) , so that R ( k ) generates Σ n . 11 Prop osition 8. F or n ≥ 7 , n ≥ k ≥ 1 and m = 1 , 2 , λ m (Σ n , R ( k ) ) ≥ 2 d n k e − 2 , and λ m (Σ n , R ( k ) , d B P ) ≤ 4( k + 1) . Pr o of: As in the proof of Theorem 5 (ii), let g ∈ Σ n b e given by g = [2 , 3 , . . . , n − 1 , n, 1], so that g − 1 r 1 , 2 g = [ n, 2 , 3 , . . . , n − 1 , 1]. Then, l R ( k ) ( g − 1 r 1 , 2 g ) ≥ 2 d n k e − 3, since to transform [ n, 2 , 3 , . . . , 1] to 1 Σ n requires moving 1 and n bac k to their original p ositions. Similarly , l C ( g − 1 r 1 , 2 r 3 , 4 g ) ≥ 2 d n k e − 2. This giv es the first inequalit y in the prop osition. Moreo v er, if r i,j , r p,q ∈ R ( k ) , then it is straight- forw ard to see that d B P ( r i,j g , g ) ≤ 2( k + 1) and d B P ( r p,q r i,j g , g ) ≤ 4( k + 1) holds, whic h giv es the second inequality in the prop osition. 2 This proposition implies that in genomics applications, adding or substituting a single reversal in a sequence of reversals in R ( k ) could p otentially hav e a large effect on d R ( k ) , but a relativ ely small effect on d B P (esp ecially for large v alues of n , e.g. there are n ≥ 20 , 000 genes in the human genome). It could be of interest to see whether other com binations of generating sets and metrics for Σ n commonly used in genomics (suc h as transp ositions [13] and the k -mer distance [20]) exhibit a similar type of b eha viour. 5. Statistical implications So far we hav e considered metric sensitivit y from a purely combinatorial and deterministic p ersp ective. But it is also of interest to in vestigate the sensitivity of the metrics discussed ab ov e when the elemen ts of S are randomly assigned. Again, the motiv ation for this question comes from genomics, where sto c hastic mo dels often pla y a central role (see, for example, [14], [22]). In this section, w e establish a result (Prop osition 9) in which the quan tity λ 2 pla ys a crucial role in allowing underlying parameters in such sto c hastic mo dels to b e estimated accurately giv en sufficiently long genome sequences. Our motiv ation here is to pro vide some basis for even tually extending the w ell-dev elop ed (and tigh t) results on the sequence length requiremen ts for tree reconstruction under site-substitution mo dels (see e.g. [3, 5, 8, 14]) to more general models of genome evolution. Consider any model of genome ev olution, where an asso ciated transformation group G acts freely on a set X of genomes of length n , and for which even ts in some symmetric generating set S o ccur indep enden tly according to a Poisson pro cess. Regard the elements of X as lea v es of an evolutionary (phylogenetic) tree with w eighted edges [18], and let µ ( x, y ) b e the sum of the weigh ts of the edges of the tree connecting leav es x, y . Then we make the follo wing assumption: 12 • The exp ected n umber of times that s ∈ S o ccurs along the path in the tree connecting x and y can b e written as n · µ s ( x, y ) (i.e. w e assume that the rate of even ts scales linearly with the length of the genome). Let µ ( x, y ) = P s ∈ S µ s ( x, y ). Then the total num ber of ev ents in S that o ccur on the path separating x and y has a P oisson distribution with mean n · µ ( x, y ). No w supp ose d is some metric on genomes that satisfies the following three prop erties: (i) d ( x, g ◦ x ) dep ends just on g , for eac h x ∈ X and g ∈ G . (ii) λ 2 ( G, S , d ) is independent of n . (iii) d = nf ( µ ( x, y )) , where d is the exp ected v alue in the mo del of d ( x, y ) and f is a function with strictly positive but b ounded first deriv ative on (0 , ∞ ). An example to illustrate this pro cess is site substitutions, under the Kimura 3ST model, describ ed at the start of Section 3, taking d = d S , where we observed that Prop erties (i) and (ii) hold (note that in this case, d ( x, y ) is the ‘Hamming distance’ betw een the sequences whic h coun ts the num b er of sites at whic h x and y differ). In that case, Prop erty (iii) also holds, since d = n 3 4 (1 − exp( − 4 µ ( x, y ) / 3)) . Note that, b oth breakp oint distance and d S satisfy (i), and we ha ve described ab o ve some cases where (ii) is satisfied. Whether (iii) holds (or the assumption that the exp ected n um b er of ev en ts scales linearly with n ) dep ends on the details of the underlying sto c hastic pro cess of genome rearrangemen t. F or example, for the appro ximation to the Nadeau-T a ylor mo del of genome rearrangemen t studied in Section 2 of [21], Prop erty (iii) holds under the assumption that the n umber of ev ents separating x and y has a Poisson distribution whose mean scales linearly with n (the pro of relies on Corollary 1(a) of [21]). The following result shows ho w d/n can b e used to estimate f ( µ ( x, y )) ac- curately , and thereb y µ ( x, y ) (by the assumptions regarding f ). The abilit y to estimate µ ( x, y ) accurately pro vides a direct route to accurate tree reconstruction b y standard phylogenetic metho ds (such as ‘neighbor-joining’ [16]) since µ ( x, y ) is ‘additiv e’ on the underlying tree but not on alternative binary trees (for details, see [18]). Prop osition 9. Consider any sto chastic mo del of genome evolution for which events in S o c cur ac c or ding to a Poisson pr o c ess with a r ate that sc ales line arly with n , and any metric d that satisfies c onditions (i) –(iii) ab ove. Then the pr ob ability 13 that d ( x, y ) /n differs fr om f ( µ ( x, y )) by mor e than z c onver ges to zer o exp onen- tial ly quickly with incr e asing n . Mor e pr e cisely, for c onstants b > 0 and c > 0 that dep end just on µ ( x, y ) and on the p air ( λ 2 ( G, S , d ) , µ ( x, y ) ), r esp e ctively, we have: P ( | d/n − f ( µ ( x, y )) | ≥ z ) ≤ exp( − bn ) + 2 exp( − cz 2 n ) , for d = d ( x, y ) . Pr o of of Pr op osition 9: W e first recall the Azuma-Ho effding inequalit y (see e.g. [1]) in which X 1 , X 2 , . . . , X k are indep endent random v ariables taking v alues in some set S , and h is any real-v alued function defined on S that satisfies the follo wing prop ert y for some constant ξ : | h ( x 1 , x 2 , . . . , x k ) − h ( x 0 x , x 0 2 , . . . , x 0 k ) | ≤ ξ , whenev er ( x i ) and ( x 0 i ) differ at just one co ordinate. In this case, the random v ariable Y := h ( X 1 , X 2 , . . . , X k ) has the tight concentration b ound for all k > 1: P ( | Y − E [ Y ] | ≥ z ) ≤ 2 exp( − z 2 2 ξ 2 k ) . (10) W e apply this general result as follo ws. Let K be the random total n um b er of ev ents in S that o ccur in the path separating x and y . By assumption, K has a Poisson distribution with mean n · µ ( x, y ). Conditional on the ev ent K = k , let X 1 , . . . , X k b e the actual elements of S that o ccur. It is assumed that these ev ents are indep endent. Moreo ver, b y (i), d ( x, y ) is a function of X 1 , . . . , X k , and b y (ii) this function satisfies the requirements of the Azuma-Ho effding inequality for ξ = λ 2 ( G, S , d ). Thus (10) furnishes the follo wing inequalit y: P ( | d/n − d/n | ≥ z | K = k ) ≤ 2 exp( − z 2 n 2 2 λ 2 k ) . (11) In voking Prop erty (iii) and the law of total probabilit y , w e obtain: P ( | d/n − f ( µ ( x, y )) | ≥ z ) = X k ≥ 0 P ( | d/n − d/n | ≥ z | K = k ) P ( K = k ) , from whic h (11) ensures the inequalit y: P ( | d/n − f ( µ ( x, y )) | ≥ z ) ≤ 2 E [exp( − z 2 n 2 2 λ 2 K )] , (12) where E denotes expectation with resp ect to K . Let us write E [exp( − z 2 n 2 2 λ 2 K )] as a w eighted sum of t w o conditional expectations: E [exp( − z 2 n 2 2 λ 2 K ) | K > 2 n · µ ( x, y )] · p + E [exp( − z 2 n 2 2 λ 2 K ) | K ≤ 2 n · µ ( x, y )] · (1 − p ) , (13) 14 where p = P ( K > 2 n · µ ( x, y )). The first term in (13) is b ounded ab o ve b y P ( K > 2 n · µ ( x, y )) since exp( − z 2 n 2 2 λ 2 K ) ≤ 1; moreo ver, since K has a Poisson distribution with mean n · µ ( x, y ) (and so is asymptotically normally distributed with mean and v ariance equal to µn ), the quan tity P ( K > 2 n · µ ( x, y )) is b ounded ab o ve by a term of the form exp( − bn ) where b dep ends just on µ ( x, y ). The second term in (13) is b ounded ab o ve b y exp( − z 2 n 4 λ 2 µ ( x,y ) ), where λ = λ 2 ( G, S , d ), since the function x 7→ exp( − A/x ) increases monotonically on [0 , ∞ ). Com bining these t w o b ounds in (13), the result no w follo ws from (12). 2 Remark. Referring again to the particular case of s ite substitutions under the Kim ura 3ST model, Prop osition 9 can b e strengthened to: P ( | d/n − f ( µ ( x, y )) | ≥ z ) ≤ 2 exp( − c 0 z 2 n ) , where c 0 > 0 can b e chosen to b e indep enden t of µ ( x, y ). This stronger result is the basis of n umerous results in the ph ylogenetic literature that show that large trees can b e reconstructed from remark ably short sequences under simple site- substitution mo dels [5]. Although the b ound in Prop osition 9 is less incisive, it w ould b e of interest to explore similar phylogenetic applications for other mo d- els of genome evolution in which λ 2 is indep endent of n , such as those inv olving breakp oin t distance under reversals of fixed length. Ac kno wledgments W e thank Marston Conder, Eamonn O’Brien and Li San W ang for some helpful commen ts. VM thanks the Ro y al So ciet y for supp orting his visit to Univ ersit y of Can terbury , where most of this w ork was undertaken. MS thanks the Roy al So ciety of New Zealand under its James Co ok F ello wship sc heme. 15 References [1] Alon, N., Sp encer, J., 1992. The Probabilistic Metho d. Wiley , New Y ork. [2] Chen, T., Skiena, S., 1996. Sorting with fixed-length rev ersals. Discr. Appl. Math. 71, 269–295. [3] Dask alakis, C., Mossel, E., Ro c h, S., 2010. Ev olutionary T rees and the Ising mo del on the Bethe lattice: a pro of of Steel’s conjecture Probab. Th. Rel. Fields 149, 149–189. [4] Eppstein, D.B.A., 1992. W ord Pro cessing in Groups. A K Peters/CR C Press. [5] Erd¨ os, P .L., Steel, M.A., Sz´ ekely , L.A., W arnow, T., 1999. A few logs suffice to build (almost) all trees (P art 1). Rand. Struc. Alg. 14(2), 153–184. [6] Ev ans, S.N., Sp eed, T. P ., 1993. Inv ariants of some probabilit y mo dels used in ph ylogenetic inference. Ann. Stat. 21, 355–377. [7] F ertin, G., Labarre, A., Rusu, I., T annier, E., Vialette, S., 2009. Com- binatorics of Genome Rearrangemen ts, The MIT Press, Cambridge, Mas- sac husetts, London, England. [8] Gronau, I., Moran, S., Snir, S., 2008. F ast and reliable reconstruction of ph ylogenetic trees with v ery short edges. pp. 379–388. In: SOD A: A CM- SIAM Symp osium on Discrete Algorithms. So ciety for Industrial and Applied Mathematics Philadelphia, P A, USA. [9] Hilb orn, R.C., 2004. Sea gulls, butterflies, and grasshopp ers: A brief history of the butterfly effect in nonlinear dynamics. Am. J. Ph ys. 72 (4), 425–427. [10] Kim ura, M., 1981. Estimation of ev olutionary distances b et ween homologous n ucleotide sequences. Proc. Natl. Acad. Sci., USA 78, 454–458. [11] Kostan tinov a, E., 2008. Some problems on Cayley graphs. Lin. Alg. Appl. 429, 2754–2769. [12] Kunkle, D., Co op erman, G., 2009. Harnessing parallel disks to solv e Rubik’s cub e. Journal of Symbolic Computation, 44(7), 872–890. [13] Labarre, L., 2006. New b ounds and tractable instances for the transp osition distance, IEEE/A CM T rans. Comput. Biol. Bioinf. 3(4), 380–394. [14] Mossel, E., Steel, M., 2005. Ho w m uch can ev olved c haracters tell us ab out the tree that generated them? pp. 384–412. In: Mathematics of Evolution and Ph ylogen y (Olivier Gascuel ed.), Oxford Univ ersity Press. 16 [15] Rotman, J.J., 1995. An Introduction to the Theory of Groups. Springer-V erlag New Y ork Inc. [16] Saitou, N., Nei, M., 1987. The neighbor-joining metho d: a new metho d for reconstructing ph ylogenetic trees, Mol. Biol. Ev ol. 4(4), 406–425. [17] Setubal, J., Meidanis, M., 1997. Introduction to Computational Molecular biology , PWS Publishing Compan y . [18] Semple, C. and Steel, M., 2003. Phylogenetics. Oxford Universit y Press. [19] Sinha, A., Meller, J., 2008. Sensitivit y analysis for reversal distance and break- p oin t re-use in genome rearrangements, Pacific J. Bio comput. 13, 37–48. [20] T rifonov, V., Rabadan, R., 2010. F requency analysis techniques for iden tifi- cation of viral genetic data. mBio 1(3), e00156-10. [21] W ang, L.-S., 2002. Genome Rearrangement Phylogen y Using W eigh b or. pp. 112–125. In: Lecture Notes for Computer Sciences No. 2452: Pro ceedings for the Second W orkshop on Algorithms in BioInformatics (W ABI’02), Rome, Italy . [22] W ang, L.-S., W arnow. T., 2005. Distance-based genome rearrangemen t phy- logen y . pp. 353–380. In: Mathematics of Evolution and Ph ylogeny , (O. Gas- cuel ed.), Oxford Universit y Press. 17

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment