The Butterfly effect in Cayley graphs, and its relevance for evolutionary genomics

The ‘Butterﬂy eﬀect’ in Ca yley graphs, and its relev ance for ev olutionary genomics Vincen t Moulton and Mik e Steel (VM) Scho ol of Computing Scienc es, University of East A nglia, Norwich, NR4 7TJ, UK. (MS) Biomathematics R ese ar ch Centr e, University of Canterbury, Christchur ch, New Ze aland. email: vinc ent.moulton@cmp.ue a.ac.uk , mike.ste el@c anterbury.ac.nz Abstract Supp ose a ﬁnite set X is rep eatedly transformed by a sequence of p ermutations of a certain type acting on an initial element x to produce a ﬁnal state y . W e inv es- tigate how ‘diﬀeren t’ the resulting state y 0 to y can b e if a slight c hange is made to the sequence, either b y deleting one p ermutation, or replacing it with another. Here the ‘diﬀerence’ b et ween y and y 0 migh t b e measured b y the minim um n umber of permutations of the permitted t yp e required to transform y to y 0 , or b y some other metric. W e discuss this ﬁrst in the general setting of sensitivit y to p ertur- bation of w alks in Ca yley graphs of groups with a sp eciﬁed set of generators. W e then in v estigate some p erm utation groups and generators arising in computational genomics, and the statistical implications of the ﬁndings. Keywor ds: ev olutionary distance, permutation, metric, group action, genome rearrangemen ts 1. In tro duction In evolutionary genomics, t wo genomes 1 are frequently compared b y the min- im um num b er of ‘rearrangemen ts’ (of v arious t yp es) required to transform one genome in to another [7]. This minim um n umber is then used to estimate of the ac- tual n umber of even ts and thereby the ‘evolutionary distance’ b etw een the sp ecies in volv ed. Since b oth the precise num ber and the actual rearrangemen t even ts that o ccurred in the ev olution of the t wo genomes from a common ancestor are un- kno wn, it is p ertinent to hav e some idea of how sensitiv e this distance estimate 1 F or the purposes of this pap er a genome is simply an ordered sequence of ob jects – usually tak en from the DNA alphabet or a collection of genes – which ma y o ccur with or without rep etition, and with or without an orien tation (+,-). Pr eprint submitte d to Elsevier Novemb er 7, 2021 migh t b e to the sequence of ev ents (not just the num ber) that really to ok place [19]. This question has imp ortan t implications for the accurate inference of ev olu- tionary relationships b etw een species from their genomes, and w e discuss some of these further in Section 5. How ev er, w e b egin b y framing the t yp e of mathematical questions that we will b e considering in a general algebraic con text. Let G b e a ﬁnite group, whose iden tity element we write as 1 G , and let S b e a subset of generators, that is symmetric (i.e. closed under in verses, so x ∈ S ⇒ x − 1 ∈ S ). In addition, let Γ = C ay ( G, S ) b e the asso ciated Cayley gr aph , with v ertex set G and an edge connecting g and g 0 if there exists s ∈ S with g 0 = g s (unless otherwise stated, w e use the con v en tion of m ultiplying group elemen ts from left to righ t). F or an y tw o elemen ts g , g 0 ∈ G , the distance d S ( g , g 0 ) in C ay ( G, S ) is the minimum v alue of k for which there exist elemen ts s 1 , . . . , s k of S so that g 0 = g s 1 · · · s k (for g = g 0 , w e set d S ( g , g 0 ) = 0). Note that d S is a metric, in particular, d S ( g , g 0 ) = d S ( g 0 , g ), since S is symmetric. In this pap er, our fo cus is on the follo wing t wo quantities: λ 1 ( G, S ) := max g ∈ G,s ∈ S { d S ( sg , g ) } , and λ 2 ( G, S ) := max g ∈ G,s,s 0 ∈ S { d S ( sg , s 0 g ) } . One wa y to view these quantities is via the following result whic h is easily pro ved. Lemma 1. L et S b e a symmetric set of gener ators for a ﬁnite gr oup G . Then: • λ 1 ( G, S ) is the maximum value of d S ( g , g 0 ) b etwe en any p air of elements g and g 0 of G for which g = s 1 s 2 · · · s k , and g 0 = s 0 1 s 0 2 · · · s 0 k , wher e s 0 i = s i ∈ S for al l but at most one value (say j ) for i , and s 0 j = 1 G . • λ 2 ( G, S ) is the maximum value of d S ( g , g 0 ) b etwe en any p air of elements g and g 0 of G for which g = s 1 s 2 · · · s k and g 0 = s 0 1 s 0 2 · · · s 0 k wher e s 0 i = s i ∈ S for al l but at most one value (say j ) for i , and s 0 j ∈ S, s 0 j 6 = s j . Th us, λ 1 ( G, S ) tells us ho w m uch (under d S ) a pro duct of generators can c hange if w e drop one v alue of s , whilst λ 2 ( G, S ) tells us how muc h (again under d S ) a pro duct of generators in S can c hange if w e substitute one v alue of s b y another s 0 (see Fig. 1 for an example where λ 2 ( G, S ) = 6). As suc h, λ m is a measure of the ‘sensitivity’ of w alks in the Cayley graph to a switc h in or deletion of a generator at some p oint. Moreo ver, if G acts transitively 2 [3421] [4321] [4312] [3241] [3214] [2314] [2134] [1234] [1243] [1423] [4123] [4132] [3412] [3142] [1324] [2143] [1342] [3124] [1432] [4231] [2431] [4213] (34) (23) (34) (23) (34) (23) (23) (34) (23) (12) [2341] [2413] Figure 1: The Cayley graph C ay ( G, S ) for G = Σ 4 (the p ermutation group on { 1 , 2 , 3 , 4 } ) and the set of transp ositions S = { (12) , (23) , (34) } . Substituting just one elemen t – namely (34) for (12) – in the pro duct corresponding to the walk in the lo wer front face (which starts and returns to the low er-most p oin t [1234]) results in a w alk that ends at a p oint ([4321], top) that is very distan t (under d S ) from the end-p oint of the original walk. In fact, the t wo end-p oints are at maximal distance in this example. and freely 2 on a set X then λ m pro vides a corresp onding measure of sensitivity of this action to a switch in or deletion of a generator (since a transitiv e, free action of G on X is isomorphic to the action of G on itself b y righ t multiplication). Actions with large λ m v alues can th us b e viewed as exhibiting a discrete, group-theoretic analogue of the ‘butterﬂy eﬀect’ in non-linear dynamics (see e.g. [9]). In the genomics applications that we shall consider, elements of the group G corresp ond to genomes, and d S to the evolutionary distance b etw een them. After presen ting some general results concerning λ m in the next section, in Sections 3 and 4 we discuss some applications arising for v arious c hoices of G and S . These include the Klein four group, whic h arises in evolutionary mo dels of DNA sequence ev olution, and the p ermutatation group, which typically app ears when studying rearrangemen t distances b et ween genomes. W e conclude in Section 5 with some statistical implications of our results. One can imagine many other settings b esides genomics where similar questions arise – for example, in a sequence of mov es that should unscram ble the Rubik’s 2 G acts tr ansitively on X if for any pair x, y ∈ X there exists g ∈ G with g ◦ x = y ; the action is fr e e if g ◦ x = h ◦ x ⇒ g = h , for all g , h ∈ G and x ∈ X , where ‘ ◦ ’ denotes the action of G on X . 3 cub e from a given p osition [12], what will b e the consequences (in terms of the n umber of mov es required) for completing the unscrambling if a mistake is made at some p oint (or one mo ve is forgotten)? In addition, related questions arise in the study of ‘automatic’ groups, where the group under consideration is typically inﬁnite [4]. 2. General inequalities W e ﬁrst make some basic observ ations ab out Ca yley graphs and the metric d S (further bac kground on basic group theory , Cayley graphs, and group actions can b e found in [15]). It is w ell kno wn that Γ is a connected regular graph of degree equal to the cardinality of S and that Γ is also vertex-transitiv e (see, for example, [11], Prop osition 1). Consider the function l S : G → { 0 , 1 , 2 , 3 . . . | G |} , where, l S (1 G ) = 0 and, for each g ∈ G − { 1 G } , l S ( g ) is the smallest n umber l of elemen ts s 1 , . . . , s l from S for which we can write g = s 1 · · · s l . The function l S clearly satisﬁes the subadditivity prop erty that, for all g , g 0 ∈ G : l S ( g g 0 ) ≤ l S ( g ) + l S ( g 0 ) . In addition, l S ( g − 1 ) = l S ( g ) , and l S ( g ) = 1 ⇔ g ∈ S, l S ( g ) = 0 ⇔ g = 1 G . Note that l S ( g g 0 ) is generally not equal to l S ( g 0 g ). The metric d S , describ ed in the previous section, is related to l S as follo ws: d S ( g , g 0 ) = l S ( g − 1 g 0 ) . Consequen tly , b y deﬁnition: λ 1 ( G, S ) = max g ∈ G,s ∈ S { l S ( g − 1 sg ) } , (1) and λ 2 ( G, S ) = max g ∈ G,s,s 0 ∈ S { l S ( g − 1 ss 0 g ) } . (2) Let l S ( G ) = max { l S ( g ) : g ∈ G } , whic h is the diameter of C ay ( G, S ), that is, maximum length shortest path connecting any t wo elemen ts of G . Clearly , λ 1 ( G, S ) , λ 2 ( G, S ) ≤ l S ( G ). Moreo ver: λ 2 ( G, S ) ≤ 2 · λ 1 ( G, S ) , (3) 4 since, for any g ∈ G and s, s 0 ∈ S , w e ha v e: d S ( sg , s 0 g ) ≤ d S ( sg , g ) + d S ( g , s 0 g ) . A partial conv erse to Inequality (3) is pro vided b y the follo wing: λ 1 ( G, S ) ≤ λ 2 ( G, S ) + λ 0 1 ( G, S ) , (4) where λ 0 1 ( G, S ) = max g ∈ G min s ∈ S { l S ( g − 1 sg ) } . T o verify (4), select a pair g ∈ G, s ∈ S so that l S ( g − 1 sg ) = λ 1 ( G, S ) . Then: λ 1 ( G, S ) = d S ( sg , g ) ≤ d S ( sg , s 1 g ) + d S ( s 1 g , g ) , where s 1 is an element s 0 (p ossibly equal to s ) in S that minimizes l S ( g − 1 s 0 g ). No w, d S ( sg , s 1 g ) ≤ λ 2 ( G, S ) (even if s 0 = s ) and d S ( s 1 g , g ) ≤ λ 0 1 ( G, S ), and so w e obtain (4). Note also that if G is Ab elian, then λ 1 ( G, S ) = 1, and λ 2 ( G, S ) ≤ 2 for any symmetric set S of generators. Moreov er, for the Ab elian 2-group G = Z n 2 and with the symmetric set S of generators consisting of all n elements with the identit y at all but one p osition, we hav e l S ( G ) = n and λ 1 ( G, S ) = 1. This sho ws that the inequalit y λ 1 ( G, S ) ≤ l S ( G ) can b e arbitrarily large. Our next result generalizes this observ ation further. Lemma 2. L et G 1 , G 2 , . . . , G k b e ﬁnite gr oups, and let S i b e a symmetric set of gener ators of G i for i = 1 , . . . , n . Consider the dir e ct pr o duct G = G 1 × G 2 ×· · · × G k along with the symmetric set of gener ators S of G c onsisting of al l p ossible k – tuples which c onsist of the identity element of G i at al l but one c o-or dinate i , wher e it takes some value in S i . Then (i) λ 1 ( G, S ) ≤ max 1 ≤ i ≤ k  l S i ( G i )  , and (ii) l S ( G ) = P k i =1 l S i ( G i ) . Pr o of: F or Part (i), let λ 1 ( G, S ) = l S ( g − 1 sg ), where s ∈ S is a non-iden tity elemen t at some co-ordinate ν . Notice that ( g − 1 sg ) j = 1 G j for all j 6 = ν . Moreo v er, ( g − 1 sg ) ν = s 1 · · · s l where l ≤ l S ν ( G ν ). Th us l S ( g − 1 sg ) ≤ l S ν ( G ν ), as claimed. F or Part (ii), the inequality l S ( G ) ≤ P k i =1 l S i ( G i ) is clear; to establish the rev erse inequalit y , let g i b e an elemen t of G i with l S i ( g i ) = l S i ( G i ), and g = ( g 1 , . . . , g k ) ∈ G . Then l S ( g ) = P k i =1 l S i ( G i ) , and so l S ( G ) ≥ P k i =1 l S i ( G i ) . 2 W e now consider ho w λ m b eha ves under group homomorphisms. Suppose H is the homomorphic image of a group G under a map p . Let N = K er ( p ) b e the k ernel of p , whic h is a normal subgroup of G , and with H ∼ = G/ N . Thus we hav e a short exact sequence: 1 → N → G p → H → 1 . (5) Let S b e a symmetric set of generators of G . Then S H = { p ( s ) : s ∈ S − N } is a symmetric set of generators of H . 5 Lemma 3. F or m = 1 , 2 , λ m ( H , S H ) ≤ λ m ( G, S ) . Pr o of: First supp ose that m = 1. F or x ∈ S H and h ∈ H , consider h − 1 xh . There exist elements g ∈ G and s ∈ S − N for whic h f ( g ) = h and f ( s ) = x . Now the elemen t g − 1 sg ∈ G can b e written as a pro duct of at most l = λ 1 ( G, S ) elemen ts of S , that is g − 1 sg = s 1 s 2 · · · s k for k ≤ l . Applying p to b oth sides of this equation giv es: h − 1 xh = p ( s 1 ) p ( s 2 ) · · · p ( s k ). Notice that some of the elemen ts on righ t may equal the iden tit y elemen t of H (since p ( s i ) = 1 H ⇔ s i ∈ N ), but they are elemen ts of S H otherwise. Th us l S H ( h − 1 xh ) ≤ l . Since this holds for all such elements h, x , Eqn. (1) sho ws that λ 1 ( H , S H ) ≤ λ 1 ( G, S ). The corresp onding result for m = 2 follo ws b y an analogous argument. 2 T o obtain a lo wer b ound for λ m ( G, S ) supp ose that the short exact sequence (5) is a split extension , i.e. there is a homomorphism i : H → G so that p ◦ i is the iden tity map on H , whic h (b y the splitting lemma) is equiv alent to the condition that G is the semidirect pro duct of N with a subgroup H 0 isomorphic to H (i.e. G = N H 0 = H 0 N , H 0 ∩ N = { 1 G } ). In this case we hav e the following b ounds. Prop osition 4. Supp ose a ﬁnite gr oup G is a semidir e ct pr o duct of sub gr oups N (normal) and H . L et S N , S H b e symmetric gener ator sets for N and H r esp e ctively, and let S = S N ∪ S H which is a symmetric gener ator set for G . Then: λ 1 ( H , S H ) ≤ λ 1 ( G, S ) ≤ λ 1 ( H , S H ) + l S N ( N ) . In p articular, by (3), λ 2 ( G, S ) ≤ 2 λ 1 ( H , S H ) + 2 l S N ( N ) . Pr o of: The low er b ound on λ 1 ( G, S ) follo ws from Lemma 3. F or the upp er b ound we m ust show that for all s ∈ S and g ∈ G , d S ( sg , g ) ≤ λ 1 ( H , S H ) + l S N ( N ) holds. W e consider tw o cases: (i) s ∈ N , and (ii) s ∈ H . In Case (i), note that the conjugate elemen t g − 1 sg is also an element of N ; in this case we ha v e the tigh ter b ound d S ( sg , g ) ≤ l S N ( N ). In Case (ii), write g = hn where n ∈ N and h ∈ H . Consider the word w = g − 1 sg = n − 1 h − 1 shn. Since N is normal we ha ve n − 1 ( h − 1 sh ) = ( h − 1 sh ) n 0 for some element n 0 ∈ N . Th us w = h − 1 shn 0 n. W rite w = w 1 w 2 where w 1 = h − 1 sh ∈ H and w 2 = n 0 n ∈ N . W e can select w 2 to b e a pro duct of terms of S N of length at most l S N ( N ) and, by Inequalit y (3), w e can select w 1 to b e a product of terms of S H of length at most λ 1 ( H , S H ). Thus w can b e written as a pro duct of, at most, λ 1 ( H , S H ) + l S N ( N ) elemen ts of S . 2 6 3. P erm utation groups and genomic applications W e ﬁrst describ e a direct application that is relev an t to the ev olution of a DNA sequence under a simple mo del of site substitution (Kimura’s 3ST mo del) [10]. Consider the four-letter DNA alphab et A = { A, C , G, T } and the Klein four- group K = Z 2 × Z 2 with an action on A in which the three non-zero elemen ts of K corresp ond to ‘transitions’ (A ↔ G, C ↔ T) and the tw o t yp es of ‘transversions’ (A ↔ C, G ↔ T; and A ↔ T, G ↔ C). This represen tation of the Kim ura 3ST mo del w as ﬁrst described and exploited b y [6]. F or g ∈ K and x ∈ A , let g ◦ x denote the element of A obtained by the action of g on x (the identit y elemen t ﬁxes each elemen t of A ). The resulting comp onen t- wise action of K n on A n , deﬁned b y: ( g 1 , . . . g n ) ◦ ( x 1 , . . . x n ) = ( g 1 ◦ x 1 , . . . , g n ◦ x n ), can b e regarded as the set of all c hanges that can o ccur to a DNA sequence ov er a perio d of time under site substitutions. No w, under an y con tin uous-time Mark ovian pro cess these c hange ev ents (‘site substitutions’) occur just one at a time and so a natural generating set of K n is the set S n of all elemen ts of K n that consist of 1 K at all but one co-ordinate. Moreov er, since the action of K n on A n is transitiv e and free (and so is isomorphic to the action of K n on itself b y righ t m ultiplication), λ m ( K n , S n ) measures the impact of ignoring (for m = 1) or replacing (for m = 2) one substitution in a chain of suc h ev ents o v er time. As K n is Ab elian, one has λ 1 ( K n , S n ) = 1 and λ 2 ( K n , S n ) = 2, whic h implies that this impact is minor, and, more signiﬁcantly , is indep endent of n ; this has imp ortant statistical implications whic h we will describ e further in Section 5. F or a related example, consider the ordered sequence of distinct genes ( g 1 , g 2 , . . . , g n ) partitioned into regions R 1 , R 2 , . . . R k so that genomic rearrangemen ts o ccur within eac h region, but not b et ween regions (e.g. R i migh t refer to diﬀerent c hromo- somes). This situation can b e mo delled by the setting of Lemma 2 in whic h G i is a p erm utation group on the genes within R i , and S i is set of elementary gene order rearrangemen t even ts that generates G i (w e discuss some examples b elo w). In this case, Lemma 2 provides a b ound on λ 1 and λ 2 that is indep enden t of the n umber of regions k . W e turn now to the calculation of λ m (Σ n , S ) for the p ermutation group Σ n on n ! elemen ts and v arious sets S of generators. This group commonly arises when studying genome rearrangements [11]. Our main in terest is to determine, for eac h instance of S , whether there is a constan t C (indep endent of n ) for whic h λ m (Σ n , S ) ≤ C , for m = 1 , 2. A p ermutation g on the set [ n ] := { 1 , 2 , . . . , n } is a bijectiv e mapping from [ n ] to itself. W e will also write g as g = [ g 1 , g 2 , . . . , g n ] where g i = g ( i ) is the image of the map g for i ∈ [ n ]. Note that, following the usual conv en tion, the pro duct g g 0 of tw o p ermutations g , g 0 ∈ Σ n will b e considered as the comp osition of the 7 functions g and g 0 . In particular, g g 0 ( i ) = g ( g 0 ( i )) for all i ∈ [ n ]. When studying genomes, eac h entry g i of a p erm utation g corresp onds to a gene and the full list [ g 1 , g 2 , . . . , g n ] to a genome. Multiplying g by a p ermutation leads to a rearrangemen t of the genome. F or example, m ultiplying by a tr ansp osition t i,j in terchanges the v alues at p ositions i and j of g , i.e. [ . . . , g i , . . . , g j , . . . ] t i,j = [ . . . , g j , . . . , g i , . . . ], and m ultiplying b y a r eversal r i,j rev erses the segmen t [ g i , g j ], 1 ≤ i < j ≤ n , of g , i.e. [ . . . , g i , g i +1 , . . . , g j − 1 , g j , . . . ] r i,j = [ . . . , g j , g j − 1 , . . . , g i +1 , g i , . . . ] . Suc h rearrangemen ts are widely observed and studied in molecular biology [7]. In genomics applications, w e are often in terested in deﬁning some distance b et ween genomes. One distance that is commonly used in the context of p ermuta- tions is the br e akp oint distance [17, 7.3]. F or g , g 0 ∈ Σ n , d B P ( g , g 0 ) is deﬁned as the n umber of pairs of elements that are adjacent in the list [0 , g 1 , g 2 , . . . , g n , n + 1], but not in the list [0 , g 0 1 , g 0 2 , . . . , g 0 n , n + 1]. F or example, if g = [1 , 2 , 3 , 4 , 5] , g 0 = [1 , 4 , 3 , 2 , 5] ∈ Σ 5 , we ha v e d B P ( g , g 0 ) = 2. It is clear that max { d B P ( g , g ) : g , g 0 ∈ Σ n } = n + 1. Alternativ ely , one can consider the r e arr angment distanc e b etw een tw o genomes, i.e. the minimal n um b er of op erations of a certain t yp e (suc h as transp ositions or rev ersals) that can b e applied to one of the genomes to obtain the other [7]. In terms of Cayley graphs, this distance can b e conv enien tly expressed for transp osi- tions and reversals as follows. Let T = T n := { t i,j ∈ Σ n : 1 ≤ i < j ≤ n } , C = C n := { t i,i +1 ∈ T : 1 ≤ i ≤ n − 1 } , (the Coxeter generators), and R := { r i,j ∈ Σ n : 1 ≤ i < j ≤ n } . Note that all three of these sets generate Σ n [11] and that they are all symmetric, since eac h generator is its own inv erse. The metric d S , S = T , C, R , is precisely the rearrangemen t distance. The diameters of C ay (Σ n , T ) and C ay (Σ n , R ) are b oth n − 1, and the diameter of C ay (Σ n , C ) is  n 2  [11]. Regarding the quantities λ m (Σ n , S ), w e hav e the following result for S = T , C , R : Theorem 5. F or n ≥ 7 the fol lowing hold: (i) λ 1 (Σ n , T n ) = 1 and λ 2 (Σ n , T n ) = 2 . 8 1 2 3 4 5 6 7 6 3 4 1 2 7 5 (b) 1 2 3 4 5 6 7 3 2 4 5 1 (a) 6 7 Figure 2: (a) A diagrammatic respresentation of the elemen t g = [3 , 2 , 5 , 4 , 6 , 7 , 1] in Σ 7 , deﬁned in the pro of of Theorem 5 (iii). (b) The pro duct r 1 , 7 g = [5 , 6 , 3 , 4 , 1 , 2 , 7]. Note that d B P ( r 1 , 7 g , g ) = 8. (ii) λ 1 (Σ n , C n ) = 2 n − 3 and 2 n − 2 ≤ λ 2 (Σ n , C n ) ≤ 4 n − 6 . (iii) n +1 2 ≤ λ m (Σ n , R n ) ≤ n − 1 , m = 1 , 2 . Pr o of: (i) Note that if g ∈ Σ n and t i,j ∈ T , then: g − 1 t i,j g = t g − 1 ( i ) ,g − 1 ( j ) . (6) Therefore λ 1 (Σ n , T ) = 1 b y (1). Thus, b y Inequality (3), we hav e λ 2 (Σ n , T ) ≤ 2. The equality λ 2 (Σ n , T ) = 2 follows b y (2) and the fact that g − 1 t k,l t i,j g = t g − 1 ( i ) ,g − 1 ( j ) t g − 1 ( k ) ,g − 1 ( l ) holds for any g ∈ Σ n and 1 ≤ i < j < k < l ≤ n . (ii) Consider the p ermutation g ∈ Σ n giv en b y g = [2 , 3 , . . . , n − 1 , n, 1]. Then g − 1 t 1 , 2 g = [ n, 2 , 3 , . . . , n − 1 , 1]. Therefore, l C ( g − 1 t 1 , 2 g ) ≥ 2 n − 3 (since to transform [ n, 2 , 3 , . . . , n − 1 , 1] to 1 Σ n requires moving 1 and n back to their original positions). Therefore, λ 1 (Σ n , C ) ≥ 2 n − 3 b y (1). But, by Equalit y (6), λ 1 (Σ n , C ) ≤ 2 n − 3, since an y transp osition is the pro duct of at most 2 n − 3 elements in C . In particular, λ 1 (Σ n , C ) = 2 n − 3. Similarly , l C ( g − 1 t 1 , 2 t 3 , 4 g ) ≥ 2 n − 2, and so λ 2 (Σ n , C ) ≥ 2 n − 2 b y (2). Hence, b y Inequalit y (3), w e ha v e λ 2 (Σ n , C ) ≤ 2(2 n − 3). (iii) The inequalit y λ m (Σ n , R n ) ≤ n − 1, m = 1 , 2 follows as the diameter of C ay (Σ n , R ) is at most n − 1. No w, supp ose n is o dd. Let g ∈ Σ n b e given by g = [3 , 2 , 5 , 4 , 7 , 6 , . . . , n − 3 , n, n − 1 , 1]. Then it is straigh t-forw ard to chec k that d B P ( r 1 ,n g , g ) = n + 1 (see Figure 2 for the case n = 7). In particular, since the length of any shortest path in C ay (Σ n , R ) joining an y g , h ∈ Σ n is at least d B P ( h, g ) / 2 by [17, p.238], we hav e λ 1 (Σ n , R ) ≥ n +1 2 . Similarly , d B P ( r 2 , 3 r 1 ,n g , g ) = n + 1 for any g ∈ Σ n , and so λ 2 (Σ n , R ) ≥ n +1 2 . In case n is even, consider g = [3 , 2 , 5 , 4 , 7 , 6 , . . . , n − 4 , n − 1 , n − 2 , 1 , n ]. Then d B P ( r 2 ,n g , g ) = n + 1 and d B P ( r 3 , 4 r 2 ,n g , g ) = n + 1. Similar reasoning yields the desired result. 2 9 In genomics, the direction in which a gene is oriented in a genome can also pro vide useful information to incorporate in rearrangement mo dels, which can b e expressed as follows in terms of Cayley graphs [11]. The hyp er o ctahe dr al gr oup B n is deﬁned as the group of all p ermutations g σ acting on the set {± 1 , . . . , ± n } suc h that g σ ( − i ) = − g σ ( i ) for all i ∈ [ n ]. An element of B n is a signe d p ermutation . Signed versions of transpositions and reversals can be deﬁned in the obvious w ay; a sign change transp osition t σ i,j switc hes the v alues in the i th and j th positions of a signed p erm utation as well as b oth of their signs and so forth. Note that we also allo w i = j for signed transp ositions and reversals so that t i,i = r i,i , i ∈ [ n ], simply switches the sign of the i th v alue. W e denote the set of signed elements corresp onding to those in S = T , C, R , together with the elemen ts t i,i , 1 ≤ i ≤ n , b y S σ . Note that the diameter of C ay ( B n , R σ ) is n + 1 [11]. No w, regarding the group B n as a wreath pro duct [11, p. 2756], we hav e a short exact sequence: 1 → N → B n p → Σ n → 1 , (7) where the homomorphism p : B n → Σ n sends g σ ∈ B n to the p ermutation of [ n ] that maps i to | g σ ( i ) | (i.e. it ignores the sign). Notice that p maps S σ on to S when S = T , C, R . In particular, from Lemma 3, the following holds for m = 1 , 2: λ m ( B n , S σ ) ≥ λ m (Σ n , S ) . (8) Moreo ver, N = K er ( p ) is isomorphic to the elementary Ab elian 2-group Z n 2 and the short exact sequence in (7) splits, so B n is a semidirect pro duct of Z n 2 and a subgroup isomorphic to Σ n . Using these observ ations, w e obtain: Corollary 6. F or n ≥ 7 , the fol lowing hold: (i) λ 1 ( B n , T σ n ) ≤ 3 and λ 2 ( B n , T σ n ) ≤ 6 . (ii) 2 n − 3 ≤ λ 1 ( B n , C σ n ) ≤ 2 n − 1 and 2 n − 2 ≤ λ 2 ( B n , C σ n ) ≤ 4 n − 2 . (iii) n +1 2 ≤ λ m ( B n , R σ n ) ≤ n + 1 , m = 1 , 2 . Pr o of: The inequalities λ 1 ( B n , T σ n ) ≤ 3 and λ 1 ( B n , C σ n ) ≤ 2 n − 1 follow from similar arguments to those used in the pro of of Theorem 5 (i) and (ii), using the signed analogue of Equation (6). Inequalit y (3) then implies that inequalities λ 2 ( B n , T σ n ) ≤ 6 and λ 2 ( B n , C σ n ) ≤ 4 n − 2 b oth hold. The inequality λ m ( B n , R σ n ) ≤ n + 1, m = 1 , 2, follows as the diameter of C ay ( B n , R σ n ) is at most n + 1. The inequalities 2 n − 3 ≤ λ 1 ( B n , C σ n ) and 2 n − 2 ≤ λ 2 ( B n , C σ n ), and the remaining ones in (iii) follow by Inequality (8) and Theorem 5. 2 10 4. Bey ond d S : prop erties of breakp oin t distance As we hav e seen for the breakp oint distance on Σ n in the last section, it can sometimes b e useful to consider metrics on a group other than the distance d S arising from some Ca yley graph. Motiv ated by this, given an arbitrary metric d on a ﬁnite group G , with symmetric generator set S , w e deﬁne: λ 1 ( G, S , d ) := max g ∈ G,s ∈ S { d ( sg , g ) } and λ 2 ( G, S , d ) := max g ∈ G,s,s 0 ∈ S { d ( sg , s 0 g ) } . In particular, λ m ( G, S ) = λ m ( G, S , d S ) and λ m ( G, S , d ) ≤ max g ,g 0 ∈ G { d ( g , g 0 ) } , m = 1 , 2. Moreo ver, the following analogue of Inequality (3) for an arbitrary metric d on G is easily seen to hold: λ 2 ( G, S , d ) ≤ 2 · λ 1 ( G, S , d ) . (9) Note that, although the quantities λ m ( G, S ) and λ m ( G, S , d ) need not b e di- rectly related to one another, in certain circumstances, they are. F or example, if d has the prop ert y that d ( g , g s ) ≤ c for some constant c it is an easy exercise to sho w that λ m ( G, S , d ) ≤ c · λ m ( G, S ) , for m = 1 , 2. W e no w return to considering the breakp oint distance d B P . In genomics, this distance is commonly used as a pro xy for rearrangement distances. Th us it is of in terest to note: Lemma 7. F or n ≥ 7 , the fol lowing hold: (i) λ 1 (Σ n , T n , d B P ) ≤ 4 and λ 2 (Σ n , T n , d B P ) ≤ 8 . (ii) λ 1 (Σ n , C n , d B P ) ≤ 4 and λ 2 (Σ n , C n , d B P ) ≤ 8 . (iii) n +1 2 ≤ λ m (Σ n , R n , d B P ) ≤ n + 1 , m = 1 , 2 . Pr o of: Supp ose t = t i,j ∈ T n , 1 ≤ i < j ≤ n . Using Equation (6), it is straigh tforward to see that d B P ( tg , g ) ≤ 4 holds for an y g ∈ Σ n . Therefore λ 1 (Σ n , T n , d B P ) , λ 1 (Σ n , C n , d B P ) ≤ 4. The inequalities in (i) and (ii) in v olving λ 2 no w follo w from Inequalit y (9). The Inequalities in (iii) follo w from the argument used in the pro of of Theorem 5 (iii) and the diameter of d B P on Σ n . 2 In particular, for C , the set of Co xeter generators of Σ n in the last section, and m = 1 , 2, w e ha ve λ m (Σ n , C ) ≥ 2 n − 3, but λ m (Σ n , C , d B P ) ≤ 4. Intriguingly , this observ ation can b e extended as follows. F or k ≥ 1, let R ( k ) , denote the set of reversals of the form { r i,j : 1 ≤ i < j ≤ n, | i − j | ≤ k } . Suc h ‘ﬁxed-length’ rev ersals ha ve been considered in the con text of genome rearrangemen ts in e.g. [2]. Note that R (1) = C and R ( k ) ⊆ R ( k +1) , so that R ( k ) generates Σ n . 11 Prop osition 8. F or n ≥ 7 , n ≥ k ≥ 1 and m = 1 , 2 , λ m (Σ n , R ( k ) ) ≥ 2 d n k e − 2 , and λ m (Σ n , R ( k ) , d B P ) ≤ 4( k + 1) . Pr o of: As in the proof of Theorem 5 (ii), let g ∈ Σ n b e given by g = [2 , 3 , . . . , n − 1 , n, 1], so that g − 1 r 1 , 2 g = [ n, 2 , 3 , . . . , n − 1 , 1]. Then, l R ( k ) ( g − 1 r 1 , 2 g ) ≥ 2 d n k e − 3, since to transform [ n, 2 , 3 , . . . , 1] to 1 Σ n requires moving 1 and n bac k to their original p ositions. Similarly , l C ( g − 1 r 1 , 2 r 3 , 4 g ) ≥ 2 d n k e − 2. This giv es the ﬁrst inequalit y in the prop osition. Moreo v er, if r i,j , r p,q ∈ R ( k ) , then it is straight- forw ard to see that d B P ( r i,j g , g ) ≤ 2( k + 1) and d B P ( r p,q r i,j g , g ) ≤ 4( k + 1) holds, whic h giv es the second inequality in the prop osition. 2 This proposition implies that in genomics applications, adding or substituting a single reversal in a sequence of reversals in R ( k ) could p otentially hav e a large eﬀect on d R ( k ) , but a relativ ely small eﬀect on d B P (esp ecially for large v alues of n , e.g. there are n ≥ 20 , 000 genes in the human genome). It could be of interest to see whether other com binations of generating sets and metrics for Σ n commonly used in genomics (suc h as transp ositions [13] and the k -mer distance [20]) exhibit a similar type of b eha viour. 5. Statistical implications So far we hav e considered metric sensitivit y from a purely combinatorial and deterministic p ersp ective. But it is also of interest to in vestigate the sensitivity of the metrics discussed ab ov e when the elemen ts of S are randomly assigned. Again, the motiv ation for this question comes from genomics, where sto c hastic mo dels often pla y a central role (see, for example, [14], [22]). In this section, w e establish a result (Prop osition 9) in which the quan tity λ 2 pla ys a crucial role in allowing underlying parameters in such sto c hastic mo dels to b e estimated accurately giv en suﬃciently long genome sequences. Our motiv ation here is to pro vide some basis for even tually extending the w ell-dev elop ed (and tigh t) results on the sequence length requiremen ts for tree reconstruction under site-substitution mo dels (see e.g. [3, 5, 8, 14]) to more general models of genome evolution. Consider any model of genome ev olution, where an asso ciated transformation group G acts freely on a set X of genomes of length n , and for which even ts in some symmetric generating set S o ccur indep enden tly according to a Poisson pro cess. Regard the elements of X as lea v es of an evolutionary (phylogenetic) tree with w eighted edges [18], and let µ ( x, y ) b e the sum of the weigh ts of the edges of the tree connecting leav es x, y . Then we make the follo wing assumption: 12 • The exp ected n umber of times that s ∈ S o ccurs along the path in the tree connecting x and y can b e written as n · µ s ( x, y ) (i.e. w e assume that the rate of even ts scales linearly with the length of the genome). Let µ ( x, y ) = P s ∈ S µ s ( x, y ). Then the total num ber of ev ents in S that o ccur on the path separating x and y has a P oisson distribution with mean n · µ ( x, y ). No w supp ose d is some metric on genomes that satisﬁes the following three prop erties: (i) d ( x, g ◦ x ) dep ends just on g , for eac h x ∈ X and g ∈ G . (ii) λ 2 ( G, S , d ) is independent of n . (iii) d = nf ( µ ( x, y )) , where d is the exp ected v alue in the mo del of d ( x, y ) and f is a function with strictly positive but b ounded ﬁrst deriv ative on (0 , ∞ ). An example to illustrate this pro cess is site substitutions, under the Kimura 3ST model, describ ed at the start of Section 3, taking d = d S , where we observed that Prop erties (i) and (ii) hold (note that in this case, d ( x, y ) is the ‘Hamming distance’ betw een the sequences whic h coun ts the num b er of sites at whic h x and y diﬀer). In that case, Prop erty (iii) also holds, since d = n 3 4 (1 − exp( − 4 µ ( x, y ) / 3)) . Note that, b oth breakp oint distance and d S satisfy (i), and we ha ve described ab o ve some cases where (ii) is satisﬁed. Whether (iii) holds (or the assumption that the exp ected n um b er of ev en ts scales linearly with n ) dep ends on the details of the underlying sto c hastic pro cess of genome rearrangemen t. F or example, for the appro ximation to the Nadeau-T a ylor mo del of genome rearrangemen t studied in Section 2 of [21], Prop erty (iii) holds under the assumption that the n umber of ev ents separating x and y has a Poisson distribution whose mean scales linearly with n (the pro of relies on Corollary 1(a) of [21]). The following result shows ho w d/n can b e used to estimate f ( µ ( x, y )) ac- curately , and thereb y µ ( x, y ) (by the assumptions regarding f ). The abilit y to estimate µ ( x, y ) accurately pro vides a direct route to accurate tree reconstruction b y standard phylogenetic metho ds (such as ‘neighbor-joining’ [16]) since µ ( x, y ) is ‘additiv e’ on the underlying tree but not on alternative binary trees (for details, see [18]). Prop osition 9. Consider any sto chastic mo del of genome evolution for which events in S o c cur ac c or ding to a Poisson pr o c ess with a r ate that sc ales line arly with n , and any metric d that satisﬁes c onditions (i) –(iii) ab ove. Then the pr ob ability 13 that d ( x, y ) /n diﬀers fr om f ( µ ( x, y )) by mor e than z c onver ges to zer o exp onen- tial ly quickly with incr e asing n . Mor e pr e cisely, for c onstants b > 0 and c > 0 that dep end just on µ ( x, y ) and on the p air ( λ 2 ( G, S , d ) , µ ( x, y ) ), r esp e ctively, we have: P ( | d/n − f ( µ ( x, y )) | ≥ z ) ≤ exp( − bn ) + 2 exp( − cz 2 n ) , for d = d ( x, y ) . Pr o of of Pr op osition 9: W e ﬁrst recall the Azuma-Ho eﬀding inequalit y (see e.g. [1]) in which X 1 , X 2 , . . . , X k are indep endent random v ariables taking v alues in some set S , and h is any real-v alued function deﬁned on S that satisﬁes the follo wing prop ert y for some constant ξ : | h ( x 1 , x 2 , . . . , x k ) − h ( x 0 x , x 0 2 , . . . , x 0 k ) | ≤ ξ , whenev er ( x i ) and ( x 0 i ) diﬀer at just one co ordinate. In this case, the random v ariable Y := h ( X 1 , X 2 , . . . , X k ) has the tight concentration b ound for all k > 1: P ( | Y − E [ Y ] | ≥ z ) ≤ 2 exp( − z 2 2 ξ 2 k ) . (10) W e apply this general result as follo ws. Let K be the random total n um b er of ev ents in S that o ccur in the path separating x and y . By assumption, K has a Poisson distribution with mean n · µ ( x, y ). Conditional on the ev ent K = k , let X 1 , . . . , X k b e the actual elements of S that o ccur. It is assumed that these ev ents are indep endent. Moreo ver, b y (i), d ( x, y ) is a function of X 1 , . . . , X k , and b y (ii) this function satisﬁes the requirements of the Azuma-Ho eﬀding inequality for ξ = λ 2 ( G, S , d ). Thus (10) furnishes the follo wing inequalit y: P ( | d/n − d/n | ≥ z | K = k ) ≤ 2 exp( − z 2 n 2 2 λ 2 k ) . (11) In voking Prop erty (iii) and the law of total probabilit y , w e obtain: P ( | d/n − f ( µ ( x, y )) | ≥ z ) = X k ≥ 0 P ( | d/n − d/n | ≥ z | K = k ) P ( K = k ) , from whic h (11) ensures the inequalit y: P ( | d/n − f ( µ ( x, y )) | ≥ z ) ≤ 2 E [exp( − z 2 n 2 2 λ 2 K )] , (12) where E denotes expectation with resp ect to K . Let us write E [exp( − z 2 n 2 2 λ 2 K )] as a w eighted sum of t w o conditional expectations: E [exp( − z 2 n 2 2 λ 2 K ) | K > 2 n · µ ( x, y )] · p + E [exp( − z 2 n 2 2 λ 2 K ) | K ≤ 2 n · µ ( x, y )] · (1 − p ) , (13) 14 where p = P ( K > 2 n · µ ( x, y )). The ﬁrst term in (13) is b ounded ab o ve b y P ( K > 2 n · µ ( x, y )) since exp( − z 2 n 2 2 λ 2 K ) ≤ 1; moreo ver, since K has a Poisson distribution with mean n · µ ( x, y ) (and so is asymptotically normally distributed with mean and v ariance equal to µn ), the quan tity P ( K > 2 n · µ ( x, y )) is b ounded ab o ve by a term of the form exp( − bn ) where b dep ends just on µ ( x, y ). The second term in (13) is b ounded ab o ve b y exp( − z 2 n 4 λ 2 µ ( x,y ) ), where λ = λ 2 ( G, S , d ), since the function x 7→ exp( − A/x ) increases monotonically on [0 , ∞ ). Com bining these t w o b ounds in (13), the result no w follo ws from (12). 2 Remark. Referring again to the particular case of s ite substitutions under the Kim ura 3ST model, Prop osition 9 can b e strengthened to: P ( | d/n − f ( µ ( x, y )) | ≥ z ) ≤ 2 exp( − c 0 z 2 n ) , where c 0 > 0 can b e chosen to b e indep enden t of µ ( x, y ). This stronger result is the basis of n umerous results in the ph ylogenetic literature that show that large trees can b e reconstructed from remark ably short sequences under simple site- substitution mo dels [5]. Although the b ound in Prop osition 9 is less incisive, it w ould b e of interest to explore similar phylogenetic applications for other mo d- els of genome evolution in which λ 2 is indep endent of n , such as those inv olving breakp oin t distance under reversals of ﬁxed length. Ac kno wledgments W e thank Marston Conder, Eamonn O’Brien and Li San W ang for some helpful commen ts. VM thanks the Ro y al So ciet y for supp orting his visit to Univ ersit y of Can terbury , where most of this w ork was undertaken. MS thanks the Roy al So ciety of New Zealand under its James Co ok F ello wship sc heme. 15 References [1] Alon, N., Sp encer, J., 1992. The Probabilistic Metho d. Wiley , New Y ork. [2] Chen, T., Skiena, S., 1996. Sorting with ﬁxed-length rev ersals. Discr. Appl. Math. 71, 269–295. [3] Dask alakis, C., Mossel, E., Ro c h, S., 2010. Ev olutionary T rees and the Ising mo del on the Bethe lattice: a pro of of Steel’s conjecture Probab. Th. Rel. Fields 149, 149–189. [4] Eppstein, D.B.A., 1992. W ord Pro cessing in Groups. A K Peters/CR C Press. [5] Erd¨ os, P .L., Steel, M.A., Sz´ ekely , L.A., W arnow, T., 1999. A few logs suﬃce to build (almost) all trees (P art 1). Rand. Struc. Alg. 14(2), 153–184. [6] Ev ans, S.N., Sp eed, T. P ., 1993. Inv ariants of some probabilit y mo dels used in ph ylogenetic inference. Ann. Stat. 21, 355–377. [7] F ertin, G., Labarre, A., Rusu, I., T annier, E., Vialette, S., 2009. Com- binatorics of Genome Rearrangemen ts, The MIT Press, Cambridge, Mas- sac husetts, London, England. [8] Gronau, I., Moran, S., Snir, S., 2008. F ast and reliable reconstruction of ph ylogenetic trees with v ery short edges. pp. 379–388. In: SOD A: A CM- SIAM Symp osium on Discrete Algorithms. So ciety for Industrial and Applied Mathematics Philadelphia, P A, USA. [9] Hilb orn, R.C., 2004. Sea gulls, butterﬂies, and grasshopp ers: A brief history of the butterﬂy eﬀect in nonlinear dynamics. Am. J. Ph ys. 72 (4), 425–427. [10] Kim ura, M., 1981. Estimation of ev olutionary distances b et ween homologous n ucleotide sequences. Proc. Natl. Acad. Sci., USA 78, 454–458. [11] Kostan tinov a, E., 2008. Some problems on Cayley graphs. Lin. Alg. Appl. 429, 2754–2769. [12] Kunkle, D., Co op erman, G., 2009. Harnessing parallel disks to solv e Rubik’s cub e. Journal of Symbolic Computation, 44(7), 872–890. [13] Labarre, L., 2006. New b ounds and tractable instances for the transp osition distance, IEEE/A CM T rans. Comput. Biol. Bioinf. 3(4), 380–394. [14] Mossel, E., Steel, M., 2005. Ho w m uch can ev olved c haracters tell us ab out the tree that generated them? pp. 384–412. In: Mathematics of Evolution and Ph ylogen y (Olivier Gascuel ed.), Oxford Univ ersity Press. 16 [15] Rotman, J.J., 1995. An Introduction to the Theory of Groups. Springer-V erlag New Y ork Inc. [16] Saitou, N., Nei, M., 1987. The neighbor-joining metho d: a new metho d for reconstructing ph ylogenetic trees, Mol. Biol. Ev ol. 4(4), 406–425. [17] Setubal, J., Meidanis, M., 1997. Introduction to Computational Molecular biology , PWS Publishing Compan y . [18] Semple, C. and Steel, M., 2003. Phylogenetics. Oxford Universit y Press. [19] Sinha, A., Meller, J., 2008. Sensitivit y analysis for reversal distance and break- p oin t re-use in genome rearrangements, Paciﬁc J. Bio comput. 13, 37–48. [20] T rifonov, V., Rabadan, R., 2010. F requency analysis techniques for iden tiﬁ- cation of viral genetic data. mBio 1(3), e00156-10. [21] W ang, L.-S., 2002. Genome Rearrangement Phylogen y Using W eigh b or. pp. 112–125. In: Lecture Notes for Computer Sciences No. 2452: Pro ceedings for the Second W orkshop on Algorithms in BioInformatics (W ABI’02), Rome, Italy . [22] W ang, L.-S., W arnow. T., 2005. Distance-based genome rearrangemen t phy- logen y . pp. 353–380. In: Mathematics of Evolution and Ph ylogeny , (O. Gas- cuel ed.), Oxford Universit y Press. 17

The Butterfly effect in Cayley graphs, and its relevance for evolutionary genomics

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment