On the maximal sum of exponents of runs in a string
A run is an inclusion maximal occurrence in a string (as a subinterval) of a repetition $v$ with a period $p$ such that $2p \le |v|$. The exponent of a run is defined as $|v|/p$ and is $\ge 2$. We show new bounds on the maximal sum of exponents of ru…
Authors: Maxime Crochemore, Marcin Kubica, Jakub Radoszewski
On the Maximal Sum of Exp onen ts of Runs in a String Maxime Cro c hemore 1 , 3 , Marcin Kubica 2 , Jakub Radoszewski ⋆ 2 , W o jciec h R y tter 2 , 5 , and T omasz W ale ´ n 2 1 King’s College London, London W C2R 2LS, UK maxime.cro chemore@kcl.ac.uk 2 Dept. of Mathematics, Computer S cience and Mec hanics, Universit y of W arsaw, W arsa w, Poland [kubica,jr ad,rytter,walen]@ mi mu w. e du.pl 3 Universit ´ e Pa ris-Est, F rance 4 Dept. of Math. an d Informatics, Copernicus Universit y , T oru´ n, Pol and Abstract. A run is an inclusion maximal o ccurrence in a string (as a subinterv al) of a rep etition v with a p eriod p suc h that 2 p ≤ | v | . The exp onen t of a run is defined as | v | /p and is ≥ 2. W e sho w new b ounds on the maximal sum of ex ponents of runs in a string of length n . O ur upp er b ound of 4 . 1 n is b etter th an the b est p revio u s ly known proven b o u nd of 5 . 6 n by Cro c hemore & Ilie (2008). The low er b ound of 2 . 035 n , obtained using a family of binary words, contradicts the conjecture of Kolpako v & Kuchero v (1999) that the maximal sum of exp onen ts of run s in a string of length n is smaller than 2 n . 1 In tro duction Repe titions and per iodicities in strings are one of the fundamen tal topics in combinatorics on words [1, 14]. They are also importa nt in other areas: lossles s compressio n, word r epresen tation, computationa l biology , etc. In this pap er we consider b ounds o n the sum of ex ponents of rep etitions that a string o f a given length may contain. In gener al, rep etitions a re studied also from other p oin ts of view, like: the classifica tion of words (b oth finite a nd infinite) not containing rep etitions of a given expo nen t, efficient ident ifica t io n of factors b eing repetitions of different t yp es a nd computing the b ounds on the num ber o f v arious types of rep etitions occur r ing in a string. The known results in the topic and a deep er description o f the mo tiv ation can be found in a survey by Cro chemore et a l. [4]. The concept o f runs (also called maximal rep etitions) has b een introduced to represent all rep etitions in a string in a succinct manner . The crucial prop- erty o f runs is that their maximal num b er in a s t r ing of length n (denoted as ρ ( n )) is O ( n ), see Ko lpak ov & K uc herov [10]. This fact is the c o rnerstone of any ⋆ Some parts of th i s paper were written d uring the author’s Erasm us exc hange at King’s College London algorithm c omputin g a ll r e petitions in strings of length n in O ( n ) time. Due to the work of many pe o ple, m uch b etter bo unds on ρ ( n ) have b een obtained. The low er b ound 0 . 927 n was fir st proved by F r anek & Y ang [7]. Afterw ards , it was improv ed by Kusano et al. [13 ] to 0 . 944 565 n employing computer ex p eriments, and very recently b y Simpson [18] to 0 . 944575 712 n . On the other hand, th e first explicit upp er bo und 5 n was settled by Rytter [16], afterwards it was sys- tematically improved to 3 . 48 n by P uglisi et a l. [15], 3 . 44 n by Rytter [17], 1 . 6 n by Cro chemore & Ilie [2, 3] and 1 . 52 n by Gir a ud [8]. The bes t known result ρ ( n ) ≤ 1 . 029 n is due to Cr ochemore et al. [5], but it is conjectur e d [10] tha t ρ ( n ) < n . Some r esults are known als o for repetitions of exp onen t higher tha n 2. F o r instance, the maximal n umber o f cubic runs (maximal rep etitions with exp onen t at least 3) in a string of length n (denoted ρ cubic ( n )) is known to b e betw ee n 0 . 4 06 n and 0 . 5 n , see C r ochemore et al. [6 ]. A stronger proper t y o f runs is that the maximal sum of their exp onen ts in a string of length n (notation: σ ( n )) is linear in terms o f n , see Kolpakov & Kuchero v [12]. It has applications to the analysis of v arious algorithms, such as computing branching tandem rep eats: the linearit y of the sum o f exp onents solves a conjecture of [9] co ncerning the linearity of the n umber of maximal tandem r epeats a nd implies that all can b e found in linear time. F or other appli- cations, we refer to [12]. The pro of that σ ( n ) < cn in K olpak ov and Kucherov’s pap er [12 ] is very complex a nd do es not provide a n y particular v alue for the co n- stant c . A bo und ca n b e derived from the pro of of Ry tter [16] but he mentioned only that the b ound that he obtains is “unsatisfacto ry” (it seems to b e 25 n ). The fir st e x plicit b ound 5 . 6 n for σ ( n ) was provided by Cro c hemore and Ilie [3 ], who claim that it could b e improv ed to 2 . 9 n employing computer exp eriment s . As for the low er bound on σ ( n ), no exact v alues were prev io usly known and it was conjectur ed [11, 12] that σ ( n ) < 2 n . In this pape r w e provide an upp er b ound o f 4 . 1 n on the ma ximal sum of exp onen ts of runs in a s tr ing of length n a nd also a stro nger upp e r b ound o f 2 . 5 n for the max ima l sum of exp onen ts of c ubic runs in a s tring o f leng th n . As for the low er b o und, we bring do wn the conjecture σ ( n ) < 2 n by providing an infinite family of binar y s trings for which the sum o f exp onents o f runs is g reater than 2 . 0 35 n . 2 Preliminaries W e consider wor ds ( strings ) u ov er a finite alphabe t Σ , u ∈ Σ ∗ ; the empt y word is denoted by ε ; the p ositions in u are num b ered from 1 to | u | . F or u = u 1 u 2 . . . u m , let us denote by u [ i . . j ] a factor o f u equal to u i . . . u j (in particular u [ i ] = u [ i . . i ]). W o rds u [1 . . i ] are ca lled prefixes of u , a nd words u [ i . . | u | ] suffixes of u . W e say that a n integer p is t he (shortest) p erio d of a w ord u = u 1 . . . u m (notation: p = pe r ( u )) if p is the smallest pos itiv e integer such that u i = u i + p holds for all 1 ≤ i ≤ m − p . W e say that words u a nd v are cyclically equiv alent (or that one of them is a cyclic r otation of the o ther ) if u = xy and v = y x for some x, y ∈ Σ ∗ . A run (also called a maxima l rep etition) in a s tr ing u is an interv al [ i . . j ] such that: – the p erio d p of the as s ociated factor u [ i . . j ] satisfies 2 p ≤ j − i + 1, – the interv al cannot b e ex tended to the right no r to the left, without violating the ab ov e pro perty , that is , u [ i − 1] 6 = u [ i + p − 1] a nd u [ j − p + 1] 6 = u [ j + 1]. A cubic ru n is a run [ i . . j ] for which the sho rtest p erio d p sa tisfies 3 p ≤ j − i + 1 . F or simplicity , in the rest of the text we so metimes r e fer to runs and cubic runs as to o ccurrences of the cor responding factor s of u . The (fra ctional) exp onent of a run is defined as ( j − i + 1) /p . F or a given word u ∈ Σ ∗ , we int r oduce the following notatio n: – ρ ( u ) a nd ρ cubic ( u ) are the num b ers of r uns and cubic runs in u resp. – σ ( u ) and σ cubic ( u ) ar e the sums of exp onen ts o f runs a nd cubic runs in u resp. F or a non- negativ e int eg er n , we use the same nota t io ns ρ ( n ), ρ cubic ( n ), σ ( n ) and σ cubic ( n ) to denote the maximal v alue o f the resp ectiv e function for a word of leng t h n . 3 Lo w er b ound for σ ( n ) T ables 1 and 2 list the sums of exp onent s of runs for several w o rds of tw o known families tha t contain very la rge n umber o f runs: the words x i defined by F ranek and Y ang [7] (giving the low er b ound ρ ( n ) ≥ 0 . 927 n , conjectured for some time to b e optimal) and the mo dified Pado v an words y i defined b y Simpson [18 ] (giving the best known low er bound ρ ( n ) ≥ 0 . 944575712 n ). These v alues have bee n computed exp erimen tally . They suggest that for the families of words x i and y i the maxima l sum of ex p onents could b e less than 2 n . W e s ho w, how ever, a lower b ound for σ ( n ) that is greater than 2 n . Theorem 1. Ther e ar e infinitely many binary strings w such t h at σ ( w ) | w | > 2 . 0 35 . Pr o of. Let us define tw o morphisms φ : { a, b, c } 7→ { a, b, c } a nd ψ : { a, b, c } 7→ { 0 , 1 } as follows: φ ( a ) = baaba, φ ( b ) = ca , φ ( c ) = bc a ψ ( a ) = 010 11 , ψ ( b ) = ψ ( c ) = 01 001011 W e define w i = ψ ( φ i ( a )). T able 3 shows the sums of exp onen ts of runs in words w i , co mputed exp erimen tally . Clearly , for any word w = ( w 8 ) k , k ≥ 1, we have σ ( w ) | w | > 2 . 0 35 . ⊓ ⊔ i | x i | ρ ( x i ) / | x i | σ ( x i ) σ ( x i ) / | x i | 1 6 0 . 3333 4 . 00 0 . 6667 2 27 0 . 7 037 39 . 18 1 . 4510 3 116 0 . 8534 209 . 70 1 . 8078 4 493 0 . 9047 954 . 27 1 . 9356 5 2090 0 . 92 06 4130 . 66 1 . 9764 6 8855 0 . 92 52 17608 . 4 8 1 . 9885 7 37512 0 . 9 266 74723 . 85 1 . 9920 8 158905 0 . 926 9 316690 . 85 1 . 9930 9 673134 0 . 927 0 1341701 . 95 1 . 9932 T able 1. Nu m b er of ru ns and sum of exp onen ts of runs in F ranek & Y ang’s [7] wo rd s x i . i | y i | ρ ( y i ) / | y i | σ ( y i ) σ ( y i ) / | y i | 4 37 0 . 7568 57 . 98 1 . 5671 8 1 25 0 . 8640 225 . 75 1 . 8060 12 380 0 . 9079 726 . 66 1 . 9123 16 1172 0 . 9309 2303 . 21 1 . 9652 20 3609 0 . 9396 7165 . 93 1 . 9856 24 11114 0 . 9427 22148 . 78 1 . 9929 28 34227 0 . 9439 68307 . 62 1 . 9957 32 105405 0 . 9443 210467 . 18 1 . 9967 36 324605 0 . 9445 648270 . 74 1 . 9971 40 999652 0 . 9445 1996544 . 3 0 1 . 9972 T able 2. Number of runs and sum of exp onents of runs in Simp son’s [18] modifi ed P adov an wo rd s y i . i | w i | σ ( w i ) σ ( w i ) / | w i | 1 31 47 . 10 1 . 5194 2 119 222 . 26 1 . 8677 3 461 911 . 68 1 . 9776 4 1751 3533 . 34 2 . 0179 5 6647 13498 . 20 2 . 0307 6 2520 5 51264 . 37 2 . 0339 7 9556 7 194470 . 3 0 2 . 0349 8 3623 27 737393 . 11 2 . 0352 9 1373 693 27957 92 . 39 2 . 0352 10 520 8071 1059976 5 . 15 2 . 0353 T able 3. Sums of exp onents of run s in wo rd s w i . 4 Upp er bounds for σ ( n ) an d σ cubic ( n ) In this section we utilize the concept of hand les o f runs a s defined in [6 ]. The original definition refers only to cubic runs , but here we extend it also to ordinary runs. Let u ∈ Σ ∗ be a word of leng th n . L e t us denote b y P = { p 1 , p 2 , . . . , p n − 1 } the set of inter-p ositions in u that are lo cated b etwe en pair s of co nsecutiv e letters of u . W e define a function H assigning to each run v in u a set of some in ter- po sitions within v (called later on hand les ) — H is a ma pping from the set of runs o ccurring in u to the set 2 P of subsets o f P . Let v be a run with per iod p and let w b e the prefix of v of length p . Let w min and w max be the minimal and maximal words (in lexicog raphical order) c y clically equiv alent to w . H ( v ) is defined as follows: a) if w min = w max then H ( v ) contains all inter-po sitions within v , b) if w min 6 = w max then H ( v ) con tains inter-p ositions b et ween consecutive oc- currences of w min in v and betw ee n consecutive o ccurrences o f w max in v . Note that H ( v ) can b e empty for a non-cubic-run v . b a b a a b a b a a b a b a a 1 1 2 a a b b b b v 1 1 w w min1 max1 v 1 2 Fig. 1. An example of a w ord with tw o highligh ted ru n s v 1 and v 2 . F or v 1 w e have w min1 6 = w max1 and for v 2 the correspondin g words are eq ual to b (a one-letter w ord). The inter-p o sitions b elongi n g to th e sets H ( v 1 ) and H ( v 2 ) are p oin ted by arro ws Pro ofs of the following prop erties of handles of runs can b e found in [6]: 1. Case (a ) in the de finitio n of H ( v ) implies that | w min | = 1. 2. H ( v 1 ) ∩ H ( v 2 ) = ∅ for a n y tw o distinct runs v 1 and v 2 in u . T o prov e the upp er b ound for σ ( n ), we need to state a n additional pro perty of handles of runs. Let R ( u ) be the s et of a ll runs in a word u , and let R 1 ( u ) and R ≥ 2 ( u ) b e the sets of runs with p eriod 1 and at least 2 resp ectiv ely . Lemma 1. If v ∈ R 1 ( u ) then σ ( v ) = | H ( v ) | + 1 . If v ∈ R ≥ 2 ( u ) then ⌈ σ ( v ) ⌉ ≤ | H ( v ) | 2 + 3 . Pr o of. F or the case of v ∈ R 1 ( u ), the pro of is straightforward fro m the definition of handles. In the o pp osite cas e, it is sufficient to note that b oth words w k min and w k max for k = ⌊ σ ( v ) ⌋ − 1 are fac to rs of v , and thus | H ( v ) | ≥ 2 · ( ⌊ σ ( v ) ⌋ − 2) . ⊓ ⊔ Now we are ready to pr o ve the upp er b ound for σ ( n ). In the pro of w e use the b ound ρ ( n ) ≤ 1 . 02 9 n on the num b er of r uns from [5 ]. Theorem 2. The sum of the exp onents of ru ns in a st ring of length n is less than 4 . 1 n . Pr o of. Let u b e a word of length n . Using Lemma 1 , we obtain: X v ∈ R ( u ) σ ( v ) = X v ∈ R 1 ( u ) σ ( v ) + X v ∈ R ≥ 2 ( u ) σ ( v ) ≤ X v ∈ R 1 ( u ) ( | H ( v ) | + 1) + X v ∈ R ≥ 2 ( u ) | H ( v ) | 2 + 3 = X v ∈ R 1 ( u ) | H ( v ) | + |R 1 ( u ) | + X v ∈ R ≥ 2 ( u ) | H ( v ) | 2 + 3 · |R ≥ 2 ( u ) | ≤ 3 · |R ( u ) | + A + B / 2 , (1) where A = P v ∈ R 1 ( u ) | H ( v ) | and B = P v ∈ R ≥ 2 ( u ) | H ( v ) | . Due to the dis j o intness of ha ndles of runs (the s e cond prop ert y of handles), A + B < n , and thus, A + B / 2 < n . Combining this with (1), we obtain: X v ∈ R ( u ) σ ( v ) < 3 · |R ( u ) | + n ≤ 3 · ρ ( n ) + n ≤ 3 · 1 . 029 n + n < 4 . 1 n. ⊓ ⊔ A similar approach for cubic runs, this time us ing the bound of 0 . 5 n for ρ cubic ( n ) from [6 ], ena bles us to immediately provide a strong e r uppe r b ound fo r the function σ cubic ( n ). Theorem 3. The su m of the exp onents of cubic runs in a string of length n is less than 2 . 5 n . Pr o of. Let u be a word o f length n . Using same inequalities as in the proo f of Theorem 2 , we obtain: X v ∈ R cubic ( u ) σ ( v ) < 3 · |R cubic ( u ) | + n ≤ 3 · ρ cubic ( n ) + n ≤ 3 · 0 . 5 n + n = 2 . 5 n, where R cubic ( u ) denotes the s et of all cubic runs o f u . ⊓ ⊔ References 1. J. Berstel and J. Karhumaki. Combinatorics on words: a tu tori al. Bul l e tin of the EA TCS , 79:178–228 , 2003. 2. M. Crochemore and L. Ilie. Analysis of m ax i mal rep etitions in strings. In L. Kucera and A. Kucera, editors, M F CS , volume 4708 of L e ctur e Notes in Com p uter Scienc e , pages 465–476. Springer, 2007. 3. M. Cro c hemore and L. Ilie. Maximal rep etitions in strings. J. Comput. Syst. Sci. , 74(5):796– 807, 2008. 4. M. Cro c hemore, L. Ilie, and W. R ytter. Rep etiti ons in strings: Algorithms and com bin atorics. The or. Comput. Sci. , 410(50):5227–523 5, 2009. 5. M. Crochemore, L. Ilie, and L. Tinta. T o wa rd s a solution to the ”runs” conjecture. In P . F erragina and G. M. Landau, ed i tors, CPM , volume 5029 of L e ctur e Notes in Computer Scienc e , pages 290–302. Springer, 2008. 6. M. Crochemore, C. I liop oulos, M. Ku bica , J. Radoszewski, W. Rytt e r, and T. W alen. On the maximal num b er of cubic runs in a string. I n Pr o c e e dings of LA T A , 2010 (to app ear). 7. F. F ranek and Q. Y ang. An asymptotic low er b ound for t he max imal number of runs in a string. Int. J. F ound. Comput. Sci. , 19(1):195–203, 2008. 8. M. Giraud. Not so many runs in strings. In C. Mart ´ ın-Vide, F. Otto, and H. F ernau, editors, LA T A , volume 5196 of L e ctur e Notes in Computer Scienc e , pages 232–23 9. Springer, 2008. 9. D. Gusfield and J. Stoy e. Simp le and fl exible detection of contiguous repeats using a suffix tree (p rel iminary version). In M. F arac h-Colton, editor, CPM , volume 1448 of L e ctur e Notes i n Computer Scienc e , pages 140–152 . S pringer, 1998. 10. R. M. Kolpako v and G. Kuchero v. Find i n g maximal rep etitions in a word in linear time. I n Pr o c e e dings of the 40th Symp osium on F oundations of Computer Scienc e , pages 596–604, 1999. 11. R. M. Kolpako v and G. Ku c herov. On maximal rep etitio ns in words. J. of Di s cr. Alg. , 1:159–186, 1999. 12. R. M. Kolpako v and G. Kuchero v. On the sum of ex ponents of maximal repetitions in a word. T e ch. R ep ort 99-R-034, LORI A , 1999. 13. K. Kusano, W. Matsubara, A . Ishino, H. Bannai, and A . Sh i n o hara. New lo wer b ounds for the maxim u m num b er of runs in a string. CoRR , abs/0804.1214, 2008. 14. M. Lothaire. Combinatorics on Wor ds . A ddison-W esley , R ea d i n g, MA., U.S.A., 1983. 15. S. J. Pu glisi, J. Simpson, and W. F. Smyth. H o w many runs can a string contain? The or. Comput. Sci. , 401(1-3):165–171, 2008. 16. W. Ry tter. The number of run s in a string: Improv ed analysis of th e linear u pper b ound. In B. D urand and W. Thomas, editors, ST A CS , v olume 3884 of L e ctur e Notes in Computer Scienc e , pages 184–195. Springer, 2006. 17. W. Rytt er. The num b er of ru ns in a string. Inf. Comput. , 205(9):1459– 1469, 2007. 18. J. Simpson. Mod ified Pado v an words and the maximum num b er of runs in a w ord. Aus tr alasian J. of Comb. , 46:129–14 5, 2010 .
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment