Understanding maximal repetitions in strings
The cornerstone of any algorithm computing all repetitions in a string of length n in O(n) time is the fact that the number of runs (or maximal repetitions) is O(n). We give a simple proof of this result. As a consequence of our approach, the stronge…
Authors: Maxime Crochemore (IGM), Lucian Ilie
Symposium on Theoretical Aspects of Computer Science 2008 (Bordeaux), pp. 11-16 www .stacs-conf .org UNDERST ANDING MAXIMAL REPETITIO NS IN STRINGS MAXIME CROCHEM OR E 1 AND LUCIAN ILIE 2 1 King’s Coll ege London, Strand, London W C2R 2LS, United Kingdom and Institut Gaspard-Monge, Univ ersit´ e P aris-Est, F rance E-mail addr ess : maxime.cro chemore@kcl.ac.uk 2 Department of Computer Science, Universit y of W estern Ontario N6A 5B7, London, Ontario, Canada E-mail addr ess : ilie@csd.u wo.ca Abstra ct. The cornerstone of any algorithm computing all rep etitions in a string of length n in O ( n ) time is the fact that the num b er of runs (or maximal rep etitions) is O ( n ). W e giv e a simple pro of of this result. As a consequence of our approach, the stronger result concerning the linearit y of the sum of exp onents of all ru ns follo ws easily . 1. In tro duction Rep etitions in strings constitute one of th e most fundamental areas of string combina- torics with very imp ortan t applications to text algorithms, data compression, or analysis of biological sequences. One of the most imp ortant pr ob lems in this area w as fin ding an algorithm for computing all r ep etitions in linear time. A ma j or obstacle was enco d ing all rep etitions in linear space b ecause th ere can b e Θ( n log n ) o ccurrences of squares in a strin g of length n (see [1]). All rep etitio n s are enco ded in runs (that is, maxima l rep etitions) and Main [9] u sed the s-factorizat ion of C ro c hemore [1] to giv e a linear-time algorithm f or finding all leftmost o ccurrences of runs. What w as essent ially miss in g to ha ve a linear-time algorithm f or computing all rep etitions, was pro ving that there are at most linearly many runs in a strin g. Iliop oulos et al. [4] sho wed that this prop ert y is true for Fib onacci w ords. The general result w as ac hieve d b y Kolpako v and Kuc herov [7] who g av e a linear-time algorithm for lo cating all runs in [6]. Kolpak o v and Kuchero v prov ed that the num b er of runs in a strin g of length n is at most cn but could not pro vide an y v alue for the constan t c . Recen tly , Rytter [10] pro ve d that c ≤ 5. The c onj ecture in [7] is that c = 1 f or binary al p hab ets, as supp orted by computations for string lengths up to 31. Using the tec hn ique of this note, w e ha ve pro ve d [2] that it is s m aller than 1 . 6, which is th e b est v alue so far. 1998 ACM Subje ct Classific ation: F.2.2 Nonnumerical Algorithms and Problems; G.2.1 Combinatorics . Key wor ds and phr ases: combinatori cs on w ords, rep etitions in strings, runs, maximal rep etitions, maxi- mal p erio dicities, sum of ex p onents. This w ork has been done during the second author’s s tay at Institut Ga spard- Monge. The same a uth or’s researc h was supp orted in part by NSERC. c M. Crochemore and L. Ilie CC Cre ative Commons Attribution-NoDer ivs License 12 M. CROCHEMORE AND L. ILIE Both p ro ofs in [6 ] and [10] are v ery intricate and our contribution is a simple pr o of of the linearit y . On the one hand, th e searc h f or a simple p ro of is motiv ated by the ve ry imp ortance of the resu lt – this is the core of the analysis of any optimal algorithm computing all r ep etitions in strings. None of the ab o ve -mentio n ed p ro ofs can b e included in a textb ook. W e b eliev e that the simple pr o of shows v ery clearly why the num b er of ru ns is linear. On the other hand, a b etter understand ing of the structure of ru ns could p a v e the w ay for simpler linear-time algorithms for find ing all rep etitio n s. F or the algorithm of [6] (and [9]), relativ ely complicated and space-consuming data s tructures are needed, such as su ffix trees. The tec h nical con trib ution of th e p ap er is based on the notion of δ -close runs (runs ha ving close ce nters), whic h is an improv ement on the notion of neighbors (ru ns h a ving close starting p ositio n s) int r o duced by Rytter [10 ]. On top of that, our appr oac h enables us to deriv e easily the stronger r esult concerning the lin earity of th e sum of exp onen ts of all runs of a string. Clearly th is r esult imp lies the first one, but the con v erse is not ob vious. Th e second result wa s giv en another long pro of in [7]; it follo ws also from [10]. Finally , we strongly b eliev e that our ideas in this pap er can b e further refined to improv e significan tly the upp er b ound on the n umber of runs , if not to p ro v e the conjecture. Th e latest r efi nemen ts and compu tations (Decem b er 2007 ) sh o w a 1 . 084 n b oun d. 2. Definitions Let A b e an alphab et and A ∗ the set of all finite strings ov er A . W e denote by | w | the length of a string w , by w [ i ] its i th letter, and b y w [ i . . j ] its factor w [ i ] w [ i + 1] · · · w [ j ]. W e sa y that w h as p erio d p iff w [ i ] = w [ i + p ], for all 1 ≤ i ≤ | w | − p . The smallest p erio d of w is called the p erio d of w and the ratio b et w een the length and the p erio d of w is called the exp onent of w . F or a p ositiv e in teger n , th e n th p o w er of w is defi n ed inductiv ely by w 1 = w , w n = w n − 1 w . A string is primitive if it cannot b e wr itten as a prop er in teger (t wo or m ore) p ow er of another string. Any nonemp t y string can b e uniquely written as an in teger p ow er of a primitiv e str in g, called its primitive r o ot . It can also b e uniquely written in the form u e v where | u | is its (sm allest) p eriod, e is the inte gral part of its exp onent , and v is a prop er prefix of u . The follo wing w ell-kno wn synchr onization pr op ert y will b e usefu l: If w is primitive , then w app ears as a factor of ww only as a prefix and as a suffix (not in-b et ween). Another prop erty we use is Fine and Wilf ’s p erio dicity lemma : If w h as p eriod s p and q and | w | ≥ p + q , then w has also p eriod gcd( p, q ). (This is a bit weak er than the original lemma wh ic h w orks as s o on as | w | ≥ p + q − gcd ( p, q ), bu t it is goo d enough for our p u rp ose.) W e refer the r eader to [8] for all concepts u s ed here. F or a s tring w = w [1 . . n ], a run 1 (or m aximal rep etition) is an interv al [ i . . j ], 1 ≤ i < j ≤ n , su ch th at (i) the factor w [ i . . j ] is p erio dic (its exp onent is 2 at least) and (ii) b oth w [ i − 1 . . j ] and w [ i . . j + 1], if defined, ha v e a strictly higher (smallest) p eriod . As an example, consider w = abba babbaba ; [3 . . 7] is a run with p erio d 2 and exp onen t 2.5; w e ha ve w [3 . . 7] = ba bab = ( ba ) 2 . 5 . Other runs are [2 . . 3] , [7 . . 8] , [8 . . 11] , [ 5 . . 10] and [1 . . 11]. F or a run starting at i and ha ving p erio d | x | = p , w e shall call w [ i . . i + 2 p − 1] = x 2 the squar e of th e run (this is th e only part of a run we can count on). Note that x is pr imitiv e 1 Runs were introduced in [9] under the name m axi m al p erio dicities ; the are called m-r ep etitions in [7] and runs in [4]. UNDERST ANDING MAXIMAL REPETITIONS IN STRINGS 13 and the square of a r un cannot b e extended to the left (with th e same p erio d) bu t ma y b e extendable to the right. The c enter of the run is the p osition c = i + p . W e s hall denote the b e ginning of the r un by i x = i , the end of its squar e by e x = i x + 2 p − 1, and its c enter b y c x = i x + p . 3. Linear num b er of runs W e describ e in this section our pr o of of the lin ear num b er of runs. Th e idea is to partition the runs by group ing together those h a ving close cen ters and similar p erio d s . T o this aim, for an y δ > 0, w e sa y that t wo r uns ha vin g s quares x 2 and y 2 are δ - close if (i) | c x − c y | ≤ δ a nd (ii) 2 δ ≤ | x | , | y | ≤ 3 δ . W e p ro v e that there cannot b e more than th ree m utually δ -close run s . (There is one exception to this rule – case (vi) b elo w – b ut then, even few er runs are obtained.) Th is means that the n umb er of runs with the p erio ds b et ween 2 δ and 3 δ in a strin g of length n is at most 3 n δ . Summing up for v alues δ i = 1 2 3 2 i , i ≥ 0, all p erio ds are considered and w e obtain that the num b er of r uns is at most ∞ X i =0 3 n δ i = ∞ X i =0 3 n 1 2 ( 3 2 ) i = 18 n. (3.1) F or this p urp ose, w e start inv estigating what h app ens when th r ee ru ns in a string w are δ -close. Let us denote th eir squares by x 2 , y 2 , z 2 , their p eriod s b y | x | = p , | y | = q , | z | = r , and assume p ≤ q ≤ r . W e discuss b elo w all the w a ys in whic h x 2 and y 2 can b e p ositioned relativ e to eac h other and see that long factors of b oth ru ns ha ve small p eriod s which z 2 has to s ync hronize. This will restrict the b eginning of z 2 to only one c hoice as otherwise some run would b e left extendable. Then a four th run δ -close to the p revious three cannot exist. Notice that, for cases (i)-(v) w e assume the cen ters of the r uns are different; the case when they coincide is cov ered b y (vi). (i) ( i y < i x < ) c y < c x < e x ≤ e y . Then x and the su ffix of length e y − c x of y h a v e p erio d q − p ; see Fig. 1(i). W e may assu me the string corresp onding to this p erio d is a primitiv e string as otherwise we can make the same reasoning with its p rimitiv e ro ot. Since z 2 is δ -close to b oth x 2 and y 2 , it must b e that c z ∈ [ c x − δ . . c y + δ ]. Consid er the interv al of length q − p that ends at the leftmost p ossible p osition for c z , that is, I = [ c x − δ − ( q − p ) . . c x − δ − 1]. It is included in the fi rst p erio d of z 2 , that is, [ i z . . c z − 1], and in [ i x . . c y ]. Thus w [ I ] is pr imitiv e and equal, d ue to z 2 , to w [ I + r ] whic h is a factor of w [ c x . . e y ]. Th erefore, the p erio d s inside the former m u st syn c hronize with the ones in the latter. It follo ws, in the case i z > i x − ( q − p ), that w [ i z − 1] = w [ c z − 1], that is, z 2 is left extendable, a contradicti on. If i z < i x − ( q − p ), then w [ c x − 1] = w [ i x − ( q − p ) − 1] = w [ i x − 1], that is, x 2 is left extendable, a con tradiction. Th e only p ossibilit y is that i z = i x − ( q − p ) and r equals q plus a multiple of q − p . Here is an example: w = baababa baababababa ab , x 2 = w [5 . . 14] = ( abab a ) 2 , y 2 = w [1 . . 14] = ( ba ababa ) 2 , and z 2 = w [3 . . 20] = ( abababaa b ) 2 . W e h a v e already , due to z 2 , that x = ρ ℓ ρ ′ , where | ρ | = q − p and ρ ′ a prefix of ρ . A fourth r un δ -close to the p revious three w ould ha v e to h a v e the same b eginning as z 2 and the length of its p er io d would ha ve to b e also q plu s a multiple of q − p . This would imply an equation of the f orm ρ m ρ ′ = ρ ′ ρ m and then ρ and ρ ′ are p ow ers of th e same strin g, a con tradiction with the p rimitivit y of x . (ii) ( i y < i x < ) c y < c x < e y ≤ e x ; th is is similar with (i); see Fig. 1(ii). Here the prefix of length e y − c x of x is a suffi x of y and h as p erio d q − p . 14 M. CROCHEMORE AND L. ILIE suffix of x y y x x x z z z z y y x x z z y y x x (i) y y x x x y y x x z z z x y y x x z z (ii) (iii) (iv) (v) (vi) prefix of x Figure 1: Relat ive p osition of x 2 and y 2 . (iii) i y < i x < c x < c y ( < e x < e y ). Here x and th e prefix of length c x − i y of y ha ve p erio d q − p ; see Fig. 1(iii). As ab ov e, a third δ -close r un z 2 w ould ha ve to sh are the same b eginning with y 2 , otherwise one of y 2 or z 2 w ould b e left extendable. A fourth δ -close run w ould hav e to start at the same place and , b ecause of the three-prefix -squ are lemma 2 of [3], since p is p rimitiv e, it would h av e a p erio d at least q + r , whic h is imp ossible. (iv) i x < i y ( < c x < c y < e x < e y ); this is similar with (iii); see Fig. 1(iv). A third ru n w ould b egin at the same p osition as y 2 and there is no four th run . (v) i x = i y ; see Fig. 1( v). Here not ev en a third δ -close run exists b ecause of the three-square lemma that implies r ≥ p + q . (vi) c x = c y . This case is significan tly differen t from the other ones, as we can ha ve man y δ -close run s here. Ho wev er, the existence of m an y r uns with the same cen ter implies v ery strong p erio dicit y p r op erties of the strin g wh ic h allo w us to count the runs globally and obtain ev en fewer runs th an b efore. In this case b oth x and y ha ve the same s mall p erio d ℓ = q − p ; see Fig. 1(vi). If we n ote c = c y then we h a v e h ru ns x α j j , 1 ≤ j ≤ h , b eginning at p ositions i x j = c − (( j − 1) ℓ + ℓ ′ ), where ℓ ′ is th e length of th e suffix of x that is a prefix of the p erio d. W e sho w that in this case we ha ve less runs than as counte d in the su m (3.1). F or h ≤ 9 there is nothin g to pr ov e as no four of ou r x α j j runs are counted for the same δ . Assume h ≥ 10. Th ere exists δ i suc h that ℓ 2 ≤ δ i ≤ 3 ℓ 4 , that is, this δ i is considered in (3.1). Th en it is not difficult to see that there is no run in w with p erio d b et w een ℓ and 9 4 ℓ and cen ter inside J = [ c + ℓ + 1 . . c + ( h − 2) ℓ + ℓ ′ ]. But ℓ ≤ 2 δ i < 3 δ i ≤ 9 4 ℓ and the length of J is 2 F or th ree w ords u, v, w , it states that if uu is a prefix of v v , v v is a prefix of w w , and u is primitive, then | u | + | v | ≤ | w | . UNDERST ANDING MAXIMAL REPETITIONS IN STRINGS 15 ( h − 3) ℓ + ℓ ′ ≥ ( h + 1) δ i . Th is means that at least h inte r v als of length δ i in the sum (3.1) are co vered b y J an d therefore at least 3 h run s in (3.1) are replaced b y our h runs. W e need also m en tion that these h in terv als of length δ i are not reused by a different cen ter with multiple r uns since suc h cent ers cannot b e close to eac h other. I n deed, if we ha ve t wo cent ers c j with the ab ov e parameters h j , ℓ j , j = 1 , 2, then, as so on as the longest runs ov erlap o ve r ℓ 1 + ℓ 2 p ositions, we ha ve ℓ 1 = ℓ 2 , d ue to Fine and Wilf ’s lemma. Then, the closest p ositions of J 1 and J 2 cannot b e closer than ℓ 1 = ℓ 2 ≥ δ i as this w ould mak e some of the runs non-primitiv e, a con tradiction. Th us the b ound in (3.1) still holds and w e pro ved Theorem 3.1. The numb e r of runs in a string of length n is O ( n ) . 4. The sum of exp onen ts Using the ab o v e approac h, w e show in this section that the sum of exp onen ts of all runs is also linear. The idea is to pr o v e th at the sum of exp onen ts of all runs with the cen ters in an in terv al of length δ and p eriod s b et ween 2 δ and 3 δ is less than 8. (As in the previous p r o of, there are exceptions to this rule, but in those cases we get a smaller sum of exp onen ts.) Th en a computation similar to (3.1) giv es that the sum of exp onen ts is at most 48 n . T o s tart with, Fine and Wilf ’s p erio d icit y lemma can b e reph rased as follo ws: F or t wo primitiv e s trings x and y , an y p o w ers x α and y β cannot ha ve a common factor longer than | x | + | y | as suc h a factor would h a v e also p erio d gcd( | x | , | y | ), con tradicting the primitivit y of x and y . Next consid er t wo δ -close runs , x α and y β , α, β ∈ Q . It cannot b e th at b oth α and β are 2 . 5 or larger, as this wo uld imply an o v erlap of length at least | x | + | y | b et wee n the t w o runs, wh ic h is forbidd en b y Fine and Wilf ’s lemma since x and y are primitiv e. Therefore, in case we h av e thr ee m utually δ -close ru ns, t wo of them must h a v e their exp onents s maller than 2 . 5. If the exp onen t of the third ru n is less than 3, w e obtain th e total of 8 w e w ere lo oking for. Ho wev er, the third ru n , sa y z γ , γ ∈ Q , ma y hav e a larger exp onent . If it do es, that affects the runs in the neighboring interv als of length δ . More precisely , if γ ≥ 3, then there cannot b e any cen ter of r un with p erio d b etw een 2 δ and 3 δ in the next (to the right) in terv al of length δ . Indeed, the ov erlap b et ween an y su c h run an d z γ w ould imply , as ab o v e, that their r o ots are not pr im itive, a contradict ion. In general, the f ollo w ing ⌊ 2( γ − 2 . 5) ⌋ in terv als of length δ cannot con tain any cente r of suc h r u ns. T h us, w e obtain a smaller su m of exp onen ts wh en this situation is met. The second exception is giv en b y case (vi) in th e previous p ro of, that is, wh en man y runs sh are the same cente r; we use the same notation as in (vi). W e need to b e aw are of the exp onent of the r un x α 1 1 , with the sm allest p erio d, as α 1 can b e as large as ℓ (and unr elated to h , the num b er of run s with the same cent er). W e shall count α 1 in to the appr op r iate in terv al of length δ i ; notice th at x α 1 1 and x α 2 2 are nev er δ -close, for any δ , b eca u se | x 2 | > 2 | x 1 | . F or 2 ≤ j ≤ h − 1, the p erio d | x j | cannot b e extended by more than ℓ p ositions to the righ t past the end of the initial square, and thus α j ≤ 2 + 1 j . Therefore, their contribution to the sum of exp onen ts is less than 3( h − 2). Th ey replace the exp onent s of the run s with cen ters in the int erv al J and p erio ds b et w een ℓ and 9 4 ℓ w hic h otherwise w ould con tribute at least 6 h to the sum of exp onen ts. The r un with the longest p erio d, x α h h , can ha ve an arbitrarily high exp onent but the r ep laced ru ns in J need to account only for a fraction (3 units) of it 16 M. CROCHEMORE AND L. ILIE since α h ≥ 3 implies new cen ters w ith m ultiple r u ns and h ence new J in terv als (precisely ⌊ α h − 2 ⌋ ) that account for the r est. W e pr o v ed Theorem 4.1. The sum of exp onents of the runs in a string of length n is O ( n ) . References [1] M. Cro chemore , An optimal algorithm for computing the repetitions in a string, Inform. Pr o c. L etters 12 (1981) 244 – 250. [2] M. Cro chemore and L. I lie. Maximal rep etitions in strings. Journal of Computer and System Scienc es , 2007. In press. [3] M. Cro chemore and W. Rytter, Squares, cub es, and time-space efficien t string searc hing, Algorithmic a 13 (1995) 405 – 425. [4] C.S. Iliop oulos, D. Mo ore, W.F. Smyth, A characterizatio n of th e squares in a Fib onacci string, The or et. Comput. Sci. 172 (1997) 281 – 291. [5] R. Kolpako v and G. K uchero v, On th e sum of exp onents of maximal rep etitions in a word, T ec h. Rep ort 99-R-034, LOR IA, 1999. [6] R. Kolpako v and G. Ku c herov, Finding maximal rep etitions in a w ord in linear time, Pr o c. of FOCS’99 , IEEE Computer So ciety Press, 1999, 596 – 604. [7] R. Kolpako v and G. Kuchero v , On maximal rep etitions in words, J. Discr ete Algorithms 1 (1) ( 2000) 159 – 186. [8] M. Lothaire, Algebr aic Combinatorics on Wor ds , Cambridge Univ. Press, 2002. [9] M.G. Main, Detecting lefmost maximal perio dicities, Discr ete Applie d Math. 25 ( 1989) 145 – 153. [10] W . Rytter, The number of run s in a string: imp ro ved analysis of the linear upp er b ound, in: B. D urand and W. Thomas (eds.), Pr o c. of ST ACS’06 , Lecture Notes in Comput. Sci. 3884 , Sp ringer-V erlag, Berlin, 2006, 184 – 195. This wor k is lice nsed unde r the Crea tive Commons Attribution-No Derivs License. T o view a copy of this license, visit http:/ /creativ e commons.org/licenses/by- nd/3.0/ .
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment