Constructions for Clumps Statistics

Discr ete Mathematics and Theoretical Computer Science DMTCS vol. (subm.) , by the authors, 1–1 Constr uctions f or Cl umps Statistics F . Bass ino 1 , J. Cl ´ ement 2 , J. F ayolle 3 , and P . Nicod ` eme 4 1 IGM, Universit ´ e de Marne la V all ´ ee, 77454 Marne-la-V all ´ ee Cedex 2, F rance . Frederique.Bassino@univ-ml v.fr 2 GREYC, CNRS-UMR 6072, Universit ´ e de Caen, 14032 Caen, F rance . Julien.Clement@inf o.unicaen.fr 3 LRI; Univ . P aris-Su d, CNR S ; B ˆ at 490, 91405 Orsa y , F ranc e. Julien.Fayolle@lri.f r 4 LIX, CNRS-UMR 7161, ´ Ecole polytechn ique, 91128, P alaiseau, F rance . nicodeme@lix. polytechnique.fr W e consider a componen t of the word statistics known as clump; starting from a ﬁnite set of words, clumps are maximal o verlapping sets of these o ccurrences. This p arameter has ﬁrst been studied by Schbath [22] with the aim of counting the number of occurrences of words in random texts. Later work with similar probabilistic approach used the Chen-Stein ap proximation for a co mpound Poisson distribution , where the numb er of clumps follo ws a law close to Poisson. Pr esently there is no combinatorial counterpart to this approach, and we ﬁl l the gap here. W e emph asize the fact that, in contrast with the probabilistic approa ch which only provides asymptotic results, the comb inatorial approach provides exa ct results that are useful when considering short sequences. 1 Introdu ction Counting word s and m otifs in random texts h as pr ovided e xten ded studies with theoretica l and p ractical reasons. Mu ch of the present comb inatorial research has b uilt over the work of Guibas and Odlyzko [10, 11] who deﬁned the autocorrelation polynomial o f a word . As an a pparently surprising consequence o f their work, t he waiting time for the ﬁrst occurrenc e of the word 111 in a Bernoulli string with probability 1 / 2 for zeroes and ones is larger tha n the waiting time fo r th e ﬁrst occurren ce of the word 1 00 . This is due to the fact that the words 111 o ccur by clumps of o nes, the probability of extending a clu mp by one position being 1 / 2 ; this implies that the a verage number of 111 in a clump is larger than one; in contrast, there is only one 100 in each clump o f 100 . Since the pro bability that the word 1 11 and the word 1 00 start at a given position both are 1 / 8 , the intera rri val ti me of clump s of 111 is larger than the interarr i val time of clumps of 100 . W e analyze in this article sev eral statictics c onnected to clu mps of one word or of a redu ced set o f words. Our approach is based on properties of th e R ´ egnier-Szpankowski [18] deco mposition of languages along occu rrences of the considered word or s et of words and on pro perties o f the preﬁx codes gener ating the clu mps. W e provide explicit generating func tions in the Bernoulli mo del for statistics such a s (i) the number o f clump s, ( ii) th e num ber of k -clu mps, (iii) the num ber of p ositions of th e texts covered by clumps, a nd (iv) th e size of clumps in in ﬁnite texts; th ese results may b e extended to a Markov model, providing some technicalities. W e consider also in the Berno ulli model an algor ithmic app roach where we co nstruct d eterministic ﬁnite au tomatas recog nizing clump s. This ap proach extends directly to the Markov model, and we obtain as a direct con sequence a Gau ssian limit law fo r the numb er of clumps in random texts. Consider a rough ﬁrst appro ximation for clump s of on e word. If the pro bability occurren ce of a w or d w is small, the pro bability of clumps K of this word is sm all. Th is implies tha t the num ber of clum ps in texts o f size n fo llo ws a Poisson law of param eter λ = n × P ( a clump starts at position i ) , wh ere i is a r andom position . Approxim ating fu rther , the rando m n umber of occ urrences Ω of th e word w in a clump fo llows a geo metric law with param eter ω , wher e ω is the pr obability of self-overlap of the word . Schbath and Reinert [ 19] obtained in the Markov case of any order a cou mpound Poisson limit law for the count of number o f occu rrences by the Chein -Stein meth od. See Reinert e t al. [20] for a revie w a nd subm. to DMTCS c  by the a uthors Discrete Mathematics and Theoretical Computer Science (DMTCS), Nancy , France 2 F . Bassino, J. Cl ´ ement, J. F ayolle, and P . Nicod ` eme Barbour et al. [1] f or an e xten si ve introduction to the Poisson ap proximation . Schb ath [22] give the ﬁrst moment o f th e n umber of k -clumps an d of the number of clump s in Bernoulli texts. Recently , Stefanov et al. [24] use a stopping time method to comp ute the distribution of clumps; their results are not explicit and practical application of their method requir es the in version of a probability generating function. W e describ e in Section 2 o ur notation s and the R ´ egnier-Szpankowski language d ecomposition. Sec- tion 3.2 and Section 3.3 respecti vely provide our analysis in the case of counting clumps and k -clumps of one word and of a ﬁnite set o f words. W e prove by an a utomaton construction a nor mal limit law for the number of clumps in Section 5 2 Preliminar ies W e con sider a ﬁnite alp habet A . Unless explicitely stated when considering a Markov source, the texts are generated by a no n-unifo rm Bernoulli source over th e alphabet A . Gi ven a set of words, clump s of these words may be seen as a generalization of runs of one letter . Clumps and k -c lumps. When considerin g a reduced set of w ords U = { u 1 , . . . , u r } where each word u i has size at least 2 , a clump is a maximal set of occurr ences of words of U such that • any two consecu ti ve letters of the clump belong to (is a factor of) at least one occurren ce, • either th e clump is com posed of a single occu rrence th at overlaps no other o ccurrences, or each occurre nce overlaps at least one other occurr ence. This deﬁnition natura lly applies also to the case where U is com posed of a single w ord. As example, co nsidering the set U = { aba, bb a } and the text T = bbbabab abababbbb abaababb , we have T = bbbabababa b a bbbbaba ababb where the clu mps are underlined . The word bbab abababa beginning at the second position of th e text is a clump, an d so are th e word s bbaba an d aba beginning at the 1 5th an d 20 th positions. On the c ontrary , the word ababa beginning at the sixth position is not a clum p since it is n ot maximal; neither is a clump the word bbabaaba beginnin g at the 15th po sition, since its two-letters factor aa is neither a factor of an occurre nce of aba nor of an occur rence of bba . More forma lly , w e use as an intermediate step clusters , following Gould en and Jackson [9]. Deﬁnition 1 (Clumps) A clustering-word for the set U = { u 1 , . . . , u r } is a wor d w ∈ A ∗ such that any two c onsecutive position s in w are covered by th e same occurr ence in w of a wor d u ∈ U . The p osition i o f the wor d w is covered by a wor d u if u = w [( j − | u | + 1) . . . j ] for some j ∈ {| u | , . . . , n } and j − | u | + 1 ≤ i ≤ j . A cluster of a clustering-word w in K U is a set of occurrence positions subsets { S u ⊂ Occ( u , w ) | u ∈ U } which covers e xactly w , that is, every two consecutive positions i and i + 1 in w ar e cover ed by at least one same occurr ence of some u ∈ U . Mor e formally ∀ i ∈ { 1 , . . . , | w | − 1 } ∃ u ∈ U , ∃ p ∈ S u such that p − | u | + 1 < i + 1 ≤ p. A clum p , generically denoted here by K is a maximal cluster in th e sense tha t ther e exists no occurrence of the set U that overlaps the corr espond ing clustering wor d without being a factor of it. Note that a single word is a cluster and that, as mentionn ed pr e viou sly , a clump may be com posed o f a single word. A k -clump of occurre nces of U (den oted by K ( k ) ) is a clump contain ing exactly k o ccurrences of U . W e aim here at providing explicit analytic formu las f or the moments of the number of clumps, the total size of text covered by clumps or the number of clumps with exactly k occu rrences. Constructions for Clumps Statistics 3 Notatio ns. W e consider the residual languag e D = L .w − as D = { x, x.w ∈ L} . In case of ambiguity , we will use a bracket notatio n {L} ( z , . . . ) to represen t the genera ting function of the language {L} ; in particular, for D = L .w − , we write {L .w − } ( z , . . . ) = D ( z , . . . ) . Considering two languag es L 1 and L 2 , if we h a ve L 1 ⊂ L 2 , we write L 2 − L 1 = L 2 \ L 1 as the difference of sets; Reduced set of words. A set of words U = { u 1 , . . . , u r } is reduced if no u i is factor of a u j with i different of j . A utocorrelations, correlations a nd rig ht extension sets of words. W e recall here the deﬁn ition of Right Extension Set introdu ced in Bassino et al. [2]. The right extension set of a pair of words ( h 1 , h 2 ) is E h 1 ,h 2 = { e | ther e e xists e ′ ∈ A + such that h 1 e = e ′ h 2 with 0 < | e | < | h 2 |} . If the word h 1 is not factor of h 2 this extension set of h 1 to h 2 is the usual corre lation set of h 1 and h 2 When we have h 1 = h 2 , we get the a utocorrelation set C h,h of th e word h that we will note fur ther C when there is no ambiguity . W e also no te C ◦ = C − ǫ . Remark that C ◦ is empty if the word w has n o autocorrelation. W e remark here tha t the em pty word ǫ b elongs to the autocorre lation set of a word . N ote also that the correlation set of two words may be empty . W e have as e xamp les C aabaa,aab = { b, a b } , C ababa,ababa = { ǫ , ba, b a ba } . Generating functio ns. W e aim at compu ting th e numb er o f a given objec t in rand om texts by use of generating function s such as L v ( z , x ) = X T ∈L P ( T ) z | T | x | T | v = X l n,i x i z n (1) where | T | v is the number of occurr ences of the object v in the te xt T and l n,i is the probability that a text of size n h as i occur rences of this ob ject. This extends natu rally for coun ting more than o ne object by considerin g multiv ariate generating functions with se veral parameters. If the random variable X n counts the number of objects in a text of size n , we get fro m Equation (1) E ( X n ) = [ z n ] ∂ L ( z , x ) ∂ x     x =1 , E ( X 2 n ) = [ z n ] ∂ ∂ x x ∂ L ( z , x ) ∂ x     x =1 . Recovering exactly or asymptotically these moments follows then from classical method s. 3 F or mal language approach 3.1 R ´ egnier and Szpank owski decomposition Since our work extends the formal language approach of R ´ egnier and Szpankows ki [18], we recall it here. Considering one word w , R ´ egnier and Szp anko wski use a na tural p arsing o r de composition of texts with at least one occurr ence of w , where • there is a ﬁrst occurr ence at the right extremity of a “subtext”, th e set of which constitute a Right languag e, • possibly fo llo wed by other oc currences, that a re separ ated b y “sub texts” that constitute the Minimal languag e, 4 F . Bassino, J. Cl ´ ement, J. F ayolle, and P . Nicod ` eme • and completed by “subtexts” that provide no other occurre nces. Moreover , there is a language witho ut any match of the considered word w . R ´ egnier [17], fu rther extended this appro ach to a reduce d set of words. W e follow here the book of Lothaire [15](Chapter 7) which presents their method. W e consider a set of words V = { v 1 , . . . , v r } . W e h a ve, forma lly Deﬁnition 2 Right, Minimal, Ultimate and Not languages • The :”Right” langu age R i associated to the wor d v i is the set of wor ds R i = { r | r = e.v i and 6 ∃ e ′ ∈ V , r = xe ′ y , | y | > 0 } . • The “Minimal” language M ij leading fr om a wor d v i to a wor d v j is the set of wor ds M ij = { m | v i .m = e.v j and 6 ∃ e ′ ∈ V , v i .m = xe ′ y , | x | > 0 , | y | > 0 } . • The “Ultimate” langu age completing a te xt after an occurrence of the wor d v i is the set of wor ds U i = { u | 6 ∃ e ∈ V , v i .u = xey , | x | > 0 } . • The “Not” language completing a te xt after an occurr ence of the wor d v i is the set of wor ds N = { n | 6 ∃ e ∈ V , n = xey } . The no tations R , M , U and N refer h ere to th e Right, Min imal, Ultimate a nd Not lan guages of a single word. Considering as examp le th e word w = ababa ; in the following texts, the underlined words be long to th e set M ; th e overlined text do es n ot since the word represented in bo ld faces is an inter mediate occurre nce. ababaaaaaab aba ab ababa bbbbb bbb abababa. Considering the matrix M such that M ij = M ij , we have S k ≥ 1  M k  i,j = A ⋆ · w j + C ij − δ ij ǫ, U i · A = [ j M ij + U i − ǫ, (2) A · R j − ( R j − w j ) = S i w i M ij , N · w j = R j + [ i R i ( C ij − δ ij ǫ ) . (3) If the size of the te xts is counted by the variable z and the occurrences of the words v 1 , . . . , v r are coun ted respectively by x 1 , . . . , x r , we get the matrix equation F ( z , x 1 , . . . , x r ) = N ( z ) + ( x 1 R 1 ( z ) , . . . , x r R r ( z ))  I − M ( z , x 1 , . . . , x r )  − 1    U 1 ( z ) . . . U r ( z )    . (4) In this last equa tion, we have M ij ( z , x 1 , . . . , x r ) = x j M ij ( z ) a nd the gener ating functions R i ( z ) , M ij ( z ) , U j ( z ) and N ( z ) ca n be computed explicitly from the s et of Eq uations (2, 3). In particu lar , when con sidering the Bern oulli weighted case A ( z ) = z an d a single word w with π w = P ( w ) , we have the set of equations R ( z ) = π w z | w | D ( z ) , M ( z ) = 1 + z − 1 D ( z ) , U ( z ) = 1 D ( z ) , N ( z ) = C ( z ) D ( z ) „ 1 D ( z ) = 1 π w z | w | + ( 1 − z ) C ( z ) « (5) Constructions for Clumps Statistics 5 A ⋆ = N + RM ⋆ U = ⇒ F ( z , x ) = 1 1 − z + π w z | w | 1 − x x + (1 − x ) C ( z ) = X n,k f n,k x k z n . (6) In this last equation , f n,k is the prob ability that a text of size n has k occu rrences of w . 3.2 Clump analysis f or one word The d ecomposition of R ´ egnier and Szpankowski is based o n a parsing by th e occu rrences of the consid- ered words. W e use a similar app roach, but parse with respect to the occu rrences of clump s. As a m ajor difference, when they consider the minimal languag e separating two occur rences, these two occurr ences may overlap; in contrast, by deﬁnitio n, overlapping of clumps is forbidden. A c lump of th e word w is basically d eﬁned as w C ⋆ , since any elemen t of C ◦ concatanate d to a c luster extends this cluster . Considering the word w = aaa , we h a ve C = { ǫ, a, aa } and C ⋆ is am biguous. W e can however generate unambig uously C ⋆ as described in the next section. 3.2.1 A preﬁx code K to gener ate unambiguously C ⋆ Since C ◦ is a ﬁnite language, it is possible to ﬁnd a preﬁx co de K gen erating C ◦ ; moreover, fo r c 1 , c 2 ∈ C − ǫ and | c 1 | < | c 2 | , the word c 1 is a proper sufﬁx of c 2 . Otherwise stated, the preﬁx code K = { κ 1 , . . . , κ k } is built ov er words q 1 , q 2 , . . . , q k and may be written as K = { q 1 , q 2 q 1 , . . . , q k q k − 1 . . . q 1 } . W e Refer to Ber stel and Perrin [4] for an intro duction to preﬁx co des. See also Berstel [3] for an analysis o f coun ts of words o f the pattern U by semaphore codes U − A ⋆ U A + . W e have the f ollowing lemma Lemma 1 The pr eﬁx code K = C ◦ \ C ◦ A + generates unamb igously the lang uage C ⋆ . Proof: It is clear that K is pr eﬁx. Con sider w ∈ C ◦ − K if this last set is n ot emp ty . Since K is a set of words o f C witho ut any pre ﬁx in C , we have a con trario w = u.v with u and v non- empty and in C . W e have | u | < | w | an d | v | < | w | ; if u or v does not be long to K , we may iter ate the pro cess on the c orrespond ing word. Since | w | is ﬁnite, after a ﬁnite num ber o f steps, we get to a deco mposition w = κ i 1 . . . κ i j where each κ i is in K . Since K is a code, the decomposition of each word of C over K is unique and so is the decompo sition of any w ord o f C ⋆ . ✷ Example 1 Let w = abaabaaba . W e ha ve abaabaaba | ǫ abaaba | aba aba | abaaba a | baabaaba = ⇒ C = { ǫ, ab a, ab aaba, baabaaba } = ⇒ K = { aba , b aabaaba } . The periods of a w ord w is th e set of integers {| h | , h ∈ C ◦ } ; the irr educible per iods is the subset of periods o f which all the periods may b e ded uced. As f ollows from Guibas and Odlyzko [1 0] and Ri vals an d Rahmann [21], when c onsidering the word abab accababa , the irred ucible perio ds a re 7 , 9 while the period 11 can be dedu ced from the perio ds 7 and 9 . Howe ver , we have here K = C = { ccababa, b a ccababa, babacca b aba } , which implies so mehow again st in tuition that, in gener al, there is no bijection between the irreduc ible periods and the preﬁx code of a word. 6 F . Bassino, J. Cl ´ ement, J. F ayolle, and P . Nicod ` eme Constructing the preﬁx-code K . W e use the follo wing algorithm : 1. start with the word w ; 2. shift w to th e right to the ﬁrst self-overlapping p osition; Let κ 1 be the trailing sufﬁx so obtain ed; insert it in a trie E ; 3. repeat shiftin g, obtainin g new trailing sufﬁxes; f or each new sufﬁx gen erated, try an insertion in the trie. If you reach a leaf, drop the sufﬁx; else where insert it. The worst case complexity for this construction is O ( | w | ) , but the a verage complexity is O ( |K| log( |K | )) , the av erage path length of a trie built over |K| keys. 3.2.2 The language decomposition Considering the word w = aaaaa , we have C = { a, aa, aa a , aaaa } and K = { a } . Mo reover , we hav e M = { a, b ( b + ab + aab + aaab + aaaab ) ⋆ aaaaa } . W e get here K ⊂ M and M − K = L w ; The languag e M and K are inde ed connected by a simple pr operty that we describe no w . Lemma 2 F or any wor d w with autocorrelation set C , pr eﬁx code K generating C ⋆ and minimal lan guage M , ther e exists a non-empty language L such that K ⊂ M an d M − K = L w . (7) Proof: W e have K ⊂ C and K ⊂ M ; th erefore, we have K ⊂ M ∩ C . W e p rove that if w ∈ C − K then w 6∈ M . Let us suppo se th at w 6 = ǫ and w ∈ C − K . Th is implies that w ∈ KA ⋆ by deﬁnition o f K . Therefo re, we have w = κv with κ ∈ K and | v | > 0 . As a con sequence, w cann ot belong to t he m inimal languag e M , the word κ correspond ing to a previous occurre nce of w . ✷ This leads immediately to the fundam ental lemma. Lemma 3 The basic equ ation for the comb inatorial decomposition of texts on the a lphabet A wher e v counts some object in the clump of a wor d w is A ⋆ v = N + R w − ( w C ⋆ ) v  ( M − K ) w − ( w C ⋆ ) v  ⋆ U , (8) Proof: The Equation (8) follows from the parsing • either there is no occurrence of w , the Not lan guage N , • or 1. we read until the ﬁrst occurrence : R w − w , 2. follo wed by any numb er of ov erlap ping occurren ces of w (a clump less the ﬁrst o ccurence): C ⋆ , 3. follo wed by any number of (a) next occurrence of w without overlap: ( M − K ) w − w (b) and any number of ov erlapp ing occurren ces of w : C ⋆ . ✷ W e can now use the preceeding lemma to count se veral parameters related to the clumps. Constructions for Clumps Statistics 7 3.2.3 Counting parame ters related to the clumps Let K ( z , x, t ) be th e generating func tion where the variable x counts the numbe r of occurrence of w in a clump, and th e variable t cou nts the size of the clumps; the v ariab le z is used here to count the total length of the texts. W e also use a variable u to count the number of clumps. W e have the follo wing theorem Theorem 1 In the weighted mo del such th at A ( z ) = z , th e generating fu nction cou nting the number o f occurr ences of a wor d w and the number of positions cover ed by the clumps of w veriﬁes F ( z , K ( z , x, t )) = N ( z ) + R ( z ) π w z | w | K ( z t, x ) 1 1 − M ( z ) − K ( z ) π w z | w | × K ( z t, x ) U ( z ) (9) wher e the generating function of the clumps veriﬁes K ( z , x, t ) = xπ w ( z t ) | w | 1 1 − x K ( z t ) (10) As a consequen ce, the generating functio n counting also the number of clumps is G ( z , x, t, u ) = F ( z , u K ( z , x, t )) . (11) Proof: Th is theore m fo llo ws f rom Lem ma (1) and fro m a direct translation of Equation (8 ) into generating function s. ✷ 3.2.4 Occurrences of clumps. Considering G ( z , u K ( z , 1 , 1 )) in Equ ation (9) and using Equatio n (10) provid es the generating function O ( γ ) ( z , u ) = X n,i o ( γ ) n,i u i z n = N ( z ) + u R ( z ) U ( z ) 1 − u M ( z ) + ( u − 1) K ( z ) (12) where o ( γ ) n,i is the pro bability of getting i clu mps (of any size) in a text of size n . Considering Γ n , th e expectation of number of clumps in te xts of size n , we ge t by differentiation X n Γ n z n = R ( z ) U ( z )(1 − K ( z )) (1 − M ( z )) 2 = π w z | w | (1 − K ( z )) (1 − z ) 2 . This implies th at Γ n = ( n − | w | + 1) π w (1 − K (1)) − π w K ′ (1) , to compare with the expectation ( n − | w | + 1) π w of the numerb of occurren ces of the word w . 3.2.5 Occurrences of k -clumps. By considerin g the equation of a clump of occur rences of w , we can write w C ⋆ = w + w K + w K 2 + . . . ( v − 1 + 1) w K k − 1 + . . . to count clumps with exactly k o ccurrences of w . Writing K ( k ) ( z , v ) the gener ating functio n which coun ts with the variable z the size of the clumps and where the variable v selects k -clump s, we have K ( k ) ( z , v ) = π w z | w |  1 1 − K ( z ) + ( v − 1) K ( z ) k − 1  Substituting this in Equation 9 giv es O ( γ k ) ( z , v ) = X o ( γ k ) n,i v i z n = N ( z ) + R ( z ) π w z | w | K ( k ) ( z , v ) 1 1 − M ( z ) − K ( z ) π w z | w | × K ( k ) ( z , v ) U ( z ) , where o ( γ k ) n,i is the probab ility that a text of size n contains e xactly i k -clump s. 8 F . Bassino, J. Cl ´ ement, J. F ayolle, and P . Nicod ` eme 3.2.6 Probab ility that a ra ndom position is covere d by a clump This follows from the kn owledge of the number of positions of the texts covered by the clumps. Let P n be the random v ariable counting the numbe r of positions cov ered by the clumps of a word w in texts of size n and H n be the probab ility that a random position is co vered by a clump in a text of s ize n . Let F ( z , t ) = G ( z , K ( z t, 1 )) where G ( z , K ) is given by Equation (9) be the generating functio n co unt- ing the size of the texts and the number of positions cov ered by clumps. W e h a ve H n = X i ≥ 0 i n P ( P n = i ) ⇐ ⇒ H n = [ z n ] ∂ ∂ t z Z z 0 F ( y , t ) dy     t =1 . 3.3 Clumps of a ﬁnite set of words W e provide in this sectio n a matricial solution f or cou nting clump s of a red uced ﬁnite set o f word s. For simplicity sake we consider a set of two words w 1 and w 2 but our approach is amenab le to any reduced ﬁnite set. Similarly to the one word case, w e ar e lead to consider preﬁx codes generating th e correlation o f two words. Writing C ⋆ ij with i 6 = j makes no sen se in terms of langua ge d ecomposition . Howe ver, we can write as previously K ij = C ij − C ij A + , wh ich deﬁnes a min imal correlation languag e with goo d proper ties. W e have as e xamp les Example 2 Let w 1 = aabaa and w 2 = aaa . W e have C 12 = { a, aa } and K 12 = { a } . In this case, we have C 12 = C 22 − { ǫ } and K 12 = K 22 . Example 3 Let w 1 = abab and w 2 = bab a . W e h ave C 12 = K 12 = { a, aba } . I n this ca se, we ha ve C 22 = { ǫ, b a } an d K 12 = a. K 22 . Follo wing a proo f similar to the proo f of Lemm a (2), there exists a language L such that M ij − K ij = L .w j . W e can ther efore wr ite a minimal correlation matrix K , co nsider the matrix S = K ⋆ and write a clump matrix G as follows K =  K 11 K 12 K 21 K 22  , S = K ⋆ , G =  w 1 S 11 w 1 S 12 w 2 S 21 w 2 S 22  (13) In this equation , G ij is a clump starting with the word w i and ﬁnishing with the w ord w j . W e ob tain now a fund amental matricial decompo sition that can be used for furthe r analy sis, A ⋆ = ( R 1 w − 1 , R 2 w − 2 ) G  ( M − K ) − G  ⋆  U 1 U 2  where we have ( M − K ) − ij = ( M ij − K ij ) w − j . 4 A utomato n ap proach For a set U = { u 1 , . . . , u r } , we build a kind o f “ Ah o-Corasick” au tomaton built on the fo llowing set of words X X = { u i · w | 1 ≤ i ≤ r and w ∈ { ǫ } ∪ E i,j for some j } . The automato n T is built on X with Q = Pref ( X ) (set of states), i = ǫ (in itial state). The tran sition function is deﬁned (as in Aho-Cor asick construction ) as δ ( p, x ) = the longest sufﬁx of px ∈ Pref ( X ) . Constructions for Clumps Statistics 9 In o rder to count the number of clumps (f or instance) the set of ﬁnal states T nee ds more a ttention: it is deﬁned as T = X \ X A + . This automaton accepts the languag e of words ending by the ﬁrst occurren ce of a word in a clump. W e can easily derive from this automaton the generating fu nction f ( z , x 1 , . . . , x r , t, u ) where x i marks an occurren ce of u i , t marks the numb er of clumps, and u the to tal length covered by the clump. Indeed, one has to mark some transitions in the adjacency matrix A acc ording to some simple rules. • T o co unt occurren ces of the u i ’ s, we ha ve to mark with th e formal variable x i transitions go ing to states A ∗ u i ∩ Pr e f ( X ) ( for 1 ≤ i ≤ r ). • For the nu mber of clump s, on can mark transition s going to states in U \ U A + = X \ X A + , that is states correspon ding to ﬁrst occurenc es inside a clump. • Finally , for the total length covered b y clumps. W e have to put a formal weigh t o n tran sitions going to a state p ∈ A ∗ U ∩ Pr ef ( X ) taking into account the num ber of symbols betwe en the last occurre nce of a word of X and the new one at the end of p . Let us deﬁne fo r a state p (cor responding to a word with a occurrence of some word of X at the en d) the fu nction ℓ ( p ) the maximal proper preﬁx q of p in A ∗ U if it exists or ǫ if ther e is n o such preﬁx. Then we must mark all tran sitions going to p with u | p |−| ℓ ( p ) | (if p ∈ A ∗ U ∩ Pref ( X ) ). Of course the construction d oes n ot gi ves a m inimal a utomaton. Howe ver th e au tomaton is co mplete and determin istic so that the translation to gener ating functio n is straightfor ward. Example 1. For one word U = { u = bababa } , and E u = { b a, ba ba } . The set X is X = { bababa, ba b ababa, bababa b aba } . + + + b a b a b axtu 6 b au 2 x b u 2 xa b b b b b a a a a a b N.B.: The sign ’ + ’ on the au tomaton indicates that the corresp onding preﬁx ends with some o c- curence of U . The dou ble oval states indicates the states where we k now we have en tered a new clump. 2. For the set U = { u 1 = aabaa, u 2 = baab } a nd the matrix of right extension sets i s E =  baa + ab aa b aa aab  . The set X is X = { aabaa, aabaab , aab aabaa, aabaaabaa, baab, baabaa, baabaab } . 10 F . Bassino, J. Cl ´ ement, J. F ayolle, and P . Nicod ` eme W e h a ve the following autom aton (with x and y mark ing occur ences re specti vely of u 1 and u 2 . The automaton is comp lete an d d eterministic. Ho wever , f or clarity’ s sake, all transitions labelled by a an d b en ding respecti vely on s tate A and B are omitted. As bef ore, the sign ’ + ’ indicates that the co rrespondin g preﬁx ( or , equ i valently , state) ends with som e occu rence of U . Th e dou ble ov al attribute indicates the state where we kn ow we ha ve entered a new clump. A + + + + a a b a atu 5 x a b a au 4 x by u a axu 2 B + + + b a a btu 4 y a axu 2 buy a buy buy a a a 5 Limit laws 5.1 Normal law A no rmal limit law for th e num ber of clumps U when U = O ( n ) in texts of size n f ollows from th e automaton construction of Section 4. A Perron -Frobenius property asserts the existence of a unique dom- inant eigen value of the positive system; app ly next a suitable Cauchy in tegral and large power The orem of Hwang [12, 13]; see [16] for details. 5.2 P o isson law f or rare word s In a Bernou lli model, if p and p are the minimal and maximal proba bility of letters of the alphabet, words of size l < log n log(1 /q ) have O ( n ) n umber of occurren ces in texts of size n with probability one. W e consider rare words with size over this threshold and n umber of occurrences O (1 ) . W e prove in this case a Po isson- like limit law . T aking a T aylor expansion of O ( γ ) ( z , u ) in E quation (12) at u = 0 , and conside ring th e k th T aylor coefﬁcient, with k = O (1) p rovide a rationa l generating function with respect to the variable z of the form H k ( z ) = [ u k ] O ( γ ) ( z , u ) = R ( z ) U ( z )( M ( z ) − K ( z )) k − 1 (1 − K ( z )) k = π w z | w |  z − 1 + (1 − K ( z )) D ( z )  k − 1 (1 − K ( z )) k ( D ( z )) k +1 . (14) W e follow Fayolle [6] to prove tha t the do minant root o f the d enominator of this last equatio n is the smallest and positive ro ot o f D ( z ) = π w z | w | + (1 − z ) C ( z ) ; (see Equ ation ( 5)). Let d be the smallest period of w . If d ≤ l / 2 classical results about periods on words provide C ( z ) = 1 + π u z | u | + · · · + ( π u z | u | ) r + S ( z ) f or a gi ven word u with | u | < l/ 2 , an d r ≥ 2 ; more over S ( z ) is a poly nomial of minimal degree at least l / 2 . Mo reover , we have K ( z ) = π u z | u | + R ( z ) where S ( z ) − R ( z ) is a p olynomial with positive coefﬁcients. This entails th at | S ( z ) | an d | R ( z ) | are o (1) f or | z | < 1 / p . Up to negligible terms, we get | C ( z ) | =     1 1 − π u z | u |     ≥ 1 1 + π u | z | | u | ≥ 1 1 + p | z | for | z | < 1 p . Constructions for Clumps Statistics 11 W e also have | 1 − K ( z ) | > 0 and π w z | w | = o (1) for | z | < 1 /p . The Rouch ´ e theorem in the d isk | z | < 1 /p the genera ting fu nction H k ( z ) has a single pole which is a smallest mo dulus r oot ρ of the equ ation D ( z ) = 0 . Per ron-Frob enius consider ations on the automaton cou nting the n umber of o ccurrences of w imply that this pole is real positive. A similar p roof follo ws when d > l / 2 . Writing D ( z ) = Q ( z )(1 − z /ρ ) and P ( z ) = z − 1 + (1 − K ( z )) D ( z ) we get as a ﬁrst approximation P ( O γ n = k ) ≈ π w ρ | w | Q ( ρ ) × 1 k !  ρP ( ρ ) × n (1 − K ( ρ )) Q ( ρ )  k × ρ − n . A similar behaviour has been observed for occurr ences of one word by R ´ egnier an d Szpanko wski [18]. 5.3 Length of the clumps in inﬁnite te xts Generating functio n of the size of the clumps in inﬁnite texts is a sum of geometric random v ariab les. 6 Conclusion An interesting application of this article would be a comb inatorial analysis of tandem repeats or multiple repeats that occur in ge nomes; large variations of such r epeats are characteristics of some genetic diseases. W ould i t be p ossible to extend ou r app roach to clumps of regular expressions? W e consider clumps of a regular expr ession ( i.e. contiguou s sets of positions suc h that each po sition is cov ered b y at least one word of th e associated regular languag e and such that leading and terminating p ositions of each occu rrence is covered by at least two occurren ces). In this case the star-height th eorem (CITE) inp lies that we cannot in gen eral ﬁn d a ﬁnite set of words w i and a ﬁnite set of preﬁx cod es K i with 1 ≤ i ≤ ℓ su ch that the languag e S 1 ≤ i ≤ n w i ( K i ) ⋆ describes the clumps. Ref ere nces [1] B A R B O U R , A . , H O L S T , L . , A N D J A N S O N , S . P oisson Appr oximation . Ox ford Univ ersity Press, 1992. [2] B A S S I N O , F. , C L ´ E M E N T , J . , F A YO L L E , J . , A N D N I C O D ` E M E , P . Counting occur rences for a ﬁn ite set of words: an inclusion-exclusion approach. I n Pr oceedings of the 2007 Confer ence on Ana lysis of Algorithms ( 2007), P . Jacquet, Ed., D MTCS, proc. AH, p p. 29–44. Procee dings of a colloq uium organized by Juan- les-Pins, F ran ce, June 2007. [3] B E R S T E L , J . Gro wth of re petition-free words - a revie w . T heo r etical Co mputer Scienc e , 34 0 (200 5), 280–2 90. [4] B E R S T E L , J . , A N D P E R R I N , D . Theory of Codes . Pure and Applied Mathematics. Academ ic Press, 1985. [5] C H O M S K Y , N . , A N D S C H ¨ U T Z E N B E R G E R , M . The algebr aic theor y of co ntext-free lang uages. Computer Pr ogr ammin g a nd F ormal Languages, (1963) , 1 18–161 . P . Braffort an d D. Hir schberg, eds, North Holland. [6] F A Y O L L E , J . An average-case an alysis of basic parameters o f the sufﬁx tree. In Mathematics a nd Computer Scien ce (2004), M. Drmota, P . Flajolet, D. Ga rdy , and B. Gittenberger, Eds., Birkh ¨ auser , pp. 21 7–227. Pro ceedings of a colloquium organized by TU W ien, V ienna, Austria, September 2004. [7] G A N T M A C H E R , F . The theory o f matrices. V ols. 1,2 . Encyclopedia of Mathematics. New Y ork: Chelsea Publishing Co. T ranslated by K. A. Hirsch, 195 9. 12 F . Bassino, J. Cl ´ ement, J. F ayolle, and P . Nicod ` eme [8] G O U L D E N , I . , A N D J AC K S O N , D . An in version theorem for clusters dec ompositions of sequenc es with distinguished subsequen ces. J. London Math. Soc. 2 , 20 (1979), 567–576. [9] G O U L D E N , I . , A N D J AC K S O N , D . Combinato rial Enumeration . John W iley , New Y ork, 1983. [10] G U I BA S , L . , A N D O D L Y Z K O , A . Per iods in strings. J. Combin. Theory A , 30 (1981), 19–42. [11] G U I BA S , L . , A N D O D LY Z K O , A . Strings overlaps, pattern matching, and non-tran siti ve games. J. Combin. Theory A , 30 (1981) , 108–2 03. [12] H W A N G , H . - K . Th ´ eor ` emes limites pour les structur es combina toir es et les fonctions arithm ´ etiques . PhD thesis, ´ Ecole polytech nique, Palaiseau, France, Dec. 1994. [13] H W A N G , H . - K . Large deviations for combina torial distrib utions I: Central limit theo rems. A nn in Appl. Pr obab. 6 (1 996), 297–319. [14] J A C Q U E T , P . , A N D S Z PA N K O W S K I , W . Autocorr elation on words and its applications. Analysis of Sufﬁx Trees by String Ruler Appro ach. J. Combin. Theory A , 66 (1994), 237–26 9. [15] L OT H A I R E , M . App lied Combinatorics on W or ds . Encycloped ia of Mathem atics. Cambridge Uni- versity Press, 2005. [16] N I C O D ` E M E , P . , S A LV Y , B . , A N D F L A J O L E T , P . Motif statistics. Theo r etical Computer Scien ce 287 , 2 (2002) , 593–6 18. [17] R ´ E G N I E R , M . A uniﬁed ap proach to word occurrenc es p robabilities. Discr ete Applied Mathematics 104 , 1 (2000) , 259–2 80. Special issue on Computatio nal Biology . [18] R ´ E G N I E R , M . , A N D S Z PA N K O W S K I , W . On pattern freq uency occu rrences in a markovian se- quence? A lgorithmica 22 , 4 (1998), 631–649. This paper was presented in part at the 1997 Interna- tional Symposium on Inform ation Theory , Ulm, Germany . [19] R E I N E RT , G . , A N D S C H BAT H . Compou nd poisson and poisson proc ess approximatio ns for o ccur- rences of multiple words in markov chains. J. Comp. Biol. 5 (1998) , 223–2 53. [20] R E I N E RT , G . , S C H BAT H , S . , A N D W A T E R M A N , M . Probabilistic and statistical pro perties of words: an overview . J. Comp. Biol. 7 (2000), 1–46. [21] R I V A L S , E . , A N D R A H M A N N , S . Combin atorics of periods in strings. Journal of Combina torial Theory - Series A 104 , 1 (2003 ), 95–1 13. [22] S C H B A T H , S . Compoun d poisson appro ximation of word co unts in DNA sequences. E SAIM Pr obab. Statist. 1 (1995) , 1–16. [23] S E D G E W I C K , R . , A N D F L A J O L E T , P . An Intr oduction to the Analysis of Algorithm s . Addison- W esley Publishing Company , 1996. [24] S T E FA N O V , V . , R O B I N , S . , A N D S C H BAT H , S . W aiting times for clumps of p atterns and for structured motifs in random sequences. Discrete Applied Mathematics 155 (2 007), 868 –880. [25] S Z PA N K O W S K I , W . A verage Case Analysis of Algorithms on Sequences . Series in Discrete Mathe- matics and Optimization. John W iley & Sons, 2001.

Constructions for Clumps Statistics

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment