Computing q-gram Non-overlapping Frequencies on SLP Compressed Texts

Computing q -gram Non-o v erlapping F requencies on SLP Compresse d T exts Keisuke Goto, Hideo Bannai, Shunsuke Inenaga, and Masayuki T akeda Department of Informatics, Kyu s hu Unive rsit y 744 Motook a, N i shiku, F ukuok a 819–0395, Japan { keisuke. gotou,bann ai,inenaga,takeda } @inf.kyushu-u.ac.jp Abstract. Length- q substrings, or q -grams, can represen t imp orta nt chara cteristics o f text data, and determining the frequencies of all q - grams contained in th e data is an imp ortan t problem with many ap p li - cations in the ﬁ eld of data min in g and mac hine learning. In this pap er, w e consider the problem of calculating th e non-overlapping f r e quencies of all q -grams in a t ex t given in compressed form, namely , as a straight line program (SLP). W e show that the problem can b e solv ed in O ( q 2 n ) time and O ( q n ) space where n is the size of the SLP . This generalizes and greatly improv es previous work (Inenaga & Bannai, 2009) which solved the problem only for q = 2 in O ( n 4 log n ) time and O ( n 3 ) space. 1 In tro duction In many situations, large- scale text data is ﬁrs t compressed for sto r age, and then is us ually decompre ssed when it is pro cessed a fter w ar ds , where we must again fa c e the size of the data. T o circumv ent this pro blem, algo rithms that work directly on the compressed representation witho u t explicit deco mpression hav e g a ined a ttention, esp ecially for the str ing pa tt ern matching problem [1], and there has been gr owing interest in what problems can be eﬃciently solved in this kind of setting [14, 17, 7, 16, 8, 6, 4]. The non-overlapping o c curr enc e fr e quency of a string P in a text string T is deﬁned as the maximum num ber of no n-o verlapping o ccurrences of P in T [3]. Non-ov erlapping frequencies are requir ed in several grammar based compres- sion algo rithms [1 3, 2], as well as ... In this pap er, we consider the problem of computing the non-overlapping o ccurrence frequenc ie s of al l q -gra ms (length- q substrings) o ccurring in a text T , when t he text is g iv en as a str aight line pr o gr am (SLP) [10] o f size n . An SLP is a context free g r ammar in the Cho msky norma l form that der iv es a single string. SLPs are a widely a ccepted abstract mo del of v ario us text c o mpression s c hemes, since texts compres sed by an y grammar- based compressio n alg o rithm (e.g. [1 8, 13]) c an b e repr esen ted a s SLPs, and those compressed by the LZ- f amily (e.g . [19, 20]) ca n be quickly transfo rmed to SLP s. Theoretically , the length N of the text repre sen ted by an SLP of size n can b e as lar ge as O (2 n ), and therefore a p olynomial time algorithm that runs on a n SLP representation is, in the worst case, fa s ter than any algor ith m w hich w orks on the uncompressed string. F or SLP compressed texts, the problem was ﬁrst consider ed in [8], where an algorithm for q = 2 r unning in O ( n 4 log n ) time and O ( n 3 ) space was presented. How ever, the algo rithm cannot be r eadily extended to handle q > 2. Intuitiv ely , the pro blem for q = 2 is muc h easier co mpa red to larger v alue s of q , s inc e there is o nly one wa y for a 2-g ram to overlap, while ther e can be many wa ys that a longer q -gram can ov e rlap. In this pa per we pres en t the ﬁrst algo rithm for calculating the non-overlapping o ccurrence frequency of all q -gra ms, that works for any q ≥ 2, and runs in O ( q 2 n ) time and O ( qn ) space. Not only do we so lv e a more gener al problem, but the complexity is grea tly improv ed compared to previous w ork. A similar problem for SLPs , wher e o ccurrences of q -grams are allowed to ov er lap, was a lso considered in [8 ], where an O ( | Σ | 2 n 2 ) time and O ( n 2 ) space algorithm was presented for q = 2. A muc h simpler a nd eﬃcient O ( q n ) time and space a lgorithm for general q ≥ 2 was recently developed [6]. As is the cas e with uncompressed strings, ideas from the a lgorithms allowing overlapping o c- currences ca n be applied somewhat to the pr oblem of obtaining no n-o verlapping o ccurrence frequencie s . How ever, there a r e still diﬃculties that arise from the ov er lapping o f o ccurrences that must b e ov er come, i.e., the o ccurrences of each q -gram can b e obtained in the sa me wa y , but w e must s omeho w compute their non-ov erlapping o ccurrence fre quency , which is no t a trivial task. F or uncompressed texts, the problem considered in this pap er c a n b e solved in O ( | T | ) time, b y applying string indices such a s suﬃx ar ra ys. A simila r problem is the string st a tistics pr oblem [3], whic h asks for the non-o verlapping o ccurrence frequency of a given string P in text str ing T . The pro blem can b e solved in O ( | P | ) time for any P , provided that the text is pre-pro cessed in O ( | T | log | T | ) time using the sophisticated algorithm of [5 ]. How ever, note that the prepro cess- ing requires only O ( | T | ) time if o ccurrences are allo w ed to o verlap. This per ha ps indicates the intrinsic diﬃculty that ar ises whe n considering ov e rlaps. 2 Preliminaries 2.1 Notation Let Σ b e a ﬁnite alp hab et . An element o f Σ ∗ is calle d a string . The length of a string T is denoted by | T | . The empty string ε is a str ing of leng t h 0, namely , | ε | = 0 . A str ing of length q > 0 is called a q -gr am . The s e t of q - grams is deno ted by Σ q . F o r a string T = X Y Z , X , Y and Z are called a pr eﬁx , substring , and suﬃx o f T , res pectively . The i -th character of a string T is denoted by T [ i ] for 1 ≤ i ≤ | T | , and the substring of a string T that beg ins at po sition i and ends at p osition j is denoted by T [ i : j ] for 1 ≤ i ≤ j ≤ | T | . F o r conv enienc e , let T [ i : j ] = ε if j < i . Let T R denote the r ev er sal of T , namely , T R = T [ N ] T [ N − 1] · · · T [1], wher e N = | T | . F or an integer i and a set of integers A , let i ⊕ A = { i + x | x ∈ A } and i ⊖ A = { i − x | x ∈ A } . If A = ∅ , then let i ⊕ A = i ⊖ A = ∅ . Similarly , for a pair o f integers ( x, y ), let i ⊕ ( x, y ) = ( i + x, i + y ). 2 2.2 Occurrences and F requencies F or any strings T and P , let O c c ( T , P ) b e the set of o ccurr ences of P in T , i.e., Oc c ( T , P ) = { k > 0 | T [ k : k + | P | − 1] = P } . The num b er of o ccurre nces of P in T , or the fr e quency of P in T is, | O c c ( T , P ) | . An y t wo o ccurr ences k 1 , k 2 ∈ Oc c ( T , P ) with k 1 < k 2 are said to b e overlapping if k 1 + | P | − 1 ≥ k 2 . Other w is e, they a re sa id to b e non-overlapp ing . The non- overlapp ing fr e quency nOc c ( T , P ) of P in T is deﬁned as the size of a lar gest subset of Oc c ( T , P ) where any tw o o ccurrenc e s in the set are non-ov er lapping. F or any string s X , Y , we say that an o ccurr e nce i o f a string Z in X Y , with | Z | ≥ 2, cr osses X and Y , if i ∈ [ | X | − | Z | + 2 : | X | ] ∩ Oc c ( X Y , Z ). F or any strings T and P , we deﬁne the sets of right and left priority non- overlapp ing o c curre nc es of P in T , res pe ctively , as follows: RnOc c ( T , P ) =  ∅ if Oc c ( T , P ) = ∅ , { i } ∪ RnOc c ( T [1 : i − 1] , P ) otherwise, LnOc c ( T , P ) =  ∅ if Oc c ( T , P ) = ∅ , { j } ∪ j + | P | − 1 ⊕ LnOc c ( T [ j + | P | : | T | ] , P ) otherwise, where i = max Oc c ( T , P ) and j = min O c c ( T , P ). F or all k ∈ RnOc c ( T , P ), it is trivially said that RnOc c ( T [ k : | T | ] , P ) ⊆ RnOc c ( T , P ). It c a n b e said to LnOc c similarly . Note that RnOc c ( T , P ) ⊆ O c c ( T , P ), LnOc c ( T , P ) ⊆ Oc c ( T , P ), and LnOc c ( T , P ) = | T | − | P | + 2 ⊖ RnOc c ( T R , P R ). Lemma 1. nOc c ( T , P ) = | R nOc c ( T , P ) | = | LnOc c ( T , P ) | Pr o of. See Appendix. Lemma 2. F or any strings T and P , and any inte ger i with 1 ≤ i ≤ | T | , let u 1 = max LnOc c ( T [1 : i − 1] , P ) + | P | − 1 and u 2 = i − 1 + min RnOc c ( T [ i : | T | ] , P ) . Then nOc c ( T , P ) = | LnOc c ( T [1 : u 1 ] , P ) | + nOc c ( T [ u 1 + 1 : u 2 − 1] , P ) + | RnOc c ( T [ u 2 : | T | ] , P ) | . Pr o of. By Le mma 1 and the deﬁnitions of u 1 , u 2 , LnO c c and RnOc c , we hav e nOc c ( T , P ) = | LnOc c ( T [1 : u 1 ] , P ) | + | LnOc c ( T [ u 1 + 1 : | T | ] , P ) | = | LnOc c ( T [1 : u 1 ] , P ) | + | RnOc c ( T [ u 1 + 1 : | T | ] , P ) | = | LnOc c ( T [1 : u 1 ] , P ) | + | RnOc c ( T [ u 1 + 1 : u 2 − 1] , P ) | + | RnOc c ( T [ u 2 : | T | ] , P ) | = | LnOc c ( T [1 : u 1 ] , P ) | + nOc c ( T [ u 1 + 1 : u 2 − 1] , P ) + | RnOc c ( T [ u 2 : | T | ] , P ) | . ⊓ ⊔ W e will later ma ke use o f the solution to the fo llowing pro blem, where o c - currences of q - grams ar e weigh ted and allow ed to ov er lap. 3 Pr oblem 1 (wei ghte d overlapping q -gr am fr e quencies). Given a string T , an in te- ger q , and int eger array w ( | w | = | T | ), compute P i ∈ Occ ( T ,P ) w [ i ] for all q -gr ams P ∈ Σ q where Oc c ( T , P ) 6 = ∅ . Theorem 1 ([6]). Pr oblem 1 c an b e solve d in O ( | T | ) time. Pr o of. See Appendix. 2.3 Straigh t Line Programs In this paper , we treat s tr ings descr ib ed in terms of str aight line pr o gr ams ( SLPs ). A stra ight line prog ram T is a sequence of as signments { X 1 = expr 1 , X 2 = expr 2 , . . . , X n = expr n } . E ach X i is a v ariable and ea ch expr i is an expres- sion where expr i = a ( a ∈ Σ ), or expr i = X ℓ X r ( ℓ, r < i ). W e will sometimes abuse notation and denote T as { X i } n i =1 . Denote b y T the string der ived from the la st v ariable X n of the pr ogra m T . Fig. 1 shows an ex ample of an SLP . The size o f the prog ram T is the num b er n o f assig nment s in T . X 1 X 2 a b a a a b a b a b a a b X 1 X 3 X 1 X 2 X 3 X 1 X 2 X 3 X 4 X 1 X 5 X 4 X 6 X 1 X 2 X 3 X 1 X 2 X 3 X 4 X 1 X 5 X 7 2 3 1 4 6 5 7 8 9 10 11 12 13 Fig. 1. The deriv atio n tree of SLP T = { X 1 = a , X 2 = b , X 3 = X 1 X 2 , X 4 = X 1 X 3 , X 5 = X 3 X 4 , X 6 = X 4 X 5 , X 7 = X 6 X 5 } , which represent s string T = val ( X 7 ) = aababaaba baab . Let val ( X i ) repr e s ent the string derived fro m X i . When it is no t con- fusing, we identify a v ariable X i with val ( X i ). Then, | X i | denotes the length of the string X i derives, and X i [ j ] = val ( X i )[ j ], X i [ j : k ] = val ( X i )[ j : k ] for 1 ≤ j, k ≤ | X i | . Let vOc c ( X i ) de - note the n um be r of times a v a riable X i o ccurs in the deriv ation of T . F or example, vOc c ( X 4 ) = 3 in Fig. 1. Both | X i | and vOc c ( X i ) can b e computed for all 1 ≤ i ≤ n in a total of O ( n ) time b y a simple itera tion on the v ariables: | X i | = 1 for any X i = a ( a ∈ Σ ), and | X i | = | X ℓ | + | X r | for any X i = X ℓ X r . Also , vOc c ( X n ) = 1 a nd for i < n , vOc c ( X i ) = P { vOc c ( X k ) | X k = X ℓ X i } + P { vOc c ( X k ) | X k = X i X r } . W e s hall assume as in v a rious previo us w ork on SLP , that the word size is at least log | T | , and hence, v alues r epresenting lengths and p ositions of T in our algorithms ca n b e manipulated in cons tant time. 3 q -gram Non-Ov erlapping F requencies on Compressed String The g oal of this pap er is to e ﬃcie nt ly solve the following pro blem. Pr oblem 2 (Non-overla pping q -gr am fr e quencies on SLP). Given an SLP T of size n that describ es string T and a p ositive integer q , compute nO c c ( T , P ) for all q -grams P ∈ Σ q . 4 If we de c ompress the given SLP T obtaining the string T , then we can solve the problem in O ( | T | ) time. How ever, it holds that | T | = O (2 n ). Hence, in o rder to solve the problem eﬃciently , we have to es ta blish an algor ithm that do es not explicitly decompress the given SLP T . 3.1 Key Ideas F or any v ariable X i and in teger k ≥ 1, let pr e ( X i , k ) = X i [1 : min { k , | X i |} ] and suf ( X i , k ) = X i [ | X i | − min { k , | X i |} + 1 : | X i | ]. That is, pr e ( X i , k ) and suf ( X i , k ) are the preﬁx a nd the suﬃx o f val ( X i ) of length k , res p ec tively . F or all v ar iables X i , pr e ( X i , k ) can be computed in a total of O ( nk ) time a nd space, as follows: pr e ( X i , k ) =      val ( X i ) if | X i | ≤ k, pr e ( X ℓ , k ) pr e ( X r , k − | X ℓ | ) if X i = X ℓ X r and | X ℓ | < k < | X i | , pr e ( X ℓ , k ) if X i = X ℓ X r and k ≤ | X ℓ | . suf ( X i , k ) can b e computed similarly in O ( nk ) time a nd s pace. F or any string T and p ositive integers q and j (1 ≤ j ≤ j + q − 1 ≤ | T | ), the longest overlapping c over o f the q -g ram P = T [ j : j + q − 1] w.r.t. po sition j of T is a n o rdered pair ← → lo c q ( T , j ) = ( b, e ) of p ositions in T which is deﬁned as: ← → lo c q ( T , j ) = arg ma x ( b,e )        ( e − b )         ( b, e ) ∈ Oc c ( T , P ) × (( q − 1) ⊕ O c c ( T , P )) , b ≤ j ≤ j + q − 1 ≤ e, ∀ k ∈ [ b : e − q ] ∩ Oc c ( T , P ) , [ k + 1 : min { k + q − 1 , e − q + 1 } ] ∩ Oc c ( T , P ) 6 = ∅        Namely , ← → lo c q ( T , j ) represents the b eginning and ending p o sitions of the maximum chain of o v erlapping oc currences of q - gram T [ j : j + q − 1 ] that contains position j . F or example, consider string T = a aabaa baaab aabaaaabaa of length 21 . F o r q = 5 and j = 9 , w e hav e ← → lo c q ( T , j ) = (2 , 16), since T [2 : 6] = T [5 : 9] = T [9 : 13] = T [12 : 16] = aa baa . Note that T [17 : 21] = aabaa is not contained in this chain since it do es no t ov erlap with T [12 : 1 6 ]. Lemma 3. Given a string T and inte gers q , j , the longest overlapping c over ← → lo c q ( T , j ) c an b e c ompute d in O ( | T | ) time. Pr o of. Using, for ex ample, the KMP algorithm [1 2], we can o btain a sor ted list of Oc c ( T , T [ j : j + q − 1]) in O ( | T | ) time. W e can just scan this list forwards and backw a rds, to ea sily o bta in b a nd e . ⊓ ⊔ F or a v aria ble X i = X ℓ X r and a p o sition 1 ≤ j ≤ | X i | − q + 1, a longest ov er lapping cover ( b, e ) = ← → lo c q ( X i , j ) is said to b e close d in X i if q − 1 < b and e < | X i | − q + 2. 5 Theorem 2. Pr oblem 2 c an b e solve d in O ( q 2 n ) time, pr ovide d that, for al l variables X i = X ℓ X r and j s.t. | X i | ≥ q and ma x { 1 , | X ℓ | − 2( q − 1) + 1 } ≤ j ≤ min {| X ℓ | + q − 1 , | X i | − q + 1 } , ( b, e ) = ← → lo c q ( X i , j ) and nOc c ( X i [ b : e ] , s ) ar e alr e ady c ompute d wher e s = X i [ j : j + q − 1] . Pr o of. Algorithm 1 s hows a pseudo-co de of our algorithm to solve Pro blem 2 . Consider q -gra m s = X i [ j : j + q − 1 ] at p osition j for which ( b, e ) = ← → lo c q ( X i , j ) is clo sed in X i . A key o bserv atio n is that, if ( b, e ) is closed in X i , then ( b, e ) is never clos e d in X ℓ or X r . Therefore, by summing up vOc c ( X i ) · nO c c ( X i [ b : e ] , s ) for each closed ( b, e ) in X i , for all such v ariables X i , we obtain nOc c ( T , s ). Line 14 is s uﬃcient to chec k if ( b, e ) is clo sed. F or all 1 ≤ i ≤ n , vOc c ( X i ) can b e computed in O ( n ) time, and t i = pr e ( X i , 2( q − 1)) suf ( X i , 2( q − 1)) can b e computed in O ( q n ) time and s pa ce. The problem a mounts to s umming up the v alues of vOc c ( X i ) · nO c c ( X i [ b : e ] , s ) for ea ch q -gra m s c o ntained in each t i , and ca n b e reduce d to P r oblem 1 on string z and in teger array w of length O ( q n ), which can b e solved in O ( q n ) time by Theor em 1. In line 15, we chec k if there is no previous positio n h (max { 1 , | X ℓ | − 2 ( q − 1) + 1 } ≤ h < j ) such that X i [ h : h + q − 1] = X i [ j : j + q − 1] b y ← → lo c q ( X i , h ) = ← → lo c q ( X i , j ), so that w e do not count the same q -gra m mor e than once. If there is no such h , we set the v alue of w i [ k − | X ℓ | + j ] to vOc c ( X i ) · nO c c ( X i [ b : e ] , s ). This c an b e chec ked in O ( q 2 n ) time for all X i and j . F or conv enience, we assume that T = val ( X n ) s tarts and ends with sp e cial characters # q − 1 and $ q − 1 that do not o cc ur a nywhere else in T , r esp ectively . Then w e can cop e with the last v ar iable X n as des crib ed a b ove. Hence the theorem holds . ⊓ ⊔ 3.2 Computing Longest Ov erlapping Cov ers In this s ubsection, we will show how to co mpute longest overlapping cover ( b, e ) = ← → lo c q ( X i , j ) where s = X i [ j : j + q − 1] for all X i and all j require d f or Theorem 2. F or any str ing T and int egers q and j (1 ≤ j < q ), let − → lo c q ( T , j ) = ( ( j, b e ) if j + q − 1 ≤ | T | , ( j, | T | ) otherwise , ← − lo c q ( T , j ) = ( ( eb , | T | − j + 1) if | T | − j − q + 2 ≥ 1 , (1 , | T | − j + 1) otherwise , where ( j, b e ) = ( j − 1) ⊕ ← → lo c q ( T [ j : | T | ] , 1 ) and ( eb , | T | − j + 1) = ← → lo c q ( T [1 : | T | − j + 1] , | T | − j − q + 2). N amely , − → lo c q ( T , j ) is a s uﬃx of the longest overlapping cov er of the q - gram T [ j : j + q − 1] that begins at p os ition j (1 ≤ j < q ) in T , and ← − lo c q ( T , j ) is a pr eﬁx of the longes t overlapping cov e r of the q -gra m T [ | T | − j − q + 2 : | T | − j + 1] that ends at p osition | T | − j + 1 in T . 6 Algorithm 1: Computing q - gram non-overlapping frequencie s from SLP Input : S LP T = { X i } n i =1 representing string T , integer q ≥ 2. Output : nOc c ( T , P ) for all q -grams P ∈ Σ q where Oc c ( T , P ) 6 = ∅ . 1 Compute vOc c ( X i ) for all 1 ≤ i ≤ n ; 2 Compute pr e ( X i , 2( q − 1)) and suf ( X i , 2( q − 1)) for all 1 ≤ i ≤ n − 1; 3 z ← ε ; w ← [ ]; 4 for i ← 1 to n do 5 if | X i | ≥ q then 6 let X i = X ℓ X r ; 7 k ← | suf ( X ℓ , 2( q − 1)) | ; 8 t i = suf ( X ℓ , 2( q − 1)) pr e ( X r , 2( q − 1)); 9 z .app end( t i ); 10 w i ← create integer array of length | t i | , each element set to 0; 11 for j ← max { 1 , | X ℓ | − 2( q − 1) + 1 } to min {| X ℓ | + q − 1 , | X i | − q + 1 } do 12 s ← X i [ j : j + q − 1]; 13 ( b, e ) ← ← → lo c q ( X i , j ); 14 if q − 1 < b and e < | X i | − q + 2 then 15 if ← → lo c q ( X i , h ) 6 = ← → lo c q ( X i , j ) for any p osition h s.t. max { 1 , | X ℓ | − 2( q − 1) + 1 } ≤ h < j then 16 w i [ k − | X ℓ | + j ] ← vOc c ( X i ) · nOc c ( X i [ b : e ] , s ); 17 w .app end ( w i ); 18 Calcula te q -gram frequencies in z , where each q -gram starting at p osition d is weighte d by w [ d ]. Lemma 4. F or al l 1 ≤ i ≤ n and 1 ≤ j ≤ 2( q − 1) , − → lo c q ( X i , j ) c an b e c ompute d in a t otal of O ( q 2 n ) time. Pr o of. W e use dyna mic pro gramming. Let X i = X ℓ X r , p j = X i [ j : j + q − 1], and assume − → lo c q ( X ℓ , j ) and − → lo c q ( X r , j ) hav e been calculated for a ll 1 ≤ j ≤ 2 ( q − 1). W e e xamine the string X i [max { j, | X ℓ | − q + 2 } : min { | X i | , | X ℓ | + q − 1 } ] for o ccurrences o f p j that cr oss X ℓ and X r , obtain its longes t ov erlapping cover ( b i , e i ), a nd chec k if it overlaps with − → lo c q ( X ℓ , j ). F urthermore, let bb r be the left most o ccurrence o f p j in X r that has the p o s sibility of overlapping with ( b i , e i ). Then, − → lo c q ( X i , j ) is e ither − → lo c q ( X ℓ , j ), or its end can b e extended t o e i , or further to the end o f − → lo c q ( X r , bb r ), dep ending on how the cov ers ov erlap. More pr e cisely , let ( j, b e ℓ ) = − → lo c q ( X ℓ , j ), ( b i , e i ) = max { j − 1 , | X ℓ | − q + 1 } ⊕ ← → lo c q ( X i [max { j, | X ℓ | − q + 2 } : min {| X i | , | X ℓ | + q − 1 } ] , h ) wher e h ∈ Oc c ( X i [max { j, | X ℓ | − q + 2 } : min {| X i | , | X ℓ | + q − 1 } ] , p j ), and ( bb r , b e r ) = ( | X ℓ | + k − 1) ⊕ − → lo c q ( X r , k ) wher e k = min Oc c ( pr e ( X r , 2( q − 1)) , p j ). (Note tha t ( bb r , b e r ) , ( b i , e i ) a re not deﬁned if o ccurrence s h, k o f p j do not exis t.) Then we 7 hav e − → lo c q ( X i , j ) =      ( j, b e ℓ ) if b e ℓ < b i or 6 ∃ h, ( j, e i ) if b i ≤ b e ℓ and ( e i < bb r or 6 ∃ k ) ( j, b e r ) otherwise . (See also Fig. 2 in App endix.) F or all v aria bles X i we pre-compute pr e ( X i , 2( q − 1)) and suf ( X i , 2( q − 1)). This ca n b e done in a tota l of O ( q n ) time. The n, ea ch − → lo c q ( X i , j ) can b e co mputed in O ( q ) time using the KMP algorithm, Lemma 3, and the ab ov e r e cursion, giving a total o f O ( q 2 n ) time for all 1 ≤ i ≤ n and 1 ≤ j ≤ 2( q − 1 ). ⊓ ⊔ Lemma 5. F or al l 1 ≤ i ≤ n and 1 ≤ j ≤ 2( q − 1) , ← − lo c q ( X i , j ) c an b e c ompute d in a t otal of O ( q 2 n ) time. Pr o of. The pro o f is essentially the same as the pro of for − → lo c q ( X i , j ) in Lemma 4. Recall that we hav e a ssumed in Theor em 2 tha t ← → lo c q ( X i , j ) ar e a lr eady com- puted. The fo llowing lemma describ es how ← → lo c q ( X i , j ) can actually be computed in a total o f O ( q 2 n ) time. Lemma 6. F or al l variable X i = X ℓ X r and j s.t. max { 1 , | X ℓ | − 2( q − 1 ) + 1 } ≤ j ≤ min {| X ℓ | + q − 1 , | X i | − q + 1 } , ( b, e ) = ← → lo c q ( X i , j ) c an b e c omput e d in a total of O ( q 2 n ) time. Pr o of. Let s j = X i [ j : j + q − 1 ]. Firstly , we compute ( b i , e i ) = ← → lo c q ( X i [ | X ℓ |− 2( q − 1) + 1 : min {| X i | , | X ℓ | + 2( q − 1) } ] , j ) a nd t hen ← → lo c q ( X i , j ) can b e computed based on ( b i , e i ), as follows: Let ( eb ℓ , e e ℓ ) = ← − lo c q ( X ℓ , | X ℓ | − e e ℓ + 1) and ( bb r , b e r ) = | X ℓ | ⊕ − → lo c q ( X r , bb r − | X ℓ | ), where e e ℓ = max O cc ( X i [max { 1 , | X ℓ | − 2 ( q − 1 ) + 1 } : | X ℓ | ] , s j ) and bb r = min O cc ( X i [ | X ℓ | + 1 : min {| X i | , | X ℓ | + 2( q − 1) } ] , s j ). 1. If b i ≤ | X ℓ | and e i > | X ℓ | , then we hav e b ≤ b i ≤ | X ℓ | < e i ≤ e . ( b, e ) = ← → lo c q ( X i , j ) can b e computed by chec k ing whether ( eb ℓ , e e ℓ ), ( b i , e i ), and ( bb r , b e r ) ar e overlapping or not. (See also Fig. 3 in App endix.) 2. If e i ≤ | X ℓ | , then trivially b = eb ℓ and e = e i . 3. If b i > | X ℓ | , then trivially b = b i and e = b e r . Each e e ℓ = h and bb r = | X ℓ | + k c a n b e c omputed using the KMP algor ithm on string suf ( X ℓ , 2( q − 1)) pr e ( X r , 2( q − 1)) in O ( q ) time. By Lemmas 4 and 5, ( eb ℓ , e e ℓ ) and ( bb r , b e r ) can b e pr e-computed in a total o f O ( q 2 n ) time for all 1 ≤ i ≤ n . Hence the lemma ho lds. ⊓ ⊔ 3.3 Largest Left-Priorit y and Smalles t Ri gh t-Priorit y Occurrences In or der to compute nOc c ( X i [ b : e ] , s ) for all X i and all j required for Theo rem 2, where ( b, e ) = ← → lo c q ( X i , j ) a nd s = X i [ j : j + q − 1], we w ill use the lar g est 8 and sec ond la r gest o cc ur rences o f LnOc c and the smallest and seco nd smallest o ccurrences o f RnOc c . F or any set S o f integers and integer 1 ≤ k ≤ | S | , let max k S a nd mi n k S denote the k - th larg est and the k - th smallest element o f S . F or 1 ≤ i ≤ n and 1 ≤ j ≤ 2( q − 1), consider to compute max k LnOc c ( X i [ j : b e i ] , p j ) for k = 1 , 2 , where ( j, b e i ) = − → lo c q ( X i , j ) and p j = X i [ j : j + q − 1]. Int uitively , diﬃculties in computing max k LnOc c ( X i [ j : b e i ] , p j ) come fr o m the fact that the string val ( X i )[ j : b e i ] can be a s lo ng as O (2 n ), but we only hav e preﬁx pr e ( X i , 3( q − 1)) and suﬃx suf ( X i , 3( q − 1)) of val ( X i ) of length O ( q ). Hence we c a nnot compute t he v a lue of b e i by simply running the KMP a lg orithm on thos e par tial strings . F o r the same reason, the size o f LnOc c ( X i [ j : b e i ] , p j ) can b e as large a s O (2 n /q ). Hence we ca nnot stor e Ln Oc c ( X i [ j : b e i ] , p j ) as is. Still, as will b e seen in the following lemma, we c an compute those v alues eﬃciently , o nly in O ( q 2 n ) time. Lemma 7. F or al l variable X i = X ℓ X r and 1 ≤ j ≤ 2( q − 1) , let ( j, b e i ) = − → lo c q ( X i , j ) , p j = X i [ j : j + q − 1] . We c an c ompute the values max 1 LnOc c ( X i [ j : b e i ] , p j ) and max 2 LnOc c ( X i [ j : b e i ] , p j ) for al l 1 ≤ i ≤ n and 1 ≤ j ≤ 2( q − 1) , in a total of O ( q 2 n ) time. Pr o of. See Appendix. The next lemma can b e shown similarly to Lemma 7 . Lemma 8. F or al l variable X i = X ℓ X r and 1 ≤ j ≤ 2( q − 1) , let ( eb , e e ) = ← − lo c q ( X i , j ) , and s j = X i [ | X i | − j − q + 2 : | X i | − j + 1] . We c an c ompute the values min 1 RnOc c ( X i [ eb : e e ] , s j ) and min 2 RnOc c ( X i [ eb : e e ] , s j ) for al l 1 ≤ i ≤ n and 1 ≤ j ≤ 2( q − 1) , in a total of O ( q 2 n ) time. Lemma 9. F or al l variable X i = X ℓ X r and 1 ≤ j < q , max LnOc c ( X i [ eb i : e e i ] , s j ) c an b e c ompute d in a total of O ( q 2 n ) time, wher e ( eb i , e e i ) = ← − lo c q ( X i , j ) and s j = X i [ | X i | − j − q + 2 : | X i | − j + 1] . Pr o of. The lemma can b e shown by using Lemma 7. See App endix for details. Lemma 10. F or al l variable X i = X ℓ X r and 1 ≤ j < q , min R nOc c ( X i [ bb i : b e i ] , p j ) c an b e c ompute d in a t otal of O ( q 2 n ) t ime, wher e ( bb i , b e i ) = − → lo c q ( X i , j ) and p j = X i [ j : j + q − 1] . Pr o of. The lemma can b e shown in a s imilar way to Lemma 9, using Lemma 8 instead of Lemma 7 . ⊓ ⊔ 3.4 Coun ting Non-O v e rlapping Occurrences in Long est Ov e rlapping Co v ers Firstly , we show how to coun t non-overlapping o ccurrence s of q - gram p j in X i [ j : b e i ], for all i a nd j , where p j = X i [ j : j + q − 1] and ( j, b e i ) = − → lo c q ( X i , j ). 9 Lemma 11. F or al l variable X i = X ℓ X r and 1 ≤ j ≤ 2( q − 1) , let ( j, b e i ) = − → lo c q ( X i , j ) and p j = X i [ j : j + q − 1] . We c an c omput e nOc c ( X i [ j : b e i ] , p j ) for al l 1 ≤ i ≤ n and 1 ≤ j ≤ 2( q − 1 ) , in a total of O ( q 2 n ) time. Pr o of. By Le mma 1, we hav e nO c c ( X i [ j : b e i ] , p j ) = | LnOc c ( X i [ j : b e i ] , p j ) | . W e compute the o ccurrence b i in ( j − 1) ⊕ LnOc c ( X i [ j : b e i ] , p j ) that cr osses X ℓ and X r , if suc h exists. Note that at most one such occur rence exists. Also , we compute the smallest o ccurrence bb r in ( j − 1) ⊕ LnOc c ( X i [ j : b e i ] , p j ) that is completely within X r . Then the desir ed v alue nOc c ( X i [ j : b e i ] , p j ) can b e computed dep ending whether b i and bb r exist or not. F or mally: Consider the set S = (( j − 1 ) ⊕ LnOc c ( X i [ j : b e i ] , p j )) ∩ [ | X ℓ | − q + 2 : | X ℓ | ] of o ccurr ence o f p j which is either empty or singleton. If S is s ingleton, then let b i be its single element. Let bb r = min { k | k ∈ (( j − 1) ⊕ LnOc c ( X i [ j : b e i ] , p j )) ∩ [ | X ℓ | + 1 : | X ℓ | + q − 1] , if ∃ b i then k ≥ b i + q } . Then we hav e nOc c ( X i [ j : b e i ] , p j ) =                nOc c ( X r [ j − | X ℓ | : b e i − | X ℓ | ] , p j ) if j > | X ℓ | , nOc c ( X ℓ [ j : b e ℓ ] , p j ) if 6 ∃ b i and 6 ∃ bb r , nOc c ( X ℓ [ j : b e ℓ ] , p j ) + 1 if ∃ b i and 6 ∃ bb r nOc c ( X ℓ [ j : b e ℓ ] , p j ) + nOc c ( X r [ b r : b e r ] , p j ) if 6 ∃ b i and ∃ bb r , nOc c ( X ℓ [ j : b e ℓ ] , p j ) + nOc c ( X r [ b r : b e r ] , p j ) + 1 if ∃ b i and ∃ bb r , where ( bb r , b e r ) = − → lo c q ( X r , bb r ). F or all v ariables X i we pre- compute pr e ( X i , 3( q − 1)) and su f ( X i , 3( q − 1)). This c an be done in a total of O ( q n ) time. If b i or bb r exists, | X ℓ | − 3( q − 1) < j − 1 + max Ln Oc c ( X ℓ [ j : b e ℓ ] , j ) ≤ | X ℓ | − q + 2. Then, each b i and bb r can b e computed fro m Ln Oc c ( X i [( j − 1 + max LnOc c ( X ℓ [ j : b e ℓ ] , j )) : | X ℓ | + 3( q − 1 )] , p j ) running the KMP algo rithm on string suf ( X ℓ , 3( q − 1)) pr e ( X r , 3( q − 1)). Based on the ab ov e recursio n, we can compute n O c c ( X i [ j : b e i ] , p j ) in a total of O ( q 2 n ) time for all 1 ≤ i ≤ n and 1 ≤ j ≤ 2( q − 1). ⊓ ⊔ The next lemma can b e shown similarly to Lemma 1 1. Lemma 12. F or al l variable X i = X ℓ X r and 1 ≤ j ≤ 2 ( q − 1) , let ( eb i , e e i ) = ← − lo c q ( X i , j ) and s j = X i [ | X i | − j − q + 2 : | X i | − j + 1] . We c an c ompute nOc c ( X i [ eb i : e e i ] , s j ) for al l 1 ≤ i ≤ n and 1 ≤ j ≤ 2( q − 1) , in a total of O ( q 2 n ) time. W e have also assumed in Theore m 2 that nOc c ( X i [ b : e ] , s j ) are a lr eady computed. This can b e c o mputed eﬃciently , as follows: Lemma 13. F or al l variable X i = X ℓ X r and j s.t. min { 1 , | X ℓ | − 2 ( q − 1) + 1 } ≤ j ≤ min {| X i | − q + 1 , | X ℓ | + q − 1 } , nOc c ( X i [ b : e ] , s j ) c an b e c omput e d in a total of O ( q 2 n ) time, wher e ( b, e ) = ← → lo c q ( X i , j ) and s j = X i [ j : j + q − 1] . 10 Pr o of. W e consider the c ase where ma x { 1 , | X ℓ | − q + 2 } ≤ j ≤ | X ℓ | , as the other cases ca n b e shown similarly . Our basic stra tegy for computing nOc c ( X i [ b : e ] , s j ) is as follows. Firstly we compute the largest element of LnOc c ( X i [ b : e ] , s j ) that o ccurs co mpletely within X ℓ . Secondly we compute the s ma llest element of RnOc c ( X i [ b : e ] , s j ) that occ urs completely within X r . Thirdly we compute an o ccurrence of s j that cro sses the b oundary of X ℓ and X r , and do not overlap the a b ove o ccurr ences of s j completely within X ℓ and X r . F or mally: Let e e ℓ = b + q − 2 + max O cc ( X i [ b : | X ℓ | ] , s j ), bb r = | X ℓ | + min O cc ( X i [ | X ℓ | + 1 : e ] , s j ), u 1 = b + q − 2 + m ax Ln Oc c ( X i [ b : e e ℓ ] , s j ), and u 2 = bb r − 1 + min RnOc c ( X i [ bb r : e ] , s j ). W e consider the case where a ll these v alues exist, as other cas es c a n be s hown similar ly . It follows from Lemmas 1 and 2 that nOc c ( X i [ b : e ] , s j ) = | LnOc c ( X i [ b : u 1 ] , s j ) | + n Oc c ( X i [ u 1 + 1 : u 2 − 1] , s j ) + | RnOc c ( X i [ u 2 : e ] , s j ) | = nOc c ( X i [ b : e e ℓ ] , s j ) + nOc c ( X i [ u 1 + 1 : u 2 − 1] , s j ) + nOc c ( X i [ bb r : e ] , s j ) , (See a lso Fig. 6 in App endix.) By Lemma 6, ( b, e ) = ← → lo c q ( X i , j ) ca n b e pre-co mputed in a to tal o f O ( q 2 n ) time. Since b < e e ℓ and bb r < e , e e ℓ and bb r can b e c omputed in O ( q ) time using the KMP algorithm. By Le mma s 11 and 12 nOc c ( X i [ b : e e ℓ ] , s j ) a nd nOc c ( X i [ bb r : e ] , s j ) can b e pre-computed in a to tal of O ( q 2 n ) time (Notice ( b, e e ℓ ) = ← − lo c q ( X ℓ , e e ℓ ) and ( bb r , e ) = | X ℓ | ⊕ − → lo c q ( X r , bb r − | X ℓ | )). By Lem- mas 9 and 10 , u 1 and u 2 can b e pr e-computed in a total of O ( q 2 n ) time. Hence nOc c ( X i [ u 1 + 1 : u 2 − 1] , s j ) can b e computed in O ( q ) time using the KMP algorithm for each i a nd j . The lemma thus holds. ⊓ ⊔ 3.5 Main Result The following theorem concludes this whole section. Theorem 3. Pr oblem 2 c an b e solve d in O ( q 2 n ) time and O ( q n ) sp ac e. Pr o of. The time complexity and corr ectness follow from Theorem 2, Lemma 6, and Lemma 13. W e compute and store strings suf ( X i , 3( q − 1)) and pr e ( X i , 3( q − 1)) of length O ( q ) for each v a riable X i , hence this r equires a tota l of O ( q n ) space for all 1 ≤ i ≤ n . W e use a constant n um ber of dynamic progr amming tables ea ch of which is of size O ( q n ). Hence the tota l space complexity is O ( q n ). ⊓ ⊔ 4 Conclusion and Discussion W e consider e d the problem of computing the no n-ov erlapping fre q uencies for all q -gr ams that o ccur in a given text repr e sented a s a n SLP . Our algo r ithm greatly improv e s previous work which solved the pro blem only for q = 2 requiring O ( n 4 log n ) time and O ( n 3 ) s pace. W e give the ﬁrst algorithm which works for any q ≥ 2 , running in O ( q 2 n ) time and O ( q n ) space, where n is the size of the SLP . 11 References 1. Amir, A ., Benson, G.: Eﬃcien t tw o-dimensional compressed matching. I n : Proc. DCC’92. pp . 279–288 (1992) 2. Ap ostolico, A., Lonardi, S .: Oﬀ-line compression by greedy textual substitution. Proceedings of the IEEE 88(11), 1733–1 744 ( 2000) 3. Ap ostolico, A., Preparata, F.P .: Data structures and algorithms for the string statistics p roblem. Algorithmica 15(5), 481–494 (1996) 4. Bille, P ., La ndau, G.M. , Raman, R ., S adak ane, K., Satti, S.R., W eimann, O.: R an- dom access to grammar-compressed strings. In: Pro c. SODA’11. pp . 373–389 (2011) 5. Brodal, G.S., Lyngsø, R.B., ¨ Ostlin, A., P edersen, C.N.S.: Solving the string statis- tics p roblem in time O ( n log n ). In: Proc. ICALP’02. LNCS, vol. 2380, pp. 728–739 (2002) 6. Goto, K., Bannai, H., Inenaga, S ., T akeda, M.: T o wards eﬃcient m in in g and classiﬁ- cation on compressed strings. In: Accepted for SPIRE’11 (2011), preprin t av ailable at arXiv :1103.3 114v1 7. Hermelin, D., Landau, G.M., Landau, S., W eimann, O.: A u n iﬁed algorithm for accelerating edit-distance computation via text-compression. I n: Proc. ST ACS’09 . pp. 529–540 (2009) 8. Inenaga, S ., Bannai, H.: Finding charac teristic substring from compressed tex ts. In: Pro c. The Prague Stringology Conference 2009. pp . 40–54 (2009), full versio n to app ear in the International Journal of F oundations of Computer S cience 9. K¨ arkk¨ ainen, J ., Sanders, P .: Simple linear w ork s uﬃx arra y construction. In: Proc. ICALP’03. LN CS, vol. 2719, pp. 943–955. Springer (2003) 10. Karpinski, M., Rytter, W., Shinohara, A.: A n eﬃcient pattern-matc h ing algorithm for strings with short descriptions. Nordic Journal of Computing 4, 172 –186 (199 7) 11. Kasai, T., Lee, G., Arimura, H., Arik aw a, S., P ark, K.: Linear-time Longest- Common-Preﬁx Computation in Suﬃx A rra ys and Its Applications. In: Pro c. CPM’01. LN CS, vol. 2089, pp. 181–192 (2001) 12. Knuth, D.E., Morris, J.H., Pratt, V.R.: F ast pattern matc hing in strings. S IAM Journal on Computing 6(2), 323–350 (1977) 13. Larsson, N.J., Moﬀat, A.: Oﬀ-line dictionary-based compression. Pro ceedings of the IEEE 88(11), 1722–173 2 (2000) 14. Lifshits, Y.: Pro cessing compressed texts: A tractability b order. In: Pro c. CPM 2007. LNCS, vol. 4580, pp. 228–240 (2007) 15. Man ber, U., Myers , G.: Suﬃx arrays: A new metho d for on-line string searches. SIAM Journal on Computing 22(5), 935–948 (1993) 16. Matsubara, W., Inenaga, S., Ishino, A., S hinohara, A., Nak am ura, T., Hashimoto, K.: Eﬃcient algorithms to compute compressed longest common sub strings and compressed palindromes. Theoretical Compu t er Science 410(8–10), 900–9 13 (20 09) 17. Na v arro, G., M¨ akinen , V.: Compressed full-text i ndexes. ACM Computing S urveys 39(1), 2 (2007) 18. Nevill-Manning, C.G., Witten, I.H., Maulsby , D .L.: Compression by induction of hierarc hical grammars. In : Pro c. DCC’94. pp. 244–25 3 (1994) 19. Ziv, J., Lemp el, A.: A universa l algorithm for sequential data compression. IEEE T ransacti ons on I nformation Theory IT-23(3), 337–349 (1977) 20. Ziv, J., Lempel, A.: Compression of individual sequ en ces via v ariable-length co ding. IEEE T ransactions on Information Theory 24(5), 530–536 (1978) 12 App endix A Pro ofs Pro of of Theorem 1. Pr o of. W e will make use of the suﬃx arr ay and lcp arr ay . The suﬃx arr ay [15] SA of any string T is an array o f length | T | s uch that SA [ i ] = j , where T [ j : | T | ] is the i -th lexic ographica lly sma lle s t suﬃx of T . The lcp array of any string T is an arr ay of length | T | such that LCP [ i ] is the length of the long est common pr e ﬁx of T [ S A [ i − 1 ] : | T | ] and T [ S A [ i ] : | T | ] for 2 ≤ i ≤ | T | , a nd LCP [1] = 0. It is w ell known that the suﬃx arr ay for any string of length | T | can b e constructed in O ( | T | ) time (e.g. [9 ]) assuming a n in teger alphab et. Given the text a nd suﬃx array , the lcp arr ay can also b e calculated in O ( | T | ) time [1 1]. W e can calcula te the overlapping q -g ram frequencies o f string T us ing suﬃx array SA and lcp array LCP . S A [ i ] repr esents an o ccurrence of a q - gram T [ S A [ i ] : S A [ i ] + q − 1]. Since the s uﬃxe s ar e le x icogra phically s orted in the suﬃx array , int erv als on the suﬃx a r ray wher e the v alues of lcp ar ray a r e at least q repres e nt o ccurrence of the same q -g ram. The sum of w [ S A [ i ]] in this in terv a l is the desired v alue for the q -gr am. Constructing SA, LCP can b e done in O ( | T | ) time, a nd summing up w [ S A [ i ]] for each interv al where LC P [ i ] ≥ q can ea sily b e done in O ( | T | ) by a simple sca n. ⊓ ⊔ Pro of of Lemma 1. Pr o of. W e prov e nOc c ( T [1 : i ] , P ) = | LnOc c ( T [1 : i ] , P ) | by induction on i . F or i ≤ 1, the statement clearly holds. Now, ass ume that the sta tement holds for i < k , where k ≥ 2. F or i = k , notice that 0 ≤ nO c c ( T [1 : k ] , P ) − | LnOc c ( T [1 : k ] , P ) ≤ 1, since there ca n be at most one new o ccurrence of P ending at po sition i , which ma y or may not be counted for nOc c ( T [1 : k ] , P ). If we assume on the co ntrary that the sta tement do es not hold for i = k , then nOc c ( T [1 : k ] , P ) − nOc c ( T [1 : k − 1] , P ) = nOc c ( T [1 : k ] , P ) − | LnOc c ( T [1 : k ] , P ) | = 1. Since the change was ca used by the new o ccurrence, we have nOc c ( T [1 : k ]) = nOc c ( T [1 : k − | P | ]) + 1. B y the inductive hypothesis , we hav e nOc c ( T [1 : k − | P | ] , P ) = | LnOc c ( T [1 : k − | P | ] , P ) | . Also, | LnO c c ( T [1 : k ] , P ) | = | LnOc c ( T [1 : k − | P | ] , P ) | + 1 , since the new o ccurrence do es not ov erlap with an y occur rences in LnOc c ( T [1 : k − | P | ]). This lea ds to nOc c ( T [1 : k ]) = | LnOc c ( T [1 : k ] , P ) | , a contradiction. nO c c ( T , P ) = | R nOc c ( T , P ) | ca n b e shown symmetrically . ⊓ ⊔ Pro of of Lemma 7. Pr o of. W e compute the smallest o ccurrence b i in ( j − 1 ) ⊕ LnOc c ( X i [ j : b e i ] , p j ) that crosses X ℓ and X r . Also, we compute the smallest o c c ur rence bb r in ( j − 1) ⊕ LnO c c ( X i [ j : b e i ] , p j ) that is co mpletely within X r . 13 Then t he desired v a lue max 1 LnOc c ( X i [ j : b e i ] , p j ) can b e computed depend- ing whether b i and bb r exist or not. F or mally , co ns ider the set S = (( j − 1) ⊕ LnOc c ( X i [ j : b e i ] , p j )) ∩ [ | X ℓ | − q + 2 : | X ℓ | ] of o ccurr ence o f p j which is either empty or singleton. If S is s ingleton, then let b i be its single element. Let bb r = min { k | k ∈ (( j − 1) ⊕ LnOc c ( X i [ j : b e i ] , p j )) ∩ [ | X ℓ | + 1 : | X ℓ | + 2( q − 1)] , if ∃ b i then k ≥ b i + q } . Then we hav e max 1 LnOc c ( X i [ j : b e i ] , p j ) =      max 1 LnOc c ( X ℓ [ j : b e ℓ ] , p j ) if 6 ∃ b i and 6 ∃ bb r b i − j + 1 if ∃ b i and 6 ∃ bb r bb r − j + ma x 1 LnOc c ( X r [ bb r − | X ℓ | : b e r ] , p j ) if ∃ bb r (See a lso Fig. 7 in App endix B.) F or all v ariables X i we pre- compute pr e ( X i , 3( q − 1)) and su f ( X i , 3( q − 1)). This c an be done in a total of O ( q n ) time. If b i or bb r exists, | X ℓ | − 3( q − 1) ≤ j − 1 + max Ln Oc c ( X ℓ [ j : b e ℓ ] , j ) ≤ | X ℓ | − q + 1. Then, each b i and bb r can b e computed fro m Ln Oc c ( X i [( j − 1 + max LnOc c ( X ℓ [ j : b e ℓ ] , j )) : | X ℓ | + 3( q − 1 )] , p j ) runnning the KMP algorithm on string pr e ( X i , 3( q − 1)) suf ( X i , 3( q − 1)). Based on the ab ove recursion, we ca n compute max 1 LnOc c ( X i [ j : b e i ] , p j ) in a total o f O ( q 2 n ) time for all 1 ≤ i ≤ n and 1 ≤ j ≤ 2( q − 1). It is not diﬃcult to see that similar cla ims, with slightly diﬀerent conditions, can be ma de for max 2 LnOc c ( X i [ j : b e i ] , p j ) where the v alue corresp onds to one of 4 v alues: max 2 LnOc c ( X ℓ [ j : b e ℓ ] , p j ), max 1 LnOc c ( X ℓ [ j : b e ℓ ] , p j ), b i , or max 2 LnOc c ( X r [ bb r − | X ℓ | : b e r ] , p j ), with appropr iate oﬀsets. ⊓ ⊔ Pro of of Lemma 9. Pr o of. Our basic strategy for computing max LnOc c ( X i [ eb i : e e i ] , s j ) is as fol- lows. Firstly we compute the large s t element of Ln Oc c ( X i [ eb i : e e i ] , s j ) that o ccurs completely w ithin X ℓ . Secondly we compute the smalle s t element of LnOc c ( X i [ eb i : e e i ] , s j ) tha t cro sses the bo unda ry of X ℓ and X r . Let d b e this o ccurrence, if such exists. Then the desir ed output max LnO c c ( X i [ eb i : e e i ] , s j ) is given as either the lar g est or the second larg est element o f ( d + q − 1) ⊕ LnOc c ( X r [ d + q − | X ℓ | : | X r | ] , s j ). More formally: W e consider the ca se where eb i + q − 1 ≤ | X ℓ | . Let e e ℓ = q − 1 + max( Oc c ( X i , s j ) ∩ [ | X ℓ | − 2( q − 1 ) + 1 : | X ℓ | − q + 1]), m = eb i − 1 + max LnOc c ( X ℓ [ eb i : e e ℓ ] , s j ) where ( eb i , e e ℓ ) = ← − lo c q ( X ℓ , | X ℓ | − e e ℓ + 1). Let d = m + q − 1 + min LnOc c ( X i [ m + q : e e i ] , s j ). Let bb r = ( d if e e i − q + 1 ≤ | X ℓ | o r d > | X ℓ | , d + q − 1 + min LnOc c ( X i [ d + q : | X i | ] , s j ) otherwise. Let h ′ = | X ℓ | +max 2 LnOc c ( X r [ bb r ′ : b e r ′ ] , s j ) and h = | X ℓ | +max 1 LnOc c ( X r [ bb r ′ : b e r ′ ] , s j ) where ( bb r ′ , b e r ′ ) = − → lo c q ( X r , bb r −| X ℓ | ). (See als o Fig. 5 in App endix B .) 14 Then max Ln O c c ( X i [ eb i : e e i ] , s j ) = ( h if h ≤ e e i − q + 1 , h ′ otherwise. The ca se where eb i + q − 1 > | X ℓ | can b e solved similar ly . Each e e ℓ , d and bb r can be computed in O ( q ) time using t he KMP algor ithm, hence requiring a tota l of O ( q 2 n ) time. By Lemmas 4 and 5, ← − lo c q ( X ℓ , e e ℓ ) a nd − → lo c q ( X i , bb r ) can b e computed in O ( q 2 n ) time for all X i = X ℓ X r and 1 ≤ j < n . By Lemma 7, h ′ and h can be computed in a total of O ( q 2 n ) time for all X i = X ℓ X r and 1 ≤ j < n . Therefore, by dynamic progr amming we ca n co mpute LnOc c ( X i [ eb i : e e i ] , s j ) in a to tal of O ( q 2 n ) time. ⊓ ⊔ 15 B Figures X i X ℓ X r be ℓ b i e i bb r be r j loc q ( X ℓ ,j ) loc q ( X r ,bb r ) p j p j p j p j p j p j p j Fig. 2. Illustration for Lemma 4. In this ﬁgure, − → lo c q ( X i , j ) = ( j, e i ). X i X ℓ X r b i e i b ee ℓ bb r e Fig. 3. Illustration for Lemma 6. Rectangles show imp ortant o ccurrences of X i [ j : j + q − 1]. In t his case b = eb ℓ and e = b e r . 16 X i e ℓ X ℓ X r bb r be j b i loc q ( X ℓ , j ) loc q ( X r , bb r ) loc q ( X i , j ) Fig. 4. Illustration for Lemma 7, calculating max LnOc c ( X i [ j : b e ] , p j ). Shadow ed o c- currences are not in LnOc c ( X i [ j : b e i ] , p j ), while white ones are in LnOc c ( X i [ j : b e i ] , p j ). X i X ℓ X r h h’ ee ℓ eb i bb r d m ee i Fig. 5. Illustration for Lemma 9. Rectangles show important occurrences of s j . In this case max LnOc c ( X i [ eb i , e e i ] , s j ) = h ′ , as h > e e i − q + 1. 17 X i X ℓ X r b i u 2 u 1 e i b ee ℓ bb r e Fig. 6. I llustration for Lemma 13. Rectangles show important o ccurrences of X i [ j : j + q − 1]. In th is case nOc c ( X i [ b : e e ℓ ] , s j ) = 3, nOc c ( X i [ u 1 + 1 : u 2 − 1] , s j ) = 1, and nOc c ( X i [ bb r : e ] , s j ) = 3. 18

Computing q-gram Non-overlapping Frequencies on SLP Compressed Texts

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment