Computing q-gram Frequencies on Collage Systems

Computing q -gram F requencies on Collage Systems Keisuk e Goto, Hideo Bannai, Shuns uke Inenaga, Masayuk i T ak eda Department of Informatics , Kyushu Univ er sity , Japan { keis uke.go tou,bannai,inenaga,takeda } @inf.kyushu-u.ac.jp Abstract Collage systems are a general framewo rk for rep resen t ing outputs of va rious text compression algo- rithms. W e consider the all q -gram frequency problem on a compressed string represented as a collage system, and present an O (( q + h log n ) n )-time O ( q n )-space algorithm for calculating th e frequencies for all q -grams that o ccur in the string. Here, n and h are resp ectively the size and height of th e collage system. 1 In tro d uction Due to the ev er increasing size of data that we generate and utilize, data is often stor e d in co mpr essed form. Since merely deco mpressing such larg e scale data can b e demanding , metho ds for pro cessing co mpressed strings as is , that is, pro cess ing a given compressed s tr ing without explicitly decompres sing it, ha s b een gaining attention [9 , 12, 4, 11, 5, 3, 1]. An interesting pr op erty of these methods is that they can b e theoretically – and sometimes even practically – faster than alg orithms which w o rk on a n uncompresse d representation of the same data. Col lage s ystems [7] are a general framework to descr ibe co mpr essed r epresentation o f str ings, us ing grammar -like v ariable assig nment s . The basic o pe rations are concatenatio n, r ep etition, and truncation. Collage systems c an mo del outputs o f v arious compr ession algorithms [7] s uch as gr ammar ba sed compr ession algorithms (e.g. [13, 8]) and thos e of the LZ-family (e.g. [15, 16]). By consider ing collage s y stems, it is po s sible to dev elo p genera l pro cessing algorithms whic h can work on compressed s trings g enerated b y any of these compressio n a lgorithms. In this pap er, we cons ide r the problem of determining the fr equencies of all q -grams o ccurring in a string T , given a co llage s ystem representing T . The problem was previously considered for r e gular c o llage systems (or equiv alently , s traight line prog rams (SLP s) [6 ]), which are collag e s ystems that cont a in neither truncation no r rep etition: In [5], an O ( | Σ | 2 n 2 ) time and O ( n 2 ) spa ce alg o rithm was presented for q = 2, where | Σ | denotes the alphab et size and n is the size of the SLP . More recently , a muc h simpler and mo re eﬃcient O ( q n ) time and space alg orithm for general q ≥ 2 was developed and w as s hown to b e pra ctically faster than an a lgorithm working on unc ompr esse d strings, when q is small [3]. The ma in co nt r ibution of this pape r is an O (( q + h log n ) n )-time and O ( q n )-space algo rithm that co mputes the fre q uencies for all q -grams that occ ur in a given string represented as a colla ge system, where n is the size of the co llage system, and h ≤ n is the height o f the deriv ation tree of the co lla ge system. The alg orithm is a non-trivial extension of the alg o rithm of [3] so that it can deal with rep etitions a nd truncations . Given a collage sys tem of size n which descr ib es a string T , it is p ossible to construct an SLP of size O ( nh log n ) w hich describ es the same string T . W e ca n then apply the algo r ithm of [3] to the SLP , achieving a n O ( q nh log n )- time O ( q nh log n )- space solution. The new O (( q + h log n ) n )-time O ( q n )-s pace solution improves o n that. General collag e systems allow for more p ow erful compression schemes, for example, while an LZ77 enco ded representation o f size m with self-r efer en cing may r equire O ( m 2 log m ) size when represented as an SLP , it can b e represented as a collage s y stem of size O ( m lo g m ) [2]. 1 2 Preliminaries 2.1 Strings Let Σ b e a nonempty ﬁnite s e t of symbols called the alphab et . An element of Σ ∗ is called a string . The length of a string T is de no ted b y | T | . The empty s tring ε is a string of length 0, namely , | ε | = 0. F or a string T = X Y Z , X , Y a nd Z are ca lled a pr eﬁx , substring , and suﬃx of T , resp ectively . The i -th character of a string T is denoted by T [ i ] for 1 ≤ i ≤ | T | , and the substring of a string T that b egins at p ositio n i and ends at p osition j is denoted by T [ i : j ] for 0 ≤ i ≤ j ≤ | T | − 1. F or co n venience, let T [ i : j ] = ε if j < i . F o r an y string X , let X 0 = ε and for any int eg er p ≥ 1, let X p = X p − 1 X . F o r strings T and P , let Oc c ( T , P ) = { i | T [ i : i + | P | − 1 ] = P } denote the set o f o ccurr ences of P in T . F or string T and in tege r k ≥ 1, let pr e ( T , k ) = T [1 : min { k , | T |} ] and suf ( T , k ) = T [ | T | − min { k , | T |} + 1 : | T | ], i.e., resp ectively the preﬁx and the suﬃx of T of length at most k . 2.2 Collage Systems W e consider strings desc rib ed by c ol lage system s, prop os e d in [7 ]. Collage sys tems a re a general fra mework for repre senting outputs of v ario us compress io n algor ithms. A collag e sy stem T is a set o f assignments { X 1 = expr 1 , X 2 = expr 2 , . . . , X n = expr n } , where each X i is a v a riable and each e xpr i is a n expres sion: expr i =                a ( a ∈ Σ) , (terminal sym b ol) X ℓ X r ( ℓ, r < i ) , (concatenation) ( X s ) p ( s < i, p > 2) , (rep etition) [ k ] X s ( s < i, 1 ≤ k < | val ( X s ) | ) , (preﬁx trunca tio n) X s [ k ] ( s < i, 1 ≤ k < | val ( X s ) | ) , (suﬃx truncation) where val is a function deﬁned b e low. T o simplify the presentation, our de ﬁnitio n o f collag e systems diﬀers from the original in that w e only cons ide r a sing le v ariable X n for the sequence part. A collage system is sa id to b e tru nc ation-fr e e if no preﬁx truncation no r suﬃx truncation is used. A col- lage s ystem is s aid to b e re gular , if it is truncation-free, a nd no rep etition is used. (Regular collage sy stems are equiv alent to straight line programs (SLP s) [6], a genera l framework for grammar-ba sed compression.) Output of the SEQ UITUR [13] and REP AIR [8] alg orithms can b e seen as a regular colla ge sy stem. F ur- thermore, a collage system is simple , if it is re g ular, and for any v aria ble X i = X ℓ X r , we hav e | X ℓ | = 1 or | X r | = 1 . Output o f the LZ78 [1 6] and LZW [14] algorithms can be seen a s a simple co llage system. T o deﬁne the der iv atio n tree o f a colla ge system, w e introduce t wo sp ecial symbols ⊲ a nd ⊳ that are not in Σ. In any seq ue nc e o ver Σ ∪ { ⊲, ⊳ } , each symbol ⊲ (resp. ⊳ ) “cancels” the immediately-rig ht (resp. -left) symbol in Σ. F o r any assig nmen t X i = expr i of a collage system T , the derivation t r e e o f X i is a tree with ro ot v lab eled X i such that: • v has one subtree consisting of a sing le no de lab eled a , if expr i = a ( a ∈ Σ). • v ha s t wo s ubtrees such that the left and the right ones are the deriv ation trees of X ℓ and X r , resp ectively , if expr i = X ℓ X r . • v has p subtrees , each of which is the deriv ation tree o f X s , if expr i = ( X s ) p . • v has ( k + 1) s ubtrees such that the rightmost o ne is the deriv ation tree of X s and the o thers ar e single-no de trees lab eled ⊲ , if e xpr i = [ k ] X s . • v has ( k + 1) subtrees such that the leftmost one is the deriv ation tr ee o f X s and the other s ar e single-no de trees lab eled ⊳ , if e xpr i = X s [ k ] . The derivation tr e e of T is deﬁned to be the deriv ation tr ee of X n . Fig . 1 shows the der iv a tion tree of a n example co llage system. W e note that the sequence of leaf-la be ls of the deriv ation tree of T is a string ov er Σ ∪ { ⊲, ⊳ } , and can b e rewritten to val ( T ) by applying the cancellation rule s ⊲c → ε and c⊳ → ε for any character c ∈ Σ. F o r example, the leaf-lab el se q uence abcab cabc ⊳ ⊳ a bc of the der iv a tion tree of Figure 1 can b e rewritten into c abcaab c . The size of a collage sy stem T is the num be r n of as signments in T . L e t heigh t ( X i ) represent the heigh t of the deriv ation tree o f X i . The height o f a colla ge system T , denoted by height ( T ), is deﬁned to b e height ( X n ). The trun c ate d deriv ation tree of a collages system T is the tr e e obtained from the deriv a tion tree o f T as follows: (1) a pair of adjace nt leav es of for m ⊲ c o r c⊳ is remov ed ( c ∈ Σ); (2) recurs ively re move in terna l no des if they have no children; (3) r epe at until there are no leav es that are lab eled with ⊲ or ⊳ in the tree. W e deﬁne a function val that maps v ariables X i to s trings ov er Σ recur sively a s follows: val ( X i ) =                a for X i = a, val ( X ℓ ) val ( X r ) for X i = X ℓ X r , val ( X s ) p for X i = ( X s ) p , val ( X s )[ k + 1 : | val ( X s ) | ] for X i = [ k ] X s , val ( X s )[1 : | val ( X s ) | − k ] for X i = X s [ k ] . A v ariable X i is said to derive the string val ( X i ). Notice that val ( X i ) is iden tica l to the leaf-lab el string of the subtree of the truncated deriv ation tre e of the collag e system that is ro o ted a t no de X i . A colla ge system T is said to derive the string T = val ( X n ), i.e., the string derived from the last v ar ia ble X n of T . When it is not confusing, we identify a v a riable X i with val ( X i ). Let | X i | = | val ( X i ) | for any v ar iable X i . | X i | for all X i can b e computed in a total of O ( n ) time by a simple iteration on the v ariables. Although | T | can b e very la rge compar ed to n , we shall assume as in previo us work, that the w o rd size is at lea st log | T | , and hence, v alues representing lengths and pos itions of T in our algor ithms ca n be manipulated in co nstant time. F or each v ar ia ble of X i , let • vOc c ( X i ) b e the num b er of s ubtrees ro oted at X i that ha s exactly val ( X i ) leaves, • trPr evOc c ( X i ) b e the n umber of s ubtrees ro oted at X i such that a non-empty prop er preﬁx of val ( X i ) is truncated and no non-empty suﬃx is truncated from its leaf-lab el string , • trSufvOc c ( X i ) b e the num b er of subtre es ro oted at X i such that a non-empt y pr op er suﬃx of val ( X i ) is truncated and no non-empty preﬁx is truncated fro m its leaf-lab el string, • trvOc c ( X i ) b e the num b er of subtrees ro oted a t X i such that b oth a non-empty pro p er pre ﬁx and a non-empty prop er suﬃx are truncated from its leaf-lab el string, in the trun c ate d deriv a tio n tree of a collag e system T . Let avOc c ( X i ) deno te the num b er o f s ubtrees ro oted a t X i in the (no n-truncated) deriv a tion tree of T . Let dvOc c ( X i ) = avOc c ( X i ) − vOc c ( X i ) − trPr evOc c ( X i ) − t rSufvOc c ( X i ) − t rvOc c ( X i ), i.e., dvOc c ( X i ) deno tes the num b er of subtrees ro oted a t X i in the deriv atio n tree that are completely removed in the truncated deriv ation tr ee. F or v a riable X 5 in the r unning example o f Figure 1, we hav e vO c c ( X 5 ) = 2, trPr evOc c ( X 5 ) = 1, trSufvOc c ( X 5 ) = 1, avOc c ( X 5 ) = 4, and dvOc c ( X 5 ) = 0. F or each v ar iable X i and 1 ≤ k < | val ( X i ) | , let le af i ( k ) denote the leaf of the deriv ation tree o f X i that corres p o nds to the k -th character of val ( X i ). In the running example of Figure 1, le af 9 (6) is the 6th leaf of the trunca ted der iv a tion tr ee tha t corresp onds to val ( X 9 )[6] = a . F o r str ing val ( X i ) = W Y Z , the leav es that cor r esp ond to W are said to be pr eﬁx le aves , the leav es that corr esp ond to Y are sa id to b e substring le aves , and the leaves that corres po nd to Z is sa id to be suﬃx le aves . 3 Computing q -gram F requencies on Coll age Systems The ma in problem w e consider in this pape r is the following: X 8 X 7 X 6 a b c a b c a b c a b c X 5 X 2 X 1 X 4 X 3 X 5 X 2 X 1 X 4 X 3 X 5 X 2 X 1 X 4 X 3 X 5 X 2 X 1 X 4 X 3 X 9 X 8 X 7 X 6 c a b c a a b c X 5 X 2 X 1 X 4 X 3 X 5 X 3 X 5 X 1 X 4 X 5 X 2 X 1 X 4 X 3 X 9 Figure 1: Deriv a tion tree (left) and truncated der iv atio n tre e (right) of c o llage system: { X 1 = a , X 2 = b , X 3 = c , X 4 = X 1 X 2 , X 5 = X 4 X 3 , X 6 = ( X 5 ) 3 , X 7 = X 6 [2] , X 8 = X 7 X 5 , X 9 = [2] X 8 } , which represents string c abcaab c . Problem 1 ( q -gram frequencies on collage systems) Given a c ol lage system T that describ es string T , c ompute | Occ ( T , P ) | for al l q -gr ams P ∈ Σ q . F or regular collage sy stems (SLPs), a simple and practically eﬃcient O ( q n ) time and space algo rithm was recently develop e d [3]. The basic idea is to c onstruct, in O ( q n ) time, a new string T ′ of length O ( q n ) and an integer arr ay w of the sa me le ng th so that P j ∈ O cc ( T ′ ,P ) w [ j ] = | Oc c ( T , P ) | for all P ∈ Σ q where Oc c ( T , P ) 6 = ∅ . W e brieﬂy descr ibe the ide a b elow: for each q -gr am o cc ur rence in the text, we identify with it, the low est v aria ble in the deriv ation tree o f T , which contains the q -gra m o cc ur rence. Thus, we hav e that ea ch q -gra m o ccurrence corr esp onds to a unique v ar iable X i = X ℓ X r such tha t the q - gram cr osses the b ounda ry betw e e n X ℓ and X r . No ticing that all q -grams that a re identiﬁed with X i are contained in the string t i = suf ( val ( X ℓ ) , q − 1) pr e ( val ( X r ) , q − 1), cons ide r ar ray w i , with w i [ j ] = vOc c ( X i ) for 1 ≤ j ≤ | t i | − q + 1, and w i [ j ] = 0 for | t i | − q + 2 ≤ j ≤ | t i | , where vOc c ( X i ) is the n umber o f no des in the deriv ation tree with lab el X i 1 . This gives us that P j ∈ O cc ( t i ,P ) w i [ j ] = vOc c ( X i ) · | Oc c ( t i , P ) | is the total num b er o f o ccurre nc e s of q -gra m P in T that are iden tiﬁed with X i , for all P ∈ Σ q . It remains to s um these v alues for all n v a riables, that is, | Oc c ( T , P ) | = P n i =1 vOc c ( X i ) · | Oc c ( t i , P ) | . Thus, P roblem 1 reduces to the follo wing problem on T ′ = t 1 · · · t n and w = w 1 · · · w n : Problem 2 (w eighted q -g ram frequencies) Given a string T ′ , an inte ger q , and inte ger arr ay w ( | w | = | T ′ | ), c ompute P j ∈ O cc ( T ′ ,P ) w [ j ] for al l q -gr ams P ∈ Σ q wher e Oc c ( T ′ , P ) 6 = ∅ . (Actually , t i and w i for whic h | t i | < q c a n b e safely ignored when constructing T ′ and w .) Since Problem 2 is solv able in O ( | T ′ | ) time using standar d s tr ing indices such a s suﬃx arr ays [1 0], Proble m 1 can b e so lved in O ( q n ) time and spac e. Our algo rithm for more gener al collage s ystems will follow this a ppr oach o f [3], but with new c halleng es lying in the constr uction o f T ′ and w . Fir st, we s how how to adapt the algor ithm to co pe with rep etitions, and then go on to des crib e how to further extend the a lgorithm to cop e with truncations. 1 Note that the deriv ation tree and the truncated deri v ation tree of an y truncation-free collage system are identical. Hence vOc c ( X i ) = avOc c ( X i ) and trPr ev Oc c ( X i ) = trSufvOcc ( X i ) = trSufvOcc ( X i ) = trvOc c ( X i ) = dvOc c ( X i ) = 0 tri vially hold. 3.1 T runcation-F ree Collage Systems Theorem 1 Pr oblem 1 c an b e solve d in O ( q n ) time and sp ac e, if the c ol lage system T is tru nc ation-fr e e. Proof. The string s pr e ( X i , d ) for all v ar ia bles X i can b e computed in O ( dn ) time a nd space using the following dynamic pro gramming recursio n: Let the arr ay P d [ i ] hold the v alue of pr e ( X i , d ). P d [ i ] =          a for X i = a P d [ ℓ ] for X i = X ℓ X r with d ≤ | X ℓ | , P d [ ℓ ] · pr e (P d [ r ] , d − | X ℓ | ) for X i = X ℓ X r with | X ℓ | < d, (P d [ s ]) y · pr e (P d [ s ] , d − | X s | y ) for X i = ( X s ) p , where y = ⌊ d/ | X s |⌋ . (Note that for X i = ( X s ) p , we have y = 0 and (P d [ s ]) y = ε when | X s | ≥ d .) Similarly , the strings suf ( X i , d ) for all v ar iables X i can als o be computed in O ( dn ) time and space by dyna mic progra mming o n a rray S d [ i ]. vOc c ( X i ) for all 1 ≤ i ≤ n can b e computed in O ( n ) time by a simple iter a tion on the v ar iables, since vOc c ( X n ) = 1 and for i < n , vOc c ( X i ) = P { vOc c ( X j ) | X j = X ℓ X i } + P { vOc c ( X j ) | X j = X i X r } + P { vOc c ( X j ) · p | X j = ( X i ) p } As mentioned previous ly , we extend the idea of [3] for r e gular colla ge sys tems so that it handles rep etitions. F or each q -gr am occur rence in the text, we identify the low est v ariable in the der iv atio n tree of T , which contains the q -g r am o c c urrence. F or each v aria ble of form X i = X ℓ X r with | X i | ≥ q , t i and w i are deﬁned as in the c ase o f regular co lla ge systems. F o r ea ch v ar iable of form X i = ( X s ) p with | X i | ≥ q , there are t wo cases: 1. If q ≤ | X s | , then let t i = suf ( val ( X s ) , q − 1) pr e ( val ( X s ) , q − 1). There ex ist p − 1 copies of t i which cross the b oundary o f X s ’s within X i . L e t w i be an integer arr ay of length | t i | such that w i [ j ] = vOc c ( X i ) · ( p − 1) for 1 ≤ j ≤ | t i | − q + 1 , and w i [ j ] = 0 for | t i | − q + 2 ≤ j ≤ | t i | . 2. If | X s | < q , then let t i = val ( X s ) pr e ( val ( X s ) p − 1 , q − 1), which can ea sily b e o btained in O ( q ) time, given pr e ( val ( X s ) , q − 1). Let y = | X s | − (( q − 1 ) mo d | X s | ). Then, for 1 ≤ j ≤ y , t i [ j : j + q − 1] o c curs p − ⌈ q / | X s |⌉ + 1 times in X i , and hence we let w i [ j ] = vOc c ( X i ) · ( u − ⌈ q / | X s |⌉ + 1 ). F or y < j ≤ | X s | , t i [ j : j + q − 1] o ccurs p − ⌈ q / | X s |⌉ times in X i , and hence w e let w i [ j ] = vOc c ( X i ) · ( p − ⌈ q / | X s |⌉ ). F or | X s | < j ≤ | t i | , we let w i [ j ] = 0. Now we construct a string z by concatenating each t i with q ≤ | t i | ≤ 2( q − 1), and its corr esp onding weigh t array w by concatena ting each w i with q ≤ | w i | ≤ 2 ( q − 1). Then the problem is r educed to Pr oblem 2 on string z and w eight arr ay w . The 0’s inser ted a t the last par ts of each w i av oid to count unw anted q - grams generated by the concatenation of t i to z , which a r e not substrings of e a ch t i . Since | z | = | w | ≤ 2( q − 1) n , the pr o blem can be solved in O ( q n ) time. Algorithm 1 in app endix shows a pseudo-co de of our algorithm that so lves P roblem 1 fo r a given truncation-free co llage s y stem. 3.2 General Collage Systems W e show an O (( q + h ) n ) time and O ( q n ) space alg orithm to solve Pro blem 1 for arbitr ary gener al co lla ge systems, wher e h is the height of the collage system. The trPr ePath and trSu fPath functions F or v aria ble X i = [ k ] X s , the path from X s to the leaf le af i ( k + 1) in the deriv ation tree of X s is c alled the pr eﬁx trun c ation p ath o f X i . F or v a riable X i = [ k ] X s , and 0 ≤ x ≤ height ( X i ), let trPr ePath x ( X i ) be a function that retur ns triple ( X u ( x ) , trPr e x , trSu f x ) where X u ( x ) is the x -th node in the preﬁx truncation path, and X u ( x ) [ trPr e x + 1 : | X u ( x ) | − trSu f x ] corresp onds to the preﬁx of X i [ k + 1 : | X i | ] that is derived from this X u ( x ) in the deriv ation tre e. Note that the v alue | X u ( x ) | − trPr e x − trS uf x is monotonically non- increasing. F or v aria ble X i = [ k ] X s , we can rec ursively compute trPr ePath x ( X i ), as follows: Let trPr ePath 0 ( X i ) = ( X i , k , 0), a nd for x ≥ 0 let trPr ePath x +1 ( X i ) =                                  ( X ℓ , trPr e x , max(0 , trSuf x − | X r | )) if X u ( x ) = X ℓ X r and 0 ≤ trPr e x < | X ℓ | , ( X r , trPr e x − | X ℓ | , trSu f x ) if X u ( x ) = X ℓ X r and | X ℓ | ≤ trPr e x , ( X e , trPr e x mo d | X e | , max { 0 , ⌈ trPr e x | X e | ⌉ · | X e | + t rSuf x − | X u |} mo d | X e | ) if X u ( x ) = ( X e ) p , ( X e , trPr e x + k ′ , trSu f x ) if X u ( x ) = [ k ′ ] X e , ( X e , trPr e x , trSu f x + k ′ ) if X u ( x ) = X e [ k ′ ] , undeﬁned if X u ( x ) = a, where trPr ePath x ( X i ) = ( X u ( x ) , trPr e x , trSu f x ). F or instance , see Figure 1. There, trPr ePath x ( X 9 ) for 0 ≤ x ≤ 5 ar e resp ectively ( X 8 , 2 , 0), ( X 7 , 2 , 0), ( X 6 , 2 , 2), ( X 5 , 2 , 0), and ( X 3 , 0 , 0). F or v a riable X i = X s [ k ] and its su ﬃx trun c ation p ath , trSu fPath x ( X i ) can b e deﬁned a nd computed analogo usly . Computing length q − 1 preﬁxes and suﬃxes of val ( X i ) F or all v aria bles X i and po sitive integer d , let the array P d [ i ] (resp. S d [ i ]) hold the v alue of pr e ( X i , d ) (resp. suf ( X i , d )). The strings pr e ( X i , d ) a nd suf ( X i , d ) ca n be computed in a total o f O (( d + h ) n ) time and O ( dn ) space using a dynamic pr ogramming r ecursion on P d [ i ] and S d [ i ] 2 . The cases where X i = a , X i = X ℓ X r and X i = ( X s ) p were mentioned in Section 3.1. If X i = X s [ k ] , then P d [ i ] = pr e (P d [ s ] , | X i | ). Let us now consider the ca se where X i = [ k ] X s . If | X i | ≤ d , P d [ i ] = suf (S d [ s ] , | X i | ). Otherwis e, | X i | > d . F rom the mono tonicity of | X u ( x ) | − trPr e x − trSuf x , there ex is ts a unique in teg er x such that X u ( x ) , X u ( x +1) are descenda nts of X i where ( X u ( x ) , trPr e x , trSu f x ) = trPr ePath x ( X i ), ( X u ( x +1) , t rPr e x +1 , trSu f x +1 ) = trPr ePath x +1 ( X i ), | X u ( x ) | − trPr e x − trSuf x ≥ d , | X u ( x +1) | − trPr e x +1 − trSuf x +1 < d , a nd X u ( x ) is a conca tenation or rep etition. This means that pr e ( X i , d ) cros s es the b oundary of the c hildr en o f X u ( x ) and can b e repr esented by their suﬃx a nd preﬁx. Thus, using this X u ( x ) , we hav e for X i = [ k ] X s , P d [ i ] = ( suf (S d [ ℓ ] , | X ℓ | − t rPr e x ) · pr e (P d [ r ] , d − ( | X ℓ | − trPr e x )) if X u ( x ) = X ℓ X r , suf (S d [ e ] , α ) · (P d [ e ]) β · pr e (P d [ e ] , ( trPr e x + d ) mo d | X e | ) if X u ( x ) = ( X e ) p , where α = | X e | − ( t rPr e x mo d | X e | ) and β = ⌊ ( d − ( | X e | − ( t rPr e x mo d | X e | ))) / | X e |⌋ . The cor resp onding v ar iable X u ( x ) can be found in O ( h ) time. S d [ i ] can be calculated a nalogo us ly . Since pr e ( X i , d ) and suf ( X i , d ) are strings of length at most d , pr e ( X i , d ) and suf ( X i , d ) can be computed in a to tal of O (( d + h ) n ) time and O ( dn ) space for all v ar iables X i . Computing vOc c ( X i ) Here, we describ e how the v alues o f avOc c ( X i ), vOc c ( X i ), trPr evOc c ( X i ), trSufvOc c ( X i ), trvO c c ( X i ), and dvOc c ( X i ) ar e computed for each v a riable X i . Let trSu fAnc ( X i ) b e the set of pairs ( X j , d ) such that X j = X s [ k ] , X i = X u ( x ) and d = trSuf x > 0 for some x ≥ 0, where trSufPath x ( X j ) = ( X u ( x ) , trPr e x , trSu f x ). See also Figure 2. The suﬃx truncation path 2 Unlike with truncation-free collage systems, P d [ i ] and S d [ i ] are not calculated indep enden tly . X j = X s [ k ] ... X i ... trSuf x Figure 2: The path be tw e e n the white and gray are a is the suﬃx trunca tion pa th for X s . X i lies in this path a nd the suﬃx o f X i of leng th t r S uf x > 0 is truncated in the truncated deriv ation tree of X j . of X s can con ta in at most one node that is labele d with X i , and hence there is at most one such v alue x fo r each pair of i and j . Also, the ﬁrst elemen ts of an y t wo pa irs in trSufAnc ( X i ) are distinct, and therefore the size of trSufAnc ( X i ) do es not e xceed n . Consider a conceptual n × n table D such tha t D [ i, j ] = ( trSuf x if X j = X s [ k ] , X i = X u ( x ) and trSu f x > 0 for some x ≥ 0 , 0 otherwise . Obviously , the num b er of non-z e ro elements in each row i do es not exceed n . On the o ther hand, the num b er of non-zer o element s in each column j do es not exceed height ( X j ) (see Fig ure 2). Hence the total num b er of no n-zero elements in D do es no t exceed nh , which means that P n i =1 | trSufAnc ( X i ) | ≤ nh . W e can compute trSufAnc ( X i ) for all X i in a total o f O ( nh ) time, where h is the height of the collage system. After that, we sor t each t rSufAnc ( X i ) in increa s ing order o f the second v alue of the pairs in trSufAnc ( X i ). The to ta l time cost to sort trSufAnc ( X i ) for all X i is O ( n X i =1 | trSufAnc ( X i ) | log | trSu fAnc ( X i ) | ) = O ( nh log n ) . The l - th element of trSu fAnc ( X i ) is denoted by t rSufAnc ( X i )[ l ] for 1 ≤ l ≤ | trSufAnc ( X i ) | . trPr eAnc ( X i ) ca n b e deﬁned and computed analogously . Lemma 2 L et T = { X i = expr i } n i =1 b e a gener al c ol lage system. Assum e that, for al l variables X i = [ k ] X s and X i ′ = X s ′ [ k ′ ] , trSufAnc ( X i ) and trPr eAnc ( X i ′ ) ar e alr e ady c ompute d with their element s sorte d. Then, we c an c ompute vOc c ( X i ) , trPr evOc c ( X i ) , trSufvOc c ( X i ) , trvOc c ( X i ) , dvOc c ( X i ) , and avOc c ( X i ) for al l variables X i in a total of O ( nh ) time, wher e h is the height of T . Proof. Clearly vOc c ( X n ) = avOc c ( X n ) = 1 and trPr evOc c ( X n ) = trSufvOc c ( X n ) = t rvOc c ( X n ) = dvOc c ( X n ) = 0. Suppo se that, for i ≤ n , w e hav e already co mputed vOc c ( X i ′ ), avOc c ( X i ′ ), trPr evOc c ( X i ′ ), trSufvOc c ( X i ′ ), trvOc c ( X i ′ ), dvOc c ( X i ′ ), trPr eAnc ( X i ′ ), and trS u fAnc ( X i ′ ) for all i ≤ i ′ ≤ n . W e propag ate some those v alue s to the descendants of X i as follows: If X i = X ℓ X r , then there are a lso avOc c ( X i ) o ccur rences of X ℓ in the deriv a tio n tree. Thus w e increase avOc c ( X ℓ ) by avOc c ( X i ). There are also dvOc c ( X i ) o ccurr ences o f X ℓ that ar e c o mpletely truncated in the truncated deriv ation tree. Thus we increase dvOc c ( X ℓ ) by dvOc c ( X i ). avOc c ( X r ) and dvOc c ( X r ) are computed s imilarly . This takes a total of O ( n ) time fo r all X i = X ℓ X r . X j ( l ) X i d ( l ) X j ( l +1) X i d ( l +1) Figure 3: The circles repres e n t no des (i.e. v ar iables) that lie on the pr e ﬁx truncation pa th of X i = [ k ] X s . The white cir cles in the left diagram repres e n t no des in W l − W l +1 . F or each Y ∈ W l − W l +1 , we incr ease either trPr evOc c ( Y ) o r trvOc c ( Y ) b y P | trSufA nc ( X i ) | m = l vOc c ( X j ( m ) ), depending o n if a non-empty suﬃx of Y is truncated or not. If X i = ( X s ) p , then ther e are p · avOc c ( X i ) o ccurr ences of X e in the deriv ation tree, and there ar e p · dvOc c ( X i ) o cc ur rences of X e that are co mpletely truncated in the truncated der iv a tion tree. Thus we increase avOc c ( X s ) a nd dvOc c ( X s ) by p · avOc c ( X i ) a nd p · dvOc c ( X i ), r esp ectively . This tak es a total of O ( n ) time for a ll X i = ( X s ) p . If X i = [ k ] X s , then we increase avOc c ( X s ) and dvOc c ( X s ) b y avOc c ( X i ) and dvOc c ( X i ), r e sp e ctively . F or x ≥ 0, let trPr ePath x ( X i ) = ( X u ( x ) , trPr e x , trSu f x ). C o nsider the pa th X u (0) = X s , X u (1) , . . . , X u ( v ) , where v is the larg est in teger satisfying trPr e x > 0. By the deﬁnition of t rPr ePath , we k now that t rPr e x > 0 for any 0 ≤ x ≤ v . Since trPr e x + trSuf x < | val ( X u ( x ) ) | , we do not increa se the v a lue of dvOc c ( X u ( x ) ) at this time. W e incr e a se trPr evOc c ( X u ( x ) ) if trSuf x = 0, a nd t r vO c c ( X u ( x ) ) if trSuf x > 0, by vOc c ( X i ), resp ectively . Now we consider the no des that lie on the left of the path. If X u ( x ) is of form X u ( x ) = X ℓ X r and trPr e x ≥ | X ℓ | , then X ℓ is completely truncated in the tr uncated deriv atio n tree. Hence we increa se dvOc c ( X ℓ ) by vOc c ( X i )+ t rSufvOc c ( X i ). If X u ( x ) is o f form X u ( x ) = ( X e ) p , then the ﬁrst ⌊ trPr e x / | X e |⌋ rep etitions of X e are completely truncated, and hence we incr ease dvOc c ( X ℓ ) by ⌊ trPr e x / | X e |⌋ · ( vOc c ( X i ) + trS ufvOc c ( X i )). F urther ca re is taken for the occur rences of X i whose non-empty suﬃx is truncated due to its ances to r cor- resp onding to trSu fAnc ( X i ), as follows: F or ea ch 1 ≤ l ≤ | t rSufAnc ( X i ) | , let ( X j ( l ) , d ( l )) = trSufAnc ( X i )[ l ], where X j ( l ) = X p [ g ] . By deﬁnition, on the suﬃx truncation path of X p there exists a subtr ee ro oted at X i whose suﬃx of length d ( l ) is trunca ted. A key observ ation is that the no des, which lie o n the preﬁx truncation pa th of X i but do not lie on the s uﬃx truncation path of X j ( l ) , hav e vOc c ( X j ( l ) ) o cc ur rences in the trunca ted der iv a tion tr ee of X j ( l ) . Let W l be the subset of these no des which consis ts o f the no des whose no n-empty pr eﬁx is truncated. F or each v ar iable Y ∈ W l , either trPr evOc c ( Y ) or trvOc c ( Y ) ha s to be incr eased b y vOc c ( X j ( l ) ) a ccordingly . F or each ﬁxe d X i , we hav e to do this for all the ancestor s of X i corres p o nding to trSufAnc ( X i ). If this is done s e pa rately for ea ch ancestor, it takes a total of O ( n 2 h ) time for all i . W e can how ever sp eed up this by pro ce ssing elements of trSufAnc ( X i ) in increasing order of d ( l ): F or each 1 ≤ l ≤ | trSufAnc ( X i ) | , w e propa gate P | trSufA nc ( X i ) | m = l vOc c ( X j ( m ) ) to the no des in W l − W l +1 (see also Figure 3), where we let W | trSufA nc ( X i ) | +1 = ∅ for simplicity . F or each ﬁxed X i , this can take O ( n ) time. How ever, the overall time complexity is O ( nh ) for all X i , since P n i =1 | trSufAnc ( X i ) | = O ( nh ) as stated previously . F or the no des that lie on the left of the pr eﬁx truncation path of X i , we incr ease their dvOc c v alue by P | trSufA nc ( X i ) | m =1 vOc c ( X j ( m ) ). This c an also b e do ne in O ( nh ) time. If X i = X s [ k ] , then the v alues a re propag ated similar ly in case of X i = [ k ] X s , in a tota l of O ( nh ) time. X u ( x ) X ℓ X r trPre x trSuf x t u ( x ) Figure 4: A non-empty trunca ted pr e ﬁx a nd a p ossibly no n-empty truncated suﬃx of X u ( x ) = X ℓ X r are shown in gr ay . The weight s for w u ( x ) are s et accor dingly for the white range of t u ( x ) . Algorithm 2 in a ppendix shows a pseudo-c o de of o ur alg orithm to compute vOc c ( X i ), trPr evOc c ( X i ), trSufvOc c ( X i ), trvO c c ( X i ), dvOc c ( X i ), and avOc c ( X i ). Construction of w eight arra y As with trunca tion free collag e s ystems, we again consider reducing Pr oblem 1 to Pr oblem 2 of computing weigh ted q -gram frequencies on a single uncompressed string. F or ea ch q -gr am o ccur rence in the text, w e again identify the low est v ariable in the tr uncated deriv ation tree o f T , which co n ta ins the q -g ram oc currence. Observe that, in this str ategy no q -gr ams will be ide ntiﬁed with a trunca tion v ar iable X , as there a lwa ys exists a non- truncation descendant of X with which the co rresp onding q -gr ams are identiﬁed. Thus we construct s tr ing t i for v aria ble X i = X ℓ X r and X i = ( X s ) p , as in Se c tio n 3 .1, and it r emains to set the v alue of w i [ j ] s o that it repr esents the total num b er of o ccur rences of the q - gram in the text, co r resp onding to t i [ j : j + q − 1] derived by X i . Firstly , w e consider complete (i.e. non- truncated) o ccurrence s of v ar iable X i in the trunca ted deriv a tion tree of the collag e system. By deﬁnition, there are vOc c ( X i ) such o c currences, and hence we set the weigh ts for w i in a similar way to Section 3 .1. Secondly , we consider the o c c urrences of X i where a non-empty preﬁx and/o r non-empt y s uﬃx of the leaf-lab el string of the subtree r o oted at X i is truncated in the truncated deriv ation tree o f the colla ge system. Consider a v aria ble X y = [ k ] X s with y > i and let v ≥ 0 b e the larges t integer satisfying trPr e v > 0, where trPr ePath v ( X j ) = ( X u ( v ) , trPr e v , trSu f v ). Assume that ther e exists an integer 0 ≤ x ≤ v suc h that u ( x ) = i , where trPr ePath x ( X y ) = ( X u ( x ) , trPr e x , trSu f x ). This implies that X i lies on the preﬁx tr uncation path of X y and a non-e mpt y pr e ﬁx of X i is truncated in the tr unca ted deriv a tion tree of X y . W e hav e the following cases dep ending on the type of X u ( x ) (recall u ( x ) = i ): If X u ( x ) = X ℓ X r , there are t wo sub-c a ses: (1) If trPr e x ≥ | X ℓ | or trSuf x ≥ | X r | , then no q -gra ms are ident iﬁed with this occur rence of X u ( x ) (= X i ). (2) If trPr e x < | X ℓ | a nd trSu f x < | X r | , then X u ( x ) (= X i ) derives a string val ( X ℓ )[ trPr e x + 1 : | X ℓ | ] · val ( X r )[1 : | X r | − trSuf x ]). Then string t u ( x ) [max(1 , trPr e x − max(0 , | X ℓ | − q + 1) + 1) : min( q − 1 , | X ℓ | ) − max(0 , trSu f x + q − 1 − | X r | ) + q − 1 ] cros s es the b oundary of X ℓ and X r , so we incre ase the weight of w u ( x ) [ j ] by vO c c ( X i ) for each j , wher e max(1 , trPr e x − max(0 , | X ℓ | − q + 1) + 1) ≤ j ≤ min( q − 1 , | X ℓ | ) − max(0 , t rSuf x + q − 1 − | X r | ). See also Figure 4. If X u ( x ) = ( X e ) p , let r = p − ⌊ trPr e x / | X s |⌋ − ⌊ trSu f x / | X s |⌋ − 2, this o ccur rence o f X u ( x ) derives string val ( X e )[( trPr e x mo d | X e | ) + 1 : | X e | ] · val ( X e ) max { 0 ,r } · val ( X e )[1 : ( trSuf x mo d | X e | ) − 1]. In what follows we consider the case wher e r > 0 and | X e | < q ≤ | X e | r . Let g = | X e |− (( q − 1) mo d | X e | ). Ther e are four types of o ccurrences of q -gram t u [ j : j + q − 1 ]: t u [ j : j + q − 1 ] o ccurs ( r − ⌈ q / | X e |⌉ + 1) times for 1 ≤ j ≤ g , t u [ j : j + q − 1 ] o ccurs ( r − ⌈ q / | X e |⌉ ) times for g < j < q , within the ( X e ) r term. t u [ j : j + q − 1 ] o ccurs crossing the b oundary of [ trPr e x mo d | X e | ] X e and ( X e ) r for ( trPr e x mo d | X e | ) < j ≤ | X e | . t u [ j : j + q − 1] o ccurs cr ossing the bo undary of ( X e ) r and X e [ trSuf x mo d | X e | ] for 1 ≤ j ≤ | X e | − (( trSuf x + q − 1) mo d | X e | ). W e can set the weigh ts of w u ( x ) for each of the 4 a bove ranges of j , ac cordingly . F or example, if X u ( x ) = ( X e ) 9 , trPr e x = 4, trPre x trSuf x X e X e X e X e X e X e X e X e X e a b a a b a a b a a b a a b a a b a a b a a b a a b a Figure 5: Illustration for [ trPr e x ] X e ( X e ) 5 X e [ trSuf x ] , trPr e x = 4 , trSu f x = 5 . V ariable X e derives the string aba , and the n umber of 5-gr ams sta rting inside [ trPr e x ] X e is 2, and the nu mber of 5-g rams completely contained within ( X e ) 5 is 1 1, and the num b er of 5- grams ending inside X e [ trSuf x ] is 1. trSuf x = 5, val ( X e ) = aba and q = 5 , then we hav e t u ( x ) = abaab aa a nd w u ( x ) = [4 , 5 , 5 , 0 , 0 , 0 , 0 ]. (See als o Figure 5.) F or the other cases, we can compute the weights similarly . Note that there are O ( h ) v ariables in the preﬁx truncation path of X y = [ k ] X s . This may lead to O ( q nh ) time complexity , as the total length of the w arr ay is O ( q n ). W e c a n how ever reduce the time cost to O (( q + h ) n ) using a diﬀere n tia l repres entation witv of w s uch that w [ j ] = P j l =1 witv [ l ] for every 1 ≤ j ≤ | w | . Given p ositive integers b, e such that 1 ≤ b ≤ e ≤ | w | , increasing the v alue of w [ j ] for all b ≤ j ≤ e b y d reduces to increasing the v alue of witv [ b ] by d and decreasing the v alue of witv [ e + 1] by b , which ca n be done in O (1) time. F or all v ar iables X i = X ℓ X r and X i = ( X s ) p , we can compute weight a rray witv i in O ( n ). F or all v ar iables X i = [ k ] X s and X i = X s [ k ] , w e can compute weigh t a rrays witv u ( x ) for a ll v ariables X u ( x ) in the preﬁx or suﬃx tr uncation path of X i , in O ( hn ) time. Then w can b e o bta ined b y a simple scan of witv in O ( q n ) time. Now, w e constr uct a string z by concatenating each t i with q ≤ | t i | ≤ 2( q − 1), a nd its corres po nding weigh t array w by concatenating each w i with q ≤ | w | ≤ 2( q − 1). Then, Problem 1 for a general collage system reduces to Problem 2 of weighted q - gram frequencies on a single uncompressed str ing, and hence we obtain: Theorem 3 Pr oblem 1 c an b e solve d in O (( q + h log n ) n ) time and O ( q n ) sp ac e, for gener al c ol lage systems. References [1] Bille, P ., Landau, G.M., Raman, R., Sada k ane , K., Satti, S.R., W eimann, O .: Ra ndom access to grammar -compresse d strings. In: Pro c. SODA’11. pp. 3 73–38 9 (20 11) [2] G¸ asieniec, L., Kar pinski, M., Plandowski, W., Rytter, W.: Eﬃcient alg orithms for Lemp el-Ziv enco ding. In: Pro c. SW A T’9 6. LNCS, vol. 1097, pp. 392–4 03. Spr inger (1 9 96) [3] Goto, K., Ba nna i, H., Inenaga, S., T akeda, M.: T owards eﬃcient mining and classiﬁcatio n on compressed strings. In: Accepted for SPIRE’11 (20 11), av ailable as arXiv:110 3.311 4v2 [4] Hermelin, D., La ndau, G.M., Landau, S., W eimann, O .: A uniﬁed alg o rithm for acce lerating edit- distance co mputatio n via text-compression. In: P ro c. ST ACS’09. pp. 5 29–5 40 (20 09) [5] Inenaga, S., Ba nnai, H.: Finding characteristic substring from co mpressed texts. In: Pro c. The Prague Stringolog y Conferenc e 2 009. pp. 40 – 54 (2009), full version to app ear in the International Journal of F oundatio ns of Co mputer Science [6] Karpinski, M., Rytter, W., Shinohara, A.: An eﬃcient pattern-ma tc hing alg orithm for s trings with short des c riptions. Nordic Journal o f Computing 4, 17 2–186 (1997) [7] Kida, T., Shibata, Y., T akeda, M., Shinohar a, A., Ar ik awa, S.: Collage system: A unifying framework for co mpr essed pattern ma tch ing . Theo retical Computer Science 2 98(1), 253 –272 (2003) [8] Larsson, N.J., Moﬀat, A.: O ﬄine dictionary -based compression. In: P ro c. DCC’99 . pp. 2 9 6–30 5. IEEE Computer So ciety (1999 ) [9] Lifshits, Y.: P ro cessing compres s ed texts: A tracta bilit y b or der. In: P ro c. CPM 20 07. LNCS, vol. 4580, pp. 2 2 8–24 0 (200 7) [10] Manber, U., Myers, G.: Suﬃx a r rays: A new metho d for o n-line str ing s e arches. SIAM Jour nal on Computing 22 (5), 935 –948 (1993) [11] Matsubara , W., Inenaga, S., Ishino, A., Shinohara , A., Nak a m ur a, T., Hashimoto , K.: Eﬃcient algo- rithms to compute compres s ed longest common s ubstrings and compressed pa lindromes. Theoretica l Computer Science 410 (8–10), 900–91 3 (2009) [12] Nav arro, G., M¨ akinen, V.: Compressed full-text indexes. A CM Computing Surv e y s 39(1), 2 (2007) [13] Nevill-Manning, C.G., Witten, I.H., Mauls b y , D.L.: Compres sion by induction of hierarchical gr ammars. In: Pro c. DCC’94 . pp. 24 4–25 3 (1994) [14] W elch, T.A.: A technique fo r high p erformance da ta c ompression. IE EE Computer 17, 8– 19 (1984 ) [15] Ziv, J., Lemp el, A.: A universal a lgorithm for s equential da ta compressio n. IEEE T ra nsactions o n Information The o ry IT-23 (3), 337–3 49 (1977) [16] Ziv, J., Lempel, A.: Compress ion of individual sequences via v ariable- length c o ding. IEEE T r ansactions on Informa tion Theory 2 4(5), 5 3 0–53 6 (197 8) A App endix Algorithm 1: Ca lculating q -gram freque ncie s of a truncation-free collage sy stem for q ≥ 2 Input : SLP T = { X i } n i =1 representing string T , integer q ≥ 2. Rep ort : all q -grams a nd their frequencies which o ccur in T . 1 Calculate vOc c ( X i ) for all 1 ≤ i ≤ n ; 2 Calculate pr e ( val ( X i ) , q − 1) a nd suf ( val ( X i ) , q − 1) fo r all 1 ≤ i ≤ n − 1 ; 3 z ← ε ; w ← []; 4 for i ← 1 to n do 5 if | X i | ≥ q the n 6 if X i = X ℓ X r and | X i | ≥ q the n 7 t i = suf ( val ( X ℓ ) , q − 1) pr e ( val ( X r ) , q − 1) ; 8 w i ← cr eate integer array of length | t i | , ea ch e le men t set to 0 ; 9 for j ← 1 to | t i | − q + 1 do w i [ j ] ← vOc c ( X i ) ; 10 else if X i = ( X s ) p and | X s | ≥ q then 11 t i = suf ( val ( X s ) , q − 1) pr e ( val ( X s ) , q − 1) ; 12 w i ← cr eate integer array of length | t i | , ea ch e le men t set to 0 ; 13 for j ← 1 to | t i | − q + 1 do w i [ j ] ← vOc c ( X i ) · ( p − 1); 14 else if X i = ( X s ) p and | X s | < q then 15 t i = pr e ( val ( X s ) min { p, ⌈ ( | X s | + q − 1) / | X s |⌉} , | X s | + q − 1 ) ; 16 w i ← cr eate integer array of length | t i | , ea ch e le men t set to 0 ; 17 y = | X s | − (( q − 1) mo d | X s | ) ; 18 for j ← 1 to y do w i [ j ] ← vOc c ( X i ) · ( p − ⌈ q / | X s |⌉ + 1 ); 19 for j ← y + 1 to | X j | do w i [ j ] ← vOc c ( X i ) · ( p − ⌈ q / | X s |⌉ ); 20 z .app end( t i ); 21 w .app end( w i ); 22 Rep ort q - gram fr equencies in z , where each q -g ram z [ i : i + q − 1 ] is weighte d by w [ i ]. Algorithm 2: Ca lculate vOc c ( X i ) for all v aria bles of general co llage system Input : A ge ne r al collag e s ystem T = { X i } n i =1 Output : vOc c ( X i ) for all 1 ≤ i ≤ n 1 compute trPr eAnc ( X i ), trS ufAnc ( X i ) for all v aria ble X i ; 2 Initialize the v alues of avOc c ( X i ), vOc c ( X i ), trPr evOc c ( X i ), trS ufvOc c ( X i ), trvO c c ( X i ), dvOc c ( X i ) to 0 for all X i ; 3 avOc c ( X n ) ← 1 ; 4 for i ← n to 1 do 5 vOc c ( X i ) ← avOc c ( X i ) − dvOc c ( X i ) − t r vO c c ( X i ) − trPr evOc c ( X i ) − trSufvOc c ( X i ) ; 6 if X i = X ℓ X r then 7 avOc c ( X ℓ ) ← avOc c ( X ℓ ) + avOc c ( X i ); avOc c ( X r ) ← avOc c ( X r ) + avOc c ( X i ) ; 8 dvOc c ( X ℓ ) ← dvOc c ( X ℓ ) + dvOc c ( X i ); dvOc c ( X r ) ← dvOc c ( X r ) + dvOc c ( X i ) ; 9 else if X i = ( X s ) p then 10 avOc c ( X s ) ← avOc c ( X s ) + p ∗ avOc c ( X i ) ; 11 dvOc c ( X s ) ← dvOc c ( X s ) + p ∗ dvOc c ( X i ) ; 12 else if X i = [ k ] X s then 13 avOc c ( X s ) ← avOc c ( X s ) + avOc c ( X i ) ; dvOc c ( X s ) ← dvOc c ( X s ) + dvOc c ( X i ) ; 14 x ← 0 ; l ← 1 ; trR ← 0 ; trS sum ← 0 ; 15 ( X u ( x ) , trPr e x , trSu f x ) ← trPr ePath x ( X s , k ) ; 16 while trPr e x > 0 do 17 ( X j ( l ) , d ( l )) ← trSu fAnc ( X i )[ l ] ; 18 while trR ≤ d ( l ) do 19 trSsum ← trSsum + vOc c ( X j ( l ) ) ; 20 l ← l + 1 ; ( X j ( l ) , d ( l )) ← trSufAnc ( X i )[ l ] ; // propa gate v O c c ( X i ) + P l − 1 m =1 vOc c ( X j ( m ) ) t o nodes in W l − W l +1 . 21 if trSuf x > 0 then t rvOc c ( X u ( x ) ) ← trvOc c ( X u ( x ) ) + vOc c ( X i ) + trSsum ; 22 else trPr evOc c ( X u ( x ) ) ← trPr evOc c ( X u ( x ) ) + vOc c ( X i ) + trSsum ; 23 if X u ( x ) = X ℓ X r then 24 if | X ℓ | ≤ trPr e x then 25 dvOc c ( X ℓ ) ← dvOc c ( X ℓ ) + ( vOc c ( X i ) + trSufvOc c ( X i )) ; 26 else trR ← trR + | X r | ; 27 else if X u ( x ) = ( X e ) p then 28 dvOc c ( X e ) ← dvOc c ( X e ) + ⌊ trPr e x / | X e |⌋ ∗ ( vOc c ( X i ) + trSufvOc c ( X i )) ; 29 trR ← trR + p − ⌈| trPr e x | / | X e |⌉ ; 30 x ← x + 1 ; ( X u ( x ) , trPr e x , trSu f x ) ← trPr e x P ath x ( X s , k ) ; 31 else if X i = X s [ k ] then // omitt ed: an alogo us to pre fix truncat ion

Computing q-gram Frequencies on Collage Systems

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment