Restructuring Compressed Texts without Explicit Decompression

Restructuring Compressed T exts without Explicit D ecompression Keisuk e Goto 1 Shirou Maruy ama 1 Sh unsuk e Inenaga 1 Hideo Bannai 1 Hiroshi Sak amoto 2 Masa yuki T ak eda 1 1 Kyush u Univ ersity , Japan 2 Kyush u Institute of T echnology , Japa n shiro.maruy ama@i.kyush u-u.ac.jp { keisuke.got ou,bannai,in enaga,takeda } @inf.kyushu-u.ac.jp hiroshi@ai. kyutech.ac. jp Abstract W e consider the problem of r estructuring co mpressed texts witho ut e xplicit deco mpression. W e present algorithms whic h allow conv er sions from compr essed representations o f a string T pro duced by an y g rammar - based compression algorithm, to repr esentations pro duced by several s pe ciﬁc compressio n algorithms including LZ77 , LZ78, r un length enco ding, and some grammar based compress ion algorithms. These are the ﬁrst alg orithms that achiev e running times polyno mial in the size o f the compressed input and output representations o f T . Since most of the represe ntations we conside r can achiev e exp onential compression, o ur a lgorithms are theoretically faster in the worst case, than any algorithm whic h ﬁrst decompresses the string for the conv er sion. 1 In tro duction Data compression is an indisp en s able tec h n ology for the hand ling of large scale d ata av ailable to da y . The traditional ob jectiv e of compression has b een to sa v e storage and communicat ion costs, whereas actually using the data normally requires a decompression step whic h can require enormous computational resources. Ho w ev er, r ecent ad v ances in c ompr esse d string pr o c essing algorithms giv e us an in triguing new p ersp ectiv e in w hic h compression can b e regarded as a form of pr e-pro cessing whic h n ot only reduces space requiremen ts for storage, but allo ws eﬃcien t pro cessing of the strings, including compressed p attern matc hing [25, 40, 10, 11], string indices [33, 7, 22], edit distance and its v arian ts [9, 15, 38], and v arious other applications [12, 16, 14, 30, 2 ]. Th ese m etho ds assume a compressed represen tation of the text as inp ut, and pro cess them without explicit decompression. An interesting prop erty of these metho ds is that they can b e theoreticall y – and sometimes ev en practically – faster than algorithms which work on an u ncompressed represen tation of the same data. The main fo cus of this pap er is to devel op a framew ork in whic h v arious pro cessing on strin gs can b e conducted en tirely in the w orld of compressed represen tations. A p r imary to ol for this ob jectiv e is r estructuring , or conv ersion, of the compressed r ep resen tation. K ey results for this problem were obtained indep endently by Rytter [36] and Charik ar e t al. [5]: giv en a n on-self referential LZ77- enco ding of size n that represen ts a strin g of length N , they ga v e algorithms for constructing a balanced grammar of size at most O ( n log ( N /n )) in output linear time. The size of the resulting grammar is an O (log( N /g )) app r o xim ation of the smallest grammar whose size is g . Grammars are generally easier to handle than the LZ-enco dings, for example, in compressed pattern matc hin g [11], and this result is the motiv ational bac kb one of man y eﬃcient algo rithms on grammar compressed strings. Our Results: In this p ap er, we pr esen t a comprehensiv e collectio n of n ew algorithms for restruc- turing to and from compressed texts represented in terms of run length enco ding (RLE), LZ77 and LZ78 enco dings, grammar based compressor RE-P AIR and BISECT ION, edit sensitiv e p ars ing (ESP), straigh t line programs (SL Ps), and admissib le grammars. All algorithms ac hiev e run n ing times p olynomial in th e size of the compr essed inp ut and output r epresen tations of the strin g. Since (most of ) the representati ons we consider can ac hieve exp onen tial compression, our algorithms are theoreticall y faster in the wo rst case, than any algorithm wh ic h ﬁrst decompresses the string for the con version. Figure 1 s ummarizes our results. Ou r algorithms immediately allo w th e follo wing applications to b e solv able in p olynomial time in the compressed world: Dynamic compressed texts: Although data stru ctures for dynamic compressed texts h a ve b een studied somewhat in the literature [3, 13, 4, 27, 35, 23], grammar b ased or LZ77 compression ha v e not b een considered in th is p ersp ectiv e. It h as recen tly b een argued that for highly r ep etitive strings, grammar based compression and LZ77 compression algorithms are b etter suited and achiev e b etter compression [7, 22]. Mo diﬁcation of th e grammar corresp ondin g to edit op erations on the string can b e conducted in O ( h ) time, where h is the h eight of the grammar. (Note that when the grammar is balanced, h = O (log N ) ev en in the w orst case.) Ho w ev er, these mo d iﬁcations are ad-ho c , and d o not assure that the r esulting grammar is a go o d compressed represent ation of the string, and rep eated edit op erations will inevitably cause degradation on the compression ratio. Note that previous w ork of Rytter and Charik ar et al. are not suﬃcien t in this resp ect: their algorithms can balance an arbitrary grammar, but they must b e giv en an LZ -enco ding of the mo d iﬁ ed str in g in order for the grammar to b e sm all. P ost-selection of compression format: Some metho ds in the ﬁeld of data m ining and machine learning utilize compression as a means of detecting and extracting meaningful information fr om 1 LZ78 α -balanced SLPs RLE LZ77 w/ self-references LZ77 w/o self-references Bisection Re-pair AVL-balanced SLPs O ( n + m ) SLPs admissible grammars O ( nm + n log n ) O ( n log N ) O ( n log N ) O ( n 3 m 2 ) ESP O ( m 2 +( m + n )log n ) O ( n log 2 N + m ) O ( mn 3 log N ) O ( mn log m ) [Charikar, et al . 2002] [Rytter 2003] [Rytter 2003] O ( n log * n ) O ( n 3 m 2 ) O (poly(log N ) n + m ) O ( m log 2 N ) [Gawrychowski 2011] O ( m ) O ( m ) O (poly(log N ) n + m ) O ( nm + n log n ) Figure 1: Su mmary of transformations b et w een compressed r ep resen tations. The lab el of eac h arc sho ws the time complexity of eac h tran s formation, where n and m are resp ectiv ely the inpu t and output sizes of eac h tr ansformation, an d N is the length of the uncompressed strin g. The b rok en arcs mean naiv e O ( n )-time transformations. Complexities without references are results sho wn in this pap er. string data [8, 6]. Compression of a giv en s tr ing is ac hieved b y exploiting v arious regularities con- tained in the str ing, and s in ce diﬀeren t compr ession algo rithms capture d iﬀeren t regularities, the usefulness of a sp eciﬁc repr esentati on will v ary dep ending on the ap p lication. As it is imp ossi- ble to predetermine the b est compression algo rithm f or all f uture applications, conv ersion of th e represent ation is an essential task. F or example, the norm alized compression distance (NCD) [6] b et w een tw o strings X and Y with resp ect to compression algorithm A is deﬁned by the v alues C A ( X Y ), C A ( X ), and C A ( Y ) whic h resp ectiv ely denote the sizes of the compressed repr esen tatio n of strings X Y , X , and Y when compressed by algorithm A . Restructuring enables us to solv e, in the compressed world, the problem of calculating the NCD with resp ect to some compression algorithm, giv en strings whic h w ere compressed previously b y a (p ossibly) diﬀerent compression algorithm. 2 Preliminaries 2.1 Notations Let Σ b e a ﬁn ite alphab et . An elemen t of Σ ∗ is called a string . The length of a string S is d enoted b y | S | . The empty string ε is a strin g of length 0, namely , | ε | = 0. F or a string S = X Y Z , X , Y and Z are called a pr eﬁx , substring , and suﬃx of S , resp ectiv ely . The set of all substr in gs of a str ing S is d enoted by Substr ( S ). Th e i -th characte r of a string S is denoted b y S [ i ] for 1 ≤ i ≤ | S | , and the sub string of a s tr ing S that b egins at p osition i and ends at p osition j is denoted b y S [ i : j ] for 1 ≤ i ≤ j ≤ | S | . F or con v enience, let S [ i : j ] = ε if j < i . F or any strings S and P , let Oc c ( S, P ) b e the set of o ccurren ces of P in S , i.e., Oc c ( S, P ) = { k > 0 | S [ k : k + | P | − 1] = P } . 2 W e shall assume that the compu ter word size is at least log | S | , and hence, v alues representing lengths and p ositions of S in our algorithms can b e manipulated in constan t time. 2.2 Suﬃx Arra ys and LCP Arra ys The su ﬃ x arra y SA [28] of any str ing S is an arra y of length | S | such that SA [ i ] = j , where S [ j : | S | ] is th e i -th lexicographicall y smallest suﬃx of S . Let lcp ( S 1 , S 2 ) is the length of the longest common preﬁx of S 1 and S 2 . T he lcp arra y of any string S is an arra y of length | S | suc h that LCP [ i ] is lcp ( S [ SA [ i − 1] : | S | ] , S [ SA [ i ] : | S | ]) for 2 ≤ i ≤ | S | , and LCP [1] = 0. The suﬃx array for an y string S can b e constru cted in O ( | S | ) time (e.g. [17]) assuming an intege r alphab et. Giv en the string and suﬃx array , the lcp array can also b e calculate d in O ( | S | ) time [19 ]. 2.3 Run Length Enco ding Deﬁnition 1 The R un-L ength (RL) factorization of a string S is the factorization f 1 , . . . , f n of S such that for every i = 1 , . . . , n , factor f i is the longest pr eﬁx of f i · · · f n with f i ∈ F , wher e F = S a ∈ Σ { a p | p > 0 } . W e note that eac h factor f i can b e written as f i = a p i i for some symbol a i ∈ Σ and some in teger p i > 0 and the rep eating sym b ols a i and a i +1 of consecutive facto rs f i and f i +1 are diﬀeren t. The output of RLE is a sequence of pairs of symbol a i and in teger p i . The n u m b er of d istin ct bigrams o ccurring in S is at m ost 2 n − 1, since these are { a i a i | 1 ≤ i ≤ n } ∪ { a i a i +1 | 1 ≤ i < n } . 2.4 LZ Enco dings LZ enco dings are d y n amic d ictionary based enco d ings. There are t wo main v ariants for LZ enco d- ings, L Z78 and LZ77. The LZ78 enco d ing [42] has sev eral v ariants. One most p opular v ariant w ould b e the LZW enco ding [39], whic h is based on the LZ78 factorization deﬁned b elo w. Deﬁnition 2 (LZ78 factorization) The LZ78-factorization of a string S is the factorization f 1 , . . . , f n of S wher e for every i = 1 , . . . , n , factor f i is the longest pr eﬁx of f i · · · f n with f i ∈ F i , wher e F i is deﬁne d by F 1 = Σ and F i +1 = F i ∪ { f i f i +1 [1] } . The output is the sequence of IDs of factors f i in F i . W e note that F i can b e reco vered from this sequence an d thus is not included in th e output. The LZ77 encod ing [41] also has m an y v arian ts. The LZSS encod ing [37] is based on the LZ77 factorization b elo w. The LZ77 factorizat ion has t wo v ariations dep end ing up on whether self-references are allo wed. Deﬁnition 3 (LZ77 factorization w /o self- references) The LZ77-factorization witho ut self- r efer enc es of a string S is the factorization f 1 , . . . , f n of S such that for every i = 1 , . . . , n , factor f i is the longest pr eﬁx of f i · · · f n with f i ∈ F i , wher e F i = Substr ( f 1 · · · f i − 1 ) ∪ Σ . Deﬁnition 4 (LZ77 factorization w / self-references) The LZ77-factorization with self- r efer enc es of a string S i s the factorization f 1 , . . . , f n of S such that for every i = 1 , . . . , n , factor f i is the longest pr eﬁx of f i · · · f n with f i ∈ F i , wher e F i = Substr ( f 1 · · · f i − 1 f ′ i ) ∪ Σ , wher e f ′ i is the pr eﬁx of f i obtaine d by r emoving the last symb ol. The LZSS is based on the LZ77 with self-references and its output is a sequence of p oint ers to factors f i . 3 2.5 Grammar-bas ed compression methods An admissible gr ammar [20] is a con text-free grammar that generates a single string. 2.5.1 Re-pair Starting with w 1 = S , w e rep eat the follo wing un til no bigrams o ccur more th an once in w i : w e ﬁnd a most frequent bigram γ i in the str ing w i , and th en r ep lace ev ery n on -ov erlapping o ccurr ence of γ i in w i with a new v ariable X i to obtain string w i +1 . Let r b e the num b er of iterations. The resulting grammar has th e p ro du ction r ules of { X i → γ i } r i =1 ∪ { X r +1 → w r +1 } . Theorem 5 ([5]) F or any string S of length N , R e-p air c onstructs in O ( N ) time an admissible gr ammar of size O ( g ( N / log N ) 2 / 3 ) , wher e g is the size of the smal lest gr ammar that derives S . 2.5.2 Bisection The Bisectio n algorithm [20, 21] constructs a grammar that can b e describ ed recurs ively as follo ws: the v ariable repr esenting string S ( | S | ≥ 2) is deriv ed by the rule X → Y Z , with | Y | = 2 k and | Z | = | X | − 2 k , where k is th e largest integ er s.t. 2 k < | X | . Th e pro duction rules for S [1 : 2 k ] and S [2 k + 1 : | S | ] are d eﬁned recursively . Wheneve r S [ i : i + q − 1] = S [ j : j + q − 1] f or some i, j, q ≥ 1 wh ich app ear in the ab ov e construction, the same v ariable is to b e u sed for d eriving these substrings. Theorem 6 ([5]) F or any string S of length N , B i se ction c onstructs an admissible gr ammar of size O ( g ( N/ log N ) 1 / 2 ) , wher e g is the size of the smal lest gr ammar that derives S . 2.6 Edit-sensitiv e parsing (ESP) A string a k ( k ≥ 2) is called a r ep etition of symb ol a , and a + is its ab b reviation. W e let log (1) n = log n , log ( i +1) = log log ( i ) n , and log ∗ n = min { i | log ( i ) n ≤ 1 } . F or example, log ∗ n ≤ 5 for an y n ≤ 2 65536 . W e thus treat log ∗ n as a constan t for suﬃ cien tly large n . W e assume that any con text-free grammar G is admissible , i.e., G derives just one strin g and for eac h v ariable X , exactly one pro d u ction rule X → α exists. The set of v ariables is denoted by V ( G ), and the set of pro du ction rules, called dictionary , is d enoted by D ( G ). W e also assu m e that X → α ∈ D ( G ) and Y → α ∈ D ( G ) implies X = Y b ecause one of them is unnecessary . W e u se V and D instead of V ( G ) and D ( G ) when G is omissible. The string deriv ed by D from a string S ∈ (Σ ∪ V ) ∗ is denoted by S ( D ). F or example, when S = aY Y and D = { X → bc, Y → X a } , we obtain S ( D ) = abcabca . F or an y string, it is un iquely partitioned to w 1 a + 1 w 2 a + 2 · · · w k a + k w k +1 b y maximal r ep etitions, where eac h a i is a sym b ol and w i is a string con taining no rep etition. E ach a + i is calle d Typ e1 metablo ck, w i is called T yp e2 metablo c k if | w i | ≥ log ∗ n , and other short w i is called T yp e3 metablo ck, w here if | w i | = 1, this is atta c h ed to a + i − 1 or a + i , with preference a + i − 1 when b oth are p ossible. Th us, any metablo c k is longer than or equal to t wo. Let S b e a m etablock and D b e a current d ictionary starting with D = ∅ . W e set E S P ( S, D ) = ( S ′ , D ∪ D ′ ) f or S ′ ( D ′ ) = S and S ′ describ ed as follo w s: 1. When S is T yp e1 or T y p e3 of length k ≥ 2, (a) If k is ev en, let S ′ = t 1 t 2 · · · t k / 2 , and mak e t i → S [2 i − 1 : 2 i ] ∈ D ′ . 4 (2) resulting string (1) position b locks for T ype1 a a a a a a a a a A A A B Figure 2: P arsing for Typ e1 string: Line (1) is an original T yp e1 string S = a 9 with its p osition blo c ks. Line (2) is the resulting string AAAB, and the pro d uction r u les A → aa and B → aaa. An y Type3 string is parsed analogously . (2) resulting string (1) position b locks for T ype2 a d e g h e c a d e g A B C D B Figure 3: P arsing for Typ e2 str ing: Line (1) is an original T yp e2 strin g ‘adeghecadeg ’ with its p osition blo c ks by alphab et reduction where its deﬁn ition is omitted in this pap er. Line (2) is the resulting strin g ABCDB, and the pro du ction ru les A → ad, B → eg, etc. (b) If k is o d d, let S ′ = t 1 t 2 · · · t ( k − 3) / 2 t , and make t i → S [2 i − 1 : 2 i ] ∈ D ′ and t → S [ k − 2 : k ] ∈ D ′ where t 0 denotes the empt y str in g for k = 3. 2. When S is T yp e2, (c) for the p artitioned S = s 1 s 2 · · · s k (2 ≤ | s i | ≤ 3) by alphab et r e duction , let S ′ = t 1 t 2 · · · t k , and make t i → s i ∈ D ′ . Cases (a) and (b) denote a t ypical left aligne d p arsing . F or example, in case S = a 6 , S ′ = x 3 and x → a 2 ∈ D ′ , and in case S = a 9 , S ′ = x 3 y and x → a 2 , y → aaa ∈ D ′ . In Case (c), w e omit the d escription of alphab et reduction [9] b ecause th e details are un necessary in this pap er. Case (b) is illustrated in Fig. 2 for a Typ e1 string, and the p arsing manner in Case (a) is obtained by ignoring the last three symbols in Case (b). P arsing f or Typ e2 is analogo us. Case (c) is illustrated in Fig. 3. Finally , we deﬁne ESP for an y string S ∈ (Σ ∪ V ) ∗ that is partitioned to S 1 S 2 · · · S k b y k metablo cks; E S P ( S, D ) = ( S ′ , D ∪ D ′ ) = ( S ′ 1 · · · S ′ k , D ∪ D ′ ), where D ′ and eac h S ′ i satisfying S ′ i ( D ′ ) = S i are deﬁned in the ab o v e. Iteration of ESP is deﬁ ned b y E S P i ( S, D ) = E S P i − 1 ( E S P ( S, D )). In particular, E S P ∗ ( S, D ) denotes the iterations of ESP unt il | S | = 1. After computing E S P ∗ ( S, D ), the ﬁn al dictionary represent s a ro oted ordered binary tree derivin g S , whic h is denoted by E T ( S ). Lemma 7 (Cormo d e and Muthukrishnan [9]) The h eigh t of E T ( S ) is O (log | S | ) and E T ( S ) can b e computed in time O ( | S | log ∗ | S | ) time. Lemma 8 (Cormo d e and Muth ukrishnan [9 ]) Let S = s 1 s 2 · · · s k b e th e p artition of a T yp e2 metablo ck S by alphab et r ed uction. F or an y 1 ≤ j ≤ | S | , the block s i con taining S [ j ] is determined b y at most S [ j − log ∗ N − 5 : j + 5]. W e refer to another c h aracteristic of ESP for pattern em b edding problem. No des v 1 , v 2 in T = E T ( S ) are adjac ent in this order if the subtrees on v 1 , v 2 are adj acent in this order. A strin g p 1 · · · p k of length k is em b edded in T if there exist no des v 1 , . . . , v k suc h that lab el ( v i ) = p i and 5 X 1 X 2 a b a a a b a b a b a a b X 1 X 3 X 1 X 2 X 3 X 1 X 2 X 3 X 4 X 1 X 5 X 4 X 6 X 1 X 2 X 3 X 1 X 2 X 3 X 4 X 1 X 5 X 7 2 3 1 4 6 5 7 8 9 10 11 12 13 Figure 4: The deriv ation tree of S LP with X 1 → a , X 2 → b , X 3 → X 1 X 2 , X 4 → X 1 X 3 , X 5 → X 3 X 4 , X 6 → X 4 X 5 , and X 7 → X 6 X 5 , representing string S = val ( X 7 ) = aababaa babaab . an y v i , v i +1 are adj acent in this order. If T [ i ], the i -th leaf of T , is the leftmost leaf of v 1 and T [ j ] is the righ tmost leaf of v k , we call that p 1 · · · p k is em b edded as T [ i : j ]. Deﬁnition 9 Q ∈ (Σ ∪ V ) ∗ is called an e videnc e of P ∈ Σ ∗ in S if the follo wing holds: S [ i : j ] = P iﬀ Q is embedd ed as T [ i : j ]. W e note that an y P has at least one evidence since P itself is an evidence of P . Lemma 10 (Maruyama et al. [29]) Giv en T = E T ( S ), for an y T [ i : i + t ] = P , there exists an evidence Q = q 1 · · · q k of P with maximal rep etitions q ℓ and k = O (log t ). W e can compute the Q in O (log t log | S | ) time, and w e can also c hec k if Q is em b edded as T [ j : j + t ] in O (log t log | S | ) time for an y j . 2.7 Straigh t Line P rograms A str aight line pr o gr am ( SLP ) [18] is a w idely accepted abstr act mod el of outputs of grammar-based compressed metho ds. An SLP is a sequence of assignmen ts { X i → expr i } n i =1 , w here eac h X i is a v ariable and eac h expr i is an expr ession, where expr i = a ( a ∈ Σ), or expr i = X ℓ X r ( ℓ, r < i ). Namely , SLPs are admissible grammars in the C homsky normal form, and hence outpu ts of admissible grammars can b e easily conv erted to S LPs in linear time (see also Figure 1). Let val ( X i ) represent the string derived from X i . When it is not confusing, w e identify a v ariable X i with val ( X i ). T h en, | X i | denotes the length of the s tr ing X i deriv es. An SLP { X i → expr i } n i =1 r epr esents the string S = val ( X n ). The size of an S LP is the n umber of assignmen ts in it. Th e heig ht of v ariable X i is denoted heig ht ( X i ), and is 1 if X i = a ( a ∈ Σ), and 1+ max { height ( X ℓ ) , height ( X r ) } if X i = X ℓ X r . T he h eigh t of an SLP { X i → expr i } n i =1 is deﬁned to b e height ( X n ). Note that | X i | and height ( X i ) for all v ariables can b e calculat ed in a total of O ( n ) time b y simple d y n amic pr ogramming iterations. In the rest of the pap er, w e w ill therefore assume that these v alues will b e a v ailable. The follo wing results are known for S L P compressed strings: Theorem 11 ([25]) Giv e n two SLPs of total size n that describ e strings S and P , r esp e ctively, a suc cinct r epr esentation of Oc c ( S, P ) c an b e c ompute d in O ( n 3 ) time and O ( n 2 ) sp ac e. 6 Since we can compute | S | and | P | in O ( n ) time and a mem b ership query to the succinct r ep resen- tation can b e answ ered in O ( n ) time [31], the equ alit y c hecking of whether S = P can b e done in a total of O ( n 3 ) time. Lemma 12 ([24]) Given an SLP of size n r epr esenting string S , an SLP of size O ( n ) which r epr esents an arbitr ary substring S [ i : j ] c an b e c onstructe d in O ( n ) time. 3 Algorithms for Restructuring Compressed T exts In th is section w e p resen t our p olynomial-time algorithms that con verts an inpu t compressed rep- resen tation to another compressed representa tion. In the sequ el, n and m will denote the sizes of the in put and output compr essed rep r esen tations, resp ectiv ely . 3.1 Con versions from R un Length Enco ding F or con ve rsions from Run Length Encod ings, we obtain the results b elo w. Theorem 13 (Run Length Enco ding to Re-pair) Given an RL f actorization of size n that r epr esents string S , the gr ammar of size m pr o duc e d by applying R e-p air algorithm to S c an b e c ompute d i n O ( nm log m ) time. Pro of. W e consider a simple sim ulation of the Re-pair algorithm that wo rks on the R L factorization of the string S . W e s hall assume that the Re-pair algorithm replaces non-o v erlapp ing bigrams with a new v ariable in a left-ﬁrst manner. Let Y i → Y ℓ Y r denote the i -th rule pro d uced by the Re-pair algorithm run n ing on S . Let S 1 = S , and for i ≥ 1 let S i +1 denote the string obtained by replacing frequent bigrams by Y 1 , Y 2 , . . . , and Y i . Note that the bigram Y ℓ Y r will not o ccur in S i +1 . Consider the RL factoriza tion of S i , and let w i denote the string obtained b y concatenating the RL factors of S i consisting of c haracters in Σ ∪ { Y j } i − 1 j =1 . W e ﬁ nd the most frequen t bigram Y ℓ Y r in w i , and then replace non-o v erlapping occur r ence of Y ℓ Y r in w i with a new v ariable Y i on the left priority basis, and then compute w i +1 . Let a, b, c ∈ Σ with a 6 = b and a 6 = c , and let ba p c b e a subs tr ing of the original strin g S , wh ere p ≥ 1. Consider an y o ccur rence of ba p c that b egins at p osition v in S , namely , let S [ v : v + p + 1] = ba p c . There are t wo cases to consider: (1) the range [ v : v + p + 1] is f ully cont ained within a v ariable Y k in w i ; (2) the range [ v : v + p + 1] is con tained in a substr in g of w i of form ( Y s ) e ( Y k (1) ) q Y k (2) · · · Y k ( l ) ( Y r ) t with k (1) > k (2) > · · · > k ( l ), where val ( Y k (1) ) q · val ( Y k (2) ) · · · v al ( Y k ( l ) ) = a p ′ for some p ′ ≤ p , ba x is a suﬃx of val ( Y s ), a y c is a p reﬁx of val ( Y r ), and x + p ′ + y = p . Let Y ℓ Y r b e the most frequent bigram in w i . It is p ossible to replace n on-o verlapping o ccur rences of Y ℓ Y r in O ( n ) time, as follo ws: W e can see that val ( Y ℓ ) val ( Y r ) o ccurs either (A) in a sequence f j f j +1 · · · f j + d − 1 of d ≥ 2 consecutiv e factors in w 1 or (B) en tirely within a single f actor f j of w 1 . This is b ecause, if val ( Y ℓ ) val ( Y r ) con tains at least tw o d istinct c h aracters a 6 = b , then it o ccurs in a sequence of d factors, and if val ( Y ℓ ) val ( Y r ) = a z , then it is fully con tained in a facto r. Consider case (A): Let val ( Y ℓ )[1] = c . Since the num b er of factors of form f = c p do es not exceed n , the n um b er of o ccurrences of the b igram of case (A) is O ( n ). No w consider case (B): Acco rding to the observ ation (2) ab o ve, an y bigram Y ℓ Y r with val ( Y ℓ ) val ( Y r ) = a z and ℓ 6 = r o ccur s at most once in eac h substring of w i that corresp onds to a factor f j . Hence the num b er of o ccurrences of suc h a bigram in w i is at most n . If ℓ = r , then the b igram Y ℓ Y ℓ can o ccur q − 1 times at eac h factor. W e then rep lace ( Y ℓ ) q with ( Y i ) q / 2 if q is ev en, and w ith ( Y i ) ( q − 1) / 2 Y ℓ otherwise, in O (1) time. Since eac h w i consists of characte rs in Σ ∪ { Y j } i − 1 j =1 , the num b er of all bigrams in w i is O ( m 2 + m | Σ | ) = O ( m 2 ). W e ﬁ nd the most fr equen t b igram in O (log m ) time u sing a heap, and the total 7 time complexit y for con v erting the RL factoriza tion to the grammar corresp onding to Re-pair is O ( nm log m ). Theorem 14 (Run Length Enco ding to LZ77/LZ88) Given an RL factorization of size n that r epr esents string S , the LZ factorization of S c an b e c ompute d in O ( nm + n log n ) time, wher e m is the size of LZ f actorization. Pro of. Let a p 1 1 , . . . , a p n n b e th e RL factorization of a text S . Assume that w e hav e alrea dy computed the ﬁrst i − 1 LZ77 factors, f 1 , . . . , f i − 1 , of S . Let the p air of in tegers ( u, q ) satisfy q + p 1 + · · · + p u − 1 = | f 1 · · · f i − 1 | , where 1 ≤ u ≤ n and 1 ≤ q ≤ p u . F or a n ew factor f i , compute the lengths l j of the longest common p reﬁx of a p u +1 u +1 · · · a p n n and eac h suﬃx a p j j · · · a p n n (1 ≤ j ≤ u ) of the RLE, where eac h RL factor a p is regarded as sin gle symb ol. T he length of th e i -th LZ77 factor is then: max j { p u +1 + · · · + p u + l j + P + Q } , where P = 0 if a u 6 = a j and P = min { p u − q , p j − 1 } otherwise, and Q = 0 if a u + l j +1 6 = a j + l j and Q = min { p u + l j +1 , p j + l j } otherw ise. The pro cess is then rep eate d to obtain f i +1 from the p air of in tegers ( u + l j + 1 , p u + l j +1 − Q ). A na ¨ ıv e algorithm for obtaining eac h l j costs O ( n ) time, and therefore results in an O ( n 2 m ) time algorithm to c hec k eac h of the O ( n ) suﬃxes to construct th e O ( m ) factors. If w e construct a suﬃx and lcp arra y on the RLE strin g b eforehand, l j can b e compu ted in O (1) time, sin ce it amounts to a range minim u m query on the lcp arra y . Note that the sum p u +1 + · · · + p u + l j can also b e obtained in constant time with O ( n ) prep ro cessing, b y constru cting an array sum [ i ] = p 1 + · · · + p i and computing sum [ u + l j ] − sum [ u ]. Therefore, con v ersion can b e done in O ( nm ) time pro vided that the s uﬃx arra y and lcp arrays are constructed. The construction of the arra ys require O ( n log n ) time, to sort an d n um b er eac h of c haracter of the alphab et a p i i . LZ78 f actorization can b e ac hiev ed by a simple mo diﬁcation. Theorem 15 (Run Length Enco ding to Bisection) Given an RL factorization of size n that r epr esents string S , the gr ammar of size m pr o duc e d by applying Bise ction algorithm to S c an b e c ompute d i n O ( m 2 + ( m + n ) log n ) time. Pro of. Consider the f ollo wing top-down algorithm whic h closely follo ws the descrip tion of Bisec- tion in Section 2.5.2 . Assu m e w e w an t to construct th e children Y ℓ , Y r of v ariable Y s represent ing S [ i : j ], to pro duce the grammar rule Y s → Y ℓ Y r . Note that an arbitrary su b string of S [ i : j ] whic h is contai ned in the RLE a p k − 1 k − 1 · · · a p k + l k + l can b e represen ted as a 4-tuple ( x, k , l, y ), wh ere i = p 1 + · · · + p k − 1 − x + 1, j = p 1 + · · · + p k + l − 1 + y − 1, 0 ≤ x < p k − 1 , and 0 ≤ y < p k + l . Let k represent th e largest in teger where 2 k < j − i + 1. F or the sub string S [ i : j ] un d er consideration, the 4-tuple for su bstrings S [ i : i + 2 k − 1] and S [ i + 2 k : j ] can b e obtained in O (log n ) time. Note th at equalit y c h ec ks b et we en substrings represented as 4-t uples can b e conducted in O (1) time with O ( n log n ) prepr o cessing, using r ange min im um qu er ies on th e lcp arra ys, similar to the tec hn iqu e used in the con ve rsion to LZ enco dings. Equalit y chec ks are conducted against the O ( m ) v ariables that will b e con tained in the output. If there exist v ariables whic h deriv e the same string, the existing v ariables are used in p lace of Y ℓ and/or Y r , and Y ℓ and/or Y r will not b e con tained in the output. Sin ce equalit y chec ks are condu cted only for the c hildren of v ariables which are con tained in the output, they are conducted only O ( m ) times. Therefore, con v ersion can b e done in O ( n log n + m ( m + log n )) = O ( m 2 + ( m + n ) log n ) time. 8 3.2 Con versions from arbitrary SLP Theorem 16 (SLP to Run Length E nco ding) Given an SLP of size n that r epr esents string S , the RL factorization of S c an b e c ompute d in O ( n + m ) time and O ( n ) sp ac e, wher e m is the size of the RL factorization. Pro of. F or eac h v ariable X i , w e ﬁrst compute the maximal lengt h of the run of iden tical c h aracters whic h is a pr eﬁx (resp. suﬃx) of X i , denoted b y plen ( X i ) (resp. slen ( X i )). This can b e computed in O ( n ) time by a simple d ynamic pr ogramming: for X i → X ℓ X r , w e hav e plen ( X i ) = plen ( X ℓ ) if plen ( X ℓ ) < | X ℓ | or X ℓ [ | X ℓ | ] 6 = X r [1], and plen ( X i ) = plen ( X ℓ ) + plen ( X r ) otherw ise. slen ( X i ) can b e computed lik ewise. Next, for eac h v ariable X i → X ℓ X r , let Llink ( X i ) denote the v ariable X i ′ → X ℓ ′ X r ′ suc h that X i ′ is the shallo w est descendan t of X i lying on the left m ost p ath of the deriv ation tree of X i , satisfying slen ( X i ′ ) ≤ | X r ′ | . Llink ( X i ) can also b e computed for all X i in O ( n ) time, by a simple dynamic p r ogramming. R link ( X i ) can b e deﬁn ed and computed lik ewise. The conv ersion algorithm is then a top down p ost-order tr a versal on the deriv ation tr ee of SLP but with jum p s using Llink and Rlink . F or the ro ot X n , w e output (1) X n [1] plen ( X n ) , (2) the RLE of X n except for the ﬁrst and last RL factors of X n , an d (3) X n [ | X n | ] slen ( X n ) . (2) can b e computed r ecursiv ely as follo ws: at eac h v ariable X i → X ℓ X r , we output (2.1) the RLE of Llink ( X i ) except for the ﬁr s t and last RL factors of Llink ( X i ), (2.2 ) either X ℓ [ | X ℓ | ] slen ( X ℓ ) X r [1] plen ( X r ) if X ℓ [ | X ℓ | ] 6 = X r [1], or X r [1] slen ( X ℓ )+ plen ( X r ) if X ℓ [ | X ℓ | ] = X r [1], and (2.3) the RLE of Rlink ( X i ) except for the ﬁrs t and last RL factors of Rlink ( X i ). The theorem follo ws s in ce th e outp ut of eac h RL factor is d one in constan t time. Theorem 17 (SLP to LZ77) Give n an SLP of size n that r epr esents string S , the LZ77 factor- ization of si ze m c an b e c ompute d in O ( mn 3 log N ) time. Pro of. Assume we ha v e already computed f 1 , . . . , f i − 1 of S from a giv en SLP of size n . Firs tly w e consider th e L Z 77 factorizatio n w ithout self-references. F or a new f actor f i , d o a b inary searc h on the length of th e factor: create a new SLP of that length, and conduct pattern matc hing on the inpu t SL P . If a matc h exists in th e r ange that corresp onds to the previous factors f 1 , . . . , f i − 1 , i.e., in the preﬁx S [1 : P i − 1 j | f j | ] of S , then the length of f i can b e longer, and if not, it must b e shorter. Using Theorem 11 and L emma 12 the LZ77 factorizati on of size m can thus b e compu ted in O ( mn 3 log N ) time. T o compu te the L Z 77 factorizat ion with self-references, we searc h for the longest matc h that b egins at a p osition fr om 1 to P i − 1 j | f j | in S . The time complexity is the same as ab o ve. Theorem 18 (SLP to LZ78) Give n an SLP of size n that r epr esents string S , the LZ78 factor- ization of si ze m c an b e c ompute d in O ( n 3 m 2 ) time. Pro of. Our algorithm for conv erting an S LP to LZ78 follo w s a similar idea: When computing a new factor f i , we construct a new S LP of f k f k +1 [1] for eac h 1 ≤ k < i , and r un the pattern matc hing algorithm on the input SLP . The longest matc h in the suﬃx S [ P i − 1 j =1 | f j | + 1 : | S | ] pro vides the new factor f i . By Theorem 11 and Lemma 12, pattern matching tasks for computing eac h factor f i tak es O ( n 3 m ) time, and therefore the tota l time complexit y is O ( n 3 m 2 ). 9 3.3 SLP to Bisection Theorem 19 Given an SLP of size n that r epr esents string S , the gr ammar of size m pr o duc e d by applying Bise ction algorithm to S c an b e c ompute d i n O ( n 3 m 2 ) time. Pro of. Giv en an arbitrary S LP of size n repr esen ting S , consider th e follo wing top-down algorithm whic h closely follo ws the description of Bisectio n in Section 2.5.2. Assu me we wan t to construct the c hildren Y ℓ , Y r of v ariable Y s represent ing S [ i : j ], to pro d uce the grammar r ule Y s → Y ℓ Y r . Let k represent th e largest in teger where 2 k < j − i + 1. By using Lemma 12, SLP s Y ℓ represent ing S [ i : i + 2 k − 1] and Y r represent ing S [ i + 2 k : j ], can b e constructed in O ( n ) time. F or these SLPs, equalit y chec ks are conducted aga inst all O ( m ) v ariables corresp onding to v ariables that w ill b e con tained in the output p ro du ced so far. If there exist v ariables whic h deriv e the s ame string, the existing v ariables are used in place of Y ℓ and/or Y r , and Y ℓ and/or Y r will not b e con tained in the output. F r om Th eorem 11, the equ alit y c hec ks f or Y ℓ and Y r can b e cond ucted in a total of O ( n 3 m ) time. Since equalit y c h ecks are condu cted only for the c hildr en of v ariables whic h are con tained in the ou tp ut, the total time is O ( n 3 m 2 ). 3.4 Con versions to and from E SP Giv en a r ep resen tation of S LP G for a string S , w e d esign algorithms to compute LZ77 and LZ78 factorizat ions for S without explicit decompression of G in O (( n + m ) log d N ) time. Here n /m is the s ize of inp ut/output grammar s ize, N = | S | , and d is a constan t. Our metho d is based on the transformation of an y S LP to its canonical form by w a y of an equ iv alen t ESP . Lemma 20 Given a dictionary D f r om E S P ∗ ( S, D ) for some S ∈ Σ ∗ , and the set V of v ariables in D , we c an c ompute an SLP with the dictionary D ′ and the set V ′ of variables which satisﬁes the fol lowing c onditions: (1) | D ′ | ≤ 2 | D | and (2) for any X i , X j ∈ V ′ , val ( X i ) ≤ lex val ( X j ) iﬀ i ≤ j , wher e ≤ lex denotes the lexic al or der over Σ . The c omputation time is O ( n log n log 3 N ) for | V | = n and | S | = N . Pro of. Consider T X = E T ( val ( X )) and T Y = E T ( val ( Y )) for an y X, Y ∈ V . Let t = ⌊| val ( Y ) | / 2 ⌋ . By Lemma 10, w e can compute an evidence Q of the pattern T Y [1 : t ] in O (log 2 t ) = O (log 2 N ) time. W e can also c hec k if Q is emb edded as T X [1 : t ] in O (log 2 N ) time. By this binary searc h, we can ﬁnd the length of longest common preﬁx of val ( X ) and val ( Y ) in O (log 2 N ) time. Th us, w e can sort all v ariables in V in O ( n log n log 3 N ) time. After sorting all v ariables in V , w e rename an y v ariable according to its rank. If there is a v ariable X w ith X → X i X j X k , w e divide it to X → Y X k and Y → X i X j b y an intermediate v ariable Y and we can determine the rank of suc h new v ariables in additional O ( n log n ) time. Dictionaries D 1 , D 2 of t wo admissible grammars are called c onsistent if X → α, Y → α ∈ D 1 ∪ D 2 implies X = Y , and consistent dictionaries D 1 , . . . , D k are similarly deﬁned . F or α ∈ (Σ ∪ V ) ∗ , α = q 1 · · · q k is called a run-length r epr esentation of α if eac h q i is a m aximal rep etition of p i ∈ Σ ∪ V . F or example, the run-length representa tion of abbaaacaa is q 1 q 2 q 3 q 4 q 5 = ab 2 a 3 ca 2 . T he num b er k of α = q 1 · · · q k is called the c hange of α . Let S = αβ γ and S ′ = α ′ β ′ γ ′ satisfying E S P ( S, D ) = ( S ′ , D ∪ D ′ ) with α ′ ( D ′ ) = α , β ′ ( D ′ ) = β , and γ ′ ( D ′ ) = γ . Then w e call su c h S = αβ γ a stable de c omp osition of S . An expression E S P ( α [ β ] γ , D ) = ( α ′ [ β ′ ] γ ′ , D ∪ D ′ ) denotes an ESP to replace the α/β /γ to the α ′ /β ′ /γ ′ , re- sp ectiv ely . F or a string α , α an d α d enote a preﬁx of α and a s uﬃx of α , r esp ectiv ely . 10 Lemma 21 L et E S P ( α [ β ] γ , D ) = ( α ′ [ β ] ′ γ ′ , D ∪ D ′ ) f or a stable de c omp osition S = αβ γ . Ther e exist substrings α , αβ , β γ , γ , e ach of whose change is at most log ∗ | S | + 5 such that E S P ([ α ] β γ , D ) = ([ α ′ ] y 1 , D ∪ D 1 ) , E S P ( α [ β ] γ , D ) = ( x 2 [ β ′ ] y 2 , D ∪ D 2 ) , E S P ( αβ [ γ ] , D ) = ( x 3 [ γ ′ ] , D ∪ D 3 ) , and D ′ = D 1 ∪ D 2 ∪ D 3 . Pro of. S ince S = αβ γ is a stable decomp osition of an ESP for S , the tr an s lated str in g α ′ and the dictionary D 1 for D 1 ( α ′ ) = α are determined b y only α and a preﬁx β γ . In case p + is the maximal preﬁx of β γ , we can set β γ = p + . Otherwise, by Lemm a 8, w e can set β γ to b e a preﬁx of length at most log ∗ | S | + 5. F or β , γ , we can set α γ , αβ w ith the b ound ed c hange, resp ectiv ely . The ab o ve ESP deﬁnes α ′ ( D 1 ) = α , β ′ ( D 2 ) = β , and γ ′ ( D 3 ) = γ . By r enaming all v ariables in the dictionaries, there is a consistent D ′ = D 1 ∪ D 2 ∪ D 3 satisfying α ′ ( D ′ ) β ′ ( D ′ ) γ ′ ( D ′ ) = αβ γ . Lemma 22 L et D b e a dictionary of an SLP enc o ding a string S ∈ Σ ∗ . A dictionary D ′ of an E SP e quivalent to D i s c omputable in O ( n log 2 N + m ) time, wher e n = | D | , m = | D ′ | , and N = | S | . Pro of. W e assume that E S P ∗ ( val ( X ℓ ) , D ) ( ℓ ≤ i, j ) is already computed and let D ′ b e the curren t dictionary consisten t with all val ( X ℓ ). F or X k → X i X j ( k > i, j ), w e estimate the time to up date D ′ . Let val ( X i ) = α and v al ( X j ) = γ . F or the initial strings α, γ , we can obtain α of length log ∗ N +6 and γ of length 6 in O (log N log ∗ N ) time. By the r esult of E S P ( α γ , D ′ ), w e d etermine the p osition blo c k β whic h α [ | α | ] and γ [1] b elong to. Then we can ﬁ nd a stable decomp osition S = αβ γ for the obtained β and reformed α and γ , where α (and γ ) is represen ted by a path from the ro ot to a leaf in the d eriv ation tree of D x (and D y ). They are called a cu r rent tail and head, resp ectiv ely . Note that we can av oid decoding α and γ for the parsing in L emm a 21. T o sim ulate this, w e use only the compressed rep r esen tations D x , D y , the current tail/head, and β . Using the run-length represent ation, the c hange of β is b oun d ed by O (log ∗ N ) as follo ws. By Lemma 21, w hen E S P ( α [ β ] γ , D ) = ( α ′ [ β ] ′ γ ′ , D ∪ D ′ ) is computed by E S P ([ α ] β γ , D ) = ([ α ′ ] y 1 , D ∪ D 1 ), E S P ( α [ β ] γ , D ) = ( x 2 [ β ′ ] y 2 , D ∪ D 2 ), and E S P ( αβ [ γ ] , D ) = ( x 3 [ γ ′ ] , D ∪ D 3 ), the resulting string β ′ is treated as the n ext β , and the current tail an d head are replaced by α ′ [ | α ′ | ] and γ ′ [1] wh ic h repr esen t the next α and γ . Let u s consider the case E S P ([ α ] β γ , D ) = ([ α ′ ] y 1 , D ∪ D 1 ). If αβ γ conta ins a maximal rep etition of p as α [ | α |− N 1 , | α | ] · β γ [1 , N 2 ] = p + , the n ext tail is the parent of α [ | α |− N 1 − 1], w hic h is determined in O (log 2 N ) time since an y rep etition is r eplaced b y the left aligned p arsing and N 1 + N 2 = O ( N ). Otherwise, b y Lemma 8, we can determine the next tail by tracing a suﬃx of α of length at most log ∗ N + 5 in O (log N log ∗ N ) time. Th us, E S P ( α [ β ] γ , D ) = ( α ′ [ β ] ′ γ ′ , D ∪ D ′ ) is sim ulated in O (log 2 N + m k ) f or X k → X i X j , where m k is the num b er of new v ariables pro duced in this ESP . Therefore we conclude that the ﬁnal dictionary D ′ equiv alen t to D is obtained in O ((log 2 N + m 1 ) + · · · + (log 2 N + m n )) = O ( n log 2 N + m ). Theorem 23 ( SLP to C anonical SLP) Give n an SLP D of size n f or string S of length N , w e can construct another SLP D ′ of size m in O ( n log 2 N + m log m log 3 N ) su ch that D ′ is a ﬁnal dictionary of E S P ∗ ( S, D ′ ) equiv alen t to D and all v ariables in D ′ are sorted by the lexical order of th eir enco ded strings. 11 Theorem 24 ( Canonical SLP to LZ77) Giv en a canonical SL P D of size n for string S of length N , we can compute LZ77 factorization f 1 , . . . , f m of S in O ( m log 2 n log 3 N + n log 2 n ) time. Pro of. Using the technique in Lemma 20, w e can sort all v ariables Z asso ciate d with Z → X Y ∈ D b y the follo wing t wo keys: the ﬁr st k ey is th e lexical order of val ( X ) R and the second is the lexical order of val ( Y ), where S R denotes the rev erse string of S ∈ Σ ∗ . Then Z is m ap p ed to a p oin t ( i, j, pos ) on a 3-dimensional space s uc h that i is an ind ex of ﬁr st k ey on X -axis, j is an index of s econd key on Y -axis, and pos is an ind ex of leftmost o ccurren ce of val ( Z ) on Z -axis. A data structure sup p orting range qu ery for the p oint set is constructed in O ( n log 2 n ) time/space and ac hieving O (log 2 n ) qu er y time (S ee [26]). Using this, w e can compute f ℓ +1 from f 1 , . . . , f ℓ and th e remaining suﬃ x S ′ suc h that S = f 1 · · · f ℓ · S ′ as follo w s. By Lemma 10, an evidence Q = q 1 · · · q k of S ′ [1 : j ] s atisfying k = O (log j ) is f ou n d in O (log 2 j ) time. Let q i b e a sym b ol. Then w e guess th e division α = q 1 · · · q i and β = q i +1 · · · q k to ﬁnd the range of X in w hic h α is em b edd ed as its suﬃx, the range of Y in which β is em b edded as its preﬁx, and th e range of Z w hose leftmost p osition pos satisﬁes pos + j ≤ | f 1 · · · f ℓ | . This query time is O (log 2 n log 2 N ). Let q i = p j i for a symbol p i . Any m aximal rep etition is replaced b y the left aligned parsing, and a resulting n ew rep etition is recurs ively r ep laced by the same m anner. Thus, an embedd ing of q 1 · · · q k to Z → X Y dividing q i = αβ suc h that q 1 · · · q i − 1 α is em b edded to X as suﬃx and β q i +1 · · · q k is em b edded to Y as preﬁx is p ossible in O (log j ) = O (log N ) divisions f or q i . In this case, the query time is O (log 2 n log 3 N ). Therefore, the total time to compute the required LZ77 f actorization is b ounded by O ( m log 2 n log 3 N + n log 2 n ). Theorem 25 ( Canonical SLP to LZ78) Giv en a canonical SL P D of size n for string S of length N , w e can compute LZ78 facto rization f 1 , . . . , f m of S in O ( m log 3 N + n ) time. Pro of. Assume th at the ﬁrst ℓ factors f 1 , . . . , f ℓ are obtained. By Lemma 10, we can ﬁnd an evidence Q i of f i (1 ≤ i ≤ ℓ ), and all evidences Q i (1 ≤ i ≤ ℓ ) are represen ted by a trie. Let S ′ b e the remaining su ﬃx of S . F or eac h j , w e can compu te an evidence of S ′ [1 : j ] in O (log 2 j ) = O (log 2 N ) time. Thus, we can ﬁnd the greatest j satisfying f i = S ′ [1 : j ] for some 1 ≤ i ≤ ℓ in O (log 3 N ) time using binary searc h . T herefore, the total time to compute the required LZ78 factorization is b ound ed by O ( m log 3 N + n ). Theorem 26 ( Run Length Enco ding to E SP) Giv en a text S repr esen ted as a RL enco ding S = f 1 · · · f n of length n , w e can compute an ES P D represen ting S in O ( n log ∗ N ) time. Pro of. W e make a little c h ange for rep lacing maximal rep etition. Consider maximal rep etition α = a k in S is app eared. If k is ev en, then we replace α to A k / 2 , otherwise we replace to A ⌊ k/ 2 ⌋− 1 B where A → aa and B → aaa . In exceptional case that the preﬁx and/or suﬃx of α is replaced with the left/righ t sy mb ol adjacen t to α , we must consider for α ′ remo ved suc h preﬁx/suﬃx fr om α . T he compu tation time to replace suc h rep etition is O (1) since the num b er of rep etitiv e symb ols is repr esen ted as a inte ger. Therefore, the time to con v ert is b ounded by O ( n log ∗ N ). Theorem 27 ( ESP to Bisection) Giv en an ESP D of size n rep resen ting S of length N , we can compute an SLP of size m generated b y Bisection in O ( m log 2 N ) time. Pro of. F or eac h corresp onding substrin g S [ i : j ] u n der consideration. W e can obtain an evidence Q corresp ond ing to S [ i : i + 2 k − 1] in O (log 2 N ) time. By the Lemma 10, w e can c hec k if Q is em b edded as T [ i + 2 k : j ] in O (log 2 N ) time. If Q can b e embedd ed, w e can allo cate same v ariable for S [ i : i + 2 k − 1] and S [ i + 2 k : j ] since b oth substrin gs are equal, otherwise d iﬀeren t v ariables are allo cated. Th er efore, con v ersion can b e d one in O ( m log 2 N ) time. 12 4 Conclusions and F u tu r e W ork In th is pap er w e present ed new eﬃcien t algorithms wh ic h , without explicit decompression, conv ert to/from compressed strin gs represen ted in terms of run length en co ding (RLE), L Z 77 and LZ 78 enco dings, grammar based compressor RE-P AIR and BISE CTION, edit sensitive parsing (ESP), straigh t line programs (SLPs), and admissible grammars. All the prop osed algorithms r un in p olynomial time in the input and output sizes, while algorithms that ﬁrst decompress the input compressed str in g can take exp onenti al time. Examples of applications of our result are dynamic compressed s tr ings allo wing for edit op erations, and p ost-sele ction of sp eciﬁc compression formats. F uture work is to extend our results to other text compression s c h emes, such as S equitur [34], Longest-First Substitution [32], and Greedy [1]. References [1] Ap ostolico, A., and Lonardi:, S. Oﬀ-line compression b y greedy textual su bstitution. Pr o c. IEEE 88 (2000), 173 3–174 4. [2] Bille , P., Landa u, G. M., R a man, R., Sadakane, K., Sa tti, S. R., and Weimann, O. Random access to grammar-compressed strings. In Pr o c. SODA’11 (201 1), pp. 373–389. [3] Chan , H.-L., Hon, W .-K., Lam , T.-W., and Sadakane, K. Dyn amic d ictionary matc hin g and compr essed su ﬃx trees. In P r o c. SODA’05 (200 5), pp . 13–22 . [4] Chan , H.-L., Hon, W.-K., Lam, T.-W., a nd Sadakane, K. Compr essed indexes for dynamic text collect ions. ACM T r ans. Algorith ms 3 , 2 (2007), 21:1– 21:29. [5] Char ika r , M., Leh m an, E., Liu, D., P anigrahy, R., Prabhakaran, M ., Saha i, A., and shela t, a. The smallest grammar problem. IEEE T r ansactions on Information The ory 51 , 7 (2005 ), 2554–2 576. [6] Cilibrasi, R., and Vit ´ anyi, P. M. B. Clustering by compression. IEEE T r ansactions on Information The ory 51 (2005), 1523–1545. [7] Claude, F., and Na v arro, G. Self-indexed grammar-b ased compr ession. F u ndamenta Informatic ae (to app ear). Pr eliminary version: Pr o c. MF CS 2009 pp. 235–24 6. [8] Cormo de, G., and Muthukrishnan , S. Sub string compression problems. In Pr o c. SODA ’05 (2005), pp. 321–330. [9] Cormo de, G., a nd Muthu krishnan, S. The string ed it d istance matc hing problem with mo ves. A CM T r ans. Algo r. 3 , 1 (2007), Ar ticle 2. [10] Ga wr ychowski, P. Optimal pattern matc hing in LZW compressed strin gs. In Pr o c. SODA’11 (2011 ), pp. 362–3 72. [11] Ga wr ychowski, P. P attern matc hin g in Lemp el-Ziv compressed s trings: fast, simple, and deterministic. In Pr o c. ESA2011 (2011). accepted (a v ailable as arXiv:1104 .4203v1 ). [12] Ga ¸ sieniec, L., Kar pinski, M., Pland owski, W ., an d R ytte r, W. Eﬃcien t algo rithms for Lemp el-Ziv enco ding. In P r o c. SW A T 1996 (1996), vol. 1097 of LN CS , pp. 392–403. 13 [13] Gonz ´ alez, R., and Na v arro, G. Rank/sele ct on dynamic compressed sequences and ap- plications. The or etic al Computer Scienc e 410 (200 9), 4414– 4422. [14] Goto, K ., Bannai, H., Inenaga, S., a nd T ak e da, M. T o wards eﬃcient mining and classiﬁcation on compressed str ings. In A c c epte d for SPIRE’11 (201 1). Preprint a v ailable at arXiv:1103 .3114v1 . [15] Herm elin, D., Landa u, G. M., Landa u, S., and W eimann, O. A uniﬁed algorithm for accele rating edit-distance computation via text-compression. In Pr o c. ST ACS 2009 (2009), pp. 529–540 . [16] Ine n aga, S., and Banna i, H. Finding c haracteristic sub string from compressed texts. In Pr o c. The P r ague Stringolo gy Confer enc e 2009 (200 9), pp. 40–5 4. [17] K ¨ arkk ¨ ainen, J., and Sand e rs, P. Simple linear work suﬃx arra y constru ction. In Pr o c. ICALP 2003 (2003), v ol. 2719 of LNCS , pp. 943–9 55. [18] Kar pinski, M., R ytter, W., a n d Shinohar a, A. An eﬃcient p attern-matc hing algorithm for strin gs with sh ort descriptions. Nor dic Journal of Computing 4 (1997), 172– 186. [19] Kas ai, T., Le e , G., Arimura, H., Arika w a, S., a nd P ark , K. Lin ear-time Longest- Common-Preﬁx Computation in Suﬃx Arra ys an d Its Applications. In Pr o c. CPM 2001 (2001 ), vo l. 2089 of LN CS , pp. 181–1 92. [20] Kieff er, J., a nd Y ang, E. Grammar-based cod es: a new class of u n iv ersal lossless source co des. IEEE T r ansactions on Information The ory 46 , 3 (2000), 737– 754. [21] Kieff er, J., Y ang, E., Nel son, G., and Cosman, P. Univ ersal lossless compression via multilev el pattern matc hin g. IE EE T r ansactions on Information The ory 46 , 4 (20 00), 1227– 1245. [22] Kre ft, S., and Na v arro, G. Self-indexing based on LZ77. In P r o c. CPM’11 (2011), v ol. 6661 of LNCS , pp. 41–54. [23] Lee, S., and P ark, K. D ynamic rank/select structures with applications to run -length enco ded texts. The or etic al Computer Scie nc e 410 , 43 (2009), 440 2–441 3. [24] Lifshits , Y. Solving classical string problems on compressed texts. In Combinator ial and Algor ithmic F oundations of Pattern and Asso ciation Disc overy (2006), no. 06201 in Dagstuhl Seminar P r o ceedings. [25] Lifshits , Y. Pr o cessing compressed texts: A tr actabilit y b order. In Pr o c. CPM 2007 (2007), v ol. 4580 of LNCS , pp. 228–240. [26] Luek e r, G. A d ata structures for orthogoal range queries. In Pr o c. F OCS’78 (1978), pp. 28– 34. [27] M ¨ akinen, V., and Na v arro, G. Dynamic en trop y-compressed sequences and fu ll-text in - dexes. ACM T r ans. Algorithm s 4 , 3 (2008), 32:1–3 2:38. [28] Manb er, U., and Myer s, G. S uﬃx arra ys: A new metho d for on-line string searc hes. SIAM J. Computing 22 , 5 (1993), 935–94 8. 14 [29] Maruy ama, S., Nakahara, M., Kishiue, N., and Saka moto, H. ESP-Ind ex: A com- pressed index based on edit-sensitiv e parsing. In SPIRE’11 (2011). accepted (a v ailable from http://h dl.handl e.net/2324/19843 ). [30] Ma tsub a ra, W., Inenaga, S., Ishino, A., Shinohara, A., Nakamur a, T., and Hashimoto, K. Eﬃcient algorithms to compute compressed longest common substrings and compressed p alindromes. The or et. Comput. Sci. 410 , 8–10 (2009 ), 900–913 . [31] Miy azaki, M., Shinohara, A., a n d T ak e da, M. An impro ved pattern matc hing algorithm for str ings in terms of straight- line pr ograms. In Pr o c. 8th Ann ual Symp osium on Combinatorial Pattern Matching (CPM ’97) (1997), v ol. 1264 of L e ctur e Notes in Computer Scienc e , Springer- V erlag, p p. 1–11. [32] Naka mura, R., Inenaga, S., Bannai, H., F u namoto, T., T akeda, M., and Shino- hara, A. Linear-time oﬀ-line text compression b y longest-ﬁrst substitution. A lgorithms 2 , 4 (2009 ), 1429–14 48. [33] Na v arro, G., and M ¨ akinen, V. Compressed f ull-text indexes. ACM Computing Surveys 39 , 1 (2007 ), 2. [34] Nevill -Manning, C. G., Witten, I. H., and Maulsby, D. L. Compr ession by induction of h ierarc h ical grammars. In Pr o c. Data Compr ession Confer enc e 1994 (DCC ’94) (199 4), pp. 244–253 . [35] Russo, L. M. S., Na v arro, G., and Olive ira, A. L. Dynamic fully-compressed su ﬃ x trees. In Pr o c. CPM’08 (2008), v ol. 5029 of LNCS , pp. 191–203. [36] R y t ter, W. Ap plication of Lemp el-Ziv factorization to the appro x im ation of grammar-based compression. The or et. Comput. Sc i . 302 , 1–3 (2003), 211– 222. [37] Sto rer, J., and Szyman ski, T. Data compression via textual sub stitution. Journal of the ACM 29 , 4 (1982), 928–95 1. [38] Tiskin, A. T o wards appr o ximate m atc hing in compressed strings: Local subsequen ce recog- nition. In Pr o c. CSR’11 (2011), v ol. 6651 of LNCS , pp. 410–414. [39] We lch, T. A. A tec hniqu e for high p erf orm ance d ata compression. IEEE Computer 17 (1984 ), 8–19. [40] Y am a moto, T., Bann ai, H., In enaga, S., and T ake da, M. F aster subsequence and don’t-care pattern matc hing on compressed texts. In Pr o c. CPM’11 (2011 ), v ol. 6661 of LNCS , pp . 309–322. [41] Ziv, J., an d Lempel , A. A universal algorithm for sequen tial data compression. IEE E T r ansactions on Information The ory IT-23 , 3 (1977) , 337–343 . [42] Ziv, J., and Le mpel, A. Compression of in dividual sequences via v ariable-length co d ing. IEEE T r ansactions on Information The ory 24 , 5 (1978), 530– 536. 15

Restructuring Compressed Texts without Explicit Decompression

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment