Bit-Optimal Lempel-Ziv compression

One of the most famous and investigated lossless data-compression scheme is the one introduced by Lempel and Ziv about 40 years ago. This compression scheme is known as "dictionary-based compression" and consists of squeezing an input string by repla…

Authors: Paolo Ferragina, Igor Nitto, Rossano Venturini

Bit-Optimal Lemp el-Ziv compression ⋆ Paolo F errag in a 1 , Ig or Nitto 1 , a nd Rossano V enturini 1 Dipartimento di Informa tica, Universit` a di Pisa , Italy Abstract. One of the most famous and inv estigated lossle ss d a ta-compression scheme is the one in trodu ce d by Lemp el and Ziv about 40 years ago [23]. This compression sc heme is known as ”dictionary-based compression” and consists of squeezing an inp ut string by rep l acing some of its substrings with (shorter) co dew o rds which are actu - ally p oin ters t o a dictionary of phrases built as th e string is pro ces sed. Surp ri singly enough, although many fundamental results are n o w adays known ab ou t upp er b ounds on t he sp ee d and eff ectiveness of t his comp ression process (see e.g. [12, 16] and refer- ences therein), “we ar e not awar e of any p arsing scheme that achieves optimality when the LZ77 -dictionary is in use under any c onstr aint on the c o dewor ds other than b eing of e qual length” [16, p ag. 159]. H ere opt imality means to achiev e the minimum num- b er of bits in compressing e ach individual input string, without any assumption on its generating source. I n th i s pap er we p ro vide th e first LZ- bas ed compressor which com- putes the bit-optimal parsing of any inp ut string in efficient time and optimal space, for a general class of va riable-length co dew ord encod i ngs whic h encompasses most of the ones typically used in data compression and in the design of search engines and compressed indexes [14, 17, 22]. 1 In tr o duction The problem of lossless data compression consists of co mpactly repr e s enting data in a for mat that can b e faithfully re cov ered from the compres s ed file. Lo ssless compress ion is achieved by taking adv antage of the re dundancy which is often pres ent in the data ge ne r ated by e ither hu mans or machines. One of the mos t famous loss less data- compressio n scheme is the one int r o duced by Lempel and Ziv in the late 70s [2 3, 24 ]. It has b een “the solution” to lossless compressio n for nearly 15 years, a nd indeed many (no n-)commercial prog rams ar e currently based on it– like gz ip , zip , pkzi p , ar j , rar , just to cite a few. This compr e s sion scheme is known as dictionary-b ase d c ompr ession , and consists of s queezing an input string S [1 , n ] by replacing some of its substrings (phrases) with (shorter) c o dewor ds which are actually pointers to a dictiona ry be ing either st atic (in that it has b een co ns tructed b efore the c o mpression starts) or dynamic (in that it is built as the input string is c ompressed). The well-known LZ77 and LZ78 compresso r s, prop osed by Lemp el a nd Ziv in [23, 24], and all their numerous v ariants [17], are interesting examples of dynamic dictionar y -based compression algor ithms. In LZ77 , and its v aria nts, the dictionary consists of all substrings o c curring in the previously scanned po rtion of the input str ing, and each co deword co nsists of a triple h d, ℓ, c i where d is the relative offset of the copied phrase, ℓ is its length and c is the single (new) character following it. In L Z78 , the dictiona ry is built up on phrases extracted from the previously scanned pr efix ⋆ It has b een partially supp orted by Y ahoo! Research , Italian MIUR It aly-Israel FIRB Pro ject and PRIN MAINSTREAM. The authors’ address is Dipartimento di In formatica, L.go B. Pon tecorvo 3, 56127 Pisa, Italy . email: { ferragina,nitto,rossano } @di.unipi.it of the input string, and e a ch co deword consists of a pair h id , c i where id is the identifier of the copied phr ase in the dictionary and c is the character following that phras e in the subsequent suffix of the string . Many theoretical and exp erimental results hav e b een de dic a ted to LZ-compr essors in these thirty years (see e.g. [17 ] and references therein); and, although to day there are a lternative solutions to the pro blem of lossles s data compr ession po tent ially offering b etter compres- sion bo unds (e.g ., Bur rows-Wheeler compressio n and Prediction by Partial Ma tching [22]), dictionary-ba sed compression is still widely used in ev ery day applications becaus e of its unique combination of compres sion p ower and sp eed. O ver the years dictionar y-based compressio n has also gained imp or tance a s a genera l a lgorithmic to o l, b eing employ e d in the design of compressed text indexes [14], or in universal clustering to ols [3], or in designing optimal pre- fetc hing mechanisms [21]. Surprisingly e no ugh s o me imp o rtant problems on the combinatorial prop er ties of the LZ- p arsing ar e still o p en, and they will b e the main topic of inv estiga tion o f this pap er. T ake the classical LZ77 and L Z78 algor ithms. They adopt a gr e e dy p arsing of the input string: namely , at each step, they take the longest dictiona ry phrase which is a prefix of the curr ently unp arsed string suffix. It is well-kno wn [1 6] that greedy par sing is optimal with re sp ect to the nu mb er of phr ases in which S can b e par sed by any suffix-complete dictionary (like LZ 77 ); and a small v ariation of it (called flexible -par sing [13]) is o ptimal for prefix-complete dictiona ries (lik e LZ78 ). O f co urse, the num b er of pa rsed phrases influences the compressio n ratio and, indeed, v arious authors [2 3 , 24, 12 ] prov e d that gree dy pa r sing achiev es asymptotica lly the ( empiric al) entr opy o f the s ource genera ting the input s tr ing S . How ever, thes e fundament al r e sults hav e not yet close d the pro blem of optimally compres sing S b eca use the optimality in the num b er of par sed phrases is not ne c essarily e qual to the optimality in the num b er of bits output b y the final compresso r. Cle a rly , if the phras es ar e compressed via an e qual-length enco der , lik e in [17, 23 , 12], then the pro duced output is bit optimal . But if one aims fo r hig her compressio n by using variable-length enc o ders for the parsed phr ases (see e.g. [22, 7]), then the bit-length of the compre ssed output pro duced by the greedy- parsing s cheme is not ne c essarily optimal . As an illus trative exa mple, co nsider the LZ7 7 -compre s sor a nd assume that the co py o f the i th phrase o ccurs very far from the c urrent unparsed p osition, and thus its d -v alue is larg e. In this c a se it could be probably mo re co nv enient to renounce to the maximality of that phra s e and s plit it into (several) smaller phr ases which p ossibly o ccur clos er to that p osition and thus can b e enco ded in few er bits overall. Several solutio ns ar e indeed known for deter mining the bit-optimal par s ing o f S , but they are either inefficient [18, 8] ta king Θ ( n 2 ) time a nd spa ce in the worst ca se, o r a pproximate [9], or they rely on heuristics [11, 1 9, 2, 4 , 8 ] which do not provide any g uarantee on the time/s pace per formance o f the compre s sion pro cess. This is the reason why Ra jpo ot a nd Sahinalp stated in [1 6, pa g. 159] that “W e are not aw are of any on-line o r o ff-line parsing scheme tha t a chiev es optimality when the LZ77 - dictionary is in use under any constr aint on the co dewords o ther than b eing o f e qual leng th”. Motiv ated by these p o o r results, w e address in this pap er the question pos ed by Ra jp o ot and Sahinalp by inv es tigating a general clas s of v aria ble-length co deword enco dings which are t ypically used in data compress ion and in the design of sea rch engines and co mpr essed indexes [14, 17 , 22]. W e pr ov e tha t the class ic g reedy-par sing scheme deploying these enco ders may b e far from the bit-o ptimal parsing by a mult iplic a tive factor Ω (log n/ log log n ), which is indeed un bo unded asymptotica lly (Sectio n 2, Lemma 1). This result is o btained by considering a n infinite family o f string s S o f increasing length n and low empirical entrop y , a nd by showing that for these str ings copying the longest phrase, as LZ77 do es, may be dramatica lly inefficien t. W e notice that this gap b etw een LZ77 and the bit-o ptima l c o mpressor stre ngthen the results prov ed by Ko sara ju and Ma nzini in [12 ], who showed that the compression rate of LZ 77 conv erg es asymptotica lly to the k th order empirica l entrop y H k ( S ) of str ing S , and this rate is dominated fo r low entropy strings by an additive ter m O ( log log n log n ) which depends on the string length n and no t on its compr e ssibility H k ( S ). These are prop er ly the strings for which the bit-o ptimal parser is m uch b etter than LZ77 , and thus clo s er to the e nt ropy of S . Given these premises , w e in vestigate and design an LZ-based co mpressor that computes the bit-optimal pars ing for those v ar iable-length integer enco ders in e fficie nt time and optimal space, in the worst case (Section 5 , Theo rem 2). Due to space limitations, we w ill detail our results only for the L Z77 -dictio na ry , and defer the discussion on other dictiona ry-base d schemes (like LZ78 ) to the last Section 6. T echnically sp ea k ing, we fo llow [18] and mo del the search fo r a bit-optimal pars ing of an input str ing S [1 , n ], as a single-sour c e shortest p ath problem (shortly , SSSP ) on a weighte d DAG G ( S ) consisting of n no de s , one per character of T , a nd e edges, one p er p ossible L Z77 -pa rsing step. E very edge is weight ed a ccording to the length in bits of the co deword adopted to compr ess the corres p o nding L Z77 -phra se. Since LZ-co dewords are tuples of integers (see ab ove), we cons ider in this pap er a class of co deword enco ders which satisfy the so called incr e asing c ost pr op erty : the lar ger is the int eger to b e enco ded, the longer is the codeword. This clas s encompasses most o f the enco ders frequently used in the litera ture to desig n data compres sors [7], compre s sed full-text indexes [14] and search engines [22]. W e pr ove new combinatorial prop erties for the SS SP -pro blem formulated on the g raph G ( S ) weighted ac cording to these enco ding functions and show that, unlike [18] (for which e = Θ ( n 2 ) in the worst ca s e), the computation of the S SSP in G ( S ) c a n b e restricted on to a subgraph e G ( S ) whose s ize is pr ovably s mal ler than the complete gra ph (see Theorem 1). Actually , we show that the size o f e G ( S ) is related to the structu r al fe ature s of the integer-enco ding functions adopted to compress the LZ-phrase s (Lemma 3). Finally , we design a n algor ithm that computes the SSS P o f e G ( S ) without materia liz ing that subg raph al l at onc e , but by c reating a nd exploring its edges on-the-fly in o ptimal O (1) amortize d time p er edge and using O ( n ) optimal s pace overall. As a r esult, our LZ77 -c o mpressor a chiev es o ptimal compressio n ratio, by using optimal O ( n ) working spa c e a nd taking time proportio nal to | e G ( S ) | (hence, it is optimal in its size). If the LZ -phras es ar e enco ded with equal-leng th co dewords, our approa ch is optimal in compressio n ratio and time/space performance, as it is classically known [1 7]. But if we consider the mor e gener al (a nd op en) case o f the v ariable- length Elias o r Fibo nacci co des, a s it is typical of data compresso rs and compress ed indexes [7, 14], then our a pproach is optimal in compr ession r atio and working-s pace o ccupancy , and takes O ( n log n ) time in the worst case. Most other v aria ble-length integer enco ders fall in this case to o (see e.g. [2 2]). T o the bes t of our knowledge, this is the first result providing a p ositive answer to Ra jpo o t-Sahinalp’s question a bove! The final Sectio n 6 will dis c uss v ariations and extensions o f our appro ach, considering the cases of a b oun de d co mpression-window and of other s uffix- or prefix-co mplete dictionar y construction sch emes (like LZ78 ). 2 On t he Bit-Optimalit y of LZ-parsing Let S [1 , n ] b e a string drawn from a n alpha be t Σ of size σ . W e will use S [ i ] to denote the i th symbol of S ; S [ i : j ] to denote the substring (also called the phr ase ) extending fro m the i th to the j th symbol in S (extremes included); and S i = S [ i : n ] to denote the i - th s uffix of S . Dictionary-bas ed compress ion works in tw o intermingled phases: p arsing a nd enc o ding . Let w 1 , w 2 , . . . , w i − 1 be the phrases in which a prefix of S has b een a lr eady par sed. The parser s elects the next phra se w i as one of the phrases in the cur rent dictiona ry that prefix the rema ining suffix of S , and p os s ibly attaches to this phr ase few o ther following symbols (t y pic a lly one) in S . This a ddition is sometimes needed to enrich the dictionar y with new symbols o ccur ring in S and never encountered b efor e in the pars ing pro cess . Phras e w i is then repre s ented via a prop er r efer enc e to the dynamic dictionary , which is built during the parsing pro c e ss. The well-known LZ7 7 and LZ7 8 compresso rs [23, 24], and all their v ar iants [17], ar e incar nations of the ab ov e compr ession scheme and differe nt iate themselves mainly by the wa y the dynamic dictiona ry is built, by the rule adopted to select the next phrase, and by the enco ding of the dictionary r eferences. In the r est of the pap er we will concentrate on the LZ77 -s cheme and defer the discussion of LZ78 ’s , a nd other approaches, to the las t Section 6 . In L Z77 a nd its v ar iants, the dictiona r y consists of all substring s star ting in the scanned prefix of S , 1 and thus it consists (implicitly) of a quadr atic num b er of phrases which ar e (explicitly) repres ented via a (dyna mic) indexing data structure tha t takes linear space in | S | and efficiently searches for rep eated substring s. The L Z77 -dictiona ry satisfies tw o ma in pro pe r ties: Pr efix Completeness — i.e. for a ny given dictionary phra se, a ll of its prefix e s ar e also dictionary phra s es— and Suffix Completeness — i.e. for any given dictionar y phrase, all of its suffixes are also dictionary phrases. At any step, the LZ77 - parser pro ceeds by adopting the so called longest match heuristic : that is, w i is ta ken as the longest phrase o f the cur rent dictionar y whic h pr efixes the remaining suffix of S . This will b e her eafter called gr e e dy p arsing . The classic LZ 77 -par s er finally adds one further symbol to w i (namely , the one following the phrase in S ) to form the next phrase of S ’s pars ing. In the res t of our pa p er, and without lo ss of ge ne r ality , we co nsider the LZ77 -v ariant which av oids the a dditional symbol p er phrase, and thus repr e sents the nex t phrase w i by the integer pair h d i , ℓ i i where d i is the relative offset of the copied phra se w i within the pr efix w 1 · · · w i − 1 and ℓ i is its length (i.e. | w i | ). W e notice that every first o ccur rence of a new sy mbol c is enco de d as h 0 , c i . Once phrase s ar e identified and re presented v ia pair s o f integers, their comp onents are compressed via variable-length inte ger enc o ders which will even tually pro duce the co mpressed output of S as a sequence of bits. In order to study and desig n bit-optimal parsing schemes, we there fo re need to deal with int e g er enco ders . Let f be an int eger-enco ding function that ma ps any integer x ∈ [ n ] into a (bit-)co deword f ( x ) who se length (in bits) is de no ted by | f ( x ) | . In this pap er we consider v ariable-length enco dings which use longer co dewords for larg er integers: Pr op erty 1 (Incr e asing Cost Pr op erty). F or any x, y ∈ [ n ] it is x ≤ y iff | f ( x ) | ≤ | f ( y ) | . This prop erty is satisfied by most practica l integer encoder s [22], as w ell b y equal-leng th co dewords, Elias co des (i.e. g a mma, delta, a nd their deriv atives [6 ]), and Fib ona c c i’s co des [7]. Therefor e, this cla ss encompa sses all enco ders typically used in the litera ture to design data co mpressor s [17 ], compr essed full-text indexes [14] a nd sear ch engines [22]. Given t wo integer-enco der s f and g (po s sibly f = g ) which satisfy the Increasing Cost Prop erty 1, we denote by LZ f ,g ( S ) the compres sed output pro duced by the gr e edy-parsing strategy in which we hav e used f to compress the distance d i , and g to co mpress the length ℓ i of any pars e d phr ase w i . LZ f ,g ( S ) thus enco des any phrase w i in | f ( d i ) | + | g ( ℓ i ) | bits. W e 1 Notice that it admits th e overlapping b etw een the current d ictionary and the next p h rase to b e constructed. hav e already noticed that LZ f ,g ( S ) is no t necess a rily bit optimal, so we will hereafter use OPT f ,g ( S ) to denote the ( f , g ) -optimal p arser , na mely the o ne that parse s S into a se q uence of phra s es whic h are drawn from the L Z77 -dictio na ry a nd which minimize the total nu m ber of bits pro duced b y their enco ders f and g . Given the a b ov e observ ations it is immediate to infer that | LZ f ,g ( S ) | ≥ | OP T f ,g ( S ) | . How ever this b ound do es not provide us with any estimate of how muc h worse the greedy parsing can b e with res p e ct to OPT f ,g ( S ). In what fo llows we ident ify an infinite family o f strings for which the compres sed o utput of the gr eedy pa rser is a m ultiplicative factor Ω (log n/ log log n ) worse than the bit-optimal par ser. This re s ult shows that the ratio | LZ f,g ( S ) | | OPT f,g ( S ) | is indeed as ymptotically unbounded, and thus p o ses the serio us need for a n ( f , g )-optimal parser, as clear ly requested b y [16]. Our arg ument holds for any choice of f and g from the family of enco ding functions that represent a n integer x with a bit string of size Θ (log x ) bits. (Thus the well-kno wn Elias’ and Fibo nacci’s co der s b elong to this family .) T aking inspiration fro m the pro of of Lemma 4 .2 in [12], we consider the infinite family o f strings S l = b a l c 2 l ba b a 2 ba 3 . . . b a l , pa rameterized in the p o s itive v alue l . The LZ77 - pa rser pa rtitions S l as 2 ( b ) ( a ) ( a l − 1 ) ( c ) ( c 2 l − 1 ) ( ba ) ( ba 2 ) ( ba 3 ) . . . ( ba l ) where the symbols forming a pars e d phrase hav e be en delimited within a pair o f brack e ts . LZ77 th us copies the latest l phra ses from the b eginning o f S l and takes at least l | f (2 l ) | = Θ ( l 2 ) bits. L e t us now co nsider a more parsimonious parser, called rOP T , which selects the cop y of ba i − 1 (with i > 1) from its immediately previous o ccurr ence: ( b ) ( a ) ( a l − 1 ) ( c ) ( c 2 l − 1 ) ( b ) ( a ) ( ba ) ( a ) ( ba 2 ) ( a ) . . . ( ba l − 1 ) ( a ) rOPT ( S l ) takes | g (2 l ) | + | g ( l ) | + P l i =2 [ | f ( i ) | + | g ( i ) | + f (0)] + O ( l ) = O ( l log l ) bits. Since OPT ( S l ) ≤ rOP T ( S l ), we c a n conclude that | LZ f ,g ( S l ) | | OPT f ,g ( S l ) | ≥ | LZ f ,g ( S l ) | | rOPT ( S l ) | ≥ Θ  l log l  (1) Since | S l | = 2 l + l 2 − O ( l ), w e hav e that l = Θ (lo g | S l | ) for sufficien tly long s tr ings. Using this e s timate into Inequa lity 1 w e finally obtain: Lemma 1. Ther e ex ist s an infinite family of strings such that, for any of its elements S , it is | LZ f ,g ( S ) | ≥ Θ (log | S | / lo g log | S | ) | OPT f ,g ( S ) | . On the other hand, we can prove that this low er b ound is tight up to a log lo g | S | mult i- plicative factor, by eas ily extending to LZ7 7 -dictionar y and Prop er ty 1, a re sult proved in [10] for static dictionar ies. Pr ecisely , we can show that | LZ f,g ( S ) | | OPT f,g ( S ) | ≤ | f ( n ) | + | g ( n ) | | f (0) | + | g (0) | , which is upp er bo unded b y O (log n ) b eca use | S | = n , | f ( n ) | = | g ( n ) | = Θ (lo g n ) and | f (0) | = | g (0) | = O (1). 3 Bit-Optimal Parsing and SSSP -problem F ollowing [1 8], we mo del the design of a bit-o ptimal LZ 77 -par sing stra tegy for a string S a s a Single-Source Shortest Path problem (shortly , SSSP -problem) on a weigh ted DAG G ( S ) defined 2 Recall the vari ant of LZ77 w e are considering in th is pap er, which uses just a pair of integers p er phrase, and thus drops th e char follo wing that phrase in S . as follows. Gr aph G ( S ) = ( V , E ) has one vertex p e r sy mbol of S plus a dummy vertex v n +1 , and its edge set E is defined so that ( v i , v j ) ∈ E iff (1) j = i + 1 or (2) the substring S [ i : j − 1 ] o ccurs in S starting from a (previous) p os ition p < i (cle a rly i < j and thus G ( S ) is a DA G). Every edg e ( v i , v j ) is lab eled with the pair h d i,j , ℓ i,j i wher e – d i,j = i − p is the distance b etw een S [ i : j − 1] and the p ositio n p of its right-most copy in S . W e set d i,j = 0 whenever j = i + 1 . – W e set ℓ i,j = j − i , if j > i + 1, or ℓ i,j = S [ i ] otherwis e. It is ea sy to se e that the edg es outgoing fro m v i denote a ll p ossible par sing steps that can b e taken by any par sing strategy w hich uses a LZ7 7 -dictionar y . Hence, ther e exists a one-to-one cor resp ondence b etw een pa ths from v 1 to v n +1 in G ( S ) and parsings of the whole string S : any path π = v 1 · · · v n +1 corres p o nds to the parsing P π ( S ) represented by the phrases la b eling the edges trav er s ed by π . If we weigh t every edge ( v i , v j ) ∈ E with an integer c ( v i , v j ) = | f ( d i,j ) | + | g ( ℓ i,j ) | which ac c ounts for the cost of enco ding its lab el (phra se) via the enco ding functions f and g , then the le ng th in bits of the enco ded par sing P π ( S ) is equal to the cost of the w eig ht ed pa th π in G ( S ). W e have therefo r e reduced the problem of determining O PT f ,g ( S ) to the problem o f computing the SSS P of G ( S ) from v 1 to v n +1 . Given that G ( S ) is a DAG , its shor test path fr o m v 1 to v n +1 can b e computed in O ( | E | ) time and spac e . In the worst case (take e .g . S = a n ), this is Θ ( n 2 ), and th us it is inefficient and un us able in pra ctice [18 , 9]. In what follows we show that the computation of the SSSP can be actually restricted to a subgra ph of G ( S ) who se size is pr ov ably o ( n 2 ) in the worst case, and typically O ( n log n ) for most known integer-enco ding functions. Then we will design efficient and sophisticated algo r ithms and data structures that will allow us to generate this subgraph on-the-fly by taking O (1) amortized time p e r edge and O ( n ) space ov er all. These algorithms will b e therefore time-and-space optimal for the subg r aph in hand! 4 An useful, small, subgraph of G ( S ) W e use F S ( v ) to denote the forwar d star of a vertex v , namely the set of v e r tices p ointed to by v in G ( S ); and we use B S ( v ) to denote the b ackwar d star of v , namely the s et o f vertices po intin g to v in G ( S ). By constructio n of G ( S ), for any v j ∈ F S ( v i ) it is i < j ; so that, all of the edges are oriented right ward, and in fact G ( S ) is a D AG. W e can actually show a strong er prop erty on the dis tr ibution of the indices of the vertices in F S ( v ) and B S ( v ), namely that they fo rm a c ontiguous range. F act 1 Given a vertex v i , it is F S ( v i ) = { v i +1 . . . , v i + x − 1 , v i + x } and B S ( v i ) = { v i − y . . . , v i − 2 , v i − 1 } . Note that x, y ar e smal ler than t he length of t he longest r ep e ate d substring in S . Pro of: By definition of ( v i , v i + x ), string S [ i : i + x − 1] o ccurs at so me p osition p < i in S . An y prefix S [ i : k − 1 ] of S [ i : i + x − 1 ] also o ccurs at that p osition p , thus v k ∈ F S ( v i ). The bo und o n x immedia tely derives fro m the definition of ( v i , v i + x ). A similar arg ument derives the pr op erty on B S ( v i ). ⊓ ⊔ This actually means that if an edge do es exist in G ( S ), then they do exis t also all edges which a re neste d within it a nd ar e inciden t into one o f its extremes . The following prop er ty relates the indices o f the vertices v j ∈ F S ( v i ) with the co st of their connecting edge ( v i , v j ), and not sur prisingly shows tha t the sma ller is j (i.e. shorter edge), the sma ller is the cos t of enco ding the parse d phras e S [ i : j − 1]. 3 3 Recall th at c ( v i , v j ) = | f ( d i,j ) | + | g ( ℓ i,j ) | , if the edge do es exist, otherwise w e set c ( v i , v j ) = + ∞ . F act 2 Given a vertex v i , for any p air of vertic es v j ′ , v j ′′ ∈ F S ( v i ) such that j ′ < j ′′ , we have that c ( v i , v j ′ ) ≤ c ( v i , v j ′′ ) . The same pr op erty holds for v j ′ , v j ′′ ∈ B S ( v i ) . Pro of: W e hav e that d i,j ′ ≤ d i,j ′′ and ℓ i,j ′ < ℓ i,j ′′ bec ause S [ i : j ′ − 1] is a prefix of S [ i : j ′′ − 1] and thus the fir st substr ing o ccurs wherever the latter o cc urs. The pr op erty holds be cause f and g sa tis fy the Incr easing Co st Pr op erty 1. ⊓ ⊔ Given these mo no tonicity pr op erties, w e are ready to character ize a s p ec ia l subset o f the vertices in F S ( v i ), a nd their connecting edges. Definition 1. An e dge ( v i , v j ) ∈ E is c al le d – d − maximal iff the n ext e dge fr om v i takes mor e bits to enc o de its distanc e. Namely, we have that | f ( d i,j ) | < | f ( d i,j +1 ) | . – ℓ − ma x imal iff the next e dge fr om v i takes m or e bits t o enc o de its length. N amely, we have that | g ( l i,j ) | < | g ( l i,j +1 ) | . Over al l, we say that e dge ( v i , v j ) is maximal if it is either d -maximal or ℓ -maximal: thus c ( v i , v j ) < c ( v i , v j +1 ) . Now, we wish to co unt the num b er of maximal edges outg oing from v i . This num b er clearly dep ends on the integer-enco ding functions f and g (which satisfy P rop erty 1). Let Q ( f , n ) (r esp. Q ( g , n )) b e the num b er of different co deword lengths gener ated by f (resp. g ) when applied to integers in the range [ n ]. W e can partition [ n ] in contiguous s ub-ranges I 1 , I 2 , . . . , I Q ( f ,n ) such that the integers in I i are mapp ed by f to co dewords (strictly) shorter than the co dewords for the integers in I i +1 . Similarly , g partitions the range [ n ] in Q ( g , n ) contiguous sub- r anges. Lemma 2. Ther e ar e at most Q ( f , n ) + Q ( g , n ) max imal e dges out going fr om any vertex v i . Pro of: By F ac t 1, vertices in F S ( v i ) have indices in a ra nge R , and by F ac t 2, c ( v i , v j ) is monotonically non-decr easing as j incr eases in R . Moreover we know that f (re sp. g ) cannot change mor e than Q ( f , n ) (resp. Q ( g , n )) times, so that the statement follows. ⊓ ⊔ In o r der to sp eed up the computation of a SSS P co nnec ting v 1 to v n +1 in G ( S ), we construct a subgra ph e G ( S ) which is pr ov ably smaller tha n G ( S ) and contains o ne of thos e SSSP . Next theorem s hows that e G ( S ) can be formed b y taking just the maximal edges o f G ( S ). Theorem 1. Ther e ex ist s a shortest p ath in G ( S ) c onne cting v 1 to v n +1 and tr aversing only maximal e dges. Pro of: By contradiction assume that every s uch shortest paths co nt ains at least one non- maximal edge. Let π = v i 1 v i 2 . . . v i k , with i 1 = 1 and i k = n + 1, be one of thes e sho rtest paths, and let γ = v i 1 . . . v i r be the lo ngest initial subpath of π which trav erses o nly max ima l edges. Assume w.l.o.g. that π is the shortest path maximizing the v alue o f | γ | . W e know that ( v i r , v i r +1 ) is a non- maximal edge, and thus we ca n take the max imal edge ( v i r , v j ) tha t has the same cost. By definition of maximal edge, it is j > i r +1 ; furthermore, we must hav e j < n + 1 be c ause we assumed that no path is formed o nly b y maximal edge s. Th us it must exist an index i h ≥ i r such that j ∈ [ i h , i h +1 ], be c ause indices in π are increa sing g iven that G ( S ) is a D AG. Since ( v i h , v i h +1 ) is an edge of π , b y F act 1 follows tha t it do es exist the edge ( v j , v i h +1 ), and by F act 2 on B S ( v i h +1 ) we can conclude that c ( v j , v i h +1 ) ≤ c ( v i h , v i h +1 ). Consequently , the path v i 1 · · · v i r v j v i h +1 · · · v i k is a lso a shortest path but its longest initial subpath of maximal edges co ns ists of | γ | + 1 vertices, which is a cont r adiction! ⊓ ⊔ Optimal-P arse r( S [1 , n ] ) 1. C [1] = 0; P [1] = 1; 2. for each i ∈ [2 , n + 1] do C [ i ] = + ∞ ; P [ i ] = N I L ; 3. for i = 1 to n do 4. generate on-the-fl y all maximal edges in F S ( v i ); 5. for any ( v i , v j ) maximal do 6. if C [ j ] > C [ i ] + c ( v i , v j ) then C [ j ] = C [ i ] + c ( v i , v j ); P [ j ] = i ; Fig. 1. A lgorithm t o compu t e the SSSP of the subgraph e G ( S ). Theorem 1 implies that the dis ta nce b etw een v 1 and v n +1 is the same in G ( S ) and e G ( S ), with the adv antage that co mputing distances in e G ( S ) can b e done faster and in re duced spa ce, bec ause of its sma ller size. In fact, Lemma 2 implies that | F S ( v ) | ≤ Q ( f , n ) + Q ( g , n ), so that Lemma 3. Su b gr aph e G ( S ) c onsists of n + 1 vertic es and at most n ( Q ( f , n ) + Q ( g , n )) e dges. F or Elias’ co des [6], Fib o nacci’s co des [7], and most pra ctical integer enco ders used for search engines and data compr e ssors [17 , 22], it is Q ( f , n ) = Q ( g , n ) = O (log n ). Ther efore | e G ( S ) | = O ( n lo g n ) a nd he nc e it is prov ably sma ller than the co mplete graph built and used by the previous pap ers [18, 9, 8]. The next technical step c o nsists of a chieving time efficiency and optimality in working space, b eca us e we cannot c o nstruct e G ( S ) all at once. In the next s e ctions we design an algorithm that genera tes e G ( S ) on-the-fly a s the computation of its SS SP go es on, and pays O (1) amortized time p er edge and no more than O ( n ) optimal space ov er a ll. In so me sense, this a lgorithm is optimal for the ident ified s ub-graph e G ( S ). 5 A bit-optimal parser F rom a high level, our solution pro c e eds a s in Figure 1, where a v ariant of a classic linear-time algorithm for SSS P ov er a D AG is rep orted [5, Section 2 4.2]. In that ps eudo-co de entries C [ i ] and P [ i ] hold, r esp ectively , the shortest path distance fro m v 1 to v i and the predec e ssor o f v i in tha t s hortest pa th. The main idea of Opti mal-P arser consists of scanning the vertices of e G ( S ) in to po logical order , and of genera ting on-the-fly a nd r elaxing (Step 6) the edges outgo ing from a vertex v i only when v i bec omes the cur rent vertex. The co r rectness of Opt imal-P arser follows directly fro m Theor em 24 .5 of [5 ] and our Theo r em 1. The key difficulty in this pro cess consists o f how to gener ate on-t he-fly and efficiently (in time and sp ac e) the maximal e dges outgoing fr om vertex v i . W e will refer to this problem as the forwar d-s t ar gener ation pr oblem, and use FS G for brevity . In the next sectio n we show that, when σ ≤ n , FSG takes O (1) a mortized time p er edge and O ( n ) spa ce in total. As a result (Theorem 2), Optim al-Pa rser requires O ( n ( Q ( f , n ) + Q ( g , n ))) time in the worst case, since the main lo op is rep eated n times a nd we have no more than Q ( f , n ) + Q ( g , n ) maximal edges p er vertex (Lemma 2 ). The space used is that for the FSG -c o mputation plus the tw o arrays C and P ; hence, it will b e shown to b e O ( n ) in tota l. In case o f a larg e alphab et σ > n , we need to a dd T sor t ( n, σ ) time b ecause of the sorting /remapping of S ’s symbo ls into [ n ]. Theorem 2. Given a st ring S [1 , n ] dr awn fr om an alphab et of size σ , and two int e ger- enc o ding functions f and g that satisfy Pr op erty 1, ther e exists an LZ77 -b ase d c ompr ession algorithm that c omputes t he ( f , g ) -optimal p arsing of S in O ( n ( Q ( f , n ) + Q ( g , n )) + T sor t ( n, σ )) time and O ( n ) sp ac e in the worst c ase. 5.1 On-the-fly generation of d -maximal edges W e c o ncentrate only on the computation of the d -max imal edges , b e cause this is the har dest task. In fact, we know that the edg es outgoing from v i can be partitioned in no mor e than Q ( f , n ) gro ups accor ding to the distanc e from S [ i ] of the copied s tring they repr esent (pro of of Lemma 2). Let I 1 , I 2 , . . . , I Q ( f ,n ) be the int erv als o f distances such that all distances in I k are enco ded with the same num b er o f bits by f . T ake now the d -maximal edge ( v i , v h k ) for the int e rv al I k . W e can infer that subs tr ing S [ i : h k − 1] is the longest substring having a copy at dista nce within I k bec ause, by Definition 1 and F act 2, any edge following ( v i , v h k ) denotes a lo ng er substring which must lie in a subsequent interv al (by d -maxima lit y of ( v i , v h k )), and th us must have lo nger distance from S [ i ]. Once d -maxima l edges are known, the computation of the ℓ -maximal edges is then eas y b eca use it suffices to further decomp os e the edges b etw een successive d -maximal edges, s ay b etw e en ( v i , v h k − 1 +1 ) and ( v i , v h k ), a ccording to the dis tinct v alues assumed by the enco ding function g on the leng ths in the range [ h k − 1 , . . . , h k − 1 ]. This takes O (1) time p er ℓ -maximal edg e , beca use it nee ds some alg e braic calcula tions a nd can then infer the co rresp onding copied substring as a prefix of S [ i : h k − 1]. So, let us conc e ntrate on the computation of d -maximal edges outgo ing from vertex v i . This is based on tw o key idea s. The first idea a ims at optimizing the space us a ge b y achieving the optimal O ( n ) working-space bo und. It c o nsists of pro ceeding in Q ( f , n ) pass es, one p er interv al I k of p ossible d -cos ts for the edges in e G ( S ). During the k th pa s s, we logically pa r tition the vertices of e G ( S ) in blo cks o f | I k | contiguous vertices, say v i k , v i k +1 , . . . , v i k + | I k |− 1 , and compute all d - ma ximal edges which sprea d from that blo ck a nd have distance within I k (th us the same d -cost c ( I k )). These edges are kept in memory until they are used by Op timal -Parse r , and discarded as so on as the first vertex o f the next blo ck, i.e. v i k + | I k | , needs to b e pr o cessed. The next blo ck of | I k | vertices is then fetched and the pro ces s rep eats . Actually , a ll passes are executed in p ar al lel to g ua rantee that a ll d -maximal edg es of v i are av ailable when pro ces sing it. There are n/ | I k | distinct blo cks, each vertex b elo ngs to exactly one blo ck at ea ch pass, and all o f its d -maxima l edge s ar e considere d in some pass (b ecause they hav e d -co st in so me I k ). The s pace is P Q ( f ,n ) k =1 | I k | = O ( n ) b ecause we keep one d -maximal edge p er v e r tex a t any pass. The second key idea aims a t computing the d -maximal edges for that blo ck of | I k | contigu- ous vertices in O ( | I k | ) time and space. This is what we address in the rest of this pap er, b ecause its so lution will allow us to state that the time complexity of FSG is P Q ( f ,n ) k =1 P n/ | I k | i =1 O ( | I k | ) = O ( n Q ( f , n )), namely O (1) amortize d time p er d -maximal edg e. Combining this fact with the ab ov e obs e rv ation on the computation of the ℓ -maximal edges, we ge t Theorem 2 . So, let us assume tha t the a lphab et size σ ≤ n , and consider the k th pass o f FS G in which we assume that I k = [ l , r ]. Rec all that all distances in I k can b e f -enco ded in the s ame num b er of, say , c ( I k ) bits. Let B = [ i, i + | I k | − 1] b e the blo ck of (indices of ) vertices for which we wish to co mpute on-the-fly the d -maximal edges o f cost c ( I k ). This means that the d - maximal edge from vertex v h , h ∈ B , repr esents a phra s e that starts at S [ h ] and has a co py whose sta r ting po sition is in the window W h = [ h − r , h − l ]. Thus the distance of that copy can b e f -enco ded in c ( I k ) bits, a nd s o w e will say that the edge ha s d -cost c ( I k ). Since this computation must S i − r 1 j i h j − l W h B W ′′ B W ′ B W B Fig. 2. The figure shows the interv al B = [ i, j ] with j = i + | I k | − 1, and the window W B and its tw o halves W ′ B and W ′′ B . be done for all vertices in B , it is useful to consider W B = W i ∪ W i + | I k |− 1 which merg es the first a nd last window and thus spans all p o sitions that can be the (copy-)reference o f any d -maximal e dge o utg oing from B . Note that |W B | = 2 | I k | (se e Figure 2). The following fact is crucial to efficiently compute all these d -maxima l edg e s via pr op er indexing data struc tur es: F act 3 If ther e exists a d -maximal e dge outgoing fr om v h having d -c ost c ( I k ) , then this e dge c an b e found by determining a p osition s ∈ W h whose s uffix S s shar es the ma ximum longest c ommon pr efix (shortly, lcp ) with S h . Pro of: Among a ll p ositio ns s in W h take one whose suffix S s shares the maximum lcp with S h , and let q be the length of this lcp . Of cours e, there may exis t many such po sitions, w e take just one of them. Then the edge ( v h , v h + q +1 ) ha s d -co s t c ( I k ) a nd is d -maximal. In fact any o ther po sition s ′ ∈ W h induces the edge ( v h , v h + q ′ +1 ), where q ′ < q is the leng th of the lcp sha red b etw een S s ′ and S h . This edge cannot b e d -ma ximal b eca use its d -co s t is s till c ( I k ) but its leng th is shorter. ⊓ ⊔ In the rest of the pa p er we will call the p osition s of F act 3 maximal p osition for vertex v h . The maximal p o s ition for v h do es exist only if v h has a d -ma ximal edge of co s t c ( I k ). Therefore we des ign an alg o rithm whic h c omputes the maximal p ositio ns of every v er tex v h in B whenever they do exist, o therwise it will as sign an arbitr ary p osition to v h . The net result will b e that we will b e g enerating a sup er gr aph of e G ( S ) which is still gua ranteed to have the size stated in Lemma 2 and can be cr e ated efficiently in O ( | I k | ) time a nd space, a s we required above. F act 3 has rela ted the computatio n of maximal p ositions for the vertices in B to lc p - computations b etw een suffixe s in B a nd suffixes in W B . Therefore it is natur a l to resort some indexing data str uc tur e, like the co mpact trie T B , built over the suffixes o f S which sta rt in the range of p ositions B ∪ W B . T rie T B takes O ( | B | + |W B | ) = O ( | I k | ) space , and that is within o ur r equired space b ounds. W e then notice that the ma ximal p osition s for a vertex v h in B having d -cost c ( I k ) ca n b e computed by finding the le af of T B which is lab ele d with an index s ∈ W h and has the de ep est lowest c ommon anc estor (shortly, lca ) with the le af lab ele d h . W e need to answer this quer y in O (1) amo rtized time p er vertex v h , s ince we a im at achieving an O ( | I k | ) time complexity ov er all vertices in B . How ever, this is tric ky . In fact this is n ot the cla ssic lca -query be cause we do not know s , which is a ctually the po sition we ar e searching for. F urthermo re, since the leaf s is the clos est one to h in T B among the leav es with index in W h , one could think to use pro p er predecesso r /success o r queries on a suitable dyna mic set of suffixes in W h . Unfortunately , this would take ω (1) time b ecause of well-kno wn low er bo unds [1 ]. Ther efore, answering this quer y in constant (amortized) time per vertex requires to devise a nd deploy prop er str uctural prop er ties of the trie T B and the problem a t ha nd. This is wha t we do in our algor ithm, whose under lying intuition follows. Let u b e the lca o f the leav e s h and s in T B . F or simplicit y , we ass ume that int erv al W h strictly prece des B and that s is the unique maximal p os itio n for v h (our alg o rithm deals with these ca ses to o, se e the pro of of Lemma 4 ). W e obser ve that h must b e the smallest index that lies in B and lab els a leaf descending from u in T B . In fact as sume, by contradiction, that a smaller index h ′ < h do es exist. By definition h ′ ∈ B a nd thus v h would no t have a d -maximal edge of d -cost c ( I k ) b ecause it could copy from the closer h ′ a po ssibly longer phrase, instead of copying from the far ther set of po sitions in W h . This obser v ation implies that we hav e to search only for one ma ximal po sition p er no de u of T B , and this po sition refers to the v e rtex v a ( u ) whose index a ( u ) is the smallest one that lies in B and lab els a leaf descending fr o m u . Computing a -v alues clear ly takes O ( |T B | ) = O ( | I k | ) time and s pace. Now we need to compute the ma ximal p osition for v a ( u ) , for each no de u ∈ T B . W e ca nnot trav ers e the subtre e of u sear ching for the maximal p osition of v a ( u ) , bec ause this would take quadratic time complexity . Conv er sely , we define W ′ B and W ′′ B to b e the first a nd the second half of W B , resp ectively , and obs e rve that a ny window W h has its left ex tr eme in W ′ B and its right extreme in W ′′ B . (See Figure 2 for an illustr ative exa mple.) The r efore the window W a ( u ) containing the maximal p ositio n s for v a ( u ) ov erla ps bo th W ′ B and W ′′ B . So if s do es e xist, then s b elo ngs to either W ′ B or to W ′′ B , and leaf s descends from u . Ther efore, the maxim um (resp. minimum) among the elements in W ′ B (resp. W ′′ B ) that lab el leaves des cending fro m u m ust b elong to W a ( u ) . This sug gests to co mpute min ( u ) a nd m ax ( u ) as the rightmost p o s ition in W ′ B and the leftmost p os ition in W ′′ B that lab el leav es descending from u , r esp ectively . These v alues can be computed in O ( | I k | ) time by a p ost-or der visit of T B . W e ar e now rea dy to compute mp [ h ] as the maximal po sition for v h , if it e xists, or otherwise set m p [ h ] arbitrar ily . W e initially set all mp ’s entries to nil ; then we visit T B in p ost-order and p erfor m, at each no de u , the following chec ks whenever mp [ a ( u )] = n il : – If min ( u ) ∈ W a ( u ) , s et m p [ a ( u )] = min ( u ). – If max ( u ) ∈ W a ( u ) , s et m p [ a ( u )] = max ( u ). A t the end of the visit, if mp [ a ( u )] is still nil we set m p [ a ( u )] = a ( parent ( u )) whenever a ( u ) 6 = a ( pa rent ( u )). This la st chec k is needed (pro of b elow) to manage the case in which S [ a ( u )] ca n copy the phr ase starting at its p o sition from p osition a ( p arent ( u )) and, addi- tionally , we hav e that B overlaps W B (whic h may o ccur depending on f ). Since T B has size O ( | I k | ), the ov erall a lgorithm requires O ( | I k | ) time and s pace in the worst c a se, as r equired. The follo wing lemma prov es its correctness : Lemma 4. F or e ach p osition h ∈ B , if t her e exists a d -maximal e dge outgoing fr om v h and having d -c ost c ( I k ) , then m p [ h ] is e qual to its maximal p osition. Pro of: Recall that B = [ i, i + | I k | − 1 ] and consider the longest pa th π = u 1 u 2 . . . u z in T B that starts from the leaf u 1 lab eled with h ∈ B and go es up ward until the trav er sed no des satisfy a ( u j ) = h , her e j = 1 , . . . , z . By definition of a -v alue, we know that all leav es descending from u z and o c c urring in B are lab eled with an index w hich is la rger than h . Clearly , a ( parent ( u z )) < h (if any). There are tw o cases for the final v alue stored in mp [ h ]. Suppo se that mp [ h ] ∈ W h . W e wan t to prov e that m p [ h ] is the index of the lea f which has the dee p es t l ca with h a mong all the o ther leaves in W h . Let u x ∈ π b e the node in which the v alue of mp [ h ] is assigned (it is a ( u x ) = h ). Ass ume that there exists at least a nother index in W h whose leaf has a deep er l ca w ith leaf h . This lc a must lie on u 1 . . . u x − 1 , say u l . Since W h is an interv al having its left ex treme in W ′ B and its r ight extreme in W ′′ B , the v alue max ( u l ) o r mi n ( u l ) must lie in W h and thus the algor ithm ha s set mp [ h ] to one of these po sitions, be c a use of the p ost-or der visit of T B . Therefore mp [ h ] must b e the index of the leaf having the dee p es t lca with h , and th us by F act 3 is its max ima l p osition (if any). Now supp ose that mp [ h ] / ∈ W h and, thus, it canno t b e a maxima l p osition for v h . W e hav e to prov e that it do es not exist a d -maximal edg e o utgoing fro m the vertex v h with cost c ( I k ). Let S s be the suffix in W h having the maximum lcp with S h , and let l b e the lcp -leng th. V a lues min ( u i ) and max ( u i ) do not b elong to W h , for any no de u i ∈ π (with a ( u i ) = h ), o therwise mp [ h ] would have be e n ass igned with an index in W h (contradicting the hypothes is ). The v alue o f mp [ h ] remains nil up to no de u z . This implies that no suffix descending fr o m u z starts in W h and, in particular , S s do es no t descend from u z . The r efore, the l ca betw e e n leav es h and s is a no de in the path from paren t ( u z ) to ro ot, and the lcp ( S a ( parent ( u z )) , S h ) ≥ lcp ( S s , S h ) = l . Since a ( paren t ( u z )) < a ( u z ) and b elong s to B , it is nearer to h than any other p o s ition in W h , and shares a lo nger prefix with S h . So we found longer edge from v h with sma ller d - c o st. This implies that v h has no d -maxima l edge of cost c ( I k ) in e G ( S ). ⊓ ⊔ W e are left with the proble m of building T B in O ( | I k | ) time and spa ce, thus a time complexity which is indep endent o f the le ngth of the indexed suffixes and the alphab et size. In Appendix A we show how to achieve this result by deploying the fact that the ab ove algorithm do es not make any as sumption on the ordering o f T B , b ecaus e it just computes (sort o f ) lca - q ueries on its str ucture. This is the last step to prov e Theorem 2. 6 Conclusions Our pa rsing scheme can be extended to v aria nts of LZ77 which deploy par sers that refer to a b ounde d compressio n-window (the typical scena rio of gzip and its der iv atives [17]). In this case, LZ77 selects the next phra se by lo oking only at the most recent w input symbols . Since w is usually a constant chosen a s a p ow er o f 2 of few K bs [17], the running time of our algorithm be comes O ( n Q ( g , n )), s ince Q ( f , w ) is a constant. W e no tice that the remaining term could further b e re fined by consider ing the leng th ℓ of the longest r ep e ate d substring in S , and state the time co mplexity as O ( n Q ( g , ℓ )). If S is ge nerated by an ergo dic sour ce [20] and g is taken to b e the classic Elia s’ c o de, then Q ( g , ℓ ) = O (lo g log n ) so that the co mplexity of our algor ithm results O ( n log log n ) time and O ( n ) space fo r this cla ss o f str ings. W e finally notice that, althoug h we hav e mainly dealt with the LZ 77 -dictiona ry , the tech- niques presented in this pap er could be extended to design efficient bit-o ptimal compresso rs for o ther on-line dictionar y co ns truction s chemes, like L Z78 . Intuitiv ely , we can s how that Theorem 1 still holds for an y suffix- or prefix-complete dictionary under the hypothesis that the co dewords a ssigned to each suffix or pr e fix of a dictionar y phra se w are s horter than the co dewords a ssigned to w itself. In this case the notion of edg e maxima lity (Definition 1) can be generalized by ca lling an edge ( v i , v j ) maximal iff a ll lo ng er edges, say ( v i , v j ′ ) with j < j ′ , hav e not larger cost, namely w i,j ′ ≤ w i,j . In this case, we can pr ovide an e fficie nt bit-o ptimal parser for the LZ78 -dictiona ry (details in the full pap er). The main o pe n ques tio n is to extend our results to statist ic al enco ding functions like Huffman or Arithmetic co ders applied on the integral range [ n ] [2 2]. They do not necessar ily satisfy Pro p erty 1 beca use it might b e the case that | f ( x ) | > | f ( y ) | , whe ne ver the integer y o ccurs more frequently than the in teger x in the parsing of S . W e arg ue that it is no t trivial to design a bit-optimal co mpressor for these enco ding -functions b ecause their co deword lengths change as it changes the set of dista nces and lengths used in the pa rsing pro cess. Practica lly , we would like to implement our optimal par ser motiv ated b y the encourag ing exp erimental results of [8], which have improv ed the s tandard LZ77 by a heuristic that tries to optimize the enco ding co s t o f just phrases’ lengths. References 1. P . Beame and F. E. Fich. Op timal b ounds for th e predecessor prob lem. I n Pr o c e e dings of the 31st ACM Symp osium on The ory of Computing , pages 295–304, 1999. 2. J. Bksi, G. Galam b os, U. Pfersch y , and G.J. W o eginger. Greedy algorithms for on-line data compression. Journal of Algorithms , 25(2):274–289, 1997 . 3. R . Cilibrasi and P . M. B. Vit´ anyi. Clustering by compression. CoRR , cs.CV/0312044 , 2003. 4. M. Cohn and R. K hazan. Parsing with p refi x and suffix dictionaries. In Data Compr ession Confer enc e , pages 180–189, 1996. 5. T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Intr o duction to Algorithms, Se c ond Edition . The MIT Press and McGraw -Hill Bo ok Company , 2001. 6. P . Elias. Universal co deword sets and representa t ions of t he integers. IEEE T r ansactions on Information The ory , 21(2):194–203, 1975. 7. P . F enwi ck. Un iversal cod es. In L ossless Compr ession H andb o ok , p ages 55–78. Academic Press, 2003. 8. A . Langiu G. Della Pe nna, F. Mignosi and A. Ulisse. On dictionary-symbolwise data compression. In http://www.di.univ aq.it/%7m ignosi/ulicompressor.php , 2006. 9. J. Kata jainen and T. Raita. An approximation algorithm for space-optimal enco ding of a text. Computer Journal , 32(3):228–237 , 1989. 10. J. K ata jainen and T. Raita. An analysis of t he longest match and the greedy heuristics in text encod ing. Journal of the ACM , 39(2):281–294, 1992 . 11. S. T. Klein. Efficient optimal recompression. Comput. J. , 40(2/3):117–12 6, 1997. 12. R . Kosara ju and G. Manzini. Compression of low entrop y strings with Lempel–Ziv algori t hms. SIAM Journal on C omputing , 29(3):893–911, 1999. 13. Y. Matias an d S .C. S ¸ ahinalp. On the optimality of parsing in dy namic dictionary based data compression. In Pr o cs of the 10th ACM-SIAM Symp osium on Discr ete Algorithms , pages 943– 944, 1999. 14. G. Nav arro and V. M¨ akinen. Compressed full-tex t indexes. ACM Comput. Surv. , 39(1), 2007. 15. S. J. Puglisi, W. F. Smyth, and A. H . T urpin. A taxonomy of suffix array construction algorithms. ACM Computing Surveys , 39(2), 2007. 16. N. Ra jp o ot and C. Sah in alp. Handb o ok of L ossless Data Compr ession , chapter Dictionary-based data compression, p ages 153–167. Academic Press, 2002. 17. D. Salomon. Data Compr ession: the Complete R efer enc e, 3r d Edi tion . Springer V erlag, 2004. 18. E. J. Sc huegraf and H. S. Heaps. A comparison of algorithms for data base compression by u se of fragments as language elements. I nformation Stor age and R etrieval , 10(9-10):309 319, 1974. 19. M. E. Gonzalez Smith and J. A. Storer. Para llel algorithms for data compression. Journal of the ACM , 32(2):344–373, 1985. 20. W. S zpanko wski. Asy mptotic prop erties of data compression and suffix t rees. IEEE T r ansactions on Information The ory , 39(5):1647– 1659, 1993. 21. J. S. V itter and P . Krishn an . Opt imal prefetching via d ata compression. Journal of the ACM , 43(5):771– 793, 1996. 22. I. H. Witten, A. Moffat, and T. C. Bell. Managing Gigabytes: Com pr essing and Indexing Do c- uments and Images . Morgan Kaufmann Publishers, Los A ltos, CA 94022, USA, second edition, 1999. 23. J. Ziv and A. Lemp el. A universal algorithm for sequential data compression. IEEE T r ansaction on Information The ory , 23:337–3 43, 1977. 24. J. Ziv and A. Lemp el. Compression of individu al sequ ences via v ariable-rate co ding. IEEE T r ansactions on I nformation The ory , 24(5):530–536, 1978. APPEN DIX A. Building T B optimally W e ar e left with the problem of building T B in O ( | I k | ) time and s pace, thus a time complexity which is indep endent of the length of the indexed suffixes and the a lphab et size. W e show how to achiev e this res ult by deploying the crucial fact that the a lgorithm of Section 5 .1 to compute the d -maximal edges do es no t make any assumption on the or dering of edges in T B , bec ause it just computes (sort of ) l ca -quer ie s on its str ucture. This is the la st step nee de d to complete the pro of o f Theo rem 2, and we give its algorithmic details here. A t pr epro cess ing time we build the suffix array of the whole string S and a da ta structure that answers constant-time lcp -q ueries b etw een pair of suffixes (see e.g. [15]). These data structures can b e built in O ( n ) time and space, when σ = O ( n ). F or large r alphab ets, we need to add T sor t ( n, σ ) time, which takes into a ccount the cost of sor ting the s ymbols o f S and re- mapping them to [ n ] (see Theorem 2 ). Let us fir st assume that B and W B are contiguous and form the rang e [ i, i + 3 | I k | − 1]. If we had the s orted s equence o f suffixes starting in S [ i, i + 3 | I k | − 1], w e could easily build T B in O ( | I k | ) time and space by deploying the ab ove lcp - data structure. Unfortunately , it is unclear how to obtain from the suffix array o f the whole S , the sorted s ub- sequence of s uffixe s starting in the ra nge [ i, i + 3 | I k | − 1] by taking O ( | B | + |W B | ) = O ( | I k | ) time (notice that these suffixes hav e length Θ ( n − i )). W e cannot per form a sequence of predecess or/succ e ssor queries b ecause they would take ω (1) time each [1]. Conversely , we r e sort the key obs erv ation ab ov e that T B do es not need to b e or de r ed, and thus devise a solution which builds a n unor der e d T B in O ( | I k | ) time and space, pa ssing throug h the construction of the suffix array of a tr ansforme d str ing. The transformatio n is simple. W e fir st map the distinct s ymbols o f S [ i, i + 3 | I k | − 1] to the fir st O ( | I k | ) in tegers. This mapping do es not ne e d t o reflect their lexicogra phic o rder, and thus ca n be co mputed in O ( | I k | ) time by a simple sca n of those symbols and the us e of a table M of size σ < n . Then, we define S M as the s tr ing S which has b een transfor med by re-mapping some of the symbols a ccording to table M (namely , those o c c ur ring in S [ i, i + 3 | I k | − 1]). W e can prove that Lemma 5. L et S i , . . . , S j b e a c ontiguous s e quenc e of suffixes in S . The r emapp e d suffixes S M i . . . S M j c an b e lexic o gr aphi c al ly sorte d in O ( j − i + 1) t ime. Pro of: Consider the string of pairs w = h S M [ i ] , b i i . . . h S M [ j ] , b j i $, where b h is 1 if S M h +1 > S M j +1 , − 1 if S M h +1 < S M j +1 , o r 0 if h = j . The ordering of the pair s is defined comp onent-wise, and we as sume that $ is a sp ecial “ pair” larger than any other pair in w . F or any pair o f indices p, q ∈ [1 . . . j − i ], it is S M p + i > S M q + i iff w p > w q . In fact, suppo se tha t w p > w q and set r = lcp ( w p , w q ). W e have tha t w [ p + r ] = h S M [ p + i + r ] , b p + i + r i > h S M [ q + i + r ] , b q + i + r i = w [ q + i + r ]. Hence S M p + i + r > S M q + i + r , by definition of the b ’s. Ther efore S M p + i > S M q + i , since their first r symbols are equal. This implies that sor ting S M i , . . . , S M j reduces to c o mputing the suffix array of w , and this takes O ( | w | ) time given that the alphab et size is O ( | w | ) [15]. Clearly , w can b e constructed in that time b o und b e c ause comparing S M z with S M j +1 takes O (1) time v ia a n l cp -query o n S (using the pro pe r data structure ab ov e) and a chec k a t their first mismatch. ⊓ ⊔ Lemma 5 allows us to genera te the c o mpact trie of S M i , . . . , S M i +3 | I k |− 1 , which is equal to the (unorder ed) compacted trie of S i , . . . , S i +3 | I k |− 1 after r eplacing every ID a s signed by M with its original symbol in S . W e finally no tice that if B and W B are not co ntiguous (as instead we ass umed ab ov e), we can use a similar strategy to so r t separ ately the suffixe s in B and the suffixes in W B , and then merge these tw o se quences toge ther by deploying the lcp -data structure mentioned at the b eg inning of this section.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment