Bounds for Compression in Streaming Models
Compression algorithms and streaming algorithms are both powerful tools for dealing with massive data sets, but many of the best compression algorithms -- e.g., those based on the Burrows-Wheeler Transform -- at first seem incompatible with streaming…
Authors: Travis Gagie (corresponding author)
Bounds for Compression in Streaming Mo dels ⋆ Abstract. Compression algorithms and streaming algorithms are b oth p o w erful to ols for dealing with massive data sets, but many of the b est compression algorithms — e.g., those based on the Burro ws-Wheeler T ransform — at first seem incompatible with streaming. In this pap er w e consider sev eral p opular streaming mod els and ask in whic h, if any , w e can co mpress as w ell a s w e can with the BWT. W e first prov e a nearly tight tradeoff b etw een memory and red undancy for the Standard, Multi- pass and W-Streams mo dels, demonstrating a b ound that is ac hiev able with the BWT but unachiev able in those mod els. W e then show we ca n compute the related S chindler T ransform in the StreamSort mo del and the BWT in the Read-W rite mo del and, thus, achiev e that b ound. 1 In tro duction The increasing size o f data s e ts ov er the past deca de has inspired work on b oth data compression and streaming algor ithms. In compression resear ch, the adven t of the Burrows-Wheeler T ransform [5] (BWT) has led to great improv emen ts in bo th theory a nd pr a ctice. Streaming alg orithms, mea n while, ar e now used no t only to pro cess da ta online but also , b ecause s e quen tial access is so muc h faster than random acc ess, to proces s them on disk more quickly . T o combine these adv ances, it seems we must find a wa y to compute something like the BWT in a str eaming mo del, e.g.: 1. Standard: In the simplest a nd mo st restr ictiv e mo del, w e a re allow ed only one pas s over the input and memory sublinear (usually po lylogarithmic) in the input’s s ize (see, e.g., [23]). 2. Multi pass: In one of the earliest pap ers on what is now called str e aming, Munro a nd Paterson [22] prop osed a mo del in which the input is stor ed on a one-wa y , re a d-only ta p e — representing exter nal memo ry — that is completely rewound whenever we reach the end. 3. W-Streams: Ruhl [25] propo sed a model in whic h w e can also write: during each pass ov er the tap e, we can replace its conten ts with something up to a constant factor larger . 4. StreamSort: Since w e cannot sort in the W-Strea ms mo del with p olyloga - rithmic memor y and pas s es, Ruhl also pro posed a gener alization in whic h we can s o rt the co n tent s o f the tap e a t the cost of a co nstan t num ber of passes (see also [1 ]). 5. Read-W rite: Grohe and Sch weik ardt [13] noted that, with an additiona l tap e, we can sort even in the W-Streams model; they prop osed a mo del in which w e have a read-write input tape , some num b er of read-wr ite work tap es, and a write-only output tap e. ⋆ P artially supp orted by the Italian MIUR and Italian-Israeli FIR B pro ject “P attern Disco very Algorithms in Discrete Structures, with Applications to Bioinformatics ”. 2 In a previous pa per [12] we proved nearly tight b ounds on how well we can compress in the Standar d mo del with constant memory . W e also s ho wed how LZ77 [28] can b e implemented with a growing sliding window to use slightly sublinear memory without increa sing the b ound on its redundancy , but tha t this cannot b e done for subloga rithmic memory . Those b ounds are, how ever, tangential to the co mmon assumption of p olyloga rithmic memor y . In this pap er we consider whether the following b ound can b e a chieved in the mo dels de- scrib ed a bov e with p olylogarithmic reso ur ces: Given a s tr ing s o f leng th n over an alphab et of constant s ize σ , can we store s in O ( nH k ( s ) + σ k log n ) bits fo r all k simultaneously , where H k ( s ) is the k th-order empiric a l entrop y of s ? W e first prov e the bound unachiev able in the first three mo de ls , v ia a nearly tight tradeoff b et ween memo ry and redundancy . W e then show w e can co mpute the Schin dler T ransform [26] (ST) in the StreamSort mo del and the BWT in the Read-W rite mo del, so the b ound is achiev able in them. W e star t by review ing some preliminary material in Section 2. In Section 3 we show ho w, for any consta n ts c and ǫ with 1 ≥ c ≥ 0 and ǫ > 0, we can store s in nH k ( s ) + O ( σ k n 1 − c + ǫ ) bits in the Standar d mo del with O ( n c ) bits of memor y ; w e then show this b ound is nearly o ptimal in the sense tha t we cannot alw ays store s in, nor recov er it from, O ( nH k ( s ) + σ k n 1 − c − ǫ ) bits. In Section 4 w e extend our tradeo ff to the Multipass and W-Streams models. Since we can store B WT( s ) in, and recover it fro m, 3 . 4 nH k ( s ) + O ( σ k ) bits in even the Standard mo del, our tradeoff implies we can neither compute nor inv ert the BWT in the first three mo dels. In Section 5 we show how we can compute the ST in the Stre amSort model and in Section 6 w e sho w how we can compute the BWT in the Read-W rite mo del. Using either transfo r m, we can store s in 1 . 8 nH k ( s ) + O ( σ k log n ) bits in these mo dels. 2 Preliminaries Throughout this pap er, w e assume s = s 1 · · · s n is a string of length n ov er an alphab et of cons tan t size σ , and c and ǫ ar e constants with 1 − ǫ > c > ǫ > 0 . The 0th-o rder empirical ent ro p y H 0 ( s ) o f s is the e ntropy of the characters’ distribution in s , i.e., H 0 ( s ) = 1 n P a ∈ s n a log n n a where a ∈ s means character a o ccurs in s and n a is its frequency . T he k th-order empirical entropy H k ( s ) of s for k ≥ 1, descr ibed in detail by Manzini [21], is defined a s H k ( s ) = 1 n X | w | = k | w s | H 0 ( w s ) , where w s is the concatena tio n o f characters that immediately follow o ccurrences of w in s . The B WT is an inv ertible transform that permutes the characters in s s o that s i comes be fo re s j if s i − 1 · · · s i − n is lexicogr aphically less than s j − 1 · · · s j − n , taking indices modulo n + 1 and considering s 0 = s n +1 to b e lexicog raphically less than a ny character in the a lphabet. In other words, the B WT so rts s ’s Bounds for Compression in Streaming Mod els 3 characters in order o f their contexts, which start a t their predeces sors and ex- tend backwards. Although the distribution of characters r emains the same — so H 0 (BWT( s )) = H 0 ( s ) — c har a cters with similar contexts are mo ved clo se together so, if s is compr essible with an alg orithm that takes uses contexts, then BWT ( s ) is compr e s sible with a n algo rithm that takes adv antage of lo cal homogeneity . Mov e-to-front [3] and dista nce co ding [4,8] are t wo reversible trans- forms that turn strings with loca l homogeneit y in to strings of n umbers with low 0th-order empirical entropy: mov e-to-fro nt keeps a list of the characters in the alphab et and r eplaces each c hara cter in the input by its p osition in the list and then mov es it to the fro nt of the list; dis tance coding writes the distance to the first o ccurrence of each c harac ter and the length of the s tr ing, then replaces each c haracter in the input by the distance fro m the last o ccurrence of that character, omitting 1s . (The num b ers are o ften wr itten with, e.g., Elia s’ delta co de [9], in which ca se move-to-fron t a nd dista nce c o ding b ecome compress ion algorithms themselves.) Building on work by Kaplan, Landau and V erbin [16], we show ed in ano ther pre vious pa per [11] that c o mposing the BWT, mo ve-to-front, run-length co ding a nd arithmetic co ding pro duces an enco ding that contains at most 3 . 4 nH k ( s ) + O ( σ k ) bits for all k simultaneously; with the BWT, dis- tance co ding and arithmetic co ding, the b ound is 1 . 8 nH k ( s ) + O ( σ k log n ) bits. Using the BWT in more sophis ticated wa ys, F erragina , Manzini, M¨ akinen and Nav arro [10] and M¨ akinen and Nav arro [20] achiev ed a bound on the enco ding’s length of nH k ( s ) + o ( n/ log n ) bits for a ll k at most a constan t proper fraction of log σ n ; Gro ssi, Gupta and Vitter [14] a c hieved a b ound of nH k ( s ) + O ( σ k log n ) bits for all k simult aneo usly , matching a lower b ound due to Riss anen [24]. In Section 3 we sho w how, because length times empirical entropy is s uperadditive — i.e ., | a | H k ( a ) + | b | H k ( b ) ≤ | ab | H k ( ab ) — we can tra de-off betw een memory and redundancy b y break ing s int o blo cks a nd enco ding each block in turn with Grossi, Gupta and Vitter’s alg orithm: the memory needed decrea ses by a fac- tor roughly equal to the num ber o f blo c ks, while the redundancy increases by that muc h. In Section 5 we use the ST instead o f the BWT. When using con- texts of length k , the ST p erm utes the characters of s so that s i comes befo r e s j if s i − 1 · · · s i − k is lexico g raphically less than s j − 1 · · · s j − k or, when they are equal, if i < j . By the same ar gumen ts as for the BWT, comp osing the ST, dis- tance co ding and arithmetic c oding pro duces an e nco ding the contains at most 1 . 8 nH k ( s ) + O ( σ k log n ) bits for the chosen context length k . A σ -ary De Br uijn seq uence o f o rder k ≥ 1 con tains each p ossible k -tuple exactly once and, it fo llo ws, has length σ k + k − 1 a nd sta r ts and ends with the same k − 1 characters. Notice that, if d consists of the fir st σ k characters of such a se quence, then H k ( d i ) = 0 for an y i . Howev er, there are ( σ !) σ k − 1 such sequenc e s [6,17,18] so, by Li and Vit´ anyi’s Incompressibility Lemma [19], for a n y fixed algor ithm A , the maximum Kolmogo ro v complexity of d relative to A is at least ⌊ σ k − 1 log σ ! ⌋ = Ω ( σ k ) bits. W e used this fact in [12] to pr o ve that a one-pass a lgorithm cannot alw ays compress w ell in terms of the k th-order empirical without using memor y exp onential in k : we can compute d fr om the configuratio ns of A ’s memory when it starts and finis he s reading a n y copy of d 4 in d i and its o utput while reading that co p y (if there were another string d ′ that to ok A b et ween those configurations while producing that output, then w e could substitute d ′ for that copy of d without changing the total enco ding); therefore, if A uses o ( σ k ) bits o f memory , then its total output must b e Ω ( | d i | ) bits. 3 The Standard mo del W e can easily store s in nH 0 ( s ) + 2 n + O (1 ) bits in one pass with O (log n ) bits of memory with dynamic Huffman co ding [2 7], for ex a mple; b y running a separ a te copy of the alg orithm for each p ossible k -tuple for an y given k , we can stor e s in nH k ( s ) + 2 n + O ( σ k ) bits in o ne pa ss with O ( σ k log n ) bits of memor y . With adaptive arithmetic co ding instead of dynamic Huffman co ding, the 2 n term b ecomes approximately n/ 100 (see, e.g., [1 5]). With the version of LZ77 we describ ed in [12], for any fixe d k we ca n store s in nH k ( s ) + o ( n ) bits in one pass with o ( n ) bits of memor y , but with b o th o ( n ) terms nearly linea r . W e now show we can substant ially reduce thos e terms sim ultaneously , answering a question we p osed in that pap er. Theorem 3.1. In the Standar d mo del with O ( n c ) bits of memory, we c an s tor e s in, and later r e c over it fr om, nH k ( s ) + O ( σ k n 1 − c + ǫ ) bits for al l k simultane ously. Pr o of. Let A b e Gr ossi, Gupta and Vitter’s algo rithm. They pr o ved A stores s in nH k ( s ) + O ( σ k log n ) bits for all k simu ltaneous ly using O ( n ) time so, altho ugh they did not give an explicit b ound on the memo r y used, we know it is a t most A ’s time c omplexit y multiplied by the word size, i.e., O ( n log n ) bits. First, supp ose we know n in a dv ance. W e pr oces s s in O ( n 1 − c + ǫ/ 2 ) blo c ks b 1 , . . . , b m , eac h of length O ( n c − ǫ/ 2 ): we rea d each blo c k b i in tu rn, compute and output A ( b i ) — using O ( | b i | log | b i | ) = O ( n c ) bits of memory — and era se b i from memory . As w e noted in Sec tion 2, empirica l entrop y is super additiv e, so the total leng th of the enco dings we output is at most m X i =1 | b i | H k ( b i ) + O ( σ k log | b i | ) ≤ nH k ( s ) + O ( σ k n 1 − c + ǫ ) bits for a ll k simultaneously . Now, supp ose we do not know n in adv ance. W e w ork as b efore but we start with a constant es tima te of n a nd, each time w e ha ve read that many characters of s , w e double it. This way , we increase the size of the la rgest blo c k b y les s than 2 and the num ber of blo cks b y an O (log n )-factor , so our asy mptotic bounds o n the memory used and the who le enco ding’s length do es not change. ⊓ ⊔ Extending our lo wer b ounds from that paper, w e now show we canno t r e duce the factor n 1 − c + ǫ m uch further unles s we increas e the fa ctor σ k , not even if w e m ultiply the b ound on the enco ding’s length by any consta nt co efficien t. This low er b ound holds for b oth compression and decompr ession; as we noted in [12], “go o d bo unds [for decompress ion] are equa lly imp ortant, b ecause often data Bounds for Compression in Streaming Mod els 5 is compre ssed once by a p ow erful ma c hine (e.g., a server o r bas e-station) a nd then tr a nsmitted to man y w eaker machines (clients or agents) who decompress it indiv idua lly .” Theorem 3.2. In the Standar d mo del with O ( n c ) bits of m emory, we c annot stor e s in O ( nH k ( s ) + σ k n 1 − c − ǫ ) bits in t he worst c ase for, e.g., k = ⌈ ( c + ǫ/ 2) log σ n ⌉ . Pr o of. Cons ider an y compres s ion alg o rithm A tha t w orks in the Standard model with O ( n c ) bits of memory . Supp o se s consis ts of co pies o f the first σ k characters d in a σ -ary De Bruijn s equence of order k = ⌈ ( c + ǫ/ 2 ) lo g σ n ⌉ whose Kolmogo rov complexity relative to A is Ω ( σ k ) = Ω ( n c + ǫ/ 2 ) bits. As w e noted in Sectio n 2 , we can compute d from the configuratio ns of A ’s memo ry when it star ts a nd finishes r eading any cop y of d and its output while reading that copy . Since A ’s memory is asympto tica lly smaller than d ’s K olmogorov complexity relative to A , it must output Ω ( σ k ) bits for every copy of d in s , i.e., Ω ( n ) bits altog ether. Since H k ( s ) = 0, how ever, O ( nH k ( s ) + σ k n 1 − c − ǫ ) = O ( n 1 − ǫ/ 2 ). ⊓ ⊔ Theorem 3.3. In the Standar d mo del with O ( n c ) bits of memory, we c annot r e- c over s fr om O ( nH k ( s ) + σ k n 1 − c − ǫ ) bits in the worst c ase for, e.g., k = ⌈ ( c + ǫ/ 2) log σ n ⌉ . Pr o of. Let A be a decompress io n algorithm that works in the Sta ndard model with O ( n c ) bits of memor y and let s b e as in the pro of of Theorem 3.2. W e can also compute d from the co nfiguration of A ’s memor y when it starts outputting any co py of d and the bits it reads while outputting that copy; since A ’s memory is asymptotically smaller than d ’s Kolmogorov complexity relativ e to A , it m ust read Ω ( σ k ) bits for each copy of d in s , i.e., Ω ( n ) bits a ltogether, wherea s O ( nH k ( s ) + σ k n 1 − c − ǫ ) = O ( n 1 − ǫ/ 2 ). ⊓ ⊔ Finally , we now prove that, if we could compute the BWT in the Standard mo del, then we could a c hieve the bo unds we have just prov en unachiev a ble ; we will draw the obvious conclusion in Section 4 as part of a mor e gener al theorem. Lemma 3. 4. In the St andar d mo del with O (log n ) bits of memory, we c an stor e BWT( s ) in, and later r e c over it fr om, 3 . 4 nH k ( s ) + O ( σ k ) bits for al l k simulta- ne ously. Pr o of. W e enco de BWT( s ) by comp osing mov e-to-fr o n t, run-length co ding and adaptive arithmetic co ding; since enco ding or deco ding ea c h of the three steps takes o ne pass a nd O (log n ) bits of memory , s o do es their comp osition. As we noted in Section 2, the resulting encoding con tains a t most 3 . 4 nH k ( s ) + O ( σ k ) bits for a ll k simultaneously . ⊓ ⊔ 4 The Multipass and W -Streams mo dels W e now e x tend our tradeoff to the Multipass and W-Strea ms models. Of co urse, anything w e can do in the Standar d mo del we can do in thos e mo dels, so the upper bounds a re immediate. 6 Theorem 4.1. In the Mult ip ass and W-Str e ams mo dels with O ( n c ) bits o f m em- ory and one p ass, we c an st or e s in n H k ( s ) + O ( σ k n 1 − c + ǫ ) bits for al l k simul- tane ously. Conv ersely , lower b ounds for the W-Streams mo del apply to the Multipass mo del. W e could quite easily extend our pro ofs to include the Multipas s model alone: e.g., to s ee we canno t compress s v ery well with po lylogarithmic pa sses, notice we c a n compute d from the configurations of A ’s memory when it starts and finishes reading any copy of d and its output while reading tha t copy dur- ing e ach p ass ; since log O (1) n = o ( n ǫ/ 2 ), even a p olyloga rithmic num b er of A ’s memory config urations are asymptotically smalle r than d ’s K olmogorov com- plexity relative to A , so the rest of our argument still holds. The pro of fo r the W-Streams mo del m ust be slightly differ en t b e cause, for example, after the first pass the tap e will genera lly not co n tain co pies of d . Theorem 4.2. In the Mult ip ass and W-Str e ams mo dels with O ( n c ) bits o f m em- ory and log O (1) n p asses, we c annot stor e s in O ( nH k ( s ) + σ k n 1 − c − ǫ ) bits in the worst c ase for, e.g., k = ⌈ ( c + ǫ/ 2 ) log σ n ⌉ . Pr o of. W e need consider only the mo re genera l W-Streams model. Consider an y compressio n algo rithm A that works in the W-Streams mo del with O ( n c ) bits of memory a nd lo g O (1) n passe s , and le t s be as in the pro of o f Theorem 3.2. Without loss of genera lit y , assume A ’s output consists of the conten ts of the tap e after its last pass (any output from intermediate pass es can b e written on the tap e instea d). Notice a n y substring of c hara cters on the tap e immedia tely a fter a particular pass mu st hav e been written (or left un touched) while A was reading a substr ing of characters o n the tap e immediately b efore that pass. Suppose A makes p passes ov er s . Consider a n y copy d 0 of d in s and, for p ≥ i ≥ 1, let d i be the substring A w r ites while reading d i − 1 . W e cla im we can compute d from d p and the memory configur ations of A when it s ta rts and finishes reading each d i . T o see why , supp o se there were a sequence d ′ 0 6 = d 0 , d ′ 1 , . . . , d ′ p − 1 , d ′ p = d p such that, for p ≥ i ≥ 1, d ′ i − 1 to ok A b etw een the i th pa ir of memo r y co nfigurations while writing d ′ i ; then w e could substitute d ′ 0 for d 0 without changing the total enco ding. It follows that d p m ust contain Ω ( σ k ) bits and, so, the whole tape m ust c on tain Ω ( n ) bits a fter the la st pa ss, where a s O ( nH k ( s ) + σ k n 1 − c − ǫ ) = O ( n 1 − ǫ/ 2 ). ⊓ ⊔ Theorem 4.3. In the Mult ip ass and W-Str e ams mo dels with O ( n c ) bits o f m em- ory and lo g O (1) n p asses, we c annot r e c over s fr om O ( nH k ( s ) + σ k n 1 − c − ǫ ) bits in t he worst c ase for, e.g., k = ⌈ ( c + ǫ/ 2) log σ n ⌉ . Pr o of. Aga in, we co nsider only the W-Streams model. Consider an y decompres- sion algo rithm A that works in the W-Streams mo del with O ( n c ) bits of memor y and log O (1) n pass es, and let s b e as in the proo f o f Theo rem 3 .2. Without loss of generality , assume the tap e contains s after A ’s las t pass. Consider an y copy d 0 of d in s and, for p ≥ i ≥ 1 , let d i be the substring A reads while writing d i − 1 ; notice this is the reverse o f the definition in the pr o of of Theorem 4.2. Bounds for Compression in Streaming Mod els 7 Notice we ca n c o mpute d fro m d p and the memory configura tion of A when it starts reading each d i : running A on d i , starting in the memory configuratio n, pro duces d i − 1 . It follows that A must re a d Ω ( σ k ) bits for ea ch cop y of d in s , i.e., Ω ( n ) bits altogether , whereas O ( nH k ( s ) + σ k n 1 − c − ǫ ) = O ( n 1 − ǫ/ 2 ). ⊓ ⊔ W e note in passing that in the W-Str eams mo del, we can easily r educe sorting the characters in s to co mputing the BWT: w e compute s ′ = ( s 1 , s 0 )( s 2 , s 1 ) · · · ( s n , s n − 1 )( s 0 , s n ); we compute BWT( s ′ ) = ( s 1 , s 0 )( s i 1 +1 , s i 1 ) · · · ( s i n +1 , s i n ); and we output s i 1 , . . . , s i n . T o see why s i 1 , . . . , s i n are sorted, supp ose ( s i +1 , s i ) pre- cedes ( s j +1 , s j ) in BWT ( s ′ ). The BWT a rranges the pairs in s ′ in the lexico- graphic order of their predecessors (how it breaks ties do es not conc e r n us no w), which is that of their pr edecessors’ firs t comp o nen ts, or that of their own second comp onen ts — so s i ≤ s j . If σ were unbounded, this reduction would imply we could not compute the BWT in the fir st three mo dels; since σ is constant, ho w- ever, it is meaningless — we can s ort the character s in s in the Multipass mo del anyw ay . F ortunately , we can also ea s ily reduce storing s in O ( nH k ( s ) + σ k log n ) bits to co mputing the BWT. Theorem 4.4. In the Standar d, Multip ass and W-S tr e ams mo dels with log O (1) n bits of m emory and p asses, we c an neither c ompute nor invert BWT( s ) in the worst c ase. Pr o of. If w e could compute or inv ert BWT( s ) then, by Lemma 3.4, we could store s in, or rec o ver it from, 3 . 4 n H k ( s ) + O ( σ k ) bits for all k simultaneously; how ever, b y Theorems 3.2, 3 .3, 4.2 and 4.3 we cannot achiev e this b ound in the worst case. ⊓ ⊔ 5 The StreamSort mo del The ST is known a s both the “ Sc hindler T r ansform” a nd the “Sort T rans form”, so it is p erhaps not surprising that it can b e co mputed in the StreamSort mo del. Indeed, computing the ST for any given k = O (log n ) ta k es only a consta n t nu mber o f passes once we have padded the input from O ( n ) bits to Ω ( n log n ) bits; w e do this padding, which takes O (lo g log n ) passes, because we are allo wed to expand the tap e conten ts by a only constant factor during each pass, and we wan t to e ventually asso ciate each character with a O (log n )-bit key — the k -tuple that precedes that character in s — and then stably so rt b y the keys. Unfortunately , we do not k no w yet how to in vert the ST in either the StreamSort or Read-W rite mo dels. Lemma 5. 1. F or any gi ven k = O (log n ) , we c an c ompute ST( s ) in the S tr e am- Sort mo del with O (log n ) bits of memory and O (log log n ) p asses. Pr o of. W e make O (log log n ) passes, eac h time doubling the length of each char- acter’s repres en tation b y padding it with 0s, until each character takes Ω (log n ) bits. W e mak e ano ther pass to as socia te eac h character in s with the k -tuple that prece de s it; since log σ k = O (log n ), we can use O (log n ) bits of memor y to 8 keep track of the last k characters we hav e seen and write them as a k ey in front of the nex t character while only doubling the num b er of bits on the tap e. W e then use O (1) pas ses to s ta bly sor t by those keys — i.e., c omputing the ST — and, finally , delete the keys a nd padding. ⊓ ⊔ Theorem 5.2. In the St r e amSort mo del with O (log n ) bits of memory and O (log n log log n ) p asses, we c an stor e s in 1 . 8 nH k ( s ) + O ( σ k log n ) bits for al l k simu ltane ously. Pr o of. W e co mpute ST( s ) in O (log log n ) passes and encode it in O (1) pa sses by comp osing distance c oding and adaptive a rithmetic co ding, for eac h k = O (log n ); in tota l, we use O (log n ) bits o f memory a nd O (log n log log n ) passes. As w e noted in Section 2, each resulting enco ding con tains at most 1 . 8 nH k ( s ) + O ( σ k log n ) bits for the v a lue of k used to c o mpute it; thus, the shortest enco ding contains 1 . 8 nH k ( s ) + O ( σ k log n ) for a ll k = O (log n ) simultaneously and so — bec ause 1 . 8 nH 0 ( s ) + O (log n ) < 1 . 8 nH k ( s ) + ( σ k log n ) = ω ( n ) for a ll k = ω (lo g n ) — for all k sim ultaneous ly . ⊓ ⊔ 6 The Read-W rite mo del An ything w e can do in the Strea mSort mo del we ca n do in the Rea d-W rite mo del us ing an O (log n )-factor more passes, so Theorem 5.2 implies we ca n store s in 1 . 8 nH k ( s ) + ( σ k log n ) bits for all k simultaneously in the Read-W rite mo del with O (log n ) bits o f memo r y and O (log 2 n log log n ) passes. It do es no t, how ever, imply we can rec o ver s again in the Rea d- W rite mo del. F or tunately , using tec hniques based on the doubling algorithm by Arg e, F err agina, Gros si and Vitter [2 ] fo r sorting strings in external memory (se e also [7]), we can b oth compute a nd invert BWT( s ) in the Read-W rite model. Figures 1 and 2 sho w how we co mpute and invert BWT(mississ ippi); to sa ve spac e , Fig ur e 2 sho ws t wo rounds of the algor ithm in ea c h row. Theorem 6.1. In t he Re ad-Write mo del with O (lo g n ) bits of memory and O (log 2 n ) p asses, we c an s tor e s in, and later r e c over it fr om, 1 . 8 nH k ( s ) + O ( σ k log n ) bits for al l k simult ane ously. Pr o of. As w e noted in Section 2, once w e have BWT( s ), w e can s tore it in 1 . 8 nH k ( s ) + O ( σ k log n ) bits for all k simultaneously by c omposing dis tance co ding and adaptiv e arithmetic co ding. T o compute or in vert distance coding and to enco de or decode adaptive a rithmetic coding all take O (log n ) bits of memory and O (1) passes. T he r efore, we need consider only ho w to compute and inv ert BWT( s ). Due to space constraints, here w e only sketc h these pro cedures; we will g iv e full descr iptions and analyses in the full pap er. T o compute BWT( s ), we app end a sp ecial c haracter that is lexicogra phically less than any character in the alphab et (# in Figur es 1 and 2). W e tag each character in s with a unique iden tifier (e.g., its po sition; in this mo del, we can Bounds for Compression in Streaming Mod els 9 triples copy and sort merge sort shrink m 1 1 i 2 i 2 1 s 3 s 3 1 s 4 s 4 1 i 5 i 5 1 s 6 s 6 1 s 7 s 7 1 i 8 i 8 1 p 9 p 9 1 p 10 p 10 1 i 11 i 11 1 # 12 # 12 1 m 1 i 11 1 # 12 # 12 1 m 1 m 1 1 i 2 i 2 1 s 3 s 4 1 i 5 i 5 1 s 6 s 7 1 i 8 i 8 1 p 9 p 10 1 i 11 i 11 1 # 12 # 12 1 m 1 m 1 1 i 2 i 8 1 p 9 p 9 1 p 10 p 9 1 p 10 p 10 1 i 11 i 2 1 s 3 s 3 1 s 4 s 3 1 s 4 s 4 1 i 5 i 5 1 s 6 s 6 1 s 7 s 6 1 s 7 s 7 1 i 8 i 11 1 # 12 1 m 1 m 1 1 i 2 1 s 3 s 4 1 i 5 1 s 6 s 7 1 i 8 1 p 9 p 10 1 i 11 1 # 12 # 12 1 m 1 1 i 2 i 8 1 p 9 1 p 10 p 9 1 p 10 1 i 11 i 2 1 s 3 1 s 4 s 3 1 s 4 1 i 5 i 5 1 s 6 1 s 7 s 6 1 s 7 1 i 8 i 11 1 # 12 1 m 1 m 1 1 i 2 1 s 3 s 4 1 i 5 1 s 6 s 7 1 i 8 1 p 9 p 10 1 i 11 1 # 12 # 12 1 m 1 1 i 2 i 8 1 p 9 1 p 10 p 9 1 p 10 1 i 11 i 2 1 s 3 1 s 4 s 3 1 s 4 1 i 5 i 5 1 s 6 1 s 7 s 6 1 s 7 1 i 8 i 11 1 m 1 m 1 2 s 3 s 4 2 s 6 s 7 2 p 9 p 10 2 # 12 # 12 3 i 2 i 8 4 p 10 p 9 4 i 11 i 2 5 s 4 s 3 5 i 5 i 5 5 s 7 s 6 5 i 8 i 11 1 m 1 m 1 2 s 3 s 4 2 s 6 s 7 2 p 9 p 10 2 # 12 # 12 3 i 2 i 8 4 p 10 p 9 4 i 11 i 2 5 s 4 s 3 5 i 5 i 5 5 s 7 s 6 5 i 8 p 10 2 # 12 # 12 3 i 2 # 12 3 i 2 i 2 5 s 4 s 3 5 i 5 i 5 5 s 7 s 6 5 i 8 i 8 4 p 10 p 9 4 i 11 i 11 1 m 1 i 11 1 m 1 m 1 2 s 3 s 7 2 p 9 p 9 4 i 11 i 8 4 p 10 p 10 2 # 12 m 1 2 s 3 s 3 5 i 5 i 2 5 s 4 s 4 2 s 6 s 4 2 s 6 s 6 5 i 8 i 5 5 s 7 s 7 2 p 9 p 10 2 # 12 3 i 2 # 12 3 i 2 5 s 4 s 3 5 i 5 5 s 7 s 6 5 i 8 4 p 10 p 9 4 i 11 1 m 1 i 11 1 m 1 2 s 3 s 7 2 p 9 4 i 11 i 8 4 p 10 2 # 12 m 1 2 s 3 5 i 5 i 2 5 s 4 2 s 6 s 4 2 s 6 5 i 8 i 5 5 s 7 2 p 9 p 9 4 i 11 1 m 1 i 11 1 m 1 2 s 3 i 8 4 p 10 2 # 12 i 2 5 s 4 2 s 6 i 5 5 s 7 2 p 9 p 10 2 # 12 3 i 2 s 6 5 i 8 4 p 10 s 7 2 p 9 4 i 11 # 12 3 i 2 5 s 4 s 3 5 i 5 5 s 7 m 1 2 s 3 5 i 5 s 4 2 s 6 5 i 8 p 9 1 m 1 i 11 2 s 3 i 8 3 # 12 i 2 4 s 6 i 5 4 p 9 p 10 5 i 2 s 6 6 p 10 s 7 7 i 11 # 12 8 s 4 s 3 9 s 7 m 1 10 i 5 s 4 10 i 8 p 9 1 m 1 i 11 2 s 3 i 8 3 # 12 i 2 4 s 6 i 5 4 p 9 p 10 5 i 2 s 6 6 p 10 s 7 7 i 11 # 12 8 s 4 s 3 9 s 7 m 1 10 i 5 s 4 10 i 8 i 8 3 # 12 # 12 8 s 4 p 10 5 i 2 i 2 4 s 6 m 1 10 i 5 i 5 4 p 9 s 4 10 i 8 i 8 3 # 12 s 7 7 i 11 i 11 2 s 3 p 9 1 m 1 m 1 10 i 5 i 5 4 p 9 p 9 1 m 1 s 6 6 p 10 p 10 5 i 2 i 11 2 s 3 s 3 9 s 7 # 12 8 s 4 s 4 10 i 8 i 2 4 s 6 s 6 6 p 10 s 3 9 s 7 s 7 7 i 11 i 8 3 # 12 8 s 4 p 10 5 i 2 4 s 6 m 1 10 i 5 4 p 9 s 4 10 i 8 3 # 12 s 7 7 i 11 2 s 3 p 9 1 m 1 10 i 5 i 5 4 p 9 1 m 1 s 6 6 p 10 5 i 2 i 11 2 s 3 9 s 7 # 12 8 s 4 10 i 8 i 2 4 s 6 6 p 10 s 3 9 s 7 7 i 11 i 5 4 p 9 1 m 1 s 7 7 i 11 2 s 3 s 4 10 i 8 3 # 12 p 10 5 i 2 4 s 6 m 1 10 i 5 4 p 9 s 6 6 p 10 5 i 2 i 2 4 s 6 6 p 10 s 3 9 s 7 7 i 11 i 8 3 # 12 8 s 4 i 11 2 s 3 9 s 7 p 9 1 m 1 10 i 5 # 12 8 s 4 10 i 8 i 5 1 m 1 s 7 2 s 3 s 4 3 # 12 p 10 4 s 6 m 1 5 p 9 s 6 6 i 2 i 2 7 p 10 s 3 8 i 11 i 8 9 s 4 i 11 10 s 7 p 9 11 i 5 # 12 12 i 8 Fig. 1. Computing the BWT in the Read-W rite mo del. triples copy and sort merge shrink copy and sort merge shrink # 3 12 m 1 i 6 ? s 2 i 8 ? # 3 i 11 ? s 4 i 12 ? p 5 m 1 ? i 6 p 5 ? p 7 p 7 ? i 8 s 2 ? s 9 s 4 ? s 10 s 9 ? i 11 s 10 ? i 12 i 8 ? # 3 # 3 12 m 1 m 1 ? i 6 i 6 ? s 2 p 7 ? i 8 i 8 ? # 3 s 9 ? i 11 i 11 ? s 4 s 10 ? i 12 i 12 ? p 5 # 3 12 m 1 m 1 ? i 6 i 12 ? p 5 p 5 ? p 7 p 5 ? p 7 p 7 ? i 8 i 6 ? s 2 s 2 ? s 9 i 11 ? s 4 s 4 ? s 10 s 2 ? s 9 s 9 ? i 11 s 4 ? s 10 s 10 ? i 12 i 8 ? # 3 12 m 1 m 1 ? i 6 ? s 2 p 7 ? i 8 ? # 3 s 9 ? i 11 ? s 4 s 10 ? i 12 ? p 5 # 3 12 m 1 ? i 6 i 12 ? p 5 ? p 7 p 5 ? p 7 ? i 8 i 6 ? s 2 ? s 9 i 11 ? s 4 ? s 10 s 2 ? s 9 ? i 11 s 4 ? s 10 ? i 12 i 8 11 m 1 m 1 ? s 2 p 7 ? # 3 s 9 ? s 4 s 10 ? p 5 # 3 12 i 6 i 12 ? p 7 p 5 ? i 8 i 6 ? s 9 i 11 ? s 10 s 2 ? i 11 s 4 ? i 12 p 7 ? # 3 # 3 12 i 6 # 3 12 i 6 i 6 ? s 9 p 5 ? i 8 i 8 11 m 1 s 2 ? i 11 i 11 ? s 10 s 4 ? i 12 i 12 ? p 7 i 8 11 m 1 m 1 ? s 2 s 10 ? p 5 p 5 ? i 8 i 12 ? p 7 p 7 ? # 3 m 1 ? s 2 s 2 ? i 11 s 9 ? s 4 s 4 ? i 12 i 6 ? s 9 s 9 ? s 4 i 11 ? s 10 s 10 ? p 5 p 7 ? # 3 12 i 6 # 3 12 i 6 ? s 9 p 5 ? i 8 11 m 1 s 2 ? i 11 ? s 10 s 4 ? i 12 ? p 7 i 8 11 m 1 ? s 2 s 10 ? p 5 ? i 8 i 12 ? p 7 ? # 3 m 1 ? s 2 ? i 11 s 9 ? s 4 ? i 12 i 6 ? s 9 ? s 4 i 11 ? s 10 ? p 5 p 7 10 i 6 # 3 12 s 9 p 5 9 m 1 s 2 ? s 10 s 4 ? p 7 i 8 11 s 2 s 10 ? i 8 i 12 ? # 3 m 1 ? i 11 s 9 ? i 12 i 6 ? s 4 i 11 ? p 5 p 7 10 i 6 # 3 12 s 9 p 5 9 m 1 s 2 ? s 10 s 4 ? p 7 i 8 11 s 2 s 10 ? i 8 i 12 ? # 3 m 1 ? i 11 s 9 ? i 12 i 6 ? s 4 i 11 ? p 5 i 12 ? # 3 # 3 12 s 9 p 7 10 i 6 i 6 ? s 4 s 10 ? i 8 i 8 11 s 2 m 1 ? i 11 i 11 ? p 5 s 9 ? i 12 i 12 ? # 3 p 5 9 m 1 m 1 ? i 11 i 11 ? p 5 p 5 9 m 1 s 4 ? p 7 p 7 10 i 6 i 8 11 s 2 s 2 ? s 10 i 6 ? s 4 s 4 ? p 7 # 3 12 s 9 s 9 ? i 12 s 2 ? s 10 s 10 ? i 8 i 12 ? # 3 12 s 9 p 7 10 i 6 ? s 4 s 10 ? i 8 11 s 2 m 1 ? i 11 ? p 5 s 9 ? i 12 ? # 3 p 5 9 m 1 ? i 11 i 11 ? p 5 9 m 1 s 4 ? p 7 10 i 6 i 8 11 s 2 ? s 10 i 6 ? s 4 ? p 7 # 3 12 s 9 ? i 12 s 2 ? s 10 ? i 8 i 12 8 s 9 p 7 10 s 4 s 10 7 s 2 m 1 ? p 5 s 9 ? # 3 p 5 9 i 11 i 11 5 m 1 s 4 6 i 6 i 8 11 s 10 i 6 ? p 7 # 3 12 i 12 s 2 ? i 8 s 9 ? # 3 # 3 12 i 12 s 4 6 i 6 i 6 ? p 7 s 2 ? i 8 i 8 11 s 10 p 5 9 i 11 i 11 5 m 1 # 3 12 i 12 i 12 8 s 9 i 11 5 m 1 m 1 ? p 5 m 1 ? p 5 p 5 9 i 11 i 6 ? p 7 p 7 10 s 4 s 10 7 s 2 s 2 ? i 8 p 7 10 s 4 s 4 6 i 6 i 12 8 s 9 s 9 ? # 3 i 8 11 s 10 s 10 7 s 2 s 9 ? # 3 12 i 12 s 4 6 i 6 ? p 7 s 2 ? i 8 11 s 10 p 5 9 i 11 5 m 1 # 3 12 i 12 8 s 9 i 11 5 m 1 ? p 5 m 1 ? p 5 9 i 11 i 6 ? p 7 10 s 4 s 10 7 s 2 ? i 8 p 7 10 s 4 6 i 6 i 12 8 s 9 ? # 3 i 8 11 s 10 7 s 2 s 9 4 i 12 s 4 6 p 7 s 2 3 s 10 p 5 9 m 1 # 3 12 s 9 i 11 5 p 5 m 1 1 i 11 i 6 2 s 4 s 10 7 i 8 p 7 10 i 6 i 12 8 # 3 i 8 11 s 2 Fig. 2. Inv erting the BWT in the Rea d- W rite mo del. 10 expand the co nten ts of the tap e by more than a consta n t factor during a pas s , so w e do not need to pa d — a lthough O (log log n ) extra rounds would not make a difference here anywa y), then form a triple from it by app ending a 1 and its successor in s . W e make t wo copies o f the set of triples , sor t the first copy by the la st c o mponent a nd the second copy by the first component (breaking ties by c har a cters identifiers), and mer ge them to for m quin tuples. (It is the copying and so rting step that we do not see how to do in the StreamSort mo del.) W e sort the set of quintuples by the fourth comp onen t, brea king ties b y the third c omponent (ignor ing c hara c ter s’ iden tifiers), br eaking contin ued ties by the second comp onen t, and break ing contin ued ties arbitrar ily . (In the first round, all the fourth a nd second comp onent s are 1, so we effectiv ely sort b y the third comp onen t; notice the third co mponent is the first co mponent’s successor in s and the fifth c omponent’s predecessor in s , ta king indices mo dulo n + 1.) Fina lly , we replace the middle triples — the second, third a nd fourth comp onen ts — with nu mbers, s ta rting at one and incrementing whenever we find a triple different from the one b efore. This pro cess res ults in a nother set of tr iples ; if the sec o nd comp onen ts are the num ber s 1 thr ough n + 1, w e stop; otherwise, we r epeat the pro cedure fro m the p oint of copying the triples. Notice that, at the end of the firs t round, the first and third comp onents in a n y triple are tw o p ositions a part in s , taking indices mo dulo n + 1. Also, for any t wo triples ( s i , x, s i +2 ) a nd ( s j , y , s j +2 ), the co mparative rela tionship betw een x and y is the same as the lexicogr aphic relatio ns hip b et ween s i +1 and s j +1 . At the end of the second ro und, the first and thir d co mponents in any triple ar e four p ositions a pa rt in s a nd, fo r any tw o triples ( s i , x, s i +4 ) and ( s j , y , s j +4 ), the compara tiv e rela tionship b et ween x a nd y is the s ame as the lexicogra phic rela tionship b et ween s i +3 s i +2 s i +1 and s j +3 s j +2 s j +1 . T o see why , notice the relationship betw een x and y dep ends, in decreasing order of pr iorit y , on the relationships betw een x 1 and y 1 , s i +2 and s j +2 , a nd x 2 and y 2 in the quin- tuples ( s i , x 2 , s i +2 , x 1 , s i +4 ) and ( s j , y 2 , s j +2 , y 1 , s j +4 ); since these quin tuples are formed by joining triples ( s i , x 2 , s i +2 ) and ( s i +2 , x 1 , s i +4 ) a nd ( s j , y 2 , s j +2 ) cre- ated during the first round, the compara tive relationships betw een x 1 and y 1 and x 2 and y 2 are the same as the lexico graphic relationships b et ween s i +3 and s j +3 and s i +1 and s j +1 . After O (log n ) rounds, when the second components are the n umbers 1 through n, they indicate the lexic ographic relations hips o f the prefixes of the third components, so the third comp onents a re BWT( s )— in our example in Figure 1, BWT( s ) = ms#spipissii. (W e note that, if we sorted quin- tuples by the seco nd comp onent, using the third and fourth to break ties, then we w ould co mpute the suffix array of s .) W e need only O (log n ) bits of memory for this pro cedure; e ac h sorting step takes O (log n ) passes and each o ther step takes O (1 ) passes , so we use O (log 2 n ) passes altogether . T o invert BWT( s ), we ag ain tag each character in BWT( s ) with a unique ident ifier a nd for m triples . This tim e, ho wev er, we prep end a ? to ea c h character , then pr epend the cor respo nding c haracter in the stable sor t o f BWT( s ); fina lly , in the triple whose first comp onent is the sp ecial character not in the alpha bet, we replace the ? by n + 1 . As fo r computing BWT( s ), we make tw o copies of Bounds for Compression in Streaming Mod els 11 the set o f triples, so rt them and merg e them. This time, for each triple, if the second comp onent is not a ? or the third comp onen t is a ?, then we simply delete the third and fourth comp onen ts; if the second comp onen t is a ? and the third compo ne nt is, then we put the third co mponent minus 1 in the second comp onen t, then delete the thir d and fourth comp onents. This pro cess results in a nother set o f triples; if the second comp onen ts are the num ber s 1 thro ugh n + 1 in so me o rder, w e stop; otherwise, w e rep eat the pro cedure fro m th e p oint of copying the triple s . By the definition of the BWT, the i th c haracter in the stable sor t of BWT( s ) is the predec essor in s of the i th character in BWT( s ). Therefor e, at the end o f the first round, the first and third co mponents in any triple are tw o p ositions apart in s , taking indices mo dulo n + 1; also , the triples that hav e s n and s n +1 as their first comp onents hav e n a nd n + 1 as their second comp onen ts. At the end of the second ro und, the first and third components of a n y triple a re 4 po sitions a part in s a nd the triples that have s n − 2 , s n − 1 , s n and s n +1 as their first comp onents hav e n − 2, n − 1, n and n + 1 as their second co mp onents. After O (log n ) rounds, when the second comp o nen ts are the n umbers from 1 through n + 1 in so me order, they indicate the positio ns in s of the triples’ first comp onen ts. Sor ting b y the second co mponent and ignor ing the sp ecial character, we can recover s . In our example in Figure 2, the triple that star ts ‘m’ then ha s a 1 as its seco nd comp onen t; the triples that start ‘i’ then have 2, 5, 8 a nd 11; the triple s that start ‘p’ then have 9 a nd 10; and the triples that start ‘s ’ then have 3, 4, 6 and 7; th us, sorting them by the second comp onent and outputting the first, we recover mississippi. Again, we need only O (log n ) bits of memory for this pr ocedur e ; each so rting step takes O (log n ) passe s and each other s tep takes O (1 ) pa s ses, so we use O (log 2 n ) passes a lto gether. ⊓ ⊔ Grohe and Sc h weik ardt show ed that, giv en a s e quence of n num b ers x 1 , . . . , x n , each of 2 log n bits, w e ca nnot generally sor t them using o (log n ) passes, O ( n 1 − ǫ ) bits of memory and O (1) ta pes, for any p ositive constant ǫ . W e can easily o btain the s a me low er b ound for the BWT, via the following r eduction from sorting: given x 1 , . . . , x n , using one pass, O (log n ) bits of memory and tw o tap es, for 1 ≤ i ≤ n and 1 ≤ j ≤ 2 log n , we replace the j th bit x i [ j ] o f x i by x i [ j ] 2 x i i j , writing 2 as a s ingle c haracter, x i in 2 log n bits, i in log n bits and j in log log n + 1 bits; notice the resulting string is of le ng th n (3 log n + log log n + 2). (F or sim- plicit y , we no w consider each character’s con text as starting at its successor and extend forwards.) The only characters follow ed by 2s in this str ing are the bits at the b eginning of replacement phrases so, if we p erform the BWT of it, the las t 2 n log n characters of the tr a nsformed str ing are the bits of x 1 , . . . , x n ; more o ver, since the lexicographic order of equal-length binary string s is the same as their nu meric order , the x i [ j ]s will b e arra nged b y x i s, with ties br ok en by the i s (so if x i = x i ′ with i < i ′ , then every x i [ j ] comes b efore every x i ′ [ j ′ ]), and further ties bro k en b y the j s; th us, the last 2 n lo g n bits of the transformed string a re x 1 , . . . , x n in sor ted order. 12 References 1. G. Aggarwal , M. Datar, S. R a jago palan, and M. R u hl. On the streaming model augmented with a sorting primitive. In Pr o c e e di ngs of the 45th I EEE Symp osium on F oundations of Computer Scienc e , pages 540–549 , 2004. 2. L. A rge, P . F erragina, R. Grossi, an d J. S. Vitter. On sorting strings in external memory . In Pr o c e e dings of the 29th ACM Symp osium on The ory of Computing , pages 540–54 8, 1997. 3. J. L. Ben tley , D. D. Sleator, R. E. T arjan, and V. K. W ei. A locally adaptive data compression scheme. Communic ations of the ACM , 29 :320–330, 1986. 4. E. Binder. Distance co ding. Usenet group comp.compression, A pril 13th, 2000. 5. M. Burrow s and D. J. Wheeler. A b lock-sorting lossles s data compression algo- rithm. T echnical Report 24, Digital Equipment Corporation, 1994. 6. N. G. de Bruijn. A com binatorial problem. Koninkli j ke Ne derlandse Ak ademie van Wetenschapp en , 49:758– 764, 1946. 7. R. Dementiev, J. K¨ arkk¨ ainen, J. Mehnert, and P . Sanders. Better external memory suffix arra y construction. ACM Journal of Exp erimental Algorithmics , to appear. 8. S. Deoro wicz. Second step algo rithms in the Burrows-Wheeler compression algo- rithm. Softwar e - Pr actic e and Exp erienc e , 32:99–111, 2002. 9. P . Elias. U niv ersal codeword sets and representati ons of th e integers . I EEE T r ans- actions on Information The ory , 21:194–203 , 1975. 10. P . F erragina, G. Manzini, V. M¨ akinen, an d G. N a v arro. Compressed rep resen- tations of sequences and full-text indexes. ACM T r ansact ions on Algor ithms , 3, 2007. 11. T. Gagie and G. Manzini. Move -to-front, distance co ding, and inv ersion frequ en- cies revisited. In Pr o c e e dings of the 18th Symp osium on Combi nato rial Pattern Matching , pages 71–82, 2007. 12. T. Gagie and G. Manzini. Space-conscious compression. In Pr o c e e dings of the 32nd Symp osium on Mathematic al F oundations of Computer Scienc e , p ages 206– 217, 2007. 13. M. Grohe and N. Schw eik ardt. Low er b ounds for sorting with few random accesses to external memory . In Pr o c e e dings of the 24th ACM Symp osium on Principles of Datab ase Systems , pages 238–24 9, 2005. 14. R. Gross i, A. Gupta, and J. S. Vitter. An algorithmic framew ork for compression and text indexing. Perso nal comm unication. 15. P . Ho ward and J. S . Vitter. Analysis of arithmetic co ding for data compression. Information Pr o c essing and Management , 28:749 –763, 1992. 16. H. Kaplan, S. Landau, and E. V erbin. A simpler analysis of Burrow s-Wheeler based compression. The or etic al Computer Scienc e , 387:220–23 5, 2007. 17. D. E. Knuth. The Ar t of Computer Pr o gr amm ing , v olume 1. Addison-W esley , 3rd edition, 1997. 18. D. E. Knuth. The Art of Computer Pr o gr amming , volume 4, fascicle 2. A ddison- W esley , 2005. 19. M. Li and P . Vit´ anyi. An I ntr o duction to Kolmo gor ov Complexity and I ts Appli- c ations . Springer-V erlag, 2nd edition, 1997. 20. V. M¨ akinen and G. Nav arro. Dyn amic entrop y-compressed sequences and full-tex t indexes. ACM T r ansactions on Algorithms , to appear. 21. G. Ma nzini. An analysis o f the Burro ws-Wheeler transform. Journal of the ACM , 48:407– 430, 2001. Bounds for Compression in Streaming Mod els 13 22. J. I. Munro and M. S . P aterson. Selection and sorting with limited storage. The- or etic al Computer Scienc e , 12:315–323 , 1980. 23. S. Muthukrishnan. Data Str e ams: Algor ithms and Appl i c ations . F oun dations and T rends in Theoretical Computer Science. Now Pu blishers, 2005 . 24. J. Rissanen. A u niv ersal d ata compression sy stem. IEEE T r ansactions on Infor- mation The ory , 29:656–663, 1983. 25. M. Ruh l. Effici ent Algorithms f or New Computational M o dels . PhD th esis , Mas- sac husetts Institute of T echnology , 2003. 26. M. S chindler. A fast block-sorting algorithm for lossless data compress ion. In Pr o c e e di ngs of the 1997 IEEE Data Compr ession Confer enc e , page 469, 1997. 27. J. S. Vitter. Design and analysis of dyn amic Huffman cod es. Journal of the A CM , 34:825– 845, 1987. 28. J. Ziv and A. Lemp el. A u niv ersal alg orithm for sequential data compression. IEEE T r ansactions on Information The ory , 23:337–34 3, 1977.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment