String algorithms and data structures

The string-matching field has grown at a such complicated stage that various issues come into play when studying it: data structure and algorithmic design, database principles, compression techniques, architectural features, cache and prefetching pol…

Authors: Paolo Ferragina

String algorithms and data structures
String algorithms and data structures ⋆ Paolo F erragina Dipartimen to di Informatica, Universit` a di Pisa, Italy Abstract. The string-matching field has gro w n at a such complicated stage that v ario us issues come into pla y when studying it: data structure and algo rithmic design, d atabase principles, compression tec hniques, ar- c hitectural features, cac he and prefetc hing p olicies. The exp ertise now a - days required to design good string data structures and algorithms is therefore transversa l to man y computer science fields and m uch more study on the orc hestra tion of known, or nov el, techniques is needed to make progress in this fascinating topic. This survey is aimed at illustrat- ing the key ideas whic h should constitute, in our opinion, the current bac kground of every index designer . W e also discuss the p ositive fea- tures and drawbac ks of known index ing sc hemes and algorithms, and devote muc h atten tion to d etail research issues and open p roblems both on the theoretical and the exp erimental side. 1 In tro duction String data is ubiquitous, c o mmon-place applications a re digital librar ies and pro duct catalogs (for bo oks, m usic, soft w are, etc.), electronic white and y el- low pag e directories, sp ecialized information sour ces (e.g. patent or genomic databases), customer relationship manag ement of data, etc.. The amount of tex- tual information managed by these applications is increasing at a s taggering rate. The best tw o illustrative examples of this g rowth are the W orld-Wide W eb, which is estimated to pro vide ac c ess to at least three terabytes of textual data, and the genomic databases, which are estimated to stor e more than fifteen billion of base pairs. Even in priv ate ha nds are common now collection s ize s which w ere unimaginable a few years a go. This s c enario is destined to become more p erv asive due to the migration of current databases toward XML storag e [2]. XML is emerging as the de facto standard for the publication and in terc hange of heterogeneous, incomplete a nd irregular data ov er the In ternet and a mongst applications. It provides ground rules to ma rk up data so it is s e lf-describing and easily rea dable by humans and computers. Lar ge p ortions of XML data are textual and include descriptive fields and tags. Ev aluating an XML quer y in volves navigating pa ths through a tree (or, in gener a l, a graph) structure . In order to sp eed up query pro cess ing, current ⋆ Address: Dipartimen to di Informatic a, Corso Italia 40, 56125 Pisa, Italy , ferragina@ di.unipi. it , http://www .di.unipi .it/ ∼ ferrag in . P artially sup- p orted by Italian MIUR pro jects: “T ec hnologies and services for enhanced con tent delive ry” and “A high-p erformance distributed platform”. 2 approaches consist of enco ding do cument pa ths in to strings of arbitrary le ng th (e.g. book/ autho r/firstname/ ) and r eplacing tree navigational op erations with string prefix queries (see e.g . [52,129,4]). In all these situations brute-force s canning of such large collections is not a viable approach to p erform string searches. Some kind of index has to be necessarily built ov er these massive textual data to e ffectively pr o cess s tring queries (of a rbitrarily lengths), p os s ibly keeping into account the pr e s ence in our computers of v ar io us memory levels, each with its technological a nd p erforma nce characteristics [8]. The index design pr oblem therefore turns out to b e more challenging than ever b efor e. The Americ an Heritage Dictionary (200 0, four th e dition) defines index a s follows: pl. (in · dex · es) or (in · di · ces) “ 1. Something that serves to guide, p oint out, or otherwise facilitate reference, especia lly: a. An alphabe tized list of names, places, and sub jects trea ted in a printed work, g iv ing the pa g e or pages on which each item is mentioned. b. A th um b index. c. Any table, file, or catalog. [...]” Some definitions pro p o sed by exp e r ts are “ The most imp orta nt of the to ols for information re tr iev al is the index —a collection of terms with pointers to places where information abo ut documents ca n be found” [11 9]; “ indexing is building a data structure that will allo w q uic k sear chi ng of the text” [2 2]; or “ the act o f assigning index terms to do cuments which ar e the ob jects to b e retrieved” [1 11]. F rom our p oint of view an index is a p er s isten t da ta structure that allows a t query time to fo cus the search for a user- provided string (o r a set o f them) on a very small p or tio n of the indexed data co llection, namely the lo cations at which the queried string(s) occur . Of course the index is just one o f the tools needed to fully solve a user query , so as the retriev al of the querie d str ing loca tions is just the first step of what is ca lled the “query answ ering pro ces s”. Informa tion retriev al (IR) models, ranking alg orithms, query lang uages and op er a tions, user- feedback mo dels and interfaces, a nd so on, all of them cons titute the rest of this complicated process and are b eyond the sco p e of this sur vey . Herea fter we will concentrate our attent ion onto the challenging problems concerned with the de- sign of efficient and e ffective indexing data structures, the basic blo ck up on which every IR system is built. W e then refer the reader in terested int o those other in teresting topics to the v a st liter ature, browsing from e.g. [79,114,16 3,22,188]. The right s tep into the text-indexing field. The publications regar ding indexing techniques and metho dologies are a c o mmon outcome of database a nd algorithmic resea rch. Their num b er is ever gr owing so that citing all of them is a task do omed to fail. This fact is contributi ng to make the ev a luation o f the nov elty , impact and usefulness of the plethora of recent index prop osals more and mor e difficult. Hence to appro ach from the corr ect angle the h uge field of text indexing , w e first need a clear framework for developmen t, pres ent ation and comparison of indexing sc hemes [193]. The lack o f this framework has lead some researchers to underestimate the features of known indexes, disr egard impo rtant criteria or make simplifying assumptions which ha v e lead them to unrea listic and/or distort results. 3 The design of a new index pas ses thro ugh the ev aluation of many criteria, not just its description and some toy experiments. W e need at a minim um to consider ov er all sp eed, disk and memo ry space requirements, CPU time and mea- sures of disk traffic (suc h as num b er of seeks and volume of da ta transferred), and ea s e of index construction. In a dyna mic setting we should also consider index main tenance in the presence of addition, mo dification and deletion of do c- umen ts/records; and implications for concurrency , transactio ns and recov erabil- it y . Also of int erest fo r b oth static and dyna mic data collections a re applicability , extensibilit y and sca la bilit y . Indeed no indexing scheme is all-p ow er ful, different indexes support different classes of queries and manage different k inds of data, so that they may turn out to be useful in diff erent application con texts. As a consequence there is no o ne single winner among the indexing data structures now adays av ailable, each one has its own positive features and drawb acks, a nd we must k now all of their fine deta ils in order to make the rig h t c hoice when implemen ting an effective and efficient sear ch engine o r IR sys tem. In what follows we therefore go in to the main a sp ects whic h influence the design of an indexing data structure th us providing an ov e r all view of the text indexing field; we intro duce the arguments which will be detailed in the next sections, and we br iefly commen t on some r ecent topics of r esearch that will be fully addressed at the end of each o f these subsequent sections . The first k ey issue: The I/O subsystem. The large amount o f textual information curr ent ly av ailable in electronic form requires to store it in to ex ter nal storage devices, like (multiple) disks and cdroms. Although these mechanical devices provide a la rge a mount of spac e at low cost, their access time is mor e than 10 5 times slow er than the time to acces s the in terna l memory of co mputers [1 58]. This gap is currently widening with the impressive technological progresses o n circuit des ign technology . Ongoing resea rch on the engineer ing side is therefor e trying to improv e the input/output subsystem by in tro ducing some hardw are mech anisms such as disk arrays, disk caches, etc.. Nevertheless the improv ement achiev able b y mea ns of a pr op er arr angement o f data a nd a prop erly structu r e d algorithmi c c omputation on disk devices abundantly surpasses the b est expec ted tech nology adv ancements [186]. Larger datasets can s tress the need for lo c ality of r efer enc e in that they may re duce the c hance of s e quent ial (cheap) disk accesses to the same block or cylinder; they may increa se the data fetch cos ts (whic h are typically linear in the dataset size); a nd they may even affect the prop or tio n of do c uments/records that answer to a use r query . In this situation a na ¨ ıve index might incur the so c alled I/O-b ottlene ck , that is, its up date a nd quer y op erations mig h t sp end most of the time in tra nsferring da ta to/fro m the disk with a co ns e quen t sensible slowdo wn of their p erforma nce. As a res ult, the index sc alability and the asymptotic analy sis of index p erformance, or chestrated w ith the disk c onsciousness of index desig n, are now adays ho t and c hallenging research topics which hav e shown to induce a p ositive effect not limited just to mechanical stor age devices, but also to all other memory levels (L1 a nd L2 caches, in ter nal memory , etc.). 4 T o desig n a nd ca refully ana ly z e the scalability and query p erformance of an index we need a computational mo del that abstracts in a reasona ble wa y the I/O-subsystem . Accurate disk models ar e complex [164], and it is virtually impo s sible to exploit all the fine p oints of disk characteristics systematically , either in practice or for algo rithmic design. In order to capture in an easy , yet significant, w ay the differences betw een the internal (electronic) memory and the external (mec hanical) disk, w e adopt the ex ternal memory mo del pro po sed in [18 6]. Here a computer is abstracted to c o nsist of a t wo-level memory: a fast and small int ernal memory , of size M , a nd a slo w and a rbitrarily lar ge ex ternal memory , called disk . Da ta b etw e e n the internal memory and the disk are trans- fered in blo cks of size B (called disk p ages ). Since disk ac c e s ses are the dominating factor in the running time of man y algor ithms, the asymptotic performa nce of the algo rithms is ev a luated by coun ting the total num b er o f disk accesse s p er- formed during the co mputation. T his is a work able approximation for alg orithm design, and w e will use it to ev aluate the perfor mance of query and update a l- gorithms. How ever there are situations, like in the co nstruction of indexing data structures (Sections 2.1 and 3.5), in whic h this acco unt ing s cheme do es not ac- curately predict the running time of algorithms on real machines beca use it does not take in to a ccount some imp or tant s pecia lties of dis k s ystems [16 2]. Namely , disk access costs hav e mainly tw o comp onents: the time to fetch the first bit of reques ted data (seek time) and the time re q uired to transmit the requested data (transfer rate). T ransfer rates are more or less stable but seek times are highly v ariable. It is th us w ell known that accessing o ne page from the disk in most cases decrease s the c o st of accessing the page succeeding it, so that “ bulk” I/Os a re less exp ensive p er page than “random” I/ O s. This difference b ecomes m uc h mo re prominent if w e also consider the re a ding-ahead/buffering/ caching optimizations whic h ar e common in current disks and op erating systems. T o deal with these s p ecia lties and a void the int ro duction of man y new parameters, w e will sometime refer to the simple acco unting scheme in tro duced in [64]: a bulk I/O is the reading/ w r iting of a c onti guous s equence of cM /B disk pag e s , where c is a prop er constant; a r andom I/O is a n y single disk-page access which is not part of a bulk I/O. In summary the p erfor mance of the algorithms designed to build, pro cess or query an indexing data structure is therefore ev alua ted b y measuring: (a) the n um ber of r andom I/ Os, and poss ibly the bulk I/Os, (b) the internal running time (CPU time), (c) the num b er of disk pages o ccupied by the indexing data structure and the working space of the query , up da te and construction a lg o- rithms. The se cond k ey issue: ty p es of queries and indexed data. Up to now we hav e talk ed about indexing data structures without sp ecifying the type o f queries that an index sho uld b e a ble to support as w ell no attention has b een devoted to the type of data an index is called to manage. These iss ues hav e a surprising impact on the des ign complexity and space occupancy of the index, and will b e strictly in terrela ted in the discus sion b e low. 5 There a re tw o main approaches to index design: wor d-b ase d indexes and ful l- text indexes. W ord-based indexes a re designed to work o n linguistic texts, o r on do cumen ts where a tokenization in to wor ds may be devised. Their ma in idea is to store the o ccurrence s o f each word (token) in a table that is indexed via a hashing function or a tree structure (th ey are usually called inverte d files o r indexes ). T o re duce the size of the table, co mmon words are either not indexed (e.g. the, at, a) o r the index is later compres sed. The adv antage of this a pproach is to suppo rt v ery fast w ord (or prefix - word) queries and to allow at reasonable s p eed some co mplex sea rches like reg ular expression or approximate matches; while t w o weaknesses are the impo ssibility in dea ling with non- tokenizable texts, like genomic sequences, and the slowness in supp orting a rbitrary substring queries. Section 2 will b e devoted to the disc us sion of word-based indexes a nd some recent adv ancements on their implemen tation, co mpression and supp orted o pe r ations. Particular attention will b e devoted to the tec hniques used to compress the in verted index or the input data co llection, and to the algorithms adopted for implemen ting more complex queries. F ull-text indexes hav e been designed to ov er come the limitations ab ov e b y dealing with arbitrary texts and general queries, a t the cost of an increase in the additional space o ccupied b y the underlying da ta structure. Examples of such indexes a re: suffix trees [128], suffix arrays [121] and String B-trees [71]. They hav e b e e n successfully a pplied to fundamen tal string-matc hing problems as well to text compressio n [42], a nalysis of genetic sequences [88], optimization of Xpath queries on XML do cumen ts [52,129,4] and to the indexing of s pecia l linguistic texts [67]. General full-text indexes a re therefore the natural ch oice to per form fast complex search es without any restrictions o n the query sequences and on the format of the indexed data ; how ever, a reader should alwa y s k eep in mind that these indexes are us ua lly more space demanding than their word- based counterparts [112,49] (cfr. o ppo rtunistic indexes [75] below). Section 3 will be devoted to a deep discus s ion on full-text indexes, p osing pa rticular attention to the String B-tree data structure and its engineering. In particular we will in tro duce some nov el algo rithmic and data structural solutions which are not confined to this sp ecific data structure. Att en tion will b e dev oted to the c hal- lenging, yet difficult, pro blem o f the construction of a full-text index b oth fro m a theoretical and a practica l p ersp ective. W e will show that this problem is re- lated to the more genera l problem of string sorting , and then discuss the known results and a nov el randomized algorithm which may hav e practical utilit y and whose technical details may have an indepe ndent interest. The third k ey issue: the space vs. tim e trade-off. The discussion on the t w o indexing appro aches above ha s pointed o ut an in teresting trade-off: spa ce o ccupancy vs. flexibilit y and efficiency of the supp orted queries. It indeed seems that in order to support substring queries, and deal with arbitrary data col- lections, w e do need to incur in an additional spa ce o verhead re q uired b y the more complica ted structure of the full-text indexes. Some author s arg ue that this extra-space o ccupancy is a fa ls e problem beca use of the con tin ued decline in the cost of ex ter nal storag e devices. How ever the impact of space reduction go es far 6 beyond the intuitiv e memory saving, b ecause it ma y induce a better utilization of (the fast) cache and (the electro nic) internal memory levels, may vir tua lly ex- pand the disk bandwidth and sig nifica n tly reduce the (mechanical) seek time o f disk systems. Hence data co mpression is an attra ctiv e choice, if not mandatory , not only for s to rage saving but also for its fav ora ble impact o n a lgorithmic p er- formance. This is very well known in algor ithmics [109] and engineering [94]: IBM has recently delivered the MXT T echnology (Memory eXpansion T echnology) for its x330 eServers which consists in a memor y chip that compresses /decompresses data on cache wr itebacks/misses thus y ielding a factor of expansion tw o on mem- ory size with just a slightly larger co s t. It is not s urprising, therefore, that we ar e witnessing in the algorithmic field a n upsurging in terest for designing suc cinct (or implici t ) da ta structures (see e.g. [38,143,14 4,142,87,168,169]) that try to re- duce as m uch as p ossible the auxiliary information k ept for indexing purp oses without introducing any sig nificant slowdo wn in the op era tions supp orted. Suc h a re s earch trend has lead to so me sur pr ising res ults on the design of c om- pr esse d f ul l-t ext indexes [75] whose impact go es beyond the text-indexing field. These results lie at the cros sing of three distinct rese a rch fields— compression, algorithmics, databases— and o rchestrate together their latest ac hievemen ts, th us sho wing once more that the design of an indexing data structure is now a- days a n int erdisciplinary task. In Section 4 we will briefly o v erview this issue b y introducing the concept of opp ortunistic index : a da ta structure that tries to take adv antage of the compressibility o f the input data to reduce its ov erall space o ccupancy . This index encapsulates b oth the compressed data and the indexing information in a space whic h is pro po rtional to the en tro p y of the indexed col- lection, th us re s ulting o ptimal in an information-conten t sense. Y et these results are mainly theoretical in their flavor a nd op en to significant impr ov ements with resp ect to their I/O per formance. Some of them have been implemented and tested in [76,77] showing that these data structures use roughly the s ame space required b y traditional co mpressors—s uc h a s g zip and bzip 2 [176]— but with added functionalities: they allow to retrieve the o ccurrences of an arbitra r y sub- string within texts of several megabytes in a few milliseconds. These exp eriments show a promising line of rese a rch and sugg est the design of a new family of text retriev al to ols which will b e discussed at the end o f Sectio n 4. The fourth k ey iss ue : String transact ions and index cac hi ng . Not only is string data pro liferating, but data s tores increa singly ha ndle large num b er of string tr ansactions that add, delete, mo dify o r search strings. As a result, the problem of managing massive string data under large n um b er of transa ctions is emerging as a fundamental c halleng e. T raditionally , string algor ithms fo c us on supp orting each of these op erations individual ly in the most efficient manner in the worst case. There is howev er an ever increasing need for indexe s that are efficien t o n a n entire sequence of string transactions, by pos sibly adapting themselves to time-v a rying distr ibution of the queries a nd to the rep etitiveness present in the query seq uence b oth at string or prefix level. Indeed it is well known that some user queries ar e frequently issued in s ome time int erv als [173] or some search eng ines impro ve their precision b y expanding the query terms 7 with so me of their morpholo gical v ariations (e.g. synonyms, plurals, etc.) [22]. Consequently , in the spirit o f amortize d analysis [180], we would lik e to design indexing data structures that a re co mpetitive (optimal) ov er the entire sequence of string op erations. This ch allenging issue has b een addressed a t the heur is- tic level in the co nt ext of word-based indexes [173,39,125,131,101]; but it has been unfortunately disre g arded when designing and analy zing full-text indexes. Here the problem is particularly difficult bec a use: (1) a string ma y b e so long to do not fit in one single disk page or even b e contained in to in ternal mem- ory , (2) eac h string comparison ma y need man y disk a ccesses if executed in a brute-force manner, and (3) the dis tr ibution of the string queries may b e un- known or v ary ov er the time. A first, pr eliminary , contributi o n in this setting has been a chi eved in [48] wher e a self-adjusting a nd external- memory v ariant o f the skip-list data structure [161] has been pr esented. By properly orchestrating the caching of this da ta structure, the caching of some quer y-string prefixes and the effectiv e manag ement of string items, the authors prov e an ext ernal-memory ver- sion for strings o f the famous Static O ptimality Theorem [180]. This introduces a new framew or k for designing and analyzing ful l-text indexing data structures and s tring-matching algorithms, where a str e am of user queries is issued b y a n unknown s ource and ca ching effects must then b e explo ited and a ccounted for when a nalyzing the query op erations. In the next sections we will a ddress th e caching issue b o th for w or d-based and full-text indexing sc hemes, pointing out some interesting resea rch topics whic h deser ve a deep er inv estiga tion. The moral that we would lik e to convey to the reader is that the text in- dexing field has grown at a suc h complicated stage that v arious issues come in to play when studying it: da ta structure desig n, database principles, compr e s - sion techniques, architectural considera tions, ca che a nd prefetc hing p o licies. The exp e rtise nowada ys required to design a goo d index is therefore tra nsversal to many alg orithmic fields a nd m uch more study o n the orc hes tration of k nown, or nov el, tec hniques is needed to make prog ress in this fascinating topic. The rest of the surv ey is therefore devoted to illustrate the k ey ideas whic h sho uld constitute, in our o pinion, the curr e nt ba ckground of ev er y index-designer. The guiding principles of our discussion will be the four key issues a b ove; they will guide the descr iption of the positive fea tures and dra wba cks of known indexing schemes a s well the inv estigation o f res e a rch issues and op en problems. A v a st, but obviously not complete, literature will accompany o ur discussio n and should be the reference where a n eager reader ma y find further technical details and research hints. 2 On the w ord-based indexes There are three main approa ch es to design a word-based index: inv erted indexes, signature files and bitmaps [188,22,19,63]. The inv erted index— also known as inverte d file , p osting file , or in no rmal Eng lish usage as c onc or danc e — is doubtless the simplest and most p opular technique for indexing lar ge text databases stor ing 8 natural-langua ge do c uments. The other tw o mechanisms ar e usually adopted in certain a pplications even if, recently , they have b een mostly aba ndoned in fav or of inv erted indexes b ecause some extensive exp er imental results [19 4] hav e shown that: Inverte d indexes offer b etter p erformanc e than s ignatu re files and bitmaps, in terms of b oth size of index and sp e e d of query hand ling [188]. As a co nsequence, the emphasis of this s ection is on inv erted indexing; a reader interested int o signature files a nd/ or bitmaps may s tart browsing from [188,22] and hav e a lo ok to some mor e recent, co r related a nd stimulating results in [3 3,134]. An in verted index is typically compo sed of tw o parts: the lexic on , als o called the vo c abulary , containing all the distinct w o rds of the text co llec tio n; and the inverte d list , a lso c a lled the p osting list , storing for each vo cabulary term a list of all text p os itions in which that term o ccurs. The vocabular y therefore suppo rts a mapping from words to their corresp onding in verted lists and in its simplest form is a list of string s and disk addr esses. The sea rch fo r a single w o r d in an in verted index consists of tw o main phases: it first locates the w or d in the vocabulary a nd then retr ie ves its list o f text p ositions. The sea rch for a phrase or a proximit y pattern (where the words must a pp ea r consecutively or clo se to each o ther, respe c tiv ely) consists of three main phases: each w or d is searched separately , their p osting lists are then retrieved and fina lly in tersec ted, taking care of consecutiveness or clo seness of word p ositions in the text. It is appar ent that the inv erted index is a simple and natural indexing s cheme, and this has o bviously contributed to its spread a mo ng the IR systems. Starting from this simple theme, researchers indulged theirs whims b y prop osing numer- ous v ar iations and improv ements. The main asp ect which has b een inv estigated is the c ompr ession of the v o ca bulary a nd of the in verted lists. In both cases we are faced with some challenging pr oblems. Since the vocabular y is a textual file an y class ica l compres sion technique might b e used, provided that subse q uen t pattern searches can b e executed effi- cient ly . Since the in verted lists are constituted by n umbers any vari able length enc o ding o f in teger s migh t be used, provided that subsequen t sequen tial deco d- ings can b e executed efficiently . O f co urse, a n y choice in vocabular y or inv erted lists implemen ta tion influences b oth the pro cess ing sp eed of q ueries and the ov era ll space o ccupied by the inverted index. W e pro ceed then to comment e ach of these p oints b elow, referring the rea der in ter ested into their fine details to the cited literature. The vocabular y is the ba sic blo ck of the inv erted index and its “ conten t” constraints the t ype of q ueries that a user can issue. Actually the index de- signer is free to decide what a wo r d is, and whic h are the r epr esentative wor ds to be included in to the vocabular y . One simple p ossibility is to take each of the words that app ear in the do cument and declare them verbatim to be vocabular y terms. This tends b oth to enlarge the voca bulary , i.e. the num b er of distinct terms that app ear into it, and increase the num b er of do cumen t/ p os ition iden- tifiers that must b e stored in the pos ting lists. Having a large v o ca bula ry not only affects the storag e space requiremen ts of the index but ca n also make it harder to use since there are more p otential quer y terms that must b e con- 9 sidered when formulating a query . F or this rea son it is common to transfo rm each word in some normal form before being included in the v o ca bula ry . The t w o classical approa ches ar e c ase fol ding , the conversion of all upper case letters to their low er case equiv a lent s (or vice versa), and stemming , the reduction of each word to its morpholog ical ro o t b y removing s uffixes or other mo difiers . It is evident that bo th approa ches present adv a nt ages (vocabulary compressio n) and disadv antages (extraneous material can be re tr ieved at query time) whic h should be tak en int o accoun t when designing a n IR system. Another common transformation consists o f omitting the so called stop wor ds from the indexing pro cess (e.g., a, the, in): They are words which o ccur to o often or carry suc h small information conten t that their use in a quer y would b e unlikely to eliminate any do cument s. In the literature there ha s b een a big debate on the usefulness of removing or k eeping the stop words. Recent pro gresses o n the compac tio n o f the inv erted lists hav e shown that the space ov erhea d induced by those words is not significant , and is abundantly pay ed fo r by the simplification in the indexing pro cess and by the incr e a sed flexibilit y of the res ulting index. The size of the vo cabulary deser ves a particular atten tion. It is intuitiv e that it should be small, but more insight on its ca rdinality and structure m ust be ac- quired in o rder to go into more complex considerations regarding its compr ession and query ing. An empirical law widely a ccepted in IR is the Heaps’ La w [91], which states that the vocabular y of a text of n words is of size V = O ( n β ), where β is a small p ositive cons ta n t dep ending on the text. As shown in [16], β is practically b etw een 0 . 4 a nd 0 . 6 so the vocabular y needs spa ce prop ortiona l to the squar e ro ot of the indexed data . Hence for large data collections the ov e r head of storing the vocabular y , even in its extended form, is minimal. Cla ssical imple- men tations of a set of words via hash tables a nd trie structures seem appropriate for exact w o rd or pre fix word queries. As so on a s the user a ims for more co mpli- cated queries, lik e approximate or regular -expression sear ches, it is preferable to keep the vo cabulary in its plain form as a ve ctor of wor ds and then a nswer a us e r query via o ne of the p ow er ful scan-based string-matching algorithms currentl y known [148]. The increase in query time is payed for by the more complicated queries the index is a ble to supp ort. As we observed in the Introduction, space saving is intimately related to time optimization in a hierar chical memory system, so that it turns out to be natural to ask o urselves if, and how, compressio n ca n help in vocabular y storage and searching. F rom one hand, vocabular y compres sion might seem useless b ecaus e of its small size; but fr o m the other hand, an y improv ement in the voc a bulary sear ch-phase it is app ealing b ecause the vo cabulary is ex- amined at each quer y on all of its constituting terms. Num erous scientific re- sults [9,11 8,82,81,18 4,65,139,108,154,178,57,140,149,106] ha ve recen tly shown how to compress a textual file and p erform exact or a pproximate searches directly on the compressed text without passing through its w ho le decompress io n. T his approach may be obviously applied to vo cabularies thus introducing tw o imme- diate improv ements: it squeezes them to an extension that ca n be ea sily k ept in to in ternal memory even for la rge data collections; it reduces the amount of 10 data exa mined during the query phase, a nd it fully exploits the proc e s sing sp eed of current pro cesso rs with respect to the bandwidth a nd access time of internal memories, thus impacting fruitfully onto the ov er all query p erformance. Ex per - imen ts ha ve shown a spe e d up of a factor about t wo in q uer y processing and a reduction of more than a factor three in spa ce o ccupancy . Nonetheless the whole scanning of the compressed dictionary is afforded, so that some ro om for quer y time improv ement is still p os sible. W e will b e back on this issue in Section 4. Most of the s pa ce usag e of inv erted indexes is devoted to the stor age of the in verted lists; a pr op er implement ation for them thus beco mes urgent in or der to make such an approach c o mpetitive against the other word-based indexing methods : signature files and bitmaps [188,194]. A larg e resea rch effort has b e e n therefore devoted to effectiv ely compress the inv erted lists still guar anteeing a fast sequen tial a ccess to their c o nt ent s. Three different t ypes of compaction approaches hav e b een prop ose d in the literature, distinguished acco rding to the ac cur acy to which the inv er ted lists identify the lo cation of a vocabulary term, usually ca lled gr anularity of the index. A c o arse-gr aine d index identifies only the do cumen ts where a term o ccurs ; a n index of mo der ate-gr ain partitions the texts in to blo cks and stores the block num b ers where a term o ccur s; a fine-gr aine d index returns instead a sentence, a ter m num b er, or e ven the c haracter p osition of every term in the text. Coarse indexes req uire less stor a ge (less than 25% of the collection size ), but during the quer y phase parts of the text must b e scanned in order to find the exac t lo cations of the query terms; also , with a coarse index m ulti-term queries a re likely to give r ise to insignific ant matches , beca use the query terms mig ht a ppea r in the same do cument but far from each other. At the other e x treme, a word-lev el indexing enables queries inv o lving adjacency and proximit y to b e answered quickly b ecause the desir e d relatio ns hip can be c hecked without accessing the text. Ho w ever, adding precise locationa l information expands the index of at lea st a fa ctor of tw o or three, compar ed with a do cument-lev el indexing since there are mo re pointers in the index and each one require s mor e bits of sto rage. In this ca s e the inverted lists take nea r ly 60% of the collection size. Unless a s ig nificant fraction of the q uer ies are expected to b e proximit y-bas e d, or “snipp ets” containing text portions where the query terms o ccur must b e efficien tly visualized, then it is pr e fer able to choos e a do cument - level granularity; proximit y and phras e-based q ueries a s w ell s nipp et ex traction can then b e handled by a po st-retriev al scan. In all those c a ses the size of the resulting index can b e further squeezed down b y adopting a compression approach which is orthogonal to the previous ones. The key idea is that each in verted list can be sorted in increasing order , and therefore the gap s b etw een consecutive p ositions can be stored instead of their absolute v alues . Here can b e used compression techniques for small integers. As the gaps for longer lists a r e smaller, lo ng er lists can b e c o mpressed better and th us s top w ords c a n be kept without introducing a significant ov erhea d in the overall index space. A n umber of suitable codes are describe d in detail in [188], more exp e r iment s are rep orted in [187]. Golomb codes are sugges ted as the best ones in many situations, e.g. TREC collection, especia lly when the 11 in tegers ar e distributed a ccording to a geometric la w. Our exper ience ho wev er suggests to us e a simpler, yet effective, coding scheme whic h is called c ontinu ation bit and is curr en tly a do pted in Altavista a nd Go ogle search engines for s toring compactly their in verted lists. This coding s cheme y ie lds a byte-aligned and compact representation of an in teger x as follows. First, the binary repr esentation of x is pa r titioned int o gro ups of 7 bits each, possibly app ending zeros to its beg inning; then, one bit is app ended to the front of each gro up setting it to one for the first gro up and to zero for the other gro ups; finally , the r esulting sequence of 8-bit groups is allo cated to a cont iguous sequence of bytes. The b yte-aligning ensures fast deco ding / enco ding op e r ations, wher eas the tag ging o f the first bit of every byte ensures the fa s t detection of co deword be ginnings. F or an integer x , this representation needs ⌊ log 2 x + 1 ⌋ / 7 b ytes; exp eriments show that its ov erhead wrt Golom b codes is small, but the Con tin uation bit scheme is by far mu ch fa s ter in deco ding thus resulting the natural choice whenev er the space issue is not a main concern. If a further space ov er head is allow ed and queries hav e to b e sp eeded up, other in teger co ding approaches do exist. Among the others w e cite the fr e quency sorte d index o rganiza tio n of [159], whic h sorts the p osting lists in decreasing order of frequency to facilitate the immediate retriev al of r elev ant o ccurrences, and the blo cke d index of [7] which computes the gaps with respect to some equally-sampled piv o ts to avoid the deco ding of some parts of the inv e r ted lis ts during their in tersec tion at quer y time. There is another approach to index compression whic h encompasses all the others b eca use it can be see n as their g eneralization. It is called blo ck-addr essing index a nd was intro duced in a system called Glimpse some years ago [1 22]. The renewed interest toward it is due to some recent results [153,75] which have shed new ligh t on its structure and opened the do or to further improv ements. In this indexing s chem e, the whole text co llection is divided in to blo cks of fixed size; these blo cks may span many do cuments, be part of a do cument , or ov erlap do cumen t b oundaries. The index stores only the blo ck num b ers where e a ch vo- cabulary term app ear s. This intro duces tw o space savings: m ultiple o ccurr ences of a vo cabulary term in a blo ck ar e represe nted only once, and few bits ar e needed to encode a blo ck n um ber . Since there are normally muc h les s blo cks than do cu- men ts, the spa ce o ccupied by the index is very small and ca n b e tuned according to the user needs. On the other ha nd, the index may by used just as a device to identify some c andidate blo cks whic h may cont ain a query- s ting occur rence. As a result a p ost-pro cessing phas e is needed to filter out the candidate blocks which actually do not contain a match (e.g. the blo ck spa ns t w o do cument s and the query terms are s pread in b oth of them). As in the do c ument-lev el indexing scheme, blo ck-addressing requires very little space , clo se to 5% of the collection size [122], but its query p erforma nce is mo des t b ecause o f the p ostpro cess ing step a nd critically dep ends on the blo ck size . Actually by v arying the blo ck size we can mak e the blo ck-addressing s ch eme to r ange from coarse-g rained to fine- grained indexing. The sma ller the blo ck size , the closer to a word-level index we are, the larger is the index but the faster is the query pro cessing. On the other extreme, the larg er is the blo ck size, the smaller is the space oc c upancy but the 12 larger is the quer y time. Finding a go o d trade- off b etw een these tw o quan tities is then a matter of us e r needs; the analysis we conduct below is based on some reasona ble assumptions o n the distribution of the vo cabulary terms and the lin- guistic structure of the do c uments [20,21]. This allo ws us to ar gue about some po sitive features of the block-addressing scheme. The Heaps’ law, int ro duced ab ov e, g ives a b ound on the vo cabulary size. An- other useful law related to the vocabular y is the Z ipf ’s Law [190] whic h sta tes that, in a text o f n terms, the i th most frequen t term appea rs n/ ( i θ z ) times, where θ is a constant that depends on the data collection (typical [90] exp eri- men tal v alues ar e in [1 . 7 , 2 . 0]) a nd z is a normalization factor . Giv en this model, it has been shown in [21] that the blo ck-addressing scheme may achiev e O ( n 0 . 85 ) space and query time complexity; no tice that b oth co mplexities are sublinear in the data size. Apart from this analytical ca lculations, it is apparent that sp eeding up the po stpro cessing step (i.e. the scanning of candidate blo cks) would impact on the query p erfo r mance of the index. This w a s the sta rting p oint of the fascinat- ing pap er [153] which inv es tig ated how to combine in a single scheme: index compression, blo ck addres sing and sequential search o n compressed text. In this pap er the sp ecialized compres s ion technique of [140] is adopted to s queeze each text blo ck in less than 2 5 % of its original size, and p erform direct searching on the compressed candidate blo cks without passing thro ugh their whole decom- pression. The sp ecia lty of this co mpression tec hnique is that it is a v aria nt of the Huffman’s algorithm with byte-aligne d and t agge d codewords. Its basic idea is to build a Huffman tree with fan-out 12 8, so that the binary co dewords have length a mult iple o f 7 bits. Then these co dewords ar e partitioned into groups o f 7 bits; to eac h group is app ended a bit that is set to 1 for the first group and to 0 fo r the others; finally , each 8-bit group is allocated to a b yte. The r esult- ing co dewords hav e many nice prop erties: (1) they are byte-aligned, hence their deco ding is fas t and requir es very few shift/masking o pe rations; (2) they are tagged, hence the b eginning of each co deword can b e easily identified; (3) they allow exac t pattern- matchin g dire c tly over the compres sed blo ck, b ecause no tagge d c o dewor d c an overlap mor e then two tagge d c o dewor ds ; (4) they allow the search for more complex patterns directly on the compressed blo cks [140,153]. The o verall result is an improv emen t of a fac to r ab out 3 o ver well known to ols like Agrep [189] and Cgrep [140], which op era te on uncompressed blo cks. If w e add to these in teresting features the fact that the symbol table of this Huffman’s v ariant is actually the v o c a bulary of the indexed collection, then w e may con- clude that this a pproach couples p er fectly well with the in verted-index scheme. Figure 1 provides a pictorial summary of the blo ck-addressing structure. W e will b e ba ck on this a pproach in Section 4 wher e we discuss and analyze a nov el compressed index for the candidate blo cks which has op ened the doo r to further improv ements. 13 (fan−out 128) Huffman Tree Vocabulary Inverted Lists (compressed and paged) Block Structure Compressed Docs block span decoded block number Fig. 1. The highlevel structur e of t he blo ck-addr essing scheme. 2.1 Constructing an inv erted i ndex This journey amo ng the inv erted index v ariations and r esults has highlighted some of their po sitive features as well their drawbac ks. It is clear that the struc- ture of the in verted index is suitable to b e mapp e d in a t w o-level memory system, like the disk/ memory case. The v o cabulary can be kept in in ternal memory , it is usually s mall and random a ccesses must b e performed o n its terms in order to a nswer the user queries; the in verted lists can b e alloca ted on disk e a ch in a contiguous sequence o f disk pa ges, thus fully exploiting the prefetching/cac hing capabilities of cur rent disks during the subsequent g ap-deco ding o p er ations. In this case the p erformance of cur rent pro cessor s is sufficient to make transparent the decoding cost with r esp ect to the one incurred for fetchi ng the co mpressed lists from the disk. There is howev er another issue which has been not addressed in the previ- ous sections a nd o ffers some challenging problems to be deal with. It conce r ns with the construction o f the inv erted lis ts. Here, the I/O- bo ttleneck ca n play a crucial r ole, and a na ¨ ıv e a lgorithm might be unable to build the index even for collections of mo dera te size. The use o f in-memory data structures of size larg er than the ac tual internal memory and the non sequential a ccess to them, migh t exp e rience a so high pag ing activity of the system to re q uire one I/O per op era- tion ! Efficient metho ds ha ve b een presen ted in the literature [13 6,188] to allow a more eco nomical index constr uction. F r om an high-level point of view, they follow an algorithmic scheme which rec a lls to our mind the m ultiw ay merg esort algorithm; how ever, the sp ecialties of the problem make compress ion a key to ol to reduce the volume o f proces s ed data and co nstraint to reorganize the op era- tions in order to make use of sequential disk-based pro cessing. F or the sake of completeness we sketc h here an algorithm that has b een used to build an inv erted index ov er a m ulti-g igabyte co llection of texts within few tens of megab ytes of 14 in ternal memo ry and only a small amount of ex tra disk space. The alg orithm will be detailed for the case of a do cument -level indexing sc heme, o ther extensions are p ossible and left to the reader a s an exercise. The basis of the metho d is a pro cess that cr eates a file o f pairs h d, t i , where d is a document n umber a nd t is a term num b er. Initially the file is ordered by increasing d , then the file is reordere d by incr easing t using an in-place m ulti-wa y external mergesort. This sorting phase is then followed by an in-place p ermutation of the disk pa g es that collectively constitute the in v erted lists in order to sto re each of them into a consecutive sequence of disk pages. In detail, the collection is read in do cument order a nd pars ed into terms, which will fo r m the vocabula r y of the in verted index. A b ounded a mount of in ternal memory is s et aside a s a working buffer. Pairs h d, t i are collected into the buffer until it is full; a fter that, it is sorted ac c o rding to the ter m n umbers and a run of disk pag es is wr itten to disk in a compres sed format (padding is used to get disk-page a lignment ). Once all the collection has b een pr o cessed, the resultant runs are com bined via a m ultiw ay merge: Jus t one blo ck of each run is res ident in memor y at any given time, and so the memor y requir e ment is mo dest. As the merge pro ceeds, output blo cks ar e pro duced and wr itten back to disk (prop erly compressed) to any av a ilable slot. Notice that there will b e always one slot av aila ble b ecause the reading (merging) pro ces s frees the blo ck slots at a faster rate than the blo cks co nsumed by the wr iting proc e s s. Once all the runs hav e b een exha us ted, the index is complete, but the inv er ted lists are spr ead ov er the disk s o that lo cality o f reference is absent and this would slowdown the subsequent query operations . An in- place p ermutation is then used to reorder the blo cks in or der to allo ca te each in verted list into a contiguous sequence of disk pages . This step is disk-intensiv e, but usually executed for a short amount of time. At the end a further pass on the lists can b e executed to “ refine” their co mpression; any now-un used space at the end of the file c a n b e released. Exp erimental results [18 8,153] have shown that the amo un t of internal memory dedicated to the sor ting pro ce ss impacts a lot, as exp ected, on the final time complexity . Just to hav e an idea, a 5 Gb c o llection can be inv erted using an in ternal memory space which is just the one required for the vocabular y , and a disk space which is about 10% mo re than the final inv er ted lists, at an overall rate of about 300 Mb of text p er hour [188]. If more in terna l memory is reserved for the sorting pro cess, then we can achiev e an ov erall r ate of ab out 1 Gb of text per hour [153]. 2.2 Some op en problem s and future researc h directions W e co nclude this section b y addressing some other interesting questions which, we think, deserve some atten tion and further in vestigation. First, w e point o ut one c hallenging feature of the block-addressing s ch eme whic h has b een not y e t fully exploited: the vo cabulary allows to tur n appro x imate or complex pattern queries on the text collection into an exa c t search for, p o ssibly many , v o cab- ulary terms on the candidate blo cks (i.e. the vo cabulary terms ma tc hing the complex user quer y). This feature has b een deplo yed in the solutions pres e nted 15 in [140,153] to sp eed up the whole scanning of the compressed candidate blo cks. W e point o ut here a differen t persp ective which may help in further improving the p os tpr o cessing phase. Indeed we migh t build a succinct index that suppo rts just exact pattern searches on each co mpressed blo cks, and then use it in co m bi- nation with the blo ck-addressing scheme to s uppor t arbitrarily complex pattern searches. This index would gain pow erful quer ies, reduced s pa ce o ccupancy and, more imp ortantly , a faster search op era tio n b e c a use the cost of a candidate-blo ck searching could b e o ( b ). This w ould impa c t onto the overall index design and per formance. A pr op osal in this direction has b een pur sued in [75], where it has been shown that this novel approach achiev es b oth sp ac e overhe ad and query time subline ar in the data c ol le ction size indep endently of t he blo ck size b . Conv ersely , in verted indices achiev e only the s e c ond goal [18 8], and clas sical blo ck-addressing schemes ac hieve bo th goa ls but under so me restrictiv e c o nditions on the v alue of b [21]. Another int eresting topic of res earch concerns with the des ig n of indices and methods for supporting faster v o ca bulary searches on complex patter n queries. Hashing or trie structures are well suited to implemen t (prefix)word queries but they a c tually fail in supporting suffix, substr ing or approxi mate word searches. In these ca ses the common approa ch consis ts of s canning the whole vo cabulary , th us incurring in a per formance slowdown that preven ts its use in search en- gines aiming for a high thro ughput. Filtering methods [148] as well nov el metric indexes [45] migh t pos s ibly help in this resp ect but simple, yet effectiv e, data structures with prov able query b ounds are still to be desig ned. W e hav e obser ved that the block-addressing scheme a nd gap-co ding metho ds are the most effective to o ls to sq uee z e th e posting lists in a r educed space. A gap-co ding algo rithm ac hieves the b est compre s sion ratio if most of the differ- ences ar e very small. Sev eral authors [34,35,135] have noticed that this o ccurs when the do cumen t n um b ers in eac h p osting list hav e hig h locality , a nd hence they designed metho ds to passively ex plo it this lo cality whenev er present in the po sting lists. A differen t a pproach to this problem has been undertaken recen tly in [32] wher e the authors suggest to p ermute the do cument n um b ers in order to actively create the loca lit y in the individual pos ting lists. The authors propo se therefore a hier a rchical c luster ing technique whic h is applied on the do cument collection as a whole, using the co sine measure as a basis of do cument s imila r- it y . The hierarchical clustering tree is then trav er sed in preorder a nd num b ers are assigned to the do cuments as they are e nco unt ered. The authors ar g ue that do cumen ts that share ma n y term lists sho uld b e c lo se together in the tre e , and therefore be labeled with near n umbers. This idea was tested on the TRE C-8 data (disks 4 and 5, excluding the Co ngressio na l Record), and show ed a spa ce improv ement of 1 4 %. Differen t similarity mea sures to build the hierarchical tree, as well different clustering approaches whic h possibly do not pass throug h the exploration of the complete graph o f all do cuments, constitute go o d aven ues for research. Another interesting issue is the exploitation o f the large internal memory c ur - rently av aila ble in our P C s to improv e the query p erfor mance. A small fra ction 16 of the internal memory is alr e a dy used a t run time to maintain the vo c a bulary of the do cument ter ms and thus to supp ort fa st word searches in r esp onse to a user query . It is therefor e natural to a im at using the rest of the internal memo r y to c ache par ts of the in verted index or the last query answers, in o rder to explo it the r efer enc e and temp or al lo c ality commonly pr esent in the query streams [99,179] for achieving improv ed query p erformance. Due to the ubiquitous use of in- verted lists in curr ent web sear ch engines, and the ever increasing amount of user quer ies issued per day , the design of caching metho dologies suitable for in verted-indexing schemes is becoming a ho t topic o f resea rch. Numerous paper s hav e b een r ecently published on this sub ject, see e.g . [173,39,125,131,101], which offer some challenging problems fo r further study: how the interpla y b etw een the retriev al a nd ranking phase impacts o n the caching strategy , ho w the compres - sion of inv erted lists a ffects the b ehavior of ca chin g schemes, how to extend the caching ideas develop ed for stand-alone machines to a distributed infor mation retriev al architecture [131,183]. W e r efer the reader to the latest WWW, VLDB and SIGMOD/PODS confere nce s for keeping track of this active r e s earch field. On the soft ware dev elopment side, ther e is muc h roo m for da ta structural and alg o rithmic engineering as well co de tuning and librar y des ign. Here w e would like to point out just one of the num erous research directions which en- compasses the interesting XML language [2]. XML is an ex tr emely versatile markup langua ge, capable of lab eling the information co nt en t o f diverse data sources including structured or semi-structured documents, relational da tabases and ob ject r e p o s itories. A query issued on XML do cuments might exploit in tel- ligently their s tr ucture to manage unifor mly all these kinds of data and to enr ich the precis ion of the quer y answers. Since XML was completed in early 1998 by the W orld Wide W eb Consortium [2], it has spread through science and indus- try , th us b ecoming a de facto standard for the publication and in terchange o f structured da ta over the Internet a nd amo ng st applica tions. The turning po int is that XML allows to repr esent the s e mant ics o f data in a structured, do c umented, machine-readable form. This has lead s ome researchers to talk ab out “semantic W eb” in order to capture the idea of having data on the W eb defined and link ed in a wa y that can b e used by machines no t just for displa y (cfr. HTML), but fo r automation, integration, reuse acro ss v arious a pplications and, last but not lea st, for p erforming “sema n tic searches”. This is no w adays a vis io n but a huge num- ber of p eople all ar ound the world are working to its concretization. One of the most tangible results of this effort is the plethora of IR sys tems sp ecialized to day to work on XML data [116,98,27,175,6,61,129,3,52,104,18]. V arious approaches hav e b een undertaken for their implement ation but the mo st pro mising for flexi- bilit y , space/time efficiency and complexity of the suppo rted queries is doubtless the one based on a “nativ e” mana gement o f the XML documents via inv erted indexes [24,15 1]. Here the idea is to support structured text queries by index- ing (real or virtual) tags as distinct terms a nd then a nswering the quer ies via complex combinations of sear ches for words and tags. In this rea lm o f solutions there is a lack of a public, easily usable and customizable rep ositor y of a lgo- rithms and da ta structures for indexing and quer ying XML do cuments. W e are 17 currently working in this direction [78]: at the present time we have a C library , called XCDE Librar y ( XC DE s ta nds for Xml Compressed Do cument Engine) that provides a set of alg orithms and data structures for indexing and sea rching an XML do cumen t co llection in its “native” form. The library offers v ario us features: sta te- of-the-art algo rithms and da ta structures for text indexing , c o m- pressed space o ccupancy , and novel succinct da ta structures for the mana g ement of the hierar chical s tructure present into the XML do cuments. Cur rently we a re using the XCD E Libra ry to implement a sear ch engine for a collection of Italian literary texts marked with XML-TEI. The X CDE Library offer s to a res e a rcher the p ossibility to in vestigate and experiment nov el algorithmic solutions for in- dexing and r etriev al without be ing obliged to re-wr ite from scra tc h all the basic pro cedures which constitute the k ernel of any c lassic IR system. 3 On the full-text indexes The inv er ted-indexing sc heme, as well any other word-based indexing metho d, is w ell s uited to manage text r etriev al queries on linguistic texts, na mely texts comp osed in a natural la nguage o r prop er ly structured to allow the ident ification of “terms” that are the units upon whic h the user queries will be formulated. Other assumptions ar e usually made to ensure an e ffective use of this indexing method: the text has to follow so me statistical prop er ties that ensure, for ex- ample, small vo cabulary size, short w ords, quer ies mo stly co ncerning with rare terms a nd aiming a t the retriev a l of par ts of words or entire phrases. Under these restrictions, which are nonetheless satisfied in man y practical user settings, the in verted indexes a re the choice since they provide efficien t query p erfo rmance, small space usa ge, cheap c o nstruction time, a nd allo w the easy implement ation of effective ranking techniques. F ull-text indexes, on the other hand, overcome the limitations of the word- based indexes . They allow to manage any kind of data and supp ort complex queries that spa n arbitra ry lo ng parts of them; they allow to draw statistics from the indexed data, as well mak e man y kind o f complex text compariso ns and in vestigations: detect pattern motifs, auto- rep etitions w ith and without er- rors, longest-r e p ea ted strings, etc.. The full-text indexes may be clearly applied to classical inf ormation retriev al, but they ar e less adeguate than inv erted in- dexes since their additiona l p ower comes a t so me cost: they are more exp ensive to build a nd o ccupy s ignificant more space. The r eal in terest in those indexing data structures is motiv ated b y some application s e ttings where inv erted in- dexes res ult unappropriate, or even un usable: Building a n inv erted index o n all the substrings of the indexed data would need quadratic space ! The applications we have in mind are: geno mic databases (where the data collection consists o f DNA or protein sequences), intrusion detection (where the data are sequences of ev e nts, log o f a ccesses, alo ng the time), or ient al language s (where word delim- iters are not so clea r), linguistic a nalysis of the text s tatistics (where the texts are co mpo sed by words but the queries require complex statistical ela bo rations to detect pla giarism, for instance), Xpath queries in XML search engines (where 18 the indexed s tr ings are pa ths into the hier archical tree structure o f an XML doc- umen t), and vo cabulary implement ations to support exact o r complex pattern searches (even the inv er ted indexes mig h t b enefit of full-text indexes !). These fascina ting prop erties and the p owerful nature of full-text indexes are the starting p oints o f our discussion. T o b egin with we need some notations and definitions. F or the inv erted indexes we defined a s index p oints the blo ck num b ers, word n um ber s or word sta rts in the indexed text. In the context o f full-text indexes an index p oint is, instead, any c haracter pos ition or, classica lly , an y po sition where a text suffix ma y star t. In the case of a text collection, an index po int is a n in teger pa ir ( j, i ), where i is the starting p os ition o f the suffix in the j th text of the collection. In most current applications, an index p oint is represented using from 3 to 6 b ytes, thus resulting indep endent on the actual length of the p ointed suffix, and characters ar e enco ded as bit se quenc es , th us allowing the unifor m management of arbitra ry la r ge a lphab ets. Let Σ b e an arbitrar ily large alphab et of characters, and let # be a new character larger tha n a n y other alphab et character. W e denote by l cp ( P , Q ) the longest common prefix leng th of tw o strings P and Q , by max l cp ( P , S ) the v alue max { l cp ( P, Q ) : Q ∈ S } , a nd by ≤ L the lexicogr aphic or der b e tw een pair of strings drawn from Σ . Finally , g iven a text T [1 , n ], we denote by S U F ( T ) the lexicogra phically ordered set of all suffixes of text T . Given a pa ttern P [1 , p ], we say that there is an o c curre nc e of P a t the p osition i of the text T , if P is a prefix of the suffix T [ i, n ], i.e., P = T [ i, i + p − 1]. A key observ ation is that: Se ar ching for the o c cu rr enc es of a p att ern P in T amounts to r etrieve all text suffixes that have the p attern P as a pr efix . In this respect, the ordered s et S U F ( T ) exploits an in ter e s ting pr op erty found by Manber and Myers [121]: the suffix es havi ng pr efix P o c cupy a c ontiguous p art of S U F ( T ). In a ddition, the leftmost (resp. r ightmost) suffix of this cont iguous part fol lows (r esp. pr e c e des) t he lexic o gr aphic p osition of P (r esp. P # ) in the or der e d set S U F ( T ). T o per form fast s tr ing searches is then paramount to use a data struc- ture that efficient ly retrieves the lexico graphic p osition of a string in the ordered set S U F ( T ). As an example, let us set T = abab abbc a nd c o nsider the lexicogra phica lly ordered set of a ll text suffixes S U F ( T ) = { 1 , 3 , 5 , 2 , 4 , 6 , 7 , 8 } (indicated by means of their starting positions in T ). If w e ha ve P = ab , its lex icographic p os ition in S U F ( T ) precedes the first text suffix T [1 , 8] = ab ababbc , whereas the lexico - graphic pos ition of P # in S U F ( T ) follo ws the fifth text suffix T [5 , 8] = abb c . F rom Ma n ber -Myers’ o bserv ation (ab ov e), the three text s uffixes b etw een T [1 , 8] and T [5 , 8] in S U F ( T ) ar e the only ones prefixed by P and th us P o ccurs in T three times at positions 1 , 3 and 5. If we instead have P = baa , then bo th P and P # hav e their lexicogr aphic p osition in S U F ( T ) betw een T [5 , 8] = abbc and T [2 , 8] = bababbc , s o that P does no t o ccur in T . The above definitions ca n b e immediately extended to a text collection ∆ by replacing S U F ( T ) with the set S U F ( ∆ ) obtained by merg ing lexicographically the suffixes in S U F ( S ) for all texts S ∈ ∆ . 19 3.1 Suffix arra ys and suffix trees The suffix arr ay [1 21], or the P A T-ar ray [84], is an indexing data structure that suppo rts fast substring searches whos e cost does not depend o n the alphabet’s size. A suffix array consists of an array-based implemen tation of the set S U F ( T ). In the example a b ove, the suffix arr ay S A equals to [1 , 3 , 5 , 2 , 4 , 6 , 7 , 8]. The search in T for a n ar bitrary pattern P [1 , p ] explo its the lexicogr a phic order pr esent in S A and the tw o s tructural observ ations made ab ove. Indeed it first determines the lexicogr aphic p osition of P in S U F ( T ) via a binary search with one level of indir e ction : P is compared against the text suffix p ointed to b y the ex amined S A ’s entry . E ach pattern- s uffix comparison needs O ( p ) time in the worst case, and thu s O ( p log n ) time suffices for the ov erall binary search. In our example, at the first s tep P = ab is compar ed a gainst the entry S A [4] = 2 , i.e. the 2nd suffix of T , and the binary search pro ceeds within the firs t half of S A since P ≤ L T [2 , 8] = bababbc . After tha t the lexicogr a phic p osition of P in S A has bee n found, the search a lgorithm scans righ tw ard the s uffix array un til it encounters suffixes prefixed b y P . This tak es O ( p occ ) time in the worst ca s e, wher e occ is the num b er of o ccur r ences o f P in T . In o ur example, the lexicogr aphic p osition of P is immediately b efore the first entry of S A , a nd ther e are three suffixes prefixed by P since P is no t a prefix of T [ S A [4] , 8] = T [2 , 8] = bababbc . Of cour s e the true b ehavior of the search a lgorithm depe nds on how many long pr efixes of P o ccur in T . If there ar e very few of such long prefixes, then it will rar ely happ en that a pa ttern-suffix compariso n in a binary-search step ta kes Θ ( p ) time, and gener a lly the O ( p log n ) b ound is quite p essimistic. In “r andom” strings this algo rithm req uires O ( p + log n ) time. This la tter bo und can b e forced to hold in the worst case too , by a dding an auxiliary array , called Lcp array , and designing a novel search pro cedure [1 21]. The array Lcp s tores the longe s t- common-prefix information be tw een a n y tw o adjacent suffixes of S U F ( T ), th us it has the same length o f S A . The nov el search pro cedure still pro ceeds via a binary sear ch, but now a pattern- suffix compar ison do e s not start from the first character of the compared strings but it takes adv antage of the co mparisons already executed a nd the informa tion av ailable in the Lcp array . Howev er, since practitioners prefer simplicit y and space-compaction to time-efficiency guar an- tee, this faster but space-consuming algorithm is rarely used in practice. F rom a practical point of view, suffix arrays a re a muc h space-efficient full-text indexing data s tructure b eca us e they store o nly one p ointer p er indexed suffix (i.e. usually 3 bytes suffice). Nonetheless s uffix arrays are pretty muc h static and, in ca se of long text strings , the contiguous spac e needed for sto ring them ca n beco me too constraining and may induce po or per formance in an external-memory setting. In fa c t, S A can be easily mapp ed onto disk by stuffing Θ ( B ) suffix po in ters p er page [84], but in this case the sear ch bo und is O ( p B log 2 N + occ B ) I/Os, a nd it is po o r in prac tice b ecause a ll of these I/Os ar e r andom . T o remedy this situation [23] prop osed the use of supr a-indic es over the suffix array . The key idea is to sa mple one out o f b suffix arr ay ent ries (usually b = Θ ( B ) and one entry p er disk pa ge is sa mpled), and to store the firs t ℓ characters of each sampled suffix in the supra-index. This s upra-index is then used as a first step to 20 reduce the p ortio n o f the suffix array where the binary search is p erfo r med. Suc h a r e duction impacts fav or ably on the o verall num b er of random I/Os required b y the se a rch op eration. Some v ar iations on this theme are possible, of course. F or e xample the supra- index do es not need to sample the s uffix a rray entries at fixed int erv als, and it do es not need to cop y in memory the same num b er ℓ of suffix c hara cters from each sampled suffix. Both these quan tities might b e set according to the text s tructure and the space av aila ble in in ternal memory for the supra- index. It go es without saying that if the s a mpled suffixes ar e chosen to start at word b oundar ie s and en tire w ords a re copied into the supra-index, the resulting data structure turns o ut to b e actual ly an in verted index. This shows the hig h flexibility of full-text indexing data structur e s that, for a pr op er se tting of their parameters, b oil down event ually to the w eaker class of word-based indexes. On the o ther extreme, the smaller is the sampling step, the larger is the memory requirement for the supra-index, and the faster is the search op era- tion. Sampling every suffix would b e fabulous fo r quer y p erformance but the quadratic space o ccupancy w ould mak e this approach unaffordable. Actually if a co mpacted trie is used to store all the suffixes, w e end up into the most famous, elegant, powerful and widely employ ed [15,88] full-text indexing da ta structure, known as the suffix tr e e [128]. Eac h arc of the suffix tree is labeled with a text substring T [ i, j ], r epresented via the triple ( T , i, j ), and the sibling arcs ar e o r- dered ac c o rding to their firs t characters, whic h are distinct (see Figure 2). There are no no des ha ving only o ne child e x cept p ossibly the roo t and eac h node has asso ciated the string obtained by concatena ting the lab els found along the down- ward pa th from the ro ot to the node itself. By app ending the special c har acter # to the text, the leav e s have a one-to-one corr esp ondence to the text s uffixes , each lea f stores a different suffix and their right w ard scanning gives a c tua lly the suffix ar ray . It is an interesting exerc is e to design a n algo rithm whic h g o es from the suffix ar ray and the L cp array to the s uffix tre e in linear time. Suffix trees a re also augmented b y means of some sp ecia l no de-to-no de p oint- ers, called suffix links [128], which turn out to b e cr ucial for the efficiency of complex sea r ches and up dates. The suffix link from a no de stor ing a nonempty string, say aS fo r a character a , leads to the no de storing S and this node alwa ys exists. There ca n b e Θ ( | Σ | ) suffix links leading to a s uffix-tree no de b ecause we can hav e o ne suffix link for each poss ible character a ∈ Σ . Suffix trees require linear space a nd are s ometimes called gener alize d suffix trees when built up on a text collection ∆ [10,89]. Suffix trees, and co mpa cted tries in general, are very efficient in search ing an arbitrary pattern string becaus e the search is directed b y the pattern itself along a downw ard tree path starting from the ro ot. This gives a search time propor tional to the pattern length, instead of a logarithmic bo und as it o ccur red for suffix arrays. Hence searching for the occ occurr ences of a pattern P [1 , p ] as a substring o f ∆ ’s texts requir es O ( p log | Σ | + occ ) time. Inserting a new text T [1 , m ] into ∆ or deleting an indexed text fro m ∆ takes O ( m log | Σ | ) time. The structure of a suffix tree is rich of information so that 21 a b a b b c a b b c b c b c a b b c c b c a b b c 4 6 7 1 8 0 2 1 3 4 (T,1,2) (T,3,4) (T,5,8) (T,7,8) 3 (T,7,8) 5 2 (T,5,8) (T,1,2) (T,7,8) (T,1,2) (T,2,2) (T,1,2) v 4 7 1 8 0 6 (a) 2 1 3 4 3 5 2 v (b) Fig. 2. (a) The suffix t re e for string T = “ab ab abb c”. We have that n o de v sp el ls out the string ‘ab ab’. The substrings ar e r epr esente d by triples to o c cupy c onstant sp ac e, e ach internal no de stor es the length of its asso ciate d string, and e ach le af stor es the starting p osition of its c orr esp onding suffix. F or our c onvenienc e, we il lustr ate in (b) the su ffix tre e showe d in (a) by explicitly writing down t he string T [ i, j ] r epr esente d by t he triple ( T , i, j ) . The endmarker # is not shown. R e ading the le aves rightwar d we get the suffix arr ay of T . statistics on text substrings [15] and n umero us types of complex queries [88,148] can b e efficien tly implemen ted. Since the suffix tree is a p owerful data structure, it would seem a ppropriate to use it in exter nal memory . T o our sur prise, how ever, suffix trees loose their go o d searching and up dating worst-c ase p erfor ma nce when used for indexing large text collections that do not fit in to internal memory . This is due to the following reas ons: a. Suffix trees hav e an u nb alanc e d top olo gy that is text-dep endent b ecause their in ternal no des are in corres po ndence to some r epea ted substrings . Conse- quent ly , these trees inevitably inherit the drawbacks p oin ted out in scien- tific literature with regar d to pag ing unbalanced trees in external memory . There are some go o d av erage-ca se solutions to this pro blem that group Θ ( B ) no des per page under no de insertions o nly [109, Sect.6.2.4] (deletions mak e the analysis extremely difficult [18 2]), but they cannot av o id sto r ing a down- ward path of k no des in Ω ( k ) di stinct pag es in the worst case. b. Since the outdegr ee of a no de can be Θ ( | Σ | ), its p ointers to c hildren migh t not fit in to O (1) disk pages s o they w ould have to be stored in a separate B-tree. This causes an O (log B | Σ | ) disk a ccess ov erhea d for each branch out of a no de b o th in searching and updating op erations . c. Bra nching from a no de to one o f its c hildren require s further disk accesses in order to r etrieve the disk pages containing the substring that labels the trav ers ed arc . d. Up dating suffix trees under string insertions or deletions [1 0,89] requires the insertion or deletion of some nodes in their un balanced structure. This 22 op eration inevitably relies on merging and splitting disk pag es in order to o ccupy Θ ( N B ) o f them. This appr oach is v e r y exp ensive: s plitting o r merging a disk page can take O ( B | Σ | ) disk acce sses because Θ ( B ) nodes can mov e from o ne pag e to a no ther. The Θ ( | Σ | ) suffix links lea ding to each mo ved no de must b e r edirected a nd they ca n be con ta ined in differen t pages. Hence we can conclude that, if the text collection ∆ is stored on disk, the search for a pattern P [1 , p ] as a substring of ∆ ’s texts takes O ( p log B | Σ | + occ ) worst-case disk accesses (according to P oints a –c). Inserting a n m -length text in ∆ or deleting an m -length text from ∆ takes O ( mB | Σ | ) disk a ccesses in the worst-case (there can b e Θ ( m ) page splits o r merges, according to p oint (d)). F rom the po in t o f view of av er age-ca s e a nalysis, suffix tree and compacted trie p e r formances in exter na l memor y are heuristic and us ually confir med by exp e riment ation [14,132,144,59,13]. The b est result to date is the Comp act P A T- tr e e [49]. It is a succinct representation of the (binary) Patricia tree [137], it o ccupies about 5 bytes p er suffix and requires ab out 5 disk accesses to search for a pa ttern in a text collection of 100 Mb. The paging strategy pr op osed to store the Compact P A T-tr ee on dis k is a heuristic that ac hiev es only 40 % page o ccupancy and slow up date p erfo r mance [49]. F rom the theoretical po in t of view, pattern searches requir e O ( h √ p + log p N ) I/O s, where h is the Patricia tree’s height; ins e r ting or deleting a text in ∆ costs at lea st a s sea r ching for al l of its suffixes individually . Therefore this solution is attractive only in practice a nd for static textual a rchiv es. Another interesting implemen tation of suffix trees has been prop osed in [112]. Here the space o ccupancy has b een confined b et ween 10 and 20 bytes p er text suffix, assuming a text shorter than 2 27 characters. 3.2 Hybrid data structures Although s uffix a rrays and compacted trie s present go o d proper ties, none of them is ex plicitly designed to work on a hiera r ch y o f memory levels. The simple paging heuristics shown a bove are not a cceptable when dea ling with larg e text collections which extensiv ely and r andomly access the external stora ge devices for bo th searching o r up dating op era tions. This is the reason wh y v ar io us re- searchers have tried to prope r ly combine these tw o a pproaches in the light o f the characteristics of the current hiera rch y of memory lev els . The result is a family of hybrid data structur es which can b e divided in to t wo lar ge s ub clas ses. One sub class cont ains data structures that exploit the no longer neg ligible size of the internal memory of current computers by keeping two indexing levels : one level consists of a compa cted trie (or a v aria n t of it) built on a subset o f the text suffixes a nd stored in internal memory (previo usly called supr a-index ); the other level is just a plain suffix array built ov er all the suffixes of the indexed text. The trie is used to route the search on a small p ortion of the suffix array , b y exploiting the efficient random-a ccess time of internal memory; an external- memory binary sea rch is subsequently p erformed on a re s tricted part of the s uffix array , s o identified, thus r equiring a reduced num b er of disk accesses . V arious ap- proaches to suffix sampling hav e b een intro duced in the literature [50 ,102,144,11], 23 as well v ario us trie co ding metho ds hav e be e n employ e d to stuff as muc h suffixes as possible in to in ternal memory [23,13,59,105]. In all these cases the aim has been to balance the efficient sea rch per formance of compacted tries with the small spa ce o ccupa ncy of suffix a r rays, taking into account the limited space av ailable in to in ternal memory . The result is that: (1) the search time is faster than in suffix arrays (see e.g. [23,11]) but it is yet not optimal b ecause of the binary s earch o n disk, (2) the up da tes are slow b ecause of the ex ter nal-memory suffix array , a nd (3) slightly more space is needed beca use of the internal-memory trie. The second subcla ss o f hybrid data structures has b een obtained by pr op erly combinin g the B-tree data structure [51] with the effective routing pr o p e r ties of suffix arrays, tries or their v ariants. An example is the Prefix B -tree [2 8] that explicitly store s prefixes of the indexed suffixes (or indexed s trings) as ro uting information (they ar e called sep ar ators ) into its internal no des. This design choice po ses some algorithmic co nstraints. In fact the upda tes of Prefix B-tr ees are complex b ecause of the presence of arbitra rily long sepa rators, which requir e recalculations and p ossibly trigger new expansio ns/contractions of the B-tree no des. V ar ious works hav e in vestigated the splitting of Prefix B-tree nodes when dealing with v ariable le ng th keys [28,115] but all o f them hav e b een faced with the problem o f c ho o s ing a prop er splitting separator. F o r these r easons, while B-trees and their basic v aria n ts are among the most used da ta structures for primary key retriev al [51,109], Prefix B-trees are not a common choice as full- text indices beca use their per formance is known to be no t efficien t enough when dealing with arbitra rily lo ng k eys or highly dynamic en vironments. 3.3 The string B-tree da ta structure The String B -tree [71] is a h ybrid data structure in tro duced to overcome the limitations and dra wba cks o f Prefix B-trees. The key idea is to plug a P atricia tree [137] into the no des of the B-tre e , th us pro viding a r outing to ol that effi- cient ly driv es the subsequent searches and, more importa n tly , occupies a space prop ortional to the num b er of indexed strings instead of their total le ng th. The String B-tree achiev es optimal search bo unds (in the case of an unbounded a l- phab e t) and attractive up date perfor mance. In pr actice it req uires a neglig ible, guaranteed, num b er of disk accesses to sear ch for a n a rbitrary pa ttern str ing in a lar ge text collection, independent of the character distribution. W e now recall the ma in ideas underlying the String B-tree data structure. F o r more theoretical details we refer the reader to [71], for a practical analysis we refer to [70] a nd Section 3.4. String B- trees ar e similar to B + -trees [51], the keys are p ointers to the strings in S U F ( ∆ ) (i.e. to suffixes of ∆ ’s str ings), they res ide in the lea ves and s o me copies of these k eys are stor ed in the internal no des for ro uting the subsequent trav ers als. The or der b etw een any t wo keys is the lexicogr aphic order amo ng the corresp onding po in ted strings. The no velty of the String B-tree is that the keys in each node are not explicitly store d, so that they may be of a rbitrary length. Only the string p ointers are kept in to the no des, organized by means 24 of a Patricia tree [13 7] whic h ensures small overhead in routing string sear ches or updates, a nd occupies space prop ortiona l to the numb er of indexed strings rather than to their total length. W e denote by S B T ∆ the string B-tree built on the text collection ∆ , and we adopt tw o con ven tions: there is no distinction b etw een a key and its co rresp ond- ing p ointed string; each disk page can con ta in up to 2 b keys, where b = Θ ( B ) is a para meter depending on the a ctual spa c e o ccupancy of a no de (this will be discussed in Section 3 .4). In detail, the str ings of S U F ( ∆ ) are distributed among the String B-tree no des as shown in Fig ure 3. S U F ( ∆ ) is pa rtitioned in to gr oups of at most 2 b strings e a ch (except the last gr o up which may contain few er strings) a nd every gro up is stored into a leaf o f S B T ∆ in such a wa y that the left-to-r ight scanning of these leaves gives the order ed s et S U F ( ∆ ) (i.e. the suffix array of ∆ ). E ach internal node π has n ( π ) c hildren, with b 2 ≤ n ( π ) ≤ b (except the ro ot which has from 2 to b children). No de π also stores the string set S π formed b y copying the leftmost and the rightmost strings con tained in each of its children. As a result, set S π consists of 2 n ( π ) strings, no de π has n ( π ) = Θ ( B ) c hildren, a nd th us the height of S B T ∆ is O (lo g B N ) wher e N is the total length of ∆ ’s strings, or equiv alently , the ca rdinality of S U F ( ∆ ). The main adv antage of String B-tr e e s is that they supp ort the standar d B-tree op erations, now, on a rbitrary long keys. Since the St ring B-tree leaves form a suffix ar ray on S U F ( ∆ ) , the sear ch for a pattern string P [1 , p ] in S B T ∆ m ust identify foremost the lexicogra phic p ositio n of P a mong the text suffixes in S U F ( ∆ ), and thus, a mong the text p ointers in the String B-tree leaves. Once this po s ition is known, all the o ccur rences of P as a substring of ∆ ’s string s are given b y the consecutive p ointers to text suffixes which start from that po sition and have P as a pr efix (refer to the obser v ation on suffix arrays, in Section 3). Their retriev al takes O (( p/B ) occ ) I/Os , in cas e o f a br ute-force match betw een the pattern P and the chec ked s uffixes; or the o ptimal O ( occ/B ) I/Os, if some additional information ab out the lo ngest-common-prefix leng th shared by adjacent suffixes is kept into each String B-tree lea f. In the example of Figure 3 the search for the pattern P = “ C T ′′ traces a do wnw ard path o f String B-tree no des and identifies the lexicographic p osition of P into the four th String B-tree leaf (from the left) and b efore the 4 2th text s uffix. The pattern o ccurre nce s ar e then retrieved by sca nning the String B- tree leav es from that p o s ition until the 32th text suffix is encountered, b ecause it is no t prefixed by P . The text p ositions { 42 , 20 , 13 , 24 , 16 } denote the five o ccurrences of P as a substring of ∆ ’s texts. Therefore the efficient implementation of string sear ches in String B-trees bo ils do wn to the efficien t r outing of the pattern search among the String B-tree no des. In this respect it is clear that the way a string s e t S π , in each trav ers ed no de π , is org anized pla ys a crucial role. The innov ative idea in String B-trees is to use a Patricia tree P T π to orga nize the s tr ing p ointers in S π [137]. Patricia trees preserve the sea r ching p ow er and prop erties of compacted tries, although in a reduced spa ce o ccupa ncy . In fac t P T π is a simplified trie in whic h ea ch arc lab el is r eplaced b y only its first character. See Figure 4 for an illustrative example. 25 When the String B-tr e e is tra versed down ward starting from the root, the trav ers al is routed b y using the Patricia tree P T π stored in each visited node π . The go al of P T π is to help finding the lexicographic p osition of the sea rched pattern P in the order ed set S π . This search is a little bit more complicated than the one in c la ssical tries (and suffix tr e es), because of the presence of only one character p er arc la bel, and in fact co ns ists of tw o stag es: – T r a ce a down ward pa th in P T π to lo c a te a leaf l which points to an inter esting string o f S π . This string do es not necessa rily ident ify P ’s p osition in S π (whic h is our g oal), but it provides enough information to find that p osition in the second s tage (see Figure 4). The retriev al of the interesting leaf l is done by traversing P T π from the roo t and comparing the c hara cters o f P with the sing le characters whic h lab el the traversed ar cs unti l a le a f is reached or no further bra nc hing is p ossible (in this cas e, c hoo se l to b e an y descendant leaf from the la st trav ers ed no de). – Compar e the s tring po int ed by l with P in order to determine their lo ngest common prefix. A useful proper t y holds [71]: the le af l stor es one of the strings in S π that shar e the longe st c ommon pr efix with P . The length ℓ of this co mmo n prefix a nd the mismatch character P [ ℓ + 1] are used in tw o wa ys: first to determine the sha llow est ancestor o f l sp elling out a string longer than ℓ ; and then, to select the leaf desc e nding from that ancestor which identifies the lexicog raphic p os ition o f P in S π . An illustrative exa mple of a search in a Patricia tree is shown in Figur e 4 for a pattern P = “ GC AC GC AC ′′ . T he leaf l found after the fir st s tage is the second one from the right . In the se c o nd stage, the a lgorithm first computes ℓ = 2 and P [ ℓ + 1 ] = A ; then, it pro ceeds along the leftmost path des cending fro m the no de u , since the 3 r d ch aracter on the ar c leading to u (i.e. the mismatch G ) is grater than the corresp onding pa ttern character A . The p osition reached by this t w o-stage pro cess is indicated in Figure 4, and r esults the correct lexicogra phic po sition of P among S π ’s strings. W e remark here that P T π requires space linear in the numb er of strings of S π , ther efore the spa ce usage is indep enden t o f their total length . Consequently , the num b er of strings in S π can b e prope rly chosen in order to b e able to fit P T π in the disk page allocated for π . An additional nice prop erty of P T π is that it allows to find the lexicogr aphic p osition o f P in S π b y exploiting the information av ailable in π ’s page a nd by fully comparing P with just one of the strings in S π . This clearly a llows to r educe the num b er of dis k accesses needed in the routing step. By counting the n umber o f disk accesses r equired for searching P [1 , p ] in the strings of ∆ , a nd reca lling that ∆ ’s strings hav e ov erall length N , we get the I/ O-b ound O ( p B log B N ). In fact, S B T ∆ has height O (log B N ), and at each trav ers ed no de π we may need to fully compare P aga inst one string of S π th us taking O ( p B + 1) disk access e s . A further refinement to this idea is p ossible, thought, by observing that we do not necessarily need to compare the tw o str ing s, i.e. P and the candidate string of S π , starting from their fir st character but we can take adv antage of the 26 comparisons executed o n the a nce s tors of π , th us skipping s ome character com- parisons and reducing the n um ber of disk a ccesses. An increment al accounting strategy allows to prov e that O ( p B + lo g B N ) disk accesses are indeed sufficient, and this bo und is optimal in the case of an unbounded alphab et. A more com- plete analysis a nd description of the search a nd up date op eratio ns is given in [7 1] where it is for mally proved the following: Theorem 1. String B-tr e es supp ort the se ar ch for al l t he occ o c curr enc es of an arbitr ary p attern P [1 , p ] in the st rings of a set ∆ taking O ( p + occ B + log B N ) disk ac c esses, wher e N is t he over al l length of ∆ ’s strings. The insertion or the deletion of an m -length string in/fr om the set ∆ t akes O ( m log B ( N + m )) disk ac c esses. The r e quir e d sp ac e is Θ ( N B ) disk p ages. As a corollary , w e get a result whic h points out the String B-tree as an effectiv e data structure als o for dictiona ry applications. Corollary 1. String B-tr e es supp ort the se ar ch for al l t he occ o c curr enc es of an arbitr ary p att ern P [1 , p ] as a pr efix of the K strings in a set ∆ taking O ( p + occ B + lo g B K ) disk ac c esses. The insertio n or t he dele tion of an m -length string in/fr om t he set ∆ takes O ( m B + log B K ) disk ac c esses. The sp ac e usage of the String B-tr e e is Θ ( K B ) disk p ages, wher e as the sp ac e o c cupie d by the string set ∆ is Θ ( N B ) disk p ages. Some authors ha ve success fully used String B-trees in other settings: multi- dimensional pr e fix - string queries [9 7], co njunctiv e bo olean quer ies o n t wo sub- strings [72], dictiona r y matc hing problems [73], distribut ed search engines [74], indexing of XML texts [54]. All o f these applications show the flex ibility of this data structure, its efficiency in exter nal memory , and foretell eng ineered im- plemen tations b ecause up to now String B-tr ees hav e b een confined mainly to the theoretical r ealm p erhaps b ecause o f their space o c c upa ncy: the b est known implemen ta tion uses about 12 bytes per indexed suffix [70]. Given this b ottle- neck, less I/O-efficient but space cheaper da ta structures have b een pr eferred in practice (e.g. supra-indexes [23]). In the next section w e try to o vercome this limitation by prop os ing a novel engineered version of String B-trees suitable for practical implementations. 3.4 Engineering the String B-tree String B-trees hav e the characteristics that their height decreases exp onentially as b ’s v alue increa ses (with fixed N ). The v alue of b is strictly rela ted to the n um ber of strings contained in each no de π b eca us e b ≤ |S π | ≤ 2 b . If the disk page size B increases, we can stor e mor e suffixes in S π . How ever, since B is t ypically chosen to b e prop ortio na l to the size of a disk page, we need a technique that maximizes |S π | for a fixed disk page size B . The space occupa ncy of a String B-tree no de π is ev aluated as the sum of three quantit ies: 27 1. The amount of auxilia ry a nd b o ok keeping information necessar y to no de π . This is practica lly neg ligible and, hereafter, it will no t b e a ccounted for. 2. The amount of space needed to s tore the p ointers to the children of π . This quantit y is a bs en t for the leaves; in the case of internal no des, usually a 4-byte p ointer suffices. 3. The amount of space required to stor e the p ointers to the strings in S π and the a sso ciated machinery P T π . This space is highly implementation dependent, so deserves an accurate discussion. Let us therefore concentrate on the amount of s pace r equired to store S π and P T π . This is determined by three kinds o f information: (i) the Patricia tree topo logy , (ii) the integer v alues kept into the internal no des of P T π (denoted b y l en ), and (iii) the p ointers to the strings in S π . The na ¨ ıve approa ch to implemen t (i–iii) is to use explicit p ointers to repres ent the pa r ent -child relationships in P T π and the s tr ings in S π , and allocate 4 b ytes for the l en v alues. Alt hough simple and efficien t in s uppo r ting search and up date op erations, this implementation induces an unacceptable spa c e o ccupancy o f ab out 24 bytes per string of S π ! The literature ab out space-efficient implement ations of Patricia tr ees is h uge but some “pruning” of kno wn r esults can b e do ne according to the f eatures of our trie enco ding problem. Hash-b ase d repr esentation of tries [58], a lthough elega nt and succinct, can b e disca rded bec a use they do not hav e g uaranteed p erformance in time and space, and they are not better than classica l tries o n small string sets [5,31], as it o ccurs in our S π ’s s ets. List or arr ay-b ase d implementations of Patricia trees adopting path and/or lev el compre s sion strategies [13,12,15 7] are space consuming and effective ma inly on rando m data. More a ppe a ling for our purp oses is a recen t line of research pioneered b y [96] and extended b y other authors [143, 144,49,107,117] to the succinct encoding o f Patricia trees. Their main idea is to succinctly enco de the Patricia tree top ology and then use s ome other data structures to prop e r ly enco de the other informa- tion, like the string p ointers (kept in to the leaves) and the l e n v alues (kept int o the in ternal no des). The general policy is therefore to ha ndle the data and the tree structure s eparately . This enables to compress the plain data using any of the kno wn metho ds (see e.g. [188]) and indep endent ly find an efficien t coding method for the tree structure irresp ective o f the form and conten ts of the data items stored in its no des and leaves. In the origina l implementation of String B-trees [70], the shap e of P T π was succinctly enco ded via tw o op erations, called c ompr ess and un c ompr ess . These op erations allow to go from a Patricia tree to a binary sequence, and vice versa, b y means of a preorder tra versal of P T π . Althoug h space efficien t and s imple, this enco ding is CPU-in tensiv e to b e updated or s e a rched, so that a s mall page size of B = 1 kilo bytes was chosen in [70] to ba la nce the CP U-cost o f no de compression/ unco mpression and the I/O -cost of the up date op era tions (see [7 0] for details). Here we prop ose a nov el encoding scheme that surprisingly throws aw ay the Patricia tree top olog y , keeps just the string p ointers a nd the l e n v alues, and is still able to support pattern searches in a constant n umber of I/Os p er visited String B-tree no de. As a r esult, the asymptotic I/O-b ounds stated in 28 Theorem 1 still hold with a significant space improv emen t in the constants hidden in the big-O h no ta tion. The starting p oint is the b eautiful re s ult of [69] that w e briefly reca ll here. Let us b e given a lexicographically or dered array of string p ointers, called S P , and the array o f lo ngest-common-pre fix e s shared by strings adjacen t in S P , called Lcp . W e ca n lo ok at S P and Lcp as the sequence of string p ointers and l e n v alues encountered in an inorder trav e r sal o f the Patricia tree P T π stored in to a given String B-tree no de π . Now, let us a ssume that w e wish to route the search for a pattern P [1 , p ] thro ugh no de π , w e then need to find the lex ic o graphic po sition of P in S P since it indexes S π . W e might implemen t that se a rch via the classical binary sea rch pro c e dur e on suffix arr ays within a lo garithmic n um ber of I/Os (see Sectio n 3.1). The result in [69] sho ws instead that it is enough to execute only one string a ccess, few more Θ ( p + k ) bit comparisons and one full scan of the ar rays L cp and S P . O f c o urse this new a lgorithm is unaffordable on lar ge arrays, but this is not our context of application: the string set S π actually consists of few thousands of items (stored in one disk pa ge), and the arrays S P and Lcp reside in memory when the search is p erformed (i.e. the disk pag e has been fetc hed). Hence the searc h is I/O-c heap in that it requires just one sequential string access, it is CPU-effectiv e b eca use the ar ray-scan can benefit from the r eading-ahea d p olicy of the internal cache, and is spa c e efficient beca use it av o ids the sto rage of P T π ’s topo logy . Let us therefore detail the sea rch algorithm which assumes a binary pattern P and cons is ts of t wo phases (see [69] for the uneas y pro o f of correctness). In the first pha s e, the algor ithm sca ns right w ard the array S P and inductiv ely keeps x a s the positio n o f P in this array (initially x = 0). At a g eneric step i it computes ℓ = L cp [ i ], as the mismatching po sition b etw een the t wo adjacent strings S P [ i ] and S P [ i + 1]. Notice that the ℓ th bit of the string S P [ i ] is s urely 0, whereas the ℓ th bit of the string S P [ i + 1 ] is sur e ly 1 b ecause they are binary and lexicogra phically ordered. Hence the algor ithm sets x = i + 1 a nd increments i if P [ ℓ ] = 1; otherwise (i.e. P [ ℓ ] = 0), it lea ves x unc hanged a nd increments i un til it meets an index i such that Lcp [ i ] < ℓ . Actually , in this la tter cas e the algorithm is jumping all the succeeding strings whic h hav e the ℓ th bit set to 1 (since P [ ℓ ] = 0). The first phase ends when i reaches the end o f S P ; it is poss ible to prov e that S P [ x ] is one o f the strings in S P sharing the longes t common prefix with P . In the illustrativ e example of Figur e 5, we have P = “ GC AC GC AC ′′ and co ded its c haracters in binar y; the first phas e ends by c o mputing x = 4 . The second phase of the search alg orithm initiates by computing the length ℓ ′ of the longest common prefix betw een P and the candidate string S P [ x ]. If S P [ x ] = P then it stops, otherwis e the algor ithm starts from p osition x a backw a rd scanning of S P if P [ ℓ ′ + 1] = 0 o r a forward scanning if P [ ℓ ′ + 1] = 1. This scan sear ch es for the lexicog raphic po sition of P in S P and pr o ceeds until is met the p o sition x ′ such that Lcp [ x ′ ] < ℓ ′ . The searched position lies b etw ee n the tw o strings S P [ x ′ ] and S P [ x ′ + 1]. In the example of Figure 5, it is ℓ ′ = 4 (in bits) and P [5] = 0 (the first bit of A’s binary co de); hence S P is scanned backward from 29 S P [4] for just one step s ince Lcp [3] = 0 < 4 = ℓ ′ . This is the correct p o s ition o f P among the strings indexed by S P . Notice that the a lgorithm needs to access the disk just for fetc hing the string S P [ x ] and comparing it ag ainst P . Hence O ( p/B ) I/Os suffice to ro ute P throug h the String B-tree no de π . An incremental acc o unt ing stra tegy , as the o ne devise d in [71], a llows to prov e that we c a n skip so me character co mparisons and therefore require O ( p + occ B + log B N ) I/Os to sea rch for the occ o ccurr ences of a pattern P [1 , p ] as a substring of ∆ ’s strings. Preliminar y exp eriments ha ve sho wn that searching few thousands of strings via this approach needs ab out 200 µ s , which is negligible co mpared to the 5 . 0 0 0 µ s required by a single I/ O on mo dern disks. F urthermore, the incremental se a rch allows sometimes to av oid the I/ Os needed to access S P [ x ] ! Some improv ements to this idea are s till p os sible b oth in time and spac e . First, w e can reduce the CPU-time of sea rch and up da te op erations by ado pting a so rt of supra-index on S P defined as follows. W e decomp ose the array S P (and hence L c p ) in to sub-arr ays of size Θ (log 2 | S P | ). The rightmost string of each s ubarray is stored in a po in ter-based P atricia tree. This wa y , the (sampled) Patricia tre e is used to determine the subarray con taining the po sition of the searched pattern; then the search pro cedure ab ov e is applied to th at subar ray to find the cor rect pos ition of P into it. The overall time complexit y is O ( p ) to trav er se the P atricia tree, and O ( p + log 2 | S P | ) to explore the reached sub- array . Notice also that only tw o strings in S P ar e acce s sed on disk. The data structure is dynamic a nd every insertion o r deletion of a n m -length string takes O ( m + log 2 | S P | ) tim e and only tw o string accesses to the disk. The resulting data structure turns out to b e simple, its construction from scratch is fast and th us s plit/merge op eratio ns on String B-tree no des s ho uld b e effectiv e if P T π is implemen ted in this w ay . W e p oint out tha t due to the sequential acces s to the a rray Lcp , a further space saving is pos s ible. W e can compactly enco de the en tries of array Lcp by representing only their differ enc es . Namely , w e us e a nov el ar ray S k ip in whic h each v alue denotes the difference b et w een t w o consecutive Lcp ’s e ntries (i.e. S k ip [ i ] = Lcp [ i ] − Lcp [ i − 1], see Figure 5). V ario us exp erimental studies o n the distribution of the Skip s over standa r d text collections ha ve s hown that most of them (ab out 9 0% [177]) are s ma ll and thus they are suitably represent ed v ia v ariable-length codes [49,132]. W e suggest the use of the c ontinuation bit co de, describ ed in Section 2, b ecause o f tw o facts: the string sa mpling at the in ternal no des of S B T ∆ and the r esults in [177] driv es us to conjecture small skips a nd th us o ne b yte coding for them; furthermor e, this coding sc heme is simple to be progra mmed, induces byte-aligned co des a nd hence it is CP U efficient. W e conclude this section b y o bserving that up to now we assumed the text collection ∆ to b e fixe d. In a r eal-life context, we sho uld exp ect that new texts are added to the collection and old texts are remov ed fro m it. While handling deletions is no t really a problem as we hav e a plethor a of to ols inherited from standard B-trees, implemen ting the addition of a new text requires decisely new tech niques. This asymmetry b et ween deletion and insertion is better understo o d 30 if we o bserve that the insertion o f a new text T [1 , m ] into ∆ requires the insertion of all of its m suffixes { T [1 , m ] , T [2 , m ] , . . . , T [ m, m ] } in to the lexico graphically ordered set S U F ( ∆ ). Co nsequently , the dominant cost is due to the compar is on of all c haracters in each text suffix that may sum up to Θ ( m 2 ). Since T can be as large as m = 10 6 characters (or even more), the r esc anning of the text characters might be a computationa l b ottleneck. On the other ha nd, the deletion of a text T [1 , m ] from ∆ consists of a sequence of m sta ndard deletions of T ’s suffix p ointers, and hence can exploit standar d B -tree techni ques. The approach prop osed in [71] to av o id the “resca nning” in text insertion is mainly theoretical in its flav or a nd co nsiders a n augment e d String B-tree wher e some p ointers ar e added to its leaves. The coun terpart for this I/O impro vemen t is that a la rger space o ccupancy is needed a nd, when reba lancing the Str ing B- tree, the r e dir ection of some of these additiona l p ointers may cause the execution of random I/Os. There fore, it is questionable if this approach is rea lly attrac tive from a pr a ctical p oint of view. Starting fro m these co nsiderations [70] prop osed an alterna tive approach based o n a b atche d insertion o f the m suffixes o f T . This approach exploits the LR U buffering s trategy of the underlying o p e r ating system and prov es effective in the ca se of a large m . In the cas e of a small m a different approa ch must b e adopted whic h is based on the suffix-arr ay mer ging pro cedure presented in [84]: a suffix arr ay S A is built for T , together with its Lcp array; the suffix array S A ∆ on the suffixes in S U F ( ∆ ) is instead derived from the leaves of S B T ∆ within O ( N /B ) I/Os. The merge of S A a nd S A ∆ (and their corresp onding Lcp a rrays) gives the new se t o f String B-tree leav es, the in ternal no des are constructed within O ( N /B ) I/Os via the simple approach devised in Section 3.3. Even if the merg ing of the tw o suffix arrays can be dramatically slow in theory , since every suffix comparison might require one disk access, the character distribution o f r e a l text collections makes the Lcp arr ays very helpful and allows to solve in pra ctice most of the s uffix co mparisons without acces sing the disk. A throughtful sp er imentation of these approaches is still needed to v alidate such empirical co nsiderations. 3.5 String B-tree construction The efficien t construction of full-text indexes on v ery large text collections is a hot topic: “ We have se en many p ap ers in which the index simply ‘is’, without discussion of how it was cr e ate d. But for an indexing scheme to b e useful it must b e p ossible for the index to b e c onstructe d in a r e asonable amount of time, ..... ” [19 3]. The constructio n phase may b e, in fact, a b ottleneck that can preven t these p ow er ful indexing to ols to b e used even in medium-scale applica tions. Known co nstruction algorithms ar e very fast when employ ed o n textual data that fit in the int ernal memory of computers [121,165,112,124] but their per formance immediately degrades when the text size b ecomes so larg e that the texts m ust be arranged on (slow) external stora ge device s . In the previous section w e have addressed the problem o f up da ting the String B- tr ee under the insertion/deletion of a single text. Obviously thos e alg orithms cannot b e adopted to construct from scratch the String B-tree over a la rgely populated text collection b eca use they 31 would incur in an enormous amount of random I/Os. In this section we desc r ibe first an efficient a lgorithm to build the suffix array S A ∆ for a text collection ∆ of size N , a nd then present a simple a lgorithm whic h deriv es the String B-tree S B T ∆ from this arr ay in O ( N /B ) I/Os. F or further theoretical a nd exp erimental results on this interesting topic w e refer the reader to [66 ,55,165,84]. Ho w to build S A ∆ . As shown in [55], the most attractive algorithm for building large suffix arr ays is the o ne prop osed in [84] b ecause it requires only 4 bytes o f working space p er indexed suffix, it accesses the disk mostly in a sequential manner and it is very simple to b e pr ogrammed. F or the simplicit y of presentation, let us a ssume to concatena te all the texts in ∆ into just one single long text T of length N , a nd let us co ncent rate on the c o nstruction of the suffix array S A T of T . The tra nsformation fr o m S A T to S A ∆ is easy and left to the reader as an exercis e. The algo rithm computes incrementally the suffix a rray S A T in Θ ( N/ M ) stages. Let ℓ < 1 b e a po sitive constant fixed b elow, and a ssume to set a par am- eter m = ℓM whic h, for the sake o f pr e s ent ation, divides N . This parameter will denote the size o f the text pieces loaded in memory at ea ch stage. The algorithm maint ains at each stage the follo wing inv ariant: At the b e gin- ning of stage h , with h = 1 , 2 , . . . , N/m , the algori thm has stor e d on the disk an arr ay S A ext c ontaining the se quenc e of t he first ( h − 1) m suffixes of T or der e d lexic o gr aphic al ly and r epr esente d via their starting p ositions in T . During the h th sta g e, the algorithm incr emental ly up dates S A ext b y prop erly inserting in to it the text suffixes which sta r t in the substring T [( h − 1) m + 1 , hm ]. This preser ves the inv ariant above, thus ensur ing that after all the N / m stages, it is S A ext = S A T . W e ar e therefore left with showin g how the generic h th stage works. In the h th stage, the text substring T [( h − 1) m + 1 , hm ] is loa ded into internal memory , and the suffix array S A int containing only the suffixes starting in that text substring is built. Then, S A int is merged with the curr ent S A ext in tw o steps with the help of a counter a rray C [1 , m + 1]: 1. The text T is scanned right wards and the lexicogra phic p osition p i of each text suffix T [ i , N ], with 1 ≤ i ≤ ( h − 1) m , is deter mined in S A int via a binary search. The en try C [ p i ] is then incremented by one unit in order to record the fact that T [ i, N ] lexicogr aphically lies b etw ee n the S A int [ p i − 1]-th and the S A int [ p i ]-th suffix of T . 2. The informa tion kept in the array C is employed to quickly merge S A int with S A ext : entry C [ j ] indica tes how many consecutive suffixes in S A ext follow the S A int [ j − 1]-th text suffix and precede the S A int [ j ]-th text suffix. This implies that a simple disk scan of S A ext is sufficient to p erform such a merging pro cess . A t the end of these tw o steps, the in v ar iant on S A ext has b een prop erly preserved so that h can b e incremented and the next stage can start co rrectly . Some comments a re in order at this point. It is clea r that the algorithm pr o ceeds b y mainly executing tw o disk scans: one is p er fo r med to load the text piece 32 T [( h − 1 ) m + 1 , hm ] in internal memory , the o ther disk scan is p erformed to merge S A int and S A ext via the counter array C . Howev er, the algo rithm might incur in many I/Os: either when S A int is built or when the lexicog raphic p osition p i of ea ch text suffix T [ i, N ] within S A int has to b e determined. In b oth these t w o cases, we may need to compar e a pair of text suffixes whic h share a long prefix not ent irely av ailable in int ernal memory (i.e., it extends b eyond T [( h − 1) m + 1 , hm ]). In the patholog ic a l case T = a N , the compar is on b et w een t wo text suffixes tak es O ( N / M ) bulk I/Os so that: O ( N log 2 m ) bulk I/Os a r e needed to build S A int ; the computation of C takes O ( hN log 2 m ) bulk I/Os; wherea s O ( h ) bulk I/ Os are needed to merge S A int with S A ext . No random I/Os are executed, and thus the global n umber of bulk I/Os is O (( N 3 log 2 M ) / M 2 ). The total space occupancy is 4 N b ytes for S A ext and 8 m b ytes for b oth C and S A int ; plus m b ytes to keep T [( h − 1) m + 1 , hm ] in internal memory (the v alue of ℓ is derived co nsequently). The merging step can b e easily implemented using so me extra space (indeed additional 4 N bytes are sufficient) , o r by employing just the spa ce allo cated for S A int and S A ext via a more tricky implementation. Since the w orst-case n umber o f tota l I/Os is cubic, a purely theoretical anal- ysis would c la ssify this alg orithm not muc h interesting. But there are so me con- siderations that are crucial to shed new ligh t on it, a nd lo o k at this algor ithm from a different p e rsp ective. First of all, we must observe that, in practical situa- tions, it is very reaso nable to ass ume that ea ch suffix comparison finds in internal memory all the (usually , cons ta n t n um b er of ) characters needed to compare the t w o in volved suffixes. Consequen tly , the practical b e havior is mo re reasonably describ ed b y the for m ula: O ( N 2 / M 2 ) bulk I/Os. Addit ionally , in the analysis ab ov e all I/Os ar e se quential and the actual n umber of rando m seeks is O ( N / M ) (i.e., at most a co nstant num b er pe r stage). Conse q uen tly , the algorithm takes fully a dv an tage of the lar g e bandwidth of current disks a nd of the high CPU- sp e e d of the pr o cessors [162,164]. Moreover, the reduced working spa ce fa cilitates the prefetc hing and cac hing p olicies of the underlying o per a ting s ystem a nd fi- nally , a careful lo ok to the algebraic calcula tio ns shows that the constants hidden in the big-Oh notation are very s mall. A recent result [55] has a lso shown how to make it no lo ng er questionable at theoretical eyes by propo sing a mo dification that achiev es efficient p erfor mance in the worst cas e. F rom S A ∆ to S B T ∆ . The construction of S A ∆ can b e coupled with the computation of the a rray Lcp ∆ containing the sequence o f longest-co mmon- prefix leng ths (lcp) betw een any pair of a djacent suffixes. Giv en these tw o arrays, the String B-tree for the text collection ∆ can b e easily derived pro ceeding in a bo ttom-up fa s hion. W e split S A ∆ in to groups o f a bo ut 2 b suffix p ointers each (a similar s plitting is a dopted on the a rray Lc p ∆ ) a nd use them to form the leaves of the String B-tree. That requires scanning S A ∆ and Lcp ∆ once. F or each leaf π we have its string set S π and its sequence o f lcps, so that the construction of the Patricia tree P T π takes linear time a nd no I/ Os. After the leaf level of the String B- tree has b een constructed, we pro ceed to the next higher level by determining new str ing and lcp se q uences. F o r this, we scan right ward the lea f level and take the leftmost string L ( π ) and the right most 33 string R ( π ) fro m each leaf π . This giv es the new string sequence who se length is a factor Θ (1 /B ) smaller than the sequence of strings stor ed in the leaf level. Each pair of adjace n t str ing s is either a L ( π ) /R ( π ) pair or a R ( π ) /L ( π ′ ) pair (derived from consecutive leaves π and π ′ ). In the former case, the lcp of the t w o strings is o btained by taking the minim um of a ll the lcps stored in π ; in the latter case , the lcp is directly a v a ilable in the a rray Lcp ∆ since R ( π ) and L ( π ′ ) are contiguous there. After that the tw o new sequences of string s and lcps hav e been constructed, we rep eat the partitioning pro ces s a b ove thus forming a new level of in ter nal no des of the String B-tree. The pro ces s contin ues for O (log B N ) iterations until the string sequence ha s length smaller than 2 b ; in that case the ro ot of the String B-tree is formed and the construction pro ces s sto pp ed. The implemen tation is quite standar d and not fully detailed here. Preliminar y exp e riment s [7 0] ha ve shown that the time taken to build a String B-tree from its suffix arr ay is negligible with res pect to the time taken for the co nstruction of the suffix a rray itself. Hence we refer the r eader to [55] for the latter timings. W e conclude this section by observing that if we a im for optimal I/O-b ounds then we hav e to resort a suffix tree co ns tr uction method [6 6] explicitly designed to w o r k in external memo ry . The alg orithm is to o muc h so phisticated to b e detailed, we therefore refer the r eader to the corres po nding literature and, just, po in t out here that the tw o ar rays S A ∆ and Lcp ∆ can be obtained from the suffix tree by means of a n inorder traversal. It can b e shown that all these steps require s orting and sequen tia l disk-scan proce dur es, th us acco unt ing for o verall O (( N /B ) lo g M /B ( N/ B )) I/Os [6 6]. 3.6 String vs suffi x sorting The construction of full-text indexes inv olves the sor ting o f the suffixes of the indexed text collection. Since a suffix is a string of ar bitrary length, we would b e driven to conclude that s uffix so rting and string so rting are “similar” problems. This is not true b ecause, in tuitively , the suffixes participating to the sor ting pro cess share so long substrings that some I/O s ma y be p ossibly sav ed when comparing them, and indeed this saving can b e a chiev ed a s shown theor etically in [66]. Conv ersely [17] show ed tha t sor ting strings on disk is not nea rly as simple as it is in in ternal memor y , a nd introduced a bunc h o f so phisticated, determinis- tic string-sor ting algorithms whic h a chieve I/O -optimality under so me conditions on the string-co mparison mo del. In this section w e present a simpler random- ized algorithm that comes close to the I/O -optimal co mplexit y , and surprisingly matches the O ( N /B ) linear I/O- bo und under some rea sonable conditions o n the problem parameters . Let K b e the n um ber of s tr ings to be so r ted, they ar e a r bitrarily long , and let N b e their total length. F or the sa ke of pres en tation, we introduce the notation n = N/ B , k = K/B a nd m = M /B . Since alg orithms do exist that match the Ω ( K log 2 K + N ) low er b o und for str ing so rting in the compariso n mo del, it seems r easonable to exp ect that the complexity of sorting str ings in exter na l memory is Θ ( k log m k + n ) I/Os. But any na ¨ ıv e algo r ithm do es not even come close to meet this I/O-b ound. In fact, in internal memory a trie data s tr ucture 34 suffices to ac hieve the optimal complexity; wherea s in external-memo ry the use of the powerful String B -tree achiev es O ( K log B K + n ) I/ Os. The problem here is that strings hav e v ariable length and their brute-force compar isons ov er the sorting pro ces s may induce a lot of I/Os. W e aim at speeding up the string comparisons, a nd we achiev e this goal by shrinking the long strings via an hashing of some of t heir pie c es . Since hashing do es not preserve the lexicogra phic o rder, we will orchestrate the selection of the string pieces to b e hashed with a carefully designed so r ting proc e s s so that the corr ect sorted order may b e even tually computed. Details follow, see Figure 6 for the pseudoco de of this alg orithm. W e illustra te the b ehavior of the algorithm on a running example and then sketc h a pro of o f its correc tness . Let S b e a s et of six strings, each of length 10. In Figure 8 these s trings ar e dr awn vertically , divided into pieces of L = 2 characters each. The hash function used to a ssign names to the L -pieces is depicted in Figure 7. W e remark that L ≫ 2 log 2 K in order to ensure, with high pr o bability , that the na mes of the (at most 2 K ) mismatching L -pieces a re different. O ur setting L = 2 is to simplify the presen tation. Figure 8 illustra tes the execution of Steps 1–4: fr o m the naming of the L - pieces to the so rting of the c-strings a nd fina lly to the identification of the mis- matchin g names. W e point out that each c-string in C ha s actually ass o ciated a po in ter to the cor resp onding S ’s string, which is depicted in Figure 8 b elow every table; this po int er is exploited in the last Step 8 to derive the sorted p ermutation of S fro m the sorted table T . Lo ok ing at Figure 8(iii), we interestingly note that C is differen t from the sorted set S (in C the 4th string of S precedes its 5th string !), and this is due to the fact that the names do not reflect of course the lexicogra phic order of their original string piece s . T he subsequent steps o f the algorithm ar e then des ig ned to tak e care of this appa rent disorder by driving the c-strings to their cor rectly-order ed po sitions. Step 6 builds the logical table T b y substituting marked names with their ranks (assigned in Step 5 and detailed in Figure 7), and the o ther names with zeros. Of co urse this transforma tion is lossy b eca use we have lost a lot of c- string characters (e.g. the piece b c which was not mark e d), nonetheless we will s how below that the canceled characters would hav e not b een compared in sorting the S ’s strings so that their eviction has not impact on the final sorting step. Figure 9(i-ii) shows how the forward and backward scanning of table T fills some of its en tries that got zeros in St ep 6. In particular Step 7(a) does not c hang e table T , wherea s Step 7(b) ch anges the fir st t wo c o lumns. The resulting table T is finally sorted to pro duce the correct se q uence of string p ointers 5,3,1 ,6,2,4 (Figure 9(iii)). As far a s the I/O -complexity is concer ned, we let sort ( η , µ ) denote the I/O-cos t of sorting η strings o f total leng th µ via mu ltiw ay Mergesort, a ctu- ally sor t ( η , µ ) = O ( µ B log m µ B ). Since the string set S is s e q uen tially store d on disk, Steps 1-2 take O ( n ) I/Os . Step 3 so rts K c-strings of total length N ′ = Θ ( N (2 log 2 K ) L + K ), where the s econd additive term acco unts for those strings which are shorter than L , thus r e q uiring sort ( K, N ′ ) I/Os. Step 4 mark s t w o names per c- string, so Step 5 requires sort (2 K , 2 K L ) I/Os. T able T consis ts 35 of K columns o f total le ng th N ′ bits. Hence, the forward and ba ckw ard scanning of Step 7 takes O ( N ′ /B ) I/ Os. Sorting the columns o f table T takes sor t ( K, N ′ ) I/Os in Step 8. Summ ing up we have Theorem 2. The r andomize d algorithm detaile d in Figur e 6 sorts K strings of total length N in sort ( K, N ′ + 2 K L ) + n exp e cte d I/Os. By setting L = Θ (log m n log 2 K ), the cost is O ( n + k (log m n ) 2 log 2 K ) I/Os. Mo reov er if it is K ≤ N / (log 2 m n log 2 K ), i.e. the av erage string leng th is po lylogarithmic in n , then the total sorting cos t re s ults the optimal O ( n ) I/ Os. It g o es without saying that if one replaces the mismatching names with their original L -pieces (instead of their ra nks), it w o uld still get the corr ect le x ico- graphic order but it would po ssibly end up in the same I/O -cost o f cla ssical mergesor t: in the worst case, Step 7 expa nds all entries of T th us resorting to a string set of size N ! The argument underlying the proof of correctness of this a lgorithm is non trivial. The key p oint is to prov e that given any pa ir of strings in S , the cor- resp onding columns of T (i.e. c-strings of C ) con ta in enough information a fter Step 7 that the column compar ison in Step 8 reflects their cor r ect lexicogra phic order. F or simplicit y w e assume to use a p erfect hash function s o that differen t L -pieces get different na mes in Step 2. Let α a nd β b e any tw o c-strings of C and assume that they agree up to the i th name (included). After C is sorted (Step 3 ), α and β are p oss ibly separated by some c-s tr ings which satisfy the following tw o proper ties: (1) all these c-s trings agree at leas t up to their i th na me, (2 ) a t lea st t wo adjacent c-strings among them disa gree at their ( i + 1)th name. Accor ding to Step 6 and P rop erty (1), the columns in T corresp onding to α a nd β will initially get zer os in their first i en tries; according to Step 6 and Proper t y (2) at lea st t wo columns betw een α ’s and β ’s o nes will get a rank v alue in their ( i + 1)th en try . The leftmost of these ranks equals the ra nk of the ( i + 1)th na me o f α ; the rightmost of these r anks equals the rank o f the ( i + 1)th name of β . After Step 7, the first i ent ries of α ’s and β ’s columns will b e filled with equal v a lues; and their ( i + 1)th ent ry will contain tw o distinct ra nks which co rrectly reflect the tw o L -pieces o ccupying the corresp onding po sitions. Hence the compa rison executed in Step 8 betw een these tw o columns g ives the cor rect lexicogr aphic order b etw een the tw o original strings. O f course this a rgument ho lds for a ny pair of c-strings in C , and th us overall for all the columns of T . W e can then co nclude that the string per mu tation derived in Step 8 is the c o rrect one. 3.7 Some op en problem s and future researc h directions An impor tant a dv a ntage of String B-trees is that they a re a v ar iant of B-trees and consequent ly most of the technological adv a nces and k now-how acquired on B-trees ca n b e smo othly a pplied to them. F or example, split and merge strate- gies ensuring go o d page-fill ra tio, no de buffering techniques to spe e d up search op erations, B- tree distr ibution ov er multi-disk systems, as well adaptive ov er flow 36 tech niques to defer no de splitting and B-tree r e-orga nization, can b e applied on String B-trees without any significant mo dification. Surprisingly enough, there are no publicly av aila ble implemen tations of the String B-tree, wherea s so me softw ares are ba s ed on it [54,97,110]. The nov el ideas pr e sent ed in this pap er foretell an engineered, publicly av a ilable implemen ta tion of this data structure. In particular, it would be worth to design a library for full-text indexing larg e text co llections based on the String B-tree data structure. This librar y should be designed to follow the API of the Berkeley DB [181], th us facilitating its use in well-established applications. The String B-tree co uld a lso b e adopted as the main search engine for genomic databases thus comp eting with the numerous re- sults based on suffix trees rece ntly app ear ed in the literature [88,46,103,133,126]. Another setting where an implement ation of the String B-tree could find a suc- cessful use is the indexing o f the tagged str uctur e of an XML do cument. Recen t results [52,47,4] adopt a P atr icia tree or a Suffix tree to solve and/o r estimate the selectivity of structural quer ies o n XML do cumen ts. How ever they are forced to e ither summarize the trie structure, in order to fit it into the in ternal mem- ory , or to pro p o s e disk-pag ing heuristics, in order to achiev e rea s onable per - formance. Unfor tunately these prop osals [52] forg et the adv a ncement s in the string-matching litera ture and thus inevitably incur into the w ell-known I/O bo ttlenec k deeply discussed in Section 3 .1. Of cour se String B-trees might be successfully used here to manage in an I/O -efficient manner the arbitrary long XML pa ths in which an XML do cument can b e par sed, as well provide a b etter caching b ehavior for the in-memory implement ations. The pro blem of m ulti-dimensional substring search, i.e. the s earch for the simu ltaneous o ccurrence of k substrings, deserves s ome attention. The approa ch prop osed in [72] pr ovides some insights in to the nature of t wo -dimensional queries, but wha t can we say ab out multi -dimensions ? Can we combine the String B-tree with some known multi-dimensional data structure [172,86] in order to achiev e guaranteed worst-case bounds ? Or, can we design a full-text index which al- lows proximit y queries betw een tw o substrings [120,72] ? Mor e study is worth to be devoted to this important sub ject because of its ubiquitous applications to databases, data mining and search engines. When dea ling with w ord-based indexes, we addressed the do cument listing pr oblem : given a word-based quer y w find all the documents in the indexed collection tha t contain w . Co nv ersely when dealing with full-text indexes, we addressed the o c curr enc e listing pr oblem : given an arbitrary pattern str ing P find all the do c ument pos itions where P o ccur s . Althoug h more natural from an application-sp ecific point of view, the document listing problem has surprisingly received not muc h attention fro m the a lgorithmic commu nit y in the area of full- text indexes, so that efficient (o ptimal) solutions are yet missing for many of its v ariants. Some papers [12 7,145] have recently initiated the study of challenging v ariations of the doc ument listing problem and solved them via simple and ef- ficien t algo rithms. Improving these approaches, as w ell extending these results to mult iple-pattern queries and to e xternal-memory setting turns out to b e a stim ulating direction of rese a rch. 37 Exact searches are just o ne s ide o f the coin, proba bly the to ol with the nar- row est setting of application ! The des ig n o f search engines for approximate or similarity string searches is b ecoming more urg ent b ecause of the doubt- less theoretical interest and the numerous a pplications in the field of genomic databases, audio/video collec tio ns and textual databases, in gener al. Significant biological breakthroughs hav e already be en achiev ed in genome research bas ed on the analysis of similar g enetic sequences, and the a lgorithmic field is over- flo o ding of results in this setting [148]. Ho wever most of these similarity-based or approximate-match ing algo rithms r equire the whole scan of the data collec- tion th us r esulting m uch costly in the presence of a lar ge amount of string data and user queries . Indexes for appr oximate, or similarity , searches turn out to be the holy grail of the Information Retriev al field. Several proposa ls hav e ap- pea red in the literature and it would be imp ossible to commen t the sp ecialties of, or ev e n list, all of them. Just to hav e a n idea, a search fo r “(appr oximate OR similarity) AND (index OR search)” re tur ned on Altavista more than 500,000 matches. T o guide ourselves in this jungle of prop osa ls we s ta te the following consideration: “it is not yet known an index which efficiently r out es the se ar ch to t he c orr e ct p ositions wher e an appr oximate/simil ar string o c curr enc e lies” . Most of the research effort has b een devoted to desig n filters : they transfo r m the approximate/similarity pa ttern search in to another string or geometric query problem for which efficient data str uctures ar e known. The transformation is of course “no t p erfect” because it intro duces some false p ositive ma tches that m ust be then filtered out via a (co s tly) scan-based algor ithm. The more filtration is achiev ed by the index, the smaller is the pa rt on whic h the approximate/similar scan-based search is applied, the faster is the ov erall a lgorithmic solution. The key point therefore relies o n the desig n of a go o d distanc e-pr eserving transfor- mation. Some approa ches tra nsform the approximate sea rch int o a set of q -gr am exact searches, then solved via known full-text indexes [185,40,155,10 0,160,41]. O ther approaches map a string o nt o a multi-dimensional integral p oint v ia a wavelet- b ase d transformation and then use multi-dimensional geometric structures to solve the transformed query [103]. Recently a mor e sophisticated dista nce-preserving transformation has b een introduced in [146,53] which maps a string in to a binary vector such that the ha mming distanc e b etw een tw o of these v ec tors provides a prov ably go o d a pproximation of the ( blo ck) e dit distanc e be tw een the t wo or ig - inal strings. This wa y a n efficien t appr oximate nea rest-neighbor data structure (see e.g. [95,113]) can b e used to search ov er these multi-dimensional vectors and achiev e gua ranteed go o d average-case p erforma nce . Notice that this solution a p- plies on whole string s; its practical p erformance has b een tested ov e r genomic data in [147]. It go es without saying that in the plethora of r esults ab out complex pattern searches a spe c ia l pla c e is o ccupied by the solutions based on suffix trees [88,152,126,93]. The suffix-tree structure is well suitable to p erfor m regular expressions, approx- imate or similar it y-based sear ches but at an av er age-time cost which may b e exp o nent ial in the pattern length o r p oly no mial in the text length [148]. Al- 38 though some recent pap ers [93,171,126] have inv estiga ted the effectiveness of those results onto genomic databases, their usefulness r emains limited due to the I/O b ottlenecks incurred by the s uffix tree both in the construction phase and for wha t concerns their spa c e o ccupancy (see Section 3.1). Perhaps the adap- tation o f these complex searching algo rithms to the String B-tree mig h t turn int o app ealing these appr o aches also from a pra ctical p ersp ective. As a final remark, w e mention that the tech niques for desig ning filtering indexes a re no t limited to genomic or textual data ba ses, but they may b e used to extend the se a rch functionalities of rela tional and ob ject-or ient ed databa ses, e .g . provide a supp or t to appr oximate string joins [85]. This shows a new in teresting direction of rese arch for pattern-matc hing to o ls. In Section 2.1 w e addresse d the problem of ca ching inv erted indexes fo r im- proving their quer y time under bias ed op era tio ns. This issue is challenging ov er all the indexing schemes and it b ecomes par ticularly difficult in the case of full- text indexes b ecause of their complicated structure. F or example, in the cas e of a suffix tree its un balanced tree structure asks for an allo cation of its no des to disk pag es, usually called pa cking, that optimizes the cache p erformance for some pattern of accesse s to the tree no des . This problem has b een in vestigated in [83] where a n algorithm is pres ent ed that finds an optimal pac king with r e- sp e c t to both the total n um ber of different pages visited in the s earch and the n um ber of page faults incurred. It is also shown that finding an optimal packing which minimizes a lso the space o ccupancy is, unfortunately , NP- c o mplete and an efficient appr oximation a lgorithm is present ed. These results deal with a static tree, so that it would be int eresting to explor e the general situation in which the distribution of the quer ies is not known in a dv a nce , c ha ng es ov er the time, and new string s are inserted or deleted from the indexed set. A preliminar y insight on this challenging question has b een a chiev ed in [48]. Ther e a nov el self-adjusting full-text index for externa l memory has been propo sed, called SASL, based on a v a riant of the Skip Lis t data structure [161]. Usually a skip list is turned in to a self-adj usting data structure by pr omoting the accessed items up its levels and demoting cer tain other items down its levels [62,141,130]. How ever a ll of the known appro a ches fail to w o rk effectiv ely in an external-memor y setting b eca use they la ck lo cality of reference and th us elicit a lot of random I/Os. A tec hnical nov elty of SASL is a simple ra ndomized demo tion strategy that, together with a doubly-exponential gr ouping of the skip list levels, g uides the demotions and guarantees lo c ality of r efer enc e in all the up dating op erations; this way , frequent items get to remain at the highest levels of the skip list with high probability , and effectiv e I/O-b ounds ar e a chieved o n exp ectation b oth for the search and update op erations. SASL furthermore e nsures balancedness without explicit w eight on the data structure; its update algorithms a re simple and guarantee a go o d use of disk space; in addition, SASL is with high probabilit y no w or se than String B-trees o n the search op erations but can be significantly b etter if the s equence of queries is hig hly sk ewed o r changes o ver the time (as most transactions do in practice). Using SASL over a sequence of m string s earches S i 1 , S i 2 , . . . , S i m takes O ( P m j =1  | S i j | B  + P n i =1 ( n i log B m n i )) exp ected I/Os , where n i is the num- 39 ber of times the string S i is queried. The first term is a low er b ound fo r sca nning the query strings; the second term is th e entrop y of the query sequence and is a standard information-theor etic lo w er bound. This is actually a n e xtension of the Static Optimality The or em to external-memory string access [180]. In the last few y ea rs a num b er of models and tec hniques hav e been devel- op ed in o rder to make it easier to reason a bo ut multi-lev el hierarchies [186]. Recent ly in [80] it has b een introduced the elegant c ache-oblivio us model, that assumes a tw o- level view of the computer memory but allows to pro ve results for an unknown multilev el memor y hiera rch y . Ca ch e oblivious a lgorithms are designed to a chiev e go o d memory performance on all lev els of the memory hi- erarch y , even though they avoid any memory-sp e cific p ar ameterization . Several basic problems— e.g . matrix m ultiplication, FFT, sorting [80,36]— hav e been solved optimally , as well irregular and dynamic problems hav e been recently addressed and solved via efficien t cache-oblivious data structures [29,37,30]. In this research flow turns out challenging the design of a c ache oblivious t rie be- cause we feel that it would probably shed new light on the indexing pr oblem: it is not clear how to guarantee cac he obliviousness in a setting wher e items are arbitrarily long and the s ize of the disk pag e is unknown. 4 Space-time tradeoff in index design A leitmotiv of the previous sections ha s b een the following: Inverte d indexes o c- cupy less sp ac e than ful l-text indexes but ar e limite d t o effic iently supp ort p o or er se ar ch op er ations. This is a frequent s tatemen t in text indexing pap ers and talks , and it has driven man y authors to conclude that the increased query p ow er of full-text indexes has to b e p aid b y additional stor age space. Although this ob- serv ation is m uc h freq uent and apparently established, it is ch allenging to ask ourselves if it is prov able that such a tra deo ff do es ex ist when designing an index. In this co n text compr e ssion app ear s as an attractive to ol because it allows no t only to s queeze the space o ccupa ncy but also to impro ve the computing sp eed. Indeed “ sp ac e optimization is closely r elate d to time optimiza tion in a disk mem- ory ” [109] b ecause it a llows a be tter use of the fast and small memor y levels close to CPU (i.e. L1 or L2 caches), r educes the disk a ccesses, virtually increases the disk ba ndwidth, and c omes at a negligible cost be c a use of the significant sp eed of current CPUs. It is therefore not surprising that IBM ha s recently installed on the eSer vers x330 a nov el memor y chip (based on the Memo ry eXpansion T echnology [94]) that stores data in a co mpressed for m thus ensuring a perfor- mance similar to the o ne achiev ed by a ser ver with double r eal memory but, of course, at a muc h low er cost. All these considera tions ha ve driv en developer s to state that it is mor e e c onomic al to stor e data in c ompr esse d f orm t han unc om- pr esse d , so that a renew ed interest in compressio n techniques raised within the algorithmic and IR co mm unities. W e have already discussed in Section 2 the use of compression in word-based index design, now we a ddr ess the impact o f compressio n onto full-text index design. 40 Compression may of course o per ate at the text level or at the index level, or bo th. The simplest approa ch consists of co mpressing the text via a lexico graphic- preserving co de [9 2] and then build a suffix array up on it [13 8]. The improv ement in space o ccupancy is how ever neg ligible since the index is muc h larger than the text. A most promising and sophisticated directio n was initiated in [143,144] with the aim of c o mpressing the full-text index itself. These authors s how ed how to build a suffix- tree based index on a text T [1 , n ] within n log 2 n + O ( n ) bits of stora ge a nd supp ort the sear ch for a pa ttern P [1 , p ] in O ( p + occ ) w or st-case time. This result stimulated an active resear ch on suc cinct enc o dings of full-text indexes that e nded up with a breakthrough [87] in whic h it w a s shown that a suc cinct suffix ar r ay can be built within Θ ( n ) bits and can support pattern searches in O ( p log 2 n + occ log ǫ n ) time, where ǫ is a n arbitrarily small positive constant. This result ha s shown that the apparen tly “random” permutation of the text suffixes can b e succinctly co ded in optimal space in the worst case [60]. In [1 68,169] extensions and v ar iations of this re s ult— e.g. a n arbitra ry la rge alphab et— hav e b een consider ed. The ab ov e index, how ever, uses space linea r in the size of the indexed collec- tion and therefore it res ults not yet comp etitiv e against the word-based indexes, whose space o ccupancy is usually o ( n ) (see Section 2). Rea l text collections ar e compressible and thus a full-text index should desider ably exploit the re peti- tiv eness present into them to squeeze its space o ccupancy vi a a m uch succinct co ding of the suffix p ointers. The fir st step toward the design of a truly c ompr esse d ful l-text inde x ensur- ing effective sea rch per formance in the w o rst ca se has b een r e cent ly pursued in [7 5]. The nov e lty of this appr o ach resides in the ca reful combination o f the Burrows-Wheeler compressio n a lgorithm [42] with the suffix array da ta struc- ture th us designing a sort of c ompr esse d su ffix arr ay . It is actually a self-indexing to ol b ecause it encapsulates a compressed version of the origina l text inside the compressed suffix array . Overall w e can say that the index is opp ortunistic in that, a lthough no as sumption on a par ticular text distribution is ma de, it takes adv antage of the compressibility of the indexed text by decreas ing the space o ccupancy at no sig nificant slowdown in the query p erfor mance. More precisely , the index in [75] o ccupies O ( n H k ( T )) + o ( n ) bits of storage, wher e H k ( T ) is the k -th or der empirical entropy of the indexed text T , and supp or ts the search for an arbitrar y pattern P [1 , p ] as a s ubstring of T in O ( p + occ log ǫ n ) time. In wha t follows we sketc h the ba s ic ideas underlying the design of this co m- pressed index, her eafter called FM-index [75 ], and we briefly discuss so me exp er- imen tal r esults [77,76] on v arious text collec tio ns. These exp eriments show that the FM-index is co mpa ct (its space o ccupancy is close to the one achiev ed by the best known compr essors), it is fast in counting the n um b er of pattern o ccur- rences, and the cost of their retriev al is r e a sonable when they a re few (i.e. in case of a selective query). As a further co nt ribution w e br iefly men tion an interesting adaptation of the FM-index to w or d- based indexing, called WFM-index. This result highlights further on the interpla y b et w een compression and index design, as w ell the r ecent plot betw een word-based and full-text indexes: everything of 41 these worlds m ust be dee ply understoo d in order to pe rform v aluable rese a rch in this topic. 4.1 The Burro ws-Wheeler transform Let T [1 , n ] denote a text over a finite alphab et Σ . In [42] Burrows and Wheeler in tro duced a new c ompression algor ithm based on a reversible transformation, now called the Burr ows-Whe eler T r ansform (BWT from now on). The BWT per mu tes the input text T in to a new string that is ea sier to compress. The BWT consists o f three basic s teps (see Figure 10): (1) app e nd to the end of T a sp ecial character # s ma ller than any o ther text character; (2) for m a lo gic al matrix M whose ro ws are the cyc lic shifts of the string T # s o rted in lexicog raphic order ; (3) cons tr uct the transformed text L by taking the last column of M . Notice that every column of M , hence also the tra nsformed text L , is a p ermutation o f T #. In particular the first column of M , call it F , is obtained b y lexicogra phically sorting the characters of T # (or, eq ually , the characters of L ). The tr a nsformed string L usua lly contains long runs of identical sym b ols and therefore ca n be efficient ly co mpressed using mov e-to-fro nt coding, in co mbination with statistical co ders (see for example [42 ,68]). 4.2 An opp ortunis tic index There is a bijective cor resp ondence b etw e e n the rows of M and the s uffixes of T (see Figure 10); and th us there is a strong relation betw een the string L and the suffix array built on T [1 21]. This is a crucial observ ation for the design of the FM-index. W e recall b elow the ba sic ideas underlying the sea rch oper ation in the FM-index, referr ing for the o ther tec hnica l deta ils to the seminal pap er [75]. In o r der to simplify the presentation, we distinguish b etw een tw o search to ols: the co un ting of the num b er of pattern o ccurrences in T and the retriev a l of their p ositions. The coun ting is implemented by exploiting t w o nice str uctural prop erties of the matrix M : ( i ) all suffixes of T pr efixed b y a pattern P [1 , p ] o ccupy a contiguous set of rows o f M (see also Section 3.1); ( ii ) this set of rows has starting po sition f irst and ending po sition l ast , where f ir st is the lexic o gr aphic p osition of the string P among the ordered ro ws of M . The v alue ( las t − f irst + 1) account s for the total n umber of pattern o ccurr ences. F or example, in Figure 10 for the pattern P = si we have f ir s t = 9 and l ast = 1 0 for a total o f tw o occurr ences. The retriev al o f the r ows f i rst and l ast is implemented b y the pr o cedure get rows which tak es O ( p ) time in the worst c a se, w or king in p constant-time phases n umbered from p to 1 (see the ps e udo c o de in Fig. 11). Each phase pre- serves the following inv a riant: At the i -th phase, the p ar ameter “fi rst ” p oints to the first r ow of M pr efix e d by P [ i, p ] and the p ar ameter “last” p oints to the la st r ow of M p r efixe d by P [ i, p ]. After the final phase, f ir st and l ast will delimit the rows of M co nt aining all the text suffixes pre fix e d by P . 42 The lo cation of a pattern o ccurrence is found by mea ns of a lgorithm lo cate . Given an index i , l o cate ( i ) returns the s tarting p osition in T of the suffix cor- resp onding to the i th ro w in M . F or example in Figure 10 we have pos (3) = 8 since the third row ippi # missi ss corr esp onds to the suffix T [8 , 11] = ipp i . The basic idea for implementing lo cate ( i ) is the following. W e lo gic al ly mark a suitable subset of the rows of M , and for each marked row j we store the star t- ing po sition pos ( j ) of its corres po nding text suffix. As a result, if locate ( i ) finds the i th r ow marked then it immediately returns its position pos ( i ); otherwise, lo cate use s the so called LF-c omputation to mov e to the row corresp onding to the s uffix T [ pos ( i ) − 1 , n ]. Actually , the index o f this ro w can b e computed as LF [ i ] = C [ L [ i ]] + Occ ( L [ i ] , i ), where C [ c ] is the n um ber of o ccurr ences in T of the characters smaller than c . The LF-computation is iterated v times un- til we r each a mar ked row i v for which pos ( i v ) is av ailable; we ca n then set pos ( i ) = pos ( i v ) + v . Notice that the LF-computation is considering text suffixes of increas ing length, until the co rresp onding marked row is encountered. Given the app ea ling as y mptotical pe r formance and structural prop erties of the FM-index, the authors hav e inv estiga ted in [77,76] its practical be havior by per forming an extensiv e set of exp eriments on v ario us text collections: 1992 CI A world fact b o o k (shortly world ) of ab out 2Mb, King J ames Bible (shortly bible ) of about 4Mb, DNA sequence (shortly e.c oli ) of ab out 4Mb, SGML-tagged texts of AP-news (shortly , ap90 ) of ab out 65Mb, the ja v a do cument ation (shor tly , jdk13 ) of ab out 70Mb, and the Canterbury Corpus (shortly , c antrbry ) of ab out 3Mb. On these files they actually experimented tw o different implemen ta tio ns of the FM-index: – A tiny index designed to ac hieve high co mpression but suppor ting only the countin g of the pattern o cc ur rences. – A fat index designed to supp ort b oth the counting and the retr iev al of the pattern o ccurr ences. Both the tin y a nd the fat indexes consist of a compres sed version of the input text plus so me additional information used for pattern searching. In T a ble 1 we rep ort a comparis on a mo ng these co mpressed full-text indexes, gzip (the standard Unix compressor ) and bzi p2 (the best known compr e ssor base d on the BWT [176]). These figur es have b e en derived from [76,77]. The exp eriments show that the tiny index takes sig nificant ly less space than the cor resp onding gzip - compressed file, and for all files except bible and c antrbry it takes less space than bzip 2 . This ma y app ear surprising since bz ip2 is also based o n the BWT [176]. The explana tion is simply that the FM-index computes the BWT for the entire file wher eas bzi p2 splits the input in 900K b blo cks. This compression improv ement is pay ed in terms o f sp eed; the construction of the tiny index takes mor e time than bzip2 . The exp eriments also show that the fat index takes slightly mo re space than the corre spo nding gz ip -compre s sed file. F or what concerns the query time we have that both the tin y a nd the fat index compute the n um b er of o ccurrences of a pa ttern in a few milliseconds, indep enden tly of 43 File bible e.c ol i world c antbry jdk13 ap90 tiny index Compr. ratio 21.09 26.92 19.62 24.02 5.87 22.14 Construction time 2.24 2.19 2.26 2.21 3.43 3.04 Decompression time 0.4 5 0 .49 0.4 4 0.38 0.42 0.57 Ave. count time 4.3 12.3 4.7 8.1 3.2 5.6 fat index Compr. ratio 32 .28 33.61 33.23 46.10 17 .02 35.49 Construction time 2.28 2.17 2.33 2.39 3.50 3.10 Decompression time 0.4 6 0 .51 0.4 6 0.41 0.43 0.59 Ave. count time 1.0 2.3 1.5 2.7 1.3 1.6 Ave. lo cate time 7.5 7.6 9.4 7.1 21.7 5. 3 bzip2 Compression ratio 20.90 26.97 19.79 20.24 7.03 27.36 Compression time 1.16 1.28 1.17 0.89 1.52 1.16 Decompression time 0.3 9 0 .48 0.3 9 0.31 0.28 0.43 gzip Compr. ratio 29 .07 28.00 29.17 26.10 10.79 37.35 Compression time 1.74 10.48 0.8 7 5.04 0.39 0.97 Decompression time 0.0 7 0 .07 0.0 6 0.06 0.04 0.07 T able 1. Compr e s sion r atio (p ercentage) and co mpression/deco mpr ession sp eed (microseconds per input byte) of tin y and fat indexes co mpared with those o f gzip (with option -9 for maximum compress io n) and bzi p2 . F or these com- pressed indexes we also r epo rts the av e rage time (in milliseco nds) for the count and lo cate oper ations. The experiments were run on a machine equipp ed with Gn u/Linux Debian 2.2, 60 0Mhz Pent ium I I I a nd 1 Gb RAM. the size of the s e a rched file. Using the fat index one can a lso compute the p osition of each o ccurrence in a few milliseco nds per occurre nce . These exp eriments show that the FM-index is compact (its spa ce o ccupancy is close to the one achiev ed by the b est kno wn compr e s sors), it is fas t in count ing the n umber of pattern o ccurrences, and the cost o f their retriev al is reaso nable when they are few (i.e. in case of a selective q uery). In addition, the FM-index allows to tra de space o ccupancy for search time by choo sing the amount o f a ux- iliary information s tored into it. As a result the FM-index com bines compression and full-text indexing: lik e gzip and bzip 2 it encapsulates a compressed v ersion of the original file; lik e suffix trees and arr ays it allows to search for ar bitrary patterns by lo ok ing only a t a small p or tion o f the compres sed file. 4.3 A word -based opp ortunisti c index As far a s user queries ar e formulated on ar bitrary substrings, the FM-index is an effectiv e and compact sear ch tool. In the informa tio n r etriev al setting, thought, user queries are c o mmonly wor d-b ase d since they are for mu lated on entire words or on their parts, like prefixes or suffixes. In these cases, the FM-index suffers from the same dr awbacks of classical full-text indexes: a t any word-based query formulated o n a pattern P , it needs a p ost- pr o c essing phase which a ims at filtering out the o cc ur rences of P whic h a re not word o ccurrences b ecause they lie entirely 44 in to a text word. This ma inly consists of chec king whether an occur rence of P , found via the get rows op eration, is pr e c e de d and fol lowe d by a non-wor d char acter . In the presence of frequen t query-patterns such a filtering pro cess may b e very time consuming, thus s lowing down the ov erall q uer y p e r formance. This effect is more dramatic when the goa l is to c ount the o ccurrences o f a w o rd, or w hen we need to just che ck whether a w o rd do es o ccur or not in to an indexed text. Starting for these consider ations the FM-index has been enriched with so me additional information c o ncerning with the linguistic st ructur e of the indexed text. The new data structure, ca lled WFM-index, is a ctually obtained by building the FM-index onto a “digeste d” version of the input text. This digested text, shortly D T , is a sp e c ia l compressed version of the orig inal text T that allows to map wor d-b ase d queries on T onto substring queries on DT . More precise ly , the dig ested text D T is obtained by compressing the text T with the byte-aligne d and t agge d Huff word algo rithm describ ed in Section 2 (see [153]). This wa y D T is a byte se quenc e which p os sesses a crucial prop- erty: Given a wor d w and its c orr esp onding tagge d c o dewor d cw , we have t hat w o c cu rs in T iff cw o c curs in D T . The tagged codewords are in some sense self-synchr onizing at the byte level b ecause of their most significant bit s et to 1 . In fact it is not p ossible that a b yte-a ligned co deword overlaps tw o o r mo re other co dewords, since it s hould have at least o ne internal byte with its most signifi- cant bit s et to 1. Similarly , it is not pos sible that a co deword is byte-aligned and starts inside another co deword, b ecaus e the la tter should ag ain hav e a t least one in ternal byte with its most sig nifica n t bit set to 1 . Such a bijection allows us to conv ert every wor d-b ase d query form ulated on a pattern w a nd the text T , into a byte-aligne d su bstring query formulated on the tagg ed co deword cw , relative to w , and the diges ted text D T . Of cour se more co mplicated word queries on T , like prefix- word or suffix- word queries, c a n b e translated into mu ltiple substring queries on D T as follows. Searching for the o ccurr ences of a pattern P as a prefix of a w o rd in T co ns ists of three steps: (1) search in the Huffw o rd dictionar y D for all the words pr efixed by P , say w 1 , w 2 , . . . , w k ; (2) compute the tag ged co dewords cw 1 , cw 2 , . . . , cw k for these words, and then (3) search for the o ccurr ences o f the cw i in to the dig ested text D T . Other word-based queries can b e s imilarly implement ed. It is natural to use a n FM-index built o ver DT to supp or t the co deword searches ov e r the digested text. Here the FM-index tak es as characters of the indexed text D T its co nstituting bytes . This a pproach has a tw ofold adv antage: it r educes the space o cc upied by the (digested) byte sequence D T a nd supp orts ov er D T effectiv e s earches for byte-aligned substrings (i.e. c o dewords). The WFM-index therefore consists of tw o parts: a full-text index FM-index( D ) built ov er the Huffw ord dictiona ry D , and a full-text index FM-index( D T ) built ov er the digested text D T . The former index is used to sea rch for the queried word (or for its v a riants) in to the dictionar y D ; fro m the retrieved words w e derive the corres p o nding (set of ) co dewords which are then searched in D T via FM-index( DT ). Hence a single word-based query on T , can be transla ted 45 b y WFM-index int o a set of exact substring queries to b e p erformed by FM- index( DT ). The adv antage o f the WFM-index over the standar d FM-index should b e apparent. Queries ar e word-oriented so that the time consuming p ost-pro cessing phase has bee n av oided; counting or existential queries are directly exe c uted on the (sma ll) dictionary D without even ac c essing the compressed file; the ov era ll space o ccupancy is usua lly smaller than the one required by the FM- index bec a use D is s mall and DT has a lot of structure that can be exploited b y the Burrows-Wheeler compressor present in WFM-index. This approach needs further exp erimental in vestigation and engineering, although some preliminary exp e riment s hav e shown that WFM-index is v er y pro mising . 4.4 Some op en problem s and future researc h directions In this section we hav e discussed the in ter play b etw een data compression a nd indexing. The FM-index is a promising data structure which combin es effectiv e space compres sion and efficient fu ll-text queries . Recent ly , the author s of [75] hav e shown that a nother compressed index do e s exist that, based o n the BWT and the Lemp el-Ziv parsing [192], answers ar bitrary pattern queries in O ( p + occ ) time a nd o ccupies O ( nH k ( T ) log ǫ n ) + o ( n ) bits of storage. Indep endently , [15 0] has presen ted a simplified compressed index that does not achiev e these go o d asymptotic bounds but it c o uld be suitable for pr actical implementation. The main op en problem left in this line o f resear ch is the desig n of a data structure which achiev e s the b est of the pr evious bounds: O ( p + occ ) query time and O ( nH k ( T )) + o ( n ) bits o f storag e o ccupancy . How ever, in our opinion, the most challenging questio n is if, a nd how, lo ca lit y of refer ence can b e exploited in these data structures to ach ieve efficient I/O- bo unds. W e aim at obtaining O ( occ/B ) I/Os fo r the lo cation of the pattern o ccurre nce s , where B is the disk-page size. In fact, the additive term O ( p ) I/Os is negligible in pr actice because any user- query is commonly comp osed of few c hara cters. Con versely occ migh t b e lar ge and thus force the locate pro cedure to e x ecute many random I/ O s in the case of a large indexed text collection. An I/ O-conscious compress ed index might comp ete successfully aga inst the String B-tree data structure (see Section 3.3). The B urrows-Wheeler transform pla y s a central role in the design of the FM-index. Its computation relies on the construction of the suffix array of the compressed string; this is the actual algorithmic b o ttlenec k for a fast implemen- tation of this compression algo rithm. Although a plethora of paper s ha ve been devoted to engineering the suffix s o rting step [42,174 ,68,176,165,156,31], there is s till room for improvemen t [124 ] and inv estig ation. Any adv ancement in this direction would immediately impact o n the co mpression time p erformance of bzip2 . As far as the co mpression ra tio of bz ip2 is co ncerned, we point out that the r ecent improv emen ts presented in the literature are either limited to sp ecial data collections or they are not fully v alidated [43,44,166,167,68,26,25]. Hence the op e n-source so ft ware bzi p2 yet remains the choice [176]. F urther study , sim- plification or v ariation on the Burrows-Wheeler transfor m are needed to improv e its compression ratio and/o r possibly impact o n the design of new compresse d 46 indexes. The approach follow ed in WFM-index is an example of this line of research. Although we ha ve explained in the previo us sections how to per form sim- ple exact searches, full-text indexes can do muc h more. In Section 3.1 w e hav e men tioned that suffix trees can support complex searches like appro ximate or similarity-based matches, as well regular expression searches. It is also well- known that suffix a rrays can simulate any a lgorithm designed on suffix trees at an O (lo g n ) extra-time p enalty . This slowdown is pay ed for by the small space o ccupied b y th e suffix array . It is clear at this p oint that it should b e easy to adapt these algorithms to work o n the FM-index or on the WFM-index. The resulting se a rch procedur e s might b e nefit more fr o m the compactness of these indexes, and therefo re p ossibly turn in to in-memory some (e.g. ge no mic) com- putations whic h now require the use of disk, with consequent p o or pe r formance. This line of res earch has b e e n pio neered in the exp erimental setting b y [17 0] which show ed that compres sed suffix a rrays can be used a s filtering data struc- tur e to speed up similarity-based searches on large genomic databases. F r om the theoretical p o int of view, [56] recently prop osed a nother interesting use o f co m- pression for s pee ding up s imila r it y-based computations in the worst case. There the dyna mic pr ogramming ma trix has b een divided into v ar iable sized blocks, as induced by the Lempe l- Ziv par sing of b oth strings [192], and the inherent per io dic nature of the str ing s has b een exploited to ach ieve O ( n 2 / log n ) time and spac e complexit y . It would be in tere s ting to combine these idea s with the ones develop ed for the FM-index in order to reduce the space r e q uirement s of these algo rithms without impair ing their s ub-quadratic time complexity (which is conjectured in [5 6] to be close to optimal). The FM-index can also be used as a building block of sophisticated Infor- mation Retriev al to ols. In Section 2 we ha ve discussed the blo ck-addressing scheme as a pro mising approa ch to index mo derate s ized textual collections, and presented some approaches to co mbi ne compression and block-addressing for a chieving better p erfo rmance [122,15 3]. In these approa ches oppo rtunistic string-matching alg orithms have be e n used to p erform searches on the co m- pressed blo cks thus achieving an improv emen t of ab out 30-50% in the final p er- formance. The FM-index a nd WFM-index na turally fit in this framework b ecause they can b e used to index eac h text blo ck individually [75]; this w ay , at query time, the compr essed index built ov er the c a ndidate blocks could b e employ ed to fasten the detection of the pa ttern o ccurr ences. It must b e noted here that this appro a ch fully explo its one of the p ositive prop erties o f the blo ck-addressing scheme: The vo c abulary allows to tu rn c omplex se ar ches on the indexe d text into multiple exact-p attern se ar ches on the c andidate text blo cks . These are pr o p e r ly the types of sea rches efficien tly supp orted by FM-index and WFM-index. A theoretical inv es tigation using a mo del genera lly accepted in Information Re- triev al [21] has sho wed in [75] that this a ppr oach a ch ieves b oth subline ar sp ac e overhe ad and subline ar query time indep endent of the blo ck size . Conv ersely , in- verted indices a ch ieve only the second goa l [188], and the classica l Glimpse to o l achiev es b oth go a ls but under some re s trictive conditions on the blo ck size [21]. 47 Algorithmic engineering and further e x per imen ts on this nov e l I R s ystem ar e yet missing and worth to be pursued to v alidate these goo d theor etica l results. 5 Conclusions In this s ur vey we hav e fo cused our attention on alg orithmic and data structural issues arising in tw o a s pec ts of infor mation r etriev al systems design: (1) rep- resenting textual collections in a form which is suitable to efficient sear chin g and mining; (2) design alg orithms to build these representations in reasonable time and to per fo r m e ffective searches and pro cessing op erations o ver them. Of cours e this is not the whole stor y ab out this huge field a s the Informa- tion R etrieval is. W e then co nclude this paper by citing other imp ortant as- pects that would deserve further co nsideration: (a) file structures and databa s e maint enance; (b) ranking tec hniques and clustering methods for sco ring a nd improving query res ults; (c) computational linguistics; (d) user in terfaces and mo dels; (e) distributed r etriev al issues as w ell securit y and access control ma n- agement. E very one of these asp ects has been the sub ject of thousands of pap ers and surveys ! W e co n ten t ourselves to cite here just some go o d starting points from whic h a user can browse for further techni cal deepenings a nd bibliog raphic links [188,22,123,1]. Ac kno wledgments This survey is the outcome of ho urs o f highlighting and, sometime hard and fatiguing, discuss ions with man y fellow researchers and friends. It encapsulates some r esults whic h hav e already se en the light in v ar ious pap ers of mine; some other s c ie ntific results, detailed in the prev io us pages, are how ever yet unpublished and probably they’ll remain in this state! So I’d lik e to po in t out the p ersons who participated to the discov er y of those ideas. The engi- neered version of String B-trees (Section 3.4) has been devised in co llab oration with Ro ber to Grossi; the randomized a lgorithm for string sor ting in external memory (Section 3.6) is a joint result with Mikk el Thorup; finally , the WFM- index (Section 4.3) is a r ecent adv ancemen t a chiev ed together with Giov anni Manzini. I finally thanks V alen tina Ciriani a nd Giov a nni Manzini for care fully reading and commenting the pr eliminary versions of this survey . References 1. Home Page of A C M ’s Sp e cial Inter est Gr oup on information r etrieval , http://inf o.sigir.a cm.org/si gir/ . 2. The XML home p age at the WWW Consortium , http://www.w 3.org/XML / . 3. Abiteboul, S., Quass, D. , McHugh, J., Widom, J., and Wi ener, J. L. The Lorel query la nguage for semi structured data. Internat ional Journal on D igital Libr aries 1 , 1 (1997), 68–88. 4. Aboulnaga, A., Alameldeen, A. R., and Na ughton, J. F. Estimating the selectivit y of XML path expressions for In ternet scale applications. In Pr o c. of the International Confer enc e on V ery L ar ge Data Bases (20 01), p p. 591–60 0. 48 5. Achar y a, A., Zhu, H., and S hen, K. Adaptive algorithms for cac he-efficient tries. In Pr o c. of the Workshop on A lgorithm Engine ering and Exp erimentation (1999), Lecture N otes in Computer Science vol. 1619, S p ringer V erlag, pp. 296– 311. 6. Aguilera, V., C lue t, S., Ve l tri, P., Vodisla v, D. , and W a t- tez, F. Querying XML do cuments in X yleme. In Pr o c. of the ACM- SIGIR Workshop on XML and Information R etrieval (200 0), http://www .haifa.il .ibm.com/ sigir00-xml/ . 7. Ahn, V., and Moff a t, A. Compressed inverted fi les with reduced deconding o verhea d. In Pr o c. of the ACM-SIGIR Confer enc e on R ese ar ch and Development in Information R etrieval (1998), pp. 290–297 . 8. Ailamaki, A., DeWitt, D., Hill, M., and W ood, D. D BMSS on a modern processor: where do es time go? In Pr o c. of the I nternational Confer enc e on V ery L ar ge Data Bases (199 9), pp. 266–277 . 9. Amir, A., Benson, G., and F ara ch, M . Let sleeping files lie: Pattern matching in Z-compressed fi les. Journal of Compute r and Syst em Scienc es 52 , 2 (1996 ), 299–30 7. 10. Am ir, A., F arach, M., Idur y, R., La Poutr ´ e, J., and Sch ¨ affer, A. Im- pro ved dy namic dictionary matching. Information and Computation 119 , 2 (1995), 258–282 . 11. An d ersson, A., Larsson, N. J., and Sw anson, K. Suffix trees on w ord. In Pr o c. of the Symp osium on Combi natorial Pattern Matching (1996), Lecture Notes in Computer S cience vol. 1075, Springer V erlag, pp. 102–115. 12. An d ersson, A., a nd Nilsson, S. Improv ed behaviour of tries by adaptive branching. Information Pr o c essing L etters 46 , 6 (1993), 295–300. 13. An d ersson, A., and Ni lsson, S. Efficient implemen tation of su ffix trees. Softwar e–Pr actic e and Exp erienc e 25 , 3 (1995), 129–141. 14. Aoe, J.-I., Morimoto, K., Shishi bori, M., and P ark, K.-H. A trie compaction algorithm for a large set of k eys. IEEE T r ansactions on Know le dge and Data Engine ering 8 , 3 (1996), 476–491. 15. Ap ostolico, A. The myriad v irtues of suffix trees. I n Combi natorial A l gorithms on Wor ds (1985), NA TO Adv anced Science In stitutes vol. 12, Series F, Springer V erlag, p p. 85–96. 16. Ara ´ ujo, M ., Na v arro , G., and Ziviani, N. Large text searching allo wing errors. In Pr o c. of the Workshop on String Pr o c essing (1997), Carleton Universit y Press, pp. 2–20. 17. Arge, L., Fe rragina, P., Gr ossi, R., and Vitter, J. On sorting strings in external memory (extended abstract). I n Pr o c. of the ACM Symp osium on The ory of Computing (1997), p p. 540–548. 18. Az agur y, A., F a ctor, M., and Mandler, B. XMLFS: A n XML-aw are fi le sys- tem. In Pr o c. of the ACM-SIGIR Workshop on XML and Information R etrieval (2000), http://www. haifa.il.i bm.com/sigir00-xml/ . 19. B aeza-Y a tes, R. , Moff a t, A., and Na v arro, G. Searching large text collec- tions. In Handb o ok of Massive Data Sets , Kluw er Academic, 2000. 20. B aeza-Y a tes, R., and Na v arro , G. Bloc k addressing indices for approxi- mate text retriev al. In Pr o c. of the International Confer enc e on Inf ormation and Know l e dge Management (199 7), pp. 1–8. 21. B aeza-Y a tes, R., and Na v arro, G. Block addressing ind ices for appro ximate text retriev al. Journal of the Americ an So ciety for Inf ormation Scienc e 51 , 1 (2000), 69–82. 49 22. B aeza-Y a tes, R., and Ribei ro -Neto, B. Mo dern Information R etrieval . Addison-W esley , 1999. 23. B aeza-Y a tes, R. A., Barbosa, E. F., and Ziviani, N. Hierarc hies of ind ices for text searching. Information Systems 21 , 6 (1996), 497–514. 24. B aeza-Y a tes, R. A., and Na v arro , G. I ntegrating con tents and structure in text retriev al. SIGMOD R e c or d 25 , 1 (1996), 67–79. 25. B alkenhol, B., and Kur tz, S. Universal data compressio n based on th e Burro ws-Wh eeler transformation: Theory and practice. IEEE T r ansactions on Computers 49 , 10 (2000), 1043–1053. 26. B alkenhol, B., Kur tz, S., a nd Sht arko v, Y. M. Mod ification of the Burrow s and Wheeler data compression algorithm. In Pr o c. of the Data Compr ession Confer enc e (1999), pp. 188–197 . 27. B arbosa, D., Bar t a, A., Mendelzon, A. O., Mihai la, G. A., Rizz olo, F., and Ro d riguez-Gui anolli, P. TOX - th e T oronto XML engine. In Pr o c. of the Workshop on Information Inte gr ation on the W eb (2001), pp. 66–73. 28. B a yer, R., and Unterauer, K. Prefix B-T rees. ACM T r ansactions on Datab ase Systems 2 , 1 (1977), 11–26. 29. B ender, M. A., Demaine , E. D., an d F ara ch-Col ton, M. Cac he-oblivious B-trees. In Pr o c. of the IEEE Symp osium on F oundations of Com puter Scienc e (2000), pp. 399–409. 30. B ender, M. A., Duan, Z., Iacono, J., and Wu, J. A lo calit y -preserving cache- oblivious dy namic dictionary . In Pr o c. of the ACM-SIAM Symp osium on Discr ete Algo rithms (2002), p p. 29–38. 31. B entley, J. L., an d Sedgewick, R. F ast algorithms for sorting and searching strings. In Pr o c e e dings of the 8th ACM-SIAM Symp osium on Discr ete Algorithms (1997), pp. 360–369. 32. B landfo rd, D., and Blelloch, G. Index compression through do cument re- ordering. In Pr o c. of the IEEE Data Compr ession Confer enc e (2002 ). 33. B loom, B. H. Space/time trade-off in hash coding with allow able errors. Com- munic ation of the ACM 13 , 7 (1970), 422–426. 34. B ookstein, A., Klein, S . T., and Rait a, T. Marko v models for clusters in concordance compression. In Pr o c. of the I EEE Data Compr ession C onfer enc e (1994), pp. 116–125. 35. B ookstein, A., Klein, S . T., and R ait a, T. D etecting con tent-b earing words by serial clustering. In Pr o c. of the ACM-SIGIR Confer enc e on R ese ar ch and Development in Information R etrieval. (1995), pp. 319–327. 36. B r od al, G. S., and F agerber g, R. Cache oblivious d istribution sw eeping. In Pr o c. of the International Col lo quium on Automata, L anguages and Pr o gr amming (2002), Lecture N otes in Computer Science vol. 2380, S p ringer V erlag, pp. 426– 438. 37. B r od al, G. S., F a g e rberg, R., and Jacob, R. Cac h e oblivious search trees via binary trees of small height. In Pr o c. of the ACM-SIAM Symp osium on D i scr ete Algo rithms (2002), p p. 39–48. 38. B r odnik, A., and Mun ro, I. Membership in constan t time and almost-minim u m space. SIAM Journal on Computing 28 , 5 (1999), 1627–1640 . 39. B r o wn, E., Callan, J., C roft, W., and Moss, J. Supporting full-text in- formation retriev al wi th a persistent ob ject store. In Pr o c. of the Internat ional Confer enc e on Extending Data b ase T e chnolo gy (199 4), pp. 365–378 . 40. B urkhard, S., Cra user, A., Ferragina, H. P., Lenhof, P., Riv als, E., and Vingron, M. QUASAR : Q-gram b ased database searc hing using suffix array . 50 In Pr o c. of the Internat ional Confer enc e on Computational Mole cular Biolo gy (1999), pp. 77–83. 41. B urkhardt, S ., a n d K ¨ arkk ¨ ainen, J. One-gapp ed q -gram filters for Leve nshtein distance. In Pr o c. of the Symp osium on Combinatorial Pattern Matching (2002 ), Lecture Notes in Computer Science vol. 2373, Springer V erlag, pp. 225–234. 42. B urro ws, M., and Wh eeler, D. A blo c k sorting lossless data compression algorithm. T echnica l Rep ort 124, Digital Equipment Corp oration, 1994. 43. Ch apin, B. Switching b etw een tw o on-line list up d ate algorithms for higher compression of Burro ws-Wheeler transformed data. In Pr o c. of the IEEE Data Compr ession Confer enc e (2000 ), pp . 183–1 92. 44. Ch apin, B., and T a te, S. R. Higher compression f rom the Burrows -Wheeler transform by mo dified sorting. In Pr o c. of the IEEE Data Compr ession Confer- enc e (1998), p. 532. 45. Ch ´ avez, E., and Na v a rr o, G . A metric ind ex for appro ximate string matching. In Pr o c. of the L atin Americ an Symp osium on The or etic al Informatics (2002), Lecture Notes in Computer Science vol. 2286, Springer V erlag, pp. 181–195. 46. Ch en, T., and Skiena, S. S . T rie-based data structures for sequence assembly . In Pr o c. of the Symp osium on Combi natorial Pattern Matching (1997), Lecture Notes in Computer S cience vol. 1264, Springer V erlag, pp. 206–223. 47. Ch en, Z., Jagadish, H. V., Korn, F., Koudas, N., Muthukrishnan , S., Ng, R. T., and Sri v ast a v a, D. Counting tw ig matc hes in a tree. In Pr o c. of the International Confer enc e on Data Engine ering (2001 ), pp . 595–604. 48. Ci riani, V., Ferragina, P., Luccio, F., and Muthu krishnan, S. Static Optimalit y Theorem for external-memory string access. In Pr o c. of the IEEE Symp osium on F oundations of Computer Scienc e (2002). 49. Clark, D. R., and Mu nro , I. Efficient suffix trees on secondary storage . In Pr o c. of the ACM-SIAM Symp osium on Discr ete Algor ithms (1996), pp. 383–39 1. 50. Colussi, L., and Del Col, A. A time and space efficien t data structure for string searching on large texts. Information Pr o c essing L etters 58 , 5 (1996), 217– 222. 51. Comer, D. Ubiquitous B-tree. A CM Computing Surveys 11 , 2 (1979), 121–137. 52. Cooper, B ., Sample, N., Franklin, M . J., Hja l t ason, G. R., and Shadmon, M. A fast index for semistructured data. The VLDB Journal (2001 ), 341–350. 53. Cormode, G., P a terson, M., Sahinalp, S. C., and Vishkin, U. Communi- cation complexity of docu ment exchange. In Pr o c. of the ACM-SIAM symp osium on Discr ete algorithms (2000), pp. 197–206. 54. Cor ti, F., Ferra gina, P., a n d P aol i, M. TReSy: A tool to index SGML document collections. T ec h nical Report, 1999 (in italian), see also http://www .cribecu. sns.it/ en index.html . 55. Crauser, A., and Ferragina, P. A theoretical and exp erimental study on the construction of suffix arra ys in external memory . A lgorithmic a 32 , 1(200 2), 1–35. 56. Crochemore, M., La nda u, G . M., and Zi v-Ukelson, M. A sub-quadratic sequence alignment algorithm for unrestricted cost matrices. In Pr o c. of the ACM- SIAM Symp osium on Di scr ete Algorith ms (2002), pp. 679–688 . 57. Crochemore, M. , Mignosi, F., Restivo, A., and Salemi, S. T ext com- pression using antidicti onaries. In Pr o c. of the International Col lo quium on Au- tomata, L anguages and Pr o gr amming (1999 ), Lecture Notes in Computer Science vol . 1644, S p ringer V erlag, pp . 261–2 70. 58. Darragh, J. J., Clear y, J. G., and Witten, I. H. Bonsai: a compact repre- sen tation of trees. Softwar e Pr actic e and Exp erienc e 23 , 3 (1993), 277–291. 51 59. De Jonge, W., T anenbaum, A. S., an d V anDeRiet, R. P. Tw o access method s using compact binary trees. IEEE T r ansactions on Softwar e Engine ering 13 , 7 (1987), 799–810 . 60. De m aine, E. D., and Lopez -Or ti z, A. A linear low er b ound on index size for text retriev al. In Pr o c. of the ACM-SIAM symp osium on Discr ete algorithms (2001), pp. 289–294. 61. Egnor, D., and Lord, R . Structured information retriev al using XML. In Pr o c. of the ACM-S IGIR Works hop on XM L and Information Re trieval (2000), http://www .haifa.il .ibm.com/ sigir00-xml/ . 62. Ergu, F., Sahinalp, S. C ., Sharp, J., and Sinha, R. K. Biased dictionaries with fast insert/deletes. In Pr o c. of the ACM Symp osium on The ory of Computing (2001), pp. 483–491. 63. F aloutsos, C. Access metho d s for text. ACM Computing Surveys 17 , 1 (1985), 49–74. 64. F arac h, M., Ferragina, P., and Muthukrishnan, S. Overcoming the mem- ory bottleneck in suffix t ree construction. In Pr o c. of the I EEE Symp osium on F oundations of Computer Scienc e (199 8), pp. 174–183 . 65. F arac h, M., and Th orup, M. String matching in Lemp el-Ziv compressed strings. A l gorithmic a 20 , 4 (1998), 388–404. 66. F arac h-Col ton, M., Ferragina, P., and Muthukrishnan, S. On the sorting- complexit y of suffix tree construction. Journal of the ACM 47 , 6 (2000), 987–101 1. 67. Feng , C. Pa t-tree-based k eyword extraction fo r c hinese information retrie v al. ACM- SIGIR (1997), 50–58. 68. Fenwick, P. The Burrow s-Wheeler transform for blo ck sorting t ex t compression: principles and improv ements . The Computer Journal 39 , 9 (1996), 731–740. 69. Ferguson, D. E. Bit-T ree: a data structure for fast file p rocessing. Communi- c ations of the ACM 35 , 6 (1992), 114–120. 70. Ferragina, P., and Gross i, R. F ast string searching in secondary storage: Theoretical develo pments and exp erimental results. In Pr o c. of the ACM-SIAM Symp osium on Di scr ete A lgorithms (1996), p p. 373–382. 71. Ferragina, P., and Gr ossi, R. The String B-T ree: A new data structure for string searc h in extern al memory and its applications. Journa l of the ACM 46 , 2 (1999), 236–280 . 72. Ferragina, P., Koud as, N . , Muthukrishnan, S ., and S ri v ast a v a, D . Two- dimensional substring indexing. In Pr o c. of the A CM Symp osium on Principles of Datab ase Systems (2001), pp. 282–288 . 73. Ferragina, P., and Luccio, F. Dynamic dictionary matching in external mem- ory . Information and Computation 146 , 12 (1998). 74. Ferragina, P., and Luccio, F. S tring searc h in coars e-grained parallel com- puters. A l gorithmic a 24 , 3–4 (1999), 177–1 94. 75. Ferragina, P., and Manz ini, G. Opp ortunistic data structures with appli- cations. In Pr o c. of the IEEE Symp osium on F oundations of Computer Scienc e (2000), pp. 390–398. 76. Ferragina, P., and Manzini , G. An ex p erimen tal stu d y of a compressed in- dex. Information Scienc es: sp e cial issue on “Dictionary Base d Compr ession ” 135 (2001), 13–28. 77. Ferragina, P., and Manzini, G. An experimental stud y of an opp ortunistic index. In Pr o c. of the A C M -SIAM Symp osium on Discr ete A lgorithms (2001), pp. 269–278 . 78. Ferragina, P., and Mastroianni, A. The XC DE library: indexing and com- pressing XML do cuments. htt p://sbrin z.di.unip i.it/ ∼ xcde , Ap ril 2002 . 52 79. Frakes, W. , and Baeza-Y a tes, R. Information R etrieval: Data Structur es and Algo rithms . Pren tice Hall, 1992. 80. Frigo, M., Leiserson, C. E., Pro k op, H., and Ramachandran, S. Cache- oblivious algorithms. In Pr o c. of the IEEE Symp osium on F oundations of Com- puter Scienc e (1999), pp . 285–298. 81. G a ¸ sieni e c, L., Karpi n ski, M., Plando wski, W ., and R ytter, W. Efficient algorithms for Lemp el-Ziv encoding. In Pr o c. of the Sc andinavian Workshop on Algo rithm The ory (1996 ), Lecture Notes in Computer Science vo l. 1097, Springer V erlag, p p. 392–40 3. 82. G a ¸ sieni e c, L., Karpinski, M., Plandows ki, W., and R ytter, W. Ran- domized efficient algorithms for compressed strings: The fi n ger-print approac h. In Pr o c. of the Symp osium on Combi natorial Pattern Matching (1996), Lecture Notes in Computer S cience vol. 1075, Springer V erlag, pp. 39–49. 83. G il, J., and It ai, A. Ho w to pac k trees. Journal of A lgorithms 32 , 2 (1999), 108–13 2. 84. G onnet, G. H., Baeza -Y a tes, R . A., and Snid er, T. Information R etrieval: Data Structur es and Algorith ms . c h. 5, pp. 66–82, Prenti ce-Hall, 1992. 85. G ra v ano, L., Ipei rotis, P. G., Jagadish, H. V., Koud as, N ., Muthukrish- nan, S., and Sriv ast a v a, D. Approximate string joins in a database (almost) for free. In Pr o c. of the International Conf er enc e on V ery L ar ge Data Base s (2001), pp. 491–500 . 86. G r ossi, R., a nd It aliano, G. Efficie nt techniques for mainta ining m ultidimen- sional keys in linked data structures. In Pr o c. of the I nternational Col lo quium on Algorith ms, L anguages and Pr o gr amming ( 1999), Lecture Notes in Computer Science vol. 1644 , Sp ringer V erlag, p p. 372–381. 87. G r ossi, R., and Vitter, J. Compressed suffix arrays and suffix trees with ap- plications to text indexing and string matching. In Pr o c. of the ACM Symp osium on The ory of Computing (2000 ), p p . 397–4 06. 88. G usfield, D. Algorithms on st rings, tr e es and se quenc es: c omputer sc ienc e and c omputational biolo gy . Cambridge U niversi ty Press, 1997. 89. G usfield, D., Landau, G. M., and S chieber, B. An efficient algo rithm for the al l pairs suffix -prefix p roblem. Information Pr o c essing L etter s 41 , 4 (19 92), 181–18 5. 90. Ha rm an, D. Overview of the third t ext retriev al conference. In Pr o c. of the T ext REtrieval Confer enc e (TREC-3) (1995), pp. 1–19. 91. He aps, H. S. Information r etrieval: the or etic al and c omputational asp e cts . A ca- demic Press, 1978. 92. Hu , T., and Tucker, A . Optimal comput er searc h trees and v ariable leng th alphabetic codes. SIAM Journal of Applie d Mathematics 21 (1971), 514–532. 93. Hu n t, E., A tkinson, M. P., and Ir ving, R. W. A database index to large biologica l sequences. In Pr o c. of the International Confer enc e on V ery L ar ge Data Bases (2001), pp. 139–148 . 94. IBM Journal on Research and Development . The Memory eXp ansion T e ch- nolo gy for xSeries servers , Marc h 2001. 95. Indyk, P., and Motw ani, R. App ro x imate nearest neigh b ors: tow ards remov- ing the curse of dimensionalit y . In Pr o c. of t he ACM Symp osium on The ory of Computing (1998), p p. 604–61 3. 96. Jacobson, G. Space-efficient stati c trees and graphs. In I EEE Symp osium on F oundations of Computer Scienc e (198 9), pp. 549–554 . 97. Jagadish, H. V., K oudas, N., and Sriv ast a v a, D. On effective multi- dimensional indexing for strings. A CM SIGMOD R e c or d 29 , 2 (2000), 403–414. 53 98. Jang, H., Kim , Y., and Shin, D. An effective mechanism for index u p date in structured do cuments. In Pr o c. of the ACM-CIKM I nternational Confer enc e on Information and Know le dge Management (199 9), pp. 383–390 . 99. Jin, S., and B est a vro s, A. T emp oral lo calit y in w eb request streams (p oster session): sources, chara cteristics, and cac hing implications. In Pr o c. of the Inter- national Confer enc e on Me asur ements and Mo deli ng of Computer Systems (2000), pp. 110–111 . 100. Jokinen, P., and Ukkonen, E. Tw o algorithms for approximate string matching in static texts. In Pr o c. of Mathematic al F oundations of Computer Scienc e (1991), pp. 240–248 . 101. J ´ onsson, B., Franklin, M., and Sri v ast a v a, D. Interactio n of query ev al- uation and buffer management for information retriev al. In Pr o c. of the ACM- SIGMOD Confer enc e on Manage ment of Data (1998), p p. 118–129. 102. K ¨ arkk ¨ ainen, J., and Ukko nen, E. Sparse suffix trees. International Confer enc e on C om puting and Combinatorics (1996), Lecture Notes in Computer Science vol . 1090, S p ringer V erlag, pp . 219-230 . 103. Kahve ci, T., and S i ngh:, A. K. Effici en t index structures for string databases. In Pr o c. of the International Confer enc e on V ery L ar ge Data Bases (2001), pp. 351–360 . 104. Kanne , C.-C., and Moerk otte, G . Efficient storage of XML data. In Pr o c. of the International Confer enc e on Data Engine ering (200 0), p. 198. 105. K ¨ arkk ¨ ainen, J. Suffix cactus: A cross b etw een suffix tree and suffix arra y . In Pr o c. of the Symp osium on Combi natorial Pattern Matching (1995), Lecture Notes in Computer S cience vol. 937, Springer V erlag, pp. 191–204. 106. K ¨ arkk ¨ ainen, J., Na v arro, G. , and Ukk onen, E. Approximate string- matc hing ov er Ziv-Lemp el compressed data. In Pr o c. of the Symp osium on Com- binatorial Pattern Matching (2000 ), Lecture N otes in Computer Science vol . 1848, Springer V erlag, pp. 195–20 9. 107. Ka t ajainen , J., and Makinen, E. T ree compression and optimization with applications. International Journal of F oundations of Computer Scienc e 1 , 4 (1990), 425–447 . 108. Kida, T., T akeda, M ., Shi nohara, A., and Arika w a, S. Sh ift-And app roac h to pattern matc hing in LZW compressed t ex t. In Pr o c. of the Symp osium on Com- binatorial Pattern Matching (1999 ), Lecture N otes in Computer Science vol . 1645, Springer V erlag, pp. 1–13. 109. Knuth, D. E. Sorting and Se ar ching , vol. 3 of The A rt of Computer Pr o gr am- ming . Addison W esley , 1998. 110. Kodeks Softw are. The morph ologial analysis mo dule, July 2002. http://www .gubin.sp b.ru/arti cles/dictionary.html 111. Korfha ge , R. Information Stor age and Re trieval . John Wiley and Sons, 1997. 112. Kur tz, S. Redu cing the space requirement of suffix trees. Softwar e—Pr actic e and Exp erienc e 29 , 13 (1999), 1149–1171 . 113. Kushilevi tz, E., Ostr o v sky, R., an d Rabani, Y. Efficien t search for ap- pro ximate nearest neighbor in high dimensional spaces. In Pr o c. of the ACM Symp osium on The ory of Computing (1998 ), pp. 614–623. 114. Lesk, M. Pr actic al digital libr aries: b o oks, bytes, and bucks . Morgan-Kaufman, 1997. 115. Litwin, W., Ze gour, D. , and Levy, G. Multilev el trie hashing. In Pr o c. of the I nternat ional Confer enc e on Extending Datab ase T e chnolo gy ( 1988), L ectu re Notes in Computer S cience vol. 303, Springer V erlag, pp. 309–335. 54 116. Luk, R., Cha n , A., Dillon, T., and Leong, H. A survey of searc h engines for XML do cuments. In Pr o c. of the ACM-SIGIR Workshop on XML and Informa- tion R etrieval (2000 ), http://www.haifa.i l.ibm.com/sig ir00-xml/ . 117. M ¨ akinen, E. A surv ey on binary tree co dings. The Computer Journal 34 , 5 (1991), 438–443 . 118. Manber, U. A text compression scheme that allo ws fast searc h ing directly in th e compressed file. In Pr o c. of the Symp osium on Combinatorial Pattern Matc hing (1994), L ectu re Notes in Computer Science vol. 807, Springer V erlag, pp. 113–124 . 119. Manber, U. F orew ord. In Mo dern Information R etrieval (1999), R. Baeza-Y ates and B. R ibeiro-Neto, Eds., Addison-W esley . 120. Manber, U., and B aeza-Y a tes, R. A . A n algorithm for string matching with a sequence of don’t cares. Information Pr o c essing L ette rs 37 , 3 (1991), 133–136. 121. Manber, U., and Myers, G. Suffix arra ys: a new metho d for on- line string searc hes. SIAM Journal on Computing 22 , 5 (1993), 935–948. 122. Manber, U., and Wu, S. GLIMPSE: A tool to searc h through entire file systems. In Pr o c. of the USENIX W inter T e chnic al Confer enc e (1994), pp. 23–32. 123. Manni ng, C. D., and Sch ¨ utze, H . F oundations of Statistic al Natur al L anguage Pr o c essing . The MIT Press, 2001. 124. Manz ini, G., and Ferra gina, P. Engineering a light weigh t suffix-arra y con- struction algorithm. In Pr o c. of the Eur op e an Symp osium on Algorithms (2002), Lecture Notes in Computer Science, Spring-V erlag. 125. Marka tos, E. On cac hing search engine results. Computer Communic ations 24 , 2 (2001), 137–143. 126. Marsan, L., and Sago t, M.-F. Algorithms for extracting structu red motifs using a suffix tree with applicati on to promoter and regulatory site consensus identifica tion. Journal of Com putational Biolo gy 7 (2000), pp. 345–360. 127. Ma tias, Y., Muthukrishnan, S., Sahinalp, S. C., and Ziv, J. Augmenting suffix trees, with applications. In Pr o c. of the Eur op e an Symp osium on Algorithms (1998), Lecture Notes in Computer Science vol. 1461, Sp ring-V erlag, pp. 67–78. 128. McCrei ght, E. M. A space-economical suffix tree construction algorithm. Jour- nal of the ACM 23 , 2 (1976), 262–272. 129. McHugh, J., Abite boul, S ., Goldman, R., Quass, D., an d Widom, J. LORE: A database managemen t system for semistructured data. SIGMOD R e c or d 26 , 3 (1997), pp. 54–66. 130. Mehlhorn, K., a nd Naher, S. Algo rithm design and soft wa re libraries: Recent develo pments in the LEDA pro ject. In Pr o c. of IFIP Congr ess (1992), vol . 1, pp. 493–505 . 131. Mei ra, W., Ces ´ ario, M., Fonseca, R., and Ziviani, N. I nteg rating WWW cac h es and searc h engines. In Pr o c. of the IEEE Glob al T ele c ommunic ations Confer enc e (1999), pp. 1763–17 69. 132. Merre t, T. H., and Shang, H. T rie meth ods for represen ting text. I n Pr o c. of the International Confer enc e on F oundations of Data O r ganization and Al- gorithms (1993), Lecture Notes in Co mputer Science vol. 730, Springer V erlag, pp. 130–145 . 133. Mewes, H. W., and Heumann, K. Genome analysis: Pa ttern search in biological macromolecule s. In Pr o c. of the Symp osium on Combinato rial Pattern Matching (1995), L ectu re Notes in Computer Science vol. 937, Springer V erlag, pp. 261–285 . 134. Mitz enmacher, M. Compressed bloom filters. In Pr o c. of the ACM Symp osium on Principles of Distribute d Computing (2001 ), p p. 144–15 0. 135. Moff a t, A., and Stuiv er, L. Exp loiting clustering in inverted fi le compression. In Pr o c. of the I EEE Data Compr ession Confer enc e (199 6), pp. 82–91. 55 136. Moff a t, A., and Bell, T. In-situ generation of compressed inv erted files. Jour- nal of the Americ an So ciety for Information Scienc e 46 , 7 (1995), 537–550. 137. Morrison, D. R. P A TRI CIA - practical algori thm to retrieve information co ded in alphanumeric. Journal of the ACM 15 , 4 (1968), 514–534. 138. Moura, E., Na v arr o, G., and Zi v iani, N. Ind exing compressed text. In Pr o c. of the South Americ an Workshop on String Pr o c essing (1997), Carleton Un ive rsit y Press. 139. Moura, E., N a v arro , G., Ziviani, N., and Bae za-Y a tes, R. F ast searc hing on compressed text allo wing errors. I n Pr o c. of the International ACM-SIGIR Con- fer enc e on R ese ar ch and Development in I nformation R etrieval (1998), pp. 298– 306. 140. Moura, E., Na v arro , G., Ziviani, N., and Bae za-Y a tes, R. F ast and flexible w ord searching on compressed tex t . ACM T r ansactions on Information Systems 18 , 2 (2000), 113–13 9. 141. Mulmuley, K. Computational Ge ometry: A n intr o duction th r ough r andomize d algorithms . Pren tice-Hall, 1994. 142. Munro, I. Succinct data structures. In Pr o c. of the Workshop on Data Struct ur es , within the Confer enc e on F oundations of Softwar e T e chnolo gy and The or etic al Computer Scienc e (1999), pp. 1–6. 143. Munro, I., and Raman, V. Su ccinct representatio n of balanced parentheses, static trees and planar graphs. In Pr o c. of the I EEE Symp osium on F oundations of Computer Scienc e (1997), pp . 118–126. 144. Munro, I., Raman, V., and Sriniv a sa Rao, S. Space efficient suffix trees. In Pr o c. of the Confer enc e on F oundations of Softwar e T e chnolo gy and The or etic al Computer Scienc e (1998), Lecture Notes in Computer Science vol. 1530, Sp ringer V erlag, p p. 186–19 5. 145. Muthukrishnan , S. Efficient algori thms for do cument retriev al problems. In Pr o c. of the ACM-SIAM Annual Symp osium on Di scr ete Algor ithms (2002), pp. 657–666 . 146. Muthukrishnan , S., an d Sahinalp, S. C. Approximate nearest neighbors and sequence comparison with block operations. In Pr o c. of the ACM Symp osium on The ory of Computing (200 0), pp. 416–424. 147. Muthukrishnan , S., and Sahina lp, S. C. Simple and practical sequence near- est neigh bors with block op erations. In Pr o c. of the Symp osium on Combinatorial Pattern Matching (2002), Lecture Notes in Computer Science vol. 2373, S pring- V erlag, p p. 262–27 8. 148. Na v arr o, G. A guided tour to approximate string matching. ACM Computing Surveys 33 , 1 ( 2001), 31–88. 149. Na v arr o, G. Regular expression searc hing ov er Ziv-Lemp el compressed text. In Pr o c. of the Symp osium on Combi natorial Pattern Matching (2001), Lecture Notes in Computer S cience vol. 2089, Springer V erlag, pp. 1–17. 150. Na v arr o, G. On compressed in d exing v ia Lemp el-Ziv p arsing. P ersonal Com- munica tion, 2002. 151. Na v arr o, G . , and Bae za-Y a tes, R. Pro ximal no des: A model to q uery doc- ument databases by con tent and structure. ACM T r ansactions on Information Systems 15 , 4 (1997), 400–435 . 152. Na v arr o, G., and Bae za-Y a tes, R. A. A new indexing metho d for approximate string matching. In Pr o c. of th e Symp osium on Combinatorial Pattern Matching (1999), Lecture N otes in Computer Science vol. 1645, S p ringer V erlag, pp. 163– 185. 56 153. Na v arr o, G., de Moura, E., Neu ber t, M., Zivi ani, N., and Baez a-Y a tes, R. Adding compressi on to b lo ck addressing in verted indexes. Information R e- trieval Journal 3 1 (2000), 49–77. 154. Na v arr o, G., and Raffinot, M. A general practical approach to p attern match- ing o ver Ziv -Lemp el compress ed text. I n Pr o c. of the Symp osium on Combina- torial Pattern Matching (1999), Lecture Notes in Computer Science vol. 1645, Springer V erlag, pp. 14–36. 155. Na v arr o, G., Su tinen, E., T anninen, J., and T arhio, J. Index ing text with app roximate q-grams. In Pr o c. of the Symp osium on Combinator ial Pat- tern Matching (2000), Lecture N otes in Computer Science vol. 1848, Springer V erlag, p p. 350–36 3. 156. Nelson, M. Data compression with t h e Burro ws-Wh eeler transform. Dr. Dobb’s Journal of Softwar e T o ols 21 , 9 (1996), 46–50. 157. Nilsson, S., and Tikkanen, M. I mplementi ng a dyn amic compressed trie. In Pr o c. of the Worksh op on Algorithmic Engine ering (1998), pp. 1–12. 158. P a tt, Y. N. Guest ed itor’s introdu ction: The I/O subsystem — A candidate for impro veme nt. IEEE Computer 27 , 3 (1994). 159. Persin, M., Zobel, J., and S ack s-D a vis, R. Filtere d do cument retriev al with frequency-sorted index es. Journal of the Americ al M athematic al So ciety f or In- formation Scienc e 47 , 10 (1996), 749–764 . 160. Pevzn e r, P. A., a n d W a terman, M. S. Multiple filtration and appro ximate pattern matching. Algorithmic a 13 , 1–2 (1995), 135–15 4. 161. Pugh, W. Skip Lists: A probabilistic alternativ e to balanced trees. Communic a- tions of the ACM 33 , 6 (1990), 668–676. 162. Quantum Coopera tion . Storage tec hnology and trends. http://www .quantum. com/src/t t/storage tech trends.htm , 2000. 163. Ragha v an, P. In formation retriev al algo rithms: A survey . Pr o c. of the ACM- SIAM Symp osium on Di scr ete Algorith ms (1997), 11–18. 164. Ruemmler, C. , a n d Wilkes, J. An in tro duction to disk drive mo deling. IEEE Computer 27 , 3 (1994), 17–29. 165. Sadakane, K. A fast algo rithms for making suffix arra ys and for Burrows- Wheeler transformation. In Pr o c. of the IEEE Data Compr ession Confer enc e (1998), pp. 129–138. 166. Sadakane, K. On optimality of v arian ts of th e block sorting compressio n. In Pr o c. of the IEEE Data Compr ession Confer enc e (1998 ), pp . 570. 167. Sadakane, K. A mo dified Burrows -Wheeler transformation for case-insensitive searc h with applicati on to suffix arra y compression. In Pr o c. of the IEEE Data Compr ession Confer enc e (1999 ), pp . 548. 168. Sadakane, K. Compressed text databases with efficient query algorithms based on the compressed suffix arra y . I n Pr o c. of the International Symp osium on A l - gorithms and Computation (2000 ), Lecture Notes in Computer Science vol. 1969, Springer V erlag, pp. 410–42 1. 169. Sadakane, K. S uccinct representatio ns of LCP information and impro vements in the compressed suffix arrays. In Pr o c. of th e A CM-SIAM A nnual Symp osium on Discr ete Algorithms (2002), pp. 225–232. 170. Sadakane, K., and Shi buy a, T. I ndexing h uge genome sequences for solving v arious problems. Genome Informatics (2002), pp. 175–183. 171. Sagot, M.-F., and Viari , A. Flexible identification of structural ob jects in nu- cleic acid sequen ces: palindromes, mirror rep eats, pseudoknots and triple helices. In Pr o c. of the Symp osium on Combi natorial Pattern Matching (1997), Lecture Notes in Computer S cience vol. 1264, Springer V erlag, pp. 224–246. 57 172. Sam e t, H. The Design and Analysis of Sp atial Data Structur es . Addison-W esley , 1990. 173. Sarai v a, P. C., de Moura, E. S . , Fo nseca, R. C., Jr., W . M., Ribeiro- Neto, B. A., and Ziviani, N . Rank-preserving tw o-level cac h ing for scalable searc h engines. I n Pr o c. of the International Confer enc e on R ese ar ch and Devel- opment in Information R etrieval (2001), pp. 51–58. 174. Schi n dler, M. A fast blo ck-sorting algorithm for lossles s data compression. http://www .compress consult.c om/szip/ , 1996. 175. Sch ¨ oning, H . T amino - A DBMS d esigned for XML. In Pr o c. of the International Confer enc e on D ata Engine ering (2001), pp. 149–154. 176. Sew ard , J. The bzip2 home page. http://s ources.re dhat.com/b zip2/ . 177. Shan g, H. T rie metho ds for text and sp atial data structur es on se c ondary stor age . PhD thesis, McGill U niversi ty , 1995 . 178. Shi ba t a, Y. , T akeda, M. , Shinohara, A., and Arika w a, S. Pattern matc hing in text compressed by using antidictionari es. I n Pr o c. of the Symp osium on C om- binatorial Pattern Matching (1999 ), Lecture N otes in Computer Science vol . 1645, Springer V erlag, pp. 37–49. 179. Si l ve rstein, C., Henzi nger, M., Marais, H., and Moricz, M. Analysis of a very large w eb searc h engine query log. ACM SIGIR F orum 33 , 1 (1999), 6–12. 180. Slea tor, D. D., and T arj an, R. E. S elf-adjusting binary searc h trees. Jour nal of the ACM 32 , 3 (1985), 652–686 . 181. Sleep yca t Softw are . The Berke ley DB. http://www .sleepycat .com/ . 182. Sprugnoli, R. On t h e allocation of binary trees in secondary storag e. BIT 21 (1981), 305–316 . 183. Tomasic, A., and Garcia-Molina, H. Cac h ing and databas e scaling in dis- tributed shared-nothing info rmation retriev al systems. ACM SIGMOD R e c or d 22 , 2 (1993), 129–13 8. 184. Turp i n, A., and Moff a t, A. F ast fi le search using text compression. Austr ali an Computing Scienc e Communi c ation (1997), 1–8. 185. Ukkonen, E. Approximate string matching with q-grams and maximal matches. The or etic al Comput er Scienc e 92 , 1 (1992), 191–211. 186. Vitter, J. External memory algori thms and data structures: Dealing with mas- siv e data. ACM Computing Surveys 33 , 2 (2001), 209–271 . 187. Wi lliams, H., and Zobel, J. Compressing integers for fast file access . The Computer Journal 42 , 3 (1999), 193–201 . 188. Wi tten, I. H., Moff a t, A., and Bell, T. C . M anaging Gigabytes: Compr essing and Indexing Do cuments and Im ages . Mo rgan Kaufmann Pub lishers, 1999. 189. Wu, S ., and Manber, U. F ast text searching allo wing errors. Communic ations of the ACM 35 , 10 (1992), 83–91. 190. Zip f, G. Human Behaviour and the Principle of L e ast Effort . Addison-W esley , 1949. 191. Ziv, J., and Lem p el, A. A universal algorithm for sequential data compression. IEEE T r ansaction on Information The ory 23 (1977), 337–343 . 192. Ziv, J., and Lempel, A. Compression of individu al sequ ences via v ariable length coding. IEEE T r ansaction on Information The ory 24 (1978), 530–536. 193. Zobel, J., Moff a t, A., and Ram amohanarao, K. Guidelines for presenta tion and comparison of indexing techniques. SI GMOD R e c or d 25 , 1 (1996), 10–15. 194. Zobel, J., Moff a t, A., and Ramam ohana rao , K. Inv erted files v ersus signa- ture fi les for t ex t ind exing. ACM T r ansactions on Datab ase Systems 23 (1998), 453–49 0. 58 ∆ 1 8 17 18 19 20 21 22 23 16 9 6 4 5 3 2 29 28 27 26 25 24 30 31 33 37 38 39 40 42 43 44 45 46 15 14 13 12 11 10 7 32 34 35 36 41 A A A T C T T T T T T T G G G G G G G C C C C C A A C C G A A A T T T A T C A pattern position 9 35 22 PT PT PT PT PT PT PT PT PT PT PT PT PT 10 5 22 33 1 37 2 30 10 39 4 41 7 6 12 15 26 18 43 21 38 3 31 28 11 14 25 17 27 5 37 10 39 7 42 24 16 29 6 18 43 28 11 27 27 18 43 39 24 16 22 42 20 13 24 16 32 8 36 29 5 occurrences Fig. 3. An il lustr ative ex ample depicting a String B-tr e e built on a set ∆ of DNA se quenc es. ∆ ’s strings ar e stor e d in a file sep ar ate d by sp e cial char acters, her e denote d with black b oxes. The triangles lab ele d with P T depict the Patricia tr e es stor e d into e ach String B-tr e e no de. The figur e also shows in b old t he String B- tr e e no des t r averse d by the se ar ch for a p attern P = “ C T ′′ . Th e cir cle d p ointers denote t he su ffixes, one p er level, explicitly che cke d during the se ar ch; the p ointers in b old, in t he le af level, denote the five suffixes pr efi xe d by P and thus the five p ositions wher e P o c curs in ∆ . 59 A A A G G G G G C [ G ] [ G ] A 0 3 4 5 6 6 u [ T ] [ C G C ] correct position for P = GCACGCAC for P = GCACGCAC checked string A G A A G A A G A A G G A G A C G C G C A G A G C G C A G G G C G C G G A G C C G G G A G [ G ] A [ G A ] A Fig. 4. An example of Patricia tr e e bu ilt on a set of k = 7 DNA strings dr awn fr om t he alphab et Σ = { A, G, C, T } . Each le af p oints t o one of the k strings; e ach internal no de u (they ar e at most k − 1 ) is lab ele d with one inte ger l en ( u ) which denotes the length of the c ommon pr efix shar e d by al l the strings p ointe d by the le aves desc ending fr om u ; e ach ar c (they ar e at most 2 k − 1 ) is lab ele d with only one char acter (c al le d branching character ). The char acters b etwe en squar e-br ackets ar e not explicitly stor e d, and denote t he other char acters lab eling a trie ar c. 60 p1 p3 p5 p4 0 1 0 0 0 0 1 1 1 1 1 0 p2 p6 p7 A = 00 G = 10 C = 11 T = 01 SA = [ p1, p2, p3, p4, p5, p6, p7] Lcp = [ 10, 6, 0, 12, 8, 12 ] skip = [ 10, −4, −6, 12, −4, 4 ] x 0 8 A G A A G A A G A A G G A G A C G C G C A G A G C G C A G G G C G C G G A G C C G G G A G 6 10 12 12 x’ Fig. 5. The arr ays S P and Lcp c ompute d on the Patricia tr e e of Figur e 4. The arr ay S k ip is derive d fr om the arr ay Lcp by subtr acting its adjac ent entries. The S k ip s and Lcp s ar e expr esse d in bits. 61 Input: A set S of K strings, whose total length is N (bits) Output: A sorted p ermutation of S 1. Every string of S is p artitioned into pieces of L b its eac h. L is chosen to b e muc h larger than 2 log 2 K . 2. Compute for eac h string p iece a name , i.e. a b it string of length 2 log 2 K , by means of a proper hash function. Ea c h string of S is then compressed by replacing L - pieces with their corresp onding n ames. The resulting set of compressed strings is denoted with C , and its elements are called c-strings . 3. Sort C via an y known extern al-memory sorting algorithm (e.g. Mergesort). 4. Compute the longest c ommon pr efix betw een an y pair of c-strings adjacen t in (the sorted) C and mark the (at most tw o) mismatching names . Let lcp x b e the number of names shared by the x th and the ( x + 1)th string of C . 5. Scan the set C and collect the (tw o) marked names of eac h c-string together with their corresp onding L -pieces. Sort these string pieces (they are at most 2 K ) and assign a r ank to each of th em— equal p ieces get the same rank. The rank is represen ted with 2 log 2 K bits (like the names of the string pieces), p ossibly padding the most significant digits with zeros. 6. Build a (logical) table T by mapping c-strings to columns and n ames of L- pieces to table entries: T [ a, b ] contai ns the a th name in the b th c-string of C . Subsequently , transform T ’s entrie s as follow s: replace th e marked names with their corresponding ranks, and th e other n ames with a b it-sequence of 2 log 2 K zero s. If the c-strings hav e not equal length, pad lo gic al ly th em with zeros. This w a y names and ranks are formed by the same num b er of bits, c-strings hav e the same length, and th eir (name or rank) pieces are correctly aligned. 7. Perfo rm a forw ard and backw ard pass through the columns of T as follo ws: (a) In the right w ard pass, copy t h e first lcp x − 1 entrie s of the ( x − 1)th column of T in to t he subsequent x th column, for x = 2 , ..., K . The mismatching names of the x t h column are n ot ov erridden. (b) In the left ward pass, cop y the fi rst lcp x entrie s of the ( x + 1)th col umn of T into th e x th column, for x = K − 1 , .... , 1. 8. The columns of T are sorted via an y known external-memory sorting algorithm (e.g. Mergesort). F rom the bijection: st ring ↔ c-string ↔ c olumn ; we deriv e the sorted p ermutation of S . Fig. 6. A rando mized algorithm for s orting arbitrary long strings in external memory . 62 L -piece name rank aa 6 1 ab 1 2 bb 4 3 b c 2 - ca 5 4 cb 3 5 cc 7 6 Fig. 7. Names of a ll L -pieces and ranks of the marked L -pieces. Notice that the L-piece bc has no rank b ecause it has b een no t ma r ked in Step 4. ab bb ab bb aa ab ab b c ab b c b b cc b c ca b c cc cc aa cb aa aa cc bb b c ab bb bb aa aa ab 1 2 3 4 5 6 i. Step 1 1 4 1 4 6 1 1 2 1 2 4 7 2 5 2 7 7 6 3 6 6 7 4 2 1 4 4 6 6 1 1 2 3 4 5 6 ii. Step 2 1 1 1 4 4 6 1 1 7 2 2 4 2 2 6 5 7 7 3 6 2 6 7 4 1 4 1 4 6 6 1 3 6 2 4 5 iii. Steps 3–5 Fig. 8. Strings a re written fro m the top to the b o ttom of each table column. (i) Strings are divided into pieces of 2 c hars eac h. (ii) Each L -piece is substituted with its na me taken from the (hash) table of Figur e 7 . (iii) Co lumns are sorted and mismatchin g names b etw een adjace nt co lumns are underlined. 0 0 2 3 3 1 0 2 6 0 0 0 0 0 0 4 6 0 5 1 0 0 0 0 0 0 0 0 0 0 1 3 6 2 4 5 i. Step 6 and 7(a) 2 2 2 3 3 1 2 2 6 0 0 0 0 0 0 4 6 0 5 1 0 0 0 0 0 0 0 0 0 0 1 3 6 2 4 5 ii. Step 7(b) 1 2 2 2 3 3 0 2 2 6 0 0 0 0 0 0 4 6 0 1 5 0 0 0 0 0 0 0 0 0 5 3 1 6 2 4 iii. Step 8 Fig. 9. (i) The r ig ht ward pass through table T . (ii) The leftw ard pass through table T . (iii) The sor ted T . 63 mississipp i# ississippi #m ssissippi# mi sissippi#m is issippi#mi ss ssippi#mis si sippi#miss is ippi#missi ss ppi#missis si pi#mississ ip i#mississi pp #mississip pi = ⇒ F L # mississipp i i #mississip p i ppi#missis s i ssippi#mis s i ssissippi# m m ississippi # p i#mississi p p pi#mississ i s ippi#missi s s issippi#mi s s sippi#miss i s sissippi#m i Fig. 10. Ex ample of Burrows-Wheeler transfor m for the string T = missis sippi . The matrix on the right has the r ows sorted in lexicog raphic or der . The output of the BWT is column L ; in this example the string i pssm#p issii . Algorithm get rows ( P [1 , p ]) 1. i = p , c = P [ p ], f ir st = C [ c ] + 1, l ast = C [ c + 1]; 2. while (( f ir st ≤ l ast ) and ( i ≥ 2)) do 3. c = P [ i − 1]; 4. f ir st = C [ c ] + Occ ( c, f irst − 1) + 1; 5. last = C [ c ] + Occ ( c, l ast ); 6. i = i − 1; 7. if ( last < f ir st ) then return “no rows prefixed b y P [1 , p ]” else return ( f ir st, last ). Fig. 11. Algorithm get rows finds the set o f rows prefixed b y pattern P [1 , p ]. Pro cedure Occ ( c, k ) count s the n um ber of o ccurr ences of the character c in the string prefix L [1 , k ]. In [75] it is shown how to implemen t Occ ( c, k ) in constant time. FM−index WFM −index Huffword byte−aligned and marked T MT,D C A C C C T C T A C T A A C C T 4 6 7 1 8 0 2 1 3 4 (T,1,2) (T,3,4) (T,5,8) (T,7,8) 3 (T,7,8) 5 2 (T,5,8) (T,1,2) (T,7,8) (T,1,2) (T,2,2) (T,1,2) v (a) 2 1 3 4 5 v

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment