Towards an Optimal Space-and-Query-Time Index for Top-k Document Retrieval

T o w ards an Optimal Space-and-Query-Time Index for T op- k Do cumen t Retriev al Wing-Kai Hon 1 , Rahul Shah 2 , and Sh arma V. Thank ac han 2 1 Department of CS, National Tsing H u a Universit y , T aiwan. wkhon@c s.nthu.edu.tw 2 Department of CS, Louisiana State Un ive rsity , USA. { rahul,tha nks } @csc.lsu.edu Abstract. Let D = { d 1 , d 2 , ...d D } b e a give n set of D string docu ments of total length n , our task is to index D , such that the k most relev ant docu ments for an online query pattern P of length p can b e retrieved eﬃcien tly . W e propose an index of size | C S A | + n l og D (2 + o (1)) bits and O ( t s ( p ) + k log log n + pol y log log n ) query time for the basic relev ance metric term-fr e quency , where | C S A | is the size (in b its) of a compressed full text index of D , with O ( t s ( p )) time for searc hing a pattern of length p . W e further reduce the space to | C S A | + n log D (1 + o (1)) bits, how ever the qu ery time will b e O ( t s ( p ) + k (log σ log log n ) 1+ ǫ + pol y log log n ), where σ is the alphab et size and ǫ > 0 is an y constant. 1 In tro duction and Related W ork Do cumen t ret r iev al is a sp ecial t yp e of pattern matc hing th at is closely related to informa- tion retriev al and w eb searc hing. In this problem, the data consists of a collect ion of text do cument s, and giv en a query pattern P , w e are required to rep ort all the do cum en ts in whic h th is pattern o ccurs (not all the o ccurren ces). In addition, the notion of r elevanc e is commonly applied to rank all the do cum ents that satisfy the query , and only those docum ents with the highest relev ance are return ed. Such a concept of relev ance h as b een central in the eﬀectiv eness and usability of present da y searc h engines like Go ogle, Bing, Y aho o, or Ask. When relev a n ce is considered, the query has an additional inp ut parameter k , and the task is to rep ort the k do cument s with the highest relev a nce to the query p attern (in the d ecreasing order of relev a n ce), ins tead of ﬁnding all the do cuments that con tain the qu er y pattern (as there m a y b e too man y). More formally , let D = { d 1 , d 2 , ...d D } d en ote a giv en set of D string do cument s to b e indexed, whose tota l lengths is n , and let P d enote a query p attern of length p . Let occ b e th e num b er of o ccurrences of this pattern ov er the en tire collection D , and ndoc b e the n umb er of do cum en ts out of D in whic h the pattern P app ears. One of the main issues is th e fact that k ≪ ndoc ≪ occ . T h us , it is imp ortant to d esign ind exes wh ic h do not ha ve to go through all the o ccurrences or even all the do cumen ts in order to ans w er a qu ery . The researc h in string d o cumen t ret riev al was in tro d uced b y M atias et al. [ 19 ], and Muth u krishnan [21] formalized it with th e in tro d u ction of relev ance m etrics lik e term-fr e quency (tf ) and min-dist , 3 and prop osed indexes with eﬃcient q u ery p er f ormance. Since then, this has b een an activ e research area [28,29 ]. The top- k do cument retriev al problem w as intro- duced in [12], where an O ( n log n )-w ord index is prop osed w ith O ( p + k + log n log log n ) query time for the case wh en the relev ance metric is term-fr e quency . A recen t ﬂur ry of activities in this area [25 ,15,8,2,4,26,23,17,13,24] came with Hon et al.’s work [14] w here th ey ga v e a linear-space index with O ( p + k log k ) query time, wh ic h works for a wide class of relev ance metrics. The recen t structure by Na v arro and Nekric h [22] ac hiev es optimal O ( p + k ) qu ery time using O ( n (log σ + log D + log log n )) bits, which impr o v es the r esults in [14] in b oth sp ace 3 tf ( P , d ) is the n u m b er of occurrences of P in d and min-dist ( P, d ) is the minimum distance betw een tw o occurren ces of P in d and time. If the relev a n ce metric is term-fr e que nc y , their in dex space can b e further impr ov ed to O ( n (log σ + log D )) bits. All these interesting results hav e con tr ibuted to w ard s the goal of ac hieving an optimal query time ind ex. Ho wev er, the space is far from op timal, moreo ver the constants hidden in the space b ound can r estrict the use of these indexes in p ractice. On the other side, the su ccinct index prop osed by Hon et al. [14] tak es ab out O (log 4 n ) time to rep ort eac h d o cu men t, which is lik ely to b e impractical. This time b oun d has b een further impro ved by [2,8], b ut still p olylog( n ) time is r equired p er rep orted do cument. Another line of w ork is to derive ind exes using ab out n log D bits add itional s p ace, and the b est known index tak es a p er do cument rep ort time of O (log k log 1+ ǫ n ) [2]. Eﬃcien t practical indexes are also kno wn [4], but their quer y algorithms are heuristics with no worst-ca se b oun d. In th is pap er, w e introdu ce t w o space eﬃcien t indexes with p er do cument rep ort time p oly -log-logarithmic in n . The main resu lts are summarized as follo w s. Theorem 1 Ther e exists an index of size | C S A | + n log D (2 + o (1)) bits with a query time of O ( t s ( p ) + k log log n + pol y log log n ) for r etrieving top- k do cuments with the highest term fr e quencies, wher e | C S A | is the size (in bits) of a c ompr esse d ful l text index of D with O ( t s ( p )) time for se ar ching a p attern of length p . Theorem 2 Ther e exists an index of size | C S A | + n log D (1 + o (1)) bits with a qu e ry time of O ( t s ( p ) + k (log σ log log n ) 1+ ǫ + pol y log log n ) for r etrieving top- k do cuments with the highest term fr e quencies, wher e | C S A | is the size (in bits) of a c ompr esse d f u l l text index of D with O ( t s ( p )) time for se ar ching a p attern of leng th p , σ is the alphab et size and ǫ > 0 is a c onstant. T able 1 giv es a su mmary of the ma jor r esu lts in the top- k frequent do cument r etriev a l problem. The time complexities are simp liﬁed b y assuming that w e are us ing the full text index prop osed by Belazzougui and Na v arro, of size | C S A | = nH h + O ( n ) + o ( n log σ ) bits and t s ( p ) = O ( p ), where H h is th e h th order empirical ent r op y of D [1 ]. W e also assume D < n ε for some ε < 1 and ǫ > 0 is any constan t. T abl e 1. Indexes for T op- k F requ ent Do cument Retriev al Source Index Space (in bits) Time pe r reported do cument [12] O ( n log n + n log 2 D ) O (1) [14] O ( n log n ) O (log k ) [4] | C S A | + n log D (1 + o (1)) Unbound ed [14] 2 | C S A | + o ( n ) O (log 4+ ǫ n ) [2] 2 | C S A | + o ( n ) O (log k log 2+ ǫ n ) [8] | C S A | + O ( n l og D log log D ) O (log 3+ ǫ n ) [2] | C S A | + O ( n l og D log log D ) O (log k log 2+ ǫ n ) [2] | C S A | + O ( n log log log D ) O (log k log 2+ ǫ n ) [22] O ( n log σ + n log D ) O (1) [8] | C S A | + n log D + o ( n ) O (log 2+ ǫ n ) [2] | C S A | + n log D + o ( n ) O (log k log 1+ ǫ n ) Ours | C S A | + 2 n log D (1 + o (1)) O (log log n ) Ours | C S A | + n log D (1 + o (1)) O ((log σ log log n ) 1+ ǫ ) 2 Preliminaries 2.1 T op- k Using Range Maximum/Minim um Queries One of the main to ols in top- k retriev al is the r ange maximum/minimum query structur es (RMQ) [6]. W e su m marize th e results in the follo wing lemmas (W e defer the p r o ofs to the App end ix A and B r esp ectiv ely). Lemma 1. L et A [1 ...n ] b e an arr ay of n numb ers. We c an pr epr o c ess A in line ar time and asso ciate A with a 2 n + o ( n ) bits RMQ data structur e such that gi ven a set of t non- overlapping r anges [ L 1 , R 1 ] , [ L 2 , R 2 ] , . . . , [ L t , R t ] , we c an ﬁnd the lar g e st (or smal lest) k num- b ers in A [ L 1 ..R 1 ] ∪ A [ L 2 ..R 2 ] ∪ · · · ∪ A [ L t ..R t ] in unsorte d or der in O ( t + k ) time. Lemma 2. L et A [1 ...n ] b e an arr ay of n inte gers taken fr om the set [1 , π ] , and e ach numb er A [ i ] is asso ciate d with a sc or e (which may b e stor e d sep ar ately and c an b e c ompute d in t scor e time). Then the arr ay A c an b e maintaine d in O ( n log π ) b its, such that given two r anges [ x ′ , x ′′ ] , [ y ′ , y ′′ ] , and a p a r ameter k , we c an se ar ch among those e ntries A [ i ] with x ′ ≤ i ≤ x ′′ and y ′ ≤ A [ i ] ≤ y ′′ , and r ep ort the k highest sc oring entries in unsorte d or der in O ((log π + k )(log π + t scor e )) time. 3 A Brief Review of Hon et al.’s I ndex In this section we giv e a brief description of Hon et al.’s index [14]. Let T = d 1 # d 2 # · · · # d D # b e a text obtained by concatenating all the do cumen ts in D , separated by a sp ecial symb ol # not app earing elsewhere inside an y of the d i s. Then the suﬃx tree [30,20,18] of T is called the gener alize d suﬃx tr e e GS T of D . Th en any giv en substring T [ a...b ] (whic h do es n ot contai n #) of T is a su bstring of some do cumen t d x ∈ D , and the v alue of x can b e computed in O (1) time by mainta in ing an ( n + D )(1 + o (1))-bit auxiliary data structure 4 . Each edge in GST is lab eled by a c h aracter string and for any no d e u , the p ath lab el of u , denoted by p at h ( u ) is the string formed by concate n ating the edge lab els f rom ro ot to u . Note that th e path lab el of the i th leftmost leaf in GST is exactly th e i th lexicographically smallest suﬃx of T . F or a pattern P [1 ..p ] that app ears in T , the lo cus no de of P is denoted b y l ocus ( P ), which is the u nique no de closest to the ro ot such that P is a p reﬁx of path ( locus ( P )), and can b e determined in O ( p ) time. W e augment the follo wing str uctures on GST. N-structur e : An N-structur e entry is a triplet ( doc, scor e, par ent ) and is asso ciated with some no de in GST . If u is a leaf n o de with path ( u ) is a suﬃx of d o cumen t d , the an N-structure en try with doc = d is stored at u . Ho wev er, if it is an internal no de, multiple N-structur e en tries m a y b e s tored at u as follo w s: an entry w ith doc = d is stored if and only if at least t wo children of u con tain (a suﬃx of ) do cumen t d in their subtrees. Th e scor e ﬁ eld in an N- structure ent r y for a do cument d asso ciated with a no de u is scor e ( path ( u ) , d ): the relev ance score of d with resp ect to the pattern path ( u ) 5 . Th e p ar ent ﬁeld stores (the pre-order rank of ) th e lo west ancestor of u wh ich has an en try for do cument d in its N-stru ctur e. In case there is no su ch an cestor, we assign a dummy no de wh ic h is regarded as the parent of the ro ot of GST. I-structur e : An I-stru ctur e entry is a triplet ( doc, scor e, or ig in ) and is asso ciated with some no d e in GST. If n o de u has an N-structur e ent r y for do cum ent d and an N-structur e 4 Main tain a bit vector B [1 ... ( n + D )], where B [ i ] = 1 if and only if T [ i ] = #, then x = r ank B ( a ) + 1 and can b e computed in O (1) time u sing [27 ]. 5 The s core is dep end ent only on d and t h e set of o ccurrences of path ( u ) in d . en try of another no de v is giv en by ( d, s cor e ( path ( v ) , d ) , u ), then u will h av e an I-structure en try ( d, scor e ( path ( v ) , d ) , v ). An internal no de ma y b e asso ciated w ith multiple I-structure en tries, and these en tries are main tained in an arr a y , sorted by th e origin ﬁeld. In addition, a range maxim um qu ery (RMQ) structure is main tained o ve r the array based on th e sc or e ﬁeld. 3.1 Query Answering T o ans w er a top- k query , w e ﬁrst s earch for the query pattern P in GST and ﬁn d its lo cus no de l ocus ( P ). W e also ﬁ nd the righ tmost leaf l ocus R ( P ) in the subtree of l ocus ( P ). Now, our task is to ﬁ nd, among the do cum ents whose suﬃ x es app ear in the su b tree of l ocus ( P ), whic h k of them ha ve the highest o ccurrences of P . Hon et al. show ed that this can b e done b y c h ec king only the I-structure en tries asso ciated with the prop er ancestors of locus ( P ), and then retrieving those k en tries wh ic h has the highest sc or e v alues and wh ose origin is from the su b tree of l ocus ( P ) (inclusivel y ). The num b er of ancestors of P is b ounded b y p and since the I-structure en tries are sorted according to the origin v alues, the entries to b e c hec ked will o ccup y a cont iguous region in the sorted arra y . The b oundaries of th e con tiguous region can b e obtained by p erf orm ing a binary searc h based on (the pre-order ranks of ) locus ( P ) and l ocus R ( P ). Once w e get the b oun daries of the con tiguous region in eac h prop er ancestors of l ocus ( P ), w e can app ly RMQ queries r ep eatedly ov er sc or e and retrieve the top- k scoring do cuments in sorted order in O ( p log n + k log k ) time. The binary searc h step can b e made faster b y main taining a pr ed ecessor s tr ucture [31] and the resulting time w ill b ecome O ( p log log n + k log k ). T his time h as b een fu rther improv ed to O ( p + k log k ) by in tro du cing t wo additional ﬁelds δ f and δ ℓ in eac h N-stru cture en try . Th e num b er of N-structure entries (hence I-structure entries) is ≤ 2 n . Th erefore the index space is O ( n log n ) bits. 4 Our Linear-Space Index In this section, we deriv e a mo diﬁed v ersion of Hon et al.’s linear index without δ ﬁelds and still ac hiev e O ( p ) term in query time. The main tec hniqu e is by in tr o d ucing a no vel criterion that categorizes the I-structure en tries as ne ar and far . The far en tries asso ciated with certain no des can b e mainta in ed together as a com bined I -str u cture, w hic h redu ces the n u m b er of I-structur e b oundaries to b e searc hed to O ( p/π + π ), where π is a sampling factor. By c h o osin g π = log log n , we sh all use pr edecessor searc h stru cture (instead of δ ﬁ elds) and can compute the I-stru cture b oun daries in O (( p/π + π ) log log n ) = O ( p + log 2 log n ) time. W e hav e the follo win g resu lt. Theorem 3 Ther e exists an i ndex of size O ( n log n ) bits for top- k do cu ment r etrieval with O ( p + log 2 log n + k log log log n + k log k ) query time. Pr o of. Firstly , w e mark all no des in GST whose no de-depths are m u ltiples of π (no de-depth of ro ot is 0). Thus, an y unmarked no de is at most π no des a wa y f r om its low est marke d ancestor. Also, the num b er of mark ed ancestors of any no d e = ⌈ (n umb er of ancestors) /π ⌉ . F or an y n o de w in GST, w e d eﬁ ne a v alue ζ ( w ) < π , where ζ ( w ) = 0 if w is marke d , else it is the n u m b er of n o des in the path from w (exclusiv ely) till its low est mark ed ancestor (inclusiv ely). In eac h I-structure en try ( d, s, v ) asso ciated with a no de w , we mainta in a fourth comp onen t ζ ( w ). Next, w e catego r ize the I-structure entries as far and ne ar as follo ws: An I-structur e entry asso ciate d with a no de w , with or ig in = v , is ne ar if ther e e xists no marke d no de in the p ath fr om v (inclusively) to w (exc lu si v ely), else it is far. W e restructure the entries su ch that all far entries are maint ained in a com b ined I-structure asso ciated with some marked no des as follo w s: if ( d, s, v , ζ ( w )) is a far entry in the I-stru cture I w asso ciated with no de w , then w e remov e this ent r y from I w and mo v e to a com bined I- structure asso ciated w ith the no de u , where u = w if w is mark ed, else u is the low est mark ed ancestor of w (i.e., u is ζ ( w ) n o d es ab o ve w ). All the entries in the com b ined I -structure are main tained in the s orted ord er of origin v alues. A predecessor searc h structure o v er the origin ﬁeld and R MQ structure o v er the sc or e ﬁ eld is main tained o ver all I-structures. Next, to un derstand ho w to answer a query with our ind ex, w e in tr o d uce the f ollo wing aux iliary lemma. Lemma 3. The top- k do cuments c orr esp onding to a p attern P c an b e obtaine d by che cking the fol lowing I-structur e e ntries ( with origins c oming fr om the subtr e e of l ocus ( P )) : (i) ne ar entries in the r e gular I-structur es asso ciate d with the no des in the p ath fr om l ocus ( P ) (exclusively) til l its lowest marke d anc estor u (inclusively), and ther e ar e at most π such no des; (ii) far entries with ζ < ζ ( l ocus ( P )) in the c ombine d I-structur e of u , and (iii) far entries in the c ombine d I-structur es asso ciate d with the marke d pr op er (at most p/π ) anc estors of u . Pr o of. In the original index by Hon et al., w e need to c h ec k the I-stru cture entries in all ancestors of l ocus ( P ). W e ma y categoriz e them as follo ws: (a) ne ar en tries asso ciated with a no d e in the subtree of u (inclusivel y ); (b) far entries associated with a no d e in th e subtree of u (inclus iv ely); (c) far en tries asso ciated with an ancestor no de of u ; (d) ne ar en tries asso ciated with an ancestor no de of u . All en tries in (a) b elong to category (i) in the lemma. The v alid entries in (b) b elong to catego r y (ii), wh ere the inequalit y ζ < ζ ( l ocus ( P )) ensures that the all entries in category (ii) w ere originally from an ancestor of l ocus ( P ) . All those entrie s in (c), wh ic h may b e a p ossible candidate for the top- k do cuments, b elong to category (iii) in the lemma. None of the entries in (d) can b e a v alid output, as th e origin of those entries are not coming from the subtree of u (from the deﬁnition of a ne ar en try), hence not from the su btree of locus ( P ). On the other h and, s ince we alw ays c hec k for the en tries with origins coming from the su btree of l ocus ( P ), these en tries must b e a sub s et of those c h ec k ed in the original index by Hon et al. I n conclusion, th e entries chec k ed in b oth indexes are exactly the same, and the lemma follo ws. ⊓ ⊔ Based on the ab o ve lemma, we may compute k candidate answ ers from eac h category and the actual top- k an s w ers can b e computed b y comparing the score of these 3 k do cuments. In catego r y (i) we hav e at most π b ound aries to b e searc hed, which tak es O ( π log log n ) time, and then r etriev e the k candidate answers in the unsorted order in O ( π + k ) time using lemma 1. S im ilarly in category (iii), the num b er of I-stru cture b ound aries to b e searc hed is p/π and it tak es total O (( p/π ) log log n + k ) time. Ho we ver, for category (ii), we ha ve an additional constraint on ζ v alue of the en tries. T o facilitate the pro cess, the ζ comp onen ts are main tained by the data structure in Lemma 2 in O ( n log π ) bits, so that the d esired answers can b e rep orted in O ((log π + k )(log π + O (1))) time. T he O ( k log k ) is for sorting the answe r s. The time f or initial p attern searc h is O ( p ). Pu tting all together with π = log log n , w e obtain Theorem 3. ⊓ ⊔ 5 Space-Eﬃcien t Enco ding of Our Index In this s ection, we derive a space-eﬃcien t index for the relev ance metric term-fr e quency . The ma jor cont ribution is that, in stead of usin g O (log n ) bits for an I-stru cture ent r y , we design some no vel enco dings so that eac h entry requires only log D + log π + O (1) bits. Th e GST will b e replaced b y a compressed full text index C S A of size | C S A | bits [11,5,10,1] along with the tree enco ding of GST in 4 n + o ( n ) bits [16] 6 . Thus l ocus ( P ) can b e compu ted in O ( p ) time by taking the LCA (lo w est common ancesto r) of leftmost and r igh tmost leaf in the suﬃx range of P . A core comp onent of our ind ex is the do cumen t array D A , wh ere D A [ i ] stores the id of do cument to wh ich the i th smallest suﬃ x in GST b elongs to. The D A can b e mainta in ed in n log D + O ( n log D log log D ) bits and can answer the follo wing queries in O (log log D ) time [9]. (i) ac c ess ( i ): returns D A [ i ]; (ii) r ank ( d, i ): r etur ns the num b er of o ccurrences of do cument d in D A [1 ...i ]; (iii) sele ct ( d, j ): is − 1 if j > | d | , else i wh ere D A [ i ] = d and r ank ( d, i ) = j . No w w e s ho w how to use D A for eﬃcien t enco din g and d ecod ing of diﬀerent comp onen ts in an I-structure ent r y . T erm-fr e quency Enc o ding : Giv en an I-structure entry with or ig in = v and doc = d , the corresp onding term-fr e quency score is exactl y the num b er of o ccurrences of d in D A [ i...j ], where i and j are the leftmost leaf and th e righ tmost leaf of v , resp ectiv ely . Thus, giv en the v alues v and d , w e can ﬁn d i and j in constan t time based on the tree encodin gs of the GST, and then compute term-fr e quency in O (log log D ) time based on t wo rank queries on D A . Th us, we will discard th e sc or e ﬁeld completely for all I-structure entries, bu t k eeping only the RMQ structur e o v er it. Origin Enc o ding : Origin enco ding is the most trickie st part, and is based on the follo wing observ ation by Hon et. al [14]: for any do cument d and for an y no d e v in GST , th ere is at most one ancestor of v that con tains an I-structure en try with doc = d and or ig in f rom a no de in the subtree of v (inclusive ly). W e int r o duce t w o separate sc hemes for enco ding origin ﬁelds in ne ar and far entries. Th is r educes the origin arra y space from O ( n log n ) bits to O ( n ) bits and deco ding tak es O (log log D ) time. Enc o ding ne ar entries: Let I w b e a regular I-structure (with only ne ar entrie s ) asso ciated with a no de w and let w q represent s the pr e-order rank of q th c hild of w . Th en from the deﬁnition of I-structur es, for a given d o cumen t d , there exists at most one entry in I w with doc = d and origin from the su b-tree of w q (inclusiv ely). Thus, for a giv en do cument d and an in tern al no de w , an en tr y in I w can b e asso ciated to a un ique c h ild no de w q of w (where w q represent the q th child of w from left, 1 ≤ q ≤ degr ee ( w ), and pr e-order r ank of w q can b e computed in constant time [16]), s uc h that origin is in the subtree of w q . Moreo v er, this origin m ust b e the no de, closest to ro ot, in the subtree of w q whic h has an N- structure ent r y for d . F r om the d eﬁ nition of N-structure, th is origin n o de m ust b e the lo west common ancestor (LCA) of the lea v es corresp onding to the ﬁrst and last s uﬃxes of d in the subtree of w q , which can b e computed usin g the tree enco d in g of GST and a constan t n umb er of rank/select op erations on D A in total O (log log D ) time. T h erefore, by main taining the information ab out w q ( origin-child = q ) for eac h I-structure entry , the corresp onding origin v alue can b e deco ded in O (log log D ) time. Thus, the origin arra y can b e replaced completely 6 Any n -no de ordered tree can b e rep resen ted in 2 n + o ( n ) bits, such th at if eac h no de is lab eled by its pre-order rank in the tree, any of t he follo wing op erations can b e sup p orted in constant time [16]: p ar ent ( i ), whic h returns the paren t of no de i ; child ( i, q ) , which return s the q -th c hild of nod e i ; child-r ank ( i ), whic h returns the n umb er of siblings to the left of no de i ; lc a ( i, j ), which returns the lo west common ancestor of tw o n o des i an d j ; and lmost-le af ( i ) /rmost-le af ( i ), whic h return s the leftmost/rightmost leaf of no de i . b y th e origin-child arra y . Recall that eac h n o de main tains the I-structur e en tries in sorted order of the origins, so that the corresp onding origin-child array w ill b e monotonic incr easing. In add ition, the v alue of eac h en try is b etw een 1 and deg r ee ( w ), so th at the array can b e enco ded using a bit v ector of length | I w | + degr ee ( w ) 7 . T he total size of the bit vec tors asso ciated with all no d es can b e b ound ed b y P w ∈ GS T ( | I w | + deg r ee ( w )) = O ( n ) bits. The O ( n log n ) bits p r edecessor search stru cture o v er origin arra y is replaced by a structure of o ( n ) bits space and O (log log n ) searc h time 8 . Enc o ding far entries: In order to enco de the origin v alues in far entries, w e intro d uce the follo wing notions. Let w ∗ b e a m arked no d e, then another no d e w ∗ q is called its q th mark ed c hild, if w ∗ q is th e q th smallest (in terms of pr e-order r ank) marked no de with w ∗ as its lo w est mark ed ancestor. Giv en the pre-order rank of w ∗ , th e pre-order rank of w ∗ q can b e computed in constan t time by mainta in ing an additional O ( n ) bits structur e. 9 Let I w ∗ represent s the com bined I-structure (with only far en tries) asso ciated w ith a mark ed no de w ∗ . Th e origin v alue of an y f ar en try in I w ∗ is alw a ys a no de in the subtree of some marked c hild w ∗ q of w ∗ , and is alw ays un ique for a giv en q and doc = d . Thus by maintai ning the information ab out w ∗ q ( origin-child ∗ = q ), we can deco de the corresp ondin g origin v alue f or a particular do cument d . i.e. origin is the LCA of the lea v es corresp onding to the ﬁrst and last suﬃx of d in the sub - tree of w ∗ q , whic h can b e computed using the tree enco ding of GST and a constan t num b er of rank/select op erations on D A in total O (log log D ) time. No w origin arr a y can b e replaced b y origin-c hild ∗ arra y , whic h can b e enco ded in P w ∗ ∈ GS T ∗ ( | I w ∗ | + deg r ee ( w ∗ )) = O ( n ) bits (using the s im ilar sc heme for enco ding origin-child arra y for ne ar entries). T he p redecessor searc h structure is replaced by o ( n ) bits sampled pr ed ecessor searc h structure. Query Answering : Qu ery answering algorithm remains the s ame as that in our lin- ear ind ex, except the fact that d ecod ing origin and term-fr e quency tak es O (log log D ) time. Then the time complexities for the steps in Lemma 3 are as follo ws: Step (i) O (( π log log n + k ) log log D ), Step (ii) O ((log π + k )(log π + log log D )) and Step (iii) ((( p/π ) log log n + k ) log log D ). S ince the term-fr e quencies are p ositiv e intege r s ≤ n , we shall use a y-fast trie [31] to get the sorted answer in O ( k log log n ) time. By c ho osing π = log 2 log n , th e qu ery time can b e b ounded by O ( t s ( p ) + p + log 4 log n + k log log n ), whic h give s the query time in Theorem 1. Here t s ( p ) is the time for in itial p attern searc hin g in C S A , and is Ω ( p ) for space-optimal CSA’s [5,1]. Sp ac e Analysis : The index consists of a full text index of | C S A | b its, D A of n log D (1 + o (1)) bits, I-str u ctures of tota l 2 n (log D + O (log π ) + O (1)) bits, tree encod ings, RMQ structures and s amp led predecessor searc h structur es (together O ( n ) b its). By c h o osing π = log 2 log n , the index space can b e b oun ded by | C S A | + n log D (3 + o (1)) + O ( n log log log n ) bits. In ord er to obtain the space b oun ds in T heorem 1, we ma y categorize D in to the follo w ing t wo cases. 7 A monotonic increasing sequence S = 1333445 can b e enco ded as B = 1011000100 10 in | B | (1 + o (1)) bits, where S [ i ] = r ank 1 ( sele ct 0 ( i )) on B , and can b e computed in constant time [27]. 8 Construct a n ew arra y by sampling every log 2 n th element in the original arra y , and maintain p redecessor searc h structure ov er it. Now, when w e p erform the query , we can ﬁrst qu ery on this sampled structu re to get an approximate answ er, and t he exact answer can b e obtained by p erforming bin ary search on a smaller range of only log 2 n elements in the origi nal arra y . The searc h time still remains O (log log n ). 9 Let GST ∗ b e a tree induced by the marked n odes in GST, so that w ∗ is the low est marked ancestor of w ∗ q in GST if and only if the no de corresp onding to w ∗ in GST ∗ (sa y , w ) is the p arent of nod e corresp onding to w ∗ q (sa y w q ) in GST ∗ . Moreo ver, w ∗ q is said to b e th e q th marked child of nod e w ∗ in GST, if w q is th e q th chil d of q in GST ∗ . Given the pre-order rank of any marked no de in GST, its pre-order rank in GST ∗ (and vice versa ) can b e computed in constant t ime by maintaining an additional bit vectors of size 2 n + o ( n ) whic h maintain the information if a nod e is marked or not. 1. When log D / log log D > log log log n , the O ( n log log log n ) term can b e absorb ed in o ( n log D ). The space can b e fur ther reduced by n log D bits from the follo wing observ a- tion that the term-fr e quency is 1 for those I-structure entrie s with origin = a le af in GST , and ther e are n s u c h en tries. Therefore all su ch en tries can b e deleted and in case if such a do cument is within top- k , that can b e rep orted u sing do cument listing. F or that we shall use Muth ukrishn an’s c h ain arr a y idea [21]. The c hain arra y C [1 ...n ] is deﬁn ed as follo w s: C [ i ] = j , wh ere j < i is the largest num b er with D A [ i ] = D A [ j ] and can b e simula ted using D A as j = sel ect ( D A [ i ] , r an k ( D A [ i ] , i ) − 1) in O (log log D ) time. Thus we do not main tain c h ain arr ay , instead an 2 n + o ( n ) = o ( n log D ) bits RMQ structure [6] o ve r it. Let [ L, R ] b e the s uﬃx range of P in the f ull text ind ex, then do cument listing can b e p erformed (in O (log log D ) time p er do cumen t) by r ep orting all those do cumen ts D A [ i ] suc h th at L ≤ i ≤ R and C [ i ] < L usin g rep eated RMQ’s. Although those d o cumen ts with fr e quency > 1 will get retriev ed again (but only once), it will not aﬀect the o v erall time complexit y . 2. When log D / log log D ≤ log log log n , we s h all use the index describ ed in Theorem 4. Thus the space-query b oun ds will b e | C S A | + n log D (1 + o (1)) bits and O ( t s ( p ) + log log n + k log D log 2 log D ) = O ( t s ( p ) + k log log n ) resp ectiv ely . By com bin ing the ab o ve case, w e get the result in Theorem 1. ⊓ ⊔ Theorem 4 Ther e exists an index of size | C S A | + n log D (1 + o (1)) bits with a qu e ry time of O ( t s ( p ) + log log n + k log D log 2 log D ) for r etrieving top- k do cuments with the highest term fr e quencies for a query p attern P of length p. Pr o of. S ee App endix C. 6 Sa ving More Space The most space-eﬃcien t v ers ion of our in dex (describ ed in theorem 2) is pr o v ed in this section. First, we give the follo wing auxiliary lemma (see App endix D for pr o of ). Lemma 4. Ther e exists an O ( n log σ log log n ) bits structur e, which c an answer ac c ess/r ank/sele ct queries on D A in O (log 2 log n ) time, and c an c ompute an entry C [ i ] in the chain-arr ay data structur e (for do cument listing) in O (log log n ) time. T o ac hieve space r eduction, w e categorize D int o the follo wing cases: 1. log D < (log σ log log n ) 1+ ǫ/ 2 : W e shall use the index describ ed in Theorem 4 and the query time will b e O ( t s ( p ) + k (log σ log log n ) 1+ ǫ ). 2. log D ≥ (log σ log log n ) 1+ ǫ/ 2 : In this case D A is replaced by a structure describ ed in Lemma 4, wh ic h make s the in dex space n log D (1 + o (1)) bits. Then by r e-deriving the b ound s with π = log 3 log n , our qu ery time will b e O ( t s ( p ) + log 6 log n + k log 2 log n ). The O ( k log 2 log n ) term can b e furth er improv ed to O ( k log log n ) f rom the f ollo w ing observ ation that, once w e get th e I-stru cture b ound aries, w e do not need an y information ab out the or ig in ﬁ elds for further qu ery pr o cessing. Thus the only v alue needed is the term-fr e quency , w h ic h can b e computed as follo ws: a sampled do cu men t arr a y D s A is main tained, such th at D A [ i ] = d is stored if and only if ( r ank D A ( d, i )) mod ρ = 0, for an in teger ρ = Θ (log D ), else we store a NIL v alue, wh ere r ank D A ( d, j ) is the num b er of o ccurr ences of d in D A [1 ...j ]. Th en D s A can b e maint ained in O ( n log D /α ) = O ( n ) bits and can compute an appro ximate rank. That is ρ r an k D 2 A ( d, j ) ≤ r ank D A ( d, j ) ≤ ρ r ank D 2 A ( d, j ) + ρ . Th u s asso ciated with eac h I-structur e entry , w e shall store this error (= Θ (log D )), wh ic h is equal to actual term-fr e quency m inus appro xim ate term-f r e quency (computed using D s A ). Thus by storing this er r or corr esp onding to eac h I-structur e entry in total O ( n log ρ ) = O ( n log log D ) = o ( n log D ) bits space, the term-fr e quency can b e obtained in O (log log D ) = O (log log n ) time by ﬁrst computing th e app r o ximate term- fr e quency using D s A and then b y add ing this stored v alue. Note that for the initial I- structure b oun dary searc hes, the origin decod ing is p erformed u sing the structure in Lemma 4. Moreo ver, this structure can compute c hain array v alues in O (log log n ) time, whic h can b e used for do cument listing in O (log log n ) time p er rep ort (wh en the I- structure entries with term-fr e quency = 1 are deleted from th e ind ex, and later such a do cument is an an s w er for a query). By com bin ing the ab o ve cases, w e obtain an | C S A | + n log D (1 + o (1)) b its in dex with query time O ( t s ( p ) + k (log σ log log n ) 1+ ǫ + log 6 log n ), wh ic h completes the pro of of T heorem 2. ⊓ ⊔ References 1. D. Belazzougui and G. Nav arro. Alphab et-In dep endent Compressed T ex t Indexing. In ESA , pages 748– 759, 2011. 2. D. Belazzougui and G. Nav arro. Improv ed Compressed In dexes for F ull-T ext Docu ment Retriev al. I n SPIRE , pages 386-397, 2011. 3. M. Blum, R.W. Floyd, V. Pratt, R. Rivest, and R. T arjan. Time Bounds for Selection. Journal of Computer and System Scienc es , 7(4):448–48 1, 1973. 4. S. Culp epp er, G. Nav arro, S . Puglisi, and A. T urpin. T op- k Ranked Do cument Searc h in General T ext Databases. I n ESA , pages 194–205, 2010. 5. P . F erragina, G. Manzini, V . M¨ akinen, and G. Nav arro. Compressed representatio n s of sequences and full-text ind ex es. ACM T r ans. Alg., 3(2): art. 20, 2007. 6. J. Fisc her. Optimal S uccinctness for Range Minim um Queries. In LA TIN , p ages 158–169, 2010. 7. G. N. F red erickson. An Optimal Algorithm for Selection in a Min-Heap. Information and Computation , 104(2):197 –214, 1993. 8. T. Gagie, G. Nav arro, and S. J. Puglisi. Colored Range Qu eries and Do cument Retrieva l. In SPIRE , pages 67–81, 2010. 9. A. Golynski, J. I. Munro, and S . S. R ao. R ank/Select Operations on Large Alph abets: A T ool for T ext Indexing. In SODA , pages 368–3 73, 2006. 10. R. Grossi an d J. S. Vitter. Compressed S uﬃx Arrays and Suﬃx T rees with Applications t o T ext In d exing and St ring Matching. SIA M Journal on Computing , 35(2):378 –407, 2005. 11. R. Grossi, A. Gup t a, and J. S. Vitter. High-Order En tropy-Compressed T ext Indexes. In SODA , p ages 841–850 , 2003. 12. W. K. H on , M. P atil, R. Shah, and S .- B. W u. Eﬃcient In dex for Retrieving T op- k Most F requent Do cu- ments. Journal of Discr ete Algorithms , 8(4):402–417, 2010. 13. W. K. Hon, R. Shah, S . V. Thank achan, and J. S . Vitter. String Retriev al for Multi-p attern Queries. In SPIRE , pages 55–66, 2010. 14. W. K. Hon , R . Sh ah, and J. S. Vitter. Space-Eﬃcient F ra mew ork for T op- k String Retriev al Problems. In FOCS , pages 713–722 , 2009. 15. W. K. Hon, R . S hah, and J. S. Vitter. Compression, Index in g, and Retriev al for Massive Strin g Data. In CPM , pages 260–274, 2010. 16. J. Jansson, K. S adak ane, and W. K . Su ng. Ultra-succinct Representation of Ord ered T rees. In SODA , pages 575–584, 2007. 17. M. Karpinski and Y . Nekrich. T op- k Color Queries for D o cument Retriev al. In SODA , pages 401–411, 2011. 18. U. Manb er and G. Myers. Suﬃx Arrays: A N ew Metho d for On-Line St rin g Searches. SIAM Journal on Computing , 22(5): 935–948, 1993. 19. Y. Matias, S. Muthukrishnan, S. C. S ahinalp, and J. Ziv. A u gmen ting Suﬃx T rees, with Applications. In ESA , p ages 67–78, 1998 . 20. E. M. McCreight A Sp ace-Economical Suﬃ x T ree Construction Algorithm. Journal of the ACM , 23(2):262– 272, 1976. 21. S. Muthukrishnan. Eﬃcient Algorithms for D ocument R etriev al Problems, In SODA , pages 657–666 , 2002. 22. G. Na va rro and Y. Nekric h . T op- k do cument retriev al in optimal time and linear space. In SODA , pages 1066–10 77, 2012. 23. G. Nav arro, S. J. Puglisi, and D. V alenzuela. Practical Compressed D ocument Retriev al. In SEA , pages 193–205 , 2011. 24. G. Na va rro and D. V alenzuela. Space-Eﬃcient T op-k Do cument R etriev al. T o app ear in SEA , 2012. 25. G. Na va rro and S. J. Puglisi. Dual-Sorted Inv erted Lists. In SPIRE , pages 309–321, 2010. 26. M. P atil, S. V. Thank achan, R. Sh ah, W. K. Hon , J. S. Vitter, and S. Chandrasek aran. I nv erted Ind exes for Phrases and S trings. In SIGIR , pages 555–564, 2011. 27. R. R aman, V. Raman, and S. Rao. Succinct Ind exable Dictionaries with A pplications to Encod ing k -ary T rees, Preﬁx Su ms and Multisets. A CM T r ansactions on A lgorithms , 3(4), 2007. 28. K. Sad aka n e. Succinct Data Structures for Flexib le T ext R etriev al Systems. Journal of Di scr ete Algo- rithms , 5(1):12–22, 2007. 29. N. V¨ alim¨ aki an d V. M¨ akinen. Space-Eﬃcient A lgorithms for D ocument Retriev al. In CPM , pages 205-215, 2007. 30. P . W einer. Linear Pattern Matching Algorithms. In SW A T , 1973. 31. D. E. Willard. Log-logarithmic W orst-Ca se Range Qu eries Are P ossible in Space Θ ( N ). Information Pr o c essing L etters , 17(2):81–8 4, 1983. A Pro of of Lemma 1 In [14], Hon et al. describ ed an O ( t + k log k )-time algorithm for retrieving the k largest n u m b ers in the sorted ord er. Ho w eve r, if sorted order is n ot n ecessary , the time can b e impro ved to O ( p + k ) based on the follo wing result of F r ederic kson [7]: The k th largest n u m b er from a set of num b ers main tained in a bin ary max heap ∆ can b e retriev ed in O ( k ) time by visiting O ( k ) no des in ∆ . I n order to solve our problem, we may consider a conceptual binary max heap ∆ as follo ws: Let ∆ ′ denote the balanced binary subtree with t lea ve s th at is lo cated at th e top part of ∆ (with the same r o ot). Eac h of the t − 1 internal no des in ∆ ′ holds the v alue ∞ . T he i th leaf n o de ℓ i in ∆ ′ (for i = 1 , 2 , ...t ) holds th e v alue A [ M i ], whic h is the maxim um elemen t in th e in terv al A [ L i ..R i ]. The v al u es h eld by the no des b elo w ℓ i will b e deﬁned recursive ly as follo ws: F or a no d e ℓ storing the maxim u m elemen t A [ M ] from the range A [ L..R ], its left c hild stores the maximum element in A [ L.. ( M − 1)] and its righ t c hild stores the maximum element in A [( M + 1) ..R ]. Note that this is a conceptual h eap whic h is bu ilt on the ﬂy , w here the v al u e asso ciated with a no de is computed in constan t time based on the RMQ s tructures only when needed. Therefore, we ﬁr st ﬁ nd the ( t − 1 + k )th largest elemen t X in this heap b y visiting O ( t + k ) no des (with O ( t + k ) RMQ qu er ies) us in g F rederic kson’s algorithm. Then, we obtain all those num b ers in ∆ w hic h are ≥ X in O ( t + k ) time by a pre-order tra versal of ∆ , such th at if the v alue asso ciated with a no de is < X , w e do not c h ec k the no d es in its sub tree. F rom those retriev ed num b ers, we d elete all the ∞ s and then separate out the k largest elemen ts in O ( t + k ) time. B Pro of of Lemma 2 In ord er to answer the ab o ve query , w e main tain A in the form of a wavelet tr e e [11 ], whic h is an ordered balanced b inary tree of n lea ves, wher e eac h leaf is lab eled with a symbol in Π , and the lea v es are sorted alphab etically from left to right. Eac h internal no de w q represent s an alphab et set Π q , and is asso ciated with a bit-ve ctor B q . In particular, the alphab et set of the r o ot is Π , and the alphab et set of a leaf is the sin gleton set con taining its corresp onding sym b ol. Each no de partitions its alphab et set among the tw o c hildren (almost) equally , su c h that all symb ols represented by the left c hild are lexicographical ly (or numerically) smaller than those repr esented by the righ t c hild . F or a no d e w q , let A q b e a subsequen ce of A b y retaining only those symb ols that are in Π q . Then B q is a bit-v ector of length | A q | , su c h that B q [ i ] = 0 if A q [ i ] is a sym b ol repr esen ted b y th e left c h ild of w q , else B q [ i ] = 1. In deed, the sub tree fr om w q itself forms a w av elet tree of A q . T o reduce the space r equiremen t, the array A is not s tored explicitly in th e wa v elet tree. Instead, we only store the bit-v ectors B q , eac h of whic h is au gmented with Raman et al.’s sc heme [27] to supp ort constan t-time rank/select op erations. The tota l size of the bit-v ectors and the augmen ted stru ctures in a particular lev el of the wa v elet tree is n (1 + o (1)) bits. W e main tain an additional r ange maximum query (RMQ) [6] structure o v er the scor e of all elemen ts of the sequence A q (in O ( | A q | ) bits). As there are log π lev els in the wa v elet tree, the total space is O ( n log π ) bits. Note that the v alue of an y A q [ i ] for an y giv en w q and i can b e computed in O (log π ) time b y tra versing log π lev els in the w a vele t tree. Similarly given an y range [ x ′ ...x ′′ ] can b e trans lated to w q as [ x ′ q ..x ′′ q ] in O (log π ) time, where A [ x ′ q ..x ′′ q ] is a subsequence of A [ x ′ ...x ′′ ] with only those elements in Π q . The desired k highest scoring entries can b e answered as follo ws: Firstly the giv en range [ y ′ , y ′′ ] can b e split into at most 2 log π disjoin t subranges, such that eac h sub range is rep- resen ted b y Π q asso ciated with some in ternal n o de w q . All the num b ers in the subs equ ence A q asso ciated with such an inte r nal no de w q will satisfy the condition y ′ ≤ A q [ i ] ≤ y ′′ . And for all suc h (at most 2 log π ) A q s, the range [ x ′ , x ′′ ] can b e translated into the corresp onding range [ x ′ q , x ′′ q ] in O (log 2 π ) time. No w, w e can apply Lemma 1 (wh er e t ≤ 2 log π ) to solv e the desired query . Ho wev er, retrieving a no d e v alue in the conceptual m ax heap (in th e pro of of Lemma 1) requ ir es us to compute the score of A q [ i ] for some w q and i on the ﬂ y , we sh all do so by ﬁrst ﬁndin g the en try A [ i ′ ] that corresp onds to A q [ i ], and th en retrieving the score of A [ i ′ ]. This tak es O (log π + t scor e ) time, so that the total query time will b e b ounded b y O (log 2 π + (2 log π + k )(log π + t scor e )) = O ((log π + k )(log π + t scor e )). C Pro of of Theorem 4 A simp le in dex can b e derived based on the succinct framework pr op osed by Hon et al. [14] and Gagie et al [8 ], whic h consists of the compressed version of GST ( C S A and tree en co d ing) and the do cument arra y D A (of n log D + O ( n log D log log D ) b its space with r ank /sel ect/access capabilities in O (log log D ) time for any d ∈ D [9]). Also, for a particular v alue q to b e deﬁned we group ev ery g = q log D log log D lea ves in the GST together (from left to right ) and mark th e lo west common ancestor (LC A) of all these lea v es. F urth er, we mark the LCA of all pairs of marked no des. Thus th e num b er of marked no des in GST can b e b ounded by O ( n/g ) [14]. F or eac h marked no de, we maint ain the top- q do cuments in its su btree exp licitly , whic h tak es O ( n/g × q log D ) = O ( n/ log log D ) bits. W e p erform this marking and store the top- q answers f or q = 1 , 2 , 4 , ... , which take s O ( n log D log log D ) bits of storage sp ace. The total index space can thus b e b ound ed by | C S A | + O ( n ) + n log D + O ( n log D log log D ) = | C S A | + n log D (1 + o (1)) bits (assuming D > √ log log n ) In order to retriev e top- k answers corresp onding to a suﬃx range, we ﬁrst searc h f or P in GST and obtain its lo cus no d e l ocus ( P ) in O ( p ) time. F ur ther w e roun d the v alue of k to the next highest p o wer of 2, say q . No w w e s earc h for a marked n o de locus ∗ ( P ) (corresp ond ing to this q ), wh ic h is same as locus ( P ) if l ocus ( P ) is marked, else it is the highest m ark ed descenden t of l ocus ( P ). Let [ L, R ] b e the s u ﬃx range of l ocus ( P ) and [ L ∗ , R ∗ ] b e the suﬃx range of locus ∗ ( P ), th en the lea ves corresp ond ing to the ranges [ L, L ∗ − 1] and [ R ∗ + 1 , R ] are called fringe le aves . It is easy to s ho w that the num b er of frin ge lea v es is at most 2 g (see [14]). Hence, in ord er to retriev e the top- k answers, we ﬁ r st chec k the top- q answers stored at l ocus ∗ ( P ) (and compute their scores), and then retriev e the score of eac h of the 2 g do cument s corr esp onding to the fringe lea ves. Recall that the score of a d o cumen t d is the frequency of P in d , whic h can b e computed in O (log log D ) time. Th us , the total time can b e b oun ded by O ( t s ( p ) + ( g + k ) log log D ) = O ( t s ( p ) + k log D log 2 log D ). Next, w e ﬁn d the top- k answers from this candidate set of 2 g + q < 2 g + 2 k do cuments. As there may b e rep etitions in the set, w e ﬁrst remo ve the rep etitions b y scanning the set once (using an aux iliary bit v ector of length D to mark if we ha ve already seen a do cumen t). After that, w e ﬁ nd the d o cumen t d which h as the k th h ighest frequency using O ( k + g ) = O ( k log D log log D ) time [3]. Finally , we isolate the top- k answe r s in un sorted order b ased on the score of d , and sort them in O ( k log k ) = O ( k log D ) time. If D ≤ √ log log n , we can retriev e the term-fr e quency of all do cu men ts in D and trivially ﬁnd the top- k do cu men ts in O ( D log D ) = O (log log n ) time. Putting all together, the o v er all query time can b e b ound ed b y O ( t s ( p ) + log log n + k log D log 2 log D ). D Pro of of Lemma 4 Let C S A b e the compressed suﬃx arra y corresp onding to the suﬃx arr a y asso ciated with GS T . Let t sa and t sa denote the time f or computing S A [ i ] (starting p osition of i th smallest suﬃx of T ) and the time for computing S A − 1 [ j ] (the rank of the j th suﬃ x T [ j...n ] among all suﬃxes of T ), resp ectiv ely . Hon et al. [14] sho wed that the ab o v e op erations on D A can b e sim ulated b y an index of size 2 | C S A | + o ( n ), and the b est query time complexities are due to Belazz ougui and Nav arro in [2 ]. W e conclude the results in the follo wing lemma. W e main tain C S A corresp ond in g to GS T and th e compressed suﬃx arra ys C S A d (for d = 1 , 2 , 3 , . . . , D ) corresp onding to eac h in d ividual do cum en t. No w, ac c ess ( i ) can b e obtained by return in g S A [ i ] in C S A . F or sele ct ( d, j ), we ﬁrst compute the j th sm allest su ﬃx in C S A d , and obtain the p osition pos of this suﬃx within do cument d , b ased on wh ich we can easily obtain the p osition pos ′ of th is suﬃx within the concatenated text of all do cuments. After that, w e compute S A − 1 [ pos ′ ] in C S A as the desired ans w er for sele ct ( d, j ). By doing a binary search on sel ect , r ank ( d, i ) can b e obtained in O (( t sa + t sa ) log n ) time. This time can b e improv ed to O (( t sa + t sa ) log log n ) as follo ws: At ev ery log 2 n th leaf of eac h C S A d , we exp licitly main tain its corresp onding p osition in C S A and main tain a pred ecessor stru cture ov er it [31]. The size of th is additional stru cture is o ( n ) bits. Now, when we p erform the query , w e can ﬁrst query on this p redecessor structur e to get an approximat e answer, and the exact answ er can b e obtained by p erformin g bin ary searc h on a smaller r ange of only log 2 n lea ves. By c ho osing the O ( n log σ log log n )-bits sp ace CSA by Grossi and Vitter [10], wh ere t sa and t sa tak es O (log log n ) time, w e obtain the lemma. An en try in c hain arra y C [ i ] = j , if j < i is the largest num b er w ith D A [ i ] = D A [ j ] = ( say d ) and is NIL if there is no suc h j . W e shall us e the follo wing steps to compu te j : using S A [ i ] compute the starting p osition of lexicographically i th smallest su ﬃx of the concatenated text an d the corresp ond ing d v alue. Let this b e th e lexicographically i d th smallest su ﬃ x of d , then ( i d − 1)th smallest suﬃ x of d can b e computed using an S A d and S A − 1 d op erations. F urther we map this text p osition in d bac k to the concatenated text and p erform an S A − 1 op eration on it to obtain j . The total time can b e b ounded b y O (log log n ).

Towards an Optimal Space-and-Query-Time Index for Top-k Document Retrieval

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment