Succinct Data Structures for Retrieval and Approximate Membership

Succinct Data Structures for Retriev al and Appro ximate Mem b ership ∗ Martin Dietzfelbinger † Rasm us Pagh ‡ No vem b er 1, 2018 Abstract The r etrieval pr oblem is the problem of asso cia ting data with keys in a set. F or mally , the data structure must store a function f : U → { 0 , 1 } r that ha s s pe ciﬁed v a lues on the e le ment s of a given set S ⊆ U , | S | = n , but may hav e any v a lue o n element s outside S . Minimal per fect hashing makes it po s sible to avoid stor ing the set S , but this induces a space o verhead o f Θ( n ) bits in addition to the nr bits needed for function v alues. In this pap er we show how to eliminate this overhead. Moreover, we show tha t for a ny k quer y time O ( k ) can b e achiev e d using s pace that is within a factor 1 + e − k of optimal, asymptotically fo r large n . If we allow logarithmic ev aluation time, the a dditive overhead can be r e duced to O (log lo g n ) bits whp. The time to construct the data structure is O ( n ), exp e cted. A main tec hnica l ingre dient is to utilize existing tight b ounds on the probability of almos t squar e rando m matrices with r ows of low weigh t to hav e full row ra nk . In addition to direct constr uctions, we p oint out a close connectio n betw een retriev a l structur es and ha sh tables where keys are store d in an ar ray and some kind of probing sc heme is used. F urther, w e prop ose a genera l r eduction that trans fer s the results o n retriev al in to analo gous res ults on appr oximate memb ership , a problem tra ditionally a ddressed using Blo om ﬁlter s. Again, we show how to eliminate the space overhead present in previously known methods, and g et arbitr a rily close to the lower b ound. The ev aluation pro cedures of our data structures a re extremely simple (similar to a Blo om ﬁlter). F o r the results stated a bove we assume fr ee access to fully ra ndom hash functions. How ever, we show how to justify this assumption using extra space o ( n ) to simulate full rando mness on a RAM. ∗ The main ideas for this paper were conceived while the authors were participating in the 2006 Seminar on Data Structures at IBFI S chloss Dagstuhl, Germany . † F aculty of Computer Science and Automation, T echnisc he U niversi t¨ at Ilmenau, P .O.Bo x 100565, 98684 Ilmenau, German y , email: martin.d ietzfelbin ger@tu-ilm enau.de ‡ Computational Logic and Algorithms Group, IT U niversi ty of Cop enh agen, Rued Langgaards V ej 7, 2300 Køb enhavn S, D enmark, email: pagh@itu.dk 1 1 In tro duction Supp ose we wan t to build a d ata str ucture that is ab le to d istinguish b et ween girls’ and b o ys’ names, in a collection of n names. Given a str ing not in the set of names, the data structure m a y return an y answe r. It is clear that in the worst case this data structure needs at least n bits, even if it is give n access to the list of names. The previously b est solution (implicit in [22]) th at do es not require the set of n ames to b e stored r equires around 1 . 23 n bits. Surpr isingly , as w e will s ee in this pap er, n + o ( n ) b its is enough, still allo wing fast queries. If “global” hash f unctions, shared among all data structures, are a v ailable th e space u sage drops all the wa y to n + O (log log n ) b its whp. This is a rare example of a data stru cture with n on-trivial functionalit y and a space usage that essen tially matc hes th e ent rop y lo wer b ound . 1.1 Problem deﬁnition The dictionary pr oblem consists of storing a s et S of n k eys, and r bits of data asso ciated w ith eac h k ey . A lo okup query for x rep orts whether or n ot x ∈ S , and in the p ositiv e case rep orts the data asso ciated with x . W e will denote th e size of S b y n , and assume th at keys come from a set U of size n O (1) . In th is p ap er, we restrict our selv es to the static pr oblem, w here S and the asso ciated data are ﬁxed and do not change. W e study tw o relaxations of the static dictionary problem that allo w d ata str uctures using less sp ace than a fu ll-ﬂed ged dictionary: • Th e r etrieval pr oblem diﬀers fr om the dictionary pr oblem in that the set S do es not need to b e stored. A r etriev al qu ery on x ∈ S is required to rep ort the d ata asso ciated with x , while a retriev al query on x 6∈ S m ay retur n any r -bit strin g. • Th e appr oximate memb ership pr oblem consists of storing a data str u cture th at supp orts m em- b ersh ip queries in the follo w ing manner: F or a qu ery on x ∈ S it is r ep orted th at x ∈ S . F or a query on x 6∈ S it is rep orted with p r obabilit y at least 1 − ε that x 6∈ S , and with pr obabilit y at most ε that x ∈ S (a “false p ositiv e”). F or simplicit y we will assu me that ε is a negativ e p o w er of 2. The mo del of computation is a unit cost RAM with a standard instruction set. F or simplicit y we assume that a key ﬁts in a single machine word, and that asso ciated v alues are no larger than keys. Some resu lts will assu me free access to fully rand om hash functions, such that an y function v alue can b e compu ted in constan t time. (Th is is explicitly stated in such cases.) 1.2 Motiv ation The appro ximate mem b ersh ip problem has attracted signiﬁcan t in terest in recent y ears du e to a n um b er of applications, mainly in distribu ted systems and d atabase sys tems, where false p ositiv es can b e tolerated and space u s age is crucial (see [3] for a surv ey). Often the false p ositiv e probabilit y that can b e tolerated is r elativ ely large, say , in the range 1% − 10%, whic h en tails that the space usage can b e made muc h smaller than what wo uld b e required to store S exactly . The retriev al problem sho ws up in situations where th e amount of data asso ciated w ith eac h k ey is small, and it is either kno wn th at queries will only b e aske d on keys in S , or where the answ ers r eturned for k eys not in S d o n ot matter. As an example, s u pp ose that we ha ve ranked the URLs of the W orld Wide W eb on a 2 r step scale, w here r is a small in teger. Then a retriev al data structure w ould b e able to pro v id e the ranking of a giv en URL, without ha vin g to s tore the URL itself. The retriev al pr oblem is also the k ey to obtaining a space-optimal RAM data str ucture that is able to answer r ange queries in constant time [1, 25]. 2 1.3 Previous results Appro ximate membership. The stud y of app ro ximate mem b ersh ip w as initiated b y Blo om [5] who describ ed the Blo om ﬁlter data str ucture which provides an elegan t, near-optimal solution to the problem: The data structur e is a b it arr a y . Use k = log 2 (1 /ε ) hash fun ctions to asso ciate eac h k ey with k randomly located b its in the array . Set these bits to 1 for all x ∈ S , and put 0s elsewhere. O n a query for x , the Blo om ﬁlter r ep orts that x ∈ S if and only if all bits asso ciated with x are 1. Blo om sho wed 1 that a space usage of n log 2 (1 /ε ) log 2 e bits suﬃces f or a f alse p ositiv e probabilit y of ε . Carter e t al . [9 ] show ed that n log 2 (1 /ε ) bits are r equired for solving the app ro ximate mem b ership problem wh en | U | ≫ n (see App endix C f or details). Th us, the analysis of [5] sho ws that Blo om ﬁlters ha ve sp ace u sage w ithin a factor log 2 e ≈ 1 . 44 of the lo w er b ound, which is tigh t. Another approac h to app ro ximate mem b ersh ip is p erfe c t hashing . A minimal p erfect h ash func- tion for S maps the k eys of S bijectiv ely to [ n ] = { 0 , . . . , n − 1 } . Hagerup and T holey [19] show ed ho w to store a m inimal p er f ect h ash fu n ction h in a data stru cture of n log 2 e + o ( n ) bits suc h that it can b e ev aluated on a given input in constan t time. T his space usage is the b est p ossible, u p to the lo wer order term. No w store an arra y of n entries where, for eac h x ∈ S , ent ry h ( x ) con tains a log 2 (1 /ε )- bit hash signature q ( x ). When lo oking up a ke y x , we answe r x ∈ S if and only if th e hash signature at en try h ( x ) is equal to q ( x ). The origin of this idea is un kno wn to us, but it is describ ed e.g. in [3]. Th e sp ace u sage f or th e resulting d ata stru ctur e diﬀers from the lo w er b ound n log 2 (1 /ε ) by th e space required for the minim um p erfect hash function, and impro v es up on Blo om ﬁlters when ε ≤ 2 − 4 and n is su ﬃcien tly large. Mitzenmac her [24] considered the enc o ding p roblem where the task is to represen t and transmit an approximat e set repr esen tation (n o fast queries requir ed ). Ho wev er, even in this case existing tec hniques ha ve a space o verhead similar to that of the p erfect hashing approac h. Retriev al. The retriev al problem h as traditionally b een add ressed through the use of p erf ect hashing. Using the Hagerup-Tholey data structure yields a space usage of n r + n log 2 e + o ( n ) bits with constan t q u ery time. Recen tly , Chazelle et al . [11] presente d a diﬀerent app r oac h to the p roblem based on an idea similar to that of a Blo om ﬁ lter: Eac h ke y is asso ciated with k = O (1) lo cations in an arra y with O ( n ) entries of r bits. The answer to a retriev al query on x is found by com bining th e v alues of entries asso ciated with x , using bit-wise XOR. In place of th e XOR op eration, an y ab elian group op er ation ma y b e used. I n fact, this idea w as used earlier by Ma jewski, W ormald, Ha v as, and Czec h [22] and by S eiden and Hirsc hb erg [31] to address the sp ecial case of order-p reserving minimal p erfect hashing. It is not hard to see that these data structure in fact solv e the retriev al problem. The main result of [22] is that f or k = 3 a space usage of around 1 . 23 nr bits is p ossib le, and this is the b est p ossible using th e constru ction algorithm of [11, 22] (other v alues of k giv e w ors e r esults). The appr oac h of th ese pap ers do es not giv e a data stru cture that is m ore eﬃcien t than p erfect hashing, asymp totical ly for large n , but the simplicit y and the lac k of lo wer order terms in the space usage that may dominate for small n mak es it in teresting fr om a practical viewp oin t. A particular f eature is that (lik e for Blo om ﬁlters) all memory lo okups are nonadaptive , i.e., the memory ad d resses can b e d etermined from the quer y only . Th is can b e exploited b y m o dern CPU arc hitectures that are able to p arallelize memory lo okups (see e.g. [33]). In fact, Chazelle et al . also sh o w ho w approximat e m emb ership can b e in corp orated int o th eir data structure b y extending arra y entries to r + log 2 (1 /ε ) bits. This generalized data structure is 1 Bloom used a certain simplifying assumption, indep enden ce of certain slightly correlated events, that has since b een justiﬁed, see [3]. 3 called a Blo omier ﬁlter . Again, the space usage is a constan t factor higher, asymp totical ly , than the solution based on p erfect h ashing. 1.4 New c ontributions Our ﬁr st con tribution sho ws that the approac h of [11, 22, 31] can b e used to ac hiev e space for retriev al that is very close to the lo wer b oun d: Theorem 1 F or any γ > 0 , r = O (log n ) , and any suﬃciently lar ge n ther e exist data structur es for the r etrieval pr oblem having the fol lowing sp ac e and time c omplexity on a unit c ost RAM with fr e e ac c ess to a ful ly r andom hash function : (a) Sp ac e nr + O (log log n ) bits whp. 2 , que ry time O (log n ) , exp e cte d c onstruction time O ( n 3 ) . (b) Sp ac e (1 + γ ) nr bits, query time O (1 + log ( 1 γ )) , e xp e cte d c onstruction time O ( n ) . Our basic data structur e and query ev aluation algorithm is the same as in [11, 22]. The new con tribu tion is to analyze a diﬀeren t construction algorithm (suggested in [31]) that is able to ac hiev e a b etter space usage. Ou r analysis n eeds to ols and theorems fr om linear algebra, while that of [11, 22] is based on r andom graph theory ([31] provided only exp erimen tal results). T o get a data stru ctur e that allo ws exp ected linear construction time we devise a n ew v arian t of the d ata structure and query ev aluation algorithm, r etaining s implicit y and non -adaptivity . Our second contribution is to p oint out an in timate connection b et we en the app ro ximate mem- b ersh ip problem and the retriev al p r oblem: Theorem 2 Assuming fr e e ac c ess to ful ly r andom hash functions, any static r etrieval data structur e c an b e use d to implement an appr oximate memb ership data structur e having false p ositive pr ob ability 2 − r , with no additional c ost in sp ac e, and O (1) extr a time. This r e duction is ne ar-optimal in the sense that it c an b e use d to solve the appr oximate memb ership pr oblem in sp ac e that is within O (log log n ) bits of optimal whp. The pap ers on Blo om ﬁ lters, and the p ap ers on retriev al [11, 22] all m ak e the assumption of access to fully r andom hash functions, as in the ab o v e. W e sho w ho w our data stru ctures can b e realized on a RAM, w ith a small additional cost in space: Theorem 3 In the setting of The or em 1, for some ε > 0 , we c an avoid the assumption of ful ly r ando m hash functions and get data structur es with the fol lowing sp ac e and time c omplexities : (a) Sp ac e nr + O ( n 1 − ε ) b its, query time O (log n ) , exp e cte d c onstruction time O ( n 1+ γ ) . (b) Sp ac e (1 + γ ) nr bits, query time O (1 + log ( 1 γ )) , e xp e cte d c onstruction time O ( n ) . Our results ha ve a couple of other implications in data structures. W e impro v e the sp ace u s age of a recen t simple construction of (minimal) p erfect h ashing of Botelho et al . [4] (Section 7 ). In addition, we sh o w a close relationship b etw een “cuck o o h ash ing”-lik e dictionaries and retriev al structures (Section 6). Th is implies impro v ed upp er b oun ds on the space us age of k -ary cuc k o o hashing [18] (or equiv alen tly , of the 1-orien tabilit y thr eshold of a k -u n iform random h yp ergraph with n ed ges). 2 “whp.” means with probabilit y 1 − O ( 1 p oly( n ) ). 4 1.5 Ov erview of pap er Section 2 describ es our b asic retriev al data stru cture and its analysis, usin g a result du e to Calkin [8]. This leads to part (b) of Theorem 1, except that th e construction time is O ( n 3 ). Part (a) is sho wn in Section 3, using a r esult of Co op er [13]. The reduction of appro ximate member s hip to retriev al, Theorem 2, is pr esented in S ection 4. Section 5 completes the pro of of part (b) of Theorem 1 by sho w in g h ow the construction algorithm can b e made to r un in linear time. S ection 7 describ es an application of our r esults to p erf ect h ashing. Some issues, su c h as circum v enting th e full randomness assumption, leading to T heorem 3 , are d iscussed in the app endices. 2 Retriev al in constan t time and almost optimal sp ace In th is section, we give the basic construction of a data structure for r etriev al with constan t time lo okup op eration and (1 + δ ) nr space. As a tec h nical basis, we start with describing a result by Calkin [8] r egarding the probabilit y that 0-1-matrices with sparse ro ws chosen randomly hav e f ull ro w rank. 2.1 Calkin’s results All calculations are o ver the ﬁ eld GF(2) = Z 2 with 2 element s. W e consider binary matrices M = ( p ij ) 1 ≤ i ≤ n, 0 ≤ j 2 ther e is a c onstant β k < 1 such that the fol lowing holds : Assume the n r ows p 1 , . . . , p n of a matrix M ar e chosen at r andom fr om the set of bi nary ve ctors of length m and weight ( numb er of 1 s ) exactly k . Then : (a) If n/m ≤ β < β k , then Pr ( M has f u l l r ow r ank ) → 1 ( as n → ∞ ) . (b) If n/m ≥ β > β k , then Pr ( M has f u l l r ow r ank ) → 0 ( as n → ∞ ) . F urthermor e, β k − (1 − ( e − k / (ln 2)) → 0 for k → ∞ ( exp onential ly fast in k ) . Remark 1 It has b een noted earlier in related work [22] that the question wh ether a matrix with m columns and r andomly c h osen ro ws of w eight 2 has full r o w ran k is equiv alen t to the question whether the graph with M as its vertex- edge incidence matrix is cyclic. The thr eshold v alue for this case is β 2 = 2, as is well known from the theory of rand om grap h s. In [22] and [4] it is explored ho w this fact can b e used for constr u cting p erf ect h ash functions, in a w ay that implicitly includ es the construction of r etriev al structures. Remark 2 A closer lo ok in to the pro of of T heorem 1.2 in [8] revea ls th at for eac h k there is some ε = ε k > 0 such that in the situation of Theorem 4(a) w e hav e Pr ( M has full ro w rank) = 1 − O ( n − ε ). The follo wing v alues are suitable: ε 3 = 2 7 , ε 4 = 5 7 , ε k = 1 for k ≥ 5. According to [8 ], the thr eshold v alue β k is c haracterized as f ollo ws: Deﬁn e f ( α, β ) = − ln 2 − α ln α − (1 − α ) ln(1 − α ) + β ln(1 + (1 − 2 α ) k ) , (1) for 0 < α < 1. L et β k b e the minimal β so that f ( α, β ) attains the v alue 0 for some α ∈ (0 , 1 2 ). Using a computer algebra system, it is easy to ﬁn d app ro ximate v alues for small k , see T able 1 . 5 k 3 4 5 6 β k 0.8894 9 0.9671 4 0.9891 6 0.9 9622 β appr k 0.9091 0.9690 0.9 893 0.9962 4 β − 1 k 1.1243 1.034 1.011 1.00 38 T ab le 1: App ro ximate thresh old v alues fr om Theorem 4, using (1 ) and (2). This table also lists upp er b ounds for the recipro cals β − 1 k , since these are the ﬁgures w e will utilize later. Calkin further pro ves that β k = 1 − e − k ln 2 − 1 2 ln 2  k 2 − 2 k + 2 k ln 2 − 1  · e − 2 k ± O ( k 4 ) · e − 3 k , (2) as k → ∞ . It seems that the appro ximation obtained by omitting the last term in (2) is qu ite go o d already for s mall v alues of k . (See the r o w for β appr k in T able 1 .) Remark 3 Resu lts similar to those of Calkin [7, 8], b u t for a diﬀeren t mo del, w ere obtained indep end en tly by Balakin, Kolc h in, and Kh okhlo v [2, 20, 21]. F urth er resu lts in a similar vein can b e found in a p ap er b y Co op er [12]. 2.2 The basic ret r iev al data structure No w w e are r eady to describ e a retriev al data structur e. Assume f : S → { 0 , 1 } r is give n, for a set S = { x 1 , . . . , x n } . F or a give n (ﬁxed) k ≥ 3 let 1 + δ > β − 1 k b e arbitrary and let m = (1 + δ ) n . W e can arrange the lo okup time to b e O ( k ) and the n um b er of bits in the data structure to b e mr = (1 + δ ) nr plus lo wer ord er terms. 3 W e assume th at we ha ve access to k hash f unctions with ranges [ m ] , . . . , [ m − k + 1] that b eha v e fully rand omly on the ke ys of S , and that, in case the constru ction b elo w fails, we ma y switc h to a n ew in dep end en t set of k hash functions, again random on S . It is not hard to see th at this assumption mak es it p ossible to d eﬁne a mappin g U ∋ x 7→ A x ∈  [ m ] k  , where  X k  denotes the set of all su b sets of X with k elements, so that computing A x from x ∈ U tak es time O ( k ), and so that ( A x ) x ∈ S is fully rand om on S . (F or details see App endix A.) W e n eed to store a few bits to record w hic h set of h ash functions wa s used in the s uccessful construction. The construction s tarts from = { x 1 , . . . , x n } and the bit strings u i = f ( x i ) ∈ { 0 , 1 } r , 1 ≤ i ≤ n . W e consider th e matrix M = ( p ij ) 1 ≤ i ≤ n, 0 ≤ j 0. Assume n is so large that this happ ens with pr obabilit y at least 3 4 . If M do es ha v e full ro w rank, the column sp ace of M is all of { 0 , 1 } n , hence for all u ∈ { 0 , 1 } n there is some a ∈ { 0 , 1 } m with M · a = u . More generally , we arrange the bit strings u 1 , . . . , u n ∈ { 0 , 1 } r as a column v ector u = ( u 1 , . . . , u n ) T . W e stretch notation a bit (but in a n atural wa y) so that we can m ultiply binary m atrices with vect ors of r -bit strin gs: m ultiplication is just bit/ve ctor m u ltiplication and addition is bit w ise X O R. It is then easy to see, w orkin g with the comp onents of 3 F or simplicity , in the notation we sup p ress rounding nonintegral v alues to a suitable near integer. 6 the u i separately , that th ere is a (column) v ector a = ( a 0 , . . . , a m − 1 ) T with entries in { 0 , 1 } r suc h that M · a = u . — W e can reph rase this as follo ws (using ⊕ as n otation f or b itwise X OR): F or a ∈ ( { 0 , 1 } r ) m and x ∈ U deﬁne h a ( x ) = L j ∈ A x a j . (4) Then for an arbitrary s equence ( u 1 , . . . , u n ) of prescrib ed v alues from { 0 , 1 } r there is s ome a ∈ ( { 0 , 1 } r ) m with h a ( x i ) = u i , for 1 ≤ i ≤ n . Suc h a v ector a ∈ ( { 0 , 1 } r ) m , toget her with an identiﬁer for the set of h ash fun ctions used in the successful constru ction, is a data structure for retrieving v alue u i = f ( x i ), giv en x i . Th ere are k accesses to the d ata stru ctur e, plus the eﬀort to ev aluate k hash functions on x and calculate the set A x from x (see App end ix A). Remark 4 A s imilar construction (o ve r arb itrary ﬁelds GF( q )) w as d escrib ed b y Seiden and Hirsc hb erg [31] . How ev er, those authors did not ha ve Calkin’s results, and so could not giv e theoretical b ound s on the n u mb er m of columns needed. Also, our construction generalizes the approac h of Chazelle e t al. [11, 22] who requ ired that M could b e transformed into ec h elon form b y p erm u ting ro w s and columns, w h ic h is su ﬃcien t, but not necessary , for M to hav e fu ll row rank. Some d etails of the construction are m iss ing. W e d escrib e one of seve ral p ossible wa ys to pro ceed. — F rom S , we ﬁrst calculate the sets A x i , 1 ≤ i ≤ n , in time O ( n ). Using Gaussian elimination, w e can chec k w hether th e indu ced matrix M = ( p ij ) has full ro w rank. If this is not the case, w e start all o ver w ith a new set of k hash fun ctions, leading to new sets A x i . This is rep eated until a suitable m atrix M is obtained. T h e exp ected num b er of rep etitions is 1 + O ( n − ε ). F or a matrix M with ind ep endent rows Gaussian elimination will also yield a “pseudoin verse” of M , that is, an inv ertible n × n -matrix C (co d ing a s equ ence of elementa ry r o w transform ations w ithout r o w exc hanges) w ith th e prop erty that in C · M the n un it vec tors o ccur as columns: ∀ i , 1 ≤ i ≤ n , ∃ b i ∈ [ m ]: column b i of C · M equals e T i = (0 , . . . , 0 , 1 , 0 , . . . , 0) T . (5) Giv en u = ( u 1 , . . . , u n ) ∈ { 0 , 1 } n w e w ish to ﬁ nd a solution a ∈ { 0 , 1 } m of the sy s tem ( C · M ) · a = C · u = u ′ = ( u ′ 1 , . . . , u ′ n ) T . (6) Since C · M has the un it vect ors in columns b 1 , . . . , b n , w e can easily read oﬀ a sp ecial a that solv es (6): Let a j = 0 for j / ∈ { b 1 , . . . , b n } , and let a b i = u ′ i for 1 ≤ i ≤ n . Exactly the same formula w orks if u , u ′ , and a are v ectors of r -bit strings. — W e h a ve established the follo wing. Theorem 5 Assume that ful ly r ando m hash functions fr om keys x ∈ S to r anges [ m ] , . . . , [ m − k + 1] ar e available ( with the option to cho ose such fu nctions r ep e ate d ly and indep endently ) . L et k > 2 b e ﬁxe d, and let 1 + δ > β − 1 k . Then for n lar ge enough the fol lowing holds : Given S = { x 1 , . . . , x n } and a se quenc e ( u 1 , . . . , u n ) of pr escrib e d elements in { 0 , 1 } r , we c an ﬁnd a ve ctor a = ( a 0 , . . . , a m − 1 ) with elements in { 0 , 1 } r such that h a ( x i ) = u i , for 1 ≤ i ≤ n . The e xp e cte d c onstruction time is O ( n 3 ) , the scr atch sp ac e ne e de d is O ( n 2 ) . Remark 5 W e note that only n en tries of a are signiﬁcant, namely en tries b 1 , . . . , b n . Th e other en tries can b e c hosen arbitrarily , for example equal to 0 . This observ ation imp lies that there is an alternativ e data stru cture wh ose redund ancy is indep end en t of r . A constant time rank data structure (e.g. T h eorem 4.4 in [27]) can b e used to identify the entries in { b 1 , . . . , b n } an d map them to entries in a “compressed” arra y of size n . T h e space usage of the rank data structure is within a lo wer order term of the entrop y of th e set { b 1 , . . . , b n } , whic h is log 2  m n  . The dr awbac k of th is is that accesses to the compr essed arra y are now adaptive, as they dep en d on lo oku p s in the rank data structure. 7 Remark 6 At the ﬁr s t glance, the time complexit y of the construction s eems to b e forbiddin gly large. Ho wev er, using a tr ic k describ ed in Ap p end ix B (“split-and-share”) make s it p ossible to obtain a data structur e with the same f unctionalit y and space b ounds (up to a o ( n ) term) in time O ( n 1+ γ ) for any giv en γ > 0. I n S ection 5 w e show how to construct a retriev al s tructure with essen tially th e s ame sp ace r equiremen ts in exp ected linear time. 3 A retriev al structure with optimal space The pu rp ose of this s ection is to pro ve Th eorem 1(a), i e., to sho w that there is a data structure supp orting retriev al that requires only sp ace n r + O (log log n ) bits whp. (n ote that nr bits is a lo we r b oun d), and in whic h one retriev al op eration take s logarithmic time. More p recisely , O (log n ) table entries are r ead and com bined by X OR. The idea is the follo wing. W e u se th e s ame setup as in Section 2.2, excepting th at the r an ge size m is equal to n , and that k , the size of the sets A x , is c h osen to b e Θ(log n ). W e set up the matrix M as in (3). The ro ws corresp ond to the n ke ys x 1 , . . . , x n in S , the columns to the ran ge [ n ]. In order to argue that the ind u ced n × n -matrix M has f ull rank at least w ith constant probabilit y , we wish to use the follo wing th eorem. Theorem 6 (C o op er [13 ], Theorem 2(a) ) L et M = ( p ij ) 1 ≤ i ≤ n, 0 ≤ j 0 is an arbitr ary c onstant. Then lim n →∞ Pr ( M i s r e gular ) = c 2 , wher e c 2 = Q 1 ≤ i ≤ n (1 − 2 − i ) ≈ 0 . 288 79 . Note. c 2 is the probabilit y that a random 0-1-matrix, eac h ent ry b eing 0 or 1 with p r obabilit y 1 2 , is regular. W e will wo r k with the constant c = 1 or p = 2 log ( n ) /n throughout, b ut any other constan t w ould d o as we ll. The statemen t of the theorem in Co op er’s p ap er is ev en more general. A sligh t d iﬃcult y arises in that the num b er of 1s in a row is ﬁ x ed to b e k in the setup of Section 2.2, and is binomially distributed in Co op er’s theorem. The id ea to resolv e this is as follo ws: The s ize k ( x ) of set A x , for x ∈ S , is chosen at random according to the binomial distribution. Then the ro w s of M will hav e we igh t Θ(log n ) w ith high probabilit y , as n oted in the follo wing lemma, whic h is easy to pr o ve by Chernoﬀ b ou n ds, e. g., [23, Theorems 4.4 and 4.5]. Lemma 7 In the situation of The or e m 6, with c = 1 , the pr ob ability that M has a r ow in which ther e ar e mor e than 4 log n or fewer than 1 2 log n 1 s is 1 /n Ω(1) . 3.1 Sampling the binomial distribution Using one extra hash function q (with range [ n 3 ], sa y), for eac h x ∈ S w e can choose a num b er k ( x ) at random from the bin omial distribution conditioned on [ 1 2 log n, 4 log n ]. T h en Lemma 7 implies that the d eviation in probabilit y fr om th e situation of Th eorem 6 is o (1). Sp eciﬁcally , assume n and p = 2 log ( n ) /n are giv en, and th at a fu lly r an d om hash function g with range [ n 4 ] is av ailable. W e wish to samp le from B( n, p ) conditioned on [ 1 2 log n, 4 log n ] = [ 1 4 np, 2 np ], at least appro ximately . F or this, w e prepare a table of all O (log n ) v alues F ( i ) = Pr ( X ≤ i ), 1 2 log n ≤ i ≤ 4 log n , for the corresp ond ing d istr ibution function F . F or giv en g ( x ) we ﬁnd i with 1 2 ( F ( i − 1) + F ( i )) ≤ g ( x ) /n 4 < 1 2 ( F ( i ) + F B ( i + 1)), and return i . An easy calculation sho ws that b oth B( n, p ; 1 4 np ) and B( n, p ; 2 n p ) are in [ n − 1 , n − 2 ], so that the error w e mak e in comparison to the true binomial distribution is in O (1 /n 2 ), whic h for our pu r p oses is negligible. Ha ving ﬁxed k ( x ), x ∈ S , usin g 4 log n further h ash fun ctions with r an ges [ n − ℓ + 1], 1 ≤ ℓ ≤ 4 log n , for eac h x ∈ S a set A x = { h 1 ( x ) , h 2 ( x ) , . . . , h k ( x ) ( x ) } is c h osen at r andom from  [ n ] k ( x )  . (F or details s ee App endix A.) 8 3.2 Putting things t ogether According to Theorem 6 the resulting matrix M will ha v e full rank with probability 0 . 2887 9 ± o (1). If it turn s out M is singular, we start all o ver, w ith a n ew set of 1 + 4 log n h ash fun ctions. With O (log n ) suc h trials, th e probabilit y that all resulting matrices M are s ingular can b e made as small as n − d for an arb itrary constan t d . The information whic h set of hash f unctions succeeds is part of the data stru cture. Recording th is information tak es at most O (log log n ) bits. The r emaining details of the constru ction are the same as in Section 2.2. Lo okup works as follo w s: Giv en a k ey x ∈ U , use q to calculate k ( x ). If this happ ens to b e outside the range [ 1 2 log n, 4 log n ], return an arbitrary v alue. Otherwise calculate A x = { h 1 ( x ) , h 2 ( x ) , . . . , h k ( x ) ( x ) } and r eturn the v alue give n by (4), with k ( x ) in place of k . This prov es Theorem 1(a). W e do n ot try to redu ce the cost O ( n 3 ) for solving the linear system. Using the “split-and-share” tric k describ ed in App end ix B we ma y a vo id the assu m ption of fully random hash f u nctions and obtain a retriev al structur e as d escrib ed in Th eorem 3(a). 4 Appro ximate mem b ership W e prov e Theorem 2 . — Let S ⊆ U w ith | S | = n and an error b ound 2 − s b e given. W e let q : U → [2 s ] b e a fully r andom hash fun ction. Using the metho d s from S ections 2.2 resp. 3 w e build a retriev al structure D that asso ciates v alue q ( x ) with x ∈ S . The space requir ements f or th e data structures and the times for construction an d retriev al are inh erited from the corresp ond ing retriev al structures. A query for x ∈ U returns “y es” if D ( x ) = q ( x ) and “no” otherwise. It is then clear that a query for an x ∈ S alwa ys yields “ye s”. A query for an x ∈ U − S yields “y es” w ith p r obabilit y 2 − s , since D ( x ) is giv en by th e data stru cture, whic h is determined by q ( y ) , y ∈ S (and some other random c hoices), and h en ce is ind ep endent of q ( x ). Remark 7 O f course, we ma y com bine th e retriev al data structure of Section 2.2 with the ap- pro x im ate memb ership data structure just d escrib ed to obtain a d ata s tr ucture that needs s p ace (1 + δ ) n ( r + s ) and has the fu nctionalit y of a “Blo omier ﬁlter” as describ ed in [11]: On a qu er y f or x ∈ S , a prescrib ed v alue f ( x ) ∈ { 0 , 1 } r is return ed , while for x ∈ U − S the probabilit y th at some v alue f rom { 0 , 1 } r (and not some error sym b ol) is r eturned is 2 − s . Remark 8 If w e drop the assumption of fully ran d om hash fun ctions (like q ) b eing pr o vided f or free, only a o (1) term h as to b e added to the false p ositive probability . This is brief ly discussed in Remark 9 in App endix B. 5 Retriev al in almost optimal space, with linear construction time In this section w e sho w how, using a v ariant of th e retriev al data structure describ ed in Section 2.2, w e can ac hieve linear exp ected construction time and still get arbitrarily close to optimal space. This will p ro ve T heorem 1(b). Th e reader should b e a ware that the r esu lts in this section hold asymptotically , only for rather large n . Using the n otation of Sections 2.1 and 2.2 , we ﬁx some k and some δ > 0 su c h that (1 + δ ) β k > 1. F ur ther, some constan t ε > 0 is ﬁxed. W e assume th at 2 k + 1 f ully random hash functions are at our disp osal, with ranges w e can c h o ose, and in case the construction fails w e can choose a new set of suc h f u nctions, even r ep eatedly . (App endix B exp lains h o w th is can b e justiﬁed.) 9 Deﬁne b = 1 2 √ log n . W e assum e th at ε and δ are so small th at (1 + ε ) 2 (1 + δ ) < 4, and hence that b · 2 (1+ ε ) 2 (1+ δ ) b 2 = o ( n/ (log n ) 3 ). Assume f : S → { 0 , 1 } r is give n, the v alue f ( x ) b eing d enoted by u x . Th e global setup is as follo ws: W e u se one fu lly random hash function ϕ to map S in to the range [ m 0 ] with m 0 = n/b . In this w ay , m 0 blo c ks B i = { x ∈ S | ϕ ( x ) = i } , 0 ≤ i < m 0 , are created, eac h with exp ected size b . The construction has tw o parts: a p rimary structure and a secondary structure for the “o ve rﬂo w k eys” that cannot b e accommo dated in the primary structure. Th is is similar to the global s tructure of a num b er of w ell-known dictionary implementat ions. F or the primary stru cture, w e try to apply the construction from S ection 2.2 to eac h of the blo cks separately , b ut only once, with a ﬁxed set of k h ash fun ctions. This constru ction ma y fail for one of tw o reasons: (i) the blo ck ma y b e to o large — we do not allo w more than b ′ = (1 + ε ) b k eys in a blo ck if it is to b e treated in the pr imary structure, or (ii) the construction from Section 2.2 fails b ecause th e ro w vect ors in the matrix M i induced by the sets A x , x ∈ B i , are n ot linearly ind ep endent. F or the p rimary structure, we set up a table T with (1 + δ )(1 + ε ) n ent ries, p artitioned in to m 0 segmen ts of size (1 + δ )(1 + ε ) b = (1 + δ ) b ′ . Segmen t n umb er i is asso ciated w ith b lo c k B i . If the construction from Section 2.2 fails, w e set all the bits in segmen t n um b er i to 0 and use the secondary stru ctur e to asso ciate keys in B i with the correct v alues. As secondary structure we c ho ose a retriev al structure as in [11, 22], built on the basis of a second set of, s a y , 3 hash fu n ctions (used to asso ciate sets A ′ x ⊆ [1 . 3 n ′ ] with the keys x ∈ S ′ ) and a table T ′ [0 .. 1 . 3 n ′ − 1]. Th is uses space 1 . 3 n ′ r bits, w here n ′ is th e size of the set S ′ of k eys for which the construction failed (the “ overﬂow keys ”). Of course, the secondary structure asso ciates a v alue f ′ ( x ) with any ke y x ∈ S . Rather th an storing information ab out whic h blo cks succeed we comp ensate for the con tribution from f ′ ( x ) as follo w s: If th e construction succeeds for B i , we store (1 + δ ) b ′ v ectors of length r in segmen t n u m b er i of table T so that x ∈ B i is asso ciated with the v alue f ( x ) ⊕ f ′ ( x ). On a qu ery for x ∈ U , calculate i = ϕ ( x ), th en the oﬀset d i = ( i − 1)(1 + δ ) b ′ of the segment for blo c k B i in T ′ , and retur n L j ∈ A x T [ j + d i ] ⊕ L j ∈ A ′ x T ′ [ j ] , ⊕ repr esen ting b it wise X OR in { 0 , 1 } r . It is clear that for x ∈ S the result will b e f ( x ): F or x ∈ S ′ the t wo terms are 0 and f ( x ), and f or x 6∈ S ′ the t wo terms are f ′ ( x ) and f ( x ) ⊕ f ′ ( x ). Note that the accesses to the tables are nonadaptiv e: all k + 3 lo oku p s ma y b e carr ied out in p arallel. In fact, if T and T ′ are concate nated this can b e seen as the same ev aluatio n pr o cedure as in our basic algorithm (4 ), the d iﬀerence b eing that the hash fu nctions were c h osen in a diﬀeren t w ay (e.g., do not all ha ve the same range). Lemma 8 b elo w sa ys that E ( n ′ ) = o ( n ). Before pro ving the lemma we conclude the s p ace analysis assumin g th at it is true. Th e o veral l space is (1 + δ )(1 + ε ) n ( r + 1 /b ) + c | S ′ | r bits (apart from lo w er order terms). I f γ > 0 is given, we ma y c ho ose ε and δ (and k ) so that this b ound is smaller than (1 + γ ) nr for n large enough. W e pro ceed to s ho w that that | S ′ | = o ( n ) and ho w to ac h ieve construction time O ( n ). Lemma 8 The exp e cte d numb er of overﬂow k e ys is o ( n ) . Pr o of : It is suﬃcient to sho w that the exp ected num b er of blo c ks th at hav e more than (1 + ε ) b k eys or for wh ic h th e constr u ction from S ection 2.2 fails is o ( m 0 ). Let x 0 ∈ S b e an arbitrary key . The num b er of keys colliding with x 0 is B( n − 1 , 1 m 0 )-distributed; w e may assume it is B( n, 1 m 0 )- distributed. T he exp ected n u m b er of k eys that collide with x 0 under ϕ then is n/m 0 ≤ b . Since b ′ = (1 + ε ) b , a stand ard Ch er n oﬀ b ound ([23, Thm. 4.4]) toget her with the fact th at b = ω (1) yields that Pr ( x collides with more than b ′ k eys) ≤ e − bε 2 / 3 = o (1). Hence the exp ected num b er of ke ys in 10 o ve rfull blo c ks is o ( n ). No w assume a blo c k B i has no more than (1 + ε ) b k eys, and the construction from Section 2.2 is applied (once). T here w e noted, referring to Remark 2, that the p robabilit y that the b ′ × (1 + δ ) b ′ -matrix ind uced b y the sets A x , x ∈ B i will ha v e linearly d ep endent ro ws is 1 /b Ω(1) , whic h is o (1) again. Hence the exp ected n u m b er of k eys in blo c ks that create matrices that do not ha ve full r ank is o ( n ).  5.1 Matrix computations using t ables As a bu ilding blo ck in our construction algorithm f or the pr imary structure we need some tables that allo w us to do computations on small m atrices eﬃcien tly . Th e details oﬀer no su rprises, bu t are giv en here for completeness. Let ε, δ > 0 b e suc h that (1 + ε ) 2 (1 + δ ) < 4. F or given n d eﬁne b = 1 2 √ log n , and let b ′ = (1 + δ ) b . W e wan t to set up auxiliary tables that h elp in dealing with binary matrices M with ≤ b ′ ro ws and (1 + δ ) b ′ columns. Ther e are no more th an b ′ · 2 (1+ ε ) 2 (1+ δ ) b 2 = o ( n/ (log n ) 3 ) suc h matrices. F or eac h such matrix M w e determinine and store w hether its r o ws are indep en d en t; if this is the case w e also calculate and store a pseudoin verse C , as describ ed in Section 2.2. The o v erall space needed for th is table is o ( n ) bits, and the total time to calculate the en tries is o ( n ) steps, ev en if a simple metho d lik e Gaussian elimination is u sed. In Section 2.2 we describ ed what we mean by m u ltiplyin g a matrix with a ve ctor of bit strings. F or some in teger constan t ℓ ≥ 3 we prepare a table of all matrix-v ector pairs ( L, v ) and their pro du ct L · v , where the 0-1-mat rix L has ≤ b ′ /ℓ rows and b ′ (1 + ε ) /ℓ columns, and th e v ector v has b ′ (1 + ε ) /ℓ entries that are b it strings of length (log n ) /ℓ . Such a table mak es it p ossible to m u ltiply a matrix w ith ≤ b ′ ro ws and b ′ (1 + ε ) columns with a vecto r w h ose en tries are bit s tr ings of length O (log n ) in O ( ℓ 3 ) word op erations, hence in O (1) time. This table has size o ( n ) b its as w ell. 5.2 Primary structure construction algorit hm W e are now ready to show the follo wing lemma. Lemma 9 The primary structur e c an b e c onstructe d in time O ( n ) . Pr o of : It is clear that linear time is suﬃcient to ﬁnd the blo c ks B i and ident ify the blo cks th at are to o large. No w consider a ﬁxed blo ck B i of size at most (1 + ε ) b . W e m u st ev aluate | B i | · k hash functions to ﬁ n d the sets A x , x ∈ B i , and can p iece together the matrix M i that is indu ced by these sets in time O ( b ) (assum ing one can establish a w ord of O ( b ) 0s in constant time an d set a bit in such a wo rd giv en by its p osition in constan t time). The wh ole matrix has fewer than log n bits and ﬁts into a s ingle word. This mak es it p ossible to us e the precomputed tables describ ed in Section 5.1. S imilarly , u sing p recomputed tables w e can chec k in constan t time whether M i has linearly indep en d en t ro w s or not and in the p ositiv e case ﬁn d a pseudoinv erse C i . No w assume a b it ve ctor u = ( u 1 , . . . , u | B i | ) T ∈ { 0 , 1 } | B i | , is giv en. Using C i and a lo okup table as in Section 5.1 w e can ﬁnd C i · u in constan t time. A b it v ector a = ( a j ) 1 ≤ j ≤ (1+ δ ) b ′ that solve s M i · a = u can then b e found in time O ( b ). This leads to an o v erall constru ction time of O ( n ) for the whole p rimary structure. If the v alues in the range are bit v ectors f ( x ) = u x ∈ { 0 , 1 } r , x ∈ B i , a construction in time O ( nr ) f ollo ws trivially . W e ma y impr o ve this time b ound as follo w s : The lo okup tables from Section 5.1 m ak e it p ossib le to m ultiply C i ev en with v ectors of length u p to O (log n ) in constant time. This establishes a construction time of O ( n ) for the general problem of r epresen ting a fun ction f : S → { 0 , 1 } r , u s ing r = O (log n ).  W e ha v e prov ed the follo win g resu lt, w hic h is a pr ecise and more general v ers ion of T heorem 1(b): 11 Theorem 10 Ther e i s an algorithm A with the fol lowing pr op erties. F or every γ > 0 ther e is some k = O (log(1 /γ )) such that for al l suﬃciently lar ge n the fol lowing holds : Given a set S ⊆ U with n = | S | and a function f : S → { 0 , 1 } r , with pr ob ability 1 − o (1) algorithm A wil l suc c e e d i n building a data structur e D such that : (a) D supp orts r etrieval in time O ( k ) , with no mor e than 2 k + 1 hash fu nc tion evaluations and 2 k + 1 ( nonada ptive ) r andom ac c esses into tables storing r -bit ve ctors or bits ; (b) the sp ac e o c c upie d by D is no mor e than (1 + γ ) nr bits ; (c) A runs in time O ( n ) . 6 Retriev al and dictionaries by balanced allo cation In sev eral recent pap ers, the follo wing scenario for (statically) storing a set S ⊆ U of k eys was studied. A set S = { x 1 , . . . , x n } ⊆ U is to b e stored in a table T[ 0 ..m − 1 ] of size m = (1 + δ ) n as f ollo ws: T o eac h key x we asso ciate a set A x ⊆ [ m ] of k p ossible table p ositions. Assume there is a mapping σ : { 1 , . . . , n } → [ m ] that is one-to-one and satisﬁes σ ( i ) ∈ A x i , for 1 ≤ i ≤ n . (In this case we sa y ( A x , x ∈ S ) is suitable for S .) Ch o ose one suc h mapping and store x i in T[ σ ( i ) ] . Examp les of constru ctions th at follo w th is sc heme are cuck o o h ashing [28], k -ary cuc koo hashing [18], block ed cuc k o o hashing [16, 29], and p erfectly balanced allo cation [14]. In [6, 17] threshold densities for blo c k ed cuc k o o h ashing w ere determined exactly . T h ese sc hemes are th e most space-eﬃcie n t dictionary s tr uctures kno wn , among sc hemes that store the keys explicitly in a hash table. F or example, k -ary cuc k o o hashing [18] works in sp ace m = (1 + ε k ) n with ε k = e − Θ( k ) . P erfectly balanced allo cation [14] w ork s in optimal space m = n with A x consisting of 2 con tiguous segmen ts of [ n ] of length O (log n ) eac h 4 . Here, w e p oin t out a close relationship b etw een dictionary structures of this k in d and retriev al structures for fu n ctions f : S → R , w h enev er the range R is not to o small. W e will assum e that R = F for a ﬁnite ﬁ eld F with | F | ≥ n . (Using a simple s plitting trick, this condition can b e atten uated to | F | ≥ n δ , see App endix B.) F rom S ection 2.2 w e recall equ ation (3) where the matrix M = ( p ij ) 1 ≤ i ≤ n 0 ≤ j k n . Botelho et al. [4] used results fr om r andom (hyp er )graph th eory to state muc h smaller b ound s (no closed formula) : m > 1 . 222 n for k = 3 and m > 1 . 295 n for k = 4, and so on, with the constan ts gro wing for gro wing k . (The qu estion ask ed in [4] is whether the hyp ergraph giv en b y A x , x ∈ S , is “acyclic” .) W e a voi d th e use of random graph theory and r esort to C alkin’s theorem (Theorem 4) to sho w that the b ounds β − 1 k from T able 1 are relev an t for this situation as w ell. The disadv anta ge of our app r oac h is that the algorithms that constru ct the data structur e need more time, sin ce they inv olv e Gaussian elimination. Agai n, the splitting tric k from App endix B can alleviate this p roblem. Assume the matrix M h as fu ll ro w r ank. W e ﬁrst calculate a pseud oin verse C that satisﬁes eq. (5) in S ection 2.2. Sin ce columns b 1 , . . . , b n of C · M form a regular quadratic matrix, and C · M is obtained fr om M only by row transformations, columns b 1 , . . . , b n of M also form a r egular matrix. This means that the determinant of the su bmatrix of M formed b y th ese column s is nonzero — hence, by the deﬁn ition of the determinan t as a su m of p r o ducts o ver all p ermutations, there must b e a b ijection ϕ : { 1 , . . . , n } ↔ { b 1 , . . . , b n } with p i,ϕ ( i ) 6 = 0, h ence p i,ϕ ( i ) = 1, f or 1 ≤ i ≤ n . This means that ϕ ( i ) ∈ A x i , for 1 ≤ i ≤ n . The mappin g ϕ ma y b e found by an eﬃcien t algorithm to calculate p erfect matc hings in bip artite graph s . F or eac h i , fr om ϕ ( i ) w e obtain a v alue λ ( i ) ∈ { 1 , . . . , k } suc h that ϕ ( i ) = h λ ( i ) ( x i ). W e form a vec tor ( u 1 , . . . , u n ) by deﬁn ing u i to b e (the binary represen tation of ) λ ( i ) − 1, using r = ⌈ log k ⌉ b its. Applying the constr u ction fr om S ection 2.2 we ﬁn d a v ector a = ( a 0 , . . . , a m − 1 ) with elemen ts in { 0 , 1 } r suc h that h a ( x i ) = λ ( i ) − 1, for 1 ≤ i ≤ n . Then the function h : U → { 0 , . . . , m − 1 } , x 7→ h h a ( x )+1 ( x ) is a p erfect hash function for S w ith range { 0 , . . . , m − 1 } . Ev aluating the function amoun ts to calculating h a ( x ) as giv en by (4). The f unction h is represent ed b y the table that con tains the comp onent s of a . This tak es m = (1 + δ ) n w ord s of ⌈ log k ⌉ bits, where δ > β − 1 k − 1 is arb itrary . Since β − 1 k ∼ 1 + e − k / ln 2 for k → ∞ , the relativ e space o verhead δ ma y b e made as small as w e wish, at the cost of larger k . A particularly attracti v e c h oice is k = 4. Sin ce β − 1 4 < 1 . 035, w e could c h o ose m = 1 . 035 n and sp end 2 m b its for th e represent ation of the ve ctor a , which amoun ts to s pace requiremen ts of 2 . 07 n bits. Bothelho et al. [4] describ e how a p erfect hash fu nction may b e turn ed int o a minimal p erfect hash function. There are sev eral plausib le tec h niques for this, one of them as follo ws: On e stores the set of lo cations in { 0 , . . . , m − 1 } − h ( S ) in a su ccinct rank data stru cture [30]. This table requires additional space of 0 . 035 n · log 2 (1 . 035 / 0 . 35) + n · log 2 (1 . 035 / 1) ≈ 0 . 22 n + o ( n ) bits. The total sp ace 14 needed for th e minimal p erfect hash function is 2 . 29 n + o ( n ) bits, whic h is a little b etter than the 2 . 7 n f r om [4]. Th e pr ice to pay for this improv emen t is th at to ﬁnd the v ector a we must solve a system of linear equations and solv e an instance of the p erfect matc h ing problem, wh ile in [4] a v ery simp le linear-time algorithm is suﬃcient. There is no big d iﬀerence in the time needed to ev aluate th e (min imal) p erfect hash function. 8 Op en problems Our construction r elies on either the assumption that the hash fu n ctions are fu lly random, or on the split-and-share construction. An obvious op en problem (shared with most data stru ctures that use m ultiple h ash functions) is wh ether simp ler hash functions can do the same job. Another question is to w h ic h exten t th e corresp ond ence sho wn in T heorem 11 of Section 6 also holds f or small v alues of r . F or example, is the sp ace thresh old for k -ary cuc koo hashin g identic al to Calkin’s threshold for random matrices with k 1s in eac h row? Similarly , if we imitate b lo c ke d cuc koo hash ing [16] by restricting the sets A x i to b e a s ubset of the un ion of t wo interv als of size k (d ep endin g on x i ), what is the b est sp ace u sage we can get? Ac kno wledgemen t . The authors thank Philipp W o elfel for seve ral motiv ating discussions on the sub j ect. References [1] S . Alstru p, G. S. Bro dal, and T. Rauhe, O p timal static range rep orting in one d imension, Pro c. 33rd ACM S T OC, 2001, p p. 476–482. [2] G. V. Balakin, V. F. Kolc hin, and V. I. Kh okh lo v, Hyp ercycles in a random hyp ergraph, Dis- cr ete M athematics and Applic ations 2 1992: 563–57 0. [3] A. Z. Bro der and M. Mitzenmac her, Net work applications of Blo om ﬁ lters: a survey , in : Pro c. 40th Annual Allerton Conference on Communication, C on trol, and Computing, pp. 636–6 46, A CM Pr ess, 2002. [4] F. C . Botelho, R. Pagh, and N. Ziviani, S im p le and space-eﬃcien t minimal p erfect hash fu nc- tions, in: P r o c. 10th W ADS 2007, Sp ringer LNCS 4619, pp . 139–150 . [5] B. H. Blo om, Space/time trade-oﬀs in hash co d ing with allo wa ble errors , Commun. ACM 13 (7) 1970: 422–426. [6] J . A. C ain, P . Sand ers, N. C. W ormald, The rand om graph threshold for k -orien tiabilit y and a fast algorithm for optimal m ultiple-c hoice allocation, P ro c. 18th A C M-SIAM SOD A, 200 7, pp. 469–476. [7] N. J. Calkin, Dep end en t sets of constant w eigh t v ectors in GF( q ), R andom Structur es and Algor ithms 9 1996: 49–54. [8] N. J. Calkin, Dep end en t sets of constan t w eigh t b inary vecto rs, Combinato ric s, Pr ob ability and Computing 6 (3) 1997: 263–271. [9] L . Carter, R. W. Flo yd , J. Gill, G. Marko wsky , and M. N. W egman, Exact and approxi mate mem b ership testers, Pro c. 10th A CM STOC, 1978, pp . 59–65. 15 [10] L. Carter and M. N. W egman, Univ ersal classes of hash f unctions, J. Comput. Syst. Sci. 18 (2) 1979: 143-154. [11] B. Ch azelle, J . K ilian, R. Rubinfeld, A. T al, Th e Blo omier ﬁlter: an eﬃcien t data structure for static supp ort lo oku p tables, Pro c. 15th A CM-SIAM S OD A, 2004, p p. 30–39. [12] C. Co op er, Asym ptotics for dep endent sums of random ve ctors, R andom Struct. Algor ithms 14 (3) 1999: 267–292. [13] C. Co op er, On the r an k of ran d om matrices, R andom Struct. Algorith ms 16 (2) 2001: 209– 232. [14] A. Czuma j, C. Riley , C. Sc heideler, P erfectly Balanced Allocation, in: Proc. RANDOM- APPR OX 2003, Sp ringer LNCS 2764, pp. 240–251 . [15] M. Dietzfelbinger, Design strategies for minimal p erfect hash functions, in: Pro c. 4th Int. Symp. on S to c hastic Algorithms: F oundations and Applications (SA GA), 2007, Springer LNCS 4665, pp. 2–17. [16] M. Dietzfelbinger and C. W eidlin g, Balance d allocation and d ictionaries with tigh tly pac ked constan t size b ins, The or et. Comput. Sci. 380 (1–2 ) 2007: 47–68 . [17] D. F ernholz and V. Ramac h andran, The k -orien tabilit y thresholds f or G n,p , Pro c. 18th ACM- SIAM S OD A, 2007, pp . 459–468. [18] D. F otakis, R. Pagh, P . S anders, P . G. S pirakis, Space eﬃcien t hash tables with worst case con- stan t access time, The ory. Comput. Syst. 38 (2) 2005: 229–2 48. [19] T. Hagerup, T. Tholey , Eﬃcien t m in imal p er f ect hashing in nearly minimal space, in: Pro c. 18th ST A CS 2001, Sp r inger LNCS 2010, pp. 317–326 . [20] V. F. Kolc h in, Rand om graphs and systems of lin ear equations in ﬁnite ﬁelds, R andom Struc- tur es and Algorithms 5 1994: 135–14 6. [21] V. F. Kolc h in and V. I. Khokhlo v, A threshold eﬀect for systems of rand om equations of a sp ecial form, Discr ete Mathematics and Applic ations 5 1995: 425–436 . [22] B. S. Ma jewski, N. C. W ormald, G. Hav as, and Z. J. Czec h , A f amily of p erfect hashing m etho ds, Computer J. 39 (6) 1996: 547–554 [23] M. Mitzenmac h er and E. Upfal, Pr ob ability and Computing , C am br idge Universit y Press, Cam- bridge, 2005 . [24] M. Mitzenmac her, Comp ressed Bloom ﬁlters, IEEE/ACM T r ansactions on Networking , 10 (5):6 04–61 2 (2002). [25] C. W. Mortensen and R. P agh, and M. Pˇ atra ¸ scu, On dynamic range r ep orting in one dim en sion, Pro c. 37th ACM S T OC, 2005, p p. 104–111. [26] R. Mot wani and P . Raghav an. R andomize d Al gorithms . Cambridge Univ ersity Press, 1995. [27] R. P agh. Lo w redun dancy in static dictionaries with constant query time. SIAM J. Comput. , 31 (2):3 53–36 3, 2001. [28] R. Pa gh and F. F. Ro dler, Cuck o o Hashing, J. Algorithms 51 :122 –144 (2004). 16 [29] R. P anigrahy , Eﬃ cien t hashing with lo okups in t wo memory accesses, Pr o c. 16th A CM-SIAM SOD A, 2005, p p . 830–839. [30] V. Raman and S. S. Rao, Static Dictionaries Sup p orting Rank, in: Pro c. 10th In t. S y m p. on Algorithms And C omp utation (ISAA C), 1999, Sp ringer LNCS 1741, pp. 18–26. [31] S. S. Seiden and D. S. Hirsch b er g, Findin g s uccinct ord ered minimal p erfect h ash fun ctions, Inf. Pr o c ess. L ett. 51 (6) 1994: 283–288. [32] M. N. W egman and L. Carter, New Classes and Applications of Hash F un ctions, in: Pro c. 20th IEEE F OCS , 1979, pp . 175–182. [33] M. Zu k owski and S. Heman and P . A. Boncz, Architec ture-conscious hashing, in: P ro c. I n t. W orksh op on Data Management on New Hardw are (DaMo N), Chicago, 2006, Article No. 6 (8 pages). A Creating random sets of size k without rep etitions W e brief ly justify the assumption that giv en k fully random hash functions with r anges we c an cho ose there is a w a y to m ap eac h k ey x to a fully random sequence (or ordered set) A x = ( h 1 ( x ) , . . . , h k ( x )) w ith all diﬀerent v alues in [ m ]. Ju st tak e k f ully random h ash fu nctions g 1 , . . . , g k where g ℓ has range [ m − ℓ + 1], f or ℓ = 1 , . . . , k . The existence of th e sequence A x is then easily pro ved: F or ℓ = 1 , . . . , k , let h ℓ ( x ) b e elemen t num b er g ℓ ( x ) in the s et [ m ] − { h 1 ( x ) , . . . , h ℓ − 1 ( x ) } . Algorithmically , it is simp ler to w ork with an array B [0 ..m − 1], initialized with B [ j ] = j . Then, sequen tially for ℓ = 1 , . . . , k , the v alue at p osition B [ g ℓ ( x )] is exc hanged with the v alue at B [ m − ℓ ]. The outpu t sequence is ( h 1 ( x ) , . . . , h k ( x )) = ( B [ m − 1] , . . . , B [ m − k ]). C learly , this is a random sequence of k distinct elemen ts of [ m ]. If sp ace is an issue, one might not actually use this arr a y B , b ut just sim u late the eﬀect. Time O ( k log k ) (using a searc h tree) or exp ected time O ( k ) (using hashing) is d eﬁnitely suﬃcien t. If the “split-and-share” appr oac h of App endix B is employ ed, the space used for arra y B is O ( n ε ). B Ho w to c ircu m v en t the full randomness assumption: “split-and-share” While it is quite common to assume that fully r andom hash fun ctions are av ailable for fr ee in the con text of Blo om ﬁlters and similar data stru ctures, in the realm of dictionary imp lemen tations and construction of p erfect hash fu nctions one prefers to u s e rand omization in the algorithm, viz., unive rsal hashing. W e br ief ly sk etc h how in the context of the static d ata structures of this p ap er unive rsal hashing may b e u sed to j ustify the fu ll r andomness assumptions. (F or details, see [15].) W e assume the reader is familiar with the concept of a u niv ersal class of hash fu nctions from U to M (for distinct ke ys x, y ∈ U and h c hosen at r an d om from the class we hav e Pr ( h ( x ) = h ( y ) ≤ 1 / | M | ) as w ell as the concept of k -wise in dep end en t h ash classes (for arbitrary d istinct k eys x 1 , . . . , x k ∈ U and h chose n at r andom from the class the h ash v alues h ( x 1 ) , . . . , h ( x k ) are fully random in M ). Constructions of suc h classes, with arbitrary r anges M = [ m ], are w ell kno wn, see [10, 32]. Let S ⊆ U , | S | = n , b e ﬁxed. It is a common idea to use a hash function h : U → [ m ] to split S in to “c hunks” S i = { x ∈ S | h ( x ) = i } , for i ∈ [ m ], and then w ork on the “c hunks” separately . In our context , we would e. g. constru ct a retriev al structure f or eac h S i separately in a dedicated 17 segmen t of memory . A more r ecen t idea [15, 16] is to use a “shared” table of random wo rds to sim u late hash functions th at are fully rand om on eac h single ch unk. The total space us age of this table m a y b e kept at o ( n ). If m ≥ 2 n 2 / 3 and h : U → [ m ] is chosen from a 4-univ er s al class, then (as a s tandard calculation sho w s ) the largest ch unk S i will h a ve at most √ n element s with p r obabilit y larger than 3 4 . W e rep eat choosing suc h h ’s unti l one is found that satisﬁes max {| S i | | 0 ≤ i < m } ≤ √ n . This we ﬁx and call it h 0 . The size of S i is called n i . It is a folklore fact that using n 1+ ǫ i space it is very easy to pro vide a data structure that giv es fully random hash f u nctions on S i , whic h can b e ev aluated in constan t time. A concrete constru ction migh t lo ok as f ollo ws. Lemma 12 (e. g. [15]) L et r = 2 n 3 / 4 , and let S ′ ⊆ U with n ′ = | S ′ | ≤ √ n . Given a 1 -universal class of hash functions fr om U to [ r ] , we may in exp e cte d time O ( | S ′ | ) ﬁnd two functions h 0 , h 1 fr om that class such that if the two tables T 0 [0 ..r − 1] and T 1 [0 ..r − 1] ar e ﬁl le d with r andom numb ers fr om [ t ] then h ′ ( x ) = ( T 0 [ h 0 ( x )] + T 1 [ h 1 ( x )]) mo d t deﬁnes a function that is ful ly r andom on S ′ . This simp le obser v ation ma y b e used as f ollo ws. F or eac h of the m c hunks S i w e ﬁn d and ﬁx and store tw o f unctions h i, 0 , h i, 1 (constan t description size) as in Lemma 12. F ur th er, w e provide a suitable num b er L of pairs of tables T j, 0 and T j, 1 , 1 ≤ j ≤ L , indep endently ﬁ lled with rand om n um b ers from [ t ] (we m ay eve n use v arying r anges [ t j ]). Then h ′ i,j ( x ) = ( T j, 0 [ h i, 0 ( x )] + T j, 1 [ h i, 1 ( x )]) mo d t , for 1 ≤ j ≤ L, pro v id es L fully random hash fun ctions on S i . The total space tak en up by the tables is no more than 2 n 3 / 4 · L n u m b ers, so an y L = O ( n 1 / 4 − η / log ( t )) will lead to a space usage of O ( n 1 − η ) bits. The cen tral ob s erv ation is that this setup mak es it p ossible to work with L tru ly f ully random hash functions on S i to construct a data stru cture D i for solving the r esp ectiv e problem (retriev al, appro ximate mem b ership) for th e keys in S i . As long as we k eep the data str u ctures d isjoin t, the fact th at the tables are s hared is h armless. More d etails m a y b e found in [15]. Remark 9 In Remark 8 w e m en tioned a sub tle problem in the false p ositiv e pr obabilit y in the appro ximate membersh ip problem in case w e use the split-and-sh are approac h to sim u late the hash functions. Th e problem is caused by the fact that a qu ery p oint y ∈ U − S is not in S i for i = h ( y ) and hence there is n o guaran tee that q (sim u lated by h i, 0 , h i, 1 and some pair T j, 0 and T j, 1 ) is fully random on S i ∪ { y } . By the pro of of Lemma 12 as giv en in [15] it can b e seen that th e probability that this h app ens for an y ﬁxed y is O (1 / √ n ). This term h as to b e add ed to the false p ositiv e probabilit y . C Space lo w er b oun d for appro ximate mem b ership F or completeness w e pr o ve a lo wer b oun d for the space needed for an appro ximate m em b ership data structure. The p ro of is a s light extension of the lo wer b oun d of Carter et al. [9]. Theorem 13 (Cart er et al . [9 ]) L et u = | U | and c onsider an ap pr oximate memb ership data structur e for sets of size n ≥ 1 and false p ositive pr ob ability ε , 0 < ε < 1 . Then the sp ac e usage in bits must b e at le ast n log 2 (1 /ε ) − O  (1 − ε ) n 2 εu +(1 − ε ) n  . Sp e ciﬁc al ly, for u > n 2 /ε the sp ac e usage must b e at le ast n log 2 (1 /ε ) − O (1) bits. 18 Pr o of : Any instance I of th e d ata structure corresp onds to a s u bset U I ⊆ U , n amely the s et of elemen ts x for which the data structures answe rs “ x ∈ S ”. F or an y set S ⊆ U ther e m ust b e s ome instance I for whic h S ⊆ U I . F urth ermore, th er e m ust exist such an instance where | U I | ≤ ε ( u − n ) + n . This is b ecause the exp ected n u m b er of false n egativ es in U − S is at most ε ( u − n ) (wh en c ho osing th e d ata s tr ucture for S ). W e sa y th at th e instance c overs S if these t wo conditions hold. The num b er of sets that can b e co v ered by an instance I is  | U I | n  ≤  ⌊ ε ( u − n ) ⌋ + n n  . This means that th e num b er of instances n eeded to co v er all su bsets of U of size n is b ounded from b elo w b y:  u n   ⌊ ε ( u − n ) ⌋ + n n  ≥ u ( u − 1) · · · ( u − n + 1) ( ε ( u − n ) + n )( ε ( u − n ) + n − 1) · · · ( ε ( u − n ) + 1) >  u ε ( u − n ) + n  n =  1 ε  1 − (1 − ε ) n εu + (1 − ε ) n  n ≥  1 ε  n exp  − (1 − ε ) n 2 εu + (1 − ε ) n  . Since eac h in stance has a u n ique memory r epresen tation, this means that the n um b er of bits used b y the d ata str ucture in th e worst case must b e at least log 2 of the n umber of ins tances.  19

Succinct Data Structures for Retrieval and Approximate Membership

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment