A cuckoo hashing variant with improved memory utilization and insertion time

Cuckoo hashing [4] is a multiple choice hashing scheme in which each item can be placed in multiple locations, and collisions are resolved by moving items to their alternative locations. In the classical implementation of two-way cuckoo hashing, the …

Authors: Ely Porat, Bar Shalem

A cuckoo hashing variant with improved memory utilization and insertion   time
A CUCKOO HASHING V ARIANT WITH IMPRO VED MEM OR Y UTILIZA TION AND INSER TION TIME EL Y PORA T AND BAR SHALEM 1. abstract Cuck o o hashing [4] is a multiple choice hashing scheme in which each item can b e placed in m ultiple lo catio ns, and collisions are r esolved b y moving items to their alternative loc a tions. In the cla ssical implementation o f tw o-way cuc koo ha s hing, the memory is partitioned int o contiguous disjoint fixed-size buckets. Each item is hashed to t wo buck e ts, and may b e stored in a ny o f the po sitions within those buc kets. Ref. [2] analyzed a v ariation in which the buck ets a re contiguous and ov erla p. How ever, many systems retrie ve da ta from seconda ry stor age in same-size blo cks called pages. F etching a pag e is a rela tively exp ensive pro cess; but once a pa ge is fetched, its conten ts c a n b e access ed order s of mag nitude faster. W e utilize this pr o p erty o f memory retriev al, presenting a v ariant of cuck o o hashing incorp o rating the following constraint: each buck et must be fully con ta ined in a sing le page, but buck ets are not necessarily co nt ig uous. Empirical results show that this mo dification increases memo ry utilization an d decrease s the n umber of itera tions required to insert an item. If each item is hashed to tw o buck ets of capa c ity tw o, the pa ge size is 8, a nd ea ch buck et is fully co ntained in a single page, the memory utilization equals 89.71% in the classical co nt ig uous disjoin t bucket v aria nt , 93.7 8% in the contiguous ov er lapping buck et v ar iant, and increases to 97.46 % in our new non-contiguous buc ket v ariant. When the memory utilization is 92% and we use breadth first sea rch to lo ok for a v acant position, the n umber of iter ations requir ed to insert a new item is dramatica lly reduced fr om 545 in the contiguous o verlapping buc kets v ariant to 52 in our new non-contiguous buck et v ariant. In addition to the empirica l results, we present a theoretical low er b ound on the memory utilization of our v ariation as a function o f the pag e siz e . 2. Introduction Cuck o o hashing [4] is a multiple choice has hing scheme in which each item can b e placed in m ultiple lo cations, and co llisions are r esolved by moving items to their alternative lo cations. This hashing scheme resembles the cuck o o ’s nes ting habits: the c uck o o lays its eg gs in other birds’ nests. When the cuc ko o chic k hatches, it pushes the other eggs o ut of the nest. Hence the name “cuck o o ha shing.” As Ref. [2] expla ins, analysis o f ha shing is similar to the analys is of balls and bins. Hashing an item to a memory lo cation corresp onds to throwing a ball in to a bin. Insights from ba lls and bins pro cesse s led to brea kthroughs in hashing metho ds. F or exa mple, if we throw n ba lls int o n bins indep endently and uniformly , it is highly probable tha t the largest bin will get (1 + o (1)) log ( n ) / loglog ( n ) balls. Azar et. a l [1 0] found that if each ball selects tw o bins independently and uniformly , and is pla ced in the bin with fewer balls, the final distribution is m uch mor e uniform. This led to hashing each item to one of tw o p os sible buck ets, decreasing the load on the most-loa ded buc ket to log ( log ( n )) + O (1) w ith high pro bability . In gener a l, if each item is ha shed into d ≥ 2 buck ets, the maximum load decrea ses to log ( log ( n )) / log ( d ) + O (1) . Cuck o o has hing [4] is an extension of tw o-wa y hashing . Each i tem is hashed to a few po ssible buc kets, and existing items may be mov ed to their alternate buck ets in order to free space for a new item. Ther e are many v ariants of cuck o o hashing. The g oals of cuck o o ha shing are to incr ease memory utilization (the num b e r o f items that can b e succe s sfully hashed to a given memory size) and to decre ase insertion co mplexit y . Pagh and Ro dler [4] ana lyzed hashing of e ach item to d = 2 1 A CUCKOO HASHING V ARIANT WITH IMPRO VED MEMOR Y UTILIZA TION AND INS ER TION TIME 2 buc kets of ca pacity k = 1 , and demonstra ted that moving items during inserts r esults in 50% space utilization with high probability . F o takis et. al. [5 ] ana lyzed hashing o f ea ch item into mo re than t wo buck ets. Ref. [6] analyzed a practically-imp or tant case in which each item is hashed to d = 2 buc kets of ca pacity k = 2 . Refs. [7, 8] found tight memor y utilization thres holds for d = 2 buck ets of any siz e k ≥ 2 . Specifically , they pr ov ed that the memory utilization fo r d = 2 and k = 2 is 89.7%. Ref. [1] prov ed that the ma ximu m memory utilization thresho lds for d ≥ 3 a nd k = 1 a re equal to the previo us ly known thresholds for the r andom k-XORSA T pr oblem. Ref. [12, 13] developed a tight formula for memory utilization for any d ≥ 3 and k = 1 and Ref. [11] extended the formula to a ny d ≥ 3 and k ≥ 1 . c omment adde d. While this work was being co mpleted, w e b ecame aware of Ref. [14] which prop osed a model wher e the memo ry is divided in to pages a nd each key has several po ssible lo cations on a single page a s well as additional c ho ices on a second ba ckup page. They pro v ide int e r esting exp erimental results. In a classical implementation of t wo-wa y cuckoo hashing, the memory is partitioned in to co n- tiguous disjoint fixed-sized buc kets o f size k . Each item is hashed to 2 buck e ts and may b e stored in a ny o f the 2 k lo cations within those buck ets. Ref. [2] analyze a v a riation in which the buc kets ov er lap. F o r exa mple, if the buck et capacity k is 3, the disjoint buck et memory lo ca - tions are: { 0 , 1 , 2 } , { 3 , 4 , 5 } , { 6 , 7 , 8 } , . . . . wher eas the ov erla pping buc ket memory loca tio ns are: { 0 , 1 , 2 } , { 1 , 2 , 3 } , { 3 , 4 , 5 } , . . . . Their empirical results show that this v ariation increa ses memory ut iliza tion fro m 89 .7% to 96 .5% for d = 2 and k = 2 . Ho wev er, many sy s tems retrieve data fr om secondary storage in s ame-size blo cks called pa ges. F etching a pag e is a relatively exp ensive pro c e ss, but once a pa ge is fetc hed, its conten ts ca n b e access ed orders of magnitude mor e quickly . W e utilize this prop erty of memory retriev al to pre s ent a v ariant of cuck o o hashing req uiring that each buck et b e fully contained in a single page but buck ets are not necessar ily contiguous. In this pa per we compare the following three v ariants of cuck o o hashing: (1) CUCKOO -CHOOSE-K- the algo rithm introduced in this pap er. The buck ets are any k cells in a page, not necessa rily in co nt ig uous lo c a tions. There ar e  t k  buc kets in a page, where t is the size of the page. (2) CUCKOO -OVERLAP [2]- The buc kets a re contiguous and ov e rlap. Here we assume that all buck ets are fully contained in a single pag e, so there are t − k + 1 buc kets in a page. This is a generaliza tion o f Ref. [2]. Orig ina lly Ref. [2] did not cons ider dividing the memo ry in to pa g es. (3) CUCKOO -DISJOINT [4]- The buck ets are contiguous a nd not ov erlapping . There ar e t/ k buc kets in a page. This is a g eneralization of Ref. [4]. Or ig inally Ref. [4] did not consider larger buck ets. Note that algo rithm CUCKO O -DISJOINT is the e x treme case of the CUCKOO-OVERLA P and CUCKOO-CHOO SE_K algorithms when the size of the pag e t equa ls the s ize o f the buck et k . W e pr ove theoretically a nd present empirical evidence that our CUCKO O-CHOOSE-K mo d- ification increa ses memor y utilization. Moreov er , using the classical cuck o o hashing s cheme, an item inser tion requires multiple loo k-ups of candidates to displace. Empirical results show that our mo dification dramatically decreases the num ber of candidat e lo ok-ups required to insert an item compared to Ref. [2]. In the ov er lapping buck ets v aria nt [2], some buck ets are split betw een t wo pages, so that each item r esides in up to 2 d pages . In our v ariant, each buck et is fully contained in a single page. An app ealing e xp e r iment al result is that CUCKOO-CHOOSE-K memory utilization conv erg e s very quickly a s a function of the pag e s ize t . When k = 2 and t = 16 , memo r y utilization is 0.9763 . This v alue is almost identical to memory utilization when t = 2 20 which equals 0.976 7. A CUCKOO HASHING V ARIANT WITH IMPRO VED MEMOR Y UTILIZA TION AND INS ER TION TIME 3 Sym bo l Description Comments n n um b er o f vertices n → ∞ (hash table capacity) m n um b er o f edges m ≤ n , m → ∞ (hashed items) d n um b er o f buck ets each t y pica l v alue: 2, dk > 2 item is hashed to k buc ket size t y pica l v alue: 2-3, dk > 2 t page size t ≥ k . k divides t . t divides n . g n um b er o f pages g = n/t . β memory utilization β = m n , 0 < β ≤ 1 V a set o f vertices | V | = v S = S (V,E) a sub-graph v v = | V | dk ≤ v < m . (if v ≥ m or v < dk then S cannot fail). x x = v n dk n ≤ x < β . ǫ a small constant 0 < ǫ ≪ 1 δ a small constant 0 < δ ≪ 1 T able 1. Parameter names and descr iptions CUCKOO-OVERLAP memory utilization when t = 16 is 0 .9494, and CUCKOO-DISJO INT mem- ory utilization is only 0.8 970. If w e allow a tiny gap of one inside the buck e ts ( t = 3) , memory utilization increa ses from 0.9229 (CUCKOO -OVERLAP ) to 0.948 0 (CUCKOO- CHO OSE-K). T able 4.2 sp ecifies the parameter s used in our ana ly sis. 3. Theoretical anal y sis W e ca n determine the succes s o f cuckoo hashing b y analyzing t he cuckoo hyper g raph. The vertices of the g r aph are the memory lo cations. The hyper-edges of the gr a ph co nnect all the memory lo cations where each item co uld b e placed. Reca ll that each item can b e placed in d buc kets chosen uniformly and indep endently o f other items. Each bucket is comp osed of a ny k lo cations in a page. It is well known (see, e.g ., ref. [2] fo r a pro of ) that a cuck o o hash fails if and only if there is a sub-graph S with v vertices and mo re than v edges. W e say that a sub-gr aph S has faile d if it has more edges than vertices. W e will b egin by analyzing the probability o f succes s of CUCKO O -CHOOSE-K for the case where the page siz e t eq uals the array size n . This simple and sp ecial ca se is presented here to in tr o duce the main ideas applied in the following sec tio n, where we a nalyze the g eneral case where the page size is a finite cons ta nt (indep endent of n ). 3.1. Memo ry util ization when the page size t equals the arr ay size n . An ana lysis of memory utilization has be en p erformed previous ly in [5]. In their analysis, they assume k = 1 and prov e that if d ≥ 2 β log  e β 1 − β  , then the ha s hing will b e succe s sful with a pro bability o f at lea st 1 − O ( n 4 − 2 d ) . Here w e der ive a similar co nstraint on memory utilization β . W e so lve the cons traint n umerically for differen t v alues of k a nd d , and obtain a low er bo und on p oss ible memory util ization for the sp ecified v alues. W e per form the analy s is us ing a mo dification of the metho d in [5], which we will later generalize to page sizes b eing equa l to any g iven co nstant. W e will b o und the failure probability using the union b ound. But first, we would like to reduce redundant summations. W e o bserve that: A CUCKOO HASHING V ARIANT WITH IMPRO VED MEMOR Y UTILIZA TION AND INS ER TION TIME 4 • If there exists a sub-graph S ( V , E ) with | E | > | V | and there exists an edge that has exac tly one vertex v 0 outside o f V, then the sub-gr aph S ′ ( V ′ = V ∪ { v 0 } , E ′ ) als o ha s more edge s than vertices since | E ′ | ≥ | E | + 1 > | V | + 1 = | V ′ | . • If there exists a sub-gr aph S ( V , E ) , such that | V | = v and | E | > v + 1 then there exists a sub-graph S ′ ( V ′ , E ′ ) s uch that | E ′ | = | V ′ | + 1 . W e c an find such a sub-graph simply by adding vertices to V one by one un til we g e t a sub-gra ph where the num b er of edges equals the num ber o f vertices plus one. F or each sub-gr a ph S we define an indica tor v ariable Z S . Z S =      1 S ( V , E ) has | V | = v v ertices and | E | = v + 1 edg es AND There is no edge that connects V to exactly o ne v er tex from outside of V 0 other w is e If Z s = 1 , then we will say that S is a b ad sub-gr aph, and otherwise we will say that S is a go o d sub-graph. If the sum ov e r Z s of all sub-gra phs is equal to zero then every sub-graph is go o d then the cuck o o hash suc c e e de d . W e will find the the memory utilization β suc h that the sum ov e r Z s of all s ub-graphs is o (1) as n → ∞ . Let p hit be the probability that a r a ndom edge hits V . Let p 1 be the pro bability that a r andom edge co nnects V to ex actly one vertex from outside o f V and let p bad ( v ) be pro bability that a given sub-graph S ( V , E ) is bad. Lemma 1. p bad ( v ) =  m v +1  p v +1 hit (1 − p 1 − p hit ) m − ( v +1) Pr o of. Immediate fro m the definition of Z s . F or S to b e bad, exactly v + 1 edges out of m edg es m us t hit V and all the rest must miss V and m ust not c o nnect V to exactly one vertex from outside of V.  Lemma 2. The pr ob ability that a r andom e dge hits V is p hit ( v ) =  ( v k ) ( n k )  d Pr o of. Each item is hashed independently to d buc kets and the size of eac h buc ket is k . T he num ber of buck ets in V is therefore  v k  and the total num b er of buck ets is  n k  .  Lemma 3 . The pr ob abili t y that a r andom e dge c onne cts V to exactly one vertex fr om outside of V is p 1 = d ( n − v ) ( v k − 1 ) ( n k )  ( v k ) ( n k )  d − 1 Pr o of. d − 1 buck ets must a ll fall in V and one buck et must contain an y k − 1 vertices from V and any of the n − v vertices from outside of V .  Let P bad ( v ) b e the probability that there exists a sub-gr aph S with v vertices suc h that S is bad. A cc ording to the union b ound, P bad ( v ) < N ( v ) · p bad ( v ) , where N v =  n v  is the num b er of sub- graphs with v vertices. W e are go ing to ana lyze P bad ( v ) as n → ∞ . If for a ll v , P bad ( v ) = o  1 n  , then P n v = d k P bad ( v ) ≤ o (1) and the cuc koo hash s ucceeds with high probability . The analysis is similar to the ana lysis given in [5]. Let x 0 = exp  − 2 dk − 2  . W e divide the analys is in to tw o se c tions. In section 3.1.1 w e show that fo r any memory utilization β and ∀ dk n ≤ x < x 0 , P bad ( x ) is o  1 n  . In section 3 .1.2w e find the maximum memory utilization β such that ∀ x 0 ≤ x < β , P bad ( x ) is exp onentially small. 3.1.1. P bad ( x ) Analysi s for dk n ≤ x < x 0 . In this section we show that if dk > 2 then for any memory utilization β and ∀ dk n ≤ x < x 0 , P bad ( x ) is o  1 n  . Lemma 4. P bad ( x ) < c 0 ( x ) · c n 1 ( x ) , wher e c 0 ( x ) = e · x dk − 1 and c 1 ( x ) = e 2 x · x ( dk − 2) x . Pr o of. See app endix A.  A CUCKOO HASHING V ARIANT WITH IMPRO VED MEMOR Y UTILIZA TION AND INS ER TION TIME 5 Theorem 5. L et δ b e a smal l c onstant, 0 < δ ≪ 1. If dk ≥ 3 t hen for any lo ad β : 1. If dk n ≤ x ≤ dk n + δ , then P bad ( x ) < O  n − ( ( dk ) 2 − dk − 1 )  = o  1 n  . 2. If dk n + δ ≤ x ≤ x 0 − δ , then P bad ( x ) de cr e ases exp onential ly as n → ∞ . Pr o of. F or any memory utilization β , P bad ( x ) < c 0 ( x ) · c n 1 ( x ) , where c 0 ( x ) = e · x dk − 1 and c 1 ( x ) = e 2 x · x ( dk − 2) x . Note that c 0 ( x ) a nd c 1 ( x ) a r e indep endent of β . Recall that x 0 = exp  − 2 dk − 2  . ∀ dk n + δ ≤ x ≤ x 0 − δ , there exist a co nstant ǫ s uch that c 1 ( x ) < 1 − ǫ . W e obtain that: 1. If x → dk n , then P bad ( x ) < lim x → dk n c 0 ( x ) · c n 1 ( x ) = O  n − ( ( dk ) 2 − dk − 1 )  ≤ O  n − 5  = o  1 n  . 2. If dk n + δ ≤ x ≤ x 0 − δ, then c 1 ( x ) < 1 − ǫ , and P bad ( x ) < c 0 ( x ) · c n 1 ( x ) decreases exp onentially as n → ∞ .  3.1.2. P bad ( x ) Analy sis for x 0 ≤ x < β . In this section w e ar e going to find the maximum mem- ory utilization β such that ∀ x 0 ≤ x < β , the pro bability tha t there exists a ba d sub-graph is exp onentially small. Lemma 6. ∀ x 0 ≤ x < β , P bad ( x ) < O (1) · c n 5 ( x, β ) wher e c 5 ( x, β ) =  1 1 − x  (1 − x )  1 x  x  β β − x  ( β − x )  β x  x x dkx  1 − dk (1 − x ) x dk − 1 − x dk  β − x Pr o of. See app endix B  Theorem 7. If c 5 ( x, β ) < 1 − ǫ for al l x 0 ≤ x < β , then P bad ( x ) de cr e ases exp onential ly as n → ∞ . A ny memory utilization β t hat satisfies the c onstr aint is a lower b oun d on the p ossible memory u tilization. Pr o of. The theorem follows directly fro m the inequa lity P bad ( x ) < O (1) · c n 5 ( x, β ) .  Numerical so lutions to the constra int c 5 ( x, β ) < 1 indicate that the memo ry utiliza tio n of the CUCKOO-K a lgorithm is β C hoose − 2 ( k = 2 , d = 2 , t = n ) > 0 . 9 37 and β C hoose − 3 ( k = 3 , d = 2 , t = n ) > 0 . 993 . Our empirical r esults show that β C hoose − 2 ( k = 2 , d = 2 , t = n ) = 0 . 976 8 and β C hoose − 3 ( k = 3 , d = 2 , t = n ) = 0 . 9974 . The memory utilization for k d > 6 ra pidly ap- proaches one. Theoretical a nalysis p er formed by [8, 7 ] provided tight thresholds of the memory utilization of the CUCKOO-DISJO INT algo rithm. β Disj oint ( k = 2 , d = 2 , t = n ) = 0 . 8 970 and β Disj oint ( k = 3 , d = 2 , t = n ) = 0 . 9592 . Ref. [2] do not provide a theoretical memory utilization threshold for small k . T he empirica l results o f Ref. [2] show tha t in the CUCKOO-OVERLAP algorithm β Ov er lap ( k = 2 , d = 2 , t = n ) = 0 . 965 0 and β Ov er lap ( k = 3 , d = 2 , t = n ) = 0 . 994 5 . The theoretical a na lysis of the CUCKO O-DISJOINT algor ithm p erformed in [12, 1, 13] do es not apply for k > 1 and the theoretical analysis of the CUCKOO - DISJOINT algor ithm perfor med by [11] do es not apply for d < 3 . 3.2. Memo ry util ization when the page size t i s a giv en constan t. In this section we analyze the pr obability that the hashing fails for the case w here the page size t equals a constant. Let P f a il ( v ) b e the pro bability that there exists a sub-g raph S with v vertices suc h that S has more edges than vertices and every vertex is in at least one edge. Let x 1 = exp  − ( k +1) dk − ( k +1)  . Here aga in we div ide the analys is into t wo sections. In section 3.2.1 we show that for a ny memor y utilization β a nd ∀ dk n ≤ x < x 1 , P f a il ( x ) is o  1 n  . In section 3.2.2 we find the maximum memory utilization β such that ∀ x 1 ≤ x < β , P bad ( x ) is expo nent ia lly sma ll. 3.2.1. P f a il ( x ) Analysi s for dk n ≤ x < x 1 . In this section we show that for a ny memor y utilizatio n β and ∀ dk n ≤ x < x 1 , P f a il ( x ) is o  1 n  . The ana ly sis her e is similar to the case ab ove where t = n , how ever now w e need to take into consideration the distribution of the vertices ov er the pages . A CUCKOO HASHING V ARIANT WITH IMPRO VED MEMOR Y UTILIZA TION AND INS ER TION TIME 6 Let V be a given set o f vertices, and let v i be the num b er of vertices in pag e i . The probability that a random edge hits V is equal to p hit ( V , t ) =  P g i =1 ( v i k ) g · ( t k )  d ≤  1 g P g i =1  v i t  k  d . Let e p hit ( V , t ) =  1 g P g i =1  v i t  k  d be an upper b ound on p hit ( V , t ) . W e are go ing to use the following lemma which s tates that incr e a sing the page size re duces e p hit . Lemma 8. F or any set of vertic es V and any inte ger c, e p hit ( V , t ) ≥ e p hit ( V , ct ) . Pr o of. Since the function f ( v ) =  v t  k is conv ex, for any sequence of c pages , 1 c   v 1 t  k + . . . +  v c t  k  ≥  v 1 + ... + v c ct  k , thus multiplying the page size b y a factor c does not increase e p hit . F or convenience, we res trict our analysis to pa ges of size t = k · c , where c is any integer. The worst case is o btained when t = k which is equiv alent to the cla ssical CUCK OO-DISJOINT ha shing. W e will now analyze P f a il ( x ) and P bad ( x ) a s n → ∞ . Recall that β = m n is the memor y utilization, 0 < β ≤ 1 and x = v n .  Lemma 9. P f a il ( x ) < c 6 ( x ) · c n 7 ( x ) wher e c 6 ( x ) = e x d − 1 and c 7 ( x ) = e ( k +1) x k x ( dk − 1 − k ) x k . Pr o of. See app endix C.  Theorem 10. L et δ b e a smal l c onst ant , 0 < δ ≪ 1. If ( d − 1) dk ≥ 3 , then for any lo ad β : 1. If dk n ≤ x ≤ dk n + δ , then P f a il ( x ) < O  n − (( d − 1) dk − 1)  = o  1 n  . 2. If dk n + δ ≤ x ≤ x 1 − δ , then P f a il ( x ) de cr e ases exp onential ly as n → ∞ . Pr o of. F or any memory utilization β , P f a il ( x ) < c 6 ( x ) · c n 7 ( x ) where c 6 ( x ) = e x d − 1 and c 7 ( x ) = e ( k +1) x k x ( dk − 1 − k ) x k . Note that c 6 ( x ) and c 7 ( x ) are indep endent of β . Reca ll that x 1 = e xp  − ( k +1) dk − ( k +1)  . ∀ dk n + δ ≤ x ≤ x 1 − δ , there exist a co nstant ǫ s uch that c 7 ( x ) < 1 − ǫ . W e obtain that: 1. If x → dk n , then P f a il ( x ) < lim x → dk n c 6 ( x ) · c n 7 ( x ) = O  n − (( d − 1) dk − 1)  ≤ O  n − 2  = o  1 n  . 2. If dk n + δ ≤ x < x 1 − δ , then c 7 ( x ) < 1 − ǫ , and P f a il ( x ) < c 6 ( x ) · c n 7 ( x ) decreases exponentially as n → ∞ .  3.2.2. P bad ( x ) A nalysis for x 1 ≤ x < β . In this section we find the maximum memor y utilization β suc h that ∀ x 1 < x < β , the pr o bability that there exists a bad sub-graph is exp onentially small. W e examine the set of sub-graphs that hav e a g iven distribution a of v er tices over the pages. a = ( a 0 , ..., a t ) , where a i is the num b er of pag es that hav e i vertices. F or example, when the pag e size t was equal to n a nd the num b er of pages g was 1 , the num b er o f sub-graphs with v vertices was  n v  and the c o rresp onding a of those sub-g raphs was { 0 , ..., 0 , a v = 1 , 0 , ..., 0 } . by definition of a, (3.1) t X i =0 a i = g Lemma 11. When the p age size t is a given c onstant, t he numb er of differ ent p ossible values of a is p olynomia l in n . Pr o of. W e denote by # a The num b er of different po ssible v alues o f a. # a < g t =  n t  t . Since t is constant, # a is p olynomial in n .  Let P bad ( a ) be the probability that there exists a bad sub-g raph with dis tr ibution a of vertices ov er the pages. If P bad ( a ) is expo nenti ally small for every a, then the union b ound o ver a polynomial n um b er of all p ossible v alues of a is also exp onentially small. Let p bad ( a ) b e the probability that a given sub-gr aph S ( V , E ) with a = ( a 0 , ..., a t ) is bad. A CUCKOO HASHING V ARIANT WITH IMPRO VED MEMOR Y UTILIZA TION AND INS ER TION TIME 7 Let ˆ a = a g be a unit vector. When the page size t is a constant, the proba bility that a random edge hits V is: (3.2) p hit ( a ) = P t i = k a i  i k  g  t k  ! d = P t i = k ˆ a i  i k   t k  ! d and the pr obability that a random edge connects V to exactly one vertex from outside of V is (3.3) p 1 ( a ) = d P t i = k a i ( t − i )  i k − 1  g  t k  ! P t i = k a i  i k  g  t k  ! d − 1 = d P t i = k ˆ a i ( t − i )  i k − 1   t k  ! P t i = k ˆ a i  i k   t k  ! d − 1 Using the union b ound we obtain that (3.4) P bad ( a ) < N ( a ) · p bad ( a ) where N ( a ) is the num b er of sub-g raphs with a = ( a 0 , ..., a t ) . (3.5) N ( a ) =  g a 0 , ..., a t  t Y i =0  t i  a i F or the asymptotic b ehavior of N ( a ) , we are going to use the following lemma: Lemma 12.  g a 0 ,...,a t  ≤ Q t i =0  g a i  a i Pr o of. See App endix D.  Lemma 13. P bad (ˆ a, β ) < O (1 ) · c 8 (ˆ a, β ) · c n 9 (ˆ a, β ) wher e c 8 (ˆ a, β ) = p hit (1 − p 1 − p hit ) − 1  β − x x  and c 9 (ˆ a, β ) = Q t i =0  1 ˆ a i  ˆ a i t Q t i =0  t i  ˆ a i t  β β − x  ( β − x )  β x  x p x hit (1 − p 1 − p hit ) β − x Pr o of. See app endix E.  Theorem 14 . If c 9 (ˆ a, β ) < 1 − ǫ for al l ∀ x 1 ≤ x < β , then P bad (ˆ a, β ) de cr e ases exp onential ly as n → ∞ . Any memory u t ilization β that satisfies the c onst r aint is a lower b oun d on the p ossible memory u tilization. Pr o of. The theorem follows directly fro m the inequa lity P bad (ˆ a, β ) < O (1 ) · c 8 (ˆ a, β ) · c n 9 (ˆ a, β ) .  Theoretical low er bo unds of the memory utilization obtained from numerical solutions of the constraint c 9 (ˆ a, β ) < 1 are displayed in figure 4.1. 4. Empirical Resul ts The exp eriments w er e conducted with a similar proto co l to the one describ ed in [2]. In a ll exp eriments the n umber of the buck ets, d , was tw o . The capacity of each bucket k was either tw o or three. The size of the hash tables n w as 1 , 209 , 600 . The r ep o rted memory utilizatio n β is the mean memory utilization ov er tw ent y tria ls. The random hash functions were based on the Matlab “rand” function with the t wister method. Items w ere inser ted into the hash table one-b y-one un til an item co uld not b e inserted. The r esults o f b o th CUCK OO-CHOOSE-K and CUCKOO-OVERLAP were notably stable. In each ca se, the standard deviation w a s a few hundredths o f a p erce nt, so error bars would b e invisible in the fig ure. Such strongly pr e dictable be havior is appea ling from a practical standpo int . Since we a dded a paging cons tr aint, o ur results are not co mpa rable to previous works that do no t include a paging constraint. A CUCKOO HASHING V ARIANT WITH IMPRO VED MEMOR Y UTILIZA TION AND INS ER TION TIME 8 Figure 4.1. Memory utilization vs. page size. Empirical CUCK O O-CHOOSE-K (gr e e n), Empirical CUCKOO-OVERLAP (red), approximation formula of Empirica l CUCKOO-CHOO SE-K (blue), and theo r etical low er b o und of C UCK OO-CHOOSE-K (black). left: k = 2 , right: k = 3 . Figure 4.2. Num b er of lo okups required to insert a n item vs. memory utilization (left) and vs. page size (right). In the left figure the pa ge size is 8 . In the right figure the memor y utilization is 92 %. The left fig ure was smo othed with an av er aging filter. The v ar ia nce in the num b er of lo okups r equired to inser t a n item was m uch smallar in CUCKOO- C HO OSE-K. Exper imen ts show that CUCKOO-CHOOSE-K impro ves memory utilization sig nificantly even for a small page size t a nd a small buck et c a pacity k when co mpared to the classical cuck o o hash- ing CUCKO O-DISJOINT. It outper forms CUCKOO -OVERLAP as w ell. Recall that CUCKOO- DISJOINT is the extreme case of CUCKOO-CHOOSE-K and CUCKOO-OVERLAP when the page size is equal to buck et size k . The memor y utilization β C hoose − k conv erge s v er y quickly to its ma x im um v alue. F or example β C hoose − 2 ( t = 1 6) = 0 . 976 3 is almos t equal to β C hoose − 2 ( t = 1 , 20 9 , 60 0) = 0 . 9 767 . Whereas β Ov er lap − 2 ( t = 16) = 0 . 9 494 and β Disj oint − 2 = 0 . 8970 . A CUCKOO HASHING V ARIANT WITH IMPRO VED MEMOR Y UTILIZA TION AND INS ER TION TIME 9 The empirica l memory utilization can b e a pproximated very accura tely b y the following form ula s: (4.1) β C hoose − 2 ( t ) ≈ 0 . 977 ·  1 − ( t 0 . 764 ) − 2 . 604  (4.2) β C hoose − 3 ( t ) ≈ 0 . 997 ·  1 − ( t 1 . 011 ) − 2 . 998  The max im um approximation er ror is 0.0011 for k = 2 and 0 .0015 for k = 3 . The empirical res ults and their approximations are displayed in figure 4.1 together with empirica l results o f CUCKOO -OVERLAP . The memory utilization fo r k d > 6 rapidly appr oaches one (not display ed). CUCKOO-CHOO SE-K outp erfor ms CUCKOO-OVERLAP no t only in memory utilization, but in the num b er of iterations r equired to insert a new item as well. Figure 4.2 illustr ates the num b er of iter ations required to insert a new item when the has h ta bl e is 9 2% full and we use breadth first search to search fo r a v acant p osition. # itr C hoose − 2 ( t = 8 ) = 52, whereas # i tr Ov er lap − 2 ( t = 8) = 5 45 . Note that in these simulations we did not limit the n umber o f insert itera tions and we contin ued to insert items a s long as w e could find free loca tio ns. Mos t applications limit the n umber of inserted itera tions, and ma int a in a low memory utilization in or der to find a free lo cation easily . If an empty p osition is not found within a fixed num b er of itera tions, a rehas h is pe r formed or the item is placed outside of the cuck o o arr ay . Appendix A. Proof of Lemma 4 Here w e prov e that P bad ( x ) < c 0 ( x ) · ( c 1 ( x )) n , where c 0 ( x ) = e · x dk − 1 and c 1 ( x ) = e 2 x · x ( dk − 2) x . Pr o of. Recall that β = m n is the memo ry utilization, 0 < β ≤ 1 and x = v n , dk n ≤ x < β . (if v ≥ m or v < dk , then the s ub- g raph S c a nnot fa il). P bad ( v ) <  n v  · p bad ( v ) where p bad ( v ) =  m v +1  p v +1 hit (1 − p 1 − p hit ) m − ( v +1) , p hit ( v ) =  ( v k ) ( n k )  d , and p 1 = d ( n − v ) ( v k − 1 ) ( n k )  ( v k ) ( n k )  d − 1 .  n v  , p bad ,  m v +1  and p hit are b ounded by: (A.1)  n v  <  e · n v  v =  e x  xn . (A.2) p bad ( v ) =  m v + 1  p v +1 hit (1 − p 1 − p hit ) m − ( v +1) <  m v + 1  p v +1 hit . (A.3)  m v + 1  <  e · m v + 1  v +1 <  e · m v  v +1 =  e · β x  xn +1 <  e x  xn +1 . (A.4) p hit <  v n  dk = x dk . And we get that (A.5) P bad ( v ) <  n v  · p bad ( v ) <  e x  xn  e x  xn +1  x dk  xn +1 = c 0 ( x ) · c n 1 ( x ) . A CUCKOO HASHING V ARIANT WITH IMPRO VED MEMOR Y UTILIZA TION AND INS ER TION TIME 10  Appendix B. Proof Of Lemma 6 Here we pr ov e ∀ x 0 ≤ x < β , P bad ( x, β ) < O (1) · c n 5 ( x, β ) where c 5 ( x, β ) =  1 1 − x  (1 − x )  1 x  x  β β − x  ( β − x )  β x  x x dkx  1 − dk (1 − x ) x dk − 1 − x dk  β − x . Pr o of. Recall that: (B.1) P bad ( v ) < N ( v ) · p bad ( v ) =  n v  m v + 1  p v +1 hit (1 − p 1 − p hit ) m − ( v +1) .  n v  , p hit and p 1 are b ounded b y: (B.2)  n v  <  n n − v  n − v  n v  v =  1 1 − x  (1 − x ) n  1 x  xn , (B.3)  x − k n  dk < p hit < x dk , (B.4) dk (1 − x ) x  x − k n  dk < p 1 . Since x ≥ x 0 ≫ 1 n , k n , we neglect the term 1 n ≪ x in the following a pproximation of  m v +1  , and we neglect the term k n ≪ x in the low er b ounds for p hit and p 1 . As n → ∞ , these ter ms contribute to P bad ( v ) factors which are b ounded by O (1) . (B.5)  m v + 1  <  m m − ( v + 1)  m − ( v +1)  m v + 1  v +1 < O (1) ·  β β − x  ( β − x ) n − 1  β x  xn +1 (B.6) (1 − p 1 − p hit ) m − ( v +1) < O (1) ·  1 − dk (1 − x ) x dk − 1 − x dk  β n − ( xn +1) The pro of for the left inequalities is given in [5] and is also a sp ecial case of the more general inequality we prov e la ter in lemma 1 2. W e get that: (B.7) P bad ( v ) <  n v  · p bad ( v ) < O (1) · c 4 ( x, β ) · c n 5 ( x, β ) , where (B.8) c 4 ( x, β ) = ( β − x ) x dk − 1  1 − dk (1 − x ) x dk − 1 − x dk  − 1 and (B.9) c 5 ( x, β ) =  1 1 − x  (1 − x )  1 x  x  β β − x  ( β − x )  β x  x x dkx  1 − dk (1 − x ) x dk − 1 − x dk  β − x Since c 4 < O (1) , we g et: P bad ( x ) < O (1) · c n 5 ( x, β ) .  A CUCKOO HASHING V ARIANT WITH IMPRO VED MEMOR Y UTILIZA TION AND INS ER TION TIME 11 Appendix C. Proof of Lemma 9 Here we pr ov e P f a il ( v ) < c 6 ( x ) · c n 7 ( x ) where c 6 ( x ) = e x d − 1 and c 7 ( x ) = e ( k +1) x k x ( dk − 1 − k ) x k . Pr o of. According to lemma 8, for any set of vertices V, e p hit decreases when the page size t is m ultiplied by an in teger . F or simplicity , we res trict our a nalysis to pa ges of s ize t = k · c , where c is any integer. The worst ca se therefor e is when t = k which is equiv alent to the cla ssical CUCK OO- DISJOINT hashing. W e use the unio n b ound to o bta in P f a il ( v ) < N ( v ) · p f a il ( v ) , wher e N ( v ) is the num b er of sub- graphs with v vertices, where each v e r tex of the sub-graph is hit by at least one edge, and p f a il ( v ) is the proba bilit y that a given sub-g r aph S ( V , E ) was hit by more than v edges. By definition of p f a il : (C.1) p f a il ( v ) <  m v + 1  e p v +1 hit . when t = k we get: (C.2) e p v +1 hit =  x d  xn +1 =  x d   x dx  n , and (C.3) N ( v ) ≤  n/k v /k  <  e n/k v /k  v/k =   e x  x/k  n , Since when the pag e size t equals k , an element can be placed either in all of the lo ca tions of a pag e or in none of them. (C.4)  m v + 1  <  n v + 1  <  e n v + 1  v +1 <  e n v  v +1 =  e x   e x  x  n . W e get that: (C.5) P f a il ( v ) < N ( v ) · p f a il ( v ) < N ( v )  m v + 1  e p v +1 hit < c 6 ( x ) · c n 7 ( x ) , where c 6 ( x ) = e x d − 1 and c 7 ( x ) = e ( k +1) x k x ( dk − 1 − k ) x k .  Appendix D. Proof of Lemma 12 The proof of  g a 0 ,...,a t  ≤ Q t i =0  g a i  a i is a gener alization of the pr o of giv en in [5]. F or an y p ositive in teg er t and any non-nega tive integer g : ( y 0 + ... + y t ) g = P α 0 + ... + α t = g  g α 0 ,...,α t  Q t i =0 ( y α i i ) , where the summation is taken ov er all sequences of non-neg ative int e ger indices α 0 through α t such that the sum of all α i is g . F or the sp ecial cas e where y i = a i g , a i is a non-nega tive integer and P t i =0 a i = g we get: (D.1)  g a 0 , ..., a t  t Y i =0  a i g  a i ≤ X α 0 + ... + α t = g  g α 0 , ..., α t  t Y i =0  a i g  α i = ( y 0 + ... + y t ) g = 1 So (D.2)  g a 0 , ..., a t  ≤ 1 Q t i =0  a i g  a i = t Y i =0  g a i  a i A CUCKOO HASHING V ARIANT WITH IMPRO VED MEMOR Y UTILIZA TION AND INS ER TION TIME 12 Appendix E. Proof of Lemma 13 Here we pr ov e P bad (ˆ a, β ) < O (1 ) · c 8 (ˆ a, β ) · c n 9 (ˆ a, β ) where c 8 (ˆ a, β ) = p hit (1 − p 1 − p hit ) − 1  β − x x  and c 9 (ˆ a, β ) = Q t i =0  1 ˆ a i  ˆ a i t Q t i =0  t i  ˆ a i t  β β − x  ( β − x )  β x  x p x hit (1 − p 1 − p hit ) β − x . Pr o of. Recall that ˆ a is a unit vector and v and x a re equa l to the following functions of ˆ a : (E.1) t X i =0 ˆ a i = 1 (E.2) v = t X i =0 a i · i = g t X i =0 ˆ a i · i = n t t X i =0 ˆ a i · i (E.3) x = v n = 1 t t X i =0 ˆ a i · i. P bad ( v ) < N ( v ) · p bad ( v ) , wher e (E.4) N ( v ) =  g a 0 , ..., a t  t Y i =0  t i  a i , (E.5) p bad ( v ) =  m v + 1  p v +1 hit (1 − p 1 − p hit ) m − ( v +1) , (E.6) p hit = P t i = k a i  i k  g  t k  ! d = P t i = k ˆ a i  i k   t k  ! d , (E.7) p 1 = d P t i = k ˆ a i ( t − i )  i k − 1   t k  ! P t i = k ˆ a i  i k   t k  ! d − 1 . Using lemma 12 we get: (E.8) N ( v ) =  g a 0 , ..., a t  t Y i =0  t i  a i ≤   t Y i =0  1 ˆ a i  ˆ a i t t Y i =0  t i  ˆ a i t   n , and (E.9)  m v + 1  <  m m − ( v + 1 )  m − ( v +1)  m v + 1  v +1 < O (1) ·  β β − x  ( β − x ) n − 1  β x  xn +1 . Finally we get: (E.10) P bad ( v ) < N ( v ) · p bad ( v ) = O (1) · c 8 ( x, β ) · c n 9 (ˆ a, β ) where A CUCKOO HASHING V ARIANT WITH IMPRO VED MEMOR Y UTILIZA TION AND INS ER TION TIME 13 (E.11) c 8 (ˆ a, β ) = p hit (1 − p 1 − p hit ) − 1  β − x x  , and (E.12) c 9 (ˆ a, β ) = t Y i =0  1 ˆ a i  ˆ a i t t Y i =0  t i  ˆ a i t  β β − x  ( β − x )  β x  x p x hit (1 − p 1 − p hit ) β − x .  References [1] M. Dietzfelbinger, A. Go erdt, M. M i tzenmac her, A. Montanar i , R. Pagh, and M . R i nk, “Tight thresholds for cuc koo hashing via XORSA T”. ICALP (1) 2010: 213-225 [2] E. Lehman and R. Panigra hy , “3.5-wa y cuck o o hashing for the price of 2 and a bit”. In Proceedings of the 17th Ann ual European Symposium on Algorithms, pages 671–681, 2009. [3] M. Dietzfelbinger and P . W o elfel, “A l most random graphs with si mple hash functions”, 35th STOC, pages 629–638, 2003. [4] R. Pagh and F. Ro dler. Cuc koo hashing. Journal of Algorithms 51 (2004), p. 122-144. [5] P . Sanders, D. F otakis, R. Pag h and P . Spirakis. “Space efficient hash tables with worst case constan t access time”. Theory of computing systems 38, 229-248 (2005) . [6] R. P anigrah y , “ Effi cien t hashing with lo okups in tw o memory acc esses” , SOD A ’05: Proceedings of the sixteen th ann ual ACM-SIAM symp osium on Discrete algorithms, pages 830–839, Philadelphia, P A, USA, 2005. So ciet y for Industrial and Applied Mathematics. [7] D. F ernholz and V. Ramachand ran. “The k-orienta bility thresholds for G n,p ”. In SOD A ’07: Pro ceedings of the eigh teenth annual A CM-SIAM symp osium on Dis crete algorithms, pages 459–468, Philadelphia, P A, USA, 2007. Society for Industrial and Applied Mathematics. [8] J. A. Cain, P . Sanders and N. W ormald, “ The random graph threshold for k- orientiabilit y and a fast algorithm for optimal multiple-c hoice allo cation”. I n SODA ’07: Pro ceedings of the eighte en th ann ual ACM-SIAM symp o- sium on Discrete algorithms, pages 469–476, Philadelphia, P A, USA, 2007. So ciet y for Industrial and Applied Mathematics. [9] A. Kirsc h, M. Mitzenmac her, and U. Wieder. “More robu s t hashing: Cuck o o hashing wi th a stash”. SIAM Journal on Computing 39:1543-1561, 2009. [10] Y. Azar, A. Z. Broder, A. R. Karlin, and E. Upfal. Balanced allo cations. SIAM Journal on Computing, 29:180- 200, 1999.A preliminary ver s i on of this paper appeared i n Proceedings of the T wen ty-Sixth Annual ACM Sym- posi um on the Theory of Computing, 1994. [11] F ounto ul akis, N. , K hosl a, M., Pa nagioutou, K .: The M ultiple-orient abil ity Thresholds for Random Hyp ergraphs. In: Pro c. 22nd SODA. SIAM (2011) [12] A. F rieze, P . M elsted, Maximum Matchings i n Random Bipartite Graphs and the Space Utilization of Cuck o o Hash tables, CoRR abs/0910.5535: (2009) [13] F ounto ul akis, Panagiotou, Orientabilit y of Random Hyp ergraphs and the Po wer of Multiple Choices, ICALP (1) 2010: 348-359 [14] M. Dietzfelbinger, M. Mitzenmac her and M. Rink, ”Cuck o o Hashing with P ages”, arXiv:1104.5111v1 [cs.DS] , 2011

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment