Distinct counting with a self-learning bitmap

Distinct Counting with a Self-Lear ning Bi tmap Aiyou Chen, Jin Cao, Larry Shepp and T uan Nguyen ∗ Bell Laboratories and Rutgers Uni v ersity October 22, 2018 Abstract Counting the numbe r of distin ct elements (cardinality) in a datas et is a fundame ntal proble m in databa se management . In recen t years, due to many of its modern applic ations , there has been signiﬁcant interest to address the distin ct count ing prob lem in a data stre am setting, where each incoming data can be seen only once and cannot be stored for long period s of time. Many probab ilistic app roache s based on eith er sa mpling or sk etching ha ve bee n p ro- posed in the computer science literatur e, that only require limited computin g ∗ Aiyou Chen (E-mail: aiy ouchen @google.c om) is a statistician at Goog le Inc., Mo untain V iew , CA 9 4043 ; Jin Cao (E-mail: c ao@research.b ell-labs.com ) is a Disting uished Mem ber of T echnical Staff at Bell L abs, Alc atel-Lucent, Mu rray Hill, NJ 07 974; Larry Shepp is a Prof essor , Departmen t of Statistics, Rutgers Uni versity , Piscataway , NJ 08854; and T uan Nguyen is a Ph.D. cand idate, De- partment of Statistics, Rutge rs Univ ersity , Piscataway , NJ 08854 . The work was done wh en Aiy ou Chen was a Member of T ech nical Staff at Bell Lab s, Alcatel-Lucent. The author s would like to thank the Editor, AE and two anonym ous referees for useful revie ws an d constructive suggestions which have improved the paper sig niﬁcantly . Lawrence Menten at Bell Labs, Alcatel-Lucent p roposed an idea of hardware implementation for S-bitmap. 1 and m emory resources. Howe ver , the performance s of these methods are not scale- in variant , in the sense that their relati v e root mean square estimatio n errors (RRMSE) depend on the unkno wn car dinalit ies. This is not desirable in many applicat ions where cardinaliti es can be very dynamic or inhomog e- neous and man y cardi nalities nee d to be es timated. In this paper , w e de v elop a nov el approach , called self-learni ng bitmap (S-bitmap) that is scale-in var iant for cardinalitie s in a speciﬁed range. S-bitmap use s a bina ry vec tor whose entries are update d from 0 to 1 by an adapti ve sampling process for inferring the unkno wn cardinalit y , where the s ampling ra tes a re red uced seq uentia lly as more and more entries change from 0 to 1. W e pro ve rigorou sly that the S- bitmap estimate is no t only unbia sed but scale -in va riant. W e demon strate that to achie ve a small RRMSE valu e of ǫ or less, our approach requires signiﬁ- cantly less me mory an d consu mes similar or less o peratio ns than state-of-th e- art methods for many common practice cardinality sca les. Bo th simulation and exp erimental studies are reported. Ke ywords : Distinct countin g, sa mpling, st reaming data, bitmap, Mark ov chain, marting ale. 1 Introduction Counting t he number of distinct elements (cardinality) in a d ataset is a fundamental problem in database m anagement. In recent years, due to high rate data collection in many mod ern applications, th ere has b een signi ﬁcant interest to address the d is- tinct counting problem in a data str eam setting wh ere each incoming dat a can be seen o nly once and cannot be stored for long periods of time. Algorit hms to deal with streamin g data are often called o nline algorit hms. For example, in modern 2 high speed networks, data trafﬁ c in the form of packets can arri ve at the network link i n the speed of gigabits per second, creating a massive data stream. A sequence of packets between the same pair of sou rce and destinati on hosts and their applica- tion prot ocols form a ﬂow , and the number of dist inct network ﬂows i s an important monitoring metri c for network health (for example, t he early stage of worm attack often results a signiﬁcant increase in the number of network ﬂows as in fected m a- chines random ly scan o thers, see Bu et al. (200 6)). As anoth er example, it is oft en useful to monitor connecti vity patterns amon g network ho sts and count the numb er of distinct peers that each host is c ommunicati ng with ov er time (Karasaridis et al. , 2007), in order to analyze the presence of peer-to-peer networks that are used for ﬁle sharing (e.g. songs, movies). The challenge of di stinct count ing i n th e s tream setting i s due to the constraint of limited memory and computation resources. In th is scenario, the exact solu tion i s infeasible, and a li ghtweight algorithm, that deri ves an appr o ximate count with lo w memory and computation al cost but wit h high accuracy , is desired. In parti cular , such a solution will be much preferred for counting tasks performed ove r Android- based smart phones (with only lim ited memory and computing resources), which is in rapid growth nowadays (Menten et al. , 2011). Another dif ﬁculty is that in many appl ications, the unknown cardinalit ies to be estimated may fall i nto a wid e range, from 1 to N , where N ≫ 1 is a kno wn upper bound . Hence an algorithm that can perform uniforml y well within the range is preferred. For instance, there can be mil lions of hosts (e.g. home users) active in a network and the num ber of ﬂo ws each host has may change dramati cally from host to host and from time t o time. Sim ilarly , a core network m ay be composed o f many links with var ying l ink speeds, and a trafﬁc snapshot of the network can rev eal v ariations between links by sev eral orders of m agnitude. (A real data example is giv en in Section 7.) It is 3 problematic if the algorithm for counting number of ﬂo ws works well (e.g. relativ e root mean s quare est imation errors are below some threshold) on some links w hile not on others due to diff erent scales. There ha ve been many sol utions de veloped in the computer science literature to address the distinct counti ng problem in the stream setting, m ost no tablyFlajolet and Martin (1985), Whang et al. (1990), Gib bons (2001), Durand and Flajolet (2003), Estan et al. (2006), Flajo let et al. (2007) among o thers. V arious asymptotical analyses hav e been carried out recently , see Kane et al . (2010) and references th erein. The ke y idea is to obtain a stati stical estimate by desi gning a compact and easy-to-compute summary statistic (also called sketch in comput er science ) from the s treaming data. Some of these methods (e.g. LogLog counting by Durand and Flajolet (2003) and Hyper- LogLog counti ng by Flajol et et al. (2007)) ha ve nice statistical properties such as asymptotic unbiasedness. Howe ver , the performance of these existing so- lutions often depends on the u nknown cardinaliti es and cannot perform un iformly well in the tar geted range of cardinalities [1 , N ] . For example, with limited memory , linear counting p roposed by Whang et al. (1990) works best with small cardinalities while the LogLog counting method works best with lar ge cardinalities. Let the performance of a distinct counting method b e measured by its relativ e root mean square error (RRMSE), where RRMSE is deﬁned by Re ( ˆ n ) = p E ( n − 1 ˆ n − 1) 2 where n is the disti nct count parameter and ˆ n is i ts est imate. I n thi s article we dev elop a novel stati stics based d istinct counting a lgorithm, called S-bitmap , that is scale-in var iant , in the sense that RRMSE is in v ariant to the unknown cardinaliti es in a wide range without add itional memory and computat ional costs, i.e. there exists 4 a constant ǫ > 0 such that Re ( ˆ n ) ≡ ǫ, for n = 1 , · · · , N . (1) S-bitmp uses the bitmap, i.e., a b inary vector , t o s ummarize the data for app roximate counting, where the binary entries are changed from 0 to 1 by an adapti ve sampling process. In the sp irit of Morris (1978), the sam pling rates decrease sequentiall y as more entries change to 1 with the optim al rate learned from the current st ate of the bitmap. The cardinality estim ate is th en obtain ed by usi ng a non -stationary Markov chain mo del deri ved from S-bitmap. W e use martingale properties to prov e that our S-bitmap estimate i s unbiased, and more importantly , i ts RR MSE is i ndeed scale-in variant. Both simul ation and experimental stu dies are reported. T o achieve the same accuracy as state-of-the-art m ethods, S-bitm ap requires si gniﬁcantly less memory for many com mon practice cardinality s cales with sim ilar or less compu- tational cost. The distinct counting problem we cons ider here is weakly related to the tra- ditional ’estimating the number of species’ problem, see Bunge and Fitzpatrick (1993), Haas and Stokes (1998), M ao (2006) and references t herein. Howe ver , tra- ditional solution s that rely on sam ple sets of the popul ation are impractical in the streaming context due to restrictive memory and computation al constraints. Whi le traditional statistical stu dies (see Bickel and Doksum, 2001) most ly focus on s tatis- tical inference given a measurement model, a critical ne w com ponent of th e solution in the online setting, as we study in this paper , is that one has to design much more compact summ ary statisti cs from the d ata (equiv alent to a model), whi ch can be computed online. The remain ing of the paper goes as follows. Section 2 further el laborates the background and revie ws several competi ng online algorithms from the literature. 5 Section 3 and 4 describe S-bitmap and estimatio n. Section 5 provides the dim en- sioning rule for S-bitmap and analysis. Section 6 reports simulati on studies includ- ing bot h performance e v aluation and comparison with state-of-the-art algorit hms. Experimental stud ies are reported in Section 7 . Throu ghout the paper , P and E de- note probabilit y and expectation, respectively , ln ( x ) and log( x ) denote the natural logarithm and base-2 logarithm of x , and T able 1 lists most notations used in the paper . The S-bitmap algo rithm has been successful ly impl emented in some Alcatel- Lucent n etwork mon itoring p roducts. A 4-page poster about the basic idea of S- bitmap (see Chen and Cao, 2009) wa s presented at the International C onference on Data Engineering in 2009. 2 Background In this s ection, we provide some background and revie w in details a few classes of benchm ark online distinct counting algorithms from t he existing l iterature that only require limited memory and comput ation. Readers familiar wit h the area can simply skip this section. 2.1 Overview Let X = { x 1 , x 2 , · · · , x T } be a sequence of items with p ossible replicates, where x i can be numbers, texts, images o r o ther dig ital symbo ls. The prob lem of distinct counting is to est imate t he num ber of distinct i tems from the sequence, denoted as n = |{ x i : 1 ≤ i ≤ T }| . For example, if x i is the i -th word in a book, then n is the number of unique words in the book. It is obvious that an exact solution ca n be obtained by listing all distinct items (e.g. words in the example). Howe ver , as we 6 V ariable Meaning m memory requirement in bits n cardinality to be estimated ˆ n S-bitmap estimate of n P , E , v ar probability , e xpectation, v ariance Re ( ˆ n ) p E ( ˆ nn − 1 − 1) 2 (relativ e root mean square err or) [0 , N ] the range of cardinalities to be estimated C − 1 / 2 , ǫ (expected, theoretic) re lative root mean squ are error of S-bitmap V a bitm ap vector p b sequential sampling rate ( 1 ≤ b ≤ m ) S t buck et location in V L t number of 1s in V after the t -th disti nct item is hashed into V I t indicator whether the t -th distinct item ﬁlls in an empty b ucket in V L t the set of locations of buck ets ﬁlled with 1s in V T b number of distinct items after b buckets ar e ﬁlled with 1s in V t b expectation of T b T able 1: Some notations used in the paper . can easil y see, thi s sol ution qui ckly becom es less attractive when n becom es large as it requires a memory linear in n for st oring the list, and an order of log n item comparisons for checking the membership of an item in the list. The o bjectiv e of online algorithms is to process the incoming data st ream in real time where each data can be seen only once, and derive an approximate count with accurac y gu arantees but with a limit ed storage and com putation budget. A typi- cal on line algorithm consists of the follo wing two st eps. First, instead of storing the original data, on e desig ns a compact sketch such that t he ess ential information about 7 the unknown quantity (cardinality in this case) is kept. The second step i s an infer- ence step where the unkn own quantity is t reated as the parameter of interest, and the sketch is m odeled as random variables (functions) associated with the parameter . In the following, we ﬁrst revie w a class of bitmap algorithms includin g li near count- ing by Whang et al. (1990) and multi-resolution bitm ap (m r- bitmap) by Estan et al. (2006), which are clo sely related to our new approach. Then we describe another class of Flajol et-Martin type algori thms. W e also cover other meth ods brieﬂy su ch as sampl ing t hat do not follow exactly the above online sketching framework. An excellent re view of these and other existing m ethods can be found in Beyer et al. (2009), Metwally et al. (20 08), Gibbons (2009), and in particular , Metwally et al. (2008) provides extensiv e simulation c omparisons. Ou r ne w approach will be com- pared with three st ate-of-the-art algorithms from t he ﬁrst two classes of met hods: mr-bitmap, LogLog counting and Hyper -LogLog counting. 2.2 Bitmap The bi tmap schem e for distinct count ing was ﬁrst proposed in Astrahan et al. (1987) and then analy zed i n details in Whang et al. (1990). T o estimate the cardinality of the sequence, the b asic idea of bi tmap , is to ﬁrst map the n distinct items un iformly randomly to m buckets such that replicate items are mapped to the same bucket, and then esti mate th e cardinali ty b ased on the number of n on-empty buckets. Here the uniform random mappin g is achie ved u sing a universal hash function (see Knuth, 1998), which i s essent ially a pseudo uniform random number generator th at takes a variable-size inpu t, called ’ke y’ (i.e. seed), and returning an integer distributed uniformly in the range o f [1 , m ] . 1 T o b e con venient, let h : X → { 1 , · · · , m } be 1 As a n example, b y takin g the inp ut datum x as an integer , the Carter-W egman hash functio n is as follows: h ( x ) = (( ax + b ) mod p ) mo d m , wh ere p is a large pr ime, and a, b are two arbitrarily 8 a univ ersal hash function, where it t akes a ke y x ∈ X and m ap to a hash value h ( x ) . For t heoretical analysis, we as sume that the hash funct ion di stributes t he items randoml y , e.g. for any x, y ∈ X with x 6 = y , h ( x ) and h ( y ) can be treated as two independent uniform random numbers. A bitmap of length m is s imply a b inary vector , say V = ( V [1] , . . . , V [ k ] , . . . , V [ m ]) where each element V [ k ] ∈ { 0 , 1 } . The basic bitmap algorith m for online distinct count ing is as follows. F irst, initialize V [ k ] = 0 for k = 1 , · · · , m . Then for each incoming data x ∈ X , compute its h ash value k = h ( x ) and update th e corresponding entry in the bitmap V [ k ] by setting V [ k ] = 1 . For conv enience, t his is sum marized in Algorith m 1. Notice that the bi tmap algorithm requi res a storage of m b its att ributed to the bitmap and requires no additional st orage for the data. It is easy t o show that each entry in the bi tamp V [ k ] is B er nou l l i (1 − (1 − m − 1 ) n ) , and hence the dist ribution of | V | = P m k =1 V [ k ] only depend s on n . V arious esti mates o f n have been deve loped based on | V | , for example, linear counting as m entioned above uses the estimator m ln( m ( m − | V | ) − 1 ) . The nam e ’linear counti ng’ comes from the fact that its memory requirement is almost linear in n in order to obtain good estimati on. T ypically , N is much lar g er than the required memory m (in b its), th us mappi ng from { 0 , · · · , m } to { 1 , · · · , N } cannot be one-to-one, i.e. perfect est imation, but one-to-multipl e. A bitmap of size m can only be used to estimate cardinalities less than m log m with certain accuracy . In order to make it scalable to a larger cardinality scale, a fe w improved methods based on bitm ap have been d ev eloped (see Es tan et al. , 2006). One m ethod, called virtua l bi tmap , is to apply t he bitmap scheme on a subset of it ems t hat is obtain ed by sampling original items with a giv en rate r . Then an estimate of n can be obtained by estimating the cardinality chosen integers mod ulo p with a 6 = 0 . Her e x is the key and the outp ut is an integer in { 1 , · · · , m } if we replace 0 with m . 9 Algorithm 1 Basic bitmap Input: a stream of items x V (a bitmap vect or of zeros with size m ) Output: | V | (numbe r of entries w ith 1s in V ) Conﬁgurati on: m 1: for x ∈ X d o 2: c ompute its hash v alue k = h ( x ) 3: i f V [ k ] = 0 then 4: update V [ k ] = 1 5: R eturn | V | = P m k =1 V [ k ] . of the sampled subset. But i t i s i mpossibl e for virtual bitmap with a single r to estimate a wide range of cardinalities accurately . Es tan et al. (2006 ) proposed a multiresolu tion bitmap ( mr -bitmap ) to improve virtual bitmap. The basic idea of mr-bitmap is to make use of mult iple virtu al b itmaps, each wit h a diffe rent s ampling rate, and embeds them into one bitmap in a m emory-efﬁ cient way . T o be precise, it ﬁrst partitio ns the original bitm ap into K blocks (equivalent to K virtual bitmaps), and then associates buckets in the k -th block with a sampling rate r k for s creening distinct items. It m ay be worth poin ting out that m r -bitmap determin es K and the sampling rates with a quasi-optim al strategy and it is still an op en question how to o ptimize th em, which we lea ve for futu re study . Tho ugh there is n o rigo rous analysis in Estan et al. (2006), mr- bitmap is not scale-in variant as suggested by simulatio ns in Section 6. 10 2.3 Flajolet-Martin type algorithms The approach of Flajo let and Martin (1985) (F M) has pioneered a dif ferent class of algorithms. T he basic idea of FM is to ﬁ rst map ea ch item x to a geometric random number g , and then record t he maximum value of the geometric random numbers max( g ) , which can be updated sequentially . In the impl ementation of FM, upon the arri va l of an item x , th e corresponding g is t he location of the left-most 1 in the b inary vector h ( x ) (each entry of the binary vector follows B er noul l i (1 / 2) ), where h is a univ ersal hash function mentioned ea rlier . Therefore P ( g = k ) = 2 − k . Naturally by hashing, repli cate items are mapped t o t he same geometric random number . The maximum order statistic max( g ) is the summary statistic for FM, also called the FM s ketch in the li terature. Note t hat the d istribution of max( g ) is com- pletely determined b y the n umber of d istinct items. By randomly partitioning i tems into m groups, the FM approach obtains m maximum random numb ers, one for each group, whi ch are independent and identically distributed, and then estim ates the distinct coun t by a mo ment method. Since FM mak es use of the binary v alue of h ( x ) , which requires at most log ( N ) bits of m emory where N is the upper bound of distinct counts (taking as power of 2), it i s als o called log-counting . V arious extensions of the FM approach ha ve been explored i n the literature based on th e k -th maximum order st atistic, where k = 1 correspond s to FM (see Giroire, 200 5; Beyer et al. , 2009 ). Flajolet and his collaborators have recently proposed two innov ativ e m ethods, called L ogLog coun ting and Hyper -LogLo g as mentioned abov e, p ublished in 2003 and 20 07, subsequently . Both meth ods use the t echnique of recording the bin ary value of g directly , which requires at m ost log(log N ) bits (taking N such that log(log N ) is int eger), and therefore are also called lo glog-counting . This provides a more compact s ummary statis tic than FM. Hyp er -LogLog is built on a more efﬁ - 11 cient estimator than LogLog, see Flajolet et al. (2007) for t he exa ct formulas of the estimators. Simulations su ggest th at al though Hyper-LogLog may ha ve a bounded R RMSE for cardinalities in a given range, i ts RRMSE ﬂuctuates as cardinalit ies change and thus it is not scale-in variant. 2.4 Distinct sampling The paper of Flajolet (1990) proposed a novel sampli ng algorithm, called W e gman’ s adaptive sa mpling , which collects a random sample of the distinct elements (binary values) of size n o more than a pre-speciﬁed number . Upon arriv al of a new di s- tinct element, i f the samp le size of t he existing coll ection is m ore than a th reshold, the algorit hm will remove s ome of the coll ected sample and the new element will be inserted with a sampling rate 2 − k , where k starts from 0 and grows adaptively according to av ailable memory . The dist inct s ampling of Gi bbons (2001) us es the same idea to col lect a random sample of disti nct elements. These sampling algo- rithms are essentially d iffe rent from t he above two classes of algorith ms based on one-scan sketches, and are comp utationally l ess attractive as they require scanni ng all existin g collection periodicall y . They belong to the log-counting family with memory c ost in the order of ǫ − 2 log( N ) where ǫ is an asymptotic RRMSE, but their asymptotic memory efﬁc iency is somewhat worse than t he original FM metho d, see Flajol et et al. (2007) for an asympt otic comparison. Flajolet (1990) has shown that wi th a ﬁnite populati on, th e RRMSE of W egman’ s adaptiv e sampli ng exhibits periodic ﬂuctuatio ns, depending on unk nown cardinali ties, and thus it is not scale in variant as deﬁned by (1). Our new approach m akes use of the g eneral idea of adaptiv e samp ling, but is quite differ ent from these sampling algorit hms, as ours does not require coll ecting a samp le set of disti nct v alues, and furthermo re is scale 12 in variant as sho wn later . 3 Self-learning Bitmap As we ha ve explained in Section 2.2, the basi c bitmap (see Algorithm 1) , as well as virtual bitm ap, provides a memo ry-ef ﬁcient data summ ary but they cannot be used to esti mate cardinalit ies accurately in a wide range. In this section, we describe a new approach for o nline dist inct counting by building a self-learni ng bitmap (S- bitmap for abbreviation), which not only is mem ory-ef ﬁcient, but provides a scale- in variant estimator with high acc uracy . The basi c id ea of S-bitmap is to b uild an adaptiv e sampling process into a bitm ap as ou r sum mary statisti c, where the sam pling rates decrease sequentially as m ore and more ne w dist inct i tems arri ve. The motiv ation for decreasing sampl ing rates is easy t o perceiv e - if one draws Bernoulli sample with rate p from a populati on with unknown size n and obtains a Binomial count, say X ∼ B inomial ( n, p ) , then the maximum likelihood est imate p − 1 X for n has relative mean square error E ( n − 1 p − 1 X − 1 ) 2 = (1 − p ) / ( np ) . So, to achie ve a constant relati ve error , one needs to use a smaller sampling r ate p on a larger population with size n . The sam- pling idea is sim ilar to “adaptive sampling ” of M orris (1978) which was proposed for coun ting a large number of items with no it em-duplication using a sm all mem- ory space. Howe ver , sin ce t he main issue of dist inct counting is item -duplication, Morris’ approach does not apply here. Now we describe S-bitmap and show how it deals with the it em-duplication issue ef fecti vely . The basic algorithm for extracting the S-bitmap summary stati stic is as follows. Let 1 ≥ p 1 ≥ p 2 ≥ · · · ≥ p m > 0 be speciﬁed sampling rates. A bitmap vector V ∈ { 0 , 1 } m with length m i s i nitialized with 0 and a coun ter 13 L is in itialized by 0 for the number of b uckets ﬁlled wi th 1s. Upon the arriv al o f a new item x (treated as a string or b inary vector), it is mapped, by a uni versal hash function us ing x as th e key , t o say k ∈ { 1 , · · · , m } . If V [ k ] = 1 , then s kip to the next item; Otherwise, with probability p L , V [ k ] i s chang ed from 0 to 1, in which case L is increased by 1. (See Figu re 1 for an illustration.) Note that the sampling is also realized wi th a uni versal hash function using x as keys. Here, L ∈ { 0 , 1 , · · · , m } indicates how many 1-bits by the end of the stream update. Obviously , the bigger L is, t he larger the cardinality is expected to be. W e show in Section 4 ho w to use L to characterize the dist inct count. If m = 2 c for some i nteger c , then S-bitmap can be implem ented efﬁciently as follows. Let d be an in teger . For each item x , it is m apped by a universal h ash function using x as the ke y to a bi nary vec tor with length c + d . Let j and u be two integers that correspond t o the bi nary representations w ith the ﬁrst c b its and last d bits, re spective ly . Then j i s the bucket location in the bitmap that the item is hashed into, and u is used for sampli ng. It is easy to see that j and u are independent. If the buck et is empty , i.e. V [ j ] = 0 , then check whether u 2 − d < p L +1 and if true, update V [ j ] = 1 . If t he bucket is no t empty , then just skip to next item. This is summarized in Algorithm 2, where the choice o f ( p 1 , · · · , p m ) is described in Section 5. Here we follow the setting of the LogLog counti ng paper by Durand and Flajolet (2003) and take X = { 0 , 1 } c + d . There is a chance of coll ision for hash functions. T ypically d = 3 0 , which is small relativ e to m , is suf ﬁcient for N in the order of mil lions. Since the sequential sam pling rates p L only depend on L which al lows us to learn the n umber of distinct items already passed, the algorith m is call ed Self- learning bitmap (S-bitm ap). 2 W e note th at the decreasing property of t he sampling 2 Statistically , the self learning proce ss can also be called ad aptive samp ling. W e notice th at Estan et al. (2006) hav e used ’adaptiv e b itmap’ to stand for a virtual bitmap wher e the sampling rate 14 Figure 1: Update o f the bitmap vector: in case 1, j ust skip t o the next item , and i n case 2, with probabilit y p L where L is the number of 1s in V so f ar , the b ucket value is changed from 0 to 1. rates, beyond the above heuristic optimality , is also sufﬁcient and necessary for ﬁl- tering out all d uplicated items. T o see th e sufﬁc iency , just note if an item is not sampled i n it s ﬁrst appearance, t hen the d -bits num ber associated wi th it (say u , in line 5 of Algorithm 2) is larger than its current sampling rate, say p L . Thus it s later replicates, st ill mapped to u , will not be sampled eith er du e t o the monoto ne prop- erty . Mathematically , if the item is mapped to u wit h u 2 − d > p L , then u 2 − d > p L +1 since p L +1 ≤ p L . On the other hand, i f p L +1 > p L , then in line 7 of Algorithm is chosen ad aptively b ased on another ro ugh estimate, an d that Flajo let (199 0 ) has used ’adaptive sampling’ fo r sub set samp ling. T o avoid poten tial con fusion with these, we use th e name ’ self learning bitmap’ instead of ’adaptive sampling bitmap’. 15 Algorithm 2 S-bitmap (SKETCHING UPD A TE) Input: a stream of items x (has hed binary vector with size c + d ) V (a bitmap vect or of zeros with size m = 2 c ) Output: B (number of buck ets with 1s in V ) Conﬁgurati on: m 1: Initialize L = 0 2: for x = b 1 · · · b c + d ∈ X do 3: s et j := [ b 1 · · · b c ] 2 (inte ger val ue of ﬁrst c bits in base 2) 4: i f V [ j ] = 0 then 5: u = [ b c +1 · · · b c + d ] 2 6: # samplin g # 7: if u 2 − d < p L +1 then 8: V [ j ] = 1 9: L = L + 1 10: Return B = L . 2, P ( p L < u 2 − d < p L +1 ) > 0 , that is, there i s a posit iv e prob ability th at the item mapped to u , in its ﬁrst appearance, is not sampled at L , but its later replicate i s sampled at L + 1 , which establish es t he n ecessity . The ar gum ent of s ufﬁ ciency here will be used to derive S-bitmap’ s Marko v property in Section 4.1 which leads to the S-bitmap estimate of the distinct count using L . It is interesting to see that unlike mr-bitmap, the s ampling rates for S-bit map are not associated with t he bucket locations , but only depend on the arriv al of new distinct it ems, throu gh in creases of L . In addition, we use the memory more efﬁ - ciently since we can adaptive ly change the sampling rates t o ﬁll in more buck ets, while mr -bitmap m ay l ea ve some virtual bit maps un used or some com pletely ﬁll ed, which leads to some waste of memory . 16 W e further note that in the S-bitm ap update process, only one hash is needed f or each incom ing it em. For b ucket update, o nly if the mapped b ucket is empty , t he last d -bits of the hashed value i s us ed t o determi ne whether the bucket should b e ﬁll ed with 1 or not. Not e t hat the sampling rate changes only when an empty bucke t is ﬁlled with 1. For example, if K buckets become ﬁlled by the end of the stream, the s ample rates on ly need to be updated K ti mes. Therefore, th e computational cost of S-bitmap is v ery low , and i s sim ilar to o r l ower than that of b enchmark algorithms such as mr -bit map, LogLog and Hyper-LogLog (in fact, Hyper-LogLog uses the same sum mary st atistic as LogLog and thus their comput ational costs are the same). 4 Estimatio n In this section, we ﬁrst derive a Marko v chain model for the above L sequence and then obtain the S-bitmap estimator . 4.1 A non-stationary Mark ov chain model From the S-bitmap update process, it is clear that the n distinct items are randomly mapped into th e m buckets, b ut not all corresponding buck ets have v al ues 1. From the abo ve su f ﬁciency argument, due to decreasing sampling rates, the bitmap ﬁlters out replicate items automatically and its update only depends on the ﬁrst arriv al of each distin ct item, i.e. new item. W ithout los s of generality , let the n distinct items be hashed into locations S 1 , S 2 , · · · , S n with 1 ≤ S i ≤ m , indexed by the sequence of t heir ﬁ rst arri v als. Obvious ly , the S i are i.i.d.. Let I t be the indicator of whether or n ot th e t -th di stinct item ﬁll s an empty bucket with 1. In other words, I t = 1 if and only if t he t -th distinct item is hashed into an empty bucket (i.e. 17 with value 0) and further ﬁlls i t with 1. Given the ﬁrst t − 1 distinct items, let L ( t − 1) = { S j : I j = 1 , 1 ≤ j ≤ t − 1 } be the bucke ts t hat are ﬁll ed with 1, and L t − 1 = |L ( t − 1 ) | be the number of buckets ﬁlled wi th 1. Then L t = L t − 1 + I t . Upon the arri va l of the t -th d istinct it em that is hash ed to bucket locati on S t , if S t does not belong to L ( t − 1) , i.e, the buck et is empty , then by the design of S -bitmap, I t is independent o f S t . T o be precise, as deﬁned in line 3 and 5 of Algorithm 2, j and u associated wi th x are i ndependent, one determining the locatio n S t and the other determining sampl ing I t . Obviously , according to line 7 of A lgorithm 2, the condit ional p robability t hat the t -th distinct it em ﬁlls the S t -th bucket with 1 is p L t − 1 +1 , otherwise is 0, that is, P ( I t = 1 | S t / ∈ L ( t − 1) , L t − 1 ) = p L t − 1 +1 and P ( I t = 1 | S t ∈ L ( t − 1) , L t − 1 ) = 0 . The ﬁnal output from the update algorithm is denoted by B , i.e. B ≡ L n = n X t =1 I t , where n is th e parameter to be estimated. Since S t and L ( t − 1) are independent, we ha ve P ( I t = 1 | L t − 1 ) = P ( I t = 1 | S t / ∈ L ( t − 1) , L t − 1 ) P ( S t / ∈ L ( t − 1) | L t − 1 ) = p L t − 1 +1 · (1 − L t − 1 m ) . This lea ds to the Markov chain property of L t as summarized in the theorem belo w . 18 Theor em 1 Let q k = (1 − m − 1 ( k − 1) ) p k for k = 1 , · · · , m . If the mono tonicity condition h olds, i.e . p 1 ≥ p 2 ≥ · · · , th en { L t : t = 1 , · · · , n } foll ows a non- stationar y Mark ov c hain model: L t = L t − 1 + 1 , with pr obabilit y q L t − 1 +1 = L t − 1 , with p r oba bility 1 − q L t − 1 +1 . 4.2 Estimation Let T k be the ind ex for t he dis tinct i tem t hat ﬁlls an empty bucket with 1 such t hat there are k buckets ﬁll ed wi th 1 b y that t ime. T hat i s, { T k = t } is equivalent t o { L t − 1 = k − 1 and I t = 1 }. Now given the output B from the update algorit hm, obviously T B ≤ n < T B +1 . A natural estim ate of n is ˆ n = t B , (2) where t b = E T b , b = 1 , 2 , · · · . Let T 0 ≡ 0 and t 0 = 0 for conv enience. The following properties hold for T b and t b . Lemma 1 Under the mo notonicity conditio n of { p k } , T k − T k − 1 , for 1 ≤ k ≤ m ar e di stributed in dependently with geometric distributions, and for 1 ≤ t ≤ m , P ( T k − T k − 1 = t ) = (1 − q k ) t − 1 q k . The e xpectation and variance of T b , 1 ≤ b ≤ m can be expr essed as t b = b X k =1 q − 1 k . and v ar ( T b ) = b X k =1 (1 − q k ) q − 2 k . 19 The proof of Lemma 1 follows from the standard Markov chain theory and is provided in the appendix for compl eteness. Below we analyze h ow to choo se the sequential samplin g rates { p 1 , · · · , p m } such that Re ( ˆ n ) is stabilized for arbitrary n ∈ { 1 , · · · , N } . 5 Dimensio ning rule and analysis In this section, we ﬁrst describ e the dimensioning rule for choosing the samp ling rates { p k } . No tice that T b is an unb iased esti mate of t b = E T b if T b is observ- able but t b is u nknown, where t 1 < t 2 < · · · < t m . Again form ally denote Re ( T b ) = q E ( T b t − 1 b − 1) 2 as the relativ e error . In o rder to make the RRMSE of S-bitmap i n variant to the unkn own cardinality n , our idea is to choose the sam- pling rates { p k } such that R e ( T b ) is in v ariant for 1 ≤ b ≤ m , si nce n must fall in between some two consecutive T b s. W e then prove that although T b are unob- serva ble, choosing parameters that st abilizes Re ( T b ) is sufﬁcient for stabilizing the RRMSE of S-bitmap for all n ∈ { 1 , · · · , N } . 5.1 Dimensioning rule T o stabilize Re ( T b ) , we need some constant C s uch that for b = 1 , · · · , m , Re ( T b ) ≡ C − 1 / 2 . (3) This l eads to the dimensioni ng rule for S-bitmap as summ arized by the following theorem, where C is d etermined later as a function of N and m . Theor em 2 Let { T k − T k − 1 : 1 ≤ k ≤ m } follow independent Geometric distribu- tions as in Lemma 1. Let r = 1 − 2( C + 1) − 1 . If p k = m m + 1 − k (1 + C − 1 ) r k , 20 then we have for k = 1 , · · · , m, p v ar ( T k ) E T k ≡ C − 1 / 2 . (4) That is, the r elative err ors Re ( T b ) do not d epend on b . Pr oof No te that (4) is equiv alent to v ar ( T b +1 ) t 2 b +1 = v ar ( T b ) t 2 b . By Lemma 1, this is equiv alent to v ar ( T b ) + (1 − q b +1 ) q − 2 b +1 ( t b + q − 1 b +1 ) 2 = v ar ( T b ) t 2 b . Since v ar ( T b ) = C − 1 t 2 b , then q − 1 b +1 = C C − 1 + 2 t b C − 1 . (5) Since t b +1 = t b + q − 1 b +1 , we hav e t b +1 = C + 1 C − 1 t b + C C − 1 . By deduction, t b +1 =  C + 1 C − 1  b  t 1 + 2 − 1 C  − C 2 . Since v ar ( T 1 ) = (1 − q 1 ) q − 1 1 = C − 1 t 2 1 and t 1 = q − 1 1 , we h a ve t 1 = C ( C − 1) − 1 . Hence with some calculus, we hav e, for r = 1 − 2( C + 1) − 1 , t b = C 2 ( r − b − 1) q b = (1 + C − 1 ) r b . 21 Since q b = (1 − b − 1 m ) p b , the sequential samplin g rate p b , for b = 1 , · · · , m , can be expressed as p b = m m + 1 − b (1 + C − 1 ) r b . The conclusion follows as th e steps can be re versed. It is easy to check that the mono tonicity property hol ds strictly for { p k : 1 ≤ k ≤ m − 2 − 1 C } , thu s satisfying the conditi on of Lem ma 1. For k > m − 2 − 1 C , the mon otonicity does not ho ld. So it is natural to expect that t he upper bound N is achiev ed when m − 2 − 1 C buckets (suppose C is e ven) in the bitmap turn i nto 1, i.e. t m − 2 − 1 C = N , or , N = C 2  r − ( m − 2 − 1 C ) − 1  . (6) Since r = 1 − 2( C + 1) − 1 , we obtain m = C 2 + ln(1 + 2 N C − 1 ) ln(1 + 2( C − 1) − 1 ) . (7) Now , given the maximum po ssible cardinality N and bitmap size m , C can be solved uniquely from this equation. For e xample, if N = 10 6 and m = 30 , 000 bits, then from (7) we can so lve C ≈ 0 . 0 1 − 2 . That is, if the sampl ing rates { p k } in Theorem 2 are designed using such ( m, N ) , then Re ( ˆ n ) can be expected t o be approximately 1% for all n ∈ { 1 , · · · , 10 6 } . In other words, to achie ve errors no mo re than 1% for all possibl e cardinalities from 1 to N , we need only about 30 kilo bits memory for S-bitmap. Since ln(1 + x ) ≈ x (1 − 1 2 x ) for x close to 0, (7) also i mplies that to achieve a small RRMSE ǫ , which is equal t o ( C − 1) − 1 / 2 according to Theorem 3 below , the memory requirement can be approximated as follows: m ≈ 1 2 ǫ − 2 (1 + ln( 1 + 2 N ǫ 2 )) . 22 Therefore, asymptoticall y , t he memory efﬁciency of S-bitmap is much better than log-counting algorithm s which requires a m emory in the order of ǫ − 2 log N . Fur - thermore, assumi ng N ǫ − 2 ≫ 1 , if ǫ < p (log N ) η / (2 eN ) where η ≈ 3 . 1206 , S-bitmap is better than Hyper -LogLog countin g which requires memory approxi - mately 1 . 04 2 ǫ − 2 log(log N ) (see Flajolet et al. , 2007) in order to achiev e an asy mp- totic RRMSE ǫ , otherwise is worse than Hyper -LogLog. Remark. In implementati on, we set p b ≡ p m − 2 − 1 C for m − 2 − 1 C ≤ b ≤ m so that the sampling rates sati sfy the mon otone property w hich is ne cessary by Lemma 1. Since the focus is on cardinalit ies in the range from 1 to N as pre-speciﬁed, which corresponds to B ≤ m − 2 − 1 C as discussed in the above, we simpl y truncate th e output L n by m − 2 − 1 C if it is lar ger than this v alu e which becomes possible when n is close to N , t hat is, B = min( L n , m − 2 − 1 C ) . (8) 5.2 Analysis Here we prove that the S-bitmap estimate is unbiased and i ts relati ve estimati on error is indeed “ scale-in variant “ as we had expected if we ignore the truncatio n ef fect in (8) for simplicit y . Theor em 3 Let B = L n , wher e L n is the number of 1 -bits in the S-bitmap, as deﬁned in Theor em 1 for 1 ≤ n ≤ N . Under the dimensioning rul e o f Theor em 2 , for the S-bitmap estimator ˆ n = t B as deﬁned in (2) , we have E ˆ n = n RRMSE ( ˆ n ) = ( C − 1) − 1 / 2 . 23 Pr oof L et for a > 1 Y n = L n Y j =0 (1 + ( a − 1 ) q − 1 j ) . By Theorem 1, L n +1 = i + B er noul l i ( q i +1 ) if L n = i . Thus E ( Y n +1 | Y 0 , Y 1 , · · · , Y n ) = E ( Y n I ( L n +1 = i ) + Y n (1 + ( a − 1 ) q − 1 L n +1 ) I ( L n +1 = i + 1) | L n ) = Y n { 1 − q i +1 + q i +1 (1 + ( a − 1 ) q − 1 i +1 ) } = Y n a if L n = i . T herefore { a − n Y n : n = 0 , 1 , · · · } is a martingale. Note th at q i = (1 + C − 1 ) r i , i ≥ 0 , where r = 1 − 2( C + 1) − 1 . Since L 0 = 0 , E Y 0 = 1 + ( a − 1) q − 1 0 and since a − n E Y n = E Y 0 , we hav e E Y n = a n (1 + ( a − 1 ) q − 1 0 ) that is, a n (1 + ( a − 1) q − 1 0 ) = E L n Y j =0 (1 + ( a − 1 ) q − 1 j ) . Recall that t b = P b j =1 q − 1 j and P b j =1 q − 2 j (1 − q j ) = C − 1 ( P b j =1 q − 1 j ) 2 . T aking ﬁrst deriv ative at a = 1 + , we hav e (since B = L n ) n + q − 1 0 = E L n X j =0 q − 1 j = E t B + q − 1 0 and taking second deri vati ve at a = 1 + , we ha ve n ( n − 1) + 2 nq − 1 0 = E ( L n X j =0 q − 1 j ) 2 − E L n X j =0 q − 2 j = E ( t B + q − 1 0 ) 2 − E ( q − 2 0 + t B + C − 1 t 2 B ) . 24 Therefore, E t B = n and E t 2 B = n 2 C / ( C − 1) . T hus v ar ( t B ) = n 2 C − 1 . Remark. This elegant m artingale ar g ument already appeared in Rosenkrantz (1987) b ut under a diff erent and simpler setting, and we rediscovered it. In implementati on, we us e t he truncated version of B , i.e. (8), which is equiv- alent to truncating the theoretical estimate by N if it is greater than N . Since by assumption the true cardinali ties are no m ore than N , thi s truncation removes one- sided bias and thus reduces the theoretical RRMSE as sho wn in the abo ve theorem. Our simulatio n belo w sho ws that this truncation ef fect is practically ignorable. 6 Simulatio n studies and compari son In this s ection, we ﬁrst present empirical studies that justify the theoreti cal analysis of S-bitmap. Then we compare S-bitmap with state-of-the-art algorithm s i n the literature in terms of memory ef ﬁciency and the scale in variance property . 6.1 Simulation validation of S-bitmap’ s theor etical perf orma nce In the abov e, our theoretical analy sis shows that wit hout trun cation by N , the S- bitmap has a scale-in variant relativ e error ǫ = ( C − 1) − 1 / 2 for n in a wid e range [1 , N ] , where C satisﬁes Equation (7) given bitmap size m . W e st udy the S-bitmap estimates based on (8) with two sets of sim ulations, both wi th N = 2 20 (about one million), and then compare empirical errors with th e theoretical result s. In the ﬁrst set, we ﬁx m = 4 , 00 0 , whi ch giv es C = 915 . 6 and ǫ = 3 . 3% , and in th e second set, we ﬁx m = 1 , 80 0 , which giv es C = 373 . 7 and ǫ = 5 . 2% . W e design the sequential sampling rates according to Section 5 .1. F or 1 ≤ n ≤ N , we s imulate n 25 Cardinality (log base 2) Relative error 4 32 256 2048 16384 131072 1048576 0.02 0.03 0.04 0.05 0.06 0.07 0.08 Simulated error (m=4000) Theoretical error (3.3%) Simulated error (m=1800) Theoretical error (5.2%) Figure 2: Em pirical and theoretical estimation errors of S-bitmap wi th m = 4 , 000 bits and m = 1 , 80 0 bits of memory for estimating cardinalities 1 ≤ n ≤ 2 20 . distinct i tems and obtain S-bitmap estimate. For each n (power o f 2 ), we replicate the simulation 1000 times and obtain the empirical RRMSE. T hese em pirical errors are compared with the theoretical errors in Figure 2. T he resu lts s how that for both sets, th e empirical errors and theoretical errors m atch extremely w ell and t he truncation ef fect is hardly visible. 6.2 Comparison with state-of-the-art algorithms In this subsectio n, we dem onstrate that S-bitmap i s more efﬁcient in terms of mem- ory and accuracy , and m ore reliable than stat e-of-the-art algorithms such as mr - bitmap, LogLog and Hyper -LogLog for many practical settings. Memory efﬁciency Hereafter , the memory cost of a distinct coun ting algorithm stands for the size of the summary statist ics (in bits) and does not count for hash functions (whose seeds require some small memory sp ace), and we note that the algorithms t o be compared here all require at least one universal hash function. 26 N ǫ = 1% ǫ = 3% ǫ = 9% HLLog S-bitmap HLLog S-bitmap HLLog S-bitmap 10 3 432.6 59.1 48.1 11.3 5.3 2.4 10 4 432.6 104.9 48.1 21.9 5.3 3.8 10 5 540.8 202.2 60.1 34.5 6.7 5.2 10 6 540.8 315.2 60.1 47.2 6.7 6.6 10 7 540.8 430.1 60.1 60 6.7 8.1 T able 2: M emory cost (with un it 100 bits) of Hyper-LogLog and S-bit map with giv en N , ǫ . ε ( in percent ) N 0.5 2 8 32 128 10 3 10 4 10 5 10 6 10 7 Figure 3: Contour plot of the ratios of the m emory cost of Hyper-LogLog to that of S-bitmap with the same ( N , ǫ ) : the contour line wi th s mall circles and l abel ’1’ represents the contour with ratio v alues equal to 1. From (7), the memory cost for S-bitmap is approxim ately linear in log (2 N /C ) . By the theory de veloped in Durand and Flajolet (2003) and Flajol et et al. (2007), the space requirements for LogLog counting and Hyper-LogLog are approximately 27 1 . 30 2 × αǫ − 2 and 1 . 04 2 × αǫ − 2 in order to achiev e RRMSE ǫ = ( C − 1) − 1 / 2 , where α = 5 , if 2 16 ≤ N < 2 32 , = 4 , if 2 8 ≤ N < 2 16 . Here α = k + 1 if 2 2 k ≤ N < 2 2 k +1 for any positi ve integer k . So LogLog requires about 56% m ore memory th an Hyper-LogLog to achie ve the same asymp totic error . There i s no analytic study of the memory cost for mr-bitmap in the literature, thus below we report a t horough memory cost comparison only bet ween S-bitmap and Hyper- LogLog. Giv en N and ǫ , t he theoretical m emory costs for S-bitmap and Hyper -LogLog can be calculated as above. Figure 3 s hows the contour plot o f the ratios of the memory requirement of Hyper-LogLog to that of S-bit map, where the ratios are shown as t he labels of corresponding contour lin es. Here ǫ × 10 0% is shown in the horizontal axis and N i s shown in the ve rtical axis, both in t he scale of log base 2. The contour line with small circ les and label ’1’ s hows th e boundary where Hyper- LogLog and S-bitmap require the same memory cost m . The lower left side of this contour line is the region where Hyper-LogLog requires more memory than S-bitmap, and the upper right side shows the opposite. T able 2 lis ts the detailed memory cost for both S-bitmap and Hyper-LogLog in a fe w cases where ǫ takes values 1%, 3% and 9%, and N takes values from 1000 to 10 8 . For example, for N = 10 6 and ǫ ≤ 3 % , which is a suitable setup for a core network ﬂo w mo nitoring, Hyper- LogLog requires at least 27% mo re memory than S-bitmap. As another example, for N = 10 4 and ǫ ≤ 3% , wh ich i s a reasonable setup for househol d network monit oring, Hy per -LogLog requires at least 12 0% mo re m emory th an S- bitmap. In sum mary , S-bit map i s u niformly m ore memory-efﬁc ient t han Hy per - LogLog when N is medium o r small and ǫ is small, though th e advantage of S- 28 bitmap against Hyper -LogLog dissipates with N ≥ 10 7 and large ǫ . Scale-in variance pr operty In many appli cations, the cardinaliti es of interest are in the scale o f a million o r less. Therefore we report simulation studies with N = 2 20 . In the ﬁrst experiment, m = 4 0 , 000 bit s of memory is used for all four algorithms. The design of mr -bitmap is optimized according to Estan et al. (2006). Let the true cardinality n vary from 10 to 10 6 and the algorithm s are run to obtain correspond- ing estimates ˆ n and esti mation errors n − 1 ˆ n − 1 . Empi rical RRMSE is com puted based on 100 0 replicates of this procedure. In the second and third experiments, the setting is similar except that m = 3 , 200 and m = 800 are used, respectively . The performance comparison is reported in Figure 4. The results s how th at in t he ﬁrst experiment, mr-bitmap has sm all errors than LogL og and HyperLogLog, but S- bitmap has smaller err ors than all competitors for cardinalities greater than 40,000; In th e second experiment, Hyper-LogLog performs better than m r-bitmap, but S- bitmap performs better than all comp etitors for cardinalities greater th an 1,000; And in the third experiment, with higher errors, S-bitmap st ill performs slightl y better than Hyper-LogLog for cardinalities greater than 1,000, and both are better than m r- bitmap and LogLog. Obviously , th e scale in variance property is validated for S-bitmap cons istently , while it i s not the case for th e competit ors. W e note t hat mr-bitmap performs badl y at the boundary , which are no t plo tted i n the ﬁgures as they are out of range. Other perf ormance measur es Besides RRMSE, which is t he L 2 metric, we have also e v aluated the perf ormance b ased o n other metrics such as E | n − 1 ˆ n − 1 | , namely the L 1 metric, and th e quantile of | n − 1 ˆ n − 1 | . As examples, T able 3 and T able 4 report the comparison of three error metrics ( L 1 , L 2 and 99% quantile) for the 29 Cardinality (log base 2) Relative error (%) 0.5 1.0 1.5 16 64 256 1024 4096 16384 65536 262144 1048576 m=40000 3 4 5 m=7200 6 8 10 12 m=800 HLLog LLog S−bitmap mr−bitmap Figure 4: Compariso n among m r- bitmap, LogLog, Hyper-LogLog and S-bit map for estimating cardinali ties from 10 to 10 6 with m = 40 , 00 0 , m = 3 , 20 0 and m = 8 00 respectiv ely . cases with ( N = 10 4 , m = 2700 ) and ( N = 10 6 , m = 6720 ), which represent two settings of differ ent scales. In both setti ngs, mr-bitmap works very well for small cardinal ities a nd w orse as cardinalities get large, with stron g boundary effect. Hyper- LogLog has a similar behavior , but is much more reliable. Interestin gly , empirical results su ggest that the scale-in var iance property hol ds for S-bitm ap not only with RRMSE, but a pproximately with the metrics of L 1 and the 99% quantile. For large car dinalities relativ e to N , the errors of Hyper -LogL og are all higher than that of S-bitmap in both settings. 30 L 1 L 2 (RRMSE) 99% quantile n S mr H S mr H S mr H 10 1.3 0.6 0.8 2.6 1.6 3 10 10 10 100 2.1 1.4 2.5 2.6 1.7 3.2 6 4 8 1000 2.1 1.6 3.5 2.6 2 4.4 6.7 5 1 1.4 5000 2.1 2.3 3.4 2.6 3.4 4.2 6.6 7.5 11.3 7500 2.1 100.7 3.5 2.6 100.9 4.3 6.9 119 11.2 10000 2.1 101.9 3.5 2.6 102.4 4.4 6.6 131.1 11.5 T able 3: Comparison of L 1 , L 2 metrics and 99%-quanti les (ti mes 10 0) among mr- bitmap (mr), Hyper -LogLog (H) and S-bitmap (S) for N = 1 0 4 and m = 27 00 . L 1 L 2 (RRMSE) 99% quantile n S mr H S mr H S mr H 10 1.1 0.5 0.4 2.4 1.3 1.9 10 10 10 100 1.8 1.4 1.6 2.3 1.7 2 6 4 5 1000 1.9 1.5 1.8 2.4 1.9 2.2 6.2 5 5.5 10000 2 2.5 2.1 2.5 3.1 2.7 6.8 7.9 7 1e+05 1.9 2.6 2.3 2.4 3.3 2.9 6.5 7.9 7.6 5e+05 1.9 2.6 2.3 2.4 3.3 2.8 6.2 8.6 7.3 750000 2 22 .9 2.2 2.5 48.2 2.8 6.1 116.9 7 1e+06 1.9 100.5 2.2 2.4 100.8 2.8 6.2 120.3 7.4 T able 4: Comparison of L 1 , L 2 metrics and 99%-quanti les (ti mes 10 0) among mr- bitmap (mr), Hyper -LogLog (H) and S-bitmap (S) for N = 1 0 6 and m = 67 20 . 31 Time (PST) Number of flows (log base 2) 01/25/03 14:00 01/25/03 16:00 01/25/03 18:00 01/25/03 20:00 01/25/03 22:00 16384 32768 65536 131072 Truth S−bitmap estimates Time (PST) Number of flows (log base 2) 01/25/03 14:00 01/25/03 16:00 01/25/03 18:00 01/25/03 20:00 01/25/03 22:00 32768 65536 131072 Truth S−bitmap estimates (a) Link 1 (b) Link 0 Figure 5: Time series of true ﬂow coun ts (in triangl e) and S-bitm ap estimates (in dotted line) per mi nute o n both links du ring slamm er out break: link 1 (a) and link 0 (b). 7 Experime ntal e valua tion W e now e v aluate the S-bitmap algorit hm on a few real network data and also com- pare it with the three competitors as abov e. 7.1 W orm trafﬁc monitoring W e ﬁrst e v aluate the algorithms on worm t raf ﬁc data, using two 9-hours traf ﬁc traces (www .rbev erly .net/ research/slammer). The traces were collected by MIT Laboratory for Computer Science from a peering e xchange point (two independent links, namely link 0 and link 1) on J an 25t h 2003, during the period of “Slammer“ worm ou tbreak. W e report th e results o f estim ating ﬂow counts for each link . W e take N = 10 6 , which i s sufﬁcient for most university trafﬁc i n normal scenarios. 32 0.04 0.05 0.06 0.07 0.08 0.09 0.10 0.00 0.05 0.10 0.15 0.20 0.25 0.30 Absolute relative error Proportion S−bitmap mr−bitmap LLog HLLog 0.04 0.05 0.06 0.07 0.08 0.09 0.10 0.00 0.05 0.10 0.15 0.20 0.25 0.30 Absolute relative error Proportion S−bitmap mr−bitmap LLog HLLog (a) Link 1 (b) Link 0 Figure 6: Proportions of estimates (y-axis) that ha ve RRMSE more than a threshold (x-axis) bas ed on S-bitmap, mr- bitmap, LogLog and Hyper-LogLog, respectively on the two links d uring sl ammer outbreak: link 1 (a) and link 0 (b), where the three vertical lines show 2, 3 and 4 tim es expected standard deviation for S -bitmap separately . Since in practice routers may not all ocate much resource for ﬂow counting, we use m = 800 0 bits. According to (7), we obt ain C = 2026 . 5 5 for designing the sam- pling rates for S-bitmap, whi ch corresponds t o an expected standard de viation of ǫ = 2 . 2% for S-bitmap. The same memory is used for other algorithms . The two panels of Figure 5 show the tim e series of ﬂow count s ever y m inute i nterval in tri- angles on link 1 and link 0 respectively , and the correspondi ng S-bitm ap estim ates in dashed lines. Occasionally t he ﬂows become very bursty (an order of d iffe rence), probably d ue to a few heavy worm scanners, while most times the t ime series are pretty st able. The estimat ion errors of the S-bitmap estimates are almos t invisible despite the non-stationary and bursty points. 33 The performance com parison between S-bitmap and alternative methods is re- ported in Figure 6 (left for Link 1 and right for Link 0), where y -axis i s the pro- portion of esti mates that have absolute relativ e estimation errors m ore than a given threshold in t he x-axis. The three thi n vertical l ines show th e 2, 3 and 4 times ex- pected standard de viation for S-bitmap, respectiv ely . For example, the proportion of S-bitmap estim ates whose absolute relative errors are m ore t han 3 t imes the ex- pected standard deviation is almost 0 on bot h links, while for the competitors, th e proportions are at least 1.5% gi ven the same threshold. The results show that S- bitmap is most resistant t o lar ge errors among all fou r algorithm s for both Lin k 1 and Link 0. Number of flows (log base 2) Frequency 64 256 1024 4096 16384 65536 262144 1048576 0 10 20 30 40 50 Figure 7: Histogram of ﬁv e-minute ﬂo w counts on backbone link s (log base 2). 7.2 Flow trafﬁc on back bone network links Now we apply the algo rithms for counting network link ﬂo ws in a core network. The real data was obtained from a Tier -1 US s ervice provider for 600 backbone 34 0.04 0.05 0.06 0.07 0.08 0.09 0.10 0 5 10 15 20 Absolute relative error Number of links S−bitmap mr−bitmap LLog HLLog Figure 8: Proportions of estimates (y-axis) that ha ve RRMSE more than a threshold (x-axis) based on S-bitmap, m r- bitmap, LogLog and Hyper-LogLog, respectively , where the three vertical times sho w 2, 3 and 4 times expected standard d e viation for S-bitmap, separately . links i n the core network, which includes time series of traf ﬁc volume in ﬂow counts on MPLS (Multi Protocol Label Switching) paths in e very ﬁ ve minut es. The trafﬁc scales va ry dramatically from li nk to l ink as well as from tim e to time. Since the original traces are not av ailable, we use simulated data for each li nk to com pute S-bitmap and th en obtain estim ates. W e set N = 1 . 5 × 10 6 and use m = 7 , 200 bits of memory to conﬁgure all algorit hms as above, which corresponds t o an expected standard deviation of 2 .4% for S-bitm ap. The simulati on us es a snapshot of a ﬁ ve minute interval ﬂow count s, whose histogram i n log b ase 2 i s presented in Figure 7. The vertical lines show that the .1%, 25%, 50%, 75%, and 99% quantiles are 18, 196, 2817 , 19401 and 361485 respective ly , where about 10 % of the l inks with no ﬂows or ﬂow counts l ess than 10 are not cons idered. Th e performance compar- ison between S-bitmap and alt ernativ e metho ds is reported in Figure 8 similar to 35 Figure 6. The results s how that both S-bitmap and Hyper-LogLog give very accu- rate estimates w ith relative estimation errors bounded b y 8%, wh ile mr-bitmap has worse performance and LogLo g is the worst (off t he range). Over all, S-bitmap is most resistant to l ar ge errors among all four algori thms. For example, the absolute relativ e errors based on S-bitmap are wit hin 3 t imes the standard deviation for all links, whi le there is on e link whose absol ute relative error is beyond this t hreshold for Hyper- LogLog, and two such links for mr -bitmap. 8 Conclusio n Distinct counting is a fundamental problem in the d atabase literature and has found important appl ications in m any areas, especially in mo dern comput er networks. In this paper , we ha ve proposed a nov el stati stical solution ( S-bitmap), which is s cale- in variant in t he sense that i ts relative root mean squ are error is ind ependent of the unknown cardinalities in a wide range. T o achie ve the sam e accuracy , wi th si milar computational cost , S-bitmap consum es sign iﬁcantly less memory than state-of-the- art method s such as m ultiresoluti on bi tmap, LogLog count ing and Hyper-LogLog for common practice scales. 36 A p pendix 8.1 Pr oof of Lemma 1 By the deﬁnition of { T k : 1 ≤ k ≤ m } , we have P ( T k − T k − 1 = t ) = ∞ X s = k − 1 P ( T k − 1 = s, T k = t + s ) = ∞ X s = k − 1 P ( I s = 1 , I t + s = 1 , L s = k − 1 , L t + s = k ) . Since L s ≤ L s +1 ≤ · · · ≤ L s + t , by the M arkov chain property of { L t : t = 1 , · · · , } , we hav e for k ≥ 1 and s ≥ k − 1 , P ( I s = 1 , I t + s = 1 , L s = k − 1 , L t + s = k ) = P ( L s = k − 1 , I s = 1) P ( L t + s = k | L t + s − 1 = k − 1) × s + t − 1 Y j = s +1 P ( L j = k − 1 | L j − 1 = k − 1) = P ( T k − 1 = s ) q k s + t − 1 Y j = s +1 (1 − q k ) = P ( T k − 1 = s ) q k (1 − q k ) t − 1 . Notice that P ∞ s = k − 1 P ( T k − 1 = s ) = P ( T k − 1 ≥ k − 1 ) is probabil ity t hat the ( k − 1 ) - th ﬁlled bucket happens wh en or after t he ( k − 1) -th distin ct item arrives, which is 100% since each dist inct item can ﬁll in at most one empty . Therefore P ( T k − T k − 1 = t ) = q k (1 − q k ) t − 1 . That is, T k − T k − 1 follows a geometric di stribution. The independence of { T k − T k − 1 : 1 ≤ k ≤ m } can b e pro ved similarly using the Marko v property of { L t : t = 37 1 , 2 , · · · } , which we refer to Chapter 3 of Durrett (19 96). This completes the proof of Lemma 1. References Astrahan, M ., Schkolnick, M. , and Whang, K. (1987 ). Approximating the nu mber of unique values of an attribute without sortin g. Information Syst ems , 12 , 11–15. Beyer , K., Gemull a, R., Haas, P ., Reinwald, B., and Sismanis, Y . (2009). Di stinct- value synopses for multiset operations. Communications of the A CM , 52 (10), 87–95. Bickel, P . J. and Doksum , K. A. (2001). Mathemati cal St atistics, Ba sic Ideas and Selected T opi cs . Prentice Hall. Bu, T ., Chen, A., W iel, S. A . V ., and W oo, T . Y . C. (20 06). Design and e va luation of a fast and robust worm detection algorithm. In Pr oceeding of the 25th IEEE International Confer ence on Computer Communication s (INFOCOM) . Bunge, J. and Fitzpatrick, M. (1993). Es timating the nu mber of sp ecies: a re view . J ournal of the American Statisti cal Association , 88 , 364–373. Chen, A. and Cao, J. (2009). Disti nct counti ng with a self-learning b itmap (poster). In Pr o ceedings of the international confer ence on Data Engineerin g . Durand, M. and Flajolet, P . (2003). Logl og counting of large cardinalit ies. In Eur o pean Symposium on Algorithms , pages 605–617. Durrett, R. (199 6). Pr obab ility: Theory and Examples, Second Edition . Duxbury Press. 38 Estan, C., V ar ghese, G., and Fis k, M. (2006). Bit map algorithms for counting active ﬂo ws on high speed links. IEEE /A CM T rans. on Networking , 14 (5), 925–937. Flajolet, P . (1990). On adapti ve sampling. Computing , 34 , 391–400. Flajolet, P . and Martin, G. N. (1985). Probabilist ic counting algorithm s for data base applications. Journal of Computer and System Sciences , 31 (2), 182–209. Flajolet, P ., Fusy , E., Gandouet, O ., and Meunier , F . (2007). Hyp erloglog: the anal- ysis of a near -optimal cardinality estimation algorithm. Analysi s of Algorithms . Gibbons, P . (200 1). Dis tinct sampli ng for highly-accurate answers to d istinct values queries and ev ent re ports. The VLDB Journal , pages 541–550. Gibbons, P . B. (2009). Di stinct-values estimation over data st reams. In In Da ta Str eam Management: Pr ocess ing Hi gh-Speed Data, Edit ors: M. Garofalakis, J. Gehrke , and R. Rastogi . Springer . Giroire, F . (2005). Order statis tics and estimating cardinalities of massiv e datasets. In Pr oceedings of the 6th DMTCS Discr ete Mathemati cs and Theor et ical Com- puter Science , pages 157–166. Haas, P . and S tokes, L. (1998). Estimating the number of class es in a ﬁ nite popula- tion. J ournal of the American Statist ical Association , 93 , 1475–1487. Kane, D. M., Nelson, J., and W oodruff, D. P . (201 0). An optimal algorithm for the distinct elem ents pro blem. In P r oceeding s of the 29th A CM SIGMOD-SIGACT - SIGART symposium on Principles of database systems of data , pages 41 – 52. Karasaridis, A., Rexroad, B., and Hoeﬂin, D. (2007). W ide-scale botnet detection and characterization. In Pr oceedings o f the ﬁrst con fer ence on F irst W o rkshop on Hot T opics in Understanding Botnets . USENIX Association. 39 Knuth, D. (1998 ). The Ar t of Computer Pr ogramming, V olume 3, 2nd edition . Addison-W esley P rofessional. Mao, C. (2006). Inference on the number o f species through geometric lower bounds. Journal of American Statistical Association , 101 , 1663–1670. Menten, L. E., Chen, A., and Stiliadis, D. (2011). Nobot - embedded malw are detection for endpoint devices. Bell Labs T echnical J ournals (accepted) . Metwally , A., Agrawa l, D., and Abbadi, A. E. (2008). Why go logarithm ic if we can go l inear? t ow ards effecti ve dis tinct coun ting of s earch traf ﬁc. In Pr oc. A CM EDBT . Morris, R. (19 78). Counting large numbers of e vents in small registers. Commun. A CM , 21 (10), 84 0–842. Rosenkrantz, W . (1987). Approximate counting: a martingale approach. Stochastic , 27 , 111–120. Whang, K., V ander-Zanden, B., and T aylor , H. (1990). A linear-time probabi listic counting algorithm for database applications. AC M T ransactions on Da tabase Systems , 15 (2), 208–229. 40

Distinct counting with a self-learning bitmap

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment