Fast computation of the median by successive binning

F ast Computation of the Median b y Successiv e Binning Ry an J. Tibshirani Dept. of Statistics, Stanford Univ ersit y , Sta nford, CA 94305 email: ryantibs@stanf ord.edu Octob er 14, 2008 Abstract In many important problems, one uses the median instead o f the mean to estimate a po pulation’s center, since the for mer is more r obust. But in general, computing the median is co nsiderably slow er than the standard mean calcula tion, and a fast median algorithm is of interest. The fastest existing algor ithm is quick select . W e inv estigate a nov el a lgorithm binmedia n , whic h has O ( n ) av erage co mplex it y . The alg orithm uses a recursive binning scheme and relies on the fact that the median and mean are alwa ys at most one s tandard deviation apart. W e also pro p o se a rela ted median appr oximation algorithm binapp rox , which has O ( n ) worst-case complexit y . These alg orithms are hig hly comp etitive with quickselect when computing the median of a single da ta set, but a re signiﬁcantly fa ster in updating the median when more data is a dded. 1 In tro duction In man y app lications it is usefu l to use the median instead of th e mean to measure the cen ter of a p opu lation. S uc h app lications include, b u t are in n o wa y limited to: b iological sciences; computational ﬁnance; image pro cessing; and optimization problems. Essen tially , one uses the median b ecause it is not sensitiv e to outliers, but this robustness comes at a price: computing the median take s muc h longer than computing the mean. Sometimes one ﬁnds the median as a step in a larger iterativ e p ro cess (lik e in many optimiza tion algorithms), and this step is the b ottlenec k. S till, in other problems (l ik e those in biological applications), one ﬁnds the med ian of a data s et and then d ata is added or subtracted. In order to ﬁnd the n ew median, one must recompute the median in the usual wa y , b ecause the standard median algorithm cannot u tilize previous w ork to shortcut th is computation. This brute-force metho d lea v es m uc h to b e desired. W e p rop ose a new median algorithm binmedia n , whic h u ses buc k ets or “bins” to recur - siv ely winn o w the set of the data p oin ts th at could p ossibly b e the med ian. Because these bins preserve some information ab out the origi nal data set, the binmedian al gorithm can u se 1 previous computational work to eﬀect iv ely up d ate the med ian, giv en more (or less) d ata. F or situations in wh ic h an exact median is not required, bi nmedian giv es rise to an app ro x- imate median algorithm bi napp ro x , w hic h rapidly computes the median to within a sm all margin of error. W e compare b inmedian and bin app ro x to qu ickselect , the fastest existing algorithm for computing the median. W e ﬁrst describ e eac h algorithm in d etail, b eginnin g with quickselect . 2 Quic kselect 2.1 The Quicks elect Algorithm Quickselect (Flo yd & Riv est 1975 a ) is n ot just used to compute the median, but is a more general algorithm to select the k th smallest elemen t out of an arra y of n elemen ts. (When n is o dd the med ian is the k th smallest with k = ( n + 1) / 2, and when n is ev en it is the mean of the elemen ts k = n/ 2 and k = n/ 2 + 1.) Th e algorithm is der ived fr om quickso rt (Hoare 1961), the w ell-kno wn sorting algorithm. First it chooses a “piv ot” elemen t p from the arra y , and rearranges (“partitions”) the arra y so that all elemen ts less than p are to its left, and all elements greater than p are to its righ t. (The element s equal to p go on either side). Then it recurses on the left or right subarray , dep end in g on where the median elemen t lies. In pseud o-cod e: Quic kselect( A, k ): 1. Giv en an array A , c ho ose a p iv ot elemen t p 2. P artition A around p ; let A 1 , A 2 , A 3 b e th e subarrays of p oints <, = , > p 3. If k ≤ len( A 1 ): r eturn Q uic kselect( A 1 , k ) 4. Else if k > len( A 1 ) + len( A 2 ): r eturn Quic kselect( A 3 , k − len( A 1 ) − len( A 2 )) 5. Else: r eturn p The fastest imp lemen tations of quickselect p erform the steps 3 and 4 iterativ ely (not recursiv ely), and use a standard in-place partitioning metho d f or step 2. On the other hand, th ere isn’t a un iv ersally accepted strategy for c ho osing the pivo t p in step 1. F or the comparisons in s ections 5 and 6, we use a v ery fast F ortran implementa tion of qu ic kselect tak en from P ress, T euko lsky , V etterling & Flannery (199 2). It c h o oses the pivot p to b e the median of the b ottom, middle, and top elemen ts in the array . 2 2.2 Strengths and W eaknesses of Quic kselect There are many adv an tages to q uickselect . Not only do es it run v ery fast in practice, bu t when the pivot is c hosen randomly from the array , q uickselect has O ( n ) a v erage computa- tional complexit y (Flo yd & Rive st 1975 b ). An other desirable prop er ty is that u s es only O (1) space, mea ning that th e amoun t of extra memory it needs d o esn’t dep end on the input size. P erhaps th e most underappr eciated stren gth of quickselect is its simplicit y . T he algorithm’s strategy is qu ite easy to u nderstand , and th e b est implement ations are only ab out 30 lines long. One disadv an tage of quickselect is th at it rearranges the arra y for its o w n computational con v enience, necessitated by the use of in -place partitioning. This seemingly inn o cuous side- eﬀect can b e a problem w hen maint aining the arra y ord er is imp ortant (for example, when one w an ts to ﬁ nd the median of a column of a matrix). T o resolv e this pr oblem, quickselect m ust m ake a cop y of this arra y , and use this scratc h copy to do its computations (in which case it uses O ( n ) space). But the biggest w eakness of quickselect is that it is not well-suited to p roblems where the median needs to b e up dated with the add ition of more data. Though it is able to eﬃcien tly compute the median from an in-memory arra y , its strategy is not condu civ e to sa ving intermediate resu lts. Therefore, if d ata is app ended to this arra y , we ha v e no choi ce with quickselect but to compute the median in the usu al w a y , and hence rep eat m uc h of our previous w ork. Th is will b e our fo cus in section 6, where we w ill giv e an extension of binmedian that is able to deal with this up date problem eﬀectiv ely . 3 Binmedian 3.1 The Binmedian Algorithm First w e giv e a useful lemma: Lemma. If X is a r andom variable having me an µ , varianc e σ 2 , and me dian m , then m ∈ [ µ − σ, µ + σ ] . Pr o of. Consider | µ − m | = | E( X − m ) | ≤ E | X − m | ≤ E | X − µ | = E p ( X − µ ) 2 ≤ p E( X − µ ) 2 = σ. 3 Here the ﬁrst inequalit y is due to Jensen’s inequalit y , and the second inequalit y is due to the fact that the median minimizes the fu nction Φ( a ) = E | X − a | . The th ird inequalit y is due to the conca ve v ersion of Jens en’s inequalit y . Giv en data p oints x 1 , . . . x n and assu ming that n is o dd 1 , the basic strategy of the binmedian algorithm is: 1. Compute the mean µ and standard deviation σ 2. F orm B bins across [ µ − σ , µ + σ ], map eac h data p oint to a bin 3. Find the bin b that con tains the median 4. Recurse on the set of p oints mapp ed to b Here B is a ﬁxed constan t, and in step 2 we consider B equally spaced bins across [ µ − σ, µ + σ ]. That is, we consider inte rv als of the form µ − σ +  i B · 2 σ, i + 1 B · 2 σ  for i = 0 , 1 , . . . B − 1. Ite rating o v er the data x 1 , . . . x n , w e count how man y p oin ts lie in eac h on e of these b ins, and ho w m any p oin ts lie to the left of th ese bins. W e d en ote these coun ts by N i and N L . In step 3, to ﬁ n d the median bin we simply start add ing up the count s from the left unt il the total is ≥ ( n + 1) / 2. Th at is, b is min imal su c h that N L + b X i =0 N i ≥ n + 1 2 . Note that our lemma guarantees that b ∈ { 0 , . . . B − 1 } . Finally , in step 4 we iterate through x 1 , . . . x n once again to determine whic h p oin ts w ere mapp ed to b in b . Then w e consid er B equally spaced bins across µ − σ +  b B · 2 σ, b +1 B · 2 σ  , and contin ue as b efore (except that now N L is the n umber of p oin ts to th e left of the median bin , i.e. N L ← N L + P b − 1 i =0 N i ). W e stop w hen th er e is only one p oint left in the median bin – this p oin t is the median. (In pr actice, it is actually faster to stop wh en the remaining num b er of p oints in th e median b in is ≤ C , another ﬁxed co nstant, and then ﬁ nd the median directly using insertion sort.) 1 An analogous strategy holds when n is ev en. Now at each stage w e ha ve to keep trac k of the bin that conta ins the “left” median (the k th smallest element with k = n/ 2) and the bin that contains the “right” median ( k = n/ 2 + 1). This is more tedious than the od d case, bu t is conceptually the same, so hereafter w e’ll ass ume that n is o dd whenever we talk about binmedian . 4 Notice th at a larger v alue of B give s fewe r p oint s th at map to the median bin, but it also means more work to ﬁn d b at eac h recurs iv e step. F or the time trials in sections 5 and 6, we choose B = 1000 b ecause it weighs th ese t wo f actors nicely . W e also replace th e recursiv e step with an ite rativ e lo op, and use C = 20. B inmedian is implemen ted in F ortran for the time trials, and this co de, as w ell as an equiv alen t implemen tation in C, is fr eely a v ailable at http://sta t.stanford.edu/ ∼ ryan tibs /median/ . 3.2 Strengths and W eaknesses of Binmedian Beginning with its w eaknesses, the binmedian algorithm will ru n faster with f ew er p oin ts in the median’s bin , and ther efore its runtime is necessarily dep enden t on th e distrib ution of the inpu t data. The worst case for bi nmedian is when there is a h uge cluster of p oin ts surroun ding th e median, with a ve ry large standard d eviation. With enough outliers, the median’s bin could b e swa mp ed with p oin ts for many iterations. Ho w ev er, we can sho w that the algorithm’s asymptotic r unnin g time is largely unaﬀected b y this pr oblem: with v ery reasonable assumptions on the d ata’s distribution, binmedian has O ( n ) a v erage complexit y (see the App end ix). Moreo v er, this r ob u stness also seems to b e true in p ractice. As we shall see in sec tion 5, the algorithm’s runtime is only mod erately aﬀected by unfav orable d istributions, and in fact, b inmedian is highly comp etitiv e with quickselect in terms of runnin g time. Lik e qu ickselect , th e binmedian a lgorithm uses O (1) space, bu t (like quickselect ) this requires it to c h ange the order of the inpu t d ata. T he g reatest strength of bi nmedian is that the algorithm is able to qu ic kly cac he some inform ation ab out the input d ata, when it maps the d ata p oin ts to b ins. This giv es rise to a fast median approxima tion algorithm, bina pp ro x . It also enables binmedian , as w ell as binapp ro x , to eﬃcient ly recompute the median when w e are giv en more d ata. W e’ll see this in section 6, but ﬁrst we discuss binappro x in the next section. 4 Binappro x 4.1 The Binappro x Algorithm In some situations, we do not need to compute the median exactly , and care more ab out getting a quic k appro ximation. Bin app ro x is a simple appr o ximation algorithm deriv ed from binmedian . It follo ws the same steps as binmedian , except that it sto ps once it has computed the median b in b , and j ust retur ns th e midp oint of this in terv al. Hence to be p erfectly clear, the algorithm’s steps are: 1. Compute the mean µ and standard deviation σ 2. F orm B bins across [ µ − σ , µ + σ ], map eac h data p oint to a bin 5 3. Find the bin b that con tains the median 4. Return the midp oint of bin b The m edian can diﬀer fr om this midp oin t by at most 1 / 2 the width of the inte rv al, or 1 / 2 · 2 σ /B = σ /B . Since w e u se B = 1000 in practice, bin app ro x is accurate to within 1 / 100 0 th of a s tand ard deviation. Again, we implemen t b inappro x in F ortran for sections 5 and 6. Th is co d e as w ell as C co de for binappro x is a v ailable at h ttp://stat.stanford.edu/ ∼ ry an tibs/median/ . 4.2 Strengths and W eaknesses of Binappro x Unlik e binmedian , the r u nt ime of bina pp ro x do esn ’t dep end on the data’s distribution, since it do esn’t p erf orm the recursiv e step. It requires O (1) space, and do esn’t p erturb the input data, so rearranging is nev er a pr oblem. The algorithm has O ( n ) wo rst-case computational complexit y , as it only needs 3 passes thr ough th e data. It consisten tly run s faster than quickselect and binmedian in practice. Most imp ortan tly it extends to a v ery f ast algorithm to deal with the up date problem. Binappro x ’s main weakness is fairly obvious: if the standard deviat ion is extremely large, the rep orted approxi mation could b e signiﬁcantly diﬀerent fr om th e actual m edian. 5 Sp eed Comparisons on Single Data Sets W e compare qui ckselect , b inmedian , and bina pp ro x across data sets coming from v arious distributions. F or eac h data set we p erform a sort and tak e the middle elemen t, and rep ort these runtimes as a reference p oint for the slo w est wa y to compute th e median. W e test a total of eigh t distrib utions. Th e ﬁrs t four — u niform o ver [0 , 1], normal with µ = 0 , σ = 1, exp onen tial with λ = 1, c hi-square with k = 5 — are included b ecause th ey are fundamental, and fr equen tly o ccur in practice . Th e last four are mixed distrib utions that are pu rp osely u n fa v orable for the binmedian alg orithm. They ha v e a lot of p oin ts around the m edian, an d a h uge standard deviation. Two of these are mixtures of stand ard n ormal data and uniformly distrib uted data o v er [ − 10 3 , 10 3 ], resp. [ − 10 4 , 10 4 ]. Th e other tw o are mixtures of standard normal data and exp onential d ata with λ = 10 − 3 , resp . λ = 10 − 4 . All these mixtures are ev en , meaning that an equal n um b er of p oints come from ea c h of the t w o d istributions. T able 1 giv es the results of all these timings, w hic h were p erformed on an In tel Xeon 64 bit, 3.00GHz pro cessor. F or the ﬁ rst four distributions, quickselect and binmedian are very comp etitiv e in terms of runtime, with quickselect doing b ette r on the data from U (0 , 1) and N (0 , 1), and binmedian doing b etter on the data from E (1) and χ 2 5 . It make s sense that binmedian d o es w ell on the data dr a wn fr om E (1) and χ 2 5 , up on examining the pr obabilit y d ensit y f unctions of these 6 U (0 , 1) N ( 0 , 1) time ratio time ratio Quickselect 107.05 (0.47) 1.17 99.70 (1.05) 1.17 Binmedian 112.45 (0.55) 1.23 106 .95 (1.51 ) 1.25 Binappro x 91.30 (0.32) 1 85.35 (0.86 ) 1 So rt 3205.8 4 (5.97) 35.11 3231.2 2 (4.24) 37.86 E (1) χ 2 5 time ratio time ratio Quickselect 106.75 (0.83) 1.62 107 .80 (4.29 ) 1.32 Binmedian 87.25 (0.96) 1.32 104.65 (4.12) 1.28 Binappro x 65.90 (0.75) 1 81.75 (3.97 ) 1 So rt 3217.5 1 (5.40) 48.82 3225.8 2 (4.70) 39.46 N ( 0 , 1) and U ( − 10 3 , 10 3 ) N (0 , 1) and U ( − 10 4 , 10 4 ) time ratio time ratio Quickselect 103.65 (1.15) 1.33 101 .00 (0.96 ) 1.30 Binmedian 126.65 (1.20) 1.62 130 .90 (1.25 ) 1.69 Binappro x 78.15 (1.02) 1 77.50 (1.28 ) 1 So rt 3225.4 0 (5.58) 41.27 3214.1 6 (3.89) 41.47 N ( 0 , 1) and E (10 − 3 ) N ( 0 , 1) and E (10 − 4 ) time ratio time ratio Quickselect 98.45 (1.13) 1.59 100. 90 (1.23) 1.58 Binmedian 88.90 (1.63) 1.43 106.85 (0.89) 1.67 Binappro x 62.10 (1.16) 1 63.80 (1.25 ) 1 So rt 3228.9 3 (5.56) 52.00 3236.7 1 (2.95) 50.73 T able 1: R untimes in mil lise c onds of q uickselect , binmedian , binappro x , and sort for c om- puting the me dian of 1 + 10 7 data p oints fr om diﬀer ent distributions. We p erforme d 10 r ep etitions for e ach distribution, and r ep ort the aver age and standar d deviation of the run- times. (In f act, for e ach r ep etition we dr ew 20 diﬀer e nt data sets fr om the distribution, and time d how long it takes for the algorithm to se quential ly c ompute the me dians of these 20 data se ts, in one c ontiguous blo ck. Then we divide d this numb er by 20; this was done to avoid timing inac c u r acies.) We also list the runtime r atios, r elative to the fastest time. 7 distributions. The d ensit y fun ction of the exp onen tial distrib ution, for example, is at its highest at 0 and d eclines quickly , remaining in sharp decline around th e median (log 2 /λ ). This implies th at there will b e (relativ ely) few p oin ts in the median’s bin, so binmedia n gets oﬀ to a goo d start. Generally sp eaking, binmedian run s slo w er than quickselect o ver the last four distribu- tions. But considering that these distribu tions are in tended to mimic binmedian ’s d egenerate case (and are extreme examples, at that), th e results are not to o dramatic at all. T he bin- median algorithm do es worse with the n ormal/uniform m ixtures (sk ew ed on b oth sides) than it do es with the n ormal/exp onen tial mixtures (skew ed on one sid e). A t its worst, binmedian is 1.3 times slo w er than qui ckselect (130.9 versus 101 ms, on the N (0 , 1) and U ( − 10 4 , 10 4 ) mixture), but is actually sligh tly f aster than qui ckselect on the N (0 , 1) and E (10 − 3 ) mixture. Binappro x is the fastest algorithm across ev ery distribution. It main tains a reasonable margin on quickselect and bi nmedian (up to 1.62 and 1.69 times faste r, resp ectiv ely), though the diﬀerences are n ot striking. In the next section, we fo cus on th e median up d ate problem, where binmedian and binappro x d ispla y a deﬁnite adv ant age. 6 Sp eed Comparisons on the Up date Problem Supp ose th at we compute the m edian of some d ata x 1 , . . . x n 0 , and then w e’re giv en more data x n 0 +1 , . . . x n , and we’re ask ed for the median of the aggregate data set x 1 , . . . x n . Of course we could just go ahead and compute the median of x 1 , . . . x n directly , but then we w ould b e redoing m uc h of our previous w ork to ﬁnd th e median of x 1 , . . . x n 0 . W e’d like a b etter strategy for up dating the median. Consider the f ollo wing: w e use binmedia n to compute the m edian of x 1 , . . . x n 0 , and sav e the bin count s N i , i = 0 , . . . B − 1, and N L , from the ﬁrst iteration. W e also need sa ve the mean µ 0 and the s tand ard deviation σ 0 . Giv en x n 0 +1 , . . . x n , w e map these p oin ts to the original B b in s and just incremen t the appr opriate coun ts. Then we compute the median bin in the usual w a y: start addin g N L + N 0 + N 1 + . . . and stop when this is ≥ ( n + 1) / 2. But n o w, since we ha v en’t mapp ed the d ata x 1 , . . . x n to bins ov er its pr op er mean and standard deviation, w e aren’t guarante ed that the median lies in one of the bins . In the case th at the median lies to the left of our bins (i.e. N L ≥ ( n + 1) / 2) or to th e right of our bins (i.e. N L + N 0 + . . . N B − 1 < ( n + 1) / 2), w e hav e to start all o v er again and p er f orm the usual bi nmedian algorithm on x 1 , . . . x n . But if the m edian do es lie in one of the bins, then w e can con tin ue on to the second iteration as us u al. This can pro vide a dramatic savings in time, b ecause we didn ’t ev en hav e to touc h x 1 , . . . x n 0 in the ﬁ rst iteration. W e can use binappro x in a similar w a y: sa v e µ 0 , σ 0 , N i , N L from the median compu tation on x 1 , . . . x n 0 , and then u se them to map x n 0 +1 , . . . x n . W e then d etermine where the median lies: if it lies outside of the b ins, w e p erform th e u sual binappro x on x 1 , . . . x n . Otherwise 8 w e can just return midp oin t of the median’s bin. In th is same situation, quickselect do es n ot hav e a w a y to sa v e its p revious work, and must redo the w h ole computation. 2 In table 2, w e d emonstrate the eﬀe ctiv eness of bi nmedian and binappro x on this up date problem. In stead of add ing data x n 0 +1 , . . . x n only once, w e add it man y times, and we record h o w long it tak es for the algorithms to compu te these medians in one con tiguous blo c k. B inmedian and binapp ro x eac h kee p a global cop y of µ 0 , σ 0 , N i , N L . They set these globals giv en the in itial data x 1 , . . . x n 0 , and as more data is add ed, th ey dynamically c hange them if the median ev er lies outside of the b in s. The ﬁ r st tw o situations fr om table 2 r epresen t the b est case scenario for binmedian and binappro x . The add ed data sets are fairly s m all wh en compared to th e initial d ata set ( n − n 0 = 10 5 data p oints versus n 0 = 1 + 10 7 data p oin ts). Also, the add ed d ata lies mainly inside the initial bin range [ µ 0 − σ 0 , µ 0 + σ 0 ], so the median will nev er lie outsid e of the bins. This means that binmedian and bi napp ro x nev er ha v e to undo an y of their previous w ork (i.e. they never hav e to recompute the mean or standard deviation of the data set, and re-bin all the p oints), and are able up date th e median v ery quic kly . A t their b est, binmedian is faster than quickselect by a f actor of 3.97, and binappro x is faster by a f actor of 24.55. The last t wo s ituations represent realistic, not-so-ideal cases for binmedian and bina pp ro x . The added data sets are equal in size to the initial data s et ( n − n 0 = 10 6 and n 0 = 1 + 10 6 ). Moreo ver, the added data lies mainly outside of the initial bin range, whic h means that the median will ev en tually lie outsid e of the bins, forcing binmedian and binappro x to r ecompute the mean and standard deviation and re-bin all of th e data p oin ts. (This ma y hap p en more than once.) Nonetheless, binmedian and bina pp ro x still maint ain a very go o d margin on quickselect , with binmedian r unnin g up to 3.94 times faster than quickselect and b inappro x runn in g up to 19.01 times faster. It is imp ortant to note that when bin app ro x successfully up dates the median (that is, it ﬁnds the new median bin without ha ving to recompute the mean, standard deviation, etc.), it is accurate to within σ 0 / 1000 , wh er e σ 0 is the standard deviation of the original d ata set x 1 , . . . x n 0 . T o b ou n d its accuracy in terms of the standard d eviation σ of the wh ole data 2 One migh t argue that q uickselect sorts half of the arra y x 1 , . . . x n 0 , and it could use th is sorted order when giv en x n 0 +1 , . . . x n to exp ed ite the partitioning steps, and thus q uickly compute the median of x 1 , . . . x n . First, k eeping this sorted order do es not actually provide muc h of a sp eed up. S econd, and most importantly , that qu ickselect might keep x 1 , . . . x n 0 in the order that it pro duced is an unrealistic ex p ectation in p ractice. There ma y b e sev eral other steps that occur b etw een initially computing the median and reco mputing the median, and th ese steps could manipulate the order of the d ata. Also, if the d ata’s order should n ’t b e p erturb ed in th e ﬁrst place (for ex amp le, if we’re computing th e median of a column of a matrix), then quickselect has t o make a copy of x 1 , . . . x n 0 , and on ly rea rranges the order of this copy . It’s n ot practical to keep th is copy around u ntil w e’re given x n 0 +1 , . . . x n , even if n 0 is only reas onably large. 9 1 + 10 7 pts ∼ N (0 , 25) and then 20 × (10 5 pts ∼ N (0 , 25)) time ratio Quickselect 2236 (8.23 ) 22.36 Binmedian 584 (5.27) 5.84 Binappro x 100 (9.42) 1 1 + 10 7 pts ∼ N (0 , 25) and then 20 × (10 5 pts ∼ N (2 , 4)) time ratio Quickselect 2234 (7.07 ) 24.55 Binmedian 563 (10.75 ) 6.19 Binappro x 91 (12.6 5) 1 1 + 10 6 pts ∼ N (0 , 25) and then 20 × (10 6 pts ∼ N (1 / 2 · j, 25) , j = 1 , . . . 20) time ratio Quickselect 2429 (11.3 6) 18.99 Binmedian 616 (9.94) 4.81 Binappro x 128 (5.27) 1 1 + 10 6 pts ∼ N (0 , 25) and then 20 × (10 6 pts ∼ N (10 , 25)) time ratio Quickselect 2376 (16.1 9) 19.01 Binmedian 629 (11.01 ) 5.03 Binappro x 125 (5.68) 1 T able 2: Runtimes in mil lise c onds of qui ckselect , binmedian , and binappro x for the me dian up date pr oblem. The algorithms initial ly c ompute the me dian of a data set, then we se q u en- tial ly add mor e data, and the algorithms must r e c ompute the me dian after e ach addition. Whenever new data is adde d, quickselect c omputes the me dian of the aggr e gate data set with its u sual str ate gy, and henc e r ep e ats much of its pr evious work. On the other hand, binmedian and binap p ro x c an use their c ache d bin c ounts to eﬀe ctively up date the me dian. F or e ach distribution, we p erforme d 10 r ep etitions, and we list the aver age and standar d deviation of these times. W e give the runtime r atios, which ar e taken r elative to the fastest time. 10 set x 1 , . . . x n , consid er σ = v u u t 1 n n X i =1 ( x i − µ ) 2 ≥ v u u t 1 n n 0 X i =1 ( x i − µ ) 2 ≥ v u u t 1 n n 0 X i =1 ( x i − µ 0 ) 2 = r n 0 n σ 0 where µ 0 is the mean of x 1 , . . . x n 0 and µ is the mean of x 1 , . . . x n . Therefore, bi napp ro x ’s error in this situation is at most σ 0 1000 ≤ q n n 0 σ 1000 . (1) If n = 2 n 0 , this b ecomes √ 2 σ / 1000 ≈ σ / 70 0, whic h is still quite small. Finally , observ e that we can use an analogous strategy f or binmedian and bina pp ro x to recompu te the med ian after r emoving (instead of adding) a sub set of data. Sup p ose that, ha ving co mpu ted th e median of x 1 , . . . x n , w e wish to kn o w median when x n 0 +1 , . . . x n are remo v ed f rom the data set. Then as b efore, we sa v e µ, σ, N i , N L from our median computation on x 1 , . . . x n , m ap x n 0 +1 , . . . x n to bins, and d ecremen t (instead of incremen t) the coun ts appropriately . 7 Discussion and S ummary W e hav e prop osed an algorithm bin median , w hic h computes the median by rep eate d bin n ing on a selectiv ely smaller sub set of th e d ata, and a corresp onding approximat ion algorithm binappro x . B inmedian h as O ( n ) a v erage complexit y when mild assumptions are made on the data’s distribution function, and binappro x has O ( n ) w orst-case complexit y . F or ﬁndin g the median of a single d ata set, binmedia n is highly comp etit iv e with quickselect , although it b ecomes slo we r when there are many p oin ts near the median and the data has an extremely large standard deviatio n. On single data sets, binappro x is consistently faster no m atter the distribution. Binmedian and binappro x b oth outp erform quickselect on the m edian u p d ate problem, w herein n ew data is su ccessiv ely added to a base data set; o n this pr ob lem, binappro x can b e nearly 25 times faster than quickselect . In most r eal applications, an error of σ / 1000 (an upp er b ound for binappro x ’s error) is p erfectly acceptable. In man y biological applications this can b e less than th e err or attributed to th e m ac hine that collects the data. Also, we emph asize that binappro x p erforms all of its computations without side eﬀects, and so it do es not p erturb the input data in any w a y . Sp eciﬁcally , binappro x d o esn’t c hange the ord er of the inp ut data, as do quickselect and binmedian . F or applications in w hic h the inpu t’s order needs to b e pr eserved, quickselect and bi nmedian m ust mak e a copy of the inpu t arra y , and this will only in crease the margin b y whic h binappro x is faster (in tables 1 and 2). Finally , it is imp ortan t to note the strategy for binmedian and binapp ro x is we ll-suited to situations in which compu tations must b e distribu ted. Th e computations to ﬁ nd the 11 mean and standard deviation can b e easily distributed – hence with the data set d ivided in to man y p arts, we can compute bin count s separately on eac h part, and then sh are these coun ts to ﬁ nd the med ian bin. F or binappro x we need on ly to rep ort the mid p oint of this bin, and for binmedian we r ecurse and u se this same distributed s trategy . Th is could work w ell in the con text of w ireless sensor net w orks, which h a v e recen tly b ecome v ery p opular in man y d iﬀerent areas of researc h, including v arious military and en vironment al applications (see Zhao & Guibas (2004) ). F or any situation in whic h an error of σ / 10 00 is acceptable, there seems to b e no do wnside to using the binappro x algorithm. It is in general a little b it faster than qui ckselect and it ad m its man y nice p rop erties, most notably its abilit ies to rapidly up d ate th e median and to b e distribu ted. W e hop e to see div erse applications that b eneﬁt from this. App endix Pro of t hat Binmedian Has O ( n ) Exp ected Running Time Let X 1 , . . . X n b e iid p oin ts from a contin uous distrib ution with d ensit y f . Assu ming that f is b ounded an d that E( X 4 1 ) < ∞ , we sh o w that the binmedian algorithm has exp ected O ( n ) runnin g time. 3 No w for notation: let f ≤ M , and E( X 2 1 ) = σ 2 . Let ˆ µ and ˆ σ b e th e empirical mean and standard d eviation of X 1 , . . . X n . Let N j b e the n umber of p oin ts at the j th iteration ( N 0 = n ), and let T b e the n u m b er of iterat ions b efore termination. W e can assure T ≤ n . 4 A t the j th iteration, the algorithm p erforms N j + B op eratio ns; therefore its ru n time is R = T X j =0 ( N j + B ) ≤ n X j =0 ( N j + B ) = n X j =0 N j + B ( n + 1) and its exp ected runtime is E( R ) ≤ n X j =0 E( N j ) + B ( n + 1) . 3 Notice that this exp ectation is tak en o ve r dra ws of th e data X 1 , . . . X n from their distribution. This is diﬀerent from the usual n otion of exp ected run ning time, which refers to an exp ectation tak en o ver random decisions mad e by the algori thm, holding t he data ﬁxed. 4 T o do this we make a sligh t modiﬁcation to the algorithm, which do es not change the ru nning time. Right before th e recursive step (step 4), we chec k if all of the d ata p oints lie in the median bin. If this is true, then in stead of constructing B new bins across the bin’s endp oints, w e construct th e b ins across the minim um and maximum data p oints. This assures that at least one d ata p oint will b e exclud ed at eac h iteration, so that T ≤ n . 12 P S f r a g r e p l a c e m e n t s M f 2 ˆ σ B ≤ 2( σ + ǫ ) B Figure 1: T he area u nder the curve b et we en the tw o red lin es is ≤ M · 2( σ + ǫ ) /B . No w w e compute E( N j ). Let b denote the median bin; this is a (random) inte rv al of length 2 ˆ σ /B . W ell E( N 1 ) = n P( X 1 ∈ b ) ≤ n P( X 1 ∈ b ; ˆ σ ≤ σ + ǫ ) + n P( ˆ σ > σ + ǫ ) . W ell P( X 1 ∈ b ; ˆ σ ≤ σ + ǫ ) < M · 2( σ + ǫ ) /B (see the picture), and P( ˆ σ − σ > ǫ ) = C /n (see the lemma; this is why w e need a ﬁnite fourth momen t of X 1 ). T herefore E( N 1 ) ≤ n B 2 M ( σ + ǫ ) + C. In general E( N j ) = n P( X 1 ∈ b, X 1 ∈ b 2 , . . . and X 1 ∈ b j ) = n P( X 1 ∈ b j ) where b i is the median’s bin at the i th iteration, so that w e can similarly b ound E( N j ) ≤ n B j 2 M ( σ + ǫ ) + C. 13 Therefore E( R ) ≤ n X j =0  n B j 2 M ( σ + ǫ ) + C  + B ( n + 1) = 2 M ( σ + ǫ )   n X j =0 1 /B j   n + ( B + C )( n + 1) ≤ 2 M ( σ + ǫ )   ∞ X j =0 1 /B j   n + ( B + C )( n + 1) = O ( n ) . Lemma. With the same assumpt ions as ab ove, P ( ˆ σ − σ > ǫ ) = O ( n − 1 ) for any ǫ > 0 . Pr o of. Notice { ˆ σ − σ > ǫ } =  ( ˆ σ − σ ) 2 > ǫ 2 ; ˆ σ > σ  =  ˆ σ 2 − 2 ˆ σ σ + σ 2 > ǫ 2 ; ˆ σ > σ  ⊆  ˆ σ 2 − σ 2 > ǫ 2 ; ˆ σ > σ  =  ˆ σ 2 − σ 2 > ǫ 2  . Recall that ˆ σ 2 = 1 n P n 1 ( X i − ¯ X ) 2 . Consider th e u nbiased estimator for σ 2 , S 2 = n n − 1 ˆ σ 2 , and  ˆ σ 2 − σ 2 > ǫ 2  ⊆  S 2 − σ 2 > ǫ 2  . By C h eb yshev’s inequalit y P( S 2 − σ 2 > ǫ 2 ) ≤ V ar( S 2 ) /ǫ 2 . The rest is a tedious b ut straigh tforw ard calculation, u sing E( X 4 1 ) < ∞ , to show that V ar ( S 2 ) = O ( n − 1 ). Pro of t hat Binmedian Has O (log n ) Exp ected N um b er of Iterations Under the same set of assump tions on the data X 1 , . . . X n , we can sho w that the binmedian algorithm has an exp ected num b er of O (log n ) iterations b efore it conv erges. W e established that E( N 1 ) ≤ n B 2 M ( σ + ǫ ) + C, and b y that same logic it follo ws E( N j | N j − 1 = m ) ≤ m B 2 M ( σ + ǫ ) + C = m − g ( m ) 14 where g ( m ) = m 2 M ( σ + ǫ )( B − 1) B − C . By T heorem 1.3 of Mot w ani & Raghav an (1995) , this means that E( T ) ≤ Z n 1 dx g ( x ) = Z n 1 dx x 2 M ( σ + ǫ )( B − 1) B − C = O (log n ) . Ac kno wledgemen ts W e could not h a v e done this work without Rob Tibshirani’s p ositiv e encouragemen t and consisten tly great inpu t. W e also thank Jerry F r iedman for man y h elpful discu ssions and for h is mento rsh ip . Finally , w e thank the Cytobank group — esp ecially William Lu — for discussing ideas w hic h ga v e r ise to the w hole p ro ject. Binapp ro x is imp lemen ted in the current version of the C ytobank softw are pac k age for ﬂ o w cytometry data analysis (www.cytobank.org). References Flo yd , R. W. & Rive st, R. L. (1975 a ), ‘Algorithm 489: Select’, Communic ations of the ACM 18 (3), 173. Flo yd , R. W. & Riv est, R. L. (1975 b ), ‘Exp ected time b oun ds for selectio n’, Communic ations of the ACM 18 (3), 165–72. Hoare, C. A. R. (19 61), ‘Algorithm 64: Quic ksort’, Communic ations of the ACM 2 (7 ), 321. Mot wani, R. & Ragha v an, P . (1995 ), R andomize d Algorithms , Cam bridge Unive rsity Press. Press, W. H., T eukolsky , S. A., V etterling, W. T. & Flannery , B. P . (1992), Numeric al R e cip es in F ortr an 77: The Art of Scientiﬁc Computing, Se c ond Edition , Cambridge Univ ersit y P r ess, New Y ork. Zhao, F. & Gu ibas, L. (2004 ), Wir eless Sensor Ne tworks: An Informatio n Pr o c essing Ap- pr o ach , E lsevier/Morgan-Kaufmann. 15

Fast computation of the median by successive binning

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment