A comparison of Gap statistic definitions with and without logarithm function

A comparison of Gap statistic deﬁnitions with and with- out logarithm function Mojgan Mohajer † Helmholt z Zentr um M ¨ unchen and De par tme nt of Statistics, Uni v ersity o f Mun ich Karl-Hans Englmeier Helmholt z Zentr um M ¨ unchen V olk er J. Schmid Depar tment of Statistics, University of Muni ch Summary . T he Gap statistic is a standar d method f or deter mining the number of clusters in a set of data. The Ga p stati stic standa rdizes the graph of log( W k ) , where W k is th e with in-cluster dispersion , by co mparin g it to its expectation under an appropr iate n ull reference distribution of the data. We sugg est to use W k instead of log( W k ) , and to compa re it to the e xpectation of W k under a null ref erence distributio n. In fact, whenev er a number fulﬁlls the orig inal Gap statistic ine quali ty , t his number a lso fulﬁlls the i nequa lity of a Gap sta tistic using W k , but not vice versa . The two de ﬁnition s of the Gap fun ction ar e ev aluated on sev eral simulated data sets and on a real data of DCE-MR images. K eyw ords : a v erage linkage, Ga p statistic, log function, number of clusters, wi thin cluster dis- persion 1. Introduction In clustering metho ds the num ber o f clus ter s is either a direct par ameter, or it may b e controlled b y other par ameters o f the metho d. Estimating the pro p er num ber of clusters is an imp ortant problem in selecting the c lustering metho d as well as in v alidating the res ult. The Gap statistic is one of the most p opular techniques to determine the optimal num ber of c lusters. The idea of the Gap statistic is to compare the within-cluster disp e rsion to its exp ectation under an appr opriate null reference distribution (Tibshir ani et al., 200 1). It outpe r forms many other metho ds, including the metho d by Ka ufman and Rousseeuw (1990), the Cali ´ nski and Harabasz (1974) index , the Krzanowski and Lai (198 8) meth o d, and the Hartigan (19 75) statistic (Tibshirani et a l., 200 1). Therefore, the Gap statistic is frequently used in a v a riety of applicatio ns, from image seg mentation (Zheng-Jun and Y ao-Qin, 2009), image edge detection (Y ang et al., 2 009) to genome clustering (W endl and Y ang, 20 0 4). How ev er, there ar e few works inv estigating the metho d itself. The tendency of the Gap statistic to ov erestimate the n um ber of clusters was rep orted by Dudoit and F ridlyand † A dd r e ss for c orr esp ondenc e: Mo jgan Moha jer, Helmholtz Zentrum M ¨ u nchen, Ingolst¨ adter Land- str. 1, 85764 Neuherb erg, German y E-mail: mojga n.mohaj er@hel mholtz-muenchen.de 2 Mohajer et al. (2002). It is also known that the Gap statistic may not work cor rectly in cases where data ar e derived from exp o nential distributions (Sugar and James, 2 003). The weighted Gap statis tic, prop os ed by Y a n and Y e (2 0 07), is an improvemen t, for example in the ca se of mix tur es of exp o nential distribution. Yin et al. (2008) p ointed out that in s itua tions where a da ta set co ntains clusters of diﬀ erent densities the Gap statistic mig h t fa il. They suggested to use r e ference da ta se ts sa mpled from normal distribution rather than uniform distribution. The original Gap statistic is based on some empirica l c hoices, suc h as the “one standa rd error ” -style rule for simulation er ror, and using the log arithm of the within cluster disp er- sion W k . How ev er, few studies hav e fo cuse d on analyzing the eﬀect of these choices. In this pap er we will show tha t using the loga rithm of W k is actually disadv an tageous for ﬁnding the num ber of clusters in data sets. Es pec ia lly in cases where clustering da ta are sampled from multi-dimensional uniform dis tributions with lar ge diﬀerences in the v ariances of the diﬀerent clusters, it is b etter to use W k instead of lo g( W k ). The pap er is organize d as follows. In section 2 the original Gap statistic is describ ed and the diﬀerence betw een the use of the loga r ithm of W k and the calcula tio n of the Gap statistic dir ectly fro m W k is discuss ed. In s ection 3 bo th Gap functions, with and without log function, are a pplied to sim ulated and rea l data, using hierar chical clustering with av erage link age metho d. W e end with a dis cussion of the results and the pr op osed metho d. 2. Theory 2.1. Gap s tatistic Let { x ij } be obser v ations with i = 1 , 2 , ..., n , j = 1 , 2 , ..., p , p features measured on n independent samples, clustered into k clusters C 1 , C 2 , ..., C k , where C r denotes the indexes of s amples in cluster r , and n r = | C r | . Let d ii ′ be the dista nce b etw een s amples i and i ′ . F or example, this dis tance might b e the squared Euclidean distance d ii ′ = P j ( x ij − x i ′ j ) 2 . The sum of the pair wise dis tances D r for all p o ints in cluster r is D r = X i,i ′ ∈ C r d ii ′ . (1) W e deﬁne W k := k X r =1 1 2 n r D r . (2) If d is the squa red Euclidean distance, then W k is the within-cluster sum of squared dis- tances fro m the cluster means. W k decreases mo notonically as the num b er o f cluster s k increases. F or the calc ulation of the Gap function, Tibshir ani e t a l. (20 01) prop osed to use the diﬀere nce of the exp ected v alue of log( W ∗ k ) of a n a ppropriate null reference a nd the log( W k ) of the data s e t, Gap n ( k ) := E ∗ n log( W ∗ k ) − log( W k ) . (3) Then, the prop er num b er of clusters for the given data set is the sma lle st k s uch that Gap n ( k ) ≥ Gap n ( k + 1 ) − s k +1 (4) Gap statistic deﬁnitio ns with a nd wit hout log arit hm functio n 3 where s k is the simulation er ror c a lculated from the sta ndard deviation sd ( k ) of B Mo nte Carlo re plic a tes log( W ∗ k ) a ccording to the equa tion s k = p 1 + 1 /B sd ( k ). The e x pe c ted v alue E ∗ n log( W ∗ k ) of within-disp ersion meas ures W ∗ kb is determined a s E ∗ n log( W ∗ k ) = 1 B X b log( W ∗ kb ) , (5) where W ∗ kb are given by cluster ing the B reference data sets. The sum of lo g ( W ∗ kb ) c a n be written as E ∗ n log( W ∗ k ) = 1 B log  Y W ∗ kb  . (6) Therefore the Ga p function from Eqn. 3 c an b e re -written; Gap n ( k ) = log ( Q W ∗ kb ) 1 /B W k ! . (7) The num b er ( Q W ∗ kb ) 1 /B is the g eometric mean of W ∗ kb . Th us, the Ga p statistic is the logarithm of the ratio of the geometric mean of W ∗ kb to W k . In the next section, we will compare this to using the diﬀere nc e s of the arithmetic mean of W ∗ kb and W k . 2.2. Gap s tatistic without logarithm function Lets considering using W k instead of log ( W k ). Tha t is, we use a n a lternative deﬁnition of the Gap function, Gap ∗ n ( k ) = E ∗ n ( W ∗ k ) − W k , (8) where E ∗ n ( W ∗ k ) = 1 B X b W ∗ kb . (9) W e refer to the pro p o sed alternative Gap statistic deﬁned b y using W k directly a s Gap ∗ n ; the original Gap calculated using the loga rithm of W k is referred to as Gap n . Tibshirani et al. (2001) note that in case of a sp ecial Gaussian mixture mo del log ( W k ) has int erpretation as log- likelihoo d (Scott and Symons, 1 971). In maximum likeliho o d inference , it is usually more conv enient to w ork with the lo g -likelihoo d function than with the likeliho o d function, in order to ha ve s ums instead of pr o ducts. Ho w ever, using log( W k ) has no co mputational adv antage versus using W k directly in the deﬁnition o f the Gap statistic. It ca n b e shown that an answer in the orig inal Gap n is a suﬃcient condition for the prop osed Gap ∗ n statistic, but not vi c e versa . Let A = Q W ∗ kb 1 /B , B = Q W ∗ k +1 b 1 /B , C = 1 B P b W ∗ kb , D = 1 B P b W ∗ k +1 b , d 1 = W k , and d 2 = W k +1 . Proposition 1. F or ∀ d 1 , d 2 > 0 , d 1 > d 2 , A > B , C > D , A, C > d 1 and B , D > d 2 , if l og  A d 1  ≥ l og  B d 2  , then C − d 1 ≥ D − d 2 . 4 Mohajer et al. Proposition 2. ∃ d 1 , d 2 > 0 , d 1 > d 2 , A > B , C > D , A, C > d 1 and B , D > d 2 so that if C − d 1 ≥ D − d 2 , then l og  A d 1  < l og  B d 2  . Pro ofs are given in App endix A. Hence, if there is a p ossible ca ndida te in Gap n at p oint k , it is also a p os sible candidate in Gap ∗ n . On the other hand it is po ssible that there is no suc h k in Gap n function while the Gap ∗ n function indicates a p ossible candidate at p oint k . In the s ection 3.4 and 3.5 there are examples from real a nd sim ulated data, in whic h the orig inal Gap n function is a strictly increasing function, th us ther e is no k that fulﬁlls the condition in Eq n. 4. Howev er, the prop osed Gap ∗ n function may b e able to sugges t a num b er of cluster s for these data sets. 2.3. W eighted Gap statistic In Eqn. 2 W k is the p o oled within-cluster sum of sq uares. This implies conside r ing a p oint far aw ay from the cluster mean, the large distance of this p oint to the cluster ce n ter has more impact compa r ed to p oints with s mall distanc e s from the cluster mean. T o this end, Y an and Y e (200 7) sugg ested to co mpute W ′ k as av era ge of a ll pa irwise dis tances for all po int s in a cluster, W ′ k = k X r =1 2 n r ( n r − 1) D r . (10) This approa ch is c alled “ weighted Gap function”. Similar to the or iginal Gap function, the weighted Gap function can als o be computed with o r witho ut logarithm. Howev er, W k in E qn. 2 is monotonically decreas ing in k if the distance d ii ′ is the Euclidea n distance. O n the other ha nd, W ′ k in Eqn. 10 is not a decr easing (or increa sing) function in k . Therefore , the pr op ositions g iven in section 2.2 ar e not v alid for the weighted Gap metho d. W e will co mpare results fro m the orig inal a nd the weight ed Gap function on tw o historica l data sets in section 3 .1. 3. Application t o sim ulated and real data sets In the pre vious section we discussed the diﬀerences of the Gap functions computed with and without loga rithm. In this section we will apply the o riginal Gap and pr op osed Gap ∗ statistics to simulated and real da ta sets, in or der to ev alua te the eﬀect of the diﬀerences in b oth approaches. Here, we use agg lomerative hier archical clus tering with group average link age method (Kaufman and Rousseeuw, 1 990). The av era ge link age method has some adv a nt ages over the widely used k-mean clustering. Hiera rchical clustering methods pro duce hierarchical representations in which the cluster s at each level o f the hier arch y are created by merg ing clusters at the ne x t low er level. Each lev e l of hiera rch y repr esents a particular g rouping of the data into disjoint clusters of samples. The en tire hierarch y r epresents an ordered sequence of suc h gr oupings. Unlik e k -mean clustering , where the c hoice o f diﬀerent num b ers of clusters can lead to totally diﬀerent assig nment o f e le men ts to the c lusters, in hierarchical Gap statistic deﬁnitio ns with a nd wit hout log arit hm functio n 5 T a ble 1. Results of standard and weighted Gap a nd Gap ∗ function s on Iris and Breast Cancer da ta sets. “+” i ndi- cates the co rrect numbe r of clusters for that data set. Gap function num b er of clusters Iris Breast Gap 3 + 2 + Gap ∗ 3 + 2 + we ig hted Gap 2 1 we ig hted Gap ∗ 7 1 clustering the se ts of c lus ters ar e nested within one ano ther. The average link ag e metho d ha s another int eresting pro per ty: the gro up av era ge dissimilar it y d ( G, H ) b etw ee n t wo g roups G a nd H is deﬁned a s: d ( G, H ) = 1 N G N H X i ∈ G X i ′ ∈ H d ii ′ , (11) where N G and N H are the num b er of sa mples in each group. The group average dissimilarity is an estimate of Z Z d ( x, x ′ ) p G ( x ) p H ( x ′ ) dxdx ′ (12) with the num b er of obs erv ations N → ∞ , where d ( x, x ′ ) is the dissimilar ity b etw een p oints x and x ′ . Eqn. 12 is a n approximation for d ( G, H ), E qn. 11, when N appro aches inﬁnit y . This is a c har acteristic of the relationship b etw ee n the t wo densities p G ( x ) and p H ( x ′ ) o f samples in group G and H . The av erage link age metho d attempts to pr o duce rela tively compact clusters that are re la tively far apar t (Kaufman and Rousseeuw, 1990 ). 3.1. T wo historical data sets Two historic a l data sets a re frequently used when discussing cluster ing; “ Fis her’s Iris da ta set” (Fisher , 19 63) and W olb ergs “Br east Cancer Wisconsin data set”(W olb er g et al., 1993 ). W e apply the four diﬀerent deﬁnitions of th e Gap statistic to thes e t wo famous historical data sets. Fisher ’s Iris data set consists of 50 sa mples from thre e sp ecies of Iris ﬂow ers. F our v ar ia bles were measured fo r ea ch sa mple. F or the “Br east Cancer Wisconsin data set”, samples arrived p erio dica lly as Dr. W olb erg rep or ts his clinical cases. The data set consists of 699 samples. Each sample is describ ed b y nine v aria bles. The whole data s et has t wo main groups, consis ting of 4 58 b e nign and 2 41 malignant tumor s. T able 1 lists the estimated num b er of clus ter s for b oth the iris and the breas t data sets using the or iginal Ga p s ta tistic Gap fr om Eqn. 3 and the pro p o sed Gap statistic without logarithm Gap ∗ as deﬁned in Eq n. 8. These tw o Gap functions are co mpared with the results of the wei g hted Gap as des c r ib ed in section 2.3 and the w eig hted Gap ∗ , i.e. , the weigh ted Gap using W k instead of lo g( W k ). In contrast to the res ult from k-mea n clustering rep orted by Y an and Y e (2007), when using average link age cluster ing the Gap statistic with the or iginal W k , Eqn. 2, estimates the num b er of clus ters for b oth data sets co rrectly . Figs. 1 and 2 show the ca lculated Gap functions for the t wo data s ets. Both, the iris and the breast cancer data sets repres ent their 6 Mohajer et al. 0 5 10 15 20 0 5 10 15 20 number of clusters weighted Gap* 0 5 10 15 20 0 0.2 0.4 0.6 0.8 1 1.2 1.4 number of clusters weighted Gap 0 5 10 15 20 10 20 30 40 50 number of clusters Gap* 0 5 10 15 20 0 0.2 0.4 0.6 0.8 1 number of clusters Gap Fig. 1. Stan dard and w eighte d Gap a nd Gap ∗ function s f or Iris da ta set natural cluster s in av era ge link age clustering . T hus, Gap and Gap ∗ show similar b ehavior. It can b e o bserved that in the case of iris data , the wei g hted Gap suggests num b er 2 as prop er num b er of clusters but w eig hted Gap ∗ suggest 7 as cluster num b er. According to the discussion in section 2.2, whenev e r a n umber fulﬁlls the inequality 4, this n umber fulﬁlls the inequality for the prop o s ed Gap ∗ . How ever, this statemen t is not v alid for w eig hted Gap due to the fact that W ′ k from Eqn. 10 is not monotonically decrea sing. 3.2. Not well separated clusters Now w e assume clusters which a re no t well separa ted. W e simulated 1000 data sets with tw o clusters e a ch, with diﬀerent prop ortions of ov erlapping. Each cluster had 5 0 obse rv ations with tw o v a riables. Both v ariables were dr awn independently from Gaussian dis tributions; for observ ations from the ﬁr st cluster bo th v ariables had expected v alues 0 and standard deviation 1. F or observ a tions fro m the sec o nd cluster b oth v ar iables were aga in randomly drawn from Gaussian distribution with exp ected v alue ∆ and standar d dev iation 1. As a result, there a r e tw o cluster s, wher e the distance b etw een the means of tw o clusters decreases with decreasing v alue of ∆. W e use ∆ = 0 . 5 , 1 , 1 . 5 , . . . , 5 . 0 . F or each of the ten unique v alues of ∆ 10 0 data sets were genera ted, and o riginal Gap and pr op osed Gap ∗ functions were calculated for these data sets. Figure 3 shows the p er centage of ﬁnding tw o as the num b er of clusters for each type of data set. It can be observed that the original Gap was b etter in estimating the prop er n umber of clusters in o verlapped clusters than Gap ∗ . These r esults were exp ected due to the tendency of the Gap to overestimate the num b er o f Gap statistic deﬁnitio ns with a nd wit hout log arit hm functio n 7 0 5 10 15 20 −10 0 10 20 30 40 50 number of clusters weighted Gap* 0 5 10 15 20 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 number of clusters weighted Gap 0 5 10 15 20 0 200 400 600 800 1000 number of clusters Gap* 0 5 10 15 20 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 number of clusters Gap Fig. 2. Stan dard and w eighte d Gap a nd Gap ∗ function s f or Breast Cancer Wisconsin data set 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 delta percentage of 2 clusters chosen Gap Gap* Fig. 3. Gap fu nction from eq. 3 and Gap ∗ function from e q. 8 are compared f or 10 data sets with t wo clusters. T wo clusters hav e diff erent por tion of ov er lapp ing area in each data set. 8 Mohajer et al. T a ble 2. Five simulated d ata sets with two clusters with N 1 and N 2 number of samples in ﬁrst and second cluster respectively . sim ulation N 1 N 2 m = N 1 / N 2 1 765 765 1 2 1020 510 2 3 1224 306 4 4 1360 170 8 5 1440 90 16 clusters which has b e en rep or ted by Dudoit and F ridlyand (2002 ). 3.3. Unequally s ized clusters Yin et al. (2008 ) r ep ort that whenev er the n umber of observ ations in one cluster is more than six-fold the num b er of obs erv ations in the o ther clus ter s, the Gap statistic is not able to estimate the num ber of clusters a ccurately . This eﬀect dep ends not o nly on the num ber diﬀerence b etw e e n clus ters but als o o n the distance be t ween clusters. W e study this eﬀect in the sp ecial case of tw o clusters sampled fr om tw o 2D norma l distributions N ( µ , I ) and N ( µ ′ , I ), wher e µ and µ ′ are t wo diﬀeren t e xp e cted v alues and I is the identit y ma tr ix. Details of this study are given in App endix B. Suppo se N 1 is the num ber of samples in the ﬁrst cluster a nd N 2 is the n umber of samples in the seco nd cluster and N 1 = m · N 2 and n = N 1 + N 2 . F or a ﬁxed total num b er o f samples n , by increasing m , the v alue of W 1 decreases. Thus, Gap 1 increases while Gap 2 is almost unc hang ed. When m b ecomes large enough, Gap 1 will be g reater than Gap 2 , a nd the estimated cluster num b er will b e one. The p o ssible num b ers of m for which Gap and Gap ∗ can still estimate tw o as prop er nu m be r of clusters, can b e estimated from the follo wing tw o inequalities (see Appendix B inequalities Eqns . 24 and 25 ): (a) for Gap md ( m + 1) 2 ≥ E ( d 1 ) E ( d 2 ) − 1 (13) (b) for Gap ∗ 2 md ( m + 1) 2 ≥ E ( d 1 ) − E ( d 2 ) (14) where d is the a verage distance b etw een the points in ﬁrst cluster to the points in s econd cluster, E ( d 1 ) is the expected distance o f tw o p oints from a recta ngular unifor m distribution with sides a and b and E ( d 2 ) is the exp ected distance of tw o p oints from a rectangula r uniform distribution with sides a 2 and b . These results are illustrated in an example in Fig. 4. In this example we compare d ﬁve data s ets with t wo clusters of diﬀerent obser v ation sizes. The total num ber of obse r v ations is the same in all ﬁve data sets, how ever, the r a tio of observ ations is v aried. T a ble 2 summarizes the size of the cluster s in eac h da ta set and the r atio betw een num b er of observ a tions in the t wo cluster s. In the ﬁrst data set the num b er of observ ations in the ﬁrst ( N 1 ) and in the second clus ter ( N 2) are equal. In the o ther four data sets N 1 increases a nd N 2 decreases as given in table 2. Samples wher e drawn as follows: Gap statistic deﬁnitio ns with a nd wit hout log arit hm functio n 9 0 5 10 15 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 number of clusters Gap 0 5 10 15 0 1000 2000 3000 4000 5000 number of clusters Gap* 0 5 10 15 2000 3000 4000 5000 6000 7000 8000 number of clusters Wk 0 5 10 15 7.2 7.4 7.6 7.8 8 8.2 8.4 8.6 8.8 9 number of clusters log(Wk) 1 Cluster N1 = N2 N1 = 2N2 N1 = 4N2 N1 = 8N2 N1 = 16N2 Fig. 4. T op: log ( W k ) (le ft) and W k (rig ht) ﬁv e simulated data set s with two cl usters each, wh ere N 1 and N 2 are t he number of sa mples in the ﬁrst and second cluster , respecti v ely . Bo ttom: Gap (left) and Gap ∗ (rig ht) f o r these data sets . 10 Moh ajer et al. (a) Select N max 1 as maximum num b er of samples in ﬁrs t cluster in all ﬁve da ta sets. (b) Select N max 2 as maximum num b er of samples in seco nd cluster in all ﬁve data sets. (c) Draw N max 1 samples from a biv ariate normal distr ibutio n with para meters ( µ , I ), where µ = (0 , 0). (d) Draw N max 2 samples from a biv ariate norma l dis tribution with parameters ( µ ′ , I ), where µ ′ = (5 , 0). (e) F or each data set, select the ﬁrst N 1 samples from the N max 1 sample p oints according to the num b er N 1 given for this data se t in table 2. (f ) F or each data set, s e lect the ﬁrs t N 2 samples from the N max 2 sample p oints according to the num b er N 2 given for this data se t in table 2. According to the estimations in App endix B and the inequa lities 13 and 14 , in this example, E ( d 1 ) ≈ 4 . 53, E ( d 2 ) ≈ 2 . 99 , and d ≈ 3 . 48. As a result only for m < 6 for the original Gap , and m < 2 for the prop os e d Gap ∗ , the gap sta tis tic determines tw o as pro p er nu m be r of clusters. Figure 4 shows log( W k ) a nd W k for all ﬁve simulated da ta sets. The blue dotted line is the exp ected log ( W k ) o n the left top a nd exp ected W k on the rig ht top of the null reference distributio n. As demonstra ted in ﬁgur e 4, by increasing the num b er of samples in ﬁrst cluster ag ainst the second cluster, the within-clus ter disp ersion W 2 remains the same but W 1 decreases. Depending on how far apar t the tw o clusters are, increasing the ratio of obser v ations in b oth clusters increases the Gap (1) v alue. Figure 4 demonstrates the or iginal Gap function (b ottom left) and the prop os ed Gap ∗ function (bo ttom rig ht ) for these ﬁve data sets. The estimated m fr om the inequalities 1 3 and 14 is co nﬁrmed by the results illustra ted in Fig. 4. 3.4. Sim ulated data with increasing Gap function In this exp eriment, data were simulated such that the calculated Gap function ( Gap fro m Eqn. 3) is a str ictly increasing function. A data se t was simulated 200 0 times and for each s imulated da ta set the original Gap and the prop o sed Gap ∗ statistic w as calc ulated. The simulated data s et cons is ts of tw o cluster s each. Each cluster contains 50 o bserv ations from an n-dimensional v a riable s pace. In the ﬁr st cluster, each feature w as sampled fro m a uniform distr ibution on in terv al [0 , 10] at r andom. F o r the seco nd cluster o nly the ﬁr st v ariable was sampled fr o m the sa me uniform distribution. All other v a riables of observ atio ns in the second cluster were set to zero. Half of the data s e ts were simulated in a 1 00- dimensional v ariable spa ce while the other half were simulated in a 2 -dimensional v ar iable space. Figure 5 depicts the av erage Gap and the av er age Gap ∗ functions for bo th the 2D data sets and the 100D data sets. F o r the 2 D data sets, bo th Gap functions sugg est t wo as prop er num b er of c lus ters. How ever, it ca n be seen that the Gap function for the 1 00D data sets is a strictly increas ing function. This is indee d exp ected due to the “curs e o f dimensionality” (Bellman, 1961). Beyer et al. (1999) hav e shown that the minim um and the max imu m o ccurr ing distances become indiscernible, as the diﬀer ence of the minimum and maximum v a lue compared to the minim um v alue conv erg es to 0 as the dimensionality d go es to inﬁnity . lim d →∞ dist max − dist min dist min → 0 . (15) Consequently , all the distances d ii ′ from E qn. 1 can b e consider ed to b e equal in a high dimensional space. Consider n observ ations from a 100 dimensional uniform distribution Gap statistic deﬁnitio ns with a nd wit hout l ogar ithm functio n 11 0 5 10 15 20 −50 0 50 100 150 200 0 5 10 15 20 −0.2 0 0.2 0.4 0.6 0.8 1 number of clusters 0 5 10 15 20 −4 −3 −2 −1 0 1 2 3 number of clusters average Gap* 0 5 10 15 20 −0.1 −0.05 0 0.05 0.1 0.15 0.2 0.25 average Gap Fig. 5. Av erage Gap and Gap ∗ for simulated 2D (left) an d 100D (r ight) data sets from experimen ts 3.4. and suppose these samples are divided into k clusters C 1 , C 2 , ..., C k , where | C 1 | = | C 2 | = ... = | C k | = n k . Consider all d ii ′ = dist , thus, W k is equal to: W k =  n 2 − k 2  dist. (16) By increasing the num b er of clusters k , W k in Eqn. 16 decr eases linearly . The slo pe of this line is the same for a ll data s ets sa mpled from the sa me high dimensional uniform po pulation even with diﬀerent n um ber of sa mples. Here, in the cas e of a 100 D data set for all k > 2 only the ﬁrst cluster will b e divided further, due to the lar ge distances of the samples in this cluster c ompared to the seco nd cluster. Hence, W k will be linear for k > 2 and para lle l to E ∗ ( W ∗ k ). The diﬀerence E ∗ ( W ∗ k ) − W k remains constant as E ∗ ( W ∗ k ) and W k decrease. Therefore, the Gap function is strictly increasing . On the other hand, whenever the diﬀerence E ∗ ( W ∗ k ) − W k remains constant, Gap ∗ ( k ) and Gap ∗ ( k + 1) will b e equal. Therefore, due to the Gap condition Eqn. 4 k will b e sugge s ted as prop er num ber of clusters by the prop osed Gap ∗ statistic. T able 3 lists the num b er o f clusters found with the orig inal Gap and the prop o sed Gap ∗ statistic for 10 00 s im ulations of 2D and 1 00D data sets, r esp ectively . While for the 2D simulation both the orig inal Gap and the pr o p osed Gap ∗ statistic per form similarly , the original Gap fails in ﬁnding the true num b er of clus ter s for all o f the 1 000 simulated 100D data sets. The propos ed Gap ∗ statistic, ho wever, is able to determine the true n umber of clusters for these simulations. 12 Moh ajer et al. T a ble 3. Nu mber o f clusters for 1000 2D and 1 00D data sets, estimated by Gap and Gap ∗ . Method Estimate of num b er of clusters 1 2 3 4 5 6 7 8 9 ≥ 10 Gap 368 489 143 0 0 0 0 0 0 0 2D Gap ∗ 270 567 16 2 1 0 0 0 0 0 0 Gap 0 0 0 0 0 0 1 3 1 995 100D Gap ∗ 0 1000 0 0 0 0 0 0 0 0 3.5. Real data set with increasing Gap functi on W e ev aluated b oth Gap functions further on seven real data sets from Dynamic Contrast- Enhanced Ma g netic Res onance Imaging (DCE-MRI) of breast tumors (Germa n Ca ncer Research Cen ter (DKFZ), 20 04). F o r e a ch data set a selected slice through the tumor with thic kness T H = 6 mm and ﬁeld o f view F OV = 320 mm × 320 mm w as measured every 3 . 25 s for 6 . 9 min utes. As a r esult, each v oxel in a data set is described by a signal time curve of length T = 128 during the contrast a g ent passag e thro ugh the tumor (Brix et a l., 2004). These cur ves give v aluable info r mation ab out blo o d circulatio n and p ermeability o f tumor tiss ue. Hence, it is o f int erest to detect voxels with simila r signal curves. Previously diﬀerent clustering metho ds were applied on DCE - MRI da ta (Fischer a nd Hennig, 1999 ; Nattkemper et a l., 2005; V arini et al., 2 006; Wism¨ uller et al., 2 006; Schlossbauer et al., 2008; Castellani et al., 200 9). One of the main challenges on this a pproach is to deter mine the n um ber of underlying patterns in the s ignal curves. T o this end w e a pplied the Gap statistic on DCE-MRI data. As b efore , we used the av er a ge link a ge clustering metho d with squared Euclidean distance as mea sure of dis similarity . The sa mples ar e the signa l cur ves of vo xels of which each is descr ib ed by 128 features, i.e. , time p o ints. T able 4 gives the num b er of clusters found with the or iginal Gap and the prop osed Gap ∗ for se ven DCE-MRI da ta sets. The tumor s in all of these imag e s hav e the same t yp e. Using the prop osed Gap ∗ statistic, the num b er of ﬁve clusters was found in ﬁve o f the seven images, whereas with the orig inal Gap statistic, no co nsistent n umber o f clusters, i.e. , regio ns, was found. Fig. 6 shows the resulting Gap and Gap ∗ functions for one of the DCE-MR images (data set 4). Similar to the simulated data set in 3.4, the Gap function is a strictly increas ing function, whereas the Gap ∗ function is not s tr ictly incre a sing and suggests ﬁve as num b er of clusters for this data set. In Fig. 7(a ) ﬁr st and s econd principal c o mpo nent of the da ta set are depicted and the ﬁve identiﬁed cluster s are shown in diﬀere nt color s and with diﬀerent symbols. The intensit y cur ves for vo xels in a cluster ar e shown in Fig. 7(c); the mean curve of each cluster is depicted in red. Fig. 7(b) depicts the tumor imag e with voxel colored a ccording to their cluster with the same color s a s in sub-ﬁgur e (a). A ring-shap ed ordering o f the ﬁve clusters can b e observed in this image . This order ing is in ag reement with enhancemen t patterns reported in medicine such as, circumferential, centripe ta l and per ipheral ring con tra s t (Bua du et al., 1 997). Ho wever, so far there is no informatio n on the num b er o f reg io ns. 4. Discussion The Gap statistic is one of the mo st popula r metho ds for estimating the num b er o f c lusters in a data set. It is rather simple to implemen t and is used in man y , diverse applicatio ns. Gap statistic deﬁnitio ns with a nd wit hout l ogar ithm functio n 13 T a ble 4 . Results for a ll se v en DCI-MRI data sets an alyzed with th e Gap and th e Gap ∗ statis- tic. nd stand s f or no t deﬁned. data set number of vo x els Gap Gap ∗ 1 1260 7 7 2 207 9 5 3 116 9 5 4 262 nd 5 5 141 11 5 6 277 nd 5 7 151 13 4 0 5 10 15 20 150 200 250 300 350 400 number of clusters Gap* 0 5 10 15 20 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 number of clusters Gap b a Fig. 6. Gap functio ns Gap and Gap ∗ for DCE-MRI data set of a breast tumor . 14 Moh ajer et al. 5 10 15 20 25 30 5 10 15 20 25 30 0 1 2 3 4 5 −15 −10 −5 0 5 10 −4 −3 −2 −1 0 1 2 3 1st Principal Component 2nd Principal Component 1 2 3 4 5 0 50 100 0 1 2 3 0 50 100 0 1 2 3 0 50 100 0 1 2 3 0 50 100 0 1 2 3 intensity time 0 50 100 0 1 2 3 1 2 3 4 5 (a) (c) (b) 4 5 3 2 1 Fig. 7. Five clusters of the DCE -MRI breast tumor found with av e rage linkage clusterin g. (a) Fi rst and second pr incipa l component of the DCE-MRI signals per vox el. V o x els are colored accordi ng to their cluster afﬁli ation . (b) Segment ation map of the tumor . V ox els are colore d similar to subﬁgure (a). (c) Sign al time curves for each v ox el in the ﬁve respective clusters along with the mean cur ve (bold red line). Gap statistic deﬁnitio ns with a nd wit hout l ogar ithm functio n 15 As rep orted by Tibshirani et a l. (2001) it outperforms many other metho ds. How ever the Gap sta tistic is no t a ble to suggest the corr ect num b er of cluster s in so me ca ses. Yin et al. (2008) hav e r epo rted that in cases where the r atio of o bserv ation sizes betw een clus ters is ov er than six- fold, the Gap statistic do es not w o rk a ccurately . Dudoit and F r idlyand (20 02) hav e mentioned the ov er estimation of Gap s tatistic in so me applications. Sugar and James (2003) ha ve rep o rted the failur e of the Gap statistic in the ca se that data w ere derived from exp onential distributions. In this pap er we hav e shown that using log ( W k ) instead of W k in the calculation of the Gap function ca n b e o ne cause of ov erestimation of nu mber o f clusters in the Ga p statistic. Theoretically there is no fea sible rea son to choos e Eqn. 3 over Eqn. 8 for the deﬁnition of the Gap sta tis tic. Indee d, using the lo garithm function in the deﬁnition of the Gap statistic has a fundamental e ﬀect on the results of the Gap statistic. This is due to a prop erty of the lo garithm function describ ed in following example: Consider f our p o sitive num b ers a , b , c , a nd d , with logarithm o f all of them greater than 1. Let b e a > c a nd b > d and a − b = c − d > 0, then we will hav e log( a ) − log( b ) < lo g( c ) − log( d ). As a result, by increasing the num b er of clusters the within cluster disp ers ion W k decreases. Consequently the Gap function increase s even when the dista nc e b etw een W ∗ k and W k remains the sa me. Estimating the n um ber of clusters dep ends on many factor s. The ch oice of clustering metho d is one of these factor s . The Gap statistic is designed to be applicable to a ny clus ter - ing metho d. In gener al, the r esults and discussions given in this work ar e not restric ted to any c lus tering metho d. Ho wev er, the choice of the clustering metho d inﬂuences the result of Ga p s tatistic. Diﬀerent clustering metho ds lo ok for diﬀerent structures in da ta. The av era ge link age metho d, used for the Ga p calculation in section 3.1, was able to ﬁnd the real cluster num b er for b o th the “iris” (Fisher, 1 963) and the “Breas t Cancer Wisconsin data set” (W olb erg et al., 1 993) in contrast to the Ga p function with k-mean clustering rep orted by Y a n and Y e (200 7). Comparing the o riginal Gap and pr op osed Gap ∗ statistic, the o riginal Gap statistic ha s a b etter p erfo rmance in the case o f ov erla pped clusters than Gap ∗ due to the tendency of the Gap of overestimating the n umber of clusters. F or real applica tion, it is how ever up to the user to decide whether t wo clusters with ov erla pping area s hould be considered as one cluster or tw o. In previous studies (Tibshirani et al., 2001; Y a n a nd Y e, 2007; Yin et al., 2008; Dudoit and F r idlyand, 2 0 02; Suga r and James, 20 03) it w a s rep orted that a n ull reference data genera ted from a uniform distribution a ligned with the principal co mpo nents of the data causes a be tter p erformance of Gap statistic. The Gap function c a lculated from such null r e ference da ta is referr ed to as Gap pc . It would b e interesting to co mpare Gap pc and Gap ∗ pc in further studies . W e hav e intro duced Gap ∗ , which compa r es the ex pe c ted v a lues o f W ∗ k with W k . Thus, it reﬂects exactly the c hanges in the within cluster dispersio n of the real data ag ainst the exp ected W ∗ k of the n ull reference data set. Whenever the original G ap results in a k as prop er nu mber of cluster, this k is also a p o ssible answer with the prop osed Gap ∗ . In contrast, ther e ar e situations where prop ose d Gap ∗ function is able to oﬀer a num ber as a prop er num b er of clusters while the original Gap has no answer. Ev aluatio ns in s ection 3 verify this idea. In subsections 3.4 and 3.5, the original Gap function is a strictly increasing function, hence it canno t ﬁnd any cluster num ber . On the o ther hand, Gap ∗ is not strictly increasing and therefor e is able to suggest a clus ter num ber for the data. F or the s imu lated data in subsection 3 .4 the suggested num b er is eq ual to real n umber of clusters . F or the rea l data set in subsection 3 .5, how e ver, we hav e no refer ence to decide if the num b er sug gested by the propo sed Gap ∗ statistic is the prop er num ber of cluster s . F urther exp er iment s ar e 16 Moh ajer et al. necessary o n real data with known cluster n umber to verify the accuracy of the propo sed Gap ∗ statistic in cases where the original Gap is a str ictly increasing function. Our ex- per iments suggest that such da ta are p oss ibly f rom multi dimensional feature space, with diﬀerent v ariances in the diﬀerent feature axe s . 5. Ac knowledgments W e are thankful to the UCI Machine Learning Repo sitory which provides a free a ccess to rea l data sets ( http://a rchiv e.ics.uci.edu/ml ). W e a lso thank Ge r man Cancer Res e arch Cent er DKFZ , which provides us with DCE-MRI data sets. This study was p er formed as part o f a joint re s earch pro ject suppo rted by the Ger man “Comp etence Alliance on Radiation Resea rch.” V olker J. Schmid was funded by the LMUinnov ative pro ject “BioMed- S – Analy s is and Mo deling o f Complex Sys tems”. A. Pr oofs A.1. Proof of proposition (1): Proof. l og ( A d 1 ) ≥ l og ( B d 2 ) ⇒ A B ≥ d 1 d 2 ⇒ A B ≥ 1 and d 1 d 2 ≥ 1 ⇒ A B − 1 ≥ d 1 d 2 − 1 Pro of by contradiction: If C − d 1 ≥ D − d 2 is not tr ue , then we have: C − d 1 < D − d 2 C − D < d 1 − d 2 ⇒ C − D d 2 < d 1 d 2 − 1 ⇒ C − D d 2 < A B − 1 1 B X b W ∗ kb − 1 B X b W ∗ k +1 b < d 2 Q W ∗ kb 1 /B Q W ∗ k +1 b 1 /B − d 2 (17) Geometric to ar ithmetic mean r elationship says:  Y W ∗ kb W ∗ k +1 b  1 /B ≤ 1 B X b W ∗ kb W ∗ k +1 b so we can rewrite the Eqn. 17 a s follows: 1 d 2 X b W ∗ kb − 1 d 2 X b W ∗ k +1 b < X b W ∗ kb W ∗ k +1 b − B (18) ⇒ X b W ∗ kb d 2 − X b W ∗ kb W ∗ k +1 b < X b W ∗ k +1 b d 2 − B Gap statistic deﬁnitio ns with a nd wit hout l ogar ithm functio n 17 Fig. 8. T wo 2D di stributio ns fro m the case study in Appe ndix B. Each distr ibution i s depicte d with three areas of 68 . 2 % , 95 . 45% , and 9 9 . 7% pe rcentage of sample occurrence inside each area. ⇒ X b W ∗ kb W ∗ k +1 b  W ∗ k +1 b − d 2 d 2  < X b W ∗ k +1 b − d 2 d 2 F or all v a lues o f b , W ∗ kb W ∗ k +1 b ≥ 1, setting W ∗ kb W ∗ k +1 b to its minimum v alue 1 , then we hav e: X b W ∗ k +1 b − d 2 d 2 < X b W ∗ k +1 b − d 2 d 2 ✷ A.2. Proof of proposition (2): Proof. F ro m the prev io us pr o of we hav e: A B − 1 < C − D d 2 Thu s, in the case of d 1 − d 2 = C − D w e will hav e: A B < d 1 d 2 ✷ B. Case Stu dy: Unequally sized cluster s In the follo wing case study the eﬀect of n umber diﬀerence b etw een clusters on Gap statistic was s tudied. The ca se study co ns idered data sets with each co nsisting of tw o cluster s sampled fro m t wo 2 D normal distributions N ( µ , σ 2 I ) and N ( µ ′ , σ 2 I ), where µ and µ ′ are exp ected v alues, I is the identit y matrix, a nd σ 2 > 0 is a pos itive r e a l num b er. According to standard sco re (Glenberg and Andrzejewski, 200 8 ) 99 . 7% o f samples will b e inside a circle with radius 3 · σ . Here, the uniform distribution rectangle, from which the null references are sampled, was estimated a s a rectang le with sides 6 · σ + ∆ and 6 · σ as illustra ted in 18 Moh ajer et al. Fig. 8, where ∆ = k µ − µ ′ k . Let N 1 be the num b er of s amples in the ﬁrst cluster and N 2 be the num ber of sa mples in the s e cond cluster , while N 1 = m · N 2 and n = N 1 + N 2 . In section 3.2 w e observed that in the cas e of N 1 = N 2 , for ∆ ≥ 5 σ b oth Gap functions estimate tw o as prop er nu mber of clusters. In this study we wan t to show how changes in m aﬀect the result of the Gap statistic. Let ∆ ≥ 5 σ and n be ﬁxed. F o r the Gap statistic it is necessar y to hav e Gap n (1) ≤ Gap n (2) − s 2 in order to b e able to choo se k = 2 a s pro p e r nu m be r of cluster s otherwise it suggests k = 1 . W e ignore s 2 and consider the inequality Gap n (1) ≤ Gap n (2). The tw o next inequalities follow from the E qns. 3 to 8 for Gap and Gap ∗ , res p ectively: (a) Gap  Y W ∗ 1 b W ∗ 2 b  1 B ≤ W 1 W 2 (19) (b) Gap ∗ 1 B X ( W ∗ 1 b − W ∗ 2 b ) ≤ W 1 − W 2 (20) Each W ∗ 1 b can b e estimated as nE ( d 1 ) where E ( d 1 ) is the exp ected dis ta nce b e tw een t wo random p oints from a rectangula r uniform distr ibution with sides 6 σ + ∆ and 6 σ . In a similar way W ∗ 2 b can b e estimated as nE ( d 2 ) where E ( d 2 ) is the exp ected distance b etw een t wo random p oints from a rectang ular uniform distribution with sides 6 σ +∆ 2 and 6 σ . T he exp ected distance of tw o random p oints sampled fro m a re c tangular uniform distr ibution with sides a and b with a ≥ b is given by (Santalo, 1 976) E ( d ) = 1 15  a 3 b 2 + b 3 a 2 + d  3 − a 2 b 2 − b 2 a 2  + 5 2  b 2 a log a + d b + a 2 b l og b + d a  (21) where d = √ a 2 + b 2 . Using these estimations a nd Eqns . 19 and 2 0 we ga in (a) Gap E ( d 1 ) E ( d 2 ) ≤ W 1 W 2 (22) (b) Gap ∗ n ( E ( d 1 ) − E ( d 2 )) ≤ W 1 − W 2 (23) F urthermore , we ca n ta ke into a ccount that W 1 includes the in ter-cluster distances b etw een the ﬁrst and second cluster s in a ddition to all distances which are use d in ca lculation of W 2 . Therefo re W 1 can b e written as W 2 + 2 N 1 N 2 d ∆ n , where d ∆ is the av erage inter-cluster distances. Conseq uent ly , inequalities (22 ) and (23 ) can b e r ewritten a s: (a) Gap E ( d 1 ) E ( d 2 ) − 1 ≤ md ∆ σ ( m + 1) 2 (24) (b) Gap ∗ E ( d 1 ) − E ( d 2 ) ≤ 2 md ∆ ( m + 1) 2 (25) Gap statistic deﬁnitio ns with a nd wit hout l ogar ithm functio n 19 References Bellman, R. (196 1). A daptive c ontr ol pr o c esses: a gu ide d tour . A Ra nd Corp oration Research Study Series. Princeton Universit y Press. Beyer, K., J. Goldstein, R. Ramakr ishnan, a nd U. Shaft (1 999). When is nearest neig hbor meaningful? In C. Beeri and P . Buneman (Eds .), Datab ase The ory ICDT9 9 , V olume 1540 of L e ctur e Notes in Computer Scienc e , pp. 21 7–23 5. Springer Ber lin / Heidelb erg. Brix, G., F. Kie s sling, R. Luch t, S. Dara i, K. W a s ser, S. Delor me, and J. Grieb el (2004). Micro circula tio n and microv a sculature in breast tumor s: phar macokinetic a nalysis of dynamic MR imag e series. Magnetic Resona nc e in Me dicine 52 (2), 420 –429 . Buadu, L., J. Murak a mi, S. Muray ama, N. Hashiguc hi, S. Sak ai, S. T oyoshima, K. Ma suda, S. Kuro ki, and S. Ohno (199 7). Patterns of p eriphera l enhancement in breast masse s: correla tion of ﬁndings on contrast medium enhanced MRI with histologic features and tumor angio g enesis. J ournal of c omputer assiste d t omo gr aphy 21 (3), 421 . Cali ´ nski, T. a nd J. Har a basz (1974). A dendrite metho d for cluster analysis. Communic a- tions in Statistics-The ory and Metho ds 3 (1), 1– 27. Castellani, U., M. Cr istiani, A. Daducci, P . F arac e , P . Marzola, V. Mur ino, and A. Sbar ba ti (2009). DCE-MRI data analysis for cancer area cla ssiﬁcation. Metho ds of information in me dicine 48 (3), 248– 253. Dudoit, S. and J . F r idlyand (2002 ). A prediction-ba sed r esampling metho d for estimating the num b er o f clusters in a dataset. Genome biolo gy 3 (7). Fischer, H. and J. Hennig (1999). Neural netw ork- based ana lysis of MR time series. Magnetic R esonanc e in Me dicine 41 (1 ), 1 24–1 31. Fisher, R. (1963). Ir vine, CA: University of Califo rnia, Scho ol of Information a nd Computer Science: UCI Machine Learning Rep o s itory . http://archiv e.ic s.uci.edu/ml. German Cancer Resear ch Cen ter (DKFZ) (2004 ). R ese ar ch pr o gr am “Inn ovative Diagnosis and Ther apy” . Heidelb erg , Germany: German Cancer Resea rch Center (DKFZ). Glenberg, A. and M. Andrzejewski (200 8). L e arning fr om data: An intr o duction to st atistic al r e asoning . T aylor & F r ancis Group, L L C . Hartigan, J. (197 5). Clustering algorithms . John Wiley & Sons, Inc. New Y ork, NY, USA. Kaufman, L. and P . Rousseeuw (1990 ). Finding Gr oups in Data An Intr o duction to Cluster Analy sis . New Y ork: Wiley Interscience. Krzanowski, W. a nd Y. Lai (1988). A c r iterion for determining the num b er of groups in a data set using s um-of-squar es cluster ing. Biometrics 44 (1), 23– 34. Nattkemper, T., B. Ar nrich, O. Licht e, W. Timm, A. Degenhard, L. Pointon, C. Hay es , and M. Lea ch (200 5). Ev aluation o f radio logical features for br east tumour c la ssiﬁ- cation in clinica l s c reening with machine learning metho ds. A r t iﬁcial Intel ligenc e in Me dicine 34 (2), 129–1 39. 20 Moh ajer et al. Santalo, L. A. (19 76). In te gr al ge ometry and ge ometric pr ob ability / Lu is A. S antalo ; with a fo r ewor d by Mark Kac , pp. 49. Addison- W esley P ub. Co., Adv a nc e d Bo ok Pro g ram, Reading, Mass. :. Schlossbauer, T., G. Leinsing e r, A. Wismuller, O . Lange, M. Scherr, A. Meyer-Baes e, and M. Reiser (2008). Clas s iﬁcation of s ma ll contrast enhancing brea st lesions in dynamic magnetic r esonance ima ging using a combination of morpho lo gical criteria and dynamic analysis based o n unsup e rvised vector-quantization. Investigative r adiolo gy 43 (1 ), 5 6. Scott, A. and M. Symons (1 971). Clustering metho ds based on lik eliho o d ra tio criteria. Biometrics 27 (2), 387 –397 . Sugar, C. A. a nd G. M. James (2003). Finding the Num b er of Clusters in a Data Set - An Information Theor etic Appro ach. J. A m. Statist. Ass. 98 (463 ), 75 0 –763 . Tibshirani, R., G. W a lther, and T. Hastie (200 1). Estimating the n umber of clusters in a data set via the gap s tatistic. J. R. St atist. S o c. B 63 (2), 411– 423. V arini, C., A. Degenhard, and T. Nattkemper (200 6). Visual e x plorator y analysis o f DCE - MRI data in breast cancer by dimensional da ta r e ductio n: A compar ative study. Biome d- ic al Signal Pr o c essing and Contr ol 1 (1 ), 5 6–63 . W endl, M. and S. Y ang (200 4). Gap sta tistics for whole g enome shotgun DNA s equencing pro jects. Bioinformatics 20 (10 ), 15 2 7–15 34. Wism¨ uller, A., A. Meyer-B¨ a se, O. Lange, T. Schlossbauer, M. Ka llergi, M. Reiser , and G. Leinsinger (2006). Seg mentation and c lassiﬁcation of dyna mic breas t ma gnetic re so- nance image da ta. Journal of Ele ctr onic Imaging 15 , 013 020. W olb erg, W., W. Stree t, and O . Ma ngasar ian (1993). Irvine, CA: Universit y of Ca lifor- nia, School of Information a nd Co mputer Science: UCI Machin e Learning Rep ositor y . ht tp://archive.ics.uci.edu/ml. Y an, M. and K. Y e (2007). Determining the nu mber of cluster s using the weighted gap statistic. Biometrics 63 (4 ), 103 1–10 3 7. Y ang, Q., L. T ang, W. Dong, a nd Y. Sun (2009 ). Image edge detecting based on gap statistic model and relative entropy . In Y. Chen, H. Deng, D. Zhang, and Y. Xiao (Eds.), FSKD (5) , pp. 384– 387. IEEE Co mputer So ciety . Yin, Z., X. Zhou, C . Bak al, F. Li, Y. Sun, N. P errimon, and S. W ong (20 0 8). Using iterativ e cluster merging w ith improv ed ga p statistics to p er form online phenotype discov ery in the context o f high-throughput RNAi screens . BMC bio informatics 9 (1), 2 6 4. Zheng-Jun, Z. and Z. Y ao-Qin (2009 ). Es timating the image seg ment ation num b er via the entrop y gap statistic. In ICIC ’09: Pr o c e e dings of the 2009 Se c ond International Confer enc e on Information and Computing Scienc e , W a shington, DC, USA, pp. 14–1 6. IEEE Computer So cie ty .

A comparison of Gap statistic definitions with and without logarithm function

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment