Data spectroscopy: Eigenspaces of convolution operators and clustering
This paper focuses on obtaining clustering information about a distribution from its i.i.d. samples. We develop theoretical results to understand and use clustering information contained in the eigenvectors of data adjacency matrices based on a radia…
Authors: ** Tao Shi (Ohio State University) – 연구 책임자 Mikhail Belkin (Ohio State University) – 이론적 분석 담당 Bin Yu (University of California, Berkeley) – 알고리즘 설계·실험 담당 **
The Annals of Statistics 2009, V ol. 37, No. 6B, 3960–3 984 DOI: 10.1214 /09-A OS700 c Institute of Mathematical Statistics , 2009 D A T A SPECTRO SCOPY: EIGENSP ACES OF CONV OLUTION OPERA TORS AND CLUSTERING By T ao Shi 1 , Mikhail Belkin 2 and Bin Yu 3 Ohio Sta te University, Ohio State University and University of California, Berkeley This pap er focuses on obtaining clustering information ab out a distribution from its i.i .d. samples. W e dev elop theoretical results to understand and u se clustering information contained in the eigen- vector s of data adjacency matrices based on a radial kernel function with a sufficien tly fast tail decay . In particular, w e provide population analyses to gain insights into whic h eigen vecto rs should b e used and when the clustering information for the distribution can b e reco vered from the sample. W e learn that a fixed number of top eigen vectors migh t at the same time contai n redundant clustering informa tion and miss relev ant cl ustering information. W e use this insig ht to design the data sp e ctr osc opic clustering (DaSp ec) algorithm that utilizes prop- erly selected eigenv ectors to determine the num b er of clusters au- tomatically and to group th e data accordingly . Our fi n dings ex tend the intuitions underlying existing s p ectral techniques suc h as spectral clustering and Kernel Principal Comp onents Analysis, and provide new understanding into their usabilit y and modes of failure. Simu- lation studies and exp eriments on real-wo rld data are conducted to sho w the potential of our algorithm. In particular, DaSp ec is found to handle unbalanced groups and reco ver clusters of different shap es b etter th an the comp eting method s. 1. Introdu ction. Data clustering based on eigen v ectors of a p ro ximity or affinit y matrix (or its normalized v ersions) has b ecome p opular in mac hine learning, compu ter vision and man y other areas. Giv en data x 1 , . . . , x n ∈ R d , Received July 2008; rev ised March 2009. 1 Supp orted in part by NA SA Grant N NG06GD31G. 2 Supp orted in part by NS F Early Career Award 064391 6. 3 Supp orted in part by N SF Grant D MS-06-05165, ARO Grant W911NF-05-1-0104, NSFC Grant 606281 02, a grant from MSR A and a Guggenheim F ell o wship in 2006. AMS 2000 subje ct classific ations. Primary 62H30; secondary 68T10. Key wor ds and phr ases. Gaussia n kernel, sp ectral clustering, kernel principal comp o- nent analysis, supp ort vector mac h ines, unsup erv ised learning. This is an electro nic repr in t of the origina l article published b y the Institute of Mathematical Statistics in The Annals of Statistics , 2009, V ol. 37, No. 6B, 3960–3 9 84 . This r e print differs from the origina l in pagination and t yp og raphic deta il. 1 2 T. S HI, M. BELKIN A ND B. YU this family of algorit hms constructs an affinit y matrix ( K n ) ij = K ( x i , x j ) /n based on a kernel function, suc h as a Gaussian k ernel K ( x, y ) = e −k x − y k 2 / (2 ω 2 ) . Clustering information is obtained by taking eigenv ectors and eigen v alues of the matrix K n or the closely related graph Laplac ian matrix L n = D n − K n , where D n is a diagonal matrix with ( D n ) ii = P j ( K n ) ij . The basic int uition is that wh en the data come from sev eral clusters, d istances b etw een clus- ters are t ypically far larger than the d istances within the same cluster, and th u s K n and L n are (close to) blo c k-diagonal matrices up to a p ermuta tion of the p oints. Eigen v ectors of such blo c k-diagonal matrices k eep the same structure. F or example, the few top eigen vec tors of L n can b e sho wn to b e constant on eac h cluster, assum ing infinite separation b et we en clusters, allo wing one to distinguish the cluste rs b y lo oking for data p oin ts corre- sp ondin g to the same or similar v alues of the eigen v ectors. In particular, we note th e algorithm of Scott and Longuet-Higgins [ 13 ] who prop osed to embed data into the space spanned b y the top eigen vec tors of K n , normalize the data in that sp ace and group d ata b y in v estigating the blo c k structure of inner p r o duct matrix of normalized d ata. P erona and F reeman [ 10 ] suggested to cluster the data in to tw o groups by directly thresholding the top eigen ve ctor of K n . Another imp ortan t algorithm, the normalize d cut , wa s prop osed by Shi and Malik [ 14 ] in the con text of image segment ation. It separates data in to t wo groups by thresholdin g the second smallest generalized eigen ve ctor of L n . Assuming k group s, Malik et al. [ 6 ] and Ng, Jordan and W eiss [ 8 ] sug- gested embed d ing the data in to the span of the b ott om k eige n v ectors of the normalized graph Laplacian 1 I n − D − 1 / 2 n K n D − 1 / 2 n and applying the k -means algorithm to group the data in the em b edding sp ace. F or further discussions on sp ect ral clustering, w e refer the r eader to W eiss [ 20 ], Dhillon, Guan and Kulis [ 2 ] and von Lu xburg [ 18 ]. An empirical comparison of v arious meth- o ds is pro vided in V erm a and Meila [ 17 ]. A d iscussion of some limitations of sp ectral clustering can b e found in Nadler and Galun [ 7 ]. A theoretical analysis of statistic al consistency of different types of sp ectral clustering is pro vided in von Luxbu rg, Belkin and Bousqu et [ 19 ]. Similarly to sp ectral clustering metho ds, Kern el Principal C omp onent Analysis (Sch¨ olk opf, Smola and M ¨ uller [ 12 ]) and sp ectra l dimensionalit y r e- duction (e.g., Belkin and Niy ogi [ 1 ]) seek lo wer dimensional represent ations of the d ata b y em b edding them in to the space s p anned by the top eigen- v ectors of K n or the b ottom eigen vec tors of the normalized graph Laplacian with the exp ectation that this embeddin g k eeps nonlinear structur e of the data. Empirical observ ations ha v e also b een made that KPCA can some- times capture clusters in the d ata. The concept of using eigen vect ors of 1 W e assume h ere that the d iagonal terms of K n are rep laced by zeros. DA T A SPECTROSCOPIC CLUSTERING 3 the kernel matrix is also closely connected to other kernel metho ds in th e mac hine learning literature, notably Sup p ort V ector Mac hines (cf. V apn ik [ 16 ] and Sc h¨ olk opf and Smola [ 11 ]), w hic h can b e view ed as fitting a linear classifier in the eigenspace of K n . Although empirical results and theoretical stud ies b oth suggest that the top eigen ve ctors con tain clustering information, the effectiv eness of these algorithms hinges h ea vily on the c h oice of the kernel and its parameters, the n um b er of the top eigen v ectors u sed, and the n um b er of group s emplo y ed. As far as we kno w, there are n o exp licit theoretical results or practical guidelines on ho w to make these c hoices. Instead of tac kling these questions regarding to particular d ata sets, it may b e more fru itful to inv estigate them from a p opulation p oint of view. Willia ms and Seeger [ 21 ] in vestig ated the dep endence of the sp ectrum of K n on the data densit y fu nction and analyzed this dep endence in the co n text of lo wer rank matrix appro ximations to the ke rnel matrix. T o the b est of our kno wledge, this work wa s the first theoretica l study of this dep endence. In this pap er w e aim to unders tand sp ectral clustering metho ds based on a p opulation analysis. W e concen trate on exploring the connections b et w een the distribu tion P and the eige n v alues and eigenfunctions of the distribution- dep endent con v olution op erator, K P f ( x ) = Z K ( x, y ) f ( y ) dP ( y ) . (1.1) The ke rnels w e consider will b e p ositiv e (semi-)definite radial k ernels. S u c h k ern els can b e written as K ( x, y ) = k ( k x − y k ), where k : [0 , ∞ ) → [0 , ∞ ) is a decreasing fu nction. W e will use kernels with suffi cien tly fast tail d eca y , su c h as the Gaussian k ernel or the exp onent ial ke rnel K ( x, y ) = e −k x − y k /ω . The connections found allo w us to gain some in sigh ts into when and why these algorithms are exp ecte d to wo rk wel l. In particular, we learn that a fi xed n um b er of top eigen v ectors of the k ernel matrix do n ot alw a ys con tain all of the clustering inf ormation. In fact, when the clusters are not balanced and/or ha ve d ifferent shap es, the top eige n v ectors may b e inadequate and redundan t at the same time. That is, some of the top eigen vec tors m ay corresp ond to the same cluster while m issing other significant clusters. C onsequen tly , w e devise a clustering algorithm th at selects only those eigen vec tors w hic h hav e clustering inf ormation not r ep r esen ted b y the other eigen v ectors already selected. The rest of the pap er is organized as follo ws. In S ection 2 , we co v er the basic defin itions, notation and mathematical facts ab out the distribution- dep endent con voluti on op erato r and its sp ectrum. W e p oin t out the strong connection b et w een K P and its empirical v ersion, the kernel matrix K n , whic h allo ws us to approxima te the sp ectrum of K P giv en data. 4 T. S HI, M. BELKIN A ND B. YU In Section 3 , w e c haracterize the d ep endence of eigenfunctions of K P on b oth the distribution P and the kernel fu nction K ( · , · ). W e show that the eigenfunctions of K P deca y to zero at the tails of the distribution P and that their deca y rates d ep ends on b oth the tail deca y r ate of P and that of the k ern el K ( · , · ). F or distribu tions with only one high densit y comp onent, we pro vide theoretica l analysis. A discussion of three sp ecial cases can b e found in the Ap p endix A . In the first tw o examples, the exact f orm of the eigen- functions of K P can b e found ; in the third, the distribution is concen trated on or around a cur v e in R d . F urther, we consider th e case wh en the distribution P con tains sev eral separate high-densit y comp onents. Th rough classical results of the p ertur- bation theory , w e sho w that the top eigenfunctions of K P are app ro ximated b y the top eigenfunctions of the corresp onding op erato rs d efined on some of those comp onent s. Ho w ev er, not ev ery comp onen t will cont ribute to the top few eigenfun ctions of K P as the eigen v alues are determined by the size and configuration of the corresp onding comp onent. Based on this key prop ert y , w e sho w that the top eigen vect ors of the kernel matrix m a y or ma y not preserv e all clustering information, whic h explains some empirical observ a- tions of certain sp ectral clustering metho ds. A real-w orld high-dimensional dataset, the USPS p ostal cod e digit data, is also analyzed to illustrate this prop erty . In Section 4 , w e utilize our theoretical results to construct the data sp e c- tr osc opic clustering (Da Sp ec) algorithm that estimates the n umb er of groups data-dep endent ly , assigns lab els to eac h observ ation, and pro vides a classifi- cation ru le for unobserv ed data, all based on the same eigen decomp osition. Data-depend ent choic es of algorithm parameters are also discussed. In Sec- tion 5 , the prop osed DaSp ec algorit hm is tested on t w o simulat ions against commonly us ed k -means and sp ectral clustering algorithms. I n b oth situa- tions, the DaSp ec algo rithm pro vides fav orable r esults ev en when the other t wo algorithms are pro vided with the n u m b er of group s in adv ance. Section 6 cont ains conclusions and discus s ion. 2. Notation and m athematical preliminaries. 2.1. D istribution-dep endent c onvolution op er ator. Giv en a probabilit y dis- tribution P on R d , we define L 2 P ( R d ) to b e the space of square integ rable functions, f ∈ L 2 P ( R d ) if R f 2 dP < ∞ , and the space is equipp ed with an inner pro du ct h f , g i = R f g dP . Giv en a kernel (symmetric function of t wo v ariables) K ( x, y ) : R d × R d → R , ( 1.1 ) defines the corresp onding int egral op- erator K P . Recall that an eigenfunction φ : R d 7→ R and the corresp ond ing eigen v alue λ of K P are defin ed b y the follo wing equations: K P φ = λφ, (2.1) DA T A SPECTROSCOPIC CLUSTERING 5 and the constrain t R φ 2 dP = 1. If the k ern el satisfies the condition Z Z K 2 ( x, y ) dP ( x ) dP ( y ) < ∞ , (2.2) the corresp onding op erator K P is a trace class op erator, whic h , in turn, implies that it is compact and h as a discrete s p ectrum. In this p ap er, we will only consider the case when a p ositiv e semi-definite k ern el K ( x, y ) and a distribution P generat e a trace class op erator K P , so that it has only coun table nonnegativ e eigen v alues λ 0 ≥ λ 1 ≥ λ 2 ≥ · · · ≥ 0. Moreo v er, there is a corresp onding orthonormal basis in L 2 p of eigenfunctions φ i satisfying ( 2.1 ). Th e dep endence of the eigen v alues and eigenfunctions of K P on P will b e one of the main fo ci of our pap er. W e note that an eigenfunction φ is uniqu ely defined n ot only on the su pp ort of P , but on ev ery p oin t x ∈ R d through φ ( x ) = 1 λ R K ( x, y ) φ ( y ) dP ( y ) , assuming that the k ern el function K is defi n ed ev erywhere on R d × R d . 2.2. Ke rnel matr ix. Let x 1 , . . . , x n b e an i.i.d. sample drawn f r om distri- bution P . The corresp onding empirical op erato r K P n is d efi n ed as K P n f ( x ) = Z K ( x, y ) f ( y ) dP n ( y ) = 1 n n X i =1 K ( x, x i ) f ( x i ) . This op erato r is closely related to the n × n k ernel matrix K n , wh ere ( K n ) ij = K ( x i , x j ) /n. Sp ecifically , the eigen v alues of K P n are the same as those of K n and an eigenfunction φ , with an eigen v alue λ 6 = 0 of K P n , is connected with the corresp onding eigenv ector v = [ v 1 , v 2 , . . . , v n ] ′ of K n b y φ ( x ) = 1 nλ n X i =1 K ( x, x i ) v i ∀ x ∈ R d . It is easy to v erify that K P n φ = λφ . Th us v alues of φ at lo cations x 1 , . . . , x n coincide with the correspond ing en tries of the eigen v ector v . Ho wev er, unlik e v , φ is defined ev erywhere in R d . F or the sp ectrum of K P n and K n , the only difference is that the sp ectrum of K P n con tains 0 with in finite m u ltiplicit y . The corresp ond ing eigenspace includ es all fun ctions v anishing on th e sample p oint s. It is wel l known that, under mild conditions and wh en d is fixed, the eigen v ectors and eig en v alues of K n con ve rge to eigenfun ctions and eige n v al- ues of K P as n → ∞ (e.g., Koltc hinskii and Gin´ e [ 4 ]). Th erefore, we exp ect the prop erties of the top eigenfunctions and eige n v alues of K P also hold for K n , assumin g that n is reasonably large. 6 T. S HI, M. BELKIN A ND B. YU 3. S p ectral prop ertie s of K P . In this section, w e stud y the sp ectral p rop- erties of K P and their connectio n to the data generating d istribution P . W e start with s everal basic prop erties of the top sp ectrum of K P and then inv es- tigate the case wh en the distribution P is a mixture of seve ral high-den s ity comp onen ts. 3.1. Basic sp e ctr al pr op erties of K P . Through Th eorem 1 and its corol- lary , we obtain an imp ortan t prop erty of the eigenfunctions of K P , that is, these eigenfunctions deca y f ast when aw a y from the ma jorit y of masses of the d istribution if the tails of K and P h a ve a fast deca y . A seco nd theorem offers the imp ortant prop erty that the top eigenfunction has no sign c h ange and m ultiplicit y one. (Three detaile d examples are pro vided in App endix A to illustrate these t w o imp ortan t pr op erties.) Theorem 1 (T ail d eca y pr op ert y of eigenfunctions). An eigenfunction φ with the c orr esp onding eigenvalue λ > 0 of K P satisfies | φ ( x ) | ≤ 1 λ s Z [ K ( x, y )] 2 dP ( y ) . Pr oof. By the C auc hy–Sc h w arz inequalit y and the d efinition of eigen- function ( 2.1 ), we see that λ | φ ( x ) | = Z K ( x, y ) φ ( y ) dP ( y ) ≤ Z K ( x, y ) | φ ( y ) | dP ( y ) ≤ s Z [ K ( x, y )] 2 dP ( y ) s Z [ φ ( y )] 2 dP ( y ) = s Z [ K ( x, y )] 2 dP ( y ) . The conclusion follo ws. W e see that the “tails” of eigenfunctions of K P deca y to zero and that the deca y rate d ep ends on the tail b ehavi ors of b oth the kernel K and the distribution P . Th is observ ation will b e useful to separate high-density areas in the case of P havi ng sev eral comp onent s. Actual ly , we ha v e the follo win g corollary immediately: Corollar y 1. L e t K ( x, y ) = k ( k x − y k ) and k ( · ) b eing nonincr e asing. Assume that P is supp orte d on a c omp act set D ⊂ R d . Then | φ ( x ) | ≤ k (dist( x, D )) λ , wher e d ist( x, D ) = inf y ∈ D k x − y k . DA T A SPECTROSCOPIC CLUSTERING 7 The pro of follo w s from Theorem 1 and the fact that k ( · ) is a nonincreasing function. And n o w we giv e an imp ortant pr op ert y of the top (co rresp ond in g to the largest eigen v alue) eigenfunction. Theorem 2 (T op eigenfunction). L et K ( x, y ) b e a p ositive se mi- definite kernel with f u l l supp ort on R d . The top eig enfunction φ 0 ( x ) of the c onvolu- tion op er ator K P : 1. is the only eige nfunction with no sign change on R d ; 2. has multiplicity one; 3. is nonzer o on the supp ort of P . The pro of is giv en in App endix B and these p rop erties will b e used later when we prop ose our clustering algorithm in Section 4 . 3.2. An e xample: top eig enfunctions of K P for mixtur e distributions. W e no w study the sp ectrum of K P defined on a mixture distribution P = G X g = 1 π g P g , (3.1) whic h is a commonly u sed mo del in clustering and classificati on. T o reduce notation confusion, w e use italicize d sup erscript 1 , 2 , . . . , g, . . . , G as the in- dex of the mixing comp onen t and ordin ary sup erscript for the p o wer of a n um b er. F or eac h mixing comp onent P g , w e d efine the corresp onding op er- ator K P g as K P g f ( x ) = Z K ( x, y ) f ( y ) dP g ( y ) . W e start by a mixture Gaussian example given in Figure 1 . Gaussian k ern el matrices K n , K 1 n and K 2 n ( ω = 0 . 3) are constructed on three batc hes of 1000 i.i.d. samp les from eac h of the thr ee distribu tions: 0 . 5 N (2 , 1 2 ) + 0 . 5 N ( − 2 , 1 2 ), N (2 , 1 2 ) and N ( − 2 , 1 2 ). W e observe that th e top eigen vect ors of K n are nearly identica l to the top eigen vec tors of K 1 n or K 2 n . F rom the p oin t of view of the op erator theory , it is easy to understand this phenomenon: with a pr op erly chosen kernel, the top eigenfunctions of an op er ator define d on e ach mixing c omp onent ar e appr oximate eigenfunctions of the op er ator define d on the mixtur e distribution. T o b e explicit, let u s consider the Gaussian conv olution op erator K P defined by P = π 1 P 1 + π 2 P 2 , with Gaussian comp onen ts P 1 = N ( µ 1 , [ σ 1 ] 2 ) and P 2 = N ( µ 2 , [ σ 2 ] 2 ) and the Gaussian kernel K ( x, y ) with bandw idth ω . Due to the linearit y of con vo lution op erators, K P = π 1 K P 1 + π 2 K P 2 . 8 T. S HI, M. BELKIN A ND B. YU Fig. 1. Eigenve ctors of a Gaussian kernel matrix ( ω = 0 . 3 ) of 1000 data sample d fr om a m ixtur e Gaussian distribution 0 . 5 N (2 , 1 2 ) + 0 . 5 N ( − 2 , 1 2 ) . L eft p anels: hi sto gr am of the data (top), first eigenve ctor of K n (midd l e) and se c ond eigenve ctor of K n (b ottom). Right p anels: histo gr ams of data f r om e ach c omp onent (top), first eigenve ctor of K 1 n (midd l e) and first eigenve ctor of K 2 n (b ottom). Consider an eigenfunction φ 1 ( x ) of K P 1 with the corresp onding eigen v alue λ 1 , K P 1 φ 1 ( x ) = λ 1 φ 1 ( x ). W e h av e K P φ 1 ( x ) = π 1 λ 1 φ 1 ( x ) + π 2 Z K ( x, y ) φ 1 ( y ) dP 2 ( y ) . As sh o wn in Prop osition 1 in App end ix A , in the Gaussian case, φ 1 ( x ) is cen tered at µ 1 and its tail deca ys exp onentia lly . Therefore, assuming enough separation b et w een µ 1 and µ 2 , π 2 R K ( x, y ) φ 1 ( y ) dP 2 ( y ) is close to 0 ev ery- where, and hence φ 1 ( x ) is an app ro ximate eigenfunction of K P . In the n ext section, we will sho w that a similar appr oximat ion holds for general mixture distributions wh ose comp onen ts may not b e Gaussian distributions. 3.3. P erturb ation analysis. F or K P defined by a mixture distribu tion ( 3.1 ) and a p ositiv e semi-definite kernel K ( · , · ), we n o w study the connec- tion b et ween its top eigen v alues and eigenfunctions and those of eac h K P g . Without loss of generalit y , let us consider a mixtur e of t w o comp onen ts. W e state the follo wing theorem regarding the top eigen v alue λ 0 of K P . DA T A SPECTROSCOPIC CLUSTERING 9 Fig. 2. Il lustr ation of sep ar ation c ondition ( 3.2 ) in The or em 3 . Theorem 3 (T op eigen v alue of mixture distrib ution). L et P = π 1 P 1 + π 2 P 2 b e a mixtur e distribution on R d with π 1 + π 2 = 1 . Given a p ositive semi-definite kernel K , denote the top ei g envalue of K P , K P 1 and K P 2 as λ 0 , λ 1 0 and λ 2 0 , r esp e ctiv e ly. Then λ 0 satisfies max( π 1 λ 1 0 , π 2 λ 2 0 ) ≤ λ 0 ≤ max( π 1 λ 1 0 , π 2 λ 2 0 ) + r , wher e r = π 1 π 2 Z Z [ K ( x, y )] 2 dP 1 ( x ) dP 2 ( y ) 1 / 2 . (3.2) The pr o of is giv en in Ap p endix B . As illustrated in Figure 2 , the v alue of r in ( 3.2 ) is small wh en P 1 and P 2 do n ot ov erlap muc h. Mean w hile, the size of r is also affected by ho w fast K ( x, y ) approac h es zero as k x − y k increases. When r is small, the top eigen v alue of K P is close to the larger one of π 1 λ 1 0 and π 2 λ 2 0 . Without loss of generalit y , we assume π 1 λ 1 0 > π 2 λ 2 0 in the rest of this section. The n ext lemma is a general p erturbation result for the eigenfunctions of K P . The empirical (matrix) v ersion of this lemma app eared in Diac onis, Go el and Holmes [ 3 ] and more general results can b e traced bac k to P arlett [ 9 ]. Lemma 1. Consider an op er ator K P with the discr ete sp e ctrum λ 0 ≥ λ 1 ≥ · · · . If kK P f − λf k L 2 P ≤ ε for some λ , ε > 0 , and f ∈ L 2 P , then K P has an eigenvalue λ k such that | λ k − λ | ≤ ε . If we further assume that s = m in i : λ i 6 = λ k | λ i − λ k | > ε, then K P has an eigenfu nction f k c orr esp onding to λ k such that k f − f k k L 2 P ≤ ε s − ε . 10 T. S HI, M. BELKIN A ND B. YU The lemma sh o ws that a constan t λ m us t b e “close” to an eigen v alue of K P if th e op erato r “almost” pro jects a fun ction f to λf . Moreo ver, the function f must b e “close” to an eigenfunction of K P if the distance b et w een K P f and λf is smaller than the eigen-gaps b et ween λ k and other eigen v alues of K P . W e are no w in a p osition to state the p erturbation result for th e top eigenfunction of K P . Giv en the facts that | λ 0 − π 1 λ 1 0 | ≤ r and K P φ 1 0 = π 1 K P 1 φ 1 0 + π 2 K P 2 φ 1 0 = ( π 1 λ 1 0 ) φ 1 0 + π 2 K P 2 φ 1 0 , Lemma 1 indicates that φ 1 0 is close to φ 0 if k π 2 K P 2 φ 1 0 k L 2 P is small enough. T o b e explicit, w e formula te the follo wing corolla ry . Corollar y 2 (T op eigenfunction of mixture distribu tion). L et P = π 1 P 1 + π 2 P 2 b e a mixtur e distribution on R d with π 1 + π 2 = 1 . Given a semi-p ositive definite kernel K ( · , · ) , we denote the top eigenvalues of K P 1 and K P 2 as λ 1 0 and λ 2 0 , r esp e ctively (assuming π 1 λ 1 0 > π 2 λ 2 0 ) and define t = λ 0 − λ 1 , the eigen-gap of K P . If the c onstant r define d in ( 3.2 ) satisfies r < t , and π 2 Z R d K ( x, y ) φ 1 0 ( y ) dP 2 ( y ) L 2 p ≤ ε, (3.3) such that ε + r < t , then π 1 λ 1 0 is close to K P ’s top eigenvalue λ 0 , | π 1 λ 1 0 − λ 0 | ≤ ε and φ 1 0 is close to K P ’s top eigenfunction φ 0 in L 2 P sense, k φ 1 0 − φ 0 k L 2 P ≤ ε t − ε . (3.4) The pro of is trivial, so it is omitted here. Sin ce Theorem 3 leads to | λ 1 0 − λ 0 | ≤ r and Lemma 1 suggests | λ 1 0 − λ k | ≤ ε f or some k , the condition r + ε < t = λ 0 − λ 1 guaran tees that φ 0 as the only p ossible choi ce for φ 1 0 to b e close to. Th erefore, φ 1 0 is approxi mately the top eigenfunction of K P . It is w orth n oting that the separable conditions in Theorem 3 , Corollary 2 are mainly based on the ov erlap of the mixture comp onents, b u t n ot on their shap es or parametric forms. Therefore, clustering metho ds based on sp ectral information are ab le to deal with more general problems b ey ond the traditional mixture mo d els based on a parametric f amily , suc h as mixture Gaussians or mixture of exp onen tial f amilies. 3.4. T op sp e ctrum of K P for mixtur e distributions. F or a mixture dis- tribution w ith enough separation b et w een its mixing comp onen ts, we no w extend the p erturbation results in Corollary 2 to other top eigenfunctions of K P . With close agreemen t b etw een ( λ 0 , φ 0 ) and ( π 1 λ 1 0 , φ 1 0 ), w e observe DA T A SPECTROSCOPIC CLUSTERING 11 that the second top eigen v alue of K P is approxi mately max( π 1 λ 1 1 , π 2 λ 2 0 ) b y inv estigating the top eigen v alue of th e op erator defined b y a new k ernel K new = K ( x, y ) − λ 0 φ 0 ( x ) φ 0 ( y ) and P . Accordingly , one may also derive the conditions un der whic h the second eigenfunctions of K P is approximat ed by φ 1 1 or φ 2 0 , dep ending on the m agnitude of π 1 λ 1 1 and π 2 λ 2 0 . By sequen tially applying the same argument , we arriv e at the follo wing corollary . Pr oper ty 1 (Mixture prop ert y of top sp ectrum). F or a c onvolution op er ator K P , define d by a semi-p ositive definite ke rnel with a fast tail de c ay and a mixtur e distribution P = P G g = 1 π g P g with enough sep ar ations b etwe en its mixing c omp onents, the top eigenfunctions of K P ar e appr oximately cho- sen fr om the top ones ( φ g i ) of K P g , i = 0 , 1 , . . . , n , g = 1 , . . . , G . The or dering of the eigenfu nc tions is determine d by mixture magnitudes π g λ g i . This prop ert y suggests that eac h of the top eigenfunctions of K P corre- sp onds to exactly one of the separable mixture comp onen ts. Th erefore, we can app ro ximate the top eigenfunctions of K P g through those of K P when enough separations exist among mixing comp onen ts. Ho w ev er, sev eral of the top eigenfunctions of K P can corresp ond to the same comp onen t and a fixed n umber of top eigenfunctions ma y miss some comp onent s en tirely , sp ecifically the ones with small mixing weig h ts π g or small eigen v alue λ g . When there is a large i.i.d. sample from a mixture distribution whose com- p onents are well separate d, we exp ect the top eigenv alues and eigenfunctions of K P to b e close to those of the empirical op erator K P n . As discussed in Section 2.2 , the eigen v alues of K P n are the same as th ose of the k ernel ma- trix K n and the eigenfunctions of K P n coincide with the eigen v ectors of K n on the sampled p oint s. Therefore, assuming goo d appro ximation of K P n to K P , the eigen v alues and eigenv ectors of K n pro vide us with access to the sp ectrum of K P . This und erstanding sh eds ligh t on the algorithms p rop osed in Scott and Longuet-Higg ins [ 13 ] and Pero na and F reeman [ 10 ], in wh ich the top (sev- eral) eigen v ectors of K n are used for clustering. While the top eigen v ectors ma y con tain clustering information, smaller or less compact groups may n ot b e iden tified using only the very top part of the sp ectrum. More ei gen vec tors need to b e inv estigated to see these clusters. On the other hand, information in the top few eigen v ectors ma y also b e redu ndant for clustering, as some of these eigen vect ors may represen t the same group. 3.5. A r e al-data e xample: a USPS digits dataset. Here we u se a high- dimensional U.S. P ostal S ervice (USP S ) digit datase t to illustrate the p rop- erties of the top sp ectrum of K P . T he data set con tains normaliz ed handwrit- ten digits, automatic ally scanned from env elop es by the USPS . Th e images here hav e b een r escaled and size-normalized, resulting in 16 × 16 gra yscale 12 T. S HI, M. BELKIN A ND B. YU images (see Le Cun et al. [ 5 ] for details). Eac h image is treated as a vect or x i in R 256 . In this exp eriment, 658 “3”s, 652 “4”s and 556 “5”s in the training data are p o oled together as our sample (size 1866). T aking the Gaussian k ernel with bandw idth ω = 2, w e construct the kernel matrix K n and compute its eigen v ectors v 1 , v 2 , . . . , v 1866 . W e visualize the digits corresp onding to large absolute v alues of th e top eige n v ectors. Giv en an eigen v ector v j , w e rank the digits x i , i = 1 , 2 , . . . , 1866, according to the absolute v alue | ( v j ) i | . In eac h ro w of Figure 3 , we show the 1st, 36th, 71st , . . . , 316th digits according to that order for a fi xed eigen vec tor v j , j = 1 , 2 , 3 , 15 , 16 , 17 , 48 , 49 , 50. It tu rns out that the digits with large absolute v alues of the top 15 eigen ve ctors, some sh own in Figure 3 , all repr esen t n um b er “4.” Th e 16th eigen v ector is the first one repr esen ting “3” and the 49th eigenv ector is the first one for “5.” The plot of the data em b edded using the top three eigen ve ctors sh o wn in the left panel of Figure 4 suggests n o separation of digits. These resu lts are strongly consisten t with our theoretica l find ings: A fixed num b er of the top eigen vect ors of K n ma y corresp ond to the same cluster wh ile missing other significan t clusters. Th is leads to the failure of clustering algorithms only u sing the top eigen vecto rs of K n . T he k -means algorithm based on top eigen v ectors (normalized as suggested in Scott and Longuet-Higgins [ 13 ]) Fig. 3. D igits r anke d by the abso lute value of eigenve ctors v 1 , v 2 , . . . , v 50 . The digits i n e ach r ow c orr esp ond to the 1 st, 36 th, 71 st , . . . , 316th lar gest absolute value of the sele cte d eigenve ctor. Thr e e eigenve ctors, v 1 , v 16 and v 49 , ar e identifie d by our DaSp e c al gorithm. DA T A SPECTROSCOPIC CLUSTERING 13 Fig. 4. L eft: sc atter plots of digits emb e dde d in the top thr e e eigenve ctors; right: digits emb e dde d in the 1 st, 16 th and 49 th eigenve ctors. pro du ces accuracies b elo w 80% and reac hes the b est p erformance as the 49th eigenv ector is included. Mean while, the data em b edded in the 1s t, 16th and 49th eigen ve ctors (the righ t panel of Figure 4 ) do present the three groups of digits “3,” “4” and “5” nearly p erfectly . If one can intelli gen tly identify these eigen ve ctors and cluster data in the space spanned by them, goo d p erform ance is exp ected. In the n ext section, we utilize our theoretical analysis to constru ct a clustering algorithm that automatica lly selects these most informativ e ei gen vec tors and groups the data accordingly . 4. A data sp ectrosc opic clustering (DaS p ec) algorithm. In this s ection, w e prop ose a data sp ectrosco pic clustering (DaSp ec) algorithm based on our theoreti cal analyses. W e c hose the commonly used Gaussian k ernel, but it ma y b e replaced by other p ositiv e definite radial k ernels with a fast tail deca y rate. 4.1. J ustific ation and the DaSp e c algorithm. As shown in Prop erty 1 for mixture distributions in Section 3.4 , w e h a ve access to appr o ximate eigen- functions of K P g through th ose of K P when eac h mixing comp onen t has enough separation from the others. W e kn ow from Theorem 2 that among the eige nfunctions of eac h comp onent K P g , the top one is the only eige nfunc- tion with no sign change . When the sp ectrum of K P g is close to that of K P , w e exp ect that there is exactly one eigenfun ction with no sign c hange o v er a certain small th r eshold ε . Therefore, the num b er of separable comp onents of P is indicated by the n u m b er of eigenfun ctions φ ( x )’s of K P with no sign c hange after thresholding. Mean while, the eigenfunctions of eac h comp onent d eca y quic kly to zero at the tail of its distribution if there is a go o d separation of comp onents. A t 14 T. S HI, M. BELKIN A ND B. YU a giv en location x , in the h igh-densit y area of a particular comp onen t whic h is at the tails of other comp onen ts, we exp ect the eigenfunctions from all other comp onen ts to b e close to zero. Among the top eigenfunction | φ g 0 ( x ) | of K P g defined on eac h comp onen t p g , g = 1 , . . . , G , th e group identit y of x corresp onds to the eigenfunction that has the largest absolute v alue, | φ g 0 ( x ) | . Com b ining this observ ation with previous discussions on the appro ximation of K n to K P , w e p r op ose the follo w ing clustering algorithm. Data sp e ctr osc opic clustering (D aSp e c) Algorith m. Input : Data x 1 , . . . , x n ∈ R d . Par ameters : Gaussian kernel bandwidth ω > 0 , thr esholds ε j > 0 . Output : Estimated num b er of separable comp onen ts ˆ G and a cluster lab el ˆ L ( x i ) for eac h data p oin t x i , i = 1 , . . . , n . Step 1. Construct the Gaussian k ern el matrix K n : ( K n ) ij = 1 n e −k x i − x j k 2 / (2 ω 2 ) , i, j = 1 , . . . , n , and compute its eigen v alues λ 1 , λ 2 , . . . , λ n and eigen vect ors v 1 , v 2 , . . . , v n . Step 2. Estimate the num b er of clusters: - Iden tify all eigen vect ors v j that hav e n o sign change s up to precision ε j . [W e sa y that a v ector e = ( e 1 , . . . , e n ) ′ has no sign change s up to ε if either ∀ i e i > − ε or ∀ i e i < ε .] - Estimate the num b er of groups by ˆ G , the n u m b er of suc h eigen v ectors. - Denote these eigen vec tors and the corresp on d ing eigen v alues b y v 1 0 , v 2 0 , . . . , v ˆ G 0 and λ 1 0 , λ 2 0 , . . . , λ ˆ G 0 , resp ectiv ely . Step 3. Assign a cluster lab el to eac h data p oin t x i as ˆ L ( x i ) = arg max g {| v g 0 i | : g = 1 , 2 , . . . , ˆ G } . It is ob viously imp ortan t to h a ve data-dep enden t c hoices for the param- eters of the DaSp ec algorithm: ω and ε j ’s. W e will discuss some heur istics for those choi ces in the next section. Giv en a DaSp ec clustering result, one imp ortan t feature of our algorithm is that little adjustment is n eeded to classify a new data p oin t x . T hanks to the connection b etw een the eigen- v ector v of K n and the eigenfunction φ of the empirical op erator K P n , we can compute the eigenfunction φ g 0 corresp onding to v g 0 b y φ g 0 ( x ) = 1 λ g 0 n X i =1 K ( x, x i ) v g 0 i , x ∈ R d . DA T A SPECTROSCOPIC CLUSTERING 15 Therefore, Step 3 of the algorithm can b e readily app lied to any x by re- placing v g 0 i with φ g 0 ( x ). So th e algorithm output can serv e as a clustering rule that separates not only the data, b ut also the u nderline distribu tion, whic h is aligned with the motiv ation b eh in d our Data Sp ectro scop y algo- rithm: learning prop erties of a d istribution though the empirical sp ectrum of K P n . 4.2. D ata-dep endent p ar ameter sp e cific ation. F ollo wing the justification of our DaSp ec algo rithm, we pr o vide some heuristics on choosing algorithm parameters in a data-dep endent w a y . Gaussian kernel b andwidth ω . The bandwidth con tr ols b oth the eigen- gaps and the tail deca y rates of the eige nfunctions. When ω is too large, the tails of eigenfunctions ma y n ot deca y fast enough to make condition ( 3.3 ) in Corollary 2 hold. Ho wev er, if ω is to o small, the eigengaps ma y v anish, in whic h case eac h data p oin t will en d up as a separate group. Intuit iv ely , w e w an t to select s m all ω bu t still to k eep enough (sa y , n × 5%) neighbors for most (95% of ) data p oints in the “range” of the k ernel, wh ic h we define as a length l that m ak es P ( k X k < l ) = 95%. In case of a Gaussian k ernel in R d , l = ω q 95% quantil e of χ 2 d . Giv en data x 1 , . . . , x n or their pairwise L 2 distance d ( x i , x j ), w e can fin d ω that satisfies the ab o v e criteria by first calculati ng q i = 5% quant ile of { d ( x i , x j ) , j = 1 , . . . , n } for eac h i = 1 , . . . , n , then taking ω = 95% quantil e of { q 1 , . . . , q n } q 95% quantil e of χ 2 d . (4.1) As sh o wn in the sim u lation stud ies in Section 5 , th is particular choic e of ω works well in lo w-dimensional case. F or high-dimensional data generated from a lo w er-dimensional structur e, suc h as an m -manifold, the pro cedure usually leads to an ω that is to o small. W e su ggest starting with ω d efined in ( 4.1 ) and trying some neigh b oring v alues to see if the r esults are improv ed, ma yb e b ased on some lab eled d ata, exp ert opinions, d ata visualization or trade-off of the b et w een and w ithin cluster distances. Thr eshold ε j . When identi fying the eigen v ectors with n o sign c han ges in S tep 2 , a threshold ε j is included to deal with the small p erturbation in tro duced by other we ll-separable mixture comp onents. Since k v j k 2 = 1 and the elemen ts of the eigen v ector decrease quic kly (exp onential ly) from max i ( | v j ( x i ) | ), w e suggest to thr eshold v j at ε j = max i ( | v j ( x i ) | ) /n ( n as the sample s ize) to accommo date the p erturbation. W e n ote that the prop er selection of algorithm p arameters is critical to the separation of the sp ectrum and the success of the clustering algo- rithms hinged on the separation. Although the describ ed heuristics seem to 16 T. S HI, M. BELKIN A ND B. YU Fig. 5. Clustering r esults on four simulate d data sets describ e d in Se ction 5. 1 . First c olumn: sc atter plots of data; se c ond c olum n: r esults the pr op ose d sp e ctr osc opic clustering algorithm; thir d c ol umn: r esults of the k -me ans algorithm; fourth c olumn: r esults of the sp e ctr al clustering algorithm (Ng, Jor dan and Weiss [ 8 ]). w ork w ell for lo w-dimensional datasets (as we will sho w in the n ext sec- tion), they are still p reliminary and more researc h is needed, esp ecially in high-dimensional data analysis. W e plan to f u rther stu dy the data-adaptiv e parameter selection p ro cedure in the future. 5. S im ulation stud ies. 5.1. Gaussian typ e c omp onents. In this sim ulation, we examine the ef- fectiv eness of the prop osed DaSp ec algorithm on datasets generated fr om Gaussian m ixtur es. Eac h data set (size of 400) is sampled from a mixture of six biv ariate Gaussians, w hile the size of eac h group follo ws a Multinomial distribution ( n = 400, and p 1 = · · · = p 6 = 1 / 6). The mean and standard de- viation of eac h Gaussian are r andomly drawn fr om a Uniform on ( − 5 , 5 ) and a Uniform on (0 , 0 . 8), resp ecti v ely . F our d ata sets generated from this distribution are p lotted in the left column of Figure 5 . It is clear that the groups may b e highly u nbalanced an d o v erlap with eac h other. Therefore, rather than trying to separate all six comp onen ts, we exp ect go o d cluster- ing algorithms to iden tify group s with reasonable s eparations b et w een high densit y areas. The DaSp ec algo rithm is app lied with parameters ω and ε j c hosen b y the pro cedu r e describ ed in Section 4.2 . T aking the num b er of groups iden- DA T A SPECTROSCOPIC CLUSTERING 17 tified by our Dasp ec algorithm, the commonly used k -means algorit hm and the sp ectral clustering algorithms prop osed in Ng, Jordan and W eiss [ 8 ] (using the same ω as the DaSp ec) are also tested to serv e as b aselines for comparison. As a common practice with k -means algorithm, fi ft y random initializa tions are used and the final results are from the one that minimizes the optimization criterion P n i =1 ( x i − y k ( i ) ) 2 , wh ere x i is assigned to group k ( i ) and y k = P n i =1 x i I ( k ( i ) = k ) / P n i =1 I ( k ( i ) = k ) . As sho wn in the s econd column of Figure 5 , the pr op osed DaSp ec al- gorithm (with data-dep enden t parameter c hoices) identifies the num b er of separable groups, isolates p otent ial outliers and group s data accordingly . The resu lts are similar to th e k -means algo rithm results (the third column), when the groups are balanced and their shap es are close to roun d. In these cases, the k -means algorithm is exp ected to work well, giv en that the data in eac h group is well represen ted b y its a ve rage. The last column shows the results of Ng et al.’s sp ectral clustering algorithm, wh ic h sometimes (see the first row) assigns data to one group ev en when they are actually far a wa y . In summ ary , for this sim u lated example, we find that the prop osed DaSp ec algorithm, w ith data-adaptiv ely chose n p arameters, identi fies the n um b er of separable groups reasonably w ell and pro d uces go o d clustering results when the separations are large enough. It is also in teresting to note that the algo rithm isolates p ossible “outliers” in to a separate group so that they do n ot affect the clustering r esults on the ma jorit y of data. T h e prop osed algorithm comp etes w ell against the commonly used k -means and sp ectral clustering algorithms. 5.2. Be yond Gaussian c omp onents. W e now compare the p erformance of the aforement ioned clustering algorithms on data sets that cont ain n on- Gaussian groups, v arious lev els of n oise and p ossible outliers. Data set D 1 con tains three w ell-separable group s and an outlier in R 2 . The first group of data is generated by add ing indep end en t Gaussian noise N ((0 , 0) T , 0 . 15 2 I 2 × 2 ) to 200 uniform samples from thr ee f ourths of a ring with radius 3, whic h is from the s ame d istribution as those plotted in the right panel of Fig- ure 8 . The second group includes 100 data p oin ts sampled from a biv ariate Gaussian N ((3 , − 3) T , 0 . 5 2 I 2 × 2 ), and the last group has only 5 d ata p oints sampled from a biv ariate Gaussian N ((0 , 0) T , 0 . 3 2 I 2 × 2 ). Finally , one outlier is lo cated at (5 , 5) T . Giv en D 1 , three more data sets ( D 2 , D 3 and D 4 ) are created by gradually adding indep enden t Gaussian n oise (with standard d e- viations 0 . 3, 0 . 6, 0 . 9, resp ecti v ely). T he scatter plots of the four d atasets are sho w n in the left column of Figure 6 . I t is clear that the d egree of separation decreases fr om top to b ottom. Similarly to the previous sim u lation, we examine the DaSp ec algorithm with d ata-driv en parameters, the k -means and Ng et al.’s sp ectral clustering 18 T. S HI, M. BELKIN A ND B. YU Fig. 6. Clustering r esults on four simulate d data sets describ e d in Se ction 5.2 . First c ol- umn: sc atter plots of data; se c ond c olumn: lab els of the G identifie d gr oups by the pr op ose d sp e ctr osc opic clustering al gorithm; thir d and forth c ol umns: k -me ans algorithm assuming G − 1 and G gr oups, r esp e ctively; fifth and sixth c olumns: sp e ctr al clustering algorithm (Ng, Jor dan and Weiss [ 8 ]) assuming G − 1 and G gr oups, r esp e ctively. algorithms on these d ata sets. The latter t w o algo rithms are tested under tw o differen t assump tions on the num b er of groups: the num b er ( G ) identified b y the DaSp ec algorithm or one group less ( G − 1 ). Note th at the DaSp ec algorithm claims only one group f or D 4 , so the other t wo algorithms are not applied to D 4 . The DaSp ec algorithm (the second column in the righ t p anel of Figure 6 ) pro duces a r easonable n umber of group s and clustering r esults. F or the p erfectly separable case in D 1 , three groups are iden tified and the one outlier is isolated out. It is w orth noting that the incomplete ring is separated from other groups, w hic h is not a simple task for algorithms based on group cen- troids. W e also see that the DaSp ec algorithm starts to combine inseparable groups as the comp onents b ecome less separable. Not surp risingly , the k -means algorithms (the third and fourth columns) do not p erform well b ecause of the presence of the non-Gaussian comp onent, unbal anced groups and outlie rs. Giv en enough separations, the sp ectral clus- tering algorithm rep orts reasonable results (the fifth and sixth columns). Ho wev er, it is sensitiv e to outliers and the sp ecification of the num b er of groups. DA T A SPECTROSCOPIC CLUSTERING 19 6. Con clusions an d d iscussion. Moti v ated b y rece n t dev elopmen ts in ke r- nel and sp ectral m etho d s, we study the connection b et we en a p robabilit y distribution and the associated con v olution op erator. F or a con v olution op - erator defined by a r adial k ernel with a fast tail deca y , we sh ow that eac h top eigenfunction of the conv olution op erator defined by a mixture distri- bution is app r o ximated b y one of the top eigenfunctions of the op erator corresp onding to a mixture comp onent. The separation condition is mainly based on the o v erlap b etw een high-densit y comp onents, instead of th eir ex- plicit parametric f orms, and thus is quite general. These theoretical results explain why the top eigen ve ctors of k ernel matrix may rev eal the clustering information but do n ot alw a ys do so. More imp ortan tly , our results rev eal that not ev ery comp onent will contribute to the top few eigenfunctions of the conv olution op erator K P b ecause the size and configur ation of a com- p onent d ecides the correspond ing eigen v alues. Hence the top eigen v ectors of the k ernel matrix m ay or ma y not p reserv e all clustering information, w hic h explains some empirical observ ations of certai n sp ectral clustering metho ds. F ollo wing our theoreti cal analyses, we pr op ose the data sp ectroscopic clus- tering algorithm based on fi nding eigen v ectors with no sign c hange. Compar- ing to commonly used k -means and sp ectral clustering algorithms, DaSp ec is simple to implemen t and pro vid es a natural estimator of the n u m b er of sep- arable comp onen ts. W e found that DaSp ec handles unbalanced groups and outliers b etter than the comp eting algorithms. Imp ortan tly , unlike k -means and certain sp ectral clustering algorithms, DaSp ec do es not require r andom initializa tion, w hic h is a p oten tially significan t adv anta ge in practice. Sim u - lations sh o w fa vorable results compared to k -means and s p ectral clustering algorithms. F or practical app licatio ns, we also pro vide some guidelines for c ho osing the algorithm parameters. Our analyses an d discussions on connections to other sp ectral or k ern el metho ds shed ligh t on why rad ial ke rnels, suc h as Gaussian k ern els, p erform w ell in many classificatio n and clustering algorithms. W e exp ect that this line of in vestig ation w ould also pro ve fruitful in understanding other kernel algorithms, su c h as Supp ort V ector Mac h ines. APPENDIX A Here we pr o vide three concrete examples to illustrate the prop erties of the eigenfunction of K P sho w n in Section 3.1 . Example 1 (Gaussian k ernel, Gaussian dens ity). Let us start with the univ ariate Gaussian case where the d istribution P ∼ N ( µ, σ 2 ) and the ke rnel function is also Gaussian. Shi, Belkin and Y u [ 15 ] provided the eigen v alues and eigenfunctions of K P , and the result is a sligh tly refined v ers ion of a result in Zhu et al. [ 22 ]. 20 T. S HI, M. BELKIN A ND B. YU Pr oposition 1. F or P ∼ N ( µ, σ 2 ) and a Gaussian k e rnel K ( x, y ) = e − ( x − y ) 2 / (2 ω 2 ) , let β = 2 σ 2 /ω 2 and let H i ( x ) b e the i th or der Hermite p oly- nomial. Then eigenvalues and eigenfunctions of K P for i = 0 , 1 , . . . ar e given by λ i = s 2 (1 + β + √ 1 + 2 β ) β 1 + β + √ 1 + 2 β i , φ i ( x ) = (1 + 2 β ) 1 / 8 √ 2 i i ! exp − ( x − µ ) 2 2 σ 2 √ 1 + 2 β − 1 2 H i 1 4 + β 2 1 / 4 x − µ σ . Here H k is the k th order Hermite p olynomial. Clearly from the explicit expression and exp ected from Theorem 2 , φ 0 is th e only p ositiv e eigenfunc- tion of K P . W e n ote that eac h eigenfunction φ i deca ys quic kly (as it is a Gaussian m ultiplied b y a p olynomial) a w a y from the mean µ of the probabil- it y distribution. W e also see that the eig en v alues of K P deca y exp onentia lly with the rate dep enden t on the b andwidth of the Gaussian k ernel ω and the v ariance of the probabilit y distribution σ 2 . T hese observ ations can b e easily generalize d to the m ultiv ariate case; see Shi, Belkin and Y u [ 15 ]. Example 2 (Exp onen tial k ern el, uniform distrib ution on an int erv al). T o giv e another concrete example, consider the exp onent ial kernel K ( x, y ) = exp( − | x − y | ω ) for the uniform distribution on the inte rv al [ − 1 , 1] ⊂ R . In Di- aconis, Go el and Holmes [ 3 ] it w as sho w n that the eigenfunctions of this k ern el can b e written as cos( bx ) or s in( bx ) inside the in terv al [ − 1 , 1] for appropriately c h osen v alues of b and deca y exp onen tially a wa y from it. The top eigenfunction can b e w ritten explicitly as follo ws: φ ( x ) = 1 λ Z [ − 1 , 1] e −| x − y | /ω cos( by ) dy ∀ x ∈ R , where λ is the corresp onding eigen v alue. Figure 7 illustrates an example of this b eha vior, for ω = 0 . 5. Fig. 7. T op two eigenfunctions of the exp onential kernel with b andwidth ω = 0 . 5 and the uniform distribution on [ − 1 , 1] . DA T A SPECTROSCOPIC CLUSTERING 21 Example 3 (A curv e in R d ). W e n o w giv e a brief in formal discussion of the imp ortan t case when our p robabilit y distribu tion is concen trated on or around a low-dimensional subm anifold in a (p oten tially high d imensional) am bient space. T he simplest example of this setting is a Gaussian distribu- tion, w hic h can b e view ed as a zero-dimensional manifold (the mean of the distribution) plus noise. A more interesti ng example of a manifold is a curv e in R d . W e observe that suc h data is generated by an y time-dep endent smo oth deterministic p ro cess, whose parameters dep end con tinuously on time t . Let ψ ( t ) : [0 , 1] → R d b e suc h a curve . Consider a restriction of the k ernel K P to ψ . Let x, y ∈ ψ and let d ( x, y ) b e the geodesic distance along the curve. It can b e s h o wn that d ( x, y ) = k x − y k + O ( k x − y k 3 ), when x, y are close, with the remainder term dep endin g on ho w the curve is emb edded in R d . Therefore, we see that if the k ernel K P is a sufficien tly local radial b asis kernel, the restriction of K P to ψ is a p erturbation of K P in a one-dimensional case. F or the exp onen tial k ern el, the one-dimensional kernel can b e written explicitly (see Examp le 2 ), and we ha v e an appro ximation to the k ernel on the manifold with a deca y off the manifold (assum ing that the k ern el is a decreasing function of th e distance). F or the Gaussian k ern el, a similar extension holds, although no explicit formula can b e easily obtained. The b eha viors of the top eigenfunction of the Gaussian and exp onen tial k ern el, r esp ectiv ely , are demonstrated in Figure 8 . The exp onential k ernel corresp onds to the b otto m left panel. T he b eha vior of the eigenfunction is seen ge nerally consisten t with the top eig enfunction of the exp onen tial ke rnel on [ − 1 , 1] sho w n in Figure 8 . The Ga ussian k ern el (top left p anel) has simila r b eha viors bu t pro duces lev el lines m ore consisten t with the data distribution, whic h ma y b e preferable in practice. Finally , w e observ e that the addition of small n oise (righ t top and b ottom panels) d o es not significan tly change the eigenfunctions. APPENDIX B Pr oof of Theorem 2 . F or a semi-p ositiv e definite k ernel K ( x, y ) with full sup p ort on R d , we fir st sho w the top eigenfunction φ 0 of K P has no sign c hange on the supp ort of the distribution. W e defin e R + = { x ∈ R d : φ 0 ( x ) > 0 } , R − = { x ∈ R d : φ 0 ( x ) < 0 } and ¯ φ 0 ( x ) = | φ 0 ( x ) | . It is clear that R ¯ φ 2 0 dP = R φ 2 0 dP = 1. Assuming that P ( R + ) > 0 and P ( R − ) > 0, w e will show that Z Z K ( x, y ) ¯ φ 0 ( x ) ¯ φ 0 ( y ) dP ( x ) dP ( y ) > Z Z K ( x, y ) φ 0 ( x ) φ 0 ( y ) dP ( x ) dP ( y ) , 22 T. S HI, M. BELKIN A ND B. YU Fig. 8. Contours of the top eigenfunction of K P for Gaussian (upp er p anels) and exp o- nential kernels (lower p anels) with b andwidth 0 . 7 . The curv e is 3 / 4 of a ring with r adius 3 and indep endent noi se of standa r d deviation 0.15 adde d in the right p anels. whic h contradic ts with the assumption that φ 0 ( · ) is the eigenfun ction as- so ciated with the largest eigen v alue. Denoting g ( x, y ) = K ( x, y ) φ 0 ( x ) φ 0 ( y ) and ¯ g ( x, y ) = K ( x, y ) ¯ φ 0 ( x ) ¯ φ 0 ( y ), w e hav e Z R + Z R + ¯ g ( x, y ) dP ( x ) dP ( y ) = Z R + Z R + g ( x, y ) dP ( x ) dP ( y ) and the equation also holds on region R − × R − . How ev er, ov er the region { ( x, y ) : x ∈ R + and y ∈ R − } , we ha v e Z R + Z R − ¯ g ( x, y ) dP ( x ) dP ( y ) > Z R + Z R − g ( x, y ) dP ( x ) dP ( y ) , DA T A SPECTROSCOPIC CLUSTERING 23 since K ( x, y ) > 0, φ 0 ( x ) > 0 and φ 0 ( y ) < 0. The inequ alit y holds on { ( x, y ) : x ∈ R − and y ∈ R + } . Putting four in tegration regions together, we arr iv e at the con tradiction. Th erefore, the assumptions P ( R + ) > 0 and P ( R − ) > 0 can- not b e true at the same time, wh ic h implies that φ 0 ( · ) has n o sign change s on the supp ort of the distribution. No w consider ∀ x ∈ R d . W e hav e λ 0 φ 0 ( x ) = Z K ( x, y ) φ 0 ( y ) dP ( y ) . Giv en th e facts that λ 0 > 0 , K ( x, y ) > 0, an d φ 0 ( y ) hav e the same sign on the supp ort, it is straigh tforw ard to see that φ 0 ( x ) has no sign c h anges and has full supp ort in R d . Finally , the isolation of ( λ 0 , φ 0 ) follo ws. If there exist another φ that shares the same eigen v alue λ 0 with φ 0 , they b oth h a ve no sign c hange and ha ve full supp ort on R d . Therefore, R φ 0 ( x ) φ ( x ) dP ( x ) > 0 and it con tradicts with the orthogonalit y b et we en eigenfunctions. Pr oof of Theor em 3 . By definition, the top eigen v alue of K P satisfies λ 0 = max f R R K ( x, y ) f ( x ) f ( y ) dP ( x ) dP ( y ) R [ f ( x )] 2 dP ( x ) . F or an y function f , Z Z K ( x, y ) f ( x ) f ( y ) dP ( x ) dP ( y ) = [ π 1 ] 2 Z Z K ( x, y ) f ( x ) f ( y ) dP 1 ( x ) dP 1 ( y ) + [ π 2 ] 2 Z Z K ( x, y ) f ( x ) f ( y ) dP 2 ( x ) dP 2 ( y ) + 2 π 1 π 2 Z Z K ( x, y ) f ( x ) f ( y ) dP 1 ( x ) dP 2 ( y ) ≤ [ π 1 ] 2 λ 1 0 Z [ f ( x )] 2 dP 1 ( x ) + [ π 2 ] 2 λ 2 0 Z [ f ( x )] 2 dP 2 ( x ) + 2 π 1 π 2 Z Z K ( x, y ) f ( x ) f ( y ) dP 1 ( x ) dP ( y ) 2 . No w w e concen trate on the last term, 2 π 1 π 2 Z Z K ( x, y ) f ( x ) f ( y ) dP 1 ( x ) dP 2 ( y ) ≤ 2 π 1 π 2 s Z Z [ K ( x, y )] 2 dP 1 ( x ) dP 2 ( y ) × s Z Z [ f ( x )] 2 [ f ( y )] 2 dP 1 ( x ) dP 2 ( y ) 24 T. S HI, M. BELKIN A ND B. YU = 2 s π 1 π 2 Z Z [ K ( x, y )] 2 dP 1 ( x ) dP 2 ( y ) × s π 1 Z [ f ( x )] 2 dP 1 ( x ) s π 2 Z [ f ( y )] 2 dP 2 ( y ) ≤ s π 1 π 2 Z Z [ K ( x, y )] 2 dP 1 ( x ) dP 2 ( y ) × π 1 Z [ f ( x )] 2 dP 1 ( x ) + π 2 Z [ f ( x )] 2 dP 2 ( x ) = r Z [ f ( x )] 2 dP ( x ) , where r = ( π 1 π 2 R R [ K ( x, y )] 2 dP 1 ( x ) dP 2 ( y )) 1 / 2 . Thus, λ 0 = max f : R f 2 dP =1 Z Z K ( x, y ) f ( x ) f ( y ) dP ( x ) dP ( y ) ≤ max f : R f 2 dP =1 π 1 λ 1 0 Z [ f ( x )] 2 π 1 dP 1 ( x ) + π 2 λ 2 0 Z [ f ( x )] 2 π 2 ddP 2 ( x ) + r ≤ max( π 1 λ 1 0 , π 2 λ 2 0 ) + r . The other side of the equalit y is easier to p r o ve. Assuming π 1 λ 1 0 > π 2 λ 2 0 and taking the top eig enfunction φ 1 0 of K P 1 as f , w e d eriv e th e follo wing re- sults b y u sing the same decomp osition on R R K ( x, y ) φ 1 0 ( x ) φ 1 0 ( y ) dP ( x ) dP ( y ) and the facts that R K ( x, y ) φ 1 0 ( x ) ddP 1 ( x ) = λ 1 0 φ 1 0 ( y ) and R [ φ 1 0 ] 2 dP 1 = 1. Denoting h ( x, y ) = K ( x, y ) φ 1 0 ( x ) φ 1 0 ( y ), w e hav e λ 0 ≥ R R K ( x, y ) φ 1 0 ( x ) φ 1 0 ( y ) dP ( x ) dP ( y ) R [ φ 1 0 ( x )] 2 dP ( x ) = [ π 1 ] 2 λ 1 0 + [ π 2 ] 2 R R h ( x, y ) dP 2 ( x ) dP 2 ( y ) + 2 π 1 π 2 λ 1 0 R [ φ 1 0 ( x )] 2 dP 2 ( x ) π 1 + π 2 R [ φ 1 0 ( x )] 2 dP 2 ( x ) = π 1 λ 1 0 π 1 + 2 π 2 R [ φ 1 0 ( x )] 2 dP 2 ( x ) π 1 + π 2 R [ φ 1 0 ( x )] 2 dP 2 ( x ) + [ π 2 ] 2 R R h ( x, y ) dP 2 ( x ) dP 2 ( y ) π 1 + π 2 R [ φ 1 0 ( x )] 2 dP 2 ( x ) ≥ π λ 1 0 . This completes the pro of. Ac kn o wledgment . T he authors would lik e to thank Y o onkyun g Lee, Prem Go el, Joseph V erdu cci and Dongh ui Y an f or helpful d iscus s ions, suggestions and comment s. DA T A SPECTROSCOPIC CLUSTERING 25 REFERENCES [1] Be lkin, M. and Niyogi, P. (2003). Using manifold structu re for partially lab eled classificati on. In A dvanc es in Neur al Information Pr o c essing Systems (S. Bec ker, S. Thrun and K. Ob ermaye r, eds.) 15 953–960. MIT Press, Cambridge, MA. [2] Dhillon, I., Guan , Y. and Kulis, B. (2005). A unified view of kernel k -means, spectral clustering, and graph partitioning. T echnical R ep ort UTCS TF-04-25, Univ. T exas, Austin. [3] Diaconis, P., Goel, S. and H olmes, S. (2008). H orseshoes in multidimensional scaling and kernel methods. Ann. Appl. Stat . 2 777–807. [4] Kol tchinski i, V . and Gin ´ e, E. (2000). R andom matrix appro ximation of spectra of integ ral operators. Berno ul li 6 113–167. MR1781185 [5] Le Cun, Y., Boser, B ., De n ker, J., H e nderson, D., How ard, R., Hubbard , W. and Jackel, L. (1990). Handwritten digit recognitio n with a bac k p ropogation netw ork. In A dvanc es in Neur al Information Pr o c essing Systems (D . T ouretzky , ed.) 2 . Morga n Kaufman, Denver, CO. [6] Malik, J., Belongie, S., Leung, T . and Shi, J. (2001). Contour and tex t ure anal - ysis for image segmen tation. International Journal of Computer Visi on 43 7–27. [7] Nad ler, B. and Galun, M. (2007). F undamental limitati ons of sp ectral clustering. In A dvanc es in Neur al Information Pr o c essing Systems ( B. Sch¨ olk opf, J. Platt and T. Hoffman, eds.) 19 1017–1024 . MIT Press, Cam b ridge, MA. [8] Ng, A., Jordan, M. and Weiss, Y. (2002). On sp ectral clustering: Analysis and an algorithm. In A dvanc es in Neur al I nformation Pr o c essing Syste ms (T. Diet- teric h, S. Bec ker and Z. Ghahramani, eds.) 14 955–962. MIT Press, Cambridge, MA. [9] P arlett, B. N. (1980). The Summetric Eigenvalue Pr oblem . Prentice H all, Engle- w oo d Cli ffs, NJ. MR057011 6 [10] Perona, P. and Freeman, W. T. (1998). A factorization app roac h to grouping. In Pr o c e e dings of the 5th Eur op e an Confer enc e on Computer Vision 655–670. Springer, London. [11] S ch ¨ olk opf, B. and Smola, A. (2002). L e arning with Kernels . MIT Press, Cam- bridge, MA. [12] S ch ¨ olk opf, B., Smola, A . and M ¨ uller, K. R. (1998). Nonlinear comp onent anal- ysis as a kernel eigen va lue problem. Neur al Comput. 10 1299–1319 . [13] S cott, G. and Longuet-Higgi ns, H . (1990). F eature grouping by relo calisation of eigenv ectors of proximi ty matrix. I n Pr o c e e dings of British Machine Vision Confer enc e 103–108. Oxford, UK. [14] S hi, J. and Malik, J. (2000). Normalized cuts and image segmenta tion. IEEE T r ans- actions on Pattern Analysis and Machine Intel ligenc e 22 888–905 . [15] S hi, T., B elkin, M. and Yu, B . (2008). Data sp ectroscop y : Learning mix t ure mo d els using eigenspaces of conv olution op erators. In Pr o c e e dings of the 25th Annual International C onf er enc e on Machine L e arning (ICML 2008) (A. McCallum and S. Row eis, eds.) 936–94 3. Omn ipress, Madison, WI. [16] V apnik, V. (1995). The Natur e of Statistic al L e arning . Springer, N ew Y ork. MR136796 5 [17] V erma, D. and Meila, M. (2001). A comparison of spectral clustering algori thms. T echnical rep ort, Univ. W ashington Computer Science and Engineering. [18] von Luxburg, U. (2007 ). A turorial on spectral clustering. Stat. Comput. 17 395– 416. MR240980 3 [19] von Luxbur g, U., Belkin, M . and B ousquet, O. (2008). Consi stency of sp ectral clustering. An n. Statist. 36 555–586. MR239680 7 26 T. S HI, M. BELKIN A ND B. YU [20] W eiss, Y. (1999). Segmentatio n using eigenv ectors: A unifying view. In Pr o c e e dings of the Seventh I EEE International Confer enc e on Com puter Vision 975–98 2. IEEE, Los Alamitos, CA. [21] W illiams, C. K. and Seege r, M. (2000). The effect of the inpu t density distribu- tion on kernel-based classifiers. In Pr o c e e dings of the 17th International Confer- enc e on Machine L e arning (P . Langley , ed.) 1159–1166 . Morgan K aufmann, San F rancisco, CA. [22] Zh u, H., Williams, C ., Rohwer, R . and Morcinie, M. (1998). Gaussian regression and optimal finite-dimensional linear mod els. In Neur al Networks and Machine L e arning (C. Bishop, ed.) 167–184. Springer, Berlin. T. Shi Dep ar tment of St a tistics Ohio St a te University 1958 Neil A venue, Cockins Hall 404 Columbus, Oh io 43210 -1247 USA E-mail: taoshi@stat.osu.edu M. Belkin Dep ar tment of Computer Science and Eng ineering Ohio St a te University 2015 Neil A ven ue, Dreese Labs 597 Columbus, Oh io 43210 -1277 USA E-mail: mbelkin@sce.osu.edu B. Yu Dep ar tment of St a tistics University of California, Berkeley 367 Ev ans Hall Berkeley, Ca lifornia 94720-3860 USA E-mail: biny u@stat.berkeley . edu
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment