A Tutorial on Spectral Clustering

A T utorial on Sp ectral Clustering Ulrik e v on Luxburg Max Planc k Institute for Biological Cyb ernetics Sp emannstr. 38, 72076 T¨ ubingen, German y ulrike.luxburg@tuebingen.mpg.de This article app ears in Statistics and Computing, 17 (4), 2007. The original publication is a v ailable at www.springer.com . Abstract In recen t years, sp ectral clustering has b ecome one of the most p opular mo dern clustering algorithms. It is simple to implemen t, can b e solv ed eﬃcien tly b y standard linear algebra soft w are, and v ery often outp erforms traditional clustering algorithms such as the k-means algorithm. On the ﬁrst glance sp ectral clustering app ears slightly m ysterious, and it is not obvious to see why it w orks at all and what it really do es. The goal of this tutorial is to give some intuition on those questions. W e describ e diﬀerent graph Laplacians and their basic prop erties, present the most common spectral clustering algorithms, and derive those algorithms from scratch by several diﬀeren t approaches. Adv an tages and disadv an tages of the diﬀerent sp ectral clustering algorithms are discussed. Keyw ords: sp ectral clustering; graph Laplacian 1 In tro duction Clustering is one of the most widely used techniques for exploratory data analysis, with applications ranging from statistics, computer science, biology to social sciences or psychology . In virtually every scien tiﬁc ﬁeld dealing with empirical data, p eople attempt to get a ﬁrst impression on their data by trying to identify groups of “similar b ehavior” in their data. In this article we would like to introduce the reader to the family of sp ectral clustering algorithms. Compared to the “traditional algorithms” suc h as k -means or single link age, sp ectral clustering has many fundamental adv an tages. Results ob- tained by sp ectral clustering often outp erform the traditional approaches, sp ectral clustering is very simple to implement and can b e solv ed eﬃciently b y standard linear algebra metho ds. This tutorial is set up as a self-con tained introduction to sp ectral clustering. W e derive spectral clustering from scratc h and present diﬀerent p oin ts of view to why sp ectral clustering works. Apart from basic linear algebra, no particular mathematical background is required b y the reader. How ev er, w e do not attempt to give a concise review of the whole literature on sp ectral clustering, which is imp ossible due to the ov erwhelming amoun t of literature on this sub ject. The ﬁrst t w o sections are devoted to a step-by-step introduction to the mathematical ob jects used by sp ectral clustering: similarit y graphs in Section 2, and graph Laplacians in Section 3. The sp ectral clustering algorithms themselv es will b e presented in Section 4. The next three sections are then devoted to explaining wh y those algorithms work. Each section corresp onds to one explanation: Section 5 describ es a graph partitioning approach, Section 6 a random w alk p ersp ectiv e, and Section 7 a perturbation theory approac h. In Section 8 we will study some practical issues related to spectral clustering, and discuss v arious extensions and literature related to sp ectral c lustering in Section 9. 1 2 Similarit y graphs Giv en a set of data p oin ts x 1 , . . . x n and some notion of similarity s ij ≥ 0 b et w een all pairs of data p oin ts x i and x j , the intuitiv e goal of clustering is to divide the data p oints in to several groups such that p oin ts in the same group are similar and p oin ts in diﬀerent groups are dissimilar to eac h other. If w e do not hav e more information than similarities b et w een data p oin ts, a nice wa y of representing the data is in form of the similarity gr aph G = ( V , E ). Each v ertex v i in this graph represents a data p oin t x i . Two vertices are connected if the similarity s ij b et w een the corresp onding data p oin ts x i and x j is p ositiv e or larger than a certain threshold, and the edge is weigh ted by s ij . The problem of clustering can now b e reformulated using the similarity graph: w e w an t to ﬁnd a partition of the graph such that the edges b et w een diﬀerent groups hav e very low weigh ts (which means that p oin ts in diﬀerent clusters are dissimilar from each other) and the edges within a group hav e high weigh ts (which means that p oin ts within the same cluster are similar to eac h other). T o b e able to formalize this in tuition we ﬁrst wan t to introduce some basic graph notation and brieﬂy discuss the kind of graphs we are going to study . 2.1 Graph notation Let G = ( V , E ) b e an undirected graph with v ertex set V = { v 1 , . . . , v n } . In the following we assume that the graph G is weigh ted, that is each edge b etw een t w o vertices v i and v j carries a non-negative w eight w ij ≥ 0. The w eighted adjac ency matrix of the graph is the matrix W = ( w ij ) i,j =1 ,...,n . If w ij = 0 this means that the v ertices v i and v j are not connected by an edge. As G is undirected we require w ij = w j i . The degree of a vertex v i ∈ V is deﬁned as d i = n X j =1 w ij . Note that, in fact, this sum only runs ov er all vertices adjacent to v i , as for all other vertices v j the w eight w ij is 0. The de gr e e matrix D is deﬁned as the diagonal matrix with the degrees d 1 , . . . , d n on the diagonal. Giv en a subset of v ertices A ⊂ V , we denote its complement V \ A by A . W e deﬁne the indicator vector 1 A = ( f 1 , . . . , f n ) 0 ∈ R n as the v ector with entries f i = 1 if v i ∈ A and f i = 0 otherwise. F or conv enience we introduce the shorthand notation i ∈ A for the set of indices { i | v i ∈ A } , in particular when dealing with a sum like P i ∈ A w ij . F or tw o not necessarily disjoint sets A, B ⊂ V we deﬁne W ( A, B ) := X i ∈ A,j ∈ B w ij . W e consider tw o diﬀerent wa ys of measuring the “size” of a subset A ⊂ V : | A | := the num ber of vertices in A v ol( A ) := X i ∈ A d i . In tuitively , | A | measures the size of A by its num ber of vertices, while vol( A ) measures the size of A b y summing ov er the weigh ts of all edges attached to v ertices in A . A subset A ⊂ V of a graph is connected if any t wo vertices in A can b e joined by a path such that all intermediate points also lie in A . A subset A is called a connected comp onen t if it is connected and if there are no connections b et w een vertices in A and A . The nonempty sets A 1 , . . . , A k form a partition of the graph if A i ∩ A j = ∅ and A 1 ∪ . . . ∪ A k = V . 2 2.2 Diﬀeren t similarit y graphs There are several p opular constructions to transform a given set x 1 , . . . , x n of data p oin ts with pairwise similarities s ij or pairwise distances d ij in to a graph. When constructing similarity graphs the goal is to mo del the lo cal neighborho od relationships b etw een the data p oin ts. The ε -neigh borho o d graph: Here we connect all points whose pairwise distances are smaller than ε . As the distances b et w een all connected p oin ts are roughly of the same scale (at most ε ), weigh ting the edges w ould not incorporate more information ab out the data to the graph. Hence, the ε -neigh b orhoo d graph is usually considered as an unw eigh ted graph. k -nearest neighbor graphs: Here the goal is to connect vertex v i with vertex v j if v j is among the k -nearest neighbors of v i . Ho w ever, this deﬁnition leads to a directed graph, as the neighborho o d relationship is not symmetric. There are tw o wa ys of making this graph undirected. The ﬁrst w ay is to simply ignore the directions of the edges, that is we connect v i and v j with an undirected edge if v i is among the k -nearest neighbors of v j or if v j is among the k -nearest neighbors of v i . The resulting graph is what is usually called the k -ne ar est neighb or gr aph . The second choice is to connect vertices v i and v j if b oth v i is among the k -nearest neighbors of v j and v j is among the k -nearest neighbors of v i . The resulting graph is called the mutual k -ne ar est neighb or gr aph . In b oth cases, after connecting the appropriate vertices we weigh t the edges by the similarity of their endp oin ts. The fully connected graph: Here we simply connect all p oints with p ositiv e similarit y with each other, and w e weigh t all edges by s ij . As the graph should represen t the lo cal neighborho od re- lationships, this construction is only useful if the similarity function itself mo dels lo cal neigh bor- ho ods. An example for such a similarity function is the Gaussian similarity function s ( x i , x j ) = exp( −k x i − x j k 2 / (2 σ 2 )), where the parameter σ controls the width of the neighborho o ds. This pa- rameter plays a similar role as the parameter ε in case of the ε -neighborho o d graph. All graphs mentioned ab o v e are regularly used in sp ectral clustering. T o our knowledge, theoretical results on the question ho w the choice of the similarit y graph inﬂuences the spectral clustering result do not exist. F or a discussion of the b eha vior of the diﬀerent graphs we refer to Section 8. 3 Graph Laplacians and their basic prop erties The main to ols for sp ectral clustering are graph Laplacian matrices. There exists a whole ﬁeld ded- icated to the study of those matrices, called sp ectral graph theory (e.g., see Chung, 1997). In this section we wan t to deﬁne diﬀerent graph Laplacians and p oin t out their most imp ortan t prop erties. W e will carefully distinguish betw een diﬀeren t v ariants of graph Laplacians. Note that in the literature there is no unique conv en tion which matrix exactly is called “graph Laplacian”. Usually , every author just calls “his” matrix the graph Laplacian. Hence, a lot of care is needed when reading literature on graph Laplacians. In the following we alw ays assume that G is an undirected, weigh ted graph with weigh t matrix W , where w ij = w j i ≥ 0. When using eigenv ectors of a matrix, we will not necessarily assume that they are normalized. F or example, the constan t v ector 1 and a multiple a 1 for some a 6 = 0 will be considered as the same eigenv ectors. Eigen v alues will alwa ys b e ordered increas ingly , resp ecting m ultiplicities. By “the ﬁrst k eigenv ectors” we refer to the eigenv ectors corresp onding to the k smallest eigenv alues. 3 3.1 The unnormalized graph Laplacian The unnormalized graph Laplacian matrix is deﬁned as L = D − W. An o v erview ov er many of its prop erties can b e found in Mohar (1991, 1997). The follo wing proposition summarizes the most imp ortant facts needed for sp ectral clustering. Prop osition 1 (Prop erties of L ) The matrix L satisﬁes the fol lowing pr op erties: 1. F or every ve ctor f ∈ R n we have f 0 Lf = 1 2 n X i,j =1 w ij ( f i − f j ) 2 . 2. L is symmetric and p ositive semi-deﬁnite. 3. The smal lest eigenvalue of L is 0, the c orr esp onding eigenve ctor is the c onstant one ve ctor 1 . 4. L has n non-ne gative, r e al-value d eigenvalues 0 = λ 1 ≤ λ 2 ≤ . . . ≤ λ n . Pr o of. P art (1): By the deﬁnition of d i , f 0 Lf = f 0 D f − f 0 W f = n X i =1 d i f 2 i − n X i,j =1 f i f j w ij = 1 2   n X i =1 d i f 2 i − 2 n X i,j =1 f i f j w ij + n X j =1 d j f 2 j   = 1 2 n X i,j =1 w ij ( f i − f j ) 2 . P art (2): The symmetry of L follows directly from the symmetry of W and D . The p ositiv e semi- deﬁniteness is a direct consequence of Part (1), which shows that f 0 Lf ≥ 0 for all f ∈ R n . P art (3): Ob vious. P art (4) is a direct consequence of P arts (1) - (3). 2 Note that the unnormalized graph Laplacian do es not dep end on the diagonal elements of the adja- cency matrix W . Each adjacency matrix which coincides with W on all oﬀ-diagonal p ositions leads to the same unnormalized graph Laplacian L . In particular, self-edges in a graph do not c hange the corresp onding graph Laplacian. The unnormalized graph Laplacian and its eigenv alues and eigenv ectors can b e used to describ e many prop erties of graphs, see Mohar (1991, 1997). One example whic h will b e imp ortan t for spectral clustering is the following: Prop osition 2 (Number of connected comp onen ts and the sp ectrum of L ) L et G b e an undi- r e cte d gr aph with non-ne gative weights. Then the multiplicity k of the eigenvalue 0 of L e quals the numb er of c onne cte d c omp onents A 1 , . . . , A k in the gr aph. The eigensp ac e of eigenvalue 0 is sp anne d by the indic ator ve ctors 1 A 1 , . . . , 1 A k of those c omp onents. Pr o of. W e start with the case k = 1, that is the graph is connected. Assume that f is an eigenv ector with eigenv alue 0. Then we know that 0 = f 0 Lf = n X i,j =1 w ij ( f i − f j ) 2 . 4 As the weigh ts w ij are non-negative, this sum can only v anish if all terms w ij ( f i − f j ) 2 v anish. Thus, if tw o vertices v i and v j are connected (i.e., w ij > 0), then f i needs to equal f j . With this argument w e can see that f needs to b e constant for all vertices which can b e connected by a path in the graph. Moreo ver, as all vertices of a connected comp onen t in an undirected graph can b e connected by a path, f needs to b e constan t on the whole connected comp onent. In a graph consisting of only one connected comp onen t we th us only hav e the constant one vector 1 as eigenv ector with eigenv alue 0, whic h obviously is the indicator vector of the connected comp onen t. No w consider the case of k connected comp onen ts. Without loss of generalit y we assume that the v ertices are ordered according to the connected components they belong to. In this case, the adjacency matrix W has a blo c k diagonal form, and the same is true for the matrix L : L =      L 1 L 2 . . . L k      Note that each of the blo cks L i is a prop er graph Laplacian on its own, namely the Laplacian corre- sp onding to the subgraph of the i -th connected comp onen t. As it is the case for all blo c k diagonal matrices, we kno w that the sp ectrum of L is given by the union of the sp ectra of L i , and the corre- sp onding eigenv ectors of L are the eigenv ectors of L i , ﬁlled with 0 at the p ositions of the other blo c ks. As each L i is a graph Laplacian of a connected graph, w e know that every L i has eigenv alue 0 with m ultiplicity 1, and the corresp onding eigenv ector is the constant one vector on the i -th connected comp onen t. Thus, the matrix L has as many eigen v alues 0 as there are connected comp onen ts, and the corresp onding eigenv ectors are the indicator vectors of the connected comp onen ts. 2 3.2 The normalized graph Laplacians There are tw o matrices which are called normalized graph Laplacians in the literature. Both matrices are closely related to each other and are deﬁned as L sym := D − 1 / 2 LD − 1 / 2 = I − D − 1 / 2 W D − 1 / 2 L rw := D − 1 L = I − D − 1 W . W e denote the ﬁrst matrix by L sym as it is a symmetric matrix, and the second one by L rw as it is closely related to a random walk. In the following we summarize several prop erties of L sym and L rw . The standard reference for normalized graph Laplacians is Chung (1997). Prop osition 3 (Prop erties of L sym and L rw ) The normalize d L aplacians satisfy the fol lowing pr op- erties: 1. F or every f ∈ R n we have f 0 L sym f = 1 2 n X i,j =1 w ij f i √ d i − f j p d j ! 2 . 2. λ is an eigenvalue of L rw with eigenve ctor u if and only if λ is an eigenvalue of L sym with eigenve ctor w = D 1 / 2 u . 3. λ is an eigenvalue of L rw with eigenve ctor u if and only if λ and u solve the gener alize d eigen- pr oblem Lu = λD u . 5 4. 0 is an eigenvalue of L rw with the c onstant one ve ctor 1 as eigenve ctor. 0 is an eigenvalue of L sym with eigenve ctor D 1 / 2 1 . 5. L sym and L rw ar e p ositive semi-deﬁnite and have n non-ne gative r e al-value d eigenvalues 0 = λ 1 ≤ . . . ≤ λ n . Pr o of. Part (1) can b e prov ed similarly to Part (1) of Prop osition 1. P art (2) can b e seen immediately by multiplying the eigenv alue equation L sym w = λw with D − 1 / 2 from the left and substituting u = D − 1 / 2 w . P art (3) follows directly b y multiplying the eigenv alue equation L rw u = λu with D from the left. P art (4): The ﬁrst statement is obvious as L rw 1 = 0, the second statement follows from (2). P art (5): The statement ab out L sym follo ws from (1), and then the statement ab out L rw follo ws from (2). 2 As it is the case for the unnormalized graph Laplacian, the multiplicit y of the eigenv alue 0 of the normalized graph Laplacian is related to the num b er of connected comp onen ts: Prop osition 4 (Number of connected comp onen ts and sp ectra of L sym and L rw ) L et G b e an undir e cte d gr aph with non-ne gative weights. Then the multiplicity k of the eigenvalue 0 of b oth L rw and L sym e quals the numb er of c onne cte d c omp onents A 1 , . . . , A k in the gr aph. F or L rw , the eigensp ac e of 0 is sp anne d by the indic ator ve ctors 1 A i of those c omp onents. F or L sym , the eigensp ac e of 0 is sp anne d by the ve ctors D 1 / 2 1 A i . Pr o of. The pro of is analogous to the one of Prop osition 2, using Prop osition 3. 2 4 Sp ectral Clustering Algorithms No w we would lik e to state the most common sp ectral clustering algorithms. F or references and the history of sp ectral clustering we refer to Section 9. W e assume that our data consists of n “p oin ts” x 1 , . . . , x n whic h can b e arbitrary ob jects. W e measure their pairwise similarities s ij = s ( x i , x j ) b y some similarity function which is symmetric and non-negative, and we denote the corresp onding similarit y matrix by S = ( s ij ) i,j =1 ...n . Unnormalized sp ectral clustering Input: Similarity matrix S ∈ R n × n , number k of clusters to construct. • Construct a similarity graph by one of the ways described in Section 2. Let W be its weighted adjacency matrix. • Compute the unnormalized Laplacian L . • Compute the ﬁrst k eigen v ectors u 1 , . . . , u k of L . • Let U ∈ R n × k be the matrix containing the vectors u 1 , . . . , u k as columns. • For i = 1 , . . . , n , let y i ∈ R k be the vector corresponding to the i -th row of U . • Cluster the points ( y i ) i =1 ,...,n in R k with the k -means algorithm into clusters C 1 , . . . , C k . Output: Clusters A 1 , . . . , A k with A i = { j | y j ∈ C i } . There are t wo diﬀeren t versions of normalized sp ectral clustering, dep ending which of the normalized 6 graph Laplacians is used. W e name b oth algorithms after tw o p opular pap ers, for more references and history please see Section 9. Normalized sp ectral clustering according to Shi and Malik (2000) Input: Similarity matrix S ∈ R n × n , number k of clusters to construct. • Construct a similarity graph by one of the ways described in Section 2. Let W be its weighted adjacency matrix. • Compute the unnormalized Laplacian L . • Compute the ﬁrst k generalized eigenv ectors u 1 , . . . , u k of the generalized eigenprob- lem Lu = λD u . • Let U ∈ R n × k be the matrix containing the vectors u 1 , . . . , u k as columns. • For i = 1 , . . . , n , let y i ∈ R k be the vector corresponding to the i -th row of U . • Cluster the points ( y i ) i =1 ,...,n in R k with the k -means algorithm into clusters C 1 , . . . , C k . Output: Clusters A 1 , . . . , A k with A i = { j | y j ∈ C i } . Note that this algorithm uses the generalized eigenv ectors of L , whic h according to Prop osition 3 corresp ond to the eigenv ectors of the matrix L rw . So in fact, the algorithm works with eigenv ectors of the normalized Laplacian L rw , and hence is called normalized sp ectral clustering. The next algorithm also uses a normalized Laplacian, but this time the matrix L sym instead of L rw . As we will see, this algorithm needs to introduce an additional row normalization step which is not needed in the other algorithms. The reasons will b ecome clear in Section 7. Normalized sp ectral clustering according to Ng, Jordan, and W eiss (2002) Input: Similarity matrix S ∈ R n × n , number k of clusters to construct. • Construct a similarity graph by one of the ways described in Section 2. Let W be its weighted adjacency matrix. • Compute the normalized Laplacian L sym . • Compute the ﬁrst k eigen v ectors u 1 , . . . , u k of L sym . • Let U ∈ R n × k be the matrix containing the vectors u 1 , . . . , u k as columns. • F orm the matrix T ∈ R n × k from U by normalizing the rows to norm 1 , that is set t ij = u ij / ( P k u 2 ik ) 1 / 2 . • For i = 1 , . . . , n , let y i ∈ R k be the vector corresponding to the i -th row of T . • Cluster the points ( y i ) i =1 ,...,n with the k -means algorithm into clusters C 1 , . . . , C k . Output: Clusters A 1 , . . . , A k with A i = { j | y j ∈ C i } . All three algorithms stated ab ov e lo ok rather similar, apart from the fact that they use three diﬀerent graph Laplacians. In all three algorithms, the main tric k is to change the representation of the abstract data p oin ts x i to p oin ts y i ∈ R k . It is due to the prop erties of the graph Laplacians that this change of represen tation is useful. W e will see in the next sections that this change of representation enhances the cluster-prop erties in the data, so that clusters can b e trivially detected in the new representation. In particular, the simple k -means clustering algorithm has no diﬃculties to detect the clusters in this new representation. Readers not familiar with k -means can read up on this algorithm in numerous 7 0 2 4 6 8 10 0 2 4 6 8 Histogram of the sample 1 2 3 4 5 6 7 8 9 10 0 0.02 0.04 0.06 0.08 Eigenvalues norm, knn 2 4 6 8 0 0.2 0.4 norm, knn Eigenvector 1 2 4 6 8 −0.5 −0.4 −0.3 −0.2 −0.1 Eigenvector 2 2 4 6 8 0 0.2 0.4 Eigenvector 3 2 4 6 8 0 0.2 0.4 Eigenvector 4 2 4 6 8 −0.5 0 0.5 Eigenvector 5 1 2 3 4 5 6 7 8 9 10 0 0.01 0.02 0.03 0.04 Eigenvalues unnorm, knn 2 4 6 8 0 0.05 0.1 unnorm, knn Eigenvector 1 2 4 6 8 −0.1 −0.05 0 Eigenvector 2 2 4 6 8 −0.1 −0.05 0 Eigenvector 3 2 4 6 8 −0.1 −0.05 0 Eigenvector 4 2 4 6 8 −0.1 0 0.1 Eigenvector 5 1 2 3 4 5 6 7 8 9 10 0 0.2 0.4 0.6 0.8 Eigenvalues norm, full graph 2 4 6 8 −0.1451 −0.1451 −0.1451 norm, full graph Eigenvector 1 2 4 6 8 −0.1 0 0.1 Eigenvector 2 2 4 6 8 −0.1 0 0.1 Eigenvector 3 2 4 6 8 −0.1 0 0.1 Eigenvector 4 2 4 6 8 −0.5 0 0.5 Eigenvector 5 1 2 3 4 5 6 7 8 9 10 0 0.05 0.1 0.15 Eigenvalues unnorm, full graph 2 4 6 8 −0.0707 −0.0707 −0.0707 unnorm, full graph Eigenvector 1 2 4 6 8 −0.05 0 0.05 Eigenvector 2 2 4 6 8 −0.05 0 0.05 Eigenvector 3 2 4 6 8 −0.05 0 0.05 Eigenvector 4 2 4 6 8 0 0.2 0.4 0.6 0.8 Eigenvector 5 Figure 1: T o y example for sp ectral clustering where the data points ha v e been dra wn from a mixture of four Gaussians on R . Left upp er corner: histogram of the data. First and second row: eigenv alues and eigen vectors of L rw and L based on the k -nearest neighbor graph. Third and fourth row: eigenv alues and eigenv ectors of L rw and L based on the fully connected graph. F or all plots, we used the Gaussian k ernel with σ = 1 as similarity function. See text for more details. text b o oks, for example in Hastie, Tibshirani, and F riedman (2001). Before we dive into the theory of sp ectral clustering, we would like to illustrate its principle on a very simple toy example. This example will be used at sev eral places in this tutorial, and w e chose it because it is so simple that the relev an t quantities can easily b e plotted. This toy data set consists of a random sample of 200 p oin ts x 1 , . . . , x 200 ∈ R dra wn according to a mixture of four Gaussians. The ﬁrst ro w of Figure 1 shows the histogram of a sample drawn from this distribution (the x -axis represents the one-dimensional data space). As similarit y function on this data set we c ho ose the Gaussian similarity function s ( x i , x j ) = exp( −| x i − x j | 2 / (2 σ 2 )) with σ = 1. As similarity graph w e consider b oth the fully connected graph and the 10-nearest neighbor graph. In Figure 1 we show the ﬁrst eigenv alues and eigenv ectors of the unnormalized Laplacian L and the normalized Laplacian L rw . That is, in the eigen v alue plot we plot i vs. λ i (for the moment ignore the dashed line and the diﬀerent shap es of the eigen v alues in the plots for the unnormalized case; their meaning will b e discussed in Section 8.5). In the eigen vector plots of an eigen vector u = ( u 1 , . . . , u 200 ) 0 w e plot x i vs. u i (note that in the example c hosen x i is simply a real num ber, hence we can depict it on the x -axis). The ﬁrst tw o ro ws of Figure 1 show the results based on the 10-nearest neighbor graph. W e can see that the ﬁrst four eigenv alues are 0, and the corresp onding eigenv ectors are cluster indicator vectors. The reason is that the clusters 8 form disconnected parts in the 10-nearest neighbor graph, in which case the eigenv ectors are given as in Prop ositions 2 and 4. The next tw o rows show the results for the fully connected graph. As the Gaussian similarity function is alwa ys p ositive, this graph only consists of one connected comp onen t. Th us, eigenv alue 0 has multiplicit y 1, and the ﬁrst eigen vector is the constan t vector. The following eigen vectors carry the information ab out the clusters. F or example in the unnormalized case (last ro w), if we threshold the second eigenv ector at 0, then the part b elow 0 corresp onds to clusters 1 and 2, and the part ab o v e 0 to clusters 3 and 4. Similarly , thresholding the third eigenv ector separates clusters 1 and 4 from clusters 2 and 3, and thresholding the fourth eigenv ector separates clusters 1 and 3 from clusters 2 and 4. Altogether, the ﬁrst four eigenv ectors carry all the information ab out the four clusters. In all the cases illustrated in this ﬁgure, sp ectral clustering using k -means on the ﬁrst four eigenv ectors easily detects the correct four clusters. 5 Graph cut p oin t of view The intuition of clustering is to separate p oin ts in diﬀerent groups according to their similarities. F or data giv en in form of a similarit y graph, this problem can be restated as follows: w e wan t to ﬁnd a par- tition of the graph such that the edges b etw een diﬀerent groups hav e a very low weigh t (which means that p oin ts in diﬀerent clusters are dissimilar from each other) and the edges within a group hav e high w eight (which means that p oints within the same cluster are similar to each other). In this section we will see ho w sp ectral clustering can be derived as an appro ximation to suc h graph partitioning problems. Giv en a similarity graph with adjacency matrix W , the simplest and most direct w a y to construct a partition of the graph is to solv e the mincut problem. T o deﬁne it, please recall the notation W ( A, B ) := P i ∈ A,j ∈ B w ij and A for the complement of A . F or a given n umber k of subsets, the mincut approach simply consists in choosing a partition A 1 , . . . , A k whic h minimizes cut( A 1 , . . . , A k ) := 1 2 k X i =1 W ( A i , A i ) . Here we introduce the factor 1 / 2 for notational consistency , otherwise we would coun t each edge twice in the cut. In particular for k = 2, mincut is a relatively easy problem and can b e solved eﬃciently , see Sto er and W agner (1997) and the discussion therein. How ev er, in practice it often do es not lead to satisfactory partitions. The problem is that in many cases, the solution of mincut simply separates one individual v ertex from the rest of the graph. Of course this is not what we wan t to achiev e in clustering, as clusters should be reasonably large groups of points. One w a y to circumv en t this problem is to explicitly request that the sets A 1 , . . . , A k are “reasonably large”. The tw o most common ob jectiv e functions to enco de this are RatioCut (Hagen and Kahng, 1992) and the normalized cut Ncut (Shi and Malik, 2000). In RatioCut, the size of a subset A of a graph is measured by its num b er of v ertices | A | , while in Ncut the size is measured by the weigh ts of its edges vol( A ). The deﬁnitions are: RatioCut( A 1 , . . . , A k ) := 1 2 k X i =1 W ( A i , A i ) | A i | = k X i =1 cut( A i , A i ) | A i | Ncut( A 1 , . . . , A k ) := 1 2 k X i =1 W ( A i , A i ) v ol( A i ) = k X i =1 cut( A i , A i ) v ol( A i ) . Note that b oth ob jective functions tak e a small v alue if the clusters A i are not to o small. In partic- ular, the minimum of the function P k i =1 (1 / | A i | ) is ac hieved if all | A i | coincide, and the minimum of P k i =1 (1 / v ol( A i )) is achiev ed if all vol( A i ) coincide. So what b oth ob jectiv e functions try to achiev e is that the clusters are “balanced”, as measured by the num b er of vertices or edge weigh ts, resp ectiv ely . Unfortunately , introducing balancing conditions makes the previously simple to solve mincut problem 9 b ecome NP hard, see W agner and W agner (1993) for a discussion. Sp ectral clustering is a wa y to solv e relaxed versions of those problems. W e will see that relaxing Ncut leads to normalized sp ectral clustering, while relaxing RatioCut leads to unnormalized sp ectral clustering (see also the tutorial slides by Ding (2004)). 5.1 Appro ximating RatioCut for k = 2 Let us start with the case of RatioCut and k = 2, b ecause the relaxation is easiest to understand in this setting. Our goal is to solve the optimization problem min A ⊂ V RatioCut( A, A ) . (1) W e ﬁrst rewrite the problem in a more conv enien t form. Giv en a subset A ⊂ V we deﬁne the vector f = ( f 1 , . . . , f n ) 0 ∈ R n with entries f i =    q | A | / | A | if v i ∈ A − q | A | / | A | if v i ∈ A. (2) No w the RatioCut ob jective function can b e conv enien tly rewritten using the unnormalized graph Laplacian. This is due to the following calculation: f 0 Lf = 1 2 n X i,j =1 w ij ( f i − f j ) 2 = 1 2 X i ∈ A,j ∈ A w ij   s | A | | A | + s | A | | A |   2 + 1 2 X i ∈ A,j ∈ A w ij   − s | A | | A | − s | A | | A |   2 = cut( A, A )  | A | | A | + | A | | A | + 2  = cut( A, A )  | A | + | A | | A | + | A | + | A | | A |  = | V | · RatioCut( A, A ) . Additionally , we hav e n X i =1 f i = X i ∈ A s | A | | A | − X i ∈ A s | A | | A | = | A | s | A | | A | − | A | s | A | | A | = 0 . In other words, the vector f as deﬁned in Equation (2) is orthogonal to the constan t one v ector 1 . Finally , note that f satisﬁes k f k 2 = n X i =1 f 2 i = | A | | A | | A | + | A | | A | | A | = | A | + | A | = n. Altogether we can see that the problem of minimizing (1) c an b e equiv alen tly rewritten as min A ⊂ V f 0 Lf sub ject to f ⊥ 1 , f i as deﬁned in Eq. (2) , k f k = √ n. (3) This is a discrete optimization problem as the entries of the solution v ector f are only allow ed to take t wo particular v alues, and of course it is still NP hard. The most obvious relaxation in this setting is 10 to discard the discreteness condition and instead allow that f i tak es arbitrary v alues in R . This leads to the relaxed optimization problem min f ∈ R n f 0 Lf sub ject to f ⊥ 1 , k f k = √ n. (4) By the Ra yleigh-Ritz theorem (e.g., see Section 5.5.2. of L ¨ utk ep ohl, 1997) it can b e seen immediately that the solution of this problem is given by the v ector f which is the eigen vector corresp onding to the second smallest eigenv alue of L (recall that the smallest eigenv alue of L is 0 with eigenv ector 1 ). So we can approximate a minimizer of RatioCut b y the second eigenv ector of L . How ev er, in order to obtain a partition of the graph we need to re-transform the real-v alued solution v ector f of the relaxed problem into a discrete indicator vector. The simplest wa y to do this is to use the sign of f as indicator function, that is to choose ( v i ∈ A if f i ≥ 0 v i ∈ A if f i < 0 . Ho wev er, in particular in the case of k > 2 treated b elo w, this heuristic is to o simple. What most sp ectral clustering algorithms do instead is to consider the co ordinates f i as p oin ts in R and cluster them in to tw o groups C , C by the k -means clustering algorithm. Then we carry o ver the resulting clustering to the underlying data p oints, that is w e choose ( v i ∈ A if f i ∈ C v i ∈ A if f i ∈ C . This is exactly the unnormalize d sp e ctr al clustering algorithm for the case of k = 2. 5.2 Appro ximating RatioCut for arbitrary k The relaxation of the RatioCut minimization problem in the case of a general v alue k follows a similar principle as the one ab o v e. Given a partition of V in to k sets A 1 , . . . , A k , we deﬁne k indicator vectors h j = ( h 1 ,j , . . . , h n,j ) 0 b y h i,j = ( 1 / p | A j | if v i ∈ A j 0 otherwise ( i = 1 , . . . , n ; j = 1 , . . . , k ) . (5) Then we set the matrix H ∈ R n × k as the matrix con taining those k indicator vectors as columns. Observ e that the columns in H are orthonormal to eac h other, that is H 0 H = I . Similar to the calculations in the last section we can see that h 0 i Lh i = cut( A i , A i ) | A i | . Moreo ver, one can chec k that h 0 i Lh i = ( H 0 LH ) ii . Com bining those facts we get RatioCut( A 1 , . . . , A k ) = k X i =1 h 0 i Lh i = k X i =1 ( H 0 LH ) ii = T r( H 0 LH ) , 11 where T r denotes the trace of a matrix. So the problem of minimizing RatioCut( A 1 , . . . , A k ) can b e rewritten as min A 1 ,...,A k T r( H 0 LH ) sub ject to H 0 H = I , H as deﬁned in Eq. (5) . Similar to ab o v e we now relax the problem b y allowing the en tries of the matrix H to take arbitrary real v alues. Then the relaxed problem b ecomes: min H ∈ R n × k T r( H 0 LH ) sub ject to H 0 H = I . This is the standard form of a trace minimization problem, and again a version of the Ra yleigh-Ritz theorem (e.g., see Section 5.2.2.(6) of L ¨ utk ep ohl, 1997) tells us that the solution is given by choosing H as the matrix which contains the ﬁrst k eigenv ectors of L as columns. W e can see that the matrix H is in fact the matrix U used in the unnormalized spectral clustering algorithm as describ ed in Section 4. Again we need to re-conv ert the real v alued solution matrix to a discrete partition. As ab o v e, the standard wa y is to use the k -means algorithms on the rows of U . This leads to the general unnormalized sp ectral clustering algorithm as presented in Section 4. 5.3 Appro ximating Ncut T echniques very similar to the ones used for RatioCut can b e used to deriv e normalized sp ectral clustering as relaxation of minimizing Ncut. In the case k = 2 we deﬁne the cluster indicator vector f b y f i =    q vol( A ) vol A if v i ∈ A − q vol( A ) vol( A ) if v i ∈ A. (6) Similar to ab o v e one can c hec k that ( Df ) 0 1 = 0, f 0 D f = vol( V ), and f 0 Lf = vol( V ) Ncut( A, A ). Thus w e can rewrite the problem of minimizing Ncut b y the equiv alent problem min A f 0 Lf sub ject to f as in (6) , D f ⊥ 1 , f 0 D f = vol( V ) . (7) Again we relax the problem by allowing f to take arbitrary real v alues: min f ∈ R n f 0 Lf sub ject to D f ⊥ 1 , f 0 D f = vol( V ) . (8) No w we substitute g := D 1 / 2 f . After substitution, the problem is min g ∈ R n g 0 D − 1 / 2 LD − 1 / 2 g sub ject to g ⊥ D 1 / 2 1 , k g k 2 = v ol( V ) . (9) Observ e that D − 1 / 2 LD − 1 / 2 = L sym , D 1 / 2 1 is the ﬁrst eigenv ector of L sym , and vol( V ) is a constant. Hence, Problem (9) is in the form of the standard Rayleigh-Ritz theorem, and its solution g is giv en b y the second eigen v ector of L sym . Re-substituting f = D − 1 / 2 g and using Prop osition 3 we see that f is the second eigenv ector of L rw , or equiv alen tly the generalized eigenv ector of Lu = λD u . F or the case of ﬁnding k > 2 clusters, we deﬁne the indicator vectors h j = ( h 1 ,j , . . . , h n,j ) 0 b y h i,j = ( 1 / p v ol( A j ) if v i ∈ A j 0 otherwise ( i = 1 , . . . , n ; j = 1 , . . . , k ) . (10) 12 Figure 2: The co c kroach graph from Guattery and Miller (1998). Then w e set the matrix H as the matrix containing those k indicator vectors as columns. Observ e that H 0 H = I , h 0 i D h i = 1, and h 0 i Lh i = cut( A i , A i ) / v ol( A i ). So we can write the problem of minimizing Ncut as min A 1 ,...,A k T r( H 0 LH ) sub ject to H 0 D H = I , H as in (10) . Relaxing the discreteness condition and substituting T = D 1 / 2 H we obtain the relaxed problem min T ∈ R n × k T r( T 0 D − 1 / 2 LD − 1 / 2 T ) sub ject to T 0 T = I . (11) Again this is the standard trace minimization problem which is solved by the matrix T which contains the ﬁrst k eigenv ectors of L sym as columns. Re-substituting H = D − 1 / 2 T and using Prop osition 3 we see that the solution H consists of the ﬁrst k eigenv ectors of the matrix L rw , or the ﬁrst k generalized eigen vectors of Lu = λD u . This yields the normalized sp ectral clustering algorithm according to Shi and Malik (2000). 5.4 Commen ts on the relaxation approac h There are several commen ts we should make ab out this deriv ation of sp ectral clustering. Most im- p ortan tly , there is no guarantee whatso ev er on the qualit y of the solution of the relaxed problem compared to the exact solution. That is, if A 1 , . . . , A k is the exact solution of minimizing RatioCut, and B 1 , . . . , B k is the solution constructed by unnormalized spectral clustering, then RatioCut( B 1 , . . . , B k ) − RatioCut( A 1 , . . . , A k ) can b e arbitrary large. Sev eral examples for this can be found in Guattery and Miller (1998). F or instance, the authors consider a very simple class of graphs called “co c k- roac h graphs”. Those graphs essentially lo ok lik e a ladder, with a few rimes remov ed, see Fig- ure 2. Obviously , the ideal RatioCut for k = 2 just cuts the ladder by a v ertical cut such that A = { v 1 , . . . , v k , v 2 k +1 , . . . , v 3 k } and A = { v k +1 , . . . , v 2 k , v 3 k +1 , . . . , v 4 k } . This cut is perfectly balanced with | A | = | A | = 2 k and cut( A, A ) = 2. How ev er, by studying the prop erties of the second eigenv ector of the unnormalized graph Laplacian of cockroac h graphs the authors prov e that unnormalized spectral clustering alwa ys cuts horizontally through the ladder, constructing the sets B = { v 1 , . . . , v 2 k } and B = { v 2 k +1 , . . . , v 4 k } . This also results in a balanced cut, but now w e cut k edges instead of just 2. So RatioCut( A, A ) = 2 /k , while RatioCut( B , B ) = 1. This means that compared to the optimal cut, the RatioCut v alue obtained by sp ectral clustering is k / 2 times w orse, that is a factor in the order of n . Several other pap ers inv estigate the quality of the clustering constructed by sp ectral clustering, for example Spielman and T eng (1996) (for unnormalized sp ectral clustering) and Kannan, V empala, and V etta (2004) (for normalized sp ectral clustering). In general it is known that eﬃcient algorithms to appro ximate balanced graph cuts up to a constant factor do not exist. T o the contrary , this appro xi- mation problem can b e NP hard itself (Bui and Jones, 1992). 13 Of course, the relaxation w e discussed ab o v e is not unique. F or example, a completely diﬀeren t relax- ation which leads to a semi-deﬁnite program is derived in Bie and Cristianini (2006), and there might b e many other useful relaxations. The reason why the sp ectral relaxation is so app ealing is not that it leads to particularly go od solutions. Its p opularit y is mainly due to the fact that it results in a standard linear algebra problem which is simple to solve. 6 Random w alks p oint of view Another line of argument to explain sp ectral clustering is based on random walks on the similarity graph. A random walk on a graph is a sto c hastic process which randomly jumps from vertex to v ertex. W e will see b elo w that sp ectral clustering can b e interpreted as trying to ﬁnd a partition of the graph suc h that the random walk sta ys long within the same cluster and seldom jumps b et w een clusters. In tuitively this makes sense, in particular together with the graph cut explanation of the last section: a balanced partition with a low cut will also hav e the prop ert y that the random walk do es not hav e man y opp ortunities to jump b et w een clusters. F or background reading on random walks in general we refer to Norris (1997) and Br´ emaud (1999), and for random w alks on graphs we recommend Aldous and Fill (in preparation) and Lov´ asz (1993). F ormally , the transition probability of jumping in one step from vertex v i to vertex v j is prop ortional to the edge w eight w ij and is given by p ij := w ij /d i . The transition matrix P = ( p ij ) i,j =1 ,...,n of the random walk is thus deﬁned by P = D − 1 W . If the graph is connected and non-bipartite, then the random w alk alw a ys p ossesses a unique stationary distribution π = ( π 1 , . . . , π n ) 0 , where π i = d i / v ol( V ). Ob viously there is a tight relationship betw een L rw and P , as L rw = I − P . As a consequence, λ is an eigenv alue of L rw with eigenv ector u if and only if 1 − λ is an eigenv alue of P with eigen vector u . It is well known that many prop erties of a graph can b e expressed in terms of the corresp onding random walk transition matrix P , see Lov´ asz (1993) for an ov erview. F rom this p oin t of view it do es not come as a surprise that the largest eigenv ectors of P and the smallest eigenv ectors of L rw can b e used to describ e cluster prop erties of the graph. Random walks and Ncut A formal equiv alence b et w een Ncut and transition probabilities of the random walk has b een observed in Meila and Shi (2001). Prop osition 5 ( Ncut via transition probabilities) L et G b e c onne cte d and non bi-p artite. As- sume that we run the r andom walk ( X t ) t ∈ N starting with X 0 in the stationary distribution π . F or disjoint subsets A, B ⊂ V , denote by P ( B | A ) := P ( X 1 ∈ B | X 0 ∈ A ) . Then: Ncut( A, A ) = P ( A | A ) + P ( A | A ) . Pr o of. First of all observ e that P ( X 0 ∈ A, X 1 ∈ B ) = X i ∈ A,j ∈ B P ( X 0 = i, X 1 = j ) = X i ∈ A,j ∈ B π i p ij = X i ∈ A,j ∈ B d i v ol( V ) w ij d i = 1 v ol( V ) X i ∈ A,j ∈ B w ij . 14 Using this we obtain P ( X 1 ∈ B | X 0 ∈ A ) = P ( X 0 ∈ A, X 1 ∈ B ) P ( X 0 ∈ A ) =   1 v ol( V ) X i ∈ A,j ∈ B w ij    v ol( A ) v ol( V )  − 1 = P i ∈ A,j ∈ B w ij v ol( A ) . No w the prop osition follows directly with the deﬁnition of Ncut. 2 This prop osition leads to a nice interpretation of Ncut, and hence of normalized sp ectral clustering. It tells us that when minimizing Ncut, we actually lo ok for a cut through the graph such that a random w alk seldom transitions from A to A and vice versa. The commute distance A second connection b et w een random walks and graph Laplacians can b e made via the commute dis- tance on the graph. The commute distance (also called resistance distance) c ij b et w een tw o v ertices v i and v j is the exp ected time it takes the random walk to trav el from vertex v i to vertex v j and back (Lo v´ asz, 1993; Aldous and Fill, in preparation). The commute distance has several nice prop erties whic h make it particularly app ealing for mac hine learning. As opp osed to the shortest path distance on a graph, the commute distance b et w een t w o v ertices decreases if there are many diﬀeren t short wa ys to get from vertex v i to vertex v j . So instead of just looking for the one shortest path, the commute distance lo oks at the set of short paths. Poin ts which are connected by a short path in the graph and lie in the same high-density region of the graph are considered closer to each other than p oin ts whic h are connected b y a short path but lie in diﬀerent high-density regions of the graph. In this sense, the comm ute distance seems particularly w ell-suited to b e used for clustering purp oses. Remark ably , the commute distance on a graph can b e computed with the help of the generalized in v erse (also called pseudo-inv erse or Mo ore-P enrose in v erse) L † of the graph Laplacian L . In the following w e denote e i = (0 , . . . 0 , 1 , 0 , . . . , 0) 0 as the i -th unit vector. T o deﬁne the generalized inv erse of L , recall that by Proposition 1 the matrix L can b e decomposed as L = U Λ U 0 where U is the matrix containing all eigenv ectors as columns and Λ the diagonal matrix with the eigenv alues λ 1 , . . . , λ n on the diagonal. As at least one of the eigenv alues is 0, the matrix L is not inv ertible. Instead, we deﬁne its generalized in verse as L † := U Λ † U 0 where the matrix Λ † is the diagonal matrix with diagonal en tries 1 /λ i if λ i 6 = 0 and 0 if λ i = 0. The entries of L † can b e computed as l † ij = P n k =2 1 λ k u ik u j k . The matrix L † is p ositive semi-deﬁnite and symmetric. F or further prop erties of L † see Gutman and Xiao (2004). Prop osition 6 (Commute distance) L et G = ( V , E ) a c onne cte d, undir e cte d gr aph. Denote by c ij the c ommute distanc e b etwe en vertex v i and vertex v j , and by L † = ( l † ij ) i,j =1 ,...,n the gener alize d inverse of L . Then we have: c ij = v ol( V )( l † ii − 2 l † ij + l † j j ) = v ol( V )( e i − e j ) 0 L † ( e i − e j ) . This result has b een published by Klein and Randic (1993), where it has b een prov ed by metho ds of electrical netw ork theory . F or a pro of using ﬁrst step analysis for random w alks see F ouss, Pirotte, Ren- ders, and Saerens (2007). There also exist other w a ys to express the comm ute distance with the help of graph Laplacians. F or example a metho d in terms of eigenv ectors of the normalized Laplacian L sym can b e found as Corollary 3.2 in Lov´ asz (1993), and a metho d computing the commute distance with the help of determinan ts of certain sub-matrices of L can be found in Bapat, Gutman, and Xiao (2003). Prop osition 6 has an imp ortan t consequence. It shows that √ c ij can b e considered as a Euclidean distance function on the vertices of the graph. This means that we can construct an embedding which 15 maps the vertices v i of the graph on p oin ts z i ∈ R n suc h that the Euclidean distances b et w een the p oin ts z i coincide with the comm ute distances on the graph. This works as follows. As the matrix L † is p ositiv e semi-deﬁnite and symmetric, it induces an inner pro duct on R n (or to b e more formal, it induces an inner pro duct on the subspace of R n whic h is perp endicular to the vector 1 ). No w choose z i as the p oin t in R n corresp onding to the i -th row of the matrix U (Λ † ) 1 / 2 . Then, by Prop osition 6 and by the construction of L † w e hav e that h z i , z j i = e 0 i L † e j and c ij = v ol( V ) || z i − z j || 2 . The em b edding used in unnormalized sp ectral clustering is related to the commute time embedding, but not identical. In sp ectral clustering, we map the vertices of the graph on the ro ws y i of the matrix U , while the commute time embedding maps the vertices on the rows z i of the matrix (Λ † ) 1 / 2 U . That is, compared to the en tries of y i , the entries of z i are additionally scaled by the inv erse eigenv alues of L . Moreov er, in sp ectral clustering w e only take the ﬁrst k columns of the matrix, while the com- m ute time embedding takes all columns. Several authors no w try to justify why y i and z i are not so diﬀeren t after all and state a bit hand-waiving that the fact that sp ectral clustering constructs clusters based on the Euclidean distances b et w een the y i can b e interpreted as building clusters of the v ertices in the graph based on the commute distance. How ev er, note that b oth approaches can diﬀer considerably . F or example, in the optimal case where the graph consists of k disconnected components, the ﬁrst k eigen v alues of L are 0 according to Prop osition 2, and the ﬁrst k columns of U consist of the cluster indicator vectors. How ev er, the ﬁrst k columns of the matrix (Λ † ) 1 / 2 U consist of zeros only , as the ﬁrst k diagonal elements of Λ † are 0. In this case, the information contained in the ﬁrst k columns of U is completely ignored in the matrix (Λ † ) 1 / 2 U , and all the non-zero elemen ts of the matrix (Λ † ) 1 / 2 U which can b e found in columns k + 1 to n are not tak en into account in spectral clustering, which discards all those columns. On the other hand, those problems do not o ccur if the underlying graph is connected. In this case, the only eigenv ector with eigenv alue 0 is the constant one v ector, whic h can b e ignored in b oth cases. The eigen vectors corresp onding to small eigenv alues λ i of L are then stressed in the matrix (Λ † ) 1 / 2 U as they are multiplied b y λ † i = 1 /λ i . In such a situa- tion, it might b e true that the comm ute time em bedding and the sp ectral em bedding do similar things. All in all, it seems that the commute time distance can b e a helpful intuition, but without making further assumptions there is only a rather lo ose relation b et w een sp ectral clustering and the commute distance. It might b e p ossible that those relations can b e tightened, for example if the similarit y function is strictly p ositiv e deﬁnite. How ev er, we hav e not yet seen a precise mathematical statement ab out this. 7 P erturbation theory p oint of view P erturbation theory studies the question of how eigenv alues and eigenv ectors of a matrix A change if w e add a small p erturbation H , that is we consider the p erturbed matrix ˜ A := A + H . Most p erturba- tion theorems state that a certain distance b et w een eigenv alues or eigen vectors of A and ˜ A is b ounded b y a constant times a norm of H . The constan t usually dep ends on whic h eigen v alue we are lo oking at, and how far this eigenv alue is separated from the rest of the sp ectrum (for a formal statement see b elo w). The justiﬁcation of sp ectral clustering is then the following: Let us ﬁrst consider the “ideal case” where the b et ween-cluster similarity is exactly 0. W e hav e seen in Section 3 that then the ﬁrst k eigenv ectors of L or L rw are the indicator v ectors of the clusters. In this case, the p oints y i ∈ R k constructed in the sp ectral clustering algorithms hav e the form (0 , . . . , 0 , 1 , 0 , . . . 0) 0 where the p osition of the 1 indicates the connected comp onen t this p oin t b elongs to. In particular, all y i b elonging to the same connected comp onen t coincide. The k -means algorithm will trivially ﬁnd the correct partition b y placing a center p oin t on each of the p oin ts (0 , . . . , 0 , 1 , 0 , . . . 0) 0 ∈ R k . In a “nearly ideal case” where we still hav e distinct clusters, but the b et ween-cluster similarity is not exactly 0, we consider the Laplacian matrices to b e perturb ed v ersions of the ones of the ideal case. P erturbation theory then tells us that the eigenv ectors will b e very close to the ideal indicator vectors. The p oin ts y i migh t not 16 completely coincide with (0 , . . . , 0 , 1 , 0 , . . . 0) 0 , but do so up to some small error term. Hence, if the p erturbations are not to o large, then k -means algorithm will still separate the groups from each other. 7.1 The formal p erturbation argument The formal basis for the p erturbation approac h to sp ectral clustering is the Da vis-Kahan theorem from matrix p erturbation theory . This theorem b ounds the diﬀerence b et w een eigenspaces of symmetric matrices under p erturbations. W e state those results for completeness, but for background reading we refer to Section V of Stewart and Sun (1990) and Section VI I.3 of Bhatia (1997). In p erturbation theory , distances b etw een subspaces are usually measured using “canonical angles” (also called “principal angles”). T o deﬁne principal angles, let V 1 and V 2 b e tw o p -dimensional subspaces of R d , and V 1 and V 2 t wo matrices suc h that their columns form orthonormal systems for V 1 and V 2 , resp ectiv ely . Then the cosines cos Θ i of the principal angles Θ i are the singular v alues of V 0 1 V 2 . F or p = 1, the so deﬁned canonical angles coincide with the normal deﬁnition of an angle. Canonical angles can also b e deﬁned if V 1 and V 2 do not ha v e the same dimension, see Section V of Stewart and Sun (1990), Section VI I.3 of Bhatia (1997), or Section 12.4.3 of Golub and V an Loan (1996). The matrix sin Θ ( V 1 , V 2 ) will denote the diagonal matrix with the sine of the canonical angles on the diagonal. Theorem 7 (Davis-Kahan) L et A, H ∈ R n × n b e symmetric matric es, and let k · k b e the F r ob enius norm or the two-norm for matric es, r esp e ctively. Consider ˜ A := A + H as a p erturb e d version of A . L et S 1 ⊂ R b e an interval. Denote by σ S 1 ( A ) the set of eigenvalues of A which ar e c ontaine d in S 1 , and by V 1 the eigensp ac e c orr esp onding to al l those eigenvalues (mor e formal ly, V 1 is the image of the sp e ctr al pr oje ction induc e d by σ S 1 ( A ) ). Denote by σ S 1 ( ˜ A ) and ˜ V 1 the analo gous quantities for ˜ A . Deﬁne the distanc e b etwe en S 1 and the sp e ctrum of A outside of S 1 as δ = min {| λ − s | ; λ eigenvalue of A, λ 6∈ S 1 , s ∈ S 1 } . Then the distanc e d ( V 1 , ˜ V 1 ) := k sin Θ ( V 1 , ˜ V 1 ) k b etwe en the two subsp ac es V 1 and ˜ V 1 is b ounde d by d ( V 1 , ˜ V 1 ) ≤ k H k δ . F or a discussion and pro ofs of this theorem see for example Section V.3 of Stewart and Sun (1990). Let us try to decrypt this theorem, for simplicity in the case of the unnormalized Laplacian (for the normalized Laplacian it works analogously). The matrix A will corresp ond to the graph Laplacian L in the ideal case where the graph has k connected comp onen ts. The matrix ˜ A corresp onds to a p erturbed case, where due to noise the k comp onen ts in the graph are no longer completely discon- nected, but they are only connected by few edges with low w eigh t. W e denote the corresp onding graph Laplacian of this case by ˜ L . F or sp ectral clustering we need to consider the ﬁrst k eigen v alues and eigen vectors of ˜ L . Denote the eigenv alues of L by λ 1 , . . . λ n and the ones of the p erturb ed Laplacian ˜ L by ˜ λ 1 , . . . , ˜ λ n . Cho osing the interv al S 1 is now the crucial p oin t. W e wan t to c ho ose it such that b oth the ﬁrst k eigenv alues of ˜ L and the ﬁrst k eigenv alues of L are contained in S 1 . This is easier the smaller the p erturbation H = L − ˜ L and the larger the eigengap | λ k − λ k +1 | is. If we manage to ﬁnd such a set, then the Davis-Kahan theorem tells us that the eigenspaces corresp onding to the ﬁrst k eigenv alues of the ideal matrix L and the ﬁrst k eigenv alues of the p erturbed matrix ˜ L are very close to each other, that is their distance is b ounded by k H k /δ . Then, as the eigenv ectors in the ideal case are piecewise constant on the connected comp onen ts, the same will approximately b e true in the p erturbed case. Ho w go od “approximately” is dep ends on the norm of the p erturbation k H k and the distance δ b etw een S 1 and the ( k + 1)st eigenv ector of L . If the set S 1 has b een chosen as the interv al [0 , λ k ], then δ coincides with the sp ectral gap | λ k +1 − λ k | . W e can see from the theorem that the larger this eigengap is, the closer the eigenv ectors of the ideal case and the p erturbed case are, and hence the b etter sp ectral clustering works. Below we will see that the size of the eigengap can also b e used in a 17 diﬀeren t context as a quality criterion for sp ectral clustering, namely when choosing the num b er k of clusters to construct. If the perturbation H is to o large or the eigengap is to o small, w e might not ﬁnd a set S 1 suc h that b oth the ﬁrst k eigenv alues of L and ˜ L are con tained in S 1 . In this case, w e need to mak e a compromise b y choosing the set S 1 to contain the ﬁrst k eigenv alues of L , but maybe a few more or less eigenv alues of ˜ L . The stateme n t of the theorem then b ecomes weak er in the sense that either we do not compare the eigenspaces corresponding to the ﬁrst k eigenv ectors of L and ˜ L , but the eigenspaces corresponding to the ﬁrst k eigenv ectors of L and the ﬁrst ˜ k eigenv ectors of ˜ L (where ˜ k is the num ber of eigenv alues of ˜ L contained in S 1 ). Or, it can happ en that δ b ecomes so small that the b ound on the distance b et w een d ( V 1 , ˜ V 1 ) blows up so muc h that it b ecomes useless. 7.2 Commen ts ab out the p erturbation approac h A bit of caution is needed when using p erturbation theory arguments to justify clustering algorithms based on eigenv ectors of matrices. In general, any blo c k diagonal symmetric matrix has the prop erty that there exists a basis of eigenv ectors which are zero outside the individual blo c ks and real-v alued within the blo c ks. F or example, based on this argument several authors use the eigenv ectors of the similarit y matrix S or adjacency matrix W to discov er clusters. Ho wev er, b eing blo c k diagonal in the ideal case of completely separated clusters can b e considered as a necessary condition for a successful use of eigenv ectors, but not a suﬃcient one. At least tw o more prop erties should b e satisﬁed: First, we need to mak e sure that the or der of the eigenv alues and eigenv ectors is meaningful. In case of the Laplacians this is alw ays true, as we know that any connected comp onent p ossesses exactly one eigen vector whic h has eigen v alue 0. Hence, if the graph has k connected comp onen ts and we take the ﬁrst k eigenv ectors of the Laplacian, then we know that we ha ve exactly one eigenv ector p er comp o- nen t. How ev er, this might not b e the case for other matrices such as S or W . F or example, it could b e the case that the tw o largest eigenv alues of a blo c k diagonal similarity matrix S come from the same blo c k. In such a situation, if w e take the ﬁrst k eigenv ec tors of S , some blo c ks will b e represented sev eral times, while there are other blo c ks which we will miss completely (unless we take certain pre- cautions). This is the reason wh y using the eigenv ectors of S or W for clustering should b e discouraged. The second prop erty is that in the ideal case, the entries of the eigenv ectors on the comp onen ts should b e “safely b ounded aw a y” from 0. Assume that an eigenv ector on the ﬁrst connected comp onen t has an en try u 1 ,i > 0 at p osition i . In the ideal case, the fact that this entry is non-zero indicates that the corresp onding p oin t i b elongs to the ﬁrst cluster. The other wa y round, if a p oin t j do es not b elong to cluster 1, then in the ideal case it should b e the case that u 1 ,j = 0. Now consider the same situation, but with p erturb ed data. The p erturb ed eigenv ector ˜ u will usually not hav e any non-zero comp onen t an y more; but if the noise is not to o large, then p erturbation theory tells us that the en tries ˜ u 1 ,i and ˜ u 1 ,j are still “close” to their original v alues u 1 ,i and u 1 ,j . So b oth entries ˜ u 1 ,i and ˜ u 1 ,j will tak e some small v alues, say ε 1 and ε 2 . In practice, if those v alues are very small it is unclear how we should in terpret this situation. Either we believe that small en tries in ˜ u indicate that the p oin ts do not b elong to the ﬁrst cluster (whic h then misclassiﬁes the ﬁrst data p oin t i ), or we think that the entries already indicate class membership and classify b oth p oin ts to the ﬁrst cluster (which misclassiﬁes p oin t j ). F or b oth matrices L and L rw , the eigen v ectors in the ideal situation are indicator v ectors, so the second problem describ ed ab o ve cannot o ccur. How ev er, this is not true for the matrix L sym , which is used in the normalized sp ectral clustering algorithm of Ng et al. (2002). Even in the ideal case, the eigen- v ectors of this matrix are giv en as D 1 / 2 1 A i . If the degrees of the vertices diﬀer a lot, and in particular if there are vertices which ha ve a very low degree, the corresp onding entries in the eigen v ectors are v ery small. T o counteract the problem describ ed ab o ve, the row-normalization step in the algorithm of Ng et al. (2002) comes into play . In the ideal case, the matrix U in the algorithm has exactly one 18 non-zero entry p er row. After row-normalization, the matrix T in the algorithm of Ng et al. (2002) then consists of the cluster indicator v ectors. Note how ev er, that this migh t not alwa ys w ork out correctly in practice. Assume that w e hav e ˜ u i, 1 = ε 1 and ˜ u i, 2 = ε 2 . If we now normalize the i -th row of U , b oth ε 1 and ε 2 will b e multiplied b y the factor of 1 / p ε 2 1 + ε 2 2 and b ecome rather large. W e now run into a similar problem as describ ed ab o v e: both p oin ts are likely to b e classiﬁed into the same cluster, even though they b elong to diﬀerent clusters. This argume n t shows that sp ectral clustering using the matrix L sym can b e problematic if the eigenv ectors con tain particularly small entries. On the other hand, note that such small entries in the eigenv ectors only o ccur if some of the vertices hav e a particularly lo w degrees (as the eigenv ectors of L sym are given b y D 1 / 2 1 A i ). One could argue that in suc h a case, the data p oin t should b e considered an outlier anyw a y , and then it do es not really matter in which cluster the p oin t will end up. T o summarize, the conclusion is that b oth unnormalized sp ectral clustering and normalized sp ectral clustering with L rw are well justiﬁed b y the perturbation theory approach. Normalized spectral clus- tering with L sym can also b e justiﬁed by p erturbation theory , but it should b e treated with more care if the graph contains vertices with very low degrees. 8 Practical details In this section we will brieﬂy discuss some of the issues which come up when actually implementing sp ectral clustering. There are several choices to b e made and parameters to be set. Ho w ever, the discussion in this section is mainly meant to raise aw areness ab out the general problems whic h an o ccur. F or thorough studies on the b ehavior of sp ectral clustering for v arious real world tasks we refer to the literature. 8.1 Constructing the similarit y graph Constructing the similarity graph for sp ectral clustering is not a trivial task, and little is known on theoretical implications of the v arious constructions. The similarity function itself Before we can ev en think ab out constructing a similarity graph, we need to deﬁne a similarity function on the data. As w e are going to construct a neigh borho od graph later on, w e need to mak e sure that the lo cal neigh b orhoo ds induced b y this similarity function are “meaningful”. This means that we need to b e sure that p oin ts whic h are considered to b e “v ery similar” by the similarit y function are also closely related in the application the data comes from. F or example, when constructing a similarit y function b et w een text do cumen ts it makes sense to c heck whether documents with a high similarit y score indeed b elong to the same text category . The global “long-range” b ehavior of the similarity function is not so imp ortan t for sp ectral clustering — it do es not really matter whether tw o data p oin ts hav e similarity score 0.01 or 0.001, say , as we will not connect those tw o p oints in the similarity graph anyw a y . In the common case where the data points live in the Euclidean space R d , a reasonable default candidate is the Gaussian similarity function s ( x i , x j ) = exp( −k x i − x j k 2 / (2 σ 2 )) (but of course we need to choose the parameter σ here, see b elow). Ultimately , the choice of the similarit y function dep ends on the domain the data comes from, and no general advice can b e giv en. Whic h type of similarit y graph The next c hoice one has to make concerns the type of the graph one w an ts to use, suc h as the k -nearest neigh b or or the ε -neigh b orhoo d graph. Let us illustrate the b eha vior of the diﬀerent graphs using the to y example presented in Figure 3. As underlying distribution we c ho ose a distribution on R 2 with 19 −1 0 1 2 −3 −2 −1 0 1 Data points −1 0 1 2 −3 −2 −1 0 1 epsilon−graph, epsilon=0.3 −1 0 1 2 −3 −2 −1 0 1 kNN graph, k = 5 −1 0 1 2 −3 −2 −1 0 1 Mutual kNN graph, k = 5 Figure 3: Diﬀerent similarity graphs, see text for details. three clusters: tw o “mo ons” and a Gaussian. The densit y of the bottom mo on is c hosen to b e larger than the one of the top mo on. The upp er left panel in Figure 3 shows a sample dra wn from this distribution. The next three panels show the diﬀerent similarity graphs on this sample. In the ε -neighborho o d graph, w e can see that it is diﬃcult to c ho ose a useful parameter ε . With ε = 0 . 3 as in the ﬁgure, the points on the middle mo on are already very tightly connected, while the p oin ts in the Gaussian are barely connected. This problem alwa ys o ccurs if we hav e data “on diﬀerent scales”, that is the distances b etw een data p oin ts are diﬀerent in diﬀerent regions of the space. The k -nearest neighbor graph, on the other hand, can connect p oin ts “on diﬀerent scales”. W e can see that p oin ts in the low-densit y Gaussian are connected with p oin ts in the high-density mo on. This is a general prop ert y of k -nearest neighbor graphs which can b e very useful. W e can also see that the k -nearest neighbor graph can break into several disconnected comp onen ts if there are high density re- gions which are reasonably far a w a y from each other. This is the case for the tw o mo ons in this example. The mutual k -nearest neigh b or graph has the prop ert y that it tends to connect p oints within regions of constant density , but do es not connect regions of diﬀerent densities with each other. So the mutual k -nearest neighbor graph can be considered as being “in b et w een” the ε -neigh b orhoo d graph and the k -nearest neighbor graph. It is able to act on diﬀerent scales, but do es not mix those scales with each other. Hence, the mu tual k -nearest neigh b or graph seems particularly well-suited if we wan t to detect clusters of diﬀerent densities. The fully connected graph is very often used in connection with the Gaussian similarity function s ( x i , x j ) = exp( −k x i − x j k 2 / (2 σ 2 )). Here the parameter σ plays a similar role as the parameter ε in the ε -neighborho o d graph. Poin ts in lo cal neighborho ods are connected with relativ ely high weigh ts, while edges b et w een far a wa y p oin ts hav e p ositiv e, but negligible weigh ts. How ev er, the resulting 20 similarit y matrix is not a sparse matrix. As a general recommendation we suggest to w ork with the k -nearest neighbor graph as the ﬁrst choice. It is simple to work with, results in a sparse adjacency matrix W , and in our exp erience is less vulnerable to unsuitable choices of parameters than the other graphs. The parameters of the similarity graph Once one has decided for the type of the similarity graph, one has to choose its connectivity parameter k or ε , resp ectiv ely . Unfortunately , barely any theoretical results are kno wn to guide us in this task. In general, if the similarit y graph con tains more connected components than the n umber of clusters we ask the algorithm to detect, then sp ectral clustering will trivially return connected components as clusters. Unless one is p erfectly sure that those connected comp onen ts are the correct clusters, one should make sure that the similarity graph is connected, or only consists of “few” connected comp onen ts and very few or no isolated vertices. There are many theoretical results on how connectivity of random graphs can b e achiev ed, but all those results only hold in the limit for the sample size n → ∞ . F or example, it is known that for n data p oin ts drawn i.i.d. from some underlying density with a connected supp ort in R d , the k -nearest neighbor graph and the mutual k -nearest neighbor graph will b e connected if w e c ho ose k on the order of log( n ) (e.g., Brito, Chav ez, Quiroz, and Y ukic h, 1997). Similar arguments sho w that the parameter ε in the ε -neighborho o d graph has to b e chosen as (log( n ) /n ) d to guarantee connectivit y in the limit (Penrose, 1999). While b eing of theoretical interest, all those results do not really help us for choosing k on a ﬁnite sample. No w let us give some rules of thum b. When working with the k -nearest neighbor graph, then the connectivit y parameter should b e chosen such that the resulting graph is connected, or at least has signiﬁcan tly fewer connected comp onen ts than clusters we wan t to detect. F or small or medium-sized graphs this can b e tried out ”b y fo ot”. F or v ery large graphs, a ﬁrst approximation could b e to choose k in the order of log ( n ), as suggested by the asymptotic connectivity results. F or the mutual k -nearest neighbor graph, we hav e to admit that w e are a bit lost for rules of thum b. The adv an tage of the mutual k -nearest neighbor graph compared to the standard k -nearest neighbor graph is that it tends not to connect areas of diﬀeren t densit y . While this can b e go od if there are clear clusters induced b y separate high-densit y areas, this can h urt in less ob vious situations as disconnected parts in the graph will alwa ys b e chosen to b e clusters by sp ectral clustering. V ery generally , one can observ e that the mutual k -nearest neighbor graph has m uc h fewer edges than the standard k -nearest neigh b or graph for the same parameter k . This suggests to choose k signiﬁcantly larger for the mutual k -nearest neighbor graph than one would do for the standard k -nearest neighbor graph. How ev er, to tak e adv antage of the prop ert y that the mutual k -nearest neighbor graph do es not connect regions of diﬀerent density , it would b e necessary to allo w for sev eral “meaningful” disconnected parts of the graph. Unfortunately , we do not know of any general heuristic to choose the parameter k such that this can b e achiev ed. F or the ε -neighborho od graph, we suggest to c ho ose ε such that the resulting graph is safely connected. T o determine the smallest v alue of ε where the graph is connected is very simple: one has to choose ε as the length of the longest edge in a minimal spanning tree of the fully connected graph on the data p oints. The latter can b e determined easily b y any minimal spanning tree algorithm. How ev er, note that when the data contains outliers this heuristic will choose ε so large that ev en the outliers are connected to the rest of the data. A similar eﬀect happ ens when the data contains sev eral tight clusters which are very far apart from each other. In b oth cases, ε will b e chosen to o large to reﬂect the scale of the most imp ortant part of the data. Finally , if one uses a fully connected graph together with a similarity function which can b e scaled 21 itself, for example the Gaussian similarity function, then the scale of the similarit y function should b e c hosen such that the resulting graph has similar prop erties as a corresp onding k -nearest neighbor or ε -neigh b orhoo d graph would hav e. One needs to make sure that for most data p oints the set of neigh- b ors with a similarity signiﬁcantly larger than 0 is “not to o small and not to o large”. In particular, for the Gaussian similarity function several rules of thum b are frequen tly used. F or example, one can c ho ose σ in the order of the mean distance of a point to its k -th nearest neigh b or, where k is c hosen similarly as ab o ve (e.g., k ∼ log( n ) + 1 ). Another wa y is to determine ε by the minimal spanning tree heuristic describ ed ab o ve, and then choose σ = ε . But note that all those rules of thum b are very ad-ho c, and dep ending on the given data at hand and its distribution of inter-point distances they migh t not work at all. In general, exp erience shows that sp ectral clustering can b e quite sensitive to changes in the similarity graph and to the c hoice of its parameters. Unfortunately , to our knowledge there has b een no sys- tematic study which inv estigates the eﬀects of the similarit y graph and its parameters on clustering and comes up with well-justiﬁed rules of thum b. None of the recommendations ab o v e is based on a ﬁrm theoretic ground. Finding rules which hav e a theoretical justiﬁcation should b e considered an in teresting and imp ortan t topic for future research. 8.2 Computing the eigen v ectors T o implement sp ectral clustering in practice one has to compute the ﬁrst k eigenv ectors of a potentially large graph Laplace matrix. Luc kily , if we use the k -nearest neighbor graph or the ε -neighborho o d graph, then all those matrices are sparse. Eﬃcien t metho ds exist to compute the ﬁrst eigen vectors of sparse matrices, the most p opular ones b eing the p o w er metho d or Krylo v subspace methods such as the Lanczos metho d (Golub and V an Loan, 1996). The sp eed of conv ergence of those algorithms dep ends on the size of the eigengap (also called sp ectral gap) γ k = | λ k − λ k +1 | . The larger this eigengap is, the faster the algorithms computing the ﬁrst k eigenv ectors conv erge. Note that a general problem occurs if one of the eigen v alues under consideration has multiplicit y larger than one. F or example, in the ideal situation of k disconnected clusters, the eigenv alue 0 has multi- plicit y k . As we ha v e seen, in this case the eigenspace is spanned b y the k cluster indicator vectors. But unfortunately , the vectors computed by the numerical eigensolvers do not necessarily conv erge to those particular vectors. Instead they just conv erge to some orthonormal basis of the eigenspace, and it usually dep ends on implementation details to which basis exactly the algorithm conv erges. But this is not so bad after all. Note that all vectors in the space spanned by the cluster indicator vectors 1 A i ha ve the form u = P k i =1 a i 1 A i for some co eﬃcien ts a i , that is, they are piecewise constant on the clusters. So the vectors returned by the eigensolv ers still enco de the information ab out the clusters, whic h can then b e used by the k -means algorithm to reconstruct the clusters. 8.3 The n um b er of clusters Cho osing the n umber k of clusters is a general problem for all clustering algorithms, and a v ariet y of more or less successful metho ds hav e been devised for this problem. In mo del-based clustering settings there exist well-justiﬁed criteria to choose the num ber of clusters from the data. Those criteria are usually based on the log-likelihoo d of the data, which can then b e treated in a frequentist or Bay esian w ay , for examples see F raley and Raftery (2002). In settings where no or few assumptions on the underlying mo del are made, a large v ariet y of diﬀerent indices can b e used to pic k the num ber of clusters. Examples range from ad-ho c measures such as the ratio of within-cluster and b etw een-cluster similarities, ov er information-theoretic criteria (Still and Bialek, 2004), the gap statistic (Tibshirani, W alther, and Hastie, 2001), to stabilit y approaches (Ben-Hur, Elisseeﬀ, and Guy on, 2002; Lange, Roth, 22 0 2 4 6 8 10 0 5 10 Histogram of the sample 0 2 4 6 8 10 0 5 10 Histogram of the sample 0 2 4 6 8 10 0 2 4 6 Histogram of the sample 1 2 3 4 5 6 7 8 9 10 0 0.02 0.04 0.06 Eigenvalues 1 2 3 4 5 6 7 8 9 10 0.02 0.04 0.06 0.08 Eigenvalues 1 2 3 4 5 6 7 8 9 10 0 0.02 0.04 0.06 0.08 Eigenvalues Figure 4: Three data sets, and the smallest 10 eigenv alues of L rw . See text for more details. Braun, and Buhmann, 2004; Ben-David, v on Luxburg, and P´ al, 2006). Of course all those methods can also b e used for sp ectral clustering. Additionally , one tool which is particularly designed for sp ectral clustering is the eigengap heuristic, which can b e used for all three graph Laplacians. Here the goal is to c ho ose the num ber k suc h that all eigen v alues λ 1 , . . . , λ k are v ery small, but λ k +1 is relativ ely large. There are several justiﬁcations for this pro cedure. The ﬁrst one is based on p erturbation theory , where we observe that in the ideal case of k completely disconnected clusters, the eigenv alue 0 has m ultiplicity k , and then there is a gap to the ( k + 1)th eigenv alue λ k +1 > 0. Other explanations can b e given by sp ectral graph theory . Here, many geometric inv arian ts of the graph can b e expressed or b ounded with the help of the ﬁrst eigenv alues of the graph Laplacian. In particular, the sizes of cuts are closely related to the size of the ﬁrst eigenv alues. F or more details on this topic w e refer to Bolla (1991), Mohar (1997) and Chung (1997). W e would like to illustrate the eigengap heuristic on our toy example introduced in Section 4. F or this purp ose w e consider similar data sets as in Section 4, but to v ary the diﬃculty of clustering we consider the Gaussians with increasing v ariance. The ﬁrst row of Figure 4 sho ws the histograms of the three samples. W e construct the 10-nearest neighbor graph as describ ed in Section 4, and plot the eigen v alues of the normalized Laplacian L rw on the diﬀerent samples (the results for the unnormalized Laplacian are similar). The ﬁrst data set consists of four well separated clusters, and w e can see that the ﬁrst 4 eigen v alues are appro ximately 0. Then there is a gap b etw een the 4th and 5th eigen v alue, that is | λ 5 − λ 4 | is relatively lar ge. According to the eigengap heuristic, this gap indicates that the data set contains 4 clusters. The same b eha vior can also be observ ed for the results of the fully connected graph (already plotted in Figure 1). So w e can see that the heuristic w orks well if the clusters in the data are very well pronounced. Ho w ever, the more noisy or ov erlapping the clusters are, the less eﬀectiv e is this heuristic. W e can see that for the second data set where the clusters are more “blurry”, there is still a gap b et w een the 4th and 5th eigenv alue, but it is not as clear to detect as in the case b efore. Finally , in the last data set, there is no well-deﬁned gap, the diﬀerences betw een all eigen v alues are approximately the same. But on the other hand, the clusters in this data set ov erlap so m uch that man y non-parametric algorithms will hav e diﬃculties to detect the clusters, unless they mak e strong assumptions on the underlying mo del. In this particular example, even for a human lo oking at the histogram it is not obvious what the correct num ber of clusters should b e. This illustrates that, as most metho ds for c hoosing the num ber of clusters, the eigengap heuristic usually works w ell if the data con tains very w ell pronounced clusters, but in am biguous cases it also returns ambiguous results. 23 Finally , note that the choice of the n umber of clusters and the choice of the connectivit y parameters of the neighborho od graph aﬀect each other. F or example, if the connectivity parameter of the neigh- b orhoo d graph is so small that the graph breaks in to, say , k 0 connected comp onents, then c ho osing k 0 as the num ber of clusters is a v alid choice. Ho wev er, as so on as the neighborho o d graph is connected, it is not clear how the num ber of clusters and the connectivity parameters of the neighborho o d graph in teract. Both the choice of the num ber of c lusters and the choice of the connectivity parameters of the graph are diﬃcult problems on their o wn, and to our kno wledge nothing non-trivial is known on their interactions. 8.4 The k -means step The three sp ectral clustering algorithms we presen ted in Section 4 use k -means as last step to extract the ﬁnal partition from the real v alued matrix of eigenv ectors. First of all, note that there is nothing principled about using the k -means algorithm in this step. In fact, as w e ha ve seen from the v arious explanations of sp ectral clustering, this step should b e very simple if the data contains well-expressed clusters. F or example, in the ideal case if completely separated clusters we know that the eigenv ectors of L and L rw are piecewise constant. In this case, all p oin ts x i whic h b elong to the same cluster C s are mapp ed to exactly the sample p oin t y i , namely to the unit vector e s ∈ R k . In such a trivial case, an y clustering algorithm applied to the p oin ts y i ∈ R k will b e able to extract the correct clusters. While it is somewhat arbitrary what clustering algorithm exactly one c ho oses in the ﬁnal step of sp ec- tral clustering, one can argue that at least the Euclidean distance betw een the p oin ts y i is a meaningful quan tity to lo ok at. W e hav e seen that the Euclidean distance betw een the p oin ts y i is related to the “comm ute distance” on the graph, and in Nadler, Lafon, Coifman, and Kevrekidis (2006) the authors sho w that the Euclidean distances b et ween the y i are also related to a more general “diﬀusion dis- tance”. Also, other uses of the sp ectral embeddings (e.g., Bolla (1991) or Belkin and Niyogi (2003)) sho w that the Euclidean distance in R d is meaningful. Instead of k -means, p eople also use other techniques to construct he ﬁnal solution from the real-v alued represen tation. F or example, in Lang (2006) the authors use h yp erplanes for this purp ose. A more adv anced p ost-processing of the eigenv ectors is prop osed in Bach and Jordan (2004). Here the authors study the subspace spanned b y the ﬁrst k eigenv ectors, and try to approximate this subspace as go o d as p ossible using piecewise constant v ectors. This also leads to minimizing certain Euclidean distances in the space R k , which can b e done by some weigh ted k -means algorithm. 8.5 Whic h graph Laplacian should b e used? A fundamen tal question related to sp ectral clustering is the question which of the three graph Lapla- cians should b e used to compute the eigenv ectors. Before deciding this question, one should alwa ys lo ok at the degree distribution of the similarity graph. If the graph is very regular and most vertices ha ve approximately the same degree, then all the Laplacians are v ery similar to each other, and will w ork equally w ell for clustering. How ev er, if the degrees in the graph are very broadly distributed, then the Laplacians diﬀer considerably . In our opinion, there are several argumen ts which advocate for using normalized rather than unnormalized sp ectral clustering, and in the normalized case to use the eigenv ectors of L rw rather than those of L sym . Clustering ob jectiv es satisﬁed by the diﬀerent algorithms The ﬁrst argument in fav or of normalized sp ectral clustering comes from the graph partitioning p oint of view. F or simplicit y let us discuss the case k = 2. In general, clustering has tw o diﬀeren t ob jectives: 24 1. W e wan t to ﬁnd a partition such that p oin ts in diﬀerent clusters are dissimilar to each other, that is we wan t to minimize the betw een-cluster similarit y . In the graph setting, this means to minimize cut( A, A ). 2. W e wan t to ﬁnd a partition such that p oin ts in the same cluster are similar to eac h other, that is we wan t to maximize the within-cluster similarities W ( A, A ) and W ( A, A ). Both RatioCut and Ncut directly implement the ﬁrst ob jective by explicitly incorp orating cut( A, A ) in the ob jective function. How ev er, concerning the second p oin t, b oth algorithms b eha ve diﬀerently . Note that W ( A, A ) = W ( A, V ) − W ( A, A ) = vol( A ) − cut( A, A ) . Hence, the within-cluster similarity is maximized if cut( A, A ) is small and if vol( A ) is large. As this is exactly what we achiev e b y minimizing Ncut, the Ncut criterion implements the second ob jective. This can b e seen even more explicitly by considering yet another graph cut ob jectiv e function, namely the MinMaxCut criterion introduced by Ding, He, Zha, Gu, and Simon (2001): MinMaxCut( A 1 , . . . , A k ) := k X i =1 cut( A i , A i ) W ( A i , A i ) . Compared to Ncut, which has the terms vol( A ) = cut( A, A ) + W ( A, A ) in the denominator, the MinMaxCut criterion only has W ( A, A ) in the denominator. In practice, Ncut and MinMaxCut are often minimized by similar cuts, as a go od Ncut solution will hav e a small v alue of cut( A, A ) anyw ay and hence the denominators are not so diﬀerent after all. Moreov er, relaxing MinMaxCut leads to exactly the same optimization problem as relaxing Ncut, namely to normalized sp ectral clustering with the eigenv ectors of L rw . So one can see b y several wa ys that normalized sp ectral clustering incorporates b oth clustering ob jectives mentioned ab o v e. No w consider the case of RatioCut. Here the ob jective is to maximize | A | and | A | instead of vol( A ) and v ol( A ). But | A | and | A | are not necessarily related to the within-cluster similarit y , as the within-cluster similarit y dep ends on the edges and not on the num ber of v ertices in A . F or instance, just think of a set A which has very many v ertices, all of which only ha ve very low weigh ted edges to each other. Minimizing RatioCut do es not attempt to maximize the within-cluster similarity , and the same is then true for its relaxation by unnormalized sp ectral clustering. So this is our ﬁrst imp ortan t p oin t to keep in mind: Normalized sp ectral clustering implements b oth clustering ob jectiv es men tioned abov e, while unnormalized sp ectral clustering only implemen ts the ﬁrst ob jectiv e. Consistency issues A completely diﬀeren t argumen t for the superiority of normalized spectral clustering comes from a sta- tistical analysis of b oth algorithms. In a statistical setting one assumes that the data p oints x 1 , . . . , x n ha ve b een sampled i.i.d. according to some probability distribution P on some underlying data space X . The most fundamen tal question is then the question of consistency: if w e dra w more and more data p oin ts, do the clustering results of sp ectral clustering con v erge to a useful partition of the underlying space X ? F or b oth normalized sp ectral clustering algorithms, it can b e pro ved that this is indeed the case (von Luxburg, Bousquet, and Belkin, 2004, 2005; von Luxburg, Belkin, and Bousquet, to app ear). Mathe- matically , one prov es that as we take the limit n → ∞ , the matrix L sym con verges in a strong sense 25 1 2 3 4 5 6 7 8 9 10 0 0.1 0.2 0.3 Eigenvalues unnorm, sigma=2 2 4 6 8 0.0707 0.0707 0.0707 unnorm, sigma=2 Eigenvector 1 2 4 6 8 −0.1 −0.05 0 0.05 0.1 Eigenvector 2 2 4 6 8 −0.05 0 0.05 0.1 0.15 Eigenvector 3 2 4 6 8 0 0.2 0.4 0.6 0.8 Eigenvector 4 2 4 6 8 −0.4 −0.2 0 0.2 0.4 0.6 Eigenvector 5 1 2 3 4 5 6 7 8 9 10 0 0.2 0.4 0.6 Eigenvalues unnorm, sigma=5 2 4 6 8 −0.0707 −0.0707 −0.0707 unnorm, sigma=5 Eigenvector 1 2 4 6 8 −0.1 −0.05 0 0.05 0.1 Eigenvector 2 2 4 6 8 −0.8 −0.6 −0.4 −0.2 0 Eigenvector 3 2 4 6 8 −0.2 0 0.2 0.4 0.6 0.8 Eigenvector 4 2 4 6 8 −0.8 −0.6 −0.4 −0.2 0 0.2 Eigenvector 5 Figure 5: Consistency of unnormalized sp ectral clustering. Plotted are eigen v alues and eigenv ectors of L , for parameter σ = 2 (ﬁrst row) and σ = 5 (second row). The dashed line indicates min d j , the eigen v alues below min d j are plotted as red diamonds, the eigenv alues ab o v e min d j are plotted as blue stars. See text for more details. to an op erator U on the space C ( X ) of contin uous functions on X . This conv ergence implies that the eigen v alues and eigenv ectors of L sym con verge to those of U , which in turn can b e transformed to a statemen t ab out the conv ergence of normalized sp ectral clustering. One can show that the partition whic h is induced on X b y the eigenv ectors of U can b e interpreted similar to the random walks inter- pretation of spectral clustering. That is, if we consider a diﬀusion pro cess on the data space X , then the partition induced b y the eigenv ectors of U is suc h that the diﬀusion do es not transition betw een the diﬀeren t clusters v ery often (v on Luxburg et al., 2004). All consistency statemen ts ab out normalized sp ectral clustering hold, for b oth L sym and L rw , under very mild conditions which are usually satisﬁed in real world applications. Unfortunately , explaining more details ab out those results go es b ey ond the scop e of this tutorial, so we refer the in terested reader to von Luxburg et al. (to app ear). In contrast to the clear conv e rgence statements for normalized sp ectral clustering, the situation for unnormalized sp ectral clustering is m uch more unpleasant. It can b e prov ed that unnormalized sp ec- tral clustering can fail to conv erge, or that it can con verge to trivial solutions which construct clusters consisting of one single p oin t of the data space (von Luxburg et al., 2005, to app ear). Mathematically , ev en though one can prov e that the matrix (1 /n ) L itself con verges to some limit op erator T on C ( X ) as n → ∞ , the sp ectral prop erties of this limit op erator T can b e so nasty that they preven t the con- v ergence of sp ectral clustering. It is p ossible to construct examples which show that this is not only a problem for very large sample size, but that it can lead to completely unreliable results even for small sample size. At least it is possible to characterize the conditions when those problem do not o ccur: W e ha ve to make sure that the eigenv alues of L corresp onding to the eigenv ectors used in unnormalized sp ectral clustering are signiﬁcantly smaller than the minimal degree in the graph. This means that if w e use the ﬁrst k eigenv ec tors for clustering, then λ i  min j =1 ,...,n d j should hold for all i = 1 , . . . , k . The mathematical reason for this condition is that eigen vectors corresp onding to eigenv alues larger than min d j appro ximate Dirac functions, that is they are approximately 0 in all but one co ordinate. If those eigenv ectors are used for clustering, then they separate the one vertex where the eigenv ector is non-zero from all other vertices, and we clearly do not wan t to construct such a partition. Again we refer to the literature for precise statements and pro ofs. F or an illustration of this phenomenon, consider again our to y data set from Section 4. W e consider the ﬁrst eigenv alues and eigenv ectors of the unnormalized graph Laplacian based on the fully connected graph, for diﬀeren t choices of the parameter σ of the Gaussian similarity function (see last row of Fig- ure 1 and all rows of Figure 5). The eigenv alues ab o v e min d j are plotted as blue stars, the eigenv alues b elo w min d j are plotted as red diamonds. The dashed line indicates min d j . In general, we can see 26 that the eigenv ectors corresp onding to eigenv alues whic h are muc h b elo w the dashed lines are “useful” eigen vectors. In case σ = 1 (plotted already in the last row of Figure 1), Eigenv alues 2, 3 and 4 are signiﬁcan tly b elo w min d j , and the corresp onding Eigenv ectors 2, 3, and 4 are meaningful (as already discussed in Section 4). If we increase the parameter σ , we can observ e that the eigenv alues tend to mo ve tow ards min d j . In case σ = 2, only the ﬁrst three eigenv alues are b elo w min d j (ﬁrst row in Figure 5), and in case σ = 5 only the ﬁrst tw o eigenv alues are b elo w min d j (second ro w in Figure 5). W e can see that as so on as an eigenv alue gets close to or ab o ve min d j , its corresp onding eigen vector appro ximates a Dirac function. Of course, those eigen vectors are unsuitable for constructing a clus- tering. In the limit for n → ∞ , those eigenv ectors would conv erge to p erfect Dirac functions. Our illustration of the ﬁnite sample case sho ws that this b eha vior not only o ccurs for large sample size, but can b e generated even on the small example in our toy data set. It is very imp ortan t to stress that those problems only concern the eigen vectors of the matrix L , and they do not o ccur for L rw or L sym . Thus, from a statistical p oint of view, it is preferable to av oid unnormalized sp ectral clustering and to use the normalized algorithms instead. Whic h normalized Laplacian? Lo oking at the diﬀerences b et w een the tw o normalized sp ectral clustering algorithms using L rw and L sym , all three explanations of sp ectral clustering are in fav or of L rw . The reason is that the eigenv ec- tors of L rw are cluster indicator vectors 1 A i , while the eigen vectors of L sym are additionally multiplied with D 1 / 2 , which might lead to undesired artifacts. As using L sym also do es not hav e any computa- tional adv an tages, we th us advocate for using L rw . 9 Outlo ok and further reading Sp ectral clustering goes back to Donath and Hoﬀman (1973), who ﬁrst suggested to construct graph partitions based on eigen vectors of the adjacency matrix. In the same year, Fiedler (1973) discov ered that bi-partitions of a graph are closely connected with the second eigenv ector of the graph Laplacian, and he suggested to use this eigenv ector to partition a graph. Since then, sp ectral clustering has b een discov ered, re-discov ered, and extended many times in diﬀerent comm unities, see for example P othen, Simon, and Liou (1990), Simon (1991), Bolla (1991), Hagen and Kahng (1992), Hendrickson and Leland (1995), V an Driessche and Ro ose (1995), Barnard, Pothen, and Simon (1995), Spielman and T eng (1996), Guattery and Miller (1998). A nice ov erview ov er the history of sp ectral clustering can b e found in Spielman and T eng (1996). In the mac hine learning communit y , sp ectral clustering has been made popular by the works of Shi and Malik (2000), Ng et al. (2002), Meila and Shi (2001), and Ding (2004). Subsequently , sp ectral cluster- ing has b een extended to man y non-standard settings, for example sp ectral clustering applied to the co-clustering problem (Dhillon, 2001), sp ectral clustering with additional side information (Joachims, 2003) connections b et w een sp ectral clustering and the w eighted kernel- k -means algorithm (Dhillon, Guan, and Kulis, 2005), learning similarity functions based on sp ectral clustering (Bac h and Jordan, 2004), or sp ectral clustering in a distributed environmen t (Kemp e and McSherry , 2004). Also, new theoretical insights ab out the relation of sp ectral clustering to other algorithms hav e b een found. A link b etw een sp ectral clustering and the weigh ted kernel k -means algorithm is describ ed in Dhillon et al. (2005). Relations b et w een sp ectral clustering and (kernel) principal comp onent analysis rely on the fact that the smallest eigenv ectors of graph Laplacians can also b e interpreted as the largest eigen- v ectors of kernel matrices (Gram matrices). Two diﬀeren t ﬂav ors of this interpretation exist: while Bengio et al. (2004) interpret the matrix D − 1 / 2 W D − 1 / 2 as kernel matrix, other authors (Saerens, 27 F ouss, Y en, and Dup on t, 2004) interpret the Mo ore-P enrose inv erses of L or L sym as kernel matrix. Both interpretations can b e used to construct (diﬀerent) out-of-sample extensions for sp ectral clus- tering. Concerning application cases of sp ectral clustering, in the last few years such a huge num ber of pap ers has b een published in v arious scientiﬁc areas that it is imp ossible to cite all of them. W e encourage the reader to query his fav orite literature data base with the phrase “sp ectral clustering” to get an impression no the v ariet y of applications. The success of spectral clustering is mainly based on the fact that it do es not mak e strong assumptions on the form of the clusters. As opp osed to k -means, where the resulting clusters form conv ex sets (or, to b e precise, lie in disjoint conv ex sets of the underlying space), sp ectral clustering can solve v ery general problems like intert wined spirals. Moreov er, sp ectral clustering can b e implemen ted eﬃcien tly ev en for large data sets, as long as we mak e sure that the similarit y graph is sparse. Once the similarit y graph is chosen, we just hav e to solve a linear problem, and there are no issues of getting stuck in lo cal minima or restarting the algorithm for several times with diﬀerent initializations. Ho wev er, we ha v e already mentioned that c ho osing a go od similarit y graph is not trivial, and spectral clustering can b e quite unstable under diﬀeren t choices of the parameters for the neighborho od graphs. So sp ectral clustering cannot serve as a “black b o x algorithm” which automatically detects the correct clusters in any given data set. But it can b e considered as a p o w erful tool which can pro duce go o d results if applied with care. In the ﬁeld of machine learning, graph Laplacians are not only used for clustering, but also emerge for many other tasks such as semi-sup ervised learning (e.g., Chap elle, Sch¨ olkopf, and Zien, 2006 for an o v erview) or manifold reconstruction (e.g., Belkin and Niy ogi, 2003). In most applications, graph Laplacians are used to enco de the assumption that data p oin ts whic h are “close” (i.e., w ij is large) should hav e a “similar” lab el (i.e., f i ≈ f j ). A function f satisﬁes this assumption if w ij ( f i − f j ) 2 is small for all i, j , that is f 0 Lf is small. With this intuition one can use the quadratic form f 0 Lf as a regularizer in a transductive classiﬁcation problem. One other wa y to interpret the use of graph Laplacians is by the smo othness assumptions they enco de. A function f which has a low v alue of f 0 Lf has the prop ert y that it v aries only “a little bit” in regions where the data p oin ts lie dense (i.e., the graph is tigh tly connected), whereas it is allo wed to v ary more (e.g., to change the sign) in regions of low data density . In this sense, a small v alue of f 0 Lf enco des the so called “cluster assumption” in semi-sup ervised learning, which requests that the decision b oundary of a classiﬁer should lie in a region of low density . An intuition often used is that graph Laplacians formally lo ok like a contin uous Laplace op erator (and this is also where the name “graph Laplacian” comes from). T o see this, transform a lo cal similarity w ij to a distance d ij b y the relationship w ij = 1 /d 2 ij and observe that w ij ( f i − f j ) 2 ≈  f i − f j d ij  2 lo oks lik e a diﬀerence quotient. As a consequence, the equation f 0 Lf = P ij w ij ( f i − f j ) 2 from Prop osition 1 lo oks lik e a discrete version of the quadratic form asso ciated to the standard Laplace op erator L on R n , which satisﬁes h g , L g i = Z |∇ g | 2 dx. This intuition has b een made precise in the works of Belkin (2003), Lafon (2004), Hein, Audib ert, and v on Luxburg (2005); M., Audib ert, and von Luxburg (2007), Belkin and Niyogi (2005), Hein (2006), Gin ´ e and Koltchinskii (2005). In general, it is prov ed that graph Laplacians are discrete v ersions of certain contin uous Laplace op erators, and that if the graph Laplacian is constructed on a similarity graph of randomly sampled data p oints, then it conv erges to some contin uous Laplace op erator (or 28 Laplace-Beltrami op erator) on the underlying space. Belkin (2003) studied the ﬁrst imp ortan t step of the con v ergence pro of, which deals with the conv ergence of a contin uous op erator related to discrete graph Laplacians to the Laplace-Beltrami op erator. His results were generalized from uniform distri- butions to general distributions by Lafon (2004). Then in Belkin and Niyogi (2005), the authors prov e p oin t wise conv ergence results for the unnormalized graph Laplacian using the Gaussian similarity func- tion on manifolds with uniform distribution. A t the same time, Hein et al. (2005) prov e more general results, taking into accoun t all diﬀerent graph Laplacians L , L rw , and L sym , more general similarity functions, and manifolds with arbitrary distributions. In Gin´ e and Koltchinskii (2005), distributional and uniform conv ergence results are prov ed on manifolds with uniform distribution. Hein (2006) stud- ies the conv ergence of the smo othness functional induced by the graph Laplacians and shows uniform con vergence results. Apart from applications of graph Laplacians to partitioning problems in the widest sense, graph Laplacians can also b e used for completely diﬀerent purp oses, for example for graph drawing (Koren, 2005). In fact, there are many more tight connections b etw een the top ology and prop erties of graphs and the graph Laplacian matrices than w e hav e mentioned in this tutorial. Now equipp ed with an understanding for the most basic prop erties, the interested reader is invited to further explore and enjo y the huge literature in this ﬁeld on his own. References Aldous, D. and Fill, J. (in preparation). R eversible Markov Chains and R andom Walks on Gr aphs. online version av ailable at http://www.stat.berkeley .edu/users/aldous/R WG/bo ok.h tml. Bac h, F. and Jordan, M. (2004). Learning sp ectral clustering. In S. Thrun, L. Saul, and B. Sch¨ olkopf (Eds.), A dvanc es in Neur al Information Pr o c essing Systems 16 (NIPS) (pp. 305 – 312). Cam- bridge, MA: MIT Press. Bapat, R., Gutman, I., and Xiao, W. (2003). A simple metho d for computing resistance distance. Z. Naturforsch. , 58 , 494 – 498. Barnard, S., Pothen, A., and Simon, H. (1995). A sp ectral algorithm for env elop e reduction of sparse matrices. Numeric al Line ar Algebr a with Applic ations , 2 (4), 317 – 334. Belkin, M. (2003). Pr oblems of L e arning on Manifolds . PhD Thesis, Universit y of Chicago. Belkin, M. and Niyogi, P . (2003). Laplacian eigenmaps for dimensionalit y reduction and data repre- sen tation. Neur al Computation , 15 (6), 1373 – 1396. Belkin, M. and Niy ogi, P . (2005). T ow ards a theoretical foundation for Laplacian-based manifold metho ds. In P . Auer and R. Meir (Eds.), Pr o c e e dings of the 18th Annual Confer enc e on L e arning The ory (COL T) (pp. 486 – 500). Springer, New Y ork. Ben-Da vid, S., von Luxburg, U., and P´ al, D. (2006). A sob er lo ok on clustering stabilit y . In G. Lugosi and H. Simon (Eds.), Pr o c e e dings of the 19th A nnual Confer enc e on L e arning The ory (COL T) (pp. 5 – 19). Springer, Berlin. Bengio, Y., Delalleau, O., Roux, N., Paiemen t, J., Vincent, P ., and Ouimet, M. (2004). Learning eigenfunctions links sp ectral embedding and k ernel PCA. Neur al Computation , 16 , 2197 – 2219. Ben-Hur, A., Elisseeﬀ, A., and Guyon, I. (2002). A stability based metho d for discov ering structure in clustered data. In Paciﬁc Symp osium on Bio c omputing (pp. 6 – 17). Bhatia, R. (1997). Matrix Analysis . Springer, New Y ork. Bie, T. D. and Cristianini, N. (2006). F ast SDP relaxations of graph cut clustering, transduction, and other combinatorial problems . JMLR , 7 , 1409 – 1436. Bolla, M. (1991). R elations b etwe en sp e ctr al and classiﬁc ation pr op erties of multigr aphs (T echnical Re- p ort No. DIMACS-91-27). Center for Discrete Mathematics and Theoretical Computer Science. Br ´ emaud, P . (1999). Markov chains: Gibbs ﬁelds, Monte Carlo simulation, and queues . New Y ork: Springer-V erlag. 29 Brito, M ., Cha v ez, E., Quiroz, A., and Y ukich, J. (1997). Connectivity of the m utual k-nearest-neighbor graph in clustering and outlier detection. Statistics and Pr ob ability L etters , 35 , 33 – 42. Bui, T. N. and Jones, C. (1992). Finding go o d appro ximate vertex and edge partitions is NP-hard. Inf. Pr o c ess. L ett. , 42 (3), 153 – 159. Chap elle, O., Sch¨ olk opf, B., and Zien, A. (Eds.). (2006). Semi-Sup ervise d L e arning . MIT Press, Cam bridge. Ch ung, F. (1997). Sp e ctr al gr aph the ory (V ol. 92 of the CBMS Regional Conference Series in Mathe- matics). Conference Board of the Mathematical Sciences, W ashington. Dhillon, I. (2001). Co-clustering do cuments and words using bipartite sp ectral graph partitioning. In Pr o c e e dings of the seventh ACM SIGKDD international c onfer enc e on Know le dge disc overy and data mining (KDD) (pp. 269 – 274). New Y ork: A CM Press. Dhillon, I., Guan, Y., and Kulis, B. (2005). A uniﬁe d view of kernel k -me ans, sp e ctr al clustering, and gr aph p artitioning (T echnical Rep ort No. UTCS TR-04-25). Universit y of T exas at Austin. Ding, C. (2004). A tutorial on sp e ctr al clustering. T alk presented at ICML. (Slides a v ailable at http://crd.lbl.gov/ ~ cding/Spectral/ ) Ding, C., He, X., Zha, H., Gu, M., and Simon, H. (2001). A min-max cut algorithm for graph partitioning and data clustering. In Pr o c e e dings of the ﬁrst IEEE International Confer enc e on Data Mining (ICDM) (pp. 107 – 114). W ashington, DC, USA: IEEE Computer So ciet y . Donath, W. E. and Hoﬀman, A. J. (1973). Low er b ounds for the partitioning of graphs. IBM J. R es. Develop. , 17 , 420 – 425. Fiedler, M. (1973). Algebraic connectivity of graphs. Cze choslovak Math. J. , 23 , 298 – 305. F ouss, F., Pirotte, A., Renders, J.-M., and Saerens, M. (2007). Random-walk computation of similar- ities b et w een no des of a graph with application to collab orativ e recommendation. IEEE T r ans. Know l. Data Eng , 19 (3), 355–369. F raley , C. and Raftery , A. E. (2002). Mo del-based clustering, discriminan t analysis, and densit y estimation. JASA , 97 , 611 – 631. Gin ´ e, E. and Koltc hinskii, V. (2005). Empirical graph Laplacian approximation of Laplace-Beltrami op erators: large sample results. In Pr o c e e dings of the 4th International Confer enc e on High Dimensional Pr ob ability (pp. 238 – 259). Golub, G. and V an Loan, C. (1996). Matrix c omputations . Baltimore: Johns Hopkins Universit y Press. Guattery , S. and Miller, G. (1998). On the quality of sp ectral separators. SIAM Journal of Matrix A nal. Appl. , 19 (3), 701 – 719. Gutman, I. and Xiao, W. (2004). Generalized inv erse of the Laplacian matrix and some applications. Bul letin de l’A c ademie Serb e des Scienc es at des A rts (Cl. Math. Natur.) , 129 , 15 – 23. Hagen, L. and Kahng, A. (1992). New sp ectral metho ds for ratio cut partitioning and clustering. IEEE T r ans. Computer-A ide d Design , 11 (9), 1074 – 1085. Hastie, T., Tibshirani, R., and F riedman, J. (2001). The elements of statistic al le arning . New Y ork: Springer. Hein, M. (2006). Uniform conv ergence of adaptive graph-based regularization. In Pr o c e e dings of the 19th Annual Confer enc e on L e arning The ory (COL T) (pp. 50 – 64). Springer, New Y ork. Hein, M., Audib ert, J.-Y., and von Luxburg, U. (2005). F rom graphs to manifolds - weak and strong p oin t wise consistency of graph Laplacians. In P . Auer and R. Meir (Eds.), Pr o c e e dings of the 18th Annual Confer enc e on L e arning The ory (COL T) (pp. 470 – 485). Springer, New Y ork. Hendric kson, B. and Leland, R. (1995). An improv ed sp ectral graph partitioning algorithm for mapping parallel computations. SIAM J. on Scientiﬁc Computing , 16 , 452 – 469. Joac hims, T. (2003). T ransductive Learning via Sp ectral Graph Partitioning. In T. F aw cett and N. Mishra (Eds.), Pr o c e e dings of the 20th international c onfer enc e on machine le arning (ICML) (pp. 290 – 297). AAAI Press. Kannan, R., V empala, S., and V etta, A. (2004). On clusterings: Go o d, bad and sp ectral. Journal of the ACM , 51 (3), 497–515. Kemp e, D. and McSherry , F. (2004). A decentralized algorithm for sp ectral analysis. In Pr o c e e dings 30 of the 36th Annual ACM Symp osium on The ory of Computing (STOC) (pp. 561 – 568). New Y ork, NY, USA: ACM Press. Klein, D. and Randic, M. (1993). Resistance distance. Journal of Mathematic al Chemistry , 12 , 81 – 95. Koren, Y. (2005). Dra wing graphs by eigenv ectors: theory and practice. Computers and Mathematics with Applic ations , 49 , 1867 – 1888. Lafon, S. (2004). Diﬀusion maps and ge ometric harmonics . PhD Thesis, Y ale Universit y . Lang, K. (2006). Fixing t wo weaknesses of the spectral metho d. In Y. W eiss, B. Sc h¨ olk opf, and J. Platt (Eds.), A dvanc es in Neur al Information Pr o c essing Systems 18 (pp. 715 – 722). Cambridge, MA: MIT Press. Lange, T., Roth, V., Braun, M., and Buhmann, J. (2004). Stability-based v alidation of clustering solutions. Neur al Computation , 16 (6), 1299 – 1323. Lo v´ asz, L. (1993). Random walks on graphs: a survey. In Combinatorics, Paul Er d¨ os is eighty (pp. 353 – 397). Budap est: J´ anos Bolyai Math. So c. L ¨ utkepohl, H. (1997). Handb o ok of Matric es . Chichester: Wiley . M., Audib ert, J.-Y., and v on Luxburg, U. (2007). Graph laplacians and their conv ergence on random neigh b orhoo d graphs. JMLR , 8 , 1325 – 1370. Meila, M. and Shi, J. (2001). A random walks view of sp ectral segmentation. In 8th International Workshop on Artiﬁcial Intel ligenc e and Statistics (AIST A TS). Mohar, B. (1991). The Laplacian sp ectrum of graphs. In Gr aph the ory, c ombinatorics, and applic ations. V ol. 2 (Kalamazo o, MI, 1988) (pp. 871 – 898). New Y ork: Wiley . Mohar, B. (1997). Some applications of Laplace eigen v alues of graphs. In G. Hahn and G. Sabidussi (Eds.), Gr aph Symmetry: Algebr aic Metho ds and Applic ations (V ol. NA TO ASI Ser. C 497, pp. 225 – 275). Kluw er. Nadler, B., Lafon, S., Coifman, R., and Kevrekidis, I. (2006). Diﬀusion maps, sp ectral clustering and eigenfunctions of Fokker-Planc k op erators. In Y. W eiss, B. Sch¨ olkopf, and J. Platt (Eds.), A dvanc es in Neur al Information Pr o c essing Systems 18 (pp. 955 – 962). Cambridge, MA: MIT Press. Ng, A., Jordan, M., and W eiss, Y. (2002). On sp ectral clustering: analysis and an algorithm. In T. Dietterich, S. Beck er, and Z. Ghahramani (Eds.), A dvanc es in Neur al Information Pr o c essing Systems 14 (pp. 849 – 856). MIT Press. Norris, J. (1997). Markov Chains . Cam bridge: Cam bridge Universit y Press. P enrose, M. (1999). A strong law for the longest edge of the minimal spanning tree. Ann. of Pr ob. , 27 (1), 246 – 260. P othen, A., Simon, H. D., and Liou, K. P . (1990). Partitioning sparse matrice s with eigenv ectors of graphs. SIAM Journal of Matrix A nal. Appl. , 11 , 430 – 452. Saerens, M., F ouss, F., Y en, L., and Dup on t, P . (2004). The principal comp onen ts analysis of a graph, and its relationships to spectral clustering. In Pr o c e e dings of the 15th Eur op e an Confer enc e on Machine L e arning (ECML) (pp. 371 – 383). Springer, Berlin. Shi, J. and Malik, J. (2000). Normalized cuts and image segmentation. IEEE T r ansactions on Pattern A nalysis and Machine Intel ligenc e , 22 (8), 888 – 905. Simon, H. (1991). P artitioning of unstructured problems for parallel pro cessing. Computing Systems Engine ering , 2 , 135 – 148. Spielman, D. and T eng, S. (1996). Sp ectral partitioning works: planar graphs and ﬁnite element meshes. In 37th Annual Symp osium on F oundations of Computer Scienc e (Burlington, VT, 1996) (pp. 96 – 105). Los Alamitos, CA: IEEE Comput. So c. Press. (See also extended technical rep ort.) Stew art, G. and Sun, J. (1990). Matrix Perturb ation The ory . New Y ork: Academic Press. Still, S. and Bialek, W. (2004). How many clusters? an information-theoretic p ersp ectiv e. Neur al Comput. , 16 (12), 2483 – 2506. Sto er, M. and W agner, F. (1997). A simple min-cut algorithm. J. A CM , 44 (4), 585 – 591. Tibshirani, R., W alther, G., and Hastie, T. (2001). Estimating the num ber of clusters in a dataset via the gap statistic. J. R oyal. Statist. So c. B , 63 (2), 411 – 423. 31 V an Driessche, R. and Ro ose, D. (1995). An improv ed sp ectral bisection algorithm and its application to dynamic load balancing. Par al lel Comput. , 21 (1), 29 – 48. v on Luxburg, U., Belkin, M., and Bousquet, O. (to app ear). Consistency of spectral clustering. Annals of Statistics . (See also T ec hnical Rep ort 134, Max Planck Institute for Biological Cyb ernetics, 2004) v on Luxburg, U., Bousquet, O., and Belkin, M. (2004). On the conv ergence of sp ectral clustering on random samples: the normalized case. In J. Shaw e-T aylor and Y. Singer (Eds.), Pr o c e e dings of the 17th Annual Confer enc e on L e arning The ory (COL T) (pp. 457 – 471). Springer, New Y ork. v on Luxburg, U., Bousquet, O., and Belkin, M. (2005). Limits of sp ectral clustering. In L. Saul, Y. W eiss, and L. Bottou (Eds.), A dvanc es in Neur al Information Pr o c essing Systems (NIPS) 17 (pp. 857 – 864). Cambridge, MA: MIT Press. W agner, D. and W agner, F. (1993). Betw een min cut and graph bisection. In Pr o c e e dings of the 18th International Symp osium on Mathematic al F oundations of Computer Scienc e (MFCS) (pp. 744 – 750). London: Springer. 32

A Tutorial on Spectral Clustering

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment