Approximate kernel clustering
In the kernel clustering problem we are given a large $n\times n$ positive semi-definite matrix $A=(a_{ij})$ with $\sum_{i,j=1}^na_{ij}=0$ and a small $k\times k$ positive semi-definite matrix $B=(b_{ij})$. The goal is to find a partition $S_1,...,S_…
Authors: Subhash Khot, Assaf Naor
Approxim ate kernel clustering Subhash Khot ∗ Courant Institute of Mathematical Sciences khot@ci ms.nyu. edu Assaf Naor † Courant Institute of Mathematical Sciences naor@ci ms.nyu. edu Abstract In the kernel clustering problem we ar e g iv en a large n × n positive sem i-definite m atrix A = ( a i j ) with P n i , j = 1 a i j = 0 and a small k × k positive s emi-definite matrix B = ( b i j ). The goal is to find a partition S 1 , . . . , S k of { 1 , . . . n } which maxim izes the quantity k X i , j = 1 X ( p , q ) ∈ S i × S j a pq b i j . W e study the comp utational complexity of this generic clustering pro blem which origin ates in the theory of machine learning. W e design a constant factor polynomial time ap proxim ation algorithm for this problem , a nswering a q uestion p osed by Song, Sm ola, Gretton and Borgwardt. In some cases we manage to comp ute the sharp approximatio n threshold for this pro blem as suming the Un ique Games Conjecture (UGC). In particu lar , when B is the 3 × 3 identity matr ix the UGC hardn ess t hreshold of this p roblem is exactly 16 π 27 . W e present and study a geometr ic conjecture of indep endent interest which we sho w w ould imply that the UGC threshold when B is the k × k iden tity matrix is 8 π 9 1 − 1 k for every k ≥ 3. 1 Introd uction This paper is dev o ted to an in v esti gation of the polynomia l time appr oximabil ity of a generic clusterin g proble m which originates in the theory of m achine learning. In doing so, we unco ver a connection with a contin uous geometric / analyt ic problem which is of indepen dent interest. In [23] Song, Smola, G retton and Bor gward t introduced the followin g framewo rk for kerne l cluste ring pr obl ems . Assume that we are giv en a center ed kernel, i.e. an n × n positi v e semidefinite matrix A = ( a i j ) with real entries such that P n i , j = 1 a i j = 0 (the assumpt ion that the kerne l is centered is a commonly used normalizatio n in learnin g theory—see [22] for more information on this topic). Such matrices arise, for example, as correlatio n matrice s of random v ariabl es ( X 1 , . . . , X n ) that measure attrib utes of certain empirica l data, i.e. a i j = E h X i X j i . W e think of n as ver y lar g e, and our g oal is to “cluster” the matrix A to a m uch s maller k × k matrix in such a way that certain feature s could still be extract ed from the clustered m atrix. Formally , gi ven a partition of { 1 , . . . , n } into k sets S 1 , . . . , S k , define the clustering of A with respect to this partition to be the k × k matrix, whose ( i , j ) th entry is X ( p , q ) ∈ S i × S j a pq . (1) ∗ Research supported in part by NSF CARREER aw ard CCF-0643626, and a Microsoft New Faculty F ellowsh ip. † Research supported by NSF grants CCF-0635078 and DMS-0528387. 1 Let A ( S 1 , . . . , S k ) denote the k × k m atrix gi ven by (1). In the kernel clustering proble m, we are gi ve n a positi v e semidefini te k × k matrix B = ( b i j ), and we wish to find the clus tering A ( S 1 , . . . , S k ) = C = ( c i j ) of A , which is m ost similar to B in the sense that P k i , j = 1 c i j b i j , i.e its scalar product with B , is as lar ge as possib le. In other words, our goal is to compute the number (and the correspo nding partiti on): Clust ( A | B ) ≔ max k X i , j = 1 X ( p , q ) ∈ S i × S j a pq b i j : { S 1 , . . . , S k } is a parti tion of { 1 , . . . , n } = max k X i , j = 1 A ( S 1 , . . . , S k ) i j · b i j : { S 1 , . . . , S k } is a partition of { 1 , . . . , n } = max n X i , j = 1 a i j b σ ( i ) σ ( j ) : σ : { 1 , . . . , n } → { 1 , . . . , k } . (2) The fl exib ility in the abov e formul ation of the kernel cluste ring probl em is clearly in the choice of comparis on matrix B , which allo ws us to enforce a wide-ra nge of clusteri ng criteria . Using the statis tical interp retatio n of ( a i j ) a s a correlation matrix, we can think o f the m atrix B as encoding our b elief / hy pothe sis that the empirical data has a certain struc ture, and the kernel clustering problem aims to e ffi cien tly ex pose this struct ure. Sev era l exp licit exampl es of usefu l “test matrices” B are discus sed in [23], including hierarc hical clus- tering and c lusteri ng data on certai n manifolds. W e refe r to [23] for ad dition al information which illust rates the v ersatil ity of this general clustering problem, includin g its relation to the Hilbert Schmidt Inde pende nce Criterion ( HSIC) and vario us ex perimen tal results . In [ 23] i t was asked if the re is a poly nomial ti me a pprox - imation algo rithm for computing Clust ( A | B ). Here we ob tain a const ant factor approx imation algorithm for this proble m, and prove some co mputatio nal hardn ess of approximat ion results . Before stating our results in full generality we shall now present a few simple illustrati v e exampl es. If B = I k is th e k × k identity matrix, then thin king once more of a i j as co rrelati ons E h X i X j i , ou r goal is to find a partiti on S 1 , . . . , S k of { 1 , . . . , n } which maximizes the quantity k X i = 1 X p , q ∈ S i E h X p X q i , i.e. we wish to cluster the va riable s so as to maximize the total intra-clu ster correlati ons. As we shall see belo w , our results yield a polynomial time algorithm which approx imates Clust ( A | I k ) up to a facto r of 8 π 9 1 − 1 k . In particula r , when k = 3 we obtain a 16 π 27 approx imation algo rithm, and w e sho w that assuming the Unique Games Conjectu re (UGC) no polynomial time algorithm can achie ve an approximatio n guarante e which is smaller than 16 π 27 . The Uniqu e Games Conjectu re was pose d by Khot in [13], and it will be described momentaril y . F or the readers who are not familiar with this computationa l hypothes is and its remarkab le applic ations to hardness of approximatio n, it su ffi ces to say that this hardness result should be vie wed as strong eviden ce that 16 π 27 is the sharp threshold belo w which no polyno m ial time algorithm can solve the ker nel clu stering probl em when B = I 3 . Moreov er , we conj ecture tha t 8 π 9 1 − 1 k is the sharp approximabi lity thresh old (assumin g U GC) for C lust ( A | I k ) for e very k ≥ 3. In this paper , we reduce this conjec ture to a purely geometri c / analytic conjec ture, which we will describe in detail later , and prov e some partia l results about it. 2 Another illustra tive example of the k ernel clustering problem is the case B = 1 − 1 − 1 1 ! . In this case, we clearl y hav e Clust A 1 − 1 − 1 1 ! ! = max n X i , j = 1 : a i j ε i ε j : ε 1 , . . . , ε n ∈ {− 1 , 1 } . (3) The optimizatio n p roblem in (3) is w ell kno wn as the positi ve semi-defini te Grothendieck problem and has sev eral algorit hmic appl ication s (see [20, 18, 2, 5]). It has been shown by Rietz [20] that the natural semidefinit e relaxatio n of ( 3) has integ rality gap π 2 (see also Nestero v’ s work [18]). Our results imply that assuming the UG C π 2 is the sharp approximat ion thresh old for the positi ve-se m idefinite Grothendieck proble m . Note that wit hout th e ass umption that A is pos iti ve semidefinit e the n atural semidefinite rela xation of (3) has integralit y gap Θ (log n ). S ee [17, 6 , 1] for more information, and [3] for hardness result s for this proble m . W e can also view the problem (3) as a generalizati on of the MaxCut problem. Indeed, let G = ( V = { 1 , . . . , n } , E ) be an n -ve rtex loop- free graph. For ev ery verte x i ∈ V let d i denote its degree in G . Let A be the Laplacian of G , i.e. A is the n × n matri x giv en by a i j = d i if i = j , − 1 if i , j ∧ i j ∈ E , 0 if i , j ∧ i j < E . (4) Then A is positi ve semi-definite since it is diagonally dominant. For e very ε 1 , . . . , ε n ∈ {− 1 , 1 } let S ⊆ V be the set S ≔ { i ∈ V : ε i = 1 } . Then: n X i , j = 1 a i j ε i ε j = n X i = 1 d i − 2 | E ( S , S ) | − 2 | E ( V \ S , V \ S ) | + 2 | E ( S , V \ S ) | = 2 | E | − 2 ( | E | − | E ( S , V \ S ) | ) + 2 | E ( S , V \ S ) | = 4 | E ( S , V \ S ) | . (5) Hence Clust A 1 − 1 − 1 1 ! ! = 4MaxCut ( G ) . Using Håstad’ s inappro ximability result for MaxCut [11] it follo ws that if P , N P there is no polynomial time algor ithm which approximates (3) up to a factor smaller than 17 16 . Our algorithmic results. For a fixed posit i ve semide finite matrix B , the approximab ility thresh old for the proble m of computin g Clust ( A | B ) depends on B . It is there fore of interes t to study the performance of our algorithms in terms of the matrix B . W e do obtain boun ds which depen d on B (which are probabl y subop timal in gene ral)—the precise sta tements ar e contained in Theorem 2.1 and T heorem 2.3. For the sake of simplicity , in the introducti on we state bounds w hich are independe nt of B . W e belie ve that the problem of computing the appro ximation thresh old (perha ps under UGC) for each fi xed B is an interesting problem which deserv es further research. If A is centere d, i.e. P n i , j = 1 a i j = 0, then for e very k × k positi ve semi-definit e matrix B our algorithm achie ves an approximation ratio of π 1 − 1 k . If, in addition , B is centered and spherica l, i.e. P k i , j = 1 b i j = 0 3 and b ii = 1 for all i , then o ur algorith m achie ves an app roximation ratio of 8 π 9 1 − 1 k . This rati o is also v alid if B is the identity matrix, and as we mentioned abov e, we belie ve that this approximatio n guarante e cannot be improv ed assu ming the U GC (and here we pro ve this conjecture for k = 3). W hen A is not necessari ly center ed (note that this case is of lesser interest in terms of the applicatio ns in m achine learning) we obtai n an algorithm w hich achiev es an approximat ion ratio of 1 + 3 π 2 (this is probably sub-optimal ). A ll of our algori thms, which are described in Section 2, use s emi-definite progra m ming in a perh aps non-obvi ous way . The rounding algorithm of our semi-definit e relaxa tion amounts to prov ing certain geometric inequal ities which can be viewed as v ariants of the positi ve semi-definite Grothe ndieck inequality . T his analysi s is presen ted in Section 2. A s a concrete exampl e we state in thi s introduction the f ollowing Gro thendieck-t ype inequa lity which correspo nds to our 8 π 9 1 − 1 k algori thm: Theor em 1.1. Let ( a i j ) be an n × n positive semi-definite matrix with P n i , j = 1 a i j = 0 . Then for every k ≥ 3 and v 1 , . . . , v k ∈ S k − 1 we have max x 1 ,..., x n ∈ S n − 1 n X i , j = 1 a i j h x i , x j i ≤ 8 π 9 1 − 1 k ! max σ : { 1 ,..., n }→{ 1 ,..., k } n X i , j = 1 a i j h v σ ( i ) , v σ ( j ) i . (6) Inequa lity (6) is sharp w hen k = 3, and we conjectu re that it is sharp for all k ≥ 4. This conjecture is related to a geometric conjec ture which we describe below . The Uniqu e Games Conjectur e, hardness of approximat ion, and the pro p eller p r oblem. Our hardness result for kerne l clusterin g problem is based on the Unique Games Conjecture which was put forth by Khot in [1 3 ]. W e shall no w describe thi s conjec ture. A Uniq ue Game is an opti m ization probl em with an i nstance L = L ( G ( V , W , E ) , n , { π vw } ( v , w ) ∈ W ). Here G ( V , W , E ) is a regular bipartite graph with verte x sets V and W and edge set E . Each verte x is suppo sed to receiv e a label from the set { 1 , . . . , n } . For e very edg e ( v , w ) ∈ E with v ∈ V and w ∈ W , there is a giv en permutati on π vw : { 1 , . . . , n } → { 1 , . . . , n } . A labeling to the U nique Game instan ce is an assignmen t ρ : V ∪ W → { 1 , . . . , n } . An edge ( v , w ) is satisfied by a labeling ρ if and only if ρ ( v ) = π vw ( ρ ( w )). The goa l i s to find a lab eling that max imizes the fr action o f edg es satis fi ed (call this maximum OP T( L )). W e think of the number of labels n as a constan t and the size of the graph G ( V , W , E ) as the size of the problem insta nce. The Unique Games Conjecture a sserts that fo r arbitr arily small constants ε, δ > 0, there exists a con stant n = n ( ε, δ ) such that no polynomial time algorit hm can distin guish whether a Unique G ames instan ce L with n labels satisfies O PT( L ) ≥ 1 − ε or OP T( L ) ≤ δ 1 . T his conjecture is (by now) a commonly used comple xity assumption to prov e hardnes s of approxi m ation results. Despite se veral recen t attempts to get better polynomial time approximation algori thms for the Unique Game problem (see the table in [4] for a descri ption of kno wn results), the unique games conjectur e still stand s. Our UGC hardness result for kernel clustering, which is presented in Section 3, is based at heart on the “dicta torship vs. low-in fluence” paradigm that is recurrent in UGC hardness result s (for example [13, 15]). In order to apply this paradigm one usually designs a probabilist ic test on a gi ven Boolean functio n on the Boolean hypercu be and then analyzes the accept ance probability of this test in the two ext remes of dictato rship functio ns and functio ns without influential v ariables. The gap between these two acceptance probab ilities translates into the hardness of approxi mation factor . In our case, instea d of a probabili stic test we need to design a positi ve semidefinite quadra tic form on the truth table of the functio n. Our form is the sum of the squares of the Fourier coe ffi cients of le vel 1. This already yields π 2 UGC hardne ss when k = 2. 1 As stated i n [13], the conjecture says that it is NP- hard to distinguish between these two cases. Ho we ver i f one only wants to rule out polynomial time algorithms, the conjecture as stated here su ffi ces. 4 For lar ger k we need to work with function s from { 1 , . . . , k } n to { 1 , . . . , k } . The analysis of this approa ch leads to the “prope ller problem” w hich we now descri be. The details of this connect ion are explain ed in Section 3. W e belie ve th at one of the inter esting aspects of the prese nt paper is that comp lexity considera tions lead to geometric / analyti c proble m s which are of indepen dent interest. Similar such connecti ons hav e been re- cently d iscove red in [14, 8]. In our ca se the red uction from UGC to kernel clus terings leads to the follo wing questi on, w hich w e call the “prop eller problem” for reasons that w ill become clear presently . L et γ k − 1 de- note the standard Gaussian measure on R k − 1 , i.e. the density of γ k − 1 is (2 π ) − ( k − 1) / 2 e −k x k 2 2 / 2 . Let A 1 , . . . , A k be a partition of R k − 1 into m easura ble sets. For each i ∈ { 1 , . . . , k } consider the Gaussian moment of the set A i , i.e. the vec tor z i ≔ Z A i xd γ k − 1 ( x ) ∈ R k − 1 . Our goal is to fi nd the partit ion w hich maximizes the sum of the squared Euclidean lengths of the G aussian moments of the elements of the partitio n, i.e. P k i = 1 k z i k 2 2 . Let C ( k ) denote the val ue of this maximum (in Section 3.1 we show that this is indeed a maximum and not just a supremum). In Section 3 we sho w that assuming the UGC ther e is no pol ynomial time algor ithm which approximat es Clust ( A | I k ) to a fa ctor smaller than 1 − 1 / k C ( k ) . In Sectio n 3.1 we show that C (2) = 2 π and C (3) = 9 8 π . The value of C (3) co m es fro m the parti tion of th e plane R 2 into a “pro peller” , i.e. three cones o f angle 2 π 3 with cu sp at the origin . Most of Sectio n 3.1 is de vote d to the proof of the followin g theorem: Theor em 1.2. C ( k ) is attained at a simplicial conical partition , i.e. a partitio n A 1 , . . . , A k of R k − 1 which has the following form: let A 1 , . . . , A m be the elements of the partition which have positiv e measur e. Then A j = B j × R k − m wher e B j ⊆ R m − 1 is a cone with cusp at 0 whose base is a simple x. It is tempting to believ e that the optimal simplicia l conical partition describe d in Theorem 1.2 occurs when the cones B 1 , . . . , B m are generated by the regu lar simple x. Howe ver , in Section 3.1 w e prov e that among suc h regular s imp licial conical par titions the o ne which maximize s the sum of th e squared leng ths of its Gaussian moments is when m = 3. W e therefore conjecture that for ev ery k ≥ 3 an optimal partition for the problem described abov e is actual ly { C 1 × R k − 3 , C 2 × R k − 3 , C 3 × R k − 3 } , where { C 1 , C 2 , C 3 } is the p ropeller partition of R 2 —see Figu re 1. If this “propell er conje cture” holds true then it would follo w that our 8 π 9 1 − 1 k approx imation algo rithm is optimal assu m ing t he UGC for e very k ≥ 4, and no t j ust for k ∈ { 2 , 3 } . The fu ll prop eller co njectu re seems to be a ch allenging geometric problem of indepe ndent interest, not just due to the co nnection that we establish between it and the study of hardne ss of approximat ion for kern el clusteri ng. W e end this introductio n with an explanat ion of how our work relates to the recent result of Ragha ven- dra [19] which sho ws that for any general ized const raint satisf action problem 2 (CSP) there is a generic way of writing a semid efinate relaxatio n that achie ves an optimal approximatio n ratio assuming the Unique Games Conjectu re. Our clustering problem fits in the framew ork of [19] as follo ws: we wish to compute max n X i , j = 1 a i j b σ ( i ) σ ( j ) : σ : { 1 , . . . , n } → { 1 , . . . , k } , (7) 2 In a generalized CSP , e very assignment to variables i n a constraint has a real-valued (possibly negati ve) pay-o ff instead of a simple decision saying that the assignment is a satisfying assignment or not. 5 Figure 1: The conject ure d optimal partition for the “sum of squar es of Gaussian m oments pr oblem” de- scribe d abo ve cons ists of a partition of R k − 1 into 3 parts, and the re m aining k − 3 parts ar e empty . Thi s partit ion corr esponds to a planar 120 ◦ “pr opelle r” multiplied by an orthogo nal copy of R k − 3 . where ( a i j ) is a centered positi ve semi-definite m atrix and ( b i j ) is a positi ve semi-defini te matrix. One can think of this problem as a CSP (with an extra global constra int corresp onding to the positi ve semi- definiten ess) w here the set of v ariables is { 1 , . . . , n } and we wish to assign each varia ble a val ue from the domain { 1 , . . . , k } . For e very pair ( i , j ) ∈ { 1 , . . . , n } × { 1 , . . . , n } , there is a constraint w ith weight a i j . W e get a payo ff of b st if v ariables i and j are assigned s ∈ { 1 , . . . , k } and t ∈ { 1 , . . . , k } respect i vely . Ragha vendra sho ws that e very integrali ty gap instance for his generic S DP relaxation can be translate d into a UGC-hardness of approxi mation resul t with the har dness f actor (essentially ) the same as the inte grality gap. W e make here the non-tri vial observ ation that in the reduction of [19], startin g with an inte grality gap instan ce for (the gener ic SD P relaxa tion of) the cluste ring problem (7), the matrix of the constraint w eights ( a i j ) indeed turns out to be positi ve semi-de finite as required in the ke rnel clustering problem (this requires proof—t he detai ls are omitted sinc e this is a digres sion from the topic of this paper). T hus Ragha vendra ’ s result can be made to apply to the kerne l clustering problem (i.e. th e generic SDP achie ves the optimum approx imation ratio assuming UG C). Nev ertheless , it is also useful to look at di ff erent relaxat ions and rounding procedures for the following reason s. Firstly , for a giv en problem there could be an S DP relaxa tion that is more natural than the generic one and might be easier to work with. Secondly , R agha vendr a’ s result (that the integral ity gap is same as the hardness factor) applies only when the integrali ty gap is a constant. This is a priori not clear for the ker nel clustering problem. For instance, a prior i the integra lity gap could be Ω (log n ) (as is the case for Grothend ieck probl em on a general graph—see [1]). So before applying the result of [19 ], one woul d need to sho w that the integra lity gap of the generic SDP is indeed a consta nt. Third ly , for CSP s w ith ne gativ e payo ff s (as is the case in the kernel clusterin g problem), Ragha vendra shows that the value computed by the generic S DP achiev es the optimal approximat ion ratio (modulo UGC), b ut the paper does not giv e a round ing proced ure. Finally , Raghav endra’ s result do es not re ally shed light on the e xact hardnes s thresh old in the sense that it sho w s ho w to transla te integ rality gap instances into a U GC hardness result, bu t giv es no idea as to ho w to construct an integrali ty gap instance in the fi rst place. Constr ucting the integr ality gap 6 instan ce in genera l amounts to answering certa in isoperime tric type geometric questio n (naturally leading to a dict atorship test, or the oth er way roun d. In other wo rds, the geometr ic question itself might be inspi red by the dictatorsh ip test that w e hav e in mind). Thus as f ar as we k now , we cann ot av oid d esigning an e xplici t dictato rship test and answering an isoperi m etric type question, whether or not w e start w ith Raghav endra’ s generi c SDP that is guarante ed to be optimal . As mentioned before , in the clustering probl em where B = ( b st ) i s centered and s pheric al, we sho w that the UGC-hardnes s threshold is at l east 1 − 1 / k C ( k ) and c haracterizin g C ( k ) seems to be a challeng ing geometri c question . 2 Constant factor appr oxima tion algorithms f or ker nel clustering Let A ∈ M n ( R ) and B ∈ M k ( R ) be posit ive semidefinite matrices. T hen there are u 1 , . . . , u n ∈ R n and v 1 , . . . , v k ∈ R k such that a i j = h u i , u j i and b i j = h v i , v j i . Such vecto rs can be found in polynomial time (this is simply the C holesk y decomposit ion). T he instance of the kerne l clusterin g probl em will be called center ed if P n i , j = 1 a i j = 0, or equi vale ntly P n i = 1 u i = 0. The in stance w ill be c alled spherical if b ii = 1 = k v i k 2 2 for all i ∈ { 1 , . . . , k } . Let R ( B ) be the radius of the smallest Euclidea n ball containi ng { v 1 , . . . , v k } . Note that R ( B ) is indeed only a function of B , i.e. it does not depend on the particu lar represe ntatio n of B as a G ram matrix. M oreo ver , it is possi ble to compute R ( B ), and g i ven the deco mposition b i j = h v i , v j i a vec tor w ∈ R k such that max j ∈{ 1 ,..., k } k v j − w k 2 = R ( B ), in polyn omial time (see [10]). Our goal is to compute in polynomia l time the quantity: Clust ( A | B ) ≔ max σ : { 1 ,..., n }→{ 1 ,..., k } n X i , j = 1 a i j b σ ( i ) σ ( j ) = max σ : { 1 ,..., n }→{ 1 ,..., k } n X i , j = 1 h u i , u j ih v σ ( i ) , v σ ( j ) i . Our algori thm, which is based on semidefinite programming, proceeds via the follo w ing steps: 1. Compute a Cholesk y decomposition of B , i.e. v 1 , . . . , v k ∈ R k with b i j = h v i , v j i . 2. Compute (using for example [10]) R ( B ) and a vecto r w ∈ R k such that max j ∈{ 1 ,..., k } k v j − w k 2 = R ( B ) . 3. Solve the semidefinite program max n X i , j = 1 a i j · D k w k 2 u + R ( B ) x i , k w k 2 u + R ( B ) x j E : u , x 1 , . . . , x n ∈ R n + 1 ∧ k u k 2 = 1 ∧ ∀ i k x i k 2 ≤ 1 . 4. Choose p , q ∈ { 1 , . . . , k } such that k v p − v q k 2 = max i , j ∈{ 1 ,.. ., k } k v i − v j k 2 . Let g 1 , g 2 ∈ R n + 1 be i.i.d. standa rd Gaussian vectors and define σ : { 1 , . . . , n } → { 1 , . . . , k } by σ ( r ) = ( p if h g 1 , x r i ≥ h g 2 , x r i , q if h g 2 , x r i ≥ h g 1 , x r i . (8) 5. Choose distinct α, β, γ ∈ { 1 , . . . , k } such that v α − v α + v β + v γ 3 2 2 + v β − v α + v β + v γ 3 2 2 + v γ − v α + v β + v γ 3 2 2 7 is maximized among all such choices of α, β, γ . Let g 1 , g 2 , g 3 ∈ R n + 1 be i.i.d. standard G aussian vec tors and define τ : { 1 , . . . , n } → { 1 , . . . , k } by τ ( r ) = α if h g 1 , x r i ≥ max {h g 2 , x r i , h g 3 , x r i} , β if h g 2 , x r i ≥ max {h g 1 , x r i , h g 3 , x r i} , γ if h g 3 , x r i ≥ max {h g 1 , x r i , h g 2 , x r i} . (9) 6. Output σ if P n i , j = 1 a i j b σ ( i ) σ ( j ) ≥ P n i , j = 1 a i j b τ ( i ) τ ( j ) . Otherwise outp ut τ . Remark 2.1. The astute reader might notice that there is an obvi ous genera lization of the above algorit hm. Namely for ev ery fixed integer s ∈ [2 , k ] we can cho ose a subset S ⊆ { 1 , .., k } of cardinali ty s which maximizes the quan tity X i ∈ S v i − 1 s X j ∈ S v j 2 2 . Then, we can choose s i.i.d. standar d Gaussia ns { g i } i ∈ S ⊆ R n + 1 and define σ s : { 1 , . . . , n } → { 1 , . . . , k } analog ously to the abov e, namely σ s ( r ) = i if h g i , x r i = m ax j ∈ S h g j , x r i . Then, we can consider the assig nments σ 2 , σ 3 , . . . , σ s and choose the one which m aximizes the object i ve P n i , j = 1 a i j b σ ℓ ( i ) σ ℓ ( j ) . In spite of this flexibility , it turns out that the the round ing metho d descr ibed above does not improv e if we take s ≥ 4. In order to demonst rate this fact we w ill proceed belo w to analyze the algori thm for general s , and then optimize ov er s . Bounds on the perfor mance of the abov e algorith m are contained in the follo wing theore m : Theor em 2.1. Assume that A is center ed, i.e. that P n i , j = 1 a i j = 0 . Let p , q , α, β, γ ∈ { 1 , . . . , k } and v 1 , . . . , v k be as in the descriptio n above . Then the algorith m outputs in polynomial time a random assignment λ : { 1 , . . . , n } → { 1 , . . . , k } satisfying Clust ( A | B ) ≤ min 2 π R ( B ) 2 k v p − v q k 2 2 , 16 π R ( B ) 2 9 v α − v α + v β + v γ 3 2 2 + v β − v α + v β + v γ 3 2 2 + v γ − v α + v β + v γ 3 2 2 E n X i , j = 1 a i j b λ ( i ) λ ( j ) . (10) In parti cular we always have Clust ( A | B ) ≤ π 1 − 1 k ! E n X i , j = 1 a i j b λ ( i ) λ ( j ) , (11) and if B is center ed and spher ical, i.e. P k i , j = 1 b i j = 0 and b ii = 1 for all i, then Clust ( A | B ) ≤ 8 π 9 1 − 1 k ! E n X i , j = 1 a i j b λ ( i ) λ ( j ) . (12) The same bound in (12) holds true if B is the identity m atrix. 8 W e singl e out in the ne xt theo rem the case k ∈ { 2 , 3 } , sinc e in thes e cases we ha ve matchin g UGC hardne ss results. Note that fo r general k we obtai n a fa ctor π appro ximation algorithm, answering posit i vely the questio n posed by Song, Smola, Gretton and Borgwa rdt in [23]. Theor em 2.2. Assume that A is center ed and B is a 2 × 2 matrix. Then our algorithm achie ves a π 2 ap- pr oximati on factor . Assuming the Unique Games Conjectur e no poly nomial time algorit hm ac hieves an appr oximation guar antee smaller than π 2 in this case . Assume that A is center ed, k = 3 and B is center ed and spherical (since k = 3 this for ces B to be the G ram matrix of the-d e gr ee thr ee r oots of unity in the comple x plane). Then our algorithm achie ves an appr oximatio n factor of 16 π 27 . Assuming the Unique Games Conjectur e no polynomial time algorithm ach ieves an appr oximation guarante e smaller than 16 π 27 in this case . In fa ct, we be lie ve that the UGC hard ness thr eshold for the kern el clusteri ng problem when A is centered and B is sphe rical and centered is exact ly 8 π 9 1 − 1 k ! . In Section 3 we describe a geometric con jecture which w e sho w implies this tig ht UGC threshold for general k . W e end the discussion by stating a (proba bly suboptimal ) consta nt facto r approximati on result when A is not neces sarily centered (note that this case is of lesser interes t in terms of the applica tions in m achine learnin g). In this case the abo ve algorithm giv es a consta nt factor approximatio n. The slig htly better bound on the appr oximation factor in Theo rem 2.3 be low follo ws from a v ariant of the above algo rithm which will be describ ed in its proof. Theor em 2.3. F or gener al A and B (not nece ssarily center ed) ther e exi sts a pol ynomial time alg orithm that ach ieves an appr oximation factor of 1 + 2 π k v p − v q k 2 2 · max i ∈{ 1 ,..., k } v i − v p + v q 2 2 2 ≤ 1 + 3 π 2 . The proof of Theorem 2.2 is contained in Section 3. W e shall no w proceed to prov e Theorem 2.1. Before doi ng so we will show h o w the gener al bound in (10) imp lies the bou nds (11) and (1 2). The pro of of Theorem 2.3 is deferred to the end of this sectio n. T o prove that (10) implies (11) let D denote the diameter of the set { v 1 , . . . , v k } , i.e. D = k v p − v q k 2 . A classic al theorem of Jung [12] (see [7]) says that R ( B ) ≤ D · r k − 1 2 k , and (11) follo ws immediately by taking the first term in the minimum in (10). W e shall no w sho w that (10) implies (12) when B is eit her centere d and spherica l or the identity matrix. Assume first of all that B is centere d and spherical. Note that since v 1 , . . . , v k are unit vectors, R ( B ) ≤ 1. Hence, by considerin g the second term in the minimum in (10) we see that it is enough to sho w that there exi st α, β, γ ∈ { 1 , . . . , k } for which v α − v α + v β + v γ 3 2 2 + v β − v α + v β + v γ 3 2 2 + v γ − v α + v β + v γ 3 2 2 ≥ 2 k k − 1 . 9 This follo ws from an ave raging ar gument. Indeed, 1 k 3 X α,β,γ ∈{ 1 ,. .., k } α<β<γ v α − v α + v β + v γ 3 2 2 + v β − v α + v β + v γ 3 2 2 + v γ − v α + v β + v γ 3 2 2 ! = 2 k k X i = 1 k v i k 2 2 − 2 k ( k − 1) X i , j ∈{ 1 ,.. ., k } i , j h v i , v j i = 2 k k X i = 1 b ii − 2 k ( k − 1) k X i , j = 1 b i j − k X i = 1 b ii = 2 k k − 1 . This complete the proof of (12) w hen B is spherical an d cen tered. T he same boun d holds tru e when B = I k is the ide ntity matrix sinc e in th is case if we d enote by e 1 , . . . , e k the sta ndard u nit basi s of R k and e = 1 k P k i = 1 e i then for e very assignmen t λ : { 1 , . . . , n } → { 1 , . . . , k } w e ha ve n X i , j = 1 a i j ( I k ) λ ( i ) λ ( j ) = n X i , j = 1 h u i , u j ih e λ ( i ) , e λ ( j ) i = n X i , j = 1 h u i , u j ih e λ ( i ) − e , e λ ( j ) − e i + 2 * n X i = 1 u i , n X j = 1 h e , e λ ( j ) i u j + − k e k 2 2 k X i = 1 u i 2 2 . (13) The last two ter ms in (13) vanis h since A is centered. Thus n X i , j = 1 a i j ( I k ) λ ( i ) λ ( j ) = k − 1 k n X i , j = 1 a i j c λ ( i ) λ ( j ) , where C = ( c i j ) = k k − 1 ( h e i − e , e j − e i ) is s pheric al and ce ntered . Thus t he case of the iden tity matrix red uces to the pre vious analysis. Pr oof of Theor em 2.1. Denote SDP ≔ max n X i , j = 1 a i j · D k w k 2 u + R ( B ) x i , k w k 2 u + R ( B ) x j E , where the max imum is taken ov er all u , x 1 , . . . , x n ∈ R n + 1 such that k u k 2 = 1 and k x i k 2 ≤ 1 for all i . Observe that SDP ≥ Clust ( A | B ) . (14) Indeed , for ev ery λ : { 1 , . . . , n } → { 1 , . . . , k } define u = w k w k 2 and x i = v λ ( i ) − w R ( B ) and note that in this case n X i , j = 1 a i j · D k w k 2 u + R ( B ) x i , k w k 2 u + R ( B ) x j E = n X i , j = 1 a i j b λ ( i ) λ ( j ) . Let u ∗ , x ∗ 1 , . . . , x ∗ n be the optimal solutio n to the SDP . It will be con ve nient to think of the SDP solution 10 as being split into two part s. So we rewri te SDP = n X i , j = 1 a i j · D k w k 2 u ∗ + R ( B ) x ∗ i , k w k 2 u ∗ + R ( B ) x ∗ j E = n X i , j = 1 h u i , u j i · hk w k 2 u ∗ + R ( B ) x ∗ i , k w k 2 u ∗ + R ( B ) x ∗ j i = n X i = 1 u i ⊗ ( k w k 2 u ∗ + R ( B ) x ∗ i ) 2 2 (15) = k w k 2 n X i = 1 u i ⊗ u ∗ + R ( B ) n X i = 1 u i ⊗ x ∗ i 2 2 = k P + Q k 2 2 , (16) where P ≔ k w k 2 n X i = 1 u i ⊗ u ∗ , (17) and Q ≔ R ( B ) n X i = 1 u i ⊗ x ∗ i . (18) Observ e in passin g that (16) implies that the object ive functio n of the SDP is con vex as a function of u , x 1 , . . . , x n , and therefor e w e may assume that k u ∗ k 2 = 1 and k x ∗ i k 2 = 1 for all i . W e shall now pro ceed with the analy sis of our a lgorithm while usin g th e var iant described in Remark 2.1. This w ill not create any additional complication , and will allow us to explain w hy there is no adv antage in worki ng with subsets of size s ≥ 4. Recall the setting: for a fixed integer s ∈ [2 , k ] we choose a subset S ⊆ { 1 , .., k } of cardinality s which maximizes the quan tity X i ∈ S v i − 1 s X j ∈ S v j 2 2 . Then, we choose s i.i.d. standard Gaussians { g i } i ∈ S ⊆ R n + 1 and define σ : { 1 , . . . , n } → { 1 , . . . , k } by setting σ ( r ) = i if h g i , x ∗ r i = m ax j ∈ S h g j , x ∗ r i . Fix i , j ∈ { 1 , . . . , n } . As prov ed by Frieze and Jerrum in [9] (see Lemma 5 there), we ha ve 3 : Pr σ ( i ) = σ ( j ) = ∞ X m = 0 R m ( s ) h x ∗ i , x ∗ j i m , 3 W e are using here the fact that x ∗ 1 , . . . , x ∗ n are unit vectors. 11 where the po wer serie s con ver ges on [ − 1 , 1] and all the coe ffi cients R m ( s ) are non-n egati ve. Moreo ver R 0 ( s ) = 1 s and R 1 ( s ) = 1 s − 1 E " max j ∈ S g j #! 2 = s (2 π ) s / 2 Z ∞ −∞ xe − x 2 / 2 Z x −∞ e − y 2 / 2 dy ! s − 1 d x . Note that conditione d on the e vent σ ( i ) = σ ( j ), the random inde x σ ( i ) is uniformly distrib uted o ver S . Also, conditioned on the ev ent σ ( i ) , σ ( j ), the pair ( σ ( i ) , σ ( j )) is uniformly distrib uted over all s ( s − 1) pairs of distinc t indices in S . Thus E [ b σ ( i ) σ ( j ) ] = P r[ σ ( i ) = σ ( j )] · 1 s X ℓ ∈ S b ℓℓ + Pr[ σ ( i ) , σ ( j )] · 1 s ( s − 1) X ℓ, t ∈ S ℓ , t b ℓ t . Denote Φ = 1 s P ℓ ∈ S b ℓℓ and Ψ = 1 s ( s − 1) P ℓ, t ∈ S ℓ , t b ℓ t . (note that Φ , Ψ depend on the matrix B as w ell as the choice of the subse t S ⊆ { 1 , . . . , k } . Thus E [ b σ ( i ) σ ( j ) ] = ∞ X m = 0 R m ( s ) h x ∗ i , x ∗ j i m · Φ + 1 − ∞ X m = 0 R m ( s ) h x ∗ i , x ∗ j i m · Ψ = ( Ψ + ( Φ − Ψ ) R 0 ( s ) ) + ( Φ − Ψ ) ∞ X m = 1 R m ( s ) h x ∗ i , x ∗ j i m . (19) Write v ≔ 1 s P ℓ ∈ S v ℓ . Observ e that Ψ + ( Φ − Ψ ) R 0 ( s ) = k v k 2 2 . (20) Indeed , since R 0 ( s ) = 1 / s we hav e Ψ + ( Φ − Ψ ) R 0 ( s ) = 1 − 1 s ! 1 s ( s − 1) X ℓ, t ∈ S ℓ , t b ℓ t + 1 s 1 s X ℓ ∈ S b ℓℓ = 1 s 2 X ℓ, t ∈ S b ℓ t = 1 s s X ℓ ∈ S v ℓ 2 2 = k v k 2 2 . Moreo ver , ( s − 1)( Φ − Ψ ) = X ℓ ∈ S k v ℓ − v k 2 2 . (21) In parti cular Φ − Ψ ≥ 0. T o prov e (21 ) we simply expa nd: X ℓ ∈ S k v ℓ − v k 2 2 = X ℓ ∈ S k v ℓ k 2 2 − s k v k 2 2 = s Φ − 1 s X ℓ, t ∈ S b ℓ t = s Φ − 1 s ( s Φ + s ( s − 1) Ψ ) = ( s − 1)( Φ − Ψ ) . Multiply ing both sides of equa tion (19) by a i j and summing ov er i , j ∈ { 1 , . . . , n } while usin g (20) we get that E n X i , j = 1 a i j b σ ( i ) σ ( j ) = k v k 2 2 n X i , j = 1 a i j + ( Φ − Ψ ) R 1 ( s ) n X i , j = 1 a i j h x ∗ i , x ∗ j i + ( Φ − Ψ ) ∞ X m = 2 R m ( s ) n X i , j = 1 a i j h x ∗ i , x ∗ j i m . (22) 12 Note that for e very m ≥ 1 we hav e n X i , j = 1 a i j h x ∗ i , x ∗ j i m = n X i , j = 1 h u i , u j i D ( x ∗ i ) ⊗ m , ( x ∗ j ) ⊗ m E = n X i = 1 u i ⊗ ( x ∗ i ) ⊗ m 2 2 ≥ 0 . (23) Plugging (23) into (22), and using the fact that Φ − Ψ ≥ 0 and the positi vity of R m ( s ), we conclu de that E n X i , j = 1 a i j b σ ( i ) σ ( j ) ≥ k v k 2 2 n X i , j = 1 a i j + ( Φ − Ψ ) R 1 ( s ) n X i , j = 1 a i j h x ∗ i , x ∗ j i . (24) W e shall now use the fact that P n i , j = 1 a i j = 0 for the first time. In this case P = 0 (see equati ons (16) and (17)) so that SDP = R ( B ) 2 n X i , j = 1 a i j h x ∗ i , x ∗ j i . (25) Hence, using (24) and (20) we get the bound E n X i , j = 1 a i j b σ ( i ) σ ( j ) ≥ R 1 ( s ) P ℓ ∈ S k v ℓ − v k 2 2 ( s − 1) R ( B ) 2 · SDP (14) ≥ R 1 ( s ) P ℓ ∈ S k v ℓ − v k 2 2 ( s − 1) R ( B ) 2 · Clust ( A | B ) . (26) The term R 1 ( s ) is studie d in Section 3.1, where its geometric interpretat ion is expla ined. In particular , it follo ws from Corollary 3.6 and Corollary 3.4 that R 1 ( s ) < R 1 (3) for ev ery s ≥ 4 and that R 1 (2) = 1 π and R 1 (3) = 9 8 π . Hence the cases s ∈ { 2 , 3 } in (26) conclu de the proof of Theorem 2.1. Moreov er , we see that for s ≥ 4 the lo wer bound in (26) is worse t han the lower bo und obtain ed when case s = 3. Indeed, we ha ve alread y noted that in this case R 1 ( s ) < R 1 (3). In additio n, 1 s 3 X T ⊆ S | T | = 3 1 2 X ℓ ∈ T v ℓ − 1 3 X t ∈ T v t 2 2 = 1 s X ℓ ∈ S k v ℓ k 2 2 − 1 s ( s − 1) X ℓ, t ∈ S ℓ , t h v ℓ , v t i = 1 s − 1 X ℓ ∈ S v ℓ − 1 s X t ∈ S v t 2 2 . This implies that there ex ists T ⊆ S w ith | T | = 3 for which 1 2 X ℓ ∈ T v ℓ − 1 3 X t ∈ T v t 2 2 ≥ 1 s − 1 X ℓ ∈ S k v ℓ − v k 2 2 , so that when s ≥ 4 the lower b ound in (26) is inferior to the same lower boun d w hen s = 3. It remains to deal with the case P n i , j = 1 a i j > 0, i.e. to prove Theor em 2.3. Pr oof of Theor em 2.3. W e sligh tly modify the algorithm that was st udied in T heorem 2.1. Let v 1 , . . . , v k and p , q ∈ { 1 , . . . , k } be as before, that is b i j = h v i , v j i and k v p − v q k 2 = max i , j ∈{ 1 ,.. ., k } k v i − v j k 2 = D , the diameter of the set { v 1 , . . . , v k } ∈ R k . Denote w ′ ≔ v p + v q 2 and R ′ ( B ) ≔ max i ∈{ 1 ,..., k } v i − w ′ 2 . 13 W e no w consider the modified semidefinite program SDP ≔ max n X i , j = 1 a i j · D k w ′ k 2 u + R ′ ( B ) x i , k w ′ k 2 u + R ′ ( B ) x j E , where the maximum is taken ov er all u , x 1 , . . . , x n ∈ R n + 1 such that k u k 2 = 1 and k x i k 2 ≤ 1 for all i . From no w on we will use the nota tion of the proof of Theorem 2.1 with w replaced by w ′ and R ( B ) replaced by R ′ ( B ) (this slight abuse of notation w ill not create an y confu sion). As before, we let g 1 , g 2 ∈ R n + 1 be i.i.d. standa rd Gaussian vectors and define σ : { 1 , . . . , n } → { 1 , . . . , k } by σ ( r ) = ( p if h g 1 , x r i ≥ h g 2 , x r i , q if h g 2 , x r i ≥ h g 1 , x r i . (27) Note that the first plac e in the proof of Theorem 2.1 where the assumptio n that A is center ed was used in equati on (25). Hence, in the present setting we still ha ve the bounds Clust ( A | B ) ≤ SD P = k P + Q k 2 2 ≤ ( k P k 2 + k Q k 2 ) 2 , (28) where P and Q are define d in (17) and (18) (with w and R ( B ) repla ced by w ′ and R ′ ( B ), respect ivel y). Also, it follo ws from (24) that E n X i , j = 1 a i j b σ ( i ) σ ( j ) ≥ k v k 2 2 n X i , j = 1 a i j + k v p − v k 2 2 + k v q − v k 2 2 R 1 (2) n X i , j = 1 a i j h x ∗ i , x ∗ j i , (29) where v = v p + v q 2 = w ′ . Note that k v p − v k 2 2 + k v q − v k 2 2 = D 2 2 , and recal l that R 1 (2) = 1 π . Thus (29) becomes : E n X i , j = 1 a i j b σ ( i ) σ ( j ) ≥ k w ′ k 2 2 n X i , j = 1 a i j + D 2 2 π n X i , j = 1 a i j h x ∗ i , x ∗ j i , (30) Note that k P k 2 2 = k w ′ k 2 2 · n X i = 1 u i ⊗ u ∗ 2 2 = k w ′ k 2 2 · k u ∗ k 2 2 n X i , j = 1 h u i , u j i = k w ′ k 2 2 n X i , j = 1 a i j . (31) and k Q k 2 2 = R ′ ( B ) 2 · n X i = 1 u i ⊗ x ∗ i 2 2 = R ′ ( B ) 2 n X i , j = 1 a i j h x ∗ i , x ∗ j i . (32) Combining (28) and (30) with (31) and (32) we see that Clust ( A | B ) ≤ ( k P k 2 + k Q k 2 ) 2 k P 2 k 2 2 + c k Q k 2 2 · E n X i , j = 1 a i j b σ ( i ) σ ( j ) , (33) where c = D 2 2 π R ′ ( B ) 2 . The con vexit y of the function x → x 2 implies that ( k P k 2 + k Q k 2 ) 2 = c c + 1 · c + 1 c k P k 2 2 + 1 − c c + 1 ( c + 1) k Q k 2 2 ! 2 ≤ c + 1 c k P k 2 2 + ( c + 1) k Q k 2 2 = 1 + 1 c ! k P k 2 2 + c k Q k 2 2 . 14 Thus (33) implies that our algori thm achie ves an approximati on guar antee bounde d above by 1 + 1 c = 1 + 2 π R ′ ( B ) 2 D 2 = 1 + 2 π k v p − v q k 2 2 · max i ∈{ 1 ,..., k } v i − v p + v q 2 2 2 . It remains to note that for e very i ∈ { 1 , . . . , k } we kno w that k v i − v p k 2 , k v i − v q k 2 ≤ D and therefor e k v i − w ′ k 2 ≤ √ 3 2 D . This implies that our approx imation guarantee is bounded from abov e by 1 + 3 π 2 . 3 UGC hardness 3.1 Geometric pr elimi naries: Propeller pr oblems Let γ n be the stand ard Gaussian measure on R n . For an y integer k ≥ 2 define C ( n , k ) ≔ sup k X j = 1 Z R n x f j ( x ) d γ n ( x ) 2 2 : f 1 , . . . , f k ∈ L 2 ( γ n ) ∧ ∀ j f j ≥ 0 ∧ k X j = 1 f j ≤ 1 . (34) W e first observ e th at the supr emum i n (34) is atta ined at a k -tuple of functions w hich corres pond to a par tition of R n : Lemma 3.1. Ther e exis t disjoint measur able sets A 1 , . . . , A k ⊆ R n suc h that A 1 ∪ A 2 ∪ · · · ∪ A k = R n and k X j = 1 Z A j xd γ n ( x ) 2 2 = C ( n , k ) . Pr oof. Let H be the Hilbe rt space L 2 ( γ n ) ⊕ L 2 ( γ n ) ⊕ · · · ⊕ L 2 ( γ n ) ( k times). D efine K ⊆ H to be the set o f all ( f 1 , . . . , f k ) ∈ H such that f j ≥ 0 for all j and P k j = 1 f j ≤ 1. Then K is a cl osed con vex and boun ded subse t of H , and hen ce by the Banach-Alaog lu it is weakly compact. The mapping ψ : K → R gi ven by ψ ( f 1 , . . . , f k ) ≔ k X j = 1 Z R n x f j ( x ) d γ n ( x ) 2 2 = k X j = 1 n X i = 1 Z R n x i f j ( x ) d γ n ( x ) ! 2 is weakly continuou s since the mapping ( x 1 , . . . , x n ) → x j is in L 2 ( γ n ) for each j . Hence ψ attains its maximum on K , s ay at ( f ∗ 1 , . . . , f ∗ k ) ∈ K . Define z j ≔ R R n x f ∗ j ( x ) d γ n ( x ) ∈ R n and let w ≔ − k X j = 1 z j = Z R n x 1 − k X j = 1 f ∗ j ( x ) d γ n ( x ) . Note that 1 k k X i = 1 X 1 ≤ j ≤ k j , i k z j k 2 2 + k z i + w k 2 2 = k X j = 1 k z j k 2 2 + 1 − 2 k ! k w k 2 2 ≥ k X j = 1 k z j k 2 2 , 15 which implies the exi stence of i ∈ { 1 , . . . , k } for which X 1 ≤ j ≤ k j , i k z j k 2 2 + k z i + w k 2 2 ≥ k X j = 1 k z j k 2 2 . Hence, if we define for j ∈ { 1 , . . . , k } , g j ≔ f ∗ j j , i f ∗ i + 1 − P k j = 1 f ∗ j j = i then ( g 1 , . . . , g k ) ∈ K , and C ( n , k ) ≥ k X j = 1 Z R n xg j ( x ) d γ n ( x ) 2 2 = X 1 ≤ j ≤ k j , i k z j k 2 2 + k z i + w k 2 2 ≥ n X j = 1 k z j k 2 2 = C ( n , k ) . So k X j = 1 Z R n xg j ( x ) d γ n ( x ) 2 2 = C ( n , k ) . Note that P k j = 1 g j = 1, so we can define a random partition A 1 , . . . , A k of R n as follo w s: let { s x } x ∈ R n be indepe ndent random varia bles taking v alues in { 1 , . . . , k } such that P r( s x = j ) = g j ( x ), and define A j ≔ { x ∈ R n : s x = j } . Then by con vexit y and the definition of C ( n , k ) we see that E k X j = 1 Z A j xd γ n ( x ) 2 2 ≥ k X j = 1 Z R n E 1 A j ( x ) xd γ n ( x ) 2 2 = k X j = 1 Z R n xg j ( x ) d γ n ( x ) 2 2 = C ( n , k ) . It theref ore follo ws that there exis ts a partition as required. Lemma 3.2. If n ≥ k − 1 then C ( n , k ) = C ( k − 1 , k ) and if n < k − 1 then C ( n , k ) = C ( n , n + 1) . Pr oof. Assume first of all that n ≥ k − 1. The inequ ality C ( n , k ) ≥ C ( k − 1 , k ) is easy since for e very f 1 , . . . , f k ∈ L 2 ( γ k − 1 ) w hich satisfy f j ≥ 0 for all j ∈ { 1 , . . . , k } and f 1 + · · · + f k ≤ 1 we can define e f 1 , . . . e f k : R n = R k − 1 × R n − k + 1 → R by e f j ( x , y ) = f j ( x ). Then e f 1 , . . . e f k ∈ L 2 ( γ n ), e f 1 , . . . e f k ≥ 0, e f 1 + · · · + e f k ≤ 1 and P k j = 1 R R k − 1 x f j ( x ) d γ k − 1 ( x ) 2 2 = P k j = 1 R R n x e f j ( x ) d γ n ( x ) 2 2 . In the reve rse directi on, by L emma 3.1 there is a measurable partit ion A 1 , . . . , A k of R n such that if we define z j ≔ R A j xd γ n ( x ) ∈ R n then we hav e P k j = 1 z j 2 2 = C ( n , k ). Note tha t k X j = 1 z j = Z R n k X j = 1 1 A j xd γ n ( x ) = Z R n xd γ n ( x ) = 0 . Hence the dimensio n of the subspace V ≔ span { z 1 , . . . , z k } is d ≤ k − 1. D efine g 1 , . . . , g k : V → [0 , 1] by g j ( x ) = γ V ⊥ ( A j − x ) ∩ V ⊥ . 16 Then g 1 + · · · + g k = 1, so that C ( k − 1 , k ) ≥ C ( d , k ) ≥ k X j = 1 Z V xg j ( x ) d γ V ( x ) 2 2 = k X j = 1 Z V Z V ⊥ 1 A j ( x + y ) xd γ V ( x ) d γ V ⊥ ( y ) 2 2 = k X j = 1 Z A j Proj V ( w ) d γ n ( w ) 2 2 = k X j = 1 Proj V ( z j ) 2 2 = k X j = 1 z j 2 2 = C ( n , k ) . W e no w pass to the case n < k − 1. The inequalit y C ( n , n + 1) ≤ C ( n , k ) is triv ial, so we need to sho w that C ( n , k ) ≤ C ( n , n + 1). W e obs erve that since k > n + 1 for ev ery v 1 , . . . , v k ∈ R n there exist two distin ct indice s i , j ∈ { 1 , . . . , k } such that h v i , v j i ≥ 0. The proof of this fac t is by induction on n . If n = 1 then our assumpti on is that k ≥ 3, and therefore at least two of the real numbers v 1 , . . . , v k must hav e the same sign. For n > 1 we m ay assume that h v 1 , v j i < 0 for all j ≥ 2 (otherwis e w e are done). Consider the vectors v j − h v 1 , v j i k v 1 k 2 2 · v 1 k j = 2 , i.e. the projections of v 2 , . . . , v k onto the orthogo nal complemen t of v 1 . B y induction there are distin ct i , j ∈ { 2 , . . . , k } such that 0 ≤ * v i − h v 1 , v i i k v 1 k 2 2 · v 1 , v j − h v 1 , v j i k v 1 k 2 2 · v 1 + = h v i , v j i − h v i , v 1 ih v j , v 1 i k v 1 k 2 2 ≤ h v i , v j i . No w , let A 1 , . . . , A k be a partition of R n as in Lemma 3.1 and denote z j ≔ R A j xd γ n ( x ) ∈ R n . By the above ar gument there are distinct i , j ∈ { 1 , . . . , k } such that h z i , z j i ≥ 0. Hence C ( n , k − 1) ≥ X 1 ≤ ℓ ≤ k ℓ < { i , j } Z A ℓ xd γ n ( x ) 2 2 + Z A i ∪ A j xd γ n ( x ) 2 2 = X 1 ≤ ℓ ≤ k ℓ < { i , j } k z ℓ k 2 2 + k z i + z j k 2 2 ≥ k X ℓ = 1 k z ℓ k 2 2 = C ( n , k ) ≥ C ( n , k − 1) . So, C ( n , k ) = C ( n , k − 1), and the requi red identit y follows by induc tion. In light of Lemma 3.2 we denote from now on C ( k ) ≔ C ( k − 1 , k ). Give n distinct z 1 , . . . , z k ∈ R k − 1 and j ∈ { 1 , . . . , k } define a set P j ( z 1 , . . . , z k ) ⊆ R k by P j ( z 1 , . . . , z k ) ≔ ( x ∈ R k : h x , z j i = max i ∈{ 1 ,..., k } h x , z i i ) . Thus n P j ( z 1 , . . . , z k ) o k j = 1 is a partition of R k − 1 which we call the simplicial partition induced by z 1 , . . . , z k (strict ly speaking the elements of this partition are not disjoint, but they inter sect at sets of measure 0). Lemma 3.3. Let A 1 , . . . , A k ⊆ R k − 1 be a partition as in Lemma 3.1, i.e. if w e set z j ≔ R A j xd γ k − 1 ( x ) then C ( k ) = P k j = 1 k z j k 2 2 . Assume also that this part ition is minimal in the sense that the number of element s of positi ve measur e in this partition is minimal among all the possib le part itions fr om Lemma 3.1. By rela beling we may assume without loss of gener ality that for some 1 ≤ ℓ ≤ k we have γ k − 1 ( A 1 ) , . . . , γ k − 1 ( A ℓ ) > 0 and that γ k − 1 ( A ℓ + 1 ) = · · · = γ k − 1 ( A k ) = 0 . Then up to an orthogon al transfo rmation z 1 , . . . , z ℓ ∈ R ℓ − 1 , for any distin ct i , j ∈ { 1 , . . . , ℓ } w e h ave h z i , z j i < 0 and for eac h j ∈ { 1 , . . . , ℓ } we have A j = P j ( z 1 , . . . , z ℓ ) × R k − ℓ up to sets of measur e zer o. 17 Pr oof. Since 1 A 1 + · · · + 1 A ℓ = 1 almost ev erywhere we ha ve z 1 + · · · z ℓ = 0. Thu s the dimension of the span of z 1 , . . . , z ℓ is at most ℓ − 1, and by applyi ng an orthogon al trans formation we may assume that z 1 , . . . , z ℓ ∈ R ℓ − 1 . Also, if for some distinc t i , j ∈ { 1 , . . . , ℓ } we ha ve h z i , z j i ≥ 0 we may replace A i by A i ∪ A j and A j by the empty set and obtain a partition of R k − 1 which contains exactly ℓ − 1 elements of positi ve measure and for which we ha ve: C ( k ) ≥ X 1 ≤ r ≤ k ℓ < { i , j } Z A r xd γ n ( x ) 2 2 + Z A i ∪ A j xd γ n ( x ) 2 2 = X 1 ≤ r ≤ k ℓ < { i , j } k z r k 2 2 + k z i + z j k 2 2 ≥ k X r = 1 k z r k 2 2 = C ( k ) . This contr adicts the minimality of the partition A 1 , . . . , A k . Note that the abov e reasoning implies in particul ar that the vecto rs z 1 , . . . , z ℓ are distinct, and therefore n P j ( z 1 , . . . , z ℓ ) × R k − ℓ o ℓ j = 1 is a partition of R k − 1 (up to sets of measure 0). Assume for the sak e of contradic- tion that these exis t i ∈ { 1 , . . . , ℓ } such that γ k − 1 A i \ P j ( z 1 , . . . , z ℓ ) × R k − ℓ > 0 . Note that up to sets of measure 0 we ha ve: A i \ P j ( z 1 , . . . , z ℓ ) × R k − ℓ = [ j ∈{ 1 ,...,ℓ } j , i ∞ [ m = 1 ( x ∈ A i : h x , z j i ≥ h x , z i i + 1 m ) . Hence there exists m > 0 and j ∈ { 1 , . . . , ℓ } \ { i } such that if we denote E ≔ n x ∈ A i : h x , z j i ≥ h x , z i i + 1 m o then γ k − 1 ( E ) > 0. Define a parti tion e A 1 , . . . e A k of R k − 1 by e A r ≔ A r r < { i , j } A i \ E r = i A j ∪ E r = j . Then for w ≔ R E xd γ k − 1 ( x ) we ha ve C ( k ) ≥ k X r = 1 Z e A r xd γ k − 1 ( x ) 2 2 = X 1 ≤ r ≤ k r < { i , j } k z r k 2 2 + k z i − w k 2 2 + k z j + w k 2 2 = k X r = 1 k z r k 2 2 + 2 k w k 2 2 + 2 h z j , w i − 2 h z i , w i ≥ C ( k ) + 2 k w k 2 2 + 2 Z E h z j , x i − h z i , x i d γ k − 1 ( x ) ≥ C ( k ) + 2 γ k − 1 ( E ) m > C ( k ) , a contra diction. Cor ollary 3.4. W e ha ve C (2) = 1 π and C (3) = 9 8 π . Pr oof. Note that Lemma 3.3 implies that for each k ≥ 2 there exists a parti tion A 1 , . . . , A k of R k − 1 such that each A j is a co ne and C ( k ) = P k j = 1 R A j xd γ k − 1 ( x ) 2 2 . When k = 2 the only suc h partit ion of R consists of the positi ve and negat ive half -lines. Thus C (2) = 2 1 √ 2 π Z ∞ 0 xe − x 2 / 2 d x ! 2 = 1 π . 18 When k = 3 the parti tion A 1 , A 2 , A 3 consis ts of disjoint c ones of ang les α 1 , α 2 , α 3 ∈ [0 , 2 π ], resp ectiv ely , where α 1 + α 2 + α 3 = 2 π . Now , for j ∈ { 1 , 2 , 3 } w e ha ve Z A j xd γ 2 ( x ) 2 2 = 1 2 π Z ∞ 0 Z α j / 2 − α j / 2 e i θ r 2 e − r 2 / 2 dr d θ 2 = sin 2 ( α j / 2) 2 π . Hence C (3) = 1 2 π max n sin 2 ( α 1 / 2) + sin 2 ( α 2 / 2) + sin 2 ( α 3 / 2) : α 1 , α 2 , α 3 ∈ [0 , π ] ∧ α 1 + α 2 + α 3 = 2 π o = 3 2 π · sin 2 π 3 = 9 8 π , (35) where (35) follo ws from a simple Lagrange multiplier argu ment. It is tempting to belie ve that for ev ery k ≥ 2, C ( k ) is attaine d at a regular simplicia l partition, i.e. a partiti on of R k − 1 of the form { P j ( v 1 , . . . , v k ) } k j = 1 where v 1 , . . . , v k are the ve rtices of the regular simple x in R k − 1 . This was sho wn to be true for k ∈ { 2 , 3 } in Corollary 3.4. W e w ill now show that this is not the case for k ≥ 4. Lemma 3.5. L et v 1 , v 2 , . . . , v k ∈ R k − 1 be verti ces of a r e gular simple x in R k − 1 , i.e. for each i ∈ { 1 , . . . , k } we have k v i k 2 = 1 and for each distinc t i , j ∈ { 1 , . . . , k } we have h v i , v j i = − 1 k − 1 . Let z i ≔ Z P i ( v 1 ,..., v k ) x d γ k − 1 ( x ) . Then k X i = 1 k z i k 2 2 = 1 k − 1 E " max j ∈{ 1 ,..., k } g j #! 2 , wher e g 1 , g 2 , . . . , g k ar e independe nt stand ard Gaussian rand om variables . Pr oof. By symmetry all the z i ha ve the same length r > 0 and z i has the same directio n as v i . Thus for all i we ha ve h z i , v i i = r . No w , k X i = 1 h z i , v i i = k X i = 1 Z P i ( v 1 ,..., v k ) h x , v i i d γ k − 1 ( x ) = k X i = 1 Z P i ( v 1 ,..., v k ) max j ∈{ 1 ,..., k } h x , v j i ! d γ k − 1 ( x ) = Z R k − 1 max j ∈{ 1 ,..., k } h x , v j i ! d γ k − 1 ( x ) = E " max j ∈{ 1 ,..., k } h j # , where h 1 , . . . , h k are standa rd Gaussia n random vari ables with cov ariances E [ h i h j ] = h v i , v j i . Let h be a standa rd Gaussian which is independ ent of h 1 , . . . , h k . Then E " max j ∈{ 1 ,..., k } h j # = E " h √ k − 1 + max j ∈{ 1 ,..., k } h j !# = E " max j ∈{ 1 ,..., k } h √ k − 1 + h j !# = E " max j ∈{ 1 ,..., k } e h j # , (36) where we set e h j ≔ h √ k − 1 + h j so that e h j are ind ependent Gaussian s with mean z ero and vari ance k k − 1 . The la st term in (36) is same as q k k − 1 · E h max j ∈{ 1 ,..., k } g j i where g 1 , . . . , g k are inde pendent standar d G aussia ns. 19 Cor ollary 3.6. F or k ≥ 2 denote R ( k ) ≔ 1 k − 1 E " max j ∈{ 1 ,..., k } g j #! 2 = k (2 π ) k / 2 Z ∞ −∞ xe − x 2 / 2 Z x −∞ e − y 2 / 2 dy ! k − 1 d x . (37) Then for every inte ger k ∈ { 2 , 4 , 5 , . . . } w e have R ( k ) < R (3) = 9 8 π . Thus, if v k 1 , . . . , v k k ar e the vertices of the r e gular simple x in R k − 1 then for k ≥ 4 we have k X j = 1 Z P j ( v k 1 ,..., v k k ) xd γ k − 1 ( x ) 2 2 < 3 X j = 1 Z P j ( v 3 1 , v 3 2 , v 3 3 ) × R k − 3 xd γ k − 1 ( x ) 2 2 . Pr oof. It follo w s from Corollary 3.4 that R (3) = C (3) = 9 8 π . W e requi re a crude bound on R ( k ). An applic ation of Stirling’ s formula sho ws that for p ≥ 2 we hav e E | g 1 | p 1 / p = 2 p / 2 √ π Γ p + 1 2 !! 1 / p ≤ r p 2 . Hence R ( k ) ≤ 1 k − 1 E " max j ∈{ 1 ,..., k } | g j | #! 2 ≤ 1 k − 1 E k X j = 1 | g j | p 1 / p 2 ≤ 1 k − 1 k X j = 1 E h | g j | p i 2 / p ≤ 1 k − 1 · k 2 / p · p 2 . Choosin g p = 2 log k ≥ 2 log 4 > 2 we see that R ( k ) ≤ e log k k − 1 . (38) The function k → log k k − 1 is decreasi ng on [4 , ∞ ), and therefor e a direct computatio n using (38) shows that R ( k ) < 9 8 π for k ≥ 26. For k ≤ 25 one can compute numeric ally (say , usin g Maple) the integr al in (37) and get the follo wing value s: R (4) = 0 . 35320455 29 , R (5) = 0 . 3381215 916 , R (6) = 0 . 321162 3921 , R (7) = 0 . 304731060 0 , R (8) = 0 . 28951969 03 , R (9) = 0 . 2756580 116 , R (10) = 0 . 2630844 408 , R (11) = 0 . 25167 80298 , R (12) = 0 . 2413075 184 , R (13) = 0 . 2318492 693 , R (14) = 0 . 2231929 784 , R (15) = 0 . 2152 42534 9 , R (16) = 0 . 2079150 401 , R (17) = 0 . 2011392 394 , R (18) = 0 . 1948538 849 , R (19) = 0 . 1890 06224 8 , R (20) = 0 . 1835506 894 , R (21) = 0 . 1784477 705 , R (22) = 0 . 1736630 840 , R (23) = 0 . 1691 66586 8 , R (24) = 0 . 1649319 261 , R (25) = 0 . 1609358 965 . Since R (3) = 0 . 35809862 19 it follo ws that R ( k ) < R (3) for e very integ er k ∈ [4 , 25] as well. W e conjecture that C ( k ) ≤ C (3) for e very integ er k ≥ 2. For future reference we end this section w ith the follo wing alternati ve characteriz ation of C ( k ): Lemma 3.7. W e have the foll owing identi ty: C ( k ) = sup E h max j ∈{ 1 ,..., k } g j i 2 P k j = 1 E h g 2 j i : ( g 1 , . . . , g k ) ∈ R k mean zero Gaussian vect or . (39) 20 Pr oof. First we show that C ( k ) is at most the right hand side of (39). W e know that there exis ts a partition A 1 , . . . , A k of R k − 1 such that if we write z i ≔ R A i x d γ k − 1 ( x ) then A j = P j ( z 1 , . . . , z k )for all j ∈ { 1 , . . . , k } and C ( k ) = P k j = 1 k z j k 2 2 . No w , C ( k ) = k X j = 1 k z j k 2 2 = k X j = 1 Z P j ( z 1 ,..., z k ) h x , z j i d γ k − 1 ( x ) = k X j = 1 Z P j ( z 1 ,..., z k ) max i ∈{ 1 ,..., k } h x , z i i ! d γ k − 1 ( x ) = Z R k − 1 max i ∈{ 1 ,..., k } h x , z i i ! d γ k − 1 ( x ) = E " max j ∈{ 1 ,..., k } h j # , (40) where in (40) h 1 , . . . , h k are mean-z ero Gaussians with cov ariance s E [ h i h j ] = h z i , z j i . Thus C ( k ) = E h max j ∈{ 1 ,..., k } h j i 2 P k j = 1 E h h 2 j i , which implies the desire d upper bound on C ( k ). For the other directio n fix a mean zero Gaussian ve ctor ( g 1 , . . . , g k ) ∈ R k and let v 1 , v 2 , . . . , v k ∈ R k be vec tors such that E [ g i g j ] = h v i , v j i for all i , j ∈ { 1 , . . . , k } . For i ∈ { 1 , . . . , k } let w i ≔ R P i ( v 1 ,..., v k ) x d γ k − 1 ( x ). No w , v u t k X i = 1 k w i k 2 2 k X i = 1 k v i k 2 2 ≥ k X i = 1 h w i , v i i = k X i = 1 Z P i ( v 1 ,..., v k ) h x , v i i d γ k − 1 ( x ) = k X i = 1 Z P i ( v 1 ,..., v k ) max j ∈{ 1 ,..., k } h x , v j i ! d γ k − 1 ( x ) = Z R k − 1 max j ∈{ 1 ,..., k } h x , v j i ! d γ k − 1 ( x ) = E " max j ∈{ 1 ,..., k } g j # . Therefore , C ( k ) ≥ k X i = 1 k w i k 2 ≥ E h max j ∈{ 1 ,..., k } g j i 2 P k j = 1 k v j k 2 2 = E h max j ∈{ 1 ,..., k } g j i 2 P k j = 1 E h g 2 j i . This complet es the proof of (39). 3.2 Dictatorships vs. functions wi th small influence s In what follo ws all functions are assumed to be measurable and we use the notatio n [ k ] ≔ { 1 , . . . , k } . In this sectio n w e will assoc iate to ev ery function from { 1 , . . . , k } n to ∆ k ≔ x ∈ R k : x i ≥ 0 ∧ ∀ i ∈ [ k ] , k X i = 1 x i ≤ 1 a numerical parameter , or “objecti ve valu e”. W e will sho w that the valu e of this parameter for functions which depend only on a single coordin ate (i.e. dictatorshi ps) di ff ers markedly from its value on functio ns which do not depend significa ntly on any particu lar coordinat e (i.e. function s w ith small influences). This step is an analog of the “dict atorship test” which is pre valent in PCP based hardness proofs. 21 W e begin with some notation and preliminaries on Fourier -type expan sions. For an y function f : R n → ∆ k we write f = ( f 1 , f 2 , . . . , f k ) where f i : R n → [0 , 1] and P k i = 1 f i ≤ 1. Wi th this notation we hav e C ( k ) = sup f : R k − 1 → ∆ k k X i = 1 Z R k − 1 x f i ( x ) d γ k − 1 ( x ) 2 2 , where C ( k ) is as in Section 3.1. W e hav e alread y seen that the supremum abov e is actuall y attain ed and at the supremum we ha ve P k i = 1 f i = 1. Also C ( k ) remains the same if the supremum is tak en ov er functions ov er R n with n ≥ k − 1, i.e. for e very n ≥ k − 1, C ( k ) = sup f : R n → ∆ k k X i = 1 Z R n x f i ( x ) d γ ( x ) 2 2 . Let ( Ω = [ k ] , µ ) be a probabi lity space, µ being the uniform measure. Let ( Ω n , µ n ) be the product space. W e will be analyzing functions f : Ω n → ∆ k (and more generally into R k ). Fix a basis of orthono rmal random variab les on Ω where one of them is the constant 1, i.e. { X 0 , X 1 , . . . , X k − 1 } w here ∀ i , X i : Ω → R , X 0 ≡ 1 and E ω ∈ Ω [ X i ( ω ) X j ( ω )] = 0 for i , j and equal to 1 if i = j . Then any function f : Ω → R can be written as a linear combinati on of the X i ’ s. In order to analyze functions f : Ω n → R , we let X = ( X 1 , X 2 , . . . , X n ) be an “ensemble” of rando m v ariables where for 1 ≤ i ≤ n , X i = { X i , 0 , X i , 1 , . . . , X i , k − 1 } , and for ev ery i , { X i , j } k − 1 j = 0 are indepe ndent copies of the { X j } k − 1 j = 0 . Any σ = ( σ 1 , σ 2 , . . . , σ n ) ∈ { 0 , 1 , 2 , . . . , k − 1 } n will be c alled a multi-in dex. W e sh all denote by | σ | the number on non -zero entries in σ . Each mul ti-index defines a m onomial x σ : = Q i ∈ [ n ] ,σ i , 0 x i ,σ i on a set of n ( k − 1) indetermina tes { x i j | i ∈ [ n ] , j ∈ { 1 , 2 , . . . , k − 1 }} , and also a rando m v ariable X σ : Ω n → R as X σ ( ω ) : = n Y i = 1 X i ,σ i ( ω i ) . It is easy to see that the random v ariables { X σ } σ form an orthonor m al basis for the space of functions f : Ω n → R . Thus, e very such f can be w ritten uniqu ely as (the “Fouri er expansio n”) f = X σ b f ( σ ) X σ , b f ( σ ) ∈ R . W e denote the correspo nding multi-line ar polynomia l as Q f = P σ b f ( σ ) x σ . One can think of f as the pol yno- mial Q f applie d to the ensemble X , i.e. f = Q f ( X ). Of c ourse, one can als o apply Q f to any othe r ensemble , and specifica lly to the G aussia n ensemble G = ( G 1 , G 2 , . . . , G n ) w here G i = { G i , 0 ≡ 1 , G i , 1 , . . . , G i , k − 1 } and G i , j , i ∈ [ n ] , 1 ≤ j ≤ k − 1 are i.i.d. standard Gaussians. Define the influence of the i ’ th va riable on f as Inf i ( F ) ≔ X σ i , 0 b f ( σ ) 2 . Roughly speaking, the results of [ 21, 16] say that if f : Ω n → [0 , 1] is a functi on with all lo w influe nces, then f = Q f ( X ) and Q f ( G ) are almost identicall y distrib uted, and in particu lar , the v alues of Q f ( G ) are essenti ally contai ned in [0 , 1]. Note that Q f ( G ) is a rand om var iable on the probabili ty space ( R n ( k − 1) , γ n ( k − 1) ). Consider functions f : Ω n → ∆ k . W e write f = ( f 1 , f 2 , . . . , f k ) where f : Ω n → [0 , 1] with P k i = 1 f i ≤ 1. Each f i has a uniqu e representati on (along with the correspon ding multi-li near polynomia l) f i = X σ b f i ( σ ) X σ , Q i : = Q f i = X σ b f i ( σ ) x σ . 22 W e shall define an objecti ve function OBJ( f ) that is a positi ve semi-definite quadratic form on the table of va lues of f . Then we analyze the val ue of this objecti ve function w hen f is a “dictators hip” versu s when f has all low influe nces. The objecti ve va lue For a fun ction f : Ω n → ∆ k (or more generall y , f : Ω n → R k ) define OBJ( f ) : = k X i = 1 X σ : | σ | = 1 b f i ( σ ) 2 . (41) In words, OBJ( f ) is the total “Fou rier mass” of all function s { f i } k i = 1 at le vel 1. Note that there are n ( k − 1) multi-ind ices σ such that | σ | = 1. The objecti ve va lue f or d ictator ship s For ℓ ∈ [ n ] we define a dictatorshi p functi on f dict ,ℓ : Ω n → ∆ k as follo ws. The range of the function is limited to onl y k points in ∆ k , namely the p oints { e 1 , e 2 , . . . , e k } where e i is a v ector with i th coordi nate 1 an d all other coord inates zero. f dict ,ℓ ( ω ) : = e i if ω ℓ = i . (42) In other words, when one writes f dict ,ℓ = ( f 1 , f 2 , . . . , f k ), f i is { 0 , 1 } -v alued and f i ( ω ) = 1 i ff ω ℓ = i . It is easy to see that the Four ier expansio n of f i is f i ( ω ) = 1 k X σ : σ j = 0 ∀ j , ℓ X σ ℓ ( i ) X σ ( ω ) . (43) Indeed , the right hand side of (43) equal s 1 k X 0 ≤ σ ℓ ≤ k − 1 X σ ℓ ( i ) X σ ℓ ( ω ℓ ) = ( 1 if ω ℓ = i , 0 otherwise. The Fouri er mass of f dict ,ℓ i at le vel 1 equals X 1 ≤ σ ℓ ≤ k − 1 X σ ℓ ( i ) k ! 2 = − X 0 ( i ) k ! 2 + X 0 ≤ σ ℓ ≤ k − 1 X σ ℓ ( i ) k ! 2 = − 1 k 2 + k k 2 = k − 1 k 2 . Summing the Fouri er m ass of all f dict ,ℓ i ’ s at le vel 1, we get OBJ( f dict ,ℓ ) = 1 − 1 k . (44 ) 23 The objecti ve va lue f or fun ctions with low influences For f : Ω n → R , j ∈ [ n ] and m ∈ N denote Inf ≤ m j ( f ) ≔ X | σ |≤ m σ j , 0 b f ( σ ) 2 . For e very η > 0 we will use the smoothing operato r: T η f = X σ η | σ | b f ( σ ) X σ . The follo wing theorem is the key analy tic fact use d in our UG C hardn ess result : Theor em 3.8. F or every ε > 0 , ther e exist s τ > 0 so that the following holds: for any functi on f : Ω n → ∆ k suc h that ∀ i ∈ [ k ] , ∀ j ∈ [ n ] , Inf ≤ log(1 /τ ) j ( f i ) ≤ τ we have , OBJ( f ) ≤ C ( k ) + ε. Pr oof. Let δ , η > 0 be su ffi cien tly small constants to be chosen later . Let Q i = Q f i be the multi-linear polyn omial asso ciated with f i . Recall th at Q i is a mult i-linea r polynomial in n ( k − 1) ind eterminates { x j ℓ | j ∈ [ n ] , ℓ ∈ [ k − 1] } . Moreov er f i = Q i ( X ) has range [0 , 1] and P k i = 1 f i ≤ 1. Let R i = ( T 1 − δ Q i )( X ) and S i = ( T 1 − δ Q i )( G ) (the smooth ening operator T 1 − δ helps us meet some techni- cal pre- condit ions before applying the in varian ce principle on [16]). Note that R i has rang e [0 , 1] an d S i has range R . It will fol low ho wev er from [16] t hat S i is with hig h probability in [0 , 1]. First we relat e O BJ( f ) to the func tions S i which will, up to t runcation, induce a partiti on of R n ( k − 1) , whic h in turn will g ive t he bound in terms of C ( k ). (1 − δ ) 2 · OBJ( f ) = (1 − δ ) 2 k X i = 1 X σ : | σ | = 1 b f i ( σ ) 2 = (1 − δ ) 2 k X i = 1 n X j = 1 k − 1 X ℓ = 1 Z R n ( k − 1) x j ℓ Q i ( x ) d γ n ( k − 1) ( x ) 2 = (1 − δ ) 2 k X i = 1 Z R n ( k − 1) x Q i ( x ) d γ n ( k − 1) ( x ) 2 2 = k X i = 1 Z R n ( k − 1) x ( T 1 − δ Q i )( x ) d γ n ( k − 1) ( x ) 2 2 = k X i = 1 Z R n ( k − 1) x S i ( x ) d γ n ( k − 1) ( x ) 2 2 . (45) W e bound the last term by C ( k ) + o (1). For an y real-v alued function h on R n ( k − 1) , let chop( h )( x ) : = 0 if h ( x ) < 0 , h ( x ) if h ( x ) ∈ [0 , 1] , 1 if h ( x ) > 1 . 24 For e very subset I ⊆ [ k ], let Q I : = P i ∈ I Q i . Since e very Q i has small lo w-degree influence, so does e very Q I . Let R I ≔ P i ∈ I ( T 1 − δ Q i )( X ), and S I ≔ P i ∈ I ( T 1 − δ Q i )( G ). Note that R { i } = R i and S { i } = S i . Applying Theorem 3.20 in [16] to the polynomial Q I , it follo ws that (provide d τ is su ffi cient ly small compare d to δ and η ), S I − cho p( S I ) 2 2 = Z R n ( k − 1) S I ( x ) − chop( S I )( x ) 2 d γ n ( k − 1) ( x ) ≤ η. (46) The function s chop( S i ) are almost what we want except that they might not sum up to at most 1. S o furthe r define S ∗ i ( x ) : = chop( S i )( x ) if P k i = 1 chop( S i )( x ) ≤ 1 , chop( S i )( x ) ( P k i = 1 chop( S i )( x )) if P k i = 1 chop( S i )( x ) > 1 . Clearly , S ∗ i ha ve range [0 , 1] and P k i = 1 S ∗ i ≤ 1. O bserv e that the followin g holds point-wise: 0 ≤ chop( S i ) − S ∗ i ≤ k X j = 1 chop( S j ) − S ∗ j ≤ max 0 , k X j = 1 chop( S j ) − 1 ≤ X I ⊆ [ k ] S I − cho p( S I ) , where the last inequal ity holds since for ev ery x , by defining I = I ( x ) = { j | S j ( x ) ≥ 0 } , k X j = 1 chop( S j )( x ) − 1 = X j ∈ I chop( S j )( x ) − 1 ≤ X j ∈ I S j ( x ) − 1 ≤ S I ( x ) − chop( S I )( x ) . It follo ws that chop( S i ) − S ∗ i 2 ≤ X I ⊆ [ k ] S I − cho p( S I ) 2 ≤ 2 k √ η, where we used (46). F inally , S i − S ∗ i 2 ≤ S i − cho p( S i ) 2 + chop( S i ) − S ∗ i 2 ≤ (2 k + 1) √ η. (47) No w write Z R n ( k − 1) x S i ( x ) d γ n ( k − 1) ( x ) = Z R n ( k − 1) x S ∗ i ( x ) d γ n ( k − 1) ( x ) + Z R n ( k − 1) x ( S i ( x ) − S ∗ i ( x )) d γ n ( k − 1) ( x ) . (4 8) The norm of second integr al is bounde d by (2 k + 1) √ η using (47) and Lemm a 3.9 below . Since k S ∗ i k 2 ≤ 1, the norm of first inte gral is bounded by 1, and thus Z R n ( k − 1) x S i ( x ) d γ n ( k − 1) ( x ) 2 2 ≤ Z R n ( k − 1) x S ∗ i ( x ) d γ n ( k − 1) ( x ) 2 2 + 2(2 k + 1) √ η + (2 k + 1) 2 η. Returnin g to the estimation in Equation (45) and noting that P k i = 1 S ∗ i ≤ 1, k X i = 1 Z R n ( k − 1) x S i ( x ) d γ n ( k − 1) ( x ) 2 2 ≤ k X i = 1 Z R n ( k − 1) x S ∗ i ( x ) d γ n ( k − 1) ( x ) 2 2 + 2(2 k + 1) 2 √ η ! ≤ sup f : R n ( k − 1) → ∆ k k X i = 1 Z R n ( k − 1) x f i ( x ) d γ n ( k − 1) ( x ) 2 2 + 2(2 k + 1) 3 √ η = C ( k ) + 2(2 k + 1) 3 √ η. It follo ws that O BJ( f ) ≤ C ( k ) + 2(2 k + 1) 3 √ η 1 − δ 2 ≤ C ( k ) + ε , provide d that η and δ are small enough. 25 Lemma 3.9. Let g ∈ L 2 ( R n , γ n ) . Then Z R n x g ( x ) d γ n ( x ) 2 ≤ k g k 2 . Pr oof. Note that the square of the left hand side equals n X i = 1 Z R n x i g ( x ) d γ n ( x ) 2 = n X i = 1 h x i , g i 2 . Since x i ∈ L 2 ( R n , γ n ) are an orthonormal set of functio ns, the sum of squares of projection s of g onto them is at most the square d norm of g . The intended hardness factor As we sho w next, the dictatorship test can be trans lated (in a more or less standa rd way by now) into a UGC-hardness r esult. T he hard ness factor (as usu al) turns out to b e the rati o of the ob jectiv e valu e when the functi on is a dictatorsh ip ve rsus the functi on has all low influe nces, i.e. 1 − 1 / k C ( k ) + o (1) = 1 − 1 / k C ( k ) − o (1) . 3.3 The r eduction from unique games to ker nel clustering Giv en a Unique Games Instance L ( G ( V , W , E ) , [ n ] , { π vw : [ n ] → [ n ] } ( v , w ) ∈ E ), we construct an instance of the cluste ring problem. W e fi rst refor mulate the kernel clustering problem for the ease of presentat ion. Ref ormulation of the problem Giv en an instan ce of the k ernel clustering problem ( A = ( a st ) , B = ( b i j )) where A and B are N × N and k × k PSD matrice s respecti vely , we note that max σ :[ N ] → [ k ] X s , t a st b σ ( s ) ,σ ( t ) = max F :[ N ] → ∆ k X s , t a st X i j b i j F ( s ) i F ( t ) j (49) = max F :[ N ] → ∆ k X i , j b i j X s , t a st F i ( s ) F j ( t ) (50) = max F :[ N ] → ∆ k X i , j b i j Q A ( F i , F j ) (5 1) where on line (49), in stead of choos ing a label σ ( s ) ∈ [ k ], we allo w a distr ibutio n ov er the k la bels F ( s ) ∈ ∆ k . The equality follows since any such prob abilistic labeling F yields a labelin g σ with the same exp ected object ive valu e by picking, for e very s ∈ [ N ], a label i with probability F ( s ) i . On line (50) we interch anged the order of summation and interpreted the i th co-ord inate of F ( s ) (i.e. F ( s ) i ) as the value of a function F i : [ N ] → [0 , 1] at index s (i.e. F i ( s )). Thus F = ( F 1 , F 2 , . . . , F k ). On lin e (51) w e re wrote P s , t a st F i ( s ) F j ( t ) as a PSD quadrat ic form Q A ( F i , F j ) on the table s of valu es of functions F i and F j . This enables us to reformulate the clustering problem as: Giv en a PSD matrix B, and a PSD quadr atic form Q ( · , · ) on R N × R N , find F : [ N ] → ∆ k , F = ( F 1 , F 2 , . . . , F k ), so as to maximize P i j b i j Q ( F i , F j ). 26 The clusteri ng p r oblem instance Giv en a Unique Games instance L G ( V , W , E ) , [ n ] , { π vw : [ n ] → [ n ] } ( v , w ) ∈ E , the clustering proble m is to fi nd F : W × Ω n → ∆ k so as to maximize P k i = 1 Q ( F i , F i ) w here Q is a suitabl y defined PSD quadra tic form. Thus the matrix B is the k × k identity matrix. For notation al con ve nience, we let F w : = F ( w , · ) , F w : Ω n → ∆ . Also, for e very v ∈ V , we let F v ≔ E ( v , w ) ∈ E [ F w ◦ π vw ] , F v : Ω n → ∆ . W e used the follo wing notation: for any fu nction g : Ω n → ∆ k and π : [ n ] → [ n ], g ◦ π : Ω n → ∆ k denote s a functi on ( g ◦ π )( ω ) : = g ( ω π (1) , ω π (2) , . . . , ω π ( n ) ) . As usual, we denote F w = ( F w , 1 , F w , 2 , . . . , F w , k ) where each F w , i has range [0 , 1] and P k i = 1 F w , i ≤ 1. Simi- larly , F v = ( F v , 1 , F v , 2 , . . . , F v , k ). No w we are ready to define the clusterin g probl em instance. Clustering instance: T he goal is to find F : W × Ω n → ∆ k so as to maximize: max F : W × Ω n → ∆ k E v ∈ V [OBJ( F v )] = max F : W × Ω n → ∆ k k X i = 1 E v ∈ V X σ : | σ | = 1 b F v , i ( σ ) 2 . (52) Completeness. W e will sho w that if the Unique G ames instance has an almost satisfy ing labeli ng, then the objecti ve v alue of the cluste ring problem is (1 − o (1)) · (1 − 1 / k ). So, let ρ : V ∪ W → [ n ] be the labeling, such that for at least 1 − ε fraction of the vertic es v ∈ V (call such v good) w e ha ve π vw ( ρ ( w )) = ρ ( v ) ∀ ( v , w ) ∈ E . Define F : W × Ω n → ∆ k as follo ws: for e very w ∈ W , F w : Ω n → ∆ k equals the d ictatorship for ρ ( w ) ∈ [ n ], i.e. F w : = f dict ,ρ ( w ) . Lemma 3.10. f dict , j ◦ π = f dict ,π ( j ) . Pr oof. f dict ,π ( j ) ( ω ) equals e ℓ if ω π ( j ) = ℓ . On the other hand ( f dict , j ◦ π )( ω ) = f dict , j ( ω π (1) , ω π (2) , . . . , ω π ( n ) ) , which equals e ℓ since ω π ( j ) = ℓ . Lemma 3.11. F or a good v ∈ V , F v = f dict ,ρ ( v ) . 27 Pr oof. For a good v , π vw ( ρ ( w )) = ρ ( v ) for ev ery ( v , w ) ∈ E . Thus F v = E ( v , w ) ∈ E h F w ◦ π vw i = E ( v , w ) ∈ E h f dict ,ρ ( w ) ◦ π vw i = E ( v , w ) ∈ E h f dict ,π vw ( ρ ( w )) i = E ( v , w ) ∈ E h f dict ,ρ ( v ) i = f dict ,ρ ( v ) Thus the contrib ution of v in (52) is OB J( f dict ,ρ ( v ) ) = 1 − 1 / k as observed in Equation (44). Since 1 − ε fractio n of v ∈ V are good, (52) is at least (1 − ε ) · (1 − 1 / k ) . Soundness Suppose on the contrary that the value of (52) is at least C ( k ) + 2 ε . W e will prove that the Unique Games instan ce must hav e a labeling that satisfies at least ετ 2 4 k log(1 /τ ) fractio n of its edges, reaching a contradicti on, pro vided its soun dness is chosen to be lower to be gin with. W e de fi ne a labeling as foll o w s. First we define a n ot-too-lar ge set of la bels L ( w ) ⊆ [ n ] for e very w ∈ W . Let τ be as in Theorem 3.8. L ( w ) : = n j ∈ [ n ] | ∃ i ∈ [ k ] , Inf ≤ log(1 /τ ) j ( F w , i ) ≥ τ/ 2 o Clearly , | L ( w ) | ≤ 2 k log(1 /τ ) τ since each F w , i has range [0 , 1] and therefore the sum of all deg ree-log (1 /τ ) influence s is at most log(1 /τ ). No w ass ume that the v alue of (52) is at least C ( k ) + 2 ε . By an a vera ging argumen t, for at lea st ε fraction of v ∈ V (call such v nice), OBJ( F v ) ≥ C ( k ) + ε . Applying T heorem 3.8, we conclude that there exis ts i 0 ∈ [ k ] , j 0 ∈ [ n ] such that Inf ≤ log(1 /τ ) j 0 ( F v , i 0 ) ≥ τ . Observe that τ ≤ Inf ≤ log(1 /τ ) j 0 ( F v , i 0 ) = Inf ≤ log(1 /τ ) j 0 E ( v , w ) ∈ E F w , i 0 ◦ π vw ≤ E ( v , w ) ∈ E h Inf ≤ log(1 /τ ) j 0 F w , i 0 ◦ π vw i Using Lemma 3.12 belo w = E ( v , w ) ∈ E Inf ≤ log(1 /τ ) π − 1 vw ( j 0 ) F w , i 0 Using Lemma 3.14 belo w This implies that for at least τ 2 fractio n of w such that ( v , w ) ∈ E , we ha ve τ 2 ≤ Inf ≤ log(1 /τ ) π − 1 vw ( j 0 ) F w , i 0 . Thus π − 1 vw ( j 0 ) ∈ L ( w ) by the definition of L ( w ). Define j 0 to be the label of v . Finally , for eve ry w ∈ W , select a random label from L ( w ) (or an arbitrary label if L ( w ) = ∅ ). Noting that ε fraction of v ∈ V are nice, and | L ( w ) | ≤ 2 k log(1 /τ ) τ , it follo ws that the labe ling satisfies ε · τ 2 · 1 2 k τ − 1 log(1 /τ ) = ετ 2 4 k log(1 /τ ) fractio n of the edge s of the Unique Games insta nce. Lemma 3.12. Suppose C is a class of funct ions g : Ω n → R and h : = E g ∈C [ g ] . Then for any j ∈ [ n ] and inte ger d, Inf j ( h ) ≤ E g ∈C h Inf j ( g ) i , Inf ≤ d j ( h ) ≤ E g ∈C h Inf ≤ d j ( g ) i . Pr oof. W e pro ve the first ine quality , the seco nd is similar by restrict ing summations t o multi -indices | σ | ≤ d . Inf j ( h ) : = X σ : σ j , 0 b h ( σ ) 2 = X σ : σ j , 0 E g ∈C b g ( σ ) 2 ≤ X σ : σ j , 0 E g ∈C h b g ( σ ) 2 i = E g ∈C h Inf j ( g ) i . 28 Lemma 3.13. Suppos e g : Ω n → R , π : [ n ] → [ n ] and let σ be a multi-inde x. Then d g ◦ π ( σ ) = b g ( π − 1 ( σ )) . Pr oof. The proof is a straight forward computati on which we omit. Lemma 3.14. Suppos e g : Ω n → R , π : [ n ] → [ n ] and j ∈ [ n ] . Then Inf j ( g ◦ π ) = Inf π − 1 ( j ) ( g ) , Inf ≤ d j ( g ◦ π ) = Inf ≤ d π − 1 ( j ) ( g ) . Pr oof. W e prov e the fi rst equa lity , the second is similar by restrict ing summations to multi-indices | σ | ≤ d . Inf j ( g ◦ π ) : = X σ : σ j , 0 d g ◦ π ( σ ) 2 = X σ : σ j , 0 b g ( π − 1 ( σ )) 2 = X σ : σ π − 1 ( j ) , 0 b g ( σ ) 2 = Inf π − 1 ( j ) ( g ) . Acknowledgeme nts W e thank Alex Smola for bringing the prob lem of approximati on algorith ms for k ernel clust ering to our attenti on and for encouragin g us to publi sh our results. Refer ences [1] N. Alon, K . Makaryche v , Y . Makarych ev , and A. Naor . Quadratic forms on graphs. In vent. Math. , 163(3 ):499– 522, 2006. [2] N. Alon and A. Naor . Approximatin g the cut-norm via G rothen dieck’ s inequalit y . SIAM J. Comput. , 35(4): 787–803 (electroni c), 2006. [3] S. Arora, E. Berge r , G. Kindler , E. Hazan, and S . Safra. On non-appro ximabili ty for quadrati c pro- grams. In 46th A nnual Symposiu m on F ou ndations of Computer Science , pages 206–215. IEEE Com- puter Society , 2005. [4] M. Charik ar , K. Makar ychev , and Y . M akaryc hev . Near-op timal al gorithms f or unique games (exte nded abstra ct). In STOC’06: Pr oceedings of the 38th Annual ACM Sympos ium on Theory of Computing , pages 205–21 4, New Y ork, 2006. A CM. [5] M. Charikar , K. Makary chev , and Y . Makaryc he v . Near-op timal algorith m s for m aximum constrain t satisf action problems. In SOD A ’07: P r oceedi ngs of the eighteenth annual ACM-SIAM sy m posium on Discr ete algorithms , pages 62–68 , Philade lphia, P A, USA, 2007. Society for Industri al and Applied Mathematic s. [6] M. Charikar and A . Wir th. Maximizing quadr atic programs: ex tending Grothendieck ’ s inequali ty . In 45th Annual Symposium on F ound ations of Computer Science , pages 54–60. IEEE Computer Society , 2004. [7] L. Dan zer , B . Gr ¨ unbau m , and V . Klee. Helly’ s theorem and its relativ es. In Pr oc. Sympo s. Pur e Math., V ol. VII , pages 101–180 . Am er . Math. Soc., Providence , R.I., 1963. 29 [8] U. F eige, G. Kindler , and R . O’Donnell . Unders tanding parallel repetition requires understand ing foams. In IEEE Confer ence on Computationa l Complex ity , pages 179–192 . IEEE Computer Society , 2007. [9] A. Frieze and M. Jerrum. Improv ed approximat ion algori thms for MA X k -CUT and MAX BISEC- TION. Algorithmica , 18(1):67–81 , 1997. [10] P . Gritzmann and V . Klee. Inn er and outer j -radii of con ve x bodies in finite-dimen sional normed spaces . Discr ete Comput. Geom. , 7(3):255– 280, 199 2. [11] J. H åstad. S ome op timal inappro ximability results. J. A CM , 48(4 ):798–859 (electro nic), 2001. [12] H . W . E. Jung. ¨ uber die kleinste k ¨ ugel, die einerumli che figureinschl isst. J . Reine Ange w . Math. , 123:2 41–25 7, 1901. [13] S . Khot. On the po wer of uni que 2-prov er 1-round games. In Pr ocee dings of the Thirty- F ourth Annua l A CM Symposium on Theory of Computing , pages 767–775 (electroni c), New Y ork, 200 2. A CM. [14] S . Khot, G. Kindler , E . Mossel, and R. O ’Donnell. Optimal inapproxima bility resu lts for max-cut and other 2-var iable csps? In 45th Annual Symposium on F ound ations of Computer Science , pages 146–1 54. IEE E Comput er Society , 2004. [15] S . Khot, G. Kindler , E. Mossel, and R . O’Donnell. Optimal inappr oximabil ity results for MAX-CUT and other 2-v ariable CS Ps? SIAM J. Compu t. , 37(1):319– 357 (electronic) , 2007 . [16] E . Mossel, R. O’Donnell, and K. Oleszkie wicz. Noise stab ility of function s with lo w influences : In v ariance and optimality . In 46th Annual Symposium on F ounda tions of Computer Scie nce , pages 21–30 . IEEE Computer Society , 2005. [17] A . Nemiro vski, C. Roos, and T . T erlak y . On maximization of quadratic form over inters ection of ellipso ids with common center . Math. Pr o gra m . , 86(3, S er . A):463–473, 1999. [18] Y . Nestero v . Semidefinite relax ation and noncon vex quadrati c optimizatio n. O ptim. Methods Softw . , 9(1-3) :141–160, 1998. [19] P . Ragha vendra . Optimal algor ithms and inapprox imability results for e very csp? In Pr oceedings of the 40th Annual A CM Symposium on Theory of C omputing , pages 245–25 4, 2008. [20] R . E. Rietz. A proof of the G rothen dieck inequa lity . Israel J. Ma th. , 19:271–27 6, 1974. [21] V . I. Rotar ′ . Limit the orems for polylin ear forms. J. Mult ivaria te Anal. , 9(4):511–5 30, 1979 . [22] B . Scholk opf and A. J. Smola. Learning with K ernels : Supp ort V ector Machi nes, Re gulariz ation, Optimizatio n, and Be yond . MIT Press, Cambrid ge, MA , USA, 2001 . [23] L . Song , A . Smola, A. Gretton , and K. A. Bor gward t. A depen dence maximization vie w of clustering . In Pr oceed ings of the 24th internatio nal confer ence on Machin e learning , pages 815 – 822, 2007. 30
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment