Data Stability in Clustering: A Closer Look
We consider the model introduced by Bilu and Linial (2010), who study problems for which the optimal clustering does not change when distances are perturbed. They show that even when a problem is NP-hard, it is sometimes possible to obtain efficient …
Authors: Shalev Ben-David, Lev Reyzin
Data Stabilit y in Clustering: A Closer Lo ok Shalev Ben-Da vid 1 and Lev Reyzin 2 1 Departmen t of Electrical Engineering and Computer Science Massac h us etts Institute o f T ec hnology 77 Massac h use tts Av en ue Cam bridge, MA 0213 9 shalev@mit. edu 2 Departmen t of Mathematics, Statistics, and Computer Science Univ ersit y of Illinois a t Chicago 851 South Morgan Street Chicago, IL 606 07 lreyzin@mat h.uic.edu Abstract W e consider the mo de l introduced b y Bilu and Linial [13], who study pr oblems for which the optimal cluster ing do es not change when distances are p erturb ed. They show that even when a problem is NP-har d, it is sometimes p ossible to obtain efficient algorithms for instances resilient to cer tain multip licative p ertur bations, e.g. on the o rder of O ( √ n ) for max-cut cluster- ing. Aw asthi et al. [6] co nsider center-based ob jectives, and B alcan and Liang [9] analyze the k -median a nd min-sum ob jectives, giving efficient algorithms for instanc e s r esilient to cer tain constant multiplicative p e rturbations. Here, we are mo tiv ated by the q uestion of to what extent these assumptions can b e relaxed while a llowing for efficient a lg orithms. W e show there is little ro o m to improve these results by giv ing NP-hardnes s lower b ounds for b oth the k -median and min-sum ob jectives. On the other hand, we show that co nstant m ultiplicativ e re s ilience parameter s can b e so s trong as to make the clustering pro blem triv ia l, leaving only a nar row range o f resilience parameters for which clustering is interesting. W e als o consider a mo del of additive perturbatio ns and give a corres p o ndence betw een additive and m ultiplicative notions of stability . Our results provide a close e x amination o f the consequences o f assuming stability in data. 1 In tro duction Clustering is one of the m ost widely-used tec hniques in statistical data analysis. The need to partition, or cluster, d ata in to meaningful categories natur ally arises in virtually ev ery domain where data is abundan t. Unfortunately , most of the natural clustering ob jectiv es, includ ing k - median, k -means, and min-su m, are NP-hard to optimize [17, 18]. It is, therefore, unsurpr ising that many of the clustering algorithms used in practice come with few guarantee s. 1 Motiv ated by ov ercoming the hard ness results, Bilu and Linial [13] consider a p erturbation resilience assumption that they argue is often implicitly made when choosing a clustering ob- jectiv e: that the optim um clustering to the d esir ed ob jectiv e Φ is preserved u nder m ultiplicativ e p ertur b ations up to a factor α > 1 to the distances b et w een the p oints. Th ey reason that if th e optim um clustering to an ob jectiv e Φ is not resilien t, as in, if small p erturb ations to the distances can cause the optim um to c hange, then Φ ma y ha ve b een the wrong ob jectiv e to b e optimizing in the first p lace. Bilu and Lin ial [13] sho w that for max-cut clustering, instances resilient to p ertur b ations of α = O ( √ n ) hav e efficien t algorithms f or reco v ering the optim um itself. Con tin uing th at line of researc h, Awasthi et al. [6 ] giv e a p olynomial time algorithm that finds the optim um clus tering for instances resilien t to m ultiplicativ e p erturb ations of α = 3 for cen ter-based 1 clustering ob jectiv es w hen cen ters must come from the data (w e cal l this the prop er setting), and α = 2 + √ 3 when when the cent ers do not need to (w e call this the Steiner setting) . Their metho d relies on a stability pr op ert y implied b y p ertu r bation resilience (see Section 2). F or the Steiner case, they also pro v e an NP-hard ness lo w er b oun d o f α = 3. Subsequ ently , Balcan and Liang [9] consid er the p rop er setting and imp ro v e the constan t past α = 3 by giving a new p olynomial time algorithm for the k -median ob jectiv e for α = 1 + √ 2 ≈ 2 . 4 stable in stances. 1.1 Our results Our work further delv es into the p rop er setting, for whic h no lo w er b ounds ha v e p reviously b een sho wn for the stability prop erty . In Section 3 we sho w that ev en in the pr op er case, where the algorithm is r estricted to choosing its cen ters f rom the d ata, for an y ǫ > 0, it is NP-hard to optimally cluster (2 − ǫ )-sta ble instances, b oth for the k -median and min-sum ob jectiv es (Theorems 5 and 7 ). T o pro v e this for the min-sum ob j ectiv e, we d efine a new notion of stabilit y that is implied b y p ertu rbation resilience, a notion that ma y b e of indep en den t in terest. Then in S ection 4 , we lo ok at the im p lications of assuming r esilience or stabilit y in the data, ev en for a constan t p ertur bation parameter α . W e s ho w th at for ev en fairly sm all constan ts, th e data b egins to h a v e ve ry strong structural pr op erties, as to make the clustering task fairly trivial. When α exceeds 2 + √ 3, the data b egins to sho w wh at is called strict separation , where eac h p oint is closer to p oin ts in its own cluster than to p oints in other clusters (Theorem 8). Finally , in Section 5 , we lo ok at whether th e picture can b e improv ed for clustering d ata that is stable under additiv e, rather than multiplicat iv e, p erturb ations. One hop e w ould b e that additive stabilit y is a more u seful assumption, where a p olynomial time algorithm for ǫ -stable instances migh t b e p ossible. Unfortun ately , this is not the case. W e consider a natur al additiv e mo d el and sho w that sev ere lo we r b oun ds hold for the additiv e notion as we ll (Theorems 13 and 17). On the p ositiv e side, we sho w via reductions that algorithms for m ultiplicativ ely stable data also w ork for additiv ely stable data for a d ifferent but related p arameter. Our results demonstrate that on the one hand, it is hard to impro v e the algorithms to w ork for lo w stabilit y constan ts, and that on th e other hand, higher stabilit y constan ts can b e quite strong, to the p oint of trivializing the pr oblem. F urtherm ore, switching fr om a m ultiplicativ e to an ad d itiv e stabilit y assumption do es not help to circumv ent the hardness results, and perh aps makes matt ers w orse. These results, tak en together, narro w the r ange of interesting parameters for theoretical study and highlight the strong r ole that the c hoice of constan t plays in stabilit y assum ptions. 1 F or center-based clustering ob jectives, the clustering is d efined by a choice of centers , and the ob jectiv e is a function of th e d istances of the p oints to their closest center. 2 One thin g to n ote that there is some difference b et w een the very r elated resilience and stabilit y prop erties (see S ection 2), stabilit y b eing w eak er and more general [6]. Some of our results apply to b oth notions, and s ome only to stabilit y . This still lea ves op en th e p ossibilit y of d evising p olynomial-time algorithms t hat, for a m uc h smaller α , wo rk on all th e α -p erturb ation resilien t instances, but n ot on all α -stable ones. 1.2 Previous w ork W e examine previous work on stabilit y , b oth as a data dep endent assump tion in clustering and in other settings. 1.2.1 Stabilit y as a data assumption in clustering The cla ssical approac h in theoretic al computer science to dealing with the worst-ca se NP-hardness of clus tering has b een to deve lop efficien t app ro ximation algorithms for the v arious clustering ob- jectiv es [2, 3, 10, 14, 19, 15], an d significan t efforts hav e b een exerted to improv e app ro ximation ratios and to pro v e lo w er b oun ds. In p articular, for metric k -median, th e b est kno wn guaran tee is a (3 + ǫ )-appro ximation [3], and the b est kn o wn lo w er b ound is (1 + 1 /e )-hardness of appr o xima- tion [17, 18]. F or m etric min-sum, the b est kn o wn resu lt is a O (p olylog ( n ))-approxima tion to the optim um [10]. In con trast, a more recent directio n of researc h has b een to c haracterize und er what conditions w e can find a desirable clustering efficien tly . Perturbation resilience/stabilit y are su ch conditions, but they are related to other stabilit y notions in clustering. Os tro vsky et al. [23 ] demonstr ate the effectiv eness of Llo yd-t yp e algorithms [21] on instances with the stabilit y p rop erty that the cost of the optimal k -means solution is small compared to the cost of the optimal ( k − 1)-mea ns solution, and their guarante es ha v e later b een improv ed b y Awast hi et al. [5]. In a d ifferen t line of w ork, Balca n et al. [8] consider what stabilit y pr op erties of a s im ilarity function, w ith resp ect to the ground tr uth clustering, are suffi cien t to clus ter w ell. I n a related direction, Balcan et al. [7] argue that, for a giv en ob jectiv e Φ, approxi mation algorithms are most useful when the clusterings th ey pro d uce are s tr ucturally close to the optim um originally sough t in c h o osing to optimize Φ in the first place. They then sho w that, for many ob jectiv es, if one mak es this assumption explicit – that all c -approxi mations to the ob jectiv e yield a clustering that is ǫ -close to the optim um – th en one c an reco v er an ǫ -close clus terin g in p olynomial time, ev en for v alues of c b elow the h ardness of appro ximation constant. The assumptions and algorithms of Balcan et al. [7 ] ha v e su bsequentl y b een carefully analyzed by Sc halek amp et al. [24]. Ac k erman and Ben-Da vid [1] also study v arious notions of resilience, and among their results, in tro duce a notion o f s tability similar to the on e studied h erein, except only the p ositions of cluster cen ters are p erturb ed. Their notion is strictly w eak er – i.e. an y p erturbation resilien t in stance is also stable in their framew ork. They sho w that Euclidean instances stable to p ertur b ations of cluster centers will ha ve p olynomial algo rithms for find ing near-optimal clusterings. Ho w ev er, they require the d esired num b er of clusters to b e small compared to the input size: their algorithms h a v e runn in g times exp onentia l in the num b er of clusters. In this work, we do not mak e that assumption; this means our p ositiv e resu lts are more general, b u t our lo w er b ounds don’t app ly . 3 1.2.2 Stabilit y in other settings Just as the Bilu and Linial [13] n otion of stabilit y giv es conditions under whic h efficien t clustering is p ossible, similar concepts hav e b een s tu died in game theory . Lipton et al. [20] pr op ose a notion of stabilit y for solution concepts of games. Th ey define a game to b e stable if small p erturb ations to th e pay off matrix do n ot significan tly c hange the v alue of the game, and they sho w games are generally not stable un der this definition. Th en, in a similar s pirit to the work of Bilu and Lin ial, Aw asthi et al. [4] p rop ose a related s tability condition for a game, which can b e lev eraged in fi n ding its approximate Nash equilibria. The Bilu and Linial [13] notion of sta bilit y h as al so b een stud ied in the con text of the met- ric trav eling salesman problem, for whic h Mihal´ ak et al. [22] giv e efficien t algorithms for 1 . 8- p ertur b ation resilien t instances, illustrating another case where a stabilit y assump tion can circum- v en t NP-hardn ess. F rom a differen t direction, Ben-Da vid et al. [12] consider the stabilit y of clustering algorithms, as opp osed to instances. They say an algorithm is s table if it pr o duces similar clusterings for differen t inputs dra wn from the same distribu tion. They argue that stabilit y is n ot as useful a notion as had b een p reviously thought in d etermining v arious parameters, suc h as the optimal n umber of clusters. 2 Notation and preliminarie s In a clusterin g instance, w e are given a set S of n p oint s in a finite metric space, and we denote d : S × S → R ≥ 0 as the distance fun ction. Φ denotes the ob jectiv e fun ction o v er a partition of S in to k clusters whic h w e wan t to op timize o v er the metric, i.e. Φ assigns a sco re to eve ry clustering. The optimal clustering with resp ect to Φ is denoted as C = { C 1 , C 2 , . . . , C k } . The k -median ob jectiv e requires S to b e partitioned in to k disjoint su b sets { S 1 , . . . , S k } and eac h subset S i to b e assigned a cen ter s i ∈ S . The goal is to minim ize Φ med , measured by Φ med ( S 1 , . . . , S k ) . = k X i =1 X p ∈ S i d ( p, s i ) . The cen ters in the optimal clustering are denoted as c 1 , . . . , c k . In an optimal solution, eac h p oin t is assigned to its nearest cent er. F or the min-sum ob jectiv e , S is partitioned in to k d isjoin t subsets, and the ob jectiv e is to minimize Φ m − s , measured by Φ m − s ( S 1 , . . . , S k ) . = k X i =1 X p,q ∈ S i d ( p, q ) . No w, w e defi ne the p ertu r bation resilience notion introd uced by Bi lu and Linial [13]. Definition 1. F or α > 1 , a clustering instanc e ( S, d ) i s α -p er turb ation r esilient to a given obje ctive Φ if for any function d ′ : S × S → R ≥ 0 such that ∀ p, q ∈ S , d ( p, q ) ≤ d ′ ( p, q ) ≤ αd ( p, q ) , the optimal clustering C ′ for Φ under d ′ is e qual to the optimal clustering C for Φ under d . 4 In this pap er, w e consider the k -median and m in -sum ob jectiv es, and we thereby in v estigate the follo wing definitions of stabilit y , whic h are implied b y p ertur bation resilience, as sh own in Sections 3.1 and 3.2. Th e f ollo w in g d efinition is adapted f rom Aw asthi et al. [6]. Definition 2. A c lustering i nstanc e ( S, d ) is α -c enter stable for the k -me dian obje c tive if for any optimal cluster C i ∈ C with c enter c i , C j ∈ C ( j 6 = i ) with c enter c j , any p oint p ∈ C i satisfies αd ( p, c i ) < d ( p, c j ) . Next, we define a new analogous notion of stabilit y for the min-sum ob jectiv e, and we sho w in Section 3.2 that for the min-sum ob jectiv e, p ertu r bation r esilience imp lies min-sum stabilit y . T o help with exp osition for the min -su m ob jectiv e, we define the d istance from a p oin t p to a set of p oint s A , d ( p, A ) . = X q ∈ A d ( p, q ) . Definition 3. A clustering instanc e ( S, d ) is α -min-sum stable for the min-sum obje ctive if for al l optimal clusters C i , C j ∈ C ( j 6 = i ) , any p oint p ∈ C i satisfies αd ( p, C i ) < d ( p, C j ) . This is a usefu l generalizatio n b ecause, as w e s h all see, known algorithms w orkin g under the p ertur b ation resilience assumption can also b e made to w ork u nder the weak er notion of m in -sum stabilit y . 3 Lo w er b ounds 3.1 The k -median ob jectiv e Aw asthi et al. [6] pro v e the follo wing connection b et w een p erturbation resilience and stabilit y . Bo th their algorithms and the algorithms of Balcan and Liang [9] crucially u se this sta bilit y assumption. Lemma 4. Any clu stering instanc e that is α -p erturb ation r esilient for the k -me dian obje ctive also satisfies the α -c enter stability. Aw asthi et al. [6] p ro v ed th at for α < 3 − ǫ , k -median clustering α -cente r s table instances is NP- hard wh en Steiner p oin ts are allo wed in the data. Afterwards, Balcan and Liang [9] circumv ented this low er boun d and achiev ed a p olynomial time algorithm for α = 1 + √ 2 b y assumin g the algorithm m ust c ho ose cluster cent ers from within the data. In the theorem b elo w, w e pro v e a lo w er b ou n d for the center stable prop ert y in this more restricted s etting, showing there is little hop e of progress eve n for data where eac h p oin t is nearly t wice closer to its o wn cen ter than to any other. Theorem 5. F or any ǫ > 0 , the pr oblem of solving (2 − ǫ ) -c enter stable k -me dian instan c es is NP-har d. 5 Pr o of. W e reduce fr om what w e call the p erfect dominating set promise pr oblem (PDS-PP), whic h w e prov e to b e NP-hard (see App endix), where we are promised th at the inpu t graph G = ( V , E ) is suc h that all of its smallest dominating sets D are p erfect, and we are ask ed to find a dominating set of size at most d . T he reduction is simple. W e tak e an ins tance of th e NP-hard problem PDS-PP on G = ( V , E ) on n vertic es and reduce it to an α = 2 − ǫ -cen ter stable instance. Our distance metric as follo ws. Ev ery vertex v ∈ V b ecomes a p oin t in the k -center instance. F or any t w o ve rtices ( u, v ) ∈ E we define d ( u, v ) = 1 / 2. When ( u, v ) / ∈ E , we set d ( u, v ) = 1. This tr ivially satisfies the triangle inequalit y for any graph G , as th e sum of the distances along an y t w o edges is at least 1. W e s et k = d . W e ob s erv e that a k -median solution of cost ( n − k ) / 2 corr esp onds to a domin ating set of size d in the PDS-PP in stance, and is therefore NP-hard to fi nd. W e also observe that b ecause all solutions of size ≤ d in th e PDS-PP instance are p erf ect, eac h (non-cen ter) p oint in the k -median solution has distance 1 / 2 to exactly one (its o w n) ce nt er, and a distance of 1 to every other cen ter. Hence, this in s tance is α = (2 − ǫ )-cen ter stable, completing the pro of. 3.2 The min-sum ob jective Analogously to Lemma 4, we can sho w that α -p erturbation resilience implies our new notion of α -min-sum stabilit y . Lemma 6. If a clustering instanc e is α -p e rturb ation r esilient, then it is also α -min-sum stable. Pr o of. Assum e to the con trary that the in s tance is α -p erturbation resilien t bu t is not α -min-sum stable. Then, there exist clusters C i , C j in the optimal s olution C and a p oint p ∈ C i suc h that αd ( p, C i ) ≥ d ( p, C j ). W e p erturb d as follo ws. W e defin e d ′ suc h that for all p oints q ∈ C i , d ′ ( p, q ) = αd ( p, q ), and for the r emainin g distances, d ′ = d . Clearly d ′ is an α -p erturbation of d . W e no w note that C is not optimal und er d ′ . Namely , we can creat e a c heap er solution C ′ that assigns p oin t p to cluster C j , and lea v es the remaining clusters unchange d, wh ic h con tradicts optimalit y of C . T his shows that C is not the optim um under d ′ whic h con tradicts th e in stance b eing α -p erturb ation resilien t. Th erefore we can conclude that if a clustering instance is α -p ertur bation resilien t, then m ust also b e α -min-sum stable. Moreo ver, we sho w in th e App endix that th e min -sum algorithm of Balcan and Liang [9], wh ich requires α to b e b oun d ed from b elo w by 3 max C ∈C | C | min C ∈C | C |− 1 , works with this more general condition. This fur ther motiv ates follo wing b oun d. Theorem 7. F or any ǫ > 0 , the pr oblem of finding an optimal min-sum k clustering in (2 − ǫ ) - min-sum stable instanc es is NP-har d. Pr o of. Consid er the triangle partition problem . Let graph G = ( V , E ) and | V | = n = 3 k , and let eac h verte x hav e maxim um degree of d = 4. Th e p roblem of whether the v ertices of G can b e partitioned in to sets V 1 , V 2 , . . . , V k suc h th at eac h V i con tains a triangle in G is NP-complete [16], ev en with the degree restriction [25]. W e reduce the triangle p artition prob lem to an (2 − ǫ )-min-sum stable clustering instance. Th e metric is as f ollo ws. Every vertex v ∈ V b ecomes a p oint in the min-sum instance. F or an y t w o v ertices ( u, v ) ∈ E we define d ( u, v ) = 1 / 2. When ( u, v ) / ∈ E , we set d ( u, v ) = 1. This satisfies the triangle inequalit y f or an y graph, as the sum of the distances along any t w o edges is at least 1. 6 No w w e show that we can cluster this instance into k clusters su c h that the cost of th e min -sum ob jectiv e is exactly n if and only if the original instance is a YES instance of triangle partition. This follo ws fr om t w o facts. 1. A YES instance of triangle partitio n maps to a clusterin g into k = n/ 3 clusters of size 3 with pairwise d istances 1 / 2, for a total cost of n 2. A cost of n is the b est ac hiev able b ecause a balanced clusterin g with all minim um p airwise in tra-cluster distance s is op timal. In p articular, 1 2 P i n i ( n i − 1) s.t. P i n i = n is a lo wer b ound on the cost of an y k -clustering with cluster-sizes n i . By conv exit y this expression is minimized at n . In the clus tering from our reduction, eac h p oin t has a sum -of-distances to its o wn cluster of 1. No w we examine the sum-of-distances of an y p oint to other clusters. A p oin t has tw o distances of 1 / 2 (edges) to its o wn cluster, and b ecause d = 4, it can h av e at most t w o more distances of 1 / 2 (edges) into an y other clus ter, le a ving the third d istance to the other cluster to b e 1, yielding a total cost of ≥ 2 into an y other cluster. Hence, it is (2 − ǫ )-min-sum stable. W e note th at it is tempting to restrict the d egree b ound to 3 in order to fur th er improv e the lo w er b ound. Un fortunately , th e triangle p artition problem on graphs of maxi mum degree 3 is p olynomial-time solv able [25], and w e cann ot impro v e the factor of 2 − ǫ by restricting to graphs of degree 3 in th is redu ction. 4 Strong consequenc es of stabil it y In Section 3, w e sho w ed that k -median clustering ev en (2 − ǫ )-cen ter stable instances is N P - hard. I n this section w e sho w that ev en for resilience to constan t m ultiplicativ e p erturb ations of α > 2 + √ 3 ≈ 3 . 7, the data obtains a prop ert y referr ed to as strict separation , where all p oin ts are closer to all other p oints in their o wn cluster than to p oints in any other cluster; this pr op ert y is kno wn to b e helpful in clustering [8]. 4.1 Strict separation No w we p ro v e the follo win g theorem, whic h sho ws th at ev en for relativ ely small m ultiplicativ e con- stan ts for α , cent er s table, and therefore p erturb ation resilien t, instances exhibit strict s ep aration. Theorem 8. L e t C = { C 1 , . . . , C k } b e th e optimal c lu stering of an α -c enter stable instanc e with α > 2 + √ 3 . L et p, p ′ ∈ C i and q ∈ C j ( i 6 = j ), then d ( p, q ) > d ( p, p ′ ) . Pr o of. Let C b e an α -cen ter stable clustering. Let p, p ′ ha v e ce nt er c 1 in C and let q ha v e cen ter c 2 , with c 1 6 = c 2 . W e ha v e d ( p, p ′ ) ≤ d ( p ′ , c 1 ) + d ( p, c 1 ) ≤ d ( p ′ , c 1 ) + 1 α d ( p, c 2 ) (1) where the second inequalit y follo ws f r om the defi n ition of α -cen ter stabilit y . Note that d ( p ′ , c 1 ) ≤ 1 α d ( p ′ , c 2 ) , 7 and sub tr acting 1 α d ( p ′ , c 1 ) from b oth sides giv es 1 − 1 α d ( p ′ , c 1 ) ≤ 1 α ( d ( p ′ , c 2 ) − d ( p ′ , c 1 )) ≤ 1 α d ( c 1 , c 2 ) . This then imp lies d ( p ′ , c 1 ) ≤ 1 α − 1 d ( c 1 , c 2 ) ≤ 1 α − 1 d ( p, c 1 ) + 1 α − 1 d ( p, c 2 ) ≤ 1 α ( α − 1) d ( p, c 2 ) + 1 α − 1 d ( p, c 2 ) = α + 1 α ( α − 1) d ( p, c 2 ) . Plugging this into Equation 1 giv es d ( p, p ′ ) ≤ α + 1 α ( α − 1) d ( p, c 2 ) + 1 α d ( p, c 2 ) = 2 α α ( α − 1) d ( p, c 2 ) = 2 α − 1 d ( p, c 2 ) . W e also ha v e d ( p, c 2 ) ≤ d ( p, q ) + d ( q , c 2 ) ≤ d ( p, q ) + 1 α d ( q , c 1 ) ≤ d ( p, q ) + 1 α ( d ( p, q ) + d ( p, c 1 )) = α + 1 α d ( p, q ) + 1 α d ( p, c 1 ) ≤ α + 1 α d ( p, q ) + 1 α 2 d ( p, c 2 ) , and sub tr acting 1 α 2 d ( p, c 2 ) from b oth s ides giv es α 2 − 1 α 2 d ( p, c 2 ) ≤ α + 1 α d ( p, q ) or d ( p, c 2 ) ≤ α α − 1 d ( p, q ) . W e th en conclude d ( p, p ′ ) ≤ 2 α − 1 d ( p, c 2 ) ≤ 2 α − 1 α α − 1 d ( p, q ) = 2 α ( α − 1) 2 d ( p, q ) . 8 Th us we get d ( p, p ′ ) < d ( p, q ) if w e set α su ch that 2 α < ( α − 1) 2 . Solving this give s α > 2 + √ 3. The follo wing example sho ws that Theorem 8 is tigh t, as in low er v alues of α ca nnot alone imply strict s ep aration. C onsider the metric space R , th e line. The example is giv en by p ′ = 0, c 1 = √ 3, p = 1 + √ 3, q = 2 + 2 √ 3, and c 2 = 3 + 3 √ 3, with p and p ′ b elonging to the cluster of c 1 and q b elonging to the cluster of c 2 . Th is example is α -cent er stable for α = 2 + √ 3, bu t it do es not satisfy strict separatio n. In particular, this example mak es ev ery inequalit y in the pro of ab o v e tigh t wh en α = 2 + √ 3. The strict separation prop ert y , ho wev er, is quite strong, as can b e seen from the follo wing Corollary . Corollary 9. L et C = { C 1 , . . . , C k } b e the optimal clustering of a 2 + √ 3 -c enter stable instanc e. Any algor ithm that cho oses c enters { c ′ 1 , . . . , c ′ k } such that c ′ i ∈ C i induc es the p artition C when p oints ar e assigne d to their closest c e nters. In fact, Balcan et al. [8] s ho w that in ins tances satisfying the strict separation prop ert y , a s imple “Single Link age” algorithm will pro duce a tree, a pruning of which giv es the optimal clustering. Such a p runing can b e f ound us in g d ynamic pr ogramming to pro duce the optimal clustering. Reco vering the optimum for 2 + √ 3-resilien t instances is not a new r esult, but is rather mean t to illus tr ate the p o we r of stabilit y assumptions. Balcan and Liang [9 ] giv e a more inv olv ed p olynomial algorithm for find in g optima in 1 + √ 2-resilien t instances. 5 Additiv e stabilit y So far, in this pap er our notions of stability w ere d efined with resp ect to multiplica tiv e p erturb a- tions. Similarly , w e can imagine an instance b eing resilien t with resp ect to additiv e p erturbations. Consider the follo wing definition. Definition 10. L e t d : S × S → [0 , 1] , and let 0 < β ≤ 1 . A clustering instanc e (S, d) is additive β -p erturb ation r e si lie nt to a given obje ctive Φ i f for any function d ′ : S × S → R ≥ 0 such that ∀ p, q ∈ S , d ( p, q ) ≤ d ′ ( p, q ) ≤ d ( p, q ) + β , the optimal clustering C ′ for Φ under d ′ is e qual to the optimal clustering C for Φ under d . W e note that in th e definition ab o v e, we require all p airw ise distances b et w een p oints to b e at most 1 . Oth erwise, resilience to a dditive p erturbations would b e a v ery we ak notio n, as th e distances in most instances could b e scaled as to b e resilien t to arb itrary additiv e p erturb ations. Esp ecially in lig ht of positive resu lts for other add itiv e stabilit y notio ns [1, 11], one p ossible hop e is th at our hardn ess results might only app ly to the multiplicat iv e case, and that we migh t b e able to get p olynomial time clusterin g algorithms for instances resilien t to arbitrarily small additive p ertur b ations. W e sho w that this is unfortu nately not the case – we in trod uce notions of add itiv e stabilit y , similar to Defin itions 2 and 3, and for the k -median and min-su m ob jectiv es, w e sho w corresp ondences b et w een multiplic ativ e and additiv e stabilit y . 9 5.1 The k -median ob jectiv e Analogously to Definition 2, we can define a notion of add itiv e β -cen ter stabilit y . Definition 11. L et d : S × S → [0 , 1] , and let 0 ≤ β ≤ 1 . A clustering instanc e ( S, d ) is additive β -c enter stable to the k -me dian obje c tiv e if for any optimal cluster C i ∈ C with c enter c i , C j ∈ C ( j 6 = i ) with c enter c j , any p oint p ∈ C i satisfies d ( p, c i ) + β < d ( p, c j ) . W e can n o w pr o v e that p er tu rbation resilience implies cen ter stabilit y . Lemma 12. Any clustering i nstanc e satisfying additive β - p erturb ation r esilienc e for the k -me dian obje ctive also satisfies additive β -c enter stability. Pr o of. Th e pr o of is similar to that of Lemm as 4. W e pro v e that for ev ery p oint p and its cen ter c i in the optimal clustering of an additiv e β -p erturbation resilien t instance, it h olds that d ( p, c j ) > d ( p, c i ) + β for any j 6 = i . Consider an additive β -p erturbation resilien t clusterin g instance. Assume we b lo w up all the pairwise distances within cluster C i b y an additiv e factor of β . As this is a legitimate p erturbation of the distance function, the optimal clustering und er this p er tu rbation is the same as the original one. Hence, p is still assig ned to the same cluster. F ur thermore, since t he distances w ithin C i w ere all changed by the same constan t f actor, c i will remain the cen ter of the cluster. The same holds for a ny ot her optimal clusters. Since the optimal clustering un der th e p erturb ed distances is unique it follo ws th at ev en in the p erturb ed distance fu nction, p p refers c i to c j , whic h implies the lemma. No w w e pr ov e a lo we r b oun d that sh o ws that the task of clustering additiv e (1 / 2 − ǫ )-cen ter stable instances with resp ect to th e k -median ob jectiv e remains NP-h ard . Theorem 13. F or any ǫ > 0 , the p r oblem of finding an optimal k -me dian clustering in additive (1 / 2 − ǫ ) - c enter stable instanc es is NP-har d. Pr o of. W e use the redu ction in Th eorem 5, in which the m etric satisfies th e needed prop erty that d : S × S → [0 , 1]. W e observe that the instances from the redu ction are additive (1 / 2 − ǫ )-cen ter stable. Hence, an algorithm for solving k -median on a (1 / 2 − ǫ )-cente r stable instance can decide whether a PDS-PP instance con tains a dominating set of a giv en size, completing the pro of. W e no w consider cen ter s tabilit y , as in the m ultiplicativ e case. W e prov e that additiv e center stabilit y implies multiplicati v e center stabilit y , and this giv es us the pr op ert y that an y algorithm for 1 1 − β -cen ter stable in stances will wo rk for additiv e β -cen ter stable instances. Lemma 14. A ny additive β -c enter stable clustering instanc e for the k -me dian obje ctive is also (multiplic ative) 1 1 − β -c enter stable. Pr o of. Let the op timal clustering b e C 1 , . . . , C k , with cen ters c 1 , . . . , c k , of an additiv e β -cen ter stabile clustering ins tance. Let p ∈ C i and let i 6 = j . F rom the stabilit y prop erty , d ( p, c j ) > d ( p, c i ) + β ≥ β . (2) 10 W e also ha v e d ( p, c i ) < d ( p, c j ) − β , from w hic h we can see 1 d ( p, c j ) − β < 1 d ( p, c i ) . This giv es us d ( p, c j ) d ( p, c i ) > d ( p, c j ) d ( p, c j ) − β ≥ 1 1 − β . (3) Equation 3 is deriv ed as follo ws. The middle term, for d ( p, c j ) ≥ β (whic h we hav e fr om Equation 2), is monotonically decreasing in d ( p, c j ). Using d ( p, c j ) ≤ 1 b oun ds it from b elo w. Relating the LHS to the RHS of Equ ation 3 giv es us the needed stability prop ert y . 5.2 The min-sum ob jective Here we d efine additiv e min-sum s tability and p ro v e the a nalogous theo rems for the min-sum ob jectiv e. Definition 15. L et d : S × S → [0 , 1] , and let 0 ≤ β ≤ 1 . A clustering instanc e i s ad ditive β -min-sum stable for the min-sum obje ctive if for every p oint p in any optimal cluster C i , it holds that d ( p, C i ) + β ( | C i | − 1) < d ( p, C j ) . Lemma 16. If a clustering instanc e is ad ditive β -p erturb ation r esilient, then it i s also additive β -min-sum stable. Pr o of. Assum e to the contrary that the instance is β -p erturbation resilien t but is not β -min -sum stable. Then, there exist clusters C i , C j in the optimal s olution C and a p oint p ∈ C i suc h that d ( p, C i ) + β ( | C i | − 1) ≥ d ( p, C j ). Then, we p erturb d as follo ws. W e defin e d ′ suc h that for all p oint s q ∈ C i , d ′ ( p, q ) = d ( p, q ) + β , and for th e remaining distances d ′ = d . Clearly d ′ is a v alid additiv e β -p ertur bation of d . W e no w note that C is not optimal under d ′ . Namely , w e can create a c heap er solution C ′ that assigns p oin t p to cluster C j , and lea v es the remaining clusters unchange d, wh ic h con tradicts optimalit y of C . This sho ws that C is not the optim um un der d ′ whic h is con tradictory to the fact that the instance is additiv e β -p erturb ation resilien t. Therefore we conclude that if a clusterin g instance is additive β -p erturb ation resilien t, then it is also additiv e β -min -sum stable. As w ith the k -median ob jectiv e, w e sh o w th at additiv e min-sum stabilit y exhibits simila r lo w er b ound s as in the m ultiplicativ e case. Theorem 17. F or any ǫ > 0 , the pr oblem of find ing an optimal min-sum c lu stering in additive (1 / 2 − ǫ ) - min-sum stable instanc es is NP-har d. Pr o of. W e u se the reduction in Theorem 7, in wh ic h the metric satisfies the prop ert y that d : S × S → [0 , 1]. T he instances from the reduction are additiv e (1 / 2 − ǫ )-min-su m stable. Hence, an alg orithm for cl ustering a (1 / 2 − ǫ )-min-sum stable instance can solv e t he triangle partitio n problem. Finally , as w e d id for the k -median ob jectiv e, w e can also reduce add itive stabilit y to m ulti- plicativ e stabilit y for the min-sum ob jectiv e. 11 Lemma 18. L et t = max C ∈C | C | min C ∈C | C |− 1 . Any additive β -min- sum stabile clustering instanc e for the min- sum obje ctiv e is also (multiplic ative) 1 1 − β /t -min-sum stable. Pr o of. Let the optimal clustering b e C 1 , . . . , C k and let p ∈ C i . L et i 6 = j . F rom the stabilit y prop erty , w e ha v e d ( p, C j ) > d ( p, C i ) + β ( | C i | − 1) ≥ β ( | C i | − 1) . (4) W e also ha v e d ( p, C i ) < d ( p, C j ) − β ( | C i | − 1) . T aking r ecipr o cals and multiplying b y d ( p, C j ), we get d ( p, C j ) d ( p, C i ) > d ( p, C j ) d ( p, C j ) − β ( | C i | − 1) ≥ | C j | | C j | − β ( | C i | − 1) (5) ≥ max C ∈C | C | max C ∈C | C j | − β (min C ∈C | C | − 1) ≥ 1 1 − β /t . (6) Equation 5 is derived as follo ws: d ( p, C j ) ≥ β ( | C i | − 1) (wh ic h we hav e f rom Equation 4), is monotonically decreasing in d ( p, C j ). Obs er v in g d ( p, c j ) ≤ | C j | b ounds it from b elo w. Equation 6 giv es us the needed prop ert y . 6 Discussion Our lo wer b ou n ds, toget her w ith the stru ctur al p rop erties implied by f airly small constan ts, il- lustrate the imp ortance parameter settings play in stabilit y assumptions. T hese results mak e us w onder the degree to whic h the assump tions stu died h erein hold in p ractice; empir ical s tu dy of r eal datasets is warran ted. Another interesti ng direction is to relax the assu mptions. Awasthi et al. [6 ] su ggest considering stabilit y u nder random, and not wo rst-case, p ertur bations. Balcan and Liang [9 ] also study a relaxed v ersion of th e assumption, where p erturbations can change the optimal clustering, but not b y m uc h. It is op en to what exten t, and on what data, an y of these appr oac h es will yiel d practical impro v emen ts. Ac kno wledgemen ts W e th ank Maria-Florina Balcan and Yingyu Liang for helpful d iscussions, and we appreciate Shai Ben-Da vid ’s, Avrim Blum ’s and S an tosh V empala’s feedbac k on the writing. W e also esp ecially thank Sh ai Ben-Da vid for helping us fi nd a b ug in the AL T 2012 v ersion of this pap er – its clai med one pass streaming result wa s incorr ect. This wo rk was sup p orted in p art b y a Simons P ostdo ctoral F ello wship in Theoretical Compu ter Science and by AR C while Lev Reyzin w as at the Georgia Institute of T ec hnology . 12 References [1] A c kerman, M., a nd Ben -Da vid, S. C lusterabilit y: A theoretical study . J ournal of Machine L e arning R ese ar ch - Pr o c e e dings T r ack 5 (2009), 1–8. [2] Ar ora, S., Ragha v an, P., a nd Rao, S. Appro ximation sc hemes for euclidean -medians and related pr oblems. In STOC (1998), pp. 106–113 . [3] Ar y a , V., Garg, N., Kh a ndekar, R., Meyerso n , A., Mun agala, K., an d P andit, V. Lo cal searc h heuristics for k-median and facilit y lo cation problems. SIAM J. Comput. 33 , 3 (2004 ), 544–562 . [4] A w asthi, P., Balca n, M. F., Blum, A., She ffet, O., a nd Ve mp ala, S. On nash- equilibria of appr o ximation-stable games. In SAGT (2010 ), pp. 78–89. [5] A w asthi, P., Blum, A ., and Shef fet, O. Stabilit y yields a ptas for k-median and k-means clustering. In FOCS (2010), pp. 309–3 18. [6] A w asthi, P., Bl um, A., an d Sheffet, O. Center-based clustering under p er tu rbation stabilit y . Inf. Pr o c ess. L ett. 112 , 1-2 (2012), 49–54. [7] Balcan, M. F., Blum , A., and Gupt a, A. App ro ximate clustering without the ap p ro xi- mation. In SODA (2009). [8] Balcan, M. F., Blum, A., and Vemp ala, S. A discriminativ e fr amew ork for clusterin g via similarit y functions. In STOC (2008), pp. 671–680 . [9] Balcan, M.-F., and Liang, Y. Clustering und er p erturb ation resilience. In ICALP (1) (2012 ), pp. 63–74. [10] Bar t al , Y., Charikar, M ., and Raz, D. Appr o ximating min-sum -clustering in metric spaces. In STOC (2001 ), pp. 11–20. [11] Ben-D a vid, S. Alternativ e measures of compu tational complexit y with app lications to ag- nostic learning. In T AMC (2006), pp. 231–235 . [12] Ben-D a vid, S., von Luxburg, U., and P ´ al, D. A sob er lo ok at clustering stabilit y . In COL T (2006), pp . 5–19. [13] Bi lu, Y., and Linial, N. Are stable instances easy? In Innovations in Computer Scienc e (2010 ), pp. 332 – 341. [14] C harikar, M., Guha, S., T ar dos, ´ E., and Shmoys, D. B. A constan t-facto r appr o xima- tion algorithm for th e k-median problem. J. Comput. Syst. Sci. 65 , 1 (2002), 129–149. [15] de la Vega, W. F., Karp inski, M., K enyon, C., and Raba n i, Y. Approxima tion sc hemes for clustering problems. In STOC (2003), pp . 50–58. [16] Garey, M. R., and John son, D. S. Computers and Intr actability: A Guide to the The ory of NP-Completeness . W. H. F reeman & Co., New Y ork, NY, 1979. 13 [17] Guha, S., and Khulle r, S . Greedy strikes bac k: Impr ov ed facilit y lo cation algorithms. J. Algor ithms 31 , 1 (1999 ), 228–24 8. [18] Jain, K., M ahdian, M., and Saber i, A. A new greedy app roac h for fac ilit y lo cation problems. In STOC (2002) , A CM, p p. 731–740 . [19] Kumar, A., Sabhar w al, Y., and Sen, S. Linear-time appro ximation sc hemes for clustering problems in any dimensions. J. ACM 57 , 2 (2010). [20] Li pton, R . J., Marka kis, E., and Meh t a, A. On stabilit y pr op erties of ec onomic solution concepts. Man u script, 2006. [21] Llo yd, S. Least squares quantiza tion in p cm. Information The ory, IEEE T r ansactions on 28 , 2 (Mar. 1982), 129–1 37. [22] Mihal ´ ak, M., Sch ¨ ongens, M., Sr ´ amek, R., and W idma yer, P . O n the complexit y of the metric tsp under stabilit y considerations. In SOFSEM (2011 ), pp. 382–393. [23] Ostr ovsky, R., Rabani, Y., Sch ulman, L. J., and Sw amy, C. Th e effecti v eness of llo yd-t yp e m etho ds for the k-means problem. In F OCS (2006) , pp . 165–176. [24] Schalekamp, F., Yu , M., and v a n Zuylen, A. Clustering with or without the approxi - mation. In COCOON (2010), pp. 70–79 . [25] v an Rooij, J. M. M., v an Kooten Niekerk, M. E., and Bodl aender, H. L. Pa rtition in to triangles on b ound ed degree graphs. In SOFSEM (2011). A Dominating set promise problem A dominating set in a unw eigh ted graph G = ( V , E ) is a subset D ⊆ V of vertic es s u c h that eac h vertex in V \ D h as a neigh b or in D . A d ominating set is p erfect if eac h vertex in D \ V has exactly one n eigh b or in D . Th e p roblems of fin ding th e smallest dominating set and smallest p erfect dominating s et are NP-hard. W e introd u ce a related problem, called the p erfect dominating set promise problem . In this problem we are promised that th e in p ut graph is suc h that all its dominating sets of size less at most d are p erfect, and w e are ask ed to fi nd a set of cardinalit y at most d . First, we pro v e the follo wing. Theorem 19. The p erfe ct do minating set pr om ise pr oblem (PD S-PP) is NP-har d. Pr o of. Th e 3 d matc hing problem (3DM) is as follo ws: let X , Y , Z b e finite disjoint sets with m = | X | = | Y | = | Z | . Let T co nta in triples ( x, y , z ) with x ∈ X, y ∈ Y , z ∈ Z with L = | T | . M ⊆ T is a p erfect 3 d -matc hing if for any t w o triples ( x 1 , y 1 , z 1 ) , ( x 2 , y 2 , z 2 ) ∈ M , w e hav e x 1 6 = x 2 , y 1 6 = y 2 , z 1 6 = z 2 . W e n otice that M is a disjoint partition. Determining whether a p erfect 3 d -matc hing exists (YES vs. NO instance) in a 3 d -matc hing instance is kno wn to b e NP-complete. No w w e red u ce an instance of the 3DM problem to PDS-PP on G = ( V , E ). F or 3DM elemen ts X , Y , and Z we construct v ertices V X , V Y , and V Z , resp ectiv ely . F or eac h triple in T we construct a v ertex in set V T . Add itionally , w e mak e an extra v ertex v . This giv es V = V X ∪ V Y ∪ V Z ∪ V T ∪ { v } . W e mak e the edge set E as foll o ws. Ev ery v ertex in V T (whic h co rresp onds to a triple) h as an edge 14 to the v ertices that it con tains in the corresp ond ing 3DM instance (one in eac h of V X , V Y , and V Z ). Ev ery vertex in V T also has an edge to v . No w we will examine the structure of the smallest dominating set D in the constructed PDS-PP instance. T he v ertex v m ust b elong to D so th at all v ertices in V T are cov ered. Then, what remains is to optimally co v er the v ertices in V X ∪ V Y ∪ V Z – the cheapest solution is to u se m ve rtices from V T , and th is is precisely the 3DM problem, whic h is NP-hard. He nce, an y solution of size d = m + 1 for the PDS-PP instance giv es a solution to the 3 D M instance. W e also obs er ve th at suc h a solution m ak es a p erf ect dominating set. Eac h ve rtex in V T \ D has one neighbor in D , namely v . E ach vertex in V X ∪ V Y ∪ V Z has a unique neigh b or in D , namely the ve rtex in V T corresp ondin g to its resp ectiv e set in the 3DM instance. B Av erage link age for min-sum stabilit y Here, we further supp ort the claim that algorithms designed for α -p erturb ation resilien t instances with resp ect to the min-sum ob jectiv e can ofte n b e made to w ork for data satisfying the more general α -min-sum stabilit y prop ert y . Algorithm 1 min-sum, α p erturb ation resilience Input: Data set S , distance function d ( · , · ) on S , m in i | C i | . Phase 1: Connect eac h p oint with its 1 2 min i | C i | nearest neighbors. • Initialize the clustering C ′ with eac h conn ected comp onent b eing a cluster. • Rep eat till only one cluster remains in C ′ : mer ge clusters C, C ′ in C ′ whic h min im ize d avg ( C, C ′ ). • Let T b e th e tree with comp onen ts as lea v es and in ternal no des corresp onding to the merges p erformed . Phase 2: Apply dynamic programming on T to get the minimum min-sum cost pru ning ˜ C . Output: Output ˜ C . One su c h algorithm is Algorithm 1. Balcan and Liang [9] prov ed that Algorithm 1 correctly clusters instances f or which the cond ition in Lemma 20 h olds . W e can pro v e the condition indeed holds f or α -min-sum stable in stances (their pr o of of the lemma holds f or the more r estricted class of p ertu r bation-resilien t instances). T o state the lemma, w e fi rst define the d istance b et w een t w o p oint sets, A and B : d ( A, B ) . = X p ∈ A X q ∈ B d ( p, q ) . Lemma 20. Assume the optimal clustering is α -min-sum stable. F or any two differ ent clusters C and C ′ in C and every A ⊂ C , αd ( A, ¯ A ) < d ( A, C ′ ) . 15 Pr o of. F rom the definition of αd ( A, ¯ A ), w e hav e αd ( A, ¯ A ) = α X p ∈ A X q ∈ ¯ A d ( p, q ) ≤ α X p ∈ A X q ∈ C d ( p, q ) = X p ∈ A α X q ∈ C d ( p, q ) < X p ∈ A X q ∈ C ′ d ( p, q ) = d ( A, C ′ ) . The first inequalit y comes from ¯ A ⊂ C and the s econd b y definition of min-sum stabilit y . This, in addition to Lemma 6, can b e used to s ho w th eir algorithm can b e employ ed for min-sum stable instances. 16
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment