Data Stability in Clustering: A Closer Look

Data Stabilit y in Clustering: A Closer Lo ok Shalev Ben-Da vid 1 and Lev Reyzin 2 1 Departmen t of Electrical Engineering and Computer Science Massac h us etts Institute o f T ec hnology 77 Massac h use tts Av en ue Cam bridge, MA 0213 9 shalev@mit. edu 2 Departmen t of Mathematics, Statistics, and Computer Science Univ ersit y of Illinois a t Chicago 851 South Morgan Street Chicago, IL 606 07 lreyzin@mat h.uic.edu Abstract W e consider the mo de l introduced b y Bilu and Linial [13], who study pr oblems for which the optimal cluster ing do es not change when distances are p erturb ed. They show that even when a problem is NP-har d, it is sometimes p ossible to obtain eﬃcient algorithms for instances resilient to cer tain multip licative p ertur bations, e.g. on the o rder of O ( √ n ) for max-cut cluster- ing. Aw asthi et al. [6] co nsider center-based ob jectives, and B alcan and Liang [9] analyze the k -median a nd min-sum ob jectives, giving eﬃcient algorithms for instanc e s r esilient to cer tain constant multiplicative p e rturbations. Here, we are mo tiv ated by the q uestion of to what extent these assumptions can b e relaxed while a llowing for eﬃcient a lg orithms. W e show there is little ro o m to improve these results by giv ing NP-hardnes s lower b ounds for b oth the k -median and min-sum ob jectives. On the other hand, we show that co nstant m ultiplicativ e re s ilience parameter s can b e so s trong as to make the clustering pro blem triv ia l, leaving only a nar row range o f resilience parameters for which clustering is interesting. W e als o consider a mo del of additive perturbatio ns and give a corres p o ndence betw een additive and m ultiplicative notions of stability . Our results provide a close e x amination o f the consequences o f assuming stability in data. 1 In tro duction Clustering is one of the m ost widely-used tec hniques in statistical data analysis. The need to partition, or cluster, d ata in to meaningful categories natur ally arises in virtually ev ery domain where data is abundan t. Unfortunately , most of the natural clustering ob jectiv es, includ ing k - median, k -means, and min-su m, are NP-hard to optimize [17, 18]. It is, therefore, unsurpr ising that many of the clustering algorithms used in practice come with few guarantee s. 1 Motiv ated by ov ercoming the hard ness results, Bilu and Linial [13] consider a p erturbation resilience assumption that they argue is often implicitly made when choosing a clustering ob- jectiv e: that the optim um clustering to the d esir ed ob jectiv e Φ is preserved u nder m ultiplicativ e p ertur b ations up to a factor α > 1 to the distances b et w een the p oints. Th ey reason that if th e optim um clustering to an ob jectiv e Φ is not resilien t, as in, if small p erturb ations to the distances can cause the optim um to c hange, then Φ ma y ha ve b een the wrong ob jectiv e to b e optimizing in the ﬁrst p lace. Bilu and Lin ial [13] sho w that for max-cut clustering, instances resilient to p ertur b ations of α = O ( √ n ) hav e eﬃcien t algorithms f or reco v ering the optim um itself. Con tin uing th at line of researc h, Awasthi et al. [6 ] giv e a p olynomial time algorithm that ﬁnds the optim um clus tering for instances resilien t to m ultiplicativ e p erturb ations of α = 3 for cen ter-based 1 clustering ob jectiv es w hen cen ters must come from the data (w e cal l this the prop er setting), and α = 2 + √ 3 when when the cent ers do not need to (w e call this the Steiner setting) . Their metho d relies on a stability pr op ert y implied b y p ertu r bation resilience (see Section 2). F or the Steiner case, they also pro v e an NP-hard ness lo w er b oun d o f α = 3. Subsequ ently , Balcan and Liang [9] consid er the p rop er setting and imp ro v e the constan t past α = 3 by giving a new p olynomial time algorithm for the k -median ob jectiv e for α = 1 + √ 2 ≈ 2 . 4 stable in stances. 1.1 Our results Our work further delv es into the p rop er setting, for whic h no lo w er b ounds ha v e p reviously b een sho wn for the stability prop erty . In Section 3 we sho w that ev en in the pr op er case, where the algorithm is r estricted to choosing its cen ters f rom the d ata, for an y ǫ > 0, it is NP-hard to optimally cluster (2 − ǫ )-sta ble instances, b oth for the k -median and min-sum ob jectiv es (Theorems 5 and 7 ). T o pro v e this for the min-sum ob j ectiv e, we d eﬁne a new notion of stabilit y that is implied b y p ertu rbation resilience, a notion that ma y b e of indep en den t in terest. Then in S ection 4 , we lo ok at the im p lications of assuming r esilience or stabilit y in the data, ev en for a constan t p ertur bation parameter α . W e s ho w th at for ev en fairly sm all constan ts, th e data b egins to h a v e ve ry strong structural pr op erties, as to make the clustering task fairly trivial. When α exceeds 2 + √ 3, the data b egins to sho w wh at is called strict separation , where eac h p oint is closer to p oin ts in its own cluster than to p oints in other clusters (Theorem 8). Finally , in Section 5 , we lo ok at whether th e picture can b e improv ed for clustering d ata that is stable under additiv e, rather than multiplicat iv e, p erturb ations. One hop e w ould b e that additive stabilit y is a more u seful assumption, where a p olynomial time algorithm for ǫ -stable instances migh t b e p ossible. Unfortun ately , this is not the case. W e consider a natur al additiv e mo d el and sho w that sev ere lo we r b oun ds hold for the additiv e notion as we ll (Theorems 13 and 17). On the p ositiv e side, we sho w via reductions that algorithms for m ultiplicativ ely stable data also w ork for additiv ely stable data for a d iﬀerent but related p arameter. Our results demonstrate that on the one hand, it is hard to impro v e the algorithms to w ork for lo w stabilit y constan ts, and that on th e other hand, higher stabilit y constan ts can b e quite strong, to the p oint of trivializing the pr oblem. F urtherm ore, switching fr om a m ultiplicativ e to an ad d itiv e stabilit y assumption do es not help to circumv ent the hardness results, and perh aps makes matt ers w orse. These results, tak en together, narro w the r ange of interesting parameters for theoretical study and highlight the strong r ole that the c hoice of constan t plays in stabilit y assum ptions. 1 F or center-based clustering ob jectives, the clustering is d eﬁned by a choice of centers , and the ob jectiv e is a function of th e d istances of the p oints to their closest center. 2 One thin g to n ote that there is some diﬀerence b et w een the very r elated resilience and stabilit y prop erties (see S ection 2), stabilit y b eing w eak er and more general [6]. Some of our results apply to b oth notions, and s ome only to stabilit y . This still lea ves op en th e p ossibilit y of d evising p olynomial-time algorithms t hat, for a m uc h smaller α , wo rk on all th e α -p erturb ation resilien t instances, but n ot on all α -stable ones. 1.2 Previous w ork W e examine previous work on stabilit y , b oth as a data dep endent assump tion in clustering and in other settings. 1.2.1 Stabilit y as a data assumption in clustering The cla ssical approac h in theoretic al computer science to dealing with the worst-ca se NP-hardness of clus tering has b een to deve lop eﬃcien t app ro ximation algorithms for the v arious clustering ob- jectiv es [2, 3, 10, 14, 19, 15], an d signiﬁcan t eﬀorts hav e b een exerted to improv e app ro ximation ratios and to pro v e lo w er b oun ds. In p articular, for metric k -median, th e b est kno wn guaran tee is a (3 + ǫ )-appro ximation [3], and the b est kn o wn lo w er b ound is (1 + 1 /e )-hardness of appr o xima- tion [17, 18]. F or m etric min-sum, the b est kn o wn resu lt is a O (p olylog ( n ))-approxima tion to the optim um [10]. In con trast, a more recent directio n of researc h has b een to c haracterize und er what conditions w e can ﬁnd a desirable clustering eﬃcien tly . Perturbation resilience/stabilit y are su ch conditions, but they are related to other stabilit y notions in clustering. Os tro vsky et al. [23 ] demonstr ate the eﬀectiv eness of Llo yd-t yp e algorithms [21] on instances with the stabilit y p rop erty that the cost of the optimal k -means solution is small compared to the cost of the optimal ( k − 1)-mea ns solution, and their guarante es ha v e later b een improv ed b y Awast hi et al. [5]. In a d iﬀeren t line of w ork, Balca n et al. [8] consider what stabilit y pr op erties of a s im ilarity function, w ith resp ect to the ground tr uth clustering, are suﬃ cien t to clus ter w ell. I n a related direction, Balcan et al. [7] argue that, for a giv en ob jectiv e Φ, approxi mation algorithms are most useful when the clusterings th ey pro d uce are s tr ucturally close to the optim um originally sough t in c h o osing to optimize Φ in the ﬁrst place. They then sho w that, for many ob jectiv es, if one mak es this assumption explicit – that all c -approxi mations to the ob jectiv e yield a clustering that is ǫ -close to the optim um – th en one c an reco v er an ǫ -close clus terin g in p olynomial time, ev en for v alues of c b elow the h ardness of appro ximation constant. The assumptions and algorithms of Balcan et al. [7 ] ha v e su bsequentl y b een carefully analyzed by Sc halek amp et al. [24]. Ac k erman and Ben-Da vid [1] also study v arious notions of resilience, and among their results, in tro duce a notion o f s tability similar to the on e studied h erein, except only the p ositions of cluster cen ters are p erturb ed. Their notion is strictly w eak er – i.e. an y p erturbation resilien t in stance is also stable in their framew ork. They sho w that Euclidean instances stable to p ertur b ations of cluster centers will ha ve p olynomial algo rithms for ﬁnd ing near-optimal clusterings. Ho w ev er, they require the d esired num b er of clusters to b e small compared to the input size: their algorithms h a v e runn in g times exp onentia l in the num b er of clusters. In this work, we do not mak e that assumption; this means our p ositiv e resu lts are more general, b u t our lo w er b ounds don’t app ly . 3 1.2.2 Stabilit y in other settings Just as the Bilu and Linial [13] n otion of stabilit y giv es conditions under whic h eﬃcien t clustering is p ossible, similar concepts hav e b een s tu died in game theory . Lipton et al. [20] pr op ose a notion of stabilit y for solution concepts of games. Th ey deﬁne a game to b e stable if small p erturb ations to th e pay oﬀ matrix do n ot signiﬁcan tly c hange the v alue of the game, and they sho w games are generally not stable un der this deﬁnition. Th en, in a similar s pirit to the work of Bilu and Lin ial, Aw asthi et al. [4] p rop ose a related s tability condition for a game, which can b e lev eraged in ﬁ n ding its approximate Nash equilibria. The Bilu and Linial [13] notion of sta bilit y h as al so b een stud ied in the con text of the met- ric trav eling salesman problem, for whic h Mihal´ ak et al. [22] giv e eﬃcien t algorithms for 1 . 8- p ertur b ation resilien t instances, illustrating another case where a stabilit y assump tion can circum- v en t NP-hardn ess. F rom a diﬀeren t direction, Ben-Da vid et al. [12] consider the stabilit y of clustering algorithms, as opp osed to instances. They say an algorithm is s table if it pr o duces similar clusterings for diﬀeren t inputs dra wn from the same distribu tion. They argue that stabilit y is n ot as useful a notion as had b een p reviously thought in d etermining v arious parameters, suc h as the optimal n umber of clusters. 2 Notation and preliminarie s In a clusterin g instance, w e are given a set S of n p oint s in a ﬁnite metric space, and we denote d : S × S → R ≥ 0 as the distance fun ction. Φ denotes the ob jectiv e fun ction o v er a partition of S in to k clusters whic h w e wan t to op timize o v er the metric, i.e. Φ assigns a sco re to eve ry clustering. The optimal clustering with resp ect to Φ is denoted as C = { C 1 , C 2 , . . . , C k } . The k -median ob jectiv e requires S to b e partitioned in to k disjoint su b sets { S 1 , . . . , S k } and eac h subset S i to b e assigned a cen ter s i ∈ S . The goal is to minim ize Φ med , measured by Φ med ( S 1 , . . . , S k ) . = k X i =1 X p ∈ S i d ( p, s i ) . The cen ters in the optimal clustering are denoted as c 1 , . . . , c k . In an optimal solution, eac h p oin t is assigned to its nearest cent er. F or the min-sum ob jectiv e , S is partitioned in to k d isjoin t subsets, and the ob jectiv e is to minimize Φ m − s , measured by Φ m − s ( S 1 , . . . , S k ) . = k X i =1 X p,q ∈ S i d ( p, q ) . No w, w e deﬁ ne the p ertu r bation resilience notion introd uced by Bi lu and Linial [13]. Deﬁnition 1. F or α > 1 , a clustering instanc e ( S, d ) i s α -p er turb ation r esilient to a given obje ctive Φ if for any function d ′ : S × S → R ≥ 0 such that ∀ p, q ∈ S , d ( p, q ) ≤ d ′ ( p, q ) ≤ αd ( p, q ) , the optimal clustering C ′ for Φ under d ′ is e qual to the optimal clustering C for Φ under d . 4 In this pap er, w e consider the k -median and m in -sum ob jectiv es, and we thereby in v estigate the follo wing deﬁnitions of stabilit y , whic h are implied b y p ertur bation resilience, as sh own in Sections 3.1 and 3.2. Th e f ollo w in g d eﬁnition is adapted f rom Aw asthi et al. [6]. Deﬁnition 2. A c lustering i nstanc e ( S, d ) is α -c enter stable for the k -me dian obje c tive if for any optimal cluster C i ∈ C with c enter c i , C j ∈ C ( j 6 = i ) with c enter c j , any p oint p ∈ C i satisﬁes αd ( p, c i ) < d ( p, c j ) . Next, we deﬁne a new analogous notion of stabilit y for the min-sum ob jectiv e, and we sho w in Section 3.2 that for the min-sum ob jectiv e, p ertu r bation r esilience imp lies min-sum stabilit y . T o help with exp osition for the min -su m ob jectiv e, we deﬁne the d istance from a p oin t p to a set of p oint s A , d ( p, A ) . = X q ∈ A d ( p, q ) . Deﬁnition 3. A clustering instanc e ( S, d ) is α -min-sum stable for the min-sum obje ctive if for al l optimal clusters C i , C j ∈ C ( j 6 = i ) , any p oint p ∈ C i satisﬁes αd ( p, C i ) < d ( p, C j ) . This is a usefu l generalizatio n b ecause, as w e s h all see, known algorithms w orkin g under the p ertur b ation resilience assumption can also b e made to w ork u nder the weak er notion of m in -sum stabilit y . 3 Lo w er b ounds 3.1 The k -median ob jectiv e Aw asthi et al. [6] pro v e the follo wing connection b et w een p erturbation resilience and stabilit y . Bo th their algorithms and the algorithms of Balcan and Liang [9] crucially u se this sta bilit y assumption. Lemma 4. Any clu stering instanc e that is α -p erturb ation r esilient for the k -me dian obje ctive also satisﬁes the α -c enter stability. Aw asthi et al. [6] p ro v ed th at for α < 3 − ǫ , k -median clustering α -cente r s table instances is NP- hard wh en Steiner p oin ts are allo wed in the data. Afterwards, Balcan and Liang [9] circumv ented this low er boun d and achiev ed a p olynomial time algorithm for α = 1 + √ 2 b y assumin g the algorithm m ust c ho ose cluster cent ers from within the data. In the theorem b elo w, w e pro v e a lo w er b ou n d for the center stable prop ert y in this more restricted s etting, showing there is little hop e of progress eve n for data where eac h p oin t is nearly t wice closer to its o wn cen ter than to any other. Theorem 5. F or any ǫ > 0 , the pr oblem of solving (2 − ǫ ) -c enter stable k -me dian instan c es is NP-har d. 5 Pr o of. W e reduce fr om what w e call the p erfect dominating set promise pr oblem (PDS-PP), whic h w e prov e to b e NP-hard (see App endix), where we are promised th at the inpu t graph G = ( V , E ) is suc h that all of its smallest dominating sets D are p erfect, and we are ask ed to ﬁnd a dominating set of size at most d . T he reduction is simple. W e tak e an ins tance of th e NP-hard problem PDS-PP on G = ( V , E ) on n vertic es and reduce it to an α = 2 − ǫ -cen ter stable instance. Our distance metric as follo ws. Ev ery vertex v ∈ V b ecomes a p oin t in the k -center instance. F or any t w o ve rtices ( u, v ) ∈ E we deﬁne d ( u, v ) = 1 / 2. When ( u, v ) / ∈ E , we set d ( u, v ) = 1. This tr ivially satisﬁes the triangle inequalit y for any graph G , as th e sum of the distances along an y t w o edges is at least 1. W e s et k = d . W e ob s erv e that a k -median solution of cost ( n − k ) / 2 corr esp onds to a domin ating set of size d in the PDS-PP in stance, and is therefore NP-hard to ﬁ nd. W e also observe that b ecause all solutions of size ≤ d in th e PDS-PP instance are p erf ect, eac h (non-cen ter) p oint in the k -median solution has distance 1 / 2 to exactly one (its o w n) ce nt er, and a distance of 1 to every other cen ter. Hence, this in s tance is α = (2 − ǫ )-cen ter stable, completing the pro of. 3.2 The min-sum ob jective Analogously to Lemma 4, we can sho w that α -p erturbation resilience implies our new notion of α -min-sum stabilit y . Lemma 6. If a clustering instanc e is α -p e rturb ation r esilient, then it is also α -min-sum stable. Pr o of. Assum e to the con trary that the in s tance is α -p erturbation resilien t bu t is not α -min-sum stable. Then, there exist clusters C i , C j in the optimal s olution C and a p oint p ∈ C i suc h that αd ( p, C i ) ≥ d ( p, C j ). W e p erturb d as follo ws. W e deﬁn e d ′ suc h that for all p oints q ∈ C i , d ′ ( p, q ) = αd ( p, q ), and for the r emainin g distances, d ′ = d . Clearly d ′ is an α -p erturbation of d . W e no w note that C is not optimal und er d ′ . Namely , we can creat e a c heap er solution C ′ that assigns p oin t p to cluster C j , and lea v es the remaining clusters unchange d, wh ic h con tradicts optimalit y of C . T his shows that C is not the optim um under d ′ whic h con tradicts th e in stance b eing α -p erturb ation resilien t. Th erefore we can conclude that if a clustering instance is α -p ertur bation resilien t, then m ust also b e α -min-sum stable. Moreo ver, we sho w in th e App endix that th e min -sum algorithm of Balcan and Liang [9], wh ich requires α to b e b oun d ed from b elo w by 3  max C ∈C | C | min C ∈C | C |− 1  , works with this more general condition. This fur ther motiv ates follo wing b oun d. Theorem 7. F or any ǫ > 0 , the pr oblem of ﬁnding an optimal min-sum k clustering in (2 − ǫ ) - min-sum stable instanc es is NP-har d. Pr o of. Consid er the triangle partition problem . Let graph G = ( V , E ) and | V | = n = 3 k , and let eac h verte x hav e maxim um degree of d = 4. Th e p roblem of whether the v ertices of G can b e partitioned in to sets V 1 , V 2 , . . . , V k suc h th at eac h V i con tains a triangle in G is NP-complete [16], ev en with the degree restriction [25]. W e reduce the triangle p artition prob lem to an (2 − ǫ )-min-sum stable clustering instance. Th e metric is as f ollo ws. Every vertex v ∈ V b ecomes a p oint in the min-sum instance. F or an y t w o v ertices ( u, v ) ∈ E we deﬁne d ( u, v ) = 1 / 2. When ( u, v ) / ∈ E , we set d ( u, v ) = 1. This satisﬁes the triangle inequalit y f or an y graph, as the sum of the distances along any t w o edges is at least 1. 6 No w w e show that we can cluster this instance into k clusters su c h that the cost of th e min -sum ob jectiv e is exactly n if and only if the original instance is a YES instance of triangle partition. This follo ws fr om t w o facts. 1. A YES instance of triangle partitio n maps to a clusterin g into k = n/ 3 clusters of size 3 with pairwise d istances 1 / 2, for a total cost of n 2. A cost of n is the b est ac hiev able b ecause a balanced clusterin g with all minim um p airwise in tra-cluster distance s is op timal. In p articular, 1 2 P i n i ( n i − 1) s.t. P i n i = n is a lo wer b ound on the cost of an y k -clustering with cluster-sizes n i . By conv exit y this expression is minimized at n . In the clus tering from our reduction, eac h p oin t has a sum -of-distances to its o wn cluster of 1. No w we examine the sum-of-distances of an y p oint to other clusters. A p oin t has tw o distances of 1 / 2 (edges) to its o wn cluster, and b ecause d = 4, it can h av e at most t w o more distances of 1 / 2 (edges) into an y other clus ter, le a ving the third d istance to the other cluster to b e 1, yielding a total cost of ≥ 2 into an y other cluster. Hence, it is (2 − ǫ )-min-sum stable. W e note th at it is tempting to restrict the d egree b ound to 3 in order to fur th er improv e the lo w er b ound. Un fortunately , th e triangle p artition problem on graphs of maxi mum degree 3 is p olynomial-time solv able [25], and w e cann ot impro v e the factor of 2 − ǫ by restricting to graphs of degree 3 in th is redu ction. 4 Strong consequenc es of stabil it y In Section 3, w e sho w ed that k -median clustering ev en (2 − ǫ )-cen ter stable instances is N P - hard. I n this section w e sho w that ev en for resilience to constan t m ultiplicativ e p erturb ations of α > 2 + √ 3 ≈ 3 . 7, the data obtains a prop ert y referr ed to as strict separation , where all p oin ts are closer to all other p oints in their o wn cluster than to p oints in any other cluster; this pr op ert y is kno wn to b e helpful in clustering [8]. 4.1 Strict separation No w we p ro v e the follo win g theorem, whic h sho ws th at ev en for relativ ely small m ultiplicativ e con- stan ts for α , cent er s table, and therefore p erturb ation resilien t, instances exhibit strict s ep aration. Theorem 8. L e t C = { C 1 , . . . , C k } b e th e optimal c lu stering of an α -c enter stable instanc e with α > 2 + √ 3 . L et p, p ′ ∈ C i and q ∈ C j ( i 6 = j ), then d ( p, q ) > d ( p, p ′ ) . Pr o of. Let C b e an α -cen ter stable clustering. Let p, p ′ ha v e ce nt er c 1 in C and let q ha v e cen ter c 2 , with c 1 6 = c 2 . W e ha v e d ( p, p ′ ) ≤ d ( p ′ , c 1 ) + d ( p, c 1 ) ≤ d ( p ′ , c 1 ) + 1 α d ( p, c 2 ) (1) where the second inequalit y follo ws f r om the deﬁ n ition of α -cen ter stabilit y . Note that d ( p ′ , c 1 ) ≤ 1 α d ( p ′ , c 2 ) , 7 and sub tr acting 1 α d ( p ′ , c 1 ) from b oth sides giv es  1 − 1 α  d ( p ′ , c 1 ) ≤ 1 α ( d ( p ′ , c 2 ) − d ( p ′ , c 1 )) ≤ 1 α d ( c 1 , c 2 ) . This then imp lies d ( p ′ , c 1 ) ≤ 1 α − 1 d ( c 1 , c 2 ) ≤ 1 α − 1 d ( p, c 1 ) + 1 α − 1 d ( p, c 2 ) ≤ 1 α ( α − 1) d ( p, c 2 ) + 1 α − 1 d ( p, c 2 ) = α + 1 α ( α − 1) d ( p, c 2 ) . Plugging this into Equation 1 giv es d ( p, p ′ ) ≤ α + 1 α ( α − 1) d ( p, c 2 ) + 1 α d ( p, c 2 ) = 2 α α ( α − 1) d ( p, c 2 ) = 2 α − 1 d ( p, c 2 ) . W e also ha v e d ( p, c 2 ) ≤ d ( p, q ) + d ( q , c 2 ) ≤ d ( p, q ) + 1 α d ( q , c 1 ) ≤ d ( p, q ) + 1 α ( d ( p, q ) + d ( p, c 1 )) = α + 1 α d ( p, q ) + 1 α d ( p, c 1 ) ≤ α + 1 α d ( p, q ) + 1 α 2 d ( p, c 2 ) , and sub tr acting 1 α 2 d ( p, c 2 ) from b oth s ides giv es α 2 − 1 α 2 d ( p, c 2 ) ≤ α + 1 α d ( p, q ) or d ( p, c 2 ) ≤ α α − 1 d ( p, q ) . W e th en conclude d ( p, p ′ ) ≤ 2 α − 1 d ( p, c 2 ) ≤  2 α − 1   α α − 1  d ( p, q ) = 2 α ( α − 1) 2 d ( p, q ) . 8 Th us we get d ( p, p ′ ) < d ( p, q ) if w e set α su ch that 2 α < ( α − 1) 2 . Solving this give s α > 2 + √ 3. The follo wing example sho ws that Theorem 8 is tigh t, as in low er v alues of α ca nnot alone imply strict s ep aration. C onsider the metric space R , th e line. The example is giv en by p ′ = 0, c 1 = √ 3, p = 1 + √ 3, q = 2 + 2 √ 3, and c 2 = 3 + 3 √ 3, with p and p ′ b elonging to the cluster of c 1 and q b elonging to the cluster of c 2 . Th is example is α -cent er stable for α = 2 + √ 3, bu t it do es not satisfy strict separatio n. In particular, this example mak es ev ery inequalit y in the pro of ab o v e tigh t wh en α = 2 + √ 3. The strict separation prop ert y , ho wev er, is quite strong, as can b e seen from the follo wing Corollary . Corollary 9. L et C = { C 1 , . . . , C k } b e the optimal clustering of a 2 + √ 3 -c enter stable instanc e. Any algor ithm that cho oses c enters { c ′ 1 , . . . , c ′ k } such that c ′ i ∈ C i induc es the p artition C when p oints ar e assigne d to their closest c e nters. In fact, Balcan et al. [8] s ho w that in ins tances satisfying the strict separation prop ert y , a s imple “Single Link age” algorithm will pro duce a tree, a pruning of which giv es the optimal clustering. Such a p runing can b e f ound us in g d ynamic pr ogramming to pro duce the optimal clustering. Reco vering the optimum for 2 + √ 3-resilien t instances is not a new r esult, but is rather mean t to illus tr ate the p o we r of stabilit y assumptions. Balcan and Liang [9 ] giv e a more inv olv ed p olynomial algorithm for ﬁnd in g optima in 1 + √ 2-resilien t instances. 5 Additiv e stabilit y So far, in this pap er our notions of stability w ere d eﬁned with resp ect to multiplica tiv e p erturb a- tions. Similarly , w e can imagine an instance b eing resilien t with resp ect to additiv e p erturbations. Consider the follo wing deﬁnition. Deﬁnition 10. L e t d : S × S → [0 , 1] , and let 0 < β ≤ 1 . A clustering instanc e (S, d) is additive β -p erturb ation r e si lie nt to a given obje ctive Φ i f for any function d ′ : S × S → R ≥ 0 such that ∀ p, q ∈ S , d ( p, q ) ≤ d ′ ( p, q ) ≤ d ( p, q ) + β , the optimal clustering C ′ for Φ under d ′ is e qual to the optimal clustering C for Φ under d . W e note that in th e deﬁnition ab o v e, we require all p airw ise distances b et w een p oints to b e at most 1 . Oth erwise, resilience to a dditive p erturbations would b e a v ery we ak notio n, as th e distances in most instances could b e scaled as to b e resilien t to arb itrary additiv e p erturb ations. Esp ecially in lig ht of positive resu lts for other add itiv e stabilit y notio ns [1, 11], one p ossible hop e is th at our hardn ess results might only app ly to the multiplicat iv e case, and that we migh t b e able to get p olynomial time clusterin g algorithms for instances resilien t to arbitrarily small additive p ertur b ations. W e sho w that this is unfortu nately not the case – we in trod uce notions of add itiv e stabilit y , similar to Deﬁn itions 2 and 3, and for the k -median and min-su m ob jectiv es, w e sho w corresp ondences b et w een multiplic ativ e and additiv e stabilit y . 9 5.1 The k -median ob jectiv e Analogously to Deﬁnition 2, we can deﬁne a notion of add itiv e β -cen ter stabilit y . Deﬁnition 11. L et d : S × S → [0 , 1] , and let 0 ≤ β ≤ 1 . A clustering instanc e ( S, d ) is additive β -c enter stable to the k -me dian obje c tiv e if for any optimal cluster C i ∈ C with c enter c i , C j ∈ C ( j 6 = i ) with c enter c j , any p oint p ∈ C i satisﬁes d ( p, c i ) + β < d ( p, c j ) . W e can n o w pr o v e that p er tu rbation resilience implies cen ter stabilit y . Lemma 12. Any clustering i nstanc e satisfying additive β - p erturb ation r esilienc e for the k -me dian obje ctive also satisﬁes additive β -c enter stability. Pr o of. Th e pr o of is similar to that of Lemm as 4. W e pro v e that for ev ery p oint p and its cen ter c i in the optimal clustering of an additiv e β -p erturbation resilien t instance, it h olds that d ( p, c j ) > d ( p, c i ) + β for any j 6 = i . Consider an additive β -p erturbation resilien t clusterin g instance. Assume we b lo w up all the pairwise distances within cluster C i b y an additiv e factor of β . As this is a legitimate p erturbation of the distance function, the optimal clustering und er this p er tu rbation is the same as the original one. Hence, p is still assig ned to the same cluster. F ur thermore, since t he distances w ithin C i w ere all changed by the same constan t f actor, c i will remain the cen ter of the cluster. The same holds for a ny ot her optimal clusters. Since the optimal clustering un der th e p erturb ed distances is unique it follo ws th at ev en in the p erturb ed distance fu nction, p p refers c i to c j , whic h implies the lemma. No w w e pr ov e a lo we r b oun d that sh o ws that the task of clustering additiv e (1 / 2 − ǫ )-cen ter stable instances with resp ect to th e k -median ob jectiv e remains NP-h ard . Theorem 13. F or any ǫ > 0 , the p r oblem of ﬁnding an optimal k -me dian clustering in additive (1 / 2 − ǫ ) - c enter stable instanc es is NP-har d. Pr o of. W e use the redu ction in Th eorem 5, in which the m etric satisﬁes th e needed prop erty that d : S × S → [0 , 1]. W e observe that the instances from the redu ction are additive (1 / 2 − ǫ )-cen ter stable. Hence, an algorithm for solving k -median on a (1 / 2 − ǫ )-cente r stable instance can decide whether a PDS-PP instance con tains a dominating set of a giv en size, completing the pro of. W e no w consider cen ter s tabilit y , as in the m ultiplicativ e case. W e prov e that additiv e center stabilit y implies multiplicati v e center stabilit y , and this giv es us the pr op ert y that an y algorithm for  1 1 − β  -cen ter stable in stances will wo rk for additiv e β -cen ter stable instances. Lemma 14. A ny additive β -c enter stable clustering instanc e for the k -me dian obje ctive is also (multiplic ative)  1 1 − β  -c enter stable. Pr o of. Let the op timal clustering b e C 1 , . . . , C k , with cen ters c 1 , . . . , c k , of an additiv e β -cen ter stabile clustering ins tance. Let p ∈ C i and let i 6 = j . F rom the stabilit y prop erty , d ( p, c j ) > d ( p, c i ) + β ≥ β . (2) 10 W e also ha v e d ( p, c i ) < d ( p, c j ) − β , from w hic h we can see 1 d ( p, c j ) − β < 1 d ( p, c i ) . This giv es us d ( p, c j ) d ( p, c i ) > d ( p, c j ) d ( p, c j ) − β ≥ 1 1 − β . (3) Equation 3 is deriv ed as follo ws. The middle term, for d ( p, c j ) ≥ β (whic h we hav e fr om Equation 2), is monotonically decreasing in d ( p, c j ). Using d ( p, c j ) ≤ 1 b oun ds it from b elo w. Relating the LHS to the RHS of Equ ation 3 giv es us the needed stability prop ert y . 5.2 The min-sum ob jective Here we d eﬁne additiv e min-sum s tability and p ro v e the a nalogous theo rems for the min-sum ob jectiv e. Deﬁnition 15. L et d : S × S → [0 , 1] , and let 0 ≤ β ≤ 1 . A clustering instanc e i s ad ditive β -min-sum stable for the min-sum obje ctive if for every p oint p in any optimal cluster C i , it holds that d ( p, C i ) + β ( | C i | − 1) < d ( p, C j ) . Lemma 16. If a clustering instanc e is ad ditive β -p erturb ation r esilient, then it i s also additive β -min-sum stable. Pr o of. Assum e to the contrary that the instance is β -p erturbation resilien t but is not β -min -sum stable. Then, there exist clusters C i , C j in the optimal s olution C and a p oint p ∈ C i suc h that d ( p, C i ) + β ( | C i | − 1) ≥ d ( p, C j ). Then, we p erturb d as follo ws. W e deﬁn e d ′ suc h that for all p oint s q ∈ C i , d ′ ( p, q ) = d ( p, q ) + β , and for th e remaining distances d ′ = d . Clearly d ′ is a v alid additiv e β -p ertur bation of d . W e no w note that C is not optimal under d ′ . Namely , w e can create a c heap er solution C ′ that assigns p oin t p to cluster C j , and lea v es the remaining clusters unchange d, wh ic h con tradicts optimalit y of C . This sho ws that C is not the optim um un der d ′ whic h is con tradictory to the fact that the instance is additiv e β -p erturb ation resilien t. Therefore we conclude that if a clusterin g instance is additive β -p erturb ation resilien t, then it is also additiv e β -min -sum stable. As w ith the k -median ob jectiv e, w e sh o w th at additiv e min-sum stabilit y exhibits simila r lo w er b ound s as in the m ultiplicativ e case. Theorem 17. F or any ǫ > 0 , the pr oblem of ﬁnd ing an optimal min-sum c lu stering in additive (1 / 2 − ǫ ) - min-sum stable instanc es is NP-har d. Pr o of. W e u se the reduction in Theorem 7, in wh ic h the metric satisﬁes the prop ert y that d : S × S → [0 , 1]. T he instances from the reduction are additiv e (1 / 2 − ǫ )-min-su m stable. Hence, an alg orithm for cl ustering a (1 / 2 − ǫ )-min-sum stable instance can solv e t he triangle partitio n problem. Finally , as w e d id for the k -median ob jectiv e, w e can also reduce add itive stabilit y to m ulti- plicativ e stabilit y for the min-sum ob jectiv e. 11 Lemma 18. L et t = max C ∈C | C | min C ∈C | C |− 1 . Any additive β -min- sum stabile clustering instanc e for the min- sum obje ctiv e is also (multiplic ative)  1 1 − β /t  -min-sum stable. Pr o of. Let the optimal clustering b e C 1 , . . . , C k and let p ∈ C i . L et i 6 = j . F rom the stabilit y prop erty , w e ha v e d ( p, C j ) > d ( p, C i ) + β ( | C i | − 1) ≥ β ( | C i | − 1) . (4) W e also ha v e d ( p, C i ) < d ( p, C j ) − β ( | C i | − 1) . T aking r ecipr o cals and multiplying b y d ( p, C j ), we get d ( p, C j ) d ( p, C i ) > d ( p, C j ) d ( p, C j ) − β ( | C i | − 1) ≥ | C j | | C j | − β ( | C i | − 1) (5) ≥ max C ∈C | C | max C ∈C | C j | − β (min C ∈C | C | − 1) ≥ 1 1 − β /t . (6) Equation 5 is derived as follo ws: d ( p, C j ) ≥ β ( | C i | − 1) (wh ic h we hav e f rom Equation 4), is monotonically decreasing in d ( p, C j ). Obs er v in g d ( p, c j ) ≤ | C j | b ounds it from b elo w. Equation 6 giv es us the needed prop ert y . 6 Discussion Our lo wer b ou n ds, toget her w ith the stru ctur al p rop erties implied by f airly small constan ts, il- lustrate the imp ortance parameter settings play in stabilit y assumptions. T hese results mak e us w onder the degree to whic h the assump tions stu died h erein hold in p ractice; empir ical s tu dy of r eal datasets is warran ted. Another interesti ng direction is to relax the assu mptions. Awasthi et al. [6 ] su ggest considering stabilit y u nder random, and not wo rst-case, p ertur bations. Balcan and Liang [9 ] also study a relaxed v ersion of th e assumption, where p erturbations can change the optimal clustering, but not b y m uc h. It is op en to what exten t, and on what data, an y of these appr oac h es will yiel d practical impro v emen ts. Ac kno wledgemen ts W e th ank Maria-Florina Balcan and Yingyu Liang for helpful d iscussions, and we appreciate Shai Ben-Da vid ’s, Avrim Blum ’s and S an tosh V empala’s feedbac k on the writing. W e also esp ecially thank Sh ai Ben-Da vid for helping us ﬁ nd a b ug in the AL T 2012 v ersion of this pap er – its clai med one pass streaming result wa s incorr ect. This wo rk was sup p orted in p art b y a Simons P ostdo ctoral F ello wship in Theoretical Compu ter Science and by AR C while Lev Reyzin w as at the Georgia Institute of T ec hnology . 12 References [1] A c kerman, M., a nd Ben -Da vid, S. C lusterabilit y: A theoretical study . J ournal of Machine L e arning R ese ar ch - Pr o c e e dings T r ack 5 (2009), 1–8. [2] Ar ora, S., Ragha v an, P., a nd Rao, S. Appro ximation sc hemes for euclidean -medians and related pr oblems. In STOC (1998), pp. 106–113 . [3] Ar y a , V., Garg, N., Kh a ndekar, R., Meyerso n , A., Mun agala, K., an d P andit, V. Lo cal searc h heuristics for k-median and facilit y lo cation problems. SIAM J. Comput. 33 , 3 (2004 ), 544–562 . [4] A w asthi, P., Balca n, M. F., Blum, A., She ffet, O., a nd Ve mp ala, S. On nash- equilibria of appr o ximation-stable games. In SAGT (2010 ), pp. 78–89. [5] A w asthi, P., Blum, A ., and Shef fet, O. Stabilit y yields a ptas for k-median and k-means clustering. In FOCS (2010), pp. 309–3 18. [6] A w asthi, P., Bl um, A., an d Sheffet, O. Center-based clustering under p er tu rbation stabilit y . Inf. Pr o c ess. L ett. 112 , 1-2 (2012), 49–54. [7] Balcan, M. F., Blum , A., and Gupt a, A. App ro ximate clustering without the ap p ro xi- mation. In SODA (2009). [8] Balcan, M. F., Blum, A., and Vemp ala, S. A discriminativ e fr amew ork for clusterin g via similarit y functions. In STOC (2008), pp. 671–680 . [9] Balcan, M.-F., and Liang, Y. Clustering und er p erturb ation resilience. In ICALP (1) (2012 ), pp. 63–74. [10] Bar t al , Y., Charikar, M ., and Raz, D. Appr o ximating min-sum -clustering in metric spaces. In STOC (2001 ), pp. 11–20. [11] Ben-D a vid, S. Alternativ e measures of compu tational complexit y with app lications to ag- nostic learning. In T AMC (2006), pp. 231–235 . [12] Ben-D a vid, S., von Luxburg, U., and P ´ al, D. A sob er lo ok at clustering stabilit y . In COL T (2006), pp . 5–19. [13] Bi lu, Y., and Linial, N. Are stable instances easy? In Innovations in Computer Scienc e (2010 ), pp. 332 – 341. [14] C harikar, M., Guha, S., T ar dos, ´ E., and Shmoys, D. B. A constan t-facto r appr o xima- tion algorithm for th e k-median problem. J. Comput. Syst. Sci. 65 , 1 (2002), 129–149. [15] de la Vega, W. F., Karp inski, M., K enyon, C., and Raba n i, Y. Approxima tion sc hemes for clustering problems. In STOC (2003), pp . 50–58. [16] Garey, M. R., and John son, D. S. Computers and Intr actability: A Guide to the The ory of NP-Completeness . W. H. F reeman & Co., New Y ork, NY, 1979. 13 [17] Guha, S., and Khulle r, S . Greedy strikes bac k: Impr ov ed facilit y lo cation algorithms. J. Algor ithms 31 , 1 (1999 ), 228–24 8. [18] Jain, K., M ahdian, M., and Saber i, A. A new greedy app roac h for fac ilit y lo cation problems. In STOC (2002) , A CM, p p. 731–740 . [19] Kumar, A., Sabhar w al, Y., and Sen, S. Linear-time appro ximation sc hemes for clustering problems in any dimensions. J. ACM 57 , 2 (2010). [20] Li pton, R . J., Marka kis, E., and Meh t a, A. On stabilit y pr op erties of ec onomic solution concepts. Man u script, 2006. [21] Llo yd, S. Least squares quantiza tion in p cm. Information The ory, IEEE T r ansactions on 28 , 2 (Mar. 1982), 129–1 37. [22] Mihal ´ ak, M., Sch ¨ ongens, M., Sr ´ amek, R., and W idma yer, P . O n the complexit y of the metric tsp under stabilit y considerations. In SOFSEM (2011 ), pp. 382–393. [23] Ostr ovsky, R., Rabani, Y., Sch ulman, L. J., and Sw amy, C. Th e eﬀecti v eness of llo yd-t yp e m etho ds for the k-means problem. In F OCS (2006) , pp . 165–176. [24] Schalekamp, F., Yu , M., and v a n Zuylen, A. Clustering with or without the approxi - mation. In COCOON (2010), pp. 70–79 . [25] v an Rooij, J. M. M., v an Kooten Niekerk, M. E., and Bodl aender, H. L. Pa rtition in to triangles on b ound ed degree graphs. In SOFSEM (2011). A Dominating set promise problem A dominating set in a unw eigh ted graph G = ( V , E ) is a subset D ⊆ V of vertic es s u c h that eac h vertex in V \ D h as a neigh b or in D . A d ominating set is p erfect if eac h vertex in D \ V has exactly one n eigh b or in D . Th e p roblems of ﬁn ding th e smallest dominating set and smallest p erfect dominating s et are NP-hard. W e introd u ce a related problem, called the p erfect dominating set promise problem . In this problem we are promised that th e in p ut graph is suc h that all its dominating sets of size less at most d are p erfect, and w e are ask ed to ﬁ nd a set of cardinalit y at most d . First, we pro v e the follo wing. Theorem 19. The p erfe ct do minating set pr om ise pr oblem (PD S-PP) is NP-har d. Pr o of. Th e 3 d matc hing problem (3DM) is as follo ws: let X , Y , Z b e ﬁnite disjoint sets with m = | X | = | Y | = | Z | . Let T co nta in triples ( x, y , z ) with x ∈ X, y ∈ Y , z ∈ Z with L = | T | . M ⊆ T is a p erfect 3 d -matc hing if for any t w o triples ( x 1 , y 1 , z 1 ) , ( x 2 , y 2 , z 2 ) ∈ M , w e hav e x 1 6 = x 2 , y 1 6 = y 2 , z 1 6 = z 2 . W e n otice that M is a disjoint partition. Determining whether a p erfect 3 d -matc hing exists (YES vs. NO instance) in a 3 d -matc hing instance is kno wn to b e NP-complete. No w w e red u ce an instance of the 3DM problem to PDS-PP on G = ( V , E ). F or 3DM elemen ts X , Y , and Z we construct v ertices V X , V Y , and V Z , resp ectiv ely . F or eac h triple in T we construct a v ertex in set V T . Add itionally , w e mak e an extra v ertex v . This giv es V = V X ∪ V Y ∪ V Z ∪ V T ∪ { v } . W e mak e the edge set E as foll o ws. Ev ery v ertex in V T (whic h co rresp onds to a triple) h as an edge 14 to the v ertices that it con tains in the corresp ond ing 3DM instance (one in eac h of V X , V Y , and V Z ). Ev ery vertex in V T also has an edge to v . No w we will examine the structure of the smallest dominating set D in the constructed PDS-PP instance. T he v ertex v m ust b elong to D so th at all v ertices in V T are cov ered. Then, what remains is to optimally co v er the v ertices in V X ∪ V Y ∪ V Z – the cheapest solution is to u se m ve rtices from V T , and th is is precisely the 3DM problem, whic h is NP-hard. He nce, an y solution of size d = m + 1 for the PDS-PP instance giv es a solution to the 3 D M instance. W e also obs er ve th at suc h a solution m ak es a p erf ect dominating set. Eac h ve rtex in V T \ D has one neighbor in D , namely v . E ach vertex in V X ∪ V Y ∪ V Z has a unique neigh b or in D , namely the ve rtex in V T corresp ondin g to its resp ectiv e set in the 3DM instance. B Av erage link age for min-sum stabilit y Here, we further supp ort the claim that algorithms designed for α -p erturb ation resilien t instances with resp ect to the min-sum ob jectiv e can ofte n b e made to w ork for data satisfying the more general α -min-sum stabilit y prop ert y . Algorithm 1 min-sum, α p erturb ation resilience Input: Data set S , distance function d ( · , · ) on S , m in i | C i | . Phase 1: Connect eac h p oint with its 1 2 min i | C i | nearest neighbors. • Initialize the clustering C ′ with eac h conn ected comp onent b eing a cluster. • Rep eat till only one cluster remains in C ′ : mer ge clusters C, C ′ in C ′ whic h min im ize d avg ( C, C ′ ). • Let T b e th e tree with comp onen ts as lea v es and in ternal no des corresp onding to the merges p erformed . Phase 2: Apply dynamic programming on T to get the minimum min-sum cost pru ning ˜ C . Output: Output ˜ C . One su c h algorithm is Algorithm 1. Balcan and Liang [9] prov ed that Algorithm 1 correctly clusters instances f or which the cond ition in Lemma 20 h olds . W e can pro v e the condition indeed holds f or α -min-sum stable in stances (their pr o of of the lemma holds f or the more r estricted class of p ertu r bation-resilien t instances). T o state the lemma, w e ﬁ rst deﬁne the d istance b et w een t w o p oint sets, A and B : d ( A, B ) . = X p ∈ A X q ∈ B d ( p, q ) . Lemma 20. Assume the optimal clustering is α -min-sum stable. F or any two diﬀer ent clusters C and C ′ in C and every A ⊂ C , αd ( A, ¯ A ) < d ( A, C ′ ) . 15 Pr o of. F rom the deﬁnition of αd ( A, ¯ A ), w e hav e αd ( A, ¯ A ) = α X p ∈ A X q ∈ ¯ A d ( p, q ) ≤ α X p ∈ A X q ∈ C d ( p, q ) = X p ∈ A α X q ∈ C d ( p, q ) < X p ∈ A X q ∈ C ′ d ( p, q ) = d ( A, C ′ ) . The ﬁrst inequalit y comes from ¯ A ⊂ C and the s econd b y deﬁnition of min-sum stabilit y . This, in addition to Lemma 6, can b e used to s ho w th eir algorithm can b e employ ed for min-sum stable instances. 16

Data Stability in Clustering: A Closer Look

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment