Performance Analysis of AIM-K-means & K-means in Quality Cluster Generation
Among all the partition based clustering algorithms K-means is the most popular and well known method. It generally shows impressive results even in considerably large data sets. The computational complexity of K-means does not suffer from the size o…
Authors: Samarjeet Borah, Mrinal Kanti Ghose
JOURNAL OF COMPUTING, VOLUME 1, ISSUE 1, DECEMBER 2009, ISSN : 2151-9617 HTTPS://SITES.GOOGLE.COM/S ITE/JOURNALOFCOMPUTING/ 175 Performance Analysis of AIM-K-means & K-means in Quality Cluster Generation Samarjeet Borah, Mrinal Kanti Ghose Abst ract : Among all the partition based clustering algorithms K-means is the most popular and w ell known method. It generally shows impressive results even in considerably large data set s. Th e comput ational complexity of K-means does not suffer from the size of the data set. The main disadvantage faced in perform ing this clustering is that the selection of initial means. If the user does not have adequate knowledge about the data set, it may lead to erroneous results. The algorithm Automatic Initialization of Means (AIM), which is an extension to K-m eans, h as been proposed to overcome the problem of initial mean generation. In this paper an attempt has been made to compar e the performance of the algorithms through implementation Index T erms— Cluster , Distance Measure, K-means, Centroid, Average Distance, Mean —————————— —————————— 1 I NTRODUCTION Clustering [2][3][4] is a type of unsupervised learning method in which a set of elements is separated into ho ‐ mogeneous groups. It seeks to discover groups, or clus ‐ ters, of simil ar objects. Generally , patterns within a va l i d cluster are more similar to each other than they are to a pattern belon ging to a different cluster . The similarity between objects is often determined using distance meas ‐ ures over the var i o u s dimensions in the dataset. The var i ‐ ety of techniques for representing data, measuring simi ‐ larity between data elements, and grouping data elements has produced a rich and often confusing assort ment of clustering methods. Clustering is useful in several ex ‐ ploratory pattern ‐ analysis, grouping, dec isi on ‐ making, and machine ‐ learning situations, including data mining, document retriev al, image segmentation, and pattern classification [5][3]. 2 P ARTITION B ASED C LUSTERING M ETHODS P artition based clustering methods create the clusters in one step. Only one set of clusters is created, although sev ‐ eral different sets of cl usters ma y be created internally within the va r i o u s algorithms. Since only one set of clus ‐ ters is output, the users mus t input the desired number of clusters. Given a database of n objects, a partition based [5] clustering algorithm constructs k partitions of the da ‐ ta, so that an objective function is optimized. In these clustering met hods some metric or criterion function is used to determine the goodness of any proposed solution. This meas ure of quality could be average distance be ‐ tween clusters or some other metric. One common mea s ‐ ure of such kind is the squired error metr ic, which meas ‐ ures the squired distance from each point to the centro id for the associated cluster . Pa rtition based clustering algo ‐ rithms try to locally improv e a certain criterion. The ma ‐ jority of them could be considered as greedy algorithms, i.e., algorithms that at each step choose the best solution and may not lead to optimal results in the end. The best solution at each step is the placement of a certain object in the cluster for which the representative point is nearest to the object. This family of clustering algorithms incl udes the first ones that appeared in the Data Mining Commu ‐ nity . The most commonly used are K ‐ means [J D 8 8 , KR90][6], PA M (Partitioning Around Medoids) [KR90], CLARA (Clustering LARge Applications) [KR90] and CLARANS (Clustering LARge ApplicatioNS ) [NH94]. All ofthem are ap plicable to data sets with numerica l attrib ‐ utes. 2.1 K-means Algorithm K ‐ means [7] is a prototype ‐ based, simple p artitional clus ‐ tering technique which attempts to find a user ‐ specified K number of clusters. These clusters are represented by their centroids. A cluster centroid is typically the mean of the points in the cluster . This algorithm is a simple itera ‐ tive clustering algorithm. The algorithm is simple to im ‐ plement and run, relatively fast, easy to adapt, and com ‐ mon in practice. It is historically one of the most impor ‐ tant algorithms in data mining. The general algorithm was introduced by Cox (1957), and (Ball and Hall, 1967; MacQueen, 1967) [6] first named it K ‐ means. Since then it has become widely popular and is classified as a parti ‐ tional or non ‐ hierarchical clustering method ( Jain and Dubes, 1988). It has a number of va r i a t i o n s [8][11]. The K ‐ means algorithm wo r ks as follows: a. Select initial centres of the K clusters. Repeat steps b through c until the cluster me mbership stabilizes. b. Generate a new partition by assigning each data to its closest cluster centres. c. Compute new cluster centres as the centroids of the clusters. The algorithm can be briefly described as follows: Let us conside r a dataset D having n data points x 1 , x 2 … • Samar j eet Borah is with the De p artment o f Com p uter Science & En g ineer- ing, Sikkim Manipal Institute of Technology, Majitar, Rangpo, East Sik - kim-737132. • Mrinal Kanti Ghose is with the Depa rtment of Compute r Science & Engi- neering as Professor & HOD, Sikkim Manipal Insti tute of Technology, Majitar, Rangpo, East Sikkim-737132. JOURNAL OF COMPUTING, VOLUME 1, ISSUE 1, DECEMBER 2009, ISSN: 2151-9617 HTTPS://SITES.GOOGLE.COM/S ITE/JOURNALOFCOMPUTING/ 176 x n . The proble m is to find mini mum va r i an c e clusters from the dataset. The objects have to be grouped into k clusters finding k points { m j }( j =1, 2, …, k ) in D such that (1) is minimized, where d ( x i , m j ) denotes the Euclidean dis ‐ tance between x i and m j . The points { m j } ( j =1, 2, …, k ) are known as cluster centroids. The problem in Eq.(1) is to find k cluster centroids, such that the average squared Euclidean distance (mean squared error) between a data point and its nearest cluster centroid is minimized. The K ‐ means algorithm provides an easy method to im ‐ plement approximate sol ution to Eq.(1). The reasons for the popularity of K ‐ means are ease and simplicity of im ‐ plementation, scalability , speed of convergence and adap ‐ tability to sparse data. The K ‐ means algorith m can be thought of as a gradient descent procedure, which begins at starting cluster centroids, and iteratively updates these centroids to decrease the objective function in Eq.(1). The K ‐ means always conv erge to a local mini mum. The par ‐ ticular local minimum found depends on the starting cluster centroids. The K ‐ means algorithm updates cluster centroids till local mi nimum is found. Before the K ‐ means algorithm con verges, distance and cent roid calculations are done while loops are executed a number of times, say l , where the positive integer l is known as the number of K ‐ means iterat ions. The prec ise va l u e of l var i e s depend ‐ ing on the initial starting cluster centroids even on the same dataset. So the c omputational time complexity of the algorithm is O ( nkl ), where n is the total number of objects in the dataset, k is the required number of clusters we identified and l is the number of iterations, k ≤ n , l ≤ n . K ‐ mean cl ustering algorith m is also facing a numb er of drawbacks. When the numbers of data are not so many , initial grouping will determine the cluster significantly . Again the numbe r of cluster, K, must be determined be ‐ fore hand. It is sensitive to the initial condition. Different initial conditions may produce different results of cluster . The algorithm may be trapped in the local optimum. W eakness of arithmetic mean is not robust to outliers. Ve r y far data from the centroid may pull the centroid away from the real one. Here the result is of circular clus ‐ ter shaped because based on distance. The major problem faced during K ‐ means clustering is the efficient selection of means. It is quite difficult to pre ‐ dict the number of clusters k in pri or . The k va r i e s from user to user . As a result, the clusters formed may not be up to mark. The finding out of exactly how many cl usters will have to be formed is a quite difficult task. To perform it efficiently the user must have detailed knowledge of the domain. Agai n the de tail knowle dge of the source data is also required. 2.2 Automatic Initialization of Means The Automat ic Initialization of Means (AIM) [12] has been proposed to make the K ‐ means algorithm a bit more efficient. The algorithm is able to detect the number of total number of clusters automatically . This alg orithm also has made the selection process of the initial set of means automatic. AIM applies a simple statistical process which selects the set of initial means automatically based on the dataset. The output of this alg orithm can be ap ‐ plied to the K ‐ means algorithm as one of the inputs. 2.2.1 Background In probability theory and statistics, the Gaussian distribu ‐ tion is a continuo us probabi lity distribut ion that describes data that clusters around a mean or average. Assuming Gaussian distribution it is known that μ ±1 σ contain 67.5% of the population and thus significant val ue s concentra te around the cluster mean μ . Po i n t s beyond this may have tendency of belonging to other clusters. We could have taken μ ±2 σ instead of μ ±1 σ , but problem with μ ±2 σ is that it will cover about 95% of the population and as a result it may lead to improper clustering. Some points that are not so relevant to the cluster may also be in cluded in the clu s ‐ ter . 2.2.2 Description Let us assume that data set D as {x i, i=1, 2… N} which con ‐ sists of N data obje cts x 1 , x 2 , …, x N . , where each object has M different attribute val ue s corresponding to the M dif ‐ ferent attributes. The va l u e of i ‐ th object ca n be given by: D i ={x i1 , x i2, …,x iM } Again let us assume that the relation x i =x k does not mean that x i and x k are the same objects in the real wor l d data ‐ base. It mean s that the two objects has equal va l u e s for the attribute set A={a l , a 2 , a 3 , …, a m }. The main objective of the algorithm is to find out the va l u e k automatically in prior to partit ion the dataset into k disjoint subsets. For distance calculation the distance measure sum of square Euclidian distance is used in this algorithm. It aims at minimizing the average square error criterion which is a good measure of the within cluster v ariation ac ross all the partitions. Thus the average square error criterion tries to make the k ‐ clusters as compac t and separated as possible. Let us assume a set of me ans M={m j , j=1, 2, …, K} which consists of initial set of means that has been generated by the algorithm based on the dataset. Based on these initial means the dataset will be grouped into K clusters. Let us assume the set of clusters as C=(c j , j=1,2,…,M} . In the next phase the means has to be updated. In the algorithm the distance threshold has been taken as: dx = μ ±1 σ (2) where μ = and σ = Before the searching of initial means the origina l dataset D will be copied to a temporary dataset T . This dataset will be used only in initial set of means gener ation proc ‐ ess. The algori thm will be repeated for n times (where n is the number of objects in the dataset). The algorithm will JOURNAL OF COMPUTING, VOLUME 1, ISSUE 1, DECEMBER 2009, ISSN : 2151-9617 HTTPS://SITES.GOOGLE.COM/S ITE/JOURNALOFCOMPUTING/ 177 select the first mean of the in itial mean set randomly fro m the dataset. The n the object selected as me an will be re ‐ moved from the temporary dataset. The procedure Dis ‐ tance_Threshold will compute the distance threshold as given in eq. 2. Whenever a new object is considered as the candidate for a cluster mea n, its average distance with existing me ans will be calculated as given in the equation below . A verage_Distance (3) where M is the set of initial means, i=1, 2, …, m and m ≤ n , m c is the candidate for new clu ster mean. If it satisfies the distance threshold then it will considered as new mean and will be removed from the temporary dataset. The algorithm is as follows: Input: D = {x 1 , x 2 , …, x n } //set of objects Output: K // To t a l number of clust ers to be generated M = {m 1 , m 2 , …, m k } // The set of initial means Algorithm: Copy D to a temporar y dataset T Calculate Distance_Threshold on T Arbitrarily select x i as m 1 Insert m 1 to M Remov e x i from T For i= 1 to n ‐ k do // Check for the next mean Begin: Arbitrarily select x i as m c Set L=0 For j= 1 to k do //Calculate the avg. dist Begin: L= L+ Distance(m c ,M[j]) End A verage_Distance=L/k If A verage_Distance ≥ Distance_Threshold then: Remov e x i from T Insert m c to M k=k+1 End 3 P ERFORMANCE A NAL YSIS The AIM is just the extension of K ‐ means to provide the number of clusters to be generated by the K ‐ means algo ‐ rithm. It also provides the initial set of means to K ‐ means. Therefore it has been decided to make a comperative analysis of the clustering quality of AIM ‐ K ‐ means with convensional K ‐ means. The main difference between the two algorithms is that in case of AIM ‐ K ‐ means it is not necessary to provide the number of clus ters to be gener ‐ ated in prior and for K ‐ means, users have to provide the number of clusters to be generated. In this evaluation process three datasets have been used. They have been fed to the algorith ms according to the increasing order of their size. The programs we re devel ‐ oped in C. To test the algorithms thoroughly , separate programs we r e developed for AIM, AIM ‐ K ‐ means and conventional K ‐ means. In the first phase the datasets have been fed to the K ‐ means with the user fed k ‐ val u e . Then the AIM ‐ K ‐ means was applied to the same data se ts where the val u e of k means is provided internally by AIM. Lastly the K ‐ means algorithm is again applied to the same datasets with the va l u e of k as given by the AIM ‐ K ‐ means method. The results reavels that: 1. There is a difference in performance for K ‐ means and AIM ‐ K ‐ means algorithm. 2. But the difference reduces when we use K-means algorithm with the value of k as given by the AIM-K-means. Figure 1: Compar ision Based on Aver age SSE The above comparison wa s made on the basis of average sum of squa re error . From the study it has been found that AIM ‐ K ‐ means is showing improv ements in average sum of squa re. This is basically because of the initial set of cluster means provided to the algorithm. In case of K ‐ means the va l u e of k has been provided based on the out ‐ put provided by AIM. But it is not possible to provide initial set of clusters in K ‐ means. 4 C ONCLUSION The most attractive propert y of the K ‐ means algorithm in data mining is its efficiency in clustering large data sets. But the main disadvantage it is facing is the number of clusters that is to be provided from the user . The algo ‐ rithm AIM, which is an extension of K ‐ means, can be used to enhan ce the efficiency automating the selection of the initial means. From the experiments it has been found that it can improve the clu ster generation process of the K ‐ means alg o rithm, without diminishing the clustering quality in most of the cases . The basic idea of AIM is to keep the simplicity and scalability of K ‐ means, wh ile achieving automaticity . A CKNOWLEDGMENT This wo rk has been carried out as part of Research Pro ‐ motion Scheme (RPS) Projec t funded by All India Council for T echnical Education, Government of India; vide san c ‐ tion order 8023/BOR/RID/RPS ‐ 217/2007 ‐ 08. JOURNAL OF COMPUTING, VOLUME 1, ISSUE 1, DECEMBER 2009, ISSN: 2151-9617 HTTPS://SITES.GOOGLE.COM/S ITE/JOURNALOFCOMPUTING/ 178 R EFERENCES [1] Efficient Classification Method for Large Dataset, Sheng ‐ Yi Jiang School of Informatics, Guangdong Univ . of Foreign Studies, Guangzhou [2] C LUSTERING T ECHNIQUES FOR L ARGES D AT A S ETS F ROM THE P AST T O THE F UTURE , A LEXANDER H INNEBURG , D ANIEL A. K EIM [3] Data Clustering: A Review , A.K. Jain (Michigan State University), M.N. Murty (Indian Institute of Science) and P. J . Flynn (The Ohio State University) [4] Data Clustering, by Lourdes Perez [5] Data Clustering and Its Ap plications, Raza Ali, Us ‐ man Ghani, Aasim Saeed. [6] MacQueen, J. Some met hods for classification and analysis of multivariate observations. Proc: 5 th Berke ‐ ley Symp. Math. Statist, Prob, 1:218 ‐ 297, 1967. [7] K ‐ means Clustering: A novel probabilistic modeling with applications Dipak K. Dey(joint wo rk with Sa ‐ miran Gh osh, IUPUI, Indianapolis), Department of Statistics, University of Connecticut [8] Va r i a t i o n s of K ‐ means algorithm: A tudy for Hig h Dimentional Large Data Sets: Sanjay Garg, Ramesh Ch. Jain, Dept. of Computer Engineering, A. D. Pa t e l Institute of T echnlogy , India. [9] J. Han and M. Kamber , Dat a Mining: Concepts and T echniques. Morgan Kaufmann, 2000. [10] Michel R. Anderberg. ‐ Cl uster Analysis for Applica ‐ tions: Academic Press, 1973. [11] Extensions to the K ‐ means Algori thm for Clustering Large Data Sets with Categorical Va l u e s ZHE XUE HUANG ACS ys CRC, CSIRO Mathematic al and In ‐ formation Sciences, GPO Box 664, Canberra, ACT 2601, Australia. [12] Automatic Initialization of Means (AIM): A Pr o ‐ posed Extension to the K ‐ means Al gorithm by Sa ‐ marjeet Borah, M. K. Ghose, accepted in Interna ‐ tional Journal of Information T echnology and Knowledge Management (IJITKM) . Samarjeet Borah has obtained his M. Tech. degree in Information Technology from Tezpur University, India in the year 2006. His major field of study is data mining. He is a faculty member in the Depart- ment of Computer Science & Engineer ing in Sikkim Manipal Institute of Technology, Sikkim, India. He is the principal investigator in a research project sponsored by Government of India. Till date he has published ten papers in various confer ences and journals. Borah is a member of the Computer Society of India, International Associations of Engineers, Hong Kong and International Association of Computer Science and Information Technology, Singapore. He has received an award on excellency in research in itiatives from Sikkim Manipal Uni- versity of Health Medical & Technological Sciences. Dr . Mrinal Kanti Ghose has obtained his Ph.D. from Dibrugarh Uni- versity , Assam, India in 1981. He is currently working as the Profes- sor and Head of the Department of Compute r Science & Engineering at Sikkim Manipal Institute of T echnology , Mazitar , Sikkim, India. Prior to this, Dr . Ghose worked in the internationally reputed R & D organisation ISRO – during 1981 to 1994 at Vikram Sarabhai Sp ace Centre, ISRO, T rivandrum in the areas of Mission simulation and Quality & Reliability Analysis of ISRO Launch vehicles and Satellite systems and during 1995 to 2006 at Regional Remote Sensing Ser- vice Centre, ISRO, IIT Campus, Kharagpur(WB), India in the areas of RS & GIS techniques for the natural resources management. Dr . Ghose has conducted quite a num ber of Seminars, Workshop and T raining programmes in the above areas and published around 35 technical papers in various national and international journals in addition to presentation/ publication of 125 research papers in inter- national/ national conferences. He has guided many M. T ech and Ph.D projects and extended consult ancy services to many reputed institutes of the country . Dr . Ghose is the Life Member of Indian As- sociation for Productivity , Quality & Reliability , Kolkata, National Insti- tute of Quality & Reliability , T rivandrum, Society for R & D Managers of India, T rivandr um and Indian Remote Sensing Society , IIRS, De- hradun.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment