Performance Analysis of AIM-K-means & K-means in Quality Cluster Generation

Among all the partition based clustering algorithms K-means is the most popular and well known method. It generally shows impressive results even in considerably large data sets. The computational complexity of K-means does not suffer from the size o…

Authors: Samarjeet Borah, Mrinal Kanti Ghose

JOURNAL OF COMPUTING, VOLUME 1, ISSUE 1, DECEMBER 2009, ISSN : 2151-9617 HTTPS://SITES.GOOGLE.COM/S ITE/JOURNALOFCOMPUTING/ 175 Performance Analysis of AIM-K-means & K-means in Quality Cluster Generation Samarjeet Borah, Mrinal Kanti Ghose Abst ract : Among all the partition based clustering algorithms K-means is the most popular and w ell known method. It generally shows impressive results even in considerably large data set s. Th e comput ational complexity of K-means does not suffer from the size of the data set. The main disadvantage faced in perform ing this clustering is that the selection of initial means. If the user does not have adequate knowledge about the data set, it may lead to erroneous results. The algorithm Automatic Initialization of Means (AIM), which is an extension to K-m eans, h as been proposed to overcome the problem of initial mean generation. In this paper an attempt has been made to compar e the performance of the algorithms through implementation Index T erms— Cluster , Distance Measure, K-means, Centroid, Average Distance, Mean ——————————  —————————— 1 I NTRODUCTION Clustering  [2][3][4]  is  a  type  of  unsupervised  learning  method  in  which  a  set  of  elements  is  separated  into  ho ‐ mogeneous  groups.  It  seeks  to  discover  groups,  or  clus ‐ ters,  of  simil ar  objects.  Generally ,  patterns  within  a  va l i d  cluster  are  more  similar  to  each  other  than  they  are  to  a  pattern  belon ging  to  a  different  cluster .  The  similarity  between  objects  is  often  determined  using  distance  meas ‐ ures  over  the  var i o u s  dimensions  in  the  dataset.  The  var i ‐ ety  of  techniques  for  representing  data,  measuring  simi ‐ larity  between  data  elements,  and  grouping  data  elements  has  produced  a  rich  and  often  confusing  assort ment  of  clustering  methods.  Clustering  is  useful  in  several  ex ‐ ploratory  pattern ‐ analysis,  grouping,  dec isi on ‐ making,  and  machine ‐ learning  situations,  including  data  mining,  document  retriev al,  image  segmentation,  and  pattern  classification  [5][3].  2 P ARTITION B ASED C LUSTERING M ETHODS P artition  based  clustering  methods  create  the  clusters  in  one  step.  Only  one  set  of  clusters  is  created,  although  sev ‐ eral  different  sets  of  cl usters  ma y  be  created  internally  within  the  va r i o u s  algorithms.  Since  only  one  set  of  clus ‐ ters  is  output,  the  users  mus t  input  the  desired  number  of  clusters.  Given  a  database  of  n  objects,  a  partition  based  [5]  clustering  algorithm  constructs  k  partitions  of  the  da ‐ ta,  so  that  an  objective  function  is  optimized.  In  these  clustering  met hods  some  metric  or  criterion  function  is  used  to  determine  the  goodness  of  any  proposed  solution.  This  meas ure  of  quality  could  be  average  distance  be ‐ tween  clusters  or  some  other  metric.  One  common  mea s ‐ ure  of  such  kind  is  the  squired  error  metr ic,  which  meas ‐ ures  the  squired  distance  from  each  point  to  the  centro id  for  the  associated  cluster .  Pa rtition  based  clustering  algo ‐ rithms  try  to  locally  improv e  a  certain  criterion.  The  ma ‐ jority  of  them  could  be  considered  as  greedy  algorithms,  i.e.,  algorithms  that  at  each  step  choose  the  best  solution  and  may  not  lead  to  optimal  results  in  the  end.  The  best  solution  at  each  step  is  the  placement  of  a  certain  object  in  the  cluster  for  which  the  representative  point  is  nearest  to  the  object.  This  family  of  clustering  algorithms  incl udes  the  first  ones  that  appeared  in  the  Data  Mining  Commu ‐ nity .  The  most  commonly  used  are  K ‐ means  [J D 8 8 ,  KR90][6],  PA M  (Partitioning  Around  Medoids)  [KR90],  CLARA  (Clustering  LARge  Applications)  [KR90]  and  CLARANS  (Clustering  LARge  ApplicatioNS  )  [NH94].  All  ofthem  are  ap plicable  to  data  sets  with  numerica l  attrib ‐ utes.  2.1 K-means Algorithm K ‐ means  [7]  is  a  prototype ‐ based,  simple  p artitional  clus ‐ tering  technique  which  attempts  to  find  a  user ‐ specified  K  number  of  clusters.  These  clusters  are  represented  by  their  centroids.  A  cluster  centroid  is  typically  the  mean  of  the  points  in  the  cluster .  This  algorithm  is  a  simple  itera ‐ tive  clustering  algorithm.  The  algorithm  is  simple  to  im ‐ plement  and  run,  relatively  fast,  easy  to  adapt,  and  com ‐ mon  in  practice.  It  is  historically  one  of  the  most  impor ‐ tant  algorithms  in  data  mining.  The  general  algorithm  was  introduced  by  Cox  (1957),  and  (Ball  and  Hall,  1967;  MacQueen,  1967)  [6]  first  named  it  K ‐ means.  Since  then  it  has  become  widely  popular  and  is  classified  as  a  parti ‐ tional  or  non ‐ hierarchical  clustering  method  ( Jain  and  Dubes,  1988).  It  has  a  number  of  va r i a t i o n s  [8][11].  The  K ‐ means  algorithm  wo r ks  as  follows:  a. Select  initial  centres  of  the  K  clusters.  Repeat  steps  b  through  c  until  the  cluster  me mbership  stabilizes.  b. Generate  a  new  partition  by  assigning  each  data  to  its  closest  cluster  centres.  c. Compute  new  cluster  centres  as  the  centroids  of  the  clusters.   The  algorithm  can  be  briefly  described  as  follows:  Let  us  conside r  a  dataset  D  having  n  data  points  x 1 ,  x 2 …  • Samar j eet Borah is with the De p artment o f Com p uter Science & En g ineer- ing, Sikkim Manipal Institute of Technology, Majitar, Rangpo, East Sik - kim-737132. • Mrinal Kanti Ghose is with the Depa rtment of Compute r Science & Engi- neering as Professor & HOD, Sikkim Manipal Insti tute of Technology, Majitar, Rangpo, East Sikkim-737132. JOURNAL OF COMPUTING, VOLUME 1, ISSUE 1, DECEMBER 2009, ISSN: 2151-9617 HTTPS://SITES.GOOGLE.COM/S ITE/JOURNALOFCOMPUTING/ 176 x n .  The  proble m  is  to  find  mini mum  va r i an c e  clusters  from  the  dataset.  The  objects  have  to  be  grouped  into  k  clusters  finding  k  points  { m j }( j =1,  2,  …,  k )  in  D  such  that   (1)   is  minimized,  where  d ( x i ,  m j )  denotes  the  Euclidean  dis ‐ tance  between  x i  and  m j .  The  points  {  m j  }  ( j =1,  2,  …,  k )  are  known  as  cluster  centroids.  The  problem  in  Eq.(1)  is  to  find  k  cluster  centroids,  such  that  the  average  squared  Euclidean  distance  (mean  squared  error)  between  a  data  point  and  its  nearest  cluster  centroid  is  minimized.   The  K ‐ means  algorithm  provides  an  easy  method  to  im ‐ plement  approximate  sol ution  to  Eq.(1).  The  reasons  for  the  popularity  of  K ‐ means  are  ease  and  simplicity  of  im ‐ plementation,  scalability ,  speed  of  convergence  and  adap ‐ tability  to  sparse  data.  The  K ‐ means  algorith m  can  be  thought  of  as  a  gradient  descent  procedure,  which  begins  at  starting  cluster  centroids,  and  iteratively  updates  these  centroids  to  decrease  the  objective  function  in  Eq.(1).  The  K ‐ means  always  conv erge  to  a  local  mini mum.  The  par ‐ ticular  local  minimum  found  depends  on  the  starting  cluster  centroids.  The  K ‐ means  algorithm  updates  cluster  centroids  till  local  mi nimum  is  found.  Before  the  K ‐ means  algorithm  con verges,  distance  and  cent roid  calculations  are  done  while  loops  are  executed  a  number  of  times,  say  l ,  where  the  positive  integer  l  is  known  as  the  number  of  K ‐ means  iterat ions.  The  prec ise  va l u e  of  l  var i e s  depend ‐ ing  on  the  initial  starting  cluster  centroids  even  on  the  same  dataset.  So  the  c omputational  time  complexity  of  the  algorithm  is  O ( nkl ),  where  n  is  the  total  number  of  objects  in  the  dataset,  k  is  the  required  number  of  clusters  we  identified  and  l  is  the  number  of  iterations,  k ≤ n ,  l ≤ n .  K ‐ mean  cl ustering  algorith m  is  also  facing  a  numb er  of  drawbacks.  When  the  numbers  of  data  are  not  so  many ,  initial  grouping  will  determine  the  cluster  significantly .  Again  the  numbe r  of  cluster,  K,  must  be  determined  be ‐ fore  hand.  It  is  sensitive  to  the  initial  condition.  Different  initial  conditions  may  produce  different  results  of  cluster .  The  algorithm  may  be  trapped  in  the  local  optimum.  W eakness  of  arithmetic  mean  is  not  robust  to  outliers.  Ve r y  far  data  from  the  centroid  may  pull  the  centroid  away  from  the  real  one.  Here  the  result  is  of  circular  clus ‐ ter  shaped  because  based  on  distance.   The  major  problem  faced  during  K ‐ means  clustering  is  the  efficient  selection  of  means.  It  is  quite  difficult  to  pre ‐ dict  the  number  of  clusters  k  in  pri or .  The  k  va r i e s  from  user  to  user .  As  a  result,  the  clusters  formed  may  not  be  up  to  mark.  The  finding  out  of  exactly  how  many  cl usters  will  have  to  be  formed  is  a  quite  difficult  task.  To  perform  it  efficiently  the  user  must  have  detailed  knowledge  of  the  domain.  Agai n  the  de tail  knowle dge  of  the  source  data  is  also  required.   2.2 Automatic Initialization of Means The  Automat ic  Initialization  of  Means  (AIM)  [12]  has  been  proposed  to  make  the  K ‐ means  algorithm  a  bit  more  efficient.  The  algorithm  is  able  to  detect  the  number  of  total  number  of  clusters  automatically .  This  alg orithm  also  has  made  the  selection  process  of  the  initial  set  of  means  automatic.  AIM  applies  a  simple  statistical  process  which  selects  the  set  of  initial  means  automatically  based  on  the  dataset.  The  output  of  this  alg orithm  can  be  ap ‐ plied  to  the  K ‐ means  algorithm  as  one  of  the  inputs.   2.2.1 Background In  probability  theory  and  statistics,  the  Gaussian  distribu ‐ tion  is  a  continuo us  probabi lity  distribut ion  that  describes  data  that  clusters  around  a  mean  or  average.  Assuming  Gaussian  distribution  it  is  known  that  μ ±1 σ  contain  67.5%  of  the  population  and  thus  significant  val ue s  concentra te  around  the  cluster  mean μ .  Po i n t s  beyond  this  may  have  tendency  of  belonging  to  other  clusters.  We  could  have  taken  μ ±2 σ  instead  of μ ±1 σ ,  but  problem  with  μ ±2 σ  is  that  it  will  cover  about  95%  of  the  population  and  as  a  result  it  may  lead  to  improper  clustering.  Some  points  that  are  not  so  relevant  to  the  cluster  may  also  be  in cluded  in  the  clu s ‐ ter .   2.2.2 Description Let  us  assume  that  data  set  D  as  {x i,  i=1,  2…  N}  which  con ‐ sists  of  N  data  obje cts  x 1 ,  x 2 ,  …,  x N . ,  where  each  object  has  M  different  attribute  val ue s  corresponding  to  the  M  dif ‐ ferent  attributes.  The  va l u e  of  i ‐ th  object  ca n  be  given  by:  D i ={x i1 ,  x i2, …,x iM }  Again  let  us  assume  that  the  relation  x i =x k  does  not  mean  that  x i  and  x k  are  the  same  objects  in  the  real  wor l d  data ‐ base.  It  mean s  that  the  two  objects  has  equal  va l u e s  for  the  attribute  set  A={a l ,  a 2 ,  a 3 ,  …,  a m }.  The  main  objective  of  the  algorithm  is  to  find  out  the  va l u e  k  automatically  in  prior  to  partit ion  the  dataset  into  k  disjoint  subsets.  For  distance  calculation  the  distance  measure  sum  of  square  Euclidian  distance  is  used  in  this  algorithm.  It  aims  at  minimizing  the  average  square  error  criterion  which  is  a  good  measure  of  the  within  cluster  v ariation  ac ross  all  the  partitions.  Thus  the  average  square  error  criterion  tries  to  make  the  k ‐ clusters  as  compac t  and  separated  as  possible.   Let  us  assume  a  set  of  me ans  M={m j ,  j=1,  2,  …,  K}  which  consists  of  initial  set  of  means  that  has  been  generated  by  the  algorithm  based  on  the  dataset.  Based  on  these  initial  means  the  dataset  will  be  grouped  into  K  clusters.  Let  us  assume  the  set  of  clusters  as  C=(c j ,  j=1,2,…,M} .  In  the  next  phase  the  means  has  to  be  updated.  In  the  algorithm  the  distance  threshold  has  been  taken  as:  dx  = μ ±1 σ   (2)  where μ =   and σ =   Before  the  searching  of  initial  means  the  origina l  dataset  D  will  be  copied  to  a  temporary  dataset  T .  This  dataset  will  be  used  only  in  initial  set  of  means  gener ation  proc ‐ ess.  The  algori thm  will  be  repeated  for  n  times  (where  n  is  the  number  of  objects  in  the  dataset).  The  algorithm  will  JOURNAL OF COMPUTING, VOLUME 1, ISSUE 1, DECEMBER 2009, ISSN : 2151-9617 HTTPS://SITES.GOOGLE.COM/S ITE/JOURNALOFCOMPUTING/ 177 select  the  first  mean  of  the  in itial  mean  set  randomly  fro m  the  dataset.  The n  the  object  selected  as  me an  will  be  re ‐ moved  from  the  temporary  dataset.  The  procedure  Dis ‐ tance_Threshold  will  compute  the  distance  threshold  as  given  in  eq.  2.  Whenever  a  new  object  is  considered  as  the  candidate  for  a  cluster  mea n,  its  average  distance  with  existing  me ans  will  be  calculated  as  given  in  the  equation  below .  A verage_Distance  (3)  where  M  is  the  set  of  initial  means,  i=1,  2,  …,  m  and  m ≤ n ,  m c  is  the  candidate  for  new  clu ster  mean.   If  it  satisfies  the  distance  threshold  then  it  will  considered  as  new  mean  and  will  be  removed  from  the  temporary  dataset.  The  algorithm  is  as  follows:   Input:  D  =  {x 1 ,  x 2 ,  …,  x n }  //set  of  objects  Output:   K  //  To t a l  number  of  clust ers  to  be  generated   M  =  {m 1 ,  m 2 ,  …,  m k }  //  The  set  of  initial  means  Algorithm:   Copy  D  to  a  temporar y  dataset  T   Calculate  Distance_Threshold  on  T   Arbitrarily  select  x i  as  m 1   Insert  m 1  to  M   Remov e  x i  from  T  For  i=  1  to  n ‐ k  do   //  Check  for  the  next  mean  Begin:   Arbitrarily  select  x i  as  m c   Set  L=0   For  j=  1  to  k  do  //Calculate  the  avg.  dist   Begin:   L=  L+  Distance(m c ,M[j])   End   A verage_Distance=L/k    If  A verage_Distance ≥ Distance_Threshold  then:   Remov e  x i  from  T   Insert  m c  to  M   k=k+1  End  3 P ERFORMANCE A NAL YSIS The  AIM  is  just  the  extension  of  K ‐ means  to  provide  the  number  of  clusters  to  be  generated  by  the  K ‐ means  algo ‐ rithm.  It  also  provides  the  initial  set  of  means  to  K ‐ means.  Therefore  it  has  been  decided  to  make  a  comperative  analysis  of  the  clustering  quality  of  AIM ‐ K ‐ means  with  convensional  K ‐ means.  The  main  difference  between  the  two  algorithms  is  that  in  case  of  AIM ‐ K ‐ means  it  is  not  necessary  to  provide  the  number  of  clus ters  to  be  gener ‐ ated  in  prior  and  for  K ‐ means,  users  have  to  provide  the  number  of  clusters  to  be  generated.  In  this  evaluation  process  three  datasets  have  been  used.  They  have  been  fed  to  the  algorith ms  according  to  the  increasing  order  of  their  size.  The  programs  we re  devel ‐ oped  in  C.  To  test  the  algorithms  thoroughly ,  separate  programs  we r e  developed  for  AIM,  AIM ‐ K ‐ means  and  conventional  K ‐ means.  In  the  first  phase  the  datasets  have  been  fed  to  the  K ‐ means  with  the  user  fed  k ‐ val u e .  Then  the  AIM ‐ K ‐ means  was  applied  to  the  same  data  se ts  where  the  val u e  of  k  means  is  provided  internally  by  AIM.  Lastly  the  K ‐ means  algorithm  is  again  applied  to  the  same  datasets  with  the  va l u e  of  k  as  given  by  the  AIM ‐ K ‐ means  method.  The  results  reavels  that:  1. There  is  a  difference  in  performance  for  K ‐ means  and  AIM ‐ K ‐ means  algorithm.  2. But the difference reduces when we use K-means algorithm with the value of k as given by the AIM-K-means. Figure 1: Compar ision Based on Aver age SSE The  above  comparison  wa s  made  on  the  basis  of  average  sum  of  squa re  error .  From  the  study  it  has  been  found  that  AIM ‐ K ‐ means  is  showing  improv ements  in  average  sum  of  squa re.  This  is  basically  because  of  the  initial  set  of  cluster  means  provided  to  the  algorithm.  In  case  of  K ‐ means  the  va l u e  of  k  has  been  provided  based  on  the  out ‐ put  provided  by  AIM.  But  it  is  not  possible  to  provide  initial  set  of  clusters  in  K ‐ means.   4 C ONCLUSION The  most  attractive  propert y  of  the  K ‐ means  algorithm  in  data  mining  is  its  efficiency  in  clustering  large  data  sets.  But  the  main  disadvantage  it  is  facing  is  the  number  of  clusters  that  is  to  be  provided  from  the  user .  The  algo ‐ rithm  AIM,  which  is  an  extension  of  K ‐ means,  can  be  used  to  enhan ce  the  efficiency  automating  the  selection  of  the  initial  means.  From  the  experiments  it  has  been  found  that  it  can  improve  the  clu ster  generation  process  of  the  K ‐ means  alg o rithm,  without  diminishing  the  clustering  quality  in  most  of  the  cases .  The  basic  idea  of  AIM  is  to  keep  the  simplicity  and  scalability  of  K ‐ means,  wh ile  achieving  automaticity .  A CKNOWLEDGMENT This  wo rk  has  been  carried  out  as  part  of  Research  Pro ‐ motion  Scheme  (RPS)  Projec t  funded  by  All  India  Council  for  T echnical  Education,  Government  of  India;  vide  san c ‐ tion  order  8023/BOR/RID/RPS ‐ 217/2007 ‐ 08.  JOURNAL OF COMPUTING, VOLUME 1, ISSUE 1, DECEMBER 2009, ISSN: 2151-9617 HTTPS://SITES.GOOGLE.COM/S ITE/JOURNALOFCOMPUTING/ 178 R EFERENCES [1] Efficient  Classification  Method  for  Large  Dataset,  Sheng ‐ Yi  Jiang  School  of  Informatics,  Guangdong  Univ .  of  Foreign  Studies,  Guangzhou  [2] C LUSTERING  T ECHNIQUES  FOR  L ARGES  D AT A  S ETS  F ROM  THE  P AST  T O  THE  F UTURE ,  A LEXANDER  H INNEBURG ,  D ANIEL  A.  K EIM  [3] Data  Clustering:  A  Review ,  A.K.  Jain  (Michigan  State  University),  M.N.  Murty  (Indian  Institute  of  Science)  and  P. J .  Flynn  (The  Ohio  State  University)  [4] Data  Clustering,  by  Lourdes  Perez  [5] Data  Clustering  and  Its  Ap plications,  Raza  Ali,  Us ‐ man  Ghani,  Aasim  Saeed.  [6] MacQueen,  J.  Some  met hods  for  classification  and  analysis  of  multivariate  observations.  Proc:  5 th  Berke ‐ ley  Symp.  Math.  Statist,  Prob,  1:218 ‐ 297,  1967.  [7] K ‐ means  Clustering:  A  novel  probabilistic  modeling  with  applications  Dipak  K.  Dey(joint  wo rk  with  Sa ‐ miran  Gh osh,  IUPUI,  Indianapolis),  Department  of  Statistics,  University  of  Connecticut  [8] Va r i a t i o n s  of  K ‐ means  algorithm:  A  tudy  for  Hig h  Dimentional  Large  Data  Sets:  Sanjay  Garg,  Ramesh  Ch.  Jain,  Dept.  of  Computer  Engineering,  A.  D.  Pa t e l  Institute  of  T echnlogy ,  India.  [9]  J.  Han  and  M.  Kamber ,  Dat a  Mining:  Concepts  and  T echniques.  Morgan  Kaufmann,  2000.  [10]  Michel  R.  Anderberg. ‐ Cl uster  Analysis  for  Applica ‐ tions:  Academic  Press,  1973.  [11] Extensions  to  the  K ‐ means  Algori thm  for  Clustering  Large  Data  Sets  with  Categorical  Va l u e s  ZHE XUE  HUANG  ACS ys  CRC,  CSIRO  Mathematic al  and  In ‐ formation  Sciences,  GPO  Box  664,  Canberra,  ACT  2601,  Australia.  [12] Automatic  Initialization  of  Means  (AIM):  A  Pr o ‐ posed  Extension  to  the  K ‐ means  Al gorithm  by  Sa ‐ marjeet  Borah,  M.  K.  Ghose,  accepted  in  Interna ‐ tional  Journal  of  Information  T echnology  and  Knowledge  Management  (IJITKM) .  Samarjeet Borah has obtained his M. Tech. degree in Information Technology from Tezpur University, India in the year 2006. His major field of study is data mining. He is a faculty member in the Depart- ment of Computer Science & Engineer ing in Sikkim Manipal Institute of Technology, Sikkim, India. He is the principal investigator in a research project sponsored by Government of India. Till date he has published ten papers in various confer ences and journals. Borah is a member of the Computer Society of India, International Associations of Engineers, Hong Kong and International Association of Computer Science and Information Technology, Singapore. He has received an award on excellency in research in itiatives from Sikkim Manipal Uni- versity of Health Medical & Technological Sciences. Dr . Mrinal Kanti Ghose has obtained his Ph.D. from Dibrugarh Uni- versity , Assam, India in 1981. He is currently working as the Profes- sor and Head of the Department of Compute r Science & Engineering at Sikkim Manipal Institute of T echnology , Mazitar , Sikkim, India. Prior to this, Dr . Ghose worked in the internationally reputed R & D organisation ISRO – during 1981 to 1994 at Vikram Sarabhai Sp ace Centre, ISRO, T rivandrum in the areas of Mission simulation and Quality & Reliability Analysis of ISRO Launch vehicles and Satellite systems and during 1995 to 2006 at Regional Remote Sensing Ser- vice Centre, ISRO, IIT Campus, Kharagpur(WB), India in the areas of RS & GIS techniques for the natural resources management. Dr . Ghose has conducted quite a num ber of Seminars, Workshop and T raining programmes in the above areas and published around 35 technical papers in various national and international journals in addition to presentation/ publication of 125 research papers in inter- national/ national conferences. He has guided many M. T ech and Ph.D projects and extended consult ancy services to many reputed institutes of the country . Dr . Ghose is the Life Member of Indian As- sociation for Productivity , Quality & Reliability , Kolkata, National Insti- tute of Quality & Reliability , T rivandrum, Society for R & D Managers of India, T rivandr um and Indian Remote Sensing Society , IIRS, De- hradun.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment