Kmeans聚类算法

最新推荐文章于 2024-06-19 20:16:31 发布

xiaofang168

最新推荐文章于 2024-06-19 20:16:31 发布

阅读量795

点赞数

Kmeans聚类算法

(2012-05-17 23:56:46)

转载▼

标签：

杂谈

分类：数据挖掘

聚类算法是这样的一种算法：给定一坨样本数据Sample，要求将样本Sample中相似的数据聚到一类。有了这个认识之后，就应该了解了聚类算法要干什么了吧。说白了，就是归类。

首先，我们需要考虑的是，如何衡量数据之间的相似程度？比如说，有一群说不同语言的人，我们一般是根据他们的方言来聚类的（当然，你也可以指定以身高来聚类）。这里，语言的相似性（或者身高）就成了我们衡量相似的量度了。在考虑存在海量数据，如微博上各种用户的关系网，如何根据用户的关注和被关注来聚类，给用户推荐他们感兴趣的用户？这就是聚类算法研究的内容之一了。

Kmeans就是这样的聚类算法中比较简单的算法，给定数据样本集Sample和应该划分的类数K，对样本数据Sample进行聚类，最终形成K个cluster，其相似的度量是某条数据i与中心点的“距离”（这里所说的距离，不止于二维）。如下图：

算法的过程是这样的：

1):input Samples and it's number N， numSample, k ;

2):randomize select k sample from the Samples asthe initialed clusters , set the maxCompute to m ,and the passNumto 0;

3) :fori=0 to N-1

a): for j=0 to K-1

i): compute the distance between record i and cluster center j;

ii): addthe record i to closest cluster;

4):for j=0 to K-1

a) : reSelect center between the cluster j ,according givenrule;

b): clear all the older merber in the cluster;

5):passNum++ ; if all the new center same to theolder center or passNum >m,terminate!else continue;

下面给出Kmeans运行的一个实例：

样本数据如下：

2.1 3.0 10.0
4.0 5.2 -1.0
5.1 1.5 2.3
10.5 12.6 10.8
12.1 10.9 11.0
4.2 5.3 -9.8
5.4 1.6 8.7
-1.0 -2.1 -0.9
0.5 0.3 0.4

需要分成三个聚类

用这个数据运行后，程序打印出划分结果：

---- 第1个聚类 ----
聚类中心：3.75,2.3,9.35
2.1,3,10
5.4,1.6,8.7

---- 第2个聚类 ----
聚类中心：4.1,5.25,-5.4
4,5.2,-1
4.2,5.3,-9.8

---- 第3个聚类 ----
聚类中心：1.53333,-0.1,0.6
5.1,1.5,2.3
-1,-2.1,-0.9
0.5,0.3,0.4

---- 第4个聚类 ----
聚类中心：11.3,11.75,10.9
10.5,12.6,10.8
12.1,10.9,11

至此，我们得到了三个相似性很高的聚类

下面，就是c++的粗略实现了：

//the single sample
class Sample{
public:
unsigned int r, g ,b;
Sample(){};
Sample(unsigned int x, unsigned int y , unsignedint z){
  r=x ; g= y; b= z;
}
//compute the distance of two sample
int distance(Sample other){
  return ((r-other.r)*(r-other.r)+
   (g-other.g) *(g-other.g) +
   (b-other.b) *(b-other.b));
}
};

//cluster class

class Cluster{
private :
Sample center; //the center ofcluster
int iCenter; //the sub ofcenter
int number; //number of themerber
int *merber;// index of the merber, record thesub
public :
Cluster(){merber = NULL; number = 0; iCenter =-1;}
Cluster(int n , Sample c , int*m , int i)
:number(n), center(c) ,merber(m) , iCenter(i)
{

}
Sample getCenter(){
return center;
}

void setCenterNumber(int i){
iCenter = i;
}
int getCenterNumber(){
return iCenter;
}

void setCenter(Sample c){
  center = c;
}
void setNumber(int n){
  number = n;
}
int getMerberNumber (){
  return number;
}
void addMerber(int n){
  merber[number++] = n;
}
int getNewCenter(){
   int newSub=0;
   for(int i=0 ;i<number ; i++){
   newSub+=merber[i];
   }
   return newSub/number;
}
};

//Kmeans class

#include <time.h>
#include <stdio>
#include "Sample.h"
#include "Cluster.h"
class Kmeans{
private:
int num , clusterNum ;
Cluster * clusters;
Sample * samples;
public :
Kmeans(int n, int d, int k, Sample * ob)
  : num(n)
  , clusterNum(k)
  , samples(ob)
  , clusters(newCluster[k])
{
  //assert(n>1&& k<n) ;
}
Kmeans(){};
~Kmeans(){delete []clusters ; delete []samples ;}

//initial the clusters
void initialCluster(){
  int temp =num/clusterNum;
  for(inti=0;i<clusterNum ; i++){
   clusters[i].setCenter(samples[(temp*i)%num]);
   clusters[i].setNumber(0);
  }
}

bool run(){
  bool converged = false; //signing converged
  int passNum = 0;
  while (!converged&& passNum <999)   //if not converged , tryagain
   // thecompute times passNum should under 999
  {
   for(int i=0 ;i<clusterNum ; i++){
    //converged= ( clusters[i].getCenter()) ==clusters[clusters[i].getNewCenter()].getCenter();
    inttemp = clusters[i].getCenterNumber();
    converged= clusters[i].getNewCenter()!=temp ;
    clusters[i].setCenter(samples[temp]);
   }
   distribute();                    //distribute the samples to the closest cluster
   //converged =(); //
   passNum++;
  }
}

void distribute()
{
  // clear all the older clustermerber , and reCluster
  for(int k=0;k<clusterNum; k++)
   clusters[k].setNumber(0);
  // compute the new closestcluster
  for(int i=0;i<num; i++){
   clusters[getClosestCluster(i)].addMerber(i);
  }
}

//get closest cluster,return the sub
int getClosestCluster(int i){
  int iShortestDistance=0 ,sub=0;
  for(intj=0;j<clusterNum ; j++){
   int temp =samples[i].distance(clusters[j].getCenter());
   if(iShortestDistance<temp){
    iShortestDistance= temp;
    sub=j;
   }
  }
  return sub;
}
};

xiaofang168

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Kmeans聚类算法

Kmeans聚类算法(2012-05-17 23:56:46)转载▼标签：杂谈分类：数据挖掘聚类算法是这样的一种算法：给定一坨样本数据Sample，要求将样本Sample中相似的数据聚到一类。有了这个认识之后，就应该了解了聚类算法要干什么了吧。说白了，就是归类。首先，我们需要考虑的是，如何衡量数据之间的相似程度？比如
复制链接

扫一扫