Kmeans聚类算法
(2012-05-17 23:56:46)
聚类算法是这样的一种算法:给定一坨样本数据Sample,要求将样本Sample中相似的数据聚到一类。有了这个认识之后,就应该了解了聚类算法要干什么了吧。说白了,就是归类。
首先,我们需要考虑的是,如何衡量数据之间的相似程度?比如说,有一群说不同语言的人,我们一般是根据他们的方言来聚类的(当然,你也可以指定以身高来聚类)。这里,语言的相似性(或者身高)就成了我们衡量相似的量度了。在考虑存在海量数据,如微博上各种用户的关系网,如何根据用户的关注和被关注来聚类,给用户推荐他们感兴趣的用户?这就是聚类算法研究的内容之一了。
Kmeans就是这样的聚类算法中比较简单的算法,给定数据样本集Sample和应该划分的类数K,对样本数据Sample进行聚类,最终形成K个cluster,其相似的度量是某条数据i与中心点的“距离”(这里所说的距离,不止于二维)。如下图:
算法的过程是这样的:
1):input Samples and it's number N, numSample, k ;
2):randomize select k sample from the Samples asthe initialed clusters , set the maxCompute to m ,and the passNumto 0;
3) :fori=0 to N-1
a): for j=0 to K-1
i): compute the distance between record i and cluster center j;
ii): addthe record i to closest cluster;
4):for j=0 to K-1
a) : reSelect center between the cluster j ,according givenrule;
b): clear all the older merber in the cluster;
5):passNum++ ; if all the new center same to theolder center or passNum >m,terminate!else continue;
下面给出Kmeans运行的一个实例:
样本数据如下:
2.1 3.0 10.0
4.0 5.2 -1.0
5.1 1.5 2.3
10.5 12.6 10.8
12.1 10.9 11.0
4.2 5.3 -9.8
5.4 1.6 8.7
-1.0 -2.1 -0.9
0.5 0.3 0.4
需要分成三个聚类
用这个数据运行后,程序打印出划分结果:
---- 第1个聚类 ----
聚类中心:3.75,2.3,9.35
2.1,3,10
5.4,1.6,8.7
---- 第2个聚类 ----
聚类中心:4.1,5.25,-5.4
4,5.2,-1
4.2,5.3,-9.8
---- 第3个聚类 ----
聚类中心:1.53333,-0.1,0.6
5.1,1.5,2.3
-1,-2.1,-0.9
0.5,0.3,0.4
---- 第4个聚类 ----
聚类中心:11.3,11.75,10.9
10.5,12.6,10.8
12.1,10.9,11
至此,我们得到了三个相似性很高的聚类
下面,就是c++的粗略实现了:
//the single sample
class Sample{
public:
unsigned int r, g ,b;
Sample(){};
Sample(unsigned int x, unsigned int y , unsignedint z){
r=x ; g= y; b= z;
}
//compute the distance of two sample
int distance(Sample other){
return ((r-other.r)*(r-other.r)+
(g-other.g) *(g-other.g) +
(b-other.b) *(b-other.b));
}
};
//cluster class
class Cluster{
private :
Sample center; //the center ofcluster
int iCenter; //the sub ofcenter
int number; //number of themerber
int *merber;// index of the merber, record thesub
public :
Cluster(){merber = NULL; number = 0; iCenter =-1;}
Cluster(int n , Sample c , int*m , int i)
:number(n), center(c) ,merber(m) , iCenter(i)
{
}
Sample getCenter(){
return center;
}
void setCenterNumber(int i){
iCenter = i;
}
int getCenterNumber(){
return iCenter;
}
void setCenter(Sample c){
center = c;
}
void setNumber(int n){
number = n;
}
int getMerberNumber (){
return number;
}
void addMerber(int n){
merber[number++] = n;
}
int getNewCenter(){
int newSub=0;
for(int i=0 ;i<number ; i++){
newSub+=merber[i];
}
return newSub/number;
}
};
//Kmeans class
#include <time.h>
#include <stdio>
#include "Sample.h"
#include "Cluster.h"
class Kmeans{
private:
int num , clusterNum ;
Cluster * clusters;
Sample * samples;
public :
Kmeans(int n, int d, int k, Sample * ob)
: num(n)
, clusterNum(k)
, samples(ob)
, clusters(newCluster[k])
{
//assert(n>1&& k<n) ;
}
Kmeans(){};
~Kmeans(){delete []clusters ; delete []samples ;}
//initial the clusters
void initialCluster(){
int temp =num/clusterNum;
for(inti=0;i<clusterNum ; i++){
clusters[i].setCenter(samples[(temp*i)%num]);
clusters[i].setNumber(0);
}
}
bool run(){
bool converged = false; //signing converged
int passNum = 0;
while (!converged&& passNum <999) //if not converged , tryagain
// thecompute times passNum should under 999
{
for(int i=0 ;i<clusterNum ; i++){
//converged= ( clusters[i].getCenter()) ==clusters[clusters[i].getNewCenter()].getCenter();
inttemp = clusters[i].getCenterNumber();
converged= clusters[i].getNewCenter()!=temp ;
clusters[i].setCenter(samples[temp]);
}
distribute(); //distribute the samples to the closest cluster
//converged =(); //
passNum++;
}
}
void distribute()
{
// clear all the older clustermerber , and reCluster
for(int k=0;k<clusterNum; k++)
clusters[k].setNumber(0);
// compute the new closestcluster
for(int i=0;i<num; i++){
clusters[getClosestCluster(i)].addMerber(i);
}
}
//get closest cluster,return the sub
int getClosestCluster(int i){
int iShortestDistance=0 ,sub=0;
for(intj=0;j<clusterNum ; j++){
int temp =samples[i].distance(clusters[j].getCenter());
if(iShortestDistance<temp){
iShortestDistance= temp;
sub=j;
}
}
return sub;
}
};