K-means算法

1.K-means算法介绍

将一组数据或记录,划分为多个组或者聚簇的行为就叫做聚类。K-means算法就是一类使用比较广泛的聚类算法。

K-means算法,处理的对象是由n维的元素所组成的数组,通过K-means算法会将这些元素划分到多个聚类中去。

K-means算法预先需要确定的值:

1.聚类个数K,这个可以预先定个值,然后随着程序运行的效果而调整

2.初始划分到K个聚类的元素,我们可以选有代表性的K个,如果数据量大,可以随机的选K个元素放入K个聚类中

上面是初始化过程,真正的K-means算法的核心是一个循环:

1.计算每一个元素与聚类中心的距离,然后得到每个元素距离最近的聚类,将它记录下来

2.调整相应的元素到相应聚类中去

3.根据变化后的元素计算聚类中心

算法终止的条件是,聚类中心不再发生变化(即不再有元素发生变动);否则就一直1~3步这样循环。

以上的步骤涉及几个概念:

1.聚类中心:由于K-means算法处理的数据时n维的,那么就根据处理数据的维度,可以得到聚类中心就是一个长度为n的数组。它里面的每一个值,都等于所有元素该维度的和/n。

比如:K-means算法处理的是2维数据,某个聚类中的元素有(1,1),(2,2),(3,4)那么,它的聚类中心就等于:

((1+2+3)/3,(1+2+4)/3) = (2,2.33)

2.元素与聚类的距离=元素中每个元素与聚类中心的每个维度的差的平方和

比如:某个聚类,它的聚类中心为(2,3),然后有个元素(2,5),那么它们的距离就是:

(2-2)*(2-2)+(3-5)*(3-5)=4

2.算法推导

假设我们需要聚类的是一组点,每个点由坐标(x,y)表示,也就是说,我们要处理的数据是二维的。数据如下:

samples = [
    [1,2],
    [15,16],
    [0,0],
    [2,5],
    [6,6],
    [14,15],
    [8,8]
]

我们首先,定义聚类个数K=3,然后初始化聚类时,把下标为0,1,2的数据赋值给3个聚类。如下:

Cluster 0's items are [[1, 2]],centers are [1.0, 2.0]
Cluster 1's items are [[15, 16]],centers are [15.0, 16.0]
Cluster 2's items are [[0, 0]],centers are [0.0, 0.0]

然后,可以得到,第一轮的计算距离结果如下:

[1, 2]  and cluster 0 distance:0.0
[1, 2]  and cluster 1 distance:392.0
[1, 2]  and cluster 2 distance:5.0
[1, 2] 's nearest cluster is:Cluster 0
[15, 16]  and cluster 0 distance:392.0
[15, 16]  and cluster 1 distance:0.0
[15, 16]  and cluster 2 distance:481.0
[15, 16] 's nearest cluster is:Cluster 1
[0, 0]  and cluster 0 distance:5.0
[0, 0]  and cluster 1 distance:481.0
[0, 0]  and cluster 2 distance:0.0
[0, 0] 's nearest cluster is:Cluster 2
[2, 5]  and cluster 0 distance:10.0
[2, 5]  and cluster 1 distance:290.0
[2, 5]  and cluster 2 distance:29.0
[2, 5] 's nearest cluster is:Cluster 0
[6, 6]  and cluster 0 distance:41.0
[6, 6]  and cluster 1 distance:181.0
[6, 6]  and cluster 2 distance:72.0
[6, 6] 's nearest cluster is:Cluster 0

[14, 15]  and cluster 0 distance:338.0
[14, 15]  and cluster 1 distance:2.0
[14, 15]  and cluster 2 distance:421.0
[14, 15] 's nearest cluster is:Cluster 1
[8, 8]  and cluster 0 distance:85.0
[8, 8]  and cluster 1 distance:113.0
[8, 8]  and cluster 2 distance:128.0
[8, 8] 's nearest cluster is:Cluster 0

我们将每个元素,调整到与它最近的聚类中,并重新计算聚类中心。结果如下:

Cluster 0's items are [[1, 2], [2, 5], [6, 6], [8, 8]],centers are [4.25, 5.25]
Cluster 1's items are [[15, 16], [14, 15]],centers are [14.5, 15.5]
Cluster 2's items are [[0, 0]],centers are [0.0, 0.0]

第二轮开始,又计算每个元素与聚类中心的距离:

[1, 2]  and cluster 0 distance:21.125
[1, 2]  and cluster 1 distance:364.5
[1, 2]  and cluster 2 distance:5.0
[1, 2] 's nearest cluster is:Cluster 2
[15, 16]  and cluster 0 distance:231.125
[15, 16]  and cluster 1 distance:0.5
[15, 16]  and cluster 2 distance:481.0
[15, 16] 's nearest cluster is:Cluster 1
[0, 0]  and cluster 0 distance:45.625
[0, 0]  and cluster 1 distance:450.5
[0, 0]  and cluster 2 distance:0.0
[0, 0] 's nearest cluster is:Cluster 2
[2, 5]  and cluster 0 distance:5.125
[2, 5]  and cluster 1 distance:266.5
[2, 5]  and cluster 2 distance:29.0
[2, 5] 's nearest cluster is:Cluster 0
[6, 6]  and cluster 0 distance:3.625
[6, 6]  and cluster 1 distance:162.5
[6, 6]  and cluster 2 distance:72.0
[6, 6] 's nearest cluster is:Cluster 0
[14, 15]  and cluster 0 distance:190.125
[14, 15]  and cluster 1 distance:0.5
[14, 15]  and cluster 2 distance:421.0
[14, 15] 's nearest cluster is:Cluster 1
[8, 8]  and cluster 0 distance:21.625
[8, 8]  and cluster 1 distance:98.5
[8, 8]  and cluster 2 distance:128.0
[8, 8] 's nearest cluster is:Cluster 0

这时,又得到了每个元素最近的聚类。再将元素调整到最近的聚类中,然后重新计算聚类中心:

Cluster 0's items are [[2, 5], [6, 6], [8, 8]],centers are [5.333333333333333, 6.333333333333333]
Cluster 1's items are [[15, 16], [14, 15]],centers are [14.5, 15.5]
Cluster 2's items are [[1, 2], [0, 0]],centers are [0.5, 1.0]

第三轮,又计算每个元素与聚类中心的距离。

[1, 2]  and cluster 0 distance:37.5555555556
[1, 2]  and cluster 1 distance:364.5
[1, 2]  and cluster 2 distance:1.25
[1, 2] 's nearest cluster is:Cluster 2
[15, 16]  and cluster 0 distance:186.888888889
[15, 16]  and cluster 1 distance:0.5
[15, 16]  and cluster 2 distance:435.25
[15, 16] 's nearest cluster is:Cluster 1
[0, 0]  and cluster 0 distance:68.5555555556
[0, 0]  and cluster 1 distance:450.5
[0, 0]  and cluster 2 distance:1.25
[0, 0] 's nearest cluster is:Cluster 2
[2, 5]  and cluster 0 distance:12.8888888889
[2, 5]  and cluster 1 distance:266.5
[2, 5]  and cluster 2 distance:18.25
[2, 5] 's nearest cluster is:Cluster 0
[6, 6]  and cluster 0 distance:0.555555555556
[6, 6]  and cluster 1 distance:162.5
[6, 6]  and cluster 2 distance:55.25
[6, 6] 's nearest cluster is:Cluster 0
[14, 15]  and cluster 0 distance:150.222222222
[14, 15]  and cluster 1 distance:0.5
[14, 15]  and cluster 2 distance:378.25
[14, 15] 's nearest cluster is:Cluster 1
[8, 8]  and cluster 0 distance:9.88888888889
[8, 8]  and cluster 1 distance:98.5
[8, 8]  and cluster 2 distance:105.25
[8, 8] 's nearest cluster is:Cluster 0

计算出来的每个元素的最近的聚类没有发生变化,与第二轮相同。算法终止,已经得到了结果。

结果就是:

Cluster 0 [[2, 5], [6, 6], [8, 8]]
Cluster 1 [[15, 16], [14, 15]]
Cluster 2 [[1, 2], [0, 0]]

我们将结果图形化,还是比较靠谱的:


3.算法源码

算法源码如下:

#coding:utf-8
import json
import math
#假设数据集D是一堆点,且这堆点是2维的点
#为了最后出效果,每个点的(a,b)分别对应左边轴上的(x,y)
samples = [
    [1,2],
    [15,16],
    [0,0],
    [2,5],
    [6,6],
    [14,15],
    [8,8]
]
#预先定义好聚类的个数
n = 3
clusters = list()
class Cluster(object):
    centers = list()
    items = list()
    def __init__(self,cluterid):
        self.cluterid = cluterid
        self.init()
    def init(self):
        self.centers = list()
        self.items = list()
    def add_item(self,val):
        self.items.append(val)
    def get_demensions(self):
        if len(self.items) == 0:
            return 0
        return len(self.items[0])
    def adjust_center(self):
        demensions = self.get_demensions()
        if demensions == 0:
            return
        self.centers = list()
        for i in range(0,demensions):
            self.centers.append(0)
        for i in range(0,len(self.items)):
            for j in range(0,len(self.items[i])):
                self.centers[j] += self.items[i][j]
        for i in range(0,demensions):
            self.centers[i] = float(self.centers[i])/len(self.items)
    def __str__(self):
        strs =  'Cluster %s\'s items are %s,centers are %s'%(self.cluterid,json.dumps(self.items),json.dumps(self.centers))
        return strs
def print_cluster():
    global clusters
    for i in range(0,len(clusters)):
        curcluster = clusters[i]
        print curcluster
def kmeans():
    global clusters
    #kmeans算法,第一步,预先定义好3个聚类中心,我们这里取第0,1,2个samples中的点作为聚类中心
    for i in range(0,3):
        curcluster = Cluster(i)
        curcluster.add_item(samples[i])
        curcluster.adjust_center()
        clusters.append(curcluster)
    #初始情况的打印
    print_cluster()

    pre_record = dict()
    cur_record = cal_distance()
    print "**************************************"
    while is_same(pre_record,cur_record) == False:
        pre_record = cur_record
        adjust(cur_record)
        print_cluster()
        cur_record = cal_distance()
        print "**************************************"
    print "The result is:"
    for i in range(0,len(clusters)):
        print 'Cluster %s'%(clusters[i].cluterid),clusters[i].items
def is_same(dict1,dict2):
    try:
        if len(dict1) != len(dict2):
            return False
        if len(set(dict1.keys()).difference(dict2.keys())) != 0:
            return False
        for k in dict1.keys():
            if dict1[k] != dict2[k]:
                return False
        return True
    except:
        return False
#调整聚类,将点调整到与它距离最近的聚类中去
def adjust(record):
    global clusters
    for i in range(0,len(clusters)):
        # 聚类还原
        clusters[i].init()
    for i in range(0,len(samples)):
        clusterid = record[i]
        clusters[clusterid].add_item(samples[i])
    #重新计算三个聚类的聚类中心
    for i in range(0,len(clusters)):
        clusters[i].adjust_center()
#每一轮计算聚类中心与各个点之间的距离,用dict存储
def cal_distance():
    record = dict()
    for i in range(0,len(samples)):
        import sys
        curdistance = sys.maxint
        curindex = -1
        for j in range(0,len(clusters)):
            distance = cal(samples[i],clusters[j].centers)
            print samples[i]," and cluster %s distance:%s"%(j,distance)
            if curdistance > distance:
                curdistance = distance
                curindex = j
        print samples[i],"'s nearest cluster is:Cluster %s"%(curindex)
        record[i] = curindex
    return record
#计算参数距离,用各维度之差的平方和表示距离
def cal(params1,params2):
    #如果两个参数连维度都不相等,那他们的距离为无穷大
    if len(params1) != len(params2):
        import sys
        return sys.maxint
    #否则用平方和表示距离
    distance = 0.0
    for i in range(0,len(params1)):
        distance += math.pow(params1[i]-params2[i],2)
    return distance
if __name__ == '__main__':
    kmeans()

算法输出如下:

Cluster 0's items are [[1, 2]],centers are [1.0, 2.0]
Cluster 1's items are [[15, 16]],centers are [15.0, 16.0]
Cluster 2's items are [[0, 0]],centers are [0.0, 0.0]
[1, 2]  and cluster 0 distance:0.0
[1, 2]  and cluster 1 distance:392.0
[1, 2]  and cluster 2 distance:5.0
[1, 2] 's nearest cluster is:Cluster 0
[15, 16]  and cluster 0 distance:392.0
[15, 16]  and cluster 1 distance:0.0
[15, 16]  and cluster 2 distance:481.0
[15, 16] 's nearest cluster is:Cluster 1
[0, 0]  and cluster 0 distance:5.0
[0, 0]  and cluster 1 distance:481.0
[0, 0]  and cluster 2 distance:0.0
[0, 0] 's nearest cluster is:Cluster 2
[2, 5]  and cluster 0 distance:10.0
[2, 5]  and cluster 1 distance:290.0
[2, 5]  and cluster 2 distance:29.0
[2, 5] 's nearest cluster is:Cluster 0
[6, 6]  and cluster 0 distance:41.0
[6, 6]  and cluster 1 distance:181.0
[6, 6]  and cluster 2 distance:72.0
[6, 6] 's nearest cluster is:Cluster 0
[14, 15]  and cluster 0 distance:338.0
[14, 15]  and cluster 1 distance:2.0
[14, 15]  and cluster 2 distance:421.0
[14, 15] 's nearest cluster is:Cluster 1
[8, 8]  and cluster 0 distance:85.0
[8, 8]  and cluster 1 distance:113.0
[8, 8]  and cluster 2 distance:128.0
[8, 8] 's nearest cluster is:Cluster 0
**************************************
Cluster 0's items are [[1, 2], [2, 5], [6, 6], [8, 8]],centers are [4.25, 5.25]
Cluster 1's items are [[15, 16], [14, 15]],centers are [14.5, 15.5]
Cluster 2's items are [[0, 0]],centers are [0.0, 0.0]
[1, 2]  and cluster 0 distance:21.125
[1, 2]  and cluster 1 distance:364.5
[1, 2]  and cluster 2 distance:5.0
[1, 2] 's nearest cluster is:Cluster 2
[15, 16]  and cluster 0 distance:231.125
[15, 16]  and cluster 1 distance:0.5
[15, 16]  and cluster 2 distance:481.0
[15, 16] 's nearest cluster is:Cluster 1
[0, 0]  and cluster 0 distance:45.625
[0, 0]  and cluster 1 distance:450.5
[0, 0]  and cluster 2 distance:0.0
[0, 0] 's nearest cluster is:Cluster 2
[2, 5]  and cluster 0 distance:5.125
[2, 5]  and cluster 1 distance:266.5
[2, 5]  and cluster 2 distance:29.0
[2, 5] 's nearest cluster is:Cluster 0
[6, 6]  and cluster 0 distance:3.625
[6, 6]  and cluster 1 distance:162.5
[6, 6]  and cluster 2 distance:72.0
[6, 6] 's nearest cluster is:Cluster 0
[14, 15]  and cluster 0 distance:190.125
[14, 15]  and cluster 1 distance:0.5
[14, 15]  and cluster 2 distance:421.0
[14, 15] 's nearest cluster is:Cluster 1
[8, 8]  and cluster 0 distance:21.625
[8, 8]  and cluster 1 distance:98.5
[8, 8]  and cluster 2 distance:128.0
[8, 8] 's nearest cluster is:Cluster 0
**************************************
Cluster 0's items are [[2, 5], [6, 6], [8, 8]],centers are [5.333333333333333, 6.333333333333333]
Cluster 1's items are [[15, 16], [14, 15]],centers are [14.5, 15.5]
Cluster 2's items are [[1, 2], [0, 0]],centers are [0.5, 1.0]
[1, 2]  and cluster 0 distance:37.5555555556
[1, 2]  and cluster 1 distance:364.5
[1, 2]  and cluster 2 distance:1.25
[1, 2] 's nearest cluster is:Cluster 2
[15, 16]  and cluster 0 distance:186.888888889
[15, 16]  and cluster 1 distance:0.5
[15, 16]  and cluster 2 distance:435.25
[15, 16] 's nearest cluster is:Cluster 1
[0, 0]  and cluster 0 distance:68.5555555556
[0, 0]  and cluster 1 distance:450.5
[0, 0]  and cluster 2 distance:1.25
[0, 0] 's nearest cluster is:Cluster 2
[2, 5]  and cluster 0 distance:12.8888888889
[2, 5]  and cluster 1 distance:266.5
[2, 5]  and cluster 2 distance:18.25
[2, 5] 's nearest cluster is:Cluster 0
[6, 6]  and cluster 0 distance:0.555555555556
[6, 6]  and cluster 1 distance:162.5
[6, 6]  and cluster 2 distance:55.25
[6, 6] 's nearest cluster is:Cluster 0
[14, 15]  and cluster 0 distance:150.222222222
[14, 15]  and cluster 1 distance:0.5
[14, 15]  and cluster 2 distance:378.25
[14, 15] 's nearest cluster is:Cluster 1
[8, 8]  and cluster 0 distance:9.88888888889
[8, 8]  and cluster 1 distance:98.5
[8, 8]  and cluster 2 distance:105.25
[8, 8] 's nearest cluster is:Cluster 0
**************************************
The result is:
Cluster 0 [[2, 5], [6, 6], [8, 8]]
Cluster 1 [[15, 16], [14, 15]]
Cluster 2 [[1, 2], [0, 0]]

4.拓展思路

文章中举的例子是恰好数值聚类的例子。如果需要聚类的不是数值,而是文本该怎么办?

其实需要的就是文本向数值转化。我们处理一个数据集D,它的文本个数是一定的,那么文本中含有的词的数目也是有限的(需要中文分词)得出。那么每一个元素,其实就是可以表示为  {词=》词TFIDF}这样的一个集合。请注意,这里的词,为了保证每个元素,以及聚类是同维的,所以,每个元素中记录的词都是词的全集。这是什么意思?

比如有3个文本:

1.我是中国人

2.中国富强

3.别克汽车

那么,这三个文本的词的全集是 {我,是,中国,人,富强,别克,汽车}

那么第一个文本,就可以表示为{我:0,是:0,中国:0,人:0,富强0,别克:1,汽车:1}

其余文本依次类推。通过这样建模,那么文本就数值化了,而且每个文本的维度也相同了。这样就可以按照K-means算法来初始化聚类,以及计算每个文本与聚类的距离了(利用TFIDF带入文本相似度计算公式进行计算)。





  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值