K-means算法

最新推荐文章于 2023-07-07 08:17:39 发布

rongyongfeikai2

最新推荐文章于 2023-07-07 08:17:39 发布

阅读量1.2k

点赞数

分类专栏：算法与数据结构

本文链接：https://blog.csdn.net/rongyongfeikai2/article/details/40712487

版权

算法与数据结构专栏收录该内容

92 篇文章 1 订阅

订阅专栏

1.K-means算法介绍

将一组数据或记录，划分为多个组或者聚簇的行为就叫做聚类。K-means算法就是一类使用比较广泛的聚类算法。

K-means算法，处理的对象是由n维的元素所组成的数组，通过K-means算法会将这些元素划分到多个聚类中去。

K-means算法预先需要确定的值：

1.聚类个数K，这个可以预先定个值，然后随着程序运行的效果而调整

2.初始划分到K个聚类的元素，我们可以选有代表性的K个，如果数据量大，可以随机的选K个元素放入K个聚类中

上面是初始化过程，真正的K-means算法的核心是一个循环：

1.计算每一个元素与聚类中心的距离，然后得到每个元素距离最近的聚类，将它记录下来

2.调整相应的元素到相应聚类中去

3.根据变化后的元素计算聚类中心

算法终止的条件是，聚类中心不再发生变化（即不再有元素发生变动）；否则就一直1~3步这样循环。

以上的步骤涉及几个概念：

1.聚类中心：由于K-means算法处理的数据时n维的，那么就根据处理数据的维度，可以得到聚类中心就是一个长度为n的数组。它里面的每一个值，都等于所有元素该维度的和/n。

比如：K-means算法处理的是2维数据，某个聚类中的元素有(1,1),(2,2),(3,4)那么，它的聚类中心就等于：

((1+2+3)/3,(1+2+4)/3) = (2,2.33)

2.元素与聚类的距离=元素中每个元素与聚类中心的每个维度的差的平方和

比如：某个聚类，它的聚类中心为（2,3)，然后有个元素（2,5），那么它们的距离就是：

(2-2)*(2-2)+(3-5)*(3-5)=4

2.算法推导

假设我们需要聚类的是一组点，每个点由坐标（x,y)表示，也就是说，我们要处理的数据是二维的。数据如下：

samples = [
    [1,2],
    [15,16],
    [0,0],
    [2,5],
    [6,6],
    [14,15],
    [8,8]
]

我们首先，定义聚类个数K=3，然后初始化聚类时，把下标为0,1,2的数据赋值给3个聚类。如下：

Cluster 0's items are [[1, 2]],centers are [1.0, 2.0]
Cluster 1's items are [[15, 16]],centers are [15.0, 16.0]
Cluster 2's items are [[0, 0]],centers are [0.0, 0.0]

然后，可以得到，第一轮的计算距离结果如下：

[1, 2] and cluster 0 distance:0.0
[1, 2] and cluster 1 distance:392.0
[1, 2] and cluster 2 distance:5.0
[1, 2] 's nearest cluster is:Cluster 0
[15, 16] and cluster 0 distance:392.0
[15, 16] and cluster 1 distance:0.0
[15, 16] and cluster 2 distance:481.0
[15, 16] 's nearest cluster is:Cluster 1
[0, 0] and cluster 0 distance:5.0
[0, 0] and cluster 1 distance:481.0
[0, 0] and cluster 2 distance:0.0
[0, 0] 's nearest cluster is:Cluster 2
[2, 5] and cluster 0 distance:10.0
[2, 5] and cluster 1 distance:290.0
[2, 5] and cluster 2 distance:29.0
[2, 5] 's nearest cluster is:Cluster 0
[6, 6] and cluster 0 distance:41.0
[6, 6] and cluster 1 distance:181.0
[6, 6] and cluster 2 distance:72.0
[6, 6] 's nearest cluster is:Cluster 0
[14, 15] and cluster 0 distance:338.0
[14, 15] and cluster 1 distance:2.0
[14, 15] and cluster 2 distance:421.0
[14, 15] 's nearest cluster is:Cluster 1
[8, 8] and cluster 0 distance:85.0
[8, 8] and cluster 1 distance:113.0
[8, 8] and cluster 2 distance:128.0
[8, 8] 's nearest cluster is:Cluster 0

我们将每个元素，调整到与它最近的聚类中，并重新计算聚类中心。结果如下：

Cluster 0's items are [[1, 2], [2, 5], [6, 6], [8, 8]],centers are [4.25, 5.25]
Cluster 1's items are [[15, 16], [14, 15]],centers are [14.5, 15.5]
Cluster 2's items are [[0, 0]],centers are [0.0, 0.0]

第二轮开始，又计算每个元素与聚类中心的距离：

[1, 2] and cluster 0 distance:21.125
[1, 2] and cluster 1 distance:364.5
[1, 2] and cluster 2 distance:5.0
[1, 2] 's nearest cluster is:Cluster 2
[15, 16] and cluster 0 distance:231.125
[15, 16] and cluster 1 distance:0.5
[15, 16] and cluster 2 distance:481.0
[15, 16] 's nearest cluster is:Cluster 1
[0, 0] and cluster 0 distance:45.625
[0, 0] and cluster 1 distance:450.5
[0, 0] and cluster 2 distance:0.0
[0, 0] 's nearest cluster is:Cluster 2
[2, 5] and cluster 0 distance:5.125
[2, 5] and cluster 1 distance:266.5
[2, 5] and cluster 2 distance:29.0
[2, 5] 's nearest cluster is:Cluster 0
[6, 6] and cluster 0 distance:3.625
[6, 6] and cluster 1 distance:162.5
[6, 6] and cluster 2 distance:72.0
[6, 6] 's nearest cluster is:Cluster 0
[14, 15] and cluster 0 distance:190.125
[14, 15] and cluster 1 distance:0.5
[14, 15] and cluster 2 distance:421.0
[14, 15] 's nearest cluster is:Cluster 1
[8, 8] and cluster 0 distance:21.625
[8, 8] and cluster 1 distance:98.5
[8, 8] and cluster 2 distance:128.0
[8, 8] 's nearest cluster is:Cluster 0

这时，又得到了每个元素最近的聚类。再将元素调整到最近的聚类中，然后重新计算聚类中心：

Cluster 0's items are [[2, 5], [6, 6], [8, 8]],centers are [5.333333333333333, 6.333333333333333]
Cluster 1's items are [[15, 16], [14, 15]],centers are [14.5, 15.5]
Cluster 2's items are [[1, 2], [0, 0]],centers are [0.5, 1.0]

第三轮，又计算每个元素与聚类中心的距离。

[1, 2] and cluster 0 distance:37.5555555556
[1, 2] and cluster 1 distance:364.5
[1, 2] and cluster 2 distance:1.25
[1, 2] 's nearest cluster is:Cluster 2
[15, 16] and cluster 0 distance:186.888888889
[15, 16] and cluster 1 distance:0.5
[15, 16] and cluster 2 distance:435.25
[15, 16] 's nearest cluster is:Cluster 1
[0, 0] and cluster 0 distance:68.5555555556
[0, 0] and cluster 1 distance:450.5
[0, 0] and cluster 2 distance:1.25
[0, 0] 's nearest cluster is:Cluster 2
[2, 5] and cluster 0 distance:12.8888888889
[2, 5] and cluster 1 distance:266.5
[2, 5] and cluster 2 distance:18.25
[2, 5] 's nearest cluster is:Cluster 0
[6, 6] and cluster 0 distance:0.555555555556
[6, 6] and cluster 1 distance:162.5
[6, 6] and cluster 2 distance:55.25
[6, 6] 's nearest cluster is:Cluster 0
[14, 15] and cluster 0 distance:150.222222222
[14, 15] and cluster 1 distance:0.5
[14, 15] and cluster 2 distance:378.25
[14, 15] 's nearest cluster is:Cluster 1
[8, 8] and cluster 0 distance:9.88888888889
[8, 8] and cluster 1 distance:98.5
[8, 8] and cluster 2 distance:105.25
[8, 8] 's nearest cluster is:Cluster 0

计算出来的每个元素的最近的聚类没有发生变化，与第二轮相同。算法终止，已经得到了结果。

结果就是：

Cluster 0 [[2, 5], [6, 6], [8, 8]]
Cluster 1 [[15, 16], [14, 15]]
Cluster 2 [[1, 2], [0, 0]]

我们将结果图形化，还是比较靠谱的：

3.算法源码

算法源码如下：

#coding:utf-8
import json
import math
#假设数据集D是一堆点，且这堆点是2维的点
#为了最后出效果，每个点的(a,b)分别对应左边轴上的（x,y）
samples = [
    [1,2],
    [15,16],
    [0,0],
    [2,5],
    [6,6],
    [14,15],
    [8,8]
]
#预先定义好聚类的个数
n = 3
clusters = list()
class Cluster(object):
    centers = list()
    items = list()
    def __init__(self,cluterid):
        self.cluterid = cluterid
        self.init()
    def init(self):
        self.centers = list()
        self.items = list()
    def add_item(self,val):
        self.items.append(val)
    def get_demensions(self):
        if len(self.items) == 0:
            return 0
        return len(self.items[0])
    def adjust_center(self):
        demensions = self.get_demensions()
        if demensions == 0:
            return
        self.centers = list()
        for i in range(0,demensions):
            self.centers.append(0)
        for i in range(0,len(self.items)):
            for j in range(0,len(self.items[i])):
                self.centers[j] += self.items[i][j]
        for i in range(0,demensions):
            self.centers[i] = float(self.centers[i])/len(self.items)
    def __str__(self):
        strs =  'Cluster %s\'s items are %s,centers are %s'%(self.cluterid,json.dumps(self.items),json.dumps(self.centers))
        return strs
def print_cluster():
    global clusters
    for i in range(0,len(clusters)):
        curcluster = clusters[i]
        print curcluster
def kmeans():
    global clusters
    #kmeans算法，第一步，预先定义好3个聚类中心，我们这里取第0,1,2个samples中的点作为聚类中心
    for i in range(0,3):
        curcluster = Cluster(i)
        curcluster.add_item(samples[i])
        curcluster.adjust_center()
        clusters.append(curcluster)
    #初始情况的打印
    print_cluster()

    pre_record = dict()
    cur_record = cal_distance()
    print "**************************************"
    while is_same(pre_record,cur_record) == False:
        pre_record = cur_record
        adjust(cur_record)
        print_cluster()
        cur_record = cal_distance()
        print "**************************************"
    print "The result is:"
    for i in range(0,len(clusters)):
        print 'Cluster %s'%(clusters[i].cluterid),clusters[i].items
def is_same(dict1,dict2):
    try:
        if len(dict1) != len(dict2):
            return False
        if len(set(dict1.keys()).difference(dict2.keys())) != 0:
            return False
        for k in dict1.keys():
            if dict1[k] != dict2[k]:
                return False
        return True
    except:
        return False
#调整聚类，将点调整到与它距离最近的聚类中去
def adjust(record):
    global clusters
    for i in range(0,len(clusters)):
        # 聚类还原
        clusters[i].init()
    for i in range(0,len(samples)):
        clusterid = record[i]
        clusters[clusterid].add_item(samples[i])
    #重新计算三个聚类的聚类中心
    for i in range(0,len(clusters)):
        clusters[i].adjust_center()
#每一轮计算聚类中心与各个点之间的距离，用dict存储
def cal_distance():
    record = dict()
    for i in range(0,len(samples)):
        import sys
        curdistance = sys.maxint
        curindex = -1
        for j in range(0,len(clusters)):
            distance = cal(samples[i],clusters[j].centers)
            print samples[i]," and cluster %s distance:%s"%(j,distance)
            if curdistance > distance:
                curdistance = distance
                curindex = j
        print samples[i],"'s nearest cluster is:Cluster %s"%(curindex)
        record[i] = curindex
    return record
#计算参数距离，用各维度之差的平方和表示距离
def cal(params1,params2):
    #如果两个参数连维度都不相等，那他们的距离为无穷大
    if len(params1) != len(params2):
        import sys
        return sys.maxint
    #否则用平方和表示距离
    distance = 0.0
    for i in range(0,len(params1)):
        distance += math.pow(params1[i]-params2[i],2)
    return distance
if __name__ == '__main__':
    kmeans()

算法输出如下：

Cluster 0's items are [[1, 2]],centers are [1.0, 2.0]
Cluster 1's items are [[15, 16]],centers are [15.0, 16.0]
Cluster 2's items are [[0, 0]],centers are [0.0, 0.0]
[1, 2]  and cluster 0 distance:0.0
[1, 2]  and cluster 1 distance:392.0
[1, 2]  and cluster 2 distance:5.0
[1, 2] 's nearest cluster is:Cluster 0
[15, 16]  and cluster 0 distance:392.0
[15, 16]  and cluster 1 distance:0.0
[15, 16]  and cluster 2 distance:481.0
[15, 16] 's nearest cluster is:Cluster 1
[0, 0]  and cluster 0 distance:5.0
[0, 0]  and cluster 1 distance:481.0
[0, 0]  and cluster 2 distance:0.0
[0, 0] 's nearest cluster is:Cluster 2
[2, 5]  and cluster 0 distance:10.0
[2, 5]  and cluster 1 distance:290.0
[2, 5]  and cluster 2 distance:29.0
[2, 5] 's nearest cluster is:Cluster 0
[6, 6]  and cluster 0 distance:41.0
[6, 6]  and cluster 1 distance:181.0
[6, 6]  and cluster 2 distance:72.0
[6, 6] 's nearest cluster is:Cluster 0
[14, 15]  and cluster 0 distance:338.0
[14, 15]  and cluster 1 distance:2.0
[14, 15]  and cluster 2 distance:421.0
[14, 15] 's nearest cluster is:Cluster 1
[8, 8]  and cluster 0 distance:85.0
[8, 8]  and cluster 1 distance:113.0
[8, 8]  and cluster 2 distance:128.0
[8, 8] 's nearest cluster is:Cluster 0
**************************************
Cluster 0's items are [[1, 2], [2, 5], [6, 6], [8, 8]],centers are [4.25, 5.25]
Cluster 1's items are [[15, 16], [14, 15]],centers are [14.5, 15.5]
Cluster 2's items are [[0, 0]],centers are [0.0, 0.0]
[1, 2]  and cluster 0 distance:21.125
[1, 2]  and cluster 1 distance:364.5
[1, 2]  and cluster 2 distance:5.0
[1, 2] 's nearest cluster is:Cluster 2
[15, 16]  and cluster 0 distance:231.125
[15, 16]  and cluster 1 distance:0.5
[15, 16]  and cluster 2 distance:481.0
[15, 16] 's nearest cluster is:Cluster 1
[0, 0]  and cluster 0 distance:45.625
[0, 0]  and cluster 1 distance:450.5
[0, 0]  and cluster 2 distance:0.0
[0, 0] 's nearest cluster is:Cluster 2
[2, 5]  and cluster 0 distance:5.125
[2, 5]  and cluster 1 distance:266.5
[2, 5]  and cluster 2 distance:29.0
[2, 5] 's nearest cluster is:Cluster 0
[6, 6]  and cluster 0 distance:3.625
[6, 6]  and cluster 1 distance:162.5
[6, 6]  and cluster 2 distance:72.0
[6, 6] 's nearest cluster is:Cluster 0
[14, 15]  and cluster 0 distance:190.125
[14, 15]  and cluster 1 distance:0.5
[14, 15]  and cluster 2 distance:421.0
[14, 15] 's nearest cluster is:Cluster 1
[8, 8]  and cluster 0 distance:21.625
[8, 8]  and cluster 1 distance:98.5
[8, 8]  and cluster 2 distance:128.0
[8, 8] 's nearest cluster is:Cluster 0
**************************************
Cluster 0's items are [[2, 5], [6, 6], [8, 8]],centers are [5.333333333333333, 6.333333333333333]
Cluster 1's items are [[15, 16], [14, 15]],centers are [14.5, 15.5]
Cluster 2's items are [[1, 2], [0, 0]],centers are [0.5, 1.0]
[1, 2]  and cluster 0 distance:37.5555555556
[1, 2]  and cluster 1 distance:364.5
[1, 2]  and cluster 2 distance:1.25
[1, 2] 's nearest cluster is:Cluster 2
[15, 16]  and cluster 0 distance:186.888888889
[15, 16]  and cluster 1 distance:0.5
[15, 16]  and cluster 2 distance:435.25
[15, 16] 's nearest cluster is:Cluster 1
[0, 0]  and cluster 0 distance:68.5555555556
[0, 0]  and cluster 1 distance:450.5
[0, 0]  and cluster 2 distance:1.25
[0, 0] 's nearest cluster is:Cluster 2
[2, 5]  and cluster 0 distance:12.8888888889
[2, 5]  and cluster 1 distance:266.5
[2, 5]  and cluster 2 distance:18.25
[2, 5] 's nearest cluster is:Cluster 0
[6, 6]  and cluster 0 distance:0.555555555556
[6, 6]  and cluster 1 distance:162.5
[6, 6]  and cluster 2 distance:55.25
[6, 6] 's nearest cluster is:Cluster 0
[14, 15]  and cluster 0 distance:150.222222222
[14, 15]  and cluster 1 distance:0.5
[14, 15]  and cluster 2 distance:378.25
[14, 15] 's nearest cluster is:Cluster 1
[8, 8]  and cluster 0 distance:9.88888888889
[8, 8]  and cluster 1 distance:98.5
[8, 8]  and cluster 2 distance:105.25
[8, 8] 's nearest cluster is:Cluster 0
**************************************
The result is:
Cluster 0 [[2, 5], [6, 6], [8, 8]]
Cluster 1 [[15, 16], [14, 15]]
Cluster 2 [[1, 2], [0, 0]]

4.拓展思路

文章中举的例子是恰好数值聚类的例子。如果需要聚类的不是数值，而是文本该怎么办？

其实需要的就是文本向数值转化。我们处理一个数据集D，它的文本个数是一定的，那么文本中含有的词的数目也是有限的（需要中文分词）得出。那么每一个元素，其实就是可以表示为 {词=》词TFIDF}这样的一个集合。请注意，这里的词，为了保证每个元素，以及聚类是同维的，所以，每个元素中记录的词都是词的全集。这是什么意思？

比如有3个文本：

1.我是中国人

2.中国富强

3.别克汽车

那么，这三个文本的词的全集是 {我，是，中国，人，富强，别克，汽车}

那么第一个文本，就可以表示为{我：0，是：0，中国：0，人：0，富强0，别克：1，汽车：1}

其余文本依次类推。通过这样建模，那么文本就数值化了，而且每个文本的维度也相同了。这样就可以按照K-means算法来初始化聚类，以及计算每个文本与聚类的距离了（利用TFIDF带入文本相似度计算公式进行计算）。