k-means算法详解及python代码

K-means算法原理

    K-means聚类属于原型聚类(基于原型的聚类,prototype-based clustering)。原型聚类算法假设聚类结构能够通过一组原型进行刻画,在现实聚类任务中极为常用。通常情况下,原型聚类算法对原型进行初始化,然后对原型进行迭代更新求解。

    针对给定样本集D = \left \{ x_{1},x_{2},...,x_{n} \right \},K-means算法针对聚类得到的簇划分C = \left \{ C_{1}, C_{2},..., C_{n} \right \}最小化平方误差(error function):

                                                                            E = \sum_{i}^{k}\sum_{x\in C_{i}}^{ }\left \| x - \mu _{i} \right \|_{2}^{2}

 其中\mu _{i} =\frac{1}{\left | C_{i} \right |} \sum _{x\in C_{i}}x 是簇C_{i}的均值向量。要最小化K-means的误差函数并不容易,找到它的最优解需要考察样本集D中所有的簇划分,这是一个NP难问题。K-means算法采用的是贪心策略,通过迭代近似求解误差函数。

K-means伪代码

输入:样本集D = \left \{ x_{1},x_{2},...,x_{n} \right \}

            聚类簇数:k

            终止条件:迭代次数iteration或者\Delta E的阈值\Delta的阈值\alpha

过程:

  1. 从样本集D中随机选取k个样本作为初始均值向量:\left \{ \mu _{1},\mu _{2}, ... , \mu _{k} \right \}
  2. repeat
  3.          令C_{i} = \phi (1 \leqslant i \leqslant k)
  4.          for j = 1, 2, ... , m
  5.                计算样本x _{j}与各均值向量\mu _{i} (1 \leq i \leq k)的距离d_{i}
  6.                根据距离最近的均值向量确定x_{j}的簇标记:\lambda _{j}
  7.                将样本x_{j}划入相应的簇:C_{\lambda _{j}} = C_{\lambda _{j}} \cup {x_{j}}
  8.         end for
  9.         for  i = 1, 2, ..., k
  10.                计算新的均值向量:\mu _{i}^{'} = \frac{1}{\left | C_{i} \right |}\sum _{x \in C_{i}} x_{i}
  11.         end for
  12. until 达到终止条件

输出:簇划分:C = \left \{ C_{1}, C_{2}, ... ,C_{k}\right \}

K-means的python代码:

import math
import matplotlib.pyplot as plt
import random

def getEuclidean(point1, point2):
    dimension = len(point1)
    dist = 0.0
    for i in range(dimension):
        dist += (point1[i] - point2[i]) ** 2
    return math.sqrt(dist)

def k_means(dataset, k, iteration):
    #初始化簇心向量
    index = random.sample(list(range(len(dataset))), k)
    vectors = []
    for i in index:
        vectors.append(dataset[i])
    #初始化标签
    labels = []
    for i in range(len(dataset)):
        labels.append(-1)
    #根据迭代次数重复k-means聚类过程
    while(iteration > 0):
        #初始化簇
        C = []
        for i in range(k):
            C.append([])
        for labelIndex, item in enumerate(dataset):
            classIndex = -1
            minDist = 1e6
            for i, point in enumerate(vectors):
                dist = getEuclidean(item, point)
                if(dist < minDist):
                    classIndex = i
                    minDist = dist
            C[classIndex].append(item)
            labels[labelIndex] = classIndex
        for i, cluster in enumerate(C):
            clusterHeart = []
            dimension = len(dataset[0])
            for j in range(dimension):
                clusterHeart.append(0)
            for item in cluster:
                for j, coordinate in enumerate(item):
                    clusterHeart[j] += coordinate / len(cluster)
            vectors[i] = clusterHeart
        iteration -= 1
    return C, labels

测试及结果:

#数据集:每三个是一组分别是西瓜的编号,密度,含糖量
data = """
1,0.697,0.46,2,0.774,0.376,3,0.634,0.264,4,0.608,0.318,5,0.556,0.215,
6,0.403,0.237,7,0.481,0.149,8,0.437,0.211,9,0.666,0.091,10,0.243,0.267,
11,0.245,0.057,12,0.343,0.099,13,0.639,0.161,14,0.657,0.198,15,0.36,0.37,
16,0.593,0.042,17,0.719,0.103,18,0.359,0.188,19,0.339,0.241,20,0.282,0.257,
21,0.748,0.232,22,0.714,0.346,23,0.483,0.312,24,0.478,0.437,25,0.525,0.369,
26,0.751,0.489,27,0.532,0.472,28,0.473,0.376,29,0.725,0.445,30,0.446,0.459"""

#数据处理 dataset是30个样本(密度,含糖量)的列表
a = data.split(',')
dataset = [[float(a[i]), float(a[i+1])] for i in range(1, len(a)-1, 3)]
C, labels = k_means(dataset, 3, 20)

colValue = ['r', 'y', 'g', 'b', 'c', 'k', 'm']
for i in range(len(C)):
    coo_X = []    #x坐标列表
    coo_Y = []    #y坐标列表
    for j in range(len(C[i])):
        coo_X.append(C[i][j][0])
        coo_Y.append(C[i][j][1])
    plt.scatter(coo_X, coo_Y, marker='x', color=colValue[i%len(colValue)], label=i)

#plt.legend(loc='upper right')
plt.show()
print(labels)

  • 14
    点赞
  • 75
    收藏
    觉得还不错? 一键收藏
  • 5
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 5
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值