K-means算法原理
K-means聚类属于原型聚类(基于原型的聚类,prototype-based clustering)。原型聚类算法假设聚类结构能够通过一组原型进行刻画,在现实聚类任务中极为常用。通常情况下,原型聚类算法对原型进行初始化,然后对原型进行迭代更新求解。
针对给定样本集,K-means算法针对聚类得到的簇划分最小化平方误差(error function):
其中 是簇的均值向量。要最小化K-means的误差函数并不容易,找到它的最优解需要考察样本集中所有的簇划分,这是一个NP难问题。K-means算法采用的是贪心策略,通过迭代近似求解误差函数。
K-means伪代码
输入:样本集;
聚类簇数:
终止条件:迭代次数iteration或者的阈值
过程:
- 从样本集中随机选取个样本作为初始均值向量:
- repeat
- 令
- for
- 计算样本与各均值向量 的距离
- 根据距离最近的均值向量确定的簇标记:
- 将样本划入相应的簇:
- end for
- for
- 计算新的均值向量:
- end for
- until 达到终止条件
输出:簇划分:
K-means的python代码:
import math
import matplotlib.pyplot as plt
import random
def getEuclidean(point1, point2):
dimension = len(point1)
dist = 0.0
for i in range(dimension):
dist += (point1[i] - point2[i]) ** 2
return math.sqrt(dist)
def k_means(dataset, k, iteration):
#初始化簇心向量
index = random.sample(list(range(len(dataset))), k)
vectors = []
for i in index:
vectors.append(dataset[i])
#初始化标签
labels = []
for i in range(len(dataset)):
labels.append(-1)
#根据迭代次数重复k-means聚类过程
while(iteration > 0):
#初始化簇
C = []
for i in range(k):
C.append([])
for labelIndex, item in enumerate(dataset):
classIndex = -1
minDist = 1e6
for i, point in enumerate(vectors):
dist = getEuclidean(item, point)
if(dist < minDist):
classIndex = i
minDist = dist
C[classIndex].append(item)
labels[labelIndex] = classIndex
for i, cluster in enumerate(C):
clusterHeart = []
dimension = len(dataset[0])
for j in range(dimension):
clusterHeart.append(0)
for item in cluster:
for j, coordinate in enumerate(item):
clusterHeart[j] += coordinate / len(cluster)
vectors[i] = clusterHeart
iteration -= 1
return C, labels
测试及结果:
#数据集:每三个是一组分别是西瓜的编号,密度,含糖量
data = """
1,0.697,0.46,2,0.774,0.376,3,0.634,0.264,4,0.608,0.318,5,0.556,0.215,
6,0.403,0.237,7,0.481,0.149,8,0.437,0.211,9,0.666,0.091,10,0.243,0.267,
11,0.245,0.057,12,0.343,0.099,13,0.639,0.161,14,0.657,0.198,15,0.36,0.37,
16,0.593,0.042,17,0.719,0.103,18,0.359,0.188,19,0.339,0.241,20,0.282,0.257,
21,0.748,0.232,22,0.714,0.346,23,0.483,0.312,24,0.478,0.437,25,0.525,0.369,
26,0.751,0.489,27,0.532,0.472,28,0.473,0.376,29,0.725,0.445,30,0.446,0.459"""
#数据处理 dataset是30个样本(密度,含糖量)的列表
a = data.split(',')
dataset = [[float(a[i]), float(a[i+1])] for i in range(1, len(a)-1, 3)]
C, labels = k_means(dataset, 3, 20)
colValue = ['r', 'y', 'g', 'b', 'c', 'k', 'm']
for i in range(len(C)):
coo_X = [] #x坐标列表
coo_Y = [] #y坐标列表
for j in range(len(C[i])):
coo_X.append(C[i][j][0])
coo_Y.append(C[i][j][1])
plt.scatter(coo_X, coo_Y, marker='x', color=colValue[i%len(colValue)], label=i)
#plt.legend(loc='upper right')
plt.show()
print(labels)