一、k-means算法简介:
该算法属于无监督学习,适用于数值型数据,用来识别数据内部结构。
(1) k-means算法的工作流程(伪代码)
1>创建k个点作为起始聚类中心(经常是随机选择)
2>对数据集中的每个数据点对每个质心,计算质心与数据点的距离
3>将数据点分配到距其最近的簇
4>对每个簇,计算簇中所有点的均值并将均值作为新的聚类中心5>重复2,3,4过程,直到新的聚类中心与上一个聚类中心的偏移值小于一个给定的阈值(临界值)
(2)当k=2,对n个样本点进行K-means聚类的流程如下图:
二、用python实现k-means算法:
我们可以先生成几类正态分布的点,然后用k-means算法进行分类。然后用matplotlib作图,画出分类结果与已知结果,进行比较,来测试k-means算法。
下面我们以生成3类正态分布的点为例,python代码如下:
import numpy
import random
import pylab as pl
pl.figure(1)
pl.figure(2)
pl.figure(3)
def cal_distance(a, b): #计算两点的距离
return (a[0]- b[0]) ** 2 + (a[1] - b[1]) **2
#生成3类正态分布的点
a1 = numpy.round(numpy.random.normal(20, 5, 250),2)
b1 = numpy.round(numpy.random.normal(24, 5, 250),2)
a2 = numpy.round(numpy.random.normal(5, 5, 250),2)
b2 = numpy.round(numpy.random.normal(4, 5, 250),2)
a3 = numpy.round(numpy.random.normal(40, 5, 250),2)
b3 = numpy.round(numpy.random.normal(1, 5, 250),2)
a = []
b = []
for i in range(250):
a.append(a1[i])
b.append(b1[i])
for i in range(250):
a.append(a2[i])
b.append(b2[i])
for i in range(250):
a.append(a3[i])
b.append(b3[i])
k1 = [ random.uniform(10,5) for _ in range(2)] #任选3个为聚类中心
k2 = [ random.uniform(50,50) for _ in range(2)]
k3 = [ random.uniform(1,0) for _ in range(2)]
clu_k1 = [] #划分3个聚类
clu_k2 = []
clu_k3 = []
while True:
clu_k1 = []
clu_k2 = []
clu_k3 = []
for i in range(750):
ab_distance1 = cal_distance(k1, [a[i], b[i]])
ab_distance2 = cal_distance(k2, [a[i], b[i]]) #计算每个样本到聚类中心的距离
ab_distance3 = cal_distance(k3, [a[i], b[i]])
if (ab_distance1 <= ab_distance2 and ab_distance1 <= ab_distance3):
clu_k1.append(i)
if (ab_distance2 <= ab_distance1 and ab_distance2 <= ab_distance3):
clu_k2.append(i) #每个样本归于更近的聚类中心
if (ab_distance3 <= ab_distance1 and ab_distance3 <= ab_distance2):
clu_k3.append(i)
k1_x = sum([a[i] for i in clu_k1]) / len(clu_k1) #每类样本计算质心并使之成为新的聚类中心
k1_y = sum([b[i] for i in clu_k1]) / len(clu_k1)
k2_x = sum([a[i] for i in clu_k2]) / len(clu_k2)
k2_y = sum([b[i] for i in clu_k2]) / len(clu_k2)
k3_x = sum([a[i] for i in clu_k3]) / len(clu_k3)
k3_y = sum([b[i] for i in clu_k3]) / len(clu_k3)
if cal_distance(k1, [k1_x, k1_y])>0.1 or cal_distance(k2, [k2_x, k2_y])>0.1 or cal_distance(k3, [k3_x, k3_y])>0.1: #判断新聚类中心与上一个聚类中心的偏移量是否为0
k1 = [k1_x, k1_y]
k2 = [k2_x, k2_y]
k3 = [k3_x, k3_y] #偏移量不为0,则求出的质心取代原来的聚类中心
else:
break #偏移量为0,迭代终止
kv1_x = [a[i] for i in clu_k1] #迭代终止后将同一类的点x,y分别以列表形式存放
kv1_y = [b[i] for i in clu_k1]
kv2_x = [a[i] for i in clu_k2]
kv2_y = [b[i] for i in clu_k2]
kv3_x = [a[i] for i in clu_k3]
kv3_y = [b[i] for i in clu_k3]
print '第一类是:',list(zip(kv1_x,kv1_y)),'\n' #zip的作用是返回一个元组列表
print '第二类是:',list(zip(kv2_x,kv2_y)),'\n'
print '第三类是:',list(zip(kv2_x,kv2_y))
pl.figure(1)
pl.plot(a1,b1,'ob')
pl.plot(a2,b2,'ob')
pl.plot(a3,b3,'ob')
pl.figure(2)
pl.plot(a1,b1,'o')
pl.plot(a2,b2,'o')
pl.plot(a3,b3,'o')
pl.figure(3)
pl.plot(kv1_x,kv1_y,'*')
pl.plot(kv2_x,kv2_y,'+')
pl.plot(kv3_x,kv3_y,'o')
pl.show()
我们可以得到的最终分类结果如下: