Python实现标准的Kmeans算法
Kmeans很常用,特别是针对无监督学习。Kmeans简单容易理解,但是功能还是很强大的。
1 Kmeans算法原理
基本原理,就是通过距离的大小去将数据分类。
可参考:http://www.csdn.net/article/2012-07-03/2807073-k-means
上面的博文写得很好,下面我写写自己的总结吧。
算法概述
1、随机在图中取K个种子,K是用户设定的;
2、然后对图中的所有点求到这K个种子点(质心)的距离,距离哪个种子点最近的就属于哪个点群;距离可以使用欧式距离计算;
3、然后,移动种子点到属于他的“点群”的中心。中心的计算,是取聚类中所有元素各自维度的算术平方数;
4、若本次中心和上次中心不重合,或者误差还在增大,即没有收敛,就重复(2)(3)步骤。
如下图所示:
2 Python实现Kmeans
首先,创建一个简单的数据集(dataset.txt)。
1.658985 4.285136 -3.453687 3.424321 4.838138 -1.151539 -5.379713 -3.362104 0.972564 2.924086 -3.567919 1.531611 0.450614 -3.302219 -3.487105 -1.724432 2.668759 1.594842 -3.156485 3.191137 3.165506 -3.999838 -2.786837 -3.099354 4.208187 2.984927 -2.123337 2.943366 0.704199 -0.479481 -0.392370 -3.963704 2.831667 1.574018 -0.790153 3.343144 2.943496 -3.357075 -3.195883 -2.283926 2.336445 2.875106 -1.786345 2.554248 2.190101 -1.906020 -3.403367 -2.778288 1.778124 3.880832 -1.688346 2.230267 2.592976 -2.054368 -4.007257 -3.207066 2.257734 3.387564 -2.679011 0.785119 0.939512 -4.023563 -3.674424 -2.261084 2.046259 2.735279 -3.189470 1.780269 4.372646 -0.822248 -2.579316 -3.497576 1.889034 5.190400 -0.798747 2.185588 2.836520 -2.658556 -3.837877 -3.253815 2.096701 3.886007 -2.709034 2.923887 3.367037 -3.184789 -2.121479 -4.232586 2.329546 3.179764 -3.284816 3.273099 3.091414 -3.815232 -3.762093 -2.432191 3.542056 2.778832 -1.736822 4.241041 2.127073 -2.983680 -4.323818 -3.938116 3.792121 5.135768 -4.786473 3.358547 2.624081 -3.260715 -4.009299 -2.978115 2.493525 1.963710 -2.513661 2.642162 1.864375 -3.176309 -3.171184 -3.572452 2.894220 2.489128 -2.562539 2.884438 3.491078 -3.947487 -2.565729 -2.012114 3.332948 3.983102 -1.616805 3.573188 2.280615 -2.559444 -2.651229 -3.103198 2.321395 3.154987 -1.685703 2.939697 3.031012 -3.620252 -4.599622 -2.185829 4.196223 1.126677 -2.133863 3.093686 4.668892 -2.562705 -2.793241 -2.149706 2.884105 3.043438 -2.967647 2.848696 4.479332 -1.764772 -4.905566 -2.911070
Keans相关方法的实现(keans2.py):
# -*- coding:utf-8 -*-
# kmeans : k-means cluster
from numpy import *
import time
import matplotlib.pyplot as plt
# 计算欧式距离
def euclDistance(vector1,vector2):
return sqrt(sum(pow(vector2-vector1,2))) # pow()是自带函数
# 使用随机样例初始化质心
def initCentroids(dataSet,k):
# k是指用户设定的k个种子点
# dataSet - 此处为mat对象
numSamples,dim = dataSet.shape
# numSample - 行,此处代表数据集数量 dim - 列,此处代表维度,例如只有xy轴的,dim=2
centroids = zeros((k, dim)) # 产生k行,dim列零矩阵
for i in range(k):
index = int(random.uniform(0, numSamples)) # 给出一个服从均匀分布的在0~numSamples之间的整数
centroids[i, :] = dataSet[index, :] # 第index行作为种子点(质心)
return centroids
# k均值聚类
def kmeans(dataSet, k):
numSamples = dataSet.shape[0]
# frist column stores which cluster this sample belongs to,
# second column stores the error between this sample and its centroid
clusterAssment = mat(zeros((numSamples, 2)))
clusterChanged = True
## step 1: init centroids
centroids = initCentroids(dataSet, k)
while clusterChanged:
clusterChanged = False
## for each sample
for i in xrange(numSamples):
minDist = 100000.0 # 最小距离
minIndex = 0 # 最小距离对应的点群
## for each centroid
## step2: find the centroid who is closest
for j in range(k):
distance = euclDistance(centroids[j, :], dataSet[i, :]) # 计算到数据的欧式距离
if distance < minDist: # 如果距离小于当前最小距离
minDist = distance # 则最小距离更新
minIndex = j # 对应的点群也会更新
## step 3: update its cluster
if clusterAssment[i, 0] != minIndex: # 如当前数据不属于该点群
clusterChanged = True # 聚类操作需要继续
clusterAssment[i, :] = minIndex, minDist**2
## step 4: update centroids
for j in range(k):
pointsInCluster = dataSet[nonzero(clusterAssment[:,0].A == j)[0]] # 取列
# nonzeros返回的是矩阵中非零的元素的[行号]和[列号]
# .A是将mat对象转为array
# 将所有等于当前点群j的,赋给pointsInCluster,之后计算该点群新的中心
centroids[j, :] = mean(pointsInCluster, axis=0) # 最后结果为两列,每一列为对应维的算术平方值
print "Congratulations, cluster complete!"
return centroids, clusterAssment
# show your cluster only available with 2-D data
def showCluster(dataSet, k, centroids, clusterAssment):
numSamples, dim = dataSet.shape # numSample - 样例数量 dim - 数据的维度
if dim != 2:
print "Sorry! I can not draw because the dimension os your data is not 2!"
return 1
mark = ['or', 'ob', 'og', 'ok', '^r', '+r', 'sr', 'dr', '<r', 'pr']
if k > len(mark):
print "Sorry! Your k is too large! Please contact Zouxy"
return 1
# draw all samples
for i in xrange(numSamples):
markIndex = int(clusterAssment[i, 0])
plt.plot(dataSet[i, 0], dataSet[i, 1], mark[markIndex])
mark = ['Dr', 'Db', 'Dg', 'Dk', '^b', '+b', 'sb', 'db', '<b', 'pb']
# draw the centroids
for i in range(k):
plt.plot(centroids[i, 0], centroids[i, 1], mark[i], ms=12.0)
plt.show()
主程序(test_kmeans.py):
# -*- coding:utf-8 -*-
from numpy import *
from kmeans2 import *
import time
import matplotlib.pyplot as plt
# step 1 : load data
print "step 1 : laod data"
dataSet = []
fileIn = open("F:\\py2projects\\dataset.txt")
for line in fileIn.readlines():
lineArr = line.strip().split()
#print lineArr[0]
dataSet.append([float(lineArr[0]), float(lineArr[1])])
#print mat(dataSet)
## step 2: clustering
print "step 2: clustering ..."
dataSet = array(dataSet)
k = 4
centroids, clusterAssment = kmeans(dataSet, k)
## step 3 : show the result
print "step 3: show the result ..."
showCluster(dataSet, k, centroids, clusterAssment)
3 实现结果
图1
图2
图3
实验结果显示,Kmeans的其中一个缺陷。Kmeans需要用随机种子进行初始化,因此这个随机种子很重要,不同的随机种子点会有得到完全不同的结果,(如上面的图1和图2)