算法优缺点
优点:简单,容易理解与解释,结果相对较好
缺点:1.分类个数k需要事先指定,指定的k值不同,聚类结果相差较大
2.初始的k个类簇中心对最终结果有影响,选择不同,结果可能不同
3.只对球状类簇有较好的识别效果
4.样本多时计算量大
5.离散数据需要特殊处理
算法实现介绍
根据初始化聚类中心信息,计算每个样本到这些中心的距离,可以判断每个样本均归属于某个类簇,更新聚簇中心信息,重新计算每个样本到新的类簇中心的距离,重复进行,直到满足最终条件。
代码中使用函数
loadDataSet(fileName)
从文件中读取数据集
distEclud(vecA, vecB)
计算两个向量的距离
randCent(dataSet, k)
随机生成初始的质心
# 计算距离方法和初始化质心方法可更换
KMeans(dataSet, k, distMeas=distEclud, createCent=randCent)
K-Means算法,输入数据和k值
show(dataSet, k, centroids, clusterAssment)
展示方法
使用数据
只展示部分数据
-3.415397 3.582124
-1.632707 1.624773
4.391960 1.415999
-3.446516 -2.464088
2.545512 2.212993
3.052709 3.494825
-3.277546 3.415180
-2.246377 1.499515
-1.206281 -3.436237
-2.697535 -3.134228
-2.067644 -0.626221
-3.122394 2.373387
-0.398199 2.010003
2.659758 -1.125584
-2.244237 0.277479
1.402242 -0.888432
0.478610 3.298247
-2.542210 3.787628
0.915269 3.205037
3.766489 -0.617184
代码
import numpy as np
def loadDataSet(fileName):
dataMat = []
fr = open(fileName)
for line in fr.readlines():
curLine = line.strip().split('\t')
fltLine = map(float, curLine)
dataMat.append(list(fltLine))
return dataMat
# 计算两个向量的距离,用的是欧几里得距离
def distEclud(vecA, vecB):
return np.sqrt(np.sum(np.power(vecA - vecB, 2)))
# 随机生成初始的质心(ng的课说的初始方式是随机选K个点)
def randCent(dataSet, k):
n = np.shape(dataSet)[1]
centroids = np.mat(np.zeros((k, n)))
for j in range(n):
minJ = min(dataSet[:, j])
rangeJ = float(max(np.array(dataSet)[:, j]) - minJ)
centroids[:, j] = minJ + rangeJ * np.random.rand(k, 1)
return centroids
def kMeans(dataSet, k, distMeas=distEclud, createCent=randCent):
m = np.shape(dataSet)[0]
clusterAssment = np.mat(np.zeros((m, 2))) # create mat to assign data points
# to a centroid, also holds SE of each point
centroids = createCent(dataSet, k)
clusterChanged = True
while clusterChanged:
clusterChanged = False
for i in range(m): # for each data point assign it to the closest centroid
minDist = np.inf
minIndex = -1
for j in range(k):
distJI = distMeas(centroids[j, :], dataSet[i, :])
if distJI < minDist:
minDist = distJI;
minIndex = j
if clusterAssment[i, 0] != minIndex:
clusterChanged = True
clusterAssment[i, :] = minIndex, minDist ** 2
print(centroids)
for cent in range(k): # recalculate centroids
ptsInClust = dataSet[np.nonzero(clusterAssment[:, 0].A == cent)[0]] # get all the point in this cluster
centroids[cent, :] = np.mean(ptsInClust, axis=0) # assign centroid to mean
return centroids, clusterAssment
def show(dataSet, k, centroids, clusterAssment):
from matplotlib import pyplot as plt
numSamples, dim = dataSet.shape
mark = ['or', 'ob', 'og', 'ok', '^r', '+r', 'sr', 'dr', '<r', 'pr']
for i in range(numSamples):
markIndex = int(clusterAssment[i, 0])
plt.plot(dataSet[i, 0], dataSet[i, 1], mark[markIndex])
mark = ['Dr', 'Db', 'Dg', 'Dk', '^b', '+b', 'sb', 'db', '<b', 'pb']
for i in range(k):
plt.plot(centroids[i, 0], centroids[i, 1], mark[i], markersize=12)
plt.show()
def main():
dataMat = np.mat(loadDataSet('testSet.txt'))
myCentroids, clustAssing = kMeans(dataMat, 4)
print(myCentroids)
show(dataMat, 4, myCentroids, clustAssing)
if __name__ == '__main__':
main()
效果图
设置聚簇点为4个时
设置聚簇点为10个时
总结
选择不同数量的类簇中心时聚簇效果会不一样,且在多次运行后展示结果也不一致,不过大体聚簇效果还算不错。