聚类是针对给定的样本,依据他们的特征的相似度或者距离,将其归并到若干个“类”或“簇”的数据分析问题,一个类是给定样本集合的一个子集。下面,介绍聚类的基本概念以及层次聚类和k-均值聚类(k-means),并且使用python实现k-均值聚类。
下面使用python实现k-means聚类:
k-means算法1:
代码如下:
from numpy import *
def loadDataSet(filename):
fr = open(filename)
'''
dataMat = []
for line in fr.readlines():
curLine = line.strip().split('\t')
fltLine = list(map(float,curLine))
dataMat.append(fltLine)
'''
curLine = [line.strip().split('\t') for line in fr.readlines()]
dataMat = [list(map(float, line)) for line in curLine]
return mat(dataMat)
#欧氏距离
def distEclud(vecA, vecB):
return sqrt(sum(power(vecA - vecB, 2)))
#构建簇中心
def randCent(dataSet, k):
#k表示分为k类
n = shape(dataSet)[1]
centroids = mat(zeros((k,n)))
for i in range(n):
min_i = min(dataSet[:,i])
range_i = float(max(dataSet[:,i]) - min_i)
centroids[:,i] = min_i + range_i * random.rand(k,1)
#这里是为了避免随机点在数据边界外
return centroids
#k-means算法
def kmeans(dataSet, k, distMeas=distEclud, createCent=randCent):
m = shape(dataSet)[0]
n = shape(dataSet)[1]
clusterAssment = mat(zeros((m,n)))
centroids = createCent(dataSet, k)
clusterChanged = True
while clusterChanged:
clusterChanged = False
for i in range(m):
minDist = inf
minIndex = -1
for j in range(k):
dist_ji = distMeas(centroids[j,:], dataSet[i,:])
if dist_ji < minDist:
minDist = dist_ji
minIndex = j
if clusterAssment[i,0] != minIndex :
clusterChanged = True
clusterAssment[i,:] = minIndex,minDist**2 #两列
print(centroids)
for cent in range(k):
ptsInClust = dataSet[nonzero(clusterAssment[:,0].A == cent)[0]]
centroids[cent,:] = mean(ptsInClust, axis=0) #按列求均值
return centroids, clusterAssment
下面测试算法:
filename = 'F:\\Louis爱学习\\机器学习\\k-means\\testSet.txt'
dataMat = loadDataSet(filename)
print(dataMat[:3])
dist = distEclud(dataMat[0],dataMat[1])
print(dist)
cent = randCent(dataMat, 2)
print(cent)
myCentroids, clustAssing = kmeans(dataMat,4)
print(myCentroids)
print(clustAssing[:5])
结果:
matrix([[ 1.658985, 4.285136],
[-3.453687, 3.424321],
[ 4.838138, -1.151539]])
5.184632816681332
matrix([[-1.00711767, 4.41548595],
[ 2.53692519, -0.67203069]])
[[-1.06540305 -2.77047941]
[ 3.21397877 -2.71440367]
[ 0.87963006 -1.83569246]
[-1.17989226 -3.6028796 ]]
[[-3.32164817 0.44062656]
[ 3.28256326 -2.248075 ]
[ 0.94527803 2.8089654 ]
[-3.01741792 -3.47251685]]
[[-2.84017553 2.6309902 ]
[ 2.8692781 -2.54779119]
[ 1.73775604 3.222066 ]
[-3.38237045 -2.9473363 ]]
[[-2.46154315 2.78737555]
[ 2.80293085 -2.7315146 ]
[ 2.6265299 3.10868015]
[-3.38237045 -2.9473363 ]]
[[-2.46154315 2.78737555]
[ 2.80293085 -2.7315146 ]
[ 2.6265299 3.10868015]
[-3.38237045 -2.9473363 ]]
[[ 2. 38.07193514]
[ 0. 0. ]
[ 1. 5.08043919]
[ 3. 17.69646711]
[ 2. 22.66412769]]
我们可以看见,迭代4次以后算法收敛,因为是随机的,所以每次运行程序的结果都会不一样,迭代次数也会有差别。
但是我们知道k是我们根据主观意识设置的,我们并不知道最好的k是多少,有时候算法收敛,但是实际的点的簇分配结果却没有那么准确,这是由于算法收敛到了局部最小值,而不是全局最小值。
为了克服收敛到局部最小值的问题,这里介绍二分k-means聚类算法。该算法的思路为:
首先,将所有的点作为一个簇,然后将该簇进行k-means划分,k设置为2;选择其中一个簇继续进行k-means划分,k设置为2;这样一共划分了三个簇,然后从这三个簇中继续选择一个簇进行k-means划分,k设置为2;这样一共划分了四个簇,....;依次类推,直到划分个数达到你指定的k值。 在每一步中,选择哪个簇进行划分取决于哪个划分的SSE最小。SSE即是每个簇中的点到这个簇的中心点的欧式距离的平方和。
下面介绍这个算法的伪代码:
下面是python代码:
k-means算法2:
def bikmeans(dataSet, k, distMeas=distEclud):
m = shape(dataSet)[0]
clusterAssment = mat(zeros((m,2))) #这里的数据只有两个变量,所以为2
centroid0 = mean(dataSet, axis=0).tolist()[0]
#matrix矩阵转换为列表,但是是嵌套列表,里面只有一个元素,为列表,所以[0]表示取这个元素
centList =[centroid0] #嵌套列表
for j in range(m):
clusterAssment[j,1] = distMeas(mat(centroid0), dataSet[j,:]) ** 2
while (len(cenList) < k):
lowestSSE = inf
for i in range(len(centList)):
ptsInCurrCluster = dataSet[nonzero(clusterAssment[:,0].A == i)[0],:]
centroidMat,splitClustAss = kmeans(ptsInCurrCluster, 2, distMeas)
sseSplit = sum(splitClustAss[:,1])
sseNotSplit = sum(clusterAssment[nonzero(clusterAssment[:,0].A != i)[0],1])
print('sseSplit, and notSplit: ', sseSplit,sseNotSplit)
if (sseSplit + sseNotSplit) <lowestSSE :
bestCentToSplit = i
bestNewCents = centroidMat
bestClustAss = splitClustAss.copy()
lowestSSE = sseSplit + sseNotSplit
bestClustAss[nonzero(bestClustAss[:,0].A == 1)[0],0] = len(centList)
bestClustAss[nonzero(bestClustAss[:,0].A == 0)[0],0] = bestCentToSplit
print('the bestCentToSplit is: ', bestCentToSplit)
print('the len of bestClustAss is: ',len(bestClustAss))
centList[bestCentToSplit] = bestNewCents[0,:]
centList.append(bestNewCents[1,:])
clusterAssment[nonzero(clusterAssment[:,0].A == bestCentToSplit)[0],:] = bestClustAss
return mat(centList),clusterAss