kmeans设置中心_机器学习kmeans

聚类是针对给定的样本,依据他们的特征的相似度或者距离,将其归并到若干个“类”或“簇”的数据分析问题,一个类是给定样本集合的一个子集。下面,介绍聚类的基本概念以及层次聚类和k-均值聚类(k-means),并且使用python实现k-均值聚类。

68e3796af8131f8b7e973f928312d815.png

0f3cb47f2ba9115922d3fdade8c060d7.png

f90ea272ae3fa17c56e9b437f75f1ce8.png

809ac842345204c11151fc6c4c892fb8.png

84ae79867c669a53a14435249f91588d.png

adcd6901e5579fa94c5ce8c848d51b20.png

f7cfee1a4d1102a51a1b93ccf809f113.png

f6b4b4137bb2ec7c749184985176e3a9.png


下面使用python实现k-means聚类:

k-means算法1:

代码如下:

  1. from numpy import *

  2. def loadDataSet(filename):

  3. fr = open(filename)

  4. '''

  5. dataMat = []

  6. for line in fr.readlines():

  7. curLine = line.strip().split('\t')

  8. fltLine = list(map(float,curLine))

  9. dataMat.append(fltLine)

  10. '''

  11. curLine = [line.strip().split('\t') for line in fr.readlines()]

  12. dataMat = [list(map(float, line)) for line in curLine]

  13. return mat(dataMat)

  14. #欧氏距离

  15. def distEclud(vecA, vecB):

  16. return sqrt(sum(power(vecA - vecB, 2)))

  17. #构建簇中心

  18. def randCent(dataSet, k):

  19. #k表示分为k类

  20. n = shape(dataSet)[1]

  21. centroids = mat(zeros((k,n)))

  22. for i in range(n):

  23. min_i = min(dataSet[:,i])

  24. range_i = float(max(dataSet[:,i]) - min_i)

  25. centroids[:,i] = min_i + range_i * random.rand(k,1)

  26. #这里是为了避免随机点在数据边界外

  27. return centroids

  28. #k-means算法

  29. def kmeans(dataSet, k, distMeas=distEclud, createCent=randCent):

  30. m = shape(dataSet)[0]

  31. n = shape(dataSet)[1]

  32. clusterAssment = mat(zeros((m,n)))

  33. centroids = createCent(dataSet, k)

  34. clusterChanged = True

  35. while clusterChanged:

  36. clusterChanged = False

  37. for i in range(m):

  38. minDist = inf

  39. minIndex = -1

  40. for j in range(k):

  41. dist_ji = distMeas(centroids[j,:], dataSet[i,:])

  42. if dist_ji < minDist:

  43. minDist = dist_ji

  44. minIndex = j

  45. if clusterAssment[i,0] != minIndex :

  46. clusterChanged = True

  47. clusterAssment[i,:] = minIndex,minDist**2 #两列

  48. print(centroids)

  49. for cent in range(k):

  50. ptsInClust = dataSet[nonzero(clusterAssment[:,0].A == cent)[0]]

  51. centroids[cent,:] = mean(ptsInClust, axis=0) #按列求均值

  52. return centroids, clusterAssment


下面测试算法:

  1. filename = 'F:\\Louis爱学习\\机器学习\\k-means\\testSet.txt'

  2. dataMat = loadDataSet(filename)

  3. print(dataMat[:3])

  4. dist = distEclud(dataMat[0],dataMat[1])

  5. print(dist)

  6. cent = randCent(dataMat, 2)

  7. print(cent)

  8. myCentroids, clustAssing = kmeans(dataMat,4)

  9. print(myCentroids)

  10. print(clustAssing[:5])

结果:

matrix([[ 1.658985,  4.285136],
[-3.453687, 3.424321],
[ 4.838138, -1.151539]])
5.184632816681332
matrix([[-1.00711767,  4.41548595],

[ 2.53692519, -0.67203069]])

[[-1.06540305 -2.77047941]
[ 3.21397877 -2.71440367]
[ 0.87963006 -1.83569246]
[-1.17989226 -3.6028796 ]]
[[-3.32164817 0.44062656]
[ 3.28256326 -2.248075 ]
[ 0.94527803 2.8089654 ]
[-3.01741792 -3.47251685]]
[[-2.84017553 2.6309902 ]
[ 2.8692781 -2.54779119]
[ 1.73775604 3.222066 ]
[-3.38237045 -2.9473363 ]]
[[-2.46154315 2.78737555]
[ 2.80293085 -2.7315146 ]
[ 2.6265299 3.10868015]

[-3.38237045 -2.9473363 ]]

[[-2.46154315 2.78737555]
[ 2.80293085 -2.7315146 ]
[ 2.6265299 3.10868015]

[-3.38237045 -2.9473363 ]]

[[ 2. 38.07193514]
[ 0. 0. ]
[ 1. 5.08043919]
[ 3. 17.69646711]
[ 2. 22.66412769]]

        我们可以看见,迭代4次以后算法收敛,因为是随机的,所以每次运行程序的结果都会不一样,迭代次数也会有差别。

        但是我们知道k是我们根据主观意识设置的,我们并不知道最好的k是多少,有时候算法收敛,但是实际的点的簇分配结果却没有那么准确,这是由于算法收敛到了局部最小值,而不是全局最小值。

        为了克服收敛到局部最小值的问题,这里介绍二分k-means聚类算法。该算法的思路为:

        首先,将所有的点作为一个簇,然后将该簇进行k-means划分,k设置为2;选择其中一个簇继续进行k-means划分,k设置为2;这样一共划分了三个簇,然后从这三个簇中继续选择一个簇进行k-means划分,k设置为2;这样一共划分了四个簇,....;依次类推,直到划分个数达到你指定的k值。  在每一步中,选择哪个簇进行划分取决于哪个划分的SSE最小。SSE即是每个簇中的点到这个簇的中心点的欧式距离的平方和。

      下面介绍这个算法的伪代码:

cd2ee66fed98324341bff157f6ae2a9c.png

443c1a7bd9f6abae3c7108ea45be61ed.png

下面是python代码:

k-means算法2:

  1. def bikmeans(dataSet, k, distMeas=distEclud):

  2. m = shape(dataSet)[0]

  3. clusterAssment = mat(zeros((m,2))) #这里的数据只有两个变量,所以为2

  4. centroid0 = mean(dataSet, axis=0).tolist()[0]

  5. #matrix矩阵转换为列表,但是是嵌套列表,里面只有一个元素,为列表,所以[0]表示取这个元素

  6. centList =[centroid0] #嵌套列表

  7. for j in range(m):

  8. clusterAssment[j,1] = distMeas(mat(centroid0), dataSet[j,:]) ** 2

  9. while (len(cenList) < k):

  10. lowestSSE = inf

  11. for i in range(len(centList)):

  12. ptsInCurrCluster = dataSet[nonzero(clusterAssment[:,0].A == i)[0],:]

  13. centroidMat,splitClustAss = kmeans(ptsInCurrCluster, 2, distMeas)

  14. sseSplit = sum(splitClustAss[:,1])

  15. sseNotSplit = sum(clusterAssment[nonzero(clusterAssment[:,0].A != i)[0],1])

  16. print('sseSplit, and notSplit: ', sseSplit,sseNotSplit)

  17. if (sseSplit + sseNotSplit) <lowestSSE :

  18. bestCentToSplit = i

  19. bestNewCents = centroidMat

  20. bestClustAss = splitClustAss.copy()

  21. lowestSSE = sseSplit + sseNotSplit

  22. bestClustAss[nonzero(bestClustAss[:,0].A == 1)[0],0] = len(centList)

  23. bestClustAss[nonzero(bestClustAss[:,0].A == 0)[0],0] = bestCentToSplit

  24. print('the bestCentToSplit is: ', bestCentToSplit)

  25. print('the len of bestClustAss is: ',len(bestClustAss))

  26. centList[bestCentToSplit] = bestNewCents[0,:]

  27. centList.append(bestNewCents[1,:])

  28. clusterAssment[nonzero(clusterAssment[:,0].A == bestCentToSplit)[0],:] = bestClustAss

  29. return mat(centList),clusterAss

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值