算法核心思想很简单,看下图:
1.算法k取值为2, 代表我们对这个样本进行2类划分,当然多类的划分算法类似
2.随机的获取两个点(用圆圈圈出的)即簇质心点,注意这两个点一定要在样本边界范围内,稍后的代码会给出实现
3.获取了这两个点后,通过计算所有的点到这两个点的距离,让每个点选择自己距离最近的质心点,这样这些属于同一个质心点的样本点都归为一簇
4.在获取的两个簇的样本点中,取平均点作为心得簇质心点,迭代以上的步骤
5.一直迭代到所有的点所在的簇都不在改变的情况下就停止
python代码实现:
# -*- coding: utf-8 -*-
import numpy as np
from numpy import *
import matplotlib.pyplot as plt
from sklearn import svm, datasets
def loadDataSet(filename):
dataMat = []
fr = open(filename)
for line in fr.readlines():
curLine = line.strip().split('\t')
fltLine = map(float,curLine)
dataMat.append(fltLine)
return dataMat
def randCent(dataset,k):
n = np.shape(dataset)[1]
centroids = mat(zeros((k,n)))
print centroids
for j in range(n):
minJ = min(dataset[:,j])
rangeJ = float(max(dataset[:,j]) - minJ)
centroids[:,j] = minJ + rangeJ * random.rand(k,1)
return centroids
def distEclud(vecA,vecB):
return sqrt(sum(power(vecA - vecB, 2)))
def kMeans(dataset, k, distMeans = distEclud, createCent = randCent):
m = shape(dataset)[0]
clusterAssment = mat(zeros((m,2)))
centroids = createCent(dataset,k)
clusterchanged = True
while clusterchanged: #只有当所有点都找到了自己可以到的最近的点之后,while才退出
clusterchanged = False
for i in range(m):
minDist = inf
minIndex = -1
for j in range(k):
distJI = distMeans(centroids[j,:],dataset[i,:])
if distJI < minDist:
minDist = distJI
minIndex = j
if clusterAssment[i,0] != minIndex:
clusterchanged = True
clusterAssment[i,:] = minIndex,minDist**2
print centroids
for cent in range(k): #重新对所选取的随机簇心点进行更新,在新族里里面找中心点
ptsInClust = dataset[nonzero(clusterAssment[:,0].A == cent)[0]]
centroids[cent,:] = mean(ptsInClust,axis=0)
return centroids,clusterAssment
xdata = loadDataSet('D:\\machinelearninginaction\\Ch10\\testSet.txt')
xdata = mat(xdata)
cds,cluster = kMeans(xdata,4)
print cds
print cluster
代码运行结果:
[[-3.96004846 -0.2370727 ]
[-3.70008211 0.47518868]
[ 3.89055325 1.72265831]
[ 0.31740575 1.22857227]]
[[-3.38237045 -2.9473363 ]
[-3.00984169 2.66771831]
[ 3.09256987 0.734184 ]
[ 0.33583518 0.38735082]]
[[-3.38237045 -2.9473363 ]
[-2.54905874 2.81904858]
[ 2.98620229 0.54660226]
[ 0.89422714 -1.26508257]]
[[-3.53973889 -2.89384326]
[-2.46154315 2.78737555]
[ 2.95373358 2.32801413]
[ 2.19454347 -3.07604306]]
[[-3.53973889 -2.89384326]
[-2.46154315 2.78737555]
[ 2.6265299 3.10868015]
[ 2.65077367 -2.79019029]]
[[-3.53973889 -2.89384326]
[-2.46154315 2.78737555]
[ 2.6265299 3.10868015]
[ 2.65077367 -2.79019029]]
经过迭代之后的获取到最终的簇质心
[[ 2. 2.3201915 ]
[ 1. 1.39004893]
[ 3. 7.46974076]
[ 0. 3.60477283]
[ 2. 2.7696782 ]
[ 1. 2.80101213]
[ 3. 5.10287596]
[ 0. 1.37029303]
[ 2. 2.29348924]
[ 1. 0.64596748]
[ 3. 1.72819697]
[ 0. 0.60909593]
[ 2. 2.51695402]
[ 1. 0.13871642]
[ 3. 9.12853034]
[ 3. 10.63785781]
[ 2. 2.39726914]
[ 1. 3.1024236 ]
[ 3. 0.40704464]
[ 0. 0.49023594]
[ 2. 0.13870613]
[ 1. 0.510241 ]
[ 3. 0.9939764 ]
[ 0. 0.03195031]
[ 2. 1.31601105]
[ 1. 0.90820377]
[ 3. 0.54477501]
[ 0. 0.31668166]
[ 2. 0.21378662]
[ 1. 4.05632356]
[ 3. 4.44962474]
[ 0. 0.41852436]
[ 2. 0.47614274]
[ 1. 1.5441411 ]
[ 3. 6.83764117]
[ 0. 1.28690535]
[ 2. 4.87745774]
[ 1. 3.12703929]
[ 3. 0.05182929]
[ 0. 0.21846598]
[ 2. 0.8849557 ]
[ 1. 0.0798871 ]
[ 3. 0.66874131]
[ 0. 3.80369324]
[ 2. 0.09325235]
[ 1. 0.91370546]
[ 3. 1.24487442]
[ 0. 0.26256416]
[ 2. 0.94698784]
[ 1. 2.63836399]
[ 3. 0.31170066]
[ 0. 1.70528559]
[ 2. 5.46768776]
[ 1. 5.73153563]
[ 3. 0.22210601]
[ 0. 0.22758842]
[ 2. 1.32864695]
[ 1. 0.02380325]
[ 3. 0.76751052]
[ 0. 0.59634253]
[ 2. 0.45550286]
[ 1. 0.01962128]
[ 3. 2.04544706]
[ 0. 1.72614177]
[ 2. 1.2636401 ]
[ 1. 1.33108375]
[ 3. 0.19026129]
[ 0. 0.83327924]
[ 2. 0.09525163]
[ 1. 0.62512976]
[ 3. 0.83358364]
[ 0. 1.62463639]
[ 2. 6.39227291]
[ 1. 0.20120037]
[ 3. 4.12455116]
[ 0. 1.11099937]
[ 2. 0.07060147]
[ 1. 0.2599013 ]
[ 3. 4.39510824]
[ 0. 1.86578044]]
以上数据是所有样本点的最终分类类别,0,1,2,3, 第二列的数据是每个点离自己的簇的质心距离。这个值可以用来对分类效果进行评估,稍后会介绍
k-均值聚类算法,在迭代多次之后,会达到收敛, 但是会出现局部收敛的问题,如图:
从这张图来看, 发现样本并没有很好的进行划分,针对这个问题就有了2分k-均值算法
2分k-均值算法:
1.把所有的样本归为一簇
2.对样本进行一次2分类,和上面的原理一样
3.分类完成后,对两个簇分别在进行2分类
4.直到达到自己期望的分类数
从上面的过程可以发现,到底要对哪一个簇进行划分呢?这就要用到每个点到质心得距离了,这里的概念叫做SSE(误差平方和),针对能最大幅度降低SSE的簇,也就是选出误差最小的那个簇进行划分