K-Means算法及python实现

K-Means算法

介绍

K-Means算法是一种常用的聚类算法,也称为K-均值聚类或快速聚类法。K-Means算法将数据划分为预设的K类,以样本点到聚类中心之间的距离作为研究的评价指标,以最小平方误差作为准则函数,迭代至距离平方和趋于稳定且小于某个特定数值或达到指定迭代次数,此时聚类完成。

原理

K-Means算法通过确定每个样本与其聚类中心点的距离的最小值,得出聚类方案的一种算法。研究对象为连续属性时,距离判定方法有欧氏距离、曼哈顿距离、闵可夫斯基距离等,最常用的是欧氏距离;研究对象为文档数据时,常使用余弦相似性度量进行研究。

  1. 输入N个样本{X1,X2,…,Xn},Xn属于D维欧式几何空间,将类别定为K类;
  2. 初始化K个样本{Z1,Z2,…,Zk},其中每个样本点都是一个初始聚类中心;
  3. 以距离初始聚类中心点最近为原则对每个样本进行分类,利用欧式距离,将每个样本点划分到距离其最近的聚类中心所在的簇,计算欧氏距离;在这里插入图片描述
  4. 计算每个簇的均值,作为新的聚类中心点;
    这个循环迭代的目标是使目标函数
    在这里插入图片描述 最小化,即同一个簇之间差异小,不同簇之间差异大。
    求最小化误差平方和:
    对SSE求偏导 在这里插入图片描述
    其中 为第i个簇当中点的个数,即新的聚类中心点就是初始聚类中心点的均值。
  5. 重复步骤3、4进行迭代直到聚类中心点不变,或者达到指定的迭代次数,此时的聚类情况是我们最终得到的结果。
    流程图如下图所示:
    在这里插入图片描述

应用

我们对下面40组数据进行K-Means聚类分析:

数据

(1.658985,4.285136),(-3.453687,3.424321),(4.838138,-1.151539),(-5.379713,-3.362104),(0.972564,2.924086),(-3.567919,1.531611),(0.450614,-3.302219),(-3.487105 ,-1.724432),(2.668759,1.594842),(-3.156485,3.191137),(3.165506,-3.999838),(-2.786837,-3.099354),(4.208187,2.984927),(-2.123337,2.943366),(0.704199,-0.479481),(-0.39237,-3.963704),(2.831667,1.574018),(-0.790153,3.343144),(2.943496,-3.357075),(-3.195883,-2.283926),(2.336445,2.875106),(-1.786345 ,2.554248),(2.190101,-1.90602),(-3.403367,-2.778288),(1.778124, 3.880832),(-1.688346,2.230267),(2.592976,-2.054368),(-4.007257,-3.207066),(2.257734,3.387564),(-2.679011,0.785119),(0.939512,-4.023563),(-3.674424,-2.261084),(2.046259,2.735279),(-3.18947,1.780269),(4.372646,-0.822248),(-2.579316, -3.497576),(1.889034,5.1904),(-0.798747,2.185588),(2.83652,-2.658556),(-3.837877,-3.253815),(2.096701,3.886007),(-2.709034,2.923887),(3.367037,-3.184789),(-2.121479,-4.232586),(2.329546,3.179764),(-3.284816,3.273099),(3.091414,-3.815232),(-3.762093,-2.432191),(3.542056,2.778832),(-1.736822,4.241041),(2.127073,-2.98368),(-4.323818,-3.938116),(3.792121,5.135768),(-4.786473,3.358547),(2.624081,-3.260715),(-4.009299,-2.978115),(2.493525,1.96371),(-2.513661,2.642162),(1.864375,-3.176309),(-3.171184,-3.572452),(2.89422,2.489128),(-2.562539,2.884438),(3.491078,-3.947487),(-2.565729,-2.012114),(3.332948,3.983102),(-1.616805,3.573188),(2.280615,-2.559444),(-2.651229,-3.103198),(2.321395,3.154987),(-1.685703,2.939697),(3.031012,-3.620252),(-4.599622,-2.185829),(4.196223,1.126677),(-2.133863,3.093686),(4.668892,-2.562705),(-2.793241,-2.149706),(2.884105,3.043438),(-2.967647,2.848696),(4.479332,-1.764772),(-4.905566,-2.91107)

程序:

import numpy as np
import matplotlib.pyplot as plt

加载数据

def loadDataSet(fileName):
data = np.loadtxt(fileName,delimiter=’\t’)
return data

欧氏距离计算

def distEclud(x,y):
return np.sqrt(np.sum((x-y)**2)) # 计算欧氏距离

为给定数据集构建一个包含K个随机质心的集合

def randCent(dataSet,k):
m,n = dataSet.shape
centroids = np.zeros((k,n))
for i in range(k):
index = int(np.random.uniform(0,m)) #
centroids[i,:] = dataSet[index,:]
return centroids

k均值聚类(完整)

 def KMeans(dataSet,k):
    m = np.shape(dataSet)[0]  #行的数目
    # 第一列存样本属于哪一簇
    # 第二列存样本的到簇的中心点的误差
    clusterAssment = np.mat(np.zeros((m,2)))
    clusterChange = True
 
    # 第1步 初始化centroids
    centroids = randCent(dataSet,k)
    while clusterChange:
        clusterChange = False
 
        # 遍历所有的样本(行数)
        for i in range(m):
            minDist = 100000.0
            minIndex = -1
 
            # 遍历所有的质心
            #第2步 找出最近的质心
            for j in range(k):
                # 计算该样本到质心的欧式距离
                distance = distEclud(centroids[j,:],dataSet[i,:])
                if distance < minDist:
                    minDist = distance
                    minIndex = j
            # 第 3 步:更新每一行样本所属的簇
            if clusterAssment[i,0] != minIndex:
                clusterChange = True
                clusterAssment[i,:] = minIndex,minDist**2
        #第 4 步:更新质心
        for j in range(k):
            pointsInCluster = dataSet[np.nonzero(clusterAssment[:,0].A == j)[0]]  # 获取簇类所有的点
            centroids[j,:] = np.mean(pointsInCluster,axis=0)   # 对矩阵的行求均值
 
    print("Congratulations,cluster complete!")
    return centroids,clusterAssment
 
    def showCluster(dataSet,k,centroids,clusterAssment):
    m,n = dataSet.shape
    if n != 2:
        print("数据不是二维的")
        return 1
 
    mark = ['or', 'ob', 'og', 'ok', '^r', '+r', 'sr', 'dr', '<r', 'pr']
    if k > len(mark):
        print("k值太大了")
        return 1
 
    # 绘制所有的样本
    for i in range(m):
        markIndex = int(clusterAssment[i,0])
        plt.plot(dataSet[i,0],dataSet[i,1],mark[markIndex])
 
    mark = ['Dr', 'Db', 'Dg', 'Dk', '^b', '+b', 'sb', 'db', '<b', 'pb']
    # 绘制质心
    for i in range(k):
        plt.plot(centroids[i,0],centroids[i,1],mark[i])
 
    plt.show()
    dataSet = loadDataSet("data111.txt")
    k = 4
    centroids,clusterAssment = KMeans(dataSet,k)
 
    showCluster(dataSet,k,centroids,clusterAssment)

    print(centroids)
    print(clusterAssment)

输出结果:

Congratulations,cluster complete!

聚类中心:

 [ 2.80293085 -2.7315146 ]
 [-2.46154315  2.78737555]
 [-3.38237045 -2.9473363 ]
 [ 2.6265299   3.10868015]

类别及与所属簇中心点的距离(分为0、1、2、3四类):

坐 标	所属类别	距中心点最短距离
(1.658985,4.2851363	7.76508437-3.453687,3.4243211	32.621680924.838138,-1.1515390	8.14381943-5.379713,-3.3621042	25.235512760.972564,2.9240863	3.06889551-3.567919,1.5316111	37.382794410.450614,-3.3022190	6.51328092-3.487105	,-1.7244322	14.591723812.668759,1.5948423	2.55232474-3.156485,3.1911371	29.352350953.165506,-3.9998380	0.-2.786837,-3.0993542	6.480573144.208187,2.9849273	3.4532362-2.123337,2.9433661	19.391094970.704199,-0.4794810	11.49864079-0.39237,-3.9637042	0.2.831667,1.5740183	2.06212705-0.790153,3.3431441	9.29158832.943496,-3.3570750	0.-3.195883,-2.2839262	10.681339272.336445,2.8751063	6.26146266-1.786345	,2.5542481	17.048990512.190101,-1.906020	0.-3.403367,-2.7782882	10.471314031.778124, 3.8808323	5.77462709-1.688346,2.2302671	16.910883712.592976,-2.0543680	0.-4.007257,-3.2070662	13.639909092.257734,3.3875643	2.94220606-2.679011,0.7851191	11.554489990.939512,-4.0235630	5.24624298-3.674424,-2.2610842	13.670793322.046259,2.7352793	2.24969595-3.18947,1.7802691	32.255428634.372646,-0.8222480	7.55594561-2.579316,	-3.4975762	5.000008121.889034,5.19043	11.74755528-0.798747,2.1855881	10.786822412.83652,-2.6585560	0.-3.837877,-3.2538152	12.375460882.096701,3.8860073	4.85160923-2.709034,2.9238871	24.883780733.367037,-3.1847890	0.-2.121479,-4.2325862	3.062115462.329546,3.1797643	2.29808071-3.284816,3.2730991	30.732962743.091414,-3.8152320	0.-3.762093,-2.4321912	13.700565173.542056,2.7788323	3.68513733-1.736822,4.2410411	16.684900632.127073,-2.983680	0.-4.323818,-3.9381162	15.456938123.792121,5.1357683	18.07739544-4.786473,3.3585471	49.621694252.624081,-3.2607150	0.-4.009299,-2.9781152	14.053561072.493525,1.963713	3.01190434-2.513661,2.6421621	23.321834391.864375,-3.1763090	0.-3.171184,-3.5724522	7.874885372.89422,2.4891283	3.4935024-2.562539,2.8844381	23.488167573.491078,-3.9474870	0.-2.565729,-2.0121142	8.532192873.332948,3.9831023	9.87170964-1.616805,3.5731881	15.046508732.280615,-2.5594440	0.-2.651229,-3.1031982	5.842914562.321395,3.1549873	2.26399569-1.685703,2.9396971	15.751280223.031012,-3.6202523	7.76508437-4.599622,-2.1858291	32.621680924.196223,1.1266770	8.14381943-2.133863,3.0936862	25.235512764.668892,-2.5627053	3.06889551-2.793241,-2.1497061	37.382794412.884105,3.0434380	6.51328092-2.967647,2.8486962	14.591723814.479332,-1.7647723	2.55232474-4.905566,-2.911071	29.35235095
  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值