k-means聚类 python实现

最新推荐文章于 2024-04-29 18:16:07 发布

Tomator01

最新推荐文章于 2024-04-29 18:16:07 发布

阅读量2k

点赞数 3

分类专栏：机器学习彭湃的专栏文章标签： k-means 机器学习 python 聚类统计学习

本文链接：https://blog.csdn.net/Big_Pai/article/details/88607856

版权

机器学习同时被 2 个专栏收录

16 篇文章 6 订阅

订阅专栏

彭湃的专栏

16 篇文章 0 订阅

订阅专栏

有用请点赞，没用请差评。

欢迎分享本文，转载请保留出处。

kmeans算法又名k均值算法。其算法思想大致为：先从样本集中随机选取 kk 个样本作为簇中心，并计算所有样本与这 kk个“簇中心”的距离，对于每一个样本，将其划分到与其距离最近的“簇中心”所在的簇中，对于新的簇计算各个簇的新的“簇中心”。
根据以上描述，我们大致可以猜测到实现kmeans算法的主要三点：
（1）簇个数 kk 的选择
（2）各个样本点到“簇中心”的距离
（3）根据新划分的簇，更新“簇中心”

算法步骤：

代码：

# -*- coding:utf-8 -*-
# kmeans : k-means cluster

import numpy as np
import matplotlib.pyplot as plt

def readfile(filename):
    """
    读取数据集
    W：特征向量数组，只取前两个特征
    label：标签（类别）列表
    :param filename:
    :return:特征向量数组和标签集合列表
    """
    save_path="D:\\python3_anaconda3\\学习\机器学习\\机器学习数据集\\"
    with open(save_path+filename,'r') as f:

        length=len(f.readlines())
        print(filename,"length: %d"%length)
        W = np.zeros((length,2))
        label=[]
        i=0

        f.seek(0,0)
        for line in f.readlines():
            linestr=line.strip()
            linestrlist=line.split(',')
            # print(linestrlist)
            # 鸢尾属植物数据集的特征共有四个，我们这里只取前两个特征作为特征向量，当然这样分类肯定是不准确的。
            number_data=[float(j) for j in linestrlist[0:2]]
            W[i,:]=np.array(number_data)
            label.append(linestrlist[4].strip('\n'))
            i+=1
    return W,label

def createDataset(filename):
    """
    创建待分类数据集
    """
    data_vector,label_str=readfile(filename)
    # print(data_vector,"\n",label)

    # 将原始数据集中非字符串标签改为用数字代表，用户后续画图
    label_num=[]
    for i in label_str:
        if i=="Iris-setosa":
            label_num.append(0)
        elif i=="Iris-versicolor":
            label_num.append(1)
        else:
            label_num.append(2)
    return  data_vector,label_num

# 计算欧式距离
def euclDistance(vector1,vector2):
    return np.sqrt(sum(pow(vector2-vector1,2)))  # pow()是自带函数

# 使用随机样例初始化质心
def initCentroids(dataSet,k):

    numSamples,dim = dataSet.shape
    # numSample - 行，此处代表数据集数量  dim - 列，此处代表维度，例如只有xy轴的，dim=2
    centroids = np.zeros((k, dim))  # 产生k行，dim列零矩阵
    for i in range(k):
        index = int(np.random.uniform(0, numSamples))  # 给出一个服从均匀分布的在0~numSamples之间的整数
        centroids[i, :] = dataSet[index, :]  # 第index行作为簇心
    # print(centroids)
    return centroids

# k均值聚类
def kmeans(dataSet, k):
    numSamples = dataSet.shape[0]
    print(numSamples)
    # frist column stores which cluster this sample belongs to,
    # second column stores the error between this sample and its centroid

    clusterAssment = np.zeros((numSamples, 2))
    clusterChanged = True

    ## step 1: init centroids
    centroids = initCentroids(dataSet, k)

    while clusterChanged:
            clusterChanged = False
            ## for each sample
            for i in range(numSamples):
                minDist = 1000000.0  # 最小距离
                minIndex = 0  # 最小距离对应的点群
                ## for each centroid
                ## step2: find the centroid who is closest
                for j in range(k):
                    distance = euclDistance(centroids[j, :], dataSet[i, :])  # 计算每个数据到每个簇中心的欧式距离
                    if distance < minDist:  # 如果距离小于当前最小距离
                        minDist = distance  # 则最小距离更新
                        minIndex = j  # 对应的点群也会更新

                ## step 3: update its cluster
                if clusterAssment[i, 0] != minIndex:  # 如当前数据不属于该点群
                    # 此处与书本上算法步骤稍微有点不同：当有一个数据的分类错误时就clusterChanged = True ，便会重新计算簇心。而书本上的终止条件是是新簇心等于上一次迭代后的簇心
                    clusterChanged = True  # 聚类操作需要继续
                    clusterAssment[i, :] = minIndex, minDist**2

     ## step 4: update centroids
            for j in range(k):
                # 提取同一类别的向量
                pointsInCluster = dataSet[np.nonzero(clusterAssment[:, 0] == j)]
                # print("s",pointsInCluster.shape)
                # nonzeros返回的是矩阵中非零的元素的[行号]和[列号]
                # 将所有等于当前点群j的，赋给pointsInCluster，之后计算该点群新的中心
                centroids[j, :] = np.mean(pointsInCluster, axis=0)  #  对每列求均值

    # print("center",centroids)
    return centroids, clusterAssment


# show your cluster only available with 2-D data
def showCluster(dataSet, k, centroids, clusterAssment,old_label):
    numSamples, dim = dataSet.shape  # numSample - 样例数量  dim - 数据的维度
    if dim != 2:
        print (" not two-dimensional data")
        return 1

    mark = ['or', 'ob', 'og', 'ok', '^r', '+r', 'sr', 'dr', '<r', 'pr']
    if k > len(mark):
        print ("the k is too large! the max k is 10")
        return 1

    # draw all samples
    for i in range(numSamples):
        markIndex = int(clusterAssment[i, 0])
        plt.plot(dataSet[i, 0], dataSet[i, 1], mark[markIndex])
    plt.title(" The classification results of k-means cluster")

    mark = ['Dr', 'Db', 'Dg', 'Dk', '^b', '+b', 'sb', 'db', '<b', 'pb']

    # draw the centroids
    # 对k-means聚类后的结果对数据进行绘图
    for i in range(k):
        plt.plot(centroids[i, 0], centroids[i, 1], mark[i], ms=12.0)


    # 按照原始数据集自带的类别画图，用于与新分类后的数据进行对比
    plt.figure()     #打开第二个窗口显示图片，而不是分屏显示
    for i in range(numSamples):
        markIndex = int(old_label[i])
        plt.plot(dataSet[i, 0], dataSet[i, 1], mark[markIndex])
    plt.title("Original classification result")
    plt.show()


'''
函数功能：  k-means聚类  
采用UCI的数据集：鸢尾属植物数据库，Link：http://archive.ics.uci.edu/ml/machine-learning-databases/iris/
为了绘图方便，只取特征空间中的前两个维度，即萼片长度、萼片宽度两个特征，当然只采用这两个特征进行分类肯定是不准确滴！！

'''
if __name__=="__main__":
    # 数据集名称
    filename="iris_all.data"

    # data_vector, label分别为特征向量和原始标签
    data_vector, label=createDataset(filename)
    # initCentroids(data_vector,3)

    k=3
    centroids, clusterAssment=kmeans(data_vector,k)

    # 按照原始标签和k-means聚类后的分类分别对数据进行绘图
    showCluster(data_vector,k,centroids,clusterAssment,label)

从散点图中可以看出，k-means聚类后和原始类别相比还是不错的，只是有少部分数据的分类是错误的，这也是可以原谅的，因为本次我们只选取了四个特征中的前两个来进行聚类，结果当然不会非常准确。

Tomator01

关注

3
点赞
踩
6

收藏

觉得还不错? 一键收藏
4
评论
k-means聚类 python实现

有用请点赞，没用请差评。欢迎分享本文，转载请保留出处。kmeans算法又名k均值算法。其算法思想大致为：先从样本集中随机选取kk个样本作为簇中心，并计算所有样本与这kk个“簇中心”的距离，对于每一个样本，将其划分到与其距离最近的“簇中心”所在的簇中，对于新的簇计算各个簇的新的“簇中心”。根据以上描述，我们大致可以猜测到实现kmeans算法的主要三点：（1）簇个数kk的...
复制链接

扫一扫

专栏目录