K-means算法（python）

最新推荐文章于 2024-01-19 14:26:59 发布

老手er

最新推荐文章于 2024-01-19 14:26:59 发布

阅读量570

点赞数 1

分类专栏：算法文章标签： K-means算法 K-means具体代码实现

本文链接：https://blog.csdn.net/qq_43608549/article/details/99327447

版权

算法专栏收录该内容

11 篇文章 0 订阅

订阅专栏

K-means算法（python人工智能入门算法）

K-means算法简介:

K-means算法是输入k个聚类个数，以及包含 n个数据对象的数据库（本篇采用的是随机数作为数据样本），输出满足方差最小标准k个聚类的一种算法。k-means 算法接受输入量 k ；然后将n个数据对象划分为 k个聚类以便使获得的聚类满足：同一聚类中的对象相似度较高；而不同聚类中的对象相似度较小的特点。聚类相似度是利用各聚类中对象的均值所获得一个“中心对象”来进行计算的。

本文的代码大体结构如下：

（1）首先导入必须的模块：random（用于生成随机数），matplotlib.pyplot（用于绘图，使数据可视化）；
（2）创建KMeans类，定义其构造函数包含初始化K值、data存放原始数据、pointCenter存放中心点将第一次获得的中心点通过随机方式在__data里随机出来、还包括各个分类的结果存储在result=[]的初始化；
（3）def randomCenter(self)，randomCenter方法 是从原始的__data里随机出最开始进行计算的k个中心点。其方法中需要注重的细节在于判断中心点是否重复，如果不重复，再加入中心点列表
（4）def distance(self, pointer1, pointer2)，distance方法用于计算两个点之间的距离，以及支持任意维度和欧式距离
（5）def calPointCenterDistance(self, center, data)， calPointCenterDistance方法计算每个点和每个中心点之间的距离。其形参包括center:中心点、data:数据源
（6）def calNewPointCenter(self, result)，calNewPointCenter方法计算新的中心点
（7）def calCenterToCenterDistance(self, old, new),calCenterToCenterDistance方法计算两次中心点之间的距离以及求和、求均值，其形参包括old: 上次的中心点、new: 新计算的中心点
（8）def fit(self, data, threshold, times=50000)，fit方法进行模型的训练，其形参包括data:训练数据；threshold:阈值，times:遍历次数退出条件。内部功能包含对原始数据进行分类，将每个点分到离它最近的中心点、 保存上次的中心点、再对原始数据进行分类，将每个点分到离它最近的中心点、返回新的分类中心点，样本数据的分类结果；
（9）if name == “main”，定义程序入口方法，其中包含各个方法的实例化以及绘图的相关操作

以上就是程序的大体方法，其具体代码如下：


import random
import matplotlib.pyplot as plt


class KMeans():

    def __init__(self, k=1):
        '''
        k: k代表分类数
        '''
        self.__k = k
        self.__data = []       # 存放原始数据
        self.__pointCenter = []  # 存放中心点，第一次获得的中心点通过随机方式在__data里随机出来
        self.__result = []
        for i in range(k):
            self.__result.append([])  # 其内部结构为[[],[],[],[],[]]
            pass
        pass

    def randomCenter(self):
        '''
        从原始的__data里随机出最开始进行计算的k个中心点
        '''
        while len(self.__pointCenter) < self.__k:
            # 随机一个索引
            index = random.randint(0, len(self.__data) - 1)
            # 判断中心点是否重复，如果不重复，加入中心点列表
            if self.__data[index] not in self.__pointCenter:
                self.__pointCenter.append(self.__data[index])
                pass
            pass
        pass
    pass

    def distance(self, pointer1, pointer2):
        '''
        计算两个点之间的距离，支持任意维度，欧式距离
        :param
        pointer1:
        :param
        pointer2:
        '''
        distance = (sum([(x1 - x2)**2 for x1, x2 in zip(pointer1, pointer2)]))**0.5
        return distance
        pass

    def calPointCenterDistance(self, center, data):
        '''
        calPointCenterDistance()
        计算每个点和每个中心点之间的距离
        '''
        centerDistance = []
        for temp in data:
            centerDistance.append([self.distance(temp, point) for point in center])
            pass
        print(centerDistance)
        return centerDistance
        pass

    def calNewPointCenter(self, result):
        '''
        计算新的中心点
        :param
        result:
        '''
        newCenterPoint = []
        for temp in result:
            # 进行转置，便于运算
            temps = [[temp[x][i] for x in range(len(temp)) ] for i in range(len(temp[0]))]
            point = []
            for t in temps:
                # 对每个维度求和，取平均值
                point.append(sum(t)/len(t))
                pass
            newCenterPoint.append(point)
            pass
        print(newCenterPoint)
        return newCenterPoint
        pass

    def calCenterToCenterDistance(self, old, new):
        '''
        计算两次中心点之间的距离，求和求均值
        old: 上次的中心点
        new: 新计算的中心点
        '''
        total = 0
        for point1, point2 in zip (old, new):
            total += self.distance(point1, point2)
            pass
        return total / len(old)
        pass

    def fit(self, data, threshold, times=50000):
        '''
        fit()：进行模型训练
        data: 训练数据
        threshold: 阈值即退出条件
        '''
        self.__data = data
        self.randomCenter()     #randomCenter()在其后续代码或上述（8）可见
        print(self.__pointCenter)
        centerDistance = self.calPointCenterDistance(self.__pointCenter, self.__data)

        # 对原始数据进行分类，将每个点分到离它最近的中心点
        i = 0
        for temp in centerDistance:
            index = temp.index(min(temp))    #取其最小距离的索引（下标）位置
            self.__result[index].append(self.__data[i])
            i += 1
            pass

        # print(self.__result)   #打印分类结果
        oldCenterPoint = self.__pointCenter
        newCenterPoint = self.calNewPointCenter(self.__result)

        while self.calCenterToCenterDistance(oldCenterPoint, newCenterPoint) > threshold:
            times -= 1
            result = []
            for i in range(self.__k):
                result.append([])
                pass
            # 保存上次的中心点
            oldCenterPoint = newCenterPoint
            centerDistance = self.calPointCenterDistance(newCenterPoint, self.__data)

            # 对原始数据进行分类，将每个点分到离它最近的中心点
            i = 0
            for temp in centerDistance:
                index = temp.index(min(temp))
                result[index].append(self.__data[i]) # result = [[[10,20]]]
                i += 1
                pass

            newCenterPoint = self.calNewPointCenter(result)
            print(self.calCenterToCenterDistance(oldCenterPoint, newCenterPoint))
            self.__result = result
            pass
        self.__pointCenter = newCenterPoint
        return newCenterPoint, self.__result
        pass

if __name__ == "__main__":
    data = [[random.randint(1, 100), random.randint(1, 100)] for i in range(100)]
    for i in range(10):
        kmeans = KMeans(k=5)
        centerPoint, result = kmeans.fit(data, 0.0001)
        print(centerPoint)
        #如下为绘制图像，使数据有利于观察
        plt.plot()
        plt.title("KMeans Classification")   #标题
        i = 0     #用于初始化颜色
        tempx = []
        tempy = []
        color = []
        for temp in result:
            #因为之前转置过，所以要将其转回
            temps = [[temp[x][i] for x in range(len(temp))] for i in range(len(temp[0]))]  
            color += [i] * len(temps[0])
            tempx += temps[0]
            tempy += temps[1]

            i += 2
            pass
        plt.scatter(tempx, tempy, c=color, s=30)
        plt.show()
        pass
    pass

如有纰漏，请多指教！

老手er

关注

1
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
K-means算法（python）

K-means算法（python人工智能入门算法）K-means算法简介：K-means算法是输入k个聚类个数，以及包含 n个数据对象的数据库（本篇采用的是随机数作为数据样本），输出满足方差最小标准k个聚类的一种算法。k-means 算法接受输入量 k ；然后将n个数据对象划分为 k个聚类以便使获得的聚类满足：同一聚类中的对象相似度较高；而不同聚类中的对象相似度较小的特点。聚类相似度是利用各聚...
复制链接

扫一扫