经典聚类算法之K-means算法

eternal1995

已于 2024-02-01 09:34:33 修改

阅读量757

点赞数 22

文章标签：算法 python 数学建模聚类人工智能

于 2024-02-01 09:30:37 首次发布

本文链接：https://blog.csdn.net/eternal1995/article/details/135966480

版权

作者简介：本人擅长运筹优化建模及算法设计，包括各类车辆路径问题、生产车间调度、二三维装箱问题，熟悉CPLEX和gurobi求解器
微信公众号：运筹优化与学习

若有运筹优化建模及算法定制需求，欢迎联系我们私聊沟通

本篇推文首先介绍K-means聚类算法以及Python代码实现，开启这个系列。

算法描述

K-means算法是一种广泛使用的聚类算法，它的目标是将若干个数据点分组到K个簇中，以使簇内的点尽可能相似，而簇间的点尽可能不同。它的核心思想是通过迭代优化簇中心的位置，以最小化簇内的平方误差总和。其中，最小化簇内的平方误差总和表达式如下：

$\sum_{i=1}^{k}\sum_{x\in S_{i}}\Vert x-c_{i}\Vert^2$

式中， $c_i$ 是簇 $S_{i}$ 的中心， $x$ 是簇 $S_{i}$ 内的点， $\Vert x-c_{i}\Vert^2$ 表示点到簇中心 $c_i$ 的距离（即相似度）。

K-means算法实现简单，适用于需要快速、初步的数据分类的场景。但是，算法同样存在一定局限性，比如对初始簇中心的选择敏感，同时需要自行定义参数k的取值。

算法步骤

初始化：随机选择 K 个数据点作为初始簇中心。
分配：将每个数据点分配给最近的簇中心。
更新：重新计算每个簇的中心（即簇内所有点的均值）。
迭代：重复步骤 2 和 3 直到簇中心不再发生变化或达到预设的迭代次数。

算法代码展示

测试数据介绍

已知若干外卖订单信息，包括商家位置信息、顾客位置信息等，需要根据订单的有关信息内容，对订单进行聚类，随后根据聚类结果分配相应的外卖骑手。其中每个订单视为一个数据点，不同订单之间的距离定义为商家距离与顾客距离的之和。

算法主框架

def Kmeans(self, plot_flag = False):
        # Run k-means algorithm
        if len(self.geo_locations) < self.k:
            return -1   # Error
        points_ = self.geo_locations.copy()
        # Compute the initial means
        self.initial_means(points_)
        stop = False
        iterations = 1
        print("****** Starting K-Means ******")
        while not stop:
            # Assignment step: assign each node to the cluster with the closest mean
            points_ = self.geo_locations.copy()
            clusters = self.assign_points(points_)
            means = self.compute_means(clusters)
            stop = self.update_means(means, 0.01)
            if not stop:
                self.means = means
            iterations += 1
        print(f"K-Means completed in {iterations} iterations.")
        self.clusters = clusters
        # Plot cluster for evaluation
        if plot_flag:
            fig = plt.figure()
            ax = fig.add_subplot(111)
            markers = ['o', 'd', 'x', 'h', 'H', 7, 4, 5, 6, '8', 'p', ',', '+', '.', 's', '*', 3, 0, 1, 2]
            colors = ['r', 'k', 'b', [0, 0, 0], [0, 0, 1], [0, 1, 0], [0, 1, 1], [1, 0, 0], [1, 0, 1], [1, 1, 0], [1, 1, 1]]
            for cluster_id, cluster in enumerate(clusters.values()):
                bus_x = []  
                bus_y = []
                cust_x = []
                cust_y = []
                for point in cluster:
                    bus_x.append(point.bus_x)
                    bus_y.append(point.bus_y)
                    cust_x.append(point.cust_x)
                    cust_y.append(point.cust_y)
                ax.scatter(bus_x, bus_y, s=20, c=colors[cluster_id % len(colors)], marker=markers[cluster_id % len(markers)])
                ax.scatter(cust_x, cust_y, s=20, c=colors[cluster_id % len(colors)], marker=markers[cluster_id % len(markers)])
            plt.show()
        self.save()
        return self.clusters

分配：将每个数据点分配给最近的簇中心

def assign_points(self, points):
        # 将每个数据点分配给最近的簇中心
        clusters = {}
        for point in points:
            dist = []
            # 找到当前节点最近的簇中心
            for mean in self.means:
                dist.append(0.5 * math.sqrt((point.bus_x - mean.bus_x) ** 2 + (point.bus_y - mean.bus_y) ** 2) 
                    + 0.5 * math.sqrt((point.cust_x - mean.cust_x) ** 2 + (point.cust_y - mean.cust_y) ** 2))
            index = dist.index(min(dist))
            clusters.setdefault(index, []).append(point)
        return clusters

更新：重新计算每个簇的中心（即簇内所有点的均值）

def compute_means(self, clusters):
        means = []
        for cluster in clusters.values():
            mean_point = Order.Order(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)
            cnt = 0.0
            for point in cluster:
                mean_point.bus_x += point.bus_x
                mean_point.bus_y += point.bus_y
                mean_point.cust_x += point.cust_x
                mean_point.cust_y += point.cust_y
                cnt += 1.0
            mean_point.bus_x /= cnt
            mean_point.bus_y /= cnt
            mean_point.cust_x /= cnt
            mean_point.cust_y /= cnt
            means.append(mean_point)
        return means