从零实现机器学习算法（十一）KMeans

最新推荐文章于 2024-08-14 00:10:28 发布

非典型废言

最新推荐文章于 2024-08-14 00:10:28 发布

阅读量382

点赞数

分类专栏：从零实现机器学习算法

本文链接：https://blog.csdn.net/sinat_35821976/article/details/90300437

版权

从零实现机器学习算法专栏收录该内容

19 篇文章 24 订阅

订阅专栏

1. KMeans简介

KMeans是一种简单的聚类方法，它使用每个样本到聚类中心的距离作为度量来决定簇。其中值是用户指定的簇的数目。初始时，随机选取个点作为聚类中心(质心)，然后通过不断修改聚类中心达到最优的效果。由于其计算每个簇的均值作为质心，因此也成为K均值。

2. KMeans模型

2.1 KMeans

KMeans算法较为简单，记K个簇中心为 $\mu_{1},\mu_{2},...\mu_{k}$ ，每个簇的样本数目为 $N_{1},N_{2},...,N_{k}$ 。KMeans使用误差平方和作为目标函数，即

$J\left(\mu_{1},\mu_{2},...\mu_{k}\right)=\frac{1}{2}\sum_{j=1}^{K}\sum_{i=1}^{N_{j}}\left(x_{i}-\mu_{j}\right)^{2}$

对损失函数求偏导，可得

$\frac{\partial J}{\partial\mu_{j}}=-\sum_{i=1}^{N_{j}}\left(x_{i}-\mu_{j}\right)$

令导数等于0，解得

$\mu_{j}=\frac{1}{N_{j}}\sum_{i=1}^{N_{j}}x_{i}$

即令每个簇的均值作为质心。

KMeans代码如下：

    def kmeans(self, train_data, k):
        sample_num = len(train_data)
        distances = np.zeros([sample_num, 2])                      # (index, distance)
        centers = self.createCenter(train_data)
        centers, distances = self.adjustCluster(centers, distances, train_data, self.k)
        return centers, distances

其中的adjustCluster()函数，为确定了初始的质心后的调整过程，功能是最小化损失函数 ,其代码为：

 def adjustCluster(self, centers, distances, train_data, k):
        sample_num = len(train_data)
        flag = True  # If True, update cluster_center
        while flag:
            flag = False
            d = np.zeros([sample_num, len(centers)])
            for i in range(len(centers)):
                # calculate the distance between each sample and each cluster center
                d[:, i] = self.calculateDistance(train_data, centers[i])

            # find the minimum distance between each sample and each cluster center
            old_label = distances[:, 0].copy()
            distances[:, 0] = np.argmin(d, axis=1)
            distances[:, 1] = np.min(d, axis=1)
            if np.sum(old_label - distances[:, 0]) != 0:
                flag = True
                # update cluster_center by calculating the mean of each cluster
                for j in range(k):
                    current_cluster = train_data[distances[:, 0] == j]  # find the samples belong to the j-th cluster center
                    if len(current_cluster) != 0:
                        centers[j, :] = np.mean(current_cluster, axis=0)
        return centers, distances

2.2 2分KMeans

由于KMeans可能收敛于局部最小值，为了解决这一问题引入2分KMeans。2分KMeans原理是先将所有样本视为一个大簇，然后将其一分为二；然后选择其中一个继续划分，直到簇的个数达到了指定的为止。那么如何选择要划分的簇呢？这里采用误差平方和（Sum of Squared Error, SSE）来作为评价标准。假设现在有个簇，记为

$C=\left\{c_{1},c_{2},...,c_{n}\right\},n<k$

选择被划分的簇流程为：将中的一个簇 $c_{i}$ 划分成两部分 $c_{i1},c_{i2}$ （这两部分采用普通的KMeans方法）,此时的SSE为

$SSE_{i}=SSE\left(c_{i1},c_{i2}\right)+SSE\left(C-c_{i}\right)$

适合划分的簇为

$index =arg\min SSE_{i}$

然后，如此反复，直到质心个数等于指定的

代码如下：

    def biKmeans(self, train_data):
        sample_num = len(train_data)
        distances = np.zeros([sample_num, 2])                                  # (index, distance)
        initial_center = np.mean(train_data, axis=0)                           # initial cluster #shape (1, feature_dim)
        centers = [initial_center]                                             # cluster list

        # clustering with the initial cluster center
        distances[:, 1] = np.power(self.calculateDistance(train_data, initial_center), 2)

        # generate cluster centers
        while len(centers) < self.k:
            # print(len(centers))
            min_SSE  = np.inf
            best_index = None                                                    
            best_centers = None                                                  
            best_distances = None                                                

            # find the best split
            for j in range(len(centers)):
                centerj_data = train_data[distances[:, 0] == j]                  # find the samples belong to the j-th center
                split_centers, split_distances = self.kmeans(centerj_data, 2)    
                split_SSE = np.sum(split_distances[:, 1]) ** 2                   # calculate the distance for after clustering
                other_distances = distances[distances[:, 0] != j]                # the samples don't belong to j-th center
                other_SSE = np.sum(other_distances[:, 1]) ** 2                   # calculate the distance don't belong to j-th center

                # save the best split result
                if (split_SSE + other_SSE) < min_SSE:
                    best_index = j                                               
                    best_centers = split_centers                                 
                    best_distances = split_distances                             
                    min_SSE = split_SSE + other_SSE

            # save the spilt data
            best_distances[best_distances[:, 0] == 1, 0] = len(centers)         
            best_distances[best_distances[:, 0] == 0, 0] = best_index           

            centers[best_index] = best_centers[0, :]                           
            centers.append(best_centers[1, :])                                  
            distances[distances[:, 0] == best_index, :] = best_distances        
        centers = np.array(centers)   # transform form list to array
        return centers, distances

2.3 KMeans++

KMeans 方法由于初始的质心选择对于聚类算法具有很大的影响，因此引入KMeans++的算法。其原理为：假设现在有个簇

$C=\left\{c_{1},c_{2},...,c_{n}\right\},n<k$

则在选取第 n+1 个聚类中心时：距离当前个聚类中心越远的点会有更高的概率被选为第 n+1 个聚类中心。这也符合我们的直觉：聚类中心当然是互相离得越远越好。首先计算每个样本点与当前已有的聚类中心的最短距离

$D\left(x_{i}\right)=\min\limits_{c_{j}\in C}\left(x_{i}-c_{j}\right)$

然后计算每个样本被选为下一个聚类中心的概率

$p_{i}=\frac{D\left(x_{i}\right)^{2}}{\sum_{x_{i}\in X}D\left(x_{i}\right)^{2}}$

然后采用轮盘法选出下一个聚类中心。个质心选完之后，运行adjustCluster()调整质心。KMeans++代码如下：

    def kmeansplusplus(self,train_data):
        sample_num = len(train_data)
        distances = np.zeros([sample_num, 2])                                  # (index, distance)

        # randomly select a sample as the initial cluster
        initial_center = train_data[np.random.randint(0, sample_num-1)]
        centers = [initial_center]

        while len(centers) < self.k:
            d = np.zeros([sample_num, len(centers)])
            for i in range(len(centers)):
                # calculate the distance between each sample and each cluster center
                d[:, i] = self.calculateDistance(train_data, centers[i])

            # find the minimum distance between each sample and each cluster center
            distances[:, 0] = np.argmin(d, axis=1)
            distances[:, 1] = np.min(d, axis=1)

            # Roulette Wheel Selection
            prob = np.power(distances[:, 1], 2)/np.sum(np.power(distances[:, 1], 2))
            index = self.rouletteWheelSelection(prob, sample_num)
            new_center = train_data[index, :]
            centers.append(new_center)

        # adjust cluster
        centers = np.array(centers)   # transform form list to array
        centers, distances = self.adjustCluster(centers, distances, train_data, self.k)
        return centers, distances