从零实现机器学习算法(十一)KMeans

目录

1. KMeans简介

2. KMeans模型

2.1 KMeans

2.2 2分KMeans

2.3 KMeans++

3. 总结与分析


1. KMeans简介

KMeans是一种简单的聚类方法,它使用每个样本到聚类中心的距离作为度量来决定簇。其中 K值是用户指定的簇的数目。初始时,随机选取 K 个点作为聚类中心(质心),然后通过不断修改聚类中心达到最优的效果。由于其计算每个簇的均值作为质心,因此也成为K均值。

2. KMeans模型

2.1 KMeans

KMeans算法较为简单,记K个簇中心为 \mu_{1},\mu_{2},...\mu_{k} ,每个簇的样本数目为 N_{1},N_{2},...,N_{k} 。KMeans使用误差平方和作为目标函数,即

J\left(\mu_{1},\mu_{2},...\mu_{k}\right)=\frac{1}{2}\sum_{j=1}^{K}\sum_{i=1}^{N_{j}}\left(x_{i}-\mu_{j}\right)^{2}

对损失函数求偏导,可得

\frac{\partial J}{\partial\mu_{j}}=-\sum_{i=1}^{N_{j}}\left(x_{i}-\mu_{j}\right)

令导数等于0,解得

\mu_{j}=\frac{1}{N_{j}}\sum_{i=1}^{N_{j}}x_{i}

即令每个簇的均值作为质心。

KMeans代码如下:

    def kmeans(self, train_data, k):
        sample_num = len(train_data)
        distances = np.zeros([sample_num, 2])                      # (index, distance)
        centers = self.createCenter(train_data)
        centers, distances = self.adjustCluster(centers, distances, train_data, self.k)
        return centers, distances

其中的adjustCluster()函数,为确定了初始的质心后的调整过程,功能是最小化损失函数 J ,其代码为:

 def adjustCluster(self, centers, distances, train_data, k):
        sample_num = len(train_data)
        flag = True  # If True, update cluster_center
        while flag:
            flag = False
            d = np.zeros([sample_num, len(centers)])
            for i in range(len(centers)):
                # calculate the distance between each sample and each cluster center
                d[:, i] = self.calculateDistance(train_data, centers[i])

            # find the minimum distance between each sample and each cluster center
            old_label = distances[:, 0].copy()
            distances[:, 0] = np.argmin(d, axis=1)
            distances[:, 1] = np.min(d, axis=1)
            if np.sum(old_label - distances[:, 0]) != 0:
                flag = True
                # update cluster_center by calculating the mean of each cluster
                for j in range(k):
                    current_cluster = train_data[distances[:, 0] == j]  # find the samples belong to the j-th cluster center
                    if len(current_cluster) != 0:
                        centers[j, :] = np.mean(current_cluster, axis=0)
        return centers, distances

2.2 2分KMeans

由于KMeans可能收敛于局部最小值,为了解决这一问题引入2分KMeans。2分KMeans原理是先将所有样本视为一个大簇,然后将其一分为二;然后选择其中一个继续划分,直到簇的个数达到了指定的 K 为止。那么如何选择要划分的簇呢?这里采用误差平方和(Sum of Squared Error, SSE)来作为评价标准。假设现在有 n 个簇,记为

C=\left\{c_{1},c_{2},...,c_{n}\right\},n<k

选择被划分的簇流程为:将 C 中的一个簇 c_{i} 划分成两部分 c_{i1},c_{i2} (这两部分采用普通的KMeans方法),此时的SSE为

SSE_{i}=SSE\left(c_{i1},c_{i2}\right)+SSE\left(C-c_{i}\right)

适合划分的簇为

index =arg\min SSE_{i}

然后,如此反复,直到质心个数等于指定的 K

代码如下:

    def biKmeans(self, train_data):
        sample_num = len(train_data)
        distances = np.zeros([sample_num, 2])                                  # (index, distance)
        initial_center = np.mean(train_data, axis=0)                           # initial cluster #shape (1, feature_dim)
        centers = [initial_center]                                             # cluster list

        # clustering with the initial cluster center
        distances[:, 1] = np.power(self.calculateDistance(train_data, initial_center), 2)

        # generate cluster centers
        while len(centers) < self.k:
            # print(len(centers))
            min_SSE  = np.inf
            best_index = None                                                    
            best_centers = None                                                  
            best_distances = None                                                

            # find the best split
            for j in range(len(centers)):
                centerj_data = train_data[distances[:, 0] == j]                  # find the samples belong to the j-th center
                split_centers, split_distances = self.kmeans(centerj_data, 2)    
                split_SSE = np.sum(split_distances[:, 1]) ** 2                   # calculate the distance for after clustering
                other_distances = distances[distances[:, 0] != j]                # the samples don't belong to j-th center
                other_SSE = np.sum(other_distances[:, 1]) ** 2                   # calculate the distance don't belong to j-th center

                # save the best split result
                if (split_SSE + other_SSE) < min_SSE:
                    best_index = j                                               
                    best_centers = split_centers                                 
                    best_distances = split_distances                             
                    min_SSE = split_SSE + other_SSE

            # save the spilt data
            best_distances[best_distances[:, 0] == 1, 0] = len(centers)         
            best_distances[best_distances[:, 0] == 0, 0] = best_index           

            centers[best_index] = best_centers[0, :]                           
            centers.append(best_centers[1, :])                                  
            distances[distances[:, 0] == best_index, :] = best_distances        
        centers = np.array(centers)   # transform form list to array
        return centers, distances

2.3 KMeans++

KMeans 方法由于初始的质心选择对于聚类算法具有很大的影响,因此引入KMeans++的算法。其原理为:假设现在有 n 个簇

C=\left\{c_{1},c_{2},...,c_{n}\right\},n<k

则在选取第 n+1 个聚类中心时:距离当前 n 个聚类中心越远的点会有更高的概率被选为第 n+1 个聚类中心。这也符合我们的直觉:聚类中心当然是互相离得越远越好。首先计算每个样本点与当前已有的聚类中心的最短距离

D\left(x_{i}\right)=\min\limits_{c_{j}\in C}\left(x_{i}-c_{j}\right)

然后计算每个样本被选为下一个聚类中心的概率

p_{i}=\frac{D\left(x_{i}\right)^{2}}{\sum_{x_{i}\in X}D\left(x_{i}\right)^{2}}

然后采用轮盘法选出下一个聚类中心。 K 个质心选完之后,运行adjustCluster()调整质心。KMeans++代码如下:

    def kmeansplusplus(self,train_data):
        sample_num = len(train_data)
        distances = np.zeros([sample_num, 2])                                  # (index, distance)

        # randomly select a sample as the initial cluster
        initial_center = train_data[np.random.randint(0, sample_num-1)]
        centers = [initial_center]

        while len(centers) < self.k:
            d = np.zeros([sample_num, len(centers)])
            for i in range(len(centers)):
                # calculate the distance between each sample and each cluster center
                d[:, i] = self.calculateDistance(train_data, centers[i])

            # find the minimum distance between each sample and each cluster center
            distances[:, 0] = np.argmin(d, axis=1)
            distances[:, 1] = np.min(d, axis=1)

            # Roulette Wheel Selection
            prob = np.power(distances[:, 1], 2)/np.sum(np.power(distances[:, 1], 2))
            index = self.rouletteWheelSelection(prob, sample_num)
            new_center = train_data[index, :]
            centers.append(new_center)

        # adjust cluster
        centers = np.array(centers)   # transform form list to array
        centers, distances = self.adjustCluster(centers, distances, train_data, self.k)
        return centers, distances

3. 总结与分析

聚类算法收敛后可以使用一些方法来调整质心。对于 K 值的选择也有一些方法来确定,如采用轮廓系数(Silhouette Coefficient)来确定。最后看下三种聚类方法的效果。

 

 

 

可以发现KMeans++的算法运行效果最好,这三种方法运行时间差不多。

 

本文相关代码和数据集: https://github.com/Ryuk17/MachineLearning

 

[1] Peter Harrington, Machine Learning IN ACTION

[2] Andrew Ng, CS229 Lecture notes

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值