【机器学习】(5.2)聚类--Kmeans

无监督模型。

聚类算法需要度量样本间的距离,距离度量的方式可以参考【机器学习】(5)聚类--距离度量_mjiansun的博客-CSDN博客

一般会使用欧氏距离。

1. K-means

1.1 基本思想

1.2 算法步骤

注意点与思考:

1. 初始值该怎么选择?

共有如下几种选择方式:

(1)根据人的先验知识得到K个初始值,比如男女身高,假定男性身高175cm,女性165cm。

(2)从样本中随机选择K个点作为初始值。但是随机选择初始值会出现如下情况:

(3)为了改善(2)中的情况,可以使用kmeans++的方式选择初始值(一般选用这种方式)。

2. k-means在更新时使用簇中所有点的均值为新质心,是否合理呢?

但是这种方式用的比较少。

2. 初始值选择改进--kmeans++

      下面结合一个简单的例子说明K-means++是如何选取初始聚类中心的。数据集中共有8个样本,分布以及对应序号如下图所示:

      假设经过图2的步骤一后6号点被选择为第一个初始聚类中心,那在进行步骤二时每个样本的D(x)和被选择为第二个聚类中心的概率如下表所示:

    其中的P(x)就是每个样本被选为下一个聚类中心的概率。最后一行的Sum是概率P(x)的累加和,用于轮盘法选择出第二个聚类中心。方法是随机产生出一个0~1之间的随机数,判断它属于哪个区间,那么该区间对应的序号就是被选择出来的第二个聚类中心了。例如1号点的区间为[0,0.2),2号点的区间为[0.2, 0.525)。

      从上表可以直观的看到第二个初始聚类中心是1号,2号,3号,4号中的一个的概率为0.9。而这4个点正好是离第一个初始聚类中心6号点较远的四个点。这也验证了K-means的改进思想:即离当前已有聚类中心较远的点有更大的概率被选为下一个聚类中心。可以看到,该例的K值取2是比较合适的。当K值大于2时,每个样本会有多个距离,需要取最小的那个距离作为D(x)

上面的方法使用了概率,但在写代码的时候可以省略得出概率的过程,只需要按照:

curSum = 0
thr = random.random() * sum(D(x))
for Di(x) in D(x):
    curSum += Di(x)
    if curSum > thr:
        return Di(x)对应的数据
import numpy as np
import sklearn.datasets as ds
import random
import matplotlib as mpl
import matplotlib.pyplot as plt

def Init_center(x, k = 4):
    '''
    kmeans++ 选取初始值
    '''

    # 第一个点的选取
    cluster_centers = []
    cluster_center_indice = random.sample([i for i in range(len(x))], k=1)[0]
    cluster_center = x[cluster_center_indice, :]
    cluster_centers.append(cluster_center)
    # 第一次的距离计算
    closest_distance = np.sum((cluster_center - x) ** 2, axis=1)
    pot_distance = np.sum(closest_distance)
    for i in range(1, k):
        thr_distance = pot_distance * np.random.random()

        # 计算候选点与数据的距离
        candidate_center = 0
        temp_distance_sum = 0
        for temp_index, temp in enumerate(closest_distance):
            temp_distance_sum += temp
            if temp_distance_sum > thr_distance:
                candidate_center = x[temp_index, :]
                break
        candidate_distance = np.sum((candidate_center - x) ** 2, axis=1)
        candidate_distance = np.array(candidate_distance)

        closest_distance = np.minimum(closest_distance, candidate_distance)
        pot_distance = closest_distance.sum()

        cluster_centers.append(candidate_center)
    return np.array(cluster_centers)


if __name__ == "__main__":
    x, y = ds.make_blobs(400, n_features=2, centers=4, random_state=2018)

    sampleNum, featureNum = x.shape
    k = 4
    ############ 终止条件,满足一条即可 ################
    # 设置迭代次数
    iter_num = 1000
    # 设置相邻迭代2次的均方误差的差值,需要使得差值小于某个值
    loss_thr = 1e-4

    cluster_centers = Init_center(x, k = 4)

    # 计算新的聚类中心
    previous_d = 0
    cur_iter = 1
    cluster_centers = np.array(cluster_centers)
    print(cluster_centers)
    while True:
        cluster_set = [[] for i in range(k)]
        data_cls = []
        cur_d = 0
        for x_i in x:
            err_array = cluster_centers - x_i
            D = err_array[:, 0] * err_array[:, 0] + err_array[:, 1] * err_array[:, 1]
            select_indice = np.argmin(D)
            cluster_set[select_indice].append(x_i)
            cur_d += D[select_indice]
            data_cls.append(select_indice)
        cluster_centers= np.array([np.mean(i, axis=0) for i in cluster_set])

        if abs(cur_d - previous_d) <= loss_thr or cur_iter >= iter_num:
            break

        previous_d = cur_d
        cur_iter += 1

    # 画图显示
    plt.figure(figsize=(8, 4))
    plt.subplot(121)
    plt.plot(x[:, 0], x[:, 1], 'r.', ms=3)
    plt.subplot(122)
    plt.scatter(x[:, 0], x[:, 1], c=data_cls, marker='.', cmap=mpl.colors.ListedColormap(list('rgbm')))
    plt.tight_layout(2)
    plt.show()


    print("end")

如上图所示,运行了2次代码,有一次的聚类效果不理想,有一个还不错。分析原因发现,就是初始点选择不好导致的,那么怎么改善初始点的选择呢?

答:多选几个候选点。然后具体从这几个候选点中选择欧式误差最小的即可。具体代码如下

import numpy as np
import sklearn.datasets as ds
import random
import matplotlib as mpl
import matplotlib.pyplot as plt


def Init_center_adjust(x, k = 4, n_local_trials=None):
    '''
    kmeans++ 选取初始值改进版本
    '''
    # Set the number of local seeding trials if none is given
    if n_local_trials is None:
        # This is what Arthur/Vassilvitskii tried, but did not report
        # specific results for other than mentioning in the conclusion
        # that it helped.
        n_local_trials = 2 + int(np.log(k))

    # 第一个点的选取
    cluster_centers = []
    cluster_center_indice = random.sample([i for i in range(len(x))], k=1)[0]
    cluster_center = x[cluster_center_indice, :]
    cluster_centers.append(cluster_center)
    # 第一次的距离计算
    closest_distance = np.sum((cluster_center - x) ** 2, axis=1)
    pot_distance = np.sum(closest_distance)
    for i in range(1, k):
        thr_distance = pot_distance * np.random.random()

        # 计算候选点与数据的距离
        closest_distance_cumsum = np.cumsum(closest_distance)
        candidate_indices = random.sample(np.where(closest_distance_cumsum > thr_distance)[0].tolist(), k=n_local_trials)
        candidate_centers = x[candidate_indices]
        candidate_distances = []
        for candidate_center in candidate_centers:
            candidate_distance = np.sum((candidate_center - x) ** 2, axis=1)
            candidate_distances.append(candidate_distance)
        candidate_distances = np.array(candidate_distances)

        candidate_distances = np.minimum(closest_distance, candidate_distances)
        candidates_pots = candidate_distances.sum(axis=1)

        best_indice = np.argmin(candidates_pots)
        best_center = candidate_centers[best_indice]
        closest_distance = candidate_distances[best_indice]
        pot_distance = candidates_pots[best_indice]

        cluster_centers.append(best_center)
    return np.array(cluster_centers)


if __name__ == "__main__":
    x, y = ds.make_blobs(400, n_features=2, centers=4, random_state=2018)

    sampleNum, featureNum = x.shape
    k = 4
    ############ 终止条件,满足一条即可 ################
    # 设置迭代次数
    iter_num = 1000
    # 设置相邻迭代2次的均方误差的差值,需要使得差值小于某个值
    loss_thr = 1e-4

    cluster_centers = Init_center_adjust(x, k = 4, n_local_trials=None)

    # 计算新的聚类中心
    previous_d = 0
    cur_iter = 1
    cluster_centers = np.array(cluster_centers)
    print(cluster_centers)
    while True:
        cluster_set = [[] for i in range(k)]
        data_cls = []
        cur_d = 0
        for x_i in x:
            err_array = cluster_centers - x_i
            D = err_array[:, 0] * err_array[:, 0] + err_array[:, 1] * err_array[:, 1]
            select_indice = np.argmin(D)
            cluster_set[select_indice].append(x_i)
            cur_d += D[select_indice]
            data_cls.append(select_indice)
        cluster_centers= np.array([np.mean(i, axis=0) for i in cluster_set])

        if abs(cur_d - previous_d) <= loss_thr or cur_iter >= iter_num:
            break

        previous_d = cur_d
        cur_iter += 1

    # 画图显示
    plt.figure(figsize=(8, 4))
    plt.subplot(121)
    plt.plot(x[:, 0], x[:, 1], 'r.', ms=3)
    plt.subplot(122)
    plt.scatter(x[:, 0], x[:, 1], c=data_cls, marker='.', cmap=mpl.colors.ListedColormap(list('rgbm')))
    plt.tight_layout(2)
    plt.show()


    print("end")

但有时候所得结果还是会有所偏差,会不太好,那么可以采用如下方式进行改进:重复运行多次,选取误差较小那一个中心店模型即可。这里也是sklearn的做法, 中n_init参数就表示运行了10次,然后从中选择误差最小那个。

3. kmean计算加速

选取初始点的方式还是kmeans++,只是在迭代更新聚类中心时,不再使用全量样本,而是使用部分样本进行聚类中心的更新。

import numpy as np
import sklearn.datasets as ds
import random
import matplotlib as mpl
import matplotlib.pyplot as plt

def Init_center(x, k = 4):
    '''
    kmeans++ 选取初始值
    '''

    # 第一个点的选取
    cluster_centers = []
    cluster_center_indice = random.sample([i for i in range(len(x))], k=1)[0]
    cluster_center = x[cluster_center_indice, :]
    cluster_centers.append(cluster_center)
    # 第一次的距离计算
    closest_distance = np.sum((cluster_center - x) ** 2, axis=1)
    pot_distance = np.sum(closest_distance)
    for i in range(1, k):
        thr_distance = pot_distance * np.random.random()

        # 计算候选点与数据的距离
        candidate_center = 0
        temp_distance_sum = 0
        for temp_index, temp in enumerate(closest_distance):
            temp_distance_sum += temp
            if temp_distance_sum > thr_distance:
                candidate_center = x[temp_index, :]
                break
        candidate_distance = np.sum((candidate_center - x) ** 2, axis=1)
        candidate_distance = np.array(candidate_distance)

        closest_distance = np.minimum(closest_distance, candidate_distance)
        pot_distance = closest_distance.sum()

        cluster_centers.append(candidate_center)
    return np.array(cluster_centers)

def Init_center_adjust(x, k = 4, n_local_trials=None):
    '''
    kmeans++ 选取初始值改进版本
    '''
    # Set the number of local seeding trials if none is given
    if n_local_trials is None:
        # This is what Arthur/Vassilvitskii tried, but did not report
        # specific results for other than mentioning in the conclusion
        # that it helped.
        n_local_trials = 2 + int(np.log(k))

    # 第一个点的选取
    cluster_centers = []
    cluster_center_indice = random.sample([i for i in range(len(x))], k=1)[0]
    cluster_center = x[cluster_center_indice, :]
    cluster_centers.append(cluster_center)
    # 第一次的距离计算
    closest_distance = np.sum((cluster_center - x) ** 2, axis=1)
    pot_distance = np.sum(closest_distance)
    for i in range(1, k):
        thr_distance = pot_distance * np.random.random()

        # 计算候选点与数据的距离
        closest_distance_cumsum = np.cumsum(closest_distance)
        candidate_indices = random.sample(np.where(closest_distance_cumsum > thr_distance)[0].tolist(), k=n_local_trials)
        candidate_centers = x[candidate_indices]
        candidate_distances = []
        for candidate_center in candidate_centers:
            candidate_distance = np.sum((candidate_center - x) ** 2, axis=1)
            candidate_distances.append(candidate_distance)
        candidate_distances = np.array(candidate_distances)

        candidate_distances = np.minimum(closest_distance, candidate_distances)
        candidates_pots = candidate_distances.sum(axis=1)

        best_indice = np.argmin(candidates_pots)
        best_center = candidate_centers[best_indice]
        closest_distance = candidate_distances[best_indice]
        pot_distance = candidates_pots[best_indice]

        cluster_centers.append(best_center)
    return np.array(cluster_centers)


if __name__ == "__main__":
    x, y = ds.make_blobs(400, n_features=2, centers=4, random_state=2018)

    sampleNum, featureNum = x.shape
    k = 4
    ############ 终止条件,满足一条即可 ################
    # 设置迭代次数
    iter_num = 1000
    # 设置相邻迭代2次的均方误差的差值,需要使得差值小于某个值
    loss_thr = 1e-4

    cluster_centers = Init_center(x, k = 4)

    # 计算新的聚类中心
    previous_d = 0
    cur_iter = 1
    cluster_centers = np.array(cluster_centers)
    print(cluster_centers)
    while True:
        cluster_set = [[] for i in range(k)]
        data_cls = []
        cur_d = 0

        # 采用minibatch方式就这一小段
        batch_indices = random.sample([i for i in range(sampleNum)], k=int(sampleNum/10))
        x_batch = x[batch_indices, :]

        for x_i in x_batch:
            err_array = cluster_centers - x_i
            D = err_array[:, 0] * err_array[:, 0] + err_array[:, 1] * err_array[:, 1]
            select_indice = np.argmin(D)
            cluster_set[select_indice].append(x_i)
            cur_d += D[select_indice]
            data_cls.append(select_indice)
        cluster_centers= np.array([np.mean(i, axis=0) for i in cluster_set])

        if abs(cur_d - previous_d) <= loss_thr or cur_iter >= iter_num:
            break

        previous_d = cur_d
        cur_iter += 1


    print("end")

4.  K-Means适用范围

K-Means只适用于簇是凸的数据,也就是多个高斯分布的组合,而上图中最后一个是非凸的簇,所以使用K-Means方法得到的聚类效果并不理想。

5. K-Means聚类方法总结

6. 聚类的衡量指标

这个衡量指标使用场景比较有限,如果我已经知道类别的话,我完全可以使用线性回归,SVM,随机森林,深度学习的方式,感觉都比聚类好。所以我认为聚类主要的作用就是针对无标记数据,通过聚类帮助标记,或者通过聚类的方式加速大规模数据搜索(faiss)。

参考

1. K-means聚类算法的三种改进(K-means++,ISODATA,Kernel K-means)介绍与对比 - Yixuan-Xu - 博客园

  • 0
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值