python高效kmeans算法实现

胖胖大海

已于 2022-03-31 17:39:58 修改

阅读量1.4k

点赞数 1

分类专栏： python编程机器学习文章标签： python kmeans算法高效实现

于 2020-02-18 14:10:17 首次发布

原文链接：https://blog.csdn.net/xufive/article/details/101448969

版权

python编程同时被 2 个专栏收录

50 篇文章 6 订阅

订阅专栏

机器学习

34 篇文章 16 订阅

订阅专栏

本篇基础代码转载自天元浪子的技术博客。

1、kmeans聚类，使用numpy计算向量之间欧氏距离

def fast_kmeans_numpy(ds, k):
    """k-means聚类算法

    k       - 指定分簇数量
    ds      - ndarray(m, n)，m个样本的数据集，每个样本n个属性值
    """

    m, n = ds.shape  # m：样本数量，n：每个样本的属性值个数
    result = np.empty(m, dtype=np.int)  # m个样本的聚类结果
    cores = ds[np.random.choice(np.arange(m), k, replace=False)]  # 从m个数据样本中不重复地随机选择k个样本作为质心
    # print(m, n)
    # print("result -> ", result)
    # print("cores -> ", cores)

    while True:  # 迭代计算
        d = np.square(np.repeat(ds, k, axis=0).reshape(m, k, n) - cores)
        # d1 = distance_euclidean_scipy(ds, cores)
        # print("d1 -> ", d1[:5])
        # print(d.shape)
        distance = np.sqrt(np.sum(d, axis=2))  # ndarray(m, k)，每个样本距离k个质心的距离，共有m行
        # print("distance -> ", distance[:5])
        index_min = np.argmin(distance, axis=1)  # 每个样本距离最近的质心索引序号

        if (index_min == result).all():  # 如果样本聚类没有改变
            return result, cores  # 则返回聚类结果和质心数据

        result[:] = index_min  # 重新分类
        for i in range(k):  # 遍历质心集
            items = ds[result == i]  # 找出对应当前质心的子样本集
            cores[i] = np.mean(items, axis=0)  # 以子样本集的均值作为当前质心的位置

2、kmeans聚类，使用scipy计算向量之间欧氏距离

from scipy import spatial


def distance_euclidean_scipy(vec1, vec2, distance="euclidean"):
    return spatial.distance.cdist(vec1, vec2, distance)


def fast_kmeans_scipy(ds, k):
    """k-means聚类算法

    k       - 指定分簇数量
    ds      - ndarray(m, n)，m个样本的数据集，每个样本n个属性值
    """

    m, n = ds.shape  # m：样本数量，n：每个样本的属性值个数
    result = np.empty(m, dtype=np.int)  # m个样本的聚类结果
    cores = ds[np.random.choice(np.arange(m), k, replace=False)]  # 从m个数据样本中不重复地随机选择k个样本作为质心

    while True:  # 迭代计算
        # d = np.square(np.repeat(ds, k, axis=0).reshape(m, k, n) - cores)
        # print("d.first.shape -> ", d.shape)
        # distance = np.sqrt(np.sum(d, axis=2))  # ndarray(m, k)，每个样本距离k个质心的距离，共有m行

        distance = distance_euclidean_scipy(ds, cores)
        index_min = np.argmin(distance, axis=1)  # 每个样本距离最近的质心索引序号

        if (index_min == result).all():  # 如果样本聚类没有改变
            return result, cores  # 则返回聚类结果和质心数据

        result[:] = index_min  # 重新分类
        for i in range(k):  # 遍历质心集
            items = ds[result == i]  # 找出对应当前质心的子样本集
            cores[i] = np.mean(items, axis=0)  # 以子样本集的均值作为当前质心的位置

3、kmeans聚类，scikit-learn实现

def kmeans_sklearn(ds, k):
    model = KMeans(n_clusters=k).fit(ds)
    centroids = model.cluster_centers_.astype(np.int)
    labels = model.labels_
    return labels, centroids

4、性能对比

使用上述两种方法对形状为（5500， 3）大小的矩阵进行k=3的聚类操作
function name -> fast_kmeans_numpy, elapse time -> 22.405 ms
function name -> fast_kmeans_scipy, elapse time -> 13.116 ms
function name -> kmeans_sklearn, elapse time -> 120.774 ms
数据显示基于scipy实现的聚类方法比基于numpy实现的聚类方法速度提升了约50%，比scikit-learn中的kmeans方法速度提升了9倍左右。

备注：以上测试在本地windows机器测试，cpu使用率25%，即单核满负荷。

胖胖大海

关注

1
点赞
踩
5

收藏

觉得还不错? 一键收藏
4
评论
python高效kmeans算法实现

本篇基础代码转载自天元浪子的技术博客。1、kmeans聚类，使用numpy计算向量之间欧氏距离def fast_kmeans_numpy(ds, k): """k-means聚类算法 k - 指定分簇数量 ds - ndarray(m, n)，m个样本的数据集，每个样本n个属性值 """ m, n = ds.shap...
复制链接

扫一扫

专栏目录