K-means源码解读

最新推荐文章于 2022-04-21 21:48:40 发布

ChinaYiqun

最新推荐文章于 2022-04-21 21:48:40 发布

阅读量292

点赞数

分类专栏：算法文章标签： kmeans 算法机器学习

本文链接：https://blog.csdn.net/Real_neu/article/details/121110665

版权

算法专栏收录该内容

12 篇文章 0 订阅

订阅专栏

K-means
划分式聚类方法需要事先指定簇类的数目或者聚类中心，通过反复迭代，直至最后达到"簇内的点足够近，簇间的点足够远"的目标。
最小化簇内平方和(within-cluster sum-of-squares)是其目标。
虽然原始 K-means 算法有诸多缺点:

需要提前确定k 值
对初始质心点敏感
对异常数据敏感

但是sklearn 包含了K-means的改良版本(K-means ++ 解决对初始质心点敏感),
也有有适当的改进,方差归一化 variance-normalized 解决(对异常数据敏感)。
以下是源码解读：

选中心

# run a k-means once
labels, inertia, centers, n_iter_ = kmeans_single(
    X,
    sample_weight,
    centers_init,
    max_iter=self.max_iter,
    verbose=self.verbose,
    tol=self._tol,
    x_squared_norms=x_squared_norms,
    n_threads=self._n_threads,
)
->
for i in range(max_iter):
        lloyd_iter(
            X,
            sample_weight,
            x_squared_norms,
            centers,
            centers_new,
            weight_in_clusters,
            labels,
            center_shift,
            n_threads,
        )

分配到簇

计算上的优化！

# Instead of computing the full pairwise squared distances matrix,
# ||X - C||² = ||X||² - 2 X.C^T + ||C||², we only need to store
# the - 2 X.C^T + ||C||² term since the argmin for a given sample only
# depends on the centers.
# pairwise_distances = ||C||²

更新中心

centers, centers_new = centers_new, centers

->
if update_centers:
    with gil:
        for j in range(n_clusters):
            weight_in_clusters[j] += weight_in_clusters_chunk[j]
            for k in range(n_features):
                centers_new[j, k] += centers_new_chunk[j * n_features + k]

# 整体算法
for i in range(n_samples):
    min_sq_dist = pairwise_distances[i * n_clusters]
    label = 0
    for j in range(1, n_clusters):
        sq_dist = pairwise_distances[i * n_clusters + j]
        if sq_dist < min_sq_dist:
            min_sq_dist = sq_dist
            label = j
    labels[i] = label

    if update_centers:
        weight_in_clusters[label] += sample_weight[i]
        for k in range(n_features):
            centers_new[label * n_features + k] += X[i, k] * sample_weight[i]

参考链接

https://zhuanlan.zhihu.com/p/104355127

ChinaYiqun

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
K-means源码解读

K-means划分式聚类方法需要事先指定簇类的数目或者聚类中心，通过反复迭代，直至最后达到"簇内的点足够近，簇间的点足够远"的目标。最小化簇内平方和(within-cluster sum-of-squares)是其目标。虽然原始 K-means 算法有诸多缺点:需要提前确定k 值对初始质心点敏感对异常数据敏感但是sklearn 包含了K-means的改良版本(K-means ++ 解决对初始质心点敏感),也有有适当的改进,方差归一化 variance-normalized 解决(对
复制链接

扫一扫