K-means
划分式聚类方法需要事先指定簇类的数目或者聚类中心,通过反复迭代,直至最后达到"簇内的点足够近,簇间的点足够远"的目标。
最小化 簇内平方和(within-cluster sum-of-squares)是其目标。
虽然原始 K-means 算法有诸多缺点:
- 需要提前确定k 值
- 对初始质心点敏感
- 对异常数据敏感
但是sklearn 包含了K-means的改良版本(K-means ++ 解决 对初始质心点敏感),
也有有适当的改进,方差归一化 variance-normalized 解决(对异常数据敏感)。
以下是源码解读:
选中心
# run a k-means once
labels, inertia, centers, n_iter_ = kmeans_single(
X,
sample_weight,
centers_init,
max_iter=self.max_iter,
verbose=self.verbose,
tol=self._tol,
x_squared_norms=x_squared_norms,
n_threads=self._n_threads,
)
->
for i in range(max_iter):
lloyd_iter(
X,
sample_weight,
x_squared_norms,
centers,
centers_new,
weight_in_clusters,
labels,
center_shift,
n_threads,
)
分配到簇
计算上的优化!
# Instead of computing the full pairwise squared distances matrix,
# ||X - C||² = ||X||² - 2 X.C^T + ||C||², we only need to store
# the - 2 X.C^T + ||C||² term since the argmin for a given sample only
# depends on the centers.
# pairwise_distances = ||C||²
更新中心
centers, centers_new = centers_new, centers
->
if update_centers:
with gil:
for j in range(n_clusters):
weight_in_clusters[j] += weight_in_clusters_chunk[j]
for k in range(n_features):
centers_new[j, k] += centers_new_chunk[j * n_features + k]
# 整体算法
for i in range(n_samples):
min_sq_dist = pairwise_distances[i * n_clusters]
label = 0
for j in range(1, n_clusters):
sq_dist = pairwise_distances[i * n_clusters + j]
if sq_dist < min_sq_dist:
min_sq_dist = sq_dist
label = j
labels[i] = label
if update_centers:
weight_in_clusters[label] += sample_weight[i]
for k in range(n_features):
centers_new[label * n_features + k] += X[i, k] * sample_weight[i]
参考链接
https://zhuanlan.zhihu.com/p/104355127