注意:单击此处https://urlify.cn/NnMjQz下载完整的示例代码,或通过Binder在浏览器中运行此示例
评估k均值初始化策略的能力,来使该算法收敛稳定,如通过聚类惯性的相对标准偏差(relative standard deviation of the inertia of the clustering)(即,到最近的聚类中心的距离平方和)来度量。
第一张图显示了模型(KMeans
或MiniBatchKMeans
)和init方法(init="random"
或init="kmeans++"
)的每种组合所达到的最佳惯性(inertia),还可以在init方法中增加n_init
参数值来控制初始化次数 。
第二张图显示了估计器MiniBatchKMeans
使用init="random"
和n_init=1
的运行结果。此次运行会导致不好的收敛性(局部最优),估计中心可能会卡在真实聚类(ground truth clusters)之间。
用于评估的数据集是各向同性(isotropic)的高斯群集的二维网格,其间隔较宽。
输出:
Evaluation of KMeans with k-means++ init
Evaluation of KMeans with random init
Evaluation of MiniBatchKMeans with k-means++ init
Evaluation of MiniBatchKMeans with random init
print(__doc__)
# 作者: Olivier Grisel
# 许可证: BSD 3 clause
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.cm as cm
from sklearn.utils import shuffle
from sklearn.utils import check_random_state
from sklearn.cluster import MiniBatchKMeans
from sklearn.cluster import KMeans
random_state = np.random.RandomState(0)
# 每个策略的运行次数(具有随机生成的数据集),以便
# 能够计算标准偏差的估计值
n_runs = 5
# k均值模型可以进行多次随机初始化,以便能够
# 平衡CPU的时间,以实现收敛
n_init_range = np.array([1, 5, 10, 15, 20])
# 数据集生成参数
n_samples_per_center = 100
grid_size = 3
scale = 0.1
n_clusters = grid_size ** 2
def make_data(random_state, n_samples_per_center, grid_size, scale):
random_state = check_random_state(random_state)
centers = np.array([[i, j]
for i in range(grid_size)
for j in range(grid_size)])
n_clusters_true, n_features = centers.shape
noise = random_state.normal(
scale=scale, size=(n_samples_per_center, centers.shape[1]))
X = np.concatenate([c + noise for c in centers])
y = np.concatenate([[i] * n_samples_per_center
for i in range(n_clusters_true)])
return shuffle(X, y, random_state=random_state)
# 第1部分:各种初始化方法的定量评估
plt.figure()
plots = []
legends = []
cases = [
(KMeans, 'k-means++', {}),
(KMeans, 'random', {}),
(MiniBatchKMeans, 'k-means++', {'max_no_improvement': 3}),
(MiniBatchKMeans, 'random', {'max_no_improvement': 3, 'init_size': 500}),
]
for factory, init, params in cases:
print("Evaluation of %s with %s init" % (factory.__name__, init))
inertia = np.empty((len(n_init_range), n_runs))
for run_id in range(n_runs):
X, y = make_data(run_id, n_samples_per_center, grid_size, scale)
for i, n_init in enumerate(n_init_range):
km = factory(n_clusters=n_clusters, init=init, random_state=run_id,
n_init=n_init, **params).fit(X)
inertia[i, run_id] = km.inertia_
p = plt.errorbar(n_init_range, inertia.mean(axis=1), inertia.std(axis=1))
plots.append(p[0])
legends.append("%s with %s init" % (factory.__name__, init))
plt.xlabel('n_init')
plt.ylabel('inertia')
plt.legend(plots, legends)
plt.title("Mean inertia for various k-means init across %d runs" % n_runs)
# 第2部分:收敛的定性视觉检查
X, y = make_data(random_state, n_samples_per_center, grid_size, scale)
km = MiniBatchKMeans(n_clusters=n_clusters, init='random', n_init=1,
random_state=random_state).fit(X)
plt.figure()
for k in range(n_clusters):
my_members = km.labels_ == k
color = cm.nipy_spectral(float(k) / n_clusters, 1)
plt.plot(X[my_members, 0], X[my_members, 1], 'o', marker='.', c=color)
cluster_center = km.cluster_centers_[k]
plt.plot(cluster_center[0], cluster_center[1], 'o',
markerfacecolor=color, markeredgecolor='k', markersize=6)
plt.title("Example cluster allocation with a single random init\n"
"with MiniBatchKMeans")
plt.show()
脚本的总运行时间:(0分钟3.557秒)
估计的内存使用量: 8 MB
下载Python源代码: plot_kmeans_stability_low_dim_dense.py
下载Jupyter notebook源代码: plot_kmeans_stability_low_dim_dense.ipynb
由Sphinx-Gallery生成的画廊
文壹由“伴编辑器”提供技术支持
☆☆☆为方便大家查阅,小编已将scikit-learn学习路线专栏 文章统一整理到公众号底部菜单栏,同步更新中,关注公众号,点击左下方“系列文章”,如图:欢迎大家和我一起沿着scikit-learn文档这条路线,一起巩固机器学习算法基础。(添加微信:mthler,备注:sklearn学习,一起进【sklearn机器学习进步群】开启打怪升级的学习之旅。)