python 聚类_使用python+sklearn实现聚类性能评估中随机分配对聚类度量值的影响

最新推荐文章于 2023-06-23 22:33:52 发布

weixin_39925350

最新推荐文章于 2023-06-23 22:33:52 发布

阅读量356

点赞数

文章标签： python 聚类 python聚类 sklearn 聚类

本文链接：https://blog.csdn.net/weixin_39925350/article/details/111293439

版权

注意：单击此处https://urlify.cn/3iAzUr下载完整的示例代码，或通过Binder在浏览器中运行此示例

下图说明了聚类数量和样本数量对各种聚类性能评估度量指标的影响。未调整的度量指标(例如V度量)显示了聚类的数量与样本数之间的依赖关系：随机标记的平均V度量随着聚类的数量越接近用于计算的样本总数而显着增加。针对ARI等偶然性度量指标进行调整后，对于任意数量的样本和聚类，一些随机方差(variations)均以0.0的平均得分为中心。因此，只有调整后的度量指标才能安全地用作共识指数(consensus index)，才能用来评估数据集中在各种重叠子样本上给定k值时，聚类算法的平均稳定性。

输出：

Computing adjusted_rand_score for 10 values of n_clusters and n_samples=100done in 0.050sComputing v_measure_score for 10 values of n_clusters and n_samples=100done in 0.068sComputing ami_score for 10 values of n_clusters and n_samples=100done in 0.356sComputing mutual_info_score for 10 values of n_clusters and n_samples=100done in 0.044sComputing adjusted_rand_score for 10 values of n_clusters and n_samples=1000done in 0.051sComputing v_measure_score for 10 values of n_clusters and n_samples=1000done in 0.064sComputing ami_score for 10 values of n_clusters and n_samples=1000done in 0.208sComputing mutual_info_score for 10 values of n_clusters and n_samples=1000done in 0.048s

print(__doc__)# 作者: Olivier Grisel # 许可证: BSD 3 clauseimport numpy as npimport matplotlib.pyplot as pltfrom time import timefrom sklearn import metricsdef uniform_labelings_scores(score_func, n_samples, n_clusters_range,                             fixed_n_classes=None, n_runs=5, seed=42):    """计算2个随机均一聚类标签的得分。     两个随机标签中每个在n_clusters_range中的可能值都具有相同数量的聚类。         当fixed_n_classes不为None时，第一个标签被认为是具有固定类数量的真实类(ground truth class)。    """    random_labels = np.random.RandomState(seed).randint    scores = np.zeros((len(n_clusters_range), n_runs))    if fixed_n_classes is not None:        labels_a = random_labels(low=0, high=fixed_n_classes, size=n_samples)    for i, k in enumerate(n_clusters_range):        for j in range(n_runs):            if fixed_n_classes is None:                labels_a = random_labels(low=0, high=k, size=n_samples)            labels_b = random_labels(low=0, high=k, size=n_samples)            scores[i, j] = score_func(labels_a, labels_b)    return scoresdef ami_score(U, V):    return metrics.adjusted_mutual_info_score(U, V)score_funcs = [    metrics.adjusted_rand_score,    metrics.v_measure_score,    ami_score,    metrics.mutual_info_score,]# 2个独立的随机聚类，具有相同的聚类数n_samples = 100n_clusters_range = np.linspace(2, n_samples, 10).astype(np.int)plt.figure(1)plots = []names = []for score_func in score_funcs:    print("Computing %s for %d values of n_clusters and n_samples=%d"          % (score_func.__name__, len(n_clusters_range), n_samples))    t0 = time()    scores = uniform_labelings_scores(score_func, n_samples, n_clusters_range)    print("done in %0.3fs" % (time() - t0))    plots.append(plt.errorbar(        n_clusters_range, np.median(scores, axis=1), scores.std(axis=1))[0])    names.append(score_func.__name__)plt.title("Clustering measures for 2 random uniform labelings\n"          "with equal number of clusters")plt.xlabel('Number of clusters (Number of samples is fixed to %d)' % n_samples)plt.ylabel('Score value')plt.legend(plots, names)plt.ylim(bottom=-0.05, top=1.05)# 根据真实类标签使用不同的n_clusters随机标签# 聚类数量固定n_samples = 1000n_clusters_range = np.linspace(2, 100, 10).astype(np.int)n_classes = 10plt.figure(2)plots = []names = []for score_func in score_funcs:    print("Computing %s for %d values of n_clusters and n_samples=%d"          % (score_func.__name__, len(n_clusters_range), n_samples))    t0 = time()    scores = uniform_labelings_scores(score_func, n_samples, n_clusters_range,                                      fixed_n_classes=n_classes)    print("done in %0.3fs" % (time() - t0))    plots.append(plt.errorbar(        n_clusters_range, scores.mean(axis=1), scores.std(axis=1))[0])    names.append(score_func.__name__)plt.title("Clustering measures for random uniform labeling\n"          "against reference assignment with %d classes" % n_classes)plt.xlabel('Number of clusters (Number of samples is fixed to %d)' % n_samples)plt.ylabel('Score value')plt.ylim(bottom=-0.05, top=1.05)plt.legend(plots, names)plt.show()