Scikit Learn-集群性能评估 (Scikit Learn - Clustering Performance Evaluation)
There are various functions with the help of which we can evaluate the performance of clustering algorithms.
在各种功能的帮助下,我们可以评估聚类算法的性能。
Following are some important and mostly used functions given by the Scikit-learn for evaluating clustering performance −
以下是Scikit学习提供的一些重要且最常用的函数,用于评估集群性能-
调整后的兰德指数 (Adjusted Rand Index)
Rand Index is a function that computes a similarity measure between two clustering. For this computation rand index considers all pairs of samples and counting pairs that are assigned in the similar or different clusters in the predicted and true clustering. Afterwards, the raw Rand Index score is ‘adjusted for chance’ into the Adjusted Rand Index score by using the following formula −
$$Adjusted\:RI=\left(RI-Expected_{-}RI\right)/\left(max\left(RI\right)-Expected_{-}RI\right)$$兰德指数是一项功能,用于计算两个聚类之间的相似性度量。 对于此计算,兰德指数考虑了在预测和真实聚类中分配给相似或不同聚类中的所有样本对和计数对。 然后,使用以下公式将原始的兰德指数得分“调整为偶然”成调整后的兰德指数得分-
$$ Adjusted \:RI = \ left(RI-Expected _ {-} RI \ right)/ \ left(max \ left(RI \ right)-Expected _ {-} RI \ right)$$It has two parameters namely labels_true, which is ground truth class labels, and labels_pred, which are clusters label to evaluate.
它有两个参数,即labels_true (是基础事实类标签)和labels_pred (它们是要评估的簇标签)。
例 (Example)
from sklearn.metrics.cluster import adjusted_rand_score
labels_true = [0, 0, 1, 1, 1, 1]
labels_pred = [0, 0, 2, 2, 3, 3]
adjusted_rand_score(labels_true, labels_pred)
输出量 (Output)
0.4444444444444445
Perfect labeling would be scored 1 and bad labelling or independent labelling is scored 0 or negative.
完美标签的评分为1,不良标签或独立标签的评分为0或否定。
基于互信息的分数 (Mutual Information Based Score)
Mutual Information is a function that computes the agreement of the two assignments. It ignores the permutations. There are following versions available −
互信息是一种计算两个分配的一致性的函数。 它忽略了排列。 有以下可用的版本-
标准化互信息(NMI) (Normalized Mutual Information (NMI))
Scikit learn have sklearn.metrics.normalized_mutual_info_score module.
Scikit学习具有sklearn.metrics.normalized_mutual_info_score模块。
例 (Example)
from sklearn.metrics.cluster import normalized_mutual_info_score
labels_true = [0, 0, 1, 1, 1, 1]
labels_pred = [0, 0, 2, 2, 3, 3]
normalized_mutual_info_score (labels_true, labels_pred)
输出量 (Output)
0.7611702597222881
调整后的共同信息(AMI) (Adjusted Mutual Information (AMI))
Scikit learn have sklearn.metrics.adjusted_mutual_info_score module.
Scikit学习具有sklearn.metrics.adjusted_mutual_info_score模块。
例 (Example)
from sklearn.metrics.cluster import adjusted_mutual_info_score
labels_true = [0, 0, 1, 1, 1, 1]
labels_pred = [0, 0, 2, 2, 3, 3]
adjusted_mutual_info_score (labels_true, labels_pred)
输出量 (Output)
0.4444444444444448
福克斯-锦葵分数 (Fowlkes-Mallows Score)
The Fowlkes-Mallows function measures the similarity of two clustering of a set of points. It may be defined as the geometric mean of the pairwise precision and recall.
Fowlkes-Mallows函数测量一组点的两个聚类的相似性。 它可以定义为成对精度和查全率的几何平均值。
Mathematically,
$$FMS=\frac{TP}{\sqrt{\left(TP+FP\right)\left(TP+FN\right)}}$$数学上
$$ FMS = \ frac {TP} {\ sqrt {\ left(TP + FP \ right)\ left(TP + FN \ right)}} $$Here, TP = True Positive − number of pair of points belonging to the same clusters in true as well as predicted labels both.
在这里, TP =真实正值 -属于相同簇的真实点对数以及预测的标记数。
FP = False Positive − number of pair of points belonging to the same clusters in true labels but not in the predicted labels.
FP =假阳性 -属于真实标签中的相同簇但不属于预测标签中的点对的数量。
FN = False Negative − number of pair of points belonging to the same clusters in the predicted labels but not in the true labels.
FN =假负 -预测标签中属于同一簇的点对的数量,但不是真实标签中的点。
The Scikit learn has sklearn.metrics.fowlkes_mallows_score module −
Scikit学习具有sklearn.metrics.fowlkes_mallows_score模块-
例 (Example)
from sklearn.metrics.cluster import fowlkes_mallows_score
labels_true = [0, 0, 1, 1, 1, 1]
labels_pred = [0, 0, 2, 2, 3, 3]
fowlkes_mallows__score (labels_true, labels_pred)
输出量 (Output)
0.6546536707079771
轮廓系数 (Silhouette Coefficient)
The Silhouette function will compute the mean Silhouette Coefficient of all samples using the mean intra-cluster distance and the mean nearest-cluster distance for each sample.
Silhouette函数将使用每个样本的平均群集内距离和平均最近群集距离来计算所有样本的平均Silhouette系数。
Mathematically,
$$S=\left(b-a\right)/max\left(a,b\right)$$数学上
$$ S = \左(ba \右)/ max \左(a,b \右)$$Here, a is intra-cluster distance.
在此,a是集群内距离。
and, b is mean nearest-cluster distance.
b是平均最近集群距离。
The Scikit learn have sklearn.metrics.silhouette_score module −
Scikit学习具有sklearn.metrics.silhouette_score模块-
例 (Example)
from sklearn import metrics.silhouette_score
from sklearn.metrics import pairwise_distances
from sklearn import datasets
import numpy as np
from sklearn.cluster import KMeans
dataset = datasets.load_iris()
X = dataset.data
y = dataset.target
kmeans_model = KMeans(n_clusters = 3, random_state = 1).fit(X)
labels = kmeans_model.labels_
silhouette_score(X, labels, metric = 'euclidean')
输出量 (Output)
0.5528190123564091
权变矩阵 (Contingency Matrix)
This matrix will report the intersection cardinality for every trusted pair of (true, predicted). Confusion matrix for classification problems is a square contingency matrix.
此矩阵将报告(真实,预测的)每个受信任对的交集基数。 分类问题的混淆矩阵是平方列联矩阵。
The Scikit learn have sklearn.metrics.contingency_matrix module.
Scikit学习了sklearn.metrics.contingency_matrix模块。
例 (Example)
from sklearn.metrics.cluster import contingency_matrix
x = ["a", "a", "a", "b", "b", "b"]
y = [1, 1, 2, 0, 1, 2]
contingency_matrix(x, y)
输出量 (Output)
array([
[0, 2, 1],
[1, 1, 1]
])
The first row of above output shows that among three samples whose true cluster is “a”, none of them is in 0, two of the are in 1 and 1 is in 2. On the other hand, second row shows that among three samples whose true cluster is “b”, 1 is in 0, 1 is in 1 and 1 is in 2.
上面输出的第一行显示了真实簇为“ a”的三个样本中,没有一个在0中,两个在1中,而1在2中。另一方面,第二行显示了在三个样本中其真实簇为“ b”,1在0中,1在1中,1在2中。
翻译自: https://www.tutorialspoint.com/scikit_learn/scikit_learn_clustering_performance_evaluation.htm