背景
很多时候,我们的对目标聚类之后,需要一个指标进行直观的分析,常见的分析指标有好几种:
from sklearn.datasets import load_iris
from sklearn.cluster import KMeans
data = load_iris()
# 查看data包含哪些信息
# print(data)
# 提取鸢尾花数据的特征以及标签
## 数据集主要包含四维特征 'sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', #'petal width (cm)'
train = data["data"]
label = data["target"]
y_pred = KMeans(n_clusters=3, random_state=9).fit_predict(train)
print(y_pred)
print(label)
#[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
# 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 2 2 2 2 0 2 2 2 2
# 2 2 0 0 2 2 2 2 0 2 0 2 0 2 2 0 0 2 2 2 2 2 0 2 2 2 2 0 2 2 2 0 2 2 2 0 2
# 2 0]
#[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
# 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
# 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
# 2 2]
1、轮廓系数silhouette_score
取值在[-1,1]之间,为-1时,为0时,无关,为1时,聚类结果十分贴合,但很多时候,我们对这个指标不敏感。
from sklearn.metrics import silhouette_score
# 1计算轮廓系数
silhouette_avg = silhouette_score(label.reshape(-1,1), y_pred)
silhouette_avg
0.7017242160053677
2、adjusted_mutual_info_score互信息
from sklearn.metrics import adjusted_mutual_info_score
# 2计算互信息
mi_score = adjusted_mutual_info_score(label, y_pred)
mi_score
0.7551191675800484
3、Jaccard系数
from sklearn.metrics import jaccard_score
# 3计算Jaccard系数
jaccard_coeff = jaccard_score(label, y_pred, average='weighted')
jaccard_coeff
0.23076923076923075
4、weighted_avg加权平均数值
# 选择权重
weights = [0.4, 0.3, 0.3] # 假设均为相等权重
# 加权平均计算
weighted_avg = weights[0] * mi_score + weights[1] * jaccard_coeff + weights[2] * silhouette_avg
weighted_avg
0.5817957010643988
5、Calinski-Harabasz指数方差比准则:
from sklearn.metrics import calinski_harabasz_score
# 5计算Calinski-Harabasz指数
ch_score = calinski_harabasz_score(label.reshape(-1,1), y_pred)
ch_score
503.7200000000001
6、Davies-Bouldin指数
from sklearn.metrics import davies_bouldin_score
# 6计算Bouldin指数
db_score = davies_bouldin_score(label.reshape(-1, 1), y_pred)
rand_index
489.32687524614107
7、adjusted_rand_score调整兰德指数
from sklearn.metrics import adjusted_rand_score
# 7计算调整兰德指数
rand_index = adjusted_rand_score(label, y_pred)
rand_index
0.7302382722834697
8、acc最大准确率
首先需要自定义一个函数,该函数是线性规划里面的任务指派问题衍生而来的。
import numpy as np
from scipy.optimize import linear_sum_assignment
def acc(y_true, y_pred):
y_true = y_true.astype(np.int64)
assert y_pred.size == y_true.size
D = max(y_pred.max(), y_true.max()) + 1
w = np.zeros((D, D), dtype=np.int64)
print(w)
print("---------初始化----------")
for i in range(y_pred.size):
w[y_pred[i], y_true[i]] += 1
print(w)
print("-----------赋值结束--------")
print(w.max() - w)
row_ind, col_ind = linear_sum_assignment(w.max() - w)
print("--------线性求和分配-----------")
print(row_ind, col_ind)
return sum([w[i, j] for i, j in zip(row_ind, col_ind)]) * 1.0 / y_pred.size
acc(label, y_pred)
0.893333333333333
其中最后一个指标最大准确率acc(之所以叫最大准确率【大家可以先想想】)是一般硕博论文中常见的评价指标,背后的逻辑思维还是比较简单的,但是对这方面介绍的文章还是比较少的,整理出来,方便大家学习。