机器学习 | 聚类评估指标

相关文章:

机器学习 | 目录

机器学习 | 距离计算

无监督学习 | KMeans与KMeans++原理

无监督学习 | KMeans之Skleaen实现:电影评分聚类

1. 聚类评估指标

Clustering performance evaluation

聚类性能度量亦称聚类“有效性指标”(validity index)。与监督学习中的性能度量相似,对聚类结果,我们需通过某种性能度量来评估其好坏;另一方面,若明确了最终将要使用的性能度量,则可直接将其作为聚类过程的优化目标,从而更好地得到符合要求的聚类结果。

聚类是将样本集D划分为若干互不相关的子集,即样本簇(类),而我们又希望聚类结果的“簇内相似度”(intra-cluster similarity)高且“簇间相似度”(intra-cluster similarity)低。

聚类性能度量大致有两类,一类是将聚类结果与某个“参考模型”(reference model,样本含标签的)进行比较,称为“外部指标”(external index);另一类是直接考察聚类结果而不利用任何参考模型,称为“内部指标”(internal index)。

对数据集 D = { x 1 , x 2 , ⋯   , x n } D=\{x_1,x_2,\cdots,x_n\} D={x1,x2,,xn},假定通过聚类给出的 k k k 个簇,划分为 C = { C 1 , C 2 , ⋯   , C k } C=\{C_1,C_2,\cdots,C_k\} C={C1,C2,,Ck},参考模型给出的 s s s 个簇划分为 C ∗ = { C 1 ∗ , C 2 ∗ , ⋯   , C s ∗ } C^*=\{C_1^*,C_2^*,\cdots,C_s^*\} C={C1,C2,,Cs}。相应地,令 λ \lambda λ λ ∗ \lambda^* λ 分别表示 C C C C ∗ C^* C 对应的簇标记向量。我们将样本两两配对考虑,定义:

a = ∣ S S ∣ , S S = { ( x i , x j ) ∣ λ i = λ j , λ i ∗ = λ j ∗ , i < j } (1) a=|SS|,\quad SS=\{(x_i,x_j)| \lambda_i=\lambda_j,\lambda_i^*=\lambda_j^*,i<j\} \tag{1} a=SS,SS={(xi,xj)λi=λj,λi=λj,i<j}(1)

b = ∣ S D ∣ , S D = { ( x i , x j ) ∣ λ i = λ j , λ i ∗ ≠ λ j ∗ , i < j } (2) b=|SD|,\quad SD=\{(x_i,x_j)| \lambda_i=\lambda_j,\lambda_i^*\neq\lambda_j^*,i<j\} \tag{2} b=SD,SD={(xi,xj)λi=λj,λi=λj,i<j}(2)

c = ∣ D S ∣ , D S = { ( x i , x j ) ∣ λ i ≠ λ j , λ i ∗ = λ j ∗ , i < j } (3) c=|DS|,\quad DS=\{(x_i,x_j)| \lambda_i\neq\lambda_j,\lambda_i^*=\lambda_j^*,i<j\} \tag{3} c=DS,DS={(xi,xj)λi=λj,λi=λj,i<j}(3)

d = ∣ D D ∣ , D D = { ( x i , x j ) ∣ λ i ≠ λ j , λ i ∗ ≠ λ j ∗ , i < j } (4) d=|DD|,\quad DD=\{(x_i,x_j)| \lambda_i\neq\lambda_j,\lambda_i^*\neq\lambda_j^*,i<j\} \tag{4} d=DD,DD={(xi,xj)λi=λj,λi=λj,i<j}(4)

其中集合 S S SS SS 表示点 i i i 和点 j j j聚类结果中处于同一个簇,而实际上这两个点也是处于同一个簇的所有点的集合,相当于混淆矩阵中的 TP;

集合 S D SD SD 表示点 i i i 和点 j j j聚类结果中处于同一个簇,而实际上这两个点不处于同一个簇的所有点的集合,相当于混淆矩阵中的 FP,…。

由于每个样本对 ( x i , x j ) ( i < j ) (x_i,x_j)(i<j) (xi,xj)(i<j) 仅能出现在一个集合中,因此有 a + b + c + d = C n 2 = n ( n − 1 ) / 2 a+b+c+d=C_n^2=n(n-1)/2 a+b+c+d=Cn2=n(n1)/2 成立。[1]

1.1 外部评估指标

基于公式(1-4)可以导出下面常用的聚类性能度量外部指标

RI 兰德指数

Rand Index 兰德指数

R I = ( a + b ) C n 2 ( ) RI=\frac{(a+b)}{C_n^2}\tag{} RI=Cn2(a+b)()

ARI 调整兰德指数

Adjuested Rand Index 调整兰德指数

sklearn.metrics.adjusted_rand_score

A R I = R I − E ( R I ) m a x ( R I ) − E ( R I ) (5) ARI=\frac{RI-E(RI)}{max(RI)-E(RI)}\tag{5} ARI=max(RI)E(RI)RIE(RI)(5)

Advantages

  • Random (uniform) label assignments have a ARI score close to 0.0 for any value of n_clusters and n_samples (which is not the case for raw Rand index or the V-measure for instance).

  • Bounded range [-1, 1]: negative values are bad (independent labelings), similar clusterings have a positive ARI, 1.0 is the perfect match score.

  • No assumption is made on the cluster structure: can be used to compare clustering algorithms such as k-means which assumes isotropic blob shapes with results of spectral clustering algorithms which can find cluster with “folded” shapes.

Drawbacks

  • Contrary to inertia, ARI requires knowledge of the ground truth classes while is almost never available in practice or requires manual assignment by human annotators (as in the supervised learning setting).

However ARI can also be useful in a purely unsupervised setting as a building block for a Consensus Index that can be used for clustering model selection (TODO).

Jaccard JC指数

Jaccard Coefficient

J C = a a + b + c (6) JC=\frac{a}{a+b+c}\tag{6} JC=a+b+ca(6)

FMI FMI指数

Fowlkes and Mallows Index

sklearn.metrics.fowlkes_mallows_score

F M I = a a + b ⋅ a a + c (7) FMI=\sqrt{\frac{a}{a+b}\cdot\frac{a}{a+c}}\tag{7} FMI=a+baa+ca (7)

Advantages

  • Random (uniform) label assignments have a FMI score close to 0.0 for any value of n_clusters and n_samples (which is not the case for raw Mutual Information or the V-measure for instance).

  • Upper-bounded at 1: Values close to zero indicate two label assignments that are largely independent, while values close to one indicate significant agreement. Further, values of exactly 0 indicate purely independent label assignments and a FMI of exactly 1 indicates that the two label assignments are equal (with or without permutation).

  • No assumption is made on the cluster structure: can be used to compare clustering algorithms such as k-means which assumes isotropic blob shapes with results of spectral clustering algorithms which can find cluster with “folded” shapes.

Drawbacks

  • Contrary to inertia, FMI-based measures require the knowledge of the ground truth classes while almost never available in practice or requires manual assignment by human annotators (as in the supervised learning setting).

MI 互信息

Mutual Information 互信息

sklearn.metrics.mutual_info_score

对数据集 D = { x 1 , x 2 , ⋯   , x n } D=\{x_1,x_2,\cdots,x_n\} D={x1,x2,,xn},假定通过聚类给出的 k k k 个簇,划分为 C = { C 1 , C 2 , ⋯   , C k } C=\{C_1,C_2,\cdots,C_k\} C={C1,C2,,Ck},参考模型给出的 s s s 个簇划分为 C ∗ = { C 1 ∗ , C 2 ∗ , ⋯   , C s ∗ } C^*=\{C_1^*,C_2^*,\cdots,C_s^*\} C={C1,C2,,Cs}

M I ( C , C ∗ ) = ∑ i = 1 k ∑ j = 1 s P ( C i , C j ∗ ) l o g P ( C i ∩ C j ∗ ) P ( C i ) P ( C j ∗ ) = ∑ i = 1 k ∑ j = 1 s ∣ C i ∩ C j ∗ ∣ n l o g n ⋅ ∣ C i ∩ C j ∗ ∣ ∣ C i ∣ ∣ C j ∗ ∣ (8) \begin{aligned} {MI(C,C^*)} &= {\sum_{i=1}^k \sum_{j=1}^s P(C_i,C_j^*)log \frac{P(C_i\cap C_j^*)}{P(C_i)P(C_j^*)}} \\ &= {\sum_{i=1}^k \sum_{j=1}^s \frac{|C_i\cap C_j^*|}{n}log\frac{n\cdot|C_i\cap C_j^*|}{|C_i||C_j^*|}} \\ \end{aligned}\tag{8} MI(C,C)=i=1kj=1sP(Ci,Cj)logP(Ci)P(Cj)P(CiCj)=i=1kj=1snCiCjlogCiCjnCiCj(8)

P ( C i ) , P ( C j ∗ ) , P ( C i ∩ C j ∗ ) P(C_i),P(C_j^*),P(C_i\cap C_j^*) P(Ci),P(Cj),P(CiCj) 可以分别看作样本属于聚类簇 C i C_i Ci ,属于类 C j ∗ C_j^* Cj 以及同时属于两者的概率。

定义熵 H:
H ( C ) = − ∑ i = 1 k P ( C i ) l o g P ( C i ) = − ∑ i = 1 k ∣ C i ∣ n l o g ( ∣ C i ∣ n ) (9) \begin{aligned} H(C)& =-\sum_{i=1}^kP(C_i)log P(C_i)\\ & = -\sum_{i=1}^k \frac{|C_i|}{n}log(\frac{|C_i|}{n})\\ \end{aligned}\tag{9} H(C)=i=1kP(Ci)logP(Ci)=i=1knCilog(nCi)(9)

给定簇信息 C ∗ C^* C 的前提条件下,类别信息 C C C 的增加量,或者说其不确定度的减少量,直观的,可以写出如下形式:

M I ( C , C ∗ ) = H ( C ) − H ( C ∣ C ∗ ) (10) MI(C,C^*)=H(C)-H(C|C^*)\tag{10} MI(C,C)=H(C)H(CC)(10)

  • 互信息的最小值为 0, 当类簇相对于类别只是随机的, 也就是说两者独立的情况下, C C C 对于 C ∗ C^* C 未带来任何有用的信息;

  • 如果得到的 C C C C ∗ C^* C 关系越密切, 那么 M I ( C , C ∗ ) MI(C,C^*) MI(C,C) 值越大. 如果 C C C 完整重现了 C ∗ C^* C , 此时互信息最大。

  • k = n k=n k=n 时,即类簇数和样本个数相等,MI 也能达到最大值。所以 MI 也存在和纯度类似的问题,即它并不对簇数目较大的聚类结果进行惩罚,因此也不能在其他条件一样的情况下,对簇数目越小越好的这种期望进行形式化。

NMI 归一化互信息

Normalized Mutual Information 归一化互信息

sklearn.metrics.normalized_mutual_info_score

NMI 则可以解决上述问题,因为熵会随着簇的数目的增长而增大。当 k = n k=n k=n 时, H ( C ) H(C) H(C) 会达到其最大值 l o g ( n ) log(n) log(n) , 此时就能保证 NMI 的值较低。之所以采用 1 2 H ( C ) + H ( C ∗ ) \frac{1}{2} H(C)+H(C^*) 21H(C)+H(C) 作为分母,是因为它是 M I ( C , C ∗ ) MI(C,C^*) MI(C,C) 的紧上界, 因此可以保证 N M I ∈ [ 0 , 1 ] NMI\in[0,1] NMI[0,1]

N M I ( C , C ∗ ) = 2 × M I ( C , C ∗ ) H ( C ) + H ( C ∗ ) (11) NMI(C,C^*)=\frac{2\times MI(C,C^*)}{H(C)+H(C^*)}\tag{11} NMI(C,C)=H(C)+H(C)2×MI(C,C)(11)

AMI 调整互信息

Adjusted Mutual Information 调整互信息

sklearn.metrics.adjusted_mutual_info_score

A M I ( C , C ∗ ) = M I ( C , C ∗ ) − E ( M I ( C , C ∗ ) ) a v g ( H ( C ) , H ( C ∗ ) ) − E [ M I ( C , C ∗ ) ] (12) AMI(C,C^*)=\frac{MI(C,C^*)-E(MI(C,C^*))}{avg(H(C),H(C^*))-E[MI(C,C^*)]}\tag{12} AMI(C,C)=avg(H(C),H(C))E[MI(C,C)]MI(C,C)E(MI(C,C))(12)

a i = ∣ C i ∣ , b j = ∣ C J ∗ ∣ a_i=|C_i|,b_j=|C_J^*| ai=Ci,bj=CJ ,则 E [ M I ( C , C ∗ ) ] E[MI(C,C^*)] E[MI(C,C)] 为:

E [ M I ( C , C ∗ ) ] = ∑ i = 1 ∣ C ∣ ∑ j = 1 ∣ C ∗ ∣ ∑ n i j = ( a i + b j − n ) + m i n ( a i , b j ) n i j n l o g ( n ⋅ n i j a i b j ) a i ! b j ! ( n − a i ) ! ( n = b j ) ! n ! n i j ! ( a i − n i j ) ! ( b j − n i j ) ! ( n − a i − b j + n i j ) ! (13) E[MI(C,C^*)] = \sum_{i=1}^{|C|}\sum_{j=1}^{|C^*|} \sum_{n_{ij}=(a_i+b_j-n)^+}^{min(a_i,b_j)} \frac{n_{ij}}{n}log(\frac{n\cdot n_{ij}}{a_i b_j}) \frac{a_i!b_j!(n-a_i)!(n=b_j)!}{n!n_{ij}!(a_i-n_{ij})!(b_j-n_{ij})!(n-a_i-b_j+n_{ij})!} \tag{13} E[MI(C,C)]=i=1Cj=1Cnij=(ai+bjn)+min(ai,bj)nnijlog(aibjnnij)n!nij!(ainij)!(bjnij)!(naibj+nij)!ai!bj!(nai)!(n=bj)!(13)

当 log 取 2 为底时,单位为 bit,取 e 为底时单位为 nat。[2]

1.2 内部评估指标

考虑聚类结果的 k k k 个簇划分 C = { C 1 , C 2 , ⋯   , C k } C=\{C_1,C_2,\cdots,C_k\} C={C1,C2,,Ck},其中 d i s t ( ⋅ , ⋅ ) dist(\cdot,\cdot) dist(,) 用于计算两个样本之间的距离(2. 距离计算), μ \mu μ 代表簇 C C C 的中心点 μ = 1 k ∑ 1 ≤ i ≤ k x i \mu=\frac{1}{k}\sum_{1\leq i\leq k} x_i μ=k11ikxi

定义:

C C C 内样本间的平均距离 a v g ( C ) avg(C) avg(C)

a v g ( C ) = C k 2 ∑ 1 ≤ i ≤ j ≤ k d i s t ( x i , x j ) (14) avg(C)=C_k^2\sum_{1\leq i\leq j \leq k} dist(x_i,x_j) \tag{14} avg(C)=Ck21ijkdist(xi,xj)(14)

C C C 内样本间的最远距离 d i a m ( C ) diam(C) diam(C)

d i a m ( C ) = m a x 1 ≤ i ≤ j ≤ k d i s t ( x i , x j ) (15) diam(C)=max_{1\leq i\leq j \leq k}dist(x_i,x_j) \tag{15} diam(C)=max1ijkdist(xi,xj)(15)

C i C_i Ci 与簇 C j C_j Cj 最近的样本间距离 d m i n ( C i , C j ) d_{min}(C_i,C_j) dmin(Ci,Cj)

d m i n ( C i , C j ) = m i n x i ∈ C i , x j ∈ C j d i s t ( x i , x j ) (16) d_{min}(C_i,C_j)=min_{x_i \in C_i,x_j \in C_j} dist(x_i,x_j) \tag{16} dmin(Ci,Cj)=minxiCi,xjCjdist(xi,xj)(16)

C i C_i Ci 与簇 C j C_j Cj 中心点间的距离 d c e n ( C i , C j ) d_{cen}(C_i,C_j) dcen(Ci,Cj)

d c e n ( C i , C j ) = d i s t ( x i , x j ) (17) d_{cen}(C_i,C_j)=dist(x_i,x_j) \tag{17} dcen(Ci,Cj)=dist(xi,xj)(17)

DBI 戴维森堡丁指数

Davies-Bouldin Index 戴维森堡丁指数,越小越好。

sklearn.metrics.davies_bouldin_score

D B I = 1 k ∑ i = 1 k max ⁡ j ≠ i ( a v g ( C i ) + a v g ( C j ) d c e n ( μ i , μ j ) ) (18) DBI=\frac{1}{k}\sum_{i=1}^k \max \limits_{j\neq i}(\frac{avg(C_i)+avg(C_j)}{d_{cen}(\mu_i,\mu_j)}) \tag{18} DBI=k1i=1kj=imax(dcen(μi,μj)avg(Ci)+avg(Cj))(18)

DI Dunn指数

Dunn Index,越大越好。

D I = min ⁡ 1 ≤ i ≤ k { min ⁡ j ≠ i ( d m i n ( C i , C j ) max ⁡ 1 ≤ l ≤ k d i a m ( C l ) ) } (19) DI= \min \limits_{1\leq i\leq k} \bigg\{ \min \limits_{j \neq i}\bigg(\frac{d_{min}(C_i,C_j)}{\max \limits_{1 \leq l \leq k} diam(C_l)}\bigg) \bigg\} \tag{19} DI=1ikmin{j=imin(1lkmaxdiam(Cl)dmin(Ci,Cj))}(19)

SC 轮廓系数

Silhouette Coefficient 轮廓系数

sklearn.metrics.silhouette_score

S = b − a m a x ( a , b ) (20) S=\frac{b-a}{max(a,b)} \tag{20} S=max(a,b)ba(20)

Advantages

  • The score is bounded between -1 for incorrect clustering and +1 for highly dense clustering. Scores around zero indicate overlapping clusters.

  • The score is higher when clusters are dense and well separated, which relates to a standard concept of a cluster.

Drawbacks

  • The Silhouette Coefficient is generally higher for convex clusters than other concepts of clusters, such as density based clusters like those obtained through DBSCAN.【不适用于紧凑、密集的数据,DBSCAN的带噪声的数据】

参考文献

[1] 周志华. 机器学习[M]. 北京: 清华大学出版社, 2016: 197-199.

[2] Gan Pan.[ML] 聚类评价指标[EB/OL].https://zhuanlan.zhihu.com/p/53840697, 2019-06-28.

  • 2
    点赞
  • 39
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值