调整兰德系数(Adjusted Rand index)用于聚类模型的性能评估,但是其需要true_label,在正式介绍兰德系数之前,先介绍调整兰德系数的前身,兰德系数。
兰德系数(Rand index)
给定 n n 个对象集合,假设 U={u1,...,uR} U = { u 1 , . . . , u R } 和 V={v1,...,vC} V = { v 1 , . . . , v C } 表示S的两个不同划分并且满足 ⋃Ri=1ui=S=⋃Cj=1vj ⋃ i = 1 R u i = S = ⋃ j = 1 C v j , ui⋂ui∗=∅=vj⋂vj∗ u i ⋂ u i ∗ = ∅ = v j ⋂ v j ∗ ,其中 1≤i≠i∗≤R 1 ≤ i ≠ i ∗ ≤ R , 1≤j≠j∗≤C 1 ≤ j ≠ j ∗ ≤ C 。
假设 U U 是外部评价标准即true_label,而是聚类结果。设定四个统计量:
- a a 为在中为同一类且在 V V 中也为同一类别的数据点对数
- 为在 U U 中为同一类但在中却隶属于不同类别的数据点对数
- c c 为在中不在同一类但在 V V 中为同一类别的数据点对数
- 为在 U U 中不在同一类且在中也不属于同一类别的数据点对数
Class\Cluster | Same Cluster | Different Cluster | SumU |
---|---|---|---|
Same Class | a | b | a+b |
Different Class | c | d | c+d |
SumV | a+c | b+d | a+b+c+d |
此时,兰德系数为:
兰德系数的值在[0,1]之间,当聚类结果完美匹配时,兰德系数为1。
调整兰德系数(Adjusted Rand index)
兰德系数的问题在于对于两个随机的划分,其兰德系数值不是一个接近于0的常数。Hubert和Arabie在1985年提出了调整兰德系数,调整兰德系数假设模型的超分布为随机模型,即 U U 和的划分为随机的,那么各类别和各簇的数据点数目是固定的。
假设 nij n i j 表示同在类别 ui u i 和簇 vj v j 内的数据点数目, ni. n i . 为类 ui u i 的数据点数目, n.j n . j 为簇 vj v j 的数目,如下表:
Class\Cluster | v1 v 1 | v2 v 2 | … | vC v C | Sums |
---|---|---|---|---|---|
u1 u 1 | n11 n 11 | n12 n 12 | … | n1C n 1 C | n1. n 1. |
u2 u 2 | n21 n 21 | n22 n 22 | … | n2C n 2 C | n2. n 2. |
… | … | … | … | … | … |
uR u R | nR1 n R 1 | nR2 n R 2 | … | nRC n R C | nR. n R . |
Sums | n.1 n .1 | n.2 n .2 | … | n.C n . C | n..=n n . . = n |
调整的兰德系数为:
ARI其实是去均值归一化的形式,RI中的a+d可以表示为 ∑i,j(nij2) ∑ i , j ( n i j 2 ) ,
优点:
- Random (uniform) label assignments have a ARI score close to 0.0 for any value of n_clusters and n_samples (which is not the case for raw Rand index or the V-measure for instance).
Bounded range [-1, 1]: negative values are bad (independent labelings), similar clusterings have a positive ARI, 1.0 is the perfect match score.
No assumption is made on the cluster structure: can be used to compare clustering algorithms such as k-means which assumes isotropic blob shapes with results of spectral clustering algorithms which can find cluster with “folded” shapes.
缺点:
Contrary to inertia, ARI requires knowledge of the ground truth classes while is almost never available in practice or requires manual assignment by human annotators (as in the supervised learning setting).
However ARI can also be useful in a purely unsupervised setting as a building block for a Consensus Index that can be used for clustering model selection (TODO).
参考:
http://faculty.washington.edu/kayee/pca/supp.pdf
http://scikit-learn.org/stable/modules/clustering.html#adjusted-rand-index