兰德系数、调整兰德系数

最新推荐文章于 2025-03-17 22:26:20 发布

若澜

最新推荐文章于 2025-03-17 22:26:20 发布

阅读量2.3w

点赞数 8

分类专栏：聚类文章标签：聚类兰德系数

聚类专栏收录该内容

1 篇文章

订阅专栏

兰德系数与调整兰德系数是衡量聚类效果的指标。兰德系数评估两个划分的相似性，取值范围在0到1之间，1表示完美匹配。调整兰德系数解决了随机划分时兰德系数不趋于0的问题，其值在-1到1之间，正值表示聚类效果良好，1表示完全一致。调整兰德系数适用于不同形状的聚类比较，但需要真实类别信息。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

兰德系数（Rand index）
调整兰德系数(Adjusted Rand index)

调整兰德系数（Adjusted Rand index）用于聚类模型的性能评估，但是其需要true_label,在正式介绍兰德系数之前，先介绍调整兰德系数的前身，兰德系数。

兰德系数（Rand index）

给定 $n$ 个对象集合 $S=\{O_1,O_2,....,O_n\}$ ，假设 $U=\{u_1,...,u_R\}$ 和 $V=\{v_1,...,v_C\}$ 表示S的两个不同划分并且满足 $\bigcup _{i=1}^R u_i = S = \bigcup_{j=1}^C v_j$ , $u_i \bigcap u_{i^*} = \emptyset = v_j \bigcap v_{j^*}$ ，其中 $1 \leq i \neq i^* \leq R$ ， $1 \leq j \neq j^* \leq C$ 。

假设 $U$ 是外部评价标准即true_label，而 $V$ 是聚类结果。设定四个统计量：

$a$ 为在 $U$ 中为同一类且在 $V$ 中也为同一类别的数据点对数
$b$ 为在 $U$ 中为同一类但在 $V$ 中却隶属于不同类别的数据点对数
$c$ 为在 $U$ 中不在同一类但在 $V$ 中为同一类别的数据点对数
$d$ 为在 $U$ 中不在同一类且在 $V$ 中也不属于同一类别的数据点对数

Class\Cluster	Same Cluster	Different Cluster	SumU
Same Class	a	b	a+b
Different Class	c	d	c+d
SumV	a+c	b+d	a+b+c+d

此时，兰德系数为：

R I = a + d a + b + c + d

$RI = \frac{a+d}{a+b+c+d}$

兰德系数的值在[0,1]之间，当聚类结果完美匹配时，兰德系数为1。

调整兰德系数(Adjusted Rand index)

兰德系数的问题在于对于两个随机的划分,其兰德系数值不是一个接近于0的常数。Hubert和Arabie在1985年提出了调整兰德系数，调整兰德系数假设模型的超分布为随机模型，即 $U$ 和 $V$ 的划分为随机的，那么各类别和各簇的数据点数目是固定的。

假设 $n_{ij}$ 表示同在类别 $u_i$ 和簇 $v_j$ 内的数据点数目， $n_{i.}$ 为类 $u_i$ 的数据点数目， $n_{.j}$ 为簇 $v_j$ 的数目，如下表：

Class\Cluster	$v_1$	$v_2$	…	$v_C$	Sums
$u_1$	$n_{11}$	$n_{12}$	…	$n_{1C}$	$n_{1.}$
$u_2$	$n_{21}$	$n_{22}$	…	$n_{2C}$	$n_{2.}$
…	…	…	…	…	…
$u_R$	$n_{R1}$	$n_{R2}$	…	$n_{RC}$	$n_{R.}$
Sums	$n_{.1}$	$n_{.2}$	…	$n_{.C}$	$n_{..}=n$

调整的兰德系数为：

A R I = R I - E ( R I ) m a x ( R I ) - E ( R I )

$ARI = \frac{RI-E(RI)}{max(RI)-E(RI)}$
ARI其实是去均值归一化的形式,RI中的a+d可以表示为

∑i,j(nij2) ∑ i , j ( n i j 2 ) $\sum_{i,j}\binom{n_{ij}}{2}$ ，

E (R I) = E (\sum i, j (n i j 2)) = [\sum i (n i . 2) \sum j (n . j 2)] / (n 2)

$E(RI) = E(\sum_{i,j}\binom{n_{ij}}{2})=[\sum_i\binom{n_{i.}}{2}\sum_j\binom{n_{.j}}{2}]/\binom{n}{2}$

m a x (R I) = 1 2 [\sum i (n i . 2) + \sum j (n . j 2)]

$max(RI) = \frac{1}{2} [\sum_i\binom{n_{i.}}{2}+\sum_j\binom{n_{.j}}{2}]$
优点：
- Random (uniform) label assignments have a ARI score close to 0.0 for any value of n_clusters and n_samples (which is not the case for raw Rand index or the V-measure for instance).

Bounded range [-1, 1]: negative values are bad (independent labelings), similar clusterings have a positive ARI, 1.0 is the perfect match score.
No assumption is made on the cluster structure: can be used to compare clustering algorithms such as k-means which assumes isotropic blob shapes with results of spectral clustering algorithms which can find cluster with “folded” shapes.

缺点:

Contrary to inertia, ARI requires knowledge of the ground truth classes while is almost never available in practice or requires manual assignment by human annotators (as in the supervised learning setting).

However ARI can also be useful in a purely unsupervised setting as a building block for a Consensus Index that can be used for clustering model selection (TODO).

参考：

http://faculty.washington.edu/kayee/pca/supp.pdf

http://scikit-learn.org/stable/modules/clustering.html#adjusted-rand-index