本文记录了几种主流的聚类算法的评价指标。主要参考文献:《机器学习》-周志华。
其中,我们重点关注聚类精度(
A
C
AC
AC)这种评价指标的原理及实现。
大体上,聚类算法的评价指标分为两种,
0) 外部评价指标
1) 内部评价指标
外部评价指标是在真实标签已知的情况下,衡量聚类结果与真实标签之间的吻合程度。常用的有以下几个:
0)Jaccard Coefficient (
J
C
JC
JC);
1)Fowlkes and Mallows Index (
F
M
I
FMI
FMI);
2)Rand Index (
R
I
RI
RI);
3)
P
u
r
i
t
y
Purity
Purity;
4)Accuracy (
A
C
AC
AC);
5)Normalized Mutual Information (
N
M
I
NMI
NMI);
内部评价指标是在不能获得真实标签的情况下,衡量聚类结果本身的好坏情况(比如簇的内聚性,簇间独立性)。常用的有两个:
6)Davies-Bouldin Index (
D
B
I
DBI
DBI);
7)Dunn Index (
D
I
DI
DI);
下面分别介绍:
假设数据集
D
=
{
x
1
,
…
,
x
n
}
D = \{\textbf{x}_1, \dots, \textbf{x}_n\}
D={x1,…,xn},假设聚类得出的标签为
p
=
[
p
1
,
…
,
p
n
]
\textbf{p} = [p_1, \dots, p_n]
p=[p1,…,pn],真实的标签为
r
=
[
r
1
,
…
,
r
n
]
\textbf{r} = [r_1, \dots, r_n]
r=[r1,…,rn],将样本两两配对考虑,定义
S
S
=
{
(
x
i
,
x
j
)
∣
p
i
=
p
j
,
r
i
=
r
j
,
i
<
j
}
SS = \{(\textbf{x}_i, \textbf{x}_j) | p_i = p_j, r_i = r_j, i < j \}
SS={(xi,xj)∣pi=pj,ri=rj,i<j},
S
D
=
{
(
x
i
,
x
j
)
∣
p
i
=
p
j
,
r
i
≠
r
j
,
i
<
j
}
SD = \{(\textbf{x}_i, \textbf{x}_j) | p_i = p_j, r_i \neq r_j, i < j \}
SD={(xi,xj)∣pi=pj,ri=rj,i<j},
D
S
=
{
(
x
i
,
x
j
)
∣
p
i
≠
p
j
,
r
i
=
r
j
,
i
<
j
}
DS = \{(\textbf{x}_i, \textbf{x}_j) | p_i \neq p_j, r_i = r_j, i < j \}
DS={(xi,xj)∣pi=pj,ri=rj,i<j},
D
D
=
{
(
x
i
,
x
j
)
∣
p
i
≠
p
j
,
r
i
≠
r
j
,
i
<
j
}
DD = \{(\textbf{x}_i, \textbf{x}_j) | p_i \neq p_j, r_i \neq r_j, i < j \}
DD={(xi,xj)∣pi=pj,ri=rj,i<j},
其中,SS包含了那些预测为相同簇并且真实标签也一致的样本对,
SD包含了那些预测为相同簇但是真实标签不一致的样本对,
DS包含了那些预测为不同簇但是真实标签一致的样本对,
DD包含了那些预测为不同簇并且真实标签也不一致的样本对。
易知,每个样本对出现并只能出现在上述某一个集合中。
基于上述式子,可导出以下外部指标:
0)
J
C
JC
JC
\begin{equation}
JC = \frac{|SS|}{|SS|+|SD|+|DS|}
\end{equation}
1)
F
M
I
FMI
FMI
\begin{equation}
FMI = \frac{|SS|}{\sqrt{(|SS|+|SD|)(|SS|+|DS|)}}
\end{equation}
2)
R
I
RI
RI
\begin{equation}
JC = \frac{|SS|}{|SS|+|SD|+|DS|}
\end{equation}
显然,上述指标的结果值均在[0, 1]区间内,值越大越好。
假设通过聚类给出的簇划分为 C = { C i } i = 1 k C = \{C_i\}_{i=1}^k C={Ci}i=1k,真实簇划分为 C ′ = { C i ′ } i = 1 s C' = \{C_i'\}_{i=1}^s C′={Ci′}i=1s,我们构建一个矩阵 W = { w i j = ∣ C i ∩ C j ′ ∣ } k × s W=\{w_{ij} = |C_i\cap C_j'|\}_{k\times s} W={wij=∣Ci∩Cj′∣}k×s, W W W存储了每一个预测簇和真实簇之间的相同样本数量。
如表一所示:
3)
P
u
r
i
t
y
Purity
Purity
顾名思义,
P
u
r
i
t
y
Purity
Purity指的是纯度,该指标可通过如下优化问题获得:
\begin{equation}
\begin{aligned}
Purity= & \max \frac{\sum_{i=1}^{k} \sum_{j=1}^{s} w_{ij} \mathbf{x}{ij}}{\mathbf{1^T} W \mathbf{1}} \
s.t. \quad & \sum{j=1}^s \mathbf{x}{ij} = 1, i = 1, \dots, k \
& \mathbf{x}{ij} = 0 or 1, i = 1, \dots, k, j = 1, \dots, s
\end{aligned}
\end{equation}
显然,
1
T
W
1
=
n
\mathbf{1^T} W \mathbf{1}=n
1TW1=n为样本个数。
实际上,
P
u
r
i
t
y
Purity
Purity就是每一行的最大值之和除以样本总数
对于表一,
P
u
r
i
t
y
=
10
+
20
+
8
+
15
102
=
0.5196
Purity = \frac{10 + 20 + 8 + 15}{102}= 0.5196
Purity=10210+20+8+15=0.5196。
4)
A
C
AC
AC
A
C
AC
AC是目前最流行的聚类评价指标。在很多文献里面,都将
A
C
AC
AC作为聚类结果的评价指标。
A
C
AC
AC定义如下:
\begin{equation}
AC(\mathbf{p}, \mathbf{r}) = \frac{\sum_{i=1}^n \delta(r_i, map(p_i))}{n},
\end{equation}
其中,
\begin{equation}
\delta(a, b) = \left{\begin{array}{ll} 1, & \textrm{if a = b};\
0, & otherwise, \end{array}\right.
\end{equation}
m
a
p
(
p
i
)
map(p_i)
map(pi) 是一个排列映射函数,将聚类得到的标签映射到与之等价的真实标签,聚类标签与真实标签之间是1-1映射(不一定是满的)。
很多论文里面说,一个最佳的
m
a
p
(
p
i
)
map(p_i)
map(pi)函数可以由Kuhn-Munkres算法产生[Matching Theory]。实际上,
A
C
AC
AC可以由如下最优化问题获得,
\begin{equation}
\begin{aligned}
AC= & \max \frac{\sum_{i=1}^{k} \sum_{j=1}^{s} w_{ij} \mathbf{x}{ij}}{\mathbf{1^T} W \mathbf{1}} \
s.t. \quad & \sum{j=1}^s \mathbf{x}{ij} = 1, i = 1, \dots, k \
& \sum{i=1}^k \mathbf{x}{ij} = 1, j = 1, \dots, s \
& \mathbf{x}{ij} = 0 or 1, i = 1, \dots, k, j = 1, \dots, s
\end{aligned}
\end{equation}
可以看到,
A
C
AC
AC的优化问题仅比
P
u
r
i
t
y
Purity
Purity的优化问题多了一个约束条件,
P
u
r
i
t
y
Purity
Purity要求每一行只选择一个数,
A
C
AC
AC不仅要求每一行唯一,而且要求每一列唯一,也就是一个预测簇只能与一个真实簇对应,一个真实簇也只能与一个预测簇对应。也就是得到的最优解
X
=
{
x
i
j
}
k
×
s
X=\{\mathbf{x}_{ij}\}_{k\times s}
X={xij}k×s是一个正交阵(当k=s时成立)。上述最优化问题有一个名称叫做指派问题,解决指派问题有一个专门的算法—匈牙利算法,也就是说,求解
A
C
AC
AC只需要用到Kuhn-Munkres算法的一部分,匈牙利算法。
关于匈牙利算法的原理和算法流程都在很多最优化书籍中有讲解。在这篇博客里面
http://blog.csdn.net/zhanghaor/article/details/52344766
有给出这个算法的Java实现。实际上我在用这个Java实现的过程中发现,对于有些情况,该算法不能收敛。一怒之下自己实现了一个,还是自己实现的靠谱点,Java代码如下:
import java.util.Arrays;
import org.ujmp.core.Matrix;
import org.ujmp.core.calculation.Calculation.Ret;
/**
* The Hungary method solving allocating problem.
* @author Yanxue
*
*/
public class Hungary {
Matrix graph;
int n, m;
//int minMatchValue;
Matrix mapMatrix;
int[] mapIndices;
public static final int MAX_ITE_NUM = 1000;
public Hungary(Matrix pGraph) {
graph = pGraph.plus(Ret.NEW, false, 0);
n = (int) pGraph.getRowCount();
m = (int) pGraph.getColumnCount();
if (n != m) {
graphSqureChange();
}
}
private void graphSqureChange() {
if (n < m) {
graph = graph.appendVertically(Ret.LINK,
Matrix.Factory.zeros(m - n, m));
} else {
graph = graph.appendHorizontally(Ret.LINK,
Matrix.Factory.zeros(n, n - m));
}
n = (int) graph.getRowCount();
m = n;
}
public void findMinMatch() {
// Compute C'
Matrix rowMinValue = graph.min(Ret.NEW, 1);
Matrix tC = Matrix.Factory.emptyMatrix();
for (int i = 0; i < n; i++) {
tC = tC.appendVertically(Ret.LINK, graph.selectRows(Ret.LINK, i)
.minus(rowMinValue.getAsInt(i, 0)));
}
Matrix columnMinValue = tC.min(Ret.NEW, 0);
Matrix _tC = Matrix.Factory.emptyMatrix();
for (int i = 0; i < m; i++) {
_tC = _tC.appendHorizontally(
Ret.LINK,
tC.selectColumns(Ret.LINK, i).minus(
columnMinValue.getAsInt(0, i)));
}
//System.out.println("C(1) computed");
Matrix tMapMatrix = constructMapAndUpdate(_tC)[0];
int tCount = 0;
while (!isOptimal(tMapMatrix) && tCount++ < MAX_ITE_NUM) {
Matrix[] tMatrix = constructMapAndUpdate(_tC);
tMapMatrix = tMatrix[0];
_tC = tMatrix[1];
}
mapMatrix = tMapMatrix;
mapIndices = new int[n];
Arrays.fill(mapIndices, -1);
for (int i = 0; i < n; i++) {
for (int j = 0; j < m; j++) {
if(mapMatrix.getAsInt(i, j) == 1) {
mapIndices[i] = j;
break;
}
}
}
}
private Matrix[] constructMapAndUpdate(Matrix c) {
Matrix tMap = Matrix.Factory.zeros(n, m);
Matrix updateC = c.plus(Ret.NEW, false, 0);
int[][] rowZeroIndices = getRowZeroIndices(c);
int[] indexSequence = findMinToMaxRowZeroCountIndexSequence(rowZeroIndices);
boolean[] rowComputed = new boolean[n];
boolean[] columnComputed = new boolean[m];
for (int i = 0; i < n; i++) {
int currentRow = indexSequence[i];
for (int j = 0; j < rowZeroIndices[currentRow].length; j++) {
if (!columnComputed[rowZeroIndices[currentRow][j]]) {
tMap.setAsInt(1, currentRow, rowZeroIndices[currentRow][j]);
columnComputed[rowZeroIndices[currentRow][j]] = true;
// 1) Flag for having bracket.
rowComputed[currentRow] = true;
break;
}
}
}
//System.out.println("C(1)\r\n" + tMap);
if (isOptimal(tMap)) {
return new Matrix[] { tMap, updateC };
}
// C' --> C''
boolean[] rowFlag = new boolean[n];
// 1)
for (int i = 0; i < n; i++) {
rowFlag[i] = !rowComputed[i];
}
//System.out.println("C(1): " + Arrays.toString(rowFlag));
boolean[] columnFlag = new boolean[m];
boolean[] _rowFlag = new boolean[n];
boolean[] _columnFlag = new boolean[m];
while (!Arrays.equals(_rowFlag, rowFlag)
|| !Arrays.equals(_columnFlag, columnFlag)) {
_rowFlag = rowFlag;
_columnFlag = columnFlag;
// 2) Flag column indices for all the zero elements in those
// bracket-flaged row.
for (int i = 0; i < n; i++) {
// flaged row
if (rowFlag[i]) {
for (int j = 0; j < rowZeroIndices[i].length; j++) {
columnFlag[rowZeroIndices[i][j]] = true;
}
}
}
//System.out.println("C(1)" + Arrays.toString(columnFlag));
// 3) Flag row indices for those bracket-flaged elements in flaged
// columns.
for (int i = 0; i < m; i++) {
if (columnFlag[i]) {
for (int j = 0; j < n; j++) {
if (tMap.getAsInt(j, i) == 1) {
rowFlag[j] = true;
break;
}
}
}
}
}
// 5) Find minimum element in those locations uncovered by lines.
int tMinValue = Integer.MAX_VALUE;
for (int i = 0; i < n; i++) {
// skip row Lines
if (!rowFlag[i]) {
continue;
}
for (int j = 0; j < m; j++) {
if (!columnFlag[j]) {
if (c.getAsInt(i, j) < tMinValue) {
tMinValue = c.getAsInt(i, j);
}
}
}
}
// 6) Minus the minimum value for those flaged rows.
for (int i = 0; i < n; i++) {
if (rowFlag[i]) {
for (int j = 0; j < m; j++) {
updateC.setAsInt(updateC.getAsInt(i, j) - tMinValue, i, j);
}
}
}
// 6) Plus the minimum value for those flaged columns.
for (int i = 0; i < m; i++) {
if (columnFlag[i]) {
for (int j = 0; j < n; j++) {
updateC.setAsInt(updateC.getAsInt(j, i) + tMinValue, j, i);
}
}
}
return new Matrix[] { tMap, updateC };
}
private int[] findMinToMaxRowZeroCountIndexSequence(int[][] rowZeroIndices) {
int[] tSequence = new int[n];
int tIndex = 0;
boolean[] rowComputed = new boolean[n];
while (tIndex < n) {
int minZeroCountIndex = 0;
int minZeroCount = Integer.MAX_VALUE;
for (int i = 0; i < n; i++) {
if (rowComputed[i]) {
continue;
}
if (rowZeroIndices[i].length < minZeroCount) {
minZeroCount = rowZeroIndices[i].length;
minZeroCountIndex = i;
}
}
tSequence[tIndex++] = minZeroCountIndex;
rowComputed[minZeroCountIndex] = true;
}
return tSequence;
}
private int[][] getRowZeroIndices(Matrix c) {
int[][] tRowZeroIndices = new int[n][];
int[] tRowZeroCounts = new int[n];
for (int i = 0; i < n; i++) {
for (int j = 0; j < m; j++) {
if (c.getAsInt(i, j) == 0) {
tRowZeroCounts[i]++;
}
}
}
for (int i = 0; i < n; i++) {
tRowZeroIndices[i] = new int[tRowZeroCounts[i]];
tRowZeroCounts[i] = 0;
for (int j = 0; j < m; j++) {
if (c.getAsInt(i, j) == 0) {
tRowZeroIndices[i][tRowZeroCounts[i]++] = j;
}
}
}
return tRowZeroIndices;
}
/**
* Judge if the map matrix is optimal.
*
* @param mapC
* @return
*/
private boolean isOptimal(Matrix mapC) {
return mapC.sum(Ret.NEW, Matrix.ALL, false).getAsInt(0, 0) == n;
}
public int[] getMapIndices() {
return mapIndices;
}
/**
Testing method.
**/
public static void main(String[] args) {
int[][] m = null;
m = new int[][]{
{ 12, 7, 9, 7, 9 },
{ 8, 9, 6, 6, 6 },
{ 7, 17, 12, 14, 9 },
{ 15, 14, 6, 6, 10 },
{ 4, 10, 7, 10, 9 }
};
m = new int[][]{
{2, 15, 13, 4},
{10, 4, 14, 15},
{9, 14, 16, 13},
{7, 8, 11, 9},
};
Matrix mMatrix = Matrix.Factory.zeros(m.length, m[0].length);
for (int i = 0; i < m.length; i++) {
for (int j = 0; j < m[i].length; j++) {
mMatrix.setAsInt(m[i][j], i, j);
}
}
Hungary h = new Hungary(mMatrix);
h.findMinMatch();
System.out.println(h.mapMatrix);
System.out.println(Arrays.toString(h.mapIndices));
}
}
在使用这个算法的时候,需要注意以下2点:
- UJMP三方库是必不可少的,这里面涉及到矩阵运算,下载链接https://ujmp.org/;
- 这个算法解决的是极小化的指派问题,如需计算极大化问题的最优解(
A
C
AC
AC就是极大化问题),需要将
W
W
W转化为
W ′ = { w i j ′ } k × s , w i j ′ = m a x ( W ) − w i j W' = \{w_{ij}'\}_{k \times s}, w_{ij}'=max(W)-w_{ij} W′={wij′}k×s,wij′=max(W)−wij, m a x ( W ) max(W) max(W)是矩阵 W W W中的最大值。这样转化之后的极小化问题的最优解等于原问题的最优解。
计算 A C AC AC的时候,只需要拿到这个匹配, W W W矩阵中对应的数相加,再除以样本总数,就可以了。
关于这个算法还有Matlab实现,可参见
http://www.cad.zju.edu.cn/home/dengcai/Data/code/hungarian.m
5)
N
M
I
NMI
NMI
N
M
I
NMI
NMI为归一化的互信息,给定两个随机变量
P
P
P和
Q
Q
Q,
P
,
Q
P,Q
P,Q之间的NMI由下式给出:
\begin{equation}
NMI(P, Q) = \frac{I(P, Q)}{\sqrt{H§H(Q)}},
\end{equation}
其中,
I
(
P
,
Q
)
I(P,Q)
I(P,Q)为
P
,
Q
P,Q
P,Q的互信息,
H
(
.
)
H(.)
H(.)为信息熵,有的文章将分母设置为
m
a
x
(
H
(
P
)
,
H
(
Q
)
)
max(H(P), H(Q))
max(H(P),H(Q)),没有太大的区别。
根据上式,预测的簇划分
C
C
C和真实的簇划分
C
′
C'
C′之间的NMI由下式给出
\begin{equation}
NMI(C, C’) = \frac{\sum_{i=1}^k \sum_{j=1}^{s} |C_i \cap C’_j| \log \frac{n|C_i \cap C’j|}{|C_i||C’j|}}{\sqrt{(\sum{i=1}^k |C_i| \log \frac{|C_i|}{n}) (\sum{j=1}^{s} |C’_j| \log \frac{|C’_j|}{n}) } }
\end{equation}
我们再谈一谈两个内部评价指标,内部的评价指标并没有利用到真实的标签,或者说,内部的评价指标反应了预测簇本身的内聚性,或者反应了簇间的独立性。考虑聚类结果的簇划分
C
=
{
C
i
}
i
=
1
k
C=\{C_i\}_{i=1}^k
C={Ci}i=1k,定义
\begin{equation}
\begin{aligned}
& avg(C_i)=\frac{2}{|C_i|(|C_i| - 1)}\sum_{\mathbf{x}_l, \mathbf{x}_j \in C_i,l<j} dist(\mathbf{x}l,\mathbf{x_j}), \
& diam(C_i) = \max{\mathbf{x}_l, \mathbf{x}j \in C_i,l<j} dist(\mathbf{x}l,\mathbf{x_j}), \
& d{min}(C_i, C_j) = \min{\mathbf{x}_l \in C_i, \mathbf{x}_m \in C_j} dist(\mathbf{x}_l, \mathbf{x}m), \
& d{cen}(C_i, C_j) = dist(\mathbf{u}_i, \mathbf{u}_j),
\end{aligned}
\end{equation}
其中,
d
i
s
t
(
.
,
.
)
dist(., .)
dist(.,.)为两个样本之间的距离。
u
i
\mathbf{u}_i
ui表示簇
C
i
C_i
Ci的中心。基于上述式子,我们可以导出以下内部指标。
-
D
B
I
DBI
DBI
\begin{equation}
DBI =\frac{1}{k}\sum_{i=1}^k \max_{j\neq i} (\frac{avg(C_i)+avg(C_j)}{d_{cen}(\mathbf{u}_i, \mathbf{u}_j)})
\end{equation}
注意, D B I DBI DBI反应了簇间的独立性与簇的内聚性,越小越好。
7) D I DI DI
\begin{equation}
DI=\frac{\min_{1\leq i\leq k} {\min_{j\neq i} d_{min}(C_i, C_j)}}{\max_{1\leq l \leq k} diam(C_l)}
\end{equation}
D
I
DI
DI越大越好。