人工智能/机器学习基础知识——聚类(性能度量 & 距离计算)

聚类

Clustering

性能度量

Target:聚类结果的“簇内相似度”(Intra-Cluster Similarity)高且“簇间相似度”(Inter-Cluster Similarity)低

  • 外部指标(External Index)

    将聚类结果与某个“参考模型”(Reference Model)比较

    a = ∣ S S ∣ , S S = { ( x i , x j ) ∣ λ i = λ j , λ i ∗ = λ j ∗ , i < j ) } b = ∣ S D ∣ , S D = { ( x i , x j ) ∣ λ i = λ j , λ i ∗ ≠ λ j ∗ , i < j ) } c = ∣ D S ∣ , D S = { ( x i , x j ) ∣ λ i ≠ λ j , λ i ∗ = λ j ∗ , i < j ) } d = ∣ D D ∣ , D D = { ( x i , x j ) ∣ λ i ≠ λ j , λ i ∗ ≠ λ j ∗ , i < j ) } \begin{array}{ll} a=|S S|, & \left.S S=\left\{\left(\boldsymbol{x}_{i}, \boldsymbol{x}_{j}\right) \mid \lambda_{i}=\lambda_{j}, \lambda_{i}^{*}=\lambda_{j}^{*}, i<j\right)\right\} \\ b=|S D|, & \left.S D=\left\{\left(\boldsymbol{x}_{i}, \boldsymbol{x}_{j}\right) \mid \lambda_{i}=\lambda_{j}, \lambda_{i}^{*} \neq \lambda_{j}^{*}, i<j\right)\right\} \\ c=|D S|, & \left.D S=\left\{\left(\boldsymbol{x}_{i}, \boldsymbol{x}_{j}\right) \mid \lambda_{i} \neq \lambda_{j}, \lambda_{i}^{*}=\lambda_{j}^{*}, i<j\right)\right\} \\ d=|D D|, & \left.D D=\left\{\left(\boldsymbol{x}_{i}, \boldsymbol{x}_{j}\right) \mid \lambda_{i} \neq \lambda_{j}, \lambda_{i}^{*} \neq \lambda_{j}^{*}, i<j\right)\right\} \end{array} a=SS,b=SD,c=DS,d=DD,SS={(xi,xj)λi=λj,λi=λj,i<j)}SD={(xi,xj)λi=λj,λi=λj,i<j)}DS={(xi,xj)λi=λj,λi=λj,i<j)}DD={(xi,xj)λi=λj,λi=λj,i<j)}
    ∗ * 表示参考模型,下述外部指标值越大越好

    • Jaccard系数

      Jaccard Coefficient,JC

      J C = a a + b + c JC = \frac{a}{a+b+c} JC=a+b+ca

    • FM指数

      Fowlkes and Mallows Index,FMI

      F M I = a a + b ⋅ a a + c FMI = \sqrt{\frac{a}{a+b}·\frac{a}{a+c}} FMI=a+baa+ca

    • Rand指数

      Rand Index,RI

      R I = 2 ( a + d ) m ( m − 1 ) RI = \frac{2(a+d)}{m(m-1)} RI=m(m1)2(a+d)

  • 内部指标(Internal Index)

    直接考察聚类结果而不利用任何参考模型

    avg ⁡ ( C ) = 2 ∣ C ∣ ( ∣ C ∣ − 1 ) ∑ 1 ⩽ i < j ⩽ ∣ C ∣ dist ⁡ ( x i , x j ) diam ⁡ ( C ) = max ⁡ 1 ⩽ i < j ⩽ ∣ C ∣ dist ⁡ ( x i , x j ) d min ⁡ ( C i , C j ) = min ⁡ x i ∈ C i , x j ∈ C j dist ⁡ ( x i , x j ) d c e n ( C i , C j ) = dist ⁡ ( μ i , μ j ) \begin{aligned} &\operatorname{avg}(C)=\frac{2}{|C|(|C|-1)} \sum_{1 \leqslant i<j \leqslant|C|} \operatorname{dist}\left(\boldsymbol{x}_{i}, \boldsymbol{x}_{j}\right) \\ &\operatorname{diam}(C)=\max _{1 \leqslant i<j \leqslant|C|} \operatorname{dist}\left(\boldsymbol{x}_{i}, \boldsymbol{x}_{j}\right) \\ &d_{\min }\left(C_{i}, C_{j}\right)=\min _{\boldsymbol{x}_{i} \in C_{i}, \boldsymbol{x}_{j} \in C_{j}} \operatorname{dist}\left(\boldsymbol{x}_{i}, \boldsymbol{x}_{j}\right) \\ &d_{\mathrm{cen}}\left(C_{i}, C_{j}\right)=\operatorname{dist}\left(\boldsymbol{\mu}_{i}, \boldsymbol{\mu}_{j}\right) \end{aligned} avg(C)=C(C1)21i<jCdist(xi,xj)diam(C)=1i<jCmaxdist(xi,xj)dmin(Ci,Cj)=xiCi,xjCjmindist(xi,xj)dcen(Ci,Cj)=dist(μi,μj)
    a v g ( C ) avg(C) avg(C)表示簇C内样本间的平均距离, d i a m ( C ) diam(C) diam(C)表示簇C内样本间的最远距离, d m i n ( C i , C j ) d_{min}(C_i, C_j) dmin(Ci,Cj)两簇最近样本间的距离, d c e n ( C i , C j ) d_{cen}(C_i, C_j) dcen(Ci,Cj)两簇中心点的距离

    • DB指数

      Davies-Bouldin Index,DBI

      D B I = 1 k ∑ i = 1 k max ⁡ j ≠ i ( avg ⁡ ( C i ) + avg ⁡ ( C j ) d cen ⁡ ( μ i , μ j ) ) \mathrm{DBI}=\frac{1}{k} \sum_{i=1}^{k} \max _{j \neq i}\left(\frac{\operatorname{avg}\left(C_{i}\right)+\operatorname{avg}\left(C_{j}\right)}{d_{\operatorname{cen}}\left(\boldsymbol{\mu}_{i}, \boldsymbol{\mu}_{j}\right)}\right) DBI=k1i=1kj=imax(dcen(μi,μj)avg(Ci)+avg(Cj))
      越小越好

    • Dunn指数

      Dunn Index,DI

      D I = min ⁡ 1 ⩽ i ⩽ k { min ⁡ j ≠ i ( d min ⁡ ( C i , C j ) max ⁡ 1 ⩽ l ⩽ k diam ⁡ ( C l ) ) } \mathrm{DI}=\min _{1 \leqslant i \leqslant k}\left\{\min _{j \neq i}\left(\frac{d_{\min }\left(C_{i}, C_{j}\right)}{\max _{1 \leqslant l \leqslant k} \operatorname{diam}\left(C_{l}\right)}\right)\right\} DI=1ikmin{j=imin(max1lkdiam(Cl)dmin(Ci,Cj))}
      越大越好

距离计算

dist(·,·)

非负性、同一性、对称性、直递性

  • 闵可夫斯基距离

    Minkowski Distance

    适用于有序属性(Ordinal Attribute)

    d i s t m k ( x i , x j ) = ( ∑ u = 1 n ∣ x i u − x j u ∣ p ) 1 2 , p ≥ 1 d i s t_{m k}\left(x_{i}, x_{j}\right)=\left(\sum_{u=1}^{n}\left|x_{i u}-x_{j u}\right|^{p}\right)^{\frac{1}{2}}, p \geq 1 distmk(xi,xj)=(u=1nxiuxjup)21,p1
    L P L_P LP范数

  • VDM

    Value Difference Metric

    适用于无序属性(Non-ordinal Attribute)

    V D M p ( a , b ) = ∑ i = 1 k ∣ m u , a , i m u , a − m u , b , i m u , b ∣ p V D M_{p}(a, b)=\sum_{i=1}^{k} \mid \frac{m_{u, a, i}}{m_{u, a}}-\frac{m_{u, b, i}}{\left.m_{u, b}\right|^{p}} VDMp(a,b)=i=1kmu,amu,a,imu,bpmu,b,i
    m u , a m_{u,a} mu,a表示在属性u上取值为a的样本数, m u , a , i m_{u,a,i} mu,a,i表示在第i个样本簇中在属性u上取值为a的样本数, k k k为样本簇数

  • 可将上述两种距离度量结合起来处理混合属性,假定有 n c n_c nc个有序属性, n − n c n-n_c nnc个无序属性,则

    Minkov ⁡ D M p ( x i , x j ) = ( ∑ u = 1 n c ∣ x i u − x j u ∣ p + ∑ u = n c + 1 n V D M p ( x i u , x j u ) ) 1 p \operatorname{Minkov} D M_{p}\left(x_{i}, x_{j}\right)=\left(\sum_{u=1}^{n_{c}}\left|x_{i u}-x_{j u}\right|^{p}+\sum_{u=n_{c}+1}^{n} V D M_{p}\left(x_{i u}, x_{j u}\right)\right)^{\frac{1}{p}} MinkovDMp(xi,xj)=(u=1ncxiuxjup+u=nc+1nVDMp(xiu,xju))p1

  • 当样本空间中不同属性的重要性不同时,可使用“加权距离”,如

    d i s t w m k ( x i , x j ) = ( w 1 ⋅ ∣ x i 1 − x j 1 ∣ p + . . . + w n ⋅ ∣ w i n − w j n ∣ p ) 1 p dist_{wmk}(x_i, x_j) = (w_1 · |x_{i1} - x_{j1}|^p + ... + w_n · |w_{in} - w_{jn}|^p)^{\frac{1}{p}} distwmk(xi,xj)=(w1xi1xj1p+...+wnwinwjnp)p1
    其中权重 w i > = 0 w_i >= 0 wi>=0,表征不同属性的重要性

  • 非度量距离

    Non-metric Distance

    • 可通过距离度量学习(Distance Metric Learning)来实现
  • 14
    点赞
  • 13
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值