数据挖掘与分析课程笔记(Chapter 14)

数据挖掘与分析课程笔记

  • 参考教材:Data Mining and Analysis : MOHAMMED J.ZAKI, WAGNER MEIRA JR.

文章目录

  1. 数据挖掘与分析课程笔记(目录)
  2. 数据挖掘与分析课程笔记(Chapter 1)
  3. 数据挖掘与分析课程笔记(Chapter 2)
  4. 数据挖掘与分析课程笔记(Chapter 5)
  5. 数据挖掘与分析课程笔记(Chapter 7)
  6. 数据挖掘与分析课程笔记(Chapter 14)
  7. 数据挖掘与分析课程笔记(Chapter 15)
  8. 数据挖掘与分析课程笔记(Chapter 20)
  9. 数据挖掘与分析课程笔记(Chapter 21)


Chapter 14:Hierarchical Clustering 分层聚类

14.1 预备

Def.1 给定数据集 D = { x 1 , x 2 , ⋯   , x n } , ( x i ∈ R d ) \mathbf{D}=\{ \mathbf{x}_1,\mathbf{x}_2,\cdots,\mathbf{x}_n\},(\mathbf{x}_i\in \mathbb{R}^d) D={x1,x2,,xn},(xiRd) D \mathbf{D} D 的一个聚类是指 D \mathbf{D} D 的划分 C = { C 1 , C 2 , ⋯   , C k } \mathcal{C}=\{C_1,C_2,\cdots,C_k \} C={C1,C2,,Ck} s.t. C i ⊆ D , C i ∩ C j = ∅ , ∪ i = 1 k C i = D C_i\subseteq \mathbf{D},C_i \cap C_j=\emptyset, \cup_{i=1}^k C_i=\mathbf{D} CiD,CiCj=,i=1kCi=D

称聚类 A = { A 1 , ⋯   , A r } \mathcal{A}=\{A_1,\cdots,A_r\} A={A1,,Ar} 是聚类 B = { B 1 , ⋯   , B s } \mathcal{B}=\{B_1,\cdots,B_s\} B={B1,,Bs} 的嵌套,如果 r > s r>s r>s,且对于 ∀ A i ∈ A \forall A_i \in \mathcal{A} AiA ,存在 B j ∈ B B_j \in \mathcal{B} BjB 使得 A i ⊆ B j A_i \subseteq B_j AiBj

D \mathbf{D} D 的分层聚类是指一个嵌套聚类序列 C 1 , ⋯   , C n \mathcal{C}_1,\cdots,\mathcal{C}_n C1,,Cn,其中 C 1 = { { x 1 } , { x 2 } , ⋯   , { x n } } , ⋯   , C n = { { x 1 , x 2 , ⋯   , x n } } \mathcal{C}_1=\{ \{\mathbf{x}_1\},\{\mathbf{x}_2\},\cdots,\{\mathbf{x}_n\}\},\cdots,\mathcal{C}_n=\{\{ \mathbf{x}_1,\mathbf{x}_2,\cdots,\mathbf{x}_n\} \} C1={{x1},{x2},,{xn}},,Cn={{x1,x2,,xn}},且 C t \mathcal{C}_t Ct C t + 1 \mathcal{C}_{t+1} Ct+1 的嵌套。

Def.2 分层聚类示图的顶点集是指所有在 C 1 , ⋯   , C n \mathcal{C}_1,\cdots,\mathcal{C}_n C1,,Cn 中出现的元,如果 C i ∈ C t C_i \in \mathcal{C}_t CiCt C j ∈ C t + 1 C_j \in \mathcal{C}_{t+1} CjCt+1 满足,则 C i C_i Ci C j C_j Cj 之间有一条边。

在这里插入图片描述

事实:

  1. 分层聚类示图是一棵二叉树(不一定,作为假设,假设每层只聚两类),分层聚类与其示图一一对应。
  2. 设(即数据点数为 n n n),则所有可能的分层聚类示图数目为 ( 2 n − 3 ) ! ! (2n-3)!! (2n3)!! (跃乘 1 × 3 × 5 × ⋯ 1\times 3 \times 5 \times \cdots 1×3×5×

14.2 团聚分层聚类

算法14.1 :

输入: D , k \mathbf{D}, k D,k

输出: C \mathcal{C} C

  1. C ← { C i = { x i } ∣ x i ∈ D } \mathcal{C} \leftarrow \{C_i=\{\mathbf{x}_i\}|\mathbf{x}_i \in \mathbf{D} \} C{Ci={xi}xiD}
  2. Δ ← { δ ( x i , x j ) : x i , x j ∈ D } \Delta \leftarrow \{\delta(\mathbf{x}_i,\mathbf{x}_j):\mathbf{x}_i,\mathbf{x}_j \in \mathbf{D} \} Δ{δ(xi,xj):xi,xjD}
  3. repeat
  4. ​ 寻找最近的对 C i , C j ∈ C C_i,C_j \in \mathcal{C} Ci,CjC
  5. C i j ← C i ∪ C j C_{ij}\leftarrow C_i \cup C_j CijCiCj
  6. C ← ( C ∣ { C i , C j } ) ∪ C i j \mathcal{C}\leftarrow (\mathcal{C} | \{C_i,C_j \}) \cup {C_{ij}} C(C{Ci,Cj})Cij
  7. ​ 根据 C \mathcal{C} C 更新距离矩阵 Δ \Delta Δ
  8. Until ∣ C ∣ = k |\mathcal{C}|=k C=k

问题:如何定义/计算 C i , C j C_i,C_j Ci,Cj 的距离,即 δ ( C i , C j ) \delta(C_i,C_j) δ(Ci,Cj) ?

δ ( C i , C j ) \delta(C_i,C_j) δ(Ci,Cj) 有以下五种不同方式:

  1. 简单连接: δ ( C i , C j ) : = min ⁡ { δ ( x , y ) ∣ x ∈ C i , y ∈ C j } \delta(C_i,C_j):= \min \{\delta(\mathbf{x},\mathbf{y}) | \mathbf{x} \in C_i, \mathbf{y} \in C_j\} δ(Ci,Cj):=min{δ(x,y)xCi,yCj}

  2. 完全连接: δ ( C i , C j ) : = max ⁡ { δ ( x , y ) ∣ x ∈ C i , y ∈ C j } \delta(C_i,C_j):= \max \{\delta(\mathbf{x},\mathbf{y}) | \mathbf{x} \in C_i, \mathbf{y} \in C_j\} δ(Ci,Cj):=max{δ(x,y)xCi,yCj}

  3. 组群平均: δ ( C i , C j ) : = ∑ x ∈ C i ∑ y ∈ C j δ ( x , y ) n i ⋅ n j , n i = ∣ C i ∣ , n j = ∣ C j ∣ \delta(C_i,C_j):= \frac{\sum\limits_{\mathbf{x} \in C_i}\sum\limits_{\mathbf{y} \in C_j}\delta(\mathbf{x},\mathbf{y})}{n_i \cdot n_j}, n_i=|C_i|,n_j=|C_j| δ(Ci,Cj):=ninjxCiyCjδ(x,y),ni=Ci,nj=Cj

  4. 均值距离: δ ( C i , C j ) : = ∣ ∣ μ i − μ j ∣ ∣ 2 , μ i = 1 n ∑ x ∈ C i x , μ j = 1 n ∑ y ∈ C j y \delta(C_i,C_j):= ||\boldsymbol{\mu}_i-\boldsymbol{\mu}_j|| ^2,\boldsymbol{\mu}_i=\frac{1}{n}\sum\limits_{\mathbf{x} \in C_i}\mathbf{x},\boldsymbol{\mu}_j=\frac{1}{n}\sum\limits_{\mathbf{y} \in C_j}\mathbf{y} δ(Ci,Cj):=∣∣μiμj2,μi=n1xCix,μj=n1yCjy

  5. 极小方差:对任意 C i C_i Ci,定义平方误差和 S S E i = ∑ x ∈ C i ∣ ∣ x − μ i ∣ ∣ 2 SSE_i= \sum\limits_{\mathbf{x} \in C_i} ||\mathbf{x}-\boldsymbol{\mu}_i|| ^2 SSEi=xCi∣∣xμi2

    C i , C j , S S E i j : = ∑ x ∈ C i ∪ C j ∣ ∣ x − μ i j ∣ ∣ 2 C_i,C_j,SSE_{ij}:=\sum\limits_{\mathbf{x} \in C_i\cup C_j} ||\mathbf{x}-\boldsymbol{\mu}_{ij}|| ^2 Ci,Cj,SSEij:=xCiCj∣∣xμij2,其中 μ i j : = 1 n i + n j ∑ x ∈ C i ∪ C j x \boldsymbol{\mu}_{ij}:=\frac{1}{n_i+n_j}\sum\limits_{\mathbf{x} \in C_i\cup C_j}\mathbf{x} μij:=ni+nj1xCiCjx

    δ ( C i , C j ) : = S S E i j − S S E i − S S E j \delta(C_i,C_j):=SSE_{ij}-SSE_i-SSE_j δ(Ci,Cj):=SSEijSSEiSSEj

证明: δ ( C i , C j ) = n i n j n i + n j ∣ ∣ μ i − μ j ∣ ∣ 2 \delta(C_i,C_j)=\frac{n_in_j}{n_i+n_j}||\boldsymbol{\mu}_i-\boldsymbol{\mu}_j|| ^2 δ(Ci,Cj)=ni+njninj∣∣μiμj2

简记: C i j : = C i ∪ C j , n i j : = n i + n j C_{ij}:=C_i\cup C_j,n_{ij}:=n_i+n_j Cij:=CiCj,nij:=ni+nj

注意: C i ∩ C j = ∅ C_i \cap C_j=\emptyset CiCj=,故 ∣ C i j ∣ = n i + n j |C_{ij}|=n_i+n_j Cij=ni+nj
δ ( C i , C j ) = ∑ z ∈ C i j ∥ z − μ i j ∥ 2 − ∑ x ∈ C i ∥ x − μ i ∥ 2 − ∑ y ∈ C j ∥ y − μ j ∥ 2 = ∑ z ∈ C i j z T z − n i j μ i j T μ i j − ∑ x ∈ C i x T x + n i μ i T μ i − ∑ y ∈ C j y T y + n j μ j T μ j = n i μ i T μ i + n j μ j T μ j − ( n i + n j ) μ i j T μ i j \begin{aligned} \delta\left(C_{i}, C_{j}\right) &=\sum_{\mathbf{z} \in C_{i j}}\left\|\mathbf{z}-\boldsymbol{\mu}_{i j}\right\|^{2}-\sum_{\mathbf{x} \in C_{i}}\left\|\mathbf{x}-\boldsymbol{\mu}_{i}\right\|^{2}-\sum_{\mathbf{y} \in C_{j}}\left\|\mathbf{y}-\boldsymbol{\mu}_{j}\right\|^{2} \\ &=\sum_{\mathbf{z} \in C_{i j}} \mathbf{z}^{T} \mathbf{z}-n_{i j} \boldsymbol{\mu}_{i j}^{T} \boldsymbol{\mu}_{i j}-\sum_{\mathbf{x} \in C_{i}} \mathbf{x}^{T} \mathbf{x}+n_{i} \boldsymbol{\mu}_{i}^{T} \boldsymbol{\mu}_{i}-\sum_{\mathbf{y} \in C_{j}} \mathbf{y}^{T} \mathbf{y}+n_{j} \boldsymbol{\mu}_{j}^{T} \boldsymbol{\mu}_{j} \\ &=n_{i} \boldsymbol{\mu}_{i}^{T} \boldsymbol{\mu}_{i}+n_{j} \boldsymbol{\mu}_{j}^{T} \boldsymbol{\mu}_{j}-\left(n_{i}+n_{j}\right) \boldsymbol{\mu}_{i j}^{T} \boldsymbol{\mu}_{i j} \end{aligned} δ(Ci,Cj)=zCij zμij 2xCixμi2yCj yμj 2=zCijzTznijμijTμijxCixTx+niμiTμiyCjyTy+njμjTμj=niμiTμi+njμjTμj(ni+nj)μijTμij
注意到: μ i j = 1 n i j ∑ z ∈ C i j z = 1 n i + n j ( ∑ x ∈ C i x + ∑ y ∈ C j y ) = 1 n i + n j ( n i μ i + n j μ j ) \boldsymbol{\mu}_{i j}=\frac{1}{n_{ij}}\sum\limits_{\mathbf{z} \in C_{ij}} \mathbf{z}=\frac{1}{n_i+n_j}(\sum\limits_{\mathbf{x} \in C_{i}} \mathbf{x}+\sum\limits_{\mathbf{y} \in C_{j}} \mathbf{y})=\frac{1}{n_i+n_j}(n_i\boldsymbol{\mu}_{i}+n_j\boldsymbol{\mu}_{j}) μij=nij1zCijz=ni+nj1(xCix+yCjy)=ni+nj1(niμi+njμj)

故: μ i j T μ i j = 1 ( n i + n j ) 2 ( n i 2 μ i T μ i + 2 n i n j μ i T μ j + n j 2 μ j T μ j ) \boldsymbol{\mu}_{i j}^{T} \boldsymbol{\mu}_{i j}=\frac{1}{\left(n_{i}+n_{j}\right)^{2}}\left(n_{i}^{2} \boldsymbol{\mu}_{i}^{T} \boldsymbol{\mu}_{i}+2 n_{i} n_{j} \boldsymbol{\mu}_{i}^{T} \boldsymbol{\mu}_{j}+n_{j}^{2} \boldsymbol{\mu}_{j}^{T} \boldsymbol{\mu}_{j}\right) μijTμij=(ni+nj)21(ni2μiTμi+2ninjμiTμj+nj2μjTμj)
δ ( C i , C j ) = n i μ i T μ i + n j μ j T μ j − 1 ( n i + n j ) ( n i 2 μ i T μ i + 2 n i n j μ i T μ j + n j 2 μ j T μ j ) = n i ( n i + n j ) μ i T μ i + n j ( n i + n j ) μ j T μ j − n i 2 μ i T μ i − 2 n i n j μ i T μ j − n j 2 μ j T μ j n i + n j = n i n j ( μ i T μ i − 2 μ i T μ j + μ j T μ j ) n i + n j = ( n i n j n i + n j ) ∥ μ i − μ j ∥ 2 \begin{aligned} \delta\left(C_{i}, C_{j}\right) &=n_{i} \boldsymbol{\mu}_{i}^{T} \boldsymbol{\mu}_{i}+n_{j} \mu_{j}^{T} \boldsymbol{\mu}_{j}-\frac{1}{\left(n_{i}+n_{j}\right)}\left(n_{i}^{2} \boldsymbol{\mu}_{i}^{T} \boldsymbol{\mu}_{i}+2 n_{i} n_{j} \boldsymbol{\mu}_{i}^{T} \boldsymbol{\mu}_{j}+n_{j}^{2} \boldsymbol{\mu}_{j}^{T} \boldsymbol{\mu}_{j}\right) \\ &=\frac{n_{i}\left(n_{i}+n_{j}\right) \boldsymbol{\mu}_{i}^{T} \boldsymbol{\mu}_{i}+n_{j}\left(n_{i}+n_{j}\right) \boldsymbol{\mu}_{j}^{T} \boldsymbol{\mu}_{j}-n_{i}^{2} \boldsymbol{\mu}_{i}^{T} \boldsymbol{\mu}_{i}-2 n_{i} n_{j} \boldsymbol{\mu}_{i}^{T} \boldsymbol{\mu}_{j}-n_{j}^{2} \boldsymbol{\mu}_{j}^{T} \boldsymbol{\mu}_{j}}{n_{i}+n_{j}} \\ &=\frac{n_{i} n_{j}\left(\boldsymbol{\mu}_{i}^{T} \boldsymbol{\mu}_{i}-2 \boldsymbol{\mu}_{i}^{T} \boldsymbol{\mu}_{j}+\boldsymbol{\mu}_{j}^{T} \boldsymbol{\mu}_{j}\right)}{n_{i}+n_{j}} \\ &=\left(\frac{n_{i} n_{j}}{n_{i}+n_{j}}\right)\left\|\boldsymbol{\mu}_{i}-\boldsymbol{\mu}_{j}\right\|^{2} \end{aligned} δ(Ci,Cj)=niμiTμi+njμjTμj(ni+nj)1(ni2μiTμi+2ninjμiTμj+nj2μjTμj)=ni+njni(ni+nj)μiTμi+nj(ni+nj)μjTμjni2μiTμi2ninjμiTμjnj2μjTμj=ni+njninj(μiTμi2μiTμj+μjTμj)=(ni+njninj) μiμj 2
问题:如何快速计算算法14.1 第7步:更新矩阵?

Lance–Williams formula
δ ( C i j , C r ) = α i ⋅ δ ( C i , C r ) + α j ⋅ δ ( C j , C r ) + β ⋅ δ ( C i , C j ) + γ ⋅ ∣ δ ( C i , C r ) − δ ( C j , C r ) ∣ \begin{array}{r} \delta\left(C_{i j}, C_{r}\right)=\alpha_{i} \cdot \delta\left(C_{i}, C_{r}\right)+\alpha_{j} \cdot \delta\left(C_{j}, C_{r}\right)+ \\ \beta \cdot \delta\left(C_{i}, C_{j}\right)+\gamma \cdot\left|\delta\left(C_{i}, C_{r}\right)-\delta\left(C_{j}, C_{r}\right)\right| \end{array} δ(Cij,Cr)=αiδ(Ci,Cr)+αjδ(Cj,Cr)+βδ(Ci,Cj)+γδ(Ci,Cr)δ(Cj,Cr)

Measure α i \alpha_i αi α j \alpha_j αj β \beta β γ \gamma γ
简单连接 1 2 1\over2 21 1 2 1\over2 21 0 0 0 − 1 2 -{1\over2} 21
完全连接 1 2 1\over2 21 1 2 1\over2 21 0 0 0 1 2 1\over2 21
组群平均 n i n i + n j \frac{n_i}{n_i+n_j} ni+njni n j n i + n j \frac{n_j}{n_i+n_j} ni+njnj 0 0 0 0 0 0
均值距离 n i n i + n j \frac{n_i}{n_i+n_j} ni+njni n j n i + n j \frac{n_j}{n_i+n_j} ni+njnj − n i n j ( n i + n j ) 2 \frac{-n_in_j}{(n_i+n_j)^2} (ni+nj)2ninj 0 0 0
极小方差 n i + n r n i + n j + n r \frac{n_i+n_r}{n_i+n_j+n_r} ni+nj+nrni+nr n j + n r n i + n j + n r \frac{n_j+n_r}{n_i+n_j+n_r} ni+nj+nrnj+nr − n r n i + n j + n r \frac{-n_r}{n_i+n_j+n_r} ni+nj+nrnr 0 0 0

Proof:

  1. 简单连接
    δ ( C i j , C r ) = min ⁡ { δ ( x , y ) ∣ x ∈ C i j , y ∈ C r } = min ⁡ { δ ( C i , C r ) , δ ( C j , C r ) } a = a + b − ∣ a − b ∣ 2 , b = a + b + ∣ a − b ∣ 2 \begin{aligned} \delta\left(C_{i j}, C_{r}\right) &= \min \{\delta({\mathbf{x}, \mathbf{y}} )|\mathbf{x}\in C_{ij}, \mathbf{y} \in C_r\} \\ &= \min \{\delta({C_{i}, C_{r}), \delta(C_{j}, C_{r}} )\} \end{aligned}\\ a=\frac{a+b-|a-b|}{2},b=\frac{a+b+|a-b|}{2} δ(Cij,Cr)=min{δ(x,y)xCij,yCr}=min{δ(Ci,Cr),δ(Cj,Cr)}a=2a+bab,b=2a+b+ab
    在这里插入图片描述

  2. 完全连接

    见上图

  3. 组群平均
    δ ( C i j , C r ) = ∑ x ∈ C i ∪ C j ∑ y ∈ C r δ ( x , y ) ( n i + n j ) ⋅ n r = ∑ x ∈ C i ∑ y ∈ C r δ ( x , y ) + ∑ x ∈ C j ∑ y ∈ C r δ ( x , y ) ( n i + n j ) ⋅ n r = n i n r δ ( C i , C r ) + n j n r δ ( C j , C r ) ( n i + n j ) ⋅ n r = n i δ ( C i , C r ) + n j δ ( C j , C r ) ( n i + n j ) \begin{aligned} \delta\left(C_{i j}, C_{r}\right) &= \frac{\sum\limits_{\mathbf{x} \in C_i\cup C_j}\sum\limits_{\mathbf{y} \in C_r}\delta(\mathbf{x},\mathbf{y})}{(n_i+n_j )\cdot n_r} \\ &= \frac{\sum\limits_{\mathbf{x} \in C_i}\sum\limits_{\mathbf{y} \in C_r}\delta(\mathbf{x},\mathbf{y})+\sum\limits_{\mathbf{x} \in C_j}\sum\limits_{\mathbf{y} \in C_r}\delta(\mathbf{x},\mathbf{y})}{(n_i+n_j )\cdot n_r} \\ &=\frac{n_in_r\delta(C_i,C_r)+n_jn_r\delta(C_j,C_r)}{(n_i+n_j )\cdot n_r}\\ &=\frac{n_i\delta(C_i,C_r)+n_j\delta(C_j,C_r)}{(n_i+n_j )} \end{aligned} δ(Cij,Cr)=(ni+nj)nrxCiCjyCrδ(x,y)=(ni+nj)nrxCiyCrδ(x,y)+xCjyCrδ(x,y)=(ni+nj)nrninrδ(Ci,Cr)+njnrδ(Cj,Cr)=(ni+nj)niδ(Ci,Cr)+njδ(Cj,Cr)

  4. 均值距离:作业

  5. 极小方差

    基于均值距离的结论再代入 δ ( C i , C j ) = n i n j n i + n j ∣ ∣ μ i − μ j ∣ ∣ 2 \delta(C_i,C_j)=\frac{n_in_j}{n_i+n_j}||\boldsymbol{\mu}_i-\boldsymbol{\mu}_j|| ^2 δ(Ci,Cj)=ni+njninj∣∣μiμj2

事实:算法14.1 的复杂度为 O ( n 2 log ⁡ n ) O(n^2\log n) O(n2logn)

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值