# EM算法和高斯混合聚类

## EM算法引言

P ( y ∣ θ ) = ∑ z P ( y , z ∣ θ ) = ∑ z P ( z ∣ θ ) P ( y ∣ z , θ ) = π p y ( 1 − p ) 1 − y + ( 1 − π ) q y ( 1 − q ) 1 − y P(y|\theta)=\sum_{z}P(y,z|\theta)=\sum_{z}P(z|\theta)P(y|z,\theta)\\ =\pi p^{y}(1-p)^{1-y}+(1-\pi)q^{y}(1-q)^{1-y}
P ( y ∣ θ ) = ∑ z P ( y , z ∣ θ ) P(y|\theta)=\sum_{z}P(y,z|\theta) 相当于全概公式。

P ( Y ∣ θ ) = ∑ Z P ( Z ∣ θ ) P ( Y ∣ Z , θ ) P(Y|\theta)=\sum_{Z}P(Z|\theta)P(Y|Z,\theta)

P ( Y ∣ θ ) = ∏ j = 1 n [ π p y j ( 1 − p ) 1 − y j + ( 1 − π ) q y j ( 1 − q ) 1 − y j ] P(Y|\theta)=\prod_{j=1}^{n}[\pi p^{y_{j}}(1-p)^{1-y_{j}}+(1-\pi)q^{y_{j}}(1-q)^{1-y_{j}}]

## EM算法推导

L ( θ ) = ∑ i = 1 m log ⁡ P ( Y ( i ) ∣ θ ) = ∑ i = 1 m log ⁡ [ ∑ z P ( Y ( i ) ∣ z , θ ) P ( z ∣ θ ) ] L(\theta)=\sum_{i=1}^{m}\log{P(Y^{(i)}|\theta)}=\sum_{i=1}^{m}\log[\sum_{z}P(Y^{(i)}|z,\theta)P(z|\theta)]
L ( θ ) = log ⁡ P ( Y ∣ θ ) = log ⁡ ∑ Z P ( Y ∣ Z , θ ) P ( Z ∣ θ ) L(\theta)=\log{P(Y|\theta)}=\log\sum_{Z}P(Y|Z,\theta)P(Z|\theta)
Jensen不等式：

EM算法通过逐步迭代最大化 L ( θ ) L(\theta) ，假设第 i i 次迭代之后的估计值为 θ ( i ) \theta^{(i)} ,我们希望新估计的值 θ \theta 能使 L ( θ ) L(\theta) 增大逐步达到极大值。考虑两者的差值：
L ( θ ) − L ( θ ( i ) ) = log ⁡ ∑ Z P ( Y ∣ Z , θ ) P ( Z ∣ θ ) − log ⁡ P ( Y ∣ θ ( i ) ) = log ⁡ ∑ Z P ( Z ∣ Y , θ ( i ) ) P ( Y ∣ Z , θ ) P ( Z ∣ θ ) P ( Z ∣ Y , θ ( i ) ) − log ⁡ P ( Y ∣ θ ( i ) ) L(\theta)-L(\theta^{(i)})=\log\sum_{Z}P(Y|Z,\theta)P(Z|\theta)-\log{P(Y|\theta^{(i)})} \\ =\log\sum_{Z}P(Z|Y,\theta^{(i)})\frac{P(Y|Z,\theta)P(Z|\theta)}{P(Z|Y,\theta^{(i)})}-\log{P(Y|\theta^{(i)})}

log ⁡ E Z ∣ Y , θ ( i ) ( P ( Y ∣ Z , θ ) P ( Z ∣ θ ) P ( Z ∣ Y , θ ( i ) ) ) ≥ E Z ∣ Y , θ ( i ) log ⁡ ( P ( Y ∣ Z , θ ) P ( Z ∣ θ ) P ( Z ∣ Y , θ ( i ) ) ) \log E_{Z|Y,\theta^{(i)}}(\frac{P(Y|Z,\theta)P(Z|\theta)}{P(Z|Y,\theta^{(i)})})\geq E_{Z|Y,\theta^{(i)}}\log{(\frac{P(Y|Z,\theta)P(Z|\theta)}{P(Z|Y,\theta^{(i)})})}

L ( θ ) − L ( θ ( i ) ) = log ⁡ [ ∑ Z P ( Z ∣ Y , θ ( i ) ) P ( Y ∣ Z , θ ) P ( Z ∣ θ ) P ( Z ∣ Y , θ ( i ) ) ] − log ⁡ P ( Y ∣ θ ( i ) ) = log ⁡ ∑ Z P ( Z ∣ Y , θ ( i ) ) P ( Y ∣ Z , θ ) P ( Z ∣ θ ) P ( Z ∣ Y , θ ( i ) ) P ( Y ∣ θ ( i ) ) ≥ ∑ Z P ( Z ∣ Y , θ ( i ) ) log ⁡ P ( Y ∣ Z , θ ) P ( Z ∣ θ ) P ( Z ∣ Y , θ ( i ) ) P ( Y ∣ θ ( i ) ) L(\theta)-L(\theta^{(i)})=\log[\sum_{Z}P(Z|Y,\theta^{(i)})\frac{P(Y|Z,\theta)P(Z|\theta)}{P(Z|Y,\theta^{(i)})}]-\log{P(Y|\theta^{(i)})}\\ =\log \sum_{Z}P(Z|Y,\theta^{(i)})\frac{P(Y|Z,\theta)P(Z|\theta)}{P(Z|Y,\theta^{(i)})P(Y|\theta^{(i)})}\\ \geq \sum_{Z}P(Z|Y,\theta^{(i)})\log\frac{P(Y|Z,\theta)P(Z|\theta)}{P(Z|Y,\theta^{(i)})P(Y|\theta^{(i)})}

B ( θ , θ ( i ) ) = L ( θ ( i ) ) + ∑ Z P ( Z ∣ Y , θ ( i ) ) log ⁡ P ( Y ∣ Z , θ ) P ( Z ∣ θ ) P ( Z ∣ Y , θ ( i ) ) P ( Y ∣ θ ( i ) ) B(\theta,\theta^{(i)})=L(\theta^{(i)})+\sum_{Z}P(Z|Y,\theta^{(i)})\log\frac{P(Y|Z,\theta)P(Z|\theta)}{P(Z|Y,\theta^{(i)})P(Y|\theta^{(i)})}
θ = θ ( i ) \theta=\theta^{(i)} 时，log项为0，此时 L ( θ ( i ) ) = B ( θ , θ ( i ) ) L(\theta^{(i)})=B(\theta,\theta^{(i)})

B ( θ , θ ( i ) ) B(\theta,\theta^{(i)}) 越大，可以使得 L ( θ ) L(\theta) 的下界越大，因为在该次迭代中我们无法得到 L ( θ ) L(\theta) 的最大值，我们只能取下界的最大的值：
θ ( i + 1 ) = arg ⁡ max ⁡ θ B ( θ , θ ( i ) ) \theta^{(i+1)}=\arg\max_{\theta}B(\theta,\theta^{(i)})

θ ( i + 1 ) = arg ⁡ max ⁡ θ ( L ( θ ( i ) ) + ∑ Z P ( Z ∣ Y , θ ( i ) ) log ⁡ P ( Y ∣ Z , θ ) P ( Z ∣ θ ) P ( Z ∣ Y , θ ( i ) ) P ( Y ∣ θ ( i ) ) ) = arg ⁡ max ⁡ θ ( ∑ Z P ( Z ∣ Y , θ ( i ) ) log ⁡ P ( Y ∣ Z , θ ) P ( Z ∣ θ ) ) = arg ⁡ max ⁡ θ ( ∑ Z P ( Z ∣ Y , θ ( i ) ) log ⁡ P ( Y , Z ∣ θ ) ) = arg ⁡ max ⁡ θ E Z ∣ Y , θ ( i ) ( log ⁡ P ( Y , Z ∣ θ ) ) \theta^{(i+1)}=\arg\max_{\theta}(L(\theta^{(i)})+\sum_{Z}P(Z|Y,\theta^{(i)})\log\frac{P(Y|Z,\theta)P(Z|\theta)}{P(Z|Y,\theta^{(i)})P(Y|\theta^{(i)})})\\ =\arg\max_{\theta}(\sum_{Z}P(Z|Y,\theta^{(i)})\log P(Y|Z,\theta)P(Z|\theta))\\ =\arg\max_{\theta}(\sum_{Z}P(Z|Y,\theta^{(i)})\log P(Y,Z|\theta))\\ =\arg\max_{\theta}E_{Z|Y,\theta^{(i)}}(\log P(Y,Z|\theta))

Q ( θ , θ ( i ) ) = E Z ∣ Y , θ ( i ) ( log ⁡ P ( Y , Z ∣ θ ) ) Q(\theta,\theta^{(i)})=E_{Z|Y,\theta^{(i)}}(\log P(Y,Z|\theta))

θ ( i + 1 ) = arg ⁡ max ⁡ θ Q ( θ , θ ( i ) ) \theta^{(i+1)}=\arg\max_{\theta}Q(\theta,\theta^{(i)})

## EM算法总流程

1.初始化参数 θ 0 \theta^{0}
2.E步：求似然函数关于 z z 的期望： Q ( θ , θ ( i ) ) = E Z ∣ Y , θ ( i ) ( log ⁡ P ( Y , Z ∣ θ ) Q(\theta,\theta^{(i)})=E_{Z|Y,\theta^{(i)}}(\log P(Y,Z|\theta)
3.M步：更新 θ ( i + 1 ) = arg ⁡ max ⁡ θ Q ( θ , θ ( i ) ) \theta^{(i+1)}=\arg\max_{\theta}Q(\theta,\theta^{(i)})

## 高斯混合聚类

EM算法作为一种求解思想，可以用于多种模型求解，高斯混合聚类是其典型应用。下图表明，在聚类中，一个高斯分布很难进行聚类，此时需要多个高斯分布进行聚类。

p ( x ) = 1 ( 2 π ) n 2 ∣ Σ ∣ 1 2 exp ⁡ ( − 1 2 ( x − μ ) Σ − 1 ( x − μ ) T ) p(\bm{x})=\frac{1}{(2\pi)^{\frac{n}{2}}|\Sigma|^{\frac{1}{2}}}\exp{(-\frac{1}{2}(\bm{x-\mu})\Sigma^{-1}(\bm{x-\mu})^{T})}

p ( x ) = ∑ k = 1 K α k p ( x ∣ μ k , Σ k ) p(\bm{x})=\sum_{k=1}^{K}\alpha_{k}p(\bm{x}|\bm{\mu}_{k},\bm{\Sigma}_{k})

P ( k ∣ x ) = p ( x ∣ μ k , Σ k ) p ( k ) p ( x ) = p ( x ∣ μ k , Σ k ) p ( k ) ∑ k = 1 K α k p ( x ∣ μ k , Σ k ) P(k|\bm{x})=\frac{p(\bm{x}|\bm{\mu}_{k},\bm{\Sigma}_{k})p(k)}{p(\bm{x})}=\frac{p(\bm{x}|\bm{\mu}_{k},\bm{\Sigma}_{k})p(k)}{\sum_{k=1}^{K}\alpha_{k}p(\bm{x}|\bm{\mu}_{k},\bm{\Sigma}_{k})}

p ( D ) = ∏ i = 1 m p ( x i ) = ∏ i = 1 m ∑ k = 1 K α k p ( x i ∣ μ k , Σ k ) p(D)=\prod_{i=1}^{m}p(\bm{x}_{i})=\prod_{i=1}^{m}\sum_{k=1}^{K}\alpha_{k}p(\bm{x}_{i}|\bm{\mu}_{k},\bm{\Sigma}_{k})

L ( D ) = ∑ i = 1 m log ⁡ ∑ k = 1 K α k p ( x i ∣ μ k , Σ k ) L(D)=\sum_{i=1}^{m}\log\sum_{k=1}^{K}\alpha_{k}p(\bm{x}_{i}|\bm{\mu}_{k},\bm{\Sigma}_{k})

γ j k = { 1 , x j 来 源 于 第 k 个 成 分 0 ， 否 则 \gamma_{jk}=\left\{\begin{matrix} 1,\bm{x}_{j}来源于第k个成分\\ 0，否则 \end{matrix}\right.

θ k = { μ k , Σ k , α k } \theta_{k}=\{\bm{\mu}_{k},\bm{\Sigma}_{k},\alpha_{k}\} θ = { θ 1 , θ 2 , . . . , θ K } \bm{\theta}=\{\theta_{1},\theta_{2},...,\theta_{K}\}

E步： 估计隐变量 γ j k , , k = 1 , 2 , . . . , K \gamma_{jk},,k=1,2,...,K

M步

θ ( i + 1 ) = arg ⁡ max ⁡ θ Q ( θ , θ ( i ) ) s . t . ∑ k = 1 K α k = 1 \theta^{(i+1)}=\arg\max_{\theta}Q(\theta,\theta^{(i)}) \\ s.t. \sum_{k=1}^{K}\alpha_{k}=1

L ( θ , λ ) = Q ( θ , θ ( i ) ) + λ ( ∑ k = 1 K α k − 1 ) L(\theta,\lambda)=Q(\theta,\theta^{(i)})+\lambda(\sum_{k=1}^{K}\alpha_{k}-1)

λ j = arg ⁡ max ⁡ k ∈ 1 , 2 , . . . , K γ ^ j k \lambda_{j}=\arg\max_{k\in 1,2,...,K}\hat{\gamma}_{jk}

11-03

11-29
06-09
03-02 5732
09-15 1368
09-16 310
11-10 150
11-20 959
09-24 1527
03-27 529
03-28 684
02-14 2066