高斯混合模型(Gaussian Mixture Model)是一种业界广泛使用的聚类算法,该方法使用了高斯分布作为参数模型,并使用了EM算法进行训练。
高斯分布
高斯分布是一种常用的连续变量分布的模型。若单个随机变量
x
x
x服从均值为
μ
\mu
μ,方差为
σ
2
\sigma^2
σ2的高斯分布,记为
x
N
(
μ
,
σ
2
)
x~\mathcal{N}(\mu,\sigma^2)
x N(μ,σ2),则其概率密度函数为:
p
(
x
∣
μ
,
σ
2
)
=
1
(
2
π
σ
2
)
1
/
2
exp
{
−
1
2
σ
2
(
x
−
μ
)
2
}
p\left(x | \mu, \sigma^{2}\right)=\frac{1}{\left(2 \pi \sigma^{2}\right)^{1 / 2}} \exp \left\{-\frac{1}{2 \sigma^{2}}(x-\mu)^{2}\right\}
p(x∣μ,σ2)=(2πσ2)1/21exp{−2σ21(x−μ)2}
对于一个
d
d
d维向量
x
x
x,若其各元素服从均值为向量
μ
\mu
μ,协方差矩阵为
Σ
\Sigma
Σ的多元高斯分布,记为
x
N
(
μ
,
Σ
)
x~\mathcal{N}(\mu,\Sigma)
x N(μ,Σ),则其概率密度函数为:
p
(
x
∣
μ
,
Σ
)
=
1
(
2
π
)
D
/
2
1
∣
Σ
∣
1
/
2
exp
{
−
1
2
(
x
−
μ
)
T
Σ
−
1
(
x
−
μ
)
}
p(\mathbf{x} | \mu, \Sigma)=\frac{1}{(2 \pi)^{D / 2}} \frac{1}{|\Sigma|^{1 / 2}} \exp \left\{-\frac{1}{2}(\mathbf{x}-\mu)^{\mathbf{T}} \Sigma^{-1}(\mathbf{x}-\mu)\right\}
p(x∣μ,Σ)=(2π)D/21∣Σ∣1/21exp{−21(x−μ)TΣ−1(x−μ)}
高斯分布函数求导
知识准备:
∂
∂
x
(
x
T
a
)
=
∂
∂
x
(
a
T
x
)
=
a
\frac{\partial}{\partial x}(x^Ta)=\frac{\partial}{\partial x}(a^Tx)=a
∂x∂(xTa)=∂x∂(aTx)=a
∂ ∂ x ( A B ) = ∂ A ∂ x B + A ∂ B ∂ x \frac{\partial}{\partial x}(AB)=\frac{\partial A}{\partial x}B+A\frac{\partial B}{\partial x} ∂x∂(AB)=∂x∂AB+A∂x∂B
∂ ∂ x ( A − 1 ) = − A − 1 ∂ A ∂ x A − 1 \frac{\partial}{\partial x}(A^{-1})=-A^{-1}\frac{\partial A}{\partial x}A^{-1} ∂x∂(A−1)=−A−1∂x∂AA−1
∂ ∂ x ln ∣ A ∣ = ( A − 1 ) T \frac{\partial}{\partial x}\ln |A|=(A^{-1})^T ∂x∂ln∣A∣=(A−1)T
高斯分布的对数似然函数为:
ln
p
(
x
∣
μ
,
Σ
)
=
−
D
2
ln
(
2
π
)
−
1
2
ln
∣
Σ
∣
−
1
2
(
x
−
μ
)
T
Σ
−
1
(
x
−
μ
)
\ln p(\mathbf{x} | \mu, \Sigma)=-\frac{D}{2}\ln(2\pi)-\frac{1}{2}\ln |\Sigma|-\frac{1}{2}(\mathbf{x}-\mu)^{\mathbf{T}} \Sigma^{-1}(\mathbf{x}-\mu)
lnp(x∣μ,Σ)=−2Dln(2π)−21ln∣Σ∣−21(x−μ)TΣ−1(x−μ)
关于均值及协方差矩阵求导:
∂
∂
μ
ln
p
(
x
∣
μ
,
Σ
)
=
Σ
−
1
(
x
−
μ
)
\frac{\partial}{\partial \mu}\ln p(\mathbf{x} | \mu, \Sigma)=\Sigma^{-1}(\mathbf{x}-\mu)
∂μ∂lnp(x∣μ,Σ)=Σ−1(x−μ)
∂
∂
Σ
ln
p
(
x
∣
μ
,
Σ
)
=
−
1
2
Σ
−
1
+
1
2
Σ
−
1
(
x
−
μ
)
(
x
−
μ
)
T
Σ
−
1
\frac{\partial}{\partial \Sigma}\ln p(\mathbf{x} | \mu, \Sigma)=-\frac{1}{2}\Sigma^{-1}+\frac{1}{2}\Sigma^{-1}(\mathbf{x}-\mu)(\mathbf{x}-\mu)^T\Sigma^{-1}
∂Σ∂lnp(x∣μ,Σ)=−21Σ−1+21Σ−1(x−μ)(x−μ)TΣ−1
则,
∂
∂
μ
p
(
x
∣
μ
,
Σ
)
=
p
(
x
∣
μ
,
Σ
)
Σ
−
1
(
x
−
μ
)
\frac{\partial}{\partial \mu}p(\mathbf{x} | \mu, \Sigma)=p(\mathbf{x} | \mu, \Sigma)\Sigma^{-1}(\mathbf{x}-\mu)
∂μ∂p(x∣μ,Σ)=p(x∣μ,Σ)Σ−1(x−μ)
∂ ∂ Σ p ( x ∣ μ , Σ ) = p ( x ∣ μ , Σ ) [ − 1 2 Σ − 1 + 1 2 Σ − 1 ( x − μ ) ( x − μ ) T Σ − 1 ] \frac{\partial}{\partial \Sigma} p(\mathbf{x} | \mu, \Sigma)=p(\mathbf{x} | \mu, \Sigma)\left[ -\frac{1}{2}\Sigma^{-1}+\frac{1}{2}\Sigma^{-1}(\mathbf{x}-\mu)(\mathbf{x}-\mu)^T\Sigma^{-1}\right] ∂Σ∂p(x∣μ,Σ)=p(x∣μ,Σ)[−21Σ−1+21Σ−1(x−μ)(x−μ)TΣ−1]
∂ ∂ Σ ln ∣ Σ ∣ = ( Σ − 1 ) T = Σ − 1 \frac{\partial}{\partial \Sigma}\ln |\Sigma|=(\Sigma^{-1})^T=\Sigma^{-1} ∂Σ∂ln∣Σ∣=(Σ−1)T=Σ−1
∂ ∂ Σ i j ( x − μ ) T Σ − 1 ( x − μ ) = ( x − μ ) T ∂ Σ − 1 ∂ Σ i j ( x − μ ) = − ( x − μ ) T Σ − 1 ∂ Σ ∂ Σ i j Σ − 1 ( x − μ ) = − B T ∂ Σ ∂ Σ i j B = − B i B j = − ( B B T ) i j \begin{aligned}\frac{\partial}{\partial \Sigma_{ij}}(\mathbf{x}-\mu)^{\mathbf{T}} \Sigma^{-1}(\mathbf{x}-\mu)&=(\mathbf{x}-\mu)^{\mathbf{T}} \frac{\partial \Sigma^{-1}}{\partial \Sigma_{ij}}(\mathbf{x}-\mu)\\ &=-(\mathbf{x}-\mu)^{\mathbf{T}} \Sigma^{-1}\frac{\partial \Sigma}{\partial \Sigma_{ij}}\Sigma^{-1}(\mathbf{x}-\mu)\\ &=-B^T\frac{\partial \Sigma}{\partial \Sigma_{ij}}B=-B_iB_j=-(BB^T)_{ij} \end{aligned} ∂Σij∂(x−μ)TΣ−1(x−μ)=(x−μ)T∂Σij∂Σ−1(x−μ)=−(x−μ)TΣ−1∂Σij∂ΣΣ−1(x−μ)=−BT∂Σij∂ΣB=−BiBj=−(BBT)ij
其中 B = Σ − 1 ( x − μ ) B=\Sigma^{-1}(\mathbf{x}-\mu) B=Σ−1(x−μ)为 n × 1 n\times1 n×1列向量, B i B_i Bi为B的第 i i i个元素; ∂ Σ ∂ Σ i j \frac{\partial \Sigma}{\partial \Sigma_{ij}} ∂Σij∂Σ除了在第 i i i行第 j j j列为1外,其他位置处的值为0。
∂ ∂ Σ ( x − μ ) T Σ − 1 ( x − μ ) = − B B T = − Σ − 1 ( x − μ ) ( x − μ ) T Σ − 1 \frac{\partial}{\partial \Sigma}(\mathbf{x}-\mu)^{\mathbf{T}} \Sigma^{-1}(\mathbf{x}-\mu)=-BB^T=-\Sigma^{-1}(\mathbf{x}-\mu)(\mathbf{x}-\mu)^T\Sigma^{-1} ∂Σ∂(x−μ)TΣ−1(x−μ)=−BBT=−Σ−1(x−μ)(x−μ)TΣ−1
高斯混合模型
高斯混合模型(Gaussian Mixed Model)指的是多个高斯分布函数的线性组合,多个高斯分布的线性叠加能拟合非常复杂的密度函数:通过足够多的高斯分布叠加,并调节它们的均值,协方差矩阵,以及线性组合的系数,可以精确地逼近任意连续密度。
我们考虑
K
K
K个高斯分布的线性叠加,这个高斯混合分布(Gaussian mixture distiburion)的概率密度函数为:
p
(
x
)
=
∑
k
=
1
K
π
k
p
(
x
∣
μ
k
,
Σ
k
)
p(\mathbf{x})=\sum_{k=1}^{K} \pi_{k} p\left(\mathbf{x} | \mu_{\mathbf{k}}, \Sigma_{\mathbf{k}}\right)
p(x)=k=1∑Kπkp(x∣μk,Σk)
其中
p
(
x
∣
μ
k
,
Σ
k
)
p\left(\mathbf{x} | \mu_{\mathbf{k}}, \Sigma_{\mathbf{k}}\right)
p(x∣μk,Σk)表示参数为
μ
k
\mu_{\mathbf{k}}
μk,
Σ
k
\Sigma_{\mathbf{k}}
Σk的高斯分布的概率密度。每个高斯密度函数称为混合模型的一个分模型(component),有自己的均值
μ
k
\mu_{\mathbf{k}}
μk和协方差矩阵
Σ
k
\Sigma_{\mathbf{k}}
Σk。式中的参数
π
k
\pi_{k}
πk是模型的混合系数(mixing coefficients),满足:
∑
k
=
1
K
π
k
=
1
,
0
≤
π
k
≤
0
\sum_{k=1}^{K}\pi_{k}=1,0\leq\pi_{k}\leq0
k=1∑Kπk=1,0≤πk≤0
高斯混合模型可以等价的写成:
p
(
x
)
=
∑
k
=
1
K
p
(
k
)
p
(
x
∣
k
)
p(x)=\sum_{k=1}^{K}p(k)p(x|k)
p(x)=k=1∑Kp(k)p(x∣k)
其中,
p
(
k
)
=
π
k
p(k)=\pi_{k}
p(k)=πk代表选择第
k
k
k个分模型的先验概率;
p
(
x
∣
k
)
=
p
(
x
∣
μ
k
,
Σ
k
)
p(x|k)=p\left(\mathbf{x} | \mu_{\mathbf{k}}, \Sigma_{\mathbf{k}}\right)
p(x∣k)=p(x∣μk,Σk)是
x
x
x对
k
k
k的条件概率密度;
p
(
k
∣
x
)
p(k|x)
p(k∣x)是指在观测到
x
x
x后,其来自第
k
k
k个分模型的后验概率,称为第k个分模型的响应度。
设样本
x
x
x由一个高斯混合模型产生,即其是从
K
K
K个高斯分布中的某一个采样得到的,但具体是哪一个高斯分布不知道。引入一个
K
K
K维的二值型随机变量
z
=
[
z
1
,
.
.
.
,
z
K
]
z=[z_{1},...,z_{K}]
z=[z1,...,zK],来表示样本
x
x
x由哪一分模型产生,
z
z
z满足条件:
z
k
∈
{
0
,
1
}
z_{k}\in\{0,1\}
zk∈{0,1},且
∑
k
=
1
K
z
k
=
1
\sum_{k=1}^{K}z_{k}=1
∑k=1Kzk=1,
z
k
=
1
z_{k}=1
zk=1表示样本
x
x
x由分模型
k
k
k抽样得到,和为1表示样本
x
x
x只可能从一个高斯分布产生。
z
z
z一共有
K
K
K种可能的状态,其边缘分布由混合系数给出:
p
(
z
k
=
1
)
=
π
k
p(z_{k}=1)=\pi_k
p(zk=1)=πk,于是
z
z
z的分布可写作:
p
(
z
)
=
∏
k
=
1
K
π
k
z
k
,
∑
k
=
1
K
z
k
=
1
p(z)=\prod_{k=1}^{K}\pi_{k}^{z_{k}},\quad \sum_{k=1}^{K}z_{k}=1
p(z)=k=1∏Kπkzk,k=1∑Kzk=1
来自高斯混合模型的样本
x
x
x可以看作由以下方式产生:
- 先以离散分布 p ( z ) p(z) p(z)抽样得到变量 z z z;
- 设根据 z z z的取值选择了第 k k k个分模型,则以高斯分布 p ( x ∣ μ k , Σ k ) p(x|\mu_k, \Sigma_k) p(x∣μk,Σk)抽样得到 x x x。
这个过程可由如下的概率图模型表示:
给定
z
z
z后
x
x
x的条件概率密度为:
p
(
x
∣
z
k
=
1
)
=
p
(
x
∣
μ
k
,
Σ
k
)
p(x|z_{k}=1)=p(x_|\mu_k, \Sigma_k)
p(x∣zk=1)=p(x∣μk,Σk)
或者写成:
p
(
x
∣
z
)
=
∏
k
=
1
K
p
(
x
∣
μ
k
,
Σ
k
)
z
k
p(x|z)=\prod_{k=1}^{K}p(x|\mu_k, \Sigma_k)^{z_{k}}
p(x∣z)=k=1∏Kp(x∣μk,Σk)zk
x
x
x的边缘分布为联合概率分布
p
(
x
,
z
)
p(x,z)
p(x,z)对
z
z
z的所有可能状态求和:
p
(
x
)
=
∑
z
p
(
x
,
z
)
=
∑
z
p
(
z
)
p
(
x
∣
z
)
=
∑
z
(
∏
k
=
1
K
π
k
z
k
∏
k
=
1
K
p
(
x
∣
μ
k
,
Σ
k
)
z
k
)
=
∑
z
(
∏
k
=
1
K
[
π
k
p
(
x
∣
μ
k
,
Σ
k
)
]
z
k
)
=
∑
k
=
1
K
π
k
p
(
x
∣
μ
k
,
Σ
k
)
\begin{aligned} p(x) &=\sum_{z} p(x, z)=\sum_{z} p(z) p(x | z) \\ &=\sum_{z}\left(\prod_{k=1}^{K} \pi_{k}^{z_{k}} \prod_{k=1}^{K} p\left(x | \mu_{k}, \Sigma_{k}\right)^{z_{k}}\right) \\ &=\sum_{z}\left(\prod_{k=1}^{K}\left[\pi_{k} p\left(x | \mu_{k}, \Sigma_{k}\right)\right]^{z_{k}}\right) \\ &=\sum_{k=1}^{K} \pi_{k} p\left(x | \mu_{k}, \Sigma_{k}\right) \end{aligned}
p(x)=z∑p(x,z)=z∑p(z)p(x∣z)=z∑(k=1∏Kπkzkk=1∏Kp(x∣μk,Σk)zk)=z∑(k=1∏K[πkp(x∣μk,Σk)]zk)=k=1∑Kπkp(x∣μk,Σk)
观测到
x
x
x后,其来自第
k
k
k个分模型的后验概率(posterior responsibility)为:
p
(
z
k
=
1
∣
x
)
=
p
(
z
k
=
1
)
p
(
x
∣
z
k
=
1
)
p
(
x
)
=
π
k
p
(
x
∣
μ
k
,
Σ
k
)
∑
j
=
1
K
π
j
p
(
x
∣
μ
j
,
Σ
j
)
\begin{aligned} p\left(z_{k}=1 | x\right) &=\frac{p\left(z_{k}=1\right) p\left(x | z_{k}=1\right)}{p(x)} \\ &=\frac{\pi_{k} p\left(x | \mu_{k}, \Sigma_{k}\right)}{\sum_{j=1}^{K} \pi_{j} p\left(x | \mu_{j}, \Sigma_{j}\right)} \end{aligned}
p(zk=1∣x)=p(x)p(zk=1)p(x∣zk=1)=∑j=1Kπjp(x∣μj,Σj)πkp(x∣μk,Σk)
将上式定义为: γ ( z k ) = p ( z k = 1 ∣ x ) \gamma(z_{k})=p(z_{k}=1|x) γ(zk)=p(zk=1∣x),称为第 k k k个分模型对 x x x的响应度(responsibility)。
高斯混合模型的参数估计
假设我们有数据集
X
=
{
x
1
,
.
.
.
,
x
N
}
\mathbf{X}=\{x_1,...,x_N\}
X={x1,...,xN}, 数据集
X
\mathbf{X}
X由一个高斯混合模型产生。要估计这个模型的参数:
π
=
{
π
1
,
.
.
.
,
π
K
}
\pi=\{\pi_1,...,\pi_K\}
π={π1,...,πK},
μ
=
{
μ
1
,
.
.
.
,
μ
K
}
\mu=\{\mu_1,...,\mu_K\}
μ={μ1,...,μK},
Σ
=
{
Σ
1
,
.
.
.
,
Σ
K
}
\Sigma=\{\Sigma_1,...,\Sigma_K\}
Σ={Σ1,...,ΣK}。记
x
n
x_n
xn对应的隐变量为
z
n
=
[
z
n
1
,
.
.
.
,
z
n
K
]
z_n=[z_{n1},...,z_{nK}]
zn=[zn1,...,znK],则:
p
(
z
n
)
=
∏
k
=
1
K
π
k
z
n
k
p(z_n)=\prod_{k=1}^{K}\pi_{k}^{z_{nk}}
p(zn)=k=1∏Kπkznk
p ( x n ∣ z n ) = ∏ k = 1 K p ( x n ∣ μ k , Σ k ) z n k p(x_n|z_n)=\prod_{k=1}^{K}p(x_n|\mu_k, \Sigma_k)^{z_{nk}} p(xn∣zn)=k=1∏Kp(xn∣μk,Σk)znk
p ( x n ) = ∑ k = 1 K π k p ( x n ∣ μ k , Σ k ) p(x_n)=\sum_{k=1}^{K} \pi_{k} p\left(x_n | \mu_{k}, \Sigma_{k}\right) p(xn)=k=1∑Kπkp(xn∣μk,Σk)
γ ( z n k ) = p ( z n k = 1 ∣ x n ) = π k p ( x n ∣ μ k , Σ k ) ∑ j = 1 K π j p ( x n ∣ μ j , Σ j ) \gamma(z_{nk})=p\left(z_{nk}=1 | x_n\right)=\frac{\pi_{k} p\left(x_n | \mu_{k}, \Sigma_{k}\right)}{\sum_{j=1}^{K} \pi_{j} p\left(x_n|\mu_{j}, \Sigma_{j}\right)} γ(znk)=p(znk=1∣xn)=∑j=1Kπjp(xn∣μj,Σj)πkp(xn∣μk,Σk)
最大似然估计
样本集
X
X
X的似然函数为:
p
(
X
∣
π
,
μ
,
Σ
)
=
∏
n
=
1
N
[
∑
k
=
1
K
π
k
p
(
x
n
∣
μ
k
,
Σ
k
)
]
p(X|\pi,\mu,\Sigma)=\prod_{n=1}^{N}[\sum_{k=1}^{K}\pi_kp(x_n|\mu_k,\Sigma_k)]
p(X∣π,μ,Σ)=n=1∏N[k=1∑Kπkp(xn∣μk,Σk)]
似然函数中的连乘求导比较麻烦,取自然对数将其转换成对数的和,得到对数似然函数:
l
n
p
(
X
∣
π
,
μ
Σ
)
=
∑
n
=
1
N
l
n
[
∑
k
=
1
K
π
k
p
(
x
n
∣
μ
k
,
Σ
k
)
]
lnp(X|\pi,\mu\Sigma)=\sum_{n=1}^{N}ln[\sum_{k=1}^{K}\pi_kp(x_n|\mu_k,\Sigma_k)]
lnp(X∣π,μΣ)=n=1∑Nln[k=1∑Kπkp(xn∣μk,Σk)]
其中,
p
(
x
n
∣
μ
k
,
Σ
k
)
=
1
(
2
π
)
d
/
2
1
∣
Σ
k
∣
1
/
2
exp
{
−
1
2
(
x
n
−
μ
k
)
T
Σ
−
1
(
x
n
−
μ
k
)
}
p(x_n|\mu_k,\Sigma_k)=\frac{1}{(2 \pi)^{d / 2}} \frac{1}{|\Sigma_k|^{1 / 2}} \exp \left\{-\frac{1}{2}(x_n-\mu_k)^{\mathbf{T}} \Sigma^{-1}(x_n-\mu_k)\right\}
p(xn∣μk,Σk)=(2π)d/21∣Σk∣1/21exp{−21(xn−μk)TΣ−1(xn−μk)}
用最大似然估计来计算高斯混合模型的参数,即求解如下的优化问题:
{
max
π
,
μ
,
Σ
l
n
p
(
X
∣
π
,
μ
,
Σ
)
s
.
t
.
∑
k
=
1
K
π
k
=
1
\begin{cases} \max_{\pi,\mu,\Sigma} ln p(X|\pi,\mu,\Sigma)\\ \\ s.t. \quad\sum_{k=1}^{K}\pi_k=1\end{cases}
⎩⎪⎨⎪⎧maxπ,μ,Σlnp(X∣π,μ,Σ)s.t.∑k=1Kπk=1
采用拉格朗日乘子法来求极值,拉格朗日函数为:
L
(
π
,
μ
,
Σ
)
=
ln
p
(
X
∣
π
,
μ
,
Σ
)
+
λ
(
∑
k
=
1
K
π
k
−
1
)
L(\pi, \mu, \Sigma)=\ln p(X | \pi, \mu, \Sigma)+\lambda\left(\sum_{k=1}^{K} \pi_{k}-1\right)
L(π,μ,Σ)=lnp(X∣π,μ,Σ)+λ(k=1∑Kπk−1)
L
(
π
,
μ
,
Σ
)
L(\pi,\mu,\Sigma)
L(π,μ,Σ)对
μ
k
\mu_k
μk求梯度:
∂
L
∂
μ
k
=
∂
∂
μ
k
ln
p
(
X
∣
π
,
μ
,
Σ
)
=
∂
∂
μ
k
(
∑
n
=
1
N
ln
[
∑
k
=
1
K
π
k
p
(
x
n
∣
μ
k
,
Σ
k
)
]
)
=
∑
n
=
1
N
∂
∂
μ
k
ln
[
∑
k
=
1
K
π
k
p
(
x
n
∣
μ
k
,
Σ
k
)
]
=
∑
n
=
1
N
[
π
k
p
(
x
n
∣
μ
k
,
Σ
k
)
∑
j
=
1
K
π
j
p
(
x
n
∣
μ
j
,
Σ
j
)
[
Σ
k
−
1
(
x
n
−
μ
k
)
]
]
=
∑
n
=
1
N
[
γ
(
z
n
k
)
Σ
k
−
1
(
x
n
−
μ
k
)
]
\begin{aligned} \frac{\partial L}{\partial \mu_{k}} &=\frac{\partial}{\partial \mu_{k}} \ln p(X | \pi, \mu, \Sigma) \\ &=\frac{\partial}{\partial \mu_{k}} \left(\sum_{n=1}^{N} \ln \left[\sum_{k=1}^{K} \pi_{k} p\left(x_{n} | \mu_{k}, \Sigma_{k}\right)\right]\right)\\ &=\sum_{n=1}^{N} \frac{\partial}{\partial \mu_{k}} \ln \left[\sum_{k=1}^{K} \pi_{k} p\left(x_{n} | \mu_{k}, \Sigma_{k}\right)\right] \\ &=\sum_{n=1}^{N}\left[\frac{\pi_{k} p\left(x_{n} | \mu_{k}, \Sigma_{k}\right)}{\sum_{j=1}^{K} \pi_{j} p\left(x_{n} | \mu_{j}, \Sigma_{j}\right)}\left[\Sigma_{k}^{-1}\left(x_{n}-\mu_{k}\right)\right]\right]\\&=\sum_{n=1}^{N}[\gamma(z_{nk})\Sigma_k^{-1}(x_n-\mu_k)] \end{aligned}
∂μk∂L=∂μk∂lnp(X∣π,μ,Σ)=∂μk∂(n=1∑Nln[k=1∑Kπkp(xn∣μk,Σk)])=n=1∑N∂μk∂ln[k=1∑Kπkp(xn∣μk,Σk)]=n=1∑N[∑j=1Kπjp(xn∣μj,Σj)πkp(xn∣μk,Σk)[Σk−1(xn−μk)]]=n=1∑N[γ(znk)Σk−1(xn−μk)]
令
∂
L
∂
μ
k
=
0
\frac{\partial L}{\partial \mu_{k}}=0
∂μk∂L=0,即
∑
n
=
1
N
[
γ
(
z
n
k
)
Σ
k
−
1
(
x
n
−
μ
k
)
]
=
0
\sum_{n=1}^{N}[\gamma(z_{nk})\Sigma_k^{-1}(x_n-\mu_k)]=0
∑n=1N[γ(znk)Σk−1(xn−μk)]=0,左乘
Σ
k
\Sigma_k
Σk,得:
∑
n
=
1
N
[
γ
(
z
n
k
)
x
n
]
=
∑
n
=
1
N
[
γ
(
z
n
k
)
μ
k
]
=
μ
k
∑
n
=
1
N
γ
(
z
n
k
)
\sum_{n=1}^{N}\left[\gamma\left(z_{n k}\right) x_{n}\right]=\sum_{n=1}^{N}\left[\gamma\left(z_{n k}\right) \mu_{k}\right]=\mu_{k} \sum_{n=1}^{N} \gamma\left(z_{n k}\right)
n=1∑N[γ(znk)xn]=n=1∑N[γ(znk)μk]=μkn=1∑Nγ(znk)
定义
N
k
=
∑
n
=
1
N
γ
(
z
n
k
)
N_k=\sum_{n=1}^{N}\gamma(z_{nk})
Nk=∑n=1Nγ(znk),可理解为被分配到第
k
k
k个分模型(聚类)的“有效“的样本数。则:
μ k = 1 N k ∑ n = 1 N [ γ ( z n k ) x n ] \mu_k=\frac{1}{N_k} \sum_{n=1}^{N}[\gamma(z_{nk})x_n] μk=Nk1n=1∑N[γ(znk)xn]
对
Σ
k
\Sigma_k
Σk中各元素求偏导:
∂
L
∂
Σ
k
=
∂
∂
Σ
k
ln
p
(
X
∣
π
,
μ
,
Σ
)
=
∂
∂
Σ
k
(
∑
n
=
1
N
ln
[
∑
k
=
1
K
π
k
p
(
x
n
∣
μ
k
,
Σ
k
)
]
)
=
∑
n
=
1
N
∂
∂
Σ
k
ln
[
∑
k
=
1
K
π
k
p
(
x
n
∣
μ
k
,
Σ
k
)
]
=
∑
n
=
1
N
[
π
k
∑
j
=
1
K
π
j
p
(
x
n
∣
μ
j
,
Σ
j
)
∂
∂
Σ
k
p
(
x
n
∣
μ
k
,
Σ
k
)
]
\begin{aligned} \frac{\partial L}{\partial \Sigma_{k}} &=\frac{\partial }{\partial \Sigma_{k}}\ln p(X| \pi, \mu, \Sigma) \\ &=\frac{\partial}{\partial \Sigma_{k}}\left(\sum_{n=1}^{N} \ln \left[\sum_{k=1}^{K} \pi_{k} p\left(x_{n} | \mu_{k}, \Sigma_{k}\right)\right]\right) \\ &=\sum_{n=1}^{N} \frac{\partial}{\partial \Sigma_{k}} \ln \left[\sum_{k=1}^{K} \pi_{k} p\left(x_{n} | \mu_{k}, \Sigma_{k}\right)\right]\\ &=\sum_{n=1}^{N}\left[\frac{\pi_{k}}{\sum_{j=1}^{K} \pi_{j} p\left(x_{n} | \mu_{j}, \Sigma_{j}\right)} \frac{\partial }{\partial \Sigma_{k}}p\left(x_{n} | \mu_{k}, \Sigma_{k}\right)\right] \end{aligned}
∂Σk∂L=∂Σk∂lnp(X∣π,μ,Σ)=∂Σk∂(n=1∑Nln[k=1∑Kπkp(xn∣μk,Σk)])=n=1∑N∂Σk∂ln[k=1∑Kπkp(xn∣μk,Σk)]=n=1∑N[∑j=1Kπjp(xn∣μj,Σj)πk∂Σk∂p(xn∣μk,Σk)]
令
∂
L
∂
Σ
k
=
0
\frac{\partial L}{\partial \Sigma_{k}}=0
∂Σk∂L=0,
Σ k = 1 N k ∑ n = 1 N γ ( z n k ) ( x n − μ k ) ( x n − μ k ) T \Sigma_{\mathrm{k}}=\frac{1}{N_{k}} \sum_{n=1}^{N} \gamma\left(z_{n k}\right)\left(x_{\mathrm{n}}-\mu_{\mathrm{k}}\right)\left(x_{\mathrm{n}}-\mu_{\mathrm{k}}\right)^{T} Σk=Nk1n=1∑Nγ(znk)(xn−μk)(xn−μk)T
对
π
k
\pi_k
πk求偏导并令其为0,
∂
L
∂
π
k
=
∂
∂
π
k
ln
p
(
X
∣
π
,
μ
,
Σ
)
=
∂
∂
π
k
[
∑
n
=
1
N
ln
[
∑
k
=
1
K
π
k
p
(
x
n
∣
μ
k
,
Σ
k
)
]
+
λ
(
∑
k
=
1
K
π
k
−
1
)
]
=
∑
n
=
1
N
p
(
x
n
∣
μ
k
,
Σ
k
)
∑
j
=
1
K
π
j
p
(
x
n
∣
μ
j
,
Σ
j
)
+
λ
=
0
\begin{aligned} \frac{\partial L}{\partial \pi_{k}} &=\frac{\partial }{\partial \pi_{k}} \ln p(X | \pi, \mu, \Sigma)\\ &=\frac{\partial}{\partial \pi_{k}} \left[\sum_{n=1}^{N} \ln \left[\sum_{k=1}^{K} \pi_{k} p\left(x_{\mathrm{n}} | \mu_{\mathrm{k}}, \Sigma_{\mathrm{k}}\right)\right]+\lambda\left(\sum_{k=1}^{K} \pi_{k}-1\right)\right]\\ &=\sum_{n=1}^{N} \frac{p\left(x_{\mathrm{n}} | \mu_{\mathrm{k}}, \Sigma_{\mathrm{k}}\right)}{\sum_{j=1}^{K} \pi_{j} p\left(x_{\mathrm{n}} | \mu_{\mathrm{j}}, \Sigma_{\mathrm{j}}\right)}+\lambda=0 \end{aligned}
∂πk∂L=∂πk∂lnp(X∣π,μ,Σ)=∂πk∂[n=1∑Nln[k=1∑Kπkp(xn∣μk,Σk)]+λ(k=1∑Kπk−1)]=n=1∑N∑j=1Kπjp(xn∣μj,Σj)p(xn∣μk,Σk)+λ=0
左右两边乘以
π
k
\pi_k
πk,
∑
n
=
1
N
π
k
p
(
x
n
∣
μ
k
,
Σ
k
)
∑
j
=
1
K
π
j
p
(
x
n
∣
μ
j
,
Σ
j
)
+
λ
π
k
=
∑
n
=
1
N
γ
(
z
n
k
)
+
λ
π
k
=
N
k
+
λ
π
k
=
0
\begin{aligned} \sum_{n=1}^{N} \frac{\pi_{k} p\left(x_{n} | \mu_{k}, \Sigma_{k}\right)}{\sum_{j=1}^{K} \pi_{j} p\left(x_{n} | \mu_{j}, \Sigma_{j}\right)}+\lambda \pi_{k} &=\sum_{n=1}^{N} \gamma\left(z_{n k}\right)+\lambda \pi_{k} \\ &=N_{k}+\lambda \pi_{k}=0 \end{aligned}
n=1∑N∑j=1Kπjp(xn∣μj,Σj)πkp(xn∣μk,Σk)+λπk=n=1∑Nγ(znk)+λπk=Nk+λπk=0
上式对
k
k
k求和得,
∑
k
=
1
K
(
N
k
+
λ
π
k
)
=
∑
k
=
1
K
N
k
+
λ
∑
k
=
1
K
π
k
=
∑
k
=
1
K
N
k
+
λ
=
0
\sum_{k=1}^{K}\left(N_{k}+\lambda \pi_{k}\right)=\sum_{k=1}^{K} N_{k}+\lambda \sum_{k=1}^{K} \pi_{k}=\sum_{k=1}^{K} N_{k}+\lambda=0
k=1∑K(Nk+λπk)=k=1∑KNk+λk=1∑Kπk=k=1∑KNk+λ=0
又
∑
k
=
1
K
N
k
=
∑
k
=
1
K
∑
n
=
1
N
γ
(
z
n
k
)
=
∑
n
=
1
N
∑
k
=
1
K
γ
(
z
n
k
)
=
∑
n
=
1
N
∑
k
=
1
K
π
k
p
(
x
∣
μ
k
,
Σ
k
)
∑
j
=
1
K
π
j
p
(
x
∣
μ
j
,
Σ
j
)
=
∑
n
=
1
N
1
=
N
\begin{aligned} \sum_{k=1}^{K} N_{k} &=\sum_{k=1}^{K} \sum_{n=1}^{N} \gamma\left(z_{n k}\right)=\sum_{n=1}^{N} \sum_{k=1}^{K} \gamma\left(z_{n k}\right) \\ &=\sum_{n=1}^{N} \sum_{k=1}^{K} \frac{\pi_{k} p\left(x | \mu_{\mathrm{k}}, \Sigma_{\mathrm{k}}\right)}{\sum_{j=1}^{K} \pi_{j} p\left(x | \mu_{\mathrm{j}}, \Sigma_{\mathrm{j}}\right)} \\ &=\sum_{n=1}^{N} 1=N \end{aligned}
k=1∑KNk=k=1∑Kn=1∑Nγ(znk)=n=1∑Nk=1∑Kγ(znk)=n=1∑Nk=1∑K∑j=1Kπjp(x∣μj,Σj)πkp(x∣μk,Σk)=n=1∑N1=N
所以 λ = − N \lambda=-N λ=−N,进而
π k = − N k λ = N k N \pi_k=-\frac{N_k}{\lambda}=\frac{N_k}{N} πk=−λNk=NNk
综上,对数似然函数
l
n
p
(
X
∣
π
,
μ
Σ
)
lnp(X|\pi,\mu\Sigma)
lnp(X∣π,μΣ)的极值点满足的条件为:
{
μ
k
=
1
N
k
∑
n
=
1
N
[
γ
(
z
n
k
)
x
n
]
Σ
k
=
1
N
k
∑
n
=
1
N
γ
(
z
n
k
)
(
x
n
−
μ
k
)
(
x
n
−
μ
k
)
T
π
k
=
N
k
N
\left\{\begin{array}{l}{\mu_{\mathrm{k}}=\frac{1}{N_{k}} \sum_{n=1}^{N}\left[\gamma\left(z_{n k}\right) \mathbf{x}_{\mathrm{n}}\right]} \\ \\ {\boldsymbol{\Sigma}_{\mathrm{k}}=\frac{1}{N_{k}} \sum_{n=1}^{N} \gamma\left(z_{n k}\right)\left(\mathbf{x}_{\mathrm{n}}-\mu_{\mathrm{k}}\right)\left(\mathbf{x}_{\mathrm{n}}-\mu_{\mathrm{k}}\right)^{T}} \\ \\ {\pi_{k}=\frac{N_{k}}{N}}\end{array}\right.
⎩⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎧μk=Nk1∑n=1N[γ(znk)xn]Σk=Nk1∑n=1Nγ(znk)(xn−μk)(xn−μk)Tπk=NNk
需要注意的是,上式并未给出高斯混合模型的解析解/闭式解,不过可使用迭代算法来计算模型参数。
求解高斯混合模型的EM算法
- 初始化:给参数均值向量 μ k \mu_k μk,协方差矩阵 Σ k \Sigma_k Σk和混合系数 π k \pi_k πk赋初值;并计算对数似然函数的初值;(可以用K均值聚类(K-means clustering)算法来得到初始参数。)
- E步(Expectation step):用当前参数值
μ
k
\mu_k
μk,
Σ
k
\Sigma_k
Σk,
π
k
\pi_k
πk计算后验概率(响应度)
γ
(
z
n
k
)
\gamma(z_{nk})
γ(znk):
γ ( z n k ) ← π k p ( x n ∣ μ k , Σ k ) ∑ j = 1 K π j p ( x n ∣ μ j , Σ j ) \gamma\left(z_{n k}\right) \leftarrow \frac{\pi_{k} p\left(x_{n} | \mu_{k}, \Sigma_{k}\right)}{\sum_{j=1}^{K} \pi_{j} p\left(x_{n} | \mu_{j}, \Sigma_{j}\right)} γ(znk)←∑j=1Kπjp(xn∣μj,Σj)πkp(xn∣μk,Σk) - M步(Maximization step):用当前响应度
γ
(
z
n
k
)
\gamma(z_{nk})
γ(znk),重新估计参数:
μ k new ← 1 N k ∑ n = 1 N [ γ ( z n k ) x n ] Σ k new ← 1 N k ∑ n = 1 N γ ( z n k ) ( x n − μ k new ) ( x n − μ k new ) T π k new ← N k N \begin{aligned} \mu_{k}^{\text {new }} & \leftarrow \frac{1}{N_{k}} \sum_{n=1}^{N}\left[\gamma\left(z_{n k}\right) x_{n}\right] \\ \Sigma_{k}^{\text {new }} & \leftarrow \frac{1}{N_{k}} \sum_{n=1}^{N} \gamma\left(z_{n k}\right)\left(x_{n}-\mu_{k}^{\text {new }}\right)\left(x_{n}-\mu_{k}^{\text {new }}\right)^{T} \\ \pi_{k}^{\text {new }} & \leftarrow \frac{N_{k}}{N} \end{aligned} μknew Σknew πknew ←Nk1n=1∑N[γ(znk)xn]←Nk1n=1∑Nγ(znk)(xn−μknew )(xn−μknew )T←NNk
其中, N k = ∑ n = 1 N γ ( z n k ) N_k=\sum_{n=1}^{N}\gamma(z_{nk}) Nk=∑n=1Nγ(znk)。
在M步中,先计算 μ k new \mu_{k}^{\text {new }} μknew 的值,然后用 μ k new \mu_{k}^{\text {new }} μknew 来计算新的协方差矩阵 Σ k new \Sigma_{k}^{\text {new }} Σknew 。 - 用新的参数
μ
k
new
\mu_{k}^{\text {new }}
μknew 、
Σ
k
new
\Sigma_{k}^{\text {new }}
Σknew 、
π
k
new
\pi_{k}^{\text {new }}
πknew 计算对数似然函数:
l n p ( X ∣ π , μ Σ ) = ∑ n = 1 N l n [ ∑ k = 1 K π k p ( x n ∣ μ k , Σ k ) ] lnp(X|\pi,\mu\Sigma) =\sum_{n=1}^{N}ln[\sum_{k=1}^{K}\pi_kp(x_n|\mu_k,\Sigma_k)] lnp(X∣π,μΣ)=n=1∑Nln[k=1∑Kπkp(xn∣μk,Σk)]
然后检查是否收敛。(当然,也可直接检查参数是否收敛。)
如果收敛,则得到了模型参数;如果还没收敛,则返回第2步继续迭代。
EM算法:换个角度看高斯混合模型的极大似然估计
在前面的极大似然估计中,我们的目标函数是不完全数据集
X
=
{
x
1
,
.
.
.
,
x
N
}
\mathbf{X}=\{x_1,...,x_N\}
X={x1,...,xN}的对数似然函数
l
n
p
(
X
∣
π
,
μ
Σ
)
lnp(X|\pi,\mu\Sigma)
lnp(X∣π,μΣ)。现在我们换个角度来看:假如我们也有隐变量的观测值
Z
=
{
z
1
,
.
.
.
,
z
N
}
\mathbf{Z}=\{z_1,...,z_N\}
Z={z1,...,zN},即知道
x
n
x_n
xn来自高斯混合中的哪一个分模型,换句话说,我们有完全数据集
{
X
,
Z
}
\{\mathbf{X,Z}\}
{X,Z},那么我们可以计算完全数据的对数似然函数
l
n
p
(
X
,
Z
∣
π
,
μ
Σ
)
lnp(X,Z|\pi,\mu\Sigma)
lnp(X,Z∣π,μΣ)。
首先,一组完全数据
{
x
n
,
z
n
}
\{x_n,z_n\}
{xn,zn}的似然函数为:
p
(
x
n
,
z
n
∣
π
,
μ
,
Σ
)
=
∏
k
=
1
K
π
k
z
n
k
p
(
x
n
∣
μ
k
,
Σ
k
)
z
n
k
p\left(x_{n}, z_{n} | \pi, \mu, \Sigma\right)=\prod_{k=1}^{K} \pi_{k}^{z_{n k}} p\left(x_{n} | \mu_{k}, \Sigma_{k}\right)^{z_{n k}}
p(xn,zn∣π,μ,Σ)=k=1∏Kπkznkp(xn∣μk,Σk)znk
所以,整个完全数据集
{
X
,
Z
}
\{\mathbf{X,Z}\}
{X,Z}的似然函数(假设每组数据独立)为:
p
(
X
,
Z
∣
π
,
μ
,
Σ
)
=
∏
n
=
1
N
∏
k
=
1
K
π
k
z
n
k
p
(
x
n
∣
μ
k
,
Σ
k
)
z
n
k
p\left(\mathbf{X,Z} | \pi, \mu, \Sigma\right)=\prod_{n=1}^{N}\prod_{k=1}^{K} \pi_{k}^{z_{n k}} p\left(x_{n} | \mu_{k}, \Sigma_{k}\right)^{z_{n k}}
p(X,Z∣π,μ,Σ)=n=1∏Nk=1∏Kπkznkp(xn∣μk,Σk)znk
取对数得,
l
n
p
(
X
,
Z
∣
π
,
μ
,
Σ
)
=
∑
n
=
1
N
∑
k
=
1
K
z
n
k
[
l
n
π
k
+
l
n
p
(
x
n
∣
μ
k
,
Σ
k
)
]
ln p\left(\mathbf{X,Z} | \pi, \mu, \Sigma\right)=\sum_{n=1}^{N}\sum_{k=1}^{K} z_{n k}[ln \pi_{k}+lnp\left(x_{n} | \mu_{k}, \Sigma_{k}\right)]
lnp(X,Z∣π,μ,Σ)=n=1∑Nk=1∑Kznk[lnπk+lnp(xn∣μk,Σk)]
可以看出,和不完全数据的对数似然函数 p ( X ∣ π , μ Σ ) = ∏ n = 1 N [ ∑ k = 1 K π k p ( x n ∣ μ k , Σ k ) ] p(X|\pi,\mu\Sigma)=\prod_{n=1}^{N}[\sum_{k=1}^{K}\pi_kp(x_n|\mu_k,\Sigma_k)] p(X∣π,μΣ)=∏n=1N[∑k=1Kπkp(xn∣μk,Σk)]相比, ∑ k = 1 K \sum_{k=1}^{K} ∑k=1K和 l n ln ln换了顺序。
然而实际情况是,我们并不知道隐变量的观测值,所以我们无法直接计算 l n p ( X , Z ∣ π , μ , Σ ) ln p\left(\mathbf{X,Z} | \pi, \mu, \Sigma\right) lnp(X,Z∣π,μ,Σ),因此,我们考虑对 l n p ( X , Z ∣ π , μ , Σ ) ln p\left(\mathbf{X,Z} | \pi, \mu, \Sigma\right) lnp(X,Z∣π,μ,Σ)在隐变量 z n k z_{nk} znk的后验分布下求期望,利用求期望的过程,将 z n k z_{nk} znk消掉。
l
n
p
(
X
,
Z
∣
π
,
μ
,
Σ
)
ln p\left(\mathbf{X,Z} | \pi, \mu, \Sigma\right)
lnp(X,Z∣π,μ,Σ)在隐变量
z
n
k
z_{nk}
znk的后验分布下的期望:
E
Z
ln
p
(
X
,
Z
∣
π
,
μ
,
Σ
)
=
E
∑
n
=
1
N
∑
k
=
1
K
{
z
n
k
[
ln
π
k
+
ln
p
(
x
n
∣
μ
k
,
Σ
k
)
]
}
=
∑
n
=
1
N
∑
k
=
1
K
{
E
(
z
n
k
)
[
ln
π
k
+
ln
p
(
x
n
∣
μ
k
,
Σ
k
)
]
}
\begin{aligned} E_{Z} \ln p(X, Z | \pi, \mu, \Sigma) &=E \sum_{n=1}^{N} \sum_{k=1}^{K}\left\{z_{n k}\left[\ln \pi_{k}+\ln p\left(x_{n} | \mu_{k}, \Sigma_{k}\right)\right]\right\} \\ &=\sum_{n=1}^{N} \sum_{k=1}^{K}\left\{E\left(z_{n k}\right)\left[\ln \pi_{k}+\ln p\left(x_{n} | \mu_{k}, \Sigma_{k}\right)\right]\right\} \end{aligned}
EZlnp(X,Z∣π,μ,Σ)=En=1∑Nk=1∑K{znk[lnπk+lnp(xn∣μk,Σk)]}=n=1∑Nk=1∑K{E(znk)[lnπk+lnp(xn∣μk,Σk)]}
其中,
E
(
z
n
k
)
E\left(z_{n k}\right)
E(znk)表示
z
n
k
z_{n k}
znk在后验分布下的期望
E
(
z
n
k
∣
X
,
θ
)
E\left(z_{n k}|X,\theta\right)
E(znk∣X,θ)。因
z
n
k
∈
{
0
,
1
}
z_{n k}\in\{0,1\}
znk∈{0,1},所以
E
(
z
n
k
∣
X
,
θ
)
=
0
×
p
(
z
n
k
=
0
∣
X
,
θ
)
+
1
×
p
(
z
n
k
=
1
∣
X
,
θ
)
=
p
(
z
n
k
=
1
∣
X
,
θ
)
=
γ
(
z
n
k
)
\begin{aligned} E\left(z_{n k} | X, \theta\right) &=0 \times p\left(z_{n k}=0 | X, \theta\right)+1 \times p\left(z_{n k}=1 | X, \theta\right) \\ &=p\left(z_{n k}=1 | X, \theta\right) \\ &=\gamma\left(z_{n k}\right) \end{aligned}
E(znk∣X,θ)=0×p(znk=0∣X,θ)+1×p(znk=1∣X,θ)=p(znk=1∣X,θ)=γ(znk)
进而,
E
Z
ln
p
(
X
,
Z
∣
π
,
μ
,
Σ
)
=
∑
n
=
1
N
∑
k
=
1
K
{
γ
(
z
n
k
)
[
ln
π
k
+
ln
p
(
x
n
∣
μ
k
,
Σ
k
)
]
}
E_{Z} \ln p(X, Z | \pi, \mu, \Sigma)=\sum_{n=1}^{N} \sum_{k=1}^{K}\left\{\gamma\left(z_{n k}\right)\left[\ln \pi_{k}+\ln p\left(x_{n} | \mu_{k}, \Sigma_{k}\right)\right]\right\}
EZlnp(X,Z∣π,μ,Σ)=n=1∑Nk=1∑K{γ(znk)[lnπk+lnp(xn∣μk,Σk)]}
现在将
γ
(
z
n
k
)
\gamma\left(z_{n k}\right)
γ(znk)当作常数,求参数
π
\pi
π,
μ
\mu
μ,
Σ
\Sigma
Σ,使
E
Z
ln
p
(
X
,
Z
∣
π
,
μ
,
Σ
)
E_{Z} \ln p(X, Z | \pi, \mu, \Sigma)
EZlnp(X,Z∣π,μ,Σ)最大化,即
{
max
π
,
μ
,
Σ
E
Z
ln
p
(
X
,
Z
∣
π
,
μ
,
Σ
)
s
.
t
.
∑
k
=
1
K
π
k
=
1
\begin{cases} \max_{\pi,\mu,\Sigma}\quad E_{Z} \ln p(X, Z | \pi, \mu, \Sigma)\\ s. t.\quad\quad\quad\quad \sum_{k=1}^{K}\pi_k=1 \end{cases}
{maxπ,μ,ΣEZlnp(X,Z∣π,μ,Σ)s.t.∑k=1Kπk=1
采用拉格朗日乘子法,拉格朗日函数为:
L
E
=
E
Z
ln
p
(
X
,
Z
∣
π
,
μ
,
Σ
)
+
λ
(
∑
k
=
1
K
π
k
−
1
)
L_E=E_{Z} \ln p(X, Z | \pi, \mu, \Sigma)+\lambda(\sum_{k=1}^{K}\pi_k-1)
LE=EZlnp(X,Z∣π,μ,Σ)+λ(k=1∑Kπk−1)
先对
μ
k
\mu_k
μk求梯度,
∂
L
E
∂
μ
k
=
0
⇒
∑
n
=
1
N
γ
(
z
n
k
)
[
Σ
k
−
1
(
x
n
−
μ
k
)
]
=
0
⇒
∑
n
=
1
N
γ
(
z
n
k
)
(
x
n
−
μ
k
)
=
0
⇒
∑
n
=
1
N
γ
(
z
n
k
)
x
n
=
∑
n
=
1
N
γ
(
z
n
k
)
μ
k
\begin{aligned} \frac{\partial L_{E}}{\partial \mu_{k}}=0 & \Rightarrow \sum_{n=1}^{N} \gamma\left(z_{n k}\right)\left[\Sigma_{k}^{-1}\left(x_{n}-\mu_{k}\right)\right]=0 \\ & \Rightarrow \sum_{n=1}^{N} \gamma\left(z_{n k}\right)\left(x_{n}-\mu_{k}\right)=0 \\ & \Rightarrow \sum_{n=1}^{N} \gamma\left(z_{n k}\right) x_{n}=\sum_{n=1}^{N} \gamma\left(z_{n k}\right) \mu_{k} \end{aligned}
∂μk∂LE=0⇒n=1∑Nγ(znk)[Σk−1(xn−μk)]=0⇒n=1∑Nγ(znk)(xn−μk)=0⇒n=1∑Nγ(znk)xn=n=1∑Nγ(znk)μk
同之前极大化
l
n
p
(
X
∣
π
,
μ
,
Σ
)
lnp(X|\pi,\mu,\Sigma)
lnp(X∣π,μ,Σ)一样,可推得
μ
k
=
1
N
k
∑
n
=
1
N
[
γ
(
z
n
k
)
x
n
]
\mu_k=\frac{1}{N_k}\sum_{n=1}^{N}[\gamma(z_{nk})x_n]
μk=Nk1n=1∑N[γ(znk)xn]
对
Σ
k
\Sigma_k
Σk求梯度,
∂
L
E
∂
Σ
k
=
∑
n
=
1
N
[
∂
∑
k
=
1
K
γ
(
z
n
k
)
ln
p
(
x
n
∣
μ
k
,
Σ
k
)
∂
Σ
k
]
=
∑
n
=
1
N
[
γ
(
z
n
k
)
∂
ln
p
(
x
n
∣
μ
k
,
Σ
k
)
∂
Σ
k
]
=
∑
n
=
1
N
[
γ
(
z
n
k
)
p
(
x
n
∣
μ
k
,
Σ
k
)
∂
p
(
x
n
∣
μ
k
,
Σ
k
)
∂
Σ
k
]
\begin{aligned} \frac{\partial L_{E}}{\partial \Sigma_{\mathrm{k}}} &=\sum_{n=1}^{N}\left[\frac{\partial \sum_{k=1}^{K} \gamma\left(z_{n k}\right) \ln p\left(x_{n} | \mu_{k}, \Sigma_{k}\right)}{\partial \Sigma_{k}}\right] \\ &=\sum_{n=1}^{N}\left[\gamma\left(z_{n k}\right) \frac{\partial \ln p\left(x_{n} | \mu_{k}, \Sigma_{k}\right)}{\partial \Sigma_{k}}\right] \\ &=\sum_{n=1}^{N}\left[\frac{\gamma\left(z_{n k}\right)}{p\left(x_{n} | \mu_{k}, \Sigma_{k}\right)} \frac{\partial p\left(x_{n} | \mu_{k}, \Sigma_{k}\right)}{\partial \Sigma_{k}}\right] \end{aligned}
∂Σk∂LE=n=1∑N[∂Σk∂∑k=1Kγ(znk)lnp(xn∣μk,Σk)]=n=1∑N[γ(znk)∂Σk∂lnp(xn∣μk,Σk)]=n=1∑N[p(xn∣μk,Σk)γ(znk)∂Σk∂p(xn∣μk,Σk)]
求
l
n
p
(
X
∣
π
,
μ
,
Σ
)
lnp(X|\pi,\mu,\Sigma)
lnp(X∣π,μ,Σ)的极值点时,有
∂
L
∂
Σ
k
=
∑
n
=
1
N
[
π
k
∑
j
=
1
K
π
j
p
(
x
n
∣
μ
j
,
Σ
j
)
∂
p
(
x
n
∣
μ
k
,
Σ
k
)
∂
Σ
k
]
\frac{\partial L}{\partial \Sigma_{\mathrm{k}}}=\sum_{n=1}^{N}\left[\frac{\pi_{k}}{\sum_{j=1}^{K} \pi_{j} p\left(x_{n} | \mu_{j}, \Sigma_{j}\right)} \frac{\partial p\left(x_{n} | \mu_{k}, \Sigma_{k}\right)}{\partial \Sigma_{k}}\right]
∂Σk∂L=n=1∑N[∑j=1Kπjp(xn∣μj,Σj)πk∂Σk∂p(xn∣μk,Σk)]
结合响应度
γ
(
z
n
k
)
\gamma(z_{nk})
γ(znk)的公式,可发现两式是一致的,这意味着两种方法计算出的
Σ
k
\Sigma_k
Σk 也相同。
再来看
π
k
\pi_k
πk,
∂
L
E
∂
π
k
=
∑
n
=
1
N
[
∂
∑
k
=
1
K
γ
(
z
n
k
)
ln
π
k
∂
π
k
]
+
λ
=
∑
n
=
1
N
[
γ
(
z
n
k
)
∂
ln
π
k
∂
π
k
]
+
λ
=
∑
n
=
1
N
[
γ
(
z
n
k
)
1
π
k
]
+
λ
=
1
π
k
∑
n
=
1
N
γ
(
z
n
k
)
+
λ
=
1
π
k
N
k
+
λ
\begin{aligned} \frac{\partial L_{E}}{\partial \pi_{k}} &=\sum_{n=1}^{N}\left[\frac{\partial \sum_{k=1}^{K} \gamma\left(z_{n k}\right) \ln \pi_{k}}{\partial \pi_{k}}\right]+\lambda \\ &=\sum_{n=1}^{N}\left[\gamma\left(z_{n k}\right) \frac{\partial \ln \pi_{k}}{\partial \pi_{k}}\right]+\lambda \\ &=\sum_{n=1}^{N}\left[\gamma\left(z_{n k}\right) \frac{1}{\pi_{k}}\right]+\lambda=\frac{1}{\pi_{k}} \sum_{n=1}^{N} \gamma\left(z_{n k}\right)+\lambda \\ &=\frac{1}{\pi_{k}} N_{k}+\lambda \end{aligned}
∂πk∂LE=n=1∑N[∂πk∂∑k=1Kγ(znk)lnπk]+λ=n=1∑N[γ(znk)∂πk∂lnπk]+λ=n=1∑N[γ(znk)πk1]+λ=πk1n=1∑Nγ(znk)+λ=πk1Nk+λ
可以发现
π
k
\pi_k
πk也和
l
n
p
(
X
∣
π
,
μ
,
Σ
)
lnp(X|\pi,\mu,\Sigma)
lnp(X∣π,μ,Σ)极值点的参数一致。
综上,通过极大化 E Z ln p ( X , Z ∣ π , μ , Σ ) E_{Z} \ln p(X, Z | \pi, \mu, \Sigma) EZlnp(X,Z∣π,μ,Σ)得到的模型参数,和 l n p ( X ∣ π , μ , Σ ) lnp(X|\pi,\mu,\Sigma) lnp(X∣π,μ,Σ)的极大似然估计结果是一致的。在EM算法的每一次迭代中,我们都在对 E Z ln p ( X , Z ∣ π , μ , Σ ) E_{Z} \ln p(X, Z | \pi, \mu, \Sigma) EZlnp(X,Z∣π,μ,Σ)求极大值。