一、极大似然估计
考虑一个高斯分布
p
(
x
∣
θ
)
p(\boldsymbol{x}|\theta)
p(x∣θ), 其中
θ
=
(
μ
,
Σ
)
\theta=(\mu,\Sigma)
θ=(μ,Σ). 样本集
X
=
{
x
1
,
.
.
.
,
x
N
}
X=\{x_1,...,x_N\}
X={x1,...,xN}中每个样本都是独立的从该高斯分布中抽取得到的,满足独立同分布假设。
因此,取到这个样本集的概率为:
p
(
X
∣
θ
)
=
∏
i
=
1
N
p
(
x
i
∣
θ
)
\begin{aligned} p(X\mid{\theta}) &= \prod_{i=1}^Np(x_i\mid\theta) \end{aligned}
p(X∣θ)=i=1∏Np(xi∣θ)我们要估计模型参数向量
θ
\theta
θ的值,由于现在样本集
X
X
X已知,所以可以把
p
(
X
∣
θ
)
p(X\mid{\theta})
p(X∣θ)看作是参数向量
θ
\theta
θ的函数, 称之为样本集
X
X
X下的似然函数
L
(
θ
∣
X
)
L(\theta\mid{X})
L(θ∣X)。
L
(
θ
∣
X
)
=
∏
i
=
1
N
p
(
x
i
∣
θ
)
\begin{aligned} L(\theta\mid{X}) &= \prod_{i=1}^Np(x_i\mid\theta) \end{aligned}
L(θ∣X)=i=1∏Np(xi∣θ)极大似然估计要做的就是最大化该似然函数,找到能够使得
p
(
X
∣
θ
)
p(X\mid{\theta})
p(X∣θ)最大的参数向量
θ
\theta
θ,换句话说,就是找到最符合当前观测样本集
X
X
X的模型的参数向量
θ
\theta
θ。
为方便运算,通常我们使用似然函数的对数形式(可以将乘积运算转化为求和形式),称之为“对数似然函数”。由于底数大于1的对数函数总是单调递增的,所以使对数似然函数达到最大值的参数向量也能使似然函数本身达到最大值。故,对数似然函数为:
L
(
θ
∣
X
)
=
log
(
∏
i
=
1
N
p
(
x
i
∣
θ
)
)
=
∑
i
=
1
N
log
p
(
x
i
∣
θ
)
\begin{aligned} L(\theta\mid{X}) &= \log\Big(\prod_{i=1}^Np(x_i\mid\theta)\Big) \\ &= \sum_{i=1}^N\log{p(x_i\mid\theta)} \end{aligned}
L(θ∣X)=log(i=1∏Np(xi∣θ))=i=1∑Nlogp(xi∣θ)参数的估计值
θ
^
=
arg
max
θ
L
(
θ
∣
X
)
\hat{\theta}=\arg\underset{\theta}{\max}L(\theta\mid{X})
θ^=argθmaxL(θ∣X)
二、混合高斯模型及其求解困境
混合高斯模型是指由
k
k
k个高斯模型叠加形成的一种概率分布模型,形式化表示为:
p
(
x
∣
Θ
)
=
∑
l
=
1
k
α
l
p
l
(
x
∣
θ
l
)
p(\mathbf{x}\mid{\Theta}) = \sum_{l=1}^{k}\alpha_lp_l(\mathbf{x}\mid{\theta_l})
p(x∣Θ)=l=1∑kαlpl(x∣θl)
其中,参数
Θ
=
(
α
1
,
.
.
.
,
α
k
,
θ
1
,
.
.
.
,
θ
k
)
\Theta=(\alpha_1,...,\alpha_k,\theta_1,...,\theta_k)
Θ=(α1,...,αk,θ1,...,θk),并且有
Σ
l
=
1
k
α
l
=
1
\Sigma_{l=1}^{k}\alpha_l=1
Σl=1kαl=1,
Σ
l
=
1
k
α
l
=
1
\Sigma_{l=1}^{k}\alpha_l=1
Σl=1kαl=1代表单个高斯分布在混合高斯模型中的权重。现在我们假设观测样本集
X
=
(
x
1
,
.
.
.
x
N
)
X=(x_1,...x_N)
X=(x1,...xN)来自于该混合高斯模型,满足独立同分布假设。为了估计出该混合高斯模型的参数
Θ
\Theta
Θ,我们写出这n个数据的对数似然函数:
L
(
Θ
∣
X
)
=
log
(
p
(
X
∣
Θ
)
)
=
log
(
∏
i
=
1
N
p
(
x
i
∣
Θ
)
)
=
∑
i
=
1
N
log
(
p
(
x
i
∣
Θ
)
)
=
∑
i
=
1
N
log
(
∑
l
=
1
k
α
l
p
l
(
x
i
∣
θ
l
)
)
\begin{aligned} L(\Theta\mid{X}) &= \log\Big(p(X\mid{\Theta})\Big) \\ &= \log\Big(\prod_{i=1}^{N}p(x_i\mid{\Theta})\Big) \\ &= \sum_{i=1}^{N}\log\Big(p(x_i\mid{\Theta})\Big) \\ &= \sum_{i=1}^{N}\log\Big(\sum_{l=1}^{k}\alpha_lp_l(x_i\mid{\theta_l})\Big) \end{aligned}
L(Θ∣X)=log(p(X∣Θ))=log(i=1∏Np(xi∣Θ))=i=1∑Nlog(p(xi∣Θ))=i=1∑Nlog(l=1∑kαlpl(xi∣θl))我们的目标就是要通过最大化该似然函数从而估计出混合高斯模型的参数
Θ
\Theta
Θ
观察该式,由于对数函数中还包含求和式,因此如果仿照极大似然估计单个高斯分布参数的方法来求解这个问题,是无法得到解析解的。所以我们要寻求更好的方式来解决这个问题。
三、EM算法
基本过程
考虑一个样本集
X
X
X是从某种分布(参数为
Θ
\Theta
Θ) 中观测得到的,我们称之为不完全数据(incomplete data)。我们引入一个无法直接观测得到的随机变量集合
Z
Z
Z,叫作隐变量。
X
X
X和
Z
Z
Z连在一起称作完全数据。我们可以得到它们的联合概率分布为:
p
(
X
,
Z
∣
Θ
)
=
p
(
Z
∣
X
,
Θ
)
p
(
X
∣
Θ
)
p(X,Z\mid{\Theta})=p(Z\mid{X,\Theta})p(X\mid{\Theta})
p(X,Z∣Θ)=p(Z∣X,Θ)p(X∣Θ)我们定义一个新的似然函数叫作完全数据似然:
L
(
Θ
∣
X
,
Z
)
=
p
(
X
,
Z
∣
Θ
)
L(\Theta\mid{X,Z})=p(X,Z\mid{\Theta})
L(Θ∣X,Z)=p(X,Z∣Θ)我们可以通过极大化这样一个似然函数来估计参数
Θ
\Theta
Θ。然而
Z
Z
Z是隐变量,其分布未知上式无法直接求解,我们通过计算完全数据的对数似然函数关于
Z
Z
Z的期望,来最大化已观测数据的边际似然。我们定义这样一个期望值为
Q
Q
Q函数:
Q
(
Θ
,
Θ
g
)
=
E
Z
[
log
p
(
X
,
Z
∣
Θ
)
∣
X
,
Θ
g
]
=
∑
Z
log
p
(
X
,
Z
∣
Θ
)
p
(
Z
∣
X
,
Θ
g
)
\begin{aligned} Q(\Theta,\Theta^{g}) &= \mathbb{E}_Z\Big[\log{p(X,Z\mid\Theta)}\mid{X,\Theta^g}\Big]\\ &= \sum_Z\log{p(X,Z\mid{\Theta})p(Z\mid{X,\Theta^g})} \end{aligned}
Q(Θ,Θg)=EZ[logp(X,Z∣Θ)∣X,Θg]=Z∑logp(X,Z∣Θ)p(Z∣X,Θg) 其中,
Θ
g
\Theta^g
Θg表示当前的参数估计值,
X
X
X是观测数据,都作为常量。
Θ
\Theta
Θ是我们要极大化的参数,
Z
Z
Z是来自于分布
p
(
Z
∣
X
,
Θ
g
)
p(Z\mid{X,\Theta^g})
p(Z∣X,Θg)的随机变量。
p
(
Z
∣
X
,
Θ
g
)
p(Z\mid{X,\Theta^g})
p(Z∣X,Θg)是未观测变量的边缘分布,并且依赖于观测数据
X
X
X和当前模型参数
Θ
g
\Theta^g
Θg.
EM算法中的E-setp就是指上面计算期望值的过程。
M-step则是极大化这样一个期望,从而得到能够最大化期望的参数
Θ
\Theta
Θ
Θ
g
+
1
=
arg
max
θ
Q
(
Θ
,
Θ
g
)
=
arg
max
θ
∑
Z
log
p
(
X
,
Z
∣
Θ
)
p
(
Z
∣
X
,
Θ
g
)
\begin{aligned} \Theta^{g+1} &= \arg\underset{\theta}{\max}Q(\Theta,\Theta^{g}) \\ &= \arg\underset{\theta}{\max}\sum_Z\log{p(X,Z\mid{\Theta})p(Z\mid{X,\Theta^g})} \end{aligned}
Θg+1=argθmaxQ(Θ,Θg)=argθmaxZ∑logp(X,Z∣Θ)p(Z∣X,Θg)重复执行E-step和M-step直至收敛。
收敛性证明(略)
四、EM算法应用于高斯混合模型
回到高斯混合模型的参数估计问题,我们将样本集
X
X
X看作不完全数据,考虑一个无法观测到的随机变量
Z
=
{
z
i
}
i
=
1
N
Z=\{z_i\}_{i=1}^{N}
Z={zi}i=1N,其中
z
i
∈
{
1
,
.
.
.
,
k
}
z_i\in\{1,...,k\}
zi∈{1,...,k}, 用来指示每一个数据点是由哪一个高斯分布产生的,(可以类比成一个类别标签,这个标签我们是无法直接观测得到的)。比如,
z
i
=
k
z_i=k
zi=k表示第
i
i
i个样本点是由混合高斯模型中的第
k
k
k个分量模型生成的。那么完全数据
(
X
,
Z
)
(X,Z)
(X,Z)的概率分布为:
p
(
X
,
Z
∣
Θ
)
=
∏
i
=
1
N
p
(
x
i
,
z
i
∣
Θ
)
=
∏
i
=
1
N
p
(
z
i
∣
Θ
)
p
(
x
i
∣
z
i
,
Θ
)
=
∏
i
=
1
N
α
z
i
p
z
i
(
x
i
∣
θ
z
i
)
\begin{aligned} p(X,Z\mid{\Theta}) &= \prod_{i=1}^{N}p(x_i,z_i\mid{\Theta}) \\ &= \prod_{i=1}^{N}p(z_i\mid{\Theta})p(x_i\mid{z_i,\Theta}) \\ &= \prod_{i=1}^{N}\alpha_{z_i}p_{z_i}(x_i\mid{\theta_{z_i}}) \end{aligned}
p(X,Z∣Θ)=i=1∏Np(xi,zi∣Θ)=i=1∏Np(zi∣Θ)p(xi∣zi,Θ)=i=1∏Nαzipzi(xi∣θzi)在混合高斯模型问题中,
p
(
z
i
∣
Θ
)
p(z_i\mid\Theta)
p(zi∣Θ)是指第
z
i
z_i
zi个模型的先验概率,也就是其权重
α
z
i
\alpha_{z_i}
αzi.给定样本点来源的高斯分布类别后,
p
(
x
i
∣
z
i
,
Θ
)
p(x_i\mid{z_i,\Theta})
p(xi∣zi,Θ)可以写成对应的高斯分布下的概率密度形式,即
p
z
i
(
x
i
∣
θ
z
i
)
p_{z_i}(x_i\mid{\theta_{z_i}})
pzi(xi∣θzi).
根据贝叶斯公式,又可以得到隐变量的条件概率分布:
p
(
Z
∣
X
,
Θ
)
=
∏
i
=
1
N
p
(
z
i
∣
x
i
,
Θ
)
=
∏
i
=
1
N
p
(
x
i
,
z
i
,
Θ
)
p
(
x
i
,
Θ
)
=
∏
i
=
1
N
α
z
i
p
z
i
(
x
i
∣
θ
z
i
)
∑
z
i
=
1
k
α
z
i
p
z
i
(
x
i
∣
θ
z
i
)
\begin{aligned} p(Z\mid{X,\Theta}) &= \prod_{i=1}^Np(z_i\mid{x_i,\Theta}) \\ &= \prod_{i=1}^N\frac{p(x_i,z_i,\Theta)}{p(x_i,\Theta)} \\ &= \prod_{i=1}^N\frac{\alpha_{z_i}p_{z_i}(x_i\mid{\theta_{z_i}})}{\sum_{z_i=1}^k\alpha_{z_i}p_{z_i}(x_i\mid{\theta_{z_i}})} \end{aligned}
p(Z∣X,Θ)=i=1∏Np(zi∣xi,Θ)=i=1∏Np(xi,Θ)p(xi,zi,Θ)=i=1∏N∑zi=1kαzipzi(xi∣θzi)αzipzi(xi∣θzi)因此,我们可以写出相应的
Q
Q
Q函数:
Q
(
Θ
,
Θ
g
)
=
∑
Z
log
p
(
X
,
Z
∣
Θ
)
p
(
Z
∣
X
,
Θ
g
)
=
∑
Z
log
∏
i
=
1
N
α
z
i
p
z
i
(
x
i
∣
θ
z
i
)
∏
i
=
1
N
p
(
z
i
∣
x
i
,
Θ
g
)
=
∑
Z
∑
i
=
1
N
log
(
α
z
i
p
z
i
(
x
i
∣
θ
z
i
)
)
∏
i
=
1
N
p
(
z
i
∣
x
i
,
Θ
g
)
=
∑
z
1
=
1
k
∑
z
2
=
1
k
.
.
.
∑
z
N
=
1
k
∑
i
=
1
N
log
(
α
z
i
p
z
i
(
x
i
∣
θ
z
i
)
)
∏
i
=
1
N
p
(
z
i
∣
x
i
,
Θ
g
)
=
∑
z
1
=
1
k
∑
z
2
=
1
k
.
.
.
∑
z
N
=
1
k
log
(
α
z
1
p
z
1
(
x
1
∣
θ
z
1
)
)
∏
i
=
1
N
p
(
z
i
∣
x
i
,
Θ
g
)
+
∑
z
1
=
1
k
∑
z
2
=
1
k
.
.
.
∑
z
N
=
1
k
∑
i
=
2
N
log
(
α
z
i
p
z
i
(
x
i
∣
θ
z
i
)
)
∏
i
=
1
N
p
(
z
i
∣
x
i
,
Θ
g
)
=
A
+
B
\begin{aligned} Q(\Theta,\Theta^{g}) &= \sum_Z\log{p(X,Z\mid{\Theta})p(Z\mid{X,\Theta^g})} \\ &= \sum_Z\log{\prod_{i=1}^{N}\alpha_{z_i}p_{z_i}(x_i\mid{\theta_{z_i}})}\prod_{i=1}^Np(z_i\mid{x_i,\Theta^g}) \\ &= \sum_Z\sum_{i=1}^{N}\log\Big(\alpha_{z_i}p_{z_i}(x_i\mid{\theta_{z_i}})\Big)\prod_{i=1}^Np(z_i\mid{x_i,\Theta^g}) \\ &= \sum_{z_1=1}^k\sum_{z_2=1}^k...\sum_{z_N=1}^k\sum_{i=1}^{N}\log\Big(\alpha_{z_i}p_{z_i}(x_i\mid{\theta_{z_i}})\Big)\prod_{i=1}^Np(z_i\mid{x_i,\Theta^g}) \\ &= \sum_{z_1=1}^k\sum_{z_2=1}^k...\sum_{z_N=1}^k\log\Big(\alpha_{z_1}p_{z_1}(x_1\mid{\theta_{z_1}})\Big)\prod_{i=1}^Np(z_i\mid{x_i,\Theta^g}) \\ &\quad+ \sum_{z_1=1}^k\sum_{z_2=1}^k...\sum_{z_N=1}^k\sum_{i=2}^{N}\log\Big(\alpha_{z_i}p_{z_i}(x_i\mid{\theta_{z_i}})\Big)\prod_{i=1}^Np(z_i\mid{x_i,\Theta^g}) \\ &= \mathcal{A} + \mathcal{B} \end{aligned}
Q(Θ,Θg)=Z∑logp(X,Z∣Θ)p(Z∣X,Θg)=Z∑logi=1∏Nαzipzi(xi∣θzi)i=1∏Np(zi∣xi,Θg)=Z∑i=1∑Nlog(αzipzi(xi∣θzi))i=1∏Np(zi∣xi,Θg)=z1=1∑kz2=1∑k...zN=1∑ki=1∑Nlog(αzipzi(xi∣θzi))i=1∏Np(zi∣xi,Θg)=z1=1∑kz2=1∑k...zN=1∑klog(αz1pz1(x1∣θz1))i=1∏Np(zi∣xi,Θg)+z1=1∑kz2=1∑k...zN=1∑ki=2∑Nlog(αzipzi(xi∣θzi))i=1∏Np(zi∣xi,Θg)=A+B其中,
A
=
∑
z
1
=
1
k
∑
z
2
=
1
k
.
.
.
∑
z
N
=
1
k
log
(
α
z
1
p
z
1
(
x
1
∣
θ
z
1
)
)
∏
i
=
1
N
p
(
z
i
∣
x
i
,
Θ
g
)
=
∑
z
1
=
1
k
log
(
α
z
1
p
z
1
(
x
1
∣
θ
z
1
)
)
p
(
z
1
∣
x
1
,
Θ
g
)
∑
z
2
=
1
k
.
.
.
∑
z
N
=
1
k
∏
i
=
2
N
p
(
z
i
∣
x
i
,
Θ
g
)
⏟
r
e
s
u
l
t
=
1
=
∑
z
1
=
1
k
log
(
α
z
1
p
z
1
(
x
1
∣
θ
z
1
)
)
p
(
z
1
∣
x
1
,
Θ
g
)
\begin{aligned} \mathcal{A} &= \sum_{z_1=1}^k\sum_{z_2=1}^k...\sum_{z_N=1}^k\log\Big(\alpha_{z_1}p_{z_1}(x_1\mid{\theta_{z_1}})\Big)\prod_{i=1}^Np(z_i\mid{x_i,\Theta^g}) \\ &= \sum_{z_1=1}^k\log\Big(\alpha_{z_1}p_{z_1}(x_1\mid{\theta_{z_1}})\Big)p(z_1\mid{x_1,\Theta^g})\underbrace{\sum_{z_2=1}^k...\sum_{z_N=1}^k\prod_{i=2}^Np(z_i\mid{x_i,\Theta^g})}_{result=1} \\ &= \sum_{z_1=1}^k\log\Big(\alpha_{z_1}p_{z_1}(x_1\mid{\theta_{z_1}})\Big)p(z_1\mid{x_1,\Theta^g}) \\ \end{aligned}
A=z1=1∑kz2=1∑k...zN=1∑klog(αz1pz1(x1∣θz1))i=1∏Np(zi∣xi,Θg)=z1=1∑klog(αz1pz1(x1∣θz1))p(z1∣x1,Θg)result=1
z2=1∑k...zN=1∑ki=2∏Np(zi∣xi,Θg)=z1=1∑klog(αz1pz1(x1∣θz1))p(z1∣x1,Θg)
B
\mathcal{B}
B 式可以按照同样的操作技巧进行分解,故不赘述。
并且,我们用
l
l
l代替
z
i
z_i
zi来简化我们的表达式。因此,
Q
Q
Q函数可以化简为:
Q
(
Θ
,
Θ
g
)
=
∑
i
=
1
N
∑
z
i
=
1
k
log
(
α
z
i
p
z
i
(
x
i
∣
θ
z
i
)
)
p
(
z
i
∣
x
i
,
Θ
g
)
=
∑
i
=
1
N
∑
l
=
1
k
log
(
α
l
p
l
(
x
i
∣
θ
l
)
)
p
(
l
∣
x
i
,
Θ
g
)
=
∑
l
=
1
k
∑
i
=
1
N
log
(
α
l
p
l
(
x
i
∣
θ
l
)
)
p
(
l
∣
x
i
,
Θ
g
)
=
∑
l
=
1
k
∑
i
=
1
N
log
(
α
l
)
p
(
l
∣
x
i
,
Θ
g
)
+
∑
l
=
1
k
∑
i
=
1
N
log
(
p
l
(
x
i
∣
θ
l
)
)
p
(
l
∣
x
i
,
Θ
g
)
\begin{aligned} Q(\Theta,\Theta^{g}) &= \sum_{i=1}^N\sum_{z_i=1}^k\log\Big(\alpha_{z_i}p_{z_i}(x_i\mid{\theta_{z_i}})\Big)p(z_i\mid{x_i,\Theta^g}) \\ &= \sum_{i=1}^N\sum_{l=1}^k\log\Big(\alpha_{l}p_{l}(x_i\mid{\theta_{l}})\Big)p(l\mid{x_i,\Theta^g}) \\ &= \sum_{l=1}^k\sum_{i=1}^N\log\Big(\alpha_{l}p_{l}(x_i\mid{\theta_{l}})\Big)p(l\mid{x_i,\Theta^g}) \\ &= \sum_{l=1}^k\sum_{i=1}^N\log(\alpha_l)p(l\mid{x_i,\Theta^g})+\sum_{l=1}^k\sum_{i=1}^N\log\Big(p_l(x_i\mid{\theta_{l}})\Big)p(l\mid{x_i,\Theta^g}) \\ \end{aligned}
Q(Θ,Θg)=i=1∑Nzi=1∑klog(αzipzi(xi∣θzi))p(zi∣xi,Θg)=i=1∑Nl=1∑klog(αlpl(xi∣θl))p(l∣xi,Θg)=l=1∑ki=1∑Nlog(αlpl(xi∣θl))p(l∣xi,Θg)=l=1∑ki=1∑Nlog(αl)p(l∣xi,Θg)+l=1∑ki=1∑Nlog(pl(xi∣θl))p(l∣xi,Θg)
这样,我们可以对包含参数
α
l
\alpha_l
αl和参数
θ
l
\theta_l
θl的项分别进行最大化从而得到各自的估计值。
1. 估计参数 θ l \theta_l θl
这个问题可以表示为下面的约束最优化问题:
max
α
l
∑
l
=
1
k
∑
i
=
1
N
log
(
α
l
)
p
(
l
∣
x
i
,
Θ
g
)
s
.
t
.
∑
l
=
1
k
α
l
=
1
\begin{aligned} \underset{\alpha_l}{\max}&\quad \sum_{l=1}^k\sum_{i=1}^N\log(\alpha_l)p(l\mid{x_i,\Theta^g}) \\ s.t.& \quad\sum_{l=1}^k{\alpha_l}=1 \end{aligned}
αlmaxs.t.l=1∑ki=1∑Nlog(αl)p(l∣xi,Θg)l=1∑kαl=1引入拉格朗日乘子
λ
\lambda
λ,构建拉格朗日函数:
L
(
α
1
,
.
.
.
,
α
2
,
λ
)
=
∑
l
=
1
k
∑
i
=
1
N
log
(
α
l
)
p
(
l
∣
x
i
,
Θ
g
)
−
λ
(
∑
l
=
1
k
α
l
−
1
)
=
∑
l
=
1
k
log
(
α
l
)
∑
i
=
1
N
p
(
l
∣
x
i
,
Θ
g
)
−
λ
(
∑
l
=
1
k
α
l
−
1
)
\begin{aligned} \mathcal{L}(\alpha_1,...,\alpha_2,\lambda) &= \sum_{l=1}^k\sum_{i=1}^N\log(\alpha_l)p(l\mid{x_i,\Theta^g})-\lambda\Big(\sum_{l=1}^k{\alpha_l}-1\Big) \\ &= \sum_{l=1}^k\log(\alpha_l)\sum_{i=1}^Np(l\mid{x_i,\Theta^g})-\lambda\Big(\sum_{l=1}^k{\alpha_l}-1\Big) \end{aligned}
L(α1,...,α2,λ)=l=1∑ki=1∑Nlog(αl)p(l∣xi,Θg)−λ(l=1∑kαl−1)=l=1∑klog(αl)i=1∑Np(l∣xi,Θg)−λ(l=1∑kαl−1)对参数
α
l
\alpha_l
αl求偏导并令其为
0
0
0:
∂
L
∂
α
l
=
1
α
l
∑
i
=
1
N
p
(
l
∣
x
i
,
Θ
g
)
−
λ
=
0
\frac{\partial\mathcal{L}}{\partial\alpha_l}=\frac{1}{\alpha_l}\sum_{i=1}^{N}p(l\mid{x_i,\Theta^g})-\lambda=0
∂αl∂L=αl1i=1∑Np(l∣xi,Θg)−λ=0得到:
α
l
=
1
λ
∑
i
=
1
N
p
(
l
∣
x
i
,
Θ
g
)
\alpha_l=\frac{1}{\lambda}\sum_{i=1}^{N}p(l\mid{x_i,\Theta^g})
αl=λ1i=1∑Np(l∣xi,Θg)代回约束条件,有:
1
−
1
λ
∑
l
=
1
k
∑
i
=
1
N
p
(
l
∣
x
i
,
Θ
g
)
=
0
1-\frac{1}{\lambda}\sum_{l=1}^{k}\sum_{i=1}^{N}p(l\mid{x_i,\Theta^g})=0
1−λ1l=1∑ki=1∑Np(l∣xi,Θg)=0在之前的推导中,我们得到过这样一个公式,即隐变量
z
z
z的条件分布:
p
(
z
i
∣
x
i
,
Θ
)
=
p
(
x
i
,
z
i
,
Θ
)
p
(
x
i
,
Θ
)
=
α
z
i
p
z
i
(
x
i
∣
θ
z
i
)
∑
z
i
=
1
k
α
z
i
p
z
i
(
x
i
∣
θ
z
i
)
\begin{aligned} p(z_i\mid{x_i,\Theta}) &= \frac{p(x_i,z_i,\Theta)}{p(x_i,\Theta)} \\ &= \frac{\alpha_{z_i}p_{z_i}(x_i\mid{\theta_{z_i}})}{\sum_{z_i=1}^k\alpha_{z_i}p_{z_i}(x_i\mid{\theta_{z_i}})} \end{aligned}
p(zi∣xi,Θ)=p(xi,Θ)p(xi,zi,Θ)=∑zi=1kαzipzi(xi∣θzi)αzipzi(xi∣θzi)同样用
l
l
l代替
z
i
z_i
zi来简化我们的表达式,得到:
p
(
l
∣
x
i
,
Θ
)
=
p
(
x
i
,
l
,
Θ
)
p
(
x
i
,
Θ
)
=
α
l
p
l
(
x
i
∣
θ
l
)
∑
l
=
1
k
α
l
p
l
(
x
i
∣
θ
l
)
\begin{aligned} p(l\mid{x_i,\Theta}) &= \frac{p(x_i,l,\Theta)}{p(x_i,\Theta)} \\ &= \frac{\alpha_{l}p_{l}(x_i\mid{\theta_{l}})}{\sum_{l=1}^k\alpha_{l}p_{l}(x_i\mid{\theta_{l}})} \end{aligned}
p(l∣xi,Θ)=p(xi,Θ)p(xi,l,Θ)=∑l=1kαlpl(xi∣θl)αlpl(xi∣θl)
故将其代回之前的式子,得到:
1
−
1
λ
∑
l
=
1
k
∑
i
=
1
N
α
l
p
l
(
x
i
∣
θ
l
)
∑
l
=
1
k
α
l
p
l
(
x
i
∣
θ
l
)
=
0
1
−
1
λ
∑
i
=
1
N
∑
l
=
1
k
α
l
p
l
(
x
i
∣
θ
l
)
∑
l
=
1
k
α
l
p
l
(
x
i
∣
θ
l
)
=
0
1
−
1
λ
∑
i
=
1
N
(
1
)
=
0
1
−
N
λ
=
0
λ
=
N
\begin{aligned} 1-\frac{1}{\lambda}\sum_{l=1}^{k}\sum_{i=1}^{N}\frac{\alpha_{l}p_{l}(x_i\mid{\theta_{l}})}{\sum_{l=1}^k\alpha_{l}p_{l}(x_i\mid{\theta_{l}})}&=0 \\ 1-\frac{1}{\lambda}\sum_{i=1}^{N}\sum_{l=1}^{k}\frac{\alpha_{l}p_{l}(x_i\mid{\theta_{l}})}{\sum_{l=1}^k\alpha_{l}p_{l}(x_i\mid{\theta_{l}})}&=0 \\ 1-\frac{1}{\lambda}\sum_{i=1}^{N}(1)&=0 \\ 1-\frac{N}{\lambda}&=0 \\ \lambda&=N \end{aligned}
1−λ1l=1∑ki=1∑N∑l=1kαlpl(xi∣θl)αlpl(xi∣θl)1−λ1i=1∑Nl=1∑k∑l=1kαlpl(xi∣θl)αlpl(xi∣θl)1−λ1i=1∑N(1)1−λNλ=0=0=0=0=N故,
α
l
=
1
N
∑
i
=
1
N
p
(
l
∣
x
i
,
Θ
g
)
\alpha_l=\frac{1}{N}\sum_{i=1}^{N}p(l\mid{x_i,\Theta^g})
αl=N1i=1∑Np(l∣xi,Θg)在估计剩下的参数之前,先补充一下一会要用到的线性代数知识。
【知识点~Matrix Algebra】
矩阵的迹等于矩阵的主对角线上元素之和,且有以下性质
t
r
(
A
+
B
)
=
t
r
(
A
)
+
t
r
(
B
)
tr(A+B)=tr(A)+tr(B)
tr(A+B)=tr(A)+tr(B)
t
r
(
A
B
)
=
t
r
(
B
A
)
tr(AB)=tr(BA)
tr(AB)=tr(BA)
∑
i
x
i
T
A
x
i
=
t
r
(
A
B
)
w
h
e
r
e
B
=
∑
i
x
i
x
i
T
\sum_ix_i^TAx_i=tr(AB) \quad where \quad B=\sum_ix_ix_i^T
i∑xiTAxi=tr(AB)whereB=i∑xixiT
a
i
,
j
a_{i,j}
ai,j表示矩阵
A
A
A的第
i
i
i行,
j
j
j列的元素,给出矩阵函数求导的一些公式:
∂
x
T
A
x
∂
x
=
(
A
+
A
T
)
x
\frac{\partial{x^TAx}}{\partial{x}}=(A+A^T)x
∂x∂xTAx=(A+AT)x
∂
log
∣
A
∣
∂
A
=
2
A
−
1
−
d
i
a
g
(
A
−
1
)
\frac{\partial\log|A|}{\partial{A}}=2A^{-1}-diag(A^{-1})
∂A∂log∣A∣=2A−1−diag(A−1)
∂
t
r
(
A
B
)
∂
A
=
B
+
B
T
−
d
i
a
g
(
B
)
\frac{\partial{tr(AB)}}{\partial{A}}=B+B^T-diag(B)
∂A∂tr(AB)=B+BT−diag(B)
对于
d
d
d维高斯分布来说,参数
θ
=
(
μ
,
Σ
)
\theta=(\mu,\Sigma)
θ=(μ,Σ),且:
p
l
(
x
∣
μ
l
,
Σ
l
)
=
1
(
2
π
)
d
/
2
∣
Σ
l
∣
1
/
2
exp
[
−
1
2
(
x
−
μ
l
)
T
Σ
l
−
1
(
x
−
μ
l
)
]
p_l(x\mid{\mu_l,\Sigma_l})=\frac{1}{(2\pi)^{d/2}\vert\Sigma_l\vert^{1/2}}\exp\big[-\frac{1}{2}(x-\mu_l)^T\Sigma_l^{-1}(x-\mu_l)\big]
pl(x∣μl,Σl)=(2π)d/2∣Σl∣1/21exp[−21(x−μl)TΣl−1(x−μl)]
2. 估计参数 μ l \mu_l μl
∑
l
=
1
k
∑
i
=
1
N
log
(
p
l
(
x
i
∣
μ
l
,
Σ
l
)
)
p
(
l
∣
x
i
,
Θ
g
)
=
∑
l
=
1
k
∑
i
=
1
N
(
log
(
2
π
−
d
/
2
)
+
log
∣
Σ
l
∣
−
1
/
2
−
1
2
(
x
i
−
μ
l
)
T
Σ
l
−
1
(
x
i
−
μ
l
)
)
p
(
l
∣
x
i
,
Θ
g
)
\begin{aligned} &\quad\sum_{l=1}^k\sum_{i=1}^N\log\Big(p_l(x_i\mid{\mu_l,\Sigma_l})\Big)p(l\mid{x_i,\Theta^g}) \\ &= \sum_{l=1}^k\sum_{i=1}^N\Big(\log(2\pi^{-d/2})+\log|\Sigma_l|^{-1/2}-\frac{1}{2}(x_i-\mu_l)^T\Sigma_l^{-1}(x_i-\mu_l)\Big)p(l\mid{x_i,\Theta^g}) \\ \end{aligned}
l=1∑ki=1∑Nlog(pl(xi∣μl,Σl))p(l∣xi,Θg)=l=1∑ki=1∑N(log(2π−d/2)+log∣Σl∣−1/2−21(xi−μl)TΣl−1(xi−μl))p(l∣xi,Θg)
忽略其中的常数项(因为常数项在求导之后为0),上式化简为:
∑
l
=
1
k
∑
i
=
1
N
(
−
1
2
log
∣
Σ
l
∣
−
1
2
(
x
i
−
μ
l
)
T
Σ
l
−
1
(
x
i
−
μ
l
)
)
p
(
l
∣
x
i
,
Θ
g
)
\sum_{l=1}^k\sum_{i=1}^N\Big(-\frac{1}{2}\log|\Sigma_l|-\frac{1}{2}(x_i-\mu_l)^T\Sigma_l^{-1}(x_i-\mu_l)\Big)p(l\mid{x_i,\Theta^g})
l=1∑ki=1∑N(−21log∣Σl∣−21(xi−μl)TΣl−1(xi−μl))p(l∣xi,Θg)关于
μ
l
\mu_l
μl求导,得到:
∑
l
=
1
k
∑
i
=
1
N
(
−
1
2
(
Σ
l
−
1
+
(
Σ
l
−
1
)
T
)
(
x
i
−
μ
l
)
(
−
1
)
)
p
(
l
∣
x
i
,
Θ
g
)
\sum_{l=1}^k\sum_{i=1}^N\Bigg(-\frac{1}{2}\Big(\Sigma_l^{-1}+(\Sigma_l^{-1})^T\Big)\Big(x_i-\mu_l\Big)\Big(-1\Big)\Bigg)p(l\mid{x_i,\Theta^g})
l=1∑ki=1∑N(−21(Σl−1+(Σl−1)T)(xi−μl)(−1))p(l∣xi,Θg)
因为协方差矩阵Sigma为对称阵,故
1
2
(
Σ
l
−
1
+
(
Σ
l
−
1
)
T
)
=
Σ
l
−
1
\frac{1}{2}\Big(\Sigma_l^{-1}+(\Sigma_l^{-1})^T\Big)=\Sigma_l^{-1}
21(Σl−1+(Σl−1)T)=Σl−1,因此,上式继续化简为:
∑
l
=
1
k
∑
i
=
1
N
Σ
l
−
1
(
x
i
−
μ
l
)
p
(
l
∣
x
i
,
Θ
g
)
\sum_{l=1}^k\sum_{i=1}^N\Sigma_l^{-1}(x_i-\mu_l)p(l\mid{x_i,\Theta^g})
l=1∑ki=1∑NΣl−1(xi−μl)p(l∣xi,Θg)令该式为0,得到:
∑
l
=
1
k
Σ
l
−
1
∑
i
=
1
N
μ
l
p
(
l
∣
x
i
,
Θ
g
)
=
∑
l
=
1
k
Σ
l
−
1
∑
i
=
1
N
x
i
p
(
l
∣
x
i
,
Θ
g
)
∑
i
=
1
N
μ
l
p
(
l
∣
x
i
,
Θ
g
)
=
∑
i
=
1
N
x
i
p
(
l
∣
x
i
,
Θ
g
)
μ
l
∑
i
=
1
N
p
(
l
∣
x
i
,
Θ
g
)
=
∑
i
=
1
N
x
i
p
(
l
∣
x
i
,
Θ
g
)
μ
l
=
∑
i
=
1
N
x
i
p
(
l
∣
x
i
,
Θ
g
)
∑
i
=
1
N
p
(
l
∣
x
i
,
Θ
g
)
\begin{aligned} \sum_{l=1}^k\Sigma_l^{-1}\sum_{i=1}^N{\mu_l}p(l\mid{x_i,\Theta^g})&=\sum_{l=1}^k\Sigma_l^{-1}\sum_{i=1}^N{x_i}p(l\mid{x_i,\Theta^g}) \\ \sum_{i=1}^N{\mu_l}p(l\mid{x_i,\Theta^g})&=\sum_{i=1}^N{x_i}p(l\mid{x_i,\Theta^g}) \\ \mu_l\sum_{i=1}^Np(l\mid{x_i,\Theta^g})&=\sum_{i=1}^N{x_i}p(l\mid{x_i,\Theta^g}) \\ \mu_l&=\frac{\sum_{i=1}^N{x_i}p(l\mid{x_i,\Theta^g})}{\sum_{i=1}^Np(l\mid{x_i,\Theta^g})} \\ \end{aligned}
l=1∑kΣl−1i=1∑Nμlp(l∣xi,Θg)i=1∑Nμlp(l∣xi,Θg)μli=1∑Np(l∣xi,Θg)μl=l=1∑kΣl−1i=1∑Nxip(l∣xi,Θg)=i=1∑Nxip(l∣xi,Θg)=i=1∑Nxip(l∣xi,Θg)=∑i=1Np(l∣xi,Θg)∑i=1Nxip(l∣xi,Θg)
- 估计参数
Σ
l
\Sigma_l
Σl
∑ l = 1 k ∑ i = 1 N log ( p l ( x i ∣ μ l , Σ l ) ) p ( l ∣ x i , Θ g ) = ∑ l = 1 k ∑ i = 1 N ( log ( 2 π − d / 2 ) + log ∣ Σ l ∣ − 1 / 2 − 1 2 ( x i − μ l ) T Σ l − 1 ( x i − μ l ) ) p ( l ∣ x i , Θ g ) \begin{aligned} &\quad\sum_{l=1}^k\sum_{i=1}^N\log\Big(p_l(x_i\mid{\mu_l,\Sigma_l})\Big)p(l\mid{x_i,\Theta^g}) \\ &= \sum_{l=1}^k\sum_{i=1}^N\Big(\log(2\pi^{-d/2})+\log|\Sigma_l|^{-1/2}-\frac{1}{2}(x_i-\mu_l)^T\Sigma_l^{-1}(x_i-\mu_l)\Big)p(l\mid{x_i,\Theta^g}) \\ \end{aligned} l=1∑ki=1∑Nlog(pl(xi∣μl,Σl))p(l∣xi,Θg)=l=1∑ki=1∑N(log(2π−d/2)+log∣Σl∣−1/2−21(xi−μl)TΣl−1(xi−μl))p(l∣xi,Θg)忽略其中的常数项(因为常数项在求导之后为0),上式化简为:
∑ l = 1 k ∑ i = 1 N ( − 1 2 log ∣ Σ l ∣ − 1 2 ( x i − μ l ) T Σ l − 1 ( x i − μ l ) ) p ( l ∣ x i , Θ g ) = ∑ l = 1 k ∑ i = 1 N ( 1 2 log ∣ Σ l − 1 ∣ p ( l ∣ x i , Θ g ) − 1 2 ( x i − μ l ) T Σ l − 1 ( x i − μ l ) p ( l ∣ x i , Θ g ) ) = ∑ l = 1 k ( 1 2 log ∣ Σ l − 1 ∣ ∑ i = 1 N p ( l ∣ x i , Θ g ) − 1 2 Σ l − 1 ∑ i = 1 N ( x i − μ l ) T ( I ) ( x i − μ l ) p ( l ∣ x i , Θ g ) ) = ∑ l = 1 k ( 1 2 log ∣ Σ l − 1 ∣ ∑ i = 1 N p ( l ∣ x i , Θ g ) − 1 2 t r ( Σ l − 1 ∑ i = 1 N ( x i − μ l ) ( x i − μ l ) T p ( l ∣ x i , Θ g ) ) ) \begin{aligned} &\quad\sum_{l=1}^k\sum_{i=1}^N\Big(-\frac{1}{2}\log|\Sigma_l|-\frac{1}{2}(x_i-\mu_l)^T\Sigma_l^{-1}(x_i-\mu_l)\Big)p(l\mid{x_i,\Theta^g}) \\ &= \sum_{l=1}^k\sum_{i=1}^N\Big(\frac{1}{2}\log|\Sigma_l^{-1}|p(l\mid{x_i,\Theta^g})-\frac{1}{2}(x_i-\mu_l)^T\Sigma_l^{-1}(x_i-\mu_l)p(l\mid{x_i,\Theta^g})\Big) \\ &= \sum_{l=1}^k\Big(\frac{1}{2}\log|\Sigma_l^{-1}|\sum_{i=1}^{N}p(l\mid{x_i,\Theta^g})-\frac{1}{2}\Sigma_l^{-1}\sum_{i=1}^{N}(x_i-\mu_l)^T(I)(x_i-\mu_l)p(l\mid{x_i,\Theta^g})\Big) \\ &= \sum_{l=1}^k\Bigg(\frac{1}{2}\log|\Sigma_l^{-1}|\sum_{i=1}^{N}p(l\mid{x_i,\Theta^g})-\frac{1}{2}tr\Big(\Sigma_l^{-1}\sum_{i=1}^{N}(x_i-\mu_l)(x_i-\mu_l)^Tp(l\mid{x_i,\Theta^g})\Big)\Bigg) \\ \end{aligned} l=1∑ki=1∑N(−21log∣Σl∣−21(xi−μl)TΣl−1(xi−μl))p(l∣xi,Θg)=l=1∑ki=1∑N(21log∣Σl−1∣p(l∣xi,Θg)−21(xi−μl)TΣl−1(xi−μl)p(l∣xi,Θg))=l=1∑k(21log∣Σl−1∣i=1∑Np(l∣xi,Θg)−21Σl−1i=1∑N(xi−μl)T(I)(xi−μl)p(l∣xi,Θg))=l=1∑k(21log∣Σl−1∣i=1∑Np(l∣xi,Θg)−21tr(Σl−1i=1∑N(xi−μl)(xi−μl)Tp(l∣xi,Θg)))考虑方程 S ( μ l , Σ l − 1 ) S(\mu_l,\Sigma_{l}^{-1}) S(μl,Σl−1)为上式中 ∑ l = 1 k \sum_{l=1}^k ∑l=1k内部的式子,即: S ( μ l , Σ l − 1 ) = 1 2 log ∣ Σ l − 1 ∣ ∑ i = 1 N p ( l ∣ x i , Θ g ) − 1 2 t r ( Σ l − 1 ∑ i = 1 N ( x i − μ l ) ( x i − μ l ) T p ( l ∣ x i , Θ g ) ⏟ M l ) S(\mu_l,\Sigma_{l}^{-1})= \frac{1}{2}\log|\Sigma_l^{-1}|\sum_{i=1}^{N}p(l\mid{x_i,\Theta^g})-\frac{1}{2}tr\Big(\Sigma_l^{-1}\underbrace{\sum_{i=1}^{N}(x_i-\mu_l)(x_i-\mu_l)^Tp(l\mid{x_i,\Theta^g})}_{M_l}\Big) S(μl,Σl−1)=21log∣Σl−1∣i=1∑Np(l∣xi,Θg)−21tr(Σl−1Ml i=1∑N(xi−μl)(xi−μl)Tp(l∣xi,Θg))
S S S关于 Σ l − 1 \Sigma_l^{-1} Σl−1求导数,得到: ∂ S ( μ l , Σ l − 1 ) ∂ Σ l − 1 = 1 2 ∑ i = 1 N p ( l ∣ x i , Θ g ) ( 2 Σ l − d i a g ( Σ l ) ) − 1 2 ∂ t r ( Σ l − 1 M l ) ∂ Σ l − 1 = 1 2 ∑ i = 1 N p ( l ∣ x i , Θ g ) ( 2 Σ l − d i a g ( Σ l ) ) − 1 2 ( 2 M l − d i a g ( M l ) ) \begin{aligned} \frac{\partial{S(\mu_l,\Sigma_l^{-1})}}{\partial{\Sigma_l^{-1}}} &= \frac{1}{2}\sum_{i=1}^{N}p(l\mid{x_i,\Theta^g})\Big(2\Sigma_l-diag(\Sigma_l)\Big)-\frac{1}{2}\frac{\partial{tr(\Sigma_l^{-1}M_l)}}{\partial\Sigma_l^{-1}} \\ &= \frac{1}{2}\sum_{i=1}^{N}p(l\mid{x_i,\Theta^g})\Big(2\Sigma_l-diag(\Sigma_l)\Big)-\frac{1}{2}\Big(2M_l-diag(M_l)\Big) \\ \end{aligned} ∂Σl−1∂S(μl,Σl−1)=21i=1∑Np(l∣xi,Θg)(2Σl−diag(Σl))−21∂Σl−1∂tr(Σl−1Ml)=21i=1∑Np(l∣xi,Θg)(2Σl−diag(Σl))−21(2Ml−diag(Ml))令该导数值为0,得到: ∑ i = 1 N p ( l ∣ x i , Θ g ) ( 2 Σ l − d i a g ( Σ l ) ) = ( 2 M l − d i a g ( M l ) ) 2 ( ∑ i = 1 N Σ l p ( l ∣ x i , Θ g ) − M l ⏟ A ) = d i a g ( ∑ i = 1 N Σ l p ( l ∣ x i , Θ g ) − M l ⏟ A ) 2 A = d i a g ( A ) \begin{aligned} \sum_{i=1}^{N}p(l\mid{x_i,\Theta^g})\Big(2\Sigma_l-diag(\Sigma_l)\Big) &= \Big(2M_l-diag(M_l)\Big) \\ 2\Big(\underbrace{\sum_{i=1}^{N}\Sigma_lp(l\mid{x_i,\Theta^g})-M_l}_{\mathcal{A}}\Big) &= diag\Big(\underbrace{\sum_{i=1}^{N}\Sigma_lp(l\mid{x_i,\Theta^g})-M_l}_{\mathcal{A}}\Big) \\ 2\mathcal{A} &= diag(\mathcal{A}) \end{aligned} i=1∑Np(l∣xi,Θg)(2Σl−diag(Σl))2(A i=1∑NΣlp(l∣xi,Θg)−Ml)2A=(2Ml−diag(Ml))=diag(A i=1∑NΣlp(l∣xi,Θg)−Ml)=diag(A)解得 A = 0 \mathcal{A}=0 A=0,即: ∑ i = 1 N Σ l p ( l ∣ x i , Θ g ) = M l ∑ i = 1 N Σ l p ( l ∣ x i , Θ g ) = ∑ i = 1 N ( x i − μ l ) ( x i − μ l ) T p ( l ∣ x i , Θ g ) Σ l ∑ i = 1 N p ( l ∣ x i , Θ g ) = ∑ i = 1 N ( x i − μ l ) ( x i − μ l ) T p ( l ∣ x i , Θ g ) Σ l = ∑ i = 1 N ( x i − μ l ) ( x i − μ l ) T p ( l ∣ x i , Θ g ) ∑ i = 1 N p ( l ∣ x i , Θ g ) \begin{aligned} \sum_{i=1}^{N}\Sigma_lp(l\mid{x_i,\Theta^g}) &= M_l \\ \sum_{i=1}^{N}\Sigma_lp(l\mid{x_i,\Theta^g}) &= \sum_{i=1}^{N}(x_i-\mu_l)(x_i-\mu_l)^Tp(l\mid{x_i,\Theta^g}) \\ \Sigma_l\sum_{i=1}^{N}p(l\mid{x_i,\Theta^g}) &= \sum_{i=1}^{N}(x_i-\mu_l)(x_i-\mu_l)^Tp(l\mid{x_i,\Theta^g}) \\ \Sigma_l &= \frac{\sum_{i=1}^{N}(x_i-\mu_l)(x_i-\mu_l)^Tp(l\mid{x_i,\Theta^g})}{\sum_{i=1}^{N}p(l\mid{x_i,\Theta^g})} \end{aligned} i=1∑NΣlp(l∣xi,Θg)i=1∑NΣlp(l∣xi,Θg)Σli=1∑Np(l∣xi,Θg)Σl=Ml=i=1∑N(xi−μl)(xi−μl)Tp(l∣xi,Θg)=i=1∑N(xi−μl)(xi−μl)Tp(l∣xi,Θg)=∑i=1Np(l∣xi,Θg)∑i=1N(xi−μl)(xi−μl)Tp(l∣xi,Θg)