前言:
高斯的基础操作在SLAM和Robot Learning中都有广泛的应用. 这些基础操作包括有: 线性变换,高维高斯分布的边缘分布与条件分布求解, MLE求解均值与方差, EM求解混合模型。 本篇博客参考于白板推导以及SLAM基础知识博客 , 我选择SLAM中的理解思路即 P ( a , b ) ∼ P ( b ) P ( a ∣ b ) P(a,b)\sim P(b) P(a|b) P(a,b)∼P(b)P(a∣b) 对条件分布进行推导, 这在PRML中也是采用类似的思路但是需要采用配方法。然后选择白板中引入舒尔补实现推导, 但这里并不会直接采用白板大神在视频里的构造,而是用了两次舒尔补实现推导。
1. MVN (Multivariate Normal Distribution) 多维高斯分布
1.1 线性变换:
对
x
∼
N
(
x
∣
μ
,
Σ
)
\boldsymbol{x}\sim \mathcal{N}(\boldsymbol{x}|\boldsymbol{\mu},\boldsymbol{\Sigma})
x∼N(x∣μ,Σ)作线性变换
y
∼
A
x
+
b
\boldsymbol{y}\sim \boldsymbol{Ax+b}
y∼Ax+b, 有
y
∼
N
(
A
μ
+
b
,
A
Σ
A
T
)
\boldsymbol{y} \sim \mathcal{N}(A\mu+b,A\Sigma A^T)
y∼N(Aμ+b,AΣAT)
证明:
E
[
y
]
=
E
[
A
x
+
b
]
=
A
E
[
x
]
+
b
=
A
μ
+
b
V
a
r
[
y
]
=
V
a
r
[
A
x
+
b
]
=
V
a
r
[
A
x
]
=
A
⋅
V
a
r
⋅
A
T
\begin{equation} \begin{split} \rm{E}[y]=\rm{E}[Ax+b]=AE[x]+b=A\mu+b\\ \rm{Var}[y]=\rm{Var}[Ax+b]=\rm{Var}[Ax]=A\cdot\rm{Var}\cdot A^T \end{split} \end{equation}
E[y]=E[Ax+b]=AE[x]+b=Aμ+bVar[y]=Var[Ax+b]=Var[Ax]=A⋅Var⋅AT
1.2 从线性变换到高维高斯分布 ( x a , x b ) ∼ N ( μ , Σ ) (x_a,x_b)\sim \mathcal{N}(\mathbf{\mu},\mathbf{\Sigma}) (xa,xb)∼N(μ,Σ), Σ = [ Σ a a Σ a b Σ b a Σ b b ] \mathbf{\Sigma}=\begin{bmatrix}\Sigma_{aa}&\Sigma_{ab}\\ \Sigma_{ba}& \Sigma_{bb}\end{bmatrix} Σ=[ΣaaΣbaΣabΣbb]的条件分布与边缘分布.
-
已知高维高斯分布求 p ( x a ) , p ( x b ) , p ( x a ∣ x b ) , p ( x b ∣ x a ) p(x_a),p(x_b),p(x_a|x_b),p(x_b|x_a) p(xa),p(xb),p(xa∣xb),p(xb∣xa)
-
p ( x a ) p(x_a) p(xa)与 p ( x b ) p(x_b) p(xb):
x a = [ I , 0 ] [ x a x b ] x b = [ 0 , I ] [ x a x b ] \begin{aligned} x_a &= [I,0]\begin{bmatrix}x_a\\x_b\end{bmatrix} \\ x_b &= [0,I]\begin{bmatrix}x_a \\ x_b \end{bmatrix} \end{aligned} xaxb=[I,0][xaxb]=[0,I][xaxb]
E [ x a ] = [ I , 0 ] [ μ a μ b ] = μ a V a r [ x a ] = [ I , 0 ] [ Σ a a Σ a b Σ b a Σ b b ] [ I 0 ] = Σ a a E [ x b ] = [ 0 , I ] [ μ a μ b ] = μ b V a r [ x b ] = [ 0 , I ] [ Σ a a Σ a b Σ b a Σ b b ] [ 0 I ] = Σ b b \begin{split} \rm{E}[x_a]=[I,0]\begin{bmatrix}\mu_a\\ \mu_b\end{bmatrix}=\mu_a\\ \rm{Var}[x_a]=[I,0]\begin{bmatrix}\Sigma_{aa} & \Sigma_{ab}\\ \Sigma_{ba} & \Sigma_{bb} \end{bmatrix} \begin{bmatrix}I \\ 0 \end{bmatrix} = \Sigma_{aa}\\ \rm{E}[x_b]=[0,I]\begin{bmatrix}\mu_a\\ \mu_b\end{bmatrix}=\mu_b\\ \rm{Var}[x_b]=[0,I]\begin{bmatrix}\Sigma_{aa} & \Sigma_{ab}\\ \Sigma_{ba} & \Sigma_{bb} \end{bmatrix} \begin{bmatrix}0 \\ I \end{bmatrix} = \Sigma_{bb} \end{split} E[xa]=[I,0][μaμb]=μaVar[xa]=[I,0][ΣaaΣbaΣabΣbb][I0]=ΣaaE[xb]=[0,I][μaμb]=μbVar[xb]=[0,I][ΣaaΣbaΣabΣbb][0I]=Σbb
x a ∼ N ( μ a , Σ a a ) x_a \sim \mathcal{N}(\mu_a,\Sigma_{aa}) xa∼N(μa,Σaa)以及 x b ∼ N ( μ b , Σ b b ) x_b\sim \mathcal{N}(\mu_b,\Sigma_{bb}) xb∼N(μb,Σbb) -
p ( x a ∣ x b ) , p ( x b ∣ x a ) p(x_a|x_b),p(x_b|x_a) p(xa∣xb),p(xb∣xa): 需要引入舒尔补, 舒尔补数学小抄
首先对于高斯分布的表达式,我们可以舍弃其分式前部,只看 e x p ( ⋅ ) \rm{exp(\cdot)} exp(⋅)部分.
p ( x a , x b ) ∝ e x p ( − 1 2 [ x a x b ] T [ Σ a a Σ a b Σ b a Σ b b ] − 1 [ x a x b ] ) p(x_a,x_b)\propto exp(-\frac{1}{2}\begin{bmatrix}x_a\\x_b\end{bmatrix}^T\begin{bmatrix}\Sigma_{aa}& \Sigma_{ab}\\ \Sigma_{ba} & \Sigma_{bb}\end{bmatrix}^{-1}\begin{bmatrix}x_a\\ x_b\end{bmatrix}) p(xa,xb)∝exp(−21[xaxb]T[ΣaaΣbaΣabΣbb]−1[xaxb])
式子中涉及对于协方差矩阵的求逆,舒尔补在这里派上用场.
下面会是关于舒尔补的一些基础等式
[ I 0 − C A − 1 I ] [ A B C D ] [ I − A − 1 B 0 I ] = [ A 0 0 D − C A − 1 B ] \begin{bmatrix}I & 0\\-CA^{-1}& I\end{bmatrix}\begin{bmatrix}A&B\\C &D\end{bmatrix}\begin{bmatrix}I & -A^{-1}B\\0& I\end{bmatrix}=\begin{bmatrix}A& 0 \\0& D-CA^{-1}B\end{bmatrix} [I−CA−10I][ACBD][I0−A−1BI]=[A00D−CA−1B]
[ I − B D − 1 0 I ] [ A B C D ] [ I 0 − D − 1 C I ] = [ A − B D − 1 C 0 0 D ] \begin{bmatrix}I & -BD^{-1}\\ 0 & I \end{bmatrix}\begin{bmatrix} A & B \\ C & D \end{bmatrix}\begin{bmatrix} I & 0\\ -D^{-1}C & I \end{bmatrix} = \begin{bmatrix} A-BD^{-1}C & 0\\ 0 & D\\ \end{bmatrix} [I0−BD−1I][ACBD][I−D−1C0I]=[A−BD−1C00D]
[ A B C D ] = [ I 0 C A − 1 I ] [ A 0 0 D − C A − 1 B ] [ I A − 1 B 0 I ] \begin{bmatrix}A&B\\C&D\end{bmatrix} = \begin{bmatrix}I & 0\\CA^{-1}&I\end{bmatrix}\begin{bmatrix}A& 0 \\0 & D-CA^{-1}B\end{bmatrix}\begin{bmatrix}I & A^{-1}B \\ 0 & I\end{bmatrix} [ACBD]=[ICA−10I][A00D−CA−1B][I0A−1BI]
[ A B C D ] − 1 = [ I − A − 1 B 0 I ] [ A − 1 0 0 ( D − C A − 1 B ) − 1 ] [ I 0 − C A − 1 I ] \begin{bmatrix}A & B \\C &D\end{bmatrix}^{-1}=\begin{bmatrix}I & -A^{-1}B\\0 &I\end{bmatrix}\begin{bmatrix}A^{-1}& 0\\0& (D-CA^{-1}B)^{-1}\end{bmatrix}\begin{bmatrix}I & 0 \\-CA^{-1} & I\end{bmatrix} [ACBD]−1=[I0−A−1BI][A−100(D−CA−1B)−1][I−CA−10I]
[ A B C D ] = [ I B D − 1 0 I ] [ A − B D − 1 C 0 0 D ] [ I 0 D − 1 C I ] \begin{bmatrix} A & B\\ C & D \end{bmatrix} = \begin{bmatrix} I & BD^{-1}\\ 0 & I \end{bmatrix}\begin{bmatrix} A-BD^{-1}C & 0 \\ 0 & D \end{bmatrix}\begin{bmatrix} I & 0\\ D^{-1}C & I \end{bmatrix} [ACBD]=[I0BD−1I][A−BD−1C00D][ID−1C0I]
[ A B C D ] − 1 = [ I 0 − D − 1 C I ] [ ( A − B D − 1 C ) − 1 0 0 D − 1 ] [ I − B D − 1 0 I ] \begin{bmatrix} A & B \\ C & D \end{bmatrix}^{-1}=\begin{bmatrix} I & 0\\ -D^{-1}C &I \end{bmatrix}\begin{bmatrix} (A-BD^{-1}C)^{-1} & 0\\ 0 & D^{-1} \end{bmatrix}\begin{bmatrix} I & -BD^{-1}\\ 0 & I \end{bmatrix} [ACBD]−1=[I−D−1C0I][(A−BD−1C)−100D−1][I0−BD−1I]考虑 p ( x a , x b ) p(x_a,x_b) p(xa,xb)指数幂部分:
− 1 2 [ x a x b ] [ Σ a a Σ a b Σ b a Σ b b ] − 1 [ x a x b ] = − 1 2 [ x a x b ] [ Λ a a Λ a b Λ b a Λ b b ] [ x a x b ] = − 1 2 ( [ x a x b ] [ I 0 Λ b a Λ a a − 1 I ] [ Λ a a 0 0 Λ b b − Λ b a Λ a a 1 Λ a b ] [ I Λ a a − 1 Λ a b 0 I ] [ x a x b ] ) = − 1 2 ( x a + x b Λ b a Λ a a − 1 ) Λ a a ( x a + Λ a a − 1 Λ a b x b ) − 1 2 x b ( Λ b b − Λ b a Λ a a − 1 Λ a b ) x b \begin{align} -\frac{1}{2}\begin{bmatrix}x_a& x_b\end{bmatrix}\begin{bmatrix} \Sigma_{aa} & \Sigma_{ab}\\ \Sigma_{ba} & \Sigma_{bb} \end{bmatrix}^{-1}\begin{bmatrix} x_a \\ x_b \end{bmatrix} &= -\frac{1}{2} \begin{bmatrix} x_a & x_b \end{bmatrix}\begin{bmatrix}\Lambda_{aa} & \Lambda_{ab}\\ \Lambda_{ba} & \Lambda_{bb} \end{bmatrix} \begin{bmatrix} x_a\\ x_b\end{bmatrix} \\ &= -\frac{1}{2}(\begin{bmatrix}x_a & x_b\end{bmatrix} \begin{bmatrix} I & 0 \\ \Lambda_{ba}\Lambda_{aa}^{-1} & I \end{bmatrix} \begin{bmatrix} \Lambda_{aa} & 0\\ 0 & \Lambda_{bb}-\Lambda_{ba}\Lambda_{aa}^{1}\Lambda_{ab} \end{bmatrix}\begin{bmatrix} I & \Lambda^{-1}_{aa}\Lambda_{ab}\\ 0 & I \end{bmatrix}\begin{bmatrix}x_a \\ x_b \end{bmatrix})\\ &= -\frac{1}{2} (x_a+x_b\Lambda_{ba}\Lambda^{-1}_{aa})\Lambda_{aa}(x_a+\Lambda^{-1}_{aa}\Lambda_{ab}x_b) -\frac{1}{2} x_b(\Lambda_{bb}-\Lambda_{ba}\Lambda^{-1}_{aa}\Lambda_{ab})x_b \end{align} −21[xaxb][ΣaaΣbaΣabΣbb]−1[xaxb]=−21[xaxb][ΛaaΛbaΛabΛbb][xaxb]=−21([xaxb][IΛbaΛaa−10I][Λaa00Λbb−ΛbaΛaa1Λab][I0Λaa−1ΛabI][xaxb])=−21(xa+xbΛbaΛaa−1)Λaa(xa+Λaa−1Λabxb)−21xb(Λbb−ΛbaΛaa−1Λab)xb
对于精度矩阵 Λ \Lambda Λ 可以借用与协方差矩阵之间的关系 Σ \Sigma Σ,结合舒尔补完成求解:
[ Λ a a Λ a b Λ b a Λ b b ] = [ Σ a a Σ a b Σ b a Σ b b ] − 1 = [ I 0 − D − 1 C I ] [ ( A − B D − 1 C ) − 1 0 0 D − 1 ] [ I − B D − 1 0 I ] = [ I 0 − Σ b b − 1 Σ b a I ] [ ( Σ a a − Σ a b Σ b b − 1 Σ b a ) − 1 0 0 Σ b b − 1 ] [ I − Σ b a Σ b b − 1 0 I ] = [ ( Σ a a − Σ a b Σ b b − 1 Σ b a ) − 1 − ( Σ a a − Σ a b Σ b b − 1 Σ b a ) − 1 Σ a b Σ b b − 1 − Σ b b − 1 Σ b a ( Σ a a − Σ a b Σ b b − 1 Σ b a ) − 1 Σ b b − 1 + Σ b b − 1 Σ b a ( Σ a a − Σ a b Σ b b − 1 Σ b a ) − 1 Σ b a Σ b b − 1 ] \begin{align} \begin{bmatrix} \Lambda_{aa} & \Lambda_{ab} \\ \Lambda_{ba} & \Lambda_{bb} \end{bmatrix} &=\begin{bmatrix} \Sigma_{aa} & \Sigma_{ab}\\ \Sigma_{ba} & \Sigma_{bb} \end{bmatrix}^{-1} \\ &= \begin{bmatrix} I & 0\\ -D^{-1}C & I \end{bmatrix} \begin{bmatrix} (A-BD^{-1}C)^{-1} & 0\\ 0 & D^{-1} \end{bmatrix} \begin{bmatrix} I & -BD^{-1}\\ 0 & I \end{bmatrix} \\ &= \begin{bmatrix} I & 0\\ -\Sigma^{-1}_{bb}\Sigma_{ba} &I \end{bmatrix} \begin{bmatrix} (\Sigma_{aa}-\Sigma_{ab}\Sigma^{-1}_{bb}\Sigma_{ba})^{-1} & 0\\ 0 & \Sigma^{-1}_{bb} \end{bmatrix}\begin{bmatrix} I & -\Sigma_{ba}\Sigma^{-1}_{bb}\\ 0 & I \end{bmatrix}\\ &=\begin{bmatrix} (\Sigma_{aa}-\Sigma_{ab}\Sigma^{-1}_{bb}\Sigma_{ba})^{-1} & -(\Sigma_{aa}-\Sigma_{ab}\Sigma^{-1}_{bb}\Sigma_{ba})^{-1}\Sigma_{ab}\Sigma^{-1}_{bb}\\ -\Sigma^{-1}_{bb}\Sigma_{ba}(\Sigma_{aa}-\Sigma_{ab}\Sigma^{-1}_{bb}\Sigma_{ba})^{-1} & \Sigma^{-1}_{bb}+\Sigma^{-1}_{bb}\Sigma_{ba}(\Sigma_{aa}-\Sigma_{ab}\Sigma^{-1}_{bb}\Sigma_{ba})^{-1}\Sigma_{ba}\Sigma^{-1}_{bb} \end{bmatrix} \end{align} [ΛaaΛbaΛabΛbb]=[ΣaaΣbaΣabΣbb]−1=[I−D−1C0I][(A−BD−1C)−100D−1][I0−BD−1I]=[I−Σbb−1Σba0I][(Σaa−ΣabΣbb−1Σba)−100Σbb−1][I0−ΣbaΣbb−1I]=[(Σaa−ΣabΣbb−1Σba)−1−Σbb−1Σba(Σaa−ΣabΣbb−1Σba)−1−(Σaa−ΣabΣbb−1Σba)−1ΣabΣbb−1Σbb−1+Σbb−1Σba(Σaa−ΣabΣbb−1Σba)−1ΣbaΣbb−1]对于(4)式的后一部分有:
Λ b b − Λ b a Λ a a − 1 Λ a b = Σ b b − 1 + Σ b b − 1 Σ b a ( Σ a a − Σ a b Σ b b − 1 Σ b a ) − 1 Σ b a Σ b b − 1 − Σ b b − 1 Σ b a ( Σ a a − Σ a b Σ b b − 1 Σ b a ) − 1 ( Σ a a − Σ a b Σ b b − 1 Σ b a ) ( Σ a a − Σ a b Σ b b − 1 Σ b a ) − 1 Σ b a Σ b b − 1 = Σ b b − 1 \begin{align} \Lambda_{bb}-\Lambda_{ba}\Lambda^{-1}_{aa}\Lambda_{ab} &= \Sigma^{-1}_{bb}+\Sigma^{-1}_{bb}\Sigma_{ba}(\Sigma_{aa}-\Sigma_{ab}\Sigma^{-1}_{bb}\Sigma_{ba})^{-1}\Sigma_{ba}\Sigma^{-1}_{bb} -\Sigma^{-1}_{bb}\Sigma_{ba}(\Sigma_{aa}\\ &-\Sigma_{ab}\Sigma^{-1}_{bb}\Sigma_{ba})^{-1}(\Sigma_{aa}-\Sigma_{ab}\Sigma^{-1}_{bb}\Sigma_{ba})(\Sigma_{aa}-\Sigma_{ab}\Sigma^{-1}_{bb}\Sigma_{ba})^{-1}\Sigma_{ba}\Sigma^{-1}_{bb}\\ &= \Sigma^{-1}_{bb} \end{align} Λbb−ΛbaΛaa−1Λab=Σbb−1+Σbb−1Σba(Σaa−ΣabΣbb−1Σba)−1ΣbaΣbb−1−Σbb−1Σba(Σaa−ΣabΣbb−1Σba)−1(Σaa−ΣabΣbb−1Σba)(Σaa−ΣabΣbb−1Σba)−1ΣbaΣbb−1=Σbb−1
对于前一部分,方差显然为 Λ a a \Lambda_{aa} Λaa, 而均值则: 将相应的部分代入:
x a + Λ a a − 1 Λ a b x b x_a + \Lambda^{-1}_{aa}\Lambda_{ab} x_b xa+Λaa−1Λabxb
中:最后还是得到
x a − Σ b a Σ b b − 1 x b x_a-\Sigma_{ba}\Sigma^{-1}_{bb}x_b xa−ΣbaΣbb−1xb注意这里对应的是标准正态分布 ( x − μ ) T Σ ( x − μ ) (x-\mu)^T\Sigma(x-\mu) (x−μ)TΣ(x−μ)中 ( x − μ ) (x-\mu) (x−μ)的部分. 将 x a x_a xa写作标准形式 ( x a − μ a ) − Σ b a Σ b b − 1 ( x b − μ b ) (x_a-\mu_a)-\Sigma_{ba}\Sigma^{-1}_{bb}(x_b-\mu_b) (xa−μa)−ΣbaΣbb−1(xb−μb). 这里注意,在白板中会把这个标准形式命名为 x a ⋅ b x_{a\cdot b} xa⋅b ,这真的让我很费解, 直到我悟到一个: p ( a ∣ b ) p(a|b) p(a∣b)是在b发生下关于a的概率分布. 因此其标准正态分布形式为: ( x a − μ a ) T Σ a ( x a − μ a ) (x_a-\mu_a)^T\Sigma_a(x_a-\mu_a) (xa−μa)TΣa(xa−μa). 对照之下可知: 在b发生下a的均值为: μ a + Σ b a Σ b b − 1 ( x b − μ b ) \mu_a+\Sigma_{ba}\Sigma^{-1}_{bb}(x_b-\mu_b) μa+ΣbaΣbb−1(xb−μb)
使用两次舒尔补是我自以为比较有意思的地方,但推导下来发现还是需要用定义去理解不能直接推导得出.看到这里的读者还是选择SLAM那篇博客吧
2. MLE 最大似然估计
2.1 MLE中用到的矩阵微分操作:
证明请见矩阵求导公式
(1)
∂
w
T
A
w
∂
w
=
2
A
w
\frac{\partial w^TAw}{\partial w}=2Aw
∂w∂wTAw=2Aw
(2)
t
r
[
A
B
C
]
=
t
r
[
C
A
B
]
=
t
r
[
B
C
A
]
tr[ABC]=tr[CAB]=tr[BCA]
tr[ABC]=tr[CAB]=tr[BCA]
(3)
x
T
A
x
=
t
r
[
x
T
A
x
]
=
t
r
[
x
x
T
A
]
x^TAx=tr[x^TAx]=tr[xx^TA]
xTAx=tr[xTAx]=tr[xxTA]
(4)
∂
∂
A
t
r
[
A
B
]
=
B
T
\frac{\partial}{\partial A}tr[AB]=B^T
∂A∂tr[AB]=BT
(5)
∂
∂
A
l
o
g
∣
A
∣
=
(
A
−
1
)
T
=
(
A
T
)
−
1
\frac{\partial}{\partial A}log|A|=(A^{-1})^T=(A^T)^{-1}
∂A∂log∣A∣=(A−1)T=(AT)−1
(6)
t
r
[
A
B
]
=
t
r
[
B
A
]
tr[AB]=tr[BA]
tr[AB]=tr[BA]
对于各次服从iid的取样而言,
X
(
i
)
∼
N
(
μ
,
Σ
)
X^{(i)}\sim \mathcal{N}(\mu,\Sigma)
X(i)∼N(μ,Σ), 要每次都取到采样的结果就是
∏
i
=
1
m
f
X
(
i
)
(
x
(
i
)
;
μ
,
Σ
)
\prod_{i=1}^m f_{\mathbf{X^{(i)}}}(\mathbf{x^{(i)} ; \mu , \Sigma })
∏i=1mfX(i)(x(i);μ,Σ)对其取log-likehood函数有:
l
(
μ
,
Σ
∣
x
(
i
)
)
=
log
∏
i
=
1
m
f
X
(
i
)
(
x
(
i
)
∣
μ
,
Σ
)
=
log
∏
i
=
1
m
1
(
2
π
)
p
/
2
∣
Σ
∣
1
/
2
exp
(
−
1
2
(
x
(
i
)
−
μ
)
T
Σ
−
1
(
x
(
i
)
−
μ
)
)
=
∑
i
=
1
m
(
−
p
2
log
(
2
π
)
−
1
2
log
∣
Σ
∣
−
1
2
(
x
(
i
)
−
μ
)
T
Σ
−
1
(
x
(
i
)
−
μ
)
)
\begin{aligned} l(\mathbf{ \mu, \Sigma | x^{(i)} }) & = \log \prod_{i=1}^m f_{\mathbf{X^{(i)}}}(\mathbf{x^{(i)} | \mu , \Sigma }) \\ & = \log \ \prod_{i=1}^m \frac{1}{(2 \pi)^{p/2} |\Sigma|^{1/2}} \exp \left( - \frac{1}{2} \mathbf{(x^{(i)} - \mu)^T \Sigma^{-1} (x^{(i)} - \mu) } \right) \\ & = \sum_{i=1}^m \left( - \frac{p}{2} \log (2 \pi) - \frac{1}{2} \log |\Sigma| - \frac{1}{2} \mathbf{(x^{(i)} - \mu)^T \Sigma^{-1} (x^{(i)} - \mu) } \right) \end{aligned}
l(μ,Σ∣x(i))=logi=1∏mfX(i)(x(i)∣μ,Σ)=log i=1∏m(2π)p/2∣Σ∣1/21exp(−21(x(i)−μ)TΣ−1(x(i)−μ))=i=1∑m(−2plog(2π)−21log∣Σ∣−21(x(i)−μ)TΣ−1(x(i)−μ))
l
(
μ
,
Σ
;
)
=
−
m
p
2
log
(
2
π
)
−
m
2
log
∣
Σ
∣
−
1
2
∑
i
=
1
m
(
x
(
i
)
−
μ
)
T
Σ
−
1
(
x
(
i
)
−
μ
)
\begin{aligned} l(\mu, \Sigma ; ) & = - \frac{mp}{2} \log (2 \pi) - \frac{m}{2} \log |\Sigma| - \frac{1}{2} \sum_{i=1}^m \mathbf{(x^{(i)} - \mu)^T \Sigma^{-1} (x^{(i)} - \mu) } \end{aligned}
l(μ,Σ;)=−2mplog(2π)−2mlog∣Σ∣−21i=1∑m(x(i)−μ)TΣ−1(x(i)−μ)
∂ ∂ μ l ( μ , Σ ∣ x ( i ) ) = ∑ i = 1 m Σ − 1 ( x ( i ) − μ ) = 0 Since Σ is positive definite 0 = m μ − ∑ i = 1 m x ( i ) μ ^ = 1 m ∑ i = 1 m x ( i ) = x ˉ \begin{aligned} \frac{\partial }{\partial \mu} l(\mathbf{ \mu, \Sigma | x^{(i)} }) & = \sum_{i=1}^m \mathbf{ \Sigma^{-1} ( x^{(i)} - \mu ) } = 0 \\ & \text{Since $\Sigma$ is positive definite} \\ 0 & = m \mu - \sum_{i=1}^m \mathbf{ x^{(i)} } \\ \hat \mu &= \frac{1}{m} \sum_{i=1}^m \mathbf{ x^{(i)} } = \mathbf{\bar{x}} \end{aligned} ∂μ∂l(μ,Σ∣x(i))0μ^=i=1∑mΣ−1(x(i)−μ)=0Since Σ is positive definite=mμ−i=1∑mx(i)=m1i=1∑mx(i)=xˉ
∂
∂
A
x
T
A
x
=
∂
∂
A
t
r
[
x
x
T
A
]
=
[
x
x
T
]
T
=
(
x
T
)
T
x
T
=
x
x
T
\frac{\partial}{\partial A} x^TAx =\frac{\partial}{\partial A} \mathrm{tr}\left[xx^TA\right] = [xx^T]^T = \left(x^{T}\right)^Tx^T = xx^T
∂A∂xTAx=∂A∂tr[xxTA]=[xxT]T=(xT)TxT=xxT
重写log-likehood函数有:
l
(
μ
,
Σ
∣
x
(
i
)
)
=
C
−
m
2
log
∣
Σ
∣
−
1
2
∑
i
=
1
m
(
x
(
i
)
−
μ
)
T
Σ
−
1
(
x
(
i
)
−
μ
)
=
C
+
m
2
log
∣
Σ
−
1
∣
−
1
2
∑
i
=
1
m
t
r
[
(
x
(
i
)
−
μ
)
(
x
(
i
)
−
μ
)
T
Σ
−
1
]
∂
∂
Σ
−
1
l
(
μ
,
Σ
∣
x
(
i
)
)
=
m
2
Σ
−
1
2
∑
i
=
1
m
(
x
(
i
)
−
μ
)
(
x
(
i
)
−
μ
)
T
Since
Σ
T
=
Σ
\begin{aligned} l(\mathbf{ \mu, \Sigma | x^{(i)} }) & = \text{C} - \frac{m}{2} \log |\Sigma| - \frac{1}{2} \sum_{i=1}^m \mathbf{(x^{(i)} - \mu)^T \Sigma^{-1} (x^{(i)} - \mu) } \\ & = \text{C} + \frac{m}{2} \log |\Sigma^{-1}| - \frac{1}{2} \sum_{i=1}^m \mathrm{tr}\left[ \mathbf{(x^{(i)} - \mu) (x^{(i)} - \mu)^T \Sigma^{-1} } \right] \\ \frac{\partial }{\partial \Sigma^{-1}} l(\mathbf{ \mu, \Sigma | x^{(i)} }) & = \frac{m}{2} \Sigma - \frac{1}{2} \sum_{i=1}^m \mathbf{(x^{(i)} - \mu) (x^{(i)} - \mu)}^T \ \ \text{Since $\Sigma^T = \Sigma$} \end{aligned}
l(μ,Σ∣x(i))∂Σ−1∂l(μ,Σ∣x(i))=C−2mlog∣Σ∣−21i=1∑m(x(i)−μ)TΣ−1(x(i)−μ)=C+2mlog∣Σ−1∣−21i=1∑mtr[(x(i)−μ)(x(i)−μ)TΣ−1]=2mΣ−21i=1∑m(x(i)−μ)(x(i)−μ)T Since ΣT=Σ
0 = m Σ − ∑ i = 1 m ( x ( i ) − μ ) ( x ( i ) − μ ) T Σ ^ = 1 m ∑ i = 1 m ( x ( i ) − μ ^ ) ( x ( i ) − μ ^ ) T \begin{aligned} 0 &= m \Sigma - \sum_{i=1}^m \mathbf{(x^{(i)} - \mu) (x^{(i)} - \mu)}^T \\ \hat \Sigma & = \frac{1}{m} \sum_{i=1}^m \mathbf{(x^{(i)} - \hat \mu) (x^{(i)} -\hat \mu)}^T \end{aligned} 0Σ^=mΣ−i=1∑m(x(i)−μ)(x(i)−μ)T=m1i=1∑m(x(i)−μ^)(x(i)−μ^)T
3. EM算法求解GMM中的参数
GMM: k个高斯分布通过线性加权组合而成一个模型. 主要还是几个问题:
(1) 每个高斯成分的参数
θ
\theta
θ 估计;
(2) 权值视为一个隐变量
z
i
∼
[
π
1
,
π
2
,
.
.
.
π
k
]
z_i\sim [\pi_1,\pi_2,...\pi_k]
zi∼[π1,π2,...πk];
(3) 给定隐变量
z
i
z_i
zi 计算对应的条件分布
x
i
∼
p
(
x
∣
y
i
,
θ
y
i
)
x_i \sim p(x|y_i,\theta_{y_i})
xi∼p(x∣yi,θyi)
p
(
x
;
θ
,
π
)
=
∑
c
=
1
k
π
c
N
(
x
;
μ
c
,
Σ
c
)
p(x;\theta,\pi) = \sum^{k}_{c=1} \pi_c\mathcal{N}(x;\mu_c,\Sigma_c)
p(x;θ,π)=c=1∑kπcN(x;μc,Σc)
EM算法:
E-step与M-step的导出:
对于一个分布的MLE有:
L
(
θ
)
=
ln
P
(
X
∣
θ
)
L(\theta)=\ln \mathcal{P}(\mathbf{X} \mid \theta)
L(θ)=lnP(X∣θ)
EM不同于MLE直接给出解析式,而是选择通过迭代的方法实现:
L
(
θ
)
>
L
(
θ
n
)
L(\theta)>L\left(\theta_n\right)
L(θ)>L(θn)
如果认为
P
(
X
∣
θ
)
\mathcal{P}(X|\theta)
P(X∣θ)是受隐变量
z
\mathbf{z}
z的影响则有:
P
(
X
∣
θ
)
=
∑
z
P
(
X
∣
z
,
θ
)
P
(
z
∣
θ
)
\mathcal{P}(\mathbf{X} \mid \theta)=\sum_{\mathbf{z}} \mathcal{P}(\mathbf{X} \mid \mathbf{z}, \theta) \mathcal{P}(\mathbf{z} \mid \theta)
P(X∣θ)=z∑P(X∣z,θ)P(z∣θ)
引入Jensen不等式(回忆log函数, 弧在上,而弦在下)有:
ln
∑
i
=
1
n
λ
i
x
i
≥
∑
i
=
1
n
λ
i
ln
(
x
i
)
\ln \sum_{i=1}^n \lambda_i x_i \geq \sum_{i=1}^n \lambda_i \ln \left(x_i\right)
lni=1∑nλixi≥i=1∑nλiln(xi)
常数
λ
i
≥
0
\lambda_i \geq 0
λi≥0,
∑
i
=
1
n
λ
i
=
1
\sum^n_{i=1}\lambda_i=1
∑i=1nλi=1, 考虑隐变量的分布:
P
(
z
∣
X
,
θ
n
)
≥
0
\mathcal{P}(\mathbf{z}|X,\theta_n)\geq 0
P(z∣X,θn)≥0,
∑
z
P
(
z
∣
X
,
θ
n
)
=
1
\sum_{\mathbf{z}}\mathcal{P}(\mathbf{z}|X,\theta_n)=1
∑zP(z∣X,θn)=1.
L = l o g ∑ z p ( x , z ) = l o g ∑ z q ( z ) p ( x , z ) q ( z ) ≥ ∑ z q ( z ) l o g p ( x , z ) q ( z ) = ∑ z q ( z ) l o g p ( x , z ) − ∑ z q ( z ) l o g q ( z ) \begin{split} \mathcal{L}&=log\sum_{\mathbf{z}}p(\mathbf{x},\mathbf{z}) = log\sum_{\mathbf{z}}q(\mathbf{z})\frac{p(\mathbf{x,z})}{q(\mathbf{z})}\\ &\geq \sum_{\mathbf{z}}q(\mathbf{z})log\frac{p(\mathbf{x,z})}{q(\mathbf{z})}\\ &=\sum_{\mathbf{z}}q(\mathbf{z})logp(\mathbf{x,z})-\sum_{\mathbf{z}}q(\mathbf{z})logq(\mathbf{z}) \end{split} L=logz∑p(x,z)=logz∑q(z)q(z)p(x,z)≥z∑q(z)logq(z)p(x,z)=z∑q(z)logp(x,z)−z∑q(z)logq(z)
H
(
q
)
=
△
−
∑
z
q
(
z
)
l
o
g
q
(
z
)
\mathcal{H}(q) \overset{\bigtriangleup}{=}-\sum_{\mathbf{z}}q(\mathbf{z})logq(\mathbf{z})
H(q)=△−z∑q(z)logq(z)
F
(
q
,
θ
)
=
△
∑
z
q
(
z
)
l
o
g
p
(
x
,
z
)
−
∑
z
q
(
z
)
l
o
g
q
(
z
)
\mathcal{F}(q,\theta)\overset{\bigtriangleup}{=} \sum_{\mathbf{z}}q(\mathbf{z})logp(\mathbf{x,z})-\sum_{\mathbf{z}}q(\mathbf{z})logq(\mathbf{z})
F(q,θ)=△z∑q(z)logp(x,z)−z∑q(z)logq(z)
F
(
q
,
θ
)
\mathcal{F}(q,\theta)
F(q,θ)即为这个log-likehood函数的lower bound:
L
(
θ
)
≥
F
(
q
,
θ
)
\mathcal{L}(\theta) \geq \mathcal{F}(q,\theta)
L(θ)≥F(q,θ)
E-step:
q
(
t
)
←
a
r
g
m
a
x
q
F
(
q
,
θ
t
)
q^{(t)} \leftarrow argmax_q \mathcal{F}(q,\theta^{t})
q(t)←argmaxqF(q,θt) : 固定
θ
t
\theta^t
θt确定隐变量的分布
M-step:
θ
(
t
+
1
)
←
a
r
g
m
a
x
θ
F
(
q
t
,
θ
)
\theta^{(t+1)} \leftarrow argmax_{\theta} \mathcal{F}(q^t,\theta)
θ(t+1)←argmaxθF(qt,θ) : 固定隐变量的分布确定各模型的参数
θ
\theta
θ
性质: 每一轮都会improve F \mathcal{F} F 也会improve L \mathcal{L} L.
E-step: 寻找隐变量的分布, 最大化 F ( q , θ ( t ) ) \mathcal{F}(q,\theta^{(t)}) F(q,θ(t)):
F ( q ) = ∑ z q ( z ) l o g p ( x , z ) − ∑ z q ( z ) l o g q ( z ) = ∑ z q ( z ) l o g p ( z ∣ x ) p ( x ) q ( z ) = ∑ z q ( z ) l o g p ( z ∣ x ) q ( z ) + ∑ z q ( z ) l o g p ( x ) = ∑ z q ( z ) l o g p ( z ∣ x ) q ( z ) + l o g p ( x ) = − K L ( q ( z ) ∣ p ( z ∣ x ) ) + L \begin{split} \mathcal{F}(q) &= \sum_{\mathbf{z}}q(\mathbf{z})log p(\mathbf{x,z})-\sum_{z}q(\mathbf{z})log q(\mathbf{z})\\ &= \sum_{\mathbf{z}}q(\mathbf{z})log \frac{p(\mathbf{z}|x)p(x)}{q(\mathbf{z})}\\ &= \sum_{\mathbf{z}}q(\mathbf{z})log \frac{p(\mathbf{z}|x)}{q(\mathbf{z})}+\sum_{\mathbf{z}}q(\mathbf{z})log p(\mathbf{x})\\ &= \sum_{\mathbf{z}}q(\mathbf{z})log \frac{p(\mathbf{z}|x)}{q(\mathbf{z})}+log p(\mathbf{x})\\ &=-KL(q(\mathbf{z})|p(\mathbf{z}|\mathbf{x}))+\mathcal{L} \end{split} F(q)=z∑q(z)logp(x,z)−z∑q(z)logq(z)=z∑q(z)logq(z)p(z∣x)p(x)=z∑q(z)logq(z)p(z∣x)+z∑q(z)logp(x)=z∑q(z)logq(z)p(z∣x)+logp(x)=−KL(q(z)∣p(z∣x))+L
KL散度部分衡量两个分布之间的距离iff q = p q=p q=p时KL=0, 也就是说只有在 q t ( z ) = p ( z ∣ x , θ t ) q^t(\mathbf{z})=p(\mathbf{z}|x,\theta^t) qt(z)=p(z∣x,θt), F ( q t , θ t ) = L ( θ t ) \mathcal{F}(q^t,\theta^t)=\mathcal{L}(\theta^t) F(qt,θt)=L(θt). 因此,当我们在计算 p ( z ∣ x , θ t ) p(\mathbf{z}|\mathbf{x},\theta^t) p(z∣x,θt)时就是在最大化 F ( q , θ t ) \mathcal{F}(q,\theta^t) F(q,θt).
M-step: 寻找
θ
(
t
+
1
)
\theta^{(t+1)}
θ(t+1)最大化
F
(
q
(
t
)
,
θ
)
\mathcal{F}(q^{(t)},\theta)
F(q(t),θ).
F
(
q
t
,
θ
)
=
∑
z
q
t
(
z
)
l
o
g
p
(
x
,
z
∣
θ
)
+
H
(
q
t
)
.
\mathcal{F}(q^{t},\theta)=\sum_{\mathbf{z}}q^{t}(\mathbf{z})\rm{log} p(\mathbf{x},\mathbf{z}|\theta)+ \mathcal{H}(q^{t}).
F(qt,θ)=z∑qt(z)logp(x,z∣θ)+H(qt).
H
(
q
t
)
\mathcal{H}(q^t)
H(qt)在这一步中是常数, 我们只需要计算log-likehood部分即可:
∑
z
q
t
(
z
)
l
o
g
p
(
x
,
z
∣
θ
)
\sum_{\mathbf{z}}q^{t}(\mathbf{z})\rm{log}p(\mathbf{x,z}|\theta)
z∑qt(z)logp(x,z∣θ)
在最大化
F
\mathcal{F}
F的时候还会最大化
L
\mathcal{L}
L:
L
(
θ
t
+
1
)
≥
F
(
q
t
,
θ
t
+
1
)
=
m
a
x
θ
F
(
q
t
,
θ
)
≥
F
(
q
t
,
θ
t
)
=
L
\begin{split} \mathcal{L}(\theta^{t+1}) &\geq \mathcal{F}(q^{t},\theta^{t+1})\\ &=\underset{\theta}{max}\mathcal{F}(q^t,\theta)\\ &\geq \mathcal{F}(q^t,\theta^t)=\mathcal{L} \end{split}
L(θt+1)≥F(qt,θt+1)=θmaxF(qt,θ)≥F(qt,θt)=L
E-step与M-step: EM上的GMM还是太难了. 下面简单抄写白板中的内容
白板中的GMM:
p
(
x
)
=
∑
k
=
1
K
p
k
N
(
x
∣
μ
k
,
Σ
k
)
p(x)=\sum^K_{k=1}p_k\mathcal{N}(x|\mu_k,\Sigma_k)
p(x)=∑k=1KpkN(x∣μk,Σk)
X
X
X: observed data
→
x
1
,
x
2
,
.
.
.
,
x
N
\rightarrow x_1,x_2, ...,x_N
→x1,x2,...,xN
(
X
,
Z
)
(X,Z)
(X,Z): complete data
→
(
x
1
,
z
1
)
,
(
x
2
,
z
2
)
,
.
.
.
,
(
x
N
,
z
N
)
\rightarrow (x_1,z_1),(x_2,z_2), ..., (x_N,z_N)
→(x1,z1),(x2,z2),...,(xN,zN)
x
x
x: observed variable
z
z
z: latent variable
z | c 1 c_1 c1 | c 2 c_2 c2 | … \dots … | c K c_K cK |
---|---|---|---|---|
p | p 1 p_1 p1 | p 2 p_2 p2 | … \dots … | p K p_K pK |
∑
k
=
1
K
p
k
=
1
\sum^K_{k=1}p_k=1
∑k=1Kpk=1;
x
∣
z
=
c
k
∼
N
(
x
∣
μ
k
,
Σ
k
)
x|z=c_k\sim\mathcal{N}(x|\mu_k,\Sigma_k)
x∣z=ck∼N(x∣μk,Σk) 对于隐变量
z
z
z的理解还是李航老师书里讲得更深刻,即
x
x
x 的生成需要先通过隐变量确定一个高斯再由该高斯生成, 显然隐变量为一个离散随机变量,因此确定哪个高斯也服从于一个概率分布,作用时再由各个高斯iid.起作用.但就公式的简洁上,白板中的公式更简单.
p
(
y
,
γ
∣
θ
)
=
∏
j
=
1
N
p
(
y
j
,
γ
j
1
,
γ
j
2
,
⋯
,
γ
j
k
∣
θ
)
=
∏
k
=
1
K
∏
j
=
1
N
[
α
k
ϕ
(
y
j
∣
θ
k
)
]
γ
j
k
连乘前后交换顺序无关紧要
=
∏
j
=
1
N
∏
k
=
1
K
[
α
k
ϕ
(
y
j
∣
θ
k
)
]
γ
j
k
=
∏
j
=
1
N
[
α
j
ϕ
(
y
j
∣
θ
k
)
]
⋅
1
⋅
1
⋯
1
⏟
K
项
\begin{split} p(y,\gamma|\theta) &= \prod^N_{j=1}p(y_j,\gamma_{j1},\gamma_{j2},\cdots,\gamma_{jk}|\theta)\\ &=\prod^K_{k=1}\prod^N_{j=1}[\alpha_k\phi(y_j|\theta_k)]^{\gamma_{jk}}\\ &连乘前后交换顺序无关紧要\\ &=\prod^N_{j=1}\prod^K_{k=1}[\alpha_k\phi(y_j|\theta_k)]^{\gamma_{jk}}\\ &=\prod^N_{j=1}\underbrace{[\alpha_j\phi(y_j|\theta_k)]\cdot 1\cdot1 \cdots 1}_{K项} \end{split}
p(y,γ∣θ)=j=1∏Np(yj,γj1,γj2,⋯,γjk∣θ)=k=1∏Kj=1∏N[αkϕ(yj∣θk)]γjk连乘前后交换顺序无关紧要=j=1∏Nk=1∏K[αkϕ(yj∣θk)]γjk=j=1∏NK项
[αjϕ(yj∣θk)]⋅1⋅1⋯1
α
j
\alpha_j
αj表示第
j
j
j项观测数据对应的权重(几何角度)或者说是概率值(混合模型角度)
E-step中符号 a r g m a x E Z ∣ X , θ ( t ) [ l o g P ( X , Z ∣ θ ) ] argmax E_{Z|X,\theta^{(t)}}[log P(X,Z|\theta)] argmaxEZ∣X,θ(t)[logP(X,Z∣θ)] 的意思是: ∫ Z l o g p ( X , Z ∣ θ ) p ( Z ∣ X , θ t ) d z \int_{Z} logp(X,Z|\theta)p(Z|X,\theta^{t})dz ∫Zlogp(X,Z∣θ)p(Z∣X,θt)dz 是一个关于在观测数据已知, 上一步各个高斯分布参数已知的情况下关于隐变量的积分.
Q
(
θ
,
θ
t
)
=
∫
Z
l
o
g
P
(
X
,
Z
∣
θ
)
p
(
Z
∣
X
,
θ
t
)
d
Z
=
∑
Z
l
o
g
[
∏
i
=
1
N
P
(
x
i
,
Z
i
∣
θ
)
]
[
∏
i
=
1
N
P
(
Z
i
∣
x
i
,
θ
t
)
]
d
z
=
∑
Z
l
o
g
[
∑
i
=
1
N
P
(
x
i
,
Z
i
∣
θ
)
]
[
∏
i
=
1
N
P
(
Z
i
∣
x
i
,
θ
t
)
]
d
z
=
∑
Z
1
,
Z
2
,
⋯
,
Z
N
∑
i
=
1
N
l
o
g
P
(
x
i
,
Z
i
∣
θ
)
∏
i
=
1
N
P
(
Z
i
∣
x
i
,
θ
t
)
=
∑
i
=
1
N
∑
Z
i
l
o
g
P
(
x
i
,
Z
i
∣
θ
)
P
(
Z
i
∣
x
i
,
θ
t
)
\begin{split} Q(\theta,\theta^{t}) &= \int_Z logP(X,Z|\theta)p(Z|X,\theta^t)dZ\\ &=\sum_{Z}log[\prod^N_{i=1}P(x_i,Z_i|\theta)][\prod^N_{i=1}P(Z_i|x_i,\theta^t)]dz\\ &=\sum_{Z}log[\sum^N_{i=1}P(x_i,Z_i|\theta)][\prod^N_{i=1}P(Z_i|x_i,\theta^t)]dz\\ &=\sum_{Z_1,Z_2,\cdots,Z_N} \sum^N_{i=1}logP(x_i,Z_i|\theta)\prod^N_{i=1}P(Z_i|x_i,\theta^t)\\ &=\sum^N_{i=1}\sum_{Z_i}logP(x_i,Z_i|\theta)P(Z_i|x_i,\theta^t) \end{split}
Q(θ,θt)=∫ZlogP(X,Z∣θ)p(Z∣X,θt)dZ=Z∑log[i=1∏NP(xi,Zi∣θ)][i=1∏NP(Zi∣xi,θt)]dz=Z∑log[i=1∑NP(xi,Zi∣θ)][i=1∏NP(Zi∣xi,θt)]dz=Z1,Z2,⋯,ZN∑i=1∑NlogP(xi,Zi∣θ)i=1∏NP(Zi∣xi,θt)=i=1∑NZi∑logP(xi,Zi∣θ)P(Zi∣xi,θt)
上式中
∑
Z
1
,
Z
2
,
⋯
,
z
N
∑
i
=
1
N
l
o
g
P
(
x
i
,
Z
i
∣
θ
)
∏
i
=
1
N
p
(
Z
i
∣
x
i
,
θ
t
)
=
∑
Z
1
,
Z
2
,
⋯
,
Z
N
[
l
o
g
P
(
x
1
,
Z
1
∣
θ
)
+
l
o
g
P
(
x
2
,
Z
2
∣
θ
)
+
⋯
+
l
o
g
P
(
x
N
,
Z
N
∣
θ
)
]
∏
i
=
1
N
P
(
Z
i
∣
x
i
,
θ
t
)
取出第一项有
:
=
∑
Z
1
,
Z
2
,
⋯
,
Z
N
l
o
g
P
(
x
1
,
Z
1
∣
θ
)
∏
i
=
1
N
P
(
Z
i
∣
x
i
,
θ
t
)
=
∑
Z
1
,
Z
2
,
⋯
,
Z
N
l
o
g
P
(
x
1
,
Z
1
∣
θ
)
P
(
Z
1
∣
x
1
,
θ
t
)
∏
i
=
2
N
P
(
Z
i
∣
x
i
,
θ
t
)
=
∑
Z
1
l
o
g
P
(
x
1
,
Z
1
∣
θ
)
P
(
Z
1
∣
x
1
,
θ
t
)
∑
Z
2
,
⋯
,
Z
N
∏
i
=
2
N
P
(
Z
i
∣
x
i
,
θ
t
)
看后面的连加与连乘符号
:
∑
Z
2
,
⋯
,
Z
N
P
(
Z
2
∣
x
2
,
θ
t
)
P
(
Z
3
∣
x
3
,
θ
t
)
P
(
Z
3
∣
x
3
,
θ
t
)
⋯
P
(
Z
N
∣
x
N
,
θ
t
)
=
∑
Z
2
P
(
Z
2
∣
x
2
)
∑
Z
3
P
(
Z
3
∣
x
3
)
∑
Z
4
P
(
Z
4
∣
x
4
)
⋯
∑
Z
N
P
(
Z
N
∣
x
N
)
=
1
⋅
1
⋯
1
⋅
1
最后只剩下
∑
Z
1
l
o
g
P
(
x
1
,
Z
1
∣
θ
)
P
(
Z
1
∣
x
1
,
θ
t
)
\begin{split} &\sum_{Z_1,Z_2,\cdots,z_N}\sum^N_{i=1}logP(x_i,Z_i|\theta)\prod^N_{i=1}p(Z_i|x_i,\theta^{t})\\ &= \sum_{Z_1,Z_2,\cdots,Z_N}[logP(x_1,Z_1|\theta)+logP(x_2,Z_2|\theta)+\cdots+logP(x_N,Z_N|\theta)]\prod^N_{i=1}P(Z_i|x_i,\theta^t)\\ & 取出第一项有:\\ &= \sum_{Z_1,Z_2,\cdots,Z_N}logP(x_1,Z_1|\theta)\prod^N_{i=1}P(Z_i|x_i,\theta^t)\\ &= \sum_{Z_1,Z_2,\cdots,Z_N}logP(x_1,Z_1|\theta)P(Z_1|x_1,\theta^t)\prod^N_{i=2}P(Z_i|x_i,\theta^t)\\ &= \sum_{Z_1}logP(x_1,Z_1|\theta)P(Z_1|x_1,\theta^t) \sum_{Z_2,\cdots,Z_N}\prod^N_{i=2}P(Z_i|x_i,\theta^t)\\ &看后面的连加与连乘符号:\\ &\sum_{Z_2,\cdots,Z_N}P(Z_2|x_2,\theta^t)P(Z_3|x_3,\theta^t)P(Z_3|x_3,\theta^t)\cdots P(Z_N|x_N,\theta^t)\\ &=\sum_{Z_2}P(Z_2|x_2)\sum_{Z_3}P(Z_3|x_3)\sum_{Z_4}P(Z_4|x_4)\cdots\sum_{Z_N}P(Z_N|x_N)\\ &=1 \cdot 1 \cdots 1 \cdot 1\\ &最后只剩下\\ &\sum_{Z_1}logP(x_1,Z_1|\theta)P(Z_1|x_1,\theta^t) \end{split}
Z1,Z2,⋯,zN∑i=1∑NlogP(xi,Zi∣θ)i=1∏Np(Zi∣xi,θt)=Z1,Z2,⋯,ZN∑[logP(x1,Z1∣θ)+logP(x2,Z2∣θ)+⋯+logP(xN,ZN∣θ)]i=1∏NP(Zi∣xi,θt)取出第一项有:=Z1,Z2,⋯,ZN∑logP(x1,Z1∣θ)i=1∏NP(Zi∣xi,θt)=Z1,Z2,⋯,ZN∑logP(x1,Z1∣θ)P(Z1∣x1,θt)i=2∏NP(Zi∣xi,θt)=Z1∑logP(x1,Z1∣θ)P(Z1∣x1,θt)Z2,⋯,ZN∑i=2∏NP(Zi∣xi,θt)看后面的连加与连乘符号:Z2,⋯,ZN∑P(Z2∣x2,θt)P(Z3∣x3,θt)P(Z3∣x3,θt)⋯P(ZN∣xN,θt)=Z2∑P(Z2∣x2)Z3∑P(Z3∣x3)Z4∑P(Z4∣x4)⋯ZN∑P(ZN∣xN)=1⋅1⋯1⋅1最后只剩下Z1∑logP(x1,Z1∣θ)P(Z1∣x1,θt)
化简后的
Q
(
θ
,
θ
t
)
Q(\theta,\theta^t)
Q(θ,θt)代入:
P
(
x
,
Z
)
=
P
(
Z
)
P
(
x
∣
z
)
=
p
Z
N
(
x
∣
μ
Z
,
σ
Z
)
P(x,Z)=P(Z)P(x|z)=p_Z\mathcal{N}(x|\mu_Z,\sigma_Z)
P(x,Z)=P(Z)P(x∣z)=pZN(x∣μZ,σZ)
p
(
Z
∣
x
)
=
p
(
x
,
Z
)
p
(
x
)
=
p
Z
N
(
x
∣
μ
Z
,
Σ
Z
)
∑
k
=
1
K
p
k
N
(
x
∣
μ
k
,
Σ
k
)
p(Z|x) = \frac{p(x,Z)}{p(x)} = \frac{p_Z\mathcal{N}(x|\mu_Z,\Sigma_Z)}{\sum^K_{k=1}p_k\mathcal{N}(x|\mu_k,\Sigma_k)}
p(Z∣x)=p(x)p(x,Z)=∑k=1KpkN(x∣μk,Σk)pZN(x∣μZ,ΣZ)
有:
Q
(
θ
,
θ
t
)
=
∑
i
=
1
N
∑
Z
i
l
o
g
[
p
Z
i
N
(
x
i
∣
μ
Z
i
,
Σ
Z
i
)
]
P
Z
i
N
(
x
i
∣
μ
i
,
Σ
Z
i
)
∑
k
=
1
K
p
k
N
(
x
i
∣
μ
k
,
Σ
k
)
Q(\theta,\theta^t)=\sum^N_{i=1}\sum_{Z_i}log\ [p_{Z_i}\mathcal{N}(x_i|\mu_{Z_i},\Sigma_{Z_i})]\frac{P_{Z_i}\mathcal{N}(x_i|\mu_i,\Sigma_{Z_i})}{\sum^K_{k=1}p_k\mathcal{N}(x_i|\mu_k,\Sigma_k)}
Q(θ,θt)=i=1∑NZi∑log [pZiN(xi∣μZi,ΣZi)]∑k=1KpkN(xi∣μk,Σk)PZiN(xi∣μi,ΣZi)
M-step: 求解公式中的参数,
Q
=
∑
k
=
1
K
∑
i
=
1
N
[
l
o
g
p
k
+
l
o
g
N
(
x
i
∣
μ
Z
i
,
Σ
Z
i
)
]
p
(
Z
i
=
k
∣
x
i
,
θ
t
)
Q = \sum^K_{k=1}\sum^N_{i=1}[logp_k+log\mathcal{N}(x_i|\mu_{Z_i},\Sigma_{Z_i})]p(Z_i=k|x_i,\theta^t)
Q=k=1∑Ki=1∑N[logpk+logN(xi∣μZi,ΣZi)]p(Zi=k∣xi,θt)
先求解
p
k
t
+
1
p^{t+1}_k
pkt+1,
p
k
t
+
1
p^{t+1}_k
pkt+1的求解满足约束条件
∑
k
=
1
K
p
k
=
1
\sum^K_{k=1}p_k=1
∑k=1Kpk=1;
L
=
∑
k
=
1
K
∑
i
=
1
N
l
o
g
p
k
p
(
z
i
=
k
∣
x
i
,
θ
t
)
+
λ
(
∑
k
=
1
K
p
k
−
1
)
L = \sum^K_{k=1}\sum^N_{i=1}logp_kp(z_i=k|x_i,\theta^t)+\lambda(\sum^K_{k=1}p_k-1)
L=k=1∑Ki=1∑Nlogpkp(zi=k∣xi,θt)+λ(k=1∑Kpk−1)
没有了
l
o
g
N
(
x
i
∣
μ
Z
i
,
Σ
Z
i
)
p
(
Z
i
=
k
∣
x
i
,
θ
t
)
log\mathcal{N}(x_i|\mu_{Z_i},\Sigma_{Z_i})p(Z_i=k|x_i,\theta^t)
logN(xi∣μZi,ΣZi)p(Zi=k∣xi,θt)项其对于
p
k
p_k
pk而言是常数项。L函数对
p
k
p_k
pk求偏导有:
∂
∂
p
k
L
=
∑
i
=
1
N
1
p
k
p
(
z
i
=
k
∣
x
i
,
θ
t
)
+
λ
=
0
\frac{\partial}{\partial p_k}L=\sum^N_{i=1}\frac{1}{p_k}p(z_i=k|x_i,\theta^t)+\lambda=0
∂pk∂L=i=1∑Npk1p(zi=k∣xi,θt)+λ=0
两边同乘
p
k
p_k
pk以求解
λ
\lambda
λ有:
∑
i
=
1
N
p
(
z
i
=
k
∣
x
i
,
θ
t
)
+
λ
p
k
=
0
\sum^N_{i=1}p(z_i=k|x_i,\theta^t)+\lambda p_k = 0
i=1∑Np(zi=k∣xi,θt)+λpk=0
利用
∑
k
=
1
K
p
k
=
1
\sum^K_{k=1}p_k=1
∑k=1Kpk=1
∑
k
=
1
K
∑
i
=
1
N
p
(
z
i
=
k
∣
x
i
,
θ
t
)
+
λ
∑
k
=
1
K
p
k
=
0
λ
=
−
∑
i
=
1
N
∑
k
=
1
K
p
(
z
i
=
k
∣
x
i
,
θ
t
)
⏟
1
=
−
N
\begin{split} &\sum^K_{k=1}\sum^N_{i=1}p(z_i=k|x_i,\theta^t)+\lambda \sum^K_{k=1}p_k=0\\ &\lambda = -\sum^N_{i=1}\underbrace{\sum^K_{k=1}p(z_i=k|x_i,\theta^t)}_{1}=-N \end{split}
k=1∑Ki=1∑Np(zi=k∣xi,θt)+λk=1∑Kpk=0λ=−i=1∑N1
k=1∑Kp(zi=k∣xi,θt)=−N
将
λ
\lambda
λ回代有:
p
k
t
+
1
=
∑
i
=
1
N
p
(
z
i
=
k
∣
x
i
,
θ
t
)
N
p^{t+1}_{k}=\frac{\sum^N_{i=1}p(z_i=k|x_i,\theta^t)}{N}
pkt+1=N∑i=1Np(zi=k∣xi,θt)
μ
k
\mu_k
μk与
Σ
k
\Sigma_k
Σk的求解则与MLE中的类似(使用矩阵的求导法则,另外注意在广义EM中提到的后一参数用更新后的前一参数更新的思路):
∑
k
=
1
K
∑
i
=
1
N
[
l
o
g
N
(
x
i
∣
μ
Z
i
,
Σ
Z
i
)
]
p
(
Z
i
=
k
∣
x
i
,
θ
t
)
=
∑
k
=
1
K
∑
i
=
1
N
[
l
o
g
(
1
2
π
)
−
l
o
g
∣
Σ
k
∣
−
1
2
(
x
n
−
μ
k
)
T
Σ
k
−
1
(
x
n
−
μ
k
)
]
p
(
Z
i
=
k
∣
x
i
,
θ
t
)
\begin{split} &\sum^K_{k=1}\sum^N_{i=1}[log\mathcal{N}(x_i|\mu_{Z_i},\Sigma_{Z_i})]p(Z_i=k|x_i,\theta^t) \\ &= \sum^K_{k=1}\sum^N_{i=1}[log(\frac{1}{\sqrt{2\pi}})-log|\Sigma_k|-\frac{1}{2}(x_n-\mu_k)^T\Sigma^{-1}_k(x_n-\mu_k)]p(Z_i=k|x_i,\theta^t) \end{split}
k=1∑Ki=1∑N[logN(xi∣μZi,ΣZi)]p(Zi=k∣xi,θt)=k=1∑Ki=1∑N[log(2π1)−log∣Σk∣−21(xn−μk)TΣk−1(xn−μk)]p(Zi=k∣xi,θt)
对
μ
k
\mu_k
μk求偏导有:
μ
k
t
+
1
=
∑
i
=
1
N
p
(
Z
i
=
k
∣
x
i
,
θ
t
)
x
i
∑
i
=
1
N
p
(
Z
i
=
k
∣
x
i
,
θ
t
)
\mu^{t+1}_k = \frac{\sum^N_{i=1}p(Z_i=k|x_i,\theta^t)x_i}{\sum^N_{i=1}p(Z_i=k|x_i,\theta^t)}
μkt+1=∑i=1Np(Zi=k∣xi,θt)∑i=1Np(Zi=k∣xi,θt)xi
对
Σ
k
\Sigma_k
Σk求偏导并注意在外层乘上
∑
i
=
1
N
p
(
z
i
=
k
∣
x
i
,
θ
t
)
\sum^N_{i=1}p(z_i=k|x_i,\theta^t)
∑i=1Np(zi=k∣xi,θt)有:
Σ
k
t
+
1
=
∑
i
=
1
N
p
(
z
i
=
k
∣
x
i
,
θ
t
)
(
x
i
−
μ
k
t
+
1
)
(
x
i
−
μ
t
+
1
)
T
∑
i
=
1
N
p
(
Z
i
=
k
∣
x
i
,
θ
t
)
\Sigma^{t+1}_{k}=\frac{\sum^N_{i=1}p(z_i=k|x_i,\theta^t)(x_i-\mu^{t+1}_k)(x_i-\mu^{t+1})^T}{\sum^N_{i=1}p(Z_i=k|x_i,\theta^t)}
Σkt+1=∑i=1Np(Zi=k∣xi,θt)∑i=1Np(zi=k∣xi,θt)(xi−μkt+1)(xi−μt+1)T
GMR: GMM与高斯条件分布的结合:
π
y
∣
x
k
=
N
k
(
x
∣
μ
x
k
,
Σ
x
k
)
Σ
l
=
1
K
N
l
(
x
∣
μ
x
l
,
Σ
x
l
)
\pi_{y|x_k}=\frac{\mathcal{N_k}(x|\mu_{xk},\Sigma_{xk})}{\Sigma^K_{l=1}\mathcal{N}_l(x|\mu_{xl},\Sigma_{xl})}
πy∣xk=Σl=1KNl(x∣μxl,Σxl)Nk(x∣μxk,Σxk)
p
(
y
∣
x
)
=
Σ
k
=
1
K
π
y
∣
x
k
N
k
(
y
∣
μ
y
∣
x
k
,
Σ
y
∣
x
k
)
p(y|x)=\Sigma^{K}_{k=1}\pi_{y|x_k}\mathcal{N}_k(y|\mu_{y|x_k},\Sigma_{y|x_k})
p(y∣x)=Σk=1Kπy∣xkNk(y∣μy∣xk,Σy∣xk)