数学小抄: Gaussian的基础操作

前言:

高斯的基础操作在SLAM和Robot Learning中都有广泛的应用. 这些基础操作包括有: 线性变换,高维高斯分布的边缘分布与条件分布求解, MLE求解均值与方差, EM求解混合模型。 本篇博客参考于白板推导以及SLAM基础知识博客 , 我选择SLAM中的理解思路即 P ( a , b ) ∼ P ( b ) P ( a ∣ b ) P(a,b)\sim P(b) P(a|b) P(a,b)P(b)P(ab) 对条件分布进行推导, 这在PRML中也是采用类似的思路但是需要采用配方法。然后选择白板中引入舒尔补实现推导, 但这里并不会直接采用白板大神在视频里的构造,而是用了两次舒尔补实现推导。

1. MVN (Multivariate Normal Distribution) 多维高斯分布

1.1 线性变换:
x ∼ N ( x ∣ μ , Σ ) \boldsymbol{x}\sim \mathcal{N}(\boldsymbol{x}|\boldsymbol{\mu},\boldsymbol{\Sigma}) xN(xμ,Σ)作线性变换 y ∼ A x + b \boldsymbol{y}\sim \boldsymbol{Ax+b} yAx+b, 有 y ∼ N ( A μ + b , A Σ A T ) \boldsymbol{y} \sim \mathcal{N}(A\mu+b,A\Sigma A^T) yN(Aμ+b,AΣAT)
证明:
E [ y ] = E [ A x + b ] = A E [ x ] + b = A μ + b V a r [ y ] = V a r [ A x + b ] = V a r [ A x ] = A ⋅ V a r ⋅ A T \begin{equation} \begin{split} \rm{E}[y]=\rm{E}[Ax+b]=AE[x]+b=A\mu+b\\ \rm{Var}[y]=\rm{Var}[Ax+b]=\rm{Var}[Ax]=A\cdot\rm{Var}\cdot A^T \end{split} \end{equation} E[y]=E[Ax+b]=AE[x]+b=Aμ+bVar[y]=Var[Ax+b]=Var[Ax]=AVarAT

1.2 从线性变换到高维高斯分布 ( x a , x b ) ∼ N ( μ , Σ ) (x_a,x_b)\sim \mathcal{N}(\mathbf{\mu},\mathbf{\Sigma}) (xa,xb)N(μ,Σ), Σ = [ Σ a a Σ a b Σ b a Σ b b ] \mathbf{\Sigma}=\begin{bmatrix}\Sigma_{aa}&\Sigma_{ab}\\ \Sigma_{ba}& \Sigma_{bb}\end{bmatrix} Σ=[ΣaaΣbaΣabΣbb]的条件分布与边缘分布.

  • 已知高维高斯分布求 p ( x a ) , p ( x b ) , p ( x a ∣ x b ) , p ( x b ∣ x a ) p(x_a),p(x_b),p(x_a|x_b),p(x_b|x_a) p(xa),p(xb),p(xaxb),p(xbxa)

  • p ( x a ) p(x_a) p(xa) p ( x b ) p(x_b) p(xb):
    x a = [ I , 0 ] [ x a x b ] x b = [ 0 , I ] [ x a x b ] \begin{aligned} x_a &= [I,0]\begin{bmatrix}x_a\\x_b\end{bmatrix} \\ x_b &= [0,I]\begin{bmatrix}x_a \\ x_b \end{bmatrix} \end{aligned} xaxb=[I,0][xaxb]=[0,I][xaxb]
    E [ x a ] = [ I , 0 ] [ μ a μ b ] = μ a V a r [ x a ] = [ I , 0 ] [ Σ a a Σ a b Σ b a Σ b b ] [ I 0 ] = Σ a a E [ x b ] = [ 0 , I ] [ μ a μ b ] = μ b V a r [ x b ] = [ 0 , I ] [ Σ a a Σ a b Σ b a Σ b b ] [ 0 I ] = Σ b b \begin{split} \rm{E}[x_a]=[I,0]\begin{bmatrix}\mu_a\\ \mu_b\end{bmatrix}=\mu_a\\ \rm{Var}[x_a]=[I,0]\begin{bmatrix}\Sigma_{aa} & \Sigma_{ab}\\ \Sigma_{ba} & \Sigma_{bb} \end{bmatrix} \begin{bmatrix}I \\ 0 \end{bmatrix} = \Sigma_{aa}\\ \rm{E}[x_b]=[0,I]\begin{bmatrix}\mu_a\\ \mu_b\end{bmatrix}=\mu_b\\ \rm{Var}[x_b]=[0,I]\begin{bmatrix}\Sigma_{aa} & \Sigma_{ab}\\ \Sigma_{ba} & \Sigma_{bb} \end{bmatrix} \begin{bmatrix}0 \\ I \end{bmatrix} = \Sigma_{bb} \end{split} E[xa]=[I,0][μaμb]=μaVar[xa]=[I,0][ΣaaΣbaΣabΣbb][I0]=ΣaaE[xb]=[0,I][μaμb]=μbVar[xb]=[0,I][ΣaaΣbaΣabΣbb][0I]=Σbb
    x a ∼ N ( μ a , Σ a a ) x_a \sim \mathcal{N}(\mu_a,\Sigma_{aa}) xaN(μa,Σaa)以及 x b ∼ N ( μ b , Σ b b ) x_b\sim \mathcal{N}(\mu_b,\Sigma_{bb}) xbN(μb,Σbb)

  • p ( x a ∣ x b ) , p ( x b ∣ x a ) p(x_a|x_b),p(x_b|x_a) p(xaxb),p(xbxa): 需要引入舒尔补, 舒尔补数学小抄

    首先对于高斯分布的表达式,我们可以舍弃其分式前部,只看 e x p ( ⋅ ) \rm{exp(\cdot)} exp()部分.
    p ( x a , x b ) ∝ e x p ( − 1 2 [ x a x b ] T [ Σ a a Σ a b Σ b a Σ b b ] − 1 [ x a x b ] ) p(x_a,x_b)\propto exp(-\frac{1}{2}\begin{bmatrix}x_a\\x_b\end{bmatrix}^T\begin{bmatrix}\Sigma_{aa}& \Sigma_{ab}\\ \Sigma_{ba} & \Sigma_{bb}\end{bmatrix}^{-1}\begin{bmatrix}x_a\\ x_b\end{bmatrix}) p(xa,xb)exp(21[xaxb]T[ΣaaΣbaΣabΣbb]1[xaxb])
    式子中涉及对于协方差矩阵的求逆,舒尔补在这里派上用场.
    下面会是关于舒尔补的一些基础等式
    [ I 0 − C A − 1 I ] [ A B C D ] [ I − A − 1 B 0 I ] = [ A 0 0 D − C A − 1 B ] \begin{bmatrix}I & 0\\-CA^{-1}& I\end{bmatrix}\begin{bmatrix}A&B\\C &D\end{bmatrix}\begin{bmatrix}I & -A^{-1}B\\0& I\end{bmatrix}=\begin{bmatrix}A& 0 \\0& D-CA^{-1}B\end{bmatrix} [ICA10I][ACBD][I0A1BI]=[A00DCA1B]
    [ I − B D − 1 0 I ] [ A B C D ] [ I 0 − D − 1 C I ] = [ A − B D − 1 C 0 0 D ] \begin{bmatrix}I & -BD^{-1}\\ 0 & I \end{bmatrix}\begin{bmatrix} A & B \\ C & D \end{bmatrix}\begin{bmatrix} I & 0\\ -D^{-1}C & I \end{bmatrix} = \begin{bmatrix} A-BD^{-1}C & 0\\ 0 & D\\ \end{bmatrix} [I0BD1I][ACBD][ID1C0I]=[ABD1C00D]
    [ A B C D ] = [ I 0 C A − 1 I ] [ A 0 0 D − C A − 1 B ] [ I A − 1 B 0 I ] \begin{bmatrix}A&B\\C&D\end{bmatrix} = \begin{bmatrix}I & 0\\CA^{-1}&I\end{bmatrix}\begin{bmatrix}A& 0 \\0 & D-CA^{-1}B\end{bmatrix}\begin{bmatrix}I & A^{-1}B \\ 0 & I\end{bmatrix} [ACBD]=[ICA10I][A00DCA1B][I0A1BI]
    [ A B C D ] − 1 = [ I − A − 1 B 0 I ] [ A − 1 0 0 ( D − C A − 1 B ) − 1 ] [ I 0 − C A − 1 I ] \begin{bmatrix}A & B \\C &D\end{bmatrix}^{-1}=\begin{bmatrix}I & -A^{-1}B\\0 &I\end{bmatrix}\begin{bmatrix}A^{-1}& 0\\0& (D-CA^{-1}B)^{-1}\end{bmatrix}\begin{bmatrix}I & 0 \\-CA^{-1} & I\end{bmatrix} [ACBD]1=[I0A1BI][A100(DCA1B)1][ICA10I]
    [ A B C D ] = [ I B D − 1 0 I ] [ A − B D − 1 C 0 0 D ] [ I 0 D − 1 C I ] \begin{bmatrix} A & B\\ C & D \end{bmatrix} = \begin{bmatrix} I & BD^{-1}\\ 0 & I \end{bmatrix}\begin{bmatrix} A-BD^{-1}C & 0 \\ 0 & D \end{bmatrix}\begin{bmatrix} I & 0\\ D^{-1}C & I \end{bmatrix} [ACBD]=[I0BD1I][ABD1C00D][ID1C0I]
    [ A B C D ] − 1 = [ I 0 − D − 1 C I ] [ ( A − B D − 1 C ) − 1 0 0 D − 1 ] [ I − B D − 1 0 I ] \begin{bmatrix} A & B \\ C & D \end{bmatrix}^{-1}=\begin{bmatrix} I & 0\\ -D^{-1}C &I \end{bmatrix}\begin{bmatrix} (A-BD^{-1}C)^{-1} & 0\\ 0 & D^{-1} \end{bmatrix}\begin{bmatrix} I & -BD^{-1}\\ 0 & I \end{bmatrix} [ACBD]1=[ID1C0I][(ABD1C)100D1][I0BD1I]

    考虑 p ( x a , x b ) p(x_a,x_b) p(xa,xb)指数幂部分:
    − 1 2 [ x a x b ] [ Σ a a Σ a b Σ b a Σ b b ] − 1 [ x a x b ] = − 1 2 [ x a x b ] [ Λ a a Λ a b Λ b a Λ b b ] [ x a x b ] = − 1 2 ( [ x a x b ] [ I 0 Λ b a Λ a a − 1 I ] [ Λ a a 0 0 Λ b b − Λ b a Λ a a 1 Λ a b ] [ I Λ a a − 1 Λ a b 0 I ] [ x a x b ] ) = − 1 2 ( x a + x b Λ b a Λ a a − 1 ) Λ a a ( x a + Λ a a − 1 Λ a b x b ) − 1 2 x b ( Λ b b − Λ b a Λ a a − 1 Λ a b ) x b \begin{align} -\frac{1}{2}\begin{bmatrix}x_a& x_b\end{bmatrix}\begin{bmatrix} \Sigma_{aa} & \Sigma_{ab}\\ \Sigma_{ba} & \Sigma_{bb} \end{bmatrix}^{-1}\begin{bmatrix} x_a \\ x_b \end{bmatrix} &= -\frac{1}{2} \begin{bmatrix} x_a & x_b \end{bmatrix}\begin{bmatrix}\Lambda_{aa} & \Lambda_{ab}\\ \Lambda_{ba} & \Lambda_{bb} \end{bmatrix} \begin{bmatrix} x_a\\ x_b\end{bmatrix} \\ &= -\frac{1}{2}(\begin{bmatrix}x_a & x_b\end{bmatrix} \begin{bmatrix} I & 0 \\ \Lambda_{ba}\Lambda_{aa}^{-1} & I \end{bmatrix} \begin{bmatrix} \Lambda_{aa} & 0\\ 0 & \Lambda_{bb}-\Lambda_{ba}\Lambda_{aa}^{1}\Lambda_{ab} \end{bmatrix}\begin{bmatrix} I & \Lambda^{-1}_{aa}\Lambda_{ab}\\ 0 & I \end{bmatrix}\begin{bmatrix}x_a \\ x_b \end{bmatrix})\\ &= -\frac{1}{2} (x_a+x_b\Lambda_{ba}\Lambda^{-1}_{aa})\Lambda_{aa}(x_a+\Lambda^{-1}_{aa}\Lambda_{ab}x_b) -\frac{1}{2} x_b(\Lambda_{bb}-\Lambda_{ba}\Lambda^{-1}_{aa}\Lambda_{ab})x_b \end{align} 21[xaxb][ΣaaΣbaΣabΣbb]1[xaxb]=21[xaxb][ΛaaΛbaΛabΛbb][xaxb]=21([xaxb][IΛbaΛaa10I][Λaa00ΛbbΛbaΛaa1Λab][I0Λaa1ΛabI][xaxb])=21(xa+xbΛbaΛaa1)Λaa(xa+Λaa1Λabxb)21xb(ΛbbΛbaΛaa1Λab)xb
    对于精度矩阵 Λ \Lambda Λ 可以借用与协方差矩阵之间的关系 Σ \Sigma Σ,结合舒尔补完成求解:
    [ Λ a a Λ a b Λ b a Λ b b ] = [ Σ a a Σ a b Σ b a Σ b b ] − 1 = [ I 0 − D − 1 C I ] [ ( A − B D − 1 C ) − 1 0 0 D − 1 ] [ I − B D − 1 0 I ] = [ I 0 − Σ b b − 1 Σ b a I ] [ ( Σ a a − Σ a b Σ b b − 1 Σ b a ) − 1 0 0 Σ b b − 1 ] [ I − Σ b a Σ b b − 1 0 I ] = [ ( Σ a a − Σ a b Σ b b − 1 Σ b a ) − 1 − ( Σ a a − Σ a b Σ b b − 1 Σ b a ) − 1 Σ a b Σ b b − 1 − Σ b b − 1 Σ b a ( Σ a a − Σ a b Σ b b − 1 Σ b a ) − 1 Σ b b − 1 + Σ b b − 1 Σ b a ( Σ a a − Σ a b Σ b b − 1 Σ b a ) − 1 Σ b a Σ b b − 1 ] \begin{align} \begin{bmatrix} \Lambda_{aa} & \Lambda_{ab} \\ \Lambda_{ba} & \Lambda_{bb} \end{bmatrix} &=\begin{bmatrix} \Sigma_{aa} & \Sigma_{ab}\\ \Sigma_{ba} & \Sigma_{bb} \end{bmatrix}^{-1} \\ &= \begin{bmatrix} I & 0\\ -D^{-1}C & I \end{bmatrix} \begin{bmatrix} (A-BD^{-1}C)^{-1} & 0\\ 0 & D^{-1} \end{bmatrix} \begin{bmatrix} I & -BD^{-1}\\ 0 & I \end{bmatrix} \\ &= \begin{bmatrix} I & 0\\ -\Sigma^{-1}_{bb}\Sigma_{ba} &I \end{bmatrix} \begin{bmatrix} (\Sigma_{aa}-\Sigma_{ab}\Sigma^{-1}_{bb}\Sigma_{ba})^{-1} & 0\\ 0 & \Sigma^{-1}_{bb} \end{bmatrix}\begin{bmatrix} I & -\Sigma_{ba}\Sigma^{-1}_{bb}\\ 0 & I \end{bmatrix}\\ &=\begin{bmatrix} (\Sigma_{aa}-\Sigma_{ab}\Sigma^{-1}_{bb}\Sigma_{ba})^{-1} & -(\Sigma_{aa}-\Sigma_{ab}\Sigma^{-1}_{bb}\Sigma_{ba})^{-1}\Sigma_{ab}\Sigma^{-1}_{bb}\\ -\Sigma^{-1}_{bb}\Sigma_{ba}(\Sigma_{aa}-\Sigma_{ab}\Sigma^{-1}_{bb}\Sigma_{ba})^{-1} & \Sigma^{-1}_{bb}+\Sigma^{-1}_{bb}\Sigma_{ba}(\Sigma_{aa}-\Sigma_{ab}\Sigma^{-1}_{bb}\Sigma_{ba})^{-1}\Sigma_{ba}\Sigma^{-1}_{bb} \end{bmatrix} \end{align} [ΛaaΛbaΛabΛbb]=[ΣaaΣbaΣabΣbb]1=[ID1C0I][(ABD1C)100D1][I0BD1I]=[IΣbb1Σba0I][(ΣaaΣabΣbb1Σba)100Σbb1][I0ΣbaΣbb1I]=[(ΣaaΣabΣbb1Σba)1Σbb1Σba(ΣaaΣabΣbb1Σba)1(ΣaaΣabΣbb1Σba)1ΣabΣbb1Σbb1+Σbb1Σba(ΣaaΣabΣbb1Σba)1ΣbaΣbb1]

    对于(4)式的后一部分有:
    Λ b b − Λ b a Λ a a − 1 Λ a b = Σ b b − 1 + Σ b b − 1 Σ b a ( Σ a a − Σ a b Σ b b − 1 Σ b a ) − 1 Σ b a Σ b b − 1 − Σ b b − 1 Σ b a ( Σ a a − Σ a b Σ b b − 1 Σ b a ) − 1 ( Σ a a − Σ a b Σ b b − 1 Σ b a ) ( Σ a a − Σ a b Σ b b − 1 Σ b a ) − 1 Σ b a Σ b b − 1 = Σ b b − 1 \begin{align} \Lambda_{bb}-\Lambda_{ba}\Lambda^{-1}_{aa}\Lambda_{ab} &= \Sigma^{-1}_{bb}+\Sigma^{-1}_{bb}\Sigma_{ba}(\Sigma_{aa}-\Sigma_{ab}\Sigma^{-1}_{bb}\Sigma_{ba})^{-1}\Sigma_{ba}\Sigma^{-1}_{bb} -\Sigma^{-1}_{bb}\Sigma_{ba}(\Sigma_{aa}\\ &-\Sigma_{ab}\Sigma^{-1}_{bb}\Sigma_{ba})^{-1}(\Sigma_{aa}-\Sigma_{ab}\Sigma^{-1}_{bb}\Sigma_{ba})(\Sigma_{aa}-\Sigma_{ab}\Sigma^{-1}_{bb}\Sigma_{ba})^{-1}\Sigma_{ba}\Sigma^{-1}_{bb}\\ &= \Sigma^{-1}_{bb} \end{align} ΛbbΛbaΛaa1Λab=Σbb1+Σbb1Σba(ΣaaΣabΣbb1Σba)1ΣbaΣbb1Σbb1Σba(ΣaaΣabΣbb1Σba)1(ΣaaΣabΣbb1Σba)(ΣaaΣabΣbb1Σba)1ΣbaΣbb1=Σbb1
    对于前一部分,方差显然为 Λ a a \Lambda_{aa} Λaa, 而均值则: 将相应的部分代入:
    x a + Λ a a − 1 Λ a b x b x_a + \Lambda^{-1}_{aa}\Lambda_{ab} x_b xa+Λaa1Λabxb
    中:最后还是得到
    x a − Σ b a Σ b b − 1 x b x_a-\Sigma_{ba}\Sigma^{-1}_{bb}x_b xaΣbaΣbb1xb

    注意这里对应的是标准正态分布 ( x − μ ) T Σ ( x − μ ) (x-\mu)^T\Sigma(x-\mu) (xμ)TΣ(xμ) ( x − μ ) (x-\mu) (xμ)的部分. 将 x a x_a xa写作标准形式 ( x a − μ a ) − Σ b a Σ b b − 1 ( x b − μ b ) (x_a-\mu_a)-\Sigma_{ba}\Sigma^{-1}_{bb}(x_b-\mu_b) (xaμa)ΣbaΣbb1(xbμb). 这里注意,在白板中会把这个标准形式命名为 x a ⋅ b x_{a\cdot b} xab ,这真的让我很费解, 直到我悟到一个: p ( a ∣ b ) p(a|b) p(ab)是在b发生下关于a的概率分布. 因此其标准正态分布形式为: ( x a − μ a ) T Σ a ( x a − μ a ) (x_a-\mu_a)^T\Sigma_a(x_a-\mu_a) (xaμa)TΣa(xaμa). 对照之下可知: 在b发生下a的均值为: μ a + Σ b a Σ b b − 1 ( x b − μ b ) \mu_a+\Sigma_{ba}\Sigma^{-1}_{bb}(x_b-\mu_b) μa+ΣbaΣbb1(xbμb)
    使用两次舒尔补是我自以为比较有意思的地方,但推导下来发现还是需要用定义去理解不能直接推导得出.看到这里的读者还是选择SLAM那篇博客吧

2. MLE 最大似然估计

2.1 MLE中用到的矩阵微分操作:
证明请见矩阵求导公式
(1) ∂ w T A w ∂ w = 2 A w \frac{\partial w^TAw}{\partial w}=2Aw wwTAw=2Aw
(2) t r [ A B C ] = t r [ C A B ] = t r [ B C A ] tr[ABC]=tr[CAB]=tr[BCA] tr[ABC]=tr[CAB]=tr[BCA]
(3) x T A x = t r [ x T A x ] = t r [ x x T A ] x^TAx=tr[x^TAx]=tr[xx^TA] xTAx=tr[xTAx]=tr[xxTA]
(4) ∂ ∂ A t r [ A B ] = B T \frac{\partial}{\partial A}tr[AB]=B^T Atr[AB]=BT
(5) ∂ ∂ A l o g ∣ A ∣ = ( A − 1 ) T = ( A T ) − 1 \frac{\partial}{\partial A}log|A|=(A^{-1})^T=(A^T)^{-1} AlogA=(A1)T=(AT)1
(6) t r [ A B ] = t r [ B A ] tr[AB]=tr[BA] tr[AB]=tr[BA]
对于各次服从iid的取样而言, X ( i ) ∼ N ( μ , Σ ) X^{(i)}\sim \mathcal{N}(\mu,\Sigma) X(i)N(μ,Σ), 要每次都取到采样的结果就是 ∏ i = 1 m f X ( i ) ( x ( i ) ; μ , Σ ) \prod_{i=1}^m f_{\mathbf{X^{(i)}}}(\mathbf{x^{(i)} ; \mu , \Sigma }) i=1mfX(i)(x(i);μ,Σ)对其取log-likehood函数有:
l ( μ , Σ ∣ x ( i ) ) = log ⁡ ∏ i = 1 m f X ( i ) ( x ( i ) ∣ μ , Σ ) = log ⁡   ∏ i = 1 m 1 ( 2 π ) p / 2 ∣ Σ ∣ 1 / 2 exp ⁡ ( − 1 2 ( x ( i ) − μ ) T Σ − 1 ( x ( i ) − μ ) ) = ∑ i = 1 m ( − p 2 log ⁡ ( 2 π ) − 1 2 log ⁡ ∣ Σ ∣ − 1 2 ( x ( i ) − μ ) T Σ − 1 ( x ( i ) − μ ) ) \begin{aligned} l(\mathbf{ \mu, \Sigma | x^{(i)} }) & = \log \prod_{i=1}^m f_{\mathbf{X^{(i)}}}(\mathbf{x^{(i)} | \mu , \Sigma }) \\ & = \log \ \prod_{i=1}^m \frac{1}{(2 \pi)^{p/2} |\Sigma|^{1/2}} \exp \left( - \frac{1}{2} \mathbf{(x^{(i)} - \mu)^T \Sigma^{-1} (x^{(i)} - \mu) } \right) \\ & = \sum_{i=1}^m \left( - \frac{p}{2} \log (2 \pi) - \frac{1}{2} \log |\Sigma| - \frac{1}{2} \mathbf{(x^{(i)} - \mu)^T \Sigma^{-1} (x^{(i)} - \mu) } \right) \end{aligned} l(μ,Σ∣x(i))=logi=1mfX(i)(x(i)μ,Σ)=log i=1m(2π)p/2∣Σ1/21exp(21(x(i)μ)TΣ1(x(i)μ))=i=1m(2plog(2π)21log∣Σ∣21(x(i)μ)TΣ1(x(i)μ))
l ( μ , Σ ; ) = − m p 2 log ⁡ ( 2 π ) − m 2 log ⁡ ∣ Σ ∣ − 1 2 ∑ i = 1 m ( x ( i ) − μ ) T Σ − 1 ( x ( i ) − μ ) \begin{aligned} l(\mu, \Sigma ; ) & = - \frac{mp}{2} \log (2 \pi) - \frac{m}{2} \log |\Sigma| - \frac{1}{2} \sum_{i=1}^m \mathbf{(x^{(i)} - \mu)^T \Sigma^{-1} (x^{(i)} - \mu) } \end{aligned} l(μ,Σ;)=2mplog(2π)2mlog∣Σ∣21i=1m(x(i)μ)TΣ1(x(i)μ)

∂ ∂ μ l ( μ , Σ ∣ x ( i ) ) = ∑ i = 1 m Σ − 1 ( x ( i ) − μ ) = 0 Since  Σ  is positive definite 0 = m μ − ∑ i = 1 m x ( i ) μ ^ = 1 m ∑ i = 1 m x ( i ) = x ˉ \begin{aligned} \frac{\partial }{\partial \mu} l(\mathbf{ \mu, \Sigma | x^{(i)} }) & = \sum_{i=1}^m \mathbf{ \Sigma^{-1} ( x^{(i)} - \mu ) } = 0 \\ & \text{Since $\Sigma$ is positive definite} \\ 0 & = m \mu - \sum_{i=1}^m \mathbf{ x^{(i)} } \\ \hat \mu &= \frac{1}{m} \sum_{i=1}^m \mathbf{ x^{(i)} } = \mathbf{\bar{x}} \end{aligned} μl(μ,Σ∣x(i))0μ^=i=1mΣ1(x(i)μ)=0Since Σ is positive definite=mμi=1mx(i)=m1i=1mx(i)=xˉ

∂ ∂ A x T A x = ∂ ∂ A t r [ x x T A ] = [ x x T ] T = ( x T ) T x T = x x T \frac{\partial}{\partial A} x^TAx =\frac{\partial}{\partial A} \mathrm{tr}\left[xx^TA\right] = [xx^T]^T = \left(x^{T}\right)^Tx^T = xx^T AxTAx=Atr[xxTA]=[xxT]T=(xT)TxT=xxT
重写log-likehood函数有:
l ( μ , Σ ∣ x ( i ) ) = C − m 2 log ⁡ ∣ Σ ∣ − 1 2 ∑ i = 1 m ( x ( i ) − μ ) T Σ − 1 ( x ( i ) − μ ) = C + m 2 log ⁡ ∣ Σ − 1 ∣ − 1 2 ∑ i = 1 m t r [ ( x ( i ) − μ ) ( x ( i ) − μ ) T Σ − 1 ] ∂ ∂ Σ − 1 l ( μ , Σ ∣ x ( i ) ) = m 2 Σ − 1 2 ∑ i = 1 m ( x ( i ) − μ ) ( x ( i ) − μ ) T    Since  Σ T = Σ \begin{aligned} l(\mathbf{ \mu, \Sigma | x^{(i)} }) & = \text{C} - \frac{m}{2} \log |\Sigma| - \frac{1}{2} \sum_{i=1}^m \mathbf{(x^{(i)} - \mu)^T \Sigma^{-1} (x^{(i)} - \mu) } \\ & = \text{C} + \frac{m}{2} \log |\Sigma^{-1}| - \frac{1}{2} \sum_{i=1}^m \mathrm{tr}\left[ \mathbf{(x^{(i)} - \mu) (x^{(i)} - \mu)^T \Sigma^{-1} } \right] \\ \frac{\partial }{\partial \Sigma^{-1}} l(\mathbf{ \mu, \Sigma | x^{(i)} }) & = \frac{m}{2} \Sigma - \frac{1}{2} \sum_{i=1}^m \mathbf{(x^{(i)} - \mu) (x^{(i)} - \mu)}^T \ \ \text{Since $\Sigma^T = \Sigma$} \end{aligned} l(μ,Σ∣x(i))Σ1l(μ,Σ∣x(i))=C2mlog∣Σ∣21i=1m(x(i)μ)TΣ1(x(i)μ)=C+2mlogΣ121i=1mtr[(x(i)μ)(x(i)μ)TΣ1]=2mΣ21i=1m(x(i)μ)(x(i)μ)T  Since ΣT=Σ

0 = m Σ − ∑ i = 1 m ( x ( i ) − μ ) ( x ( i ) − μ ) T Σ ^ = 1 m ∑ i = 1 m ( x ( i ) − μ ^ ) ( x ( i ) − μ ^ ) T \begin{aligned} 0 &= m \Sigma - \sum_{i=1}^m \mathbf{(x^{(i)} - \mu) (x^{(i)} - \mu)}^T \\ \hat \Sigma & = \frac{1}{m} \sum_{i=1}^m \mathbf{(x^{(i)} - \hat \mu) (x^{(i)} -\hat \mu)}^T \end{aligned} 0Σ^=mΣi=1m(x(i)μ)(x(i)μ)T=m1i=1m(x(i)μ^)(x(i)μ^)T

参考链接

3. EM算法求解GMM中的参数

GMM: k个高斯分布通过线性加权组合而成一个模型. 主要还是几个问题:
(1) 每个高斯成分的参数 θ \theta θ 估计;
(2) 权值视为一个隐变量 z i ∼ [ π 1 , π 2 , . . . π k ] z_i\sim [\pi_1,\pi_2,...\pi_k] zi[π1,π2,...πk];
(3) 给定隐变量 z i z_i zi 计算对应的条件分布 x i ∼ p ( x ∣ y i , θ y i ) x_i \sim p(x|y_i,\theta_{y_i}) xip(xyi,θyi)
p ( x ; θ , π ) = ∑ c = 1 k π c N ( x ; μ c , Σ c ) p(x;\theta,\pi) = \sum^{k}_{c=1} \pi_c\mathcal{N}(x;\mu_c,\Sigma_c) p(x;θ,π)=c=1kπcN(x;μc,Σc)

EM算法:
E-step与M-step的导出:
对于一个分布的MLE有:
L ( θ ) = ln ⁡ P ( X ∣ θ ) L(\theta)=\ln \mathcal{P}(\mathbf{X} \mid \theta) L(θ)=lnP(Xθ)
EM不同于MLE直接给出解析式,而是选择通过迭代的方法实现:
L ( θ ) > L ( θ n ) L(\theta)>L\left(\theta_n\right) L(θ)>L(θn)
如果认为 P ( X ∣ θ ) \mathcal{P}(X|\theta) P(Xθ)是受隐变量 z \mathbf{z} z的影响则有:
P ( X ∣ θ ) = ∑ z P ( X ∣ z , θ ) P ( z ∣ θ ) \mathcal{P}(\mathbf{X} \mid \theta)=\sum_{\mathbf{z}} \mathcal{P}(\mathbf{X} \mid \mathbf{z}, \theta) \mathcal{P}(\mathbf{z} \mid \theta) P(Xθ)=zP(Xz,θ)P(zθ)
引入Jensen不等式(回忆log函数, 弧在上,而弦在下)有:
ln ⁡ ∑ i = 1 n λ i x i ≥ ∑ i = 1 n λ i ln ⁡ ( x i ) \ln \sum_{i=1}^n \lambda_i x_i \geq \sum_{i=1}^n \lambda_i \ln \left(x_i\right) lni=1nλixii=1nλiln(xi)
常数 λ i ≥ 0 \lambda_i \geq 0 λi0, ∑ i = 1 n λ i = 1 \sum^n_{i=1}\lambda_i=1 i=1nλi=1, 考虑隐变量的分布: P ( z ∣ X , θ n ) ≥ 0 \mathcal{P}(\mathbf{z}|X,\theta_n)\geq 0 P(zX,θn)0, ∑ z P ( z ∣ X , θ n ) = 1 \sum_{\mathbf{z}}\mathcal{P}(\mathbf{z}|X,\theta_n)=1 zP(zX,θn)=1.

L = l o g ∑ z p ( x , z ) = l o g ∑ z q ( z ) p ( x , z ) q ( z ) ≥ ∑ z q ( z ) l o g p ( x , z ) q ( z ) = ∑ z q ( z ) l o g p ( x , z ) − ∑ z q ( z ) l o g q ( z ) \begin{split} \mathcal{L}&=log\sum_{\mathbf{z}}p(\mathbf{x},\mathbf{z}) = log\sum_{\mathbf{z}}q(\mathbf{z})\frac{p(\mathbf{x,z})}{q(\mathbf{z})}\\ &\geq \sum_{\mathbf{z}}q(\mathbf{z})log\frac{p(\mathbf{x,z})}{q(\mathbf{z})}\\ &=\sum_{\mathbf{z}}q(\mathbf{z})logp(\mathbf{x,z})-\sum_{\mathbf{z}}q(\mathbf{z})logq(\mathbf{z}) \end{split} L=logzp(x,z)=logzq(z)q(z)p(x,z)zq(z)logq(z)p(x,z)=zq(z)logp(x,z)zq(z)logq(z)

H ( q ) = △ − ∑ z q ( z ) l o g q ( z ) \mathcal{H}(q) \overset{\bigtriangleup}{=}-\sum_{\mathbf{z}}q(\mathbf{z})logq(\mathbf{z}) H(q)=zq(z)logq(z)
F ( q , θ ) = △ ∑ z q ( z ) l o g p ( x , z ) − ∑ z q ( z ) l o g q ( z ) \mathcal{F}(q,\theta)\overset{\bigtriangleup}{=} \sum_{\mathbf{z}}q(\mathbf{z})logp(\mathbf{x,z})-\sum_{\mathbf{z}}q(\mathbf{z})logq(\mathbf{z}) F(q,θ)=zq(z)logp(x,z)zq(z)logq(z)

F ( q , θ ) \mathcal{F}(q,\theta) F(q,θ)即为这个log-likehood函数的lower bound:
L ( θ ) ≥ F ( q , θ ) \mathcal{L}(\theta) \geq \mathcal{F}(q,\theta) L(θ)F(q,θ)

E-step: q ( t ) ← a r g m a x q F ( q , θ t ) q^{(t)} \leftarrow argmax_q \mathcal{F}(q,\theta^{t}) q(t)argmaxqF(q,θt) : 固定 θ t \theta^t θt确定隐变量的分布
M-step: θ ( t + 1 ) ← a r g m a x θ F ( q t , θ ) \theta^{(t+1)} \leftarrow argmax_{\theta} \mathcal{F}(q^t,\theta) θ(t+1)argmaxθF(qt,θ) : 固定隐变量的分布确定各模型的参数 θ \theta θ

性质: 每一轮都会improve F \mathcal{F} F 也会improve L \mathcal{L} L.

E-step: 寻找隐变量的分布, 最大化 F ( q , θ ( t ) ) \mathcal{F}(q,\theta^{(t)}) F(q,θ(t)):

F ( q ) = ∑ z q ( z ) l o g p ( x , z ) − ∑ z q ( z ) l o g q ( z ) = ∑ z q ( z ) l o g p ( z ∣ x ) p ( x ) q ( z ) = ∑ z q ( z ) l o g p ( z ∣ x ) q ( z ) + ∑ z q ( z ) l o g p ( x ) = ∑ z q ( z ) l o g p ( z ∣ x ) q ( z ) + l o g p ( x ) = − K L ( q ( z ) ∣ p ( z ∣ x ) ) + L \begin{split} \mathcal{F}(q) &= \sum_{\mathbf{z}}q(\mathbf{z})log p(\mathbf{x,z})-\sum_{z}q(\mathbf{z})log q(\mathbf{z})\\ &= \sum_{\mathbf{z}}q(\mathbf{z})log \frac{p(\mathbf{z}|x)p(x)}{q(\mathbf{z})}\\ &= \sum_{\mathbf{z}}q(\mathbf{z})log \frac{p(\mathbf{z}|x)}{q(\mathbf{z})}+\sum_{\mathbf{z}}q(\mathbf{z})log p(\mathbf{x})\\ &= \sum_{\mathbf{z}}q(\mathbf{z})log \frac{p(\mathbf{z}|x)}{q(\mathbf{z})}+log p(\mathbf{x})\\ &=-KL(q(\mathbf{z})|p(\mathbf{z}|\mathbf{x}))+\mathcal{L} \end{split} F(q)=zq(z)logp(x,z)zq(z)logq(z)=zq(z)logq(z)p(zx)p(x)=zq(z)logq(z)p(zx)+zq(z)logp(x)=zq(z)logq(z)p(zx)+logp(x)=KL(q(z)p(zx))+L

KL散度部分衡量两个分布之间的距离iff q = p q=p q=p时KL=0, 也就是说只有在 q t ( z ) = p ( z ∣ x , θ t ) q^t(\mathbf{z})=p(\mathbf{z}|x,\theta^t) qt(z)=p(zx,θt), F ( q t , θ t ) = L ( θ t ) \mathcal{F}(q^t,\theta^t)=\mathcal{L}(\theta^t) F(qt,θt)=L(θt). 因此,当我们在计算 p ( z ∣ x , θ t ) p(\mathbf{z}|\mathbf{x},\theta^t) p(zx,θt)时就是在最大化 F ( q , θ t ) \mathcal{F}(q,\theta^t) F(q,θt).

M-step: 寻找 θ ( t + 1 ) \theta^{(t+1)} θ(t+1)最大化 F ( q ( t ) , θ ) \mathcal{F}(q^{(t)},\theta) F(q(t),θ).
F ( q t , θ ) = ∑ z q t ( z ) l o g p ( x , z ∣ θ ) + H ( q t ) . \mathcal{F}(q^{t},\theta)=\sum_{\mathbf{z}}q^{t}(\mathbf{z})\rm{log} p(\mathbf{x},\mathbf{z}|\theta)+ \mathcal{H}(q^{t}). F(qt,θ)=zqt(z)logp(x,zθ)+H(qt).

H ( q t ) \mathcal{H}(q^t) H(qt)在这一步中是常数, 我们只需要计算log-likehood部分即可:
∑ z q t ( z ) l o g p ( x , z ∣ θ ) \sum_{\mathbf{z}}q^{t}(\mathbf{z})\rm{log}p(\mathbf{x,z}|\theta) zqt(z)logp(x,zθ)

在最大化 F \mathcal{F} F的时候还会最大化 L \mathcal{L} L:
L ( θ t + 1 ) ≥ F ( q t , θ t + 1 ) = m a x θ F ( q t , θ ) ≥ F ( q t , θ t ) = L \begin{split} \mathcal{L}(\theta^{t+1}) &\geq \mathcal{F}(q^{t},\theta^{t+1})\\ &=\underset{\theta}{max}\mathcal{F}(q^t,\theta)\\ &\geq \mathcal{F}(q^t,\theta^t)=\mathcal{L} \end{split} L(θt+1)F(qt,θt+1)=θmaxF(qt,θ)F(qt,θt)=L

E-step与M-step: EM上的GMM还是太难了. 下面简单抄写白板中的内容

白板中的GMM: p ( x ) = ∑ k = 1 K p k N ( x ∣ μ k , Σ k ) p(x)=\sum^K_{k=1}p_k\mathcal{N}(x|\mu_k,\Sigma_k) p(x)=k=1KpkN(xμk,Σk)
X X X: observed data → x 1 , x 2 , . . . , x N \rightarrow x_1,x_2, ...,x_N x1,x2,...,xN
( X , Z ) (X,Z) (X,Z): complete data → ( x 1 , z 1 ) , ( x 2 , z 2 ) , . . . , ( x N , z N ) \rightarrow (x_1,z_1),(x_2,z_2), ..., (x_N,z_N) (x1,z1),(x2,z2),...,(xN,zN)
x x x: observed variable
z z z: latent variable

z c 1 c_1 c1 c 2 c_2 c2 … \dots c K c_K cK
p p 1 p_1 p1 p 2 p_2 p2 … \dots p K p_K pK

∑ k = 1 K p k = 1 \sum^K_{k=1}p_k=1 k=1Kpk=1; x ∣ z = c k ∼ N ( x ∣ μ k , Σ k ) x|z=c_k\sim\mathcal{N}(x|\mu_k,\Sigma_k) xz=ckN(xμk,Σk) 对于隐变量 z z z的理解还是李航老师书里讲得更深刻,即 x x x 的生成需要先通过隐变量确定一个高斯再由该高斯生成, 显然隐变量为一个离散随机变量,因此确定哪个高斯也服从于一个概率分布,作用时再由各个高斯iid.起作用.但就公式的简洁上,白板中的公式更简单.
p ( y , γ ∣ θ ) = ∏ j = 1 N p ( y j , γ j 1 , γ j 2 , ⋯   , γ j k ∣ θ ) = ∏ k = 1 K ∏ j = 1 N [ α k ϕ ( y j ∣ θ k ) ] γ j k 连乘前后交换顺序无关紧要 = ∏ j = 1 N ∏ k = 1 K [ α k ϕ ( y j ∣ θ k ) ] γ j k = ∏ j = 1 N [ α j ϕ ( y j ∣ θ k ) ] ⋅ 1 ⋅ 1 ⋯ 1 ⏟ K 项 \begin{split} p(y,\gamma|\theta) &= \prod^N_{j=1}p(y_j,\gamma_{j1},\gamma_{j2},\cdots,\gamma_{jk}|\theta)\\ &=\prod^K_{k=1}\prod^N_{j=1}[\alpha_k\phi(y_j|\theta_k)]^{\gamma_{jk}}\\ &连乘前后交换顺序无关紧要\\ &=\prod^N_{j=1}\prod^K_{k=1}[\alpha_k\phi(y_j|\theta_k)]^{\gamma_{jk}}\\ &=\prod^N_{j=1}\underbrace{[\alpha_j\phi(y_j|\theta_k)]\cdot 1\cdot1 \cdots 1}_{K项} \end{split} p(y,γθ)=j=1Np(yj,γj1,γj2,,γjkθ)=k=1Kj=1N[αkϕ(yjθk)]γjk连乘前后交换顺序无关紧要=j=1Nk=1K[αkϕ(yjθk)]γjk=j=1NK [αjϕ(yjθk)]111
α j \alpha_j αj表示第 j j j项观测数据对应的权重(几何角度)或者说是概率值(混合模型角度)

E-step中符号 a r g m a x E Z ∣ X , θ ( t ) [ l o g P ( X , Z ∣ θ ) ] argmax E_{Z|X,\theta^{(t)}}[log P(X,Z|\theta)] argmaxEZX,θ(t)[logP(X,Zθ)] 的意思是: ∫ Z l o g p ( X , Z ∣ θ ) p ( Z ∣ X , θ t ) d z \int_{Z} logp(X,Z|\theta)p(Z|X,\theta^{t})dz Zlogp(X,Zθ)p(ZX,θt)dz 是一个关于在观测数据已知, 上一步各个高斯分布参数已知的情况下关于隐变量的积分.

Q ( θ , θ t ) = ∫ Z l o g P ( X , Z ∣ θ ) p ( Z ∣ X , θ t ) d Z = ∑ Z l o g [ ∏ i = 1 N P ( x i , Z i ∣ θ ) ] [ ∏ i = 1 N P ( Z i ∣ x i , θ t ) ] d z = ∑ Z l o g [ ∑ i = 1 N P ( x i , Z i ∣ θ ) ] [ ∏ i = 1 N P ( Z i ∣ x i , θ t ) ] d z = ∑ Z 1 , Z 2 , ⋯   , Z N ∑ i = 1 N l o g P ( x i , Z i ∣ θ ) ∏ i = 1 N P ( Z i ∣ x i , θ t ) = ∑ i = 1 N ∑ Z i l o g P ( x i , Z i ∣ θ ) P ( Z i ∣ x i , θ t ) \begin{split} Q(\theta,\theta^{t}) &= \int_Z logP(X,Z|\theta)p(Z|X,\theta^t)dZ\\ &=\sum_{Z}log[\prod^N_{i=1}P(x_i,Z_i|\theta)][\prod^N_{i=1}P(Z_i|x_i,\theta^t)]dz\\ &=\sum_{Z}log[\sum^N_{i=1}P(x_i,Z_i|\theta)][\prod^N_{i=1}P(Z_i|x_i,\theta^t)]dz\\ &=\sum_{Z_1,Z_2,\cdots,Z_N} \sum^N_{i=1}logP(x_i,Z_i|\theta)\prod^N_{i=1}P(Z_i|x_i,\theta^t)\\ &=\sum^N_{i=1}\sum_{Z_i}logP(x_i,Z_i|\theta)P(Z_i|x_i,\theta^t) \end{split} Q(θ,θt)=ZlogP(X,Zθ)p(ZX,θt)dZ=Zlog[i=1NP(xi,Ziθ)][i=1NP(Zixi,θt)]dz=Zlog[i=1NP(xi,Ziθ)][i=1NP(Zixi,θt)]dz=Z1,Z2,,ZNi=1NlogP(xi,Ziθ)i=1NP(Zixi,θt)=i=1NZilogP(xi,Ziθ)P(Zixi,θt)
上式中
∑ Z 1 , Z 2 , ⋯   , z N ∑ i = 1 N l o g P ( x i , Z i ∣ θ ) ∏ i = 1 N p ( Z i ∣ x i , θ t ) = ∑ Z 1 , Z 2 , ⋯   , Z N [ l o g P ( x 1 , Z 1 ∣ θ ) + l o g P ( x 2 , Z 2 ∣ θ ) + ⋯ + l o g P ( x N , Z N ∣ θ ) ] ∏ i = 1 N P ( Z i ∣ x i , θ t ) 取出第一项有 : = ∑ Z 1 , Z 2 , ⋯   , Z N l o g P ( x 1 , Z 1 ∣ θ ) ∏ i = 1 N P ( Z i ∣ x i , θ t ) = ∑ Z 1 , Z 2 , ⋯   , Z N l o g P ( x 1 , Z 1 ∣ θ ) P ( Z 1 ∣ x 1 , θ t ) ∏ i = 2 N P ( Z i ∣ x i , θ t ) = ∑ Z 1 l o g P ( x 1 , Z 1 ∣ θ ) P ( Z 1 ∣ x 1 , θ t ) ∑ Z 2 , ⋯   , Z N ∏ i = 2 N P ( Z i ∣ x i , θ t ) 看后面的连加与连乘符号 : ∑ Z 2 , ⋯   , Z N P ( Z 2 ∣ x 2 , θ t ) P ( Z 3 ∣ x 3 , θ t ) P ( Z 3 ∣ x 3 , θ t ) ⋯ P ( Z N ∣ x N , θ t ) = ∑ Z 2 P ( Z 2 ∣ x 2 ) ∑ Z 3 P ( Z 3 ∣ x 3 ) ∑ Z 4 P ( Z 4 ∣ x 4 ) ⋯ ∑ Z N P ( Z N ∣ x N ) = 1 ⋅ 1 ⋯ 1 ⋅ 1 最后只剩下 ∑ Z 1 l o g P ( x 1 , Z 1 ∣ θ ) P ( Z 1 ∣ x 1 , θ t ) \begin{split} &\sum_{Z_1,Z_2,\cdots,z_N}\sum^N_{i=1}logP(x_i,Z_i|\theta)\prod^N_{i=1}p(Z_i|x_i,\theta^{t})\\ &= \sum_{Z_1,Z_2,\cdots,Z_N}[logP(x_1,Z_1|\theta)+logP(x_2,Z_2|\theta)+\cdots+logP(x_N,Z_N|\theta)]\prod^N_{i=1}P(Z_i|x_i,\theta^t)\\ & 取出第一项有:\\ &= \sum_{Z_1,Z_2,\cdots,Z_N}logP(x_1,Z_1|\theta)\prod^N_{i=1}P(Z_i|x_i,\theta^t)\\ &= \sum_{Z_1,Z_2,\cdots,Z_N}logP(x_1,Z_1|\theta)P(Z_1|x_1,\theta^t)\prod^N_{i=2}P(Z_i|x_i,\theta^t)\\ &= \sum_{Z_1}logP(x_1,Z_1|\theta)P(Z_1|x_1,\theta^t) \sum_{Z_2,\cdots,Z_N}\prod^N_{i=2}P(Z_i|x_i,\theta^t)\\ &看后面的连加与连乘符号:\\ &\sum_{Z_2,\cdots,Z_N}P(Z_2|x_2,\theta^t)P(Z_3|x_3,\theta^t)P(Z_3|x_3,\theta^t)\cdots P(Z_N|x_N,\theta^t)\\ &=\sum_{Z_2}P(Z_2|x_2)\sum_{Z_3}P(Z_3|x_3)\sum_{Z_4}P(Z_4|x_4)\cdots\sum_{Z_N}P(Z_N|x_N)\\ &=1 \cdot 1 \cdots 1 \cdot 1\\ &最后只剩下\\ &\sum_{Z_1}logP(x_1,Z_1|\theta)P(Z_1|x_1,\theta^t) \end{split} Z1,Z2,,zNi=1NlogP(xi,Ziθ)i=1Np(Zixi,θt)=Z1,Z2,,ZN[logP(x1,Z1θ)+logP(x2,Z2θ)++logP(xN,ZNθ)]i=1NP(Zixi,θt)取出第一项有:=Z1,Z2,,ZNlogP(x1,Z1θ)i=1NP(Zixi,θt)=Z1,Z2,,ZNlogP(x1,Z1θ)P(Z1x1,θt)i=2NP(Zixi,θt)=Z1logP(x1,Z1θ)P(Z1x1,θt)Z2,,ZNi=2NP(Zixi,θt)看后面的连加与连乘符号:Z2,,ZNP(Z2x2,θt)P(Z3x3,θt)P(Z3x3,θt)P(ZNxN,θt)=Z2P(Z2x2)Z3P(Z3x3)Z4P(Z4x4)ZNP(ZNxN)=1111最后只剩下Z1logP(x1,Z1θ)P(Z1x1,θt)

化简后的 Q ( θ , θ t ) Q(\theta,\theta^t) Q(θ,θt)代入:
P ( x , Z ) = P ( Z ) P ( x ∣ z ) = p Z N ( x ∣ μ Z , σ Z ) P(x,Z)=P(Z)P(x|z)=p_Z\mathcal{N}(x|\mu_Z,\sigma_Z) P(x,Z)=P(Z)P(xz)=pZN(xμZ,σZ)
p ( Z ∣ x ) = p ( x , Z ) p ( x ) = p Z N ( x ∣ μ Z , Σ Z ) ∑ k = 1 K p k N ( x ∣ μ k , Σ k ) p(Z|x) = \frac{p(x,Z)}{p(x)} = \frac{p_Z\mathcal{N}(x|\mu_Z,\Sigma_Z)}{\sum^K_{k=1}p_k\mathcal{N}(x|\mu_k,\Sigma_k)} p(Zx)=p(x)p(x,Z)=k=1KpkN(xμk,Σk)pZN(xμZ,ΣZ)
有:
Q ( θ , θ t ) = ∑ i = 1 N ∑ Z i l o g   [ p Z i N ( x i ∣ μ Z i , Σ Z i ) ] P Z i N ( x i ∣ μ i , Σ Z i ) ∑ k = 1 K p k N ( x i ∣ μ k , Σ k ) Q(\theta,\theta^t)=\sum^N_{i=1}\sum_{Z_i}log\ [p_{Z_i}\mathcal{N}(x_i|\mu_{Z_i},\Sigma_{Z_i})]\frac{P_{Z_i}\mathcal{N}(x_i|\mu_i,\Sigma_{Z_i})}{\sum^K_{k=1}p_k\mathcal{N}(x_i|\mu_k,\Sigma_k)} Q(θ,θt)=i=1NZilog [pZiN(xiμZi,ΣZi)]k=1KpkN(xiμk,Σk)PZiN(xiμi,ΣZi)
M-step: 求解公式中的参数,
Q = ∑ k = 1 K ∑ i = 1 N [ l o g p k + l o g N ( x i ∣ μ Z i , Σ Z i ) ] p ( Z i = k ∣ x i , θ t ) Q = \sum^K_{k=1}\sum^N_{i=1}[logp_k+log\mathcal{N}(x_i|\mu_{Z_i},\Sigma_{Z_i})]p(Z_i=k|x_i,\theta^t) Q=k=1Ki=1N[logpk+logN(xiμZi,ΣZi)]p(Zi=kxi,θt)
先求解 p k t + 1 p^{t+1}_k pkt+1, p k t + 1 p^{t+1}_k pkt+1的求解满足约束条件 ∑ k = 1 K p k = 1 \sum^K_{k=1}p_k=1 k=1Kpk=1;
L = ∑ k = 1 K ∑ i = 1 N l o g p k p ( z i = k ∣ x i , θ t ) + λ ( ∑ k = 1 K p k − 1 ) L = \sum^K_{k=1}\sum^N_{i=1}logp_kp(z_i=k|x_i,\theta^t)+\lambda(\sum^K_{k=1}p_k-1) L=k=1Ki=1Nlogpkp(zi=kxi,θt)+λ(k=1Kpk1)
没有了 l o g N ( x i ∣ μ Z i , Σ Z i ) p ( Z i = k ∣ x i , θ t ) log\mathcal{N}(x_i|\mu_{Z_i},\Sigma_{Z_i})p(Z_i=k|x_i,\theta^t) logN(xiμZi,ΣZi)p(Zi=kxi,θt)项其对于 p k p_k pk而言是常数项。L函数对 p k p_k pk求偏导有:
∂ ∂ p k L = ∑ i = 1 N 1 p k p ( z i = k ∣ x i , θ t ) + λ = 0 \frac{\partial}{\partial p_k}L=\sum^N_{i=1}\frac{1}{p_k}p(z_i=k|x_i,\theta^t)+\lambda=0 pkL=i=1Npk1p(zi=kxi,θt)+λ=0
两边同乘 p k p_k pk以求解 λ \lambda λ有:
∑ i = 1 N p ( z i = k ∣ x i , θ t ) + λ p k = 0 \sum^N_{i=1}p(z_i=k|x_i,\theta^t)+\lambda p_k = 0 i=1Np(zi=kxi,θt)+λpk=0
利用 ∑ k = 1 K p k = 1 \sum^K_{k=1}p_k=1 k=1Kpk=1
∑ k = 1 K ∑ i = 1 N p ( z i = k ∣ x i , θ t ) + λ ∑ k = 1 K p k = 0 λ = − ∑ i = 1 N ∑ k = 1 K p ( z i = k ∣ x i , θ t ) ⏟ 1 = − N \begin{split} &\sum^K_{k=1}\sum^N_{i=1}p(z_i=k|x_i,\theta^t)+\lambda \sum^K_{k=1}p_k=0\\ &\lambda = -\sum^N_{i=1}\underbrace{\sum^K_{k=1}p(z_i=k|x_i,\theta^t)}_{1}=-N \end{split} k=1Ki=1Np(zi=kxi,θt)+λk=1Kpk=0λ=i=1N1 k=1Kp(zi=kxi,θt)=N
λ \lambda λ回代有:
p k t + 1 = ∑ i = 1 N p ( z i = k ∣ x i , θ t ) N p^{t+1}_{k}=\frac{\sum^N_{i=1}p(z_i=k|x_i,\theta^t)}{N} pkt+1=Ni=1Np(zi=kxi,θt)

μ k \mu_k μk Σ k \Sigma_k Σk的求解则与MLE中的类似(使用矩阵的求导法则,另外注意在广义EM中提到的后一参数用更新后的前一参数更新的思路):
∑ k = 1 K ∑ i = 1 N [ l o g N ( x i ∣ μ Z i , Σ Z i ) ] p ( Z i = k ∣ x i , θ t ) = ∑ k = 1 K ∑ i = 1 N [ l o g ( 1 2 π ) − l o g ∣ Σ k ∣ − 1 2 ( x n − μ k ) T Σ k − 1 ( x n − μ k ) ] p ( Z i = k ∣ x i , θ t ) \begin{split} &\sum^K_{k=1}\sum^N_{i=1}[log\mathcal{N}(x_i|\mu_{Z_i},\Sigma_{Z_i})]p(Z_i=k|x_i,\theta^t) \\ &= \sum^K_{k=1}\sum^N_{i=1}[log(\frac{1}{\sqrt{2\pi}})-log|\Sigma_k|-\frac{1}{2}(x_n-\mu_k)^T\Sigma^{-1}_k(x_n-\mu_k)]p(Z_i=k|x_i,\theta^t) \end{split} k=1Ki=1N[logN(xiμZi,ΣZi)]p(Zi=kxi,θt)=k=1Ki=1N[log(2π 1)logΣk21(xnμk)TΣk1(xnμk)]p(Zi=kxi,θt)
μ k \mu_k μk求偏导有:
μ k t + 1 = ∑ i = 1 N p ( Z i = k ∣ x i , θ t ) x i ∑ i = 1 N p ( Z i = k ∣ x i , θ t ) \mu^{t+1}_k = \frac{\sum^N_{i=1}p(Z_i=k|x_i,\theta^t)x_i}{\sum^N_{i=1}p(Z_i=k|x_i,\theta^t)} μkt+1=i=1Np(Zi=kxi,θt)i=1Np(Zi=kxi,θt)xi
Σ k \Sigma_k Σk求偏导并注意在外层乘上 ∑ i = 1 N p ( z i = k ∣ x i , θ t ) \sum^N_{i=1}p(z_i=k|x_i,\theta^t) i=1Np(zi=kxi,θt)有:
Σ k t + 1 = ∑ i = 1 N p ( z i = k ∣ x i , θ t ) ( x i − μ k t + 1 ) ( x i − μ t + 1 ) T ∑ i = 1 N p ( Z i = k ∣ x i , θ t ) \Sigma^{t+1}_{k}=\frac{\sum^N_{i=1}p(z_i=k|x_i,\theta^t)(x_i-\mu^{t+1}_k)(x_i-\mu^{t+1})^T}{\sum^N_{i=1}p(Z_i=k|x_i,\theta^t)} Σkt+1=i=1Np(Zi=kxi,θt)i=1Np(zi=kxi,θt)(xiμkt+1)(xiμt+1)T

GMR: GMM与高斯条件分布的结合:
π y ∣ x k = N k ( x ∣ μ x k , Σ x k ) Σ l = 1 K N l ( x ∣ μ x l , Σ x l ) \pi_{y|x_k}=\frac{\mathcal{N_k}(x|\mu_{xk},\Sigma_{xk})}{\Sigma^K_{l=1}\mathcal{N}_l(x|\mu_{xl},\Sigma_{xl})} πyxk=Σl=1KNl(xμxl,Σxl)Nk(xμxk,Σxk)
p ( y ∣ x ) = Σ k = 1 K π y ∣ x k N k ( y ∣ μ y ∣ x k , Σ y ∣ x k ) p(y|x)=\Sigma^{K}_{k=1}\pi_{y|x_k}\mathcal{N}_k(y|\mu_{y|x_k},\Sigma_{y|x_k}) p(yx)=Σk=1KπyxkNk(yμyxk,Σyxk)

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值