【ML】高斯混合模型GMM(生成模型)

  高斯混合模型(Gaussian Mixture Model)是一种业界广泛使用的聚类算法,该方法使用了高斯分布作为参数模型,并使用了EM算法进行训练。

高斯分布

  高斯分布是一种常用的连续变量分布的模型。若单个随机变量 x x x服从均值为 μ \mu μ,方差为 σ 2 \sigma^2 σ2的高斯分布,记为 x   N ( μ , σ 2 ) x~\mathcal{N}(\mu,\sigma^2) x N(μ,σ2),则其概率密度函数为:
p ( x ∣ μ , σ 2 ) = 1 ( 2 π σ 2 ) 1 / 2 exp ⁡ { − 1 2 σ 2 ( x − μ ) 2 } p\left(x | \mu, \sigma^{2}\right)=\frac{1}{\left(2 \pi \sigma^{2}\right)^{1 / 2}} \exp \left\{-\frac{1}{2 \sigma^{2}}(x-\mu)^{2}\right\} p(xμ,σ2)=(2πσ2)1/21exp{2σ21(xμ)2}

  对于一个 d d d维向量 x x x,若其各元素服从均值为向量 μ \mu μ,协方差矩阵为 Σ \Sigma Σ的多元高斯分布,记为 x   N ( μ , Σ ) x~\mathcal{N}(\mu,\Sigma) x N(μ,Σ),则其概率密度函数为:
p ( x ∣ μ , Σ ) = 1 ( 2 π ) D / 2 1 ∣ Σ ∣ 1 / 2 exp ⁡ { − 1 2 ( x − μ ) T Σ − 1 ( x − μ ) } p(\mathbf{x} | \mu, \Sigma)=\frac{1}{(2 \pi)^{D / 2}} \frac{1}{|\Sigma|^{1 / 2}} \exp \left\{-\frac{1}{2}(\mathbf{x}-\mu)^{\mathbf{T}} \Sigma^{-1}(\mathbf{x}-\mu)\right\} p(xμ,Σ)=(2π)D/21Σ1/21exp{21(xμ)TΣ1(xμ)}

高斯分布函数求导

知识准备:
∂ ∂ x ( x T a ) = ∂ ∂ x ( a T x ) = a \frac{\partial}{\partial x}(x^Ta)=\frac{\partial}{\partial x}(a^Tx)=a x(xTa)=x(aTx)=a

∂ ∂ x ( A B ) = ∂ A ∂ x B + A ∂ B ∂ x \frac{\partial}{\partial x}(AB)=\frac{\partial A}{\partial x}B+A\frac{\partial B}{\partial x} x(AB)=xAB+AxB

∂ ∂ x ( A − 1 ) = − A − 1 ∂ A ∂ x A − 1 \frac{\partial}{\partial x}(A^{-1})=-A^{-1}\frac{\partial A}{\partial x}A^{-1} x(A1)=A1xAA1

∂ ∂ x ln ⁡ ∣ A ∣ = ( A − 1 ) T \frac{\partial}{\partial x}\ln |A|=(A^{-1})^T xlnA=(A1)T

高斯分布的对数似然函数为:
ln ⁡ p ( x ∣ μ , Σ ) = − D 2 ln ⁡ ( 2 π ) − 1 2 ln ⁡ ∣ Σ ∣ − 1 2 ( x − μ ) T Σ − 1 ( x − μ ) \ln p(\mathbf{x} | \mu, \Sigma)=-\frac{D}{2}\ln(2\pi)-\frac{1}{2}\ln |\Sigma|-\frac{1}{2}(\mathbf{x}-\mu)^{\mathbf{T}} \Sigma^{-1}(\mathbf{x}-\mu) lnp(xμ,Σ)=2Dln(2π)21lnΣ21(xμ)TΣ1(xμ)
关于均值及协方差矩阵求导:
∂ ∂ μ ln ⁡ p ( x ∣ μ , Σ ) = Σ − 1 ( x − μ ) \frac{\partial}{\partial \mu}\ln p(\mathbf{x} | \mu, \Sigma)=\Sigma^{-1}(\mathbf{x}-\mu) μlnp(xμ,Σ)=Σ1(xμ)

∂ ∂ Σ ln ⁡ p ( x ∣ μ , Σ ) = − 1 2 Σ − 1 + 1 2 Σ − 1 ( x − μ ) ( x − μ ) T Σ − 1 \frac{\partial}{\partial \Sigma}\ln p(\mathbf{x} | \mu, \Sigma)=-\frac{1}{2}\Sigma^{-1}+\frac{1}{2}\Sigma^{-1}(\mathbf{x}-\mu)(\mathbf{x}-\mu)^T\Sigma^{-1} Σlnp(xμ,Σ)=21Σ1+21Σ1(xμ)(xμ)TΣ1
则,
∂ ∂ μ p ( x ∣ μ , Σ ) = p ( x ∣ μ , Σ ) Σ − 1 ( x − μ ) \frac{\partial}{\partial \mu}p(\mathbf{x} | \mu, \Sigma)=p(\mathbf{x} | \mu, \Sigma)\Sigma^{-1}(\mathbf{x}-\mu) μp(xμ,Σ)=p(xμ,Σ)Σ1(xμ)

∂ ∂ Σ p ( x ∣ μ , Σ ) = p ( x ∣ μ , Σ ) [ − 1 2 Σ − 1 + 1 2 Σ − 1 ( x − μ ) ( x − μ ) T Σ − 1 ] \frac{\partial}{\partial \Sigma} p(\mathbf{x} | \mu, \Sigma)=p(\mathbf{x} | \mu, \Sigma)\left[ -\frac{1}{2}\Sigma^{-1}+\frac{1}{2}\Sigma^{-1}(\mathbf{x}-\mu)(\mathbf{x}-\mu)^T\Sigma^{-1}\right] Σp(xμ,Σ)=p(xμ,Σ)[21Σ1+21Σ1(xμ)(xμ)TΣ1]

∂ ∂ Σ ln ⁡ ∣ Σ ∣ = ( Σ − 1 ) T = Σ − 1 \frac{\partial}{\partial \Sigma}\ln |\Sigma|=(\Sigma^{-1})^T=\Sigma^{-1} ΣlnΣ=(Σ1)T=Σ1

∂ ∂ Σ i j ( x − μ ) T Σ − 1 ( x − μ ) = ( x − μ ) T ∂ Σ − 1 ∂ Σ i j ( x − μ ) = − ( x − μ ) T Σ − 1 ∂ Σ ∂ Σ i j Σ − 1 ( x − μ ) = − B T ∂ Σ ∂ Σ i j B = − B i B j = − ( B B T ) i j \begin{aligned}\frac{\partial}{\partial \Sigma_{ij}}(\mathbf{x}-\mu)^{\mathbf{T}} \Sigma^{-1}(\mathbf{x}-\mu)&=(\mathbf{x}-\mu)^{\mathbf{T}} \frac{\partial \Sigma^{-1}}{\partial \Sigma_{ij}}(\mathbf{x}-\mu)\\ &=-(\mathbf{x}-\mu)^{\mathbf{T}} \Sigma^{-1}\frac{\partial \Sigma}{\partial \Sigma_{ij}}\Sigma^{-1}(\mathbf{x}-\mu)\\ &=-B^T\frac{\partial \Sigma}{\partial \Sigma_{ij}}B=-B_iB_j=-(BB^T)_{ij} \end{aligned} Σij(xμ)TΣ1(xμ)=(xμ)TΣijΣ1(xμ)=(xμ)TΣ1ΣijΣΣ1(xμ)=BTΣijΣB=BiBj=(BBT)ij
其中 B = Σ − 1 ( x − μ ) B=\Sigma^{-1}(\mathbf{x}-\mu) B=Σ1(xμ) n × 1 n\times1 n×1列向量, B i B_i Bi为B的第 i i i个元素; ∂ Σ ∂ Σ i j \frac{\partial \Sigma}{\partial \Sigma_{ij}} ΣijΣ除了在第 i i i行第 j j j列为1外,其他位置处的值为0。
∂ ∂ Σ ( x − μ ) T Σ − 1 ( x − μ ) = − B B T = − Σ − 1 ( x − μ ) ( x − μ ) T Σ − 1 \frac{\partial}{\partial \Sigma}(\mathbf{x}-\mu)^{\mathbf{T}} \Sigma^{-1}(\mathbf{x}-\mu)=-BB^T=-\Sigma^{-1}(\mathbf{x}-\mu)(\mathbf{x}-\mu)^T\Sigma^{-1} Σ(xμ)TΣ1(xμ)=BBT=Σ1(xμ)(xμ)TΣ1

高斯混合模型

  高斯混合模型(Gaussian Mixed Model)指的是多个高斯分布函数的线性组合,多个高斯分布的线性叠加能拟合非常复杂的密度函数:通过足够多的高斯分布叠加,并调节它们的均值,协方差矩阵,以及线性组合的系数,可以精确地逼近任意连续密度。

  我们考虑 K K K个高斯分布的线性叠加,这个高斯混合分布(Gaussian mixture distiburion)的概率密度函数为:
p ( x ) = ∑ k = 1 K π k p ( x ∣ μ k , Σ k ) p(\mathbf{x})=\sum_{k=1}^{K} \pi_{k} p\left(\mathbf{x} | \mu_{\mathbf{k}}, \Sigma_{\mathbf{k}}\right) p(x)=k=1Kπkp(xμk,Σk)
其中 p ( x ∣ μ k , Σ k ) p\left(\mathbf{x} | \mu_{\mathbf{k}}, \Sigma_{\mathbf{k}}\right) p(xμk,Σk)表示参数为 μ k \mu_{\mathbf{k}} μk Σ k \Sigma_{\mathbf{k}} Σk的高斯分布的概率密度。每个高斯密度函数称为混合模型的一个分模型(component),有自己的均值 μ k \mu_{\mathbf{k}} μk和协方差矩阵 Σ k \Sigma_{\mathbf{k}} Σk。式中的参数 π k \pi_{k} πk是模型的混合系数(mixing coefficients),满足:
∑ k = 1 K π k = 1 , 0 ≤ π k ≤ 0 \sum_{k=1}^{K}\pi_{k}=1,0\leq\pi_{k}\leq0 k=1Kπk=1,0πk0

  高斯混合模型可以等价的写成:
p ( x ) = ∑ k = 1 K p ( k ) p ( x ∣ k ) p(x)=\sum_{k=1}^{K}p(k)p(x|k) p(x)=k=1Kp(k)p(xk)
其中, p ( k ) = π k p(k)=\pi_{k} p(k)=πk代表选择第 k k k个分模型的先验概率; p ( x ∣ k ) = p ( x ∣ μ k , Σ k ) p(x|k)=p\left(\mathbf{x} | \mu_{\mathbf{k}}, \Sigma_{\mathbf{k}}\right) p(xk)=p(xμk,Σk) x x x k k k的条件概率密度; p ( k ∣ x ) p(k|x) p(kx)是指在观测到 x x x后,其来自第 k k k个分模型的后验概率,称为第k个分模型的响应度。

  设样本 x x x由一个高斯混合模型产生,即其是从 K K K个高斯分布中的某一个采样得到的,但具体是哪一个高斯分布不知道。引入一个 K K K维的二值型随机变量 z = [ z 1 , . . . , z K ] z=[z_{1},...,z_{K}] z=[z1,...,zK],来表示样本 x x x由哪一分模型产生, z z z满足条件: z k ∈ { 0 , 1 } z_{k}\in\{0,1\} zk{0,1},且 ∑ k = 1 K z k = 1 \sum_{k=1}^{K}z_{k}=1 k=1Kzk=1 z k = 1 z_{k}=1 zk=1表示样本 x x x由分模型 k k k抽样得到,和为1表示样本 x x x只可能从一个高斯分布产生。 z z z一共有 K K K种可能的状态,其边缘分布由混合系数给出: p ( z k = 1 ) = π k p(z_{k}=1)=\pi_k p(zk=1)=πk,于是 z z z的分布可写作:
p ( z ) = ∏ k = 1 K π k z k , ∑ k = 1 K z k = 1 p(z)=\prod_{k=1}^{K}\pi_{k}^{z_{k}},\quad \sum_{k=1}^{K}z_{k}=1 p(z)=k=1Kπkzk,k=1Kzk=1
  来自高斯混合模型的样本 x x x可以看作由以下方式产生:

  1. 先以离散分布 p ( z ) p(z) p(z)抽样得到变量 z z z
  2. 设根据 z z z的取值选择了第 k k k个分模型,则以高斯分布 p ( x ∣ μ k , Σ k ) p(x|\mu_k, \Sigma_k) p(xμk,Σk)抽样得到 x x x

  这个过程可由如下的概率图模型表示:
GMM
  给定 z z z x x x的条件概率密度为:
p ( x ∣ z k = 1 ) = p ( x ∣ μ k , Σ k ) p(x|z_{k}=1)=p(x_|\mu_k, \Sigma_k) p(xzk=1)=p(xμk,Σk)
或者写成:
p ( x ∣ z ) = ∏ k = 1 K p ( x ∣ μ k , Σ k ) z k p(x|z)=\prod_{k=1}^{K}p(x|\mu_k, \Sigma_k)^{z_{k}} p(xz)=k=1Kp(xμk,Σk)zk
x x x的边缘分布为联合概率分布 p ( x , z ) p(x,z) p(x,z) z z z的所有可能状态求和:
p ( x ) = ∑ z p ( x , z ) = ∑ z p ( z ) p ( x ∣ z ) = ∑ z ( ∏ k = 1 K π k z k ∏ k = 1 K p ( x ∣ μ k , Σ k ) z k ) = ∑ z ( ∏ k = 1 K [ π k p ( x ∣ μ k , Σ k ) ] z k ) = ∑ k = 1 K π k p ( x ∣ μ k , Σ k ) \begin{aligned} p(x) &=\sum_{z} p(x, z)=\sum_{z} p(z) p(x | z) \\ &=\sum_{z}\left(\prod_{k=1}^{K} \pi_{k}^{z_{k}} \prod_{k=1}^{K} p\left(x | \mu_{k}, \Sigma_{k}\right)^{z_{k}}\right) \\ &=\sum_{z}\left(\prod_{k=1}^{K}\left[\pi_{k} p\left(x | \mu_{k}, \Sigma_{k}\right)\right]^{z_{k}}\right) \\ &=\sum_{k=1}^{K} \pi_{k} p\left(x | \mu_{k}, \Sigma_{k}\right) \end{aligned} p(x)=zp(x,z)=zp(z)p(xz)=z(k=1Kπkzkk=1Kp(xμk,Σk)zk)=z(k=1K[πkp(xμk,Σk)]zk)=k=1Kπkp(xμk,Σk)

  观测到 x x x后,其来自第 k k k个分模型的后验概率(posterior responsibility)为:
p ( z k = 1 ∣ x ) = p ( z k = 1 ) p ( x ∣ z k = 1 ) p ( x ) = π k p ( x ∣ μ k , Σ k ) ∑ j = 1 K π j p ( x ∣ μ j , Σ j ) \begin{aligned} p\left(z_{k}=1 | x\right) &=\frac{p\left(z_{k}=1\right) p\left(x | z_{k}=1\right)}{p(x)} \\ &=\frac{\pi_{k} p\left(x | \mu_{k}, \Sigma_{k}\right)}{\sum_{j=1}^{K} \pi_{j} p\left(x | \mu_{j}, \Sigma_{j}\right)} \end{aligned} p(zk=1x)=p(x)p(zk=1)p(xzk=1)=j=1Kπjp(xμj,Σj)πkp(xμk,Σk)

  将上式定义为: γ ( z k ) = p ( z k = 1 ∣ x ) \gamma(z_{k})=p(z_{k}=1|x) γ(zk)=p(zk=1x),称为第 k k k个分模型对 x x x的响应度(responsibility)。

高斯混合模型的参数估计

  假设我们有数据集 X = { x 1 , . . . , x N } \mathbf{X}=\{x_1,...,x_N\} X={x1,...,xN}, 数据集 X \mathbf{X} X由一个高斯混合模型产生。要估计这个模型的参数: π = { π 1 , . . . , π K } \pi=\{\pi_1,...,\pi_K\} π={π1,...,πK} μ = { μ 1 , . . . , μ K } \mu=\{\mu_1,...,\mu_K\} μ={μ1,...,μK} Σ = { Σ 1 , . . . , Σ K } \Sigma=\{\Sigma_1,...,\Sigma_K\} Σ={Σ1,...,ΣK}。记 x n x_n xn对应的隐变量为 z n = [ z n 1 , . . . , z n K ] z_n=[z_{n1},...,z_{nK}] zn=[zn1,...,znK],则:
p ( z n ) = ∏ k = 1 K π k z n k p(z_n)=\prod_{k=1}^{K}\pi_{k}^{z_{nk}} p(zn)=k=1Kπkznk

p ( x n ∣ z n ) = ∏ k = 1 K p ( x n ∣ μ k , Σ k ) z n k p(x_n|z_n)=\prod_{k=1}^{K}p(x_n|\mu_k, \Sigma_k)^{z_{nk}} p(xnzn)=k=1Kp(xnμk,Σk)znk

p ( x n ) = ∑ k = 1 K π k p ( x n ∣ μ k , Σ k ) p(x_n)=\sum_{k=1}^{K} \pi_{k} p\left(x_n | \mu_{k}, \Sigma_{k}\right) p(xn)=k=1Kπkp(xnμk,Σk)

γ ( z n k ) = p ( z n k = 1 ∣ x n ) = π k p ( x n ∣ μ k , Σ k ) ∑ j = 1 K π j p ( x n ∣ μ j , Σ j ) \gamma(z_{nk})=p\left(z_{nk}=1 | x_n\right)=\frac{\pi_{k} p\left(x_n | \mu_{k}, \Sigma_{k}\right)}{\sum_{j=1}^{K} \pi_{j} p\left(x_n|\mu_{j}, \Sigma_{j}\right)} γ(znk)=p(znk=1xn)=j=1Kπjp(xnμj,Σj)πkp(xnμk,Σk)

最大似然估计

  样本集 X X X的似然函数为:
p ( X ∣ π , μ , Σ ) = ∏ n = 1 N [ ∑ k = 1 K π k p ( x n ∣ μ k , Σ k ) ] p(X|\pi,\mu,\Sigma)=\prod_{n=1}^{N}[\sum_{k=1}^{K}\pi_kp(x_n|\mu_k,\Sigma_k)] p(Xπ,μ,Σ)=n=1N[k=1Kπkp(xnμk,Σk)]
似然函数中的连乘求导比较麻烦,取自然对数将其转换成对数的和,得到对数似然函数:
l n p ( X ∣ π , μ Σ ) = ∑ n = 1 N l n [ ∑ k = 1 K π k p ( x n ∣ μ k , Σ k ) ] lnp(X|\pi,\mu\Sigma)=\sum_{n=1}^{N}ln[\sum_{k=1}^{K}\pi_kp(x_n|\mu_k,\Sigma_k)] lnp(Xπ,μΣ)=n=1Nln[k=1Kπkp(xnμk,Σk)]
其中,
p ( x n ∣ μ k , Σ k ) = 1 ( 2 π ) d / 2 1 ∣ Σ k ∣ 1 / 2 exp ⁡ { − 1 2 ( x n − μ k ) T Σ − 1 ( x n − μ k ) } p(x_n|\mu_k,\Sigma_k)=\frac{1}{(2 \pi)^{d / 2}} \frac{1}{|\Sigma_k|^{1 / 2}} \exp \left\{-\frac{1}{2}(x_n-\mu_k)^{\mathbf{T}} \Sigma^{-1}(x_n-\mu_k)\right\} p(xnμk,Σk)=(2π)d/21Σk1/21exp{21(xnμk)TΣ1(xnμk)}
用最大似然估计来计算高斯混合模型的参数,即求解如下的优化问题:
{ max ⁡ π , μ , Σ l n p ( X ∣ π , μ , Σ ) s . t . ∑ k = 1 K π k = 1 \begin{cases} \max_{\pi,\mu,\Sigma} ln p(X|\pi,\mu,\Sigma)\\ \\ s.t. \quad\sum_{k=1}^{K}\pi_k=1\end{cases} maxπ,μ,Σlnp(Xπ,μ,Σ)s.t.k=1Kπk=1

采用拉格朗日乘子法来求极值,拉格朗日函数为:
L ( π , μ , Σ ) = ln ⁡ p ( X ∣ π , μ , Σ ) + λ ( ∑ k = 1 K π k − 1 ) L(\pi, \mu, \Sigma)=\ln p(X | \pi, \mu, \Sigma)+\lambda\left(\sum_{k=1}^{K} \pi_{k}-1\right) L(π,μ,Σ)=lnp(Xπ,μ,Σ)+λ(k=1Kπk1)

   L ( π , μ , Σ ) L(\pi,\mu,\Sigma) L(π,μ,Σ) μ k \mu_k μk求梯度:
∂ L ∂ μ k = ∂ ∂ μ k ln ⁡ p ( X ∣ π , μ , Σ ) = ∂ ∂ μ k ( ∑ n = 1 N ln ⁡ [ ∑ k = 1 K π k p ( x n ∣ μ k , Σ k ) ] ) = ∑ n = 1 N ∂ ∂ μ k ln ⁡ [ ∑ k = 1 K π k p ( x n ∣ μ k , Σ k ) ] = ∑ n = 1 N [ π k p ( x n ∣ μ k , Σ k ) ∑ j = 1 K π j p ( x n ∣ μ j , Σ j ) [ Σ k − 1 ( x n − μ k ) ] ] = ∑ n = 1 N [ γ ( z n k ) Σ k − 1 ( x n − μ k ) ] \begin{aligned} \frac{\partial L}{\partial \mu_{k}} &=\frac{\partial}{\partial \mu_{k}} \ln p(X | \pi, \mu, \Sigma) \\ &=\frac{\partial}{\partial \mu_{k}} \left(\sum_{n=1}^{N} \ln \left[\sum_{k=1}^{K} \pi_{k} p\left(x_{n} | \mu_{k}, \Sigma_{k}\right)\right]\right)\\ &=\sum_{n=1}^{N} \frac{\partial}{\partial \mu_{k}} \ln \left[\sum_{k=1}^{K} \pi_{k} p\left(x_{n} | \mu_{k}, \Sigma_{k}\right)\right] \\ &=\sum_{n=1}^{N}\left[\frac{\pi_{k} p\left(x_{n} | \mu_{k}, \Sigma_{k}\right)}{\sum_{j=1}^{K} \pi_{j} p\left(x_{n} | \mu_{j}, \Sigma_{j}\right)}\left[\Sigma_{k}^{-1}\left(x_{n}-\mu_{k}\right)\right]\right]\\&=\sum_{n=1}^{N}[\gamma(z_{nk})\Sigma_k^{-1}(x_n-\mu_k)] \end{aligned} μkL=μklnp(Xπ,μ,Σ)=μk(n=1Nln[k=1Kπkp(xnμk,Σk)])=n=1Nμkln[k=1Kπkp(xnμk,Σk)]=n=1N[j=1Kπjp(xnμj,Σj)πkp(xnμk,Σk)[Σk1(xnμk)]]=n=1N[γ(znk)Σk1(xnμk)]
∂ L ∂ μ k = 0 \frac{\partial L}{\partial \mu_{k}}=0 μkL=0,即 ∑ n = 1 N [ γ ( z n k ) Σ k − 1 ( x n − μ k ) ] = 0 \sum_{n=1}^{N}[\gamma(z_{nk})\Sigma_k^{-1}(x_n-\mu_k)]=0 n=1N[γ(znk)Σk1(xnμk)]=0,左乘 Σ k \Sigma_k Σk,得:
∑ n = 1 N [ γ ( z n k ) x n ] = ∑ n = 1 N [ γ ( z n k ) μ k ] = μ k ∑ n = 1 N γ ( z n k ) \sum_{n=1}^{N}\left[\gamma\left(z_{n k}\right) x_{n}\right]=\sum_{n=1}^{N}\left[\gamma\left(z_{n k}\right) \mu_{k}\right]=\mu_{k} \sum_{n=1}^{N} \gamma\left(z_{n k}\right) n=1N[γ(znk)xn]=n=1N[γ(znk)μk]=μkn=1Nγ(znk)
定义 N k = ∑ n = 1 N γ ( z n k ) N_k=\sum_{n=1}^{N}\gamma(z_{nk}) Nk=n=1Nγ(znk),可理解为被分配到第 k k k个分模型(聚类)的“有效“的样本数。则:

μ k = 1 N k ∑ n = 1 N [ γ ( z n k ) x n ] \mu_k=\frac{1}{N_k} \sum_{n=1}^{N}[\gamma(z_{nk})x_n] μk=Nk1n=1N[γ(znk)xn]

  对 Σ k \Sigma_k Σk中各元素求偏导:
∂ L ∂ Σ k = ∂ ∂ Σ k ln ⁡ p ( X ∣ π , μ , Σ ) = ∂ ∂ Σ k ( ∑ n = 1 N ln ⁡ [ ∑ k = 1 K π k p ( x n ∣ μ k , Σ k ) ] ) = ∑ n = 1 N ∂ ∂ Σ k ln ⁡ [ ∑ k = 1 K π k p ( x n ∣ μ k , Σ k ) ] = ∑ n = 1 N [ π k ∑ j = 1 K π j p ( x n ∣ μ j , Σ j ) ∂ ∂ Σ k p ( x n ∣ μ k , Σ k ) ] \begin{aligned} \frac{\partial L}{\partial \Sigma_{k}} &=\frac{\partial }{\partial \Sigma_{k}}\ln p(X| \pi, \mu, \Sigma) \\ &=\frac{\partial}{\partial \Sigma_{k}}\left(\sum_{n=1}^{N} \ln \left[\sum_{k=1}^{K} \pi_{k} p\left(x_{n} | \mu_{k}, \Sigma_{k}\right)\right]\right) \\ &=\sum_{n=1}^{N} \frac{\partial}{\partial \Sigma_{k}} \ln \left[\sum_{k=1}^{K} \pi_{k} p\left(x_{n} | \mu_{k}, \Sigma_{k}\right)\right]\\ &=\sum_{n=1}^{N}\left[\frac{\pi_{k}}{\sum_{j=1}^{K} \pi_{j} p\left(x_{n} | \mu_{j}, \Sigma_{j}\right)} \frac{\partial }{\partial \Sigma_{k}}p\left(x_{n} | \mu_{k}, \Sigma_{k}\right)\right] \end{aligned} ΣkL=Σklnp(Xπ,μ,Σ)=Σk(n=1Nln[k=1Kπkp(xnμk,Σk)])=n=1NΣkln[k=1Kπkp(xnμk,Σk)]=n=1N[j=1Kπjp(xnμj,Σj)πkΣkp(xnμk,Σk)]
∂ L ∂ Σ k = 0 \frac{\partial L}{\partial \Sigma_{k}}=0 ΣkL=0

Σ k = 1 N k ∑ n = 1 N γ ( z n k ) ( x n − μ k ) ( x n − μ k ) T \Sigma_{\mathrm{k}}=\frac{1}{N_{k}} \sum_{n=1}^{N} \gamma\left(z_{n k}\right)\left(x_{\mathrm{n}}-\mu_{\mathrm{k}}\right)\left(x_{\mathrm{n}}-\mu_{\mathrm{k}}\right)^{T} Σk=Nk1n=1Nγ(znk)(xnμk)(xnμk)T

  对 π k \pi_k πk求偏导并令其为0,
∂ L ∂ π k = ∂ ∂ π k ln ⁡ p ( X ∣ π , μ , Σ ) = ∂ ∂ π k [ ∑ n = 1 N ln ⁡ [ ∑ k = 1 K π k p ( x n ∣ μ k , Σ k ) ] + λ ( ∑ k = 1 K π k − 1 ) ] = ∑ n = 1 N p ( x n ∣ μ k , Σ k ) ∑ j = 1 K π j p ( x n ∣ μ j , Σ j ) + λ = 0 \begin{aligned} \frac{\partial L}{\partial \pi_{k}} &=\frac{\partial }{\partial \pi_{k}} \ln p(X | \pi, \mu, \Sigma)\\ &=\frac{\partial}{\partial \pi_{k}} \left[\sum_{n=1}^{N} \ln \left[\sum_{k=1}^{K} \pi_{k} p\left(x_{\mathrm{n}} | \mu_{\mathrm{k}}, \Sigma_{\mathrm{k}}\right)\right]+\lambda\left(\sum_{k=1}^{K} \pi_{k}-1\right)\right]\\ &=\sum_{n=1}^{N} \frac{p\left(x_{\mathrm{n}} | \mu_{\mathrm{k}}, \Sigma_{\mathrm{k}}\right)}{\sum_{j=1}^{K} \pi_{j} p\left(x_{\mathrm{n}} | \mu_{\mathrm{j}}, \Sigma_{\mathrm{j}}\right)}+\lambda=0 \end{aligned} πkL=πklnp(Xπ,μ,Σ)=πk[n=1Nln[k=1Kπkp(xnμk,Σk)]+λ(k=1Kπk1)]=n=1Nj=1Kπjp(xnμj,Σj)p(xnμk,Σk)+λ=0
左右两边乘以 π k \pi_k πk
∑ n = 1 N π k p ( x n ∣ μ k , Σ k ) ∑ j = 1 K π j p ( x n ∣ μ j , Σ j ) + λ π k = ∑ n = 1 N γ ( z n k ) + λ π k = N k + λ π k = 0 \begin{aligned} \sum_{n=1}^{N} \frac{\pi_{k} p\left(x_{n} | \mu_{k}, \Sigma_{k}\right)}{\sum_{j=1}^{K} \pi_{j} p\left(x_{n} | \mu_{j}, \Sigma_{j}\right)}+\lambda \pi_{k} &=\sum_{n=1}^{N} \gamma\left(z_{n k}\right)+\lambda \pi_{k} \\ &=N_{k}+\lambda \pi_{k}=0 \end{aligned} n=1Nj=1Kπjp(xnμj,Σj)πkp(xnμk,Σk)+λπk=n=1Nγ(znk)+λπk=Nk+λπk=0
上式对 k k k求和得,
∑ k = 1 K ( N k + λ π k ) = ∑ k = 1 K N k + λ ∑ k = 1 K π k = ∑ k = 1 K N k + λ = 0 \sum_{k=1}^{K}\left(N_{k}+\lambda \pi_{k}\right)=\sum_{k=1}^{K} N_{k}+\lambda \sum_{k=1}^{K} \pi_{k}=\sum_{k=1}^{K} N_{k}+\lambda=0 k=1K(Nk+λπk)=k=1KNk+λk=1Kπk=k=1KNk+λ=0

∑ k = 1 K N k = ∑ k = 1 K ∑ n = 1 N γ ( z n k ) = ∑ n = 1 N ∑ k = 1 K γ ( z n k ) = ∑ n = 1 N ∑ k = 1 K π k p ( x ∣ μ k , Σ k ) ∑ j = 1 K π j p ( x ∣ μ j , Σ j ) = ∑ n = 1 N 1 = N \begin{aligned} \sum_{k=1}^{K} N_{k} &=\sum_{k=1}^{K} \sum_{n=1}^{N} \gamma\left(z_{n k}\right)=\sum_{n=1}^{N} \sum_{k=1}^{K} \gamma\left(z_{n k}\right) \\ &=\sum_{n=1}^{N} \sum_{k=1}^{K} \frac{\pi_{k} p\left(x | \mu_{\mathrm{k}}, \Sigma_{\mathrm{k}}\right)}{\sum_{j=1}^{K} \pi_{j} p\left(x | \mu_{\mathrm{j}}, \Sigma_{\mathrm{j}}\right)} \\ &=\sum_{n=1}^{N} 1=N \end{aligned} k=1KNk=k=1Kn=1Nγ(znk)=n=1Nk=1Kγ(znk)=n=1Nk=1Kj=1Kπjp(xμj,Σj)πkp(xμk,Σk)=n=1N1=N

所以 λ = − N \lambda=-N λ=N,进而

π k = − N k λ = N k N \pi_k=-\frac{N_k}{\lambda}=\frac{N_k}{N} πk=λNk=NNk

  综上,对数似然函数 l n p ( X ∣ π , μ Σ ) lnp(X|\pi,\mu\Sigma) lnp(Xπ,μΣ)的极值点满足的条件为:
{ μ k = 1 N k ∑ n = 1 N [ γ ( z n k ) x n ] Σ k = 1 N k ∑ n = 1 N γ ( z n k ) ( x n − μ k ) ( x n − μ k ) T π k = N k N \left\{\begin{array}{l}{\mu_{\mathrm{k}}=\frac{1}{N_{k}} \sum_{n=1}^{N}\left[\gamma\left(z_{n k}\right) \mathbf{x}_{\mathrm{n}}\right]} \\ \\ {\boldsymbol{\Sigma}_{\mathrm{k}}=\frac{1}{N_{k}} \sum_{n=1}^{N} \gamma\left(z_{n k}\right)\left(\mathbf{x}_{\mathrm{n}}-\mu_{\mathrm{k}}\right)\left(\mathbf{x}_{\mathrm{n}}-\mu_{\mathrm{k}}\right)^{T}} \\ \\ {\pi_{k}=\frac{N_{k}}{N}}\end{array}\right. μk=Nk1n=1N[γ(znk)xn]Σk=Nk1n=1Nγ(znk)(xnμk)(xnμk)Tπk=NNk

需要注意的是,上式并未给出高斯混合模型的解析解/闭式解,不过可使用迭代算法来计算模型参数。

求解高斯混合模型的EM算法
  1. 初始化:给参数均值向量 μ k \mu_k μk,协方差矩阵 Σ k \Sigma_k Σk和混合系数 π k \pi_k πk赋初值;并计算对数似然函数的初值;(可以用K均值聚类(K-means clustering)算法来得到初始参数。)
  2. E步(Expectation step):用当前参数值 μ k \mu_k μk Σ k \Sigma_k Σk π k \pi_k πk计算后验概率(响应度) γ ( z n k ) \gamma(z_{nk}) γ(znk)
    γ ( z n k ) ← π k p ( x n ∣ μ k , Σ k ) ∑ j = 1 K π j p ( x n ∣ μ j , Σ j ) \gamma\left(z_{n k}\right) \leftarrow \frac{\pi_{k} p\left(x_{n} | \mu_{k}, \Sigma_{k}\right)}{\sum_{j=1}^{K} \pi_{j} p\left(x_{n} | \mu_{j}, \Sigma_{j}\right)} γ(znk)j=1Kπjp(xnμj,Σj)πkp(xnμk,Σk)
  3. M步(Maximization step):用当前响应度 γ ( z n k ) \gamma(z_{nk}) γ(znk),重新估计参数:
    μ k new  ← 1 N k ∑ n = 1 N [ γ ( z n k ) x n ] Σ k new  ← 1 N k ∑ n = 1 N γ ( z n k ) ( x n − μ k new  ) ( x n − μ k new  ) T π k new  ← N k N \begin{aligned} \mu_{k}^{\text {new }} & \leftarrow \frac{1}{N_{k}} \sum_{n=1}^{N}\left[\gamma\left(z_{n k}\right) x_{n}\right] \\ \Sigma_{k}^{\text {new }} & \leftarrow \frac{1}{N_{k}} \sum_{n=1}^{N} \gamma\left(z_{n k}\right)\left(x_{n}-\mu_{k}^{\text {new }}\right)\left(x_{n}-\mu_{k}^{\text {new }}\right)^{T} \\ \pi_{k}^{\text {new }} & \leftarrow \frac{N_{k}}{N} \end{aligned} μknew Σknew πknew Nk1n=1N[γ(znk)xn]Nk1n=1Nγ(znk)(xnμknew )(xnμknew )TNNk
    其中, N k = ∑ n = 1 N γ ( z n k ) N_k=\sum_{n=1}^{N}\gamma(z_{nk}) Nk=n=1Nγ(znk)
    在M步中,先计算 μ k new  \mu_{k}^{\text {new }} μknew 的值,然后用 μ k new  \mu_{k}^{\text {new }} μknew 来计算新的协方差矩阵 Σ k new  \Sigma_{k}^{\text {new }} Σknew 
  4. 用新的参数 μ k new  \mu_{k}^{\text {new }} μknew  Σ k new  \Sigma_{k}^{\text {new }} Σknew  π k new  \pi_{k}^{\text {new }} πknew 计算对数似然函数:
    l n p ( X ∣ π , μ Σ ) = ∑ n = 1 N l n [ ∑ k = 1 K π k p ( x n ∣ μ k , Σ k ) ] lnp(X|\pi,\mu\Sigma) =\sum_{n=1}^{N}ln[\sum_{k=1}^{K}\pi_kp(x_n|\mu_k,\Sigma_k)] lnp(Xπ,μΣ)=n=1Nln[k=1Kπkp(xnμk,Σk)]
    然后检查是否收敛。(当然,也可直接检查参数是否收敛。)
    如果收敛,则得到了模型参数;如果还没收敛,则返回第2步继续迭代。

EM算法:换个角度看高斯混合模型的极大似然估计

  在前面的极大似然估计中,我们的目标函数是不完全数据集 X = { x 1 , . . . , x N } \mathbf{X}=\{x_1,...,x_N\} X={x1,...,xN}的对数似然函数 l n p ( X ∣ π , μ Σ ) lnp(X|\pi,\mu\Sigma) lnp(Xπ,μΣ)。现在我们换个角度来看:假如我们也有隐变量的观测值 Z = { z 1 , . . . , z N } \mathbf{Z}=\{z_1,...,z_N\} Z={z1,...,zN},即知道 x n x_n xn来自高斯混合中的哪一个分模型,换句话说,我们有完全数据集 { X , Z } \{\mathbf{X,Z}\} {X,Z},那么我们可以计算完全数据的对数似然函数 l n p ( X , Z ∣ π , μ Σ ) lnp(X,Z|\pi,\mu\Sigma) lnp(X,Zπ,μΣ)
  首先,一组完全数据 { x n , z n } \{x_n,z_n\} {xn,zn}的似然函数为:
p ( x n , z n ∣ π , μ , Σ ) = ∏ k = 1 K π k z n k p ( x n ∣ μ k , Σ k ) z n k p\left(x_{n}, z_{n} | \pi, \mu, \Sigma\right)=\prod_{k=1}^{K} \pi_{k}^{z_{n k}} p\left(x_{n} | \mu_{k}, \Sigma_{k}\right)^{z_{n k}} p(xn,znπ,μ,Σ)=k=1Kπkznkp(xnμk,Σk)znk

所以,整个完全数据集 { X , Z } \{\mathbf{X,Z}\} {X,Z}的似然函数(假设每组数据独立)为:
p ( X , Z ∣ π , μ , Σ ) = ∏ n = 1 N ∏ k = 1 K π k z n k p ( x n ∣ μ k , Σ k ) z n k p\left(\mathbf{X,Z} | \pi, \mu, \Sigma\right)=\prod_{n=1}^{N}\prod_{k=1}^{K} \pi_{k}^{z_{n k}} p\left(x_{n} | \mu_{k}, \Sigma_{k}\right)^{z_{n k}} p(X,Zπ,μ,Σ)=n=1Nk=1Kπkznkp(xnμk,Σk)znk

取对数得,
l n p ( X , Z ∣ π , μ , Σ ) = ∑ n = 1 N ∑ k = 1 K z n k [ l n π k + l n p ( x n ∣ μ k , Σ k ) ] ln p\left(\mathbf{X,Z} | \pi, \mu, \Sigma\right)=\sum_{n=1}^{N}\sum_{k=1}^{K} z_{n k}[ln \pi_{k}+lnp\left(x_{n} | \mu_{k}, \Sigma_{k}\right)] lnp(X,Zπ,μ,Σ)=n=1Nk=1Kznk[lnπk+lnp(xnμk,Σk)]

可以看出,和不完全数据的对数似然函数 p ( X ∣ π , μ Σ ) = ∏ n = 1 N [ ∑ k = 1 K π k p ( x n ∣ μ k , Σ k ) ] p(X|\pi,\mu\Sigma)=\prod_{n=1}^{N}[\sum_{k=1}^{K}\pi_kp(x_n|\mu_k,\Sigma_k)] p(Xπ,μΣ)=n=1N[k=1Kπkp(xnμk,Σk)]相比, ∑ k = 1 K \sum_{k=1}^{K} k=1K l n ln ln换了顺序。

  然而实际情况是,我们并不知道隐变量的观测值,所以我们无法直接计算 l n p ( X , Z ∣ π , μ , Σ ) ln p\left(\mathbf{X,Z} | \pi, \mu, \Sigma\right) lnp(X,Zπ,μ,Σ),因此,我们考虑对 l n p ( X , Z ∣ π , μ , Σ ) ln p\left(\mathbf{X,Z} | \pi, \mu, \Sigma\right) lnp(X,Zπ,μ,Σ)在隐变量 z n k z_{nk} znk的后验分布下求期望,利用求期望的过程,将 z n k z_{nk} znk消掉。

   l n p ( X , Z ∣ π , μ , Σ ) ln p\left(\mathbf{X,Z} | \pi, \mu, \Sigma\right) lnp(X,Zπ,μ,Σ)在隐变量 z n k z_{nk} znk的后验分布下的期望:
E Z ln ⁡ p ( X , Z ∣ π , μ , Σ ) = E ∑ n = 1 N ∑ k = 1 K { z n k [ ln ⁡ π k + ln ⁡ p ( x n ∣ μ k , Σ k ) ] } = ∑ n = 1 N ∑ k = 1 K { E ( z n k ) [ ln ⁡ π k + ln ⁡ p ( x n ∣ μ k , Σ k ) ] } \begin{aligned} E_{Z} \ln p(X, Z | \pi, \mu, \Sigma) &=E \sum_{n=1}^{N} \sum_{k=1}^{K}\left\{z_{n k}\left[\ln \pi_{k}+\ln p\left(x_{n} | \mu_{k}, \Sigma_{k}\right)\right]\right\} \\ &=\sum_{n=1}^{N} \sum_{k=1}^{K}\left\{E\left(z_{n k}\right)\left[\ln \pi_{k}+\ln p\left(x_{n} | \mu_{k}, \Sigma_{k}\right)\right]\right\} \end{aligned} EZlnp(X,Zπ,μ,Σ)=En=1Nk=1K{znk[lnπk+lnp(xnμk,Σk)]}=n=1Nk=1K{E(znk)[lnπk+lnp(xnμk,Σk)]}
其中, E ( z n k ) E\left(z_{n k}\right) E(znk)表示 z n k z_{n k} znk在后验分布下的期望 E ( z n k ∣ X , θ ) E\left(z_{n k}|X,\theta\right) E(znkX,θ)。因 z n k ∈ { 0 , 1 } z_{n k}\in\{0,1\} znk{0,1},所以
E ( z n k ∣ X , θ ) = 0 × p ( z n k = 0 ∣ X , θ ) + 1 × p ( z n k = 1 ∣ X , θ ) = p ( z n k = 1 ∣ X , θ ) = γ ( z n k ) \begin{aligned} E\left(z_{n k} | X, \theta\right) &=0 \times p\left(z_{n k}=0 | X, \theta\right)+1 \times p\left(z_{n k}=1 | X, \theta\right) \\ &=p\left(z_{n k}=1 | X, \theta\right) \\ &=\gamma\left(z_{n k}\right) \end{aligned} E(znkX,θ)=0×p(znk=0X,θ)+1×p(znk=1X,θ)=p(znk=1X,θ)=γ(znk)
进而,
E Z ln ⁡ p ( X , Z ∣ π , μ , Σ ) = ∑ n = 1 N ∑ k = 1 K { γ ( z n k ) [ ln ⁡ π k + ln ⁡ p ( x n ∣ μ k , Σ k ) ] } E_{Z} \ln p(X, Z | \pi, \mu, \Sigma)=\sum_{n=1}^{N} \sum_{k=1}^{K}\left\{\gamma\left(z_{n k}\right)\left[\ln \pi_{k}+\ln p\left(x_{n} | \mu_{k}, \Sigma_{k}\right)\right]\right\} EZlnp(X,Zπ,μ,Σ)=n=1Nk=1K{γ(znk)[lnπk+lnp(xnμk,Σk)]}
现在将 γ ( z n k ) \gamma\left(z_{n k}\right) γ(znk)当作常数,求参数 π \pi π μ \mu μ Σ \Sigma Σ,使 E Z ln ⁡ p ( X , Z ∣ π , μ , Σ ) E_{Z} \ln p(X, Z | \pi, \mu, \Sigma) EZlnp(X,Zπ,μ,Σ)最大化,即
{ max ⁡ π , μ , Σ E Z ln ⁡ p ( X , Z ∣ π , μ , Σ ) s . t . ∑ k = 1 K π k = 1 \begin{cases} \max_{\pi,\mu,\Sigma}\quad E_{Z} \ln p(X, Z | \pi, \mu, \Sigma)\\ s. t.\quad\quad\quad\quad \sum_{k=1}^{K}\pi_k=1 \end{cases} {maxπ,μ,ΣEZlnp(X,Zπ,μ,Σ)s.t.k=1Kπk=1
采用拉格朗日乘子法,拉格朗日函数为:
L E = E Z ln ⁡ p ( X , Z ∣ π , μ , Σ ) + λ ( ∑ k = 1 K π k − 1 ) L_E=E_{Z} \ln p(X, Z | \pi, \mu, \Sigma)+\lambda(\sum_{k=1}^{K}\pi_k-1) LE=EZlnp(X,Zπ,μ,Σ)+λ(k=1Kπk1)
  先对 μ k \mu_k μk求梯度,
∂ L E ∂ μ k = 0 ⇒ ∑ n = 1 N γ ( z n k ) [ Σ k − 1 ( x n − μ k ) ] = 0 ⇒ ∑ n = 1 N γ ( z n k ) ( x n − μ k ) = 0 ⇒ ∑ n = 1 N γ ( z n k ) x n = ∑ n = 1 N γ ( z n k ) μ k \begin{aligned} \frac{\partial L_{E}}{\partial \mu_{k}}=0 & \Rightarrow \sum_{n=1}^{N} \gamma\left(z_{n k}\right)\left[\Sigma_{k}^{-1}\left(x_{n}-\mu_{k}\right)\right]=0 \\ & \Rightarrow \sum_{n=1}^{N} \gamma\left(z_{n k}\right)\left(x_{n}-\mu_{k}\right)=0 \\ & \Rightarrow \sum_{n=1}^{N} \gamma\left(z_{n k}\right) x_{n}=\sum_{n=1}^{N} \gamma\left(z_{n k}\right) \mu_{k} \end{aligned} μkLE=0n=1Nγ(znk)[Σk1(xnμk)]=0n=1Nγ(znk)(xnμk)=0n=1Nγ(znk)xn=n=1Nγ(znk)μk
同之前极大化 l n p ( X ∣ π , μ , Σ ) lnp(X|\pi,\mu,\Sigma) lnp(Xπ,μ,Σ)一样,可推得
μ k = 1 N k ∑ n = 1 N [ γ ( z n k ) x n ] \mu_k=\frac{1}{N_k}\sum_{n=1}^{N}[\gamma(z_{nk})x_n] μk=Nk1n=1N[γ(znk)xn]
  对 Σ k \Sigma_k Σk求梯度,
∂ L E ∂ Σ k = ∑ n = 1 N [ ∂ ∑ k = 1 K γ ( z n k ) ln ⁡ p ( x n ∣ μ k , Σ k ) ∂ Σ k ] = ∑ n = 1 N [ γ ( z n k ) ∂ ln ⁡ p ( x n ∣ μ k , Σ k ) ∂ Σ k ] = ∑ n = 1 N [ γ ( z n k ) p ( x n ∣ μ k , Σ k ) ∂ p ( x n ∣ μ k , Σ k ) ∂ Σ k ] \begin{aligned} \frac{\partial L_{E}}{\partial \Sigma_{\mathrm{k}}} &=\sum_{n=1}^{N}\left[\frac{\partial \sum_{k=1}^{K} \gamma\left(z_{n k}\right) \ln p\left(x_{n} | \mu_{k}, \Sigma_{k}\right)}{\partial \Sigma_{k}}\right] \\ &=\sum_{n=1}^{N}\left[\gamma\left(z_{n k}\right) \frac{\partial \ln p\left(x_{n} | \mu_{k}, \Sigma_{k}\right)}{\partial \Sigma_{k}}\right] \\ &=\sum_{n=1}^{N}\left[\frac{\gamma\left(z_{n k}\right)}{p\left(x_{n} | \mu_{k}, \Sigma_{k}\right)} \frac{\partial p\left(x_{n} | \mu_{k}, \Sigma_{k}\right)}{\partial \Sigma_{k}}\right] \end{aligned} ΣkLE=n=1N[Σkk=1Kγ(znk)lnp(xnμk,Σk)]=n=1N[γ(znk)Σklnp(xnμk,Σk)]=n=1N[p(xnμk,Σk)γ(znk)Σkp(xnμk,Σk)]
l n p ( X ∣ π , μ , Σ ) lnp(X|\pi,\mu,\Sigma) lnp(Xπ,μ,Σ)的极值点时,有
∂ L ∂ Σ k = ∑ n = 1 N [ π k ∑ j = 1 K π j p ( x n ∣ μ j , Σ j ) ∂ p ( x n ∣ μ k , Σ k ) ∂ Σ k ] \frac{\partial L}{\partial \Sigma_{\mathrm{k}}}=\sum_{n=1}^{N}\left[\frac{\pi_{k}}{\sum_{j=1}^{K} \pi_{j} p\left(x_{n} | \mu_{j}, \Sigma_{j}\right)} \frac{\partial p\left(x_{n} | \mu_{k}, \Sigma_{k}\right)}{\partial \Sigma_{k}}\right] ΣkL=n=1N[j=1Kπjp(xnμj,Σj)πkΣkp(xnμk,Σk)]
结合响应度 γ ( z n k ) \gamma(z_{nk}) γ(znk)的公式,可发现两式是一致的,这意味着两种方法计算出的 Σ k \Sigma_k Σk 也相同。
  再来看 π k \pi_k πk
∂ L E ∂ π k = ∑ n = 1 N [ ∂ ∑ k = 1 K γ ( z n k ) ln ⁡ π k ∂ π k ] + λ = ∑ n = 1 N [ γ ( z n k ) ∂ ln ⁡ π k ∂ π k ] + λ = ∑ n = 1 N [ γ ( z n k ) 1 π k ] + λ = 1 π k ∑ n = 1 N γ ( z n k ) + λ = 1 π k N k + λ \begin{aligned} \frac{\partial L_{E}}{\partial \pi_{k}} &=\sum_{n=1}^{N}\left[\frac{\partial \sum_{k=1}^{K} \gamma\left(z_{n k}\right) \ln \pi_{k}}{\partial \pi_{k}}\right]+\lambda \\ &=\sum_{n=1}^{N}\left[\gamma\left(z_{n k}\right) \frac{\partial \ln \pi_{k}}{\partial \pi_{k}}\right]+\lambda \\ &=\sum_{n=1}^{N}\left[\gamma\left(z_{n k}\right) \frac{1}{\pi_{k}}\right]+\lambda=\frac{1}{\pi_{k}} \sum_{n=1}^{N} \gamma\left(z_{n k}\right)+\lambda \\ &=\frac{1}{\pi_{k}} N_{k}+\lambda \end{aligned} πkLE=n=1N[πkk=1Kγ(znk)lnπk]+λ=n=1N[γ(znk)πklnπk]+λ=n=1N[γ(znk)πk1]+λ=πk1n=1Nγ(znk)+λ=πk1Nk+λ
可以发现 π k \pi_k πk也和 l n p ( X ∣ π , μ , Σ ) lnp(X|\pi,\mu,\Sigma) lnp(Xπ,μ,Σ)极值点的参数一致。

  综上,通过极大化 E Z ln ⁡ p ( X , Z ∣ π , μ , Σ ) E_{Z} \ln p(X, Z | \pi, \mu, \Sigma) EZlnp(X,Zπ,μ,Σ)得到的模型参数,和 l n p ( X ∣ π , μ , Σ ) lnp(X|\pi,\mu,\Sigma) lnp(Xπ,μ,Σ)的极大似然估计结果是一致的。在EM算法的每一次迭代中,我们都在对 E Z ln ⁡ p ( X , Z ∣ π , μ , Σ ) E_{Z} \ln p(X, Z | \pi, \mu, \Sigma) EZlnp(X,Zπ,μ,Σ)求极大值。

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值