高斯混合模型(GaussianMixture Model)

0 前言

  高斯混合模型(Gaussian Mixture Model)通常简称GMM,是一种广泛使用的聚类算法,该方法使用了高斯分布作为参数模型,并使用了期望最大(Expectation Maximization,简称EM)算法进行训练。

1 单高斯模型

  首先,当随机变量X属于一维情况下的高斯概率密度函数:
P ( x ; μ , σ 2 ) = 1 2 π σ 2 e x p ( − ( x − μ ) 2 2 σ 2 ) (1.1) P(x;\mu,\sigma^2) =\frac{1}{ \sqrt{ 2 \pi \sigma^2} }exp(- \frac{(x-\mu)^2}{2 \sigma^2}) \tag{1.1} P(x;μ,σ2)=2πσ2 1exp(2σ2(xμ)2)(1.1)
  其中 μ \mu μ为均值, σ 2 \sigma^2 σ2表示方差,其概率密度函数图像如下:

  当随机变量X是多维特征数据时,高斯分布服从下方概率密度函数:

P ( x , μ , Σ ) = 1 ( 2 π ) D 2 Σ 1 2 e x p ( − 1 2 ( x − μ ) T Σ − 1 ( x − μ ) ) (1.2) P(x,\bm{\mu} , \bm{\Sigma}) = \frac{1}{ (2 \pi)^{\frac D2} \Sigma^{\frac 12} }exp ( - \frac12 (x-\mu)^T \Sigma^{-1}(x-\mu) ) \tag{1.2} P(x,μ,Σ)=(2π)2DΣ211exp(21(xμ)TΣ1(xμ))(1.2)
  其中, μ \mu μ均值, Σ \Sigma Σ为协方差矩阵,D为样本X的维数。其概率密度函数图像如下:

  对于单高斯模型,如果明确训练样本是属于某个高斯模型,我们可以通过极大似然估计求得该高斯模型的参数。假设有样本集 D = { x 1 , x 2 , . . . , x m } D=\{x_1,x_2,...,x_m\} D={x1,x2,...,xm},其中 x i ∈ R d x_i \in R^d xiRd,每个样本点都是独立的。通过概率密度函数,可以得到样本集的似然函数:
L ( μ , Σ ) = ∏ i = 1 m P ( x i ) = ∏ i = 1 m 1 ( 2 π ) D 2 Σ 1 2 e x p ( − 1 2 ( x i − μ ) T Σ − 1 ( x i − μ ) ) (1.3) L(\mu ,\Sigma) = \prod_{i=1}^m P(x_i) = \prod_{i=1}^m \frac{1}{ (2 \pi)^{\frac D2} \Sigma^{\frac 12} }exp ( - \frac12 (x_i-\mu)^T \Sigma^{-1}(x_i-\mu) ) \tag{1.3} L(μ,Σ)=i=1mP(xi)=i=1m(2π)2DΣ211exp(21(xiμ)TΣ1(xiμ))(1.3)
由于每个点发生的概率都很小,乘积会变得极其小,不利于计算和观察,因此通常我们用 Maximum Log-Likelihood 来计算:
l ( μ , Σ ) = l o g L ( ( μ , Σ ) = ∑ i = 1 m − D 2 l o g 2 π − 1 2 l o g ∣ Σ ∣ − 1 2 ( x i − μ ) T Σ − 1 ( x i − μ ) (1.4) l(\mu ,\Sigma) = logL((\mu ,\Sigma) = \sum_{i=1}^m -\frac D2 log2\pi - \frac12 log |\Sigma| - \frac12 (x_i-\mu)^T \Sigma^{-1}(x_i-\mu) \tag{1.4} l(μ,Σ)=logL((μ,Σ)=i=1m2Dlog2π21logΣ21(xiμ)TΣ1(xiμ)(1.4)
对最大对数似然函数关于各个参数求偏导,并令结果等于0,可解得各个参数。
∂ l ( μ , Σ ) ∂ μ = ∑ i = 1 m − 1 2 ∗ 2 Σ − 1 ( x i − μ ) = 0 ( Σ 为 正 定 矩 阵 ) m μ = ∑ i = 1 m x i ⇒ μ = 1 m ∑ i = 1 m x i (1.5) \frac{\partial l(\mu ,\Sigma) }{ \partial \mu} = \sum_{i=1}^m -\frac12 * 2 \Sigma^{-1} (x_i-\mu) = 0 \qquad (\Sigma为正定矩阵)\\ m \mu = \sum_{i=1}^m x_i \Rightarrow \qquad \mu=\frac1m \sum_{i=1}^m x_i \tag{1.5} μl(μ,Σ)=i=1m212Σ1(xiμ)=0(Σ)mμ=i=1mxiμ=m1i=1mxi(1.5)
Σ \Sigma Σ求偏导,我们先了解下面的矩阵求导知识:

  • t r [ A B C ] = t r [ C A B ] = t r [ B C A ] tr[ABC] = tr[CAB] = tr[BCA] tr[ABC]=tr[CAB]=tr[BCA]
  • x T A x = t r ( x T A x ) = t r ( x x T A ) x^TAx = tr(x^TAx) = tr(xx^TA) xTAx=tr(xTAx)=tr(xxTA)
  • ∂ t r ( A B ) ∂ A = B T \frac{\partial tr(AB)}{\partial A} = B^T Atr(AB)=BT
  • ∂ l o g ∣ A ∣ ∂ A = A − T \frac{\partial log|A|}{\partial A} = A^{-T} AlogA=AT

l ( μ , Σ ) = ∑ i = 1 m − D 2 l o g 2 π − 1 2 l o g ∣ Σ ∣ − 1 2 ( x i − μ ) T Σ − 1 ( x i − μ ) = ∑ i = 1 m − D 2 l o g 2 π + 1 2 l o g ∣ Σ − 1 ∣ − 1 2 t r [ ( x i − μ ) ( x i − μ ) T Σ − 1 ] (1.6) \begin{aligned} l(\mu ,\Sigma) &= \sum_{i=1}^m -\frac D2 log2\pi - \frac12 log |\Sigma| - \frac12 (x_i-\mu)^T \Sigma^{-1}(x_i-\mu) \\ &= \sum_{i=1}^m -\frac D2 log2\pi + \frac12 log |\Sigma^{-1}| - \frac12 tr[ (x_i-\mu)(x_i-\mu)^T \Sigma^{-1} ] \end{aligned} \tag{1.6} l(μ,Σ)=i=1m2Dlog2π21logΣ21(xiμ)TΣ1(xiμ)=i=1m2Dlog2π+21logΣ121tr[(xiμ)(xiμ)TΣ1](1.6)
∂ l ( μ , Σ ) ∂ Σ − 1 = m 2 Σ − 1 2 ∑ i = 1 m ( x i − μ ) ( x i − μ ) T = 0 Σ = 1 m ∑ i = 1 m ( x i − μ ) ( x i − μ ) T (1.7) \frac{\partial l(\mu ,\Sigma) }{ \partial \Sigma^{-1}} = \frac m2 \Sigma - \frac12 \sum_{i=1}^m (x_i-\mu)(x_i-\mu)^T = 0 \\ \Sigma = \frac 1m \sum_{i=1}^m (x_i-\mu)(x_i-\mu)^T \tag{1.7} Σ1l(μ,Σ)=2mΣ21i=1m(xiμ)(xiμ)T=0Σ=m1i=1m(xiμ)(xiμ)T(1.7)

2 混合高斯模型

  假设设有随机变量X,混合高斯模型由K个高斯模型组成(即数据包含K个类),则GMM的概率密度函数如下:
P ( x ) = ∑ i = 1 K p ( k ) p ( x ∣ k ) = ∑ i = 1 K π i N ( x ; μ i , Σ i ) (2.1) P(x) = \sum_{i=1}^K p(k)p(x|k) = \sum_{i=1}^K \pi_i N(x;\mu_i,\Sigma_i) \tag{2.1} P(x)=i=1Kp(k)p(xk)=i=1KπiN(x;μi,Σi)(2.1)
  其中, p ( x ∣ k ) = N ( x ; μ k , Σ k ) p(x|k) = N(x;\mu_k,\Sigma_k) p(xk)=N(x;μk,Σk)是第k个高斯模型的概率密度函数, π k \pi_k πk第k个高斯模型的权重,称作选择第k个模型的先验概率,且满足 ∑ i = 1 K π k = 1 \sum_{i=1}^K \pi_k =1 i=1Kπk=1
m个样本点的联合概率为:
L ( μ , Σ , π ) = ∏ i = 1 m p ( x i ) = ∏ i = 1 m [ ∑ j = 1 K π j N ( x ; μ j , Σ j ) ] (2.2) L(\mu,\Sigma,\pi) = \prod_{i=1}^m p(x_i) = \prod_{i=1}^m \left[\sum_{j=1}^K \pi_j N(x;\mu_j,\Sigma_j) \right] \tag{2.2} L(μ,Σ,π)=i=1mp(xi)=i=1m[j=1KπjN(x;μj,Σj)](2.2)
对数似然函数表示为:
l ( θ ) = l o g L ( μ , Σ , π ) = ∑ i = 1 m l o g ∑ j = 1 K π j N ( x ; μ j , Σ j ) (2.3) l(\theta) = log L(\mu,\Sigma,\pi) = \sum_{i=1}^m log \sum_{j=1}^K \pi_j N(x;\mu_j,\Sigma_j) \tag{2.3} l(θ)=logL(μ,Σ,π)=i=1mlogj=1KπjN(x;μj,Σj)(2.3)
对似然函数求导,令导数为0,发现不能求得参数的解析解,下面我们将使用EM算法求解混合高斯模型的参数。
在介绍推导过程之前,先引入几个公式:

离散随机变量X,p(x)表示X的概率密度分布函数,g(x)表示X的某一函数。
离散变量数学期望: E [ g ( X ) ] = ∑ i g ( x i ) p ( x i ) E[g(X)] = \sum_i g(x_i) p(x_i) E[g(X)]=ig(xi)p(xi)
如果f是凸函数,X是随机变量。由Jensen不等式: E [ f ( X ) ] ≥ f ( E X ) E[f(X)] \geq f(EX) E[f(X)]f(EX)

  • E-step
    引入隐变量 z i j z_{ij} zij,将一个样本点表示为:
    ( x i , z 1 , z 2 , . . . , z K ) , i = 1 , 2 , . . . , m (\bm{x_i},z_1,z_2,...,z_K ) ,\qquad i=1,2,...,m (xi,z1,z2,...,zK),i=1,2,...,m
    其中 z j ∈ { 0 , 1 } z_j \in \{0,1\} zj{0,1} z j = 1 z_j = 1 zj=1 ,表示样本 x i x_i xi来自第j个高斯模型,且 ∑ j = 1 K z j = 1 \sum_{j=1}^K z_j =1 j=1Kzj=1
    样本集的似然函数为:
    L ( θ ) = ∏ i = 1 m p ( x i ; θ ) = ∏ i = 1 m ∑ j = 1 K p ( x i , z j ; θ ) (2.4) L(\theta) = \prod_{i=1}^m p(x_i;\theta) = \prod_{i=1}^m \sum_{j=1}^K p(x_i,z_j;\theta) \tag{2.4} L(θ)=i=1mp(xi;θ)=i=1mj=1Kp(xi,zj;θ)(2.4)
    对数似然函数表示为:
    l ( θ ) = l o g L ( θ ) = ∑ i = 1 m l o g ( ∑ j = 1 K p ( x i , z j ; θ ) ) = ∑ i = 1 m l o g ∑ j = 1 K Q i ( z j ) p ( x i , z j ; θ ) Q i ( z j ) (2.5) \begin{aligned} l(\theta) &= log L(\theta) = \sum_{i=1}^m log \left( \sum_{j=1}^K p(x_i,z_j;\theta) \right) \\ &= \sum_{i=1}^m log \sum_{j=1}^K Q_i(z_j) \frac{ p(x_i,z_j;\theta) }{ Q_i(z_j) } \end{aligned} \tag{2.5} l(θ)=logL(θ)=i=1mlog(j=1Kp(xi,zj;θ))=i=1mlogj=1KQi(zj)Qi(zj)p(xi,zj;θ)(2.5)
    其中, ∑ j = 1 K Q i ( z j ) = 1 , Q i ( z ) ≥ 0 \sum_{j=1}^K Q_i(z_j) =1,Q_i(z) \geq 0 j=1KQi(zj)=1,Qi(z)0 。令:
    p ( z i j ) = Q i ( z j ) f ( z i j ) = p ( x i , z j ; θ ) Q i ( z j ) p(z_{ij}) = Q_i(z_j) \qquad f(z_{ij}) = \frac{ p(x_i,z_j;\theta) }{ Q_i(z_j) } p(zij)=Qi(zj)f(zij)=Qi(zj)p(xi,zj;θ)
    由期望公式: E [ f ( X ) ] = ∑ i f ( x i ) p ( x i ) E[f(X)] = \sum_i f(x_i) p(x_i) E[f(X)]=if(xi)p(xi),可知
    ∑ j = 1 K Q i ( z j ) p ( x i , z j ; θ ) Q i ( z j ) 是 p ( x i , z j ; θ ) Q i ( z j ) 的 数 学 期 望 (2.6) \sum_{j=1}^K Q_i(z_j) \frac{ p(x_i,z_j;\theta) }{ Q_i(z_j) } 是 \frac{ p(x_i,z_j;\theta) }{ Q_i(z_j) }的数学期望 \tag{2.6} j=1KQi(zj)Qi(zj)p(xi,zj;θ)Qi(zj)p(xi,zj;θ)(2.6)
    由于 f ( x ) = l o g x f(x) = logx f(x)=logx是凹函数,根据jensen不等式:
    l ( θ ) = ∑ i = 1 m l o g [ E ( p ( x i , z j ; θ ) Q i ( z j ) ) ] ≥ ∑ i = 1 m E [ l o g p ( x i , z j ; θ ) Q i ( z j ) ] (2.7) l(\theta) = \sum_{i=1}^m log \left[ E \left( \frac{ p(x_i,z_j;\theta) }{ Q_i(z_j) } \right) \right] \geq \sum_{i=1}^m E \left[ log \frac{ p(x_i,z_j;\theta) }{ Q_i(z_j) }\right] \tag{2.7} l(θ)=i=1mlog[E(Qi(zj)p(xi,zj;θ))]i=1mE[logQi(zj)p(xi,zj;θ)](2.7)
    根据期望的定义,可将不等式右边展开:
    ∑ i = 1 m E [ l o g p ( x i , z j ; θ ) Q i ( z j ) ] = ∑ i = 1 m ∑ j = 1 K Q i ( z j ) l o g p ( x i , z j ; θ ) Q i ( z j ) (2.8) \sum_{i=1}^m E \left[ log \frac{ p(x_i,z_j;\theta) }{ Q_i(z_j) } \right] = \sum_{i=1}^m \sum_{j=1}^K Q_i(z_j) log \frac{ p(x_i,z_j;\theta) }{ Q_i(z_j) } \tag{2.8} i=1mE[logQi(zj)p(xi,zj;θ)]=i=1mj=1KQi(zj)logQi(zj)p(xi,zj;θ)(2.8)
    可以得到似然函数关于参数 θ \theta θ的下界:
    l ( θ ) ≥ ∑ i = 1 m ∑ j = 1 K Q i ( z j ) l o g p ( x i , z j ; θ ) Q i ( z j ) = J ( θ ) (2.9) l(\theta) \geq \sum_{i=1}^m \sum_{j=1}^K Q_i(z_j) log \frac{ p(x_i,z_j;\theta) }{ Q_i(z_j) } = J(\theta) \tag{2.9} l(θ)i=1mj=1KQi(zj)logQi(zj)p(xi,zj;θ)=J(θ)(2.9)
    假设 θ \theta θ已经给定,那么 l ( θ ) l(\theta) l(θ)的值就取决于 Q i ( z j ) Q_i(z_j) Qi(zj) p ( x i , z j ; θ ) p(x_i,z_j;\theta) p(xi,zj;θ)了。我们可以通过调整这两个概率使下界 J ( θ ) J(\theta) J(θ)不断上升,以逼近 l ( θ ) l(\theta) l(θ)的真实值,当下界 J ( θ ) J(\theta) J(θ)与似然函数 l ( θ ) l(\theta) l(θ)的值相等,然后固定 Q i ( z j ) Q_i(z_j) Qi(zj),利用极大似然估计调整 θ \theta θ,使下界 J ( θ ) J(\theta) J(θ)值达到最大值,得到的 θ \theta θ为新的 θ t + 1 \theta^{t+1} θt+1;再固定 θ \theta θ,调整 Q i ( z j ) Q_i(z_j) Qi(zj)…直到收敛到似然函数 l ( θ ) l(\theta) l(θ)的最大值 θ ∗ \theta^* θ处。过程示意图如下:

接下来我们需要寻找使不等式等号成立的条件,容易知道,不等式等号成立,就是Jensen不等式等号成立:
∑ i = 1 m l o g [ E ( p ( x i , z j ; θ ) Q i ( z j ) ) ] ≥ ∑ i = 1 m E [ l o g p ( x i , z j ; θ ) Q i ( z j ) ] (2.10) \sum_{i=1}^m log \left[ E \left( \frac{ p(x_i,z_j;\theta) }{ Q_i(z_j) } \right) \right] \geq \sum_{i=1}^m E \left[ log \frac{ p(x_i,z_j;\theta) }{ Q_i(z_j) }\right] \tag{2.10} i=1mlog[E(Qi(zj)p(xi,zj;θ))]i=1mE[logQi(zj)p(xi,zj;θ)](2.10)
要使上面不等式等号成立,当且仅当:
p ( x i , z j ; θ ) Q i ( z j ) = C ⇒ p ( x i , z j ; θ ) = C Q i ( z j ) ⇒ ∑ j = 1 K p ( x i , z j ; θ ) = C ∑ j = 1 K Q i ( z j ) ∵ ∑ j = 1 K Q i ( z j ) = 1 ⇒ ∑ j = 1 K p ( x i , z j ; θ ) = p ( x i ; θ ) = C ∴ Q i ( z j ) = p ( x i , z j ; θ ) ∑ j = 1 K p ( x i , z j ; θ ) = p ( x i , z j ; θ ) p ( x i ; θ ) = p ( z j ∣ x i ; θ ) (2.11) \begin{aligned} & \frac{ p(x_i,z_j;\theta) }{ Q_i(z_j) } = C \\ & \Rightarrow p(x_i,z_j;\theta) = CQ_i(z_j) \\ & \Rightarrow \sum_{j=1}^K p(x_i,z_j;\theta) = C \sum_{j=1}^K Q_i(z_j) \qquad \because \sum_{j=1}^K Q_i(z_j) =1 \\ & \Rightarrow \sum_{j=1}^K p(x_i,z_j;\theta) = p(x_i;\theta) =C \\ & \therefore Q_i(z_j) = \frac{ p(x_i,z_j;\theta) }{ \sum_{j=1}^K p(x_i,z_j;\theta) } = \frac{ p(x_i,z_j;\theta) }{ p(x_i;\theta) } = p(z_j|x_i;\theta) \end{aligned} \tag{2.11} Qi(zj)p(xi,zj;θ)=Cp(xi,zj;θ)=CQi(zj)j=1Kp(xi,zj;θ)=Cj=1KQi(zj)j=1KQi(zj)=1j=1Kp(xi,zj;θ)=p(xi;θ)=CQi(zj)=j=1Kp(xi,zj;θ)p(xi,zj;θ)=p(xi;θ)p(xi,zj;θ)=p(zjxi;θ)(2.11)

  1. M-step
    通过E步,我们得到了每个样本i关于 z j z_j zj的分布 Q i ( z j ) Q_i(z_j) Qi(zj),即已知 Q i ( z j ) Q_i(z_j) Qi(zj),此时固定 Q i ( z j ) Q_i(z_j) Qi(zj),求下界 J ( θ ) J(\theta) J(θ)最大值:
    J ( θ ) = ∑ i = 1 m ∑ j = 1 K Q i ( z j ) l o g p ( x i , z j ; θ ) Q i ( z j ) (2.12) J(\theta) = \sum_{i=1}^m \sum_{j=1}^K Q_i(z_j) log \frac{ p(x_i,z_j;\theta) }{ Q_i(z_j) } \tag{2.12} J(θ)=i=1mj=1KQi(zj)logQi(zj)p(xi,zj;θ)(2.12)
    对下界函数 J ( θ ) J(\theta) J(θ)求关于 θ \theta θ的偏导数:
    J ( θ ) = ∑ i = 1 m ∑ j = 1 K Q i ( z j ) l o g p ( x i , z j ; θ ) Q i ( z j ) = ∑ i = 1 m ∑ j = 1 K Q i ( z j ) l o g p ( x i , z j ; θ ) − Q i ( z j ) l o g Q i ( z j ) = ∑ i = 1 m ∑ j = 1 K Q i ( z j ) l o g ( π j N ( μ j , Σ j ) ) − Q i ( z j ) l o g Q i ( z j ) = ∑ i = 1 m ∑ j = 1 K Q i ( z j ) l o g [ π j 1 ( 2 π ) D 2 ∣ Σ j ∣ 1 2 e x p ( − 1 2 ( x i − μ j ) T Σ j − 1 ( x i − μ j ) ) ] − Q i ( z j ) l o g Q i ( z j ) = ∑ i = 1 m ∑ j = 1 K Q i ( z j ) [ l o g π j − D 2 l o g 2 π − 1 2 l o g ∣ Σ j ∣ − 1 2 ( x i − μ j ) T Σ j − 1 ( x i − μ j ) ] − Q i ( z j ) l o g Q i ( z j ) (2.13) \begin{aligned} J(\theta) &= \sum_{i=1}^m \sum_{j=1}^K Q_i(z_j) log \frac{ p(x_i,z_j;\theta) }{ Q_i(z_j) } \\ &= \sum_{i=1}^m \sum_{j=1}^K Q_i(z_j) log p(x_i,z_j;\theta) - Q_i(z_j) log Q_i(z_j) \\ & = \sum_{i=1}^m \sum_{j=1}^K Q_i(z_j) log (\pi_j N(\mu_j,\Sigma_j)) - Q_i(z_j) log Q_i(z_j) \\ & = \sum_{i=1}^m \sum_{j=1}^K Q_i(z_j) log \left[\pi_j \frac{1}{ (2 \pi)^{\frac D2} |\Sigma_j|^{\frac 12} }exp ( - \frac12 (x_i-\mu_j)^T \Sigma_j^{-1}(x_i-\mu_j) ) \right] - Q_i(z_j) log Q_i(z_j) \\ & = \sum_{i=1}^m \sum_{j=1}^K Q_i(z_j) \left[ log \pi_j - \frac D2 log2\pi - \frac12 log |\Sigma_j| - \frac12 (x_i-\mu_j)^T \Sigma_j^{-1}(x_i-\mu_j) \right] - Q_i(z_j) log Q_i(z_j) \end{aligned} \tag{2.13} J(θ)=i=1mj=1KQi(zj)logQi(zj)p(xi,zj;θ)=i=1mj=1KQi(zj)logp(xi,zj;θ)Qi(zj)logQi(zj)=i=1mj=1KQi(zj)log(πjN(μj,Σj))Qi(zj)logQi(zj)=i=1mj=1KQi(zj)log[πj(2π)2DΣj211exp(21(xiμj)TΣj1(xiμj))]Qi(zj)logQi(zj)=i=1mj=1KQi(zj)[logπj2Dlog2π21logΣj21(xiμj)TΣj1(xiμj)]Qi(zj)logQi(zj)(2.13)
    π p \pi_p πp求偏导,需要使用拉格朗日乘子法:
    ∂ J ( θ ) ∂ π p = ∂ [ ∑ i = 1 m ∑ j = 1 K Q i ( z j ) l o g π j + λ ( 1 − ∑ j = 1 K π j ) ] / ∂ π p = ∑ i = 1 m Q i ( z p ) π p − λ = 0 ⇒ ∑ i = 1 m Q i ( z p ) = λ π p ⇒ ∑ i = 1 m ∑ j = 1 K Q i ( z j ) = λ ∑ j = 1 K π j ∵ ∑ j = 1 K Q i ( z j ) = 1 , ∑ j = 1 K π j = 1 ∴ λ = m ⇒ π p = ∑ i = 1 m Q i ( z p ) m (2.14) \begin{aligned} \frac{\partial J(\theta) }{ \partial \pi_p} &= \partial \left[ \sum_{i=1}^m \sum_{j=1}^K Q_i(z_j) log \pi_j + \lambda (1-\sum_{j=1}^K \pi_j) \right] / \partial \pi_p \\ &= \frac{ \sum_{i=1}^m Q_i(z_p) }{ \pi_p } - \lambda = 0 \\ & \Rightarrow \sum_{i=1}^m Q_i(z_p) = \lambda \pi_p \\ & \Rightarrow \sum_{i=1}^m \sum_{j=1}^K Q_i(z_j) = \lambda \sum_{j=1}^K \pi_j \\ & \because \sum_{j=1}^K Q_i(z_j) =1, \qquad \sum_{j=1}^K \pi_j =1 \\ & \therefore \lambda = m \qquad \Rightarrow \pi_p = \frac{ \sum_{i=1}^m Q_i(z_p) }{m} \end{aligned} \tag{2.14} πpJ(θ)=[i=1mj=1KQi(zj)logπj+λ(1j=1Kπj)]/πp=πpi=1mQi(zp)λ=0i=1mQi(zp)=λπpi=1mj=1KQi(zj)=λj=1Kπjj=1KQi(zj)=1,j=1Kπj=1λ=mπp=mi=1mQi(zp)(2.14)
    μ p \mu_p μp求偏导,只看和 μ p \mu_p μp有关的部分:
    ∂ J ( θ ) ∂ μ p = ∂ [ ∑ i = 1 m ∑ j = 1 K Q i ( z j ) ( − 1 2 ( x i − μ j ) T Σ j − 1 ( x i − μ j ) ) ] / ∂ μ p = ∑ i = 1 m − 1 2 Q i ( z p ) ∂ ( x i − μ p ) T Σ p − 1 ( x i − μ p ) ∂ μ p ∵ ∂ u v ∂ x = ∂ u T v ∂ x = ∂ u ∂ x v + ∂ v ∂ x u = ∑ i = 1 m − 1 2 Q i ( z p ) [ ∂ ( x i − μ p ) ∂ μ p Σ p − 1 ( x i − μ p ) + ∂ ( Σ p − 1 x i − Σ p − 1 μ p ) ∂ μ p ( x i − μ p ) ] = ∑ i = 1 m − 1 2 Q i ( z p ) [ ∂ ( − μ p ) ∂ μ p Σ p − 1 ( x i − μ p ) + ∂ ( − Σ p − 1 μ p ) ∂ μ p ( x i − μ p ) ] = ∑ i = 1 m − 1 2 Q i ( z p ) [ − Σ p − 1 ( x i − μ p ) − ( Σ p − 1 ) T ( x i − μ p ) ] = ∑ i = 1 m − 1 2 Q i ( z p ) [ − Σ p − 1 ( x i − μ p ) − Σ p − 1 ( x i − μ p ) ] = ∑ i = 1 m Q i ( z p ) [ Σ p − 1 ( x i − μ p ) ] = 0 ⇒ μ p = ∑ i = 1 m Q i ( z p ) x i ∑ i = 1 m Q i ( z p ) (2.15) \begin{aligned} \frac{\partial J(\theta) }{ \partial \mu_p} &= \partial \left[ \sum_{i=1}^m \sum_{j=1}^K Q_i(z_j) (-\frac12 (x_i-\mu_j)^T \Sigma_j^{-1}(x_i-\mu_j) )\right] / \partial \mu_p \\ & = \sum_{i=1}^m -\frac12 Q_i(z_p) \frac{\partial (x_i-\mu_p)^T \Sigma_p^{-1}(x_i-\mu_p) }{ \partial \mu_p } \qquad \because \frac{ \partial \bm{uv} }{ \partial \bm x} = \frac{ \partial \bm{u^Tv} }{ \partial \bm x} = \frac{\partial \bm u}{ \partial \bm x} \bm v + \frac{\partial \bm v}{ \partial \bm x} \bm u \\ & = \sum_{i=1}^m -\frac12 Q_i(z_p) \left[ \frac{\partial (x_i-\mu_p) }{ \partial \mu_p } \Sigma_p^{-1}(x_i-\mu_p) + \frac{\partial (\Sigma_p^{-1}x_i - \Sigma_p^{-1} \mu_p) }{ \partial \mu_p } (x_i-\mu_p) \right] \\ & = \sum_{i=1}^m -\frac12 Q_i(z_p) \left[ \frac{\partial (-\mu_p) }{ \partial \mu_p } \Sigma_p^{-1}(x_i-\mu_p) + \frac{\partial (- \Sigma_p^{-1} \mu_p) }{ \partial \mu_p } (x_i-\mu_p) \right] \\ & = \sum_{i=1}^m -\frac12 Q_i(z_p) \left[ - \Sigma_p^{-1}(x_i-\mu_p) - ( \Sigma_p^{-1})^T(x_i-\mu_p) \right] \\ & = \sum_{i=1}^m -\frac12 Q_i(z_p) \left[ - \Sigma_p^{-1}(x_i-\mu_p) - \Sigma_p^{-1} (x_i-\mu_p) \right] \\ & = \sum_{i=1}^m Q_i(z_p) [ \Sigma_p^{-1}(x_i-\mu_p)] = 0 \\ & \Rightarrow \mu_p = \frac{ \sum_{i=1}^m Q_i(z_p) x_i }{ \sum_{i=1}^m Q_i(z_p)} \end{aligned} \tag{2.15} μpJ(θ)=[i=1mj=1KQi(zj)(21(xiμj)TΣj1(xiμj))]/μp=i=1m21Qi(zp)μp(xiμp)TΣp1(xiμp)xuv=xuTv=xuv+xvu=i=1m21Qi(zp)[μp(xiμp)Σp1(xiμp)+μp(Σp1xiΣp1μp)(xiμp)]=i=1m21Qi(zp)[μp(μp)Σp1(xiμp)+μp(Σp1μp)(xiμp)]=i=1m21Qi(zp)[Σp1(xiμp)(Σp1)T(xiμp)]=i=1m21Qi(zp)[Σp1(xiμp)Σp1(xiμp)]=i=1mQi(zp)[Σp1(xiμp)]=0μp=i=1mQi(zp)i=1mQi(zp)xi(2.15)
    Σ p \Sigma_p Σp求偏导,只看和 Σ p \Sigma_p Σp有关的部分:
    ∂ J ( θ ) ∂ Σ p = ∂ [ ∑ i = 1 m ∑ j = 1 K Q i ( z j ) ( − 1 2 l o g ∣ Σ j ∣ − 1 2 ( x i − μ j ) T Σ j − 1 ( x i − μ j ) ) ] / ∂ Σ p = ∂ [ ∑ i = 1 m − 1 2 Q i ( z p ) ( l o g ∣ Σ p ∣ + ( x i − μ p ) T Σ p − 1 ( x i − μ p ) ) ] / ∂ Σ p ∵ ∂ l o g ∣ Σ p ∣ ∂ Σ p = 1 ∣ Σ p ∣ ∣ Σ p ∣ ∂ Σ p = 1 ∣ Σ p ∣ ∣ Σ p ∣ ( Σ p − 1 ) T = Σ p − 1 ∵ ∂ [ ( x i − μ p ) T Σ p − 1 ( x i − μ p ) ] ∂ Σ p = ∂ t r [ ( x i − μ p ) T Σ p − 1 ( x i − μ p ) ] ∂ Σ p = ∂ t r [ Σ p − 1 ( x i − μ p ) ( x i − μ p ) T ] ∂ Σ p ∂ t r [ Σ p − 1 ( x i − μ p ) ( x i − μ p ) T ] ∂ Σ p = − ( Σ p − 1 ) T [ ( x i − μ p ) ( x i − μ p ) T ] T ( Σ p − 1 ) T = − Σ p − 1 ( x i − μ p ) ( x i − μ p ) T Σ p − 1 ∂ J ( θ ) ∂ Σ p = ∑ i = 1 m − 1 2 Q i ( z p ) ( Σ p − 1 − Σ p − 1 ( x i − μ p ) ( x i − μ p ) T Σ p − 1 ) = 0 ⇒ ∑ i = 1 m Q i ( z p ) Σ p − 1 = ∑ i = 1 m Q i ( z p ) Σ p − 1 ( x i − μ p ) ( x i − μ p ) T Σ p − 1 ( 右 边 同 时 乘 以 Σ p ) ⇒ ∑ i = 1 m Q i ( z p ) I = ∑ i = 1 m Q i ( z p ) Σ p − 1 ( x i − μ p ) ( x i − μ p ) T ( 左 边 同 时 乘 以 Σ p ) ⇒ ∑ i = 1 m Q i ( z p ) Σ p = ∑ i = 1 m Q i ( z p ) ( x i − μ p ) ( x i − μ p ) T ⇒ Σ p = ∑ i = 1 m Q i ( z p ) ( x i − μ p ) ( x i − μ p ) T ∑ i = 1 m Q i ( z p ) (2.16) \begin{aligned} \frac{\partial J(\theta) }{ \partial \Sigma_p} &= \partial \left[ \sum_{i=1}^m \sum_{j=1}^K Q_i(z_j) \left(- \frac12 log |\Sigma_j| - \frac12 (x_i-\mu_j)^T \Sigma_j^{-1}(x_i-\mu_j) \right) \right] / \partial \Sigma_p \\ &= \partial \left[ \sum_{i=1}^m -\frac12 Q_i(z_p) ( log |\Sigma_p| + (x_i-\mu_p)^T \Sigma_p^{-1} (x_i-\mu_p) ) \right] / \partial \Sigma_p \\ & \because \frac{ \partial log |\Sigma_p| }{ \partial \Sigma_p} = \frac 1{|\Sigma_p|} \frac{|\Sigma_p|}{ \partial \Sigma_p} = \frac 1{|\Sigma_p|} |\Sigma_p| (\Sigma_p^{-1})^T = \Sigma_p^{-1} \\ & \because \frac{ \partial [ (x_i-\mu_p)^T \Sigma_p^{-1} (x_i-\mu_p) ] }{ \partial \Sigma_p} = \frac{ \partial tr [ (x_i-\mu_p) ^T \Sigma_p^{-1} (x_i-\mu_p) ] }{ \partial \Sigma_p} = \frac{ \partial tr [ \Sigma_p^{-1} (x_i-\mu_p)(x_i-\mu_p)^T ] }{ \partial \Sigma_p}\\ & \frac{ \partial tr [ \Sigma_p^{-1} (x_i-\mu_p)(x_i-\mu_p)^T ] }{ \partial \Sigma_p} = - (\Sigma_p^{-1})^T [ (x_i-\mu_p)(x_i-\mu_p)^T ]^T (\Sigma_p^{-1})^T = - \Sigma_p^{-1} (x_i-\mu_p)(x_i-\mu_p)^T \Sigma_p^{-1} \\ \frac{\partial J(\theta) }{ \partial \Sigma_p} &= \sum_{i=1}^m -\frac12 Q_i(z_p)( \Sigma_p^{-1} - \Sigma_p^{-1} (x_i-\mu_p)(x_i-\mu_p)^T \Sigma_p^{-1} ) =0 \\ & \Rightarrow \sum_{i=1}^m Q_i(z_p) \Sigma_p^{-1} = \sum_{i=1}^m Q_i(z_p) \Sigma_p^{-1} (x_i-\mu_p)(x_i-\mu_p)^T \Sigma_p^{-1} \quad (右边同时乘以 \Sigma_p) \\ & \Rightarrow \sum_{i=1}^m Q_i(z_p) I = \sum_{i=1}^m Q_i(z_p) \Sigma_p^{-1} (x_i-\mu_p)(x_i-\mu_p)^T \quad (左边同时乘以 \Sigma_p) \\ & \Rightarrow \sum_{i=1}^m Q_i(z_p) \Sigma_p = \sum_{i=1}^m Q_i(z_p) (x_i-\mu_p)(x_i-\mu_p)^T \\ & \Rightarrow \Sigma_p = \frac{ \sum_{i=1}^m Q_i(z_p) (x_i-\mu_p)(x_i-\mu_p)^T }{ \sum_{i=1}^m Q_i(z_p) } \end{aligned} \tag{2.16} ΣpJ(θ)ΣpJ(θ)=[i=1mj=1KQi(zj)(21logΣj21(xiμj)TΣj1(xiμj))]/Σp=[i=1m21Qi(zp)(logΣp+(xiμp)TΣp1(xiμp))]/ΣpΣplogΣp=Σp1ΣpΣp=Σp1Σp(Σp1)T=Σp1Σp[(xiμp)TΣp1(xiμp)]=Σptr[(xiμp)TΣp1(xiμp)]=Σptr[Σp1(xiμp)(xiμp)T]Σptr[Σp1(xiμp)(xiμp)T]=(Σp1)T[(xiμp)(xiμp)T]T(Σp1)T=Σp1(xiμp)(xiμp)TΣp1=i=1m21Qi(zp)(Σp1Σp1(xiμp)(xiμp)TΣp1)=0i=1mQi(zp)Σp1=i=1mQi(zp)Σp1(xiμp)(xiμp)TΣp1(Σp)i=1mQi(zp)I=i=1mQi(zp)Σp1(xiμp)(xiμp)T(Σp)i=1mQi(zp)Σp=i=1mQi(zp)(xiμp)(xiμp)TΣp=i=1mQi(zp)i=1mQi(zp)(xiμp)(xiμp)T(2.16)

Σ p \Sigma_p Σp求偏导的另一种方式,对 Σ p − 1 求 偏 导 \Sigma_p^{-1}求偏导 Σp1
∂ J ( θ ) ∂ Σ p − 1 = ∂ [ ∑ i = 1 m ∑ j = 1 K Q i ( z j ) ( − 1 2 l o g ∣ Σ j ∣ − 1 2 ( x i − μ j ) T Σ j − 1 ( x i − μ j ) ) ] / ∂ Σ p − 1 = ∂ [ ∑ i = 1 m ∑ j = 1 K Q i ( z j ) ( 1 2 l o g ∣ Σ j − 1 ∣ − 1 2 ( x i − μ j ) T Σ j − 1 ( x i − μ j ) ) ] / ∂ Σ p − 1 = ∂ [ ∑ i = 1 m 1 2 Q i ( z p ) ( l o g ∣ Σ p − 1 ∣ − ( x i − μ p ) T Σ p − 1 ( x i − μ p ) ) ] / ∂ Σ p − 1 = ∑ i = 1 m 1 2 Q i ( z p ) [ ( ( Σ p − 1 ) − 1 ) T − ∂ t r [ ( x i − μ p ) T Σ p − 1 ( x i − μ p ) ] ∂ Σ p − 1 ] = ∑ i = 1 m 1 2 Q i ( z p ) [ Σ p − ∂ t r [ ( x i − μ p ) ( x i − μ p ) T Σ p − 1 ] ∂ Σ p − 1 ] = ∑ i = 1 m 1 2 Q i ( z p ) [ Σ p − [ ( x i − μ p ) ( x i − μ p ) T ] T ] = ∑ i = 1 m 1 2 Q i ( z p ) [ Σ p − ( x i − μ p ) ( x i − μ p ) T ] = 0 ⇒ Σ p = ∑ i = 1 m Q i ( z p ) ( x i − μ p ) ( x i − μ p ) T ∑ i = 1 m Q i ( z p ) (2.17) \begin{aligned} \frac{\partial J(\theta) }{ \partial \Sigma_p^{-1}} &= \partial \left[ \sum_{i=1}^m \sum_{j=1}^K Q_i(z_j) \left(- \frac12 log |\Sigma_j| - \frac12 (x_i-\mu_j)^T \Sigma_j^{-1}(x_i-\mu_j) \right) \right] / \partial \Sigma_p^{-1} \\ & = \partial \left[ \sum_{i=1}^m \sum_{j=1}^K Q_i(z_j) \left( \frac12 log |\Sigma_j^{-1}| - \frac12 (x_i-\mu_j)^T \Sigma_j^{-1}(x_i-\mu_j) \right) \right] / \partial \Sigma_p^{-1} \\ & = \partial \left[ \sum_{i=1}^m \frac12 Q_i(z_p) ( log |\Sigma_p^{-1}| - (x_i-\mu_p)^T \Sigma_p^{-1}(x_i-\mu_p) ) \right] / \partial \Sigma_p^{-1} \\ & = \sum_{i=1}^m \frac12 Q_i(z_p) \left[ (( \Sigma_p^{-1})^{-1})^T - \frac{\partial tr [(x_i-\mu_p)^T \Sigma_p^{-1}(x_i-\mu_p)]}{ \partial \Sigma_p^{-1} } \right] \\ & = \sum_{i=1}^m \frac12 Q_i(z_p) \left[ \Sigma_p - \frac{\partial tr [(x_i-\mu_p)(x_i-\mu_p)^T \Sigma_p^{-1}]}{ \partial \Sigma_p^{-1} } \right] \\ & = \sum_{i=1}^m \frac12 Q_i(z_p) \left[ \Sigma_p - [(x_i-\mu_p)(x_i-\mu_p)^T ]^T \right] \\ & = \sum_{i=1}^m \frac12 Q_i(z_p) \left[ \Sigma_p - (x_i-\mu_p)(x_i-\mu_p)^T \right] =0 \\ & \Rightarrow \Sigma_p = \frac{ \sum_{i=1}^m Q_i(z_p) (x_i-\mu_p)(x_i-\mu_p)^T }{ \sum_{i=1}^m Q_i(z_p) } \end{aligned} \tag{2.17} Σp1J(θ)=[i=1mj=1KQi(zj)(21logΣj21(xiμj)TΣj1(xiμj))]/Σp1=[i=1mj=1KQi(zj)(21logΣj121(xiμj)TΣj1(xiμj))]/Σp1=[i=1m21Qi(zp)(logΣp1(xiμp)TΣp1(xiμp))]/Σp1=i=1m21Qi(zp)[((Σp1)1)TΣp1tr[(xiμp)TΣp1(xiμp)]]=i=1m21Qi(zp)[ΣpΣp1tr[(xiμp)(xiμp)TΣp1]]=i=1m21Qi(zp)[Σp[(xiμp)(xiμp)T]T]=i=1m21Qi(zp)[Σp(xiμp)(xiμp)T]=0Σp=i=1mQi(zp)i=1mQi(zp)(xiμp)(xiμp)T(2.17)

3 EM算法

通过上面对EM算法的理论推导,现在我们对EM算法的实现步骤进行一个总结:

  1. 初始化参数 K , θ = { π j , μ j , Σ j } j = 1 , 2 , . . , K K,\theta=\{ \pi_j,\mu_j,\Sigma_j \} \quad j=1,2,..,K K,θ={πj,μj,Σj}j=1,2,..,K
  2. E-step:根据当前参数 π j , μ j , Σ j \pi_j,\mu_j,\Sigma_j πj,μj,Σj计算后验概率 Q i ( z j ) Q_i(z_j) Qi(zj)
    Q i ( z j ) = π j N ( x i ; μ j , Σ j ) ∑ p = 1 K π p N ( x i ; μ p , Σ p ) (3.1) Q_i(z_j) = \frac{ \pi_j N(x_i;\mu_j,\Sigma_j) }{ \sum_{p=1}^K \pi_p N(x_i;\mu_p,\Sigma_p) } \tag{3.1} Qi(zj)=p=1KπpN(xi;μp,Σp)πjN(xi;μj,Σj)(3.1)
  3. M-step:根据E步中计算的 Q i ( z j ) Q_i(z_j) Qi(zj),重新计算参数 θ = { π j , μ j , Σ j } j = 1 , 2 , . . , K \theta=\{ \pi_j,\mu_j,\Sigma_j \} \quad j=1,2,..,K θ={πj,μj,Σj}j=1,2,..,K
  4. 计算对数似然函数,观察对数似然函数值的变化是否收敛,收敛则结束,否则转到步骤2,直到收敛。

注:通过EM求解混合高斯模型,得到的解不一定是全局最优解,可能收敛到局部最优。

4 代码实现

import numpy as np
import matplotlib.pyplot as plt

#高斯概率密度函数
def gaussian(x, Mean, Cov_matrix):
    dim = np.shape(cov)[0]  # 维度
    # 之所以加入单位矩阵是为了防止行列式为0的情况
    covdet = np.linalg.det(Cov_matrix + np.eye(dim) * 0.01)  # 协方差矩阵的行列式
    covinv = np.linalg.inv(Cov_matrix + np.eye(dim) * 0.01)  # 协方差矩阵的逆
    xdiff = x - Mean
    # 概率密度
    prob = 1.0 / np.power(2 * np.pi, 1.0 * dim / 2) / np.sqrt(np.abs(covdet)) * np.exp(
        -0.5 * xdiff.T @ covinv @ xdiff)
    return prob
def GMM(X, K, iter_num=10):
    global  imgnum
    m, n = X.shape
    # 初始化参数
    oldpi = pi = np.full(K, 1.0 / K)
    oldmu = mu = [X[i] for i in np.roll(np.arange(K), np.random.choice(m))]
    olsigma = sigma = [np.eye(n) for i in range(K)]
    Q = np.zeros((m, K))
    errorlost = 0
    for itn in range(iter_num):
        plt.close()
        plt.plot(X[:, 0], X[:, 1], '.')
        # E_step
        for i in range(m):
            pxz = [pi[j] * gaussian(X[i], mu[j], sigma[j]) for j in range(K)]
            pxzsum = np.sum(pxz)
            Q[i] = pxz / pxzsum
        Qcol = np.sum(Q, axis=0)
        pi = Qcol / m
        mu = [np.sum(Q[:, j:j + 1] * X, axis=0) / Qcol[j] for j in range(K)]
        for j in range(K):
            xdif = X - mu[j]
            sigma[j] = 1.0 / Qcol[j] * np.sum(
                [Q[i][j] * (xdif[i:i + 1].T @ xdif[i:i + 1]) for i in range(m)], axis=0)
        # 画出每个高斯分布的均值
        for j in range(K):
            plt.plot(mu[j][0], mu[j][1], marker='o', c='r')
        plt.savefig(str(imgnum)+".png")
        imgnum += 1
        plt.show()
        curerror = sum(np.power((oldpi - pi).tolist(), 2)) + sum(np.power(np.ravel(oldmu) - np.ravel(mu), 2)) + sum(
            np.power(np.ravel(olsigma) - np.ravel(sigma), 2))
        if abs(curerror - errorlost) < 0.0001:
            break
        errorlost = curerror
        print("第{:d}次迭代参数平方损失{:6f}:".format(itn, errorlost))
    return pi, mu, sigma
if __name__ == '__main__':
    # 初始化参数
    mean = [2, 2]
    cov = [[1, 0], [0, 1]]
    s1 = np.random.multivariate_normal(mean, cov, 80, "raise")
    mean = [10, 8]
    cov = [[1, 0], [0, 4]]
    s2 = np.random.multivariate_normal(mean, cov, 100, "raise")
    mean = [4, 13]
    cov = [[2, 1], [1, 3]]
    s3 = np.random.multivariate_normal(mean, cov, 120, "raise")

    sall = np.vstack((s1, s2, s3))
    np.random.shuffle(sall)
    # 高斯分量的个数
    K = 3
    # 最大迭代次数
    iter_numbers = 60
    pi, mu, sigma = GMM(sall, K, iter_numbers)
    pointnum = 100
    xrange = np.linspace(sall[:, 0].min(), sall[:, 0].max(), pointnum)
    yrange = np.linspace(sall[:, 1].min(), sall[:, 1].max(), pointnum)
    XX, YY = np.meshgrid(xrange, yrange)
    plt.close()
    plt.scatter(s1[:, 0], s1[:, 1], marker='.', c='r')
    plt.scatter(s2[:, 0], s2[:, 1], marker='o', c='b')
    plt.scatter(s3[:, 0], s3[:, 1], marker='x', c='g')
    plt_data = np.dstack((XX, YY))
    for j in range(K):
        Z = [[gaussian(plt_data[q][p], mu[j], sigma[j]) for p in range(pointnum)] for q in range(pointnum)]
        cs = plt.contour(XX, YY, Z)
        plt.clabel(cs)
    plt.savefig(str(imgnum) + ".png")
    plt.show()

5 参考

高斯混合模型的终极理解

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值