模式识别 | PRML Chapter 9 Mixture Models and EM

PRML Chapter 9 Mixture Models and EM

9.1 K-means Clustering

Consider the problem of identifying groups or clusters of data points in a multidimensional space.

To describe the assignment of data points to clusters, we define an objective function(distortion measure):

J = ∑ n = 1 N ∑ k = 1 K r n k ∣ ∣ x n − μ k ∣ ∣ 2 J = \sum_{n=1}^{N}\sum_{k=1}^{K}r_{nk}||x_{n}-\mu_{k}||^{2} J=n=1Nk=1Krnkxnμk2

To find r n k r_{nk} rnk and μ k \mu_k μk so as to minimize J J J, we go through an iterative procedure in which each iteration involves two successive steps corresponding to successive optimizations. The solution will be:

μ k = ∑ n r n k x n ∑ n r n k \mu_{k} = \frac{\sum_{n}r_{nk}x_{n}}{\sum_{n}r_{nk}} μk=nrnknrnkxn

We can also derive an on-line stochastic algorithm by applying the Robbins-Monro procedure to the problem

μ k n e w = μ k o l d + η n ( x n − x k o l d ) \mu_{k}^{new} = \mu_{k}^{old} + \eta_{n}(x_{n} - x_{k}^{old}) μknew=μkold+ηn(xnxkold)

We can generalize the K-means algorithm by introducing a more general dissimilarity measure:

J ~ = ∑ n = 1 N ∑ k = 1 K r n k ν ( x n , μ k ) \tilde{J} = \sum_{n=1}^{N}\sum_{k=1}^{K}r_{nk}\nu(x_{n}, \mu_{k}) J~=n=1Nk=1Krnkν(xn,μk)

which gives the K-medoids algorithm.

9.1.1 Image segmentation and compression

9.2 Mixtures of Guassians

We have mentioned the Gaussian mixture model as a simple linear superposition of Gaussian components, we now turn to a formulation of Gaussian mixtures in terms of discrete latent variables.

p ( x ) = ∑ k = 1 K π k N ( x ∣ μ k , Σ k ) p(x) = \sum_{k=1}^{K}\pi_{k}N(x | \mu_{k}, \Sigma_{k}) p(x)=k=1KπkN(xμk,Σk)

The marginal distribution over z z z is specified in terms of mixing coefficients π k \pi_k πk

p ( z k = 1 ) = π k p(z_{k} = 1) = \pi_{k} p(zk=1)=πk

∑ k z k = 1 \sum_{k} z_{k} = 1 kzk=1

where the parameters { π k } \{\pi_k\} {πk} must satisfy:

0 ≤ π k ≤ 1 0 \leq \pi_{k} \leq 1 0πk1

∑ k = 1 K π k = 1 \sum_{k=1}^{K}\pi_k = 1 k=1Kπk=1

in order to be valid probabilities, because z z z uses a 1-of-K representation, we can also write this distribution in the form:

p ( z ) = ∏ k = 1 K π k z k p(z) = \prod_{k=1}^{K}\pi_{k}^{z_{k}} p(z)=k=1Kπkzk

Similarly, the conditional distribution of x is a Gaussian:

p ( x ∣ z ) = ∏ k = 1 K N ( x ∣ μ k , Σ k ) z k p(x | z) = \prod_{k=1}^{K}\mathcal{N}(x | \mu_{k}, \Sigma_{k})^{z_{k}} p(xz)=k=1KN(xμk,Σk)zk

The joint distribution is given by p ( z ) p ( x ∥ z ) p(z)p(x\|z) p(z)p(xz) and the marginal distribution of x x x is then obtained by summing the joint distribution over all possible states of z z z to give:

p ( x ) = ∑ z p ( z ) p ( x ∣ z ) = ∑ k = 1 K π k N ( x ∣ μ k , Σ k ) p(x) = \sum_{z}p(z)p(x | z) = \sum_{k=1}^{K}\pi_{k}\mathcal{N}(x | \mu_{k}, \Sigma_{k}) p(x)=zp(z)p(xz)=k=1KπkN(xμk,Σk)

The corresponding posterior probability will be:

γ ( z k ) ≡ p ( z k = 1 ∣ x ) = p ( z k = 1 ) p ( x ∣ z k = 1 ) ∑ j = 1 K p ( z j = 1 ) p ( x ∣ z j = 1 ) = π k N ( x ∣ μ k , Σ k ) ∑ j = 1 K π j N ( x ∣ μ j , Σ k ) \gamma(z_{k}) \equiv p(z_{k} = 1 | x) = \frac{p(z_{k}=1)p(x | z_{k}=1)}{\sum_{j=1}^{K}p(z_{j}=1)p(x | z_{j} = 1)} = \frac{\pi_{k}\mathcal{N}(x | \mu_{k}, \Sigma_{k})}{\sum_{j=1}^{K}\pi_{j}\mathcal{N}(x | \mu_{j},\Sigma_{k})} γ(zk)p(zk=1x)=j=1Kp(zj=1)p(xzj=1)p(zk=1)p(xzk=1)=j=1KπjN(xμj,Σk)πkN(xμk,Σk)

9.2.1 Maximum likelihood

The log of the likelihood function is given by:

ln ⁡ p ( X ∣ π , μ , Σ ) = ∑ n = 1 N ln ⁡ { ∑ k = 1 K π k N ( x n ∣ μ k , Σ k ) } \ln p(\mathbf{X} | \pi, \mu, \Sigma) = \sum_{n=1}^{N}\ln\left\{\sum_{k=1}^{K}\pi_{k}\mathcal{N}(x_{n} | \mu_{k}, \Sigma_{k}) \right\} lnp(Xπ,μ,Σ)=n=1Nln{k=1KπkN(xnμk,Σk)}

9.2.2 EM for Gaussian mixtures

An elegant and powerful method for finding maximum likelihood solutions for models with latent variables is called the espectation-maximization algorithm (EM algorithm).

Maximize the likelihood function, we obtain:

μ k = 1 N k ∑ n = 1 N γ ( z n k ) x n \mu_{k} = \frac{1}{N_{k}}\sum_{n=1}^{N}\gamma(z_{nk})x_{n} μk=Nk1n=1Nγ(znk)xn

where N k = ∑ n = 1 N γ ( z n k ) N_{k} = \sum_{n=1}^{N}\gamma(z_{nk}) Nk=n=1Nγ(znk).

And for the covariance matrix of a single Gaussian, we obtain:

Σ k = 1 N k ∑ n = 1 N γ ( z n k ) ( x n − μ k ) T \Sigma_{k} = \frac{1}{N_{k}}\sum_{n=1}^{N}\gamma(z_{nk})(x_{n} - \mu_{k})^{T} Σk=Nk1n=1Nγ(znk)(xnμk)T

For the mixing coefficients (using Lagrange multiplier) we obtain:

π k = N k N \pi_{k} = \frac{N_{k}}{N} πk=NNk

EM for Gaussian Mixtures

So, given a Gaussian mixture model, the goal is to maximize the likelihood function with respect to the parameters (mean & covariances of the components & mixing coefficients).

    1. Initialize μ k \mu_k μk, Σ \Sigma Σ, π k \pi_k πk. Evaluate the initial value of the log likelihood.
    1. E step. Evaluate the responsibilities using the current parameter values.

γ ( z n k ) = π k N ( x n ∣ μ k , Σ k ) ∑ j = 1 K π j N ( x n ∣ μ j , Σ j ) \gamma(z_{nk}) = \frac{\pi_{k}\mathcal{N}(x_{n} | \mu_{k}, \Sigma_{k})}{\sum_{j=1}^{K}\pi_{j}\mathcal{N}(x_{n} | \mu_{j}, \Sigma_{j})} γ(znk)=j=1KπjN(xnμj,Σj)πkN(xnμk,Σk)

    1. M step. Re-estimate the parameters using the current responsibilities.

μ k n e w = 1 N k ∑ n = 1 N γ ( z n k ) x n \mu_{k}^{new} = \frac{1}{N_{k}}\sum_{n=1}^{N}\gamma(z_{nk})x_{n} μknew=Nk1n=1Nγ(znk)xn

∑ k n e w = 1 N k ∑ n = 1 N γ ( z n k ) ( x n − μ k n e w ) ( x n − μ k n e w ) T \sum_{k}^{new} = \frac{1}{N_{k}}\sum_{n=1}^{N}\gamma(z_{nk})(x_{n} - \mu_{k}^{new})(x_{n} - \mu_{k}^{new})^{T} knew=Nk1n=1Nγ(znk)(xnμknew)(xnμknew)T

π k n e w = N k N \pi_{k}^{new} = \frac{N_{k}}{N} πknew=NNk

where

N k = ∑ n = 1 N γ ( z n k ) N_{k} = \sum_{n=1}^{N}\gamma(z_{nk}) Nk=n=1Nγ(znk)

  1. Evaluate the log likelihood

ln ⁡ p ( X ∣ μ , Σ , π ) = ∑ n = 1 N ln ⁡ { ∑ k = 1 K π k N ( x n ∣ μ k , Σ k ) } \ln p(\mathbf{X} | \mu, \Sigma, \pi) = \sum_{n=1}^{N}\ln\left\{\sum_{k=1}^{K}\pi_{k}\mathcal{N}(x_{n} | \mu_{k}, \Sigma_{k}) \right\} lnp(Xμ,Σ,π)=n=1Nln{k=1KπkN(xnμk,Σk)}

and check for convergence of either the parameters of the log likelihood. If the convergence criterion is not satisfied return to step 2.

9.3 An Alternative View of EM

The goal of the EM algorithm is to find maximum likelihood solutions for models having latent variables. Set the parameters θ \theta θ, and the likelihood function is given by:

ln ⁡ p ( X ∣ θ ) = ln ⁡ { ∑ Z p ( X , Z ∣ θ ) } \ln p(\mathbf{X}| \theta) = \ln\left\{\sum_{Z}p(\mathbf{X,Z} | \theta) \right\} lnp(Xθ)=ln{Zp(X,Zθ)}

In the E step, we use current θ o l d \theta^{old} θold to find the posterior by which we can find the expectation of the complete-data log likelihood evaluated for some general parameter value θ \theta θ:

θ n e w = arg ⁡ max ⁡ θ Q ( θ , θ o l d ) \theta^{new} = \arg\max_{\theta}\mathcal{Q}(\theta, \theta^{old}) θnew=argθmaxQ(θ,θold)

In the M step, we determine the revised parameter estimate θ n e w \theta^{new} θnew by maximizing this function:

Q ( θ , θ o l d ) = ∑ Z p ( Z ∣ X , θ o l d ) ln ⁡ p ( X , Z ∣ θ ) \mathcal{Q}(\theta, \theta^{old}) = \sum_{Z}p(\mathcal{Z} | \mathcal{X}, \theta^{old})\ln p(\mathcal{X, Z} | \theta) Q(θ,θold)=Zp(ZX,θold)lnp(X,Zθ)

9.3.1 Guassian mixtures revisited

The log likelihood function takes the form:

ln ⁡ p ( X , Z ∣ μ , Σ . π ) = ∑ n = 1 N ∑ k = 1 K z n k { ln ⁡ π k + ln ⁡ N ( x n ∣ μ k , Σ k ) } \ln p(\mathbf{X, Z} | \mu, \Sigma. \pi) = \sum_{n=1}^{N}\sum_{k=1}^{K}z_{nk}\{\ln \pi_{k} + \ln\mathcal{N}(x_{n} | \mu_{k}, \Sigma_{k}) \} lnp(X,Zμ,Σ.π)=n=1Nk=1Kznk{lnπk+lnN(xnμk,Σk)}

Using Bayes’ theorem, we see that the posterior distribution takes the form:

p ( Z ∣ X , μ , Σ , π ) ∝ ∏ n = 1 N ∏ k = 1 K [ π k N ( x n ∣ μ k , Σ k ) ] z n k p(Z | X, \mu, \Sigma, \pi) \propto \prod_{n=1}^{N}\prod_{k=1}^{K}[\pi_{k}\mathcal{N}(x_{n} | \mu_{k}, \Sigma_{k}) ]^{z_{nk}} p(ZX,μ,Σ,π)n=1Nk=1K[πkN(xnμk,Σk)]znk

The expected value of the complete-data log likelihood function is therefore given by:

E Z [ ln ⁡ p ( X , Z ∣ μ , Σ , π ) ] = ∑ n = 1 N ∑ k = 1 K γ ( z n k ) { ln ⁡ π k + ln ⁡ N ( x n ∣ μ k , Σ k ) } E_{Z}[\ln p(X, Z | \mu, \Sigma, \pi)] = \sum_{n=1}^{N}\sum_{k=1}^{K}\gamma(z_{nk})\{\ln \pi_{k} + \ln\mathcal{N}(x_{n} | \mu_{k}, \Sigma_{k}) \} EZ[lnp(X,Zμ,Σ,π)]=n=1Nk=1Kγ(znk){lnπk+lnN(xnμk,Σk)}

9.3.2 Relation to K-means

The posterior probabilities for a particular data point x n x_n xn are given by:

γ ( z n k ) = π k e x p ( − ∣ ∣ x n − μ k ∣ ∣ 2 / 2 ε ) ∑ j π j e x p ( − ∣ ∣ x n − μ j ∣ ∣ 2 / 2 ε ) \gamma(z_{nk})=\frac{\pi_k exp(-||x_n-\mu_k||^2/2\varepsilon)}{\sum_j \pi_j exp(-||x_n-\mu_j||^2/2\varepsilon)} γ(znk)=jπjexp(xnμj2/2ε)πkexp(xnμk2/2ε)

9.3.3 Mixtures of Bernoulli distributions

In order to derive the EM algorithm, we first write down the complete-data log likelihood function, which is given by:

ln ⁡ p ( X , Z ∣ μ , π ) = ∑ n = 1 N ∑ k = 1 K z n k { ln ⁡ π k + ∑ i = 1 D [ x n i ln ⁡ μ k i + ( 1 − x n i ) ln ⁡ ( 1 − μ k i ) ] } \ln p(\mathbf{X, Z} | \mu, \pi) = \sum_{n=1}^{N}\sum_{k=1}^{K}z_{nk} \left\{\ln \pi_{k} + \sum_{i=1}^{D}[x_{ni}\ln\mu_{ki} + (1-x_{ni})\ln(1-\mu_{ki})] \right\} lnp(X,Zμ,π)=n=1Nk=1Kznk{lnπk+i=1D[xnilnμki+(1xni)ln(1μki)]}

where X = { x n } \mathbf{X}=\{x_n\} X={xn} and Z = { z n } \mathbf{Z}=\{z_n\} Z={zn}. Next we take the expectation of the complete-data log likelihood with respect to the posterior distribution of the latent variables to give:

E Z [ ln ⁡ p ( X , Z ∣ μ , π ) ] = ∑ n = 1 N ∑ k = 1 K γ ( z n k ) { ln ⁡ π k + ∑ i = 1 D [ x n i ln ⁡ μ k i + ( 1 − x n i ) ln ⁡ ( 1 − μ k i ) ] } E_{Z}[\ln p(\mathbf{X, Z} | \mu, \pi)] = \sum_{n=1}^{N}\sum_{k=1}^{K}\gamma(z_{nk})\left\{\ln\pi_{k} + \sum_{i=1}^{D}[x_{ni}\ln\mu_{ki} + (1-x_{ni})\ln(1-\mu_{ki})] \right\} EZ[lnp(X,Zμ,π)]=n=1Nk=1Kγ(znk){lnπk+i=1D[xnilnμki+(1xni)ln(1μki)]}

9.3.4 EM for Bayesian linear regression

The complete-data log likelihood function is then given by:

l n p ( t , w ∣ α , β ) = l n p ( t ∣ w , β ) + l n p ( w ∣ α ) ln p(\mathbf{t},w|\alpha,\beta)=lnp(\mathbf{t}|w,\beta)+lnp(w|\alpha) lnp(t,wα,β)=lnp(tw,β)+lnp(wα)

Taking the expectation with respect to the posterior distribution of w w w gives:

E [ l n p ( t , w ∣ α , β ) ] = M 2 l n ( α 2 π ) − α 2 E [ w T w ] + N 2 l n ( β 2 π ) − β 2 ∑ n = 1 N E [ ( t n − w T ϕ n ) 2 ] E[ln p(\mathbf{t},w|\alpha,\beta)]=\frac{M}{2}ln(\frac{\alpha}{2\pi})-\frac{\alpha}{2}E[w^T w]+\frac{N}{2}ln(\frac{\beta}{2\pi})-\frac{\beta}{2}\sum_{n=1}^NE[(t_n-w^T\phi_n)^2] E[lnp(t,wα,β)]=2Mln(2πα)2αE[wTw]+2Nln(2πβ)2βn=1NE[(tnwTϕn)2]

Setting the derivatives with respect to α \alpha α or β \beta β to zero, we obtain the M step re-estimation equation:

α = M E [ w T w ] = M m N T m N + T r ( S N ) \alpha=\frac{M}{E[w^T w]}=\frac{M}{m^T_N m_N+Tr(S_N)} α=E[wTw]M=mNTmN+Tr(SN)M

1 β = 1 N ( ∣ ∣ t − Φ m N ∣ ∣ 2 + T r [ Φ T Φ S N ] ) \frac{1}{\beta}=\frac{1}{N}(||\mathbf{t}-\mathbf{\Phi_{m_N}}||^2+Tr[\mathbf{\Phi^T\Phi}S_N]) β1=N1(tΦmN2+Tr[ΦTΦSN])

9.4 The EM Algorithm in General

Our goal is to maximize the likelihood function:

p ( X ∣ θ ) = ∑ Z p ( X , Z ∣ θ ) p(X|\theta)=\sum_Z p(X,Z|\theta) p(Xθ)=Zp(X,Zθ)

Next we introduce a distribution q ( Z ) q(Z) q(Z) defined over the latent variables. For any choice of q ( Z ) q(Z) q(Z), the following decomposition holds:

ln ⁡ p ( X ∣ θ ) = L ( q , θ ) + K L ( q ∣ ∣ p ) \ln p(X | \theta) = \mathcal{L}(q, \theta) + KL(q||p) lnp(Xθ)=L(q,θ)+KL(qp)

where we have defined

L ( q , θ ) = ∑ Z q ( Z ) ln ⁡ { p ( X , Z ∣ θ ) q ( Z ) } \mathcal{L}(q, \theta) = \sum_{Z}q(Z)\ln\left\{\frac{p(X,Z | \theta)}{q(Z)} \right\} L(q,θ)=Zq(Z)ln{q(Z)p(X,Zθ)}

K L ( q ∣ ∣ p ) = − ∑ Z q ( Z ) ln ⁡ { p ( Z ∣ X , θ ) q ( Z ) } KL(q||p) = -\sum_{Z}q(Z)\ln \left\{\frac{p(Z | X, \theta)}{q(Z)} \right\} KL(qp)=Zq(Z)ln{q(Z)p(ZX,θ)}

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值