PRML Chapter 9 Mixture Models and EM
9.1 K-means Clustering
Consider the problem of identifying groups or clusters of data points in a multidimensional space.
To describe the assignment of data points to clusters, we define an objective function(distortion measure):
J = ∑ n = 1 N ∑ k = 1 K r n k ∣ ∣ x n − μ k ∣ ∣ 2 J = \sum_{n=1}^{N}\sum_{k=1}^{K}r_{nk}||x_{n}-\mu_{k}||^{2} J=n=1∑Nk=1∑Krnk∣∣xn−μk∣∣2
To find r n k r_{nk} rnk and μ k \mu_k μk so as to minimize J J J, we go through an iterative procedure in which each iteration involves two successive steps corresponding to successive optimizations. The solution will be:
μ k = ∑ n r n k x n ∑ n r n k \mu_{k} = \frac{\sum_{n}r_{nk}x_{n}}{\sum_{n}r_{nk}} μk=∑nrnk∑nrnkxn
We can also derive an on-line stochastic algorithm by applying the Robbins-Monro procedure to the problem
μ k n e w = μ k o l d + η n ( x n − x k o l d ) \mu_{k}^{new} = \mu_{k}^{old} + \eta_{n}(x_{n} - x_{k}^{old}) μknew=μkold+ηn(xn−xkold)
We can generalize the K-means algorithm by introducing a more general dissimilarity measure:
J ~ = ∑ n = 1 N ∑ k = 1 K r n k ν ( x n , μ k ) \tilde{J} = \sum_{n=1}^{N}\sum_{k=1}^{K}r_{nk}\nu(x_{n}, \mu_{k}) J~=n=1∑Nk=1∑Krnkν(xn,μk)
which gives the K-medoids algorithm.
9.1.1 Image segmentation and compression
9.2 Mixtures of Guassians
We have mentioned the Gaussian mixture model as a simple linear superposition of Gaussian components, we now turn to a formulation of Gaussian mixtures in terms of discrete latent variables.
p ( x ) = ∑ k = 1 K π k N ( x ∣ μ k , Σ k ) p(x) = \sum_{k=1}^{K}\pi_{k}N(x | \mu_{k}, \Sigma_{k}) p(x)=k=1∑KπkN(x∣μk,Σk)
The marginal distribution over z z z is specified in terms of mixing coefficients π k \pi_k πk
p ( z k = 1 ) = π k p(z_{k} = 1) = \pi_{k} p(zk=1)=πk
∑ k z k = 1 \sum_{k} z_{k} = 1 k∑zk=1
where the parameters { π k } \{\pi_k\} {πk} must satisfy:
0 ≤ π k ≤ 1 0 \leq \pi_{k} \leq 1 0≤πk≤1
∑ k = 1 K π k = 1 \sum_{k=1}^{K}\pi_k = 1 k=1∑Kπk=1
in order to be valid probabilities, because z z z uses a 1-of-K representation, we can also write this distribution in the form:
p ( z ) = ∏ k = 1 K π k z k p(z) = \prod_{k=1}^{K}\pi_{k}^{z_{k}} p(z)=k=1∏Kπkzk
Similarly, the conditional distribution of x is a Gaussian:
p ( x ∣ z ) = ∏ k = 1 K N ( x ∣ μ k , Σ k ) z k p(x | z) = \prod_{k=1}^{K}\mathcal{N}(x | \mu_{k}, \Sigma_{k})^{z_{k}} p(x∣z)=k=1∏KN(x∣μk,Σk)zk
The joint distribution is given by p ( z ) p ( x ∥ z ) p(z)p(x\|z) p(z)p(x∥z) and the marginal distribution of x x x is then obtained by summing the joint distribution over all possible states of z z z to give:
p ( x ) = ∑ z p ( z ) p ( x ∣ z ) = ∑ k = 1 K π k N ( x ∣ μ k , Σ k ) p(x) = \sum_{z}p(z)p(x | z) = \sum_{k=1}^{K}\pi_{k}\mathcal{N}(x | \mu_{k}, \Sigma_{k}) p(x)=z∑p(z)p(x∣z)=k=1∑KπkN(x∣μk,Σk)
The corresponding posterior probability will be:
γ ( z k ) ≡ p ( z k = 1 ∣ x ) = p ( z k = 1 ) p ( x ∣ z k = 1 ) ∑ j = 1 K p ( z j = 1 ) p ( x ∣ z j = 1 ) = π k N ( x ∣ μ k , Σ k ) ∑ j = 1 K π j N ( x ∣ μ j , Σ k ) \gamma(z_{k}) \equiv p(z_{k} = 1 | x) = \frac{p(z_{k}=1)p(x | z_{k}=1)}{\sum_{j=1}^{K}p(z_{j}=1)p(x | z_{j} = 1)} = \frac{\pi_{k}\mathcal{N}(x | \mu_{k}, \Sigma_{k})}{\sum_{j=1}^{K}\pi_{j}\mathcal{N}(x | \mu_{j},\Sigma_{k})} γ(zk)≡p(zk=1∣x)=∑j=1Kp(zj=1)p(x∣zj=1)p(zk=1)p(x∣zk=1)=∑j=1KπjN(x∣μj,Σk)πkN(x∣μk,Σk)
9.2.1 Maximum likelihood
The log of the likelihood function is given by:
ln p ( X ∣ π , μ , Σ ) = ∑ n = 1 N ln { ∑ k = 1 K π k N ( x n ∣ μ k , Σ k ) } \ln p(\mathbf{X} | \pi, \mu, \Sigma) = \sum_{n=1}^{N}\ln\left\{\sum_{k=1}^{K}\pi_{k}\mathcal{N}(x_{n} | \mu_{k}, \Sigma_{k}) \right\} lnp(X∣π,μ,Σ)=n=1∑Nln{k=1∑KπkN(xn∣μk,Σk)}
9.2.2 EM for Gaussian mixtures
An elegant and powerful method for finding maximum likelihood solutions for models with latent variables is called the espectation-maximization algorithm (EM algorithm).
Maximize the likelihood function, we obtain:
μ k = 1 N k ∑ n = 1 N γ ( z n k ) x n \mu_{k} = \frac{1}{N_{k}}\sum_{n=1}^{N}\gamma(z_{nk})x_{n} μk=Nk1n=1∑Nγ(znk)xn
where N k = ∑ n = 1 N γ ( z n k ) N_{k} = \sum_{n=1}^{N}\gamma(z_{nk}) Nk=∑n=1Nγ(znk).
And for the covariance matrix of a single Gaussian, we obtain:
Σ k = 1 N k ∑ n = 1 N γ ( z n k ) ( x n − μ k ) T \Sigma_{k} = \frac{1}{N_{k}}\sum_{n=1}^{N}\gamma(z_{nk})(x_{n} - \mu_{k})^{T} Σk=Nk1n=1∑Nγ(znk)(xn−μk)T
For the mixing coefficients (using Lagrange multiplier) we obtain:
π k = N k N \pi_{k} = \frac{N_{k}}{N} πk=NNk
EM for Gaussian Mixtures
So, given a Gaussian mixture model, the goal is to maximize the likelihood function with respect to the parameters (mean & covariances of the components & mixing coefficients).
-
- Initialize μ k \mu_k μk, Σ \Sigma Σ, π k \pi_k πk. Evaluate the initial value of the log likelihood.
-
- E step. Evaluate the responsibilities using the current parameter values.
γ ( z n k ) = π k N ( x n ∣ μ k , Σ k ) ∑ j = 1 K π j N ( x n ∣ μ j , Σ j ) \gamma(z_{nk}) = \frac{\pi_{k}\mathcal{N}(x_{n} | \mu_{k}, \Sigma_{k})}{\sum_{j=1}^{K}\pi_{j}\mathcal{N}(x_{n} | \mu_{j}, \Sigma_{j})} γ(znk)=∑j=1KπjN(xn∣μj,Σj)πkN(xn∣μk,Σk)
-
- M step. Re-estimate the parameters using the current responsibilities.
μ k n e w = 1 N k ∑ n = 1 N γ ( z n k ) x n \mu_{k}^{new} = \frac{1}{N_{k}}\sum_{n=1}^{N}\gamma(z_{nk})x_{n} μknew=Nk1n=1∑Nγ(znk)xn
∑ k n e w = 1 N k ∑ n = 1 N γ ( z n k ) ( x n − μ k n e w ) ( x n − μ k n e w ) T \sum_{k}^{new} = \frac{1}{N_{k}}\sum_{n=1}^{N}\gamma(z_{nk})(x_{n} - \mu_{k}^{new})(x_{n} - \mu_{k}^{new})^{T} k∑new=Nk1n=1∑Nγ(znk)(xn−μknew)(xn−μknew)T
π k n e w = N k N \pi_{k}^{new} = \frac{N_{k}}{N} πknew=NNk
where
N k = ∑ n = 1 N γ ( z n k ) N_{k} = \sum_{n=1}^{N}\gamma(z_{nk}) Nk=n=1∑Nγ(znk)
- Evaluate the log likelihood
ln p ( X ∣ μ , Σ , π ) = ∑ n = 1 N ln { ∑ k = 1 K π k N ( x n ∣ μ k , Σ k ) } \ln p(\mathbf{X} | \mu, \Sigma, \pi) = \sum_{n=1}^{N}\ln\left\{\sum_{k=1}^{K}\pi_{k}\mathcal{N}(x_{n} | \mu_{k}, \Sigma_{k}) \right\} lnp(X∣μ,Σ,π)=n=1∑Nln{k=1∑KπkN(xn∣μk,Σk)}
and check for convergence of either the parameters of the log likelihood. If the convergence criterion is not satisfied return to step 2.
9.3 An Alternative View of EM
The goal of the EM algorithm is to find maximum likelihood solutions for models having latent variables. Set the parameters θ \theta θ, and the likelihood function is given by:
ln p ( X ∣ θ ) = ln { ∑ Z p ( X , Z ∣ θ ) } \ln p(\mathbf{X}| \theta) = \ln\left\{\sum_{Z}p(\mathbf{X,Z} | \theta) \right\} lnp(X∣θ)=ln{Z∑p(X,Z∣θ)}
In the E step, we use current θ o l d \theta^{old} θold to find the posterior by which we can find the expectation of the complete-data log likelihood evaluated for some general parameter value θ \theta θ:
θ n e w = arg max θ Q ( θ , θ o l d ) \theta^{new} = \arg\max_{\theta}\mathcal{Q}(\theta, \theta^{old}) θnew=argθmaxQ(θ,θold)
In the M step, we determine the revised parameter estimate θ n e w \theta^{new} θnew by maximizing this function:
Q ( θ , θ o l d ) = ∑ Z p ( Z ∣ X , θ o l d ) ln p ( X , Z ∣ θ ) \mathcal{Q}(\theta, \theta^{old}) = \sum_{Z}p(\mathcal{Z} | \mathcal{X}, \theta^{old})\ln p(\mathcal{X, Z} | \theta) Q(θ,θold)=Z∑p(Z∣X,θold)lnp(X,Z∣θ)
9.3.1 Guassian mixtures revisited
The log likelihood function takes the form:
ln p ( X , Z ∣ μ , Σ . π ) = ∑ n = 1 N ∑ k = 1 K z n k { ln π k + ln N ( x n ∣ μ k , Σ k ) } \ln p(\mathbf{X, Z} | \mu, \Sigma. \pi) = \sum_{n=1}^{N}\sum_{k=1}^{K}z_{nk}\{\ln \pi_{k} + \ln\mathcal{N}(x_{n} | \mu_{k}, \Sigma_{k}) \} lnp(X,Z∣μ,Σ.π)=n=1∑Nk=1∑Kznk{lnπk+lnN(xn∣μk,Σk)}
Using Bayes’ theorem, we see that the posterior distribution takes the form:
p ( Z ∣ X , μ , Σ , π ) ∝ ∏ n = 1 N ∏ k = 1 K [ π k N ( x n ∣ μ k , Σ k ) ] z n k p(Z | X, \mu, \Sigma, \pi) \propto \prod_{n=1}^{N}\prod_{k=1}^{K}[\pi_{k}\mathcal{N}(x_{n} | \mu_{k}, \Sigma_{k}) ]^{z_{nk}} p(Z∣X,μ,Σ,π)∝n=1∏Nk=1∏K[πkN(xn∣μk,Σk)]znk
The expected value of the complete-data log likelihood function is therefore given by:
E Z [ ln p ( X , Z ∣ μ , Σ , π ) ] = ∑ n = 1 N ∑ k = 1 K γ ( z n k ) { ln π k + ln N ( x n ∣ μ k , Σ k ) } E_{Z}[\ln p(X, Z | \mu, \Sigma, \pi)] = \sum_{n=1}^{N}\sum_{k=1}^{K}\gamma(z_{nk})\{\ln \pi_{k} + \ln\mathcal{N}(x_{n} | \mu_{k}, \Sigma_{k}) \} EZ[lnp(X,Z∣μ,Σ,π)]=n=1∑Nk=1∑Kγ(znk){lnπk+lnN(xn∣μk,Σk)}
9.3.2 Relation to K-means
The posterior probabilities for a particular data point x n x_n xn are given by:
γ ( z n k ) = π k e x p ( − ∣ ∣ x n − μ k ∣ ∣ 2 / 2 ε ) ∑ j π j e x p ( − ∣ ∣ x n − μ j ∣ ∣ 2 / 2 ε ) \gamma(z_{nk})=\frac{\pi_k exp(-||x_n-\mu_k||^2/2\varepsilon)}{\sum_j \pi_j exp(-||x_n-\mu_j||^2/2\varepsilon)} γ(znk)=∑jπjexp(−∣∣xn−μj∣∣2/2ε)πkexp(−∣∣xn−μk∣∣2/2ε)
9.3.3 Mixtures of Bernoulli distributions
In order to derive the EM algorithm, we first write down the complete-data log likelihood function, which is given by:
ln p ( X , Z ∣ μ , π ) = ∑ n = 1 N ∑ k = 1 K z n k { ln π k + ∑ i = 1 D [ x n i ln μ k i + ( 1 − x n i ) ln ( 1 − μ k i ) ] } \ln p(\mathbf{X, Z} | \mu, \pi) = \sum_{n=1}^{N}\sum_{k=1}^{K}z_{nk} \left\{\ln \pi_{k} + \sum_{i=1}^{D}[x_{ni}\ln\mu_{ki} + (1-x_{ni})\ln(1-\mu_{ki})] \right\} lnp(X,Z∣μ,π)=n=1∑Nk=1∑Kznk{lnπk+i=1∑D[xnilnμki+(1−xni)ln(1−μki)]}
where X = { x n } \mathbf{X}=\{x_n\} X={xn} and Z = { z n } \mathbf{Z}=\{z_n\} Z={zn}. Next we take the expectation of the complete-data log likelihood with respect to the posterior distribution of the latent variables to give:
E Z [ ln p ( X , Z ∣ μ , π ) ] = ∑ n = 1 N ∑ k = 1 K γ ( z n k ) { ln π k + ∑ i = 1 D [ x n i ln μ k i + ( 1 − x n i ) ln ( 1 − μ k i ) ] } E_{Z}[\ln p(\mathbf{X, Z} | \mu, \pi)] = \sum_{n=1}^{N}\sum_{k=1}^{K}\gamma(z_{nk})\left\{\ln\pi_{k} + \sum_{i=1}^{D}[x_{ni}\ln\mu_{ki} + (1-x_{ni})\ln(1-\mu_{ki})] \right\} EZ[lnp(X,Z∣μ,π)]=n=1∑Nk=1∑Kγ(znk){lnπk+i=1∑D[xnilnμki+(1−xni)ln(1−μki)]}
9.3.4 EM for Bayesian linear regression
The complete-data log likelihood function is then given by:
l n p ( t , w ∣ α , β ) = l n p ( t ∣ w , β ) + l n p ( w ∣ α ) ln p(\mathbf{t},w|\alpha,\beta)=lnp(\mathbf{t}|w,\beta)+lnp(w|\alpha) lnp(t,w∣α,β)=lnp(t∣w,β)+lnp(w∣α)
Taking the expectation with respect to the posterior distribution of w w w gives:
E [ l n p ( t , w ∣ α , β ) ] = M 2 l n ( α 2 π ) − α 2 E [ w T w ] + N 2 l n ( β 2 π ) − β 2 ∑ n = 1 N E [ ( t n − w T ϕ n ) 2 ] E[ln p(\mathbf{t},w|\alpha,\beta)]=\frac{M}{2}ln(\frac{\alpha}{2\pi})-\frac{\alpha}{2}E[w^T w]+\frac{N}{2}ln(\frac{\beta}{2\pi})-\frac{\beta}{2}\sum_{n=1}^NE[(t_n-w^T\phi_n)^2] E[lnp(t,w∣α,β)]=2Mln(2πα)−2αE[wTw]+2Nln(2πβ)−2βn=1∑NE[(tn−wTϕn)2]
Setting the derivatives with respect to α \alpha α or β \beta β to zero, we obtain the M step re-estimation equation:
α = M E [ w T w ] = M m N T m N + T r ( S N ) \alpha=\frac{M}{E[w^T w]}=\frac{M}{m^T_N m_N+Tr(S_N)} α=E[wTw]M=mNTmN+Tr(SN)M
1 β = 1 N ( ∣ ∣ t − Φ m N ∣ ∣ 2 + T r [ Φ T Φ S N ] ) \frac{1}{\beta}=\frac{1}{N}(||\mathbf{t}-\mathbf{\Phi_{m_N}}||^2+Tr[\mathbf{\Phi^T\Phi}S_N]) β1=N1(∣∣t−ΦmN∣∣2+Tr[ΦTΦSN])
9.4 The EM Algorithm in General
Our goal is to maximize the likelihood function:
p ( X ∣ θ ) = ∑ Z p ( X , Z ∣ θ ) p(X|\theta)=\sum_Z p(X,Z|\theta) p(X∣θ)=Z∑p(X,Z∣θ)
Next we introduce a distribution q ( Z ) q(Z) q(Z) defined over the latent variables. For any choice of q ( Z ) q(Z) q(Z), the following decomposition holds:
ln p ( X ∣ θ ) = L ( q , θ ) + K L ( q ∣ ∣ p ) \ln p(X | \theta) = \mathcal{L}(q, \theta) + KL(q||p) lnp(X∣θ)=L(q,θ)+KL(q∣∣p)
where we have defined
L ( q , θ ) = ∑ Z q ( Z ) ln { p ( X , Z ∣ θ ) q ( Z ) } \mathcal{L}(q, \theta) = \sum_{Z}q(Z)\ln\left\{\frac{p(X,Z | \theta)}{q(Z)} \right\} L(q,θ)=Z∑q(Z)ln{q(Z)p(X,Z∣θ)}
K L ( q ∣ ∣ p ) = − ∑ Z q ( Z ) ln { p ( Z ∣ X , θ ) q ( Z ) } KL(q||p) = -\sum_{Z}q(Z)\ln \left\{\frac{p(Z | X, \theta)}{q(Z)} \right\} KL(q∣∣p)=−Z∑q(Z)ln{q(Z)p(Z∣X,θ)}