1. Introduction
- The EM algorithm is an efficient iterative procedure to compute the maximum likelihood (ML) estimate in the presence of missing or hidden data (variables).
- It intends to estimate the model parameters such that the observed data are the most likely.
Convexity
- Let be a real function defined on an interval . is said to be convex on if,
- is concave (strictly concave) if is convex (strictly convex).
- Theorem 1. If is twice differentiable on [a, b] and on [a, b], then is convex on [a, b].
- If x takes vector values, f(x) is convex if the hessian matrix H is positive semi-definite (H>=0).
- -ln(x) is strictly convex in (0, inf), and hence ln(x) is strictly concave in (0, inf).
Jensen's inequality
- The convexity is generalized to multivariate.
- Let be a convex function defined on an interval. If and with,
- Hence, for concave functions:
- Applying ln(x) and concavity, we can verify that,
2. The EM Algorithm
- Objective: maximize the log-likelihood of the data x, which is drawn from an unknown distribution, given the model parameterized by:
- The basic idea:
- Introduce a hidden variable such that its knowledge would simplify the maximization of
- Each iteration of the EM algorithm consists of two processes:
- E-step: estimate the distribution of the hidden variable given the data and the current values of the parameters.
- M-step: modify the parameters in order to maximize the joint distribution of the data and the hidden variable.
- Convergence is assured since the algorithm is guaranteed to increase the likelihood at each iteration.
My understanding: it is usually difficult to directly estimate or maximize the objective function, since there are so many parameters and the objective function may not be differentiable (hence it is not applicable of traditional differential methods). Instead, the EM algorithm introduces a hidden variable which makes it easy to estimate the parameter values. Specifically, it aims to maximize the joint distribution of the data and the hidden variable, which is corresponding to optimize and maximize the original objective function according to the convergence property of the EM algorithm.
- The detailed derivation can be referred to Andrew's or Sean's tutorial.
Example: EM for GMM (shortly), more can be found in the GMM study.
- Assume a hidden variable q, referring to for each point, which Gaussian generated it? (see left figure).
- E-step: for each point, estimate the probability that each Gaussian generated it. (see middle figure).
- M-step: modify the parameters according to the hidden variable to maximize the likelihood of the data and the hidden variable. (see right figure).
Let us consider the following auxiliary function: . It aims to find the best parameters that maximize function A:
References
- Andrew Ng, The EM algorithm: http://cs229.stanford.edu/materials.html.
- Sean Borman, The Expectation Maximization Algorithm: a short tutorial.
- Samy Bengio, Statistical Machine Learning from Data Gaussian Mixture Models.