Study notes for Expectation Maximum Algorithm

最新推荐文章于 2017-11-07 10:49:00 发布

Felix_夜雨

最新推荐文章于 2017-11-07 10:49:00 发布

阅读量2.4k

点赞数

分类专栏： Machine Learning 文章标签： machine learning 机器学习 study notes

本文链接：https://blog.csdn.net/u010693617/article/details/9098333

版权

23 篇文章 0 订阅

订阅专栏

The EM algorithm is an efficient iterative procedure to compute the maximum likelihood (ML) estimate in the presence of missing or hidden data (variables).
It intends to estimate the model parameters such that the observed data are the most likely.

Let be a real function defined on an interval . is said to be convex on if $\forall x_1, x_2\in I, \lambda\in [0,1]$ ,

is said to be strictly convex if the inequality is strict. Intuitively, this definition states that the function falls below (strictly convex) or is never above (convex) the straight line from points $(x_1, f(x_1))$ to $(x_2, f(x_2))$ .
is concave (strictly concave) if is convex (strictly convex).
Theorem 1. If is twice differentiable on [a, b] and on [a, b], then is convex on [a, b].
- If x takes vector values, f(x) is convex if the hessian matrix H is positive semi-definite (H>=0).
- -ln(x) is strictly convex in (0, inf), and hence ln(x) is strictly concave in (0, inf).

The convexity is generalized to multivariate.
Let be a convex function defined on an interval. If $x_1, x_2, \ldots, x_n \in I$ and $\lambda_1, \lambda_2, \ldots, \lambda_n \ge 0$ with $\sum\nolimits_{i=1}^n \lambda_i=1$ ,

Note that $E[f(x)]=f(E(x))$ holds true if and only if with probability 1, i.e., if X is a constant.
Hence, for concave functions:
Applying ln(x) and concavity, we can verify that,
$\frac{1}{n}\sum_{i=1}^n x_i \ge \sqrt[n]{x_1x_2...x_n}$

Objective: maximize the log-likelihood $p(x|\theta)$ of the data x, which is drawn from an unknown distribution, given the model parameterized by $\theta$ :
$\theta^*=arg\max_\theta p(x|\theta)=arg\max_\theta log \prod_{j=1}^n p(x_j|\theta)=arg\max_\theta \sum_{j=1}^n log\ p(x_j|\theta)$
The basic idea:
- Introduce a hidden variable such that its knowledge would simplify the maximization of $p(x|\theta)$
- Each iteration of the EM algorithm consists of two processes:
  - E-step: estimate the distribution of the hidden variable given the data and the current values of the parameters.
  - M-step: modify the parameters in order to maximize the joint distribution of the data and the hidden variable.
Convergence is assured since the algorithm is guaranteed to increase the likelihood at each iteration.
My understanding: it is usually difficult to directly estimate or maximize the objective function, since there are so many parameters and the objective function may not be differentiable (hence it is not applicable of traditional differential methods). Instead, the EM algorithm introduces a hidden variable which makes it easy to estimate the parameter values. Specifically, it aims to maximize the joint distribution of the data and the hidden variable, which is corresponding to optimize and maximize the original objective function according to the convergence property of the EM algorithm.
The detailed derivation can be referred to Andrew's or Sean's tutorial.

Assume a hidden variable q, referring to for each point, which Gaussian generated it? (see left figure).
E-step: for each point, estimate the probability that each Gaussian generated it. (see middle figure).
M-step: modify the parameters according to the hidden variable to maximize the likelihood of the data and the hidden variable. (see right figure).
Let us consider the following auxiliary function: $A(\theta, \theta^s)=E_q[log p(x, q|\theta) | x, \theta^s]$ . It aims to find the best parameters that maximize function A: $\theta^{s+1}=arg\max_\theta A(\theta, \theta^s)$

关注