Study notes for Gaussian Mixture Model

最新推荐文章于 2022-02-16 19:50:33 发布

Felix_夜雨

最新推荐文章于 2022-02-16 19:50:33 发布

阅读量2.1k

点赞数 1

分类专栏： Machine Learning 文章标签： machine learning 机器学习 study notes

本文链接：https://blog.csdn.net/u010693617/article/details/9100821

版权

Machine Learning 专栏收录该内容

23 篇文章 0 订阅

订阅专栏

1. Mixture Model

A typical k-dimensional mixture model is a hierarchical model consisting of:
- There are N observations, each observation is a mixture of K components.
- Each component belongs to the same distribution but with different parameters.
- A set of K mixture weights, each of which is a probability, all of which sum to one.

2. Gaussian Mixture Model

Multivariate (d-dimensional) Gaussian distribution is elaborated in Anomaly Detection.
A Gaussian mixture model (GMM) is a weighted sum of K components (multivariate) Gaussian distributions as given by:
$p(x)=\sum_{j=1}^K w_j\cdot N(x| \mu_j, \Sigma_j)$
where w_j is the prior probability (weight) that an observation x is derived from the j-th Gaussian distribution:
$\sum_{j=1}^K w_j=1 \mbox{ and } 0\le w_j \le 1$

My understanding: a GMM is a linear combination of K Gaussian distributions, hence it is likely to be a mixed Gaussian distribution, given the combination parameters w _j, corresponding to the importance of the j-th Gaussain distribution. Put simply, a GMM is a distribution for a variable. When it is applied to clustering problem, each Gaussian component corresponds to one cluster. Hence, each example may be generated by different components with different probabilities. We will not assign each example to a specific cluster, but give a probability that an example is assigned (or due to) a specific Gaussian distribution.
Examples:
Problems: given a set of data (i.e., observaitons), and assuming that each of these observations is derived due to an unknown distribution (i.e., GMM), how to estimate the parameters $\theta$ of the GMM model that best fits the data.
Solutions: maximize the likelihood $p(x|\theta)$ of the data with regard to the model parameters:
$\theta^*=arg\max_\theta p(x|\theta)=arg\max_\theta \prod_{i=1}^N p(x_i|\theta)$
However, sine each $p(x_i|\theta)$ is often a small value,the product value will becomes extremely small that overflows the representation capability of a computer (浮点溢出). Hence, we often adopt the log-likelihood function as instead:
$\theta^*=arg\max_\theta log\ p(x|\theta)=arg\max_\theta \sum_{i=1}^N log\ p(x_i|\theta)=arg\max_\theta \sum_{i=1}^N log\ \sum_{j=1}^K w_j p(x_i|\mu_j, \Sigma_j)$
Hence, it can be solved by an EM algorithm, which aims to maximize the log-likelihood function.

3. EM for GMM

For GMM, the hidden variable Q will describe which Gaussian generated each example. If Q was observed, then it would be simple to maximize the likelihood of the data: simply estimate the parameters Gaussian by Gaussion. Moreover, we will see we can easily estimate Q.
The mixture of Gaussian model for each example x_i can be written as follows:
$p(x_i|\theta)=\sum_{j=1}^N p(j|\theta)p(x_i|j, \theta)$
Let us now introduce the following indicator variable:
$q_{i, j}=\left\{\begin{array}{ll} 1 & \mbox{if Gaussian j emitted } x_i, \\ 0 & \mbox{otherwise;} \end{array}\right.$
We can now write the joint likelihood of all the x and Q:
$p(x, Q|\theta)=\prod_{i=1}^N\prod_{j=1}^K p(j|\theta)^{q_{i,j}} p(x_i|j, \theta)^{q_{i,j}}$
which in log gives:
$log\ p(x, Q|\theta)=\sum_{i=1}^N\sum_{j=1}^K q_{i,j}log\ p(j|\theta) + q_{i,j} log\ p(x_i|j, \theta)$
Let us now write the corresponding auxiliary function:
$\begin{array}{ll}A(\theta, \theta^s)&=E_Q[log\ p(x, Q|\theta) | x, \theta^s]\\&=E_Q[\sum_{i=1}^N \sum_{j=1}^K q_{i,j}log\ p(j|\theta)+q_{i, j}log\ p(x_i|j, \theta)|x, \theta^s]\\&=\sum_{i=1}^N \sum_{j=1}^K E_Q[q_{i,j}|x, \theta^s] log\ p(j|\theta)+E_Q[q_{i,j}|x, \theta^s] log\ p(x_i|j, \theta)\end{array}$
Hence, the E-step estimates the posterior:
$\begin{array}{ll}E_Q(q_{i, j}|x, \theta^s)&=1\cdot p(q_{i, j}=1|x, \theta^s)+0\cdot p(q_{i, j}=0|x, \theta^s)\\ &= p(j|x_i,\theta^s)\\ &=\frac{p(x_i|j, \theta^s)p(j|\theta^s)}{p(x_i|\theta^s)}\end{array}$
And the M-step finds the parameters $\theta$ that maximizes A, hence search for:
$\frac{\partial A}{\partial \theta}=0$
for each parameter ( $\mu_j$ , $\sigma_j^2$ and weights $w_j$ , note that $\sum_{j=1}^Kw_j=1$ ). The resultant update parameters are:
Means: $\hat{\mu}_j=\frac{\sum_{i=1}^N x_i \cdot p(j|x_i, \theta^s)}{\sum_{i=1}^N p(j|x_i, \theta^s)}$

Variance: $\hat{\sigma}_j^2=\frac{\sum_{i=1}^N (x_i-\hat{\mu}_j)^2 \cdot p(j|x_i, \theta^s)}{\sum_{i=1}^N p(j|x_i, \theta^s)}$

Weights: $\hat{w}_j=\frac{1}{N} \sum_{i=1}^N p(j|x_i, \theta^s)$
EM is very sensitive to initial conditions. Hence, we often use K-means to initialize the EM.

4. Adapted GMM

In some cases, you have access to only a few examples coming from the target distribution, but many from a nearby distribution.
In such cases, the maximum a posterior (MAP) adaption is most well-known and used for GMMs.
Normal maximum likelihood training for a data set x:
$\theta^*=arg\max_\theta p(x|\theta)$
MAP training:
$\begin{array}{ll}\theta^*&=arg\max_\theta p(\theta|x)\\ &=arg\max_\theta \frac{p(x|\theta)p(\theta)}{p(x)} \\& = arg\max_\theta p(x|\theta)p(\theta)\end{array}$
where $p(\theta)$ represents your prior belief about the distribution of the parameters $\theta$
To select a proper distribution, we often use conjugate priors, to ensure the EM algorithm tractable.
- Dirichlet distribution for weights
- Gaussian densities for means and variances.

References

Pluskid, 漫谈 Clustering (3): Gaussian Mixture Model
Samy Bengio, Statistical Machine Learning from Data Gaussian Mixture Models.

Felix_夜雨

关注

1
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
Study notes for Gaussian Mixture Model

1. Mixture ModelA typical k-dimensional mixture model is a hierarchical model consisting of:There are N observations, each observation is a mixture of K components. Each component belongs to t
复制链接

扫一扫

专栏目录