Algorithm Learning：Clustering - Gaussian Mixture Model with Expectation-Maximization

最新推荐文章于 2024-03-20 16:08:19 发布

Lai_Ye

最新推荐文章于 2024-03-20 16:08:19 发布

阅读量438

点赞数

文章标签：机器学习

本文链接：https://blog.csdn.net/Lai_Ye_/article/details/108748201

版权

Introduction

Gaussian mixture model（GMM）is a widely used clustering algorithm in the industry, which uses Gaussian distribution as the parametric model and utilizes maximum Expectation (EM) algorithm for training. At the same time, it’s also a method to accurately evaluate the accuracy of the posterior probability.

Induction Bias

Before describing the GMM, the induction bias needs to be explained. Whether it’s a machine or a person, the learning process can be viewed as an “induction process,” and induction requires some hypothetical preconditions. In machine learning, a learning algorithm also has a Hypothetical Preconditions, which is called induction bias.In my understanding, there is a connection and difference between induction bias and Dr. Ng’s hypothesis representation. Both are inductive premises of the machine, while induction bias is more general: it represents only a deterministic condition, not one that must be expressed as a function.

The setting of induction bias needs to be regulated. Failure to induce bias or bias is too general will result in Overfitting, while excessive restriction of induction paranoia will often result in some absurd results.The difficulty lies in finding a balance.

GMM has its very apt induction bias, it assumes that data obey the Mixture Gaussian Distribution, that is, data can be thought of as being generated from several Gaussian distributions. This bias is reasonable. It has nice properties in calculation. Moreover, by increasing the number of models, we can arbitrarily approximate any continuous PDF.

Theories

GMM is a simple extension of the Gaussian model. GMM utilizes a combination of multiple Gaussian distributions to characterize the data distribution. GMM is often used to solve cases where the data in the same collection contains many different distributions.The expected maximum algorithm is used to train the Gaussian mixture model.

In clustering, picking a random point from a GMM distribution can actually be divided into two steps: First, random selection is carried out among K Gaussian models, among which the probability of each Gaussian distribution being selected is actually its weight coefficient. Then only the random point selection problem in this distribution is considered, which is converted to a single Gaussian distribution problem.

The mixed Gaussian model gives you more information than the K-means model. The k-means result is that each data point is assigned to one cluster, and GMM gives the probability of assigning the data points to each cluster. This is also called a soft assignment

Procedure Explanation

1. Hypothesis Representation

GMM refers to a PDF model with the following form:
$\theta)=\sum^K_{k=1}\pi_k\phi(x;\theta_k)\qquad\quad (1)$ where, $\pi_k$ is the coefficient, $\pi_k>=0$ and $\sum^K_{k=1}\pi_k=1$ , $\theta_k=(\mu_k, \sigma_k)$ .
$\phi(x\mid \theta_k)$ is Gaussian PDF:
$\phi(x\mid \theta_k)=\frac{1}{\sqrt{2\pi}\sigma_k}exp{(-\frac{(x-\mu_k)^2}{2\sigma_k^2})}\qquad\quad (2)$ Each GMM consists of K Gaussian distributions, and each Gaussian is called a “Component”, these components are linearly added together to form the PDF of the model. The probability that each Component is selected is actually its coefficient $\pi_k$ .

When GMM is used for clustering, we estimate the process of the parameter $(\pi_k,\mu_k,\sigma_k)$ , also known as “parameter estimation”, based on the existing data and assumed probability density function form.

2. The Dilemma of Maximum Likelihood Estimation

When we choose the PDF as the model, we often think of using the maximum likelihood estimation to solve the problem in the first place. But why does GMM face the “dilemma” when adopting this method?

For the training data set $T=\lbrace x^{(1)},x^{(2)},...,x^{(m)}\rbrace,x^{(i)}\in\chi\in R^n$ , suppose $x^{(i)}$ comes from a GMM, logarithmic likelihood function:
$L(\theta)=\sum^m_{i=1}logp(x^{(i)}; \mu,\sigma,\pi)=\sum^m_{i=1}log\sum_{k=1}^K\pi_k N(x^{(i)};\mu_k,\sigma_k)\qquad\quad (3)$
At this point, there is an accumulation sign in the logarithm sign, and it is difficult to continue the derivative calculation, so maximum likelihood estimation should be in trouble.

3. The Bayesian Understanding of GMM

As we mentioned, $\pi_k$ can be viewed as the probability that the $k$ th class is selected. We introduce a $K$ dimensional random variable $z=(z_1,z_2,...,z_K)$ , $z_k(1\leq k\leq K)$ , $z_k=1$ indicates that the $k$ th gaussian distribution is selected, and $z_k=0$ means that it is not. It has the following properties:
$z_k=\lbrace 0,1\rbrace$ $p(z_k=1)=\pi_k$ $\sum_Kz_k=1$ We assume that $z_k$ are independently identically distributed, we can write the joint probability distribution form of $z$ :
$p(z)=p(z_1)p(z_2)...p(z_K)=\prod_{k=1}^Kp(z_k)=\prod_{k=1}^K\pi_k^{z_k}\qquad\quad (4)$ Where $z$ can only be 0 or 1, and only $z_k$ is 1, while all the other $z_j(j\neq k)$ are 0.

Each class of data selected is Normally Distributed, which can be further represented by conditional probability:
$p(x\mid z_k=1)=N(x; \mu_k,\sigma_k)\qquad\quad (5)$ $p(x\mid z)=\prod_{k=1}^KN(x; \mu_k,\sigma_k)^{z_k}\qquad\quad (6)$ According to (4) and (6), combined with conditional probability, we can get the form of $p (x)$ :
$p(x)=\sum_zp(x,z)=\sum_zp(z)p(x\mid z)$ $\qquad\qquad\ \ \ =\sum_z\prod_{k=1}^K(\pi_k^{z_k}N(x; \mu_k,\sigma_k)^{z_k})\qquad\quad (7)$ $\qquad=\sum_{k=1}^K\pi_kN(x;\mu_k,\sigma_k)\qquad\quad (8)$ It can be found that (7) has the same form as the original mixed Gaussian model (8), and the latent variable $z$ is introduced into (7).

latent variable: We know that the data can be divided into two classes, but if we randomly select a point, we don't know whether the data point belongs to the first class or the second class, and we can't observe its attribution, so we introduce latent variable to describe this phenomenon.

Based on Bayes’ idea, we can find the posterior probability $p(z\mid x;\theta)$ :

$p(z\mid x;\theta)=p(z_k=1\mid x) \qquad\quad\qquad\quad$ $\qquad\qquad\qquad\qquad\quad=\frac{p(z_k=1)p(x\mid z_k=1)}{p(x)} \qquad\quad\qquad\quad$ $\quad\qquad\qquad=\frac{p(z_k=1)p(x\mid z_k=1)}{\sum^K_{i=1}p(z_i=1)p(x\mid z_i=1)}$ $\qquad\qquad\quad\quad\qquad\quad=\frac{\pi_kN(x_k\mid \mu_k,\sigma_k)}{\sum^K_{i=1}\pi_iN(x; \mu_i,\sigma_i)}\qquad\quad\qquad\quad(9)$ We define the above formula (9) $\gamma (z_k)$ to represent the posterior probability of the $k$ th Component and lay a foundation for the following Expectation-Maximum algorithm.

4. Expectation-Maximum（EM） Algorithm

EM algorithm is an iterative algorithm for maximum likelihood estimation or maximum posterior probability estimation of probabilistic model parameters with latent variables. Each iteration of EM algorithm is accomplished in two steps: Step E, expectation; Step M, maximization.

EM algorithm optimizes the parameter step by step by iterating to maximize $L(\theta)$ . A “clever” method is to find the lower bound of $L(\theta)$ . As long as the lower bound of $L(\theta)$ is maximized during each iteration, $L(\theta)$ can be guaranteed to increase all the time.

Jensen Inequality

In order to solve the above problems, the Jensen Inequality is introduced:
$\qquad\quad\qquad\quad f(\sum_{i=1}^N\lambda_ix_i)\leq\sum_{i=1}^N\lambda_if(x_i)\qquad\quad\qquad\quad(10)$ $s.t.\ \ \ \ \ \lambda_i\geq0, \ \ \sum_{i=1}^N\lambda_i=1$ Where $f (x)$ is convex function. If and only if $X i \equiv C$ (constant $C$ ) the equal sign is true.

And the probability form of the inequality is shown as follows: $\qquad\quad\qquad\quad f(E(X))\leq E(f(X))\qquad\quad\qquad\quad(11)$

Step E Algorithm Derivation

We utilize Jensen’s inequality to solve the lower bound of $L(\theta)$ in the $t + 1$ iteration. The scaling process is as follows:
$\qquad\quad\qquad\quad L(\theta)\geq \sum^m_{i=1}\sum_{z^{(i)}}T(z^{(i)}) log \frac{p(x^{(i)},z^{(i)};\theta)}{T(z^{(i)})}\qquad\quad\qquad\quad(12)$
Where $T(z^{(i)})$ is a constructed $\lambda_i$ in (10), it must be a known when the estimator $\theta^{(i)}$ 's quantity is given.

According to the Bayesian understanding above, we find that the posterior probability of $z$ exactly meets the requirements of the construction $T(z^{(i)})$ quantity, and because: $\qquad\quad\qquad\quad\sum_{z^{(i)}}p(z^{(i)}\mid x;\theta^{(t)})=1\qquad\quad\qquad\quad(13)$ $\frac{p(x^{(i)},z^{(i)};\theta^{(t)})}{p(z^{(i)}\mid x;\theta^{(t)})}=\frac{p(z^{(i)}\mid x^{(i)};\theta^{(t)})p(x^{(i)};\theta^{(t)})}{p(z^{(i)}\mid x^{(i)};\theta^{(t)})}$ $\quad=p(x^{(i)};\theta^{(t)})$ $\qquad\quad\qquad\quad\equiv C\qquad\quad\qquad\quad(14)$
In fact, the EM algorithm also utilize $p (z)$ to construct the lower bound function. We can construct as follows: $T(z^{(i)})=p(z^{(i)}\mid x^{(i)};\theta^{(t)})$ $\quad\qquad\qquad=\frac{p(z_k^{(i)}=1)p(x^{(i)}\mid z_k^{(i)}=1)}{\sum^K_{k=1}p(z^{(i)}_k=1)p(x^{(i)}\mid z^{(i)}_k=1)}$ $\qquad\qquad\quad\quad\qquad\quad=\frac{\pi_kN(x^{(i)}\mid \mu_k,\sigma_k)}{\sum^K_{k=1}\pi_kN(x^{(i)}; \mu_k,\sigma_k)}\qquad\quad\qquad\quad(15)$
$\qquad\quad\qquad\quad w_k^i=p(z_k^{(i)}=1\mid x^{(i)};\theta^{(t)})\qquad\quad\qquad\quad(16)$ $w_j^i$ represents the probability that the $i$ th sample comes from the $j$ th Gaussian distribution, that is, $w_j^i$ is a matrix of $n$ _samples $*$ $k$ _components.

And that’s where we get the key $Q$ function（Doctor Li Hang expresses $Q$ function in Statistical Learning Method: it is the expectation of logarithmic likelihood function $p(x^{(i)},z^{(i)};\theta)$ of complete data on conditional probability distribution $p(z^{(i)}\mid x^{(i)};\theta^{(t)})$ of latent data $z$ under given observation data $x$ and current parameter $\theta^{(t)}$ ）:
$Q(\theta,\theta^{(t)})= \sum^m_{i=1}\sum_{z^{(i)}}T(z^{(i)}) log \frac{p(x^{(i)},z^{(i)};\theta)}{T(z^{(i)})}$ $\qquad\quad\qquad=\sum^m_{i=1}\sum_{z^{(i)}}p(z^{(i)}\mid x^{(i)};\theta^{(t)}) log \frac{p(x^{(i)},z^{(i)};\theta)}{p(z^{(i)}\mid x^{(i)};\theta^{(t)})}\qquad\quad(17)$

Step M Algorithm Derivation

Once $Q$ function is obtained, the function is maximized to obtain the model parameters for the next iteration:
$\theta^{(t+1)}=arg max Q(\theta,\theta^{(t)})$ First, we substitute (4), (5) and (16) into the $Q$ function to get: $Q(\theta,\theta^{(t)})=\sum^m_{i=1}\sum_{k=1}^KT(z^{(i)}_k=1) log \frac{p(x^{(i)}\mid z^{(i)}_k=1;\mu,\sigma)p(z^{(i)}_k=1;\pi)}{T(z^{(i)}_k=1)}$ $=\sum^m_{i=1}\sum_{k=1}^Kw_k^i log \frac{\pi_k\frac{1}{(2\pi)^{\frac{n}{2}}\sigma_k^\frac{1}{2}}exp{(-\frac{1}{2}(x^{(i)}-\mu_k)^T\sigma_k^{-1}(x^{(i)}-\mu_k))}}{w_k^i }\qquad\quad(18)$
Take the partial derivative of the above expression with respect to each parameter, set the partial derivative equal to 0, obtain the parameter value of the maximum value of the function and assign it to the parameters participating in the next iteration:
$\qquad\quad\mu_k:=\frac{\sum_{i=1}^mw_k^ix^{(i)}}{\sum_{i=1}^mw_k^i}\qquad\quad(19)$ $\qquad\quad\sigma_k:=\frac{\sum_{i=1}^mw_k^i(x^{(i)}-\mu_k)^T(x^{(i)}-\mu_k)}{\sum_{i=1}^mw_k^i}\qquad\quad(20)$ $\qquad\quad\pi_k:=\frac{\sum_{i=1}^mw_k^i}{m}\qquad\quad(21)$ Where： $w_k^i=p(z_k^{(i)}=1\mid x^{(i)};\theta^{(t)})$

EM Algorithm Flow

Input: Observational dataset $T=\lbrace x^{(1)},x^{(2)},...,x^{(m)}\rbrace,x^{(i)}\in\chi\in R^n$ ; $\qquad\,\,$ Gaussian Mixture Model $p(x;\theta)$

Output: GMM parameters $(\pi, \mu, \sigma)\in R^K$

Define the number of components $K$ , initialize the parameters $(\pi, \mu, \sigma)$ , and start the iteration.
Step E: According to the model parameters of the current iteration $t$ ,calculate the responsiveness $w_k^i$ of the sub-model $k$ to the observed data $x^{(i)}$ : $w_k^i=\frac{\pi_kN(x^{(i)}\mid \mu_k,\sigma_k)}{\sum^K_{k=1}\pi_kN(x^{(i)}; \mu_k,\sigma_k)}, \,i=1,2,...,m;\, k=1,2,...,K$
Step M: Calculate the model parameters of the new iteration $t + 1$ ： $\mu_k^{new}:=\frac{\sum_{i=1}^mw_k^ix^{(i)}}{\sum_{i=1}^mw_k^i}$ $\sigma_k^{new}:=\frac{\sum_{i=1}^mw_k^i(x^{(i)}-\mu_k)^T(x^{(i)}-\mu_k)}{\sum_{i=1}^mw_k^i}$ $\pi_k^{new}:=\frac{\sum_{i=1}^mw_k^i}{m}$
Calculate the maximum likelihood function $L(\theta)$ , test whether the function converges, if not, return step 2.

Convergence Verification of EM Algorithm

In fact, the EM algorithm maximizes the logarithmic likelihood function by iterative stepwise approximation. Since the likelihood function exactly has an upper bound, we expect parameter iteration can increase $L(\theta)$ :
$L(\theta^{(t+1)})-L(\theta^{(t)})=L(\theta^{(t+1)})-Q(\theta^{(t)},\theta^{(t)})$ $\geq Q(\theta^{(t+1)},\theta^{(t)}) -Q(\theta^{(t)},\theta^{(t)})$ As the value of the new iteration parameter makes $Q$ function reach the maximum value, there are: $Q(\theta^{(t+1)},\theta^{(t)}) -Q(\theta^{(t)},\theta^{(t)})\ge0$ So the likelihood function can converge during iteration, and EM algorithm works.

Potential Applications

Since normal distribution is common in daily life, GMM has high application value. Combined with big data technology, GMM can locate information about individuals’ behavior within population. For example, using facial data to identify a person’s race.

As a classification model, GMM can try to be a part of GAN. GMM improves the performance of the final generator by providing more specific categorization information.