Algorithm Learning:Clustering - Gaussian Mixture Model with Expectation-Maximization

Introduction

Gaussian mixture model(GMM)is a widely used clustering algorithm in the industry, which uses Gaussian distribution as the parametric model and utilizes maximum Expectation (EM) algorithm for training. At the same time, it’s also a method to accurately evaluate the accuracy of the posterior probability.



Induction Bias

Before describing the GMM, the induction bias needs to be explained. Whether it’s a machine or a person, the learning process can be viewed as an “induction process,” and induction requires some hypothetical preconditions. In machine learning, a learning algorithm also has a Hypothetical Preconditions, which is called induction bias.In my understanding, there is a connection and difference between induction bias and Dr. Ng’s hypothesis representation. Both are inductive premises of the machine, while induction bias is more general: it represents only a deterministic condition, not one that must be expressed as a function.

The setting of induction bias needs to be regulated. Failure to induce bias or bias is too general will result in Overfitting, while excessive restriction of induction paranoia will often result in some absurd results.The difficulty lies in finding a balance.

GMM has its very apt induction bias, it assumes that data obey the Mixture Gaussian Distribution, that is, data can be thought of as being generated from several Gaussian distributions. This bias is reasonable. It has nice properties in calculation. Moreover, by increasing the number of models, we can arbitrarily approximate any continuous PDF.


Theories

GMM is a simple extension of the Gaussian model. GMM utilizes a combination of multiple Gaussian distributions to characterize the data distribution. GMM is often used to solve cases where the data in the same collection contains many different distributions.The expected maximum algorithm is used to train the Gaussian mixture model.

In clustering, picking a random point from a GMM distribution can actually be divided into two steps: First, random selection is carried out among K Gaussian models, among which the probability of each Gaussian distribution being selected is actually its weight coefficient. Then only the random point selection problem in this distribution is considered, which is converted to a single Gaussian distribution problem.

The mixed Gaussian model gives you more information than the K-means model. The k-means result is that each data point is assigned to one cluster, and GMM gives the probability of assigning the data points to each cluster. This is also called a soft assignment


Procedure Explanation

1. Hypothesis Representation

GMM refers to a PDF model with the following form:
p ( x ; θ ) = ∑ k = 1 K π k ϕ ( x ; θ k ) ( 1 ) p(x; \theta)=\sum^K_{k=1}\pi_k\phi(x;\theta_k)\qquad\quad (1) p(x;θ)=k=1Kπkϕ(x;θk)(1)where, π k \pi_k πk is the coefficient, π k > = 0 \pi_k>=0 πk>=0 and ∑ k = 1 K π k = 1 \sum^K_{k=1}\pi_k=1 k=1Kπk=1, θ k = ( μ k , σ k ) \theta_k=(\mu_k, \sigma_k) θk=(μk,σk).
ϕ ( x ∣ θ k ) \phi(x\mid \theta_k) ϕ(xθk) is Gaussian PDF:
ϕ ( x ∣ θ k ) = 1 2 π σ k e x p ( − ( x − μ k ) 2 2 σ k 2 ) ( 2 ) \phi(x\mid \theta_k)=\frac{1}{\sqrt{2\pi}\sigma_k}exp{(-\frac{(x-\mu_k)^2}{2\sigma_k^2})}\qquad\quad (2) ϕ(xθk)=2π σk1exp(2σk2(xμk)2)(2)Each GMM consists of K Gaussian distributions, and each Gaussian is called a “Component”, these components are linearly added together to form the PDF of the model. The probability that each Component is selected is actually its coefficient π k \pi_k πk.

When GMM is used for clustering, we estimate the process of the parameter ( π k , μ k , σ k ) (\pi_k,\mu_k,\sigma_k) (πk,μk,σk), also known as “parameter estimation”, based on the existing data and assumed probability density function form.

2. The Dilemma of Maximum Likelihood Estimation

When we choose the PDF as the model, we often think of using the maximum likelihood estimation to solve the problem in the first place. But why does GMM face the “dilemma” when adopting this method?

For the training data set T = { x ( 1 ) , x ( 2 ) , . . . , x ( m ) } , x ( i ) ∈ χ ∈ R n T=\lbrace x^{(1)},x^{(2)},...,x^{(m)}\rbrace,x^{(i)}\in\chi\in R^n T={x(1),x(2),...,x(m)},x(i)χRn, suppose x ( i ) x^{(i)} x(i) comes from a GMM, logarithmic likelihood function:
L ( θ ) = ∑ i = 1 m l o g p ( x ( i ) ; μ , σ , π ) = ∑ i = 1 m l o g ∑ k = 1 K π k N ( x ( i ) ; μ k , σ k ) ( 3 ) L(\theta)=\sum^m_{i=1}logp(x^{(i)}; \mu,\sigma,\pi)=\sum^m_{i=1}log\sum_{k=1}^K\pi_k N(x^{(i)};\mu_k,\sigma_k)\qquad\quad (3) L(θ)=i=1mlogp(x(i);μ,σ,π)=i=1mlogk=1KπkN(x(i);μk,σk)(3)
At this point, there is an accumulation sign in the logarithm sign, and it is difficult to continue the derivative calculation, so maximum likelihood estimation should be in trouble.

3. The Bayesian Understanding of GMM

As we mentioned, π k \pi_k πk can be viewed as the probability that the k k kth class is selected. We introduce a K K K dimensional random variable z = ( z 1 , z 2 , . . . , z K ) z=(z_1,z_2,...,z_K) z=(z1,z2,...,zK), z k ( 1 ≤ k ≤ K ) z_k(1\leq k\leq K) zk(1kK), z k = 1 z_k=1 zk=1 indicates that the k k kth gaussian distribution is selected, and z k = 0 z_k=0 zk=0 means that it is not. It has the following properties:
z k = { 0 , 1 } z_k=\lbrace 0,1\rbrace zk={0,1} p ( z k = 1 ) = π k p(z_k=1)=\pi_k p(zk=1)=πk ∑ K z k = 1 \sum_Kz_k=1 Kzk=1We assume that z k z_k zk are independently identically distributed, we can write the joint probability distribution form of z z z:
p ( z ) = p ( z 1 ) p ( z 2 ) . . . p ( z K ) = ∏ k = 1 K p ( z k ) = ∏ k = 1 K π k z k ( 4 ) p(z)=p(z_1)p(z_2)...p(z_K)=\prod_{k=1}^Kp(z_k)=\prod_{k=1}^K\pi_k^{z_k}\qquad\quad (4) p(z)=p(z1)p(z2)...p(zK)=k=1Kp(zk)=k=1Kπkzk(4)Where z z z can only be 0 or 1, and only z k z_k zk is 1, while all the other z j ( j ≠ k ) z_j(j\neq k) zj(j=k) are 0.

Each class of data selected is Normally Distributed, which can be further represented by conditional probability:
p ( x ∣ z k = 1 ) = N ( x ; μ k , σ k ) ( 5 ) p(x\mid z_k=1)=N(x; \mu_k,\sigma_k)\qquad\quad (5) p(xzk=1)=N(x;μk,σk)(5) p ( x ∣ z ) = ∏ k = 1 K N ( x ; μ k , σ k ) z k ( 6 ) p(x\mid z)=\prod_{k=1}^KN(x; \mu_k,\sigma_k)^{z_k}\qquad\quad (6) p(xz)=k=1KN(x;μk,σk)zk(6)According to (4) and (6), combined with conditional probability, we can get the form of p ( x ) p(x) p(x):
p ( x ) = ∑ z p ( x , z ) = ∑ z p ( z ) p ( x ∣ z ) p(x)=\sum_zp(x,z)=\sum_zp(z)p(x\mid z) p(x)=zp(x,z)=zp(z)p(xz)     = ∑ z ∏ k = 1 K ( π k z k N ( x ; μ k , σ k ) z k ) ( 7 ) \qquad\qquad\ \ \ =\sum_z\prod_{k=1}^K(\pi_k^{z_k}N(x; \mu_k,\sigma_k)^{z_k})\qquad\quad (7)    =zk=1K(πkzkN(x;μk,σk)zk)(7) = ∑ k = 1 K π k N ( x ; μ k , σ k ) ( 8 ) \qquad=\sum_{k=1}^K\pi_kN(x;\mu_k,\sigma_k)\qquad\quad (8) =k=1KπkN(x;μk,σk)(8)It can be found that (7) has the same form as the original mixed Gaussian model (8), and the latent variable z z z is introduced into (7).

latent variable: We know that the data can be divided into two classes, but if we randomly select a point, we don't know whether the data point belongs to the first class or the second class, and we can't observe its attribution, so we introduce latent variable to describe this phenomenon.

Based on Bayes’ idea, we can find the posterior probability p ( z ∣ x ; θ ) p(z\mid x;\theta) p(zx;θ):

p ( z ∣ x ; θ ) = p ( z k = 1 ∣ x ) p(z\mid x;\theta)=p(z_k=1\mid x) \qquad\quad\qquad\quad p(zx;θ)=p(zk=1x) = p ( z k = 1 ) p ( x ∣ z k = 1 ) p ( x ) \qquad\qquad\qquad\qquad\quad=\frac{p(z_k=1)p(x\mid z_k=1)}{p(x)} \qquad\quad\qquad\quad =p(x)p(zk=1)p(xzk=1) = p ( z k = 1 ) p ( x ∣ z k = 1 ) ∑ i = 1 K p ( z i = 1 ) p ( x ∣ z i = 1 ) \quad\qquad\qquad=\frac{p(z_k=1)p(x\mid z_k=1)}{\sum^K_{i=1}p(z_i=1)p(x\mid z_i=1)} =i=1Kp(zi=1)p(xzi=1)p(zk=1)p(xzk=1) = π k N ( x k ∣ μ k , σ k ) ∑ i = 1 K π i N ( x ; μ i , σ i ) ( 9 ) \qquad\qquad\quad\quad\qquad\quad=\frac{\pi_kN(x_k\mid \mu_k,\sigma_k)}{\sum^K_{i=1}\pi_iN(x; \mu_i,\sigma_i)}\qquad\quad\qquad\quad(9) =i=1KπiN(x;μi,σi)πkN(xkμk,σk)(9)We define the above formula (9) γ ( z k ) \gamma (z_k) γ(zk) to represent the posterior probability of the k k kth Component and lay a foundation for the following Expectation-Maximum algorithm.

4. Expectation-Maximum(EM) Algorithm

EM algorithm is an iterative algorithm for maximum likelihood estimation or maximum posterior probability estimation of probabilistic model parameters with latent variables. Each iteration of EM algorithm is accomplished in two steps: Step E, expectation; Step M, maximization.

EM algorithm optimizes the parameter step by step by iterating to maximize L ( θ ) L(\theta) L(θ). A “clever” method is to find the lower bound of L ( θ ) L(\theta) L(θ). As long as the lower bound of L ( θ ) L(\theta) L(θ) is maximized during each iteration, L ( θ ) L(\theta) L(θ) can be guaranteed to increase all the time.

Jensen Inequality

In order to solve the above problems, the Jensen Inequality is introduced:
f ( ∑ i = 1 N λ i x i ) ≤ ∑ i = 1 N λ i f ( x i ) ( 10 ) \qquad\quad\qquad\quad f(\sum_{i=1}^N\lambda_ix_i)\leq\sum_{i=1}^N\lambda_if(x_i)\qquad\quad\qquad\quad(10) f(i=1Nλixi)i=1Nλif(xi)(10) s . t .       λ i ≥ 0 ,    ∑ i = 1 N λ i = 1 s.t.\ \ \ \ \ \lambda_i\geq0, \ \ \sum_{i=1}^N\lambda_i=1 s.t.     λi0,  i=1Nλi=1Where f ( x ) f(x) f(x) is convex function. If and only if X i ≡ C Xi ≡ C XiC(constant C C C) the equal sign is true.

And the probability form of the inequality is shown as follows: f ( E ( X ) ) ≤ E ( f ( X ) ) ( 11 ) \qquad\quad\qquad\quad f(E(X))\leq E(f(X))\qquad\quad\qquad\quad(11) f(E(X))E(f(X))(11)

Step E Algorithm Derivation

We utilize Jensen’s inequality to solve the lower bound of L ( θ ) L(\theta) L(θ) in the t + 1 t+1 t+1 iteration. The scaling process is as follows:
L ( θ ) ≥ ∑ i = 1 m ∑ z ( i ) T ( z ( i ) ) l o g p ( x ( i ) , z ( i ) ; θ ) T ( z ( i ) ) ( 12 ) \qquad\quad\qquad\quad L(\theta)\geq \sum^m_{i=1}\sum_{z^{(i)}}T(z^{(i)}) log \frac{p(x^{(i)},z^{(i)};\theta)}{T(z^{(i)})}\qquad\quad\qquad\quad(12) L(θ)i=1mz(i)T(z(i))logT(z(i))p(x(i),z(i);θ)(12)
Where T ( z ( i ) ) T(z^{(i)}) T(z(i)) is a constructed λ i \lambda_i λi in (10), it must be a known when the estimator θ ( i ) \theta^{(i)} θ(i)'s quantity is given.

According to the Bayesian understanding above, we find that the posterior probability of z z z exactly meets the requirements of the construction T ( z ( i ) ) T(z^{(i)}) T(z(i)) quantity, and because: ∑ z ( i ) p ( z ( i ) ∣ x ; θ ( t ) ) = 1 ( 13 ) \qquad\quad\qquad\quad\sum_{z^{(i)}}p(z^{(i)}\mid x;\theta^{(t)})=1\qquad\quad\qquad\quad(13) z(i)p(z(i)x;θ(t))=1(13) p ( x ( i ) , z ( i ) ; θ ( t ) ) p ( z ( i ) ∣ x ; θ ( t ) ) = p ( z ( i ) ∣ x ( i ) ; θ ( t ) ) p ( x ( i ) ; θ ( t ) ) p ( z ( i ) ∣ x ( i ) ; θ ( t ) ) \frac{p(x^{(i)},z^{(i)};\theta^{(t)})}{p(z^{(i)}\mid x;\theta^{(t)})}=\frac{p(z^{(i)}\mid x^{(i)};\theta^{(t)})p(x^{(i)};\theta^{(t)})}{p(z^{(i)}\mid x^{(i)};\theta^{(t)})} p(z(i)x;θ(t))p(x(i),z(i);θ(t))=p(z(i)x(i);θ(t))p(z(i)x(i);θ(t))p(x(i);θ(t)) = p ( x ( i ) ; θ ( t ) ) \quad=p(x^{(i)};\theta^{(t)}) =p(x(i);θ(t)) ≡ C ( 14 ) \qquad\quad\qquad\quad\equiv C\qquad\quad\qquad\quad(14) C(14)
In fact, the EM algorithm also utilize p ( z ) p(z) p(z) to construct the lower bound function. We can construct as follows: T ( z ( i ) ) = p ( z ( i ) ∣ x ( i ) ; θ ( t ) ) T(z^{(i)})=p(z^{(i)}\mid x^{(i)};\theta^{(t)}) T(z(i))=p(z(i)x(i);θ(t)) = p ( z k ( i ) = 1 ) p ( x ( i ) ∣ z k ( i ) = 1 ) ∑ k = 1 K p ( z k ( i ) = 1 ) p ( x ( i ) ∣ z k ( i ) = 1 ) \quad\qquad\qquad=\frac{p(z_k^{(i)}=1)p(x^{(i)}\mid z_k^{(i)}=1)}{\sum^K_{k=1}p(z^{(i)}_k=1)p(x^{(i)}\mid z^{(i)}_k=1)} =k=1Kp(zk(i)=1)p(x(i)zk(i)=1)p(zk(i)=1)p(x(i)zk(i)=1) = π k N ( x ( i ) ∣ μ k , σ k ) ∑ k = 1 K π k N ( x ( i ) ; μ k , σ k ) ( 15 ) \qquad\qquad\quad\quad\qquad\quad=\frac{\pi_kN(x^{(i)}\mid \mu_k,\sigma_k)}{\sum^K_{k=1}\pi_kN(x^{(i)}; \mu_k,\sigma_k)}\qquad\quad\qquad\quad(15) =k=1KπkN(x(i);μk,σk)πkN(x(i)μk,σk)(15)
w k i = p ( z k ( i ) = 1 ∣ x ( i ) ; θ ( t ) ) ( 16 ) \qquad\quad\qquad\quad w_k^i=p(z_k^{(i)}=1\mid x^{(i)};\theta^{(t)})\qquad\quad\qquad\quad(16) wki=p(zk(i)=1x(i);θ(t))(16) w j i w_j^i wji represents the probability that the i i ith sample comes from the j j jth Gaussian distribution, that is, w j i w_j^i wji is a matrix of n n n_samples ∗ * k k k_components.

And that’s where we get the key Q Q Q function(Doctor Li Hang expresses Q Q Q function in Statistical Learning Method: it is the expectation of logarithmic likelihood function p ( x ( i ) , z ( i ) ; θ ) p(x^{(i)},z^{(i)};\theta) p(x(i),z(i);θ) of complete data on conditional probability distribution p ( z ( i ) ∣ x ( i ) ; θ ( t ) ) p(z^{(i)}\mid x^{(i)};\theta^{(t)}) p(z(i)x(i);θ(t)) of latent data z z z under given observation data x x x and current parameter θ ( t ) \theta^{(t)} θ(t)):
Q ( θ , θ ( t ) ) = ∑ i = 1 m ∑ z ( i ) T ( z ( i ) ) l o g p ( x ( i ) , z ( i ) ; θ ) T ( z ( i ) ) Q(\theta,\theta^{(t)})= \sum^m_{i=1}\sum_{z^{(i)}}T(z^{(i)}) log \frac{p(x^{(i)},z^{(i)};\theta)}{T(z^{(i)})} Q(θ,θ(t))=i=1mz(i)T(z(i))logT(z(i))p(x(i),z(i);θ) = ∑ i = 1 m ∑ z ( i ) p ( z ( i ) ∣ x ( i ) ; θ ( t ) ) l o g p ( x ( i ) , z ( i ) ; θ ) p ( z ( i ) ∣ x ( i ) ; θ ( t ) ) ( 17 ) \qquad\quad\qquad=\sum^m_{i=1}\sum_{z^{(i)}}p(z^{(i)}\mid x^{(i)};\theta^{(t)}) log \frac{p(x^{(i)},z^{(i)};\theta)}{p(z^{(i)}\mid x^{(i)};\theta^{(t)})}\qquad\quad(17) =i=1mz(i)p(z(i)x(i);θ(t))logp(z(i)x(i);θ(t))p(x(i),z(i);θ)(17)

Step M Algorithm Derivation

Once Q Q Q function is obtained, the function is maximized to obtain the model parameters for the next iteration:
θ ( t + 1 ) = a r g m a x Q ( θ , θ ( t ) ) \theta^{(t+1)}=arg max Q(\theta,\theta^{(t)}) θ(t+1)=argmaxQ(θ,θ(t))First, we substitute (4), (5) and (16) into the Q Q Q function to get: Q ( θ , θ ( t ) ) = ∑ i = 1 m ∑ k = 1 K T ( z k ( i ) = 1 ) l o g p ( x ( i ) ∣ z k ( i ) = 1 ; μ , σ ) p ( z k ( i ) = 1 ; π ) T ( z k ( i ) = 1 ) Q(\theta,\theta^{(t)})=\sum^m_{i=1}\sum_{k=1}^KT(z^{(i)}_k=1) log \frac{p(x^{(i)}\mid z^{(i)}_k=1;\mu,\sigma)p(z^{(i)}_k=1;\pi)}{T(z^{(i)}_k=1)} Q(θ,θ(t))=i=1mk=1KT(zk(i)=1)logT(zk(i)=1)p(x(i)zk(i)=1;μ,σ)p(zk(i)=1;π) = ∑ i = 1 m ∑ k = 1 K w k i l o g π k 1 ( 2 π ) n 2 σ k 1 2 e x p ( − 1 2 ( x ( i ) − μ k ) T σ k − 1 ( x ( i ) − μ k ) ) w k i ( 18 ) =\sum^m_{i=1}\sum_{k=1}^Kw_k^i log \frac{\pi_k\frac{1}{(2\pi)^{\frac{n}{2}}\sigma_k^\frac{1}{2}}exp{(-\frac{1}{2}(x^{(i)}-\mu_k)^T\sigma_k^{-1}(x^{(i)}-\mu_k))}}{w_k^i }\qquad\quad(18) =i=1mk=1Kwkilogwkiπk(2π)2nσk211exp(21(x(i)μk)Tσk1(x(i)μk))(18)
Take the partial derivative of the above expression with respect to each parameter, set the partial derivative equal to 0, obtain the parameter value of the maximum value of the function and assign it to the parameters participating in the next iteration:
μ k : = ∑ i = 1 m w k i x ( i ) ∑ i = 1 m w k i ( 19 ) \qquad\quad\mu_k:=\frac{\sum_{i=1}^mw_k^ix^{(i)}}{\sum_{i=1}^mw_k^i}\qquad\quad(19) μk:=i=1mwkii=1mwkix(i)(19) σ k : = ∑ i = 1 m w k i ( x ( i ) − μ k ) T ( x ( i ) − μ k ) ∑ i = 1 m w k i ( 20 ) \qquad\quad\sigma_k:=\frac{\sum_{i=1}^mw_k^i(x^{(i)}-\mu_k)^T(x^{(i)}-\mu_k)}{\sum_{i=1}^mw_k^i}\qquad\quad(20) σk:=i=1mwkii=1mwki(x(i)μk)T(x(i)μk)(20) π k : = ∑ i = 1 m w k i m ( 21 ) \qquad\quad\pi_k:=\frac{\sum_{i=1}^mw_k^i}{m}\qquad\quad(21) πk:=mi=1mwki(21)Where: w k i = p ( z k ( i ) = 1 ∣ x ( i ) ; θ ( t ) ) w_k^i=p(z_k^{(i)}=1\mid x^{(i)};\theta^{(t)}) wki=p(zk(i)=1x(i);θ(t))

EM Algorithm Flow

Input: Observational dataset T = { x ( 1 ) , x ( 2 ) , . . . , x ( m ) } , x ( i ) ∈ χ ∈ R n T=\lbrace x^{(1)},x^{(2)},...,x^{(m)}\rbrace,x^{(i)}\in\chi\in R^n T={x(1),x(2),...,x(m)},x(i)χRn;    \qquad\,\, Gaussian Mixture Model p ( x ; θ ) p(x;\theta) p(x;θ)

Output: GMM parameters ( π , μ , σ ) ∈ R K (\pi, \mu, \sigma)\in R^K (π,μ,σ)RK

  1. Define the number of components K K K, initialize the parameters ( π , μ , σ ) (\pi, \mu, \sigma) (π,μ,σ), and start the iteration.
  2. Step E: According to the model parameters of the current iteration t t t,calculate the responsiveness w k i w_k^i wki of the sub-model k k k to the observed data x ( i ) x^{(i)} x(i): w k i = π k N ( x ( i ) ∣ μ k , σ k ) ∑ k = 1 K π k N ( x ( i ) ; μ k , σ k ) ,   i = 1 , 2 , . . . , m ;   k = 1 , 2 , . . . , K w_k^i=\frac{\pi_kN(x^{(i)}\mid \mu_k,\sigma_k)}{\sum^K_{k=1}\pi_kN(x^{(i)}; \mu_k,\sigma_k)}, \,i=1,2,...,m;\, k=1,2,...,K wki=k=1KπkN(x(i);μk,σk)πkN(x(i)μk,σk),i=1,2,...,m;k=1,2,...,K
  3. Step M: Calculate the model parameters of the new iteration t + 1 t+1 t+1 μ k n e w : = ∑ i = 1 m w k i x ( i ) ∑ i = 1 m w k i \mu_k^{new}:=\frac{\sum_{i=1}^mw_k^ix^{(i)}}{\sum_{i=1}^mw_k^i} μknew:=i=1mwkii=1mwkix(i) σ k n e w : = ∑ i = 1 m w k i ( x ( i ) − μ k ) T ( x ( i ) − μ k ) ∑ i = 1 m w k i \sigma_k^{new}:=\frac{\sum_{i=1}^mw_k^i(x^{(i)}-\mu_k)^T(x^{(i)}-\mu_k)}{\sum_{i=1}^mw_k^i} σknew:=i=1mwkii=1mwki(x(i)μk)T(x(i)μk) π k n e w : = ∑ i = 1 m w k i m \pi_k^{new}:=\frac{\sum_{i=1}^mw_k^i}{m} πknew:=mi=1mwki
  4. Calculate the maximum likelihood function L ( θ ) L(\theta) L(θ), test whether the function converges, if not, return step 2.

Convergence Verification of EM Algorithm

In fact, the EM algorithm maximizes the logarithmic likelihood function by iterative stepwise approximation. Since the likelihood function exactly has an upper bound, we expect parameter iteration can increase L ( θ ) L(\theta) L(θ):
L ( θ ( t + 1 ) ) − L ( θ ( t ) ) = L ( θ ( t + 1 ) ) − Q ( θ ( t ) , θ ( t ) ) L(\theta^{(t+1)})-L(\theta^{(t)})=L(\theta^{(t+1)})-Q(\theta^{(t)},\theta^{(t)}) L(θ(t+1))L(θ(t))=L(θ(t+1))Q(θ(t),θ(t)) ≥ Q ( θ ( t + 1 ) , θ ( t ) ) − Q ( θ ( t ) , θ ( t ) ) \geq Q(\theta^{(t+1)},\theta^{(t)}) -Q(\theta^{(t)},\theta^{(t)}) Q(θ(t+1),θ(t))Q(θ(t),θ(t))As the value of the new iteration parameter makes Q Q Q function reach the maximum value, there are: Q ( θ ( t + 1 ) , θ ( t ) ) − Q ( θ ( t ) , θ ( t ) ) ≥ 0 Q(\theta^{(t+1)},\theta^{(t)}) -Q(\theta^{(t)},\theta^{(t)})\ge0 Q(θ(t+1),θ(t))Q(θ(t),θ(t))0So the likelihood function can converge during iteration, and EM algorithm works.


Potential Applications

Since normal distribution is common in daily life, GMM has high application value. Combined with big data technology, GMM can locate information about individuals’ behavior within population. For example, using facial data to identify a person’s race.

As a classification model, GMM can try to be a part of GAN. GMM improves the performance of the final generator by providing more specific categorization information.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值