Machine Learning（2）Estimate the probability density -- Mixture of Gaussians

本文链接：https://blog.csdn.net/qq_26386707/article/details/79402194

Machine Learning（2）Mixture of Gaussians

Chenjing Ding
2018/02/21

notation	meaning
M	the number of mixture components
p(j)	weight of mixture component
$p(x\|\theta_j)$	mixture component
$p(x\|\theta)$	mixture density
$\theta_j$	j-th component parameters

1. Mixture of Multivariate Gaussians

In some cases, one Gaussian distribution cannot represent $p(x|\theta)$ , (see red model in figure 1 ), thus in this chapter we want to estimate the mixture density of multivariate Gaussians.

1.1 Obtain mixture of density

Weight of mixture component:

p (j) = π j

$p(j) = \pi_j$
Mixture component:

p (x | θ j)

$p(x|\theta_j)$
Mixture density

p (x | θ) = \sum j = 1 M p (x | θ j) p (j)

$p(x|\theta) = \sum_{j=1}^Mp(x|\theta_j)p(j)$

figure1 mixture of density

2. Maximum Likelihood

using maximum likelihood to estimate $u_j$ :

E n (θ) = - ln p (x n | θ) E (θ) = \sum n = 1 N E n (θ) = \sum n = 1 N - ln p (x n | θ) \partial E ( θ ) \partial u j = - \sum n = 1 N \partial p ( x n | θ ) \partial u j p ( x n | θ ) = - \sum n = 1 N p ( j ) \partial p ( x n | θ j ) \partial u j \sum M k = 1 p ( x n | θ k ) p ( k ) = - \sum n = 1 N p ( j ) Σ - 1 ( x n - u j ) p ( x n | θ j ) \sum M k = 1 p ( x n | θ k ) p ( k ) = - Σ - 1 \sum n = 1 N (x n - u j) p ( j ) p ( x n | θ j ) \sum M k = 1 p ( x n | θ k ) p ( k ); γ j (x n) = p ( j ) p ( x n | θ j ) \sum M k = 1 p ( x n | θ k ) p ( k ); \Rightarrow u j = \sum N n = 1 x n γ j ( x n ) \sum N n = 1 γ j ( x n )

$E_n(\theta) =-\ln p(x_n|\theta)\\E(\theta) =\sum_{n=1}^N E_n(\theta) = \sum_{n=1}^N -\ln p(x_n|\theta) \\ \frac{\partial E(\theta)}{\partial u_j } = - \sum_{n=1}^N \frac{\frac{\partial p(x_n|\theta)}{\partial u_j}}{p(x_n|\theta)} =- \sum_{n=1}^N \frac{ p(j)\frac{\partial p(x_n|\theta_j)}{\partial u_j}}{ \sum_{k=1}^Mp(x_n|\theta_k)p(k)} \\ = - \sum_{n=1}^N \frac{ p(j)\Sigma^{-1}(x_n-u_j)p(x_n|\theta_j)}{ \sum_{k=1}^Mp(x_n|\theta_k)p(k)} \\ = - \Sigma^{-1}\sum_{n=1}^N(x_n-u_j) \frac{ p(j)p(x_n|\theta_j)}{ \sum_{k=1}^Mp(x_n|\theta_k)p(k)}; \\ \gamma_j(x_n) =\frac{ p(j)p(x_n|\theta_j)}{ \sum_{k=1}^Mp(x_n|\theta_k)p(k)};\\ \Rightarrow u_j =\frac{ \sum_{n=1}^N x_n\gamma_j(x_n) }{ \sum_{n=1}^N \gamma_j(x_n)}$
Problem with estimation $u_j$

uj u j $u_j$ depends on

γj(xn) γ j ( x n ) $\gamma_j(x_n)$ ,

γj(xn) γ j ( x n ) $\gamma_j(x_n)$ also depends on

uj u j $u_j$ , so there is no analytical solution.

γ J (x n) = p ( J ) p ( x n | θ J ) \sum M k = 1 p ( x n | θ k ) p ( k ) = p ( x n | j = J , θ ) p ( J ) p ( x n | θ ) = p ( x n , j = J | θ ) p ( x n | θ ) = p (j = J | x n, θ)

$\gamma_J(x_n)=\frac{ p(J)p(x_n|\theta_J)}{ \sum_{k=1}^Mp(x_n|\theta_k)p(k)} = \frac{ p(x_n|j=J,\theta) p(J)}{p(x_n|\theta)} \\=\frac{p(x_n,j=J|\theta)}{p(x_n|\theta)} =p(j=J|x_n,\theta)$ thus

γj(xn) γ j ( x n ) $\gamma_j(x_n)$ represents “responsibility of component j for mixture density given $x_n$ ”, if we can estimate

γj(xn) γ j ( x n ) $\gamma_j(x_n)$ , then we can obtain

uj u j $u_j$ ; and K-Means cluster is helpful.

3. K-Means cluster

K-Means cluster aims to assign data to one of the K clusters according to the distance to the mean of each cluster.

3.1 steps

step1: Initialization: pick K arbitrary centroids (cluster means)

step2: Assign each sample to the closest centroid.

step3: Adjust the centroids to be the means of the samples assigned to them.

step4: Go to step 2 until no change in step3;

figure2 the process of K-Means cluster (K = 2)

3.2 Objective function

K-Means optimizes the following objective function:

L = \sum n = 1 N \sum k = 1 K r n k | | x n - μ k | | 2 r n k = {1, k = a r g m i n k | | x n - μ k | | 2 0, e l s e

$L = \sum_{n=1}^N \sum_{k=1}^K r_{nk} ||x_n-\mu_k ||^2\\r_{nk} = \lbrace_{\ \ \ 0,\ \ \ else}^{\ \ \ 1, \ \ \ k = argmin_k || x_n - \mu_k||^2}$

rnk r n k $r_{nk}$ is an indicator variable that checks whether

uk u k $u_k$ is the nearest cluster center to point

xn x n $x_n$ .

3.3 Advantages and Disadvantages

Advantage:

simple and fast to compute
converge to local minimum of within-cluster squared error

Disadvantage:

sensitive to initialization
sensitive to outliers
difficult to set K properly
only detect spherical clusters

figure3 the problem of K-Means cluster (K = 2)

4 .EM Algorithm

Once we use K-Means cluster to get the mean of each cluster, then we have $\theta_j = (u_j,\ \Sigma_j)$ , we can estimate the “responsibility” of component j for mixture density $\gamma_j(x_n)$ .

4.1 K-Means Clustering Revisited

step1: Initialization pick K arbitrary centroids [compute $\theta_j^0=(\mu_j^0, \Sigma_j^0)$ ]

step2: Assign each sample to the closest centroid. [compute $\gamma_j(x_n)$ $\Rightarrow$ Estep]

step3: Adjust the centroids to be the means of the samples assigned to them, [compute $\theta_j^\tau=(\mu_j^\tau, \Sigma_j^\tau)$ $\Rightarrow$ Mstep]

step4: Go to step 2 (until no change)

The process is almost same with K-Means cluster, but in K-Means one point only depends on one distribution, no concept like $\gamma_j(x_n)$ .

4.2 Estep & Mstep

Estep: softly assign samples to mixture components

γ j (x n) = p ( j ) p ( x n | θ j ) \sum M k = 1 p ( x n | θ k ) p ( k ); \forall j = 1... K, \forall n = 1... N

$\gamma_j(x_n) =\frac{ p(j)p(x_n|\theta_j)}{ \sum_{k=1}^Mp(x_n|\theta_k)p(k)}; \forall j = 1... K , \forall n= 1...N$
Mstep: re-estimate the parameters (separately for each mixture component) based on the soft assignments.

N j ˆ = \sum n = 1 N γ j (x n) p (j) ˆ = N j ˆ N u n e w j ˆ = \sum N n = 1 γ j ( x n ) * x n \sum N n = 1 γ j ( x n ) Σ n e w j ˆ = 1 N j ˆ \sum n = 1 N γ j (x n) (x n - u n e w j ˆ) (x n - u n e w j ˆ) T

$\widehat{N_j} = \sum_{n=1}^N \gamma_j(x_n)\\ \widehat{ p(j) } = \frac{\widehat{N_j}}{N}\\ \widehat{u_j^{new} }= \frac{\sum_{n=1}^N \gamma_j(x_n)*x_n}{ \sum_{n=1}^N \gamma_j(x_n)}\\ \widehat{ \Sigma_j^{new} } = \frac{1}{ \widehat{ N_j }} \sum_{n=1}^N \gamma_j(x_n) (x_n -\widehat{u_j^{new} } ) (x_n -\widehat{u_j^{new} } )^T$

4.3 Advantages

Very general, can represent any (continuous) distribution.
Once trained, very fast to evaluate.
Can be updated online.

4.4 Caveats

introduce regularization
instead of $\Sigma^-1$ , use $(\Sigma + \sigma )^{-1}$ to avoid $\Sigma^-1 = 0$ causing $p(x_n|\theta_j)$ goes to infinite
Initialize with k-Means to get better results
Typical steps:
Run k-Means M times (e.g. M = 10~100)
Pick the best result (lowest error J)
Use this result to initialize EM
EM for MoG is computational expensive
Need to select the number of mixture components K properly $\Rightarrow$ model selection problem