Machine Learning(2)Estimate the probability density -- Mixture of Gaussians

Machine Learning(2)Mixture of Gaussians


Chenjing Ding
2018/02/21


notationmeaning
Mthe number of mixture components
p(j)weight of mixture component
p(x|θj) p ( x | θ j ) mixture component
p(x|θ) p ( x | θ ) mixture density
θj θ j j-th component parameters

1. Mixture of Multivariate Gaussians

In some cases, one Gaussian distribution cannot represent p(x|θ) p ( x | θ ) , (see red model in figure 1 ), thus in this chapter we want to estimate the mixture density of multivariate Gaussians.

1.1 Obtain mixture of density

Weight of mixture component:

p(j)=πj p ( j ) = π j

Mixture component:
p(x|θj) p ( x | θ j )

Mixture density
p(x|θ)=j=1Mp(x|θj)p(j) p ( x | θ ) = ∑ j = 1 M p ( x | θ j ) p ( j )


这里写图片描述
figure1 mixture of density

2. Maximum Likelihood

using maximum likelihood to estimate uj u j :

En(θ)=lnp(xn|θ)E(θ)=n=1NEn(θ)=n=1Nlnp(xn|θ)E(θ)uj=n=1Np(xn|θ)ujp(xn|θ)=n=1Np(j)p(xn|θj)ujMk=1p(xn|θk)p(k)=n=1Np(j)Σ1(xnuj)p(xn|θj)Mk=1p(xn|θk)p(k)=Σ1n=1N(xnuj)p(j)p(xn|θj)Mk=1p(xn|θk)p(k);γj(xn)=p(j)p(xn|θj)Mk=1p(xn|θk)p(k);uj=Nn=1xnγj(xn)Nn=1γj(xn) E n ( θ ) = − ln ⁡ p ( x n | θ ) E ( θ ) = ∑ n = 1 N E n ( θ ) = ∑ n = 1 N − ln ⁡ p ( x n | θ ) ∂ E ( θ ) ∂ u j = − ∑ n = 1 N ∂ p ( x n | θ ) ∂ u j p ( x n | θ ) = − ∑ n = 1 N p ( j ) ∂ p ( x n | θ j ) ∂ u j ∑ k = 1 M p ( x n | θ k ) p ( k ) = − ∑ n = 1 N p ( j ) Σ − 1 ( x n − u j ) p ( x n | θ j ) ∑ k = 1 M p ( x n | θ k ) p ( k ) = − Σ − 1 ∑ n = 1 N ( x n − u j ) p ( j ) p ( x n | θ j ) ∑ k = 1 M p ( x n | θ k ) p ( k ) ; γ j ( x n ) = p ( j ) p ( x n | θ j ) ∑ k = 1 M p ( x n | θ k ) p ( k ) ; ⇒ u j = ∑ n = 1 N x n γ j ( x n ) ∑ n = 1 N γ j ( x n )

Problem with estimation uj u j
uj u j depends on γj(xn) γ j ( x n ) , γj(xn) γ j ( x n ) also depends on uj u j , so there is no analytical solution.
γJ(xn)=p(J)p(xn|θJ)Mk=1p(xn|θk)p(k)=p(xn|j=J,θ)p(J)p(xn|θ)=p(xn,j=J|θ)p(xn|θ)=p(j=J|xn,θ) γ J ( x n ) = p ( J ) p ( x n | θ J ) ∑ k = 1 M p ( x n | θ k ) p ( k ) = p ( x n | j = J , θ ) p ( J ) p ( x n | θ ) = p ( x n , j = J | θ ) p ( x n | θ ) = p ( j = J | x n , θ )
thus γj(xn) γ j ( x n ) represents “responsibility of component j for mixture density given xn x n , if we can estimate γj(xn) γ j ( x n ) , then we can obtain uj u j ; and K-Means cluster is helpful.

3. K-Means cluster

K-Means cluster aims to assign data to one of the K clusters according to the distance to the mean of each cluster.

3.1 steps

step1: Initialization: pick K arbitrary centroids (cluster means)

step2: Assign each sample to the closest centroid.

step3: Adjust the centroids to be the means of the samples assigned to them.

step4: Go to step 2 until no change in step3;


这里写图片描述
figure2 the process of K-Means cluster (K = 2)

3.2 Objective function

K-Means optimizes the following objective function:

L=n=1Nk=1Krnk||xnμk||2rnk={   1,   k=argmink||xnμk||2   0,   else L = ∑ n = 1 N ∑ k = 1 K r n k | | x n − μ k | | 2 r n k = {       0 ,       e l s e       1 ,       k = a r g m i n k | | x n − μ k | | 2
rnk r n k is an indicator variable that checks whether uk u k is the nearest cluster center to point xn x n .

3.3 Advantages and Disadvantages

Advantage:

  • simple and fast to compute
  • converge to local minimum of within-cluster squared error

Disadvantage:

  • sensitive to initialization
  • sensitive to outliers
  • difficult to set K properly
  • only detect spherical clusters

    这里写图片描述
    figure3 the problem of K-Means cluster (K = 2)

4 .EM Algorithm

Once we use K-Means cluster to get the mean of each cluster, then we have θj=(uj, Σj) θ j = ( u j ,   Σ j ) , we can estimate the “responsibility” of component j for mixture density γj(xn) γ j ( x n ) .

4.1 K-Means Clustering Revisited

step1: Initialization pick K arbitrary centroids [compute θ0j=(μ0j,Σ0j) θ j 0 = ( μ j 0 , Σ j 0 ) ]

step2: Assign each sample to the closest centroid. [compute γj(xn) γ j ( x n ) Estep]

step3: Adjust the centroids to be the means of the samples assigned to them, [compute θτj=(μτj,Στj) θ j τ = ( μ j τ , Σ j τ ) Mstep]

step4: Go to step 2 (until no change)

The process is almost same with K-Means cluster, but in K-Means one point only depends on one distribution, no concept like γj(xn) γ j ( x n ) .

4.2 Estep & Mstep

Estep: softly assign samples to mixture components

γj(xn)=p(j)p(xn|θj)Mk=1p(xn|θk)p(k);j=1...K,n=1...N γ j ( x n ) = p ( j ) p ( x n | θ j ) ∑ k = 1 M p ( x n | θ k ) p ( k ) ; ∀ j = 1... K , ∀ n = 1... N

Mstep: re-estimate the parameters (separately for each mixture component) based on the soft assignments.
Njˆ=n=1Nγj(xn)p(j)ˆ=NjˆNunewjˆ=Nn=1γj(xn)xnNn=1γj(xn)Σnewjˆ=1Njˆn=1Nγj(xn)(xnunewjˆ)(xnunewjˆ)T N j ^ = ∑ n = 1 N γ j ( x n ) p ( j ) ^ = N j ^ N u j n e w ^ = ∑ n = 1 N γ j ( x n ) ∗ x n ∑ n = 1 N γ j ( x n ) Σ j n e w ^ = 1 N j ^ ∑ n = 1 N γ j ( x n ) ( x n − u j n e w ^ ) ( x n − u j n e w ^ ) T

4.3 Advantages
  • Very general, can represent any (continuous) distribution.
  • Once trained, very fast to evaluate.
  • Can be updated online.
4.4 Caveats
  1. introduce regularization
    instead of Σ1 Σ − 1 , use (Σ+σ)1 ( Σ + σ ) − 1 to avoid Σ1=0 Σ − 1 = 0 causing p(xn|θj) p ( x n | θ j ) goes to infinite
  2. Initialize with k-Means to get better results
    Typical steps:
    Run k-Means M times (e.g. M = 10~100)
    Pick the best result (lowest error J)
    Use this result to initialize EM
  3. EM for MoG is computational expensive
  4. Need to select the number of mixture components K properly model selection problem
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值