ML Note 3 - Unsupervised Learning

无监督学习中,模型的输入不再包括训练样本的期望输出。因此相应 training set 的定义变为
S = { x 1 , x 2 , … , x m } S = \{x_1, x_2, \dots,x_m\} S={x1,x2,,xm}

本文中其他定义和有监督学习一致。

In this set of notes, the problems are essentially finding clumps and data compression. For finding clumps, we can use k-means to find the center directly, or using EM to model the distribution of each clump. For data compression, we can use factor analysis to estimate the probability of a data being in a specific subspace, or we can find the subspace with PCA.

The k-means Clustering Algorithm

设样本空间有 k k k 个簇, 算法的目标是找到这些 cluster centroids
μ 1 , μ 2 , … , μ k ∈ R n \mu_1,\mu_2,\dots,\mu_k\in\mathbb{R}^n μ1,μ2,,μkRn

c i c_i ci 表示样本 x i x_i xi 所属簇的下标,定义 distortion function
J ( c , μ ) = ∑ i = 1 m ∣ ∣ x i − μ c i ∣ ∣ 2 J(c,\mu) = \sum\limits_{i=1}^m||x_i - \mu_{c_i}||^2 J(c,μ)=i=1mxiμci2

J J J 应用坐标下降可以得到下面的算法
initialize  μ 1 , … , μ k repeat until stable  { for i in 1...m  c i : = arg ⁡ min ⁡ j ∣ ∣ x i − μ j ∣ ∣ 2 for j in 1...k  C j = { k ∣ c k = j } μ j : = ∑ i ∈ C j x i / ∣ C j ∣ } \begin{aligned} & \text{initialize } \mu_1,\dots,\mu_k\\ & \text{repeat until stable } \{\\ & \qquad\text{for i in 1...m } \\ & \qquad\qquad c_i := \arg\min_j ||x_i-\mu_j||^2\\ & \qquad\text{for j in 1...k } \\ & \qquad\qquad C_j = \{k|c_k = j\}\\ & \qquad\qquad\mu_j := \sum_{i \in C_j} x_i / |C_j| \\ & \} \end{aligned} initialize μ1,,μkrepeat until stable {for i in 1...m ci:=argjminxiμj2for j in 1...k Cj={kck=j}μj:=iCjxi/Cj}

因为损失函数 J J J 是非凸的,初始化 μ \mu μ 的方法对算法的收敛性有一定影响。常用的方法是选取相距最远的样本作为初始值

  1. 选取样本空间中相聚最远的两点,作为两个初始值
  2. 当初始值个数不足 K K K 时,选择第 max ⁡ i min ⁡ { ⟨ x i , x j ⟩ ∣ j = 1 , 2 , … , m } \max\limits_i \min\{ \langle x_i, x_j \rangle \vert j = 1, 2, \dots, m\} imaxmin{xi,xjj=1,2,,m} 个样本加入初始值

此方法有助于减少迭代次数,但是依赖于核矩阵,因此不适用 m m m 太大的情形。另一种简单的方法是随机选取 K K K 个不同的样本作为初始值。

The EM Algorithm

Suppose there exist some latent r.v.s z ( 1 ) , … , z ( m ) z^{(1)}, \dots, z^{(m)} z(1),,z(m) and we wish to fit the parameters θ \theta θ of a model p ( x , z ) p(x,z) p(x,z) to the training set.

The algorithm can be stated as proof
repeat  { E-step: for i in 1...m  Q i ( z ( i ) ) : = p ( z ( i ) ∣ x ( i ) ; θ ) M-step:  θ : = arg ⁡ max ⁡ θ ∑ i ∑ j Q i ( j ) log ⁡ p ( x ( i ) , z ( i ) = j ; θ ) Q i ( j ) } \begin{aligned} & \text{repeat } \{\\ & \qquad\text{E-step: for i in 1...m } \\ & \qquad\qquad Q_i(z^{(i)}) := p(z^{(i)}|x^{(i)};\theta)\\ & \qquad\text{M-step: } \\ & \qquad\qquad\theta := \arg\max_\theta \sum\limits_i \sum\limits_j Q_i(j) \log \frac{p(x^{(i)},z^{(i)}=j;\theta)}{Q_i(j)}\\ & \} \end{aligned} repeat {E-step: for i in 1...m Qi(z(i)):=p(z(i)x(i);θ)M-step: θ:=argθmaxijQi(j)logQi(j)p(x(i),z(i)=j;θ)}

Mixture of Gaussians

Suppose there are k k k Gaussians and each training data x ( i ) x^{(i)} x(i) belongs to one of them. Let z ( i ) z^{(i)} z(i) donate the class x ( i ) x^{(i)} x(i) belongs to
z ( i ) ∼ M u l t i n o m i a l k ( ϕ ) x ( i ) ∣ z ( i ) = j ∼ N ( μ j , Σ j ) \begin{array}{rcl} z^{(i)} &\sim& Multinomial_k(\phi)\\ x^{(i)}|z^{(i)} = j &\sim& N(\mu_j, \Sigma_j) \end{array} z(i)x(i)z(i)=jMultinomialk(ϕ)N(μj,Σj)

We wish to model the data by specifying p ( x ( i ) , z ( i ) ) p(x^{(i)}, z^{(i)}) p(x(i),z(i)).

The algorithm can be stated as proof
repeat  { E-step: for i in 1...m, for j in 1...k w j ( i ) : = ϕ j p ( x ( i ) ∣ z ( i ) = j ; μ , Σ ) ∑ l = 1 k ϕ l p ( x ( i ) ∣ z ( i ) = l ; μ , Σ ) M-step: for j in 1...k ϕ j : = 1 m ∑ i = 1 m w j ( i ) μ j : = ∑ i = 1 m w j ( i ) x ( i ) ∑ i = 1 m w j ( i ) Σ j : = ∑ i = 1 m w j ( i ) ( x ( i ) − μ j ) ( x ( i ) − μ j ) T ∑ i = 1 m w j ( i ) } \begin{aligned} & \text{repeat } \{\\ & \qquad\text{E-step: for i in 1...m, for j in 1...k} \\ & \qquad\qquad w_j^{(i)}:=\frac{\phi_jp(x^{(i)}|z^{(i)}=j;\mu,\Sigma)}{\sum_{l=1}^k\phi_lp(x^{(i)}|z^{(i)}=l;\mu,\Sigma)}\\ & \qquad\text{M-step: for j in 1...k} \\ & \qquad\qquad\phi_j := \frac{1}{m}\sum_{i=1}^m w_j^{(i)}\\ & \qquad\qquad\mu_j := \frac{\sum_{i=1}^m w_j^{(i)}x^{(i)}}{\sum_{i=1}^mw_j^{(i)}}\\ & \qquad\qquad\Sigma_j := \frac{\sum_{i=1}^mw_j^{(i)}(x^{(i)}-\mu_j)(x^{(i)}-\mu_j)^T}{\sum_{i=1}^mw_j^{(i)}}\\ & \} \end{aligned} repeat {E-step: for i in 1...m, for j in 1...kwj(i):=l=1kϕlp(x(i)z(i)=l;μ,Σ)ϕjp(x(i)z(i)=j;μ,Σ)M-step: for j in 1...kϕj:=m1i=1mwj(i)μj:=i=1mwj(i)i=1mwj(i)x(i)Σj:=i=1mwj(i)i=1mwj(i)(x(i)μj)(x(i)μj)T}

Factor Analysis

Consider the problem in which n ≫ m n \gg m nm. In such a setting, we assume that the data are generated by some latent r.v. z z z
x = μ + Λ z + ϵ x = \mu+\Lambda z + \epsilon x=μ+Λz+ϵ

where
R k ∋ z ∼ N ( 0 ⃗ , I ) R n ∋ ϵ ∼ N ( 0 ⃗ , Ψ ) \begin{array}{rcccl} \mathbb{R}^k &\ni& z &\sim& N(\vec 0, I)\\ \mathbb{R}^n &\ni& \epsilon &\sim& N(\vec 0, \Psi) \end{array} RkRnzϵN(0 ,I)N(0 ,Ψ)

the value of k k k is usually chosen to be smaller than n n n. Parameters of our model are

  • vector μ ∈ R n \mu \in \mathbb{R}^n μRn
  • matrix Λ ∈ R n × k \Lambda \in \mathbb{R}^{n\times k} ΛRn×k
  • diagonal matrix Ψ ∈ R n × n \Psi \in \mathbb{R}^{n\times n} ΨRn×n

Since
[ z x ] ∼ N ( [ 0 ⃗ μ ] , [ I Λ T Λ Λ Λ T + Ψ ] ) \left[\begin{aligned}z\\x\end{aligned}\right] \sim N(\left[\begin{aligned}\vec0\\\mu\end{aligned}\right], \left[\begin{array}{cc}I & \Lambda^T\\\Lambda & \Lambda\Lambda^T+\Psi\end{array}\right]) [zx]N([0 μ],[IΛΛTΛΛT+Ψ])

parameters for z ( i ) ∣ x ( i ) ∼ N ( μ z ( i ) ∣ x ( i ) , Σ z ( i ) ∣ x ( i ) ) z^{(i)}|x^{(i)} \sim N(\mu_{z^{(i)}|x^{(i)}}, \Sigma_{z^{(i)}|x^{(i)}}) z(i)x(i)N(μz(i)x(i),Σz(i)x(i)) are
μ z ( i ) ∣ x ( i ) = Λ T ( Λ Λ T + Ψ ) − 1 ( x ( i ) − μ ) Σ z ( i ) ∣ x ( i ) = I − Λ T ( Λ Λ T + Ψ ) − 1 Λ \begin{array}{rcl} \mu_{z^{(i)}|x^{(i)}} &=& \Lambda^T (\Lambda\Lambda^T+\Psi)^{-1}(x^{(i)}-\mu)\\ \Sigma_{z^{(i)}|x^{(i)}} &=& I - \Lambda^T(\Lambda\Lambda^T+\Psi)^{-1}\Lambda \end{array} μz(i)x(i)Σz(i)x(i)==ΛT(ΛΛT+Ψ)1(x(i)μ)IΛT(ΛΛT+Ψ)1Λ

To use EM algorithm, we have that proof
Q i ( z ( i ) ) : = p ( z ( i ) ∣ x ( i ) ; μ , Λ , Ψ ) Λ : = ( ∑ i = 1 m ( x ( i ) − μ ) E [ z ( i ) ] T ) ( ∑ i = 1 m E [ z ( i ) ( z ( i ) ) T ] ) μ : = 1 m ∑ i = 1 m x ( i ) Φ : = 1 m ∑ i = 1 m E [ ( x ( i ) − Λ z ( i ) ) ( x ( i ) − Λ z ( i ) ) T ] \begin{array}{rcl} Q_i(z^{(i)}) &:=& p(z^{(i)}|x^{(i)};\mu,\Lambda,\Psi)\\ \Lambda &:=& (\sum\limits_{i=1}^m(x^{(i)}-\mu)E[z^{(i)}]^T)(\sum\limits_{i=1}^mE[z^{(i)}(z^{(i)})^T])\\ \mu &:=& \frac{1}{m}\sum\limits_{i=1}^mx^{(i)}\\ \Phi &:=& \frac{1}{m}\sum\limits_{i=1}^mE\Big[(x^{(i)}-\Lambda z^{(i)})(x^{(i)}-\Lambda z^{(i)})^T\Big] \end{array} Qi(z(i))ΛμΦ:=:=:=:=p(z(i)x(i);μ,Λ,Ψ)(i=1m(x(i)μ)E[z(i)]T)(i=1mE[z(i)(z(i))T])m1i=1mx(i)m1i=1mE[(x(i)Λz(i))(x(i)Λz(i))T]

where
E [ z ( i ) ] = μ z ( i ) ∣ x ( i ) E [ z ( i ) ( z ( i ) ) T ] = μ z ( i ) ∣ x ( i ) μ z ( i ) ∣ x ( i ) T + Σ z ( i ) ∣ x ( i ) Ψ i i = Φ i i \begin{array}{rcl} E[z^{(i)}] &=& \mu_{z^{(i)}|x^{(i)}}\\ E[z^{(i)}(z^{(i)})^T] &=& \mu_{z^{(i)}|x^{(i)}}\mu^T_{z^{(i)}|x^{(i)}} + \Sigma_{z^{(i)}|x^{(i)}}\\ \Psi_{ii} = \Phi_{ii} \end{array} E[z(i)]E[z(i)(z(i))T]Ψii=Φii==μz(i)x(i)μz(i)x(i)μz(i)x(i)T+Σz(i)x(i)

ICA

Independent components analysis finds a new basis in which to represent our data. Suppose some data s ∈ R n s\in\mathbb{R}^n sRn is generrated via n n n independent sources, and we observe the overlapping of them
x = A s x = As x=As

where A A A is the mixing matrix.

Our goal is to find the unmixing matrix W = A − 1 W = A^{-1} W=A1 to recover s ( i ) s^{(i)} s(i) according to x ( i ) x^{(i)} x(i). Notate
W = [ w 1 ⋮ w n ] W = \left[\begin{array}{c} w_1\\ \vdots\\ w_n \end{array}\right] W=w1wn

where w i ∈ R n w_i \in \mathbb{R}^n wiRn and the j t h j^{th} jth source can be recovered by
s j ( i ) = w j T x ( i ) s_j^{(i)} = w_j^Tx^{(i)} sj(i)=wjTx(i)

Assume that the sources are i.i.d. conforming to logistic distribution
p s ( s i ) = g ′ ( s i ) p_s(s_i) = g'(s_i) ps(si)=g(si)

where g ( z ) = 1 1 + e − z g(z) = \frac{1}{1+e^{-z}} g(z)=1+ez1. Then the joint distribution
p ( s ) = ∏ i = 1 n p s ( s i ) p(s) = \prod\limits_{i=1}^n p_s(s_i) p(s)=i=1nps(si)

By transformation s = W x s = Wx s=Wx
p ( x ) = ∏ i = 1 n p s ( w i T x ) ⋅ ∣ W ∣ p(x) = \prod\limits_{i=1}^n p_s(w_i^Tx) \cdot|W| p(x)=i=1nps(wiTx)W

Using maximum likelihood
l ( W ) = ∑ i = 1 m ( ∑ j = 1 n log ⁡ g ′ ( w j T x ( i ) ) + log ⁡ ∣ W ∣ ) l(W) = \sum\limits_{i=1}^m \left(\sum\limits_{j=1}^n \log g'(w_j^Tx^{(i)}) + \log|W|\right) l(W)=i=1m(j=1nlogg(wjTx(i))+logW)

we can derive the update rule for stochastic gradient ascent
W : = W + α ( [ 1 − 2 g ( w 1 T x ( i ) ) 1 − 2 g ( w 2 T x ( i ) ) ⋮ 1 − 2 g ( w n T x ( i ) ) ] ( x ( i ) ) T + ( W T ) − 1 ) W := W + \alpha\left(\left[\begin{array}{c} 1-2g(w^T_1x^{(i)})\\ 1-2g(w^T_2x^{(i)})\\ \vdots\\ 1-2g(w^T_nx^{(i)})\\ \end{array}\right](x^{(i)})^T + (W^T)^{-1}\right) W:=W+α12g(w1Tx(i))12g(w2Tx(i))12g(wnTx(i))(x(i))T+(WT)1

Formula Proof

EM Algorithm

Similar as before, we derive the likelihood
l ( θ ) = ∑ i = 1 m log ⁡ p ( x ( i ) ; θ ) = ∑ i = 1 m log ⁡ ∑ j p ( x ( i ) , z ( i ) = j ; θ ) \begin{array}{rcl} l(\theta) &=& \sum\limits_{i=1}^m \log p(x^{(i)};\theta)\\ &=& \sum\limits_{i=1}^m \log \sum\limits_j p(x^{(i)},z^{(i)} = j;\theta) \end{array} l(θ)==i=1mlogp(x(i);θ)i=1mlogjp(x(i),z(i)=j;θ)

Assume that z ( i ) z^{(i)} z(i) have some distribution Q i Q_i Qi. Since
E z ( i ) [ p ( x ( i ) , z ( i ) ; θ ) Q i ( z ( i ) ) ] = ∑ j Q i ( j ) p ( x ( i ) , z ( i ) = j ; θ ) Q i ( j ) = ∑ j p ( x ( i ) , z ( i ) = j ; θ ) \begin{array}{rcl} E_{z^{(i)}}\Big[\frac{p(x^{(i)},z^{(i)};\theta)}{Q_i(z^{(i)})}\Big] &=& \sum\limits_j Q_i(j)\frac{p(x^{(i)},z^{(i)} = j;\theta)}{Q_i(j)}\\ &=& \sum\limits_j p(x^{(i)},z^{(i)} = j;\theta) \end{array} Ez(i)[Qi(z(i))p(x(i),z(i);θ)]==jQi(j)Qi(j)p(x(i),z(i)=j;θ)jp(x(i),z(i)=j;θ)

By Jensen’s inequality
log ⁡ E z ( i ) [ p ( x ( i ) , z ( i ) ; θ ) Q i ( z ( i ) ) ] ≥ E z ( i ) [ log ⁡ p ( x ( i ) , z ( i ) ; θ ) Q i ( z ( i ) ) ] = ∑ j Q i ( j ) log ⁡ p ( x ( i ) , z ( i ) = j ; θ ) Q i ( j ) \begin{array}{rcl} \log E_{z^{(i)}} \Big[\frac{p(x^{(i)},z^{(i)};\theta)}{Q_i(z^{(i)})}\Big] &\ge& E_{z^{(i)}} \Big[\log\frac{p(x^{(i)},z^{(i)};\theta)}{Q_i(z^{(i)})}\Big]\\\\ &=& \sum\limits_j Q_i(j)\log\frac{p(x^{(i)},z^{(i)}=j;\theta)}{Q_i(j)} \end{array} logEz(i)[Qi(z(i))p(x(i),z(i);θ)]=Ez(i)[logQi(z(i))p(x(i),z(i);θ)]jQi(j)logQi(j)p(x(i),z(i)=j;θ)

Therefore
l ( θ ) ≥ ∑ i ∑ j Q i ( j ) log ⁡ p ( x ( i ) , z ( i ) = j ; θ ) Q i ( j ) l(\theta) \ge \sum\limits_i\sum\limits_j Q_i(j)\log\frac{p(x^{(i)},z^{(i)}=j;\theta)}{Q_i(j)} l(θ)ijQi(j)logQi(j)p(x(i),z(i)=j;θ)

Define the lower bound
J ( Q , θ ) = ∑ i ∑ j Q i ( j ) log ⁡ p ( x ( i ) , z ( i ) = j ; θ ) Q i ( j ) J(Q, \theta) = \sum\limits_i\sum\limits_j Q_i(j)\log\frac{p(x^{(i)},z^{(i)}=j;\theta)}{Q_i(j)} J(Q,θ)=ijQi(j)logQi(j)p(x(i),z(i)=j;θ)

We can apply coordinate ascent to maximize J J J. W.r.t. Q Q Q, in order for l ( θ ) = J ( Q , θ ) l(\theta) = J(Q,\theta) l(θ)=J(Q,θ) to hold
p ( x ( i ) , z ( i ) ; θ ) Q i ( z ( i ) ) = c \frac{p(x^{(i)},z^{(i)};\theta)}{Q_i(z^{(i)})} = c Qi(z(i))p(x(i),z(i);θ)=c

where c c c does not depend on z ( i ) z^{(i)} z(i). Since
∑ z Q i ( z ) = 1 \sum\limits_zQ_i(z) = 1 zQi(z)=1

we have that
c = ∑ z p ( x ( i ) , z ; θ ) = p ( x ( i ) ; θ ) c = \sum\limits_z p(x^{(i)},z;\theta) = p(x^{(i)};\theta) c=zp(x(i),z;θ)=p(x(i);θ)

Therefore
Q i ( z ( i ) ) = p ( x ( i ) , z ( i ) ; θ ) p ( x ( i ) ; θ ) = p ( z ( i ) ∣ x ( i ) ; θ ) Q_i(z^{(i)}) = \frac{p(x^{(i)},z^{(i)};\theta)}{p(x^{(i)};\theta)} = p(z^{(i)}|x^{(i)};\theta) Qi(z(i))=p(x(i);θ)p(x(i),z(i);θ)=p(z(i)x(i);θ)

Mixture of Gaussians

The E-step is easy
w j ( i ) = Q i ( z ( i ) = j ) = P ( z ( i ) = j ∣ x ( i ) ; ϕ , μ , Σ ) = ϕ j p ( x ( i ) ∣ z ( i ) = j ; μ , Σ ) ∑ l = 1 k ϕ l p ( x ( i ) ∣ z ( i ) = l ; μ , Σ ) \begin{array}{rcl} w_j^{(i)} &=& Q_i(z^{(i)} = j)\\\\ &=& P(z^{(i)}=j|x^{(i)};\phi,\mu,\Sigma)\\\\ &=& \frac{\phi_jp(x^{(i)}|z^{(i)}=j;\mu,\Sigma)}{\sum_{l=1}^k\phi_lp(x^{(i)}|z^{(i)}=l;\mu,\Sigma)} \end{array} wj(i)===Qi(z(i)=j)P(z(i)=jx(i);ϕ,μ,Σ)l=1kϕlp(x(i)z(i)=l;μ,Σ)ϕjp(x(i)z(i)=j;μ,Σ)

In the M-step
J ( Q , θ ) = ∑ i = 1 m ∑ j = 1 k w j ( i ) log ⁡ ϕ j exp ⁡ ( − 1 2 ( x ( i ) − μ j ) T Σ j − 1 ( x ( i ) − μ j ) ) w j ( i ) ( 2 π ) n / 2 ∣ Σ j ∣ 1 / 2 J(Q,\theta) = \sum\limits_{i=1}^m \sum\limits_{j=1}^k w_j^{(i)}\log\frac{\phi_j\exp\Big(-\frac{1}{2}(x^{(i)}-\mu_j)^T\Sigma_j^{-1}(x^{(i)}-\mu_j)\Big)}{w_j^{(i)}(2\pi)^{n/2}|\Sigma_j|^{1/2}} J(Q,θ)=i=1mj=1kwj(i)logwj(i)(2π)n/2Σj1/2ϕjexp(21(x(i)μj)TΣj1(x(i)μj))

To maximize J J J w.r.t. μ j \mu_j μj
∵ ∇ μ j J = ∑ i = 1 m w j ( i ) Σ j − 1 ( x ( i ) − μ j ) ∴ μ j : = ∑ i = 1 m w j ( i ) x ( i ) ∑ i = 1 m w j ( i ) \begin{aligned} & \because & \nabla_{\mu_j} J &= \sum\limits_{i=1}^m w_j^{(i)} \Sigma_j^{-1} (x^{(i)}-\mu_j) \\ & \therefore & \mu_j &:= \frac{\sum_{i=1}^m w_j^{(i)}x^{(i)}}{\sum_{i=1}^mw_j^{(i)}} \end{aligned} μjJμj=i=1mwj(i)Σj1(x(i)μj):=i=1mwj(i)i=1mwj(i)x(i)

Similarly, we have that
Σ j : = ∑ i = 1 m w j ( i ) ( x ( i ) − μ j ) ( x ( i ) − μ j ) T ∑ i = 1 m w j ( i ) \Sigma_j := \frac{\sum_{i=1}^mw_j^{(i)}(x^{(i)}-\mu_j)(x^{(i)}-\mu_j)^T}{\sum_{i=1}^mw_j^{(i)}} Σj:=i=1mwj(i)i=1mwj(i)(x(i)μj)(x(i)μj)T

For parameters ϕ j \phi_j ϕj
J ( Q , θ ) = ∑ i = 1 m ∑ j = 1 k w j ( i ) log ⁡ ϕ j + c J(Q,\theta) = \sum\limits_{i=1}^m \sum\limits_{j=1}^k w_j^{(i)}\log\phi_j + c J(Q,θ)=i=1mj=1kwj(i)logϕj+c

where c c c does not depend on ϕ j \phi_j ϕj. The problem can be stated as
min ⁡ ϕ ∑ i = 1 m ∑ j = 1 k w j ( i ) log ⁡ ϕ j s.t. ∑ j = 1 k ϕ j = 1 \begin{array}{rl} \min\limits_{\phi} & \sum\limits_{i=1}^m \sum\limits_{j=1}^k w_j^{(i)}\log\phi_j\\ \text{s.t.} & \sum\limits_{j=1}^k\phi_j = 1 \end{array} ϕmins.t.i=1mj=1kwj(i)logϕjj=1kϕj=1

Using Lagrangian
L ( ϕ ) = ∑ i = 1 m ∑ j = 1 k w j ( i ) log ⁡ ϕ j + β ( ∑ j = 1 k ϕ j − 1 ) L(\phi) = \sum\limits_{i=1}^m \sum\limits_{j=1}^k w_j^{(i)}\log\phi_j + \beta(\sum\limits_{j=1}^k\phi_j - 1) L(ϕ)=i=1mj=1kwj(i)logϕj+β(j=1kϕj1)

we have that
∂ ∂ ϕ j L = ∑ i = 1 m w j ( i ) ϕ j + β \frac{\partial}{\partial\phi_j}L = \sum\limits_{i=1}^m\frac{w_j^{(i)}}{\phi_j} + \beta ϕjL=i=1mϕjwj(i)+β

Therefore
ϕ j ∝ ∑ i = 1 m w j ( i ) \phi_j \propto \sum_{i=1}^m w_j^{(i)} ϕji=1mwj(i)

According to the constraint
ϕ j : = 1 m ∑ i = 1 m w j ( i ) \phi_j := \frac{1}{m}\sum_{i=1}^m w_j^{(i)} ϕj:=m1i=1mwj(i)

Factor Analysis

For the M-step, the problem is to
max ⁡ μ , Λ , Ψ ∑ i = 1 m ∫ z ( i ) Q i ( z ( i ) ) log ⁡ p ( x ( i ) , z ( i ) ; μ , Λ , Ψ ) Q i ( z ( i ) ) d z ( i ) \begin{aligned} & \max\limits_{\mu,\Lambda,\Psi} & \sum\limits_{i=1}^m \int_{z^{(i)}}Q_i(z^{(i)}) \log\frac{p(x^{(i)},z^{(i)};\mu,\Lambda,\Psi)}{Q_i(z^{(i)})}dz^{(i)} \end{aligned} μ,Λ,Ψmaxi=1mz(i)Qi(z(i))logQi(z(i))p(x(i),z(i);μ,Λ,Ψ)dz(i)

The object function can be written as
J ( μ , Λ , Ψ ) = ∑ i = 1 m E [ log ⁡ p ( x ( i ) , z ( i ) ; μ , Λ , Ψ ) Q i ( z ( i ) ) ] = ∑ i = 1 m E [ log ⁡ p ( x ( i ) , z ( i ) ; μ , Λ , Ψ ) + log ⁡ p ( z ( i ) ) − log ⁡ Q i ( z ( i ) ) ] \begin{aligned} & J(\mu,\Lambda,\Psi)\\ &= \sum\limits_{i=1}^m E\Big[\log\frac{ p(x^{(i)},z^{(i)};\mu,\Lambda,\Psi)}{Q_i(z^{(i)})}\Big]\\ &= \sum\limits_{i=1}^m E\Big[\log p(x^{(i)},z^{(i)};\mu,\Lambda,\Psi)+ \log p(z^{(i)}) - \log Q_i(z^{(i)})\Big]\\ \end{aligned} J(μ,Λ,Ψ)=i=1mE[logQi(z(i))p(x(i),z(i);μ,Λ,Ψ)]=i=1mE[logp(x(i),z(i);μ,Λ,Ψ)+logp(z(i))logQi(z(i))]

Getting rid of constants
J ( μ , Λ , Ψ ) ≡ ∑ i = 1 m E [ log ⁡ p ( x ( i ) ∣ z ( i ) ; μ , Λ , Ψ ) ] ≡ − 1 2 ∑ i = 1 m E [ log ⁡ ∣ Ψ ∣ + ( x ( i ) − μ − Λ z ( i ) ) T Ψ − 1 ( x ( i ) − μ − Λ z ( i ) ) ] \begin{aligned} & J(\mu,\Lambda,\Psi)\\ &\equiv \sum\limits_{i=1}^m E\Big[\log p(x^{(i)}|z^{(i)};\mu,\Lambda,\Psi)\Big]\\ &\equiv -\frac{1}{2}\sum\limits_{i=1}^m E\Big[\log|\Psi| + (x^{(i)}-\mu-\Lambda z^{(i)})^T\Psi^{-1}(x^{(i)}-\mu-\Lambda z^{(i)})\Big] \end{aligned} J(μ,Λ,Ψ)i=1mE[logp(x(i)z(i);μ,Λ,Ψ)]21i=1mE[logΨ+(x(i)μΛz(i))TΨ1(x(i)μΛz(i))]

After derivation
∇ Λ J = Ψ − 1 ∑ i = 1 m E [ ( x ( i ) − μ − Λ z ( i ) ) ( z ( i ) ) T ] ∇ μ J = ( Λ Λ T + Ψ ) − 1 ∑ i = 1 m ( x ( i ) − μ ) ∇ Ψ J = ∑ i = 1 m E [ Ψ − 1 − ( x ( i ) − μ − Λ z ( i ) ) T Ψ − 2 ( x ( i ) − μ − Λ z ( i ) ) ] \begin{array}{rcl} \nabla_\Lambda J &=& \Psi^{-1} \sum\limits_{i=1}^m E\Big[(x^{(i)}-\mu-\Lambda z^{(i)})(z^{(i)})^T \Big]\\ \nabla_\mu J &=& (\Lambda\Lambda^T+\Psi)^{-1}\sum\limits_{i=1}^m(x^{(i)}-\mu)\\ \nabla_\Psi J &=& \sum\limits_{i=1}^m E\Big[\Psi^{-1}-(x^{(i)}-\mu-\Lambda z^{(i)})^T\Psi^{-2}(x^{(i)}-\mu-\Lambda z^{(i)})\Big] \end{array} ΛJμJΨJ===Ψ1i=1mE[(x(i)μΛz(i))(z(i))T](ΛΛT+Ψ)1i=1m(x(i)μ)i=1mE[Ψ1(x(i)μΛz(i))TΨ2(x(i)μΛz(i))]

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

LutingWang

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值