无监督学习中,模型的输入不再包括训练样本的期望输出。因此相应 training set 的定义变为
S
=
{
x
1
,
x
2
,
…
,
x
m
}
S = \{x_1, x_2, \dots,x_m\}
S={x1,x2,…,xm}
本文中其他定义和有监督学习一致。
In this set of notes, the problems are essentially finding clumps and data compression. For finding clumps, we can use k-means to find the center directly, or using EM to model the distribution of each clump. For data compression, we can use factor analysis to estimate the probability of a data being in a specific subspace, or we can find the subspace with PCA.
Contents
The k-means Clustering Algorithm
设样本空间有
k
k
k 个簇, 算法的目标是找到这些 cluster centroids
μ
1
,
μ
2
,
…
,
μ
k
∈
R
n
\mu_1,\mu_2,\dots,\mu_k\in\mathbb{R}^n
μ1,μ2,…,μk∈Rn
设
c
i
c_i
ci 表示样本
x
i
x_i
xi 所属簇的下标,定义 distortion function
J
(
c
,
μ
)
=
∑
i
=
1
m
∣
∣
x
i
−
μ
c
i
∣
∣
2
J(c,\mu) = \sum\limits_{i=1}^m||x_i - \mu_{c_i}||^2
J(c,μ)=i=1∑m∣∣xi−μci∣∣2
对
J
J
J 应用坐标下降可以得到下面的算法
initialize
μ
1
,
…
,
μ
k
repeat until stable
{
for i in 1...m
c
i
:
=
arg
min
j
∣
∣
x
i
−
μ
j
∣
∣
2
for j in 1...k
C
j
=
{
k
∣
c
k
=
j
}
μ
j
:
=
∑
i
∈
C
j
x
i
/
∣
C
j
∣
}
\begin{aligned} & \text{initialize } \mu_1,\dots,\mu_k\\ & \text{repeat until stable } \{\\ & \qquad\text{for i in 1...m } \\ & \qquad\qquad c_i := \arg\min_j ||x_i-\mu_j||^2\\ & \qquad\text{for j in 1...k } \\ & \qquad\qquad C_j = \{k|c_k = j\}\\ & \qquad\qquad\mu_j := \sum_{i \in C_j} x_i / |C_j| \\ & \} \end{aligned}
initialize μ1,…,μkrepeat until stable {for i in 1...m ci:=argjmin∣∣xi−μj∣∣2for j in 1...k Cj={k∣ck=j}μj:=i∈Cj∑xi/∣Cj∣}
因为损失函数 J J J 是非凸的,初始化 μ \mu μ 的方法对算法的收敛性有一定影响。常用的方法是选取相距最远的样本作为初始值
- 选取样本空间中相聚最远的两点,作为两个初始值
- 当初始值个数不足 K K K 时,选择第 max i min { ⟨ x i , x j ⟩ ∣ j = 1 , 2 , … , m } \max\limits_i \min\{ \langle x_i, x_j \rangle \vert j = 1, 2, \dots, m\} imaxmin{⟨xi,xj⟩∣j=1,2,…,m} 个样本加入初始值
此方法有助于减少迭代次数,但是依赖于核矩阵,因此不适用 m m m 太大的情形。另一种简单的方法是随机选取 K K K 个不同的样本作为初始值。
The EM Algorithm
Suppose there exist some latent r.v.s z ( 1 ) , … , z ( m ) z^{(1)}, \dots, z^{(m)} z(1),…,z(m) and we wish to fit the parameters θ \theta θ of a model p ( x , z ) p(x,z) p(x,z) to the training set.
The algorithm can be stated as proof
repeat
{
E-step: for i in 1...m
Q
i
(
z
(
i
)
)
:
=
p
(
z
(
i
)
∣
x
(
i
)
;
θ
)
M-step:
θ
:
=
arg
max
θ
∑
i
∑
j
Q
i
(
j
)
log
p
(
x
(
i
)
,
z
(
i
)
=
j
;
θ
)
Q
i
(
j
)
}
\begin{aligned} & \text{repeat } \{\\ & \qquad\text{E-step: for i in 1...m } \\ & \qquad\qquad Q_i(z^{(i)}) := p(z^{(i)}|x^{(i)};\theta)\\ & \qquad\text{M-step: } \\ & \qquad\qquad\theta := \arg\max_\theta \sum\limits_i \sum\limits_j Q_i(j) \log \frac{p(x^{(i)},z^{(i)}=j;\theta)}{Q_i(j)}\\ & \} \end{aligned}
repeat {E-step: for i in 1...m Qi(z(i)):=p(z(i)∣x(i);θ)M-step: θ:=argθmaxi∑j∑Qi(j)logQi(j)p(x(i),z(i)=j;θ)}
Mixture of Gaussians
Suppose there are
k
k
k Gaussians and each training data
x
(
i
)
x^{(i)}
x(i) belongs to one of them. Let
z
(
i
)
z^{(i)}
z(i) donate the class
x
(
i
)
x^{(i)}
x(i) belongs to
z
(
i
)
∼
M
u
l
t
i
n
o
m
i
a
l
k
(
ϕ
)
x
(
i
)
∣
z
(
i
)
=
j
∼
N
(
μ
j
,
Σ
j
)
\begin{array}{rcl} z^{(i)} &\sim& Multinomial_k(\phi)\\ x^{(i)}|z^{(i)} = j &\sim& N(\mu_j, \Sigma_j) \end{array}
z(i)x(i)∣z(i)=j∼∼Multinomialk(ϕ)N(μj,Σj)
We wish to model the data by specifying p ( x ( i ) , z ( i ) ) p(x^{(i)}, z^{(i)}) p(x(i),z(i)).
The algorithm can be stated as proof
repeat
{
E-step: for i in 1...m, for j in 1...k
w
j
(
i
)
:
=
ϕ
j
p
(
x
(
i
)
∣
z
(
i
)
=
j
;
μ
,
Σ
)
∑
l
=
1
k
ϕ
l
p
(
x
(
i
)
∣
z
(
i
)
=
l
;
μ
,
Σ
)
M-step: for j in 1...k
ϕ
j
:
=
1
m
∑
i
=
1
m
w
j
(
i
)
μ
j
:
=
∑
i
=
1
m
w
j
(
i
)
x
(
i
)
∑
i
=
1
m
w
j
(
i
)
Σ
j
:
=
∑
i
=
1
m
w
j
(
i
)
(
x
(
i
)
−
μ
j
)
(
x
(
i
)
−
μ
j
)
T
∑
i
=
1
m
w
j
(
i
)
}
\begin{aligned} & \text{repeat } \{\\ & \qquad\text{E-step: for i in 1...m, for j in 1...k} \\ & \qquad\qquad w_j^{(i)}:=\frac{\phi_jp(x^{(i)}|z^{(i)}=j;\mu,\Sigma)}{\sum_{l=1}^k\phi_lp(x^{(i)}|z^{(i)}=l;\mu,\Sigma)}\\ & \qquad\text{M-step: for j in 1...k} \\ & \qquad\qquad\phi_j := \frac{1}{m}\sum_{i=1}^m w_j^{(i)}\\ & \qquad\qquad\mu_j := \frac{\sum_{i=1}^m w_j^{(i)}x^{(i)}}{\sum_{i=1}^mw_j^{(i)}}\\ & \qquad\qquad\Sigma_j := \frac{\sum_{i=1}^mw_j^{(i)}(x^{(i)}-\mu_j)(x^{(i)}-\mu_j)^T}{\sum_{i=1}^mw_j^{(i)}}\\ & \} \end{aligned}
repeat {E-step: for i in 1...m, for j in 1...kwj(i):=∑l=1kϕlp(x(i)∣z(i)=l;μ,Σ)ϕjp(x(i)∣z(i)=j;μ,Σ)M-step: for j in 1...kϕj:=m1i=1∑mwj(i)μj:=∑i=1mwj(i)∑i=1mwj(i)x(i)Σj:=∑i=1mwj(i)∑i=1mwj(i)(x(i)−μj)(x(i)−μj)T}
Factor Analysis
Consider the problem in which
n
≫
m
n \gg m
n≫m. In such a setting, we assume that the data are generated by some latent r.v.
z
z
z
x
=
μ
+
Λ
z
+
ϵ
x = \mu+\Lambda z + \epsilon
x=μ+Λz+ϵ
where
R
k
∋
z
∼
N
(
0
⃗
,
I
)
R
n
∋
ϵ
∼
N
(
0
⃗
,
Ψ
)
\begin{array}{rcccl} \mathbb{R}^k &\ni& z &\sim& N(\vec 0, I)\\ \mathbb{R}^n &\ni& \epsilon &\sim& N(\vec 0, \Psi) \end{array}
RkRn∋∋zϵ∼∼N(0,I)N(0,Ψ)
the value of k k k is usually chosen to be smaller than n n n. Parameters of our model are
- vector μ ∈ R n \mu \in \mathbb{R}^n μ∈Rn
- matrix Λ ∈ R n × k \Lambda \in \mathbb{R}^{n\times k} Λ∈Rn×k
- diagonal matrix Ψ ∈ R n × n \Psi \in \mathbb{R}^{n\times n} Ψ∈Rn×n
Since
[
z
x
]
∼
N
(
[
0
⃗
μ
]
,
[
I
Λ
T
Λ
Λ
Λ
T
+
Ψ
]
)
\left[\begin{aligned}z\\x\end{aligned}\right] \sim N(\left[\begin{aligned}\vec0\\\mu\end{aligned}\right], \left[\begin{array}{cc}I & \Lambda^T\\\Lambda & \Lambda\Lambda^T+\Psi\end{array}\right])
[zx]∼N([0μ],[IΛΛTΛΛT+Ψ])
parameters for
z
(
i
)
∣
x
(
i
)
∼
N
(
μ
z
(
i
)
∣
x
(
i
)
,
Σ
z
(
i
)
∣
x
(
i
)
)
z^{(i)}|x^{(i)} \sim N(\mu_{z^{(i)}|x^{(i)}}, \Sigma_{z^{(i)}|x^{(i)}})
z(i)∣x(i)∼N(μz(i)∣x(i),Σz(i)∣x(i)) are
μ
z
(
i
)
∣
x
(
i
)
=
Λ
T
(
Λ
Λ
T
+
Ψ
)
−
1
(
x
(
i
)
−
μ
)
Σ
z
(
i
)
∣
x
(
i
)
=
I
−
Λ
T
(
Λ
Λ
T
+
Ψ
)
−
1
Λ
\begin{array}{rcl} \mu_{z^{(i)}|x^{(i)}} &=& \Lambda^T (\Lambda\Lambda^T+\Psi)^{-1}(x^{(i)}-\mu)\\ \Sigma_{z^{(i)}|x^{(i)}} &=& I - \Lambda^T(\Lambda\Lambda^T+\Psi)^{-1}\Lambda \end{array}
μz(i)∣x(i)Σz(i)∣x(i)==ΛT(ΛΛT+Ψ)−1(x(i)−μ)I−ΛT(ΛΛT+Ψ)−1Λ
To use EM algorithm, we have that proof
Q
i
(
z
(
i
)
)
:
=
p
(
z
(
i
)
∣
x
(
i
)
;
μ
,
Λ
,
Ψ
)
Λ
:
=
(
∑
i
=
1
m
(
x
(
i
)
−
μ
)
E
[
z
(
i
)
]
T
)
(
∑
i
=
1
m
E
[
z
(
i
)
(
z
(
i
)
)
T
]
)
μ
:
=
1
m
∑
i
=
1
m
x
(
i
)
Φ
:
=
1
m
∑
i
=
1
m
E
[
(
x
(
i
)
−
Λ
z
(
i
)
)
(
x
(
i
)
−
Λ
z
(
i
)
)
T
]
\begin{array}{rcl} Q_i(z^{(i)}) &:=& p(z^{(i)}|x^{(i)};\mu,\Lambda,\Psi)\\ \Lambda &:=& (\sum\limits_{i=1}^m(x^{(i)}-\mu)E[z^{(i)}]^T)(\sum\limits_{i=1}^mE[z^{(i)}(z^{(i)})^T])\\ \mu &:=& \frac{1}{m}\sum\limits_{i=1}^mx^{(i)}\\ \Phi &:=& \frac{1}{m}\sum\limits_{i=1}^mE\Big[(x^{(i)}-\Lambda z^{(i)})(x^{(i)}-\Lambda z^{(i)})^T\Big] \end{array}
Qi(z(i))ΛμΦ:=:=:=:=p(z(i)∣x(i);μ,Λ,Ψ)(i=1∑m(x(i)−μ)E[z(i)]T)(i=1∑mE[z(i)(z(i))T])m1i=1∑mx(i)m1i=1∑mE[(x(i)−Λz(i))(x(i)−Λz(i))T]
where
E
[
z
(
i
)
]
=
μ
z
(
i
)
∣
x
(
i
)
E
[
z
(
i
)
(
z
(
i
)
)
T
]
=
μ
z
(
i
)
∣
x
(
i
)
μ
z
(
i
)
∣
x
(
i
)
T
+
Σ
z
(
i
)
∣
x
(
i
)
Ψ
i
i
=
Φ
i
i
\begin{array}{rcl} E[z^{(i)}] &=& \mu_{z^{(i)}|x^{(i)}}\\ E[z^{(i)}(z^{(i)})^T] &=& \mu_{z^{(i)}|x^{(i)}}\mu^T_{z^{(i)}|x^{(i)}} + \Sigma_{z^{(i)}|x^{(i)}}\\ \Psi_{ii} = \Phi_{ii} \end{array}
E[z(i)]E[z(i)(z(i))T]Ψii=Φii==μz(i)∣x(i)μz(i)∣x(i)μz(i)∣x(i)T+Σz(i)∣x(i)
ICA
Independent components analysis finds a new basis in which to represent our data. Suppose some data
s
∈
R
n
s\in\mathbb{R}^n
s∈Rn is generrated via
n
n
n independent sources, and we observe the overlapping of them
x
=
A
s
x = As
x=As
where A A A is the mixing matrix.
Our goal is to find the unmixing matrix
W
=
A
−
1
W = A^{-1}
W=A−1 to recover
s
(
i
)
s^{(i)}
s(i) according to
x
(
i
)
x^{(i)}
x(i). Notate
W
=
[
w
1
⋮
w
n
]
W = \left[\begin{array}{c} w_1\\ \vdots\\ w_n \end{array}\right]
W=⎣⎢⎡w1⋮wn⎦⎥⎤
where
w
i
∈
R
n
w_i \in \mathbb{R}^n
wi∈Rn and the
j
t
h
j^{th}
jth source can be recovered by
s
j
(
i
)
=
w
j
T
x
(
i
)
s_j^{(i)} = w_j^Tx^{(i)}
sj(i)=wjTx(i)
Assume that the sources are i.i.d. conforming to logistic distribution
p
s
(
s
i
)
=
g
′
(
s
i
)
p_s(s_i) = g'(s_i)
ps(si)=g′(si)
where
g
(
z
)
=
1
1
+
e
−
z
g(z) = \frac{1}{1+e^{-z}}
g(z)=1+e−z1. Then the joint distribution
p
(
s
)
=
∏
i
=
1
n
p
s
(
s
i
)
p(s) = \prod\limits_{i=1}^n p_s(s_i)
p(s)=i=1∏nps(si)
By transformation
s
=
W
x
s = Wx
s=Wx
p
(
x
)
=
∏
i
=
1
n
p
s
(
w
i
T
x
)
⋅
∣
W
∣
p(x) = \prod\limits_{i=1}^n p_s(w_i^Tx) \cdot|W|
p(x)=i=1∏nps(wiTx)⋅∣W∣
Using maximum likelihood
l
(
W
)
=
∑
i
=
1
m
(
∑
j
=
1
n
log
g
′
(
w
j
T
x
(
i
)
)
+
log
∣
W
∣
)
l(W) = \sum\limits_{i=1}^m \left(\sum\limits_{j=1}^n \log g'(w_j^Tx^{(i)}) + \log|W|\right)
l(W)=i=1∑m(j=1∑nlogg′(wjTx(i))+log∣W∣)
we can derive the update rule for stochastic gradient ascent
W
:
=
W
+
α
(
[
1
−
2
g
(
w
1
T
x
(
i
)
)
1
−
2
g
(
w
2
T
x
(
i
)
)
⋮
1
−
2
g
(
w
n
T
x
(
i
)
)
]
(
x
(
i
)
)
T
+
(
W
T
)
−
1
)
W := W + \alpha\left(\left[\begin{array}{c} 1-2g(w^T_1x^{(i)})\\ 1-2g(w^T_2x^{(i)})\\ \vdots\\ 1-2g(w^T_nx^{(i)})\\ \end{array}\right](x^{(i)})^T + (W^T)^{-1}\right)
W:=W+α⎝⎜⎜⎜⎛⎣⎢⎢⎢⎡1−2g(w1Tx(i))1−2g(w2Tx(i))⋮1−2g(wnTx(i))⎦⎥⎥⎥⎤(x(i))T+(WT)−1⎠⎟⎟⎟⎞
Formula Proof
EM Algorithm
Similar as before, we derive the likelihood
l
(
θ
)
=
∑
i
=
1
m
log
p
(
x
(
i
)
;
θ
)
=
∑
i
=
1
m
log
∑
j
p
(
x
(
i
)
,
z
(
i
)
=
j
;
θ
)
\begin{array}{rcl} l(\theta) &=& \sum\limits_{i=1}^m \log p(x^{(i)};\theta)\\ &=& \sum\limits_{i=1}^m \log \sum\limits_j p(x^{(i)},z^{(i)} = j;\theta) \end{array}
l(θ)==i=1∑mlogp(x(i);θ)i=1∑mlogj∑p(x(i),z(i)=j;θ)
Assume that
z
(
i
)
z^{(i)}
z(i) have some distribution
Q
i
Q_i
Qi. Since
E
z
(
i
)
[
p
(
x
(
i
)
,
z
(
i
)
;
θ
)
Q
i
(
z
(
i
)
)
]
=
∑
j
Q
i
(
j
)
p
(
x
(
i
)
,
z
(
i
)
=
j
;
θ
)
Q
i
(
j
)
=
∑
j
p
(
x
(
i
)
,
z
(
i
)
=
j
;
θ
)
\begin{array}{rcl} E_{z^{(i)}}\Big[\frac{p(x^{(i)},z^{(i)};\theta)}{Q_i(z^{(i)})}\Big] &=& \sum\limits_j Q_i(j)\frac{p(x^{(i)},z^{(i)} = j;\theta)}{Q_i(j)}\\ &=& \sum\limits_j p(x^{(i)},z^{(i)} = j;\theta) \end{array}
Ez(i)[Qi(z(i))p(x(i),z(i);θ)]==j∑Qi(j)Qi(j)p(x(i),z(i)=j;θ)j∑p(x(i),z(i)=j;θ)
By Jensen’s inequality
log
E
z
(
i
)
[
p
(
x
(
i
)
,
z
(
i
)
;
θ
)
Q
i
(
z
(
i
)
)
]
≥
E
z
(
i
)
[
log
p
(
x
(
i
)
,
z
(
i
)
;
θ
)
Q
i
(
z
(
i
)
)
]
=
∑
j
Q
i
(
j
)
log
p
(
x
(
i
)
,
z
(
i
)
=
j
;
θ
)
Q
i
(
j
)
\begin{array}{rcl} \log E_{z^{(i)}} \Big[\frac{p(x^{(i)},z^{(i)};\theta)}{Q_i(z^{(i)})}\Big] &\ge& E_{z^{(i)}} \Big[\log\frac{p(x^{(i)},z^{(i)};\theta)}{Q_i(z^{(i)})}\Big]\\\\ &=& \sum\limits_j Q_i(j)\log\frac{p(x^{(i)},z^{(i)}=j;\theta)}{Q_i(j)} \end{array}
logEz(i)[Qi(z(i))p(x(i),z(i);θ)]≥=Ez(i)[logQi(z(i))p(x(i),z(i);θ)]j∑Qi(j)logQi(j)p(x(i),z(i)=j;θ)
Therefore
l
(
θ
)
≥
∑
i
∑
j
Q
i
(
j
)
log
p
(
x
(
i
)
,
z
(
i
)
=
j
;
θ
)
Q
i
(
j
)
l(\theta) \ge \sum\limits_i\sum\limits_j Q_i(j)\log\frac{p(x^{(i)},z^{(i)}=j;\theta)}{Q_i(j)}
l(θ)≥i∑j∑Qi(j)logQi(j)p(x(i),z(i)=j;θ)
Define the lower bound
J
(
Q
,
θ
)
=
∑
i
∑
j
Q
i
(
j
)
log
p
(
x
(
i
)
,
z
(
i
)
=
j
;
θ
)
Q
i
(
j
)
J(Q, \theta) = \sum\limits_i\sum\limits_j Q_i(j)\log\frac{p(x^{(i)},z^{(i)}=j;\theta)}{Q_i(j)}
J(Q,θ)=i∑j∑Qi(j)logQi(j)p(x(i),z(i)=j;θ)
We can apply coordinate ascent to maximize
J
J
J. W.r.t.
Q
Q
Q, in order for
l
(
θ
)
=
J
(
Q
,
θ
)
l(\theta) = J(Q,\theta)
l(θ)=J(Q,θ) to hold
p
(
x
(
i
)
,
z
(
i
)
;
θ
)
Q
i
(
z
(
i
)
)
=
c
\frac{p(x^{(i)},z^{(i)};\theta)}{Q_i(z^{(i)})} = c
Qi(z(i))p(x(i),z(i);θ)=c
where
c
c
c does not depend on
z
(
i
)
z^{(i)}
z(i). Since
∑
z
Q
i
(
z
)
=
1
\sum\limits_zQ_i(z) = 1
z∑Qi(z)=1
we have that
c
=
∑
z
p
(
x
(
i
)
,
z
;
θ
)
=
p
(
x
(
i
)
;
θ
)
c = \sum\limits_z p(x^{(i)},z;\theta) = p(x^{(i)};\theta)
c=z∑p(x(i),z;θ)=p(x(i);θ)
Therefore
Q
i
(
z
(
i
)
)
=
p
(
x
(
i
)
,
z
(
i
)
;
θ
)
p
(
x
(
i
)
;
θ
)
=
p
(
z
(
i
)
∣
x
(
i
)
;
θ
)
Q_i(z^{(i)}) = \frac{p(x^{(i)},z^{(i)};\theta)}{p(x^{(i)};\theta)} = p(z^{(i)}|x^{(i)};\theta)
Qi(z(i))=p(x(i);θ)p(x(i),z(i);θ)=p(z(i)∣x(i);θ)
Mixture of Gaussians
The E-step is easy
w
j
(
i
)
=
Q
i
(
z
(
i
)
=
j
)
=
P
(
z
(
i
)
=
j
∣
x
(
i
)
;
ϕ
,
μ
,
Σ
)
=
ϕ
j
p
(
x
(
i
)
∣
z
(
i
)
=
j
;
μ
,
Σ
)
∑
l
=
1
k
ϕ
l
p
(
x
(
i
)
∣
z
(
i
)
=
l
;
μ
,
Σ
)
\begin{array}{rcl} w_j^{(i)} &=& Q_i(z^{(i)} = j)\\\\ &=& P(z^{(i)}=j|x^{(i)};\phi,\mu,\Sigma)\\\\ &=& \frac{\phi_jp(x^{(i)}|z^{(i)}=j;\mu,\Sigma)}{\sum_{l=1}^k\phi_lp(x^{(i)}|z^{(i)}=l;\mu,\Sigma)} \end{array}
wj(i)===Qi(z(i)=j)P(z(i)=j∣x(i);ϕ,μ,Σ)∑l=1kϕlp(x(i)∣z(i)=l;μ,Σ)ϕjp(x(i)∣z(i)=j;μ,Σ)
In the M-step
J
(
Q
,
θ
)
=
∑
i
=
1
m
∑
j
=
1
k
w
j
(
i
)
log
ϕ
j
exp
(
−
1
2
(
x
(
i
)
−
μ
j
)
T
Σ
j
−
1
(
x
(
i
)
−
μ
j
)
)
w
j
(
i
)
(
2
π
)
n
/
2
∣
Σ
j
∣
1
/
2
J(Q,\theta) = \sum\limits_{i=1}^m \sum\limits_{j=1}^k w_j^{(i)}\log\frac{\phi_j\exp\Big(-\frac{1}{2}(x^{(i)}-\mu_j)^T\Sigma_j^{-1}(x^{(i)}-\mu_j)\Big)}{w_j^{(i)}(2\pi)^{n/2}|\Sigma_j|^{1/2}}
J(Q,θ)=i=1∑mj=1∑kwj(i)logwj(i)(2π)n/2∣Σj∣1/2ϕjexp(−21(x(i)−μj)TΣj−1(x(i)−μj))
To maximize
J
J
J w.r.t.
μ
j
\mu_j
μj
∵
∇
μ
j
J
=
∑
i
=
1
m
w
j
(
i
)
Σ
j
−
1
(
x
(
i
)
−
μ
j
)
∴
μ
j
:
=
∑
i
=
1
m
w
j
(
i
)
x
(
i
)
∑
i
=
1
m
w
j
(
i
)
\begin{aligned} & \because & \nabla_{\mu_j} J &= \sum\limits_{i=1}^m w_j^{(i)} \Sigma_j^{-1} (x^{(i)}-\mu_j) \\ & \therefore & \mu_j &:= \frac{\sum_{i=1}^m w_j^{(i)}x^{(i)}}{\sum_{i=1}^mw_j^{(i)}} \end{aligned}
∵∴∇μjJμj=i=1∑mwj(i)Σj−1(x(i)−μj):=∑i=1mwj(i)∑i=1mwj(i)x(i)
Similarly, we have that
Σ
j
:
=
∑
i
=
1
m
w
j
(
i
)
(
x
(
i
)
−
μ
j
)
(
x
(
i
)
−
μ
j
)
T
∑
i
=
1
m
w
j
(
i
)
\Sigma_j := \frac{\sum_{i=1}^mw_j^{(i)}(x^{(i)}-\mu_j)(x^{(i)}-\mu_j)^T}{\sum_{i=1}^mw_j^{(i)}}
Σj:=∑i=1mwj(i)∑i=1mwj(i)(x(i)−μj)(x(i)−μj)T
For parameters
ϕ
j
\phi_j
ϕj
J
(
Q
,
θ
)
=
∑
i
=
1
m
∑
j
=
1
k
w
j
(
i
)
log
ϕ
j
+
c
J(Q,\theta) = \sum\limits_{i=1}^m \sum\limits_{j=1}^k w_j^{(i)}\log\phi_j + c
J(Q,θ)=i=1∑mj=1∑kwj(i)logϕj+c
where
c
c
c does not depend on
ϕ
j
\phi_j
ϕj. The problem can be stated as
min
ϕ
∑
i
=
1
m
∑
j
=
1
k
w
j
(
i
)
log
ϕ
j
s.t.
∑
j
=
1
k
ϕ
j
=
1
\begin{array}{rl} \min\limits_{\phi} & \sum\limits_{i=1}^m \sum\limits_{j=1}^k w_j^{(i)}\log\phi_j\\ \text{s.t.} & \sum\limits_{j=1}^k\phi_j = 1 \end{array}
ϕmins.t.i=1∑mj=1∑kwj(i)logϕjj=1∑kϕj=1
Using Lagrangian
L
(
ϕ
)
=
∑
i
=
1
m
∑
j
=
1
k
w
j
(
i
)
log
ϕ
j
+
β
(
∑
j
=
1
k
ϕ
j
−
1
)
L(\phi) = \sum\limits_{i=1}^m \sum\limits_{j=1}^k w_j^{(i)}\log\phi_j + \beta(\sum\limits_{j=1}^k\phi_j - 1)
L(ϕ)=i=1∑mj=1∑kwj(i)logϕj+β(j=1∑kϕj−1)
we have that
∂
∂
ϕ
j
L
=
∑
i
=
1
m
w
j
(
i
)
ϕ
j
+
β
\frac{\partial}{\partial\phi_j}L = \sum\limits_{i=1}^m\frac{w_j^{(i)}}{\phi_j} + \beta
∂ϕj∂L=i=1∑mϕjwj(i)+β
Therefore
ϕ
j
∝
∑
i
=
1
m
w
j
(
i
)
\phi_j \propto \sum_{i=1}^m w_j^{(i)}
ϕj∝i=1∑mwj(i)
According to the constraint
ϕ
j
:
=
1
m
∑
i
=
1
m
w
j
(
i
)
\phi_j := \frac{1}{m}\sum_{i=1}^m w_j^{(i)}
ϕj:=m1i=1∑mwj(i)
Factor Analysis
For the M-step, the problem is to
max
μ
,
Λ
,
Ψ
∑
i
=
1
m
∫
z
(
i
)
Q
i
(
z
(
i
)
)
log
p
(
x
(
i
)
,
z
(
i
)
;
μ
,
Λ
,
Ψ
)
Q
i
(
z
(
i
)
)
d
z
(
i
)
\begin{aligned} & \max\limits_{\mu,\Lambda,\Psi} & \sum\limits_{i=1}^m \int_{z^{(i)}}Q_i(z^{(i)}) \log\frac{p(x^{(i)},z^{(i)};\mu,\Lambda,\Psi)}{Q_i(z^{(i)})}dz^{(i)} \end{aligned}
μ,Λ,Ψmaxi=1∑m∫z(i)Qi(z(i))logQi(z(i))p(x(i),z(i);μ,Λ,Ψ)dz(i)
The object function can be written as
J
(
μ
,
Λ
,
Ψ
)
=
∑
i
=
1
m
E
[
log
p
(
x
(
i
)
,
z
(
i
)
;
μ
,
Λ
,
Ψ
)
Q
i
(
z
(
i
)
)
]
=
∑
i
=
1
m
E
[
log
p
(
x
(
i
)
,
z
(
i
)
;
μ
,
Λ
,
Ψ
)
+
log
p
(
z
(
i
)
)
−
log
Q
i
(
z
(
i
)
)
]
\begin{aligned} & J(\mu,\Lambda,\Psi)\\ &= \sum\limits_{i=1}^m E\Big[\log\frac{ p(x^{(i)},z^{(i)};\mu,\Lambda,\Psi)}{Q_i(z^{(i)})}\Big]\\ &= \sum\limits_{i=1}^m E\Big[\log p(x^{(i)},z^{(i)};\mu,\Lambda,\Psi)+ \log p(z^{(i)}) - \log Q_i(z^{(i)})\Big]\\ \end{aligned}
J(μ,Λ,Ψ)=i=1∑mE[logQi(z(i))p(x(i),z(i);μ,Λ,Ψ)]=i=1∑mE[logp(x(i),z(i);μ,Λ,Ψ)+logp(z(i))−logQi(z(i))]
Getting rid of constants
J
(
μ
,
Λ
,
Ψ
)
≡
∑
i
=
1
m
E
[
log
p
(
x
(
i
)
∣
z
(
i
)
;
μ
,
Λ
,
Ψ
)
]
≡
−
1
2
∑
i
=
1
m
E
[
log
∣
Ψ
∣
+
(
x
(
i
)
−
μ
−
Λ
z
(
i
)
)
T
Ψ
−
1
(
x
(
i
)
−
μ
−
Λ
z
(
i
)
)
]
\begin{aligned} & J(\mu,\Lambda,\Psi)\\ &\equiv \sum\limits_{i=1}^m E\Big[\log p(x^{(i)}|z^{(i)};\mu,\Lambda,\Psi)\Big]\\ &\equiv -\frac{1}{2}\sum\limits_{i=1}^m E\Big[\log|\Psi| + (x^{(i)}-\mu-\Lambda z^{(i)})^T\Psi^{-1}(x^{(i)}-\mu-\Lambda z^{(i)})\Big] \end{aligned}
J(μ,Λ,Ψ)≡i=1∑mE[logp(x(i)∣z(i);μ,Λ,Ψ)]≡−21i=1∑mE[log∣Ψ∣+(x(i)−μ−Λz(i))TΨ−1(x(i)−μ−Λz(i))]
After derivation
∇
Λ
J
=
Ψ
−
1
∑
i
=
1
m
E
[
(
x
(
i
)
−
μ
−
Λ
z
(
i
)
)
(
z
(
i
)
)
T
]
∇
μ
J
=
(
Λ
Λ
T
+
Ψ
)
−
1
∑
i
=
1
m
(
x
(
i
)
−
μ
)
∇
Ψ
J
=
∑
i
=
1
m
E
[
Ψ
−
1
−
(
x
(
i
)
−
μ
−
Λ
z
(
i
)
)
T
Ψ
−
2
(
x
(
i
)
−
μ
−
Λ
z
(
i
)
)
]
\begin{array}{rcl} \nabla_\Lambda J &=& \Psi^{-1} \sum\limits_{i=1}^m E\Big[(x^{(i)}-\mu-\Lambda z^{(i)})(z^{(i)})^T \Big]\\ \nabla_\mu J &=& (\Lambda\Lambda^T+\Psi)^{-1}\sum\limits_{i=1}^m(x^{(i)}-\mu)\\ \nabla_\Psi J &=& \sum\limits_{i=1}^m E\Big[\Psi^{-1}-(x^{(i)}-\mu-\Lambda z^{(i)})^T\Psi^{-2}(x^{(i)}-\mu-\Lambda z^{(i)})\Big] \end{array}
∇ΛJ∇μJ∇ΨJ===Ψ−1i=1∑mE[(x(i)−μ−Λz(i))(z(i))T](ΛΛT+Ψ)−1i=1∑m(x(i)−μ)i=1∑mE[Ψ−1−(x(i)−μ−Λz(i))TΨ−2(x(i)−μ−Λz(i))]