高斯分布
p ( x ∣ μ , σ 2 ) = 1 2 π σ e x p ( − ( x − μ ) 2 2 σ 2 ) p(x|\mu, \sigma^2)=\frac{1}{\sqrt{2\pi}\sigma}exp(-\frac{(x-\mu)^2}{2\sigma^2}) p(x∣μ,σ2)=2πσ1exp(−2σ2(x−μ)2)
d维多元高斯分布
p
(
x
∣
μ
,
∑
)
=
1
2
π
d
2
∣
∑
∣
1
2
e
x
p
(
−
1
2
(
x
−
μ
)
∑
(
x
−
μ
)
)
p(x|\mu, \sum)=\frac{1}{{2\pi}^{\frac{d}{2}}|\sum|^{\frac{1}{2}}}exp(-\frac{1}{2}\frac{(x-\mu)}{\sum(x-\mu)})
p(x∣μ,∑)=2π2d∣∑∣211exp(−21∑(x−μ)(x−μ))
对d维做极大似然估计:
给定数据 D = x 1 , . . . , x N D={x_1,..., x_N} D=x1,...,xN似然是 p ( D ∣ μ , ∑ ) = ∏ n = 1 N p ( x n ∣ μ , ∑ ) p(D|\mu,\sum) = \prod_{n=1}^{N}p(x_n | \mu, \sum) p(D∣μ,∑)=n=1∏Np(xn∣μ,∑)
MLE 估计:
(
μ
M
L
,
∑
M
L
)
=
a
r
g
m
a
x
μ
,
∑
l
o
g
p
(
D
∣
μ
,
∑
)
(\mu_{ML},\sum{ML}) = \underset{\mu, \sum}{argmax}logp(D|\mu,\sum)
(μML,∑ML)=μ,∑argmaxlogp(D∣μ,∑),
μ
M
L
=
1
N
∑
n
=
1
N
x
n
\mu_{ML} = \frac{1}{N}\sum_{n=1}^{N}x_n
μML=N1n=1∑Nxn
(
∑
M
L
)
2
=
1
N
∑
n
=
1
N
(
x
n
−
μ
M
L
)
(
x
n
−
μ
M
L
)
T
(\sum ML)^2 = \frac{1}{N}\sum_{n=1}^{N}(x_n-\mu_{ML})(x_n-\mu_{ML})^T
(∑ML)2=N1n=1∑N(xn−μML)(xn−μML)T
为什么使用高斯分布
如何p(x,y)联合分布是高斯分布,那么p(x)是高斯分布,同样p(y)也是高斯分布。
混合高斯分布
单个高斯分布只有一个mode,单个高斯分布不能模拟多个mode的数据。
使用多个高斯分布,就可以对数据进行聚类。
单峰的高斯分布作为basis 分布,多个高斯分布使用线性叠加(这种思路类似boost的想法),即混合高斯。
p
(
x
)
=
∑
k
=
1
K
π
k
N
(
x
∣
μ
k
,
σ
k
2
)
p(x) = \sum_{k=1}^{K}\pi_k\mathbb{N}(x|\mu_k, \sigma^2_k)
p(x)=k=1∑KπkN(x∣μk,σk2)
对
π
k
\pi_k
πk有约束,
∑
π
k
=
1
\sum\pi_k=1
∑πk=1。
学习混合高斯分布
Log -likehood
log似然:
£
(
μ
,
∑
)
=
l
o
g
p
(
D
∣
π
,
μ
,
∑
)
=
∑
n
=
1
N
l
o
g
(
∑
k
=
1
K
π
k
N
(
x
∣
μ
k
,
∑
k
)
\pounds(\mu, \sum) = log p(D|\pi,\mu,\sum) = \sum_{n=1}^{N}log(\sum_{k=1}^K \pi_k\mathbb{N}({x|\mu_k,\sum _k})
£(μ,∑)=logp(D∣π,μ,∑)=n=1∑Nlog(k=1∑KπkN(x∣μk,k∑)
但是MLE是复杂的,对于单个高斯分布,MLE是简单的。
简单的分析:
-
∂
£
∂
μ
k
=
0
\frac{\partial \pounds}{\partial \mu_k} = 0
∂μk∂£=0得到
∑ n = 1 N = π k N ( x n ∣ μ k , ∑ k ) ∑ j π j N ( x n ∣ μ k , ∑ k ) ( ∑ k ( x n − μ k ) ) − 1 \sum_{n=1}^{N} = \frac{\pi_k\mathbb{N}({x_n|\mu_k,\sum _k})}{\sum_j\pi_j\mathbb{N}({x_n|\mu_k,\sum _k})}(\sum_k(x_n-\mu_k))^{-1} n=1∑N=∑jπjN(xn∣μk,∑k)πkN(xn∣μk,∑k)(k∑(xn−μk))−1
另 γ ( z n k ) = π k N ( x n ∣ μ k , ∑ k ) ∑ j π j N ( x n ∣ μ k , ∑ k ) \gamma (z_{nk}) = \frac{\pi_k\mathbb{N}({x_n|\mu_k,\sum _k})}{\sum_j\pi_j\mathbb{N}({x_n|\mu_k,\sum _k})} γ(znk)=∑jπjN(xn∣μk,∑k)πkN(xn∣μk,∑k)
则 μ k = 1 N k ∑ n = 1 N γ ( z n k ) x n \mu_k = \frac{1}{N_k}\sum_{n=1}^{N}\gamma (z_{nk})x_n μk=Nk1∑n=1Nγ(znk)xn,
N k = ∑ n = 1 N γ ( z n k ) N_k= \sum_{n=1}^{N}\gamma (z_{nk}) Nk=∑n=1Nγ(znk) , N k N_k Nk是所有数据拟合到k分布上面的权重和。
这里的 μ k \mu_k μk也是 1 N k \frac{1}{N_k} Nk1求均。
- ∂ £ ∂ ∑ k = 0 \frac{\partial \pounds}{\partial \sum_k} = 0 ∂∑k∂£=0得到
∑ k = 1 N k ∑ n = 1 N γ ( z n k ) ( x n − μ k ) ( x n − μ k ) T \sum_k = \frac{1}{N_k}\sum_{n=1}^N \gamma(z_{nk})(x_n-\mu_k)(x_n - \mu_k)^T ∑k=Nk1∑n=1Nγ(znk)(xn−μk)(xn−μk)T
- 令 ∂ L ∂ π k = 0 \frac{\partial L}{\partial \pi_k} =0 ∂πk∂L=0
由于对
π
k
\pi_k
πk有约束,
∑
π
k
=
1
\sum\pi_k=1
∑πk=1,使用拉格朗日求
π
k
\pi_k
πk
L
=
£
(
μ
,
∑
)
+
λ
(
∑
k
=
1
K
π
k
−
1
)
L = \pounds(\mu, \sum)+\lambda(\sum_{k=1}^K\pi_k -1)
L=£(μ,∑)+λ(k=1∑Kπk−1)
∑ n = 1 N N ( x n ∣ μ k , ∑ k ) ∑ j π j N ( x n ∣ μ k , ∑ k ) + λ = 0 \sum_{n=1}^N \frac{\mathbb{N}({x_n|\mu_k,\sum _k})}{\sum_j\pi_j\mathbb{N}({x_n|\mu_k,\sum _k})} + \lambda=0 n=1∑N∑jπjN(xn∣μk,∑k)N(xn∣μk,∑k)+λ=0
π k = N k N \pi_k=\frac{N_k}{N} πk=NNk
综上结果
π
k
=
N
k
N
\pi_k=\frac{N_k}{N}
πk=NNk
μ
k
=
1
N
k
∑
n
=
1
N
γ
(
z
n
k
)
x
n
\mu_k = \frac{1}{N_k}\sum_{n=1}^{N}\gamma (z_{nk})x_n
μk=Nk1∑n=1Nγ(znk)xn
∑
k
=
1
N
k
∑
n
=
1
N
γ
(
z
n
k
)
(
x
n
−
μ
k
)
(
x
n
−
μ
k
)
T
\sum_k = \frac{1}{N_k}\sum_{n=1}^N \gamma(z_{nk})(x_n-\mu_k)(x_n - \mu_k)^T
∑k=Nk1∑n=1Nγ(znk)(xn−μk)(xn−μk)T
γ ( z n k ) = π k N ( x n ∣ μ k , ∑ k ) ∑ j π j N ( x n ∣ μ k , ∑ k ) \gamma (z_{nk}) = \frac{\pi_k\mathbb{N}({x_n|\mu_k,\sum _k})}{\sum_j\pi_j\mathbb{N}({x_n|\mu_k,\sum _k})} γ(znk)=∑jπjN(xn∣μk,∑k)πkN(xn∣μk,∑k)
关键是求,但是 γ ( z n k ) \gamma (z_{nk}) γ(znk) 是未知的。
EM算法引入
解决上面鸡生蛋,蛋生鸡的
γ
(
z
n
k
)
\gamma (z_{nk})
γ(znk)求解。
E-step
γ ( z n k ) = π k N ( x n ∣ μ k , ∑ k ) ∑ j π j N ( x n ∣ μ k , ∑ k ) \gamma (z_{nk}) = \frac{\pi_k\mathbb{N}({x_n|\mu_k,\sum _k})}{\sum_j\pi_j\mathbb{N}({x_n|\mu_k,\sum _k})} γ(znk)=∑jπjN(xn∣μk,∑k)πkN(xn∣μk,∑k), γ \gamma γ实际上是后验分布,假设第n个样本拟合到k分布上面 p ( z n k = 1 ∣ x n , μ , ∑ ) p(z_{nk}=1 | x_n, \mu, \sum) p(znk=1∣xn,μ,∑)。
M-step
π
k
=
N
k
N
\pi_k=\frac{N_k}{N}
πk=NNk
μ
k
=
1
N
k
∑
n
=
1
N
γ
(
z
n
k
)
x
n
\mu_k = \frac{1}{N_k}\sum_{n=1}^{N}\gamma (z_{nk})x_n
μk=Nk1∑n=1Nγ(znk)xn
∑
k
=
1
N
k
∑
n
=
1
N
γ
(
z
n
k
)
(
x
n
−
μ
k
)
(
x
n
−
μ
k
)
T
\sum_k = \frac{1}{N_k}\sum_{n=1}^N \gamma(z_{nk})(x_n-\mu_k)(x_n - \mu_k)^T
∑k=Nk1∑n=1Nγ(znk)(xn−μk)(xn−μk)T
不断的迭代E步和M步进行计算,这里初始点的选取会影响混合高斯聚类的结果。
理解高斯分布
对于
p
(
x
)
=
∑
k
=
1
K
π
k
N
(
x
∣
μ
k
,
∑
k
)
p(x) = \sum_{k=1}^{K}\pi_k \mathbb{N}(x|\mu_k, \sum_k)
p(x)=∑k=1KπkN(x∣μk,∑k)引入选择变量z
z
=
(
0
1
0
)
z = \begin{pmatrix} 0\\ 1\\ 0 \end{pmatrix}
z=⎝⎛010⎠⎞
p ( x , z ) = ∑ k = 1 K π k z k N ( x ∣ μ k , ∑ k ) z k p(x,z) = \sum_{k=1}^{K}\pi_k^{z_k} \mathbb{N}(x|\mu_k, \sum_k)^{z_k} p(x,z)=∑k=1KπkzkN(x∣μk,∑k)zk
- 重新定义log-likehood
l o g p ( D ∣ Θ ) = ∑ n = 1 N l o g ( ∑ z n p ( x n , z n ) ) logp(D|\Theta )=\sum_{n=1}^Nlog(\sum_{z_n}p(x_n, z_n)) logp(D∣Θ)=∑n=1Nlog(∑znp(xn,zn))
这里的
l
o
g
∑
log\sum
log∑是很难求导的,所以我们使用Jensen不等式近似
l
o
g
x
1
+
x
2
2
≥
l
o
g
x
1
+
l
o
g
x
2
2
log\frac{x_1+x_2}{2} \geq \frac{logx_1 + logx_2}{2}
log2x1+x2≥2logx1+logx2 或者使用期望的表示方法
l
o
g
E
p
(
x
)
[
x
]
≥
E
p
(
x
)
[
l
o
g
x
]
logE_{p(x)}[x] \geq E_{p(x)}[logx]
logEp(x)[x]≥Ep(x)[logx]
引入
q
(
z
n
)
q(z_n)
q(zn)(在机器学习里面称为 Evidence lower bound):
l
o
g
p
(
D
∣
Θ
)
=
∑
n
=
1
N
l
o
g
(
∑
z
n
q
(
z
n
)
p
(
x
n
,
z
n
)
q
(
z
n
)
)
≥
∑
n
=
1
N
∑
z
n
q
(
z
n
)
l
o
g
(
p
(
x
n
,
z
n
)
q
(
z
n
)
)
≅
£
(
θ
,
q
(
Z
)
)
logp(D|\Theta )=\sum_{n=1}^Nlog(\sum_{z_n}q(z_n)\frac{p(x_n, z_n)}{q(z_n)}) \geq \sum_{n=1}^N\sum_{z_n}q(z_n)log(\frac{p(x_n,z_n)}{q(z_n)}) \cong \pounds(\theta , q(Z))
logp(D∣Θ)=n=1∑Nlog(zn∑q(zn)q(zn)p(xn,zn))≥n=1∑Nzn∑q(zn)log(q(zn)p(xn,zn))≅£(θ,q(Z))
q 一般意义上称为变分分布(变分的方法)。
但是lower bound 是可紧可松的,如何约定GAP
£
(
θ
,
q
(
Z
)
)
=
∑
n
=
1
N
{
∑
z
n
q
(
z
n
)
l
o
g
p
(
x
n
,
z
n
)
−
∑
z
n
q
(
z
n
)
l
o
g
q
(
z
n
)
}
=
∑
n
=
1
N
{
∑
z
n
q
(
z
n
)
l
o
g
(
p
(
x
n
,
z
n
)
p
(
x
n
)
)
+
l
o
g
p
(
x
n
)
−
∑
z
n
q
(
z
n
)
l
o
g
q
(
z
n
)
}
=
l
o
g
p
(
D
∣
θ
)
+
∑
n
=
1
N
{
∑
z
n
q
(
z
n
)
l
o
g
p
(
z
n
∣
x
n
)
−
∑
z
n
q
(
z
n
)
l
o
g
q
(
z
n
)
}
=
l
o
g
p
(
D
∣
θ
)
−
K
L
(
q
(
Z
)
∣
∣
p
(
Z
∣
D
)
)
\pounds(\theta , q(Z))=\sum_{n=1}^N\left \{\sum_{z_n}q(z_n)logp(x_n,z_n) - \sum_{z_n}q(z_n)logq(z_n)\right \}\\ = \sum_{n=1}^N \left \{ \sum_{z_n}q(z_n)log(\frac{p(x_n,z_n)}{p(x_n)}) +logp(x_n) - \sum_{z_n}q(z_n)logq(z_n) \right \}\\ =logp(D|\theta) + \sum_{n=1}^N \left \{ \sum_{z_n}q(z_n)logp(z_n|x_n) -\sum_{z_n}q(z_n)logq(z_n) \right \}\\ =logp(D|\theta) - KL(q(Z)||p(Z|D))
£(θ,q(Z))=n=1∑N{zn∑q(zn)logp(xn,zn)−zn∑q(zn)logq(zn)}=n=1∑N{zn∑q(zn)log(p(xn)p(xn,zn))+logp(xn)−zn∑q(zn)logq(zn)}=logp(D∣θ)+n=1∑N{zn∑q(zn)logp(zn∣xn)−zn∑q(zn)logq(zn)}=logp(D∣θ)−KL(q(Z)∣∣p(Z∣D))
上式中
l
o
g
p
(
D
∣
θ
)
=
∑
n
=
1
N
l
o
g
p
(
x
n
)
logp(D|\theta) = \sum_{n=1}^Nlogp(x_n)
logp(D∣θ)=∑n=1Nlogp(xn)
所以lower bound的GAP是一个KL散度。
£
(
θ
,
q
(
Z
)
)
\pounds(\theta , q(Z))
£(θ,q(Z)) 与
l
o
g
p
(
D
∣
θ
)
logp(D|\theta)
logp(D∣θ)之间的GAP是KL散度,
l
o
g
p
(
D
∣
θ
)
−
£
(
θ
,
q
(
Z
)
)
=
K
L
(
q
(
Z
)
∣
∣
p
(
Z
∣
D
)
)
logp(D|\theta) - \pounds(\theta , q(Z)) = KL(q(Z)||p(Z|D))
logp(D∣θ)−£(θ,q(Z))=KL(q(Z)∣∣p(Z∣D))
要使得GAP最小,则
K
L
(
q
(
Z
)
∣
∣
p
(
Z
∣
D
)
)
=
0
KL(q(Z)||p(Z|D)) =0
KL(q(Z)∣∣p(Z∣D))=0
- EM算法
最大化lower bound或者最小化GAP
E 步:
Maximize over q(Z) ->
∂
£
∂
q
=
0
\frac{\partial \pounds}{\partial q} =0
∂q∂£=0
其中
q
(
z
n
)
=
p
(
z
n
∣
x
n
)
q(z_n) = p(z_n|xn)
q(zn)=p(zn∣xn)等价与前面的
γ
(
z
n
k
)
\gamma(z_{nk})
γ(znk)
M 步:
Maximize over
θ
\theta
θ ->
∂
£
∂
θ
=
0
\frac{\partial \pounds}{\partial \theta} =0
∂θ∂£=0