文章目录
本文目的
LDA (Latent Dirichlet Allocation) 是一个非常重要的文档主题模型,在众多领域均有着广泛的运用。LDA的训练方式大体分为基于Gibbs采样和变分推断EM算法两类。本文结合了一些相关资料,聚焦于LDA模型的变分推断EM算法的数学推导,假设读者已经了解过LDA的基本原理。
LDA模型的简单回顾
首先对LDA模型做一个简单的回顾:假设数据集
D
D
D中有
M
M
M篇文档,其中第
d
d
d篇文档有
N
d
N_d
Nd个单词,且数据集
D
D
D中一共涉及
V
V
V种单词,
K
K
K种主题。上述概率图展示了LDA模型的生成过程,概率图中的各符号定义如下:
- α \alpha α:“文档-主题”的先验狄利克雷分布参数,是一个 K K K维向量
- η \eta η:“主题-单词”的先验狄利克雷分布参数,是一个 V V V维向量
- θ d \theta_d θd:表示第 d d d篇文档的主题多项式分布的参数,有 θ d = D i r ( α ) \theta_d=Dir(\alpha) θd=Dir(α)
- z d , n z_{d,n} zd,n:表示第 d d d篇文档中的第 n n n个单词所属的主题,是一个 K K K维one-hot向量
- β k \beta_k βk:表示第 k k k个主题的单词多项式分布的参数,有 β k = D i r ( η ) \beta_k=Dir(\eta) βk=Dir(η)
- w d , n w_{d,n} wd,n:表示第 d d d篇文档中的第 n n n个单词,是一个 V V V维的one-hot向量
LDA模型的生成过程如下:
- 首先从“文档-主题”的先验狄利克雷分布 D i r ( α ) Dir(\alpha) Dir(α)中生成文档 d d d 的主题多项式分布参数 θ d \theta_d θd, ( d = 1 , 2 , … , M ) (d=1,2,\dots,M) (d=1,2,…,M)
- 然后对文档 d d d 中的每一个单词,从多项式分布 M u l t i ( θ d ) Multi(\theta_d) Multi(θd)生成该单词对应的主题 z d n z_{dn} zdn, ( n = 1 , 2 , … , N d ) (n=1,2,\dots,N_d) (n=1,2,…,Nd)
- 从狄利克雷分布 D i r ( β ) Dir(\beta) Dir(β)中生成主题 k k k 的单词多项式分布参数 β k \beta_k βk, ( k = 1 , 2 , … , K ) (k=1,2,\dots,K) (k=1,2,…,K)
- 针对文档 d d d 中的每一个单词,从多项式分布 M u l t i ( β z d n ) Multi(\beta_{z_{dn}}) Multi(βzdn)中生成该单词 w d n w_{dn} wdn
LDA求解 —— 变分推断EM算法
关于EM算法和变分推断的推导可以参考我之前的一篇总结:《EM算法与变分推断 —— 数学推导》
如果不想看长篇大论也没关系,下面我将从LDA的视角入手,简单回顾一下EM算法和变分推断。
1 EM算法
对于概率图模型中包含有隐变量的情况,可以使用EM算法进行参数估计。隐变量是指不可观测的变量,但其参与到了样本的生成过程。在LDA模型中,隐变量为
θ
,
z
,
β
\theta, z, \beta
θ,z,β,可观测变量为单词
w
w
w,模型的参数为
α
,
η
\alpha, \eta
α,η,首先我们可以根据概率图中的关系直接写出所有变量的联合概率分布:
p
(
w
,
θ
,
z
,
β
;
α
,
η
)
=
p
(
β
∣
η
)
p
(
θ
∣
α
)
p
(
z
∣
θ
)
p
(
w
∣
β
,
z
)
=
∏
k
=
1
K
p
(
β
k
∣
η
)
∏
d
=
1
M
[
p
(
θ
d
∣
α
)
∏
n
=
1
N
d
p
(
z
d
n
∣
θ
d
)
p
(
w
d
n
∣
β
z
d
n
,
z
d
n
)
]
(1)
\begin{aligned} p(w, \theta, z, \beta;\alpha,\eta) &= p(\beta|\eta) \, p(\theta|\alpha)\,p(z|\theta) \, p(w|\beta,z)\\ &= \prod_{k=1}^K p(\beta_k|\eta) \prod_{d=1}^M [\,p(\theta_d|\alpha) \prod_{n=1}^{N_d} p(z_{dn}|\theta_d) \, p(w_{dn}|\beta_{z_{dn}},z_{dn})] \end{aligned} \tag{1}
p(w,θ,z,β;α,η)=p(β∣η)p(θ∣α)p(z∣θ)p(w∣β,z)=k=1∏Kp(βk∣η)d=1∏M[p(θd∣α)n=1∏Ndp(zdn∣θd)p(wdn∣βzdn,zdn)](1)接下来我们可以利用求和或积分的方式消去联合概率分布中的隐变量,得到可观测变量
w
w
w在给定参数
α
,
η
\alpha, \eta
α,η的情况下的边际似然(针对于整个数据集
D
D
D):
p
(
w
;
α
,
η
)
=
∬
∏
k
=
1
K
p
(
β
k
∣
η
)
∏
d
=
1
M
p
(
θ
d
∣
α
)
[
∏
n
=
1
N
d
∑
z
d
n
p
(
z
d
n
∣
θ
d
)
p
(
w
d
n
∣
β
z
d
n
,
z
d
n
)
]
d
β
k
d
θ
d
(2)
p(w;\alpha,\eta)=\iint \prod_{k=1}^K p(\beta_k|\eta) \prod_{d=1}^Mp(\theta_d|\alpha) [\prod_{n=1}^{N_d} \sum_{z_{dn}} p(z_{dn}|\theta_d) \, p(w_{dn}|\beta_{z_{dn}},z_{dn})]d\beta_k \, d\theta_d \tag{2}
p(w;α,η)=∬k=1∏Kp(βk∣η)d=1∏Mp(θd∣α)[n=1∏Ndzdn∑p(zdn∣θd)p(wdn∣βzdn,zdn)]dβkdθd(2)LDA模型的目标是最大化(2)式,然而等式右边涉及复杂的积分,求解并不容易,因此可以引入一个关于隐变量
θ
,
z
,
β
\theta, z, \beta
θ,z,β的近似分布
q
(
θ
,
z
,
β
)
q(\theta, z, \beta)
q(θ,z,β)对对数似然
log
p
(
w
;
α
,
η
)
\log p(w;\alpha,\eta)
logp(w;α,η)进行推导:
log
p
(
w
;
α
,
η
)
=
∬
∑
z
q
(
θ
,
z
,
β
)
log
p
(
w
;
α
,
η
)
d
β
d
θ
因
为
∑
z
q
(
θ
,
z
,
β
)
=
1
=
∬
∑
z
q
(
θ
,
z
,
β
)
log
[
p
(
w
,
θ
,
z
,
β
;
α
,
η
)
p
(
θ
,
z
,
β
∣
w
;
α
,
η
)
⋅
q
(
θ
,
z
,
β
)
q
(
θ
,
z
,
β
)
]
d
β
d
θ
=
∬
∑
z
q
(
θ
,
z
,
β
)
log
p
(
w
,
θ
,
z
,
β
;
α
,
η
)
q
(
θ
,
z
,
β
)
d
β
d
θ
+
∬
∑
z
q
(
θ
,
z
,
β
)
log
q
(
θ
,
z
,
β
)
p
(
θ
,
z
,
β
∣
w
;
α
,
η
)
d
β
d
θ
=
E
L
B
O
(
q
,
w
;
α
,
η
)
+
K
L
[
q
(
θ
,
z
,
β
)
∣
∣
p
(
θ
,
z
,
β
∣
w
;
α
,
η
)
]
(3)
\begin{aligned} \log p(w;\alpha,\eta) &= \iint \sum_z q(\theta, z, \beta) \log p(w;\alpha,\eta) \, d\beta d\theta \qquad\qquad\qquad\qquad\small{因为\sum_zq(\theta,z,\beta)=1}\\ &= \iint \sum_z q(\theta, z, \beta) \log[\frac{p(w, \theta, z, \beta;\alpha, \eta)}{p(\theta, z, \beta|w;\alpha, \eta)} \cdot \frac{q(\theta, z, \beta)}{q(\theta, z, \beta)} ] \, d\beta d\theta \\ &= \iint \sum_z q(\theta, z, \beta) \log \frac{p(w, \theta, z, \beta;\alpha, \eta)}{q(\theta, z, \beta)} \, d\beta d\theta \\ &\qquad+ \iint \sum_z q(\theta, z, \beta) \log \frac{q(\theta, z, \beta)}{p(\theta, z, \beta|w;\alpha, \eta)} \, d\beta d\theta \\ &= ELBO(q, w;\alpha, \eta) + KL[q(\theta, z, \beta) \, || \, p(\theta, z, \beta|w;\alpha, \eta)] \end{aligned} \tag{3}
logp(w;α,η)=∬z∑q(θ,z,β)logp(w;α,η)dβdθ因为z∑q(θ,z,β)=1=∬z∑q(θ,z,β)log[p(θ,z,β∣w;α,η)p(w,θ,z,β;α,η)⋅q(θ,z,β)q(θ,z,β)]dβdθ=∬z∑q(θ,z,β)logq(θ,z,β)p(w,θ,z,β;α,η)dβdθ+∬z∑q(θ,z,β)logp(θ,z,β∣w;α,η)q(θ,z,β)dβdθ=ELBO(q,w;α,η)+KL[q(θ,z,β)∣∣p(θ,z,β∣w;α,η)](3)(3)式由两部分组成,其中第一项ELBO (Evidence Lower BOund) 称为证据下界,它表示对数似然
log
p
(
w
;
α
,
η
)
\log p(w;\alpha,\eta)
logp(w;α,η)的下界;第二项为KL散度项,它衡量两个分布的相似程度,是一个恒大于0的值,且相似程度越高KL散度值越趋于0
模型训练的目标是 arg max α , η log p ( w ; α , η ) \argmax_{\alpha, \eta} \log p(w;\alpha, \eta) α,ηargmaxlogp(w;α,η),而由(3)式可知 log p ( w ; α , η ) ≥ E L B O ( q ; α , η ) \log p(w;\alpha,\eta) \ge ELBO(q;\alpha,\eta) logp(w;α,η)≥ELBO(q;α,η),因此我们可以将模型优化的目标转变为 arg max α , η E L B O ( q ; α , η ) \argmax_{\alpha,\eta}ELBO(q;\alpha,\eta) α,ηargmaxELBO(q;α,η),EM算法正式围绕着这个目标展开优化的。
EM算法分为E-step和M-step:首先在E-step时控制模型的参数保持不变,计算出隐变量的后验分布作为隐变量的近似分布,这一步使得KL散度项减小至0,而由于 log p ( w ; α , η ) \log p(w;\alpha,\eta) logp(w;α,η)与隐变量无关不会发生改变,因此该步骤等同于最大化当前ELBO的值;之后在M-step时控制隐变量保持不变,寻找最优的模型参数使得ELBO达到当前的最大值。如此反复迭代E-step和M-step,使得ELBO项不断增大。
最终对于LDA模型,定义EM算法如下:
E-step:固定模型参数 α t , η t \alpha_t, \eta_t αt,ηt,令 q t + 1 ( θ , z , β ) = p ( θ , z , β ∣ x ; α t , η t ) q_{t+1}(\theta, z, \beta)=p(\theta, z, \beta|x;\alpha_t, \eta_t) qt+1(θ,z,β)=p(θ,z,β∣x;αt,ηt)
M-step:固定 q t + 1 ( θ , z , β ) q_{t+1}(\theta, z, \beta) qt+1(θ,z,β),优化模型参数 α t + 1 , η t + 1 = arg max α , η E L B O ( q t + 1 , x ; α , η ) \alpha_{t+1}, \eta_{t+1}=\argmax_{\alpha,\eta} ELBO(q_{t+1},x;\alpha, \eta) αt+1,ηt+1=α,ηargmaxELBO(qt+1,x;α,η)
2 变分推断
2.1 确定问题目标
EM算法的关键步骤是E-step中令 q t + 1 ( θ , z , β ) = p ( θ , z , β ∣ x ; α t , η t ) q_{t+1}(\theta, z, \beta)=p(\theta, z, \beta|x;\alpha_t, \eta_t) qt+1(θ,z,β)=p(θ,z,β∣x;αt,ηt),这需要计算隐变量的后验分布。对于较为简单的模型(如混合高斯模型等)隐变量的后验分布可以直接进行推导,而对于LDA模型而言,由概率图可知在 w w w为观测变量时,隐变量 β \beta β和 θ \theta θ, β \beta β和 z z z之间不是条件独立的,即存在耦合,无法进行隐变量后验分布的推导。因此我们可以引入变分推断的方法,假设隐变量 θ , z , β \theta, z, \beta θ,z,β分别由各自不同的独立分布生成(即mean field假设),得到一个变分分布 q ( θ , z , β ; γ , ϕ , λ ) q(\theta,z,\beta;\gamma,\phi,\lambda) q(θ,z,β;γ,ϕ,λ),并希望该变分分布近似我们无法求得的隐变量后验分布 p ( θ , z , β ∣ x ; α , η ) p(\theta,z,\beta|x;\alpha,\eta) p(θ,z,β∣x;α,η)
因此关于隐变量
θ
,
z
,
β
\theta,z,\beta
θ,z,β的联合概率分布
q
(
θ
,
z
,
β
)
q(\theta, z, \beta)
q(θ,z,β)可改写为如下形式:
q
(
θ
,
z
,
β
)
=
q
(
θ
,
z
,
β
;
γ
,
ϕ
,
λ
)
=
∏
k
=
1
K
q
(
β
k
∣
λ
k
)
∏
d
=
1
M
[
q
(
θ
d
∣
γ
d
)
∏
n
=
1
N
d
q
(
z
d
n
∣
ϕ
d
n
)
]
(4)
\begin{aligned} q(\theta, z, \beta) &= q(\theta, z, \beta;\gamma,\phi,\lambda) \\ &= \prod_{k=1}^Kq(\beta_k|\lambda_k) \prod_{d=1}^M [\, q(\theta_d|\gamma_d)\prod_{n=1}^{N_d}q(z_{dn}|\phi_{dn})\,] \end{aligned} \tag{4}
q(θ,z,β)=q(θ,z,β;γ,ϕ,λ)=k=1∏Kq(βk∣λk)d=1∏M[q(θd∣γd)n=1∏Ndq(zdn∣ϕdn)](4)于是现在的目标变为:
arg min
γ
,
ϕ
,
λ
K
L
[
q
(
θ
,
z
,
β
∣
γ
,
ϕ
,
λ
)
∣
∣
p
(
θ
,
z
,
β
∣
w
;
α
,
η
)
]
\argmin_{\gamma,\,\phi,\,\lambda} KL[q(\theta,z,\beta|\gamma,\phi,\lambda) \, || \, p(\theta,z,\beta|w;\alpha,\eta)]
γ,ϕ,λargminKL[q(θ,z,β∣γ,ϕ,λ)∣∣p(θ,z,β∣w;α,η)]
然而现在我们采用变分推断的思路是引入变分分布去近似隐变量的后验分布,但是这依旧无法解决由隐变量耦合导致的后验分布无法推导的这一状况,因此我们可以尝试继续转化问题的目标:由(3)式可知对数似然
log
p
(
w
;
α
,
η
)
\log p(w;\alpha,\eta)
logp(w;α,η)由ELBO项和KL散度项两部分组成,因此最小化KL散度项等同于最大化ELBO项,即:
γ
∗
,
ϕ
∗
,
λ
∗
=
arg min
γ
,
ϕ
,
λ
K
L
[
q
(
θ
,
z
,
β
∣
γ
,
ϕ
,
λ
)
∣
∣
p
(
θ
,
z
,
β
∣
w
;
α
,
η
)
]
=
arg max
γ
,
ϕ
,
λ
E
L
B
O
(
q
,
w
;
α
,
η
)
(5)
\begin{aligned} \gamma^*,\phi^*,\lambda^* &= \argmin_{\gamma,\,\phi,\,\lambda} KL[q(\theta,z,\beta|\gamma,\phi,\lambda) \, || \, p(\theta,z,\beta|w;\alpha,\eta)] \\ &= \argmax_{\gamma,\,\phi,\,\lambda}ELBO(q, w;\alpha, \eta) \end{aligned} \tag{5}
γ∗,ϕ∗,λ∗=γ,ϕ,λargminKL[q(θ,z,β∣γ,ϕ,λ)∣∣p(θ,z,β∣w;α,η)]=γ,ϕ,λargmaxELBO(q,w;α,η)(5)
这里我们联合(4)式对
E
L
B
O
(
q
,
w
;
α
,
η
)
ELBO(q,w;\alpha,\eta)
ELBO(q,w;α,η)进行拆分:
E
L
B
O
(
q
,
w
;
α
,
η
)
=
∬
∑
z
q
(
θ
,
z
,
β
)
log
p
(
w
,
θ
,
z
,
β
;
α
,
η
)
q
(
θ
,
z
,
β
)
d
β
d
θ
=
∬
∑
z
q
(
θ
,
z
,
β
;
γ
,
ϕ
,
λ
)
log
p
(
w
,
θ
,
z
,
β
;
α
,
η
)
q
(
θ
,
z
,
β
;
γ
,
ϕ
,
λ
)
d
β
d
θ
=
E
q
(
θ
,
z
,
β
;
γ
,
ϕ
,
λ
)
log
p
(
w
,
θ
,
z
,
β
;
α
,
η
)
−
E
q
(
θ
,
z
,
β
;
γ
,
ϕ
,
λ
)
log
q
(
θ
,
z
,
β
;
γ
,
ϕ
,
λ
)
=
E
q
log
p
(
β
∣
η
)
+
E
q
log
p
(
θ
∣
α
)
+
E
q
log
p
(
z
∣
θ
)
+
E
q
log
p
(
w
∣
β
,
z
)
−
E
q
log
q
(
β
∣
λ
)
−
E
q
log
q
(
z
∣
ϕ
)
−
E
q
log
q
(
θ
∣
γ
)
将
q
(
θ
,
z
,
β
;
γ
,
ϕ
,
λ
)
简
记
为
q
(6)
\begin{aligned} ELBO(q,w;\alpha,\eta) &= \iint \sum_z q(\theta, z, \beta) \log \frac{p(w, \theta, z, \beta;\alpha, \eta)}{q(\theta, z, \beta)} \, d\beta d\theta \\ &= \iint \sum_z q(\theta, z, \beta;\gamma,\phi,\lambda) \log \frac{p(w, \theta, z, \beta;\alpha, \eta)}{q(\theta, z, \beta;\gamma,\phi,\lambda)} \, d\beta d\theta \\ &= \mathbb{E}_{q(\theta, z, \beta;\gamma,\phi,\lambda)}\log p(w, \theta, z, \beta;\alpha, \eta) - \mathbb{E}_{q(\theta, z, \beta;\gamma,\phi,\lambda)}\log q(\theta, z, \beta;\gamma,\phi,\lambda) \\ &= \mathbb{E}_q\log p(\beta|\eta) + \mathbb{E}_q\log p(\theta|\alpha) + \mathbb{E}_q\log p(z|\theta) + \mathbb{E}_q \log p(w|\beta,z) \\ &\qquad- \mathbb{E}_q \log q(\beta|\lambda) - \mathbb{E}_q \log q(z|\phi) - \mathbb{E}_q \log q(\theta|\gamma) \qquad\small{将q(\theta, z, \beta;\gamma,\phi,\lambda)简记为q}\\ \end{aligned} \tag{6}
ELBO(q,w;α,η)=∬z∑q(θ,z,β)logq(θ,z,β)p(w,θ,z,β;α,η)dβdθ=∬z∑q(θ,z,β;γ,ϕ,λ)logq(θ,z,β;γ,ϕ,λ)p(w,θ,z,β;α,η)dβdθ=Eq(θ,z,β;γ,ϕ,λ)logp(w,θ,z,β;α,η)−Eq(θ,z,β;γ,ϕ,λ)logq(θ,z,β;γ,ϕ,λ)=Eqlogp(β∣η)+Eqlogp(θ∣α)+Eqlogp(z∣θ)+Eqlogp(w∣β,z)−Eqlogq(β∣λ)−Eqlogq(z∣ϕ)−Eqlogq(θ∣γ)将q(θ,z,β;γ,ϕ,λ)简记为q(6)根据概率图模型将ELBO项拆解为了7项,下面的主要任务是依次对这7项进行推导,并由此实现EM算法。
2.2 指数分布族性质
这里需要引入一点指数分布族的性质,以方便后续的推导。首先狄利克雷分布的定义如下:
D
i
r
(
θ
;
α
)
=
Γ
(
∑
i
=
1
K
α
i
)
∏
i
=
1
K
Γ
(
α
i
)
∏
i
=
1
K
θ
i
α
i
−
1
(7)
Dir(\theta;\alpha) = \frac{\Gamma(\sum_{i=1}^K\alpha_i)}{\prod_{i=1}^K \Gamma(\alpha_i)} \prod_{i=1}^K \theta_i^{\alpha_i-1} \tag{7}
Dir(θ;α)=∏i=1KΓ(αi)Γ(∑i=1Kαi)i=1∏Kθiαi−1(7)
其中
Γ
(
)
\Gamma()
Γ()为Gamma函数,定义如下:
Γ
(
x
)
=
∫
0
∞
t
x
−
1
e
−
t
d
t
(8)
\Gamma(x) = \int_0^\infin t^{x-1}e^{-t}dt \tag{8}
Γ(x)=∫0∞tx−1e−tdt(8)
由于狄利克雷分布属于指数分布族,这里不加证明地引入指数分布族的性质:
E
D
i
r
(
θ
;
α
)
log
(
θ
k
)
=
Ψ
(
α
k
)
−
Ψ
(
∑
i
=
1
K
a
i
)
(9)
\mathbb{E}_{Dir(\theta;\alpha)} \log(\theta_k) = \Psi(\alpha_k)-\Psi(\sum_{i=1}^Ka_i) \tag{9}
EDir(θ;α)log(θk)=Ψ(αk)−Ψ(i=1∑Kai)(9)其中
Ψ
(
)
\Psi()
Ψ()为Digamma函数:
Ψ
(
x
)
=
d
d
x
log
Γ
(
x
)
=
Γ
′
(
x
)
Γ
(
x
)
(10)
\Psi(x)=\frac{d}{dx}\log\Gamma(x)=\frac{\Gamma'(x)}{\Gamma(x)} \tag{10}
Ψ(x)=dxdlogΓ(x)=Γ(x)Γ′(x)(10)
2.3 对ELBO进行推导
下面将结合指数分布族的性质,对(6)式中的7个小项逐一进行推导。
-
E
q
log
p
(
β
∣
η
)
\mathbb{E}_q\log p(\beta|\eta)
Eqlogp(β∣η)
E q log p ( β ∣ η ) = E q log ∏ k = 1 K D i r ( β ∣ η ) = E q log ∏ k = 1 K Γ ( ∑ v = 1 V η v ) ∏ v = 1 V Γ ( η v ) ∏ v = 1 V β k v η v − 1 = K log Γ ( ∑ v = 1 V η v ) − K ∑ v = 1 V log Γ ( η v ) + ∑ k = 1 K E q ∑ v = 1 V ( η v − 1 ) log β k v = K log Γ ( ∑ v = 1 V η v ) − K ∑ v = 1 V log Γ ( η v ) + ∑ k = 1 K ∑ v = 1 V ( η v − 1 ) [ Ψ ( λ k v ) − Ψ ( ∑ i = 1 V λ k i ) ] (11) \begin{aligned} \mathbb{E}_q\log p(\beta|\eta) &= \mathbb{E}_q\log\prod_{k=1}^KDir(\beta|\eta) \\ &= \mathbb{E}_q \log\prod_{k=1}^K\frac{\Gamma(\sum_{v=1}^V\eta_v)}{\prod_{v=1}^V\Gamma(\eta_v)}\prod_{v=1}^V\beta_{kv}^{\eta_v-1} \\ &= K\log \Gamma(\sum_{v=1}^V\eta_v) - K\sum_{v=1}^V\log\Gamma(\eta_v) + \sum_{k=1}^K\mathbb{E} _q\sum_{v=1}^V(\eta_v-1)\log\beta_{kv} \\ &= K\log \Gamma(\sum_{v=1}^V\eta_v) - K\sum_{v=1}^V\log\Gamma(\eta_v) \\ &\qquad + \sum_{k=1}^K\sum_{v=1}^V(\eta_v-1)[\Psi(\lambda_{kv})-\Psi(\sum_{i=1}^V\lambda_{ki})] \\ \end{aligned} \tag{11} Eqlogp(β∣η)=Eqlogk=1∏KDir(β∣η)=Eqlogk=1∏K∏v=1VΓ(ηv)Γ(∑v=1Vηv)v=1∏Vβkvηv−1=KlogΓ(v=1∑Vηv)−Kv=1∑VlogΓ(ηv)+k=1∑KEqv=1∑V(ηv−1)logβkv=KlogΓ(v=1∑Vηv)−Kv=1∑VlogΓ(ηv)+k=1∑Kv=1∑V(ηv−1)[Ψ(λkv)−Ψ(i=1∑Vλki)](11) -
E
q
log
p
(
θ
∣
α
)
\mathbb{E}_q \log p(\theta|\alpha)
Eqlogp(θ∣α)
E q log p ( θ ∣ α ) = E q log D i r ( θ ∣ α ) 这 里 只 针 对 一 篇 文 档 , 没 有 加 入 ∏ d = 1 M = E q log Γ ( ∑ k = 1 K α k ) ∏ k = 1 K Γ ( α k ) ∏ k = 1 K θ k α k − 1 = log Γ ( ∑ k = 1 K α k ) − ∑ k = 1 K log Γ ( α k ) + ∑ k = 1 K E q ( α k − 1 ) log θ k = log Γ ( ∑ k = 1 K α k ) − ∑ k = 1 K log Γ ( α k ) + ∑ k = 1 K ( α k − 1 ) [ Ψ ( γ k ) − Ψ ( ∑ i = 1 K γ i ) ] (12) \begin{aligned} \mathbb{E}_q \log p(\theta|\alpha) &= \mathbb{E}_q\log Dir(\theta|\alpha) \qquad\qquad\qquad\qquad\small{这里只针对一篇文档,没有加入\prod_{d=1}^M}\\ &= \mathbb{E}_q \log\frac{\Gamma(\sum_{k=1}^K\alpha_k)}{\prod_{k=1}^K\Gamma(\alpha_k)}\prod_{k=1}^K\theta_k^{\alpha_k-1} \\ &= \log \Gamma(\sum_{k=1}^K\alpha_k) - \sum_{k=1}^K\log\Gamma(\alpha_k) + \sum_{k=1}^K\mathbb{E} _q(\alpha_k-1)\log\theta_k \\ &= \log \Gamma(\sum_{k=1}^K\alpha_k) - \sum_{k=1}^K\log\Gamma(\alpha_k) + \sum_{k=1}^K (\alpha_k-1)[\Psi(\gamma_k)-\Psi(\sum_{i=1}^K\gamma_i)]\\ \end{aligned} \tag{12} Eqlogp(θ∣α)=EqlogDir(θ∣α)这里只针对一篇文档,没有加入d=1∏M=Eqlog∏k=1KΓ(αk)Γ(∑k=1Kαk)k=1∏Kθkαk−1=logΓ(k=1∑Kαk)−k=1∑KlogΓ(αk)+k=1∑KEq(αk−1)logθk=logΓ(k=1∑Kαk)−k=1∑KlogΓ(αk)+k=1∑K(αk−1)[Ψ(γk)−Ψ(i=1∑Kγi)](12) -
E
q
log
p
(
z
∣
θ
)
\mathbb{E}_q\log p(z|\theta)
Eqlogp(z∣θ)
E q log p ( z ∣ θ ) = ∑ n = 1 N ∑ k = 1 K E q log θ k z n k z n 是 K 维 o n e h o t 向 量 , z n k = 1 或 0 = ∑ n = 1 N ∑ k = 1 K E q z n k ⋅ E q log θ k = ∑ n = 1 N ∑ k = 1 K ϕ n k ⋅ [ Ψ ( γ k ) − Ψ ( ∑ i = 1 K γ i ) ] (13) \begin{aligned} \mathbb{E}_q\log p(z|\theta) &= \sum_{n=1}^N\sum_{k=1}^K\mathbb{E}_q \log \theta_k^{z_{nk}} \qquad\qquad \small{z_n是K维onehot向量,z_{nk}}=1\,或\,0 \\ &= \sum_{n=1}^N\sum_{k=1}^K\mathbb{E}_q \, z_{nk} \cdot\mathbb{E}_q\log\theta_k \\ &= \sum_{n=1}^N\sum_{k=1}^K\phi_{nk} \cdot [\Psi(\gamma_k)-\Psi(\sum_{i=1}^K\gamma_i)] \end{aligned} \tag{13} Eqlogp(z∣θ)=n=1∑Nk=1∑KEqlogθkznkzn是K维onehot向量,znk=1或0=n=1∑Nk=1∑KEqznk⋅Eqlogθk=n=1∑Nk=1∑Kϕnk⋅[Ψ(γk)−Ψ(i=1∑Kγi)](13) -
E
q
log
p
(
w
∣
β
,
z
)
\mathbb{E}_q\log p(w|\beta,z)
Eqlogp(w∣β,z)
E q log p ( w ∣ β , z ) = ∑ n = 1 N ∑ k = 1 K ∑ v = 1 V E q log β k v ( z n k ⋅ w n v ) = ∑ n = 1 N ∑ k = 1 K ∑ v = 1 V E q z n k ⋅ E q w n v ⋅ E q log β k v = ∑ n = 1 N ∑ k = 1 K ∑ v = 1 V ϕ n k ⋅ w n v ⋅ [ Ψ ( λ k v ) − Ψ ( ∑ i = 1 V λ k i ) ] (14) \begin{aligned} \mathbb{E}_q\log p(w|\beta,z) &= \sum_{n=1}^N\sum_{k=1}^K\sum_{v=1}^V\mathbb{E}_q\log \beta_{kv}^{(z_{nk}\cdot w_{nv})} \\ &= \sum_{n=1}^N\sum_{k=1}^K\sum_{v=1}^V\mathbb{E}_q\,z_{nk}\cdot \mathbb{E}_q\,w_{nv} \cdot \mathbb{E}_q\log\beta_{kv} \\ &= \sum_{n=1}^N\sum_{k=1}^K\sum_{v=1}^V \phi_{nk}\cdot w_{nv}\cdot[\Psi(\lambda_{kv})-\Psi(\sum_{i=1}^V\lambda_{ki})\,] \end{aligned} \tag{14} Eqlogp(w∣β,z)=n=1∑Nk=1∑Kv=1∑VEqlogβkv(znk⋅wnv)=n=1∑Nk=1∑Kv=1∑VEqznk⋅Eqwnv⋅Eqlogβkv=n=1∑Nk=1∑Kv=1∑Vϕnk⋅wnv⋅[Ψ(λkv)−Ψ(i=1∑Vλki)](14) -
E
q
log
q
(
β
∣
λ
)
\mathbb{E}_q\log q(\beta|\lambda)
Eqlogq(β∣λ)
E q log q ( β ∣ λ ) = E q log ∏ k = 1 K [ Γ ( ∑ v = 1 V λ k v ) ∏ v = 1 V Γ ( λ k v ) ∏ v = 1 V β k v λ k v − 1 ] = ∑ k = 1 K [ log Γ ( ∑ v = 1 V λ k v ) − ∑ v = 1 V log Γ ( λ k v ) ] + ∑ k = 1 K ∑ v = 1 V ( λ k v − 1 ) [ Ψ ( λ k v ) − Ψ ( ∑ i = 1 V λ k i ) ] (15) \begin{aligned} \mathbb{E}_q\log q(\beta|\lambda) &= \mathbb{E}_q \log \prod_{k=1}^K[\frac{\Gamma(\sum_{v=1}^V\lambda_{kv})}{\prod_{v=1}^V\Gamma(\lambda_{kv})}\prod_{v=1}^V \beta_{kv}^{\lambda_{kv}-1}] \\ &= \sum_{k=1}^K[\,\log \Gamma(\sum_{v=1}^V\lambda_{kv}) - \sum_{v=1}^V \log \Gamma(\lambda_{kv})\,] \\ &\qquad + \sum_{k=1}^K\sum_{v=1}^V(\lambda_{kv}-1)[\Psi(\lambda_{kv}) - \Psi(\sum_{i=1}^V\lambda_{ki})] \end{aligned} \tag{15} Eqlogq(β∣λ)=Eqlogk=1∏K[∏v=1VΓ(λkv)Γ(∑v=1Vλkv)v=1∏Vβkvλkv−1]=k=1∑K[logΓ(v=1∑Vλkv)−v=1∑VlogΓ(λkv)]+k=1∑Kv=1∑V(λkv−1)[Ψ(λkv)−Ψ(i=1∑Vλki)](15) -
E
q
log
q
(
z
∣
ϕ
)
\mathbb{E}_q\log q(z|\phi)
Eqlogq(z∣ϕ)
E q log q ( z ∣ ϕ ) = ∑ n = 1 N ∑ k = 1 K E q log ϕ n k z n k = ∑ n = 1 N ∑ k = 1 K E q z n k ⋅ log ϕ n k = ∑ n = 1 N ∑ k = 1 K ϕ n k log ϕ n k (16) \begin{aligned} \mathbb{E}_q\log q(z|\phi) &= \sum_{n=1}^N\sum_{k=1}^K \mathbb{E}_q\log \phi_{nk}^{z_{nk}} \\ &= \sum_{n=1}^N\sum_{k=1}^K \mathbb{E}_q\,z_{nk}\cdot\log \phi_{nk} \\ &= \sum_{n=1}^N\sum_{k=1}^K\phi_{nk}\log \phi_{nk} \end{aligned} \tag{16} Eqlogq(z∣ϕ)=n=1∑Nk=1∑KEqlogϕnkznk=n=1∑Nk=1∑KEqznk⋅logϕnk=n=1∑Nk=1∑Kϕnklogϕnk(16) -
E
q
log
q
(
θ
∣
γ
)
\mathbb{E}_q\log q(\theta|\gamma)
Eqlogq(θ∣γ)
E q log q ( θ ∣ γ ) = E q log Γ ( ∑ k = 1 K γ k ) ∏ k = 1 K Γ ( γ k ) ∏ k = 1 K θ k γ k − 1 = log Γ ( ∑ k = 1 K γ k ) − ∑ k = 1 K log Γ ( γ k ) + ∑ k = 1 K ( γ k − 1 ) [ Ψ ( γ k ) − Ψ ( ∑ i = 1 K γ i ) ] (17) \begin{aligned} \mathbb{E}_q\log q(\theta|\gamma) &= \mathbb{E}_q \log \frac{\Gamma(\sum_{k=1}^K\gamma_k)}{\prod_{k=1}^K\Gamma(\gamma_k)} \prod_{k=1}^K\theta_k^{\gamma_k-1} \\ &= \log \Gamma(\sum_{k=1}^K\gamma_k) - \sum_{k=1}^K \log \Gamma(\gamma_k) \\ &\qquad + \sum_{k=1}^K(\gamma_k-1)[\,\Psi(\gamma_k)-\Psi(\sum_{i=1}^K\gamma_i)\,] \end{aligned} \tag{17} Eqlogq(θ∣γ)=Eqlog∏k=1KΓ(γk)Γ(∑k=1Kγk)k=1∏Kθkγk−1=logΓ(k=1∑Kγk)−k=1∑KlogΓ(γk)+k=1∑K(γk−1)[Ψ(γk)−Ψ(i=1∑Kγi)](17)
2.4 E-step推导
通过ELBO各项的推导,将 E L B O ( q , w ; α , η ) ELBO(q,w;\alpha,\eta) ELBO(q,w;α,η)转化为了关于 γ , ϕ , λ , w , α , η \gamma, \phi, \lambda,w,\alpha, \eta γ,ϕ,λ,w,α,η的函数。在E-step中,我们的目标是找到最优的变分参数 γ ∗ , ϕ ∗ , λ ∗ \gamma^*, \phi^*, \lambda^* γ∗,ϕ∗,λ∗以最大化ELBO项的值,因此考虑分别对三个变分参数求偏导并置零。需要注意的是变分参数的限制条件,由于 γ \gamma γ和 λ \lambda λ是狄利克雷分布的参数,因此没有限制条件;而 ϕ \phi ϕ是多项式分布的参数,因此对于单词 n n n的所有主题概率之和等于1,即 ∑ k = 1 K ϕ n k = 1 ( n = 1 , 2 , … , N ) \sum_{k=1}^K\phi_{nk}=1\,\,(n=1,2,\dots,N) ∑k=1Kϕnk=1(n=1,2,…,N)
2.4.1 针对变分参数 ϕ \phi ϕ
- 挑选出ELBO中与
ϕ
\phi
ϕ有关的项,并加入Lagrange约束条件
E L B O ( q , w ; α , η ) [ ϕ ] = ∑ n = 1 N ∑ k = 1 K ϕ n k [ Ψ ( γ k ) − Ψ ( ∑ i = 1 K γ i ) ] + ∑ n = 1 N ∑ k = 1 K ∑ v = 1 V ϕ n k w n v [ Ψ ( λ k v ) − Ψ ( ∑ i = 1 V λ k i ) ] − ∑ n = 1 N ∑ k = 1 K ϕ n k log ϕ n k + ∑ n = 1 N c n ( ∑ k = 1 K ϕ n k − 1 ) ⏟ L a g r a n g e 约 束 条 件 (18) \begin{aligned} ELBO(q,w;\alpha,\eta)_{\bm{[\phi]}} &= \sum_{n=1}^N\sum_{k=1}^K\phi_{nk}[\Psi(\gamma_k)-\Psi(\sum_{i=1}^K\gamma_i)] \\ &\qquad + \sum_{n=1}^N\sum_{k=1}^K\sum_{v=1}^V\phi_{nk}\,w_{nv}[\Psi(\lambda_{kv}) - \Psi(\sum_{i=1}^V\lambda_{ki})] \\ & \qquad -\sum_{n=1}^N\sum_{k=1}^K\phi_{nk}\log\phi_{nk} + \underbrace{\sum_{n=1}^N c_n(\sum_{k=1}^K\phi_{nk}-1)}_{Lagrange约束条件} \\ \end{aligned} \tag{18} ELBO(q,w;α,η)[ϕ]=n=1∑Nk=1∑Kϕnk[Ψ(γk)−Ψ(i=1∑Kγi)]+n=1∑Nk=1∑Kv=1∑Vϕnkwnv[Ψ(λkv)−Ψ(i=1∑Vλki)]−n=1∑Nk=1∑Kϕnklogϕnk+Lagrange约束条件 n=1∑Ncn(k=1∑Kϕnk−1)(18) - 对
ϕ
n
k
\phi_{nk}
ϕnk求偏导
∂ ∂ ϕ n k E L B O ( q , w ; α , η ) [ ϕ ] = Ψ ( γ k ) − Ψ ( ∑ k ′ = 1 K γ k ′ ) + ∑ v = 1 V w n v [ Ψ ( λ k v ) − Ψ ( ∑ i = 1 V λ k i ) ] − log ϕ n k − 1 + c n (19) \begin{aligned} \frac{\partial}{\partial\phi_{nk}} ELBO(q,w;\alpha,\eta)_{\bm{[\phi]}} &= \Psi(\gamma_k) - \Psi(\sum_{k'=1}^K\gamma_{k'}) \\ &\qquad + \sum_{v=1}^Vw_{nv}[\Psi(\lambda_{kv}) - \Psi(\sum_{i=1}^V\lambda_{ki})] \\ &\qquad - \log\phi_{nk} - 1 + c_n \end{aligned} \tag{19} ∂ϕnk∂ELBO(q,w;α,η)[ϕ]=Ψ(γk)−Ψ(k′=1∑Kγk′)+v=1∑Vwnv[Ψ(λkv)−Ψ(i=1∑Vλki)]−logϕnk−1+cn(19) - 偏导数置0得到
ϕ n k = exp { Ψ ( γ k ) − Ψ ( ∑ k ′ = 1 K γ k ′ ) + ∑ v = 1 V w n v [ Ψ ( λ k v ) − Ψ ( ∑ i = 1 V λ k i ) ] − 1 + c n } ∝ exp { Ψ ( γ k ) − Ψ ( ∑ k ′ = 1 K γ k ′ ) + ∑ v = 1 V w n v [ Ψ ( λ k v ) − Ψ ( ∑ i = 1 V λ k i ) ] } (20) \begin{aligned} \phi_{nk} &= \exp\{\,\Psi(\gamma_k) - \Psi(\sum_{k'=1}^K\gamma_{k'}) + \sum_{v=1}^Vw_{nv}[\Psi(\lambda_{kv}) - \Psi(\sum_{i=1}^V\lambda_{ki})] - 1 + c_n\}\\ &\propto \exp\{\,\Psi(\gamma_k) - \Psi(\sum_{k'=1}^K\gamma_{k'}) + \sum_{v=1}^Vw_{nv}[\Psi(\lambda_{kv}) - \Psi(\sum_{i=1}^V\lambda_{ki})]\} \end{aligned} \tag{20} ϕnk=exp{Ψ(γk)−Ψ(k′=1∑Kγk′)+v=1∑Vwnv[Ψ(λkv)−Ψ(i=1∑Vλki)]−1+cn}∝exp{Ψ(γk)−Ψ(k′=1∑Kγk′)+v=1∑Vwnv[Ψ(λkv)−Ψ(i=1∑Vλki)]}(20)
2.4.2 针对变分参数 γ \gamma γ
- 挑选出ELBO中与
γ
\gamma
γ有关的项
E L B O ( q , w ; α , η ) [ γ ] = ∑ n = 1 N ∑ k = 1 K ϕ n k [ Ψ ( γ k ) − Ψ ( ∑ i = 1 K γ i ) ] + ∑ k = 1 K ( α k − 1 ) [ Ψ ( γ k ) − Ψ ( ∑ i = 1 K γ i ) ] − log Γ ( ∑ k = 1 K γ k ) + ∑ k = 1 K log Γ ( γ k ) − ∑ k = 1 K ( γ k − 1 ) [ Ψ ( γ k ) − Ψ ( ∑ i = 1 K γ i ) ] = ∑ k = 1 K { ( ∑ n = 1 N ϕ n k + α k − γ k ) [ Ψ ( γ k ) − Ψ ( ∑ i = 1 K γ k ) ] } − log Γ ( ∑ k = 1 K γ k ) + ∑ k = 1 K log Γ ( γ k ) (21) \begin{aligned} ELBO(q,w;\alpha,\eta)_{\bm{[\gamma]}} &= \sum_{n=1}^N\sum_{k=1}^K\phi_{nk}[\Psi(\gamma_k)-\Psi(\sum_{i=1}^K\gamma_i)] \\ &\qquad + \sum_{k=1}^K(\alpha_k-1)[\Psi(\gamma_k)-\Psi(\sum_{i=1}^K\gamma_i)] \\ &\qquad - \log \Gamma(\sum_{k=1}^K\gamma_k) + \sum_{k=1}^K\log \Gamma(\gamma_k) \\ &\qquad - \sum_{k=1}^K(\gamma_k-1)[\Psi(\gamma_k)-\Psi(\sum_{i=1}^K\gamma_i)] \\ &= \sum_{k=1}^K \{\, (\sum_{n=1}^N\phi_{nk}+\alpha_k-\gamma_k)[\Psi(\gamma_k)-\Psi(\sum_{i=1}^K\gamma_k)]\,\} \\ &\qquad - \log \Gamma(\sum_{k=1}^K\gamma_k) + \sum_{k=1}^K\log \Gamma(\gamma_k) \end{aligned} \tag{21} ELBO(q,w;α,η)[γ]=n=1∑Nk=1∑Kϕnk[Ψ(γk)−Ψ(i=1∑Kγi)]+k=1∑K(αk−1)[Ψ(γk)−Ψ(i=1∑Kγi)]−logΓ(k=1∑Kγk)+k=1∑KlogΓ(γk)−k=1∑K(γk−1)[Ψ(γk)−Ψ(i=1∑Kγi)]=k=1∑K{(n=1∑Nϕnk+αk−γk)[Ψ(γk)−Ψ(i=1∑Kγk)]}−logΓ(k=1∑Kγk)+k=1∑KlogΓ(γk)(21) - 对
γ
k
\gamma_k
γk求偏导
∂ ∂ γ k E L B O ( q , w ; α , η ) [ γ ] = − Ψ ( γ k ) + Ψ ( ∑ k ′ = 1 K γ k ′ ) + [ Ψ ′ ( γ k ) − Ψ ′ ( ∑ k ′ = 1 K γ k ′ ) ] ( ∑ n = 1 N ϕ n k + α k − γ k ) − Γ ′ ( ∑ k ′ = 1 K γ k ′ ) Γ ( ∑ k ′ = 1 K γ k ′ ) ⏟ = Ψ ( ∑ k ′ = 1 K γ k ′ ) + Γ ′ ( γ k ) Γ ( γ k ) ⏟ = Ψ ( γ k ) = ( ∑ n = 1 N ϕ n k + α k − γ k ) [ Ψ ′ ( γ k ) − Ψ ′ ( ∑ k ′ = 1 K γ k ′ ) ] (22) \begin{aligned} \frac{\partial}{\partial\gamma_k}ELBO(q,w;\alpha,\eta)_{\bm{[\gamma]}} &= -\Psi(\gamma_k)+\Psi(\sum_{k'=1}^K\gamma_{k'}) \\ &\qquad + [\Psi'(\gamma_k)-\Psi'(\sum_{k'=1}^K\gamma_{k'})]\,(\sum_{n=1}^N\phi_{nk}+\alpha_k-\gamma_k) \\ &\qquad -\underbrace{\frac{\Gamma'(\sum_{k'=1}^K\gamma_{k'})}{\Gamma(\sum_{k'=1}^K\gamma_{k'})}}_{=\,\Psi(\sum_{k'=1}^K\gamma_{k'})} + \underbrace{\frac{\Gamma'(\gamma_k)}{\Gamma(\gamma_k)}}_{=\,\Psi(\gamma_k)} \\ &= (\sum_{n=1}^N\phi_{nk}+\alpha_k - \gamma_k) \, [\Psi'(\gamma_k) - \Psi'(\sum_{k'=1}^K\gamma_{k'})] \end{aligned} \tag{22} ∂γk∂ELBO(q,w;α,η)[γ]=−Ψ(γk)+Ψ(k′=1∑Kγk′)+[Ψ′(γk)−Ψ′(k′=1∑Kγk′)](n=1∑Nϕnk+αk−γk)−=Ψ(∑k′=1Kγk′) Γ(∑k′=1Kγk′)Γ′(∑k′=1Kγk′)+=Ψ(γk) Γ(γk)Γ′(γk)=(n=1∑Nϕnk+αk−γk)[Ψ′(γk)−Ψ′(k′=1∑Kγk′)](22) - 偏导数置0得到
γ k = α k + ∑ n = 1 N ϕ n k (23) \gamma_k = \alpha_k + \sum_{n=1}^N\phi_{nk} \tag{23} γk=αk+n=1∑Nϕnk(23)
2.4.3 针对变分参数 λ \lambda λ
- 挑选出ELBO中与
λ
\lambda
λ有关的项
E L B O ( q , w ; α , η ) [ λ ] = ∑ k = 1 K ∑ v = 1 V ( η v − 1 ) [ Ψ ( λ k v ) − Ψ ( ∑ i = 1 V λ k i ) ] + ∑ n = 1 N ∑ k = 1 K ∑ v = 1 V ϕ n k w n v [ Ψ ( λ k v ) − Ψ ( ∑ i = 1 V λ k i ) ] − ∑ k = 1 K [ log Γ ( ∑ v = 1 V λ k v ) − ∑ v = 1 V log Γ ( λ k v ) ] − ∑ k = 1 K ∑ v = 1 V ( λ k v − 1 ) [ Ψ ( λ k v ) − Ψ ( ∑ i = 1 V λ k i ) ] = ∑ k = 1 K ∑ v = 1 V [ Ψ ( λ k v ) − Ψ ( ∑ i = 1 V λ k i ) ] [ ( η v − 1 ) + ∑ n = 1 N ϕ n k w n v − ( λ k v − 1 ) ] − ∑ k = 1 K [ log Γ ( ∑ v = 1 V λ k v ) − ∑ v = 1 V log Γ ( λ k v ) ] (24) \begin{aligned} ELBO(q,w;\alpha,\eta)_{\bm{[\lambda]}} &= \sum_{k=1}^K\sum_{v=1}^V(\eta_v-1)[\Psi(\lambda_{kv})-\Psi(\sum_{i=1}^V\lambda_{ki})] \\ &\qquad + \sum_{n=1}^N\sum_{k=1}^K\sum_{v=1}^V\phi_{nk}\,w_{nv}[\Psi(\lambda_{kv})-\Psi(\sum_{i=1}^V\lambda_{ki})] \\ &\qquad - \sum_{k=1}^K[\log \Gamma(\sum_{v=1}^V\lambda_{kv}) - \sum_{v=1}^V\log\Gamma(\lambda_{kv})] \\ &\qquad - \sum_{k=1}^K\sum_{v=1}^V(\lambda_{kv}-1)[\Psi(\lambda_{kv})-\Psi(\sum_{i=1}^V\lambda_{ki})] \\ &= \sum_{k=1}^K\sum_{v=1}^V \,[\Psi(\lambda_{kv})-\Psi(\sum_{i=1}^V\lambda_{ki})][(\eta_v-1) \\ &\qquad + \sum_{n=1}^N\phi_{nk}\,w_{nv}-(\lambda_{kv}-1)] \\ &\qquad - \sum_{k=1}^K[\log \Gamma(\sum_{v=1}^V\lambda_{kv}) - \sum_{v=1}^V\log\Gamma(\lambda_{kv})] \end{aligned} \tag{24} ELBO(q,w;α,η)[λ]=k=1∑Kv=1∑V(ηv−1)[Ψ(λkv)−Ψ(i=1∑Vλki)]+n=1∑Nk=1∑Kv=1∑Vϕnkwnv[Ψ(λkv)−Ψ(i=1∑Vλki)]−k=1∑K[logΓ(v=1∑Vλkv)−v=1∑VlogΓ(λkv)]−k=1∑Kv=1∑V(λkv−1)[Ψ(λkv)−Ψ(i=1∑Vλki)]=k=1∑Kv=1∑V[Ψ(λkv)−Ψ(i=1∑Vλki)][(ηv−1)+n=1∑Nϕnkwnv−(λkv−1)]−k=1∑K[logΓ(v=1∑Vλkv)−v=1∑VlogΓ(λkv)](24) - 对
λ
k
v
\lambda_{kv}
λkv求偏导数
∂ ∂ λ k v E L B O ( q , w ; α , η ) [ λ ] = [ Ψ ′ ( λ k v ) − Ψ ′ ( ∑ i = 1 V λ k i ) ] [ ∑ n = 1 N ϕ n k w n v + η v − λ k v ] − Ψ ( λ k v ) + Ψ ( ∑ i = 1 V λ k i ) − Γ ′ ( ∑ v ′ = 1 V λ k v ′ ) Γ ( ∑ v ′ = 1 V λ k v ′ ) ⏟ = Ψ ( ∑ v ′ = 1 V λ k v ′ ) + Γ ′ ( λ k v ) Γ ( λ k v ) ⏟ = Ψ ( λ k v ) = [ Ψ ′ ( λ k v ) − Ψ ′ ( ∑ i = 1 V λ k i ) ] [ ∑ n = 1 N ϕ n k w n v + η v − λ k v ] (25) \begin{aligned} \frac{\partial}{\partial\lambda_{kv}}ELBO(q,w;\alpha,\eta)_{\bm{[\lambda]}} &= [\Psi'(\lambda_{kv})-\Psi'(\sum_{i=1}^V\lambda_{ki})][\sum_{n=1}^N\phi_{nk}\,w_{nv}+\eta_v-\lambda_{kv}] \\ &\qquad - \Psi(\lambda_{kv}) + \Psi(\sum_{i=1}^V\lambda_{ki}) \\ &\qquad - \underbrace{\frac{\Gamma'(\sum_{v'=1}^V\lambda_{kv'})}{\Gamma(\sum_{v'=1}^V\lambda_{kv'})}}_{=\,\Psi(\sum_{v'=1}^V\lambda_{kv'})} + \underbrace{\frac{\Gamma'(\lambda_{kv})}{\Gamma(\lambda_{kv})}}_{=\,\Psi(\lambda_{kv})} \\ &= [\Psi'(\lambda_{kv})-\Psi'(\sum_{i=1}^V\lambda_{ki})][\sum_{n=1}^N\phi_{nk}\,w_{nv}+\eta_v-\lambda_{kv}] \end{aligned} \tag{25} ∂λkv∂ELBO(q,w;α,η)[λ]=[Ψ′(λkv)−Ψ′(i=1∑Vλki)][n=1∑Nϕnkwnv+ηv−λkv]−Ψ(λkv)+Ψ(i=1∑Vλki)−=Ψ(∑v′=1Vλkv′) Γ(∑v′=1Vλkv′)Γ′(∑v′=1Vλkv′)+=Ψ(λkv) Γ(λkv)Γ′(λkv)=[Ψ′(λkv)−Ψ′(i=1∑Vλki)][n=1∑Nϕnkwnv+ηv−λkv](25) - 偏导数置0得到
λ k v = η v + ∑ n = 1 N ϕ n k w n v (26) \lambda_{kv}=\eta_v + \sum_{n=1}^N\phi_{nk}\,w_{nv} \tag{26} λkv=ηv+n=1∑Nϕnkwnv(26)
2.4.4 E-step更新公式
(20)、(23)、(26)三式已经将E-step中三个变分参数的更新公式给出,但这里需要注意的是之前我们都是按照一篇文档进行的公式推导,因此需要把训练数据扩展至整个语料库
D
D
D,得到最终的更新公式。这里需要注意的是参数
ϕ
\phi
ϕ和
γ
\gamma
γ是每个文档都不同的,而参数
λ
\lambda
λ是整个语料库共有的。
{
ϕ
d
n
k
∝
exp
{
Ψ
(
γ
d
k
)
−
Ψ
(
∑
k
′
=
1
K
γ
d
k
′
)
+
∑
v
=
1
V
w
d
n
v
[
Ψ
(
λ
k
v
)
−
Ψ
(
∑
i
=
1
V
λ
k
i
)
]
}
γ
d
k
=
α
k
+
∑
n
=
1
N
ϕ
d
n
k
λ
k
v
=
η
v
+
∑
d
=
1
M
∑
n
=
1
N
ϕ
d
n
k
w
d
n
v
(27)
\left\{ \begin{aligned} \phi_{dnk} & \propto \exp\{\,\Psi(\gamma_{dk}) - \Psi(\sum_{k'=1}^K\gamma_{dk'}) + \sum_{v=1}^Vw_{dnv}[\Psi(\lambda_{kv}) - \Psi(\sum_{i=1}^V\lambda_{ki})] \,\} \\ \gamma_{dk} & = \alpha_k + \sum_{n=1}^N\phi_{dnk} \\ \lambda_{kv}&=\eta_v + \sum_{d=1}^M\sum_{n=1}^N\phi_{dnk}\,w_{dnv} \end{aligned} \right.\tag{27}
⎩⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎧ϕdnkγdkλkv∝exp{Ψ(γdk)−Ψ(k′=1∑Kγdk′)+v=1∑Vwdnv[Ψ(λkv)−Ψ(i=1∑Vλki)]}=αk+n=1∑Nϕdnk=ηv+d=1∑Mn=1∑Nϕdnkwdnv(27)在E-step中,只需循环更新
ϕ
,
γ
,
λ
\phi,\gamma,\lambda
ϕ,γ,λ三个参数,直至收敛即可。但这里需要注意一点,由于参数
ϕ
\phi
ϕ是有限制条件的,因此当更新完参数
ϕ
\phi
ϕ后要进行归一化,即对于任意文档
d
d
d和其中的单词
n
n
n,都有
∑
k
=
1
K
ϕ
d
n
k
=
1
\sum_{k=1}^K\phi_{dnk}=1
∑k=1Kϕdnk=1
2.5 M-step推导
在E-step找到最佳的变分参数 ϕ , γ , λ \phi, \gamma, \lambda ϕ,γ,λ之后,接下来进入M-step,即需要固定变分分布 q ( θ , z , β ; ϕ , γ , λ ) q(\theta,z,\beta;\phi,\gamma,\lambda) q(θ,z,β;ϕ,γ,λ),寻找模型的参数 α , η \alpha,\eta α,η使得模型对数似然函数的下界 E L B O ( q , x ; α , η ) ELBO(q,x;\alpha, \eta) ELBO(q,x;α,η)达到最大。这里需要分别求出ELBO对参数 α , η \alpha,\eta α,η的偏导数,然后采用梯度下降法或二阶牛顿迭代法寻找最优的参数解。
2.5.1 针对模型参数 α \alpha α
- 挑选出ELBO中与
α
\alpha
α有关的项
需要注意的是由于模型参数 α \alpha α都是针对数据集中所有文档的,即每个文档的参数 α \alpha α相同,因此在下面推导时会针对所有文档进行参数更新。
E L B O ( q , w ; α , η ) [ α ] = ∑ d = 1 M [ log Γ ( ∑ k = 1 K α k ) − ∑ k = 1 K log Γ ( α k ) ] + ∑ d = 1 M ∑ k = 1 K ( α k − 1 ) [ Ψ ( γ d k ) − Ψ ( ∑ i = 1 K γ d i ) ] (28) \begin{aligned} ELBO(q,w;\alpha,\eta)_{\bm{[\alpha]}} &= \sum_{d=1}^M[\, \log \Gamma(\sum_{k=1}^K\alpha_k) - \sum_{k=1}^K\log\Gamma(\alpha_k)\,] \\ &\qquad + \sum_{d=1}^M \sum_{k=1}^K(\alpha_k-1)[\Psi(\gamma_{dk})-\Psi(\sum_{i=1}^K\gamma_{di})] \end{aligned} \tag{28} ELBO(q,w;α,η)[α]=d=1∑M[logΓ(k=1∑Kαk)−k=1∑KlogΓ(αk)]+d=1∑Mk=1∑K(αk−1)[Ψ(γdk)−Ψ(i=1∑Kγdi)](28) - 对
α
k
\alpha_{k}
αk求一阶偏导数
∂ ∂ α k E L B O ( q , w ; α , η ) [ α ] = M ⋅ [ Ψ ( ∑ i = 1 K α i ) − Ψ ( α k ) ] + ∑ d = 1 M [ Ψ ( γ d k ) − Ψ ( ∑ i = 1 K γ d i ) ] (29) \begin{aligned} \frac{\partial}{\partial\alpha_{k}}ELBO(q,w;\alpha,\eta)_{\bm{[\alpha]}} &= M\cdot[\Psi(\sum_{i=1}^K\alpha_{i}) - \Psi(\alpha_k)\,]\\ &\qquad + \sum_{d=1}^M [\,\Psi(\gamma_{dk})-\Psi(\sum_{i=1}^K\gamma_{di})\,] \end{aligned} \tag{29} ∂αk∂ELBO(q,w;α,η)[α]=M⋅[Ψ(i=1∑Kαi)−Ψ(αk)]+d=1∑M[Ψ(γdk)−Ψ(i=1∑Kγdi)](29) - 对
α
j
\alpha_{j}
αj求二阶偏导数
∂ ∂ α k α j E L B O ( q , w ; α , η ) [ α ] = M ⋅ [ Ψ ′ ( ∑ i = 1 K α i ) − δ ( k , j ) Ψ ′ ( α k ) ] (30) \begin{aligned} \frac{\partial}{\partial\alpha_{k}\alpha_j}ELBO(q,w;\alpha,\eta)_{\bm{[\alpha]}} &= M\cdot[\,\Psi'(\sum_{i=1}^K\alpha_i) - \delta(k,j)\Psi'(\alpha_k)\,] \end{aligned} \tag{30} ∂αkαj∂ELBO(q,w;α,η)[α]=M⋅[Ψ′(i=1∑Kαi)−δ(k,j)Ψ′(αk)](30)其中 δ ( k , j ) = { 1 , k = j 0 , k ≠ j (31) \delta(k,j)=\left\{ \begin{aligned} 1, \,\,\, & k=j \\ 0, \,\,\, & k\ne j \end{aligned} \right.\tag{31} δ(k,j)={1,0,k=jk=j(31)
2.5.2 针对模型参数 η \eta η
- 挑选出ELBO中与
η
\eta
η有关的项
E L B O ( q , w ; α , η ) [ η ] = K log Γ ( ∑ v = 1 V η v ) − K ∑ v = 1 V log Γ ( η v ) + ∑ k = 1 K ∑ v = 1 V ( η v − 1 ) [ Ψ ( λ k v ) − Ψ ( ∑ i = 1 V λ k i ) ] (32) \begin{aligned} ELBO(q,w;\alpha,\eta)_{\bm{[\eta]}} &= K\log \Gamma(\sum_{v=1}^V\eta_v) - K\sum_{v=1}^V\log\Gamma(\eta_v) \\ &\qquad + \sum_{k=1}^K\sum_{v=1}^V(\eta_v-1)[\Psi(\lambda_{kv})-\Psi(\sum_{i=1}^V\lambda_{ki})] \end{aligned} \tag{32} ELBO(q,w;α,η)[η]=KlogΓ(v=1∑Vηv)−Kv=1∑VlogΓ(ηv)+k=1∑Kv=1∑V(ηv−1)[Ψ(λkv)−Ψ(i=1∑Vλki)](32) - 对
η
i
\eta_{i}
ηi求一阶偏导数
∂ ∂ η i E L B O ( q , w ; α , η ) [ η ] = K ⋅ [ Ψ ( ∑ i ′ = 1 V η i ′ ) − Ψ ( η i ) ] + ∑ k = 1 K [ Ψ ( λ k i ) − Ψ ( ∑ i ′ = 1 V λ k i ′ ) ] (33) \begin{aligned} \frac{\partial}{\partial\eta_{i}}ELBO(q,w;\alpha,\eta)_{\bm{[\eta]}} &= K\cdot[\Psi(\sum_{i'=1}^V\eta_{i'}) - \Psi(\eta_i)\,]\\ &\qquad + \sum_{k=1}^K [\,\Psi(\lambda_{ki})-\Psi(\sum_{i'=1}^V\lambda_{ki'})\,] \end{aligned} \tag{33} ∂ηi∂ELBO(q,w;α,η)[η]=K⋅[Ψ(i′=1∑Vηi′)−Ψ(ηi)]+k=1∑K[Ψ(λki)−Ψ(i′=1∑Vλki′)](33) - 对
η
j
\eta_{j}
ηj求二阶偏导数
∂ ∂ η i η j E L B O ( q , w ; α , η ) [ η ] = K ⋅ [ Ψ ′ ( ∑ i ′ = 1 V η i ′ ) − δ ( i , j ) Ψ ′ ( η i ) ] (34) \begin{aligned} \frac{\partial}{\partial\eta_{i}\eta_j}ELBO(q,w;\alpha,\eta)_{\bm{[\eta]}} &= K\cdot[\,\Psi'(\sum_{i'=1}^V\eta_{i'}) - \delta(i,j)\Psi'(\eta_{i})\,] \end{aligned} \tag{34} ∂ηiηj∂ELBO(q,w;α,η)[η]=K⋅[Ψ′(i′=1∑Vηi′)−δ(i,j)Ψ′(ηi)](34)
2.5.3 M-step更新公式(牛顿迭代法)
M-step通常采用牛顿迭代法求解,其收敛速度比梯度下降快一些
{
α
k
=
α
k
+
∇
α
k
E
L
B
O
∇
α
k
α
j
E
L
B
O
η
i
=
η
i
+
∇
η
i
E
L
B
O
∇
η
i
η
j
E
L
B
O
(35)
\left\{ \begin{aligned} \alpha_k &= \alpha_k + \frac{\nabla_{\alpha_k}ELBO}{\nabla_{\alpha_k\alpha_j}ELBO}\\ \eta_i &= \eta_i + \frac{\nabla_{\eta_i}ELBO}{\nabla_{\eta_i\eta_j}ELBO} \end{aligned} \right.\tag{35}
⎩⎪⎪⎪⎨⎪⎪⎪⎧αkηi=αk+∇αkαjELBO∇αkELBO=ηi+∇ηiηjELBO∇ηiELBO(35)