LDA与Gibbs Sampling
统计推断简介
待补充
B \Beta B函数与 Γ \Gamma Γ函数的性质
待补充
Beta \text{Beta} Beta、 Dirichlet \text{Dirichlet} Dirichlet分布与二项、多项分布
待补充
LDA的概念与似然表示
LDA(Latent Dirichlet Allocation)使用概率生成模型的方式描述文档——主题——单词之间的关系,LDA原文的概率图描述如下:
原文中对每篇文章的每个单词,生成过程描述如下:
- Choose N ∼ Poisson ( ξ ) N \sim \text{Poisson}(\xi) N∼Poisson(ξ).
- Choose θ ∼ Dir ( α ) \theta \sim \text{Dir}(\alpha) θ∼Dir(α).
- For each of the N N N words w n w_n wn:
(a) Choose a topic z n ∼ Multinomial ( θ ) z_n \sim \text{Multinomial}(\theta) zn∼Multinomial(θ).
(b) Choose a word w n w_n wn from p ( w n ∣ z n , β ) p(w_n| z_n, \beta) p(wn∣zn,β), a multinomial probability conditioned on the topic z n z_n zn.
注意上述概率图的主题——单词分布实际上没有先验分布,并不是完全的“贝叶斯”。
而使用Gibbs采样的LDA,概率图描述如下:
注意这里比原文多了一个变量 ϕ ⃗ k \vec{\phi}_k ϕk,即对主题——单词分布,也使用Dirichlet先验进行建模,是更彻底的“贝叶斯”,对每篇文章的每个单词,生成过程描述如下:
- 遍历每个主题 k k k:
~~ 选择主题——单词分布 ϕ ⃗ k ∼ Dir ( β ) \vec{\phi}_k \sim \text{Dir}(\beta) ϕk∼Dir(β);- 遍历每篇文档 m m m:
~~ 选择文档——主题分布 θ ⃗ m ∼ Dir ( α ) \vec{\theta}_m \sim \text{Dir}(\alpha) θm∼Dir(α);
~~ 遍历每个单词 n n n:
~~~~ (a) 选择单词——主题分布 z m , n ∼ Multinomial ( θ ⃗ m ) z_{m, n} \sim \text{Multinomial}(\vec{\theta}_m) zm,n∼Multinomial(θm);
~~~~ (b) 选择单词分布 w m , n ∼ Multinomial ( ϕ ⃗ z m , n ) w_{m, n} \sim \text{Multinomial}(\vec{\phi}_{z_{m, n}}) wm,n∼Multinomial(ϕzm,n)。
符号说明如下:
M M M:语料中文档的数量
K K K:主题的数量
V V V:词表的大小
α ⃗ \vec{\alpha} α:文档→主题的分布超参数
β ⃗ \vec{\beta} β:主题→单词的分布超参数
θ ⃗ m \vec{\theta}_m θm:第 m m m个文档主题的分布参数,即 p ( z ∣ d = m ) p(z| d = m) p(z∣d=m)分布的参数,所有文档的参数集记为 Θ ‾ = { θ ⃗ m } m = 1 M \underline{\Theta} = \{\vec{\theta}_m\}_{m = 1}^M Θ={θm}m=1M
ϕ ⃗ k \vec{\phi}_k ϕk:第 k k k个主题单词的分布参数,即 p ( t ∣ z = k ) p(t| z = k) p(t∣z=k)分布的参数,所有主题的参数集记为 Φ ‾ = { ϕ ⃗ k } k = 1 K \underline{\Phi} = \{\vec{\phi}_k\}_{k = 1}^K Φ={ϕk}k=1K
N m N_m Nm:每个文档的单词数量,服从Poisson分布,参数为 ξ \xi ξ
z m , n z_{m, n} zm,n:第 m m m个文档的第 n n n个单词选择的主题索引
w m , n w_{m, n} wm,n:第 m m m个文档的第 n n n个单词选择的单词索引
根据LDA的定义,某篇文档中某个单词 w m , n w_{m, n} wm,n是 t t t的概率为:
p ( w m , n = t ∣ θ ⃗ , Φ ‾ ) = ∑ k = 1 K p ( w m , n = t ∣ ϕ ⃗ k ) p ( z m , n = k ∣ θ ⃗ m ) p(w_{m, n} = t| \vec{\theta}, \underline{\Phi}) = \sum_{k = 1}^K{p(w_{m, n} = t| \vec{\phi}_k)p(z_{m, n} = k| \vec{\theta}_m)} p(wm,n=t∣θ,Φ)=k=1∑Kp(wm,n=t∣ϕk)p(zm,n=k∣θm)
联合概率为:
p ( w ⃗ m , z ⃗ m , θ ⃗ m , Φ ‾ ∣ α ⃗ , β ⃗ ) = ∏ n = 1 N m p ( w m , n ∣ ϕ ⃗ z m , n ) p ( z m , n ∣ θ ⃗ m ) ⏟ word plate p ( θ ⃗ m ∣ α ⃗ ) ⏞ document plate(1 document) p ( Φ ‾ ∣ β ⃗ ) ⏟ topic plate p(\vec{w}_m, \vec{z}_m, \vec{\theta}_m, \underline{\Phi}| \vec{\alpha}, \vec{\beta}) = \overbrace{ \underbrace{ \prod_{n = 1}^{N_m}{p(w_{m, n}| \vec{\phi}_{z_{m, n}})p(z_{m, n}| \vec{\theta}_m)} }_{\text{word plate}} p(\vec{\theta}_m| \vec{\alpha}) }^{\text{document plate(1 document)}} \underbrace{ p(\underline{\Phi}| \vec{\beta}) }_{\text{topic plate}} p(wm,zm,θm,Φ∣α,β)=word plate n=1∏Nmp(wm,n∣ϕzm,n)p(zm,n∣θm)p(θm∣α) document plate(1 document)topic plate p(Φ∣β)
因此单个文档的似然函数为:
p ( w ⃗ m ∣ α ⃗ , β ⃗ ) = ∫ ∫ p ( θ ⃗ m ∣ α ⃗ ) p ( Φ ‾ ∣ β ⃗ ) ∏ n = 1 N m ∑ z m , n p ( w m , n ∣ ϕ ⃗ z m , n ) p ( z m , n ∣ θ ⃗ m ) d Φ ‾ d θ ⃗ m = ∫ ∫ p ( θ ⃗ m ∣ α ⃗ ) p ( Φ ‾ ∣ β ⃗ ) ∏ n = 1 N m p ( w m , n ∣ θ ⃗ m , Φ ‾ ) d Φ ‾ d θ ⃗ m \begin{aligned} p(\vec{w}_m| \vec{\alpha}, \vec{\beta}) &= \int{\int{ p(\vec{\theta}_m| \vec{\alpha})p(\underline{\Phi}| \vec{\beta})\prod_{n = 1}^{N_m}{\sum_{z_{m, n}}{ p(w_{m, n}| \vec{\phi}_{z_{m, n}})p(z_{m, n}| \vec{\theta}_m) }} }}\text{d}{\underline{\Phi}}\text{d}{\vec{\theta}_m}\\ &= \int{\int{ p(\vec{\theta}_m| \vec{\alpha})p(\underline{\Phi}| \vec{\beta})\prod_{n = 1}^{N_m}{ p(w_{m, n}| \vec{\theta}_m, \underline{\Phi}) } }}\text{d}{\underline{\Phi}}\text{d}{\vec{\theta}_m} \end{aligned} p(wm∣α,β)=∫∫p(θm∣α)p(Φ∣β)n=1∏Nmzm,n∑p(wm,n∣ϕzm,n)p(zm,n∣θm)dΦdθm=∫∫p(θm∣α)p(Φ∣β)n=1∏Nmp(wm,n∣θm,Φ)dΦdθm
所有语料的似然函数为:
p ( W ∣ α ⃗ , β ⃗ ) = ∏ m = 1 M p ( w ⃗ m ∣ α ⃗ , β ⃗ ) p(W| \vec{\alpha}, \vec{\beta}) = \prod_{m = 1}^M{p(\vec{w}_m| \vec{\alpha}, \vec{\beta})} p(W∣α,β)=m=1∏Mp(wm∣α,β)
使用Gibbs Sampling进行统计推断
待补充
Collapsed LDA Gibbs sampler
要推导LDA的Gibbs采样,我们可以使用上述的隐变量方法,在LDA中隐藏变量是 z m , n z_{m, n} zm,n,即每篇文章每个单词 w m , n w_{m, n} wm,n选择的主题,我们不需要引入变量 Θ ‾ \underline{\Theta} Θ和 Φ ‾ \underline{\Phi} Φ,因为它们都可以理解为 z m , n z_{m, n} zm,n的统计量(即其他文献描述的 z m , n z_{m, n} zm,n是它们的充分统计量),这种把部分变量积分通过积分消去的策略就称为“Collapsed”(塌缩),常应用于Gibbs采样。
现在的目标简化为获取 p ( z ⃗ ∣ w ⃗ ) p(\vec{z}| \vec{w}) p(z∣w)的分布:
p ( z ⃗ ∣ w ⃗ ) = p ( z ⃗ , w ⃗ ) p ( w ⃗ ) = ∏ i = 1 W p ( z i , w i ) ∏ i = 1 W ∑ k = 1 K p ( z i = k , w i ) p(\vec{z}| \vec{w}) = \frac{p(\vec{z}, \vec{w})}{p(\vec{w})} = \frac{ \prod_{i = 1}^W{p(z_i, w_i)} }{ \prod_{i = 1}^W{\sum_{k = 1}^K}{p(z_i = k, w_i)} } p(z∣w)=p(w)p(z,w)=∏i=1W∑k=1Kp(zi=k,wi)∏i=1Wp(zi,wi)
这里暂时省略了超参数,该分布的参数空间非常巨大,最难的是分母部分,包含了 K W K^W KW级别单词的求和运算,因此需要引入Gibbs采样。在这个场景下,我们希望能对 p ( z i ∣ z ⃗ ¬ i , w ⃗ ) p(z_i| \vec{z}_{\neg i}, \vec{w}) p(zi∣z¬i,w)采样并模拟 p ( z ⃗ ∣ w ⃗ ) p(\vec{z}| \vec{w}) p(z∣w)。通过隐藏变量法,我们可以得到全条件概率公式,因此需要形式化描述联合分布,可以分解为:
p ( w ⃗ , z ⃗ ∣ α ⃗ , β ⃗ ) = p ( w ⃗ ∣ z ⃗ , β ⃗ ) p ( z ⃗ ∣ α ⃗ ) p(\vec{w}, \vec{z}| \vec{\alpha}, \vec{\beta}) = p(\vec{w}| \vec{z}, \vec{\beta})p(\vec{z}| \vec{\alpha}) p(w,z∣α,β)=p(w∣z,β)p(z∣α)
第一部分独立于 α ⃗ \vec{\alpha} α(条件独立: w ⃗ ⊥ ⊥ α ⃗ ∣ z ⃗ \vec{w} \perp\!\!\!\!\perp \vec{\alpha}| \vec{z} w⊥⊥α∣z),第二部分独立于 β ⃗ \vec{\beta} β,因此两部分可以分别处理,第一部分:
p ( w ⃗ ∣ z ⃗ , β ⃗ ) = ∫ p ( w ⃗ ∣ z ⃗ , Φ ‾ ) p ( Φ ‾ ∣ β ⃗ ) d Φ ‾ p(\vec{w}| \vec{z}, \vec{\beta}) = \int{ p(\vec{w}| \vec{z}, \underline{\Phi})p(\underline{\Phi}| \vec{\beta}) }\text{d}\underline{\Phi} p(w∣z,β)=∫p(w∣z,Φ)p(Φ∣β)dΦ
其中 p ( Φ ‾ ∣ β ⃗ ) p(\underline{\Phi}| \vec{\beta}) p(Φ∣β)是已知的先验分布, p ( w ⃗ ∣ z ⃗ , Φ ‾ ) p(\vec{w}| \vec{z}, \underline{\Phi}) p(w∣z,Φ)可以理解为所有所有单词出现的概率:
p ( w ⃗ ∣ z ⃗ , Φ ‾ ) = ∏ i = 1 W p ( w i ∣ z i ) = ∏ i = 1 W ϕ z i , w i p(\vec{w}| \vec{z}, \underline{\Phi}) = \prod_{i = 1}^W{p(w_i| z_i)} = \prod_{i = 1}^W{\phi_{z_i, w_i}} p(w∣z,Φ)=i=1∏Wp(wi∣zi)=i=1∏Wϕzi,wi
这里假设了单词的独立性,换个维度统计得到:
p ( w ⃗ ∣ z ⃗ , Φ ‾ ) = ∏ k = 1 K ∏ { i : z i = k } p ( w i = t ∣ z i = k ) = ∏ k = 1 K ∏ t = 1 V ϕ k , t n k ( t ) p(\vec{w}| \vec{z}, \underline{\Phi}) = \prod_{k = 1}^K{\prod_{\{i: z_i = k\}}{ p(w_i = t| z_i = k) }} = \prod_{k = 1}^K{\prod_{t = 1}^V{ \phi_{k, t}^{n_k^{(t)}} }} p(w∣z,Φ)=k=1∏K{i:zi=k}∏p(wi=t∣zi=k)=k=1∏Kt=1∏Vϕk,tnk(t)
这里需要注意下标 t t t和 i i i的区别,前者是单词表上的索引,后者是单词实例的索引,理解清楚了才能通顺的理解后续的各项统计量。
我们使用 n k ( t ) n_k^{(t)} nk(t)表示单词 t t t被观察为主题 k k k的次数,因此完整的第一部分可推导为:
p ( w ⃗ ∣ z ⃗ , β ⃗ ) = ∫ p ( w ⃗ ∣ z ⃗ , Φ ‾ ) p ( Φ ‾ ∣ β ⃗ ) d Φ ‾ = ∫ ∏ z = 1 K 1 Δ ( β ⃗ ) ∑ t = 1 V ϕ z , t n z ( t ) + β t − 1 d ϕ ⃗ z = ∏ z = 1 K Δ ( n ⃗ z + β ⃗ ) Δ ( β ⃗ ) , n ⃗ z = { n z ( t ) } t = 1 V \begin{aligned} p(\vec{w}| \vec{z}, \vec{\beta}) &= \int{ p(\vec{w}| \vec{z}, \underline{\Phi})p(\underline{\Phi}| \vec{\beta}) }\text{d}\underline{\Phi}\\ &= \int{ \prod_{z = 1}^K{ \frac{1}{\Delta(\vec{\beta})}\sum_{t = 1}^V{ \phi_{z, t}^{n_z^{(t)} + \beta_t - 1} } } }\text{d}\vec{\phi}_z\\ &= \prod_{z = 1}^K{ \frac{ \Delta(\vec{n}_z + \vec{\beta}) }{ \Delta(\vec{\beta}) } }, \vec{n}_z = \{n_z^{(t)}\}_{t = 1}^V \end{aligned} p(w∣z,β)=∫p(w∣z,Φ)p(Φ∣β)dΦ=∫z=1∏KΔ(β)1t=1∑Vϕz,tnz(t)+βt−1dϕz=z=1∏KΔ(β)Δ(nz+β),nz={nz(t)}t=1V
类似的对 p ( z ⃗ ∣ α ⃗ ) p(\vec{z}| \vec{\alpha}) p(z∣α)进行推导:
p ( z ⃗ ∣ α ⃗ ) = ∫ p ( z ⃗ ∣ Θ ‾ ) p ( Θ ∣ α ⃗ ) d Θ ‾ p(\vec{z}| \vec{\alpha}) = \int{ p(\vec{z}| \underline{\Theta})p(\Theta| \vec{\alpha}) }\text{d}\underline{\Theta} p(z∣α)=∫p(z∣Θ)p(Θ∣α)dΘ
其中 p ( Θ ∣ α ⃗ ) p(\Theta| \vec{\alpha}) p(Θ∣α)是已知的先验分布, p ( z ⃗ ∣ Θ ‾ ) p(\vec{z}| \underline{\Theta}) p(z∣Θ)可通过统计得到:
p ( z ⃗ ∣ Θ ‾ ) = ∏ i = 1 W p ( z i ∣ d i ) = ∏ m = 1 M ∏ k = 1 K p ( z i = k ∣ d i = m ) = ∏ m = 1 M ∏ k = 1 K θ m , k n m ( k ) p(\vec{z}| \underline{\Theta}) = \prod_{i = 1}^W{p(z_i| d_i)} = \prod_{m = 1}^M{\prod_{k = 1}^K{p(z_i = k| d_i = m)}} = \prod_{m = 1}^M{\prod_{k = 1}^K{ \theta_{m, k}^{n_m^{(k)}} }} p(z∣Θ)=i=1∏Wp(zi∣di)=m=1∏Mk=1∏Kp(zi=k∣di=m)=m=1∏Mk=1∏Kθm,knm(k)
我们使用 d i d_i di表示每个单词所属的文档索引, n m ( k ) n_m^{(k)} nm(k)表示某篇文档中被观察为某个主题的单词计数,因此完整的第二部分可推导为:
p ( z ⃗ ∣ α ⃗ ) = ∫ p ( z ⃗ ∣ Θ ‾ ) p ( Θ ∣ α ⃗ ) d Θ ‾ = ∫ ∏ m = 1 M 1 Δ ( α ⃗ ) ∏ k = 1 K θ m , k n m ( k ) + α k − 1 d θ ⃗ m = ∏ m = 1 M Δ ( n ⃗ m + α ⃗ ) Δ ( α ⃗ ) , n ⃗ m = { n m ( k ) } k = 1 K \begin{aligned} p(\vec{z}| \vec{\alpha}) &= \int{ p(\vec{z}| \underline{\Theta})p(\Theta| \vec{\alpha}) }\text{d}\underline{\Theta}\\ &= \int{ \prod_{m = 1}^M{ \frac{1}{\Delta(\vec{\alpha})}\prod_{k = 1}^K{ \theta_{m, k}^{n_m^{(k)} + \alpha_k - 1} } } }\text{d}\vec{\theta}_m\\ &= \prod_{m = 1}^M{ \frac{ \Delta(\vec{n}_m + \vec{\alpha}) }{ \Delta(\vec{\alpha}) } }, \vec{n}_m = \{n_m^{(k)}\}_{k = 1}^K \end{aligned} p(z∣α)=∫p(z∣Θ)p(Θ∣α)dΘ=∫m=1∏MΔ(α)1k=1∏Kθm,knm(k)+αk−1dθm=m=1∏MΔ(α)Δ(nm+α),nm={nm(k)}k=1K
结合两部分得到联合概率:
p ( z ⃗ , w ⃗ ∣ α ⃗ , β ⃗ ) = ∏ z = 1 K Δ ( n ⃗ z + β ⃗ ) Δ ( β ⃗ ) ∏ m = 1 M Δ ( n ⃗ m + α ⃗ ) Δ ( α ⃗ ) p(\vec{z}, \vec{w}| \vec{\alpha}, \vec{\beta}) = \prod_{z = 1}^K{ \frac{ \Delta(\vec{n}_z + \vec{\beta}) }{ \Delta(\vec{\beta}) } } \prod_{m = 1}^M{ \frac{ \Delta(\vec{n}_m + \vec{\alpha}) }{ \Delta(\vec{\alpha}) } } p(z,w∣α,β)=z=1∏KΔ(β)Δ(nz+β)m=1∏MΔ(α)Δ(nm+α)
然后可以推导Gibbs的采样公式 p ( z i = k ∣ z ⃗ ¬ i , w ⃗ ) p(z_i = k| \vec{z}_{\neg i}, \vec{w}) p(zi=k∣z¬i,w)[1]:
p ( z i = k ∣ z ⃗ ¬ i , w ⃗ ) = p ( w ⃗ , z ⃗ ) p ( w ⃗ , z ⃗ ¬ i ) = p ( w ⃗ ∣ z ⃗ ) p ( w ⃗ ¬ i ∣ z ⃗ ¬ i ) p ( w i ) p ( z ⃗ ) p ( z ⃗ ¬ i ) ∝ Δ ( n ⃗ z + β ⃗ ) Δ ( n ⃗ z , ¬ i + β ⃗ ) Δ ( n ⃗ m + α ⃗ ) Δ ( n ⃗ m , ¬ i + α ⃗ ) ∝ Γ ( n k ( t ) + β t ) Γ ( ∑ t = 1 V n k , ¬ i ( t ) + β t ) Γ ( n k , ¬ i ( t ) + β t ) Γ ( ∑ t = 1 V n k ( t ) + β t ) Γ ( n m ( k ) + α k ) Γ ( ∑ k = 1 K n m , ¬ i ( k ) + α k ) Γ ( n m , ¬ i ( k ) + α k ) Γ ( ∑ k = 1 K n m ( k ) + α k ) ∝ n k , ¬ i ( t ) + β t ∑ t = 1 V n k , ¬ i ( t ) + β t n m , ¬ i ( t ) + α k [ ∑ k = 1 K n m ( k ) + α k ] − 1 \begin{aligned} p(z_i = k| \vec{z}_{\neg i}, \vec{w}) &= \frac{p(\vec{w}, \vec{z})}{p(\vec{w}, \vec{z}_{\neg i})} = \frac{p(\vec{w}| \vec{z})}{p(\vec{w}_{\neg i}| \vec{z}_{\neg i})p(w_i)}\frac{p(\vec{z})}{p(\vec{z}_{\neg i})}\\ &\propto \frac{\Delta(\vec{n}_z + \vec{\beta})}{\Delta(\vec{n}_{z, \neg i} + \vec{\beta})} \frac{\Delta(\vec{n}_m + \vec{\alpha})}{\Delta(\vec{n}_{m, \neg i} + \vec{\alpha})}\\ &\propto \frac{ \Gamma(n_k^{(t)} + \beta_t)\Gamma(\sum_{t = 1}^V{n_{k, \neg i}^{(t)} + \beta_t}) }{ \Gamma(n_{k, \neg i}^{(t)} + \beta_t)\Gamma(\sum_{t = 1}^V{n_k^{(t)} + \beta_t}) } \frac{ \Gamma(n_m^{(k)} + \alpha_k)\Gamma(\sum_{k = 1}^K{n_{m, \neg i}^{(k)} + \alpha_k}) }{ \Gamma(n_{m, \neg i}^{(k)} + \alpha_k)\Gamma(\sum_{k = 1}^K{n_m^{(k)} + \alpha_k}) }\\ &\propto \frac{n_{k, \neg i}^{(t)} + \beta_t}{\sum_{t = 1}^V{n_{k, \neg i}^{(t)} + \beta_t}} \frac{n_{m, \neg i}^{(t)} + \alpha_k}{[\sum_{k = 1}^K{n_m^{(k)} + \alpha_k}] - 1} \end{aligned} p(zi=k∣z¬i,w)=p(w,z¬i)p(w,z)=p(w¬i∣z¬i)p(wi)p(w∣z)p(z¬i)p(z)∝Δ(nz,¬i+β)Δ(nz+β)Δ(nm,¬i+α)Δ(nm+α)∝Γ(nk,¬i(t)+βt)Γ(∑t=1Vnk(t)+βt)Γ(nk(t)+βt)Γ(∑t=1Vnk,¬i(t)+βt)Γ(nm,¬i(k)+αk)Γ(∑k=1Knm(k)+αk)Γ(nm(k)+αk)Γ(∑k=1Knm,¬i(k)+αk)∝∑t=1Vnk,¬i(t)+βtnk,¬i(t)+βt[∑k=1Knm(k)+αk]−1nm,¬i(t)+αk
其中 n ⋅ , ¬ i ( ⋅ ) n_{\cdot, \neg i}^{(\cdot)} n⋅,¬i(⋅)表示从相关的文档或主题中排除第 i i i个单词,第一个等式通过条件独立性( w i ⊥ ⊥ w ⃗ ¬ i ∣ z ⃗ ¬ i w_i \perp\!\!\!\!\perp \vec{w}_{\neg i}| \vec{z}_{\neg i} wi⊥⊥w¬i∣z¬i)把 p ( w i ) p(w_i) p(wi)分离出来,且 p ( w i ) p(w_i) p(wi)是常量,可以约去。第一行到第二行的推导,实际上是分子分母直接套用上述的联合概率公式(分母排除了某个单词),剩下的就是 Γ \Gamma Γ函数的计算。
另一个思路的分割线(开始)
这里较难理解的是第一行到第二行的内涵,可以换个思路理解[4]:
p ( z i = k ∣ z ⃗ ¬ i , w ⃗ ) ∝ p ( z i = k , z ⃗ ¬ i , w ⃗ ) = p ( w i ∣ z i = j , z ⃗ ¬ i , w ⃗ ¬ i ) p ( z i = j ∣ z ⃗ ¬ i , w ⃗ ¬ i ) = p ( w i ∣ z i = j , z ⃗ ¬ i , w ⃗ ¬ i ) p ( z i = j ∣ z ⃗ ¬ i ) \begin{aligned} p(z_i = k| \vec{z}_{\neg i}, \vec{w}) &\propto p(z_i = k, \vec{z}_{\neg i}, \vec{w})\\ &= p(w_i| z_i = j, \vec{z}_{\neg i}, \vec{w}_{\neg i})p(z_i = j| \vec{z}_{\neg i}, \vec{w}_{\neg i})\\ &= p(w_i| z_i = j, \vec{z}_{\neg i}, \vec{w}_{\neg i})p(z_i = j| \vec{z}_{\neg i}) \end{aligned} p(zi=k∣z¬i,w)∝p(zi=k,z¬i,w)=p(wi∣zi=j,z¬i,w¬i)p(zi=j∣z¬i,w¬i)=p(wi∣zi=j,z¬i,w¬i)p(zi=j∣z¬i)
也可以分解为两部分,第一部分推导:
p ( w i ∣ z i = j , z ⃗ ¬ i , w ⃗ ¬ i ) = ∫ p ( w i ∣ z i = j , ϕ j ) p ( ϕ j ∣ z ⃗ ¬ i , w ⃗ ¬ i ) d ϕ j = ∫ ϕ j , w i p ( ϕ j ∣ z ⃗ ¬ i , w ⃗ ¬ i ) d ϕ j p ( ϕ j ∣ z ⃗ ¬ i , w ⃗ ¬ i ) ∝ p ( w ⃗ ¬ i ∣ ϕ j , z ⃗ ¬ i ) p ( ϕ j ) ∼ Dirichlet ( β + n ¬ i , j ( w ) ) p(w_i| z_i = j, \vec{z}_{\neg i}, \vec{w}_{\neg i}) \newline \begin{aligned} &= \int{p(w_i| z_i = j, \phi_j)p(\phi_j| \vec{z}_{\neg i}, \vec{w}_{\neg i})}\text{d}\phi_j\\ &= \int{\phi_{j, w_i} p(\phi_j| \vec{z}_{\neg i}, \vec{w}_{\neg i})}\text{d}\phi_j \end{aligned} \newline \begin{aligned} p(\phi_j| \vec{z}_{\neg i}, \vec{w}_{\neg i}) &\propto p(\vec{w}_{\neg i}| \phi_j, \vec{z}_{\neg i})p(\phi_j)\\ &\sim \text{Dirichlet}(\beta + n_{\neg i, j}^{(w)}) \end{aligned} p(wi∣zi=j,z¬i,w¬i)=∫p(wi∣zi=j,ϕj)p(ϕj∣z¬i,w¬i)dϕj=∫ϕj,wip(ϕj∣z¬i,w¬i)dϕjp(ϕj∣z¬i,w¬i)∝p(w¬i∣ϕj,z¬i)p(ϕj)∼Dirichlet(β+n¬i,j(w))
其中 n ¬ i , j ( w ) n_{\neg i, j}^{(w)} n¬i,j(w)表示单词 w w w分配到主题 j j j的计数(存疑),使用 Dirichlet \text{Dirichlet} Dirichlet分布的期望公式,可推导:
p ( w i ∣ z i = j , z ⃗ ¬ i , w ⃗ ¬ i ) = n ¬ i , j ( w i ) + β n ¬ i , j ( ⋅ ) + W β p(w_i| z_i = j, \vec{z}_{\neg i}, \vec{w}_{\neg i}) = \frac{n_{\neg i, j}^{(w_i)} + \beta}{n_{\neg i, j}^{(\cdot)} + W\beta} p(wi∣zi=j,z¬i,w¬i)=n¬i,j(⋅)+Wβn¬i,j(wi)+β
其中 n ¬ i , j ( ⋅ ) n_{\neg i, j}^{(\cdot)} n¬i,j(⋅)表示分配到主题 j j j的所有单词的计数(存疑)。
类似的第二部分推导:
p ( z i = j ∣ z ¬ i ) = ∫ p ( z i = j ∣ θ d ) p ( θ d ∣ z ⃗ ¬ i ) d θ d p ( θ d ∣ z ⃗ ¬ i ) ∝ p ( z ⃗ ¬ i ∣ θ d ) p ( θ d ) ∼ Dirichlet ( n ¬ i , j ( d ) + α ) \begin{aligned} p(z_i = j| z_{\neg i}) &= \int{p(z_i = j| \theta_d)p(\theta_d| \vec{z}_{\neg i})}\text{d}\theta_d\\ p(\theta_d| \vec{z}_{\neg i}) &\propto p(\vec{z}_{\neg i}| \theta_d)p(\theta_d)\\ &\sim \text{Dirichlet}(n_{\neg i, j}^{(d)} + \alpha) \end{aligned} p(zi=j∣z¬i)p(θd∣z¬i)=∫p(zi=j∣θd)p(θd∣z¬i)dθd∝p(z¬i∣θd)p(θd)∼Dirichlet(n¬i,j(d)+α)
其中 n ¬ i , j ( d ) n_{\neg i, j}^{(d)} n¬i,j(d)表示除第 i i i个单词之外分配到第 j j j个主题的单词计数。
p ( z i = j ∣ z ⃗ ¬ i ) = n ¬ i , j ( d ) + α n ¬ i , ⋅ ( d ) + K α p(z_i = j| \vec{z}_{\neg i}) = \frac{n_{\neg i, j}^{(d)} + \alpha}{n_{\neg i, \cdot}^{(d)} + K\alpha} p(zi=j∣z¬i)=n¬i,⋅(d)+Kαn¬i,j(d)+α
其中 n ¬ i , ⋅ ( d ) n_{\neg i, \cdot}^{(d)} n¬i,⋅(d)表示除第 i i i个单词之外所有单词分配到第 d d d篇文档的主题计数。
最终Gibbs采样的条件概率公式为:
p ( z i = j ∣ z ⃗ ¬ i , w ⃗ ) ∝ ( n ¬ i , j ( w i ) + β n ¬ i , j ( ⋅ ) + W β ) ( n ¬ i , j ( d ) + α n ¬ i , ⋅ ( d ) + K α ) p(z_i = j| \vec{z}_{\neg i}, \vec{w}) \propto \bigg( \frac{n_{\neg i, j}^{(w_i)} + \beta}{n_{\neg i, j}^{(\cdot)} + W\beta} \bigg) \bigg( \frac{n_{\neg i, j}^{(d)} + \alpha}{n_{\neg i, \cdot}^{(d)} + K\alpha} \bigg) p(zi=j∣z¬i,w)∝(n¬i,j(⋅)+Wβn¬i,j(wi)+β)(n¬i,⋅(d)+Kαn¬i,j(d)+α)
另一个思路的分割线(结束)
参数估计
通过Gibbs采样得到了 z ⃗ \vec{z} z的样本后,我们就可以用这些样本模拟 z ⃗ \vec{z} z的分布,然后得到需要的各种统计量,设 M = { w ⃗ , z ⃗ } \mathcal{M} = \{\vec{w}, \vec{z}\} M={w,z},有:
p ( θ ⃗ m ∣ M , α ⃗ ) = 1 Z θ m ∏ n = 1 N m p ( z m , n ∣ θ ⃗ m ) p ( θ ⃗ m ∣ α ⃗ ) = Dir ( θ ⃗ m ∣ n ⃗ m + α ⃗ ) p ( ϕ ⃗ k ∣ M , β ⃗ ) = 1 Z ϕ k ∏ { i : z i = k } p ( w i ∣ ϕ ⃗ k ) p ( ϕ ⃗ k ∣ β ⃗ ) = Dir ( ϕ ⃗ k ∣ n ⃗ k + β ⃗ ) p(\vec{\theta}_m| \mathcal{M}, \vec{\alpha}) = \frac{1}{Z_{\theta_m}}\prod_{n = 1}^{N_m}{ p(z_{m, n}| \vec{\theta}_m)p(\vec{\theta}_m| \vec{\alpha}) } = \text{Dir}(\vec{\theta}_m| \vec{n}_m + \vec{\alpha}) \newline p(\vec{\phi}_k| \mathcal{M}, \vec{\beta}) = \frac{1}{Z_{\phi_k}}\prod_{\{i: z_i = k\}}{ p(w_i| \vec{\phi}_k)p(\vec{\phi}_k| \vec{\beta}) } = \text{Dir}(\vec{\phi}_k| \vec{n}_k + \vec{\beta}) p(θm∣M,α)=Zθm1n=1∏Nmp(zm,n∣θm)p(θm∣α)=Dir(θm∣nm+α)p(ϕk∣M,β)=Zϕk1{i:zi=k}∏p(wi∣ϕk)p(ϕk∣β)=Dir(ϕk∣nk+β)
通过 Dirichlet \text{Dirichlet} Dirichlet的期望计算性质,可以得到参数 ϕ ⃗ k \vec{\phi}_k ϕk和 θ ⃗ m \vec{\theta}_m θm的估计:
ϕ k , t = n k ( t ) + β t ∑ t = 1 V n k ( t ) + β t θ m , k = n m ( k ) + α k ∑ k = 1 K n m ( k ) + α k \phi_{k, t} = \frac{n_k^{(t)} + \beta_t}{\sum_{t = 1}^V{n_k^{(t)} + \beta_t}} \newline \theta_{m, k} = \frac{n_m^{(k)} + \alpha_k}{\sum_{k = 1}^K{n_m^{(k)} + \alpha_k}} ϕk,t=∑t=1Vnk(t)+βtnk(t)+βtθm,k=∑k=1Knm(k)+αknm(k)+αk
参考资料
[1] Parameter estimation for text analysis
[2] A Theoretical and Practical Implementation Tutorial on Topic Modeling and Gibbs Sampling