Learning Variational Inference and Latent Dirichlet Allocation

Learning Variational Inference and Latent Dirichlet Allocation

Latent Dirichlet Allocation(LDA) is a generative model which can be used in document clustering or information retrieval. It assumes the generation of a document, as shown in the picture
LDA
First of all, we assume that the topic distribution within a single document is generated by a Dirichlet distribution, with hyper-parameter α . Then based on this distribution, we can assign each slot in this article a topic(assume that there is a slot for each word), that’s how we obtain the z. After that, for each slot, draw a word for it based on the word distribution with a single topic ϕ , which is generated by Dirichlet distribution with parameter β

The w is a word in a document, which is observable when we are given a data set. What we wanna do is to find the actual value for all the latent variables, which is quite difficult.

Here we introduce a interesting method called variational inference, trying to find a couple of equivalent parameters that can approximate the object distribution using a new distribution that defined by us. To make life easier, we assume that the new distribution is in exponential family, and obey the relationship of q(Z)=q(z1,z1...zn)=Πiq(zi) . For a exponential family distribution, the pdf can be written as

P(z|x,β)=h(z)exp(η(x,β)T(x)A(η(x,β)))
.

In order to explain how variational inference works, I can use a very simple example. Let’s say we have a distribution, with observed variable x, and latent variable z,β , which we want to obtain. Assume the posterior of the hidden variables P(z,β|x) is a quite intractable distribution(we don’t know the distribution type, and it’s quite complex), we can hardly work them out. But why not define a new distribution and try to make it approximate the posterior? So, we say,

P(x)=P(z,β,x)P(z,β|x)

log(P(x))=log(P(z,β,x))log(P(z,β|x)q(z,β))log(q(z,β))
,
taking expectation value for both sides, we have
log(P(x))=z,βlog(P(z,β,x))q(z,β)dzdβz,βlog(P(z,β|x)q(z,β))q(z,β)dzdβz,βlog(q(z,β))q(z,β)dzdβ=Eq(z,β)log(P(z,β,x))Eq(z,β)q(z,β)KL(P(z,β|x)||q(z,β))
,
the first two terms are called Evidence Lower Bound(ELBO), and the third term is called KL Divergence of the two distributions. If we find the parameter for the distribution q, so that the KL divergence reaches the minimum value, we can say that, we found a new distribution to approximate out objective. To minimize the KL divergence is equivalent to maximize the ELBO, so our goal is converted to a optimization problem: argminλ,ΦELBO .

First, we write down the ELBO:

L(λ,Φ)=Eq(z,β)log(P(z,β,x))Eq(z,β)q(z,β)
,
Then, assume all the distributions of z and β are in exponential family, so,
P(z|x,β)=h(z)exp(η(x,β)T(z)A(η(x,β)))
,
P(β|x,z)=h(β)exp(η(x,z)T(β)A(η(x,z)))
,
q(z|λ)=h(z)exp(η(λ)T(z)A(η(λ)))
,
q(β|Φ)=h(z)exp(η(Φ)T(β)A(η(Φ)))
.

Then look at the ELBO. The first part of ELBO can be written as

Eqlog(P(z|x,β))Eqlog(P(β|x))
, second part as
Eq(log(q(z|λ)))+Eqlog(q(β|Φ))
.

Then we optimize the ELBO by fixing one parameter and adjust another one, iteratively. For example, we fix λ , and adjust Φ . Cause λ is fixed, we ignore the z part, keep only β parts, then substitude the exp family distribution pdfs, we obtain

Eqη(Φ)T(β)EqA(η(Φ))Eq(η(x,z)T(β))
, where T(β)=A(η) , so the expression becomes
L=η(Φ)A(η(Φ))A(η(Φ))Eq(η(x,z))A(η(Φ))
.

To find the optimized value, simply take dLdη=0 , the result should be

η(Φ)=Eq(β)η(x,z)
.
The same process for other parameters. Repeat the process alternatively until converge.

In the next blog, I will discuss how to apply this technique in Latent Dirichlet Allocation.

Thanks to Dr Richard Xu’s(http://www.uts.edu.au/staff/yida.xu) lecture notes and Prof David Blei’s online lectures(https://www.youtube.com/watch?v=DDq3OVp9dNA&t=2425s).

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值