Learning Variational Inference and Latent Dirichlet Allocation
Latent Dirichlet Allocation(LDA) is a generative model which can be used in document clustering or information retrieval. It assumes the generation of a document, as shown in the picture
First of all, we assume that the topic distribution within a single document is generated by a Dirichlet distribution, with hyper-parameter
α
. Then based on this distribution, we can assign each slot in this article a topic(assume that there is a slot for each word), that’s how we obtain the z. After that, for each slot, draw a word for it based on the word distribution with a single topic
ϕ
, which is generated by Dirichlet distribution with parameter
β
The w is a word in a document, which is observable when we are given a data set. What we wanna do is to find the actual value for all the latent variables, which is quite difficult.
Here we introduce a interesting method called variational inference, trying to find a couple of equivalent parameters that can approximate the object distribution using a new distribution that defined by us. To make life easier, we assume that the new distribution is in exponential family, and obey the relationship of q(Z)=q(z1,z1...zn)=Πiq(zi) . For a exponential family distribution, the pdf can be written as
In order to explain how variational inference works, I can use a very simple example. Let’s say we have a distribution, with observed variable x, and latent variable
z,β
, which we want to obtain. Assume the posterior of the hidden variables
P(z,β|x)
is a quite intractable distribution(we don’t know the distribution type, and it’s quite complex), we can hardly work them out. But why not define a new distribution and try to make it approximate the posterior? So, we say,
taking expectation value for both sides, we have
the first two terms are called Evidence Lower Bound(ELBO), and the third term is called KL Divergence of the two distributions. If we find the parameter for the distribution q, so that the KL divergence reaches the minimum value, we can say that, we found a new distribution to approximate out objective. To minimize the KL divergence is equivalent to maximize the ELBO, so our goal is converted to a optimization problem: argminλ,ΦELBO .
First, we write down the ELBO:
Then, assume all the distributions of z and
Then look at the ELBO. The first part of ELBO can be written as
Then we optimize the ELBO by fixing one parameter and adjust another one, iteratively. For example, we fix
λ
, and adjust
Φ
. Cause
λ
is fixed, we ignore the z part, keep only
β
parts, then substitude the exp family distribution pdfs, we obtain
To find the optimized value, simply take
dLdη=0
, the result should be
The same process for other parameters. Repeat the process alternatively until converge.
In the next blog, I will discuss how to apply this technique in Latent Dirichlet Allocation.
Thanks to Dr Richard Xu’s(http://www.uts.edu.au/staff/yida.xu) lecture notes and Prof David Blei’s online lectures(https://www.youtube.com/watch?v=DDq3OVp9dNA&t=2425s).