Learning Variational Inference and Latent Dirichlet Allocation

四大队剃头闻师傅

于 2017-02-01 05:35:03 发布

阅读量262

点赞数 1

文章标签： clustering

本文链接：https://blog.csdn.net/weixin_35768135/article/details/54804073

版权

Learning Variational Inference and Latent Dirichlet Allocation

Latent Dirichlet Allocation(LDA) is a generative model which can be used in document clustering or information retrieval. It assumes the generation of a document, as shown in the picture
LDA
First of all, we assume that the topic distribution within a single document is generated by a Dirichlet distribution, with hyper-parameter $\alpha$ . Then based on this distribution, we can assign each slot in this article a topic(assume that there is a slot for each word), that’s how we obtain the z. After that, for each slot, draw a word for it based on the word distribution with a single topic $\phi$ , which is generated by Dirichlet distribution with parameter $\beta$

The w is a word in a document, which is observable when we are given a data set. What we wanna do is to find the actual value for all the latent variables, which is quite difficult.

Here we introduce a interesting method called variational inference, trying to find a couple of equivalent parameters that can approximate the object distribution using a new distribution that defined by us. To make life easier, we assume that the new distribution is in exponential family, and obey the relationship of $q(Z) = q(z_1,z_1...z_n) = \Pi_i q(zi)$ . For a exponential family distribution, the pdf can be written as

P (z | x, β) = h (z) e x p (η (x, β) T (x) - A (η (x, β)))

$P(z|x,\beta)= h(z)exp(\eta(x,\beta)T(x) - A(\eta(x,\beta)))$ .

In order to explain how variational inference works, I can use a very simple example. Let’s say we have a distribution, with observed variable x, and latent variable $z,\beta$ , which we want to obtain. Assume the posterior of the hidden variables $P(z,\beta|x)$ is a quite intractable distribution(we don’t know the distribution type, and it’s quite complex), we can hardly work them out. But why not define a new distribution and try to make it approximate the posterior? So, we say,

P (x) = P ( z , β , x ) P ( z , β | x )

$P(x) = \frac{P(z,\beta,x)}{P(z,\beta|x)}$

l o g (P (x)) = l o g (P (z, β, x)) - l o g (P ( z , β | x ) q ( z , β )) - l o g (q (z, β))

$log(P(x)) = log(P(z,\beta,x)) - log(\frac{P(z,\beta|x)}{q(z,\beta)})-log(q(z,\beta))$ ,
taking expectation value for both sides, we have

l o g (P (x)) = \int z, β l o g (P (z, β, x)) q (z, β) d z d β - \int z, β l o g (P ( z , β | x ) q ( z , β )) q (z, β) d z d β - \int z, β l o g (q (z, β)) q (z, β) d z d β = E q (z, β) l o g (P (z, β, x)) - E q (z, β) q (z, β) - K L (P (z, β | x) | | q (z, β))

$log(P(x)) = \int_{z,\beta}log(P(z,\beta,x))q(z,\beta)dzd\beta - \int_{z,\beta}log(\frac{P(z,\beta|x)}{q(z,\beta)})q(z,\beta)dzd\beta - \int_{z,\beta}log(q(z,\beta))q(z,\beta)dzd\beta = E_{q(z,\beta)}log(P(z,\beta,x)) - E_{q(z,\beta)}q(z,\beta) - KL(P(z,\beta|x)||q(z,\beta))$ ,
the first two terms are called Evidence Lower Bound(ELBO), and the third term is called KL Divergence of the two distributions. If we find the parameter for the distribution q, so that the KL divergence reaches the minimum value, we can say that, we found a new distribution to approximate out objective. To minimize the KL divergence is equivalent to maximize the ELBO, so our goal is converted to a optimization problem:

argminλ,ΦELBO $argmin_{\lambda,\Phi} ELBO$ .

First, we write down the ELBO:

L (λ, Φ) = E q (z, β) l o g (P (z, β, x)) - E q (z, β) q (z, β)

$\mathcal{L}(\lambda,\Phi) = E_{q(z,\beta)}log(P(z,\beta,x)) - E_{q(z,\beta)}q(z,\beta)$ ,
Then, assume all the distributions of

z $z$ and

β $\beta$ are in exponential family, so,

P (z | x, β) = h (z) e x p (η (x, β) T (z) - A (η (x, β)))

$P(z|x,\beta)= h(z)exp(\eta(x,\beta)T(z) - A(\eta(x,\beta)))$ ,

P (β | x, z) = h (β) e x p (η (x, z) T (β) - A (η (x, z)))

$P(\beta|x,z)= h(\beta)exp(\eta(x,z)T(\beta) - A(\eta(x,z)))$ ,

q (z | λ) = h (z) e x p (η (λ) T (z) - A (η (λ)))

$q(z|\lambda) = h(z)exp(\eta(\lambda)T(z) - A(\eta(\lambda)))$ ,

q (β | Φ) = h (z) e x p (η (Φ) T (β) - A (η (Φ)))

$q(\beta|\Phi) = h(z)exp(\eta(\Phi)T(\beta) - A(\eta(\Phi)))$ .

Then look at the ELBO. The first part of ELBO can be written as

E q l o g (P (z | x, β)) - E q l o g (P (β | x))

$E_{q}log(P(z|x,\beta))-E_qlog(P(\beta|x))$ , second part as

E q (l o g (q (z | λ))) + E q l o g (q (β | Φ))

$E_q(log(q(z|\lambda))) + E_qlog(q(\beta|\Phi))$ .

Then we optimize the ELBO by fixing one parameter and adjust another one, iteratively. For example, we fix $\lambda$ , and adjust $\Phi$ . Cause $\lambda$ is fixed, we ignore the z part, keep only $\beta$ parts, then substitude the exp family distribution pdfs, we obtain

E q η (Φ) T (β) - E q A (η (Φ)) - E q (η (x, z) T (β))

$E_q\eta(\Phi)T(\beta) - E_qA(\eta(\Phi)) - E_q(\eta(x,z)T(\beta))$ , where

T(β)=A′(η) $T(\beta) = A'(\eta)$ , so the expression becomes

L = η (Φ) A' (η (Φ)) - A (η (Φ)) - E q (η (x, z)) A' (η (Φ))

$\mathcal{L} = \eta(\Phi)A'(\eta(\Phi)) - A(\eta(\Phi)) - E_q(\eta(x,z))A'(\eta(\Phi))$ .

To find the optimized value, simply take $\frac{d\mathcal{L}}{d\eta} = 0$ , the result should be

η (Φ) = E q (β) η (x, z)

$\eta(\Phi) = E_{q(\beta)}\eta(x,z)$ .
The same process for other parameters. Repeat the process alternatively until converge.

In the next blog, I will discuss how to apply this technique in Latent Dirichlet Allocation.

Thanks to Dr Richard Xu’s(http://www.uts.edu.au/staff/yida.xu) lecture notes and Prof David Blei’s online lectures(https://www.youtube.com/watch?v=DDq3OVp9dNA&t=2425s).

四大队剃头闻师傅

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫