Bayesian Learning_ELBO, Variational Inference, and EM Algorithm

最新推荐文章于 2024-10-01 16:06:06 发布

拉普拉斯的汪

最新推荐文章于 2024-10-01 16:06:06 发布

阅读量450

点赞数

分类专栏： Machine Learning 文章标签：机器学习

本文链接：https://blog.csdn.net/qq_39599295/article/details/119944087

版权

本文详细介绍了贝叶斯学习中的关键概念，包括潜在变量模型、证据下界（ELBO）的定义、导出及意义，以及变分推断和EM算法的工作原理。通过解释ELBO与证据之间的差距，阐述了变分推断在近似后验分布中的作用。此外，还讨论了EM算法如何作为坐标上升法在证据下界上的应用，并通过Gaussian Mixture Model (GMM)举例说明了最大似然估计在GMM中的实现过程。

摘要由CSDN通过智能技术生成

Reference:

https://mbernste.github.io/posts/elbo/

https://mbernste.github.io/posts/variational_inference/

https://mbernste.github.io/posts/em/

https://mbernste.github.io/posts/gmm_em/

Theodoridis S. Machine learning: a Bayesian and optimization perspective[M]. Academic press, 2015.

Content

Latent variable model

We posit that our observed data $x$ is a realization from some random variable $X$ .
We posit the existence of another random variable $Z$ where $X$ and $Z$ are distributed according to a joint distribution $p(X,Z;\theta)$ where $\theta$ parameterizes the distribution.
Our data is only a realization of $X$ , not $Z$ , and therefore $Z$ remains unobserved (i.e. latent).

There are two predominant tasks that we may be interested in accomplishing:

Given some fixed value for $\theta$ , compute the posterior distribution $p(Z|X;\theta)$ [Can be solved by variational inference]
Given that $\theta$ is unknown, find the maximum likelihood estimate of $\theta$ [Can be solved by EM]

Both variational inference and EM rely on the ELBO.

The Evidence Lower Bound (ELBO)

What is the ELBO?

To understand the evidence lower bound, we must first understand what we mean by “evidence”: it is just a name given to the likelihood function evaluated at a fixed $\theta$
$\text{evidence}:=\log p(x;\theta) \tag{ELBO.1}$

Why is this quantity called the “evidence”?

Intuitively, if we have chosen the right model $p$ and $\theta$ , then we would expect that the marginal probability of our observed data $x$ , would be high. Thus, a higher value of $\log⁡p(x;θ)$ indicates, in some sense, that we may be on the right track with the model $p$ and parameters $\theta$ that we have chosen. That is, this quantity is “evidence” that we have chosen the right model for the data.

If we happen to know (or posit) that $Z$ follows some distribution denoted by $q$ , s.t.
$p(x,z;\theta):=p(x|z;\theta)q(z) \tag{ELBO.2}$
Then the evidence lower bound is just a lower bound on the evidence that makes use of the known $q$ . Specifically,
$\log p(x;\theta)\ge E_{Z\sim q}\left[\log \frac{p(x,Z;\theta)}{q(Z)} \right] \tag{ELBO.3}$
where the ELBO is simply the right-hand side of the above equation:
$ELBO:=E_{Z\sim q}\left[\log \frac{p(x,Z;\theta)}{q(Z)} \right] \tag{ELBO.4}$

Derivation

Jensen’s Inequality: if $X$ is a random variable and $\varphi$ is a convex (concave) function, then
$\varphi(E[X])\le (\ge) E[\varphi(X)]$
Since $\log (\cdot)$ is a concave function,
$\begin{aligned} E_{Z\sim q}\left[\log \frac{p(x,Z;\theta)}{q(Z)} \right]&\le \log \left[E_{Z\sim q}\left( \frac{p(x,Z;\theta)}{q(Z)}\right) \right]\\ &=\log \left[\int q(Z)\frac{p(x,Z;\theta)}{q(Z)}dz \right]\\ &=\log p(x;\theta) \end{aligned}$

The gap between the evidence and the ELBO

It turns out that the gap between the evidence and the ELBO is precisely the Kullback Leibler divergence between $q (Z)$ and $p (Z ∣ x; θ)$ .
$\text{evidence}-\text{ELBO}:=\log p(x;\theta)-E_{Z\sim q}\left[\log \frac{p(x,Z;\theta)}{q(Z)} \right] =KL(q(Z)\|p(Z|x;\theta)) \tag{ELBO.5}$
This fact forms the basis of the [variational inference algorithm] for approximate Bayesian inference.

在这里插入图片描述

Derivation
$\begin{aligned} \log p(x;\theta)-E_{Z\sim q}\left[\log \frac{p(x,Z;\theta)}{q(Z)} \right] &= \int q(Z)\log p(x;\theta)dz-\int q(Z)\log \frac{p(x,Z;\theta)}{q(Z)}dz\\ &= \int q(Z) \log \frac{p(x;\theta) q(Z)}{p(x,Z;\theta)}dz\\ &= \int q(Z) \log \frac{q(Z)}{p(Z|x;\theta)}dz\\ &= KL(q(Z)\|p(Z|x;\theta)) \end{aligned}$

Variational Inference

Why variational inference?

Variational inference is a paradigm for estimating a posterior distribution when computing it explicitly is intractable.

Assume that we have a model that involves hidden random variables $Z$ , observed random variables $X$ , and some posited probabilistic model over the hidden and the observed random variables $P (X, Z)$ . The goal is to compute the posterior distribution $P (Z ∣ X)$ .

Ideally, we would do so by using Bayes theorem:
$p(z|x)=\frac{p(x|z)p(z)}{p(x)}$
In practice, it is often difficult to compute $p (z ∣ x)$ via Bayes theorem because the denominator $p (x)$ does not have a closed form. Usually, the denominator $p (x)$ can be only be expressed as an integral that marginalizes over $z$ : $p(x)=\int p(x,z) dz$ . In such scenarios, we’re often forced to approximate $p (z ∣ x)$ rather than compute it directly. Variational inference is one such approximation technique.

Details

Instead of computing $p (z ∣ x)$ exactly via Bayes theorem, variational inference attempts to find another distribution $q (z)$ that is ‘close’ to $p (z ∣ x)$ , where the ‘closeness’ is measured by the KL-divergence.
$KL(q(Z)\|p(Z|x))=\int q(Z) \log \frac{q(Z)}{p(Z|x)}dz=E_{Z\sim q}\left[ \log \frac{q(Z)}{p(Z|x)}\right] \tag{VI.1}$