变分推断公式推导

who_N.s.N

已于 2023-05-01 15:56:14 修改

阅读量227

点赞数

文章标签：机器学习算法概率论

于 2023-05-01 15:50:09 首次发布

本文链接：https://blog.csdn.net/qq_45733704/article/details/130456535

版权

文章介绍了变分下界作为优化生成概率模型的一种方法，通过最大化变分下界来近似最大化给定数据的生成概率。这种方法涉及到变分推断，利用KL散度和期望运算，可以处理不易直接计算的概率分布。在实践中，通常采用梯度上升策略优化变分下界，从而估计难以直接计算的模型参数和概率分布。

摘要由CSDN通过智能技术生成

变分下界

公式推导

我们的目的是最大化 $\log p_\theta(\mathbf{x})$ ，即最大化给定数据 $\mathbf{x}$ 的生成概率。然而，由于 $p_\theta(\mathbf{x})$ 往往难以直接计算，我们可以通过最大化变分下界 $\mathcal{L}(\theta, \phi; \mathbf{x})$ 来近似地最大化 $\log p_\theta(\mathbf{x})$ 。因此，我们的最终目标是最大化变分下界 $\mathcal{L}(\theta, \phi; \mathbf{x})$ ，以此来间接地最大化 $\log p_\theta(\mathbf{x})$ 。

变分下界 $\mathcal{L}(\theta, \phi; \mathbf{x})$ 是对 $\log p_\theta(\mathbf{x})$ 的下界估计，即 $\log p_\theta(\mathbf{x})$ 的值不小于 $\mathcal{L}(\theta, \phi; \mathbf{x})$ 。这可以通过变分下界的定义式：
$\begin{aligned} & \log p_\theta(\mathbf{x})=\mathbb{E}_{\mathbf{z} \sim q_\phi(\mathbf{z} \mid \mathbf{x})}\left[\log p_\theta(\mathbf{x})\right]；见最后\\ & =\mathbb{E}_{\mathbf{z} \sim q_\phi(\mathbf{z} \mid \mathbf{x})}\left[\log \frac{p_\theta(\mathbf{x}, \mathbf{z})}{p_\theta(\mathbf{z} \mid \mathbf{x})}\right] \\ & =\mathbb{E}_{\mathbf{z} \sim q \phi(\mathbf{z} \mid \mathbf{x})}\left[\log \frac{p_\theta(\mathbf{x}, \mathbf{z}) q_\phi(\mathbf{z} \mid \mathbf{x})}{p_\theta(\mathbf{z} \mid \mathbf{x}) q_\phi(\mathbf{z} \mid \mathbf{x})}\right] \\ & =\mathbb{E}_{\mathbf{z} \sim q \phi(\mathbf{z} \mid \mathbf{x})}\left[\log \frac{p_\theta(\mathbf{x}, \mathbf{z})}{q_\phi(\mathbf{z} \mid \mathbf{x})}\right]+\mathbb{E}_{\mathbf{z} \sim q \phi(\mathbf{z} \mid \mathbf{x})}\left[\log \frac{q_\phi(\mathbf{z} \mid \mathbf{x})}{p_\theta(\mathbf{z} \mid \mathbf{x})}\right] \\ & =\mathcal{L}(\theta, \phi ; \mathbf{x})+\mathrm{KL}\left(q_\phi(\mathbf{z} \mid \mathbf{x}) \| p_\theta(\mathbf{z} \mid \mathbf{x})\right) \\ & \geq \mathcal{L}(\theta, \phi ; \mathbf{x}) \\ & \end{aligned}$
其中，我们用到了变分下界的定义式，以及 KL 散度的非负性质 $\text{KL}\left(q_\phi(\mathbf{z} \mid \mathbf{x}) || p_\theta(\mathbf{z} \mid \mathbf{x})\right) \geq 0$ ，进而得到了 $\log p_\theta(\mathbf{x})$ 与变分下界 $\mathcal{L}(\theta, \phi; \mathbf{x})$ 的关系。

通过最大化 $\mathcal{L}(\theta, \phi; \mathbf{x})$ ，我们可以得到一个近似的最优解 $(\hat{\theta}, \hat{\phi})$ ，这个近似的最优解可以用来估计 $p_\theta(\mathbf{z} \mid \mathbf{x})$ 和 $\log p_\theta(\mathbf{x})$ 。具体来说，我们可以通过以下的方式来估计这些量：
$\begin{gathered} \log p_\theta(\mathbf{x}) \approx \mathcal{L}(\hat{\theta}, \hat{\phi} ; \mathbf{x}) \\ p_\theta(\mathbf{z} \mid \mathbf{x}) \approx q(\mathbf{z} \mid \mathbf{x}) \approx \frac{1}{K} \sum_{k=1}^K q\left(\mathbf{z}^{(k)} \mid \mathbf{x}\right) \end{gathered}$
其中 $\mathbf{z}^{(1)}, \mathbf{z}^{(2)}, \ldots, \mathbf{z}^{(K)}$ 是从 $q(\mathbf{z} \mid \mathbf{x})$ 中采样得到的随机向量， $K$ 是采样数量。

因此，通过最大化变分下界，我们可以使用一种近似的方式来求解无法直接计算的 $\log p_\theta(\mathbf{x})$ 和 $p_\theta(\mathbf{z} \mid \mathbf{x})$ 。

在实践中，一般使用等价的形式来计算变分下界，即：
$\mathcal{L}(\theta, \phi ; \mathbf{x})=\mathrm{KL}\left(q_\phi(\mathbf{z} \mid \mathbf{x}) \| p_\theta(\mathbf{z} \mid \mathbf{x})\right)-\mathbb{E}_{q_\phi(\mathbf{z} \mid \mathbf{x})}\left[\log p_\theta(\mathbf{x}, \mathbf{z})\right]$
其中，第一项是 $\text{KL}$ 散度。第二项是期望，可以使用蒙特卡罗方法（如重参数化技巧）来估计。这样就可以通过对 $\mathcal{L}(\theta, \phi; \mathbf{x})$ 进行梯度上升来最大化该下界，从而近似最大化 $\log p_\theta(\mathbf{x})$ 。

公式说明

$\mathbb{E}_{\mathbf{z} \sim q_\phi(\mathbf{z} \mid \mathbf{x})}\left[\log p_\theta(\mathbf{x})\right]=\int \log p_\theta(\mathbf{x}) q_\phi(\mathbf{z} \mid \mathbf{x}) d \mathbf{z}$

由于 $\log p_\theta(\mathbf{x})$ 不依赖于 $\mathbf{z}$ ，所以可以将其提取出来，得到：
$\mathbb{E}_{\mathbf{z} \sim q_\phi(\mathbf{z} \mid \mathbf{x})}\left[\log p_\theta(\mathbf{x})\right]=\log p_\theta(\mathbf{x}) \int q_\phi(\mathbf{z} \mid \mathbf{x}) d \mathbf{z}=\log p_\theta(\mathbf{x})$
这里用到了 $\int q_\phi(\mathbf{z} \mid \mathbf{x}) d\mathbf{z} = 1$ ，即 $q_\phi(\mathbf{z} \mid \mathbf{x})$ 是一个概率密度函数的性质。因此，右边等于 $\log p_\theta(\mathbf{x})$ 。