1. VDM
1.1 VDM简介
VDM (Variational Diffusion Models) 基于 [[MHVAE]] 模型,但与 [[MHVAE]] 模型有3个不同:
-
对于所有时间步 t t t:隐变量 z t \boldsymbol{z}_t zt 的维度和数据 x \boldsymbol{x} x 的维度相等,即 z t ∈ R d \boldsymbol{z}_t \in \mathbb{R}^d zt∈Rd, x ∈ R d x \in \mathbb{R}^d x∈Rd;
-
对于所有时间步 t t t:隐变量 z t \boldsymbol{z}_t zt 不是通过神经网络模型学习得到的,而是以前一个时间步 z t − 1 \boldsymbol{z}_{t-1} zt−1 为均值的高斯分布。所以,Diffusion Models 不需要通过神经网络模型学习一个 Encoder;
-
随着时间步 t t t 的增大,隐变量 z t \boldsymbol{z}_t zt 逐渐逼近标准正态分布,最后在第 T T T 步时 z T ∼ N ( 0 , I ) \mathbf{z}_T \sim \mathcal{N}(\mathbf{0},\mathbf{I}) zT∼N(0,I)(T足够大)。
对照 [[MHVAE]] 的联合概率公式 (1) ,可得 VDM 的联合概率公式 (9):
$$\begin{align}
\underbrace{p\left (x_{0:T}\right )}{\text{Joint Distribution}} = \underbrace{p\left (x_T\right )}{\text{Prior}} \prod_{t=1}^{T} \underbrace{p_{\theta}\left (x_{t-1}\mid x_t\right )}_{\text{Decoder}}
\end{align}$$
对照[[MHVAE]]的后验公式 (2),可得VDM的后验公式 (10):
$$\begin{align}
\underbrace{q\left (\boldsymbol{x}{1:T}\mid \boldsymbol{x}0\right )}{\text{Posterior Distribution}} = \prod{t=1}^{T} \underbrace{q\left (\boldsymbol{x}t\mid \boldsymbol{x}{t-1}\right )}_{\text{Encoder}}
\end{align}$$
注意[[MHVAE]]的公式和VDM的公式有以下区别:
-
q ϕ q_{\phi} qϕ 全部修改为 q q q ,因为VDM模型的Encoder不需要用神经网络建模;
-
z t \boldsymbol{z}_t zt 全部修改为 x t \boldsymbol{x}_t xt,因为在VDM中 z t \boldsymbol{z}_t zt 的维度和 x t \boldsymbol{x}_t xt 的维度相等。
-
x \boldsymbol{x} x 全部修改为 x 0 \boldsymbol{x}_0 x0
1.2 如何推导VDM的ELBo?
根据[[MHVAE]]的ELBo公式 (8) 将 q ϕ q_{\phi} qϕ 改成 q q q, z t z_t zt 改成 x t x_t xt 即可得到VDM的ELBo:
$$
\begin{align}
& \log \underbrace{p(\boldsymbol{x})}_{\text{Evidence}} \
\geq & \underbrace{\mathbb{E}{q\left(\boldsymbol{x}{1: T} \mid \boldsymbol{x}0\right)}\left[\log \frac{p\left(\boldsymbol{x}{0: T}\right)}{q\left(\boldsymbol{x}_{1: T} \mid \boldsymbol{x}0\right)}\right]}{\text{ELBo of VDM}} \
=& \mathbb{E}{q\left(\boldsymbol{x}{1: T} \mid \boldsymbol{x}0\right)}\left[\log \frac{p\left(\boldsymbol{x}T\right) \prod{t=1}^T p{\boldsymbol{\theta}}\left(\boldsymbol{x}_{t-1} \mid \boldsymbol{x}t\right)}{\prod{t=1}^T q\left(\boldsymbol{x}t \mid \boldsymbol{x}{t-1}\right)}\right] \
= & \mathbb{E}{q\left(\boldsymbol{x}{1: T} \mid \boldsymbol{x}0\right)}\left[\log \frac{p\left(\boldsymbol{x}T\right) p{\boldsymbol{\theta}}\left(\boldsymbol{x}0 \mid \boldsymbol{x}1\right) \prod{t=2}^T p{\boldsymbol{\theta}}\left(\boldsymbol{x}{t-1} \mid \boldsymbol{x}_t\right)}{q\left(\boldsymbol{x}_1 \mid \boldsymbol{x}0\right) \prod{t=2}^T q\left(\boldsymbol{x}t \mid \boldsymbol{x}{t-1}\right)}\right] \
= & \mathbb{E}{q\left(\boldsymbol{x}{1: T} \mid \boldsymbol{x}0\right)}\left[\log \frac{p\left(\boldsymbol{x}T\right) p{\boldsymbol{\theta}}\left(\boldsymbol{x}0 \mid \boldsymbol{x}1\right) \prod{t=2}^T p{\boldsymbol{\theta}}\left(\boldsymbol{x}{t-1} \mid \boldsymbol{x}_t\right)}{q\left(\boldsymbol{x}_1 \mid \boldsymbol{x}0\right) \prod{t=2}^T q\left(\boldsymbol{x}t \mid \boldsymbol{x}{t-1}, \boldsymbol{x}_0\right)}\right] \
= & \mathbb{E}{q\left(\boldsymbol{x}{1: T} \mid \boldsymbol{x}0\right)}\left[\log \frac{p{\boldsymbol{\theta}}\left(\boldsymbol{x}T\right) p{\boldsymbol{\theta}}\left(\boldsymbol{x}_0 \mid \boldsymbol{x}1\right)}{q\left(\boldsymbol{x}1 \mid \boldsymbol{x}0\right)}+\log \prod{t=2}^T \frac{p{\boldsymbol{\theta}}\left(\boldsymbol{x}{t-1} \mid \boldsymbol{x}_t\right)}{q\left(\boldsymbol{x}t \mid \boldsymbol{x}{t-1}, \boldsymbol{x}_0\right)}\right] \
= & \mathbb{E}{q\left(\boldsymbol{x}{1: T} \mid \boldsymbol{x}_0\right)}\left[\log \frac{p\left(\boldsymbol{x}T\right) p{\boldsymbol{\theta}}\left(\boldsymbol{x}_0 \mid \boldsymbol{x}1\right)}{q\left(\boldsymbol{x}1 \mid \boldsymbol{x}0\right)}+\log \prod{t=2}^T \frac{p{\boldsymbol{\theta}}\left(\boldsymbol{x}{t-1} \mid \boldsymbol{x}t\right)}{\frac{q\left(\boldsymbol{x}{t-1} \mid \boldsymbol{x}_t, \boldsymbol{x}_0\right) q\left(\boldsymbol{x}_t \mid \boldsymbol{x}0\right)}{q\left(\boldsymbol{x}{t-1} \mid \boldsymbol{x}_0\right)}}\right] \
= & \mathbb{E}{q\left(\boldsymbol{x}{1: T} \mid \boldsymbol{x}_0\right)}\left[\log \frac{p\left(\boldsymbol{x}T\right) p{\boldsymbol{\theta}}\left(\boldsymbol{x}_0 \mid \boldsymbol{x}1\right)}{q\left(\boldsymbol{x}1 \mid \boldsymbol{x}0\right)}+\log \prod{t=2}^T \frac{p{\boldsymbol{\theta}}\left(\boldsymbol{x}{t-1} \mid \boldsymbol{x}t\right)}{\frac{q\left(\boldsymbol{x}{t-1} \mid \boldsymbol{x}_t, \boldsymbol{x}_0\right) q\left(\boldsymbol{x}_t \mid \boldsymbol{x}0\right)}{q\left(\boldsymbol{x}{t-1} \mid \boldsymbol{x}_0\right)}}\right] \
= & \mathbb{E}{q\left(\boldsymbol{x}{1: T} \mid \boldsymbol{x}_0\right)}\left[\log \frac{p\left(\boldsymbol{x}T\right) p{\boldsymbol{\theta}}\left(\boldsymbol{x}_0 \mid \boldsymbol{x}_1\right)}{q\left(\boldsymbol{x}_1 \mid \boldsymbol{x}_0\right)}+\log \frac{q\left(\boldsymbol{x}_1 \mid \boldsymbol{x}0\right)}{q\left(\boldsymbol{x}T \mid \boldsymbol{x}0\right)}+\log \prod{t=2}^T \frac{p{\boldsymbol{\theta}}\left(\boldsymbol{x}{t-1} \mid \boldsymbol{x}t\right)}{q\left(\boldsymbol{x}{t-1} \mid \boldsymbol{x}_t, \boldsymbol{x}_0\right)}\right] \
= & \mathbb{E}{q\left(\boldsymbol{x}{1: T} \mid \boldsymbol{x}_0\right)}\left[\log \frac{p\left(\boldsymbol{x}T\right) p{\boldsymbol{\theta}}\left(\boldsymbol{x}_0 \mid \boldsymbol{x}1\right)}{q\left(\boldsymbol{x}T \mid \boldsymbol{x}0\right)}+\sum{t=2}^T \log \frac{p{\boldsymbol{\theta}}\left(\boldsymbol{x}{t-1} \mid \boldsymbol{x}t\right)}{q\left(\boldsymbol{x}{t-1} \mid \boldsymbol{x}_t, \boldsymbol{x}_0\right)}\right] \
= & \mathbb{E}{q\left(\boldsymbol{x}{1: T} \mid \boldsymbol{x}0\right)}\left[\log p{\boldsymbol{\theta}}\left(\boldsymbol{x}0 \mid \boldsymbol{x}1\right)\right]+\mathbb{E}{q\left(\boldsymbol{x}{1: T} \mid \boldsymbol{x}0\right)}\left[\log \frac{p\left(\boldsymbol{x}T\right)}{q\left(\boldsymbol{x}T \mid \boldsymbol{x}0\right)}\right]+\sum{t=2}^T \mathbb{E}{q\left(\boldsymbol{x}{1: T} \mid \boldsymbol{x}0\right)}\left[\log \frac{p{\boldsymbol{\theta}}\left(\boldsymbol{x}{t-1} \mid \boldsymbol{x}t\right)}{q\left(\boldsymbol{x}{t-1} \mid \boldsymbol{x}_t, \boldsymbol{x}_0\right)}\right] \
= & \mathbb{E}_{q\left(\boldsymbol{x}_1 \mid \boldsymbol{x}0\right)}\left[\log p{\boldsymbol{\theta}}\left(\boldsymbol{x}_0 \mid \boldsymbol{x}1\right)\right]+\mathbb{E}{q\left(\boldsymbol{x}_T \mid \boldsymbol{x}_0\right)}\left[\log \frac{p\left(\boldsymbol{x}T\right)}{q\left(\boldsymbol{x}T \mid \boldsymbol{x}0\right)}\right]+\sum{t=2}^T \mathbb{E}{q\left(\boldsymbol{x}t, \boldsymbol{x}{t-1} \mid \boldsymbol{x}0\right)}\left[\log \frac{p{\boldsymbol{\theta}}\left(\boldsymbol{x}{t-1} \mid \boldsymbol{x}t\right)}{q\left(\boldsymbol{x}{t-1} \mid \boldsymbol{x}_t, \boldsymbol{x}_0\right)}\right] \
= & \underbrace{\underbrace{\mathbb{E}{q\left(\boldsymbol{x}1 \mid \boldsymbol{x}0\right)}\left[\log p{\boldsymbol{\theta}}\left(\boldsymbol{x}0 \mid \boldsymbol{x}1\right)\right]}{x_0\approx{x1}}}{\text{reconstruction term} \color{red}{\approx 0}}-\underbrace{D{\mathrm{KL}}\left(\underbrace{q\left(\boldsymbol{x}T \mid \boldsymbol{x}0\right)}{\approx N(0, I)} \parallel \underbrace{p\left(\boldsymbol{x}T\right)}{=N(0, I)}\right)}{\text{prior matching term}\color{red}{\approx 0}} -\underbrace{\sum{t=2}^T \underbrace{\mathbb{E}_{q\left(\boldsymbol{x}_t \mid \boldsymbol{x}_0\right)}\left[
D_{\mathrm{KL}}\left(\underbrace{q\left(\boldsymbol{x}{t-1} \mid \boldsymbol{x}t, \boldsymbol{x}0\right)}{\color{red}{\text {complexity posterior}}} \parallel \underbrace{p{\boldsymbol{\theta}}\left(\boldsymbol{x}{t-1} \mid \boldsymbol{x}t\right)}{\textcolor{green}{\text{Decoder of VDM}}}\right)
\right]}{\text {denoising matching term }}}{\textbf{Objective function to optimize}}
\end{align}
$$
1.3 如何从ELBo推导VDM的目标函数?
1.3.1 重建项 (reconstruction term)
结论:可以通过蒙特卡罗估计计算,但是真实情况是当 T T T 比较大时 x 0 ≈ x 1 \boldsymbol{x}_0 \approx \boldsymbol{x}_1 x0≈x1,可忽略不计。
$$\begin{align}
\underbrace{\underbrace{\mathbb{E}_{q\left(\boldsymbol{x}_1 \mid \boldsymbol{x}0\right)}\left[\log p{\boldsymbol{\theta}}\left(\boldsymbol{x}0 \mid \boldsymbol{x}1\right)\right]}{x_0\approx{x1}}}{\text{reconstruction term}} \approx 0
\end{align}$$
1.3.2 先验匹配项 (prior matching term)
在 VDM 中,我们有如下假设:
$$\begin{align}
\underbrace{q\left(\boldsymbol{x}t \mid \boldsymbol{x}{t-1} \right)}{\textcolor{red}{\text{Encoder of VDM}}} & =\mathcal{N}\left(\boldsymbol{x}t ; \sqrt{\alpha_t} \boldsymbol{x}{t-1},\left(1-\alpha_t\right) \mathbf{I}\right) \quad \text{ with } \boldsymbol{\alpha{t} \in (0, 1)}
\end{align}$$
利用重参数化技巧,可得:
$$\begin{align}
\boldsymbol{x}t & =\sqrt{\alpha_t} \boldsymbol{x}{t-1}+\sqrt{1-\alpha_t} \boldsymbol{\epsilon} \quad \text { with } \boldsymbol{\epsilon} \sim \mathcal{N}(\boldsymbol{\epsilon} ; \mathbf{0}, \mathbf{I}) \text{, }\boldsymbol{\alpha_{t} \in (0, 1)}
\end{align}$$
基于重参数化技巧继续推导,可得:
$$
\begin{align}
\boldsymbol{x}t & =\sqrt{\alpha_t} \boldsymbol{x}{t-1}+\sqrt{1-\alpha_t} \boldsymbol{\epsilon}_{t-1}^* \
& =\sqrt{\alpha_t}\left(\sqrt{\alpha_{t-1}} \boldsymbol{x}{t-2}+\sqrt{1-\alpha{t-1}} \epsilon_{t-2}^\right)+\sqrt{1-\alpha_t} \boldsymbol{\epsilon}_{t-1}^ \
& =\sqrt{\alpha_t \alpha_{t-1}} \boldsymbol{x}{t-2}+\sqrt{\alpha_t-\alpha_t \alpha{t-1}} \boldsymbol{\epsilon}{t-2}^*+\sqrt{1-\alpha_t} \boldsymbol{\epsilon}{t-1}^* \
& =\sqrt{\alpha_t \alpha_{t-1}} \boldsymbol{x}{t-2}+\sqrt{{\sqrt{\alpha_t-\alpha_t \alpha{t-1}}}^2+\sqrt{1-\alpha_t}} \boldsymbol{\epsilon}{t-2} \quad \text{(apply } \boldsymbol{\lbrace \epsilon_t^*,\epsilon_t \rbrace{t=0}^{T}\overset{iid}{\sim}\mathcal{N}\boldsymbol{(\epsilon; \mathbf{0}, \mathbf{I})}}\text{)}\
& =\sqrt{\alpha_t \alpha_{t-1}} \boldsymbol{x}{t-2}+\sqrt{\alpha_t-\alpha_t \alpha{t-1}+1-\alpha_t} \boldsymbol{\epsilon}_{t-2} \
& =\sqrt{\alpha_t \alpha_{t-1}} \boldsymbol{x}{t-2}+\sqrt{1-\alpha_t \alpha{t-1}} \boldsymbol{\epsilon}_{t-2} \
& =\ldots \
& =\sqrt{\prod_{i=1}^t \alpha_i} \boldsymbol{x}0+\sqrt{1-\prod{i=1}^t \alpha_i \boldsymbol{\epsilon}0} \quad \text{ with } \boldsymbol{\epsilon}{0} \sim \mathcal{N}(\boldsymbol{\epsilon}; \mathbf{0}, \mathbf{I}) \
& =\sqrt{\bar{\alpha}_t} \boldsymbol{x}_0+\sqrt{1-\bar{\alpha}_t} \boldsymbol{\epsilon}0 \quad \text{ with } \boldsymbol{\epsilon}{0} \sim \mathcal{N}(\boldsymbol{\epsilon}; \mathbf{0}, \mathbf{I}) \
& \sim \mathcal{N}\left(\boldsymbol{x}_t ; \sqrt{\bar{\alpha}_t} \boldsymbol{x}_0,\left(1-\bar{\alpha}_t\right) \mathbf{I}\right)
\end{align}
$$
可得:
$$
q(\boldsymbol{x}_t) = \mathcal{N}\left(\boldsymbol{x}_t ; \sqrt{\bar{\alpha}_t} \boldsymbol{x}_0,\left(1-\bar{\alpha}_t\right) \mathbf{I}\right)
$$
由马尔可夫性:
$$\begin{align}
q(\boldsymbol{x}_t \mid \boldsymbol{x}_0) = q(\boldsymbol{x}_t) = \mathcal{N}\left(\boldsymbol{x}_t ; \sqrt{\bar{\alpha}_t} \boldsymbol{x}_0,\left(1-\bar{\alpha}_t\right) \mathbf{I}\right)
\end{align}$$
当
T
T
T 足够大,比如
T
=
1000
T = 1000
T=1000 :
$$
\alpha_t \in (0, 1) \implies \bar{\alpha}_T \approx 0
$$
所以:
$$
\boldsymbol{q}(\boldsymbol{x}_T \mid \boldsymbol{x}_0) \approx \mathcal{N}\left(\mathbf{0}, \mathbf{I}\right)
$$
结论:先验匹配项可以忽略不计:
$$
\underbrace{D_{\mathrm{KL}}\left(\underbrace{q\left(\boldsymbol{x}T \mid \boldsymbol{x}0\right)}{\approx N(0, I)} \parallel \underbrace{p\left(\boldsymbol{x}T\right)}{=N(0, I)}\right)}{\text{prior matching term}} \approx 0
$$
1.3.3 降噪匹配项 (denoising matching term)
下面对降噪匹配项中的子式分别进行推导:
$$
\begin{align}
& \underbrace{q\left(\boldsymbol{x}_{t-1} \mid \boldsymbol{x}_t, \boldsymbol{x}0\right) }{\text{complexity posterior}} \
= & \frac{q\left(\boldsymbol{x}t \mid \boldsymbol{x}{t-1}, \boldsymbol{x}0\right) q\left(\boldsymbol{x}{t-1} \mid \boldsymbol{x}_0\right)}{q\left(\boldsymbol{x}_t \mid \boldsymbol{x}_0\right)} \
= & \frac{\overbrace{\mathcal{N}\left(\boldsymbol{x}t ; \sqrt{\alpha_t} \boldsymbol{x}{t-1},\left(1-\alpha_t\right) \mathbf{I}\right)}^{\text{apply Markov Property in Eq.(25)}} \overbrace{\mathcal{N}\left(\boldsymbol{x}{t-1} ; \sqrt{\bar{\alpha}{t-1}} \boldsymbol{x}0,\left(1-\bar{\alpha}{t-1}\right) \mathbf{I}\right)}^{\text{apply Eq.(37)}}}{\underbrace{\mathcal{N}\left(\boldsymbol{x}_t ; \sqrt{\bar{\alpha}_t} \boldsymbol{x}_0,\left(1-\bar{\alpha}t\right) \mathbf{I}\right)}{\text{apply Eq.(37)}}} \
\propto & \exp \left{-\left[\frac{\left(\boldsymbol{x}t-\sqrt{\alpha_t} \boldsymbol{x}{t-1}\right)^2}{2\left(1-\alpha_t\right)}+\frac{\left(\boldsymbol{x}{t-1}-\sqrt{\bar{\alpha}{t-1}} \boldsymbol{x}0\right)^2}{2\left(1-\bar{\alpha}{t-1}\right)}-\frac{\left(\boldsymbol{x}_t-\sqrt{\bar{\alpha}_t} \boldsymbol{x}_0\right)^2}{2\left(1-\bar{\alpha}_t\right)}\right]\right} \
= & \exp \left{-\frac{1}{2}\left[\frac{\left(\boldsymbol{x}t-\sqrt{\alpha_t} \boldsymbol{x}{t-1}\right)^2}{1-\alpha_t}+\frac{\left(\boldsymbol{x}{t-1}-\sqrt{\bar{\alpha}{t-1}} \boldsymbol{x}0\right)^2}{1-\bar{\alpha}{t-1}}-\frac{\left(\boldsymbol{x}_t-\sqrt{\bar{\alpha}_t} \boldsymbol{x}_0\right)^2}{1-\bar{\alpha}_t}\right]\right} \
= & \exp \left{-\frac{1}{2}\left[\frac{\left(-2 \sqrt{\alpha_t} \boldsymbol{x}t \boldsymbol{x}{t-1}+\alpha_t \boldsymbol{x}{t-1}2\right)}{1-\alpha_t}+\frac{\left(\boldsymbol{x}_{t-1}2-2 \sqrt{\bar{\alpha}{t-1}} \boldsymbol{x}_{t-1} \boldsymbol{x}0\right)}{1-\bar{\alpha}{t-1}}+C\left(\boldsymbol{x}_t, \boldsymbol{x}_0\right)\right]\right} \
\propto & \exp \left{-\frac{1}{2}\left[-\frac{2 \sqrt{\alpha_t} \boldsymbol{x}t \boldsymbol{x}{t-1}}{1-\alpha_t}+\frac{\alpha_t \boldsymbol{x}{t-1}2}{1-\alpha_t}+\frac{\boldsymbol{x}_{t-1}2}{1-\bar{\alpha}{t-1}}-\frac{2 \sqrt{\bar{\alpha}{t-1}} \boldsymbol{x}{t-1} \boldsymbol{x}0}{1-\bar{\alpha}{t-1}}\right]\right} \
= & \exp \left{-\frac{1}{2}\left[\left(\frac{\alpha_t}{1-\alpha_t}+\frac{1}{1-\bar{\alpha}{t-1}}\right) \boldsymbol{x}{t-1}^2-2\left(\frac{\sqrt{\alpha_t} \boldsymbol{x}t}{1-\alpha_t}+\frac{\sqrt{\bar{\alpha}{t-1}} \boldsymbol{x}0}{1-\bar{\alpha}{t-1}}\right) \boldsymbol{x}_{t-1}\right]\right} \
= & \exp \left{-\frac{1}{2}\left[\frac{\alpha_t\left(1-\bar{\alpha}{t-1}\right)+1-\alpha_t}{\left(1-\alpha_t\right)\left(1-\bar{\alpha}{t-1}\right)} \boldsymbol{x}{t-1}^2-2\left(\frac{\sqrt{\alpha_t} \boldsymbol{x}t}{1-\alpha_t}+\frac{\sqrt{\bar{\alpha}{t-1}} \boldsymbol{x}0}{1-\bar{\alpha}{t-1}}\right) \boldsymbol{x}{t-1}\right]\right} \
= & \exp \left{-\frac{1}{2}\left[\frac{\alpha_t-\bar{\alpha}t+1-\alpha_t}{\left(1-\alpha_t\right)\left(1-\bar{\alpha}{t-1}\right)} \boldsymbol{x}{t-1}^2-2\left(\frac{\sqrt{\alpha_t} \boldsymbol{x}t}{1-\alpha_t}+\frac{\sqrt{\bar{\alpha}{t-1}} \boldsymbol{x}0}{1-\bar{\alpha}{t-1}}\right) \boldsymbol{x}{t-1}\right]\right} \
= & \exp \left{-\frac{1}{2}\left[\frac{1-\bar{\alpha}t}{\left(1-\alpha_t\right)\left(1-\bar{\alpha}{t-1}\right)} \boldsymbol{x}{t-1}^2-2\left(\frac{\sqrt{\alpha_t} \boldsymbol{x}t}{1-\alpha_t}+\frac{\sqrt{\bar{\alpha}{t-1}} \boldsymbol{x}0}{1-\bar{\alpha}{t-1}}\right) \boldsymbol{x}{t-1}\right]\right} \
= & \exp \left{-\frac{1}{2}\left(\frac{1-\bar{\alpha}t}{\left(1-\alpha_t\right)\left(1-\bar{\alpha}{t-1}\right)}\right)\left[\boldsymbol{x}{t-1}^2-2 \frac{\left(\frac{\sqrt{\alpha_t} \boldsymbol{x}t}{1-\alpha_t}+\frac{\sqrt{\bar{\alpha}{t-1}} \boldsymbol{x}0}{1-\bar{\alpha}{t-1}}\right)}{\frac{1-\bar{\alpha}t}{\left(1-\alpha_t\right)\left(1-\bar{\alpha}{t-1}\right)}} \boldsymbol{x}{t-1}\right]\right} \
= & \exp \left{-\frac{1}{2}\left(\frac{1-\bar{\alpha}t}{\left(1-\alpha_t\right)\left(1-\bar{\alpha}{t-1}\right)}\right)\left[\boldsymbol{x}{t-1}^2-2 \frac{\left(\frac{\sqrt{\alpha_t} \boldsymbol{x}t}{1-\alpha_t}+\frac{\sqrt{\bar{\alpha}{t-1}} \boldsymbol{x}0}{1-\bar{\alpha}{t-1}}\right)\left(1-\alpha_t\right)\left(1-\bar{\alpha}{t-1}\right)}{1-\bar{\alpha}t} \boldsymbol{x}{t-1}\right]\right} \
= & \exp \left{-\frac{1}{2}\left(\frac{1}{\frac{\left(1-\alpha_t\right)\left(1-\bar{\alpha}{t-1}\right)}{1-\bar{\alpha}t}}\right)\left[\boldsymbol{x}{t-1}^2-2 \frac{\sqrt{\alpha_t}\left(1-\bar{\alpha}{t-1}\right) \boldsymbol{x}t+\sqrt{\bar{\alpha}{t-1}}\left(1-\alpha_t\right) \boldsymbol{x}_0}{1-\bar{\alpha}t} \boldsymbol{x}{t-1}\right]\right} \
\propto & \mathcal{N}(\boldsymbol{x}{t-1} ; \underbrace{\frac{\sqrt{\alpha_t}\left(1-\bar{\alpha}{t-1}\right) \boldsymbol{x}t+\sqrt{\bar{\alpha}{t-1}}\left(1-\alpha_t\right) \boldsymbol{x}_0}{1-\bar{\alpha}t}}{\mu_q\left(\boldsymbol{x}_t, \boldsymbol{x}0\right)}, \underbrace{\left.\frac{\left(1-\alpha_t\right)\left(1-\bar{\alpha}{t-1}\right)}{1-\bar{\alpha}t} \mathbf{I}\right)}{\boldsymbol{\Sigma}_q(t)}
\end{align}
$$
参考式 (52) ,试图将 p θ ( x t − 1 ∣ x t ) p_{\theta}(\boldsymbol{x}_{t-1} \mid \boldsymbol{x}_t) pθ(xt−1∣xt) 也改写为正态分布的形式,令方差与式 (52) 相等:
$$\begin{align}
p_{\theta}(\boldsymbol{x}{t-1} \mid \boldsymbol{x}t) = \mathcal{N}(\boldsymbol{x}{t-1}; \underbrace{\mu{\theta}(\boldsymbol{x}t, t)}{\text{learned by model}}, \Sigma_{q}(t))
\end{align}$$
参考式 (52) 中的 μ q ( x t , x 0 ) \mu_{q}(\boldsymbol{x}_t, \boldsymbol{x}_0) μq(xt,x0),可得:
$$\begin{align}
\boldsymbol{\mu}_{\boldsymbol{\theta}}\left(\boldsymbol{x}_t, t\right)
=\frac{\sqrt{\alpha_t}\left(1-\bar{\alpha}{t-1}\right) \boldsymbol{x}t+\sqrt{\bar{\alpha}{t-1}}\left(1-\alpha_t\right) \overbrace{\hat{\boldsymbol{x}}{\boldsymbol{\theta}}\left(\boldsymbol{x}_t, t\right)}^{\text{learned by model}}}{1-\bar{\alpha}_t}
\end{align}$$
2.3.4 VDM的目标函数
公式 (23) 中的重建项和先验匹配项小到可以忽略不计,可得目标函数:
$$\begin{align}
&\operatorname{arg}\max \log{p(\boldsymbol{x})} \
\propto &\operatorname{arg}\min_{\theta} \underbrace{\sum_{t=2}^T \underbrace{\mathbb{E}_{q\left(\boldsymbol{x}_t \mid \boldsymbol{x}_0\right)}\left[
D_{\mathrm{KL}}\left(\underbrace{q\left(\boldsymbol{x}{t-1} \mid \boldsymbol{x}t, \boldsymbol{x}0\right)}{\color{red}{\text {complexity posterior}}} \parallel \underbrace{p{\boldsymbol{\theta}}\left(\boldsymbol{x}{t-1} \mid \boldsymbol{x}t\right)}{\textcolor{red}{\text{Decoder of VDM}}}\right)
\right]}{\text {denoising matching term }}}{\textbf{Objective function to optimize}}
\end{align} 根据公式 ( 56 ) , V D M 的目标函数为: 根据公式 (56) ,VDM的目标函数为: 根据公式(56),VDM的目标函数为:\begin{align}
&\operatorname{arg}\min \sum_{t=2}^{T}\mathbb{E}{q\left(\boldsymbol{x}t \mid \boldsymbol{x}0\right)}\left[ D{\text{KL}}(q(\boldsymbol{x}{t-1}|\boldsymbol{x}{t}, \boldsymbol{x}{0})\parallel p{\theta}(\boldsymbol{x}{t-1}|\boldsymbol{x}{t}))\right] \
=& \operatorname{arg}\min\sum_{t=2}^{T}\mathbb{E}{q\left(\boldsymbol{x}t \mid \boldsymbol{x}0\right)}\left[ D{\operatorname{KL}}(\mathcal{N}(\boldsymbol{x}{t-1}; \boldsymbol{\mu}{q}(\boldsymbol{x}t, \boldsymbol{x}0), \boldsymbol{\Sigma}{q}(t)) \parallel \mathcal{N}(\boldsymbol{x}{t-1}; \boldsymbol{\mu}_{\theta}(\boldsymbol{x}t,\boldsymbol{t}), \boldsymbol{\Sigma}{q}(t)))\right] \
=& \operatorname{arg}\min\sum_{t=2}^{T}\mathbb{E}_{q\left(\boldsymbol{x}_t \mid \boldsymbol{x}_0\right)}\left[ \dfrac{1}{2}[\log\dfrac{|\boldsymbol{\Sigma}_q(t)|}{|\boldsymbol{\Sigma}_q(t)|}-d+\operatorname{tr}(\boldsymbol{\Sigma}_q(t)^{-1}\boldsymbol{\Sigma}q(t))+(\boldsymbol{\mu}{\theta}(\boldsymbol{x}t, \boldsymbol{t})-\boldsymbol{\mu}{q}(\boldsymbol{x}_t, \boldsymbol{x}0))T\boldsymbol{\Sigma}_q(t){-1}(\boldsymbol{\mu}{\theta}(\boldsymbol{x}t, \boldsymbol{t})-\boldsymbol{\mu}{q}(\boldsymbol{x}_t, \boldsymbol{x}_0))]\right] \quad \text{(apply )}\
=& \operatorname{arg}\min\sum_{t=2}^{T}\mathbb{E}_{q\left(\boldsymbol{x}_t \mid \boldsymbol{x}0\right)}\left[ \dfrac{1}{2}\left[\log1-d+d+(\boldsymbol{\mu}{\theta}(\boldsymbol{x}t, \boldsymbol{t})-\boldsymbol{\mu}{q}(\boldsymbol{x}_t, \boldsymbol{x}0))T\sum_q(t){-1}(\boldsymbol{\mu}{\theta}(\boldsymbol{x}t, \boldsymbol{t})-\boldsymbol{\mu}{q}(\boldsymbol{x}_t, \boldsymbol{x}_0))\right]\right] \
=& \operatorname{arg}\min_{\theta} \sum_{t=2}^{T}\mathbb{E}_{q\left(\boldsymbol{x}_t \mid \boldsymbol{x}0\right)}\left[ \frac{1}{2}\left[\left(\boldsymbol{\mu}{\theta}(\boldsymbol{x}t, \boldsymbol{t})-\boldsymbol{\mu}{q}(\boldsymbol{x}_t, \boldsymbol{x}0)\right){T}\sum_{q}(t){-1}\left(\boldsymbol{\mu}{\theta}(\boldsymbol{x}t, \boldsymbol{t})-\boldsymbol{\mu}{q}(\boldsymbol{x}_t, \boldsymbol{x}_0)\right)\right]\right] \
=& \operatorname{arg}\min\limits_{\theta} \sum_{t=2}^{T}\mathbb{E}_{q\left(\boldsymbol{x}_t \mid \boldsymbol{x}0\right)}\left[ \dfrac{1}{2}\left[\left(\boldsymbol{\mu}{\theta}(\boldsymbol{x}t,\boldsymbol{t})-\boldsymbol{\mu}{q}(\boldsymbol{x}_t, \boldsymbol{x}0)\right){T}\left(\sigma_{q}{2}(t)I\right)^{-1}\left(\boldsymbol{\mu}{\theta}(\boldsymbol{x}t,\boldsymbol{t})-\boldsymbol{\mu}{q}(\boldsymbol{x}_t, \boldsymbol{x}_0)\right)\right]\right] \
=& \operatorname{arg}\min_{\theta}\sum_{t=2}^{T}\mathbb{E}_{q\left(\boldsymbol{x}t \mid \boldsymbol{x}0\right)}\left[ \frac{1}{2\sigma{q}^{2}(t)}\left[\left|\boldsymbol{\mu}{\theta}(\boldsymbol{x}t, \boldsymbol{t})-\boldsymbol{\mu}{q}(\boldsymbol{x}_t, \boldsymbol{x}0)\right|{2}^{2}\right] \right] \
=& \operatorname{arg}\min_{\theta} \sum_{t=2}^{T}\mathbb{E}{q\left(\boldsymbol{x}t \mid \boldsymbol{x}0\right)}\left[ \frac{1}{2\sigma{q}^{2}(t)}\left[\left| \underbrace{\frac{\sqrt{\alpha_t}\left(1-\bar{\alpha}{t-1}\right) \boldsymbol{x}t+\sqrt{\bar{\alpha}{t-1}}\left(1-\alpha_t\right) \overbrace{\hat{\boldsymbol{x}}{\boldsymbol{\theta}}\left(\boldsymbol{x}t, t\right)}^{\text{learned by model}}}{1-\bar{\alpha}t}}{\mu{\theta} \text{ apply Eq.(55)}} - \underbrace{\frac{\sqrt{\alpha_t}\left(1-\bar{\alpha}{t-1}\right) \boldsymbol{x}t+\sqrt{\bar{\alpha}{t-1}}\left(1-\alpha_t\right) \boldsymbol{x}0}{1-\bar{\alpha}t}}{\mu{q} \text{ apply Eq.(53)}} \right|{2}^{2}\right] \right] \
=& \operatorname{arg}\min_{\theta} \sum_{t=2}^{T}\mathbb{E}{q\left(\boldsymbol{x}t \mid \boldsymbol{x}0\right)} \left[ \frac{1}{2\sigma{q}^{2}(t)} \left[ \left| \frac{\sqrt{\bar{\alpha}{t-1}}(1-\alpha{t})\hat{x}{\theta}(x_t,t)}{1-\bar{\alpha}t} - \frac{\sqrt{\bar{\alpha}{t-1}}(1-\alpha{t})x_0}{1-\bar{\alpha}t} \right|{2}^{2} \right] \right] \
=& \operatorname{arg}\min_{\theta} \sum_{t=2}^{T}\mathbb{E}{q\left(\boldsymbol{x}t \mid \boldsymbol{x}0\right)} \left[ \frac{1}{2\sigma{q}^{2}(t)} \left[ \left| \frac{\sqrt{\bar{\alpha}{t-1}}(1-\alpha{t})}{1-\bar{\alpha}t} (\hat{x}{\theta}(x_t,t) - x_0) \right|_{2}^{2} \right] \right] \
=& \underbrace{\operatorname{arg}\min_{\theta} \sum_{t=2}^{T}\mathbb{E}{q\left(\boldsymbol{x}t \mid \boldsymbol{x}0\right)} \left[ \frac{1}{2\sigma{q}^{2}(t)} \frac{\bar{\alpha}{t-1}(1-\alpha{t})2}{(1-\bar{\alpha}_t)2} \left[ \left| \hat{x}{\theta}(x_t,t) - x_0 \right|{2}^{2} \right] \right]}_{\text{Objective function of Diffusion Model}}
\end{align}$$
式 (67) 采用蒙特卡洛估计,可得:
$$\begin{align}
\operatorname{arg}\min_{\theta} \mathbb{E}{t\sim \mathbf{U}(2,T)} \left [ \mathbb{E}{q\left(\boldsymbol{x}t \mid \boldsymbol{x}0\right)} \left[ \frac{1}{2\sigma{q}^{2}(t)} \frac{\bar{\alpha}{t-1}(1-\alpha_{t})2}{(1-\bar{\alpha}_t)2} \left[ \left| \hat{x}{\theta}(x_t,t) - x_0 \right|{2}^{2} \right] \right] \right]
\end{align}$$
综上所述,VDM模型学习的预测原图 x 0 \boldsymbol{x}_0 x0,因为目标函数中的 ∥ x ^ θ ( x t , t ) − x 0 ∥ 2 2 \|\hat{\boldsymbol{x}}_{\theta}(\boldsymbol{x}_t, t) - \boldsymbol{x}_0\|_{2}^{2} ∥x^θ(xt,t)−x0∥22 。