系统理解扩散模型(Diffusion Models):从柏拉图洞穴之喻开始(中)

系统理解扩散模型(Diffusion Models):从柏拉图洞穴之喻开始(中)

本文参考"Understanding Diffusion Models: A Unified Perspective"

变分扩散模型(Variational Diffusion Models)

设想我们给HVAE模型增添三个限制条件:

  • 潜在变量的维度和数据的维度相同;
  • 每一层级的编码器不是通过学习得到的,而是事先预定好的线性高斯模型;
  • 最后一层(第 T T T层)的潜在变量分布是一个标准高斯分布。

加上上述条件的HVAE模型就是变分扩散模型(Variational Diffusion Models,VDM)。

由第一个条件,我们可以先统一符号:用 x 0 x_0 x0表示真实的数据样本,而用 x t , t ∈ [ 1 , T ] x_t, t \in [1, T] xt,t[1,T]表示对应第 t t t层的潜在变量。此时,后验分布可以重写为:
q ( x 1 : T ∣ x 0 ) = ∏ t = 1 T q ( x t ∣ x t − 1 ) \begin{equation} q(x_{1:T}|x_0) = \prod_{t=1}^T q(x_t|x_{t-1}) \end{equation} q(x1:Tx0)=t=1Tq(xtxt1)

根据第二个条件,我们将高斯编码器的均值设置为 μ t ( x t ) = α t x t − 1 \mu_t(x_t)=\sqrt \alpha_t x_{t-1} μt(xt)=α txt1,并将其方差设置为 Σ t ( x t ) = ( 1 − α t ) I \Sigma_t(x_t) = (1-\alpha_t) I Σt(xt)=(1αt)I。此时,编码器可以表示为:
q ( x t ∣ x t − 1 ) = N ( x t ; α t x t − 1 , ( 1 − α t ) I ) \begin{equation} q(x_t|x_{t-1}) = \mathcal{N}(x_t; \sqrt \alpha_t x_{t-1}, (1-\alpha_t) I) \end{equation} q(xtxt1)=N(xt;α txt1,(1αt)I)

根据第三个条件, α t \alpha_t αt的值需要遵循一定的规律,使得最后一层的潜在分布 p ( x T ) p(x_T) p(xT)是一个标准高斯分布。此时,VDM的联合分布可以重写为
p ( x 0 : T ) = p ( x T ) ∏ t = 1 T p θ ( x t − 1 ∣ x t ) where p ( x T ) = N ( x T ; 0 , I ) \begin{align} p(x_{0:T}) &= p(x_T)\prod_{t=1}^{T}p_\theta(x_{t-1}|x_t) \\ &\text{where} \\ &p(x_T) = \mathcal{N}(x_T; 0, I) \end{align} p(x0:T)=p(xT)t=1Tpθ(xt1xt)wherep(xT)=N(xT;0,I)

VDM

如果以图片为输入,这就相当于不断给这张图片加上一系列的噪声,只至输出为纯高斯噪声。值得注意的是,由于编码过程就是按照既定过程加高斯噪声,编码器分布 q ( x t ∣ x t − 1 ) q(x_t|x_{t-1}) q(xtxt1)不再有参数 ϕ \phi ϕ。也就是说,对于VDM模型,我们关注 p θ ( x t − 1 ∣ x t ) p_\theta(x_{t-1}|x_t) pθ(xt1xt)即可,并以此生成新数据。具体而言,训练完成后,我们从 p ( x T ) p(x_T) p(xT)采样出高斯噪声,然后逐步执行去噪过程 p θ ( x t − 1 ∣ x t ) p_\theta(x_{t-1}|x_t) pθ(xt1xt)生成新的 x 0 x_0 x0

相似地,我们可以通过最大化ELBO来优化VDM:
log ⁡ p ( x ) = log ⁡ ∫ p ( x 0 : T ) d x 1 : T = log ⁡ ∫ p ( x 0 : T ) q ( x 1 : T ∣ x 0 ) q ( x 1 : T ∣ x 0 ) d x 1 : T = log ⁡ E q ( x 1 : T ∣ x 0 ) [ p ( x 0 : T ) q ( x 1 : T ∣ x 0 ) ] ≥ E q ( x 1 : T ∣ x 0 ) [ log ⁡ p ( x 0 : T ) q ( x 1 : T ∣ x 0 ) ] = E q ( x 1 : T ∣ x 0 ) [ log ⁡ p ( x T ) ∏ t = 1 T p θ ( x t − 1 ∣ x t ) ∏ t = 1 T q ( x t ∣ x t − 1 ) ] = E q ( x 1 : T ∣ x 0 ) [ log ⁡ p ( x T ) p θ ( x 0 ∣ x 1 ) ∏ t = 2 T p θ ( x t − 1 ∣ x t ) q ( x T ∣ x T − 1 ) ∏ t = 1 T − 1 q ( x t ∣ x t − 1 ) ] = E q ( x 1 : T ∣ x 0 ) [ log ⁡ p ( x T ) p θ ( x 0 ∣ x 1 ) ∏ t = 1 T − 1 p θ ( x t ∣ x t + 1 ) q ( x T ∣ x T − 1 ) ∏ t = 1 T − 1 q ( x t ∣ x t − 1 ) ] = E q ( x 1 : T ∣ x 0 ) [ log ⁡ p ( x T ) p θ ( x 0 ∣ x 1 ) q ( x T ∣ x T − 1 ) ] + E q ( x 1 : T ∣ x 0 ) [ log ⁡ ∏ t = 1 T − 1 p θ ( x t ∣ x t + 1 ) q ( x t ∣ x t − 1 ) ] = E q ( x 1 : T ∣ x 0 ) [ log ⁡ p θ ( x 0 ∣ x 1 ) ] + E q ( x 1 : T ∣ x 0 ) [ log ⁡ p ( x T ) q ( x T ∣ x T − 1 ) ] + E q ( x 1 : T ∣ x 0 ) [ ∑ t = 1 T − 1 log ⁡ p θ ( x t ∣ x t + 1 ) q ( x t ∣ x t − 1 ) ] = E q ( x 1 : T ∣ x 0 ) [ log ⁡ p θ ( x 0 ∣ x 1 ) ] + E q ( x 1 : T ∣ x 0 ) [ log ⁡ p ( x T ) q ( x T ∣ x T − 1 ) ] + ∑ t = 1 T − 1 E q ( x 1 : T ∣ x 0 ) [ log ⁡ p θ ( x t ∣ x t + 1 ) q ( x t ∣ x t − 1 ) ] = E q ( x 1 ∣ x 0 ) [ log ⁡ p θ ( x 0 ∣ x 1 ) ] + E q ( x T − 1 , x T ∣ x 0 ) [ log ⁡ p ( x T ) q ( x T ∣ x T − 1 ) ] + ∑ t = 1 T − 1 E q ( x t − 1 , x t , x t + 1 ∣ x 0 ) [ log ⁡ p θ ( x t ∣ x t + 1 ) q ( x t ∣ x t − 1 ) ] = E q ( x 1 ∣ x 0 ) [ log ⁡ p θ ( x 0 ∣ x 1 ) ] − E q ( x T − 1 ∣ x 0 ) [ D K L ( q ( x T ∣ x T − 1 )   ∣ ∣   p ( x T ) ) ] − ∑ t = 1 T − 1 E q ( x t − 1 , x t + 1 ∣ x 0 ) [ D K L ( q ( x t ∣ x t − 1 )   ∣ ∣   p θ ( x t ∣ x t + 1 ) ) ] \begin{align} \log p(x) &= \log \int p(x_{0:T}) dx_{1:T} \\ &= \log \int \frac{p(x_{0:T})q(x_{1:T}|x_0)}{q(x_{1:T}|x_0)}dx_{1:T} \\ &= \log \mathbb{E}_{q(x_{1:T}|x_0)}\left[\frac{p(x_{0:T})}{q(x_{1:T}|x_0)}\right] \\ & \geq \mathbb{E}_{q(x_{1:T}|x_0)}\left[\log\frac{p(x_{0:T})}{q(x_{1:T}|x_0)}\right] \\ &= \mathbb{E}_{q(x_{1:T}|x_0)}\left[\log\frac{p(x_T)\prod_{t=1}^{T}p_\theta(x_{t-1}|x_t)}{\prod_{t=1}^T q(x_t|x_{t-1})}\right] \\ &= \mathbb{E}_{q(x_{1:T}|x_0)}\left[\log\frac{p(x_T)p_\theta(x_0|x_1)\prod_{t=2}^{T}p_\theta(x_{t-1}|x_t)}{q(x_T|x_{T-1}) \prod_{t=1}^{T-1} q(x_t|x_{t-1})}\right] \\ &= \mathbb{E}_{q(x_{1:T}|x_0)}\left[\log\frac{p(x_T)p_\theta(x_0|x_1)\prod_{t=1}^{T-1}p_\theta(x_{t}|x_{t+1})}{q(x_T|x_{T-1}) \prod_{t=1}^{T-1} q(x_t|x_{t-1})}\right] \\ &= \mathbb{E}_{q(x_{1:T}|x_0)}\left[\log\frac{p(x_T)p_\theta(x_0|x_1)}{q(x_T|x_{T-1}) }\right] \\ &\quad+ \mathbb{E}_{q(x_{1:T}|x_0)}\left[\log\prod_{t=1}^{T-1}\frac{p_\theta(x_{t}|x_{t+1})}{q(x_t|x_{t-1})}\right] \\ &= \mathbb{E}_{q(x_{1:T}|x_0)}\left[\log p_\theta(x_0|x_1)\right] + \mathbb{E}_{q(x_{1:T}|x_0)}\left[\log\frac{p(x_T)}{q(x_T|x_{T-1}) }\right]\\ &\quad+\mathbb{E}_{q(x_{1:T}|x_0)}\left[\sum_{t=1}^{T-1}\log\frac{p_\theta(x_{t}|x_{t+1})}{q(x_t|x_{t-1})}\right] \\ &=\mathbb{E}_{q(x_{1:T}|x_0)}\left[\log p_\theta(x_0|x_1)\right] + \mathbb{E}_{q(x_{1:T}|x_0)}\left[\log\frac{p(x_T)}{q(x_T|x_{T-1}) }\right]\\ &\quad+\sum_{t=1}^{T-1}\mathbb{E}_{q(x_{1:T}|x_0)}\left[\log\frac{p_\theta(x_{t}|x_{t+1})}{q(x_t|x_{t-1})}\right] \\ &=\mathbb{E}_{q(x_{1}|x_0)}\left[\log p_\theta(x_0|x_1)\right] + \mathbb{E}_{q(x_{T-1}, x_T|x_0)}\left[\log\frac{p(x_T)}{q(x_T|x_{T-1}) }\right]\\ &\quad+\sum_{t=1}^{T-1}\mathbb{E}_{q(x_{t-1}, x_t, x_{t+1}|x_0)}\left[\log\frac{p_\theta(x_{t}|x_{t+1})}{q(x_t|x_{t-1})}\right] \\ &=\mathbb{E}_{q(x_{1}|x_0)}\left[\log p_\theta(x_0|x_1)\right] \\ &\quad- \mathbb{E}_{q(x_{T-1}|x_0)}\left[D_{KL}(q(x_T|x_{T-1}) \ ||\ p(x_T))\right] \\ &\quad-\sum_{t=1}^{T-1}\mathbb{E}_{q(x_{t-1}, x_{t+1}|x_0)}\left[D_{KL}(q(x_t|x_{t-1})\ ||\ p_\theta(x_{t}|x_{t+1}))\right] \end{align} logp(x)=logp(x0:T)dx1:T=logq(x1:Tx0)p(x0:T)q(x1:Tx0)dx1:T=logEq(x1:Tx0)[q(x1:Tx0)p(x0:T)]Eq(x1:Tx0)[logq(x1:Tx0)p(x0:T)]=Eq(x1:Tx0)[logt=1Tq(xtxt1)p(xT)t=1Tpθ(xt1xt)]=Eq(x1:Tx0)[logq(xTxT1)t=1T1q(xtxt1)p(xT)pθ(x0x1)t=2Tpθ(xt1xt)]=Eq(x1:Tx0)[logq(xTxT1)t=1T1q(xtxt1)p(xT)pθ(x0x1)t=1T1pθ(xtxt+1)]=Eq(x1:Tx0)[logq(xTxT1)p(xT)pθ(x0x1)]+Eq(x1:Tx0)[logt=1T1q(xtxt1)pθ(xtxt+1)]=Eq(x1:Tx0)[logpθ(x0x1)]+Eq(x1:Tx0)[logq(xTxT1)p(xT)]+Eq(x1:Tx0)[t=1T1logq(xtxt1)pθ(xtxt+1)]=Eq(x1:Tx0)[logpθ(x0x1)]+Eq(x1:Tx0)[logq(xTxT1)p(xT)]+t=1T1Eq(x1:Tx0)[logq(xtxt1)pθ(xtxt+1)]=Eq(x1x0)[logpθ(x0x1)]+Eq(xT1,xTx0)[logq(xTxT1)p(xT)]+t=1T1Eq(xt1,xt,xt+1x0)[logq(xtxt1)pθ(xtxt+1)]=Eq(x1x0)[logpθ(x0x1)]Eq(xT1x0)[DKL(q(xTxT1) ∣∣ p(xT))]t=1T1Eq(xt1,xt+1x0)[DKL(q(xtxt1) ∣∣ pθ(xtxt+1))]

  1. E q ( x 1 ∣ x 0 ) [ log ⁡ p θ ( x 0 ∣ x 1 ) ] \mathbb{E}_{q(x_{1}|x_0)}\left[\log p_\theta(x_0|x_1)\right] Eq(x1x0)[logpθ(x0x1)]是恢复项,表示了给定 x 1 x_1 x1预测原始数据样本 x 0 x_0 x0的对数概率;
  2. E q ( x T − 1 ∣ x 0 ) [ D K L ( q ( x T ∣ x T − 1 )   ∣ ∣   p ( x T ) ) ] \mathbb{E}_{q(x_{T-1}|x_0)}\left[D_{KL}(q(x_T|x_{T-1}) \ ||\ p(x_T))\right] Eq(xT1x0)[DKL(q(xTxT1) ∣∣ p(xT))]是先验匹配项,要求最后的潜在分布和高斯先验一致,注意,这一项没有可优化的参数;
  3. E q ( x t − 1 , x t + 1 ∣ x 0 ) [ D K L ( q ( x t ∣ x t − 1 )   ∣ ∣   p θ ( x t ∣ x t + 1 ) ) ] \mathbb{E}_{q(x_{t-1}, x_{t+1}|x_0)}\left[D_{KL}(q(x_t|x_{t-1})\ ||\ p_\theta(x_{t}|x_{t+1}))\right] Eq(xt1,xt+1x0)[DKL(q(xtxt1) ∣∣ pθ(xtxt+1))]是一致性项,该项旨在限定在 x t x_t xt处的分布一致,也就是说一个去噪步和对应的加噪步应该保持一致。我们可以通过训练 p θ ( x t ∣ x t + 1 ) p_\theta(x_{t}|x_{t+1}) pθ(xtxt+1),使其与高斯分布 q ( x t ∣ x t − 1 ) q(x_t|x_{t-1}) q(xtxt1)一致来最小化该项。

Optimizing VDM

优化VDM的花费主要在第三项,因为我们得对所有的中间步 t t t进行优化。

但是,上述推导中第三项要同时对两个变量 { x t − 1 , x t + 1 } \{x_{t-1}, x_{t+1}\} {xt1,xt+1}求期望,其对应的蒙特卡洛估计方差一般比对一个变量的要高。接下来,我们进一步推导出一个ELBO,使得最终每一项只需对一个变量求期望。关键的一步是利用 q ( x t ∣ x t − 1 ) = q ( x t ∣ x t − 1 , x 0 ) q(x_t|x_{t-1})=q(x_t|x_{t-1}, x_0) q(xtxt1)=q(xtxt1,x0),这是因为在马尔可夫过程中条件 x 0 x_0 x0是多余的。进一步,我们利用贝叶斯公式:
q ( x t ∣ x t − 1 , x 0 ) = q ( x t − 1 ∣ x t , x 0 ) q ( x t ∣ x 0 ) q ( x t − 1 ∣ x 0 ) \begin{equation} q(x_t|x_{t-1}, x_0) = \frac{q(x_{t-1}|x_{t}, x_0)q(x_t|x_0)}{q(x_{t-1}|x_0)} \end{equation} q(xtxt1,x0)=q(xt1x0)q(xt1xt,x0)q(xtx0)
代入上文中的推导,我们得到:
log ⁡ p ( x ) ≥ E q ( x 1 : T ∣ x 0 ) [ log ⁡ p ( x 0 : T ) q ( x 1 : T ∣ x 0 ) ] = E q ( x 1 : T ∣ x 0 ) [ log ⁡ p ( x T ) ∏ t = 1 T p θ ( x t − 1 ∣ x t ) ∏ t = 1 T q ( x t ∣ x t − 1 ) ] = E q ( x 1 : T ∣ x 0 ) [ log ⁡ p ( x T ) p θ ( x 0 ∣ x 1 ) ∏ t = 2 T p θ ( x t − 1 ∣ x t ) q ( x 1 ∣ x 0 ) ∏ t = 2 T q ( x t ∣ x t − 1 ) ] = E q ( x 1 : T ∣ x 0 ) [ log ⁡ p ( x T ) p θ ( x 0 ∣ x 1 ) ∏ t = 2 T p θ ( x t − 1 ∣ x t ) q ( x 1 ∣ x 0 ) ∏ t = 2 T q ( x t ∣ x t − 1 , x 0 ) ] = E q ( x 1 : T ∣ x 0 ) [ log ⁡ p ( x T ) p θ ( x 0 ∣ x 1 ) q ( x 1 ∣ x 0 ) + log ⁡ ∏ t = 2 T p θ ( x t − 1 ∣ x t ) q ( x t ∣ x t − 1 , x 0 ) ] = E q ( x 1 : T ∣ x 0 ) [ log ⁡ p ( x T ) p θ ( x 0 ∣ x 1 ) q ( x 1 ∣ x 0 ) + log ⁡ ∏ t = 2 T p θ ( x t − 1 ∣ x t ) q ( x t − 1 ∣ x t , x 0 ) q ( x t ∣ x 0 ) q ( x t − 1 ∣ x 0 ) ] = E q ( x 1 : T ∣ x 0 ) [ log ⁡ p ( x T ) p θ ( x 0 ∣ x 1 ) q ( x 1 ∣ x 0 ) + log ⁡ q ( x 1 ∣ x 0 ) q ( x T ∣ x 0 ) + log ⁡ ∏ t = 2 T p θ ( x t − 1 ∣ x t ) q ( x t − 1 ∣ x t , x 0 ) ] = E q ( x 1 : T ∣ x 0 ) [ log ⁡ p ( x T ) p θ ( x 0 ∣ x 1 ) q ( x T ∣ x 0 ) + log ⁡ ∏ t = 2 T p θ ( x t − 1 ∣ x t ) q ( x t − 1 ∣ x t , x 0 ) ] = E q ( x 1 : T ∣ x 0 ) [ log ⁡ p θ ( x 0 ∣ x 1 ) ] + E q ( x 1 : T ∣ x 0 ) [ log ⁡ p ( x T ) q ( x T ∣ x 0 ) ] + ∑ t = 2 T E q ( x 1 : T ∣ x 0 ) [ log ⁡ p θ ( x t − 1 ∣ x t ) q ( x t − 1 ∣ x t , x 0 ) ] = E q ( x 1 ∣ x 0 ) [ log ⁡ p θ ( x 0 ∣ x 1 ) ] + E q ( x T ∣ x 0 ) [ log ⁡ p ( x T ) q ( x T ∣ x 0 ) ] + ∑ t = 2 T E q ( x t − 1 , x t ∣ x 0 ) [ log ⁡ p θ ( x t − 1 ∣ x t ) q ( x t − 1 ∣ x t , x 0 ) ] = E q ( x 1 ∣ x 0 ) [ log ⁡ p θ ( x 0 ∣ x 1 ) ] − D K L ( q ( x T ∣ x 0 )   ∥   p ( x T ) ) − ∑ t = 2 T E q ( x t ∣ x 0 ) [ D K L ( q ( x t − 1 ∣ x t , x 0 )   ∥   p θ ( x t − 1 ∣ x t ) ) ] \begin{align*} \log p(x) &\geq \mathbb{E}_{q(x_{1:T}|x_0)}\left[\log\frac{p(x_{0:T})}{q(x_{1:T}|x_0)}\right] \\ &= \mathbb{E}_{q(x_{1:T}|x_0)}\left[\log\frac{p(x_T)\prod_{t=1}^{T}p_\theta(x_{t-1}|x_t)}{\prod_{t=1}^T q(x_t|x_{t-1})}\right] \\ &= \mathbb{E}_{q(x_{1:T}|x_0)}\left[\log\frac{p(x_T)p_\theta(x_0|x_1)\prod_{t=2}^{T}p_\theta(x_{t-1}|x_t)}{q(x_1|x_{0}) \prod_{t=2}^{T} q(x_t|x_{t-1})}\right] \\ &= \mathbb{E}_{q(x_{1:T}|x_0)}\left[\log\frac{p(x_T)p_\theta(x_0|x_1)\prod_{t=2}^{T}p_\theta(x_{t-1}|x_t)}{q(x_1|x_{0}) \prod_{t=2}^{T} q(x_t|x_{t-1}, x_0)}\right] \\ &= \mathbb{E}_{q(x_{1:T}|x_0)}\left[\log\frac{p(x_T)p_\theta(x_0|x_1)}{q(x_1|x_{0})}+ \log \prod_{t=2}^{T}\frac{p_\theta(x_{t-1}|x_t)}{q(x_t|x_{t-1}, x_0)}\right] \\ &= \mathbb{E}_{q(x_{1:T}|x_0)}\left[\log\frac{p(x_T)p_\theta(x_0|x_1)}{q(x_1|x_{0})}+ \log \prod_{t=2}^{T}\frac{p_\theta(x_{t-1}|x_t)}{ \frac{q(x_{t-1}|x_{t}, x_0)q(x_t|x_0)}{q(x_{t-1}|x_0)}}\right] \\ &= \mathbb{E}_{q(x_{1:T}|x_0)}\left[\log\frac{p(x_T)p_\theta(x_0|x_1)}{q(x_1|x_{0})}+ \log \frac{q(x_1|x_0)}{q(x_T|x_0)}+ \log \prod_{t=2}^{T}\frac{p_\theta(x_{t-1}|x_t)}{q(x_{t-1}|x_{t}, x_0)}\right]\\ &=\mathbb{E}_{q(x_{1:T}|x_0)}\left[\log\frac{p(x_T)p_\theta(x_0|x_1)}{q(x_T|x_0)}+ \log \prod_{t=2}^{T}\frac{p_\theta(x_{t-1}|x_t)}{q(x_{t-1}|x_{t}, x_0)}\right] \\ &=\mathbb{E}_{q(x_{1:T}|x_0)}\left[\log p_\theta(x_0|x_1)\right] + \mathbb{E}_{q(x_{1:T}|x_0)}\left[\log \frac{p(x_T)}{q(x_T|x_0)}\right] + \sum_{t=2}^T\mathbb{E}_{q(x_{1:T}|x_0)}\left[\log \frac{p_\theta(x_{t-1}|x_t)}{q(x_{t-1}|x_{t}, x_0)}\right] \\ &=\mathbb{E}_{q(x_{1}|x_0)}\left[\log p_\theta(x_0|x_1)\right] + \mathbb{E}_{q(x_{T}|x_0)}\left[\log \frac{p(x_T)}{q(x_T|x_0)}\right] + \sum_{t=2}^T\mathbb{E}_{q(x_{t-1}, x_t|x_0)}\left[\log \frac{p_\theta(x_{t-1}|x_t)}{q(x_{t-1}|x_{t}, x_0)}\right] \\ &=\mathbb{E}_{q(x_{1}|x_0)}\left[\log p_\theta(x_0|x_1)\right] -D_{KL}(q(x_{T}|x_0)\ \|\ p(x_T))- \sum_{t=2}^T\mathbb{E}_{q(x_t|x_0)}\left[D_{KL}(q(x_{t-1}|x_{t}, x_0)\ \|\ p_\theta(x_{t-1}|x_t))\right] \end{align*} logp(x)Eq(x1:Tx0)[logq(x1:Tx0)p(x0:T)]=Eq(x1:Tx0)[logt=1Tq(xtxt1)p(xT)t=1Tpθ(xt1xt)]=Eq(x1:Tx0)[logq(x1x0)t=2Tq(xtxt1)p(xT)pθ(x0x1)t=2Tpθ(xt1xt)]=Eq(x1:Tx0)[logq(x1x0)t=2Tq(xtxt1,x0)p(xT)pθ(x0x1)t=2Tpθ(xt1xt)]=Eq(x1:Tx0)[logq(x1x0)p(xT)pθ(x0x1)+logt=2Tq(xtxt1,x0)pθ(xt1xt)]=Eq(x1:Tx0) logq(x1x0)p(xT)pθ(x0x1)+logt=2Tq(xt1x0)q(xt1xt,x0)q(xtx0)pθ(xt1xt) =Eq(x1:Tx0)[logq(x1x0)p(xT)pθ(x0x1)+logq(xTx0)q(x1x0)+logt=2Tq(xt1xt,x0)pθ(xt1xt)]=Eq(x1:Tx0)[logq(xTx0)p(xT)pθ(x0x1)+logt=2Tq(xt1xt,x0)pθ(xt1xt)]=Eq(x1:Tx0)[logpθ(x0x1)]+Eq(x1:Tx0)[logq(xTx0)p(xT)]+t=2TEq(x1:Tx0)[logq(xt1xt,x0)pθ(xt1xt)]=Eq(x1x0)[logpθ(x0x1)]+Eq(xTx0)[logq(xTx0)p(xT)]+t=2TEq(xt1,xtx0)[logq(xt1xt,x0)pθ(xt1xt)]=Eq(x1x0)[logpθ(x0x1)]DKL(q(xTx0)  p(xT))t=2TEq(xtx0)[DKL(q(xt1xt,x0)  pθ(xt1xt))]

  1. E q ( x 1 ∣ x 0 ) [ log ⁡ p θ ( x 0 ∣ x 1 ) ] \mathbb{E}_{q(x_{1}|x_0)}\left[\log p_\theta(x_0|x_1)\right] Eq(x1x0)[logpθ(x0x1)]还是恢复项;
  2. D K L ( q ( x T ∣ x 0 )   ∥   p ( x T ) ) D_{KL}(q(x_{T}|x_0)\ \|\ p(x_T)) DKL(q(xTx0)  p(xT))依然是衡量了最终的分布和标准正态分布间的差距,根据假设,该项应该为0;
  3. E q ( x t ∣ x 0 ) [ D K L ( q ( x t − 1 ∣ x t , x 0 )   ∥   p θ ( x t − 1 ∣ x t ) ) ] \mathbb{E}_{q(x_t|x_0)}\left[D_{KL}(q(x_{t-1}|x_{t}, x_0)\ \|\ p_\theta(x_{t-1}|x_t))\right] Eq(xtx0)[DKL(q(xt1xt,x0)  pθ(xt1xt))]是去噪一致项,我们希望通过训练去噪过程 p θ ( x t − 1 ∣ x t ) p_\theta(x_{t-1}|x_t) pθ(xt1xt),使其符合真实的去噪过程 q ( x t − 1 ∣ x t , x 0 ) q(x_{t-1}|x_{t}, x_0) q(xt1xt,x0)

Optimizing VDM

优化VDM的重担依然落在最小化第三项上。

接下来,我们就讨论如何优化 D K L ( q ( x t − 1 ∣ x t , x 0 )   ∥   p θ ( x t − 1 ∣ x t ) ) D_{KL}(q(x_{t-1}|x_{t}, x_0)\ \|\ p_\theta(x_{t-1}|x_t)) DKL(q(xt1xt,x0)  pθ(xt1xt))

回忆一下VDM模型的第二个假设,我们假设编码器是线性高斯分布,也就是说
q ( x t ∣ x t − 1 , x 0 ) = q ( x t ∣ x t − 1 ) = N ( x t ; α t x t − 1 , ( 1 − α t ) I ) \begin{equation} q(x_t|x_{t-1}, x_0)=q(x_t|x_{t-1}) = \mathcal{N}(x_t; \sqrt \alpha_t x_{t-1}, (1-\alpha_t) I) \end{equation} q(xtxt1,x0)=q(xtxt1)=N(xt;α txt1,(1αt)I)
再利用之前提到的重新参数化技巧,我们可以得出:
x t = α t x t − 1 + 1 − α t ϵ ,  with  ϵ ∼ N ( ϵ ; 0 , I ) \begin{equation} x_t = \sqrt \alpha_t x_{t-1} + \sqrt{1-\alpha_t} \epsilon, \ \text{with} \ \epsilon \sim \mathcal{N}(\epsilon; 0, I) \end{equation} xt=α txt1+1αt ϵ, with ϵN(ϵ;0,I)
那么,假设我们有 2 T 2T 2T个随机噪声变量 { ϵ t ∗ , ϵ t } t = 0 T − 1 ∼ N ( ϵ ; 0 , I ) \{\epsilon^*_t, \epsilon_t\}^{T-1}_{t=0}\sim \mathcal{N}(\epsilon; 0, I) {ϵt,ϵt}t=0T1N(ϵ;0,I),我们可以得出以下递归等式:
x t = α t x t − 1 + 1 − α t ϵ t − 1 ∗ = α t ( α t − 1 x t − 2 + 1 − α t − 1 ϵ t − 2 ∗ ) + 1 − α t ϵ t − 1 ∗ = α t α t − 1 x t − 2 + α t − α t α t − 1 ϵ t − 2 ∗ + 1 − α t ϵ t − 1 ∗ = α t α t − 1 x t − 2 + α t − α t α t − 1 + 1 − α t ϵ t − 2 = α t α t − 1 x t − 2 + 1 − α t α t − 1 ϵ t − 2 = … = ∏ t = 1 T α t x 0 + 1 − ∏ t = 1 T α t ϵ 0 = α ˉ t x 0 + 1 − α ˉ t ϵ 0 ∼ N ( x t ; α ˉ t x 0 , ( 1 − α ˉ t ) I ) \begin{align} x_t &= \sqrt \alpha_t x_{t-1} + \sqrt{1-\alpha_t} \epsilon^*_{t-1} \\ &= \sqrt \alpha_t (\sqrt \alpha_{t-1} x_{t-2} + \sqrt{1-\alpha_{t-1}} \epsilon^*_{t-2}) + \sqrt{1-\alpha_t} \epsilon^*_{t-1} \\ &= \sqrt{\alpha_t\alpha_{t-1}}x_{t-2} + \sqrt{ \alpha_t- \alpha_t\alpha_{t-1}}\epsilon^*_{t-2} + \sqrt{1-\alpha_t} \epsilon^*_{t-1} \\ &= \sqrt{\alpha_t\alpha_{t-1}}x_{t-2} + \sqrt{\alpha_t- \alpha_t\alpha_{t-1} + 1-\alpha_t}\epsilon_{t-2} \\ &= \sqrt{\alpha_t\alpha_{t-1}}x_{t-2} + \sqrt{1- \alpha_t\alpha_{t-1}}\epsilon_{t-2}\\ &= \dots \\ &= \sqrt{\prod_{t=1}^T\alpha_t}x_{0} + \sqrt{1- \prod_{t=1}^T\alpha_t}\epsilon_{0} \\ &= \sqrt{\bar{\alpha}_t}x_{0} + \sqrt{1- \bar{\alpha}_t}\epsilon_{0} \\ &\sim \mathcal{N}(x_t; \sqrt{\bar{\alpha}_t}x_{0}, (1- \bar{\alpha}_t)I) \end{align} xt=α txt1+1αt ϵt1=α t(α t1xt2+1αt1 ϵt2)+1αt ϵt1=αtαt1 xt2+αtαtαt1 ϵt2+1αt ϵt1=αtαt1 xt2+αtαtαt1+1αt ϵt2=αtαt1 xt2+1αtαt1 ϵt2==t=1Tαt x0+1t=1Tαt ϵ0=αˉt x0+1αˉt ϵ0N(xt;αˉt x0,(1αˉt)I)
获得 q ( x t ∣ x 0 ) q(x_t|x_0) q(xtx0)的显示形式后,我们可以再次利用贝叶斯公式,写出 q ( x t ∣ x t − 1 , x 0 ) q(x_t|x_{t-1}, x_0) q(xtxt1,x0)的显示形式:
q ( x t − 1 ∣ x t , x 0 ) = q ( x t ∣ x t − 1 , x 0 ) q ( x t − 1 ∣ x 0 ) q ( x t ∣ x 0 ) = N ( x t ; α t x t − 1 , ( 1 − α t ) I ) N ( x t − 1 ; α ˉ t − 1 x 0 , ( 1 − α ˉ t − 1 ) I ) N ( x t ; α ˉ t x 0 , ( 1 − α ˉ t ) I ) ∝ exp ⁡ { − [ ( x t − α t x t − 1 ) 2 2 ( 1 − α t ) + ( x t − 1 − α ˉ t − 1 x 0 ) 2 2 ( 1 − α ˉ t − 1 ) − ( x t − α ˉ t x 0 ) 2 2 ( 1 − α ˉ t ) ] } ∝ exp ⁡ { − 1 2 ( 1 ( 1 − α t ) ( 1 − α ˉ t − 1 ) ( 1 − α ˉ t ) ) [ x t − 1 2 − 2 α t ( 1 − α ˉ t − 1 ) x t + α ˉ t − 1 ( 1 − α t ) x 0 1 − α ˉ t x t − 1 ] } ∝ N ( x t − 1 ; α t ( 1 − α ˉ t − 1 ) x t + α ˉ t − 1 ( 1 − α t ) x 0 1 − α ˉ t , ( 1 − α t ) ( 1 − α ˉ t − 1 ) ( 1 − α ˉ t ) I ) \begin{align*} q(x_{t-1}|x_{t}, x_0) &= \frac{q(x_{t}|x_{t-1}, x_0)q(x_{t-1}|x_0)}{q(x_{t}|x_0)} \\ &= \frac{\mathcal{N}(x_t; \sqrt{{\alpha}_{t}}x_{t-1}, (1- {\alpha}_{t})I)\mathcal{N}(x_{t-1}; \sqrt{\bar{\alpha}_{t-1}}x_{0}, (1- \bar{\alpha}_{t-1})I)}{\mathcal{N}(x_t; \sqrt{\bar{\alpha}_t}x_{0}, (1- \bar{\alpha}_t)I)} \\ &\propto \exp \left\{ - \left[ \frac{(x_{t}-\sqrt{{\alpha}_{t}}x_{t-1})^2}{2(1-\alpha_t)} + \frac{(x_{t-1}-\sqrt{\bar{\alpha}_{t-1}}x_{0})^2}{2(1- \bar{\alpha}_{t-1})} - \frac{(x_t - \sqrt{\bar{\alpha}_t}x_{0})^2}{2(1- \bar{\alpha}_t)} \right] \right\} \\ &\propto \exp \left\{-\frac{1}{2}\left( \frac{1}{\frac{(1-\alpha_t)(1- \bar{\alpha}_{t-1})}{(1- \bar{\alpha}_{t})}}\right)\left[x_{t-1}^2-2\frac{\sqrt{{\alpha}_{t}}(1- \bar{\alpha}_{t-1})x_t+\sqrt{\bar{\alpha}_{t-1}}(1-\alpha_t)x_0}{1- \bar{\alpha}_{t}}x_{t-1} \right] \right\} \\ &\propto \mathcal{N}(x_{t-1}; \frac{\sqrt{{\alpha}_{t}}(1- \bar{\alpha}_{t-1})x_t+\sqrt{\bar{\alpha}_{t-1}}(1-\alpha_t)x_0}{1- \bar{\alpha}_{t}}, \frac{(1-\alpha_t)(1- \bar{\alpha}_{t-1})}{(1- \bar{\alpha}_{t})}I) \end{align*} q(xt1xt,x0)=q(xtx0)q(xtxt1,x0)q(xt1x0)=N(xt;αˉt x0,(1αˉt)I)N(xt;αt xt1,(1αt)I)N(xt1;αˉt1 x0,(1αˉt1)I)exp{[2(1αt)(xtαt xt1)2+2(1αˉt1)(xt1αˉt1 x0)22(1αˉt)(xtαˉt x0)2]}exp 21 (1αˉt)(1αt)(1αˉt1)1 [xt1221αˉtαt (1αˉt1)xt+αˉt1 (1αt)x0xt1] N(xt1;1αˉtαt (1αˉt1)xt+αˉt1 (1αt)x0,(1αˉt)(1αt)(1αˉt1)I)
根据以上推导,我们得出 q ( x t − 1 ∣ x t , x 0 ) q(x_{t-1}|x_{t}, x_0) q(xt1xt,x0)是一个正态分布,其中均值记做:
μ q ( x t , x 0 ) = α t ( 1 − α ˉ t − 1 ) x t + α ˉ t − 1 ( 1 − α t ) x 0 1 − α ˉ t \begin{equation} \mu_q(x_t, x_0) = \frac{\sqrt{{\alpha}_{t}}(1- \bar{\alpha}_{t-1})x_t+\sqrt{\bar{\alpha}_{t-1}}(1-\alpha_t)x_0}{1- \bar{\alpha}_{t}} \end{equation} μq(xt,x0)=1αˉtαt (1αˉt1)xt+αˉt1 (1αt)x0
方差记做:
Σ q ( t ) = σ q 2 ( t ) I = ( 1 − α t ) ( 1 − α ˉ t − 1 ) ( 1 − α ˉ t ) I \begin{equation} \Sigma_q(t) = \sigma_q^2(t)I = \frac{(1-\alpha_t)(1- \bar{\alpha}_{t-1})}{(1- \bar{\alpha}_{t})}I \end{equation} Σq(t)=σq2(t)I=(1αˉt)(1αt)(1αˉt1)I
为了使去噪过程 p θ ( x t − 1 ∣ x t ) p_\theta(x_{t-1}|x_t) pθ(xt1xt)和真实的去噪过程 q ( x t − 1 ∣ x t , x 0 ) q(x_{t-1}|x_{t}, x_0) q(xt1xt,x0)一致,我们可以将它建模成正态分布。具体而言,由于 α \alpha α是事先定好的,我们可以直接设置 p θ ( x t − 1 ∣ x t ) p_\theta(x_{t-1}|x_t) pθ(xt1xt)方差为 Σ q ( t ) = σ q 2 ( t ) I \Sigma_q(t) = \sigma_q^2(t)I Σq(t)=σq2(t)I;另外,我们设置其均值为 μ θ ( x t , t ) \mu_\theta(x_t,t) μθ(xt,t)。进而我们可以得到:
arg min ⁡ θ D K L ( q ( x t − 1 ∣ x t , x 0 )   ∥   p θ ( x t − 1 ∣ x t ) ) = arg min ⁡ θ D K L ( N ( x t − 1 ; μ q , Σ q ( t ) )   ∥   N ( x t − 1 ; μ θ , Σ q ( t ) ) ) = arg min ⁡ θ 1 2 [ log ⁡ ∣ Σ q ( t ) ∣ ∣ Σ q ( t ) ∣ − d + tr ( Σ q ( t ) − 1 Σ q ( t ) ) + ( μ θ − μ q ) ⊤ Σ q ( t ) − 1 ( μ θ − μ q ) ] = arg min ⁡ θ 1 2 σ q 2 ( t ) ( ∥ μ θ − μ q ) ∥ 2 2 ) \begin{align*} &\argmin_\theta D_{KL}(q(x_{t-1}|x_{t}, x_0)\ \|\ p_\theta(x_{t-1}|x_t)) \\ =&\argmin_\theta D_{KL}(\mathcal{N}(x_{t-1};\mu_q, \Sigma_q(t))\ \|\ \mathcal{N}(x_{t-1};\mu_\theta, \Sigma_q(t))) \\ =&\argmin_\theta\frac{1}{2}\left[\log \frac{ |\Sigma_q(t)|}{ |\Sigma_q(t)|} -d + \text{tr}( \Sigma_q(t)^{-1} \Sigma_q(t))+(\mu_{\theta}-\mu_{q})^\top \Sigma_q(t)^{-1} (\mu_{\theta}-\mu_{q})\right] \\ =&\argmin_\theta\frac{1}{2\sigma_q^2(t)}\left(\|\mu_{\theta}-\mu_{q})\|^2_2 \right) \end{align*} ===θargminDKL(q(xt1xt,x0)  pθ(xt1xt))θargminDKL(N(xt1;μq,Σq(t))  N(xt1;μθ,Σq(t)))θargmin21[logΣq(t)Σq(t)d+tr(Σq(t)1Σq(t))+(μθμq)Σq(t)1(μθμq)]θargmin2σq2(t)1(μθμq)22)

D K L ( N ( x ; μ x , Σ x )   ∥   N ( y ; μ y , Σ y ) ) = 1 2 [ log ⁡ ∣ Σ y ∣ ∣ Σ x ∣ − d + tr ( Σ y − 1 Σ x ) + ( μ y − μ x ) ⊤ Σ y − 1 ( μ y − μ x ) ] D_{KL}(\mathcal{N}(x;\mu_x, \Sigma_x)\ \|\ \mathcal{N}(y;\mu_y, \Sigma_y)) = \frac{1}{2}[\log \frac{|\Sigma_y|}{|\Sigma_x|}-d+\text{tr}(\Sigma_y^{-1}\Sigma_x)+(\mu_y-\mu_x)^\top\Sigma_y^{-1}(\mu_y-\mu_x)] DKL(N(x;μx,Σx)  N(y;μy,Σy))=21[logΣxΣyd+tr(Σy1Σx)+(μyμx)Σy1(μyμx)]

也就是说,我们的目标等价于优化 μ θ ( x t , t ) \mu_\theta(x_t, t) μθ(xt,t),使其和 μ q ( x t , x 0 ) \mu_q(x_t, x_0) μq(xt,x0)匹配。根据(36),我们可以将 μ θ ( x t , t ) \mu_\theta(x_t, t) μθ(xt,t)设为
μ θ ( x t , t ) = α t ( 1 − α ˉ t − 1 ) x t + α ˉ t − 1 ( 1 − α t ) x ^ θ ( x t , t ) 1 − α ˉ t \begin{equation} \mu_\theta(x_t, t) = \frac{\sqrt{{\alpha}_{t}}(1- \bar{\alpha}_{t-1})x_t+\sqrt{\bar{\alpha}_{t-1}}(1-\alpha_t)\hat{x}_\theta(x_t, t)}{1- \bar{\alpha}_{t}} \end{equation} μθ(xt,t)=1αˉtαt (1αˉt1)xt+αˉt1 (1αt)x^θ(xt,t)
有了这个显示形式,我们可以进一步改写我们的目标函数:
arg min ⁡ θ D K L ( q ( x t − 1 ∣ x t , x 0 )   ∥   p θ ( x t − 1 ∣ x t ) ) = arg min ⁡ θ D K L ( N ( x t − 1 ; μ q , Σ q ( t ) )   ∥   N ( x t − 1 ; μ θ , Σ q ( t ) ) ) = arg min ⁡ θ 1 2 σ q 2 ( t ) ( ∥ μ θ − μ q ) ∥ 2 2 ) = arg min ⁡ θ 1 2 σ q 2 ( t ) ( ∥ α t ( 1 − α ˉ t − 1 ) x t + α ˉ t − 1 ( 1 − α t ) x ^ θ ( x t , t ) 1 − α ˉ t − α t ( 1 − α ˉ t − 1 ) x t + α ˉ t − 1 ( 1 − α t ) x 0 1 − α ˉ t ∥ 2 2 ) = arg min ⁡ θ 1 2 σ q 2 ( t ) α ˉ t − 1 ( 1 − α t ) 2 ( 1 − α ˉ t ) 2 ( ∥ x ^ θ ( x t , t ) − x 0 ∥ 2 2 ) \begin{align*} &\argmin_\theta D_{KL}(q(x_{t-1}|x_{t}, x_0)\ \|\ p_\theta(x_{t-1}|x_t)) \\ =&\argmin_\theta D_{KL}(\mathcal{N}(x_{t-1};\mu_q, \Sigma_q(t))\ \|\ \mathcal{N}(x_{t-1};\mu_\theta, \Sigma_q(t))) \\ =&\argmin_\theta\frac{1}{2\sigma_q^2(t)}\left(\|\mu_{\theta}-\mu_{q})\|^2_2 \right) \\ =&\argmin_\theta\frac{1}{2\sigma_q^2(t)}\left(\|\frac{\sqrt{{\alpha}_{t}}(1- \bar{\alpha}_{t-1})x_t+\sqrt{\bar{\alpha}_{t-1}}(1-\alpha_t)\hat{x}_\theta(x_t, t)}{1- \bar{\alpha}_{t}}-\frac{\sqrt{{\alpha}_{t}}(1- \bar{\alpha}_{t-1})x_t+\sqrt{\bar{\alpha}_{t-1}}(1-\alpha_t)x_0}{1- \bar{\alpha}_{t}}\|^2_2\right) \\ =&\argmin_\theta\frac{1}{2\sigma_q^2(t)}\frac{\bar{\alpha}_{t-1}(1-\alpha_t)^2}{(1- \bar{\alpha}_{t})^2}(\|\hat{x}_\theta(x_t, t)-x_0\|_2^2) \end{align*} ====θargminDKL(q(xt1xt,x0)  pθ(xt1xt))θargminDKL(N(xt1;μq,Σq(t))  N(xt1;μθ,Σq(t)))θargmin2σq2(t)1(μθμq)22)θargmin2σq2(t)1(1αˉtαt (1αˉt1)xt+αˉt1 (1αt)x^θ(xt,t)1αˉtαt (1αˉt1)xt+αˉt1 (1αt)x022)θargmin2σq2(t)1(1αˉt)2αˉt1(1αt)2(x^θ(xt,t)x022)
因此,优化一个VDM本质上就是训练一个去噪神经网络,能够从加了任意噪声的图片还原出相应的原始版本。并且,对于优化ELBO目标中第三项,我们可以利用迭代步上的期望来近似该求和:
arg min ⁡ θ E t ∼ U { 2 , T } [ E q ( x t ∣ x 0 ) [ D K L ( q ( x t − 1 ∣ x t , x 0 )   ∥   p θ ( x t − 1 ∣ x t ) ) ] ] \begin{equation} \argmin_\theta \mathbb{E}_{t\sim U\{2, T\}}\left[\mathbb{E}_{q(x_t|x_0)}\left[D_{KL}(q(x_{t-1}|x_{t}, x_0)\ \|\ p_\theta(x_{t-1}|x_t))\right]\right] \end{equation} θargminEtU{2,T}[Eq(xtx0)[DKL(q(xt1xt,x0)  pθ(xt1xt))]]
上述优化过程可以通过对 t t t进行随机采样进行。

学习扩散噪声参数

上文中提到,每一步中的噪声参数 α t \alpha_t αt是事先预定好的,那么能否将其设置成可学习的呢?一个直接的办法就是使用一个神经网络 α ^ η ( t ) \hat{\alpha}_\eta(t) α^η(t)来对其进行建模。但是,这种方式很低效,因为在每一步中我们要多次调用这个网络来计算 α ˉ t \bar{\alpha}_t αˉt。尽管我们也可以提前计算完后将所需的值缓存下来,但是我们这里介绍一种基于对目标函数的进一步推导变形使得问题简化的方法。
1 2 σ q 2 ( t ) α ˉ t − 1 ( 1 − α t ) 2 ( 1 − α ˉ t ) 2 ( ∥ x ^ θ ( x t , t ) − x 0 ∥ 2 2 ) = 1 2 ( 1 − α t ) ( 1 − α ˉ t − 1 ) ( 1 − α ˉ t ) α ˉ t − 1 ( 1 − α t ) 2 ( 1 − α ˉ t ) 2 ( ∥ x ^ θ ( x t , t ) − x 0 ∥ 2 2 ) = 1 2 α ˉ t − 1 ( 1 − α t ) ( 1 − α ˉ t − 1 ) ( 1 − α ˉ t ) ( ∥ x ^ θ ( x t , t ) − x 0 ∥ 2 2 ) = 1 2 α ˉ t − 1 ( 1 − α ˉ t ) − α ˉ t ( 1 − α ˉ t − 1 ) ( 1 − α ˉ t − 1 ) ( 1 − α ˉ t ) ( ∥ x ^ θ ( x t , t ) − x 0 ∥ 2 2 ) = 1 2 ( α ˉ t − 1 1 − α ˉ t − 1 − α ˉ t 1 − α ˉ t ) ( ∥ x ^ θ ( x t , t ) − x 0 ∥ 2 2 ) \begin{align} &\frac{1}{2\sigma_q^2(t)}\frac{\bar{\alpha}_{t-1}(1-\alpha_t)^2}{(1- \bar{\alpha}_{t})^2}(\|\hat{x}_\theta(x_t, t)-x_0\|_2^2) \\ =&\frac{1}{2\frac{(1-\alpha_t)(1- \bar{\alpha}_{t-1})}{(1- \bar{\alpha}_{t})}}\frac{\bar{\alpha}_{t-1}(1-\alpha_t)^2}{(1- \bar{\alpha}_{t})^2}(\|\hat{x}_\theta(x_t, t)-x_0\|_2^2) \\ =&\frac{1}{2}\frac{\bar{\alpha}_{t-1}(1-\alpha_t)}{(1- \bar{\alpha}_{t-1})(1- \bar{\alpha}_{t})}(\|\hat{x}_\theta(x_t, t)-x_0\|_2^2) \\ =&\frac{1}{2}\frac{\bar{\alpha}_{t-1}(1-\bar\alpha_t)-\bar{\alpha}_{t}(1-\bar\alpha_{t-1})}{(1- \bar{\alpha}_{t-1})(1- \bar{\alpha}_{t})}(\|\hat{x}_\theta(x_t, t)-x_0\|_2^2)\\ =&\frac{1}{2}(\frac{\bar{\alpha}_{t-1}}{1- \bar{\alpha}_{t-1}}-\frac{\bar{\alpha}_{t}}{1-\bar{\alpha}_{t}})(\|\hat{x}_\theta(x_t, t)-x_0\|_2^2) \end{align} ====2σq2(t)1(1αˉt)2αˉt1(1αt)2(x^θ(xt,t)x022)2(1αˉt)(1αt)(1αˉt1)1(1αˉt)2αˉt1(1αt)2(x^θ(xt,t)x022)21(1αˉt1)(1αˉt)αˉt1(1αt)(x^θ(xt,t)x022)21(1αˉt1)(1αˉt)αˉt1(1αˉt)αˉt(1αˉt1)(x^θ(xt,t)x022)21(1αˉt1αˉt11αˉtαˉt)(x^θ(xt,t)x022)
由于, q ( x t ∣ x 0 ) q(x_t|x_0) q(xtx0)是高斯分布 N ( x t ; α ˉ t x 0 , ( 1 − α ˉ t ) I ) \mathcal{N}(x_t; \sqrt{\bar{\alpha}_t}x_{0}, (1- \bar{\alpha}_t)I) N(xt;αˉt x0,(1αˉt)I),根据信噪比(signal-to-noise ratio,SNR)的定义 SNR = μ 2 σ 2 \text{SNR}=\frac{\mu^2}{\sigma^2} SNR=σ2μ2,每一步的信噪比可以写作
SNR ( t ) = α ˉ t 1 − α ˉ t \begin{equation} \text{SNR}(t) = \frac{\bar{\alpha}_t}{1- \bar{\alpha}_t} \end{equation} SNR(t)=1αˉtαˉt
进而,优化目标可以简化为:
1 2 ( SNR ( t − 1 ) − SNR ( t ) ) ( ∥ x ^ θ ( x t , t ) − x 0 ∥ 2 2 ) \begin{equation} \frac{1}{2}(\text{SNR}(t-1)-\text{SNR}(t))(\|\hat{x}_\theta(x_t, t)-x_0\|_2^2) \end{equation} 21(SNR(t1)SNR(t))(x^θ(xt,t)x022)
在VDM中,我们要求SNR是随着步数 t t t单调递减的(逐渐加噪声使得SNR单调递减),直至第 T T T步时成为标准高斯噪声。因此,我们将SNR建模为:
SNR ( t ) = exp ⁡ ( − ω η ( t ) ) \begin{equation} \text{SNR}(t) = \exp(-\omega_\eta(t)) \end{equation} SNR(t)=exp(ωη(t))
其中,网络的参数是 η \eta η。那么,在优化(39)时,我们也要同时优化参数 η \eta η。注意到,(47)的形式使得我们可以得出
α ˉ t = sigmoid ( − ω η ( t ) ) 1 − α ˉ t = sigmoid ( ω η ( t ) ) \begin{align} \bar{\alpha}_t &= \text{sigmoid}(-\omega_\eta(t)) \\ 1-\bar{\alpha}_t &= \text{sigmoid}(\omega_\eta(t)) \end{align} αˉt1αˉt=sigmoid(ωη(t))=sigmoid(ωη(t))
上述两个式子使得我们可以很容易地依照(35)对在 x 0 x_0 x0上加上任意所需噪声得出 x t x_t xt

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值