【理论推导】扩散模型 Diffusion Model

VAE 与 多层 VAE

回顾之前的文章 【理论推导】变分自动编码器 Variational AutoEncoder(VAE),有结论
log ⁡ p ( x ) = E z ∼ q ( z ∣ x ) [ log ⁡ p ( x , z ) q ( z ∣ x ) ] + KL ( q ∣ ∣ p ) ≥ E z ∼ q ( z ∣ x ) [ log ⁡ p ( x , z ) q ( z ∣ x ) ] \log p(x) = \mathbb E_{z\sim q(z|x)}[\log \frac{p(x,z)}{q(z|x)}] + \text{KL}(q||p) \geq \mathbb E_{z\sim q(z|x)}[\log \frac{p(x,z)}{q(z|x)}] logp(x)=Ezq(zx)[logq(zx)p(x,z)]+KL(q∣∣p)Ezq(zx)[logq(zx)p(x,z)]
该不等式的另一种推导方式如下所示
log ⁡ p ( x ) = log ⁡ E z ∼ q ( z ∣ x ) [ p ( x , z ) q ( z ∣ x ) ] ≥ E z ∼ q ( z ∣ x ) [ log ⁡ p ( x , z ) q ( z ∣ x ) ] \log p(x) = \log \mathbb E_{z\sim q(z|x)}[\frac{p(x,z)}{q(z|x)}] \geq \mathbb E_{z\sim q(z|x)}[\log \frac{p(x,z)}{q(z|x)}] logp(x)=logEzq(zx)[q(zx)p(x,z)]Ezq(zx)[logq(zx)p(x,z)]
其中不等号由 Jensen 不等式给出

将单层 VAE 扩展到多层 VAE,如下所示
在这里插入图片描述
log ⁡ p ( x ) = log ⁡ ∫ z 1 ∫ z 2 p ( x , z 1 , z 2 ) d z 1 d z 2 = log ⁡ ∫ z 1 ∫ z 2 q ( z 1 , z 2 ∣ x ) p ( x , z 1 , z 2 ) q ( z 1 , z 2 ∣ x ) d z 1 d z 2 = log ⁡ E z 1 , z 2 ∼ q ( z 1 , z 2 ∣ x ) [ p ( x , z 1 , z 2 ) q ( z 1 , z 2 ∣ x ) ] ≥ E z 1 , z 2 ∼ q ( z 1 , z 2 ∣ x ) [ log ⁡ p ( x , z 1 , z 2 ) q ( z 1 , z 2 ∣ x ) ] = ( i ) E z 1 , z 2 ∼ q ( z 1 , z 2 ∣ x ) [ log ⁡ p ( x ∣ z 1 ) p ( z 1 ∣ z 2 ) p ( z 2 ) q ( z 1 ∣ x ) q ( z 2 ∣ z 1 ) ] \begin{align} \log p(x) &= \log \int_{z_1}\int_{z_2} p(x, z_1,z_2) dz_1dz_2 \nonumber \\&= \log \int_{z_1}\int_{z_2} q(z_1, z_2|x) \frac{p(x, z_1,z_2)}{q(z_1, z_2|x)} dz_1dz_2 \nonumber \\&=\log \mathbb E_{z_1,z_2\sim q(z_1,z_2|x)}[ \frac{p(x, z_1,z_2)}{q(z_1, z_2|x)}] \nonumber \\&\geq \mathbb E_{z_1,z_2\sim q(z_1,z_2|x)}[ \log \frac{p(x, z_1,z_2)}{q(z_1, z_2|x)}] \nonumber \\&\overset{(i)}{=} \mathbb E_{z_1,z_2\sim q(z_1,z_2|x)}[ \log \frac{p(x|z_1)p(z_1|z_2)p(z_2)}{q(z_1|x)q(z_2|z_1)}] \nonumber \end{align} logp(x)=logz1z2p(x,z1,z2)dz1dz2=logz1z2q(z1,z2x)q(z1,z2x)p(x,z1,z2)dz1dz2=logEz1,z2q(z1,z2x)[q(z1,z2x)p(x,z1,z2)]Ez1,z2q(z1,z2x)[logq(z1,z2x)p(x,z1,z2)]=(i)Ez1,z2q(z1,z2x)[logq(z1x)q(z2z1)p(xz1)p(z1z2)p(z2)]
其中 (i) 处要求变量之间满足Markov假设,如果我们将多层 VAE 扩展到更多层,可以得到与扩散模型相近的图示形式,因此我们可以借助VAE相关的技巧来看待扩散模型

DDPM

在这里插入图片描述
扩散模型是通过向图像多次施加噪声来将图像转化为噪声,该过程称为前向扩散过程 (forward diffusion process),而从某个先验噪声分布中采样一个噪声图作为初值,通过不断去噪来生成图像的过程称为是扩散的逆过程,可以类比于使用 Langevin Dynamics 进行图像生成的思路。

扩散过程

假定 x 0 ∼ q ( x ) x_0\sim q(x) x0q(x) 是采样自真实数据分布 q q q 的样本,我们向其添加 T T T 步的高斯噪声,公式如下
q ( x t ∣ x t − 1 ) = N ( x t ; 1 − β t x t − 1 , β t I ) q(x_t|x_{t-1}) = \mathcal N(x_t; \sqrt{1-\beta_t}x_{t-1},\beta_t I) q(xtxt1)=N(xt;1βt xt1,βtI)
其中 β t ∈ [ 0 , 1 ] \beta_t \in [0,1] βt[0,1],整个过程服从Markov假设,因此有 q ( x 1 : T ∣ x 0 ) = ∏ t = 1 T q ( x t ∣ x t − 1 ) q(x_{1:T}|x_0) = \prod_{t=1}^T q(x_t|x_{t-1}) q(x1:Tx0)=t=1Tq(xtxt1),当 T → ∞ T\rightarrow \infty T x T x_T xT 服从高斯分布

如果我们希望快速得到 x t x_t xt,可以不通过递推式而是求一个通项的表达形式。假定 α t = 1 − β t \alpha_t = 1-\beta_t αt=1βt α t ‾ = ∏ i = 1 t α i \overline{\alpha_t} = \prod_{i=1}^t\alpha_i αt=i=1tαi { z i , z ‾ i ∼ N ( 0 , I ) } i = 0 T \{z_i, \overline{z}_i \sim \mathcal N(0,I)\}_{i=0}^T {zi,ziN(0,I)}i=0T为若干独立同分布的随机变量,根据递推公式,有
x t = α t x t − 1 + 1 − α t z t − 1 = α t α t − 1 x t − 1 + α t 1 − α t − 1 z t − 2 + 1 − α t z t − 1 = ( i ) α t α t − 1 x t − 1 + 1 − α t α t − 1 z ‾ t − 2 = . . . = α ‾ t x 0 + 1 − α ‾ t z ‾ 0 \begin{align} x_t &= \sqrt{\alpha_t}x_{t-1} + \sqrt{1-\alpha_t}z_{t-1} \nonumber \\&= \sqrt{\alpha_t\alpha_{t-1}}x_{t-1} + \sqrt{\alpha_t}\sqrt{1-\alpha_{t-1}}z_{t-2}+ \sqrt{1-\alpha_t}z_{t-1} \nonumber \\&\overset{(i)}{=} \sqrt{\alpha_t\alpha_{t-1}}x_{t-1} + \sqrt{1-\alpha_t\alpha_{t-1}}\overline{z}_{t-2} \nonumber \\&= ...\nonumber \\&=\sqrt{\overline{\alpha}_t}x_0+\sqrt{1-\overline{\alpha}_t}\overline{z}_0 \end{align} xt=αt xt1+1αt zt1=αtαt1 xt1+αt 1αt1 zt2+1αt zt1=(i)αtαt1 xt1+1αtαt1 zt2=...=αt x0+1αt z0
其中,等式 (i) 为两个高斯分布的线性叠加仍为一个高斯分布,即对于 A ∼ N ( μ a , σ a 2 ) A\sim \mathcal{N}(\mu_a, \sigma_a^2) AN(μa,σa2) B ∼ N ( μ b , σ b 2 ) B\sim \mathcal{N}(\mu_b, \sigma_b^2) BN(μb,σb2),线性叠加 m A + n B ∼ N ( m μ a + n μ b , m 2 σ a 2 + n 2 σ b 2 ) mA+nB \sim \mathcal{N}(m\mu_a+n\mu_b, m^2\sigma_a^2+n^2\sigma_b^2) mA+nBN(mμa+nμb,m2σa2+n2σb2)。因此,有
x t ∣ x 0 ∼ N ( α ‾ t x 0 , ( 1 − α ‾ t ) I ) \begin{align} x_t|x_0 \sim \mathcal N(\sqrt{\overline\alpha_t}x_0,(1-\overline\alpha_t)I) \end{align} xtx0N(αt x0,(1αt)I)
对于扩散过程,我们希望加噪的强度从小到大,即 β 1 < β 2 < . . . < β T − 1 < β T \beta_1 <\beta_2 < ...<\beta_{T-1} < \beta_T β1<β2<...<βT1<βT,有 1 > α ‾ 1 > . . . > α ‾ T > 0 1>\overline{\alpha}_1 > ... > \overline{\alpha}_T>0 1>α1>...>αT>0

逆扩散过程/采样过程

我们希望从 x T x_T xT 中恢复出 x 0 x_0 x0,为此需要建模条件概率 q ( x t − 1 ∣ x t ) q(x_{t-1}|x_t) q(xt1xt),注意到,根据Bayes公式,有
q ( x t − 1 ∣ x t ) = q ( x t ∣ x t − 1 ) q ( x t − 1 ) q ( x t ) q(x_{t-1}|x_t) = q(x_t |x_{t-1})\frac{q(x_{t-1})}{q(x_t)} q(xt1xt)=q(xtxt1)q(xt)q(xt1)
我们无法得到真实的 q ( x t − 1 ) q(x_{t-1}) q(xt1),因此采用条件概率分布来逼近 q ( x t − 1 ∣ x t ) q(x_{t-1}|x_t) q(xt1xt),有
q ( x t − 1 ∣ x t ) ≈ q ( x t − 1 ∣ x t , x 0 ) = q ( x t ∣ x t − 1 , x 0 ) q ( x t − 1 ∣ x 0 ) q ( x t ∣ x 0 ) q(x_{t-1} |x_{t})\approx q(x_{t-1} |x_{t},x_0) = q(x_t |x_{t-1},x_0)\frac{q(x_{t-1}|x_0)}{q(x_t|x_0)} q(xt1xt)q(xt1xt,x0)=q(xtxt1,x0)q(xtx0)q(xt1x0)
其中 q ( x t − 1 ∣ x t , x 0 ) q(x_{t-1}|x_t,x_0) q(xt1xt,x0)是可以计算出来的。使用贝叶斯公式,有
q ( x t − 1 ∣ x t , x 0 ) = q ( x t ∣ x t − 1 , x 0 ) q ( x t − 1 ∣ x 0 ) q ( x t ∣ x 0 ) ∝ exp ⁡ ( − 1 2 ( ( x t − α t x t − 1 ) 2 β t + ( x t − 1 − α ‾ t − 1 x 0 ) 2 1 − α ‾ t − 1 − ( x t − α ‾ t x 0 ) 2 1 − α ‾ t ) ) = exp ⁡ ( − 1 2 ( ( α t β t + 1 1 − α ‾ t ) x t − 1 2 − ( 2 α t β t x t + 2 α ‾ t 1 − α ‾ t x 0 ) x t − 1 + C ( x 0 , x t ) ) ) \begin{align} q(x_{t-1}|x_t,x_0) &= q(x_t |x_{t-1}, x_0)\frac{q(x_{t-1}|x_0)}{q(x_t|x_0)} \nonumber \\&\propto \exp\left( -\frac{1}{2}\left(\frac{(x_t-\sqrt{\alpha_t}x_{t-1})^2}{\beta_t}+\frac{(x_{t-1}-\sqrt{\overline\alpha_{t-1}}x_{0})^2}{1-\overline\alpha_{t-1}}-\frac{(x_{t}-\sqrt{\overline\alpha_{t}}x_{0})^2}{1-\overline\alpha_{t}}\right)\right) \nonumber \\&=\exp\left( -\frac{1}{2}\left((\frac{\alpha_t}{\beta_t}+\frac{1}{1-\overline\alpha_t})x_{t-1}^2 - (\frac{2\sqrt{\alpha_t}}{\beta_t}x_t+\frac{2\sqrt{\overline \alpha_t}}{1-\overline \alpha_t}x_0)x_{t-1} + C(x_0,x_t) \right)\right)\nonumber \end{align} q(xt1xt,x0)=q(xtxt1,x0)q(xtx0)q(xt1x0)exp(21(βt(xtαt xt1)2+1αt1(xt1αt1 x0)21αt(xtαt x0)2))=exp(21((βtαt+1αt1)xt12(βt2αt xt+1αt2αt x0)xt1+C(x0,xt)))
对比高斯分布的形式,可以得到条件概率分布 x t − 1 ∣ x t , x 0 x_{t-1}|x_t,x_0 xt1xt,x0 服从均值,方差为如下形式的高斯分布
μ = ( α t β t x t + α ‾ t 1 − α ‾ t x 0 ) / ( α t β t + 1 1 − α ‾ t − 1 ) = α t ( 1 − α ‾ t − 1 ) 1 − α ‾ t x t + α ‾ t − 1 β t 1 − α ‾ t x 0 σ 2 = β ~ t = 1 α t β t + 1 1 − α ‾ t − 1 = 1 − α ‾ t − 1 1 − α ‾ t β t \begin{align} \mu &= (\frac{\sqrt{\alpha_t}}{\beta_t}x_t+\frac{\sqrt{\overline \alpha_t}}{1-\overline \alpha_t}x_0) / (\frac{\alpha_t}{\beta_t}+\frac{1}{1-\overline\alpha_{t-1}}) = \frac{\sqrt{\alpha}_t(1-\overline\alpha_{t-1})}{1-\overline\alpha_t}x_t+\frac{\sqrt{\overline\alpha_{t-1}}\beta_{t}}{1-\overline\alpha_{t}}x_0 \nonumber \\ \sigma^2 &= \tilde{\beta}_t = \frac{1}{\frac{\alpha_t}{\beta_t}+\frac{1}{1-\overline\alpha_{t-1}}} = \frac{1-\overline\alpha_{t-1}}{1-\overline \alpha_t}\beta_t \end{align} μσ2=(βtαt xt+1αtαt x0)/(βtαt+1αt11)=1αtα t(1αt1)xt+1αtαt1 βtx0=β~t=βtαt+1αt111=1αt1αt1βt
我们使用神经网络来拟合 z ‾ 0 \overline z_0 z0,即 ϵ θ ( x t , t ) ≈ z ‾ 0 \epsilon_\theta(x_t,t)\approx\overline z_0 ϵθ(xt,t)z0,注意到,我们通过 z ‾ 0 \overline z_0 z0的加噪方式得到的 x t x_t xt,因此,神经网络本质上是拟合的添加的噪声。将 (1) 式代入到其中,消掉 x 0 x_{0} x0,得到
μ = μ ~ t = 1 α t ( x t − β t 1 − α ‾ t z ‾ 0 ) \begin{align} \mu &= \tilde \mu_t = \frac{1}{\sqrt{\alpha_t}}\left(x_t-\frac{\beta_t}{\sqrt{1-\overline{\alpha}_t}}\overline z_0\right) \end{align} μ=μ~t=αt 1(xt1αt βtz0)
因此,有
x t − 1 ∣ x t ∼ N ( 1 α t ( x t − β t 1 − α ‾ t z ‾ 0 ) , 1 − α ‾ t − 1 1 − α ‾ t β t ) \begin{align} x_{t-1}|x_t \sim \mathcal N\left(\frac{1}{\sqrt{\alpha_t}}\left(x_t-\frac{\beta_t}{\sqrt{1-\overline{\alpha}_t}}\overline z_0\right),\frac{1-\overline\alpha_{t-1}}{1-\overline \alpha_t}\beta_t \right) \end{align} xt1xtN(αt 1(xt1αt βtz0),1αt1αt1βt)

损失函数

考虑损失函数的设计,假定我们使用含参 θ \theta θ 的概率模型 p θ p_\theta pθ 去拟合真实数据分布 q q q,根据 KL 散度的性质,有
− log ⁡ p θ ( x 0 ) ≤ − log ⁡ p θ ( x 0 ) + KL ( q ( x 1 : T ∣ x 0 ) ∣ ∣ p θ ( x 1 : T ∣ x 0 ) ) = − log ⁡ p θ ( x 0 ) + E q ( x 1 : T ∣ x 0 ) [ log ⁡ q ( x 1 : T ∣ x 0 ) p θ ( x 0 : T ) / p θ ( x 0 ) ] = E q ( x 1 : T ∣ x 0 ) [ log ⁡ q ( x 1 : T ∣ x 0 ) p θ ( x 0 : T ) ] \begin{align} -\log p_\theta(x_0) &\leq -\log p_\theta(x_0) +\text{KL}(q(x_{1:T}|x_0)||p_\theta(x_{1:T}|x_0)) \nonumber \\&= -\log p_\theta(x_0) +\mathbb E_{q(x_{1:T}|x_0)}[\log \frac{q(x_{1:T}|x_0)}{p_\theta(x_{0:T})/p_\theta(x_0)}] \nonumber \\&= \mathbb E_{q(x_{1:T}|x_0)}[\log \frac{q(x_{1:T}|x_0)}{p_\theta(x_{0:T})}] \nonumber \end{align} logpθ(x0)logpθ(x0)+KL(q(x1:Tx0)∣∣pθ(x1:Tx0))=logpθ(x0)+Eq(x1:Tx0)[logpθ(x0:T)/pθ(x0)q(x1:Tx0)]=Eq(x1:Tx0)[logpθ(x0:T)q(x1:Tx0)]
对左右两边求期望,有
E q ( x 0 ) [ − log ⁡ p θ ( x 0 ) ] ≤ E q ( x 0 : T ∣ x 0 ) [ log ⁡ q ( x 1 : T ∣ x 0 ) p θ ( x 0 : T ) ] = △ L VLB \mathbb E_{q(x_0)}[-\log p_\theta(x_0)]\leq \mathbb E_{q(x_{0:T}|x_0)}[\log \frac{q(x_{1:T}|x_0)}{p_\theta(x_{0:T})}] \overset{\triangle}{=} L_\text{VLB} Eq(x0)[logpθ(x0)]Eq(x0:Tx0)[logpθ(x0:T)q(x1:Tx0)]=LVLB
L VLB L_\text{VLB} LVLB 进行化简,有
L VLB = E q ( x 0 : T ∣ x 0 ) [ log ⁡ q ( x 1 : T ∣ x 0 ) p θ ( x 0 : T ) ] = E q ( x 0 : T ∣ x 0 ) [ − log ⁡ p ( x T ) + ∑ i = 1 T log ⁡ q ( x t ∣ x t − 1 ) p θ ( x t − 1 ∣ x t ) ] = ( i ) E q ( x 0 : T ∣ x 0 ) [ − log ⁡ p ( x T ) + ∑ i = 2 T log ⁡ ( q ( x t − 1 ∣ x t , x 0 ) p θ ( x t − 1 ∣ x t ) q ( x t ∣ x 0 ) q ( x t − 1 ∣ x 0 ) ) + log ⁡ q ( x 1 ∣ x 0 ) p θ ( x 0 ∣ x 1 ) ] = E q ( x 0 : T ∣ x 0 ) [ − log ⁡ p ( x T ) + ∑ i = 2 T log ⁡ q ( x t − 1 ∣ x t , x 0 ) p θ ( x t − 1 ∣ x t ) + log ⁡ q ( x T ∣ x 0 ) q ( x 1 ∣ x 0 ) + log ⁡ q ( x 1 ∣ x 0 ) p θ ( x 0 ∣ x 1 ) ] = E q ( x 0 : T ∣ x 0 ) [ ∑ i = 2 T log ⁡ q ( x t − 1 ∣ x t , x 0 ) p θ ( x t − 1 ∣ x t ) + log ⁡ q ( x T ∣ x 0 ) p ( x T ) − log ⁡ p θ ( x 0 ∣ x 1 ) ] = KL ( q ( x T ∣ x 0 ) ∣ ∣ p ( x T ) ) + ∑ t = 2 T KL ( q ( x t − 1 ∣ x t , x 0 ) ∣ ∣ p θ ( x t − 1 ∣ x t ) ) − log ⁡ p θ ( x 0 ∣ x 1 ) \begin{align} L_\text{VLB} &= \mathbb E_{q(x_{0:T}|x_0)}[\log \frac{q(x_{1:T}|x_0)}{p_\theta(x_{0:T})}] \nonumber \\&=\mathbb E_{q(x_{0:T}|x_0)}[-\log p(x_T)+\sum_{i=1}^T\log\frac{q(x_t|x_{t-1})}{p_\theta(x_{t-1}|x_t)}] \nonumber \\&\overset{(i)}{=}\mathbb E_{q(x_{0:T}|x_0)}[-\log p(x_T)+\sum_{i=2}^T\log(\frac{q(x_{t-1}|x_t,x_0)}{p_\theta(x_{t-1}|x_t)}\frac{q(x_t|x_0)}{q(x_{t-1}|x_0)}) + \log\frac{q(x_1|x_{0})}{p_\theta(x_{0}|x_1)}] \nonumber \\&=\mathbb E_{q(x_{0:T}|x_0)}[-\log p(x_T)+\sum_{i=2}^T\log\frac{q(x_{t-1}|x_t,x_0)}{p_\theta(x_{t-1}|x_t)}+\log\frac{q(x_T|x_0)}{q(x_{1}|x_0)} + \log\frac{q(x_1|x_{0})}{p_\theta(x_{0}|x_1)}] \nonumber \\&=\mathbb E_{q(x_{0:T}|x_0)}[\sum_{i=2}^T\log\frac{q(x_{t-1}|x_t,x_0)}{p_\theta(x_{t-1}|x_t)}+\log\frac{q(x_T|x_0)}{p(x_T)} - \log p_\theta(x_{0}|x_1)] \nonumber \\&=\text{KL}(q(x_T|x_0)||p(x_T)) +\sum_{t=2}^T\text{KL}(q(x_{t-1}|x_t,x_0)||p_\theta(x_{t-1}|x_t))-\log p_\theta(x_0|x_1) \end{align} LVLB=Eq(x0:Tx0)[logpθ(x0:T)q(x1:Tx0)]=Eq(x0:Tx0)[logp(xT)+i=1Tlogpθ(xt1xt)q(xtxt1)]=(i)Eq(x0:Tx0)[logp(xT)+i=2Tlog(pθ(xt1xt)q(xt1xt,x0)q(xt1x0)q(xtx0))+logpθ(x0x1)q(x1x0)]=Eq(x0:Tx0)[logp(xT)+i=2Tlogpθ(xt1xt)q(xt1xt,x0)+logq(x1x0)q(xTx0)+logpθ(x0x1)q(x1x0)]=Eq(x0:Tx0)[i=2Tlogpθ(xt1xt)q(xt1xt,x0)+logp(xT)q(xTx0)logpθ(x0x1)]=KL(q(xTx0)∣∣p(xT))+t=2TKL(q(xt1xt,x0)∣∣pθ(xt1xt))logpθ(x0x1)
其中,等号 (i) 处推导如下所示
q ( x t ∣ x t − 1 ) = q ( x t ∣ x t − 1 , x 0 ) = q ( x t − 1 ∣ x t , x 0 ) q ( x t ∣ x 0 ) q ( x t − 1 ∣ x 0 ) q(x_t|x_{t-1}) = q(x_t|x_{t-1},x_0)=\frac{q(x_{t-1}|x_t,x_0)q(x_t|x_0)}{q(x_{t-1}|x_0)} q(xtxt1)=q(xtxt1,x0)=q(xt1x0)q(xt1xt,x0)q(xtx0)
我们固定方差 β t \beta_t βt为一超参数,因此对于公式(5)中的第一项是无参的常量,可以忽略;对于最后一项,作者提出简化掉来训练会更好。因为 p θ p_\theta pθ 是我们拟合分布使用的模型,所以我们可以假定 p θ ( x t − 1 ∣ x t ) = N ( μ θ ( x t , t ) , σ t 2 I ) ) p_\theta(x_{t-1}|x_t) = \mathcal N(\mu_\theta(x_t,t),\sigma_t^2 I)) pθ(xt1xt)=N(μθ(xt,t),σt2I)),因此该分布仅均值部分与输入有关,可以得到如下式子
KL ( q ( x t − 1 ∣ x t , x 0 ) ∣ ∣ p θ ( x t − 1 ∣ x t ) ) = E q [ 1 2 σ t 2 ∣ ∣ μ ~ t ( x t , x 0 ) − μ θ ( x t , t ) ∣ ∣ 2 2 ] + C \begin{align} \text{KL}(q(x_{t-1}|x_t,x_0)||p_\theta(x_{t-1}|x_t)) &=\mathbb E_q[\frac{1}{2\sigma_t^2}||\tilde\mu_t(x_t,x_0)-\mu_\theta(x_t,t)||_2^2] +C \nonumber \end{align} KL(q(xt1xt,x0)∣∣pθ(xt1xt))=Eq[2σt21∣∣μ~t(xt,x0)μθ(xt,t)22]+C
ϵ ∼ N ( 0 , I ) \epsilon \sim \mathcal N(0,I) ϵN(0,I),使用公式 (1) x 0 x_0 x0 ϵ \epsilon ϵ 替换掉里面的 x t x_t xt,同时使用 (4) 式替换掉其中的 μ ~ t \tilde\mu_t μ~t
KL ( q ( x t − 1 ∣ x t , x 0 ) ∣ ∣ p θ ( x t − 1 ∣ x t ) ) = E q [ 1 2 σ t 2 ∣ ∣ 1 α t ( x t ( x 0 , ϵ ) − β t 1 − α ‾ t ϵ ) − μ θ ( x t , t ) ∣ ∣ 2 2 ] + C \begin{align} \text{KL}(q(x_{t-1}|x_t,x_0)||p_\theta(x_{t-1}|x_t)) &=\mathbb E_q[\frac{1}{2\sigma_t^2}||\frac{1}{\sqrt{\alpha_t}}(x_t(x_0,\epsilon)-\frac{\beta_t}{\sqrt{1-\overline\alpha_t}}\epsilon)-\mu_\theta(x_t,t)||_2^2] +C \nonumber \end{align} KL(q(xt1xt,x0)∣∣pθ(xt1xt))=Eq[2σt21∣∣αt 1(xt(x0,ϵ)1αt βtϵ)μθ(xt,t)22]+C
其中公式(4)给出了 μ θ \mu_\theta μθ ϵ θ \epsilon_\theta ϵθ 满足如下关系
μ θ ( x t , t ) = 1 α t ( x t ( x 0 , ϵ ) − β t 1 − α ‾ t ϵ θ ( x t , t ) ) \mu_\theta(x_t,t) = \frac{1}{\sqrt{\alpha_t}}( x_t(x_0,\epsilon)-\frac{\beta_t}{\sqrt{1-\overline\alpha_t}}\epsilon_\theta(x_t,t)) μθ(xt,t)=αt 1(xt(x0,ϵ)1αt βtϵθ(xt,t))
因此,有
KL ( q ( x t − 1 ∣ x t , x 0 ) ∣ ∣ p θ ( x t − 1 ∣ x t ) ) = E x 0 , ϵ [ β t 2 2 σ t 2 α t ( 1 − α ‾ t ) ∣ ∣ ϵ − ϵ θ ( α ‾ t x 0 + 1 − α ‾ t ϵ , t ) ∣ ∣ 2 2 ] + C \begin{align} \text{KL}(q(x_{t-1}|x_t,x_0)||p_\theta(x_{t-1}|x_t)) &=\mathbb E_{x_0,\epsilon}[\frac{\beta_t^2}{2\sigma_t^2\alpha_t(1-\overline{\alpha}_t)}||\epsilon-\epsilon_\theta(\sqrt{\overline\alpha_t}x_0+\sqrt{1-\overline\alpha_t}\epsilon, t)||_2^2] +C \nonumber \end{align} KL(q(xt1xt,x0)∣∣pθ(xt1xt))=Ex0,ϵ[2σt2αt(1αt)βt2∣∣ϵϵθ(αt x0+1αt ϵ,t)22]+C
损失函数即为
L ( θ ) = E t , x 0 , ϵ [ ∣ ∣ ϵ − ϵ θ ( α ‾ t x 0 + 1 − α ‾ t ϵ , t ) ∣ ∣ 2 2 ] \begin{align} \mathcal L(\theta) &=\mathbb E_{t,x_0,\epsilon}[||\epsilon-\epsilon_\theta(\sqrt{\overline\alpha_t}x_0+\sqrt{1-\overline\alpha_t}\epsilon, t)||_2^2] \end{align} L(θ)=Et,x0,ϵ[∣∣ϵϵθ(αt x0+1αt ϵ,t)22]

算法流程

在这里插入图片描述

DDIM

DDPM的核心问题在于采样需要迭代足够多的次数,而且理论推导中的概率分布是 q ( x t − 1 ∣ x t ) q(x_{t-1}|x_t) q(xt1xt),因此每次迭代的下标变化为1,如果我们希望下标变化可以不局限为1,例如如果我们支持计算 q ( x s ∣ x t ) ( s < t ) q(x_s|x_t)(s<t) q(xsxt)(s<t) 那么我们可以任意设置从 t : T → 0 t: T\rightarrow 0 t:T0 迭代的次数,这需要我们在采样时突破下式中的 q ( x t ∣ x t − 1 ) q(x_t |x_{t-1}) q(xtxt1)
q ( x t − 1 ∣ x t ) ≈ q ( x t − 1 ∣ x t , x 0 ) = q ( x t ∣ x t − 1 , x 0 ) q ( x t − 1 ∣ x 0 ) q ( x t ∣ x 0 ) = q ( x t ∣ x t − 1 ) q ( x t − 1 ∣ x 0 ) q ( x t ∣ x 0 ) q(x_{t-1} |x_{t})\approx q(x_{t-1} |x_{t},x_0) = q(x_t |x_{t-1},x_0)\frac{q(x_{t-1}|x_0)}{q(x_t|x_0)} = q(x_t |x_{t-1})\frac{q(x_{t-1}|x_0)}{q(x_t|x_0)} q(xt1xt)q(xt1xt,x0)=q(xtxt1,x0)q(xtx0)q(xt1x0)=q(xtxt1)q(xtx0)q(xt1x0)
如果没有 q ( x t ∣ x t − 1 , x 0 ) q(x_t|x_{t-1},x_0) q(xtxt1,x0),但仍可以通过下式来求解
∫ p ( x t − 1 ∣ x t , x 0 ) p ( x t ∣ x 0 ) d x t = p ( x t − 1 ∣ x 0 ) \int p(x_{t−1}|x_t,x_0)p(x_t|x_0)dx_t=p(x_{t−1}|x_0) p(xt1xt,x0)p(xtx0)dxt=p(xt1x0)
其中 p ( x t ∣ x 0 ) , p ( x t − 1 ∣ x 0 ) p(x_t|x_0),p(x_{t-1}|x_0) p(xtx0),p(xt1x0) 都为高斯分布,可以假定 p ( x t − 1 ∣ x t , x 0 ) p(x_{t−1}|x_t,x_0) p(xt1xt,x0) 也为高斯分布,其均值为 x t x_t xt x 0 x_0 x0 的线性组合

更一般地,我们考虑任意两个下标 x s , x t ( s < t ) x_s,x_t(s<t) xs,xt(s<t),假定 x s = m s ∣ t x t + n s ∣ t x 0 + σ s ∣ t ε 1 x_s = m_{s|t} x_t+n_{s|t}x_0+\sigma_{s|t}\varepsilon_1 xs=mstxt+nstx0+σstε1,我们已知 q ( x s ∣ x 0 ) q(x_s|x_0) q(xsx0) q ( x t ∣ x 0 ) q(x_t|x_0) q(xtx0),联立可得
{ x s = m s ∣ t x t + n s ∣ t x 0 + σ s ∣ t ε 1 x t = α ‾ t x 0 + 1 − α ‾ t ε 2 x s = α ‾ s x 0 + 1 − α ‾ s ε 3 \left\{\begin{matrix} x_s = m_{s|t} x_t+n_{s|t}x_0+\sigma_{s|t}\varepsilon_1 \\ x_t = \sqrt{\overline\alpha_t} x_0+\sqrt{1-\overline\alpha_t}\varepsilon_2 \\ x_s = \sqrt{\overline\alpha_s} x_0+\sqrt{1-\overline\alpha_s}\varepsilon_3 \end{matrix}\right. xs=mstxt+nstx0+σstε1xt=αt x0+1αt ε2xs=αs x0+1αs ε3
可得关于 m s ∣ t m_{s|t} mst n s ∣ t n_{s|t} nst 的联立表达式
{ m s ∣ t α ‾ t + n s ∣ t = α ‾ s m s ∣ t 2 ( 1 − α ‾ t ) + σ s ∣ t 2 = 1 − α ‾ s \left\{\begin{matrix} m_{s|t}\sqrt{\overline\alpha_t} + n_{s|t} = \sqrt{\overline\alpha_s} \\ m_{s|t}^2(1-\overline\alpha_t) + \sigma^2_{s|t} = 1-\overline\alpha_s \end{matrix}\right. {mstαt +nst=αs mst2(1αt)+σst2=1αs
解得
{ m s ∣ t = 1 − α ‾ s − σ s ∣ t 2 1 − α ‾ t n s ∣ t = α ‾ s − α ‾ t 1 − α ‾ t ( 1 − α ‾ s − σ s ∣ t 2 ) \left\{\begin{matrix} m_{s|t} = \sqrt{\frac{1-\overline\alpha_s-\sigma_{s|t}^2}{1-\overline\alpha_t}} \\ n_{s|t} = \sqrt{\overline\alpha_{s}} - \sqrt{\frac{\overline\alpha_t}{1-\overline\alpha_t}(1-\overline\alpha_s-\sigma_{s|t}^2)} \end{matrix}\right. mst=1αt1αsσst2 nst=αs 1αtαt(1αsσst2)
带回到原式,可得任意两下标 s < t s<t s<t 的采样公式
x s = α ‾ s x 0 + 1 − α ‾ s − σ s ∣ t 2 x t − α ‾ t x 0 1 − α ‾ t + σ s ∣ t ε x_s = \sqrt{\overline\alpha_s}x_0+\sqrt{1-\overline\alpha_{s}-\sigma_{s|t}^2}\frac{x_t-\sqrt{\overline\alpha_t}x_0}{\sqrt{1-\overline\alpha_t}}+\sigma_{s|t}\varepsilon xs=αs x0+1αsσst2 1αt xtαt x0+σstε
注意到,DDIM并没有使用 q ( x t ∣ x s ) q(x_t|x_s) q(xtxs),因此 DDIM 相比于 DDPM 具有更加泛化的形式,这里的 x 0 x_0 x0 是使用 x t x_t xt ϵ θ ( x t , t ) \epsilon_\theta(x_t,t) ϵθ(xt,t) 和公式 (1) 给出的估计值,而 x t − α ‾ t x 0 1 − α ‾ t \frac{x_t-\sqrt{\overline\alpha_t}x_0}{\sqrt{1-\overline\alpha_t}} 1αt xtαt x0 对应 ϵ θ ( x t , t ) \epsilon_\theta(x_t,t) ϵθ(xt,t),即
x s = α ‾ s α ‾ t ( x t − 1 − α ‾ t ϵ θ ( x t , t ) ) + 1 − α ‾ s − σ s ∣ t 2 ϵ θ ( x t , t ) + σ s ∣ t ε x_s = \sqrt{\frac{\overline\alpha_s}{\overline\alpha_t}}(x_t-\sqrt{1-\overline\alpha_t}\epsilon_\theta(x_t,t))+\sqrt{1-\overline\alpha_{s}-\sigma_{s|t}^2}\epsilon_\theta(x_t,t)+\sigma_{s|t}\varepsilon xs=αtαs (xt1αt ϵθ(xt,t))+1αsσst2 ϵθ(xt,t)+σstε

参考资料

Denoising Diffusion Probabilistic Models
Denoising Diffusion Implicit Models
Probabilistic Diffusion Model概率扩散模型理论
扩散模型 Diffusion Model

  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值