Denoising Diffusion Probabilistic Models (DDPM)

Ho J., Jain A. and Abbeel P. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems (NIPS), 2020.

[Page E. Approximating to the cumulative normal function and its inverse for use on a pocket calculator. Applied Statistics, vol. 26, pp. 75-76, 1977.]

Yerukala R., Boiroju N. K. Approximations to standard normal distribution function. Journal of Scientific and Engineering Research, vol. 6, pp. 515-518, 2015.

diffusion model和变分界的结合.
对抗鲁棒性上已经有多篇论文用DDPM生成的数据用于训练了, 可见其强大.

主要内容

Diffusion models

reverse process

p ( x T ) = N ( x T ; 0 , I ) p(x_T) = \mathcal{N}(x_T; 0, I) p(xT)=N(xT;0,I)出发:
p θ ( x 0 : T ) : = p ( X T ) ∏ t = 1 T p θ ( x t − 1 ∣ x t ) , p θ ( x t − 1 ∣ x t ) : = N ( x t − 1 ; μ θ ( x t , t ) , Σ θ ( x t , t ) ) , p_{\theta}(x_{0:T}) := p(X_T) \prod_{t=1}^T p_{\theta}(x_{t-1}|x_t), \quad p_{\theta}(x_{t-1}|x_t) := \mathcal{N}(x_{t-1}; \mu_{\theta}(x_{t}, t), \Sigma_{\theta}(x_t, t)), pθ(x0:T):=p(XT)t=1Tpθ(xt1xt),pθ(xt1xt):=N(xt1;μθ(xt,t),Σθ(xt,t)),
注意这个过程我们拟合均值 μ θ \mu_{\theta} μθ和协方差矩阵 Σ θ \Sigma_{\theta} Σθ.

这部分的过程逐步将噪声’恢复’为图片(信号) x 0 x_0 x0.

forward process

q ( x 1 : T ∣ x 0 ) : = ∏ t = 1 T q ( x t ∣ x t − 1 ) , q ( x t ∣ x t − 1 ) : = N ( x t ; 1 − β t x t − 1 , β t I ) . q(x_{1:T}|x_0) := \prod_{t=1}^{T}q(x_t|x_{t-1}), \quad q(x_t|x_{t-1}):= \mathcal{N}(x_t; \sqrt{1 - \beta_t} x_{t-1}, \beta_t I). q(x1:Tx0):=t=1Tq(xtxt1),q(xtxt1):=N(xt;1βt xt1,βtI).
其中 β t \beta_t βt是可训练的参数或者人为给定的超参数.

这部分为将图片(信号)逐步添加噪声的过程.

变分界

对于参数 θ \theta θ, 很自然地我们希望通过最小化其负对数似然来优化:
E p d a t a ( x 0 ) [ − log ⁡ p θ ( x 0 ) ] = E p d a t a ( x 0 ) [ − log ⁡ ∫ p θ ( x 0 : T ) d x 0 : T ] = E p d a t a ( x 0 ) [ − log ⁡ ∫ q ( x 1 : T ∣ x 0 ) p θ ( x 0 : T ) q ( x 1 : T ∣ x 0 ) d x 0 : T ] = E p d a t a ( x 0 ) [ − log ⁡ E q ( x 1 : T ∣ x 0 ) p θ ( x 0 : T ) q ( x 1 : T ∣ x 0 ) ] ≤ − E p d a t a ( x 0 ) E q ( x 1 : T ∣ x 0 ) [ log ⁡ p θ ( x 0 : T ) q ( x 1 : T ∣ x 0 ) ] = − E q [ log ⁡ p θ ( x 0 : T ) q ( x 1 : T ∣ x 0 ) ] = − E q [ log ⁡ p ( x T ) + ∑ t = 1 T log ⁡ p θ ( x t − 1 ∣ x t ) q ( x t ∣ x t − 1 ) ] = − E q [ log ⁡ p ( x T ) + ∑ t = 2 T log ⁡ p θ ( x t − 1 ∣ x t ) q ( x t ∣ x t − 1 ) + log ⁡ p θ ( x 0 ∣ x 1 ) q ( x 1 ∣ x 0 ) ] = − E q [ log ⁡ p ( x T ) + ∑ t = 2 T log ⁡ p θ ( x t − 1 ∣ x t ) q ( x t − 1 ∣ x t , x 0 ) ⋅ q ( x t − 1 ∣ x 0 ) q ( x t ∣ x 0 ) + log ⁡ p θ ( x 0 ∣ x 1 ) q ( x 1 ∣ x 0 ) ] = − E q [ log ⁡ p ( x T ) q ( x T ∣ x 0 ) + ∑ t = 2 T log ⁡ p θ ( x t − 1 ∣ x t ) q ( x t − 1 ∣ x t , x 0 ) + log ⁡ p θ ( x 0 ∣ x 1 ) ] \begin{array}{ll} \mathbb{E}_{p_{data}(x_0)} \bigg[-\log p_{\theta}(x_0) \bigg] &=\mathbb{E}_{p_{data}(x_0)} \bigg[-\log \int p_{\theta}(x_{0:T}) \mathrm{d}x_{0:T} \bigg] \\ &=\mathbb{E}_{p_{data}(x_0)} \bigg[-\log \int q(x_{1:T}|x_0)\frac{p_{\theta}(x_{0:T})}{q(x_{1:T}|x_0)} \mathrm{d}x_{0:T} \bigg] \\ &=\mathbb{E}_{p_{data}(x_0)} \bigg[-\log \mathbb{E}_{q(x_{1:T}|x_0)} \frac{p_{\theta}(x_{0:T})}{q(x_{1:T}|x_0)} \bigg] \\ &\le -\mathbb{E}_{p_{data}(x_0)}\mathbb{E}_{q(x_{1:T}|x_0)} \bigg[\log \frac{p_{\theta}(x_{0:T})}{q(x_{1:T}|x_0)} \bigg] \\ &= -\mathbb{E}_q \bigg[\log \frac{p_{\theta}(x_{0:T})}{q(x_{1:T}|x_0)} \bigg] \\ &= -\mathbb{E}_q \bigg[\log p(x_T) + \sum_{t=1}^T \log \frac{p_{\theta}(x_{t-1}|x_t)}{q(x_t|x_{t-1})} \bigg] \\ &= -\mathbb{E}_q \bigg[\log p(x_T) + \sum_{t=2}^T \log \frac{p_{\theta}(x_{t-1}|x_t)}{q(x_t|x_{t-1})} + \log \frac{p_{\theta}(x_0|x_1)}{q(x_1|x_0)} \bigg] \\ &= -\mathbb{E}_q \bigg[\log p(x_T) + \sum_{t=2}^T \log \frac{p_{\theta}(x_{t-1}|x_t)}{q(x_{t-1}|x_t, x_0)} \cdot \frac{q(x_{t-1}|x_0)}{q(x_t|x_0)} + \log \frac{p_{\theta}(x_0|x_1)}{q(x_1|x_0)} \bigg] \\ &= -\mathbb{E}_q \bigg[\log \frac{p(x_T)}{q(x_T|x_0)} + \sum_{t=2}^T \log \frac{p_{\theta}(x_{t-1}|x_t)}{q(x_{t-1}|x_t, x_0)} + \log p_{\theta}(x_0|x_1) \bigg] \\ \end{array} Epdata(x0)[logpθ(x0)]=Epdata(x0)[logpθ(x0:T)dx0:T]=Epdata(x0)[logq(x1:Tx0)q(x1:Tx0)pθ(x0:T)dx0:T]=Epdata(x0)[logEq(x1:Tx0)q(x1:Tx0)pθ(x0:T)]Epdata(x0)Eq(x1:Tx0)[logq(x1:Tx0)pθ(x0:T)]=Eq[logq(x1:Tx0)pθ(x0:T)]=Eq[logp(xT)+t=1Tlogq(xtxt1)pθ(xt1xt)]=Eq[logp(xT)+t=2Tlogq(xtxt1)pθ(xt1xt)+logq(x1x0)pθ(x0x1)]=Eq[logp(xT)+t=2Tlogq(xt1xt,x0)pθ(xt1xt)q(xtx0)q(xt1x0)+logq(x1x0)pθ(x0x1)]=Eq[logq(xTx0)p(xT)+t=2Tlogq(xt1xt,x0)pθ(xt1xt)+logpθ(x0x1)]

注: q = q ( x 1 : T ∣ x 0 ) p d a t a ( x 0 ) q=q(x_{1:T}|x_0)p_{data}(x_0) q=q(x1:Tx0)pdata(x0), 下面另 q ( x 0 ) : = p d a t a ( x 0 ) q(x_0) := p_{data}(x_0) q(x0):=pdata(x0).


E q [ log ⁡ q ( x T ∣ x 0 ) p ( x T ) ] = ∫ q ( x 0 , x T ) log ⁡ q ( x T ∣ x 0 ) p ( x T ) d x 0 d x T = ∫ q ( x 0 ) q ( x T ∣ x 0 ) log ⁡ q ( x T ∣ x 0 ) p ( x T ) d x 0 d x T = ∫ q ( x 0 ) D K L ( q ( x T ∣ x 0 ) ∥ p ( x T ) ) d x 0 = ∫ q ( x 0 : T ) D K L ( q ( x T ′ ∣ x 0 ) ∥ p ( x T ′ ) ) d x 0 : T = E q [ D K L ( q ( x T ′ ∣ x 0 ) ∥ p ( x T ′ ) ) ] . \begin{array}{ll} \mathbb{E}_q [\log \frac{q(x_T|x_0)}{p(x_T)}] &= \int q(x_0, x_T) \log \frac{q(x_T|x_0)}{p(x_T)} \mathrm{d}x_0 \mathrm{d}x_T \\ &= \int q(x_0) q(x_T|x_0) \log \frac{q(x_T|x_0)}{p(x_T)} \mathrm{d}x_0 \mathrm{d}x_T \\ &= \int q(x_0) \mathrm{D_{KL}}(q(x_T|x_0) \| p(x_T)) \mathrm{d}x_0 \\ &= \int q(x_{0:T}) \mathrm{D_{KL}}(q(x'_T|x_0) \| p(x'_T)) \mathrm{d}x_{0:T} \\ &= \mathbb{E}_q \bigg[\mathrm{D_{KL}}(q(x'_T|x_0) \| p(x'_T)) \bigg]. \end{array} Eq[logp(xT)q(xTx0)]=q(x0,xT)logp(xT)q(xTx0)dx0dxT=q(x0)q(xTx0)logp(xT)q(xTx0)dx0dxT=q(x0)DKL(q(xTx0)p(xT))dx0=q(x0:T)DKL(q(xTx0)p(xT))dx0:T=Eq[DKL(q(xTx0)p(xT))].


E q [ log ⁡ q ( x t − 1 ∣ x t , x 0 ) p θ ( x t − 1 ∣ x t ) ] = ∫ q ( x 0 , x t − 1 , x t ) log ⁡ q ( x t − 1 ∣ x t , x 0 ) p θ ( x t − 1 ∣ x t ) d x 0 d x t − 1 d x t = ∫ q ( x 0 , x t ) D K L ( q ( x t − 1 ∣ x t , x 0 ) ∥ p θ ( x t − 1 ∣ x t ) ) d x 0 d x t = E q [ D K L ( q ( x t − 1 ′ ∣ x t , x 0 ) ∥ p θ ( x t − 1 ′ ∣ x t ) ) ] . \begin{array}{ll} \mathbb{E}_q [\log \frac{q(x_{t-1}|x_t, x_0)}{p_{\theta}(x_{t-1}|x_t)}] &=\int q(x_0, x_{t-1}, x_t) \log \frac{q(x_{t-1}|x_t, x_0)}{p_{\theta}(x_{t-1}|x_t)} \mathrm{d}x_0 \mathrm{d}x_{t-1}\mathrm{d}x_t\\ &=\int q(x_0, x_t) \mathrm{D_{KL}}(q(x_{t-1}|x_t, x_0)\| p_{\theta}(x_{t-1}|x_t)) \mathrm{d}x_0 \mathrm{d}x_t\\ &=\mathbb{E}_q\bigg[\mathrm{D_{KL}}(q(x'_{t-1}|x_t, x_0)\| p_{\theta}(x'_{t-1}|x_t)) \bigg]. \end{array} Eq[logpθ(xt1xt)q(xt1xt,x0)]=q(x0,xt1,xt)logpθ(xt1xt)q(xt1xt,x0)dx0dxt1dxt=q(x0,xt)DKL(q(xt1xt,x0)pθ(xt1xt))dx0dxt=Eq[DKL(q(xt1xt,x0)pθ(xt1xt))].

故最后:
L : = E q [ D K L ( q ( x T ′ ∣ x 0 ) ∥ p ( x T ′ ) ) ⏟ L T + ∑ t = 2 T D K L ( q ( x t − 1 ′ ∣ x t , x 0 ) ∥ p θ ( x t − 1 ′ ∣ x t ) ) ⏟ L t − 1 − log ⁡ p θ ( x 0 ∣ x 1 ) ⏟ L 0 . ] \mathcal{L} := \mathbb{E}_q \bigg[ \underbrace{\mathrm{D_{KL}}(q(x'_T|x_0) \| p(x'_T))}_{L_T} + \sum_{t=2}^T \underbrace{\mathrm{D_{KL}}(q(x'_{t-1}|x_t, x_0)\| p_{\theta}(x'_{t-1}|x_t))}_{L_{t-1}} \underbrace{-\log p_{\theta}(x_0|x_1)}_{L_0}. \bigg] L:=Eq[LT DKL(q(xTx0)p(xT))+t=2TLt1 DKL(q(xt1xt,x0)pθ(xt1xt))L0 logpθ(x0x1).]

损失求解

因为无论forward, 还是 reverse process都是基于高斯分布的, 我们可以显示求解上面的各项损失:

首先, 对于forward process中的 x t x_t xt:
x t = 1 − β t x t − 1 + β t ϵ ,   ϵ ∼ N ( 0 , I ) = 1 − β t ( 1 − β t − 1 x t − 2 + β t − 1 ϵ ′ ) + β ϵ = 1 − β t 1 − β t − 1 x t − 2 + 1 − β t β t − 1 ϵ ′ + β ϵ = 1 − β t 1 − β t − 1 x t − 2 + 1 − ( 1 − β t ) ( 1 − β t − 1 ) ϵ = ⋯ = ( ∏ s = 1 t 1 − β s ) x 0 + 1 − ∏ s = 1 t ( 1 − β s ) ϵ , \begin{array}{ll} x_t &= \sqrt{1 - \beta_t} x_{t-1} + \sqrt{\beta_t} \epsilon, \: \epsilon \sim \mathcal{N}(0, I) \\ &= \sqrt{1 - \beta_t} (\sqrt{1 - \beta_{t-1}} x_{t-2} + \sqrt{\beta_{t-1}} \epsilon') + \sqrt{\beta} \epsilon \\ &= \sqrt{1 - \beta_t}\sqrt{1 - \beta_{t-1}} x_{t-2} + \sqrt{1 - \beta_t}\sqrt{\beta_{t-1}} \epsilon' + \sqrt{\beta} \epsilon \\ &= \sqrt{1 - \beta_t}\sqrt{1 - \beta_{t-1}} x_{t-2} + \sqrt{1 - (1 - \beta_t)(1 - \beta_{t-1})} \epsilon \\ &= \cdots \\ &= (\prod_{s=1}^t \sqrt{1 - \beta_s}) x_0 + \sqrt{1 - \prod_{s=1}^t (1 - \beta_s)} \epsilon, \end{array} xt=1βt xt1+βt ϵ,ϵN(0,I)=1βt (1βt1 xt2+βt1 ϵ)+β ϵ=1βt 1βt1 xt2+1βt βt1 ϵ+β ϵ=1βt 1βt1 xt2+1(1βt)(1βt1) ϵ==(s=1t1βs )x0+1s=1t(1βs) ϵ,

q ( x t ∣ x 0 ) = N ( x t ∣ α ˉ t x 0 , ( 1 − α ˉ t ) I ) ,   α ˉ t : = ∏ s = 1 t α s , α s : = 1 − β s . q(x_t|x_0) = \mathcal{N}(x_t|\sqrt{\bar{\alpha}_t}x_0, (1 - \bar{\alpha}_t)I), \: \bar{\alpha}_t := \prod_{s=1}^t \alpha_s, \alpha_s := 1 - \beta_s. q(xtx0)=N(xtαˉt x0,(1αˉt)I),αˉt:=s=1tαs,αs:=1βs.

对于后验分布 q ( x t − 1 ∣ x t , x 0 ) q(x_{t-1}|x_t, x_0) q(xt1xt,x0), 我们有
q ( x t − 1 ∣ x t , x 0 ) = q ( x t ∣ x t − 1 ) q ( x t − 1 ∣ x 0 ) q ( x t ∣ x 0 ) ∝ q ( x t ∣ x t − 1 ) q ( x t − 1 ∣ x 0 ) ∝ exp ⁡ { − 1 2 ( 1 − α ˉ t − 1 ) β t [ ( 1 − α ˉ t − 1 ) ∥ x t − 1 − β t x t − 1 ∥ 2 + β t ∥ x t − 1 − α ˉ t − 1 x 0 ∥ 2 ] } ∝ exp ⁡ { − 1 2 ( 1 − α ˉ t − 1 ) β t [ ( 1 − α ˉ t ) ∥ x t − 1 ∥ 2 − 2 ( 1 − α ˉ t − 1 ) α t x t T x t − 1 − 2 α ˉ t − 1 β t ] } \begin{array}{ll} q(x_{t-1}|x_t, x_0) &= \frac{q(x_t|x_{t-1})q(x_{t-1}|x_0)}{q(x_t|x_0)} \\ &\propto q(x_t|x_{t-1})q(x_{t-1}|x_0) \\ &\propto \exp\Bigg\{-\frac{1}{2 (1 - \bar{\alpha}_{t-1})\beta_t} \bigg[(1 - \bar{\alpha}_{t-1}) \|x_t - \sqrt{1 - \beta_t} x_{t-1}\|^2 + \beta_t \|x_{t-1} - \sqrt{\bar{\alpha}_{t-1}}x_0\|^2 \bigg]\Bigg\} \\ &\propto \exp\Bigg\{-\frac{1}{2 (1 - \bar{\alpha}_{t-1})\beta_t} \bigg[(1 - \bar{\alpha}_t)\|x_{t-1}\|^2 - 2(1 - \bar{\alpha}_{t-1}) \sqrt{\alpha_t} x_t^Tx_{t-1} - 2 \sqrt{\bar{\alpha}_{t-1}} \beta_t \bigg]\Bigg\} \\ \end{array} q(xt1xt,x0)=q(xtx0)q(xtxt1)q(xt1x0)q(xtxt1)q(xt1x0)exp{2(1αˉt1)βt1[(1αˉt1)xt1βt xt12+βtxt1αˉt1 x02]}exp{2(1αˉt1)βt1[(1αˉt)xt122(1αˉt1)αt xtTxt12αˉt1 βt]}

所以
q ( x t − 1 ∣ x t , x 0 ) ∼ N ( x t − 1 ∣ u ~ t ( x t , x 0 ) , β ~ t I ) , q(x_{t-1}|x_t, x_0) \sim \mathcal{N}(x_{t-1}|\tilde{u}_t(x_t, x_0), \tilde{\beta}_t I), q(xt1xt,x0)N(xt1u~t(xt,x0),β~tI),
其中
u ~ t ( x t , x 0 ) : = α ˉ t − 1 β t 1 − α ˉ t x 0 + α t ( 1 − α ˉ t − 1 ) 1 − α ˉ t x t , \tilde{u}_t(x_t,x_0) := \frac{\sqrt{\bar{\alpha}_{t-1}}\beta_t}{1 - \bar{\alpha}_t} x_0 + \frac{\sqrt{\alpha_t}(1 - \bar{\alpha}_{t-1})}{1 - \bar{\alpha}_t} x_t, u~t(xt,x0):=1αˉtαˉt1 βtx0+1αˉtαt (1αˉt1)xt,
β ~ t = 1 − α ˉ t − 1 1 − α ˉ t β t . \tilde{\beta}_t = \frac{1 - \bar{\alpha}_{t-1}}{1 - \bar{\alpha}_t} \beta_t. β~t=1αˉt1αˉt1βt.

L t L_{t} Lt

L T L_T LT θ \theta θ无关, 舍去.

作者假设 Σ θ ( x t , t ) = σ t 2 I \Sigma_{\theta}(x_t, t) = \sigma_t^2 I Σθ(xt,t)=σt2I非训练的参数, 其中
σ t 2 = β t ∣ β ~ t , \sigma_t^2 = \beta_t | \tilde{\beta}_t, σt2=βtβ~t,
分别为 x 0 ∼ N ( 0 , I ) x_0 \sim \mathcal{N}(0, I) x0N(0,I) x 0 x_0 x0为固定值时, 期望下KL散度的最优参数(作者说在实验中二者差不多).


L t = 1 2 σ t 2 ∥ μ θ ( x t , t ) − u ~ t ( x t , x 0 ) ∥ 2 + C , t = 1 , 2 , ⋯   , T − 1. L_{t} = \frac{1}{2 \sigma^2_t} \| \mu_{\theta}(x_t, t) - \tilde{u}_t(x_t, x_0)\|^2 +C, \quad t = 1,2,\cdots, T-1. Lt=2σt21μθ(xt,t)u~t(xt,x0)2+C,t=1,2,,T1.

x t = α ˉ t x 0 + 1 − α ˉ t ϵ ⇒ x 0 = 1 α ˉ t x t − 1 − α ˉ t α ˉ t ϵ . x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon \Rightarrow x_0 = \frac{1}{\sqrt{\bar{\alpha}_t}}x_t - \frac{\sqrt{1 - \bar{\alpha}_t} }{\sqrt{\bar{\alpha}_t}} \epsilon. xt=αˉt x0+1αˉt ϵx0=αˉt 1xtαˉt 1αˉt ϵ.

所以
E q [ L t − 1 − C ] = E x 0 , ϵ { 1 2 σ t 2 ∥ μ θ ( x t , t ) − u ~ t ( x t , ( 1 α ˉ t x t − 1 − α ˉ t α ˉ t ϵ ) ) ∥ 2 } = E x 0 , ϵ { 1 2 σ t 2 ∥ μ θ ( x t , t ) − 1 α t ( x t − β t 1 − α ˉ t ϵ ) } \begin{array}{ll} \mathbb{E}_q [L_{t-1} - C] &= \mathbb{E}_{x_0, \epsilon} \bigg\{ \frac{1}{2 \sigma_t^2} \| \mu_{\theta}(x_t, t) - \tilde{u}_t\big( x_t, (\frac{1}{\sqrt{\bar{\alpha}_t}}x_t - \frac{\sqrt{1 - \bar{\alpha}_t} }{\sqrt{\bar{\alpha}_t}} \epsilon) \big)\|^2 \bigg\} \\ &= \mathbb{E}_{x_0, \epsilon} \bigg\{ \frac{1}{2 \sigma^2_t} \| \mu_{\theta}(x_t, t) - \frac{1}{\sqrt{\alpha_t}} \big( x_t - \frac{\beta_t}{\sqrt{1 - \bar{\alpha}_t}} \epsilon \big) \bigg\} \\ \end{array} Eq[Lt1C]=Ex0,ϵ{2σt21μθ(xt,t)u~t(xt,(αˉt 1xtαˉt 1αˉt ϵ))2}=Ex0,ϵ{2σt21μθ(xt,t)αt 1(xt1αˉt βtϵ)}

注: 上式子中 x t x_t xt x 0 , ϵ x_0, \epsilon x0,ϵ决定, 实际上 x t = x t ( x 0 , ϵ ) x_t = x_t(x_0, \epsilon) xt=xt(x0,ϵ), 故期望实际上是对 x t x_t xt求期望.

既然如此, 我们不妨直接参数化 μ θ \mu_{\theta} μθ
μ θ ( x t , t ) : = 1 α t ( x t − β t 1 − α ˉ t ϵ θ ( x t , t ) ) , \mu_{\theta}(x_t, t):= \frac{1}{\sqrt{\alpha_t}} \big( x_t - \frac{\beta_t}{\sqrt{1 - \bar{\alpha}_t}} \epsilon_{\theta}(x_t, t) \big), μθ(xt,t):=αt 1(xt1αˉt βtϵθ(xt,t)),
即直接建模残差 ϵ \epsilon ϵ.

此时损失可简化为:
E x 0 , ϵ { β t 2 2 σ t 2 α t ( 1 − α ˉ t ) ∥ ϵ θ ( α ˉ t x 0 + 1 − α ˉ t ϵ , t ) − ϵ ∥ 2 } \mathbb{E}_{x_0, \epsilon} \bigg\{ \frac{\beta_t^2}{2\sigma_t^2 \alpha_t (1 - \bar{\alpha}_t)} \|\epsilon_{\theta}(\sqrt{\bar{\alpha}_t}x_0 + \sqrt{1 - \bar{\alpha}_t}\epsilon, t) - \epsilon\|^2 \bigg\} Ex0,ϵ{2σt2αt(1αˉt)βt2ϵθ(αˉt x0+1αˉt ϵ,t)ϵ2}

这个实际上时denoising score matching.

类似地, 从 p θ ( x t − 1 ∣ x t ) p_{\theta}(x_{t-1}|x_t) pθ(xt1xt)采样则为:
x t − 1 = 1 α t ( x t − β t 1 − α ˉ t ϵ θ ( x t , t ) ) + σ t z ,   z ∼ N ( 0 , I ) , x_{t-1} = \frac{1}{\sqrt{\alpha_t}} \big( x_t - \frac{\beta_t}{\sqrt{1 - \bar{\alpha}_t}} \epsilon_{\theta}(x_t, t) \big) + \sigma_t z, \: z \sim \mathcal{N}(0, I), xt1=αt 1(xt1αˉt βtϵθ(xt,t))+σtz,zN(0,I),
这是Langevin dynamic的形式(步长和权重有点变化)

注: 这部分见here.

L 0 L_0 L0

最后我们要处理 L 0 L_0 L0, 这里作者假设 x 0 ∣ x 1 x_0|x_1 x0x1满足一个离散分布, 首先图片取值于 { 0 , 1 , 2 , ⋯   , 255 } \{0, 1, 2, \cdots, 255\} {0,1,2,,255}, 并标准化至 [ − 1 , 1 ] [-1, 1] [1,1]. 假设
p θ ( x 0 ∣ x 1 ) = ∏ i = 1 D ∫ δ − ( x 0 i ) δ + ( x 0 i ) N ( x ; μ θ i ( x 1 , 1 ) , σ 1 2 ) d x , δ + ( x ) = { + ∞ if  x = 1 , x + 1 255 if  x < 1. δ − ( x ) { − ∞ if  x = − 1 , x − 1 255 if  x > − 1. p_{\theta}(x_0|x_1) = \prod_{i=1}^D \int_{\delta_{-}(x_0^i)}^{\delta_+(x_0^i) } \mathcal{N}(x; \mu_{\theta}^i(x_1, 1), \sigma_1^2) \mathrm{d}x, \\ \delta_+(x) = \left \{ \begin{array}{ll} +\infty & \text{if } x = 1, \\ x + \frac{1}{255} & \text{if } x < 1. \end{array} \right . \delta_- (x) \left \{ \begin{array}{ll} -\infty & \text{if } x = -1, \\ x - \frac{1}{255} & \text{if } x > -1. \end{array} \right . pθ(x0x1)=i=1Dδ(x0i)δ+(x0i)N(x;μθi(x1,1),σ12)dx,δ+(x)={+x+2551if x=1,if x<1.δ(x){x2551if x=1,if x>1.
实际上就是将普通的正态分布划分为:
( − ∞ , − 1 + 1 / 255 ] , ( − 1 + 1 / 255 , − 1 + 3 / 255 ] , ⋯   , ( 1 − 3 / 255 , 1 − 1 / 255 ] , ( 1 − 1 / 255 , + ∞ ) (-\infty, -1 + 1/255], (-1 + 1 / 255, -1 + 3/255], \cdots, (1 - 3/255, 1 - 1/255], (1 - 1 / 255, +\infty) (,1+1/255],(1+1/255,1+3/255],,(13/255,11/255],(11/255,+)

各取值落在其中之一.
在实际代码编写中, 会遇到高斯函数密度函数估计的问题(直接求是做不到的), 作者选择用下列的估计方式:
Φ ( x ) ≈ 1 2 { 1 + tanh ⁡ ( 2 / π ( 1 + 0.044715 x 2 ) ) } . \Phi(x) \approx \frac{1}{2} \Bigg\{1 + \tanh \bigg(\sqrt{2/\pi} (1 + 0.044715 x^2) \bigg) \Bigg\}. Φ(x)21{1+tanh(2/π (1+0.044715x2))}.
这样梯度也就能够回传了.

注: 该估计属于Page.

最后的算法

注: t = 1 t=1 t=1对应 L 0 L_0 L0, t = 2 , ⋯   , T t=2,\cdots, T t=2,,T对应 L 1 , ⋯   , L T − 1 L_{1}, \cdots, L_{T-1} L1,,LT1.
注: 对于 L t L_t Lt作者省略了开始的系数, 这反而是一种加权.
作者在实际中是采样损失用以训练的.

细节

注意到, 作者的 ϵ θ ( ⋅ , t ) \epsilon_{\theta}(\cdot, t) ϵθ(,t)是有显示强调 t t t, 作者在实验中是通过attention中的位置编码实现的, 假设位置编码为 P P P:

  1. $ t = \text{Linear}(\text{ACT}(\text{Linear}(t * P)))$, 即通过两层的MLP来转换得到time_steps;
  2. 作者用的是U-Net结构, 在每个residual 模块中:
    x + = Linear ( ACT ( t ) ) . x += \text{Linear}(\text{ACT}(t)). x+=Linear(ACT(t)).
参数
T T T1000
β t \beta_t βt [ 0.0001 , 0.02 ] [0.0001, 0.02] [0.0001,0.02], 线性增长 1 , 2 , ⋯   , T 1,2,\cdots, T 1,2,,T.
backboneU-Net

注: 作者在实现中还用到了EMA等技巧.

代码

原文代码

lucidrains-denoising-diffusion-pytorch

  • 0
    点赞
  • 10
    收藏
    觉得还不错? 一键收藏
  • 2
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值