Denoising Diffusion Probabilistic Models
Jonathan Ho, Ajay Jain, Pieter Abbeel
NeurIPS 2020
1 Background
Diffusion 模型为隐变量模型,
x
1
,
.
.
.
,
x
T
\bm{x}_1, ..., \bm{x}_T
x1,...,xT 为与原始数据
x
0
∼
q
(
x
0
)
\bm{x}_0 \sim q(\bm{x}_0)
x0∼q(x0) 维度一致的隐变量,所有隐变量之间满足马尔科夫性。已知
p
(
x
T
)
∼
N
(
x
T
;
0
,
I
)
p(\bm{x}_T) \sim \mathcal{N}(\bm{x}_T; \bm{0}, \bm{I})
p(xT)∼N(xT;0,I),计算联合概率
p
θ
(
x
0
:
T
)
p_{\theta}(\bm{x}_{0:T})
pθ(x0:T) 称为 逆过程。
p
θ
(
x
0
:
T
)
=
p
(
x
T
)
∏
t
=
1
T
p
θ
(
x
t
−
1
∣
x
t
)
,
p
θ
(
x
t
−
1
∣
x
t
)
=
N
(
x
t
−
1
;
μ
θ
(
x
t
,
t
)
,
Σ
θ
(
x
t
,
t
)
)
(1)
p_{\theta}(\bm{x}_{0:T}) = p(\bm{x}_T) \prod_{t=1}^T p_{\theta}(\bm{x}_{t-1} | \bm{x}_t), \quad p_{\theta}(\bm{x}_{t-1} | \bm{x}_t) = \mathcal{N}(\bm{x}_{t-1}; \bm{\mu}_{\theta}(\bm{x}_t, t), \bm{\Sigma}_{\theta}(\bm{x}_t, t)) \tag{1}
pθ(x0:T)=p(xT)t=1∏Tpθ(xt−1∣xt),pθ(xt−1∣xt)=N(xt−1;μθ(xt,t),Σθ(xt,t))(1)
其中,
μ
θ
(
x
t
,
t
)
\bm{\mu}_{\theta}(\bm{x}_t, t)
μθ(xt,t) 和
Σ
θ
(
x
t
,
t
)
\bm{\Sigma}_{\theta}(\bm{x}_t, t)
Σθ(xt,t) 为神经网络。
估计后验
q
(
x
1
:
T
∣
x
0
)
q(\bm{x}_{1:T} | \bm{x}_0)
q(x1:T∣x0) 称为 扩散过程。
q
(
x
1
:
T
∣
x
0
)
=
∏
t
=
1
T
q
(
x
t
∣
x
t
−
1
)
,
q
(
x
t
∣
x
t
−
1
)
=
N
(
x
t
;
1
−
β
t
x
t
−
1
,
β
t
I
)
(2)
q(\bm{x}_{1:T} | \bm{x}_0) = \prod_{t=1}^T q(\bm{x}_t | \bm{x}_{t-1}), \quad q(\bm{x}_t | \bm{x}_{t-1}) = \mathcal{N}(\bm{x}_t; \sqrt{1 - \beta_t} \bm{x}_{t-1}, \beta_t \bm{I}) \tag{2}
q(x1:T∣x0)=t=1∏Tq(xt∣xt−1),q(xt∣xt−1)=N(xt;1−βtxt−1,βtI)(2)
扩散过程可以看做,根据超参数
β
t
\beta_t
βt,向数据中添加高斯噪声的过程。
下面介绍扩散过程的一个性质,定义
α
t
=
1
−
β
t
,
α
ˉ
t
=
∏
s
=
1
t
α
s
\alpha_t = 1 - \beta_t, \bar{\alpha}_t = \prod_{s=1}^t \alpha_s
αt=1−βt,αˉt=∏s=1tαs,根据式 2 可得,
x
t
=
α
t
x
t
−
1
+
β
t
ϵ
t
=
α
t
α
t
−
1
x
t
−
2
+
α
t
β
t
−
1
ϵ
t
−
1
+
β
t
ϵ
t
=
α
t
α
t
−
1
α
t
−
2
x
t
−
3
+
α
t
α
t
−
1
β
t
−
2
ϵ
t
−
2
+
α
t
β
t
−
1
ϵ
t
−
1
+
β
t
ϵ
t
=
.
.
.
=
α
ˉ
t
x
0
+
α
t
α
t
−
1
.
.
.
α
2
β
1
ϵ
1
+
.
.
.
+
α
t
β
t
−
1
ϵ
t
−
1
+
β
t
ϵ
t
(3)
\begin{aligned} \bm{x}_t &= \sqrt{\alpha_t} \bm{x}_{t-1} + \sqrt{\beta_t} \bm{\epsilon}_t \\ &= \sqrt{\alpha_t \alpha_{t-1}} \bm{x}_{t-2} + \sqrt{\alpha_t \beta_{t-1}} \bm{\epsilon}_{t-1} + \sqrt{\beta_t} \bm{\epsilon}_t \\ &= \sqrt{\alpha_t \alpha_{t-1} \alpha_{t-2}} \bm{x}_{t-3} + \sqrt{\alpha_t \alpha_{t-1} \beta_{t-2}} \bm{\epsilon}_{t-2} + \sqrt{\alpha_t \beta_{t-1}} \bm{\epsilon}_{t-1} + \sqrt{\beta_t} \bm{\epsilon}_t \\ &= ... \\ &= \sqrt{\bar{\alpha}_t} \bm{x}_0 + \sqrt{\alpha_t \alpha_{t-1} ... \alpha_2 \beta_1} \bm{\epsilon}_1 +... + \sqrt{\alpha_t \beta_{t-1}} \bm{\epsilon}_{t-1} + \sqrt{\beta_t} \bm{\epsilon}_t \end{aligned} \tag{3}
xt=αtxt−1+βtϵt=αtαt−1xt−2+αtβt−1ϵt−1+βtϵt=αtαt−1αt−2xt−3+αtαt−1βt−2ϵt−2+αtβt−1ϵt−1+βtϵt=...=αˉtx0+αtαt−1...α2β1ϵ1+...+αtβt−1ϵt−1+βtϵt(3)
其中,所有
ϵ
t
∼
N
(
0
,
I
)
\bm{\epsilon}_t \sim \mathcal{N}(\bm{0}, \bm{I})
ϵt∼N(0,I) 均为随机噪声,根据高斯噪声的叠加性可得,
q
(
x
t
∣
x
0
)
=
N
(
x
t
;
α
ˉ
t
x
0
,
(
α
t
α
t
−
1
.
.
.
α
2
β
1
+
.
.
.
+
α
t
β
t
−
1
+
β
t
)
I
)
=
N
(
x
t
;
α
ˉ
t
x
0
,
(
1
−
α
ˉ
t
)
I
)
(4)
\begin{aligned} q(\bm{x}_t | \bm{x}_0) &= \mathcal{N}(\bm{x}_t; \sqrt{\bar{\alpha}_t} \bm{x}_0, (\alpha_t \alpha_{t-1} ... \alpha_2 \beta_1 + ... + \alpha_t \beta_{t-1} + \beta_t)\bm{I}) \\ &= \mathcal{N}(\bm{x}_t; \sqrt{\bar{\alpha}_t} \bm{x}_0, (1 - \bar{\alpha}_t)\bm{I}) \end{aligned} \tag{4}
q(xt∣x0)=N(xt;αˉtx0,(αtαt−1...α2β1+...+αtβt−1+βt)I)=N(xt;αˉtx0,(1−αˉt)I)(4)
根据这条性质,我们可以从 x 0 \bm{x}_0 x0 直接采样 x t − 1 \bm{x}_{t-1} xt−1,无需中间步骤。
我们的目的为训练神经网络,使逆过程得到的
p
θ
(
x
0
)
p_{\theta}(\bm{x}_0)
pθ(x0) 尽可能与真实分布
q
(
x
0
)
q(\bm{x}_0)
q(x0) 接近,即最小化KL散度。
D
K
L
(
q
(
x
0
)
∣
∣
p
θ
(
x
0
)
)
=
∫
q
(
x
0
)
log
q
(
x
0
)
d
x
0
⏟
c
o
n
s
t
a
n
t
−
∫
q
(
x
0
)
log
p
θ
(
x
0
)
d
x
0
=
c
o
n
s
t
−
∫
q
(
x
0
)
log
p
θ
(
x
0
:
T
)
d
x
0
:
T
=
c
o
n
s
t
−
∫
q
(
x
0
)
log
q
(
x
1
:
T
∣
x
0
)
q
(
x
1
:
T
∣
x
0
)
p
θ
(
x
0
:
T
)
d
x
0
:
T
≤
c
o
n
s
t
−
∫
q
(
x
0
:
T
)
log
p
θ
(
x
0
:
T
)
q
(
x
1
:
T
∣
x
0
)
d
x
0
:
T
=
c
o
n
s
t
+
E
q
[
−
log
p
θ
(
x
0
:
T
)
q
(
x
1
:
T
∣
x
0
)
]
=
c
o
n
s
t
+
E
q
[
−
log
p
(
x
T
)
−
∑
t
≥
1
log
p
θ
(
x
t
−
1
∣
x
t
)
q
(
x
t
∣
x
t
−
1
)
]
⏟
L
(5)
\begin{aligned} D_{KL}(q(\bm{x}_0) || p_{\theta}(\bm{x}_0)) &= \underbrace{\int q(\bm{x}_0)\log q(\bm{x}_0) d \bm{x}_0}_{constant} - \int q(\bm{x}_0)\log p_{\theta}(\bm{x}_0) d \bm{x}_0 \\ &= const - \int q(\bm{x}_0)\log p_{\theta}(\bm{x}_{0:T}) d \bm{x}_{0:T} \\ &= const \ - \int q(\bm{x}_0)\log \frac{q(\bm{x}_{1:T} | \bm{x}_0)}{q(\bm{x}_{1:T} | \bm{x}_0)} p_{\theta}(\bm{x}_{0:T}) d \bm{x}_{0:T} \\ &\le const \ - \int q(\bm{x}_{0:T}) \log \frac{p_{\theta}(\bm{x}_{0:T})}{q(\bm{x}_{1:T} | \bm{x}_0)} d \bm{x}_{0:T} \\ &= const \ + \mathbb{E}_q[- \log \frac{p_{\theta}(\bm{x}_{0:T})}{q(\bm{x}_{1:T} | \bm{x}_0)}] \\ &= const \ + \underbrace{\mathbb{E}_q[- \log p(\bm{x}_T) - \sum_{t \ge 1} \log \frac{p_{\theta}(\bm{x}_{t-1} | \bm{x}_t)}{q(\bm{x}_t | \bm{x}_{t-1})}]}_L \end{aligned} \tag{5}
DKL(q(x0)∣∣pθ(x0))=constant
∫q(x0)logq(x0)dx0−∫q(x0)logpθ(x0)dx0=const−∫q(x0)logpθ(x0:T)dx0:T=const −∫q(x0)logq(x1:T∣x0)q(x1:T∣x0)pθ(x0:T)dx0:T≤const −∫q(x0:T)logq(x1:T∣x0)pθ(x0:T)dx0:T=const +Eq[−logq(x1:T∣x0)pθ(x0:T)]=const +L
Eq[−logp(xT)−t≥1∑logq(xt∣xt−1)pθ(xt−1∣xt)](5)
由于
q
(
x
0
)
q(\bm{x}_0)
q(x0) 为真实分布,所以第一项为常数,训练目标变为最小化
L
L
L。
L
=
E
q
[
−
log
p
(
x
T
)
−
∑
t
>
1
log
p
θ
(
x
t
−
1
∣
x
t
)
q
(
x
t
∣
x
t
−
1
)
−
log
p
θ
(
x
0
∣
x
1
)
q
(
x
1
∣
x
0
)
]
=
E
q
[
−
log
p
(
x
T
)
−
∑
t
>
1
log
p
θ
(
x
t
−
1
∣
x
t
)
q
(
x
t
∣
x
t
−
1
,
x
0
)
−
log
p
θ
(
x
0
∣
x
1
)
q
(
x
1
∣
x
0
)
]
=
E
q
[
−
log
p
(
x
T
)
−
∑
t
>
1
log
p
θ
(
x
t
−
1
∣
x
t
)
q
(
x
t
−
1
∣
x
t
,
x
0
)
q
(
x
t
−
1
∣
x
0
)
q
(
x
t
∣
x
0
)
−
log
p
θ
(
x
0
∣
x
1
)
q
(
x
1
∣
x
0
)
]
=
E
q
[
−
log
p
(
x
T
)
−
∑
t
>
1
log
p
θ
(
x
t
−
1
∣
x
t
)
q
(
x
t
−
1
∣
x
t
,
x
0
)
−
∑
t
>
1
log
q
(
x
t
−
1
∣
x
0
)
q
(
x
t
∣
x
0
)
−
log
p
θ
(
x
0
∣
x
1
)
q
(
x
1
∣
x
0
)
]
=
E
q
[
−
log
p
(
x
T
)
q
(
x
T
∣
x
0
)
−
∑
t
>
1
log
p
θ
(
x
t
−
1
∣
x
t
)
q
(
x
t
−
1
∣
x
t
,
x
0
)
−
log
p
θ
(
x
0
∣
x
1
)
]
=
E
q
[
D
K
L
(
q
(
x
T
∣
x
0
)
∣
∣
p
(
x
T
)
)
⏟
L
T
+
∑
t
>
1
D
K
L
(
q
(
x
t
−
1
∣
x
t
,
x
0
)
∣
∣
p
θ
(
x
t
−
1
∣
x
t
)
)
⏟
L
t
−
1
−
log
p
θ
(
x
0
∣
x
1
)
⏟
L
0
]
(6)
\begin{aligned} L &= \mathbb{E}_q[- \log p(\bm{x}_T) - \sum_{t > 1} \log \frac{p_{\theta}(\bm{x}_{t-1} | \bm{x}_t)}{q(\bm{x}_t | \bm{x}_{t-1})} - \log \frac{p_{\theta}(\bm{x}_0 | \bm{x}_1)}{q(\bm{x}_1 | \bm{x}_0)}] \\ &= \mathbb{E}_q[- \log p(\bm{x}_T) - \sum_{t > 1} \log \frac{p_{\theta}(\bm{x}_{t-1} | \bm{x}_t)}{q(\bm{x}_t | \bm{x}_{t-1}, \bm{x}_0)} - \log \frac{p_{\theta}(\bm{x}_0 | \bm{x}_1)}{q(\bm{x}_1 | \bm{x}_0)}] \\ &= \mathbb{E}_q[- \log p(\bm{x}_T) - \sum_{t > 1} \log \frac{p_{\theta}(\bm{x}_{t-1} | \bm{x}_t)}{q(\bm{x}_{t-1} | \bm{x}_t, \bm{x}_0)} \frac{q(\bm{x}_{t-1} | \bm{x}_0)}{q(\bm{x}_t | \bm{x}_0)} - \log \frac{p_{\theta}(\bm{x}_0 | \bm{x}_1)}{q(\bm{x}_1 | \bm{x}_0)}] \\ &= \mathbb{E}_q[- \log p(\bm{x}_T) - \sum_{t > 1} \log \frac{p_{\theta}(\bm{x}_{t-1} | \bm{x}_t)}{q(\bm{x}_{t-1} | \bm{x}_t, \bm{x}_0)} - \sum_{t > 1} \log \frac{q(\bm{x}_{t-1} | \bm{x}_0)}{q(\bm{x}_t | \bm{x}_0)} - \log \frac{p_{\theta}(\bm{x}_0 | \bm{x}_1)}{q(\bm{x}_1 | \bm{x}_0)}] \\ &= \mathbb{E}_q[- \log \frac{p(\bm{x}_T)}{q(\bm{x}_T | \bm{x}_0)} - \sum_{t > 1} \log \frac{p_{\theta}(\bm{x}_{t-1} | \bm{x}_t)}{q(\bm{x}_{t-1} | \bm{x}_t, \bm{x}_0)} - \log p_{\theta}(\bm{x}_0 | \bm{x}_1)] \\ &= \mathbb{E}_q[\underbrace{D_{KL}(q(\bm{x}_T | \bm{x}_0) || p(\bm{x}_T))}_{L_T} + \sum_{t > 1} \underbrace{D_{KL}(q(\bm{x}_{t-1} | \bm{x}_t, \bm{x}_0) || p_{\theta}(\bm{x}_{t-1} | \bm{x}_t))}_{L_{t-1}} \underbrace{- \log p_{\theta}(\bm{x}_0 | \bm{x}_1)}_{L_0}] \end{aligned} \tag{6}
L=Eq[−logp(xT)−t>1∑logq(xt∣xt−1)pθ(xt−1∣xt)−logq(x1∣x0)pθ(x0∣x1)]=Eq[−logp(xT)−t>1∑logq(xt∣xt−1,x0)pθ(xt−1∣xt)−logq(x1∣x0)pθ(x0∣x1)]=Eq[−logp(xT)−t>1∑logq(xt−1∣xt,x0)pθ(xt−1∣xt)q(xt∣x0)q(xt−1∣x0)−logq(x1∣x0)pθ(x0∣x1)]=Eq[−logp(xT)−t>1∑logq(xt−1∣xt,x0)pθ(xt−1∣xt)−t>1∑logq(xt∣x0)q(xt−1∣x0)−logq(x1∣x0)pθ(x0∣x1)]=Eq[−logq(xT∣x0)p(xT)−t>1∑logq(xt−1∣xt,x0)pθ(xt−1∣xt)−logpθ(x0∣x1)]=Eq[LT
DKL(q(xT∣x0)∣∣p(xT))+t>1∑Lt−1
DKL(q(xt−1∣xt,x0)∣∣pθ(xt−1∣xt))L0
−logpθ(x0∣x1)](6)
2 Diffusion models and denoising autoencoders
2.1 Forward process and L T L_T LT
由于 q ( x T ∣ x 0 ) q(\bm{x}_T | \bm{x}_0) q(xT∣x0) 和 p ( x T ) p(\bm{x}_T) p(xT) 均为正态分布,所以 L T = 0 L_T=0 LT=0。
2.2 Reverse process and L 1 : T − 1 L_{1:T-1} L1:T−1
L
t
−
1
L_{t-1}
Lt−1 中,
q
(
x
t
−
1
∣
x
t
,
x
0
)
=
N
(
x
t
−
1
;
μ
~
t
(
x
t
,
x
0
)
,
β
~
t
I
)
q(\bm{x}_{t-1} | \bm{x}_t, \bm{x}_0) = \mathcal{N}(\bm{x}_{t-1}; \widetilde{\bm{\mu}}_t(\bm{x}_t, \bm{x}_0), \widetilde{\bm{\beta}}_t \bm{I})
q(xt−1∣xt,x0)=N(xt−1;μ
t(xt,x0),β
tI),证明如下图所示。图源:54、Diffusion Model扩散模型理论与完整PyTorch代码详细解读。
本文作者将
p
θ
(
x
t
−
1
∣
x
t
)
p_{\theta}(\bm{x}_{t-1} | \bm{x}_t)
pθ(xt−1∣xt) 中的
Σ
θ
(
x
t
,
t
)
\bm{\Sigma}_{\theta}(\bm{x}_t, t)
Σθ(xt,t) 替换为
σ
t
2
\sigma_t^2
σt2,
σ
t
2
\sigma_t^2
σt2 取
β
t
\beta_t
βt 或
β
~
t
\widetilde{\beta}_t
β
t,取得了相似的效果。
由于
D
K
L
(
N
(
μ
1
,
σ
1
2
)
∣
∣
N
(
μ
2
,
σ
2
2
)
)
=
(
μ
1
−
μ
2
)
2
+
σ
1
2
2
σ
2
2
+
log
σ
2
σ
1
−
1
2
D_{KL}(\mathcal{N}(\mu_1, \sigma_1^2) || \mathcal{N}(\mu_2, \sigma_2^2)) = \frac{(\mu_1 - \mu_2)^2 + \sigma_1^2}{2 \sigma_2^2} + \log \frac{\sigma_2}{\sigma_1} - \frac{1}{2}
DKL(N(μ1,σ12)∣∣N(μ2,σ22))=2σ22(μ1−μ2)2+σ12+logσ1σ2−21,所以,
L
t
−
1
=
E
q
[
1
2
σ
t
2
∥
μ
~
t
(
x
t
,
x
0
)
−
μ
θ
(
x
t
,
t
)
∥
2
]
+
C
=
E
q
[
1
2
σ
t
2
∥
α
t
(
1
−
α
ˉ
t
−
1
)
1
−
α
t
ˉ
x
t
+
α
ˉ
t
−
1
β
t
1
−
α
t
ˉ
x
0
−
μ
θ
(
x
t
,
t
)
∥
2
]
+
C
=
E
x
0
,
ϵ
[
1
2
σ
t
2
∥
1
α
t
(
x
t
(
x
0
,
ϵ
)
−
β
t
1
−
α
ˉ
t
ϵ
)
−
μ
θ
(
x
t
(
x
0
,
ϵ
)
,
t
)
∥
2
]
+
C
(7)
\begin{aligned} L_{t-1} &= \mathbb{E}_q[\frac{1}{2\sigma_t^2} \| \widetilde{\bm{\mu}}_t(\bm{x}_t, \bm{x}_0) - \bm{\mu}_{\theta}(\bm{x}_t, t) \|^2] + C \\ &= \mathbb{E}_q[\frac{1}{2\sigma_t^2} \| \frac{\sqrt{\alpha_t}(1 - \bar{\alpha}_{t-1})}{1 - \bar{\alpha_t}} \bm{x}_t + \frac{\sqrt{\bar{\alpha}_{t-1}} \beta_t}{1 - \bar{\alpha_t}} \bm{x}_0 - \bm{\mu}_{\theta}(\bm{x}_t, t) \|^2] + C \\ &= \mathbb{E}_{\bm{x}_0, \bm{\epsilon}}[\frac{1}{2\sigma_t^2} \| \frac{1}{\sqrt{\alpha_t}} \big( \bm{x}_t(\bm{x}_0, \bm{\epsilon}) - \frac{\beta_t}{\sqrt{1 - \bar{\alpha}_t}}\bm{\epsilon} \big) - \bm{\mu}_{\theta} \big( \bm{x}_t(\bm{x}_0, \bm{\epsilon}), t \big) \|^2] + C \tag{7} \end{aligned}
Lt−1=Eq[2σt21∥μ
t(xt,x0)−μθ(xt,t)∥2]+C=Eq[2σt21∥1−αtˉαt(1−αˉt−1)xt+1−αtˉαˉt−1βtx0−μθ(xt,t)∥2]+C=Ex0,ϵ[2σt21∥αt1(xt(x0,ϵ)−1−αˉtβtϵ)−μθ(xt(x0,ϵ),t)∥2]+C(7)
其中,
x
t
(
x
0
,
ϵ
)
=
α
ˉ
t
x
0
+
1
−
α
ˉ
t
ϵ
\bm{x}_t(\bm{x}_0, \bm{\epsilon}) = \sqrt{\bar{\alpha}_t}\bm{x}_0 + \sqrt{1 - \bar{\alpha}_t} \bm{\epsilon}
xt(x0,ϵ)=αˉtx0+1−αˉtϵ,
ϵ
\bm{\epsilon}
ϵ 为由
x
0
\bm{x}_0
x0 采样
x
t
\bm{x}_t
xt 时引入的高斯噪声,C 为常数。
根据式 7 ,可以将神经网络
μ
θ
\bm{\mu}_{\theta}
μθ 的形式变为,
μ
θ
(
x
t
,
t
)
=
1
α
t
(
x
t
−
β
t
1
−
α
ˉ
t
ϵ
θ
(
x
t
,
t
)
)
(8)
\bm{\mu}_{\theta}(\bm{x}_t, t) = \frac{1}{\sqrt{\alpha_t}} \big( \bm{x}_t - \frac{\beta_t}{\sqrt{1 - \bar{\alpha}_t}}\bm{\epsilon}_{\theta}(\bm{x}_t, t) \big) \tag{8}
μθ(xt,t)=αt1(xt−1−αˉtβtϵθ(xt,t))(8)
损失
L
t
−
1
L_{t-1}
Lt−1 就变为,
E
x
0
,
ϵ
[
β
t
2
2
σ
t
2
α
t
(
1
−
α
ˉ
t
)
∥
ϵ
−
ϵ
θ
(
x
t
(
x
0
,
ϵ
)
,
t
)
∥
2
]
+
C
(9)
\mathbb{E}_{\bm{x}_0, \bm{\epsilon}}[\frac{\beta_t^2}{2 \sigma_t^2 \alpha_t (1 - \bar{\alpha}_t)} \| \bm{\epsilon} - \bm{\epsilon}_{\theta} \big( \bm{x}_t(\bm{x}_0, \bm{\epsilon}), t \big) \|^2] + C \tag{9}
Ex0,ϵ[2σt2αt(1−αˉt)βt2∥ϵ−ϵθ(xt(x0,ϵ),t)∥2]+C(9)
2.3 Reverse process and L 0 L_0 L0
原文中作者根据图像的性质做出了一些变化,本文在此进行忽略。
根据式 1 可知,
p
θ
(
x
0
∣
x
1
)
=
N
(
x
0
;
μ
θ
(
x
1
,
1
)
,
σ
1
2
I
)
(10)
p_{\theta}(\bm{x}_0 | \bm{x}_1) = \mathcal{N}(\bm{x}_0; \bm{\mu}_{\theta}(\bm{x}_1, 1), \sigma_1^2 \bm{I}) \tag{10}
pθ(x0∣x1)=N(x0;μθ(x1,1),σ12I)(10)
所以,
L
0
=
1
2
σ
1
2
E
x
0
,
ϵ
ˉ
1
[
∥
x
0
−
μ
θ
(
x
1
(
x
0
,
ϵ
ˉ
1
)
,
1
)
∥
2
]
+
C
′
(11)
L_0 = \frac{1}{2 \sigma_1^2} \mathbb{E}_{\bm{x}_0, \bar{\bm{\epsilon}}_1} [\| \bm{x}_0 - \bm{\mu}_{\theta} \big(\bm{x}_1(\bm{x}_0, \bar{\bm{\epsilon}}_1), 1 \big) \|^2] + C' \tag{11}
L0=2σ121Ex0,ϵˉ1[∥x0−μθ(x1(x0,ϵˉ1),1)∥2]+C′(11)
其中, C ′ C' C′ 为常数。
2.4 Simplified training objective
作者发现,将损失
L
L
L 简化为
L
s
i
m
p
l
e
L_{simple}
Lsimple,能提高模型的效果。
L
s
i
m
p
l
e
(
θ
)
=
E
t
,
x
0
,
ϵ
[
∥
ϵ
−
ϵ
θ
(
x
t
(
x
0
,
ϵ
)
,
t
)
∥
2
]
L_{simple}(\theta) = \mathbb{E}_{t, \bm{x}_0, \bm{\epsilon}}[ \| \bm{\epsilon} - \bm{\epsilon}_{\theta} \big( \bm{x}_t(\bm{x}_0, \bm{\epsilon}), t \big) \|^2]
Lsimple(θ)=Et,x0,ϵ[∥ϵ−ϵθ(xt(x0,ϵ),t)∥2]
训练的伪代码如下图所示。
3 Sampling
采样的过程就是采样
x
T
\bm{x}_T
xT 逆扩散生成
x
0
\bm{x}_0
x0 的过程,伪代码如下图所示。