文章目录
[Page E. Approximating to the cumulative normal function and its inverse for use on a pocket calculator. Applied Statistics, vol. 26, pp. 75-76, 1977.]
概
diffusion model和变分界的结合.
对抗鲁棒性上已经有多篇论文用DDPM生成的数据用于训练了, 可见其强大.
主要内容
Diffusion models
reverse process
从
p
(
x
T
)
=
N
(
x
T
;
0
,
I
)
p(x_T) = \mathcal{N}(x_T; 0, I)
p(xT)=N(xT;0,I)出发:
p
θ
(
x
0
:
T
)
:
=
p
(
X
T
)
∏
t
=
1
T
p
θ
(
x
t
−
1
∣
x
t
)
,
p
θ
(
x
t
−
1
∣
x
t
)
:
=
N
(
x
t
−
1
;
μ
θ
(
x
t
,
t
)
,
Σ
θ
(
x
t
,
t
)
)
,
p_{\theta}(x_{0:T}) := p(X_T) \prod_{t=1}^T p_{\theta}(x_{t-1}|x_t), \quad p_{\theta}(x_{t-1}|x_t) := \mathcal{N}(x_{t-1}; \mu_{\theta}(x_{t}, t), \Sigma_{\theta}(x_t, t)),
pθ(x0:T):=p(XT)t=1∏Tpθ(xt−1∣xt),pθ(xt−1∣xt):=N(xt−1;μθ(xt,t),Σθ(xt,t)),
注意这个过程我们拟合均值
μ
θ
\mu_{\theta}
μθ和协方差矩阵
Σ
θ
\Sigma_{\theta}
Σθ.
这部分的过程逐步将噪声’恢复’为图片(信号) x 0 x_0 x0.
forward process
q
(
x
1
:
T
∣
x
0
)
:
=
∏
t
=
1
T
q
(
x
t
∣
x
t
−
1
)
,
q
(
x
t
∣
x
t
−
1
)
:
=
N
(
x
t
;
1
−
β
t
x
t
−
1
,
β
t
I
)
.
q(x_{1:T}|x_0) := \prod_{t=1}^{T}q(x_t|x_{t-1}), \quad q(x_t|x_{t-1}):= \mathcal{N}(x_t; \sqrt{1 - \beta_t} x_{t-1}, \beta_t I).
q(x1:T∣x0):=t=1∏Tq(xt∣xt−1),q(xt∣xt−1):=N(xt;1−βtxt−1,βtI).
其中
β
t
\beta_t
βt是可训练的参数或者人为给定的超参数.
这部分为将图片(信号)逐步添加噪声的过程.
变分界
对于参数
θ
\theta
θ, 很自然地我们希望通过最小化其负对数似然来优化:
E
p
d
a
t
a
(
x
0
)
[
−
log
p
θ
(
x
0
)
]
=
E
p
d
a
t
a
(
x
0
)
[
−
log
∫
p
θ
(
x
0
:
T
)
d
x
0
:
T
]
=
E
p
d
a
t
a
(
x
0
)
[
−
log
∫
q
(
x
1
:
T
∣
x
0
)
p
θ
(
x
0
:
T
)
q
(
x
1
:
T
∣
x
0
)
d
x
0
:
T
]
=
E
p
d
a
t
a
(
x
0
)
[
−
log
E
q
(
x
1
:
T
∣
x
0
)
p
θ
(
x
0
:
T
)
q
(
x
1
:
T
∣
x
0
)
]
≤
−
E
p
d
a
t
a
(
x
0
)
E
q
(
x
1
:
T
∣
x
0
)
[
log
p
θ
(
x
0
:
T
)
q
(
x
1
:
T
∣
x
0
)
]
=
−
E
q
[
log
p
θ
(
x
0
:
T
)
q
(
x
1
:
T
∣
x
0
)
]
=
−
E
q
[
log
p
(
x
T
)
+
∑
t
=
1
T
log
p
θ
(
x
t
−
1
∣
x
t
)
q
(
x
t
∣
x
t
−
1
)
]
=
−
E
q
[
log
p
(
x
T
)
+
∑
t
=
2
T
log
p
θ
(
x
t
−
1
∣
x
t
)
q
(
x
t
∣
x
t
−
1
)
+
log
p
θ
(
x
0
∣
x
1
)
q
(
x
1
∣
x
0
)
]
=
−
E
q
[
log
p
(
x
T
)
+
∑
t
=
2
T
log
p
θ
(
x
t
−
1
∣
x
t
)
q
(
x
t
−
1
∣
x
t
,
x
0
)
⋅
q
(
x
t
−
1
∣
x
0
)
q
(
x
t
∣
x
0
)
+
log
p
θ
(
x
0
∣
x
1
)
q
(
x
1
∣
x
0
)
]
=
−
E
q
[
log
p
(
x
T
)
q
(
x
T
∣
x
0
)
+
∑
t
=
2
T
log
p
θ
(
x
t
−
1
∣
x
t
)
q
(
x
t
−
1
∣
x
t
,
x
0
)
+
log
p
θ
(
x
0
∣
x
1
)
]
\begin{array}{ll} \mathbb{E}_{p_{data}(x_0)} \bigg[-\log p_{\theta}(x_0) \bigg] &=\mathbb{E}_{p_{data}(x_0)} \bigg[-\log \int p_{\theta}(x_{0:T}) \mathrm{d}x_{0:T} \bigg] \\ &=\mathbb{E}_{p_{data}(x_0)} \bigg[-\log \int q(x_{1:T}|x_0)\frac{p_{\theta}(x_{0:T})}{q(x_{1:T}|x_0)} \mathrm{d}x_{0:T} \bigg] \\ &=\mathbb{E}_{p_{data}(x_0)} \bigg[-\log \mathbb{E}_{q(x_{1:T}|x_0)} \frac{p_{\theta}(x_{0:T})}{q(x_{1:T}|x_0)} \bigg] \\ &\le -\mathbb{E}_{p_{data}(x_0)}\mathbb{E}_{q(x_{1:T}|x_0)} \bigg[\log \frac{p_{\theta}(x_{0:T})}{q(x_{1:T}|x_0)} \bigg] \\ &= -\mathbb{E}_q \bigg[\log \frac{p_{\theta}(x_{0:T})}{q(x_{1:T}|x_0)} \bigg] \\ &= -\mathbb{E}_q \bigg[\log p(x_T) + \sum_{t=1}^T \log \frac{p_{\theta}(x_{t-1}|x_t)}{q(x_t|x_{t-1})} \bigg] \\ &= -\mathbb{E}_q \bigg[\log p(x_T) + \sum_{t=2}^T \log \frac{p_{\theta}(x_{t-1}|x_t)}{q(x_t|x_{t-1})} + \log \frac{p_{\theta}(x_0|x_1)}{q(x_1|x_0)} \bigg] \\ &= -\mathbb{E}_q \bigg[\log p(x_T) + \sum_{t=2}^T \log \frac{p_{\theta}(x_{t-1}|x_t)}{q(x_{t-1}|x_t, x_0)} \cdot \frac{q(x_{t-1}|x_0)}{q(x_t|x_0)} + \log \frac{p_{\theta}(x_0|x_1)}{q(x_1|x_0)} \bigg] \\ &= -\mathbb{E}_q \bigg[\log \frac{p(x_T)}{q(x_T|x_0)} + \sum_{t=2}^T \log \frac{p_{\theta}(x_{t-1}|x_t)}{q(x_{t-1}|x_t, x_0)} + \log p_{\theta}(x_0|x_1) \bigg] \\ \end{array}
Epdata(x0)[−logpθ(x0)]=Epdata(x0)[−log∫pθ(x0:T)dx0:T]=Epdata(x0)[−log∫q(x1:T∣x0)q(x1:T∣x0)pθ(x0:T)dx0:T]=Epdata(x0)[−logEq(x1:T∣x0)q(x1:T∣x0)pθ(x0:T)]≤−Epdata(x0)Eq(x1:T∣x0)[logq(x1:T∣x0)pθ(x0:T)]=−Eq[logq(x1:T∣x0)pθ(x0:T)]=−Eq[logp(xT)+∑t=1Tlogq(xt∣xt−1)pθ(xt−1∣xt)]=−Eq[logp(xT)+∑t=2Tlogq(xt∣xt−1)pθ(xt−1∣xt)+logq(x1∣x0)pθ(x0∣x1)]=−Eq[logp(xT)+∑t=2Tlogq(xt−1∣xt,x0)pθ(xt−1∣xt)⋅q(xt∣x0)q(xt−1∣x0)+logq(x1∣x0)pθ(x0∣x1)]=−Eq[logq(xT∣x0)p(xT)+∑t=2Tlogq(xt−1∣xt,x0)pθ(xt−1∣xt)+logpθ(x0∣x1)]
注: q = q ( x 1 : T ∣ x 0 ) p d a t a ( x 0 ) q=q(x_{1:T}|x_0)p_{data}(x_0) q=q(x1:T∣x0)pdata(x0), 下面另 q ( x 0 ) : = p d a t a ( x 0 ) q(x_0) := p_{data}(x_0) q(x0):=pdata(x0).
又
E
q
[
log
q
(
x
T
∣
x
0
)
p
(
x
T
)
]
=
∫
q
(
x
0
,
x
T
)
log
q
(
x
T
∣
x
0
)
p
(
x
T
)
d
x
0
d
x
T
=
∫
q
(
x
0
)
q
(
x
T
∣
x
0
)
log
q
(
x
T
∣
x
0
)
p
(
x
T
)
d
x
0
d
x
T
=
∫
q
(
x
0
)
D
K
L
(
q
(
x
T
∣
x
0
)
∥
p
(
x
T
)
)
d
x
0
=
∫
q
(
x
0
:
T
)
D
K
L
(
q
(
x
T
′
∣
x
0
)
∥
p
(
x
T
′
)
)
d
x
0
:
T
=
E
q
[
D
K
L
(
q
(
x
T
′
∣
x
0
)
∥
p
(
x
T
′
)
)
]
.
\begin{array}{ll} \mathbb{E}_q [\log \frac{q(x_T|x_0)}{p(x_T)}] &= \int q(x_0, x_T) \log \frac{q(x_T|x_0)}{p(x_T)} \mathrm{d}x_0 \mathrm{d}x_T \\ &= \int q(x_0) q(x_T|x_0) \log \frac{q(x_T|x_0)}{p(x_T)} \mathrm{d}x_0 \mathrm{d}x_T \\ &= \int q(x_0) \mathrm{D_{KL}}(q(x_T|x_0) \| p(x_T)) \mathrm{d}x_0 \\ &= \int q(x_{0:T}) \mathrm{D_{KL}}(q(x'_T|x_0) \| p(x'_T)) \mathrm{d}x_{0:T} \\ &= \mathbb{E}_q \bigg[\mathrm{D_{KL}}(q(x'_T|x_0) \| p(x'_T)) \bigg]. \end{array}
Eq[logp(xT)q(xT∣x0)]=∫q(x0,xT)logp(xT)q(xT∣x0)dx0dxT=∫q(x0)q(xT∣x0)logp(xT)q(xT∣x0)dx0dxT=∫q(x0)DKL(q(xT∣x0)∥p(xT))dx0=∫q(x0:T)DKL(q(xT′∣x0)∥p(xT′))dx0:T=Eq[DKL(q(xT′∣x0)∥p(xT′))].
又
E
q
[
log
q
(
x
t
−
1
∣
x
t
,
x
0
)
p
θ
(
x
t
−
1
∣
x
t
)
]
=
∫
q
(
x
0
,
x
t
−
1
,
x
t
)
log
q
(
x
t
−
1
∣
x
t
,
x
0
)
p
θ
(
x
t
−
1
∣
x
t
)
d
x
0
d
x
t
−
1
d
x
t
=
∫
q
(
x
0
,
x
t
)
D
K
L
(
q
(
x
t
−
1
∣
x
t
,
x
0
)
∥
p
θ
(
x
t
−
1
∣
x
t
)
)
d
x
0
d
x
t
=
E
q
[
D
K
L
(
q
(
x
t
−
1
′
∣
x
t
,
x
0
)
∥
p
θ
(
x
t
−
1
′
∣
x
t
)
)
]
.
\begin{array}{ll} \mathbb{E}_q [\log \frac{q(x_{t-1}|x_t, x_0)}{p_{\theta}(x_{t-1}|x_t)}] &=\int q(x_0, x_{t-1}, x_t) \log \frac{q(x_{t-1}|x_t, x_0)}{p_{\theta}(x_{t-1}|x_t)} \mathrm{d}x_0 \mathrm{d}x_{t-1}\mathrm{d}x_t\\ &=\int q(x_0, x_t) \mathrm{D_{KL}}(q(x_{t-1}|x_t, x_0)\| p_{\theta}(x_{t-1}|x_t)) \mathrm{d}x_0 \mathrm{d}x_t\\ &=\mathbb{E}_q\bigg[\mathrm{D_{KL}}(q(x'_{t-1}|x_t, x_0)\| p_{\theta}(x'_{t-1}|x_t)) \bigg]. \end{array}
Eq[logpθ(xt−1∣xt)q(xt−1∣xt,x0)]=∫q(x0,xt−1,xt)logpθ(xt−1∣xt)q(xt−1∣xt,x0)dx0dxt−1dxt=∫q(x0,xt)DKL(q(xt−1∣xt,x0)∥pθ(xt−1∣xt))dx0dxt=Eq[DKL(q(xt−1′∣xt,x0)∥pθ(xt−1′∣xt))].
故最后:
L
:
=
E
q
[
D
K
L
(
q
(
x
T
′
∣
x
0
)
∥
p
(
x
T
′
)
)
⏟
L
T
+
∑
t
=
2
T
D
K
L
(
q
(
x
t
−
1
′
∣
x
t
,
x
0
)
∥
p
θ
(
x
t
−
1
′
∣
x
t
)
)
⏟
L
t
−
1
−
log
p
θ
(
x
0
∣
x
1
)
⏟
L
0
.
]
\mathcal{L} := \mathbb{E}_q \bigg[ \underbrace{\mathrm{D_{KL}}(q(x'_T|x_0) \| p(x'_T))}_{L_T} + \sum_{t=2}^T \underbrace{\mathrm{D_{KL}}(q(x'_{t-1}|x_t, x_0)\| p_{\theta}(x'_{t-1}|x_t))}_{L_{t-1}} \underbrace{-\log p_{\theta}(x_0|x_1)}_{L_0}. \bigg]
L:=Eq[LT
DKL(q(xT′∣x0)∥p(xT′))+t=2∑TLt−1
DKL(q(xt−1′∣xt,x0)∥pθ(xt−1′∣xt))L0
−logpθ(x0∣x1).]
损失求解
因为无论forward, 还是 reverse process都是基于高斯分布的, 我们可以显示求解上面的各项损失:
首先, 对于forward process中的
x
t
x_t
xt:
x
t
=
1
−
β
t
x
t
−
1
+
β
t
ϵ
,
ϵ
∼
N
(
0
,
I
)
=
1
−
β
t
(
1
−
β
t
−
1
x
t
−
2
+
β
t
−
1
ϵ
′
)
+
β
ϵ
=
1
−
β
t
1
−
β
t
−
1
x
t
−
2
+
1
−
β
t
β
t
−
1
ϵ
′
+
β
ϵ
=
1
−
β
t
1
−
β
t
−
1
x
t
−
2
+
1
−
(
1
−
β
t
)
(
1
−
β
t
−
1
)
ϵ
=
⋯
=
(
∏
s
=
1
t
1
−
β
s
)
x
0
+
1
−
∏
s
=
1
t
(
1
−
β
s
)
ϵ
,
\begin{array}{ll} x_t &= \sqrt{1 - \beta_t} x_{t-1} + \sqrt{\beta_t} \epsilon, \: \epsilon \sim \mathcal{N}(0, I) \\ &= \sqrt{1 - \beta_t} (\sqrt{1 - \beta_{t-1}} x_{t-2} + \sqrt{\beta_{t-1}} \epsilon') + \sqrt{\beta} \epsilon \\ &= \sqrt{1 - \beta_t}\sqrt{1 - \beta_{t-1}} x_{t-2} + \sqrt{1 - \beta_t}\sqrt{\beta_{t-1}} \epsilon' + \sqrt{\beta} \epsilon \\ &= \sqrt{1 - \beta_t}\sqrt{1 - \beta_{t-1}} x_{t-2} + \sqrt{1 - (1 - \beta_t)(1 - \beta_{t-1})} \epsilon \\ &= \cdots \\ &= (\prod_{s=1}^t \sqrt{1 - \beta_s}) x_0 + \sqrt{1 - \prod_{s=1}^t (1 - \beta_s)} \epsilon, \end{array}
xt=1−βtxt−1+βtϵ,ϵ∼N(0,I)=1−βt(1−βt−1xt−2+βt−1ϵ′)+βϵ=1−βt1−βt−1xt−2+1−βtβt−1ϵ′+βϵ=1−βt1−βt−1xt−2+1−(1−βt)(1−βt−1)ϵ=⋯=(∏s=1t1−βs)x0+1−∏s=1t(1−βs)ϵ,
故
q
(
x
t
∣
x
0
)
=
N
(
x
t
∣
α
ˉ
t
x
0
,
(
1
−
α
ˉ
t
)
I
)
,
α
ˉ
t
:
=
∏
s
=
1
t
α
s
,
α
s
:
=
1
−
β
s
.
q(x_t|x_0) = \mathcal{N}(x_t|\sqrt{\bar{\alpha}_t}x_0, (1 - \bar{\alpha}_t)I), \: \bar{\alpha}_t := \prod_{s=1}^t \alpha_s, \alpha_s := 1 - \beta_s.
q(xt∣x0)=N(xt∣αˉtx0,(1−αˉt)I),αˉt:=s=1∏tαs,αs:=1−βs.
对于后验分布
q
(
x
t
−
1
∣
x
t
,
x
0
)
q(x_{t-1}|x_t, x_0)
q(xt−1∣xt,x0), 我们有
q
(
x
t
−
1
∣
x
t
,
x
0
)
=
q
(
x
t
∣
x
t
−
1
)
q
(
x
t
−
1
∣
x
0
)
q
(
x
t
∣
x
0
)
∝
q
(
x
t
∣
x
t
−
1
)
q
(
x
t
−
1
∣
x
0
)
∝
exp
{
−
1
2
(
1
−
α
ˉ
t
−
1
)
β
t
[
(
1
−
α
ˉ
t
−
1
)
∥
x
t
−
1
−
β
t
x
t
−
1
∥
2
+
β
t
∥
x
t
−
1
−
α
ˉ
t
−
1
x
0
∥
2
]
}
∝
exp
{
−
1
2
(
1
−
α
ˉ
t
−
1
)
β
t
[
(
1
−
α
ˉ
t
)
∥
x
t
−
1
∥
2
−
2
(
1
−
α
ˉ
t
−
1
)
α
t
x
t
T
x
t
−
1
−
2
α
ˉ
t
−
1
β
t
]
}
\begin{array}{ll} q(x_{t-1}|x_t, x_0) &= \frac{q(x_t|x_{t-1})q(x_{t-1}|x_0)}{q(x_t|x_0)} \\ &\propto q(x_t|x_{t-1})q(x_{t-1}|x_0) \\ &\propto \exp\Bigg\{-\frac{1}{2 (1 - \bar{\alpha}_{t-1})\beta_t} \bigg[(1 - \bar{\alpha}_{t-1}) \|x_t - \sqrt{1 - \beta_t} x_{t-1}\|^2 + \beta_t \|x_{t-1} - \sqrt{\bar{\alpha}_{t-1}}x_0\|^2 \bigg]\Bigg\} \\ &\propto \exp\Bigg\{-\frac{1}{2 (1 - \bar{\alpha}_{t-1})\beta_t} \bigg[(1 - \bar{\alpha}_t)\|x_{t-1}\|^2 - 2(1 - \bar{\alpha}_{t-1}) \sqrt{\alpha_t} x_t^Tx_{t-1} - 2 \sqrt{\bar{\alpha}_{t-1}} \beta_t \bigg]\Bigg\} \\ \end{array}
q(xt−1∣xt,x0)=q(xt∣x0)q(xt∣xt−1)q(xt−1∣x0)∝q(xt∣xt−1)q(xt−1∣x0)∝exp{−2(1−αˉt−1)βt1[(1−αˉt−1)∥xt−1−βtxt−1∥2+βt∥xt−1−αˉt−1x0∥2]}∝exp{−2(1−αˉt−1)βt1[(1−αˉt)∥xt−1∥2−2(1−αˉt−1)αtxtTxt−1−2αˉt−1βt]}
所以
q
(
x
t
−
1
∣
x
t
,
x
0
)
∼
N
(
x
t
−
1
∣
u
~
t
(
x
t
,
x
0
)
,
β
~
t
I
)
,
q(x_{t-1}|x_t, x_0) \sim \mathcal{N}(x_{t-1}|\tilde{u}_t(x_t, x_0), \tilde{\beta}_t I),
q(xt−1∣xt,x0)∼N(xt−1∣u~t(xt,x0),β~tI),
其中
u
~
t
(
x
t
,
x
0
)
:
=
α
ˉ
t
−
1
β
t
1
−
α
ˉ
t
x
0
+
α
t
(
1
−
α
ˉ
t
−
1
)
1
−
α
ˉ
t
x
t
,
\tilde{u}_t(x_t,x_0) := \frac{\sqrt{\bar{\alpha}_{t-1}}\beta_t}{1 - \bar{\alpha}_t} x_0 + \frac{\sqrt{\alpha_t}(1 - \bar{\alpha}_{t-1})}{1 - \bar{\alpha}_t} x_t,
u~t(xt,x0):=1−αˉtαˉt−1βtx0+1−αˉtαt(1−αˉt−1)xt,
β
~
t
=
1
−
α
ˉ
t
−
1
1
−
α
ˉ
t
β
t
.
\tilde{\beta}_t = \frac{1 - \bar{\alpha}_{t-1}}{1 - \bar{\alpha}_t} \beta_t.
β~t=1−αˉt1−αˉt−1βt.
L t L_{t} Lt
L T L_T LT与 θ \theta θ无关, 舍去.
作者假设
Σ
θ
(
x
t
,
t
)
=
σ
t
2
I
\Sigma_{\theta}(x_t, t) = \sigma_t^2 I
Σθ(xt,t)=σt2I为非训练的参数, 其中
σ
t
2
=
β
t
∣
β
~
t
,
\sigma_t^2 = \beta_t | \tilde{\beta}_t,
σt2=βt∣β~t,
分别为
x
0
∼
N
(
0
,
I
)
x_0 \sim \mathcal{N}(0, I)
x0∼N(0,I)和
x
0
x_0
x0为固定值时, 期望下KL散度的最优参数(作者说在实验中二者差不多).
故
L
t
=
1
2
σ
t
2
∥
μ
θ
(
x
t
,
t
)
−
u
~
t
(
x
t
,
x
0
)
∥
2
+
C
,
t
=
1
,
2
,
⋯
,
T
−
1.
L_{t} = \frac{1}{2 \sigma^2_t} \| \mu_{\theta}(x_t, t) - \tilde{u}_t(x_t, x_0)\|^2 +C, \quad t = 1,2,\cdots, T-1.
Lt=2σt21∥μθ(xt,t)−u~t(xt,x0)∥2+C,t=1,2,⋯,T−1.
又
x
t
=
α
ˉ
t
x
0
+
1
−
α
ˉ
t
ϵ
⇒
x
0
=
1
α
ˉ
t
x
t
−
1
−
α
ˉ
t
α
ˉ
t
ϵ
.
x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon \Rightarrow x_0 = \frac{1}{\sqrt{\bar{\alpha}_t}}x_t - \frac{\sqrt{1 - \bar{\alpha}_t} }{\sqrt{\bar{\alpha}_t}} \epsilon.
xt=αˉtx0+1−αˉtϵ⇒x0=αˉt1xt−αˉt1−αˉtϵ.
所以
E
q
[
L
t
−
1
−
C
]
=
E
x
0
,
ϵ
{
1
2
σ
t
2
∥
μ
θ
(
x
t
,
t
)
−
u
~
t
(
x
t
,
(
1
α
ˉ
t
x
t
−
1
−
α
ˉ
t
α
ˉ
t
ϵ
)
)
∥
2
}
=
E
x
0
,
ϵ
{
1
2
σ
t
2
∥
μ
θ
(
x
t
,
t
)
−
1
α
t
(
x
t
−
β
t
1
−
α
ˉ
t
ϵ
)
}
\begin{array}{ll} \mathbb{E}_q [L_{t-1} - C] &= \mathbb{E}_{x_0, \epsilon} \bigg\{ \frac{1}{2 \sigma_t^2} \| \mu_{\theta}(x_t, t) - \tilde{u}_t\big( x_t, (\frac{1}{\sqrt{\bar{\alpha}_t}}x_t - \frac{\sqrt{1 - \bar{\alpha}_t} }{\sqrt{\bar{\alpha}_t}} \epsilon) \big)\|^2 \bigg\} \\ &= \mathbb{E}_{x_0, \epsilon} \bigg\{ \frac{1}{2 \sigma^2_t} \| \mu_{\theta}(x_t, t) - \frac{1}{\sqrt{\alpha_t}} \big( x_t - \frac{\beta_t}{\sqrt{1 - \bar{\alpha}_t}} \epsilon \big) \bigg\} \\ \end{array}
Eq[Lt−1−C]=Ex0,ϵ{2σt21∥μθ(xt,t)−u~t(xt,(αˉt1xt−αˉt1−αˉtϵ))∥2}=Ex0,ϵ{2σt21∥μθ(xt,t)−αt1(xt−1−αˉtβtϵ)}
注: 上式子中 x t x_t xt由 x 0 , ϵ x_0, \epsilon x0,ϵ决定, 实际上 x t = x t ( x 0 , ϵ ) x_t = x_t(x_0, \epsilon) xt=xt(x0,ϵ), 故期望实际上是对 x t x_t xt求期望.
既然如此, 我们不妨直接参数化
μ
θ
\mu_{\theta}
μθ为
μ
θ
(
x
t
,
t
)
:
=
1
α
t
(
x
t
−
β
t
1
−
α
ˉ
t
ϵ
θ
(
x
t
,
t
)
)
,
\mu_{\theta}(x_t, t):= \frac{1}{\sqrt{\alpha_t}} \big( x_t - \frac{\beta_t}{\sqrt{1 - \bar{\alpha}_t}} \epsilon_{\theta}(x_t, t) \big),
μθ(xt,t):=αt1(xt−1−αˉtβtϵθ(xt,t)),
即直接建模残差
ϵ
\epsilon
ϵ.
此时损失可简化为:
E
x
0
,
ϵ
{
β
t
2
2
σ
t
2
α
t
(
1
−
α
ˉ
t
)
∥
ϵ
θ
(
α
ˉ
t
x
0
+
1
−
α
ˉ
t
ϵ
,
t
)
−
ϵ
∥
2
}
\mathbb{E}_{x_0, \epsilon} \bigg\{ \frac{\beta_t^2}{2\sigma_t^2 \alpha_t (1 - \bar{\alpha}_t)} \|\epsilon_{\theta}(\sqrt{\bar{\alpha}_t}x_0 + \sqrt{1 - \bar{\alpha}_t}\epsilon, t) - \epsilon\|^2 \bigg\}
Ex0,ϵ{2σt2αt(1−αˉt)βt2∥ϵθ(αˉtx0+1−αˉtϵ,t)−ϵ∥2}
这个实际上时denoising score matching.
类似地, 从
p
θ
(
x
t
−
1
∣
x
t
)
p_{\theta}(x_{t-1}|x_t)
pθ(xt−1∣xt)采样则为:
x
t
−
1
=
1
α
t
(
x
t
−
β
t
1
−
α
ˉ
t
ϵ
θ
(
x
t
,
t
)
)
+
σ
t
z
,
z
∼
N
(
0
,
I
)
,
x_{t-1} = \frac{1}{\sqrt{\alpha_t}} \big( x_t - \frac{\beta_t}{\sqrt{1 - \bar{\alpha}_t}} \epsilon_{\theta}(x_t, t) \big) + \sigma_t z, \: z \sim \mathcal{N}(0, I),
xt−1=αt1(xt−1−αˉtβtϵθ(xt,t))+σtz,z∼N(0,I),
这是Langevin dynamic的形式(步长和权重有点变化)
注: 这部分见here.
L 0 L_0 L0
最后我们要处理
L
0
L_0
L0, 这里作者假设
x
0
∣
x
1
x_0|x_1
x0∣x1满足一个离散分布, 首先图片取值于
{
0
,
1
,
2
,
⋯
,
255
}
\{0, 1, 2, \cdots, 255\}
{0,1,2,⋯,255}, 并标准化至
[
−
1
,
1
]
[-1, 1]
[−1,1]. 假设
p
θ
(
x
0
∣
x
1
)
=
∏
i
=
1
D
∫
δ
−
(
x
0
i
)
δ
+
(
x
0
i
)
N
(
x
;
μ
θ
i
(
x
1
,
1
)
,
σ
1
2
)
d
x
,
δ
+
(
x
)
=
{
+
∞
if
x
=
1
,
x
+
1
255
if
x
<
1.
δ
−
(
x
)
{
−
∞
if
x
=
−
1
,
x
−
1
255
if
x
>
−
1.
p_{\theta}(x_0|x_1) = \prod_{i=1}^D \int_{\delta_{-}(x_0^i)}^{\delta_+(x_0^i) } \mathcal{N}(x; \mu_{\theta}^i(x_1, 1), \sigma_1^2) \mathrm{d}x, \\ \delta_+(x) = \left \{ \begin{array}{ll} +\infty & \text{if } x = 1, \\ x + \frac{1}{255} & \text{if } x < 1. \end{array} \right . \delta_- (x) \left \{ \begin{array}{ll} -\infty & \text{if } x = -1, \\ x - \frac{1}{255} & \text{if } x > -1. \end{array} \right .
pθ(x0∣x1)=i=1∏D∫δ−(x0i)δ+(x0i)N(x;μθi(x1,1),σ12)dx,δ+(x)={+∞x+2551if x=1,if x<1.δ−(x){−∞x−2551if x=−1,if x>−1.
实际上就是将普通的正态分布划分为:
(
−
∞
,
−
1
+
1
/
255
]
,
(
−
1
+
1
/
255
,
−
1
+
3
/
255
]
,
⋯
,
(
1
−
3
/
255
,
1
−
1
/
255
]
,
(
1
−
1
/
255
,
+
∞
)
(-\infty, -1 + 1/255], (-1 + 1 / 255, -1 + 3/255], \cdots, (1 - 3/255, 1 - 1/255], (1 - 1 / 255, +\infty)
(−∞,−1+1/255],(−1+1/255,−1+3/255],⋯,(1−3/255,1−1/255],(1−1/255,+∞)
各取值落在其中之一.
在实际代码编写中, 会遇到高斯函数密度函数估计的问题(直接求是做不到的), 作者选择用下列的估计方式:
Φ
(
x
)
≈
1
2
{
1
+
tanh
(
2
/
π
(
1
+
0.044715
x
2
)
)
}
.
\Phi(x) \approx \frac{1}{2} \Bigg\{1 + \tanh \bigg(\sqrt{2/\pi} (1 + 0.044715 x^2) \bigg) \Bigg\}.
Φ(x)≈21{1+tanh(2/π(1+0.044715x2))}.
这样梯度也就能够回传了.
注: 该估计属于Page.
最后的算法
注:
t
=
1
t=1
t=1对应
L
0
L_0
L0,
t
=
2
,
⋯
,
T
t=2,\cdots, T
t=2,⋯,T对应
L
1
,
⋯
,
L
T
−
1
L_{1}, \cdots, L_{T-1}
L1,⋯,LT−1.
注: 对于
L
t
L_t
Lt作者省略了开始的系数, 这反而是一种加权.
作者在实际中是采样损失用以训练的.
细节
注意到, 作者的 ϵ θ ( ⋅ , t ) \epsilon_{\theta}(\cdot, t) ϵθ(⋅,t)是有显示强调 t t t, 作者在实验中是通过attention中的位置编码实现的, 假设位置编码为 P P P:
- $ t = \text{Linear}(\text{ACT}(\text{Linear}(t * P)))$, 即通过两层的MLP来转换得到time_steps;
- 作者用的是U-Net结构, 在每个residual 模块中:
x + = Linear ( ACT ( t ) ) . x += \text{Linear}(\text{ACT}(t)). x+=Linear(ACT(t)).
参数 | 值 |
---|---|
T T T | 1000 |
β t \beta_t βt | [ 0.0001 , 0.02 ] [0.0001, 0.02] [0.0001,0.02], 线性增长 1 , 2 , ⋯ , T 1,2,\cdots, T 1,2,⋯,T. |
backbone | U-Net |
注: 作者在实现中还用到了EMA等技巧.