VAE 与 多层 VAE
回顾之前的文章 【理论推导】变分自动编码器 Variational AutoEncoder(VAE),有结论
log
p
(
x
)
=
E
z
∼
q
(
z
∣
x
)
[
log
p
(
x
,
z
)
q
(
z
∣
x
)
]
+
KL
(
q
∣
∣
p
)
≥
E
z
∼
q
(
z
∣
x
)
[
log
p
(
x
,
z
)
q
(
z
∣
x
)
]
\log p(x) = \mathbb E_{z\sim q(z|x)}[\log \frac{p(x,z)}{q(z|x)}] + \text{KL}(q||p) \geq \mathbb E_{z\sim q(z|x)}[\log \frac{p(x,z)}{q(z|x)}]
logp(x)=Ez∼q(z∣x)[logq(z∣x)p(x,z)]+KL(q∣∣p)≥Ez∼q(z∣x)[logq(z∣x)p(x,z)]
该不等式的另一种推导方式如下所示
log
p
(
x
)
=
log
E
z
∼
q
(
z
∣
x
)
[
p
(
x
,
z
)
q
(
z
∣
x
)
]
≥
E
z
∼
q
(
z
∣
x
)
[
log
p
(
x
,
z
)
q
(
z
∣
x
)
]
\log p(x) = \log \mathbb E_{z\sim q(z|x)}[\frac{p(x,z)}{q(z|x)}] \geq \mathbb E_{z\sim q(z|x)}[\log \frac{p(x,z)}{q(z|x)}]
logp(x)=logEz∼q(z∣x)[q(z∣x)p(x,z)]≥Ez∼q(z∣x)[logq(z∣x)p(x,z)]
其中不等号由 Jensen 不等式给出
将单层 VAE 扩展到多层 VAE,如下所示
log
p
(
x
)
=
log
∫
z
1
∫
z
2
p
(
x
,
z
1
,
z
2
)
d
z
1
d
z
2
=
log
∫
z
1
∫
z
2
q
(
z
1
,
z
2
∣
x
)
p
(
x
,
z
1
,
z
2
)
q
(
z
1
,
z
2
∣
x
)
d
z
1
d
z
2
=
log
E
z
1
,
z
2
∼
q
(
z
1
,
z
2
∣
x
)
[
p
(
x
,
z
1
,
z
2
)
q
(
z
1
,
z
2
∣
x
)
]
≥
E
z
1
,
z
2
∼
q
(
z
1
,
z
2
∣
x
)
[
log
p
(
x
,
z
1
,
z
2
)
q
(
z
1
,
z
2
∣
x
)
]
=
(
i
)
E
z
1
,
z
2
∼
q
(
z
1
,
z
2
∣
x
)
[
log
p
(
x
∣
z
1
)
p
(
z
1
∣
z
2
)
p
(
z
2
)
q
(
z
1
∣
x
)
q
(
z
2
∣
z
1
)
]
\begin{align} \log p(x) &= \log \int_{z_1}\int_{z_2} p(x, z_1,z_2) dz_1dz_2 \nonumber \\&= \log \int_{z_1}\int_{z_2} q(z_1, z_2|x) \frac{p(x, z_1,z_2)}{q(z_1, z_2|x)} dz_1dz_2 \nonumber \\&=\log \mathbb E_{z_1,z_2\sim q(z_1,z_2|x)}[ \frac{p(x, z_1,z_2)}{q(z_1, z_2|x)}] \nonumber \\&\geq \mathbb E_{z_1,z_2\sim q(z_1,z_2|x)}[ \log \frac{p(x, z_1,z_2)}{q(z_1, z_2|x)}] \nonumber \\&\overset{(i)}{=} \mathbb E_{z_1,z_2\sim q(z_1,z_2|x)}[ \log \frac{p(x|z_1)p(z_1|z_2)p(z_2)}{q(z_1|x)q(z_2|z_1)}] \nonumber \end{align}
logp(x)=log∫z1∫z2p(x,z1,z2)dz1dz2=log∫z1∫z2q(z1,z2∣x)q(z1,z2∣x)p(x,z1,z2)dz1dz2=logEz1,z2∼q(z1,z2∣x)[q(z1,z2∣x)p(x,z1,z2)]≥Ez1,z2∼q(z1,z2∣x)[logq(z1,z2∣x)p(x,z1,z2)]=(i)Ez1,z2∼q(z1,z2∣x)[logq(z1∣x)q(z2∣z1)p(x∣z1)p(z1∣z2)p(z2)]
其中 (i) 处要求变量之间满足Markov假设,如果我们将多层 VAE 扩展到更多层,可以得到与扩散模型相近的图示形式,因此我们可以借助VAE相关的技巧来看待扩散模型
DDPM
扩散模型是通过向图像多次施加噪声来将图像转化为噪声,该过程称为前向扩散过程 (forward diffusion process),而从某个先验噪声分布中采样一个噪声图作为初值,通过不断去噪来生成图像的过程称为是扩散的逆过程,可以类比于使用 Langevin Dynamics 进行图像生成的思路。
扩散过程
假定
x
0
∼
q
(
x
)
x_0\sim q(x)
x0∼q(x) 是采样自真实数据分布
q
q
q 的样本,我们向其添加
T
T
T 步的高斯噪声,公式如下
q
(
x
t
∣
x
t
−
1
)
=
N
(
x
t
;
1
−
β
t
x
t
−
1
,
β
t
I
)
q(x_t|x_{t-1}) = \mathcal N(x_t; \sqrt{1-\beta_t}x_{t-1},\beta_t I)
q(xt∣xt−1)=N(xt;1−βtxt−1,βtI)
其中
β
t
∈
[
0
,
1
]
\beta_t \in [0,1]
βt∈[0,1],整个过程服从Markov假设,因此有
q
(
x
1
:
T
∣
x
0
)
=
∏
t
=
1
T
q
(
x
t
∣
x
t
−
1
)
q(x_{1:T}|x_0) = \prod_{t=1}^T q(x_t|x_{t-1})
q(x1:T∣x0)=∏t=1Tq(xt∣xt−1),当
T
→
∞
T\rightarrow \infty
T→∞,
x
T
x_T
xT 服从高斯分布
如果我们希望快速得到
x
t
x_t
xt,可以不通过递推式而是求一个通项的表达形式。假定
α
t
=
1
−
β
t
\alpha_t = 1-\beta_t
αt=1−βt,
α
t
‾
=
∏
i
=
1
t
α
i
\overline{\alpha_t} = \prod_{i=1}^t\alpha_i
αt=∏i=1tαi,
{
z
i
,
z
‾
i
∼
N
(
0
,
I
)
}
i
=
0
T
\{z_i, \overline{z}_i \sim \mathcal N(0,I)\}_{i=0}^T
{zi,zi∼N(0,I)}i=0T为若干独立同分布的随机变量,根据递推公式,有
x
t
=
α
t
x
t
−
1
+
1
−
α
t
z
t
−
1
=
α
t
α
t
−
1
x
t
−
1
+
α
t
1
−
α
t
−
1
z
t
−
2
+
1
−
α
t
z
t
−
1
=
(
i
)
α
t
α
t
−
1
x
t
−
1
+
1
−
α
t
α
t
−
1
z
‾
t
−
2
=
.
.
.
=
α
‾
t
x
0
+
1
−
α
‾
t
z
‾
0
\begin{align} x_t &= \sqrt{\alpha_t}x_{t-1} + \sqrt{1-\alpha_t}z_{t-1} \nonumber \\&= \sqrt{\alpha_t\alpha_{t-1}}x_{t-1} + \sqrt{\alpha_t}\sqrt{1-\alpha_{t-1}}z_{t-2}+ \sqrt{1-\alpha_t}z_{t-1} \nonumber \\&\overset{(i)}{=} \sqrt{\alpha_t\alpha_{t-1}}x_{t-1} + \sqrt{1-\alpha_t\alpha_{t-1}}\overline{z}_{t-2} \nonumber \\&= ...\nonumber \\&=\sqrt{\overline{\alpha}_t}x_0+\sqrt{1-\overline{\alpha}_t}\overline{z}_0 \end{align}
xt=αtxt−1+1−αtzt−1=αtαt−1xt−1+αt1−αt−1zt−2+1−αtzt−1=(i)αtαt−1xt−1+1−αtαt−1zt−2=...=αtx0+1−αtz0
其中,等式 (i) 为两个高斯分布的线性叠加仍为一个高斯分布,即对于
A
∼
N
(
μ
a
,
σ
a
2
)
A\sim \mathcal{N}(\mu_a, \sigma_a^2)
A∼N(μa,σa2),
B
∼
N
(
μ
b
,
σ
b
2
)
B\sim \mathcal{N}(\mu_b, \sigma_b^2)
B∼N(μb,σb2),线性叠加
m
A
+
n
B
∼
N
(
m
μ
a
+
n
μ
b
,
m
2
σ
a
2
+
n
2
σ
b
2
)
mA+nB \sim \mathcal{N}(m\mu_a+n\mu_b, m^2\sigma_a^2+n^2\sigma_b^2)
mA+nB∼N(mμa+nμb,m2σa2+n2σb2)。因此,有
x
t
∣
x
0
∼
N
(
α
‾
t
x
0
,
(
1
−
α
‾
t
)
I
)
\begin{align} x_t|x_0 \sim \mathcal N(\sqrt{\overline\alpha_t}x_0,(1-\overline\alpha_t)I) \end{align}
xt∣x0∼N(αtx0,(1−αt)I)
对于扩散过程,我们希望加噪的强度从小到大,即
β
1
<
β
2
<
.
.
.
<
β
T
−
1
<
β
T
\beta_1 <\beta_2 < ...<\beta_{T-1} < \beta_T
β1<β2<...<βT−1<βT,有
1
>
α
‾
1
>
.
.
.
>
α
‾
T
>
0
1>\overline{\alpha}_1 > ... > \overline{\alpha}_T>0
1>α1>...>αT>0
逆扩散过程/采样过程
我们希望从
x
T
x_T
xT 中恢复出
x
0
x_0
x0,为此需要建模条件概率
q
(
x
t
−
1
∣
x
t
)
q(x_{t-1}|x_t)
q(xt−1∣xt),注意到,根据Bayes公式,有
q
(
x
t
−
1
∣
x
t
)
=
q
(
x
t
∣
x
t
−
1
)
q
(
x
t
−
1
)
q
(
x
t
)
q(x_{t-1}|x_t) = q(x_t |x_{t-1})\frac{q(x_{t-1})}{q(x_t)}
q(xt−1∣xt)=q(xt∣xt−1)q(xt)q(xt−1)
我们无法得到真实的
q
(
x
t
−
1
)
q(x_{t-1})
q(xt−1),因此采用条件概率分布来逼近
q
(
x
t
−
1
∣
x
t
)
q(x_{t-1}|x_t)
q(xt−1∣xt),有
q
(
x
t
−
1
∣
x
t
)
≈
q
(
x
t
−
1
∣
x
t
,
x
0
)
=
q
(
x
t
∣
x
t
−
1
,
x
0
)
q
(
x
t
−
1
∣
x
0
)
q
(
x
t
∣
x
0
)
q(x_{t-1} |x_{t})\approx q(x_{t-1} |x_{t},x_0) = q(x_t |x_{t-1},x_0)\frac{q(x_{t-1}|x_0)}{q(x_t|x_0)}
q(xt−1∣xt)≈q(xt−1∣xt,x0)=q(xt∣xt−1,x0)q(xt∣x0)q(xt−1∣x0)
其中
q
(
x
t
−
1
∣
x
t
,
x
0
)
q(x_{t-1}|x_t,x_0)
q(xt−1∣xt,x0)是可以计算出来的。使用贝叶斯公式,有
q
(
x
t
−
1
∣
x
t
,
x
0
)
=
q
(
x
t
∣
x
t
−
1
,
x
0
)
q
(
x
t
−
1
∣
x
0
)
q
(
x
t
∣
x
0
)
∝
exp
(
−
1
2
(
(
x
t
−
α
t
x
t
−
1
)
2
β
t
+
(
x
t
−
1
−
α
‾
t
−
1
x
0
)
2
1
−
α
‾
t
−
1
−
(
x
t
−
α
‾
t
x
0
)
2
1
−
α
‾
t
)
)
=
exp
(
−
1
2
(
(
α
t
β
t
+
1
1
−
α
‾
t
)
x
t
−
1
2
−
(
2
α
t
β
t
x
t
+
2
α
‾
t
1
−
α
‾
t
x
0
)
x
t
−
1
+
C
(
x
0
,
x
t
)
)
)
\begin{align} q(x_{t-1}|x_t,x_0) &= q(x_t |x_{t-1}, x_0)\frac{q(x_{t-1}|x_0)}{q(x_t|x_0)} \nonumber \\&\propto \exp\left( -\frac{1}{2}\left(\frac{(x_t-\sqrt{\alpha_t}x_{t-1})^2}{\beta_t}+\frac{(x_{t-1}-\sqrt{\overline\alpha_{t-1}}x_{0})^2}{1-\overline\alpha_{t-1}}-\frac{(x_{t}-\sqrt{\overline\alpha_{t}}x_{0})^2}{1-\overline\alpha_{t}}\right)\right) \nonumber \\&=\exp\left( -\frac{1}{2}\left((\frac{\alpha_t}{\beta_t}+\frac{1}{1-\overline\alpha_t})x_{t-1}^2 - (\frac{2\sqrt{\alpha_t}}{\beta_t}x_t+\frac{2\sqrt{\overline \alpha_t}}{1-\overline \alpha_t}x_0)x_{t-1} + C(x_0,x_t) \right)\right)\nonumber \end{align}
q(xt−1∣xt,x0)=q(xt∣xt−1,x0)q(xt∣x0)q(xt−1∣x0)∝exp(−21(βt(xt−αtxt−1)2+1−αt−1(xt−1−αt−1x0)2−1−αt(xt−αtx0)2))=exp(−21((βtαt+1−αt1)xt−12−(βt2αtxt+1−αt2αtx0)xt−1+C(x0,xt)))
对比高斯分布的形式,可以得到条件概率分布
x
t
−
1
∣
x
t
,
x
0
x_{t-1}|x_t,x_0
xt−1∣xt,x0 服从均值,方差为如下形式的高斯分布
μ
=
(
α
t
β
t
x
t
+
α
‾
t
1
−
α
‾
t
x
0
)
/
(
α
t
β
t
+
1
1
−
α
‾
t
−
1
)
=
α
t
(
1
−
α
‾
t
−
1
)
1
−
α
‾
t
x
t
+
α
‾
t
−
1
β
t
1
−
α
‾
t
x
0
σ
2
=
β
~
t
=
1
α
t
β
t
+
1
1
−
α
‾
t
−
1
=
1
−
α
‾
t
−
1
1
−
α
‾
t
β
t
\begin{align} \mu &= (\frac{\sqrt{\alpha_t}}{\beta_t}x_t+\frac{\sqrt{\overline \alpha_t}}{1-\overline \alpha_t}x_0) / (\frac{\alpha_t}{\beta_t}+\frac{1}{1-\overline\alpha_{t-1}}) = \frac{\sqrt{\alpha}_t(1-\overline\alpha_{t-1})}{1-\overline\alpha_t}x_t+\frac{\sqrt{\overline\alpha_{t-1}}\beta_{t}}{1-\overline\alpha_{t}}x_0 \nonumber \\ \sigma^2 &= \tilde{\beta}_t = \frac{1}{\frac{\alpha_t}{\beta_t}+\frac{1}{1-\overline\alpha_{t-1}}} = \frac{1-\overline\alpha_{t-1}}{1-\overline \alpha_t}\beta_t \end{align}
μσ2=(βtαtxt+1−αtαtx0)/(βtαt+1−αt−11)=1−αtαt(1−αt−1)xt+1−αtαt−1βtx0=β~t=βtαt+1−αt−111=1−αt1−αt−1βt
我们使用神经网络来拟合
z
‾
0
\overline z_0
z0,即
ϵ
θ
(
x
t
,
t
)
≈
z
‾
0
\epsilon_\theta(x_t,t)\approx\overline z_0
ϵθ(xt,t)≈z0,注意到,我们通过
z
‾
0
\overline z_0
z0的加噪方式得到的
x
t
x_t
xt,因此,神经网络本质上是拟合的添加的噪声。将 (1) 式代入到其中,消掉
x
0
x_{0}
x0,得到
μ
=
μ
~
t
=
1
α
t
(
x
t
−
β
t
1
−
α
‾
t
z
‾
0
)
\begin{align} \mu &= \tilde \mu_t = \frac{1}{\sqrt{\alpha_t}}\left(x_t-\frac{\beta_t}{\sqrt{1-\overline{\alpha}_t}}\overline z_0\right) \end{align}
μ=μ~t=αt1(xt−1−αtβtz0)
因此,有
x
t
−
1
∣
x
t
∼
N
(
1
α
t
(
x
t
−
β
t
1
−
α
‾
t
z
‾
0
)
,
1
−
α
‾
t
−
1
1
−
α
‾
t
β
t
)
\begin{align} x_{t-1}|x_t \sim \mathcal N\left(\frac{1}{\sqrt{\alpha_t}}\left(x_t-\frac{\beta_t}{\sqrt{1-\overline{\alpha}_t}}\overline z_0\right),\frac{1-\overline\alpha_{t-1}}{1-\overline \alpha_t}\beta_t \right) \end{align}
xt−1∣xt∼N(αt1(xt−1−αtβtz0),1−αt1−αt−1βt)
损失函数
考虑损失函数的设计,假定我们使用含参
θ
\theta
θ 的概率模型
p
θ
p_\theta
pθ 去拟合真实数据分布
q
q
q,根据 KL 散度的性质,有
−
log
p
θ
(
x
0
)
≤
−
log
p
θ
(
x
0
)
+
KL
(
q
(
x
1
:
T
∣
x
0
)
∣
∣
p
θ
(
x
1
:
T
∣
x
0
)
)
=
−
log
p
θ
(
x
0
)
+
E
q
(
x
1
:
T
∣
x
0
)
[
log
q
(
x
1
:
T
∣
x
0
)
p
θ
(
x
0
:
T
)
/
p
θ
(
x
0
)
]
=
E
q
(
x
1
:
T
∣
x
0
)
[
log
q
(
x
1
:
T
∣
x
0
)
p
θ
(
x
0
:
T
)
]
\begin{align} -\log p_\theta(x_0) &\leq -\log p_\theta(x_0) +\text{KL}(q(x_{1:T}|x_0)||p_\theta(x_{1:T}|x_0)) \nonumber \\&= -\log p_\theta(x_0) +\mathbb E_{q(x_{1:T}|x_0)}[\log \frac{q(x_{1:T}|x_0)}{p_\theta(x_{0:T})/p_\theta(x_0)}] \nonumber \\&= \mathbb E_{q(x_{1:T}|x_0)}[\log \frac{q(x_{1:T}|x_0)}{p_\theta(x_{0:T})}] \nonumber \end{align}
−logpθ(x0)≤−logpθ(x0)+KL(q(x1:T∣x0)∣∣pθ(x1:T∣x0))=−logpθ(x0)+Eq(x1:T∣x0)[logpθ(x0:T)/pθ(x0)q(x1:T∣x0)]=Eq(x1:T∣x0)[logpθ(x0:T)q(x1:T∣x0)]
对左右两边求期望,有
E
q
(
x
0
)
[
−
log
p
θ
(
x
0
)
]
≤
E
q
(
x
0
:
T
∣
x
0
)
[
log
q
(
x
1
:
T
∣
x
0
)
p
θ
(
x
0
:
T
)
]
=
△
L
VLB
\mathbb E_{q(x_0)}[-\log p_\theta(x_0)]\leq \mathbb E_{q(x_{0:T}|x_0)}[\log \frac{q(x_{1:T}|x_0)}{p_\theta(x_{0:T})}] \overset{\triangle}{=} L_\text{VLB}
Eq(x0)[−logpθ(x0)]≤Eq(x0:T∣x0)[logpθ(x0:T)q(x1:T∣x0)]=△LVLB
对
L
VLB
L_\text{VLB}
LVLB 进行化简,有
L
VLB
=
E
q
(
x
0
:
T
∣
x
0
)
[
log
q
(
x
1
:
T
∣
x
0
)
p
θ
(
x
0
:
T
)
]
=
E
q
(
x
0
:
T
∣
x
0
)
[
−
log
p
(
x
T
)
+
∑
i
=
1
T
log
q
(
x
t
∣
x
t
−
1
)
p
θ
(
x
t
−
1
∣
x
t
)
]
=
(
i
)
E
q
(
x
0
:
T
∣
x
0
)
[
−
log
p
(
x
T
)
+
∑
i
=
2
T
log
(
q
(
x
t
−
1
∣
x
t
,
x
0
)
p
θ
(
x
t
−
1
∣
x
t
)
q
(
x
t
∣
x
0
)
q
(
x
t
−
1
∣
x
0
)
)
+
log
q
(
x
1
∣
x
0
)
p
θ
(
x
0
∣
x
1
)
]
=
E
q
(
x
0
:
T
∣
x
0
)
[
−
log
p
(
x
T
)
+
∑
i
=
2
T
log
q
(
x
t
−
1
∣
x
t
,
x
0
)
p
θ
(
x
t
−
1
∣
x
t
)
+
log
q
(
x
T
∣
x
0
)
q
(
x
1
∣
x
0
)
+
log
q
(
x
1
∣
x
0
)
p
θ
(
x
0
∣
x
1
)
]
=
E
q
(
x
0
:
T
∣
x
0
)
[
∑
i
=
2
T
log
q
(
x
t
−
1
∣
x
t
,
x
0
)
p
θ
(
x
t
−
1
∣
x
t
)
+
log
q
(
x
T
∣
x
0
)
p
(
x
T
)
−
log
p
θ
(
x
0
∣
x
1
)
]
=
KL
(
q
(
x
T
∣
x
0
)
∣
∣
p
(
x
T
)
)
+
∑
t
=
2
T
KL
(
q
(
x
t
−
1
∣
x
t
,
x
0
)
∣
∣
p
θ
(
x
t
−
1
∣
x
t
)
)
−
log
p
θ
(
x
0
∣
x
1
)
\begin{align} L_\text{VLB} &= \mathbb E_{q(x_{0:T}|x_0)}[\log \frac{q(x_{1:T}|x_0)}{p_\theta(x_{0:T})}] \nonumber \\&=\mathbb E_{q(x_{0:T}|x_0)}[-\log p(x_T)+\sum_{i=1}^T\log\frac{q(x_t|x_{t-1})}{p_\theta(x_{t-1}|x_t)}] \nonumber \\&\overset{(i)}{=}\mathbb E_{q(x_{0:T}|x_0)}[-\log p(x_T)+\sum_{i=2}^T\log(\frac{q(x_{t-1}|x_t,x_0)}{p_\theta(x_{t-1}|x_t)}\frac{q(x_t|x_0)}{q(x_{t-1}|x_0)}) + \log\frac{q(x_1|x_{0})}{p_\theta(x_{0}|x_1)}] \nonumber \\&=\mathbb E_{q(x_{0:T}|x_0)}[-\log p(x_T)+\sum_{i=2}^T\log\frac{q(x_{t-1}|x_t,x_0)}{p_\theta(x_{t-1}|x_t)}+\log\frac{q(x_T|x_0)}{q(x_{1}|x_0)} + \log\frac{q(x_1|x_{0})}{p_\theta(x_{0}|x_1)}] \nonumber \\&=\mathbb E_{q(x_{0:T}|x_0)}[\sum_{i=2}^T\log\frac{q(x_{t-1}|x_t,x_0)}{p_\theta(x_{t-1}|x_t)}+\log\frac{q(x_T|x_0)}{p(x_T)} - \log p_\theta(x_{0}|x_1)] \nonumber \\&=\text{KL}(q(x_T|x_0)||p(x_T)) +\sum_{t=2}^T\text{KL}(q(x_{t-1}|x_t,x_0)||p_\theta(x_{t-1}|x_t))-\log p_\theta(x_0|x_1) \end{align}
LVLB=Eq(x0:T∣x0)[logpθ(x0:T)q(x1:T∣x0)]=Eq(x0:T∣x0)[−logp(xT)+i=1∑Tlogpθ(xt−1∣xt)q(xt∣xt−1)]=(i)Eq(x0:T∣x0)[−logp(xT)+i=2∑Tlog(pθ(xt−1∣xt)q(xt−1∣xt,x0)q(xt−1∣x0)q(xt∣x0))+logpθ(x0∣x1)q(x1∣x0)]=Eq(x0:T∣x0)[−logp(xT)+i=2∑Tlogpθ(xt−1∣xt)q(xt−1∣xt,x0)+logq(x1∣x0)q(xT∣x0)+logpθ(x0∣x1)q(x1∣x0)]=Eq(x0:T∣x0)[i=2∑Tlogpθ(xt−1∣xt)q(xt−1∣xt,x0)+logp(xT)q(xT∣x0)−logpθ(x0∣x1)]=KL(q(xT∣x0)∣∣p(xT))+t=2∑TKL(q(xt−1∣xt,x0)∣∣pθ(xt−1∣xt))−logpθ(x0∣x1)
其中,等号 (i) 处推导如下所示
q
(
x
t
∣
x
t
−
1
)
=
q
(
x
t
∣
x
t
−
1
,
x
0
)
=
q
(
x
t
−
1
∣
x
t
,
x
0
)
q
(
x
t
∣
x
0
)
q
(
x
t
−
1
∣
x
0
)
q(x_t|x_{t-1}) = q(x_t|x_{t-1},x_0)=\frac{q(x_{t-1}|x_t,x_0)q(x_t|x_0)}{q(x_{t-1}|x_0)}
q(xt∣xt−1)=q(xt∣xt−1,x0)=q(xt−1∣x0)q(xt−1∣xt,x0)q(xt∣x0)
我们固定方差
β
t
\beta_t
βt为一超参数,因此对于公式(5)中的第一项是无参的常量,可以忽略;对于最后一项,作者提出简化掉来训练会更好。因为
p
θ
p_\theta
pθ 是我们拟合分布使用的模型,所以我们可以假定
p
θ
(
x
t
−
1
∣
x
t
)
=
N
(
μ
θ
(
x
t
,
t
)
,
σ
t
2
I
)
)
p_\theta(x_{t-1}|x_t) = \mathcal N(\mu_\theta(x_t,t),\sigma_t^2 I))
pθ(xt−1∣xt)=N(μθ(xt,t),σt2I)),因此该分布仅均值部分与输入有关,可以得到如下式子
KL
(
q
(
x
t
−
1
∣
x
t
,
x
0
)
∣
∣
p
θ
(
x
t
−
1
∣
x
t
)
)
=
E
q
[
1
2
σ
t
2
∣
∣
μ
~
t
(
x
t
,
x
0
)
−
μ
θ
(
x
t
,
t
)
∣
∣
2
2
]
+
C
\begin{align} \text{KL}(q(x_{t-1}|x_t,x_0)||p_\theta(x_{t-1}|x_t)) &=\mathbb E_q[\frac{1}{2\sigma_t^2}||\tilde\mu_t(x_t,x_0)-\mu_\theta(x_t,t)||_2^2] +C \nonumber \end{align}
KL(q(xt−1∣xt,x0)∣∣pθ(xt−1∣xt))=Eq[2σt21∣∣μ~t(xt,x0)−μθ(xt,t)∣∣22]+C
设
ϵ
∼
N
(
0
,
I
)
\epsilon \sim \mathcal N(0,I)
ϵ∼N(0,I),使用公式 (1)
x
0
x_0
x0 与
ϵ
\epsilon
ϵ 替换掉里面的
x
t
x_t
xt,同时使用 (4) 式替换掉其中的
μ
~
t
\tilde\mu_t
μ~t
KL
(
q
(
x
t
−
1
∣
x
t
,
x
0
)
∣
∣
p
θ
(
x
t
−
1
∣
x
t
)
)
=
E
q
[
1
2
σ
t
2
∣
∣
1
α
t
(
x
t
(
x
0
,
ϵ
)
−
β
t
1
−
α
‾
t
ϵ
)
−
μ
θ
(
x
t
,
t
)
∣
∣
2
2
]
+
C
\begin{align} \text{KL}(q(x_{t-1}|x_t,x_0)||p_\theta(x_{t-1}|x_t)) &=\mathbb E_q[\frac{1}{2\sigma_t^2}||\frac{1}{\sqrt{\alpha_t}}(x_t(x_0,\epsilon)-\frac{\beta_t}{\sqrt{1-\overline\alpha_t}}\epsilon)-\mu_\theta(x_t,t)||_2^2] +C \nonumber \end{align}
KL(q(xt−1∣xt,x0)∣∣pθ(xt−1∣xt))=Eq[2σt21∣∣αt1(xt(x0,ϵ)−1−αtβtϵ)−μθ(xt,t)∣∣22]+C
其中公式(4)给出了
μ
θ
\mu_\theta
μθ 与
ϵ
θ
\epsilon_\theta
ϵθ 满足如下关系
μ
θ
(
x
t
,
t
)
=
1
α
t
(
x
t
(
x
0
,
ϵ
)
−
β
t
1
−
α
‾
t
ϵ
θ
(
x
t
,
t
)
)
\mu_\theta(x_t,t) = \frac{1}{\sqrt{\alpha_t}}( x_t(x_0,\epsilon)-\frac{\beta_t}{\sqrt{1-\overline\alpha_t}}\epsilon_\theta(x_t,t))
μθ(xt,t)=αt1(xt(x0,ϵ)−1−αtβtϵθ(xt,t))
因此,有
KL
(
q
(
x
t
−
1
∣
x
t
,
x
0
)
∣
∣
p
θ
(
x
t
−
1
∣
x
t
)
)
=
E
x
0
,
ϵ
[
β
t
2
2
σ
t
2
α
t
(
1
−
α
‾
t
)
∣
∣
ϵ
−
ϵ
θ
(
α
‾
t
x
0
+
1
−
α
‾
t
ϵ
,
t
)
∣
∣
2
2
]
+
C
\begin{align} \text{KL}(q(x_{t-1}|x_t,x_0)||p_\theta(x_{t-1}|x_t)) &=\mathbb E_{x_0,\epsilon}[\frac{\beta_t^2}{2\sigma_t^2\alpha_t(1-\overline{\alpha}_t)}||\epsilon-\epsilon_\theta(\sqrt{\overline\alpha_t}x_0+\sqrt{1-\overline\alpha_t}\epsilon, t)||_2^2] +C \nonumber \end{align}
KL(q(xt−1∣xt,x0)∣∣pθ(xt−1∣xt))=Ex0,ϵ[2σt2αt(1−αt)βt2∣∣ϵ−ϵθ(αtx0+1−αtϵ,t)∣∣22]+C
损失函数即为
L
(
θ
)
=
E
t
,
x
0
,
ϵ
[
∣
∣
ϵ
−
ϵ
θ
(
α
‾
t
x
0
+
1
−
α
‾
t
ϵ
,
t
)
∣
∣
2
2
]
\begin{align} \mathcal L(\theta) &=\mathbb E_{t,x_0,\epsilon}[||\epsilon-\epsilon_\theta(\sqrt{\overline\alpha_t}x_0+\sqrt{1-\overline\alpha_t}\epsilon, t)||_2^2] \end{align}
L(θ)=Et,x0,ϵ[∣∣ϵ−ϵθ(αtx0+1−αtϵ,t)∣∣22]
算法流程
DDIM
DDPM的核心问题在于采样需要迭代足够多的次数,而且理论推导中的概率分布是
q
(
x
t
−
1
∣
x
t
)
q(x_{t-1}|x_t)
q(xt−1∣xt),因此每次迭代的下标变化为1,如果我们希望下标变化可以不局限为1,例如如果我们支持计算
q
(
x
s
∣
x
t
)
(
s
<
t
)
q(x_s|x_t)(s<t)
q(xs∣xt)(s<t) 那么我们可以任意设置从
t
:
T
→
0
t: T\rightarrow 0
t:T→0 迭代的次数,这需要我们在采样时突破下式中的
q
(
x
t
∣
x
t
−
1
)
q(x_t |x_{t-1})
q(xt∣xt−1)
q
(
x
t
−
1
∣
x
t
)
≈
q
(
x
t
−
1
∣
x
t
,
x
0
)
=
q
(
x
t
∣
x
t
−
1
,
x
0
)
q
(
x
t
−
1
∣
x
0
)
q
(
x
t
∣
x
0
)
=
q
(
x
t
∣
x
t
−
1
)
q
(
x
t
−
1
∣
x
0
)
q
(
x
t
∣
x
0
)
q(x_{t-1} |x_{t})\approx q(x_{t-1} |x_{t},x_0) = q(x_t |x_{t-1},x_0)\frac{q(x_{t-1}|x_0)}{q(x_t|x_0)} = q(x_t |x_{t-1})\frac{q(x_{t-1}|x_0)}{q(x_t|x_0)}
q(xt−1∣xt)≈q(xt−1∣xt,x0)=q(xt∣xt−1,x0)q(xt∣x0)q(xt−1∣x0)=q(xt∣xt−1)q(xt∣x0)q(xt−1∣x0)
如果没有
q
(
x
t
∣
x
t
−
1
,
x
0
)
q(x_t|x_{t-1},x_0)
q(xt∣xt−1,x0),但仍可以通过下式来求解
∫
p
(
x
t
−
1
∣
x
t
,
x
0
)
p
(
x
t
∣
x
0
)
d
x
t
=
p
(
x
t
−
1
∣
x
0
)
\int p(x_{t−1}|x_t,x_0)p(x_t|x_0)dx_t=p(x_{t−1}|x_0)
∫p(xt−1∣xt,x0)p(xt∣x0)dxt=p(xt−1∣x0)
其中
p
(
x
t
∣
x
0
)
,
p
(
x
t
−
1
∣
x
0
)
p(x_t|x_0),p(x_{t-1}|x_0)
p(xt∣x0),p(xt−1∣x0) 都为高斯分布,可以假定
p
(
x
t
−
1
∣
x
t
,
x
0
)
p(x_{t−1}|x_t,x_0)
p(xt−1∣xt,x0) 也为高斯分布,其均值为
x
t
x_t
xt 与
x
0
x_0
x0 的线性组合
更一般地,我们考虑任意两个下标
x
s
,
x
t
(
s
<
t
)
x_s,x_t(s<t)
xs,xt(s<t),假定
x
s
=
m
s
∣
t
x
t
+
n
s
∣
t
x
0
+
σ
s
∣
t
ε
1
x_s = m_{s|t} x_t+n_{s|t}x_0+\sigma_{s|t}\varepsilon_1
xs=ms∣txt+ns∣tx0+σs∣tε1,我们已知
q
(
x
s
∣
x
0
)
q(x_s|x_0)
q(xs∣x0),
q
(
x
t
∣
x
0
)
q(x_t|x_0)
q(xt∣x0),联立可得
{
x
s
=
m
s
∣
t
x
t
+
n
s
∣
t
x
0
+
σ
s
∣
t
ε
1
x
t
=
α
‾
t
x
0
+
1
−
α
‾
t
ε
2
x
s
=
α
‾
s
x
0
+
1
−
α
‾
s
ε
3
\left\{\begin{matrix} x_s = m_{s|t} x_t+n_{s|t}x_0+\sigma_{s|t}\varepsilon_1 \\ x_t = \sqrt{\overline\alpha_t} x_0+\sqrt{1-\overline\alpha_t}\varepsilon_2 \\ x_s = \sqrt{\overline\alpha_s} x_0+\sqrt{1-\overline\alpha_s}\varepsilon_3 \end{matrix}\right.
⎩
⎨
⎧xs=ms∣txt+ns∣tx0+σs∣tε1xt=αtx0+1−αtε2xs=αsx0+1−αsε3
可得关于
m
s
∣
t
m_{s|t}
ms∣t 与
n
s
∣
t
n_{s|t}
ns∣t 的联立表达式
{
m
s
∣
t
α
‾
t
+
n
s
∣
t
=
α
‾
s
m
s
∣
t
2
(
1
−
α
‾
t
)
+
σ
s
∣
t
2
=
1
−
α
‾
s
\left\{\begin{matrix} m_{s|t}\sqrt{\overline\alpha_t} + n_{s|t} = \sqrt{\overline\alpha_s} \\ m_{s|t}^2(1-\overline\alpha_t) + \sigma^2_{s|t} = 1-\overline\alpha_s \end{matrix}\right.
{ms∣tαt+ns∣t=αsms∣t2(1−αt)+σs∣t2=1−αs
解得
{
m
s
∣
t
=
1
−
α
‾
s
−
σ
s
∣
t
2
1
−
α
‾
t
n
s
∣
t
=
α
‾
s
−
α
‾
t
1
−
α
‾
t
(
1
−
α
‾
s
−
σ
s
∣
t
2
)
\left\{\begin{matrix} m_{s|t} = \sqrt{\frac{1-\overline\alpha_s-\sigma_{s|t}^2}{1-\overline\alpha_t}} \\ n_{s|t} = \sqrt{\overline\alpha_{s}} - \sqrt{\frac{\overline\alpha_t}{1-\overline\alpha_t}(1-\overline\alpha_s-\sigma_{s|t}^2)} \end{matrix}\right.
⎩
⎨
⎧ms∣t=1−αt1−αs−σs∣t2ns∣t=αs−1−αtαt(1−αs−σs∣t2)
带回到原式,可得任意两下标
s
<
t
s<t
s<t 的采样公式
x
s
=
α
‾
s
x
0
+
1
−
α
‾
s
−
σ
s
∣
t
2
x
t
−
α
‾
t
x
0
1
−
α
‾
t
+
σ
s
∣
t
ε
x_s = \sqrt{\overline\alpha_s}x_0+\sqrt{1-\overline\alpha_{s}-\sigma_{s|t}^2}\frac{x_t-\sqrt{\overline\alpha_t}x_0}{\sqrt{1-\overline\alpha_t}}+\sigma_{s|t}\varepsilon
xs=αsx0+1−αs−σs∣t21−αtxt−αtx0+σs∣tε
注意到,DDIM并没有使用
q
(
x
t
∣
x
s
)
q(x_t|x_s)
q(xt∣xs),因此 DDIM 相比于 DDPM 具有更加泛化的形式,这里的
x
0
x_0
x0 是使用
x
t
x_t
xt、
ϵ
θ
(
x
t
,
t
)
\epsilon_\theta(x_t,t)
ϵθ(xt,t) 和公式 (1) 给出的估计值,而
x
t
−
α
‾
t
x
0
1
−
α
‾
t
\frac{x_t-\sqrt{\overline\alpha_t}x_0}{\sqrt{1-\overline\alpha_t}}
1−αtxt−αtx0 对应
ϵ
θ
(
x
t
,
t
)
\epsilon_\theta(x_t,t)
ϵθ(xt,t),即
x
s
=
α
‾
s
α
‾
t
(
x
t
−
1
−
α
‾
t
ϵ
θ
(
x
t
,
t
)
)
+
1
−
α
‾
s
−
σ
s
∣
t
2
ϵ
θ
(
x
t
,
t
)
+
σ
s
∣
t
ε
x_s = \sqrt{\frac{\overline\alpha_s}{\overline\alpha_t}}(x_t-\sqrt{1-\overline\alpha_t}\epsilon_\theta(x_t,t))+\sqrt{1-\overline\alpha_{s}-\sigma_{s|t}^2}\epsilon_\theta(x_t,t)+\sigma_{s|t}\varepsilon
xs=αtαs(xt−1−αtϵθ(xt,t))+1−αs−σs∣t2ϵθ(xt,t)+σs∣tε
参考资料
Denoising Diffusion Probabilistic Models
Denoising Diffusion Implicit Models
Probabilistic Diffusion Model概率扩散模型理论
扩散模型 Diffusion Model