Diffusion Model(DDPM)的基本思路
对于隐变量模型给出的基本思路:
p ( x ) = ∫ p ( x ∣ z ) p ( z ) d z p(x)=\int p(x|z) p(z) \mathrm{d}z p(x)=∫p(x∣z)p(z)dz
VAE给出的优化是提高 p ( z ) p(z) p(z)的采样效率:
- encoder 用参数化的后验分布模型 q θ ( z ∣ x ) q_\theta(z|x) qθ(z∣x),直接预测均值 μ \mu μ和方差 Σ \Sigma Σ,使得 q θ ( z ∣ x ) = N ( μ , Σ ) q_\theta(z|x) = N(\mu,\Sigma) qθ(z∣x)=N(μ,Σ)
- decoder 将从 q θ ( z ∣ x ) q_\theta(z|x) qθ(z∣x)中采样得到的 z z z经过参数化的 p θ ( x ∣ z ) p_\theta(x|z) pθ(x∣z) 生成最后的 x ^ \hat{x} x^
diffusion model 的大体思路其实与之类似,但是在diffusion model 中,
q
θ
(
z
∣
x
)
q_\theta(z|x)
qθ(z∣x) 所对应的过程不是一步得到的,而是由P步连续变换得来的,即在encoder中:
q
θ
(
z
,
x
1
:
T
∣
x
)
=
q
θ
(
z
∣
x
T
−
1
)
⋅
q
θ
(
x
T
−
1
∣
x
T
−
2
)
⋯
q
θ
(
x
t
∣
x
t
−
1
)
⋯
q
θ
(
x
2
∣
x
1
)
⋅
q
θ
(
x
1
∣
x
)
q_\theta(z,x_{1:T}|x)=q_\theta(z|x_{T-1})\cdot q_\theta(x_{T-1}|x_{T-2})\cdots q_\theta(x_{t}|x_{t-1}) \cdots q_\theta(x_2|x_1) \cdot q_\theta(x_1|x)
qθ(z,x1:T∣x)=qθ(z∣xT−1)⋅qθ(xT−1∣xT−2)⋯qθ(xt∣xt−1)⋯qθ(x2∣x1)⋅qθ(x1∣x)
同理,在decoder中有:
p
θ
(
x
,
x
1
:
T
∣
z
)
=
p
θ
(
x
∣
x
1
)
⋅
p
θ
(
x
1
∣
x
2
)
⋯
p
θ
(
x
t
−
1
∣
x
t
)
⋯
p
θ
(
x
T
−
2
∣
x
T
−
1
)
⋅
p
θ
(
x
T
−
1
∣
z
)
p_\theta(x,x_{1:T}|z)=p_\theta(x|x_1)\cdot p_\theta(x_1|x_2)\cdots p_\theta(x_{t-1}|x_t) \cdots p_\theta(x_{T-2}|x_{T-1})\cdot p_\theta(x_{T-1}|z)
pθ(x,x1:T∣z)=pθ(x∣x1)⋅pθ(x1∣x2)⋯pθ(xt−1∣xt)⋯pθ(xT−2∣xT−1)⋅pθ(xT−1∣z)
(这种马尔科夫链式的分布加噪过程在某种程度是受非平衡热力学提出的)
其图解如下(符号略有不同, x x x 变成了 x 0 x_0 x0 , z z z 变成了 x T x_T xT):
- 从右向左是encoder过程,是一个无参数的 q ( x t ∣ x t − 1 ) q(x_t|x_{t-1}) q(xt∣xt−1) ,即这是一个纯粹人为的过程(比如从原始的清晰图片每次按照高斯分布进行一个映射)。
- 从左往右是decoder过程,是一个带参数的 p θ ( x t − 1 ∣ x t ) p_\theta(x_{t-1}|x_t) pθ(xt−1∣xt)。其并非像VAE那样直接预测 x ^ \hat{x} x^ ,而是预测高斯噪声,在decode过程中逐渐减去高斯噪声,还原出清晰的图像。
- 在diffusion model 中隐变量 z z z (在图中是 x T x_T xT)和原始图片的维度是一样大的(但是理论上还是简单的(因为很像高斯噪声))。
Diffusion Model 的 encoder过程
首先定义递增的常量序列
β
t
{\beta_t}
βt,满足
0
<
β
1
<
β
2
<
⋯
<
β
T
<
1
0<\beta_1 \lt \beta_2 \lt \cdots \lt \beta_T<1
0<β1<β2<⋯<βT<1
定义观测原图为随机变量
x
0
x_0
x0,然后定义从随机变量
x
t
−
1
x_{t-1}
xt−1到随机变量
x
t
x_t
xt的分布关系为
q
(
x
t
∣
x
t
−
1
)
=
N
(
x
t
;
1
−
β
t
x
t
−
1
,
β
t
)
\begin{align}q(x_t|x_{t-1})=\mathcal{N}(x_t;\sqrt{1-\beta_t}x_{t-1}, \beta_t )\end{align}
q(xt∣xt−1)=N(xt;1−βtxt−1,βt)
即
x
t
x_t
xt是均值
1
−
β
t
x
t
−
1
\sqrt{1-\beta_t}x_{t-1}
1−βtxt−1且方差为
β
t
\beta_t
βt 的高斯分布,用类似重参数分解可以得到
KaTeX parse error: {align} can be used only in display mode.
对于公式(2)有以下一些解释
- x t − 1 x_{t-1} xt−1和 ϵ t − 1 \epsilon_{t-1} ϵt−1前面的两个系数平方和等于1
- β t \beta_t βt单调递增且 β t ∈ ( 0 , 1 ) \beta_t \in (0, 1) βt∈(0,1),则可以保证t=0时候方差几乎为0,t=T时方差几乎为1
如果定义
- α t = 1 − β t \alpha_t=1-\beta_t αt=1−βt。为了书写方便
- α ˉ t = ∏ i = 1 t α i \bar{\alpha}_t=\prod_{i=1}^t \alpha_i αˉt=∏i=1tαi。为了书写方便
- ϵ ˉ k ∼ N ( 0 , 1 ) \bar{\epsilon}_k \sim \mathcal{N}(0, 1) ϵˉk∼N(0,1)。代表k个高斯分布合并之后的新高斯分布
那么递推展开可以得到
x
t
=
α
t
x
t
−
1
+
1
−
α
t
ϵ
t
−
1
=
α
t
α
t
−
1
x
t
−
2
+
1
−
α
t
α
t
−
1
ϵ
ˉ
2
=
⋯
=
α
t
ˉ
x
0
+
1
−
α
t
ˉ
ϵ
ˉ
t
\begin{align}x_t&=\sqrt{\alpha_t} x_{t-1} + \sqrt{1-\alpha_t} \epsilon_{t-1} \\ &=\sqrt{\alpha_t \alpha_{t-1}} x_{t-2} + \sqrt{1-\alpha_t \alpha_{t-1}} \bar{\epsilon}_2 \\&= \cdots \\&=\sqrt{\bar{\alpha_t}}x_0 +\sqrt{1-\bar{\alpha_t}}\bar{\epsilon}_t\end{align}
xt=αtxt−1+1−αtϵt−1=αtαt−1xt−2+1−αtαt−1ϵˉ2=⋯=αtˉx0+1−αtˉϵˉt
注意
- 这里用了方差的性质,即两个高斯分布的和还是高斯分布,并且新方差等于这两个高斯分布的方差
- ϵ ˉ k \bar{\epsilon}_k ϵˉk是k个高斯分布合并之后的新高斯分布
- x 0 x_0 x0和 ϵ ˉ t \bar{\epsilon}_t ϵˉt前面两个系数的平方和仍然是1
(这几个性质真的好鸡儿巧妙,到底是怎么构造出来的,有什么说法吗)
观察公式(6)里的x_t可以发现
- 随着 t → T t \rightarrow T t→T, α t ˉ → 0 \sqrt{\bar{\alpha_t}} \rightarrow 0 αtˉ→0且 1 − α t ˉ → 1 \sqrt{1-\bar{\alpha_t}} \rightarrow 1 1−αtˉ→1,因此 x t → N ( 0 , 1 ) x_t \rightarrow \mathcal{N}(0, 1) xt→N(0,1),逐渐变成标准高斯分布,极端情况下 x T = N ( 0 , 1 ) x_T =\mathcal{N}(0, 1) xT=N(0,1)
- 不仅 q ( x t ∣ x t − 1 ) = N ( 1 − β t x t − 1 , β t ) q(x_t|x_{t-1})=\mathcal{N}(\sqrt{1-\beta_t}x_{t-1}, \beta_t ) q(xt∣xt−1)=N(1−βtxt−1,βt)可以直接计算,并且 q ( x t ∣ x 0 ) = N ( α t ˉ x 0 , 1 − α t ˉ ) q(x_t|x_0)=\mathcal{N}(\sqrt{\bar{\alpha_t}}x_0 , 1-\bar{\alpha_t}) q(xt∣x0)=N(αtˉx0,1−αtˉ)也可以直接计算
- 整个encoder过程的是完全透明的,可以高效地计算中间任意分布 q ( x t ∣ x 0 ) q(x_t|x_0) q(xt∣x0)的方式
Diffusion Model 的 decoder 过程
扩散过程是将数据噪音化,那么反向过程就是一个去噪的过程,如果我们可以求出反向过程的每一步的真实分布 q ( x t − 1 ∣ x t ) q(\mathbf{x}_{t-1} \vert \mathbf{x}_t) q(xt−1∣xt),那么从一个随机噪音 x T ∼ N ( 0 , I ) \mathbf{x}_T \sim \mathcal{N}(\mathbf{0}, \mathbf{I}) xT∼N(0,I)开始,逐渐去噪就能生成一个真实的样本,所以反向过程也就是生成数据的过程。
(打个比方,Diffusion Model 的原理类似于我想让模型学会盖楼,于是我们在模型面前把一栋完整的楼一锤一锤地砸烂成shit,但是每一锤都很小很简单,让模型学会把每一锤恢复,他就能把这一堆shit恢复成一栋楼。这样随便给模型一堆shit,模型就能以此盖一栋楼)
原论文认为,对于encoder过程的每一步 KaTeX parse error: {align} can be used only in display mode. ,当 β t \beta_t βt 足够小的时候,其逆过程的分布 q ( x t − 1 ∣ x t ) q(\mathbf{x}_{t-1} \vert \mathbf{x}_t) q(xt−1∣xt) 也可以近似地看做一个正态分布。(具体的理论证明似乎比较地复杂,暂且按下)
那我们可以做出如下假设,并用神经网络来计算其中的参数:
p
θ
(
x
0
:
T
)
=
p
(
x
T
)
∏
t
=
1
T
p
θ
(
x
t
−
1
∣
x
t
)
p
θ
(
x
t
−
1
∣
x
t
)
=
N
(
x
t
−
1
;
μ
θ
(
x
t
,
t
)
,
Σ
θ
(
x
t
,
t
)
)
p_\theta(\mathbf{x}_{0:T}) = p(\mathbf{x}_T) \prod^T_{t=1} p_\theta(\mathbf{x}_{t-1} \vert \mathbf{x}_t) \quad p_\theta(\mathbf{x}_{t-1} \vert \mathbf{x}_t) = \mathcal{N}(\mathbf{x}_{t-1}; \boldsymbol{\mu}_\theta(\mathbf{x}_t, t), \boldsymbol{\Sigma}_\theta(\mathbf{x}_t, t))\\
pθ(x0:T)=p(xT)t=1∏Tpθ(xt−1∣xt)pθ(xt−1∣xt)=N(xt−1;μθ(xt,t),Σθ(xt,t))
这里
p
(
x
T
)
=
N
(
0
,
I
)
p(\mathbf{x}_T)= \mathcal{N}(\mathbf{0}, \mathbf{I})
p(xT)=N(0,I),而
p
θ
(
x
t
−
1
∣
x
t
)
p_\theta(\mathbf{x}_{t-1} \vert \mathbf{x}_t)
pθ(xt−1∣xt) 为参数化的高斯分布,它们的均值和方差由训练的网络
μ
θ
(
x
t
,
t
)
\boldsymbol{\mu}_\theta(\mathbf{x}_t, t)
μθ(xt,t) 和
Σ
θ
(
x
t
,
t
)
\boldsymbol{\Sigma}_\theta(\mathbf{x}_t, t)
Σθ(xt,t) 给出。我们训练扩散的模型实际上就是要训练得到这些模型。我们要用神经网络计算出的这些分布来估计
q
(
x
t
−
1
∣
x
t
)
q(x_{t-1}|x_t)
q(xt−1∣xt) 。
但是实际的分布
q
(
x
t
−
1
∣
x
t
)
q(x_{t-1}|x_t)
q(xt−1∣xt) 是难以处理的,但是加上条件
x
0
x_0
x0 的后验分布
q
(
x
t
−
1
∣
x
t
,
x
0
)
q(x_{t-1}|x_t,x_0)
q(xt−1∣xt,x0) 是方便处理的,我们不妨假设:
q
(
x
t
−
1
∣
x
t
,
x
0
)
=
N
(
μ
~
(
x
t
,
x
0
)
,
β
~
t
I
)
q(\mathbf{x}_{t-1} \vert \mathbf{x}_t, \mathbf{x}_0) = \mathcal{N}(\color{blue}{\tilde{\boldsymbol{\mu}}}(\mathbf{x}_t, \mathbf{x}_0), \color{red}{\tilde{\beta}_t} \mathbf{I}) \\
q(xt−1∣xt,x0)=N(μ~(xt,x0),β~tI)
下面我们对其中的参数进行推导计算如下:
首先,由贝叶斯分布,我们有:
q
(
x
t
−
1
∣
x
t
,
x
0
)
=
q
(
x
t
∣
x
t
−
1
,
x
0
)
q
(
x
t
−
1
∣
x
0
)
q
(
x
t
∣
x
0
)
q(\mathbf{x}_{t-1} \vert \mathbf{x}_t, \mathbf{x}_0) = q(\mathbf{x}_t \vert \mathbf{x}_{t-1}, \mathbf{x}_0) \frac{ q(\mathbf{x}_{t-1} \vert \mathbf{x}_0) }{ q(\mathbf{x}_t \vert \mathbf{x}_0) }\\
q(xt−1∣xt,x0)=q(xt∣xt−1,x0)q(xt∣x0)q(xt−1∣x0)
又因为encoder过程符合马尔科夫链特性,我们有
q
(
x
t
∣
x
t
−
1
,
x
0
)
=
q
(
x
t
∣
x
t
−
1
)
=
N
(
x
t
;
1
−
β
t
x
t
−
1
,
β
t
I
)
q(x_t|x_{t-1}, x_0) = q(x_t|x_{t-1}) = \mathcal{N}(\mathbf{x}_t; \sqrt{1 - \beta_t} \mathbf{x}_{t-1}, \beta_t\mathbf{I})
q(xt∣xt−1,x0)=q(xt∣xt−1)=N(xt;1−βtxt−1,βtI) , 并且由前面 encoder过程中的推导,我们知道:
q
(
x
t
−
1
∣
x
0
)
=
N
(
x
t
−
1
;
α
ˉ
t
−
1
x
0
,
(
1
−
α
ˉ
t
−
1
)
I
)
q
(
x
t
∣
x
0
)
=
N
(
x
t
;
α
ˉ
t
x
0
,
(
1
−
α
ˉ
t
)
I
)
q(\mathbf{x}_{t-1} \vert \mathbf{x}_0)=\mathcal{N}(\mathbf{x}_{t-1}; \sqrt{\bar{\alpha}_{t-1}} \mathbf{x}_0, (1 - \bar{\alpha}_{t-1})\mathbf{I})\\q(\mathbf{x}_t \vert \mathbf{x}_0) = \mathcal{N}(\mathbf{x}_t; \sqrt{\bar{\alpha}_t} \mathbf{x}_0, (1 - \bar{\alpha}_t)\mathbf{I}) \\
q(xt−1∣x0)=N(xt−1;αˉt−1x0,(1−αˉt−1)I)q(xt∣x0)=N(xt;αˉtx0,(1−αˉt)I)
(其实从这里就可以看出
q
(
x
t
−
1
∣
x
t
,
x
0
)
q(x_{t-1}|x_t,x_0)
q(xt−1∣xt,x0) 是货真价实的高斯分布了)带进原式子可得:
q
(
x
t
−
1
∣
x
t
,
x
0
)
=
q
(
x
t
∣
x
t
−
1
,
x
0
)
q
(
x
t
−
1
∣
x
0
)
q
(
x
t
∣
x
0
)
∝
exp
(
−
1
2
(
(
x
t
−
α
t
x
t
−
1
)
2
β
t
+
(
x
t
−
1
−
α
ˉ
t
−
1
x
0
)
2
1
−
α
ˉ
t
−
1
−
(
x
t
−
α
ˉ
t
x
0
)
2
1
−
α
ˉ
t
)
)
=
exp
(
−
1
2
(
x
t
2
−
2
α
t
x
t
x
t
−
1
+
α
t
x
t
−
1
2
β
t
+
x
t
−
1
2
−
2
α
ˉ
t
−
1
x
0
x
t
−
1
+
α
ˉ
t
−
1
x
0
2
1
−
α
ˉ
t
−
1
−
(
x
t
−
α
ˉ
t
x
0
)
2
1
−
α
ˉ
t
)
)
=
exp
(
−
1
2
(
(
α
t
β
t
+
1
1
−
α
ˉ
t
−
1
)
x
t
−
1
2
−
(
2
α
t
β
t
x
t
+
2
α
ˉ
t
−
1
1
−
α
ˉ
t
−
1
x
0
)
x
t
−
1
+
C
(
x
t
,
x
0
)
)
)
\begin{aligned} q(\mathbf{x}_{t-1} \vert \mathbf{x}_t, \mathbf{x}_0) &= q(\mathbf{x}_t \vert \mathbf{x}_{t-1}, \mathbf{x}_0) \frac{ q(\mathbf{x}_{t-1} \vert \mathbf{x}_0) }{ q(\mathbf{x}_t \vert \mathbf{x}_0) } \\ &\propto \exp \Big(-\frac{1}{2} \big(\frac{(\mathbf{x}_t - \sqrt{\alpha_t} \mathbf{x}_{t-1})^2}{\beta_t} + \frac{(\mathbf{x}_{t-1} - \sqrt{\bar{\alpha}_{t-1}} \mathbf{x}_0)^2}{1-\bar{\alpha}_{t-1}} - \frac{(\mathbf{x}_t - \sqrt{\bar{\alpha}_t} \mathbf{x}_0)^2}{1-\bar{\alpha}_t} \big) \Big) \\ &= \exp \Big(-\frac{1}{2} \big(\frac{\mathbf{x}_t^2 - 2\sqrt{\alpha_t} \mathbf{x}_t \color{blue}{\mathbf{x}_{t-1}} \color{black}{+ \alpha_t} \color{red}{\mathbf{x}_{t-1}^2} }{\beta_t} + \frac{ \color{red}{\mathbf{x}_{t-1}^2} \color{black}{- 2 \sqrt{\bar{\alpha}_{t-1}} \mathbf{x}_0} \color{blue}{\mathbf{x}_{t-1}} \color{black}{+ \bar{\alpha}_{t-1} \mathbf{x}_0^2} }{1-\bar{\alpha}_{t-1}} - \frac{(\mathbf{x}_t - \sqrt{\bar{\alpha}_t} \mathbf{x}_0)^2}{1-\bar{\alpha}_t} \big) \Big) \\ &= \exp\Big( -\frac{1}{2} \big( \color{red}{(\frac{\alpha_t}{\beta_t} + \frac{1}{1 - \bar{\alpha}_{t-1}})} \mathbf{x}_{t-1}^2 - \color{blue}{(\frac{2\sqrt{\alpha_t}}{\beta_t} \mathbf{x}_t + \frac{2\sqrt{\bar{\alpha}_{t-1}}}{1 - \bar{\alpha}_{t-1}} \mathbf{x}_0)} \mathbf{x}_{t-1} \color{black}{ + C(\mathbf{x}_t, \mathbf{x}_0) \big) \Big)} \end{aligned}\\
q(xt−1∣xt,x0)=q(xt∣xt−1,x0)q(xt∣x0)q(xt−1∣x0)∝exp(−21(βt(xt−αtxt−1)2+1−αˉt−1(xt−1−αˉt−1x0)2−1−αˉt(xt−αˉtx0)2))=exp(−21(βtxt2−2αtxtxt−1+αtxt−12+1−αˉt−1xt−12−2αˉt−1x0xt−1+αˉt−1x02−1−αˉt(xt−αˉtx0)2))=exp(−21((βtαt+1−αˉt−11)xt−12−(βt2αtxt+1−αˉt−12αˉt−1x0)xt−1+C(xt,x0)))
这里的
C
(
x
t
,
x
0
)
C(\mathbf{x}_t, \mathbf{x}_0)
C(xt,x0)是一个和
x
t
−
1
\mathbf{x}_{t-1}
xt−1无关的部分,所以省略。根据高斯分布的概率密度函数定义和上述结果,我们可以通过待定系数求出后验分布
q
(
x
t
−
1
∣
x
t
,
x
0
)
q(\mathbf{x}_{t-1} \vert \mathbf{x}_{t}, \mathbf{x}_0)
q(xt−1∣xt,x0)的均值和方差:
β
~
t
=
1
/
(
α
t
β
t
+
1
1
−
α
ˉ
t
−
1
)
=
1
/
(
α
t
−
α
ˉ
t
+
β
t
β
t
(
1
−
α
ˉ
t
−
1
)
)
=
1
−
α
ˉ
t
−
1
1
−
α
ˉ
t
⋅
β
t
μ
~
t
(
x
t
,
x
0
)
=
(
α
t
β
t
x
t
+
α
ˉ
t
−
1
1
−
α
ˉ
t
−
1
x
0
)
/
(
α
t
β
t
+
1
1
−
α
ˉ
t
−
1
)
=
(
α
t
β
t
x
t
+
α
ˉ
t
−
1
1
−
α
ˉ
t
−
1
x
0
)
1
−
α
ˉ
t
−
1
1
−
α
ˉ
t
⋅
β
t
=
α
t
(
1
−
α
ˉ
t
−
1
)
1
−
α
ˉ
t
x
t
+
α
ˉ
t
−
1
β
t
1
−
α
ˉ
t
x
0
\begin{aligned} \tilde{\beta}_t &= 1/(\frac{\alpha_t}{\beta_t} + \frac{1}{1 - \bar{\alpha}_{t-1}}) \\&= 1/(\frac{\alpha_t - \bar{\alpha}_t + \beta_t}{\beta_t(1 - \bar{\alpha}_{t-1})}) \\&= \color{green}{\frac{1 - \bar{\alpha}_{t-1}}{1 - \bar{\alpha}_t} \cdot \beta_t} \\ \tilde{\boldsymbol{\mu}}_t (\mathbf{x}_t, \mathbf{x}_0) &= (\frac{\sqrt{\alpha_t}}{\beta_t} \mathbf{x}_t + \frac{\sqrt{\bar{\alpha}_{t-1} }}{1 - \bar{\alpha}_{t-1}} \mathbf{x}_0)/(\frac{\alpha_t}{\beta_t} + \frac{1}{1 - \bar{\alpha}_{t-1}}) \\ &= (\frac{\sqrt{\alpha_t}}{\beta_t} \mathbf{x}_t + \frac{\sqrt{\bar{\alpha}_{t-1} }}{1 - \bar{\alpha}_{t-1}} \mathbf{x}_0) \color{green}{\frac{1 - \bar{\alpha}_{t-1}}{1 - \bar{\alpha}_t} \cdot \beta_t} \\ &= \frac{\sqrt{\alpha_t}(1 - \bar{\alpha}_{t-1})}{1 - \bar{\alpha}_t} \mathbf{x}_t + \frac{\sqrt{\bar{\alpha}_{t-1}}\beta_t}{1 - \bar{\alpha}_t} \mathbf{x}_0\\ \end{aligned}\\
β~tμ~t(xt,x0)=1/(βtαt+1−αˉt−11)=1/(βt(1−αˉt−1)αt−αˉt+βt)=1−αˉt1−αˉt−1⋅βt=(βtαtxt+1−αˉt−1αˉt−1x0)/(βtαt+1−αˉt−11)=(βtαtxt+1−αˉt−1αˉt−1x0)1−αˉt1−αˉt−1⋅βt=1−αˉtαt(1−αˉt−1)xt+1−αˉtαˉt−1βtx0
可以看到方差是一个定量(扩散过程参数固定),而均值是一个依赖
x
0
\mathbf{x}_0
x0 和
x
t
\mathbf{x}_t
xt 的函数。这个分布将会被用于推导扩散模型的优化目标。
Diffusion Model 的优化目标
上面介绍了扩散模型的扩散过程和反向过程,现在我们来从另外一个角度来看扩散模型:如果我们把中间产生的变量看成隐变量的话,那么扩散模型其实是包含T个隐变量的隐变量模型(latent variable model),它可以看成是一个特殊的Hierarchical VAEs:
相比VAE来说,扩散模型的隐变量是和原始数据同维度的,而且encoder(即扩散过程)是固定的。既然扩散模型是隐变量模型,那么我们可以就可以基于变分推断来得到 variational lower bound(VLB,又称ELBO)作为最大化优化目标,这里有:
log
p
θ
(
x
0
)
=
log
∫
p
θ
(
x
0
:
T
)
d
x
1
:
T
=
log
∫
p
θ
(
x
0
:
T
)
q
(
x
1
:
T
∣
x
0
)
q
(
x
1
:
T
∣
x
0
)
d
x
1
:
T
≥
E
q
(
x
1
:
T
∣
x
0
)
[
log
p
θ
(
x
0
:
T
)
q
(
x
1
:
T
∣
x
0
)
]
\begin{aligned} \log p_\theta(\mathbf{x}_0) &=\log\int p_\theta(\mathbf{x}_{0:T}) d\mathbf{x}_{1:T}\\ &=\log\int \frac{p_\theta(\mathbf{x}_{0:T}) q(\mathbf{x}_{1:T}\vert \mathbf{x}_{0})}{q(\mathbf{x}_{1:T}\vert \mathbf{x}_{0})} d\mathbf{x}_{1:T}\\ &\geq \mathbb{E}_{q(\mathbf{x}_{1:T}\vert \mathbf{x}_{0})}[\log \frac{p_\theta(\mathbf{x}_{0:T})}{q(\mathbf{x}_{1:T}\vert \mathbf{x}_{0})}]\\ \end{aligned}\\
logpθ(x0)=log∫pθ(x0:T)dx1:T=log∫q(x1:T∣x0)pθ(x0:T)q(x1:T∣x0)dx1:T≥Eq(x1:T∣x0)[logq(x1:T∣x0)pθ(x0:T)]
(这里最后一步使用了琴生不等式)
对于网络训练来说,我们的训练目标是VLB取负:
L
=
−
L
VLB
=
E
q
(
x
1
:
T
∣
x
0
)
[
−
log
p
θ
(
x
0
:
T
)
q
(
x
1
:
T
∣
x
0
)
]
=
E
q
(
x
1
:
T
∣
x
0
)
[
log
q
(
x
1
:
T
∣
x
0
)
p
θ
(
x
0
:
T
)
]
L=-L_{\text{VLB}}=\mathbb{E}_{q(\mathbf{x}_{1:T}\vert \mathbf{x}_{0})}[-\log \frac{p_\theta(\mathbf{x}_{0:T})}{q(\mathbf{x}_{1:T}\vert \mathbf{x}_{0})}]=\mathbb{E}_{q(\mathbf{x}_{1:T}\vert \mathbf{x}_{0})}[\log \frac{q(\mathbf{x}_{1:T}\vert \mathbf{x}_{0})}{p_\theta(\mathbf{x}_{0:T})}] \\
L=−LVLB=Eq(x1:T∣x0)[−logq(x1:T∣x0)pθ(x0:T)]=Eq(x1:T∣x0)[logpθ(x0:T)q(x1:T∣x0)]
我们对其进行进一步分解可以得到:
L
=
E
q
(
x
1
:
T
∣
x
0
)
[
log
q
(
x
1
:
T
∣
x
0
)
p
θ
(
x
0
:
T
)
]
=
E
q
(
x
1
:
T
∣
x
0
)
[
log
∏
t
=
1
T
q
(
x
t
∣
x
t
−
1
)
p
θ
(
x
T
)
∏
t
=
1
T
p
θ
(
x
t
−
1
∣
x
t
)
]
=
E
q
(
x
1
:
T
∣
x
0
)
[
−
log
p
θ
(
x
T
)
+
∑
t
=
1
T
log
q
(
x
t
∣
x
t
−
1
)
p
θ
(
x
t
−
1
∣
x
t
)
]
=
E
q
(
x
1
:
T
∣
x
0
)
[
−
log
p
θ
(
x
T
)
+
∑
t
=
2
T
log
q
(
x
t
∣
x
t
−
1
)
p
θ
(
x
t
−
1
∣
x
t
)
+
log
q
(
x
1
∣
x
0
)
p
θ
(
x
0
∣
x
1
)
]
=
E
q
(
x
1
:
T
∣
x
0
)
[
−
log
p
θ
(
x
T
)
+
∑
t
=
2
T
log
q
(
x
t
∣
x
t
−
1
,
x
0
)
p
θ
(
x
t
−
1
∣
x
t
)
+
log
q
(
x
1
∣
x
0
)
p
θ
(
x
0
∣
x
1
)
]
;use
q
(
x
t
∣
x
t
−
1
,
x
0
)
=
q
(
x
t
∣
x
t
−
1
)
=
E
q
(
x
1
:
T
∣
x
0
)
[
−
log
p
θ
(
x
T
)
+
∑
t
=
2
T
log
(
q
(
x
t
−
1
∣
x
t
,
x
0
)
p
θ
(
x
t
−
1
∣
x
t
)
⋅
q
(
x
t
∣
x
0
)
q
(
x
t
−
1
∣
x
0
)
)
+
log
q
(
x
1
∣
x
0
)
p
θ
(
x
0
∣
x
1
)
]
;use Bayes’ Rule
=
E
q
(
x
1
:
T
∣
x
0
)
[
−
log
p
θ
(
x
T
)
+
∑
t
=
2
T
log
q
(
x
t
−
1
∣
x
t
,
x
0
)
p
θ
(
x
t
−
1
∣
x
t
)
+
∑
t
=
2
T
log
q
(
x
t
∣
x
0
)
q
(
x
t
−
1
∣
x
0
)
+
log
q
(
x
1
∣
x
0
)
p
θ
(
x
0
∣
x
1
)
]
=
E
q
(
x
1
:
T
∣
x
0
)
[
−
log
p
θ
(
x
T
)
+
∑
t
=
2
T
log
q
(
x
t
−
1
∣
x
t
,
x
0
)
p
θ
(
x
t
−
1
∣
x
t
)
+
log
q
(
x
T
∣
x
0
)
q
(
x
1
∣
x
0
)
+
log
q
(
x
1
∣
x
0
)
p
θ
(
x
0
∣
x
1
)
]
=
E
q
(
x
1
:
T
∣
x
0
)
[
log
q
(
x
T
∣
x
0
)
p
θ
(
x
T
)
+
∑
t
=
2
T
log
q
(
x
t
−
1
∣
x
t
,
x
0
)
p
θ
(
x
t
−
1
∣
x
t
)
−
log
p
θ
(
x
0
∣
x
1
)
]
=
E
q
(
x
T
∣
x
0
)
[
log
q
(
x
T
∣
x
0
)
p
θ
(
x
T
)
]
+
∑
t
=
2
T
E
q
(
x
t
,
x
t
−
1
∣
x
0
)
[
log
q
(
x
t
−
1
∣
x
t
,
x
0
)
p
θ
(
x
t
−
1
∣
x
t
)
]
−
E
q
(
x
1
∣
x
0
)
[
log
p
θ
(
x
0
∣
x
1
)
]
=
E
q
(
x
T
∣
x
0
)
[
log
q
(
x
T
∣
x
0
)
p
θ
(
x
T
)
]
+
∑
t
=
2
T
E
q
(
x
t
∣
x
0
)
[
q
(
x
t
−
1
∣
x
t
,
x
0
)
log
q
(
x
t
−
1
∣
x
t
,
x
0
)
p
θ
(
x
t
−
1
∣
x
t
)
]
−
E
q
(
x
1
∣
x
0
)
[
log
p
θ
(
x
0
∣
x
1
)
]
=
D
KL
(
q
(
x
T
∣
x
0
)
∥
p
θ
(
x
T
)
)
⏟
L
T
+
∑
t
=
2
T
E
q
(
x
t
∣
x
0
)
[
D
KL
(
q
(
x
t
−
1
∣
x
t
,
x
0
)
∥
p
θ
(
x
t
−
1
∣
x
t
)
)
]
⏟
L
t
−
1
−
E
q
(
x
1
∣
x
0
)
log
p
θ
(
x
0
∣
x
1
)
⏟
L
0
\begin{aligned} L &= \mathbb{E}_{q(\mathbf{x}_{1:T}\vert \mathbf{x}_{0})} \Big[ \log\frac{q(\mathbf{x}_{1:T}\vert\mathbf{x}_0)}{p_\theta(\mathbf{x}_{0:T})} \Big] \\ &= \mathbb{E}_{q(\mathbf{x}_{1:T}\vert \mathbf{x}_{0})} \Big[ \log\frac{\prod_{t=1}^T q(\mathbf{x}_t\vert\mathbf{x}_{t-1})}{ p_\theta(\mathbf{x}_T) \prod_{t=1}^T p_\theta(\mathbf{x}_{t-1} \vert\mathbf{x}_t) } \Big] \\ &= \mathbb{E}_{q(\mathbf{x}_{1:T}\vert \mathbf{x}_{0})} \Big[ -\log p_\theta(\mathbf{x}_T) + \sum_{t=1}^T \log \frac{q(\mathbf{x}_t\vert\mathbf{x}_{t-1})}{p_\theta(\mathbf{x}_{t-1} \vert\mathbf{x}_t)} \Big] \\ &= \mathbb{E}_{q(\mathbf{x}_{1:T}\vert \mathbf{x}_{0})} \Big[ -\log p_\theta(\mathbf{x}_T) + \sum_{t=2}^T \log \frac{q(\mathbf{x}_t\vert\mathbf{x}_{t-1})}{p_\theta(\mathbf{x}_{t-1} \vert\mathbf{x}_t)} + \log\frac{q(\mathbf{x}_1 \vert \mathbf{x}_0)}{p_\theta(\mathbf{x}_0 \vert \mathbf{x}_1)} \Big] \\ &= \mathbb{E}_{q(\mathbf{x}_{1:T}\vert \mathbf{x}_{0})} \Big[ -\log p_\theta(\mathbf{x}_T) + \sum_{t=2}^T \log \frac{q(\mathbf{x}_t\vert\mathbf{x}_{t-1}, \mathbf{x}_{0})}{p_\theta(\mathbf{x}_{t-1} \vert\mathbf{x}_t)} + \log\frac{q(\mathbf{x}_1 \vert \mathbf{x}_0)}{p_\theta(\mathbf{x}_0 \vert \mathbf{x}_1)} \Big] & \text{ ;use } q(\mathbf{x}_t \vert \mathbf{x}_{t-1}, \mathbf{x}_0)=q(\mathbf{x}_t \vert \mathbf{x}_{t-1})\\ &= \mathbb{E}_{q(\mathbf{x}_{1:T}\vert \mathbf{x}_{0})} \Big[ -\log p_\theta(\mathbf{x}_T) + \sum_{t=2}^T \log \Big( \frac{q(\mathbf{x}_{t-1} \vert \mathbf{x}_t, \mathbf{x}_0)}{p_\theta(\mathbf{x}_{t-1} \vert\mathbf{x}_t)}\cdot \frac{q(\mathbf{x}_t \vert \mathbf{x}_0)}{q(\mathbf{x}_{t-1}\vert\mathbf{x}_0)} \Big) + \log \frac{q(\mathbf{x}_1 \vert \mathbf{x}_0)}{p_\theta(\mathbf{x}_0 \vert \mathbf{x}_1)} \Big] & \text{ ;use Bayes' Rule }\\ &= \mathbb{E}_{q(\mathbf{x}_{1:T}\vert \mathbf{x}_{0})} \Big[ -\log p_\theta(\mathbf{x}_T) + \sum_{t=2}^T \log \frac{q(\mathbf{x}_{t-1} \vert \mathbf{x}_t, \mathbf{x}_0)}{p_\theta(\mathbf{x}_{t-1} \vert\mathbf{x}_t)} + \sum_{t=2}^T \log \frac{q(\mathbf{x}_t \vert \mathbf{x}_0)}{q(\mathbf{x}_{t-1} \vert \mathbf{x}_0)} + \log\frac{q(\mathbf{x}_1 \vert \mathbf{x}_0)}{p_\theta(\mathbf{x}_0 \vert \mathbf{x}_1)} \Big] \\ &= \mathbb{E}_{q(\mathbf{x}_{1:T}\vert \mathbf{x}_{0})} \Big[ -\log p_\theta(\mathbf{x}_T) + \sum_{t=2}^T \log \frac{q(\mathbf{x}_{t-1} \vert \mathbf{x}_t, \mathbf{x}_0)}{p_\theta(\mathbf{x}_{t-1} \vert\mathbf{x}_t)} + \log\frac{q(\mathbf{x}_T \vert \mathbf{x}_0)}{q(\mathbf{x}_1 \vert \mathbf{x}_0)} + \log \frac{q(\mathbf{x}_1 \vert \mathbf{x}_0)}{p_\theta(\mathbf{x}_0 \vert \mathbf{x}_1)} \Big]\\ &= \mathbb{E}_{q(\mathbf{x}_{1:T}\vert \mathbf{x}_{0})} \Big[ \log\frac{q(\mathbf{x}_T \vert \mathbf{x}_0)}{p_\theta(\mathbf{x}_T)} + \sum_{t=2}^T \log \frac{q(\mathbf{x}_{t-1} \vert \mathbf{x}_t, \mathbf{x}_0)}{p_\theta(\mathbf{x}_{t-1} \vert\mathbf{x}_t)} - \log p_\theta(\mathbf{x}_0 \vert \mathbf{x}_1) \Big] \\ &= \mathbb{E}_{q(\mathbf{x}_{T}\vert \mathbf{x}_{0})}\Big[\log\frac{q(\mathbf{x}_T \vert \mathbf{x}_0)}{p_\theta(\mathbf{x}_T)}\Big]+\sum_{t=2}^T \mathbb{E}_{q(\mathbf{x}_{t}, \mathbf{x}_{t-1}\vert \mathbf{x}_{0})}\Big[\log \frac{q(\mathbf{x}_{t-1} \vert \mathbf{x}_t, \mathbf{x}_0)}{p_\theta(\mathbf{x}_{t-1} \vert\mathbf{x}_t)}\Big] - \mathbb{E}_{q(\mathbf{x}_{1}\vert \mathbf{x}_{0})}\Big[\log p_\theta(\mathbf{x}_0 \vert \mathbf{x}_1)\Big] \\ &= \mathbb{E}_{q(\mathbf{x}_{T}\vert \mathbf{x}_{0})}\Big[\log\frac{q(\mathbf{x}_T \vert \mathbf{x}_0)}{p_\theta(\mathbf{x}_T)}\Big]+\sum_{t=2}^T \mathbb{E}_{q(\mathbf{x}_{t}\vert \mathbf{x}_{0})}\Big[q(\mathbf{x}_{t-1} \vert \mathbf{x}_t, \mathbf{x}_0)\log \frac{q(\mathbf{x}_{t-1} \vert \mathbf{x}_t, \mathbf{x}_0)}{p_\theta(\mathbf{x}_{t-1} \vert\mathbf{x}_t)}\Big] - \mathbb{E}_{q(\mathbf{x}_{1}\vert \mathbf{x}_{0})}\Big[\log p_\theta(\mathbf{x}_0 \vert \mathbf{x}_1)\Big] \\ &= \underbrace{D_\text{KL}(q(\mathbf{x}_T \vert \mathbf{x}_0) \parallel p_\theta(\mathbf{x}_T))}_{L_T} + \sum_{t=2}^T \underbrace{\mathbb{E}_{q(\mathbf{x}_{t}\vert \mathbf{x}_{0})}\Big[D_\text{KL}(q(\mathbf{x}_{t-1} \vert \mathbf{x}_t, \mathbf{x}_0) \parallel p_\theta(\mathbf{x}_{t-1} \vert\mathbf{x}_t))\Big]}_{L_{t-1}} -\underbrace{\mathbb{E}_{q(\mathbf{x}_{1}\vert \mathbf{x}_{0})}\log p_\theta(\mathbf{x}_0 \vert \mathbf{x}_1)}_{L_0} \end{aligned}
L=Eq(x1:T∣x0)[logpθ(x0:T)q(x1:T∣x0)]=Eq(x1:T∣x0)[logpθ(xT)∏t=1Tpθ(xt−1∣xt)∏t=1Tq(xt∣xt−1)]=Eq(x1:T∣x0)[−logpθ(xT)+t=1∑Tlogpθ(xt−1∣xt)q(xt∣xt−1)]=Eq(x1:T∣x0)[−logpθ(xT)+t=2∑Tlogpθ(xt−1∣xt)q(xt∣xt−1)+logpθ(x0∣x1)q(x1∣x0)]=Eq(x1:T∣x0)[−logpθ(xT)+t=2∑Tlogpθ(xt−1∣xt)q(xt∣xt−1,x0)+logpθ(x0∣x1)q(x1∣x0)]=Eq(x1:T∣x0)[−logpθ(xT)+t=2∑Tlog(pθ(xt−1∣xt)q(xt−1∣xt,x0)⋅q(xt−1∣x0)q(xt∣x0))+logpθ(x0∣x1)q(x1∣x0)]=Eq(x1:T∣x0)[−logpθ(xT)+t=2∑Tlogpθ(xt−1∣xt)q(xt−1∣xt,x0)+t=2∑Tlogq(xt−1∣x0)q(xt∣x0)+logpθ(x0∣x1)q(x1∣x0)]=Eq(x1:T∣x0)[−logpθ(xT)+t=2∑Tlogpθ(xt−1∣xt)q(xt−1∣xt,x0)+logq(x1∣x0)q(xT∣x0)+logpθ(x0∣x1)q(x1∣x0)]=Eq(x1:T∣x0)[logpθ(xT)q(xT∣x0)+t=2∑Tlogpθ(xt−1∣xt)q(xt−1∣xt,x0)−logpθ(x0∣x1)]=Eq(xT∣x0)[logpθ(xT)q(xT∣x0)]+t=2∑TEq(xt,xt−1∣x0)[logpθ(xt−1∣xt)q(xt−1∣xt,x0)]−Eq(x1∣x0)[logpθ(x0∣x1)]=Eq(xT∣x0)[logpθ(xT)q(xT∣x0)]+t=2∑TEq(xt∣x0)[q(xt−1∣xt,x0)logpθ(xt−1∣xt)q(xt−1∣xt,x0)]−Eq(x1∣x0)[logpθ(x0∣x1)]=LT
DKL(q(xT∣x0)∥pθ(xT))+t=2∑TLt−1
Eq(xt∣x0)[DKL(q(xt−1∣xt,x0)∥pθ(xt−1∣xt))]−L0
Eq(x1∣x0)logpθ(x0∣x1) ;use q(xt∣xt−1,x0)=q(xt∣xt−1) ;use Bayes’ Rule
可以看到最终的优化目标共包含
T
+
1
T+1
T+1 项,其中 $L_0 $ 可以看成是原始数据重建,优化的是负对数似然,
L
0
L_0
L0 可以用估计的
N
(
x
0
;
μ
θ
(
x
1
,
1
)
\mathcal{N}(\mathbf{x}_0; \boldsymbol{\mu}_\theta(\mathbf{x}_1, 1)
N(x0;μθ(x1,1),
Σ
θ
(
x
1
,
1
)
)
\boldsymbol{\Sigma}_\theta(\mathbf{x}_1, 1))
Σθ(x1,1))来专门构建一个离散化的decoder来计算:
p
θ
(
x
0
∣
x
1
)
=
∏
i
=
1
D
∫
δ
−
(
x
0
i
)
δ
+
(
x
0
i
)
N
(
x
;
μ
θ
i
(
x
1
,
1
)
,
Σ
θ
i
(
x
1
,
1
)
)
d
x
δ
+
(
x
)
=
{
∞
if
x
=
1
x
+
1
255
if
x
<
1
δ
+
(
x
)
=
{
−
∞
if
x
=
−
1
x
−
1
255
if
x
>
−
1
p_{\theta}(\mathbf{x}_0\vert\mathbf{x}_1)=\prod^D_{i=1}\int ^{\delta_+(x_0^i)}_{\delta_-(x_0^i)}\mathcal{N}(x; \mu^i_\theta(x_1, 1), \Sigma^i_\theta(x_1, 1))\mathrm{d}x\\ \delta_+(x)= \begin{cases} \infty& \text{ if } x=1 \\ x+\frac{1}{255}& \text{ if } x <1 \end{cases} \\ \delta_+(x)= \begin{cases} -\infty& \text{ if } x=-1 \\ x-\frac{1}{255}& \text{ if } x >-1 \end{cases}
pθ(x0∣x1)=i=1∏D∫δ−(x0i)δ+(x0i)N(x;μθi(x1,1),Σθi(x1,1))dxδ+(x)={∞x+2551 if x=1 if x<1δ+(x)={−∞x−2551 if x=−1 if x>−1
(其中
i
i
i 是对像素点的编号),相当于计算由
x
1
\mathbf{x_1}
x1 大致生成
x
0
\mathbf{x_0}
x0 的所有像素的概率。
而 L T L_T LT计算的是最后得到的噪音的分布和先验分布的KL散度,这个KL散度没有训练参数,近似为0,因为先验 p ( x T ) = N ( 0 , I ) p(\mathbf {x}_T)=\mathcal{N}(\mathbf{0}, \mathbf{I}) p(xT)=N(0,I)而扩散过程最后得到的随机噪音 q ( x T ∣ x 0 ) q(\mathbf{x}_{T}\vert \mathbf{x}_{0}) q(xT∣x0)也近似为 N ( 0 , I ) \mathcal{N}(\mathbf{0}, \mathbf{I}) N(0,I);而 L t − 1 L_{t-1} Lt−1则是计算的是估计分布 p θ ( x t − 1 ∣ x t ) p_\theta(\mathbf{x}_{t-1} \vert\mathbf{x}_t) pθ(xt−1∣xt)和真实后验分布 q ( x t − 1 ∣ x t , x 0 ) q(\mathbf{x}_{t-1} \vert \mathbf{x}_t, \mathbf{x}_0) q(xt−1∣xt,x0)的KL散度,这里希望我们估计的去噪过程和依赖真实数据的去噪过程近似一致。
DDPM对
p
θ
(
x
t
−
1
∣
x
t
)
p_\theta(\mathbf{x}_{t-1} \vert\mathbf{x}_t)
pθ(xt−1∣xt) 做了近一步简化,采用固定的方差:
Σ
θ
(
x
t
,
t
)
=
σ
t
2
I
\boldsymbol{\Sigma}_\theta(\mathbf{x}_t, t)= \sigma_t^2\mathbf{I}
Σθ(xt,t)=σt2I,这里的
σ
t
2
\sigma_t^2
σt2可以设定为
β
t
\beta _t
βt或者
β
~
t
\tilde{\beta}_t
β~t(这其实是两个极端,分别是上限和下限,也可以采用可训练的方差,见论文Improved Denoising Diffusion Probabilistic Models和Analytic-DPM: an Analytic Estimate of the Optimal Reverse Variance in Diffusion Probabilistic Models)。这里假定
σ
t
2
=
β
~
t
\sigma_t^2=\tilde{\beta}_t
σt2=β~t,那么:
q
(
x
t
−
1
∣
x
t
,
x
0
)
=
N
(
x
t
−
1
;
μ
~
(
x
t
,
x
0
)
,
σ
t
2
I
)
p
θ
(
x
t
−
1
∣
x
t
)
=
N
(
x
t
−
1
;
μ
θ
(
x
t
,
t
)
,
σ
t
2
I
)
q(\mathbf{x}_{t-1} \vert \mathbf{x}_t, \mathbf{x}_0)=\mathcal{N}(\mathbf{x}_{t-1}; {\tilde{\boldsymbol{\mu}}} (\mathbf{x}_t, \mathbf{x}_0), {\sigma_t^2} \mathbf{I}) p_\theta(\mathbf{x}_{t-1} \vert \mathbf{x}_t) = \mathcal{N}(\mathbf{x}_{t-1}; \boldsymbol{\mu}_\theta(\mathbf{x}_t, t), {\sigma_t^2} \mathbf{I})
q(xt−1∣xt,x0)=N(xt−1;μ~(xt,x0),σt2I)pθ(xt−1∣xt)=N(xt−1;μθ(xt,t),σt2I)
对于两个高斯分布的KL散度,其计算公式为(具体推导见生成模型之VAE):
KL
(
p
1
∣
∣
p
2
)
=
1
2
(
tr
(
Σ
2
−
1
Σ
1
)
+
(
μ
2
−
μ
1
)
⊤
Σ
2
−
1
(
μ
2
−
μ
1
)
−
n
+
log
det
(
Σ
2
)
det
(
Σ
1
)
)
\text{KL}(p_1||p_2) = \frac{1}{2}(\text{tr}(\boldsymbol{\Sigma}_2^{-1}\boldsymbol{\Sigma}_1)+(\boldsymbol{\mu_2}-\boldsymbol{\mu_1})^{\top}\boldsymbol{\Sigma}_2^{-1}(\boldsymbol{\mu_2}-\boldsymbol{\mu_1})-n+\log\frac{\det(\boldsymbol{\Sigma_2})}{\det(\boldsymbol{\Sigma_1})}) \\
KL(p1∣∣p2)=21(tr(Σ2−1Σ1)+(μ2−μ1)⊤Σ2−1(μ2−μ1)−n+logdet(Σ1)det(Σ2))
那么就有:
D
KL
(
q
(
x
t
−
1
∣
x
t
,
x
0
)
∥
p
θ
(
x
t
−
1
∣
x
t
)
)
=
D
KL
(
N
(
x
t
−
1
;
μ
~
(
x
t
,
x
0
)
,
σ
t
2
I
)
∥
N
(
x
t
−
1
;
μ
θ
(
x
t
,
t
)
,
σ
t
2
I
)
)
=
1
2
(
n
+
1
σ
t
2
∥
μ
~
t
(
x
t
,
x
0
)
−
μ
θ
(
x
t
,
t
)
∥
2
−
n
+
log
1
)
=
1
2
σ
t
2
∥
μ
~
t
(
x
t
,
x
0
)
−
μ
θ
(
x
t
,
t
)
∥
2
\begin{aligned} D_\text{KL}(q(\mathbf{x}_{t-1} \vert \mathbf{x}_t, \mathbf{x}_0)\parallel p_\theta(\mathbf{x}_{t-1} \vert \mathbf{x}_t)) &=D_\text{KL}(\mathcal{N}(\mathbf{x}_{t-1}; {\tilde{\boldsymbol{\mu}}} (\mathbf{x}_t, \mathbf{x}_0), {\sigma_t^2} \mathbf{I}) \parallel \mathcal{N}(\mathbf{x}_{t-1}; \boldsymbol{\mu}_\theta(\mathbf{x}_t, t), {\sigma_t^2} \mathbf{I})) \\ &=\frac{1}{2}(n+\frac{1}{{\sigma_t^2}}\|\tilde{\boldsymbol{\mu}}_t(\mathbf{x}_t, \mathbf{x}_0) - {\boldsymbol{\mu}_\theta(\mathbf{x}_t, t)} \|^2 -n+\log1) \\ &=\frac{1}{2{\sigma_t^2}}\|\tilde{\boldsymbol{\mu}}_t(\mathbf{x}_t, \mathbf{x}_0) - {\boldsymbol{\mu}_\theta(\mathbf{x}_t, t)} \|^2 \end{aligned}\\
DKL(q(xt−1∣xt,x0)∥pθ(xt−1∣xt))=DKL(N(xt−1;μ~(xt,x0),σt2I)∥N(xt−1;μθ(xt,t),σt2I))=21(n+σt21∥μ~t(xt,x0)−μθ(xt,t)∥2−n+log1)=2σt21∥μ~t(xt,x0)−μθ(xt,t)∥2
那么优化目标L_{t-1}即为:
L
t
−
1
=
E
q
(
x
t
∣
x
0
)
[
1
2
σ
t
2
∥
μ
~
t
(
x
t
,
x
0
)
−
μ
θ
(
x
t
,
t
)
∥
2
]
L_{t-1}=\mathbb{E}_{q(\mathbf{x}_{t}\vert \mathbf{x}_{0})}\Big[ \frac{1}{2{\sigma_t^2}}\|\tilde{\boldsymbol{\mu}}_t(\mathbf{x}_t, \mathbf{x}_0) - {\boldsymbol{\mu}_\theta(\mathbf{x}_t, t)} \|^2\Big] \\
Lt−1=Eq(xt∣x0)[2σt21∥μ~t(xt,x0)−μθ(xt,t)∥2]
从上述公式来看,我们其实是在希望网络学习到的均值
μ
θ
(
x
t
,
t
)
\boldsymbol{\mu}_\theta(\mathbf{x}_t, t)
μθ(xt,t)和后验分布的均值
μ
~
(
x
t
,
x
0
)
{\tilde{\boldsymbol{\mu}}} (\mathbf{x}_t, \mathbf{x}_0)
μ~(xt,x0)尽可能地一致。不过DDPM发现,预测均值并不是最好的选择。根据前面得到的扩散过程的特性,我们有:
x
t
(
x
0
,
ϵ
)
=
α
ˉ
t
x
0
+
1
−
α
ˉ
t
ϵ
where
ϵ
∼
N
(
0
,
I
)
\mathbf{x_t}(\mathbf{x_0},\mathbf{\epsilon})=\sqrt{\bar{\alpha}_t}\mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_t}\mathbf{\epsilon} \quad \text{ where } \mathbf{\epsilon}\sim \mathcal{N}(\mathbf{0}, \mathbf{I}) \\
xt(x0,ϵ)=αˉtx0+1−αˉtϵ where ϵ∼N(0,I)
将这个公式带入上述优化目标(注意这里的损失我们加上了对 \mathbf{x}_0 的数学期望),可以得到:
L
t
−
1
=
E
x
0
(
E
q
(
x
t
∣
x
0
)
[
1
2
σ
t
2
∥
μ
~
t
(
x
t
,
x
0
)
−
μ
θ
(
x
t
,
t
)
∥
2
]
)
=
E
x
0
,
ϵ
∼
N
(
0
,
I
)
[
1
2
σ
t
2
∥
μ
~
t
(
x
t
(
x
0
,
ϵ
)
,
1
α
ˉ
t
(
x
t
(
x
0
,
ϵ
)
−
1
−
α
ˉ
t
ϵ
)
)
−
μ
θ
(
x
t
(
x
0
,
ϵ
)
,
t
)
∥
2
]
=
E
x
0
,
ϵ
∼
N
(
0
,
I
)
[
1
2
σ
t
2
∥
(
α
t
(
1
−
α
ˉ
t
−
1
)
1
−
α
ˉ
t
x
t
(
x
0
,
ϵ
)
+
α
ˉ
t
−
1
β
t
1
−
α
ˉ
t
1
α
ˉ
t
(
x
t
(
x
0
,
ϵ
)
−
1
−
α
ˉ
t
ϵ
)
)
−
μ
θ
(
x
t
(
x
0
,
ϵ
)
,
t
)
∥
2
]
=
E
x
0
,
ϵ
∼
N
(
0
,
I
)
[
1
2
σ
t
2
∥
1
α
t
(
x
t
(
x
0
,
ϵ
)
−
β
t
1
−
α
ˉ
t
ϵ
)
−
μ
θ
(
x
t
(
x
0
,
ϵ
)
,
t
)
∥
2
]
\begin{aligned} L_{t-1}&=\mathbb{E}_{\mathbf{x}_{0}}\Big(\mathbb{E}_{q(\mathbf{x}_{t}\vert \mathbf{x}_{0})}\Big[ \frac{1}{2{\sigma_t^2}}\|\tilde{\boldsymbol{\mu}}_t(\mathbf{x}_t, \mathbf{x}_0) - {\boldsymbol{\mu}_\theta(\mathbf{x}_t, t)} \|^2\Big]\Big) \\ &=\mathbb{E}_{\mathbf{x}_{0},\mathbf{\epsilon}\sim \mathcal{N}(\mathbf{0}, \mathbf{I})}\Big[ \frac{1}{2{\sigma_t^2}}\|\tilde{\boldsymbol{\mu}}_t\Big(\mathbf{x_t}(\mathbf{x_0},\mathbf{\epsilon}), \frac{1}{\sqrt{\bar \alpha_t}} \big(\mathbf{x_t}(\mathbf{x_0},\mathbf{\epsilon}) - \sqrt{1 - \bar{\alpha}_t} \mathbf{\epsilon} \big)\Big ) - {\boldsymbol{\mu}_\theta(\mathbf{x_t}(\mathbf{x_0},\mathbf{\epsilon}), t)} \|^2\Big] \\ &=\mathbb{E}_{\mathbf{x}_{0},\mathbf{\epsilon}\sim \mathcal{N}(\mathbf{0}, \mathbf{I})}\Big[ \frac{1}{2{\sigma_t^2}}\|\Big (\frac{\sqrt{\alpha_t}(1 - \bar{\alpha}_{t-1})}{1 - \bar{\alpha}_t} \mathbf{x}_t(\mathbf{x_0},\mathbf{\epsilon}) + \frac{\sqrt{\bar{\alpha}_{t-1}}\beta_t}{1 - \bar{\alpha}_t} \frac{1}{\sqrt{\bar \alpha_t}} \big(\mathbf{x_t}(\mathbf{x_0},\mathbf{\epsilon}) - \sqrt{1 - \bar{\alpha}_t} \mathbf{\epsilon} \big) \Big) - {\boldsymbol{\mu}_\theta(\mathbf{x_t}(\mathbf{x_0},\mathbf{\epsilon}), t)} \|^2\Big] \\ &=\mathbb{E}_{\mathbf{x}_{0},\mathbf{\epsilon}\sim \mathcal{N}(\mathbf{0}, \mathbf{I})}\Big[ \frac{1}{2{\sigma_t^2}}\|\frac{1}{\sqrt{\alpha_t}}\Big( \mathbf{x}_t(\mathbf{x_0},\mathbf{\epsilon}) - \frac{\beta_t}{\sqrt{1 - \bar{\alpha}_t}}\mathbf{\epsilon}\Big) - {\boldsymbol{\mu}_\theta(\mathbf{x_t}(\mathbf{x_0},\mathbf{\epsilon}), t)} \|^2\Big] \end{aligned}\\
Lt−1=Ex0(Eq(xt∣x0)[2σt21∥μ~t(xt,x0)−μθ(xt,t)∥2])=Ex0,ϵ∼N(0,I)[2σt21∥μ~t(xt(x0,ϵ),αˉt1(xt(x0,ϵ)−1−αˉtϵ))−μθ(xt(x0,ϵ),t)∥2]=Ex0,ϵ∼N(0,I)[2σt21∥(1−αˉtαt(1−αˉt−1)xt(x0,ϵ)+1−αˉtαˉt−1βtαˉt1(xt(x0,ϵ)−1−αˉtϵ))−μθ(xt(x0,ϵ),t)∥2]=Ex0,ϵ∼N(0,I)[2σt21∥αt1(xt(x0,ϵ)−1−αˉtβtϵ)−μθ(xt(x0,ϵ),t)∥2]
近一步地,我们对
μ
θ
(
x
t
(
x
0
,
ϵ
)
,
t
)
\boldsymbol{\mu}_\theta(\mathbf{x_t}(\mathbf{x_0},\mathbf{\epsilon}), t)
μθ(xt(x0,ϵ),t)也进行重参数化,变成:
μ
θ
(
x
t
(
x
0
,
ϵ
)
,
t
)
=
1
α
t
(
x
t
(
x
0
,
ϵ
)
−
β
t
1
−
α
ˉ
t
ϵ
θ
(
x
t
(
x
0
,
ϵ
)
,
t
)
)
\boldsymbol{\mu}_\theta(\mathbf{x_t}(\mathbf{x_0},\mathbf{\epsilon}), t)=\frac{1}{\sqrt{\alpha_t}}\Big( \mathbf{x}_t(\mathbf{x_0},\mathbf{\epsilon}) - \frac{\beta_t}{\sqrt{1 - \bar{\alpha}_t}}\mathbf{\epsilon}_\theta\big(\mathbf{x}_t(\mathbf{x_0},\mathbf{\epsilon}), t\big)\Big) \\
μθ(xt(x0,ϵ),t)=αt1(xt(x0,ϵ)−1−αˉtβtϵθ(xt(x0,ϵ),t))
这里的
ϵ
θ
\mathbf{\epsilon}_\theta
ϵθ是一个基于神经网络的拟合函数,这意味着我们由原来的预测均值而换成预测噪音
ϵ
\mathbf{\epsilon}
ϵ。我们将上述等式带入优化目标,可以得到:
L
t
−
1
=
E
x
0
,
ϵ
∼
N
(
0
,
I
)
[
1
2
σ
t
2
∥
1
α
t
(
x
t
(
x
0
,
ϵ
)
−
β
t
1
−
α
ˉ
t
ϵ
)
−
μ
θ
(
x
t
(
x
0
,
ϵ
)
,
t
)
∥
2
]
=
E
x
0
,
ϵ
∼
N
(
0
,
I
)
[
β
t
2
2
σ
t
2
α
t
(
1
−
α
ˉ
t
)
∥
ϵ
−
ϵ
θ
(
x
t
(
x
0
,
ϵ
)
,
t
)
∥
2
]
=
E
x
0
,
ϵ
∼
N
(
0
,
I
)
[
β
t
2
2
σ
t
2
α
t
(
1
−
α
ˉ
t
)
∥
ϵ
−
ϵ
θ
(
α
ˉ
t
x
0
+
1
−
α
ˉ
t
ϵ
,
t
)
∥
2
]
\begin{aligned} L_{t-1}&=\mathbb{E}_{\mathbf{x}_{0},\mathbf{\epsilon}\sim \mathcal{N}(\mathbf{0}, \mathbf{I})}\Big[ \frac{1}{2{\sigma_t^2}}\|\frac{1}{\sqrt{\alpha_t}}\Big( \mathbf{x}_t(\mathbf{x_0},\mathbf{\epsilon}) - \frac{\beta_t}{\sqrt{1 - \bar{\alpha}_t}}\mathbf{\epsilon}\Big) - {\boldsymbol{\mu}_\theta(\mathbf{x_t}(\mathbf{x_0},\mathbf{\epsilon}), t)} \|^2\Big] \\ &= \mathbb{E}_{\mathbf{x}_{0},\mathbf{\epsilon}\sim \mathcal{N}(\mathbf{0}, \mathbf{I})}\Big[ \frac{\beta_t^2}{2{\sigma_t^2}\alpha_t(1-\bar{\alpha}_t)}\| \mathbf{\epsilon}- \mathbf{\epsilon}_\theta\big(\mathbf{x}_t(\mathbf{x_0},\mathbf{\epsilon}), t\big)\|^2\Big]\\ &=\mathbb{E}_{\mathbf{x}_{0},\mathbf{\epsilon}\sim \mathcal{N}(\mathbf{0}, \mathbf{I})}\Big[ \frac{\beta_t^2}{2{\sigma_t^2}\alpha_t(1-\bar{\alpha}_t)}\| \mathbf{\epsilon}- \mathbf{\epsilon}_\theta\big(\sqrt{\bar{\alpha}_t}\mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_t}\mathbf{\epsilon}, t\big)\|^2\Big] \end{aligned}\\
Lt−1=Ex0,ϵ∼N(0,I)[2σt21∥αt1(xt(x0,ϵ)−1−αˉtβtϵ)−μθ(xt(x0,ϵ),t)∥2]=Ex0,ϵ∼N(0,I)[2σt2αt(1−αˉt)βt2∥ϵ−ϵθ(xt(x0,ϵ),t)∥2]=Ex0,ϵ∼N(0,I)[2σt2αt(1−αˉt)βt2∥ϵ−ϵθ(αˉtx0+1−αˉtϵ,t)∥2]
DDPM近一步对上述目标进行了简化,即去掉了权重系数,变成了:
L
t
−
1
simple
=
E
x
0
,
ϵ
∼
N
(
0
,
I
)
[
∥
ϵ
−
ϵ
θ
(
α
ˉ
t
x
0
+
1
−
α
ˉ
t
ϵ
,
t
)
∥
2
]
L_{t-1}^{\text{simple}}=\mathbb{E}_{\mathbf{x}_{0},\mathbf{\epsilon}\sim \mathcal{N}(\mathbf{0}, \mathbf{I})}\Big[ \| \mathbf{\epsilon}- \mathbf{\epsilon}_\theta\big(\sqrt{\bar{\alpha}_t}\mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_t}\mathbf{\epsilon}, t\big)\|^2\Big]
Lt−1simple=Ex0,ϵ∼N(0,I)[∥ϵ−ϵθ(αˉtx0+1−αˉtϵ,t)∥2]
这里的t在[1, T]范围内取值(如前所述,其中取1时对应L_0)。由于去掉了不同t的权重系数,所以这个简化的目标其实是VLB优化目标进行了reweight。从DDPM的对比实验结果来看,预测噪音比预测均值效果要好,采用简化版本的优化目标比VLB目标效果要好。