Diffusion Model 扩散模型
1.基本原理
Figure 1.
-
扩散模型包括两个步骤(Fig.1):
- 固定的(或预设的)前向扩散过程 q q q:该过程会逐渐将高斯噪声添加到图像中,直到最终得到纯噪声。
- 可训练的反向去噪扩散过程 p θ p_\theta pθ:训练一个神经网络,从纯噪音开始逐渐去噪,直到得到一个真实图像。
-
扩散模型目的与前向、反向步骤:
- 学习从纯噪声生成图片的方法
- 前向过程:
- 逐步向真实图片添加噪声最终得到一个纯噪声
- 对于训练集中的每张图片,都能生成一系列的噪声程度不同的加噪图片
- 在训练时,这些 【不同程度的噪声图片 + 生成它们所用的噪声】 是实际的训练样本
- 反向过程:
- 训练好模型后,采样、生成图片
2.数学推导
2.1.前向过程(加噪过程)
Figure 2.
q
(
x
0
)
q(x_0)
q(x0) 代表真实数据分布(大量图片),在此分布中采样可得到真实图片
x
0
∼
q
(
x
0
)
x_0\sim q(x_0)
x0∼q(x0) 。整个过程为马尔可夫过程,后一时刻的数据只受前一时刻的数据影响,根据前一时刻不断在后一时刻加入噪声
ϵ
∼
N
(
0
,
I
)
\epsilon\sim \mathcal{N}(0,\mathbf{I})
ϵ∼N(0,I),前向扩散过程为
q
(
x
t
∣
x
t
−
1
)
q(x_t|x_{t-1})
q(xt∣xt−1) ,定义
0
<
β
1
<
β
2
<
.
.
.
<
β
T
<
1
0<\beta_1<\beta_2<...<\beta_T<1
0<β1<β2<...<βT<1 ,则
t
t
t 时刻
μ
t
:
1
−
β
t
x
t
−
1
,
σ
t
:
β
t
I
\mu_t:\sqrt{1-\beta_t}x_{t-1},\sigma_t:\beta_t\mathbf{I}
μt:1−βtxt−1,σt:βtI ,则有:
q
(
x
t
∣
x
t
−
1
)
=
N
(
x
t
;
1
−
β
t
x
t
−
1
,
β
t
I
)
.
x
t
=
1
−
β
t
x
t
−
1
+
β
t
ϵ
t
(1)
q(\mathbf{x}_t|\mathbf{x}_{t-1})=\mathcal{N}(\mathbf{x}_t;\sqrt{1-\beta_t}\mathbf{x}_{t-1},\beta_t\mathbf{I}).\\ \mathbf{x}_t=\sqrt{1-\beta_t}\mathbf{x}_{t-1}+\sqrt{\beta_t}\epsilon_t \tag{1}
q(xt∣xt−1)=N(xt;1−βtxt−1,βtI).xt=1−βtxt−1+βtϵt(1)
ϵ
t
\epsilon_t
ϵt 为每一时刻添加的噪声(均独立同分布),令
α
t
=
1
−
β
t
,
β
t
=
1
−
α
t
\alpha_t=1-\beta_t,\ \beta_t=1-\alpha_t
αt=1−βt, βt=1−αt 则有:
x
t
=
α
t
x
t
−
1
+
1
−
α
t
ϵ
t
(2)
\mathbf{x}_t=\sqrt{\alpha_t}\mathbf{x}_{t-1}+\sqrt{1-\alpha_t}\epsilon_t \tag{2}
xt=αtxt−1+1−αtϵt(2)
总步长
x
0
→
x
1
→
x
2
→
.
.
.
→
x
t
→
.
.
.
→
x
T
,
T
x_0\to x_1\to x_2\to ...\to x_t\to...\to x_T,\ T
x0→x1→x2→...→xt→...→xT, T为总步长,迭代求解很慢。由于噪声是独立同分布的可做如下推导:
q
(
x
t
∣
x
t
−
1
)
=
N
(
x
t
;
α
t
x
t
−
1
,
(
1
−
α
t
)
I
)
x
t
=
α
t
x
t
−
1
+
1
−
α
t
ϵ
t
(3)
q(x_t|x_{t-1})=\mathcal{N}(x_t;\sqrt{\alpha_t}x_{t-1},\ (1-\alpha_t)\mathbf{I})\\ x_t=\sqrt{\alpha_t}x_{t-1}+\sqrt{1-\alpha_t}\epsilon_t \tag{3}
q(xt∣xt−1)=N(xt;αtxt−1, (1−αt)I)xt=αtxt−1+1−αtϵt(3)
此处
x
t
−
1
x_{t-1}
xt−1 用
x
t
−
2
x_{t-2}
xt−2 表示:
q
(
x
t
∣
x
t
−
2
)
=
α
t
(
α
t
−
1
x
t
−
2
+
1
−
α
t
−
1
ϵ
t
−
1
)
+
1
−
α
t
ϵ
t
=
α
t
α
t
−
1
x
t
−
2
+
α
t
−
α
t
α
t
−
1
ϵ
t
−
1
+
1
−
α
t
ϵ
t
(4)
q(x_t|x_{t-2})=\sqrt{\alpha_t}(\sqrt{\alpha_{t-1}}x_{t-2}+\sqrt{1-\alpha_{t-1}}\epsilon_{t-1})+\sqrt{1-\alpha_{t}}\epsilon_t\\ =\sqrt{\alpha_t\alpha_{t-1}}x_{t-2}+\sqrt{\alpha_t-\alpha_t\alpha_{t-1}}\epsilon_{t-1}+\sqrt{1-\alpha_t}\epsilon_t \tag{4}
q(xt∣xt−2)=αt(αt−1xt−2+1−αt−1ϵt−1)+1−αtϵt=αtαt−1xt−2+αt−αtαt−1ϵt−1+1−αtϵt(4)
此处,分布独立的情况下由正态分布相加还是正态分布,例:
μ
Z
=
μ
X
+
μ
Y
σ
Z
2
=
σ
X
2
+
σ
Y
2
(5)
\mu_Z=\mu_X+\mu_Y\\ \sigma_Z^2=\sigma_X^2+\sigma_Y^2 \tag{5}
μZ=μX+μYσZ2=σX2+σY2(5)
分布不独立的情况下,两个高斯分布的和的概率密度函数需要卷积运算计算,卷积结果仍然是高斯分布,均值方差如下:
μ
Z
=
μ
X
+
μ
Y
σ
Z
2
=
σ
X
2
+
σ
Y
2
+
2
ρ
σ
X
σ
Y
(6)
\mu_Z=\mu_X+\mu_Y\\ \sigma_Z^2=\sigma_X^2+\sigma_Y^2+2\rho\sigma_X\sigma_Y \tag{6}
μZ=μX+μYσZ2=σX2+σY2+2ρσXσY(6)
其中
ρ
\rho
ρ 为两个高斯分布的协方差,均值方差系数乘积如 Eq.(7):
E
[
c
X
]
=
c
⋅
E
[
X
]
V
a
r
[
c
X
]
=
c
2
⋅
V
a
r
[
X
]
(7)
E[cX]=c\cdot E[X]\\ Var[cX]=c^2\cdot Var[X] \tag{7}
E[cX]=c⋅E[X]Var[cX]=c2⋅Var[X](7)
在 Eq.(4)中
α
t
−
α
t
α
t
−
1
ϵ
t
−
1
,
1
−
α
t
ϵ
t
\sqrt{\alpha_t-\alpha_t\alpha_{t-1}}\epsilon_{t-1},\ \sqrt{1-\alpha_t}\epsilon_t
αt−αtαt−1ϵt−1, 1−αtϵt 为两个正太分布:
α
t
−
α
t
α
t
−
1
ϵ
t
−
1
∼
N
(
0
,
(
α
t
−
α
t
α
t
−
1
)
I
)
1
−
α
t
ϵ
t
∼
N
(
0
,
(
1
−
α
t
)
I
)
(8)
\sqrt{\alpha_t-\alpha_t\alpha_{t-1}}\epsilon_{t-1}\sim\mathcal{N}(0,\ (\alpha_t-\alpha_t\alpha_{t-1})\mathbf{I})\\ \sqrt{1-\alpha_t}\epsilon_t\sim\mathcal{N}(0,\ (1-\alpha_t)\mathbf{I}) \tag{8}
αt−αtαt−1ϵt−1∼N(0, (αt−αtαt−1)I)1−αtϵt∼N(0, (1−αt)I)(8)
加和后,服从如下:
α
t
−
α
t
α
t
−
1
ϵ
t
−
1
+
1
−
α
t
ϵ
t
∼
N
(
0
+
0
,
(
α
t
−
α
t
α
t
−
1
+
1
−
α
t
)
I
)
=
N
(
0
,
(
1
−
α
t
α
t
−
1
)
I
)
(9)
\sqrt{\alpha_t-\alpha_t\alpha_{t-1}}\epsilon_{t-1}+\sqrt{1-\alpha_t}\epsilon_t\sim\mathcal{N}(0+0,\ (\alpha_t-\alpha_t\alpha_{t-1}+1-\alpha_t)\mathbf{I})=\mathcal{N}(0,\ (1-\alpha_t\alpha_{t-1})\mathbf{I}) \tag{9}
αt−αtαt−1ϵt−1+1−αtϵt∼N(0+0, (αt−αtαt−1+1−αt)I)=N(0, (1−αtαt−1)I)(9)
因此
q
(
x
t
∣
x
t
−
1
)
→
q
(
x
t
∣
x
0
)
q(x_t|x_{t-1})\to q(x_t|x_{0})
q(xt∣xt−1)→q(xt∣x0) 可等价如下:
α
ˉ
t
=
∏
i
=
1
t
α
i
x
t
=
α
t
α
t
−
1
x
t
−
2
+
1
−
α
t
α
t
−
1
ϵ
.
.
.
x
t
=
α
ˉ
t
x
0
+
1
−
α
ˉ
t
ϵ
q
(
x
t
∣
x
0
)
=
N
(
x
t
;
α
ˉ
t
x
0
,
(
1
−
α
ˉ
t
)
I
)
(10)
\bar{\alpha}_{t}=\prod_{i=1}^{t}\alpha_{i}\\ x_t=\sqrt{\alpha_t\alpha_{t-1}}x_{t-2}+\sqrt{1-\alpha_t\alpha_{t-1}}\epsilon \\ ... \\ x_t=\sqrt{\bar{\alpha}_t}x_0+\sqrt{1-\bar{\alpha}_t}\epsilon\\ q(x_t|x_0)=\mathcal{N}(x_t;\sqrt{\bar{\alpha}_t}x_0,\ (1-\bar{\alpha}_t)\mathbf{I}) \tag{10}
αˉt=i=1∏tαixt=αtαt−1xt−2+1−αtαt−1ϵ...xt=αˉtx0+1−αˉtϵq(xt∣x0)=N(xt;αˉtx0, (1−αˉt)I)(10)
根据链式法则:
P
(
X
1
,
X
2
,
.
.
.
,
X
n
)
=
P
(
X
1
)
⋅
P
(
X
2
∣
X
1
)
⋅
P
(
X
3
∣
X
1
,
X
2
)
⋅
.
.
.
⋅
P
(
X
n
∣
X
1
,
X
2
,
.
.
.
,
X
n
−
1
)
(11)
\begin{aligned}&P(X_1,X_2,...,X_n)=P(X_1)\cdot P(X_2|X_1)\cdot P(X_3|X_1,X_2)\cdot...\cdot P(X_n|X_1,X_2,...,X_{n-1})\end{aligned} \tag{11}
P(X1,X2,...,Xn)=P(X1)⋅P(X2∣X1)⋅P(X3∣X1,X2)⋅...⋅P(Xn∣X1,X2,...,Xn−1)(11)
条件概率的链式法则:
q
(
x
1
:
T
∣
x
0
)
=
q
(
x
1
∣
x
0
)
⋅
q
(
x
2
∣
x
0
,
x
1
)
⋅
q
(
x
3
∣
x
0
,
x
1
,
x
2
)
⋅
.
.
.
⋅
q
(
x
T
∣
x
0
,
x
1
,
.
.
.
,
x
T
−
1
)
(12)
\begin{aligned}&q(x_{1:T}|x_0)=q(x_1|x_0)\cdot q(x_2|x_0,x_1)\cdot q(x_3|x_0,x_1,x_2)\cdot...\cdot q(x_T|x_0,x_1,...,x_{T-1})\end{aligned} \tag{12}
q(x1:T∣x0)=q(x1∣x0)⋅q(x2∣x0,x1)⋅q(x3∣x0,x1,x2)⋅...⋅q(xT∣x0,x1,...,xT−1)(12)
由于扩散过程由马尔可夫链定义,联合概率分布:
q
(
x
1
:
T
∣
x
0
)
=
q
(
x
1
∣
x
0
)
⋅
q
(
x
2
∣
x
1
)
⋅
q
(
x
3
∣
x
2
)
⋅
.
.
.
⋅
q
(
x
T
∣
x
T
−
1
)
q
(
x
1
:
T
∣
x
0
)
=
∏
i
=
1
T
q
(
x
t
∣
x
t
−
1
)
(13)
q(x_{1:T}|x_0)=q(x_1|x_0)\cdot q(x_2|x_1)\cdot q(x_3|x_2)\cdot...\cdot q(x_T|x_{T-1})\\ q(x_{1:T}|x_0)=\prod_{i=1}^T q(x_t|x_{t-1}) \tag{13}
q(x1:T∣x0)=q(x1∣x0)⋅q(x2∣x1)⋅q(x3∣x2)⋅...⋅q(xT∣xT−1)q(x1:T∣x0)=i=1∏Tq(xt∣xt−1)(13)
2.2.反向过程(去噪过程)
Figure 3.
我们希望以加噪数据为输入,模型预测去噪后的数据表示为 p θ ( x t − 1 ∣ x t ) p_{\theta}(x_{t-1}|x_t) pθ(xt−1∣xt) ,去噪过程同样遵循马尔可夫过程
根据贝叶斯定理:
P
(
A
∣
B
)
=
P
(
B
∣
A
)
P
(
A
)
P
(
B
)
(14)
P(A|B)=\frac{P(B|A)P(A)}{P(B)} \tag{14}
P(A∣B)=P(B)P(B∣A)P(A)(14)
p ( x t − 1 ∣ x t ) = p ( x t ∣ x t − 1 ) p ( x t − 1 ) p ( x t ) (15) p(x_{t-1}|x_t)=\frac{p(x_{t}|x_{t-1})p(x_{t-1})}{p(x_t)} \tag{15} p(xt−1∣xt)=p(xt)p(xt∣xt−1)p(xt−1)(15)
为各项添加
x
0
x_0
x0 :
p
(
x
t
−
1
∣
x
t
,
x
0
)
=
p
(
x
t
∣
x
t
−
1
,
x
0
)
p
(
x
t
−
1
∣
x
0
)
p
(
x
t
∣
x
0
)
(16)
p(x_{t-1}|x_t,x_0)=\frac{p(x_{t}|x_{t-1},x_0)p(x_{t-1}|x_0)}{p(x_t|x_0)} \tag{16}
p(xt−1∣xt,x0)=p(xt∣x0)p(xt∣xt−1,x0)p(xt−1∣x0)(16)
由 Eq.(2) 与 Eq.(10) ,得到:
N
(
α
t
x
t
−
1
,
1
−
α
t
)
→
P
(
x
t
∣
x
t
−
1
,
x
0
)
=
1
2
π
1
−
a
t
e
[
−
1
2
(
x
t
−
a
t
x
t
−
1
)
2
1
−
a
t
]
N
(
α
ˉ
t
x
0
,
1
−
α
ˉ
t
)
→
P
(
x
t
∣
x
0
)
=
1
2
π
1
−
a
ˉ
t
e
[
−
1
2
(
x
t
−
a
ˉ
t
x
0
)
2
1
−
a
t
]
N
(
α
ˉ
t
−
1
x
0
,
1
−
α
ˉ
t
−
1
)
→
P
(
x
t
−
1
∣
x
0
)
=
1
2
π
1
−
a
ˉ
t
−
1
e
[
−
1
2
(
x
t
−
1
−
a
ˉ
t
−
1
x
0
)
2
1
−
a
ˉ
t
−
1
]
(17)
\mathcal{N}(\sqrt{\alpha_t}x_{t-1},1-\alpha_t)\to P(x_t|x_{t-1},x_0)=\frac1{\sqrt{2\pi}\sqrt{1-a_t}}e^{\left[-\frac12\frac{(x_t-\sqrt{a_t}x_{t-1})^2}{1-a_t}\right]}\\ \mathcal{N}(\sqrt{\bar{\alpha}_t}x_0,1-\bar{\alpha}_t)\to P(x_t|x_0)=\frac1{\sqrt{2\pi}\sqrt{1-\bar{a}_t}}e^{\left[-\frac12\frac{(x_t-\sqrt{\bar{a}_t}x_0)^2}{1-a_t}\right]}\\ \mathcal{N}(\sqrt{\bar{\alpha}_{t-1}}x_0,1-\bar{\alpha}_{t-1})\to P(x_{t-1}|x_0)=\frac1{\sqrt{2\pi}\sqrt{1-\bar{a}_{t-1}}}e^{\left[-\frac12\frac{(x_{t-1}-\sqrt{\bar{a}_{t-1}}x_0)^2}{1-\bar{a}_{t-1}}\right]} \tag{17}
N(αtxt−1,1−αt)→P(xt∣xt−1,x0)=2π1−at1e[−211−at(xt−atxt−1)2]N(αˉtx0,1−αˉt)→P(xt∣x0)=2π1−aˉt1e[−211−at(xt−aˉtx0)2]N(αˉt−1x0,1−αˉt−1)→P(xt−1∣x0)=2π1−aˉt−11e[−211−aˉt−1(xt−1−aˉt−1x0)2](17)
带入 Eq.(16) 化简后得到:
P
(
x
t
−
1
∣
x
t
,
x
0
)
∼
N
(
a
t
(
1
−
a
ˉ
t
−
1
)
1
−
a
ˉ
t
x
t
+
a
ˉ
t
−
1
(
1
−
a
t
)
1
−
a
ˉ
t
x
0
,
(
1
−
a
t
1
−
a
ˉ
t
−
1
1
−
a
ˉ
t
)
2
)
(18)
P(x_{t-1}|x_t,x_0)\sim \mathcal{N}\left(\frac{\sqrt{a_t}(1-\bar{a}_{t-1})}{1-\bar{a}_t}x_t+\frac{\sqrt{\bar{a}_{t-1}}(1-a_t)}{1-\bar{a}_t}x_0,\left(\frac{\sqrt{1-a_t}\sqrt{1-\bar{a}_{t-1}}}{\sqrt{1-\bar{a}_t}}\right)^2\right) \tag{18}
P(xt−1∣xt,x0)∼N(1−aˉtat(1−aˉt−1)xt+1−aˉtaˉt−1(1−at)x0,(1−aˉt1−at1−aˉt−1)2)(18)
在反向过程中,我们想要求解
x
0
x_0
x0 ,故进一步要将
x
0
x_0
x0 替换掉,在正向过程 Eq.(10) 中转换为:
x
0
=
x
t
−
1
−
a
ˉ
t
×
ϵ
a
ˉ
t
(19)
x_0=\frac{x_t-\sqrt{1-\bar{a}_t}\times\epsilon}{\sqrt{\bar{a}_t}} \tag{19}
x0=aˉtxt−1−aˉt×ϵ(19)
带入 Eq.(18) :
P
(
x
t
−
1
∣
x
t
)
∼
N
(
a
t
(
1
−
a
ˉ
t
−
1
)
1
−
a
ˉ
t
x
t
+
a
ˉ
t
−
1
(
1
−
a
t
)
1
−
a
ˉ
t
×
x
t
−
1
−
a
ˉ
t
×
ϵ
a
ˉ
t
,
(
β
t
(
1
−
a
ˉ
t
−
1
)
1
−
a
ˉ
t
)
)
(20)
P(x_{t-1}|x_t)\sim \mathcal{N}\left(\frac{\sqrt{a_t}(1-\bar{a}_{t-1})}{1-\bar{a}_t}x_t+\frac{\sqrt{\bar{a}_{t-1}}(1-a_t)}{1-\bar{a}_t}\times\frac{x_t-\sqrt{1-\bar{a}_t}\times\epsilon}{\sqrt{\bar{a}_t}},\left(\sqrt{\frac{\beta_t(1-\bar{a}_{t-1})}{1-\bar{a}_t}}\right)\right) \tag{20}
P(xt−1∣xt)∼N
1−aˉtat(1−aˉt−1)xt+1−aˉtaˉt−1(1−at)×aˉtxt−1−aˉt×ϵ,
1−aˉtβt(1−aˉt−1)
(20)
此时,得到关系式,可以通过
x
t
x_t
xt 得到
x
t
−
1
x_{t-1}
xt−1 的分布数据(去噪过程),但是噪声
ϵ
\epsilon
ϵ 是随机的,确定了
ϵ
\epsilon
ϵ 的值就相当于确定了
x
t
−
1
x_{t-1}
xt−1 的值,根据 Eq.(10)
x
t
=
α
ˉ
t
x
0
+
1
−
α
ˉ
t
ϵ
x_t=\sqrt{\bar{\alpha}_t}x_0+\sqrt{1-\bar{\alpha}_t}\epsilon
xt=αˉtx0+1−αˉtϵ ,得到目标函数:
L
s
i
m
p
l
e
=
E
x
0
∼
q
(
x
0
)
,
ϵ
∼
N
(
0
,
I
)
[
∥
ϵ
−
ϵ
θ
(
α
‾
t
x
0
+
1
−
α
‾
t
ϵ
,
t
)
∥
2
]
(21)
L_{simple}=E_{x_0\sim q(x_0),\epsilon\sim\mathcal{N}(0,I)}\left[\left\|\epsilon-\epsilon_\theta(\sqrt{\overline{\alpha}_t}x_0+\sqrt{1-\overline{\alpha}_t}\epsilon,t)\right\|^2\right] \tag{21}
Lsimple=Ex0∼q(x0),ϵ∼N(0,I)[
ϵ−ϵθ(αtx0+1−αtϵ,t)
2](21)
其中,
ϵ
θ
\epsilon_\theta
ϵθ 是神经网络预测的噪声,
ϵ
\epsilon
ϵ 是服从高斯分布的真实噪声