生成模型分类
已有的生成模型大致可以分为两类
- likelihood-based model:以 VAE 和 normalizing flow model 为主,直接拟合数据分布,但通常对网络模型的结构设计提出了较大限制
- implicit generative model:以 GAN 为主,使用判别器间接判断输出的数据是否符合数据分布,存在训练困难,容易训练失败以及模式崩塌等问题
Score Matching & Non-Normalized Distribution
Score Matching 最早在 2005 年《Estimation of Non-Normalized Statistical Models by Score Matching》提出,主要是用于解决概率分布的归一化的问题。在生成模型中,我们常希望使用含参模型
p
θ
(
x
)
p_\theta(x)
pθ(x) 给出的概率密度函数能够拟合真实的概率密度函数
p
(
x
)
p(x)
p(x),
p
θ
(
x
)
p_\theta(x)
pθ(x) 作为一个概率密度函数须满足归一化性质
∫
x
p
θ
(
x
)
d
x
=
1
\int_x p_{\theta}(x)dx = 1
∫xpθ(x)dx=1,因此,我们通常使用模型
q
θ
(
x
)
q_\theta(x)
qθ(x) 给出一个未归一化的概率密度函数,然后使用归一化项
Z
(
θ
)
Z(\theta)
Z(θ) 来保证
p
θ
(
x
)
p_\theta(x)
pθ(x) 的性质,即
p
θ
(
x
)
=
1
Z
(
θ
)
q
θ
(
x
)
p_\theta(x) = \frac{1}{Z(\theta)}q_\theta(x)
pθ(x)=Z(θ)1qθ(x)
其中
Z
(
θ
)
Z(\theta)
Z(θ) 为与样本无关的一个常量。由于
Z
(
θ
)
Z(\theta)
Z(θ) 的存在,无论是基于梯度的优化过程还是正向推理都变得很难计算,因此,考虑使
p
θ
(
x
)
p_\theta(x)
pθ(x) 关于输入的梯度逼近
p
(
x
)
p(x)
p(x) 关于输入的梯度
▽
x
q
θ
(
x
)
=
▽
x
p
θ
(
x
)
≈
▽
x
p
(
x
)
\triangledown_x q_\theta(x) = \triangledown_x p_\theta(x) \approx \triangledown_x p(x)
▽xqθ(x)=▽xpθ(x)≈▽xp(x)
形式化地,我们定义分数函数
ψ
:
R
n
→
R
n
\psi : \R^n\rightarrow \R^n
ψ:Rn→Rn 如下
ψ
(
x
)
=
▽
x
p
(
x
)
\psi(x) = \triangledown_x p(x)
ψ(x)=▽xp(x)
我们利用MSE损失函数找到最优参数
θ
\theta
θ 的过程称作是 Score Matching,损失函数定义为
J
ESM
(
θ
)
=
1
2
∫
x
p
(
x
)
∣
∣
ψ
θ
(
x
)
−
ψ
(
x
)
∣
∣
2
2
d
x
=
E
x
∼
p
(
x
)
[
1
2
∣
∣
ψ
θ
(
x
)
−
ψ
(
x
)
∣
∣
2
2
]
\begin{align} J_\text{ESM}(\theta) &= \frac{1}{2} \int_{x} p(x) ||\psi_\theta(x) - \psi(x)||_2^2 dx \nonumber \\ &= \mathbb E_{x\sim p(x)} [\frac{1}{2}||\psi_\theta(x) - \psi(x)||_2^2] \end{align}
JESM(θ)=21∫xp(x)∣∣ψθ(x)−ψ(x)∣∣22dx=Ex∼p(x)[21∣∣ψθ(x)−ψ(x)∣∣22]
这种损失函数的表达形式也被称为显式分数匹配 (Explicit Score Matching)
对
J
ESM
J_\text{ESM}
JESM 进行展开,有
J
(
θ
)
=
1
2
∫
x
p
(
x
)
∣
∣
ψ
θ
2
(
x
)
∣
∣
2
2
d
x
−
∫
x
p
(
x
)
ψ
θ
T
(
x
)
ψ
(
x
)
d
x
+
1
2
∫
x
p
(
x
)
∣
∣
ψ
2
(
x
)
∣
∣
2
2
d
x
=
1
2
∫
x
p
(
x
)
∣
∣
ψ
θ
2
(
x
)
∣
∣
2
2
d
x
−
∫
x
p
(
x
)
ψ
θ
T
(
x
)
ψ
(
x
)
d
x
+
C
\begin{align} J(\theta) &= \frac{1}{2} \int_{x} p(x) ||\psi_\theta^2(x)||^2_2dx - \int_{x} p(x) \psi_\theta^T(x)\psi(x)dx + \frac{1}{2} \int_{x} p(x) ||\psi^2(x)||^2_2dx \nonumber \\ &= \frac{1}{2} \int_{x} p(x) ||\psi_\theta^2(x)||^2_2dx - \int_{x} p(x) \psi_\theta^T(x)\psi(x)dx + C \end{align}
J(θ)=21∫xp(x)∣∣ψθ2(x)∣∣22dx−∫xp(x)ψθT(x)ψ(x)dx+21∫xp(x)∣∣ψ2(x)∣∣22dx=21∫xp(x)∣∣ψθ2(x)∣∣22dx−∫xp(x)ψθT(x)ψ(x)dx+C
对于第二项进行维度展开,有
∫
x
p
(
x
)
ψ
θ
T
(
x
)
ψ
(
x
)
d
x
=
∑
i
=
1
n
∫
x
p
(
x
)
ψ
θ
(
i
)
(
x
)
ψ
(
i
)
(
x
)
d
x
=
∑
i
=
1
n
∫
x
p
(
x
)
ψ
θ
(
i
)
(
x
)
∂
log
p
(
x
)
∂
x
(
i
)
d
x
=
∑
i
=
1
n
∫
x
ψ
θ
(
i
)
(
x
)
∂
p
(
x
)
∂
x
(
i
)
d
x
\begin{align} \int_{x} p(x) \psi_\theta^T(x)\psi(x)dx &= \sum_{i=1}^n \int_{x} p(x) \psi_\theta^{(i)}(x)\psi^{(i)}(x)dx \nonumber \\&= \sum_{i=1}^n \int_{x} p(x) \psi_\theta^{(i)}(x)\frac{\partial \log p(x)}{\partial x^{(i)}}dx \nonumber \\&= \sum_{i=1}^n \int_{x} \psi_\theta^{(i)}(x)\frac{\partial p(x)}{\partial x^{(i)}}dx \end{align}
∫xp(x)ψθT(x)ψ(x)dx=i=1∑n∫xp(x)ψθ(i)(x)ψ(i)(x)dx=i=1∑n∫xp(x)ψθ(i)(x)∂x(i)∂logp(x)dx=i=1∑n∫xψθ(i)(x)∂x(i)∂p(x)dx
我们考虑
i
=
1
i=1
i=1 时的情况,由分部积分公式
lim
a
→
∞
,
b
→
−
∞
f
(
a
,
x
(
2
)
,
.
.
.
,
x
(
n
)
)
g
(
a
,
x
(
2
)
,
.
.
.
,
x
(
n
)
)
−
f
(
b
,
x
(
2
)
,
.
.
.
,
x
(
n
)
)
g
(
b
,
x
(
2
)
,
.
.
.
,
x
(
n
)
)
=
∫
−
∞
∞
f
(
x
)
∂
g
(
x
)
∂
x
(
1
)
d
x
(
1
)
+
∫
−
∞
∞
g
(
x
)
∂
f
(
x
)
∂
x
(
1
)
d
x
(
1
)
\lim_{a\rightarrow \infty , b\rightarrow -\infty} f(a, x^{(2)}, ... , x^{(n)}) g(a, x^{(2)}, ... , x^{(n)}) - f(b, x^{(2)}, ... , x^{(n)}) g(b, x^{(2)}, ... , x^{(n)}) \\ = \int_{-\infty}^{\infty} f(x)\frac{\partial g(x)}{\partial{x^{(1)}}}d x^{(1)} + \int_{-\infty}^{\infty} g(x)\frac{\partial f(x)}{\partial{x^{(1)}}}d x^{(1)}
a→∞,b→−∞limf(a,x(2),...,x(n))g(a,x(2),...,x(n))−f(b,x(2),...,x(n))g(b,x(2),...,x(n))=∫−∞∞f(x)∂x(1)∂g(x)dx(1)+∫−∞∞g(x)∂x(1)∂f(x)dx(1)
因此,有下式
∫
x
ψ
θ
(
1
)
(
x
)
∂
p
(
x
)
∂
x
(
1
)
d
x
=
∫
x
(
2
)
.
.
.
x
(
n
)
∫
x
(
1
)
ψ
θ
(
1
)
(
x
)
∂
p
(
x
)
∂
x
(
1
)
d
x
(
1
)
d
(
x
(
2
)
.
.
.
x
(
n
)
)
=
∫
x
(
2
)
.
.
.
x
(
n
)
[
lim
a
→
∞
,
b
→
−
∞
(
p
(
a
,
x
(
2
)
,
.
.
.
)
ψ
θ
(
1
)
(
a
,
x
(
2
)
,
.
.
.
)
−
p
(
b
,
x
(
2
)
,
.
.
.
)
ψ
θ
(
1
)
(
b
,
x
(
2
)
,
.
.
.
)
)
−
∫
x
(
1
)
p
(
x
)
∂
ψ
θ
(
1
)
(
x
)
∂
x
(
1
)
d
x
(
1
)
]
d
(
x
(
2
)
.
.
.
x
(
n
)
)
\begin{align} \int_{x} \psi_\theta^{(1)}(x)\frac{\partial p(x)}{\partial x^{(1)}}dx &= \int_{x^{(2)}...x^{(n)}} \int_{x^{(1)}} \psi_\theta^{(1)}(x)\frac{\partial p(x)}{\partial x^{(1)}}dx^{(1)} d(x^{(2)}...x^{(n)}) \nonumber \\&= \int_{x^{(2)}...x^{(n)}}\left[\lim_{a\rightarrow \infty , b\rightarrow -\infty}(p(a,x^{(2)},...)\psi_\theta^{(1)}(a,x^{(2)},...)-p(b,x^{(2)},...)\psi_\theta^{(1)}(b,x^{(2)},...)) - \int_{x^{(1)}} p(x)\frac{\partial \psi_\theta^{(1)}(x)}{\partial x^{(1)}}dx^{(1)} \right ]d(x^{(2)}...x^{(n)}) \nonumber \end{align}
∫xψθ(1)(x)∂x(1)∂p(x)dx=∫x(2)...x(n)∫x(1)ψθ(1)(x)∂x(1)∂p(x)dx(1)d(x(2)...x(n))=∫x(2)...x(n)[a→∞,b→−∞lim(p(a,x(2),...)ψθ(1)(a,x(2),...)−p(b,x(2),...)ψθ(1)(b,x(2),...))−∫x(1)p(x)∂x(1)∂ψθ(1)(x)dx(1)]d(x(2)...x(n))
若假定
lim
∣
∣
x
∣
∣
→
∞
p
(
x
)
ψ
θ
(
x
)
=
0
\lim_{||x||\rightarrow \infty} p(x)\psi_\theta(x) = 0
lim∣∣x∣∣→∞p(x)ψθ(x)=0,上式可化简如下所示
∫
x
ψ
θ
(
1
)
(
x
)
∂
p
(
x
)
∂
x
(
1
)
d
x
=
−
∫
x
∂
ψ
θ
(
1
)
(
x
)
∂
x
(
1
)
p
(
x
)
d
x
\begin{align} \int_{x} \psi_\theta^{(1)}(x)\frac{\partial p(x)}{\partial x^{(1)}}dx = -\int_x \frac{\partial \psi_\theta^{(1)}(x)}{\partial x^{(1)}} p(x) dx \end{align}
∫xψθ(1)(x)∂x(1)∂p(x)dx=−∫x∂x(1)∂ψθ(1)(x)p(x)dx
将 (3)(4) 式代回到 (2) 式,得到等价的隐式函数优化目标
J
ISM
J_\text{ISM}
JISM
J
ISM
(
θ
)
=
∫
x
p
(
x
)
∑
i
=
1
n
(
∂
ψ
θ
(
i
)
(
x
)
∂
x
i
+
1
2
ψ
θ
(
i
)
(
x
)
2
)
d
x
=
∫
x
p
(
x
)
(
tr
(
▽
x
ψ
θ
(
x
)
)
+
1
2
∣
∣
ψ
θ
(
x
)
∣
∣
2
2
)
d
x
=
E
x
∼
p
(
x
)
[
tr
(
▽
x
ψ
θ
(
x
)
)
+
1
2
∣
∣
ψ
θ
(
x
)
∣
∣
2
2
]
\begin{align} J_\text{ISM}(\theta) &= \int_x p(x)\sum_{i=1}^n\left(\frac{\partial \psi_\theta^{(i)}(x)}{\partial x_i}+\frac{1}{2}\psi_\theta^{(i)}(x)^2 \right) dx \nonumber \\ &= \int_x p(x)\left ( \text{tr}(\triangledown_x\psi_\theta(x))+\frac{1}{2} ||\psi_\theta(x)||_2^2 \right) dx \nonumber \\ &=\mathbb{E}_{x\sim p(x)} [\text{tr}(\triangledown_x\psi_\theta(x))+\frac{1}{2} ||\psi_\theta(x)||_2^2 ] \end{align}
JISM(θ)=∫xp(x)i=1∑n(∂xi∂ψθ(i)(x)+21ψθ(i)(x)2)dx=∫xp(x)(tr(▽xψθ(x))+21∣∣ψθ(x)∣∣22)dx=Ex∼p(x)[tr(▽xψθ(x))+21∣∣ψθ(x)∣∣22]
如果我们得到了一个最优的分数函数
ψ
θ
(
x
)
\psi_\theta(x)
ψθ(x),那么我们可以引入郎之万动力学方程 (Langevin Dynamics) 进行数据的生成任务,具体来说,给定一个固定的步长
ϵ
>
0
\epsilon>0
ϵ>0,初始值
x
0
∼
π
(
x
)
x_0 \sim \pi(x)
x0∼π(x) 采样自某先验分布之中,
z
t
∼
N
(
0
,
I
)
z_t \sim \mathcal N(0,I)
zt∼N(0,I),有如下迭代公式
x
t
=
x
t
−
1
+
ϵ
2
▽
x
log
p
(
x
t
−
1
)
+
ϵ
z
t
≈
x
t
−
1
+
ϵ
2
ψ
θ
(
x
t
−
1
)
+
ϵ
z
t
\begin{align} x_t &= x_{t-1} + \frac{\epsilon}{2}\triangledown_x \log p(x_{t-1}) +\sqrt{\epsilon} z_t \nonumber \\&\approx x_{t-1} + \frac{\epsilon}{2}\psi_\theta(x_{t-1}) +\sqrt{\epsilon} z_t \nonumber \end{align}
xt=xt−1+2ϵ▽xlogp(xt−1)+ϵzt≈xt−1+2ϵψθ(xt−1)+ϵzt
当满足
t
→
∞
t\rightarrow \infty
t→∞ 和
ϵ
→
0
\epsilon\rightarrow 0
ϵ→0 时,
x
t
x_t
xt 等价于从
p
(
x
)
p(x)
p(x) 中进行采样
但以上方法存在以下几个问题
- 在搭建神经网络时,需要使用反向传播优化模型参数,(5)式中存在关于输出的一阶导数,这表明反向传播需要计算关于输出的二阶导数
- 对于尺寸较大的输入,以上建模方式并不生效
问题2的一种直观解释:图像分布可以看作是高维空间中的一个流形,因此对于空间中的大多位置都是不存在合理的训练样本进行训练的,对于采样自先验分布的初值 x 0 x_0 x0,会因为不准确的分数函数而移动到某局部最优值上,得到较差的生成结果
Sliced Score Matching
Score Matching 最主要的问题是效率问题,为了避免计算Score函数相对于输入的二阶 Hessian 矩阵,作者提出使用如下式子替换
J
ESM
(
θ
)
J_\text{ESM}(\theta)
JESM(θ)
J
ESSM
(
θ
)
=
E
v
∼
p
v
(
v
)
E
x
∼
p
(
x
)
[
1
2
(
v
T
ψ
θ
(
x
)
−
v
T
ψ
(
x
)
)
2
]
\begin{align} J_\text{ESSM}(\theta) &= \mathbb E_{v\sim p_v(v)} \mathbb E_{x\sim p(x)} [\frac{1}{2}(v^T\psi_\theta(x) - v^T\psi(x))^2] \end{align}
JESSM(θ)=Ev∼pv(v)Ex∼p(x)[21(vTψθ(x)−vTψ(x))2]
其中
v
v
v 是抽样自
p
v
p_v
pv 的随机方向向量,满足
E
p
v
[
v
v
T
]
≻
0
\mathbb E_{p_v}[vv^T]\succ 0
Epv[vvT]≻0 (矩阵是正定的),且
E
p
v
[
∣
∣
v
∣
∣
2
2
]
<
∞
\mathbb E_{p_v}[||v||_2^2]<\infty
Epv[∣∣v∣∣22]<∞ ,例如标准正态分布、多元Rademacher分布和超球面上的均匀分布。
对 (6) 式采用类似地推导,可以得到
J
ISSM
(
θ
)
J_\text{ISSM}(\theta)
JISSM(θ)
J
ISSM
(
θ
)
=
E
v
∼
p
v
(
v
)
E
x
∼
p
(
x
)
[
v
T
▽
x
ψ
θ
(
x
)
v
+
1
2
(
v
T
ψ
θ
(
x
)
)
2
]
J_\text{ISSM}(\theta) = \mathbb E_{v\sim p_v(v)} \mathbb E_{x\sim p(x)} [v^T\triangledown_x\psi_\theta(x)v + \frac{1}{2}(v^T\psi_\theta(x))^2]
JISSM(θ)=Ev∼pv(v)Ex∼p(x)[vT▽xψθ(x)v+21(vTψθ(x))2]
证明过程如下所示
如何加速SM:对于 ▽ x ψ θ ( x ) \triangledown_x \psi_\theta(x) ▽xψθ(x),原本需要计算一个 Hessian 矩阵,现在损失函数只需先计算标量 v T ψ θ ( x ) v^T\psi_\theta(x) vTψθ(x),然后相对于标量求导 ▽ x ( v T ψ θ ( x ) ) = v T ▽ x ψ θ ( x ) \triangledown_x (v^T\psi_\theta(x)) = v^T\triangledown_x \psi_\theta(x) ▽x(vTψθ(x))=vT▽xψθ(x),即可得到损失函数
Denoising Score Matching
SSM提供了一个无偏的分数函数拟合方法,但是仍需要较大的计算量进行图像的生成。类比于VAE的思路,如果每个样本不是表示单独的样本点,而是表示一个局部的概率分布,即在训练样本
x
x
x 上施加一个噪声
x
~
∣
x
\tilde x|x
x~∣x,可以有效地缓解Score Matching中的问题2,假定
q
σ
(
x
~
∣
x
)
q_\sigma(\tilde x|x)
qσ(x~∣x) 表示噪声的概率分布,其中
σ
\sigma
σ 为噪声分布中的参数,那么噪声样本的概率分布为
q
σ
(
x
~
)
=
∫
x
q
σ
(
x
~
∣
x
)
p
(
x
)
d
x
q_\sigma(\tilde x) = \int_{x} q_\sigma(\tilde x|x) p(x) dx
qσ(x~)=∫xqσ(x~∣x)p(x)dx
那么,学习的目标即为使用模型拟合该噪声分布的分数函数,例如
J
ESM
J_\text{ESM}
JESM 表示转化为
J
ESM
,
q
σ
=
E
x
~
∼
q
σ
(
x
~
)
[
1
2
∣
∣
ψ
θ
(
x
~
)
−
ψ
(
x
~
)
∣
∣
2
2
]
=
E
x
~
∼
q
σ
(
x
~
)
[
1
2
∣
∣
ψ
θ
(
x
~
)
−
∂
log
q
σ
(
x
~
)
∂
x
~
∣
∣
2
2
]
\begin{align} J_{\text{ESM},q_\sigma} &= \mathbb E_{\tilde x \sim q_\sigma(\tilde x)} [\frac{1}{2}||\psi_\theta(\tilde x) - \psi(\tilde x)||_2^2] \nonumber \\&= \mathbb E_{\tilde x \sim q_\sigma(\tilde x)} [\frac{1}{2}||\psi_\theta(\tilde x) - \frac{\partial \log q_\sigma(\tilde x)}{\partial \tilde x}||_2^2] \end{align}
JESM,qσ=Ex~∼qσ(x~)[21∣∣ψθ(x~)−ψ(x~)∣∣22]=Ex~∼qσ(x~)[21∣∣ψθ(x~)−∂x~∂logqσ(x~)∣∣22]
另一方面,定义
J
DSM
,
q
σ
(
θ
)
J_{\text{DSM},q_\sigma}(\theta)
JDSM,qσ(θ)
J
DSM
,
q
σ
(
θ
)
=
E
x
,
x
~
∼
q
σ
(
x
,
x
~
)
[
1
2
∣
∣
ψ
θ
(
x
~
)
−
∂
log
q
σ
(
x
~
∣
x
)
∂
x
~
∣
∣
2
2
]
J_{\text{DSM},q_\sigma}(\theta) = \mathbb E_{x,\tilde x \sim q_\sigma(x,\tilde x)} [\frac{1}{2}||\psi_\theta(\tilde x) - \frac{\partial \log q_\sigma(\tilde x|x)}{\partial \tilde x}||_2^2]
JDSM,qσ(θ)=Ex,x~∼qσ(x,x~)[21∣∣ψθ(x~)−∂x~∂logqσ(x~∣x)∣∣22]
下面证明,在Score Matching对
θ
\theta
θ 进行优化的目标下,
J
DSM
,
q
σ
(
θ
)
J_{\text{DSM},q_\sigma}(\theta)
JDSM,qσ(θ) 与
J
ESM
,
q
σ
(
θ
)
J_{\text{ESM},q_\sigma}(\theta)
JESM,qσ(θ) 等价
J
ESM
,
q
σ
=
E
x
~
∼
q
σ
(
x
~
)
[
1
2
∣
∣
ψ
θ
(
x
~
)
−
∂
log
q
σ
(
x
~
)
∂
x
~
∣
∣
2
2
]
=
E
x
,
x
~
∼
q
σ
(
x
,
x
~
)
[
1
2
∣
∣
ψ
θ
(
x
~
)
∣
∣
2
2
]
−
S
(
θ
)
+
C
\begin{align} J_{\text{ESM}, q_\sigma} &= \mathbb E_{\tilde x \sim q_\sigma(\tilde x)} [\frac{1}{2}||\psi_\theta(\tilde x) - \frac{\partial \log q_\sigma(\tilde x)}{\partial \tilde x}||_2^2] \nonumber \\&=\mathbb E_{x,\tilde x \sim q_\sigma(x,\tilde x)} [\frac{1}{2}||\psi_\theta(\tilde x)||_2^2] - S(\theta) + C \nonumber \end{align}
JESM,qσ=Ex~∼qσ(x~)[21∣∣ψθ(x~)−∂x~∂logqσ(x~)∣∣22]=Ex,x~∼qσ(x,x~)[21∣∣ψθ(x~)∣∣22]−S(θ)+C
其中
C
C
C 表示不包含
θ
\theta
θ 的部分,而
S
(
θ
)
S(\theta)
S(θ) 推导如下所示
S
(
θ
)
=
E
x
~
∼
q
σ
(
x
~
)
[
<
ψ
θ
(
x
~
)
,
∂
log
q
σ
(
x
~
)
∂
x
~
>
]
=
∫
x
~
q
σ
(
x
~
)
<
ψ
θ
(
x
~
)
,
1
q
σ
(
x
~
)
⋅
∂
q
σ
(
x
~
)
∂
x
~
>
d
x
~
=
∫
x
~
<
ψ
θ
(
x
~
)
,
∂
q
σ
(
x
~
)
∂
x
~
>
d
x
~
=
∫
x
~
<
ψ
θ
(
x
~
)
,
∂
∂
x
~
∫
x
q
σ
(
x
~
∣
x
)
p
(
x
)
d
x
>
d
x
~
=
∫
x
~
<
ψ
θ
(
x
~
)
,
∫
x
p
(
x
)
∂
q
σ
(
x
~
∣
x
)
∂
x
~
d
x
>
d
x
~
=
∫
x
~
<
ψ
θ
(
x
~
)
,
∫
x
p
(
x
)
q
σ
(
x
~
∣
x
)
∂
log
q
σ
(
x
~
∣
x
)
∂
x
~
d
x
>
d
x
~
=
∫
x
~
∫
x
p
(
x
)
q
σ
(
x
~
∣
x
)
<
ψ
θ
(
x
~
)
,
∂
log
q
σ
(
x
~
∣
x
)
∂
x
~
>
d
x
d
x
~
=
E
x
,
x
~
∼
q
σ
(
x
,
x
~
)
[
<
ψ
θ
(
x
~
)
,
∂
log
q
σ
(
x
~
∣
x
)
∂
x
~
>
]
\begin{align} S(\theta) &= \mathbb E_{\tilde x \sim q_\sigma(\tilde x)} [<\psi_\theta(\tilde x), \frac{\partial \log q_\sigma(\tilde x)}{\partial \tilde x}>] \nonumber \\&=\int_{\tilde{x}} q_\sigma(\tilde x) <\psi_\theta(\tilde x), \frac{1}{q_\sigma(\tilde x)}\cdot \frac{\partial q_\sigma(\tilde x)}{\partial \tilde x}> d\tilde x\nonumber \\&=\int_{\tilde{x}} <\psi_\theta(\tilde x), \frac{\partial q_\sigma(\tilde x)}{\partial \tilde x}> d\tilde x\nonumber \\&=\int_{\tilde{x}} <\psi_\theta(\tilde x), \frac{\partial }{\partial \tilde x} \int_x q_\sigma(\tilde x|x) p(x) dx> d\tilde x\nonumber \\&=\int_{\tilde{x}} <\psi_\theta(\tilde x), \int_x p(x) \frac{\partial q_\sigma(\tilde x|x)}{\partial \tilde x} dx> d\tilde x\nonumber \\&=\int_{\tilde{x}} <\psi_\theta(\tilde x), \int_x p(x) q_\sigma(\tilde x|x) \frac{\partial \log q_\sigma(\tilde x|x)}{\partial \tilde x} dx> d\tilde x\nonumber \\&=\int_{\tilde{x}}\int_x p(x) q_\sigma(\tilde x|x) <\psi_\theta(\tilde x), \frac{\partial \log q_\sigma(\tilde x|x)}{\partial \tilde x} > dx d\tilde x\nonumber \\&=\mathbb E_{x,\tilde x \sim q_\sigma(x,\tilde x)}[<\psi_\theta(\tilde x), \frac{\partial \log q_\sigma(\tilde x|x)}{\partial \tilde x} >] \nonumber \end{align}
S(θ)=Ex~∼qσ(x~)[<ψθ(x~),∂x~∂logqσ(x~)>]=∫x~qσ(x~)<ψθ(x~),qσ(x~)1⋅∂x~∂qσ(x~)>dx~=∫x~<ψθ(x~),∂x~∂qσ(x~)>dx~=∫x~<ψθ(x~),∂x~∂∫xqσ(x~∣x)p(x)dx>dx~=∫x~<ψθ(x~),∫xp(x)∂x~∂qσ(x~∣x)dx>dx~=∫x~<ψθ(x~),∫xp(x)qσ(x~∣x)∂x~∂logqσ(x~∣x)dx>dx~=∫x~∫xp(x)qσ(x~∣x)<ψθ(x~),∂x~∂logqσ(x~∣x)>dxdx~=Ex,x~∼qσ(x,x~)[<ψθ(x~),∂x~∂logqσ(x~∣x)>]
对
J
DSM
,
q
σ
J_{\text{DSM}, q_\sigma}
JDSM,qσ 进行展开,即可得证
J
DSM
,
q
σ
J_{\text{DSM}, q_\sigma}
JDSM,qσ等价于
J
ESM
,
q
σ
J_{\text{ESM}, q_\sigma}
JESM,qσ,注意到如果施加的噪声为
N
(
0
,
σ
2
I
)
\mathcal N(0, \sigma^2I)
N(0,σ2I),那么有
∂
log
q
σ
(
x
~
∣
x
)
∂
x
~
=
1
σ
2
(
x
−
x
~
)
\frac{\partial \log q_\sigma(\tilde x|x)}{\partial \tilde x} = \frac{1}{\sigma^2}(x-\tilde x)
∂x~∂logqσ(x~∣x)=σ21(x−x~)
因此,优化
J
DSM
,
q
σ
J_{\text{DSM}, q_\sigma}
JDSM,qσ 目标函数如下所示
J
DSM
,
q
σ
(
θ
)
=
E
x
,
x
~
∼
q
σ
(
x
,
x
~
)
[
1
2
∣
∣
ψ
θ
(
x
~
)
−
1
σ
2
(
x
−
x
~
)
∣
∣
2
2
]
\begin{align} J_{\text{DSM},q_\sigma}(\theta) = \mathbb E_{x,\tilde x \sim q_\sigma(x,\tilde x)} [\frac{1}{2}||\psi_\theta(\tilde x) - \frac{1}{\sigma^2}(x-\tilde x)||_2^2] \end{align}
JDSM,qσ(θ)=Ex,x~∼qσ(x,x~)[21∣∣ψθ(x~)−σ21(x−x~)∣∣22]
使用噪声的方式可以巧妙地避开了二阶导,提高了训练的效率,观察函数表示,我们发现DSM本质上,在使用分数函数拟合一个降噪器,这种方法存在以下两种弊端
- 拟合出来的分布是加了噪声之后的分布
- 噪声的级别 σ \sigma σ 难以设置,较强的噪声损害了拟合出的数据分布,而较小的噪声无法覆盖到足够大的区域使得对于大部分初值无法优化到理想的位置
NCSN给出的方法是,在推理的不同阶段从大到小给出噪声,即
超参数的选择:噪声级别,步长与迭代次数的选择可见 Improved Techniques for Training Score-Based Generative Models
参考资料
Generative Modeling by Estimating Gradients of the Data Distribution
Sliced Score Matching: A Scalable Approach to Density and Score Estimation
Improved Techniques for Training Score-Based Generative Models