diffusion model(十七)Score视角下的生成模型（Score-Based Generative Models）

莫叶何竹

已于 2024-06-14 08:13:14 修改

阅读量898

点赞数 12

分类专栏： diffusion model 文章标签： scorebasedmodel SBM diffusion NCSN

于 2024-05-30 19:38:15 首次发布

本文链接：https://blog.csdn.net/weixin_40779727/article/details/139332740

版权

diffusion model 专栏收录该内容

21 篇文章 13 订阅

订阅专栏

	info
paper	Generative Modeling by Estimating Gradients of the Data Distribution
songyang 大佬博客	https://yang-song.net/blog/2021/score/
song yang 大佬视频讲解	https://www.youtube.com/watch?v=wMmqCMwuM2Q
Github	https://github.com/ermongroup/ncsn
个人博客主页	http://myhz0606.com/article/ncsn

1 背景

生成模型想要做的是这样一件事：假设我们有一批数据 $X=\{\mathrm{x}_i \in \mathbb{R}^D\}_{i=1}^{N}$ 是从数据分布 $p_{data}(\mathrm{x})$ 采样得到，但我们不知道 $p_{data}(\mathrm{x})$ 是什么。我们期望通过已知的数据 $X$ 来训练一个模型 $p_{\theta}(\mathrm{x})$ ，这个模型和真实数据分布近似 $p_{\theta}(\mathrm{x}) \simeq p_{data}(\mathrm{x})$ 。这样我们就可以通过 $p_{\theta}(\mathrm{x})$ 源源不断采样得到新的样本。

过去我们做图像生成任务主要有两类方法：

method	main idea	representative method	优点	缺点
likelihood-based model	以最大化对数似然为训练目标，直接对数据分布 $p_{data}(x)$ 进行拟合（模型输出的就是满足数据分布的图片）	1. 自回归模型(autoregressive models) 2. 流模型(normalizing flow models)3. 能量模型（energy-based models, EBMs）4. 变分自编码器(variational auto-encoders , VAEs))	收敛速度快。	1. 对网络结构限制多2. 生成效果一般
implicit generative models	没有对数据分布 $p_{data}(x)$ 进行直接优化，而是通过判别器间接优化	生成对抗模型	生成质量相对高	训练不稳定

上述两类方法都有各自的局限。本文提出了一个新的生成模型，通过score matching的方法估计数据分布对数的梯度（论文称该梯度为score），随后通过朗之万动力学（Langevin dynamic）的方法进行采样生成。下面来看具体是怎么做的吧！

2 Score-based Generative Model

基于score-based的生成模型不拟合原始数据分布 $p_{data}(\mathrm{x})$ ，而是拟合数据分布对数的梯度 $\nabla_{\mathrm{x}} \log p_{data}(\mathrm{x})$ 。作者将数据分布对数的梯度称为score。下面我们从score-based model的motivation出发，探究score-based model是如何工作的。

2.1 Motivation of Score-Based Model

前文提到生成模型的目标是: 通过采样数据 $\{\mathrm{x}_1, \mathrm{x}_2, \cdots, \mathrm{x}_N \} \stackrel {\mathrm{i.i.d.}} \sim p_{data}(\mathrm{x})$ 来训练一个模型 $p_{\theta}$ ( $\theta \in \Theta$ ， $\Theta$ 是模型的参数空间)，使得 $p_{\theta}(\mathrm{x}) \simeq p_{data}(\mathrm{x})$ 。

为了保证neural netowrk的输出符合标准数据分布的特征：

1）非负

2）概率分布的积分为1

我们可以将 $p_{\theta}(\mathrm{x})$ 定义为

$p_\theta(\mathrm{x}) = \frac{e^{-f_{\theta}(\mathrm{x})}}{Z_{\theta}} \tag{1}$

式中： $Z_{\theta}$ 是一个归一化常数，使概率分布满足 $\int p_{\theta}(\mathrm{x})d\mathrm{x}=1$ ，不难得出 $Z_{\theta} = \int e ^{-f_{\theta}(\mathrm{x})}d\mathrm{x}$ 。 $f_{\theta}(\mathrm{x})$ 称为非归一化概率模型（unnormalized probabilistic model）或能量模型（energy-based model）。

为了求解式(1)的概率分布，我们需要求解 $Z_{\theta}$ ，即使将 $\mathrm{x}$ 视作是离散变量，将积分用求和近似， $Z_{\theta}$ 的求解依然非常困难。

过去解决归一化常数 $Z_{\theta}$ 难求解的优化思路主要有3个

核心思路	代表方法	缺点
近似求解 $Z_{\theta}$	energy-based model [5]	概率密度估计不准确
通过限制网络结构来让 $Z_{\theta}$ 便于求解	Normalizing flow model [6]VAE [7]	对网络结构的约束限制了模型的表达能力
只建模数据生成过程，不对概率密度函数建模	GAN[3]	模型收敛难，无法提供准确的概率估计

通过以上分析，一个更好的生成模型架构应当具备以下几个特征：

1）不给模型架构添加过多先验，使模型的表达能力受限。

2）能够准确的估计概率密度。

生成质量高，生成过程可控。

2.2 Why Score-Based Model

为了绕过非归一化常数 $Z_{\theta}$ 计算麻烦的问题，score-based model 网络拟合的目标并非数据分布 $p_{\mathrm{data}}{(\mathrm{x})}$ 而是数据分布的梯度 $\nabla_{\mathrm{x}} \log p_{\mathrm{data}}{(\mathrm{x})}$ 。下面来看为什么要这么做。

2.2.1 $\nabla_{\mathrm{x}} \log p_{\mathrm{data}}{(\mathrm{x})}$ 是什么

为了方便理解，不妨假设原始数据分布 $p_{\mathrm{data}}{(\mathrm{x})}$ 由两个高斯分布组成，颜色越深的位置说明该位置的样本点的概率值越大。

image cropped from song yang’s tutorial[8]

假设样本点 $\mathrm{x}_0 \sim p_{\mathrm{data}}(\mathrm{x}), \mathrm{x}_0 \in \mathbb{R} ^{d}$ 。 $\nabla_{\mathrm{x}} \log p_{\mathrm{data}}{(\mathrm{x}_0)}$ 是概率密度函数在 $\mathrm{x}_0$ 处对数的梯度(实质上是一个向量场 $\nabla_{\mathrm{x}} \log p_{\mathrm{data}}{(\mathrm{x}_0)}\in \mathbb{R}^{d}$ )。

物理意义：当样本点 $\mathrm{x}_0$ 沿着 $\nabla_{\mathrm{x}} \log p_{\mathrm{data}}{(\mathrm{x}_0)}$ 的方向变化时，能够使得概率增长最快。

image cropped from song yang’s tutorial[8]

2.2.2 为什么估计 $\nabla_{\mathrm{x}} \log p_{\mathrm{data}}{(\mathrm{x})}$ 能绕开归一化常数 $Z_{\theta}$ 的计算

根据式1的定义： $p_\theta(\mathrm{x}) = \frac{e^{-f_{\theta}(\mathrm{x})}}{Z_{\theta}}$ ，对其两边取对数得到

$\begin{aligned} \log{p_\theta}(\mathrm{x}) &= \log{e^{-f_{\theta}(\mathrm{x})}} - \log{Z_{\theta}} \\ &= -f_{\theta}(\mathrm{x}) - \log{Z_{\theta}} \end{aligned} \tag{2}$

两边对 $\mathrm{x}$ 求导

$\begin{aligned} \nabla_{\mathrm{x}}\log{p_\theta}(\mathrm{x}) &= -\nabla_{\mathrm{x}}f_{\theta}(\mathrm{x}) - \underbrace{\nabla_{\mathrm{x}}\log{Z_{\theta}}}_{= 0} \\ &= -\nabla_{\mathrm{x}}f_{\theta}(\mathrm{x}) \end{aligned} \tag{3}$

作者将 $\nabla_{\mathrm{x}}\log{p_\theta}(\mathrm{x})$ 定义为score function，记为 $s_{\theta}(\mathrm{x})$ ，将以score function为建模目标的模型称为score-based model。score function与归一化常数 $Z_{\theta}$ 无关。

2.3 How to Train Score-Based Model

2.3.1 Score Matching

我们可以最小化fisher diverence来训练score based model，即估计的score和实际的score均方误差最小：

$\min_{\theta} \frac{1}{2} \mathbb{E}_{p_{\text {data}}}\left[\left\| \nabla_{\mathrm{x}} \log p_{\text {data }}(\mathrm{x})-\nabla_{\mathrm{x}} \log p_\theta(\mathrm{x})\right\|_2^2\right]\tag{4}$

上式还有一个问题，虽然 $\nabla_{\mathrm{x}} \log p_\theta(\mathrm{x})$ 可以通过自动微分工具很方便的计算（如torch），但数据分布 $p_{\text {data}}(x)$ 是未知的，这导致 $\nabla_{x} \log p_{\text {data }}(x)$ 无法求得，因此需要对上式进行变形。这个方法称之为score matching [1]。

最终score matching的优化目标如下(忽略常数项)（附录A给出了详细推导步骤）：

$\begin{align*} &\min_{\theta} \frac{1}{2}\mathbb{E}_{p_{\text{data}}} \big[2 \mathrm{tr} \big(\underbrace{ \nabla_{\mathrm{x}}^{\textcolor{red}{2}} \log p_\theta(\mathrm{x})}_{\nabla_{\mathrm{x}} s_{\theta}(\mathrm{x})} \big) + \| \underbrace{\nabla_{\mathrm{x}} \log p_\theta(\mathrm{x})}_{s_{\theta}(x)}\|_{2}^2 \big] \tag{5} \\ & = \min_{\theta} \frac{1}{2}\mathbb{E}_{p_{\text{data}}}[2 \mathrm{Tr} (\nabla_{\mathrm{x}} s_{\theta}(\mathrm{x})) + \| s_\theta({\mathrm{x}})\|] \tag{6} \end{align*}$

通过score matching，我们推导出式6，无需知道 $\nabla_{\mathrm{x}} \log p_{\text {data }}(\mathrm{x})$ 也能优化 $s_{\theta}(\mathrm{x})$ 。只需计算 $s_{\theta}(\mathrm{x})$ 和 $\nabla_{\mathrm{x}} s_{\theta}(\mathrm{x})$ 即可。 $s_{\theta}(\mathrm{x})$ 的计算很简单，只是一个前向计算，而 $\nabla_{\mathrm{x}} s_{\theta}(\mathrm{x})$ 的计算复杂度就高了，我们知道 $s_{\theta}(\mathrm{x})$ 和 $\mathrm{x}$ 都是向量 $s_{\theta},\mathrm{x} \in \mathbb{R}^{d}$ ，因此 $\nabla_{\mathrm{x}} s_{\theta}(\mathrm{x})$ 是向量对向量求导，相较标量对向量求导，计算量扩大了 $d$ 倍，即需要 $d$ 次反向传播才能得到 $\nabla_{\mathrm{x}} s_{\theta}(\mathrm{x})$ 。那么有没有办法能够对这个计算量进行优化呢？答案是肯定的

$\begin{aligned} &\boldsymbol {\mathrm{x}} = [x_1, x_2, \cdots, x_d]^T, \\ & s_{\theta}(\boldsymbol {\mathrm{x}}) = [s_1(\mathrm{x; \theta}), s_2(\mathrm{x; \theta}), \cdots, s_d(\mathrm{x; \theta})]^T \\ &\nabla_{\mathrm{x}} s_{\theta}(\boldsymbol {\mathrm{x}}) = \begin{bmatrix}&\frac{\partial }{\partial x_1}s_1(\mathrm{x; \theta}) &\frac{ \partial }{\partial x_2}s_1(\mathrm{x; \theta}) & \cdots & \frac{ \partial }{x_d}s_1(\mathrm{x; \theta}) \\&\frac{\partial }{\partial x_1}s_2(\mathrm{x; \theta}) &\frac{ \partial }{\partial x_2}s_2(\mathrm{x; \theta}) & \cdots & \frac{ \partial }{\partial x_d}s_2(\mathrm{x; \theta}) \\&\vdots & \vdots &\ddots &\vdots \\ &\frac{\partial }{\partial x_1}s_d(\mathrm{x; \theta}) &\frac{ \partial }{\partial x_2}s_d(\mathrm{x; \theta}) & \cdots & \frac{ \partial }{\partial x_d}s_d(\mathrm{x; \theta}) \end{bmatrix} \\ &\mathrm{Tr}(\nabla_{\mathrm{x}} s_{\theta}(\boldsymbol {\mathrm{x}})) = \sum^{d}_{i=1} \frac{\partial s_i(\mathrm{x; \theta})}{\partial x_i} \end{aligned} \tag{7}$

2.3.2 Sliced Score Matching (SSM)

sliced score matching是加速计算score matching的一个方法。我们知道score matching计算量大主要是由于向量对向量求导，需要计算雅可比矩阵（Jacobian matrix）导致，向量维度越高计算量越大。sliced score matching的思路很简单，如果我们将输出向量映射为一个标量，此时计算量不就显著降低了。最简单的方法，我们可以引入一个投影矩阵 $\mathrm{v}, \mathrm{v} \in \mathbb{R}^{d \times 1}$ 。投影矩阵 $\mathrm{v}$ 服从某一分布 $p_{\mathrm{v}}$ (可以为高斯分布或其它分布)，即 $\mathrm{v} \sim p_{\mathrm{v}}$ ， sliced score matching的fisher divergence为

$\min_{\theta} \frac{1}{2} \mathbb{E}_{p_{\mathrm{v}}}\mathbb{E}_{p_{\text {data}}} \left[\left\| \mathrm{v} ^ {\mathrm{T}} \nabla_{\mathrm{x}} \log p_{\text {data }}(\mathrm{x})-\mathrm{v} ^ {\mathrm{T}} s_{\theta}{(x)}\right\|_2^2\right]\tag{8}$

同样类似score matching方法对上式转为期望定义形式，利用分部积分消除未知数据分布 $p_{data}$ 的依赖，得到slice score matching的优化目标。（推导细节参考文献[9]）

$\min_{\theta} \mathbb{E}_{p_{\mathrm{v}}}\mathbb{E}_{p_{\text {data}}} \left[ \mathrm{v}^{\mathrm{T}}\nabla_{\mathrm{x}}(\mathrm{v}^{\mathrm{T}}s_{\theta}(\mathrm{x})) + \frac{1}{2}(\mathrm{v}^{\mathrm{T}}s_{\theta}(\mathrm{x}))^2 \right] \tag{9}$

上式中 $\mathrm{v}^{\mathrm{T}}s_{\theta}(\mathrm{x})$ 的结果是一个标量，只需一次反向传播即可，相比原生score matching计算量大大降低。

2.3.3 denoising score matching （`DSM`）

denoising score matching是解决原生score matching计算量大的另一个思路。 $\mathrm{x}$ 是从数据分布 $p_{data}(\mathrm{x})$ 采样的样本点，我们定义一个噪声分布 $q_{\sigma}(\tilde{\mathrm{x}}|\mathrm{x}) = \mathcal{N}(\mathrm{x}, \sigma^2 \mathrm{I})$

$\tilde{\mathrm{x}} = \mathrm{x} + \sigma z \tag{10}$

其中: $z\sim \mathcal{N}(\mathbf{0}, \mathbf{I})$ ，我们得到加噪的 $\mathrm{x}$ 。

注意，这里我们并不能知道 $\tilde{\mathrm{x}}$ 是什么分布，不妨将 $\tilde{\mathrm{x}}$ 的分布记为 $q_{\sigma}({\tilde{\mathrm{x}}})$
$q_{\sigma}({\tilde{\mathrm{x}}}) = \int q_{\sigma}(\tilde{\mathrm{x}}|\mathrm{x}) p_{data}(\mathrm{x})d\mathrm{x}$

我们在扰动后的数据 $\tilde{\mathrm{x}}$ 上估计score，类似式（4），我们可以写出下式的优化目标

$\min_{\theta} \frac{1}{2} \mathbb{E}_{q_{\sigma}({\tilde{\mathrm{x}}})}\left[\left\| \nabla_{\tilde{\mathrm{x}}} \log q_{\sigma}({\tilde{\mathrm{x}}}) -s_{\theta}(\tilde{\mathrm{x}})\right\|_2^2\right]\tag{11}$

同样，由于 $q_{\sigma}({\tilde{\mathrm{x}}})$ 未知，我们需要对其进行转化，最后得到denoising score matching的优化目标，详细推导过程见文献[10]，或附录B。

$\min_{\theta} \frac{1}{2} \mathbb{E}_{\mathrm{x} \sim p_{\text {data}}(\mathrm{x}), \tilde{\mathrm{x}}\sim q_{\sigma}(\tilde{\mathrm{x}}|\mathrm{x})}\left[\left\| s_{\theta}(\tilde{\mathrm{x}})-\nabla_{\tilde{\mathrm{x}}} \log q_{\sigma}(\tilde{\mathrm{x}}|\mathrm{x})\right\|_2^2\right]\tag{12}$

其中 $\nabla_{\tilde{\mathrm{x}}} \log q_{\sigma}(\tilde{\mathrm{x}}|\mathrm{x})$ 可以很方便的计算

$\begin{aligned}q_{\sigma}(\tilde{\mathrm{x}}|\mathrm{x}) &= \frac{1}{\sqrt{2\pi} \sigma } \exp \left ({- \frac{(\tilde{\mathrm{x}} - \mathrm{x})^2}{2\sigma^2}} \right) \\ \log q_{\sigma}(\tilde{\mathrm{x}}|\mathrm{x}) &= \log {\frac{1}{\sqrt{2\pi} \sigma }} + \log \exp \left ({- \frac{(\tilde{\mathrm{x}} - \mathrm{x})^2}{2\sigma^2}} \right) \\ \log q_{\sigma}(\tilde{\mathrm{x}}|\mathrm{x}) &= \log {\frac{1}{\sqrt{2\pi} \sigma }} - { \frac{(\tilde{\mathrm{x}} - \mathrm{x})^2}{2\sigma^2}} \\ \nabla_{\tilde{\mathrm{x}}} \log q_{\sigma}(\tilde{\mathrm{x}}|\mathrm{x}) &= \nabla_{\tilde{\mathrm{x}}}\log {\frac{1}{\sqrt{2\pi} \sigma }} - \nabla_{\tilde{\mathrm{x}}}{ \frac{(\tilde{\mathrm{x}} - \mathrm{x})^2}{2\sigma^2}} \\ \nabla_{\tilde{\mathrm{x}}} \log q_{\sigma}(\tilde{\mathrm{x}}|\mathrm{x}) &= -{ \frac{\tilde{\mathrm{x}} - \mathrm{x}}{\sigma^2}} \end{aligned} \tag{13}$

带入式12，得

$\min_{\theta} \frac{1}{2} \mathbb{E}_{\mathrm{x} \sim p_{\text {data}}(\mathrm{x}), \tilde{\mathrm{x}}\sim q_{\sigma}(\tilde{\mathrm{x}}|\mathrm{x})}\big[\| s_{\theta}(\tilde{\mathrm{x}})+{ \frac{\tilde{\mathrm{x}} - \mathrm{x}}{\sigma^2}} \|_2^2\big]\tag{14}$

从式14可见，denoising score matching同样解决了原生score matching高维Jacobian matrix计算复杂度高的问题。但Denoising score matching还有个问题，通过式14的优化目标我们训练的是 $\tilde{\mathrm{x}}$ 的score based model， $\tilde{\mathrm{x}}$ 是有噪声的，用包含噪声的score来近似无噪声的score对生成的质量肯定会有影响。虽然我们可以降低所添加噪声的强度来让 $\tilde{\mathrm{x}} \simeq \mathrm{x}$ ，当 $\sigma \rightarrow 0$ 时 $\tilde{\mathrm{x}} \rightarrow \mathrm{x}$ 。但是当 $\sigma \rightarrow 0$ 时， $\nabla_{\tilde{\mathrm{x}}} \log q_{\sigma}(\tilde{\mathrm{x}}|\mathrm{x}) \rightarrow \infin \in \mathbb{R}^d$ 此时会出现梯度爆炸。实际中需要tradeoff生成质量和训练稳定性来决定 $\sigma$ 选取的大小。

2.3.4 score matching, slice score matching, denoising score matching效果对比

从slice score matching paper[9]中我们可以得出一下结论：

出当数据的维度增加时，score matching的Jacobian matrix 的计算开销增加，此时SSM和DSM的训练效率明显优于原生score matching。

在这里插入图片描述

从performance的角度看，原生score matching的效果是最好的，slice score matching次之，denoising score matching最弱。

在这里插入图片描述

通过SSM或DSM ****两个方法我们均可以相对高效的训练score based model。那么，如何利用训练好的score based model进行图片生成呢？答案就是朗之万动力学（langevin dynamics）。

2.4 How to Sample from trained Score-Based Model

通过前面的训练，我们得到训练好的score- based model $s_{\theta ^{\star}}(\mathrm{x}), \theta ^{\star} \in \Theta$ 是模型的参数( $\Theta$ 是模型的参数空间)。通过 $s_{\theta ^{\star}}(\mathrm{x})$ 我们可以得到当前数据在概率密度增长最快的方向。即： $p_{data}{(\mathrm{x})} < p_{data}{(\mathrm{x} + \lambda s_{\theta ^ {\star}}(\mathrm{x}) )}$ ， $\lambda$ 可以考虑为更新的步长。

这样我们可以得出最naive的采样方法（本质上就是梯度迭代算法）：

$\underbrace{\mathrm{x}}_{=\mathrm{x}^0} \rightarrow \underbrace{ \mathrm{x^{0}} + \lambda^{0} s_{\theta ^ {\star}}(\mathrm{x^{0}}) }_{=\mathrm{x}^{1}}\rightarrow \underbrace{ \mathrm{x^{1}} + \lambda^{1} s_{\theta ^ {\star}}(\mathrm{x^{1}})}_{=\mathrm{x}^{2}} \rightarrow \cdots \rightarrow \underbrace{ \mathrm{x^{T-1}} + \lambda^{T-1} s_{\theta ^ {\star}}(\mathrm{x^{T-1}})}_{=\mathrm{x}^{T}} \tag{15}$

理想情况下： $p_{data}(\mathrm{x}^{0}) < p_{data}(\mathrm{x}^{1}) < \cdots < p_{data}(\mathrm{x}^{T})$

$\mathrm{x}^{T}$ 就是我们采样生成的数据。初始数据 $\mathrm{x}^{0}$ 可以从某一个已知分布 $\mathrm{x}^{0} \sim \pi (\mathrm{x})$ 进行采样。

2.4.1 Langevin Dynamics Sampling

此处不介绍Langevin Dynamics Sampling 的提出背景。朗之万采样的算法流程为：

$\begin{aligned} \hline &\text{Initialize} \quad \mathrm{x}^{0} \sim \pi (\mathrm{x}) \\ & \text{Repeat for } t \leftarrow 1, 2, \cdots, T \\ & \quad \quad z^{t} \sim \mathcal{N} (\mathbf{0}, \mathbf{I}) \\ & \quad \quad \mathrm{x^t} = \mathrm{x^t} + \frac{\epsilon}{2} \underbrace{ \nabla _{\mathrm{x}} \log p_{data}(\mathrm{x}^{t-1}) } _{\simeq s_{\theta ^ {\star}} (\mathrm{x^{t-1}}) } + \sqrt{\epsilon}z^{t} \\ & \text{If }\epsilon \rightarrow 0 \text{ and } T \rightarrow \infin \text{, we are guaranteed to have }\mathrm{x}^{T} \sim p_{data}(\mathrm{x}) \\ \hline \end{aligned}$

Langevin Dynamics Sampling和naive的梯度迭代法很相似，只是每一次采样中引入了一个随机项。

2.5 Challenges of score-based generative modeling

通过score-matching + Langevin dynamics sampling我们已经完成了用score-based model做生成的完整理论推导。但将其进行实际应用时，效果并不理想。下图为作者在CIFAR-10数据上的实验效果。作者认为这主要是由于score估计不准确导致，有以下两个因素导致我们无法估计出准确的score

image from song yang’s tutorial

2.5.1 The manifold hypothesis（流形假设）

根据流形假设的观点，我们观察到的数据实际上是由一个低维流形映射到高维空间（ambient space）。从流形假设的观点看，score based generative model面临两个挑战：

1） $s_{\theta^\star}(\mathrm{x})$ 估计的是高维空间的梯度，当 $\mathrm{x}$ 被限制在低维流形上时，在高维空间（ambient space）中 $\nabla_{\mathrm{x}} \log p_{data}(\mathrm{x})$ 有很多方向对 $\mathrm{x}$ 是没有意义的。（直白来说就是 $\mathrm{x}$ 有很多冗余的维度，在冗余维度上估计的score是不准确的。）

2）只有当数据分布 $p_{data}(\mathrm{x})$ 能够支撑整个高位空间时，根据式4的优化才能提供一个一致的估计量。这意味着,如果数据仅分布在低维流形上,而不是充满整个高维空间,那么用分数匹配目标来估计分数将会有偏差,导致估计不准确。

在训练过程的训练loss波动大也从一个侧面表现了流形假设引起的score估计不准确。

image from paper[4]， slice score matching训练过程中的loss变化趋势

2.5.2 Low data density regions （数据低密度区域）

假定原始数据分布 $p_{data}({\mathrm{x}})$ 由两个高斯分布组成。若直接以 $\mathrm{x} \sim p_{data}({\mathrm{x}})$ 进行采样，低密度区域样本很难被采样到，这导致 $s_{\theta ^ \star}({\mathrm{x}})$ 没有充分拟合低密度区域的数据，难以估计出此区域数据准确的score。

image from paper[4]

2.5.3 如何估计准确的score

回顾一下，作者发现score based generate model效果不好主要由两个原因造成：

1）根据流形假设，原始数据是高维空间的低维流形，无效维度的score会导致训练不稳定。

2）根据数据分布的特点，低密度区域的数据没有得到充分训练，导致无法估计出准确的score。

作者发现，只需给原始数据添加随机扰动的高斯噪声 $\tilde{\mathrm{x}} = \mathrm{x}+\sigma z$ ，就能很好的缓解上述两个问题。针对第一个问题，由于高斯分布的噪声能够充满整个高维空间（是满秩的），因此当原始数据添加了高斯噪声后形成的样本空间也是满秩的，从而解决原始数据低维流形的限制。

其次，由于原始数据被高斯噪声扰动，使其在样本空间的分布更加均匀，低密度区域变小，从而让低密度区域的数据得到更多的采样，带来更多的训练信号。

细心的同学发现了，加噪的做法和2.3.3节Denoising score matching（DSM）的核心思路很像。对于DSM来说我们需要保证训练稳定的情形下，尽可能的降低添加噪声的强度，以此来保证 $s_{\theta}(\mathrm{\tilde{x}}) \simeq \nabla _{\mathrm{x}} \log p_{data}(\mathrm{x})$ 。但是，添加的噪声强度低了，流形假设和低密度数据分布的问题依旧无法很好被解决，这导致 $\nabla _{\mathrm{x}} \log p_{data}(\mathrm{x})$ 无法被很好的估计。对问题再次概括：

我们需要添加较大噪声，确保score model在低密度区域也能估计出正确的score，但是这估计的score 是 $\tilde{\mathrm{x}}$ 的，噪声较大我们无法保证 $s_{\theta}(\mathrm{\tilde{x}}) \simeq \nabla _{\mathrm{x}} \log p_{data}(\mathrm{x})$ 。如何解决这个问题呢？

2.6 Noise-conditional Score Network（`NCSN`）——Training and Sampling

通过之前的讨论我们知道，当 $\sigma_{1} > \sigma_{2} > \cdots > \sigma_{L} \simeq 0$ 时

$\sigma$ 越大score估计的越准确（意味着误差越小）

$\begin{aligned} &\mathbb{E}_{\tilde{\mathrm{x} } \sim q_{\sigma_{L}}(\mathrm{\tilde{x}}), \mathrm{x}\sim p_{data}(\mathrm{x})}\| s_{\theta_L}(\mathrm{x}, \sigma_L) - \nabla_{\tilde{\mathrm{x}}} \log q_{\sigma_{L}}(\mathrm{\tilde{x}})\|_{2}^{2} > \\ &\mathbb{E}_{\tilde{\mathrm{x} } \sim q_{\sigma_{L-1}}(\mathrm{\tilde{x}}), \mathrm{x}\sim p_{data}(\mathrm{x})}\| s_{\theta_{L-1}}(\mathrm{x}, \sigma_{L-1}) - \nabla_{\tilde{\mathrm{x}}}\log q_{\sigma_{L-1}}(\mathrm{\tilde{x}})\|_{2}^{2} > \\ &>\cdots> \\ &\mathbb{E}_{\tilde{\mathrm{x} } \sim q_{\sigma_{1}}(\mathrm{\tilde{x}}), \mathrm{x}\sim p_{data}(\mathrm{x})}\| s_{\theta_1}(\mathrm{x}, \sigma_1) - \nabla_{\tilde{\mathrm{x}}}\log q_{\sigma_{1}}(\mathrm{\tilde{x}})\|_{2}^{2} \end{aligned} \tag{16}$

$\sigma$ 越大，noise score离真实数据的score越远

$\begin{aligned} &\mathbb{E}_{\tilde{\mathrm{x} } \sim q_{\sigma_{L}}(\mathrm{\tilde{x}}), \mathrm{x}\sim p_{data}(\mathrm{x})}\| \underbrace{\nabla_{\mathrm{\tilde{x}}}\log q_{\sigma{L}}(\mathrm{\tilde{x}})}_{\simeq s_{\theta_{L}(\mathrm{x}, \sigma_{L}})} - \nabla_{\mathrm{x}}\log p_{data}(\mathrm{x})\|_{2}^{2} < \\ &\mathbb{E}_{\tilde{\mathrm{x} } \sim q_{\sigma_{L-1}}(\mathrm{\tilde{x}}), \mathrm{x}\sim p_{data}(\mathrm{x})}\| \underbrace{\nabla_{\mathrm{\tilde{x}}}\log q_{\sigma{L-1}}(\mathrm{\tilde{x}})}_{\simeq s_{\theta_{L-1}(\mathrm{x}, \sigma_{L-1}})} - \nabla_{\mathrm{x}}\log p_{data}(\mathrm{x})\|_{2}^{2} < \\ &< \cdots < \\ &\mathbb{E}_{\tilde{\mathrm{x} } \sim q_{\sigma_{1}}(\mathrm{\tilde{x}}), \mathrm{x}\sim p_{data}(\mathrm{x})}\| \underbrace{\nabla_{\mathrm{\tilde{x}}}\log q_{\sigma{1}}(\mathrm{\tilde{x}})}_{\simeq s_{\theta_{1}(\mathrm{x}, \sigma_{1}})} - \nabla_{\mathrm{x}}\log p_{data}(\mathrm{x})\|_{2}^{2} \end{aligned} \tag{17}$

既然用一种噪声强度我们无法tradeoff准确的score和生成质量，那么为什么不用多组噪声强度渐进式的估计真实数据的score呢。先通过强噪声训练的score-based model来引导大方向，走向高密度区域，随后逐级用更小强度的score-based model进行微调方向，最终逼近真实数据分布的score。

为了完成上述过程，首先需要训练不同强度下的score-based model。

2.6.1 Noise-conditional Score Network（`NCSN`）

定义:

$\{\sigma_i\}_{i=1}^{L}$ 为正几何序列，满足 $\frac{\sigma_{1}}{\sigma_{2}} = \frac{\sigma_{2}}{\sigma_{3}} = \cdots \frac{\sigma_{L-1}} {\sigma_{L}} = k > 1$ ， $\sigma_{L} \simeq 0$ . 不难得出

$\sigma_i = k^{L - i} \sigma_L \tag{18}$

$q_{\sigma_i}({\tilde{\mathrm{x}}}) = \int q_{\sigma_i}(\tilde{\mathrm{x}}|\mathrm{x}) p_{data}(\mathrm{x})d\mathrm{x} = \int \mathcal{N}(\mathbf{0}, \sigma_{i}^2\mathbf{I}) p_{data}(\mathrm{x})d\mathrm{x}$

参考2.3.3的式11，我们很容易写出噪声强度为 $\sigma_{i}$ 时的score matching优化目标

$\mathcal{l}(\theta; \sigma_i) = \frac{1}{2} \mathbb{E}_{\tilde{\mathrm{x}}\sim q_{\sigma_i}({\tilde{\mathrm{x}}}), \mathrm{x}\sim p_{data}({\mathrm{x})}} \left[\left\| \nabla_{\mathrm{x}} \log q_{\sigma_i}({\tilde{\mathrm{x}}}) -s_{\theta}(\mathrm{x}, \sigma_{i})\right\|_2^2\right]\tag{19}$

对于所有的 $\sigma \in \{\sigma_i\}_{i=1}^{L}$ 我们可以得到联合优化目标

$\begin{aligned} \mathcal{L}(\theta; \{\sigma_i\}_{i=1}^{L}) &= \sum_{i=1}^{L} \lambda (\sigma_{i}) \mathcal{l}(\theta; \sigma_i) \\ & = \frac{1}{2} \sum_{i=1}^{L} \lambda (\sigma_{i}) \mathbb{E}_{\tilde{\mathrm{x}}\sim q_{\sigma_i}({\tilde{\mathrm{x}}}), \mathrm{x}\sim p_{data}({\mathrm{x})}} \left[\left\| \nabla_{\tilde{\mathrm{x}}} \log q_{\sigma_i}({\tilde{\mathrm{x}}}) -s_{\theta}(\mathrm{x}, \sigma_{i})\right\|_2^2\right] \end{aligned} \tag{20}$

$\lambda (\sigma_{i}) > 0$ 是一个超参数，表示我们可以对不同的 $\sigma_i$ 设置不同的权重。作者实验发现 $\|s_{\theta}(\mathrm{x}, \sigma)\|_{2} \propto \frac{1}{\sigma}$ ，为了均衡不同噪声强度下模型参数更新的梯度，作者设置 $\lambda (\sigma_{i}) = \sigma_i^2$ ，此时

$\begin{aligned} \mathcal{L}(\theta; \{\sigma_i\}_{i=1}^{L}) & = \frac{1}{2} \sum_{i=1}^{L} \sigma_{i}^2 \mathbb{E}_{\tilde{\mathrm{x}}\sim q_{\sigma_i}({\tilde{\mathrm{x}}}), \mathrm{x}\sim p_{data}({\mathrm{x})}} \left[\left\| \nabla_{\tilde{\mathrm{x}}} \log q_{\sigma_i}({\tilde{\mathrm{x}}}) -s_{\theta}(\mathrm{x}, \sigma_{i})\right\|_2^2\right] \end{aligned} \tag{21}$

作者表明，我们可以用slice score matching或denoising score matching的方法来优化上述目标，兼顾训练速度和精度。

2.6.2 Annealed Langevin dynamics

参考模拟退火[11](simulated annealing)和退火重要性抽样(annealed importance sampling)[12]的算法思路。作者提出的退火朗之万采样算法流程如下：

$\begin{aligned} \hline &\text{Initialize} \quad \mathrm{\tilde{x}}_{1}^{0} \sim \pi (\mathrm{x}) \\ & \text{for } i \leftarrow 1, 2, \cdots, L \text{ do} \\ & \quad \quad \alpha_{i} \leftarrow \epsilon \left(\frac{\sigma_i}{\sigma_L}\right)^2 \\ & \quad \quad \text{for } t \leftarrow 1, 2, \cdots, T \text{ do}\\ & \quad \quad \quad \quad z^{t} \sim \mathcal{N} (\mathbf{0}, \mathbf{I}) \\ & \quad \quad \quad \quad \mathrm{\tilde{x}}_{i}^t \leftarrow \mathrm{\tilde{x}}_{i}^{t-1} + \underbrace{\frac{\alpha_i}{2} s_{\theta ^ {\star}} (\mathrm{\tilde{x}}_{i}^{t-1}, \sigma_i) }_{\text{signal item}} + \underbrace{\sqrt{\alpha_i}z^{t}}_{\text{noise item}} \\ & \quad \quad \text{end for } \\ & \quad \quad \mathrm{\tilde{x}}_{i+1}^{0} \leftarrow\mathrm{\tilde{x}}_{i}^{T} \\ & \text{end for} \\ & \text{Return }x_{L}^{T} \\ \hline \end{aligned}$

从某一先验分布（高斯分布、均匀分布等）采样得到初始值 $\mathrm{\tilde{x}}^{0}_{1}$ 。进入两层循环，第一层循环给定由大到小的噪声强度序列，第二层循环为原生的朗之万采样。噪声强度大时，低密度区域的score估计准确，并且此时离目标位置较远，采用相对较大的更新步长（ $\alpha_i / 2$ ），通过逐级降噪和朗之万采样让初始点 $\tilde{\mathrm{x}}^{0}_{1}$ 逐步向高密度区域靠近，直至完成最后的生成。

$\tilde{\mathrm{x}}^{0}_{1}\stackrel{朗之万采样}\rightarrow \tilde{\mathrm{x}}^{T}_{1} = \tilde{\mathrm{x}}^{0}_{2}\stackrel{朗之万采样}\rightarrow \tilde{\mathrm{x}}^{T}_{2}=\tilde{\mathrm{x}}^{0}_{3} \rightarrow \cdots \rightarrow \tilde{\mathrm{x}}^{T}_{L} \tag{22}$

更新步长 $\alpha_{i}$ 的选取是一个超参数。本文中，作者让 $\alpha_i \propto \sigma_i^2$ ，核心动机时为了让不同噪声强度，朗之万采样的信噪比(signal-to-noise ratio, SNR)强度维持一致。（有点困惑为什么要这样做，论文中并未给出详细说明，是否有相关的理论依据呢，希望有懂的大佬解惑）

$\mathrm{SNR} = \frac{\text{signal}}{\text{noise}} = \frac{\frac{\alpha_i}{2} s_{\theta ^ {\star}} (\mathrm{\tilde{x}}_{i}^{t-1}, \sigma_i) }{\sqrt{\alpha_i}z^{t}} \tag{23}$

噪声强度为 $\sigma_i$ 时，朗之万采样信噪比强度的期望

$\mathbb{E}\| \frac{\frac{\alpha_i}{2} s_{\theta ^ {\star}} (\mathrm{\tilde{x}}_{i}, \sigma_i) }{\sqrt{\alpha_i}z}\|_{2}^{2} = \mathbb{E}\| \frac{\alpha_i s_{\theta ^ {\star}} (\mathrm{\tilde{x}}_{i}, \sigma_i) }{4z}\|_{2}^{2} \stackrel{\| z\|_{2}^{2} =1} \simeq \frac{1}{4}\mathbb{E}\| \alpha_i s_{\theta ^ {\star}} (\mathrm{\tilde{x}}_{i}, \sigma_i) \|_{2}^{2}\tag{24}$

根据之前的实验结论，我们知道 $\|s_{\theta}(\mathrm{x}, \sigma)\|_{2} \propto \frac{1}{\sigma}$ ，因此当 $\alpha_i \propto \sigma_i^2$ 时，上式为常数与 $\sigma_i$ 无关，从而满足不同噪声强度下的朗之万采样的信噪比的强度相同。

以上就是score-based generative model的全部内容了。

通过训练好score-based model结合退火朗之万采样，在生成任务上Inception score首次打败了GAN。

在这里插入图片描述

3 conditional-based generate

在第二节中，通过训练好的score-based generative model 我们可以在任何先验分布进行采样，通过退火的朗之万采样算法生成符合原始数据分布的样本，即

$\mathrm{x}_{1}^{0} \sim \pi_{prior}(\mathrm{x}) \stackrel{\mathrm{SBGM}} \rightarrow \mathrm{x}^{T}_{L} \sim p_{data}(\mathrm{x}) \tag{25}$

但我们实际中，其实更想要 $p_{data}(\mathrm{x|y})$ 而不是 $p_{data}(\mathrm{x})$ 。下面来看如何用condition $y$ 对生成过程进行控制。

conditional control有两种形式：

一类是classifier-guided generate，此类方法无需重新训练score network，只需在原有数据分布上训练一个分类器（此时的condition就是类别），即可引导生成特定类别的图片。
另一类是classifier-free generate，此类方法需要额外训练一个conditional score network。

3.1 classifier-guided generate

详细内容可见我之前的文档：

http://myhz0606.com/article/guided

3.2 classifier-free generate

详细内容可见我之前的文档：

http://myhz0606.com/article/classifier_free

4 小结

score based model将对概率分布的估计转为对 $\nabla_{\mathrm{x}} \log p_{data}(\mathrm{x})$ (称之为score)的估计规避对归一化参数 $Z_{\theta}$ 的计算，以此摆脱网络架构的限制。

但是，由于**原始数据分布 $p_{data}(\mathrm{x})$ 不可知，无法直接用fisher divergence（式4）**训练score based model模型，需要转为score matching 的优化形式（式6）。

对于图片而言，数据维度过大，原生score matching需要计算**Jacobian matrix**，计算开销非常大，需要结合slice score matching（SSM）或denoising score matching（DSM）的方法减少计算开销。

虽然通过以上手段我们可以相对高效的训练score-based model 。但由于流形假设和数据低密度区域的影响，导致模型估计的score 不准确。作者通过对原始数据加噪的方式缓解上述两个问题，并提出退火的朗之万采样算法完成高质量的样本生成。

参考文献

[1] Estimation of non-normalized statistical models by score matching

[2] https://yang-song.net/blog/2021/score/

[3] Generative adversarial nets

[4] How to Train Your Energy-Based Models.

[5] A tutorial on energy-based learning

[6] NICE: Non-linear independent components estimation

[7] variational auto-encoders

[8] https://www.youtube.com/watch?v=wMmqCMwuM2Q

[9] sliced score matching

[10] denoising score matching

[11] Optimization by simulated annealing.

[12] Annealed importance sampling.

附录

文档字数过多发不出去，拆分到另一篇博客
https://blog.csdn.net/weixin_40779727/article/details/139332940?spm=1001.2014.3001.5502

莫叶何竹

关注

12
点赞
踩
9

收藏

觉得还不错? 一键收藏
0
评论
diffusion model(十七)Score视角下的生成模型（Score-Based Generative Models）

infopaperGenerative Modeling by Estimating Gradients of the Data Distributionsongyang 大佬博客https://yang-song.net/blog/2021/score/song yang 大佬视频讲解https://www.youtube.com/watch?v=wMmqCMwuM2QGithubhttps://github.com/ermongroup/ncsn个人博
复制链接

扫一扫