扩散模型相关论文阅读，扩散模型和知识蒸馏的结合提升预测速度：Progressive Distillation for Fast Sampling of Diffusion Models

CUHK-SZ-relu

已于 2023-04-04 14:22:41 修改

阅读量5.9k

点赞数 6

文章标签：论文阅读人工智能深度学习

于 2023-04-04 14:14:05 首次发布

本文链接：https://blog.csdn.net/qq_43210957/article/details/129948059

版权

文章介绍了扩散模型在生成模型中的出色表现，但指出其预测速度慢的问题。作者提出一种名为逐步蒸馏的技术，能够将预训练的扩散模型行为转换为需要更少步骤的新模型，从而显著减少采样时间。在实验中，他们能将采样步骤从8192降低到4步，而不牺牲感知质量。此外，整个渐进蒸馏过程并不增加额外的训练时间，为扩散模型的实用性和效率提供了改进方案。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

论文地址及代码

谷歌research的成果，ICLR 2022
https://arxiv.org/abs/2202.00512
tenserflow官方开源代码： https://github.com/google-research/google-research/tree/master/diffusion_distillation
pytorch非官方代码：https://github.com/lucidrains/imagen-pytorch

速览

主要解决的问题—扩散模型预测慢

1.扩散模型虽然取得了很好的效果，但是预测速度慢。
2.作者提出了一种逐步蒸馏的方式，如下图：

在这里插入图片描述

0.Abstruct

0.1 逐句翻译

Diffusion models have recently shown great promise for generative modeling, out- performing GANs on perceptual quality and autoregressive models at density es- timation. A remaining downside is their slow sampling time: generating high quality samples takes many hundreds or thousands of model evaluations. Here we make two contributions to help eliminate this downside: First, we present new parameterizations of diffusion models that provide increased stability when using few sampling steps. Second, we present a method to distill a trained deterministic diffusion sampler, using many steps, into a new diffusion model that takes half as many sampling steps. We then keep progressively applying this distillation proce- dure to our model, halving the number of required sampling steps each time. On standard image generation benchmarks like CIFAR-10, ImageNet, and LSUN, we start out with state-of-the-art samplers taking as many as 8192 steps, and are able to distill down to models taking as few as 4 steps without losing much perceptual quality; achieving, for example, a FID of 3.0 on CIFAR-10 in 4 steps. Finally, we show that the full progressive distillation procedure does not take more time than it takes to train the original model, thus representing an efficient solution for generative modeling using diffusion at both train and test time.

扩散模型最近在生成模型方面表现出了很大的优点，在感知质量和密度估计方面比 GAN 表现更好，而缺点是采样时间较长：生成高质量的样本需要数百或数千个模型评估。在这里，我们提出了两个贡献来帮助消除这个缺点：首先，我们提出了新的扩散模型参数化方式，在使用少量采样步骤时可以提供更高的稳定性。其次，我们提出了一种方法，可以将训练好的确定性扩散采样器通过许多步骤转换为新的扩散模型，只需要一半左右的采样步骤。然后我们逐步应用这个蒸馏过程到我们的模型中，每次将所需的采样步骤减半。在标准图像生成基准测试如 CIFAR-10、ImageNet 和 LSUN 中，我们开始使用高达 8192 个步骤的最新采样器，并能够将其蒸馏到只需要 4 个步骤的模型中，而不会丢失太多感知质量。例如，在 CIFAR-10 中，我们在 4 个步骤内实现了 FID=3.0。最后，我们展示了完整的逐步蒸馏过程并不比训练原始模型花费更多时间，因此它是一种在训练和测试时同时使用扩散模型的有效解决方案。

总结

主要是两个创新点，

1.提出了一种可以简化步骤的方法
2.提出了一种知识蒸馏的方法，可以把更高的迭代次数优化为更低的迭代次数。

1.INTRODUCTION

1.1逐句翻译

第一段（扩散模型在各个方面取得很好的成果）

Diffusion models (Sohl-Dickstein et al., 2015; Song & Ermon, 2019; Ho et al., 2020) are an emerg- ing class of generative models that has recently delivered impressive results on many standard gen- erative modeling benchmarks. These models have achieved ImageNet generation results outper- forming BigGAN-deep and VQ-VAE-2 in terms of FID score and classification accuracy score (Ho et al., 2021; Dhariwal & Nichol, 2021), and they have achieved likelihoods outperforming autore- gressive image models (Kingma et al., 2021; Song et al., 2021b). They have also succeeded in image super-resolution (Saharia et al., 2021; Li et al., 2021) and image inpainting (Song et al., 2021c), and there have been promising results in shape generation (Cai et al., 2020), graph generation (Niu et al., 2020), and text generation (Hoogeboom et al., 2021; Austin et al., 2021).
扩散模型 (Sohl-Dickstein et al., 2015; Song & Ermon, 2019; Ho et al., 2020) 是一种新兴的生成模型类群，最近在多个标准生成模型基准测试中取得了令人印象深刻的结果。这些模型在 FID 得分和分类准确率得分方面实现了 ImageNet 生成结果，远优于 BigGAN-deep 和 VQ-VAE-2(Ho et al., 2021; Dhariwal & Nichol, 2021),并且在生成概率方面比自回归图像模型表现更好 (Kingma et al., 2021; Song et al., 2021b)。它们还成功地实现了图像超分辨率 (Saharia et al., 2021; Li et al., 2021) 和图像修复 (Song et al., 2021c),并且在形状生成 (Cai et al., 2020)、图生成 (Niu et al., 2020) 和文本生成 (Hoogeboom et al., 2021; Austin et al., 2021) 等领域取得了有前途的结果。

第二段（提出扩散模型预测慢的问题）

A major barrier remains to practical adoption of diffusion models: sampling speed. While sam- pling can be accomplished in relatively few steps in strongly conditioned settings, such as text-to- speech (Chen et al., 2021) and image super-resolution (Saharia et al., 2021), or when guiding the sampler using an auxiliary classifier (Dhariwal & Nichol, 2021), the situation is substantially differ- ent in settings in which there is less conditioning information available. Examples of such settings are unconditional and standard class-conditional image generation, which currently require hundreds or thousands of steps using network evaluations that are not amenable to the caching optimizations of other types of generative models (Ramachandran et al., 2017).

然而，扩散模型的实际应用仍然存在一个主要障碍：采样速度。虽然在强条件设置下，如文本到语音 (Chen et al., 2021) 和图像超分辨率 (Saharia et al., 2021),采样可以在相对较短的时间内完成，或者通过辅助分类器指导采样 (Dhariwal & Nichol, 2021),但在条件较少的设置下，情况则完全不同。例如，无条件和标准条件的图像生成目前需要使用网络评估，这些评估不受其他类型生成模型的缓存优化 (Ramachandran et al., 2017)。

第三段（作者提出自己的想法）

In this paper, we reduce the sampling time of diffusion models by orders of magnitude in uncondi- tional and class-conditional image generation, which represent the setting in which diffusion models have been slowest in previous work. We present a procedure to distill the behavior of a N -step DDIM sampler (Song et al., 2021a) for a pretrained diffusion model into a new model with N/2 steps, with little degradation in sample quality. In what we call progressive distillation, we repeat this distilla- tion procedure to produce models that generate in as few as 4 steps, still maintaining sample quality competitive with state-of-the-art models using thousands of steps.

在本文中，我们显著提高了扩散模型在无条件和标准条件图像生成方面的采样速度，这是扩散模型在以前的工作中最慢的设置之一。我们提出了一种方法，将一个预训练的扩散模型的 N 步 DDIM 采样器的行为蒸馏到一个新的模型中，只需要 N/2 步，而样本质量几乎没有下降。我们称之为逐步蒸馏，我们将这种方法重复应用到我们的模型中，以产生只需 4 步即可生成图像的模型，并保持与使用数千步的最先进的模型相当的样本质量。

在这里插入图片描述

文字说明

Figure 1: A visualization of two iterations of our proposed progressive distillation algorithm. A sampler f(z;η), mapping random noise ε to samples x in 4 deterministic steps, is distilled into a new sampler f(z;θ) taking only a single step. The original sampler is derived by approximately integrating the probability flow ODE for a learned diffusion model, and distillation can thus be understood as learning to integrate in fewer steps, or amortizing this integration into the new sampler.

Figure 1: 一种可视化了我们提出的渐进式蒸馏算法的两个迭代。一个采样器 f(z;η),将随机噪声ε映射到 4 个确定的采样 x，被蒸馏成一个仅需要一个步骤的新采样器 f(z;θ)。原始的采样器是通过近似积分学习扩散模型的概率流常微分方程 (ODE) 得出的，因此蒸馏可以被理解为在更少的步骤中学习积分，或者将积分分摊到新采样器中。

1.2总结

1.扩散模型虽然取得了很好的效果，但是预测速度慢。
2.作者提出了一种逐步蒸馏的方式，如下图：

在这里插入图片描述

3 PROGRESSIVE DISTILLATION

第一段（简单介绍如何蒸馏减少步数）

To make diffusion models more efficient at sampling time, we propose progressive distillation: an algorithm that iteratively halves the number of required sampling steps by distilling a slow teacher diffusion model into a faster student model. Our implementation of progressive distillation stays very close to the implementation for training the original diffusion model, as described by e.g. Ho et al. (2020). Algorithm 1 and Algorithm 2 present diffusion model training and progressive distillation side-by-side, with the relative changes in progressive distillation highlighted in green.
为了在采样时提高扩散模型的效率，我们提出了渐进式蒸馏算法：一种迭代地减半所需采样步骤的算法，通过将慢速教师扩散模型蒸馏成更快的学生模型来实现。我们实现的渐进式蒸馏算法与训练原始扩散模型的实现非常相似，例如 Ho 等 (2020) 所述。算法 1 和算法 2 同时展示了扩散模型训练和渐进式蒸馏，其中渐进式蒸馏的相对变化被绿色突出显示。

在这里插入图片描述

第二段

We start the progressive distillation procedure with a teacher diffusion model that is obtained by training in the standard way. At every iteration of progressive distillation, we then initialize the student model with a copy of the teacher, using both the same parameters and same model definition. Like in standard training, we then sample data from the training set and add noise to it, before forming the training loss by applying the student denoising model to this noisy data zt. The main difference in progressive distillation is in how we set the target for the denoising model: instead of the original data x, we have the student model denoise towards a target x ̃ that makes a single student DDIM step match 2 teacher DDIM steps. We calculate this target value by running 2 DDIM sampling steps using the teacher, starting from zt and ending at zt−1/N , with N being the number of student sampling steps. By inverting a single step of DDIM, we then calculate the value the student model would need to predict in order to move from zt to zt−1/N in a single step, as we show in detail in Appendix G. The resulting target value x ̃(zt) is fully determined given the teacher model and starting point zt, which allows the student model to make a sharp prediction when evaluated at zt. In contrast, the original data point x is not fully determined given zt, since multiple different data points x can produce the same noisy data zt: this means that the original denoising model is predicting a weighted average of possible x values, which produces a blurry prediction. By making sharper predictions, the student model can make faster progress during sampling.

我们开始渐进式蒸馏过程，使用通过标准训练得到的教师扩散模型开始。在每次迭代中，然后我们初始化学生模型，使用与教师相同的参数和模型定义，从训练集中采样数据并添加噪声，然后在使用学生去噪模型对这个噪声数据 zt 应用时形成训练损失。渐进式蒸馏的主要区别在于我们如何设置去噪模型的目标：而不是原始数据 x，我们让学生模型去噪以目标 x ̃使单个学生 DDIM 步等于 2 个教师 DDIM 步。我们计算这个目标值，通过使用教师运行 2 个 DDIM 采样步骤，从 zt 开始并结束于 zt-1/N,其中 N 是学生采样步骤的数量。通过逆推一个 DDIM 步，我们然后计算学生模型需要在一步中从 zt 移动到 zt-1/N 所需的预测值，具体细节在附录 G 中展示。结果的目标值 x ̃(zt) 在教师模型和起始点 zt 给定的情况下是完全确定的，这允许学生模型在评估 zt 时做出清晰的预测。相比之下，给定 zt 的原始数据点 x 没有完全确定，因为多个不同的数据点 x 可以产生相同的噪声数据 zt:这意味着原始去噪模型预测加权平均可能 x 值，这产生模糊的预测。通过做出更清晰的预测，学生模型在采样时能够更快地进展。

第三段（继续描述这个迭代可以不断递归使用，学生变成新的老师）

After running distillation to learn a student model taking N sampling steps, we can repeat the pro- cedure with N/2 steps: The student model then becomes the new teacher, and a new student model is initialized by making a copy of this model.

运行蒸馏以学习采取 N 采样步骤的学生模型后，我们可以重复该过程以进行 N/2 步骤：学生模型将成为新的老师，并通过复制该模型初始化一个新的学生模型。

第四段（这里调整Alph1为0真的没看懂，得看看代码）

Unlike our procedure for training the original model, we always run progressive distillation in dis- crete time: we sample this discrete time such that the highest time index corresponds to a signal-to- noise ratio of zero, i.e. α1 = 0, which exactly matches the distribution of input noise z1 ∼ N (0, I) that is used at test time. We found this to work slightly better than starting from a non-zero signal- to-noise ratio as used by e.g. Ho et al. (2020), both for training the original model as well as when performing progressive distillation.

不同于我们训练原始模型的流程，我们总是在离散时间中进行渐进式蒸馏：我们采样离散时间，使得最高时间索引对应信号噪声比为零，即α1=0，这完全匹配测试时使用的输入噪声 z1∼N(0,I) 的分布。我们发现，这种方法相对于例如 Ho 等 (2020) 使用非零信号噪声比开始训练和进行渐进式蒸馏时，在训练原始模型和进行渐进式蒸馏时表现更好。