扩散模型相关论文阅读,扩散模型和知识蒸馏的结合提升预测速度:Progressive Distillation for Fast Sampling of Diffusion Models

文章介绍了扩散模型在生成模型中的出色表现,但指出其预测速度慢的问题。作者提出一种名为逐步蒸馏的技术,能够将预训练的扩散模型行为转换为需要更少步骤的新模型,从而显著减少采样时间。在实验中,他们能将采样步骤从8192降低到4步,而不牺牲感知质量。此外,整个渐进蒸馏过程并不增加额外的训练时间,为扩散模型的实用性和效率提供了改进方案。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

论文地址及代码

谷歌research的成果,ICLR 2022
https://arxiv.org/abs/2202.00512
tenserflow官方开源代码: https://github.com/google-research/google-research/tree/master/diffusion_distillation
pytorch非官方代码:https://github.com/lucidrains/imagen-pytorch

速览

主要解决的问题—扩散模型预测慢

  • 1.扩散模型虽然取得了很好的效果,但是预测速度慢。
  • 2.作者提出了一种逐步蒸馏的方式,如下图:

在这里插入图片描述

0.Abstruct

0.1 逐句翻译

Diffusion models have recently shown great promise for generative modeling, out- performing GANs on perceptual quality and autoregressive models at density es- timation. A remaining downside is their slow sampling time: generating high quality samples takes many hundreds or thousands of model evaluations. Here we make two contributions to help eliminate this downside: First, we present new parameterizations of diffusion models that provide increased stability when using few sampling steps. Second, we present a method to distill a trained deterministic diffusion sampler, using many steps, into a new diffusion model that takes half as many sampling steps. We then keep progressively applying this distillation proce- dure to our model, halving the number of required sampling steps each time. On standard image generation benchmarks like CIFAR-10, ImageNet, and LSUN, we start out with state-of-the-art samplers taking as many as 8192 steps, and are able to distill down to models taking as few as 4 steps without losing much perceptual quality; achieving, for example, a FID of 3.0 on CIFAR-10 in 4 steps. Finally, we show that the full progressive distillation procedure does not take more time than it takes to train the original model, thus representing an efficient solution for generative modeling using diffusion at both train and test time.

扩散模型最近在生成模型方面表现出了很大的优点,在感知质量和密度估计方面比 GAN 表现更好,而缺点是采样时间较长:生成高质量的样本需要数百或数千个模型评估。在这里,我们提出了两个贡献来帮助消除这个缺点:首先,我们提出了新的扩散模型参数化方式,在使用少量采样步骤时可以提供更高的稳定性。其次,我们提出了一种方法,可以将训练好的确定性扩散采样器通过许多步骤转换为新的扩散模型,只需要一半左右的采样步骤。然后我们逐步应用这个蒸馏过程到我们的模型中,每次将所需的采样步骤减半。在标准图像生成基准测试如 CIFAR-10、ImageNet 和 LSUN 中,我们开始使用高达 8192 个步骤的最新采样器,并能够将其蒸馏到只需要 4 个步骤的模型中,而不会丢失太多感知质量。例如,在 CIFAR-10 中,我们在 4 个步骤内实现了 FID=3.0。最后,我们展示了完整的逐步蒸馏过程并不比训练原始模型花费更多时间,因此它是一种在训练和测试时同时使用扩散模型的有效解决方案。

总结

主要是两个创新点,

  • 1.提出了一种可以简化步骤的方法
  • 2.提出了一种知识蒸馏的方法,可以把更高的迭代次数优化为更低的迭代次数。

1.INTRODUCTION

1.1逐句翻译

第一段(扩散模型在各个方面取得很好的成果)

Diffusion models (Sohl-Dickstein et al., 2015; Song & Ermon, 2019; Ho et al., 2020) are an emerg- ing class of generative models that has recently delivered impressive results on many standard gen- erative modeling benchmarks. These models have achieved ImageNet generation results outper- forming BigGAN-deep and VQ-VAE-2 in terms of FID score and classification accuracy score (Ho et al., 2021; Dhariwal & Nichol, 2021), and they have achieved likelihoods outperforming autore- gressive image models (Kingma et al., 2021; Song et al., 2021b). They have also succeeded in image super-resolution (Saharia et al., 2021; Li et al., 2021) and image inpainting (Song et al., 2021c), and there have been promising results in shape generation (Cai et al., 2020), graph generation (Niu et al., 2020), and text generation (Hoogeboom et al., 2021; Austin et al., 2021).
扩散模型 (Sohl-Dickstein et al., 2015; Song & Ermon, 2019; Ho et al., 2020) 是一种新兴的生成模型类群,最近在多个标准生成模型基准测试中取得了令人印象深刻的结果。这些模型在 FID 得分和分类准确率得分方面实现了 ImageNet 生成结果,远优于 BigGAN-deep 和 VQ-VAE-2(Ho et al., 2021; Dhariwal & Nichol, 2021),并且在生成概率方面比自回归图像模型表现更好 (Kingma et al., 2021; Song et al., 2021b)。它们还成功地实现了图像超分辨率 (Saharia et al., 2021; Li et al., 2021) 和图像修复 (Song et al., 2021c),并且在形状生成 (Cai et al., 2020)、图生成 (Niu et al., 2020) 和文本生成 (Hoogeboom et al., 2021; Austin et al., 2021) 等领域取得了有前途的结果。

第二段(提出扩散模型预测慢的问题)

A major barrier remains to practical adoption of diffusion models: sampling speed. While sam- pling can be accomplished in relatively few steps in strongly conditioned settings, such as text-to- speech (Chen et al., 2021) and image super-resolution (Saharia et al., 2021), or when guiding the sampler using an auxiliary classifier (Dhariwal & Nichol, 2021), the situation is substantially differ- ent in settings in which there is less conditioning information available. Examples of such settings are unconditional and standard class-conditional image generation, which currently require hundreds or thousands of steps using network evaluations that are not amenable to the caching optimizations of other types of generative models (Ramachandran et al., 2017).

然而,扩散模型的实际应用仍然存在一个主要障碍:采样速度。虽然在强条件设置下,如文本到语音 (Chen et al., 2021) 和图像超分辨率 (Saharia et al., 2021),采样可以在相对较短的时间内完成,或者通过辅助分类器指导采样 (Dhariwal & Nichol, 2021),但在条件较少的设置下,情况则完全不同。例如,无条件和标准条件的图像生成目前需要使用网络评估,这些评估不受其他类型生成模型的缓存优化 (Ramachandran et al., 2017)。

第三段(作者提出自己的想法)

In this paper, we reduce the sampling time of diffusion models by orders of magnitude in uncondi- tional and class-conditional image generation, which represent the setting in which diffusion models have been slowest in previous work. We present a procedure to distill the behavior of a N -step DDIM sampler (Song et al., 2021a) for a pretrained diffusion model into a new model with N/2 steps, with little degradation in sample quality. In what we call progressive distillation, we repeat this distilla- tion procedure to produce models that generate in as few as 4 steps, still maintaining sample quality competitive with state-of-the-art models using thousands of steps.

在本文中,我们显著提高了扩散模型在无条件和标准条件图像生成方面的采样速度,这是扩散模型在以前的工作中最慢的设置之一。我们提出了一种方法,将一个预训练的扩散模型的 N 步 DDIM 采样器的行为蒸馏到一个新的模型中,只需要 N/2 步,而样本质量几乎没有下降。我们称之为逐步蒸馏,我们将这种方法重复应用到我们的模型中,以产生只需 4 步即可生成图像的模型,并保持与使用数千步的最先进的模型相当的样本质量。

在这里插入图片描述

文字说明

Figure 1: A visualization of two iterations of our proposed progressive distillation algorithm. A sampler f(z;η), mapping random noise ε to samples x in 4 deterministic steps, is distilled into a new sampler f(z;θ) taking only a single step. The original sampler is derived by approximately integrating the probability flow ODE for a learned diffusion model, and distillation can thus be understood as learning to integrate in fewer steps, or amortizing this integration into the new sampler.

Figure 1: 一种可视化了我们提出的渐进式蒸馏算法的两个迭代。一个采样器 f(z;η),将随机噪声ε映射到 4 个确定的采样 x,被蒸馏成一个仅需要一个步骤的新采样器 f(z;θ)。原始的采样器是通过近似积分学习扩散模型的概率流常微分方程 (ODE) 得出的,因此蒸馏可以被理解为在更少的步骤中学习积分,或者将积分分摊到新采样器中。

1.2总结

  • 1.扩散模型虽然取得了很好的效果,但是预测速度慢。
  • 2.作者提出了一种逐步蒸馏的方式,如下图:

在这里插入图片描述

3 PROGRESSIVE DISTILLATION

第一段(简单介绍 如何蒸馏减少步数)

To make diffusion models more efficient at sampling time, we propose progressive distillation: an algorithm that iteratively halves the number of required sampling steps by distilling a slow teacher diffusion model into a faster student model. Our implementation of progressive distillation stays very close to the implementation for training the original diffusion model, as described by e.g. Ho et al. (2020). Algorithm 1 and Algorithm 2 present diffusion model training and progressive distillation side-by-side, with the relative changes in progressive distillation highlighted in green.
为了在采样时提高扩散模型的效率,我们提出了渐进式蒸馏算法:一种迭代地减半所需采样步骤的算法,通过将慢速教师扩散模型蒸馏成更快的学生模型来实现。我们实现的渐进式蒸馏算法与训练原始扩散模型的实现非常相似,例如 Ho 等 (2020) 所述。算法 1 和算法 2 同时展示了扩散模型训练和渐进式蒸馏,其中渐进式蒸馏的相对变化被绿色突出显示。

在这里插入图片描述

第二段

We start the progressive distillation procedure with a teacher diffusion model that is obtained by training in the standard way. At every iteration of progressive distillation, we then initialize the student model with a copy of the teacher, using both the same parameters and same model definition. Like in standard training, we then sample data from the training set and add noise to it, before forming the training loss by applying the student denoising model to this noisy data zt. The main difference in progressive distillation is in how we set the target for the denoising model: instead of the original data x, we have the student model denoise towards a target x ̃ that makes a single student DDIM step match 2 teacher DDIM steps. We calculate this target value by running 2 DDIM sampling steps using the teacher, starting from zt and ending at zt−1/N , with N being the number of student sampling steps. By inverting a single step of DDIM, we then calculate the value the student model would need to predict in order to move from zt to zt−1/N in a single step, as we show in detail in Appendix G. The resulting target value x ̃(zt) is fully determined given the teacher model and starting point zt, which allows the student model to make a sharp prediction when evaluated at zt. In contrast, the original data point x is not fully determined given zt, since multiple different data points x can produce the same noisy data zt: this means that the original denoising model is predicting a weighted average of possible x values, which produces a blurry prediction. By making sharper predictions, the student model can make faster progress during sampling.

我们开始渐进式蒸馏过程,使用通过标准训练得到的教师扩散模型开始。在每次迭代中,然后我们初始化学生模型,使用与教师相同的参数和模型定义,从训练集中采样数据并添加噪声,然后在使用学生去噪模型对这个噪声数据 zt 应用时形成训练损失。渐进式蒸馏的主要区别在于我们如何设置去噪模型的目标:而不是原始数据 x,我们让学生模型去噪以目标 x ̃使单个学生 DDIM 步等于 2 个教师 DDIM 步。我们计算这个目标值,通过使用教师运行 2 个 DDIM 采样步骤,从 zt 开始并结束于 zt-1/N,其中 N 是学生采样步骤的数量。通过逆推一个 DDIM 步,我们然后计算学生模型需要在一步中从 zt 移动到 zt-1/N 所需的预测值,具体细节在附录 G 中展示。结果的目标值 x ̃(zt) 在教师模型和起始点 zt 给定的情况下是完全确定的,这允许学生模型在评估 zt 时做出清晰的预测。相比之下,给定 zt 的原始数据点 x 没有完全确定,因为多个不同的数据点 x 可以产生相同的噪声数据 zt:这意味着原始去噪模型预测加权平均可能 x 值,这产生模糊的预测。通过做出更清晰的预测,学生模型在采样时能够更快地进展。

第三段(继续描述这个迭代可以不断递归使用,学生变成新的老师)

After running distillation to learn a student model taking N sampling steps, we can repeat the pro- cedure with N/2 steps: The student model then becomes the new teacher, and a new student model is initialized by making a copy of this model.

运行蒸馏以学习采取 N 采样步骤的学生模型后,我们可以重复该过程以进行 N/2 步骤:学生模型将成为新的老师,并通过复制该模型初始化一个新的学生模型。

第四段(这里调整Alph1为0真的没看懂,得看看代码)

Unlike our procedure for training the original model, we always run progressive distillation in dis- crete time: we sample this discrete time such that the highest time index corresponds to a signal-to- noise ratio of zero, i.e. α1 = 0, which exactly matches the distribution of input noise z1 ∼ N (0, I) that is used at test time. We found this to work slightly better than starting from a non-zero signal- to-noise ratio as used by e.g. Ho et al. (2020), both for training the original model as well as when performing progressive distillation.

不同于我们训练原始模型的流程,我们总是在离散时间中进行渐进式蒸馏:我们采样离散时间,使得最高时间索引对应信号噪声比为零,即α1=0,这完全匹配测试时使用的输入噪声 z1∼N(0,I) 的分布。我们发现,这种方法相对于例如 Ho 等 (2020) 使用非零信号噪声比开始训练和进行渐进式蒸馏时,在训练原始模型和进行渐进式蒸馏时表现更好。

### 扩散模型蒸馏技术的实现详解 扩散模型Diffusion Models, DM)因其强大的生成能力被广泛应用,但在实际部署中面临的主要挑战是其采样过程较为缓慢。为了加速这一过程,研究者提出了多种优化方案,其中一种有效的方法便是基于蒸馏的思想来减少所需的采样步骤。 #### 1. **逐步蒸馏的核心概念** 逐步蒸馏是一种通过学习简化版扩散模型行为的方式,从而大幅缩短采样时间的技术。具体而言,该方法的目标是从一个经过充分训练的标准扩散模型中提取知识,并将其转移到一个新的模型上,使得新模型能够在更少的时间步数下完成高质量样本的生成[^3]。 #### 2. **蒸馏的具体流程** 以下是逐步蒸馏的关键环节: - **初始模型准备**: 需要有一个已经过良好训练的基础扩散模型作为教师模型。此模型通常具有较高的精度较长的推理链路 (例如数百甚至上千个时间步)。 - **目标设定**: 定义学生模型应达到的效果——即在较少迭代次数的情况下接近原模型的质量水平。假设原始模型需要 \(N\) 步才能生成理想图片,则希望新的精简版本能在大约 \(\frac{N}{k}\)(\( k>1\)) 的时间内达成相似效果。 - **损失函数设计**: 构建适合当前任务需求的新形式化表达式用于指导参数调整方向;比如可以采用均方误差(MSE)衡量预测噪声分布之间的差异程度等指标来进行监督学习。 - **多轮次循环改进**: 不断重复上述操作直至满足既定性能阈值为止。每一次迭代都会进一步压缩所需计算量的同时维持较高水准的表现力。 最终结果表明,在仅需四步就能生成图像的前提下,仍能保持与传统耗时较长却精确度高的算法相媲美的视觉体验品质。 ```python def distill_diffusion_model(original_steps=1000, target_reduction_factor=4): distilled_steps = original_steps // target_reduction_factor # Define the teacher model and student model architectures here. for epoch in range(num_epochs): # Training loop over epochs # Forward pass with both models on a batch of data points. outputs_teacher = run_teacher_model(inputs) outputs_student = run_student_model(inputs) # Compute loss based on discrepancy between teachers' output at certain timesteps vs students'. loss = compute_loss(outputs_teacher, outputs_student) optimizer.zero_grad() loss.backward() optimizer.step() return refined_student_model_with_fewer_timesteps(distilled_steps) ``` 以上伪代码展示了如何利用梯度下降法不断修正小型网络权重直到它能够模仿大型复杂结构的行为模式的过程概览。 #### 3. **优势分析** 相比其他提速手段如直接修改架构或者单纯增加硬件资源投入等方式来说,这种基于知识迁移理念下的解决方案具备如下几个明显优点: - 显著降低运行成本; - 更易于移植至不同平台之上执行; - 对原有框架改动较小便于维护升级等工作开展。 然而值得注意的是虽然整体效率有所提升但仍可能存在一定局限性例如对于极端情况处理可能不够稳健等问题有待后续深入探讨解决办法。 ---
评论 4
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

CUHK-SZ-relu

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值