The paradox of diffusion distillation 扩散蒸馏的悖论

扩散模型将生成高维分布数据的困难任务分解为许多去噪任务,每个任务都相对容易得多。我们训练它们一次解决其中一个任务。为了抽样,我们顺序进行多次预测。这种迭代细化是它们强大能力的来源。

……真的是这样吗?近期关于扩散模型的很多论文都集中在减少所需的抽样步骤上;一些工作甚至旨在实现单步抽样。当将任务分解为许多更简单的步骤被认为是这些模型首先能够如此良好工作的原因时,这似乎有点违反直觉!

在这篇博客文章中,让我们更仔细地看看减少扩散模型获取好结果所需的抽样步骤的各种方法。我们将特别关注各种形式的蒸馏:这是通过用另一个模型(教师)的预测来监督一个新模型(学生)的训练实践。针对扩散模型的各种蒸馏方法已经产生了极其引人注目的结果。

Diffusion models split up the difficult task of generating data from a high-dimensional distribution into many denoising tasks, each of which is much easier. We train them to solve just one of these tasks at a time. To sample, we make many predictions in sequence. This iterative refinement is where their power comes from. …or is it? A lot of recent papers about diffusion models focus on reducing the number of sampling steps required; some works even aim to enable single-step sampling. That seems counterintuitive, when splitting things up into many easier steps is supposedly why these models work so well in the first place!

In this blog post, let’s take a closer look at the various ways in which the number of sampling steps required to get good results from diffusion models can be reduced. We will focus on various forms of distillation in particular: this is the practice of training a new model (the student) by supervising it with the predictions of another model (the teacher). Various distillation methods for diffusion models have produced extremely compelling results.

我开始写这篇文章时打算保持相对高层次的讨论,但由于扩散模型的蒸馏是一个有点小众的主题,我不得不详细解释某些事情,因此它变成了深入探讨。下面是目录。

I intended this to be relatively high-level when I started writing, but since distillation of diffusion models is a bit of a niche subject, I could not avoid explaining certain things in detail, so it turned into a deep dive. Below is a table of contents. Click to jump directly to a particular section of this post.

  • 扩散采样:小心行事!

  • 带着目的穿越输入空间

  • 扩散蒸馏

  • 将扩散采样蒸馏为单次前向传递

  • 渐进式蒸馏

  • 引导蒸馏

  • 矫正流

  • 一致性蒸馏 & TRACT

  • BOOT:无数据蒸馏

  • 使用神经算子采样

  • 得分蒸馏采样

  • 对抗蒸馏

  • 但是“没有免费的午餐”怎么办?

  • 我们真的需要一个教师吗?

  • 在数据与噪声之间绘制迷宫

  • 结束语

  • 致谢

  • 参考文献

  1. Diffusion sampling: tread carefully!

  2. Moving through input space with purpose

  3. Diffusion distillation

    1. Distilling diffusion sampling into a single forward pass

    2. Progressive distillation

    3. Guidance distillation

    4. Rectified flow

    5. Consistency distillation & TRACT

    6. BOOT: data-free distillation

    7. Sampling with neural operators

    8. Score distillation sampling

    9. Adversarial distillation

  4. But what about “no free lunch”?

  5. Do we really need a teacher?

  6. Charting the maze between data and noise

  7. Closing thoughts

  8. Acknowledgements

  9. References

1.Diffusion sampling: tread carefully! 扩散采样:小心行事!

首先,为什么从扩散模型获得良好结果需要多步骤?深入理解这一点很有价值,以便我们能够欣赏各种方法如何在不影响输出质量的情况下(至少不是太多)减少这个过程。

扩散模型中的一个采样步骤包括:

  1. 预测我们应该在输入空间中移动的方向来去除噪声,或者等效地,使输入在数据分布下更有可能;

  2. 向该方向迈出一小步。

根据采样算法,你可能会添加一些噪声,或使用更高级的机制来计算更新方向。

我们只迈出一小步,因为这个预测方向只在局部有意义:它指向输入空间中一个数据分布下可能性高的区域——并不指向任何特定的数据点。所以,如果我们迈出一大步,我们会最终到达那个高可能性区域的质心,这并不一定是数据分布的代表性样本。可以将其视为一个粗略的估计。如果你觉得这不直观,你并不孤单!在高维空间中,概率分布往往表现得不直观,这是我过去在一篇深入的博客文章中写过的内容。

具体来说,在图像领域,如果输入中有很多噪声,那么在预测方向上迈出一大步往往会产生模糊的图像。这是因为它基本上对应于许多可能图像的平均值。(为了论证的目的,我故意忽略了作为采样算法一部分可能被再次加入的任何噪声。)

另一种看待这个问题的方式是,噪声遮盖了高频信息,这对应于锐利的特征和细粒度的细节(这也是我之前写过的内容)。对这些高频信息的不确定性产生了一个预测,其中所有的可能性都混合在一起,从而完全缺乏高频信息。

预测方向的局部有效性意味着我们应该只采取微小的步伐,然后重新评估模型以确定新的方向。当然,这不实际,所以我们改为采取有限但较小的步伐。这与机器学习模型在参数空间中基于梯度的优化方式非常相似,但我们这里是在输入空间中操作。就像在模型训练中一样,如果我们迈出的步伐太大,最终结果的质量将会受到影响。

下面是一个表示两维输入空间的图表。( \mathbf{x}_t )代表时间步骤( t )的含噪输入,我们在这里通过向来自数据分布的干净图像( \mathbf{x}_0 )添加噪声来构造它。图中还显示了扩散模型预测的方向,我们应该向该方向移动以使输入更有可能。这指向( \hat{\mathbf{x}}_0 ),一个高可能性区域的质心,该区域以粉色阴影表示。

First of all, why does it take many steps to get good results from a diffusion model? It’s worth developing a deeper understanding of this, in order to appreciate how various methods are able to cut down on this without compromising the quality of the output – or at least, not too much.

A sampling step in a diffusion model consists of:

  • predicting the direction in input space in which we should move to remove noise, or equivalently, to make the input more likely under the data distribution;

  • taking a small step in that direction.

Depending on the sampling algorithm, you might add a bit of noise, or use a more advanced mechanism to compute the update direction.

We only take a small step, because this predicted direction is only meaningful locally: it points towards a region of input space where the likelihood under the data distribution is high – not to any specific data point in particular. So if we were to take a big step, we would end up in the centroid of that high-likelihood region, which isn’t necessarily a representative sample of the data distribution. Think of it as a rough estimate. If you find this unintuitive, you are not alone! Probability distributions in high-dimensional spaces often behave unintuitively, something I’ve written an an in-depth blog post about in the past.

Concretely, in the image domain, taking a big step in the predicted direction tends to yield a blurry image, if there is a lot of noise in the input. This is because it basically corresponds to the average of many plausible images. (For the sake of argument, I am intentionally ignoring any noise that might be added back in as part of the sampling algorithm.)

Another way of looking at it is that the noise obscures high-frequency information, which corresponds to sharp features and fine-grained details (something I’ve also written about before). The uncertainty about this high-frequency information yields a prediction where all the possibilities are blended together, which results in a lack of high-frequency information altogether.

The local validity of the predicted direction implies we should only be taking infinitesimal steps, and then reevaluating the model to determine a new direction. Of course, this is not practical, so we take finite but small steps instead. This is very similar to the way gradient-based optimisation of machine learning models works in parameter space, but here we are operating in the input space instead. Just as in model training, if the steps we take are too large, the quality of the end result will suffer.

Below is a diagram that represents the input space in two dimensions. (\mathbf{x}_t) represents the noisy input at time step (t), which we constructed here by adding noise to a clean image (\mathbf{x}_0) drawn from the data distribution. Also shown is the direction (predicted by a diffusion model) in which we should move to make the input more likely. This points to (\hat{\mathbf{x}}_0), the centroid of a region of high likelihood, which is shaded in pink.

图表展示了输入空间中高可能性区域以及扩散模型预测的指向该区域质心的方向。

(请参阅我之前关于扩散引导几何学的博客文章的第一部分,对于在2D中表示非常高维空间提出了一些注意事项!)

如果我们按照这个方向迈出一步并添加一些噪声(例如在DDPM1采样算法中所做的那样),我们最终得到( \mathbf{x}_{t-1} ),对应于一个略微减少噪声的输入图像。现在预测的方向指向一个更小的、“更具体”的高可能性区域,因为之前的采样步骤解决了一些不确定性。下面的图表显示了这一点。

Diagram showing a region of high likelihood in input space, as well as the direction predicted by a diffusion model, which points to the centroid of this region.

(Please see the first section of my previous blog post on the geometry of diffusion guidance for some words of caution about representing very high-dimensional spaces in 2D!)

If we proceed to take a step in this direction and add some noise (as we do in the DDPM(1) sampling algorithm, for example), we end up with (\mathbf{x}_{t-1}), which corresponds to a slightly less noisy input image. The predicted direction now points to a smaller, “more specific” region of high likelihood, because some uncertainty was resolved by the previous sampling step. This is shown in the diagram below.

图表展示了单个采样步骤之后扩散模型预测的更新方向,以及它所指向的相应的高可能性区域。

每一步方向的变化意味着我们在采样过程中通过输入空间追踪的路径是弯曲的。实际上,因为我们在进行有限的近似,这并不完全准确:它实际上是一条折线路径。但如果我们让步数趋于无穷大,我们最终会得到一条曲线。这条曲线上每一点的预测方向对应于切线方向。下面的图表展示了我们可能通过输入空间追踪的无限采样步骤的曲线的风格化版本。

Diagram showing the updated direction predicted by a diffusion model after a single sampling step, as well as the corresponding region of high likelihood which it points to.

The change in direction at every step means that the path we trace out through input space during sampling is curved. Actually, because we are making a finite approximation, that’s not entirely accurate: it is actually a piecewise linear path. But if we let the number of steps go to infinity, we would end up with a curve. The predicted direction at each point on this curve corresponds to the tangent direction. A stylised version of what this curve might look like is shown in the diagram below.

图表展示了我们可能通过输入空间与无限多的采样步骤追踪的曲线的风格化版本(虚线红色曲线)。

Diagram showing a stylised version of the curve we might trace through input space with an infinite number of sampling steps (dashed red curve).

2.Moving through input space with purpose 带着目的穿越输入空间

开发了大量的扩散采样算法,以更快地穿越输入空间,并减少达到一定输出质量所需的采样步骤数量。试图在这里列出所有这些算法将是一项无望的努力,但我想突出几种算法来展示其中许多思想模仿了在基于梯度优化中使用的技术。

关于扩散采样的一个非常常见的问题是,我们是否应该在每一步注入噪声,就像在DDPM1中,以及基于随机微分方程(SDE)求解器的采样算法。Karras等人研究了这个问题(参见他们的“即时经典”论文中的第3和第4部分),并发现引入随机性的主要效果是误差校正:扩散模型预测是近似的,噪声有助于防止这些近似误差在许多采样步骤中累积。在优化的背景下,噪声在随机梯度下降(SGD)中的规则化效果已经被充分研究,所以这或许不足为奇。

然而,对于某些应用来说,在每个采样步骤注入随机性是不可接受的,因为需要从噪声分布到数据分布样本之间的确定性映射。如DDIM和基于ODE的方法等采样算法使这成为可能(我之前写过这种魔法般的壮举,以及这如何将扩散模型和基于流的模型联系起来)。这在蒸馏的上下文中用于教师模型时非常方便(见下一节)。在这种情况下,可以使用其他技术来减少近似误差,同时避免增加采样步骤数量。

一种这样的技术是使用更高阶的方法。海恩的二阶方法用于解微分方程,结果产生了一个基于ODE的采样器,它每步需要两次模型评估,用于获得更新方向的改进估计。虽然这使得每个采样步骤的成本大约增加了两倍,但在总函数评估次数方面的权衡仍然可以是有利的。

这个想法的另一个变种涉及使模型预测更高阶的分数函数——想象一下,这就像是模型同时估计方向和曲率。然后,这些估计可以被用来在低曲率区域快速移动,并在其他地方适当减速。GENIE就是这样一种方法,它涉及将昂贵的二阶梯度计算蒸馏到一个小型神经网络中,以将额外成本降低到实际水平。

最后,我们可以通过跨采样步骤聚合信息来模拟更高阶信息的效果。这与在基于梯度优化中使用动量非常相似,它也能根据曲率加速和减速,但无需显式估计二阶量。在解微分方程的背景下,这种方法通常被称为多步法,这个想法启发了许多扩散采样算法。

除了采样算法的选择外,我们还可以选择如何间隔我们计算更新的时间步。这些默认是在整个范围内均匀间隔的(想象),但由于噪声时间表通常是非线性的(即(\sigma_t)是(t)的非线性函数),相应的噪声水平结果也是非线性间隔的。然而,将采样步骤间隔作为一个单独从噪声时间表调整的超参数对待(或等价地,在采样时改变噪声时间表)可能是有益的。明智地间隔时间步可以在给定步骤预算下改善结果的质量。

A plethora of diffusion sampling algorithms have been developed to move through input space more swiftly and reduce the number of sampling steps required to achieve a certain level of output quality. Trying to list all of them here would be a hopeless endeavour, but I want to highlight a few of these algorithms to demonstrate that a lot of the ideas behind them mimic techniques used in gradient-based optimisation.

A very common question about diffusion sampling is whether we should be injecting noise at each step, as in DDPM(1), and sampling algorithms based on stochastic differential equation (SDE) solvers(2). Karras et al.(3) study this question extensively (see sections 3 & 4 in their “instant classic” paper) and find that the main effect of introducing stochasticity is error correction: diffusion model predictions are approximate, and noise helps to prevent these approximation errors from accumulating across many sampling steps. In the context of optimisation, the regularising effect of noise in stochastic gradient descent (SGD) is well-studied, so perhaps this is unsurprising.

However, for some applications, injecting randomness at each sampling step is not acceptable, because a deterministic mapping between samples from the noise distribution and samples from the data distribution is necessary. Sampling algorithms such as DDIM(4) and ODE-based approaches(2) make this possible (I’ve previously written about this feat of magic, as well as how this links together diffusion models and flow-based models). An example of where this comes in handy is for teacher models in the context of distillation (see next section). In that case, other techniques can be used to reduce approximation error while avoiding an increase in the number of sampling steps.

One such technique is the use of higher order methods. Heun’s 2nd order method for solving differential equations results in an ODE-based sampler that requires two model evaluations per step, which it uses to obtain improved estimates of update directions(5). While this makes each sampling step approximately twice as expensive, the trade-off can still be favourable in terms of the total number of function evaluations(3).

Another variant of this idea involves making the model predict higher-order score functions – think of this as the model estimating both the direction and the curvature, for example. These estimates can then be used to move faster in regions of low curvature, and slow down appropriately elsewhere. GENIE(6) is one such method, which involves distilling the expensive second order gradient calculation into a small neural network to reduce the additional cost to a practical level.

Finally, we can emulate the effect of higher-order information by aggregating information across sampling steps. This is very similar to the use of momentum in gradient-based optimisation, which also enables acceleration and deceleration depending on curvature, but without having to explicitly estimate second order quantities. In the context of differential equation solving, this approach is usually termed a multistep method, and this idea has inspired many diffusion sampling algorithms(7) (8) (9) (10).

In addition to the choice of sampling algorithm, we can also choose how to space the time steps at which we compute updates. These are spaced uniformly across the entire range by default (think ), but because noise schedules are often nonlinear (i.e. (\sigma_t) is a nonlinear function of (t)), the corresponding noise levels are spaced in a nonlinear fashion as a result. However, it can pay off to treat sampling step spacing as a hyperparameter to tune separately from the choice of noise schedule (or, equivalently, to change the noise schedule at sampling time). Judiciously spacing out the time steps can improve the quality of the result at a given step budgetnp.linspace(3).

3. Diffusion distillation 扩散蒸馏

广义上讲,在神经网络的背景下,蒸馏指的是训练一个神经网络模仿另一个神经网络的输出。前者被称为学生模型,后者是教师模型。通常,教师模型已经被训练过,并且其权重被冻结。当应用于扩散模型时,会发生一些有趣的事情:即使学生和教师网络在架构上完全相同,学生的收敛速度也会显著快于教师模型的训练过程。

要理解为什么会发生这种情况,考虑到扩散模型的训练涉及使用数据集中的示例 (\mathbf{x}_0) 进行监督,我们在此基础上添加了不同量的噪声以创建网络输入 (\mathbf{x}_t)。但与期望网络能够精确预测 (\mathbf{x}_0) 不同,我们实际上希望它能够预测 (\mathbb{E}\left[\mathbf{x}_0 \mid \mathbf{x}_t \right]),即数据分布上的条件期望。回顾本文第一节的第一个图表很有价值,以理解这一点:我们用 (\mathbf{x}_0) 对模型进行监督,但这并不是我们希望模型预测的东西——我们实际上希望它能预测指向高可能性区域质心的方向,而 (\mathbf{x}_0) 仅是该区域的一个代表性样本。在讨论扩散的各种观点时,我之前提到过这一点。这意味着,随着训练的进行,权重更新不断地将模型权重拉向不同的方向,减缓了收敛速度。

当我们对扩散模型进行蒸馏,而不是从头开始训练时,教师模型提供了(\mathbb{E}\left[\mathbf{x}_0 \mid \mathbf{x}_t \right]) 的近似值,学生模型学习去模仿它。与之前不同,现在用来监督模型的目标已经是一个(近似的)期望值,而不是单个代表性样本。结果,与标准扩散训练损失相比,蒸馏损失的方差显著降低。后者倾向于产生跳跃无常的训练曲线,而蒸馏提供了一个更平稳的过程。当你将两个训练曲线并排绘制时,这一点尤其明显。需要注意的是,这种方差减少确实是有代价的:由于教师模型本身并不完美,我们实际上是以增加偏差为代价来交换方差。

然而,仅仅方差减少并不能解释为什么扩散模型的蒸馏如此受欢迎。蒸馏也是减少所需采样步骤数量的非常有效的方法。在这方面,它似乎比简单改变采样算法要有效得多,但当然也有更高的前期成本,因为它需要额外的模型训练。

扩散蒸馏有许多变体,我将尝试在下面简洁地总结几种。不用说,这不是文献的全面回顾。一个相对较新的综述论文是Weijian Luo在2023年4月的作品,尽管从那时起这个领域出现了很多工作,所以我也会尝试涵盖一些较新的内容。如果你觉得有某个特别值得提及但我没有涵盖的方法,请在评论中告诉我。

Broadly speaking, in the context of neural networks, distillation refers to training a neural network to mimic the outputs of another neural network(11). The former is referred to as the student, while the latter is the teacher. Usually, the teacher has been trained previously, and its weights are frozen. When applied to diffusion models, something interesting happens: even if the student and teacher networks are identical in terms of architecture, the student will converge significantly faster than the teacher did when it was trained.

To understand why this happens, consider that diffusion model training involves supervising the network with examples (\mathbf{x}_0) from the dataset, to which we have added varying amounts of noise to create the network input (\mathbf{x}_t). But rather than expecting the network to be able to predict (\mathbf{x}_0) exactly, what we actually want is for it to predict (\mathbb{E}\left[\mathbf{x}_0 \mid \mathbf{x}_t \right]), that is, a conditional expectation over the data distribution. It’s worth revisiting the first diagram in section 1 of this post to grasp this: we supervise the model with (\mathbf{x}_0), but this is not what we want the model to predict – what we actually want is for it to predict a direction pointing to the centroid of a region of high likelihood, which (\mathbf{x}_0) is merely a representative sample of. I’ve previously mentioned this when discussing various perspectives on diffusion. This means that weight updates are constantly pulling the model weights in different directions as training progresses, slowing down convergence.

When we distill a diffusion model, rather than training it from scratch, the teacher provides an approximation of (\mathbb{E}\left[\mathbf{x}_0 \mid \mathbf{x}_t \right]), which the student learns to mimic. Unlike before, the target used to supervise the model is now already an (approximate) expectation, rather than a single representative sample. As a result, the variance of the distillation loss is significantly reduced compared to that of the standard diffusion training loss. Whereas the latter tends to produce training curves that are jumping all over the place, distillation provides a much smoother ride. This is especially obvious when you plot both training curves side by side. Note that this variance reduction does come at a cost: since the teacher is itself an imperfect model, we’re actually trading variance for bias.

Variance reduction alone does not explain why distillation of diffusion models is so popular, however. Distillation is also a very effective way to reduce the number of sampling steps required. It seems to be a lot more effective in this regard than simply changing up the sampling algorithm, but of course there is also a higher upfront cost, because it requires additional model training.

There are many variants of diffusion distillation, a few of which I will try to compactly summarise below. It goes without saying that this is not an exhaustive review of the literature. A relatively recent survey paper is Weijian Luo’s (from April 2023)(12), though a lot of work has appeared in this space since then, so I will try to cover some newer things as well. If you feel there is a particular method that’s worth mentioning but that I didn’t cover, let me know in the comments.

3.1 Distilling diffusion sampling into a single forward pass 将扩散采样蒸馏为单次前向传递

典型的扩散采样程序涉及重复应用神经网络于一个画布上,并使用预测来更新该画布。当我们展开这个网络的计算图时,这可以被重新解释为一个本身具有更深层次的神经网络,其中许多层共享权重。我之前已经更详细地讨论过这种关于扩散的观点。

蒸馏通常用于将较大的网络压缩成较小的网络,因此Luhman & Luhman设定了一个目标,训练一个小得多的学生网络来复现这个更深的教师网络输出,该教师网络对应于一个展开的采样程序。事实上,他们提出的是通过在最小二乘意义上(均方误差损失)匹配输出,将整个采样程序蒸馏到一个与单个扩散预测步骤使用相同架构的网络中。根据采样程序的步骤数量,这可能对应于一种相当极端的模型压缩形式(从计算的角度来看——当然,参数数量保持不变)。

这种方法需要一个确定性的采样程序,因此他们使用DDIM——许多后来开发的蒸馏方法也遵循的选择。他们方法的结果是一个紧凑的学生网络,该网络能在单次前向传递中将噪声分布的样本转化为数据分布的样本。

A typical diffusion sampling procedure involves repeatedly applying a neural network on a canvas, and using the prediction to update that canvas. When we unroll the computational graph of this network, this can be reinterpreted as a much deeper neural network in its own right, where many layers share weights. I’ve previously discussed this perspective on diffusion in more detail.

Distillation is often used to compress larger networks into smaller ones, so Luhman & Luhman(13) set out to train a much smaller student network to reproduce the outputs of this much deeper teacher network corresponding to an unrolled sampling procedure. In fact, what they propose is to distill the entire sampling procedure into a network with the same architecture used for a single diffusion prediction step, by matching outputs in the least-squares sense (MSE loss). Depending on how many steps the sampling procedure has, this may correspond to quite an extreme form of model compression (in the sense of compute, that is – the number of parameters stays the same, of course).

This approach requires a deterministic sampling procedure, so they use DDIM(4) – a choice which many distillation methods that were developed later also follow. The result of their approach is a compact student network which transforms samples from the noise distribution into samples from the data distribution in a single forward pass.

图示展示了将扩散采样程序蒸馏为单次前向传递。

然而,将这种方法付诸实践时,会遇到一个重大障碍:为了获得一个用于学生的单一训练示例,我们必须使用教师运行完整的扩散采样程序,这通常在训练期间进行是太昂贵了。因此,学生的数据集必须预先离线生成。这仍然很昂贵,但至少只需要做一次,而且生成的训练示例可以用于多个周期。

为了加快学习过程,用教师的权重初始化学生也有帮助(因为他们的架构是相同的,我们可以这样做)。这是大多数扩散蒸馏方法使用的一个技巧。

这项工作作为扩散蒸馏的一个有力概念验证,但除了计算成本外,确定性采样程序中的错误累积,加上学生预测的近似性质,对可达到的输出质量施加了重大限制。

Diagram showing distillation of the diffusion sampling procedure into a single forward pass.

Putting this into practice, one encounters a significant hurdle, though: to obtain a single training example for the student, we have to run the full diffusion sampling procedure using the teacher, which is usually too expensive to do on-the-fly during training. Therefore the dataset for the student has to be pre-generated offline. This is still expensive, but at least it only has to be done once, and the resulting training examples can be reused for multiple epochs.

To speed up the learning process, it also helps to initialise the student with the weights of the teacher (which we can do because their architectures are identical). This is a trick that most diffusion distillation methods make use of.

This work served as a compelling proof-of-concept for diffusion distillation, but aside from the computational cost, the accumulation of errors in the deterministic sampling procedure, combined with the approximate nature of the student predictions, imposed significant limits on the achievable output quality.

3.2 Progressive distillation 渐进式蒸馏

渐进式蒸馏是一种迭代方法,可以将所需采样步骤的数量减半。这是通过将两个连续采样步骤的输出蒸馏到一个前向传递中来实现的。与之前的方法一样,这需要一个确定性的采样方法(论文中使用的是DDIM),以及一个预定的教师模型使用的采样步骤数 (N)。

Progressive distillation(14) is an iterative approach that halves the number of required sampling steps. This is achieved by distilling the output of two consecutive sampling steps into a single forward pass. As with the previous method, this requires a deterministic sampling method (the paper uses DDIM), as well as a predetermined number of sampling steps (N) to use for the teacher model.

图示展示了渐进式蒸馏。学生学会在一个前向传递中匹配两个采样步骤的结果。

为了进一步减少采样步骤的数量,可以反复应用这种方法。理论上,通过应用这个程序 (\log_2 N) 次,可以一直减少到单步采样。这解决了之前方法的几个缺点:

  • 在每个蒸馏阶段,只需要两个连续的采样步骤,这比端到端运行整个采样程序要便宜得多。因此,它可以在训练过程中即时完成,不再需要预生成训练数据集。

  • 如果有的话(或任何其他数据集!),可以重复使用原始用于教师模型的训练数据集。这有助于将学习重点放在相关和有趣的输入空间部分上。

  • 虽然我们可以一直减少到1步,但程序的迭代性质允许在质量和计算成本之间进行权衡。减少到4步或8步实际上大大帮助保持了来自蒸馏不可避免的质量损失,同时还显著加快了采样速度。这也提供了比简单减少教师模型的采样步骤数量而不是蒸馏它更好的权衡(参见论文中的图4)。

Diagram showing progressive distillation. The student learns to match the result of two sampling steps in one forward pass.

To reduce the number of sampling steps further, it can be applied repeatedly. In theory, one can go all the way down to single-step sampling by applying the procedure (\log_2 N) times. This addresses several shortcomings of the previous approach:

  • At each distillation stage, only two consecutive sampling steps are required, which is significantly cheaper than running the whole sampling procedure end-to-end. Therefore it can be done on-the-fly during training, and pre-generating the training dataset is no longer required.

  • The original training dataset used for the teacher model can be reused, if it is available (or any other dataset!). This helps to focus learning on the part of input space that is relevant and interesting.

  • While we could go all the way down to 1 step, the iterative nature of the procedure enables a trade-off between quality and compute cost. Going down to 4 or 8 steps turns out to help a lot to keep the inevitable quality loss from distillation at bay, while still speeding up sampling very significantly. This also provides a much better trade-off than simply reducing the number of sampling steps for the teacher model, instead of distilling it (see Figure 4 in the paper).

Aside: v-prediction 旁注:v-预测

在图像领域训练扩散模型时最常见的参数化形式,其中神经网络预测标准化的高斯噪声变量 (\varepsilon),对渐进式蒸馏造成了问题。对于 (\varepsilon) 的均方误差损失中噪声水平的隐式相对权重特别适用于视觉数据,因为它很好地映射到人类视觉系统对低频和高频空间频率的不同敏感度。这就是它如此常用的原因。

要从预测 (\varepsilon) 的模型获得输入空间中的预测 (\hat{\mathbf{x}}_0),我们可以使用以下公式:

[\hat{\mathbf{x}}_0 = \alpha_t^{-1} \left( \mathbf{x}_t - \sigma_t \varepsilon (\mathbf{x}_t) \right) .]

这里,(\sigma_t) 表示时间步 (t) 处噪声的标准偏差。(对于保持方差的扩散,缩放因子 (\alpha_t = \sqrt{1 - \sigma_t^2}),对于爆炸方差的扩散,(\alpha_t = 1)。)

在高噪声水平下,(\mathbf{x}_t) 主要由噪声支配,所以 (\mathbf{x}_t) 和缩放噪声预测之间的差异可能相当小——但这种差异完全决定了输入空间中的预测 (\hat{\mathbf{x}}_0)! 这意味着任何预测误差可能会被放大。在标准扩散模型中,这实际上不是问题,因为错误可以在多个采样步骤中得到纠正。在渐进式蒸馏中,这在后续迭代中成为问题,其中我们主要在高噪声水平下评估模型(在单步模型的极限情况下,模型只在最高噪声水平下被评估)。

事实证明,通过将模型参数化以预测 (\mathbf{x}_0) 来解决这个问题是简单的,但渐进式蒸馏的论文还引入了一个新的预测目标 (\mathbf{v} = \alpha_t \varepsilon - \sigma_t \mathbf{x}_0)(“速度”,见第4节和附录D)。这具有一些非常好的属性,并且最近在蒸馏应用之外也变得相当流行。

The most common parameterisation for training diffusion models in the image domain, where the neural network predicts the standardised Gaussian noise variable (\varepsilon), causes problems for progressive distillation. The implicit relative weighting of noise levels in the MSE loss w.r.t. (\varepsilon) is particularly suitable for visual data, because it maps well to the human visual system’s varying sensitivity to low and high spatial frequencies. This is why it is so commonly used.

To obtain a prediction in input space (\hat{\mathbf{x}}_0) from a model that predicts (\varepsilon) from the noisy input (\mathbf{x}_t), we can use the following formula:

[\hat{\mathbf{x}}_0 = \alpha_t^{-1} \left( \mathbf{x}_t - \sigma_t \varepsilon (\mathbf{x}_t) \right) .]

Here, (\sigma_t) represents the standard deviation of the noise at time step (t). (For variance-preserving diffusion, the scale factor (\alpha_t = \sqrt{1 - \sigma_t^2}), for variance-exploding diffusion, (\alpha_t = 1).)

At high noise levels, (\mathbf{x}_t) is dominated by noise, so the difference between (\mathbf{x}_t) and the scaled noise prediction is potentially quite small – but this difference entirely determines the prediction in input space (\hat{\mathbf{x}}_0)! This means any prediction errors may get amplified. In standard diffusion models, this is not a problem in practice, because errors can be corrected over many steps of sampling. In progressive distillation, this becomes a problem in later iterations, where we mainly evaluate the model at high noise levels (in the limit of a single-step model, the model is only ever evaluated at the highest noise level).

It turns out this issue can be addressed simply by parameterising the model to predict (\mathbf{x}_0) instead, but the progressive distillation paper also introduces a new prediction target (\mathbf{v} = \alpha_t \varepsilon - \sigma_t \mathbf{x}_0) (“velocity”, see section 4 and appendix D). This has some really nice properties, and has also become quite popular beyond just distillation applications in recent times.

3.3 Guidance distillation 引导蒸馏

在转向更高级的扩散蒸馏方法以减少采样步骤之前,值得关注引导蒸馏。这种方法的目标不是在更少的步骤中实现高质量的样本,而是在使用无分类器引导时使每一步的计算成本更低。我已经专门为扩散引导撰写了两篇完整的博客文章,因此我不会在这里回顾该概念。如果你不熟悉,先检查它们:

  • 引导:扩散模型的作弊码

  • 扩散引导的几何学

无分类器引导的使用每个采样步骤需要两次模型评估:一次是有条件的,一次是无条件的。这使得采样的成本大约翻倍,因为主要成本在于模型评估。为了避免支付这个成本,我们可以将来自引导的预测蒸馏到一个模型中,该模型在单次前向传递中直接预测它们,条件是选择的引导比例。

虽然引导蒸馏并未减少采样步骤的数量,但它大约减半了每步所需的计算量,因此它仍然使采样速度大约加快了两倍。它也可以与其他形式的蒸馏结合使用。这是有用的,因为减少采样步骤实际上减少了引导的影响,引导依赖于重复的小调整来更新方向以起作用。在另一种蒸馏方法之前应用引导蒸馏可以帮助确保在减少步骤数量时保留原始效果。

Before moving on to more advanced diffusion distillation methods that reduce the number of sampling steps, it’s worth looking at guidance distillation. The goal of this method is not to achieve high-quality samples in fewer steps, but rather to make each step computationally cheaper when using classifier-free guidance(15). I have already dedicated two entire blog posts specifically to diffusion guidance, so I will not recap the concept here. Check them out first if you’re not familiar:

The use of classifier-free guidance requires two model evaluations per sampling step: one conditional, one unconditional. This makes sampling roughly twice as expensive, as the main cost is in the model evaluations. To avoid paying that cost, we can distill predictions that result from guidance into a model that predicts them directly in a single forward pass, conditioned on the chosen guidance scale(16).

While guidance distillation does not reduce the number of sampling steps, it roughly halves the required computation per step, so it still makes sampling roughly twice as fast. It can also be combined with other forms of distillation. This is useful, because reducing the number of sampling steps actually reduces the impact of guidance, which relies on repeated small adjustments to update directions to work. Applying guidance distillation before another distillation method can help ensure that the original effect is preserved as the number of steps is reduced.

图示展示了引导蒸馏。带有无分类器引导的单步采样(需要通过扩散模型进行两次前向传递)被蒸馏成单次前向传递。

Diagram showing guidance distillation. A single step of sampling with classifier-free guidance (requiring two forward passes through the diffusion model) is distilled into a single forward pass.

3.4 Rectified flow 矫正流

理解扩散采样需要采取许多小步骤的要求的一种方式,是通过曲率的角度来看:我们只能沿直线行走,所以如果我们采取的步伐太大,我们就会“跌出”曲线,导致明显的近似误差。

如前所述,一些采样算法通过使用曲率信息来确定步长,或通过注入噪声以减少误差累积来补偿这一点。矫正流方法采取了更激进的方式:如果我们只是用另一组曲率明显较小的路径替换从噪声和数据分布中抽样的这些曲线路径会怎样?

这是可能的,使用的程序类似于蒸馏,尽管它并不完全具有相同的目标:蒸馏试图学习更好/更快的现有路径之间的近似,从噪声和数据分布中的样本,而重新流程则完全替换路径。我们得到了一个新模型,产生了一组在“最优运输”意义上成本更低的路径。具体来说,这意味着路径曲率更小。它们通常也会连接不同的样本对。从某种意义上说,从噪声到数据的映射被“重新布线”为更直。

One way to understand the requirement for diffusion sampling to take many small steps, is through the lens of curvature: we can only take steps in a straight line, so if the steps we take are too large, we end up “falling off” the curve, leading to noticeable approximation errors.

As mentioned before, some sampling algorithms compensate for this by using curvature information to determine the step size, or by injecting noise to reduce error accumulation. The rectified flow method(17) takes a more drastic approach: what if we just replace these curved paths between samples from the noise and data distributions with another set of paths that are significantly less curved?

This is possible using a procedure that resembles distillation, though it doesn’t quite have the same goal: whereas distillation tries to learn better/faster approximations of existing paths between samples from the noise and data distributions, the reflow procedure replaces the paths with a new set of paths altogether. We get a new model that gives rise to a set of paths with a lower cost in the “optimal transport” sense. Concretely, this means the paths are less curved. They will also typically connect different pairs of samples than before. In some sense, the mapping from noise to data is “rewired” to be more straight.

图示展示了应用重新流程后,与数据点x0关联的旧路径和新路径。新路径曲率显著较小(虽然不是完全直的),并连接x0到一个与之前不同的来自噪声分布的样本。

Diagram showing the old and new paths associated with data point x0 after applying the reflow procedure. The new path is significantly less curved (though not completely straight), and connects x0 to a different sample from the noise distribution than before.

曲率较低意味着我们在使用我们最喜欢的采样算法从这个新模型采样时,可以采取更少的、更大的步骤,同时仍然控制近似误差。但除此之外,这还大大提高了蒸馏的效果,大概是因为它使任务变得更容易。

这个程序可以递归应用,以产生一组更直的路径。经过无限次的应用后,路径应该是完全直的。实践中,这只能进行到一定程度,因为每次程序的应用都会产生一个新模型,该模型近似于之前的模型,因此错误可以迅速累积。幸运的是,只需要一到两次应用就可以获得大部分直的路径。

这种方法已成功应用于稳定扩散模型,并采用感知损失进行了蒸馏步骤。结果模型在单次前向传播中产生合理的样本。该方法的一个缺点是,每个重新流程步骤都需要使用确定性采样算法生成一组样本对(数据和相应的噪声)的数据集,这通常需要离线完成才是实际可行的。

Lower curvature means we can take fewer, larger steps when sampling from this new model using our favourite sampling algorithm, while still keeping the approximation error at bay. But aside from that, this also greatly increases the efficacy of distillation, presumably because it makes the task easier.

The procedure can be applied recursively, to yield and even straighter set of paths. After an infinite number of applications, the paths should be completely straight. In practice, this only works up to a certain point, because each application of the procedure yields a new model which approximates the previous, so errors can quickly accumulate. Luckily, only one or two applications are needed to get paths that are mostly straight.

This method was successfully applied to a Stable Diffusion model(18) and followed by a distillation step using a perceptual loss(19). The resulting model produces reasonable samples in a single forward pass. One downside of the method is that each reflow step requires the generation of a dataset of sample pairs (data and corresponding noise) using a deterministic sampling algorithm, which usually needs to be done offline to be practical.

3.5 Consistency distillation & TRACT 一致性蒸馏与时间区间闭包蒸馏(TRACT)

正如我们之前所讨论的,扩散采样通过输入空间绘制出一条曲线路径,在这条曲线的每个点上,扩散模型预测切线方向。如果我们有一个模型能够预测路径在数据分布侧的终点,允许我们从路径上的任何地方一步跳到那里,情况会怎样呢?那么曲线的曲率根本就不重要了。

这正是一致性模型所做的事情。它们看起来与扩散模型非常相似,但它们预测的是一种不同的量:路径的终点,而不是切线方向。从某种意义上说,扩散模型和一致性模型只是描述噪声与数据之间映射的两种不同方式。或许可以将一致性模型视为扩散模型的“积分形式”(或者等效地,将扩散模型视为一致性模型的“导数形式”)。

As we covered before, diffusion sampling traces a curved path through input space, and at each point on this curve, the diffusion model predicts the tangent direction. What if we had a model that could predict the endpoint of the path on the side of the data distribution instead, allowing us to jump there from anywhere on the path in one step? Then the degree of curvature simply wouldn’t matter.

This is what consistency models(20) do. They look very similar to diffusion models, but they predict a different kind of quantity: an endpoint of the path, rather than a tangent direction. In a sense, diffusion models and consistency models are just two different ways to describe a mapping between noise and data. Perhaps it could be useful to think of consistency models as the “integral form” of diffusion models (or, equivalently, of diffusion models as the “derivative form” of consistency models).

图解显示了扩散模型(灰色)和一致性模型(蓝色)预测的差异。前者预测路径的切线方向,后者预测数据侧的路径终点。

Diagram showing the difference between the predictions from a diffusion model (grey) and a consistency model (blue). The former predicts a tangent direction to the path, the latter predicts the endpoint of the path on the data side.

虽然从零开始训练一致性模型是可能的(在我看来并不那么直接——稍后会详述),但获得一致性模型的更实际的途径是首先训练一个扩散模型,然后进行蒸馏。这个过程被称为一致性蒸馏。

值得注意的是,结果模型看起来与将扩散采样程序蒸馏成单次前向传递所得到的非常相似。然而,那只允许我们从路径的一个端点(在噪声侧)跳到另一个端点(在数据侧)。一致性模型能够从路径上的任何地方跳到数据侧的端点。

学习将路径上的任何点映射到其终点需要成对的数据,因此看起来我们又一次需要运行完整的采样过程以从教师模型获得训练目标,这是昂贵的。然而,这可以通过使用一种自举机制来避免,在这种机制中,学生模型除了从教师学习外,还可以从自身学习。

这基于以下原则:一致性模型沿路径上所有点的预测应该是相同的。因此,如果我们使用教师沿路径进行一步,学生的预测应该保持不变。让 (f(\mathbf{x}_t, t)) 表示学生(一致性模型),则我们有:

[f(\mathbf{x}_{t - \Delta t}, t - \Delta t) \equiv f(\mathbf{x}_t, t),]

其中 (\Delta t) 是步长,(\mathbf{x}_{t - \Delta t}) 是从 (\mathbf{x}_t) 开始的采样步骤的结果,更新方向由教师给出。预测沿路径上所有点保持一致,这就是名称的由来。注意,对于扩散模型来说,这一点根本不成立。

与一致性模型的论文同时,提出了时间区间闭包蒸馏(TRACT)作为对逐步蒸馏的改进,使用了非常类似的自举机制。实施细节有所不同,与一致性模型预测路径上任何点的终点不同,TRACT将时间步范围划分为间隔,蒸馏模型预测在这些间隔的边界上路径上的点。

While it is possible to train a consistency model from scratch (though not that straightforward, in my opinion – more on this later), a more practical route to obtaining a consistency model is to train a diffusion model first, and then distill it. This process is called consistency distillation.

It’s worth noting that the resulting model looks quite similar to what we get when distilling the diffusion sampling procedure into a single forward pass. However, that only lets us jump from one endpoint of a path (at the noise side) to the other (at the data side). Consistency models are able to jump to the endpoint on the data side from anywhere on the path.

Learning to map any point on a path to its endpoint requires paired data, so it would seem that we once again need to run the full sampling process to obtain training targets from the teacher model, which is expensive. However, this can be avoided using a bootstrapping mechanism where, in addition to learning from the teacher, the student also learns from itself.

This hinges on the following principle: the prediction of the consistency model along all points on the path should be the same. Therefore, if we take a step along the path using the teacher, the student’s prediction should be unchanged. Let (f(\mathbf{x}_t, t)) represent the student (a consistency model), then we have:

[f(\mathbf{x}_{t - \Delta t}, t - \Delta t) \equiv f(\mathbf{x}_t, t),]

where (\Delta t) is the step size and (\mathbf{x}_{t - \Delta t}) is the result of a sampling step starting from (\mathbf{x}_t), with the update direction given by the teacher. The prediction remains consistent along all points on the path, which is where the name comes from. Note that this is not at all true for diffusion models.

Concurrently with the consistency models paper, transitive closure time-distillation (TRACT) (21) was proposed as an improvement over progressive distilation, using a very similar bootstrapping mechanism. The details of implementation differ, and rather than predicting the endpoint of a path from any point on the path, as consistency models do, TRACT instead divides the range of time steps into intervals, with the distilled model predicting points on paths at the boundaries of those intervals.

图解显示了TRACT如何将时间步范围划分为间隔。从路径上的任何点,学生被训练以预测当前点所在间隔的左边界对应的点。这与一致性模型的目标相同,但是分别应用于路径的非重叠段,而不是整个路径。

Diagram showing how TRACT divides the time step range into intervals. From any point on the path, the student is trained to predict the point corresponding to the left boundary of the interval the current point is in. This is the same target as for consistency models, but applied separately to non-overlapping segments of the path, rather than to the path as a whole.

像逐步蒸馏一样,这是一个可以重复执行的过程,使用越来越少的间隔,最终得到的东西几乎与一致性模型相同(当使用一个包含整个时间步范围的单个间隔时)。TRACT被提出作为逐步蒸馏的替代方案,它需要更少的蒸馏阶段,从而减少错误累积的可能性。

众所周知,扩散模型从权重平均中受益匪浅,因此TRACT和一致性模型的原始表述都使用学生的权重的指数移动平均(EMA)来构建一个自教师模型,它实际上作为蒸馏过程中的一个额外教师,与扩散模型一起作用。尽管如此,一致性模型的较新迭代不使用EMA。

改进一致性模型的另一种策略是使用替代的损失函数进行蒸馏,例如感知损失,如LPIPS,而不是我们之前也见过的用于矫正流的通常的均方误差(MSE)。

最近将稳定扩散模型蒸馏成潜在一致性模型的工作产生了引人注目的结果,能够在1到4个采样步骤内产生高分辨率图像。

一致性轨迹模型是扩散模型和一致性模型的泛化,能够预测路径上任何一点到其之前任何其他点的映射,以及切线方向。为了实现这一点,它们以两个时间步为条件,指示起始和结束位置。当两个时间步相同时,模型像扩散模型一样预测切线方向。

Like progressive distillation, this is a procedure that can be repeated with fewer and fewer intervals, to eventually end up with something that looks pretty much the same as a consistency model (when using a single interval that encompasses the entire time step range). TRACT was proposed as an alternative to progressive distillation which requires fewer distillation stages, thus reducing the potential for error accumulation.

It is well-known that diffusion models benefit significantly from weight averaging(22) (23), so both TRACT and the original formulation of consistency models use an exponential moving average (EMA) of the student’s weights to construct a self-teacher model, which effectively acts as an additional teacher in the distillation process, alongside the diffusion model. That said, a more recent iteration of consistency models(24) does not use EMA.

Another strategy to improve consistency models is to use alternative loss functions for distillation, such as a perceptual loss like LPIPS(19), instead of the usual mean squared error (MSE), which we’ve also seen used before with rectified flow(17).

Recent work on distilling a Stable Diffusion model into a latent consistency model(25) has yielded compelling results, producing high-resolution images in 1 to 4 sampling steps.

Consistency trajectory models(26) are a generalisation of both diffusion models and consistency models, enabling prediction of any point along a path from any other point before it, as well as tangent directions. To achieve this, they are conditioned on two time steps, indicating the start and end positions. When both time steps are the same, the model predicts the tangent direction, like a diffusion model would.

3.6 BOOT: data-free distillation 数据自由蒸馏(BOOT)

与一致性模型学习的,从路径上的任何点预测数据侧路径终点的方法不同,我们可以尝试从噪声侧的路径终点预测路径上的任何点。这正是BOOT所做的,提供了另一种描述噪声与数据之间映射的方式。将这种表述与一致性模型相比,一个看起来像是另一个的“转置”(见下方图解)。对于那些记得word2vec的人来说,这让我非常想起skip-gram方法和连续词袋(CBoW)方法之间的关系!

Instead of predicting the endpoint of a path at the data side from any point on that path, as consistency models learn to do, we can try to predict any point on the path from its endpoint at the noise side. This is what BOOT(27) does, providing yet another way to describe a mapping between noise and data. Comparing this formulation to consistency models, one looks like the “transpose” of the other (see diagram below). For those of you who remember word2vec(28), it reminds me lot of the relationship between the skip-gram and continuous bag-of-words (CBoW) methods!

图解显示了一致性蒸馏(顶部)和BOOT(底部)中学生的输入和预测目标,基于Gu等人2023年的图2。

Diagram showing the inputs and prediction targets for the student in consistency distillation (top) and BOOT (bottom), based on Figure 2 in Gu et al. 2023.

就像一致性模型一样,这种表述也使得一种自举形式的蒸馏成为可能,以避免使用教师运行完整的采样程序(我猜这就是名称的由来):使用学生预测 (\mathbf{x}t = f(\varepsilon, t)),运行教师采样步骤获得 (\mathbf{x}{t - \Delta t}),然后训练学生使得 (f(\varepsilon, t - \Delta t) \equiv \mathbf{x}_{t - \Delta t})。

因为学生模型只将噪声 (\varepsilon) 作为输入,我们不需要任何训练数据来进行蒸馏。当我们直接将扩散采样过程蒸馏成单次前向传递时,情况也是如此——尽管当然在那种情况下,我们无法避免使用教师运行完整的采样程序。

然而,有一个很大的注意事项:事实证明,学习预测 (\mathbf{x}_t) 实际上相当困难。但有一个巧妙的解决办法:我们首先使用身份 (\mathbf{x}_t = \alpha_t \mathbf{x}_0 + \sigma_t \varepsilon) 将其转换成不同的目标,而不是直接预测 (\mathbf{x}_t)。由于 (\varepsilon) 是给定的,我们可以将其重写为 (\mathbf{x}_0 = \frac{\mathbf{x}_t - \sigma_t \varepsilon}{\alpha_t}),这对应于干净输入的估计。虽然 (\mathbf{x}_t) 看起来像一张噪声图像,这种单步的 (\mathbf{x}_0) 估计看起来像一张模糊图像,缺乏高频内容。这对神经网络来说预测起来容易得多。

如果我们将 (\mathbf{x}_t) 视为信号和噪声的混合体,我们基本上是在提取“信号”组件并代替预测它。我们可以轻松地使用相同的公式将这样的预测转换回对 (\mathbf{x}_t) 的预测。就像 (\mathbf{x}_t) 通过输入空间追踪的路径可以被一个ODE描述一样,这种时变的 (\mathbf{x}_0)-估计也可以。BOOT的作者将描述这条路径的ODE称为信号-ODE。

与原始一致性模型表述(以及TRACT)不同,自举程序不使用指数移动平均。为了减少错误累积,作者建议使用高阶求解器来运行教师采样步骤。让这种方法良好工作的另一个要求是一个辅助的“边界损失”,确保在 (t = T)(即最高噪声水平)时蒸馏模型表现良好。

Just like consistency models, this formulation enables a form of bootstrapping to avoid having to run the full sampling procedure using the teacher (hence the name, I presume): predict (\mathbf{x}t = f(\varepsilon, t)) using the student, run a teacher sampling step to obtain (\mathbf{x}{t - \Delta t}), then train the student so that (f(\varepsilon, t - \Delta t) \equiv \mathbf{x}_{t - \Delta t}).

Because the student only ever takes the noise (\varepsilon) as input, we do not need any training data to perform distillation. This is also the case when we directly distill the diffusion sampling procedure into a single forward pass – though of course in that case, we can’t avoid running the full sampling procedure using the teacher.

There is one big caveat however: it turns out that predicting (\mathbf{x}_t) is actually quite hard to learn. But there is a neat workaround for this: instead of predicting (\mathbf{x}_t) directly, we first convert it into a different target using the identity (\mathbf{x}_t = \alpha_t \mathbf{x}_0 + \sigma_t \varepsilon). Since (\varepsilon) is given, we can rewrite this as (\mathbf{x}_0 = \frac{\mathbf{x}_t - \sigma_t \varepsilon}{\alpha_t}), which corresponds to an estimate of the clean input. Whereas (\mathbf{x}_t) looks like a noisy image, this single-step (\mathbf{x}_0) estimate looks like a blurry image instead, lacking high-frequency content. This is a lot easier for a neural network to predict.

If we see (\mathbf{x}_t) as a mixture of signal and noise, we are basically extracting the “signal” component and predicting that instead. We can easily convert such a prediction back to a prediction of (\mathbf{x}_t) using the same formula. Just like (\mathbf{x}_t) traces a path through input space which can be described by an ODE, this time-dependent (\mathbf{x}_0)-estimate does as well. The BOOT authors call the ODE describing this path the signal-ODE.

Unlike in the original consistency models formulation (as well as TRACT), no exponential moving average is used for the bootstrapping procedure. To reduce error accumulation, the authors suggest using a higher-order solver to run the teacher sampling step. Another requirement to make this method work well is an auxiliary “boundary loss”, ensuring the distilled model is well-behaved at (t = T) (i.e. at the highest noise level).

3.7 Sampling with neural operators 使用神经算子采样

通过使用神经算子的扩散采样(DSNO;也被称为DFNO,这个缩写在某个时点似乎发生了变化!)通过训练一个模型实现,该模型可以在单次前向传递中给定一个噪声样本预测从噪声到数据的整个路径。虽然输入((\varepsilon))和目标(各种 (t) 下的 (\mathbf{x}_t))与BOOT蒸馏的学生模型相同,但后者一次只能产生路径上的单个点。

这似乎很有野心——一个神经网络如何一次性预测从噪声到数据的整个路径呢?所谓的傅里叶神经算子(FNO)被用来实现这一点。通过施加某些架构约束,添加时间卷积层并利用傅里叶变换在频率空间中表示时间函数,我们获得了一个能够一次性为任意数量的时间步骤产生预测的模型。

一个自然的问题是:为什么我们实际上想要预测整个路径?在采样时,我们真正关心的只是最终结果,即数据侧的路径终点((t = 0))。对于BOOT,预测路径上其他点的目的是为了启用用于训练的自举机制。但DSNO不涉及任何自举,那么在这里做这件事的目的是什么呢?

答案可能在于时间卷积层的归纳偏置,结合扩散模型学习的通过输入空间的路径的相对平滑性。得益于这种架构先验,训练路径上的其他点也有助于改善数据侧终点预测的质量,即我们在单步采样时真正关心的路径上唯一的点。我必须承认我并不是100%确定这是唯一的原因——如果还有其他令人信服的原因解释了这是如何工作的,请告诉我!

Diffusion sampling with neural operators (DSNO; also known as DFNO, the acronym seems to have changed at some point!)(29) works by training a model that can predict an entire path from noise to data given a noise sample in a single forward pass. While the inputs ((\varepsilon)) and targets ((\mathbf{x}_t) at various (t)) are the same as for a BOOT-distilled student model, the latter is only able to produce a single point on the path at a time.

This seems ambitious – how can a neural network predict an entire path at once, from noise all the way to data? The so-called Fourier neural operator (FNO)(30) is used to achieve this. By imposing certain architectural constraints, adding temporal convolution layers and making use of the Fourier transform to represent functions of time in frequency space, we obtain a model that can produce predictions for any number of time steps at once.

A natural question is then: why would we actually want to predict the entire path? When sampling, we only really care about the final outcome, i.e. the endpoint of the path at the data side ((t = 0)). For BOOT, the point of predicting the other points on the path is to enable the bootstrapping mechanism used for training. But DSNO does not involve any bootstrapping, so what is the point of doing this here?

The answer probably lies in the inductive bias of the temporal convolution layers, combined with the relative smoothness of the paths through input space learnt by diffusion models. Thanks to this architectural prior, training on other points on the path also helps to improve the quality of the predictions at the endpoint on the data side, that is, the only point on the path we actually care about when sampling in a single step. I have to admit I am not 100% confident that this is the only reason – if there is another compelling reason why this works, please let me know!

3.8 Score distillation sampling 分数蒸馏采样(SDS)

分数蒸馏采样(SDS)与我们迄今讨论的方法略有不同:它不是通过生成一个需要更少步骤就能输出高质量结果的学生模型来加速采样,而是旨在优化图像的参数化表示。这意味着,只要这些表示能产生相对于它们参数可微分的像素空间输出,即使扩散模型是在像素网格上训练的,它们也能操作图像的其他表示形式。

作为这一点的具体示例,SDS实际上是为了实现文本到3D而引入的。这是通过优化3D模型的神经辐射场(NeRF)表示来实现的,使用预训练的图像扩散模型应用于随机的2D投影,通过文本提示(DreamFusion)控制生成的3D模型。

天真地,人们可能会认为,简单地通过参数化表示产生的像素空间输出反向传播不同时间步骤的扩散损失应该可以做到这一点。这将产生关于表示参数的梯度更新,以最小化扩散损失,这应该使像素空间输出看起来更像一个可信的图像。不幸的是,即使直接应用于像素表示,这种方法也不是很有效。

事实证明,这主要是由于梯度中的一个特定因素所致,该因素对应于扩散模型本身的雅可比矩阵。对于低噪声水平,这个雅可比矩阵条件差。简单地完全忽略这个因子(即用单位矩阵替换)会让事情变得好得多。作为附加好处,这意味着我们可以避免通过扩散模型反向传播。我们只需要前向传递,就像在常规扩散采样算法中一样!

在以一种相当特定的方式修改梯度之后,值得问的是,这种修改后的梯度对应于什么损失函数。这实际上是概率密度蒸馏中使用的相同损失函数,最初是为了将音频波形生成的自回归模型蒸馏到前馈模型中而开发的。我在这里不会详细说明这个连接,除了提及它为SDS似乎展现出的寻模行为提供了解释。这种行为经常导致病理性,需要额外的正则化损失项来减轻。还发现,对教师使用高引导尺度(一个比正常用于采样图像时会使用的更高的值)有助于改善结果。

无噪声分数蒸馏(NFSD)是一种修改了梯度以启用较低引导尺度使用的变体,这导致更好的样本质量和多样性。变分分数蒸馏采样(VSD)通过优化参数化表示上的分布,而不是点估计,改进了SDS,这也消除了对高引导尺度的需求。

反过来,VSD已经被用作更传统的扩散蒸馏策略中的一个组成部分,旨在减少采样步骤数。单步图像生成器很容易被重新解释为参数化表示上的一个分布,这使得VSD很容易应用于这个设置,即使它最初是为了改善文本到3D而不是加速图像生成而设计的。

Diff-Instruct可以被视为这样一个应用,尽管它实际上是与VSD同时发布的。为了将扩散模型的知识蒸馏到单步前馈生成器中,他们建议最小化积分KL散度(IKL),这是沿扩散过程(关于时间的)KL散度的加权积分。其梯度是通过对比教师的预测和那些辅助扩散模型的预测来估计的,而这个辅助模型是在生成器输出上同时训练的。这种同时训练赋予它一点GAN的风格,但注意在这种情况下,生成器和辅助模型不是对手。就像SDS一样,对生成器参数的IKL梯度只需要评估扩散模型教师,但不需要通过它反向传播——尽管当然,对生成器输出训练辅助扩散模型确实需要反向传播。

分布匹配蒸馏(DMD)从不同的角度得出了非常相似的公式。就像在Diff-Instruct中一样,使用了对生成器输出的并行训练的扩散模型,其预测与教师的预测相对比,以获得前馈生成器的梯度。这与教师的配对数据上的感知回归损失(LPIPS)结合使用,这些数据是预先离线生成的。后者只应用于一小部分训练示例,使这个预生成步骤的计算成本不那么禁止性。

Score distillation sampling (SDS)(31) is a bit different from the methods we’ve discussed so far: rather than accelerating sampling by producing a student model that needs fewer steps for high-quality output, this method is aimed at optimisation of parameterised representations of images. This means that it enables diffusion models to operate on other representations of images than pixel grids, even though that is what they were trained on – as long as those representations produce pixel space outputs that are differentiable w.r.t. their parameters(32).

As a concrete example of this, SDS was actually introduced to enable text-to-3D. This is achieved through optimisation of Neural Radiance Field (NeRF)(33) representations of 3D models, using a pretrained image diffusion model applied to random 2D projections to control the generated 3D models through text prompts (DreamFusion).

Naively, one could think that simply backpropagating the diffusion loss at various time steps through the pixel space output produced by the parameterised representation should do the trick. This yields gradient updates w.r.t. the representation parameters that minimise the diffusion loss, which should make the pixel space output look more like a plausible image. Unfortunately, this method doesn’t work very well, even when applied directly to pixel representations.

It turns out this is primarily caused by a particular factor in the gradient, which corresponds to the Jacobian of the diffusion model itself. This Jacobian is poorly conditioned for low noise levels. Simply omitting this factor altogether (i.e. replacing it with the identity matrix) makes things work much better. As an added bonus, it means we can avoid having to backpropagate through the diffusion model. All we need is forward passes, just like in regular diffusion sampling algorithms!

After modifying the gradient in a fairly ad-hoc fashion, it’s worth asking what loss function this modified gradient corresponds to. This is actually the same loss function used in probability density distillation(34), which was originally developed to distill autoregressive models for audio waveform generation into feedforward models. I won’t elaborate on this connection here, except to mention that it provides an explanation for the mode-seeking behaviour that SDS seems to exhibit. This behaviour often results in pathologies, which require additional regularisation loss terms to mitigate. It was also found that using a high guidance scale for the teacher (a higher value than one would normally use to sample images) helps to improve results.

Noise-free score distillation (NFSD)(35) is a variant that modifies the gradient further to enable the use of lower guidance scales, which results in better sample quality and diversity. Variational score distillation sampling (VSD)(36) improves over SDS by optimising a distribution over parameterised representations, rather than a point estimate, which also eliminates the need for high guidance scales.

VSD has in turn been used as a component in more traditional diffusion distillation strategies, aimed at reducing the number of sampling steps. A single-step image generator can easily be reinterpreted as a distribution over parameterised representations, which makes VSD readily applicable to this setting, even if it was originally conceived to improve text-to-3D rather than speed up image generation.

Diff-Instruct(37) can be seen as such an application, although it was actually published concurrently with VSD. To distill the knowledge from a diffusion model into a single-step feed-forward generator, they suggest minimising the integral KL divergence (IKL), which is a weighted integral of the KL divergence along the diffusion process (w.r.t. time). Its gradient is estimated by contrasting the predictions of the teacher and those of an auxiliary diffusion model which is concurrently trained on generator outputs. This concurrent training gives it a bit of a GAN(38) flavour, but note that the generator and the auxiliary model are not adversaries in this case. As with SDS, the gradient of the IKL with respect to the generator parameters only requires evaluating the diffusion model teacher, but not backpropagating through it – though training the auxiliary diffusion model on generator outputs does of course require backpropagation.

Distribution matching distillation (DMD)(39) arrives at a very similar formulation from a different angle. Just like in Diff-Instruct, a concurrently trained diffusion model of the generator outputs is used, and its predictions are contrasted against those of the teacher to obtain gradients for the feed-forward generator. This is combined with a perceptual regression loss (LPIPS(19)) on paired data from the teacher, which is pre-generated offline. The latter is only applied on a small subset of training examples, making the computational cost of this pre-generation step less prohibitive.

3.9 Adversarial distillation 对抗性蒸馏

在扩散模型完全占领图像生成领域之前,生成对抗网络(GANs)提供了最佳的视觉真实性,但代价是模式丢失:模型输出的多样性通常不反映训练数据的多样性,但至少它们看起来不错。换句话说,它们以多样性为代价换取了质量。此外,GANs可以通过单次前向传递生成图像,因此它们非常快速——比扩散模型采样要快得多。

因此,一些工作试图结合对抗模型和扩散模型的优点,并不足为奇。有许多方法可以做到这一点:去噪扩散GANs和对抗性分数匹配只是两个例子。

更近期的一个例子是UFOGen,它提出了一种对扩散模型的对抗性微调方法,这在形式上看起来很像蒸馏,但实际上并不是严格意义上的蒸馏。UFOGen将标准扩散损失与对抗性损失结合起来。标准扩散损失本身将导致一个模型尝试预测条件期望E[x0∣xt],而额外的对抗性损失项允许模型偏离这一点,并在高噪声水平下产生较少模糊的预测。其结果是减少了多样性,但也实现了更快的采样。生成器和判别器都是基于预训练扩散模型的参数初始化的,但这种预训练模型不会像蒸馏方法那样被用来产生训练目标。尽管如此,它值得包含在这里,因为它旨在实现与我们讨论的大多数蒸馏方法相同的目标。

另一方面,对抗性扩散蒸馏是一种“真正”的蒸馏方法,将分数蒸馏采样(SDS)与对抗性损失相结合。它利用了基于图像表示学习模型DINO的特征构建的判别器,该模型之前也用于纯对抗性的文本到图像模型StyleGAN-T。结果产生的学生模型可以实现单步采样,但也可以通过多步采样以提高结果的质量。这种方法被用于SDXL Turbo,一个文本到图像系统,能够实现实时生成——随着你的输入,生成的图像会更新。

Before diffusion models completely took over in the space of image generation, generative adversarial networks (GANs)(38) offered the best visual fidelity, at the cost of mode-dropping: the diversity of model outputs usually does not reflect the diversity of the training data, but at least they look good. In other words, they trade off diversity for quality. On top of that, GANs generate images in a single forward pass, so they are very fast – much faster than diffusion model sampling.

It is therefore unsurprising that some works have sought to combine the benefits of adversarial models and diffusion models. There are many ways to do so: denoising diffusion GANs(40) and adversarial score matching(41) are just two examples.

A more recent example is UFOGen(42), which proposes an adversarial finetuning approach for diffusion models that looks a lot like distillation, but actually isn’t distillation, in the strict sense of the word. UFOGen combines the standard diffusion loss with an adversarial loss. Whereas the standard diffusion loss by itself would result in a model that tries to predict the conditional expectation E[x0∣x t], the additional adversarial loss term allows the model to deviate from this and produce less blurry predictions at high noise levels. The result is a reduction in diversity, but it also enables faster sampling. Both the generator and the discriminator are initialised from the parameters of a pre-trained diffusion model, but this pre-trained model is not evaluated to produce training targets, as would be the case in a distillation approach. Nevertheless, it merits inclusion here, as it is intended to achieve the same goal as most of the distillation approaches that we’ve discussed.

Adversarial diffusion distillation(43), on the other hand, is a “true” distillation approach, combining score distillation sampling (SDS) with an adversarial loss. It makes use of a discriminator built on top of features from an image representation learning model, DINO(44), which was previously also used for a purely adversarial text-to-image model, StyleGAN-T(45). The resulting student model enables single-step sampling, but can also be sampled from with multiple steps to improve the quality of the results. This method was used for SDXL Turbo, a text-to-image system that enables realtime generation – the generated image is updated as you type.

4.But what about “no free lunch”? 但“免费午餐定律”怎么说?

为什么我们能让这些蒸馏模型仅通过几步就产生引人注目的样本,而扩散模型需要数十甚至数百步才能实现同样的效果?“没有免费的午餐”是怎么回事?

乍一看,扩散蒸馏似乎确实是机器学习中被广泛认为的普遍真理的反例,但事实更加复杂。到一定程度,通过蒸馏可能可以在不显著牺牲模型质量的情况下提高扩散模型采样的效率,但大多数蒸馏方法所针对的范围(即1-4采样步骤)远远超出了这一点,而且是以牺牲质量换取速度。实践中,蒸馏几乎总是“有损”的,不能期望学生模型能完美模仿教师的预测。这会导致错误,在采样步骤中累积,或对于某些方法,在蒸馏过程的不同阶段累积。

这种权衡看起来像什么?这取决于蒸馏方法。对于大多数方法,模型质量的降低直接影响到输出的感知质量:蒸馏模型的样本通常看起来可能模糊,或者细节可能看起来锐利但不够真实,特别是在人脸图像中尤为明显。基于判别器的对抗性损失或像LPIPS这样的感知损失函数的使用,旨在通过进一步关注感知上相关的信号内容来减轻一些这种退化。

一些方法在保持输出质量和高频内容的忠实度方面做得非常好,但这通常以牺牲样本多样性为代价。前面讨论的对抗性方法是一个很好的例子,以及基于分数蒸馏采样的方法,它们隐含地优化了一个寻模损失函数。

那么,如果蒸馏意味着模型质量的损失,训练一个扩散模型然后进行蒸馏甚至值得吗?为什么不训练一个不同类型的模型,比如GAN,它可以直接产生单步生成器,无需蒸馏呢?关键在于,蒸馏为我们提供了对这种权衡的某种程度的控制。我们获得了灵活性:我们可以选择我们能承受多少步骤,通过选择正确的方法,我们可以决定我们将如何切角。我们更在乎保真度还是多样性?这是我们的选择!

Why is it that we can get these distilled models to produce compelling samples in just a few steps, when diffusion models take tens or hundreds of steps to achieve the same thing? What about “no such thing as a free lunch”?

At first glance, diffusion distillation certainly seems like a counterexample to what is widely considered a universal truth in machine learning, but there is more to it. Up to a point, diffusion model sampling can probably be made more efficient through distillation at no noticeable cost to model quality, but the regime targeted by most distillation methods (i.e. 1-4 sampling steps) goes far beyond that point, and trades off quality for speed. Distillation is almost always “lossy” in practice, and the student cannot be expected to perfectly mimic the teacher’s predictions. This results in errors which can accumulate across sampling steps, or for some methods, across different phases of the distillation process.

What does this trade-off look like? That depends on the distillation method. For most methods, the decrease in model quality directly affects the perceptual quality of the output: samples from distilled models can often look blurry, or the fine-grained details might look sharp but less realistic, which is especially noticeable in images of human faces. The use of adversarial losses based on discriminators, or perceptual loss functions such as LPIPS(19), is intended to mitigate some of this degradation, by further focusing model capacity on signal content that is perceptually relevant.

Some methods preserve output quality and fidelity of high-frequency content to a remarkable degree, but this then usually comes at cost to the diversity of the samples instead. The adversarial methods discussed earlier are a great example of this, as well as methods based on score distillation sampling, which implicitly optimise a mode-seeking loss function.

So if distillation implies a loss of model quality, is training a diffusion model and then distilling it even worthwhile? Why not train a different type of model instead, such as a GAN, which produces a single-step generator out of the box, without requiring distillation? The key here is that distillation provides us with some degree of control over this trade-off. We gain flexibility: we get to choose how many steps we can afford, and by choosing the right method, we can decide exactly how we’re going to cut corners. Do we care more about fidelity or diversity? It’s our choice!

5.Do we really need a teacher? 我们真的需要一个教师模型吗?

一旦我们确定扩散蒸馏给了我们所追求的模型类型,具有关于输出质量、多样性和采样速度的正确权衡,那么值得问的是,我们是否真的需要蒸馏才能得到这种模型。在某种意义上,一旦我们通过蒸馏获得了特定的模型,那就是一个存在证明,显示这样的模型在实践中是可行的——但这并没有证明我们以最有效的方式达到了那个模型。或许有更短的路径?我们能否从头开始训练这样的模型,并完全跳过训练教师模型的步骤?

答案取决于蒸馏方法。对于可以通过扩散蒸馏获得的某些类型的模型,确实存在不需要蒸馏的替代训练方法。然而,这些方法往往不如蒸馏路径那么有效。这或许并不那么令人惊讶:长久以来人们已经知道,当将一个大型神经网络蒸馏成一个较小的网络时,我们经常可以获得比从头训练那个较小网络更好的结果。这里发生的现象是相同的,因为我们正在将一个多步骤的采样过程蒸馏成一个步骤明显更少的过程。如果我们观察这些采样过程的计算图,前者比后者“更深”,所以我们所做的看起来非常类似于将一个大模型蒸馏成一个小模型。

在你可以选择蒸馏或从头开始训练的一个实例是一致性模型。介绍它们的论文描述了一致性蒸馏和一致性训练。后者需要一些技巧才能良好工作,包括为一些超参数创建一种“课程表”式的调度,所以它可以说比扩散模型训练更加复杂一些。

Once we have established that diffusion distillation gives us the kind of model that we are after, with the right trade-offs in terms of output quality, diversity and sampling speed, it’s worth asking whether we even needed distillation to arrive at this model to begin with. In a sense, once we’ve obtained a particular model through distillation, that’s an existence proof, showing that such a model is feasible in practice – but it does not prove that we arrived at that model in the most efficient way possible. Perhaps there is a shorter route? Could we train such a model from scratch, and skip the training of the teacher model entirely?

The answer depends on the distillation method. For certain types of models that can be obtained through diffusion distillation, there are indeed alternative training recipes that do not require distillation at all. However, these tend not to work quite as well as the distillation route. Perhaps this is not that surprising: it has long been known that when distilling a large neural network into a smaller one, we can often get better results than when we train that smaller network from scratch(11). The same phenomenon is at play here, because we are distilling a sampling procedure with many steps into one with considerably fewer steps. If we look at the computational graphs of these sampling procedures, the former is much “deeper” than the latter, so what we’re doing looks very similar to distilling a large model into a smaller one.

One instance where you have the choice of distillation or training from scratch, is consistency models. The paper that introduced them(20) describes both consistency distillation and consistency training. The latter requires a few tricks to work well, including schedules for some of the hyperparameters to create a kind of “curriculum”, so it is arguably a bit more involved than diffusion model training.

6.Charting the maze between data and noise 在数据和噪声之间绘制迷宫

一个特别与蒸馏相关的、对扩散模型训练的有趣视角是,它提供了一种发现分布之间最优运输图的方法。通过概率流ODE公式,我们可以看到扩散模型学习了噪声和数据之间的双射关系,而且这种映射在某种意义上是近似最优的。

这也解释了观察到的不同的在相似数据上训练的扩散模型倾向于学习相似映射的现象:它们都在尝试近似同一个最优解!我不久前在推特(X上?)讨论了这个:

One interesting perspective on diffusion model training that is particularly relevant to distillation, is that it provides a way to uncover an optimal transport map between distributions(46). Through the probability flow ODE formulation(2), we can see that diffusion models learn a bijection between noise and data, and it turns out that this mapping is approximately optimal in some sense.

This also explains the observation that different diffusion models trained on similar data tend to learn similar mappings: they are all trying to approximate the same optimum! I tweeted (X’ed?) about this a while back:

到目前为止,扩散模型训练似乎是我们所知的近似这种最优映射的最简单和最有效(即可扩展)的方式,但它不是唯一的方式:一致性训练代表了一个引人注目的替代策略。这让我好奇还有哪些方法尚未被发现,以及我们是否能找到比扩散模型训练更简单或更统计高效的方法。

通过更仔细地观察曲率,我们可以在一些这些方法之间找到有趣的联系。扩散模型训练揭示的连接噪声和数据分布的样本的路径倾向于是弯曲的。这就是为什么在采样时我们需要许多离散步骤才能准确地近似它们。

我们讨论了几种绕过这个问题的方法:一致性模型通过改变模型的预测目标,从当前位置的切线方向到数据侧的曲线终点,从而避免了这个问题。矫正流则完全替换了弯曲路径,用一组更直的路径代替。但对于完全直的路径,切线方向实际上会指向终点!换句话说:在完全直的路径的极限情况下,一致性模型和扩散模型预测相同的事物,并且彼此无法区分。

这个观察在实践中重要吗?可能不是——它只是一个整洁的联系。但我认为,深入理解分布之间的确定性映射以及如何大规模地揭示它们,以及参数化和表示它们的不同方式,是值得的。我认为这是扩散蒸馏以及更广泛意义上的通过迭代细化进行生成建模的创新的肥沃土壤。

So far, it seems that diffusion model training is the simplest and most effective (i.e. scalable) way we know of to approximate this optimal mapping, but it is not the only way: consistency training represents a compelling alternative strategy. This makes me wonder what other approaches are yet to be discovered, and whether we might be able to find methods that are even simpler than diffusion model training, or more statistically efficient.

Another interesting link between some of these methods can be found by looking more closely at curvature. The paths connecting samples from the noise and data distributions uncovered by diffusion model training tend to be curved. This is why we need many discrete steps to approximate them accurately when sampling.

We discussed a few approaches to sidestep this issue: consistency models(20) (21) avoid it by changing the prediction target of the model, from the tangent direction at the current position to the endpoint of the curve at the data side. Rectified flow(17) instead replaces the curved paths altogether, with a set of paths that are much straighter. But for perfectly straight paths, the tangent direction will actually point to the endpoint! In other words: in the limiting case of perfectly straight paths, consistency models and diffusion models predict the same thing, and become indistinguishable from each other.

Is that observation practically relevant? Probably not – it’s just a neat connection. But I think it’s worthwhile to cultivate a deeper understanding of deterministic mappings between distributions and how to uncover them at scale, as well as the different ways to parameterise them and represent them. I think this is fertile ground for innovations in diffusion distillation, as well as generative modelling through iterative refinement in a broader sense.

7.Closing thoughts 结束语

正如我在开头提到的,这本应是对扩散蒸馏的一个相对高层次的处理,以及为什么有这么多不同的方法来实施它。我最终进行了一次深入探讨,因为很难不解释这些方法本身就讨论这些方法之间的联系。在研究这个主题并试图简洁地解释时,我实际上学到了很多。如果你想学习机器学习研究中的特定主题(或真的是其他任何东西),我真心推荐写一篇博客文章。

总结一下,我想退一步,识别一些模式和趋势。尽管扩散蒸馏方法有很大的多样性,但明显有一些常用的技巧和想法频繁出现:

几乎所有方法都依赖于使用确定性采样算法从教师模型获取目标。DDIM4很受欢迎,但更高级的方法(例如高阶方法)也是一个选择。

学生网络的参数通常从教师网络的参数初始化。这不仅加速了收敛,对于某些方法来说,这对它们的工作至关重要。我们能够这样做是因为教师和学生的架构通常是相同的,不像在判别模型的蒸馏中。

几种方法利用了如LPIPS19这样的感知损失来减少蒸馏对低层次感知质量的负面影响。

自举,即让学生从自身学习,是一个避免必须运行完整的采样算法从教师那里获取目标的有用技巧。有时候发现使用学生参数的指数移动平均有助于此,但这并不是那么明确。

蒸馏可以与其他建模选择交互。一个重要的例子是无分类器引导15,它隐含地依赖于有许多采样步骤。引导通过修改扩散模型在输入空间预测的方向来操作,如果只进行少数几步采样,这种效果不可避免地会减弱。对于某些方法,蒸馏后应用引导已经没有意义了,因为学生不再预测输入空间中的方向。幸运的是,引导蒸馏16可以用来缓解这种影响。

另一个例子是潜在扩散47:当将蒸馏应用于在潜在空间训练的扩散模型时,一个重要的问题是损失应该应用于潜在表示还是像素。例如,对抗性扩散蒸馏(ADD)论文43明确建议在像素空间计算蒸馏损失以提高稳定性。

首先尽可能好地解决问题,然后寻找可以接受的权衡的捷径,这在机器学习中总体上是非常有效的。扩散蒸馏是这一点的典型例子。仍然没有所谓的免费午餐,但扩散蒸馏使我们能够有意识地走捷径,这非常有价值!

如果你想在学术背景中引用这篇文章,你可以使用这个BibTeX片段:

As I mentioned at the beginning, this was supposed to be a fairly high-level treatment of diffusion distillation, and why there are so many different ways to do it. I ended up doing a bit of a deep dive, because it’s difficult to talk about the connections between all these methods without also explaining the methods themselves. In reading up on the subject and trying to explain things concisely, I actually learnt a lot. If you want to learn about a particular subject in machine learning research (or really anything else), I can heartily recommend writing a blog post about it.

To wrap things up, I wanted to take a step back and identify a few patterns and trends. Although there is a huge variety of diffusion distillation methods, there are clearly some common tricks and ideas that come back frequently:

  • Using deterministic sampling algorithms to obtain targets from the teacher is something that almost all methods rely on. DDIM(4) is popular, but more advanced methods (e.g. higher-order methods) are also an option.

  • The parameters of the student network are usually initialised from those of the teacher. This doesn’t just accelerate convergence, for some methods this is essential for them to work at all. We can do this because the architectures of the teacher and student are often identical, unlike in distillation of discriminative models.

  • Several methods make use of perceptual losses such as LPIPS(19) to reduce the negative impact of distillation on low-level perceptual quality.

  • Bootstrapping, i.e. having the student learn from itself, is a useful trick to avoid having to run the full sampling algorithm to obtain targets from the teacher. Sometimes using the exponential moving average of the student’s parameters is found to help for this, but this isn’t as clear-cut.

Distillation can interact with other modelling choices. One important example is classifier-free guidance(15), which implicitly relies on there being many sampling steps. Guidance operates by modifying the direction in input space predicted by the diffusion model, and the effect of this will inevitably be reduced if only a few sampling steps are taken. For some methods, applying guidance after distillation doesn’t actually make sense anymore, because the student no longer predicts a direction in input space. Luckily guidance distillation(16) can be used to mitigate the impact of this.

Another instance of this is latent diffusion(47): when applying distillation to a diffusion model trained in latent space, one important question to address is whether the loss should be applied to the latent representation or to pixels. As an example, the adversarial diffusion distillation (ADD) paper(43) explicitly suggests calculating the distillation loss in pixel space for improved stability.

The procedure of first solving a problem as well as possible, and then looking for shortcuts that yield acceptable trade-offs, is very effective in machine learning in general. Diffusion distillation is a quintessential example of this. There is still no such thing as a free lunch, but diffusion distillation enables us to cut corners with intention, and that’s worth a lot!

If you would like to cite this post in an academic context, you can use this BibTeX snippet:

8.Acknowledgements

9. References

  1. Ho, Jain, Abbeel, “Denoising Diffusion Probabilistic Models”, 2020. ↩(2)

  2. Song, Sohl-Dickstein, Kingma, Kumar, Ermon and Poole, “Score-Based Generative Modeling through Stochastic Differential Equations”, International Conference on Learning Representations, 2021. ↩(2) ↩(3)

  3. Karras, Aittala, Aila, Laine, “Elucidating the Design Space of Diffusion-Based Generative Models”, Neural Information Processing Systems, 2022. ↩(2) ↩(3)

  4. Song, Meng, Ermon, “Denoising Diffusion Implicit Models”, International Conference on Learning Representations, 2021. ↩(2) ↩(3)

  5. Jolicoeur-Martineau, Li, Piché-Taillefer, Kachman, Mitliagkas, “Gotta Go Fast When Generating Data with Score-Based Models”, arXiv, 2021.

  6. Dockhorn, Vahdat, Kreis, “GENIE: Higher-Order Denoising Diffusion Solvers”, Neural Information Processing Systems, 2022.

  7. Liu, Ren, Lin, Zhao, “Pseudo Numerical Methods for Diffusion Models on Manifolds”, International Conference on Learning Representations, 2022.

  8. Zhang, Chen, “Fast Sampling of Diffusion Models with Exponential Integrator”, International Conference on Learning Representations, 2023.

  9. Lu, Zhou, Bao, Chen, Li, Zhu, “DPM-Solver: A Fast ODE Solver for Diffusion Probabilistic Model Sampling in Around 10 Steps”, Neural Information Processing Systems, 2022.

  10. Lu, Zhou, Bao, Chen, Li, Zhu, “DPM-Solver++: Fast Solver for Guided Sampling of Diffusion Probabilistic Models”, arXiv, 2022.

  11. Hinton, Vinyals, Dean, “Distilling the Knowledge in a Neural Network”, NeurIPS Deep Learning Workshop, 2014. ↩(2)

  12. Luo, “A Comprehensive Survey on Knowledge Distillation of Diffusion Models”, arXiv, 2023.

  13. Luhman, Luhman, “Knowledge Distillation in Iterative Generative Models for Improved Sampling Speed”, arXiv, 2021.

  14. Salimans, Ho, “Progressive Distillation for Fast Sampling of Diffusion Models”, International Conference on Learning Representations, 2022.

  15. Ho, Salimans, “Classifier-Free Diffusion Guidance”, Neural Information Processing Systems, 2021. ↩(2)

  16. Meng, Rombach, Gao, Kingma, Ermon, Ho, Salimans, “On Distillation of Guided Diffusion Models”, Computer Vision and Pattern Recognition, 2023. ↩(2)

  17. Liu, Gong, Liu, “Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow”, International Conference on Learning Representations, 2023. ↩(2) ↩(3)

  18. Liu, Zhang, Ma, Peng, Liu, “InstaFlow: One Step is Enough for High-Quality Diffusion-Based Text-to-Image Generation”, arXiv, 2023.

  19. Zhang, Isola, Efros, Shechtman, Wang, “The Unreasonable Effectiveness of Deep Features as a Perceptual Metric”, Computer Vision and Pattern Recognition, 2018. ↩(2) ↩(3) ↩(4) ↩(5)

  20. Song, Dhariwal, Chen, Sutskever, “Consistency Models”, International Conference on Machine Learning, 2023. ↩(2) ↩(3)

  21. Berthelot, Autef, Lin, Yap, Zhai, Hu, Zheng, Talbott, Gu, “TRACT: Denoising Diffusion Models with Transitive Closure Time-Distillation”, arXiv, 2023. ↩(2)

  22. Song, Ermon, “Improved Techniques for Training Score-Based Generative Models”, Neural Information Processing Systems, 2020.

  23. Karras, Aittala, Lehtinen, Hellsten, Aila, Laine, “Analyzing and Improving the Training Dynamics of Diffusion Models”, arXiv, 2023.

  24. Song, Dhariwal, “Improved Techniques for Training Consistency Models”, International Conference on Learnign Representations, 2024.

  25. Luo,Tan, Huang, Li, Zhao, “Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference”, arXiv, 2023.

  26. Kim, Lai, Liao, Murata, Takida, Uesaka, He, Mitsufuji, Ermon, “Consistency Trajectory Models: Learning Probability Flow ODE Trajectory of Diffusion”, International Conference on Learning Representations, 2024.

  27. Gu, Zhai, Zhang, Liu, Susskind, “BOOT: Data-free Distillation of Denoising Diffusion Models with Bootstrapping”, arXiv, 2023.

  28. Mikolov, Chen, Corrado, Dean, “Efficient Estimation of Word Representations in Vector Space”, International Conference on Learning Representation, 2013.

  29. Zheng, Nie, Vahdat, Azizzadenesheli, Anandkumar, “Fast Sampling of Diffusion Models via Operator Learning”, International Conference on Machine Learning, 2023.

  30. Li, Kovachki, Azizzadenesheli, Liu, Bhattacharya, Stuart, Anandkumar, “Fourier neural operator for parametric partial differential equations”, International Conference on Learning Representations, 2021.

  31. Poole, Jain, Barron, Mildenhall, “DreamFusion: Text-to-3D using 2D Diffusion”, arXiv, 2022.

  32. Mordvintsev, Pezzotti, Schubert, Olah, “Differentiable Image Parameterizations”, Distill, 2018.

  33. Mildenhall, Srinivasan, Tancik, Barron, Ramamoorthi, Ng, “NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis”, European Conference on Computer Vision, 2020.

  34. Van den Oord, Li, Babuschkin, Simonyan, Vinyals, Kavukcuoglu, van den Driessche, Lockhart, Cobo, Stimberg, Casagrande, Grewe, Noury, Dieleman, Elsen, Kalchbrenner, Zen, Graves, King, Walters, Belov and Hassabis, “Parallel WaveNet: Fast High-Fidelity Speech Synthesis”, International Conference on Machine Learning, 2018.

  35. Katzir, Patashnik, Cohen-Or, Lischinski, “Noise-free Score Distillation”, International Conference on Learning Representations, 2024.

  36. Wang, Lu, Wang, Bao, Li, Su, Zhu, “ProlificDreamer: High-Fidelity and Diverse Text-to-3D Generation with Variational Score Distillation”, Neural Information Processing Systems, 2023.

  37. Luo, Hu, Zhang, Sun, Li, Zhang, “Diff-Instruct: A Universal Approach for Transferring Knowledge From Pre-trained Diffusion Models”, Neural Information Processing Systems, 2023.

  38. Goodfellow, Pouget-Abadie, Mirza, Xu, Warde-Farley, Ozair, Courville and Bengio, “Generative Adversarial Nets”, Neural Information Processing Systems, 2014. ↩(2)

  39. Yin, Gharbi, Zhang, Shechtman, Durand, Freeman, Park, “One-step Diffusion with Distribution Matching Distillation”, arXiv, 2023.

  40. Xiao, Kreis, Vahdat, “Tackling the Generative Learning Trilemma with Denoising Diffusion GANs”, International Conference on Learning Representations, 2022.

  41. Jolicoeur-Martineau, Piché-Taillefer, Tachet des Combes, Mitliagkas, “Adversarial score matching and improved sampling for image generation”, International Conference on Learning Representations, 2021.

  42. Xu, Zhao, Xiao, Hou, “UFOGen: You Forward Once Large Scale Text-to-Image Generation via Diffusion GANs”, arXiv, 2023.

  43. Sauer, Lorenz, Blattmann, Rombach, “Adversarial Diffusion Distillation”, arXiv, 2023. ↩(2)

  44. Caron, Touvron, Misra, Jégou, Mairal, Bojanowski, Joulin, “Emerging Properties in Self-Supervised Vision Transformers”, International Conference on Computer Vision, 2021.

  45. Sauer, Karras, Laine, Geiger, Aila, “StyleGAN-T: Unlocking the Power of GANs for Fast Large-Scale Text-to-Image Synthesis”, International Conference on Machine Learning, 2023.

  46. Khrulkov, Ryzhakov, Chertkov, Oseledets, “Understanding DDPM Latent Codes Through Optimal Transport”, International Conference on Learning Representations, 2023.

  47. Rombach, Blattmann, Lorenz, Esser, Ommer, “High-Resolution Image Synthesis with Latent Diffusion Models”, Computer Vision and Pattern Recognition, 2022.

  • 12
    点赞
  • 30
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值