Diffusion Model简明梳理2024，小综述，待续-CSDN博客

本文链接：https://blog.csdn.net/Yonggie/article/details/136055039

下面按照时间线梳理Diffusion Model的发展。

参考

https://spaces.ac.cn/archives/9119 DDPM
https://spaces.ac.cn/archives/9181 DDIM
https://www.bilibili.com/video/BV1SG411F7c9 10:22 DDPM
https://github.com/zoubohao/DenoisingDiffusionProbabilityModel-ddpm-
https://lilianweng.github.io/posts/2021-07-11-diffusion-models 英文总结
https://www.bilibili.com/video/BV17r4y1u77B

Notation

全文使用下列标示：

$x$ ，图像
$\epsilon$ ，高斯噪声
$\mu$ ，重建原图的函数
$\theta$ ，神经网络参数
$g t$ , ground truth简写
$c$ , 多分类器
$f_\theta$ , 训练的diffusion模型

Diffusion Model

Diffusion Probabilistic Model, 原文Deep unsupervised learning using nonequilibrium thermodynamics, PMLR’15。

前向

$x_{t}=\alpha_{t-1} x_{t-1}+\beta_{t-1} \epsilon$
$\alpha^2+\beta^2=1$
注意前向过程是没有任何参数的，只有人为设定的超参数。

后向

直接使用重建函数 $\mu$ ：
$x_{t-1}=\mu_\theta(x_t)$

贡献

提出Diffusion Model应用在图像的初始模型。

DDPM

Denoising Diffusion Probabilistic Model, NIPS’20

贡献

反向过程的重建函数不再重建整个图像，只重建噪声。
使用Unet来预测噪声。

由于原本前向过程是
$x_{t}=\alpha_{t-1} x_{t-1}+\beta_{t-1} \epsilon$
则反向时候 $x_{t-1}$ 是
$x_{t-1}=\frac{1}{\alpha_t}(x_{t}+\beta_{t} \epsilon_t)$
其中这个 $\epsilon_t$ 就直接使用神经网络来预测重建，不再重建整个图像。变成：
$x_{t-1}=\mu(x_t)=\frac{1}{\alpha_t}(x_{t}+\beta_{t} \epsilon_{\theta}(x_t,t))$
其中 $\epsilon_{\theta}(x_t,t)$ 是噪声重建函数，是一个神经网络。
objective function使用真实重建图像和预测重建图像的MSE：
$||x_{t-1} - \mu(x_t)||^2$

损失函数之间的关系

我们把损失函数再深入了解一下。
$||x_{t-1} - \mu(x_t)||^2$
将 $\frac{1}{\alpha_t}(x_{t}+\beta_{t} \epsilon_{\theta}(x_t,t))$ 代入 $\mu(x_t)$ ，获得：
$||x_{t-1} - \mu(x_t)||^2=\frac{\beta_t^2}{\alpha_t^2}||\epsilon_{gt}-\epsilon_{\theta}(x_t,t)||^2$
其中 $\epsilon_{gt}$ 是真实噪声。

最后的objective function是：
$||ε_t−ϵ_θ(\bar{α}_tx_0+α_t\bar{β}_{t−1}\bar{ε}_{t−1}+β_tε_t,t)||^2$

我这里只是列举了小部分，在训练时还有不少trick需要注意，详情还是请参考苏建林的博客吧！

缺陷

超参 $T$ 要非常大，前向过程非常缓慢

后记

原版的实现是TensorFlow的。可以参考这个实现：https://github.com/zoubohao/DenoisingDiffusionProbabilityModel-ddpm-，简洁明了。
Unet就被用作了预测噪声（噪声重建函数）。

当前Unet在Diffusion中的结构可以参考：https://zhuanlan.zhihu.com/p/642354007，读前半部分即可。

需要注意的是，目前大部分的实现中，命名的方式都是DDIM中贝叶斯的方式去理解和实现的DDPM和DDIM，所以在没看DDIM的时候其实可以把命名稍微改一下去理解，我这里推荐一下我的改写的github： https://github.com/Yonggie/easier_reading_ddpm_diffusion，这个实现没有使用贝叶斯相关的命名方式，相对易懂。

DDIM

Denoising diffusion implicit models, ICLR’21

idea

不再每次经过T步进行噪声预测然后扩散，直接一步到位，直接预测第 $t$ 步的噪声后重建图像。

贡献

效率高，提出一种来换取更快推断的方法。

缺陷

牺牲多样性diversity
效果仍然不够好

Classifier Gradient Guidance

Diffusion Models Beat GANs on Image Synthesis, NIPS’21

Idea

Classifier Gradient Guidance，再reverse的过程中，添加一个预训练的多分类器，来判别当前步骤下图片的类别，并且用分类器的梯度辅助扩散过程，进行更真实的图像生成。
$c(x_t) -> {a,b,c,d,...}$

贡献

classifier gradient guidance
Diffusion model第一次在F score等指标上超过GAN

conditioned classifier

idea

在Gradient Guidance的基础上，把classifier用文本语言模型（或者其他模型）替换，使得gradient的信息中带有文本等信息，能够实现文生图功能。这个文章就有很多了。
这种加了条件的reversed过程就叫conditioned reverse。

贡献

实现了文图模态的融合生成，大大加速了产业化

classifier free guidance

GLIDE,owards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models。
DALLE2,
Imagen,

idea

在Guidance的基础上，结合了Contrastive learning的思想，与BERT、MAE等的mask掩码类似，可以人为定义正负例。
举例：
$f_\theta(x_t,t,y) contrastive f_\theta(x_t,t,y_{masked})$
这样一个是全condtion $y$ ，另一个是对condition $y$ 进行部分遮盖甚至全部遮盖，生成至少2个输出，一个正向输出和一个增广输出，然后通过输出的差别进行学习。

贡献

不再需要另外预训练的classifier（尽管可以用clip等大模型）
不再需要fake label
第一次很好的完成文生图任务（GLIDE）

近期的火热产业模型

小红书InstantID (InstantX): InstantID: Zero-shot Identity-Preserving Generation in Seconds, https://huggingface.co/spaces/InstantX/InstantID
Controlnet: https://arxiv.org/pdf/2302.05543.pdf， https://github.com/lllyasviel/ControlNet
insightface swapper
DALLE3（闭源）