Introduction to Stable Diffusion

Yongqiang Cheng

已于 2023-05-12 21:45:26 修改

阅读量355

点赞数 1

分类专栏： AI Benchmark 基准测试文章标签： Stable Diffusion

于 2018-12-29 22:23:22 首次发布

世上没有白读的书，每一页都算数。

本文链接：https://blog.csdn.net/chengyq116/article/details/85345846

版权

AI Benchmark 基准测试专栏收录该内容

4 篇文章 0 订阅

订阅专栏

Introduction to Stable Diffusion

1. CompVis

https://huggingface.co/CompVis

在这里插入图片描述

1.1 CompVis - Computer Vision and Learning LMU Munich

Computer Vision and Learning research group at Ludwig Maximilian University of Munich (formerly Computer Vision Group at Heidelberg University)

The Ludwig Maximilian University of Munich (University of Munich or LMU) is a public research university in Munich, Germany.
路德维希-马克西米利安-慕尼黑大学，慕尼黑大学 or LMU 是一所位于德国巴伐利亚州首府慕尼黑市中心的综合性大学。

The Ruprecht Karl University of Heidelberg (Heidelberg University) is a public research university in Heidelberg, Baden-Württemberg, Germany.
鲁普莱希特-卡尔斯-海德堡大学，海德堡大学是一所位于巴登-符腾堡州大学城海德堡的综合性大学。

2. Stable Diffusion Models

Stable Diffusion is a latent text-to-image diffusion model capable of generating photo-realistic images given any text input.
Stable Diffusion 是一种潜在的文本到图像扩散模型，能够在给定任何文本输入的情况下生成逼真的图像。

stable ['steɪb(ə)l]：n. 马厩，马房，养马场 adj. 稳定的，稳固的，牢固的，稳重的 v. 使 (马) 入厩，把 (马) 拴在马厩
diffusion [dɪˈfjuːʒ(ə)n]：n. 传播，散布，扩散，冗长
latent ['leɪt(ə)nt]：adj. 潜在的，潜伏的，隐藏的 n. 隐约指纹，潜指印
realistic [ˌrɪəˈlɪstɪk]：adj. 现实的，实际的，实事求是的，明智的

We recommend you use Stable Diffusion with 🤗 Diffusers library. You can also use the original CompVis code.

huggingface / diffusers
https://github.com/huggingface/diffusers

CompVis / stable-diffusion
https://github.com/compvis/stable-diffusion

Stable Diffusion with 🧨 Diffusers
https://huggingface.co/blog/stable_diffusion

There are variants of the weights depending on:

The library they are intended for.
The training regime. There are 4 training versions: v1-1 through v1-4. Each one was created from the checkpoint of the previous version, and was trained for additional steps in specific variants of the dataset.
每一个都是从以前版本的检查点创建的，并针对数据集特定变体的额外步骤进行了训练。

regime [reɪˈʒiːm]：n. 政体，组织方法，管理体制

Model	Library	Details
stable-diffusion-v1-1	🤗 Diffusers	237k steps at resolution 256x256 on laion2B-en. 194k steps at resolution 512x512 on laion-high-resolution.
stable-diffusion-v1-2	🤗 Diffusers	v1-1 plus: 515k steps at 512x512 on “laion-improved-aesthetics”.
stable-diffusion-v1-3	🤗 Diffusers	v1-2 plus: 195k steps at 512x512 on “laion-improved-aesthetics”, with 10% dropping of text-conditioning.
stable-diffusion-v1-4	🤗 Diffusers	v1-2 plus: 225k steps at 512x512 on “laion-aesthetics v2 5+”, with 10% dropping of text conditioning.
stable-diffusion-v-1-1-original	CompVis	237k steps at resolution 256x256 on laion2B-en. 194k steps at resolution 512x512 on laion-high-resolution.
stable-diffusion-v-1-2-original	CompVis	v1-1 plus: 515k steps at 512x512 on “laion-improved-aesthetics”.
stable-diffusion-v-1-3-original	CompVis	v1-2 plus: 195k steps at 512x512 on “laion-improved-aesthetics”, with 10% dropping of text-conditioning.
stable-diffusion-v-1-4-original	CompVis	v1-2 plus: 225k steps at 512x512 on “laion-aesthetics v2 5+”, with 10% dropping of text conditioning.

Models
https://huggingface.co/CompVis

2.2 CompVis / stable-diffusion-v-1-4-original

https://huggingface.co/CompVis/stable-diffusion-v-1-4-original
The Stable-Diffusion-v-1-4 checkpoint was initialized with the weights of the Stable-Diffusion-v-1-2 checkpoint and subsequently fine-tuned on 225k steps at resolution 512x512 on “laion-aesthetics v2 5+” and 10% dropping of the text-conditioning to improve classifier-free guidance sampling.

Download the weights

sd-v1-4.ckpt
sd-v1-4-full-ema.ckpt

These weights are intended to be used with the original CompVis Stable Diffusion codebase https://github.com/CompVis/stable-diffusion.

3. Stable Diffusion with 🧨 Diffusers

https://huggingface.co/blog/stable_diffusion

3.1 How does Stable Diffusion work?

Stable Diffusion is based on a particular type of diffusion model called Latent Diffusion, proposed in High-Resolution Image Synthesis with Latent Diffusion Models https://arxiv.org/abs/2112.10752.

Generally speaking, diffusion models are machine learning systems that are trained to denoise random Gaussian noise step by step, to get to a sample of interest, such as an image.
https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/diffusers_intro.ipynb
一般来说，扩散模型是机器学习系统经过训练可以逐步对随机高斯噪声进行去噪，以获得感兴趣的样本，例如图像。

Diffusion models have shown to achieve state-of-the-art results for generating image data. But one downside of diffusion models is that the reverse denoising process is slow because of its repeated, sequential nature. In addition, these models consume a lot of memory because they operate in pixel space, which becomes huge when generating high-resolution images. Therefore, it is challenging to train these models and also use them for inference.
扩散模型的一个缺点是反向去噪过程很慢，因为它具有重复的、连续的性质。此外，这些模型会消耗大量内存，因为它们在像素空间中运行，在生成高分辨率图像时像素空间会变得很大。

Latent Diffusion can reduce the memory and compute complexity by applying the diffusion process over a lower dimensional latent space, instead of using the actual pixel space. This is the key difference between standard diffusion and latent diffusion models: in latent diffusion the model is trained to generate latent (compressed) representations of the images.
Latent Diffusion 通过在较低维度的潜在空间上应用扩散过程而不是使用实际像素空间，可以降低内存和计算复杂性。这是标准扩散模型和潜在扩散模型之间的主要区别：在潜在扩散中，模型经过训练以生成图像的潜在 (压缩) 表示。

There are three main components in latent diffusion.

An autoencoder (VAE).
A U-Net.
A text-encoder, e.g. CLIP’s Text Encoder.

1. The autoencoder (VAE)
The VAE model has two parts, an encoder and a decoder. The encoder is used to convert the image into a low dimensional latent representation, which will serve as the input to the U-Net model. The decoder, conversely, transforms the latent representation back into an image.
编码器用于将图像转换为低维潜在表示，作为 U-Net 模型的输入。相反，解码器将 the latent representation 转换回图像。

During latent diffusion training, the encoder is used to get the latent representations (latents) of the images for the forward diffusion process, which applies more and more noise at each step. During inference, the denoised latents generated by the reverse diffusion process are converted back into images using the VAE decoder. As we will see during inference we only need the VAE decoder.
在潜在扩散训练期间，编码器用于为前向扩散过程获取图像的潜在表示 (latents)，它在每一步应用越来越多的噪声。在推理过程中，使用 VAE 解码器将反向扩散过程生成的去噪潜伏转换回图像。正如我们将在推理过程中看到的那样，我们只需要 VAE 解码器。

2. The U-Net
The U-Net has an encoder part and a decoder part both comprised of ResNet blocks. The encoder compresses an image representation into a lower resolution image representation and the decoder decodes the lower resolution image representation back to the original higher resolution image representation that is supposedly less noisy. More specifically, the U-Net output predicts the noise residual which can be used to compute the predicted denoised image representation.
U-Net 的编码器部分和解码器部分均由 ResNet 块组成。编码器将图像表示压缩为较低分辨率的图像表示，解码器将较低分辨率的图像表示解码回噪声较小的原始高分辨率图像表示。更具体地说，U-Net 输出预测可用于计算预测去噪图像表示的噪声残差。

To prevent the U-Net from losing important information while downsampling, short-cut connections are usually added between the downsampling ResNets of the encoder to the upsampling ResNets of the decoder. Additionally, the stable diffusion U-Net is able to condition its output on text-embeddings via cross-attention layers. The cross-attention layers are added to both the encoder and decoder part of the U-Net usually between ResNet blocks.
为了防止 U-Net 在下采样时丢失重要信息，通常在编码器的下采样 ResNet 和解码器的上采样 ResNet 之间添加 short-cut connections。此外，稳定扩散 U-Net 能够通过交叉注意层在文本嵌入上调节其输出。通常在 ResNet 块之间将交叉注意层添加到 U-Net 的编码器和解码器部分。

3. The text-encoder
The text-encoder is responsible for transforming the input prompt, e.g. “An astronaut riding a horse” into an embedding space that can be understood by the U-Net. It is usually a simple transformer-based encoder that maps a sequence of input tokens to a sequence of latent text-embeddings.
文本编码器负责转换输入提示，例如 An astronaut riding a horse 进入一个 U-Net 可以理解的嵌入空间。它通常是一个简单的基于转换器的编码器，将输入标记序列映射到潜在文本嵌入序列。

Inspired by Imagen, Stable Diffusion does not train the text-encoder during training and simply uses an CLIP’s already trained text encoder, CLIPTextModel.
受 Imagen 的启发，Stable Diffusion 在训练期间不会训练文本编码器，而只是使用 CLIP 已经训练过的文本编码器 CLIPTextModel。

astronaut [ˈæstrəˌnɔːt]：n. 航天员，宇航员，太空人

3.1.1. Why is latent diffusion fast and efficient?

Since latent diffusion operates on a low dimensional space, it greatly reduces the memory and compute requirements compared to pixel-space diffusion models. For example, the autoencoder used in Stable Diffusion has a reduction factor of 8. This means that an image of shape (3, 512, 512) becomes (3, 64, 64) in latent space, which requires 8 × 8 = 64 times less memory.
由于潜在扩散在低维空间上运行，因此与像素空间扩散模型相比，它大大降低了内存和计算要求。例如，Stable Diffusion 中使用的自编码器的缩减因子为 8。这意味着形状为 (3, 512, 512) 的图像在潜在空间中变为 (3, 64, 64)，这需要 8 × 8 = 64 倍更少的内存。

This is why it’s possible to generate 512 × 512 images so quickly, even on 16GB Colab GPUs!

3.1.2. Stable Diffusion during inference

Let’s now take a closer look at how the model works in inference by illustrating the logical flow.

在这里插入图片描述

The stable diffusion model takes both a latent seed and a text prompt as an input. The latent seed is then used to generate random latent image representations of size 64×64 where as the text prompt is transformed to text embeddings of size 77×768 via CLIP’s text encoder.
稳定扩散模型将 a latent seed and a text prompt (文本提示) 作为输入。然后使用 latent seed 生成大小为 64×64 的随机 latent image representations，其中文本提示通过 CLIP 的文本编码器转换为大小为 77×768 的 text embeddings (文本嵌入)。

Next the U-Net iteratively denoises the random latent image representations while being conditioned on the text embeddings. The output of the U-Net, being the noise residual, is used to compute a denoised latent image representation via a scheduler algorithm. Many different scheduler algorithms can be used for this computation, each having its pros and cons.
接下来，U-Net 在以文本嵌入为条件的同时迭代地对随机 latent image representations 进行去噪。U-Net 的输出是噪声残差，用于通过调度程序算法计算去噪的潜在图像表示。

pros and cons：利弊，优缺点，正反两方面，赞成者和反对者

For Stable Diffusion, we recommend using one of:

PNDM scheduler (used by default) https://github.com/huggingface/diffusers/blob/main/src/diffusers/schedulers/scheduling_pndm.py
DDIM scheduler https://github.com/huggingface/diffusers/blob/main/src/diffusers/schedulers/scheduling_ddim.py
K-LMS scheduler https://github.com/huggingface/diffusers/blob/main/src/diffusers/schedulers/scheduling_lms_discrete.py

Theory on how the scheduler algorithm function is out-of-scope for this notebook, but in short one should remember that they compute the predicted denoised image representation from the previous noise representation and the predicted noise residual.
调度程序算法根据先前的噪声表示和预测的噪声残差计算预测的去噪图像表示。

The denoising process is repeated ca. 50 times to step-by-step retrieve better latent image representations. Once complete, the latent image representation is decoded by the decoder part of the variational auto encoder.
去噪过程重复大约 50 次以逐步检索更好的潜在图像表示。完成后，潜在图像表示由变分自动编码器的解码器部分解码。

Circa is a word of Latin origin meaning approximately.
Circa (c., ca., ca or cca.)：大约，周围

References

https://blog.csdn.net/chengyq116

Yongqiang Cheng

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
打赏
1
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录