AIGC-Stable Diffusion

爱吃肉的鹏

已于 2024-05-11 01:14:06 修改

阅读量1.3k

点赞数 23

文章标签： AIGC stable diffusion

于 2023-12-07 14:07:01 首次发布

本文链接：https://blog.csdn.net/z240626191s/article/details/134851961

版权

Stable Diffusion（稳定扩散）是一种生成式大模型，它在AI领域中标志着一个新的里程碑，为我们揭示了未来将会是AIGC的时代。传统的深度学习模型逐渐向AIGC过渡，这也意味着我们需要学习更多关于AIGC的内容。

如果你和我一样是AIGC的初学者，那么学习AIGC模型的基础知识是非常重要的。Stable Diffusion作为一个强大的模型，有着很高的适用性，特别是在生成式任务方面。通过学习它的基本理论和应用，可以更好地理解复杂网络中的信息传播规律，并掌握不同场景下的生成技术。

总之，Stable Diffusion是一个引人注目的模型，它的出现标志着AI领域的一个新的发展方向，并且未来的趋势将会是由AIGC模型主导。如果对此感兴趣，那么深入学习AIGC的内容将会是非常有益的。【文末含SD搭建与使用】

在学习Stable Diffusion之前，了解DDPM的内容是必要的。

在我之前的文章中简单介绍了一下有关DDPM的内容，有兴趣的可以看一下：AIGC-从代码角度去理解DDPM(扩散模型)

因为本地环境受限(显存、算力)，因此部分内容可能分析比较浅显请见谅~

Stable Diffusion（SD）是由Stability AI和LAION共同研发的一种生成式模型。该模型可以应用于文生图和图生图任务，并且还包括后续的定制生成图像任务，如ControlNet等。

从模型名称上可以看出，SD模型中包含了"Diffusion"一词，这意味着它与DDPM类似，具备去噪的过程。而对于图生图任务来说，还会涉及到加噪的过程。

本文将主要介绍文生图任务，探讨SD模型在该任务中的应用。

文生图是指用户输入一段文字，经过一定的迭代次数，模型输出一张符合文字描述的图像。

SD模型的组成

SD模型主要包含了以下几个部分：

1.CLIP Text Encoder(文本编码器)

作用：将文本信息进行编码生成对应的特征矩阵方便输入到SD模型中。

2.VAE Encoder(变分自编码器)

作用：生成Latent Feature(隐空间特征)和文本特征同时作为模型输入。如果是图生图任务则将图像进行编码生成Latent Feature；如果是文生图则用随机生成的高斯噪声矩阵作为Latent Feature作为输入。【也就是在输入SD模型前有两个输入，文本特征和隐空间特征】

3.U-Net网络

作用：用于不断地预测噪声，并在每次预测噪声过程中加入文本语义特征。

4.Schedule

作用：对UNet预测的噪声进行优化处理(动态调整预测的噪声，控制U-Net预测噪声的强度)

5.VAE Decoder(解码器)

作用：将最终得到的Latent Feature经过解码器生成图像。

在SD的迭代过程中(去噪过程)，噪声会不断的减少，图像信息和文本语义信息会不断的增加。

大致过程如下：

本文所使用的是stable diffusion1.5，权重百度网盘：

链接：https://pan.baidu.com/s/1D8G9JQo8atakGEs0rZTXTg
提取码：yypn

SD基础原理

其实不论是GAN、DDPM还是说SD模型，都和其他的深度学习算法一样，都是在训练中学习训练集的数据分布。

SD和DDPM一样，都有扩散过程(加噪过程)和生成过程(去噪过程)。

在前向的扩散过程中，会通过不断的加噪得到随机高斯噪声分布。在生成过程中是对噪声图像不断的去噪得到最终的图像。过程如下，整个加噪和去噪的过程是马尔科夫链。

前向扩散过程(加噪)：

前向的扩散过程就是一个不断加噪的过程，我们可以对一张图不断的加噪直至生成一张随机噪声矩阵(控制加噪的步数即可)，也就是由前面说的Schedule进行控制。

反向生成过程(去噪)：

反向生成与前向扩散相反，该过程是已知一个噪声分布，由模型进行推理预测得到预测噪声的过程。

那么训练过程就是将预测噪声和实际的输入噪声之间建立loss进行训练【该部分我在我的另一篇DDPM有讲过】。

CLIP网络

前面说到，CLIP网络是可以将文本信息生成对应的特征矩阵的，能够将输入的文本与图像进行比较，从而实现文本到图像的理解(这里说的是文生图的过程)。其实可以这样理解CLIP网络，CNN是提取图像特征的，那么CLIP是提取文本特征的。

CLIP有可以分成Tokenizer和embedding特征提取两个部分。Tokenizer 则是用于将文本转换成适合 CLIP 模型处理的标记序列的工具，它将输入的自然语言文本分割成适当的标记，以便 CLIP 模型能够正确理解和处理。

这个什么叫标记化呢？这里举个例子：输入为："I love natural language processing."

在标记化的过程中，这个句子可能会被分解成以下单词标记：

"I"
"love"
"natural"
"language"
"processing"

然后，每个单词可能会被映射到一个唯一的数字或向量表示。

这里给一个例子代码：

在这个代码中，输入的promt为"a girl, beautiful".padding为最大填充(这里是77)，truncation表示是否截断。

prompt = 'a girl, beautiful'
text_tokenizer = CLIPTokenizer.from_pretrained(weights, subfolder="tokenizer")
text_token_ids = text_tokenizer(
    prompt,
    padding="max_length",  # padding方式为最大填充，即在token序列中填充使其达到最大长度
    max_length=text_tokenizer.model_max_length,  # 指定了最大长度为"text_tokenizer"函数所设定的模型最大长度
    truncation=True,  # 表示对于超过最大长度的部分进行截断
    return_tensors="pt"
).input_ids

得到的输出如下：

也就是通过tokenizer将我输入的prompt进行标记化，得到了一个Shape为(1,77)的tensor,可以看到里面有很大49407，这些其实都是padding得到的token ID。而前面的才是输入单词的token ID.

tensor([[49406,   320,  1611,   267,  1215, 49407, 49407, 49407, 49407, 49407,
         49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407,
         49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407,
         49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407,
         49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407,
         49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407,
         49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407,
         49407, 49407, 49407, 49407, 49407, 49407, 49407]])

这些token ID都是提前设置好的，我们打开SD模型tokenizer中的vocab.json文件。我们看一下输入的prompt"a girl,beautiful"对应的token ID和上面的输出是否一致。首先是ID为49406代表什么呢？

通过查询json配置文件可以找到ID 49406表示为语句的起始位置，表示这里是语句的开始：

然后我们在依次找每个单词对应的token ID,结果如下，而ID 49407表示是语句的结束。

"a</w>": 320

"girl</w>": 1611

",</w>": 267

"beautiful</w>": 1215

"<|endoftext|>": 49407

然后我们把上面的token ID送入网络进行特征提取，代码如下：

text_encoder = CLIPTextModel.from_pretrained(weights,subfolder="text_encoder").to('cuda')
text_embeddings = text_encoder(text_token_ids.to("cuda"))[0]

此时会得到一个shape为(1,77,768)的tensor.也就是将上面的token映射得到了768维度的特征向量而组成的一个特征矩阵。

那么我们得到这个文本特征矩阵就可以和我们的图像计算相似度(余弦相似距离)，也可以计算文本和问题的相似性，比如我输入了两个prompt:

1.prompt: 'a girl, beautiful'

2.prompt: "a woman,beautiful"

计算相似性结果为：

[[0.88568085]]

完整代码如下：

import torch
from transformers import CLIPTextModel, CLIPTokenizer
from sklearn.metrics.pairwise import cosine_similarity

'''
可以将输入文本转为文本特征
'''
weights = "F:/Stable_webui/stable-diffusion-webui/models/Stable-diffusion/stable-diffusion-v1-5"
# 加载CLIP text encoder模型和Tokenizer
text_encoder = CLIPTextModel.from_pretrained(weights,subfolder="text_encoder").to('cuda')
text_tokenizer = CLIPTokenizer.from_pretrained(weights, subfolder="tokenizer")
# 将prompt进行tokenize,得到对应的token ids特征
prompt = 'a girl, beautiful'
text_token_ids = text_tokenizer(
    prompt,
    padding="max_length",  # padding方式为最大填充，即在token序列中填充使其达到最大长度
    max_length=text_tokenizer.model_max_length,  # 指定了最大长度为"text_tokenizer"函数所设定的模型最大长度
    truncation=True,  # 表示对于超过最大长度的部分进行截断
    return_tensors="pt"
).input_ids
# print("text_token_ids' shape is ", text_token_ids.shape)  # [1,77]
# print("text_token_ids ", text_token_ids)

text_embeddings = text_encoder(text_token_ids.to("cuda"))[0]
# print("text_embeddings' shape:",text_embeddings.shape) # [1,77,768], 每个token会映射到768维空间中
# print(text_embeddings)

prompt2 = "a woman,beautiful"
text_token_ids2 = text_tokenizer(
    prompt2,
    padding="max_length",  # padding方式为最大填充，即在token序列中填充使其达到最大长度
    max_length=text_tokenizer.model_max_length,  # 指定了最大长度为"text_tokenizer"函数所设定的模型最大长度
    truncation=True, # 表示对于超过最大长度的部分进行截断
    return_tensors="pt"
).input_ids
text_embeddings2 = text_encoder(text_token_ids2.to("cuda"))[0]

# 可以计算两个句子的相似性
# 计算两个嵌入向量的平均值
mean_embedding1 = torch.mean(text_embeddings, dim=1).squeeze().cpu().detach().numpy()
mean_embedding2 = torch.mean(text_embeddings2, dim=1).squeeze().cpu().detach().numpy()
# 利用余弦相似度计算
similarity = cosine_similarity([mean_embedding1], [mean_embedding2])
print(similarity)

VAE网络

前面说到了，VAE是SD中一个模块，VAE又可以由encode(编码器)和decode(解码器)组成，编码器生成Latent Feature(隐空间特征)，解码器将最终得到的Latent Feature经过解码器生成图像。

或者可以这样说VAE可以对图像(如果这里输入为图像的话)进行压缩(压缩到一个Latent Feature)和重建。

我们可以举个例子，如下图所示，将原始图像(大小为256x256送入VAE编码器后会生成一个特征图像(大小为batchx4x32x32)，然后再通过VAE decode得到重建后的图像(大小为256x256)。


原始图像	VAE encode	VAE decode

这里附上代码供大家学习参考：

import cv2
import torch
import numpy as np
from diffusers import AutoencoderKL

'''
VAE可以将图像进行压缩和重构
VAE的encoder将图像转为低维特征可以作为unet的输入，然后decoder可以进行重建
'''
weights = "F:/Stable_webui/stable-diffusion-webui/models/Stable-diffusion/stable-diffusion-v1-5"
# 加载VAEmodel
VAE = AutoencoderKL.from_pretrained(weights, subfolder="vae").to("cuda", dtype=torch.float32)

raw_image = cv2.imread("SD_image.png")  # read image
raw_image = cv2.cvtColor(raw_image, cv2.COLOR_BGR2RGB)  # BG2RGB
raw_image = cv2.resize(raw_image, (256, 256))  # resize
image = raw_image.astype(np.float32) / 127.5 - 1.0  # norm
image = image.transpose(2, 0, 1)  # hwc->chw
image = image[None, ...] # add batch
image = torch.from_numpy(image).to("cuda", dtype=torch.float32) # numpy->tensor


with torch.inference_mode():
    # 使用VAE进行压缩和重建, 输出的latent为特征图
    latent = VAE.encode(image).latent_dist.sample()  # 编码 shape[1,4,32,32]
    rec_image = VAE.decode(latent).sample  # 解码 重够图像 shape (batch_size,3,256,256)

    # 后处理
    rec_image = (rec_image / 2 + 0.5).clamp(0, 1)  # 归一化到0~1
    rec_image = rec_image.cpu().permute(0, 2, 3, 1).numpy()

    # 反归一化
    rec_image = (rec_image * 255).round().astype("uint8")
    rec_image = rec_image[0]

    # 保存重建后图像
    cv2.imwrite("reconstructed_image.png", cv2.cvtColor(rec_image, cv2.COLOR_RGB2BGR))

U-Net

在SD模型中的主干网络部分使用的是U-Net网络，但却是改良版的U-Net网络，只不过还是有encode和decode。在CNN中，我们知道U-Net是用来做图像分割的，提取的是图像特征，而在SD中的图像特征就是"噪声特征"，网络不断的对输入噪声进行去噪的过程。我们可以看到说在去噪过程中就是不停的迭代(调用)U-Net网络，其实就可以理解为将若干个U-Net的串联起来的一个CNN网络，浅层网络(刚开始迭代的)U-Net通过去噪会生成(提取)一些图像的大致轮廓等特征，随着网络的不断加深(迭代次数的增加)，会不断的往图像中添加细节。最终生成一个Latent Feature，然后将该特征图送入我们的VAEdecode中进行图像的重建。

U-Net在每轮预测噪声的过程中都会向其中添加文本特征，预测过程可视化如下(我这里设置的为25 steps，prompt="a photograph of an astronaut riding a horse")：

这里我将对应的代码也附上，可以复现：

这里的noise_pred是预测噪声特征图，save_name是保存的图像名称，我这里是以每个step保存的

    def save_noise_pred_feat(self, noise_pred, save_name):
        noise_pred = (noise_pred / 2 + 0.5).clamp(0, 1)
        # 将Tensor转换为 NumPy 数组
        noise_pred_np = noise_pred.detach().cpu().numpy()
        # 将通道数从第二个维度移动到最后一个维度
        # （前提假设通道数是第二个维度）
        noise_pred_np = np.moveaxis(noise_pred_np, 1, -1)
        # 反归一化
        noise_pred_np = (noise_pred_np * 255).round().astype(np.uint8)
        # 如果你想将所有特征图合并成一个图像保存，可以使用 PIL 的拼接功能
        # 这里仅作示例
        combined_img = Image.fromarray(np.concatenate(noise_pred_np, axis=1))
        combined_img.save(f'{save_name}_combined_feature_map.png')

然后放在这里可以实现。

                 if do_classifier_free_guidance and guidance_rescale > 0.0:
                    # Based on 3.4. in https://arxiv.org/pdf/2305.08891.pdf
                    noise_pred = rescale_noise_cfg(noise_pred, noise_pred_text, guidance_rescale=guidance_rescale)

                # compute the previous noisy sample x_t -> x_t-1
                latents = self.scheduler.step(noise_pred, t, latents, **extra_step_kwargs, return_dict=False)[0]
                self.save_noise_pred_feat(latents,i)

让后将得到最后一个去噪后的图像(latents)送入VAE decode中，得到我们最终的图像如下：

快速搭建SD模型

搭建SD的方式有很多种，我这里先以diffusers搭建SD为例(仅含推理部分)。

安装diffusers库以及依赖：

pip install diffusers==0.18.0 -i https://pypi.tuna.tsinghua.edu.cn/simple some-package

pip install transformers==4.27.0 accelerate==0.12.0 safetensors==0.2.7 invisible_watermark -i https://pypi.tuna.tsinghua.edu.cn/simple some-package

文生图

接下来就可以快速调用SD

from diffusers import StableDiffusionPipeline


#初始化SD模型，加载预训练权重
pipe = StableDiffusionPipeline.from_pretrained("F:/BaiduNetdiskDownload/stable-diffusion-v1-5")


pipe.to("cuda")

#如GPU的内存不够，可以加载float16
pipe = StableDiffusionPipeline.from_pretrained("F:/BaiduNetdiskDownload/stable-diffusion-v1-5",revision="fp16",torch_dtype=torch.float16)

#输入prompt
prompt = "a photograph of an astronaut riding a horse"
steps = 50
image = pipe(prompt, height=512, width=512, num_inference_steps=steps).images[0]
image.save('SD_image.png')

其中：num_inference_steps表示优化的次数，数值越大越好，但需要的时间也会多。

输出尺寸模型为512x512，较低的分辨率生成效果也不好。

如果是低算力，或者用CPU推理也是可以的，但效果就是很好了~

比如我在我的电脑上用cpu进行推理【我的显卡1650 4G的太拉跨了】，效果如下：

图生图

import requests
import torch
from PIL import Image
from io import BytesIO
from diffusers import StableDiffusionImg2ImgPipeline
device = "cpu"
model_id_or_path = "F:/Stable_webui/stable-diffusion-webui/models/Stable-diffusion/stable-diffusion-v1-5"
pipe = StableDiffusionImg2ImgPipeline.from_pretrained(model_id_or_path)
pipe = pipe.to(device)
img_path = "sketch-mountains-input.jpg"
init_img = Image.open(img_path)
init_img = init_img.resize((768,512))
prompt = "A fantasy landscape, trending on artstation"

images = pipe(prompt=prompt, image=init_img, strength=0.75).images
images[0].save("img2img.png")

效果如下：

原图

img2img

prompt:

"A fantasy landscape, trending on artstation"

文章参考

[1] Rocky Ding.深入浅出完整解析Stable Diffusion（SD）核心基础知识

[2] Bubbliiiing.AIGC专栏2——Stable Diffusion结构解析-以文本生成图像（文生图，txt2img）为例

说明

1.在生成某些图像的时候会生成全黑的图像，这是因为SD设置了保护机制，用于检查图像是否包含不安全的内容。解决的方法很简单，只需要去除保护机制即可，两种修改方法：修改pipeline_stable_diffusion.py(该文件在stable_diffusion/)中的以下内容：

        if not output_type == "latent":
            image = self.vae.decode(latents / self.vae.config.scaling_factor, return_dict=False)[0]
            # image, has_nsfw_concept = self.run_safety_checker(image, device, prompt_embeds.dtype)
            has_nsfw_concept = None

方法2：

        def run_safety_checker(self, image, device, dtype):
        # if self.safety_checker is None:
        #     has_nsfw_concept = None
        # else:
        #     if torch.is_tensor(image):
        #         feature_extractor_input = self.image_processor.postprocess(image, output_type="pil")
        #     else:
        #         feature_extractor_input = self.image_processor.numpy_to_pil(image)
        #     safety_checker_input = self.feature_extractor(feature_extractor_input, return_tensors="pt").to(device)
        #     image, has_nsfw_concept = self.safety_checker(
        #         images=image, clip_input=safety_checker_input.pixel_values.to(dtype)
        #     )
        return image, False