【Diffusers库】第三篇 Pipeline的拆解与组装

Robin C

已于 2024-03-19 10:01:32 修改

阅读量880

点赞数 31

文章标签： AIGC stable diffusion 人工智能

于 2024-03-14 14:35:15 首次发布

本文链接：https://blog.csdn.net/qq_38423499/article/details/136569615

版权

写在前面的话

这是我们研发的用于 消费决策的AI助理 ，我们会持续优化，欢迎体验与反馈。微信扫描二维码，添加即可。
官方链接：https://ailab.smzdm.com/

************************************************************** 分割线 *******************************************************************
diffusers被设计成一个用户友好且灵活的工具箱，用于构建适合您用例的扩散模型。这个库的核心是模型和调度器。虽然DiffusionPipeline为了方便起见将这些组件捆绑在一起，但也可以拆分pipeline，并使用模型和调度器来创建新的扩散系统。
在本教程中，会介绍一下使用模型和调度器来组装用于推理的扩散系统，从基本的pipeline开始，然后转到Stable Diffusion pipeline。

组装好的pipeline

pipeline是运行模型进行推理的一种快速而简单的方法，生成图像不需要超过四行代码，比如DDPMPipeline：

from diffusers import DDPMPipeline

ddpm = DDPMPipeline.from_pretrained("google/ddpm-cat-256").to("cuda")
image = ddpm(num_inference_steps=25).images[0]
image

在这里插入图片描述

拆解开的 pipeline(无条件出图)

这非常容易，但pipeline是如何做到的？让我们分解一下pipeline，看看“引擎盖”下面发生了什么。
在上面的示例中，pipeline包含一个UNet模型和一个DDPM调度器。pipeline对一张图片进行不停得去噪，大概经过一定的步数，最终生成一张图片。
在去噪的过程中，每一轮都会获取一个指定尺寸的噪声图片，使用模型去预测这个噪声图，预测出一个新的噪声图(原文：The pipeline denoises an image by taking random noise the size of the desired output and passing it through the model several times.)，这个噪声图比原图的噪声要少。在每个时间步长中，模型预测噪声残差，调度器使用它(噪声残差)来预测出噪声较小的图像。管道重复此过程，直到到达指定数量的推理步骤的末尾。

现在分别创建模型和调度器，

from diffusers import DDPMScheduler, UNet2DModel

scheduler = DDPMScheduler.from_pretrained("google/ddpm-cat-256")
model = UNet2DModel.from_pretrained("google/ddpm-cat-256").to("cuda")

然后设置调度器重复生成噪声图的轮数，官方是给了50轮(后面出的图其实是设定的100轮，但是还是不成像)。

scheduler.set_timesteps(50)

在运行完上一步后，会生成1个张量(tensor)，一维的一个张量，张量中的每个元素对应一个时间步长，步数为50，所以这个张量中的元素也是50。随后就是迭代这个tensor实现图片去噪的。

# 输入
scheduler.timesteps
# 输出
tensor([980, 960, 940, 920, 900, 880, 860, 840, 820, 800, 780, 760, 740, 720,
    700, 680, 660, 640, 620, 600, 580, 560, 540, 520, 500, 480, 460, 440,
    420, 400, 380, 360, 340, 320, 300, 280, 260, 240, 220, 200, 180, 160,
    140, 120, 100,  80,  60,  40,  20,   0])

创建一些噪声，其形状与所需输出相同：

import torch

sample_size = model.config.sample_size
noise = torch.randn((1, 3, sample_size, sample_size)).to("cuda")

现在写一个循环来迭代时间步长。在每个循环中，模型执行UNet2DModel.forward（）传递并返回噪声残差。scheduler.step（）方法获取有噪声的残差、时间步长和输入，并在前一个时间步长预测图像。这个输出成为去噪循环中模型的下一个输入，它将重复，直到到达时间步长数组的末尾。

input = noise

for t in scheduler.timesteps:
    with torch.no_grad():
        noisy_residual = model(input, t).sample
	previous_noisy_sample = scheduler.step(noisy_residual, t, input).prev_sample
	input = previous_noisy_sample

这是整个去噪过程，你可以用同样的模式来写任何扩散系统。

最后一步是将去噪输出转换为图像：

from PIL import Image
import numpy as np
import cv2
image = (input / 2 + 0.5).clamp(0, 1)
image = image.cpu().permute(0, 2, 3, 1).numpy()[0]
# 官方给出的代码，我运行后会报错，
# image = Image.fromarray((image * 255)).round().astype("uint8")

# 我自己改了下,用cv2保存的
image_2 = np.uint8(image * 255)
cv2.imwrite("demo.png", image_2)

# 假如用PIL保存的话,转下格式就好了
image = Image.fromarray(image_2)
image.save("demo.png")

我感觉他这个模型有问题…官网给出的这段代码后面，并没有给出对应的图像，我按照他的代码跑会报错，修改的不报错了，但是出图效果越不好，不过我们只是为了了解这个过程，所以不要太注重这个图的效果。
在这里插入图片描述

拆解开的 pipeline(文生图)

加载各个组建

之所以被称为潜在扩散模型，是因为它将图片“转化”到一个比较低的维度进行运作，代替了实际的像素空间，这样更加省内存。编码器将图像"转化"的更低的维度，解码器将“转化”后的图片(向量)转换回图像。
对于文生图模型，除了UNet模型和调度器，还需要一个tokenizer和一个encoder来生成文本嵌入。使用from_prelined（）方法加载所有这些组件。以“runwayml/stable-diffusion-v1-5”为例，每个组件都存储在一个单独的子文件夹中：

from PIL import Image
import torch
from transformers import CLIPTextModel, CLIPTokenizer
from diffusers import AutoencoderKL, UNet2DConditionModel, PNDMScheduler, UniPCMultistepScheduler

# VAE (Variational Auto-Encoder 变分自动编码器)
vae = AutoencoderKL.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="vae")
# tokenizer 分词器
tokenizer = CLIPTokenizer.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="tokenizer")
# 文本编码器
text_encoder = CLIPTextModel.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="text_encoder")
# 模型
unet = UNet2DConditionModel.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="unet")
# 加载调度器：UniPCMultistepScheduler
scheduler = UniPCMultistepScheduler.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="scheduler")

VAE，即变分自编码器，把图像编码到特征，进行生成过程后再把特征解码到图像。
UNet，模型，用来预测噪声图片的
Text-Encoder，用于把tokens编码为一串向量，用来控制扩散模型的生成。
Tokenizer，把输入的文本按照字典编码为上面的tokens，输出一个带有数字的tensor。
Scheduler，我们知道扩散模型有很多采样方法，Scheduler定义了我们用哪种采样方法

为了加快推理速度，将模型移动到GPU，因为与调度器不同，它们具有可训练的权重：

torch_device = "cuda"
vae.to(torch_device)
text_encoder.to(torch_device)
unet.to(torch_device)

文本编码

下一步是标记文本以生成 text embeddings。 text embeddings用于调节UNet模型，并引导扩散过程朝着文本描述的方向进行图像生成。

# guidance_scale 是引导词的权重系数
prompt = ["a photograph of an astronaut riding a horse"]
height = 512  # default height of Stable Diffusion
width = 512  # default width of Stable Diffusion
num_inference_steps = 25  # Number of denoising steps
guidance_scale = 7.5  # Scale for classifier-free guidance
generator = torch.manual_seed(0)  # Seed generator to create the inital latent noise
batch_size = len(prompt)

text_input = tokenizer(
    prompt, padding="max_length", max_length=tokenizer.model_max_length, truncation=True, return_tensors="pt"
)
#输入
text_input["input_ids"]
#输出
tensor([[49406,   320,  8853,   539,   550, 18376,  6765,   320,  4558, 49407,
         49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407,
         49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407,
         49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407,
         49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407,
         49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407,
         49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407,
         49407, 49407, 49407, 49407, 49407, 49407, 49407]])
         
with torch.no_grad():
    text_embeddings = text_encoder(text_input.input_ids.to(torch_device))[0]

还需要生成 unconditional text embeddings，这是填充标记的嵌入。它们需要具有与条件text_embeddings相同的形状（batch_size和seq_length）：

max_length = text_input.input_ids.shape[-1]
uncond_input = tokenizer([""] * batch_size, padding="max_length", max_length=max_length, return_tensors="pt")
uncond_embeddings = text_encoder(uncond_input.input_ids.to(torch_device))[0]

让我们将条件嵌入和无条件嵌入拼接到一个batch中，以避免进行两次前向传递：

text_embeddings = torch.cat([uncond_embeddings, text_embeddings])

创建随机噪声

接下来，生成一些初始随机噪声作为扩散过程的起点。这是图像生成前的样子（自行想象电视收不到信号的雪花点），它将被逐渐去噪。噪声点的尺寸小于最终的图像尺寸，但这没关系，因为模型稍后会将其转换为最终的512x512图像尺寸。

# 之所以被“8”整除，因为vae模型有3次下采样
latents = torch.randn(
    (batch_size, unet.in_channels, height // 8, width // 8),
    generator=generator,
)
latents = latents.to(torch_device)

图片去噪

首先要确定是初始噪声分布的输入（sigma参数，噪声比值，用于提升调度器的使用效果），比如：UniPCMultistepScheduler。

latents = latents * scheduler.init_noise_sigma

最后一步是创建去噪循环，该循环将逐步将潜在噪声转换为prompt指向的图像。还有，去噪循环需要做三件事：

设置调度器的时间步长
去循环这个时间步长
在每个时间步长，调用UNet模型来预测噪声残差，并将其传递给调度器来计算先前的噪声样本。

from tqdm.auto import tqdm

scheduler.set_timesteps(num_inference_steps)

for t in tqdm(scheduler.timesteps):
    # expand the latents if we are doing classifier-free guidance to avoid doing two forward passes.
    latent_model_input = torch.cat([latents] * 2)

    latent_model_input = scheduler.scale_model_input(latent_model_input, timestep=t)

    # predict the noise residual
    with torch.no_grad():
        noise_pred = unet(latent_model_input, t, encoder_hidden_states=text_embeddings).sample

    # perform guidance
    noise_pred_uncond, noise_pred_text = noise_pred.chunk(2)
    noise_pred = noise_pred_uncond + guidance_scale * (noise_pred_text - noise_pred_uncond)

    # compute the previous noisy sample x_t -> x_t-1
    latents = scheduler.step(noise_pred, t, latents).prev_sample

解码图片

最后一步是使用vae 模块，将图像编码给解码为图像，并获得带样本的解码输出：

# scale and decode the image latents with vae
latents = 1 / 0.18215 * latents
with torch.no_grad():
    image = vae.decode(latents).sample

最后，将图像转换为PIL。图片查看您生成的图片！

image = (image / 2 + 0.5).clamp(0, 1).squeeze()
image = (image.permute(1, 2, 0) * 255).to(torch.uint8).cpu().numpy()
images = (image * 255).round().astype("uint8")
image = Image.fromarray(image)
image