【Diffusers教程】一、简介：快速使用Diffusion网络

prinTao

已于 2024-03-08 16:31:35 修改

阅读量5.5k

点赞数 55

分类专栏： Diffusion 文章标签： diffuser

于 2024-03-06 20:53:02 首次发布

本文链接：https://blog.csdn.net/prinTao/article/details/136517555

版权

Diffusion 专栏收录该内容

11 篇文章

订阅专栏

本文介绍了如何使用HuggingFace库中的Diffusers进行深度学习模型的下载和推理，包括使用pipeline接口以及StableDiffusion模型的详细步骤，如模型加载、调度器操作和图像生成过程。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

原始教程地址： https://huggingface.co/docs/diffusers/using-diffusers/write_own_pipeline

模型的下载，调用

diffuser 的 from_pretrained()加载模型时，优先加载本地模型，搜索不到会自动下载：base_url = https://huggingface.co/ 。查询地址为 base_url + 库名称。

默认下载的模型会缓存到C:Users\Administrator\.cache\huggingface\diffusers 或者是 ubuntu的 ~/.cache 中

指定下载模型缓存位置，有以下两种方式：

在from_pretrained中指定cache_.dir
指定环境变量HF_HOME和XDG_CACHE_HOME的值
示例以下第一种方式：

pipeline = StableDiffusionUpscalePipeline.from_pretrained(model_id, torch_dtype=torch.float16, cache_dir="./models/")

就会在当前目录创建一个models文件，并将下载的模型缓存到该位置。

使用Diffusers进行模型推理

使用`pipeline` 进行推理

简单DDPM网络

导入Pipeline, from_pretrained（）加载模型，可以是本地模型，或从the Hugging Face Hub自动下载

一次性全加载进来

from diffusers import DDPMPipeline
ddpm = DDPMPipeline.from_pretrained("google/ddpm-cat-256", use_safetensors=True).to("cuda")
image = ddpm(num_inference_steps=25).images[0]
image

model and scheduler 分开加载

from diffusers import DDPMScheduler, UNet2DModel
scheduler = DDPMScheduler.from_pretrained("google/ddpm-cat-256")
model = UNet2DModel.from_pretrained("google/ddpm-cat-256", use_safetensors=True).to("cuda")

设置timestep：

scheduler.set_timesteps(50)
print(scheduler.timesteps)

生成随机噪声

import torch
sample_size = model.config.sample_size # 尺寸得和unet对应
input = noise = torch.randn((1, 3, sample_size, sample_size), device="cuda")

执行迭代降噪

执行**UNet2DModel.forward()传递并返回噪声残差。调度程序的step()**方法采用噪声残差、时间步长和输入，并预测前一个时间步长的图像。

for t in scheduler.timesteps:
    with torch.no_grad():
        noisy_residual = model(input, t).sample
    previous_noisy_sample = scheduler.step(noisy_residual, t, input).prev_sample
    input = previous_noisy_sample

图像输出

from PIL import Image
import numpy as np
image = (input / 2 + 0.5).clamp(0, 1).squeeze()
image = (image.permute(1, 2, 0) * 255).round().to(torch.uint8).cpu().numpy()
image = Image.fromarray(image)
image.show()

Stable Diffusion 案例

https://huggingface.co/blog/stable_diffusion#how-does-stable-diffusion-work

使用图像的低维表示而不是实际的像素空间，这使得它的内存效率更高。编码器将图像压缩为更小的表示，解码器将压缩的表示转换回图像。对于文本到图像模型，您需要一个分词器和一个编码器来生成文本嵌入。SD具有三个独立的预训练模型。

使用**from_pretrained()方法加载所有这些组件。您可以在预训练[runwayml/stable-diffusion-v1-5](https://huggingface.co/runwayml/stable-diffusion-v1-5)**检查点中找到它们，每个组件都存储在单独的子文件夹中

from PIL import Image
import torch
from transformers import CLIPTextModel, CLIPTokenizer
from diffusers import AutoencoderKL, UNet2DConditionModel, PNDMScheduler
vae = AutoencoderKL.from_pretrained("CompVis/stable-diffusion-v1-5", subfolder="vae", use_safetensors=True)
tokenizer = CLIPTokenizer.from_pretrained("CompVis/stable-diffusion-v1-5", subfolder="tokenizer")
text_encoder = CLIPTextModel.from_pretrained(
    "CompVis/stable-diffusion-v1-5", subfolder="text_encoder", use_safetensors=True
)
unet = UNet2DConditionModel.from_pretrained(
    "CompVis/stable-diffusion-v1-5", subfolder="unet", use_safetensors=True
)

代替默认的**PNDMScheduler，将其替换为UniPCMultistepScheduler**以插入不同的调度程序：

from diffusers import UniPCMultistepScheduler
scheduler = UniPCMultistepScheduler.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="scheduler")

为了加速推理，请将模型移至 GPU，因为与调度程序不同，它们具有可训练的权重

torch_device = "cuda"
vae.to(torch_device)
text_encoder.to(torch_device)
unet.to(torch_device)

创建文本嵌入

该文本用于调节 UNet 模型并将扩散过程引导至类似于输入提示的方向。

💡 该guidance_scale参数决定了生成图像时应赋予提示多少权重。

	prompt = ["a photograph of an astronaut riding a horse"]
height = 512  # default height of Stable Diffusion
width = 512  # default width of Stable Diffusion
num_inference_steps = 25  # Number of denoising steps
guidance_scale = 7.5  # Scale for classifier-free guidance
generator = torch.manual_seed(0)  # Seed generator to create the initial latent noise
batch_size = len(prompt)

对文本进行标记并根据提示生成嵌入 text_embeddings ：

text_input = tokenizer(
    prompt, padding="max_length", max_length=tokenizer.model_max_length, truncation=True, return_tensors="pt"
)
with torch.no_grad():
    text_embeddings = text_encoder(text_input.input_ids.to(torch_device))[0]

生成无条件文本嵌入 uncond_embeddings ，即填充标记的嵌入。这些需要与条件具有相同的形状 : batch_size和seq_lengthtext_embeddings

max_length = text_input.input_ids.shape[-1]
uncond_input = tokenizer([""] * batch_size, padding="max_length", max_length=max_length, return_tensors="pt")
uncond_embeddings = text_encoder(uncond_input.input_ids.to(torch_device))[0]

将条件嵌入和无条件嵌入连接到一个批处理中，以避免进行两次前向传递：
```
text_embeddings = torch.cat([uncond_embeddings, text_embeddings])
```

创建随机噪声
1. 生成一些初始随机噪声作为扩散过程的起点。这是图像的潜在表示，它将逐渐去噪。此时，latent图像小于最终图像尺寸，但这没关系，因为模型稍后会将其转换为最终 512x512 图像尺寸。
  
  在这个过程中，图像被缩小三次，编码为latent space : 2 ** ( len (vae.config.block_out_channels) - 1 ) == 8
```
latents = torch.randn(
    (batch_size, unet.config.in_channels, height // 8, width // 8),
    generator=generator,
    device=torch_device, )
```

对图像进行去噪

首先使用初始噪声分布sigma （噪声比例值）缩放输入，这是改进的调度程序（如UniPCMultistepScheduler）**所必需的：
```
latents = latents * scheduler.init_noise_sigma
```

最后一步是创建去噪循环，该循环将逐步将纯噪声转换为latents提示所描述的图像。请记住，去噪循环需要做三件事：

设置去噪期间使用的调度程序的时间步长。
迭代时间步。
在每个时间步，调用 UNet 模型来预测噪声残差，并将其传递给调度程序以计算先前的噪声样本。

from tqdm.auto import tqdm
scheduler.set_timesteps(num_inference_steps)
for t in tqdm(scheduler.timesteps):
    # 如果我们正在进行无分类器指导，则扩展潜在以避免做两次forward
    latent_model_input = torch.cat([latents] * 2)
    latent_model_input = scheduler.scale_model_input(latent_model_input, timestep=t)
    # 预测噪声
    with torch.no_grad():
        noise_pred = unet(latent_model_input, t, encoder_hidden_states=text_embeddings).sample
    #  执行引导
    noise_pred_uncond, noise_pred_text = noise_pred.chunk(2)
    noise_pred = noise_pred_uncond + guidance_scale * (noise_pred_text - noise_pred_uncond)
    # 计算降低噪声样本 x_t -> x_t-1
    latents = scheduler.step(noise_pred, t, latents).prev_sample

将vae潜在表示解码为图像并通过以下方式获得解码输出sample：

1 / 0.18215 是因为VAE的归一化

# scale and decode the image latents with vae
latents = 1 / 0.18215 * latents
with torch.no_grad():
    image = vae.decode(latents).sample
# 转换为图像
image = (image / 2 + 0.5).clamp(0, 1).squeeze()
image = (image.permute(1, 2, 0) * 255).to(torch.uint8).cpu().numpy()
images = (image * 255).round().astype("uint8")
image = Image.fromarray(image)
image