【Diffusers库】diffusers库微调stable diffusion实战记录（1）

最新推荐文章于 2025-04-12 14:09:13 发布

qiushou2357

最新推荐文章于 2025-04-12 14:09:13 发布

阅读量3.8k

点赞数 8

文章标签： stable diffusion

本文链接：https://blog.csdn.net/qiushou2357/article/details/134125674

版权

目前网络上对于diffusers库的介绍仅限于对模型的加载，但实际上diffusers库具有多种功能，本博客将用于记录运用diffuser库微调一个stable diffusion的过程。

开始前的一些基础知识

首先先简单介绍一下SD，一般来讲一个正统的 stable diffusion 包含三个分开的预训练模型，分别为VAE，U-Net，以及Text-encoder。

-VAE 变分自动编码器
包含encoder和decoder两个部分，其中encoder旨在把图像转化成隐变量，然后decoder将隐变量还原为图像，训练时这个隐变量会被作为U-Net的输入。在推理过程中，我们只需要过decoder的部分。

-U-Net 图像分割网络
U-Net的encoder和decoder都由残差网络块构成，encoder作用是特征提取，decoder是上采样，它用于输出噪声隐变量。残差回路是为了防止信息丢失，会在层中添加cross-attention的部分。

-The Text-encoder 文本编码器
将输入的文本转化为可被U-Net识别的东西，通常用transformer实现，由于imagen的提出，现在这部分可以分开训练，用CLIP来完成。

组合起来的推理过程如上
-其他小部件
tokenizer：服务于文本编码器。
scheduler：用于在训练的过程中迭代添加噪声

逐步搭建推理过程

我们不想再load别人写好的完整模型了，而是想从不同的模型中择其所长，怎么办呢？

from transformers import CLIPTextModel, CLIPTokenizer
from diffusers import AutoencoderKL, UNet2DConditionModel, PNDMScheduler
from diffusers import LMSDiscreteScheduler

# 1. 从老stable diffusion里获取VAE
vae = AutoencoderKL.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="vae")

# 2. 从openai获取新的CLIP    
tokenizer = CLIPTokenizer.from_pretrained("openai/clip-vit-large-patch14")
text_encoder = CLIPTextModel.from_pretrained("openai/clip-vit-large-patch14")

# 3. 从老stable diffusion中获取unet
unet = UNet2DConditionModel.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="unet")

# 4. 创建一个LMSD的Scheduler
scheduler = LMSDiscreteScheduler(beta_start=0.00085, beta_end=0.012, beta_schedule="scaled_linear", num_train_timesteps=1000)

# 5. 将这些部件加到GPU上运算
torch_device = "cuda"
vae.to(torch_device)
text_encoder.to(torch_device)
unet.to(torch_device)

至此我们已将所有的组件装载完毕，来运行一个prompt

prompt = ["a photograph of an astronaut riding a horse"]
height = 512                        # 图片高
width = 512                         # 图片宽
num_inference_steps = 100           # 去噪次数
guidance_scale = 7.5                # 输入prompt与图像的紧密型

generator = torch.manual_seed(0)    # 随机种子
batch_size = len(prompt)

首先，将prompt送入tokenizer编码成向量，padding将所有不同长度的句子归一成同样长度，然后通过CLIP编码成隐向量。

text_input = tokenizer(prompt, padding="max_length", max_length=tokenizer.model_max_length, truncation=True, return_tensors="pt")

text_embeddings = text_encoder(text_input.input_ids.to(torch_device))[0]

同时，获取一个空embedding加到embedding中来实现classifier-free，进而增加多样性。

max_length = text_input.input_ids.shape[-1]
uncond_input = tokenizer([""] * batch_size, padding="max_length", max_length=max_length, return_tensors="pt")
uncond_embeddings = text_encoder(uncond_input.input_ids.to(torch_device))[0]   
text_embeddings = torch.cat([uncond_embeddings, text_embeddings])

接下来，生成随机噪声

latents = torch.randn(
    (batch_size, unet.in_channels, height // 8, width // 8),
    generator=generator,
)
latents = latents.to(torch_device)

这里生成了一个torch.Size([1, 4, 64, 64])的噪声，一会儿会被用于去噪生成所需图片的隐变量。
接下来我们初始化scheduler，还需要计算一个sigma，这个sigma是我们使用的K-LMS scheduler所需要的。
进行一个迭代的去噪。

scheduler.set_timesteps(num_inference_steps)
latents = latents * scheduler.init_noise_sigma

from tqdm.auto import tqdm

scheduler.set_timesteps(num_inference_steps)

for t in tqdm(scheduler.timesteps):
    # 这个步骤是为了服务于classifier-free guidance
    latent_model_input = torch.cat([latents] * 2)
    latent_model_input = scheduler.scale_model_input(latent_model_input, timestep=t)

    #用 Unet 计算噪声
    with torch.no_grad():
        noise_pred = unet(latent_model_input, t, encoder_hidden_states=text_embeddings).sample

    # 进行classifier-free guidance
    noise_pred_uncond, noise_pred_text = noise_pred.chunk(2)
    noise_pred = noise_pred_uncond + guidance_scale * (noise_pred_text - noise_pred_uncond)

    # 迭代到下一轮去噪
    latents = scheduler.step(noise_pred, t, latents).prev_sample

使用VAE将隐变量还原为图片

# scale and decode the image latents with vae
latents = 1 / 0.18215 * latents
with torch.no_grad():
    image = vae.decode(latents).sample

储存你的成果

image = (image / 2 + 0.5).clamp(0, 1)
image = image.detach().cpu().permute(0, 2, 3, 1).numpy()
images = (image * 255).round().astype("uint8")
pil_images = [Image.fromarray(image) for image in images]
pil_images[0]