前言
`完成了扩散模型打卡活动第二章的学习,简单记录下学习所得。
一、基本概念
微调:在新的数据集上重新训练已有的模型,以改变原有的输出类型。主要是为了针对高分辨率图像数据从头开始训练耗时太长的问题。
引导:在推理阶段引导现有模型的生成过程,以获取额外的控制。这个过程可以评估模型每个阶段的预测并进行修改,使最终生成图像符合我们的喜好。
条件生成:在训练过程中产生的额外信息,导入到模型中进行预测,通过输入相关信息作为条件来控制模型的生成。
条件调节:在训练时提供额外的信息(如类标签或者图像标题等),可以用于之后的推测。具体方法包括
1、将条件信息作为额外的通道输入UNet模型;2、将条件信息转换成embedding,然后将embedding通过投影层映射来改变其通道数,从而可以对齐模型中间层的输出通道,最后将embedding加到中间层的输出上;3、添加带有交叉注意力(cross-attention)机制的网络层。
下面将具体展开几部分的概念,并进行一个简单的实战。
二、微调和引导
1.环境准备
%pip install -qq diffusers datasets accelerate wandb open-clip-torch
import numpy as np
import torch
import torch.nn.functional as F
import torchvision
from datasets import load_dataset
from diffusers import DDIMScheduler, DDPMPipeline
from matplotlib import pyplot as plt
from PIL import Image
from torchvision import transforms
from tqdm.auto import tqdm
device = (
"mps"
if torch.backends.mps.is_available()
else "cuda"
if torch.cuda.is_available()
else "cpu"
)
加载预训练好的DDPMPiPeline测试下,生成一张图像:
image_pipe = DDPMPipeline.from_pretrained("google/ddpm-celebahq-256")
image_pipe.to(device);
images = image_pipe().images
images[0]
虽然生成的图像很清晰,但生成过程十分缓慢,这主要是因为采样器运用DDPM的时间步过长导致的,下面将尝试适用更好的采样器进行改进。
2.DDIM实现更快的采样
生成图像的过程其实就是一个采样去噪声的过程,在diffusers当中,我们从随机噪声出发,输入给模型后将模型预测结果传递给step函数,来更新图像信息,不断重复这个过程。如果这个过程时间步过长,就会导致生成图像过程缓慢。其实这么多时间步很没必要,因此可以采用DDIM来替代DDPM来实现以更少的步骤生成样本。
DDIM即去噪扩散隐式模型,定义的DDIMScheduler可以实现更少步骤的样本输出:
# Create new scheduler and set num inference steps
scheduler = DDIMScheduler.from_pretrained("google/ddpm-celebahq-256")
scheduler.set_timesteps(num_inference_steps=40)
模型总共执行40步,每一步相当于原始时间步的1000步。
创建四个随机图像并运行采样循环,观察当前x和预测的去噪版本:
# The random starting point
x = torch.randn(4, 3, 256, 256).to(device) # Batch of 4, 3-channel 256 x 256 px images
# Loop through the sampling timesteps
for i, t in tqdm(enumerate(scheduler.timesteps)):
# Prepare model input
model_input = scheduler.scale_model_input(x, t)
# Get the prediction
with torch.no_grad():
noise_pred = image_pipe.unet(model_input, t)["sample"]
# Calculate what the updated sample should look like with the scheduler
scheduler_output = scheduler.step(noise_pred, t, x)
# Update x
x = scheduler_output.prev_sample
# Occasionally display both x and the predicted denoised images
if i % 10 == 0 or i == len(scheduler.timesteps) - 1:
fig, axs = plt.subplots(1, 2, figsize=(12, 5))
grid = torchvision.utils.make_grid(x, nrow=4).permute(1, 2, 0)
axs[0].imshow(grid.cpu().clip(-1, 1) * 0.5 + 0.5)
axs[0].set_title(f"Current x (step {i})")
pred_x0 = (
scheduler_output.pred_original_sample
) # Not available for all schedulers
grid = torchvision.utils.make_grid(pred_x0, nrow=4).permute(1, 2, 0)
axs[1].imshow(grid.cpu().clip(-1, 1) * 0.5 + 0.5)
axs[1].set_title(f"Predicted denoised images (step {i})")
plt.show()
可以看到随着时间步的迭代,预测输出越来越精细。
3.微调
微调就是用新的数据集重新训练已有的模型,来更新输出。
通常情况下是要求新数据集与预训练模型的原始数据保持一致的,但这里为了有意思,使用了第一章的蝴蝶数据集。
下载蝴蝶数据集并创建一个加载器,并采样一批图像如下:
# @markdown load and prepare a dataset:
# Not on Colab? Comments with #@ enable UI tweaks like headings or user inputs
# but can safely be ignored if you're working on a different platform.
dataset_name = "huggan/smithsonian_butterflies_subset" # @param
dataset = load_dataset(dataset_name, split="train")
image_size = 256 # @param
batch_size = 4 # @param
preprocess = transforms.Compose(
[
transforms.Resize((image_size, image_size)),
transforms.RandomHorizontalFlip(),
transforms.ToTensor(),
transforms.Normalize([0.5], [0.5]),
]
)
def transform(examples):
images = [preprocess(image.convert("RGB")) for image in examples["image"]]
return {"images": images}
dataset.set_transform(transform)
train_dataloader = torch.utils.data.DataLoader(
dataset, batch_size=batch_size, shuffle=True
)
print("Previewing batch:")
batch = next(iter(train_dataloader))
grid = torchvision.utils.make_grid(batch["images"], nrow=4)
plt.imshow(grid.permute(1, 2, 0).cpu().clip(-1, 1) * 0.5 + 0.5);
之所以要将批量大小设置为4,是因为输入图像大小为256*256,太高会耗尽RAM.
下面是训练阶段。这里将优化目标设为image_pipe.unet.parameters()来只更新unet的权重,其余部分与第一章一样:
num_epochs = 2 # @param
lr = 1e-5 # 2param
grad_accumulation_steps = 2 # @param
optimizer = torch.optim.AdamW(image_pipe.unet.parameters(), lr=lr)
losses = []
for epoch in range(num_epochs):
for step, batch in tqdm(enumerate(train_dataloader), total=len(train_dataloader)):
clean_images = batch["images"].to(device)
# Sample noise to add to the images
noise = torch.randn(clean_images.shape).to(clean_images.device)
bs = clean_images.shape[0]
# Sample a random timestep for each image
timesteps = torch.randint(
0,
image_pipe.scheduler.num_train_timesteps,
(bs,),
device=clean_images.device,
).long()
# Add noise to the clean images according to the noise magnitude at each timestep
# (this is the forward diffusion process)
noisy_images = image_pipe.scheduler.add_noise(clean_images, noise, timesteps)
# Get the model prediction for the noise
noise_pred = image_pipe.unet(noisy_images, timesteps, return_dict=False)[0]
# Compare the prediction with the actual noise:
loss = F.mse_loss(
noise_pred, noise
) # NB - trying to predict noise (eps) not (noisy_ims-clean_ims) or just (clean_ims)
# Store for later plotting
losses.append(loss.item())
# Update the model parameters with the optimizer based on this loss
loss.backward(loss)
# Gradient accumulation:
if (step + 1) % grad_accumulation_steps == 0:
optimizer.step()
optimizer.zero_grad()
print(
f"Epoch {epoch} average loss: {sum(losses[-len(train_dataloader):])/len(train_dataloader)}"
)
# Plot the loss curve:
plt.plot(losses)
可以看到损失曲线非常不稳定,这主要是因为batch_size=4过小的原因。一种解决办法是适用极小的学习率来进行更新。但是还是希望可以找到一种与增加批量大小效果一样好,但又不会使内存上升的方法。
梯度累加 gradient accumulation就是这样一种方法,其原理就是不像通常训练一样,每训练一批数据,就将优化器的梯度归零,而是经过几轮这样的话梯度会进行累加内部默认会求均值,直到达到相应的累计次数,再进行梯度更新,这样就可以有效合并多个批次的信息进行单次更新。这样就可以使总更新次数变少类似于适用大批量数据进行训练。
用预训练好的模型生成一些图片:
# @markdown Generate and plot some images:
x = torch.randn(8, 3, 256, 256).to(device) # Batch of 8
for i, t in tqdm(enumerate(scheduler.timesteps)):
model_input = scheduler.scale_model_input(x, t)
with torch.no_grad():
noise_pred = image_pipe.unet(model_input, t)["sample"]
x = scheduler.step(noise_pred, t, x).prev_sample
grid = torchvision.utils.make_grid(x, nrow=4)
plt.imshow(grid.permute(1, 2, 0).cpu().clip(-1, 1) * 0.5 + 0.5);
可以看到输出非常奇怪,这主要是因为微调数据集与人脸模型原始数据集不匹配的原因。但如果训练更长时间,就会越来越像蝴蝶数据集靠拢。
4.引导
微调是在训练数据集上进行下手,重新训练新的数据集。而引导是对预测生成过程进行一些控制。先以简单的示例入手,我们希望生成图像偏向特定颜色,可以首先创建调节函数,将图像的像素与目标像素进行比较并返回平均误差:
def color_loss(images, target_color=(0.1, 0.9, 0.5)):
"""Given a target color (R, G, B) return a loss for how far away on average
the images' pixels are from that color. Defaults to a light teal: (0.1, 0.9, 0.5)"""
target = (
torch.tensor(target_color).to(images.device) * 2 - 1
) # Map target color to (-1, 1)
target = target[
None, :, None, None
] # Get shape right to work with the images (b, c, h, w)
error = torch.abs(
images - target
).mean() # Mean absolute difference between the image pixels and the target color
return error
制作采样循环步骤:
1、创建具有requires_grad = True 的新版本 x
2、计算降噪版本 (x0)
3、通过我们的损失函数馈送预测的 x0
4、求此损失函数相对于 x 的梯度
5、在我们使用调度程序之前,使用此条件梯度来修改 x,希望根据我们的指导函数将 x 推向一个会导致更低损失的方向。
这里有两个版本的尝试。
首先是在unet获得噪声预测后,再把x的requires_grad设置为True,这样更节省内存,因为我们不用从扩散模型去追溯梯度,但梯度精度不高:
# Variant 1: shortcut method
# The guidance scale determines the strength of the effect
guidance_loss_scale = 40 # Explore changing this to 5, or 100
x = torch.randn(8, 3, 256, 256).to(device)
for i, t in tqdm(enumerate(scheduler.timesteps)):
# Prepare the model input
model_input = scheduler.scale_model_input(x, t)
# predict the noise residual
with torch.no_grad():
noise_pred = image_pipe.unet(model_input, t)["sample"]
# Set x.requires_grad to True
x = x.detach().requires_grad_()
# Get the predicted x0
x0 = scheduler.step(noise_pred, t, x).pred_original_sample
# Calculate loss
loss = color_loss(x0) * guidance_loss_scale
if i % 10 == 0:
print(i, "loss:", loss.item())
# Get gradient
cond_grad = -torch.autograd.grad(loss, x)[0]
# Modify x based on this gradient
x = x.detach() + cond_grad
# Now step with scheduler
x = scheduler.step(noise_pred, t, x).prev_sample
# View the output
grid = torchvision.utils.make_grid(x, nrow=4)
im = grid.permute(1, 2, 0).cpu().clip(-1, 1) * 0.5 + 0.5
Image.fromarray(np.array(im * 255).astype(np.uint8))
第二个版本需要更大的RAM,即在一开始就设置require_grad:
# Variant 2: setting x.requires_grad before calculating the model predictions
guidance_loss_scale = 40
x = torch.randn(4, 3, 256, 256).to(device)
for i, t in tqdm(enumerate(scheduler.timesteps)):
# Set requires_grad before the model forward pass
x = x.detach().requires_grad_()
model_input = scheduler.scale_model_input(x, t)
# predict (with grad this time)
noise_pred = image_pipe.unet(model_input, t)["sample"]
# Get the predicted x0:
x0 = scheduler.step(noise_pred, t, x).pred_original_sample
# Calculate loss
loss = color_loss(x0) * guidance_loss_scale
if i % 10 == 0:
print(i, "loss:", loss.item())
# Get gradient
cond_grad = -torch.autograd.grad(loss, x)[0]
# Modify x based on this gradient
x = x.detach() + cond_grad
# Now step with scheduler
x = scheduler.step(noise_pred, t, x).prev_sample
grid = torchvision.utils.make_grid(x, nrow=4)
im = grid.permute(1, 2, 0).cpu().clip(-1, 1) * 0.5 + 0.5
Image.fromarray(np.array(im * 255).astype(np.uint8))
很明显第二种输出更接近模型训练的图像类型,并且可以增加引导比例以获得更强的效果。
这是基础的颜色引导,但如果我们是想输入文本去引导呢?这里就要用到clip引导了。
clip是由OpenAI 创建的模型,它允许我们将图像与文本标题进行比较。这是非常强大的,因为它使我们能够量化图像与提示的匹配程度。由于该过程是可微的,我们可以将其用作损失函数来指导我们的扩散模型!
基本方法是:1、嵌入文本提示以获取文本的 512 维 CLIP 嵌入
2、对于扩散模型过程中的每一步:
对预测的去噪图像进行多种变体(具有多个变体可提供更清晰的损失信号);对于每一个,使用 CLIP 嵌入图像,并将此嵌入与提示的文本嵌入进行比较(使用称为“大圆距离平方”的度量);3、计算此损耗相对于当前噪声 x 的梯度,并在使用调度程序更新之前使用此梯度修改 x。
# @markdown load a CLIP model and define the loss function
import open_clip
clip_model, _, preprocess = open_clip.create_model_and_transforms(
"ViT-B-32", pretrained="openai"
)
clip_model.to(device)
# Transforms to resize and augment an image + normalize to match CLIP's training data
tfms = torchvision.transforms.Compose(
[
torchvision.transforms.RandomResizedCrop(224), # Random CROP each time
torchvision.transforms.RandomAffine(
5
), # One possible random augmentation: skews the image
torchvision.transforms.RandomHorizontalFlip(), # You can add additional augmentations if you like
torchvision.transforms.Normalize(
mean=(0.48145466, 0.4578275, 0.40821073),
std=(0.26862954, 0.26130258, 0.27577711),
),
]
)
# And define a loss function that takes an image, embeds it and compares with
# the text features of the prompt
def clip_loss(image, text_features):
image_features = clip_model.encode_image(
tfms(image)
) # Note: applies the above transforms
input_normed = torch.nn.functional.normalize(image_features.unsqueeze(1), dim=2)
embed_normed = torch.nn.functional.normalize(text_features.unsqueeze(0), dim=2)
dists = (
input_normed.sub(embed_normed).norm(dim=2).div(2).arcsin().pow(2).mul(2)
) # Squared Great Circle Distance
return dists.mean()
定义好损失函数后,只用将前面的颜色损失替换即可开始训练。
# @markdown applying guidance using CLIP
prompt = "Red Rose (still life), red flower painting" # @param
# Explore changing this
guidance_scale = 8 # @param
n_cuts = 4 # @param
# More steps -> more time for the guidance to have an effect
scheduler.set_timesteps(50)
# We embed a prompt with CLIP as our target
text = open_clip.tokenize([prompt]).to(device)
with torch.no_grad(), torch.cuda.amp.autocast():
text_features = clip_model.encode_text(text)
x = torch.randn(4, 3, 256, 256).to(
device
) # RAM usage is high, you may want only 1 image at a time
for i, t in tqdm(enumerate(scheduler.timesteps)):
model_input = scheduler.scale_model_input(x, t)
# predict the noise residual
with torch.no_grad():
noise_pred = image_pipe.unet(model_input, t)["sample"]
cond_grad = 0
for cut in range(n_cuts):
# Set requires grad on x
x = x.detach().requires_grad_()
# Get the predicted x0:
x0 = scheduler.step(noise_pred, t, x).pred_original_sample
# Calculate loss
loss = clip_loss(x0, text_features) * guidance_scale
# Get gradient (scale by n_cuts since we want the average)
cond_grad -= torch.autograd.grad(loss, x)[0] / n_cuts
if i % 25 == 0:
print("Step:", i, ", Guidance loss:", loss.item())
# Modify x based on this gradient
alpha_bar = scheduler.alphas_cumprod[i]
x = (
x.detach() + cond_grad * alpha_bar.sqrt()
) # Note the additional scaling factor here!
# Now step with scheduler
x = scheduler.step(noise_pred, t, x).prev_sample
grid = torchvision.utils.make_grid(x.detach(), nrow=4)
im = grid.permute(1, 2, 0).cpu().clip(-1, 1) * 0.5 + 0.5
Image.fromarray(np.array(im * 255).astype(np.uint8))
可以看出预测图像已经接近提示词需要输出的玫瑰,可以通过更多调节来使得输出更加完美。
三、实战
最后创建一个以类条件信息作为添加输入的扩散模型,使用的是MNIST数据集,希望训练出一个类条件扩散模型,可以指定我们希望推理生成的数字。
首先导入模块,加载数据集:
import torch
import torchvision
from torch import nn
from torch.nn import functional as F
from torch.utils.data import DataLoader
from diffusers import DDPMScheduler, UNet2DModel
from matplotlib import pyplot as plt
from tqdm.auto import tqdm
device = 'mps' if torch.backends.mps.is_available() else 'cuda' if torch.cuda.is_available() else 'cpu'
print(f'Using device: {device}')
# Load the dataset
dataset = torchvision.datasets.MNIST(root="mnist/", train=True, download=True, transform=torchvision.transforms.ToTensor())
# Feed it into a dataloader (batch size 8 here just for demo)
train_dataloader = DataLoader(dataset, batch_size=8, shuffle=True)
# View some examples
x, y = next(iter(train_dataloader))
print('Input shape:', x.shape)
print('Labels:', y)
plt.imshow(torchvision.utils.make_grid(x)[0], cmap='Greys');
接着创建类条件的UNet模型,具体方法如下:
1、创建带有一些附加输入通道的标准 UNet2DModel
2、通过嵌入层将类标签映射到学习的形状 (class_emb_size) 向量
3、此信息连接为内部 UNet 输入 net_input = torch.cat((x, class_cond), 1) 的额外通道
4、将此 net_input (总共有(class_emb_size+1 )个通道 )输入 UNet 以获得最终预测
class ClassConditionedUnet(nn.Module):
def __init__(self, num_classes=10, class_emb_size=4):
super().__init__()
# The embedding layer will map the class label to a vector of size class_emb_size
self.class_emb = nn.Embedding(num_classes, class_emb_size)
# Self.model is an unconditional UNet with extra input channels to accept the conditioning information (the class embedding)
self.model = UNet2DModel(
sample_size=28, # the target image resolution
in_channels=1 + class_emb_size, # Additional input channels for class cond.
out_channels=1, # the number of output channels
layers_per_block=2, # how many ResNet layers to use per UNet block
block_out_channels=(32, 64, 64),
down_block_types=(
"DownBlock2D", # a regular ResNet downsampling block
"AttnDownBlock2D", # a ResNet downsampling block with spatial self-attention
"AttnDownBlock2D",
),
up_block_types=(
"AttnUpBlock2D",
"AttnUpBlock2D", # a ResNet upsampling block with spatial self-attention
"UpBlock2D", # a regular ResNet upsampling block
),
)
# Our forward method now takes the class labels as an additional argument
def forward(self, x, t, class_labels):
# Shape of x:
bs, ch, w, h = x.shape
# class conditioning in right shape to add as additional input channels
class_cond = self.class_emb(class_labels) # Map to embedding dimension
class_cond = class_cond.view(bs, class_cond.shape[1], 1, 1).expand(bs, class_cond.shape[1], w, h)
# x is shape (bs, 1, 28, 28) and class_cond is now (bs, 4, 28, 28)
# Net input is now x and class cond concatenated together along dimension 1
net_input = torch.cat((x, class_cond), 1) # (bs, 5, 28, 28)
# Feed this to the UNet alongside the timestep and return the prediction
return self.model(net_input, t).sample # (bs, 1, 28, 28)
训练和采样,以前我们会做一些类似 prediction = unet(x, t) 的事情,我们现在会在训练期间添加正确的标签作为第三个参数( prediction = unet(x, t, y) ),在推理时,我们可以传递我们想要的任何标签,如果一切顺利,模型应该生成匹配的图像。 y 在这种情况下是 MNIST 数字的标签,值从 0 到 9。
# Create a scheduler
noise_scheduler = DDPMScheduler(num_train_timesteps=1000, beta_schedule='squaredcos_cap_v2')
#@markdown Training loop (10 Epochs):
# Redefining the dataloader to set the batch size higher than the demo of 8
train_dataloader = DataLoader(dataset, batch_size=128, shuffle=True)
# How many runs through the data should we do?
n_epochs = 10
# Our network
net = ClassConditionedUnet().to(device)
# Our loss function
loss_fn = nn.MSELoss()
# The optimizer
opt = torch.optim.Adam(net.parameters(), lr=1e-3)
# Keeping a record of the losses for later viewing
losses = []
# The training loop
for epoch in range(n_epochs):
for x, y in tqdm(train_dataloader):
# Get some data and prepare the corrupted version
x = x.to(device) * 2 - 1 # Data on the GPU (mapped to (-1, 1))
y = y.to(device)
noise = torch.randn_like(x)
timesteps = torch.randint(0, 999, (x.shape[0],)).long().to(device)
noisy_x = noise_scheduler.add_noise(x, noise, timesteps)
# Get the model prediction
pred = net(noisy_x, timesteps, y) # Note that we pass in the labels y
# Calculate the loss
loss = loss_fn(pred, noise) # How close is the output to the noise
# Backprop and update the params:
opt.zero_grad()
loss.backward()
opt.step()
# Store the loss for later
losses.append(loss.item())
# Print out the average of the last 100 loss values to get an idea of progress:
avg_loss = sum(losses[-100:])/100
print(f'Finished epoch {epoch}. Average of the last 100 loss values: {avg_loss:05f}')
# View the loss curve
plt.plot(losses)
训练完成后我们可以对一些不同标签中的图像进行采样作为我们的条件生成:
#@markdown Sampling some different digits:
# Prepare random x to start from, plus some desired labels y
x = torch.randn(80, 1, 28, 28).to(device)
y = torch.tensor([[i]*8 for i in range(10)]).flatten().to(device)
# Sampling loop
for i, t in tqdm(enumerate(noise_scheduler.timesteps)):
# Get model pred
with torch.no_grad():
residual = net(x, t, y) # Again, note that we pass in our labels y
# Update sample with step
x = noise_scheduler.step(residual, t, x).prev_sample
# Show the results
fig, ax = plt.subplots(1, 1, figsize=(12, 12))
ax.imshow(torchvision.utils.make_grid(x.detach().cpu().clip(-1, 1), nrow=8)[0], cmap='Greys')
可以看到生成效果还是可以的