项目地址:Docs
Notebook地址(推荐使用google colab打开):https://github.com/huggingface/diffusion-models-class/tree/main/unit2
推荐阅读材料:《扩散模型-从原理到实战》第五章
本文为大家介绍扩散模型的三种进阶操作方式:
1. 微调
2. 引导
3. Conditioning
微调(Fine-tuning)
基于一个预训练好的模型,用少量新数据继续训练,训练后模型可以学到新数据的知识
步骤:
1. 准备微调数据集:微调数据集最好和预训练模型的数据集相似或相关,在取得微调数据集后可对其进行transform操作变换(Resize, Flip, Normalize...),指定batch_size并最终放入dataloader。
# @markdown load and prepare a dataset:
# Not on Colab? Comments with #@ enable UI tweaks like headings or user inputs
# but can safely be ignored if you're working on a different platform.
dataset_name = "huggan/smithsonian_butterflies_subset" # @param
dataset = load_dataset(dataset_name, split="train")
image_size = 256 # @param
batch_size = 4 # @param
preprocess = transforms.Compose(
[
transforms.Resize((image_size, image_size)),
transforms.RandomHorizontalFlip(),
transforms.ToTensor(),
transforms.Normalize([0.5], [0.5]),
]
)
def transform(examples):
images = [preprocess(image.convert("RGB")) for image in examples["image"]]
return {"images": images}
dataset.set_transform(transform)
train_dataloader = torch.utils.data.DataLoader(
dataset, batch_size=batch_size, shuffle=True
)
print("Previewing batch:")
batch = next(iter(train_dataloader))
grid = torchvision.utils.make_grid(batch["images"], nrow=4)
plt.imshow(grid.permute(1, 2, 0).cpu().clip(-1, 1) * 0.5 + 0.5);
2. 微调的训练代码,与普通的模型训练代码基本一致。下图中optimizer优化的参数即为预训练模型的全部参数。为了在小内存的情况下模拟更大的batch_size,采用了梯度累加(grad_accumulation)。
如果想要时不时的查看训练过程中的图片生成效果,可以采用wandb
梯度累加的相关知识请参考https://kozodoi.me/python/deep%20learning/pytorch/tutorial/2021/02/19/gradient-accumulation.html#:~:text=Simply%20speaking%2C%20gradient%20accumulation%20means,might%20find%20this%20tutorial%20useful.
num_epochs = 2 # @param
lr = 1e-5 # 2param
grad_accumulation_steps = 4 # @param
optimizer = torch.optim.AdamW(image_pipe.unet.parameters(), lr=lr)
losses = []
for epoch in range(num_epochs):
for step, batch in tqdm(enumerate(train_dataloader), total=len(train_dataloader)):
clean_images = batch["images"].to(device)
# Sample noise to add to the images
noise = torch.randn(clean_images.shape).to(clean_images.device)
bs = clean_images.shape[0]
# Sample a random timestep for each image
timesteps = torch.randint(
0,
image_pipe.scheduler.num_train_timesteps,
(bs,),
device=clean_images.device,
).long()
# Add noise to the clean images according to the noise magnitude at each timestep
# (this is the forward diffusion process)
noisy_images = image_pipe.scheduler.add_noise(clean_images, noise, timesteps)
# Get the model prediction for the noise
noise_pred = image_pipe.unet(noisy_images, timesteps, return_dict=False)[0]
# Compare the prediction with the actual noise:
loss = F.mse_loss(
noise_pred, noise
) # NB - trying to predict noise (eps) not (noisy_ims-clean_ims) or just (clean_ims)
# Store for later plotting
losses.append(loss.item())
# Update the model parameters with the optimizer based on this loss
loss.backward(loss)
# Gradient accumulation:
if (step + 1) % grad_accumulation_steps == 0:
optimizer.step()
optimizer.zero_grad()
print(
f"Epoch {epoch} average loss: {sum(losses[-len(train_dataloader):])/len(train_dataloader)}"
)
# Plot the loss curve:
plt.plot(losses)
引导(Guidance)
在模型预测的过程中增加一些约束限制,让模型生成符合某些特定要求的结果。
其中一种引导(增加约束)的方式是使用一个损失函数来计算生成的图片和我们规定的条件时间的偏差。例如下图中我们希望生成的图片颜色偏绿色,那我们就定义一个损失来计算生成图片中的像素和一个绿色颜色(0.1, 0.9, 0.5)之间的偏差。
def color_loss(images, target_color=(0.1, 0.9, 0.5)):
"""Given a target color (R, G, B) return a loss for how far away on average
the images' pixels are from that color. Defaults to a light teal: (0.1, 0.9, 0.5)"""
target = (
torch.tensor(target_color).to(images.device) * 2 - 1
) # Map target color to (-1, 1)
target = target[
None, :, None, None
] # Get shape right to work with the images (b, c, h, w)
error = torch.abs(
images - target
).mean() # Mean absolute difference between the image pixels and the target color
return error
在建立好损失函数之后,我们需要如下步骤进行引导:
1. 输入图片x的require_grad设置为True
2. 将x放入模型预测,输出noise_pred,;再将noise_pred和x放入scheduler,去返回值中的pred_original_sample变量,即为x0
3. 将x0带入损失函数,计算损失L
4. 计算L对于x的梯度dL/dx , 并用该梯度更新x
5. 将更新后的x和noise_pred一起放入scheduler.step(),完成一次图片去噪。
# Variant 1: shortcut method
# The guidance scale determines the strength of the effect
guidance_loss_scale = 40 # Explore changing this to 5, or 100
x = torch.randn(8, 3, 256, 256).to(device)
for i, t in tqdm(enumerate(scheduler.timesteps)):
# Prepare the model input
model_input = scheduler.scale_model_input(x, t)
# predict the noise residual
with torch.no_grad():
noise_pred = image_pipe.unet(model_input, t)["sample"]
# Set x.requires_grad to True
x = x.detach().requires_grad_()
# Get the predicted x0
x0 = scheduler.step(noise_pred, t, x).pred_original_sample
# Calculate loss
loss = color_loss(x0) * guidance_loss_scale
if i % 10 == 0:
print(i, "loss:", loss.item())
# Get gradient
cond_grad = -torch.autograd.grad(loss, x)[0]
# Modify x based on this gradient
x = x.detach() + cond_grad
# Now step with scheduler
x = scheduler.step(noise_pred, t, x).prev_sample
# View the output
grid = torchvision.utils.make_grid(x, nrow=4)
im = grid.permute(1, 2, 0).cpu().clip(-1, 1) * 0.5 + 0.5
Image.fromarray(np.array(im * 255).astype(np.uint8))
其中,在计算x梯度的时候,有两种方式:
第一种是不计算x在预测模型(model)中的梯度,如上图用with torch.no_grad() 囊括住模型预测部分,这样的好处是计算dL/dx时速度快,缺点是损失了一些精度;
另一种做法是删除上图的with torch.no_grad(),即从一开始就累加x的梯度,好处是精度高,缺点是速度慢。
CLIP Guidance (用CLIP模型辅助引导)
可以实现基于文本prompt对生成图片内容进行调整。
CLIP可以分别对文本和图片进行embedding,并使得我们可以对二者的向量进行相似度计算和损失函数的定义。定义了损失函数之后,即可按照上面相同的方法对图片进行调整。
Conditioning
通过把标签信息并入输入然后训练网络,使得模型可以预测我们指定的内容。
案例介绍:
用MINST数据集(以及他的标签)微调UNet2DModel,从而达到增加conditioning信息的目的。微调后模型可以根据指定的标签y生成相应的数字图片(MINST的标签为0-9共10个数字)
Step 0 安装依赖
%pip install -q diffusers
import torch
import torchvision
from torch import nn
from torch.nn import functional as F
from torch.utils.data import DataLoader
from diffusers import DDPMScheduler, UNet2DModel
from matplotlib import pyplot as plt
from tqdm.auto import tqdm
device = 'mps' if torch.backends.mps.is_available() else 'cuda' if torch.cuda.is_available() else 'cpu'
print(f'Using device: {device}')
Step 1 数据集准备
下载MINST数据集,放入dataloader
# Load the dataset
dataset = torchvision.datasets.MNIST(root="mnist/", train=True, download=True, transform=torchvision.transforms.ToTensor())
# Feed it into a dataloader (batch size 8 here just for demo)
train_dataloader = DataLoader(dataset, batch_size=8, shuffle=True)
# View some examples
x, y = next(iter(train_dataloader))
print('Input shape:', x.shape)
print('Labels:', y)
plt.imshow(torchvision.utils.make_grid(x)[0], cmap='Greys');
Step 2 网络搭建
1. __init__()函数中:
(1)增加一个embedding layer class_emb来将图片的标签映射成embedding,embeding的维度class_emb_size为手动设定;
(2)调用UNet2DModel时,将in_channel设置为1 + class_emb_size,其中1表示MINST数据集的channel(1 * 28 * 28), class_emb_size表示在对标签y映射成为embedding后增加到channel的维度,forward函数会详细说明是如何操作的;
2.forward():(重点)
(1)class_cond先将标签从0-9数字映射成了embedding,再把embedding进行维度扩充
(2)将处理好的class_cond和输入图片x进行channel维度的拼接,作为新的输入,放入模型进行训练(以及预测)
class ClassConditionedUnet(nn.Module):
def __init__(self, num_classes=10, class_emb_size=4):
super().__init__()
# The embedding layer will map the class label to a vector of size class_emb_size
self.class_emb = nn.Embedding(num_classes, class_emb_size)
# Self.model is an unconditional UNet with extra input channels to accept the conditioning information (the class embedding)
self.model = UNet2DModel(
sample_size=28, # the target image resolution
in_channels=1 + class_emb_size, # Additional input channels for class cond.
out_channels=1, # the number of output channels
layers_per_block=2, # how many ResNet layers to use per UNet block
block_out_channels=(32, 64, 64),
down_block_types=(
"DownBlock2D", # a regular ResNet downsampling block
"AttnDownBlock2D", # a ResNet downsampling block with spatial self-attention
"AttnDownBlock2D",
),
up_block_types=(
"AttnUpBlock2D",
"AttnUpBlock2D", # a ResNet upsampling block with spatial self-attention
"UpBlock2D", # a regular ResNet upsampling block
),
)
# Our forward method now takes the class labels as an additional argument
def forward(self, x, t, class_labels):
# Shape of x:
bs, ch, w, h = x.shape
# class conditioning in right shape to add as additional input channels
class_cond = self.class_emb(class_labels) # Map to embedding dimension
class_cond = class_cond.view(bs, class_cond.shape[1], 1, 1).expand(bs, class_cond.shape[1], w, h)
# x is shape (bs, 1, 28, 28) and class_cond is now (bs, 4, 28, 28)
# Net input is now x and class cond concatenated together along dimension 1
net_input = torch.cat((x, class_cond), 1) # (bs, 5, 28, 28)
# Feed this to the UNet alongside the timestep and return the prediction
return self.model(net_input, t).sample # (bs, 1, 28, 28)
Step3 训练
训练过程与正常模型训练相似,唯一的区别是y也作为参数放入模型预测。
# Create a scheduler
noise_scheduler = DDPMScheduler(num_train_timesteps=1000, beta_schedule='squaredcos_cap_v2')
#@markdown Training loop (10 Epochs):
# Redefining the dataloader to set the batch size higher than the demo of 8
train_dataloader = DataLoader(dataset, batch_size=128, shuffle=True)
# How many runs through the data should we do?
n_epochs = 10
# Our network
net = ClassConditionedUnet().to(device)
# Our loss function
loss_fn = nn.MSELoss()
# The optimizer
opt = torch.optim.Adam(net.parameters(), lr=1e-3)
# Keeping a record of the losses for later viewing
losses = []
# The training loop
for epoch in range(n_epochs):
for x, y in tqdm(train_dataloader):
# Get some data and prepare the corrupted version
x = x.to(device) * 2 - 1 # Data on the GPU (mapped to (-1, 1))
y = y.to(device)
noise = torch.randn_like(x)
timesteps = torch.randint(0, 999, (x.shape[0],)).long().to(device)
noisy_x = noise_scheduler.add_noise(x, noise, timesteps)
# Get the model prediction
pred = net(noisy_x, timesteps, y) # Note that we pass in the labels y
# Calculate the loss
loss = loss_fn(pred, noise) # How close is the output to the noise
# Backprop and update the params:
opt.zero_grad()
loss.backward()
opt.step()
# Store the loss for later
losses.append(loss.item())
# Print out the average of the last 100 loss values to get an idea of progress:
avg_loss = sum(losses[-100:])/100
print(f'Finished epoch {epoch}. Average of the last 100 loss values: {avg_loss:05f}')
# View the loss curve
plt.plot(losses)
Step 4 预测
和普通的模型(输入x,预测y)不同,我们需要同时把x和y放入模型进行预测,预测结果为受到y的conditioning的生成图片。
#@markdown Sampling some different digits:
# Prepare random x to start from, plus some desired labels y
x = torch.randn(80, 1, 28, 28).to(device)
y = torch.tensor([[i]*8 for i in range(10)]).flatten().to(device)
# Sampling loop
for i, t in tqdm(enumerate(noise_scheduler.timesteps)):
# Get model pred
with torch.no_grad():
residual = net(x, t, y) # Again, note that we pass in our labels y
# Update sample with step
x = noise_scheduler.step(residual, t, x).prev_sample
# Show the results
fig, ax = plt.subplots(1, 1, figsize=(12, 12))
ax.imshow(torchvision.utils.make_grid(x.detach().cpu().clip(-1, 1), nrow=8)[0], cmap='Greys')