使用diffusers运行controlnet,版本 stable diffusion v1-5 ,controlnet_v11p_sd15,记录一些代码中的细节。
from diffusers import StableDiffusionControlNetImg2ImgPipeline, ControlNetModel, UniPCMultistepScheduler
from diffusers.utils import load_image
import numpy as np
import torch
import cv2
from PIL import Image
CONTROLNET_PATH = ''
MODEL_PATH = ''
image = Image.open("squirrel.jpeg")
np_image = np.array(image)
# get canny image
np_image = cv2.Canny(np_image, 100, 200)
np_image = np_image[:, :, None]
np_image = np.concatenate([np_image, np_image, np_image], axis=2)
canny_image = Image.fromarray(np_image)
# load control net and stable diffusion v1-5
controlnet = ControlNetModel.from_pretrained(CONTROLNET_PATH, torch_dtype=torch.float16)
pipe = StableDiffusionControlNetImg2ImgPipeline.from_pretrained(MODEL_PATH, controlnet=controlnet, torch_dtype=torch.float16)
# speed up diffusion process with faster scheduler and memory optimization
pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)
pipe.enable_model_cpu_offload()
# generate image
generator = torch.manual_seed(0)
image = pipe(
"a white squirrel",
num_inference_steps=10,
generator=generator,
image=image,
control_image=canny_image,
).images[0]
image.save("result.jpg")
controlnet模型使用了stable diffusion unet的encoder和middle block,进行了重新训练。
具体可参考:stable diffusion中的UNet2DConditionModel代码解读
1.参数strength
strength表示对图像的改变程度,是一个0到1之间的数字,默认0.8,数值越大,对图像的改变越大。
同时这个值也影响推理步骤num_inference_steps,最终推理步数为num_inference_steps * strength。
def get_timesteps(self, num_inference_steps, strength, device):
init_timestep = min(int(num_inference_steps * strength), num_inference_steps)
t_start = max(num_inference_steps - init_timestep, 0)
timesteps = self.scheduler.timesteps[t_start * self.scheduler.order :]
return timesteps, num_inference_steps - t_start
2.zero convolution
canny图像(condition)进入的第一个 zero convolution 是ControlNetConditioningEmbedding,经过三次下采样和通道变换,生成和latents(input)相同的维度 (2,320,h//8,w//8),然后两者相加。
后面的 zero convolution 都是 kernel_size=(1, 1)的Conv2d。
ControlNetConditioningEmbedding(
(conv_in): Conv2d(3, 16, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(blocks): ModuleList(
(0): Conv2d(16, 16, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(1): Conv2d(16, 32, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
(2): Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(3): Conv2d(32, 96, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
(4): Conv2d(96, 96, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(5): Conv2d(96, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
)
(conv_out): Conv2d(256, 320, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
)
3.conditioning_scale
经过down block和middle block得到的结果,在 zero convolution 后都要进行缩放,乘以 conditioning_scale,默认值为0.8。
down_block_res_samples = [sample * conditioning_scale for sample in down_block_res_samples]
mid_block_res_sample = mid_block_res_sample * conditioning_scale
4.结果相加
然后这些值再和上图左侧UNet down block和middle block 生成的对应结果直接相加,最后进入UNet的decoder部分,计算noise。
#down block
if is_controlnet:
new_down_block_res_samples = ()
for down_block_res_sample, down_block_additional_residual in zip(
down_block_res_samples, down_block_additional_residuals
):
down_block_res_sample = down_block_res_sample + down_block_additional_residual
new_down_block_res_samples = new_down_block_res_samples + (down_block_res_sample,)
down_block_res_samples = new_down_block_res_samples
#middle block
if is_controlnet:
sample = sample + mid_block_additional_residual