controlnet代码中的一些细节

使用diffusers运行controlnet,版本 stable diffusion v1-5 ,controlnet_v11p_sd15,记录一些代码中的细节。

from diffusers import StableDiffusionControlNetImg2ImgPipeline, ControlNetModel, UniPCMultistepScheduler
from diffusers.utils import load_image
import numpy as np
import torch
import cv2
from PIL import Image
CONTROLNET_PATH = ''
MODEL_PATH = ''
image = Image.open("squirrel.jpeg")
np_image = np.array(image)
# get canny image
np_image = cv2.Canny(np_image, 100, 200)
np_image = np_image[:, :, None]
np_image = np.concatenate([np_image, np_image, np_image], axis=2)
canny_image = Image.fromarray(np_image)
# load control net and stable diffusion v1-5
controlnet = ControlNetModel.from_pretrained(CONTROLNET_PATH, torch_dtype=torch.float16)
pipe = StableDiffusionControlNetImg2ImgPipeline.from_pretrained(MODEL_PATH, controlnet=controlnet, torch_dtype=torch.float16)
# speed up diffusion process with faster scheduler and memory optimization
pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)
pipe.enable_model_cpu_offload()
# generate image
generator = torch.manual_seed(0)
image = pipe(
    "a white  squirrel",
    num_inference_steps=10,
    generator=generator,
    image=image,
    control_image=canny_image,
).images[0]
image.save("result.jpg")

controlnet模型使用了stable diffusion unet的encoder和middle block,进行了重新训练。
具体可参考:stable diffusion中的UNet2DConditionModel代码解读
在这里插入图片描述

1.参数strength

strength表示对图像的改变程度,是一个0到1之间的数字,默认0.8,数值越大,对图像的改变越大。
同时这个值也影响推理步骤num_inference_steps,最终推理步数为num_inference_steps * strength。

    def get_timesteps(self, num_inference_steps, strength, device):
        init_timestep = min(int(num_inference_steps * strength), num_inference_steps)
        t_start = max(num_inference_steps - init_timestep, 0)
        timesteps = self.scheduler.timesteps[t_start * self.scheduler.order :]
        return timesteps, num_inference_steps - t_start

2.zero convolution

canny图像(condition)进入的第一个 zero convolution 是ControlNetConditioningEmbedding,经过三次下采样和通道变换,生成和latents(input)相同的维度 (2,320,h//8,w//8),然后两者相加。
后面的 zero convolution 都是 kernel_size=(1, 1)的Conv2d。

ControlNetConditioningEmbedding(
  (conv_in): Conv2d(3, 16, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (blocks): ModuleList(
    (0): Conv2d(16, 16, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (1): Conv2d(16, 32, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
    (2): Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (3): Conv2d(32, 96, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
    (4): Conv2d(96, 96, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (5): Conv2d(96, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
  )
  (conv_out): Conv2d(256, 320, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
)

3.conditioning_scale

经过down block和middle block得到的结果,在 zero convolution 后都要进行缩放,乘以 conditioning_scale,默认值为0.8。

 down_block_res_samples = [sample * conditioning_scale for sample in down_block_res_samples]
 mid_block_res_sample = mid_block_res_sample * conditioning_scale

4.结果相加

然后这些值再和上图左侧UNet down block和middle block 生成的对应结果直接相加,最后进入UNet的decoder部分,计算noise。

#down block
if is_controlnet:
    new_down_block_res_samples = ()
    for down_block_res_sample, down_block_additional_residual in zip(
        down_block_res_samples, down_block_additional_residuals
    ):
        down_block_res_sample = down_block_res_sample + down_block_additional_residual
        new_down_block_res_samples = new_down_block_res_samples + (down_block_res_sample,)
    down_block_res_samples = new_down_block_res_samples
    
#middle block
if is_controlnet:
   sample = sample + mid_block_additional_residual
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值