Stable Diffusion XL简介

冰冰冰泠泠泠

已于 2023-11-19 21:38:25 修改

阅读量1.6k

点赞数

分类专栏：生成模型文章标签：人工智能

于 2023-10-29 18:18:54 首次发布

本文链接：https://blog.csdn.net/icylling/article/details/132863316

版权

生成模型专栏收录该内容

14 篇文章

订阅专栏

本文介绍了StableDiffusionXL的升级，包括采用图像到图像的Refiner模型提升视觉保真度，以及双文本编码器的应用。文章详细展示了如何通过基础和精化模型结合实现更高质量的图像生成，以及模型训练的步骤调整策略。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Stable Diffusion XL的是一个文生图模型，是原来Stable Diffusion的升级版。相比旧版的Stable Diffusion模型，Stable Diffusion XL主要的不同有三点：

有一个精化模型（下图的Refiner），通过image-to-image的方式来提高视觉保真度。
使用了两个text encoder，OpenCLIP ViT-bigG和CLIP ViT-L。
增加了图片大小和长宽比作为输入条件。

SDXL与以前SD结构的不同如下图：

代码示例

加载基础和精化两个模型，并生成图片：

from diffusers import DiffusionPipeline
import torch

base = DiffusionPipeline.from_pretrained(r"D:\hg_models\stabilityai\stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True).to("cuda")
refiner = DiffusionPipeline.from_pretrained(r"D:\hg_models\stabilityai\stable-diffusion-xl-refiner-1.0", text_encoder_2=base.text_encoder_2, vae=base.vae, torch_dtype=torch.float16, variant="fp16", use_safetensors=True).to("cuda")

n_steps = 40
high_noise_frac = 0.8
prompt = "A girl with purple hair, a yellow headband, and red eyes"
generator = torch.Generator(device='cuda').manual_seed(100)
image = base(
    prompt=prompt,
    generator=generator,
    num_inference_steps=n_steps,
    denoising_end=high_noise_frac,
    output_type="latent",
).images
image = refiner(
    prompt=prompt,
    generator=generator,
    num_inference_steps=n_steps,
    denoising_start=high_noise_frac,
    image=image,
).images[0]

n_steps定义总步数，high_noise_frac定义基础模型跑的步数所占的比例。SDXL 基础模型在 0-999 的时间步上进行训练，而SDXL 精化模型则在 0-199 的低噪声时间步上根据基本模型进行微调，因此我们在前 800 个时间步（高噪声）上使用基本模型，而在后 200 个时间步（低噪声）上使用精化模型。因此，high_noise_frac 被设为 0.8，这样所有 200-999 步（去噪时间步的前 80%）都由基本模型执行，而 0-199 步（去噪时间步的后 20%）则由细化模型执行。

因为总步数是采样的40步，实际上，base模型跑了32步，refiner跑了8步。

只使用基础模型也是可以出图的。如果只使用基础模型跑全部的40步，则生成的图片如下明显质量降低。

n_steps = 40
high_noise_frac = 0.8
prompt = "A girl with purple hair, a yellow headband, and red eyes"
generator = torch.Generator(device='cuda').manual_seed(100)
image = base(
    prompt=prompt,
    generator=generator,
    num_inference_steps=n_steps,
    # denoising_end=high_noise_frac,
    # output_type="latent",
).images[0]
# image = refiner(
#     prompt=prompt,
#     generator=generator,
#     num_inference_steps=n_steps,
#     denoising_start=high_noise_frac,
#     image=image,
# ).images[0]

如果将original_size设置的比较小(128, 128)，则会生成一个模糊的图片，类似把原来(128, 128)的图片放大的效果。

n_steps = 40
prompt = "A girl with purple hair, a yellow headband, and red eyes"
image = base(
    prompt=prompt,
    generator=torch.Generator(device='cuda').manual_seed(100),
    num_inference_steps=n_steps,
    original_size=(128, 128),
).images[0]

如果将crops_coords_top_left设置为(0, 512)，则会生成一个偏左的图片，类似把原来图crop截取过。

prompt = "A girl with purple hair, a yellow headband, and red eyes"
image = base(
    prompt=prompt,
    generator=torch.Generator(device='cuda').manual_seed(100),
    num_inference_steps=40,
    crops_coords_top_left=(0, 512),
).images[0]