Textual Inversion on diffusers

最新推荐文章于 2025-02-28 09:17:48 发布

Adenialzz

最新推荐文章于 2025-02-28 09:17:48 发布

阅读量972

点赞数 3

文章标签： stable diffusion

本文链接：https://blog.csdn.net/weixin_44966641/article/details/134901702

版权

Textual Inversion on diffusers

参考自官方文档：https://huggingface.co/docs/diffusers/training/textual_inversion_inference、https://huggingface.co/docs/diffusers/training/text_inversion?installation=PyTorch

textual inversion 可以通过几张图片，使得 Stable Diffusion 学习到一种新的视觉概念（visual concept），它将一个新的文本 token 与特定的 embedding 对应起来，从而在生图时，通过指定特定的 token，来生成关联的概念。diffusers 中的 StableDiffusionPipeline 是支持 textual inversion 的。这里 (Stable Diffusion Conceptualizer) 有一些社区中大家贡献出来的 textual inversion concept，你可以试着去用一些。

本文将介绍如何加载已有的 textual inversion conceptual，并进行推理生图。然后介绍如何自行制作一个关于新的 concept 的 textual inversion。

Textual Inversion 推理

由于 SDXL 包含两个 text-encoder 模型，因此它的 Textual Inversion 的使用方式与 SD1/2 有所不同，这里我们分别进行介绍。

SD1/2

首先加载一些库：

import torch
from diffusers import StableDiffusionPipeline
from diffusers.utils import make_image_grid

选择一个训练好的 SD 底模，和一个 textual inversion concept，这里我们选择 SD1.5 和 cat-boy 。并分别加载底模 pipeline 和 textual inversion：

pretrained_model_name_or_path = "runwayml/stable-diffusion-v1-5"
repo_id_embeds = "sd-concepts-library/cat-toy"

pipeline = StableDiffusionPipeline.from_pretrained(
    pretrained_model_name_or_path, torch_dtype=torch.float16, use_safetensors=True
).to("cuda")

pipeline.load_textual_inversion(repo_id_embeds)

接着我们写一个 prompt，注意，使用 textual inversion 的关键就在 prompt 这里。我们需要将加载进来的 cat-toy textual inversion 以特殊 token <cat-toy> 的形式加到 prompt 里面。textual inversion 一般有 object 和 style 两种类型，分别是图像实体和生图风格。这里按照正常的语法加到 prompt 中即可，如 a <xxx-object> in a <xxx-style> 。这里我们的 cat-toy 是一种 object 的 textual inversion。可以到这里查看更多大家发布的 object 或 style 的 textual inversion，自己动手生成几个看一下。

prompt = "a grafitti in a favela wall with a <cat-toy> on it"

num_samples_per_row = 2
num_rows = 2

然后运行 pipeline 并查看生成的结果：

all_images = []
for _ in range(num_rows):
    images = pipeline(prompt, num_images_per_prompt=num_samples_per_row, num_inference_steps=50, guidance_scale=7.5).images
    all_images.extend(images)

grid = make_image_grid(all_images, num_rows, num_samples_per_row)
grid

SDXL

SDXL 也可以使用 textual inversion 技术。但不同于 SD1/2，SDXL 有两个 text encoder，所以我们需要两个 textual inversion embeddings，分别对应两个 text encoder。

我们这里先下载一个 SDXL 的 textual inversion 文件，看看其中到底是什么：

from huggingface_hub import hf_hub_download
from safetensors.torch import load_file

file = hf_hub_download("dn118/unaestheticXL", filename="unaestheticXLv31.safetensors")
state_dict = load_file(file)
state_dict

# 输出
{'clip_g': tensor([[ 0.0077, -0.0112,  0.0065,  ...,  0.0195,  0.0159,  0.0275],
         ...,
         [-0.0170,  0.0213,  0.0143,  ..., -0.0302, -0.0240, -0.0362]],
 'clip_l': tensor([[ 0.0023,  0.0192,  0.0213,  ..., -0.0385,  0.0048, -0.0011],
         ...,
         [ 0.0475, -0.0508, -0.0145,  ...,  0.0070, -0.0089, -0.0163]],

可以看到，文件中有两个 embeddings：clip_g 和 clip_l ，分别对应 SDXL 中较大的 text encoder pipe.text_encoder_2 和较小的 pipe.text_encoder 。接下来，我们分别将两个 embedding 传入到对应的 tokenizer 和 text_encoder 中，并生图看一下效果：

from diffusers import AutoPipelineForText2Image
import torch

pipe = AutoPipelineForText2Image.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", variant="fp16", torch_dtype=torch.float16)
pipe.to("cuda")

pipe.load_textual_inversion(state_dict["clip_g"], token="unaestheticXLv31", text_encoder=pipe.text_encoder_2, tokenizer=pipe.tokenizer_2)
pipe.load_textual_inversion(state_dict["clip_l"], token="unaestheticXLv31", text_encoder=pipe.text_encoder, tokenizer=pipe.tokenizer)

# 该 embedding 应用作 negative embedding
generator = torch.Generator().manual_seed(33)
image = pipe("a woman standing in front of a mountain", negative_prompt="unaestheticXLv31", generator=generator).images[0]
image

这个 embedding 是由一些低质量的、不美观的图片得到的，我们这里将它用作 negative embedding，从而期望得到的图片比较美观。

Textual Inversion训练

Textual Inversion 是一种，通过几张到几十张图片，来教会模型一个新的视觉概念的训练方法。具体来说，textual Inversion 是在学习并更新一个文本 embedding，该 embedding 与某个特殊的 token 进行绑定。在训练完成后，进行推理生图时，我们只需要在 prompt 中加入这个 embedding，就可以使得模型生成出指定的视觉概念。

准备阶段

在 diffusers 库中，给出了 textual inversion 训练的示例脚本 textual_inversion.py 。本文将对其进行简要介绍，着重介绍我们在训练自己的 textual inversion 时需要修改的部分。

要在 diffusers 中训练 textual inversion，首先要从源码进行安装 diffusers 库。然后进入到示例脚本目录，并安装对应的依赖包：

git clone https://github.com/huggingface/diffusers
cd diffusers
pip install .

cd examples/textual_inversion
pip install -r requirements.txt

我们使用 Accelerate 包来进行分布式训练，首先进行一些配置。

shell 中配置：

# 配置accelerate环境
accelerate config

# 配置默认accelerate环境
accelerate config default

在python中配置：

from accelerate.utils import write_basic_config

write_basic_config()

最后，我们还要准备好自己的数据集，要对应与训练脚本中的实现。在 textual inversion 的训练中，准备一个包含几张 concept 图片的文件夹就可以了。具体操作可参考：创建一个训练数据集。

接下来，我们将着重介绍一下在训练自己的 textual inversion 时，训练脚本中需要修改的部分。这不会涵盖训练脚本中的所有细节，如果有兴趣，可以自行了解训练脚本其他部分。

脚本参数

textual inversion 的示例训练脚本通过 argparse 提供了许多启动参数，来使得我们能够按需进行训练。方便起见，diffusers 提供了常用参数（如 batch size，learning rate等）的默认值，如有需要，也可以自行修改。比如我们可以增加梯度累积的步数：

accelerate launch textual_inversion.py --gradient_accumulation_steps=4

另外一些重要的参数列举如下：

--pretrained_model_name_or_path ：底模的名称或本地路径
--train_data_dir：训练数据文件夹（包含几张到几十张示例图片）
--placeholder_token：embeddings 要绑定到的特殊 token 词，也就是训练完成后我们推理生图时要用的特殊 token。注意这个特殊 token 不能是原模型的 tokenizer 词表中已有的 token
--initializer_token：一个与我们要学习的新概念接近的单词，有助于模型学习的初始化
--num_vectors：用于学习 embeddings 的向量数量，增大该值有助于模型更好地学习，但会增加训练成本
``–learnable_property`：要学习的 textual inversion 是一种风格 (style) 如梵高绘画风格，还是一种实体 (object) 如狗。

训练脚本

首先会加载 tokenizer、scheduler 和 model：

# Load tokenizer
if args.tokenizer_name:
    tokenizer = CLIPTokenizer.from_pretrained(args.tokenizer_name)
elif args.pretrained_model_name_or_path:
    tokenizer = CLIPTokenizer.from_pretrained(args.pretrained_model_name_or_path, subfolder="tokenizer")

# Load scheduler and models
noise_scheduler = DDPMScheduler.from_pretrained(args.pretrained_model_name_or_path, subfolder="scheduler")
text_encoder = CLIPTextModel.from_pretrained(
    args.pretrained_model_name_or_path, subfolder="text_encoder", revision=args.revision
)
vae = AutoencoderKL.from_pretrained(args.pretrained_model_name_or_path, subfolder="vae", revision=args.revision)
unet = UNet2DConditionModel.from_pretrained(
    args.pretrained_model_name_or_path, subfolder="unet", revision=args.revision
)

然后，我们要加入一个新的 placeholder_token ，所以要扩充原 tokenizer 的词表：

# ...
tokenizer.add_tokens(placeholder_tokens)
# ...

不同于其他的训练脚本，textual_inversion.py 中有一个单独定制的 dataset 类：TextualInversionDataset 。我们可以调整图片尺寸、占位 token 词、插值方法等。如果有需要，也可以直接修改这个 dataset 类的实现。

之后，初始化数据集：

train_dataset = TextualInversionDataset(
    data_root=args.train_data_dir,
    tokenizer=tokenizer,
    size=args.resolution,
    placeholder_token=(" ".join(tokenizer.convert_ids_to_tokens(placeholder_token_ids))),
    repeats=args.repeats,
    learnable_property=args.learnable_property,
    center_crop=args.center_crop,
    set="train",
)
train_dataloader = torch.utils.data.DataLoader(
    train_dataset, batch_size=args.train_batch_size, shuffle=True, num_workers=args.dataloader_num_workers
)

然后，training loop 会开始进行训练，进行噪声残差预测并不断更新刚刚添加的新的特殊 token 对应的 embedding 的权重。如果想要了解更多关于训练循环内所做的事情，可以参考：diffusers pipeline拆解：理解pipelines、models和schedulers。

启动训练脚本

现在，我们修改好了训练脚本，可以准备启动脚本，开始训练了。

本例中，我们将先下载几张关于 cat toy 的图片。当然，如果你已经准备好了自己的数据集，就是自己数据集中的图片。

from huggingface_hub import snapshot_download

local_dir = "./cat"
snapshot_download(
    "diffusers/cat_toy_example", local_dir=local_dir, repo_type="dataset", ignore_patterns=".gitattributes"
)

然后设置几个变量，并启动脚本：

export MODEL_NAME="runwayml/stable-diffusion-v1-5"
export DATA_DIR="./cat"

accelerate launch textual_inversion.py \
  --pretrained_model_name_or_path=$MODEL_NAME \
  --train_data_dir=$DATA_DIR \
  --learnable_property="object" \
  --placeholder_token="<cat-toy>" \
  --initializer_token="toy" \
  --resolution=512 \
  --train_batch_size=1 \
  --gradient_accumulation_steps=4 \
  --max_train_steps=3000 \
  --learning_rate=5.0e-04 \
  --scale_lr \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --output_dir="textual_inversion_cat" \
  --push_to_hub

训练结束后，会产出如下几个文件：

learned_embeds.bin：学习到的关于数据集中示例图像的 embedding
token_identifier.txt：保存 placeholder_token
type_of_concept.txt：记录训练的 concept 的种类（object 或 style）

还有，如果想在训练过程中查看训练的效果图片，可以在启动命令中设置如下参数：

--validation_prompt="A <cat-toy> train"
--num_validation_images=4
--validation_steps=100

使用训练的 textual inversion 进行推理生图

这一步其实我们在第一节已经详细介绍过了，加载训练好的 textual inversion 并生图测试：

from diffusers import StableDiffusionPipeline
import torch

pipeline = StableDiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16).to("cuda")
pipeline.load_textual_inversion("sd-concepts-library/cat-toy")
image = pipeline("A <cat-toy> train", num_inference_steps=50).images[0]
image.save("cat-train.png")