使用diffusers来训练自己的Stable Diffusion 3大模型

本文链接：https://blog.csdn.net/weixin_44072525/article/details/141195845

基于diffusers的Stable diffusion训练代码

这里给大家介绍一个基于diffusers库来训练stable diffusion相关模型的训练代码，包含Lora、ControlNet、IP-adapter、Animatediff，以及最新的stable diffusion 3 lora版本的训练代码。

现有的一些类似kohya-ss训练器虽然用起来方便，但源代码封装地比较冗长，对于像我这样的新手小白阅读起来比较困难。因此我基于diffusers库重新写了相关训练代码，并删除了很多冗余部分，想要了解代码层级是如何训练的可以帮我点点star。

github地址：https://github.com/SongwuJob/simple-SD-trainer
代码主要修改至diffusers，也参考了一些开源项目，本人是新手小白，难免出错请见谅。

Image Caption

图片描述是训练文本到图像模型的重要组成部分，可用于 Lora、ControlNet 等。常见的caption方法大致可分为两类：

SDWebUI Tagger：这种方法是在webui界面中使用的一个标签器，其本质是一个多分类模型来生成标签。
VLM： VLM能更好地理解图像中的密集语义，并能提供详细的标签，这也是我们推荐的方法。

在我们的实验中，我们使用GLM-4v-9b为训练过的图像添加标注。具体来说，我们使用query = "please describe this image into prompt words, and reply us with keywords like xxx, xxx, xxx, xxx"来提示 VLM 输出图片标注。例如，我们可以使用 GLM-4v 为单张图像添加prompt：

import torch
from PIL import Image
from transformers import AutoModelForCausalLM, AutoTokenizer

device = "cuda"

tokenizer = AutoTokenizer.from_pretrained("THUDM/glm-4v-9b", trust_remote_code=True)

query = "please describe this image into prompt words, and reply us with keywords like xxx, xxx, xxx, xxx"
image = Image.open("your image").convert('RGB')
inputs = tokenizer.apply_chat_template([{"role": "user", "image": image, "content": query}],
                                       add_generation_prompt=True, tokenize=True, return_tensors="pt",
                                       return_dict=True)  # chat mode

inputs = inputs.to(device)
model = AutoModelForCausalLM.from_pretrained(
    "THUDM/glm-4v-9b",
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    trust_remote_code=True
).to(device).eval()

gen_kwargs = {"max_new_tokens": 77, "do_sample": True, "top_k": 1}
with torch.no_grad():
    outputs = model.generate(**inputs, **gen_kwargs)
    outputs = outputs[:, inputs['input_ids'].shape[1]:]
    caption = tokenizer.decode(outputs[0], skip_special_tokens=True)
    print(caption)

Lora训练

Stable Diffusion XL（SDXL）是IDM的高级变体，旨在根据文本描述生成高质量图像。在原有 SD1.5(2.1)的基础上，SDXL 提供了更强的功能和更高的性能，使其成为生成式人工智能领域各种应用的强大工具。

我们的 Lora 训练代码 train_text_to_image_lora_sdxl.py 是根据 diffusers 和 kohya-ss 修改而来的。

我们将dataset类重写为BaseDataset.py方便阅读如何加载数据的。
为了简化训练过程，我们删除了diffusers代码内部的一些参数，并调整了一些设置。

如果你想要基于diffusers来训练自己的lora模型，请先对训练数据进行标注，可以使用SDWebUI的tagger器，也可以使用一些视觉语言模型进行标注，标注之后的data.json格式如下：

[
    {
        "image": "1.jpg",
        "text": "white hair, anime style, pink background, long hair, jacket, black and red top, earrings, rosy cheeks, large eyes, youthful, fashion, illustration, manga, character design, vibrant colors, hairstyle, clothing, accessories, earring design, artistic, contemporary, youthful fashion, graphic novel, digital drawing, pop art influence, soft shading, detailed rendering, feminine aesthetic"
    },
    {
        "image": "2.jpg",
        "text": "cute, anime-style, girl, long, wavy, hair, green, plaid, blazer, blush, big, expressive, eyes, hoop, earrings, soft, pastel, colors, youthful, innocent, charming, fashionable"
    }
]

图像标注之后，我们就可以执行 sh train_text_to_image_lora_sdxl.sh 来训练你的 lora 模型了，详细代码见Github：

export MODEL_NAME="/path/to/your/model"
export OUTPUT_DIR="lora/rank32"
export TRAIN_DIR="/path/to/your/data"
export JSON_FILE="/path/to/your/data/data.json"

accelerate launch  ./stable_diffusion/train_text_to_image_lora_sdxl.py \
  --pretrained_model_name_or_path=$MODEL_NAME \
  --train_data_dir=$TRAIN_DIR \
  --output_dir=$OUTPUT_DIR \
  --json_file=$JSON_FILE \
  --height=1024 --width=1024  \
  --train_batch_size=2 \
  --random_flip \
  --rank=32 --text_encoder_rank=8 \
  --gradient_accumulation_steps=2 \
  --num_train_epochs=30 --repeats=5 \
  --checkpointing_steps=1000 \
  --learning_rate=1e-4 \
  --text_encoder_lr=1e-5 \
  --lr_scheduler="constant_with_warmup" \
  --lr_warmup_steps=500 \
  --mixed_precision="fp16" \
  --train_text_encoder \
  --seed=1337 \

Stable Diffusion 3 Lora训练

SD3的 Lora 训练代码 train_text_to_image_lora_sd3.py 是根据diffusers 中的train_dreambooth_lora_sd3.py 修改而来。SD3 训练还存在很多问题，本代码只是基于 diffusers 的简单训练代码，在设置 max_sequence_length=77 时看起来很有效。

数据预处理（image caption）和 data.json 格式与 SDXL 一致。
我们将数据集重写为 SD3BaseDataset.py方便阅读数据如何加载。
为了简化训练过程，我们删除了diffusers一些参数，并调整了一些设置。
您需要设置一个较大的rank，以使DiT结构产生良好效果，建议设置为 64-128（64 用于较少的训练数据，128 用于较多的训练数据）。

为训练好的图片添加完标注后，我们就可以执行 sh train_text_to_image_lora_sd3.sh 来训练你的 lora 模型了，详细代码见Github：

export MODEL_NAME="/path/to/your/stable-diffusion-3-medium-diffusers"
export OUTPUT_DIR="lora/rank32"
export TRAIN_DIR="/path/to/your/data"
export JSON_FILE="/path/to/your/data/data.json"

accelerate launch ./stable_diffusion/train_text_to_image_lora_sd3.py \
  --pretrained_model_name_or_path=$MODEL_NAME \
  --train_data_dir=$TRAIN_DATA_DIR \
  --output_dir=$OUTPUT_DIR \
  --json_file=$JSON_FILE \
  --mixed_precision="fp16" \
  --height=1024 --width=1024  \
  --random_flip \
  --train_batch_size=2 \
  --checkpointing_steps=1000 \
  --gradient_accumulation_steps=2 \
  --learning_rate=1e-4 \
  --text_encoder_lr=5e-6 \
  --rank=64 --text_encoder_rank=8 \
  --lr_scheduler="constant_with_warmup" --lr_warmup_steps=500 \
  --num_train_epochs=30 \
  --scale_lr --train_text_encoder \
  --seed=1337

ControlNet训练

我们的 ControlNet 训练代码 train_controlnet_sdxl.py 由 diffusers 修改而来。

我们将数据集重写为 ControlNetDataset.py方便阅读如何加载数据的。
我们重写了数据加载过程，并删除了 diffusers 中的一些参数，以简化训练过程。

要测试ControlNet的训练，可以下载在Hugging face上下载相关的数据集，如 controlnet_sdxl_animal。同时，您需要对这些训练数据进行如下简单的预处理：

controlnet_data
  ├──images/  (image files)
  │  ├──0.png
  │  ├──1.png
  │  ├──......
  ├──conditioning_images/  (conditioning image files)
  │  ├──0.png
  │  ├──1.png
  │  ├──......
  ├──data.json

data.json 格式

[
    {
        "text": "a person walking a dog on a leash",
        "image": "images/1.png",
        "conditioning_image": "conditioning_images/1.png"
    },
    {
        "text": "a woman walking her dog in the park",
        "image": "images/2.png",
        "conditioning_image": "conditioning_images/2.png"
    }
]

准备好完整的训练图像后，我们可以执行 sh train_controlnet_sdxl.sh 来训练ControlNet模型，详细代码见Github：

export MODEL_DIR="/path/to/your/model"
export OUTPUT_DIR="controlnet"
export TRAIN_DIR="controlnet_data"
export JSON_FILE="controlnet_data/data.json"

accelerate launch ./stable_diffusion/train_controlnet_sdxl.py \
 --pretrained_model_name_or_path=$MODEL_DIR \
 --train_data_dir=$TRAIN_DIR \
 --output_dir=$OUTPUT_DIR \
 --json_file=$JSON_FILE \
 --mixed_precision="fp16" \
 --width=1024 --height=1024 \
 --learning_rate=1e-5 \
 --checkpointing_steps=1000 \
 --num_train_epochs=5 \
 --lr_scheduler="constant_with_warmup" \
 --lr_warmup_steps=500 \
 --train_batch_size=1 --dataloader_num_workers=4 \
 --gradient_accumulation_steps=2 \
 --seed=1337 \

IP-adapter 训练

IP-adapter 是一种无需训练的个性化文本到图像生成方法，有多个版本，如 IP-Adapter-Plus 和 IP-Adapter-FaceID。在此，我们重现了 IP-Adapter-Plus 的训练代码，让您可以使用小型数据集对其进行微调。例如，您可以使用动漫数据集对 IP-Adapter-Plus 进行微调，以实现个性化动漫图像生成。

我们的训练代码train_ip_adapter_plus_sdxl.py是从IP-adapter修改而来的。

我们将数据集重写为IPAdapterDataset.py方便了解数据如何加载。
为了更好地理解细粒度的图像信息，我们进行了 IP-Adapter-Plus-SDXL 训练。

从本质上讲，IP-adapter 的训练目标是重建任务，因此数据集的格式与 Lora 微调的格式类似。在为完整的训练图像添加caption后，我们可以执行 sh train_ip_adapter_plus_sdxl.sh 来训练 IP -adapter，详细代码见Github：

export MODEL_NAME="/path/to/your/stable-diffusion-xl-base-1.0"
export PRETRAIN_IP_ADAPTER_PATH="/path/to/your/.../sdxl_models/ip-adapter-plus_sdxl_vit-h.bin"
export IMAGE_ENCODER_PATH="/path/to/your/.../models/image_encoder"
export OUTPUT_DIR="ip-adapter"
export TRAIN_DIR="images"
export JSON_FILE="images/data.json"

accelerate launch ./stable_diffusion/train_ip_adapter_plus_sdxl.py \
  --pretrained_model_name_or_path=$MODEL_NAME \
  --image_encoder_path=$IMAGE_ENCODER_PATH \
  --pretrained_ip_adapter_path=$PRETRAIN_IP_ADAPTER_PATH \
  --data_json_file=$JSON_FILE \
  --data_root_path=$TRAIN_DIR \
  --mixed_precision="fp16" \
  --height=1024 --width=1024\
  --train_batch_size=2 \
  --dataloader_num_workers=4 \
  --learning_rate=1e-05 \
  --weight_decay=0.01 \
  --output_dir=$OUTPUT_DIR \
  --save_steps=10000 \
  --seed=1337 \

AnimateDiff 训练

AnimateDiff 是一种开源的文本到视频（T2V）技术，它通过整合运动模块和从大规模视频数据集中学习可靠的运动先验，扩展了原始的文本到图像模型。在此，我们使用 LoRA 重写了 AnimateDiff 的训练代码。请注意，我们使用最新的Diffusers库在SD1.5模型上重现了训练代码：

我们的训练代码参考了AnimationDiff with train，为了简化训练代码，我们使用了最新的 Diffusers。
我们将数据集重写为AnimateDiffDataset.py方便了解如何加载视频数据及其标签。

请注意，我们使用 lora 来微调预训练的 AnimateDiff，它可以大大减少对 CUDA 内存的需求。如果你想微调 animatediff 模型，可以下载来自Hugging face的视频数据，如 webvid10M。同时，处理后的 data.json 格式如下：

[
    {
        "video": "stock-footage-grilled-chicken-wings.mp4",
        "text": "Grilled chicken wings."
    },
    {
        "video": "stock-footage-waving-australian-flag-on-top-of-a-building.mp4",
        "text": "Waving Australian flag on top of a building."
    }
]

具体来说，我们使用 PEFT使用 Lora 的 animatediff 模型进行微调：

# Load scheduler, tokenizer and models.
noise_scheduler = DDPMScheduler.from_pretrained(args.pretrained_model_name_or_path, subfolder="scheduler")
tokenizer = CLIPTokenizer.from_pretrained(args.pretrained_model_name_or_path, subfolder="tokenizer", revision=args.revision)
text_encoder = CLIPTextModel.from_pretrained(args.pretrained_model_name_or_path, subfolder="text_encoder", revision=args.revision)
vae = AutoencoderKL.from_pretrained(args.pretrained_model_name_or_path, subfolder="vae", revision=args.revision, variant=args.variant)
unet = UNet2DConditionModel.from_pretrained(args.pretrained_model_name_or_path, subfolder="unet", revision=args.revision, variant=args.variant)

# Animatediff: UNet2DConditionModel -> UNetMotionModel
motion_adapter = MotionAdapter.from_pretrained(args.motion_module, torch_dtype=torch.float16)
unet = UNetMotionModel.from_unet2d(unet, motion_adapter)

# freeze parameters of models to save more memory
unet.requires_grad_(False)
vae.requires_grad_(False)
text_encoder.requires_grad_(False)

# use PEFT to load Lora, finetune the parameters of SD model and motion_adapter.
unet_lora_config = LoraConfig(
    r=args.rank,
    lora_alpha=args.rank,
    init_lora_weights="gaussian",
    target_modules=["to_k", "to_q", "to_v", "to_out.0"],
)

# Add adapter and make sure the trainable params are in float32.
unet.add_adapter(unet_lora_config)

准备好完整的训练视频后，我们就可以执行 sh train_animatediff_with_lora.sh 来训练 animatediff 模型，详细代码见Github：

export MODEL_NAME="/path/to/your/Realistic_Vision_V5.1_noVAE"
export MOTION_MODULE="/path/to/your/animatediff-motion-adapter-v1-5-2"
export OUTPUT_DIR="animatediff"
export TRAIN_DIR="webvid"
export JSON_FILE="webvid/data.json"

accelerate launch  ./stable_diffusion/train_animatediff_with_lora.py \
  --pretrained_model_name_or_path=$MODEL_NAME \
  --motion_module=$MOTION_MODULE \
  --train_data_dir=$TRAIN_DIR \
  --output_dir=$OUTPUT_DIR \
  --json_file=$JSON_FILE \
  --resolution=512  \
  --train_batch_size=1 \
  --rank=8 \
  --gradient_accumulation_steps=2 \
  --num_train_epochs=10 \
  --checkpointing_steps=10000 \
  --learning_rate=1e-5 \
  --lr_scheduler="constant_with_warmup" \
  --lr_warmup_steps=500 \
  --mixed_precision="fp16" \
  --seed=1337 \