Fine-Tuning Stable Diffusion with LoRA (Workaround for ‘Unscale FP16 Gradients’ Error)

最新推荐文章于 2025-05-23 09:38:39 发布

吴脑的键客

最新推荐文章于 2025-05-23 09:38:39 发布

阅读量747

点赞数 25

分类专栏： AI炼丹文章标签： stable diffusion 人工智能机器学习

本文链接：https://blog.csdn.net/weixin_41446370/article/details/144578455

版权

AI炼丹专栏收录该内容

10 篇文章

订阅专栏

我是在 2024 年 3 月底写这篇文章的，距离这篇文章在 Hugging Face 上发表已经一年多了，距离 Julien Simon 发布视频解释如何使用 AWS EC2 spot instances 以不到 1 美元的价格微调 Stable Diffusion 也有几个月了。首先，这里是你应该参考的官方页面。不幸的是，你很可能会遇到 "Attempting to unscale FP16 gradients. "错误。这里有多个用户报告过这个错误。如果你也遇到了这种情况，下面是解决问题的方法。

下面是开始微调的 accelerate 命令：

export MODEL_NAME="CompVis/stable-diffusion-v1-4"
export DATASET_NAME="lambdalabs/pokemon-blip-captions"

accelerate launch --mixed_precision="fp16" --multi_gpu  train_text_to_image.py \
  --pretrained_model_name_or_path=$MODEL_NAME \
  --dataset_name=$DATASET_NAME \
  --use_ema \
  --resolution=512 --center_crop --random_flip \
  --train_batch_size=1 \
  --gradient_accumulation_steps=4 \
  --gradient_checkpointing \
  --max_train_steps=15000 \
  --learning_rate=1e-05 \
  --max_grad_norm=1 \
  --lr_scheduler="constant" --lr_warmup_steps=0 \
  --output_dir="sd-pokemon-model"

Resuming from checkpoint checkpoint-35
01/03/2024 21:55:36 - INFO - accelerate.accelerator - Loading states from sd-model-finetuned-lora/checkpoint-35
Loading unet.
01/03/2024 21:55:36 - INFO - peft.tuners.tuners_utils - Already found a peft_config attribute in the model. This will lead to having multiple adapters in the model. Make sure to know what you are doing!
01/03/2024 21:55:38 - INFO - accelerate.checkpointing - All model weights loaded successfully
01/03/2024 21:55:38 - INFO - accelerate.checkpointing - All optimizer states loaded successfully
01/03/2024 21:55:38 - INFO - accelerate.checkpointing - All scheduler states loaded successfully
01/03/2024 21:55:38 - INFO - accelerate.checkpointing - All dataloader sampler states loaded successfully
01/03/2024 21:55:38 - INFO - accelerate.checkpointing - GradScaler state loaded successfully
01/03/2024 21:55:38 - INFO - accelerate.checkpointing - All random states loaded successfully
01/03/2024 21:55:38 - INFO - accelerate.accelerator - Loading in 0 custom states
Steps: 70% 35/50 [00:00<?, ?it/s]Traceback (most recent call last):
File “/content/train_text_to_image_lora_sdxl.py”, line 1261, in
main(args)
File “/content/train_text_to_image_lora_sdxl.py”, line 1077, in main
accelerator.clip_grad_norm_(params_to_optimize, args.max_grad_norm)
File “/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py”, line 2040, in clip_grad_norm_
self.unscale_gradients()
File “/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py”, line 2003, in unscale_gradients
self.scaler.unscale_(opt)
File “/usr/local/lib/python3.10/dist-packages/torch/cuda/amp/grad_scaler.py”, line 307, in unscale_
optimizer_state[“found_inf_per_device”] = self.unscale_grads(
File “/usr/local/lib/python3.10/dist-packages/torch/cuda/amp/grad_scaler.py”, line 229, in unscale_grads
raise ValueError(“Attempting to unscale FP16 gradients.”)
ValueError: Attempting to unscale FP16 gradients.

train_text_to_image_lora.py 脚本中缺少参数 --mixed_precision="fp16"，因此无法正常工作。下面是正在运行的 accelerate 命令：

export MODEL_NAME="CompVis/stable-diffusion-v1-4"
export DATASET_NAME="lambdalabs/pokemon-blip-captions"

accelerate launch --mixed_precision="fp16" --multi_gpu  train_text_to_image.py \
  --mixed_precision="fp16" \
  --pretrained_model_name_or_path=$MODEL_NAME \
  --dataset_name=$DATASET_NAME \
  --use_ema \
  --resolution=512 --center_crop --random_flip \
  --train_batch_size=1 \
  --gradient_accumulation_steps=4 \
  --gradient_checkpointing \
  --max_train_steps=15000 \
  --learning_rate=1e-05 \
  --max_grad_norm=1 \
  --lr_scheduler="constant" --lr_warmup_steps=0 \
  --output_dir="sd-pokemon-model"