我是在 2024 年 3 月底写这篇文章的,距离这篇文章在 Hugging Face 上发表已经一年多了,距离 Julien Simon 发布视频解释如何使用 AWS EC2 spot instances 以不到 1 美元的价格微调 Stable Diffusion 也有几个月了。 首先,这里是你应该参考的官方页面。 不幸的是,你很可能会遇到 "Attempting to unscale FP16 gradients. "错误。 这里有多个用户报告过这个错误。 如果你也遇到了这种情况,下面是解决问题的方法。
下面是开始微调的 accelerate
命令:
export MODEL_NAME="CompVis/stable-diffusion-v1-4"
export DATASET_NAME="lambdalabs/pokemon-blip-captions"
accelerate launch --mixed_precision="fp16" --multi_gpu train_text_to_image.py \
--pretrained_model_name_or_path=$MODEL_NAME \
--dataset_name=$DATASET_NAME \
--use_ema \
--resolution=512 --center_crop --random_flip \
--train_batch_size=1 \
--gradient_accumulation_steps=4 \
--gradient_checkpointing \
--max_train_steps=15000 \
--learning_rate=1e-05 \
--max_grad_norm=1 \
--lr_scheduler="constant" --lr_warmup_steps=0 \
--output_dir="sd-pokemon-model"
Resuming from checkpoint checkpoint-35
01/03/2024 21:55:36 - INFO - accelerate.accelerator - Loading states from sd-model-finetuned-lora/checkpoint-35
Loading unet.
01/03/2024 21:55:36 - INFO - peft.tuners.tuners_utils - Already found apeft_config
attribute in the model. This will lead to having multiple adapters in the model. Make sure to know what you are doing!
01/03/2024 21:55:38 - INFO - accelerate.checkpointing - All model weights loaded successfully
01/03/2024 21:55:38 - INFO - accelerate.checkpointing - All optimizer states loaded successfully
01/03/2024 21:55:38 - INFO - accelerate.checkpointing - All scheduler states loaded successfully
01/03/2024 21:55:38 - INFO - accelerate.checkpointing - All dataloader sampler states loaded successfully
01/03/2024 21:55:38 - INFO - accelerate.checkpointing - GradScaler state loaded successfully
01/03/2024 21:55:38 - INFO - accelerate.checkpointing - All random states loaded successfully
01/03/2024 21:55:38 - INFO - accelerate.accelerator - Loading in 0 custom states
Steps: 70% 35/50 [00:00<?, ?it/s]Traceback (most recent call last):
File “/content/train_text_to_image_lora_sdxl.py”, line 1261, in
main(args)
File “/content/train_text_to_image_lora_sdxl.py”, line 1077, in main
accelerator.clip_grad_norm_(params_to_optimize, args.max_grad_norm)
File “/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py”, line 2040, in clip_grad_norm_
self.unscale_gradients()
File “/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py”, line 2003, in unscale_gradients
self.scaler.unscale_(opt)
File “/usr/local/lib/python3.10/dist-packages/torch/cuda/amp/grad_scaler.py”, line 307, in unscale_
optimizer_state[“found_inf_per_device”] = self.unscale_grads(
File “/usr/local/lib/python3.10/dist-packages/torch/cuda/amp/grad_scaler.py”, line 229, in unscale_grads
raise ValueError(“Attempting to unscale FP16 gradients.”)
ValueError: Attempting to unscale FP16 gradients.
train_text_to_image_lora.py
脚本中缺少参数 --mixed_precision="fp16"
,因此无法正常工作。 下面是正在运行的 accelerate
命令:
export MODEL_NAME="CompVis/stable-diffusion-v1-4"
export DATASET_NAME="lambdalabs/pokemon-blip-captions"
accelerate launch --mixed_precision="fp16" --multi_gpu train_text_to_image.py \
--mixed_precision="fp16" \
--pretrained_model_name_or_path=$MODEL_NAME \
--dataset_name=$DATASET_NAME \
--use_ema \
--resolution=512 --center_crop --random_flip \
--train_batch_size=1 \
--gradient_accumulation_steps=4 \
--gradient_checkpointing \
--max_train_steps=15000 \
--learning_rate=1e-05 \
--max_grad_norm=1 \
--lr_scheduler="constant" --lr_warmup_steps=0 \
--output_dir="sd-pokemon-model"