深度学习环境配置及复现代码中的相关bug记录

钓猫的小笨鱼

已于 2024-09-17 14:18:18 修改

阅读量1.4k

点赞数 13

文章标签： python 深度学习

于 2024-09-04 11:50:21 首次发布

本文链接：https://blog.csdn.net/m0_62185213/article/details/141888112

版权

目录

前言

一、虚拟环境配置基操

二、如何在服务器上运行stable-diffusion-3-medium

1.创建虚拟环境

2.安装 Diffuers 和 Transformers

3.注册 HuggingFace 账号并登录

4.运行测试代码

5.运行测试代码过程中的bug

5.1

ValueError: Cannot instantiate this tokenizer from a slow version. If it's based on sentencepiece, make sure you have sentencepiece installed.

5.2

ImportError: T5Converter requires the protobuf library but it was not found in your environment. Checkout the instructions on the installation page of its repo: https://github.com/protocolbuffers/protobuf/tree/master/python#installation and follow the ones that match your environment. Please note that you may need to restart your runtime after installation.

5.3

TypeError: expected np.ndarray (got numpy.ndarray)

5.4

第一张卡爆内存了，想换另一张，

6.测试结果

三、在火影忍者数据集上微调一个火影风格的文生图模型（非Lora方式）

1.下载数据集

2.准备代码

3.开始训练

4.训练时的bug

4.1 NotImplementedError

5.模型推理

6.推理结果

总结

前言

记录深度学习环境配置及复现代码中的相关bug

一、虚拟环境配置基操

删除虚拟环境 conda remove -n mmdet --all
退出虚拟环境 conda deactivate
创建虚拟环境 conda create -n mmdet python=3.9
激活虚拟环境 conda activate mmdet
下载pytorch：官网PyTorch
nvidia-smi查询服务器可支持cuda版本
选择符合要求的pytorch下载
# CUDA 11.8

conda install pytorch==2.3.1 torchvision==0.18.1 torchaudio==2.3.1 pytorch-cuda=11.8 -c pytorch -c nvidia

二、如何在服务器上运行stable-diffusion-3-medium

1.创建虚拟环境

使用的是python=3.9

conda install pytorch==2.3.1 torchvision==0.18.1 torchaudio==2.3.1 pytorch-cuda=11.8 -c pytorch -c nvidia

2.安装 Diffuers 和 Transformers

确保 Diffuers 和 Transformers 都用的是最新版本。

pip install --upgrade diffusers transformers

3.注册 HuggingFace 账号并登录

注册 HuggingFace 账号，再在 SD3 的模型网站 https://huggingface.co/stabilityai/stable-diffusion-3-medium 里确认同意某些使用协议。之后，我们要设置 Access Token。具体操作如下所示，先点右上角的 "settings"，再点左边的 "Access Tokens"，创建一个新 token。将这个 token 复制保存在本地后，点击 token 右上角选项里的 "Edit Permission"，在权限里开启 "... public gated repos ..."。

参考：https://blog.csdn.net/sinat_37574187/article/details/140415406

在服务器端安装 HuggingFace 命令行版

pip install -U "huggingface_hub[cli]"

再输入 huggingface-cli login，命令行会提示输入 token 信息。把刚刚保存好的 token 粘贴进去，即可完成登录。

huggingface-cli login
 
Enter your token (input will not be visible): 在这里粘贴 token

在这里，由于服务器不能连接外网，登录不上去

可以

pip install -U huggingface_hub  #不确定是否需要下载

pip install -U hf-transfer 

export HF_ENDPOINT=https://hf-mirror.com  # 镜像网站

export HF_HUB_ENABLE_HF_TRANSFER=1  # 开启加速

4.运行测试代码

import torch
from diffusers import StableDiffusion3Pipeline
 
pipe = StableDiffusion3Pipeline.from_pretrained(
    "stabilityai/stable-diffusion-3-medium-diffusers", torch_dtype=torch.float16)
pipe = pipe.to("cuda")
 
image = pipe(
    "A cat holding a sign that says hello world",
    negative_prompt="",
    num_inference_steps=28,
    guidance_scale=7.0,
).images[0]
 
image.save('tmp.png')

其中pipe = StableDiffusion3Pipeline.from_pretrained( "stabilityai/stable-diffusion-3-medium-diffusers", torch_dtype=torch.float16)是从hugface上下载模型及权重等（约20G）

可以设置其下载路径

pipe = StableDiffusion3Pipeline.from_pretrained( "stabilityai/stable-diffusion-3-medium-diffusers", torch_dtype=torch.float16,cache_dir="./models/")  #即下载到./models处

下载过程经常会因为网络问题断掉，最后看了下载结果后，发现完全可以在电脑上“科学上网”下载后再迁移至服务器上。

5.运行测试代码过程中的bug

5.1 ValueError: Cannot instantiate this tokenizer from a slow version. If it's based on sentencepiece, make sure you have sentencepiece installed.

解决方法：pip install sentencepiece

5.2 ImportError: T5Converter requires the protobuf library but it was not found in your environment. Checkout the instructions on the installation page of its repo: https://github.com/protocolbuffers/protobuf/tree/master/python#installation and follow the ones that match your environment. Please note that you may need to restart your runtime after installation.

解决方法：pip install protobuf

5.3 TypeError: expected np.ndarray (got numpy.ndarray)

解决方法：

是因为numpy版本的问题

应该安装1.26.0版本的numpy

我的numpy版本是1.26.4，先卸载这个numpy,再安装1.26.0版本的numpy

pip uninstall numpy

pip install numpy==1.26.0

5.4 第一张卡爆内存了，想换另一张，

import os

os.environ['CUDA_VISIBLE_DEVICES'] = '3'

但是报错torch.cuda.DeferredCudaCallError: CUDA call failed lazily at initialization with error: device >= 0 && device < num_gpus INTERNAL ASSERT FAILED at "/opt/conda/conda-bld/pytorch_1716905971093/work/aten/src/ATen/cuda/CUDAContext.cpp":50, please report a bug to PyTorch. device=, num_gpus=

参考：os.environ[‘CUDA_VISIBLE_DEVICES‘] 无法生效原因 - 知乎 (zhihu.com)

解决方法：import os os.environ['CUDA_VISIBLE_DEVICES'] = '3'要放在import torch的前面

6.测试结果

好吧，和跑代码的我一样不开心（hello个屁world）

三、在火影忍者数据集上微调一个火影风格的文生图模型（非Lora方式）

本节记录是否可以用自己的数据集微调Stable Diffusion模型，以火影忍者数据集为例

1.下载数据集

数据集的大小大约700MB左右；数据集的下载方式有两种：

如果你的网络与HuggingFace连接是通畅的，那么直接运行我下面提供的代码即可，它会直接通过HF的datasets库进行下载。（使用上文的镜像网站，可以直接下载）
如果网络存在问题，我也把它放到百度网盘（提取码: gtk8），下载naruto-blip-captions.zip到本地解压后，运行到与训练脚本同一目录下。
摘录自Stable Diffusion文生图模型训练入门实战（完整代码）_业界新闻_筋斗云 (jindouyun.cn)

from datasets import load_dataset

# 直接指定缓存路径
ds = load_dataset("lambdalabs/naruto-blip-captions", cache_dir="/data/zq/diffusers/data")

2.准备代码

diffusers代码配置可直接参考官方网站diffusers/examples/text_to_image at main · huggingface/diffusers · GitHub

首先克隆代码到本地

git clone https://github.com/huggingface/diffusers
cd diffusers
pip install .

接着进入/examples/text_to_imag文件夹

cd examples
cd text_to_imag
pip install -r requirements.txt

初始化 🤗Accelerate 环境 (我没有选择初始化，之后报错了，但也可以解决，详见后文报错栏）

accelerate config

安装PEFT 库

pip install peft=0.6.0

3.开始训练

训练时我选用的是stabilityai/stable-diffusion-2模型，终端输入

export MODEL_NAME="stabilityai/stable-diffusion-2" #在训练命令中如果直接写路径，可以忽略这一条

accelerate launch --mixed_precision="fp16" train_text_to_image.py --pretrained_model_name_or_path="stabilityai/stable-diffusion-2" --dataset_name="/data/zq/diffusers/data" --use_ema --resolution=768 --center_crop --random_flip --train_batch_size=1 --gradient_accumulation_steps=4 --gradient_checkpointing --max_train_steps=15000 --learning_rate=1e-05 --max_grad_norm=1 --lr_scheduler="constant" --lr_warmup_steps=0 --output_dir="sd-naruto-model" #注意resolution，v1用512，v2用768

开始训练，如下图

训练过程中产生的文件（log，权重文件等会保留在sd-naruto-model文件夹中）

4.训练时的bug

数据集下载好了后基本上很顺利的

4.1 NotImplementedError

raise NotImplementedError(

NotImplementedError: Using RTX 4000 series doesn't support faster communication broadband via P2P or IB. Please set `NCCL_P2P_DISABLE="1"` and `NCCL_IB_DISABLE="1" or use `accelerate launch` which will do this automatically.

解决方法：应该是由于我没有初始化 🤗Accelerate 环境，按照它报错的在终端输入命令行

NCCL_P2P_DISABLE="1"，NCCL_IB_DISABLE="1"即可

5.模型推理

import torch
from diffusers import StableDiffusionPipeline

model_path = "/data/zq/diffusers/examples/text_to_image/sd-naruto-model"
pipe = StableDiffusionPipeline.from_pretrained(model_path, torch_dtype=torch.float16)
pipe.to("cuda")

image = pipe(prompt="Bill Gates with a hoodie").images[0]
image.save("yoda-naruto.png")

注意model_path不是权重文件的路径，而是 sd-naruto-model文件夹的路径（不懂原理的小白觉得奇奇怪怪）

推理如下图