多机多卡分布式训练

1. 环境搭建

  • 分布式训练框架:accelerate+deepspeed+pdsh(可有可无)
  • 基础环境:cuda、显卡驱动、pytorch

1.1 安装相关包

wget https://developer.download.nvidia.com/compute/cuda/11.8.0/local_installers/cuda-repo-rhel7-11-8-local-11.8.0_520.61.05-1.x86_64.rpm
sudo rpm -i cuda-repo-rhel7-11-8-local-11.8.0_520.61.05-1.x86_64.rpm
sudo yum clean all
sudo yum -y install nvidia-driver-latest-dkms
sudo yum -y install cuda
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
pip install accelerate
pip install deepspeed

可以参考教程:并行分布式运维工具pdsh-阿里云开发者社区

tar jxvf pdsh-2.29.tar.bz2
cd pdsh-2.29
./configure --with-ssh --with-rsh --with-mrsh --with-dshgroups --with-machines=/etc/pdsh/machines
make
make install
pdsh -V

注意:所有机器均需要安装一模一样的环境:版本需要一致;conda安装路径一致;同时cuda和pytorch版本相对应,如下图所示。

2. 启动分布式训练

2.1 accelerate方式启动

2.1.1 使用accelerate

# 1、生成accelerate配置文件,使用命令行生成
accelerate config

2.1.2 启动分布式训练脚本

方式一:使用pdsh,仅需要在主节点启动

# accelerate语法
accelerate launch --config_file <accelerate配置文件> python_script.py <python脚本环境变量参数>

# 示例如下
config_path=/data0/sdmt/mxm/kohya_ss/my_util/config/deepspeed_pdsh_config.yaml

accelerate launch --config_file $config_path \
train_text_to_image_sdxl.py --mixed_precision fp16 --enable_xformers_memory_efficient_attention --gradient_checkpointing  --noise_offset 0.05  --cache_dir "/data0/sdmt/mxm/datasets/" --num_train_epochs 20 --resolution 1024 --proportion_empty_prompts 0.2 --learning_rate 1e-06 --lr_scheduler "constant" --lr_warmup_steps 0 --validation_prompt "a pair of casual leather shoes" --validation_epochs 5  --pretrained_model_name_or_path "stabilityai/stable-diffusion-xl-base-1.0" --pretrained_vae_model_name_or_path "madebyollin/sdxl-vae-fp16-fix"  --train_data_dir "/data0/sdmt/train_img/000/10_train_1024_hug"

如下所示,启动2台服务器,服务器每台一张显卡。

2.1.3 accelerate 和 deepspeed环境配置文件

  • accelerate 

使用accelerate config生成的配置文件default_config.yaml,默认在如下目录下:

/root/.cache/huggingface/accelerate/default_config.yaml
  • deepspeed 命令运行,默认在当前目录下如下文件,环境变量是追加形式:
.deepspeed_env.yaml

注意:nccl问题,可以查看.deepspeed_env.yaml配置问题。可以直接删除

2.2 deepspeed启动方式

  • 指定显卡:配置include参数,示例如下所示
    deepspeed --include localhost:4,6,7 train_bash.py \
        --deepspeed ./deepspeed/ds_z3_config.json \
        ...

3. 问题记录

  • 问题:显卡驱动已经安装,但torch.cuda不可用(注:安装的torch为gpu版本)UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 802: system not yet initialized (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.) return torch._C._cuda_getDeviceCount() > 0

解决方案:安装nvidia-fabricmanager,并启动管理,如下所示

# xxx代表显卡驱动大的版本号如550,
sudo apt-get install nvidia-fabricmanager-xxx  
# 启动
systemctl start nvidia-fabricmanager
# 开机启动
systemctl enable nvidia-fabricmanager

  • 问题1:RuntimeError:
    127.0.0.1: Some tensors share memory, this will lead to duplicate memory on disk and potential differences when loading them again:
127.0.0.1: Configuration saved in sdxl-model-finetuned/vae/config.jsonit/s]
127.0.0.1: Model weights saved in sdxl-model-finetuned/vae/diffusion_pytorch_model.safetensors
127.0.0.1: [2023-11-02 11:16:49,463] [INFO] [launch.py:347:main] Process 43174 exits successfully.
127.0.0.1: Configuration saved in sdxl-model-finetuned/unet/config.json
127.0.0.1: Traceback (most recent call last):
127.0.0.1:   File "/data0/sdmt/mxm/kohya_ss/my_util/train_text_to_image_sdxl.py", line 1206, in <module>
127.0.0.1:     main(args)
127.0.0.1:   File "/data0/sdmt/mxm/kohya_ss/my_util/train_text_to_image_sdxl.py", line 1157, in main
127.0.0.1:     pipeline.save_pretrained(args.output_dir)
127.0.0.1:   File "/root/.conda/envs/sdxl/lib/python3.10/site-packages/diffusers/pipelines/pipeline_utils.py", line 661, in save_pretrained
127.0.0.1:     save_method(os.path.join(save_directory, pipeline_component_name), **save_kwargs)
127.0.0.1:   File "/root/.conda/envs/sdxl/lib/python3.10/site-packages/diffusers/models/modeling_utils.py", line 361, in save_pretrained
127.0.0.1:     safetensors.torch.save_file(
127.0.0.1:   File "/root/.conda/envs/sdxl/lib/python3.10/site-packages/safetensors/torch.py", line 232, in save_file
127.0.0.1:     serialize_file(_flatten(tensors), filename, metadata=metadata)
127.0.0.1:   File "/root/.conda/envs/sdxl/lib/python3.10/site-packages/safetensors/torch.py", line 394, in _flatten
127.0.0.1:     raise RuntimeError(
127.0.0.1: RuntimeError:
127.0.0.1:             Some tensors share memory, this will lead to duplicate memory on disk and potential differences when loading them again: [{'down_blocks.2.attentions.0.transformer_blocks.4.attn2.to_v.weight', 'up_blocks.0.attentions.2.transformer_blocks.

解决方法:报错信息里,提到save_file() --serialize_file()。仅需要更改代码,添加设置参数 safe_serialization=False,保存为pickle格式。

pipeline.save_pretrained(args.output_dir, safe_serialization=False)

  • 问题2:RuntimeError:

Timed out initializing process group in store based barrier on
10.252.31.4: rank: 0, for key: store_based_barrier_key:2 (world_size=2, worker_count=1,
10.252.31.4: timeout=0:30:00)

raise RuntimeError(
127.0.0.1: RuntimeError: Timed out initializing process group in store based barrier on rank: 0, for key: store_based_barrier_key:2 (world_size=2, worker_count=1, timeout=0:30:00)
.....................
127.0.0.1: │ /root/.conda/envs/sdxl/lib/python3.10/site-packages/torch/distributed/distri │
127.0.0.1: │ buted_c10d.py:469 in _store_based_barrier                                    │
127.0.0.1: │                                                                              │
127.0.0.1: │    466 │   │   │   log_time = time.time()                                    │
127.0.0.1: │    467 │   │                                                                 │
127.0.0.1: │    468 │   │   if timedelta(seconds=(time.time() - start)) > timeout:        │
127.0.0.1: │ ❱  469 │   │   │   raise RuntimeError(                                       │
127.0.0.1: │    470 │   │   │   │   "Timed out initializing process group in store based  │
127.0.0.1: │    471 │   │   │   │   "rank: {}, for key: {} (world_size={}, worker_count={ │
127.0.0.1: │    472 │   │   │   │   │   rank, store_key, world_size, worker_count, timeou │
127.0.0.1: ╰──────────────────────────────────────────────────────────────────────────────╯
127.0.0.1: RuntimeError: Timed out initializing process group in store based barrier on
127.0.0.1: rank: 0, for key: store_based_barrier_key:2 (world_size=2, worker_count=1,
127.0.0.1: timeout=0:30:00)

解决办法:原因:由于数据量大,导致数据预处理时间长,超出默认时间30分通信连接。

修改超时间,如下如所示,超时修改为120分钟,修改源代码constans.py文件在客路径下/root/.conda/envs/sdxl/lib/python3.10/site-packages/torch/distributed/constants.py,注意修改/root/.conda/envs/sdxl/lib/python3.10/site-packages/deepspeed/constants.py可能会未生效。

from torch._C._distributed_c10d import _DEFAULT_PG_TIMEOUT
# Default process group wide timeout, if applicable.
# This only applies to the gloo and nccl backends
# (only if NCCL_BLOCKING_WAIT or NCCL_ASYNC_ERROR_HANDLING is set to 1).
# To make an attempt at backwards compatibility with THD, we use an
# extraordinarily high default timeout, given that THD did not have timeouts.
default_pg_timeout = _DEFAULT_PG_TIMEOUT

from datetime import timedelta
default_pg_timeout = timedelta(minutes=120)
  • 训练速度影响因素:网络带宽(千兆/万兆)区别。如下所示训练时长差异,两台A800机器:

  • 4
    点赞
  • 12
    收藏
    觉得还不错? 一键收藏
  • 2
    评论
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值