Deepspeed 结合huggingface Trainer实现多机分布式训练

目前工作中只使用了单机多卡做微调训练,为了提升训练效率,特实验多机多卡分布式训练。

一、环境准备

本试验使用两台机器(manager,worker),操作系统ubuntu 22.4,每台机器有4个GPU

为了使安装配置统一,使用docker容器,docker 的安装这里不做介绍。

1.网络配置-创建overlay共享网络

初始化集群,在manager机器上运行:

docker swarm init

#输出结果:
Swarm initialized: current node (k4ehuhg4a2umpjoo7yovy1caf) is now a manager.

To add a worker to this swarm, run the following command:

    docker swarm join --token SWMTKN-1-686yitjn5p5twd3b3pzezqofd8dlk1wm6juqo3xb5bj4xzztvh-15obj4grc8p8mqul8qvmupkdi 192.168.11.11:2377

To add a manager to this swarm, run 'docker swarm join-token manager' and follow the instructions.

加入集群,在worker机器上运行:

docker swarm join --token SWMTKN-1-686yitjn5p5twd3b3pzezqofd8dlk1wm6juqo3xb5bj4xzztvh-15obj4grc8p8mqul8qvmupkdi 192.168.11.11:2377

在 manager 中创建 overlay 网络,执行命令:

docker network create --driver=overlay --attachable test-net

执行命令docker network ls查看当前网络状态,可以看到最后一行,已经创建好了

NETWORK ID     NAME                      DRIVER    SCOPE
ec8c853e521d   bridge                    bridge    local
72574615b63f   docker_gwbridge           bridge    local
9fbe2f6c3b22   freeaskinternet_default   bridge    local
b8273bdcc836   host                      host      local
ii71ul2agult   ingress                   overlay   swarm
eadcc6c24a81   none                      null      local
fxnzpd6r1hr0   sharednet                 overlay   swarm
wdoj2fcw29np   test-net                  overlay   swarm

2.安装docker-compose

sudo curl -L "https://github.com/docker/compose/releases/download/1.29.2/docker-compose-$(uname -s)-$(uname -m)" -o /usr/bin/docker-compose
sudo chmod +x /usr/bin/docker-compose
docker-compose --version

3.创建工作目录work

mkdir work
cd work

4.在work中创建文件,Dockerfile

#Dockerfile

from nvcr.io/nvidia/cuda:12.2.0-devel-ubuntu22.04

# 更新系统包
RUN apt-get update && apt-get install -y git build-essential zlib1g-dev libncurses5-dev libgdbm-dev libnss3-dev libssl-dev libsqlite3-dev libreadline-dev libffi-dev liblzma-dev libbz2-dev curl wget net-tools iputils-ping pdsh

# 安装Python
WORKDIR /home/user

RUN wget https://www.python.org/ftp/python/3.10.6/Python-3.10.6.tgz && \
  tar -zvxf Python-3.10.6.tgz && cd Python-3.10.6 && \
  ./configure --enable-optimizations && make -j 4 && make install

5.在work中创建文件docker-compose.yml

version: "3"
services:
  llmtrain:
    build:
      context: .
      dockerfile: Dockerfile
    container_name: llmtrain
    tty: true
    restart: always
    ulimits:
      memlock: -1
      stack: 67108864
    shm_size: 40G
    deploy:
      resources:
        reservations:
          devices:
            - capabilities: [gpu]
    volumes:
      - ./code:/home/user/code:cached
    networks:
      - test-net

networks:
  test-net:
    external: true

6.构建docker

sudo docker-compose up -d --build

7.进入容器

sudo docker exec -it <容器ID> /bin/bash

8.查看网络

ifconfig -a


eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1450
        inet 10.0.1.14  netmask 255.255.255.0  broadcast 10.0.1.255
        ether 02:42:0a:00:01:0e  txqueuelen 0  (Ethernet)
        RX packets 2170444797  bytes 11730029590467 (11.7 TB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 1371803017  bytes 11419623920546 (11.4 TB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

eth1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 172.18.0.3  netmask 255.255.0.0  broadcast 172.18.255.255
        ether 02:42:ac:12:00:03  txqueuelen 0  (Ethernet)
        RX packets 74646  bytes 395241942 (395.2 MB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 44728  bytes 3336632 (3.3 MB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        loop  txqueuelen 1000  (Local Loopback)
        RX packets 161709  bytes 15509786 (15.5 MB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 161709  bytes 15509786 (15.5 MB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

9.验证网络是否通

分别进入两个容器中查看ip地址后,互ping 一下对方看网络是否正常。

10.安装工程需要用到的库

pip3 install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple

pip3 install deepspeed

注从2-10步需要在另一台机器worker上也执行一遍

11.配置ssh 免密登录

安装openssh-server服务

首先分别去manager,worker节点的容器中安装openssh-server服务并启动

# 安装ssh服务 
apt-get install openssh-server -y 
# 启动ssh服务 
/etc/init.d/ssh start

配置免密登录

注意:以下操作都是在manager,worker节点容器内部

分别去manager,worker节点的容器中执行 ssh-keygen -t rsa 命令,一直回车即可。

ssh-keygen -t rsa
  • 将manager节点中的~/.ssh/id_rsa.pub的内容复制写入到manager节点和worker节点中的~/.ssh/authorized_keys文件中。
  • 将worker节点中的~/.ssh/id_rsa.pub的内容复制写入到manager节点和worker节点中的~/.ssh/authorized_keys文件中。

在复制文件内容时要注意一下回车换行符。

接着分别去manager,worker节点的/etc/hosts文件中增加映射

10.0.1.14       worker
10.0.1.16       manager

最后测试容器之间是否可以免密登录,如果配置正确,应该不需要输入密码就可以登录。

ssh manager

ssh worker

12.配置NCCL相关环境变量

在~/.bashrc文件中增加以下内容:

#需要注意NCCL的配置,这里需要根据机器的情况指定NCCL的通讯网卡,这里用的是eth0可以通过ifconfig -a命令查看

export NCCL_SOCKET_IFNAME=eth0
export NCCL_DEBUG=INFO

然后不要忘了 source .bashrc 让其生效,worker节点要执行同样的操作。

二、分布式训练

本试验使用的是bloom-7B模型。

由于huggingface trainer对deepspeed的支持非常友好,只需要一个配置参数即可:

1.准备配置文件hostfile,ds_config_1.json

#slots表示对应机器上可供使用的gpu数量
manager slots=4
worker slots=4

deepspeed_config对应文件ds_config_1.json的内容:

{
  "train_batch_size": "auto",
  "train_micro_batch_size_per_gpu": "auto",
  "gradient_accumulation_steps": "auto",
  "gradient_clipping": "auto",
  "zero_allow_untested_optimizer": true,
  
   "bfloat16": {
        "enabled": false
    },
    "fp16": {
        "enabled": "auto",
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },
    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "betas": "auto",
            "eps": "auto",
            "weight_decay": "auto"
        }
    },
    "zero_optimization": {
        "stage": 1,
        "reduce_bucket_size": 5e8
       
    },
    "steps_per_print": 10,
    "wall_clock_breakdown": false,
    "checkpoint": {
        "use_node_local_storage": true
    }

}

本试验zeRO使用了stage 1.在实际使用过程中请结合模型大小以及gpu情况做合适的配置。

具体内容可以参考:Zero Redundancy Optimizer - DeepSpeed

2.启动训练

deepspeed --hostfile hostfile finetune.py --deepspeed ./ds_config_1.json

3.结果

oot@b50557cdc89c:/home/user/code# nvidia-smi
Wed May 29 02:08:43 2024       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.125.06   Driver Version: 525.125.06   CUDA Version: 12.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A800 80G...  Off  | 00000000:34:00.0 Off |                    0 |
| N/A   56C    P0    83W / 300W |  79577MiB / 81920MiB |    100%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A800 80G...  Off  | 00000000:35:00.0 Off |                    0 |
| N/A   54C    P0    78W / 300W |  80555MiB / 81920MiB |    100%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A800 80G...  Off  | 00000000:9D:00.0 Off |                    0 |
| N/A   62C    P0    99W / 300W |  80379MiB / 81920MiB |    100%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A800 80G...  Off  | 00000000:9E:00.0 Off |                    0 |
| N/A   59C    P0    91W / 300W |  80763MiB / 81920MiB |    100%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

三、结论

deepspeed多机分布式训练若NCCL使用socket通信,速度非常慢,还不如单机多卡速度快!建议使用IB通信,但需要有相关硬件支撑。

四、nccl-test

经过nccl-test测试发现多机nccl基于socket通信的速度是单机的十分之一。多机的速度是0.3G/s,单机的是4G/s.

#安装NCCL,NCCL已经支持软件源安装

apt install libnccl2 libnccl-dev

#查看是否安装成功
ldconfig -p | grep libnccl

#安装mpich
apt-get install mpich


#安装nccl-test
#下载:https://github.com/nvidia/nccl-tests或
git clone https://github.com/NVIDIA/nccl-tests.git

cd nccl-tests
make mpi=1

#测试单机
#mpi方式
mpirun -np 4 ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 4
./build/all_reduce_perf -b 8 -e 128M -f 2 -g 4

#mpi测试多机
mpirun -np 8 -hosts manager,worker -map-by slot -env NCCL_DEBUG INFO -env NCCL_SOCKET_IFNAME eth0 ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 4

五、参考资料

docker容器中deepspeed多机多卡集群分布式训练大模型 - 简书

当你拿到一台GPU服务器后,你要做什么 - 知乎

[docker]nvidia的cuda镜像列表-CSDN博客

https://github.com/NVIDIA/nccl/issues/318

DistributedDataParallel on multiple GPU nodes slower than one GPU node - #2 by mrshenli - PyTorch Forums

nccl-test 使用指引-腾讯云开发者社区-腾讯云

  • 15
    点赞
  • 12
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值