目前工作中只使用了单机多卡做微调训练,为了提升训练效率,特实验多机多卡分布式训练。
一、环境准备
本试验使用两台机器(manager,worker),操作系统ubuntu 22.4,每台机器有4个GPU
为了使安装配置统一,使用docker容器,docker 的安装这里不做介绍。
1.网络配置-创建overlay共享网络
初始化集群,在manager机器上运行:
docker swarm init
#输出结果:
Swarm initialized: current node (k4ehuhg4a2umpjoo7yovy1caf) is now a manager.
To add a worker to this swarm, run the following command:
docker swarm join --token SWMTKN-1-686yitjn5p5twd3b3pzezqofd8dlk1wm6juqo3xb5bj4xzztvh-15obj4grc8p8mqul8qvmupkdi 192.168.11.11:2377
To add a manager to this swarm, run 'docker swarm join-token manager' and follow the instructions.
加入集群,在worker机器上运行:
docker swarm join --token SWMTKN-1-686yitjn5p5twd3b3pzezqofd8dlk1wm6juqo3xb5bj4xzztvh-15obj4grc8p8mqul8qvmupkdi 192.168.11.11:2377
在 manager 中创建 overlay 网络,执行命令:
docker network create --driver=overlay --attachable test-net
执行命令docker network ls查看当前网络状态,可以看到最后一行,已经创建好了
NETWORK ID NAME DRIVER SCOPE
ec8c853e521d bridge bridge local
72574615b63f docker_gwbridge bridge local
9fbe2f6c3b22 freeaskinternet_default bridge local
b8273bdcc836 host host local
ii71ul2agult ingress overlay swarm
eadcc6c24a81 none null local
fxnzpd6r1hr0 sharednet overlay swarm
wdoj2fcw29np test-net overlay swarm
2.安装docker-compose
sudo curl -L "https://github.com/docker/compose/releases/download/1.29.2/docker-compose-$(uname -s)-$(uname -m)" -o /usr/bin/docker-compose
sudo chmod +x /usr/bin/docker-compose
docker-compose --version
3.创建工作目录work
mkdir work
cd work
4.在work中创建文件,Dockerfile
#Dockerfile
from nvcr.io/nvidia/cuda:12.2.0-devel-ubuntu22.04
# 更新系统包
RUN apt-get update && apt-get install -y git build-essential zlib1g-dev libncurses5-dev libgdbm-dev libnss3-dev libssl-dev libsqlite3-dev libreadline-dev libffi-dev liblzma-dev libbz2-dev curl wget net-tools iputils-ping pdsh
# 安装Python
WORKDIR /home/user
RUN wget https://www.python.org/ftp/python/3.10.6/Python-3.10.6.tgz && \
tar -zvxf Python-3.10.6.tgz && cd Python-3.10.6 && \
./configure --enable-optimizations &&