必备前提:
1.安装docker
2.安装nvidia-docker
官网:Installation Guide — NVIDIA Cloud Native Technologies documentation
ubuntu系统:(按如下步骤一步步执行命令即可)
(1)Setting up NVIDIA Container Toolkit¶
Setup the stable
repository and the GPG key:
$ distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
&& curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - \
&& curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
Note
To get access to experimental
features such as CUDA on WSL or the new MIG capability on A100, you may want to add the experimental
branch to the repository listing:
$ curl -s -L https://nvidia.github.io/nvidia-container-runtime/experimental/$distribution/nvidia-container-runtime.list | sudo tee /etc/apt/sources.list.d/nvidia-container-runtime.list
Install the nvidia-docker2
package (and dependencies) after updating the package listing:
$ sudo apt-get update
$ sudo apt-get install -y nvidia-docker2
Restart the Docker daemon to complete the installation after setting the default runtime:
$ sudo systemctl restart docker
At this point, a working setup can be tested by running a base CUDA container:
$ sudo docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi
This should result in a console output shown below:
示例一:通过pre-built的container实现:
1. 拉取pre-built docker with horovod镜像
官网:Docker Hub
docker pull horovod/horovod
在单机上运行
使用nvidia-docker运行,根据自己情况替换horovod/horovod:latest, 这是你的镜像名以及版本
(若遇到无法连接外网问题,需要添加代理的环境变量,后面有记录!)
nvidia-docker run -it horovod/horovod:latest
上一条命令执行完成,会发现状态变为下图:
此时已经进入对应容器。可以开始执行程序了。
进入/examples/pytorch目录,执行
horovodrun -np 2 -H localhost:2 python pytorch_mnist.py
-np表示训练进程数 localhost 本地
总的意思:在本地机器的两块GPU上执行分布式训练
注意:如果遇到horovodrun: command not found问题,需要安装(不过前面已经装过了)
(1)To run on CPUs:
pip install horovod
(2) To run on GPU with NCCL:
HOROVOD_GPU_OPERATIONs=NCCL pip install horovod
报错:docker中运行程序提示Failed to download 无法连接外网
错误原因:无法解析域名,在Docker中不能访问外网
在启动docker的时候将代理的环境变量加进去:
docker run --help | grep env
解决办法:
在运行docker时添加环境变量,执行后进入容器,输入env可以查看环境变量,此时之前添加的proxy就已经存在了:
nvidia-docker run -it -e http_proxy="地址" -e https_proxy="地址" -e no_proxy="地址" horovod/horovod:latest
注:这里的地址可以根据自身情况定,前两个可以写当前服务器的IP地址:端口,例如http://127.0.0.1:1234
再次执行
horovodrun -np 2 python pytorch_mnist.py
即可,开始多卡训练,至此单机多卡的示例一就结束啦啦,撒花花❀
在多机多卡上运行
此处内容有点多放在下一篇记录啦啦啦!!!
示例二:修改dockerfile自定义环境(待完善)
源代码中提供了dockerfile文件,方便我们使用docker快速配置环境。该容器在/examples目录下包含horovod的示例。
1.Building
首先根据自己需要对dockerfile进行修改,包括cuda,tensorflow,pytorch版本等:
三个文件夹具体含义见README.md,这里使用horovod,支持cuda的
$ mkdir horovod-docker-gpu
$ wget -O horovod-docker-gpu/Dockerfile https://raw.githubusercontent.com/horovod/horovod/master/Dockerfile.gpu
$ docker build -t horovod:latest horovod-docker-gpu
2.安装NCCL
官方链接:Installation Guide :: NVIDIA Deep Learning NCCL Documentation
In the following commands, please replace <architecture> with your CPU architecture: x86_64, ppc64le, or sbsa, and replace <distro> with the Ubuntu version, for example ubuntu1604, ubuntu1804, or ubuntu2004.
- Install the keys.(这里由于内网问题总是出错,所以选择单独下载文件,再sudo apt-key adv 7fa2af80.pub)
(1) When installing using the network repo for Ubuntu 20.04/18.04:
sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/<distro>/<architecture>/7fa2af80.pub
(2) When installing using the network repo for Ubuntu 16.04:
sudo apt-key adv --fetch-keys http://developer.download.nvidia.com/compute/cuda/repos/<distro>/<architecture>/7fa2af80.pub
- Install the repository.
- For the local NCCL repository:
sudo dpkg -i nccl-repo-<version>.deb
- For the network repository:
sudo add-apt-repository "deb https://developer.download.nvidia.com/compute/cuda/repos/<distro>/<architecture>/ /"
- For the local NCCL repository:
- Update the APT database:
sudo apt update
- Install the libnccl2 package with APT. Additionally, if you need to compile applications with NCCL, you can install the libnccl-dev package as well:
Note: If you are using the network repository, the following command will upgrade CUDA to the latest version.
sudo apt install libnccl2 libnccl-dev
If you prefer to keep an older version of CUDA, specify a specific version, for example:sudo apt install libnccl2=2.4.8-1+cuda10.0 libnccl-dev=2.4.8-1+cuda10.0