Deep Learning分布式训练2---Horovod in Docker

mygugu

已于 2022-03-17 17:34:44 修改

阅读量1.8k

点赞数

分类专栏： Docker kubernetes 文章标签： docker 分布式 Horovod

于 2022-03-15 18:32:43 首次发布

本文链接：https://blog.csdn.net/mygugu/article/details/123503334

版权

Docker 同时被 2 个专栏收录

8 篇文章 1 订阅

订阅专栏

kubernetes

6 篇文章 0 订阅

订阅专栏

必备前提：

1.安装docker

2.安装nvidia-docker

官网：Installation Guide — NVIDIA Cloud Native Technologies documentation

ubuntu系统：（按如下步骤一步步执行命令即可）

（1）Setting up NVIDIA Container Toolkit¶

Setup the stable repository and the GPG key:

$ distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
   && curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - \
   && curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list

Note

To get access to experimental features such as CUDA on WSL or the new MIG capability on A100, you may want to add the experimental branch to the repository listing:

$ curl -s -L https://nvidia.github.io/nvidia-container-runtime/experimental/$distribution/nvidia-container-runtime.list | sudo tee /etc/apt/sources.list.d/nvidia-container-runtime.list

Install the nvidia-docker2 package (and dependencies) after updating the package listing:

$ sudo apt-get update

$ sudo apt-get install -y nvidia-docker2

Restart the Docker daemon to complete the installation after setting the default runtime:

$ sudo systemctl restart docker

At this point, a working setup can be tested by running a base CUDA container:

$ sudo docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi

This should result in a console output shown below:

示例一：通过pre-built的container实现：

1. 拉取pre-built docker with horovod镜像
官网：Docker Hub

docker pull horovod/horovod

在单机上运行

使用nvidia-docker运行，根据自己情况替换horovod/horovod:latest, 这是你的镜像名以及版本

（若遇到无法连接外网问题，需要添加代理的环境变量，后面有记录！）

nvidia-docker run -it horovod/horovod:latest

上一条命令执行完成，会发现状态变为下图：

此时已经进入对应容器。可以开始执行程序了。

进入/examples/pytorch目录，执行

horovodrun -np 2 -H localhost:2 python pytorch_mnist.py

-np表示训练进程数 localhost 本地

总的意思：在本地机器的两块GPU上执行分布式训练

注意：如果遇到horovodrun: command not found问题，需要安装(不过前面已经装过了)

（1）To run on CPUs:

pip install horovod

(2) To run on GPU with NCCL:

HOROVOD_GPU_OPERATIONs=NCCL pip install horovod

报错：docker中运行程序提示Failed to download 无法连接外网

错误原因：无法解析域名，在Docker中不能访问外网

在启动docker的时候将代理的环境变量加进去：

docker run --help | grep env

解决办法：

在运行docker时添加环境变量，执行后进入容器，输入env可以查看环境变量，此时之前添加的proxy就已经存在了：

nvidia-docker run -it -e http_proxy="地址" -e https_proxy="地址" -e no_proxy="地址" horovod/horovod:latest
注：这里的地址可以根据自身情况定，前两个可以写当前服务器的IP地址:端口，例如http://127.0.0.1:1234

再次执行

horovodrun -np 2 python pytorch_mnist.py

即可，开始多卡训练，至此单机多卡的示例一就结束啦啦，撒花花❀

在多机多卡上运行

此处内容有点多放在下一篇记录啦啦啦！！！

多机多卡分布式训练

示例二：修改dockerfile自定义环境(待完善)

源代码中提供了dockerfile文件，方便我们使用docker快速配置环境。该容器在/examples目录下包含horovod的示例。

1.Building

首先根据自己需要对dockerfile进行修改，包括cuda,tensorflow,pytorch版本等：

源码dockerfile源码

三个文件夹具体含义见README.md,这里使用horovod,支持cuda的

$ mkdir horovod-docker-gpu
$ wget -O horovod-docker-gpu/Dockerfile https://raw.githubusercontent.com/horovod/horovod/master/Dockerfile.gpu
$ docker build -t horovod:latest horovod-docker-gpu

2.安装NCCL

官方链接：Installation Guide :: NVIDIA Deep Learning NCCL Documentation

In the following commands, please replace <architecture> with your CPU architecture: x86_64, ppc64le, or sbsa, and replace <distro> with the Ubuntu version, for example ubuntu1604, ubuntu1804, or ubuntu2004.

Install the keys.（这里由于内网问题总是出错，所以选择单独下载文件，再sudo apt-key adv 7fa2af80.pub）
(1) When installing using the network repo for Ubuntu 20.04/18.04:
```
sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/<distro>/<architecture>/7fa2af80.pub
```
(2) When installing using the network repo for Ubuntu 16.04:
```
sudo apt-key adv --fetch-keys http://developer.download.nvidia.com/compute/cuda/repos/<distro>/<architecture>/7fa2af80.pub
```

Install the repository.

For the local NCCL repository:
```
sudo dpkg -i nccl-repo-<version>.deb
```

For the network repository:

sudo add-apt-repository "deb https://developer.download.nvidia.com/compute/cuda/repos/<distro>/<architecture>/ /"

Update the APT database:
```
sudo apt update
```
Install the libnccl2 package with APT. Additionally, if you need to compile applications with NCCL, you can install the libnccl-dev package as well:
Note: If you are using the network repository, the following command will upgrade CUDA to the latest version.
```
sudo apt install libnccl2 libnccl-dev
```
If you prefer to keep an older version of CUDA, specify a specific version, for example:
```
sudo apt install libnccl2=2.4.8-1+cuda10.0 libnccl-dev=2.4.8-1+cuda10.0
```

mygugu

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Deep Learning分布式训练2---Horovod in Docker

必备前提：1.安装docker2.安装nvidia-docker官网：Installation Guide — NVIDIA Cloud Native Technologies documentationubuntu系统：（按如下步骤一步步执行命令即可）（1）Setting up NVIDIA Container Toolkit¶Setup thestablerepository and the GPG key:$ distribution=$(. /etc/os-r...
复制链接

扫一扫