如何在单机上搭建分布式深度学习环境？以Pytorch+cuda+mpi on LXC为例

最新推荐文章于 2024-08-09 10:34:54 发布

处女座程序员的朋友

最新推荐文章于 2024-08-09 10:34:54 发布

阅读量1.1k

点赞数

分类专栏：转载请指明出处http://www.cnblogs.com/ 文章标签：深度学习分布式 pytorch

原文链接：https://www.jianshu.com/p/a41ccde46087

版权

转载请指明出处http://www.cnblogs.com/ 专栏收录该内容

17 篇文章 1 订阅

订阅专栏

环境：Manjaro - arch linux
框架：pytorch 1.0-cuda-10.0
Nividia driver：410
cuda：10.0

前言

为了在单机上跑起分布式实验的环境，我选择了lxc作为容器（可以类似的看作为docker，docker就是基于LXC的），来在一台PC上跑起多个实验环境。使用lxc，可以方便的进行容器的管理，也能方便的克隆镜像。
同时，LXC支持，多个虚拟化的环境中共享使用一个GPU，只需要将显卡设备对应的文件挂载到 LXC 容器中就能解决这个问题。
在安装前，请保证你的主机已经安装好了Nvidia驱动，并在LXC中使用相同的驱动版本！

安装起来主要有3个难点：

如果直接安装驱动，会发现提示无法卸载内核模块，导致显卡驱动工作异常。其实这是正常的，因为容器和宿主机是共享内核的。实际上，我们在容器内也不需要安装内核模块，只是需要那些库罢了，所以在安装显卡驱动的时候带上 --no-kernel-module 就可以解决这个问题。
直接安装的Open-MPI是不支持CUDA的，需要重新编译。
直接安装Pytorch是不支持MPI的，需要重新编译。

备注：我现在只是记录了配置过程中关键的部分，安装时请阅读我引用的网页or文档

1. 配置LXC

1.1 安装LXC

#安装LXC
使用系统源安装即可
#start service
sudo systemctl restart lxc.service
sudo systemctl restart lxc-net.service

#generete nvidia-uvm device
#这个设备文件不会自动生成，需要生成一下
#sudo .sh ~/uvm
/sbin/modprobe nvidia-uvm
D=`grep nvidia-uvm /proc/devices | awk '{print $1}'`
mknod -m 666 /dev/nvidia-uvm c $D 0

1.2 host上的网络配置。

a. 修改/etc/default/lxc-net文件，使用以下配置

# Leave USE_LXC_BRIDGE as "true" if you want to use lxcbr0 for your
# containers.  Set to "false" if you'll use virbr0 or another existing
# bridge, or mavlan to your host's NIC.
USE_LXC_BRIDGE="true"

# If you change the LXC_BRIDGE to something other than lxcbr0, then
# you will also need to update your /etc/lxc/default.conf as well as the
# configuration (/var/lib/lxc/<container>/config) for any containers
# already created using the default config to reflect the new bridge
# name.
# If you have the dnsmasq daemon installed, you'll also have to update
# /etc/dnsmasq.d/lxc and restart the system wide dnsmasq daemon.
LXC_BRIDGE="lxcbr0"
LXC_ADDR="10.0.3.1"
LXC_NETMASK="255.255.255.0"
LXC_NETWORK="10.0.3.0/24"
LXC_DHCP_RANGE="10.0.3.2,10.0.3.254"
LXC_DHCP_MAX="253"
# Uncomment the next line if you'd like to use a conf-file for the lxcbr0
# dnsmasq.  For instance, you can use 'dhcp-host=mail1,10.0.3.100' to have
# container 'mail1' always get ip address 10.0.3.100.
#LXC_DHCP_CONFILE=/etc/lxc/dnsmasq.conf

# Uncomment the next line if you want lxcbr0's dnsmasq to resolve the .lxc
# domain.  You can then add "server=/lxc/10.0.3.1' (or your actual $LXC_ADDR)
# to your system dnsmasq configuration file (normally /etc/dnsmasq.conf,
# or /etc/NetworkManager/dnsmasq.d/lxc.conf on systems that use NetworkManager).
# Once these changes are made, restart the lxc-net and network-manager services.
# 'container1.lxc' will then resolve on your host.
#LXC_DOMAIN="lxc"

b. 启用Dnsmasq （把host当DNS服务器使）

1.安装Dnsmasq
2.add "server=/lxc/10.0.3.1' (or your actual $LXC_ADDR) to your system dnsmasq configuration file (normally /etc/dnsmasq.conf, or /etc/NetworkManager/dnsmasq.d/lxc.conf on systems that use NetworkManager)

c. 配置LXC

修改/etc/lxc/default.conf
将原先的网络配置替换为：

lxc.net.0.type = veth
lxc.net.0.link = lxcbr0
lxc.net.0.flags = up
lxc.net.0.hwaddr = 00:16:3e:17:18:19

d.安装lxc template

选择需要安装的系统版本

sudo lxc-create -n Pytorch_1 -t download -- --server=mirrors.tuna.tsinghua.edu.cn/lxc-images

f.挂载显卡

sudo vim /var/lib/lxc/Pytorch_1/config

添加:

# Nvidia
lxc.mount.entry = /dev/nvidia0 dev/nvidia0 none bind,optional,create=file
lxc.mount.entry = /dev/nvidiactl dev/nvidiactl none bind,optional,create=file
lxc.mount.entry = /dev/nvidia-modeset dev/nvidia-modeset none bind,optional,crea
te=file
lxc.mount.entry = /dev/nvidia-uvm dev/nvidia-uvm none bind,optional,create=file
lxc.mount.entry = /share share none bind,create=dir

网络配置结束，这样LXC里就联网了

2. 配置LXC容器里的环境

sudo lxc-start -n Pytorch_3
sudo lxc-attach -n Pytorch_3

#接下来就是在lxc中了

# add proxy, which is host's ss  
export ALL_PROXY=socks5://10.0.3.1:1081

# change mirrorlist
# add in top： /etc/pacman.d/mirrorlist
Server = https://mirrors.tuna.tsinghua.edu.cn/archlinux/$repo/os/$arch
Server = http://mirrors.163.com/archlinux/$repo/os/$arch
Server = http://mirrors.ustc.edu.cn/archlinux/$repo/os/$arch
Server = http://mirrors.cqu.edu.cn/archlinux/$repo/os/$arch
Server = http://mirror.lzu.edu.cn/archlinux/$repo/os/$arch
Server = http://mirrors.neusoft.edu.cn/archlinux/$repo/os/$arch

# install base tool
# 如果是ubuntu系统，则使用apt-get，详细请google
pacman -S base-devel

# add user
useradd -m -Gwheel newuser
passwd newuser
passwd root

#进入 visudo设置用户为无密码
visudo
# 在 root ALL=(ALL) ALL这一行后面加上：
newuser ALL=(ALL) NOPASSWD: ALL

# install nvidia driver
# 需要使用--no-kernel-module
sh /NVIDIA-Linux-x86_64-410.66.run --no-kernel-module
# 使用nvidia-smi测试驱动是否安装成功

# install cuda cudnn 
pacman -S --force cuda
pacman -S cudnn

# 进入cuda目录测试cuda是否正常，如果输出正常则cuda安装OK
# test cuda followed by ：
# Run a CUDA test: go to the directory /opt/cuda/samples/1_Utilities/deviceQuery, type make to compile an executable, and run the executable ./deviceQuery

3. 安装Open-MPI with CUDA

Open-MPI的默认是不支持CUDA的，需要重新编译

# cp mpi source files
# file source : https://www.open-mpi.org/software/ompi/v3.0/downloads/openmpi-3.0.0.tar.gz
cp /share/openmpi-3.0.0.tar.gz . -r
#install mpi with cuda
#guide : https://discuss.pytorch.org/t/segfault-using-cuda-with-openmpi/11140
gunzip -c openmpi-3.0.0.tar.gz | tar xf -
cd openmpi-3.0.0
./configure --prefix=/home/$USER/.openmpi --with-cuda=PATH_TO_CUDA
make
sudo make install
# add mpi bin to PATH
export PATH="$PATH:/home/$USER/.openmpi/bin"
export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/home/$USER/.openmpi/lib/"

install conda

sudo sh /share/Anaconda3-5.3.0-Linux-x86_64.sh
export CMAKE_PREFIX_PATH="$(dirname $(which conda))/../" # [anaconda root directory]

Install basic dependencies

conda install numpy pyyaml mkl mkl-include setuptools cmake cffi typing
conda install -c mingfeima mkldnn

4. 重新编译Pytorch，支持CUDA与MPI

# use makepkg to build pytorch-mpi-cuda in lxc
cp /PATH/TO/PKGBUILD ~
cd ~
makepkg

PKGBUILD的内容：

pkgname=("python-pytorch-magma-mkldnn-cudnn-git")
_pkgname="pytorch"
pkgver=v1.0rc0.r122.gc2a57d082d
pkgrel=1
pkgdesc="Tensors and Dynamic neural networks in Python with strong GPU acceleration"
arch=('x86_64')
url="https://pytorch.org"
license=('BSD')
depends=('cuda' 'cudnn' 'magma')
makedepends=('cmake' 'git' 'nccl')
provides=('python-pytorch')
conflicts=('python-pytorch')
source=("git+https://github.com/pytorch/pytorch.git"
        "git+https://github.com/catchorg/Catch2"
        "git+https://github.com/pybind/pybind11"
        "git+https://github.com/NVlabs/cub"
        "git+https://github.com/eigenteam/eigen-git-mirror"
        "git+https://github.com/google/googletest"
        "git+https://github.com/NervanaSystems/nervanagpu"
        "git+https://github.com/google/benchmark"
        "git+https://github.com/google/protobuf"
        "git+https://github.com/Yangqing/ios-cmake"
        "git+https://github.com/Maratyszcza/NNPACK"
        "git+https://github.com/facebookincubator/gloo"
        "git+https://github.com/Maratyszcza/pthreadpool"
        "git+https://github.com/Maratyszcza/FXdiv"
        "git+https://github.com/Maratyszcza/FP16"
        "git+https://github.com/Maratyszcza/psimd"
        "git+https://github.com/facebook/zstd"
        "git+https://github.com/Maratyszcza/cpuinfo"
        "git+https://github.com/PeachPy/enum34"
        "git+https://github.com/Maratyszcza/PeachPy"
        "git+https://github.com/benjaminp/six"
        "git+https://github.com/ARM-software/ComputeLibrary"
        "git+https://github.com/onnx/onnx"
        "git+https://github.com/USCILab/cereal"
        "git+https://github.com/onnx/onnx-tensorrt"
        "git+https://github.com/shibatch/sleef"
        "git+https://github.com/intel/ideep"
        )
sha256sums=('SKIP'
            'SKIP'
            'SKIP'
            'SKIP'
            'SKIP'
            'SKIP'
            'SKIP'
            'SKIP'
            'SKIP'
            'SKIP'
            'SKIP'
            'SKIP'
            'SKIP'
            'SKIP'
            'SKIP'
            'SKIP'
            'SKIP'
            'SKIP'
            'SKIP'
            'SKIP'
            'SKIP'
            'SKIP'
            'SKIP'
            'SKIP'
            'SKIP'
            'SKIP'
            'SKIP'
            )
pkgver() {
  cd "${_pkgname}"
  git describe --long --tags | sed 's/\([^-]*-g\)/r\1/;s/-/./g'
}

prepare() {
  cd "${_pkgname}"

  git submodule init
  git config submodule."third_party/catch".url "${srcdir}"/Catch2
  git config submodule."third_party/pybind11".url "${srcdir}"/pybind11
  git config submodule."third_party/cub".url "${srcdir}"/cub
  git config submodule."third_party/eigen".url "${srcdir}"/eigen-git-mirror
  git config submodule."third_party/googletest".url "${srcdir}"/googletest
  git config submodule."third_party/nervanagpu".url "${srcdir}"/nervanagpu
  git config submodule."third_party/benchmark".url "${srcdir}"/benchmark
  git config submodule."third_party/protobuf".url "${srcdir}"/protobuf
  git config submodule."third_party/ios-cmake".url "${srcdir}"/ios-cmake
  git config submodule."third_party/NNPACK".url "${srcdir}"/NNPACK
  git config submodule."third_party/gloo".url "${srcdir}"/gloo
  git config submodule."third_party/NNPACK_deps/pthread_ool".url "${srcdir}"/pthreadpool
  git config submodule."third_party/NNPACK_deps/FXdiv".url "${srcdir}"/FXdiv
  git config submodule."third_party/NNPACK_deps/FP16".url "${srcdir}"/FP16
  git config submodule."third_party/NNPACK_deps/psimd".url "${srcdir}"/psimd
  git config submodule."third_party/zstd".url "${srcdir}"/zstd
  git config submodule."third_party/cpuinfo".url "${srcdir}"/cpuinfo
  git config submodule."third_party/python-enum".url "${srcdir}"/enum34
  git config submodule."third_party/python-peachpy".url "${srcdir}"/PeachPy
  git config submodule."third_party/python-six".url "${srcdir}"/six
  git config submodule."third_party/ComputeLibrary".url "${srcdir}"/ComputeLibrary
  git config submodule."third_party/onnx".url "${srcdir}"/onnx
  git config submodule."third_party/cereal".url "${srcdir}"/cereal
  git config submodule."third_party/onnx-tensorrt".url "${srcdir}"/onnx-tensorrt
  git config submodule."third_party/sleef".url "${srcdir}"/sleef
  git config submodule."third_party/ideep".url "${srcdir}"/ideep
  git submodule update
}

build() {
  export USE_OPENCV=OFF # Caffe2 is not compatible with OpenCV4: pending https://github.com/pytorch/pytorch/pull/9966
  export USE_FFMPEG=ON
  export USE_MKLDNN=ON
  export USE_NNPACK=ON # A bit redundant with MKLDNN hopefully PyTorch choose the best depending on op
  export USE_CUDA=ON
  export USE_CUDNN=ON
  export USE_NERVANAGPU=OFF # Hopefully CUDNN integrated those
  export USE_OPENCL=ON
  export USE_OPENMP=ON
  export USE_NUMPY=ON
  export USE_MAGMA=ON
  #export CMAKE_PREFIX_PATH=/home/yan/anaconda3/bin/../
  export CC=gcc-7
  export CXX=g++-7
  export CUDAHOSTCXX=g++-7
  export CUDA_HOME=/opt/cuda
  export CUDNN_LIB_DIR=/opt/cuda/lib64
  export CUDNN_INCLUDE_DIR=/opt/cuda/include
  export TORCH_CUDA_ARCH_LIST="6.1" # Consumer Pascal
  export MAGMA_HOME=/opt/magma 
  export OPENCV_INCLUDE_DIRS=/usr/include/opencv4
  export FFMPEG_INCLUDE_DIR=/usr/include # libavcodec, libavutils
  export FFMPEG_LIBRARIES=/usr/lib # libavcodec
  # export CUB_INCLUDE_DIRS # For system CUB, otherwise PyTorch picks it from thirdparty submodules

  # unfortunately PyTorch doesn't pick up Intel OpenMP
  # And Caffe2 doesn't pick up any OpenMP at all (because I didn't install LLVM OMP runtime)
  # https://github.com/pytorch/pytorch/issues/12535

  #source /opt/intel/mkl/bin/mklvars.sh intel64
  #source /opt/intel/pkg_bin/compilervars.sh intel64


  cd "$srcdir/${_pkgname}"
  python setup.py build
  # srcdir/${_pkgname} = /home/yan/build_pytorch/src/pytorch
}

package() {
  cd "$srcdir/${_pkgname}"
   # pkgdir = /home/yan/build_pytorch/pkg/python-pytorch-magma-mkldnn-cudnn-git
  python setup.py install --root="$pkgdir"/ --optimize=1 --skip-build
  install -Dm644 LICENSE "${pkgdir}/usr/share/licenses/${pkgname}/LICENSE.txt"
}

作者：Yanring_
链接：https://www.jianshu.com/p/a41ccde46087
来源：简书
著作权归作者所有。商业转载请联系作者获得授权，非商业转载请注明出处。