环境:Manjaro - arch linux
框架:pytorch 1.0-cuda-10.0
Nividia driver:410
cuda:10.0
前言
为了在单机上跑起分布式实验的环境,我选择了lxc作为容器(可以类似的看作为docker,docker就是基于LXC的),来在一台PC上跑起多个实验环境。使用lxc,可以方便的进行容器的管理,也能方便的克隆镜像。
同时,LXC支持,多个虚拟化的环境中共享使用一个GPU,只需要将显卡设备对应的文件挂载到 LXC 容器中就能解决这个问题。
在安装前,请保证你的主机已经安装好了Nvidia驱动,并在LXC中使用相同的驱动版本!
安装起来主要有3个难点:
- 如果直接安装驱动,会发现提示无法卸载内核模块,导致显卡驱动工作异常。其实这是正常的,因为容器和宿主机是共享内核的。实际上,我们在容器内也不需要安装内核模块,只是需要那些库罢了,所以在安装显卡驱动的时候带上 --no-kernel-module 就可以解决这个问题。
- 直接安装的Open-MPI是不支持CUDA的,需要重新编译。
- 直接安装Pytorch是不支持MPI的,需要重新编译。
备注:我现在只是记录了配置过程中关键的部分,安装时请阅读我引用的网页or文档
1. 配置LXC
1.1 安装LXC
#安装LXC
使用系统源安装即可
#start service
sudo systemctl restart lxc.service
sudo systemctl restart lxc-net.service
#generete nvidia-uvm device
#这个设备文件不会自动生成,需要生成一下
#sudo .sh ~/uvm
/sbin/modprobe nvidia-uvm
D=`grep nvidia-uvm /proc/devices | awk '{print $1}'`
mknod -m 666 /dev/nvidia-uvm c $D 0
1.2 host上的网络配置。
a. 修改/etc/default/lxc-net文件,使用以下配置
# Leave USE_LXC_BRIDGE as "true" if you want to use lxcbr0 for your
# containers. Set to "false" if you'll use virbr0 or another existing
# bridge, or mavlan to your host's NIC.
USE_LXC_BRIDGE="true"
# If you change the LXC_BRIDGE to something other than lxcbr0, then
# you will also need to update your /etc/lxc/default.conf as well as the
# configuration (/var/lib/lxc/<container>/config) for any containers
# already created using the default config to reflect the new bridge
# name.
# If you have the dnsmasq daemon installed, you'll also have to update
# /etc/dnsmasq.d/lxc and restart the system wide dnsmasq daemon.
LXC_BRIDGE="lxcbr0"
LXC_ADDR="10.0.3.1"
LXC_NETMASK="255.255.255.0"
LXC_NETWORK="10.0.3.0/24"
LXC_DHCP_RANGE="10.0.3.2,10.0.3.254"
LXC_DHCP_MAX="253"
# Uncomment the next line if you'd like to use a conf-file for the lxcbr0
# dnsmasq. For instance, you can use 'dhcp-host=mail1,10.0.3.100' to have
# container 'mail1' always get ip address 10.0.3.100.
#LXC_DHCP_CONFILE=/etc/lxc/dnsmasq.conf
# Uncomment the next line if you want lxcbr0's dnsmasq to resolve the .lxc
# domain. You can then add "server=/lxc/10.0.3.1' (or your actual $LXC_ADDR)
# to your system dnsmasq configuration file (normally /etc/dnsmasq.conf,
# or /etc/NetworkManager/dnsmasq.d/lxc.conf on systems that use NetworkManager).
# Once these changes are made, restart the lxc-net and network-manager services.
# 'container1.lxc' will then resolve on your host.
#LXC_DOMAIN="lxc"
b. 启用Dnsmasq (把host当DNS服务器使)
1.安装Dnsmasq
2.add "server=/lxc/10.0.3.1' (or your actual $LXC_ADDR) to your system dnsmasq configuration file (normally /etc/dnsmasq.conf, or /etc/NetworkManager/dnsmasq.d/lxc.conf on systems that use NetworkManager)
c. 配置LXC
修改/etc/lxc/default.conf
将原先的网络配置替换为:
lxc.net.0.type = veth
lxc.net.0.link = lxcbr0
lxc.net.0.flags = up
lxc.net.0.hwaddr = 00:16:3e:17:18:19
d.安装lxc template
选择需要安装的系统版本
sudo lxc-create -n Pytorch_1 -t download -- --server=mirrors.tuna.tsinghua.edu.cn/lxc-images
f.挂载显卡
sudo vim /var/lib/lxc/Pytorch_1/config
添加:
# Nvidia
lxc.mount.entry = /dev/nvidia0 dev/nvidia0 none bind,optional,create=file
lxc.mount.entry = /dev/nvidiactl dev/nvidiactl none bind,optional,create=file
lxc.mount.entry = /dev/nvidia-modeset dev/nvidia-modeset none bind,optional,crea
te=file
lxc.mount.entry = /dev/nvidia-uvm dev/nvidia-uvm none bind,optional,create=file
lxc.mount.entry = /share share none bind,create=dir
网络配置结束,这样LXC里就联网了
2. 配置LXC容器里的环境
sudo lxc-start -n Pytorch_3
sudo lxc-attach -n Pytorch_3
#接下来就是在lxc中了
# add proxy, which is host's ss
export ALL_PROXY=socks5://10.0.3.1:1081
# change mirrorlist
# add in top: /etc/pacman.d/mirrorlist
Server = https://mirrors.tuna.tsinghua.edu.cn/archlinux/$repo/os/$arch
Server = http://mirrors.163.com/archlinux/$repo/os/$arch
Server = http://mirrors.ustc.edu.cn/archlinux/$repo/os/$arch
Server = http://mirrors.cqu.edu.cn/archlinux/$repo/os/$arch
Server = http://mirror.lzu.edu.cn/archlinux/$repo/os/$arch
Server = http://mirrors.neusoft.edu.cn/archlinux/$repo/os/$arch
# install base tool
# 如果是ubuntu系统,则使用apt-get,详细请google
pacman -S base-devel
# add user
useradd -m -Gwheel newuser
passwd newuser
passwd root
#进入 visudo设置用户为无密码
visudo
# 在 root ALL=(ALL) ALL这一行后面加上:
newuser ALL=(ALL) NOPASSWD: ALL
# install nvidia driver
# 需要使用--no-kernel-module
sh /NVIDIA-Linux-x86_64-410.66.run --no-kernel-module
# 使用nvidia-smi测试驱动是否安装成功
# install cuda cudnn
pacman -S --force cuda
pacman -S cudnn
# 进入cuda目录测试cuda是否正常,如果输出正常则cuda安装OK
# test cuda followed by :
# Run a CUDA test: go to the directory /opt/cuda/samples/1_Utilities/deviceQuery, type make to compile an executable, and run the executable ./deviceQuery
3. 安装Open-MPI with CUDA
Open-MPI的默认是不支持CUDA的,需要重新编译
# cp mpi source files
# file source : https://www.open-mpi.org/software/ompi/v3.0/downloads/openmpi-3.0.0.tar.gz
cp /share/openmpi-3.0.0.tar.gz . -r
#install mpi with cuda
#guide : https://discuss.pytorch.org/t/segfault-using-cuda-with-openmpi/11140
gunzip -c openmpi-3.0.0.tar.gz | tar xf -
cd openmpi-3.0.0
./configure --prefix=/home/$USER/.openmpi --with-cuda=PATH_TO_CUDA
make
sudo make install
# add mpi bin to PATH
export PATH="$PATH:/home/$USER/.openmpi/bin"
export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/home/$USER/.openmpi/lib/"
install conda
sudo sh /share/Anaconda3-5.3.0-Linux-x86_64.sh
export CMAKE_PREFIX_PATH="$(dirname $(which conda))/../" # [anaconda root directory]
Install basic dependencies
conda install numpy pyyaml mkl mkl-include setuptools cmake cffi typing
conda install -c mingfeima mkldnn
4. 重新编译Pytorch,支持CUDA与MPI
# use makepkg to build pytorch-mpi-cuda in lxc
cp /PATH/TO/PKGBUILD ~
cd ~
makepkg
PKGBUILD的内容:
pkgname=("python-pytorch-magma-mkldnn-cudnn-git")
_pkgname="pytorch"
pkgver=v1.0rc0.r122.gc2a57d082d
pkgrel=1
pkgdesc="Tensors and Dynamic neural networks in Python with strong GPU acceleration"
arch=('x86_64')
url="https://pytorch.org"
license=('BSD')
depends=('cuda' 'cudnn' 'magma')
makedepends=('cmake' 'git' 'nccl')
provides=('python-pytorch')
conflicts=('python-pytorch')
source=("git+https://github.com/pytorch/pytorch.git"
"git+https://github.com/catchorg/Catch2"
"git+https://github.com/pybind/pybind11"
"git+https://github.com/NVlabs/cub"
"git+https://github.com/eigenteam/eigen-git-mirror"
"git+https://github.com/google/googletest"
"git+https://github.com/NervanaSystems/nervanagpu"
"git+https://github.com/google/benchmark"
"git+https://github.com/google/protobuf"
"git+https://github.com/Yangqing/ios-cmake"
"git+https://github.com/Maratyszcza/NNPACK"
"git+https://github.com/facebookincubator/gloo"
"git+https://github.com/Maratyszcza/pthreadpool"
"git+https://github.com/Maratyszcza/FXdiv"
"git+https://github.com/Maratyszcza/FP16"
"git+https://github.com/Maratyszcza/psimd"
"git+https://github.com/facebook/zstd"
"git+https://github.com/Maratyszcza/cpuinfo"
"git+https://github.com/PeachPy/enum34"
"git+https://github.com/Maratyszcza/PeachPy"
"git+https://github.com/benjaminp/six"
"git+https://github.com/ARM-software/ComputeLibrary"
"git+https://github.com/onnx/onnx"
"git+https://github.com/USCILab/cereal"
"git+https://github.com/onnx/onnx-tensorrt"
"git+https://github.com/shibatch/sleef"
"git+https://github.com/intel/ideep"
)
sha256sums=('SKIP'
'SKIP'
'SKIP'
'SKIP'
'SKIP'
'SKIP'
'SKIP'
'SKIP'
'SKIP'
'SKIP'
'SKIP'
'SKIP'
'SKIP'
'SKIP'
'SKIP'
'SKIP'
'SKIP'
'SKIP'
'SKIP'
'SKIP'
'SKIP'
'SKIP'
'SKIP'
'SKIP'
'SKIP'
'SKIP'
'SKIP'
)
pkgver() {
cd "${_pkgname}"
git describe --long --tags | sed 's/\([^-]*-g\)/r\1/;s/-/./g'
}
prepare() {
cd "${_pkgname}"
git submodule init
git config submodule."third_party/catch".url "${srcdir}"/Catch2
git config submodule."third_party/pybind11".url "${srcdir}"/pybind11
git config submodule."third_party/cub".url "${srcdir}"/cub
git config submodule."third_party/eigen".url "${srcdir}"/eigen-git-mirror
git config submodule."third_party/googletest".url "${srcdir}"/googletest
git config submodule."third_party/nervanagpu".url "${srcdir}"/nervanagpu
git config submodule."third_party/benchmark".url "${srcdir}"/benchmark
git config submodule."third_party/protobuf".url "${srcdir}"/protobuf
git config submodule."third_party/ios-cmake".url "${srcdir}"/ios-cmake
git config submodule."third_party/NNPACK".url "${srcdir}"/NNPACK
git config submodule."third_party/gloo".url "${srcdir}"/gloo
git config submodule."third_party/NNPACK_deps/pthread_ool".url "${srcdir}"/pthreadpool
git config submodule."third_party/NNPACK_deps/FXdiv".url "${srcdir}"/FXdiv
git config submodule."third_party/NNPACK_deps/FP16".url "${srcdir}"/FP16
git config submodule."third_party/NNPACK_deps/psimd".url "${srcdir}"/psimd
git config submodule."third_party/zstd".url "${srcdir}"/zstd
git config submodule."third_party/cpuinfo".url "${srcdir}"/cpuinfo
git config submodule."third_party/python-enum".url "${srcdir}"/enum34
git config submodule."third_party/python-peachpy".url "${srcdir}"/PeachPy
git config submodule."third_party/python-six".url "${srcdir}"/six
git config submodule."third_party/ComputeLibrary".url "${srcdir}"/ComputeLibrary
git config submodule."third_party/onnx".url "${srcdir}"/onnx
git config submodule."third_party/cereal".url "${srcdir}"/cereal
git config submodule."third_party/onnx-tensorrt".url "${srcdir}"/onnx-tensorrt
git config submodule."third_party/sleef".url "${srcdir}"/sleef
git config submodule."third_party/ideep".url "${srcdir}"/ideep
git submodule update
}
build() {
export USE_OPENCV=OFF # Caffe2 is not compatible with OpenCV4: pending https://github.com/pytorch/pytorch/pull/9966
export USE_FFMPEG=ON
export USE_MKLDNN=ON
export USE_NNPACK=ON # A bit redundant with MKLDNN hopefully PyTorch choose the best depending on op
export USE_CUDA=ON
export USE_CUDNN=ON
export USE_NERVANAGPU=OFF # Hopefully CUDNN integrated those
export USE_OPENCL=ON
export USE_OPENMP=ON
export USE_NUMPY=ON
export USE_MAGMA=ON
#export CMAKE_PREFIX_PATH=/home/yan/anaconda3/bin/../
export CC=gcc-7
export CXX=g++-7
export CUDAHOSTCXX=g++-7
export CUDA_HOME=/opt/cuda
export CUDNN_LIB_DIR=/opt/cuda/lib64
export CUDNN_INCLUDE_DIR=/opt/cuda/include
export TORCH_CUDA_ARCH_LIST="6.1" # Consumer Pascal
export MAGMA_HOME=/opt/magma
export OPENCV_INCLUDE_DIRS=/usr/include/opencv4
export FFMPEG_INCLUDE_DIR=/usr/include # libavcodec, libavutils
export FFMPEG_LIBRARIES=/usr/lib # libavcodec
# export CUB_INCLUDE_DIRS # For system CUB, otherwise PyTorch picks it from thirdparty submodules
# unfortunately PyTorch doesn't pick up Intel OpenMP
# And Caffe2 doesn't pick up any OpenMP at all (because I didn't install LLVM OMP runtime)
# https://github.com/pytorch/pytorch/issues/12535
#source /opt/intel/mkl/bin/mklvars.sh intel64
#source /opt/intel/pkg_bin/compilervars.sh intel64
cd "$srcdir/${_pkgname}"
python setup.py build
# srcdir/${_pkgname} = /home/yan/build_pytorch/src/pytorch
}
package() {
cd "$srcdir/${_pkgname}"
# pkgdir = /home/yan/build_pytorch/pkg/python-pytorch-magma-mkldnn-cudnn-git
python setup.py install --root="$pkgdir"/ --optimize=1 --skip-build
install -Dm644 LICENSE "${pkgdir}/usr/share/licenses/${pkgname}/LICENSE.txt"
}
作者:Yanring_
链接:https://www.jianshu.com/p/a41ccde46087
来源:简书
著作权归作者所有。商业转载请联系作者获得授权,非商业转载请注明出处。