简述
由于容器技术的优势,其应用越发广泛,并且传统虚拟化技术正逐步向容器进行适配,比如将SR-IOV(Single-Root Input/Output
Virtualization)应用于容器,Intel的实验[1]表明网络和存储的性能几乎能接近物理设备。同时近些年GPU (Graphics Processing
Unit)在高性能计算,云桌面等领域不断革新。GPU密集型的应用程序开发、调试和使用,环境比较多样且版本依赖程度高。而借助容器技术在CI/CD方面的优势,容器化的GPU应用程序将带来以下好处,NVIDIA Docker简化了这些繁锁的工作,本文将初步认识和简单实践nvidia-docker[2]。
Benefits of GPU
containerization:
Reproducible
builds
Ease of
deployment
Isolation of
individual devices
Run across
heterogeneous driver/toolkit environments
Requires only the
NVIDIA driver to be installed
Enables "fire and
forget" GPU applications
Facilitate
collaboration
Example of how CUDA
integrates with Docker.
实验环境
系统配置
操作系统为CentOS
[root@localhost ~]#
cat /etc/redhat-release
CentOS Linux
release 7.2.1511 (Core)
[root@localhost ~]#
uname -a
Linux
localhost.localdomain 3.10.0-327.13.1.el7.x86_64 #1 SMP Thu Mar 31
16:04:38 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
[root@localhost ~]#
yum update -y
[root@localhost ~]#
yum install -y wget tmux vim git pciutils
kernel-devel kernel-headers gcc make epel-release
GPU详情
[root@localhost ~]#
lspci | grep VGA
03:00.0 VGA
compatible controller: NVIDIA Corporation GK104GL [Quadro K4200]
(rev a1)
04:00.0 VGA
compatible controller: NVIDIA Corporation GF110GL [Tesla C2050 /
C2075] (rev a1)
[root@localhost ~]#
lspci -v -s 03:00.0
03:00.0 VGA
compatible controller: NVIDIA Corporation GK104GL [Quadro K4200]
(rev a1) (prog-if 00 [VGA controller])
Subsystem: NVIDIA Corporation Device 1096
Physical Slot: 2
Flags: bus master, fast devsel, latency 0, IRQ 11
Memory at fa000000 (32-bit, non-prefetchable) [size=16M]
Memory at d0000000 (64-bit, prefetchable) [size=256M]
Memory at e0000000 (64-bit, prefetchable) [size=32M]
I/O ports at d000 [size=128]
Expansion ROM at fb000000 [disabled] [size=512K]
Capabilities: [60] Power Management version 3
Capabilities: [68] MSI: Enable- Count=1/1 Maskable-
64bit+
Capabilities: [78] Express Endpoint, MSI 00
Capabilities: [b4] Vendor Specific Information: Len=14
Capabilities: [100] Virtual Channel
Capabilities: [128] Power Budgeting
Capabilities: [420] Advanced Error Reporting
Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1
Len=024
[root@localhost ~]#
lspci -v -s 04:00.0
04:00.0 VGA
compatible controller: NVIDIA Corporation GF110GL [Tesla C2050 /
C2075] (rev a1) (prog-if 00 [VGA controller])
Subsystem: NVIDIA Corporation Tesla C2075
Physical Slot: 4
Flags: fast devsel, IRQ 11
Memory at f8000000 (32-bit, non-prefetchable) [disabled]
[size=16M]
Memory at e8000000 (64-bit, prefetchable) [disabled]
[size=128M]
Memory at f0000000 (64-bit, prefetchable) [disabled]
[size=32M]
I/O ports at c000 [disabled] [size=128]
Expansion ROM at f9000000 [disabled] [size=512K]
Capabilities: [60] Power Management version 3
Capabilities: [68] MSI: Enable- Count=1/1 Maskable-
64bit+
Capabilities: [78] Express Endpoint, MSI 00
Capabilities: [b4] Vendor Specific Information: Len=14
Capabilities: [100] Virtual Channel
Capabilities: [128] Power Budgeting
Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1
Len=024
安装docker
[root@localhost ~]#
sudo tee /etc/yum.repos.d/docker.repo <
>
[dockerrepo]
> name=Docker
Repository
>
baseurl=https://yum.dockerproject.org/repo/main/centos/$releasever/
>
enabled=1
>
gpgcheck=1
>
gpgkey=https://yum.dockerproject.org/gpg
> EOF
[root@localhost ~]#
yum install docker-engine
[root@localhost ~]#
systemctl restart docker
[root@localhost ~]#
systemctl enable docker
安装NVIDIA驱动
[root@localhost ~]#
uname -r
3.10.0-327.13.1.el7.x86_64
[root@localhost ~]#
ll /usr/src/kernels/3.10.0-327.13.1.el7.x86_64/
版本要一致,否则检查修改grup2并重启。
出于兼容性的考虑,选择较低版本的驱动进行安装
[root@localhost ~]#
sh ./NVIDIA-Linux-x86_64-352.79_Tesla_C2050.run
[root@localhost ~]#
ll /dev/nvidia*
crw-rw-rw-. 1 root
root 195, 0
4月6
18:19 /dev/nvidia0
crw-rw-rw-. 1 root
root 195, 1
4月6
18:19 /dev/nvidia1
crw-rw-rw-. 1 root
root 195, 255 4月6 18:19
/dev/nvidiactl
crw-rw-rw-. 1 root
root 246, 0
4月6
18:19 nvidia-uvm
如果没有nvidia-uvm,则手动modprobe
[root@localhost ~]#
sudo modprobe nvidia_uvm
安装和配置CUDA环境
安装
[root@localhost
~]# rpm -ivh
cuda-repo-rhel7-7.5-18.x86_64.rpm
[root@localhost ~]#
yum clean expire-cache
[root@localhost ~]#
yum install cuda -y
出现DKMS dependency问题时,检查是否执行了yum install -y epel-release
配置环境变量
[root@localhost ~]#
find / -name nvcc
/usr/local/cuda-7.5/bin/nvcc
可知cuda版本是7.5,位于/usr/local/cuda-7.5/目录下。
[root@localhost ~]#
vim /etc/profile
…….
……..
export
PATH=/usr/local/cuda-7.5/bin:$PATH
export
LD_LIBRARY_PATH=/usr/local/cuda-7.5/lib64:$LD_LIBRARY_PATH
安装nvidia-docker
# Install
nvidia-docker and nvidia-docker-plugin
[root@localhost ~]#
sudo tar --strip-components=1 -C /usr/bin -xvf
/tmp/nvidia-docker_1.0.0.beta.3_amd64.tar.xz && rm
/tmp/nvidia-docker*.tar.xz
# Run
nvidia-docker-plugin
[root@localhost ~]#
sudo -b nohup nvidia-docker-plugin >
/tmp/nvidia-docker.log
docker
images
REPOSITORY TAG IMAGE
ID CREATED SIZE
nvidia/cuda latest 22bde803e760 2 weeks
ago 1.226 GB
错误:
[root@localhost ~]#
nvidia-docker run --rm nvidia/cuda nvidia-smi
docker: Error
response from daemon: create nvidia_driver_352.79: create
nvidia_driver_352.79: Error looking up volume plugin nvidia-docker:
plugin not found.
See 'docker run
--help'.
解决办法:
nvidia-docker
volume setup
docker volume
ls
DRIVER VOLUME NAME
local nvidia_driver_352.79
测试
启动多个容器,并确认每个container中都有nvidia设备
mkdir -p
~/docker/digits
nvidia-docker run
-it -p 8080:8080 -v ~/docker/digits:/digits nvidia/cuda
nvidia-docker run
-it -p 8081:8080 -v ~/docker/digits:/digits nvidia/cuda
nvidia-docker run
-it -p 8082:8080 -v ~/docker/digits:/digits nvidia/cuda
nvidia-docker run
-it -p 8083:8080 -v ~/docker/digits:/digits nvidia/cuda
docker ps
-a
CONTAINER
ID IMAGE COMMAND CREATED STATUS PORTS NAMES
5d86dbc4047b nvidia/cuda "/bin/bash" About a minute ago Up About a
minute 0.0.0.0:8083->8080/tcp
romantic_williams
0c8d3300140b nvidia/cuda "/bin/bash" About a minute ago Up About a
minute 0.0.0.0:8082->8080/tcp tiny_shaw
5c927720fa16 nvidia/cuda "/bin/bash" 2 minutes
ago Up About a minute 0.0.0.0:8081->8080/tcp drunk_brattain
5ab94e3a21d2 nvidia/cuda "/bin/bash" 2 minutes
ago Up 2
minutes 0.0.0.0:8080->8080/tcp evil_pare
在容器中编译并运行cuda程序
cuda源文件
root@5ab94e3a21d2:~# cd
/digits/
root@5ab94e3a21d2:/digits# ls
hellocuda.cu
在容器中使用nvcc编译
root@5ab94e3a21d2:/digits# nvcc hellocuda.cu -o
hellocuda
在容器中运行程序
root@5ab94e3a21d2:/digits# ./hellocuda
16 18 20 22 24 26
28 30 32 34 36 38 40 42 44 46
也可在容器中使用其它测试NVIDIA/DIGITS[4][5]
[1] Single-Root
Input/Output Virtualization (SR-IOV) with Linux*
Containers