最近因为业务需要,很多服务要用docker部署,于是开始研究docker的使用。
代码是python,使用的深度学习框架为tensorflow, 按照官网说明,需要先安装 Docker 和 nvidia-docker。其中Docker的安装比较简单,基本就是参照了这篇文档:https://yeasy.gitbooks.io/docker_practice/install/centos.html
但是安装nvidia-docker时遇到了一些小坑,记录一下。
对于nvidia-docker的安装,基本是参照了官网的说明:https://github.com/NVIDIA/nvidia-docker
# If you have nvidia-docker 1.0 installed: we need to remove it and all existing GPU containers
docker volume ls -q -f driver=nvidia-docker | xargs -r -I{} -n1 docker ps -q -a -f volume={} | xargs -r docker rm -f
sudo yum remove nvidia-docker
# Add the package repositories
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.repo | \
sudo tee /etc/yum.repos.d/nvidia-docker.repo
# Install nvidia-docker2 and reload the Docker daemon configuration
sudo yum install -y nvidia-docker2
sudo pkill -SIGHUP dockerd
# Test nvidia-smi with the latest official CUDA image
docker run --runtime=nvidia --rm nvidia/cuda:9.0-base nvidia-smi
前面几个步骤还比较顺利,就是最后执行
docker run --runtime=nvidia --rm nvidia/cuda:9.0-base nvidia-smi
这个命令时,一直报错。
报错内容如下:
docker: Error response from daemon: OCI runtime create failed: container_linux.go:344: starting container process caused "process_linux.go:424: container init caused \"process_linux.go:407: running prestart hook 1 caused \\\"error running hook: exit status 1, stdout: , stderr: exec command: [/usr/bin/nvidia-container-cli --load-kmods configure --ldconfig=@/sbin/ldconfig --device=all --compute --utility --require=cuda>=9.0 --pid=15997 /var/lib/docker/overlay2/5e678ed1c028293c3a8d9edc227307b89239e8c41672174811378c40e2dbbec9/merged]\\\\nnvidia-container-cli: requirement error: unsatisfied condition: cuda >= 9.0\\\\n\\\"\"": unknown.
仔细看最后一段“requirement error: unsatisfied condition: cuda >= 9.0” 猜测是cuda 版本问题。
查看本机cuda版本:
cat /usr/local/cuda/version.txt
显示:CUDA Version 8.0.61
果然版本不行
于是修改上述命令:
docker run --runtime=nvidia --rm nvidia/cuda:8.0-base nvidia-smi
还是报错:
Unable to find image 'nvidia/cuda:8.0-base' locally
docker: Error response from daemon: manifest for nvidia/cuda:8.0-base not found.
应该是没有这个镜像文件
最后百度查到,正确的命令应该是:
docker run --runtime=nvidia --rm nvidia/cuda:8.0-devel nvidia-smi
成功显示:
Thu Feb 14 07:56:19 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.66 Driver Version: 375.66 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla P100-PCIE... Off | 0000:00:0C.0 Off | 0 |
| N/A 42C P0 28W / 250W | 0MiB / 16276MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
总结:测试命令也是要跟本机实际的cuda版本对应起来才行啊。