基本环境:centos7.3
1.配置国内镜像:
1.1 修改nameserver
vi /etc/resolv.conf
将原始的nameserver改为114.114.114.114,并且添加一个nameserver:8.8.8.8
114.114.114.114是国内首家云安全DNS 地址,8.8.8.8是Google提供的免费DNS服务器的IP地址
1.2 修改国内源(这里采用清华镜像源)
具体方式可以在清华镜像源中找到,最后yum makecache
2.安装nvidia驱动(命令行安装)
参考网站1:https://blog.csdn.net/w18750930043/article/details/80622783
参考网站2:https://serverfault.com/questions/870211/yum-install-kmod-nvidia-kernel-issue
按照ELRepo添加之后,安装kmod-nvidia很有可能会有内核错误问题,即:
软件包:kmod-nvidia-410.73-2.el7_6.elrepo.x86_64 (elrepo)
需要:kernel(drm_atomic_helper_plane_reset) = 0x97498548
已安装: kernel-3.10.0-514.el7.x86_64 (@anaconda)
kernel(drm_atomic_helper_plane_reset) = 0xabd4c98d
已安装: kernel-3.10.0-862.el7.x86_64 (@base)
kernel(drm_atomic_helper_plane_reset) = 0xe7694b10
已安装: kernel-3.10.0-862.14.4.el7.x86_64 (@updates)
kernel(drm_atomic_helper_plane_reset) = 0xe7694b10
已安装: kernel-ml-4.19.4-1.el7.elrepo.x86_64 (@elrepo-kernel)
kernel(drm_atomic_helper_plane_reset) = 0x3663ba58
可用: kernel-3.10.0-862.2.3.el7.x86_64 (updates)
kernel(drm_atomic_helper_plane_reset) = 0xe7694b10
…………………………………………………………
可用: kernel-debug-3.10.0-862.el7.x86_64 (base)
kernel(drm_atomic_helper_plane_reset) = 0x1ccd0c71
可用: kernel-debug-3.10.0-862.2.3.el7.x86_64 (updates)
kernel(drm_atomic_helper_plane_reset) = 0x1ccd0c71
可用: kernel-debug-3.10.0-862.3.2.el7.x86_64 (updates)
kernel(drm_atomic_helper_plane_reset) = 0x1ccd0c71
可用: kernel-debug-3.10.0-862.3.3.el7.x86_64 (updates)
……………………………………………………
解决方法:
2.1 查看可用驱动(针对不同内核版本)
sudo yum --enablerepo=elrepo --showduplicates list kmod-nvidia
2.2 选取安装可用驱动
不一定选取最新的版本,这应该取决于centos7当前所用内核(查看命令:uname -r),最笨的方法是逐个尝试,如我选取的如下:
sudo yum install kmod-nvidia-390.87-1.el7_5.elrepo -y
不报警告/错误,显示如下即成功(这里是中文版):
已安装:
kmod-nvidia.x86_64 0:390.87-1.el7_5.elrepo
2.3 测试
重启centos7并测试,测试命令:
nvidia-smi
补充:可能需要升级内核,可以参考https://blog.csdn.net/kikajack/article/details/79396793
3. docker安装(centos7)
3.1 docker
直接参考网站:https://mirrors.tuna.tsinghua.edu.cn/help/docker-ce/
3.2 nvidia-docker
参考网站1:https://github.com/NVIDIA/nvidia-docker
参考网站2:https://blog.csdn.net/lantuxin/article/details/83795159
参考网站3:https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#pre-installation-actions
3.2.1 安装centos7的kernel headers以及development packages
sudo yum install kernel-devel-$(uname -r) kernel-headers-$(uname -r)
原因---参考网站3中的2.4有这么一句话:The CUDA Driver requires that the kernel headers and development packages for the running version of the kernel be installed at the time of the driver installation, as well whenever the driver is rebuilt. For example, if your system is running kernel version 3.17.4-301, the 3.17.4-301 kernel headers and development packages must also be installed.
3.2.2 移除之前安装过的nvidia-docker1
sudo docker volume ls -q -f driver=nvidia-docker | xargs -r -I{} -n1 docker ps -q -a -f volume={} | xargs -r docker rm -f
sudo yum remove nvidia-docker
3.2.3 添加包仓库
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.repo | sudo tee /etc/yum.repos.d/nvidia-docker.repo
3.2.4 安装nvidia-docker2
sudo yum install -y nvidia-docker2
sudo pkill -SIGHUP dockerd
3.2.5 测试带cuda的镜像
docker run --runtime=nvidia --rm nvidia/cuda:9.0-base nvidia-smi
当然,也可以测试docker中的cuda:
sudo nvidia-docker run -it --rm registry.docker-cn.com/nvidia/cuda:latest bash
nvcc -V