Centos7安装NVIDIA显卡驱动和CUDA以及Docker-GPU环境

本文链接：https://blog.csdn.net/ITRugod/article/details/110310890

声明:这是我在大学毕业后进入第二家互联网公司学习的内容

背景

领导有需求让在AWS上买一台GPU的服务器，进行安装部署3D建模项目的集成开发环境

安装NVIDIA显卡驱动和CUDA

我购买了一台g3s.xlarge型号的ec2，选择centos7 配置好网络、磁盘、安全组和标签就准备开机安装了

查看显卡型号

[root@loaclhost ~]# yum install pciutils -y
[root@loaclhost ~]# lspci | grep -i NVIDIA
00:1e.0 VGA compatible controller: NVIDIA Corporation GM204GL [Tesla M60] (rev a1)

Google后发现

NVIDIA® Tesla® GPU是适用于服务器的 TESLA 数据中心的 GPU

它可以更快速地处理要求最严格的高性能计算 (HPC) 和超大规模数据中心工作负载。

在Linux实例上安装NVIDIA GRID驱动程序

准备

安装NVIDIA GRID驱动程序的依赖以及AWS CLI

yum update -y
yum install -y lrzsz vim wget ntpdate yum-utils zip unzip tree  gcc gcc-c++  epel-release
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
sudo ./aws/install
reboot

配置

配置AWS并确保IAM用户必须具有AmazonS3ReadOnlyAccess 策略授予的权限。

aws configure

为您当前正在运行的内核版本安装gcc编译器和内核头文件包。

yum install -y gcc kernel-devel-$(uname -r)

禁用nouveauNVIDIA图形卡的开源驱动程序。

添加nouveau到 /etc/modprobe.d/blacklist.conf黑名单文件。
cat << EOF | sudo tee --append /etc/modprobe.d/blacklist.conf
blacklist vga16fb
blacklist nouveau
blacklist rivafb
blacklist nvidiafb
blacklist rivatv
EOF

编辑/etc/default/grub文件并添加以下行：
GRUB_CMDLINE_LINUX="rdblacklist=nouveau"

重建Grub配置
grub2-mkconfig -o /boot/grub2/grub.cfg

部署

下载GRID驱动程序安装实用程序
aws s3 cp --recursive s3://ec2-linux-nvidia-drivers/latest/ .
chmod +x NVIDIA-Linux-x86_64*.run
sudo /bin/sh ./NVIDIA-Linux-x86_64*.run
reboot

出现提示时，接受许可协议并根据需要指定安装选项

安装过程中一些选项

The distribution-provided pre-install script failed! Are you sure you want to continue?

选择 yes 继续。

Would you like to register the kernel module souces with DKMS? This will allow DKMS to automatically build a new module, if you install a different kernel later?

选择 No 继续。

Nvidia’s 32-bit compatibility libraries?

选择 No 继续。

Would you like to run the nvidia-xconfigutility to automatically update your x configuration so that the NVIDIA x driver will be used when you restart x? Any pre-existing x confile will be backed up.

选择 Yes 继续

部署完成后执行说明安装成功

[root@ ~]# nvidia-smi 
Thu Aug 27 06:15:12 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.05    Driver Version: 450.51.05    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla M60           On   | 00000000:00:1E.0 Off |                    0 |
| N/A   33C    P8    15W / 150W |      0MiB /  7618MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |

运行3D程序

目前公司的3D项目可以已经用Docker部署，查阅docker官网发现docker在19版本后在docker run –gpus all 即可(表示使用所有的gpu，如果要使用2个gpu：–gpus 2，也可直接指定哪几个卡：–gpus ‘“device=1,2”’

在docker19之前需要单独下载nvidia-docker1或nvidia-docker2来启动容器

目前最新的docker版本是19.03.12，即不需要额外安装nvidia-docker

安装Docker

[root@ ~]# yum install -y yum-utils 
[root@ ~]# yum-config-manager \
    --add-repo \
    https://download.docker.com/linux/centos/docker-ce.repo
[root@ ~]# yum install -y docker-ce docker-ce-cli containerd.io
[root@ ~]# systemctl start docker
[root@ ~]# docker -v
Docker version 19.03.12, build 48a66213fe

安装GPU容器运行时

distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.repo | sudo tee /etc/yum.repos.d/nvidia-docker.repo

sudo yum install -y nvidia-container-toolkit
sudo systemctl restart docker

验证docker可以使用GPU

docker run --help | grep -i gpus
      --gpus gpu-request               GPU devices to add to the container ('all' to pass all GPUs)
      
docker run -it --rm --gpus all ubuntu nvidia-smi
Thu Aug 27 09:57:43 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.05    Driver Version: 450.51.05    CUDA Version: N/A      |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla M60           On   | 00000000:00:1E.0 Off |                    0 |
| N/A   32C    P8    15W / 150W |      0MiB /  7618MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

说明可以运行

报错总结

执行sudo /bin/sh ./NVIDIA-Linux-x86_64*.run时报错

Nouveau kernel driver

ERROR: The Nouveau kernel driver is currently in use by your system. This driver is incompatible with the NVIDIA driver, and must be disabled before proceeding. Please consult
the NVIDIA driver README and your Linux distribution’s documentation for details on how to correctly disable the Nouveau kernel driver.

这个驱动正在被系统使用,这个驱动和Nvidia驱动冲突,要想继续安装,则必须禁用此驱动！因为centos 系统默认装的显卡驱动就是Nouveau .　Nouveau是一个由爱好者组织的针对NVIDIA显卡开发第三方开源3D驱动的共同项目，并且Nouveau是在完全没有得到NVIDIA任何支持的情况下进行开发的，Nouveau算是X.Org基金会的一个项目.

解决办法如下：

即关闭Nouveau：

检查nouveau driver确保没有被加载！

lsmod | grep nouveau

安装完GPU容器运行时报错

[root@ ~]# docker run -it --rm --gpus all ubuntu nvidia-smi
Unable to find image 'ubuntu:latest' locally
latest: Pulling from library/ubuntu
54ee1f796a1e: Pull complete 
f7bfea53ad12: Pull complete 
46d371e02073: Pull complete 
b66c17bbf772: Pull complete 
Digest: sha256:31dfb10d52ce76c5ca0aa19d10b3e6424b830729e32a89a7c6eee2cda2be67a5
Status: Downloaded newer image for ubuntu:latest
docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].

解决办法，重启docker

systemctl restart docker

如果还不行，说明你没有安装GPU容器运行时