声明:这是我在大学毕业后进入第二家互联网公司学习的内容
背景
领导有需求让在AWS上买一台GPU的服务器,进行安装部署3D建模项目的集成开发环境
安装NVIDIA显卡驱动和CUDA
我购买了一台g3s.xlarge型号的ec2,选择centos7 配置好网络、磁盘、安全组和标签就准备开机安装了
查看显卡型号
[root@loaclhost ~]# yum install pciutils -y
[root@loaclhost ~]# lspci | grep -i NVIDIA
00:1e.0 VGA compatible controller: NVIDIA Corporation GM204GL [Tesla M60] (rev a1)
Google后发现
NVIDIA® Tesla® GPU是适用于服务器的 TESLA 数据中心的 GPU
它可以更快速地处理要求最严格的高性能计算 (HPC) 和超大规模数据中心工作负载。
在Linux实例上安装NVIDIA GRID驱动程序
准备
安装NVIDIA GRID驱动程序的依赖以及AWS CLI
yum update -y
yum install -y lrzsz vim wget ntpdate yum-utils zip unzip tree gcc gcc-c++ epel-release
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
sudo ./aws/install
reboot
配置
配置AWS并确保IAM用户必须具有AmazonS3ReadOnlyAccess 策略授予的权限。
aws configure
为您当前正在运行的内核版本安装gcc编译器和内核头文件包。
yum install -y gcc kernel-devel-$(uname -r)
禁用nouveauNVIDIA图形卡的开源驱动程序。
添加nouveau到 /etc/modprobe.d/blacklist.conf黑名单文件。
cat << EOF | sudo tee --append /etc/modprobe.d/blacklist.conf
blacklist vga16fb
blacklist nouveau
blacklist rivafb
blacklist nvidiafb
blacklist rivatv
EOF
编辑/etc/default/grub文件并添加以下行:
GRUB_CMDLINE_LINUX="rdblacklist=nouveau"
重建Grub配置
grub2-mkconfig -o /boot/grub2/grub.cfg
部署
下载GRID驱动程序安装实用程序
aws s3 cp --recursive s3://ec2-linux-nvidia-drivers/latest/ .
chmod +x NVIDIA-Linux-x86_64*.run
sudo /bin/sh ./NVIDIA-Linux-x86_64*.run
reboot
出现提示时,接受许可协议并根据需要指定安装选项
安装过程中一些选项
- The distribution-provided pre-install script failed! Are you sure you want to continue?
选择 yes 继续。
- Would you like to register the kernel module souces with DKMS? This will allow DKMS to automatically build a new module, if you install a different kernel later?
选择 No 继续。
- Nvidia’s 32-bit compatibility libraries?
选择 No 继续。
- Would you like to run the nvidia-xconfigutility to automatically update your x configuration so that the NVIDIA x driver will be used when you restart x? Any pre-existing x confile will be backed up.
选择 Yes 继续
部署完成后执行说明安装成功
[root@ ~]# nvidia-smi
Thu Aug 27 06:15:12 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.05 Driver Version: 450.51.05 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla M60 On | 00000000:00:1E.0 Off | 0 |
| N/A 33C P8 15W / 150W | 0MiB / 7618MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
运行3D程序
目前公司的3D项目可以已经用Docker部署,查阅docker官网发现docker在19版本后在docker run –gpus all 即可(表示使用所有的gpu,如果要使用2个gpu:–gpus 2,也可直接指定哪几个卡:–gpus ‘“device=1,2”’
在docker19之前需要单独下载nvidia-docker1或nvidia-docker2来启动容器
目前最新的docker版本是19.03.12,即不需要额外安装nvidia-docker
安装Docker
[root@ ~]# yum install -y yum-utils
[root@ ~]# yum-config-manager \
--add-repo \
https://download.docker.com/linux/centos/docker-ce.repo
[root@ ~]# yum install -y docker-ce docker-ce-cli containerd.io
[root@ ~]# systemctl start docker
[root@ ~]# docker -v
Docker version 19.03.12, build 48a66213fe
安装GPU容器运行时
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.repo | sudo tee /etc/yum.repos.d/nvidia-docker.repo
sudo yum install -y nvidia-container-toolkit
sudo systemctl restart docker
验证docker可以使用GPU
docker run --help | grep -i gpus
--gpus gpu-request GPU devices to add to the container ('all' to pass all GPUs)
docker run -it --rm --gpus all ubuntu nvidia-smi
Thu Aug 27 09:57:43 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.05 Driver Version: 450.51.05 CUDA Version: N/A |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla M60 On | 00000000:00:1E.0 Off | 0 |
| N/A 32C P8 15W / 150W | 0MiB / 7618MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
说明可以运行
报错总结
执行sudo /bin/sh ./NVIDIA-Linux-x86_64*.run时报错
Nouveau kernel driver
ERROR: The Nouveau kernel driver is currently in use by your system. This driver is incompatible with the NVIDIA driver, and must be disabled before proceeding. Please consult
the NVIDIA driver README and your Linux distribution’s documentation for details on how to correctly disable the Nouveau kernel driver.
这个驱动正在被系统使用,这个驱动和Nvidia驱动冲突,要想继续安装,则必须禁用此驱动!因为centos 系统默认装的显卡驱动就是Nouveau . Nouveau是一个由爱好者组织的针对NVIDIA显卡开发第三方开源3D驱动的共同项目,并且Nouveau是在完全没有得到NVIDIA任何支 持的情况下进行开发的,Nouveau算是X.Org基金会的一个项目.
解决办法如下:
即关闭Nouveau:
检查nouveau driver确保没有被加载!
lsmod | grep nouveau
安装完GPU容器运行时报错
[root@ ~]# docker run -it --rm --gpus all ubuntu nvidia-smi
Unable to find image 'ubuntu:latest' locally
latest: Pulling from library/ubuntu
54ee1f796a1e: Pull complete
f7bfea53ad12: Pull complete
46d371e02073: Pull complete
b66c17bbf772: Pull complete
Digest: sha256:31dfb10d52ce76c5ca0aa19d10b3e6424b830729e32a89a7c6eee2cda2be67a5
Status: Downloaded newer image for ubuntu:latest
docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].
解决办法,重启docker
systemctl restart docker
如果还不行,说明你没有安装GPU容器运行时