安装NVIDIA和CUDA操作
前期环境配置操作
查看NVIDIA
[root@localhost ~]# lspci -nn | grep NV
1a:00.0 3D controller [0302]: NVIDIA Corporation GP102GL [Tesla P40] [10de:1b38] (rev a1)
[root@localhost ~]# lshw -numeric -C display
*-display
description: VGA compatible controller
product: ASPEED Graphics Family [1A03:2000]
vendor: ASPEED Technology, Inc. [1A03]
physical id: 0
bus info: pci@0000:03:00.0
version: 41
width: 32 bits
clock: 33MHz
capabilities: pm msi vga_controller bus_master cap_list rom
configuration: driver=ast latency=0
resources: irq:17 memory:9c000000-9cffffff memory:9d000000-9d01ffff ioport:2000(size=128) memory:c0000-dffff
*-display
description: 3D controller
product: GP102GL [Tesla P40] [10DE:1B38]
vendor: NVIDIA Corporation [10DE]
physical id: 0
bus info: pci@0000:1a:00.0
version: a1
width: 64 bits
clock: 33MHz
capabilities: pm msi pciexpress bus_master cap_list
configuration: driver=nvidia latency=0
resources: iomemory:38700-386ff iomemory:38780-3877f irq:392 memory:a9000000-a9ffffff memory:387000000000-3877ffffffff memory:387800000000-387801ffffff
检测是否安装了NVIDIA的GPU(硬件)
[root@localhost local]# lspci | grep -i nvidia
09:00.0 VGA compatible controller: NVIDIA Corporation GM204GL [Tesla M6] (rev a1)
安装GCC、kernal组件、dkms
yum install gcc
yum install gcc-g++
yum install -y elfutils-libelf-devel
yum install -y kernel-devel gcc -y
查看显卡信息,检测内核版本和源码版本是否一致,保证一致
[root@localhost pkg]# ls /boot | grep vmlinu
vmlinuz-0-rescue-f89c734ac8c2471a948b2b8e7cea7df3
vmlinuz-3.10.0-957.el7.x86_64
[root@localhost pkg]# rpm -aq | grep kernel-devel
kernel-devel-3.10.0-957.el7.x86_64
root@localhost pkg]# lsmod | grep nouveau
nouveau 1869689 0
video 24538 1 nouveau
mxm_wmi 13021 1 nouveau
i2c_algo_bit 13413 2 ast,nouveau
drm_kms_helper 179394 2 ast,nouveau
ttm 114635 2 ast,nouveau
drm 429744 6 ast,ttm,drm_kms_helper,nouveau
wmi 21636 2 mxm_wmi,nouveau
屏蔽默认的nouveau
## vim /lib/modprobe.d/dist-blacklist.conf
# watchdog drivers
blacklist i8xx_tco
# framebuffer drivers
blacklist aty128fb
blacklist atyfb
blacklist radeonfb
blacklist i810fb
blacklist cirrusfb
blacklist intelfb
blacklist kyrofb
blacklist i2c-matroxfb
blacklist hgafb
#blacklist nvidiafb
blacklist rivafb
blacklist savagefb
blacklist sstfb
blacklist neofb
blacklist tridentfb
blacklist tdfxfb
blacklist virgefb
blacklist vga16fb
blacklist viafb
增加
blacklist nouveau
options nouveau modeset=0
重建initramfs image步骤
[root@localhost pkg]# mv /boot/initramfs-$(uname -r).img /boot/initramfs-$(uname -r).img.bak
[root@localhost pkg]# dracut /boot/initramfs-$(uname -r).img $(uname -r)
修改运行级别为文本模式
[root@localhost pkg]# systemctl set-default multi-user.target
Removed symlink /etc/systemd/system/default.target.
Created symlink from /etc/systemd/system/default.target to /usr/lib/systemd/system/multi-user.target.
重启服务器:reboot
查询nouveau是否关闭
[root@localhost ~]# lsmod | grep nouveau
安装NVIDIA驱动和CUDA驱动
run文件增加权限
chmod a+x cuda_10.0.130_410.48_linux.run
安装cuda
./cuda_10.0.130_410.48_linux.run
-no-opengl-libs
增加 -no-opengl-libs参数,表示不安装OpenGL文件,这个参数能够避免无法进入图形界面的问题。
安装nvidia驱动
如果需要单独安装nvidia驱动,安装 另外参数 –no-opengl-files表示不安装OpenGL文件,这个参数能够避免无法进入图形界面的问题
sudo ./NVIDIA.run -no-x-check -no-nouveau-check -no-opengl-files
等会儿协议,输入accept后回车。如果已经提前安装了NVIDIA驱动,则需要回车取消Nvidia driver那一项,其他不变,install安装。
设置环境变量
~/.bashrc文件增加以下内容:
export PATH="/usr/local/cuda-10.0/bin:$PATH"
export LD_LIBRARY_PATH="/usr/local/cuda-10.0/lib64:$LD_LIBRARY_PATH"
export CUDACXX="/usr/local/cuda-10.0/bin/nvcc"
输入 source ~/.bashrc
在当前shell中,使环境变量生效。
查看是否安装成功
[root@localhost local]# nvcc -V
nvcc: NVIDIA ® Cuda compiler driver
Copyright © 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:01_CDT_2018
Cuda compilation tools, release 10.0, V10.0.130
验证NVIDIA是否安装成功
[root@localhost local]# nvidia-smi
Fri Jun 22 08:07:11 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.06 Driver Version: 450.51.06 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla M6 Off | 00000000:09:00.0 Off | Off |
| N/A 43C P0 25W / 100W | 0MiB / 8129MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
CUDA demo程序验证是否安装成功
[root@localhost cuda-10.0]# cd /usr/local/cuda-10.0/samples/1_Utilities/deviceQuery
[root@localhost deviceQuery]# sudo make
[root@localhost deviceQuery]# ./deviceQuery
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "Tesla M6"
CUDA Driver Version / Runtime Version 11.0 / 10.0
CUDA Capability Major/Minor version number: 5.2
Total amount of global memory: 8129 MBytes (8524136448 bytes)
(12) Multiprocessors, (128) CUDA Cores/MP: 1536 CUDA Cores
GPU Max Clock rate: 1050 MHz (1.05 GHz)
Memory Clock rate: 2300 Mhz
Memory Bus Width: 256-bit
L2 Cache Size: 2097152 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device supports Compute Preemption: No
Supports Cooperative Kernel Launch: No
Supports MultiDevice Co-op Kernel Launch: No
Device PCI Domain ID / Bus ID / location ID: 0 / 9 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 11.0, CUDA Runtime Version = 10.0, NumDevs = 1
Result = PASS
出现Result = PASS并成功检测出所有显卡,代表显卡驱动和cuda安装成功
问题汇总
vncserver 和nvidia驱动冲突问题
问题现象
1、nvidia驱动安装时,如果vncserver开启,会出现安装失败。
2、nvidia驱动安装后,vncserver开启,远程桌面显示黑屏。
原因
NVIDIA自带的OPENGL和系统的opengl冲突,导致图形界面损坏。
解决方案
卸载NVIDIA驱动,重新安装不带opengl的nvidia驱动。
//命令:
systemctl get-default multi-user.target
reboot
./NVIDIA-Linux-x86_64-510.47.03.run --no-opengl-files
systemctl get-default graphical.target
reboot
查询默认界面模式
systemctl get-default
设置命令界面:
systemctl set-default multi-user.target
设置图形界面:
systemctl set-default graphical.target
inittab文件描述
[root@localhost etc]# cat inittab
# inittab is no longer used when using systemd.
#
# ADDING CONFIGURATION HERE WILL HAVE NO EFFECT ON YOUR SYSTEM.
#
# Ctrl-Alt-Delete is handled by /usr/lib/systemd/system/ctrl-alt-del.target
#
# systemd uses 'targets' instead of runlevels. By default, there are two main targets:
#
# multi-user.target: analogous to runlevel 3
# graphical.target: analogous to runlevel 5
#
# To view current default target, run:
# systemctl get-default
#
# To set a default target, run:
# systemctl set-default TARGET.target
关于centos 8 安装cuda 10.2驱动失败的问题
现象
执行cuda 10.2.run文件后,会出现安装nvidia 440.33失败和cuda安装失败的log信息。
查看log信息:cat /var/log/nvidia-install.log
解决方案
通过安装测试,nvidia 驱动安装也失败,cuda 10.2证实是支持centos 8的,显卡型号Nvidia P40。怀疑是缺少依赖组件。
通过排查,缺少elfutils-libelf-devel组件。
查找是否已经安装,命令:
rpm -qa | grep elfutils-libelf-devel