NVIDIA驱动安装及报错处理

下载GPU驱动包

驱动下载:https://www.nvidia.com/Download/Find.aspx
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
复制好地址后,使用wget命令下载。

[root@node ~]# wget https://us.download.nvidia.com/tesla/450.191.01/NVIDIA-Linux-x86_64-450.191.01.run
[root@node ~]# ls
NVIDIA-Linux-x86_64-450.191.01.run

安装GPU驱动包

[root@node ~]# sh NVIDIA-Linux-x86_64-450.191.01.run 
Verifying archive integrity... OK
Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 450.191.01.....................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

执行以上命令后,跳出交互界面,全部默认确定即可。
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
安装完后,进行验证,出现一下信息证明安装成功。

[root@node ~]# nvidia-smi 
Thu Sep 22 18:02:27 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.191.01   Driver Version: 450.191.01   CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  Off  | 00000000:65:01.0 Off |                    0 |
| N/A   34C    P0    35W / 300W |      0MiB / 32510MiB |      4%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
[root@node ~]# 

卸载GPU驱动包

/usr/bin/nvidia-uninstall
reboot

GPU驱动包安装排错

由于使用的测试机器已经有装相关包,所以是卸载后重新装的,但在卸载时没有完全卸载完所有东西,且未进行重启操作,导致上述安装时出现一些问题,报错截图如下:
在这里插入图片描述
报错文本:
ERROR: An NVIDIA kernel module ‘nvidia’ appears to already be loaded in your kernel. This may be because it is in use (for example, by an X server, a CUDA program, or the NVIDIA
Persistence Daemon), but this may also happen if your kernel was configured without support for module unloading. Please be sure to exit any programs that may be using the GPU(s)
before attempting to upgrade your driver. If no GPU-based programs are running, you know that your kernel supports module unloading, and you still receive this message, then an
error may have occured that has corrupted an NVIDIA kernel module’s usage count, for which the simplest remedy is to reboot your computer.
在这里插入图片描述
报错文本:
ERROR: Installation has failed. Please see the file ‘/var/log/nvidia-installer.log’ for details. You may find suggestions on fixing installation problems in the README available on the
Linux driver download page at www.nvidia.com.

大致是说已有相关程序在使用了nvidia的模块了,导致安装报错。
不是很明确是什么原因,所以优先考虑了日志,如下:

[root@node ~]# tail -50 /var/log/nvidia-installer.log 
nvidia-installer log file '/var/log/nvidia-installer.log'
creation time: Thu Sep 22 17:46:57 2022
installer version: 450.191.01

PATH: /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/root/bin

nvidia-installer command line:
    ./nvidia-installer

Unable to load: nvidia-installer ncurses v6 user interface

Using: nvidia-installer ncurses user interface
-> Detected 8 CPUs online; setting concurrency level to 8.
ERROR: An NVIDIA kernel module 'nvidia' appears to already be loaded in your kernel.  This may be because it is in use (for example, by an X server, a CUDA program, or the NVIDIA Persistence Daemon), but this may also happen if your kernel was configured without support for module unloading.  Please be sure to exit any programs that may be using the GPU(s) before attempting to upgrade your driver.  If no GPU-based programs are running, you know that your kernel supports module unloading, and you still receive this message, then an error may have occured that has corrupted an NVIDIA kernel module's usage count, for which the simplest remedy is to reboot your computer.
ERROR: Installation has failed.  Please see the file '/var/log/nvidia-installer.log' for details.  You may find suggestions on fixing installation problems in the README available on the Linux driver download page at www.nvidia.com.
[root@node ~]#

查看后没有得到有用的信息,但心里已经有了想法,于是开始排查使用nvidia的相关进程什么的。

[root@node ~]# ps -ef | grep nvidia
root      1569     2  0 11:06 ?        00:00:01 [irq/86-nvidia]
root      1570     2  0 11:06 ?        00:00:00 [nvidia]
root     17161 17144  0 17:53 pts/0    00:00:00 grep --color=auto nvidia
[root@node ~]# lsof /dev/nvidia*
COMMAND   PID USER   FD   TYPE  DEVICE SIZE/OFF  NODE NAME
cloud-mon 676 root    8u   CHR 195,255      0t0 21014 /dev/nvidiactl
cloud-mon 676 root    9u   CHR   195,0      0t0 21018 /dev/nvidia0
cloud-mon 676 root   12u   CHR   195,0      0t0 21018 /dev/nvidia0
cloud-mon 676 root   13u   CHR   195,0      0t0 21018 /dev/nvidia0
[root@node ~]# ps -ef | grep 676
root       676     1  0 11:06 ?        00:00:06 /usr/local/xxxx-xxxx-agent/xxxx-xxxx-agent start
root     17167 17144  0 17:55 pts/0    00:00:00 grep --color=auto 676
[root@node ~]# kill -9 676

查到了,是某一个agent在使用,于是kill后重新安装,解决问题。

### NVIDIA 驱动内核编译报错解决方案 在安装 NVIDIA 驱动并编译其内核模块时,可能会遇到多种类型的错误。以下是针对常见问题及其可能的解决方案。 #### 1. 栈帧大小超出限制 如果在编译过程中出现了关于栈帧大小超限的问题,则可能是由于编译器检测到函数使用的堆栈空间超过允许的最大值所致[^4]。可以通过调整编译选项来解决此问题: - 修改内核配置文件中的 `CONFIG_FRAME_WARN` 参数以增加最大允许的栈帧大小。 - 或者,在构建命令中显式传递 `-Wno-error=frame-larger-than=` 来忽略该警告作为错误的行为。 ```bash export CFLAGS="-Wno-error=frame-larger-than=" make clean && make ``` #### 2. 使用不同的编译器 当使用非 GCC 的编译器(如 Clang)编译 Linux 内核时,需确保用于编译 NVIDIA 模块的工具链与此一致[^2]。如果不匹配,可能导致兼容性问题或链接失败。可以尝试设置环境变量指定特定版本的编译器: ```bash export CC=/path/to/your/compiler/bin/clang ./NVIDIA-Linux-x86_64-*.run --kernel-build-path=/lib/modules/$(uname -r)/build/ ``` #### 3. 安装依赖项 某些情况下,缺少必要的开发库也可能引发编译错误。建议先确认已安装所有必需组件,并更新系统软件包至最新状态: ```bash sudo apt-get update sudo apt-get install build-essential dkms linux-headers-$(uname -r) ``` 对于基于 Debian 的发行版,还可以通过 `.deb` 文件方式完成驱动部署过程[^3]: ```bash cd /var/nvidia-driver-local-repo-ubuntu2404-570.124.06/ sudo dpkg -i *.deb ``` 最后一步通常会自动处理大部分潜在冲突以及重新加载图形服务等工作。 #### 4. 调整权限与执行脚本 为了使运行安装程序具备适当的操作权利,请赋予它可执行属性后再启动实际流程[^1]: ```bash chmod +x NVIDIA-Linux-x86_64-*.run sudo ./NVIDIA-Linux-x86_64-*.run ``` 以上方法能够有效应对绝大多数因内核定制化程度较高而导致的标准驱动适配难题。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值