NVIDIA驱动安装及报错处理

下载GPU驱动包

驱动下载:https://www.nvidia.com/Download/Find.aspx
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
复制好地址后,使用wget命令下载。

[root@node ~]# wget https://us.download.nvidia.com/tesla/450.191.01/NVIDIA-Linux-x86_64-450.191.01.run
[root@node ~]# ls
NVIDIA-Linux-x86_64-450.191.01.run

安装GPU驱动包

[root@node ~]# sh NVIDIA-Linux-x86_64-450.191.01.run 
Verifying archive integrity... OK
Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 450.191.01.....................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

执行以上命令后,跳出交互界面,全部默认确定即可。
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
安装完后,进行验证,出现一下信息证明安装成功。

[root@node ~]# nvidia-smi 
Thu Sep 22 18:02:27 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.191.01   Driver Version: 450.191.01   CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  Off  | 00000000:65:01.0 Off |                    0 |
| N/A   34C    P0    35W / 300W |      0MiB / 32510MiB |      4%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
[root@node ~]# 

卸载GPU驱动包

/usr/bin/nvidia-uninstall
reboot

GPU驱动包安装排错

由于使用的测试机器已经有装相关包,所以是卸载后重新装的,但在卸载时没有完全卸载完所有东西,且未进行重启操作,导致上述安装时出现一些问题,报错截图如下:
在这里插入图片描述
报错文本:
ERROR: An NVIDIA kernel module ‘nvidia’ appears to already be loaded in your kernel. This may be because it is in use (for example, by an X server, a CUDA program, or the NVIDIA
Persistence Daemon), but this may also happen if your kernel was configured without support for module unloading. Please be sure to exit any programs that may be using the GPU(s)
before attempting to upgrade your driver. If no GPU-based programs are running, you know that your kernel supports module unloading, and you still receive this message, then an
error may have occured that has corrupted an NVIDIA kernel module’s usage count, for which the simplest remedy is to reboot your computer.
在这里插入图片描述
报错文本:
ERROR: Installation has failed. Please see the file ‘/var/log/nvidia-installer.log’ for details. You may find suggestions on fixing installation problems in the README available on the
Linux driver download page at www.nvidia.com.

大致是说已有相关程序在使用了nvidia的模块了,导致安装报错。
不是很明确是什么原因,所以优先考虑了日志,如下:

[root@node ~]# tail -50 /var/log/nvidia-installer.log 
nvidia-installer log file '/var/log/nvidia-installer.log'
creation time: Thu Sep 22 17:46:57 2022
installer version: 450.191.01

PATH: /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/root/bin

nvidia-installer command line:
    ./nvidia-installer

Unable to load: nvidia-installer ncurses v6 user interface

Using: nvidia-installer ncurses user interface
-> Detected 8 CPUs online; setting concurrency level to 8.
ERROR: An NVIDIA kernel module 'nvidia' appears to already be loaded in your kernel.  This may be because it is in use (for example, by an X server, a CUDA program, or the NVIDIA Persistence Daemon), but this may also happen if your kernel was configured without support for module unloading.  Please be sure to exit any programs that may be using the GPU(s) before attempting to upgrade your driver.  If no GPU-based programs are running, you know that your kernel supports module unloading, and you still receive this message, then an error may have occured that has corrupted an NVIDIA kernel module's usage count, for which the simplest remedy is to reboot your computer.
ERROR: Installation has failed.  Please see the file '/var/log/nvidia-installer.log' for details.  You may find suggestions on fixing installation problems in the README available on the Linux driver download page at www.nvidia.com.
[root@node ~]#

查看后没有得到有用的信息,但心里已经有了想法,于是开始排查使用nvidia的相关进程什么的。

[root@node ~]# ps -ef | grep nvidia
root      1569     2  0 11:06 ?        00:00:01 [irq/86-nvidia]
root      1570     2  0 11:06 ?        00:00:00 [nvidia]
root     17161 17144  0 17:53 pts/0    00:00:00 grep --color=auto nvidia
[root@node ~]# lsof /dev/nvidia*
COMMAND   PID USER   FD   TYPE  DEVICE SIZE/OFF  NODE NAME
cloud-mon 676 root    8u   CHR 195,255      0t0 21014 /dev/nvidiactl
cloud-mon 676 root    9u   CHR   195,0      0t0 21018 /dev/nvidia0
cloud-mon 676 root   12u   CHR   195,0      0t0 21018 /dev/nvidia0
cloud-mon 676 root   13u   CHR   195,0      0t0 21018 /dev/nvidia0
[root@node ~]# ps -ef | grep 676
root       676     1  0 11:06 ?        00:00:06 /usr/local/xxxx-xxxx-agent/xxxx-xxxx-agent start
root     17167 17144  0 17:55 pts/0    00:00:00 grep --color=auto 676
[root@node ~]# kill -9 676

查到了,是某一个agent在使用,于是kill后重新安装,解决问题。

  • 9
    点赞
  • 16
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
您好!关于PyTorch的安装报错处理,我可以帮助您解答。首先,您可以按照以下步骤安装PyTorch: 1. 确保已经安装了Python。PyTorch支持Python 3.6及以上版本。 2. 打开命令行终端,使用pip或conda命令安装PyTorch。如果您使用pip,可以运行以下命令: ``` pip install torch torchvision ``` 如果您使用conda,可以运行以下命令: ``` conda install pytorch torchvision -c pytorch ``` 在安装过程中,可能会遇到一些报错。以下是一些常见的报错及其处理方法: 1. 报错信息:`ERROR: Could not find a version that satisfies the requirement torch`。 处理方法:请确保您输入的命令正确,并且网络连接正常。如果网络连接不稳定,可以尝试使用其他源进行安装。 2. 报错信息:`ModuleNotFoundError: No module named 'torch'`。 处理方法:这通常表示PyTorch未成功安装。请检查安装命令是否正确,并尝试重新安装PyTorch。 3. 报错信息:`CUDA out of memory`。 处理方法:这表示您的显存不足以运行当前的模型或数据。您可以尝试减少输入数据的大小,或使用更小的模型。另外,您也可以考虑使用较大的显存或在云平台上运行模型。 4. 报错信息:`ImportError: DLL load failed: The specified module could not be found`。 处理方法:这可能是由于缺少一些依赖库导致的。您可以尝试更新您的操作系统和显卡驱动程序,并重新安装PyTorch。 如果您遇到其他报错,可以提供具体的报错信息,我将尽力帮助您解决问题。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值