代码运行的好好的突然有一天打开发现显卡用不了了,一跑模型显示RuntimeError: No CUDA GPUs are available
1.首先对显卡是否能用进行检测,在python console界面依次输入:
import torch
print(torch.cuda.device_count()) // 或者print(torch.cuda.is_available())
如果输出不为0或true,说明显卡是可用的,检查代码调用显卡序号是否有问题或者cuda环境安装是否正确。
如果输出为0或者false,或者在终端输入nvidia-smi报错,则说明显卡不能用,是linux内核升级导致显卡驱动不匹配的问题,可以将显卡驱动降为原先的版本。打开软件和更新的附加驱动也可以看出驱动自动升级了,原先我安装的是530版本,先降降驱动版本试试看。
2.从官网中查找以前的驱动版本,在下方驱动程序下的试用版驱动程序中搜索可用驱动,
我下载的是NVIDIA-Linux-x86_64-530.41.03.run(因为记得之前的版本就是这个)
3.卸载之前的驱动并重启,一定要保证卸载,卸载不干净安装过程中会报错。
sudo apt-get remove --purge nvidia*
reboot
3.在下载文件夹打开终端输入
sudo ./NVIDIA-Linux-x86_64-530.41.03.run
踩坑记录:
第3步安装的时候报错,如果显示ERROR: An error occurred while performing the step: “Building kernel modules则需要降低内核版本,终端输入uname -r查看内核版本,查看已安装的内核镜像,我这里显示linux-image-5.15.0-87-generic是install,故将内核版本将为5.15.0-87
//查看已安装的内核镜像
dpkg --get-selections | grep linux-image
linux-image-5.15.0-1039-nvidia deinstall
linux-image-5.15.0-1043-intel-iotg deinstall
linux-image-5.15.0-1045-gcp deinstall
linux-image-5.15.0-1046-oracle deinstall
linux-image-5.15.0-1048-aws deinstall
linux-image-5.15.0-1050-azure deinstall
linux-image-5.15.0-43-generic deinstall
linux-image-5.15.0-87-generic install
linux-image-5.15.0-87-lowlatency deinstall
linux-image-5.17.0-1035-oem deinstall
linux-image-5.19.0-41-generic deinstall
linux-image-5.19.0-42-generic deinstall
linux-image-5.19.0-43-generic deinstall
linux-image-5.19.0-45-generic deinstall
linux-image-5.19.0-46-generic deinstall
linux-image-6.1.0-1024-oem deinstall
linux-image-6.2.0-1010-nvidia deinstall
linux-image-6.2.0-1014-aws deinstall
linux-image-6.2.0-1014-oracle deinstall
linux-image-6.2.0-1015-azure deinstall
linux-image-6.2.0-1015-lowlatency deinstall
linux-image-6.2.0-1017-gcp install
linux-image-6.2.0-34-generic deinstall
linux-image-6.2.0-35-generic install
linux-image-6.5.0-1004-oem install
linux-image-generic install
linux-image-generic-hwe-22.04 install
//查看制指定版本内核包
apt-cache search linux | grep 5.15.0-87-generic
linux-buildinfo-5.15.0-87-generic - Linux kernel buildinfo for version 5.15.0 on 64 bit x86 SMP
linux-cloud-tools-5.15.0-87-generic - Linux kernel version specific cloud tools for version 5.15.0-87
linux-headers-5.15.0-87-generic - Linux kernel headers for version 5.15.0 on 64 bit x86 SMP
linux-image-5.15.0-87-generic - Signed kernel image generic
linux-image-unsigned-5.15.0-87-generic - Linux kernel image for version 5.15.0 on 64 bit x86 SMP
linux-modules-5.15.0-87-generic - Linux kernel extra modules for version 5.15.0 on 64 bit x86 SMP
linux-modules-extra-5.15.0-87-generic - Linux kernel extra modules for version 5.15.0 on 64 bit x86 SMP
linux-modules-iwlwifi-5.15.0-87-generic - Linux kernel iwlwifi modules for version 5.15.0-87
linux-tools-5.15.0-87-generic - Linux kernel version specific tools for version 5.15.0-87
linux-modules-nvidia-390-5.15.0-87-generic - Linux kernel nvidia modules for version 5.15.0-87
linux-modules-nvidia-418-server-5.15.0-87-generic - Linux kernel nvidia modules for version 5.15.0-87
linux-modules-nvidia-450-server-5.15.0-87-generic - Linux kernel nvidia modules for version 5.15.0-87
linux-modules-nvidia-470-5.15.0-87-generic - Linux kernel nvidia modules for version 5.15.0-87
linux-modules-nvidia-470-server-5.15.0-87-generic - Linux kernel nvidia modules for version 5.15.0-87
linux-modules-nvidia-525-5.15.0-87-generic - Linux kernel nvidia modules for version 5.15.0-87
linux-modules-nvidia-525-open-5.15.0-87-generic - Linux kernel nvidia modules for version 5.15.0-87
linux-modules-nvidia-525-server-5.15.0-87-generic - Linux kernel nvidia modules for version 5.15.0-87
linux-modules-nvidia-535-5.15.0-87-generic - Linux kernel nvidia modules for version 5.15.0-87
linux-modules-nvidia-535-open-5.15.0-87-generic - Linux kernel nvidia modules for version 5.15.0-87
linux-modules-nvidia-535-server-5.15.0-87-generic - Linux kernel nvidia modules for version 5.15.0-87
linux-modules-nvidia-535-server-open-5.15.0-87-generic - Linux kernel nvidia modules for version 5.15.0-87
linux-objects-nvidia-390-5.15.0-87-generic - Linux kernel nvidia modules for version 5.15.0-87 (objects)
linux-objects-nvidia-418-server-5.15.0-87-generic - Linux kernel nvidia modules for version 5.15.0-87 (objects)
linux-objects-nvidia-450-server-5.15.0-87-generic - Linux kernel nvidia modules for version 5.15.0-87 (objects)
linux-objects-nvidia-470-5.15.0-87-generic - Linux kernel nvidia modules for version 5.15.0-87 (objects)
linux-objects-nvidia-470-server-5.15.0-87-generic - Linux kernel nvidia modules for version 5.15.0-87 (objects)
linux-objects-nvidia-525-5.15.0-87-generic - Linux kernel nvidia modules for version 5.15.0-87 (objects)
linux-objects-nvidia-525-open-5.15.0-87-generic - Linux kernel nvidia modules for version 5.15.0-87 (objects)
linux-objects-nvidia-525-server-5.15.0-87-generic - Linux kernel nvidia modules for version 5.15.0-87 (objects)
linux-objects-nvidia-535-5.15.0-87-generic - Linux kernel nvidia modules for version 5.15.0-87 (objects)
linux-objects-nvidia-535-open-5.15.0-87-generic - Linux kernel nvidia modules for version 5.15.0-87 (objects)
linux-objects-nvidia-535-server-5.15.0-87-generic - Linux kernel nvidia modules for version 5.15.0-87 (objects)
linux-objects-nvidia-535-server-open-5.15.0-87-generic - Linux kernel nvidia modules for version 5.15.0-87 (objects)
linux-signatures-nvidia-5.15.0-87-generic - Linux kernel signatures for nvidia modules for version 5.15.0-87-generic
//安装指定版本
sudo apt-get install linux-headers-5.15.0-87-generic linux-image-5.15.0-87-generic
//执行下列命令可以看到当前内核和刚安装的内核
grep menuentry /boot/grub/grub.cfg
//指定系统中内核启动顺序
sudo vim /etc/default/grub
将GRUB_DEFAULT=0修改为GRUB_DEFAULT="Advanced options for Ubuntu>Ubuntu, with Linux 5.15.0-87-generic"
//更新配置并重启
sudo update-grub
reboot
重启之后可以输入uname -r查看内核版本,可以看到已经更改成功。
更改成功之后重新执行第3步,安装显卡驱动,在其他cuda环境没有改变的情况下,显卡可以被重新调用。