背景:
猜测由于是未关闭内核自动升级,由内核自动升级引起的nvidia驱动掉了。
现象
NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
思路
使用dkms动态内核模块在新更新的内核上重新配置内核即可
解决
验证硬件存在
lspci |grep -i nvidia
01:00.0 VGA compatible controller: NVIDIA Corporation Device 2684 (rev a1)
01:00.1 Audio device: NVIDIA Corporation Device 22ba (rev a1)
81:00.0 VGA compatible controller: NVIDIA Corporation Device 2684 (rev a1)
81:00.1 Audio device: NVIDIA Corporation Device 22ba (rev a1)
c1:00.0 VGA compatible controller: NVIDIA Corporation Device 2684 (rev a1)
c1:00.1 Audio device: NVIDIA Corporation Device 22ba (rev a1)
c2:00.0 VGA compatible controller: NVIDIA Corporation Device 2684 (rev a1)
c2:00.1 Audio device: NVIDIA Corporation Device 22ba (rev a1)
验证当前系统内核模块中不存在nvidia模块,预期输出为空,实际输出为空
lsmod |grep -i nvidia
使用.run安装的nvidia驱动默认路径/usr/src
ls /usr/src/
linux-headers-6.5.0-41-generic linux-headers-6.5.0-44-generic linux-hwe-6.5-headers-6.5.0-41 linux-hwe-6.5-headers-6.5.0-44 nvidia-535.183.01
dkms模块说明
apt-cache show dkms
Package: dkms
Architecture: all
Version: 2.8.7-2ubuntu2.2
Multi-Arch: foreign
Priority: optional
Section: admin
Origin: Ubuntu
Maintainer: Ubuntu Developers <ubuntu-devel-discuss@lists.ubuntu.com>
Original-Maintainer: Dynamic Kernel Modules Support Team <dkms@packages.debian.org>
Bugs: https://bugs.launchpad.net/ubuntu/+filebug
Installed-Size: 295
Provides: dh-sequence-dkms
Pre-Depends: lsb-release
Depends: kmod | kldutils, gcc, gcc-12, dpkg-dev, make | build-essential, coreutils (>= 7.4), patch, dctrl-tools
Recommends: fakeroot, sudo, linux-headers-686-pae | linux-headers-amd64 | linux-headers-generic | linux-headers
Suggests: menu, e2fsprogs
Breaks: shim-signed (<< 1.34~)
Filename: pool/main/d/dkms/dkms_2.8.7-2ubuntu2.2_all.deb
Size: 70130
MD5sum: 19e431299038b83a3146d8cca64af532
SHA1: 7f1909cc371e12d76b9dd24cd3ce813b7d7b89cc
SHA256: 0af1ca1ba81ba8680332300838273d43c449a4d549fd992c70c8f64a9462ad74
SHA512: 59e79360f3bb7fe1c7af779a84f6d67b1067ace2151013adfde2b206e1f3395090f9b8798d1fe41f15237f85bd9d9185621f4f33d57d63b1e40668e825a3bdfd
Homepage: https://github.com/dell-oss/dkms
Description-en: Dynamic Kernel Module Support Framework
# 动态内核模块支持框架
DKMS is a framework designed to allow individual kernel modules to be upgraded
without changing the whole kernel. It is also very easy to rebuild modules as
you upgrade kernels.
# DKMS 是一个框架,旨在允许在不更改整个内核的情况下升级单个内核模块。在升级内核时重建模块也非常容易
Description-md5: b7b6bb6a6b083b2245e0648e7752a459
Task: ubuntustudio-video
dkms配置nvidia模块
apt install dkms -y
dkms install -m nvidia -v 535.183.01
附上执行信息输出
Creating symlink /var/lib/dkms/nvidia/535.183.01/source -> /usr/src/nvidia-535.183.01
Kernel preparation unnecessary for this kernel. Skipping...
Building module:
cleaning build area...
'make' -j32 NV_EXCLUDE_BUILD_MODULES='' KERNEL_UNAME=6.5.0-44-generic modules...........
Signing module:
- /var/lib/dkms/nvidia/535.183.01/6.5.0-44-generic/x86_64/module/nvidia-drm.ko
- /var/lib/dkms/nvidia/535.183.01/6.5.0-44-generic/x86_64/module/nvidia-modeset.ko
- /var/lib/dkms/nvidia/535.183.01/6.5.0-44-generic/x86_64/module/nvidia-peermem.ko
- /var/lib/dkms/nvidia/535.183.01/6.5.0-44-generic/x86_64/module/nvidia-uvm.ko
- /var/lib/dkms/nvidia/535.183.01/6.5.0-44-generic/x86_64/module/nvidia.ko
Secure Boot not enabled on this system.
cleaning build area...
nvidia.ko:
Running module version sanity check.
- Original module
- No original module exists within this kernel
- Installation
- Installing to /lib/modules/6.5.0-44-generic/updates/dkms/
nvidia-uvm.ko:
Running module version sanity check.
- Original module
- No original module exists within this kernel
- Installation
- Installing to /lib/modules/6.5.0-44-generic/updates/dkms/
nvidia-modeset.ko:
Running module version sanity check.
- Original module
- No original module exists within this kernel
- Installation
- Installing to /lib/modules/6.5.0-44-generic/updates/dkms/
nvidia-drm.ko:
Running module version sanity check.
- Original module
- No original module exists within this kernel
- Installation
- Installing to /lib/modules/6.5.0-44-generic/updates/dkms/
nvidia-peermem.ko:
Running module version sanity check.
- Original module
- No original module exists within this kernel
- Installation
- Installing to /lib/modules/6.5.0-44-generic/updates/dkms/
depmod...
验证
lsmod |grep -i nvidia*
nvidia_uvm 1785856 18
nvidia 56827904 830 nvidia_uvm
drm 765952 5 drm_kms_helper,ast,drm_shmem_helper,nvidia
nvidia-smi
Mon Jul 29 16:46:02 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01 Driver Version: 535.183.01 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 4090 Off | 00000000:01:00.0 Off | Off |
| 32% 48C P2 63W / 450W | 10960MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA GeForce RTX 4090 Off | 00000000:81:00.0 Off | Off |
| 31% 45C P2 58W / 450W | 10960MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 2 NVIDIA GeForce RTX 4090 Off | 00000000:C1:00.0 Off | Off |
| 35% 52C P2 71W / 450W | 10960MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 3 NVIDIA GeForce RTX 4090 Off | 00000000:C2:00.0 Off | Off |
| 31% 48C P2 54W / 450W | 10960MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
禁用apt自动更新内核
systemctl disable --now apt-daily.timer
systemctl disable --now apt-daily-upgrade.timer
systemctl mask apt-daily-upgrade.timer
systemctl mask apt-daily.timer
references
https://blog.csdn.net/wjinjie/article/details/108997692