一台使用RTX3090 GPU卡的PC在对Ubuntu做apt-get upgrade后重启发现桌面出不来了,为了解决这个问题遇到了多个坑,记下来备忘。
首先想退回去用旧版的GPU驱动,卸掉已有版本:
sudo apt-get --purge remove "cuda*"
sudo apt-get --purge remove "*nvidia*"
然后安装低版本的CUDA10的deb安装包之类,发现即使重启后也不起作用,执行nvidia-smi总是报错:
Failed to initialize NVML: Driver/library version mismatch
那可能是和当然使用的linux kernel版本不匹配,直接安装deb包是不行的,需要使用源码编译出与当前kernel版本适配的ko,于是改成使用这种使用run文件方式安装:
wget https://developer.download.nvidia.com/compute/cuda/10.2/Prod/local_installers/cuda_10.2.89_440.33.01_linux.run
chmod +x cuda_10.2.89_440.33.01_linux.run
./cuda_10.2.89_440.33.01_linux.run
可以安装成功,但是重启系统后桌面还是进不去,切换到文字界面可以看到报错:
改成使用低版本的驱动程序安装则每次安装到最后都报错:
ERROR: Unable to load the 'nvidia-drm' kernel module
按照网上别人说的一些办法,例如禁用BIOS的secure boot或者升级内核,解决内核和source版本的不一致等等办法通通没用,最后试着安装了一个cuda11.0里包含的driver版本450.80.02对应的run文件 NVIDIA-Linux-x86_64-450.80.02.run来安装却一次性成功了,这说明对于比较新的GPU,需要安装比较新的驱动才行,老版本的驱动安装不了,更不用说跑步起来了。
既然驱动也要是对应于cuda11以上的版本,直接使用cuda11.1.1(RTX30序列GPU好像需要11.1.1或者以上版本才能正常工作)安装更好,但是目前最好不要使用最新的cuda11.3或者cuda11.4,因为像pytorch这样的工具还根本不支持,盲目安装高版本不是啥好事,够用就行。
解决驱动版本的选择问题后,开机启动后还是gdm桌面出不来,看网上有人说gdm3对于最新的NVIDIA的驱动支持不好, 于是安装lightdm 显示管理服务器和Unity桌面:
sudo apt-get install lightdm unity
安装过程中确认选择lightdm为默认的Display Manager,而不是gdm3(事后需要切换时,可以使用dpkg-reconfigure lightdm) ,然后重启时发现桌面出不来,那个Ubuntu的标记总是在那个动,就是始终桌面出不来:
检查状态:
root@ubuntu-rtx3090:~# systemctl status lightdm
● lightdm.service - Light Display Manager
Loaded: loaded (/lib/systemd/system/lightdm.service; indirect; vendor preset: enabled)
Active: failed (Result: exit-code) since Wed 2021-08-25 19:15:17 CST; 7min ago
Docs: man:lightdm(1)
Process: 1246 ExecStart=/usr/sbin/lightdm (code=exited, status=1/FAILURE)
Process: 1243 ExecStartPre=/bin/sh -c [ "$(basename $(cat /etc/X11/default-display-manager 2>/dev/null))" = "lightdm" ] (code=exited, status=0/SUCCESS)
Main PID: 1246 (code=exited, status=1/FAILURE)
8月 25 19:15:17 ubuntu-rtx3090 systemd[1]: lightdm.service: Service hold-off time over, scheduling restart.
8月 25 19:15:17 ubuntu-rtx3090 systemd[1]: lightdm.service: Scheduled restart job, restart counter is at 5.
8月 25 19:15:17 ubuntu-rtx3090 systemd[1]: Stopped Light Display Manager.
8月 25 19:15:17 ubuntu-rtx3090 systemd[1]: lightdm.service: Start request repeated too quickly.
8月 25 19:15:17 ubuntu-rtx3090 systemd[1]: lightdm.service: Failed with result 'exit-code'.
8月 25 19:15:17 ubuntu-rtx3090 systemd[1]: Failed to start Light Display Manager.
apt policy lightdm
lightdm:
Installed: 1.26.0-0ubuntu1
Candidate: 1.26.0-0ubuntu1
Version table:
*** 1.26.0-0ubuntu1 500
500 http://mirrors.aliyun.com/ubuntu bionic/universe amd64 Packages
100 /var/lib/dpkg/status
root@ubuntu-rtx3090:~# lightdm --test-mode --debug
Failed to load configuration from /etc/lightdm/lightdm.conf: Key file does not start with a group
root@ubuntu-rtx3090:~# lightdm --show-config
Failed to load configuration from /etc/lightdm/lightdm.conf: Key file does not start with a group
从Failed to load configuration from /etc/lightdm/lightdm.conf: Key file does not start with a group来看/etc/lightdm/lightdm.conf有问题,打开一看,发现只有一行:
greeter-session=unity-greeter
加上Seat组才是正确的:
[Seat:*]
greeter-session=unity-greeter
再执行 lightdm --show-config 就能正常输出了:
root@ubuntu-rtx3090:~# lightdm --show-config
[Seat:*]
A allow-guest=false
C greeter-wrapper=/usr/lib/lightdm/lightdm-greeter-session
D guest-wrapper=/usr/lib/lightdm/lightdm-guest-session
G user-session=unity
F greeter-show-manual-login=true
I greeter-session=unity-greeter
F all-guest=false
H xserver-command=X -core
[LightDM]
B backup-logs=false
Sources:
A /usr/share/lightdm/lightdm.conf.d/50-disable-guest.conf
B /usr/share/lightdm/lightdm.conf.d/50-disable-log-backup.conf
C /usr/share/lightdm/lightdm.conf.d/50-greeter-wrapper.conf
D /usr/share/lightdm/lightdm.conf.d/50-guest-wrapper.conf
E /usr/share/lightdm/lightdm.conf.d/50-ubuntu.conf
F /usr/share/lightdm/lightdm.conf.d/50-unity-greeter.conf
G /usr/share/lightdm/lightdm.conf.d/50-unity.conf
H /usr/share/lightdm/lightdm.conf.d/50-xserver-command.conf
I /etc/lightdm/lightdm.conf
从上面还可以看出,对于lightdm的多个配置文件的优先级,显然/etc/lightdm/lightdm.conf有最高优先级,它里面的设置覆盖前面的所有配置文件,因为lightdm读取配置文件的顺序是 A->I
再重启lightdm: sudo systemctl restart lightdm,发现服务正常了:
root@ubuntu-rtx3090:~# systemctl status lightdm
● lightdm.service - Light Display Manager
Loaded: loaded (/lib/systemd/system/lightdm.service; indirect; vendor preset: enabled)
Active: active (running) since Wed 2021-08-25 19:50:22 CST; 3min 1s ago
Docs: man:lightdm(1)
Process: 1088 ExecStartPre=/bin/sh -c [ "$(basename $(cat /etc/X11/default-display-manager 2>/dev/null))" = "lightdm" ] (code=exited, status=0/SUCCESS)
Main PID: 1096 (lightdm)
Tasks: 6 (limit: 4915)
CGroup: /system.slice/lightdm.service
├─1096 /usr/sbin/lightdm
├─1115 /usr/lib/xorg/Xorg -core :0 -seat seat0 -auth /var/run/lightdm/root/:0 -nolisten tcp vt7 -novtswitch
└─1564 lightdm --session-child 12 19
8月 25 19:50:21 ubuntu-rtx3090 systemd[1]: Starting Light Display Manager...
8月 25 19:50:22 ubuntu-rtx3090 systemd[1]: Started Light Display Manager.
8月 25 19:50:23 ubuntu-rtx3090 lightdm[1220]: pam_kwallet(lightdm-greeter:setcred): (null): pam_sm_setcred
8月 25 19:50:23 ubuntu-rtx3090 lightdm[1220]: pam_kwallet5(lightdm-greeter:setcred): (null): pam_sm_setcred
8月 25 19:50:23 ubuntu-rtx3090 lightdm[1220]: pam_unix(lightdm-greeter:session): session opened for user lightdm by (uid=0)
8月 25 19:50:23 ubuntu-rtx3090 lightdm[1220]: pam_kwallet(lightdm-greeter:session): (null): pam_sm_open_session
8月 25 19:50:23 ubuntu-rtx3090 lightdm[1220]: pam_kwallet(lightdm-greeter:session): pam_kwallet: open_session called without kwallet_key
8月 25 19:50:23 ubuntu-rtx3090 lightdm[1220]: pam_kwallet5(lightdm-greeter:session): (null): pam_sm_open_session
8月 25 19:50:23 ubuntu-rtx3090 lightdm[1220]: pam_kwallet5(lightdm-greeter:session): pam_kwallet5: open_session called without kwallet5_key
root@ubuntu-rtx3090:~# lightdm --test-mode --debug
[+0.00s] DEBUG: Logging to /var/log/lightdm/lightdm.log
[+0.00s] DEBUG: Starting Light Display Manager 1.26.0, UID=0 PID=2573
[+0.00s] DEBUG: Loading configuration dirs from /var/lib/snapd/desktop/lightdm/lightdm.conf.d
[+0.00s] DEBUG: Loading configuration dirs from /usr/share/lightdm/lightdm.conf.d
[+0.00s] DEBUG: Loading configuration from /usr/share/lightdm/lightdm.conf.d/50-disable-guest.conf
[+0.00s] DEBUG: Loading configuration from /usr/share/lightdm/lightdm.conf.d/50-disable-log-backup.conf
[+0.00s] DEBUG: Loading configuration from /usr/share/lightdm/lightdm.conf.d/50-greeter-wrapper.conf
[+0.00s] DEBUG: Loading configuration from /usr/share/lightdm/lightdm.conf.d/50-guest-wrapper.conf
[+0.00s] DEBUG: Loading configuration from /usr/share/lightdm/lightdm.conf.d/50-ubuntu.conf
[+0.00s] DEBUG: Loading configuration from /usr/share/lightdm/lightdm.conf.d/50-unity-greeter.conf
[+0.00s] DEBUG: [Seat:*] contains unknown option all-guest
[+0.00s] DEBUG: Loading configuration from /usr/share/lightdm/lightdm.conf.d/50-unity.conf
[+0.00s] DEBUG: Loading configuration from /usr/share/lightdm/lightdm.conf.d/50-xserver-command.conf
[+0.00s] DEBUG: Loading configuration dirs from /usr/local/share/lightdm/lightdm.conf.d
[+0.00s] DEBUG: Loading configuration dirs from /etc/xdg/lightdm/lightdm.conf.d
[+0.00s] DEBUG: Loading configuration from /etc/lightdm/lightdm.conf
[+0.00s] DEBUG: Registered seat module local
[+0.00s] DEBUG: Registered seat module xremote
[+0.00s] DEBUG: Registered seat module unity
[+0.00s] DEBUG: Using D-Bus name org.freedesktop.DisplayManager
[+0.01s] DEBUG: Monitoring logind for seats
[+0.01s] DEBUG: New seat added from logind: seat0
[+0.01s] DEBUG: Seat seat0: Loading properties from config section Seat:*
[+0.01s] DEBUG: Seat seat0: Starting
[+0.01s] DEBUG: Seat seat0: Creating greeter session
[+0.01s] DEBUG: Seat seat0: Creating display server of type x
[+0.01s] DEBUG: Using VT 7
[+0.01s] DEBUG: Seat seat0: Starting local X display on VT 7
[+0.01s] DEBUG: XServer 1: Logging to /var/log/lightdm/x-1.log
[+0.01s] DEBUG: XServer 1: Writing X server authority to /var/run/lightdm/root/:1
[+0.01s] DEBUG: XServer 1: Launching X Server
[+0.01s] DEBUG: Launching process 2578: /usr/bin/X -core :1 -seat seat0 -auth /var/run/lightdm/root/:1 -nolisten tcp vt7 -novtswitch
[+0.01s] DEBUG: XServer 1: Waiting for ready signal from X server :1
[+0.01s] DEBUG: Acquired bus name org.freedesktop.DisplayManager
[+0.01s] DEBUG: Registering seat with bus path /org/freedesktop/DisplayManager/Seat0
[+0.01s] DEBUG: Loading users from org.freedesktop.Accounts
[+0.01s] DEBUG: User /org/freedesktop/Accounts/User1000 added
Failed to use bus name org.freedesktop.DisplayManager, do you have appropriate permissions?
不过登录界面unity-greeter还是没有出来,使用gdm3为Display Manager时gdm3的服务使用systemctl status gdm3 查看也是能正常启动了的,就是登录窗口greeter出不来,像使用lightdm时,最后就是停留在这里:
折腾了很久,包括安装和在lightdm.conf里配置了lightdm-gtk-greeter和强制设置greeter-show-manual-login=true,还是看不到登录界面出来,
[Seat:*]
greeter-session=lightdm-gtk-greeter
greeter-show-manual-login=true
allow-guest=false
猜测是不是gdm3和lightdm的greeter窗口在最新的GPU驱动桌linux内核下都不能正常显示,那么我跳过登录让系统自动登录进入桌面,结果如何呢?于是在/etc/lightdm/lightdm.conf里增加一行(我登录的用户名是ubuntu):
autologin-user=ubuntu
再重启系统,终于能看到久违的unity桌面了!
经试验,下面这些设置有没有都没关系:
autologin-guest=false
autologin-user-timeout=0
autologin-session=lightdm-autologin
因解决问题中可能需要升级内核版本,附录一下如何安装和删除指定版本的内核及相关命令:
uname -r
lsb_release -a
#查看当前已经安装的 Kernel Image
dpkg --get-selections |grep linux-image
#查询当前软件仓库可以安装的 Kernel Image 版本
apt-cache search linux | grep linux-image
#安装指定版本的 Kernel Image 和 Kernel Header
apt-get install linux-headers-5.4.0-81-generic linux-image-5.4.0-81-generic
Building module:
cleaning build area...
'make' -j24 NV_EXCLUDE_BUILD_MODULES='' KERNEL_UNAME=5.4.0-81-generic IGNORE_CC_MISMATCH='' modules.....
Signing module:
- /var/lib/dkms/nvidia/450.80.02/5.4.0-81-generic/x86_64/module/nvidia-modeset.ko
- /var/lib/dkms/nvidia/450.80.02/5.4.0-81-generic/x86_64/module/nvidia-drm.ko
- /var/lib/dkms/nvidia/450.80.02/5.4.0-81-generic/x86_64/module/nvidia.ko
- /var/lib/dkms/nvidia/450.80.02/5.4.0-81-generic/x86_64/module/nvidia-uvm.ko
Secure Boot not enabled on this system.
cleaning build area...
DKMS: build completed.
nvidia.ko:
Running module version sanity check.
- Original module
- No original module exists within this kernel
- Installation
- Installing to /lib/modules/5.4.0-81-generic/updates/dkms/
nvidia-uvm.ko:
Running module version sanity check.
- Original module
- No original module exists within this kernel
- Installation
- Installing to /lib/modules/5.4.0-81-generic/updates/dkms/
nvidia-modeset.ko:
Running module version sanity check.
- Original module
- No original module exists within this kernel
- Installation
- Installing to /lib/modules/5.4.0-81-generic/updates/dkms/
nvidia-drm.ko:
Running module version sanity check.
- Original module
- No original module exists within this kernel
- Installation
- Installing to /lib/modules/5.4.0-81-generic/updates/dkms/
depmod...
DKMS: install completed.
...done.
Processing triggers for linux-image-5.4.0-81-generic (5.4.0-81.91~18.04.1) ...
/etc/kernel/postinst.d/dkms:
* dkms: running auto installation service for kernel 5.4.0-81-generic
...done.
/etc/kernel/postinst.d/initramfs-tools:
update-initramfs: Generating /boot/initrd.img-5.4.0-81-generic
/etc/kernel/postinst.d/zz-update-grub:
Sourcing file `/etc/default/grub' ### 自动执行update-grub
Generating grub configuration file ...
Found linux image: /boot/vmlinuz-5.4.0-81-generic
Found initrd image: /boot/initrd.img-5.4.0-81-generic
Found linux image: /boot/vmlinuz-5.4.0-72-generic
Found initrd image: /boot/initrd.img-5.4.0-72-generic
Found linux image: /boot/vmlinuz-5.4.0-53-generic
Found initrd image: /boot/initrd.img-5.4.0-53-generic
Adding boot menu entry for EFI firmware configuration
done
查看当前的 Kernel 列表
grep menuentry /boot/grub/grub.cfg
修改 Kernel 的启动顺序:如果安装的是最新的版本,那么默认就是首选的;
如果安装的是旧版本,就需要修改 grub 配置
vi /etc/default/grub
生效配置
update-grub