docker container在创建时是加了gpu设备的,在container里安装cuda后却发现gpu用不起来,连执行最简单的nvidia-smi命令都报错:Failed to initialize NVML: Driver/library version mismatch
在容器内分别检查nvidia drvier和nvidia相关库发现:
cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module 465.19.01 Fri Mar 19 07:44:41 UTC 2021
cat /var/log/dpkg.log|grep nvidia
2022-08-14 14:52:45 install libnvidia-cfg1-470:amd64 <none> 470.141.03-0ubuntu0.18.04.1
2022-08-14 14:52:45 status half-installed libnvidia-cfg1-470:amd64 470.141.03-0ubuntu0.18.04.1
2022-08-14 14:52:45 status unpacked libnvidia-cfg1-470:amd64 470.141.03-0ubuntu0.18.04.1
2022-08-14 14:52:45 status unpacked libnvidia-cfg1-470:amd64 470.141.03-0ubuntu0.18.04.1
2022-08-14 14:52:46 install libnvidia-common-470:all <none> 470.141.03-0ubuntu0.18.04.1
2022-08-14 14:52:46 status half-installed libnvidia-common-470:all 470.141.03-0ubuntu0.18.04.1
2022-08-14 14:52:46 status unpacked libnvidia-common-470:all 470.141.03-0ubuntu0.18.04.1
2022-08-14 14:52:46 status unpacked libnvidia-common-470:all 470.141.03-0ubuntu0.18.04.1
2022-08-14 14:52:46 install libnvidia-compute-470:amd64 <none> 470.141.03-0ubuntu0.18.04.1
2022-08-14 14:52:46 status half-installed libnvidia-compute-470:amd64 470.141.03-0ubuntu0.18.04.1
2022-08-14 14:52:47 status unpacked libnvidia-compute-470:amd64 470.141.03-0ubuntu0.18.04.1
2022-08-14 14:52:47 status unpacked libnvidia-compute-470:amd64 470.141.03-0ubuntu0.18.04.1
2022-08-14 14:52:47 install libnvidia-decode-470:amd64 <none> 470.141.03-0ubuntu0.18.04.1
2022-08-14 14:52:47 status half-installed libnvidia-decode-470:amd64 470.141.03-0ubuntu0.18.04.1
2022-08-14 14:52:47 status unpacked libnvidia-decode-470:amd64 470.141.03-0ubuntu0.18.04.1
2022-08-14 14:52:47 status unpacked libnvidia-decode-470:amd64 470.141.03-0ubuntu0.18.04.1
2022-08-14 14:52:47 install libnvidia-encode-470:amd64 <none> 470.141.03-0ubuntu0.18.04.1
2022-08-14 14:52:47 status half-installed libnvidia-encode-470:amd64 470.141.03-0ubuntu0.18.04.1
2022-08-14 14:52:47 status unpacked libnvidia-encode-470:amd64 470.141.03-0ubuntu0.18.04.1
2022-08-14 14:52:48 status unpacked libnvidia-encode-470:amd64 470.141.03-0ubuntu0.18.04.1
2022-08-14 14:52:48 install libnvidia-extra-470:amd64 <none> 470.141.03-0ubuntu0.18.04.1
2022-08-14 14:52:48 status half-installed libnvidia-extra-470:amd64 470.141.03-0ubuntu0.18.04.1
2022-08-14 14:52:48 status unpacked libnvidia-extra-470:amd64 470.141.03-0ubuntu0.18.04.1
2022-08-14 14:52:48 status unpacked libnvidia-extra-470:amd64 470.141.03-0ubuntu0.18.04.1
2022-08-14 14:52:48 install libnvidia-fbc1-470:amd64 <none> 470.141.03-0ubuntu0.18.04.1
2022-08-14 14:52:48 status half-installed libnvidia-fbc1-470:amd64 470.141.03-0ubuntu0.18.04.1
...
出了这种问题一般是因为container里安装的cuda版本较高,和driver版本不匹配,因为container使用的driver是host环境里安装的,而不是container里安装cuda时安装的。
解决办法很简单,把host环境下的nvidia driver 升级到不低于容器内的nvidia库的版本即可,例如:
sudo apt install nvidia-driver-470
然后执行reboot即可,不重启是不行的,cat /proc/driver/nvidia/version可以看到driver还是465,而不是新安装的470,新安装的驱动需要重启后才能生效。