如何解决docker container内报错:Failed to initialize NVML: Driver/library version mismatch问题

docker container在创建时是加了gpu设备的,在container里安装cuda后却发现gpu用不起来,连执行最简单的nvidia-smi命令都报错:Failed to initialize NVML: Driver/library version mismatch

在容器内分别检查nvidia drvier和nvidia相关库发现:

    cat /proc/driver/nvidia/version

NVRM version: NVIDIA UNIX x86_64 Kernel Module  465.19.01  Fri Mar 19 07:44:41 UTC 2021

    cat /var/log/dpkg.log|grep nvidia

2022-08-14 14:52:45 install libnvidia-cfg1-470:amd64 <none> 470.141.03-0ubuntu0.18.04.1
2022-08-14 14:52:45 status half-installed libnvidia-cfg1-470:amd64 470.141.03-0ubuntu0.18.04.1
2022-08-14 14:52:45 status unpacked libnvidia-cfg1-470:amd64 470.141.03-0ubuntu0.18.04.1
2022-08-14 14:52:45 status unpacked libnvidia-cfg1-470:amd64 470.141.03-0ubuntu0.18.04.1
2022-08-14 14:52:46 install libnvidia-common-470:all <none> 470.141.03-0ubuntu0.18.04.1
2022-08-14 14:52:46 status half-installed libnvidia-common-470:all 470.141.03-0ubuntu0.18.04.1
2022-08-14 14:52:46 status unpacked libnvidia-common-470:all 470.141.03-0ubuntu0.18.04.1
2022-08-14 14:52:46 status unpacked libnvidia-common-470:all 470.141.03-0ubuntu0.18.04.1
2022-08-14 14:52:46 install libnvidia-compute-470:amd64 <none> 470.141.03-0ubuntu0.18.04.1
2022-08-14 14:52:46 status half-installed libnvidia-compute-470:amd64 470.141.03-0ubuntu0.18.04.1
2022-08-14 14:52:47 status unpacked libnvidia-compute-470:amd64 470.141.03-0ubuntu0.18.04.1
2022-08-14 14:52:47 status unpacked libnvidia-compute-470:amd64 470.141.03-0ubuntu0.18.04.1
2022-08-14 14:52:47 install libnvidia-decode-470:amd64 <none> 470.141.03-0ubuntu0.18.04.1
2022-08-14 14:52:47 status half-installed libnvidia-decode-470:amd64 470.141.03-0ubuntu0.18.04.1
2022-08-14 14:52:47 status unpacked libnvidia-decode-470:amd64 470.141.03-0ubuntu0.18.04.1
2022-08-14 14:52:47 status unpacked libnvidia-decode-470:amd64 470.141.03-0ubuntu0.18.04.1
2022-08-14 14:52:47 install libnvidia-encode-470:amd64 <none> 470.141.03-0ubuntu0.18.04.1
2022-08-14 14:52:47 status half-installed libnvidia-encode-470:amd64 470.141.03-0ubuntu0.18.04.1
2022-08-14 14:52:47 status unpacked libnvidia-encode-470:amd64 470.141.03-0ubuntu0.18.04.1
2022-08-14 14:52:48 status unpacked libnvidia-encode-470:amd64 470.141.03-0ubuntu0.18.04.1
2022-08-14 14:52:48 install libnvidia-extra-470:amd64 <none> 470.141.03-0ubuntu0.18.04.1
2022-08-14 14:52:48 status half-installed libnvidia-extra-470:amd64 470.141.03-0ubuntu0.18.04.1
2022-08-14 14:52:48 status unpacked libnvidia-extra-470:amd64 470.141.03-0ubuntu0.18.04.1
2022-08-14 14:52:48 status unpacked libnvidia-extra-470:amd64 470.141.03-0ubuntu0.18.04.1
2022-08-14 14:52:48 install libnvidia-fbc1-470:amd64 <none> 470.141.03-0ubuntu0.18.04.1
2022-08-14 14:52:48 status half-installed libnvidia-fbc1-470:amd64 470.141.03-0ubuntu0.18.04.1
...

出了这种问题一般是因为container里安装的cuda版本较高,和driver版本不匹配,因为container使用的driver是host环境里安装的,而不是container里安装cuda时安装的。

解决办法很简单,把host环境下的nvidia driver 升级到不低于容器内的nvidia库的版本即可,例如:

      sudo apt install nvidia-driver-470

然后执行reboot即可,不重启是不行的,cat /proc/driver/nvidia/version可以看到driver还是465,而不是新安装的470,新安装的驱动需要重启后才能生效。

  • 1
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

Arnold-FY-Chen

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值