Failed to initialize NVML: Driver/library version mismatch问题剖析

    在执行nvidia-smi的时候报出这个错误,虽然解决办法异常简单,只需要重启一下电脑即可,但是对于错误的原因还是做一下分析和扩展,总是在期望会有意想不到的收获,哈哈。

什么是NVML?

    The NVIDIA Management Library (NVML) is a C-based programmatic interface for monitoring and managing various states within NVIDIA Tesla™ GPUs. It is intended to be a platform for building 3rd party applications, and is also the underlying library for the NVIDIA-supported nvidia-smi tool. NVML is thread-safe so it is safe to make simultaneous NVML calls from multiple threads.归根结底,NVML既是可编程接口,也是第三方应用开发平台,又是某些工具(如:nvidia-tool)依赖的库。

Driver/Library version mismatch?

[zuosi@localhost]$cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module  396.44  Wed Jul 11 16:51:49 PDT 2018
GCC version:  gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.10)

显示NVRM版本为396.44,再来看看显卡驱动的版本。

[zuosi@localhost]$sudo dpkg --list | grep nvidia
rc  nvidia-384                                 384.130-0ubuntu0.16.04.1                      amd64        NVIDIA binary driver - version 384.130
ii  nvidia-396                                 396.82-0ubuntu1                               amd64        NVIDIA binary driver - version 396.82
ii  nvidia-cuda-dev                            7.5.18-0ubuntu1                               amd64        NVIDIA CUDA development files
ii  nvidia-cuda-doc                            7.5.18-0ubuntu1                               all          NVIDIA CUDA and OpenCL documentation
ii  nvidia-cuda-gdb                            7.5.18-0ubuntu1                               amd64        NVIDIA CUDA Debugger (GDB)
ii  nvidia-cuda-toolkit                        7.5.18-0ubuntu1                               amd64        NVIDIA CUDA development toolkit
ii  nvidia-opencl-dev:amd64                    7.5.18-0ubuntu1                               amd64        NVIDIA OpenCL development files
rc  nvidia-opencl-icd-384                      384.130-0ubuntu0.16.04.1                      amd64        NVIDIA OpenCL ICD
ii  nvidia-opencl-icd-396                      396.82-0ubuntu1                               amd64        NVIDIA OpenCL ICD
ii  nvidia-prime                               0.8.2                                         amd64        Tools to enable NVIDIA's Prime
ii  nvidia-profiler                            7.5.18-0ubuntu1                               amd64        NVIDIA Profiler for CUDA and OpenCL
ii  nvidia-settings                            418.40.04-0ubuntu1                            amd64        Tool for configuring the NVIDIA graphics driver
ii  nvidia-visual-profiler                     7.5.18-0ubuntu1                               amd64        NVIDIA Visual Profiler for CUDA and OpenCL

注:ii意味着'应该安装它并且已安装它';rc表示它被删除/卸载,但是它的配置文件仍然存在'

这里显示的驱动版本为nvidia-396.82,差异应该就在这里(396.44 vs 396.82),为什么之前可以,现在突然不一致了?查查dpkg日志,因为这里明显内核中已经加载的版本落后了。

[zuosi@localhost]$cat /var/log/dpkg.log| grep nvidia
2019-04-29 14:08:57 upgrade nvidia-396:amd64 396.44-0ubuntu1 396.82-0ubuntu1
2019-04-29 14:08:57 status half-configured nvidia-396:amd64 396.44-0ubuntu1
2019-04-29 14:09:04 status unpacked nvidia-396:amd64 396.44-0ubuntu1
2019-04-29 14:09:04 status half-installed nvidia-396:amd64 396.44-0ubuntu1
2019-04-29 14:09:13 status half-installed nvidia-396:amd64 396.44-0ubuntu1
2019-04-29 14:09:13 status unpacked nvidia-396:amd64 396.82-0ubuntu1
2019-04-29 14:09:13 status unpacked nvidia-396:amd64 396.82-0ubuntu1
2019-04-29 14:09:14 upgrade nvidia-opencl-icd-396:amd64 396.44-0ubuntu1 396.82-0ubuntu1
2019-04-29 14:09:14 status half-configured nvidia-opencl-icd-396:amd64 396.44-0ubuntu1
2019-04-29 14:09:14 status unpacked nvidia-opencl-icd-396:amd64 396.44-0ubuntu1
2019-04-29 14:09:14 status half-installed nvidia-opencl-icd-396:amd64 396.44-0ubuntu1
2019-04-29 14:09:14 status half-installed nvidia-opencl-icd-396:amd64 396.44-0ubuntu1
2019-04-29 14:09:14 status unpacked nvidia-opencl-icd-396:amd64 396.82-0ubuntu1
2019-04-29 14:09:14 status unpacked nvidia-opencl-icd-396:amd64 396.82-0ubuntu1
2019-04-29 14:09:14 upgrade nvidia-settings:amd64 410.72-0ubuntu1 418.40.04-0ubuntu1
2019-04-29 14:09:14 status half-configured nvidia-settings:amd64 410.72-0ubuntu1
2019-04-29 14:09:14 status unpacked nvidia-settings:amd64 410.72-0ubuntu1
2019-04-29 14:09:14 status half-installed nvidia-settings:amd64 410.72-0ubuntu1
2019-04-29 14:09:14 status half-installed nvidia-settings:amd64 410.72-0ubuntu1
2019-04-29 14:09:14 status unpacked nvidia-settings:amd64 418.40.04-0ubuntu1
2019-04-29 14:09:14 status unpacked nvidia-settings:amd64 418.40.04-0ubuntu1
2019-04-29 14:09:59 configure nvidia-396:amd64 396.82-0ubuntu1 <none>
2019-04-29 14:09:59 status unpacked nvidia-396:amd64 396.82-0ubuntu1
2019-04-29 14:09:59 status unpacked nvidia-396:amd64 396.82-0ubuntu1
2019-04-29 14:09:59 status half-configured nvidia-396:amd64 396.82-0ubuntu1
2019-04-29 14:10:54 status installed nvidia-396:amd64 396.82-0ubuntu1
2019-04-29 14:10:55 configure nvidia-opencl-icd-396:amd64 396.82-0ubuntu1 <none>
2019-04-29 14:10:55 status unpacked nvidia-opencl-icd-396:amd64 396.82-0ubuntu1
2019-04-29 14:10:55 status unpacked nvidia-opencl-icd-396:amd64 396.82-0ubuntu1
2019-04-29 14:10:55 status half-configured nvidia-opencl-icd-396:amd64 396.82-0ubuntu1
2019-04-29 14:10:55 status installed nvidia-opencl-icd-396:amd64 396.82-0ubuntu1
2019-04-29 14:10:55 configure nvidia-settings:amd64 418.40.04-0ubuntu1 <none>
2019-04-29 14:10:55 status unpacked nvidia-settings:amd64 418.40.04-0ubuntu1
2019-04-29 14:10:55 status unpacked nvidia-settings:amd64 418.40.04-0ubuntu1
2019-04-29 14:10:55 status half-configured nvidia-settings:amd64 418.40.04-0ubuntu1
2019-04-29 14:10:55 status installed nvidia-settings:amd64 418.40.04-0ubuntu1

显然,nvidia显卡驱动有一次升级(貌似是因为我手动执行了一次apt-get upgrade?),由396.44升级为396.82,但是内核模型还需要重新加载。实际上内核驱动模块已经就位,只等你重新加载进内核了,不信你看。

[zuosi@localhost]$find /lib/modules/$(uname -r) -name "*nvidia*.ko" -ls
  8677356     64 -rw-r--r--   1 root     root        63846 Feb 13 04:31 /lib/modules/4.15.0-46-generic/kernel/drivers/video/fbdev/nvidia/nvidiafb.ko
  8650998     72 -rw-r--r--   1 root     root        69852 Apr 29 14:10 /lib/modules/4.15.0-46-generic/updates/dkms/nvidia_396_drm.ko
  8650995  18392 -rw-r--r--   1 root     root     18830596 Apr 29 14:10 /lib/modules/4.15.0-46-generic/updates/dkms/nvidia_396.ko
  8650997   1292 -rw-r--r--   1 root     root      1319556 Apr 29 14:10 /lib/modules/4.15.0-46-generic/updates/dkms/nvidia_396_modeset.ko
  8650999   1260 -rw-r--r--   1 root     root      1286612 Apr 29 14:10 /lib/modules/4.15.0-46-generic/updates/dkms/nvidia_396_uvm.ko
[zuosi@localhost]$modinfo /lib/modules/4.15.0-46-generic/updates/dkms/nvidia_396.ko
filename:       /lib/modules/4.15.0-46-generic/updates/dkms/nvidia_396.ko
alias:          char-major-195-*
version:        396.82
supported:      external
license:        NVIDIA
srcversion:     1972864AFC73362967DE403
alias:          pci:v000010DEd00000E00sv*sd*bc04sc80i00*
alias:          pci:v000010DEd*sv*sd*bc03sc02i00*
alias:          pci:v000010DEd*sv*sd*bc03sc00i00*
depends:        ipmi_msghandler
retpoline:      Y
name:           nvidia
vermagic:       4.15.0-46-generic SMP mod_unload
parm:           NVreg_Mobile:int
parm:           NVreg_ResmanDebugLevel:int
parm:           NVreg_RmLogonRC:int
parm:           NVreg_ModifyDeviceFiles:int
parm:           NVreg_DeviceFileUID:int
parm:           NVreg_DeviceFileGID:int
parm:           NVreg_DeviceFileMode:int
parm:           NVreg_UpdateMemoryTypes:int
parm:           NVreg_InitializeSystemMemoryAllocations:int
parm:           NVreg_UsePageAttributeTable:int
parm:           NVreg_MapRegistersEarly:int
parm:           NVreg_RegisterForACPIEvents:int
parm:           NVreg_CheckPCIConfigSpace:int
parm:           NVreg_EnablePCIeGen3:int
parm:           NVreg_EnableMSI:int
parm:           NVreg_TCEBypassMode:int
parm:           NVreg_UseThreadedInterrupts:int
parm:           NVreg_EnableStreamMemOPs:int
parm:           NVreg_EnableBacklightHandler:int
parm:           NVreg_RestrictProfilingToAdminUsers:int
parm:           NVreg_EnableUserNUMAManagement:int
parm:           NVreg_MemoryPoolSize:int
parm:           NVreg_IgnoreMMIOCheck:int
parm:           NVreg_RegistryDwords:charp
parm:           NVreg_RegistryDwordsPerDevice:charp
parm:           NVreg_RmMsg:charp
parm:           NVreg_AssignGpus:charp

我使用了最简单的方式,重启的方式加载396.82显卡驱动内核模块,呵呵。

  • 2
    点赞
  • 6
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值