问题分析
在进行nvidia-smi
的时候, 发现输出如下错误,
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver.
Make sure that the latest NVIDIA driver is installed and running.
搜索的网上的资源, 大多都是说要重新安装CUDA, 或者升级Linux headers之类的, 比较麻烦, 所以我想看看有没有其他的办法.
原因分析: nvidia driver不能正常运行, nvidia-smi依赖driver, 因此输出错误.
首先想到是重新安装下合适版本的driver, 但是近来机器没人动, 因此这个方法没有奏效.
然后, 我发现之前需要的一个包libstdc++被升级了, 所以尝试把相应的版本降下去, 然后重启下, 居然好了!
总结: 如果突然出现这个错误, 先定位到出错的原因, 再把修改放回去, 比重新安装cuda之类的好的多.
参考资源:
- 先看显卡是否安装了, 再看看kernel的版本是否改了
- https://forums.developer.nvidia.com/t/nvidia-smi-has-failed-because-it-couldnt-communicate-with-the-nvidia-driver-ubuntu-16-04/48635/7
- 更新内核, 重装驱动
- https://deeptalk.lambdalabs.com/t/nvidia-smi-has-failed-because-it-couldnt-communicate-with-the-nvidia-driver/148/8
- https://stackoverflow.com/questions/42984743/nvidia-smi-has-failed-because-it-couldnt-communicate-with-the-nvidia-driver
- https://askubuntu.com/questions/927199/nvidia-smi-has-failed-because-it-couldnt-communicate-with-the-nvidia-driver-ma
- https://forums.developer.nvidia.com/t/nvidia-smi-has-failed-because-it-couldnt-communicate-with-the-nvidia-driver-make-sure-that-the-latest-nvidia-driver-is-installed-and-running/111008/5