RuntimeError: ProcessGroupNCCL is only supported with GPUs, no GPUs found

今天用GPU跑的时候显示:RuntimeError: ProcessGroupNCCL is only supported with GPUs, no GPUs found!

pg = ProcessGroupNCCL(prefix_store, rank, world_size, pg_options)
RuntimeError: ProcessGroupNCCL is only supported with GPUs, no GPUs found!

这个错误一开始让我怀疑起了这个GPU是不是没用了,-_-||,但是实验室里的小伙伴确信gpu没问题!然后就开始了bug排查之旅...

这时在命令行查看的时候终于现出了它的马脚,估计是pytorch出现了问题,害!

>>> import torch
>>> print(torch.cuda.is_available())
/home/xutianjiao/anaconda3/envs/py36/lib/python3.6/site-packages/torch/cuda/__init__.py:80: UserWarning: CUDA initialization: The NVIDIA driver on your system is too old (found version 9020). Please update your GPU driver by downloading and installing a new version from the URL: http://www.nvidia.com/Download/index.aspx Alternatively, go to: https://pytorch.org to install a PyTorch version that has been compiled with your version of the CUDA driver. (Triggered internally at  ../c10/cuda/CUDAFunctions.cpp:112.)
  return torch._C._cuda_getDeviceCount() > 0
False
>>> print(torch.cuda.get_device_name(0))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/xutianjiao/anaconda3/envs/py36/lib/python3.6/site-packages/torch/cuda/__init__.py", line 326, in get_device_name
    return get_device_properties(device).name
  File "/home/xutianjiao/anaconda3/envs/py36/lib/python3.6/site-packages/torch/cuda/__init__.py", line 356, in get_device_properties
    _lazy_init()  # will define _get_device_properties
  File "/home/xutianjiao/anaconda3/envs/py36/lib/python3.6/site-packages/torch/cuda/__init__.py", line 214, in _lazy_init
    torch._C._cuda_init()
RuntimeError: The NVIDIA driver on your system is too old (found version 9020). Please update your GPU driver by downloading and installing a new version from the URL: http://www.nvidia.com/Download/index.aspx Alternatively, go to: https://pytorch.org to install a PyTorch version that has been compiled with your version of the CUDA driver.

查了下这个错误,显示cuda和torch的版本不匹配。其实不出现上面这个报错也可能是cuda版本不一致的问题,可以检查一下:

import torch
print(torch.version.cuda)

附Linux如何查看cuda版本:

第一种方法:
nvcc --version

第二种方法:
cat /usr/local/cuda/version.txt

再检查了pytorch的版本,1.10+,好的,那就安装低版本torch试试!注意了:这里cuda的版本要对应上,具体查看

# CUDA 9.2
conda install pytorch==1.7.1 torchvision==0.8.2 torchaudio==0.7.2 cudatoolkit=9.2 -c pytorch

# CUDA 10.1
conda install pytorch==1.7.1 torchvision==0.8.2 torchaudio==0.7.2 cudatoolkit=10.1 -c pytorch

# CUDA 10.2
conda install pytorch==1.7.1 torchvision==0.8.2 torchaudio==0.7.2 cudatoolkit=10.2 -c pytorch

# CUDA 11.0
conda install pytorch==1.7.1 torchvision==0.8.2 torchaudio==0.7.2 cudatoolkit=11.0 -c pytorch

# CPU Only
conda install pytorch==1.7.1 torchvision==0.8.2 torchaudio==0.7.2 cpuonly -c pytorch

大功告成!

  • 7
    点赞
  • 8
    收藏
    觉得还不错? 一键收藏
  • 2
    评论
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值