Unexpected error from cudaGetDeviceCount 错误解决

engchina

已于 2023-12-25 21:34:22 修改

阅读量1w

点赞数 9

分类专栏： LINUX 文章标签： cuda pytorch

于 2023-11-19 20:08:09 首次发布

本文链接：https://blog.csdn.net/engchina/article/details/134494382

版权

LINUX 专栏收录该内容

912 篇文章

订阅专栏

Unexpected error from cudaGetDeviceCount 错误解决

0. 背景
1. 解决方法1
2. 解决方法2
2. 解决方法3

0. 背景

新配置了1台服务器，有4张4090显卡。

在 wsl-ubuntu 里执行 python -c “import torch;print(torch.cuda.is_available());” 命令时，会报以下错误。

/root/miniconda3/envs/chatglm3-demo/lib/python3.10/site-packages/torch/cuda/__init__.py:107: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 2: out of memory (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:109.)
  return torch._C._cuda_getDeviceCount() > 0
False

执行 nvidia-smi 也能正常输出结果。

网上查了很多文章，大部分都是说重启就解决了，或者说 cuda 和 pytorch 版本不符等需要安装同一版本之类的。

我的另外一台电脑，也是同样安装的，所以个人分析不是上面问题。

1. 解决方法1

后来从是不是4张4090显卡需要什么特殊设置这个角度，有查了很多文章，后来终于通过设置，

CUDA_DEVICE_ORDER="PCI_BUS_ID" CUDA_VISIBLE_DEVICES=0,4 python -c "import torch;print(torch.cuda.is_available());"

得到了正确的输出，

True

2. 解决方法2

后来从是不是4张4090显卡需要什么特殊设置这个角度，有查了很多文章，后来终于通过设置，

# Checks if `cuda` is available via an `nvml-based` check which won't trigger the drivers and leave cuda uninitialized.
CUDA_DEVICE_ORDER="PCI_BUS_ID" PYTORCH_NVML_BASED_CUDA_CHECK=1 CUDA_VISIBLE_DEVICES=0,1,2,3 python -c "import torch;print(torch.cuda.is_available());"

得到了正确的输出，

True

2. 解决方法3

后来从是不是4张4090显卡需要什么特殊设置这个角度，有查了很多文章，后来终于通过设置，

CUDA_DEVICE_ORDER="PCI_BUS_ID" CUDA_VISIBLE_DEVICES=0,1,2,3 python -c "from accelerate import Accelerator;import torch;print(torch.cuda.is_available());"

得到了正确的输出，

True

完结！