CUDA玄学问题？我不理解

最新推荐文章于 2024-05-21 10:11:28 发布

三思为上策

最新推荐文章于 2024-05-21 10:11:28 发布

阅读量497

点赞数

分类专栏：深度学习小白经验贴文章标签： python pytorch Powered by 金山文档

本文链接：https://blog.csdn.net/qq_43522986/article/details/129643788

版权

深度学习小白经验贴专栏收录该内容

8 篇文章 3 订阅

订阅专栏

最近跑基于Stable diffusion的代码时，遇到过几次一个神奇问题，解决方法也很玄学，全程懵，希望有大佬来指点一下。

我的目的：

在有8张显卡的服务器上跑程序，此前一直默认在cuda:0第一张显卡上跑，我想要指定另一张卡跑程序。

我的尝试：

Stable diffusion看起来是通过config文件指定gpu序号的，我就先改了那里，没用，程序跑起来还是默认cuda:0

在文件中通过torch.cuda.set_device(6)指定所用显卡：

if __name__ == "__main__":
    torch.cuda.set_device(6)
    main()

报错：说cuda中6这个序号不存在

 File "scripts/inference.py", line 587, in <module>
    torch.cuda.set_device(6)
 File "anaconda3/envs/PbE/lib/python3.8/site-packages/torch/cuda/__init__.py", line 311, in set_device
    torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

于是我在torch.cuda.set_device(6)之前先print(torch.cuda.device_count())，发现显示可用cuda数只有一个。然而，这是一个8卡机器啊！

我别的.py文件中print(torch.cuda.device_count())可以正常得到8，在终端通过python命令，手打如下两行也能正常得到8的答案。只有我要跑的这个文件不正常。

>>>import torch
>>>print(torch.cuda.device_count())

出于直觉，我在要跑的文件一堆import中找到import torch这一行，在它之后插入一行print(torch.cuda.device_count())，结果！！！得到了8的正确答案，并且可以成功地调用torch.cuda.set_device(6)，把程序丢到cuda:6去跑了。

但实在难以理解背后的原因。仅仅把print(torch.cuda.device_count())放到import torch后，就不报错了；import完其他包以后再print(torch.cuda.device_count())就有错。莫非import操作还能改变cuda可用数目不成？太玄学了吧。