RuntimeError: Unexpected error from cudaGetDeviceCount(). Error 802: system not yet initialized问题解决

endNone

已于 2024-06-21 15:04:25 修改

阅读量245

点赞数 4

分类专栏：大模型debug 文章标签： nvidia torch cuda python pytorch

于 2024-06-21 15:03:56 首次发布

本文链接：https://blog.csdn.net/zwhszdx/article/details/139861534

版权

大模型debug 专栏收录该内容

10 篇文章 0 订阅

订阅专栏

场景还原

笔者新拿到了一台服务器安装了cuda12.1的驱动和cuda toolkit，启动vllm服务后出现如下报错：

[root@localhost ~]#python3.9 /root/FastChat/fastchat/serve/vllm_worker.py --model-path /run/model/qwen-110b/   --num-gpus 8 --dtype bfloat16 
2024-06-21 00:50:37 | ERROR | stderr | Traceback (most recent call last):
2024-06-21 00:50:37 | ERROR | stderr |   File "/root/FastChat/fastchat/serve/vllm_worker.py", line 41, in <module>
2024-06-21 00:50:37 | ERROR | stderr |     seed = torch.cuda.current_device()
2024-06-21 00:50:37 | ERROR | stderr |   File "/usr/local/lib/python3.9/site-packages/torch/cuda/__init__.py", line 778, in current_device
2024-06-21 00:50:37 | ERROR | stderr |     _lazy_init()
2024-06-21 00:50:37 | ERROR | stderr |   File "/usr/local/lib/python3.9/site-packages/torch/cuda/__init__.py", line 293, in _lazy_init
2024-06-21 00:50:37 | ERROR | stderr |     torch._C._cuda_init()
2024-06-21 00:50:37 | ERROR | stderr | RuntimeError: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 802: system not yet initialized
[root@localhost ~]# 
[1] 0:bash*

问题所在

nvidia-fabricmanager服务没有启动，多GPU运行不了

问题解决

systemctl enable nvidia-fabricmanager
systemctl start nvidia-fabricmanager
systemctl status nvidia-fabricmanager

endNone

关注

4
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
RuntimeError: Unexpected error from cudaGetDeviceCount(). Error 802: system not yet initialized问题解决

RuntimeError: Unexpected error from cudaGetDeviceCount(). Error 802: system not yet initialized问题解决
复制链接

扫一扫