cuda runtime error (802) : system not yet initialized .../THCGeneral.cpp:50

参考链接


https://github.com/pytorch/pytorch/issues/35710
显卡型号是Tesla A100
总结一句是安装fabric manager

如果在pytoch使用cuda出现了以下错误:

RuntimeError: cuda runtime error (802) : system not yet initialized at
/opt/conda/conda-bld/pytorch_1579022034529/work/aten/src/THC/THCGeneral.cpp:50

尝试编译并运行一下实例之一:

https://github.com/NVIDIA/cuda-samples.git

git clone https://github.com/NVIDIA/cuda-samples.git
cd cuda-samples/Samples/bandwidthTest
make
./bandwidthTest

注意:NVCC 将会位于/usr/local/cuda/bin

如果你得到了

 ./bandwidthTest
[CUDA Bandwidth Test] - Starting...
Running on...
cudaGetDeviceProperties returned 802
-> system not yet initialized
CUDA error at bandwidthTest.cu:256 code=802(cudaErrorSystemNotReady) "cudaSetDevice(currentDevice)" 

这意味着未安装数据中心GPU管理器。您要做的是安装nvidia DCGM。您要做的是安装nvidia DCGM,获得存储库密钥:

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-ubuntu2004.pin
sudo mv cuda-ubuntu2004.pin /etc/apt/preferences.d/cuda-repository-pin-600
sudo apt-key adv --keyserver-options http-proxy=http://proxy-chain.intel.com:911  --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/7fa2af80.pub

如果你得到:

sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/7fa2af80.pub
Executing: /tmp/apt-key-gpghome.qjhmgicscb/gpg.1.sh --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/7fa2af80.pub
gpg: requesting key from ‘https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/7fa2af80.pub’
gpg: WARNING: unable to fetch URI https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/7fa2af80.pub: Connection timed out

您可能需要手动设置代理:

sudo apt-key adv --keyserver-options http-proxy=<PROXY-ADDRESS:PORT>  --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/7fa2af80.pub

然后您可以安装存储库和软件包

sudo add-apt-repository "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/ /"
sudo apt-get update
sudo apt-get install -y datacenter-gpu-manager

终止主机引擎:

sudo nv-hostengine -t

并启动 fabricmanager

    sudo systemctl enable nvidia-fabricmanager.service
sudo service nvidia-fabricmanager start

如果你得到了

sudo service nvidia-fabricmanager start
Failed to start nvidia-fabricmanager.service: Unit nvidia-fabricmanager.service not foun

安装 fabric manager并且启动它:

sudo apt-get install cuda-drivers-fabricmanager-<version>
sudo systemctl enable nvidia-fabricmanager.service
sudo service nvidia-fabricmanager start

现在将会成功执行程序:

 ./bandwidthTest
[CUDA Bandwidth Test] - Starting...
Running on...
> 
> 
> Device 0: NVIDIA A100-SXM4-40GB  Quick Mode
> 
>  Host to Device Bandwidth, 1 Device(s)  PINNED Memory Transfers   
> Transfer Size (Bytes)        Bandwidth(GB/s)    32000000              
> 26.1
> 
>  Device to Host Bandwidth, 1 Device(s)  PINNED Memory Transfers   
> Transfer Size (Bytes)        Bandwidth(GB/s)    32000000              
> 25.6
> 
>  Device to Device Bandwidth, 1 Device(s)  PINNED Memory Transfers   
> Transfer Size (Bytes)        Bandwidth(GB/s)    32000000              
> 1152.7
> 
> Result = PASS

注意:CUDA 示例不适用于性能测量。 启动GPU Boost时,结果将会有所不同。
同时还有python-cuda

ipython
Python 3.7.11 (default, Jul 27 2021, 14:32:16) 
Type 'copyright', 'credits' or 'license' for more information
IPython 7.26.0 -- An enhanced Interactive Python. Type '?' for help.


In [1]: import torch
In [2]: torch.cuda.is_available()
Out[2]: True

额外参考链接

https://github.com/aws/aws-parallelcluster/wiki/NVIDIA-Fabric-Manager-stops-running-on-Ubuntu-18.04-and-Ubuntu-20.04

  • 2
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值