参考链接
https://github.com/pytorch/pytorch/issues/35710
显卡型号是Tesla A100
总结一句是安装fabric manager
如果在pytoch使用cuda出现了以下错误:
RuntimeError: cuda runtime error (802) : system not yet initialized at
/opt/conda/conda-bld/pytorch_1579022034529/work/aten/src/THC/THCGeneral.cpp:50
尝试编译并运行一下实例之一:
https://github.com/NVIDIA/cuda-samples.git
git clone https://github.com/NVIDIA/cuda-samples.git
cd cuda-samples/Samples/bandwidthTest
make
./bandwidthTest
注意:NVCC 将会位于/usr/local/cuda/bin
如果你得到了
./bandwidthTest
[CUDA Bandwidth Test] - Starting...
Running on...
cudaGetDeviceProperties returned 802
-> system not yet initialized
CUDA error at bandwidthTest.cu:256 code=802(cudaErrorSystemNotReady) "cudaSetDevice(currentDevice)"
这意味着未安装数据中心GPU管理器。您要做的是安装nvidia DCGM。您要做的是安装nvidia DCGM,获得存储库密钥:
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-ubuntu2004.pin
sudo mv cuda-ubuntu2004.pin /etc/apt/preferences.d/cuda-repository-pin-600
sudo apt-key adv --keyserver-options http-proxy=http://proxy-chain.intel.com:911 --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/7fa2af80.pub
如果你得到:
sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/7fa2af80.pub
Executing: /tmp/apt-key-gpghome.qjhmgicscb/gpg.1.sh --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/7fa2af80.pub
gpg: requesting key from ‘https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/7fa2af80.pub’
gpg: WARNING: unable to fetch URI https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/7fa2af80.pub: Connection timed out
您可能需要手动设置代理:
sudo apt-key adv --keyserver-options http-proxy=<PROXY-ADDRESS:PORT> --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/7fa2af80.pub
然后您可以安装存储库和软件包
sudo add-apt-repository "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/ /"
sudo apt-get update
sudo apt-get install -y datacenter-gpu-manager
终止主机引擎:
sudo nv-hostengine -t
并启动 fabricmanager
sudo systemctl enable nvidia-fabricmanager.service
sudo service nvidia-fabricmanager start
如果你得到了
sudo service nvidia-fabricmanager start
Failed to start nvidia-fabricmanager.service: Unit nvidia-fabricmanager.service not foun
安装 fabric manager并且启动它:
sudo apt-get install cuda-drivers-fabricmanager-<version>
sudo systemctl enable nvidia-fabricmanager.service
sudo service nvidia-fabricmanager start
现在将会成功执行程序:
./bandwidthTest
[CUDA Bandwidth Test] - Starting...
Running on...
>
>
> Device 0: NVIDIA A100-SXM4-40GB Quick Mode
>
> Host to Device Bandwidth, 1 Device(s) PINNED Memory Transfers
> Transfer Size (Bytes) Bandwidth(GB/s) 32000000
> 26.1
>
> Device to Host Bandwidth, 1 Device(s) PINNED Memory Transfers
> Transfer Size (Bytes) Bandwidth(GB/s) 32000000
> 25.6
>
> Device to Device Bandwidth, 1 Device(s) PINNED Memory Transfers
> Transfer Size (Bytes) Bandwidth(GB/s) 32000000
> 1152.7
>
> Result = PASS
注意:CUDA 示例不适用于性能测量。 启动GPU Boost时,结果将会有所不同。
同时还有python-cuda
ipython
Python 3.7.11 (default, Jul 27 2021, 14:32:16)
Type 'copyright', 'credits' or 'license' for more information
IPython 7.26.0 -- An enhanced Interactive Python. Type '?' for help.
In [1]: import torch
In [2]: torch.cuda.is_available()
Out[2]: True
额外参考链接
https://github.com/aws/aws-parallelcluster/wiki/NVIDIA-Fabric-Manager-stops-running-on-Ubuntu-18.04-and-Ubuntu-20.04