问题
PyTorch无法使用GPU,报以下错误:
Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False.
分析
验证CUDA 运行是否正常:
# 进入 CUDA Samples目录,以“~/NVIDIA_CUDA-11.0_Samples”为例
cd ~/NVIDIA_CUDA-11.0_Samples/1_Utilities/deviceQuery
make
./deviceQuery
发现CUDA不能正常运行:
CUDA Device Query (Runtime API) version (CUDART static linking)
cudaGetDeviceCount returned 802
-> system not yet initialized
Result = FAIL
查找资料发现是由于服务器重启后Nvidia Fabric Manager没有启动导致。
解决
配置Nvidia Fabric Manager开机启动并启动服务:
sudo systemctl enable nvidia-fabricmanager.service
sudo service nvidia-fabricmanager start
CUDA即可正常运行。
参考
cuda runtime error (802) : system not yet initialized …/THCGeneral.cpp:50