宿主机为Ubuntu20.04 + gtx1060,Nvidia driver版本为510.85.02。
安装环境为:tensorrt8.4
安装完成后,一当调用cuda环境就会报错:Error 804: forward compatibility was attempted on non supported HW。
检查问题原因
在Linux宿主机上使用docker(版本>= 19.3)之前,请确保安装了nvidia-container-runtime和nvidia-container-toolkit:
sudo apt-get install nvidia-container-runtime nvidia-container-toolkit
并且确保nvidia-container-runtime-hook在PATH环境变量的路径中:
:~$ which nvidia-container-runtime-hook
/usr/bin/nvidia-container-runtime-hook
cuda初探
既然是个cuda初始化就报错的问题,那Gemfield不妨先抛开PyTorch,在当前的Docker环境上直接写一个最简化的C程序来初始化CUDA设备,看看是否会出错。
代码:
#include <stdio.h>
#include <cuda_runtime.h>
int main() {
int device = 0;
int gpuDeviceCount = 0;
struct cudaDeviceProp properties;
cudaError_t cudaResultCode = cudaGetDeviceCount(&gpuDeviceCount);
if (cudaResultCode == cudaSuccess){
cudaGetDeviceProperties(&properties, device);
printf("%d GPU CUDA devices(s)(%d)\n", gpuDeviceCount, properties.major);
printf("\t Product Name: %s\n" , properties.name);
printf("\t TotalGlobalMem: %ld MB\n" , properties.totalGlobalMem/(1024^2));
printf("\t GPU Count: %d\n" , properties.multiProcessorCount);
printf("\t Kernels found: %d\n" , properties.concurrentKernels);
return 0;
}
printf("\t gemfield error: %d\n",cudaResultCode);
}
编译:
g++ -I/usr/local/cuda-11.2/targets/x86_64-linux/include/ gemfield.cpp -o gemfield -L/usr/local/cuda-11.2/targets/x86_64-linux/lib/ -lcudart
~# ./gemfield
gemfield error: 804
Error 804: forward compatibility was attempted on non supported HW”,这个错误的意思是说:你的硬件不支持forward compatibility。
解决办法
很简单,将宿主主机的nvidia显卡驱动更新成与镜像相同的版本,然后再次安装nvidia-container-runtime和nvidia-container-toolkit:
显卡驱动安装请参考:环境搭建01——Ubuntu如何查看显卡信息及安装NVDIA显卡驱动_命名无能的博客-CSDN博客_ubuntu如何查看显卡驱动
本文参考
PyTorch的CUDA错误:Error 804: forward compatibility was attempted on non supported HW - 知乎
如有侵权,请联系删除。