服务器3090训练laneAF--环境配置(DCNv2编译报错、cuda版本不匹配)问题记录

Xhlucky

已于 2022-07-30 14:50:30 修改

阅读量2k

点赞数 1

文章标签：深度学习 pytorch

于 2022-07-21 15:14:03 首次发布

本文链接：https://blog.csdn.net/Xhlucky/article/details/125911651

版权

项目场景：

根据原论文GitHub - sel118/LaneAF的要求：

使用的torch=1.7.0

torchvision=0.8.1

cuda=10.1

这里使用torch=1.7.0的原因是因为DCNv2，原作者提供的代码对pytorch1.7以上的版本不友好

一、问题描述

如果按照这个配置进行编译时，会出现一系列报错，主要有

错误1：

unable to execute ‘usr/local/cuda-10.0/bin/nvcc‘: No such file or directory

使用以下方法可以解决

终端输入：

export CUDA_HOME=/usr/local/cuda-10.0

错误2：

nvcc fatal : Unsupported gpu architecture ‘compute_86‘

解决方法：

降低算力：

export TORCH_CUDA_ARCH_LIST="7.5"

（注意：如果出现其他的报错，均可以使用三、中的方法进行解决！）

二、解决错误1、2以后，开始训练，但是3090支持的cuda版本必须在11.0及以上，所以会出现报错，更换cuda以及torch版本

pip install torch==1.7.0+cu110 torchvision==0.8.1+cu110 torchaudio==0.7.0 -f https://download.pytorch.org/whl/torch_stable.html

这个时候报错，报错如下(大致是这样，回忆的)：

the GPU program failed to execute at /pytorch/aten/src/THC/THCBlas cud:390

这个错误使用上面一、中两种解决方案无法解决！！！，因为cuda11.0版本过高，对于原来的DCNv2无法匹配，所以采用以下方法。

三、采用的解决方案：

没有采用原始的DCNv2，使用mmcv库中的DCN模块代替DCNv2官方库

具体如下：

1、安装mmcv库：

# 命令行输入：
pip install mmcv-full -f https://download.openmmlab.com/mmcv/dist/{cu_version}/{torch_version}/index.html
# 将其中的{cu_version}替换为你的CUDA版本，{torch_version}替换为你已经安装的pytorch版本；
# 例如：CUDA 为11.0，pytorch为1.7.0
pip install mmcv-full -f https://download.openmmlab.com/mmcv/dist/cu110/torch1.7.0/index.html

2、将代码中使用到DCNv2的位置进行改动：

将

from .DCNv2.dcn_v2 import DCN

改为

from mmcv.ops import DeformConv2dPack as DCN

可能需要调正一下参数：

 使用方法与官方DCNv2一样，只不过deformable_groups参数名改为deform_groups即可，例如：
dconv2 = DCN(in_channel, out_channel, kernel_size=(3, 3), stride=(2, 2), padding=1, deform_groups=2)

总结：

对于原始DCNv2库的调用，除了torch版本需要在1.7.0以下，还要求cuda版本不能过高。

对于3090，需要的cuda版本需要在11.0及以上，所以会发生报错，故采用mmcv库中的DCN模块代替DCNv2官方库。

参考：DCNv2+pytorch1.7及以上版本编译报错解决方法 - 知乎 (zhihu.com)