insightFace跳坑实录(2020.11.18)

项目场景:

insigthface这个项目真的有点费劲,研究了两三天,到处跳坑,刚总算是能够在公司设备上运行起来了,这里暂时先记录下自己这两天遇到的坑。
先说下我这边的设备配置:
显卡型号:GeForce RTX 2080 SUPER
发行版本:CentOS Linux 7 (Core)
最开始的CUDA版本:9.0(**跳坑的一切罪恶之源 **

问题描述:

在以上配置之下,当我按照官网源码步骤想测试一下的时候,我输入以下运行命令

python recognition/ArcFace/verification.py

然后就出现了下面的错误

raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [10:10:32] src/operator/fusion/fused_op.cu:604: Check failed: compileResult == NVRTC_SUCCESS (5 vs. 0) : NVRTC Compilation failed. 
Please set environment variable MXNET_USE_FUSION to 0.
nvrtc: error: invalid value for --gpu-architecture (-arch)

当看到这个提示时我以为需要设置下变量就可,
于是输入:

export MXNET_USE_FUSION=0  

输出如下:

loading /home/user/Desktop/insightface-master/models/model-r100-ii/model 0
[10:16:41] src/nnvm/legacy_json_util.cc:209: Loading symbol saved by previous version v1.2.0. Attempting to upgrade...
[10:16:41] src/nnvm/legacy_json_util.cc:217: Symbol successfully upgraded!
[10:16:41] src/base.cc:51: Upgrade advisory: this mxnet has been built against cuda library version 9000, which is older than the oldest version tested by CI (10000).  
Set MXNET_CUDA_LIB_CHECKING=0 to quiet this warning.
……
raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [10:17:27] src/operator/nn/./cudnn/cudnn_convolution-inl.h:155: Check failed: e == CUDNN_STATUS_SUCCESS (8 vs. 0) : 
cuDNN: CUDNN_STATUS_EXECUTION_FAILED

从这开始,我就和这个错误开始杠上了。
到处搜索相关资料,发现没几个出现cuDNN: CUDNN_STATUS_EXECUTION_FAILED这种情况的。
不记得从哪里看到的文章有说可能是python的版本问题,不能用3.6.2,要换成3.6.6,我用的是conda虚拟环境,索性就更新成python3.6.6,再次执行验证,却又出现下面的输出

UserWarning: NumPy 1.14.5 or above is required for this version of SciPy (detected version 1.12.1)
  UserWarning)
RuntimeError: module compiled against API version 0xc but this version of numpy is 0xa
ImportError: numpy.core.multiarray failed to import

更新numpy也是没甚麽用。

错误如下:

testing verification..
[11:10:36] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:97: Running performance tests to find the best convolution algorithm, this can take a while... 
(set the environment variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
Traceback (most recent call last):
  File "recognition/ArcFace/verification.py", line 668, in <module>
    ver_list[i], model, args.batch_size, args.nfolds)
  File "recognition/ArcFace/verification.py", line 289, in test
    _embeddings = net_out[0].asnumpy()
  File "/home/user/anaconda3/envs/gpu-insightFace/lib/python3.6/site-packages/mxnet/ndarray/ndarray.py", line 2535, in asnumpy
    ctypes.c_size_t(data.size)))
  File "/home/user/anaconda3/envs/gpu-insightFace/lib/python3.6/site-packages/mxnet/base.py", line 255, in check_call
    raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [11:10:38] src/operator/nn/./cudnn/cudnn_convolution-inl.h:155: Check failed: e == CUDNN_STATUS_SUCCESS (8 vs. 0) : 
cuDNN: CUDNN_STATUS_EXECUTION_FAILED

(伪)解决方案:

我就像上面那样,整了一大圈又回到了原来的错误中,愁~~~
后来又不知道从哪里看到mxnet不用cudnn,需要移除,索性我就学着输入了卸载cudnn命令:

sudo rm -rf /usr/local/cuda/include/cudnn.h
sudo rm -rf /usr/local/cuda/lib64/libcudnn*

依然没有用
后面发现,有人说CUDA不是最新的会出现我遇到的情况,因此在NVIDIA官网上下了升级包,然后安装居然可以成功执行verification.py了

testing verification..
(12000, 512)
infer time 97.54976199999976
[lfw]XNorm: 22.132483
[lfw]Accuracy: 0.00000+-0.00000
[lfw]Accuracy-Flip: 0.99767+-0.00281
Max of [lfw] is 0.99767
testing verification..
(14000, 512)
infer time 114.06679099999972
[cfp_fp]XNorm: 21.340036
[cfp_fp]Accuracy: 0.00000+-0.00000
[cfp_fp]Accuracy-Flip: 0.98271+-0.00559
Max of [cfp_fp] is 0.98271
testing verification..
(12000, 512)
infer time 98.13358699999996
[agedb_30]XNorm: 22.654597
[agedb_30]Accuracy: 0.00000+-0.00000
[agedb_30]Accuracy-Flip: 0.98250+-0.00712
Max of [agedb_30] is 0.98250

解决方案:

当我以为解决了这个麻烦的时候,没想到的是,执行验证成功,但在执行训练文件时依然出现同样的错误

mxnet.base.MXNetError: [17:16:47] src/operator/nn/./cudnn/cudnn_convolution-inl.h:155: Check failed: e == CUDNN_STATUS_SUCCESS (8 vs. 0) :
 cuDNN: CUDNN_STATUS_EXECUTION_FAILED

到这里真的是疯了,转了一大圈又回到了远点
不知道该怎末办的时候又发现了有人说是cuda版本不对,20版本的NVIDIA显卡必须要用CUDA10以上版本跑这个工程。
反正试错也不少了,那就在多试试,于是我重新安装了CUDA10.1,总算是解决了根本问题。
总算能运行训练进程:

[13:09:05] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:97: Running performance tests to find the best convolution algorithm, this can take a while... (set the environment variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
INFO:root:Epoch[0] Batch [0-20] Speed: 129.81 samples/sec acc=0.000000 lossvalue=48.763742
INFO:root:Epoch[0] Batch [20-40] Speed: 130.29 samples/sec acc=0.000000lossvalue=55.789442
INFO:root:Epoch[0] Batch [40-60] Speed: 129.81 samples/sec acc=0.000000 lossvalue=61.417017
INFO:root:Epoch[0] Batch [60-80] Speed: 129.81 samples/sec acc=0.000000 lossvalue=60.413354
INFO:root:Epoch[0] Batch [80-100] Speed: 129.73 samples/sec acc=0.000000 lossvalue=62.855455
INFO:root:Epoch[0] Batch [100-120] Speed: 129.77 samples/sec acc=0.000000 lossvalue=60.250255
…………………………

关于该工程实践建议参考博客地址:
https://blog.csdn.net/weixin_43013761/article/details/99646731
https://blog.csdn.net/Danbinbo/article/details/99738785
https://blog.csdn.net/weixin_38192254/article/details/104002253
https://blog.csdn.net/xiaotuzigaga/article/details/89224594

推荐github上一个该项目的实现教程(如果我早点看到这个,估计也不会走了一堆冤枉路):https://github.com/Danbinabo/insighrface这个上面还有视频教程,虽然时间是19年的,工程源码有点不一样,但看了还是有很大启发的,感谢各位作者以及insightFace官方源码的分享。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值