YOLOV3+ASFF（Learning Spatial Fusion for Single-Shot Object Detection）训练踩坑记

最新推荐文章于 2024-08-13 08:37:30 发布

搞视觉的张小凡

最新推荐文章于 2024-08-13 08:37:30 发布

阅读量4.1k

点赞数 1

分类专栏： Pytorch 目标检测文章文章标签： pytorch 深度学习自动驾驶

本文链接：https://blog.csdn.net/comway_li/article/details/104814946

版权

Pytorch 同时被 2 个专栏收录

6 篇文章 2 订阅

订阅专栏

目标检测文章

4 篇文章 1 订阅

订阅专栏

文章：https://arxiv.org/pdf/1911.09516v2.pdf

github：https://github.com/ruinmessi/ASFF

博客分析：https://blog.csdn.net/weixin_42096202/article/details/103293579

前沿：

下图是ASFF文章的效果，与yolov3对比，yolov3+ASFF 320的尺寸与416的yolov3，在速度保持差不多的情况下，精度有大幅度的提高；

一、编译DCN

注意：原作者的实现，只支持pytorch1.0以上的，否则则编译会不成功的（我试过其他版本，一直报错）；

. /make.sh

如果使用的python3 注意要把sh文件中的python改成python3；

错误：如果出现下图错误前面有的时候是需要sudo权限的，

解决办法：在pyhton3前面加sudo；

出现下图所示，则安装成功；

二、下载安装：

在github下载解压后，下载一些必要的安装包

apex, numpy, opencv, tqdm, pyyaml, matplotlib, scikit-image，pycocotools。。。。

其他的用pip3安装即可，只有apex需要自己安装；

安装编译apex：

git clone https://github.com/NVIDIA/apex.git   ----安装下载
cd apex
python3 setup.py install --cpp_ext --cuda_ext  ------编译

注意：git会很慢，慢慢等吧哈哈哈

错误：Cuda extensions are being compiled with a version of Cuda that does not...

解决办法：出现这种情况，是pytorch版本问题，如果你不想改版本，就牺牲一些功能好了，

sudo cd apex
sudo nano setup.py

进去以后，将52行注释，加上pass，重新运行即可解决错误；

if (bare_metal_major != torch_binary_major) or (bare_metal_minor != torch_binary_minor):
    	pass
        # raise RuntimeError("Cuda extensions are being compiled with a version of Cuda that does " +
        #                    "not match the version used to compile Pytorch binaries.  " +
        #                    "Pytorch binaries were compiled with Cuda {}.\n".format(torch.version.cuda) +
        #                    "In some cases, a minor-version mismatch will not cause later errors:  " +
        #                    "https://github.com/NVIDIA/apex/pull/323#discussion_r287021798.  "
        #                    "You can try commenting out this check (at your own risk).")

出现一下情况就表面安装成功了；

三、跑demo程序

下载权重：https://pan.baidu.com/s/1d9hOQBj20HCy51qWbonxMQ（这是我用的）

还有其他的权重：

yolov3 mobilenetv2 (ours)weights baiduYun training tfboard log
yolov3 mobilenetv2 +asff weights baiduYun training tfboard log
yolov3_baseline (ours) weights baiduYun training tfboard log
yolov3_asff weights baiduYun training tfboard log
yolov3_asff* (320-608) weights baiduYun
yolov3_asff* (480-800) weights baiduYun

python3 demo.py -i /path/to/your/image \
--cfg config/yolov3_baseline.cfg -d COCO \
--checkpoint /path/to/you/weights --half --asff --rfb -s 608

参数解释：
-i, --img: 图片path.
--cfg: config files.
-d: choose datasets, COCO or VOC.
-c, --checkpoint:预训练模型或者训练好的模型.
--half: FP16 testing.
-s: 评估的图片尺寸, from 320 to 608 as in YOLOv3.

四、制作数据集

支持两种数据格式，VOC与COCO，按照标准数据集格式制作即可；

然后把自己path的在main.py文件相应位置更改；

五、训练

下载预训练模型：

darknet53预训练模型：https://pan.baidu.com/s/19PaXl6p9vXHG2ZuGqtfLOg

MobileNetV2预训练模型：https://pan.baidu.com/s/12eScI6YNBvkVX0286cMEZA

python3 -m torch.distributed.launch --nproc_per_node=10 
--master_port=${RANDOM+10000} main.py 
--cfg config/yolov3_baseline.cfg -d COCO 
--tfboard --distributed --ngpu 10 
--checkpoint weights/darknet53_feature_mx.pth 
--start_epoch 0 --half --log_dir log/COCO -s 608

如果不用分布式训练，则命令为：

python3 main.py 
--cfg config/yolov3_baseline.cfg -d COCO 
--tfboard --distributed --ngpu 10 
--checkpoint weights/darknet53_feature_mx.pth 
--start_epoch 0 --half --log_dir log/COCO -s 608

参数解释:

--cfg: 配置文件.
--tfboard: 是否使用tensorboard，写了就表示使用.
--distributed: 是否使用分布式训练（我们仅通过分布式训练测试代码）
-d: 选择什么数据集格式, COCO or VOC.
--ngpu: GPUs数量.
-c, --checkpoint: 预训练权重
--start_epoch: 从哪开始重新训练.
--half: FP16 training.
--log_dir: tensorboard生成的文件存放path.
-s: 评估的图片尺寸, from 320 to 608 as in YOLOv3
如要训练YOLOv3 带 ASFF or ASFF*, 需要使用一下命令:

python3 -m torch.distributed.launch --nproc_per_node=10 --master_port=${RANDOM+10000} main.py \
--cfg config/yolov3_baseline.cfg -d COCO --tfboard --distributed --ngpu 10 \
--checkpoint weights/darknet53_feature_mx.pth --start_epoch 0 --half --asff --rfb --dropblock \
--log_dir log/COCO_ASFF -s 608

参数解释:

--vis: Visualization of ASFF.
--testset: evaluate on COCO test-dev.
-s: evaluation image size.

训练时候出现：RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [2, 4, 76, 76, 25]], which is output 0 of CloneBackward

解决办法：我试了下，我用torch1.2就会报错，但是换成1.1就正常训练，那只能换成1.1了；