[YOLOF] You Only Look One-level Feature (CVPR. 2021)

image-20210320184404824

代码:https://github.com/megvii-model/YOLOF

1. Motivation

​ FPN有2种优点,(1)multi-scal-feature;(2)divide-and-conquer。

The most popular way to build feature pyramids is the feature pyramid networks (FPN) [22], which mainly brings two benefits:

(1) multi-scale feature fusion: fusing multiple low-resolution and high-resolution feature inputs to obtain better representations;

(2) divide- and-conquer: detecting objects on different levels regarding objects’ scales.

​ 一种普遍的理念认为FPN的成功是依赖于对多尺度特征的融合,而忽略了FPN中divide-and-conquer分而治之的方法。

​ 如图1所示,为MIMO,SIMO,MISO,SISO的4种encoders。可以发现,SIMO只用c5作为single输入的MAP,比MOMO用C5 C4 C3的输入的gap不到1;但MISO,只用p5输出的MAP,却比MIMO用p3~p7的差了很多。作者认为这种现象暗示了2种事实,(1)C5特征对于检测不同尺寸物体的重要性,从而使得SIMO comparable;(2)多尺度融合远没有分而治之方法来的重要。

These phenomenons suggest two facts:

(1) the C5 feature carries sufficient context for detecting ob- jects on various scales, which enables the SiMo encoder to achieve comparable results; (2) the multi-scale feature fusion benefit is far away less critical than the divide-and- conquer benefit, thus multi-scale feature fusion might not be the most significant benefit of FPN.

​ 作者认为one step deeper,divide-and-conquer与目标检测中的优化问题相关。它可以将复杂的检测问题划分为物体的尺寸,促进优化过程(facilitating the optimization process)的子问题。

​ 对于FPN的成功在于它对目标检测的优化问题的解决方案。

The above analysis suggests that the essential factor for the success of FPN is its solution to the optimization prob- lem in object detection.

image-20210320185650554

2. Contribution

​ 本文中,发现在dense 目标检测中 FPN的最重要的优点来源于它对于优化问题上的的( divide-and-conquer)分而治之的解决方法,而不是多尺度特征融合(the multi-scale feature fusion)。

We show that the most significant benefits of FPN is its divide-and-conquer solution to the optimization problem in dense object detection rather than the multi-scale feature fusion.

​ 本文提出了YOLOF(只是用c5 降采样32倍的特征图),不适用FPN的一个简单高效的baseline,并且提出两个关键性的方法,膨胀编码器(Dilated Encoder)以及统一匹配(Uniform Matching 来缩小SISO以及MIMOencoder的性能差距。

We present YOLOF, which is a simple and efficient baseline without using FPN. In YOLOF, we propose two key components, Dilated Encoder and Uniform Matching, bridging the performance gap between the SiSo encoder and the MiMo encoder.

​ 实验结果是comparable的性能,以及更快的速度。

Extensive experiments on COCO benchmark indicates the importance of each component. Moreover, we conduct comparisons with RetinaNet [23], DETR [4] and YOLOv4 [1]. We can achieve comparable results with a faster speed on GPUs.

3. Cost Analysis of MiMo Encoders

​ 如图2所示,本文认为检测的pipeline可以分为3个部分,backbone,encoder以及decoder,以及backbone是接受来自于backbone的输入,并且属正常分布表示,decoder用于解决特定的分类和回归任务。

image-20210320185720249

​ 如图3所示,表示MIMO,SISMO的channel分别为512,256时的backbone,encoder,decoder以及speed(GFLOPS和FPS),MIMO encoder的内存开销巨大(134G vs. 6G),并且MIMO运行速度相对与SIMO更慢(13 FPS vs. 34 FPS)。

Given the above drawbacks of the MiMo encoder, we aim to find an alterna- tive way to solve the optimization problem while keeping the detector simple, accurate, and fast simultaneously.

image-20210320185735730

4. Method

4.1 Limited Scale Range

​ 仅仅使用原始的SISO的效果并不理想,如图4所示,原始是因为物体的尺寸和单一特征图上所覆盖的尺寸大小不匹配,(a)图表示特征的感受野只会包含一部,(b)图表明增大的尺寸范围会使得特征能够包含大物体,但无法检测小物体。因此,作者提出(c)图的构想,通过制作multiple receptive fields多种感受野

的特征图,从而包含所有的scales。

image-20210321100723530

4.2 Dilated Encoder

​ 如图5,作者提出了Dilated Encoder,通过将空洞的residual blocks,首先通过含有bn层的1x1以及3x3卷积,接着使用带有不同dilated rate的Residual Bolcoks来产生具有多感受野的特征图。

image-20210321100707375

4.3 Imbalance Problem on Positive Anchors

​ 检测器在训练中更注意到大物体而忽略了小物体造成了imbalance problem。图6表明,single features的产生的正样本在不同matching方法中的分布。

image-20210321111335666

4.4 Uniform Matching

​ 本文采用了统一匹配策略,对于每一个gt,采用k近邻的pos anchors,这样就保证了对于每一个gt,都有相同数量的正样本,与之匹配,而不用考虑它们本身的大小问题。

​ sparse anchor:将多种anchors都加入single-level中,在C5特征图上的每一个位置构建了五个大小为{32,64,128,256,512}的anchors。

we propose an Uniform Matching strategy: adopting the k nearest anchor as positive anchors for each ground-truth box which makes sure that all ground-truth boxes can be matched with the same number of positive anchors uniformly regardless of their sizes.

5. YOLOF

​ decoder中加入了DETR的FFN以及AUTOASSIGN中的objectness分支,分类head的卷积层由4个变为2个。

The final classification scores for all predictions are generated by multiplying the classification output with the corresponding implicit objectness.

image-20210321114814970

6. Experiments

6.1 Comparison with RetinaNet

image-20210321113446217

6.2 Comparison with DETR

image-20210321113516091

6.3 Comparison with YOLOv4

image-20210321113522392

6.4 Ablation Experiments

6.4.1 Effect of Dilated Encoder and Uniform Matching

image-20210321113556132

6.4.2 Number of ResBlocks

image-20210321113610964

6.4.3 Different dilations

image-20210321113616371

6.4.4 Add shortcut or not

image-20210321113621448

6.4.5 Number of positive

image-20210321113627347

6.4.6 Uniform matching vs. other matchings

image-20210321113632386

6.5 Error Analysis

image-20210321115042714
  • 4
    点赞
  • 8
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值