[YOLOF] You Only Look One-level Feature (CVPR. 2021)

最新推荐文章于 2022-11-01 12:02:28 发布

Ah丶Weii

最新推荐文章于 2022-11-01 12:02:28 发布

阅读量1.3k

点赞数 4

分类专栏：学习

本文链接：https://blog.csdn.net/weixin_43823854/article/details/115050963

版权

学习专栏收录该内容

29 篇文章 1 订阅

订阅专栏

代码：https://github.com/megvii-model/YOLOF

1. Motivation

FPN有2种优点，（1）multi-scal-feature；（2）divide-and-conquer。

The most popular way to build feature pyramids is the feature pyramid networks (FPN) [22], which mainly brings two benefits:

(1) multi-scale feature fusion: fusing multiple low-resolution and high-resolution feature inputs to obtain better representations;

(2) divide- and-conquer: detecting objects on different levels regarding objects’ scales.

一种普遍的理念认为FPN的成功是依赖于对多尺度特征的融合，而忽略了FPN中divide-and-conquer分而治之的方法。

如图1所示，为MIMO，SIMO，MISO，SISO的4种encoders。可以发现，SIMO只用c5作为single输入的MAP，比MOMO用C5 C4 C3的输入的gap不到1；但MISO，只用p5输出的MAP，却比MIMO用p3~p7的差了很多。作者认为这种现象暗示了2种事实，（1）C5特征对于检测不同尺寸物体的重要性，从而使得SIMO comparable；（2）多尺度融合远没有分而治之方法来的重要。

These phenomenons suggest two facts:

(1) the C5 feature carries sufficient context for detecting ob- jects on various scales, which enables the SiMo encoder to achieve comparable results; (2) the multi-scale feature fusion benefit is far away less critical than the divide-and- conquer benefit, thus multi-scale feature fusion might not be the most significant benefit of FPN.

作者认为one step deeper，divide-and-conquer与目标检测中的优化问题相关。它可以将复杂的检测问题划分为物体的尺寸，促进优化过程（facilitating the optimization process）的子问题。

对于FPN的成功在于它对目标检测的优化问题的解决方案。

The above analysis suggests that the essential factor for the success of FPN is its solution to the optimization prob- lem in object detection.

2. Contribution

本文中，发现在dense 目标检测中 FPN的最重要的优点来源于它对于优化问题上的的（ divide-and-conquer）分而治之的解决方法，而不是多尺度特征融合（the multi-scale feature fusion）。

We show that the most significant benefits of FPN is its divide-and-conquer solution to the optimization problem in dense object detection rather than the multi-scale feature fusion.

本文提出了YOLOF（只是用c5 降采样32倍的特征图），不适用FPN的一个简单高效的baseline，并且提出两个关键性的方法，膨胀编码器（Dilated Encoder）以及统一匹配（Uniform Matching 来缩小SISO以及MIMOencoder的性能差距。

We present YOLOF, which is a simple and efficient baseline without using FPN. In YOLOF, we propose two key components, Dilated Encoder and Uniform Matching, bridging the performance gap between the SiSo encoder and the MiMo encoder.

实验结果是comparable的性能，以及更快的速度。

Extensive experiments on COCO benchmark indicates the importance of each component. Moreover, we conduct comparisons with RetinaNet [23], DETR [4] and YOLOv4 [1]. We can achieve comparable results with a faster speed on GPUs.

3. Cost Analysis of MiMo Encoders

如图2所示，本文认为检测的pipeline可以分为3个部分，backbone，encoder以及decoder，以及backbone是接受来自于backbone的输入，并且属正常分布表示，decoder用于解决特定的分类和回归任务。

如图3所示，表示MIMO，SISMO的channel分别为512，256时的backbone，encoder，decoder以及speed（GFLOPS和FPS），MIMO encoder的内存开销巨大（134G vs. 6G），并且MIMO运行速度相对与SIMO更慢（13 FPS vs. 34 FPS）。

Given the above drawbacks of the MiMo encoder, we aim to find an alterna- tive way to solve the optimization problem while keeping the detector simple, accurate, and fast simultaneously.

4. Method

4.1 Limited Scale Range

仅仅使用原始的SISO的效果并不理想，如图4所示，原始是因为物体的尺寸和单一特征图上所覆盖的尺寸大小不匹配，（a)图表示特征的感受野只会包含一部，（b）图表明增大的尺寸范围会使得特征能够包含大物体，但无法检测小物体。因此，作者提出（c）图的构想，通过制作multiple receptive fields多种感受野

的特征图，从而包含所有的scales。

4.2 Dilated Encoder

如图5，作者提出了Dilated Encoder，通过将空洞的residual blocks，首先通过含有bn层的1x1以及3x3卷积，接着使用带有不同dilated rate的Residual Bolcoks来产生具有多感受野的特征图。

4.3 Imbalance Problem on Positive Anchors

检测器在训练中更注意到大物体而忽略了小物体造成了imbalance problem。图6表明，single features的产生的正样本在不同matching方法中的分布。

4.4 Uniform Matching

本文采用了统一匹配策略，对于每一个gt，采用k近邻的pos anchors，这样就保证了对于每一个gt，都有相同数量的正样本，与之匹配，而不用考虑它们本身的大小问题。

sparse anchor：将多种anchors都加入single-level中，在C5特征图上的每一个位置构建了五个大小为{32,64,128,256,512}的anchors。

we propose an Uniform Matching strategy: adopting the k nearest anchor as positive anchors for each ground-truth box which makes sure that all ground-truth boxes can be matched with the same number of positive anchors uniformly regardless of their sizes.

5. YOLOF

decoder中加入了DETR的FFN以及AUTOASSIGN中的objectness分支，分类head的卷积层由4个变为2个。

The final classification scores for all predictions are generated by multiplying the classification output with the corresponding implicit objectness.