论文阅读：You Only Look One-level Feature（YOLOF）

最新推荐文章于 2022-04-14 15:39:08 发布

贾小树

最新推荐文章于 2022-04-14 15:39:08 发布

阅读量342

点赞数

分类专栏：论文阅读目标检测

本文链接：https://blog.csdn.net/j879159541/article/details/115403319

版权

论文阅读同时被 2 个专栏收录

74 篇文章 1 订阅

订阅专栏

目标检测

45 篇文章 1 订阅

订阅专栏

文章目录

1、论文总述

本篇论文针对FPN的作用进行了深入分析，认为主要有两个功能：特征融合和对不同尺度目标进行分而治之优化，然后作者发现这两个作用中比较重要的是分而治之思想，并不是特征融合，这就让人保持怀疑了。

作者分析完之后主要有两个工作：（1）去掉FPN，通过残差连接对两个不同感受野的feature map进行相加实现多种尺度感受野覆盖，详见论文中的figure4，这个相加操作可以实现多种感受野覆盖，我持怀疑态度（2）只有一个feature map时，利用topK策略为不同尺度的目标分配固定数量的正样本数量

4月20号补充： paper报的速度对比实际上是吃了P3 head部分4conv的福利，原本的retinanet里head部分计算量占比太大
在这里插入图片描述

上图是说残差连接可以实现感受野的覆盖，作者实验用的backbone都是ResNet系列，而ResNet都是本身就有残差连接的，这里有个疑问：如果backbone都换成VGG的话，YOLOF是否可以达到和RetinaNet同样的性能？？

针对本篇论文，有文章解读的已经很详细了，这里不赘述：
我扔掉FPN来做目标检测，效果竟然这么强！YOLOF开源：你只需要看一层特征｜CVPR2021

We propose You Only Look One-level Feature (YOLOF),
which only uses one single C5 feature (with a downsample
rate of 32) for detection. To bridge the performance gap
between the SiSo encoder and the MiMo encoder, we first
design the structure of the encoder properly to extract the
multi-scale contexts for objects on various scales, compensating for the lack of multiple-level features; then, we apply a uniform matching mechanism to solve the imbalance problem of positive anchors raised by the sparse anchors in
the single feature.
Without bells and whistles, YOLOF achieves comparable results with its feature pyramids counterpart RetinaNet [23] but 2.5× faster. In a single feature manner,
YOLOF matches the performance of the recent proposed
DETR [4] while converging much faster (7×). With an image size of 608 × 608 and other techniques [1, 47], YOLOF
achieve 44.3 mAP running at 60 fps on 2080Ti, which is
13% faster than YOLOv4 [1].

2、FPN的两大作用

The most popular way to build feature pyramids is the feature pyramid networks (FPN) [22], which mainly brings two benefits: (1) multi-scale feature
fusion: fusing multiple low-resolution and high-resolution
feature inputs to obtain better representations; (2) divideand-conquer: detecting objects on different levels regarding objects’ scales.

3、MiMo和SiMo结构的不同

作者做实验时这俩有相似的性能！！！

在这里插入图片描述

MiMo就是指FPN结构
SiMo具体结构如下：

在这里插入图片描述

4、一个有争议的点

在这里插入图片描述
作者针对单个feature map的感受野进行提高：首先下图的 Projector将感受野由（a）增大到（b），再用下图所示的残差连接将单一感受野扩大到多种尺度感受野，即由(b)到（c），这步我持怀疑态度，残差相加之后只有一个feature map的情况下，它的感受野也能覆盖这么多？

在这里插入图片描述

5、topK的取值

在这里插入图片描述

作者经实验验证，K=4时，效果最好

6、训练时的Other Details

Other Details.（1） As mentioned in the previous section, the
pre-defined anchors in YOLOF are sparse, decreasing the
match quality between anchors and ground-truth boxes. We
add a random shift operation on the image to circumvent
this problem. The operation shifts the image randomly with
a maximum of 32 pixels in left, right, top, and bottom directions and aims to inject noises into the object’s position in
the image, increasing the probability of ground-truth boxes
matching with high-quality anchors.
（2）Moreover, we found
that a restriction on the anchors’ center’s shift is also helpful
to the final classification when using a single-level feature.
We add a restriction that the centers’ shift for all anchors
should smaller than 32 pixels.