1. 摘要大意
本文回顾了一级检测器的特征金字塔网络(FPN),指出并验证了FPN的成功在于它分而治之(divide-and-conquer )解决了目标检测的优化问题,而不是多尺度特征融合(multi-scale feature fusion)问题。
2. 简介
设计实验认为FBN是一种MiMo( Multiple-in-Multiple-out)型编码器。
SiMo编码器只有一个输入特征C5,并且没有进行特征融合,却可以达到与MiMo编码器(即FPN)相当的性能,相比之下,性能下降显著(≥12 mAP)在MiSo和SiSo编码器。
这些现象表明了两个事实:
(1)C5特征具有足够的上下文来检测不同尺度的物体,这使得SiMo编码器能够实现可比的结果;
(2)多尺度特征融合的优势远没有分治的优势重要,因此多尺度特征融合可能不是FPN最显著的优势。以上分析表明,FPN成功的关键因素是它解决了目标检测的优化问题。
同时分而治之的方法也会带来内存负担,只看一个特征层可以增加运算效率。
概括
3. 结构
An illustration of the detection pipeline:
(1) the backbone;
(2) the encoder, which receives inputs from the backbone and distributes representations for detection;
(3) the decoder, which performs classification and regression tasks and generate final prediction boxes.
Backbone:
采用ResNet 和ResNeXt系列作为主干。所有型号都是在ImageNet上预先训练好的。主干的输出是C5 feature map,它有2048个通道,下采样速率为32。为了与其他检测器进行公平的比较,主干中的所有batchnorm层默认都是冻结的。
Encoder:
通过在主干后添加两个Projector(一个1 × 1,一个3 × 3卷积),resulting in a feature map with 512 channels。为了使编码器的输出特征能够覆盖不同尺度上的所有对象,添加由三个连续卷积组成的Rsidual Blocks:the first 1 × 1 convolution apply channel reduction with a reduction rate of 4, then a 3 × 3 convolution with dilation is used to enlarge the receptive field, at last, a 1 × 1 convolution to recover the number of channels.
Decoder:
采用了RetinaNet的主要设计,它由两个并行的特定任务头组成:the classification head and the regression head。
增加两个小的修改:
(1)follow the design of FFN in DETR,and make the number of convolution layers in two heads different.There are four convolutions followed by batch normalization layers and ReLU layers on the regression head while only have two on the classification head.
(2)Autoassign and add an implicit objectness prediction (without direct supervision) for each anchor on the regression head. The final classification scores for all predictions are generated by multiplying the classification output with the corresponding implicit objectness.
An illustration of the structure of Dilated Encoder:
In the figure, 1×1 and 3×3 denotes 1×1 and 3×3 convolution layers and ×4 means four successive residual blocks.
All convolution layers in Residual Blocks are followed by a batchnorm layer and a ReLU layer,
while in Projector, we only use convolution layers and batchnorm layers.
Detailed Structures of MiMo, SiMo, MiSo, and SiSo encoders.
The sketch of YOLOF
4. 结果与对比
YOLOF取得与RetinaNet-FPN同等的性能,同时快2.5倍;无需transformer层,YOLOF仅需一级特征即可取得与DETR相当的性能,同时训练时间少7倍。以 大小的图像作为输入,YOLOF取得了44.3mAP的指标且推理速度为60fps@2080Ti,它比YOLOv4快13%。