【Feature Pyramid】《Deep Feature Pyramid Reconfiguration for Object Detection》

最新推荐文章于 2020-11-16 22:27:10 发布

bryant_meng

最新推荐文章于 2020-11-16 22:27:10 发布

阅读量483

点赞数 2

分类专栏： CNN / Transformer

本文链接：https://blog.csdn.net/bryant_meng/article/details/105009880

版权

CNN / Transformer 专栏收录该内容

211 篇文章 7 订阅

订阅专栏

在这里插入图片描述
ECCV-2018

文章目录

1 Background and Motivation
2 Advantages / Contributions
3 Method
- 3.1 Deep Feature Reconfiguration
- - 3.1.1 Global Attention for Feature Hierarchy
  - 3.1.2 Local Reconfiguration
4 Experiments
5 Conclusion（own）

1 Background and Motivation

目前 feature pyramids 设计的结构仍 inefficient to integrate the semantic information over different scales.

作者争对 SSD 和 FPN 结构的缺点，在 SSD 的基础上，设计了新的 feature pyramids 结构，使得 object detection 模型具有更强的特征表达能力！

SSD 是最早的用 feature pyramids 来做 object detection 的方法之一，
在这里插入图片描述
图片来自 SSD详解

缺点是 SSD 用 shallow-layer 的神经网络来检测小目标，但是低层的网络没有高级语义信息，小目标检测的效果不理想

在这里插入图片描述
FPN 的 top-down 特征融合是线性的，too simple to capture highly-nonlinear patterns for more complicate and practical cases. Several

2 Advantages / Contributions

提出了 global-local reconfigurations 的 feature pyramids，enhance multi-scale representations
feature pyramids 中 all scales are performed simultaneously，比 layer-by-layer transformation 更efficient

3 Method

在这里插入图片描述

1）ConvNet Feature Hierarchy

目标检测的 backbone 特征集合可以表示成如下形式
在这里插入图片描述

$L$ ：表示 backbone 的总层数
$x_l$ ：表示 $l^{th}$ 层的输出

在 SSD 模型中，预测特征图集合可以表示为
在这里插入图片描述
eg： $P$ 在 VGG 中为 23

$x_P$ 是高分辨率，limited semantic information，没有 reuse deeper and semantic information，不利于小目标的检测！

2）Lateral Connection

在 FPN 结构中，特征进行了如下的融合
在这里插入图片描述
$\alpha$ 和 $\beta$ 是 Conv（没有 activation function）和 up-sampling（双线性插值），可以理解为线性操作！一般化的表示，FPN 对特征做了如下形式的 polynomial expansions：

新生成的用来预测的特征图集合可以表示为
在这里插入图片描述

这种 representation power，在复杂的目标检测任务上是不够的

3.1 Deep Feature Reconfiguration

在这里插入图片描述
首先是要从线性变成非线性

其中非线性函数 $H_l(X)$ 表示为 a global attention 和 a local reconfiguration 操作！

3.1.1 Global Attention for Feature Hierarchy

用的 SENet 的方法，channel attention
在这里插入图片描述
SENet 的介绍可以参考【SENet】《Squeeze-and-Excitation Networks》

第一步压缩分辨率，把特征图保留为一个只有通道数的向量
在这里插入图片描述

$x_l^c(i,j)$ 表示 $i$ 行 $j$ 列， $c^{th}$ channel

第二步接两个 fully connection
在这里插入图片描述
这里有点错误，应该是 $W_1^l$ 和 $W_2^l$ ， $\delta$ 是 relu， $\sigma$ 的 sigmoid

在这里插入图片描述
把 channel dimension 的向量，压缩 $r = 16$ 倍，

第三步，把学习到的通道权重，与原来的特征图做 channel-wise multiplication

最后用来预测的特征图集合表示为

这里把下标 $l$ 去掉更好，加个下标多此一举

3.1.2 Local Reconfiguration

在这里插入图片描述
resnet 的结构！把 channel attention 的输出接一个 bottleneck，配合 1x1 conv 的 shortcut 分支

用残差的好处是，it is easier to optimize the residual mapping than to optimize the desired underlying mapping.