论文阅读:Scale-Aware Trident Networks for Object Detection

1、论文总述

本文的动机主要是想改善目标检测中多尺度的问题(小目标检测性能上不去,大目标相对好点,中目标的检测效果一般最好),作者通过做实验发现,通过将标准卷积替换为不同dilation的膨胀卷积,可以相应的在不同尺度的目标性能上达到最优,所以作者提出了Trident block,就是三个分支,每个分支的dilation大小不一样,让每一个分支分别负责检测不同大小的目标,同时通过weight sharing , scale aware training 这俩辅助训练,提高网络的泛化性,减少过拟合;最后在推理的时候,把三个分支缩小到一个分支,只用主分支(中间那个),可以达到和三个分支近似的效果(精度稍微低一点点),一个分支的训练setting 和 三个分支的训练setting不太一样,见Table 6.

注:本文的baseline是Faster RCNN,two stage的,作者提到,FPN的提升是有限的,而且在two stage的目标检测网络中,如果训练的epoch足够多,不加FPN也可以达到和FPN一样的效果,这点不是很理解,我觉得有两种可能性:(1)是训练trciks加的不一样,加了某些tricks之后,达到和FPN一样的效果,所以不加FPN也行(2)作者的baseline是two stage的Faster RCNN(roi pooling时候本身就有一个特征的对齐和固定feature 的大小),如果是one stage的RetinaNet或者FCOS,那不加FPN肯定是不行的

个人理解:CNN的scale invarient是很弱的,以往的目标检测而模型一般通过random crop、多尺度训练或者扩大训练集来让网络学习(强行记忆)尺度不变性的特征,而TridentNet 则是通过控制dilation rate在conv4 stage加上multi branch的trident blocks ,在训练的时候让网络显示地学习不同大小的目标的同一特征(scale invarient),相当于一个训练trick,所以测试时候只留下中间的主分支,性能也只是下降一点点
注:论文的大部分都能理解,但到了singe branch approx 这块就有点晕了,感觉和前面说的有点矛盾

在这里插入图片描述

In this paper, instead of feeding in multi-scale inputs
like the image pyramid, we propose a novel network structure to adapt the network for different scales. In particular, we create multiple scale-specific feature maps with the
proposed trident blocks as shown in Figure 1©. With the
help of dilated convolutions [41], different branches of trident blocks have the same network structure and share the
same parameters yet have different receptive fields. Furthermore, to avoid training objects with extreme scales, we
leverage a scale-aware training scheme to make each branch
specific to a given scale range matching its receptive field.
Finally, thanks to weight sharing through the whole multibranch network, we could approximate the full TridentNet
with only one major branch during inference. This approximation only brings marginal performance degradation. As
a result, it could achieve significant improvement over the
single-scale baseline without any compromise of inference
speed. This property makes TridentNet more desirable over
other methods for practical uses.

2、与image pyramid、FPN相比

Both the image pyramid and the feature pyramid methods share the same motivation that models should have different receptive fields for objects of different scales. Despite
the inefficiency, the image pyramid fully utilizes the representational power of the model to transform objects of all
scales equally. In contrast, the feature pyramid generates
multi-level features thus sacrificing the feature consistency
across different scales. This leads to a decrease in effective
training data and a higher risk of overfitting for each scale.
The goal of this work is to get the best of two worlds by creating features with a uniform representational power for
all scales efficiently.

作者说图像金字塔和FPN都是很好的检测多尺度目标的手段,图像金字塔可以充分利用CNN的表示能力来同等学习不同尺度的目标,但是FPN产生multi-level features,所以牺牲了不同尺度的feature consistency(这个应该是指:不同scale的feature经过不同深度不同参数的变换),FPN共享参数的话,还有这个问题么?

3、Investigation of Receptive Field

在这里插入图片描述
注:backbone变深的时候,继续加大dilation已经作用不大了(backbone浅的时候,可以通过增大dilation来提升性能)

作者通过实验得出的两个结论:

  1. The performance on objects of different scales are in-
    fluenced by the receptive field of a network. The most
    suitable receptive field is strongly correlated with the
    scale of objects.
  2. Although ResNet-101 has a large enough theoretical receptive field to cover large objects (greater than
    96×96 resolution) in COCO, the performance of large
    objects could still be improved when enlarging the dilation rate. This finding shares the same spirit in [31]
    that the effective receptive field is smaller than the theoretical receptive field. We hypothesize that the effective receptive field of detection networks needs to
    balance between small and large objects. Increasing
    dilation rates enlarges the effective receptive field by
    emphasizing large objects, thus compromising the performance of small objects.

注:有效感受野小于理论感受野,有效感受野太大会损害小目标的性能

4、trident blocks加在哪里

在这里插入图片描述

Stacking trident blocks allows us to control receptive fields of different branches in an efficient way
similar to the pilot experiment in Section 3. Typically, we
replace the blocks in the last stage of the backbone network
with trident blocks since larger strides lead to a larger difference in receptive fields as needed.

5、weight sharing的优势

In this work, we share the weights of all
branches and their associated RPN and R-CNN heads, and
only vary the dilation rate of each branch.
The advantages of weight sharing are three-fold. It reduces the number of parameters and makes TridentNet need
no extra parameters compared with the original detector.
It also echoes with our motivation that objects of different
scales should go through a uniform transformation with the
same representational power. A final point is that transformation parameters could be trained on more object samples
from all branches. In other words, the same parameters are
trained for different scale ranges under different receptive
fields.

有三个优点:(1)减少参数量,防止过拟合 (2)不同尺度的目标特征变换应该是一致的 (3)这些变换参数训练的时候可以有更多的样本

6、Scale-aware Training Scheme

Similar to SNIP [38], we define a valid range [li, ui] for
each branch i. During training, we only select the proposals and ground truth boxes whose scales fall in the corresponding valid range of each branch. Specifically, for an
Region-of-Interest (RoI) with width w and height h on the input image(before resize), it is valid for branch i when:

This scale-aware training scheme could be applied on both
RPN and R-CNN. For RPN, we select ground truth boxes
which are valid for each branch according to Eq. 1 during
anchor label assignment. Similarly, we remove all invalid
proposals for each branch during the training of R-CNN.

注:这个Li 和 Ui 是自己定义的,超参数,这个Scale-aware 训练机制同时应用在RPN 和 RCNN的训练上

7、Fast Inference Approximation

Here we propose TridentNet Fast, a fast approximation of TridentNet with only one branch during inference.
For a three-branch network as in Figure 2, we use the middle branch for inference since its valid range covers both
large and small objects. In this way, TridentNet Fast incurs
no additional time cost compared with a standard Faster RCNN detector. Surprisingly, we find that this approximation
only exhibits a slight performance drop compared with the
original TridentNet. This may due to our weight-sharing
strategy, through which multi-branch training is equivalent
to within-network scale augmentation. Detailed ablation of
TridentNet Fast could be found in Section 5.3.

在这里插入图片描述
注:(1)作者把TridentNet Fast 性能与TridentNet性能相似归功于 weight-sharing ??
(2)TridentNet Fast 性能最好的时候就是训练时候不加Scale-aware 机制 ??

8、Ablation Studies

在这里插入图片描述注: Weight-sharing 的 提升 比 Scale-aware 的提升 大多了

在这里插入图片描述
在这里插入图片描述

9、 Comparison with State-of-the-Arts

在这里插入图片描述
注:

TridentNet, which is to directly apply our method on
Faster R-CNN with ResNet-101 backbone in the 2× training scheme, achieves 42.7 AP without bells and whistles.
To fairly compare with SNIP and SNIPER, we apply
multi-scale training, soft-NMS, deformable convolutions,
large-batch BN, and the 3× training scheme on TridentNet and get TridentNet*.

在这里插入图片描述
注: 作者的实验表明:Faster RCNN 加上 FPN 以后没有提升

参考文献

1、TridentNet:处理目标检测中尺度变化新思路

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值