目标检测之RFB-NET（论文翻译辅助阅读）

最新推荐文章于 2024-07-07 15:56:58 发布

别问问就rushB

最新推荐文章于 2024-07-07 15:56:58 发布

阅读量1.3k

点赞数

文章标签：目标检测　ＲＦＢＮＥＴ

本文链接：https://blog.csdn.net/weixin_41227970/article/details/102721920

版权

个人记录

个人看久了都不像个字了　读不出来了
代码链接：code
论文链接：paper

RFB-NET

AB:Current top-performing object detectors depend on deep CNN backbones, such as ResNet-101 and Inception, benefiting from their
powerful feature representations but suffering from high computational
costs. Conversely, some lightweight model based detectors fulfil real time
processing, while their accuracies are often criticized. In this paper, we
explore an alternative to build a fast and accurate detector by strengthening lightweight features using a hand-crafted mechanism. Inspired by
the structure of Receptive Fields (RFs) in human visual systems, we
propose a novel RF Block (RFB) module, which takes the relationship
between the size and eccentricity of RFs into account, to enhance the feature discriminability and robustness. We further assemble RFB to the
top of SSD, constructing the RFB Net detector. To evaluate its effectiveness, experiments are conducted on two major benchmarks and the
results show that RFB Net is able to reach the performance of advanced
very deep detectors while keeping the real-time speed. Code is available
at https://github.com/ruinmessi/RFBNet.
摘要：当前性能最佳的对象检测器受益于深的CNN骨干网，例如ResNet-101和Inception，它们从中受益
功能强大的特征表示，但计算成本高。相反，一些基于轻量级模型的检测器可以实现实时处理，而其准确性经常受到批评。在本文中，我们探索了一种替代方法，即通过使用手工制作的机制来增强轻量级功能，从而构建一种快速而准确的检测器。受人类视觉系统中接收场（RF）结构的启发，我们提出了一种新颖的RF模块（RFB）模块，该模块考虑了RF的大小和偏心率之间的关系，以增强特征的可分辨性和鲁棒性。我们进一步将RFB组装到 SSD的顶部，构造RFB Net检测器。为了评估其有效性，在两个主要基准上进行了实验，结果表明RFB Net能够在保持实时速度的同时达到高级超深检测器的性能。可以从https://github.com/ruinmessi/RFBNet 获得代码。

Intro

In recent years, Region-based Convolutional Neural Networks (R-CNN) [8], along with its representative updated descendants, e.g. Fast R-CNN [7] and Faster R-CNN [26], have persistently promoted the performance of object detection on major challenges and benchmarks, such as Pascal VOC [5], MS COCO [21],
and ILSVRC [27]. They formulate this issue as a two-stage problem and build a typical pipeline, where the first phase hypothesizes category-agnostic object proposals within the given image and the second phase classifies each proposal according to CNN based deep features. It is generally accepted that in these methods, CNN representation plays a crucial role, and the learned feature is
expected to deliver a high discriminative power encoding object characteristics and a good robustness especially to moderate positional shifts (usually incurred by inaccurate boxes). A number of very recent efforts have confirmed such a fact. For instance, [11] and [15] extract features from deeper CNN backbones, like ResNet [11] and Inception [31]; [19] introduces a top-down architecture to
construct feature pyramids, integrating low-level and high-level information; and the latest top-performing Mask R-CNN [9] produces an RoIAlign layer to generate more precise regional features. All these methods adopt improved features to reach better results; however, such features basically come from deeper neural networks with heavy computational costs, making them suffer from a low
inference speed.

简介(简单谷歌翻译下)

近年来，基于区域的卷积神经网络（R-CNN）[8]及其代表更新后代，例如快速R-CNN [7]和快速R-CNN [26]在主要挑战和基准（例如Pascal VOC [5]，MS COCO [21]，
和ILSVRC [27]。他们将这个问题表述为一个两阶段的问题，并构建了一条典型的流程，其中第一阶段在给定图像中假设与类别无关的对象建议，第二阶段根据基于CNN的深入特征对每个建议进行分类。人们普遍认为，在这些方法中，CNN表示起着至关重要的作用，并且学习到的功能是
期望提供高判别力的编码对象特性和良好的鲁棒性，尤其是对于中等的位置偏移（通常由不准确的框引起）。最近的许多努力已经证实了这一事实。例如，[11]和[15]从更深的CNN主干中提取特征，例如ResNet [11]和Inception [31]； [19]介绍了一种自上而下的架构
构建特征金字塔，整合低层和高层信息；最新的性能最佳的Mask R-CNN [9]产生了一个RoIAlign层，以生成更精确的区域特征。所有这些方法均采用改进的功能以达到更好的效果。但是，这些功能基本上来自更深的神经网络，且计算量大，从而使它们的运行成本较低推理速度。

To accelerate detection, a single-stage framework is investigated, where the phase of object proposal generation is discarded. Although the pioneering attempts, namely You Look Only Once (YOLO) [24] and Single Shot Detector (SSD) [22], illustrate the ability of real-time processing, they tend to sacrifice accuracies, with a clear drop ranging from 10% to 40% relative to state-of-theart two-stage solutions [20]. More recently, Deconvolutional SSD (DSSD) [6] and
RetinaNet [20] substantially ameliorate the precision scores, which are comparable to the top ones reported by the two-stage detectors. Unfortunately their performance gains are credited to the very deep ResNet-101 [11] model as well, which limits the efficiency.

为了加快检测速度，研究了一个单阶段框架，该框架废弃了对象提议生成的阶段。尽管开创性的尝试，例如“一次只看一次”（YOLO）[24]和“单发检测器”（SSD）[22]，说明了实时处理的能力，但它们往往会牺牲精度，其下降幅度从10％到10％不等。相对于最新的两阶段解决方案而言，这一比例为40％[20]。最近，反卷积SSD（DSSD）[6]和 RetinaNet [20]大大改善了精度得分，与两级检测器报告的最高得分相当。不幸的是，他们的性能提升也归功于非常深入的ResNet-101 [11]模型，这限制了效率。

According to the discussion above, to build a fast yet powerful detector, a reasonable alternative is to enhance feature representation of the lightweight network by bringing in certain hand-crafted mechanisms rather than stubbornly deepening the model. On the other side, several discoveries in neuroscience reveal that in human visual cortex, the size of population Receptive Field (pRF) is a function of eccentricity in their retinotopic maps, and although varying between maps, it increases with eccentricity in each map [36], as illustrated in Fig. 1. It helps to highlight the importance of the region nearer to the center and elevate the insensitivity to small spatial shifts. A few shallow descriptors coincidentally make use of this mechanism to design [34,14,37] or learn [1,38,29] their pooling
schemes, and show good performance in matching image patches.

根据上面的讨论，要构建一个快速但功能强大的检测器，合理的选择是通过引入某些手工制作的机制而不是顽固地加深模型来增强轻量级网络的特征表示。另一方面，神经科学方面的一些发现表明，在人类视觉皮层中，群体感受野（pRF）的大小在其视网膜视点图中是偏心率的函数，尽管各图之间存在差异，但在每个图中，偏心率都会增加[36] ，如图1所示。它有助于突出显示靠近中心区域的重要性，并提高对较小空间偏移的不敏感度。一些浅层描述符恰巧利用此机制来设计[34,14,37]或学习[1,38,29]它们的合并方案，并在匹配图像补丁方面显示出良好的性能。
在这里插入图片描述
Regarding current deep learning models, they commonly set RFs at the same size with a regular sampling grid on a feature map, which probably induces some loss in the feature discriminability as well as robustness. Inception [33] considers RFs of multiple sizes, and it implements this concept by launching multi-branch CNNs with different convolution kernels. Its variants [32,31,16] achieve competitive results in object detection (in the two-stage framework) and classification tasks. However, all kernels in Inception are sampled at the same center. A similar idea appears in [3], where an Atrous Spatial Pyramid Pooling (ASPP) is exploited to capture multi-scale information. It applies several parallel convolutions with different atrous rates on the top feature map to vary the sampling
distance from the center, which proves effective in semantic segmentation. But the features only have a uniform resolution from previous convolution layers of the same kernel size, and compared to the daisy shaped ones, the resulting feature tends to be less distinctive. Deformable CNN [4] attempts to adaptively
adjust the spatial distribution of RFs according to the scale and shape of the object. Although its sampling grid is flexible, the impact of eccentricity of RFs is not taken into account, where all pixels in an RF contribute equally to the output response and the most important information is not emphasized.

对于当前的深度学习模型，他们通常将RF与特征图上的常规采样网格设置为相同大小，这可能会导致特征可辨性和鲁棒性方面的损失。 Inception [33]考虑了多种大小的RF，并通过启动具有不同卷积内核的多分支CNN来实现此概念。它的变体[32,31,16]在对象检测（在两阶段框架中）和分类任务中获得了有竞争力的结果。但是，inception中的所有内核都是在同一中心采样的。在[3]中出现了类似的想法，其中利用Atrous空间金字塔池（ASPP）来捕获多尺度信息。它在顶部特征图上应用了几个具有不同粗化率的并行卷积来改变采样与中心的距离，这在语义分割中被证明是有效的。但是这些特征仅具有相同内核大小的先前卷积层的一致分辨率，并且与雏菊形的卷积层相比，所得特征趋向于不那么鲜明。可变形CNN [4]尝试自适应根据物体的大小和形状调整RF的空间分布。尽管其采样网格非常灵活，但并未考虑RF离心率的影响，其中RF中的所有像素均对输出响应做出同等贡献，并且不强调最重要的信息。

Inspired by the structure of RFs in the human visual system, this paper proposes a novel module, namely Receptive Field Block (RFB), to strengthen the deep features learned from lightweight CNN models so that they can contribute to fast and accurate detectors. Specifically, RFB makes use of multi-branch pooling with varying kernels corresponding to RFs of different sizes, applies dilated
convolution layers to control their eccentricities, and reshapes them to generate final representation, as in Fig. 2. We then assemble the RFB module to the top of SSD [22], a real-time approach with a lightweight backbone, and construct an advanced one-stage detector (RFB Net). Thanks to such a simple module, RFB Net delivers relatively decent scores that are comparable to the ones of
up-to-date deeper backbone network based detectors [19,18,20] and retains the fast speed of the original lightweight detector. Additionally, the RFB module is generic and imposes few constraints on the network architecture.

受到人类视觉系统中射频结构的启发，本文提出了一种新颖的模块，即感受野块（RFB），以增强从轻量级CNN模型中学到的深层功能，从而为快速，准确的检测器做出贡献。具体来说，RFB利用具有与不同大小的RF对应的变化内核的多分支池，应用扩展卷积层以控制其离心率，并对其重塑形状以生成最终表示形式，如图2所示。然后，将RFB模块组装到SSD的顶部[22]，这是一种具有轻量级骨干的实时方法，并构造了高级一级检测器（RFB网络）。由于有了这样一个简单的模块，RFB Net所提供的相对不错的分数可与最新的基于更深层骨干网的检测器[19,18,20]，并保持了原始轻量级检测器的快速速度。此外，RFB模块是通用的，对网络体系结构几乎没有限制。
在这里插入图片描述

We propose the RFB module to simulate the configuration in terms of the size and eccentricity of RFs in human visual systems, aiming to enhance deep features of lightweight CNN networks.
We present the RFB Net based detector, and by simply replacing the top convolution layers of SSD [22] with RFB, it shows significant performance gain while still keeping the computational cost under control.
We show that RFB Net achieves state-of-the-art results on the Pascal VOC and MS COCO at a real time processing speed, and demonstrate the generalization ability of RFB by linking it to MobileNet [12].

1.我们建议使用RFB模块来模拟人眼视觉系统中RF的大小和偏心度方面的配置，以增强轻型CNN网络的深层功能。
2.我们提出了基于RFB Net的检测器，通过简单地将SSD [22]的顶层卷积层替换为RFB，它显示出显着的性能提升，同时仍保持了可控制的计算成本。
3.我们展示了RFB Net在Pascal VOC和MS COCO上以实时处理速度获得了最先进的结果，并通过将RFB链接到MobileNet [12]展示了RFB的泛化能力。

总结一下：

现有的two stage的虽然准确率高，但是慢，尽管大神们改进了还是基于较深度的网络，onestage的又快又准，但是比起前面的又没有那么准，作者根据人类视觉的系统的偏心率和RFs大小来提出了一个RFB．在onestage的网络下加入了这个模块，性能提升还能降低计算成本．

Related work

Two-stage detector: R-CNN [8] straightforwardly combines the steps of cropping box proposals like Selective Search [35] and classifying them through a CNN model, yielding a significant accuracy gain compared to traditional methods, which opens the deep learning era in object detection. Its descendants (e.g., Fast R-CNN [7], Faster R-CNN [26]) update the two-stage framework and achieve dominant performance. Besides, a number of effective extensions are proposed
to further improve the detection accuracy, such as R-FCN [17], FPN [19], Mask R-CNN [9].

两阶段检测器：R-CNN [8]直接结合裁剪框建议（如“选择性搜索” [35]）的步骤，并通过CNN模型对其进行分类，与传统方法相比，产生了显着的准确性提升，从而开启了深度学习时代。目标检测。它的后代（例如，Fast R-CNN [7]，Faster R-CNN [26]）更新了两阶段框架，并取得了统治地位。此外，提出了许多有效的扩展为了进一步提高检测精度，例如R-FCN [17]，FPN [19]，Mask R-CNN [9]。

One-stage detector: The most representative one-stage detectors are YOLO [24,25] and SSD [22]. They predict confidences and locations for multiple objects based on the whole feature map. Both the detectors adopt lightweight backbones for acceleration, while their accuracies apparently trail those of top two-stage methods. Recent more advanced single-stage detectors (e.g., DSSD [6] and RetinaNet
[20]) update their original lightweight backbones by the deeper ResNet-101 and apply certain techniques, such as deconvolution [6] or Focal Loss [20], whose scores are comparable and even superior to the ones of state-of-the-art two-stage methods. However, such performance gains largely consume their advantage in speed.

一级检测器：最具代表性的一级检测器是YOLO [24,25]和SSD [22]。他们根据整个特征图预测多个对象的置信度和位置。两种检测器均采用轻型骨架进行加速，而其准确性显然落后于顶级两阶段方法。最近更先进的单级检测器（例如，DSSD [6]和RetinaNet
[20]）通过更深的ResNet-101更新了其原始的轻量级骨干，并应用了某些技术，例如反卷积[6]或焦距损失[20]，它们的得分相当，甚至甚至超过了最新状态。艺术两阶段方法。但是，这种性能提升在很大程度上消耗了其速度优势。

Receptive field: Recall that in this study, we aim to improve the performance of high-speed single-stage detectors without incurring too much computational burden. Therefore, instead of applying very deep backbones, RFB, imitating the mechanism of RFs in the human visual system, is used to enhance lightweight model based feature representation. Actually, there exist several studies that discuss RFs in CNN, and the most related ones are the Inception family [33,32,31], ASPP [3], and Deformable CNN [4]. The Inception block adopts multiple branches with different kernel sizes to capture multi-scale information. However, all the kernels are sampled at the same
center, which requires much larger ones to reach the same sampling coverage and thus loses some crucial details. For ASPP, dilated convolution varies the sampling distance from the center, but the features have a uniform resolution from the previous convolution layers of the same kernel size, which treats the clues
at all the positions equally, probably leading to confusion between object and context. Deformable CNN [4] learns distinctive resolutions of individual objects, unfortunately it holds the same downside as ASPP. RFB is indeed different from them, and it highlights the relationship between RF size and eccentricity in a daisy-shape configuration, where bigger weights are assigned to the positions
nearer to the center by smaller kernels, claiming that they are more important than the farther ones. See Fig. 3 for differences of the four typical spatial RF structures. On the other side, Inception and ASPP have not been successfully adopted to improve one-stage detectors, while RFB shows an effective way to make use of their advantages in this issue.

接收领域：回想一下，在这项研究中，我们旨在提高高速单级检测器的性能，而又不会产生过多的计算负担。因此，代替应用非常深的主干，RFB模仿了人类视觉系统中的RF机制，用于增强基于轻量级模型的特征表示。实际上，有一些研究讨论了CNN中的RF，其中最相关的是Inception系列[33,32,31]，ASPP [3]和可变形CNN [4]。 Inception块采用具有不同内核大小的多个分支来捕获多尺度信息。但是，所有内核都在相同的位置采样
中心，需要更大的中心才能达到相同的采样范围，因此丢失了一些关键细节。对于ASPP，膨胀卷积会改变距中心的采样距离，但是这些功能与之前具有相同内核大小的卷积层具有统一的分辨率，这可以处理这些线索
在所有位置上均等，可能导致对象和上下文之间的混淆。可变形CNN [4]学习单个对象的独特分辨率，不幸的是，它具有与ASPP相同的缺点。 RFB确实与它们不同，它突出显示了雏形配置中RF尺寸与偏心率之间的关系，在这种配置中，较大的权重分配给了位置
较小的内核更靠近中心，声称它们比更远的内核更重要。有关四种典型空间RF结构的差异，请参见图3。另一方面，尚未成功采用Inception和ASPP来改进单级检测器，而RFB显示了在此问题中利用其优势的有效方法。

Method

In this section, we revisit the human visual cortex, introduce our RFB components and the way to simulate such a mechanism, and describe the architecture of the RFB Net detector as well as its training/testing schedule.

Visual Cortex Revisit

During the past few decades, it has come true that functional Magnetic Resonance Imaging (fMRI) non-invasively measures human brain activities at a resolution in millimeter, and RF modeling has become an important sensory science tool used to predict responses and clarify brain computations. Since human neuroscience instruments often observe the pooled responses of many neurons,
these models are thus commonly called pRF models [36]. Based on fMRI and pRF modeling, it is possible to investigate the relation across many visual field maps in the cortex. At each cortical map, researchers find a positive correlation between pRF size and eccentricity [36], while the coefficient of correlation varies in visual field maps, as shown in Fig. 1
在这里插入图片描述
视觉皮质重访
在过去的几十年中，功能性磁共振成像（fMRI）以毫米为单位的分辨率无创地测量了人类的大脑活动，而RF建模已成为一种重要的感觉科学工具，用于预测反应和阐明大脑的计算，这已经成为现实。由于人类神经科学仪器经常观察到许多神经元的汇集反应，
这些模型因此通常称为pRF模型[36]。基于fMRI和pRF建模，可以研究皮质中许多视野图之间的关系。在每个皮质图上，研究人员发现pRF大小与偏心率之间呈正相关[36]，而相关系数却有所不同
在视野图中，如图1所示

Receptive Field Block

The proposed RFB is a multi-branch convolutional block. Its inner structure can be divided into two components: the multi-branch convolution layer with different kernels and the trailing dilated pooling or convolution layers. The former part is identical to that of Inception, responsible for simulating the pRFs of multiple sizes, and the latter part reproduces the relation between the pRF size and eccentricity in the human visual system. Fig. 2 illustrates RFB along with its corresponding spatial pooling region maps. We elaborate the two parts and their functions in detail in the following.

感受野
提出的RFB是一个多分支卷积块。它的内部结构可以分为两个部分：具有不同内核的多分支卷积层和尾随的扩展池化或卷积层。前一部分与Inception相同，负责模拟多种尺寸的pRF，而后一部分则再现了人类视觉系统中pRF尺寸与偏心率之间的关系。图2说明了RFB及其对应的空间合并区域图。接下来，我们将详细介绍这两个部分及其功能。

Multi-branch convolution layer: According to the definition of RF in
CNNs, it is a simple and natural way to apply different kernels to achieve multisize RFs, which is supposed to be superior to the RFs that share a fixed size. We adopt the latest changes in the updated versions, i.e., Inception V4 and Inception-ResNet V2 [31] in the Inception family. To be specific, first, we employ the bottleneck structure in each branch, consisting of a 1 × 1 conv-layer, to decrease the number of channels in the feature map plus an n × n conv-layer.
Second, we replace the 5×5 conv-layer by two stacked 3×3 conv-layers to reduce parameters and deeper non-linear layers. For the same reason, we use a 1 × n plus an n×1 conv-layer to take place of the original n×n conv-layer. Ultimately, we apply the shortcut design from ResNet [11] and Inception-ResNet V2 [31].

多分支卷积层：根据RF中的定义
CNN，这是应用不同内核以实现多尺寸RF的一种简单自然的方法，该方法应该优于共享固定大小的RF。我们采用了更新版本中的最新更改，即Inception系列中的Inception V4和Inception-ResNet V2 [31]。具体来说，首先，我们在每个分支中采用1×1转换层组成的瓶颈结构，以减少特征图中加上n×n转换层的通道数。其次，我们将5×5转换层替换为两个堆叠的3×3转换层，以减少参数并加深非线性层。出于同样的原因，我们使用一个1×n加上一个n×1的conv层来代替原始的n×n的conv层。最终，我们采用了ResNet [11]和Inception-ResNet V2 [31]的快捷方式设计。
在这里插入图片描述
Dilated pooling or convolution layer: This concept is originally introduced in Deeplab [2], which is also named the astrous convolution layer. The basic intention of this structure is to generate feature maps of a higher resolution, capturing information at a larger area with more context while keeping the same number of parameters. This design has rapidly proved competent at
semantic segmentation [3], and has also been adopted in some reputable object detectors, such as SSD [22] and R-FCN [17], to elevate speed or/and accuracy. In this paper, we exploit dilated convolution to simulate the impact of the eccentricities of pRFs in the human visual cortex. Fig. 4 illustrates two combinations of multi-branch convolution layer and dilated pooling or convolution
layer. At each branch, the convolution layer of a particular kernel size is followed by a pooling or convolution layer with a corresponding dilation. The kernel size and dilation have a similar positive functional relation as that of the size and eccentricity of pRFs in the visual cortex. Eventually, the feature maps of all the branches are concatenated, merging into a spatial pooling or convolution array
as in Fig. 1. The specific parameters of RFB, e.g., kernel size, dilation of each branch, and number of branches, are slightly different at each position within the detector, which are clarified in the next section.

**膨胀池或卷积层：**此概念最初是在Deeplab [2]中引入的，Deeplab [2]也称为天文卷积层。这种结构的基本意图是生成更高分辨率的特征图，在更大范围内以更多上下文捕获信息，同时保持相同数量的参数。该设计已迅速证明具有胜任力
语义分割[3]，并且已经在一些知名的对象检测器中采用，例如SSD [22]和R-FCN [17]，以提高速度或/和准确性。在本文中，我们利用膨胀卷积来模拟pRF的离心率在人类视觉皮层中的影响。图4说明了多分支卷积层和膨胀池或卷积的两种组合
层。在每个分支处，特定内核大小的卷积层之后是具有相应扩展的合并或卷积层。籽粒大小和扩张与皮层中pRF的大小和离心率具有相似的正函数关系。最终，所有分支的特征图被连接起来，合并为空间池或卷积数组
如图1所示。RFB的特定参数，例如内核大小，每个分支的扩张以及分支的数量，在检测器中的每个位置都略有不同，这将在下一部分中阐明。

RFB Net Detection Architecture

The proposed RFB Net detector reuses the multi-scale and one-stage framework of SSD [22], where the RFB module is embedded to ameliorate the feature extracted from the lightweight backbone so that the detector is more accurate and still fast enough. Thanks to the property of RFB for easily being integrated into CNNs, we can preserve the SSD architecture as much as possible. The main modification lies in replacing the top convolution layers with RFB, and some minor but active ones are given in Fig. 5

RFB网络检测架构

提出的RFB网络探测器重用了SSD的多尺度和一级框架[22]，其中嵌入了RFB模块以改善从轻型骨干网中提取的功能，从而使探测器更加准确且仍然足够快。由于RFB易于集成到CNN中，因此我们可以尽可能保留SSD架构。主要修改在于用RFB替换了顶部卷积层，图5中给出了一些较小但活跃的层.
在这里插入图片描述

Lightweight backbone: We use exactly the same backbone network as in SSD [22]. In brief, it is a VGG16 [30] architecture pre-trained on the ILSVRC CLS-LOC dataset [27], where its fc6 and fc7 layers are converted to convolutional layers with sub-sampling parameters, and its pool5 layer is changed from 2×2-s2 to 3×3-s1. The dilated convolution layer is used to fill holes and all the dropout
layers and the fc8 layer are removed. Even though many accomplished lightweight networks have recently been proposed (e.g. DarkNet [25], MobileNet [12], and ShuffleNet [39]), we focus on this backbone to achieve direct comparison to the original SSD [22].
RFB on multi-scale feature maps: In the original SSD [22], the base network is followed by a cascade of convolutional layers to form a series of feature maps with consecutively decreasing spatial resolutions and increasing fields of view. In our implementation, we keep the same cascade structure of SSD, but the front convolutional layers with feature maps of relatively large resolutions are replaced by the RFB module. In the primary version of RFB, we use a a single structure setting to imitate the impact of eccentricity. As the rate of the size and eccentricity of pRF differs between visual maps, we correspondingly adjust the parameters of RFB to form an RFB-s module, which mimics smaller pRFs in shallow human retinotopic maps, and put it behind the conv4 3 features, as illustrated in Fig. 4 and Fig. 5. The last few convolutional layers are preserved since the resolutions of their feature maps are too small to apply filters with large kernels like 5 × 5.

轻量级骨干网：我们使用与SSD [22]中完全相同的骨干网。简而言之，它是在ILSVRC CLS-LOC数据集[27]上预先训练的VGG16 [30]架构，其中其fc6和fc7层被转换为具有子采样参数的卷积层，其pool5层从2更改为×2-s2至3×3-s1。膨胀的卷积层用于填充孔和所有漏失
层和fc8层被删除。即使最近已经提出了许多完善的轻量级网络（例如DarkNet [25]，MobileNet [12]和ShuffleNet [39]），但我们仍专注于该骨干网以实现与原始SSD的直接比较[22]。
多尺度特征图上的RFB：在原始的SSD [22]中，基础网络后面是级联的卷积层，以形成一系列具有连续降低的空间分辨率和不断增加的视野的特征图。在我们的实现中，我们保持了SSD的相同级联结构，但是具有相对较大分辨率的特征图的前卷积层被RFB模块取代。在RFB的主要版本中，我们使用单个结构设置来模仿偏心的影响。由于视觉图之间pRF大小和偏心率的比率不同，我们相应地调整RFB的参数以形成RFB-s模块，该模块模仿浅层人类视网膜图谱中较小的pRF，并将其置于conv4 3特征之后，如图4和图5所示。保留了最后几个卷积层，因为它们的特征图的分辨率太小，无法应用带有
像5×5这样的大内核。