ssd网络结构_目标检测网络SSD之PyTorch实现

最新推荐文章于 2024-04-28 22:58:47 发布

weixin_39792393

最新推荐文章于 2024-04-28 22:58:47 发布

阅读量470

点赞数

文章标签： ssd网络结构

本文是关于SSD PyTorch源码的笔记。

持续更新中。。。

图1. SSD架构图

总体架构

The SSD approach is based on a feed-forward convolutional network that produces a fixed-size collection of bounding boxes and scores for the presence of object class instances in those boxes, followed by a non-maximum suppression step to produce the final detections. The early network layers are based on a standard architecture used for high quality image classification (truncated before any classification layers), which we will call the base network ^[1]. We then add auxiliary structure to the network to produce detections with the following key features:

SSD 方法基于一个输出固定尺寸边界框和存在于边界框内对象实例分数集合的前馈卷积网络，（译者注：前馈卷积网络后）跟着一个非极大值抑制步骤以产出最终的检测（译者注：结果）。前期网络层基于用于高质量图像分类的标准架构（在分类层前截断），我们称之为基础网络^[2]。然后我们为网络添加了具备以下关键特征的辅助结构以便产出检测（译者注：结果）：

Multi-scale feature maps for detection We add convolutional feature layers to the end of the truncated base network. These layers decrease in size progressively and allow predictions of detections at multiple scales. The convolutional model for predicting detections is different for each feature layer ( cf Overfeat ^[3] and YOLO ^[4] that operate on a single scale feature map).

用于检测的多尺度特征图

我们在截断的基础网络后添加卷积特征层，这些层在尺寸上逐步减小并且允许预测多尺度检测，用于预测检测的卷积模型每一个特征层是不同的（比照 Overfeat^[5] 和 YOLO^[6] 在一个单尺度特征图上操作）。

Convolutional predictors for detection Each added feature layer (or optionally an ex- isting feature layer from the base network) can produce a fixed set of detection predic- tions using a set of convolutional filters. These are indicated on top of the SSD network architecture in Fig. 2. For a feature layer of size m × n with p channels, the basic el- ement for predicting parameters of a potential detection is a 3 × 3 × p small kernel that produces either a score for a category, or a shape offset relative to the default box coordinates. At each of the m × n locations where the kernel is applied, it produces an output value. The bounding box offset output values are measured relative to a default box position relative to each feature map location ( cf the architecture of YOLO[5] that uses an intermediate fully connected layer instead of a convolutional filter for this step).

用于检测的卷积预测器

每个添加到特征层（或一个已存在于基础网络的特征层）使用卷积滤波器可以产出一个检测预测的固定集合，这些（译者注：特征层）在（译者注：论文中）图2的SSD网络架构顶部已被指出。对于一个有 p 个通道尺寸为 m × n 的特征层，为了预测一个潜在检测的基础元素是一个 m × n × p 的小（译者注：卷积）核，其产出要么是一个类别分数，要么是一个相对于默认框坐标的形状偏移量。在 m × n 的每一个位置上应用（译者注：卷积）核，它产出一个输出值。边界框偏移量输出值使用相对于默认框位置相对于每一个特征图位置度量（比照 YOLO^[6]的架构在这一步使用内部全连接层而不是卷积滤波器）。

Default boxes and aspect ratios We associate a set of default bounding boxes with each feature map cell, for multiple feature maps at the top of the network. The default boxes tile the feature map in a convolutional manner, so that the position of each box relative to its corresponding cell is fixed. At each feature map cell, we predict the offsets relative to the default box shapes in the cell, as well as the per-class scores that indicate the presence of a class instance in each of those boxes. Specifically, for each box out of k at a given location, we compute c class scores and the 4 offsets relative to the original default box shape. This results in a total of (c + 4)k filters that are applied around each location in the feature map, yielding (c + 4)kmn outputs for a m × n feature map. For an illustration of default boxes, please refer to Fig. 1. Our default boxes are similar to the anchor boxes used in Faster R-CNN [2], however we apply them to several feature maps of different resolutions. Allowing different default box shapes in several feature maps let us efficiently discretize the space of possible output box shapes.

默认框和纵横比

对于网络顶部多个特征图，我们将一组默认边界框和每个特征图单元格关联起来。默认的边界框使用卷积方式平铺特征图，因此每个框相对于它对应单元格的默认框形状的位置是固定的。在每个特征图单元格上，我们预测相对于单元格内默认框形状的偏移量，也预测指示每一个框里类型实例是否存在的每类分数。特别地，对 k 个给定位置的每个框，我们计算 c 个类分数和 4 个相对于原始默认框形状的偏移量。这导致共有 (c + 4)×k 个（译者注：卷积）核应用于特征图上每个位置，对于 m × n 的特征图产出 (c + 4) × k × m × n 个输出。关于默认框的示例，请参见（译者注：论文中）图1。我们到默认框与 Faster R-CNN^[7] 中断锚框类似，无论如何我们将他们应用到几个不同分辨率的特征图上。允许在几个特征图里有不同的默认框形状让我们有效将可能输出框形状的空间离散化。

代码实现

主要的网络架构在 ssd.py，主要包含：

vgg卷积层
辅助特征层
多尺度框生成
预测、回归

VGG卷积层

SSD 使用了VGG16(D)卷积部分（5层卷积，一般标记为Conv5）作为基础网络，后面加了 1024 × 3 × 3、1024 × 1 × 1 两个卷积层，这两个卷积层后都有 RELU 层。

下图给出了SSD 300 VGG部分网络结构，以及每层的配置、输出特征图的大小。

表1. SSD基础网络配置表（CHW格式，不含批次）

红线分割上部分是VGG-16标准卷积（SSD架构图上有标注『VGG-16 through Conv5_3 layer』），红线下部分是SSD自定义的卷积层。
红框框起来的Conv2d-4_3、Conv2d-7_1用于提取多尺度特征。
第3个卷积层的池化层 MaxPool2d-3使用了天花板 池化(ceil_mode=True)。

回顾一下VGG网络结构：

表2. VGG 网络配置表

辅助特征层

在基础网络后接8个辅助特征层，配置如下表：

表3. SSD辅助特征层配置表（CHW格式，不含批次）

红框框选的卷积层（Conv2d-2_1、Conv2d-4_1、Conv2d-6_1、Conv2d-8_1）用于多尺度特征提取。

多尺度框生成

在定位层和分类层，基于VGG网络中第21层（Conv2d-4_3）、倒数第二层（Conv2d-7_1）分别生成 4、6个默认框，然后基于辅助特征层第1、3、5、7层分别生成 6、6、4、4 个默认框。共在6个层生成30个（4 + 6 + 6 + 6 + 4 + 4）框。

表4. 多尺度卷积层输出尺寸（BCHW格式，b表示batch）

组装SSD

init

# SSD network
self.vgg = nn.ModuleList(base)
# Layer learns to scale the l2 normalized features from conv4_3
self.L2Norm = L2Norm(512, 20)
self.extras = nn.ModuleList(extras)
self.loc = nn.ModuleList(head[0])
self.conf = nn.ModuleList(head[1])

由于定位层、分类层是两个分支，导致整个网络无法使用单独的前馈模型表达，因此分别使用 ModuleList 包装的基础网络、辅助层、定位层、分类层组装，网络每层运行需要手工调用。

forward

# apply vgg up to conv4_3 relu
for k in range(23):
    x = self.vgg[k](x)

s = self.L2Norm(x)
sources.append(s)

依次调用vgg基础网络前[0~23)层，对22层的输出进行L2正则化，将Conv4_3正则化后的输出保存到 sources。

source 是什么用途后续解释。

for k in range(23, len(self.vgg)):
    x = self.vgg[k](x)
sources.append(x)

依次调用VGG基础网络[23,35)层，将最终输出保存到 sources。

# apply extra layers and cache source layer outputs
for k, v in enumerate(self.extras):
    x = F.relu(v(x), inplace=True)
    if k % 2 == 1:
        sources.append(x)

依次调用辅助特征层，辅助特征卷积层没有 RELU，此处每层调用后应用 RELU 函数，将1、3、5、7辅助特征层的经过 RELU 后的输出添加到 sources。

到此处可发现 sources 实际存储了基础网络中Conv4_3、基础网络最后一层、辅助特征层1、3、5、7层的特征图，即不同尺度的特征图。

# apply multibox head to source layers
for (x, l, c) in zip(sources, self.loc, self.conf):
    loc.append(l(x).permute(0, 2, 3, 1).contiguous())
    conf.append(c(x).permute(0, 2, 3, 1).contiguous())

对于不同尺度的特征分别和定位、分类组成（特征，定位、分类）三元组，分别调用每个定位层和每个分类层，并使用 permute 将数据的维度由BCHW调换为BHWC，然后转换为连续内存表示形式。

为什么需要将数据的维度由BCHW调换为BHWC？因为softmax需要应用在批次维度上，而我们的批次维度在第0维。 为什么需要调用 contiguous 方法转换成连续内存表示形式？参考栩风：PyTorch中的contiguous。
定位层、分类层输入的不同尺度见『多尺度框生成』中 表4，定位层、分类层的输出尺寸见下 表5、 表6。

表5. 定位层在不同尺度特征层上的输出尺寸（input、output使用CHW格式）

表6. 分类层在不同尺度特征层上的输出尺寸（input、output使用CHW格式）

维度转换前后对照如下：

表7. 维度转换前后对照表（BCHW->BHWC）

loc = torch.cat([o.view(o.size(0), -1) for o in loc], 1)
conf = torch.cat([o.view(o.size(0), -1) for o in conf], 1)

将定位层、分类层输出分别转换为2维矩阵并横向按列拼接成一个大的矩阵。

if self.phase == "test":
    output = self.detect(
        loc.view(loc.size(0), -1, 4),                   # loc preds
        self.softmax(conf.view(conf.size(0), -1,
                     self.num_classes)),                # conf preds
        self.priors.type(type(x.data))                  # default boxes
    )
else:
    output = (
        loc.view(loc.size(0), -1, 4),
        conf.view(conf.size(0), -1, self.num_classes),
        self.priors
    )

如果是训练阶段，输出(定位、分类、先验框坐标)三元组，定位 size 为 [batch,num_priors*4]，分类 size 为 [batch*num_priors,num_classes] ，先验框坐标 size 为 [2,num_priors*4]。

参考文档

A PyTorch Implementation of Single Shot MultiBox Detector

SSD feature map 选择解析

SSD 源码实现 (PyTorch)

Pytorch(2) maxpool的ceil_mode

SSD: Single Shot MultiBox Detector

Very Deep Convolutional Networks for Large-Scale Image Recognition

参考

^ We use the VGG-16 network as a base, but other networks should also produce good results.
^我们使用VGG-16网络作为基础，但是其他网络也应该产出好的结果。
^OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks https://arxiv.org/abs/1312.6229
^You Only Look Once: Unified, Real-Time Object Detection https://arxiv.org/abs/1506.02640
^ILSVRC 2013 冠军
^^a^b2016年前首次提出的单阶段实时目标检测算法
^2016年state of art的两阶段检测算法 https://arxiv.org/abs/1506.01497