SSD源码解析

最新推荐文章于 2022-10-05 18:03:38 发布

eilot_c

最新推荐文章于 2022-10-05 18:03:38 发布

阅读量211

点赞数

本文链接：https://blog.csdn.net/eilot_c/article/details/106858184

版权

论文地址: SSD: Single Shot MultiBox Detector
非官方代码: pytorch

介绍

SSD，全称Single Shot MultiBox Detector，是一种One-Stage的方法，它由Wei Liu在ECCV 2016上提出，SSD具有如下主要特点：

从YOLO中继承了将detection转化为regression的思路，同时一次即可完成网络训练
基于Faster RCNN中的anchor，提出了相似的prior box
加入基于特征金字塔（Pyramidal Feature Hierarchy）的检测方式，相当于半个FPN思路

SSD的网络结构

由上图可以看出，SSD的基础网络结构由基础网络VGG16组成，在VGG16基础网络之后接了一个3x3的卷积和一个1x1的卷积做特征融合，然后增加了一个Extra Feature Layers 层，这个层由八个卷积层构成。SSD在前面的基础网络去conv_4_3之后的relu层输出，以及倒数第二层的conv_7_1的relu再加上Extra Feature Layers层的第1,3,5,7层共有6个 featuremap层，在此基础上对box进行预测。但感觉要提升效果的话可以对基础网络进行更改，增加特征融合等等。

空洞卷积(Dilation Conv)

ssd网络里还使用了空洞卷积(Dilation Conv),采用VGG16做基础模型，首先VGG16是在ILSVRC CLS-LOC数据集预训练。然后借鉴了DeepLab-LargeFOV，分别将VGG16的全连接层fc6和fc7转换成 3×3卷积层 conv6和 1×1 卷积层conv7，同时将池化层 pool5 由原来的 stride=2 的 2×2 变成 stride=1 的(猜想是不想reduce特征图大小)，为了配合这种变化，采用了一种 Atrous Algorithm，其实就是conv6采用扩展卷积或带孔卷积（Dilation Conv），其在不增加参数与模型复杂度的条件下指数级扩大卷积的视野，其使用扩张率(dilation rate)参数，来表示扩张的大小，如下图所示，(a)是普通的 3×3 卷积，其视野就是 3×3 ，(b)是扩张率为 1，此时视野变成 7×7 ，(c)扩张率为3时，视野扩大为 15×15 ，但是视野的特征更稀疏了。Conv6采用 3×3 大小但dilation rate=6的扩展卷积。

Prior Box

SSD中有着类似anchor机制的Prior Box机制，用于来生成先验框，后面将这些先验框与真实的gt进行匹配，然后与预测的进行回归。从而得到物体真实的位置。
SSD的prior Box 按照如下规则生成：

以feature map上每个点的中点为中心（offset=0.5），生成一些列同心的prior box（然后中心点的坐标会乘以step，相当于从feature map位置映射回原图位置）
正方形prior box最小边长为’’’pash $min_size$’’’，最大边长为：\sqrt{min_size*max_size}
E=mc^2
根据相应的aspect ratio，会生成不同个数的长方形，长宽为：$ \sqrt{aspect_ratio}min_size$ 和 $1/ \sqrt{aspect_ratio}min_size$ $f(x)=ax+b$
最终网络生成固定数量的Prior Box
每个feature map 对应prior box的min_size 和max_size 由以下的公式决定，公式中的m是使用feature map的数量(m=6)第一层feature map对应的min_size=S1，max_size=S2；第二层min_size=S2，max_size=S3；其他类推。在原文中，Smin=0.2，Smax=0.9

	min_size	max_size	def.boxes num
conv4_3	30	60	4
fc7	60	111	6
conv6_2	111	162	6
fc7	162	213	6
conv4_3	213	264	4
fc7	264	315	4

训练策略

正负样本

给定输入图像以及每个物体的Ground Truth,首先找到每个Ground True box对应的default box中IOU最大的最为正样本。然后,在剩下的default box中寻找与Ground Truth 的IOU大于0.5的default box作为正样本。一个Ground Truth可能对应多个正样本default box.其他的default box作为负样本。,为了保证样本尽量平衡,SSD采用了hard nagative mining,即对负样本进行抽样,抽样时按照置信度误差(预测背景的置信度越小,误差越大)进行奖序排列,选取误差较大的top-k作为训练的负样本,保证正负样本比例接近1:3。

目标函数

目标函数为训练过程中的优化标准,目标函数也称损失函数,主要包括位置误差(localization loss,loc) 与置信度误差(confidence loss,conf,分类损失)的加权和,定义为：

代码解析

基础模型定义


# vgg([64, 64, 'M', 128, 128, 'M', 256, 256, 256, 'C', 512, 512, 512, 'M',
#             512, 512, 512], 3)
# This function is derived from torchvision VGG make_layers()
# https://github.com/pytorch/vision/blob/master/torchvision/models/vgg.py
def vgg(cfg, i, batch_norm=False):
    layers = []
    in_channels = i
    for v in cfg:
        if v == 'M':
            layers += [nn.MaxPool2d(kernel_size=2, stride=2)]
        elif v == 'C':
            layers += [nn.MaxPool2d(kernel_size=2, stride=2, ceil_mode=True)]
        else:
            conv2d = nn.Conv2d(in_channels, v, kernel_size=3, padding=1)
            if batch_norm:
                layers += [conv2d, nn.BatchNorm2d(v), nn.ReLU(inplace=True)]
            else:
                layers += [conv2d, nn.ReLU(inplace=True)]
            in_channels = v
    pool5 = nn.MaxPool2d(kernel_size=3, stride=1, padding=1)
    conv6 = nn.Conv2d(512, 1024, kernel_size=3, padding=6, dilation=6)
    conv7 = nn.Conv2d(1024, 1024, kernel_size=1)
    layers += [pool5, conv6,
               nn.ReLU(inplace=True), conv7, nn.ReLU(inplace=True)]
    return layers


base = {
    '300': [64, 64, 'M', 128, 128, 'M', 256, 256, 256, 'C', 512, 512, 512, 'M',
            512, 512, 512],
    '512': [],
}
extras = {
    '300': [256, 'S', 512, 128, 'S', 256, 128, 256, 128, 256],
    '512': [],
}
mbox = {
    '300': [4, 6, 6, 6, 4, 4],  # number of boxes per feature map location
    '512': [],
}

def add_extras(cfg, i, batch_norm=False):
    # Extra layers added to VGG for feature scaling
    layers =[]
    in_channel = i
    flag = False
    for k,v in enumerate(cfg):
        if in_channel !="S":
            if v == "S":
                layers += [nn.Conv2d(in_channel,cfg[k+1],
                                     kernel_size=(1,3)[flag], stride=2,padding=1)]   # flag 来控制卷积核是1 还是3
            else:
                layers += [nn.Conv2d(in_channel, v, kernel_size=(1,3)[flag])]

            flag = not flag

        in_channel = v
    return layers

模型Head部分的生成


# cfg  [4, 6, 6, 6, 4, 4],  # number of boxes per feature map location
def multibox(vgg, extra_layers, cfg, num_classes):
    loc_layers = []
    conf_layers = []
    vgg_source = [21, -2]
    for k, v in enumerate(vgg_source):
        loc_layers += [nn.Conv2d(vgg[v].out_channels,
                                 cfg[k] * 4, kernel_size=3, padding=1)]
        conf_layers += [nn.Conv2d(vgg[v].out_channels,
                        cfg[k] * num_classes, kernel_size=3, padding=1)]
    # 对extra_layers中的（Conv2d-2_1、Conv2d-4_1、Conv2d-6_1、Conv2d-8_1）层通过卷积提取特征
    for k, v in enumerate(extra_layers[1::2], 2):
        loc_layers += [nn.Conv2d(v.out_channels, cfg[k]
                                 * 4, kernel_size=3, padding=1)]
        conf_layers += [nn.Conv2d(v.out_channels, cfg[k]
                                  * num_classes, kernel_size=3, padding=1)]
    return vgg, extra_layers, (loc_layers, conf_layers)

模型先验框的生成


class PriorBox(object):
    """Compute priorbox coordinates in center-offset form for each source
    feature map.
    """
    def __init__(self, cfg):
        super(PriorBox, self).__init__()
        self.image_size = cfg['min_dim']
        # number of priors for feature map location (either 4 or 6)
        self.num_priors = len(cfg['aspect_ratios'])
        self.variance = cfg['variance'] or [0.1]
        self.feature_maps = cfg['feature_maps']
        self.min_sizes = cfg['min_sizes']
        self.max_sizes = cfg['max_sizes']
        self.steps = cfg['steps']
        self.aspect_ratios = cfg['aspect_ratios']
        self.clip = cfg['clip']
        self.version = cfg['name']
        for v in self.variance:
            if v <= 0:
                raise ValueError('Variances must be greater than 0')

    def forward(self):
        mean = []
        #   'steps': [8, 16, 32, 64, 100, 300],
        # 'feature_maps': [38, 19, 10, 5, 3, 1],
        #  'min_sizes': [21, 45, 99, 153, 207, 261],
        # 'max_sizes': [45, 99, 153, 207, 261, 315],
        for k, f in enumerate(self.feature_maps):
            for i, j in product(range(f), repeat=2):
                f_k = self.image_size / self.steps[k]
                # unit center x,y
                cx = (j + 0.5) / f_k
                cy = (i + 0.5) / f_k

                # aspect_ratio: 1
                # rel size: min_size
                s_k = self.min_sizes[k]/self.image_size
                mean += [cx, cy, s_k, s_k]

                # aspect_ratio: 1
                # rel size: sqrt(s_k * s_(k+1))
                s_k_prime = sqrt(s_k * (self.max_sizes[k]/self.image_size))
                mean += [cx, cy, s_k_prime, s_k_prime]

                # rest of aspect ratios
                for ar in self.aspect_ratios[k]:
                    mean += [cx, cy, s_k*sqrt(ar), s_k/sqrt(ar)]
                    mean += [cx, cy, s_k/sqrt(ar), s_k*sqrt(ar)]
        # back to torch land
        output = torch.Tensor(mean).view(-1, 4)
        if self.clip:
            output.clamp_(max=1, min=0)
        return output

MultiBox 损失函数

class MultiBoxLoss(nn.Module):
    """SSD Weighted Loss Function
    Compute Targets:
        1) Produce Confidence Target Indices by matching  ground truth boxes
           with (default) 'priorboxes' that have jaccard index > threshold parameter
           (default threshold: 0.5).
        2) Produce localization target by 'encoding' variance into offsets of ground
           truth boxes and their matched  'priorboxes'.
        3) Hard negative mining to filter the excessive number of negative examples
           that comes with using a large number of default bounding boxes.
           (default negative:positive ratio 3:1)
    Objective Loss:
        L(x,c,l,g) = (Lconf(x, c) + αLloc(x,l,g)) / N
        Where, Lconf is the CrossEntropy Loss and Lloc is the SmoothL1 Loss
        weighted by α which is set to 1 by cross val.
        Args:
            c: class confidences,
            l: predicted boxes,
            g: ground truth boxes
            N: number of matched default boxes
        See: https://arxiv.org/pdf/1512.02325.pdf for more details.


        # 计算目标:
        # 输出那些与真实框的iou大于一定阈值的框的下标.
        # 根据与真实框的偏移量输出localization目标
        # 用难样例挖掘算法去除大量负样本(默认正负样本比例为1:3)
        # 目标损失:
        # L(x,c,l,g) = (Lconf(x,c) + αLloc(x,l,g)) / N
        # 参数:
        # c: 类别置信度(class confidences)
        # l: 预测的框(predicted boxes)
        # g: 真实框(ground truth boxes)
        # N: 匹配到的框的数量(number of matched default boxes)
    """
#  MultiBoxLoss(cfg['num_classes'], 0.5, True, 0, True, 3, 0.5, False, args.cuda)
    def __init__(self, num_classes, overlap_thresh, prior_for_matching,
                 bkg_label, neg_mining, neg_pos, neg_overlap, encode_target,
                 use_gpu=True):
        super(MultiBoxLoss, self).__init__()
        self.use_gpu = use_gpu
        self.num_classes = num_classes  # 列表数  21 
        self.threshold = overlap_thresh # 交并比阈值, 0.5
        self.background_label = bkg_label # 背景标签, 0
        self.encode_target = encode_target  # True 没卵用
        self.use_prior_for_matching = prior_for_matching  # True, 没卵用
        self.do_neg_mining = neg_mining  # 负样本和正样本的比例, 3:1
        self.negpos_ratio = neg_pos # 0.5 判定负样本的阈值.
        self.neg_overlap = neg_overlap  # False 没卵用
        self.variance = cfg['variance']

    def forward(self, predictions, targets):
        """Multibox Loss
        Args:
            predictions (tuple): A tuple containing loc preds, conf preds,
            and prior boxes from SSD net.
                conf shape: torch.size(batch_size,num_priors,num_classes)
                loc shape: torch.size(batch_size,num_priors,4)
                priors shape: torch.size(num_priors,4)

            targets (tensor): Ground truth boxes and labels for a batch,
                shape: [batch_size,num_objs,5] (last idx is the label).
        """
        loc_data, conf_data, priors = predictions
        # loc_data: [batch_size, 8732, 4]
        # conf_data: [batch_size, 8732, 21]
        # priors: [8732, 4]  default box 对于任意的图片, 都是相同的, 因此无需带有 batch 维度
        num = loc_data.size(0) # num = batch_size
        priors = priors[:loc_data.size(1), :] # loc_data.size(1) = 8732, 因此 priors 维持不变
        num_priors = (priors.size(0)) # num_priors = 8732
        num_classes = self.num_classes # num_classes = 21 (默认为voc数据集)

        # match priors (default boxes) and ground truth boxes
        # 将priors(default boxes)和ground truth boxes匹配
        loc_t = torch.Tensor(num, num_priors, 4) # shape:[batch_size, 8732, 4]
        conf_t = torch.LongTensor(num, num_priors)  # shape:[batch_size, 8732]
        for idx in range(num):
            # targets是列表, 列表的长度为batch_size, 列表中每个元素为一个 tensor,
            # 其 shape 为 [num_objs, 5], 其中 num_objs 为当前图片中物体的数量, 第二维前4个元素为边框坐标, 最后一个元素为类别编号(1~20)
            truths = targets[idx][:, :-1].data # [num_objs, 4]
            labels = targets[idx][:, -1].data  # [num_objs] 使用的是 -1, 而不是 -1:, 因此, 返回的维度变少了
            defaults = priors.data # [8732, 4]
            # from ..box_utils import match
            # 关键函数, 实现候选框与真实框之间的匹配, 注意是候选框而不是预测结果框! 这个函数实现较为复杂, 会在后面着重讲解
            match(self.threshold, truths, defaults, self.variance, labels,
                  loc_t, conf_t,  )
        if self.use_gpu:
            loc_t = loc_t.cuda()
            conf_t = conf_t.cuda()
        # wrap targets
        # 用Variable封装loc_t, 新版本的 PyTorch 无需这么做, 只需要将 requires_grad 属性设置为 True 就行了
        loc_t = Variable(loc_t, requires_grad=False)
        conf_t = Variable(conf_t, requires_grad=False)

        pos = conf_t > 0 # 筛选出 >0 的box下标(大部分都是=0的)
        num_pos = pos.sum(dim=1, keepdim=True) # 求和, 取得满足条件的box的数量, [batch_size, num_gt_threshold]

        # Localization Loss (Smooth L1)
        # Shape: [batch,num_priors,4]
        # 位置(localization)损失函数, 使用 Smooth L1 函数求损失
        # loc_data:[batch, num_priors, 4]
        # pos: [batch, num_priors]
        # pos_idx: [batch, num_priors, 4], 复制下标成坐标格式, 以便获取坐标值
        pos_idx = pos.unsqueeze(pos.dim()).expand_as(loc_data)
        loc_p = loc_data[pos_idx].view(-1, 4) # 获取预测结果值
        loc_t = loc_t[pos_idx].view(-1, 4)    # 获取gt值
        loss_l = F.smooth_l1_loss(loc_p, loc_t, size_average=False) # 计算损失

        # Compute max conf across batch for hard negative mining
        # 计算最大的置信度, 以进行难负样本挖掘
        # conf_data: [batch, num_priors, num_classes]
        # batch_conf: [batch, num_priors, num_classes]
        batch_conf = conf_data.view(-1, self.num_classes)

        # conf_t: [batch, num_priors]
        # loss_c: [batch*num_priors, 1], 计算每个priorbox预测后的损失
        loss_c = log_sum_exp(batch_conf) - batch_conf.gather(1, conf_t.view(-1, 1))
        
        # 难负样本挖掘, 按照loss进行排序, 取loss最大的负样本参与更新
        # Hard Negative Mining
        loss_c[pos] = 0  # filter out pos boxes for now # 将所有的pos下标的box的loss置为0(pos指示的是正样本的下标)
        # 将 loss_c 的shape 从 [batch*num_priors, 1] 转换成 [batch, num_priors]
        loss_c = loss_c.view(num, -1)  # reshape
        # 进行降序排序, 并获取到排序的下标
        _, loss_idx = loss_c.sort(1, descending=True)
        # 将下标进行升序排序, 并获取到下标的下标
        _, idx_rank = loss_idx.sort(1)
        # num_pos: [batch, 1], 统计每个样本中的obj个数
        num_pos = pos.long().sum(1, keepdim=True)
        # 根据obj的个数, 确定负样本的个数(正样本的3倍)
        num_neg = torch.clamp(self.negpos_ratio*num_pos, max=pos.size(1)-1)
        # 获取到负样本的下标
        neg = idx_rank < num_neg.expand_as(idx_rank)
        
        # 计算包括正样本和负样本的置信度损失
        # pos: [batch, num_priors]
        # pos_idx: [batch, num_priors, num_classes]
        pos_idx = pos.unsqueeze(2).expand_as(conf_data)

        # neg: [batch, num_priors]
        # neg_idx: [batch, num_priors, num_classes]
        neg_idx = neg.unsqueeze(2).expand_as(conf_data)
        # 按照pos_idx和neg_idx指示的下标筛选参与计算损失的预测数据
        conf_p = conf_data[(pos_idx+neg_idx).gt(0)].view(-1, self.num_classes)
        # 按照pos_idx和neg_idx筛选目标数据
        targets_weighted = conf_t[(pos+neg).gt(0)]
        # 计算二者的交叉熵
        loss_c = F.cross_entropy(conf_p, targets_weighted, size_average=False)

        # Sum of losses: L(x,c,l,g) = (Lconf(x, c) + αLloc(x,l,g)) / N
        # 将损失函数归一化后返回
        N = num_pos.data.sum()
        loss_l /= N
        loss_c /= N
        return loss_l, loss_c

match函数的解析


def match(threshold, truths, priors, variances, labels, loc_t, conf_t, idx):
    # threshold: (float) 确定是否匹配的交并比阈值
    # truths: (tensor: [num_obj, 4]) 存储真实 box 的边框坐标
    # priors: (tensor: [num_priors, 4], 即[8732, 4]), 存储推荐框的坐标, 注意, 此时的框是 default box, 而不是 SSD 网络预测出来的框的坐标, 预测的结果存储在 loc_data中, 其 shape 为[num_obj, 8732, 4].
    # variances: cfg['variance'], [0.1, 0.2], 用于将坐标转换成方便训练的形式(参考RCNN系列对边框坐标的处理)
    # labels: (tensor: [num_obj]), 代表了每个真实 box 对应的类别的编号
    # loc_t: (tensor: [batches, 8732, 4]),
    # conf_t: (tensor: [batches, 8732]),
    # idx: batches 中图片的序号, 标识当前正在处理的 image 在 batches 中的序号

    # jaccard index
    overlaps = jaccard(    # [A, B], 返回任意两个box之间的交并比, overlaps[i][j] 代表box_a中的第i个box与box_b中的第j个box之间的交并比.
        truths,
        point_form(priors)
    )

    # 二部图匹配(Bipartite Matching)
    # [num_objs,1], 得到对于每个 gt box 来说的匹配度最高的 prior box, 前者存储交并比, 后者存储prior box在num_priors中的位置
    best_prior_overlap, best_prior_idx = overlaps.max(1, keepdim=True)  # keepdim=True, 因此shape为[num_objs,1]
    # [1,num_priors] best ground truth for each prior
    # [1, num_priors], 即[1,8732], 同理, 得到对于每个 prior box 来说的匹配度最高的 gt box
    best_truth_overlap, best_truth_idx = overlaps.max(0, keepdim=True)
    best_truth_idx.squeeze_(0) # 上面特意保留了维度(keepdim=True), 这里又都把维度 squeeze/reduce 了, 实际上只需用默认的 keepdim=False 就可以自动 squeeze/reduce 维度.
    best_truth_overlap.squeeze_(0)
    best_prior_idx.squeeze_(1)
    best_prior_overlap.squeeze_(1)



    best_truth_overlap.index_fill_(0, best_prior_idx, 2)  # ensure best prior
    # 维度压缩后变为[num_priors], best_prior_idx 维度为[num_objs],
    # 该语句会将与gt box匹配度最好的prior box 的交并比置为 2, 确保其最大, 以免防止某些 gtbox 没有匹配的 priorbox.

    # 假想一种极端情况, 所有的priorbox与某个gtbox(标记为G)的交并比为1, 而其他gtbox分别有一个交并比
    # 最高的priorbox, 但是肯定小于1(因为其他的gtbox与G的交并比肯定小于1), 这样一来, 就会使得所有
    # 的priorbox都与G匹配, 为了防止这种情况, 我们将那些对gtbox来说, 具有最高交并比的priorbox,
    # 强制进行互相匹配, 即令best_truth_idx[best_prior_idx[j]] = j, 详细见下面的for循环


    # TODO refactor: index  best_prior_idx with long tensor
    # ensure every gt matches with its prior of max overlap
    # 注意!!: 因为 gt box 的数量要远远少于 prior box 的数量, 因此, 同一个 gt box 会与多个 prior box 匹配.
    for j in range(best_prior_idx.size(0)):
        best_truth_idx[best_prior_idx[j]] = j
        # best_prior_idx[j] 代表与box_a的第j个box交并比最高的 prior box 的下标, 将与该 gtbox
        # 匹配度最好的 prior box 的下标改为j, 由此,完成了该 gtbox 与第j个 prior box 的匹配.
        # 这里的循环只会进行num_obj次, 剩余的匹配为 best_truth_idx 中原本的值.
        # 这里处理的情况是, priorbox中第i个box与gtbox中第k个box的交并比最高,
        # 即 best_truth_idx[i]= k
        # 但是对于best_prior_idx[k]来说, 它却与priorbox的第l个box有着最高的交并比,
        # 即best_prior_idx[k]=l
        # 而对于gtbox的另一个边框gtbox[j]来说, 它与priorbox[i]的交并比最大,
        # 即但是对于best_prior_idx[j] = i.
        # 那么, 此时, 我们就应该将best_truth_idx[i]= k 修改成 best_truth_idx[i]= j.
        # 即令 priorbox[i] 与 gtbox[j]对应.
        # 这样做的原因: 防止某个gtbox没有匹配的 prior box.
    matches = truths[best_truth_idx]          # Shape: [num_priors,4]
    # truths 的shape 为[num_objs, 4], 而best_truth_idx是一个指示下标的列表, 列表长度为 8732,
    # 列表中的下标范围为0~num_objs-1, 代表的是与每个priorbox匹配的gtbox的下标
    # 上面的表达式会返回一个shape为 [num_priors, 4], 即 [8732, 4] 的tensor, 代表的就是与每个priorbox匹配的gtbox的坐标值.
    conf = labels[best_truth_idx] + 1        # 与上面的语句道理差不多, 这里得到的是每个prior box匹配的类别编号, shape 为[8732]
    conf[best_truth_overlap < threshold] = 0 # 将与gtbox的交并比小于阈值的置为0 , 即认为是非物体框
    loc = encode(matches, priors, variances) # 返回编码后的中心坐标和宽高.
    loc_t[idx] = loc    # [num_priors,4] encoded offsets to learn # 设置第idx张图片的gt编码坐标信息
    conf_t[idx] = conf  # [num_priors] top class label for each prior  设置第idx张图片的编号信息.(大于0即为物体编号, 认为有物体, 小于0认为是背景)

参考文章：https://blog.csdn.net/happyday_d/article/details/86021993
https://hellozhaozheng.github.io/z_post/PyTorch-SSD

eilot_c

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
SSD源码解析

论文地址: SSD: Single Shot MultiBox Detector 非官方代码: pytorch介绍SSD，全称Single Shot MultiBox Detector，是一种One-Stage的方法，它由Wei Liu在ECCV 2016上提出，SSD具有如下主要特点：从YOLO中继承了将d...
复制链接

扫一扫