目标检测之SSD原理与实现

最新推荐文章于 2024-06-11 16:43:37 发布

置顶视觉一只白

最新推荐文章于 2024-06-11 16:43:37 发布

阅读量1.8w

点赞数 17

分类专栏：深度学习

深度学习专栏收录该内容

27 篇文章 5 订阅

订阅专栏

一、设计理念

论文下载：点击打开链接

SSD和Yolo一样都是采用一个CNN网络来进行检测，但是却采用了多尺度的特征图，其基本架构下图所示。下面将SSD核心设计理念总结为以下三点：

1.采用多尺度特征图用于检测

所谓多尺度采用大小不同的特征图，CNN网络一般前面的特征图比较大，后面会逐渐采用stride=2的卷积或者pool来降低特征图大小，这正如图3所示，一个比较大的特征图和一个比较小的特征图，它们都用来做检测。这样做的好处是比较大的特征图来用来检测相对较小的目标，而小的特征图负责检测大目标，如图4所示，8x8的特征图可以划分更多的单元，但是其每个单元的先验框尺度比较小。

2.用卷积进行检测

与Yolo最后采用全连接层不同，SSD直接采用卷积对不同的特征图来进行提取检测结果。对于形状为 $m\times n \times p$ 的特征图，只需要采用 $3\times 3 \times p$ 这样比较小的卷积核得到检测值。

3.设置先验框

在Yolo中，每个单元预测多个边界框，但是其都是相对这个单元本身（正方块），但是真实目标的形状是多变的，Yolo需要在训练过程中自适应目标的形状。而SSD借鉴了Faster R-CNN中anchor的理念，每个单元设置尺度或者长宽比不同的先验框，预测的边界框（bounding boxes）是以这些先验框为基准的，在一定程度上减少训练难度。一般情况下，每个单元会设置多个先验框，其尺度和长宽比存在差异，如图5所示，可以看到每个单元使用了4个不同的先验框，图片中猫和狗分别采用最适合它们形状的先验框来进行训练，后面会详细讲解训练过程中的先验框匹配原则。

SSD的检测值也与Yolo不太一样。对于每个单元的每个先验框，其都输出一套独立的检测值，对应一个边界框，主要分为两个部分。第一部分是各个类别的置信度或者评分，值得注意的是SSD将背景也当做了一个特殊的类别，如果检测目标共有个类别，SSD其实需要预测 c+1 个置信度值，其中第一个置信度指的是不含目标或者属于背景的评分。后面当我们说个类别置信度时，请记住里面包含背景那个特殊的类别，即真实的检测类别只有 c-1 个。在预测过程中，置信度最高的那个类别就是边界框所属的类别，特别地，当第一个置信度值最高时，表示边界框中并不包含目标。第二部分就是边界框的location，包含4个值 (cx, cy, w, h) ，分别表示边界框的中心坐标以及宽高。但是真实预测值其实只是边界框相对于先验框的转换值(paper里面说是offset，但是觉得transformation更合适，参见R-CNN)。先验框位置用 $d=(d^{cx}, d^{cy}, d^w, d^h)$ 表示，其对应边界框用 $b=(b^{cx}, b^{cy}, b^w, b^h)$ $表示，那么边界框的预测值其实是相对于的转换值：

习惯上，我们称上面这个过程为边界框的编码（encode），预测时，你需要反向这个过程，即进行解码（decode），从预测值

中得到边界框的真实位置

：

然而，在SSD的Caffe源码实现中还有trick，那就是设置variance超参数来调整检测值，通过bool参数variance_encoded_in_target来控制两种模式，当其为True时，表示variance被包含在预测值中，就是上面那种情况。但是如果是False（大部分采用这种方式，训练更容易？），就需要手动设置超参数variance，用来对

的4个值进行放缩，此时边界框需要这样解码：

综上所述，对于一个大小 $m\times n$ 的特征图，共有

个单元，每个单元设置的先验框数目记为

，那么每个单元共需要 (c+4)k

个预测值，所有的单元共需要 (c+4)kmn

个预测值，由于SSD采用卷积做检测，所以就需要 (c+4)k

个卷积核完成这个特征图的检测过程。

二、网络结构

SSD采用VGG16作为基础模型，然后在VGG16的基础上新增了卷积层来获得更多的特征图以用于检测。SSD的网络结构如图5所示。上面是SSD模型，下面是Yolo模型，可以明显看到SSD利用了多尺度的特征图做检测。模型的输入图片大小是 $300\times300$ （还可以是 $512\times512$ ，其与前者网络结构没有差别，只是最后新增一个卷积层，本文不再讨论）。

采用VGG16做基础模型，首先VGG16是在ILSVRC CLS-LOC数据集预训练。然后借鉴了DeepLab-LargeFOV，分别将VGG16的全连接层fc6和fc7转换成 $3\times3$ 卷积层 conv6和 $1\times1$ 卷积层conv7，同时将池化层pool5由原来的变成（猜想是不想reduce特征图大小），为了配合这种变化，采用了一种Atrous Algorithm，其实就是conv6采用扩展卷积或带孔卷积（Dilation Conv），其在不增加参数与模型复杂度的条件下指数级扩大卷积的视野，其使用扩张率(dilation rate)参数，来表示扩张的大小，如下图6所示，(a)是普通的 $3\times3$ 卷积，其视野就是 $3\times3$ ，(b)是扩张率为1，此时视野变成 $7\times7$ ，(c)扩张率为3时，视野扩大为 $15\times15$ ，但是视野的特征更稀疏了。Conv6采用 $3\times3$ 大小但dilation rate=6的扩展卷积。

然后移除dropout层和fc8层，并新增一系列卷积层，在检测数据集上做finetuing。

其中VGG16中的Conv4_3层将作为用于检测的第一个特征图。conv4_3层特征图大小是 $38\times38$ ，但是该层比较靠前，其norm较大，所以在其后面增加了一个L2 Normalization层（参见ParseNet），以保证和后面的检测层差异不是很大，这个和Batch Normalization层不太一样，其仅仅是对每个像素点在channle维度做归一化，而Batch Normalization层是在[batch_size, width, height]三个维度上做归一化。归一化后一般设置一个可训练的放缩变量gamma，使用TF可以这样简单实现：

def l2norm(x, scale, trainable=True, scope="L2Normalization"):
    n_channels = x.get_shape().as_list()[-1]
    l2_norm = tf.nn.l2_normalize(x, [3], epsilon=1e-12)
    with tf.variable_scope(scope):
        gamma = tf.get_variable("gamma", shape=[n_channels, ], dtype=tf.float32,
                                initializer=tf.constant_initializer(scale),
                                trainable=trainable)
        return l2_norm * gamma

从后面新增的卷积层中提取Conv7，Conv8_2，Conv9_2，Conv10_2，Conv11_2作为检测所用的特征图，加上Conv4_3层，共提取了6个特征图，其大小分别是 (38, 38), (19, 19), (10, 10), (5, 5), (3, 3), (1, 1) ，但是不同特征图设置的先验框数目不同（同一个特征图上每个单元设置的先验框是相同的，这里的数目指的是一个单元的先验框数目）。先验框的设置，包括尺度（或者说大小）和长宽比两个方面。对于先验框的尺度，其遵守一个线性递增规则：随着特征图大小降低，先验框尺度线性增加：

其中指的特征图个数，但却是，因为第一层（Conv4_3层）是单独设置的， s_k 表示先验框大小相对于图片的比例，而 $s_{min}$ 和 $s_{max}$ 表示比例的最小值与最大值，paper里面取0.2和0.9。对于第一个特征图，其先验框的尺度比例一般设置为 $s_{min}/2=0.1$ ，那么尺度为 $300\times 0.1=30$ 。对于后面的特征图，先验框尺度按照上面公式线性增加，但是先将尺度比例先扩大100倍，此时增长步长为 $\lfloor \frac{\lfloor s_{max}\times 100\rfloor - \lfloor s_{min}\times 100\rfloor}{m-1}\rfloor=17$ ，这样各个特征图的 s_k 为 20, 37, 54, 71, 88 ，将这些比例除以100，然后再乘以图片大小，可以得到各个特征图的尺度为 60,111, 162,213,264 ，这种计算方式是参考SSD的Caffe源码。综上，可以得到各个特征图的先验框尺度 30,60,111, 162,213,264 。对于长宽比，一般选取 $a_r\in \{1,2,3,\frac{1}{2},\frac{1}{3}\}$ ，对于特定的长宽比，按如下公式计算先验框的宽度与高度（后面的 s_k 均指的是先验框实际尺度，而不是尺度比例）：

$w^a_{k}=s_k\sqrt{a_r},\space h^a_{k}=s_k/\sqrt{a_r}$

默认情况下，每个特征图会有一个 a_r=1 且尺度为 s_k 的先验框，除此之外，还会设置一个尺度为 $s'_{k}=\sqrt{s_k s_{k+1}}$ 且 a_r=1 的先验框，这样每个特征图都设置了两个长宽比为1但大小不同的正方形先验框。注意最后一个特征图需要参考一个虚拟 $s_{m+1}=300\times105/100=315$ 来计算 $s'_{m}$ 。因此，每个特征图一共有个先验框 $\{1,2,3,\frac{1}{2},\frac{1}{3},1'\}$ ，但是在实现时，Conv4_3，Conv10_2和Conv11_2层仅使用4个先验框，它们不使用长宽比为 $3,\frac{1}{3}$ 的先验框。每个单元的先验框的中心点分布在各个单元的中心，即 $(\frac{i+0.5}{|f_k|},\frac{j+0.5}{|f_k|}),i,j\in[0, |f_k|)$ ，其中 |f_k| 为特征图的大小。

得到了特征图之后，需要对特征图进行卷积得到检测结果，图7给出了一个 $5\times5$ 大小的特征图的检测过程。其中Priorbox是得到先验框，前面已经介绍了生成规则。检测值包含两个部分：类别置信度和边界框位置，各采用一次 $3\times3$ 卷积来进行完成。令 n_k 为该特征图所采用的先验框数目，那么类别置信度需要的卷积核数量为 $n_k\times c$ ，而边界框位置需要的卷积核数量为 $n_k\times 4$ 。由于每个先验框都会预测一个边界框，所以SSD300一共可以预测 $38\times38\times4+19\times19\times6+10\times10\times6+5\times5\times6+3\times3\times4+1\times1\times4=8732$ 个边界框，这是一个相当庞大的数字，所以说SSD本质上是密集采样。

三、训练

3.1 先验框匹配

在训练过程中，首先要确定训练图片中的ground truth（真实目标）与哪个先验框来进行匹配，与之匹配的先验框所对应的边界框将负责预测它。在Yolo中，ground truth的中心落在哪个单元格，该单元格中与其IOU最大的边界框负责预测它。但是在SSD中却完全不一样，SSD的先验框与ground truth的匹配原则主要有两点。首先，对于图片中每个ground truth，找到与其IOU最大的先验框，该先验框与其匹配，这样，可以保证每个ground truth一定与某个先验框匹配。通常称与ground truth匹配的先验框为正样本（其实应该是先验框对应的预测box，不过由于是一一对应的就这样称呼了），反之，若一个先验框没有与任何ground truth进行匹配，那么该先验框只能与背景匹配，就是负样本。一个图片中ground truth是非常少的，而先验框却很多，如果仅按第一个原则匹配，很多先验框会是负样本，正负样本极其不平衡，所以需要第二个原则。第二个原则是：对于剩余的未匹配先验框，若某个ground truth的 $\text{IOU}$ 大于某个阈值（一般是0.5），那么该先验框也与这个ground truth进行匹配。这意味着某个ground truth可能与多个先验框匹配，这是可以的。但是反过来却不可以，因为一个先验框只能匹配一个ground truth，如果多个ground truth与某个先验框 $\text{IOU}$ 大于阈值，那么先验框只与IOU最大的那个先验框进行匹配。第二个原则一定在第一个原则之后进行，仔细考虑一下这种情况，如果某个ground truth所对应最大 $\text{IOU}$ 小于阈值，并且所匹配的先验框却与另外一个ground truth的 $\text{IOU}$ 大于阈值，那么该先验框应该匹配谁，答案应该是前者，首先要确保某个ground truth一定有一个先验框与之匹配。但是，这种情况我觉得基本上是不存在的。由于先验框很多，某个ground truth的最大 $\text{IOU}$ 肯定大于阈值，所以可能只实施第二个原则既可以了，这里的TensorFlow版本就是只实施了第二个原则，但是这里的Pytorch两个原则都实施了。图8为一个匹配示意图，其中绿色的GT是ground truth，红色为先验框，FP表示负样本，TP表示正样本。

尽管一个ground truth可以与多个先验框匹配，但是ground truth相对先验框还是太少了，所以负样本相对正样本会很多。为了保证正负样本尽量平衡，SSD采用了hard negative mining，就是对负样本进行抽样，抽样时按照置信度误差（预测背景的置信度越小，误差越大）进行降序排列，选取误差的较大的top-k作为训练的负样本，以保证正负样本比例接近1:3。

3.2损失函数

训练样本确定了，然后就是损失函数了。损失函数定义为位置误差（locatization loss， loc）与置信度误差（confidence loss, conf）的加权和：

$L(x, c, l, g) = \frac{1}{N}(L_{conf}(x,c) + \alpha L_{loc}(x,l,g))$

其中是先验框的正样本数量。这里 $x^p_{ij}\in \{ 1,0 \}$ 为一个指示参数，当 $x^p_{ij}= 1$ 时表示第个先验框与第个ground truth匹配，并且ground truth的类别为。为类别置信度预测值。为先验框的所对应边界框的位置预测值，而是ground truth的位置参数。对于位置误差，其采用Smooth L1 loss，定义如下：

由于 $x^p_{ij}$ 的存在，所以位置误差仅针对正样本进行计算。值得注意的是，要先对ground truth的进行编码得到 $\hat{g}$ ，因为预测值也是编码值，若设置variance_encoded_in_target=True，编码时要加上variance：

$\hat{g}^{cx}_j = (g^{cx}_j - d^{cx}_i)/d^w_i/variance[0], \hat{g}^{cy}_j = (g^{cy}_j - d^{cy}_i)/d^h_i/variance[1]$

$\hat{g}^{w}_j = \log(g^{w}_j/d^w_i)/variance[2], \space \hat{g}^{h}_j = \log(g^{h}_j/d^h_i)/variance[3]$

对于置信度误差，其采用softmax loss:

3.3 数据扩增

采用数据扩增（Data Augmentation）可以提升SSD的性能，主要采用的技术有水平翻转（horizontal flip），随机裁剪加颜色扭曲（random crop & color distortion），随机采集块域（Randomly sample a patch）（获取小目标训练样本），如下图所示：

3.4 预训练过程

预测过程比较简单，对于每个预测框，首先根据类别置信度确定其类别（置信度最大者）与置信度值，并过滤掉属于背景的预测框。然后根据置信度阈值（如0.5）过滤掉阈值较低的预测框。对于留下的预测框进行解码，根据先验框得到其真实的位置参数（解码后一般还需要做clip，防止预测框位置超出图片）。解码之后，一般需要根据置信度进行降序排列，然后仅保留top-k（如400）个预测框。最后就是进行NMS算法，过滤掉那些重叠度较大的预测框。最后剩余的预测框就是检测结果了。

四、实现

main.py

import cv2
import numpy as np
import tensorflow as tf
import matplotlib.image as mpimg

from ssd300_vgg import SSD
from utils import preprocess_image, process_bboxes
from drawbox import plt_bboxes


def main():
    # 【1】搭建网络-->解码网络输出-->设置图片的占位节点
    ssd_net = SSD()  # 搭建网络：ssd300_vgg
    classes, scores, bboxes = ssd_net.detections()  # 设置分数阈值，解码网络输出得到bbox的类别、得分(概率)、边界框位置和大小
    images = ssd_net.images()  # 设置图片的占位节点：images是一个tf.placeholder

    # 【2】导入SSD模型
    sess = tf.Session()
    ckpt_filename = 'D:/Python/SSD/model/ssd_vgg_300_weights.ckpt'
    sess.run(tf.global_variables_initializer())
    saver = tf.train.Saver()
    saver.restore(sess, ckpt_filename)

    # 【3】预处理图片-->处理预测边界框bboxes
    img = cv2.imread('D:/Python/SSD/SSD_data/car.jpg')
    img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
    # 预处理图片：
    # 1、白化；
    # 2、resize300*300；
    # 3、增加batchsize这个维度.
    img_prepocessed = preprocess_image(img)
    # 将预处理好的图片赋给图片的占位节点
    rclasses, rscores, rbboxes = sess.run([classes, scores, bboxes], feed_dict={images: img_prepocessed})
    # 处理预测边界框：
    # 1、cut the box:将边界框超出整张图片(0,0)—(300,300)的部分cut掉；
    # 2、按类别置信度scores降序，对边界框进行排序并仅保留top_k=400；
    # 3、计算IOU-->NMS;
    # 4、根据先验框anchor调整预测边界框的大小.
    rclasses, rscores, rbboxes = process_bboxes(rclasses, rscores, rbboxes)

    # 【4】可视化最终的检测结果
    plt_bboxes(img, rclasses, rscores, rbboxes)
    print('SSD detection has done!')


if __name__ == '__main__':
    main()

drawbox.py

import cv2
import random

import matplotlib.pyplot as plt
import matplotlib.image as mpimg
import matplotlib.cm as mpcm


# 类别名称
CLASSES = ["aeroplane", "bicycle", "bird", "boat", "bottle",
                        "bus", "car", "cat", "chair", "cow", "diningtable",
                        "dog", "horse", "motorbike", "person", "pottedplant",
                        "sheep", "sofa", "train","tvmonitor"]
# =========================================================================== #
# 色图
# =========================================================================== #
def colors_subselect(colors, num_classes=21):
    dt = len(colors) // num_classes
    sub_colors = []
    for i in range(num_classes):
        color = colors[i*dt]
        if isinstance(color[0], float):
            sub_colors.append([int(c * 255) for c in color])
        else:
            sub_colors.append([c for c in color])
    return sub_colors

colors_plasma = colors_subselect(mpcm.plasma.colors, num_classes=21)
colors_tableau = [(255, 255, 255), (31, 119, 180), (174, 199, 232), (255, 127, 14), (255, 187, 120),
                  (44, 160, 44), (152, 223, 138), (214, 39, 40), (255, 152, 150),
                  (148, 103, 189), (197, 176, 213), (140, 86, 75), (196, 156, 148),
                  (227, 119, 194), (247, 182, 210), (127, 127, 127), (199, 199, 199),
                  (188, 189, 34), (219, 219, 141), (23, 190, 207), (158, 218, 229)]


# =========================================================================== #
# OpenCV绘制
# =========================================================================== #
def draw_lines(img, lines, color=(255, 0, 0), thickness=2):
    """Draw a collection of lines on an image.
    """
    for line in lines:
        for x1, y1, x2, y2 in line:
            cv2.line(img, (x1, y1), (x2, y2), color, thickness)


def draw_rectangle(img, p1, p2, color=(255, 0, 0), thickness=2):
    cv2.rectangle(img, p1[::-1], p2[::-1], color, thickness)


def draw_bbox(img, bbox, shape, label, color=(255, 0, 0), thickness=2):
    p1 = (int(bbox[0] * shape[0]), int(bbox[1] * shape[1]))
    p2 = (int(bbox[2] * shape[0]), int(bbox[3] * shape[1]))
    cv2.rectangle(img, p1[::-1], p2[::-1], color, thickness)
    p1 = (p1[0]+15, p1[1])
    cv2.putText(img, str(label), p1[::-1], cv2.FONT_HERSHEY_DUPLEX, 0.5, color, 1)


def bboxes_draw_on_img(img, classes, scores, bboxes, colors, thickness=2):
    shape = img.shape
    for i in range(bboxes.shape[0]):
        bbox = bboxes[i]
        color = colors[classes[i]]
        # Draw bounding box...
        p1 = (int(bbox[0] * shape[0]), int(bbox[1] * shape[1]))
        p2 = (int(bbox[2] * shape[0]), int(bbox[3] * shape[1]))
        cv2.rectangle(img, p1[::-1], p2[::-1], color, thickness)
        # Draw text...
        s = '%s/%.3f' % (classes[i], scores[i])
        p1 = (p1[0]-5, p1[1])
        cv2.putText(img, s, p1[::-1], cv2.FONT_HERSHEY_DUPLEX, 0.4, color, 1)


# =========================================================================== #
# Matplotlib可视化
# =========================================================================== #
def plt_bboxes(img, classes, scores, bboxes, figsize=(10,10), linewidth=1.5, show_class_name=True):
    """Visualize bounding boxes. Largely inspired by SSD-MXNET!"""
    fig = plt.figure(figsize=figsize)
    plt.imshow(img)
    height = img.shape[0]
    width = img.shape[1]
    colors = dict()
    for i in range(classes.shape[0]):
        cls_id = int(classes[i])
        if cls_id >= 0:
            score = scores[i]
            if cls_id not in colors:
                colors[cls_id] = (random.random(), random.random(), random.random())
            ymin = int(bboxes[i, 0] * height)
            xmin = int(bboxes[i, 1] * width)
            ymax = int(bboxes[i, 2] * height)
            xmax = int(bboxes[i, 3] * width)
            rect = plt.Rectangle((xmin, ymin), xmax - xmin,
                                 ymax - ymin, fill=False,
                                 edgecolor=colors[cls_id],
                                 linewidth=linewidth)
            plt.gca().add_patch(rect)
            class_name = CLASSES[cls_id-1] if show_class_name else str(cls_id)
            plt.gca().text(xmin, ymin - 2,
                           '{:s} | {:.3f}'.format(class_name, score),
                           bbox=dict(facecolor=colors[cls_id], alpha=0.5),
                           fontsize=12, color='white')
    plt.savefig('./SSD_data/detection.jpg') # 保存检测后的图

ssd-vgg.py

from collections import namedtuple
import numpy as np
import tensorflow as tf

from ssd_layers import conv2d,max_pool2d,l2norm,dropout,pad2d,ssd_multibox_layer
from ssd_anchors import ssd_anchors_all_layers

# SSD参数
SSDParams = namedtuple('SSDParameters', ['img_shape',  # 输入图片大小: 300x300
                                         'num_classes',  # 类别个数: 20+1（20类+1背景）
                                         'no_annotation_label',
                                         'feature_layers', # 最后detection layer的特征图名字列表
                                         'feature_shapes', # 最后detection layer的特征图size尺寸列表
                                         'anchor_size_bounds', # the down and upper bounds of anchor sizes
                                         'anchor_sizes',   # 最后detection layer的anchor size尺寸列表list
                                         'anchor_ratios',  # 最后detection layer的anchor的长宽比列表list
                                         'anchor_steps',   # list of cell size (pixel size) of layer for detection
                                         'anchor_offset',  # 每个anchor的中心点坐标相对cell左上角的偏移量
                                         'normalizations', # list of normalizations of layer for detection
                                         'prior_scaling'   #
                                         ])

class SSD(object):
    # 构造函数
    def __init__(self,is_training=True):
        self.is_training = is_training
        self.threshold = 0.5 # class score类别分数阈值
        self.ssd_params = SSDParams(img_shape=(300,300),
                                    num_classes=21,
                                    no_annotation_label=21,
                                    feature_layers=['block4','block7','block8','block9','block10','block11'],
                                    feature_shapes=[(38,38),(19,19),(10,10),(5,5),(3,3),(1,1)],
                                    anchor_size_bounds=[0.15, 0.90],  # diff from the original paper
                                    anchor_sizes=[(21.,45.),(45.,99.),(99.,153.),
                                                  (153.,207.),(207.,261.),(261.,315.)],
                                    anchor_ratios=[[2, .5],[2, .5, 3, 1. / 3],[2, .5, 3, 1. / 3],
                                                   [2, .5, 3, 1. / 3],[2, .5],[2, .5]],
                                    anchor_steps=[8, 16, 32, 64, 100, 300],
                                    anchor_offset=0.5,
                                    normalizations=[20, -1, -1, -1, -1, -1],
                                    prior_scaling=[0.1, 0.1, 0.2, 0.2]
                                    )

        predictions,locations = self._built_net() # 【1】SSD300的网络结构(输入图片为300)
        # self._update_feature_shapes_from_net()
        classes,scores,bboxes = self._bboxes_select(predictions,locations) # 【2、3】解码网络输出，并筛选边界框
        self._classes = classes # 类别
        self._scores = scores # 得分(概率)
        self._bboxes = bboxes # 预测边界框的位置和大小

    ##################################【1】SSD300的网络结构(输入图片为300)################################################
    def _built_net(self):
        self.end_points = {}  # 记录detection layers输出
        # 输入图片的占位节点（固定大小的占位）
        self._images = tf.placeholder(tf.float32,
                                      shape=[None,self.ssd_params.img_shape[0],self.ssd_params.img_shape[1],3])

        with tf.variable_scope('ssd_300_vgg'): # 注意："ssd_300_vgg"不能修改，否则导入的模型会找不到
            # (1)原来经典的vgg layers
            # block 1
            net = conv2d(self._images, filters=64, kernel_size=3, scope='conv1_1')
            net = conv2d(net, 64, 3, scope='conv1_2')
            self.end_points['block1'] = net
            net = max_pool2d(net, pool_size=2, scope='pool1')
            # block 2
            net = conv2d(net, 128, 3, scope='conv2_1')
            net = conv2d(net, 128, 3, scope='conv2_2')
            self.end_points['block2'] = net
            net = max_pool2d(net, 2, scope='pool2')
            # block 3
            net = conv2d(net, 256, 3, scope="conv3_1")
            net = conv2d(net, 256, 3, scope="conv3_2")
            net = conv2d(net, 256, 3, scope="conv3_3")
            self.end_points["block3"] = net
            net = max_pool2d(net, 2, scope="pool3")
            # block 4
            net = conv2d(net, 512, 3, scope="conv4_1")
            net = conv2d(net, 512, 3, scope="conv4_2")
            net = conv2d(net, 512, 3, scope="conv4_3")
            self.end_points["block4"] = net
            net = max_pool2d(net, 2, scope="pool4")
            # block 5
            net = conv2d(net, 512, 3, scope="conv5_1")
            net = conv2d(net, 512, 3, scope="conv5_2")
            net = conv2d(net, 512, 3, scope="conv5_3")
            self.end_points["block5"] = net
            print(net)
            net = max_pool2d(net, pool_size=3, stride=1, scope="pool5")  # 'pool核'大小为3*3，步长为1
            print(net)

            # (2)后添加的SSD layers
            # block 6:使用空洞卷积(带膨胀系数的dilate conv)
            net = conv2d(net, filters=1024, kernel_size=3, dilation_rate=6, scope='conv6')
            self.end_points['block6'] = net
            # net = dropout(net, is_training=self.is_training)
            # block 7
            net = conv2d(net, 1024, 1, scope='conv7')
            self.end_points['block7'] = net
            # block 8
            net = conv2d(net, 256, 1, scope='conv8_1x1')
            net = conv2d(pad2d(net,1), 512, 3, stride=2, scope='conv8_3x3', padding='valid')
            self.end_points['block8'] = net
            # block 9
            net = conv2d(net, 128, 1, scope="conv9_1x1")
            net = conv2d(pad2d(net, 1), 256, 3, stride=2, scope="conv9_3x3", padding="valid")
            self.end_points["block9"] = net
            # block 10
            net = conv2d(net, 128, 1, scope="conv10_1x1")
            net = conv2d(net, 256, 3, scope="conv10_3x3", padding="valid")
            self.end_points["block10"] = net
            # block 11
            net = conv2d(net, 128, 1, scope="conv11_1x1")
            net = conv2d(net, 256, 3, scope="conv11_3x3", padding="valid")
            self.end_points["block11"] = net

            # class和location的预测值
            predictions = []
            locations = []
            for i, layer in enumerate(self.ssd_params.feature_layers):
                # layer=self.ssd_params.feature_layers=['block4','block7','block8','block9','block10','block11']
                cls, loc = ssd_multibox_layer(self.end_points[layer], self.ssd_params.num_classes,
                                              self.ssd_params.anchor_sizes[i],
                                              self.ssd_params.anchor_ratios[i],
                                              self.ssd_params.normalizations[i],
                                              scope=layer + '_box')
                predictions.append(tf.nn.softmax(cls))  # 解码class得分：用softmax函数
                locations.append(loc)  # 解码边界框位置xywh

            return predictions, locations

    # 从prediction layers中获得特征图的shapes
    def _update_feature_shape_from_net(self,predictions):
        new_feature_shapes = []
        for l in predictions:
            new_feature_shapes.append(l.get_shape().as_list()[1:])
        self.ssd_params._replace(feature_shapes=new_feature_shapes )

    # 获取SSD的anchor
    def anchors(self):
        return ssd_anchors_all_layers(self.ssd_params.img_shape,
                                      self.ssd_params.feature_shapes,
                                      self.ssd_params.anchor_sizes,
                                      self.ssd_params.anchor_ratios,
                                      self.ssd_params.anchor_steps,
                                      self.ssd_params.anchor_offset,
                                      np.float32)
    ####################################################################################################################

    ###################################【2】解码网络输出，得到边界框位置和大小bbox_location################################
    def _bboxes_decode_layer(self,feature_locations,anchor_bboxes,prior_scaling): # prior_scaling:先验尺寸
        """
        Decode the feature location of one layer
        params:
          feature_locations: 5D Tensor, [batch_size, size, size, n_anchors, 4]
          anchor_bboxes: list of Tensors(y, x, w, h)
                         shape: [size,size,1], [size,size,1], [n_anchors], [n_anchors]
          prior_scaling: list of 4 floats
        """
        y_a,x_a,h_a,w_a = anchor_bboxes
        print(y_a)
        # 解码：由anchor计算真实的cx/cy/w/h
        cx = feature_locations[:,:,:,:,0] * w_a * prior_scaling[0] + x_a #########################
        cy = feature_locations[:,:,:,:,1] * h_a * prior_scaling[1] + y_a
        w = w_a * tf.exp(feature_locations[:,:,:,:,2] * prior_scaling[2])
        h = h_a * tf.exp(feature_locations[:,:,:,:,3] * prior_scaling[3])

        # cx/cy/w/h --> ymin/xmin/ymax/xmax
        bboxes = tf.stack([cy-h/2.0,cx-w/2.0,cy+h/2.0,cx+w/2.0], axis=-1)

        # shape为[batch_size, size, size, n_anchors, 4]
        return bboxes
    ####################################################################################################################

    ##################################【3】去除得分score<阈值threshold的解码得到边界框bboxes##############################
    # 从最后的特征图中筛选1次边界框bboxes原则(仅针对batchsize=1):最大类别得分score>阈值
    def _bboxes_select_layer(self,feature_predictions,feature_locations,anchor_bboxes,prior_scaling):
        # bboxes的个数=网络输出的shape之间的乘积
        n_bboxes = np.product(feature_predictions.get_shape().as_list()[1:-1])

        # 解码边界框位置location
        bboxes = self._bboxes_decode_layer(feature_locations,anchor_bboxes,prior_scaling)
        bboxes = tf.reshape(bboxes,[n_bboxes,4]) # [边界框bboxes数量，每个bbox的位置和大小]
        predictions = tf.reshape(feature_predictions,[n_bboxes,self.ssd_params.num_classes]) # [边界框bboxes数量，每个bbox的类别得分]
        # 移除背景的得分num_class预测值
        sub_predictions = predictions[:,1:]

        # 筛选最大的类别分数
        classes = tf.argmax(sub_predictions,axis=1) + 1 # 类别labels：最大的类别分数索引。(因为背景在第一个索引位置，故后面+1)
        scores = tf.reduce_max(sub_predictions,axis=1) # 最大类别得分max_class scores
        # ※※※筛选边界框bbox：最大类别得分>阈值(只用了第二个原则)※※※
        filter_mask = scores > self.threshold # 变成bool类型的向量：True留下、False去除
        classes = tf.boolean_mask(classes,filter_mask)
        scores = tf.boolean_mask(scores,filter_mask)
        bboxes = tf.boolean_mask(bboxes,filter_mask)

        return classes,scores,bboxes

    # 筛选所有的预测边界框：循环调用上面的筛选原则
    def _bboxes_select(self,predictions,locations):
        anchor_bboxes_list = self.anchors()
        classes_list = []
        scores_list = []
        bboxes_list = []

        # 对每个feature layer选择bboxes：循环调用上面的筛选原则
        for n in range(len(predictions)):
            anchor_bboxes = list(map(tf.convert_to_tensor,anchor_bboxes_list[n]))
            classes,scores,bboxes = self._bboxes_select_layer(predictions[n],locations[n],
                                                              anchor_bboxes,self.ssd_params.prior_scaling)
            classes_list.append(classes)
            scores_list.append(scores)
            bboxes_list.append(bboxes)
        # 整合所有的feature layer筛选的边界框结果
        classes = tf.concat(classes_list, axis=0)
        scores = tf.concat(scores_list, axis=0)
        bboxes = tf.concat(bboxes_list, axis=0)

        return classes, scores, bboxes
    ####################################################################################################################

    def images(self):
        return self._images

    # 检测：得到预测边界框的类别、得分(概率)、边界框位置和大小
    def detections(self):
        return self._classes, self._scores, self._bboxes

if __name__ == '__main__':
    ssd = SSD()
    sess = tf.Session()
    saver = tf.train.Saver()
    saver.restore(sess, 'D:/Python/SSD/model/ssd_vgg_300_weights.ckpt') # 导入模型

ssd-anchors.py

import math
import numpy as np

def ssd_size_bounds_to_values(size_bounds,
                              n_feat_layers,
                              img_shape=(300, 300)):
    """Compute the reference sizes of the anchor boxes from relative bounds.
    The absolute values are measured in pixels, based on the network
    default size (300 pixels).
    This function follows the computation performed in the original
    implementation of SSD in Caffe.
    Return:
      list of list containing the absolute sizes at each scale. For each scale,
      the ratios only apply to the first value.
    """
    assert img_shape[0] == img_shape[1]

    img_size = img_shape[0]
    min_ratio = int(size_bounds[0] * 100)
    max_ratio = int(size_bounds[1] * 100)
    step = int(math.floor((max_ratio - min_ratio) / (n_feat_layers - 2)))
    # Start with the following smallest sizes.
    sizes = [[img_size * size_bounds[0] / 2, img_size * size_bounds[0]]]
    for ratio in range(min_ratio, max_ratio + 1, step):
        sizes.append((img_size * ratio / 100.,
                      img_size * (ratio + step) / 100.))
    return sizes

def ssd_anchor_one_layer(img_shape,
                         feat_shape,
                         sizes,
                         ratios,
                         step,
                         offset=0.5,
                         dtype=np.float32):
    """Computer SSD default anchor boxes for one feature layer.
    Determine the relative position grid of the centers, and the relative
    width and height.
    Arguments:
      feat_shape: Feature shape, used for computing relative position grids;
      size: Absolute reference sizes;
      ratios: Ratios to use on these features;
      img_shape: Image shape, used for computing height, width relatively to the
        former;
      offset: Grid offset.
    Return:
      y, x, h, w: Relative x and y grids, and height and width.
    """
    # Compute the position grid: simple way.
    # y, x = np.mgrid[0:feat_shape[0], 0:feat_shape[1]]
    # y = (y.astype(dtype) + offset) / feat_shape[0]
    # x = (x.astype(dtype) + offset) / feat_shape[1]
    # Weird SSD-Caffe computation using steps values...
    y, x = np.mgrid[0:feat_shape[0], 0:feat_shape[1]]
    y = (y.astype(dtype) + offset) * step / img_shape[0]
    x = (x.astype(dtype) + offset) * step / img_shape[1]

    # Expand dims to support easy broadcasting.
    y = np.expand_dims(y, axis=-1)  # [size, size, 1]
    x = np.expand_dims(x, axis=-1)  # [size, size, 1]

    # Compute relative height and width.
    # Tries to follow the original implementation of SSD for the order.
    num_anchors = len(sizes) + len(ratios)
    h = np.zeros((num_anchors, ), dtype=dtype)  # [n_anchors]
    w = np.zeros((num_anchors, ), dtype=dtype)  # [n_anchors]
    # Add first anchor boxes with ratio=1.
    h[0] = sizes[0] / img_shape[0]
    w[0] = sizes[0] / img_shape[1]
    di = 1
    if len(sizes) > 1:
        h[1] = math.sqrt(sizes[0] * sizes[1]) / img_shape[0]
        w[1] = math.sqrt(sizes[0] * sizes[1]) / img_shape[1]
        di += 1
    for i, r in enumerate(ratios):
        h[i+di] = sizes[0] / img_shape[0] / math.sqrt(r)
        w[i+di] = sizes[0] / img_shape[1] * math.sqrt(r)
    return y, x, h, w


def ssd_anchors_all_layers(img_shape,
                           layers_shape,
                           anchor_sizes,
                           anchor_ratios,
                           anchor_steps,
                           offset=0.5,
                           dtype=np.float32):
    """Compute anchor boxes for all feature layers.
    """
    layers_anchors = []
    for i, s in enumerate(layers_shape):
        anchor_bboxes = ssd_anchor_one_layer(img_shape, s,
                                             anchor_sizes[i],
                                             anchor_ratios[i],
                                             anchor_steps[i],
                                             offset=offset, dtype=dtype)
        layers_anchors.append(anchor_bboxes)
    return layers_anchors

ssd-layers.py

import tensorflow as tf

# conv2d卷积层:步长=1
def conv2d(x,filters,kernel_size,stride=1,padding='same',
           dilation_rate=1,activation=tf.nn.relu,scope='conv2d'):
    kernel_sizes = [kernel_size] * 2 # --> [kernel_size,kernel_size]
    strides = [stride] * 2 # --> [stride,stride]
    dilation_rate = [dilation_rate] * 2 # 膨胀率-->[dilation_rate,dilation_rate]
    return tf.layers.conv2d(inputs=x,filters=filters,kernel_size=kernel_sizes,
                            strides=strides,dilation_rate=dilation_rate,padding=padding,
                            name=scope,activation=activation)

# max_pool2d最大池化层
def max_pool2d(x, pool_size, stride=None, scope='max_pool2d'):
    pool_sizes = [pool_size] * 2
    if stride==None:
        strides = [pool_size] * 2
    else:
        strides = [stride] * 2
    return tf.layers.max_pooling2d(inputs=x,pool_size=pool_sizes,strides=strides,name=scope,padding='same')

# pad2d零填充：针对步长为1的conv2d层
def pad2d(x,pad):
    return tf.pad(x,paddings=[[0,0],[pad,pad],[pad,pad],[0,0]])

# dropout
def dropout(x,rate=0.5,is_training=True):
    return tf.layers.dropout(inputs=x,rate=rate,training=is_training)

# l2norm：Conv4_3层将作为用于检测的第一个特征图,该层比较靠前，其norm较大，
# 所以在其后面增加了一个L2 Normalization层，以保证和后面的检测层差异不是很大.
# 这个和Batch Normalization层不太一样:其仅仅是对每个像素点在channle维度做归一化，归一化后一般设置一个可训练的放缩变量gamma.
# 而Batch Normalization层是在[batch_size, width, height]三个维度上做归一化。
def l2norm(x,scale,trainable=True,scope='L2Normalization'):
    n_channels = x.get_shape().as_list()[-1] # 通道数
    l2_norm = tf.nn.l2_normalize(x,dim=[3],epsilon=1e-12) # 只对每个像素点在channels上做归一化
    with tf.variable_scope(scope):
        gamma = tf.get_variable("gamma", shape=[n_channels, ], dtype=tf.float32,
                                initializer=tf.constant_initializer(scale),
                                trainable=trainable)
    return l2_norm * gamma

# multibox layer:
# 从由Conv4_3，Conv7，Conv8_2，Conv9_2，Conv10_2，Conv11_2特征图经过卷积
# 得到的最后detection layer得获取边界框的类别classes、位置location的预测值。
def ssd_multibox_layer(x,num_classes,sizes,ratios,normalization=-1,scope='multibox'):
    pre_shape = x.get_shape().as_list()[1:-1] # 去除第一个和最后一个得到shape
    pre_shape = [-1] + pre_shape
    with tf.variable_scope(scope):
        # l2 norm
        if normalization > 0:
            x = l2norm(x,normalization)
            print(x)

        # anchors数量
        n_anchors = len(sizes) + len(ratios)
        # locations位置预测值
        loc_pred = conv2d(x,filters=n_anchors*4,kernel_size=3,activation=None,scope='conv_loc') # 一个anchor用4个量表示位置、大小
        loc_pred = tf.reshape(loc_pred,pre_shape + [n_anchors,4]) # [anchor数量，每个anchor的locations信息]
        # class类别预测值
        cls_pred = conv2d(x,filters=n_anchors*num_classes,kernel_size=3,activation=None,scope='conv_cls')
        cls_pred = tf.reshape(cls_pred,pre_shape + [n_anchors,num_classes]) # [anchor数量，每个anchor的class信息]

        return cls_pred,loc_pred

utils.py

import cv2
import numpy as np

#######################################图像预处理################################################################
# 白化图片
def whiten_image(image, means=(123., 117., 104.)):
    """从每个图像通道中减去给定的均值"""
    if image.ndim != 3:
        raise ValueError('Input must be of size [height, width, C>0]')
    num_channels = image.shape[-1]
    if len(means) != num_channels:
        raise ValueError('len(means) must match the number of channels')

    mean = np.array(means, dtype=image.dtype)
    image = image - mean
    return image

def resize_image(image, size=(300, 300)):
    return cv2.resize(image, size)

# ※※※汇总：预处理图片preprocess_image※※※
def preprocess_image(image):
    image_cp = np.copy(image).astype(np.float32)
    # 1、白化
    image_whitened = whiten_image(image_cp)
    # 2、resize
    image_resized = resize_image(image_whitened)
    # 3、增加batch_size这一维度
    image_expanded = np.expand_dims(image_resized, axis=0) # [batchsize,width,height]
    return image_expanded
##################################################################################################################

#######################################图像边界框bbox##############################################################
# (1)cut the box:将边界框超出整张图片(0,0)—(300,300)的部分cut掉
def bboxes_clip(bbox_ref, bboxes):
    """Clip bounding boxes with respect to reference bbox."""
    bboxes = np.copy(bboxes)
    bboxes = np.transpose(bboxes)
    bbox_ref = np.transpose(bbox_ref)
    bboxes[0] = np.maximum(bboxes[0], bbox_ref[0]) # xmin
    bboxes[1] = np.maximum(bboxes[1], bbox_ref[1]) # ymin
    bboxes[2] = np.minimum(bboxes[2], bbox_ref[2]) # xmax
    bboxes[3] = np.minimum(bboxes[3], bbox_ref[3]) # ymax
    bboxes = np.transpose(bboxes)
    return bboxes

# (2)按类别置信度scores降序，对边界框进行排序并仅保留top_k=400
def bboxes_sort(classes, scores, bboxes, top_k=400):
    """Sort bounding boxes by decreasing order and keep only the top_k."""
    idxes = np.argsort(-scores)
    classes = classes[idxes][:top_k]
    scores = scores[idxes][:top_k]
    bboxes = bboxes[idxes][:top_k]
    return classes, scores, bboxes

# (3)计算IOU+NMS
# 计算IOU
def bboxes_iou(bboxes1, bboxes2):
    bboxes1 = np.transpose(bboxes1)
    bboxes2 = np.transpose(bboxes2)

    # 计算两个box的交集：交集左上角的点取两个box的max，交集右下角的点取两个box的min
    int_ymin = np.maximum(bboxes1[0], bboxes2[0])
    int_xmin = np.maximum(bboxes1[1], bboxes2[1])
    int_ymax = np.minimum(bboxes1[2], bboxes2[2])
    int_xmax = np.minimum(bboxes1[3], bboxes2[3])

    # 计算两个box交集的wh：如果两个box没有交集，那么wh为0(按照计算方式wh为负数，跟0比较取最大值)
    int_h = np.maximum(int_ymax - int_ymin, 0.)
    int_w = np.maximum(int_xmax - int_xmin, 0.)

    # 计算IOU
    int_vol = int_h * int_w # 交集面积
    vol1 = (bboxes1[2] - bboxes1[0]) * (bboxes1[3] - bboxes1[1]) # bboxes1面积
    vol2 = (bboxes2[2] - bboxes2[0]) * (bboxes2[3] - bboxes2[1]) # bboxes2面积
    iou = int_vol / (vol1 + vol2 - int_vol) # IOU=交集/并集
    return iou

# NMS，或者用tf.image.non_max_suppression(boxes, scores,self.max_output_size, self.iou_threshold)
def bboxes_nms(classes, scores, bboxes, nms_threshold=0.5):
    """Apply non-maximum selection to bounding boxes."""
    keep_bboxes = np.ones(scores.shape, dtype=np.bool)
    for i in range(scores.size-1):
        if keep_bboxes[i]:
            # Computer overlap with bboxes which are following.
            overlap = bboxes_iou(bboxes[i], bboxes[(i+1):])
            # Overlap threshold for keeping + checking part of the same class
            keep_overlap = np.logical_or(overlap < nms_threshold, classes[(i+1):] != classes[i])
            keep_bboxes[(i+1):] = np.logical_and(keep_bboxes[(i+1):], keep_overlap)
    idxes = np.where(keep_bboxes)
    return classes[idxes], scores[idxes], bboxes[idxes]

# (4)根据先验框anchor调整预测边界框的大小
def bboxes_resize(bbox_ref, bboxes):
    """Resize bounding boxes based on a reference bounding box,
    assuming that the latter is [0, 0, 1, 1] after transform.
    """
    bboxes = np.copy(bboxes)
    # Translate.
    bboxes[:, 0] -= bbox_ref[0]
    bboxes[:, 1] -= bbox_ref[1]
    bboxes[:, 2] -= bbox_ref[0]
    bboxes[:, 3] -= bbox_ref[1]
    # Resize.
    resize = [bbox_ref[2] - bbox_ref[0], bbox_ref[3] - bbox_ref[1]]
    bboxes[:, 0] /= resize[0]
    bboxes[:, 1] /= resize[1]
    bboxes[:, 2] /= resize[0]
    bboxes[:, 3] /= resize[1]
    return bboxes

# ※※※汇总：处理预测边界框process_bboxes※※※
def process_bboxes(rclasses, rscores, rbboxes, rbbox_img = (0.0, 0.0, 1.0, 1.0),
                   top_k=400, nms_threshold=0.5):
    # 【1】cut the box:将边界框超出整张图片(0,0)—(300,300)的部分cut掉
    rbboxes = bboxes_clip(rbbox_img, rbboxes)
    # 【2】按类别置信度scores降序，对边界框进行排序并仅保留top_k=400
    rclasses, rscores, rbboxes = bboxes_sort(rclasses, rscores, rbboxes, top_k)
    # 【3】计算IOU-->NMS
    rclasses, rscores, rbboxes = bboxes_nms(rclasses, rscores, rbboxes, nms_threshold)
    # 【4】根据先验框anchor调整预测边界框的大小
    rbboxes = bboxes_resize(rbbox_img, rbboxes)

    return rclasses, rscores, rbboxes
###################################################

实验结果

以上就是所有内容了。

视觉一只白

关注

17
点赞
踩
132

收藏

觉得还不错? 一键收藏
8
评论
目标检测之SSD原理与实现

一、设计理念论文下载：点击打开链接SSD和Yolo一样都是采用一个CNN网络来进行检测，但是却采用了多尺度的特征图，其基本架构下图所示。下面将SSD核心设计理念总结为以下三点：1.采用多尺度特征图用于检测所谓多尺度采用大小不同的特征图，CNN网络一般前面的特征图比较大，后面会逐渐采用stride=2的卷积或者pool来降低特征图大小，这正如图3所示，一个比较大的特征图和一个比较小的特征图，它们都用...
复制链接

扫一扫