SSD: Single Shot MultiBox Detector

最新推荐文章于 2022-03-11 20:24:14 发布

G5Lorenzo

最新推荐文章于 2022-03-11 20:24:14 发布

阅读量395

点赞数

分类专栏：论文笔记

本文链接：https://blog.csdn.net/qq_36825778/article/details/103503146

版权

论文笔记专栏收录该内容

14 篇文章 1 订阅

订阅专栏

SSD: Single Shot MultiBox Detector

一、SSD主要思想

SSD是Single Shot MultiBox Detector的缩写，Single shot表明了SSD属于one-stage系列，而MultiBox则表明了SSD采用了多尺度框的方法。
SSD中，原图经过不同的卷积层，得到不同大小的feature map，靠前的卷积层输出的feature map比较大，越靠后feature map越小，而靠前的feature map上的先验框的尺度较小，靠后的feature map上的先验框的尺度较大，也就是说，从前往后feature map依次减小而先验框依次增大，这样大feature map上的小框就能用来检测小物体，小feature map上的大框就能检测大物体
另外，SSD使用了卷积直接进行检测，所以上面得到的不同大小的feature map可以直接通过卷积进行检测得到分类置信度和位置信息。

二、模型结构

在这里插入图片描述
SSD模型结构主要分为三个部分：

基础特征提取网络：VGG-16(只保留前5个卷积块)
SSD层：SSD Layers
预测层：Classification and localization layer

三、模型特点

SSD和YOLO都是属于one-stage算法，也就是说SSD像YOLO一样，能够通过模型一步到位，直接同时得到分类和定位信息，但是相比起YOLO算法，SSD有一下三个特点：

使用多尺度的feature map进行检测
使用卷积直接进行检测
使用不同尺度和宽高比的先验框

3.1 使用多尺度的feature map进行检测

在这里插入图片描述
之前的算法，比如YOLO，它经过卷积层，最终得到的是一个单一尺度的feature map，然后用这个单一尺度的feature map进行检测，而SSD算法则是通过卷积层生成不同尺度的feature map，然后将其中部分不同尺度的feature map都用来进行检测。
在这里插入图片描述
因为浅层的feature map的感受野较小，而深层的feature map的感受野较大，所以浅层的较大的feature map可用来检测小物体，而深层的较小的feature map可用来检测大物体。

3.2 使用卷积直接进行检测

YOLO是在全连接之后进行检测，而SSD则是将得到的不同尺度的feature map直接通过卷积进行检测。
使用卷积直接进行检测，就是根据预测的目标，来设置小卷积核（文中和代码中设置为3x3）的个数。因为卷积核的个数决定了卷积层输出的feature map的通道数，那么不同的通道数个数，就表示不同的意义。
比如，假设输入预测层(Classification and Localization layer)的feature map的size为m*n*p，卷积核为3*3*p，预测类别是3类，feature map上每个像素点都是3个先验框，
对于分类预测：
设置卷积核的个数为3*3，预测层输出的feature map维度为:[m, n, 3*3]，这就表示对于feature map上的m*n个像素点，每个点对应一个9维的向量，每3个分量表示一个框的预测信息。
在这里插入图片描述
对于位置预测：
设置卷积核的个数为3*4，输出的feature map的维度即为[m, n, 3*4]，表示对于feature map上的m*n个像素点，每个点对应一个12维的向量，每4个分量表示一个框的四个位置信息(offsets)。

3.3 使用不同尺度和宽高比的先验框

在这里插入图片描述
在SSD中，不同的feature map上设置的先验框的数目不同，尺度(也就是框的大小)和比也不同。

3.3.1 先验框的尺度(大小)：

$s_k=s_{min}+\frac{s_{max}-s_{min}}{m-1}(k-1),\quad k\in[1, m]$
其中，

m：表示feature map个数减一(因为第一个feature map的尺度单独设置)，
$s_k$ ：表示先验框大小相对于图片的比例
$s_{min}$ ：比例的最小值，论文中设置0.2
$s_{max}$ ：比例的最大值，论文中设置为0.9

对于第一个feature map，其先验框的尺度比例一般设置为 $s_{min}/2=0.1$ ，尺度则为300*0.1=30
对于后5个feature map，其先验框的的尺度计算如下，先将尺度比例扩大100倍，即 $s_{min}=20,s_{max}=90$ ，然后再按照上述公式计算，结果除100作为最终的尺度比例：

$k=1,\quad s_1=[20+\lfloor \frac{90-20}{5-1}\rfloor*(1-1)]/100=[20+0]/100=0.2$
$k=2,\quad s_2=[20+\lfloor \frac{90-20}{5-1}\rfloor*(2-1)]/100=[20+17]/100=0.37$
$k=3,\quad s_3=[20+\lfloor \frac{90-20}{5-1}\rfloor*(3-1)]/100=[20+34]/100=0.54$
$k=4,\quad s_4=[20+\lfloor \frac{90-20}{5-1}\rfloor*(2-1)]/100=[20+17]/100=0.71$
$k=5,\quad s_5=[20+\lfloor \frac{90-20}{5-1}\rfloor*(2-1)]/100=[20+17]/100=0.88$

然后将这5个尺度比例乘以原图尺寸，则到每个feature map上先验框的尺度，即60,111,162,213,264
再综合第一个feature map的尺度30,总共6个feature map上先验框的尺度分别为30,60,111,162,213,264

3.3.2 先验框的宽高：

根据下面的公式，通过 $s_k$ 和 $a_r$ 就能计算出每个feature map上具体的宽和高
$w_k^a=s_k \sqrt {a_r}, \quad h_k^a=\frac{s_k}{\sqrt {a_r}}$
其中， $a_r$ 为先验框的宽高比， $a_r\in\{1,2,3,\frac{1}{2},\frac{1}{3}\}$

一个先验框尺度 $s_k$ 对应5个宽高比 $a_r$ ，而当 $a_r=1$ 时，增加尺度 $s'_k=\sqrt{s_ks_{k+1}}$ ，即多增加一个框，对于最后一个特征图，需要参考一个虚拟的尺度 $s_{m+1}=300*105/100=315$ 来计算 $s'_{m}$
也就是说，每个feature map上的各个像素点对应6个框，其中正方形框2个，矩形框4个

计算示例：
$s_k\in\{30,60,111,162,213,264\},\quad a_r\in\{1,2,3,\frac{1}{2},\frac{1}{3}\}$

$k = 1$ ， $s_1=30$ ：
$a_1=1$ ,要计算两个正方形边长，第二个需要计算 $s'_1=\sqrt{s_1s_{2}}=\sqrt{30*60}=42$
第一个正方形框边长为： $w_1^a=s_1 \sqrt {a_1}=30*1=30, \quad h_1^a=\frac{s_1}{\sqrt {a_r}}=\frac{30}{1}=30$
第二个正方形框边长为： $w_1'^a=s'_1 \sqrt {a_1}=42*1=42, \quad h_1'^a=\frac{s_1'}{\sqrt {a_1}}=\frac{42}{1}=42$
$a_2=2$ ，矩形框，宽： $w_2^a=s_1 \sqrt {a_2}=30*\sqrt{2}=30\sqrt{2}$ 高： $h_2^a=\frac{s_1}{\sqrt {a_2}}=\frac{30}{\sqrt{2}}$
$a_3=3$ ，矩形框，宽： $w_3^a=s_1 \sqrt {a_3}=30*\sqrt{3}=30\sqrt{3}$ 高： $h_3^a=\frac{s_1}{\sqrt {a_3}}=\frac{30}{\sqrt{3}}$
$a_4=\frac{1}{2}$ ，矩形框，宽： $w_4^a=s_1 \sqrt {a_4}=30*\sqrt{\frac{1}{2}}=\frac{30}{\sqrt{2}}$ 高： $h_4^a=\frac{s_1}{\sqrt {a_4}}=30\sqrt{2}$
$a_5=\frac{1}{3}$ ，矩形框，宽： $w_5^a=s_1 \sqrt {a_5}=30*\sqrt{\frac{1}{3}}=\frac{30}{\sqrt{3}}$ 高： $h_5^a=\frac{s_1}{\sqrt {a_5}}=30\sqrt{3}$
剩下的5个feature map计算过程同理

正常计算过程如上，但是在论文和代码实际计算情况下，Conv4_3，Conv10_2和Conv11_2层，也就是第一个和最后两个feature map都是不使用 $3,\frac{1}{3}$ 这两个宽高比，也就是说这三层feature map上每个像素点都是对应4个不同大小的先验框

3.3.3 先验框的中心点：

根据上述过程可得到具体的先验框的大小，接下来要确定每个先验框的中心：
$(\frac{i+0.5}{|f_k|},\frac{j+0.5}{|f_k|}),\quad i,j\in[0, |f_k|)$
其中， $f_k|$ 为第k个feature map的大小

四、损失函数

损失函数定义为置信度损失conf(confidence loss)和定位损失loc(localization loss)的加权和：
$L(x,c,l,g)=\frac{1}{N}(L_{conf}(x,c)+\alpha L_{loc}(x,l,g))$
其中：

$N$ ：表示先验框的正样本数量
$c$ ：为类别置信度预测值
$l$ ：为预测框的位置变换量
$\alpha$ ：为权重系数，通过交叉验证设置为1
$x_{ij}^p=1$ 表示第 i 个先验框与类别 p 的第 j 个ground truth相匹配，否则， $x_{ij}^p=0$ 。

置信度损失conf(confidence loss)：
置信度损失函数为多分类的softmax损失函数：
$L_{conf}(x,c)=-\sum \limits^N_{i\in Pos}x_{ij}^plog(\hat c_i^p)-\sum \limits ^N_{i\in Neg}log(\hat c_i^0)\quad where \quad \hat c_i^p=\frac{exp(c_i^p)}{\sum_pexp(c_i^p)}$
其中：

p表示的是第p个类别
0表示的是第0个类别，也就是背景类

定位损失loc(localization loss)：
$L_{loc}(x,l,g)=\sum\limits _{i\in Pos}^N \sum \limits _{m\in\{cx,cy,w,h\}} x_{ij}^ksmooth_{L1}(l_i^m-\hat g_j^m) \\ \hat g_j^{cx}=(g_j^{cx}-d_i^{cx})/d_i^w\quad \quad \hat g_j^{cy}=(g_j^{cy}-d_i^{cy})/d_i^h \\ \hat g^w_j=log(\frac{g_j^w}{d_i^w}) \quad \quad \hat g^h_j=log(\frac{g_j^h}{d_i^h})$

其中:

$g$ ：表示的是default box到ground truth的变换量
$l$ ：表示的是defualt box到修正后的位置的变换量，也就是变换量的预测值
$smooth_{L_1}(x)=\begin{cases}0.5x^2,\quad \quad \quad if|x|<1 \\|x|-0.5, \quad otherwise \end{cases}$

五、主要代码分析

5.1 训练阶段

5.1.1 主函数

四个主要过程：

构建网络——生成类别和位置变换量的预测值（logits(未经过softmax)、localisations）
Anchor box计算——根据每层feature map大小和和先验框尺度及宽高比计算
编码处理——将得到的anchor box位置信息和数据集中的ground truth类别标签和位置信息进行处理
构建损失函数进行迭代训练——使用前两步的结果并筛选负样本，然后计算损失函数

#coding:utf-8

# Copyright 2016 Paul Balanca. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
"""Generic training script that trains a SSD model using a given dataset."""
import tensorflow as tf
from tensorflow.python.ops import control_flow_ops

from datasets import dataset_factory
from deployment import model_deploy# 模型部署
from nets import nets_factory
from preprocessing import preprocessing_factory
import tf_utils

slim = tf.contrib.slim

#DATA_FORMAT = 'NCHW'
DATA_FORMAT = 'NHWC'
# =========================================================================== #
# SSD Network flags.
# =========================================================================== #
tf.app.flags.DEFINE_float(
    'loss_alpha', 1., 'Alpha parameter in the loss function.')
tf.app.flags.DEFINE_float(
    'negative_ratio', 3., 'Negative ratio in the loss function.')#负样本率？
tf.app.flags.DEFINE_float(
    'match_threshold', 0.5, 'Matching threshold in the loss function.')

# =========================================================================== #
# General Flags.
# =========================================================================== #
tf.app.flags.DEFINE_string(
    'train_dir', 'logs',
    'Directory where checkpoints and event logs are written to.')
tf.app.flags.DEFINE_integer('num_clones', 1,
                            'Number of model clones to deploy.')#clones1代表1个参数服务器和1个worker，这一个worker即为主worker
tf.app.flags.DEFINE_boolean('clone_on_cpu', True,#False
                            'Use CPUs to deploy clones.')#在GPU上部署worker
tf.app.flags.DEFINE_integer(
    'num_readers', 2,
    'The number of parallel readers that read data from the dataset.')#双线程读取数据
tf.app.flags.DEFINE_integer(
    'num_preprocessing_threads', 2,
    'The number of threads used to create the batches.')#双线程创建batches

tf.app.flags.DEFINE_integer(
    'log_every_n_steps', 10,#每10个epoch打印一次日志
    'The frequency with which logs are print.')
tf.app.flags.DEFINE_integer(
    'save_summaries_secs', 60,#60秒保存一次摘要
    'The frequency with which summaries are saved, in seconds.')
tf.app.flags.DEFINE_integer(
    'save_interval_secs', 60,#60秒保存一次模型
    'The frequency with which the model is saved, in seconds.')
tf.app.flags.DEFINE_float(
    'gpu_memory_fraction', 0.8, 'GPU memory fraction to use.')#80%的GPU使用率

# =========================================================================== #
# Optimization Flags.
# =========================================================================== #
tf.app.flags.DEFINE_float(
    'weight_decay', 0.00004, 'The weight decay on the model weights.')#L2正则lamba
tf.app.flags.DEFINE_string(
    'optimizer', 'adam',
    'The name of the optimizer, one of "adadelta", "adagrad", "adam",'
    '"ftrl", "momentum", "sgd" or "rmsprop".')
tf.app.flags.DEFINE_float(
    'adadelta_rho', 0.95,
    'The decay rate for adadelta.')
tf.app.flags.DEFINE_float(
    'adagrad_initial_accumulator_value', 0.1,
    'Starting value for the AdaGrad accumulators.')
tf.app.flags.DEFINE_float(
    'adam_beta1', 0.9,
    'The exponential decay rate for the 1st moment estimates.')
tf.app.flags.DEFINE_float(
    'adam_beta2', 0.999,
    'The exponential decay rate for the 2nd moment estimates.')
tf.app.flags.DEFINE_float('opt_epsilon', 1.0, 'Epsilon term for the optimizer.')
tf.app.flags.DEFINE_float('ftrl_learning_rate_power', -0.5,
                          'The learning rate power.')
tf.app.flags.DEFINE_float(
    'ftrl_initial_accumulator_value', 0.1,
    'Starting value for the FTRL accumulators.')
tf.app.flags.DEFINE_float(
    'ftrl_l1', 0.0, 'The FTRL l1 regularization strength.')
tf.app.flags.DEFINE_float(
    'ftrl_l2', 0.0, 'The FTRL l2 regularization strength.')
tf.app.flags.DEFINE_float(
    'momentum', 0.9,
    'The momentum for the MomentumOptimizer and RMSPropOptimizer.')
tf.app.flags.DEFINE_float('rmsprop_momentum', 0.9, 'Momentum.')
tf.app.flags.DEFINE_float('rmsprop_decay', 0.9, 'Decay term for RMSProp.')

# =========================================================================== #
# Learning Rate Flags.
# =========================================================================== #
tf.app.flags.DEFINE_string(
    'learning_rate_decay_type',
    'exponential',
    'Specifies how the learning rate is decayed. One of "fixed", "exponential",'#指数形式的还是固定形式的
    ' or "polynomial"')#多项式的
tf.app.flags.DEFINE_float('learning_rate', 0.01, 'Initial learning rate.')
tf.app.flags.DEFINE_float(
    'end_learning_rate', 0.0001,
    'The minimal end learning rate used by a polynomial decay learning rate.')
tf.app.flags.DEFINE_float(
    'label_smoothing', 0.0, 'The amount of label smoothing.')#label平滑数量
tf.app.flags.DEFINE_float(
    'learning_rate_decay_factor', 0.94, 'Learning rate decay factor.')#学习率衰减率
tf.app.flags.DEFINE_float(
    'num_epochs_per_decay', 2.0,
    'Number of epochs after which learning rate decays.')#每2个epoch衰减一次学习率
tf.app.flags.DEFINE_float(
    'moving_average_decay', None,#移动平均衰减
    'The decay to use for the moving average.'
    'If left as None, then moving averages are not used.')

# =========================================================================== #
# Dataset Flags.
# =========================================================================== #
#dataset_name imagenet
tf.app.flags.DEFINE_string(
    'dataset_name', 'pascalvoc_2007', 'The name of the dataset to load.')
tf.app.flags.DEFINE_integer(
    'num_classes', 21, 'Number of classes to use in the dataset.')
tf.app.flags.DEFINE_string(
    'dataset_split_name', 'train', 'The name of the train/test split.')#数据切分名称
tf.app.flags.DEFINE_string(
    'dataset_dir', 'tfrecords', 'The directory where the dataset files are stored.')
tf.app.flags.DEFINE_integer(
    'labels_offset', 0,#偏移量
    'An offset for the labels in the dataset. This flag is primarily used to '#offset消减量
    'evaluate the VGG and ResNet architectures which do not use a background '
    'class for the ImageNet dataset.')#如果不用background则减1
tf.app.flags.DEFINE_string(
    'model_name', 'ssd_300_vgg', 'The name of the architecture to train.')
tf.app.flags.DEFINE_string(
    'preprocessing_name', None, 'The name of the preprocessing to use. If left '
    'as `None`, then the model_name flag is used.')#没有使用预处理
tf.app.flags.DEFINE_integer(
    'batch_size', 4, 'The number of samples in each batch.')
tf.app.flags.DEFINE_integer(
    'train_image_size', None, 'Train image size')#训练集大小
tf.app.flags.DEFINE_integer('max_number_of_steps', None,
                            'The maximum number of training steps.')#epoch上限

# =========================================================================== #
# Fine-Tuning Flags.
# =========================================================================== #
tf.app.flags.DEFINE_string(
    'checkpoint_path', './checkpoints/ssd_300_vgg.ckpt',
    'The path to a checkpoint from which to fine-tune.')#获取将要微调的模型
tf.app.flags.DEFINE_string(
    'checkpoint_model_scope', None,
    'Model scope in the checkpoint. None if the same as the trained model.')#模型
tf.app.flags.DEFINE_string(
    'checkpoint_exclude_scopes', None,
    'Comma-separated list of scopes of variables to exclude when restoring '
    'from a checkpoint.')
tf.app.flags.DEFINE_string(
    'trainable_scopes', None,
    'Comma-separated list of scopes to filter the set of variables to train.'
    'By default, None would train all the variables.')#如果没有逗号分割，则变量被全部训练
tf.app.flags.DEFINE_boolean(
    'ignore_missing_vars', False,
    'When restoring a checkpoint would ignore missing variables.')#忽略丢失的变量

FLAGS = tf.app.flags.FLAGS


# =========================================================================== #
# Main training routine.训练主线程
# =========================================================================== #
def main(_):
    if not FLAGS.dataset_dir:
        raise ValueError('You must supply the dataset directory with --dataset_dir')

    tf.logging.set_verbosity(tf.logging.DEBUG)#设置日记打印的级别，较为严格
    with tf.Graph().as_default():#创建图
        # Config model_deploy. Keep TF Slim Models structure.
        # Useful if want to need multiple GPUs and/or servers in the future.
        #部署分布式集群
        deploy_config = model_deploy.DeploymentConfig(
            num_clones=FLAGS.num_clones,
            clone_on_cpu=FLAGS.clone_on_cpu,
            replica_id=0,
            num_replicas=1,
            num_ps_tasks=0)
        # Create global_step.
        with tf.device(deploy_config.variables_device()):
            '''把变量部署到不同的服务器上，需要点儿时间'''
            global_step = slim.create_global_step() #tf.train.create_global_step

        # Select the dataset.
        #加载数据
        dataset = dataset_factory.get_dataset(
            FLAGS.dataset_name, FLAGS.dataset_split_name, FLAGS.dataset_dir)

        # Get the SSD network and its anchors.
        #获得网络和anchor框
        ssd_class = nets_factory.get_network(FLAGS.model_name)#ssd-300模型
        ssd_params = ssd_class.default_params._replace(num_classes=FLAGS.num_classes)#加载参数
        ssd_net = ssd_class(ssd_params) # 根据指定参数和指定类型实例化ssd网络
        ssd_shape = ssd_net.params.img_shape#300*300
        ssd_anchors = ssd_net.anchors(ssd_shape) # 生成anchor box信息

        # Select the preprocessing function.
        preprocessing_name = FLAGS.preprocessing_name or FLAGS.model_name #是否采用预处理模块
        image_preprocessing_fn = preprocessing_factory.get_preprocessing( #image_preprocessing_fn:image, labels, bboxes
            preprocessing_name, is_training=True)

        tf_utils.print_configuration(FLAGS.__flags, ssd_params,
                                     dataset.data_sources, FLAGS.train_dir)
        # =================================================================== #
        # Create a dataset provider and batches.
        # =================================================================== #
        with tf.device(deploy_config.inputs_device()):
            with tf.name_scope(FLAGS.dataset_name + '_data_provider'):
                provider = slim.dataset_data_provider.DatasetDataProvider(#slim图像读取，多线程图像读取
                    dataset,#传入帕斯卡2007图集图片，公5011张，20类
                    num_readers=FLAGS.num_readers,#传入读图线程数
                    common_queue_capacity=20 * FLAGS.batch_size,#80
                    common_queue_min=10 * FLAGS.batch_size,#40
                    shuffle=True)
            # Get for SSD network: image, labels, bboxes. 得到图片，类别标签，以及ground truth
            [image, shape, glabels, gbboxes] = provider.get(['image', 'shape',
                                                             'object/label',
                                                             'object/bbox'])
            '''获取帕斯卡2007 voc数据集'''
            # Pre-processing image, labels and bboxes.
            image, glabels, gbboxes = \
                image_preprocessing_fn(image, glabels, gbboxes,
                                       out_shape=ssd_shape,
                                       data_format=DATA_FORMAT)


            '''获取训练集图像、GT标签及bbox，网络模型为vgg19变型'''
            # Encode groundtruth labels and bboxes.
            # 将上面得到的anchor 位置信息和数据集中的ground truth类别标签和位置信息进行编码，得到分类和位置信息以及置信度
            gclasses, glocalisations, gscores = \
                ssd_net.bboxes_encode(glabels, gbboxes, ssd_anchors)
            batch_shape = [1] + [len(ssd_anchors)] * 3 #4个维度分别是[image, gclasses, glocalisations, gscores]


            '''计算训练集bbox的IOU得分'''
            # Training batches and queue.
            r = tf.train.batch(
                tf_utils.reshape_list([image, gclasses, glocalisations, gscores]),
                batch_size=FLAGS.batch_size,#4
                num_threads=FLAGS.num_preprocessing_threads,#读取线程
                capacity=5 * FLAGS.batch_size)#20
            '''batch_size = 4, 返回一个[1, 6, 6, 6]尺寸的数据集'''
            b_image, b_gclasses, b_glocalisations, b_gscores = \
                tf_utils.reshape_list(r, batch_shape)
            '''append到4个list中,分别是image、gclasses、glocalisations、gscores'''
            # Intermediate queueing: unique batch computation pipeline for all
            # GPUs running the training.
            batch_queue = slim.prefetch_queue.prefetch_queue(
                tf_utils.reshape_list([b_image, b_gclasses, b_glocalisations, b_gscores]),
                capacity=2 * deploy_config.num_clones)
            '''使用预加载队列从文件双线读取, batch_queue 为一个'FIFOQueue' object，即为队列结构体'''
        # =================================================================== #
        # Define the model running on every GPU.
        # =================================================================== #
        def clone_fn(batch_queue):
            """Allows data parallelism by creating multiple
               定义模型运行在所有GPU上
            clones of network_fn."""
            # Dequeue batch.
            b_image, b_gclasses, b_glocalisations, b_gscores = \
                tf_utils.reshape_list(batch_queue.dequeue(), batch_shape)
            '''切分队列，先进先出'''

            # Construct SSD network.
            arg_scope = ssd_net.arg_scope(weight_decay=FLAGS.weight_decay,
                                          data_format=DATA_FORMAT)
            '''ssd模型框架及参数定义'''
            with slim.arg_scope(arg_scope):
                predictions, localisations, logits, end_points = \
                    ssd_net.net(b_image, is_training=True)
            '''传入图片[4, 38, 38, 3]'''
            # Add loss function.
            '''构建损失函数'''
            ssd_net.losses(logits, localisations,#logits:one-hot 向量一共有21维
                           b_gclasses, b_glocalisations, b_gscores,
                           match_threshold=FLAGS.match_threshold,
                           negative_ratio=FLAGS.negative_ratio,
                           alpha=FLAGS.loss_alpha,#拉格朗日乘子是1
                           label_smoothing=FLAGS.label_smoothing)
            return end_points#'block11',end_points为最后一层

        # Gather initial summaries.
        summaries = set(tf.get_collection(tf.GraphKeys.SUMMARIES))

        # =================================================================== #
        # Add summaries from first clone.
        # =================================================================== #
        clones = model_deploy.create_clones(deploy_config, clone_fn, [batch_queue])
        first_clone_scope = deploy_config.clone_scope(0)
        # Gather update_ops from the first clone. These contain, for example,
        # the updates for the batch_norm variables created by network_fn.
        update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS, first_clone_scope)

        # Add summaries for end_points.
        end_points = clones[0].outputs
        for end_point in end_points:
            x = end_points[end_point]
            summaries.add(tf.summary.histogram('activations/' + end_point, x))
            summaries.add(tf.summary.scalar('sparsity/' + end_point,
                                            tf.nn.zero_fraction(x)))
        # Add summaries for losses and extra losses.
        for loss in tf.get_collection(tf.GraphKeys.LOSSES, first_clone_scope):
            summaries.add(tf.summary.scalar(loss.op.name, loss))
        for loss in tf.get_collection('EXTRA_LOSSES', first_clone_scope):
            summaries.add(tf.summary.scalar(loss.op.name, loss))

        # Add summaries for variables.
        for variable in slim.get_model_variables():
            summaries.add(tf.summary.histogram(variable.op.name, variable))

        # =================================================================== #
        # Configure the moving averages.
        # =================================================================== #
        if FLAGS.moving_average_decay:
            moving_average_variables = slim.get_model_variables()
            variable_averages = tf.train.ExponentialMovingAverage(
                FLAGS.moving_average_decay, global_step)
        else:
            moving_average_variables, variable_averages = None, None

        # =================================================================== #
        # Configure the optimization procedure.
        # =================================================================== #
        with tf.device(deploy_config.optimizer_device()):
            learning_rate = tf_utils.configure_learning_rate(FLAGS,
                                                             dataset.num_samples,
                                                             global_step)
            optimizer = tf_utils.configure_optimizer(FLAGS, learning_rate)
            summaries.add(tf.summary.scalar('learning_rate', learning_rate))

        if FLAGS.moving_average_decay:
            # Update ops executed locally by trainer.
            update_ops.append(variable_averages.apply(moving_average_variables))

        # Variables to train.
        variables_to_train = tf_utils.get_variables_to_train(FLAGS)

        # and returns a train_tensor and summary_op
        total_loss, clones_gradients = model_deploy.optimize_clones(
            clones,
            optimizer,
            var_list=variables_to_train)
        # Add total_loss to summary.
        summaries.add(tf.summary.scalar('total_loss', total_loss))

        # Create gradient updates.
        grad_updates = optimizer.apply_gradients(clones_gradients,
                                                 global_step=global_step)
        update_ops.append(grad_updates)
        update_op = tf.group(*update_ops)
        train_tensor = control_flow_ops.with_dependencies([update_op], total_loss,
                                                          name='train_op')

        # Add the summaries from the first clone. These contain the summaries
        summaries |= set(tf.get_collection(tf.GraphKeys.SUMMARIES,
                                           first_clone_scope))
        # Merge all summaries together.
        summary_op = tf.summary.merge(list(summaries), name='summary_op')

        # =================================================================== #
        # Kicks off the training.
        # =================================================================== #
        gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=FLAGS.gpu_memory_fraction)
        config = tf.ConfigProto(log_device_placement=False,
                                gpu_options=gpu_options)
        saver = tf.train.Saver(max_to_keep=5,
                               keep_checkpoint_every_n_hours=1.0,
                               write_version=2,
                               pad_step_number=False)
        slim.learning.train(
            train_tensor,
            logdir=FLAGS.train_dir,
            master='',
            is_chief=True,
            init_fn=tf_utils.get_init_fn(FLAGS),
            summary_op=summary_op,
            number_of_steps=FLAGS.max_number_of_steps,
            log_every_n_steps=FLAGS.log_every_n_steps,
            save_summaries_secs=FLAGS.save_summaries_secs,
            saver=saver,
            save_interval_secs=FLAGS.save_interval_secs,
            session_config=config,
            sync_optimizer=None)


if __name__ == '__main__':
    tf.app.run()

5.1.2 构建网络

def ssd_net(inputs,
            num_classes=SSDNet.default_params.num_classes,
            feat_layers=SSDNet.default_params.feat_layers,
            anchor_sizes=SSDNet.default_params.anchor_sizes,
            anchor_ratios=SSDNet.default_params.anchor_ratios,
            normalizations=SSDNet.default_params.normalizations,
            is_training=True,
            dropout_keep_prob=0.5,
            prediction_fn=slim.softmax,
            reuse=None,
            scope='ssd_512_vgg'):
    """SSD net definition.
    """
    # End_points collect relevant activations for external use.
    end_points = {}
    with tf.variable_scope(scope, 'ssd_512_vgg', [inputs], reuse=reuse):
        # Original VGG-16 blocks.
        net = slim.repeat(inputs, 2, slim.conv2d, 64, [3, 3], scope='conv1')
        end_points['block1'] = net
        net = slim.max_pool2d(net, [2, 2], scope='pool1')
        # Block 2.
        net = slim.repeat(net, 2, slim.conv2d, 128, [3, 3], scope='conv2')
        end_points['block2'] = net
        net = slim.max_pool2d(net, [2, 2], scope='pool2')
        # Block 3.
        net = slim.repeat(net, 3, slim.conv2d, 256, [3, 3], scope='conv3')
        end_points['block3'] = net
        net = slim.max_pool2d(net, [2, 2], scope='pool3')
        # Block 4.
        net = slim.repeat(net, 3, slim.conv2d, 512, [3, 3], scope='conv4')
        end_points['block4'] = net
        net = slim.max_pool2d(net, [2, 2], scope='pool4')
        # Block 5.
        net = slim.repeat(net, 3, slim.conv2d, 512, [3, 3], scope='conv5')
        end_points['block5'] = net
        net = slim.max_pool2d(net, [3, 3], 1, scope='pool5')

        # Additional SSD blocks.
        # Block 6: let's dilate the hell out of it!
        net = slim.conv2d(net, 1024, [3, 3], rate=6, scope='conv6')
        end_points['block6'] = net
        # Block 7: 1x1 conv. Because the fuck.
        net = slim.conv2d(net, 1024, [1, 1], scope='conv7')
        end_points['block7'] = net

        # Block 8/9/10/11: 1x1 and 3x3 convolutions stride 2 (except lasts).
        end_point = 'block8'
        with tf.variable_scope(end_point):
            net = slim.conv2d(net, 256, [1, 1], scope='conv1x1')
            net = custom_layers.pad2d(net, pad=(1, 1))
            net = slim.conv2d(net, 512, [3, 3], stride=2, scope='conv3x3', padding='VALID')
        end_points[end_point] = net
        end_point = 'block9'
        with tf.variable_scope(end_point):
            net = slim.conv2d(net, 128, [1, 1], scope='conv1x1')
            net = custom_layers.pad2d(net, pad=(1, 1))
            net = slim.conv2d(net, 256, [3, 3], stride=2, scope='conv3x3', padding='VALID')
        end_points[end_point] = net
        end_point = 'block10'
        with tf.variable_scope(end_point):
            net = slim.conv2d(net, 128, [1, 1], scope='conv1x1')
            net = custom_layers.pad2d(net, pad=(1, 1))
            net = slim.conv2d(net, 256, [3, 3], stride=2, scope='conv3x3', padding='VALID')
        end_points[end_point] = net
        end_point = 'block11'
        with tf.variable_scope(end_point):
            net = slim.conv2d(net, 128, [1, 1], scope='conv1x1')
            net = custom_layers.pad2d(net, pad=(1, 1))
            net = slim.conv2d(net, 256, [3, 3], stride=2, scope='conv3x3', padding='VALID')
        end_points[end_point] = net
        end_point = 'block12'
        with tf.variable_scope(end_point):
            net = slim.conv2d(net, 128, [1, 1], scope='conv1x1')
            net = custom_layers.pad2d(net, pad=(1, 1))
            net = slim.conv2d(net, 256, [4, 4], scope='conv4x4', padding='VALID')
            # Fix padding to match Caffe version (pad=1).
            # pad_shape = [(i-j) for i, j in zip(layer_shape(net), [0, 1, 1, 0])]
            # net = tf.slice(net, [0, 0, 0, 0], pad_shape, name='caffe_pad')
        end_points[end_point] = net

        # Prediction and localisations layers.
        predictions = []
        logits = []
        localisations = []
        for i, layer in enumerate(feat_layers):
            with tf.variable_scope(layer + '_box'):
                p, l = ssd_vgg_300.ssd_multibox_layer(end_points[layer],
                                                      num_classes,
                                                      anchor_sizes[i],
                                                      anchor_ratios[i],
                                                      normalizations[i])
            predictions.append(prediction_fn(p))
            logits.append(p)
            localisations.append(l)

        return predictions, localisations, logits, end_points

def ssd_multibox_layer(inputs,#字典形式储存的特征图名称及特征图的值
                       num_classes,#21
                       sizes,#6个
                       ratios=[1],#每个有2~4个
                       normalization=-1,#归一化
                       bn_normalization=False):
    """Construct a multibox layer, return a class and localization predictions.
    """
    net = inputs # 输入第4层最后一个feature map
    if normalization > 0:
        net = custom_layers.l2_normalization(net, scaling=True)
    # Number of anchors.
    num_anchors = len(sizes) + len(ratios) #  2 + 2  (0.21 45) [2, 0.5] 第一个featruemap上有4个anchor框

    # Location.
    num_loc_pred = num_anchors * 4 # 每个anchor对应4个位置预测值
    loc_pred = slim.conv2d(net, num_loc_pred, [3, 3], activation_fn=None,
                           scope='conv_loc')
    loc_pred = custom_layers.channel_to_last(loc_pred) # 改变通道数所在的维度,保证通道数在最后一维
    loc_pred = tf.reshape(loc_pred,
                          tensor_shape(loc_pred, 4)[:-1]+[num_anchors, 4]) # 把最后一维拆分成2个维度，(anchor数，位置信息),位置信息固定为4
    # Class prediction.
    num_cls_pred = num_anchors * num_classes # 每个anchor对应有21个类别信息
    cls_pred = slim.conv2d(net, num_cls_pred, [3, 3], activation_fn=None,
                           scope='conv_cls')
    cls_pred = custom_layers.channel_to_last(cls_pred)
    cls_pred = tf.reshape(cls_pred,
                          tensor_shape(cls_pred, 4)[:-1]+[num_anchors, num_classes]) # 把最后一维拆分成2个维度，(anchor数，类别数)
    return cls_pred, loc_pred

5.1.3 Anchor box计算

主要过程：

计算中心点坐标
根据宽高比和尺度计算宽高

def ssd_anchor_one_layer(img_shape,
                         feat_shape,
                         sizes,
                         ratios,
                         step,
                         offset=0.5,
                         dtype=np.float32):
    """Computer SSD default anchor boxes for one feature layer.

    Determine the relative position grid of the centers, and the relative
    width and height.

    Arguments:
      feat_shape: Feature shape, used for computing relative position grids;
      size: Absolute reference sizes;
      ratios: Ratios to use on these features;
      img_shape: Image shape, used for computing height, width relatively to the
        former;
      offset: Grid offset.

    Return:
      y, x, h, w: Relative x and y grids, and height and width.
      返回每个cell的中心点的相对位置，以及相对于原图的宽高比例
    """
    # Compute the position grid: simple way.
    # y, x = np.mgrid[0:feat_shape[0], 0:feat_shape[1]]
    # y = (y.astype(dtype) + offset) / feat_shape[0]
    # x = (x.astype(dtype) + offset) / feat_shape[1]
    # Weird SSD-Caffe computation using steps values...
    y, x = np.mgrid[0:feat_shape[0], 0:feat_shape[1]]
    y = (y.astype(dtype) + offset) * step / img_shape[0]#计算每个cell的中心点
    x = (x.astype(dtype) + offset) * step / img_shape[1]#计算每个cell的中心点

    # Expand dims to support easy broadcasting.
    y = np.expand_dims(y, axis=-1)
    x = np.expand_dims(x, axis=-1)

    # Compute relative height and width.
    # Tries to follow the original implementation of SSD for the order.
    num_anchors = len(sizes) + len(ratios)
    h = np.zeros((num_anchors, ), dtype=dtype)
    w = np.zeros((num_anchors, ), dtype=dtype)
    # Add first anchor boxes with ratio=1.
    h[0] = sizes[0] / img_shape[0]
    w[0] = sizes[0] / img_shape[1]#所占原图的比例
    di = 1
    if len(sizes) > 1: # 对于ratio=1的，增一个框，w = sk',h = sk'
        h[1] = math.sqrt(sizes[0] * sizes[1]) / img_shape[0]#sk’的计算，是个比率(特殊框的计算)
        w[1] = math.sqrt(sizes[0] * sizes[1]) / img_shape[1]#sk’的计算，是个比率(特殊框的计算)
        di += 1
    for i, r in enumerate(ratios):
        h[i+di] = sizes[0] / img_shape[0] / math.sqrt(r)#高比例
        w[i+di] = sizes[0] / img_shape[1] * math.sqrt(r)#宽比率
    return y, x, h, w

5.1.4 编码处理

将得到的anchor box位置信息和数据集中的ground truth类别标签和位置信息进行处理

def tf_ssd_bboxes_encode_layer(labels,
                               bboxes,
                               anchors_layer,
                               num_classes,
                               no_annotation_label,
                               ignore_threshold=0.5,
                               prior_scaling=[0.1, 0.1, 0.2, 0.2],
                               dtype=tf.float32):
    """Encode groundtruth labels and bounding boxes using SSD anchors from
    one layer.

    Arguments:
      labels: 1D Tensor(int64) containing groundtruth labels;
      bboxes: Nx4 Tensor(float) with bboxes relative coordinates;
      anchors_layer: Numpy array with layer anchors;
      matching_threshold: Threshold for positive match with groundtruth bboxes;
      prior_scaling: Scaling of encoded coordinates.

    Return:
      (target_labels, target_localizations, target_scores): Target Tensors.
    """
    # Anchors coordinates and volume.
    yref, xref, href, wref = anchors_layer# 获得中心点及高、宽
    #href
    #array([ 0.07      ,  0.10246951,  0.04949747,  0.09899495], dtype=float32)
    #wref
    #array([ 0.07      ,  0.10246951,  0.09899495,  0.04949747], dtype=float32)    
    ymin = yref - href / 2.#计算左上角x坐标值
    xmin = xref - wref / 2.#计算左上角y坐标值
    ymax = yref + href / 2.#计算右下角x坐标值
    xmax = xref + wref / 2.#计算右下角y坐标值
    vol_anchors = (xmax - xmin) * (ymax - ymin)#计算anchor框面积 (38, 38, 4)
    '''矩阵运算'''

    # Initialize tensors...
    shape = (yref.shape[0], yref.shape[1], href.size)#(38, 38, 4)
    feat_labels = tf.zeros(shape, dtype=tf.int64)
    feat_scores = tf.zeros(shape, dtype=dtype)

    feat_ymin = tf.zeros(shape, dtype=dtype)
    feat_xmin = tf.zeros(shape, dtype=dtype)
    feat_ymax = tf.ones(shape, dtype=dtype)
    feat_xmax = tf.ones(shape, dtype=dtype)
    '''左上角和右下角坐标：每个框左上、右下、labels、scores'''

    def jaccard_with_anchors(bbox):#groundtruth
        """Compute jaccard score between a box and the anchors.
           IOU计算
        """
        int_ymin = tf.maximum(ymin, bbox[0])
        int_xmin = tf.maximum(xmin, bbox[1])
        int_ymax = tf.minimum(ymax, bbox[2])
        int_xmax = tf.minimum(xmax, bbox[3])
        h = tf.maximum(int_ymax - int_ymin, 0.)
        w = tf.maximum(int_xmax - int_xmin, 0.)#交集的长宽
        # Volumes.
        inter_vol = h * w#交集面积
        union_vol = vol_anchors - inter_vol \
            + (bbox[2] - bbox[0]) * (bbox[3] - bbox[1])#anchor面积-交集+bbox面积，这里的bbox为GT
        jaccard = tf.div(inter_vol, union_vol)#IOU值
        return jaccard

    def intersection_with_anchors(bbox):
        """Compute intersection between score a box and the anchors.
           仅仅交集计算
        """
        int_ymin = tf.maximum(ymin, bbox[0])
        int_xmin = tf.maximum(xmin, bbox[1])
        int_ymax = tf.minimum(ymax, bbox[2])
        int_xmax = tf.minimum(xmax, bbox[3])
        h = tf.maximum(int_ymax - int_ymin, 0.)
        w = tf.maximum(int_xmax - int_xmin, 0.)
        inter_vol = h * w
        scores = tf.div(inter_vol, vol_anchors) # 交集的面积比上先验框的面积，将其表示为置信度
        return scores

    def condition(i, feat_labels, feat_scores,
                  feat_ymin, feat_xmin, feat_ymax, feat_xmax):
        """Condition: check label index.
           逐元素比较大小,遍历labels，因为i在body返回的时候加1了
        """
        r = tf.less(i, tf.shape(labels))
        return r[0]

    def body(i, feat_labels, feat_scores,
             feat_ymin, feat_xmin, feat_ymax, feat_xmax):
        """Body: update feature labels, scores and bboxes. 循环体
        Follow the original SSD paper for that purpose:
          - assign values when jaccard > 0.5;
          - only update if beat the score of other bboxes.
        """
        # Jaccard score.
        label = labels[i]  # 第i个ground truth对应的label
        bbox = bboxes[i]   # 第i个ground truth
        # 计算该feature map上所有的框与第一个ground truth的IOU
        jaccard = jaccard_with_anchors(bbox)
        # Mask: check threshold + scores + no annotations + num_classes.
        # 当IOU大于feat_scores时，对应的mask置为1，做筛选
        mask = tf.greater(jaccard, feat_scores)
        # mask = tf.logical_and(mask, tf.greater(jaccard, matching_threshold))
        mask = tf.logical_and(mask, feat_scores > -0.5)
        mask = tf.logical_and(mask, label < num_classes)    #label满足<21
        imask = tf.cast(mask, tf.int64)                     #将mask转换数据类型int型
        fmask = tf.cast(mask, dtype)                        #将mask转换数据类型float型
        # Update values using mask.
        feat_labels = imask * label + (1 - imask) * feat_labels #当mask=1，则feat_labels=1；否则为0，即背景
        feat_scores = tf.where(mask, jaccard, feat_scores)      #tf.where表示如果mask为真则jaccard，否则为feat_scores
        # 选择与GT bbox IOU最大的框作为GT bbox，然后循环
        feat_ymin = fmask * bbox[0] + (1 - fmask) * feat_ymin
        feat_xmin = fmask * bbox[1] + (1 - fmask) * feat_xmin
        feat_ymax = fmask * bbox[2] + (1 - fmask) * feat_ymax
        feat_xmax = fmask * bbox[3] + (1 - fmask) * feat_xmax

        # Check no annotation label: ignore these anchors...
        # interscts = intersection_with_anchors(bbox)
        # mask = tf.logical_and(interscts > ignore_threshold,
        #                       label == no_annotation_label)
        # # Replace scores by -1.
        # feat_scores = tf.where(mask, -tf.cast(mask, dtype), feat_scores)

        return [i+1, feat_labels, feat_scores,
                feat_ymin, feat_xmin, feat_ymax, feat_xmax]

    # Main loop definition. 主循环
    i = 0
    [i, feat_labels, feat_scores,
     feat_ymin, feat_xmin,
     feat_ymax, feat_xmax] = tf.while_loop(condition, body,
                                           [i, feat_labels, feat_scores,
                                            feat_ymin, feat_xmin,
                                            feat_ymax, feat_xmax])

    # Transform to center / size.
    feat_cy = (feat_ymax + feat_ymin) / 2.
    feat_cx = (feat_xmax + feat_xmin) / 2.
    feat_h = feat_ymax - feat_ymin
    feat_w = feat_xmax - feat_xmin
    # Encode features.
    feat_cy = (feat_cy - yref) / href / prior_scaling[0]
    feat_cx = (feat_cx - xref) / wref / prior_scaling[1]
    feat_h = tf.log(feat_h / href) / prior_scaling[2]
    feat_w = tf.log(feat_w / wref) / prior_scaling[3]
    # Use SSD ordering: x / y / w / h instead of ours.
    feat_localizations = tf.stack([feat_cx, feat_cy, feat_w, feat_h], axis=-1)
    return feat_labels, feat_localizations, feat_scores

边框的编码(encode)过程：
由先验框anchor box和真实框ground truth的位置信息，
即， $P=(P_x, P_y,P_w,P_h)$ ， $G=(G_x,G_y,G_w,G_h)$
得到变换量 $t_*^i$ ： $\begin{cases} t_x = (G_x-P_x)/P_w \\t_y=(G_y-P_y)/P_h\\t_w=log(G_w/P_w) \\t_h=log(G_h/P_h)\end{cases}$

5.1.5 构建损失函数

def ssd_losses(logits, localisations,
               gclasses, glocalisations, gscores,
               match_threshold=0.5,                          # IOU的权值
               negative_ratio=3.,                            # 正负样本比例
               alpha=1.,
               label_smoothing=0.,
               scope=None):
    """Loss functions for training the SSD 300 VGG network.

    This function defines the different loss components of the SSD, and
    adds them to the TF loss collection.

    Arguments:
      logits: (list of) predictions logits Tensors;                # 网络预测的logits,也就是没有进过softmax的置信度
      localisations: (list of) localisations Tensors;              # 网络预测的位置变换量
      gclasses: (list of) groundtruth labels Tensors;
      glocalisations: (list of) groundtruth localisations Tensors;
      gscores: (list of) groundtruth score Tensors;
    """
    with tf.name_scope(scope, 'ssd_losses'):
        l_cross_pos = []   # 存储正样本的损失函数值
        l_cross_neg = []   # 存储负样本的损失函数值
        l_loc = []         # 存储位置偏移量的损失函数值
        for i in range(len(logits)):
            dtype = logits[i].dtype
            with tf.name_scope('block_%i' % i):
                # Determine weights Tensor.
                pmask = gscores[i] > match_threshold     # 将iou大于0.5的记为正样本
                fpmask = tf.cast(pmask, dtype)           # 类型转换
                n_positives = tf.reduce_sum(fpmask)      # 记录正样本个数

                # Select some random negative entries.
                # n_entries = np.prod(gclasses[i].get_shape().as_list())
                # r_positive = n_positives / n_entries
                # r_negative = negative_ratio * n_positives / (n_entries - n_positives)

                # Negative mask.
                no_classes = tf.cast(pmask, tf.int32)         # 将正例的类型转换成整数
                predictions = slim.softmax(logits[i])         # 记录每个类别预测的概率值
                nmask = tf.logical_and(tf.logical_not(pmask), # 将不是正例并且交并比大于-0.5的记为负样本
                                       gscores[i] > -0.5)
                fnmask = tf.cast(nmask, dtype)                # 类型转换
                nvalues = tf.where(nmask,                     # 将负样本的值取出来
                                   predictions[:, :, :, :, 0],# predictions[:, :, :, :, 0]是预测第一类，也就是背景类
                                   1. - fnmask)               # 在nmask中是负样本，就取预测值，否则取1-fnmask(此时为0)=1
                nvalues_flat = tf.reshape(nvalues, [-1])      # 对nvalues进行拉平操作

                # Number of negative entries to select.     筛选负样本
                n_neg = tf.cast(negative_ratio * n_positives, tf.int32) # 正负样本比例乘以正样本个数，得到负样本个数
                n_neg = tf.maximum(n_neg, tf.size(nvalues_flat) // 8)   # 将正样本3倍与nvalues实际样本1/8倍比较选最大的
                n_neg = tf.maximum(n_neg, tf.shape(nvalues)[0] * 4)
                max_neg_entries = 1 + tf.cast(tf.reduce_sum(fnmask), tf.int32)
                n_neg = tf.minimum(n_neg, max_neg_entries)

                val, idxes = tf.nn.top_k(-nvalues_flat, k=n_neg)  # 选取前k=neg个负样本,表示选择的交并比最小的k个
                minval = val[-1]                                  # 所选择的负样本中交并比最大的那个
                # Final negative mask.
                nmask = tf.logical_and(nmask, -nvalues > minval)  # 最终的负样本
                fnmask = tf.cast(nmask, dtype)

                # Add cross-entropy loss.
                with tf.name_scope('cross_entropy_pos'):     # 正样本的交叉熵损失函数
                    loss = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits[i],
                                                                          labels=gclasses[i])
                    loss = tf.losses.compute_weighted_loss(loss, fpmask) # 其实就是相当于loss×fpmask,过滤负样本
                    l_cross_pos.append(loss)                             # 因为负样本的label就是0,其他的是1,相乘以保留正样本

                with tf.name_scope('cross_entropy_neg'):     # 负样本的交叉熵损失函数
                    loss = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits[i],
                                                                          labels=no_classes)
                    loss = tf.losses.compute_weighted_loss(loss, fnmask) # 与上面同理
                    l_cross_neg.append(loss)

                # Add localization loss: smooth L1, L2, ...
                with tf.name_scope('localization'):
                    # Weights Tensor: positive mask + random negative.
                    weights = tf.expand_dims(alpha * fpmask, axis=-1)
                    loss = custom_layers.abs_smooth(localisations[i] - glocalisations[i]) # 位置变换量的smoothL1损失函数
                    loss = tf.losses.compute_weighted_loss(loss, weights)
                    l_loc.append(loss)

        # Additional total losses...
        with tf.name_scope('total'):
            total_cross_pos = tf.add_n(l_cross_pos, 'cross_entropy_pos')
            total_cross_neg = tf.add_n(l_cross_neg, 'cross_entropy_neg')
            total_cross = tf.add(total_cross_pos, total_cross_neg, 'cross_entropy') # 正负样本总体的损失和
            total_loc = tf.add_n(l_loc, 'localization')  # 正样本总体的位置回归损失和

            # Add to EXTRA LOSSES TF.collection
            tf.add_to_collection('EXTRA_LOSSES', total_cross_pos)
            tf.add_to_collection('EXTRA_LOSSES', total_cross_neg)
            tf.add_to_collection('EXTRA_LOSSES', total_cross)
            tf.add_to_collection('EXTRA_LOSSES', total_loc)

5.2 测试阶段

主要过程：

构建网络——生成类别和位置变换量的预测值（logits(未经过softmax)、localisations）
Anchor box计算——根据每层feature map大小和和先验框尺度及宽高比计算
编码处理——将得到的anchor box位置信息和数据集中的ground truth类别标签和位置信息进行处理
构建损失函数，计算损失函数值
解码处理——将网络预测结果和anchor计算的结果进行解码操作，得到预测的anchor box的位置
筛选处理——对的得到预测类别和位置的anchor box进行进一步的筛选
- 按照select_threshold筛选
- 筛选前k个置信度最大的
- 使用非极大抑制再进行筛选

前四步和训练阶段一致，下面省略对应的代码

5.2.1 解码处理

边框的解码(decode)过程：
由变化量的预测值 $d_*(P)$ 和先验框anchor box的位置信息，
即， $P=(P_x, P_y,P_w,P_h)$ 和 $d_*(P)$ : $\begin{cases} d_x(P) = (\hat G_x-P_x)/P_w \\d_y(P)=(\hat G_y-P_y)/P_h\\d_w(P)=log(\hat G_w/P_w) \\d_h(P)=log(\hat G_h/P_h)\end{cases}$

得到预测框的位置信息： $\begin{cases}\hat G_x=P_wd_x(P)+P_x\\ \hat G_y = P_hd_y(P)+P_y\\ \hat G_w = P_wexp(d_w(P)) \\ \hat G_h = P_hexp(d_h(P))\end{cases}$

def tf_ssd_bboxes_decode_layer(feat_localizations,
                               anchors_layer,
                               prior_scaling=[0.1, 0.1, 0.2, 0.2]):
    """Compute the relative bounding boxes from the layer features and
    reference anchor bounding boxes.

    Arguments:
      feat_localizations: Tensor containing localization features.
      anchors: List of numpy array containing anchor boxes.

    Return:
      Tensor Nx4: ymin, xmin, ymax, xmax
    """
    yref, xref, href, wref = anchors_layer

    # Compute center, height and width
    cx = feat_localizations[:, :, :, :, 0] * wref * prior_scaling[0] + xref
    cy = feat_localizations[:, :, :, :, 1] * href * prior_scaling[1] + yref
    w = wref * tf.exp(feat_localizations[:, :, :, :, 2] * prior_scaling[2])
    h = href * tf.exp(feat_localizations[:, :, :, :, 3] * prior_scaling[3])
    # Boxes coordinates.
    ymin = cy - h / 2.
    xmin = cx - w / 2.
    ymax = cy + h / 2.
    xmax = cx + w / 2.
    bboxes = tf.stack([ymin, xmin, ymax, xmax], axis=-1)
    return bboxes

然而，在SSD的Caffe源码实现中还有trick，那就是设置variance超参数来调整检测值，通过bool参数variance_encoded_in_target来控制两种模式，当其为True时，表示variance被包含在预测值中，就是上面那种情况。但是如果是False（大部分采用这种方式，训练更容易？），就需要手动设置超参数variance，用来对 $d_*(P)$ 的4个值进行放缩，此时边界框需要这样解码：
预测框的位置信息： $\begin{cases}\hat G_x=P_wd_x(P)variance[0]+P_x\\ \hat G_y = P_hd_y(P)variance[1]+P_y\\ \hat G_w = P_wexp(d_w(P)variance[2]) \\ \hat G_h = P_hexp(d_h(P)variance[3])\end{cases}$

5.2.2 筛选处理

筛选处理——对的得到预测类别和位置的anchor box进行进一步的筛选
- 按照select_threshold筛选
- 筛选前k个置信度最大的
- 使用非极大抑制再进行筛选

def detected_bboxes(self, predictions, localisations,
                    select_threshold=None, nms_threshold=0.5,
                    clipping_bbox=None, top_k=400, keep_top_k=200):
    """Get the detected bounding boxes from the SSD network output.
    """
    # Select top_k bboxes from predictions, and clip
    # 按照select_threshold筛选
    rscores, rbboxes = \
        ssd_common.tf_ssd_bboxes_select(predictions, localisations,
                                        select_threshold=select_threshold,
                                        num_classes=self.params.num_classes)
    # 筛选前k个置信度最大的
    rscores, rbboxes = \
        tfe.bboxes_sort(rscores, rbboxes, top_k=top_k)

    # 使用非极大抑制再进行筛选
    # Apply NMS algorithm.
    rscores, rbboxes = \
        tfe.bboxes_nms_batch(rscores, rbboxes,
                             nms_threshold=nms_threshold,
                             keep_top_k=keep_top_k)
    # if clipping_bbox is not None:
    #     rbboxes = tfe.bboxes_clip(clipping_bbox, rbboxes)
    return rscores, rbboxes