SSD阅读和解读

最新推荐文章于 2019-09-30 16:32:34 发布

置顶无名份的浪漫2018

最新推荐文章于 2019-09-30 16:32:34 发布

阅读量502

点赞数 1

本文链接：https://blog.csdn.net/pc9803/article/details/95359148

版权

GitHub上面的关于SSD的复现代码：

TensorFlow版：https://github.com/balancap/SSD-Tensorflow

ca的mobile版：https://github.com/chuanqi305/MobileNet-SSD

keras版：https://github.com/pierluigiferrari/ssd_keras

pytorch版：https://github.com/amdegroot/ssd.pytorch

caffe原版：https://github.com/weiliu89/caffe/tree/ssd

参考的博客：

主要是论文的解读：https://blog.csdn.net/u010167269/article/details/52563573

代码解读，优点比较细致的解释各行代码的功能，缺点是函数接口和关联比较少讲到：https://blog.csdn.net/c20081052/article/details/80391627

代码解读，优点是比较注重整体，及其各部分的关联，和上面的文章有点互补：https://blog.csdn.net/qq1483661204/article/details/79776065

这个博客我还没看，但感觉是8错的：https://www.jianshu.com/p/772316cf6f42

网络原理

原理方面，VGG基本结构、多尺度感知、anchor等方面的很多内容，目前已经有很多博客写过了，包括上述几个和博主之前写过的一篇，这里就不需要说太多了。

个人记录

1.模型结构来说，主体网络，实际上和分类的模型差不多，不同的是在于分类是输出N+1个数组成的tensor，而检测是输出很多很多个数的tensor（具体看先验框的多少）。在SSD中，这些数来自于几个层，最后经过一次卷积后合在一起。

2.对于SSD复现的代码，模型搭建部分还是比较常规的，不难懂，关键还是两个部分复杂些：bounding boxes的预处理、losses函数，这两个部分比较多tensor，要比较花时间才能弄清楚，这也是SSD的关键部分。

3.SSD多个特征层中的多个bounding boxes，每个会用4个参数描述，bounding boxes最本质的作用是用4*N长度的tensor记录住各个框的位置和形状信息，作为相对坐标系，而原图的坐标就是世界坐标系（对应的每个位置，就是真正的坐标）。相对坐标系+相对坐标=世界坐标，相对坐标就是训练和预测的时候要得到的预测值，和坐标系（就是bounding boxes的描述）结构有点像，不要搞混。

4.bounding boxes在训练阶段，较大的作用是在于在预处理时，提供很多框给GT选择，作为本batch的训练样本。实际上训练样本并不是直接的GT，而是利用GT选出一批很接近于GT的bounding boxes（具体选择原则看论文），然后根据负样本的选择原则，选取更多的负样本的bounding boxes（实际上并不是所有的bounding boxes都会在本batch训练中用到的），同时选出了这些boxes之后需要求取对应的预测值，训练的时候对这些预测值进行求losses。实际上训练的思想就是让模型在本个batch的时候，能够通过卷积系统生成出正样本的那些预测值。

5.bounding boxes初始化阶段，得到的feat_labels, feat_localizations, feat_scores等tensor，都是服务于后续的losses建立，而这些参数，在losses函数的计算过程中，将会被展平。

6.正样本的选取，是选择很GT相交（IoU大于一定值）的批anchor，过程是用逐个GT的位置值和anchor的tensor进行运算（功能上用点像遍历全部anchor）。而负样本的选取，是在anchor的tensor中，首先去掉正样本的，然后选取K个logist最小的框。所以两种样本，衡量的标准是不同的，一个是IoU，一个是logist的值，就是分类第0类的得分。

代码注释

下面代码是转载自https://blog.csdn.net/c20081052/article/details/80391627的代码及注释，另外加上了一些博主的注释笔记（##标志的）

ssd_vgg_300.py：

import math
from collections import namedtuple
 
import numpy as np
import tensorflow as tf
 
import tf_extended as tfe
from nets import custom_layers
from nets import ssd_common
 
slim = tf.contrib.slim
 
 
# =========================================================================== #
# SSD class definition.
# =========================================================================== #
#collections模块的namedtuple子类不仅可以使用item的index访问item，还可以通过item的name进行访问可以将namedtuple理解为c中的struct结构，其首先将各个item命名，然后对每个item赋予数据
SSDParams = namedtuple('SSDParameters', ['img_shape',           #输入图像大小
                                         'num_classes',         #分类类别数
                                         'no_annotation_label', #无标注标签
                                         'feat_layers',         #特征层
                                         'feat_shapes',         #特征层形状大小
                                         'anchor_size_bounds',  #锚点框大小上下边界，是与原图相比得到的小数值
                                         'anchor_sizes',        #初始锚点框尺寸
                                         'anchor_ratios',       #锚点框长宽比
                                         'anchor_steps',        #特征图相对原始图像的缩放
                                         'anchor_offset',       #锚点框中心的偏移
                                         'normalizations',      #是否正则化
                                         'prior_scaling'        #是对特征图参考框向gtbox做回归时用到的尺度缩放（0.1,0.1,0.2,0.2）
                                         ])
 
 
class SSDNet(object):
    """Implementation of the SSD VGG-based 300 network.
    The default features layers with 300x300 image input are:
      conv4 ==> 38 x 38
      conv7 ==> 19 x 19
      conv8 ==> 10 x 10
      conv9 ==> 5 x 5
      conv10 ==> 3 x 3
      conv11 ==> 1 x 1
    The default image size used to train this network is 300x300.    #训练输入图像尺寸默认为300x300
    """
    default_params = SSDParams(             #默认参数
        img_shape=(300, 300),
        num_classes=21,  #包含背景在内，共21类目标类别
        no_annotation_label=21,
        feat_layers=['block4', 'block7', 'block8', 'block9', 'block10', 'block11'], #特征层名字
        feat_shapes=[(38, 38), (19, 19), (10, 10), (5, 5), (3, 3), (1, 1)],       #特征层尺寸
        anchor_size_bounds=[0.15, 0.90],                                           
        # anchor_size_bounds=[0.20, 0.90],                                        #论文中初始预测框大小为0.2x300~0.9x300；实际代码是[45,270]
        anchor_sizes=[(21., 45.),   #直接给出的每个特征图上起初的锚点框大小；如第一个特征层框大小是h:21;w:45;  共6个特征图用于回归
                      (45., 99.),   #越小的框能够得到原图上更多的局部信息，反之得到更多的全局信息；
                      (99., 153.),
                      (153., 207.),
                      (207., 261.),
                      (261., 315.)],
        # anchor_sizes=[(30., 60.),
        #               (60., 111.),
        #               (111., 162.),
        #               (162., 213.),
        #               (213., 264.),
        #               (264., 315.)],
        anchor_ratios=[[2, .5],             #每个特征层上的每个特征点预测的box长宽比及数量；如：block4: def_boxes:4
                       [2, .5, 3, 1./3],    #block7: def_boxes:6   （ratios中的4个+默认的1:1+额外增加的一个=6）
                       [2, .5, 3, 1./3],    #block8: def_boxes:6
                       [2, .5, 3, 1./3],    #block9: def_boxes:6
                       [2, .5],             #block10: def_boxes:4
                       [2, .5]],            #block11: def_boxes:4   #备注：实际上略去了默认的ratio=1以及多加了一个sqrt(初始框宽*初始框高)，后面代码有
        anchor_steps=[8, 16, 32, 64, 100, 300],   #特征图锚点框放大到原始图的缩放比例；
        anchor_offset=0.5,                        #每个锚点框中心点在该特征图cell中心，因此offset=0.5
        normalizations=[20, -1, -1, -1, -1, -1],  #是否归一化，大于0则进行，否则不做归一化；目前看来只对block_4进行正则化，因为该层比较靠前，其norm较大，需做L2正则化（仅仅对每个像素在channel维度做归一化）以保证和后面检测层差异不是很大；
        prior_scaling=[0.1, 0.1, 0.2, 0.2]        #特征图上每个目标与参考框间的尺寸缩放（y,x,h,w）解码时用到
        )
 
    def __init__(self, params=None):      #网络参数的初始化
        """Init the SSD net with some parameters. Use the default ones
        if none provided.
        """
        if isinstance(params, SSDParams):  #是否有参数输入，是则用输入的，否则使用默认的
            self.params = params           #isinstance是python的內建函数，如果参数1与参数2的类型相同则返回true；
        else:
            self.params = SSDNet.default_params
 
    # ======================================================================= #
    def net(self, inputs,                            #定义网络模型
            is_training=True,                        #是否训练
            update_feat_shapes=True,                 #是否更新特征层的尺寸
            dropout_keep_prob=0.5,                   #dropout=0.5
            prediction_fn=slim.softmax,              #采用softmax预测结果
            reuse=None,
            scope='ssd_300_vgg'):                    #网络名：ssd_300_vgg   （基础网络时VGG，输入训练图像size是300x300）
        """SSD network definition.
        """
        r = ssd_net(inputs,                               #网络输入参数r
                    num_classes=self.params.num_classes, 
                    feat_layers=self.params.feat_layers,
                    anchor_sizes=self.params.anchor_sizes,
                    anchor_ratios=self.params.anchor_ratios,
                    normalizations=self.params.normalizations,
                    is_training=is_training,
                    dropout_keep_prob=dropout_keep_prob,
                    prediction_fn=prediction_fn,
                    reuse=reuse,
                    scope=scope)
        # Update feature shapes (try at least!)    #下面这步我的理解就是让读者自行更改特征层的输入，未必论文中介绍的那几个block
        if update_feat_shapes:                                               #是否更新特征层图像尺寸？
            shapes = ssd_feat_shapes_from_net(r[0], self.params.feat_shapes)  #输入特征层图像尺寸以及inputs（应该是预测的特征尺寸），输出更新后的特征图尺寸列表
            self.params = self.params._replace(feat_shapes=shapes)        #将更新的特征图尺寸shapes替换当前的特征图尺寸
        return r                                                      #更新网络输入参数r  
 
    def arg_scope(self, weight_decay=0.0005, data_format='NHWC'):  #定义权重衰减=0.0005，L2正则化项系数；数据类型是NHWC
        """Network arg_scope.
        """
        return ssd_arg_scope(weight_decay, data_format=data_format)
 
    def arg_scope_caffe(self, caffe_scope):
        """Caffe arg_scope used for weights importing.
        """
        return ssd_arg_scope_caffe(caffe_scope)      
 
    # ======================================================================= #
    def update_feature_shapes(self, predictions):    #更新特征形状尺寸（来自预测结果）
        """Update feature shapes from predictions collection (Tensor or Numpy
        array).
        """
        shapes = ssd_feat_shapes_from_net(predictions, self.params.feat_shapes)
        self.params = self.params._replace(feat_shapes=shapes)
 
    def anchors(self, img_shape, dtype=np.float32):             #输入原始图像尺寸；返回每个特征层每个参考锚点框的位置及尺寸信息（x,y,h,w）        
        """Compute the default anchor boxes, given an image shape.
        """
        return ssd_anchors_all_layers(img_shape,                #这是个关键函数；检测所有特征层中的参考锚点框位置和尺寸信息
                                      self.params.feat_shapes,
                                      self.params.anchor_sizes, 
                                      self.params.anchor_ratios,
                                      self.params.anchor_steps,
                                      self.params.anchor_offset,
                                      dtype)
 
    def bboxes_encode(self, labels, bboxes, anchors,  #编码，用于将标签信息，真实目标信息和锚点框信息编码在一起；得到预测真实框到参考框的转换值
                      scope=None):
        """Encode labels and bounding boxes.   
        """
        return ssd_common.tf_ssd_bboxes_encode(
            labels, bboxes, anchors,
            self.params.num_classes,
            self.params.no_annotation_label,      #未标注的标签（应该代表背景）
            ignore_threshold=0.5,                 #IOU筛选阈值
            prior_scaling=self.params.prior_scaling,  #特征图目标与参考框间的尺寸缩放（0.1,0.1,0.2,0.2）
            scope=scope)
 
    def bboxes_decode(self, feat_localizations, anchors,  #解码，用锚点框信息，锚点框与预测真实框间的转换值，得到真是的预测框（ymin,xmin,ymax,xmax）
                      scope='ssd_bboxes_decode'):
        """Encode labels and bounding boxes.
        """
        return ssd_common.tf_ssd_bboxes_decode(
            feat_localizations, anchors,
            prior_scaling=self.params.prior_scaling,
            scope=scope)
 
    def detected_bboxes(self, predictions, localisations,   #通过SSD网络，得到检测到的bbox
                        select_threshold=None, nms_threshold=0.5,
                        clipping_bbox=None, top_k=400, keep_top_k=200):
        """Get the detected bounding boxes from the SSD network output.    
        """
        # Select top_k bboxes from predictions, and clip    #选取top_k=400个框，并对框做修建（超出原图尺寸范围的切掉）
        rscores, rbboxes = \                                                       #得到对应某个类别的得分值以及bbox
            ssd_common.tf_ssd_bboxes_select(predictions, localisations,
                                            select_threshold=select_threshold,
                                            num_classes=self.params.num_classes)
        rscores, rbboxes = \                                     #按照得分高低，筛选出400个bbox和对应得分
            tfe.bboxes_sort(rscores, rbboxes, top_k=top_k)
        # Apply NMS algorithm.                                   #应用非极大值抑制，筛选掉与得分最高bbox重叠率大于0.5的，保留200个
        rscores, rbboxes = \
            tfe.bboxes_nms_batch(rscores, rbboxes,
                                 nms_threshold=nms_threshold,
                                 keep_top_k=keep_top_k)
        if clipping_bbox is not None:
            rbboxes = tfe.bboxes_clip(clipping_bbox, rbboxes)
        return rscores, rbboxes                              #返回裁剪好的bbox和对应得分
#尽管一个ground truth可以与多个先验框匹配，但是ground truth相对先验框还是太少了，
#所以负样本相对正样本会很多。为了保证正负样本尽量平衡，SSD采用了hard negative mining，
#就是对负样本进行抽样，抽样时按照置信度误差（预测背景的置信度越小，误差越大）进行降序排列，
#选取误差的较大的top-k作为训练的负样本，以保证正负样本比例接近1:3
    def losses(self, logits, localisations,
               gclasses, glocalisations, gscores,
               match_threshold=0.5,
               negative_ratio=3.,
               alpha=1.,
               label_smoothing=0.,
               scope='ssd_losses'):
        """Define the SSD network losses.
        """
        return ssd_losses(logits, localisations,
                          gclasses, glocalisations, gscores,
                          match_threshold=match_threshold,
                          negative_ratio=negative_ratio,
                          alpha=alpha,
                          label_smoothing=label_smoothing,
                          scope=scope)
 
 
# =========================================================================== #
# SSD tools...
# =========================================================================== #
def ssd_size_bounds_to_values(size_bounds,
                              n_feat_layers,
                              img_shape=(300, 300)):
    """Compute the reference sizes of the anchor boxes from relative bounds.
    The absolute values are measured in pixels, based on the network
    default size (300 pixels).
    This function follows the computation performed in the original
    implementation of SSD in Caffe.
    Return:
      list of list containing the absolute sizes at each scale. For each scale,
      the ratios only apply to the first value.
    """
    assert img_shape[0] == img_shape[1]
 
    img_size = img_shape[0]
    min_ratio = int(size_bounds[0] * 100)
    max_ratio = int(size_bounds[1] * 100)
    step = int(math.floor((max_ratio - min_ratio) / (n_feat_layers - 2)))
    # Start with the following smallest sizes.
    sizes = [[img_size * size_bounds[0] / 2, img_size * size_bounds[0]]]
    for ratio in range(min_ratio, max_ratio + 1, step):
        sizes.append((img_size * ratio / 100.,
                      img_size * (ratio + step) / 100.))
    return sizes
 
 
def ssd_feat_shapes_from_net(predictions, default_shapes=None):
    """Try to obtain the feature shapes from the prediction layers. The latter
    can be either a Tensor or Numpy ndarray.
    Return:
      list of feature shapes. Default values if predictions shape not fully
      determined.
    """
    feat_shapes = []
    for l in predictions:              #l:是预测的特征形状
        # Get the shape, from either a np array or a tensor.
        if isinstance(l, np.ndarray):  #如果l是np.ndarray类型，则将l的形状赋给shape；否则将shape作为list
            shape = l.shape
        else:
            shape = l.get_shape().as_list()
        shape = shape[1:4]
        # Problem: undetermined shape...     #如果预测的特征尺寸未定，则使用默认的形状；否则将shape中的值赋给特征形状列表中
        if None in shape:  
            return default_shapes
        else:
            feat_shapes.append(shape)
    return feat_shapes                        #返回更新后的特征尺寸list
 
 ##计算一个层layer中每个bounding boxes的xyhw信息，返回4个tensor
def ssd_anchor_one_layer(img_shape,         #检测单个特征图中所有锚点的坐标和尺寸信息（未与原图做除法）
                         feat_shape,
                         sizes,
                         ratios,
                         step,
                         offset=0.5,
                         dtype=np.float32):
    """Computer SSD default anchor boxes for one feature layer.
    Determine the relative position grid of the centers, and the relative
    width and height.
    Arguments:
      feat_shape: Feature shape, used for computing relative position grids;
      size: Absolute reference sizes;
      ratios: Ratios to use on these features;
      img_shape: Image shape, used for computing height, width relatively to the
        former;
      offset: Grid offset.
    Return:
      y, x, h, w: Relative x and y grids, and height and width.
    """
    # Compute the position grid: simple way.
    # y, x = np.mgrid[0:feat_shape[0], 0:feat_shape[1]]
    # y = (y.astype(dtype) + offset) / feat_shape[0]
    # x = (x.astype(dtype) + offset) / feat_shape[1]
    # Weird SSD-Caffe computation using steps values...    #归一化到原图的锚点中心坐标（x,y）;其坐标值域为(0,1)
    y, x = np.mgrid[0:feat_shape[0], 0:feat_shape[1]]      #对于第一个特征图（block4：38x38）；y=[[0,0,……0],[1,1,……1]，……[37,37,……，37]]；而x=[[0,1,2……，37]，[0,1,2……，37],……[0,1,2……，37]]
    y = (y.astype(dtype) + offset) * step / img_shape[0]   #将38个cell对应锚点框的y坐标偏移至每个cell中心，然后乘以相对原图缩放的比例，再除以原图
    x = (x.astype(dtype) + offset) * step / img_shape[1]   #可以得到在原图上，相对原图比例大小的每个锚点中心坐标x,y
 
    # Expand dims to support easy broadcasting.    #将锚点中心坐标扩大维度
    y = np.expand_dims(y, axis=-1)   #对于第一个特征图，y的shape=38x38x1；x的shape=38x38x1
    x = np.expand_dims(x, axis=-1)
 ## 生成y=[[[0],[0],……[0]],[[1],[1],……[1]]，……[[37],[37],……，[37]]]；x=[[[0],[1],[2]……，[37]]，[[0],[1],[2]……，[37]],……[[0],[1],[2]……，[37]]]
 ## 要是x-h，则[0]会扩展成[-h -h -h -h]，h有多少维就扩展为h同样的shape
 ## 参考：https://www.cnblogs.com/Time-LCJ/p/9233370.html
    # Compute relative height and width.
    # Tries to follow the original implementation of SSD for the order.
    num_anchors = len(sizes) + len(ratios)      #该特征图上每个点对应的锚点框数量;如：对于第一个特征图每个点预测4个锚点框（block4：38x38），2+2=4
    h = np.zeros((num_anchors, ), dtype=dtype)     #对于第一个特征图，h的shape=4x；w的shape=4x
    w = np.zeros((num_anchors, ), dtype=dtype)
    # Add first anchor boxes with ratio=1.
    h[0] = sizes[0] / img_shape[0]            #第一个锚点框的高h[0]=起始锚点的高/原图大小的高；例如：h[0]=21/300
    w[0] = sizes[0] / img_shape[1]            #第一个锚点框的宽w[0]=起始锚点的宽/原图大小的宽；例如：w[0]=21/300
    di = 1  #锚点宽个数偏移
    if len(sizes) > 1:                     
        h[1] = math.sqrt(sizes[0] * sizes[1]) / img_shape[0]    #第二个锚点框的高h[1]=sqrt（起始锚点的高*起始锚点的宽）/原图大小的高；例如：h[1]=sqrt(21*45)/300
        w[1] = math.sqrt(sizes[0] * sizes[1]) / img_shape[1]    #第二个锚点框的高w[1]=sqrt（起始锚点的高*起始锚点的宽）/原图大小的宽；例如：w[1]=sqrt(21*45)/300
        di += 1     #di=2
    for i, r in enumerate(ratios):                            #遍历长宽比例，第一个特征图，r只有两个，2和0.5；共四个锚点宽size（h[0]~h[3]）
        h[i+di] = sizes[0] / img_shape[0] / math.sqrt(r)      #例如：对于第一个特征图，h[0+2]=h[2]=21/300/sqrt(2);w[0+2]=w[2]=45/300*sqrt(2)
        w[i+di] = sizes[0] / img_shape[1] * math.sqrt(r)      #例如：对于第一个特征图，h[1+2]=h[3]=21/300/sqrt(0.5);w[1+2]=w[3]=45/300*sqrt(0.5)
    return y, x, h, w                                         #返回没有归一化前的锚点坐标和尺寸
 
##计算全部层layer中每个bounding boxes的xyhw信息，组成一个tensor
def ssd_anchors_all_layers(img_shape,                 #检测所有特征图中锚点框的四个坐标信息； 输入原始图大小
                           layers_shape,              #每个特征层形状尺寸
                           anchor_sizes,              #起始特征图中框的长宽size
                           anchor_ratios,             #锚点框长宽比列表
                           anchor_steps,              #锚点框相对原图缩放比例
                           offset=0.5,                #锚点中心在每个特征图cell中的偏移
                           dtype=np.float32):            
    """Compute anchor boxes for all feature layers.
    """
    layers_anchors = []           #用于存放所有特征图中锚点框位置尺寸信息
    for i, s in enumerate(layers_shape):                 #6个特征图尺寸；如：第0个是38x38
        anchor_bboxes = ssd_anchor_one_layer(img_shape, s,     #分别计算每个特征图中锚点框的位置尺寸信息；
                                             anchor_sizes[i],    #输入：第i个特征图中起始锚点框大小；如第0个是(21., 45.)
                                             anchor_ratios[i],   #输入：第i个特征图中锚点框长宽比列表；如第0个是[2, .5]
                                             anchor_steps[i],    #输入：第i个特征图中锚点框相对原始图的缩放比；如第0个是8
                                             offset=offset, dtype=dtype)  #输入：锚点中心在每个特征图cell中的偏移
        layers_anchors.append(anchor_bboxes)              #将6个特征图中每个特征图上的点对应的锚点框（6个或4个）保存
    return layers_anchors
 ## layers_anchors：全部特征层的bounding boxes的y, x, h, w信息组成的tensor，以第一个特征层为例：
 ## y维度：38x38x-1
 ## x维度：38x38x-1
 ## h维度：4x1
 ## w维度：4x1
 ## 后续求中心坐标等信息时，xy和hw相减，会产生38x38x4的tensor

# =========================================================================== #
# Functional definition of VGG-based SSD 300.
# =========================================================================== #
def tensor_shape(x, rank=3):
    """Returns the dimensions of a tensor.
    Args:
      image: A N-D Tensor of shape.
    Returns:
      A list of dimensions. Dimensions that are statically known are python
        integers,otherwise they are integer scalar tensors.
    """
    if x.get_shape().is_fully_defined():
        return x.get_shape().as_list()
    else:
        static_shape = x.get_shape().with_rank(rank).as_list()
        dynamic_shape = tf.unstack(tf.shape(x), rank)
        return [s if s is not None else d
                for s, d in zip(static_shape, dynamic_shape)]
 
 ##功能：输入特征层等信息，输出相应的分类和回归信息，就是一个预测的过程
def ssd_multibox_layer(inputs,                    #输入特征层
                       num_classes,               #类别数
                       sizes,                     #参考先验框的尺度
                       ratios=[1],  ssd_anchors_all_layers              #默认的先验框长宽比为1
                       normalization=-1,          #默认不做正则化
                       bn_normalization=False):
    """Construct a multibox layer, return a class and localization predictions.
    """
    net = inputs
    if normalization > 0:    #如果输入整数，则进行L2正则化
        net = custom_layers.l2_normalization(net, scaling=True)    #对通道所在维度进行正则化，随后乘以gamma缩放系数
    # Number of anchors.
    num_anchors = len(sizes) + len(ratios)  #每层特征图参考先验框的个数[4,6,6,6,4,4]
 
    # Location.     #每个先验框对应4个坐标信息
    num_loc_pred = num_anchors * 4    #特征图上每个单元预测的坐标所需维度=锚点框数*4
    loc_pred = slim.conv2d(net, num_loc_pred, [3, 3], activation_fn=None,   #通过对特征图进行3x3卷积得到位置信息和类别权重信息
                           scope='conv_loc')                                #该部分是定位信息，输出维度为[特征图h,特征图w,每个单元所有锚点框坐标]
    loc_pred = custom_layers.channel_to_last(loc_pred)                      ##这个操作实际上是把tensor的通道数放在最后，不过TensorFlow中本来就是这样的，可以不用这个操作
    loc_pred = tf.reshape(loc_pred,                                         #最后整个特征图所有锚点框预测目标位置 tensor为[h*w*每个cell先验框数，4]
                          tensor_shape(loc_pred, 4)[:-1]+[num_anchors, 4])  #zfj#特征图里面每一个先验框对应4个位置值信息
    # Class prediction.                                                #类别预测
    num_cls_pred = num_anchors * num_classes                            #特征图上每个单元预测的类别所需维度=锚点框数*种类数
    cls_pred = slim.conv2d(net, num_cls_pred, [3, 3], activation_fn=None, #该部分是类别信息，输出维度为[特征图h,特征图w,每个单元所有锚点框对应类别信息]
                           scope='conv_cls')
    cls_pred = custom_layers.channel_to_last(cls_pred)
    cls_pred = tf.reshape(cls_pred,
                          tensor_shape(cls_pred, 4)[:-1]+[num_anchors, num_classes]) #最后整个特征图所有锚点框预测类别 tensor为[h*w*每个cell先验框数，种类数]
                                                                            #zfj#特征图里面每一个先验框对应一个类别信息（num_classes中的一个）
    return cls_pred, loc_pred  #返回预测得到的类别和box位置 tensor
 
 
def ssd_net(inputs,                                             #定义ssd网络结构
            num_classes=SSDNet.default_params.num_classes,      #分类数
            feat_layers=SSDNet.default_params.feat_layers,      #特征层
            anchor_sizes=SSDNet.default_params.anchor_sizes,
            anchor_ratios=SSDNet.default_params.anchor_ratios,
            normalizations=SSDNet.default_params.normalizations, #正则化
            is_training=True,
            dropout_keep_prob=0.5,
            prediction_fn=slim.softmax,
            reuse=None,
            scope='ssd_300_vgg'):
    """SSD net definition.
    """
    # if data_format == 'NCHW':
    #     inputs = tf.transpose(inputs, perm=(0, 3, 1, 2))
 
    # End_points collect relevant activations for external use.
    end_points = {}   #用于收集每一层输出结果
    with tf.variable_scope(scope, 'ssd_300_vgg', [inputs], reuse=reuse):
        # Original VGG-16 blocks.
        net = slim.repeat(inputs, 2, slim.conv2d, 64, [3, 3], scope='conv1')    #VGG16网络的第一个conv，重复2次卷积，核为3x3,64个特征
        end_points['block1'] = net                                              #conv1_2结果存入end_points，name='block1'
        net = slim.max_pool2d(net, [2, 2], scope='pool1')
        # Block 2.
        net = slim.repeat(net, 2, slim.conv2d, 128, [3, 3], scope='conv2')      #重复2次卷积，核为3x3,128个特征
        end_points['block2'] = net                                              #conv2_2结果存入end_points，name='block2'
        net = slim.max_pool2d(net, [2, 2], scope='pool2')
        # Block 3.
        net = slim.repeat(net, 3, slim.conv2d, 256, [3, 3], scope='conv3')      #重复3次卷积，核为3x3,256个特征
        end_points['block3'] = net                                              #conv3_3结果存入end_points，name='block3'
        net = slim.max_pool2d(net, [2, 2], scope='pool3')
        # Block 4.
        net = slim.repeat(net, 3, slim.conv2d, 512, [3, 3], scope='conv4')       #重复3次卷积，核为3x3,512个特征
        end_points['block4'] = net                                               #conv4_3结果存入end_points，name='block4'
        net = slim.max_pool2d(net, [2, 2], scope='pool4')                      #zfj#检测所用的输出的图为：38*38*512；本块的pool在提取特征图之后
        # Block 5.
        net = slim.repeat(net, 3, slim.conv2d, 512, [3, 3], scope='conv5')       #重复3次卷积，核为3x3,512个特征
        end_points['block5'] = net                                               #conv5_3结果存入end_points，name='block5'
        net = slim.max_pool2d(net, [3, 3], stride=1, scope='pool5')                
 
        # Additional SSD blocks.                                                  #去掉了VGG的全连接层
        # Block 6: let's dilate the hell out of it!
        net = slim.conv2d(net, 1024, [3, 3], rate=6, scope='conv6')               #将VGG基础网络最后的池化层结果做扩展卷积（带孔卷积）；
        end_points['block6'] = net                                                #conv6结果存入end_points，name='block6' 
        net = tf.layers.dropout(net, rate=dropout_keep_prob, training=is_training) #dropout层
        # Block 7: 1x1 conv. Because the fuck.
        net = slim.conv2d(net, 1024, [1, 1], scope='conv7')                          #将dropout后的网络做1x1卷积，输出1024特征，name='block7'
        end_points['block7'] = net                                              #zfj# 19*19*1024
        net = tf.layers.dropout(net, rate=dropout_keep_prob, training=is_training)   #将卷积后的网络继续做dropout
 
        # Block 8/9/10/11: 1x1 and 3x3 convolutions stride 2 (except lasts).
        end_point = 'block8'                                                         
        with tf.variable_scope(end_point):
            net = slim.conv2d(net, 256, [1, 1], scope='conv1x1')           #对上述dropout的网络做1x1卷积，然后做3x3卷积，,输出512特征图，name=‘block8’
            net = custom_layers.pad2d(net, pad=(1, 1))      #zfj# x和y方向都向外扩张1个像素，填0
            net = slim.conv2d(net, 512, [3, 3], stride=2, scope='conv3x3', padding='VALID') #stride=2的conv来代替了pool的功能  #zfj#VALID说明不是same的conv，顺便压缩大小
        end_points[end_point] = net                                             #zfj# 10*10*512
        end_point = 'block9'
        with tf.variable_scope(end_point):
            net = slim.conv2d(net, 128, [1, 1], scope='conv1x1')                 #对上述网络做1x1卷积，然后做3x3卷积，输出256特征图，name=‘block9’
            net = custom_layers.pad2d(net, pad=(1, 1))          #zfj# 和下一句加起来，功能类似于same的conv
            net = slim.conv2d(net, 256, [3, 3], stride=2, scope='conv3x3', padding='VALID')     
        end_points[end_point] = net                                              #zfj# 5*5*256
        end_point = 'block10'
        with tf.variable_scope(end_point):
            net = slim.conv2d(net, 128, [1, 1], scope='conv1x1')               #对上述网络做1x1卷积，然后做3x3卷积，输出256特征图，name=‘block10’
            net = slim.conv2d(net, 256, [3, 3], scope='conv3x3', padding='VALID')
        end_points[end_point] = net                                            #zfj# 3*3*256
        end_point = 'block11'
        with tf.variable_scope(end_point):
            net = slim.conv2d(net, 128, [1, 1], scope='conv1x1')              #对上述网络做1x1卷积，然后做3x3卷积，输出256特征图，name=‘block11’
            net = slim.conv2d(net, 256, [3, 3], scope='conv3x3', padding='VALID')
        end_points[end_point] = net                                              #zfj# 1*1*256
 
        # Prediction and localisations layers. #预测和定位
        predictions = []
        logits = []
        localisations = []
        for i, layer in enumerate(feat_layers):               #遍历特征层
            with tf.variable_scope(layer + '_box'):                #起个命名范围   
                p, l = ssd_multibox_layer(end_points[layer],   #做多尺度大小box预测的特征层，返回每个cell中每个先验框预测的类别p和预测的位置l
                                          num_classes,         #种类数
                                          anchor_sizes[i],     #先验框尺度（同一特征图上的先验框尺度和长宽比一致）
                                          anchor_ratios[i],    #先验框长宽比
                                          normalizations[i])   #每个特征正则化信息，目前是只对第一个特征图做归一化操作；
             #把每一层的预测收集
			predictions.append(prediction_fn(p))  #prediction_fn为softmax，预测类别概率
            logits.append(p) #把每个cell每个先验框预测的类别的conv输出值存在logits中
            localisations.append(l)   #预测位置信息
 
        return predictions, localisations, logits, end_points  #返回类别预测概率值，位置预测结果，所属某个类别的conv输出值，以及特征层
ssd_net.default_image_size = 300
 
 
def ssd_arg_scope(weight_decay=0.0005, data_format='NHWC'):  #权重衰减系数=0.0005；其是L2正则化项的系数
    """Defines the VGG arg scope.
    Args:
      weight_decay: The l2 regularization coefficient.
    Returns:
      An arg_scope.
    """
    with slim.arg_scope([slim.conv2d, slim.fully_connected],
                        activation_fn=tf.nn.relu,
                        weights_regularizer=slim.l2_regularizer(weight_decay),
                        weights_initializer=tf.contrib.layers.xavier_initializer(),
                        biases_initializer=tf.zeros_initializer()):
        with slim.arg_scope([slim.conv2d, slim.max_pool2d],
                            padding='SAME',
                            data_format=data_format):
            with slim.arg_scope([custom_layers.pad2d,
                                 custom_layers.l2_normalization,
                                 custom_layers.channel_to_last],
                                data_format=data_format) as sc:
                return sc
 
 
# =========================================================================== #
# Caffe scope: importing weights at initialization.
# =========================================================================== #
def ssd_arg_scope_caffe(caffe_scope):
    """Caffe scope definition.
    Args:
      caffe_scope: Caffe scope object with loaded weights.
    Returns:
      An arg_scope.
    """
    # Default network arg scope.
    with slim.arg_scope([slim.conv2d],
                        activation_fn=tf.nn.relu,
                        weights_initializer=caffe_scope.conv_weights_init(),
                        biases_initializer=caffe_scope.conv_biases_init()):
        with slim.arg_scope([slim.fully_connected],
                            activation_fn=tf.nn.relu):
            with slim.arg_scope([custom_layers.l2_normalization],
                                scale_initializer=caffe_scope.l2_norm_scale_init()):
                with slim.arg_scope([slim.conv2d, slim.max_pool2d],
                                    padding='SAME') as sc:
                    return sc
 
 
# =========================================================================== #
# SSD loss function.
# =========================================================================== #
def ssd_losses(logits, localisations,    #损失函数定义为位置误差和置信度误差的加权和；
               gclasses, glocalisations, gscores,
               match_threshold=0.5,
               negative_ratio=3.,  
               alpha=1.,             #位置误差权重系数
               label_smoothing=0.,
               device='/cpu:0',
               scope=None):
    with tf.name_scope(scope, 'ssd_losses'):
        lshape = tfe.get_shape(logits[0], 5)
        num_classes = lshape[-1]
        batch_size = lshape[0]
 
        # Flatten out all vectors!      ##将所有向量展平
        flogits = []
        fgclasses = []
        fgscores = []
        flocalisations = []
        fglocalisations = []
        for i in range(len(logits)):
            flogits.append(tf.reshape(logits[i], [-1, num_classes]))    #将类别的概率值reshape成（-1,21）
            fgclasses.append(tf.reshape(gclasses[i], [-1]))             #真实类别
            fgscores.append(tf.reshape(gscores[i], [-1]))               #预测真实目标的得分
            flocalisations.append(tf.reshape(localisations[i], [-1, 4]))  #预测真实目标边框坐标（编码形式的值）
            fglocalisations.append(tf.reshape(glocalisations[i], [-1, 4])) #用于将真实目标gt的坐标进行编码存储
        # And concat the crap!
        logits = tf.concat(flogits, axis=0)
        gclasses = tf.concat(fgclasses, axis=0)
        gscores = tf.concat(fgscores, axis=0)
        localisations = tf.concat(flocalisations, axis=0)
        glocalisations = tf.concat(fglocalisations, axis=0)
        dtype = logits.dtype
 
        # Compute positive matching mask...
        pmask = gscores > match_threshold  #预测框与真实框IOU>0.5则将这个先验作为正样本
        fpmask = tf.cast(pmask, dtype)          ##记录正样本的位置，展平，像上面那样格式
        n_positives = tf.reduce_sum(fpmask)  #求正样本数量N
 
        # Hard negative mining... 为了保证正负样本尽量平衡，SSD采用了hard negative mining，就是对负样本进行抽样，抽样时按照置信度误差（预测背景的置信度越小，误差越大）进行降序排列，选取误差的较大的top-k作为训练的负样本，以保证正负样本比例接近1:3
        no_classes = tf.cast(pmask, tf.int32)
        predictions = slim.softmax(logits)   #类别预测
        nmask = tf.logical_and(tf.logical_not(pmask),
                               gscores > -0.5)
        fnmask = tf.cast(nmask, dtype)     
        nvalues = tf.where(nmask,
                           predictions[:, 0],
                           1. - fnmask)    ##得到的是：正样本的位置标1，负样本的位置，标类别0的得分
        nvalues_flat = tf.reshape(nvalues, [-1])
        # Number of negative entries to select.
        max_neg_entries = tf.cast(tf.reduce_sum(fnmask), tf.int32)  
        n_neg = tf.cast(negative_ratio * n_positives, tf.int32) + batch_size #负样本数量，保证是正样本3倍   #zfj#这里可能错了
        n_neg = tf.minimum(n_neg, max_neg_entries)
 
        val, idxes = tf.nn.top_k(-nvalues_flat, k=n_neg) #抽样时按照置信度误差（预测背景的置信度越小，误差越大）进行降序排列，选取误差的较大的top-k作为训练的负样本
        max_hard_pred = -val[-1]                            ##没有和GT相交的，且当前logits最小的K个。k取非正样本和正样本*3，两者中的最小值
        # Final negative mask.
        nmask = tf.logical_and(nmask, nvalues < max_hard_pred)
        fnmask = tf.cast(nmask, dtype)   #zfj#记录负样本的位置
 
        # Add cross-entropy loss.    #交叉熵
        with tf.name_scope('cross_entropy_pos'):
            loss = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits,   #类别置信度误差
                                                                  labels=gclasses)
            loss = tf.div(tf.reduce_sum(loss * fpmask), batch_size, name='value')  #将置信度误差除以正样本数后除以batch-size
            tf.losses.add_loss(loss)
 
        with tf.name_scope('cross_entropy_neg'):
            loss = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits,
                                                                  labels=no_classes)
            loss = tf.div(tf.reduce_sum(loss * fnmask), batch_size, name='value')
            tf.losses.add_loss(loss)
 
        # Add localization loss: smooth L1, L2, ...
        with tf.name_scope('localization'):         ## 这个localization只关注正样本的框
            # Weights Tensor: positive mask + random negative.
            weights = tf.expand_dims(alpha * fpmask, axis=-1)
            loss = custom_layers.abs_smooth(localisations - glocalisations)   #先验框对应边界的位置预测值-真实位置；然后做Smooth L1 loss
            loss = tf.div(tf.reduce_sum(loss * weights), batch_size, name='value')  #将上面的loss*权重（=alpha/正样本数）求和后除以batch-size  
            tf.losses.add_loss(loss)  #获得置信度误差和位置误差的加权和
 
 
def ssd_losses_old(logits, localisations,
                   gclasses, glocalisations, gscores,
                   match_threshold=0.5,
                   negative_ratio=3.,
                   alpha=1.,
                   label_smoothing=0.,
                   device='/cpu:0',
                   scope=None):
    """Loss functions for training the SSD 300 VGG network.
    This function defines the different loss components of the SSD, and
    adds them to the TF loss collection.
    Arguments:
      logits: (list of) predictions logits Tensors;
      localisations: (list of) localisations Tensors;
      gclasses: (list of) groundtruth labels Tensors;
      glocalisations: (list of) groundtruth localisations Tensors;
      gscores: (list of) groundtruth score Tensors;
    """
    with tf.device(device):
        with tf.name_scope(scope, 'ssd_losses'):
            l_cross_pos = []
            l_cross_neg = []
            l_loc = []
            for i in range(len(logits)):
                dtype = logits[i].dtype
                with tf.name_scope('block_%i' % i):
                    # Sizing weight...
                    wsize = tfe.get_shape(logits[i], rank=5)
                    wsize = wsize[1] * wsize[2] * wsize[3]
 
                    # Positive mask.
                    pmask = gscores[i] > match_threshold
                    fpmask = tf.cast(pmask, dtype)
                    n_positives = tf.reduce_sum(fpmask)
 
                    # Select some random negative entries.
                    # n_entries = np.prod(gclasses[i].get_shape().as_list())
                    # r_positive = n_positives / n_entries
                    # r_negative = negative_ratio * n_positives / (n_entries - n_positives)
 
                    # Negative mask.
                    no_classes = tf.cast(pmask, tf.int32)
                    predictions = slim.softmax(logits[i])
                    nmask = tf.logical_and(tf.logical_not(pmask),
                                           gscores[i] > -0.5)
                    fnmask = tf.cast(nmask, dtype)
                    nvalues = tf.where(nmask,
                                       predictions[:, :, :, :, 0],
                                       1. - fnmask)
                    nvalues_flat = tf.reshape(nvalues, [-1])
                    # Number of negative entries to select.
                    n_neg = tf.cast(negative_ratio * n_positives, tf.int32)
                    n_neg = tf.maximum(n_neg, tf.size(nvalues_flat) // 8)
                    n_neg = tf.maximum(n_neg, tf.shape(nvalues)[0] * 4)
                    max_neg_entries = 1 + tf.cast(tf.reduce_sum(fnmask), tf.int32)
                    n_neg = tf.minimum(n_neg, max_neg_entries)
 
                    val, idxes = tf.nn.top_k(-nvalues_flat, k=n_neg)
                    max_hard_pred = -val[-1]
                    # Final negative mask.
                    nmask = tf.logical_and(nmask, nvalues < max_hard_pred)
                    fnmask = tf.cast(nmask, dtype)
 
                    # Add cross-entropy loss.
                    with tf.name_scope('cross_entropy_pos'):
                        fpmask = wsize * fpmask
                        loss = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits[i],
                                                                              labels=gclasses[i])
                        loss = tf.losses.compute_weighted_loss(loss, fpmask)
                        l_cross_pos.append(loss)
 
                    with tf.name_scope('cross_entropy_neg'):
                        fnmask = wsize * fnmask
                        loss = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits[i],
                                                                              labels=no_classes)
                        loss = tf.losses.compute_weighted_loss(loss, fnmask)
                        l_cross_neg.append(loss)
 
                    # Add localization loss: smooth L1, L2, ...
                    with tf.name_scope('localization'):
                        # Weights Tensor: positive mask + random negative.
                        weights = tf.expand_dims(alpha * fpmask, axis=-1)
                        loss = custom_layers.abs_smooth(localisations[i] - glocalisations[i])
                        loss = tf.losses.compute_weighted_loss(loss, weights)
                        l_loc.append(loss)
 
            # Additional total losses...
            with tf.name_scope('total'):
                total_cross_pos = tf.add_n(l_cross_pos, 'cross_entropy_pos')
                total_cross_neg = tf.add_n(l_cross_neg, 'cross_entropy_neg')
                total_cross = tf.add(total_cross_pos, total_cross_neg, 'cross_entropy')
                total_loc = tf.add_n(l_loc, 'localization')
 
                # Add to EXTRA LOSSES TF.collection
                tf.add_to_collection('EXTRA_LOSSES', total_cross_pos)
                tf.add_to_collection('EXTRA_LOSSES', total_cross_neg)
                tf.add_to_collection('EXTRA_LOSSES', total_cross)
                tf.add_to_collection('EXTRA_LOSSES', total_loc)

custom_layers.py：

import tensorflow as tf
 
from tensorflow.contrib.framework.python.ops import add_arg_scope
from tensorflow.contrib.layers.python.layers import initializers
from tensorflow.contrib.framework.python.ops import variables
from tensorflow.contrib.layers.python.layers import utils
from tensorflow.python.ops import nn
from tensorflow.python.ops import init_ops
from tensorflow.python.ops import variable_scope
 
 
def abs_smooth(x):
    """Smoothed absolute function. Useful to compute an L1 smooth error.  当预测值与目标值相差很大时, 梯度容易爆炸,因此L1 loss对噪声（outliers）更鲁棒
    Define as:
        x^2 / 2         if abs(x) < 1
        abs(x) - 0.5    if abs(x) > 1
    We use here a differentiable definition using min(x) and abs(x). Clearly
    not optimal, but good enough for our purpose!
    """
    absx = tf.abs(x)
    minx = tf.minimum(absx, 1)
    r = 0.5 * ((absx - 1) * minx + absx)   #计算得到L1 smooth loss
    return r
 
 
@add_arg_scope
def l2_normalization(  #L2正则化：稀疏正则化操作
        inputs,   #输入特征层，[batch_size,h,w,c]
        scaling=False,  #默认归一化后是否设置缩放变量gamma
        scale_initializer=init_ops.ones_initializer(),   #scale初始化为1
        reuse=None,
        variables_collections=None,
        outputs_collections=None,
        data_format='NHWC', 
        trainable=True,
        scope=None):
    """Implement L2 normalization on every feature (i.e. spatial normalization).
    Should be extended in some near future to other dimensions, providing a more
    flexible normalization framework.
    Args:
      inputs: a 4-D tensor with dimensions [batch_size, height, width, channels].
      scaling: whether or not to add a post scaling operation along the dimensions
        which have been normalized.
      scale_initializer: An initializer for the weights.
      reuse: whether or not the layer and its variables should be reused. To be
        able to reuse the layer scope must be given.
      variables_collections: optional list of collections for all the variables or
        a dictionary containing a different list of collection per variable.
      outputs_collections: collection to add the outputs.
      data_format:  NHWC or NCHW data format.
      trainable: If `True` also add variables to the graph collection
        `GraphKeys.TRAINABLE_VARIABLES` (see tf.Variable).
      scope: Optional scope for `variable_scope`.
    Returns:
      A `Tensor` representing the output of the operation.
    """
 
    with variable_scope.variable_scope(
            scope, 'L2Normalization', [inputs], reuse=reuse) as sc:
        inputs_shape = inputs.get_shape()   #得到输入特征层的维度信息
        inputs_rank = inputs_shape.ndims    #维度数=4
        dtype = inputs.dtype.base_dtype     #数据类型
        if data_format == 'NHWC':
            # norm_dim = tf.range(1, inputs_rank-1)
            norm_dim = tf.range(inputs_rank-1, inputs_rank)  #需要正则化的维度是4-1=3即channel这个维度
            params_shape = inputs_shape[-1:]                #通道数
        elif data_format == 'NCHW':              
            # norm_dim = tf.range(2, inputs_rank)   
            norm_dim = tf.range(1, 2)                #需要正则化的维度是第1维，即channel这个维度
            params_shape = (inputs_shape[1])         #通道数
 
        # Normalize along spatial dimensions.
        outputs = nn.l2_normalize(inputs, norm_dim, epsilon=1e-12)   #对通道所在维度进行正则化，其中epsilon是避免除0风险
        # Additional scaling.
        if scaling:                   #判断是否对正则化后设置缩放变量
            scale_collections = utils.get_variable_collections(
                variables_collections, 'scale')
            scale = variables.model_variable('gamma',
                                             shape=params_shape,
                                             dtype=dtype,
                                             initializer=scale_initializer,
                                             collections=scale_collections,
                                             trainable=trainable)
            if data_format == 'NHWC':
                outputs = tf.multiply(outputs, scale)     
            elif data_format == 'NCHW':
                scale = tf.expand_dims(scale, axis=-1)
                scale = tf.expand_dims(scale, axis=-1)
                outputs = tf.multiply(outputs, scale)
                # outputs = tf.transpose(outputs, perm=(0, 2, 3, 1))
 
        return utils.collect_named_outputs(outputs_collections,      #即返回L2_norm*gamma
                                           sc.original_name_scope, outputs)
 
 
@add_arg_scope
def pad2d(inputs,
          pad=(0, 0),
          mode='CONSTANT',
          data_format='NHWC',
          trainable=True,
          scope=None):
    """2D Padding layer, adding a symmetric padding to H and W dimensions.
    Aims to mimic padding in Caffe and MXNet, helping the port of models to
    TensorFlow. Tries to follow the naming convention of `tf.contrib.layers`.
    Args:
      inputs: 4D input Tensor;
      pad: 2-Tuple with padding values for H and W dimensions;
      mode: Padding mode. C.f. `tf.pad`
      data_format:  NHWC or NCHW data format.
    """
    with tf.name_scope(scope, 'pad2d', [inputs]):
        # Padding shape.
        if data_format == 'NHWC':
            paddings = [[0, 0], [pad[0], pad[0]], [pad[1], pad[1]], [0, 0]]
        elif data_format == 'NCHW':
            paddings = [[0, 0], [0, 0], [pad[0], pad[0]], [pad[1], pad[1]]]
        net = tf.pad(inputs, paddings, mode=mode)
        return net
 
 
@add_arg_scope
def channel_to_last(inputs,    #作用，将输入的特征图网络的通道维度放在最后，返回变形后的网络
                    data_format='NHWC',
                    scope=None):
    """Move the channel axis to the last dimension. Allows to
    provide a single output format whatever the input data format.
    Args:
      inputs: Input Tensor;
      data_format: NHWC or NCHW.
    Return:
      Input in NHWC format.
    """
    with tf.name_scope(scope, 'channel_to_last', [inputs]):
        if data_format == 'NHWC':
            net = inputs
        elif data_format == 'NCHW':
            net = tf.transpose(inputs, perm=(0, 2, 3, 1))
        return net

ssd_common.py：

import numpy as np
import tensorflow as tf
import tf_extended as tfe
 
 
# =========================================================================== #
# TensorFlow implementation of boxes SSD encoding / decoding.
# =========================================================================== #
def tf_ssd_bboxes_encode_layer(labels,          #gt标签，1D的tensor
                               bboxes,          #Nx4的Tensor(float)，真实的bbox    ##一个GT对应一个bboxes
                               anchors_layer,      #参考锚点list
                               num_classes,        #分类类别数
                               no_annotation_label,
                               ignore_threshold=0.5,                #gt和锚点框间的匹配阈值，大于该值则为正样本
                               prior_scaling=[0.1, 0.1, 0.2, 0.2],  #真实值到预测值转换中用到的缩放
                               dtype=tf.float32):
    """Encode groundtruth labels and bounding boxes using SSD anchors from
    one layer.
    Arguments:
      labels: 1D Tensor(int64) containing groundtruth labels;
      bboxes: Nx4 Tensor(float) with bboxes relative coordinates;
      anchors_layer: Numpy array with layer anchors;
      matching_threshold: Threshold for positive match with groundtruth bboxes;
      prior_scaling: Scaling of encoded coordinates.
    Return:
      (target_labels, target_localizations, target_scores): Target Tensors.   返回：包含目标标签类别，目标位置，目标置信度的tesndor
    """
    # Anchors coordinates and volume.
    yref, xref, href, wref = anchors_layer   #此前每个特征图上点对应生成的锚点框作为参考框
    ymin = yref - href / 2.                  #求参考框的左上角点（xmin,ymin）和右下角点(xmax,ymax)
    xmin = xref - wref / 2.   #yref和xref的shape为（38,38,1）；href和wref的shape为（4，）
    ymax = yref + href / 2.
    xmax = xref + wref / 2.   ##这里左边的4个tensor维度都是: 38x38x4
    vol_anchors = (xmax - xmin) * (ymax - ymin) #求参考框面积vol_anchors    ## 38x38x4
 
    # Initialize tensors...                            #shape表示每个特征图上总锚点数
    shape = (yref.shape[0], yref.shape[1], href.size)  #对于第一个特征图，shape=(38,38,4)；第二个特征图的shape=(19,19,6)
    feat_labels = tf.zeros(shape, dtype=tf.int64)   #初始化每个特征图上的点对应的各个box所属标签维度 如：38x38x4
    feat_scores = tf.zeros(shape, dtype=dtype)      #初始化每个特征图上的点对应的各个box所属标目标的得分值维度 如：38x38x4   ##IoU得分
 
    feat_ymin = tf.zeros(shape, dtype=dtype)    #预测每个特征图每个点所属目标的坐标 ；如38x38x4;初始化为全0
    feat_xmin = tf.zeros(shape, dtype=dtype)
    feat_ymax = tf.ones(shape, dtype=dtype)
    feat_xmax = tf.ones(shape, dtype=dtype)
 
    def jaccard_with_anchors(bbox):                      #计算gt的框和参考锚点框的重合度   ##经过一个Bbox的计算，38x38x4的jaccard中只有和该Bbox有交集的一些anchor
                                                                                        ##才有IoU值，其他的都是0
        """Compute jaccard score between a box and the anchors.
        """
        int_ymin = tf.maximum(ymin, bbox[0])           #计算重叠区域的坐标
        int_xmin = tf.maximum(xmin, bbox[1])
        int_ymax = tf.minimum(ymax, bbox[2])
        int_xmax = tf.minimum(xmax, bbox[3])
        h = tf.maximum(int_ymax - int_ymin, 0.)        #计算重叠区域的长与宽
        w = tf.maximum(int_xmax - int_xmin, 0.)
        # Volumes.
        inter_vol = h * w                                 #重叠区域的面积
        union_vol = vol_anchors - inter_vol \             #计算bbox和参考框的并集区域
            + (bbox[2] - bbox[0]) * (bbox[3] - bbox[1])
        jaccard = tf.div(inter_vol, union_vol)           #计算IOU并返回该值
        return jaccard                                 
 
    def intersection_with_anchors(bbox):                #计算某个参考框包含真实框的得分情况
        """Compute intersection between score a box and the anchors.
        """
        int_ymin = tf.maximum(ymin, bbox[0])            #计算bbox和锚点框重叠区域的坐标和长宽
        int_xmin = tf.maximum(xmin, bbox[1])
        int_ymax = tf.minimum(ymax, bbox[2])
        int_xmax = tf.minimum(xmax, bbox[3])
        h = tf.maximum(int_ymax - int_ymin, 0.)
        w = tf.maximum(int_xmax - int_xmin, 0.)
        inter_vol = h * w                                 #重叠区域面积
        scores = tf.div(inter_vol, vol_anchors)           #将重叠区域面积除以参考框面积作为该参考框得分值；
        return scores
 
    def condition(i, feat_labels, feat_scores,
                  feat_ymin, feat_xmin, feat_ymax, feat_xmax):
        """Condition: check label index.
        """
        r = tf.less(i, tf.shape(labels)) # 逐元素比较大小,遍历labels，因为i在body返回的时候加1了
        return r[0]
 
    def body(i, feat_labels, feat_scores,                 #该函数大致意思是选择与gt box IOU最大的锚点框负责回归任务，并预测对应的边界框，如此循环
             feat_ymin, feat_xmin, feat_ymax, feat_xmax):
        """Body: update feature labels, scores and bboxes.
        Follow the original SSD paper for that purpose:
          - assign values when jaccard > 0.5;
          - only update if beat the score of other bboxes.
        """
        # Jaccard score.                                         #计算bbox与参考框的IOU值
        label = labels[i]
        bbox = bboxes[i]
        jaccard = jaccard_with_anchors(bbox)      ##逐个GT更新IoU的数值表
        # Mask: check threshold + scores + no annotations + num_classes.
        mask = tf.greater(jaccard, feat_scores)                  #当IOU大于feat_scores时，对应的mask至1，做筛选       ## feat_scores一开始都为0
        # mask = tf.logical_and(mask, tf.greater(jaccard, matching_threshold))
        mask = tf.logical_and(mask, feat_scores > -0.5)       ## feat_scores足够大才mark
        mask = tf.logical_and(mask, label < num_classes)         #label满足<21
        imask = tf.cast(mask, tf.int64)                          #将mask转换数据类型int型
        fmask = tf.cast(mask, dtype)                             #将mask转换数据类型float型
        # Update values using mask.
        feat_labels = imask * label + (1 - imask) * feat_labels  #当mask=1，则feat_labels=1；否则为0，即背景   ##imask中，为1的地方就更新feat_labels，否则不更新
        feat_scores = tf.where(mask, jaccard, feat_scores)       #tf.where表示如果mask为真则jaccard，否则为feat_scores
		
        feat_ymin = fmask * bbox[0] + (1 - fmask) * feat_ymin    #选择与GT bbox IOU最大的框作为GT bbox，然后循环
        feat_xmin = fmask * bbox[1] + (1 - fmask) * feat_xmin     ## 应该是选取一批anchor作为正样本，取向于GT，附近没有GT的，就是保持初始值0
        feat_ymax = fmask * bbox[2] + (1 - fmask) * feat_ymax
        feat_xmax = fmask * bbox[3] + (1 - fmask) * feat_xmax
 
        # Check no annotation label: ignore these anchors...     #对没有标注标签的锚点框做忽视，应该是背景
        # interscts = intersection_with_anchors(bbox)
        # mask = tf.logical_and(interscts > ignore_threshold,
        #                       label == no_annotation_label)
        # # Replace scores by -1.
        # feat_scores = tf.where(mask, -tf.cast(mask, dtype), feat_scores)
 
        return [i+1, feat_labels, feat_scores,
                feat_ymin, feat_xmin, feat_ymax, feat_xmax]
    # Main loop definition.
    i = 0
    ### 逐元素比较大小,其实就是遍历label，因为i在body返回的时候加1了，直到遍历完
    [i, feat_labels, feat_scores,
     feat_ymin, feat_xmin,
     feat_ymax, feat_xmax] = tf.while_loop(condition, body,
                                           [i, feat_labels, feat_scores,
                                            feat_ymin, feat_xmin,
                                            feat_ymax, feat_xmax])
    # Transform to center / size.                               #转换为中心及长宽形式（计算补偿后的中心）
    feat_cy = (feat_ymax + feat_ymin) / 2.  #真实预测值其实是边界框相对于先验框的转换值，encode就是为了求这个转换值
    feat_cx = (feat_xmax + feat_xmin) / 2.  
    feat_h = feat_ymax - feat_ymin
    feat_w = feat_xmax - feat_xmin    ##这4个tensor，已经是向GT“靠拢”过的了
    # Encode features.
    feat_cy = (feat_cy - yref) / href / prior_scaling[0]   #(预测真实边界框中心y-参考框中心y)/参考框高/缩放尺度
    feat_cx = (feat_cx - xref) / wref / prior_scaling[1]   
    feat_h = tf.log(feat_h / href) / prior_scaling[2]      #log(预测真实边界框高h/参考框高h)/缩放尺度
    feat_w = tf.log(feat_w / wref) / prior_scaling[3]     ## 就是feat_**-anchors_layer的具体表现形式
    # Use SSD ordering: x / y / w / h instead of ours.
    feat_localizations = tf.stack([feat_cx, feat_cy, feat_w, feat_h], axis=-1)  #返回（cx转换值,cy转换值,w转换值,h转换值）形式的边界框的预测值（其实是预测框相对于参考框的转换）
   ##strack相当于拼接，把4个tensor最里面的维度的数值拼接成一个维度
    return feat_labels, feat_localizations, feat_scores                         #返回目标标签，目标预测值（位置转换值），目标置信度
	#经过我们回归得到的变换，经过变换得到真实框，所以这个地方损失函数其实是我们预测的是变换，我们实际的框和anchor之间的变换和我们预测的变换之间的loss。我们回归的是一种变换。并不是直接预测框，这个和YOLO是不一样的。和Faster RCNN是一样的
 ### feat_labels 表示返回每个anchor对应的类别，feat_localizations返回的是一种变换，
 ### feat_scores 每个anchor与gt对应的最大的交并比
 
def tf_ssd_bboxes_encode(labels,       #1D的tensor 包含gt标签
                         bboxes,       #Nx4的tensor包含真实框的相对坐标
                         anchors,      #参考锚点框信息（y,x,h,w) 其中y,x是中心坐标  ##就是layers_anchors
                         num_classes,
                         no_annotation_label,
                         ignore_threshold=0.5,
                         prior_scaling=[0.1, 0.1, 0.2, 0.2],
                         dtype=tf.float32,
                         scope='ssd_bboxes_encode'):
    """Encode groundtruth labels and bounding boxes using SSD net anchors.
    Encoding boxes for all feature layers.
    Arguments:
      labels: 1D Tensor(int64) containing groundtruth labels;
      bboxes: Nx4 Tensor(float) with bboxes relative coordinates;
      anchors: List of Numpy array with layer anchors;
      matching_threshold: Threshold for positive match with groundtruth bboxes;
      prior_scaling: Scaling of encoded coordinates.
    Return:
      (target_labels, target_localizations, target_scores):  #返回：目标标签，目标位置，目标得分值（都是list形式）
        Each element is a list of target Tensors.
    """
    with tf.name_scope(scope):  
        target_labels = []               #目标标签
        target_localizations = []        #目标位置
        target_scores = []               #目标得分
        for i, anchors_layer in enumerate(anchors):                #对所有特征图中的参考框做遍历
            with tf.name_scope('bboxes_encode_block_%i' % i):
                t_labels, t_loc, t_scores = \
                    tf_ssd_bboxes_encode_layer(labels, bboxes, anchors_layer,    #输入真实标签，gt位置大小，参考框位置大小……得到预测真实标签，参考框到真实框的转换以及得分
                                               num_classes, no_annotation_label,
                                               ignore_threshold,
                                               prior_scaling, dtype)
                target_labels.append(t_labels)
                target_localizations.append(t_loc)
                target_scores.append(t_scores)
        return target_labels, target_localizations, target_scores
### t_labels 表示返回每个anchor对应的类别，t_loc返回的是一种变换，
### t_scores 每个anchor与gt对应的最大的交并比
### target_labels是一个list，包含每层的每个anchor对应的gt类别，
### target_localizations对应的是包含每一层所有anchor对应的变换
### target_scores 返回的是每个anchor与gt对应的最大的交并比

 
def tf_ssd_bboxes_decode_layer(feat_localizations,   #解码，在预测时用到，根据之前得到的预测值相对于参考框的转换值后，反推出真实位置（该位置包括真实的x,y,w,h）
                               anchors_layer,         #需要输入：预测框和参考框的转换feat_localizations，参考框位置尺度信息anchors_layer，以及转换时用到的缩放
                               prior_scaling=[0.1, 0.1, 0.2, 0.2]):    #输出真实预测框的ymin,xmin，ymax,xmax
    """Compute the relative bounding boxes from the layer features and
    reference anchor bounding boxes.
    Arguments:
      feat_localizations: Tensor containing localization features.
      anchors: List of numpy array containing anchor boxes.
    Return:
      Tensor Nx4: ymin, xmin, ymax, xmax
    """
    yref, xref, href, wref = anchors_layer    #锚点框的参考中心点以及长宽
 
    # Compute center, height and width
    cx = feat_localizations[:, :, :, :, 0] * wref * prior_scaling[0] + xref
    cy = feat_localizations[:, :, :, :, 1] * href * prior_scaling[1] + yref
    w = wref * tf.exp(feat_localizations[:, :, :, :, 2] * prior_scaling[2])
    h = href * tf.exp(feat_localizations[:, :, :, :, 3] * prior_scaling[3])
    # Boxes coordinates.
    ymin = cy - h / 2.
    xmin = cx - w / 2.
    ymax = cy + h / 2.
    xmax = cx + w / 2.
    bboxes = tf.stack([ymin, xmin, ymax, xmax], axis=-1)
    return bboxes     #预测真实框的坐标信息（两点式的框）
 
 
def tf_ssd_bboxes_decode(feat_localizations,
                         anchors,
                         prior_scaling=[0.1, 0.1, 0.2, 0.2],
                         scope='ssd_bboxes_decode'):
    """Compute the relative bounding boxes from the SSD net features and
    reference anchors bounding boxes.
    Arguments:
      feat_localizations: List of Tensors containing localization features.
      anchors: List of numpy array containing anchor boxes.
    Return:
      List of Tensors Nx4: ymin, xmin, ymax, xmax
    """
    with tf.name_scope(scope):
        bboxes = []
        for i, anchors_layer in enumerate(anchors):
            bboxes.append(
                tf_ssd_bboxes_decode_layer(feat_localizations[i],
                                           anchors_layer,
                                           prior_scaling))
        return bboxes
 
 
# =========================================================================== #
# SSD boxes selection.
# =========================================================================== #
def tf_ssd_bboxes_select_layer(predictions_layer, localizations_layer,  #输入预测得到的类别和位置做筛选
                               select_threshold=None,
                               num_classes=21,
                               ignore_class=0,
                               scope=None):
    """Extract classes, scores and bounding boxes from features in one layer.
    Batch-compatible: inputs are supposed to have batch-type shapes.
    Args:
      predictions_layer: A SSD prediction layer;
      localizations_layer: A SSD localization layer;
      select_threshold: Classification threshold for selecting a box. All boxes
        under the threshold are set to 'zero'. If None, no threshold applied.
    Return:
      d_scores, d_bboxes: Dictionary of scores and bboxes Tensors of
        size Batches X N x 1 | 4. Each key corresponding to a class.
    """
    select_threshold = 0.0 if select_threshold is None else select_threshold
    with tf.name_scope(scope, 'ssd_bboxes_select_layer',
                       [predictions_layer, localizations_layer]):
        # Reshape features: Batches x N x N_labels | 4
        p_shape = tfe.get_shape(predictions_layer)
        predictions_layer = tf.reshape(predictions_layer,
                                       tf.stack([p_shape[0], -1, p_shape[-1]]))
        l_shape = tfe.get_shape(localizations_layer)
        localizations_layer = tf.reshape(localizations_layer,
                                         tf.stack([l_shape[0], -1, l_shape[-1]]))
 
        d_scores = {}
        d_bboxes = {}
        for c in range(0, num_classes):
            if c != ignore_class:    #如果不是背景类别
                # Remove boxes under the threshold.   #去掉低于阈值的box
                scores = predictions_layer[:, :, c]    #预测为第c类别的得分值
                fmask = tf.cast(tf.greater_equal(scores, select_threshold), scores.dtype)
                scores = scores * fmask  #保留得分值大于阈值的得分
                bboxes = localizations_layer * tf.expand_dims(fmask, axis=-1)
                # Append to dictionary.
                d_scores[c] = scores
                d_bboxes[c] = bboxes
 
        return d_scores, d_bboxes  #返回字典，每个字典里是对应某类的预测权重和框位置信息；
 
 
def tf_ssd_bboxes_select(predictions_net, localizations_net,  #输入：SSD网络输出的预测层list；定位层list；类别选择框阈值（None表示都选）
                         select_threshold=None,                #返回一个字典，key为类别，值为得分和bbox坐标
                         num_classes=21,   #包含了背景类别
                         ignore_class=0,   #第0类是背景
                         scope=None):
    """Extract classes, scores and bounding boxes from network output layers.
    Batch-compatible: inputs are supposed to have batch-type shapes.
    Args:
      predictions_net: List of SSD prediction layers;
      localizations_net: List of localization layers;
      select_threshold: Classification threshold for selecting a box. All boxes
        under the threshold are set to 'zero'. If None, no threshold applied.
    Return:
      d_scores, d_bboxes: Dictionary of scores and bboxes Tensors of      #返回一个字典，其中key是对应类别，值对应得分值和坐标信息
        size Batches X N x 1 | 4. Each key corresponding to a class.
    """
    with tf.name_scope(scope, 'ssd_bboxes_select',
                       [predictions_net, localizations_net]):
        l_scores = []
        l_bboxes = []
        for i in range(len(predictions_net)):
            scores, bboxes = tf_ssd_bboxes_select_layer(predictions_net[i],
                                                        localizations_net[i],
                                                        select_threshold,
                                                        num_classes,
                                                        ignore_class)
            l_scores.append(scores)  #对应某个类别的得分
            l_bboxes.append(bboxes)  #对应某个类别的box坐标信息
        # Concat results.
        d_scores = {}
        d_bboxes = {}
        for c in l_scores[0].keys():
            ls = [s[c] for s in l_scores]
            lb = [b[c] for b in l_bboxes]
            d_scores[c] = tf.concat(ls, axis=1)
            d_bboxes[c] = tf.concat(lb, axis=1)
        return d_scores, d_bboxes
 
 
def tf_ssd_bboxes_select_layer_all_classes(predictions_layer, localizations_layer,
                                           select_threshold=None):
    """Extract classes, scores and bounding boxes from features in one layer.
     Batch-compatible: inputs are supposed to have batch-type shapes.
     Args:
       predictions_layer: A SSD prediction layer;
       localizations_layer: A SSD localization layer;
      select_threshold: Classification threshold for selecting a box. If None,
        select boxes whose classification score is higher than 'no class'.
     Return:
      classes, scores, bboxes: Input Tensors.    #输出：类别，得分，框
     """
    # Reshape features: Batches x N x N_labels | 4
    p_shape = tfe.get_shape(predictions_layer)
    predictions_layer = tf.reshape(predictions_layer,
                                   tf.stack([p_shape[0], -1, p_shape[-1]]))
    l_shape = tfe.get_shape(localizations_layer)
    localizations_layer = tf.reshape(localizations_layer,
                                     tf.stack([l_shape[0], -1, l_shape[-1]]))
    # Boxes selection: use threshold or score > no-label criteria.
    if select_threshold is None or select_threshold == 0:
        # Class prediction and scores: assign 0. to 0-class
        classes = tf.argmax(predictions_layer, axis=2)
        scores = tf.reduce_max(predictions_layer, axis=2)
        scores = scores * tf.cast(classes > 0, scores.dtype)
    else:
        sub_predictions = predictions_layer[:, :, 1:]
        classes = tf.argmax(sub_predictions, axis=2) + 1
        scores = tf.reduce_max(sub_predictions, axis=2)
        # Only keep predictions higher than threshold.
        mask = tf.greater(scores, select_threshold)
        classes = classes * tf.cast(mask, classes.dtype)
        scores = scores * tf.cast(mask, scores.dtype)
    # Assume localization layer already decoded.
    bboxes = localizations_layer
    return classes, scores, bboxes  #寻找当前特征图中类别，得分，bbox
 
 
def tf_ssd_bboxes_select_all_classes(predictions_net, localizations_net,
                                     select_threshold=None,
                                     scope=None):
    """Extract classes, scores and bounding boxes from network output layers.
    Batch-compatible: inputs are supposed to have batch-type shapes.
    Args:
      predictions_net: List of SSD prediction layers;
      localizations_net: List of localization layers;
      select_threshold: Classification threshold for selecting a box. If None,
        select boxes whose classification score is higher than 'no class'.
    Return:
      classes, scores, bboxes: Tensors.
    """
    with tf.name_scope(scope, 'ssd_bboxes_select',
                       [predictions_net, localizations_net]):
        l_classes = []
        l_scores = []
        l_bboxes = []
        for i in range(len(predictions_net)):
            classes, scores, bboxes = \
                tf_ssd_bboxes_select_layer_all_classes(predictions_net[i],
                                                       localizations_net[i],
                                                       select_threshold)
            l_classes.append(classes)
            l_scores.append(scores)
            l_bboxes.append(bboxes)
 
        classes = tf.concat(l_classes, axis=1)
        scores = tf.concat(l_scores, axis=1)
        bboxes = tf.concat(l_bboxes, axis=1)
        return classes, scores, bboxes  #返回所有特征图综合得出的类别，得分，bbox

无名份的浪漫2018

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
SSD阅读和解读

GitHub上面的关于SSD的复现代码：TensorFlow版：https://github.com/balancap/SSD-Tensorflowca的mobile版：https://github.com/chuanqi305/MobileNet-SSDkeras版：https://github.com/pierluigiferrari/ssd_keraspytorch版：htt...
复制链接

扫一扫