SSD: Single Shot MultiBox Detector

SSD: Single Shot MultiBox Detector

一、SSD主要思想在这里插入图片描述

  SSD是Single Shot MultiBox Detector的缩写,Single shot表明了SSD属于one-stage系列,而MultiBox则表明了SSD采用了多尺度框的方法。
  SSD中,原图经过不同的卷积层,得到不同大小的feature map,靠前的卷积层输出的feature map比较大,越靠后feature map越小,而靠前的feature map上的先验框的尺度较小,靠后的feature map上的先验框的尺度较大,也就是说,从前往后feature map依次减小而先验框依次增大,这样大feature map上的小框就能用来检测小物体,小feature map上的大框就能检测大物体
  另外,SSD使用了卷积直接进行检测,所以上面得到的不同大小的feature map可以直接通过卷积进行检测得到分类置信度和位置信息。

二、模型结构

在这里插入图片描述
SSD模型结构主要分为三个部分:

  • 基础特征提取网络:VGG-16(只保留前5个卷积块)
  • SSD层:SSD Layers
  • 预测层:Classification and localization layer

三、模型特点在这里插入图片描述

  SSD和YOLO都是属于one-stage算法,也就是说SSD像YOLO一样,能够通过模型一步到位,直接同时得到分类和定位信息,但是相比起YOLO算法,SSD有一下三个特点:

  • 使用多尺度的feature map进行检测
  • 使用卷积直接进行检测
  • 使用不同尺度和宽高比的先验框

3.1 使用多尺度的feature map进行检测

在这里插入图片描述
  之前的算法,比如YOLO,它经过卷积层,最终得到的是一个单一尺度的feature map,然后用这个单一尺度的feature map进行检测,而SSD算法则是通过卷积层生成不同尺度的feature map,然后将其中部分不同尺度的feature map都用来进行检测。
在这里插入图片描述
  因为浅层的feature map的感受野较小,而深层的feature map的感受野较大,所以浅层的较大的feature map可用来检测小物体,而深层的较小的feature map可用来检测大物体。

3.2 使用卷积直接进行检测

  YOLO是在全连接之后进行检测,而SSD则是将得到的不同尺度的feature map直接通过卷积进行检测。
  使用卷积直接进行检测,就是根据预测的目标,来设置小卷积核(文中和代码中设置为3x3)的个数。因为卷积核的个数决定了卷积层输出的feature map的通道数,那么不同的通道数个数,就表示不同的意义。
  比如,假设输入预测层(Classification and Localization layer)的feature map的size为m*n*p,卷积核为3*3*p,预测类别是3类,feature map上每个像素点都是3个先验框,
  对于分类预测
  设置卷积核的个数为3*3,预测层输出的feature map维度为:[m, n, 3*3],这就表示对于feature map上的m*n个像素点,每个点对应一个9维的向量,每3个分量表示一个框的预测信息。
在这里插入图片描述
  对于位置预测
  设置卷积核的个数为3*4,输出的feature map的维度即为[m, n, 3*4],表示对于feature map上的m*n个像素点,每个点对应一个12维的向量,每4个分量表示一个框的四个位置信息(offsets)。
在这里插入图片描述

3.3 使用不同尺度和宽高比的先验框

在这里插入图片描述
  在SSD中,不同的feature map上设置的先验框的数目不同,尺度(也就是框的大小)和比也不同。

3.3.1 先验框的尺度(大小):

s k = s m i n + s m a x − s m i n m − 1 ( k − 1 ) , k ∈ [ 1 , m ] s_k=s_{min}+\frac{s_{max}-s_{min}}{m-1}(k-1),\quad k\in[1, m] sk=smin+m1smaxsmin(k1),k[1,m]
其中,

  • m:表示feature map个数减一(因为第一个feature map的尺度单独设置),
  • s k s_k sk: 表示先验框大小相对于图片的比例
  • s m i n s_{min} smin:比例的最小值,论文中设置0.2
  • s m a x s_{max} smax:比例的最大值,论文中设置为0.9

对于第一个feature map,其先验框的尺度比例一般设置为 s m i n / 2 = 0.1 s_{min}/2=0.1 smin/2=0.1,尺度则为300*0.1=30
对于后5个feature map,其先验框的的尺度计算如下,先将尺度比例扩大100倍,即 s m i n = 20 , s m a x = 90 s_{min}=20,s_{max}=90 smin=20,smax=90,然后再按照上述公式计算,结果除100作为最终的尺度比例:

   k = 1 , s 1 = [ 20 + ⌊ 90 − 20 5 − 1 ⌋ ∗ ( 1 − 1 ) ] / 100 = [ 20 + 0 ] / 100 = 0.2 k=1,\quad s_1=[20+\lfloor \frac{90-20}{5-1}\rfloor*(1-1)]/100=[20+0]/100=0.2 k=1,s1=[20+519020(11)]/100=[20+0]/100=0.2
   k = 2 , s 2 = [ 20 + ⌊ 90 − 20 5 − 1 ⌋ ∗ ( 2 − 1 ) ] / 100 = [ 20 + 17 ] / 100 = 0.37 k=2,\quad s_2=[20+\lfloor \frac{90-20}{5-1}\rfloor*(2-1)]/100=[20+17]/100=0.37 k=2,s2=[20+519020(21)]/100=[20+17]/100=0.37
   k = 3 , s 3 = [ 20 + ⌊ 90 − 20 5 − 1 ⌋ ∗ ( 3 − 1 ) ] / 100 = [ 20 + 34 ] / 100 = 0.54 k=3,\quad s_3=[20+\lfloor \frac{90-20}{5-1}\rfloor*(3-1)]/100=[20+34]/100=0.54 k=3,s3=[20+519020(31)]/100=[20+34]/100=0.54
   k = 4 , s 4 = [ 20 + ⌊ 90 − 20 5 − 1 ⌋ ∗ ( 2 − 1 ) ] / 100 = [ 20 + 17 ] / 100 = 0.71 k=4,\quad s_4=[20+\lfloor \frac{90-20}{5-1}\rfloor*(2-1)]/100=[20+17]/100=0.71 k=4,s4=[20+519020(21)]/100=[20+17]/100=0.71
   k = 5 , s 5 = [ 20 + ⌊ 90 − 20 5 − 1 ⌋ ∗ ( 2 − 1 ) ] / 100 = [ 20 + 17 ] / 100 = 0.88 k=5,\quad s_5=[20+\lfloor \frac{90-20}{5-1}\rfloor*(2-1)]/100=[20+17]/100=0.88 k=5,s5=[20+519020(21)]/100=[20+17]/100=0.88

然后将这5个尺度比例乘以原图尺寸,则到每个feature map上先验框的尺度,即60,111,162,213,264
再综合第一个feature map的尺度30,总共6个feature map上先验框的尺度分别为30,60,111,162,213,264

3.3.2 先验框的宽高:

根据下面的公式,通过 s k s_k sk a r a_r ar就能计算出每个feature map上具体的宽和高
w k a = s k a r , h k a = s k a r w_k^a=s_k \sqrt {a_r}, \quad h_k^a=\frac{s_k}{\sqrt {a_r}} wka=skar ,hka=ar sk
其中, a r a_r ar为先验框的宽高比, a r ∈ { 1 , 2 , 3 , 1 2 , 1 3 } a_r\in\{1,2,3,\frac{1}{2},\frac{1}{3}\} ar{1,2,3,21,31}

一个先验框尺度 s k s_k sk对应5个宽高比 a r a_r ar,而当 a r = 1 a_r=1 ar=1时,增加尺度 s k ′ = s k s k + 1 s'_k=\sqrt{s_ks_{k+1}} sk=sksk+1 ,即多增加一个框,对于最后一个特征图,需要参考一个虚拟的尺度 s m + 1 = 300 ∗ 105 / 100 = 315 s_{m+1}=300*105/100=315 sm+1=300105/100=315来计算 s m ′ s'_{m} sm
也就是说,每个feature map上的各个像素点对应6个框,其中正方形框2个,矩形框4个

计算示例:
s k ∈ { 30 , 60 , 111 , 162 , 213 , 264 } , a r ∈ { 1 , 2 , 3 , 1 2 , 1 3 } s_k\in\{30,60,111,162,213,264\},\quad a_r\in\{1,2,3,\frac{1}{2},\frac{1}{3}\} sk{30,60,111,162,213,264},ar{1,2,3,21,31}

k = 1 k=1 k=1 s 1 = 30 s_1=30 s1=30
a 1 = 1 a_1=1 a1=1,要计算两个正方形边长,第二个需要计算 s 1 ′ = s 1 s 2 = 30 ∗ 60 = 42 s'_1=\sqrt{s_1s_{2}}=\sqrt{30*60}=42 s1=s1s2 =3060 =42
第一个正方形框边长为: w 1 a = s 1 a 1 = 30 ∗ 1 = 30 , h 1 a = s 1 a r = 30 1 = 30 w_1^a=s_1 \sqrt {a_1}=30*1=30, \quad h_1^a=\frac{s_1}{\sqrt {a_r}}=\frac{30}{1}=30 w1a=s1a1 =301=30,h1a=ar s1=130=30
第二个正方形框边长为: w 1 ′ a = s 1 ′ a 1 = 42 ∗ 1 = 42 , h 1 ′ a = s 1 ′ a 1 = 42 1 = 42 w_1'^a=s'_1 \sqrt {a_1}=42*1=42, \quad h_1'^a=\frac{s_1'}{\sqrt {a_1}}=\frac{42}{1}=42 w1a=s1a1 =421=42,h1a=a1 s1=142=42
a 2 = 2 a_2=2 a2=2,矩形框,宽 : w 2 a = s 1 a 2 = 30 ∗ 2 = 30 2 w_2^a=s_1 \sqrt {a_2}=30*\sqrt{2}=30\sqrt{2} w2a=s1a2 =302 =302  高: h 2 a = s 1 a 2 = 30 2 h_2^a=\frac{s_1}{\sqrt {a_2}}=\frac{30}{\sqrt{2}} h2a=a2 s1=2 30
a 3 = 3 a_3=3 a3=3,矩形框,宽 : w 3 a = s 1 a 3 = 30 ∗ 3 = 30 3 w_3^a=s_1 \sqrt {a_3}=30*\sqrt{3}=30\sqrt{3} w3a=s1a3 =303 =303  高: h 3 a = s 1 a 3 = 30 3 h_3^a=\frac{s_1}{\sqrt {a_3}}=\frac{30}{\sqrt{3}} h3a=a3 s1=3 30
a 4 = 1 2 a_4=\frac{1}{2} a4=21,矩形框,宽 : w 4 a = s 1 a 4 = 30 ∗ 1 2 = 30 2 w_4^a=s_1 \sqrt {a_4}=30*\sqrt{\frac{1}{2}}=\frac{30}{\sqrt{2}} w4a=s1a4 =3021 =2 30  高: h 4 a = s 1 a 4 = 30 2 h_4^a=\frac{s_1}{\sqrt {a_4}}=30\sqrt{2} h4a=a4 s1=302
a 5 = 1 3 a_5=\frac{1}{3} a5=31,矩形框,宽 : w 5 a = s 1 a 5 = 30 ∗ 1 3 = 30 3 w_5^a=s_1 \sqrt {a_5}=30*\sqrt{\frac{1}{3}}=\frac{30}{\sqrt{3}} w5a=s1a5 =3031 =3 30  高: h 5 a = s 1 a 5 = 30 3 h_5^a=\frac{s_1}{\sqrt {a_5}}=30\sqrt{3} h5a=a5 s1=303
剩下的5个feature map计算过程同理

正常计算过程如上,但是在论文和代码实际计算情况下,Conv4_3,Conv10_2和Conv11_2层,也就是第一个和最后两个feature map都是不使用 3 , 1 3 3,\frac{1}{3} 3,31这两个宽高比,也就是说这三层feature map上每个像素点都是对应4个不同大小的先验框

3.3.3 先验框的中心点:

根据上述过程可得到具体的先验框的大小,接下来要确定每个先验框的中心:
( i + 0.5 ∣ f k ∣ , j + 0.5 ∣ f k ∣ ) , i , j ∈ [ 0 , ∣ f k ∣ ) (\frac{i+0.5}{|f_k|},\frac{j+0.5}{|f_k|}),\quad i,j\in[0, |f_k|) (fki+0.5,fkj+0.5),i,j[0,fk)
其中, ∣ f k ∣ |f_k| fk为第k个feature map的大小

四、损失函数

损失函数定义为置信度损失conf(confidence loss)和定位损失loc(localization loss)的加权和:
L ( x , c , l , g ) = 1 N ( L c o n f ( x , c ) + α L l o c ( x , l , g ) ) L(x,c,l,g)=\frac{1}{N}(L_{conf}(x,c)+\alpha L_{loc}(x,l,g)) L(x,c,l,g)=N1(Lconf(x,c)+αLloc(x,l,g))
其中:

  • N N N:表示先验框的正样本数量
  • c c c:为类别置信度预测值
  • l l l:为预测框的位置变换量
  • α \alpha α:为权重系数,通过交叉验证设置为1
  • x i j p = 1 x_{ij}^p=1 xijp=1 表示第 i 个先验框与类别 p 的第 j 个ground truth相匹配,否则, x i j p = 0 x_{ij}^p=0 xijp=0

置信度损失conf(confidence loss):
置信度损失函数为多分类的softmax损失函数:
L c o n f ( x , c ) = − ∑ i ∈ P o s N x i j p l o g ( c ^ i p ) − ∑ i ∈ N e g N l o g ( c ^ i 0 ) w h e r e c ^ i p = e x p ( c i p ) ∑ p e x p ( c i p ) L_{conf}(x,c)=-\sum \limits^N_{i\in Pos}x_{ij}^plog(\hat c_i^p)-\sum \limits ^N_{i\in Neg}log(\hat c_i^0)\quad where \quad \hat c_i^p=\frac{exp(c_i^p)}{\sum_pexp(c_i^p)} Lconf(x,c)=iPosNxijplog(c^ip)iNegNlog(c^i0)wherec^ip=pexp(cip)exp(cip)
其中:

  • p表示的是第p个类别
  • 0表示的是第0个类别,也就是背景类

定位损失loc(localization loss):
L l o c ( x , l , g ) = ∑ i ∈ P o s N ∑ m ∈ { c x , c y , w , h } x i j k s m o o t h L 1 ( l i m − g ^ j m ) g ^ j c x = ( g j c x − d i c x ) / d i w g ^ j c y = ( g j c y − d i c y ) / d i h g ^ j w = l o g ( g j w d i w ) g ^ j h = l o g ( g j h d i h ) L_{loc}(x,l,g)=\sum\limits _{i\in Pos}^N \sum \limits _{m\in\{cx,cy,w,h\}} x_{ij}^ksmooth_{L1}(l_i^m-\hat g_j^m) \\ \hat g_j^{cx}=(g_j^{cx}-d_i^{cx})/d_i^w\quad \quad \hat g_j^{cy}=(g_j^{cy}-d_i^{cy})/d_i^h \\ \hat g^w_j=log(\frac{g_j^w}{d_i^w}) \quad \quad \hat g^h_j=log(\frac{g_j^h}{d_i^h}) Lloc(x,l,g)=iPosNm{cx,cy,w,h}xijksmoothL1(limg^jm)g^jcx=(gjcxdicx)/diwg^jcy=(gjcydicy)/dihg^jw=log(diwgjw)g^jh=log(dihgjh)

其中:

  • g g g:表示的是default box到ground truth的变换量
  • l l l:表示的是defualt box到修正后的位置的变换量,也就是变换量的预测值
  • s m o o t h L 1 ( x ) = { 0.5 x 2 , i f ∣ x ∣ < 1 ∣ x ∣ − 0.5 , o t h e r w i s e smooth_{L_1}(x)=\begin{cases}0.5x^2,\quad \quad \quad if|x|<1 \\|x|-0.5, \quad otherwise \end{cases} smoothL1(x)={0.5x2,ifx<1x0.5,otherwise

五、主要代码分析

5.1 训练阶段

5.1.1 主函数

四个主要过程:

  • 构建网络——生成类别和位置变换量的预测值(logits(未经过softmax)、localisations)
  • Anchor box计算——根据每层feature map大小和和先验框尺度及宽高比计算
  • 编码处理——将得到的anchor box位置信息和数据集中的ground truth类别标签和位置信息进行处理
  • 构建损失函数进行迭代训练——使用前两步的结果并筛选负样本,然后计算损失函数
#coding:utf-8

# Copyright 2016 Paul Balanca. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
"""Generic training script that trains a SSD model using a given dataset."""
import tensorflow as tf
from tensorflow.python.ops import control_flow_ops

from datasets import dataset_factory
from deployment import model_deploy# 模型部署
from nets import nets_factory
from preprocessing import preprocessing_factory
import tf_utils

slim = tf.contrib.slim

#DATA_FORMAT = 'NCHW'
DATA_FORMAT = 'NHWC'
# =========================================================================== #
# SSD Network flags.
# =========================================================================== #
tf.app.flags.DEFINE_float(
    'loss_alpha', 1., 'Alpha parameter in the loss function.')
tf.app.flags.DEFINE_float(
    'negative_ratio', 3., 'Negative ratio in the loss function.')#负样本率?
tf.app.flags.DEFINE_float(
    'match_threshold', 0.5, 'Matching threshold in the loss function.')

# =========================================================================== #
# General Flags.
# =========================================================================== #
tf.app.flags.DEFINE_string(
    'train_dir', 'logs',
    'Directory where checkpoints and event logs are written to.')
tf.app.flags.DEFINE_integer('num_clones', 1,
                            'Number of model clones to deploy.')#clones1代表1个参数服务器和1个worker,这一个worker即为主worker
tf.app.flags.DEFINE_boolean('clone_on_cpu', True,#False
                            'Use CPUs to deploy clones.')#在GPU上部署worker
tf.app.flags.DEFINE_integer(
    'num_readers', 2,
    'The number of parallel readers that read data from the dataset.')#双线程读取数据
tf.app.flags.DEFINE_integer(
    'num_preprocessing_threads', 2,
    'The number of threads used to create the batches.')#双线程创建batches

tf.app.flags.DEFINE_integer(
    'log_every_n_steps', 10,#每10个epoch打印一次日志
    'The frequency with which logs are print.')
tf.app.flags.DEFINE_integer(
    'save_summaries_secs', 60,#60秒保存一次摘要
    'The frequency with which summaries are saved, in seconds.')
tf.app.flags.DEFINE_integer(
    'save_interval_secs', 60,#60秒保存一次模型
    'The frequency with which the model is saved, in seconds.')
tf.app.flags.DEFINE_float(
    'gpu_memory_fraction', 0.8, 'GPU memory fraction to use.')#80%的GPU使用率

# =========================================================================== #
# Optimization Flags.
# =========================================================================== #
tf.app.flags.DEFINE_float(
    'weight_decay', 0.00004, 'The weight decay on the model weights.')#L2正则lamba
tf.app.flags.DEFINE_string(
    'optimizer', 'adam',
    'The name of the optimizer, one of "adadelta", "adagrad", "adam",'
    '"ftrl", "momentum", "sgd" or "rmsprop".')
tf.app.flags.DEFINE_float(
    'adadelta_rho', 0.95,
    'The decay rate for adadelta.')
tf.app.flags.DEFINE_float(
    'adagrad_initial_accumulator_value', 0.1,
    'Starting value for the AdaGrad accumulators.')
tf.app.flags.DEFINE_float(
    'adam_beta1', 0.9,
    'The exponential decay rate for the 1st moment estimates.')
tf.app.flags.DEFINE_float(
    'adam_beta2', 0.999,
    'The exponential decay rate for the 2nd moment estimates.')
tf.app.flags.DEFINE_float('opt_epsilon', 1.0, 'Epsilon term for the optimizer.')
tf.app.flags.DEFINE_float('ftrl_learning_rate_power', -0.5,
                          'The learning rate power.')
tf.app.flags.DEFINE_float(
    'ftrl_initial_accumulator_value', 0.1,
    'Starting value for the FTRL accumulators.')
tf.app.flags.DEFINE_float(
    'ftrl_l1', 0.0, 'The FTRL l1 regularization strength.')
tf.app.flags.DEFINE_float(
    'ftrl_l2', 0.0, 'The FTRL l2 regularization strength.')
tf.app.flags.DEFINE_float(
    'momentum', 0.9,
    'The momentum for the MomentumOptimizer and RMSPropOptimizer.')
tf.app.flags.DEFINE_float('rmsprop_momentum', 0.9, 'Momentum.')
tf.app.flags.DEFINE_float('rmsprop_decay', 0.9, 'Decay term for RMSProp.')

# =========================================================================== #
# Learning Rate Flags.
# =========================================================================== #
tf.app.flags.DEFINE_string(
    'learning_rate_decay_type',
    'exponential',
    'Specifies how the learning rate is decayed. One of "fixed", "exponential",'#指数形式的还是固定形式的
    ' or "polynomial"')#多项式的
tf.app.flags.DEFINE_float('learning_rate', 0.01, 'Initial learning rate.')
tf.app.flags.DEFINE_float(
    'end_learning_rate', 0.0001,
    'The minimal end learning rate used by a polynomial decay learning rate.')
tf.app.flags.DEFINE_float(
    'label_smoothing', 0.0, 'The amount of label smoothing.')#label平滑数量
tf.app.flags.DEFINE_float(
    'learning_rate_decay_factor', 0.94, 'Learning rate decay factor.')#学习率衰减率
tf.app.flags.DEFINE_float(
    'num_epochs_per_decay', 2.0,
    'Number of epochs after which learning rate decays.')#每2个epoch衰减一次学习率
tf.app.flags.DEFINE_float(
    'moving_average_decay', None,#移动平均衰减
    'The decay to use for the moving average.'
    'If left as None, then moving averages are not used.')

# =========================================================================== #
# Dataset Flags.
# =========================================================================== #
#dataset_name imagenet
tf.app.flags.DEFINE_string(
    'dataset_name', 'pascalvoc_2007', 'The name of the dataset to load.')
tf.app.flags.DEFINE_integer(
    'num_classes', 21, 'Number of classes to use in the dataset.')
tf.app.flags.DEFINE_string(
    'dataset_split_name', 'train', 'The name of the train/test split.')#数据切分名称
tf.app.flags.DEFINE_string(
    'dataset_dir', 'tfrecords', 'The directory where the dataset files are stored.')
tf.app.flags.DEFINE_integer(
    'labels_offset', 0,#偏移量
    'An offset for the labels in the dataset. This flag is primarily used to '#offset消减量
    'evaluate the VGG and ResNet architectures which do not use a background '
    'class for the ImageNet dataset.')#如果不用background则减1
tf.app.flags.DEFINE_string(
    'model_name', 'ssd_300_vgg', 'The name of the architecture to train.')
tf.app.flags.DEFINE_string(
    'preprocessing_name', None, 'The name of the preprocessing to use. If left '
    'as `None`, then the model_name flag is used.')#没有使用预处理
tf.app.flags.DEFINE_integer(
    'batch_size', 4, 'The number of samples in each batch.')
tf.app.flags.DEFINE_integer(
    'train_image_size', None, 'Train image size')#训练集大小
tf.app.flags.DEFINE_integer('max_number_of_steps', None,
                            'The maximum number of training steps.')#epoch上限

# =========================================================================== #
# Fine-Tuning Flags.
# =========================================================================== #
tf.app.flags.DEFINE_string(
    'checkpoint_path', './checkpoints/ssd_300_vgg.ckpt',
    'The path to a checkpoint from which to fine-tune.')#获取将要微调的模型
tf.app.flags.DEFINE_string(
    'checkpoint_model_scope', None,
    'Model scope in the checkpoint. None if the same as the trained model.')#模型
tf.app.flags.DEFINE_string(
    'checkpoint_exclude_scopes', None,
    'Comma-separated list of scopes of variables to exclude when restoring '
    'from a checkpoint.')
tf.app.flags.DEFINE_string(
    'trainable_scopes', None,
    'Comma-separated list of scopes to filter the set of variables to train.'
    'By default, None would train all the variables.')#如果没有逗号分割,则变量被全部训练
tf.app.flags.DEFINE_boolean(
    'ignore_missing_vars', False,
    'When restoring a checkpoint would ignore missing variables.')#忽略丢失的变量

FLAGS = tf.app.flags.FLAGS


# =========================================================================== #
# Main training routine.训练主线程
# =========================================================================== #
def main(_):
    if not FLAGS.dataset_dir:
        raise ValueError('You must supply the dataset directory with --dataset_dir')

    tf.logging.set_verbosity(tf.logging.DEBUG)#设置日记打印的级别,较为严格
    with tf.Graph().as_default():#创建图
        # Config model_deploy. Keep TF Slim Models structure.
        # Useful if want to need multiple GPUs and/or servers in the future.
        #部署分布式集群
        deploy_config = model_deploy.DeploymentConfig(
            num_clones=FLAGS.num_clones,
            clone_on_cpu=FLAGS.clone_on_cpu,
            replica_id=0,
            num_replicas=1,
            num_ps_tasks=0)
        # Create global_step.
        with tf.device(deploy_config.variables_device()):
            '''把变量部署到不同的服务器上,需要点儿时间'''
            global_step = slim.create_global_step() #tf.train.create_global_step

        # Select the dataset.
        #加载数据
        dataset = dataset_factory.get_dataset(
            FLAGS.dataset_name, FLAGS.dataset_split_name, FLAGS.dataset_dir)

        # Get the SSD network and its anchors.
        #获得网络和anchor框
        ssd_class = nets_factory.get_network(FLAGS.model_name)#ssd-300模型
        ssd_params = ssd_class.default_params._replace(num_classes=FLAGS.num_classes)#加载参数
        ssd_net = ssd_class(ssd_params) # 根据指定参数和指定类型实例化ssd网络
        ssd_shape = ssd_net.params.img_shape#300*300
        ssd_anchors = ssd_net.anchors(ssd_shape) # 生成anchor box信息

        # Select the preprocessing function.
        preprocessing_name = FLAGS.preprocessing_name or FLAGS.model_name #是否采用预处理模块
        image_preprocessing_fn = preprocessing_factory.get_preprocessing( #image_preprocessing_fn:image, labels, bboxes
            preprocessing_name, is_training=True)

        tf_utils.print_configuration(FLAGS.__flags, ssd_params,
                                     dataset.data_sources, FLAGS.train_dir)
        # =================================================================== #
        # Create a dataset provider and batches.
        # =================================================================== #
        with tf.device(deploy_config.inputs_device()):
            with tf.name_scope(FLAGS.dataset_name + '_data_provider'):
                provider = slim.dataset_data_provider.DatasetDataProvider(#slim图像读取,多线程图像读取
                    dataset,#传入帕斯卡2007图集图片,公5011张,20类
                    num_readers=FLAGS.num_readers,#传入读图线程数
                    common_queue_capacity=20 * FLAGS.batch_size,#80
                    common_queue_min=10 * FLAGS.batch_size,#40
                    shuffle=True)
            # Get for SSD network: image, labels, bboxes. 得到图片,类别标签,以及ground truth
            [image, shape, glabels, gbboxes] = provider.get(['image', 'shape',
                                                             'object/label',
                                                             'object/bbox'])
            '''获取帕斯卡2007 voc数据集'''
            # Pre-processing image, labels and bboxes.
            image, glabels, gbboxes = \
                image_preprocessing_fn(image, glabels, gbboxes,
                                       out_shape=ssd_shape,
                                       data_format=DATA_FORMAT)


            '''获取训练集图像、GT标签及bbox,网络模型为vgg19变型'''
            # Encode groundtruth labels and bboxes.
            # 将上面得到的anchor 位置信息和数据集中的ground truth类别标签和位置信息进行编码,得到分类和位置信息以及置信度
            gclasses, glocalisations, gscores = \
                ssd_net.bboxes_encode(glabels, gbboxes, ssd_anchors)
            batch_shape = [1] + [len(ssd_anchors)] * 3 #4个维度分别是[image, gclasses, glocalisations, gscores]


            '''计算训练集bbox的IOU得分'''
            # Training batches and queue.
            r = tf.train.batch(
                tf_utils.reshape_list([image, gclasses, glocalisations, gscores]),
                batch_size=FLAGS.batch_size,#4
                num_threads=FLAGS.num_preprocessing_threads,#读取线程
                capacity=5 * FLAGS.batch_size)#20
            '''batch_size = 4, 返回一个[1, 6, 6, 6]尺寸的数据集'''
            b_image, b_gclasses, b_glocalisations, b_gscores = \
                tf_utils.reshape_list(r, batch_shape)
            '''append到4个list中,分别是image、gclasses、glocalisations、gscores'''
            # Intermediate queueing: unique batch computation pipeline for all
            # GPUs running the training.
            batch_queue = slim.prefetch_queue.prefetch_queue(
                tf_utils.reshape_list([b_image, b_gclasses, b_glocalisations, b_gscores]),
                capacity=2 * deploy_config.num_clones)
            '''使用预加载队列从文件双线读取, batch_queue 为一个'FIFOQueue' object,即为队列结构体'''
        # =================================================================== #
        # Define the model running on every GPU.
        # =================================================================== #
        def clone_fn(batch_queue):
            """Allows data parallelism by creating multiple
               定义模型运行在所有GPU上
            clones of network_fn."""
            # Dequeue batch.
            b_image, b_gclasses, b_glocalisations, b_gscores = \
                tf_utils.reshape_list(batch_queue.dequeue(), batch_shape)
            '''切分队列,先进先出'''

            # Construct SSD network.
            arg_scope = ssd_net.arg_scope(weight_decay=FLAGS.weight_decay,
                                          data_format=DATA_FORMAT)
            '''ssd模型框架及参数定义'''
            with slim.arg_scope(arg_scope):
                predictions, localisations, logits, end_points = \
                    ssd_net.net(b_image, is_training=True)
            '''传入图片[4, 38, 38, 3]'''
            # Add loss function.
            '''构建损失函数'''
            ssd_net.losses(logits, localisations,#logits:one-hot 向量一共有21维
                           b_gclasses, b_glocalisations, b_gscores,
                           match_threshold=FLAGS.match_threshold,
                           negative_ratio=FLAGS.negative_ratio,
                           alpha=FLAGS.loss_alpha,#拉格朗日乘子是1
                           label_smoothing=FLAGS.label_smoothing)
            return end_points#'block11',end_points为最后一层

        # Gather initial summaries.
        summaries = set(tf.get_collection(tf.GraphKeys.SUMMARIES))

        # =================================================================== #
        # Add summaries from first clone.
        # =================================================================== #
        clones = model_deploy.create_clones(deploy_config, clone_fn, [batch_queue])
        first_clone_scope = deploy_config.clone_scope(0)
        # Gather update_ops from the first clone. These contain, for example,
        # the updates for the batch_norm variables created by network_fn.
        update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS, first_clone_scope)

        # Add summaries for end_points.
        end_points = clones[0].outputs
        for end_point in end_points:
            x = end_points[end_point]
            summaries.add(tf.summary.histogram('activations/' + end_point, x))
            summaries.add(tf.summary.scalar('sparsity/' + end_point,
                                            tf.nn.zero_fraction(x)))
        # Add summaries for losses and extra losses.
        for loss in tf.get_collection(tf.GraphKeys.LOSSES, first_clone_scope):
            summaries.add(tf.summary.scalar(loss.op.name, loss))
        for loss in tf.get_collection('EXTRA_LOSSES', first_clone_scope):
            summaries.add(tf.summary.scalar(loss.op.name, loss))

        # Add summaries for variables.
        for variable in slim.get_model_variables():
            summaries.add(tf.summary.histogram(variable.op.name, variable))

        # =================================================================== #
        # Configure the moving averages.
        # =================================================================== #
        if FLAGS.moving_average_decay:
            moving_average_variables = slim.get_model_variables()
            variable_averages = tf.train.ExponentialMovingAverage(
                FLAGS.moving_average_decay, global_step)
        else:
            moving_average_variables, variable_averages = None, None

        # =================================================================== #
        # Configure the optimization procedure.
        # =================================================================== #
        with tf.device(deploy_config.optimizer_device()):
            learning_rate = tf_utils.configure_learning_rate(FLAGS,
                                                             dataset.num_samples,
                                                             global_step)
            optimizer = tf_utils.configure_optimizer(FLAGS, learning_rate)
            summaries.add(tf.summary.scalar('learning_rate', learning_rate))

        if FLAGS.moving_average_decay:
            # Update ops executed locally by trainer.
            update_ops.append(variable_averages.apply(moving_average_variables))

        # Variables to train.
        variables_to_train = tf_utils.get_variables_to_train(FLAGS)

        # and returns a train_tensor and summary_op
        total_loss, clones_gradients = model_deploy.optimize_clones(
            clones,
            optimizer,
            var_list=variables_to_train)
        # Add total_loss to summary.
        summaries.add(tf.summary.scalar('total_loss', total_loss))

        # Create gradient updates.
        grad_updates = optimizer.apply_gradients(clones_gradients,
                                                 global_step=global_step)
        update_ops.append(grad_updates)
        update_op = tf.group(*update_ops)
        train_tensor = control_flow_ops.with_dependencies([update_op], total_loss,
                                                          name='train_op')

        # Add the summaries from the first clone. These contain the summaries
        summaries |= set(tf.get_collection(tf.GraphKeys.SUMMARIES,
                                           first_clone_scope))
        # Merge all summaries together.
        summary_op = tf.summary.merge(list(summaries), name='summary_op')

        # =================================================================== #
        # Kicks off the training.
        # =================================================================== #
        gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=FLAGS.gpu_memory_fraction)
        config = tf.ConfigProto(log_device_placement=False,
                                gpu_options=gpu_options)
        saver = tf.train.Saver(max_to_keep=5,
                               keep_checkpoint_every_n_hours=1.0,
                               write_version=2,
                               pad_step_number=False)
        slim.learning.train(
            train_tensor,
            logdir=FLAGS.train_dir,
            master='',
            is_chief=True,
            init_fn=tf_utils.get_init_fn(FLAGS),
            summary_op=summary_op,
            number_of_steps=FLAGS.max_number_of_steps,
            log_every_n_steps=FLAGS.log_every_n_steps,
            save_summaries_secs=FLAGS.save_summaries_secs,
            saver=saver,
            save_interval_secs=FLAGS.save_interval_secs,
            session_config=config,
            sync_optimizer=None)


if __name__ == '__main__':
    tf.app.run()

  

5.1.2 构建网络
def ssd_net(inputs,
            num_classes=SSDNet.default_params.num_classes,
            feat_layers=SSDNet.default_params.feat_layers,
            anchor_sizes=SSDNet.default_params.anchor_sizes,
            anchor_ratios=SSDNet.default_params.anchor_ratios,
            normalizations=SSDNet.default_params.normalizations,
            is_training=True,
            dropout_keep_prob=0.5,
            prediction_fn=slim.softmax,
            reuse=None,
            scope='ssd_512_vgg'):
    """SSD net definition.
    """
    # End_points collect relevant activations for external use.
    end_points = {}
    with tf.variable_scope(scope, 'ssd_512_vgg', [inputs], reuse=reuse):
        # Original VGG-16 blocks.
        net = slim.repeat(inputs, 2, slim.conv2d, 64, [3, 3], scope='conv1')
        end_points['block1'] = net
        net = slim.max_pool2d(net, [2, 2], scope='pool1')
        # Block 2.
        net = slim.repeat(net, 2, slim.conv2d, 128, [3, 3], scope='conv2')
        end_points['block2'] = net
        net = slim.max_pool2d(net, [2, 2], scope='pool2')
        # Block 3.
        net = slim.repeat(net, 3, slim.conv2d, 256, [3, 3], scope='conv3')
        end_points['block3'] = net
        net = slim.max_pool2d(net, [2, 2], scope='pool3')
        # Block 4.
        net = slim.repeat(net, 3, slim.conv2d, 512, [3, 3], scope='conv4')
        end_points['block4'] = net
        net = slim.max_pool2d(net, [2, 2], scope='pool4')
        # Block 5.
        net = slim.repeat(net, 3, slim.conv2d, 512, [3, 3], scope='conv5')
        end_points['block5'] = net
        net = slim.max_pool2d(net, [3, 3], 1, scope='pool5')

        # Additional SSD blocks.
        # Block 6: let's dilate the hell out of it!
        net = slim.conv2d(net, 1024, [3, 3], rate=6, scope='conv6')
        end_points['block6'] = net
        # Block 7: 1x1 conv. Because the fuck.
        net = slim.conv2d(net, 1024, [1, 1], scope='conv7')
        end_points['block7'] = net

        # Block 8/9/10/11: 1x1 and 3x3 convolutions stride 2 (except lasts).
        end_point = 'block8'
        with tf.variable_scope(end_point):
            net = slim.conv2d(net, 256, [1, 1], scope='conv1x1')
            net = custom_layers.pad2d(net, pad=(1, 1))
            net = slim.conv2d(net, 512, [3, 3], stride=2, scope='conv3x3', padding='VALID')
        end_points[end_point] = net
        end_point = 'block9'
        with tf.variable_scope(end_point):
            net = slim.conv2d(net, 128, [1, 1], scope='conv1x1')
            net = custom_layers.pad2d(net, pad=(1, 1))
            net = slim.conv2d(net, 256, [3, 3], stride=2, scope='conv3x3', padding='VALID')
        end_points[end_point] = net
        end_point = 'block10'
        with tf.variable_scope(end_point):
            net = slim.conv2d(net, 128, [1, 1], scope='conv1x1')
            net = custom_layers.pad2d(net, pad=(1, 1))
            net = slim.conv2d(net, 256, [3, 3], stride=2, scope='conv3x3', padding='VALID')
        end_points[end_point] = net
        end_point = 'block11'
        with tf.variable_scope(end_point):
            net = slim.conv2d(net, 128, [1, 1], scope='conv1x1')
            net = custom_layers.pad2d(net, pad=(1, 1))
            net = slim.conv2d(net, 256, [3, 3], stride=2, scope='conv3x3', padding='VALID')
        end_points[end_point] = net
        end_point = 'block12'
        with tf.variable_scope(end_point):
            net = slim.conv2d(net, 128, [1, 1], scope='conv1x1')
            net = custom_layers.pad2d(net, pad=(1, 1))
            net = slim.conv2d(net, 256, [4, 4], scope='conv4x4', padding='VALID')
            # Fix padding to match Caffe version (pad=1).
            # pad_shape = [(i-j) for i, j in zip(layer_shape(net), [0, 1, 1, 0])]
            # net = tf.slice(net, [0, 0, 0, 0], pad_shape, name='caffe_pad')
        end_points[end_point] = net

        # Prediction and localisations layers.
        predictions = []
        logits = []
        localisations = []
        for i, layer in enumerate(feat_layers):
            with tf.variable_scope(layer + '_box'):
                p, l = ssd_vgg_300.ssd_multibox_layer(end_points[layer],
                                                      num_classes,
                                                      anchor_sizes[i],
                                                      anchor_ratios[i],
                                                      normalizations[i])
            predictions.append(prediction_fn(p))
            logits.append(p)
            localisations.append(l)

        return predictions, localisations, logits, end_points
def ssd_multibox_layer(inputs,#字典形式储存的特征图名称及特征图的值
                       num_classes,#21
                       sizes,#6个
                       ratios=[1],#每个有2~4个
                       normalization=-1,#归一化
                       bn_normalization=False):
    """Construct a multibox layer, return a class and localization predictions.
    """
    net = inputs # 输入第4层最后一个feature map
    if normalization > 0:
        net = custom_layers.l2_normalization(net, scaling=True)
    # Number of anchors.
    num_anchors = len(sizes) + len(ratios) #  2 + 2  (0.21 45) [2, 0.5] 第一个featruemap上有4个anchor框

    # Location.
    num_loc_pred = num_anchors * 4 # 每个anchor对应4个位置预测值
    loc_pred = slim.conv2d(net, num_loc_pred, [3, 3], activation_fn=None,
                           scope='conv_loc')
    loc_pred = custom_layers.channel_to_last(loc_pred) # 改变通道数所在的维度,保证通道数在最后一维
    loc_pred = tf.reshape(loc_pred,
                          tensor_shape(loc_pred, 4)[:-1]+[num_anchors, 4]) # 把最后一维拆分成2个维度,(anchor数,位置信息),位置信息固定为4
    # Class prediction.
    num_cls_pred = num_anchors * num_classes # 每个anchor对应有21个类别信息
    cls_pred = slim.conv2d(net, num_cls_pred, [3, 3], activation_fn=None,
                           scope='conv_cls')
    cls_pred = custom_layers.channel_to_last(cls_pred)
    cls_pred = tf.reshape(cls_pred,
                          tensor_shape(cls_pred, 4)[:-1]+[num_anchors, num_classes]) # 把最后一维拆分成2个维度,(anchor数,类别数)
    return cls_pred, loc_pred

  

5.1.3 Anchor box计算

主要过程:

  • 计算中心点坐标
  • 根据宽高比和尺度计算宽高
def ssd_anchor_one_layer(img_shape,
                         feat_shape,
                         sizes,
                         ratios,
                         step,
                         offset=0.5,
                         dtype=np.float32):
    """Computer SSD default anchor boxes for one feature layer.

    Determine the relative position grid of the centers, and the relative
    width and height.

    Arguments:
      feat_shape: Feature shape, used for computing relative position grids;
      size: Absolute reference sizes;
      ratios: Ratios to use on these features;
      img_shape: Image shape, used for computing height, width relatively to the
        former;
      offset: Grid offset.

    Return:
      y, x, h, w: Relative x and y grids, and height and width.
      返回每个cell的中心点的相对位置,以及相对于原图的宽高比例
    """
    # Compute the position grid: simple way.
    # y, x = np.mgrid[0:feat_shape[0], 0:feat_shape[1]]
    # y = (y.astype(dtype) + offset) / feat_shape[0]
    # x = (x.astype(dtype) + offset) / feat_shape[1]
    # Weird SSD-Caffe computation using steps values...
    y, x = np.mgrid[0:feat_shape[0], 0:feat_shape[1]]
    y = (y.astype(dtype) + offset) * step / img_shape[0]#计算每个cell的中心点
    x = (x.astype(dtype) + offset) * step / img_shape[1]#计算每个cell的中心点

    # Expand dims to support easy broadcasting.
    y = np.expand_dims(y, axis=-1)
    x = np.expand_dims(x, axis=-1)

    # Compute relative height and width.
    # Tries to follow the original implementation of SSD for the order.
    num_anchors = len(sizes) + len(ratios)
    h = np.zeros((num_anchors, ), dtype=dtype)
    w = np.zeros((num_anchors, ), dtype=dtype)
    # Add first anchor boxes with ratio=1.
    h[0] = sizes[0] / img_shape[0]
    w[0] = sizes[0] / img_shape[1]#所占原图的比例
    di = 1
    if len(sizes) > 1: # 对于ratio=1的,增一个框,w = sk',h = sk'
        h[1] = math.sqrt(sizes[0] * sizes[1]) / img_shape[0]#sk’的计算,是个比率(特殊框的计算)
        w[1] = math.sqrt(sizes[0] * sizes[1]) / img_shape[1]#sk’的计算,是个比率(特殊框的计算)
        di += 1
    for i, r in enumerate(ratios):
        h[i+di] = sizes[0] / img_shape[0] / math.sqrt(r)#高比例
        w[i+di] = sizes[0] / img_shape[1] * math.sqrt(r)#宽比率
    return y, x, h, w

  

5.1.4 编码处理

将得到的anchor box位置信息和数据集中的ground truth类别标签和位置信息进行处理

def tf_ssd_bboxes_encode_layer(labels,
                               bboxes,
                               anchors_layer,
                               num_classes,
                               no_annotation_label,
                               ignore_threshold=0.5,
                               prior_scaling=[0.1, 0.1, 0.2, 0.2],
                               dtype=tf.float32):
    """Encode groundtruth labels and bounding boxes using SSD anchors from
    one layer.

    Arguments:
      labels: 1D Tensor(int64) containing groundtruth labels;
      bboxes: Nx4 Tensor(float) with bboxes relative coordinates;
      anchors_layer: Numpy array with layer anchors;
      matching_threshold: Threshold for positive match with groundtruth bboxes;
      prior_scaling: Scaling of encoded coordinates.

    Return:
      (target_labels, target_localizations, target_scores): Target Tensors.
    """
    # Anchors coordinates and volume.
    yref, xref, href, wref = anchors_layer# 获得中心点及高、宽
    #href
    #array([ 0.07      ,  0.10246951,  0.04949747,  0.09899495], dtype=float32)
    #wref
    #array([ 0.07      ,  0.10246951,  0.09899495,  0.04949747], dtype=float32)    
    ymin = yref - href / 2.#计算左上角x坐标值
    xmin = xref - wref / 2.#计算左上角y坐标值
    ymax = yref + href / 2.#计算右下角x坐标值
    xmax = xref + wref / 2.#计算右下角y坐标值
    vol_anchors = (xmax - xmin) * (ymax - ymin)#计算anchor框面积 (38, 38, 4)
    '''矩阵运算'''

    # Initialize tensors...
    shape = (yref.shape[0], yref.shape[1], href.size)#(38, 38, 4)
    feat_labels = tf.zeros(shape, dtype=tf.int64)
    feat_scores = tf.zeros(shape, dtype=dtype)

    feat_ymin = tf.zeros(shape, dtype=dtype)
    feat_xmin = tf.zeros(shape, dtype=dtype)
    feat_ymax = tf.ones(shape, dtype=dtype)
    feat_xmax = tf.ones(shape, dtype=dtype)
    '''左上角和右下角坐标:每个框左上、右下、labels、scores'''

    def jaccard_with_anchors(bbox):#groundtruth
        """Compute jaccard score between a box and the anchors.
           IOU计算
        """
        int_ymin = tf.maximum(ymin, bbox[0])
        int_xmin = tf.maximum(xmin, bbox[1])
        int_ymax = tf.minimum(ymax, bbox[2])
        int_xmax = tf.minimum(xmax, bbox[3])
        h = tf.maximum(int_ymax - int_ymin, 0.)
        w = tf.maximum(int_xmax - int_xmin, 0.)#交集的长宽
        # Volumes.
        inter_vol = h * w#交集面积
        union_vol = vol_anchors - inter_vol \
            + (bbox[2] - bbox[0]) * (bbox[3] - bbox[1])#anchor面积-交集+bbox面积,这里的bbox为GT
        jaccard = tf.div(inter_vol, union_vol)#IOU值
        return jaccard

    def intersection_with_anchors(bbox):
        """Compute intersection between score a box and the anchors.
           仅仅交集计算
        """
        int_ymin = tf.maximum(ymin, bbox[0])
        int_xmin = tf.maximum(xmin, bbox[1])
        int_ymax = tf.minimum(ymax, bbox[2])
        int_xmax = tf.minimum(xmax, bbox[3])
        h = tf.maximum(int_ymax - int_ymin, 0.)
        w = tf.maximum(int_xmax - int_xmin, 0.)
        inter_vol = h * w
        scores = tf.div(inter_vol, vol_anchors) # 交集的面积比上先验框的面积,将其表示为置信度
        return scores

    def condition(i, feat_labels, feat_scores,
                  feat_ymin, feat_xmin, feat_ymax, feat_xmax):
        """Condition: check label index.
           逐元素比较大小,遍历labels,因为i在body返回的时候加1了
        """
        r = tf.less(i, tf.shape(labels))
        return r[0]

    def body(i, feat_labels, feat_scores,
             feat_ymin, feat_xmin, feat_ymax, feat_xmax):
        """Body: update feature labels, scores and bboxes. 循环体
        Follow the original SSD paper for that purpose:
          - assign values when jaccard > 0.5;
          - only update if beat the score of other bboxes.
        """
        # Jaccard score.
        label = labels[i]  # 第i个ground truth对应的label
        bbox = bboxes[i]   # 第i个ground truth
        # 计算该feature map上所有的框与第一个ground truth的IOU
        jaccard = jaccard_with_anchors(bbox)
        # Mask: check threshold + scores + no annotations + num_classes.
        # 当IOU大于feat_scores时,对应的mask置为1,做筛选
        mask = tf.greater(jaccard, feat_scores)
        # mask = tf.logical_and(mask, tf.greater(jaccard, matching_threshold))
        mask = tf.logical_and(mask, feat_scores > -0.5)
        mask = tf.logical_and(mask, label < num_classes)    #label满足<21
        imask = tf.cast(mask, tf.int64)                     #将mask转换数据类型int型
        fmask = tf.cast(mask, dtype)                        #将mask转换数据类型float型
        # Update values using mask.
        feat_labels = imask * label + (1 - imask) * feat_labels #当mask=1,则feat_labels=1;否则为0,即背景
        feat_scores = tf.where(mask, jaccard, feat_scores)      #tf.where表示如果mask为真则jaccard,否则为feat_scores
        # 选择与GT bbox IOU最大的框作为GT bbox,然后循环
        feat_ymin = fmask * bbox[0] + (1 - fmask) * feat_ymin
        feat_xmin = fmask * bbox[1] + (1 - fmask) * feat_xmin
        feat_ymax = fmask * bbox[2] + (1 - fmask) * feat_ymax
        feat_xmax = fmask * bbox[3] + (1 - fmask) * feat_xmax

        # Check no annotation label: ignore these anchors...
        # interscts = intersection_with_anchors(bbox)
        # mask = tf.logical_and(interscts > ignore_threshold,
        #                       label == no_annotation_label)
        # # Replace scores by -1.
        # feat_scores = tf.where(mask, -tf.cast(mask, dtype), feat_scores)

        return [i+1, feat_labels, feat_scores,
                feat_ymin, feat_xmin, feat_ymax, feat_xmax]

    # Main loop definition. 主循环
    i = 0
    [i, feat_labels, feat_scores,
     feat_ymin, feat_xmin,
     feat_ymax, feat_xmax] = tf.while_loop(condition, body,
                                           [i, feat_labels, feat_scores,
                                            feat_ymin, feat_xmin,
                                            feat_ymax, feat_xmax])

    # Transform to center / size.
    feat_cy = (feat_ymax + feat_ymin) / 2.
    feat_cx = (feat_xmax + feat_xmin) / 2.
    feat_h = feat_ymax - feat_ymin
    feat_w = feat_xmax - feat_xmin
    # Encode features.
    feat_cy = (feat_cy - yref) / href / prior_scaling[0]
    feat_cx = (feat_cx - xref) / wref / prior_scaling[1]
    feat_h = tf.log(feat_h / href) / prior_scaling[2]
    feat_w = tf.log(feat_w / wref) / prior_scaling[3]
    # Use SSD ordering: x / y / w / h instead of ours.
    feat_localizations = tf.stack([feat_cx, feat_cy, feat_w, feat_h], axis=-1)
    return feat_labels, feat_localizations, feat_scores

边框的编码(encode)过程:
由先验框anchor box和真实框ground truth的位置信息,
即, P = ( P x , P y , P w , P h ) P=(P_x, P_y,P_w,P_h) P=(Px,Py,Pw,Ph) G = ( G x , G y , G w , G h ) G=(G_x,G_y,G_w,G_h) G=(Gx,Gy,Gw,Gh)
得到变换量 t ∗ i t_*^i ti { t x = ( G x − P x ) / P w t y = ( G y − P y ) / P h t w = l o g ( G w / P w ) t h = l o g ( G h / P h ) \begin{cases} t_x = (G_x-P_x)/P_w \\t_y=(G_y-P_y)/P_h\\t_w=log(G_w/P_w) \\t_h=log(G_h/P_h)\end{cases} tx=(GxPx)/Pwty=(GyPy)/Phtw=log(Gw/Pw)th=log(Gh/Ph)

  

5.1.5 构建损失函数
def ssd_losses(logits, localisations,
               gclasses, glocalisations, gscores,
               match_threshold=0.5,                          # IOU的权值
               negative_ratio=3.,                            # 正负样本比例
               alpha=1.,
               label_smoothing=0.,
               scope=None):
    """Loss functions for training the SSD 300 VGG network.

    This function defines the different loss components of the SSD, and
    adds them to the TF loss collection.

    Arguments:
      logits: (list of) predictions logits Tensors;                # 网络预测的logits,也就是没有进过softmax的置信度
      localisations: (list of) localisations Tensors;              # 网络预测的位置变换量
      gclasses: (list of) groundtruth labels Tensors;
      glocalisations: (list of) groundtruth localisations Tensors;
      gscores: (list of) groundtruth score Tensors;
    """
    with tf.name_scope(scope, 'ssd_losses'):
        l_cross_pos = []   # 存储正样本的损失函数值
        l_cross_neg = []   # 存储负样本的损失函数值
        l_loc = []         # 存储位置偏移量的损失函数值
        for i in range(len(logits)):
            dtype = logits[i].dtype
            with tf.name_scope('block_%i' % i):
                # Determine weights Tensor.
                pmask = gscores[i] > match_threshold     # 将iou大于0.5的记为正样本
                fpmask = tf.cast(pmask, dtype)           # 类型转换
                n_positives = tf.reduce_sum(fpmask)      # 记录正样本个数

                # Select some random negative entries.
                # n_entries = np.prod(gclasses[i].get_shape().as_list())
                # r_positive = n_positives / n_entries
                # r_negative = negative_ratio * n_positives / (n_entries - n_positives)

                # Negative mask.
                no_classes = tf.cast(pmask, tf.int32)         # 将正例的类型转换成整数
                predictions = slim.softmax(logits[i])         # 记录每个类别预测的概率值
                nmask = tf.logical_and(tf.logical_not(pmask), # 将不是正例并且交并比大于-0.5的记为负样本
                                       gscores[i] > -0.5)
                fnmask = tf.cast(nmask, dtype)                # 类型转换
                nvalues = tf.where(nmask,                     # 将负样本的值取出来
                                   predictions[:, :, :, :, 0],# predictions[:, :, :, :, 0]是预测第一类,也就是背景类
                                   1. - fnmask)               # 在nmask中是负样本,就取预测值,否则取1-fnmask(此时为0)=1
                nvalues_flat = tf.reshape(nvalues, [-1])      # 对nvalues进行拉平操作

                # Number of negative entries to select.     筛选负样本
                n_neg = tf.cast(negative_ratio * n_positives, tf.int32) # 正负样本比例乘以正样本个数,得到负样本个数
                n_neg = tf.maximum(n_neg, tf.size(nvalues_flat) // 8)   # 将正样本3倍与nvalues实际样本1/8倍比较选最大的
                n_neg = tf.maximum(n_neg, tf.shape(nvalues)[0] * 4)
                max_neg_entries = 1 + tf.cast(tf.reduce_sum(fnmask), tf.int32)
                n_neg = tf.minimum(n_neg, max_neg_entries)

                val, idxes = tf.nn.top_k(-nvalues_flat, k=n_neg)  # 选取前k=neg个负样本,表示选择的交并比最小的k个
                minval = val[-1]                                  # 所选择的负样本中交并比最大的那个
                # Final negative mask.
                nmask = tf.logical_and(nmask, -nvalues > minval)  # 最终的负样本
                fnmask = tf.cast(nmask, dtype)

                # Add cross-entropy loss.
                with tf.name_scope('cross_entropy_pos'):     # 正样本的交叉熵损失函数
                    loss = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits[i],
                                                                          labels=gclasses[i])
                    loss = tf.losses.compute_weighted_loss(loss, fpmask) # 其实就是相当于loss×fpmask,过滤负样本
                    l_cross_pos.append(loss)                             # 因为负样本的label就是0,其他的是1,相乘以保留正样本

                with tf.name_scope('cross_entropy_neg'):     # 负样本的交叉熵损失函数
                    loss = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits[i],
                                                                          labels=no_classes)
                    loss = tf.losses.compute_weighted_loss(loss, fnmask) # 与上面同理
                    l_cross_neg.append(loss)

                # Add localization loss: smooth L1, L2, ...
                with tf.name_scope('localization'):
                    # Weights Tensor: positive mask + random negative.
                    weights = tf.expand_dims(alpha * fpmask, axis=-1)
                    loss = custom_layers.abs_smooth(localisations[i] - glocalisations[i]) # 位置变换量的smoothL1损失函数
                    loss = tf.losses.compute_weighted_loss(loss, weights)
                    l_loc.append(loss)

        # Additional total losses...
        with tf.name_scope('total'):
            total_cross_pos = tf.add_n(l_cross_pos, 'cross_entropy_pos')
            total_cross_neg = tf.add_n(l_cross_neg, 'cross_entropy_neg')
            total_cross = tf.add(total_cross_pos, total_cross_neg, 'cross_entropy') # 正负样本总体的损失和
            total_loc = tf.add_n(l_loc, 'localization')  # 正样本总体的位置回归损失和

            # Add to EXTRA LOSSES TF.collection
            tf.add_to_collection('EXTRA_LOSSES', total_cross_pos)
            tf.add_to_collection('EXTRA_LOSSES', total_cross_neg)
            tf.add_to_collection('EXTRA_LOSSES', total_cross)
            tf.add_to_collection('EXTRA_LOSSES', total_loc)

5.2 测试阶段

主要过程:

  • 构建网络——生成类别和位置变换量的预测值(logits(未经过softmax)、localisations)
  • Anchor box计算——根据每层feature map大小和和先验框尺度及宽高比计算
  • 编码处理——将得到的anchor box位置信息和数据集中的ground truth类别标签和位置信息进行处理
  • 构建损失函数,计算损失函数值
  • 解码处理——将网络预测结果和anchor计算的结果进行解码操作,得到预测的anchor box的位置
  • 筛选处理——对的得到预测类别和位置的anchor box进行进一步的筛选
    • 按照select_threshold筛选
    • 筛选前k个置信度最大的
    • 使用非极大抑制再进行筛选

前四步和训练阶段一致,下面省略对应的代码

5.2.1 解码处理

边框的解码(decode)过程:
由变化量的预测值 d ∗ ( P ) d_*(P) d(P)和先验框anchor box的位置信息,
即, P = ( P x , P y , P w , P h ) P=(P_x, P_y,P_w,P_h) P=(Px,Py,Pw,Ph) d ∗ ( P ) d_*(P) d(P): { d x ( P ) = ( G ^ x − P x ) / P w d y ( P ) = ( G ^ y − P y ) / P h d w ( P ) = l o g ( G ^ w / P w ) d h ( P ) = l o g ( G ^ h / P h ) \begin{cases} d_x(P) = (\hat G_x-P_x)/P_w \\d_y(P)=(\hat G_y-P_y)/P_h\\d_w(P)=log(\hat G_w/P_w) \\d_h(P)=log(\hat G_h/P_h)\end{cases} dx(P)=(G^xPx)/Pwdy(P)=(G^yPy)/Phdw(P)=log(G^w/Pw)dh(P)=log(G^h/Ph)

得到预测框的位置信息: { G ^ x = P w d x ( P ) + P x G ^ y = P h d y ( P ) + P y G ^ w = P w e x p ( d w ( P ) ) G ^ h = P h e x p ( d h ( P ) ) \begin{cases}\hat G_x=P_wd_x(P)+P_x\\ \hat G_y = P_hd_y(P)+P_y\\ \hat G_w = P_wexp(d_w(P)) \\ \hat G_h = P_hexp(d_h(P))\end{cases} G^x=Pwdx(P)+PxG^y=Phdy(P)+PyG^w=Pwexp(dw(P))G^h=Phexp(dh(P))

def tf_ssd_bboxes_decode_layer(feat_localizations,
                               anchors_layer,
                               prior_scaling=[0.1, 0.1, 0.2, 0.2]):
    """Compute the relative bounding boxes from the layer features and
    reference anchor bounding boxes.

    Arguments:
      feat_localizations: Tensor containing localization features.
      anchors: List of numpy array containing anchor boxes.

    Return:
      Tensor Nx4: ymin, xmin, ymax, xmax
    """
    yref, xref, href, wref = anchors_layer

    # Compute center, height and width
    cx = feat_localizations[:, :, :, :, 0] * wref * prior_scaling[0] + xref
    cy = feat_localizations[:, :, :, :, 1] * href * prior_scaling[1] + yref
    w = wref * tf.exp(feat_localizations[:, :, :, :, 2] * prior_scaling[2])
    h = href * tf.exp(feat_localizations[:, :, :, :, 3] * prior_scaling[3])
    # Boxes coordinates.
    ymin = cy - h / 2.
    xmin = cx - w / 2.
    ymax = cy + h / 2.
    xmax = cx + w / 2.
    bboxes = tf.stack([ymin, xmin, ymax, xmax], axis=-1)
    return bboxes

然而,在SSD的Caffe源码实现中还有trick,那就是设置variance超参数来调整检测值,通过bool参数variance_encoded_in_target来控制两种模式,当其为True时,表示variance被包含在预测值中,就是上面那种情况。但是如果是False(大部分采用这种方式,训练更容易?),就需要手动设置超参数variance,用来对 d ∗ ( P ) d_*(P) d(P)的4个值进行放缩,此时边界框需要这样解码:
预测框的位置信息: { G ^ x = P w d x ( P ) v a r i a n c e [ 0 ] + P x G ^ y = P h d y ( P ) v a r i a n c e [ 1 ] + P y G ^ w = P w e x p ( d w ( P ) v a r i a n c e [ 2 ] ) G ^ h = P h e x p ( d h ( P ) v a r i a n c e [ 3 ] ) \begin{cases}\hat G_x=P_wd_x(P)variance[0]+P_x\\ \hat G_y = P_hd_y(P)variance[1]+P_y\\ \hat G_w = P_wexp(d_w(P)variance[2]) \\ \hat G_h = P_hexp(d_h(P)variance[3])\end{cases} G^x=Pwdx(P)variance[0]+PxG^y=Phdy(P)variance[1]+PyG^w=Pwexp(dw(P)variance[2])G^h=Phexp(dh(P)variance[3])

5.2.2 筛选处理
  • 筛选处理——对的得到预测类别和位置的anchor box进行进一步的筛选
    • 按照select_threshold筛选
    • 筛选前k个置信度最大的
    • 使用非极大抑制再进行筛选
def detected_bboxes(self, predictions, localisations,
                    select_threshold=None, nms_threshold=0.5,
                    clipping_bbox=None, top_k=400, keep_top_k=200):
    """Get the detected bounding boxes from the SSD network output.
    """
    # Select top_k bboxes from predictions, and clip
    # 按照select_threshold筛选
    rscores, rbboxes = \
        ssd_common.tf_ssd_bboxes_select(predictions, localisations,
                                        select_threshold=select_threshold,
                                        num_classes=self.params.num_classes)
    # 筛选前k个置信度最大的
    rscores, rbboxes = \
        tfe.bboxes_sort(rscores, rbboxes, top_k=top_k)

    # 使用非极大抑制再进行筛选
    # Apply NMS algorithm.
    rscores, rbboxes = \
        tfe.bboxes_nms_batch(rscores, rbboxes,
                             nms_threshold=nms_threshold,
                             keep_top_k=keep_top_k)
    # if clipping_bbox is not None:
    #     rbboxes = tfe.bboxes_clip(clipping_bbox, rbboxes)
    return rscores, rbboxes
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值