SSD: Single Shot MultiBox Detector
一、SSD主要思想
SSD是Single Shot MultiBox Detector的缩写,Single shot表明了SSD属于one-stage系列,而MultiBox则表明了SSD采用了多尺度框的方法。
SSD中,原图经过不同的卷积层,得到不同大小的feature map,靠前的卷积层输出的feature map比较大,越靠后feature map越小,而靠前的feature map上的先验框的尺度较小,靠后的feature map上的先验框的尺度较大,也就是说,从前往后feature map依次减小而先验框依次增大,这样大feature map上的小框就能用来检测小物体,小feature map上的大框就能检测大物体
另外,SSD使用了卷积直接进行检测,所以上面得到的不同大小的feature map可以直接通过卷积进行检测得到分类置信度和位置信息。
二、模型结构
SSD模型结构主要分为三个部分:
- 基础特征提取网络:VGG-16(只保留前5个卷积块)
- SSD层:SSD Layers
- 预测层:Classification and localization layer
三、模型特点
SSD和YOLO都是属于one-stage算法,也就是说SSD像YOLO一样,能够通过模型一步到位,直接同时得到分类和定位信息,但是相比起YOLO算法,SSD有一下三个特点:
- 使用多尺度的feature map进行检测
- 使用卷积直接进行检测
- 使用不同尺度和宽高比的先验框
3.1 使用多尺度的feature map进行检测
之前的算法,比如YOLO,它经过卷积层,最终得到的是一个单一尺度的feature map,然后用这个单一尺度的feature map进行检测,而SSD算法则是通过卷积层生成不同尺度的feature map,然后将其中部分不同尺度的feature map都用来进行检测。
因为浅层的feature map的感受野较小,而深层的feature map的感受野较大,所以浅层的较大的feature map可用来检测小物体,而深层的较小的feature map可用来检测大物体。
3.2 使用卷积直接进行检测
YOLO是在全连接之后进行检测,而SSD则是将得到的不同尺度的feature map直接通过卷积进行检测。
使用卷积直接进行检测,就是根据预测的目标,来设置小卷积核(文中和代码中设置为3x3)的个数。因为卷积核的个数决定了卷积层输出的feature map的通道数,那么不同的通道数个数,就表示不同的意义。
比如,假设输入预测层(Classification and Localization layer)的feature map的size为m*n*p,卷积核为3*3*p,预测类别是3类,feature map上每个像素点都是3个先验框,
对于分类预测:
设置卷积核的个数为3*3,预测层输出的feature map维度为:[m, n, 3*3],这就表示对于feature map上的m*n个像素点,每个点对应一个9维的向量,每3个分量表示一个框的预测信息。
对于位置预测:
设置卷积核的个数为3*4,输出的feature map的维度即为[m, n, 3*4],表示对于feature map上的m*n个像素点,每个点对应一个12维的向量,每4个分量表示一个框的四个位置信息(offsets)。
3.3 使用不同尺度和宽高比的先验框
在SSD中,不同的feature map上设置的先验框的数目不同,尺度(也就是框的大小)和比也不同。
3.3.1 先验框的尺度(大小):
s
k
=
s
m
i
n
+
s
m
a
x
−
s
m
i
n
m
−
1
(
k
−
1
)
,
k
∈
[
1
,
m
]
s_k=s_{min}+\frac{s_{max}-s_{min}}{m-1}(k-1),\quad k\in[1, m]
sk=smin+m−1smax−smin(k−1),k∈[1,m]
其中,
- m:表示feature map个数减一(因为第一个feature map的尺度单独设置),
- s k s_k sk: 表示先验框大小相对于图片的比例
- s m i n s_{min} smin:比例的最小值,论文中设置0.2
- s m a x s_{max} smax:比例的最大值,论文中设置为0.9
对于第一个feature map,其先验框的尺度比例一般设置为
s
m
i
n
/
2
=
0.1
s_{min}/2=0.1
smin/2=0.1,尺度则为300*0.1=30
对于后5个feature map,其先验框的的尺度计算如下,先将尺度比例扩大100倍,即
s
m
i
n
=
20
,
s
m
a
x
=
90
s_{min}=20,s_{max}=90
smin=20,smax=90,然后再按照上述公式计算,结果除100作为最终的尺度比例:
k
=
1
,
s
1
=
[
20
+
⌊
90
−
20
5
−
1
⌋
∗
(
1
−
1
)
]
/
100
=
[
20
+
0
]
/
100
=
0.2
k=1,\quad s_1=[20+\lfloor \frac{90-20}{5-1}\rfloor*(1-1)]/100=[20+0]/100=0.2
k=1,s1=[20+⌊5−190−20⌋∗(1−1)]/100=[20+0]/100=0.2
k
=
2
,
s
2
=
[
20
+
⌊
90
−
20
5
−
1
⌋
∗
(
2
−
1
)
]
/
100
=
[
20
+
17
]
/
100
=
0.37
k=2,\quad s_2=[20+\lfloor \frac{90-20}{5-1}\rfloor*(2-1)]/100=[20+17]/100=0.37
k=2,s2=[20+⌊5−190−20⌋∗(2−1)]/100=[20+17]/100=0.37
k
=
3
,
s
3
=
[
20
+
⌊
90
−
20
5
−
1
⌋
∗
(
3
−
1
)
]
/
100
=
[
20
+
34
]
/
100
=
0.54
k=3,\quad s_3=[20+\lfloor \frac{90-20}{5-1}\rfloor*(3-1)]/100=[20+34]/100=0.54
k=3,s3=[20+⌊5−190−20⌋∗(3−1)]/100=[20+34]/100=0.54
k
=
4
,
s
4
=
[
20
+
⌊
90
−
20
5
−
1
⌋
∗
(
2
−
1
)
]
/
100
=
[
20
+
17
]
/
100
=
0.71
k=4,\quad s_4=[20+\lfloor \frac{90-20}{5-1}\rfloor*(2-1)]/100=[20+17]/100=0.71
k=4,s4=[20+⌊5−190−20⌋∗(2−1)]/100=[20+17]/100=0.71
k
=
5
,
s
5
=
[
20
+
⌊
90
−
20
5
−
1
⌋
∗
(
2
−
1
)
]
/
100
=
[
20
+
17
]
/
100
=
0.88
k=5,\quad s_5=[20+\lfloor \frac{90-20}{5-1}\rfloor*(2-1)]/100=[20+17]/100=0.88
k=5,s5=[20+⌊5−190−20⌋∗(2−1)]/100=[20+17]/100=0.88
然后将这5个尺度比例乘以原图尺寸,则到每个feature map上先验框的尺度,即60,111,162,213,264
再综合第一个feature map的尺度30,总共6个feature map上先验框的尺度分别为30,60,111,162,213,264
3.3.2 先验框的宽高:
根据下面的公式,通过
s
k
s_k
sk和
a
r
a_r
ar就能计算出每个feature map上具体的宽和高
w
k
a
=
s
k
a
r
,
h
k
a
=
s
k
a
r
w_k^a=s_k \sqrt {a_r}, \quad h_k^a=\frac{s_k}{\sqrt {a_r}}
wka=skar,hka=arsk
其中,
a
r
a_r
ar为先验框的宽高比,
a
r
∈
{
1
,
2
,
3
,
1
2
,
1
3
}
a_r\in\{1,2,3,\frac{1}{2},\frac{1}{3}\}
ar∈{1,2,3,21,31}
一个先验框尺度
s
k
s_k
sk对应5个宽高比
a
r
a_r
ar,而当
a
r
=
1
a_r=1
ar=1时,增加尺度
s
k
′
=
s
k
s
k
+
1
s'_k=\sqrt{s_ks_{k+1}}
sk′=sksk+1,即多增加一个框,对于最后一个特征图,需要参考一个虚拟的尺度
s
m
+
1
=
300
∗
105
/
100
=
315
s_{m+1}=300*105/100=315
sm+1=300∗105/100=315来计算
s
m
′
s'_{m}
sm′
也就是说,每个feature map上的各个像素点对应6个框,其中正方形框2个,矩形框4个
计算示例:
s
k
∈
{
30
,
60
,
111
,
162
,
213
,
264
}
,
a
r
∈
{
1
,
2
,
3
,
1
2
,
1
3
}
s_k\in\{30,60,111,162,213,264\},\quad a_r\in\{1,2,3,\frac{1}{2},\frac{1}{3}\}
sk∈{30,60,111,162,213,264},ar∈{1,2,3,21,31}
k
=
1
k=1
k=1,
s
1
=
30
s_1=30
s1=30:
a
1
=
1
a_1=1
a1=1,要计算两个正方形边长,第二个需要计算
s
1
′
=
s
1
s
2
=
30
∗
60
=
42
s'_1=\sqrt{s_1s_{2}}=\sqrt{30*60}=42
s1′=s1s2=30∗60=42
第一个正方形框边长为:
w
1
a
=
s
1
a
1
=
30
∗
1
=
30
,
h
1
a
=
s
1
a
r
=
30
1
=
30
w_1^a=s_1 \sqrt {a_1}=30*1=30, \quad h_1^a=\frac{s_1}{\sqrt {a_r}}=\frac{30}{1}=30
w1a=s1a1=30∗1=30,h1a=ars1=130=30
第二个正方形框边长为:
w
1
′
a
=
s
1
′
a
1
=
42
∗
1
=
42
,
h
1
′
a
=
s
1
′
a
1
=
42
1
=
42
w_1'^a=s'_1 \sqrt {a_1}=42*1=42, \quad h_1'^a=\frac{s_1'}{\sqrt {a_1}}=\frac{42}{1}=42
w1′a=s1′a1=42∗1=42,h1′a=a1s1′=142=42
a
2
=
2
a_2=2
a2=2,矩形框,宽 :
w
2
a
=
s
1
a
2
=
30
∗
2
=
30
2
w_2^a=s_1 \sqrt {a_2}=30*\sqrt{2}=30\sqrt{2}
w2a=s1a2=30∗2=302 高:
h
2
a
=
s
1
a
2
=
30
2
h_2^a=\frac{s_1}{\sqrt {a_2}}=\frac{30}{\sqrt{2}}
h2a=a2s1=230
a
3
=
3
a_3=3
a3=3,矩形框,宽 :
w
3
a
=
s
1
a
3
=
30
∗
3
=
30
3
w_3^a=s_1 \sqrt {a_3}=30*\sqrt{3}=30\sqrt{3}
w3a=s1a3=30∗3=303 高:
h
3
a
=
s
1
a
3
=
30
3
h_3^a=\frac{s_1}{\sqrt {a_3}}=\frac{30}{\sqrt{3}}
h3a=a3s1=330
a
4
=
1
2
a_4=\frac{1}{2}
a4=21,矩形框,宽 :
w
4
a
=
s
1
a
4
=
30
∗
1
2
=
30
2
w_4^a=s_1 \sqrt {a_4}=30*\sqrt{\frac{1}{2}}=\frac{30}{\sqrt{2}}
w4a=s1a4=30∗21=230 高:
h
4
a
=
s
1
a
4
=
30
2
h_4^a=\frac{s_1}{\sqrt {a_4}}=30\sqrt{2}
h4a=a4s1=302
a
5
=
1
3
a_5=\frac{1}{3}
a5=31,矩形框,宽 :
w
5
a
=
s
1
a
5
=
30
∗
1
3
=
30
3
w_5^a=s_1 \sqrt {a_5}=30*\sqrt{\frac{1}{3}}=\frac{30}{\sqrt{3}}
w5a=s1a5=30∗31=330 高:
h
5
a
=
s
1
a
5
=
30
3
h_5^a=\frac{s_1}{\sqrt {a_5}}=30\sqrt{3}
h5a=a5s1=303
剩下的5个feature map计算过程同理
正常计算过程如上,但是在论文和代码实际计算情况下,Conv4_3,Conv10_2和Conv11_2层,也就是第一个和最后两个feature map都是不使用 3 , 1 3 3,\frac{1}{3} 3,31这两个宽高比,也就是说这三层feature map上每个像素点都是对应4个不同大小的先验框
3.3.3 先验框的中心点:
根据上述过程可得到具体的先验框的大小,接下来要确定每个先验框的中心:
(
i
+
0.5
∣
f
k
∣
,
j
+
0.5
∣
f
k
∣
)
,
i
,
j
∈
[
0
,
∣
f
k
∣
)
(\frac{i+0.5}{|f_k|},\frac{j+0.5}{|f_k|}),\quad i,j\in[0, |f_k|)
(∣fk∣i+0.5,∣fk∣j+0.5),i,j∈[0,∣fk∣)
其中,
∣
f
k
∣
|f_k|
∣fk∣为第k个feature map的大小
四、损失函数
损失函数定义为置信度损失conf(confidence loss)和定位损失loc(localization loss)的加权和:
L
(
x
,
c
,
l
,
g
)
=
1
N
(
L
c
o
n
f
(
x
,
c
)
+
α
L
l
o
c
(
x
,
l
,
g
)
)
L(x,c,l,g)=\frac{1}{N}(L_{conf}(x,c)+\alpha L_{loc}(x,l,g))
L(x,c,l,g)=N1(Lconf(x,c)+αLloc(x,l,g))
其中:
- N N N:表示先验框的正样本数量
- c c c:为类别置信度预测值
- l l l:为预测框的位置变换量
- α \alpha α:为权重系数,通过交叉验证设置为1
- x i j p = 1 x_{ij}^p=1 xijp=1 表示第 i 个先验框与类别 p 的第 j 个ground truth相匹配,否则, x i j p = 0 x_{ij}^p=0 xijp=0。
置信度损失conf(confidence loss):
置信度损失函数为多分类的softmax损失函数:
L
c
o
n
f
(
x
,
c
)
=
−
∑
i
∈
P
o
s
N
x
i
j
p
l
o
g
(
c
^
i
p
)
−
∑
i
∈
N
e
g
N
l
o
g
(
c
^
i
0
)
w
h
e
r
e
c
^
i
p
=
e
x
p
(
c
i
p
)
∑
p
e
x
p
(
c
i
p
)
L_{conf}(x,c)=-\sum \limits^N_{i\in Pos}x_{ij}^plog(\hat c_i^p)-\sum \limits ^N_{i\in Neg}log(\hat c_i^0)\quad where \quad \hat c_i^p=\frac{exp(c_i^p)}{\sum_pexp(c_i^p)}
Lconf(x,c)=−i∈Pos∑Nxijplog(c^ip)−i∈Neg∑Nlog(c^i0)wherec^ip=∑pexp(cip)exp(cip)
其中:
- p表示的是第p个类别
- 0表示的是第0个类别,也就是背景类
定位损失loc(localization loss):
L
l
o
c
(
x
,
l
,
g
)
=
∑
i
∈
P
o
s
N
∑
m
∈
{
c
x
,
c
y
,
w
,
h
}
x
i
j
k
s
m
o
o
t
h
L
1
(
l
i
m
−
g
^
j
m
)
g
^
j
c
x
=
(
g
j
c
x
−
d
i
c
x
)
/
d
i
w
g
^
j
c
y
=
(
g
j
c
y
−
d
i
c
y
)
/
d
i
h
g
^
j
w
=
l
o
g
(
g
j
w
d
i
w
)
g
^
j
h
=
l
o
g
(
g
j
h
d
i
h
)
L_{loc}(x,l,g)=\sum\limits _{i\in Pos}^N \sum \limits _{m\in\{cx,cy,w,h\}} x_{ij}^ksmooth_{L1}(l_i^m-\hat g_j^m) \\ \hat g_j^{cx}=(g_j^{cx}-d_i^{cx})/d_i^w\quad \quad \hat g_j^{cy}=(g_j^{cy}-d_i^{cy})/d_i^h \\ \hat g^w_j=log(\frac{g_j^w}{d_i^w}) \quad \quad \hat g^h_j=log(\frac{g_j^h}{d_i^h})
Lloc(x,l,g)=i∈Pos∑Nm∈{cx,cy,w,h}∑xijksmoothL1(lim−g^jm)g^jcx=(gjcx−dicx)/diwg^jcy=(gjcy−dicy)/dihg^jw=log(diwgjw)g^jh=log(dihgjh)
其中:
- g g g:表示的是default box到ground truth的变换量
- l l l:表示的是defualt box到修正后的位置的变换量,也就是变换量的预测值
- s m o o t h L 1 ( x ) = { 0.5 x 2 , i f ∣ x ∣ < 1 ∣ x ∣ − 0.5 , o t h e r w i s e smooth_{L_1}(x)=\begin{cases}0.5x^2,\quad \quad \quad if|x|<1 \\|x|-0.5, \quad otherwise \end{cases} smoothL1(x)={0.5x2,if∣x∣<1∣x∣−0.5,otherwise
五、主要代码分析
5.1 训练阶段
5.1.1 主函数
四个主要过程:
- 构建网络——生成类别和位置变换量的预测值(logits(未经过softmax)、localisations)
- Anchor box计算——根据每层feature map大小和和先验框尺度及宽高比计算
- 编码处理——将得到的anchor box位置信息和数据集中的ground truth类别标签和位置信息进行处理
- 构建损失函数进行迭代训练——使用前两步的结果并筛选负样本,然后计算损失函数
#coding:utf-8
# Copyright 2016 Paul Balanca. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
"""Generic training script that trains a SSD model using a given dataset."""
import tensorflow as tf
from tensorflow.python.ops import control_flow_ops
from datasets import dataset_factory
from deployment import model_deploy# 模型部署
from nets import nets_factory
from preprocessing import preprocessing_factory
import tf_utils
slim = tf.contrib.slim
#DATA_FORMAT = 'NCHW'
DATA_FORMAT = 'NHWC'
# =========================================================================== #
# SSD Network flags.
# =========================================================================== #
tf.app.flags.DEFINE_float(
'loss_alpha', 1., 'Alpha parameter in the loss function.')
tf.app.flags.DEFINE_float(
'negative_ratio', 3., 'Negative ratio in the loss function.')#负样本率?
tf.app.flags.DEFINE_float(
'match_threshold', 0.5, 'Matching threshold in the loss function.')
# =========================================================================== #
# General Flags.
# =========================================================================== #
tf.app.flags.DEFINE_string(
'train_dir', 'logs',
'Directory where checkpoints and event logs are written to.')
tf.app.flags.DEFINE_integer('num_clones', 1,
'Number of model clones to deploy.')#clones1代表1个参数服务器和1个worker,这一个worker即为主worker
tf.app.flags.DEFINE_boolean('clone_on_cpu', True,#False
'Use CPUs to deploy clones.')#在GPU上部署worker
tf.app.flags.DEFINE_integer(
'num_readers', 2,
'The number of parallel readers that read data from the dataset.')#双线程读取数据
tf.app.flags.DEFINE_integer(
'num_preprocessing_threads', 2,
'The number of threads used to create the batches.')#双线程创建batches
tf.app.flags.DEFINE_integer(
'log_every_n_steps', 10,#每10个epoch打印一次日志
'The frequency with which logs are print.')
tf.app.flags.DEFINE_integer(
'save_summaries_secs', 60,#60秒保存一次摘要
'The frequency with which summaries are saved, in seconds.')
tf.app.flags.DEFINE_integer(
'save_interval_secs', 60,#60秒保存一次模型
'The frequency with which the model is saved, in seconds.')
tf.app.flags.DEFINE_float(
'gpu_memory_fraction', 0.8, 'GPU memory fraction to use.')#80%的GPU使用率
# =========================================================================== #
# Optimization Flags.
# =========================================================================== #
tf.app.flags.DEFINE_float(
'weight_decay', 0.00004, 'The weight decay on the model weights.')#L2正则lamba
tf.app.flags.DEFINE_string(
'optimizer', 'adam',
'The name of the optimizer, one of "adadelta", "adagrad", "adam",'
'"ftrl", "momentum", "sgd" or "rmsprop".')
tf.app.flags.DEFINE_float(
'adadelta_rho', 0.95,
'The decay rate for adadelta.')
tf.app.flags.DEFINE_float(
'adagrad_initial_accumulator_value', 0.1,
'Starting value for the AdaGrad accumulators.')
tf.app.flags.DEFINE_float(
'adam_beta1', 0.9,
'The exponential decay rate for the 1st moment estimates.')
tf.app.flags.DEFINE_float(
'adam_beta2', 0.999,
'The exponential decay rate for the 2nd moment estimates.')
tf.app.flags.DEFINE_float('opt_epsilon', 1.0, 'Epsilon term for the optimizer.')
tf.app.flags.DEFINE_float('ftrl_learning_rate_power', -0.5,
'The learning rate power.')
tf.app.flags.DEFINE_float(
'ftrl_initial_accumulator_value', 0.1,
'Starting value for the FTRL accumulators.')
tf.app.flags.DEFINE_float(
'ftrl_l1', 0.0, 'The FTRL l1 regularization strength.')
tf.app.flags.DEFINE_float(
'ftrl_l2', 0.0, 'The FTRL l2 regularization strength.')
tf.app.flags.DEFINE_float(
'momentum', 0.9,
'The momentum for the MomentumOptimizer and RMSPropOptimizer.')
tf.app.flags.DEFINE_float('rmsprop_momentum', 0.9, 'Momentum.')
tf.app.flags.DEFINE_float('rmsprop_decay', 0.9, 'Decay term for RMSProp.')
# =========================================================================== #
# Learning Rate Flags.
# =========================================================================== #
tf.app.flags.DEFINE_string(
'learning_rate_decay_type',
'exponential',
'Specifies how the learning rate is decayed. One of "fixed", "exponential",'#指数形式的还是固定形式的
' or "polynomial"')#多项式的
tf.app.flags.DEFINE_float('learning_rate', 0.01, 'Initial learning rate.')
tf.app.flags.DEFINE_float(
'end_learning_rate', 0.0001,
'The minimal end learning rate used by a polynomial decay learning rate.')
tf.app.flags.DEFINE_float(
'label_smoothing', 0.0, 'The amount of label smoothing.')#label平滑数量
tf.app.flags.DEFINE_float(
'learning_rate_decay_factor', 0.94, 'Learning rate decay factor.')#学习率衰减率
tf.app.flags.DEFINE_float(
'num_epochs_per_decay', 2.0,
'Number of epochs after which learning rate decays.')#每2个epoch衰减一次学习率
tf.app.flags.DEFINE_float(
'moving_average_decay', None,#移动平均衰减
'The decay to use for the moving average.'
'If left as None, then moving averages are not used.')
# =========================================================================== #
# Dataset Flags.
# =========================================================================== #
#dataset_name imagenet
tf.app.flags.DEFINE_string(
'dataset_name', 'pascalvoc_2007', 'The name of the dataset to load.')
tf.app.flags.DEFINE_integer(
'num_classes', 21, 'Number of classes to use in the dataset.')
tf.app.flags.DEFINE_string(
'dataset_split_name', 'train', 'The name of the train/test split.')#数据切分名称
tf.app.flags.DEFINE_string(
'dataset_dir', 'tfrecords', 'The directory where the dataset files are stored.')
tf.app.flags.DEFINE_integer(
'labels_offset', 0,#偏移量
'An offset for the labels in the dataset. This flag is primarily used to '#offset消减量
'evaluate the VGG and ResNet architectures which do not use a background '
'class for the ImageNet dataset.')#如果不用background则减1
tf.app.flags.DEFINE_string(
'model_name', 'ssd_300_vgg', 'The name of the architecture to train.')
tf.app.flags.DEFINE_string(
'preprocessing_name', None, 'The name of the preprocessing to use. If left '
'as `None`, then the model_name flag is used.')#没有使用预处理
tf.app.flags.DEFINE_integer(
'batch_size', 4, 'The number of samples in each batch.')
tf.app.flags.DEFINE_integer(
'train_image_size', None, 'Train image size')#训练集大小
tf.app.flags.DEFINE_integer('max_number_of_steps', None,
'The maximum number of training steps.')#epoch上限
# =========================================================================== #
# Fine-Tuning Flags.
# =========================================================================== #
tf.app.flags.DEFINE_string(
'checkpoint_path', './checkpoints/ssd_300_vgg.ckpt',
'The path to a checkpoint from which to fine-tune.')#获取将要微调的模型
tf.app.flags.DEFINE_string(
'checkpoint_model_scope', None,
'Model scope in the checkpoint. None if the same as the trained model.')#模型
tf.app.flags.DEFINE_string(
'checkpoint_exclude_scopes', None,
'Comma-separated list of scopes of variables to exclude when restoring '
'from a checkpoint.')
tf.app.flags.DEFINE_string(
'trainable_scopes', None,
'Comma-separated list of scopes to filter the set of variables to train.'
'By default, None would train all the variables.')#如果没有逗号分割,则变量被全部训练
tf.app.flags.DEFINE_boolean(
'ignore_missing_vars', False,
'When restoring a checkpoint would ignore missing variables.')#忽略丢失的变量
FLAGS = tf.app.flags.FLAGS
# =========================================================================== #
# Main training routine.训练主线程
# =========================================================================== #
def main(_):
if not FLAGS.dataset_dir:
raise ValueError('You must supply the dataset directory with --dataset_dir')
tf.logging.set_verbosity(tf.logging.DEBUG)#设置日记打印的级别,较为严格
with tf.Graph().as_default():#创建图
# Config model_deploy. Keep TF Slim Models structure.
# Useful if want to need multiple GPUs and/or servers in the future.
#部署分布式集群
deploy_config = model_deploy.DeploymentConfig(
num_clones=FLAGS.num_clones,
clone_on_cpu=FLAGS.clone_on_cpu,
replica_id=0,
num_replicas=1,
num_ps_tasks=0)
# Create global_step.
with tf.device(deploy_config.variables_device()):
'''把变量部署到不同的服务器上,需要点儿时间'''
global_step = slim.create_global_step() #tf.train.create_global_step
# Select the dataset.
#加载数据
dataset = dataset_factory.get_dataset(
FLAGS.dataset_name, FLAGS.dataset_split_name, FLAGS.dataset_dir)
# Get the SSD network and its anchors.
#获得网络和anchor框
ssd_class = nets_factory.get_network(FLAGS.model_name)#ssd-300模型
ssd_params = ssd_class.default_params._replace(num_classes=FLAGS.num_classes)#加载参数
ssd_net = ssd_class(ssd_params) # 根据指定参数和指定类型实例化ssd网络
ssd_shape = ssd_net.params.img_shape#300*300
ssd_anchors = ssd_net.anchors(ssd_shape) # 生成anchor box信息
# Select the preprocessing function.
preprocessing_name = FLAGS.preprocessing_name or FLAGS.model_name #是否采用预处理模块
image_preprocessing_fn = preprocessing_factory.get_preprocessing( #image_preprocessing_fn:image, labels, bboxes
preprocessing_name, is_training=True)
tf_utils.print_configuration(FLAGS.__flags, ssd_params,
dataset.data_sources, FLAGS.train_dir)
# =================================================================== #
# Create a dataset provider and batches.
# =================================================================== #
with tf.device(deploy_config.inputs_device()):
with tf.name_scope(FLAGS.dataset_name + '_data_provider'):
provider = slim.dataset_data_provider.DatasetDataProvider(#slim图像读取,多线程图像读取
dataset,#传入帕斯卡2007图集图片,公5011张,20类
num_readers=FLAGS.num_readers,#传入读图线程数
common_queue_capacity=20 * FLAGS.batch_size,#80
common_queue_min=10 * FLAGS.batch_size,#40
shuffle=True)
# Get for SSD network: image, labels, bboxes. 得到图片,类别标签,以及ground truth
[image, shape, glabels, gbboxes] = provider.get(['image', 'shape',
'object/label',
'object/bbox'])
'''获取帕斯卡2007 voc数据集'''
# Pre-processing image, labels and bboxes.
image, glabels, gbboxes = \
image_preprocessing_fn(image, glabels, gbboxes,
out_shape=ssd_shape,
data_format=DATA_FORMAT)
'''获取训练集图像、GT标签及bbox,网络模型为vgg19变型'''
# Encode groundtruth labels and bboxes.
# 将上面得到的anchor 位置信息和数据集中的ground truth类别标签和位置信息进行编码,得到分类和位置信息以及置信度
gclasses, glocalisations, gscores = \
ssd_net.bboxes_encode(glabels, gbboxes, ssd_anchors)
batch_shape = [1] + [len(ssd_anchors)] * 3 #4个维度分别是[image, gclasses, glocalisations, gscores]
'''计算训练集bbox的IOU得分'''
# Training batches and queue.
r = tf.train.batch(
tf_utils.reshape_list([image, gclasses, glocalisations, gscores]),
batch_size=FLAGS.batch_size,#4
num_threads=FLAGS.num_preprocessing_threads,#读取线程
capacity=5 * FLAGS.batch_size)#20
'''batch_size = 4, 返回一个[1, 6, 6, 6]尺寸的数据集'''
b_image, b_gclasses, b_glocalisations, b_gscores = \
tf_utils.reshape_list(r, batch_shape)
'''append到4个list中,分别是image、gclasses、glocalisations、gscores'''
# Intermediate queueing: unique batch computation pipeline for all
# GPUs running the training.
batch_queue = slim.prefetch_queue.prefetch_queue(
tf_utils.reshape_list([b_image, b_gclasses, b_glocalisations, b_gscores]),
capacity=2 * deploy_config.num_clones)
'''使用预加载队列从文件双线读取, batch_queue 为一个'FIFOQueue' object,即为队列结构体'''
# =================================================================== #
# Define the model running on every GPU.
# =================================================================== #
def clone_fn(batch_queue):
"""Allows data parallelism by creating multiple
定义模型运行在所有GPU上
clones of network_fn."""
# Dequeue batch.
b_image, b_gclasses, b_glocalisations, b_gscores = \
tf_utils.reshape_list(batch_queue.dequeue(), batch_shape)
'''切分队列,先进先出'''
# Construct SSD network.
arg_scope = ssd_net.arg_scope(weight_decay=FLAGS.weight_decay,
data_format=DATA_FORMAT)
'''ssd模型框架及参数定义'''
with slim.arg_scope(arg_scope):
predictions, localisations, logits, end_points = \
ssd_net.net(b_image, is_training=True)
'''传入图片[4, 38, 38, 3]'''
# Add loss function.
'''构建损失函数'''
ssd_net.losses(logits, localisations,#logits:one-hot 向量一共有21维
b_gclasses, b_glocalisations, b_gscores,
match_threshold=FLAGS.match_threshold,
negative_ratio=FLAGS.negative_ratio,
alpha=FLAGS.loss_alpha,#拉格朗日乘子是1
label_smoothing=FLAGS.label_smoothing)
return end_points#'block11',end_points为最后一层
# Gather initial summaries.
summaries = set(tf.get_collection(tf.GraphKeys.SUMMARIES))
# =================================================================== #
# Add summaries from first clone.
# =================================================================== #
clones = model_deploy.create_clones(deploy_config, clone_fn, [batch_queue])
first_clone_scope = deploy_config.clone_scope(0)
# Gather update_ops from the first clone. These contain, for example,
# the updates for the batch_norm variables created by network_fn.
update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS, first_clone_scope)
# Add summaries for end_points.
end_points = clones[0].outputs
for end_point in end_points:
x = end_points[end_point]
summaries.add(tf.summary.histogram('activations/' + end_point, x))
summaries.add(tf.summary.scalar('sparsity/' + end_point,
tf.nn.zero_fraction(x)))
# Add summaries for losses and extra losses.
for loss in tf.get_collection(tf.GraphKeys.LOSSES, first_clone_scope):
summaries.add(tf.summary.scalar(loss.op.name, loss))
for loss in tf.get_collection('EXTRA_LOSSES', first_clone_scope):
summaries.add(tf.summary.scalar(loss.op.name, loss))
# Add summaries for variables.
for variable in slim.get_model_variables():
summaries.add(tf.summary.histogram(variable.op.name, variable))
# =================================================================== #
# Configure the moving averages.
# =================================================================== #
if FLAGS.moving_average_decay:
moving_average_variables = slim.get_model_variables()
variable_averages = tf.train.ExponentialMovingAverage(
FLAGS.moving_average_decay, global_step)
else:
moving_average_variables, variable_averages = None, None
# =================================================================== #
# Configure the optimization procedure.
# =================================================================== #
with tf.device(deploy_config.optimizer_device()):
learning_rate = tf_utils.configure_learning_rate(FLAGS,
dataset.num_samples,
global_step)
optimizer = tf_utils.configure_optimizer(FLAGS, learning_rate)
summaries.add(tf.summary.scalar('learning_rate', learning_rate))
if FLAGS.moving_average_decay:
# Update ops executed locally by trainer.
update_ops.append(variable_averages.apply(moving_average_variables))
# Variables to train.
variables_to_train = tf_utils.get_variables_to_train(FLAGS)
# and returns a train_tensor and summary_op
total_loss, clones_gradients = model_deploy.optimize_clones(
clones,
optimizer,
var_list=variables_to_train)
# Add total_loss to summary.
summaries.add(tf.summary.scalar('total_loss', total_loss))
# Create gradient updates.
grad_updates = optimizer.apply_gradients(clones_gradients,
global_step=global_step)
update_ops.append(grad_updates)
update_op = tf.group(*update_ops)
train_tensor = control_flow_ops.with_dependencies([update_op], total_loss,
name='train_op')
# Add the summaries from the first clone. These contain the summaries
summaries |= set(tf.get_collection(tf.GraphKeys.SUMMARIES,
first_clone_scope))
# Merge all summaries together.
summary_op = tf.summary.merge(list(summaries), name='summary_op')
# =================================================================== #
# Kicks off the training.
# =================================================================== #
gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=FLAGS.gpu_memory_fraction)
config = tf.ConfigProto(log_device_placement=False,
gpu_options=gpu_options)
saver = tf.train.Saver(max_to_keep=5,
keep_checkpoint_every_n_hours=1.0,
write_version=2,
pad_step_number=False)
slim.learning.train(
train_tensor,
logdir=FLAGS.train_dir,
master='',
is_chief=True,
init_fn=tf_utils.get_init_fn(FLAGS),
summary_op=summary_op,
number_of_steps=FLAGS.max_number_of_steps,
log_every_n_steps=FLAGS.log_every_n_steps,
save_summaries_secs=FLAGS.save_summaries_secs,
saver=saver,
save_interval_secs=FLAGS.save_interval_secs,
session_config=config,
sync_optimizer=None)
if __name__ == '__main__':
tf.app.run()
5.1.2 构建网络
def ssd_net(inputs,
num_classes=SSDNet.default_params.num_classes,
feat_layers=SSDNet.default_params.feat_layers,
anchor_sizes=SSDNet.default_params.anchor_sizes,
anchor_ratios=SSDNet.default_params.anchor_ratios,
normalizations=SSDNet.default_params.normalizations,
is_training=True,
dropout_keep_prob=0.5,
prediction_fn=slim.softmax,
reuse=None,
scope='ssd_512_vgg'):
"""SSD net definition.
"""
# End_points collect relevant activations for external use.
end_points = {}
with tf.variable_scope(scope, 'ssd_512_vgg', [inputs], reuse=reuse):
# Original VGG-16 blocks.
net = slim.repeat(inputs, 2, slim.conv2d, 64, [3, 3], scope='conv1')
end_points['block1'] = net
net = slim.max_pool2d(net, [2, 2], scope='pool1')
# Block 2.
net = slim.repeat(net, 2, slim.conv2d, 128, [3, 3], scope='conv2')
end_points['block2'] = net
net = slim.max_pool2d(net, [2, 2], scope='pool2')
# Block 3.
net = slim.repeat(net, 3, slim.conv2d, 256, [3, 3], scope='conv3')
end_points['block3'] = net
net = slim.max_pool2d(net, [2, 2], scope='pool3')
# Block 4.
net = slim.repeat(net, 3, slim.conv2d, 512, [3, 3], scope='conv4')
end_points['block4'] = net
net = slim.max_pool2d(net, [2, 2], scope='pool4')
# Block 5.
net = slim.repeat(net, 3, slim.conv2d, 512, [3, 3], scope='conv5')
end_points['block5'] = net
net = slim.max_pool2d(net, [3, 3], 1, scope='pool5')
# Additional SSD blocks.
# Block 6: let's dilate the hell out of it!
net = slim.conv2d(net, 1024, [3, 3], rate=6, scope='conv6')
end_points['block6'] = net
# Block 7: 1x1 conv. Because the fuck.
net = slim.conv2d(net, 1024, [1, 1], scope='conv7')
end_points['block7'] = net
# Block 8/9/10/11: 1x1 and 3x3 convolutions stride 2 (except lasts).
end_point = 'block8'
with tf.variable_scope(end_point):
net = slim.conv2d(net, 256, [1, 1], scope='conv1x1')
net = custom_layers.pad2d(net, pad=(1, 1))
net = slim.conv2d(net, 512, [3, 3], stride=2, scope='conv3x3', padding='VALID')
end_points[end_point] = net
end_point = 'block9'
with tf.variable_scope(end_point):
net = slim.conv2d(net, 128, [1, 1], scope='conv1x1')
net = custom_layers.pad2d(net, pad=(1, 1))
net = slim.conv2d(net, 256, [3, 3], stride=2, scope='conv3x3', padding='VALID')
end_points[end_point] = net
end_point = 'block10'
with tf.variable_scope(end_point):
net = slim.conv2d(net, 128, [1, 1], scope='conv1x1')
net = custom_layers.pad2d(net, pad=(1, 1))
net = slim.conv2d(net, 256, [3, 3], stride=2, scope='conv3x3', padding='VALID')
end_points[end_point] = net
end_point = 'block11'
with tf.variable_scope(end_point):
net = slim.conv2d(net, 128, [1, 1], scope='conv1x1')
net = custom_layers.pad2d(net, pad=(1, 1))
net = slim.conv2d(net, 256, [3, 3], stride=2, scope='conv3x3', padding='VALID')
end_points[end_point] = net
end_point = 'block12'
with tf.variable_scope(end_point):
net = slim.conv2d(net, 128, [1, 1], scope='conv1x1')
net = custom_layers.pad2d(net, pad=(1, 1))
net = slim.conv2d(net, 256, [4, 4], scope='conv4x4', padding='VALID')
# Fix padding to match Caffe version (pad=1).
# pad_shape = [(i-j) for i, j in zip(layer_shape(net), [0, 1, 1, 0])]
# net = tf.slice(net, [0, 0, 0, 0], pad_shape, name='caffe_pad')
end_points[end_point] = net
# Prediction and localisations layers.
predictions = []
logits = []
localisations = []
for i, layer in enumerate(feat_layers):
with tf.variable_scope(layer + '_box'):
p, l = ssd_vgg_300.ssd_multibox_layer(end_points[layer],
num_classes,
anchor_sizes[i],
anchor_ratios[i],
normalizations[i])
predictions.append(prediction_fn(p))
logits.append(p)
localisations.append(l)
return predictions, localisations, logits, end_points
def ssd_multibox_layer(inputs,#字典形式储存的特征图名称及特征图的值
num_classes,#21
sizes,#6个
ratios=[1],#每个有2~4个
normalization=-1,#归一化
bn_normalization=False):
"""Construct a multibox layer, return a class and localization predictions.
"""
net = inputs # 输入第4层最后一个feature map
if normalization > 0:
net = custom_layers.l2_normalization(net, scaling=True)
# Number of anchors.
num_anchors = len(sizes) + len(ratios) # 2 + 2 (0.21 45) [2, 0.5] 第一个featruemap上有4个anchor框
# Location.
num_loc_pred = num_anchors * 4 # 每个anchor对应4个位置预测值
loc_pred = slim.conv2d(net, num_loc_pred, [3, 3], activation_fn=None,
scope='conv_loc')
loc_pred = custom_layers.channel_to_last(loc_pred) # 改变通道数所在的维度,保证通道数在最后一维
loc_pred = tf.reshape(loc_pred,
tensor_shape(loc_pred, 4)[:-1]+[num_anchors, 4]) # 把最后一维拆分成2个维度,(anchor数,位置信息),位置信息固定为4
# Class prediction.
num_cls_pred = num_anchors * num_classes # 每个anchor对应有21个类别信息
cls_pred = slim.conv2d(net, num_cls_pred, [3, 3], activation_fn=None,
scope='conv_cls')
cls_pred = custom_layers.channel_to_last(cls_pred)
cls_pred = tf.reshape(cls_pred,
tensor_shape(cls_pred, 4)[:-1]+[num_anchors, num_classes]) # 把最后一维拆分成2个维度,(anchor数,类别数)
return cls_pred, loc_pred
5.1.3 Anchor box计算
主要过程:
- 计算中心点坐标
- 根据宽高比和尺度计算宽高
def ssd_anchor_one_layer(img_shape,
feat_shape,
sizes,
ratios,
step,
offset=0.5,
dtype=np.float32):
"""Computer SSD default anchor boxes for one feature layer.
Determine the relative position grid of the centers, and the relative
width and height.
Arguments:
feat_shape: Feature shape, used for computing relative position grids;
size: Absolute reference sizes;
ratios: Ratios to use on these features;
img_shape: Image shape, used for computing height, width relatively to the
former;
offset: Grid offset.
Return:
y, x, h, w: Relative x and y grids, and height and width.
返回每个cell的中心点的相对位置,以及相对于原图的宽高比例
"""
# Compute the position grid: simple way.
# y, x = np.mgrid[0:feat_shape[0], 0:feat_shape[1]]
# y = (y.astype(dtype) + offset) / feat_shape[0]
# x = (x.astype(dtype) + offset) / feat_shape[1]
# Weird SSD-Caffe computation using steps values...
y, x = np.mgrid[0:feat_shape[0], 0:feat_shape[1]]
y = (y.astype(dtype) + offset) * step / img_shape[0]#计算每个cell的中心点
x = (x.astype(dtype) + offset) * step / img_shape[1]#计算每个cell的中心点
# Expand dims to support easy broadcasting.
y = np.expand_dims(y, axis=-1)
x = np.expand_dims(x, axis=-1)
# Compute relative height and width.
# Tries to follow the original implementation of SSD for the order.
num_anchors = len(sizes) + len(ratios)
h = np.zeros((num_anchors, ), dtype=dtype)
w = np.zeros((num_anchors, ), dtype=dtype)
# Add first anchor boxes with ratio=1.
h[0] = sizes[0] / img_shape[0]
w[0] = sizes[0] / img_shape[1]#所占原图的比例
di = 1
if len(sizes) > 1: # 对于ratio=1的,增一个框,w = sk',h = sk'
h[1] = math.sqrt(sizes[0] * sizes[1]) / img_shape[0]#sk’的计算,是个比率(特殊框的计算)
w[1] = math.sqrt(sizes[0] * sizes[1]) / img_shape[1]#sk’的计算,是个比率(特殊框的计算)
di += 1
for i, r in enumerate(ratios):
h[i+di] = sizes[0] / img_shape[0] / math.sqrt(r)#高比例
w[i+di] = sizes[0] / img_shape[1] * math.sqrt(r)#宽比率
return y, x, h, w
5.1.4 编码处理
将得到的anchor box位置信息和数据集中的ground truth类别标签和位置信息进行处理
def tf_ssd_bboxes_encode_layer(labels,
bboxes,
anchors_layer,
num_classes,
no_annotation_label,
ignore_threshold=0.5,
prior_scaling=[0.1, 0.1, 0.2, 0.2],
dtype=tf.float32):
"""Encode groundtruth labels and bounding boxes using SSD anchors from
one layer.
Arguments:
labels: 1D Tensor(int64) containing groundtruth labels;
bboxes: Nx4 Tensor(float) with bboxes relative coordinates;
anchors_layer: Numpy array with layer anchors;
matching_threshold: Threshold for positive match with groundtruth bboxes;
prior_scaling: Scaling of encoded coordinates.
Return:
(target_labels, target_localizations, target_scores): Target Tensors.
"""
# Anchors coordinates and volume.
yref, xref, href, wref = anchors_layer# 获得中心点及高、宽
#href
#array([ 0.07 , 0.10246951, 0.04949747, 0.09899495], dtype=float32)
#wref
#array([ 0.07 , 0.10246951, 0.09899495, 0.04949747], dtype=float32)
ymin = yref - href / 2.#计算左上角x坐标值
xmin = xref - wref / 2.#计算左上角y坐标值
ymax = yref + href / 2.#计算右下角x坐标值
xmax = xref + wref / 2.#计算右下角y坐标值
vol_anchors = (xmax - xmin) * (ymax - ymin)#计算anchor框面积 (38, 38, 4)
'''矩阵运算'''
# Initialize tensors...
shape = (yref.shape[0], yref.shape[1], href.size)#(38, 38, 4)
feat_labels = tf.zeros(shape, dtype=tf.int64)
feat_scores = tf.zeros(shape, dtype=dtype)
feat_ymin = tf.zeros(shape, dtype=dtype)
feat_xmin = tf.zeros(shape, dtype=dtype)
feat_ymax = tf.ones(shape, dtype=dtype)
feat_xmax = tf.ones(shape, dtype=dtype)
'''左上角和右下角坐标:每个框左上、右下、labels、scores'''
def jaccard_with_anchors(bbox):#groundtruth
"""Compute jaccard score between a box and the anchors.
IOU计算
"""
int_ymin = tf.maximum(ymin, bbox[0])
int_xmin = tf.maximum(xmin, bbox[1])
int_ymax = tf.minimum(ymax, bbox[2])
int_xmax = tf.minimum(xmax, bbox[3])
h = tf.maximum(int_ymax - int_ymin, 0.)
w = tf.maximum(int_xmax - int_xmin, 0.)#交集的长宽
# Volumes.
inter_vol = h * w#交集面积
union_vol = vol_anchors - inter_vol \
+ (bbox[2] - bbox[0]) * (bbox[3] - bbox[1])#anchor面积-交集+bbox面积,这里的bbox为GT
jaccard = tf.div(inter_vol, union_vol)#IOU值
return jaccard
def intersection_with_anchors(bbox):
"""Compute intersection between score a box and the anchors.
仅仅交集计算
"""
int_ymin = tf.maximum(ymin, bbox[0])
int_xmin = tf.maximum(xmin, bbox[1])
int_ymax = tf.minimum(ymax, bbox[2])
int_xmax = tf.minimum(xmax, bbox[3])
h = tf.maximum(int_ymax - int_ymin, 0.)
w = tf.maximum(int_xmax - int_xmin, 0.)
inter_vol = h * w
scores = tf.div(inter_vol, vol_anchors) # 交集的面积比上先验框的面积,将其表示为置信度
return scores
def condition(i, feat_labels, feat_scores,
feat_ymin, feat_xmin, feat_ymax, feat_xmax):
"""Condition: check label index.
逐元素比较大小,遍历labels,因为i在body返回的时候加1了
"""
r = tf.less(i, tf.shape(labels))
return r[0]
def body(i, feat_labels, feat_scores,
feat_ymin, feat_xmin, feat_ymax, feat_xmax):
"""Body: update feature labels, scores and bboxes. 循环体
Follow the original SSD paper for that purpose:
- assign values when jaccard > 0.5;
- only update if beat the score of other bboxes.
"""
# Jaccard score.
label = labels[i] # 第i个ground truth对应的label
bbox = bboxes[i] # 第i个ground truth
# 计算该feature map上所有的框与第一个ground truth的IOU
jaccard = jaccard_with_anchors(bbox)
# Mask: check threshold + scores + no annotations + num_classes.
# 当IOU大于feat_scores时,对应的mask置为1,做筛选
mask = tf.greater(jaccard, feat_scores)
# mask = tf.logical_and(mask, tf.greater(jaccard, matching_threshold))
mask = tf.logical_and(mask, feat_scores > -0.5)
mask = tf.logical_and(mask, label < num_classes) #label满足<21
imask = tf.cast(mask, tf.int64) #将mask转换数据类型int型
fmask = tf.cast(mask, dtype) #将mask转换数据类型float型
# Update values using mask.
feat_labels = imask * label + (1 - imask) * feat_labels #当mask=1,则feat_labels=1;否则为0,即背景
feat_scores = tf.where(mask, jaccard, feat_scores) #tf.where表示如果mask为真则jaccard,否则为feat_scores
# 选择与GT bbox IOU最大的框作为GT bbox,然后循环
feat_ymin = fmask * bbox[0] + (1 - fmask) * feat_ymin
feat_xmin = fmask * bbox[1] + (1 - fmask) * feat_xmin
feat_ymax = fmask * bbox[2] + (1 - fmask) * feat_ymax
feat_xmax = fmask * bbox[3] + (1 - fmask) * feat_xmax
# Check no annotation label: ignore these anchors...
# interscts = intersection_with_anchors(bbox)
# mask = tf.logical_and(interscts > ignore_threshold,
# label == no_annotation_label)
# # Replace scores by -1.
# feat_scores = tf.where(mask, -tf.cast(mask, dtype), feat_scores)
return [i+1, feat_labels, feat_scores,
feat_ymin, feat_xmin, feat_ymax, feat_xmax]
# Main loop definition. 主循环
i = 0
[i, feat_labels, feat_scores,
feat_ymin, feat_xmin,
feat_ymax, feat_xmax] = tf.while_loop(condition, body,
[i, feat_labels, feat_scores,
feat_ymin, feat_xmin,
feat_ymax, feat_xmax])
# Transform to center / size.
feat_cy = (feat_ymax + feat_ymin) / 2.
feat_cx = (feat_xmax + feat_xmin) / 2.
feat_h = feat_ymax - feat_ymin
feat_w = feat_xmax - feat_xmin
# Encode features.
feat_cy = (feat_cy - yref) / href / prior_scaling[0]
feat_cx = (feat_cx - xref) / wref / prior_scaling[1]
feat_h = tf.log(feat_h / href) / prior_scaling[2]
feat_w = tf.log(feat_w / wref) / prior_scaling[3]
# Use SSD ordering: x / y / w / h instead of ours.
feat_localizations = tf.stack([feat_cx, feat_cy, feat_w, feat_h], axis=-1)
return feat_labels, feat_localizations, feat_scores
边框的编码(encode)过程:
由先验框anchor box和真实框ground truth的位置信息,
即,
P
=
(
P
x
,
P
y
,
P
w
,
P
h
)
P=(P_x, P_y,P_w,P_h)
P=(Px,Py,Pw,Ph),
G
=
(
G
x
,
G
y
,
G
w
,
G
h
)
G=(G_x,G_y,G_w,G_h)
G=(Gx,Gy,Gw,Gh)
得到变换量
t
∗
i
t_*^i
t∗i:
{
t
x
=
(
G
x
−
P
x
)
/
P
w
t
y
=
(
G
y
−
P
y
)
/
P
h
t
w
=
l
o
g
(
G
w
/
P
w
)
t
h
=
l
o
g
(
G
h
/
P
h
)
\begin{cases} t_x = (G_x-P_x)/P_w \\t_y=(G_y-P_y)/P_h\\t_w=log(G_w/P_w) \\t_h=log(G_h/P_h)\end{cases}
⎩⎪⎪⎪⎨⎪⎪⎪⎧tx=(Gx−Px)/Pwty=(Gy−Py)/Phtw=log(Gw/Pw)th=log(Gh/Ph)
5.1.5 构建损失函数
def ssd_losses(logits, localisations,
gclasses, glocalisations, gscores,
match_threshold=0.5, # IOU的权值
negative_ratio=3., # 正负样本比例
alpha=1.,
label_smoothing=0.,
scope=None):
"""Loss functions for training the SSD 300 VGG network.
This function defines the different loss components of the SSD, and
adds them to the TF loss collection.
Arguments:
logits: (list of) predictions logits Tensors; # 网络预测的logits,也就是没有进过softmax的置信度
localisations: (list of) localisations Tensors; # 网络预测的位置变换量
gclasses: (list of) groundtruth labels Tensors;
glocalisations: (list of) groundtruth localisations Tensors;
gscores: (list of) groundtruth score Tensors;
"""
with tf.name_scope(scope, 'ssd_losses'):
l_cross_pos = [] # 存储正样本的损失函数值
l_cross_neg = [] # 存储负样本的损失函数值
l_loc = [] # 存储位置偏移量的损失函数值
for i in range(len(logits)):
dtype = logits[i].dtype
with tf.name_scope('block_%i' % i):
# Determine weights Tensor.
pmask = gscores[i] > match_threshold # 将iou大于0.5的记为正样本
fpmask = tf.cast(pmask, dtype) # 类型转换
n_positives = tf.reduce_sum(fpmask) # 记录正样本个数
# Select some random negative entries.
# n_entries = np.prod(gclasses[i].get_shape().as_list())
# r_positive = n_positives / n_entries
# r_negative = negative_ratio * n_positives / (n_entries - n_positives)
# Negative mask.
no_classes = tf.cast(pmask, tf.int32) # 将正例的类型转换成整数
predictions = slim.softmax(logits[i]) # 记录每个类别预测的概率值
nmask = tf.logical_and(tf.logical_not(pmask), # 将不是正例并且交并比大于-0.5的记为负样本
gscores[i] > -0.5)
fnmask = tf.cast(nmask, dtype) # 类型转换
nvalues = tf.where(nmask, # 将负样本的值取出来
predictions[:, :, :, :, 0],# predictions[:, :, :, :, 0]是预测第一类,也就是背景类
1. - fnmask) # 在nmask中是负样本,就取预测值,否则取1-fnmask(此时为0)=1
nvalues_flat = tf.reshape(nvalues, [-1]) # 对nvalues进行拉平操作
# Number of negative entries to select. 筛选负样本
n_neg = tf.cast(negative_ratio * n_positives, tf.int32) # 正负样本比例乘以正样本个数,得到负样本个数
n_neg = tf.maximum(n_neg, tf.size(nvalues_flat) // 8) # 将正样本3倍与nvalues实际样本1/8倍比较选最大的
n_neg = tf.maximum(n_neg, tf.shape(nvalues)[0] * 4)
max_neg_entries = 1 + tf.cast(tf.reduce_sum(fnmask), tf.int32)
n_neg = tf.minimum(n_neg, max_neg_entries)
val, idxes = tf.nn.top_k(-nvalues_flat, k=n_neg) # 选取前k=neg个负样本,表示选择的交并比最小的k个
minval = val[-1] # 所选择的负样本中交并比最大的那个
# Final negative mask.
nmask = tf.logical_and(nmask, -nvalues > minval) # 最终的负样本
fnmask = tf.cast(nmask, dtype)
# Add cross-entropy loss.
with tf.name_scope('cross_entropy_pos'): # 正样本的交叉熵损失函数
loss = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits[i],
labels=gclasses[i])
loss = tf.losses.compute_weighted_loss(loss, fpmask) # 其实就是相当于loss×fpmask,过滤负样本
l_cross_pos.append(loss) # 因为负样本的label就是0,其他的是1,相乘以保留正样本
with tf.name_scope('cross_entropy_neg'): # 负样本的交叉熵损失函数
loss = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits[i],
labels=no_classes)
loss = tf.losses.compute_weighted_loss(loss, fnmask) # 与上面同理
l_cross_neg.append(loss)
# Add localization loss: smooth L1, L2, ...
with tf.name_scope('localization'):
# Weights Tensor: positive mask + random negative.
weights = tf.expand_dims(alpha * fpmask, axis=-1)
loss = custom_layers.abs_smooth(localisations[i] - glocalisations[i]) # 位置变换量的smoothL1损失函数
loss = tf.losses.compute_weighted_loss(loss, weights)
l_loc.append(loss)
# Additional total losses...
with tf.name_scope('total'):
total_cross_pos = tf.add_n(l_cross_pos, 'cross_entropy_pos')
total_cross_neg = tf.add_n(l_cross_neg, 'cross_entropy_neg')
total_cross = tf.add(total_cross_pos, total_cross_neg, 'cross_entropy') # 正负样本总体的损失和
total_loc = tf.add_n(l_loc, 'localization') # 正样本总体的位置回归损失和
# Add to EXTRA LOSSES TF.collection
tf.add_to_collection('EXTRA_LOSSES', total_cross_pos)
tf.add_to_collection('EXTRA_LOSSES', total_cross_neg)
tf.add_to_collection('EXTRA_LOSSES', total_cross)
tf.add_to_collection('EXTRA_LOSSES', total_loc)
5.2 测试阶段
主要过程:
- 构建网络——生成类别和位置变换量的预测值(logits(未经过softmax)、localisations)
- Anchor box计算——根据每层feature map大小和和先验框尺度及宽高比计算
- 编码处理——将得到的anchor box位置信息和数据集中的ground truth类别标签和位置信息进行处理
- 构建损失函数,计算损失函数值
- 解码处理——将网络预测结果和anchor计算的结果进行解码操作,得到预测的anchor box的位置
- 筛选处理——对的得到预测类别和位置的anchor box进行进一步的筛选
- 按照select_threshold筛选
- 筛选前k个置信度最大的
- 使用非极大抑制再进行筛选
前四步和训练阶段一致,下面省略对应的代码
5.2.1 解码处理
边框的解码(decode)过程:
由变化量的预测值
d
∗
(
P
)
d_*(P)
d∗(P)和先验框anchor box的位置信息,
即,
P
=
(
P
x
,
P
y
,
P
w
,
P
h
)
P=(P_x, P_y,P_w,P_h)
P=(Px,Py,Pw,Ph)和
d
∗
(
P
)
d_*(P)
d∗(P):
{
d
x
(
P
)
=
(
G
^
x
−
P
x
)
/
P
w
d
y
(
P
)
=
(
G
^
y
−
P
y
)
/
P
h
d
w
(
P
)
=
l
o
g
(
G
^
w
/
P
w
)
d
h
(
P
)
=
l
o
g
(
G
^
h
/
P
h
)
\begin{cases} d_x(P) = (\hat G_x-P_x)/P_w \\d_y(P)=(\hat G_y-P_y)/P_h\\d_w(P)=log(\hat G_w/P_w) \\d_h(P)=log(\hat G_h/P_h)\end{cases}
⎩⎪⎪⎪⎨⎪⎪⎪⎧dx(P)=(G^x−Px)/Pwdy(P)=(G^y−Py)/Phdw(P)=log(G^w/Pw)dh(P)=log(G^h/Ph)
得到预测框的位置信息: { G ^ x = P w d x ( P ) + P x G ^ y = P h d y ( P ) + P y G ^ w = P w e x p ( d w ( P ) ) G ^ h = P h e x p ( d h ( P ) ) \begin{cases}\hat G_x=P_wd_x(P)+P_x\\ \hat G_y = P_hd_y(P)+P_y\\ \hat G_w = P_wexp(d_w(P)) \\ \hat G_h = P_hexp(d_h(P))\end{cases} ⎩⎪⎪⎪⎨⎪⎪⎪⎧G^x=Pwdx(P)+PxG^y=Phdy(P)+PyG^w=Pwexp(dw(P))G^h=Phexp(dh(P))
def tf_ssd_bboxes_decode_layer(feat_localizations,
anchors_layer,
prior_scaling=[0.1, 0.1, 0.2, 0.2]):
"""Compute the relative bounding boxes from the layer features and
reference anchor bounding boxes.
Arguments:
feat_localizations: Tensor containing localization features.
anchors: List of numpy array containing anchor boxes.
Return:
Tensor Nx4: ymin, xmin, ymax, xmax
"""
yref, xref, href, wref = anchors_layer
# Compute center, height and width
cx = feat_localizations[:, :, :, :, 0] * wref * prior_scaling[0] + xref
cy = feat_localizations[:, :, :, :, 1] * href * prior_scaling[1] + yref
w = wref * tf.exp(feat_localizations[:, :, :, :, 2] * prior_scaling[2])
h = href * tf.exp(feat_localizations[:, :, :, :, 3] * prior_scaling[3])
# Boxes coordinates.
ymin = cy - h / 2.
xmin = cx - w / 2.
ymax = cy + h / 2.
xmax = cx + w / 2.
bboxes = tf.stack([ymin, xmin, ymax, xmax], axis=-1)
return bboxes
然而,在SSD的Caffe源码实现中还有trick,那就是设置variance超参数来调整检测值,通过bool参数variance_encoded_in_target来控制两种模式,当其为True时,表示variance被包含在预测值中,就是上面那种情况。但是如果是False(大部分采用这种方式,训练更容易?),就需要手动设置超参数variance,用来对
d
∗
(
P
)
d_*(P)
d∗(P)的4个值进行放缩,此时边界框需要这样解码:
预测框的位置信息:
{
G
^
x
=
P
w
d
x
(
P
)
v
a
r
i
a
n
c
e
[
0
]
+
P
x
G
^
y
=
P
h
d
y
(
P
)
v
a
r
i
a
n
c
e
[
1
]
+
P
y
G
^
w
=
P
w
e
x
p
(
d
w
(
P
)
v
a
r
i
a
n
c
e
[
2
]
)
G
^
h
=
P
h
e
x
p
(
d
h
(
P
)
v
a
r
i
a
n
c
e
[
3
]
)
\begin{cases}\hat G_x=P_wd_x(P)variance[0]+P_x\\ \hat G_y = P_hd_y(P)variance[1]+P_y\\ \hat G_w = P_wexp(d_w(P)variance[2]) \\ \hat G_h = P_hexp(d_h(P)variance[3])\end{cases}
⎩⎪⎪⎪⎨⎪⎪⎪⎧G^x=Pwdx(P)variance[0]+PxG^y=Phdy(P)variance[1]+PyG^w=Pwexp(dw(P)variance[2])G^h=Phexp(dh(P)variance[3])
5.2.2 筛选处理
- 筛选处理——对的得到预测类别和位置的anchor box进行进一步的筛选
- 按照select_threshold筛选
- 筛选前k个置信度最大的
- 使用非极大抑制再进行筛选
def detected_bboxes(self, predictions, localisations,
select_threshold=None, nms_threshold=0.5,
clipping_bbox=None, top_k=400, keep_top_k=200):
"""Get the detected bounding boxes from the SSD network output.
"""
# Select top_k bboxes from predictions, and clip
# 按照select_threshold筛选
rscores, rbboxes = \
ssd_common.tf_ssd_bboxes_select(predictions, localisations,
select_threshold=select_threshold,
num_classes=self.params.num_classes)
# 筛选前k个置信度最大的
rscores, rbboxes = \
tfe.bboxes_sort(rscores, rbboxes, top_k=top_k)
# 使用非极大抑制再进行筛选
# Apply NMS algorithm.
rscores, rbboxes = \
tfe.bboxes_nms_batch(rscores, rbboxes,
nms_threshold=nms_threshold,
keep_top_k=keep_top_k)
# if clipping_bbox is not None:
# rbboxes = tfe.bboxes_clip(clipping_bbox, rbboxes)
return rscores, rbboxes