目标检测网络结构记录(二)
上一篇主要讲解了two stage的检测算法,其中以faster R-CNN最为经典,RPN网络和ROI pooling的使用,使得目标检测相较于R-CNN较为快速,并且检测准确度非常高,这也使得深度网络用于目标检测的应用和改进越来越多。
我们这次主要学习剩下的 one stage 检测算法,其中以YOLO系列和SSD为主,YOLO系列不断改进,前三代的进化又作者完成,后续的有其他大神完成,都对网络结构等做了改进,SSD算法至今还作为一些比较复杂的网络的基础结构,都非常值得我们学习。
YOLO
YOLO的提出其实是为了实现更快的检测目标,Faster R-CNN已经实现了几乎很高水平的检测速度和精度。YOLO则更加激进,直接使用一个单一的CNN网络,在整个输入图片上进行边框回归问题,不再使用类似RPN这样的区域建议生成,大大提高了速度,但是牺牲了一部分准确率。
YOLO的结构图如图所示,对于输入的原始图片,分为7*7的小格子,如果我们选的坐标落在每个格子内,说明我们将处理这个格子周围的区域特征,进行类别分类和边框回归。
具体网络结构如下,使用了26个卷积层(其中四个最大池化层),3个全连接层,输入数据大小为448 * 448,经过卷积之后,变为7 * 7,其中有两个卷积层步长为2,起到了下采样的作用,通过全连接层后,输出大小为1470(7 * 7 * (5+20)),其中7为最后将图像分为49个网格,也就是把长宽分为七份,5是四个坐标值加预测框的置信度,20是判断物体的类别,论文中使用的是20类。相当于若是一个预测框的中心落到某个格子里,那么这个预测框负责预测物体,一个单元格要生成2个boxs,包括四个坐标值和一个置信度,并且需要预测物体的类别得分。其中的五个值分别是x,y,w,h和置信度。其中x,y代表边界框的中心离开其所在网格单元格边界的偏移。w,h代表边界框真实宽高相对于整幅图像的比例。x,y,w,h这几个参数都已经被限制到了区间[0,1]上。
我认为这块的重点就是输出的理解以及边框的回归问题,其中对于坐标的预测回归解释如下:
这位博主的图拿来用一下,见谅,其中x,y就是我们预测的坐标偏移量,w,h表示宽高比例,图中的公式计算方式进行逆运算就能得到下面代码中出现的坐标操作。
对于损失函数而言,YOLO中的损失函数分为三个部分,坐标损失,分类损失,置信度损失,具体如下图所示:
其中都是使用的均方和误差函数作为计算函数,在计算置信度损失函数时,需要分是否在在边框内,也就是预测框中是否检测到物体。其重点在于:
1.只对真实值中的确有物体的格点进行x, y, w, h的反向传播训练
2.虽然损失函数的各个部分均为均方误差(包含w, h的部分除外),但坐标损失部分的权重和无物体格点的有无物体预测权重设的不同,分别为5和0.5,
坐标损失权重之所以设的很大,是因为坐标值一般较小,不设置大的权重很难学习,而模型中大部分的格点都是没有物体的,所以如果无物体的损失权重和有物体的损失权重一样,为了最小化这样的损失,神经网络对于有物体和无物体的预测概率会大致相同,但事实上大部分格点真实值为0,为了避免这种问题,就把无物体的格点有无物体预测给予更小的权重,使得模型更倾向与预测无物体格点(符合真实的数据分布)
3.为了避免对大网格和对小网格的偏差基于相同的惩罚(对于相同的偏差,对小网格的惩罚应该更大)
具体代码如下,其中定义了网络结构和损失函数:
import numpy as np
import tensorflow as tf
import yolo.config as cfg
slim = tf.contrib.slim
class YOLONet(object):
def __init__(self, is_training=True):
self.classes = cfg.CLASSES
self.num_class = len(self.classes)
self.image_size = cfg.IMAGE_SIZE
self.cell_size = cfg.CELL_SIZE
self.boxes_per_cell = cfg.BOXES_PER_CELL
self.output_size = (self.cell_size * self.cell_size) *\
(self.num_class + self.boxes_per_cell * 5)
self.scale = 1.0 * self.image_size / self.cell_size
self.boundary1 = self.cell_size * self.cell_size * self.num_class
self.boundary2 = self.boundary1 +\
self.cell_size * self.cell_size * self.boxes_per_cell
self.object_scale = cfg.OBJECT_SCALE
self.noobject_scale = cfg.NOOBJECT_SCALE
self.class_scale = cfg.CLASS_SCALE
self.coord_scale = cfg.COORD_SCALE
self.learning_rate = cfg.LEARNING_RATE
self.batch_size = cfg.BATCH_SIZE
self.alpha = cfg.ALPHA
self.offset = np.transpose(np.reshape(np.array(
[np.arange(self.cell_size)] * self.cell_size * self.boxes_per_cell),
(self.boxes_per_cell, self.cell_size, self.cell_size)), (1, 2, 0))
self.images = tf.placeholder(
tf.float32, [None, self.image_size, self.image_size, 3],
name='images')
self.logits = self.build_network(
self.images, num_outputs=self.output_size, alpha=self.alpha,
is_training=is_training)
if is_training:
self.labels = tf.placeholder(
tf.float32,
[None, self.cell_size, self.cell_size, 5 + self.num_class])
self.loss_layer(self.logits, self.labels)
self.total_loss = tf.losses.get_total_loss()
tf.summary.scalar('total_loss', self.total_loss)
def build_network(self,
images,
num_outputs,
alpha,
keep_prob=0.5,
is_training=True,
scope='yolo'):
with tf.variable_scope(scope):
with slim.arg_scope(
[slim.conv2d, slim.fully_connected],
activation_fn=leaky_relu(alpha),
weights_regularizer=slim.l2_regularizer(0.0005),
weights_initializer=tf.truncated_normal_initializer(0.0, 0.01)
):
net = tf.pad(
images, np.array([[0, 0], [3, 3], [3, 3], [0, 0]]),
name='pad_1')
net = slim.conv2d(
net, 64, 7, 2, padding='VALID', scope='conv_2')
net = slim.max_pool2d(net, 2, padding='SAME', scope='pool_3')
net = slim.conv2d(net, 192, 3, scope='conv_4')
net = slim.max_pool2d(net, 2, padding='SAME', scope='pool_5')
net = slim.conv2d(net, 128, 1, scope='conv_6')
net = slim.conv2d(net, 256, 3, scope='conv_7')
net = slim.conv2d(net, 256, 1, scope='conv_8')
net = slim.conv2d(net, 512, 3, scope='conv_9')
net = slim.max_pool2d(net, 2, padding='SAME', scope='pool_10')
net = slim.conv2d(net, 256, 1, scope='conv_11')
net = slim.conv2d(net, 512, 3, scope='conv_12')
net = slim.conv2d(net, 256, 1, scope='conv_13')
net = slim.conv2d(net, 512, 3, scope='conv_14')
net = slim.conv2d(net, 256, 1, scope='conv_15')
net = slim.conv2d(net, 512, 3, scope='conv_16')
net = slim.conv2d(net, 256, 1, scope='conv_17')
net = slim.conv2d(net, 512, 3, scope='conv_18')
net = slim.conv2d(net, 512, 1, scope='conv_19')
net = slim.conv2d(net, 1024, 3, scope='conv_20')
net = slim.max_pool2d(net, 2, padding='SAME', scope='pool_21')
net = slim.conv2d(net, 512, 1, scope='conv_22')
net = slim.conv2d(net, 1024, 3, scope='conv_23')
net = slim.conv2d(net, 512, 1, scope='conv_24')
net = slim.conv2d(net, 1024, 3, scope='conv_25')
net = slim.conv2d(net, 1024, 3, scope='conv_26')
net = tf.pad(
net, np.array([[0, 0], [1, 1], [1, 1], [0, 0]]),
name='pad_27')
net = slim.conv2d(
net, 1024, 3, 2, padding='VALID', scope='conv_28')
net = slim.conv2d(net, 1024, 3, scope='conv_29')
net = slim.conv2d(net, 1024, 3, scope='conv_30')
net = tf.transpose(net, [0, 3, 1, 2], name='trans_31')
net = slim.flatten(net, scope='flat_32')
net = slim.fully_connected(net, 512, scope='fc_33')
net = slim.fully_connected(net, 4096, scope='fc_34')
net = slim.dropout(
net, keep_prob=keep_prob, is_training=is_training,
scope='dropout_35')
net = slim.fully_connected(
net, num_outputs, activation_fn=None, scope='fc_36')
return net
def calc_iou(self, boxes1, boxes2, scope='iou'):
"""calculate ious
Args:
boxes1: 5-D tensor [BATCH_SIZE, CELL_SIZE, CELL_SIZE, BOXES_PER_CELL, 4] ====> (x_center, y_center, w, h)
boxes2: 5-D tensor [BATCH_SIZE, CELL_SIZE, CELL_SIZE, BOXES_PER_CELL, 4] ===> (x_center, y_center, w, h)
Return:
iou: 4-D tensor [BATCH_SIZE, CELL_SIZE, CELL_SIZE, BOXES_PER_CELL]
"""
with tf.variable_scope(scope):
# transform (x_center, y_center, w, h) to (x1, y1, x2, y2)
boxes1_t = tf.stack([boxes1[..., 0] - boxes1[..., 2] / 2.0,
boxes1[..., 1] - boxes1[..., 3] / 2.0,
boxes1[..., 0] + boxes1[..., 2] / 2.0,
boxes1[..., 1] + boxes1[..., 3] / 2.0],
axis=-1)
boxes2_t = tf.stack([boxes2[..., 0] - boxes2[..., 2] / 2.0,
boxes2[..., 1] - boxes2[..., 3] / 2.0,
boxes2[..., 0] + boxes2[..., 2] / 2.0,
boxes2[..., 1] + boxes2[..., 3] / 2.0],
axis=-1)
# calculate the left up point & right down point
lu = tf.maximum(boxes1_t[..., :2], boxes2_t[..., :2])
rd = tf.minimum(boxes1_t[..., 2:], boxes2_t[..., 2:])
# intersection
intersection = tf.maximum(0.0, rd - lu)
inter_square = intersection[..., 0] * intersection[..., 1]
# calculate the boxs1 square and boxs2 square
square1 = boxes1[..., 2] * boxes1[..., 3]
square2 = boxes2[..., 2] * boxes2[..., 3]
union_square = tf.maximum(square1 + square2 - inter_square, 1e-10)
return tf.clip_by_value(inter_square / union_square, 0.0, 1.0)
def loss_layer(self, predicts, labels, scope='loss_layer'):
with tf.variable_scope(scope):
predict_classes = tf.reshape(
predicts[:, :self.boundary1],
[self.batch_size, self.cell_size, self.cell_size, self.num_class])
predict_scales = tf.reshape(
predicts[:, self.boundary1:self.boundary2],
[self.batch_size, self.cell_size, self.cell_size, self.boxes_per_cell])
predict_boxes = tf.reshape(
predicts[:, self.boundary2:],
[self.batch_size, self.cell_size, self.cell_size, self.boxes_per_cell, 4])
response = tf.reshape(
labels[..., 0],
[self.batch_size, self.cell_size, self.cell_size, 1])
boxes = tf.reshape(
labels[..., 1:5],
[self.batch_size, self.cell_size, self.cell_size, 1, 4])
boxes = tf.tile(
boxes, [1, 1, 1, self.boxes_per_cell, 1]) / self.image_size
classes = labels[..., 5:]
offset = tf.reshape(
tf.constant(self.offset, dtype=tf.float32),
[1, self.cell_size, self.cell_size, self.boxes_per_cell])
offset = tf.tile(offset, [self.batch_size, 1, 1, 1])
offset_tran = tf.transpose(offset, (0, 2, 1, 3))
predict_boxes_tran = tf.stack(
[(predict_boxes[..., 0] + offset) / self.cell_size,
(predict_boxes[..., 1] + offset_tran) / self.cell_size,
tf.square(predict_boxes[..., 2]),
tf.square(predict_boxes[..., 3])], axis=-1)
iou_predict_truth = self.calc_iou(predict_boxes_tran, boxes)
# calculate I tensor [BATCH_SIZE, CELL_SIZE, CELL_SIZE, BOXES_PER_CELL]
object_mask = tf.reduce_max