睿智的目标检测8——yolo3的loss组成详解

最新推荐文章于 2022-04-22 14:55:01 发布

Bubbliiiing

最新推荐文章于 2022-04-22 14:55:01 发布

阅读量9.9k

点赞数 14

分类专栏：睿智的目标检测文章标签： yolo3 loss 详解 loss组成

本文链接：https://blog.csdn.net/weixin_44791964/article/details/102756832

版权

睿智的目标检测专栏收录该内容

67 篇文章 2754 订阅

订阅专栏

睿智的目标检测8——yolo3的loss组成详解

学习前言
参考源码
计算loss所需参数
- 1、y_pre
- 2、y_true
loss的计算过程

学习前言

只会预测是不够的，对于只有会了训练才能训练出我们自己的模型！
在这里插入图片描述

参考源码

本次教程是基于github中的https://github.com/qqwweee/keras-yolo3。
详解的不是代码，而是训练思路，主要讲的是yolo3需要被减少的loss到底是什么。
当然配合代码观看肯定更容易看懂。

计算loss所需参数

在计算loss的时候，实际上是y_pre和y_true之间的对比：
y_pre就是一幅图像经过网络之后的输出，内部含有三个特征层的内容；
y_true就是一个真实图像中，将它的真实框的位置以及框内物体的种类，转化成yolo3网络输出后的格式的值。
实际上y_pre和y_true内容的shape都是
(batch_size,13,13,3,85)
(batch_size,26,26,3,85)
(batch_size,52,52,3,85)

1、y_pre

y_pre就是一幅图像图像经过网络之后的输出，内部含有三个特征层的内容；
如下是一副图像在model里经过的各种层：

def yolo_body(inputs, num_anchors, num_classes):
    """Create YOLO_V3 model CNN body in Keras."""
    # darknet53
    darknet = Model(inputs, darknet_body(inputs))
    # 第一个特征层
    # y1=(batch_size,13,13,3,85)
    x, y1 = make_last_layers(darknet.output, 512, num_anchors*(num_classes+5))

    x = compose(
            DarknetConv2D_BN_Leaky(256, (1,1)),
            UpSampling2D(2))(x)
    x = Concatenate()([x,darknet.layers[152].output])
    # 第二个特征层
    # y2=(batch_size,26,26,3,85)
    x, y2 = make_last_layers(x, 256, num_anchors*(num_classes+5))

    x = compose(
            DarknetConv2D_BN_Leaky(128, (1,1)),
            UpSampling2D(2))(x)
    x = Concatenate()([x,darknet.layers[92].output])
    # 第三个特征层
    # y3=(batch_size,52,52,3,85)
    x, y3 = make_last_layers(x, 128, num_anchors*(num_classes+5))

    return Model(inputs, [y1,y2,y3])

对于yolo3的模型来说，其最后输出的内容就是三个特征层的内容，三个特征层分别对应着图片被分为不同size的网格后，每个网格点上三个先验框对应的位置、置信度及其种类。
对于输出的y1、y2、y3而言，[…, : 2]指的是相对于每个网格点的偏移量，[…, 2: 4]指的是宽和高，[…, 4: 5]指的是该框的置信度，[…, 5: ]指的是每个种类的预测概率。
现在的y_pre还是没有解码的，解码了之后才是真实图像上的情况，解码过程比较简单。如果不懂的可以看我另一篇博文睿智的目标检测7——yolo3详解及其预测代码复现

def yolo_head(feats, anchors, num_classes, input_shape, calc_loss=False):
    """Convert final layer features to bounding box parameters."""
    num_anchors = len(anchors)
    # Reshape to batch, height, width, num_anchors, box_params.
    anchors_tensor = K.reshape(K.constant(anchors), [1, 1, 1, num_anchors, 2])

    # 获得x，y的网格
    grid_shape = K.shape(feats)[1:3] # height, width
    grid_y = K.tile(K.reshape(K.arange(0, stop=grid_shape[0]), [-1, 1, 1, 1]),
        [1, grid_shape[1], 1, 1])
    grid_x = K.tile(K.reshape(K.arange(0, stop=grid_shape[1]), [1, -1, 1, 1]),
        [grid_shape[0], 1, 1, 1])
    grid = K.concatenate([grid_x, grid_y])
    grid = K.cast(grid, K.dtype(feats))
    
    # (batch_size,13,13,3,85)
    feats = K.reshape(
        feats, [-1, grid_shape[0], grid_shape[1], num_anchors, num_classes + 5])

    # Adjust preditions to each spatial grid point and anchor size.
    box_xy = (K.sigmoid(feats[..., :2]) + grid) / K.cast(grid_shape[::-1], K.dtype(feats))
    box_wh = K.exp(feats[..., 2:4]) * anchors_tensor / K.cast(input_shape[::-1], K.dtype(feats))
    box_confidence = K.sigmoid(feats[..., 4:5])
    box_class_probs = K.sigmoid(feats[..., 5:])

    # 在计算loss的时候返回如下参数
    if calc_loss == True:
        return grid, feats, box_xy, box_wh
    return box_xy, box_wh, box_confidence, box_class_probs

2、y_true

y_true就是一个真实图像中，将它的真实框的位置以及框内物体的种类，转化成yolo3网络输出后的格式的值。
在yolo3中，其使用了一个专门的函数用于处理读取进来的图片的框的真实情况。

def preprocess_true_boxes(true_boxes, input_shape, anchors, num_classes):

其输入为：
true_boxes：shape为(m, T, 5)代表m张图T个框的x_min、y_min、x_max、y_max、class_id。
input_shape：输入的形状，此处为416、416
anchors：代表9个先验框的大小
num_classes：种类的数量。
其实对真实框的处理是将真实框转化成图片中相对网格的xyhw，步骤如下：
1、取框的真实值，获取其框的中心及其宽高，除去input_shape变成比例的模式。
2、建立全为0的y_true，y_true是一个列表，包含三个特征层，shape分别为(m,13,13,3,85),(m,26,26,3,85),(m,52,52,3,85)。
3、对每一张图片处理，将每一张图片中的真实框的wh和先验框的wh对比，计算IOU值，选取其中IOU最高的一个，得到其所属特征层及其网格点的位置，在对应的y_true中将内容进行保存。

for t, n in enumerate(best_anchor):
    for l in range(num_layers):
        if n in anchor_mask[l]:

            # 计算该目标在第l个特征层所处网格的位置
            i = np.floor(true_boxes[b,t,0]*grid_shapes[l][1]).astype('int32')
            j = np.floor(true_boxes[b,t,1]*grid_shapes[l][0]).astype('int32')

            # 找到best_anchor索引的索引
            k = anchor_mask[l].index(n)
            c = true_boxes[b,t, 4].astype('int32')
            
            # 保存到y_true中
            y_true[l][b, j, i, k, 0:4] = true_boxes[b,t, 0:4]
            y_true[l][b, j, i, k, 4] = 1
            y_true[l][b, j, i, k, 5+c] = 1

对于最后输出的y_true而言，只有每个图里每个框最对应的位置有数据，其它的地方都为0。
preprocess_true_boxes全部的代码如下：

def preprocess_true_boxes(true_boxes, input_shape, anchors, num_classes):
    '''
    将真实框的位置预处理为训练输入格式

    Parameters
    ----------
    true_boxes: array, shape=(m, T, 5)
        相对于输入形状的绝对x_min、y_min、x_max、y_max、class_id。
    input_shape: array-like, hw, multiples of 32
    anchors: array, shape=(N, 2), wh
    num_classes: integer

    Returns
    -------
    y_true: 数组列表，类似yolo_3输出的形状，xywh是相关值

    '''
    # true_boxes包含5个参数，分别是x_min、y_min、x_max、y_max、class_id。
    assert (true_boxes[..., 4]<num_classes).all(), 'class id must be less than num_classes'
    num_layers = len(anchors)//3 # default setting
    anchor_mask = [[6,7,8], [3,4,5], [0,1,2]] if num_layers==3 else [[3,4,5], [1,2,3]]

    true_boxes = np.array(true_boxes, dtype='float32')
    input_shape = np.array(input_shape, dtype='int32')
    
    # 取框的真实值，获取其框的中心及其宽高。
    boxes_xy = (true_boxes[..., 0:2] + true_boxes[..., 2:4]) // 2
    boxes_wh = true_boxes[..., 2:4] - true_boxes[..., 0:2]

    # 变成比例的模式
    true_boxes[..., 0:2] = boxes_xy/input_shape[::-1]
    true_boxes[..., 2:4] = boxes_wh/input_shape[::-1]

    # 一共有m张图像
    m = true_boxes.shape[0]

    # (13,13),(26,26),(52,52)
    grid_shapes = [input_shape//{0:32, 1:16, 2:8}[l] for l in range(num_layers)]
    # (m,13,13,3,85),(m,26,26,3,85),(m,52,52,3,85)
    y_true = [np.zeros((m,grid_shapes[l][0],grid_shapes[l][1],len(anchor_mask[l]),5+num_classes),
        dtype='float32') for l in range(num_layers)]

    # (1,9,2)
    anchors = np.expand_dims(anchors, 0)

    # 对先验框进行处理
    anchor_maxes = anchors / 2.
    anchor_mins = -anchor_maxes

    # 取有效框
    valid_mask = boxes_wh[..., 0]>0
    
    # 对每一张图片进行处理
    for b in range(m):
        # 如果没有box则检测下一张图片
        wh = boxes_wh[b, valid_mask[b]]
        if len(wh)==0: continue

        # (T, 1, 2)
        wh = np.expand_dims(wh, -2)

        # 将真实框与先验框进行对比运算
        # 计算IOU
        box_maxes = wh / 2.
        box_mins = -box_maxes

        intersect_mins = np.maximum(box_mins, anchor_mins)
        intersect_maxes = np.minimum(box_maxes, anchor_maxes)
        intersect_wh = np.maximum(intersect_maxes - intersect_mins, 0.)
        intersect_area = intersect_wh[..., 0] * intersect_wh[..., 1]
        box_area = wh[..., 0] * wh[..., 1]
        anchor_area = anchors[..., 0] * anchors[..., 1]
        iou = intersect_area / (box_area + anchor_area - intersect_area)

        # 计算每一幅图中，真实框与那个先验框最匹配
        # shape为(T)，代表每一个框最匹配的先验框的位置0123456789
        best_anchor = np.argmax(iou, axis=-1)

        for t, n in enumerate(best_anchor):
            for l in range(num_layers):
                if n in anchor_mask[l]:

                    # 计算该目标在第l个特征层所处网格的位置
                    i = np.floor(true_boxes[b,t,0]*grid_shapes[l][1]).astype('int32')
                    j = np.floor(true_boxes[b,t,1]*grid_shapes[l][0]).astype('int32')

                    # 找到best_anchor索引的索引
                    k = anchor_mask[l].index(n)
                    c = true_boxes[b,t, 4].astype('int32')
                    
                    # 保存到y_true中
                    y_true[l][b, j, i, k, 0:4] = true_boxes[b,t, 0:4]
                    y_true[l][b, j, i, k, 4] = 1
                    y_true[l][b, j, i, k, 5+c] = 1

    return y_true

loss的计算过程

在得到了y_pre和y_true后怎么对比呢？不是简单的减一下就可以的呢。
1、利用y_true取出该特征层中真实存在目标的点的位置(m,13,13,3,1)及其对应的种类(m,13,13,3,80)。
2、将yolo_outputs的特征层输出进行处理，得到reshape后的预测值y_pre，shape分别为(m,13,13,3,85),(m,26,26,3,85),(m,52,52,3,85)。还有解码后的xy，wh。
3、获取真实框编码后的值，后面用于计算loss
4、对于每一幅图，计算其中所有真实框与预测框的IOU，取出每个网络点中IOU最大的先验框，如果这个最大的IOU都小于ignore_thresh，意味着这个网络点内不存在目标，可以被忽略。
5、计算xy和wh上的loss，其计算的是实际上存在目标的，利用第三步真实框编码后的的结果和未处理的预测结果进行对比得到loss。
6、计算置信度的loss，其有两部分构成，第一部分是实际上存在目标的，预测结果中置信度的值与1对比；第二部分是实际上不存在目标的，在第四步中得到其IOU还较大的预测结果中的值与0对比。
7、计算预测种类的loss，其计算的是实际上存在目标的，预测类与真实类的差距。
其实际上计算的总的loss是三个loss的和，这三个loss分别是：

实际存在的框，编码后的结果与预测值的差距。
实际存在的框，预测结果中置信度的值与1对比；实际不存在的框，在上述步骤中，第四步得到其IOU还较大的预测结果中的值与0对比。
实际存在的框，种类预测结果与实际结果的对比。

其实际代码如下：

def yolo_loss(args, anchors, num_classes, ignore_thresh=.5, print_loss=False):
    '''Return yolo_loss tensor

    Parameters
    ----------
    yolo_outputs: list of tensor, the output of yolo_body or tiny_yolo_body
    y_true: list of array, the output of preprocess_true_boxes
    anchors: array, shape=(N, 2), wh
    num_classes: integer
    ignore_thresh: float, the iou threshold whether to ignore object confidence loss

    Returns
    -------
    loss: tensor, shape=(1,)

    '''
    num_layers = len(anchors)//3 
    # 将预测结果和实际ground truth分开
    yolo_outputs = args[:num_layers]
    y_true = args[num_layers:]
    # 先验框
    anchor_mask = [[6,7,8], [3,4,5], [0,1,2]] if num_layers==3 else [[3,4,5], [1,2,3]]

    # 得到input_shpae为416,416 
    input_shape = K.cast(K.shape(yolo_outputs[0])[1:3] * 32, K.dtype(y_true[0]))

    # 得到网格的shape为13,13;26,26;52,52
    grid_shapes = [K.cast(K.shape(yolo_outputs[l])[1:3], K.dtype(y_true[0])) for l in range(num_layers)]
    loss = 0

    # 取出每一张图片
    m = K.shape(yolo_outputs[0])[0] # batch size, tensor
    mf = K.cast(m, K.dtype(yolo_outputs[0]))

    # y_true是一个列表，包含三个特征层，shape分别为(m,13,13,3,85),(m,26,26,3,85),(m,52,52,3,85)。
    # yolo_outputs是一个列表，包含三个特征层。
    for l in range(num_layers):
        # 取出该特征层中存在目标的点的位置。(m,13,13,3,1)
        object_mask = y_true[l][..., 4:5]
        # 取出其对应的种类(m,13,13,3,80)
        true_class_probs = y_true[l][..., 5:]

        # 将yolo_outputs的特征层输出进行处理
        # 得到reshape后的预测值，shape分别为(m,13,13,3,85),(m,26,26,3,85),(m,52,52,3,85)。
        # 还有解码后的xy，wh
        grid, raw_pred, pred_xy, pred_wh = yolo_head(yolo_outputs[l],
             anchors[anchor_mask[l]], num_classes, input_shape, calc_loss=True)
        
        # 这个是解码后的预测的box的位置
        pred_box = K.concatenate([pred_xy, pred_wh])

        # 将真实框进行编码，后面用于计算loss
        raw_true_xy = y_true[l][..., :2]*grid_shapes[l][::-1] - grid
        raw_true_wh = K.log(y_true[l][..., 2:4] / anchors[anchor_mask[l]] * input_shape[::-1])

        # object_mask如果真实存在目标则保存其wh值
        raw_true_wh = K.switch(object_mask, raw_true_wh, K.zeros_like(raw_true_wh))
        box_loss_scale = 2 - y_true[l][...,2:3]*y_true[l][...,3:4]

        # 找到要被忽略的
        ignore_mask = tf.TensorArray(K.dtype(y_true[0]), size=1, dynamic_size=True)
        object_mask_bool = K.cast(object_mask, 'bool')
        
        # 对每一张图片计算ignore_mask
        def loop_body(b, ignore_mask):
            # 取出第b副图内所有的box的参数
            # n,4每个图有n个框，4个参数
            true_box = tf.boolean_mask(y_true[l][b,...,0:4], object_mask_bool[b,...,0])
            # 计算预测结果与真实情况的iou
            # 计算后为13,13,3,n
            iou = box_iou(pred_box[b], true_box)

            # 13,13,3
            best_iou = K.max(iou, axis=-1)

            # 第b个图中需要被忽略的地方
            ignore_mask = ignore_mask.write(b, K.cast(best_iou<ignore_thresh, K.dtype(true_box)))
            return b+1, ignore_mask

        _, ignore_mask = K.control_flow_ops.while_loop(lambda b,*args: b<m, loop_body, [0, ignore_mask])

        # 将每幅图的内容压缩，进行处理
        # ignore_mask(mf,13,13,3,1)
        ignore_mask = ignore_mask.stack()
        ignore_mask = K.expand_dims(ignore_mask, -1)

        # K.binary_crossentropy is helpful to avoid exp overflow.
        xy_loss = object_mask * box_loss_scale * K.binary_crossentropy(raw_true_xy, raw_pred[...,0:2], from_logits=True)
        wh_loss = object_mask * box_loss_scale * 0.5 * K.square(raw_true_wh-raw_pred[...,2:4])
        
        confidence_loss = object_mask * K.binary_crossentropy(object_mask, raw_pred[...,4:5], from_logits=True)+ \
            (1-object_mask) * K.binary_crossentropy(object_mask, raw_pred[...,4:5], from_logits=True) * ignore_mask
        class_loss = object_mask * K.binary_crossentropy(true_class_probs, raw_pred[...,5:], from_logits=True)

        xy_loss = K.sum(xy_loss) / mf
        wh_loss = K.sum(wh_loss) / mf
        confidence_loss = K.sum(confidence_loss) / mf
        class_loss = K.sum(class_loss) / mf
        loss += xy_loss + wh_loss + confidence_loss + class_loss
        if print_loss:
            loss = tf.Print(loss, [loss, xy_loss, wh_loss, confidence_loss, class_loss, K.sum(ignore_mask)], message='loss: ')
    return loss

这样就可以计算loss啦，还是很复杂的！不过还是可以理解的！作者太强啦！

如果存在什么疑惑可以向我提问。

Bubbliiiing

关注

14
点赞
踩
103

收藏

觉得还不错? 一键收藏
打赏
38
评论
睿智的目标检测8——yolo3的loss组成详解

睿智的目标检测8——yolo3的训练思路学习前言参考源码计算loss所需参数1、y_pre2、y_trueloss的计算过程学习前言只会预测是不够的，对于只有会了训练才能训练出我们自己的模型！参考源码本次教程是基于github中的https://github.com/qqwweee/keras-yolo3。详解的不是代码，而是训练思路，主要讲的是yolo3需要被减少的loss到底是什么...
复制链接

扫一扫