关于Yolov2的loss function二三事

最近因参考看 “Few-shot Object Detection via Feature Reweighting, ICCV 2019”,里面利用到了Yolov2的框架,考虑参考学习下yolov2的损失函数,以便实现论文中的内容

这里附上yolo1的loss参考
ref1-问题讨论
ref2-原理解析

YOLOV2

yolov2的模型summary如下
在这里插入图片描述

位置预测

yolov2给予darknet19,从RPN网络借鉴anchor boxes来进行预测边界框和先验框的偏移量。在Faster-RCNN中,边界框实际的位置(x,y),需要借助预测偏移(tx,ty),先验框的尺度(wi, hi)和中心坐标(xi, yi)来计算,(xi, yi)是特征图的每个位置的中心点。这种方法在进行预测时,会由于tx,ty的不确定情况,会有不同程度的偏移,导致模型训练的不稳定性。YOLOV2采用的是预测边框中心点相对于对应cell左上角的相对偏移值,这样做是将边框中心点约束在当前的cell中,并采用sigmoid函数处理偏移值,使得预测的偏移值在(0,1)范围内,避免了上述情况。

边框实际提供了4个偏移值,tx,ty,tw,th, 在下图中表述了利用这些值计算实际边框的中心位置和长宽。
在这里插入图片描述
cx,cy为cell的左上角坐标,上图中认为为(1,1)。利用了sigmoid函数后,δtx,δty∈(-1,1),因而会被约束到一个范围内(这里成为一个cell内),pw,ph为先验框的宽和高(假设yolov2的anchor 之一为 0.57273, 0.677385,则实际这个值是相对于特征图的,原始要乘32),值是相对于特征图的。这里得到的是相对于特征图的,最后转换为四个参数:
bx = (δ(tx) + cx) / W, by = (δ(ty) + cy) / H
bw = pw * exp(tw) / W, bh = ph * exp(th) / H
其中W 和H为特征图的长宽,假设我们输入到darknet19的图像是(416,416)的情况,则W=H=13,上面bx, by, bw, bh的值乘输入图像长宽,可以得到原始图像中的边框位置和大小。

根据这个描述,实际上在进行选框的损失计算时,计算的是tx,ty,tw,th着些项,可以通过对应的表达式进行转换,得到和groundtruth一样的格式进行计算

输出

一张(416,416,3)的输入图像,最后YOLOV2会输出成(1, 13, 13, 125)的情况,这里125 = 5 * (20 + 4 + 1),5是anchor的个数,20是voc数据的类别数,4是xywh的偏移值,1是置信度,区分物体和背景的值。

loss function

在这里插入图片描述
W, H和上面的W和H的解释相同,A为anchor个数,各个λ的值是各部分的权重系数/
损失函数分为了5部分,这里分为了三部分:no object,prior, 和剩余的包括边框,置信度,类别。背景部分只计算置信度的误差
这里可以参考我的另一篇博客,darknet源码阅读之region_layer.c,实际上在darknet源码里面如何实现yolov2的公式的

no object

在这里插入图片描述
这个部分公式第一项,在很多的loss解释文档都没有描述清楚,在这里打算重新描述下:
1MaxIOU<Thresh 实际表述最后会得到一个类似于 [13, 13, 5, 1]的一个bool张量,这个MaxIOU<Thresh针对的是:
① annotation file里面所有的boxes
② 预测的[13, 13, 5, 4] 的 [tx, ty, tw, th],记作coorT
这里将所有的boxes和每一个coorT计算得到一个iou列表,然后取到这个iou列表的最大值,如果这个最大值小于阈值(常见0.6),则将这个no object框计入到loss中,因为这部分被认定为没有物体的框。
由于计算出来featuremap之后,我们主要度量的objectness=1的部分(即有物体部分)很少,绝大多数为没有物体的部分,需要将没有物体部分(no object)的部分考虑进来,避免最后模型不能分辨前景和背景。

prior

在这里插入图片描述
坐标误差计算,这里只在前12800个迭代中计算,用于促进网络学习到anchor的形状。在下面损失函数计算的代码中, 我加入了这部分的loss,发现并不是非常的有效果。对于前面的少次数迭代来说,假设batchsize是32,12800 / 32 = 400,在400步时结束该部分的迭代,这部分迭代时learning rate一般比较小,需要有针对性的学习到anchor的形状。

剩余部分

在这里插入图片描述
第一项是坐标的损失。
第二项是置信度损失,在计算时,用了预测框和gt的真实iou值进行,这里这个iou和前面的iou不是一个,要注意。
第三项为类别的误差,如上述公式描述。

yolov2模型 代码

darknet19模型的基本框架代码如下

def space2depth(x):
    return tf.nn.space_to_depth(x, block_size=2)

def darknet19_yolov2(input, class_num=20):
    # suppose input as [416, 416, 3]
    shape_in = input.shape
    grid_h = shape_in[1] // 32
    grid_w = shape_in[2] // 32
    anchors_num = 5
    # conv1 set
    conv = Conv2D(filters=32, kernel_size=3, padding='same', 
                   use_bias=False, name='conv1')(input)
    conv = BatchNormalization(name='bn1')(conv)
    conv = LeakyReLU(alpha=0.1)(conv)
    conv = MaxPool2D(pool_size=(2, 2))(conv)

    # conv2 set
    conv = Conv2D(filters=64, kernel_size=3, padding='same', 
                  use_bias=False, name='conv2')(conv)
    conv = BatchNormalization(name='bn2')(conv)
    conv = LeakyReLU(alpha=0.1)(conv)
    conv = MaxPool2D(pool_size=(2, 2))(conv)

    # conv3 set
    conv = Conv2D(filters=128, kernel_size=3, padding='same', 
                  use_bias=False, name='conv3_1')(conv)
    conv = BatchNormalization(name='bn3_1')(conv)
    conv = LeakyReLU(alpha=0.1)(conv)

    conv = Conv2D(filters=64, kernel_size=1, padding='same',
                  use_bias=False, name='conv3_2')(conv)
    conv = BatchNormalization(name='bn3_2')(conv)
    conv = LeakyReLU(alpha=0.1)(conv)

    conv = Conv2D(filters=128, kernel_size=3, padding='same', 
                  use_bias=False, name='conv3_3')(conv)
    conv = BatchNormalization(name='bn3_3')(conv)
    conv = LeakyReLU(alpha=0.1)(conv)
    conv = MaxPool2D(pool_size=(2, 2))(conv)

    # conv4 set
    conv = Conv2D(filters=256, kernel_size=3, padding='same', 
                  use_bias=False, name='conv4_1')(conv)
    conv = BatchNormalization(name='bn4_1')(conv)
    conv = LeakyReLU(alpha=0.1)(conv)

    conv = Conv2D(filters=128, kernel_size=1, padding='same',
                  use_bias=False, name='conv4_2')(conv)
    conv = BatchNormalization(name='bn4_2')(conv)
    conv = LeakyReLU(alpha=0.1)(conv)

    conv = Conv2D(filters=256, kernel_size=3, padding='same', 
                  use_bias=False, name='conv4_3')(conv)
    conv = BatchNormalization(name='bn4_3')(conv)
    conv = LeakyReLU(alpha=0.1)(conv)
    conv = MaxPool2D(pool_size=(2, 2))(conv)

    # conv5 set
    conv = Conv2D(filters=512, kernel_size=3, padding='same', 
                  use_bias=False, name='conv5_1')(conv)
    conv = BatchNormalization(name='bn5_1')(conv)
    conv = LeakyReLU(alpha=0.1)(conv)

    conv = Conv2D(filters=256, kernel_size=1, padding='same',
                  use_bias=False, name='conv5_2')(conv)
    conv = BatchNormalization(name='bn5_2')(conv)
    conv = LeakyReLU(alpha=0.1)(conv)

    conv = Conv2D(filters=512, kernel_size=3, padding='same', 
                  use_bias=False, name='conv5_3')(conv)
    conv = BatchNormalization(name='bn5_3')(conv)
    conv = LeakyReLU(alpha=0.1)(conv)

    conv = Conv2D(filters=256, kernel_size=1, padding='same',
                  use_bias=False, name='conv5_4')(conv)
    conv = BatchNormalization(name='bn5_4')(conv)
    conv = LeakyReLU(alpha=0.1)(conv)

    conv = Conv2D(filters=512, kernel_size=3, padding='same', 
                  use_bias=False, name='conv5_5')(conv)
    conv = BatchNormalization(name='bn5_5')(conv)
    conv = LeakyReLU(alpha=0.1)(conv)
    shortcut1 = conv
    conv = MaxPool2D(pool_size=(2, 2))(conv)

    # conv6 set
    conv = Conv2D(filters=1024, kernel_size=3, padding='same', 
                  use_bias=False, name='conv6_1')(conv)
    conv = BatchNormalization(name='bn6_1')(conv)
    conv = LeakyReLU(alpha=0.1)(conv)

    conv = Conv2D(filters=512, kernel_size=1, padding='same',
                  use_bias=False, name='conv6_2')(conv)
    conv = BatchNormalization(name='bn6_2')(conv)
    conv = LeakyReLU(alpha=0.1)(conv)

    conv = Conv2D(filters=1024, kernel_size=3, padding='same', 
                  use_bias=False, name='conv6_3')(conv)
    conv = BatchNormalization(name='bn6_3')(conv)
    conv = LeakyReLU(alpha=0.1)(conv)

    conv = Conv2D(filters=512, kernel_size=1, padding='same',
                  use_bias=False, name='conv6_4')(conv)
    conv = BatchNormalization(name='bn6_4')(conv)
    conv = LeakyReLU(alpha=0.1)(conv)

    conv = Conv2D(filters=1024, kernel_size=3, padding='same', 
                  use_bias=False, name='conv6_5')(conv)
    conv = BatchNormalization(name='bn6_5')(conv)
    conv = LeakyReLU(alpha=0.1)(conv)

    # conv7
    conv = Conv2D(filters=1024, kernel_size=3, padding='same',
                  use_bias=False, name='conv7_1')(conv)
    conv = BatchNormalization(name='bn7_1')(conv)
    conv = LeakyReLU(alpha=0.1)(conv)

    conv = Conv2D(filters=1024, kernel_size=3, padding='same',
                  use_bias=False, name='conv7_2')(conv)
    conv = BatchNormalization(name='bn7_2')(conv)
    conv = LeakyReLU(alpha=0.1)(conv)

    # conv8 conca
    shortcut1 = Conv2D(filters=64, kernel_size=1, padding='same', 
                  use_bias=False, name='conv8_1')(shortcut1)
    shortcut1 = BatchNormalization(name='bn8_1')(shortcut1)
    shortcut1 = LeakyReLU(alpha=0.1)(shortcut1)
    shortcut1 = tf.keras.layers.Lambda(space2depth)(shortcut1)

    conv = concatenate([shortcut1, conv])

    conv = Conv2D(filters=1024, kernel_size=3, padding='same',
                  use_bias=False, name='conv8_2')(conv)
    conv = BatchNormalization(name='bn8_2')(conv)
    conv = LeakyReLU(alpha=0.1)(conv)

    conv = Conv2D(filters=(class_num + 5) * anchors_num, kernel_size=1, padding='same', kernel_initializer='he_normal')(conv)
    conv = tf.keras.layers.Reshape((grid_w, grid_h, anchors_num, (class_num + 5)))(conv)

    return conv

损失函数代码

def compute_loss_yolo(conv, label, true_boxes_grid, anchors, class_num=cfg.TRAIN.TRAIN_CLASS_NUM, iter=0):
    # conv is like (1, 13, 13, 5, class_num+5), and the pred is like the
    alpha1 = 1.0
    alpha2 = 1.0
    alpha3 = 5.0
    alpha4 = 1.0
    alpha5 = 1.0
    alpha6 = 0.01 # prior learning
    label = tf.convert_to_tensor(label)

    # darknet yolo info
    avg_iou = 0
    recall = 0
    avg_cat = 0
    avg_obj = 0
    avg_anyobj = 0
    count = 0
    class_count = 0

    output_shape = K.shape(conv).numpy()
    anchors = K.reshape(K.variable(anchors), [1, 1, 1, 5, 2]) # yolov2 anchor length is 5
    dims = K.cast(K.reshape(output_shape[1:3], (1, 1, 1, 1, 2)), K.dtype(conv))
    c1 = tf.tile(tf.range(output_shape[1]), [output_shape[2]])
    coord_x = tf.cast(tf.reshape(c1, (1, output_shape[2], output_shape[1], 1, 1)), tf.float32)
    coord_y = tf.transpose(coord_x, (0, 2, 1, 3, 4))
    coords = tf.tile(tf.concat([coord_x, coord_y], -1), [output_shape[0], 1, 1, 5, 1])

    object_mask = tf.cast(label[..., 4:5] > 0., dtype=tf.float32)
    nb_detector_mask = K.sum(object_mask)
    # first loss - coordinate loss
    # in feature map ratio
    bxy = tf.keras.backend.sigmoid(conv[:, :, :, :, :2]) + coords
    label_xy = label[..., :2] + coords

    # second loss - w,h loss
    # this is the loss of w&h of yolov2 as tx and ty
    bwh = conv[..., 2:4]
    label_wh = label[..., 2:4]
    # in feature map ratio
    bwh_exp = K.exp(bwh) * anchors
    label_wh_exp = K.exp(label_wh) * anchors
    ratio_obj = tf.repeat(tf.expand_dims(2 - (label_wh_exp[..., 0] / output_shape[1] * label_wh_exp[..., 1] / output_shape[2]), axis=-1), 2, axis=-1)

    bxy_loss = K.sum(K.square((bxy - label_xy) * ratio_obj) * object_mask)  # /(nb_detector_mask + 1e-6)
    bwh_loss = K.sum(K.square((bwh - label_wh) * ratio_obj) * object_mask) #/(nb_detector_mask + 1e-6)


    ## CONFIDENCE LOSS
    to = K.sigmoid(conv[..., 4:5])
    ### find iou between prediction and ground truth boxes
    x1, y1, w1, h1 = label_xy[..., 0], label_xy[..., 1], label_wh_exp[..., 0], label_wh_exp[
        ..., 1]
    x2, y2, w2, h2 = bxy[..., 0], bxy[..., 1], bwh_exp[..., 0], bwh_exp[..., 1]
    ious = iou(x1, y1, w1, h1, x2, y2, w2, h2)
    ious = K.expand_dims(ious, -1)

    avg_iou = K.sum(ious * object_mask)
    recall = K.sum(tf.cast(ious * object_mask > 0.5, dtype=tf.float32))
    avg_obj = K.sum(to * object_mask)
    avg_anyobj = K.sum(to)
    count = K.sum(object_mask)
    class_count = count

    ### for each detector: best ious between prediction and true boxes
    pred_xy = K.expand_dims(bxy / dims, 4) # divide dims so the feature map ratio to 0~1 ration of bx by
    pred_wh = K.expand_dims(bwh_exp / dims, 4)
    pred_wh_half = pred_wh / 2.0
    pred_mins = pred_xy - pred_wh_half
    pred_maxes = pred_xy + pred_wh_half
    true_box_shape = K.int_shape(true_boxes_grid)
    true_boxes_grid = K.reshape(true_boxes_grid, [true_box_shape[0], 1, 1, 1, true_box_shape[1], true_box_shape[2]])
    true_xy, true_wh = true_boxes_grid[..., 0:2], true_boxes_grid[..., 2:4]
    true_wh_half = true_wh * 0.5
    true_mins = true_xy - true_wh_half
    true_maxes = true_xy + true_wh_half
    intersect_mins = K.maximum(pred_mins, true_mins)
    intersect_maxes = K.minimum(pred_maxes, true_maxes)

    intersect_wh = K.maximum(intersect_maxes - intersect_mins, 0.)
    intersect_areas = intersect_wh[..., 0] * intersect_wh[..., 1]

    pred_areas = pred_wh[..., 0] * pred_wh[..., 1]
    true_areas = true_wh[..., 0] * true_wh[..., 1]

    union_areas = pred_areas + true_areas - intersect_areas
    iou_scores = intersect_areas / union_areas
    best_ious = K.max(iou_scores, axis=4)
    best_ious = K.expand_dims(best_ious)

    # forth term. TODO, corresponse the noobj loss with the class, since the class here is divide by reweight vector
    no_object_detection = K.cast(best_ious < cfg.TRAIN.IOU_THRESH, dtype=K.dtype(best_ious))
    noobj_mask = no_object_detection * (1 - object_mask)
    nb_noobj_mask = K.sum(tf.cast(noobj_mask > 0.0, tf.float32))
    to_noobj_loss = K.sum(K.square(0-to) * noobj_mask) #* (nb_detector_mask) / (nb_detector_mask + nb_noobj_mask + 1e-6) #/ (nb_noobj_mask + 1e-6)

    # thrid loss
    to_obj_loss = K.sum(K.square(ious - to) * object_mask) #* (nb_noobj_mask) / (nb_detector_mask + nb_noobj_mask + 1e-6) # / (nb_detector_mask + 1e-6)

    # fifth term
    pre_class = K.sigmoid(conv[..., 5:])
    gt_class = label[..., 5:]
    class_loss = K.sum(K.square(pre_class - gt_class) * object_mask) #/ (nb_detector_mask + 1e-6)
    # print("nb_noobj_mask, ", nb_noobj_mask, " nb_detector_mask : ", nb_detector_mask)
    avg_cat = K.sum(pre_class * gt_class)

    # sixth term, the prior learning, corresponse with the anchor
    if iter < cfg.TRAIN.PRIOR_NUM:
        prior_wh = K.log(anchors / dims)
        prior_loss = K.sum((K.square(prior_wh - bwh) + K.square(0.5 - K.sigmoid(conv[:, :, :, :, :2]))) * noobj_mask)
    else:
        prior_loss = 0

    loss = alpha1 * bxy_loss + alpha2 * bwh_loss + alpha3 * to_obj_loss + \
                 alpha5 * class_loss + alpha4 * to_noobj_loss + alpha6 * prior_loss
    return loss, alpha1 * bxy_loss,  alpha2 * bwh_loss, alpha3 * to_obj_loss, alpha4 * to_noobj_loss, \
           alpha5 * class_loss, alpha6 * prior_loss, [avg_iou / (count + 1e-6), avg_cat / (class_count + 1e-6), avg_obj / (count + 1e-6),
    avg_anyobj / (output_shape[1] * output_shape[2] * output_shape[3] * output_shape[0]), recall / (count + 1e-6), count]


def iou(x1, y1, w1, h1, x2, y2, w2, h2):
    xmin1, ymin1 = x1 - 0.5 * w1, y1 - 0.5 * h1
    xmax1, ymax1 = x1 + 0.5 * w1, y1 + 0.5 * h1
    xmin2, ymin2 = x2 - 0.5 * w2, y2 - 0.5 * h2
    xmax2, ymax2 = x2 + 0.5 * w2, y2 + 0.5 * h2

    intersect_x = np.minimum(xmax1, xmax2) - np.maximum(xmin1, xmin2)
    intersect_y = np.minimum(ymax1, ymax2) - np.maximum(ymin1, ymin2)
    intersection = intersect_x * intersect_y
    union = w1 * h1 + w2 * h2 - intersection
    iou = intersection / (union + 1e-6)  ## to avoid division by zero
    return iou

YOLOV2的训练结果和效果

这里贴一下tensorboard部分的训练结果曲线,个人记录,供参考
并没有完全将loss降到很低,主要还是由于缺乏充足时间和算力进行调参
在这里插入图片描述
在这里插入图片描述
部分识别结果如下
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
感觉对于一些difficult的样例识别的不是很好,需要继续训练

  • 1
    点赞
  • 7
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值