YOLO-v3边框预测与回归

最新推荐文章于 2024-09-19 08:53:06 发布

小 K 同学

最新推荐文章于 2024-09-19 08:53:06 发布

阅读量2.9k

点赞数 3

分类专栏： CNN 机器学习

本文链接：https://blog.csdn.net/lxk2017/article/details/90606973

版权

机器学习同时被 2 个专栏收录

6 篇文章 0 订阅

订阅专栏

CNN

5 篇文章 1 订阅

订阅专栏

YOLO-v3边框预测与回归

模型说明
yolo层前面的conv计算output参数说明
yolo层Inference和回归计算
yolo层计算和训练

我们读yolov3论文时都知道边框预测的公式，然而难以准确理解为何作者要这么做，这里我依据Darknet代码来总结解释一下个人的见解，总结串联一下学习时容易遇到的疑惑，期待对大家有所帮助，理解错误的地方还请大家批评指正，我只是个小白哦，发出来也是为了与大家多多交流，看看理解的对不对。

模型说明

假设模型如下所示，下面的分析在此模型基础上：

layer     filters    size/stride        input                output
0 conv     16    3 x 3 / 1      512 x 512 x   3   ->   512 x 512 x  16  0.226 BFLOPs
1 max           2 x 2 / 2      512 x 512 x  16   ->   256 x 256 x  16
2 conv     32    3 x 3 / 1      256 x 256 x  16   ->   256 x 256 x  32  0.604 BFLOPs
3 max           2 x 2 / 2      256 x 256 x  32   ->   128 x 128 x  32
4 conv     64    3 x 3 / 1     128 x 128 x  32   ->   128 x 128 x  64  0.604 BFLOPs
5 max           2 x 2 / 2      128 x 128 x  64   ->   64 x  64 x  64
6 conv     64   3 x 3 / 1       64 x  64 x  64   ->    64 x  64 x  64  0.302 BFLOPs
7 max          2 x 2 / 2       64 x  64 x  64   ->    32 x  32 x  64
8 conv     64   3 x 3 / 1       32 x  32 x  64   ->    32 x  32 x  64  0.075 BFLOPs
9 max          2 x 2 / 1       32 x  32 x  64   ->    32 x  32 x  64
10 conv    64   1 x 1 / 1       32 x  32 x  64   ->    32 x  32 x  64  0.008 BFLOPs
11 conv    30   1 x 1 / 1       32 x  32 x  64   ->    32 x  32 x  30  0.004 BFLOPs
12 yolo

cfg文件中的yolo层参数设置：

[yolo]
mask = 1,2,3 #使用anchor的索引, 1,2,3表示使用下面定义的anchors中的2,3,4个anchor
anchors = 10,14,  23,27,  37,58,  81,82,  135,169,  344,319
classes=5 #类别数目
num=6
jitter=.3 #数据增强手段，此处jitter为随机调整宽高比的范围
ignore_thresh = .7  #计算IOU阈值大小.当预测的检测框与ground true的IOU大于ignore_thresh的时候，参与loss的计算，否则，检测框的不参与loss计算。
truth_thresh = 1
random=1

yolo层前面的conv计算output参数说明

yolov3模型中，yolo层前面一层的卷积核个数为3*(4+1+classes)，人脸识别模型最后一层conv卷积核个数为3*(4+1+5)=30，五个类：yin,li,lei,xia,chen，之所以这样设置，会在下面说明。由上面模型可以看出，11 conv层计算结果为32x32x30个参数，这些计算出来的参数所代表的含义如下如所示：
在这里插入图片描述
前四个[1,2,3,4]为边界框预测，经过计算得到的b.x,b.y,b.w,b.h与数据集中的label标签对应，标签数据如下图，第5个为目标(object)预测，最后6,7,8,9,10是分类预测，每一组代表一类。

yolo层Inference和回归计算

流程图如下：
在这里插入图片描述

yolo层计算和训练

Logistics计算：

该部分只针对[1、2、5、6、7、8、9、10]做计算，[3、4]不做计算，后面会说明[3、4]不做计算：

在这里插入图片描述

对应代码：

    for (b = 0; b < l.batch; ++b){
        for(n = 0; n < l.n; ++n){
            int index = entry_index(l, b, n*l.w*l.h, 0);
            activate_array(l.output + index, 2*l.w*l.h, LOGISTIC);
            index = entry_index(l, b, n*l.w*l.h, 4);
            activate_array(l.output + index, (1+l.classes)*l.w*l.h, LOGISTIC);
        }
    }

检测框坐标计算：

检测框坐标(b_x,b_y,b_w,b_h)，分别指(中心点x,中心点y,宽w,高h)，坐标计算公式如下：
在这里插入图片描述
其中tx,ty,tw,th是一系列conv计算得到的值，cx,cy为feature map中的坐标，feature map大小为32x32，则(cx,cy)为(0,0)到(31,31)1024个点，p_w,p_h为cfg中anchors尺寸，上图cfg中的anchors 1,anchors 2,anchors 3分别为23,27, 37,58, 81,82 ，到此，已经计算出我们预测的框的坐标了。
在这里插入图片描述
代码：

box get_yolo_box(float *x, float *biases, int n, int index, int i, int j, int lw, int lh, int w, int h, int stride)
{
    box b;
    b.x = (i + x[index + 0*stride]) / lw;
    b.y = (j + x[index + 1*stride]) / lh;
    b.w = exp(x[index + 2*stride]) * biases[2*n]   / w;
    b.h = exp(x[index + 3*stride]) * biases[2*n+1] / h;
    return b;
}

IOU计算：

计算出每个feature map单元所对应的预测框位置和大小，就可以与ground truth 对比，计算IOU：
下面图示IOU=area/(area1+area2-area)
在这里插入图片描述

计算梯度Δ值：

PS：这里有一个问题，不管FasterRCNN还是YOLO，都不是直接回归bounding box的长宽（就像这样b_w=p_w t_w^{’），而是要做一个对数变换，实际预测的是log(⋅)。这是因为如果不做变换，直接预测相对形变t_w}’, 那么要求t_w^’>0，因为框的长宽不可能是负数。这样，是在做一个有不等式条件约束的优化问题，没法直接用SGD来做。所以先取一个对数变换，将其不等式约束去掉，就可以了。
梯度Δ值的计算与预测框坐标的计算相反，对于前四组x、y、w、h来说：
在这里插入图片描述
其中G_x,G_y, G_w, G_h是label标签里面的值，就是ground truth的值，p_w,p_h分别为23,27, 37,58, 81,82。

scale为(2-truth.w*truth.h)。

float delta_yolo_box(box truth, float *x, float *biases, int n, int index, int i, int j, int lw, int lh, int w, int h, float *delta, float scale, int stride)
{
    box pred = get_yolo_box(x, biases, n, index, i, j, lw, lh, w, h, stride);
    float iou = box_iou(pred, truth);
    float tx = (truth.x*lw - i);
    float ty = (truth.y*lh - j);
    float tw = log(truth.w*w / biases[2*n]);
    float th = log(truth.h*h / biases[2*n + 1]);
    delta[index + 0*stride] = scale * (tx - x[index + 0*stride]);
    delta[index + 1*stride] = scale * (ty - x[index + 1*stride]);
    delta[index + 2*stride] = scale * (tw - x[index + 2*stride]);
    delta[index + 3*stride] = scale * (th - x[index + 3*stride]);
    return iou;
}

对于第五组object回归来说：best_iou>l.ignore_thresh，Δobj=0，否则Δobj=0-l.output。另外，YOLO会对每个bounding box给出是否是object的置信度预测，用来区分objects和背景。这个值使用logistic回归。当某个bounding box与ground truth的IoU大于其他所有bounding box时，target给1；如果某个bounding box不是IoU最大的那个，但是IoU也大于了某个阈值（我们取0.7），那么我们忽略它（既不惩罚，也不奖励），这个做法是从Faster RCNN借鉴的。我们对每个ground truth只分配一个最好的bounding box与其对应（这与Faster RCNN不同）。如果某个bounding box没有被分配到任何一个ground truth对应，那么它对边框位置大小的回归和class的预测没有贡献，我们只惩罚它的objectness，即试图减小其confidence。
在这里插入图片描述
对于后面5个class组来说，如果预测框是该类，则Δc=1-l.output,否则Δc=0-l.output.
代码：

void delta_yolo_class(float *output, float *delta, int index, int class, int classes, int stride, float *avg_cat)
{
    int n;
    if (delta[index]){
        delta[index + stride*class] = 1 - output[index + stride*class];
        if(avg_cat) *avg_cat += output[index + stride*class];
        return;
    }
    for(n = 0; n < classes; ++n){
        delta[index + stride*n] = ((n == class)?1 : 0) - output[index + stride*n];
        if(n == class && avg_cat) *avg_cat += output[index + stride*n];
    }
}