bounding box回归的原理学习——yoloV1

最新推荐文章于 2024-09-25 00:32:42 发布

brightming

最新推荐文章于 2024-09-25 00:32:42 发布

阅读量1.6w

点赞数 7

分类专栏：机器学习文章标签： bbox-回归

本文链接：https://blog.csdn.net/brightming/article/details/78072045

版权

机器学习专栏收录该内容

17 篇文章 1 订阅

订阅专栏

参考：
https://zhuanlan.zhihu.com/p/25236464
http://blog.csdn.net/williamyi96/article/details/77530948
http://blog.csdn.net/zijin0802034/article/details/77685438
https://leonardoaraujosantos.gitbooks.io/artificial-inteligence/content/object_localization_and_detection.html

一、逻辑理解

从逻辑上说明对bbox回归的原理的理解。
之前觉得bbox的回归是一个很难理解的地方：这些回归的坐标数字，依据在哪里？
其实回归的输入并不是这些预测的坐标数字，而是预测的坐标对应的feature map中的内容，这个内容与相对于ground truth进行对比，计算，是回归的根本依据。
通过不断的训练，得到了回归的参数，在预测时，网络产生了图像的feature map，对于任意一个预测框，背后对应了一个feature 区域，将学习到的参数与此feature区域进行运算，就会得到调整预测框的参数了。因为同类物体的特征应当相似。

二、yoloＶ1论文学习

(1) resizes the input image to 448 × 448,
(2) runs a single convolutional network on the image, and
(3) thresholds the resulting detections by the model’s confidence.

A single convolutional network simultaneously predicts multiple bounding boxes and class probabilities for those boxes.

对于输入的图片，resize到网络的输入所需要的大小，经过多重卷积处理，得到了7*7大小的feature map。里面的7x7的每个格子就是grid cell。
每个grid cell会负责预测B个bounding box（问题：怎么预测出来的？），以及对于每个类的条件概率。每个bounding box的预测内容包括有：box的信息：x,y,w,h，以及本bounding box包含有某类物体的置信度。含义如下：
—-x,y是相对于grid cell的左上角来说的box的中心位置，w,h是box的宽高，相对于整张图片的款高。
—-confidence的含义：如果这个box包含了任何一种物体，则定义为该box与ground truth中的任何一个box的iou的大小，如果该box不包含任何一个物体，则confidence为0。
—-每个grid cell还会预测C个不同类别的条件概率。即如果这个grid cell是包含有物体的，那么对于可能包含的不同物体的概率情况。
假设每个grid cell会预测2个box，需要检测的物体有20类，那么最后生成的向量size是7x7x(5*2+20)=1470。

loss function如下：
这里写图片描述

训练过程中的回归是如何进行的呢？
基本的逻辑就是对坐标+iou误差、分类误差进行修正。
在训练的时候，每个样本的ground truth的box是知道的，对应于相当于下采样后的的feature map，也能找到在feature map中对应的格子的位置。如果一个grid cell是该物体的中心位置，则这个grid cell负责预测这个物体;每个grid cell都预测了若干个box，其中会有一个box具有与ground truth的某个box有最大的iou，则由这个box负责调整box的位置（另外的预测的box就忽略了）。
可以想象，一开始的时候，权重是随机的，面对抽象出来的图像的feature信息，其预测的box的信息、confidence的信息、物体类别的条件概率都是随机的，经过了loss函数进行惩罚与优化，使得预测行为越来越好。取极端情况为例子加深理解：总是用同一张图片做训练，每次得到的图像的feature map都是一样的。一开始的时候，权重随机，权重与某个grid cell的运算的结果是随机，即每个grid cell对应的5*b+C的数据是随机的。网络对此情况进行优化，指导权重向某个方向优化。第二次看到还是一样的feature map，新的权重与某个grid cell对应的数据进行计算，就会得到更好的box位置和confidence，物体类别的信息。如此反复进行，直到权重参数调整到这样的情况，面对同样的grid cell，其计算出来的box的信息完全与ground box的位置符合，confidence=1，正确类别的条件概率为1。

问题：
可以看到修正box的时候，是针对训练的时候，权重与训练图片中的feature map的相应部分进行运算来进行调整的，即：如果在实际应用中，有一个物体，其与训练的图片中的物体很像，那么可以进行好的预测，如果是以一个没有见过的角度去看，那么由于训练时没有对这个角度的feature进行计算与优化，就不会预测好，跟遇到一个类别之外的物体没差别。所以，增加训练的样本的多角度很有必要。
同时一个grid cell虽然预测若干个box，但只会由那个与ground truth的任何box有最大交集的预测box来负责预测这个grid cell的情况，那么对于比较密集挨在一起物体的情况就会不能预测的现象出现：两只鸟靠的很近，它们的两个ground truth的box，都与某个grid cell有最大的iou，与其他的grid cell都没有相交，但因为这里有两只鸟，所以，这个gridcell预测的若干个box中，总会与两只鸟分别对应的两个ground truth的box的其中一个有更大的iou值，就会选到这个预测box，负责预测其中的一只鸟，对于另一只鸟，就会忽略。而由于鸟比较小，与其他的grid cell也没有相交，所以这另外的鸟就给忽略掉了，检测就会检测不出来。

yolov1源码学习

分析detection_layer.c


void forward_detection_layer(const detection_layer l, network net)
{
    int locations = l.side*l.side;//7x7,side 是detection 层的参数
    int i,j;
    memcpy(l.output, net.input, l.outputs*l.batch*sizeof(float));
    //if(l.reorg) reorg(l.output, l.w*l.h, size*l.n, l.batch, 1);
    int b;
    if (l.softmax){
        for(b = 0; b < l.batch; ++b){
            int index = b*l.inputs;
            for (i = 0; i < locations; ++i) {
                int offset = i*l.classes;
                softmax(l.output + index + offset, l.classes, 1, 1,
                        l.output + index + offset);
            }
        }
    }
    //l.output的组织方式是：
    //分成3块：第一块统一放置classes的信息，第二块统一放置每个box的confidence信息，
    //第三块统一放置box的坐标信息
    if(net.train){
        float avg_iou = 0;
        float avg_cat = 0;
        float avg_allcat = 0;
        float avg_obj = 0;
        float avg_anyobj = 0;
        int count = 0;
        *(l.cost) = 0;
        int size = l.inputs * l.batch;
        memset(l.delta, 0, size * sizeof(float));
        //一次训练同时处理batch张图片
        for (b = 0; b < l.batch; ++b){
            //某一张图片的开始,上面的make_detection_layer方法的assert方法
            //验证了inputs的情况
            int index = b*l.inputs;
            //每张图片到最后的feature map大小是locations=S*S,如（7×7个）
            for (i = 0; i < locations; ++i) {
                //每个grid cell对应的truth的信息：1个confidence，4个坐标信息，相应的类别信息
                int truth_index = (b*locations + i)*(1+l.coords+l.classes);
                int is_obj = net.truth[truth_index];
                for (j = 0; j < l.n; ++j) {
                    //预测的每个box中的confidence信息。参考l.output的组织方式
                    int p_index = index + locations*l.classes + i*l.n + j;
                    //先默认为该grid cell是no obj的
                    l.delta[p_index] = l.noobject_scale*(0 - l.output[p_index]);
                    *(l.cost) += l.noobject_scale*pow(l.output[p_index], 2);
                    avg_anyobj += l.output[p_index];
                }

                int best_index = -1;
                float best_iou = 0;
                float best_rmse = 20;

                if (!is_obj){
                    //该grid cell没有物体，直接返回
                    //其cost就是那个noobject相应的计算结果
                    continue;
                }

                //第i个grid cell的classes预测信息的开始位置,classes的信息统一排在前面
                int class_index = index + i*l.classes;
                //论文中lossfunction的最后一项：
                //计算该grid cell预测出来的每个class的概率信息与真实信息（对应的类别是1,其余为0.可以看ground truth信息组织的代码）的差值
                for(j = 0; j < l.classes; ++j) {
                    l.delta[class_index+j] = l.class_scale * (net.truth[truth_index+1+j] - l.output[class_index+j]);
                    *(l.cost) += l.class_scale * pow(net.truth[truth_index+1+j] - l.output[class_index+j], 2);
                    if(net.truth[truth_index + 1 + j]) avg_cat += l.output[class_index+j];
                    avg_allcat += l.output[class_index+j];
                }

                //box的信息排在confidence，classes信息后面
                box truth = float_to_box(net.truth + truth_index + 1 + l.classes, 1);
                truth.x /= l.side;
                truth.y /= l.side;

                //判断哪个pred box与ground truth中的任何一个box
                //的iou最大，则由该预测box负责对本grid cell
                //进行预测
                for(j = 0; j < l.n; ++j){
                    //参考l.output的组织方式
                    //跳过前面的classes和每个box的confidence的数据，到达box的坐标数据位置
                    int box_index = index + locations*(l.classes + l.n) + (i*l.n + j) * l.coords;
                    box out = float_to_box(l.output + box_index, 1);
                    //归一化
                    out.x /= l.side;
                    out.y /= l.side;

                    if (l.sqrt){
                        //表示里面的预测值w,h是经过sqrt操作的，
                        out.w = out.w*out.w;
                        out.h = out.h*out.h;
                    }

                    float iou  = box_iou(out, truth);
                    //iou = 0;
                    float rmse = box_rmse(out, truth);
                    if(best_iou > 0 || iou > 0){
                        if(iou > best_iou){
                            best_iou = iou;
                            best_index = j;
                        }
                    }else{
                        if(rmse < best_rmse){
                            best_rmse = rmse;
                            best_index = j;
                        }
                    }
                }

                if(l.forced){
                    if(truth.w*truth.h < .1){
                        best_index = 1;
                    }else{
                        best_index = 0;
                    }
                }
                if(l.random && *(net.seen) < 64000){
                    best_index = rand()%l.n;
                }

                //得到最佳box
                int box_index = index + locations*(l.classes + l.n) + (i*l.n + best_index) * l.coords;
                int tbox_index = truth_index + 1 + l.classes;

                box out = float_to_box(l.output + box_index, 1);
                out.x /= l.side;
                out.y /= l.side;
                if (l.sqrt) {
                    out.w = out.w*out.w;
                    out.h = out.h*out.h;
                }
                float iou  = box_iou(out, truth);

                //printf("%d,", best_index);
                int p_index = index + locations*l.classes + i*l.n + best_index;
                //上面加了默认的无物体的情况
                *(l.cost) -= l.noobject_scale * pow(l.output[p_index], 2);
                *(l.cost) += l.object_scale * pow(1-l.output[p_index], 2);
                //这个avg_obj没看到减去之前默认加的值
                avg_obj += l.output[p_index];
                //预测的每个box的confidence的值代表的是
                //有多肯定这个box包含了一个物体,按论文的定义是
                //期望的等于预测的box与ground truth box的iou.
                //那么走l.rescore的分支是比较合理的。
                l.delta[p_index] = l.object_scale * (1.-l.output[p_index]);

                if(l.rescore){
                    l.delta[p_index] = l.object_scale * (iou - l.output[p_index]);
                }

                l.delta[box_index+0] = l.coord_scale*(net.truth[tbox_index + 0] - l.output[box_index + 0]);
                l.delta[box_index+1] = l.coord_scale*(net.truth[tbox_index + 1] - l.output[box_index + 1]);
                l.delta[box_index+2] = l.coord_scale*(net.truth[tbox_index + 2] - l.output[box_index + 2]);
                l.delta[box_index+3] = l.coord_scale*(net.truth[tbox_index + 3] - l.output[box_index + 3]);
                if(l.sqrt){
                    l.delta[box_index+2] = l.coord_scale*(sqrt(net.truth[tbox_index + 2]) - l.output[box_index + 2]);
                    l.delta[box_index+3] = l.coord_scale*(sqrt(net.truth[tbox_index + 3]) - l.output[box_index + 3]);
                }

                *(l.cost) += pow(1-iou, 2);
                avg_iou += iou;
                ++count;
            }
        }

        if(0){
            float *costs = calloc(l.batch*locations*l.n, sizeof(float));
            for (b = 0; b < l.batch; ++b) {
                int index = b*l.inputs;
                for (i = 0; i < locations; ++i) {
                    for (j = 0; j < l.n; ++j) {
                        int p_index = index + locations*l.classes + i*l.n + j;
                        costs[b*locations*l.n + i*l.n + j] = l.delta[p_index]*l.delta[p_index];
                    }
                }
            }
            int indexes[100];
            top_k(costs, l.batch*locations*l.n, 100, indexes);
            float cutoff = costs[indexes[99]];
            for (b = 0; b < l.batch; ++b) {
                int index = b*l.inputs;
                for (i = 0; i < locations; ++i) {
                    for (j = 0; j < l.n; ++j) {
                        int p_index = index + locations*l.classes + i*l.n + j;
                        if (l.delta[p_index]*l.delta[p_index] < cutoff) l.delta[p_index] = 0;
                    }
                }
            }
            free(costs);
        }


        *(l.cost) = pow(mag_array(l.delta, l.outputs * l.batch), 2);


        printf("Detection Avg IOU: %f, Pos Cat: %f, All Cat: %f, Pos Obj: %f, Any Obj: %f, count: %d\n", avg_iou/count, avg_cat/count, avg_allcat/(count*l.classes), avg_obj/count, avg_anyobj/(l.batch*locations*l.n), count);
        //if(l.reorg) reorg(l.delta, l.w*l.h, size*l.n, l.batch, 0);
    }
}

这里需要对l.output的数据组织方式进行了解，否则阅读其取各种值的代码会无法理解。l.output是包含一批图片的输出结果，每一张的输出紧跟着前一张的输出的后面，对于一张图片的输出数据其组织方式如下：
1、第一部分是每个grid cell预测的指定class的条件概率：
一个grid cell如果预测20个class，则排列方式就是第一个grid cell预测的class_0,class_1..class_19的条件概率值，紧接着是第二个grid cell的预测值，直到所有的grid cell的对于各类别的条件概率值都排列完毕；
2、第二部分是每个grid cell预测的每个box的confidence的值：
同样，如果一个grid cell预测2个box，则就是confidence_box_0,confidence_box_1….
3、第三部分就是预测的各个box的坐标信息了：
按照x,y,w,h的方式排列。

整体的检测流程步骤很清晰：
1、循环处理一批中的每一张图片，对于每一张图片，进行以下处理：
2、先默认当做该grid cell 不包含物体，计算loss情况：lossfunction中的第4项；
3、取出对应的grid cell的truth box的数据
4、计算条件概率的loss情况，该数据只与grid cell有关，与预测的box无关，就是lossfunction中的最后一项
5、找出该grid cell预测的最好的那个box，叫做best_box
6、利用best_box计算box坐标的loss情况：loss function中的前两项，以及confidence的loss情况：loss function中的第3项；

如果某个grid cell不包含任何物体，其loss部分只有一个noobj的；否则会包含其他部分。

对着源码与论文，基本理解了其计算思路，但是对于里面如何进行优化的过程，还不是很清楚，主要是对于反向传播的代码级别的理解不够。