YOLOv5 后处理cuda实现

库达ZT

已于 2023-07-03 19:15:42 修改

阅读量900

点赞数 1

分类专栏： Cuda学习笔记文章标签： YOLO c++

于 2023-07-02 15:01:15 首次发布

本文链接：https://blog.csdn.net/zhuangtu1999/article/details/131499750

版权

Cuda学习笔记专栏收录该内容

17 篇文章 1 订阅

订阅专栏

文章详细介绍了如何将PyTorch模型的输出通过numpy转换成二进制文件，然后在C++环境中读取进行后处理研究，特别是针对YOLOv5的输出结构。主要步骤包括GPU/CPU解码、置信度过滤和NMS非极大值抑制。在GPU解码中，利用原子操作更新检测框计数，优化性能。CPU解码时，通过提前分配内存和减少不必要的计算提高效率。整个过程强调了理解和优化后处理步骤的重要性。

摘要由CSDN通过智能技术生成

前言

对于后处理的代码研究，可以把PyTorch的数据通过转换成numpy后，tobytes再写到文件，然后再到c++中读取的方式，能够快速进行问题研究和排查，此时不需要tensorRT推理也可以做后处理研究。这也叫变量控制法

注意: yolov5 中的detect.py是对一张图片做推理, 推理用的信息是(n x num_classes + 5)
yolov5的输出tensor(n x 85), n 是 n个bounding box
其中85是cx, cy, width, height, objness, classification * 80
objctness的意思是当前这个Bounding Box是否包含检测目标
class_confidence条件概率的意思是当前Bounding Box的对于是否包含这个类别目标的概率, 并且每一个bounding box里面有全类别的class_confidence。
当前bounding box的 confidence(置信度) = objectness(物体概率) x class_confidence(条件概率)
最后拿来计算置信度的confidence是最大的class_confidence
总之, 无论是CPU解码还是GPU解码, 都是两步走, 置信度过滤后NMS过滤, 把一张图多余的框去掉。但是NMS操作之前需要先把Box信息恢复成框
在GPU解码输出中，[count, box1, box2, box3] 因为GPU解码是多线程的, 所以需要用count记录已经处理了多少个bounding box。CPU单线程不需要, GPU需要确保不会将一个检测框重复输出或者漏掉。
在深度学习部署中，通常使用单精度浮点数（float）来存储数据。单精度浮点数占用4个字节，相比于双精度浮点数（double）占用的8个字节，可以减少存储空间和计算时间，同时也可以更好地利用GPU的计算资源。不过，在某些特殊情况下，可能需要使用双精度浮点数来更准确地表示数据。代码中看到f要知道为什么

nms流程图

main函数

int main()
{
   // yolov5的输出tensor(n x 85)
   // 其中85是cx, cy, width, height, objness, classification * 80

    // 加载一个二进制的文件
    auto data = load_file("predict.data");
    auto image = cv::imread("input-image.jpg");

    // 因为数据是以二进制存储在文件中的, 如果想对二进制文件进行访问，需要使用指针
    // char * -> float *
    float *ptr = (float *)data.data();
    int nelem = data.size() / sizeof(float); // 计算data有多少个数据
    int ncols = 85;                          // cx, cy, width, height, objness, classification * 80
    int nrows = nelem / ncols;

    // 这里是用gpu_decode拿到框框
    // 这里的boxes是一个vector的数据类型
    auto boxes = gpu_decode(ptr, nrows, ncols);

    // 这里是把框框在图像上画出来
    // for (auto it = boxes.begin(); it != boxes.end(); ++it) 有点像这句话
    for (auto &box : boxes)
    {

        // image, 左上角坐标，右小角坐标, 线的颜色, 线的宽度
        cv::rectangle(image, cv::Point(box.left, box.top), cv::Point(box.right, box.bottom),
                      cv::Scalar(0, 255, 0), 2);
        cv::putText(image, cv::format("%.2f", box.confidence), cv::Point(box.left, box.top - 7),
                    0, 0.8, cv::Scalar(0, 0, 255), 2, 16);
    }

    cv::imwrite("image-draw.jpg", image);
    return 0;
}

CPU decode

一般在写GPU之前会先写一个CPU。我们先来看CPU的：

vector<Box> cpu_decode(float* predict, int rows, int cols, float confidence_threshold = 0.25f, float nms_threshold = 0.45f){
    
    vector<Box> boxes;
    int num_classes = cols - 5;
    for(int i = 0; i < rows; ++i){
        float* pitem = predict + i * cols;
        float objness = pitem[4]; 
        /*
        pitem[4] 表示对预测值矩阵中的第 5 列数据进行访问，即获取目标置信度（objectness）的数值。
        在默认情况下，预测值矩阵的每一行包含了与目标检测相关的信息，通常按如下顺序排列：目标边界框的坐标（x、y、width、height）、目标置信度以及各个类别的预测概率。
        因此，通过 pitem[4] 可以获取到当前行的目标置信度的数值。如果目标置信度低于设定的置信度阈值 confidence_threshold，则会跳过该行数据的处理，不将其作为边界框之一。
        */
        if(objness < confidence_threshold)
            continue;
        /*
        CPU解码重点：
        避免多余的计算，需要知道有些数学运算需要的时间远超过很多if，减少他们的次数就是性能的关键
        所以这里在置信度小于阈值的情况下，直接跳过计算下一个
        */

        float* pclass = pitem + 5;
        int label     = std::max_element(pclass, pclass + num_classes) - pclass;
        float prob    = pclass[label];
        float confidence = prob * objness;
        if(confidence < confidence_threshold)
            continue;

        float cx     = pitem[0];
        float cy     = pitem[1];
        float width  = pitem[2];
        float height = pitem[3];
        float left   = cx - width * 0.5;
        float top    = cy - height * 0.5;
        float right  = cx + width * 0.5;
        float bottom = cy + height * 0.5;
        boxes.emplace_back(left, top, right, bottom, confidence, (float)label);
    }
    //nms
    std::sort(boxes.begin(), boxes.end(), [](Box& a, Box& b){return a.confidence > b.confidence;});
    std::vector<bool> remove_flags(boxes.size());//看看是不是要留下来 为true删掉，false保留
    std::vector<Box> box_result;//新的box vector 为了避免直接在boxes里作操作还要上移的过程
    box_result.reserve(boxes.size());//用reverse不会默认初始化

    auto iou = [](const Box& a, const Box& b){
        float cross_left   = std::max(a.left, b.left);
        float cross_top    = std::max(a.top, b.top);
        float cross_right  = std::min(a.right, b.right);
        float cross_bottom = std::min(a.bottom, b.bottom);

        float cross_area = std::max(0.0f, cross_right - cross_left) * std::max(0.0f, cross_bottom - cross_top);
        float union_area = std::max(0.0f, a.right - a.left) * std::max(0.0f, a.bottom - a.top) 
                         + std::max(0.0f, b.right - b.left) * std::max(0.0f, b.bottom - b.top) - cross_area;
        if(cross_area == 0 || union_area == 0) return 0.0f;
        return cross_area / union_area;
    };
    //nms判断要不要保留
    for(int i = 0; i < boxes.size(); ++i){
        if(remove_flags[i]) continue;

        auto& ibox = boxes[i];
        box_result.emplace_back(ibox);
        for(int j = i + 1; j < boxes.size(); ++j){
            if(remove_flags[j]) continue;

            auto& jbox = boxes[j];
            if(ibox.label == jbox.label){
                // class matched
                if(iou(ibox, jbox) >= nms_threshold)
                    remove_flags[j] = true;
            }
        }
    }
    return box_result;
}

         pitem[4] 表示对预测值矩阵中的第 5 列数据进行访问，即获取目标置信度（objectness）的数值。
        在默认情况下，预测值矩阵的每一行包含了与目标检测相关的信息，通常按如下顺序排列：目标边界框的坐标（x、y、width、height）、目标置信度以及各个类别的预测概率。
        因此，通过 pitem[4] 可以获取到当前行的目标置信度的数值。如果目标置信度低于设定的置信度阈值 confidence_threshold，则会跳过该行数据的处理，不将其作为边界框之一。

CPU解码重点：
避免多余的计算，需要知道有些数学运算需要的时间远超过很多if，减少他们的次数就是性能的关键
所以这里在置信度小于阈值的情况下，直接跳过计算下一个

要善于运用各种技巧，比如下面这里提前分配内存，提前分配空间，其中reverse可以明显改善效率

    std::vector<bool> remove_flags(boxes.size());
    std::vector<Box> box_result；
    box_result.reserve(boxes.size());

GPU decode kernel

而GPU的decode就很有意思了


static __global__ void decode_kernel(
    float* predict, int num_bboxes, int num_classes, float confidence_threshold, 
    float* invert_affine_matrix, float* parray, int max_objects, int NUM_BOX_ELEMENT
){  
    int position = blockDim.x * blockIdx.x + threadIdx.x;
    if (position >= num_bboxes) return;

    float* pitem     = predict + (5 + num_classes) * position;
    float objectness = pitem[4];
    if(objectness < confidence_threshold)
        return;

    float* class_confidence = pitem + 5;
    float confidence        = *class_confidence++;//表示将class_confidence指针当前指向的值赋给confidence变量，并将指针后移一位。
    int label               = 0;
    for(int i = 1; i < num_classes; ++i, ++class_confidence){
        if(*class_confidence > confidence){
            confidence = *class_confidence;
            label      = i;
        }
    }

    confidence *= objectness;
    if(confidence < confidence_threshold)
        return;
//到这一步，已经得到一个可以确定的框了。
    int index = atomicAdd(parray, 1); //count , box1 , box2 , bos3.......这里取首地址，就是count进行相加1
    if(index >= max_objects)
        return;

    float cx         = *pitem++;
    float cy         = *pitem++;
    float width      = *pitem++;
    float height     = *pitem++;
    float left   = cx - width * 0.5f;
    float top    = cy - height * 0.5f;
    float right  = cx + width * 0.5f;
    float bottom = cy + height * 0.5f;
    // affine_project(invert_affine_matrix, left,  top,    &left,  &top);
    // affine_project(invert_affine_matrix, right, bottom, &right, &bottom);

    // left, top, right, bottom, confidence, class, keepflag
    float* pout_item = parray + 1 + index * NUM_BOX_ELEMENT;
    *pout_item++ = left;
    *pout_item++ = top;
    *pout_item++ = right;
    *pout_item++ = bottom;
    *pout_item++ = confidence;
    *pout_item++ = label;
    *pout_item++ = 1; // 1 = keep, 0 = ignore
}

这里由于原来作者是将float* output_device = nullptr; //count , box1 , box2 , bos3.......这么大的一个东西都塞给了parray，parray第一个数是count, count框框的数量，所以运用atomic对于框的数量进行求和。

在根据求得的框的信息放到poutitem中返回进行nms

GPU NMS


#include <cuda_runtime.h>

static __device__ void affine_project(float* matrix, float x, float y, float* ox, float* oy){
    *ox = matrix[0] * x + matrix[1] * y + matrix[2];
    *oy = matrix[3] * x + matrix[4] * y + matrix[5];
}

static __device__ float box_iou(
    float aleft, float atop, float aright, float abottom, 
    float bleft, float btop, float bright, float bbottom
){

    float cleft 	= max(aleft, bleft);
    float ctop 		= max(atop, btop);
    float cright 	= min(aright, bright);
    float cbottom 	= min(abottom, bbottom);
    
    float c_area = max(cright - cleft, 0.0f) * max(cbottom - ctop, 0.0f);
    if(c_area == 0.0f)
        return 0.0f;
    
    float a_area = max(0.0f, aright - aleft) * max(0.0f, abottom - atop);
    float b_area = max(0.0f, bright - bleft) * max(0.0f, bbottom - btop);
    return c_area / (a_area + b_area - c_area);
}

static __global__ void fast_nms_kernel(float* bboxes, int max_objects, float threshold, int NUM_BOX_ELEMENT){

    int position = (blockDim.x * blockIdx.x + threadIdx.x);
    int count = min((int)*bboxes, max_objects);
    if (position >= count) 
        return;
    
    // left, top, right, bottom, confidence, class, keepflag
    float* pcurrent = bboxes + 1 + position * NUM_BOX_ELEMENT;count , box1 , box2 , bos3.......这里取left，就需要越过count值，所以+1
    for(int i = 0; i < count; ++i){
        float* pitem = bboxes + 1 + i * NUM_BOX_ELEMENT;
        if(i == position || pcurrent[5] != pitem[5]) continue;

        if(pitem[4] >= pcurrent[4]){
            if(pitem[4] == pcurrent[4] && i < position)
                continue;

            float iou = box_iou(
                pcurrent[0], pcurrent[1], pcurrent[2], pcurrent[3],
                pitem[0],    pitem[1],    pitem[2],    pitem[3]
            );

            if(iou > threshold){
                pcurrent[6] = 0;  // 1=keep, 0=ignore
                return;
            }
        }
    }
}

这里的nms就和cpu端的一样了，每一个线程处理一个框。

通过float* pcurrent = bboxes + 1 + position * NUM_BOX_ELEMENT;取出每个框所对应的第一个坐标left。（parray对应count , box1 , box2 , bos3 ，而box对应 left, top, right, bottom, confidence, class, keepflag）

然后转回host端，按照flag将result填充。

    int num_boxes = min((int)output_host[0], max_objects);
    for(int i = 0; i < num_boxes; ++i){
        float* ptr = output_host + 1 + NUM_BOX_ELEMENT * i;
        int keep_flag = ptr[6];
        if(keep_flag){
            box_result.emplace_back(
                ptr[0], ptr[1], ptr[2], ptr[3], ptr[4], (int)ptr[5]
            );
        }
    }

最后别忘记destory stream 和cudafree

总结：

1、main函数：

在mian函数中，运用load_file读取tensor文件这里用load_file打开图片, 这里是用二进制模式打开文件(ios::binary), 使用static std::vector<uint8_t>存储数据。
YOLOV5给出来的data是n x (5 + classes)的, 这里通过计算可以获得行数列数, 然后传入指向data的指针, nrows, ncols解码, 本案例提供cpu解码和GPU解码
解码结束后返回的是vector<Box>，Box是自定义数据类型, 每一个box是一个bounding box, 里面储存着left, top, right, bottom, confidence, label

2、CPU decode

创建一个vector<bool>类型的数组，用来存放要保留的框，创建一个vector<box>类型的数组，用来存放过滤后的框。
先对于obj进行过滤，再对于confidence进行过滤，减少计算量。
恢复成框记得左上角才是原点
用vector中的emplace_back()添加,通常情况下，使用push_back函数向std::vector中添加元素时，会调用元素类型的复制或移动构造函数，将元素从临时对象复制或移动到容器中。而emplace_back函数则允许我们直接在容器中构造元素，避免了额外的复制或移动开销。

3、GPU decode

在GPU分别开辟输入内存, 输出结果内存, 在CPU上开辟输出结果内存。先把YOLOV5输出的数据放到GPU, 操作结束再拿回CPU
在decode kernel结束后，返回的是已经过滤挑选完毕合格的box，按照left, top, right, bottom, confidence, class, keepflag的顺序放到返回的结果中
对于fast-nms，先拿出对应box第一个坐标left的指针。之后进行多重if，比如计算label是否一致，confidence是否是pitem[4] >= pcurrent[4],最后在计算iou。节省运算资源