最终章-让我们的自制推理框架实现Yolov5的推理

最新推荐文章于 2024-07-25 16:59:12 发布

qq_32901731

最新推荐文章于 2024-07-25 16:59:12 发布

阅读量593

点赞数 2

文章标签： YOLO opencv 计算机视觉

本文链接：https://blog.csdn.net/qq_32901731/article/details/129710271

版权

最终章-让我们的自制推理框架实现Yolov5的推理

说说Yolov5的预处理函数

预处理的作用和上一节ResNet的预处理函数作用大致相同，可以分为以下几部分：

图像缩放
图像补边
颜色空间转换
归一化图像
RGBRGBRGB To RRRGGGBBB

用C++实现以上的过程

实现以上的预处理过程需要借用图像处理库OpenCV, 我们以下将进行逐一讲解。预处理过程被封装在kuiper_infer::sftensor PreProcessImage(const cv::Mat &image, const int32_t input_h, const int32_t input_w) 函数中，image是预处理函数的输入图像，输出是预处理完毕后得到的张量。

图像缩放和补边

图像缩放和补边实现在LetterBox方法中，方法的参数定义如下：

float Letterbox(
    const cv::Mat &image,
    cv::Mat &out_image,
    const cv::Size &new_shape = cv::Size(640, 640),
    int stride = 32,
    const cv::Scalar &color = cv::Scalar(114, 114, 114),
    bool fixed_shape = false,
    bool scale_up = false);

其中主要有以下几个参数，image为我们输入的图像，out_image是预处理完毕的图像，new_shape是需要缩放的大小，一般设定为yolov5模型输入的大小640，color为补边的颜色。

其他几个参数不作重点讲解，对于最后的预处理效果影响不大。

cv::Size shape = image.size();
  float r = std::min(
      (float) new_shape.height / (float) shape.height, (float) new_shape.width / (float) shape.width);
  if (!scale_up) {
    r = std::min(r, 1.0f);
  }

  int new_unpad[2]{
      (int) std::round((float) shape.width * r), (int) std::round((float) shape.height * r)};

在letter_box函数中，r是新高度、宽度和旧高度、宽度的比值，它是两个比值之中的较小值，它的作用是为了在之后的resize中让新的图像保持合适的横纵比，防止resize的图像发生扭曲变形的情况。

而其中的new_unpad则是为了让图像保持横纵比的同时，指定的新形状大小。

  cv::Mat tmp;
  if (shape.width != new_unpad[0] || shape.height != new_unpad[1]) {
    cv::resize(image, tmp, cv::Size(new_unpad[0], new_unpad[1]));
  } else {
    tmp = image.clone();
  }

如果new_unpad和输入图像的大小不一致，则需要进行resize调整图像的大小，但是我们还需要注意一点的是：new_unpad可能并不是我们需求的大小（因为它需要保持图像的横纵比），所以我们有如下的dw和dh计算。

  float dw = new_shape.width - new_unpad[0];
  float dh = new_shape.height - new_unpad[1];

  if (!fixed_shape) {
    dw = (float) ((int) dw % stride);
    dh = (float) ((int) dh % stride);
  }

  dw /= 2.0f;
  dh /= 2.0f;

  int top = int(std::round(dh - 0.1f));
  int bottom = int(std::round(dh + 0.1f));
  int left = int(std::round(dw - 0.1f));
  int right = int(std::round(dw + 0.1f));
  cv::copyMakeBorder(tmp, out_image, top, bottom, left, right, cv::BORDER_CONSTANT, color);

dw和dh就是new_unpad和实际需要resize的大小之间的差距，这个差距我们用 cv::copyMakeBorder进行填补，填补就是参数中的color颜色。

在这里插入图片描述

上图就是经过letterbox之后的输出大小，大小为640x640，且保持原图像的横纵比，多余的地方用灰色进行补边。

LetterBox函数在PreProcessImage函数中被调用：

  cv::Mat out_image;
  Letterbox(image, out_image, {input_h, input_w}, stride, {114, 114, 114},
            true);

颜色空间转换和归一化

  cv::Mat rgb_image;
  cv::cvtColor(out_image, rgb_image, cv::COLOR_BGR2RGB);

  cv::Mat normalize_image;
  rgb_image.convertTo(normalize_image, CV_32FC3, 1. / 255.);

分别是将BGR的格式转换为RGB格式，和将像素除以255进行归一化，这里的归一化过程和ResNet的相比，相对简单一些。如上的过程同样在PreprocessImage中被调用。

像素格式从RGBRGBRGB 到 RRRGGGBBB

  std::vector<cv::Mat> split_images;
  cv::split(normalize_image, split_images);
  assert(split_images.size() == input_c);

  std::shared_ptr<Tensor<float>> input =
      std::make_shared<Tensor<float>>(input_c, input_h, input_w);
  input->Fill(0.f);

  int index = 0;
  int offset = 0;
  for (const auto& split_image : split_images) {
    assert(split_image.total() == input_w * input_h);
    const cv::Mat& split_image_t = split_image.t();
    memcpy(input->slice(index).memptr(), split_image_t.data,
           sizeof(float) * split_image.total());
    index += 1;
    offset += split_image.total();
  }

首先使用cv::split将图像的RGB三个通道拆分，分别存放到split_images数组中，同时我们准备好一个input张量备用。

接下来，我们使用for循环对split_images数组进行处理，每次获得其中一个通道，比如在第一次循环中我们将R通道split_image先进行转置.t()，这里需要转置是由于行主序和列主序的关系（opencv是行主序的）。

随后我们将其中的一个通道split_image直接拷贝到input张量中，并将下标offset加此次通道的像素值来指向下次拷贝开始的位置。

对全过程进行回顾

kuiper_infer::sftensor PreProcessImage(const cv::Mat &image, const int32_t input_h, const int32_t input_w) { 
  assert(!image.empty());
  using namespace kuiper_infer;
  const int32_t input_c = 3;
  const int32_t input_h = 640;
  const int32_t input_w = 640;

  const int32_t origin_input_h = image.size().height;
  const int32_t origin_input_w = image.size().width;

  int stride = 32;
  cv::Mat out_image;
  Letterbox(image, out_image, {input_h, input_w}, stride, {114, 114, 114},
            true);

  cv::Mat rgb_image;
  cv::cvtColor(out_image, rgb_image, cv::COLOR_BGR2RGB);

  cv::Mat normalize_image;
  rgb_image.convertTo(normalize_image, CV_32FC3, 1. / 255.);

  std::vector<cv::Mat> split_images;
  cv::split(normalize_image, split_images);
  assert(split_images.size() == input_c);

  std::shared_ptr<Tensor<float>> input =
      std::make_shared<Tensor<float>>(input_c, input_h, input_w);
  input->Fill(0.f);

  int index = 0;
  int offset = 0;
  for (const auto& split_image : split_images) {
    assert(split_image.total() == input_w * input_h);
    const cv::Mat& split_image_t = split_image.t();
    memcpy(input->slice(index).memptr(), split_image_t.data,
           sizeof(float) * split_image.total());
    index += 1;
    offset += split_image.total();
  }
  return input;
}

预处理函数的调用过程

预处理函数的调用过程在YoloDemo函数中，YoloDemo的参数定义如下：

void YoloDemo(const std::vector<std::string> &image_paths,
              const std::string &param_path,
              const std::string &bin_path,
              const uint32_t batch_size)

其中image_paths是图片的路径，数量和batch_size保持一致。param_path是模型参数文件的路径，bin_path是模型全中文件的路径。

  using namespace kuiper_infer;
  const int32_t input_h = 640;
  const int32_t input_w = 640;

  assert(batch_size == image_paths.size());
  std::vector<sftensor> inputs;
  for (uint32_t i = 0; i < batch_size; ++i) {
    const auto &input_image = cv::imread(image_paths.at(i));
    sftensor input = PreProcessImage(input_image, input_h, input_w);
    assert(input->rows() == 640);
    assert(input->cols() == 640);
    inputs.push_back(input);
  }

inputs用来存放输入张量，是一个vector类型，vector的长度和batch_size是相同的。所以从另一方面来理解，inputs就是长度为batch_size的Yolo模型输入。

Yolo模型的载入

载入Yolo模型的方法如下：

  RuntimeGraph graph(param_path, bin_path);
  graph.Build("pnnx_input_0", "pnnx_output_0");

但是如果你没有实现模型中所有需要的算子，就会报出如下的错误：

COULD NOT CREATE A LOGGINGFILE 20230321-131652.4249!F20230321 13:16:52.668184  4249 layer_factory.cpp:29] Can not find the layer type: nn.SiLU
*** Check failure stack trace: ***

根据错误我们可以发现，是nn.SiLU算子没有被实现，所以我们在下面的过程中需要对缺失的所有算子进行补充。

编写SiLU算子

SiLU算子的数学计算过程如下：
$\text{silu}(x) = \frac{x}{1 + e^{-x}}$
可以看出这个算子只不过是sigmoid函数的乘以x而已，本质并没有多大的区别。我们看一下Silu中Forward函数如下的实现：

InferStatus SiLULayer::Forward(const std::vector<std::shared_ptr<Tensor<float>>> &inputs,
                               std::vector<std::shared_ptr<Tensor<float>>> &outputs) {
  if (inputs.empty()) {
    LOG(ERROR) << "The input feature map of silu layer is empty";
    return InferStatus::kInferFailedInputEmpty;
  }

  if (inputs.size() != outputs.size()) {
    LOG(ERROR) << "The input and output size of silu layer is not adapting";
    return InferStatus::kInferFailedInputOutSizeAdaptingError;
  }
 ....
}

以上的部分是对Forwards函数的输入和输出进行检查

  const uint32_t batch_size = inputs.size();
#pragma omp parallel for num_threads(batch_size)
  for (uint32_t i = 0; i < batch_size; ++i) {
    const std::shared_ptr<Tensor<float>> &input = inputs.at(i);
    CHECK(input == nullptr || !input->empty()) << "The input feature map of silu layer is empty!";

    std::shared_ptr<Tensor<float>> output = outputs.at(i);
    if (output == nullptr || output->empty()) {
      output = std::make_shared<Tensor<float>>(input->shapes());
      outputs.at(i) = output;
    }

    CHECK(output->shapes() == input->shapes()) << "The output size of silu layer is error";
    output->set_data(input->data());
    output->Transform([](const float value) {
      return value / (1.f + expf(-value));
    });
  }
  return InferStatus::kInferSuccess;
}

上面的函数中对batch_size个批次数据进行处理，首先获得当前的数据input，随后将input中的数据拷贝到output中，再对output张量中的数据进行处理，在Transform函数中处理的方式如上方的公式定义。再算子编写完成后，我们通过自动注册功能将SiLU的实现注册到全局。

LayerRegistererWrapper kSiluGetInstance("nn.SiLU", SiLULayer::GetInstance);

#pragma omp parallel for num_threads(batch_size)另外我们看到了上方的Forwards函数中有这样一句代码，这是OpenMP库的用法，它的作用就是将下方的for循环进行多线程处理，比如for循环需要处理1000个数据，它会init多个线程，每个线程负责处理其中的一部分，例如线程1负责处理for(i=0…10)，线程2负责处理for(i=11…20)等等，这部分我们不展开细讲，网上已经有很多资料了。

编写Concat算子

concat算子的实现在cat.cpp中（奇怪的名字），它的功能是将多个张量在通道维(channel dim)进行拼接。我们下面将用图例和代码结合的方式来讲。

InferStatus CatLayer::Forward(
    const std::vector<std::shared_ptr<Tensor<float>>>& inputs,
    std::vector<std::shared_ptr<Tensor<float>>>& outputs) {
  if (inputs.empty()) {
    LOG(ERROR) << "The input feature map of cat layer is empty";
    return InferStatus::kInferFailedInputEmpty;
  }

  if (inputs.size() == outputs.size()) {
    LOG(ERROR) << "The input and output size is not adapting";
    return InferStatus::kInferFailedInputOutSizeAdaptingError;
  }

  if (dim_ != 1 && dim_ != -3) {
    LOG(ERROR) << "The dimension of cat layer is error";
    return InferStatus::kInferFailedDimensionParameterError;
  }

  const uint32_t output_size = outputs.size();
  CHECK(inputs.size() % output_size == 0);
  const uint32_t packet_size = inputs.size() / output_size;
  ...
}

上方的代码属于cat.cpp，Forward定义了算子具体的计算过程。inputs就是待拼接的多个张量形成的数组，有如下的存储形式：

在这里插入图片描述

可以看到这里有三个输入，形成inputs数组，分别用三种不同颜色来表示。所以在上方的代码中batch_size等于3，那么如果我们将它们按照通道维度进行拼接，最后形成的output个数就是1，所以上方代码中的packet_size大小等于3.

将如上的3个输入大小为1x2x3的张量,1是通道的数量，拼接成一个的时候，最终形成的输出有3x2x3.

在这里插入图片描述

形成的输出如上图所示，它的维度是3，高度为2，宽度为3. 我们来看具体的代码实现：

#pragma omp parallel for num_threads(outputs.size())
  for (uint32_t i = 0; i < outputs.size(); ++i) {
    std::shared_ptr<Tensor<float>> output = outputs.at(i);
    uint32_t start_channel = 0;
    uint32_t rows = inputs.front()->rows();
    uint32_t cols = inputs.front()->cols();

    for (uint32_t j = i; j < inputs.size(); j += output_size) {
      const std::shared_ptr<Tensor<float>>& input = inputs.at(j);
      CHECK(input != nullptr && !input->empty())
          << "The input feature map of cat layer is empty";
      const uint32_t in_channels = input->channels();
      CHECK(rows == input->rows() && cols == input->cols());

      if (output == nullptr || output->empty()) {
        output = std::make_shared<Tensor<float>>(in_channels * packet_size,
                                                 rows, cols);
        outputs.at(i) = output;
      }
      CHECK(output->channels() == in_channels * packet_size &&
            output->rows() == rows && output->cols() == cols);
      for (uint32_t c = 0; c < in_channels; ++c) {
        output->slice(start_channel + c) = input->slice(c);
      }
      start_channel += input->channels();
    }
  }

首先对于一个批次batch_size个数据得到其中的一个输入input，然后再得到准备好的输出空间output.

CHECK(output->channels() == in_channels * packet_size &&
output->rows() == rows && output->cols() == cols);

从这里我们可以看出，我们需要检查output的通道数量等于input数组的数量乘以input的维度。用数学方法表达的方法如下。
$output\,channels =input\,numbers\times input\,channels$

for (uint32_t c = 0; c < in_channels; ++c) {
   output->slice(start_channel + c) = input->slice(c);
}
start_channel += input->channels();

随后我们将逐个输入在output的通道维上拼接起来，start_channel表示当前拼接的实际位置（在通道维的），将一个input的多个维度逐一拼接到output张量的start_channel之后。

编写UpSample算子

这是一个上采样算子，算子的作用就是将输入的大小(width和height)放大到指定的scale倍而已，放大的方法这里采用了nearest方法，也就是通过复制最近点的值来进行上采样。这个算子的实现总体而言比较简单，由如下的图例所示，对于任意一个从(0,0)到(3,3)的像素点，在scale等于4的时候，都拷贝(0,0)位置像素上的值，因为：
$\, x\div\,scale=0, y\div scale=0 \quad x\in(0,3)\,y\in(0,3)\,scale=4$

同理对于任意一个从(4,4)到(7,7)的点，它都会拷贝位置(1,1)上的像素值，因为：
$\, x\div\,scale=1, y\div scale=1 \quad x\in(4,7)\,y\in(4,7)\,scale=4$
它的实现放在upsample.cpp中，具体到代码如下：

for (uint32_t i = 0; i < batch_size; ++i) {
    const arma::fcube &input_data = inputs.at(i)->data();
    std::shared_ptr<Tensor<float>> output = outputs.at(i);
    if (output == nullptr || output->empty()) {
      output = std::make_shared<Tensor<float>>(input_data.n_slices,
                                               uint32_t(input_data.n_rows * scale_h_),
                                               uint32_t(input_data.n_cols * scale_w_));
      outputs.at(i) = output;
    }
    auto &output_data = output->data();
    CHECK(output_data.n_rows == input_data.n_rows * scale_h_) << "The height of the feature map is not adapting!";
    CHECK(output_data.n_cols == input_data.n_cols * scale_w_) << "The width of the feature map is not adapting!";
    CHECK(input_data.n_slices == output_data.n_slices) << "The channel of the feature map is not adapting!";
...
  }

上方的代码首先得到输入的张量空间和输出的张量空间input和output. 随后再检查output的空间是否放得下上采样后的输入，长宽乘以scale倍数。

    for (uint32_t c = 0; c < channels; ++c) {
      const arma::fmat &input_channel = input_data.slice(c);
      arma::fmat &output_channel = output_data.slice(c);
      const uint32_t output_w = output_channel.n_cols;
      const uint32_t output_h = output_channel.n_rows;

      for (uint32_t w = 0; w < output_w; ++w) {
        const uint32_t src_w = uint32_t((float) w / this->scale_w_);
        CHECK(src_w < input_channel.n_cols);
        float *output_channel_ptr = output_channel.colptr(w);
        const float *input_channel_ptr = input_channel.colptr(src_w);

        for (uint32_t h = 0; h < output_h; ++h) {
          const uint32_t src_h = uint32_t((float) h / this->scale_h_);
          CHECK(src_h < input_channel.n_rows);

          const float src_value = *(input_channel_ptr + src_h);
          *(output_channel_ptr + h) = src_value;
        }
      }
    }

其次得到输出空间其中的一维output_channel，再对它进行循环遍历，我们将output_channel上的坐标除以scale_h和scale_w，得到它在输入input_channel上的坐标src_h和src_w，随后根据src_h和src_w位置的值进行赋值。

编写YoloDetect算子

YoloDetect的Python定义如下，直接摘录自YoloV5项目的yolo.py文件。

    def forward(self, x):
        z = []  # inference output
        for i in range(self.nl):
            x[i] = self.m[i](x[i])  # conv
            bs, _, ny, nx = x[i].shape  # x(bs,255,20,20) to x(bs,3,20,20,85)
            x[i] = x[i].view(bs, self.na, self.no, ny, nx).permute(0, 1, 3, 4, 2).contiguous()

            if not self.training:  # inference
                省略...
                else:  # Detect (boxes only)
                    xy, wh, conf = x[i].sigmoid().split((2, 2, self.nc + 1), 4)
                    xy = (xy * 2 + self.grid[i]) * self.stride[i]  # xy
                    wh = (wh * 2) ** 2 * self.anchor_grid[i]  # wh
                    y = torch.cat((xy, wh, conf), 4)
                z.append(y.view(bs, self.na * nx * ny, self.no))

我们需要用C++来实现以上的Python算子，实现我们放在yolo_detect.cpp中，我们来进行逐步地分析。

  for (uint32_t b = 0; b < batch_size; ++b) {
      const std::shared_ptr<Tensor<float>>& input = stage_output.at(b);
      CHECK(input != nullptr && !input->empty());
      const uint32_t nx = input->rows();
      const uint32_t ny = input->cols();
      input->Reshape({stages, uint32_t(classes_info), ny * nx}, true);
      const uint32_t size = input->size();

input->Reshape对应的是x[i].view函数，用于将张量reshape到对应的形状(stages, classes_info, ny x nx).

 input->Transform(
          [](const float value) { return 1.f / (1.f + expf(-value)); });

如上的代码对应python代码中的x[i].sigmoid()

      arma::fmat& x_stages = x_stages_tensor->slice(b);
      for (uint32_t s = 0; s < stages; ++s) {
        x_stages.submat(ny * nx * s, 0, ny * nx * (s + 1) - 1,
                        classes_info - 1) = input->slice(s).t();
      }

      const arma::fmat& xy = x_stages.submat(0, 0, x_stages.n_rows - 1, 1);
      const arma::fmat& wh = x_stages.submat(0, 2, x_stages.n_rows - 1, 3);

如上的代码对应的python实现为.split(2, 2, self.nc + 1), 4)以及xy = (xy * 2 + self.grid[i]) * self.stride[i] 和wh = (wh * 2) ** 2 * self.anchor_grid[i]。

我们可以知道，这里的代码只不过是用armadillo去复写python numpy的实现而已，写的小心一点就可以，没有太大的技术含量。

x_stages.submat(0, 0, x_stages.n_rows - 1, 1) =
          (xy * 2 + grids_[stage]) * strides_[stage];
x_stages.submat(0, 2, x_stages.n_rows - 1, 3) =
          arma::pow((wh * 2), 2) % anchor_grids_[stage];
zs.at(stage) = x_stages_tensor;

这部分代码对应的是y = torch.cat((xy, wh, conf), 4),将处理好的whxy信息再重新拼接起来。

还有一个问题，这里的stage变量指的是什么意思，我觉得可以理解为检测头，比如Yolov5的三个检测头，用来适配大小、尺寸不同物体的检测。一个检测头中所有批次的数据在处理完之后都会被放到x_stages和zs.at(stage)的位置

 uint32_t current_rows = 0;
  arma::fcube f1(concat_rows, classes_info, batch_size);
  for (const auto& z : zs) {
    f1.subcube(current_rows, 0, 0, current_rows + z->rows() - 1,
               classes_info - 1, batch_size - 1) = z->data();
    current_rows += z->rows();
  }

随后我们将三个检测头的输出重新拼接起来，并存放到f1的位置。例如在上方的stages中，stage的大小依次为(1,8,19200,85), (1,8,4800,85)和(1,8,1200,85). 拼接后得到的f1变量的大小为(1,25200,85)，AI工程师，对这个数字有没有很熟悉。

结果验证

至此我们已经实现了Yolov5模型的算子级支持，现在来验证一下吧。下面的param和bin文件怎么得到，大家可以自己看PNNX项目哦，我这边只说一下大体流程。

YoloV5的export.py中选择导出torchscript
pnnx yolov5s.pt inputshape=[1,3,640,640] moduleop=models.common.Focus,models.yolo.Detect

https://github.com/Tencent/ncnn/tree/master/tools/pnnx PNNX项目位置，其中yolov5s.pt是上一步导出的模型文件

随后你就可以得到param和bin文件了，这里要注意的是inputshape=[1,3,640,640] 中的形状大小依次为NCHW维度，并且它和你之后做推理输入密切相关。你也需要在预处理中将大小调整为640, batch size的大小为1

结果验证的代码

TEST(test_net, forward_yolo1) {
  using namespace kuiper_infer;
  RuntimeGraph graph("tmp/yolo/demo/yolov5n_small.pnnx.param",
                     "tmp/yolo/demo/yolov5n_small.pnnx.bin");

  graph.Build("pnnx_input_0", "pnnx_output_0");
  const uint32_t batch_size = 4;
  std::vector<std::shared_ptr<Tensor<float>>> inputs;

  for (int i = 0; i < batch_size; ++i) {
    std::shared_ptr<Tensor<float>> input = std::make_shared<Tensor<float>>(3, 320, 320);
    input->Fill(127.f);
    inputs.push_back(input);
  }

首先需要说明的是，我们在yolov5n_small.pnnx导出时指定的input_shape需要保持一致，这里的大小是[4,3,320,320].

  std::vector<std::shared_ptr<Tensor<float>>> outputs = graph.Forward(inputs, false);
  for (int i = 0; i < batch_size; ++i) {
    std::string file_path = "tmp/yolo/" + std::to_string(i + 1) + ".csv";
    const auto &output1 = CSVDataLoader::LoadData(file_path);
    const auto &output2 = outputs.at(i);

    ASSERT_EQ(output1.size(), output2->size());
    for (int r = 0; r < output1.n_rows; ++r) {
      for (int c = 0; c < output1.n_cols; ++c) {
        ASSERT_LE(std::abs(output1.at(r, c) - output2->at(0, r, c)), 0.05) << " row: " << r << " col: " << c;
      }
    }
  }
}

随后通过graph.forward来得到推理的结果outputs. 另外我们将Pytorch输出的csv文件也进行load，它们两者进行逐一进行数值上的比较。
在这里插入图片描述

可以看到没有任何的问题。

Yolov5的后处理过程和Demo运行

后处理过程可以分为以下的几步：

得到Yolov5网络的输出，对于640x640的大小，输出的大小也就是1,25200,85
从输出中过滤置信度较低的输出，85维依次是x,y,w,h,confidence，所以只要得到第四个位置的confidence进行判断就可以。

在YoloDemo函数中有以下的代码，我们进行讲解：

  for (int i = 0; i < outputs.size(); ++i) {
    const auto &image = cv::imread(image_paths.at(i));
    const int32_t origin_input_h = image.size().height;
    const int32_t origin_input_w = image.size().width;

    const auto &output = outputs.at(i);
    assert(!output->empty());
    const auto &shapes = output->shapes();
    assert(shapes.size() == 3);

    const uint32_t elements = shapes.at(1);
    const uint32_t num_info = shapes.at(2);
    std::vector<Detection> detections;

    std::vector<cv::Rect> boxes;
    std::vector<float> confs;
    std::vector<int> class_ids;

elements等于25200，num_info等于85. 我们随后通过置信度进行过滤：

    const uint32_t b = 0;
    for (uint32_t e = 0; e < elements; ++e) {
      float cls_conf = output->at(b, e, 4);
      if (cls_conf >= conf_thresh) {
        int center_x = (int) (output->at(b, e, 0));
        int center_y = (int) (output->at(b, e, 1));
        int width = (int) (output->at(b, e, 2));
        int height = (int) (output->at(b, e, 3));
        int left = center_x - width / 2;
        int top = center_y - height / 2;

        int best_class_id = -1;
        float best_conf = -1.f;
        for (uint32_t j = 5; j < num_info; ++j) {
          if (output->at(b, e, j) > best_conf) {
            best_conf = output->at(b, e, j);
            best_class_id = int(j - 5);
          }
        }

        boxes.emplace_back(left, top, width, height);
        confs.emplace_back(best_conf * cls_conf);
        class_ids.emplace_back(best_class_id);
      }
    }

首先得到85维度中的x,y,w,h数据，随后再遍历第5到第85中的置信度数据，在这个过程中排除其中elements中置信度不满足conf_thresh，并选取第5到第85中的置信度最大的一个，并将它的class_ids, boxes和置信度数据放到对应的数组中。

进行NMS排除其中检测框重叠的部分（NMS的原理不讲了，自己百度下），我们直接使用opencv自带的实现，实际它和Yolov5实现有点出入，不过关系并不是很大。

std::vector<int> indices;
cv::dnn::NMSBoxes(boxes, confs, conf_thresh, iou_thresh, indices);

将通过NMS的检测框放入到detections数据中，其中ScaleCoords是将检测的位置映射回输入图片上的相关位置，因为我们检测是用640大小检测的，但是实际输入大小不是这个，所以需要进行重映射的过程。

void ScaleCoords(const cv::Size &img_shape, cv::Rect &coords, const cv::Size &img_origin_shape) {
  float gain = std::min((float) img_shape.height / (float) img_origin_shape.height,
                        (float) img_shape.width / (float) img_origin_shape.width);

  int pad[2] = {(int) (((float) img_shape.width - (float) img_origin_shape.width * gain) / 2.0f),
                (int) (((float) img_shape.height - (float) img_origin_shape.height * gain) / 2.0f)};

  coords.x = (int) std::round(((float) (coords.x - pad[0]) / gain));
  coords.y = (int) std::round(((float) (coords.y - pad[1]) / gain));

  coords.width = (int) std::round(((float) coords.width / gain));
  coords.height = (int) std::round(((float) coords.height / gain));

  coords.x = clip(coords.x, 0, img_origin_shape.width);
  coords.y = clip(coords.y, 0, img_origin_shape.height);
  coords.width = clip(coords.width, 0, img_origin_shape.width);
  coords.height = clip(coords.height, 0, img_origin_shape.height);
}

其中coords是我们检测出来的坐标位置，image_shape是现在图片resize之后的大小，image_origin_shape是图片之前的大小，我们需要将coords坐标重新映射回image_origin_shape中。

最后就是对检测框的绘制过程

    for (const auto &detection : detections) {
      cv::rectangle(image, detection.box, cv::Scalar(255, 255, 255), 4);
      cv::putText(image, std::to_string(detection.class_id),
                  cv::Point(detection.box.x, detection.box.y), font_face,
                  font_scale, cv::Scalar(255, 255, 0), 4);
    }
    cv::imwrite(std::string("output") + std::to_string(i) + ".jpg", image);

得到的结果，是不是很牛

在这里插入图片描述

说在最后的话

希望同学们能够学有所成，这门课到这里就结束了。

如果还没点赞的同学，麻烦去github star一下，https://github.com/zjhellofss/KuiperInfer
对于意犹未尽的同学，欢迎加入到KuiperInfer项目的开发中来。另外我们还会有第二次开课，会和一个大型开源社区合作做这个事情，如果有同学愿意担任助教的话（完善课件和解答课程问题，或者可以根据你的擅长来加课），请通过lyrry1997联系我，这会是一件非常有意义的事情。