YOLO v1 原理到代码(一)

算法恩仇录

已于 2022-02-06 00:15:30 修改

阅读量384

点赞数

分类专栏： CV 文章标签： CV cv

于 2021-06-18 21:26:06 首次发布

本文链接：https://blog.csdn.net/u010751974/article/details/118025270

版权

CV 专栏收录该内容

10 篇文章 0 订阅

订阅专栏

前沿

目标检测任务关注的是图片中特定目标物体的位置。一个检测任务包含两个子任务，其一是输出这一目标的类别信息，属于分类任务；其二是输出目标的具体位置信息，属于定位任务。类比我们去摘草莓，找到草莓和草莓的位置。
物体在图像中的位置信息（矩形框的坐标值表示， $X m i n 、 Y m i n 、 X m a x 、 Y m a x$ ）, $X m i n 、 Y m i n$ 左上角， $X m a x 、 Y m a x$ 右下角。
物体的中心点可以根据坐标值计算得到，中心点 $(X m i n + (X m a x - X m i n) / 2, Y m i n + (Y m a x - Y m i n) / 2)$

早期，传统目标检测算法还没有使用深度学习，一般分为三个阶段：区域选取（找到物体的位置）、特征提取（描述物体的特征，特征类比苹果与西瓜的颜色与形状，苹果：小小的红色椭圆，西瓜：大大的绿色椭圆）、特征分类（看见小小的红色椭圆，知道是苹果；看见大大的绿色椭圆，知道是西瓜）。

区域选取：采用滑动窗口(Sliding Windows)算法（可以想象一个窗口在图像从左到右，从上到下，框出图像内容），选取图像中可能出现物体的位置，这种算法会存在大量冗余框，并且计算复杂度高。
特征提取：通过手工设计的特征提取器（如SIFT和HOG等）进行特征提取。
特征分类：使用分类器(如SVM)对上一步提取的特征进行分类。

YOLO 的算法流程

以下适合有深度学习基础的读者

YOLO v1 是深度学习的方法，可以看成黑盒子，输入一张图片，输出图中物体类别信息和物体的位置。

为什么网络能输出物体的类别信息和物体的位置呢？这里需要了解网络的输出是什么样子的？
在这里插入图片描述

网络输出的张量尺寸为 $7 \times 7 \times 30$
$S\times S\times(5\times B + C)$
$S = 7 ， B = 2 ， C = 20$

$7 \times 7$ 是最后的特征图化为 $7 \times 7$ grid cell.
每个grid cell 里有 2个 bounding boxes [简称bbox]，生成 $x, y, w, h$ ， $x, y$ 是物体中心点坐标， $w, h$ 是 bbox 的长宽。
每个bbox有包含物体的概率 $c o n f i d e n c e$ 。
Each grid cell predicts $B$ （ $B = 2$ ） bounding boxes and confidence scores for those boxes.
- define $c o n f i d e n c e$ as $IOU^{truth}_{pred}$
  - $IOU^{truth}_{pred}$ 指网络生成的bbox框和GT标注bbox计算IOU值（两个框的重叠率）。
  - 不包含物体： If no object exists in that cell, the $c o n f i d e n c e$ scores should be zero. = > $P r (O b j e c t) = 0$
  - 包含物体：Otherwise we want the confidence score to equal the intersection over union (IOU) between the predicted box and the ground truth. = > $P r (O b j e c t) = 1$
$C$ 是类别概率，根据这个可以得到物体的类别标签，在YOLO v1论文中，使用VOC的数据集，有20个类别。实际在代码里还有一个背景类，不包含物体的情况下，confidence = 0，是背景类。

一个网络能学习预测物体类别和物体位置，这就要看损失函数了～

Loss Function

损失函数包含三个部分：（1）2.1 中心点、宽、高；（2）2.2 置信度；（3）2.3 物体的类别标签

损失函数三个部分的细节分别了解一下

2.1 中心点、宽、高的损失计算
在这里插入图片描述

$x, y$ : predicated bbox center, $w, h$ : predicated bbox width & height
$\hat{x}, \hat{y}$ : labeled bbox center, $\hat{w}, \hat{h}$ : labeled bbox width & height
$\sqrt{w},\sqrt{h}$ : Suppress the effect for larger bbox ==> 物体大，宽变化大、物体小，变化小，抑制之后，两者变化区别不大

$i:0 -（S^2-1）$ [iterate each grid cell ( $0 - 48$ )]
$j : 0 - (B - 1)$ [iterate each bbox $(0 - 1)$ ]

2.2 置信度的损失计算
在这里插入图片描述

$\hat{C_i}$ : confidence score [IoU] of predicated and labeled bbox

$\hat{C_i}$ 进一步翻译: 网络生成的 $x, y, w, h$ 与 label的 $\hat{x}, \hat{y},\hat{w}, \hat{h}$ 计算IOU

$C_i$ : predicated confidence score [IoU] generated from networks

此图中的Test: 公式来源论文，代码中，网络直接生成confidence的值。
- $l_{ij}^{obj}$ 指正样本，类似mask, 例如下面的表格, 正样本的位置为1， $C_i == 1$

0	0	0
0	0	1
0	0	0
0	1	0
1	0	0
0	0	0
0	0	0

$l_{ij}^{noobj}$ 指负样本，类似 mask, 例如下面的表格，负样本的位置为1。 $C_i == 0$

1	1	1	1	1	1	1
1	1	1	1	1	0	1
1	1	1	1	1	1	1
1	1	0	1	1	1	1
1	0	1	1	1	1	1
1	1	1	1	1	1	1
1	1	1	1	1	1	1

For $l_{ij}^{obj}$ , we have $B$ predictions in each cell, only the one with largest IoU shall be labeled as 1，另一个bbox一般设置为noobj，如果另一个bbox与GT的IOU过大（例如0.65），可以不加入noobj，不参与计算。

2.3 物体的类别标签，损失计算
在这里插入图片描述

Each cell will only predict 1 object, which is decided by the bbox with largest IoU
这里给一个思考题：如果两个物体很近，物体的中心点也靠的很近，标签会标两个物体还是一个物体呢？这个主要看代码的实现，以下代码是一个。

 for box in boxes:
            class_label, x, y, width, height = box.tolist()
   
            class_label = int(class_label)
   

            # i,j represents the cell row and cell column
            i, j = int(self.S * y), int(self.S * x)
           
            x_cell, y_cell = self.S * x - j, self.S * y - i
       

            """
            Calculating the width and height of cell of bounding box,
            relative to the cell is done by the following, with
            width as the example:

            width_pixels = (width*self.image_width)
            cell_pixels = (self.image_width)

            Then to find the width relative to the cell is simply:
            width_pixels/cell_pixels, simplification leads to the
            formulas below.
            """
            width_cell, height_cell = (
                width * self.S,
                height * self.S,
            )

            # If no object already found for specific cell i,j
            # Note: This means we restrict to ONE object
            # per cell!
            if label_matrix[i, j, 20] == 0:
                # Set that there exists an object
                label_matrix[i, j, 20] = 1

                # Box coordinates
                box_coordinates = torch.tensor(
                    [x_cell, y_cell, width_cell, height_cell]
                )

                label_matrix[i, j, 21:25] = box_coordinates

                # Set one hot encoding for class_label
                label_matrix[i, j, class_label] = 1

        return image, label_matrix

这里代码只会标注一个

代码来源

算法恩仇录

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
YOLO v1 原理到代码(一)

前沿目标检测任务关注的是图片中特定目标物体的位置。一个检测任务包含两个子任务，其一是输出这一目标的类别信息，属于分类任务；其二是输出目标的具体位置信息，属于定位任务。类比我们去摘草莓，找到草莓和草莓的位置。物体在图像中的位置信息（矩形框的坐标值表示，Xmin、Ymin、Xmax、YmaxXmin、Ymin、Xmax、YmaxXmin、Ymin、Xmax、Ymax）,Xmin、YminXmin、YminXmin、Ymin 左上角，Xmax、YmaxXmax、YmaxXmax、Ymax 右下角。
复制链接

扫一扫