【目标检测】—— YOLO V3

aolaf

已于 2023-03-25 20:58:04 修改

阅读量2.4k

点赞数

分类专栏： # 深度学习03-目标检测文章标签：目标检测 YOLO 计算机视觉

于 2020-06-22 16:32:08 首次发布

本文链接：https://blog.csdn.net/weixin_42454048/article/details/106902335

版权

深度学习03-目标检测专栏收录该内容

10 篇文章 3 订阅

订阅专栏

1. YOLO V3算法分析

1.1 网络结构图

在这里插入图片描述图最下面的三个蓝色框，代表网络中常用的三个模块：
CBL：Yolov3网络结构中的最小组件，其由Conv（卷积）+ BN + Leaky relu组成。
Res unit：残差组件，借鉴ResNet网络中的残差结构，让网络可以构建的更深。
ResN：N代表数字，有res1，res2, … ,res8等等，表示这个res_block里含有多少个Res_unit。由一个CBL和N个残差组件构成，是Yolov3中的大组件。每个Res模块前面的CBL都起到下采样的作用，因此经过5次Res模块后，得到的特征图是608->304->152->76->38->19大小。

张量拼接与张量相加
Contact：张量拼接，会扩充两个张量的维度，例如26×26×256和26×26×512两个张量拼接，结果是26×26×768。
Add：张量相加，张量直接相加，不会扩充维度，例如104×104×128和104×104×128相加，结果还是104×104×128。

Backbone中卷积层的数量：
每个ResN中包含1+2×N个卷积层，因此整个主干网络Backbone中一共包含1+（1+2×1）+（1+2×2）+（1+2×8）+（1+2×8）+（1+2×4）=52，再加上一个FC全连接层，即可以组成一个Darknet53分类网络。不过在目标检测Yolov3中，去掉FC层，不过为了方便称呼，仍然把Yolov3的主干网络叫做Darknet53结构。

1.2 网络主体（backbone）

全卷积网络
在YOLO V2中，要经历Maxpooling层的5次缩小，会将特征图缩小到原输入尺寸的 $1/2^5$ ，即1/32。而整个v3结构里面，是没有池化层和全连接层的。前向传播过程中，张量的尺寸变换是通过改变卷积核的步长来实现的（下采样），比如stride=(2, 2)，这就等于将图像边长缩小了一半(即面积缩小到原来的1/4)。若输入为608x608，则输出为19x19(608/32=19)。如下图YOLO V3红框圈出的部分是V3的五次下采样。

输入图像大小为32的倍数
YOLO V3也和V2一样，backbone都会将输出特征图缩小到输入的1/32。所以，通常都要求输入图片是32的倍数。
YOLO V2的Darknet-19与YOLO V3的Darknet-53对比如下：
在这里插入图片描述

1.3 网络细节特征

1.3.1 网络的输入

网络需要输入的图像大小为416 X 416，通过锁定纵横比进行缩放，min(w/img_w, h/img_h)这个比例来缩放，保证长的边缩放为需要的输入尺寸416，而短边按比例缩放不会扭曲。假设原图的尺寸为img_w = 768，img_h = 576, 则缩放比例k = min（416/768，416/576）= 0.54，缩放后的尺寸为new_w = 416，new_h = 312。而需要的输入图像尺寸为w = 416， h = 416，如下图，剩下的灰色区域用(128,128,128)填充即可构造为416*416。不管训练还是测试时都需要这样操作原图。
在这里插入图片描述
图像缩放代码如下：

def letterbox_image(img, inp_dim):
    """
    lteerbox_image()将图片按照纵横比进行缩放，将空白部分用(128,128,128)填充,调整图像尺寸
    具体而言,此时某个边正好可以等于目标长度,另一边小于等于目标长度
    将缩放后的数据拷贝到画布中心,返回完成缩放
    """
    img_w, img_h = img.shape[1], img.shape[0]
    w, h = inp_dim#inp_dim是需要resize的尺寸（如416*416）
    # 取min(w/img_w, h/img_h)这个比例来缩放，缩放后的尺寸为new_w, new_h,即保证较长的边缩放后正好等于目标长度(需要的尺寸)，另一边的尺寸缩放后还没有填充满.
    new_w = int(img_w * min(w/img_w, h/img_h))
    new_h = int(img_h * min(w/img_w, h/img_h))
    resized_image = cv2.resize(img, (new_w,new_h), interpolation = cv2.INTER_CUBIC) #将图片按照纵横比不变来缩放为new_w x new_h，768 x 576的图片缩放成416x312.,用了双三次插值
    # 创建一个画布, 将resized_image数据拷贝到画布中心。
    canvas = np.full((inp_dim[1], inp_dim[0], 3), 128)#生成一个我们最终需要的图片尺寸hxwx3的array,这里生成416x416x3的array,每个元素值为128
    # 将wxhx3的array中对应new_wxnew_hx3的部分(这两个部分的中心应该对齐)赋值为刚刚由原图缩放得到的数组,得到最终缩放后图片
    canvas[(h-new_h)//2:(h-new_h)//2 + new_h,(w-new_w)//2:(w-new_w)//2 + new_w,  :] = resized_image
    
    return canvas

1.3.2 多尺度预测

为了加强小目标检出率，YOLO V3输出了1/8, 1/16, 1/32，3个不同尺度的feature map。这个借鉴了SSD和FPN，采用多尺度来对不同大小的目标进行检测，其中1/8分辨率较大，位置信息丰富，主要用于检测小目标，1/32分辨率较低，语义信息丰富，感受野较大，主要用于检测大目标。

1.3.3 输出通道数255

3个输出feature map的通道数均是255。
每个anchor_box需要预测(t_x, t_y, t_w,t_h, confidence)五个基本参数，同时还要有80个类别的概率。由于YOLO V3给每个grid_ceil设置了3个先验框所以3*(5 + 80) = 255。

1.4 Anchor box的设置

YOLO V3通过k-means聚类为3个输出的预测特征图（13 X 13， 26 X 26，52 X 52）各设定了3个不同大小anchor box。即每个grid_ceil都会有对应的3个anchor box。
1/8的特征图分辨率最大，感受野最小，适合检测小目标，anchor box为(10, 13); (16, 30);(33, 23)。
1/16的特征图适合一般大小的物体，anchor box为(30, 61); (62, 45); (59, 119)。
1/32的特征图分辨率最小，感受野最大，适合检测大目标，anchor box为(116 , 90); (156 , 198); (373 , 326)。
所以当输入为416×416时，实际总共（52×52+26×26+13×13）× 3=10647个先验框。
在这里插入图片描述
上图中先验框的尺寸是针对原图416 X 416的，需要除以采样倍数，获得在特征图下的anchor box的尺寸，如（116 X 90）—> (116/32, 90/32)。

感受一下9种先验框的尺寸，下图中蓝色框为聚类得到的先验框。黄色框是ground truth，红框是对象中心点所在的网格。
在这里插入图片描述

1.5 预测边界框（Bounding Box）的编码与解码

1.5.1 训练过程编码：

图中，b_x, b_y, b_w, b_h为预测框映射到feature map的参数，p_w, p_h为anchor box映射到feature map的参数。

tx，ty，tw，th的求解
t_x = G_x - C_x
t_y = G_y - C_y
t_w = log(G_w / p_w)
t_h = log(G_h / p_h)

其中， G_x、G_y、G_w 、G_h 为ground truth在该feature map上的坐标。
训练阶段，这里的t_x、t_y、t_w 、t_h 是网络需要学习的目标标签。

宽高比率取对数
t_w 、 t_h是GT所在边框的长宽和anchor box长宽之间的比率，不直接回归bounding box的长宽而是尺度缩放到对数空间，是怕训练会带来不稳定的梯度。因为如果不做变换，直接预测相对形变t_w，那么要求t_w>0(框的宽高不可能是负数)。这样，是在做一个有不等式条件约束的优化问题，没法直接用SGD来做。所以先取一个对数变换，将其不等式约束去除。
代码实现：

# 将真实边界框的尺寸映射到网格的尺度上去
c_x_s = c_x / s
c_y_s = c_y / s
box_ws = box_w / s
box_hs = box_h / s
# 计算中心点所落在的网格的坐标
grid_x = int(c_x_s)
grid_y = int(c_y_s)
# anchor box的宽高
p_w, p_h = anchor_size
# 网络预测值
tx = c_x_s - grid_x
ty = c_y_s - grid_y
tw = np.log(box_ws / p_w)
th = np.log(box_hs / p_h)

1.5.2 推理过程解码：

在这里插入图片描述

b_x, b_y, b_w, b_h的求解
在这里插入图片描述
推理阶段，这里的t_x、t_y、t_w 、t_h 是网络输出的预测结果。

t_x、 t_y做sigmoid处理：
由于t_x、 t_y表征的是GT中心点下采样后与所其属grid_ceil的量化误差，其属于[0, 1]，因此训练时我们利用sigmoid函数网络的输出约束到[0, 1]区间。

有了网络预测的中心点偏移量（t_x, t_y）和尺度缩放（t_w, t_h）就能让anchor box经过微调与ground truth重合。
如下图，红色框为anchor box，绿色框为Ground Truth，平移+尺度缩放可实线红色框先平移到虚线红色框，然后再缩放到绿色框。
在这里插入图片描述

1.6 正负样本匹配与训练标签生成

①. 计算全部先验框和GT框之间的iou
②. 与GT iou最大的先验框为正样本，小于负样本阈值的为负样本，其余均为忽略样本。
③. 根据该先验框的iou索引可以获得特征图尺度信息（stride）与先验框的宽高。


def multi_gt_creator(input_size, strides, label_lists, anchor_size):
    """制作训练正样本"""
    batch_size = len(label_lists)
    h = w = input_size
    num_scale = len(strides)
    all_anchor_size = anchor_size
    anchor_number = len(all_anchor_size) // num_scale

    gt_tensor = []
    for s in strides:
        gt_tensor.append(np.zeros([batch_size, h//s, w//s, anchor_number, 1+1+4+1+4]))
    
    # generate gt datas
    for batch_index in range(batch_size):
        for gt_label in label_lists[batch_index]:
            # get a bbox coords
            gt_class = int(gt_label[-1])
            xmin, ymin, xmax, ymax = gt_label[:-1]
            # 计算真实框的中心点和宽高
            c_x = (xmax + xmin) / 2 * w
            c_y = (ymax + ymin) / 2 * h
            box_w = (xmax - xmin) * w
            box_h = (ymax - ymin) * h

            # 检查数据
            if box_w < 1. or box_h < 1.:
                # print('A dirty data !!!')
                continue    

            # 计算先验框和边界框之间的IoU
            anchor_boxes = set_anchors(all_anchor_size)
            gt_box = np.array([[0, 0, box_w, box_h]])
            iou = compute_iou(anchor_boxes, gt_box)

            # 阈值筛选
            iou_mask = (iou > ignore_thresh)

            if iou_mask.sum() == 0:
                # 若所有的IoU都小于ignore，则将IoU最大的先验框分配给真实框，其他均视为负样本
                index = np.argmax(iou)
                # 确定该正样本被分配到哪个尺度上去，以及哪个先验框被选中为正样本
                s_indx = index // anchor_number
                ab_ind = index - s_indx * anchor_number
                # 获得该尺度的降采样倍数
                s = strides[s_indx]
                # 获得该先验框的参数
                p_w, p_h = anchor_boxes[index, 2], anchor_boxes[index, 3]
                # 计算中心点所在的网格坐标
                c_x_s = c_x / s
                c_y_s = c_y / s
                grid_x = int(c_x_s)
                grid_y = int(c_y_s)
                # 制作学习标签
                tx = c_x_s - grid_x
                ty = c_y_s - grid_y
                tw = np.log(box_w / p_w)
                th = np.log(box_h / p_h)
                weight = 2.0 - (box_w / w) * (box_h / h)

                if grid_y < gt_tensor[s_indx].shape[1] and grid_x < gt_tensor[s_indx].shape[2]:
                    gt_tensor[s_indx][batch_index, grid_y, grid_x, ab_ind, 0] = 1.0
                    gt_tensor[s_indx][batch_index, grid_y, grid_x, ab_ind, 1] = gt_class
                    gt_tensor[s_indx][batch_index, grid_y, grid_x, ab_ind, 2:6] = np.array([tx, ty, tw, th])
                    gt_tensor[s_indx][batch_index, grid_y, grid_x, ab_ind, 6] = weight
                    gt_tensor[s_indx][batch_index, grid_y, grid_x, ab_ind, 7:] = np.array([xmin, ymin, xmax, ymax])
            
            else:
                # 至少有一个IoU大于ignore
                
                # 我们只保留IoU最大的作为正样本，
                # 其余的要么被忽略，要么视为负样本
                best_index = np.argmax(iou)
                for index, iou_m in enumerate(iou_mask):
                    if iou_m:
                        if index == best_index:
                            # 确定该正样本被分配到哪个尺度上去，以及哪个先验框被选中为正样本
                            s_indx = index // anchor_number
                            ab_ind = index - s_indx * anchor_number
                            # 获得该尺度的降采样倍数
                            s = strides[s_indx]
                            # 获得该先验框的参数
                            p_w, p_h = anchor_boxes[index, 2], anchor_boxes[index, 3]
                            # 计算中心点所在的网格坐标
                            c_x_s = c_x / s
                            c_y_s = c_y / s
                            grid_x = int(c_x_s)
                            grid_y = int(c_y_s)
                            # 制作学习标签
                            tx = c_x_s - grid_x
                            ty = c_y_s - grid_y
                            tw = np.log(box_w / p_w)
                            th = np.log(box_h / p_h)
                            weight = 2.0 - (box_w / w) * (box_h / h)

                            if grid_y < gt_tensor[s_indx].shape[1] and grid_x < gt_tensor[s_indx].shape[2]:
                                gt_tensor[s_indx][batch_index, grid_y, grid_x, ab_ind, 0] = 1.0
                                gt_tensor[s_indx][batch_index, grid_y, grid_x, ab_ind, 1] = gt_class
                                gt_tensor[s_indx][batch_index, grid_y, grid_x, ab_ind, 2:6] = np.array([tx, ty, tw, th])
                                gt_tensor[s_indx][batch_index, grid_y, grid_x, ab_ind, 6] = weight
                                gt_tensor[s_indx][batch_index, grid_y, grid_x, ab_ind, 7:] = np.array([xmin, ymin, xmax, ymax])
            
                        else:
                            # 这些先验框即便IoU大于ignore，但由于不是最大的
                            # 故被忽略掉
                            s_indx = index // anchor_number
                            ab_ind = index - s_indx * anchor_number
                            s = strides[s_indx]
                            c_x_s = c_x / s
                            c_y_s = c_y / s
                            grid_x = int(c_x_s)
                            grid_y = int(c_y_s)
                            gt_tensor[s_indx][batch_index, grid_y, grid_x, ab_ind, 0] = -1.0
                            gt_tensor[s_indx][batch_index, grid_y, grid_x, ab_ind, 6] = -1.0

    gt_tensor = [gt.reshape(batch_size, -1, 1+1+4+1+4) for gt in gt_tensor]
    gt_tensor = np.concatenate(gt_tensor, 1)
        
    return torch.from_numpy(gt_tensor).float()

1.7 损失函数

置信度损失：BCE loss
类别损失：BCE loss
中心点偏移量：t_x、t_y：BCE loss
宽高尺寸缩放t_w 、t_h：MSE loss

aolaf

关注

0
点赞
踩
15

收藏

觉得还不错? 一键收藏
0
评论
【目标检测】—— YOLO V3

1. YOLO V3算法分析1.1 网络结构图图最下面的三个蓝色框，代表网络中常用的三个模块：DBL：Yolov3网络结构中的最小组件，其由Conv（卷积）+ BN + Leaky relu组成。Res unit：残差组件，借鉴ResNet网络中的残差结构，让网络可以构建的更深。ResN：n代表数字，有res1，res2, … ,res8等等，表示这个res_block里含有多少个Res_unit。由一个CBL和N个残差组件构成，是Yolov3中的大组件。每个Res模块前面的CBL都起到下采样的作
复制链接

扫一扫