Tensorflow YOLOv3实现与详解

最新推荐文章于 2024-08-05 18:41:17 发布

zhangg_chuan

最新推荐文章于 2024-08-05 18:41:17 发布

阅读量2.7w

点赞数 12

分类专栏： TensorFlow

本文链接：https://blog.csdn.net/neo_qiye/article/details/84782199

版权

TensorFlow 专栏收录该内容

4 篇文章 1 订阅

订阅专栏

tips：大部分资源来自https://www.jianshu.com/p/3943be47fe84，这里权当一组学习记录。
另外推荐一个英文详解blog：YOLO V3 PyTorch实现

1 环境说明

TensorFlow-gpu 1.8
Keras 2.2.4，安装方式：进入tensorflow env ，conda install keras
OpenCV 3.4.4
python 3.6.3
代码库：git代码库

2 权值下载与TensorFlow适配

权值文件下载链接：yolov3.weight
百度网盘下载地址：
链接: https://pan.baidu.com/s/1QZTjhc6yAoqs2KJWj-Y2gA 提取码: 9ndy 复制这段内容后打开百度网盘手机App，操作更方便哦

python yad2k.py cfg/yolo.cfg yolov3.weights data/yolo.h5

提示无法画图，conda install pydot
在解析

3 Demo测试

python demo.py

提示错误：TypeError: float() argument must be a string or a number, not ‘dict’
我在文件yad2k.py里加了如下两句话，可以运行：


    print(model.summary())
    model.save('{}'.format(output_path))
    print('Saved Keras model to {}'.format(output_path))

    #test model file is aviable add by me
    del model  # deletes the existing model
    model = keras.models.load_model(output_path)

测试结果
yolo v3 detect result

4 调用本本摄像头跟踪测试

在demo里加入视频读取判断：

def detect_video(video, yolo, all_classes):
    """Use yolo v3 to detect video.
    # Argument:
        video: video file.
        yolo: YOLO, yolo model.
        all_classes: all classes name.
    """
    video_path = os.path.join("videos", "test", video)
    if (os.path.exists(video_path) and video != ''):
        camera = cv2.VideoCapture(video_path)
    else:
        camera = cv2.VideoCapture(0)
        video = 'your_camera.mp4'
    res, frame = camera.read()
    if not res:
        print("file open failed and camera can not open")
    cv2.namedWindow("detection", cv2.WINDOW_AUTOSIZE)

    # Prepare for saving the detected video
    sz = (int(camera.get(cv2.CAP_PROP_FRAME_WIDTH)),
        int(camera.get(cv2.CAP_PROP_FRAME_HEIGHT)))
    fourcc = cv2.VideoWriter_fourcc(*'mpeg')

    vout = cv2.VideoWriter()
    vout.open(os.path.join("videos", "res", video), fourcc, 20, sz, True)

    while True:
        res, frame = camera.read()

        if not res:
            break

        image = detect_image(frame, yolo, all_classes)
        cv2.imshow("detection", image)

        # Save the video frame by frame
        vout.write(image)

        if cv2.waitKey(110) & 0xff == 27:       #press ESC to quit
                break

    vout.release()
    camera.release()

测试图：
摄像头效果

5 详细原理解读

5.1 网络结构图

特征提取网络结构图

5.1.1 特征提取网络部分：darknet-53

网络的基本结构单元为卷积单元和残差单元：
卷积单元：
conv2D->BatchNormalization->leakyRelu
这部分卷积全部为3*3的卷积核大小
LeakyRelu: if(xi>=0) yo = xi; else yo = xi/a (a>1);
残差单元：
1 * 1卷积,n个filterBN,L_Relu，->3 * 3卷积，2 * n个filter,BN,L_Relu->与input相加->线性激活函数（线性激活函数有毛用，遗留问题）

代码看网络结构：
input

output
总共6个卷积单元， 23个残差单元，每个残差单元有2个卷积，因此6+23*2 = 53，darknet-53有53个卷积操作。

5.1.2 目标类别与框定位网络部分

目标检测部分采用类似特征金字塔的方式，在网络中选取三层feature map 做为目标分类与定位的特征Tensor，在网络中选取了第26个卷积层（52,52,256），第43个卷积层（2626512）和第53层（1313512）。

特征图选取

第53层的结果经过下图一系列的卷积变为（13,13,255）-> reshape (13,13,3,85);
第43层的结果与53层卷积操作和up sampilng卷积之后的结果concat，经过下图一系列的卷积变为（26,26,85）-> reshape (26,26,3,85);
第26层的结果与43和53层卷积操作和up sampilng（介绍上采样卷积）卷积之后的结果concat，经过下图一系列的卷积变为（52,52,85）-> reshape (52,52,3,85);
目标检测部分网络结构图

5.1.3 目标特征参数到目标框与类别的处理

在数据集上统计最优的9个anchor ，这个类似与Faster RCNN中RPN产生proposal 的时候9种尺度的anchor box ，只不过yolo-v3这里使用的在coco数据集上训练出来的特定比例, 应该算是一组超参数。
anchors = [[10, 13], [16, 30], [33, 23], [30, 61], [62, 45],
[59, 119], [116, 90], [156, 198], [373, 326]]
同时与anchors对应的还有9个Mask
masks = [[6, 7, 8], [3, 4, 5], [0, 1, 2]]，
Mask把anchors分成三组，在进行目标预测的时候，conv53对应的mask为6,7,8, 对用anchor的后三个; conv43 对应的mask为3,4,5, 对用anchor的中间三个; conv26层对应的mask为0,1,2,也就是说，在比较小的feature map上预测较大的目标，在比较大的feature map上预测小目标。

以conv53的输出为例说明其预测过程：
input: (13,13,3,85): 3的维度对应三个anchor box; 85的维度包括4个位置信息，1个表示是否包含目标和80个类别的概率;

box_xy = sigmoid(input[:, :2])
box_wh = exp(input[:, 2:4]); box_wh 表示anchor的W,H 的比例系数
box_wh = box_wh * anchors_tensor;
box_xy + grid, grid表示对应feature map上每个点的相对坐标点，维度为（13,13,3），第三个维度上的数据都相同; 预测的左上角点，加上偏移就对应了目标预测的中心点; 预测的是一个在图像中的相当位置，在映射到原图的大小上就得到了目标的bounding box。
box_xy += grid
box_xy /= (grid_w, grid_h)
box_wh /= (416, 416)
box_xy -= (box_wh / 2.)
output对应参数关系

box 与output的转换关系

box_confidence = sigmoid(input[:, 4])
box_class_probs = sigmoid(input[:, 5:])

5.1.4 框筛选

针对3个卷积层的output，一共可以得出：
((52 x 52) + (26 x 26) + 13 x 13)) x 3 = 10647 bounding boxes
然而我们的图像中一般这么多的目标，需要进行与之筛选和NMS（非最大抑制）
根据box_confidence和输入的阈值，去除小于阈值的框，非最大抑制（受IOU 参数的影响）是为了去除一个目标同时被多个框预测，保留其中概率最大的框最为目标框。
框筛选的具体过程(针对三个尺度的feature map)：

1 . box_scores(13,13,3, 80) = box_confidence（13,13,3,1）* box_class_probs(13,13,3, 80);
2 . box_classes(13,13,3) = argmax(box_scores, axis=-1),选取出每个框的概率的最大值的位置，对应类别的编号;
3 . box_class_scores(13,13,3) = max(box_scores, axis=-1),找出最大类别的概率值;
4 . 根据目标得分阈值筛选掉小于阈值的box，得出初步筛选的box，对应类别和得分;
5 . 遍历三个尺度的Feature map之后得出一组box， classes， scores;
6 . 遍历筛选之后存在的类别，针对每个类别筛选;

针对当前类别的所有boxes 得到七左上角和面积值，并进行得分排序;

从得分排序的最大值，求其与剩余所有框的IOU, 筛选掉IOU 大于阈值的框;

针对剩余的框，选取得分最高的框循环，直到没有框结束，得到当前类别的NMS结果。

6 计算量评估

zhangg_chuan

关注

12
点赞
踩
137

收藏

觉得还不错? 一键收藏
30
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录