论文笔记（十）【yolov4】YOLOv4: Optimal Speed and Accuracy of Object Detection

最新推荐文章于 2022-04-02 15:50:16 发布

CSPhD-winston-杨帆

最新推荐文章于 2022-04-02 15:50:16 发布

阅读量423

点赞数

分类专栏： yolo 文章标签：深度学习神经网络机器学习

本文链接：https://blog.csdn.net/WhiffeYF/article/details/111353737

版权

yolo 同时被 2 个专栏收录

15 篇文章

订阅专栏

卷积神经网络

13 篇文章

订阅专栏

参考：YOLOv4原文翻译 - v4它终于来了！

Abstract

There are a huge number of features which are said to improve Convolutional Neural Network (CNN) accuracy. Practical testing of combinations of such features on large datasets, and theoretical justification of the result, is required. Some features operate on certain models exclusively and for certain problems exclusively, or only for small-scale datasets; while some features, such as batch-normalization and residual-connections, are applicable to the majority of models, tasks, and datasets. We assume that such universal features include Weighted-Residual-Connections (WRC), Cross-Stage-Partial-connections (CSP), Cross mini-Batch Normalization (CmBN), Self-adversarial-training (SAT) and Mish-activation. We use new features: WRC, CSP, CmBN, SAT, Mish activation, Mosaic data augmentation, CmBN, DropBlock regularization, and CIoU loss, and combine some of them to achieve state-of-the-art results: 43.5% AP (65.7% AP50) for the MS COCO dataset at a realtime speed of ∼65 FPS on Tesla V100. Source code is at https://github.com/AlexeyAB/darknet.

摘要
目前有很多可以提高CNN准确性的算法。这些算法的组合在庞大数据集上进行测试、对实验结果进行理论验证都是非常必要的。有些算法只在特定的模型上有效果，并且只对特定的问题有效，或者只对小规模的数据集有效；然而有些算法，比如batch-normalization和residual-connections，对大多数的模型、任务和数据集都适用。我们认为这样通用的算法包括：Weighted-Residual-Connections（WRC), Cross-Stage-Partial-connections（CSP）, Cross mini-Batch Normalization（CmBN）, Self-adversarial-training（SAT）以及Mish-activation。我们使用了新的算法：WRC, CSP, CmBN, SAT, Mish activation, Mosaic data augmentation, CmBN, Dropblock regularization 和CIoU loss以及它们的组合，获得了最优的效果：在MS COCO数据集上的AP值为43.5%(65.7% AP50)，在Tesla V100上的实时推理速度为65FPS。

笔记：从摘要中我们基本上可以看出：v4实际上就是保留Darknet作为backbone，然后通过大量的实验研究了众多普适性算法对网络性能的影响，然后找到了它们最优的组合。

1， Introduction

The majority of CNN-based object detectors are largely applicable only for recommendation systems. For example,searching for free parking spaces via urban video cameras is executed by slow accurate models, whereas car collision warning is related to fast inaccurate models. Improving the real-time object detector accuracy enables using them not only for hint generating recommendation systems, but also for stand-alone process management and human input reduction. Real-time object detector operation on conventional Graphics Processing Units (GPU) allows their mass usage at an affordable price. The most accurate modern neural networks do not operate in real time and require large number of GPUs for training with a large mini-batch-size. We address such problems through creating a CNN that operates in real-time on a conventional GPU, and for which training requires only one conventional GPU.

1 介绍
大部分基于CNN的目标检测器主要只适用于推荐系统。举例来说，通过城市相机寻找免费停车位置的系统使用着慢速但是高精度的模型，然而汽车碰撞警告却使用着快速但是低精度的模型。提高实时目标检测器的精度不经能够应用在推荐系统上，而且还能用于独立的流程管理以及降低人员数量上。目前大部分高精度的神经网络不仅不能实时运行，并且需要较大的mini-batch-size在多个GPUs上进行训练。我们构建了仅在一块GPU上就可以实时运行的CNN解决了这个问题，并且它只需要在一块GPU上进行训练。

在这里插入图片描述
Figure 1: Comparison of the proposed YOLOv4 and other state-of-the-art object detectors. YOLOv4 runs twice faster than EfficientDet with comparable performance. Improves YOLOv3’s AP and FPS by 10% and 12%, respectively.

图1：提出的YOLOv4与其他最先进的目标探测器的比较。YOLOv4运行速度比EfficientDet快两倍，性能相当。提高YOLOv3的AP和FPS分别10%和12%。

The main goal of this work is designing a fast operating speed of an object detector in production systems and optimization for parallel computations, rather than the low computation volume theoretical indicator (BFLOP). We hope that the designed object can be easily trained and used. For example, anyone who uses a conventional GPU to train and test can achieve real-time, high quality, and convincing object detection results, as the YOLOv4 results shown in Figure 1. Our contributions are summarized as follows:

我们工作的主要目标就是设计一个仅在单个计算系统（比如单个GPU）上就可以快速运行的目标检测器并且对并行计算进行优化，并非减低计量计算量理论指标（BFLOP）。我们希望这个检测器能够轻松的训练和使用。具体来说就是任何一个人仅仅使用一个GPU进行训练和测试就可以得到实时的，高精度的以及令人信服的目标检测结果，正如在图片1中所示的YOLOv4的结果。我们的贡献总结如下：

1.,We develope an efficient and powerful object detection model. It makes everyone can use a 1080 Ti or 2080 Ti GPU to train a super fast and accurate object detector.

（1）我们提出了一个高效且强大的目标检测模型。任何人可以使用一个1080Ti或者2080Ti的GPU就可以训练出一个快速并且高精度的目标检测器

2.,We verify the influence of state-of-the-art Bag-of-Freebies and Bag-of-Specials methods of object detection during the detector training.

（2）我们在检测器训练的过程中，测试了目标检测中最高水准的Bag-of-Freebies和Bat-of-Specials方法。

3, We modify state-of-the-art methods and make them more effecient and suitable for single GPU training,including CBN [89], PAN [49], SAM [85], etc.
。

（3）我们改进了最高水准的算法，使得它们更加高效并且适合于在一个GPU上进行训练，比如CBN, PAN, SAM等。

Related work

2.1. Object detection models

A modern detector is usually composed of two parts,a backbone which is pre-trained on ImageNet and a head which is used to predict classes and bounding boxes of objects. For those detectors running on GPU platform, their backbone could be VGG [68], ResNet [26], ResNeXt [86], or DenseNet [30]. For those detectors running on CPU platform, their backbone could be SqueezeNet [31], MobileNet [28, 66, 27, 74], or ShuffleNet [97, 53]. As to the head part, it is usually categorized into two kinds, i.e., one-stage object detector and two-stage object detector. The most representative two-stage object detector is the R-CNN [19] series, including fast R-CNN [18], faster R-CNN [64], R-FCN [9], and Libra R-CNN [58]. It is also possible to make a two-stage object detector an anchor-free object detector, such as RepPoints [87]. As for one-stage object detector, the most representative models are YOLO [61, 62, 63], SSD [50], and RetinaNet [45]. In recent years, anchor-free one-stage object detectors are developed. The detectors of this sort are CenterNet [13], CornerNet [37, 38], FCOS [78], etc. Object
detectors developed in recent years often insert some layers between backbone and head, and these layers are usually used to collect feature maps from different stages. We can call it the neck of an object detector. Usually, a neck is composed of several bottom-up paths and several top-down paths. Networks equipped with this mechanism include Feature Pyramid Network (FPN) [44], Path Aggregation Network (PAN) [49], BiFPN [77], and NAS-FPN [17].

2 相关工作
2.1 目标检测模型
检测器通常由两部分组成：backbone和head。前者在ImageNet上进行预训练，后者用来预测类别信息和目标物体的边界框。在GPU平台上运行的检测器，它们的backbone可能是VGG, ResNet, ResNetXt,或者是DenseNet。在CPU平台上运行的检测器，它们的backbone可能是SqueezeNet，MobileNet或者是ShuffleNet。对于head部分，通常分为两类：one-stage和two-stage的目标检测器。Two-stage的目标检测器的代表是R-CNN系列，包括：fast R-CNN, faster R-CNN,R-FCN和Libra R-CNN. 还有基于anchor-free的Two-stage的目标检测器，比如RepPoints。One-stage目标检测器的代表模型是YOLO, SSD和RetinaNet。在最近几年，出现了基于anchor-free的one-stage的算法，比如CenterNet, CornerNet, FCOS等等。在最近几年，目标检测器在backbone和head之间会插入一些网络层，这些网络层通常用来收集不同的特征图。我们将其称之为目标检测器的neck。通常，一个neck由多个bottom-up路径和top-down路径组成。使用这种机制的网络包括Feature Pyramid Network（FPN）,Path Aggregation Network（PAN），BiFPN和NAS-FPN。

In addition to the above models, some researchers put their emphasis on directly building a new backbone (DetNet [43], DetNAS [7]) or a new whole model (SpineNet [12], HitDetector [20]) for object detection.

在这里插入图片描述

除了上面的这些模型，一些学者将重点放在为目标检测器构建新的backbone（DetNet, DetNASNet）或者是一整个新的模型（SpinNet, HitDetector）

To sum up, an ordinary object detector is composed of several parts:

综上所述，一个普通的目标检测器由下面四个部分组成：

2.2. Bag of freebies

Usually, a conventional object detector is trained offline. Therefore, researchers always like to take this advantage and develop better training methods which can make the object detector receive better accuracy without increasing the inference cost. We call these methods that only change the training strategy or only increase the training cost as “bag of freebies.” What is often adopted by object detection methods and meets the definition of bag of freebies is data augmentation. The purpose of data augmentation is to increase the variability of the input images, so that the designed object detection model has higher robustness to the images obtained from different environments. For examples, photometric distortions and geometric distortions are two commonly used data augmentation method and they definitely benefit the object detection task. In dealing with photometric distortion, we adjust the brightness, contrast, hue, saturation, and noise of an image. For geometric distortion, we add random scaling, cropping, flipping, and rotating.

2.2 Bag of freebies
通常来说，目标检测器都是进行离线训练的（训练的时候对GPU数量和规格不限制）。因此，研究者总是喜欢扬长避短，使用最好的训练手段，因此可以在不增加推理成本的情况下，获得最好的检测精度。我们将只改变训练策略或者只增加训练成本的方法称之为“bag of freebies"。在目标检测中经常使用并且满足bag of freebies的定义的算法称是数据增广。数据增广的目的是增加输入图片的可变性，因此目标检测模型对从不同场景下获取的图片有着更高的鲁棒性。举例来说，photometric distoitions和geometric distortions是用来数据增强方法的两个常用的手段。在处理photometric distortion中，我们会调整图像的亮度，对比度，色调，饱和度以及噪声。对于geometric distortion，我们会随机增加尺度变化，裁剪，翻转以及旋转。

The data augmentation methods mentioned above are all pixel-wise adjustments, and all original pixel information in the adjusted area is retained. In addition, some researchers engaged in data augmentation put their emphasis on simulating object occlusion issues. They have achieved good results in image classification and object detection. For example, random erase [100] and CutOut [11] can randomly select the rectangle region in an image and fill in a random or complementary value of zero. As for hide-and-seek [69] and grid mask [6], they randomly or evenly select multiple rectangle regions in an image and replace them to all zeros. If similar concepts are applied to feature maps, there are DropOut [71], DropConnect [80], and DropBlock [16] methods. In addition, some researchers have proposed the methods of using multiple images together to perform data augmentation. For example, MixUp [92] uses two images to multiply and superimpose with different coefficient ratios, and then adjusts the label with these superimposed ratios. As for CutMix [91], it is to cover the cropped image to rectangle region of other images, and adjusts the label according to the size of the mix area. In addition to the above mentioned methods, style transfer GAN [15] is also used for data augmentation, and such usage can effectively reduce the texture bias learned by CNN.

上面提及的数据增广的手段都是像素级别的调整，它保留了调整区域的所有原始像素信息。此外，一些研究者将数据增广的重点放在了模拟目标物体遮挡问题上。他们在图像分类和目标检测上已经取得了不错的结果。具体来说，random erase和CutOut可以随机选择图像上的矩形区域，然后进行随机融合或者使用零像素值来进行融合。对于hide-and-seek和grid mask，他们随机地或者均匀地在一幅图像中选择多个矩形区域，并且使用零来代替矩形区域中的像素值。如果将相似的概念用来特征图中，出现了DropOut, DropConnect和DropBlock方法。此外，一些研究者提出一起使用多张图像进行数据增强的方法。举例来说，MixUp使用两张图片进行相乘并且使用不同的系数比进行叠加，然后使用它们的叠加比来调整标签。对于CutMix，它将裁剪的图片覆盖到其他图片的矩形区域，然后根据混合区域的大小调整标签。除了上面提及的方法，style transfer GAN也用来数据增广，CNN可以学习如何有效的减少纹理偏差。