YOLOv4论文阅读(附原文翻译)
论文阅读
https://arxiv.org/abs/2004.10934
https://github.com/AlexeyAB/darknet
YOLOv4出来一年多了,之前只是看了别人的翻译或者总结,最近终于有空好好看了下原文。
通过复现作者的工作看有没有可以在实际工作中可以用的到的部分,然后通过对每个知识点的扩展阅读进一步扎实自己的基本功。
pdf共有17页。
1页:Abstract和Introduction。
2-4页:Related Work。就是一篇写得很好的综述,介绍了目标检测网络的基本结构Architecture,常见的Backbone、Neck、Head。
还介绍了一些提升性能的策略技巧,分为Bag of freebies免费礼包和Bag of specials特价礼包。这个比方很形象很好理解,影响Inference就是需要cost(花钱)。Bag of freebies免费礼包只影响Traing不影响Inference,Bag of specials特价礼包稍微有点影响Inference。
这一章节的内容其实很多,每个知识点都是一笔带过,其实对应的参考文献都应该好好阅读。
5-7页:Methodology。介绍了YOLOv4的主要工作。除了在上一章介绍的内容中对Architecture、Bag of freebies和Bag of specials进行选择,作者还做了自己的创新。
7-10页:Experiments和Results。介绍了实验配置和结果。
10页:Conclusions和Acknowledgements。结论和声明。
11-13:结果表格。
14-17:参考文献。
论文翻译
Abstract摘要
There are a huge number of features which are said to improve Convolutional Neural Network (CNN) accuracy. Practical testing of combinations of such features on large datasets, and theoretical justification of the result, is required. Some features operate on certain models exclusively and for certain problems exclusively, or only for small-scale datasets; while some features, such as batch-normalization and residual-connections, are applicable to the majority of models, tasks, and datasets. We assume that such universal features include Weighted-Residual-Connections (WRC), Cross-Stage-Partial-connections (CSP), Cross mini-Batch Normalization (CmBN), Self-adversarial-training (SAT) and Mish-activation. We use new features: WRC, CSP, CmBN, SAT, Mish activation, Mosaic data augmentation, CmBN, DropBlock regularization, and CIoU loss, and combine some of them to achieve state-of-the-art results: 43.5% AP (65.7% AP50) for the MS COCO dataset at a realtime speed of ~65 FPS on Tesla V100. Source code is at https://github.com/AlexeyAB/darknet.
很多特性都声称能够提升CNN的准确率accuracy,必须要对这些技巧进行组合并在大数据集上进行实际测试和理论论证才能证明。有些特性只对特定的模型、特定的问题或者小数据集上有效;而有些特性,比如批归一化(Batch Normalization)和残差连接(Residual-connections)对大多数模型、任务、数据集有效。我们假设这些通用特性包括加权残差连接(Weighted-Residual-Connections, WRC)、跨阶段部分连接(Cross-Stage-Partial-connections,CSP)、跨迷你批量归一化(Cross mini-Batch Normalization,CmBN)、自对抗训练(Self-adversarial-training,SAT)和Mish激活函数。我们使用新特性:WRC,CSP, CmBN,SAT,Mish激活函数,马赛克数据增广,CmBN,DropBlock正规化,CIoU loss函数,并将它们进行组合在COCO数据集上达到最先进的结果:43.5%AP(65.7% A P 50 AP_{50} AP50),在Tesla V100上实时速度约65FPS。源代码在https://github.com/AlexeyAB/darknet。
1.Introduction 引言
The majority of CNN-based object detectors are largely applicable only for recommendation systems. For example, searching for free parking spaces via urban video cameras is executed by slow accurate models, whereas car collision warning is related to fast inaccurate models. Improving the real-time object detector accuracy enables using them not only for hint generating recommendation systems, but also for stand-alone process management and human input reduction. Real-time object detector operation on conventional Graphics Processing Units (GPU) allows their mass usage at an affordable price. The most accurate modern neural networks do not operate in real time and require large number of GPUs for training with a large mini-batch-size. We address such problems through creating a CNN that operates in real-time on a conventional GPU, and for which training requires only one conventional GPU.
大多数CNN目标检测器只适用于推荐系统。例如,通过城市摄像机搜索免费停车位是由慢速准确的模型执行的,而汽车碰撞预警则使用快速但不准确的模型。提高实时目标检测器的准确性,使得它们不仅可以用于推荐系统,还可以用于独立的流程并减少人工输入。在常规GPU上运行的实时目标检测器,使得它们在合理的价格内大规模使用。最精确的现代神经网络不是实时运行,且需要大量的GPU来进行大mini-batch-size的训练。我们通过创建一个可以在一块常规GPU上训练并实时运行的CNN来解决上述问题。
The main goal of this work is designing a fast operating speed of an object detector in production systems and optimization for parallel computations, rather than the low computation volume theoretical indicator (BFLOP). We hope that the designed object can be easily trained and used. For example, anyone who uses a conventional GPU to train and test can achieve real-time, high quality, and convincing object detection results, as the YOLOv4 results shown in Figure 1. Our contributions are summarized as follows:
1. We develop an efficient and powerful object detection model. It makes everyone can use a 1080 Ti or 2080 Ti GPU to train a super fast and accurate object detector.
2. We verify the influence of state-of-the-art Bag-of-Freebies and Bag-of-Specials methods of object detection during the detector training.
3. We modify state-of-the-art methods and make them more effecient and suitable for single GPU training, including CBN [89], PAN [49], SAM [85], etc.
本工作的主要目标是生产系统中快速运行的目标检测器的设计和并行计算的优化,而不是低计算量理论指标(BFLOP)。我们希望设计的对象能够容易训练和使用。例如,任何人使用一个常规GPU进行训练和测试都可以获得实时、高质量、令人信服的目标检测结果,如图1所示是YOLOv4的结果。本文的贡献总结如下:
1.我们开发了一个高效、强大的目标检测模型。每个人都可以使用1080 Ti或2080 Ti来训练一个超级快速和准确的目标检测器。
2.在检测器训练过程中,我们验证了最先进的Bag-of-Freebies和Bag-of-Specials方法的影响。
3.我们改进了目前最先进的方法,使它们更有效且更适用于单GPU训练,包括CBN[89]、PAN[49]、SAM[85]等。
2.Related work相关工作
2.1.Object detection models目标检测模型
A modern detector is usually composed of two parts, a backbone which is pre-trained on ImageNet and a head which is used to predict classes and bounding boxes of objects. For those detectors running on GPU platform, their backbone could be VGG [68], ResNet [26], ResNeXt [86], or DenseNet [30]. For those detectors running on CPU platform, their backbone could be SqueezeNet [31], MobileNet [28, 66, 27, 74], or ShuffleNet [97, 53]. As to the head part, it is usually categorized into two kinds, i.e., one-stage object detector and two-stage object detector. The most representative two-stage object detector is the R-CNN [19] series, including fast R-CNN [18], faster R-CNN [64], R-FCN [9], and Libra R-CNN [58]. It is also possible to make a two- stage object detector an anchor-free object detector, such as RepPoints [87]. As for one-stage object detector, the most representative models are YOLO [61, 62, 63], SSD [50], and RetinaNet [45]. In recent years, anchor-free one-stage object detectors are developed. The detectors of this sort are CenterNet [13], CornerNet [37, 38], FCOS [78], etc. Object detectors developed in recent years often insert some layers between backbone and head, and these layers are usually used to collect feature maps from different stages. We can call it the neck of an object detector. Usually, a neck is composed of several bottom-up paths and several top-down paths. Networks equipped with this mechanism include Feature Pyramid Network (FPN) [44], Path Aggregation Network (PAN) [49], BiFPN [77], and NAS-FPN [17].In addition to the above models, some researchers put their emphasis on directly building a new backbone (DetNet [43], DetNAS [7]) or a new whole model (SpineNet [12], HitDetector [20]) for object detection.
To sum up, an ordinary object detector is composed of several parts:
现代目标检测器一般由两部分组成,一个backbone(主干网络)和一个head,backbone是在ImageNet上预训练的,head是用于预测物品的类别和包围框。在GPU平台上运行的检测器,它们的backbone一般是VGG [68],ResNet [26],ResNeXt [86]或者 DenseNet [30]。在CPU平台上运行的检测器,它们的backbone一般是SqueezeNet [31],MobileNet [28, 66, 27, 74]或者ShuffleNet [97, 53]。head一般分为两类,即one-stage检测器和two-stage检测器。典型的two-stage检测器有R-CNN [19] 系列,包括fast R-CNN [18],faster R-CNN [64],R-FCN [9]和Libra R-CNN [58]。也可以将two-stage检测器变成anchor-free目标检测器,如RepPoints[87]。典型的one-stage检测器有 YOLO [61, 62, 63],SSD [50]和RetinaNet [45]。近年来,出现了anchor-free的one-stage目标检测,如CenterNet [13],CornerNet [37, 38],FCOS [78]。近年来还发展出了在backbone和head中间加入一些layer,这些层通常用于收集不同阶段的feature map特征图。我们将其称为目标检测器的neck。通常,一个neck通常由若干自底向上路径和自顶向下路径组成。具备这种机制的网络包括Feature Pyramid Network (FPN) [44],Path Aggregation Network (PAN) [49],BiFPN [77]和 NAS-FPN [17]。除了上述模型,一些研究人员把重点放在直接建立一个新的backbone(DetNet [43],DetNAS [7])或者一个全新的模型(SpineNet [12],HitDetector [20])。
总结一下,一个通用的目标检测器包含以下部分:
输入Input: 图像Image,块Patches,图像金字塔Image Pyramid
主干网络Backbones: VGG16 [68],ResNet-50 [26],SpineNet[12],EfficientNet-B0/B7 [75],CSPResNeXt50 [81],CSPDarknet53 [81]
脖子Neck:
Additional block: SPP [25], ASPP [5], RFB [47], SAM [85]
**Path-aggregation block: ** FPN [44],PAN [49],NAS-FPN [17],Fully-connected FPN,BiFPN [77],ASFF [48],SFAM [98]
检测头Head:
稠密预测Dense Prediction(one-stage):
anchor based:RPN [64],SSD [50],YOLO [61],RetinaNet [45]
anchor free:CornerNet [37],CenterNet [13],MatrixNet [60],FCOS [78]
稀疏预测Sparse Prediction(two-stage):
anchor based:Faster R-CNN [64], R-FCN [9], Mask R-CNN [23]
anchor free:RepPoints [87]
2.2.Bag of freebies免费礼包
Usually, a conventional object detector is trained offline. Therefore, researchers always like to take this advantage and develop better training methods which can make the object detector receive better accuracy without increasing the inference cost. We call these methods that only change the training strategy or only increase the training cost as “bag of freebies.” What is often adopted by object detection methods and meets the definition of bag of freebies is data augmentation. The purpose of data augmentation is to increase the variability of the input images, so that the designed object detection model has higher robustness to the images obtained from different environments. For examples, photometric distortions and geometric distortions are two commonly used data augmentation method and they definitely benefit the object detection task. In dealing with photometric distortion, we adjust the brightness, contrast, hue, saturation, and noise of an image. For geometric distortion, we add random scaling, cropping, flipping, and rotating.
通常,一个常规的目标检测器是离线训练的。因此,学者往往利用这一点来开发更好的训练方法使目标检测器能够达到更好的准确度,同时不增加推理的成本。我们称这些改变训练策略或增加训练成本的方法为免费礼包。经常使用的免费礼包有数据增广。数据增广的目的在于增加输入图像的多样性,从而使得设计出来的目标检测模型对不同环境的图像具有更高的鲁棒性。例如,光度失真和几何失真是两种常用的数据增广方法,在目标检测任务中很有效。在处理光度失真时,通过调节图像的亮度、对比度、色相、饱和度和噪声来实现。对于几何失真,通过随机缩放、裁剪、翻转和旋转来实现。
The data augmentation methods mentioned above are all pixel-wise adjustments, and all original pixel information in the adjusted area is retained. In addition, some researchers engaged in data augmentation put their emphasis on simulating object occlusion issues. They have achieved good results in image classification and object detection. For example, random erase [100] and CutOut [11] can randomly select the rectangle region in an image and fill in a random or complementary value of zero. As for hide-and-seek [69] and grid mask [6], they randomly or evenly select multiple rectangle regions in an image and replace them to all zeros. If similar concepts are applied to feature maps, there are DropOut [71], DropConnect [80], and DropBlock [16] methods. In addition, some researchers have proposed the methods of using multiple images together to perform data augmentation. For example, MixUp [92] uses two images to multiply and superimpose with different coefficient ratios, and then adjusts the label with these superimposed ratios. As for CutMix [91], it is to cover the cropped image to rectangle region of other images, and adjusts the label according to the size of the mix area. In addition to the above mentioned methods, style transfer GAN [15] is also used for data augmentation, and such usage can effectively reduce the texture bias learned by CNN.
上述的数据增广方法都是像素级的调整,被调整区域的所有原始像素级的信息都被保留。另外,一些数据增强的研究者将重点放在模拟物体遮挡,并在图像分类和目标检测领域得到很好的结果。例如,随机擦除和CutOut可以随机选取图像中的矩形区域,并填入一个零的随机或互补值。而hide-and-seek和grid mask是随机或平均地选择图像中的多个矩形区域并用零值替补。在特征图中相似的操作有DropOut、DropConnect和DropBlock。另外,还有一些学者提出了用多个图像一起进行数据增广。例如,MixUp使用不同的系数比例将两个图像相乘或叠加,然后使用同样的比例调整标签。而CutMix则是将裁剪后的图像覆盖到其他图像的矩形区域,并根据混合区域的大小调整标签。除了上述方法外,style transfer GAN进行数据增强,这样可以有效减少CNN学习到的纹理偏差。
Different from the various approaches proposed above, some other bag of freebies methods are dedicated to solving the problem that the semantic distribution in the dataset may have bias. In dealing with the problem of semantic distribution bias, a very important issue is that there is a problem of data imbalance between different classes, and this problem is often solved by hard negative example mining [72] or online hard example mining [67] in two-stage object detector. But the example mining method is not applicable to one-stage object detector, because this kind of detector belongs to the dense prediction architecture. Therefore Lin et al. [45] proposed focal loss to deal with the problem of data imbalance existing between various classes. Another very important issue is that it is difficult to express the relationship of the degree of association between different categories with the one-hot hard representation. This representation scheme is often used when executing labeling. The label smoothing proposed in [73] is to convert hard label into soft label for training, which can make model more robust. In order to obtain a better soft label, Islam et al. [33] introduced the concept of knowledge distillation to design the label refinement network.
不同于上面提到的各种方法,还有一些免费礼包方法旨在解决存在变差的数据集的语义分布。解决语义分布偏差时,一个很重要的问题是不同类别的数据之间的不平衡,在two-stage目标检测器中这个问题通常通过负样本难例挖掘或在线难例挖掘来解决。但是样本挖掘的方法不适用于one-stage目标检测器,因为这类检测器检测器属于稠密预测结构。因此[45]在处理不同类别的数据不平衡问题时提出了focal loss。另一个很重要的问题是,用one-hot表示法很难表达不同类别之间的关联程度。而one-hot表示法在labeling的时候常用,[73]提出了Label Smoothing将硬标签转换成软标签,使得模型更加鲁棒。为了获得更好的软标签,[33]引入了知识蒸馏的概念来设计标签细化网络。
The last bag of freebies is the objective function of Bounding Box (BBox) regression. The traditional object detector usually uses Mean Square Error (MSE) to directly perform regression on the center point coordinates and height and width of the BBox, i.e., { x c e n t e r , y c e n t e r , w , h } \{x_{center}, y_{center}, w, h \} {
xcenter,ycenter,w,h}, or the upper left point and the lower right point, i.e., { x t o p _ l e f t , y t o p _ l e f t , x b o t t o m _ r i g h t , y b o t t o m _ r i g h t } \{ x_{top\_left}, y_{top\_left}, x_{bottom\_right}, y_{bottom\_right} \} {
xtop_left,ytop_