YOLOv4: Optimal Speed and Accuracy of Object Detection
本文为YOLOv4论文的阅读笔记,第一次粗读,后面会进行第二次精读,请各位大佬拍砖。
Abstract
据说有许多特征可以提高卷积神经网络(CNN)的准确性。需要在大型数据集上对这些特征的组合进行实际测试,并从理论上证明结果是正确的。某些特征仅在某些模型上运行,并且仅在某些问题上运行,或者仅在小型数据集上运行;而某些功能(例如批归一化和残差连接)适用于大多数模型,任务和数据集。我们假设此类通用功能包括加权残差连接(WRC),跨阶段部分连接(CSP),跨小批量标准化(CmBN),自对抗训练(SAT)和Mish激活。我们使用以下新功能:WRC,CSP,CmBN,SAT,Mish激活,马赛克数据增强,CmBN,DropBlock正则化和CIoU损失,并结合其中的一些来实现最先进的结果:MS COCO数据集43.5% AP (65.7% AP50) ,在Tesla V100上实时速度约为65 FPS。Source code is at this https URL
1. Introduction
本文的贡献总结如下:
- 1. 开发了一种高效而强大的目标检测模型。 它使每个人都可以使用1080 Ti或2080 Ti GPU训练超快速和准确的目标检测器。
- 2. 在检测器训练期间,验证了最先进的Bag-of-Freebies和Bag-of-Specials目标检测方法的影响。
- 3. 我们修改了最先进的方法,使它们更有效,更适合单GPU训练,包括CBN [89],PAN [49],SAM [85]等。
2. Related work
2.1. Object detection models
普通的物体检测器由以下几部分组成:
- Input: Image, Patches, Image Pyramid
- Backbones: VGG16 [68], ResNet-50 [26], SpineNet [12], EfficientNet-B0/B7 [75], CSPResNeXt50 [81], CSPDarknet53 [81]
- Neck:
- Additional blocks: SPP [25], ASPP [5], RFB [47], SAM [85]
- Path-aggregation blocks: FPN [44], PAN [49], NAS-FPN [17], Fully-connected FPN, BiFPN [77], ASFF [48], SFAM [98]
- Heads:
- Dense Prediction (one-stage):
- RPN [64], SSD [50], YOLO [61], RetinaNet [45] (anchor based)
- CornerNet [37], CenterNet [13], MatrixNet [60], FCOS [78] (anchor free)
- Sparse Prediction (two-stage):
- Faster R-CNN [64], R-FCN [9], Mask RCNN [23] (anchor based)
- RepPoints [87] (anchor free)
- Dense Prediction (one-stage):
2.2. Bag of freebies
我们称这些仅改变训练策略或仅增加训练成本方法为“Bag of freebies”。
- data augmentation(pixel-wise)
- photometric distortions: brightness, contrast, hue, saturation, and noise of an image
- geometric distortions: random scaling, cropping, flipping, and rotating
- object occlusion
- pixel level: random erase [100] and CutOut [11], hide-and-seek [69] and grid mask [6]
- feature level: DropOut [71], DropConnect [80], and DropBlock [16]
- multiple images together
- MixUp [92], CutMix [91]
- reduce texture bias
- style transfer GAN [15]
- semantic distribution bias(data imbalance)
- hard negative example mining [72] or online hard example mining[67](two-stage)
- focal loss [45](one-stage, dense prediction)
- degree of association between different categories
- one-hot hard representation
- label smoothing [73]
- knowledge distillation [33]
- objective function of BBox regression
- Mean Square Error (MSE)
- IoU loss [90]
- GIoU loss [65]
- DIoU loss [99], CIoU loss [99]
2.3. Bag of specials
对于那些仅增加少量推理成本但可以显着提高目标检测准确性的插件模块和后处理方法,我们将其称为“bag of specials”。一般而言,这些插件模块用于增强模型中的某些属性,例如扩大感受野,引入注意力机制或增强特征融合能力等,而后处理是用于筛选模型预测结果的方法。
- enhance receptive field
- SPP [25], improved SPP [63], ASPP [5], and RFB [47]
- attention mechanism
- channel-wise attention
- Squeeze-and-Excitation (SE)
- pointwise attention
- Spatial Attention Module (SAM)
- channel-wise attention
- feature integration
- skip connection [51] or hyper-column [22]
- FPN
- SFAM [98]: channelwise level re-weighting
- ASFF [48]: point-wise level reweighting
- BiFPN [77]: scale-wise level re-weighting
- activation function
- ReLU [56]: solve the gradient vanish
- LReLU [54], PReLU [24], ReLU6 [28], Scaled Exponential Linear Unit (SELU) [35]
- Swish [59], hard-Swish [27], and Mish [55]: continuously differentiable
- post-processing
- NMS [19]: greedy NMS
- soft NMS [1]
- DIoU NMS [99]
3. Methodology
3.1. Selection of architecture
我们的目标是在输入网络分辨率,卷积层数,参数数量(卷积核大小^2 *过滤器*通道/组)和层输出的数量(卷积核数量)之间找到最佳平衡。
下一个目标是选择附加模块来增加感受野,和参数融合的最佳方法(从针对不同检测器级别的不同主干网级别:如FPN,PAN,ASFF,BiFPN)
对于分类最佳的参考模型对于检测器并非总是最佳的。 与分类器相比,检测器需要以下条件:
- 更高的网络输入大小(分辨率)– 用于检测多个小型物体
- 更多的层数 – 更高的感受野,以覆盖网络输入的不断增大
- 更多参数 – 具有更强的模型能力,可在单个图像中检测不同大小的多个对象
不同大小的感受野的影响总结如下:
- Up to对象的大小 - 允许查看整个目标
- Up to网络输入大小 - 允许查看目标周围的上下文
- 超出网络输入大小 - 增加图像点和最终激活之间的连接数
我们在CSPDarknet53上添加了SPP块,因为它显着增加了感受野,分离出最重要的上下文特征,并且几乎不会降低网络运行速度。 我们使用PANet作为参数融合方法,来自用于不同检测器级别的不同主干级别,而不是YOLOv3中使用的FPN。
最后,我们选择CSPDarknet53主干网,SPP附加模块,PANet路径聚合neck和YOLOv3(基于锚)head作为YOLOv4的体系结构。
3.2. Selection of BoF and BoS
为了改进目标检测的训练,CNN通常使用以下方法:
- Activations: ReLU, leaky-ReLU, parametric-ReLU, ReLU6, SELU, Swish, or Mish
- Bounding box regression loss: MSE, IoU, GIoU, CIoU, DIoU
- Data augmentation: CutOut, MixUp, CutMix
- Regularization method: DropOut, DropPath [36], Spatial DropOut [79], or DropBlock
- Normalization of the network activations by their mean and variance: Batch Normalization (BN) [32], Cross-GPU Batch Normalization (CGBN or SyncBN) [93], Filter Response Normalization (FRN) [70], or Cross-Iteration Batch Normalization (CBN) [89]
- Skip-connections: Residual connections, Weighted residual connections, Multi-input weighted residual connections, or Cross stage partial connections (CSP)
在正则化方法中,发布Drop-Block的人已将其方法与其他方法进行了详细的比较,并且其正则化方法赢了很多。 因此,我们毫不犹豫地选择了DropBlock作为我们的正则化方法。
3.3. Additional improvements
为了使设计的检测器更适合在单个GPU上进行训练,我们进行了以下附加设计和改进:
- 我们介绍了一种新的数据增强马赛克方法和Self-Adversarial训练(SAT)
- 我们在应用遗传算法时选择最佳超参数
- 我们修改了一些现有方法以使我们的设计适合进行有效的训练和检测 - 修改的SAM,修改的PAN和交叉mini-Batch BN(CmBN)
Mosaic表示一种新的数据增强方法,该方法混合了4个训练图像。 因此,混合了4个不同的上下文,而CutMix仅混合了2个输入图像。 这样可以检测正常上下文之外的对象。 此外,batch normalization从每层上的4张不同图像计算激活统计信息。 这大大减少了对large mini-batch size的需求。
自对抗训练(SAT)也代表了一项新的数据增强技术,该技术可在2个前馈反馈阶段进行操作。 在第一阶段,神经网络会更改原始图像,而不是网络权重。 以这种方式,神经网络对其自身执行对抗攻击,从而改变原始图像以产生图像上没有所需物体的欺骗。 在第二阶段,训练神经网络以正常方式检测此修改图像上的对象。
CmBN表示CBN修改版本,如图4所示,定义为Cross mini-Batch Normalization(CmBN)。 这仅收集单个批次中的mini-Batch之间的统计信息。
我们将SAM从空间注意力改为点注意力,并替换PAN到shortcut连接为concatenation,分别如图5和图6所示。
3.4. YOLOv4
In this section, we shall elaborate the details of YOLOv4.
YOLOv4 consists of:
- Backbone: CSPDarknet53 [81]
- Neck: SPP [25], PAN [49]
- Head: YOLOv3 [63]
YOLO v4 uses:
- Bag of Freebies (BoF) for backbone:
- CutMix and Mosaic data augmentation,
- DropBlock regularization,
- Class label smoothing
- Bag of Specials (BoS) for backbone:
- Mish activation,
- Cross-stage partial connections (CSP),
- Multiinput weighted residual connections (MiWRC)
- Bag of Freebies (BoF) for detector:
- CIoU-loss,
- CmBN,
- DropBlock regularization,
- Mosaic data augmentation,
- Self-Adversarial Training,
- Eliminate grid sensitivity,
- Using multiple anchors for a single ground truth,
- Cosine annealing scheduler [52],
- Optimal hyperparameters,
- Random training shapes
- Bag of Specials (BoS) for detector:
- Mish activation,
- SPP-block,
- SAM-block,
- PAN path-aggregation block,
- DIoU-NMS
4. Experiments
我们测试了不同训练改进技术对ImageNet(ILSVRC 2012 val)数据集上分类器准确性的影响,然后测试了对MS COCO(test-dev 2017)数据集上检测器准确性的影响。
4.1. Experimental setup
4.2. Influence of different features on Classifier training
4.3. Influence of different features on Detector training
- S: Eliminate grid sensitivity 方程
,,其中c_x和c_y始终是整数,在YOLOv3中用于评估目标坐标,因此,对于接近c_x或c_x + 1值的b_x值,需要非常高的t_x绝对值。 我们通过将sigmoid乘以超过1.0的因数来解决此问题,从而消除了无法检测到对象的网格的影响。
- M: Mosaic data augmentation - 在训练过程中使用4个图像mosaic拼接代替单个图像
- IT: IoU threshold - 对单个ground truth使用多个anchors, IoU (truth, anchor) > IoU threshold
- GA: Genetic algorithms - 使用遗传算法在网络训练的前10%时间段内选择最佳超参数
- LS: Class label smoothing - 使用label smoothing来进行sigmoid激活
- CBN: CmBN - using Cross mini-Batch Normalization for collecting statistics inside the entire batch, instead of collecting statistics inside a single mini-batch
- CA: Cosine annealing scheduler - altering the learning rate during sinusoid training
- DM: Dynamic mini-batch size - automatic increase of mini-batch size during small resolution training by using Random training shapes
- OA: Optimized Anchors - using the optimized anchors for training with the 512x512 network resolution
- GIoU, CIoU, DIoU, MSE - using different loss algorithms for bounded box regression
4.4. Influence of different backbones and pretrained weightings on Detector training
4.5. Influence of different minibatch size on Detector training
5. Results
- Table 8: Comparison of the speed and accuracy of different object detectors on the MS COCO dataset (testdev 2017). (Real-time detectors with FPS 30 or higher are highlighted here. We compare the results with batch=1 without using tensorRT.)
- Table 9: Comparison of the speed and accuracy of different object detectors on the MS COCO dataset (test-dev 2017). (Real-time detectors with FPS 30 or higher are highlighted here. We compare the results with batch=1 without using tensorRT.)
- Table 10: Comparison of the speed and accuracy of different object detectors on the MS COCO dataset (test-dev 2017). (Real-time detectors with FPS 30 or higher are highlighted here. We compare the results with batch=1 without using tensorRT.)
总结
建议先看YOLOv4用到的技术,理解后再看本文就非常清晰:
- 深度眸:想读懂YOLOV4,你需要先了解下列技术(一)
- 深度眸:想读懂YOLOV4,你需要先了解下列技术(二)
参考
- YOLOv4: Optimal Speed and Accuracy of Object Detection