mmdetection论文笔记

论文翻译

Abstract

​ We present MMDetection, an object detection toolbox that contains a rich set of object detection and instance segmentation methods as well as related components and modules. The toolbox started from a codebase of MMDet team who won the detection track of COCO Challenge 2018. It gradually evolves into a unified platform that covers many popular detection methods and contemporary modules. It not only includes training and inference codes, but also provides weights for more than 200 network models. We believe this toolbox is by far the most complete detection toolbox.In this paper, we introduce the various features of this toolbox. In addition, we also conduct a benchmarking study on different methods, components, and their hyper-parameters.We wish that the toolbox and benchmark could serve the growing research community by providing a flexible toolkit to reimplement existing methods and develop their own new detectors. Code and models are available at https: //github.com/open-mmlab/mmdetection. The project is under active development and we will keep this document updated.

​ 我们提出MMDetection,一个目标检测工具箱,其中包含了一组丰富的目标检测和实例分割方法以及相关的组件和方法。此工具箱起源于MMDet团队的代码库,此团队曾获得detection track of COCO Challenge 2018的冠军。并逐渐发展为一个涵盖了许多流行的目标检测方法和模块的统一平台。其不仅包含了训练和推断的代码,而且提供了超过了200个网络模型的权重。我们相信这个工具箱是到目前为止最完全的检测工具箱。在本文中,我们介绍了此工具箱的大量特征。另外,我们也对不同的方法、组件以及它们的超参数进行了基准研究。我们希望此工具箱和基准能够通过提供一个灵活的工具包以重新实现已经存在的方法并发展自己新的检测器,以此能够为不断增长的研究社区服务。此项目正在积极发展中,我们也将持续更新文档。

单词

by far:到目前为止
conduct:v. 组织,实施,进行;指挥(音乐);带领,引导;举止,表现;传导(热或电)
benchmarking:基准
be under active development:在积极发展中

1. Introduction

​ Object detection and instance segmentation are both fundamental computer vision tasks. The pipeline of detection frameworks is usually more complicated than classificationlike tasks, and different implementation settings can lead to very different results. Towards the goal of providing a highquality codebase and unified benchmark, we build MMDetection, an object detection and instance segmentation codebase with PyTorch [24].

​ 目标检测和实例分割都是计算机视觉的基础任务。目标检测框架的流程通常比图像分类任务要复杂,并且不同的实现设置会造成不同的结果。以提供一个高质量的代码库和统一的基准为目标,我们建立了mmdetection,一个目标检测和实例分割的pytorch代码库。

​ Major features of MMDetection are: (1) Modular design. We decompose the detection framework into different components and one can easily construct a customized object detection framework by combining different modules. (2) Support of multiple frameworks out of box. The toolbox supports popular and contempoary detection frameworks, see Section 2 for the full list. (3) High efficiency.All basic bbox and mask operations run on GPUs. The training speed is faster than or comparable to other codebases, including Detectron [10] , maskrcnn-benchmark [21] and SimpleDet [6]. (4) State of the art. The toolbox stems from the codebase developed by the MMDet team, who won COCO Detection Challenge in 2018, and we keep pushing it forward.

​ mmdetection的主要特征为:(1)模块化设计。我们分解此目标检测框架为不同的组件和一个可以通过结合不同模块轻松构建一个定制的目标检测框架。(2)开箱即用的多重框架的支持。此工具箱支持流行的和当代的检测框架,完整的列表详见章节2.(3)高效率。所有的基础bbox和mask操作都在GPUs运行。(果然穷逼不配拥有深度学习)其训练速度相较于其余代码库更加快速,包括Detectron [10] , maskrcnn-benchmark [21] 以及 SimpleDet [6]。(4)最先进。此工具箱起源于MMDet团队建立的代码库,他们曾获得过COCO Detection Challenge in 2018的冠军,并且继续持续进步。

​ Apart from introducing the codebase and benchmarking results, we also report our experience and best practice for training object detectors. Ablation experiments on hyperparameters, architectures, training strategies are performed and discussed. We hope that the study can benefit future research and facilitate comparisons between different methods.

​ 除了介绍此代码库和基准结果,我们也记录了我们的经验和训练目标检测器的最佳实践。我们也进行和讨论了超参数,结构,训练策略的消融实验。我们希望此研究能够等于对未来的研究有益,并促进不同方法间的对比。

​ The remaining sections are organized as follows. We first introduce various supported methods and highlight important features of MMDetection, and then present the benchmark results. Lastly, we show some ablation studies on some chosen baselines.

​ 剩余的章节按照如下结构进行组织。首先,我们介绍了大量的已支持的方法并突出mmdetection的重要特征,然后,展示了基准测试结果。末尾,我们展示了一些可被选择的baseline的消融研究。

单词

implementation:n.实现
decompose:vt. 分解
customized:adj.定制的
out of box:开箱即用
ablation experiment:消融实验,目标检测的pipeline用了A, B, C,然后效果还不错,但你并不知道A, B, C各自到底起了多大的作用,可能B效率很低同时精度很好,也可能A和B彼此相互促进。Ablation experiment就是用来告诉你或者读者整个流程里面的关键部分到底起了多大作用,就像Ross将RPN换成SS进行对比实验,以及与不共享主干网络进行对比,就是为了给读者更直观的数据来说明算法的有效性。
facilitate:vt.促进,帮助,使容易

2. Supported Frameworks

​ MMDetection contains high-quality implementations of popular object detection and instance segmentation methods. A summary of supported frameworks and features compared with other codebases is provided in Table 1.MMDetection supports more methods and features than other codebases, especially for recent ones. A list is given as follows.

​ mmdetection包含了流行的目标检测和实例分割方法的高质量实现。表1提供了已经支持的与其余代码库的框架和特征的总结。mmdetection相较于其他代码库,支持更多的方法和特征,尤其是最新的。清单如下。

2.1. Single-stage Methods
• SSD [19]: a classic and widely used single-stage detector with simple model architecture, proposed in 2015.

SSD:2015年提出的一种经典且被广泛使用的具有简单模型架构的单级检测器。

• RetinaNet [18]: a high-performance single-stage detector with Focal Loss, proposed in 2017.

RetinaNet:2017年提出的一种具有焦损的高性能单级检测器。

• GHM [16]: a gradient harmonizing mechanism to improve single-stage detectors, proposed in 2019.

GHM:2019年提出的一种改善single-stage 用梯度协调机制。

• FCOS [32]: a fully convolutional anchor-free singlestage detector, proposed in 2019.

FCOS:2019年提出的一种完全卷积的anchor-free(今年anchor-free蜂拥而至,与之区别的是Fastmoer-RCNN之流的anchor-based)单步检测器。

• FSAF [39]: a feature selective anchor-free module for single-stage detectors, proposed in 2019.

FSAF:2019年提出的一种特征选择的anchor-free模块的单步检测器。

2.2. Two-stage Methods
• Fast R-CNN [9]: a classic object detector which requires pre-computed proposals, proposed in 2015.

Fast R-CNN:2015年提出的一种需要预计算proposals的经典目标检测器。

• Faster R-CNN [27]: a classic and widely used twostage object detector which can be trained end-to-end,proposed in 2015.

Faster R-CNN:2015年提出的一种经典,得到广泛应用的可被训练为端到端的two-stage目标检测模型。

• R-FCN [7]: a fully convolutional object detector with faster speed than Faster R-CNN, proposed in 2016.

R-FCN:2016年提出的一种比Faster R-CNN速度更快的全卷积目标检测器。

• Mask R-CNN [13]: a classic and widely used object detection and instance segmentation method, proposed in 2017.

Mask R-CNN: 2017年提出的一种经典且广泛应用的目标检测和实例分割方法。

• Grid R-CNN [20]: a grid guided localization mechanism as an alternative to bounding box regression, proposed in 2018.

Grid R-CNN:2018年提出的一种以网格为导向的本地机制,作为bbox回归的替代。

• Mask Scoring R-CNN [15]: an improvement over Mask R-CNN by predicting the mask IoU, proposed in 2019.
• Double-Head R-CNN [35]: different heads for classification and localization, proposed in 2019.
2.3. Multi-stage Methods
• Cascade R-CNN [2]: a powerful multi-stage object detection method, proposed in 2017.

Cascade R-CNN:2017年提出的一种强力多步目标检测方法。

• Hybrid Task Cascade [4]: a multi-stage multi-branch object detection and instance segmentation method,
proposed in 2019.

Hybrid Task Cascade:2019年提出的一种多步多分支目标检测与实例分割方法。

2.4. General Modules and Methods
• Mixed Precision Training [22]: train deep neural networks using half precision floating point (FP16) numbers, proposed in 2018.

Mixed Precision Training :(混合精度训练混合精度训练是在尽可能减少精度损失的情况下利用半精度浮点数加速训练。它使用FP16即半精度浮点数存储权重和梯度。在减少占用内存的同时起到了加速训练的效果。)

• Soft NMS [1]: an alternative to NMS, proposed in 2017.

Soft NMS:2017年提出的NMS的替代

• OHEM [29]: an online sampling method that mines hard samples for training, proposed in 2016.

OHEM :2016年,一个在线采样用于挖掘困难样本用于训练的方法。(hard samples:识别不好的样本)

• DCN [8]: deformable convolution and deformable RoI pooling, proposed in 2017.

DCN: 2017年,可变形的卷积和可变形的RoI池化。(RoI:regions of interest)

• DCNv2 [42]: modulated deformable operators, proposed in 2018.

DCNv2: 2018年,模块化的可变形操作。

• Train from Scratch [12]: training from random initialization instead of ImageNet pretraining, proposed in 2018.

Train from Scratch:2018年,从随机初始化的训练替代ImageNet的预训练模型。

• ScratchDet [40]: another exploration on training from scratch, proposed in 2018.

ScratchDet :2018年,另一个从零开始训练的探究。

• M2Det [38]: a new feature pyramid network to construct more effective feature pyramids, proposed in 2018.
• GCNet [3]: global context block that can efficiently model the global context, proposed in 2019.
• Generalized Attention [41]: a generalized attention formulation, proposed in 2019.
• SyncBN [25]: synchronized batch normalization across GPUs, we adopt the official implementation by PyTorch.
• Group Normalization [36]: a simple alternative to BN, proposed in 2018.
• Weight Standardization [26]: standardizing the weights in the convolutional layers for micro-batch training, proposed in 2019.
• HRNet [30, 31]: a new backbone with a focus on learning reliable high-resolution representations, proposed in 2019.
• Guided Anchoring [34]: a new anchoring scheme that predicts sparse and arbitrary-shaped anchors, proposed in 2019.
• Libra R-CNN [23]: a new framework towards balanced learning for object detection, proposed in 2019.

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-rJQjMkFe-1580976908791)(C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\image-20200201162000810.png)]

单词

harmonizing:协调
hybrid:混合,杂交
Cascade:级联
mine: 挖掘,采矿,挖矿
deformable:adj.变形的

3. Architecture

3.1. Model Representation

​ Although the model architectures of different detectors are different, they have common components, which can be roughly summarized into the following classes.

虽然不同检测器的模型结构不同,但都拥有共性,大概可以总结(抽象)为以下几类。

Backbone Backbone is the part that transforms an image to feature maps, such as a ResNet-50 without the last fully connected layer.

Backbone 是将图像转换为feature maps的部分,比如去掉全连接层的ResNet-50。

Neck Neck is the part that connects the backbone and heads. It performs some refinements or reconfigurations on the raw feature maps produced by the backbone. An example is Feature Pyramid Network (FPN).

Neck是连接backbone和heads的部分。对backbone产生的原始特征图进行改进和重新配置。一个例子就是特征金字塔网络(FPN)。

DenseHead (AnchorHead/AnchorFreeHead) DenseHead is the part that operates on dense locations of feature maps, including AnchorHead and AnchorFreeHead, e.g., RPNHead, RetinaHead, FCOSHead.

DenseHead:对特征图的密集位置进行操作(dense location),包括:AnchorHead, AnchorFreeHead。比如:RPNHead, RetinaHead, FCOSHead。实际上这里就是给出所有的框,也就是propals。

RoIExtractor RoIExtractor is the part that extracts RoIwise features from a single or multiple feature maps with RoIPooling-like operators. An example that extracts RoI features from the corresponding level of feature pyramids is SingleRoIExtractor.

RoIExtractor:RoI提取器是从一个或多个具有RoIPooling-like操作符的特征图提取 RoIwise features的部分。从相应的要素金字塔级别提取RoI要素的示例是SingleRoIExtractor。

RoIHead (BBoxHead/MaskHead) RoIHead is the part that takes RoI features as input and make RoI-wise taskspecific predictions, such as bounding box classification/regression, mask prediction.

​ RoIHead是将RoI特征作为输入并进行RoI-wise的特定任务预测(例如边界框分类/回归,mask预测)的部分。

​ With the above abstractions, the framework of singlestage and two-stage detectors is illustrated in Figure 1. We can develop our own methods by simply creating some new components and assembling existing ones.

总的来说就是:

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-SDvZvsmt-1580976908828)(C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\image-20200201193949961.png)]

3.2. Training Pipeline

​ We design a unified training pipeline with hooking mechanism. This training pipeline can not only be used for object detection, but also other computer vision tasks such as image classification and semantic segmentation.

​ 我们设计了一个有着挂钩机制的统一的训练流程。此训练流程不仅可以用于目标检测,还可以用于其余计算机视觉任务,比如图像分类和语义分割。

​ The training processes of many tasks share a similar workflow, where training epochs and validation epochs run iteratively and validation epochs are optional. In each epoch, we forward and backward the model by many iterations. To make the pipeline more flexible and easy to customize, we define a minimum pipeline which just forwards the model repeatedly. Other behaviors are defined by a hooking mechanism. In order to run a custom training process, we may want to perform some self-defined operations before or after some specific steps. We define some timepoints where users may register any executable methods (hooks), including before run, before train epoch, after train epoch, before train iter, after train iter, before val epoch, after val epoch, before val iter, after val iter, after run.Registered hooks are triggered at specified timepoints following the priority level. A typical training pipeline in MMDetection is shown in Figure 2. The validation epoch is not shown in the figure since we use evaluation hooks to test the performance after each epoch. If specified, it has the same pipeline as the training epoch.

​ 许多任务的训练过程共享了一个相似的工作流程,训练epoch和验证epoch迭代的运行并且验证epoch是可选的。在每个epoch中,我们反复在模型中迭代的正反向传播。为了使此流程变得更加灵活,更加容易定制化,我们定义了一个mini化的流程,此流程仅反复在模型中正向传播。其余的行为被定义为钩子机制。为了运行一个自定义的训练过程,我们可能会在一些特定的不走之前或之后执行一些自我定义的操作。我们定义了一些时间节点,用户可以注册任何可执行的方法(钩子,应该是指的回调函数之类的),包括run前等等 。已经注册的钩子方法在优先级之后的特定时间节点被触发。mmdetection中的一个典型的训练流程如图2所示。验证epoch不在此图中进行展示因为我们在每个epoch之后使用评估钩子函数去测试性能。在指定的情况下,它有训练epoch相同的流程。

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-x2fGOo0B-1580976908830)(C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\image-20200201162652291.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-wg4eGCVx-1580976908833)(C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\image-20200201162715221.png)]

单词

hooking:钩子
semantic:语义
custom:自定义,习俗

4. Benchmarks

4.1. Experimental Setting

Dataset. MMDetection supports both VOC-style and COCO-style datasets. We adopt MS COCO 2017 as the primary benchmark for all experiments since it is more challenging and widely used. We use the train split for training and report the performance on the val split.

​ mmdetection支持VOC和COCO风格的数据集。我们采取了MS COCO2017作为所有实验主要的基准(这里应该是真mAP的标准),因为其更具挑战性以及受到广泛应用。

Implementation details. If not otherwise specified, we adopt the following settings. (1) Images are resized to a maximum scale of 1333 × 800,without changing the aspect ratio. (2) We use 8 V100 GPUs for training with a total batch size of 16 (2 images per GPU) and a single V100 GPU for inference. (3) The training schedule is the same as Detectron [10]. “1x” and “2x” means 12 epochs and 24 epochs respectively. “20e” is adopted in cascade models, which denotes 20 epochs.

​ 细节补充。如果没有其余的指定,默认采取了以下设置。(1)图像被resize到最大尺度为(1333,800),并不将改变横纵比。(2)官方使用8块 V100训练,batchsize为16 每个GPU2张图片 (16=2*8)用一块V100进行推断。(3)训练时间表和Detectron 一致。“1x” and “2x” 表示12个epochs和24个. cascade模型中采用“20e” 表示 20 epochs.

Evaluation metrics. We adopt standard evaluation metrics for COCO dataset, where multiple IoU thresholds from 0.5 to 0.95 are applied. The results of region proposal network (RPN) are measured with Average Recall (AR) and detection results are evaluated with mAP.

评估指标。我们采用了标准的COCO的评估指标,即采用0.5到0.95的多个IoU_thr(来求解mAP)。RPN用AR评估,检测结果用mAP评估。

4.2. Benchmarking Results Main results.

​ We benchmark different methods on COCO 2017 val, including SSD [19], RetinaNet [18], Faster RCNN [27], Mask RCNN [13] and Cascade R-CNN [18], Hybrid Task Cascade [4] and FCOS [32]. We evalute all results with four widely used backbones, i.e., ResNet-50 [14], ResNet-101 [14], ResNet-101-32x4d [37] and ResNeXt101-64x4d [37]. We report the inference speed of these methods and bbox/mask AP in Figure 3. The inference time is tested on a single Tesla V100 GPU.

​ 我们在COCO 2017 val上对不同的方法进行基准测试,包括了SSD,RetinaNet,Faster RCNN,Mask RCNN以及Casacade R-CNN,Hybrid Task Cascade以及FCOS。我们利用4种广泛使用的backbone来评估结果(ResNet-50 [14], ResNet-101 [14], ResNet-101-32x4d [37] 以及ResNeXt101-64x4d [37])。我们在图3中报告这些方法的推断速度以及AP。

​ Comparison with other codebases Besides MMDetection, there are also other popular codebases like Detectron [10], maskrcnn-benchmark [21] and SimpleDet [6]. They are built on the deep learning frameworks of caffe21, PyTorch [24] and MXNet [5], respectively. We compare MMDetection with Detectron (@a6a835f), maskrcnnbenchmark (@c8eff2c) and SimpleDet (@cf4fce4) from three aspects: performance, speed and memory. Mask R-CNN and RetinaNet are taken for representatives of twostage and single-stage detectors. Since these codebases are also under development, the reported results in their model zoo may be outdated, and those results are tested on different hardwares. For fair comparison, we pull the latest codes and test them in the same environment. Results are shown in Table 2. The memory reported by different frameworks are measured in different ways. MMDetection reports the maximum memory of all GPUs, maskrcnn-benchmark reports the memory of GPU 0, and these two adopt the PyTorch API “torch.cuda.max memory allocated()”.

​ 除了mmdetection,也存在其余的一些流行的代码库。他们建立在caffe21,Pytorch以及MXNet这样的深度学习框架。我们从性能,速度以及内存消耗这三个方面将mmdetection与其余几个代码库进行对比。因为这些代码库也在发展当中,所以报告中的结果可能已经过时,并且也在不同的环境下测试。为了公平起见,我们pull了最新的代码,并在相同环境中进行测试。结果如表2所示。不同框架报告的内存占用用不同的方式测量。mmdetection报告了GPUs的最大内存,maskrcnn-benchmark报告了GPU 0的内存,这两者都采用了Pytorch的API “torch.cuda.max memory allocated()”.

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-W1TcEFQ2-1580976908836)(C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\image-20200201212718091.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-jXDv1olQ-1580976908841)(C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\image-20200203220540449.png)]

​ Detectron reports the GPU with the caffe2 API “caffe2.python.utils.GetGPUMemoryUsageStats()”, and SimpleDet reports the memory shown by “nvidiasmi”, a command line utility provided by NVIDIA.
Generally, the actual memory usage of MMDetection and maskrcnn-benchmark are similar and lower than the others.

Detectron利用caffe2的API “caffe2.python.utils.GetGPUMemoryUsageStats()”报告GPU占用,SimpleDet用nVidia-smi。。。。总的来说,要比其余小。(。。。说实在的,看数据其实没觉得mmd有多优势)

Inference speed on different GPUs. Different researchers may use various GPUs, here we show the speed benchmark on common GPUs, e.g., TITAN X, TITAN Xp, TITAN V, GTX 1080 Ti, RTX 2080 Ti and V100. We evaluate three models on each type of GPU and report the inference speed in Figure 4. It is noted that other hardwares of these servers are not exactly the same, such as CPUs and hard disks, but the results can provide a basic impression for the speed benchmark.

在不同GPU上的速度比较。

Mixed precision training. MMDetection supports mixed precision training to reduce GPU memory and to speed up the training, while the performance remains almost the same. The maskrcnn-benchmark supports mixed precision training with a p e x 2 apex^2 apex2 and SimpleDet also has its own implementation. Detectron does not support it yet. We report the results and compare with the other two codebases in Table 3. We test all codebases on the same V100 node.

Additionally, we investigate more models to figure out the effectiveness of mixed precision training. As shown in Table 4, we can learn that a larger batch size is more memory saving. When the batch size is increased to 12, the memory of FP16 training is reduced to nearly half of FP32 training.Moreover, mixed precision training is more memory efficient when applied to simpler frameworks like RetinaNet.

混合精度训练。mmdetection支持混合精度训练去降低GPU消耗以加速训练,而且性能仍旧保持一致。maskrcnn-benchmark支持利用 a p e x 2 apex^2 apex2混合精度训练,SimpleDet也又它自己的实现方式。Detectron目前还不支持。我们将mmd与其余两个代码库进行了对比,结果如表3所示。我们在V100上进行了所有代码库的测试。

另外,我们调查了更多的模型来描述混合精度训练的有效性。batchsize越大越节省空间。当batch size增加到12的时候,FP16的内存消耗是FP32的一半,而且,当混合精度训练被用于RetinaNet这样的简单框架的时候,内存效率更高。

Multi-node scalability. Since MMDetection supports distributed training on multiple nodes, we test its scalability on 8, 16, 32, 64 GPUs, respectively. We adopt Mask R-CNN as the benchmarking method and conduct experiments on another V100 cluster. Following [11], the base learning rate is adjusted linearly when adopting different batch sizes. Experimental results in Figure 5 shows that MMDetection achieves nearly linear acceleration for multiple nodes.

因为MMD支持多节点分布式训练,我们分别在8,16,32,64个GPUs上测试了他的可扩展性。。。。MMD实现了在多节点上的几乎线性加速

单词

aspect ratio:横纵比
be under developmentL:发展中
node: 节点
scalability:可扩展性
respectively:分别

5. Extensive Studies

​ With MMDetection, we conducted extensive study on some important components and hyper-parameters. We wish that the study can shed lights to better practices in making fair comparisons across different methods and settings.

借助MMDetection,我们对一些重要的组件和超参数进行了广泛的研究。 我们希望这项研究可以为更好的做法提供启发,以便对不同方法和设置进行公平比较。

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-qr9zi81T-1580976908843)(C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\image-20200203221003053.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-oUnsj8ua-1580976908845)(C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\image-20200203221032447.png)]

5.1. Regression Losses

A multi-task loss is usually adopted for training an object detector, which consists of the classification and regression branch. The most widely adopted regression loss is Smooth L1 loss. Recently, there are more regression losses proposed, e.g., Bounded IoU Loss [33], IoU Loss [32], GIoU Loss [28], Balanced L1 Loss [23]. L1 Loss is also a straightforward variant. However, these losses are usually implemented in different methods and settings. Here we evaluate all the losses under the same environment. It is noted that the final performance varies with different loss weights assigned to the regression loss, hence, we perform coarse grid search to find the best loss weight for each loss.

多任务loss经常被用于训练一个目标检测器,组合了分类和回归分支。应用最广泛的回归loss是Somoth L1 loss.。最近,有更多的回归损失被推荐,比如Bounded IOU Loss,IOU Loss,GIoU Loss [28], Balanced L1 Loss [23]。==L1 loss是一个直接变体(没看懂这句话。)==然而,这些loss经常实现于不同的方法和设置。在此,我们在相同的环境下评估了所有的loss。注意,最终的性能随着分配给回归loss的权重有关,因此,我们利用粗糙的网格搜索以找到每个loss最佳的权重。

Results in Table 5 show that by simply increasing the loss weight of Smooth L1 Loss, the final performance can improve by 0.5%. Without tuning the loss weight, L1 Loss is 0.6% higher than Smooth L1, while increasing the loss weight will not bring further gain. L1 loss has larger loss values than Smooth L1, especially for bounding boxes that are relatively accurate. According to the analysis in [23], boosting the gradients of better located bounding boxes will benefit the localization. The loss values of L1 loss are already quite large, therefore, increasing loss weight does not work better. Balanced L1 Loss achieves 0.3% higher mAP than L1 Loss for end-to-end Faster R-CNN, which is a little different from experiments in [23] that adopts pre-computed proposals. However, we find that Balanced L1 loss can lead to a higher gain on the baseline of the proposed IoUbalanced sampling or balanced FPN. IoU-based losses perform slightly better than L1-based losses with optimal loss weights except for Bounded IoU Loss. GIoU Loss is 0:1% higher than IoU Loss, and Bounded IoU Loss has similar performance to Smooth L1 Loss, but requires a larger loss weight。

表5中的结果表明,仅通过增加“平滑L1损失”的损失权重,最终性能就可以提高0.5%。 如果不调整损失权重,则L1损失比平滑L1高0.6%,而增加损失权重将不会带来进一步的收益。 与平滑L1相比,L1 loss具有更大的loss值,尤其是对于相对准确的边界框。 根据[23]中的分析,提高位置更好的边界框的梯度将有利于定位。 L1损失的损失值已经很大,因此,增加损失权重并不能更好地发挥作用。 平衡的L1loss实现的mAP比端到端Faster R-CNN的L1loss高0.3%,这与采用预先计算的建议的[23]中的实验略有不同。 但是,我们发现,平衡的L1loss会导致在建议的IoU平衡采样或平衡FPN的基线上产生更高的增益。 除了有界IoUloss外,基于IoU的loss在具有最佳loss权重下的性能比基于L1的loss稍好。 GIoU损失比IoU损失高0.1%,有界IoU损失的性能与Smooth L1损失相似,但需要更大的损失权重。

5.2. Normalization Layers

​ The batch size used when training detectors is usually small (1 or 2) due to limited GPU memory, and thus BN layers are usually frozen as a typical convention. There are two options for configuring BN layers. (1) whether to update the statistics E(x) and Var(x), and (2) whether to optimize affine weights γ and β. Following the argument names of PyTorch, we denote (1) and (2) as eval and requires grad. eval = True means statistics are not updated, and requires_grad = T rue means γ and β are also optimized during training. Apart from freezing BN layers, there are also other normalization layers which tackles the problem of small batch size, such as Synchronized BN (SyncBN) [25] and Group Normalization (GN) [36]. We first evaluate different settings for BN layers in backbones, and then compare BN with SyncBN and GN.

由于有限的GPU内存,训练检测器的batch size通常很小(一般1或者2),因此BN层通常会被冻结。对配置BN层,一般有两种选项:(1)是否更新统计量 E ( x ) E(x) E(x) V a r ( x ) Var(x) Var(x), (2)是否优化仿射层权重 γ γ γ β β β。下面pytorch的参数名,我们将(1)和(2)表示为eval ,requires_grad 。eval=True指的是统计量不被更新,requires_grad =True指的是 γ γ γ β β β在训练期间同时也被优化。 除了冻结BN层以外,这里也有其余的normalization 层,这些层可解决一些小batch size的问题,比如 Synchronized BN (SyncBN) [25]和 Group Normalization (GN) [36]。我们第一次评估backnone中的BN 层的不同设置,然后与SyncBN和GN。

BN settings. We evaluate different combinations of eval and requires_grad on Mask R-CNN, under 1x and 2x training schedules. Results in Table 6 show that updating statistics with a small batch size severely harms the performance, when we recompute statistics (eval is false) and fix the affine weights (requires grad is false), respectively. Compared with eval = T rue; requires grad = T rue, it is 3:1% lower in terms of bbox AP and 3:0% lower in terms of mask AP. Under 1x learning rate (lr) schedule, fixing the affine weights or not only makes slightly differences, i.e., 0. 1%.When a longer lr schedule is adopted, making affine weights trainable outperforms fixing these weights by about 0:5%.In MMDetection, eval = True; requires grad = T rue is adopted as the default setting.

我们在1x和2x的训练表(学习率计划)下,我们在Mask R-CNN下评估了eval以及requires_grad的不同结合方式。表6中的结果显示,当我们分别计算不断更新的统计量(eval=false,requires_grad=True)以及固定的仿射权重(reqiures_grad=false)以小批量更新统计信息会严重损害性能。相比eval=True;requires_grad=True,就bbox AP而言,下降了3.1%;mask AP,下降了3.1%。在1x学习率时间表下,固定仿射权重不仅会产生微小差异,即0. 1%。当采用更长的lr时间表时,使可仿射权重可训练的性能优于将这些权重固定约0.5%,在MMD中,eval=True;requires_grad=True被采用为默认设置。

Different normalization layers. Batch Normalization (BN) is widely adopted in modern CNNs. However, it heavily depends on the large batch size to precisely estimate the statistics E(x) and Var(x). In object detection, the batch size is usually much smaller than in classification, and the typical solution is to use the statistics of pretrained backbones and not to update them during training, denoted as FrozenBN. More recently, SyncBN and GN are proposed and have proved their effectiveness [36, 25]. SyncBN computes mean and variance across multi-GPUs and GN divides channels of features into groups and computes mean and variance within each group, which help to combat against the issue of small batch sizes. FrozenBN, SyncBN and GN can be specified in MMDetection with only simple modifications in config files.

批处理规范化(BN)在现代CNN中被广泛采用。 但是,它在很大程度上取决于大批量的大小以精确地估计统计量E(x)和Var(x)。 在对象检测中,批次大小通常比分类中小得多,典型的解决方案是使用预先训练的主干的统计信息,而不是在训练过程中更新它们,称为FrozenBN。 最近,提出了SyncBN和GN,并证明了它们的有效性[36,25]。 SyncBN计算跨多个GPU的均值和方差,而GN将特征通道划分为组,并计算每个组内的均值和方差,这有助于解决小批量问题。 仅在配置文件中进行简单修改即可在MMDetection中指定FrozenBN,SyncBN和GN。

Here we study two questions. (1) How do different normalization layers compare with each other? (2) Where to add normalization layers to detectors? To answer these two questions, we run three experiments of Mask R-CNN with ResNet-50-FPN and replace the BN layers in backbones with FrozenBN, SyncBN and GN, respectively. Group number is set to 32 following [36]. Other settings and model architectures are kept the same. In [36], the 2fc bbox head is replaced with 4conv1fc and GN layers are also added to FPN and bbox/mask heads. We perform another two sets of experiments to study these two changes. Furthermore, we explore different number of convolution layers for bbox head.

在这里,我们研究两个问题。 (1)不同的归一化层之间如何比较? (2)在哪里向检测器添加归一化层? 为了回答这两个问题,我们使用ResNet-50-FPN进行了Mask R-CNN的三个实验,并分别用FrozenBN,SyncBN和GN替换了骨干中的BN层。 组号在[36]之后设置为32。 其他设置和模型架构保持不变。 在[36]中,2fc bbox head被4conv1fc取代,GN层也被添加到FPN和bbox / mask头中。 我们还要进行另外两组实验来研究这两种变化。 此外,我们为bbox head探索了不同数量的卷积层。

Results in Table 7 show that (1) FrozenBN, SyncBN and GN achieve similar performance if we just replace BN layers in backbones with corresponding ones. (2) Adding SyncBN or GN to FPN and bbox/mask head will not bring further gain. (3) Replacing the 2fc bbox head with 4conv1fc as well as adding normalization layers to FPN and bbox/mask head improves the performance by around 1:5%.
(4) More convolution layers in bbox head will lead to higher performance .

表7中的结果表明:(1)如果仅用相应的主干替换骨干中的BN层,FrozenBN,SyncBN和GN将达到类似的性能。 (2)将SyncBN或GN添加到FPN和bbox / mask头不会带来进一步的收益。 (3)将2fc bbox头替换为4conv1fc,并在FPN和bbox / mask头上添加归一化层,可使性能提高约1.5%。(4)bbox head中更多的卷积层将导致更高的性能。

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Z33zWKVP-1580976908847)(C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\image-20200203221600890.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-HFYm4dKx-1580976908848)(C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\image-20200203221745635.png)]

单词

assign to:分配	
affine:仿射,映射
argument:议点,争论点
in terms of:就...而言
variance:方差

5.3. Training Scales

As a typical convention, training images are resized to a predefined scale without changing the aspect ratio. Previous studies typically prefer a scale of 1000 × 600, and now 1333 × 800 is typically adopted. In MMDetection, we adopt 1333 × 800 as the default training scale. As a simple data augmentation method, multi-scale training is also commonly used. No systematic study exists to examine the way to select an appropriate training scales. Knowing this is crucial to facilitate more effective and efficient training. When multi-scale training is adopted, a scale is randomly selected in each iteration, and the image will be resized to the selected scale. There are mainly two random selection methods, one is to predefine a set of scales and randomly pick a scale from them, the other is to define a scale range, and randomly generate a scale between the minimum and maximum scale. We denote the first method as “value” mode and the second one as “range” mode. Specifically, “range” mode can be seen as a special case of “value” mode where the interval of predefined scales is 1.

按照惯例,训练图像会在不更改纵横比的情况下调整为预定义的比例。 以前的研究通常更喜欢1000×600的比例,现在通常采用1333×800。 在MMDetection中,我们采用1333×800作为默认训练比例。 作为一种简单的数据扩充方法,也通常使用多尺度训练。 没有系统的研究来检查选择合适的训练量表的方法。 知道这一点对于促进更有效的培训至关重要。 当采用多尺度训练时,将在每次迭代中随机选择一个尺度,并将图像调整为选定的尺度。 主要有两种随机选择方法,一种是预定义一组标尺并从中随机选择一个标尺,另一种是定义标尺范围,并在最小和最大标尺之间随机生成一个标尺。 我们将第一种方法表示为“值”模式,将第二种方法表示为“范围”模式。 具体来说,“范围”模式可以看作是“值”模式的一种特殊情况,其中预定义比例尺的间隔为1。

We train Mask R-CNN with different scales and random modes, and adopt the 2x lr schedule because more training augmentation usually requires longer lr schedules. The results are shown in Table 8, in which 1333 × [640:800:32] indicates that the longer edge is fixed to 1333 and the shorter edge is randomly selected from the pool of f640; 672; 704; 736; 768; 800g, corresponding to the “value” mode. The setting 1333 × [640:800] indicates that the shorter edge is randomly selected between 640 and 800, which corresponds to the “range” mode. From the results we can learn that the “range” mode performs similar to or slightly better than the “value” mode with the same minimum and maximum scales. Usually a wider range brings more improvement, especially for larger maximum scales. Specifically, [640 : 960] is 0:4% and 0:5% higher than [640 : 800] in terms of bbox and mask AP. However, a smaller minimum scale like 480 will not achieve better performance.

我们训练具有不同比例和随机模式的Mask R-CNN,并采用2x lr计划,因为更多的训练增强通常需要更长的lr计划。 结果显示在表8中,其中1333×[640:800:32]表示较长的边缘固定为1333,而较短的边缘是从f640池中随机选择的; 672; 704; 736; 768; 800g,对应“值”模式。 设置1333×[640:800]表示较短的边缘是在640和800之间随机选择的,这对应于“范围”模式。 从结果可以看出,在相同的最小和最大刻度下,“范围”模式的性能与“值”模式相似或稍好。 通常,范围更大会带来更多的改进,尤其是对于更大的最大比例。 具体而言,就bbox和mask AP而言,[640:960]比[640:800]高0:4%和0:5%。 但是,较小的最小刻度(如480)将无法获得更好的性能。

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-CArgJM3L-1580976908849)(C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\image-20200203221941389.png)]

5.4. Other Hyper-parameters.
MMDetection mainly follows the hyper-parameter settings in Detectron and also explores our own implementations. Empirically, we found that some of the hyperparameters of Detectron are not optimal, especially for RPN. In Table 9, we list those that can further improve the performance of RPN. Although the tuning may benefit the performance, in MMDetection we adopt the same setting as Detectron by default and just leave this study for reference.

MMDetection主要遵循Detectron中的超参数设置,并探索我们自己的实现。 根据经验,我们发现Detectron的某些超参数不是最佳的,特别是对于RPN。 在表9中,我们列出了可以进一步提高RPN性能的产品。 尽管微调可能会提高性能,但在MMDetection中,默认情况下我们采用与Detectron相同的设置,此研究仅供参考。

smoothl1 beta Most detection methods adopt Smooth L1 Loss as the regression loss, implemented as torch:where(x < beta; 0:5 ∗ x2=beta; x − 0:5 ∗ beta).

大多数目标检测方法采用了Smooth L1 loss作为回归loss,torch中的实现为:where(x < beta; 0.5 ∗ x2=beta; x − 0.5 ∗ beta)

The parameter beta is the threshold for L1 term and MSELoss t-erm. It is set to 1 9 in RPN by default, according to the standard deviation of regression errors empirically.

参数beta是L1项和MSELoss t-erm的阈值。 根据经验,该值在RPN中默认设置为19。

Experimental results show that a smaller beta may improve average recall (AR) of RPN slightly. In the study of Section 5.1, we found that L1 Loss performs better than Smooth L1 when the loss weight is 1. When we set beta to a smaller value, Smooth L1 Loss will get closer to L1 Loss and the equivalent loss weight is larger, resulting in better performance.

实验结果表明,更小的的beta值可能会略微提高AR以及RPN。在5.1节中,当loss的权重为1时,我们发现L1 loss相对于Smooth L1的实现得更好。当把beta值设置得更小时,Smooth L1 loss更接近于L1 Loss,并且在loss权重更大时相等,表现出更好得结果。

allowed border In RPN, pre-defined anchors are generated on each location of a feature map. Anchors exceeding the boundaries of the image by more than allowed border will be ignored during training. It is set to 0 by default, which means any anchors exceeding the image boundary will be ignored. However, we find that relaxing this rule will be beneficial. If we set it to infinity, which means none of the anchors are ignored, AR will be improved from 57:1% to 57:7%. In this way, ground truth objects near boundaries will have more matching positive samples during training.

在RPN中,在特征图的每个位置上都会生成预定义anchor。 训练期间,超出图像边界超出允许范围的anchor将被忽略。 默认情况下,它设置为0,这意味着超出图像边界的所有锚点都将被忽略。 但是,我们发现放宽此规则将是有益的。 如果将其设置为无穷大,这意味着不会忽略任何锚点,则AR将从57:1%提高到57:7%。 这样,在训练过程中, ground truth objects将具有更多匹配的正样本。

neg_pos_ub We add this new hyper-parameter for sampling positive and negative anchors. When training the RPN, in the case when insufficient positive anchors are present, one typically samples more negative samples to guarantee a fixed number of training samples. Here we explore neg pos ub to control the upper bound of the ratio of negative samples to positive samples. Setting neg pos ub to infinity leads to the aforementioned sampling behavior. This default practice will sometimes cause imbalance distribution in negative and positive samples. By setting it to a reasonable value, e.g., 3 or 5, which means we sample negative samples at most 3 or 5 times of positive ones, a gain of 1.2% or 1.1% is observed.

我们添加了这个新的超参数,用于对正锚和负锚进行采样。 在训练RPN时,如果没有足够的anchors ,通常会采样更多的负样本以保证固定数量的训练样本。 在这里,我们探索负位置控制负样本与正样本之比的上限。 将负位置设置为无穷大会导致上述采样行为。 这种默认做法有时会导致负样本和正样本中的不平衡分布。 通过将其设置为合理的值(例如3或5),这意味着我们对负样本进行的采样最多为正样本的3或5倍,观察到的增益为1.2%或1.1%。

A. Detailed Results

We present detailed benchmarking results for some methods in Table 10. R-50 and R-50 © denote pytorch-style and caffe-style ResNet-50 backbone, respectively. In the bottleneck residual block, pytorchstyle ResNet uses a 1x1 stride-1 convolutional layer followed by a 3x3 stride-2 convolutional layer, while caffe-style ResNet uses a 1x1 stride-2 convolutional layer followed by a 3x3 stride-1 convolutional layer.

我们在表10中提供了某些方法的详细基准测试结果。R-50和R-50(c)分别表示pytorch风格和caffe风格的ResNet-50主干。 在残差块中,pytorch 风格ResNet使用1x1 stride-1卷积层,然后是3x3 stride-2卷积层,而caffe风格的ResNet使用1x1 stride-2卷积层,然后是3x3 stride-1卷积层。

Refer to https://github.com/open-mmlab/ mmdetection/blob/master/MODEL_ZOO.md for more settings and components.

代码笔记

  • 1
    点赞
  • 10
    收藏
    觉得还不错? 一键收藏
  • 3
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 3
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值