论文翻译——YOLO9000: Better, Faster, Stronger

最新推荐文章于 2023-05-24 08:50:14 发布

b_dxac

最新推荐文章于 2023-05-24 08:50:14 发布

阅读量482

点赞数 1

分类专栏：论文

本文链接：https://blog.csdn.net/b_dxac/article/details/108843001

版权

YOLO9000是一个实时物体检测系统，能检测超过9000个类别。通过改进YOLOv2，它在速度和准确率上超越了其他先进方法。采用多尺度训练，YOLOv2在不同尺寸下都能运行，同时提供联合训练方法，使得模型能预测未标注检测数据的类别。在ImageNet检测任务上，YOLO9000表现出色，尽管只有部分类别的检测数据，但预测准确度高，同时保持实时运行能力。

摘要由CSDN通过智能技术生成

摘要

翻译：

我们介绍YOLO9000，一个最先进的，实时的物体检测系统，可以检测超过9000个物体类别。首先，我们提出了对YOLO检测方法的各种改进，既有新颖的，也有来自之前工作的改进。改进后的YOLOv2在PASCAL VOC和COCO等标准检测任务上是最先进的。使用一种新的，多尺度的训练方法，同样的YOLOv2模型可以运行在不同的大小，提供了一个简单的折衷速度和准确性。在67帧每秒时，YOLOv2在2007 VOC上得到76.8的mAP。在40帧/秒的时候，YOLOv2得到了78.6的mAP，超过了最先进的方法，比如ResNet和SSD的Faster RCNN，同时仍然运行得更快。最后提出了一种目标检测和分类的联合训练方法。利用这种方法，我们在COCO检测数据集和ImageNet分类数据集上同时训练YOLO9000。我们的联合训练允许YOLO9000预测没有标记检测数据的对象类的检测。我们在ImageNet检测任务上验证我们的方法。YOLO9000在ImageNet检测验证集上得到19.7mAP，尽管只有200个类中的44个类的检测数据。在156个不在COCO的类中，YOLO9000得到16.0的mAP。但是YOLO只能检测200多个类;它能预测到9000多个不同类别的目标。它仍然是实时运行的。

We introduce YOLO9000, a state-of-the-art, real-time object detection system that can detect over 9000 object categories. First we propose various improvements to the YOLO detection method, both novel and drawn from prior work. The improved model, YOLOv2, is state-of-the-art on standard detection tasks like PASCAL VOC and COCO. Using a novel, multi-scale training method the same YOLOv2 model can run at varying sizes, offering an easy tradeoff between speed and accuracy. At 67 FPS, YOLOv2 gets 76.8 mAP on VOC 2007. At 40 FPS, YOLOv2 gets 78.6 mAP , outperforming state-of-the-art methods like Faster RCNN with ResNet and SSD while still running significantly faster. Finally we propose a method to jointly train on object detection and classification. Using this method we train YOLO9000 simultaneously on the COCO detection dataset and the ImageNet classification dataset. Our joint training allows YOLO9000 to predict detections for object classes that don’t have labelled detection data. We validate our approach on the ImageNet detection task. YOLO9000 gets 19.7 mAP on the ImageNet detection validation set despite only having detection data for 44 of the 200 classes. On the 156 classes not in COCO, YOLO9000 gets 16.0 mAP. But YOLO can detect more than just 200 classes; it predicts detections for more than 9000 different object categories. And it still runs in real-time.

1 说明

翻译：

通用目标检测应该是快速、准确的，能够识别各种各样的目标。自从神经网络的引入，检测框架变得越来越快和准确。然而，大多数检测方法仍然局限于一小组对象。

与分类和标记等其他任务相比，当前的目标检测数据集是有限的。最常见的检测数据集包含成千上万到几十万的图像，有几十到几百个标签[3][10][2]。分类数据集有数百万幅图像，有几万或几十万个类别。

General purpose object detection should be fast, accurate, and able to recognize a wide variety of objects. Since the introduction of neural networks, detection frameworks have become increasingly fast and accurate. However, most detection methods are still constrained to a small set of objects.

Current object detection datasets are limited compared to datasets for other tasks like classification and tagging. The most common detection datasets contain thousands to hundreds of thousands of images with dozens to hundreds of tags [3] [10] [2]. Classification datasets have millions of images with tens or hundreds of thousands of categories [20] [2].

翻译：

我们希望检测规模到对象分类的水平。但是，用于检测的图像标记要比用于分类或标记的图像标记昂贵得多(标记通常由用户免费提供)。因此，在不久的将来，我们不太可能看到检测数据集与分类数据集具有相同的规模。

We would like detection to scale to level of object classification. However, labelling images for detection is far more expensive than labelling for classification or tagging (tags are often user-supplied for free). Thus we are unlikely to see detection datasets on the same scale as classification datasets in the near future.

翻译：

我们提出了一种新的方法来利用我们已经拥有的大量分类数据，并利用它来扩大现有检测系统的范围。我们的方法使用目标分类的层次视图，允许我们将不同的数据集组合在一起。我们还提出了一种联合训练算法，允许我们在检测数据和分类数据上训练目标检测器。我们的方法利用带标记的检测图像来学习精确定位目标，同时使用分类图像来增加词汇量和鲁棒性。使用这种方法，我们训练YOLO9000，一个实时的目标检测器，可以检测超过9000种不同的目标类别。首先，我们改进基础YOLO检测系统，以生产YOLOv2，一个最先进的，实时检测器。然后我们使用我们的数据集组合方法和联合训练算法来训练一个来自ImageNet的9000多个类的模型和来自COCO的检测数据。我们所有的代码和预先训练过的模型都可以通过http://pjreddie.com/yolo9000/在线获得。

原文: 可修改后右键重新翻译

We propose a new method to harness the large amount of classification data we already have and use it to expand the scope of current detection systems. Our method uses a hierarchical view of object classification that allows us to combine distinct datasets together. We also propose a joint training algorithm that allows us to train object detectors on both detection and classification data. Our method leverages labeled detection images to learn to precisely localize objects while it uses classification images to increase its vocabulary and robustness. Using this method we train YOLO9000, a real-time object detector that can detect over 9000 different object categories. First we improve upon the base YOLO detection system to produce YOLOv2, a state-of-the-art, real-time detector. Then we use our dataset combination method and joint training algorithm to train a model on more than 9000 classes from ImageNet as well as detection data from COCO. All of our code and pre-trained models are available online at http://pjreddie.com/yolo9000/.

2 更好

翻译：

YOLO的缺点相对于最先进的检测系统。YOLO与Fast R-CNN的误差分析表明，YOLO存在大量的定位错误。此外，与基于区域提案的方法相比，YOLO的召回率相对较低。因此，我们主要关注在保持分类精度的同时提高召回率和定位。计算机视觉一般趋向于更大，更深的网络[6][18][17]。更好的性能通常取决于训练更大的网络或将多个模型集成在一起。然而，使用YOLOv2，我们想要一个更快更精确的探测器。我们没有扩大我们的网络，而是简化了网络，然后使表示更容易学习。我们从过去的工作中收集各种想法，并结合我们自己的新概念来提高YOLO的表现。结果的摘要见表2。

YOLO suffers from a variety of shortcomings relative to state-of-the-art detection systems. Error analysis of YOLO compared to Fast R-CNN shows that YOLO makes a significant number of localization errors. Furthermore, YOLO has relatively low recall compared to region proposal-based methods. Thus we focus mainly on improving recall and localization while maintaining classification accuracy. Computer vision generally trends towards larger, deeper networks [6] [18] [17]. Better performance often hinges on training larger networks or ensembling multiple models together. However, with YOLOv2 we want a more accurate detector that is still fast. Instead of scaling up our network, we simplify the network and then make the representation easier to learn. We pool a variety of ideas from past work with our own novel concepts to improve YOLO’s performance. A summary of results can be found in Table 2.

翻译：

批处理规范化。批处理规范化在收敛方面带来了显著的改进，同时消除了对其他形式的正则化[7]的需要。通过在YOLO中对所有卷积层进行批处理归一化，我们在mAP上得到了超过2%的改进。批处理规范化还有助于规范化模型。通过批量归一化，我们可以在不过度拟合的情况下消除模型的偏差。

高分辨率的分类器。所有最先进的检测方法都使用在ImageNet[16]上预先训练好的分类器。从AlexNet开始，大多数分类器操作的输入图像小于256×256[8]。原YOLO在224×224处训练分类器网络，将分辨率提高到448进行检测。这意味着网络必须同时切换到学习目标检测和调整到新的输入分辨率。

对于YOLOv2，我们首先在ImageNet上以448×448分辨率对分类网络进行10个epoch的精细调整。这给了网络时间来调整它的过滤器在高分辨率输入下工作得更好。然后在检测时对得到的网络进行微调。这种高分辨率分类网络使我们的mAP增加了近4%。

Batch Normalization. Batch normalization leads to significant improvements in convergence while eliminating the need for other forms of regularization [7]. By adding batch normalization on all of the convolutional layers in YOLO we get more than 2% improvement in mAP . Batch normalization also helps regularize the model. With batch normalization we can remove dropout from the model without overfitting.

High Resolution Classifier. All state-of-the-art detection methods use classifier pre-trained on ImageNet [16]. Starting with AlexNet most classifiers operate on input images smaller than 256 × 256 [8]. The original YOLO trains the classifier network at 224 × 224 and increases the resolution to 448 for detection. This means the network has to simultaneously switch to learning object detection and adjust to the new input resolution.

For YOLOv2 we first fine tune the classification network at the full 448×448 resolution for 10 epochs on ImageNet. This gives the network time to adjust its filterstoworkbetter on higher resolution input. We then fine tune the resulting network on detection. This high resolution classification network gives us an increase of almost 4% mAP.

翻译：

使用锚框卷积。YOLO使用卷积特征提取器上的全连接层直接预测边界框的坐标。Faster R-CNN使用手选先验[15]预测边界框而不是直接预测坐标。在Faster R-CNN中，区域建议网络(RPN)只使用卷积层来预测锚框的偏移量和信任度。由于预测层是卷积的，RPN在feature map的每个位置预测这些偏移量。预测偏移量而不是坐标简化了问题，使网络更容易学习。

Convolutional With Anchor Boxes. YOLO predicts the coordinates of bounding boxes directly using fully connected layers on top of the convolutional feature extractor. Instead of predicting coordinates directly Faster R-CNN predicts bounding boxes using hand-picked priors [15]. Using only convolutional layer