大数据测试基准_现代基准数据集

最新推荐文章于 2024-07-19 00:12:30 发布

weixin_26726011

最新推荐文章于 2024-07-19 00:12:30 发布

阅读量1k

点赞数

原文链接：https://towardsdatascience.com/i-performed-error-analysis-on-open-images-and-now-i-have-trust-issues-89080e03ba09

版权

本文探讨了大数据测试基准的重要性，引用了对Open Images进行错误分析的案例，揭示了现代基准数据集对于建立信任和确保数据质量的关键作用。

摘要由CSDN通过智能技术生成

大数据测试基准

深度分析 (In Depth Analysis)

现代基准数据集 (Modern Benchmark Datasets)

As the performance of deep learning models trained on massive datasets continues to advance, large-scale dataset competitions have become the proving ground for the latest and greatest computer vision models. We’ve come a long way as a community from the times where MNIST — a dataset with only 70,000 28x28 pixel images — was the de facto standard. New, larger datasets have arisen out of a desire to train more complex models to solve more challenging tasks: ImageNet, COCO and Google’s Open Images are among the most popular.

随着在海量数据集上训练的深度学习模型的性能不断提高，大规模数据集竞赛已成为最新和最佳计算机视觉模型的试验场。从MNIST(仅包含70,000个28x28像素图像的数据集)成为事实上的标准的时代起，我们已经成为一个社区。为了训练更复杂的模型以解决更具挑战性的任务，出现了新的，更大的数据集：ImageNet，COCO和Google的Open Images最受欢迎。

But even on these huge datasets the differences in performance of top models is becoming narrow. The 2019 Open Images Detection Challenge shows the top five teams fighting for less than a 0.06 margin in mean average precision (mAP). It’s even less for COCO.

但是，即使在这些庞大的数据集上，顶级模型的性能差异也变得越来越狭窄。 2019年公开图像检测挑战赛显示，排名前五的团队在平均平均准确度 (mAP)方面的差距不到0.06。对于COCO来说甚至更少。

There’s no doubt that our research community is delivering when it comes to developing innovative new techniques to improve model performance, but the model is only half of the picture. Recent findings have made it increasingly clear that the other half — the data — plays at least as critical of a role, perhaps even greater.

毫无疑问，在开发创新的新技术以改善模型性能方面，我们的研究团体正在交付成果，但模型只是其中的一半。最近的发现越来越清楚地表明，另一半(数据)至少起着关键作用，甚至可能起着更大的作用。

Just this year…

就在今年...

…researchers at Google and DeepMind reassessed ImageNet and their findings suggest that the developments of late may not even be finding meaningful generalizations, instead just overfitting the idiosyncrasies of the ImageNet labeling procedure.
…Google和DeepMind的研究人员重新评估了ImageNet ，他们的发现表明，最近的发展甚至没有找到有意义的概括，而只是过分适合ImageNet标记过程的特质。
…MIT has widthdrawn the Tiny Images dataset after a paper brought to light that a portion of the 80 million images contained racist and misogynistic slurs.
…麻省理工学院绘制了Tiny Images数据集的宽度，此前有一篇论文表明8000万张图像中的一部分包含种族主义和厌恶女性的言论。
…Jo and Gebru from Stanford and Google, respectively, argued that more attention needs to be put on data collection and annotation procedures by drawing analogy to more matured data archival procedures.
…分别来自斯坦福大学和Google的Jo和Gebru 认为，应通过与更成熟的数据归档程序进行类比，将更多的注意力放在数据收集和注释程序上。
…researchers from UC Berkeley and Microsoft performed a study showing that when using self-supervised pre-training, one could achieve gains on downstream tasks by focusing not on the network architecture or task/loss selection, but on a third axis, the data itself. To paraphrase: focusing on the data is not only a good idea, it’s a novel idea in 2020!
…来自加州大学伯克利分校和微软的研究人员进行的一项研究表明，使用自我监督的预训练时，可以通过不着眼于网络体系结构或任务/损失选择，而是着眼于第三轴， 即数据本身来获得下游任务的收益。。换个说法：关注数据不仅是一个好主意，而且在2020年是个新主意！

And here’s what two leaders of the field are saying about this:

这是该领域的两位领导人对此的评价：

“In building practical systems, often there’s more manual error analysis and more human insight that goes into these systems than sometimes deep learning researchers like to acknowledge.” — Andrew Ng
“在构建实际的系统中，与有时深度学习的研究人员有时不愿承认的相比，进入这些系统通常会有更多的人工错误分析和更多的人类洞察力。” －吴彦祖
“Become one with the data” — Andrej Karpathy in his popular essay on training neural networks
“ 与数据合而为一” — Andrej Karpathy在其关于训练神经网络的流行文章中

How many times have you found yourself spending hours, days, weeks pouring over samples in your data? Have you been surprised by how much manual inspection was necessary? Or can you think of a time when you trusted macro statistics perhaps more than you should?

您发现自己花了数小时，数天或数周的时间倾注数据中的样本几次？您对需要进行多少手动检查感到惊讶吗？还是可以想到某个时候，您相信宏统计信息的程度可能超出您的期望？

The computer vision community is starting to wake up to the idea that we need to be close to the data. If we want accurate models that behave as expected, it’s not enough to have a large dataset; it needs to have the right data and it needs to be accurately labeled.

计算机视觉社区开始意识到我们需要接近数据。如果我们想要准确的模型来达到预期的效果，那么拥有庞大的数据集是不够的。它需要正确的数据 ，并且需要准确地标记。

Every year, researchers are battling it out to climb to the top of a leaderboard with razor thin margins determining fates. But do we really know what’s going on with these datasets? Is a 0.01 margin in mAP even meaningful?

每年，研究人员都在与之搏斗，爬到排行榜的顶部，而利润却很少。但是我们真的知道这些数据集怎么回事吗？ 0.01的mAP余量是否有意义？

打开图像数据集的错误分析 (Error Analysis on the Open Images Dataset)

With another Open Images Challenge just wrapping up, it seemed only appropriate to investigate this popular benchmark dataset and try to better understand what it means to have an object detection model with high mAP. So, I took it upon myself to do some basic error analysis of a pre-trained model with the goal being to observe patterns in errors in the context of the dataset, not the model. To my surprise, I found that a significant portion of these errors were in fact not errors; instead, the dataset annotations were incorrect!

随着另一场“开放图像挑战赛”的结束，似乎只适合研究这个流行的基准数据集，并试图更好地理解拥有高mAP的对象检测模型意味着什么。因此，我本人对预先训练的模型进行了一些基本的误差分析，目的是在数据集而不是模型中观察误差模式。令我惊讶的是，我发现这些错误中有很大一部分实际上不是错误；相反，数据集注释不正确！

什么是错误分析？ (What is error analysis?)

Error analysis is the process of manually inspecting a model’s prediction errors, identified during evaluation, and making note of the causes of the errors. You don’t need to look at the whole dataset, but at least enough examples to know that you are correctly approximating a trend; let’s say 100 samples as a bare minimum. Open up a spreadsheet, or grab a piece of paper, and start jotting down notes.

误差分析是手动检查模型的预测误差，在评估过程中确定并记录误差原因的过程。您无需查看整个数据集，但至少需要有足够的示例才能知道您正确地近似了趋势。假设至少有100个样本。打开电子表格，或抓一张纸，然后开始记笔记。

Why do this? Perhaps the majority of images your model is struggling on are low resolution or have poor lighting. If this is the case, adding more high resolution well-lit images to the training set is unlikely to manifest as significant improvement in model accuracy. Any number of other qualitative characteristics of your dataset may be at play; the only way to find out is to analyze your data!

为什么这样您的模型苦苦挣扎的大多数图像可能是低分辨率或光线不足。在这种情况下，向训练集添加更多高分辨率的光线充足的图像不太可能显着提高模型的准确性。数据集的许多其他定性特征可能正在发挥作用；找出答案的唯一方法就是分析您的数据！

准备分析 (Preparing for the analysis)

I generated predictions on the Open Images V4 test set using this FasterRCNN+InceptionResNetV2 network. This network seemed like an ideal choice as it is trained and evaluated on Open Images V4, has a relatively high-mAP of 0.58, and it is readily available to the public via Tensorflow Hub. I then needed to evaluate each image individually.

我使用此FasterRCNN + InceptionResNetV2网络在Open Images V4测试集中生成了预测。该网络似乎是理想的选择，因为它是在Open Images V4上进行训练和评估的，具有相对较高的mAP(0.58)，并且可以通过Tensorflow Hub轻松地向公众提供。然后，我需要分别评估每个图像。

Open Images uses a sophisticated evaluation protocol that considers hierarchy, groups and even specifies known-present and known-absent classes. Despite the availability of the Tensorflow Object Detection API, specifically supporting evaluation on Open Images, it took some non-trivial code to get per-image evaluation results. Why isn’t this supported natively? At any rate, I was eventually able to determine exactly which detection was a true positive or false positive for each image.

Open Images使用了一种复杂的评估协议，该协议考虑了层次结构，组，甚至指定了已知存在和不存在的类。尽管Tensorflow Object Detection API可用，特别支持对Open Images的评估，但它还是需要一些非平凡的代码才能获得每个图像的评估结果。为什么本机不支持此功能？无论如何，我最终都能准确地确定每个图像的检测结果是真阳性还是假阳性。

I decided to filter the detections, only looking at those with confidence > 0.4. This threshold turned out to be roughly the point where the number of true positives surpasses the number of false positives.

我决定过滤检测结果，仅查看置信度> 0.4的检测结果。事实证明，此阈值大致是真实阳性数超过错误阳性数的点。

错误类型 (Types of Errors)

The structure of this analysis is inspired by a 2012 study which took two state-of-the-art (at the time) detectors and performed manual error analysis. Hoiem’s group created categories such as localization error, confusion with semantically similar objects and false positives on background, but interestingly not anything related to ground truth error!

此分析的结构受到2012年研究的启发，该研究采用了两个最新(当时)的检测器并进行了手动错误分析。 Hoiem的小组创建了一些类别，例如定位错误，与语义相似的对象的混淆以及背景上的误报，但有趣的是，这与地面真实错误无关！

I broke down the cause of error into three groups: model errors, ground truth errors and other errors, each subsisting of a few specific causes of error. The next sections define these specific errors and provide examples, in order to provide context before we look at the aggregate results of the error analysis.

我将错误原因分为三类：模型错误，基本事实错误和其他错误，每一个都包含一些特定的错误原因。接下来的部分将定义这些特定的错误并提供示例，以便在我们查看错误分析的汇总结果之前提供上下文。

模型错误 (Model Errors)

Model errors are the familiar set of errors that Hoiem’s paper established and many researchers have subsequently used in their own publications. A small modification I made here was to omit “other” errors and to add “duplicate” errors which is a split of the “localization” errors.

模型错误是Hoiem论文确定的一组熟悉的错误，许多研究人员随后在自己的出版物中使用了模型错误。我在这里所做的一个小修改是忽略“其他”错误，并添加“重复”错误，这是“本地化”错误的一部分。

loc: localization error, i.e. IoU below threshold of 0.5
loc：定位错误，即IoU低于阈值0.5
sim: confusion with semantically similar objects
sim：与语义相似的对象混淆
bg: confusion with background
bg：与背景混淆
dup: duplicate box, meaning both localization error and a true positive exists; this was made as a separate category from loc because it was observed to be a very common type of error
dup：重复框，意味着定位错误和真实肯定同时存在；这是与loc分开的一个类别，因为它被认为是非常常见的错误类型

Image for post — *Dog*. Bottom Left: confusion with background, which was mistaken as a Boat. Bottom Right: two duplicate Dog boxes around the same dog with bad localization. *Dog* 。左下：与背景混淆，被误认为是小船。右下：在同一只狗周围的两个重复的“狗”盒，定位错误。

地面真相错误 (Ground Truth Errors)

Ground truth errors are causes of false positives whose “fault” is in the annotation, not the model prediction. Were these to be corrected, they would be reassigned as true positives.

基本事实错误是误报的原因，其“错误”在注释中，而不是模型预测中。如果这些问题得到纠正，它们将被重新分配为真实肯定。

missing: the ground truth box should exist but does not
丢失：地面真相框应存在但不存在
incorrect: the ground truth box exists but the label is incorrect or it is not as specific in the label hierarchy as it could be
错误：存在地面真相框，但标签不正确或在标签层次结构中不尽如人意
group: the ground truth box should be marked as a group but is not
组：地面真相框应标记为组，但不能

其他错误 (Other Errors)

Lastly we have unclear errors, meaning edge cases where it isn’t apparent whether the prediction is correct or not.

最后，我们有不清楚的错误 ，这意味着在某些情况下，尚不清楚预测是否正确。

unclear: model error? ground truth error? ask ten different people and you’ll get ten different answers
不清楚：模型错误？地面真相错误？问十个不同的人，你会得到十个不同的答案

Fortunately this category only accounted for roughly 6.5% of error, but it is still important to note that when trying to create a label hierarchy that can categorize everything in the world there are always going to be edge cases like this circuit lady which the model predicted as a Toy.

幸运的是，这一类别仅占错误的6.5％，但仍然需要注意的是，当尝试创建可以对世界上所有事物进行分类的标签层次结构时，总是会出现模型所预测的边缘情况，例如电路小姐作为玩具。

汇总结果 (Aggregate Results)

The following table shows the results of analyzing a subset of 178 of the total 125,436 test images.

下表显示了分析总共125,436张测试图像中的178个子集的结果。

This is crazy! 36% of the false positives should actually be true positives! It’s not immediately clear what impact this would have on mAP, given that it is a rather complex metric, however it’s safe to say that the officially reported mAP of 0.58 is underestimating the true performance of the model.

这太疯狂了！ 36％的误报实际上应该是真实的！鉴于这是一个相当复杂的指标，目前尚不清楚这会对mAP产生什么影响，但是可以肯定地说，官方报告的mAP为0.58会低估该模型的真实性能。

The single most common cause of error was missing ground truth annotation, accounting for over ¼ of all errors. This is a challenging problem. It’s unrealistic to ask for a dataset that is not missing boxes. Many of these missing annotations are peripheral objects, not the central focus of the image. But this only emphasizes the need for easy, possibly automated, identification of annotations that will go through an additional round of review. There are other implications as well. Peripheral objects are generally smaller; how do these missing annotations affect accuracy metrics when split into small/medium/large bounding box sizes?

导致错误的最常见原因是缺少地面事实注释，占所有错误的1/4以上。这是一个具有挑战性的问题。要求一个不缺少盒子的数据集是不现实的。这些缺失的注释中有许多是外围对象，而不是图像的中心焦点。但是，这仅强调了对容易进行注释(可能是自动进行的)的注释进行识别的需要，这些注释将经过另一轮审核。还有其他含义。外围物体通常较小；将这些注释分成小/中/大边界框大小时，它们如何影响准确性指标？

A few of these other causes of error — duplicate bounding boxes, incorrect ground truth labels and group errors, in particular — signal the importance of labeling ontology and annotation protocols. Complex label hierarchies can lead to incorrect ground truth labels, though this study indicates that this is not the case for Open Images. Handling groups is another complication that needs to be carefully defined and reviewed; while not as prevalent as other causes of error, 7.6% being due to boxes that should have been flagged as a group is certainly not insignificant. Finally, duplicate bounding boxes could be, at least in part, a byproduct of expanded hierarchy. In the Open Images object detection challenge a model is tasked with generating a bounding box for each label in the hierarchy. For example, for an image containing a jaguar, the Open Images challenge expects boxes to be generated for not just Jaguar, but also Carnivore, Mammal and Animal. Could this unintentionally lead to a model generating multiple Jaguar boxes for the same animal? Faster RCNN applies classification as a post-processing step after region proposal. So if the model is trained to generate four boxes for every jaguar it sees, it shouldn’t be surprising that these four boxes sometimes get the same classification label.

其他一些导致错误的原因-重复的边界框，不正确的地面真理标签和组错误-表示标记本体和注释协议的重要性。复杂的标签层次结构可能会导致错误的地面真相标签，尽管这项研究表明，Open Image并非如此。处理组是另一种复杂情况，需要仔细定义和审查。虽然不如其他错误原因那么普遍，但由于应该将这些框标记为一个组而导致的错误率为7.6％，这并非微不足道。最后，重复的边界框可能至少部分是扩展层次结构的副产品。在“打开图像”对象检测挑战中，模型的任务是为层次结构中的每个标签生成一个边界框。例如，对于包含美洲虎的图像，“打开图像”挑战希望不仅为美洲虎 ，而且为食肉动物 ， 哺乳动物和动物生成框。这会无意间导致模型为同一只动物生成多个美洲虎盒子吗？更快的RCNN将分类作为区域提案之后的后处理步骤。因此，如果训练模型为它看到的每个美洲虎都生成四个框，那么这四个框有时会获得相同的分类标签也就不足为奇了。

下一步是什么？ (What’s next?)

What would happen to the Open Images Leaderboard if these ground truth errors were corrected? How could this affect our understanding of what strategies work best?

如果纠正了这些地面真实性错误，那么打开图像排行榜会发生什么？这如何影响我们对最佳策略的理解？

It should be noted that these errors aren’t just statistical noise. Similar to the findings of the DeepMind team that analyzed ImageNet, there are patterns to the annotation errors in Open Images. For example, missing face-annotations are a very common cause for false positives and bounding boxes around trees should often be labeled as groups but are not.

应该注意的是，这些误差不仅仅是统计噪声。与分析ImageNet的DeepMind团队的发现类似，Open Images中存在注释错误的模式。例如，缺少人脸注释是导致误报的一个非常普遍的原因，并且树木周围的边界框通常应标记为组，但不能标记为组。

The purpose of this article is not to criticize the creators of Open Images — to the contrary, this dataset and its corresponding challenges have instigated great achievements — but rather to shed light on a blind spot that could be holding back progress. The impact of these popular open datasets is far-reaching, as they are often used as the starting point for fine-tuning/transfer learning. Furthermore, if popular datasets are suffering from annotation correctness, then most likely while inspecting our own datasets we will be encountering the same. But I’m certainly not speaking from experience…*ahem*

本文的目的不是要批评Open Images的创建者(相反，此数据集及其相应的挑战已取得了巨大的成就)，而是阐明了可能阻碍进度的盲点。这些流行的开放数据集的影响深远，因为它们通常被用作微调/转移学习的起点。此外，如果流行的数据集存在注释正确性的问题，那么很可能在检查我们自己的数据集时也会遇到同样的问题。但是我当然不是从经验中讲... *啊*

We’re at the forefront of a shift in focus, where the data itself is being rightfully acknowledged as every bit as important as the model trained on it, if not more! Perhaps we will see smaller, more carefully curated datasets rising in popularity, or more demand for methods like active or semi-supervised learning, that allow us to automate and scale annotation work. Either way, a key challenge will be creating the infrastructure to manage dynamic datasets that grow in size and evolve based on feedback from humans and machine learning models. There’s a lot of potential in this nascent topic!

我们处在焦点转移的最前沿，在那里，数据本身被正确地认可为与其上训练的模型同样重要的一点，甚至更多！也许我们会看到更小，更精心策划的数据集越来越受欢迎，或者对主动或半监督学习等方法的需求增加，使我们能够自动化和扩展注释工作。无论哪种方式，一个关键挑战将是创建基础结构来管理动态数据集，该数据集会根据人类和机器学习模型的反馈而增长并不断发展。这个新生话题有很大的潜力！

可视化对象检测数据集 (Visualizing object detection datasets)

To perform this error analysis study, I used Voxel51’s data visualization tool, FiftyOne, a Python package that makes it really easy to load your datasets and interactively search and explore them both through code and a visualization app. Here’s the FiftyOne code I ran to perform the error analysis in this study:

为了执行此错误分析研究，我使用了Voxel51的数据可视化工具FiftyOne ，这是一个Python软件包，可非常轻松地加载数据集并通过代码和可视化应用程序进行交互式搜索和浏览。这是我在研究中运行以执行错误分析的FiftyOne代码：

import fiftyone as fo
from fiftyone import ViewField as F


# Load the dataset
dataset = fo.Dataset.from_dir(
    "/path/to/open-images-v4-test-500",
    fo.types.FiftyOneDataset,
)


# launch the app and start browsing
session = fo.launch_app(dataset=dataset)


# filter the visible detections by confidence and filter the samples
# to only those with at least one false positive
session.view = (
   dataset
   .filter_detections("true_positives", F("confidence") > 0.4)
   .filter_detections("false_positives", F("confidence") > 0.4)
   .match(F("false_positives.detections").length() > 0)
   .sort_by("open_images_id")
)

Want to explore this data for yourself? Download it here!

是否想自己探索这些数据？在这里下载！

Want to evaluate your own model on Open Images? Try this tutorial!

想在Open Images上评估自己的模型吗？试试这个教程！

Want to learn more about best practices for inspecting visual datasets? Check out this post!

想更多地了解检查视觉数据集的最佳实践吗？看看这个帖子！