【MS COCO】《Microsoft COCO：Common Objects in Context》

最新推荐文章于 2025-03-26 11:24:03 发布

bryant_meng

最新推荐文章于 2025-03-26 11:24:03 发布

阅读量1.2k

点赞数 27

分类专栏： CNN / Transformer 文章标签：目标检测实例分割深度学习 MS COCO 计算机视觉

本文链接：https://blog.csdn.net/bryant_meng/article/details/143849985

版权

CNN / Transformer 专栏收录该内容

250 篇文章

订阅专栏

在这里插入图片描述

ECCV-2014

Lin T Y, Maire M, Belongie S, et al. Microsoft coco: Common objects in context[C]//Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13. Springer International Publishing, 2014: 740-755.

https://cocodataset.org/#home

文章目录

1、Background and Motivation
2、Related Work
3、Advantages / Contributions
4、Method
- 4.1、Image Collection
- 4.2、Image Annotation
5、Dataset Statistics
6、Dataset Splits
7、Algorithmic Analysis
8、Conclusion（own） / Future work

1、Background and Motivation

好的公开数据集推动了计算机视觉算法的发展

在计算机视觉领域，物体检测和分割是重要任务，但现有的数据集如 PASCAL VOC 和 ImageNet 在物体类别和上下文信息的丰富性上有所限制。

what datasets will best continue our advance towards our ultimate goal of scene understanding?

本文作者公开了一个更有利于场景理解的数据集，Microsoft Common Objects in Context，简称 COCO

构建一个包含丰富上下文信息的物体检测、分割和标注的大型数据集。
专注于精确定位物体实例，并选择性地包含“事物”类别而非“背景”类别，但认识到“背景”类别能提供重要的上下文信息，未来可能会进行标注。
通过收集非标志性图像来增强数据集的泛化能力，这些图像通常包含更多类别和复杂的上下文关系。

在这里插入图片描述

对比的是 PASCAL VOC、ImageNet、SUN 数据集

COCO 收集了更多的 non-iconic views (or non-canonical perspectives） 图片，可以理解为目标不总是在画面正中心的图片

每张图片类别数、同类别目标数更丰富

强调了 context 信息，文中指的是 detailed spatial understanding of object layout will be a core component of scene analysis，也即目标在画面中的分布，目标种类多，位置分布广

最重要的是提供了分割标签

在这里插入图片描述

2,500,000 labeled instances in 328,000 images

2、Related Work

Image Classification
Object detection
Semantic scene labeling
This enables the labeling of objects for which individual instances are hard to define, such as grass, streets, or walls.
Other vision datasets

3、Advantages / Contributions

在这里插入图片描述

公开了一个大型场景理解数据集 COCO

大规模数据集：该论文介绍了一个新的大规模数据集，旨在解决场景理解中的三个核心研究问题：检测非典型视图（或非规范视角）的对象、对象之间的上下文推理以及对象的精确二维定位。
非典型视图检测：与以往的对象识别数据集主要关注图像分类、对象边界框定位或语义像素级分割不同，该数据集专注于分割单个对象实例，并包含非典型视图的对象图像，这有助于模型更好地泛化。
上下文推理：数据集通过包含丰富上下文关系的图像，促进了对象之间的上下文推理研究。
精确二维定位：该数据集专注于精确定位对象实例，这有助于提升对象检测和分割的准确性。
丰富标注：数据集经过丰富标注，包括类别标签、实例定位和对象分割，为训练和评估模型提供了全面的信息。
多种收集策略：为了收集非典型视图的图像，采用了两种策略：从Flickr收集图像，以及搜索对象类别的成对组合，如“狗+车”，从而获得了更多非典型图像。
高性能标注管道：设计了一个高效且高质量的标注管道，利用Amazon的Mechanical Turk（AMT）进行众包任务，以确保标注的准确性和效率。

4、Method

4.1、Image Collection

（1）Common Object Categories

“Thing” categories include objects for which individual instances may be easily labeled (person, chair, car) where “stuff” categories include materials and objects with no clear boundaries (sky, street, grass).

COCO 目标类别 only 集中在 thing 而不是 stuff

类别定义都是日常比较常见的（ entry-level categories），不会钻牛角尖，比如 dog 而不是西高地，还让 4-8 岁的 baby 配合 name every object they see in indoor and outdoor environments

也参考了 R. Sitton, Spelling Sourcebook. Egger Publishing, 1996

完全涵盖 PASCAL VOC

frequent object categories taken from WordNet, LabelMe, SUN and other sources as well as categories derived from a free recall experiment with young children.

在这里插入图片描述

搜集的备选 272 个类别如 table 2 所示，最后留下 91 个类别

在这里插入图片描述

（2）Non-iconic Image Collection

在这里插入图片描述
这张图很形象，说明了 iconic 和 non-iconic 的区别

datasets containing more non-iconic images are better at generalizing

Flickr 搜集图片
在这里插入图片描述

we searched for pairwise combinations of object categories

“dog + car”

scene/object category pairs

40 scene categories

在这里插入图片描述

最终 a collection of 328,000 images with rich contextual relationships between objects

4.2、Image Annotation

在这里插入图片描述

用的是众包，workers on Amazon’s Mechanical Turk (AMT).

流程非常值得学习

（1）Category Labeling

hierarchical approach

11 super-categories

在这里插入图片描述

only a single instance of each category needs to be annotated in this stage

8 workers were asked to label each image

took ∼20k worker hour，太夸张了，相当于一个人需要不休不眠 work 27-28 个月

在这里插入图片描述

（2）Instance Spotting

place a cross

at most 10 instances of a given category per image

∼10k worker hours，相当于一个人需要不休不眠 work 13-14 个月

在这里插入图片描述

（3）Instance Segmentation

required all workers to complete a training task for each object category——岗前培训

在这里插入图片描述

we define a single task for segmenting a single object instance labeled from the previous annotation stage

会有质量评估，打标签不行的 workers，他/她所有的标签将不会被采纳

Each segmentation is initially shown to 3 annotators. If any of the annotators indicates the segmentation is bad, it is shown to 2 additional workers.

在这里插入图片描述

在这里插入图片描述
After 10-15 instances of a category were segmented in an image, the remaining instances were marked as “crowds” using a single (possibly multipart) segment.

crowd labeling is only necessary for images containing more than ten object instances of a given category

在这里插入图片描述

测试的时候， areas marked as crowds will be ignored and not affect a detector’s score.

（4）Annotation Performance Analysis

在这里插入图片描述

recall is of primary importance as false positives could be removed in later stages.

假设一个物体被漏掉的概率为 50%，all 8 annotators missing such a case is at most $5^8 ≈ .004$ ，属于小概率事件了

all jobs from workers below the black line were rejected.

（a）图可以看出，8 个 workers 质量差不多到上限了，workers 的 recall 会偏高相比于专家，目标类别本身往往有歧义的

上图中 GT 怎么来的呢？

Ground truth was computed using majority vote of the experts

5、Dataset Statistics

在这里插入图片描述

在这里插入图片描述
On average our dataset contains 3.5 categories and 7.7 instances per image.

这张图对我来说算是启蒙图了，第一次接触深度学习训练的数据集就是用图 5 的格式统计的

（b）可以看出， COCO 每张图包含 1~6 种类别的占比比较多

（c）可以看出， COCO 每张图包含 1~6 个目标的占比比较多

（d）可以看出，COCO 类别数和每个类别中目标数量达到了一个比较好的均衡

（e）可以看出，COCO 超多目标小于图片占比的 6%，属于小目标

exhibits the long tail phenomenon，COCO 相对好一些

6、Dataset Splits

第一批收集的数据分两次发布，2014 和 2015

The 2014 release contains 82,783 training, 40,504 validation, and 40,775 testing images (approximately 1/2 train, 1/4 val, and 1/4 test).

The cumulative 2015 release will contain a total of 165,482 train, 81,208 val, and 81,434 test images

2014 仅发布了 80 个类别，原因如下

在这里插入图片描述

7、Algorithmic Analysis

在这里插入图片描述

DPMv5-C: the same implementation trained on COCO (5000 positive and 10000 negative images)

可以看到，在 VOC 数据集上 C 没有 P 好哟，作者的自圆其说

difficult (non-iconic) images during training may not always help. Such examples may act as noise and pollute the learned model if the model is not rich enough to capture such appearance variability

下面看看分割的效果

To decouple segmentation evaluation from detection correctness, we benchmark segmentation quality using only correct detections

感觉像是 DPM 配合模板，得到分割结果，模板如图 7 所示，
在这里插入图片描述