ECCV-2014
Lin T Y, Maire M, Belongie S, et al. Microsoft coco: Common objects in context[C]//Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13. Springer International Publishing, 2014: 740-755.
https://cocodataset.org/#home
文章目录
1、Background and Motivation
好的公开数据集推动了计算机视觉算法的发展
在计算机视觉领域,物体检测和分割是重要任务,但现有的数据集如 PASCAL VOC 和 ImageNet 在物体类别和上下文信息的丰富性上有所限制。
what datasets will best continue our advance towards our ultimate goal of scene understanding?
本文作者公开了一个更有利于场景理解的数据集,Microsoft Common Objects in Context,简称 COCO
- 构建一个包含丰富上下文信息的物体检测、分割和标注的大型数据集。
- 专注于精确定位物体实例,并选择性地包含“事物”类别而非“背景”类别,但认识到“背景”类别能提供重要的上下文信息,未来可能会进行标注。
- 通过收集非标志性图像来增强数据集的泛化能力,这些图像通常包含更多类别和复杂的上下文关系。
对比的是 PASCAL VOC、ImageNet、SUN 数据集
COCO 收集了更多的 non-iconic views (or non-canonical perspectives) 图片,可以理解为目标不总是在画面正中心的图片
每张图片类别数、同类别目标数更丰富
强调了 context 信息,文中指的是 detailed spatial understanding of object layout will be a core component of scene analysis,也即目标在画面中的分布,目标种类多,位置分布广
最重要的是提供了分割标签
2,500,000 labeled instances in 328,000 images
2、Related Work
- Image Classification
- Object detection
- Semantic scene labeling
This enables the labeling of objects for which individual instances are hard to define, such as grass, streets, or walls. - Other vision datasets
3、Advantages / Contributions
公开了一个大型场景理解数据集 COCO
-
大规模数据集:该论文介绍了一个新的大规模数据集,旨在解决场景理解中的三个核心研究问题:检测非典型视图(或非规范视角)的对象、对象之间的上下文推理以及对象的精确二维定位。
-
非典型视图检测:与以往的对象识别数据集主要关注图像分类、对象边界框定位或语义像素级分割不同,该数据集专注于分割单个对象实例,并包含非典型视图的对象图像,这有助于模型更好地泛化。
-
上下文推理:数据集通过包含丰富上下文关系的图像,促进了对象之间的上下文推理研究。
-
精确二维定位:该数据集专注于精确定位对象实例,这有助于提升对象检测和分割的准确性。
-
丰富标注:数据集经过丰富标注,包括类别标签、实例定位和对象分割,为训练和评估模型提供了全面的信息。
-
多种收集策略:为了收集非典型视图的图像,采用了两种策略:从Flickr收集图像,以及搜索对象类别的成对组合,如“狗+车”,从而获得了更多非典型图像。
-
高性能标注管道:设计了一个高效且高质量的标注管道,利用Amazon的Mechanical Turk(AMT)进行众包任务,以确保标注的准确性和效率。
4、Method
4.1、Image Collection
(1)Common Object Categories
“Thing” categories include objects for which individual instances may be easily labeled (person, chair, car) where “stuff” categories include materials and objects with no clear boundaries (sky, street, grass).
COCO 目标类别 only 集中在 thing 而不是 stuff
类别定义都是日常比较常见的( entry-level categories),不会钻牛角尖,比如 dog 而不是西高地,还让 4-8 岁的 baby 配合 name every object they see in indoor and outdoor environments
也参考了 R. Sitton, Spelling Sourcebook. Egger Publishing, 1996
完全涵盖 PASCAL VOC
frequent object categories taken from WordNet, LabelMe, SUN and other sources as well as categories derived from a free recall experiment with young children.
搜集的备选 272 个类别如 table 2 所示,最后留下 91 个类别
(2)Non-iconic Image Collection
这张图很形象,说明了 iconic 和 non-iconic 的区别
datasets containing more non-iconic images are better at generalizing
Flickr 搜集图片
we searched for pairwise combinations of object categories
“dog + car”
scene/object category pairs
40 scene categories
最终 a collection of 328,000 images with rich contextual relationships between objects
4.2、Image Annotation
用的是众包,workers on Amazon’s Mechanical Turk (AMT).
流程非常值得学习
(1)Category Labeling
hierarchical approach
11 super-categories
only a single instance of each category needs to be annotated in this stage
8 workers were asked to label each image
took ∼20k worker hour,太夸张了,相当于一个人需要不休不眠 work 27-28 个月
(2)Instance Spotting
place a cross
at most 10 instances of a given category per image
∼10k worker hours,相当于一个人需要不休不眠 work 13-14 个月
(3)Instance Segmentation
required all workers to complete a training task for each object category——岗前培训
we define a single task for segmenting a single object instance labeled from the previous annotation stage
会有质量评估,打标签不行的 workers,他/她所有的标签将不会被采纳
Each segmentation is initially shown to 3 annotators. If any of the annotators indicates the segmentation is bad, it is shown to 2 additional workers.
After 10-15 instances of a category were segmented in an image, the remaining instances were marked as “crowds” using a single (possibly multipart) segment.
crowd labeling is only necessary for images containing more than ten object instances of a given category
测试的时候, areas marked as crowds will be ignored and not affect a detector’s score.
(4)Annotation Performance Analysis
recall is of primary importance as false positives could be removed in later stages.
假设一个物体被漏掉的概率为 50%,all 8 annotators missing such a case is at most . 5 8 ≈ . 004 .5^8 ≈ .004 .58≈.004,属于小概率事件了
all jobs from workers below the black line were rejected.
(a)图可以看出,8 个 workers 质量差不多到上限了,workers 的 recall 会偏高相比于专家,目标类别本身往往有歧义的
上图中 GT 怎么来的呢?
Ground truth was computed using majority vote of the experts
5、Dataset Statistics
On average our dataset contains 3.5 categories and 7.7 instances per image.
这张图对我来说算是启蒙图了,第一次接触深度学习训练的数据集就是用图 5 的格式统计的
(b)可以看出, COCO 每张图包含 1~6 种类别的占比比较多
(c)可以看出, COCO 每张图包含 1~6 个目标的占比比较多
(d)可以看出,COCO 类别数和每个类别中目标数量达到了一个比较好的均衡
(e)可以看出,COCO 超多目标小于图片占比的 6%,属于小目标
exhibits the long tail phenomenon,COCO 相对好一些
6、Dataset Splits
第一批收集的数据分两次发布,2014 和 2015
The 2014 release contains 82,783 training, 40,504 validation, and 40,775 testing images (approximately 1/2 train, 1/4 val, and 1/4 test).
The cumulative 2015 release will contain a total of 165,482 train, 81,208 val, and 81,434 test images
2014 仅发布了 80 个类别,原因如下
7、Algorithmic Analysis
DPMv5-C: the same implementation trained on COCO (5000 positive and 10000 negative images)
可以看到,在 VOC 数据集上 C 没有 P 好哟,作者的自圆其说
difficult (non-iconic) images during training may not always help. Such examples may act as noise and pollute the learned model if the model is not rich enough to capture such appearance variability
下面看看分割的效果
To decouple segmentation evaluation from detection correctness, we benchmark segmentation quality using only correct detections
感觉像是 DPM 配合模板,得到分割结果,模板如图 7 所示,
8、Conclusion(own) / Future work
- non-iconic images,varied viewpoints
- only label “things”, but labeling “stuff”
- context
- searched for pairwise combinations of object categories
- areas marked as crowds will be ignored and not affect a detector’s score.
- 有 crowd 类别,所有存在很多漏标(测试的时候忽略掉了),作者更看做 recall 所有该数据集训练完会提升 FP