【Cut, Paste and Learn】《Cut, Paste and Learn: Surprisingly Easy Synthesis for Instance Detection》

最新推荐文章于 2023-02-03 11:46:11 发布

bryant_meng

最新推荐文章于 2023-02-03 11:46:11 发布

阅读量920

点赞数 1

分类专栏： CNN / Transformer 文章标签：人工智能

本文链接：https://blog.csdn.net/bryant_meng/article/details/127966177

版权

CNN / Transformer 专栏收录该内容

210 篇文章 7 订阅

订阅专栏

在这里插入图片描述

ICCV-2017

1 Background and Motivation

在这里插入图片描述
实例检测（instance detection）之于目标检测，等价于实例分割之于语义分割

不仅要检测出不同类别的目标，还要区分同类别目标的不同个体

Instance detection occurs commonly in robotics, AR/VR etc., and can also be viewed as ﬁne-grained recognition.

显然，这种任务对标签的要求更高

collecting such annotations is a major impediment for rapid deployment of detection systems in robotics or other personalized applications.

针对得到大规模有标签数据比较耗时耗力的问题，本文作者提出 Cut, Paste and Learn 数据生成方法（Sythesizing data），确保生成数据的 only patch-level realism（不 care global consistency，比如杯子一定要在桌子上面等等），即使视觉上看上去仍有瑕疵，但模型跑出来效果不错
在这里插入图片描述

The underlying theme is to ‘paste’ real object masks in real images, thus reducing the dependence on graphics renderings.

2 Related Work

instance detection

local features（SIFT, SURF, MSER）
shape-based methods
Modern detection methods（one stage, two stage）

Sythesizing data

There is a wide spectrum of work where rendered datasets are used for computer vision tasks.（真单目标+真随机背景->全部 render）

3 Advantages / Contributions

提出 Cut, Paste and Learn 数据生成方法，在 instance detection 数据集上提升明显，跨数据集的泛化性能也不错

4 Method

Traditional Dataset Collection：an data curation step + an annotation step

好的 instance detection 模型 have good coverage of viewpoints and scales of the object

生成数据的大体流程如下图所示

在这里插入图片描述

Collect object instance images
Collect scene images
Predict foreground mask for the object
Paste object instances in scenes
invariance to local artifacts，training algorithm does not focus on subpixel discrepancies at the boundaries.

注意这个 negatives，不仅仅只生成 objects 还引入了负样本的干扰

4.1 Collecting images

（1）Images of objects from different viewpoints

在这里插入图片描述

从 BigBIRD Dataset sample，具体介绍见本博客 5.1 小节

（2）Background images of indoor scenes
在这里插入图片描述
从 UW Scenes dataset 中 sample

There are 1548 images in the backgrounds dataset.

（3）Foreground/Background segmentation

在这里插入图片描述
用的 FCN 分割网络，PASCAL VOC 预训练，主干 VGG

The object masks from the depth sensor are used as ground truth for training this model.（BigBIRD Dataset）

还有个后处理操作，用的是《The Fast Bilateral Solver》（ECCV-2016）方法使分割边缘更加平滑

在这里插入图片描述
上图可以看出作者的方法对 transparent 物体也能有很好的分割结果

4.2 Adding Objects to Images

we present steps to generate data thatforces the training algorithm to ignore these artifacts and focus only on the object appearance

（1）Detection Model

Faster RCNN 网络，COCO 预训练，VGG主干

（2）Benchmarking Dataset

use the GMU Kitchen dataset for evaluation

4.2.1 Blending

在这里插入图片描述
Poisson blending smooths edges and adds lighting variations

Although these blending methods do not yield visually ‘perfect’ results, they improve performance of the trained detectors.

在这里插入图片描述

4.2.2 Data Augmentation

（1）2D Rotation
在这里插入图片描述

（2）3D Rotation
在这里插入图片描述
不引入生成的数据，一些漏检的例子，

（3）Occlusion and Truncation

Truncation，ensuring at least 0.25 of the object box is in the image.

Occlusion，paste the objects with partial overlap with each other (max IOU of 0.75).

在这里插入图片描述

（4）Distractor Objects

在这里插入图片描述
additional objects from the BigBIRD dataset as distractors.

5 Experiments

We generate a synthetic dataset with approximately 6000 images using all modes of data augmentation.

在这里插入图片描述

5.1 Datasets

1）UW Scenes dataset

取背景
在这里插入图片描述

2）BigBIRD Dataset

each object has 600 images, captured by ﬁve cameras with different viewpoints
在这里插入图片描述
作者选用了其中的 33 object instances，取目标

3）GMU Kitchen Dataset

9 kitchen scenes with 6, 728 images，训练测试
在这里插入图片描述
与作者从 BigBIRD 抽出来的 33 个 instances 有 11 个是重复的

4）Active Vision Dataset

9 scenes and 17,556 images
33 objects in total and 6 objects in overlap with the GMU Kitchen Scenes.
与作者从 BigBIRD 抽出来的 33 个 instances 有 6 个是重复的（11中的6）

在这里插入图片描述

5.2 Training and Evaluation on the GMU Dataset

在这里插入图片描述

5.3 Evaluation on the Active Vision Dataset

To test generalization across datasets, train on GMU Kitchen，test on Active Vision Dataset

在这里插入图片描述
Varying Real Data

10% 的 Real + Syn 就能匹敌 100% Real，还是很猛哒

6 Conclusion（own） / Future work

code：https://github.com/debidatta/syndata-generation
《The Fast Bilateral Solver》（ECCV-2016）

a novel algorithm for edge-aware smoothing

在这里插入图片描述

图像融合（一）Poisson Blending，code：https://github.com/yskmt/pb
patch-level realism，not respect global consistency or even obey scene factors such as lighting
We showed that patch-based realism is sufﬁcient for training region-proposal based object detectors. 后续要兼顾下 global consistency for placing objects——几何布局真实和全局一致性（比如杯子在桌子上）
Cut, Paste and Learn论文阅读
Cut Paste and Learn Surprisingly Easy Synthesis for Instance Detection - QuickPeek