论文阅读笔记(十八):Fully Convolutional Networks for Semantic Segmentation(FCN)

Convolutional networks are powerful visual models that yield hierarchies of features. We show that convolutional networks by themselves, trained end-to-end, pixels-to-pixels, exceed the state-of-the-art in semantic segmentation. Our key insight is to build “fully convolutional” networks that take input of arbitrary size and produce correspondingly-sized output with efficient inference and learning. We define and detail the space of fully convolutional networks, explain their application to spatially dense prediction tasks, and draw connections to prior models. We adapt contemporary classification networks (AlexNet [19], the VGG net [31], and GoogLeNet [32]) into fully convolutional networks and transfer their learned representations by fine-tuning [4] to the segmentation task. We then define a novel architecture that combines semantic information from a deep, coarse layer with appearance information from a shallow, fine layer to produce accurate and detailed segmentations.

卷积网络在特征分层领域是非常强大的视觉模型。我们证明了经过端到端、像素到像素训练的卷积网络超过语义分割中最先进的技术。我们的核心观点是建立“全卷积”网络,输入任意尺寸,经过有效的推理和学习产生相应尺寸的输出。我们定义并指定全卷积网络的空间,解释它们在空间范围内dense prediction任务(预测每个像素所属的类别)和获取与先验模型联系的应用。我们改编当前的分类网络(AlexNet [19] ,the VGG net [31] , and GoogLeNet [32] )到完全卷积网络和通过微调 [4] 传递它们的学习表现到分割任务中。然后我们定义了一个跳跃式的架构,结合来自深、粗层的语义信息和来自浅、细层的表征信息来产生准确和精细的分割。

Convolutional networks are driving advances in recognition. Convnets are not only improving for whole-image classification [19, 31, 32], but also making progress on local tasks with structured output. These include advances in bounding box object detection [29, 12, 17], part and keypoint prediction [39, 24], and local correspondence [24, 9].

卷积网络在识别领域前进势头很猛。卷积网不仅全图式的分类上有所提高 [19,31,32] ,也在结构化输出的局部任务上取得了进步。包括在目标检测边界框 [29,12,17] 、部分和关键点预测 [39,24] 和局部通信 [24,9] 的进步。

The natural next step in the progression from coarse to fine inference is to make a prediction at every pixel. Prior approaches have used convnets for semantic segmentation [27, 2, 8, 28, 16, 14, 11], in which each pixel is labeled with the class of its enclosing object or region, but with shortcomings that this work addresses.

在从粗糙到精细推理的进展中下一步自然是对每一个像素进行预测。早前的方法已经将卷积网络用于语义分割[27, 2, 8, 28, 16, 14, 11],其中每个像素被标记为其封闭对象或区域的类别,但是有个缺点就是这项工作addresses。

We show that a fully convolutional network (FCN), trained end-to-end, pixels-to-pixels on semantic segmentation exceeds the state-of-the-art without further machinery. To our knowledge, this is the first work to train FCNs end-to-end (1) for pixelwise prediction and (2) from supervised pre-training. Fully convolutional versions of existing networks predict dense outputs from arbitrary-sized inputs. Both learning and inference are performed whole-image-ata-time by dense feedforward computation and backpropagation. In-network upsampling layers enable pixelwise prediction and learning in nets with subsampled pooling.

我们证明了经过端到端 、像素到像素训练的的卷积网络超过语义分割中没有further machinery的最先进的技术。我们认为,这是第一次训练端到端(1)的FCN在像素级别的预测,而且来自监督式预处理(2)。全卷积在现有的网络基础上从任意尺寸的输入预测密集输出。学习和推理能在全图通过密集的前馈计算和反向传播一次执行。网内上采样层能在像素级别预测和通过下采样池化学习。

This method is efficient, both asymptotically and absolutely, and precludes the need for the complications in other works. Patchwise training is common [27, 2, 8, 28, 11], but lacks the efficiency of fully convolutional training. Our approach does not make use of preand post-processing complications, including superpixels [8, 16], proposals [16, 14], or post-hoc refinement by random fields or local classifiers [8, 16]. Our model transfers recent success in classification [19, 31, 32] to dense prediction by reinterpreting classification nets as fully convolutional and fine-tuning from their learned representations. In contrast, previous works have applied small convnets without supervised pre-training [8, 28, 27].

这种方法非常有效,无论是渐进地还是完全地,消除了在其他方法中的并发问题。Patchwise训练是常见的 [27, 2, 8, 28, 11],但是缺少了全卷积训练的有效性。我们的方法不是利用预处理或者后期处理解决并发问题,包括超像素 [8,16] ,proposals [16,14] ,或者对通过随机域事后细化或者局部分类 [8,16] 。我们的模型通过重新解释分类网到全卷积网络和微调它们的学习表现将最近在分类上的成功 [19, 31, 32] 移植到dense prediction。与此相反,先前的工作应用的是小规模、没有超像素预处理的卷积网。

Semantic segmentation faces an inherent tension between semantics and location: global information resolves what while local information resolves where. Deep feature hierarchies jointly encode location and semantics in a localto-global pyramid. We define a novel “skip” architecture to combine deep, coarse, semantic information and shallow, fine, appearance information in Section 4.2 (see Figure 3).

语义分割面临在语义和位置的内在张力问题:全局信息解决的“是什么”,而局部信息解决的是“在哪里”。深层特征通过非线性的局部到全局金字塔编码了位置和语义信息。我们在4.2节(见图3)定义了一种利用集合了深、粗层的语义信息和浅、细层的表征信息的特征谱的跨层架构。

In the next section, we review related work on deep classification nets, FCNs, and recent approaches to semantic segmentation using convnets. The following sections explain FCN design and dense prediction tradeoffs, introduce our architecture with in-network upsampling and multilayer combinations, and describe our experimental framework. Finally, we demonstrate state-of-the-art results on PASCAL VOC 2011-2, NYUDv2, and SIFT Flow.

在下一节,我们回顾深层分类网、FCNs和最近一些利用卷积网解决语义分割的相关工作。接下来的章节将解释FCN设计和密集预测权衡,介绍我们的网内上采样和多层结合架构,描述我们的实验框架。最后,我们展示了最先进技术在PASCAL VOC 2011-2, NYUDv2, 和SIFT Flow上的实验结果。

Our approach draws on recent successes of deep nets for image classification [19, 31, 32] and transfer learning [4, 38]. Transfer was first demonstrated on various visual recognition tasks [4, 38], then on detection, and on both instance and semantic segmentation in hybrid proposalclassifier models [12, 16, 14]. We now re-architect and finetune classification nets to direct, dense prediction of semantic segmentation. We chart the space of FCNs and situate prior models, both historical and recent, in this framework.

我们的方法是基于最近深层网络在图像分类上的成功 [19, 31, 32] 和转移学习。转移第一次被证明在各种视觉识别任务 [4,38] ,然后是检测,不仅在实例还有融合proposal-classification模型的语义分割 [12,16,14] 。我们现在重新构建和微调直接的、dense prediction语义分割的分类网。在这个框架里我们绘制FCNs的空间并将过去的或是最近的先验模型置于其中。

Fully convolutional networks To our knowledge, the idea of extending a convnet to arbitrary-sized inputs first appeared in Matan et al. [25], which extended the classic LeNet [21] to recognize strings of digits. Because their net was limited to one-dimensional input strings, Matan et al. used Viterbi decoding to obtain their outputs. Wolf and Platt [37] expand convnet outputs to 2-dimensional maps of detection scores for the four corners of postal address blocks. Both of these historical works do inference and learning fully convolutionally for detection. Ning et al. [27] define a convnet for coarse multiclass segmentation of C. elegans tissues with fully convolutional inference.

全卷积网络据我们所知,第一次将卷积网扩展到任意尺寸的输入的是Matan等人 [25] ,它将经典的LeNet [21] 扩展到识别字符串的位数。因为他们的网络结构限制在一维的输入串,Matan等人利用译码器译码获得输出。Wolf和Platt [37] 将卷积网输出扩展到来检测邮政地址块的四角得分的二维图。这些先前工作做的是推理和用于检测的全卷积式学习。Ning等人 [27] 定义了一种卷积网络用于秀丽线虫组织的粗糙的、多分类分割,基于全卷积推理。

Fully convolutional computation has also been exploited in the present era of many-layered nets. Sliding window detection by Sermanet et al. [29], semantic segmentation by Pinheiro and Collobert [28], and image restoration by Eigen et al. [5] do fully convolutional inference. Fully convolutional training is rare, but used effectively by Tompson et al. [35] to learn an end-to-end part detector and spatial model for pose estimation, although they do not exposit on or analyze this method.

全卷积计算也被用在现在的一些多层次的网络结构中。Sermanet等人的滑动窗口检测 [29] ,Pinherio 和Collobert的语义分割 [28] ,Eigen等人的图像修复 [5] 都做了全卷积式推理。全卷积训练很少,但是被Tompson等人 [35] 用来学习一种端到端的局部检测和姿态估计的空间模型非常有效,尽管他们没有解释或者分析这种方法。

Alternatively, He et al. [17] discard the nonconvolutional portion of classification nets to make a feature extractor. They combine proposals and spatial pyramid pooling to yield a localized, fixed-length feature for classification. While fast and effective, this hybrid model cannot be learned end-to-end.

此外,He等人 [17] 在特征提取时丢弃了分类网的无卷积部分。他们结合proposals和空间金字塔池来产生一个局部的、固定长度的特征用于分类。尽管快速且有效,但是这种混合模型不能进行端到端的学习。

Dense prediction with convnets Several recent works have applied convnets to dense prediction problems, including semantic segmentation by Ning et al. [27], Farabet et al. [8], and Pinheiro and Collobert [28]; boundary prediction for electron microscopy by Ciresan et al. [2] and for natural images by a hybrid neural net/nearest neighbor model by Ganin and Lempitsky [11]; and image restoration and depth estimation by Eigen et al. [5, 6]. Common elements of these approaches include:

• small models restricting capacity and receptive fields;
• patchwise training [27, 2, 8, 28, 11];
• post-processing by superpixel projection, random field
regularization, filtering, or local classification [8, 2,
11];
• input shifting and output interlacing for dense output
[28, 11] as introduced by OverFeat [29];
• multi-scale pyramid processing [8, 28, 11];
• saturating tanh nonlinearities [8, 5, 28]; and
• ensembles [2, 11],

基于卷积网的dense prediction近期的一些工作已经将卷积网应用于dense prediction问题,包括Ning等人的语义分割 [27] ,Farabet等人 [8] 以及Pinheiro和Collobert [28] ;Ciresan等人的电子显微镜边界预测 [2] 以及Ganin和Lempitsky [11] 的通过混合卷积网和最邻近模型的处理自然场景图像;还有Eigen等人 [5,6] 的图像修复和深度估计。这些方法的相同点包括如下:

• 限制容量和接收域的小模型
• patchwise训练 [27, 2, 8, 28, 11];
• 超像素投影的预处理,随机场正则化、滤波或局部分类 [8,2,11]
• 输入移位和dense输出的隔行交错输出 [28,11,29]
• 多尺度金字塔处理 [8,28,11]
• 饱和双曲线正切非线性 [8,5,28]
• 集成 [2,11]

whereas our method does without this machinery. However, we do study patchwise training 3.4 and “shift-and-stitch” dense output 3.2 from the perspective of FCNs. We also discuss in-network upsampling 3.3, of which the fully connected prediction by Eigen et al. [6] is a special case.

然而我们的方法确实没有这种机制。但是我们研究了patchwise训练 (3.4节)和从FCNs的角度出发的“shift-and-stitch”dense输出(3.2节)。我们也讨论了网内上采样(3.3节),其中Eigen等人[7]的全连接预测是一个特例。

Unlike these existing methods, we adapt and extend deep classification architectures, using image classification as supervised pre-training, and fine-tune fully convolutionally to learn simply and efficiently from whole image inputs and whole image ground thruths.

和这些现有的方法不同的是,我们改编和扩展了深度分类架构,使用图像分类作为监督预处理,和从全部图像的输入和ground truths(用于有监督训练的训练集的分类准确性)通过全卷积微调进行简单且高效的学习。

Hariharan et al. [16] and Gupta et al. [14] likewise adapt deep classification nets to semantic segmentation, but do so in hybrid proposal-classifier models. These approaches fine-tune an R-CNN system [12] by sampling bounding boxes and/or region proposals for detection, semantic segmentation, and instance segmentation. Neither method is learned end-to-end.

Hariharan等人 [16] 和Gupta等人 [14] 也改编深度分类网到语义分割,但是也在混合proposal-classifier模型中这么做了。这些方法通过采样边界框和region proposal进行微调了R-CNN系统 [12] ,用于检测、语义分割和实例分割。这两种办法都不能进行端到端的学习。

They achieve state-of-the-art results on PASCAL VOC segmentation and NYUDv2 segmentation respectively, so we directly compare our standalone, end-to-end FCN to their semantic segmentation results in Section 5.

他们分别在PASCAL VOC和NYUDv2实现了最好的分割效果,所以在第5节中我们直接将我们的独立的、端到端的FCN和他们的语义分割结果进行比较。

Each layer of data in a convnet is a three-dimensional array of size h × w × d, where h and w are spatial dimensions, and d is the feature or channel dimension. The first layer is the image, with pixel size h × w, and d color channels. Locations in higher layers correspond to the locations in the image they are path-connected to, which are called their receptive fields.

卷积网的每层数据是一个h*w*d的三维数组,其中h和w是空间维度,d是特征或通道维数。第一层是像素尺寸为h*w、颜色通道数为d的图像。高层中的locations和图像中它们连通的locations相对应,被称为接收域。

Fully convolutional networks are a rich class of models, of which modern classification convnets are a special case. Recognizing this, extending these classification nets to segmentation, and improving the architecture with multi-resolution layer combinations dramatically improves the state-of-the-art, while simultaneously simplifying and speeding up learning and inference.

全卷积网络是模型非常重要的部分,是现代化分类网络中一个特殊的例子。认识到这个,将这些分类网络扩展到分割并通过多分辨率的层结合显著提高先进的技术,同时简化和加速学习和推理。

相关参考

  • 1
    点赞
  • 5
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值