googlenet论文_FCN论文精读

最新推荐文章于 2023-01-12 21:43:25 发布

weixin_39945445

最新推荐文章于 2023-01-12 21:43:25 发布

阅读量494

点赞数

文章标签：全卷积网络语义分割端到端训练迁移学习跳跃连接

1.研究成果及意义

1.1 研究成果

将分类网络改变为全卷积神经网络，具体包括全连接层转换为卷积层以及通过反卷积进行上采样
使用迁移学习的方法进行微调
使用跳跃结构使得语义信息和表征信息相结合，产生准确而精细的分割
FCN证明了端到端、像素到像素训练方式下的卷积神经网络超过了现有语义分割方向最先进的技术
FCN成为PASCAL VOC上最出色的分割方法，较2011和2012的state-of-the-art分割算法的MIOU提高了将近20%，如图1所示

图1 FCN指标提升

1.2 历史意义

深度学习语义分割领域的开山之作
端到端训练为后续语义分割算法的发展铺平了道路

图2 FCN与传统算法分割效果对比

2.论文结构

Abstract

介绍论文的背景、核心观点、方法途径、最终成果

Introduction

语义分割研究现状、本文贡献、文章整体结构

Related Work

文章思想来源、先前方法特点、本文的不同之处

Prior knowledge

卷积网络基本定义、与分类网络间的联系和区别、Shift-and-stitch、Deconvolution、Patchwise training

Details of learning

算法结构、创新点、设计细节

Results

指标定义、多种数据集中的实验分析

Conclusion

实验结论

References

参考文献

3.Abstract

Convoltional networks are powerful visual models that yield hierarchies of features.

卷积神经网络在特征分层领域是强大的视觉模型。

We show that convolutional networks by themselves, trained end-to-end, pixels to-pixels, exceed the state-of-the-art in semantic segmentation.

我们证明了，经过端到端、像素到像素卷积神经网络超过了当前语义分割中的最先进技术。

Our key insight is to build“fully convolutional”networks that take input of arbitrary size and produce correspondingly-sized output with effificient inference and learning.

我们的核心思想是去构建一个全卷积网络，输入任意尺寸和经过有效的推理和学习产生相应的输出。

We define and detail the space of fully convolutional networks, explain their application to spatially dense prediction tasks, and draw connections to prior models.

我们定义和描述了全卷积网络的空间细节，解释了空间密集预测任务(像素级预测任务)上它们的应用，绘制了与先前模型的联系。

We adapt contemporary classifification networks(AlexNet, the VGG net, and GoogLeNet) into fully convolutional networks and transfer their learned representations by fine-tuning to the segmentation task.

我们改编了当前的分类网络(AlexNet、VGG、GoogLeNet)成全卷积网络，并且通过微调手段传递它们的学习表现至分割任务中。

We then define a skip architecture that combines semantic information from a deep, coarse layer with appearance information from a shallow, fine layer to produce accurate and detailed segmentations.

我们定义了一个跳跃连接，结合来自深层的语义信息和来自浅层的表征信息，去产生一个更精确和详细的分割层。

Our fully convolutional network achieves state-of-the-art segmentation of PASCAL VOC (20% relative improvement to 62.2% mean IU on 2012), NYUDv2, and SIFT Flow, while inference takes less than one fififth of a second for a typical image.

我们的全卷积网络实现了在PASCAL VOC、NYU和SIFT上的最优状态，对于一张典型的图像，花费少于0.02s的时间。

本篇摘要写作思路(值得借鉴)：大背景概述、核心观点、设计的网络的效果、为了实现功能(我们做了哪些步骤)、这个东西在哪些数据集上达到什么样的效果

Abstract总结：

(1)主要成就

端到端、像素到像素训练方式下的卷积神经网络超过了现有语义分割方向最先进的技术。

(2)核心思想

搭建了一个全卷积网络，输入任意尺寸的图像，经过有效推理和学习得到相同尺寸的输出。

(3)主要方法

将当前分类网络改编成全卷积网络(AlexNet、VGG和GoogLeNet)并进行微调设计了跳跃连接将全局信息和局部信息连接起来，相互补偿。

(4)实验结果

在PASCAL VOC、NYUDv2和SIFT Flow数据集上得到了state-of-the-art的结果。

4.Introduction

第一段(CNN发展现状)

Convolutional networks are driving advances in recognition.

卷积网络正在识别领域驱动进步。

Convnets are not only improving for whole-image classification, but also making progress on local tasks with structed output.

卷积网络不仅提升了整张图片的分类效果，而且在带有结构化输出的局部任务上取得了进步。

These include advances in bounding box object detection, part and key-point prediction , and local correspondence.

包括在bounding box目标检测、部分和关键点预测、局部通信这些结构化输出的局部任务上的进步。

第二段(粗糙预测、精细预测、像素级预测)

The natural next step in the progression from coarse to fine inference is to make a prediction at every pixel.

从粗糙到细致预测发展自然的下一步是到像素级预测。

Prior approaches have used convnets for semantic segmentation, in which each
pixel is labeled with the class of its enclosing object or region, but with short-comings that this work addresses.

先前的方法已经使用了卷积网络对于语义分割，在这些方法中，每个像素是用其封闭对象或区域的类来标记，但是这些工作方式带有缺点。

第三段(引出本文的贡献)

We show that a fully convolutional network(FCN) trained end-to-end, pixels-to-pixels on semantic segmentation exceeds the state-of-the-art without further machinery.

我们证明了，采用端到端、像素到像素训练的全卷积神经网络(FCN)达到了state-of-the-art的效果。

To our knowledge, this is the first work to train FCNs end-to-end for pixelwise prediction and from supervised pre-training.

据我们所知，这是从有监督预训练和像素级上预测FCN的第一部作品。

Fully convolutional versions of existing networks predict dense outputs from arbitary-size inputs.

已存在网络的全卷积版本预测来自任意大小输入的密集输出。

Both learining and inference are performed whole-image-at-a-time by dense feedforward computation and backpropagation.

学习和推理是通过密集输入计算和反馈计算一次在整张图片上完成的，如图3所示。

In-network upsampling layers enable pixelwise prediction and learning in nets with subsampled pooling.

在网络上采样层使能像素预测和网络下采样层学习。

图3 FCN学习和预测过程

第四段(FCN优势)

This method is effificient, both asymptotically and absolutely, and precludes the need for the complications in other works.

这种方法是有效的，无论是渐进的还是绝对的，并且解决了在其他论文中的难题。

Patchwise training is common[参考文献], but lacks the efficiency of fully convolutional training.

分块训练是这样的方式，但是缺乏全卷积训练的效率。

Our approach does not make use of pre- and post-processing complications,
including superpixels, proposals, or post-hoc refifinement by random fifields or local classififiers.

我们的方法不使用预处理和后处理的复杂性，包括超像素、建议框或随机FIFIELD或本地Classifier的事后再细化。

Our model transfers recent success in classification to dense prediction by reinterpreting classification nets as fully convolutional and fine-tuning from their learned representations.

我们的模型通过将分类网络重新解释为完全卷积和根据其学习的表示进行微调，将最近在分类方面取得的成功转化为密集预测。

In contrast, previous works have applied small convnets without supervised pre-training.

相反，之前的工作是在没有监督预训练下应用小型的卷积网络。

第五段(语义分割存在的问题和相应的解决方法)

Semantic segmentation faces an inherent tension between semantics and location:
global information resolves what while local information resolves where.

语义分割面临着在语义和位置之间的一个内部矛盾：全局信息解决是什么，而局部信息解决在哪里。

Deep feature hierarchies encode location and semantics in a nonlinear local-to-global pyramid.

深层特征层次结构以一种非线性的局部到全局金字塔方式去结合位置和语义。

We define a skip architecture to take advantage of this feature spectrum that combines deep, coarse, semantic information and shallow, fine, appearance information in Section 4.2.

我们定义了一个跳跃结构去利用结合来自深层的、粗糙的语义信息和来自浅层的、精细的表征信息的特征谱。

全局信息

提取位置：浅层网络中提取局部信息，如图4中上图
特点：物体的几何信息比较丰富，对应的感受野较小，如图5-(b)
目的：有助于分割尺寸较小的目标，有助于提升分割的精确程度

图4 全局信息与局部信息

局部信息

提取位置：深层网络中提取全局信息，如图4中下图
特点：物体的空间信息比较丰富，对应的感受野较大，如图5-(c)
目的：有助于分割尺寸较大的目标，有助于提高分割精确程度

图5 不同算法感受域

全局信息与局部信息的矛盾

全局信息特征图比较小，局部信息特征图比较大，两者skip-connection融合比较困难

第六段(文章整体结构)

In the next section, we review related work on deep classification nets, FCNs, and recent approaches to semantic segmentation using convnets.

在下一部分，我们回顾了在深度分类网络、FCN和最近使用卷积网络语义分割方法上的相关工作。

The following sections explain FCN design and dense prediction trade-offs, introduce our architecture with in-network upsampling and multilayer combinations, and describe our experimental framework.

下一部分解释了FCN设计和密集预测之间的trade-off，介绍了我们的网络内上采样和多层组合结构，并描述了我们的实验框架。

Finally, we demonstrate state-of-the-art results on PASCAL VOC 2011-2, NYUDv2, and SIFT Flow.

最后，我们展示了PASCAL VOC 2011-2、NYUDv2和SIFT-Flow的最新结果。

5.Related work(介绍文章思想源头)

本文相关工作参考文献意义不大，略过

第一段(迁移学习)

第二段(介绍之前算法、交代作者思想来源)

第三段(基于CNN的密集预测任务)

第四段(引出本文研究内容)

第五段(分类模型做预训练)

第六段(比较实验结果)

Introduction & Related work总结：

在以往的分割方法中，主要有两大类缺点：

(1)基于图像块的分割虽然常见，但是效率低，且往往需要前期或者后期处理(例如超像素、检测框、局部预分类等)

(2)语义分割面临着语义和位置不可兼得的问题。全局信息解决的"是什么"，而局部信息解决的是"在哪里"

为了解决上面这两个问题，本文主要有三个创新点：

(1)将分类网络改编为全卷积神经网络，具体包括全连接层转化为卷积层以及通过反卷积进行上采样

(2)使用迁移学习的方法进行微调

(3)使用跳跃连接结构使得语义信息可以和表征信息相结合，产生准确而精细的分割

6.Fully convolutional networks(先验知识)

第一段：

Each layer of data in a convnet is a three-dimensional array of size
, where h and w are spatial dimensions, and d is the feature or channel dimension.

在卷积网络中每一层是

大小的三维数组。

The first layer is the image, with pixel size h × w, and d color channels

第一层是图像，像素大小是

，d个颜色通道。

Locations in higher layers correspond to the locations in the image they are path-connected to, which are called their receptive fifields.

高层中的位置对应于图像中它们被路径连接到的位置，这些位置被称为它们的感受域。

感受域

在卷积神经网络中，决定某一层输出结果中一个元素所对应的输入层的区域大小，被称作感受野，如图6所示。通常来说，大感受野的效果要比小感受野的效果要好。

注意：此处感受野大小，是指该点映射至 原图感受野的大小，而不是上一层的感受野大小。另外，如果l+1表示当前层，则l表示上一层。
由公式可见，第一层stride越大，感受野越大。但是过大的stride会使feature map保留的信息变少。因此，在减少stride的情况下，如何增大感受野或使其保持不变，成为分割中的一大问题。

图6 感受野

第二段：

Convnets are built on translation invariance.

卷积网络是建立在平移不变性基础上。

Their basic components(convolution, pooling, and activation functions) operate on local input regions, and depend only on relative spatial coordinates.

这些基本成分作用在局部输入空间时，只依赖于相对应的空间坐标。

平移不变性

宏观结果：图像中目标无论被移到图片中的哪个位置，分类结果都应该是相同的，如图7所示
具体过程：卷积
平移不变图像中的目标有移动时，得到的特征图也会产生相同移动

图7 平移不变性

<<Why do deep convolutional networks generalize so poorly to small image transformations?>>

输入图像大小与网络深度都会影响平移不变性。影响因素：下采样导致失效的主要原因

后面的段落略过。

6.1Adapting classififiers for dense prediction

第一段

Typical recognition nets, include LeNet, AlexNet, and its deeper successors, ostensibly take fixed-sized inputs and produce non-spatial outputs.

典型的识别网络，包括LeNet、AlexNet和更深的成功网络，表面上采取固定大小的输入，然后产生非空间的输出。

The fully connected layers of these nets have fixed dimensions and throw away spatial coordinates.

这些网络的全连接层有固定的尺度并且丢弃空间坐标。

However, these fully connected layers can also be viewed as convolutions with kernels that cover their entire input regions.

然而，这些全连接层也能被看成带有核的卷积，覆盖整个输入区域。

Doing so casts them into fully convolutional networks that take input of any size
and output classifification maps.

这样做会将它们放入完全卷积的网络中，接收任意大小的输入和输出分类图。

This transformation is illustrated in Figure 8.

转换在图8中表示。

图8 全连接转卷积

第二段

Furthermore, while the resulting maps are equivalent to the evaluation of the original net on particular input patches, the computation is highly amortized over the overlapping regions of those patches.

进一步说，虽然得到的映射图相当于对特定输入块上原始网络的评估，但在这些块的重叠区域上的计算是高度分散的。

For example, while AlexNet takes 1 .2 ms (on a typical GPU) to infer the classifification scores of a 227 ×227 image, the fully convolutional net takes 22 ms
to produce a 10 ×10 grid of outputs from a 500 ×500 image, which is more than 5 times faster than the na¨ıve approach1.

例如，虽然AlexNet需要1.2毫秒(在典型的GPU上)来推断227×227图像的分类分数，但完全卷积网络从500×500图像中产生10×10网格的输出需要22毫秒，比普通方法快5倍多。

第三段

The spatial output maps of these convolutionalized models make them a natural choice for dense problems like semantic segmentation.

卷积模型的空间输出特征图更适合像语义分割这样的密集预测任务。

With ground truth available at every output cell, both the forward and backward passes are straightforward, and both take advantage of the inherent computational effificiency (and aggressive optimization) of convolution.

每一个输出单元都有相应的ground truth，前向传播和反向传播都是一气呵成的，这样的方式的优势是可以充分利用卷积的内在计算效率。

The corresponding backward times for the AlexNet example are 2 .4 ms for a single image and 37 ms for a fully convolutional 10 × 10 output map, resulting in a
speedup similar to that of the forward pass.

例如，AlexNet对于单张图片的相应反向传播时间是2.4ms，全卷积输出10x10的图片只需要37ms，表明推理过程中加速了。

第四段

While our reinterpretation of classifification nets as fully convolutional yields output maps for inputs of any size, the output dimensions are typically reduced by
subsampling.

当我们重新解释分类网络作为全卷积网络去产生任意大小的输出特征图时，输出尺寸通常是被下采样减小。

The classifification nets subsample to keep filters small and computational requirements reasonable.

分类网络下采样让滤波器变小并且计算要求合理。

This coarsens the output of a fully convolutional version of these nets, reducing it
from the size of the input by a factor equal to the pixel stride of the receptive fields
of the output units.

这些网络的全卷积版本的输出变粗，将其从输入的大小减小一个因子，通过一个因子减少输入的大小等同于输出单元感受野的像素步长。

经典算法VS本文算法

图9 VGG分类模型

图10 VGG转FCN

FCN网络中，将CNN网络的后三层全部转化为1*1的卷积核所对应等同向量长度的多通道卷积层，整个网络模型全部都是由卷积层组成，没有全连接层产生的向量。CNN是图像级的识别，也就是从图像到结果，而FCN是像素级的识别，标注出输入图像上每一个像素可能属于哪一个类别。

6.2 Shift-and-stitch is filter rarefaction

If the output is downsampled by a factor of
, shift the input

pixels to the right and

pixels down, once for every

s.t.

. Process each of these

inputs, and interlace the outputs so that the predictions correspond to the pixels at the centers of their receptive fields.

如果输出是通过一个因子

进行下采样，偏移输入x个像素至右边和y个像素至下边，每一次是(x,y)，

。处理这

个输入，粘贴输出以便预测对应像素感受野的中心。

Shift-and-stitch

补零+平移原始图片得到四种版本的输入图片
最大池化得到对应的四张输出特征图
将四张输出图拼接成密集预测图

图11 Shift-and-stitch

6.3 Upsampling

本文没有沿用以往的插值上采样(Interpolation)，而是提出来新的上采样方法，即反卷积(Deconvolution)。反卷积可以理解为卷积操作的逆运算，反卷积并不能重复原因卷积操作造成的值的损失，它仅仅是将卷积过程中的步骤反向变换一次，因此它还可以被称为转置卷积。

正常卷积公式：

反卷积公式(即为卷积逆运算)：

图12 正常卷积

图13 反卷积

6.4 Patchwise training

略

7.Segmentation architecture(模型详解)

第一段

we cast ILSVRC classifiers into FCNs and augment them for dense predicition with in-network upsampling and a pixelwise loss.

我们转换ILSVRC分类器成FCN网络，然后用在网络内部的上采样和相应像素的损失改变它们用于密集预测。

We train for segmentation by fine-tuning.

我们通过微调的方式训练分割。

Next, we add skips between layers to fuse coarse, semantic and local, appearance information.

下一步，我们在层之间增加跳跃连接，去融合粗糙的语义和局部的表征信息。

This skip architecture is learned end-to-end to refine the semantics and spatial precision of the output.

端到端使用的跳跃结构改善了输出的语义和空间精确度。

第二段

For this investigation, we train and validate on the PASCAL VOC 2011 segmentation challenge.

为此，我们在PASCAL VOC 2011分割挑战上训练和验证。

We train with a per-pixel multinomial logistic loss and validate with the standard metric of mean pixel intersection over union, with the mean taken over all classes,
including background.

我们用每个像素的逻辑斯蒂回归损失训练和用标准mIOU评价，带有所有类的平均，包括背景。

The training ignores pixels that are masked out (as ambiguous or diffificult) in the ground truth.

训练会忽略ground truth被遮住的像素(如模棱两可或困难的)。

7.1 From classifier to dense FCN

第一段

We begin by convolutionalizing proven classification architectures as in Section3.

我们开始将在第三部分被证明的分类架构卷积化。

We consider the AlexNet architecture that won ILSVRC12, as well as the VGG nets and the GoogLeNet which did exceptionally well in ILSVRC14.

我们考虑赢得ILSVRC12的AlexNet架构和在ILSVRC14上表现非常好的VGG、GoogLeNet网络。

We pick the VGG 16-layer net, which we found to be equivalent to the 19-layer net on this task.

我们挑选VGG-16，我们发现在这个任务上VGG-16是等于VGG-19。

For GoogLeNet, we use only the final loss layer, and improve performance by discarding the final average pooling layer.

对于GoogLeNet，我们仅使用最后的损失层，并通过丢弃最后的平均池化层来提升性能。

We decapitate each net by discarding the final classifier layer, and convert all fully connected layers to convolutions.

我们通过丢弃最后的分类层去掉网络的头部，并且转换所有的全连接层成卷积层。

We append a 1x1 convolution with channel dimension 21 to predict scores for each of the PASCAL classes (including background) at each of the coarse output locations, followed by a deconvolution layer to bi-linearly upsample the coarse outputs to pixel-dense outputs as described in Section 3.3.

我们在每个粗略输出位置附加一个1x1卷积（信道维数为21）来预测每个PASCAL类（包括背景）的分数，然后是一个去卷积层，将粗输出双线性上采样到像素密集的输出，如第3.3节所述。

Table 1 compares the preliminary validation results along with the basic characteristics of each net.

表1沿着每个网络的基本特性比较了初步验证结果。

We report the best results achieved after convergence at a fixed learning rate (at least 175 epochs).

我们公布了在固定学习率(至少175轮)在收敛后实现的最好结果。

第二段

Fine-tuning from classifification to segmentation gave reasonable predictions for each net.

微调分类网络至分割上，对于每一个网络给一个合理的预测。

Even the worst model achieved ~ 75% of state-of-the-art performance.

即使最差的模型达到了state-of-the-art性能的75%。

The segmentation-equipped VGG net (FCN-VGG16) already appears to be state-of-the-art at 56.0 mean IU on val, compared to 52.6 on test.

配备分段的VGG网络（FCN-VGG16）在val上的平均IU为56.0，而在测试中为52.6，这似乎是最先进的。

Training on extra data raises FCN-VGG16 to 59.4 mean IU and FCN-AlexNet to 48.0
mean IU on a subset of val.

在val的子集上，额外数据的训练使FCN-VGG16的平均IU提高到59.4，FCN-AlexNet的平均IU提高到48.0。

Despite similar classifification accuracy, our implementation of GoogLeNet did not match the VGG16 segmentation result.

尽管相似的分类精确度，GoogLeNet的分割结果相距VGG16的分割结果挺远。

7.2 Combining what and where

略

(1)算法架构

图14 FCN计算过程

图15 Segmentation Architecture

(2)模型细节

加载预训练模型
初始化反卷积参数(线性插值)
至少175个epoch后算法才会有不错的表现
学习率在100次后会进行调整(调整思路因人而异,多尝试)
pool3之前的特征图不需要融合(pool3之前影响很小)

这些技巧分散在文章的各个地方，因此需要仔细阅读文章。

由图16可知，PASCAL VOC上VGG的效果最好。

图16 不同backbone的比较

图17 FCN不同输出的比较

图18 不同FCN输出效果图

(3)训练技巧

硬件设备

NVIDIA Tesla K40c

深度学习框架

Caffe

minibatch

20

优化器

SGD+momentum 0.9

学习率

FCN-AlexNet

FCN-VGG

FCN-GoogLeNet

分类模型中的Dropout
扩大数据规模
数据预处理

Randomly mirroring

类别平衡不是必须的

图19 不同模型分割效果

图20 FCN不同数据集效果

图21 FCN与传统模型对比

8.Experimental analysis(实验结果及分析)

对照原文

9.Problem discussion

类别不平衡真的不是必须的吗？

以分类问题为例，假设一个训练集共包括99张狗的图片和5张猫的图片，测试集包括99张狗的图片和1张猫的图片，你认为算法最后的测试准确率是多少?

分割中对不同类别给予不同权重。

结论：

一定要保证类别的平衡性，在LinkNet论文中会讲解具体方法。

10.Conclusion

(1)关键点和创新点

对经典网络的改变--卷积替换全连接
对前后特征图的补偿--跳跃连接
对特征图尺寸的恢复--反卷积

(2)改进点

尺寸恢复
类别不平衡
数据预处理(根据实际任务调用)
资源利用(单块GPU、多块GPU)

weixin_39945445

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
googlenet论文_FCN论文精读

1.研究成果及意义1.1 研究成果将分类网络改变为全卷积神经网络，具体包括全连接层转换为卷积层以及通过反卷积进行上采样使用迁移学习的方法进行微调使用跳跃结构使得语义信息和表征信息相结合，产生准确而精细的分割FCN证明了端到端、像素到像素训练方式下的卷积神经网络超过了现有语义分割方向最先进的技术FCN成为PASCAL VOC上最出色的分割方法，较2011和2012的state-of-the-art分...
复制链接

扫一扫