GCN如何使用大卷积替代小卷积？（论文讲解含超详细注解+中英文对照+配图）

最新推荐文章于 2024-04-27 14:45:42 发布

ZRX_GIS

最新推荐文章于 2024-04-27 14:45:42 发布

阅读量2.1k

点赞数 1

分类专栏：深度学习文章标签： python 深度学习卷积神经网络

本文链接：https://blog.csdn.net/qq_44813407/article/details/121021723

版权

深度学习专栏收录该内容

38 篇文章 14 订阅

订阅专栏

Large Kernel Matters ——Improve Semantic Segmentation by Global Convolutional Network

大卷积项——通过全局卷积网络提升语义分割

Abstract

摘要

One of recent trends [31, 32, 14] in network architecture design is stacking small filters (e.g., 1x1 or 3x3) in the entire network because the stacked small filters is more efficient than a large kernel, given the same computational complexity. However, in the field of semantic segmentation, where we need to perform dense per-pixel prediction, we find that the large kernel (and effective receptive field) plays an important role when we have to perform the classification and localization tasks simultaneously. Following our design principle, we propose a Global Convolutional Network to address both the classification and localization issues for the semantic segmentation. We also suggest a residual-based boundary refinement to further refine the object boundaries. Our approach achieves state-of-art performance on two public benchmarks and significantly outperforms previous results, 82.2% (vs 80.2%) on PASCAL VOC 2012 dataset and 76.9% (vs 71.8%) on Cityscapes dataset.

最近网络架构设计的一个趋势[31,32,14]是在整个网络中堆叠小滤波器(例如1x1或3x3)，因为在相同的计算复杂度下，堆叠的小滤波器比大的内核更有效。然而，在语义分割领域中，当我们需要进行密集的逐像素预测时，我们发现当我们需要同时进行分类和定位任务时，大核(和有效接受域)起着重要的作用。根据我们的设计原则，我们提出了一个全局卷积网络来解决语义分割的分类和定位问题。我们还提出了基于残差的边界细化，以进一步细化对象边界。我们的方法在两个公共基准测试中实现了最先进的性能，显著优于之前的结果，在PASCAL VOC 2012数据集和城市景观数据集上的结果分别为82.2% (vs 80.2%)和76.9% (vs 71.8%)。

摘要解读

Ø 背景叙述：网络架构设计的一个最新趋势是在整个网络中堆叠小过滤器。
Ø 本文贡献：当必须同时执行分类和定位任务时，卷积核扮演着重要角色。我们提出了一个全局卷积网络来解决语义分割的分类和定位问题。还基于残差结构重新定义对象边界。
Ø 实验结果：分别在PASCAL VOC 2012和Cityscapes数据集中取得了82.2%和76.9%的MIoU。

Introduction

1 引言

Semantic segmentation can be considered as a per-pixel classification problem. There are two challenges in this task: 1) classification: an object associated to a specific semantic concept should be marked correctly; 2) localization: the classification label for a pixel must be aligned to the appropriate coordinates in output score map. A well-designed segmentation model should deal with the two issues simultaneously.

语义分割可以看作是逐像素分类问题。在这项任务中有两个挑战:1)分类:与特定语义概念相关的对象应该被正确标记;2)定位:一个像素的分类标签必须与输出得分图中相应的坐标对齐。一个设计良好的细分模型应该同时处理这两个问题。

However, these two tasks are naturally contradictory. For the classification task, the models are required to be invariant to various transformations like translation and rotation. But for the localization task, models should be transformation-sensitive, i.e., precisely locate every pixel for each semantic category. The conventional semantic segmentation algorithms mainly target for the localization issue, as shown in Figure 1 B.

然而，这两项任务自然是矛盾的。对于分类任务，模型需要对各种转换(如平移和旋转)保持不变。但对于定位任务，模型应该是转换敏感的，即精确定位每个语义类别的每个像素。传统的语义分割算法主要针对定位问题，如图1b所示。但是这可能会损失分类精度。

In this paper, we propose an improved net architecture, called Global Convolutional Network (GCN), to deal with the above two challenges simultaneously. We follow two design principles: 1) from the localization view, the model structure should be fully convolutional to retain the localization performance and no fully-connected or global pooling layers should be used as these layers will discard the localization information; 2) from the classification view, large kernel size should be adopted in the network architecture to enable densely connections between feature maps and per-pixel classifiers, which enhances the capability to handle different transformations. These two principles lead to our GCN, as in Figure 2 A. The FCN [25]-like structure is employed as our basic framework and our GCN is used to generate semantic score maps. To make global convolution practical, we adopt symmetric, separable large filters to reduce the model parameters and computation cost. To further improve the localization ability near the object boundaries, we introduce boundary refinement block to model the boundary alignment as a residual structure, shown in Figure 2 C.Unlike the CRF-like post-process [6], our boundary refinement block is integrated into the network and trained end-to-end.

在本文中，我们提出了一种改进的网络结构，称为全球卷积网络(GCN)，以同时处理上述两个挑战。我们遵循两个设计原则:1)从定位角度来看，模型结构应该是完全卷积的，以保持定位性能，不应该使用全连接或全局池层，因为这些层会丢弃定位信息;2)从分类的角度来看，网络架构应采用大核大小，使特征图与逐像素分类器紧密连接，增强处理不同转换的能力。这两个原则引出了我们的GCN，如图2a所示。我们使用FCN[25]类结构作为基本框架，使用GCN生成语义评分图。为了实现全局卷积，我们采用对称、可分离的大滤波器来减少模型参数和计算代价。为了进一步提高目标边界附近的定位能力，我们引入边界细化块，将边界对齐建模为残差结构，如图2 C所示。不同于CRF这类的后处理，我们的边界提取是端到端的。

Our contributions are summarized as follows: 1) we propose Global Convolutional Network for semantic segmentation which explicitly address the classification and localization problems simultaneously; 2) a Boundary Refinement block is introduced which can further improve the localization performance near the object boundaries; 3) we achieve state-of-art results on two standard benchmarks, with 82.2% on PASCAL VOC 2012 and 76.9% on the Cityscapes.

我们的贡献如下:1)我们提出了用于语义分割的全局卷积网络，明确地同时解决了分类和定位问题;2)引入边界细分块，进一步提高目标边界附近的定位性能;3)我们在两个标准基准上取得了最先进的结果，在PASCAL VOC 2012和城市景观上分别达到82.2%和76.9%。
在这里插入图片描述

引言解读

语义分割可以看成是一个逐像素分类的任务，包含分类和定位两个挑战。一个好的语义分割模型应该能够同时处理好上述两个任务
语义分割的两个方面(分类和定位)天然对立。对于分类任务，模型必须是具有不变性的，以适应目标的各种形式，如平移和倒转；而对于定位任务，模型应该是对变换敏感的，即能够精确定位语义类别的每个像素
从以上两个方面出发，可以引申出设计网络的两个原则：第一，从定位的角度出发，应该采用全卷积的结构，去掉全连接层或全局池化层；第二，从分类的角度出发，应该采用较大的卷积核，使得像素与特征图的结合更加紧密，增强处理不同变换的能力，而且，一旦卷积核过小，造成感受野过小，覆盖不了较大的目标，不利于分类

Related Work

2.相关工作

In this section we quickly review the literatures on semantic segmentation. One of the most popular CNN based work is the Fully Convolutional Network (FCN) [25]. By converting the fully-connected layers into convolutional layers and concatenating the intermediate score maps, FCN has outperformed a lot of traditional methods on semantic segmentation. Following the structure of FCN, there are several works trying to improve the semantic segmentation task based on the following three aspects.

在这一节中，我们将快速回顾语义切分方面的文献。基于CNN的最受欢迎的工作之一是完全卷积网络(FCN)[25]。FCN将全连通层转换为卷积层，并将中间分值图连接起来，在语义分割方面优于许多传统方法。在FCN结构的基础上，从以下三个方面对语义分割任务进行了改进。

Context Embedding in semantic segmentation is a hot topic. Among the first, Zoom-out [26] proposes a handcrafted hierarchical context features, while ParseNet [23] adds a global pooling branch to extract context information. Further, Dilated-Net [37] appends several layers after the score map to embed the multi-scale context, and DeeplabV2 [7] uses the Atrous Spatial Pyramid Pooling, which is a combination of convolutions, to embed the context directly from feature map.

上下文嵌入是语义分割中的一个热点问题。在第一个特性中，Zoom-out[26]提出了一个手工制作的层次上下文特性，而ParseNet[23]增加了一个全局池分支来提取上下文信息。此外，Dilated-Net[37]在score map之后添加了几个层来嵌入多尺度上下文，DeeplabV2[7]使用了Atrous Spatial Pyramid Pooling，这是一种卷积的组合，直接从feature map嵌入上下文。

Resolution Enlarging is another research direction in semantic segmentation. Initially, FCN [25] proposes the deconvolution (i.e. inverse of convolution) operation to increase the resolution of small score map. Further, DeconvNet [27] and SegNet [3] introduce the unpooling operation (i.e. inverse of pooling) and a glass-like network to learn the upsampling process. More recently, LRR [12] argues that upsampling a feature map is better than score map. Instead of learning the upsampling process, Deeplab [24] and Dilated-Net [37] propose a special dilated convolution to directly increase the spatial size of small feature maps, resulting in a larger score map.

分辨率放大是语义分割的另一个研究方向。首先，FCN[25]提出了反卷积(即卷积逆)操作来提高小分数图的分辨率。此外，DeconvNet[27]和SegNet[3]引入了非池化操作(即池化的逆操作)和玻璃状网络来学习上采样过程。最近，LRR[12]认为对特征图进行上采样比分数图更好。Deeplab[24]和dilated - net[37]并没有学习上采样过程，而是提出了一种特殊的扩张卷积，直接增加小feature map的空间大小，从而产生更大的score map。

Boundary Alignment tries to refine the predictions near the object boundaries. Among the many methods, Conditional Random Field (CRF) is often employed here because of its good mathematical formation. Deeplab [6] directly employs denseCRF [18], which is a CRF-variant built on fully-connected graph, as a post-processing method after CNN. Then CRFAsRNN [38] models the denseCRF into a RNN-style operator and proposes an end-to-end pipeline, yet it involves too much CPU computation on Permutohedral Lattice [1].

边界对齐试图细化靠近对象边界的预测。在众多的方法中，条件随机场(CRF)由于其良好的数学形式而经常被采用。Deeplab[6]直接使用基于全连通图的crf变体denseCRF[18]作为CNN之后的后处理方法。然后CRFAsRNN[38]将denseCRF建模为rnn风格的算子，并提出端到端管道，但它涉及到太多的CPU计算在Permutohedral Lattice[1]上。

Furthermore, Adelaide [21] deeply incorporates CRF and CNN where hand-crafted potentials is replaced by convolutions and nonlinearities. Besides, there are also some alternatives to CRF. [4] presents a similar model to CRF, called Bilateral Solver, yet achieves 10x speed and comparable performance. [16] introduces the bilateral filter to learn the specific pairwise potentials within CNN.

此外，Adelaide[21]深度融合了CRF和CNN，其中手工制作的电位被卷积和非线性所取代。此外，也有一些替代CRF。[4]提出了一个与CRF相似的模型，称为双边求解器，但实现了10倍的速度和相当的性能。[16]引入了双边滤波器来学习CNN内部特定的成对势。

In contrary to previous works, we argues that semantic segmentation is a classification task on large feature map and our Global Convolutional Network could simultaneously fulfill the demands of classification and localization.

与以往的工作相反，我们认为语义分割是一项大型特征地图上的分类任务，我们的全局卷积网络可以同时满足分类和定位的要求。

2.相关工作解读

Context Embedding

Zoom-out提出了一个手工构建的层次化上下文特性
ParseNet添加了一个全局池化分支来提取上下文信息
Dilated-Net在score map后添加了几个层，以嵌入多尺度上下文
Deeplab V2使用ASPP，通过卷积结果的组合，直接从特征映射中组合上下文信息

Resolution Enlarging

FCN使用反卷积来提高小尺寸分数图的分辨率
DeconvNet和SegNet引入了反池化操作和一个glass-like(大致翻成“小蛮腰状” )的网络来学习上采样过程
LRR认为上采样特征图要好于上采样分数图
Deeplab和Dilated-Net没有学习上采样的过程，而是使用空洞卷积来直接增加小尺寸特征映射的空间大小，
从而得到更大尺寸的分数映射

Boundary Alignment

在众多的方法中，条件随机场(ConditionalRandomfield，CRF)因其良好的数学结构而被广泛采用
Deeplab直接使用denseCRF，作为CNN的一种后处理方法
CRF as RNN将denseCRF建模为类RNN的结构，并提出了一个端到端的算法，但它在Permutohedral Lattice格点上引入了过多的CPU计算
DPN对denseCRF进行了一个近似，将整个算法流程全部迁移到了GPU上
Adelaide将CRF和CNN深深地结合在一起，CRF中手工设计的势能函数被卷积和非线性层取代

Approach

3.方法

In this section, we first propose a novel Global Convolutional Network (GCN) to address the contradictory aspects classification and localization in semantic segmentation. Then using GCN we design a fully-convolutional framework for semantic segmentation task.

在本节中，我们首先提出一种新的全局卷积网络(GCN)来解决语义分割中分类和定位的矛盾问题。然后利用GCN设计了一个全卷积的语义分割框架。

3.1. Global Convolutional Network

3.1 全局卷积网络

The task of semantic segmentation, or pixel-wise classification, requires to output a score map assigning each pixel from the input image with semantic label. As mentioned in Introduction section, this task implies two challenges: classification and localization. However, we find that the requirements of classification and localization problems are naturally contradictory: (1) For classification task, models are required invariant to transformation on the inputs objects may be shifted, rotated or rescaled but the classification results are expected to be unchanged. (2) While for localization task, models should be transformation-sensitive because the localization results depend on the positions of inputs.

语义分割或像素分类的任务需要输出一个评分图，为输入图像中的每个像素分配语义标签。正如引言部分所提到的，这个任务意味着两个挑战:分类和本地化。然而，我们发现分类和定位问题的要求自然是矛盾的:(1)对于分类任务，模型要求变换不变，输入对象可以移动、旋转或缩放，但分类结果预期不变。(2)对于定位任务，由于定位结果依赖于输入的位置，因此模型应具有转换敏感性。

In deep learning, the differences between classification and localization lead to different styles of models. For classification, most modern frameworks such as AlexNet [20], VGG Net [31], GoogleNet [32, 33] or ResNet [14] employ the Cone-shaped networks shown in Figure 1 A: features are extracted from a relatively small hidden layer, which is coarse on spatial dimensions, and classifiers are densely connected to entire feature map via fullyconnected layer [20, 31] or global pooling layer [32, 33, 14], which makes features robust to locally disturbances and allows classifiers to handle different types of input transformations. For localization, in contrast, we need relatively large feature maps to encode more spatial information. That is why most semantic segmentation frameworks, such as FCN [25, 30], U-Net [28], DeepLab [6, 7], DeconvNet [27], adopt Barrel-shaped networks shown in Figure 1 B. Techniques such as Deconvolution [25], Unpooling [27, 3] and Dilated-Convolution [6, 37] are used to generate high-resolution feature maps, then classifiers are connected locally to each spatial location on the feature map to generate pixel-wise semantic labels.

在深度学习中，分类和定位的不同导致了模型的不同风格。为了分类，大多数现代框架，如AlexNet [20]， VGG Net [31]， GoogleNet[32, 33]或ResNet[14]采用如图1 A所示的锥形网络:特征提取是在一个相对较小的隐层中进行的，该隐层在空间维度上比较粗糙，分类器通过全连接层[20,31]或全局池化层[32,33,14]紧密连接到整个feature map，这使得特征对局部干扰具有鲁棒性，并允许分类器处理不同类型的输入转换。相比之下，对于定位，我们需要相对较大的特征图来编码更多的空间信息。这就是为什么大多数语义分割框架，如FCN [25,30]， U-Net [28]， DeepLab [6,7]， DeconvNet[27]，都采用如图1b所示的“barrel”形网络。然后分类器被局部地连接到特征地图上的每个空间位置，以生成像素级的语义标签。

We notice that current state-of-the-art semantic segmentation models [25, 6, 27] mainly follow the design principles for localization, however, which may be suboptimal for classification. As classifiers are connected locally rather than globally to the feature map, it is difficult for classifiers to handle different variations of transformations on the input. For example, consider the situations in Figure 3: a classifier is aligned to the center of an input object, so it is expected to give the semantic label for the object. At first, the valid receptive filed (VRF)1 is large enough to hold the entire object. However, if the input object is resized to a large scale, then VRF can only cover a part of the object, which may be harmful for classification. It will be even worse if larger feature maps are used, because the gap between classification and localization becomes larger.

我们注意到，目前最先进的语义分割模型[25,6,27]主要遵循本地化设计原则，但这对于分类可能是次优的。由于分类器是局部连接到特征映射，而不是全局连接到特征映射，因此分类器很难处理输入上不同变化的转换。例如，考虑图3中的情况:分类器与输入对象的中心对齐，因此期望它给出对象的语义标签。首先，有效接受域(VRF)1足够大，可以容纳整个对象。但是，如果将输入对象调整为较大的尺度，则VRF只能覆盖对象的一部分，这可能不利于分类。如果使用更大的feature map，情况会更糟，因为分类和定位之间的差距会更大。

Based on above observation, we try to design a new architecture to overcome the drawbacks. First from the localization view, the structure must be fully-convolutional without any fully-connected layer or global pooling layer that used by many classification networks, since the latter will discard localization information. Second from the classification view, motivated by the densely-connected structure of classification models, the kernel size of the convolutional structure should be as large as possible. Specially, if the kernel size increases to the spatial size of feature map (named global convolution), the network will share the same benefit with pure classification models. Based on these two principles, we propose a novel Global Convolutional Network (GCN) in Figure 2 B. Instead of directly using larger kernel or global convolution, our GCN module employs a combination of 1 k + k 1 and k 1 + 1 k convolutions, which enables densely connections within a large k k region in the feature map. Different from the separable kernels used by [33], we do not use any nonlinearity after convolution layers. Compared with the trivial k k convolution, our GCN structure involves only O( 2 k ) computation cost and number of parameters, which is more practical for large kernel sizes.

基于以上观察，我们尝试设计一种新的架构来克服缺点。首先，从定位的角度来看，该结构必须是完全卷积的，不能有许多分类网络使用的全连接层或全局池化层，因为后者会丢弃定位信息。其次，从分类角度来看，由于分类模型的紧密连接结构，卷积结构的核大小应该尽可能大。特别是，如果核大小增加到特征图的空间大小(称为全局卷积)，网络将与纯分类模型共享相同的好处。基于这两个原则,我们提出一个新的全球卷积网络(GCN)在图2 b,而不是直接使用大内核或全球卷积,我们的政府通讯模块采用1 k + k和k 1 + 1 k的隆起,使密集连接在一个大k k地区特征映射。与[33]所使用的可分离核不同，我们在卷积层后不使用任何非线性。与普通的k - k卷积相比，我们的GCN结构只需要O(2k)的计算量和参数数量，对于大核尺寸的核更实用。

3.2. Overall Framework

3.2 整体框架

Our overall segmentation model are shown in Figure 2. We use pretrained ResNet [14] as the feature network and FCN4 [25, 36] as the segmentation framework. Multi-scale feature maps are extracted from different stages in the feature network. Global Convolutional Network structures are used to generate multi-scale semantic score maps for each class.Similar to [25, 36], score maps of lower resolution will be upsampled with a deconvolution layer, then added up with higher ones to generate new score maps. The final semantic score map will be generated after the last upsampling, which is used to output the prediction results.

我们的整体分割模型如图2所示。我们使用预先训练的ResNet[14]作为特征网络，FCN4[25,36]作为分割框架。从特征网络的不同阶段提取多尺度特征图。采用全局卷积网络结构为每个类生成多尺度语义得分图。与[25,36]类似，对低分辨率的评分图进行反褶积层上采样，再与高分辨率的评分图相加，生成新的评分图。最后一次上采样后生成最终的语义得分图，用于输出预测结果。
在这里插入图片描述

In addition, we propose a Boundary Refinement (BR) block shown in Figure 2 C. Here, we models the boundary alignment as a residual structure. More specifically, we define S as the refined score map: S = S + R(S), where S is the coarse score map and R(·) is the residual branch. The details can be referred to Figure 2.

此外，我们提出了一个边界细化(BR)块，如图2 c所示。在这里，我们将边界对齐建模为残差结构。更具体地说，我们将S定义为细化的分数图:S = S + R(S)，其中S为粗分数图，R(·)为残差分支。详细信息见图2。

方法解读

基础网络使用ResNet作为特征提取路径，使用FCN作为语义分割框架。
使用了ResNet中不同stage的feature map，因此是多尺度架构。
GCN模块则用于产生低分辨率的的score map，并上采样与更高分辨率的score map 相加产生新的score map经过最后的上采样，输出预测结果

ZRX_GIS

关注

1
点赞
踩
3

收藏

觉得还不错? 一键收藏
打赏
0
评论
GCN如何使用大卷积替代小卷积？（论文讲解含超详细注解+中英文对照+配图）

Large Kernel Matters ——Improve Semantic Segmentation by Global Convolutional Network大卷积项——通过全局卷积网络提升语义分割Abstract摘要One of recent trends [31, 32, 14] in network architecture design is stacking small filters (e.g., 1x1 or 3x3) in the entire network b.
复制链接

扫一扫