图像分割综述—翻译整理（未完待续）

最新推荐文章于 2023-06-02 00:17:18 发布

苏锌雨

最新推荐文章于 2023-06-02 00:17:18 发布

阅读量4.1k

点赞数 1

分类专栏： Machine Learning 文章标签：深度学习

原文链接：arxiv.org/pdf/1907.06119

版权

Machine Learning 专栏收录该内容

3 篇文章 3 订阅

订阅专栏

Abstract

The machine learning community has been overwhelmed by a plethora of deep learning based approaches.
机器学习社区已经被过多的深度学习方法所淹没
Many challenging computer vision tasks such as detection, localization, recognition and segmentation of objects in unconstrained environment are being efficiently addressed by various types of deep neural networks like convolutional neural networks, recurrent networks, adversarial networks， autoencoders and so on.
许多挑战性的cv任务，如对无约束环境中物体的检测，定位，识别和分割，通过各种类型的深度神经网络如CNN，RNN，对抗网络，自动编码器等，已经被有效地处理
While there have been plenty of analytical studies regarding the object detection or recognition domain, many new deep learning techniques have surfaced with respect to image segmentation techniques.
在已有的大量关于物体检测和识别领域的分析研究中，许多关于图像分割技术新的深度学习方法出现了
This paper approaches these various deep learning techniques of image segmentation from an analytical perspective.
本文从分析的视角搜集了这些用于图像分割不同的深度学习技术
The main goal of this work is to provide an intuitive understanding of the major techniques that has made significant contribution to the image segmentation domain.
这项工作的主要目标是提供对于图像分割领域做出重要贡献的主要技术直观的理解
Starting from some of the traditional image segmentation approaches, the paper progresses describing the effect deep learning had on the image segmentation domain.
从一些传统的图像分割方法开始，文章逐渐描述了深度学习在图像分割领域的影响
Thereafter, most of the major segmentation algorithms have been logically categorized with paragraphs dedicated to their unique contribution.
在这之后，将大部分主要的分割算法按逻辑分类，并用一些篇幅来描述他们独特的贡献
With an ample amount of intuitive explanations, the reader is expected to have an improved ability to visualize the internal dynamics of these processes.
通过大量直观的解释，读者可以更好地可视化这些过程的内部动态

1 Introduction

paragraph 1

Image segmentation can be defined as a specific image processing technique which is used to divide an image into two or more meaningful regions.
图像分割可以被定义为一种特定的图像处理技术，用于讲一幅图像分成一个或多个具有意义的区域
Image segmentation can also be seen as a process of defining boundaries between separate semantic entities in an image.
图像分割也可以被视为定义区分图像中的语义实体的过程
From a more technical perspective, image segmentation is a process of assigning a label to each pixel in the image such that pixels with the same label are connected with respect to some visual or semantic property (Fig. 1).
从更多技术性的视角来看，图像分割是一个对图像中每一个像素分配标签过程，其中拥有一些视觉或语义性质上相同标签的像素将被连接到一起

paragraph 2

Image segmentation subsumes a large class of finely related problems in computer vision. The most classic version is semantic segmentation [66].
图像分割包含了cv中的一类精细相关的问题
In semantic segmentation, each pixel is classified into one of the predefined set of classes such that pixels belonging to the same class belongs to an unique semantic entity in the image.
在语义分割中，每一个像素被分到预先定义的类别中的一类，同一个类别的像素属于图像中独一的语义实体
It is also worthy to note that the semantics in question depends not only on the data but also the problem that needs to be addressed.
值得注意的是问题中的语义不仅取决于数据，也取决于需要解决的问题
For example, for a pedestrian detection system, the whole body of person should belong to the same segment, however for a action recognition system, it might be necessary to segment different body parts into different classes.
举例，对于一个行人检测系统，人的整个身体应该属于相同的分割结果，但在动作识别系统中，需要将不同的身体部位分割到不同的类别中
Other forms of image segmentation can focus on the most important object in a scene. A particular class of problem called saliency detection [19] is born from this.
其他形式的图像分割则关注于场景中最重要的物体。一个特定的问题类别，显著性检测就出自于此
Other variants of this domain can be foreground background separation problems. In many systems like, image retrieval or visual question answering it is often necessary to count the number of objects. Instance specific segmentation addresses that issue. Instance specific segmentation is often coupled with object detection systems to detect and segment multiple instances of the same object[43] in a scene.
这个领域中的其他变体，如前景背景分割问题。在许多系统中，如图像修复或视觉问题回答，经常需要计算物体的数量，实例具体分割解决了这个问题。实例分割经常与其他物体检测算法结合，用于检测和分割场景中相同物体的多个实例
Segmentation in the temporal space is also a challenging domain and has various application. In object tracking scenarios, pixel level classification is not only performed in the spatial domain but also across time.
时空的分割也是一个有挑战性的领域，有多方面的应用。在物体跟踪情景中，像素级的分类不仅仅在空间中执行，而且跨时间执行
Other applications in traffic analysis or surveillance needs to perform motion segmentation to analyze paths of moving objects. In the field of segmentation with lower semantic level, over-segmentation is also a common approach where images are divided into extremely small regions to ensure boundary adherence, at the cost of creating a lot of spurious edges.
流量分析和监视中的其他应用需要执行运动分割来分析运动物体的路径。在更低级语义分割的领域中，过分割也是一种常见的方法，其将图像分为极小的区块以确保沿着边界，以产生大量的伪边缘为代价
Over-segmentation algorithms are often combined with region merging techniques to perform image segmen- tation. Even simple color or texture segmentation also finds its use in various scenarios. Another important distinction between segmentation algorithms is the need of interactions from the user. While it is desirable to have fully automated systems, a little bit of interaction from the user can improve the quality of segmentation to a large extent. This is especially applicable where we are dealing with complex scenes or we do not posses an ample amount of data to train the system.
过分割算法经常和区域融合技术结合来执行图像分割任务。即使是简单的颜色或纹理分割也能各种情况下使用。分割算法间的另一个重要的区别是需要与用户之间进行交互。虽然全自动系统也是可取的，但与用户间一点点的交互也可以极大地提高分割的质量。这在我们处理复杂场景或没有足够数量的训练数据的系统中尤其适用。

paragraph 3

Segmentation algorithms has several applications in the real world. In medical image processing [123] as well we need to localize various abnormalities like aneurysms [48], tumors [145], cancerous elements like melanoma detection [189], or specific organs during surgeries [206].
分割算法在现实世界中有一些应用。在医学图像处理我们也需要定位一些异常，比如动脉瘤，肿瘤，黑色素瘤等癌性成分检测，或者在手术中分辨器官
Another domain where segmentation is important is surveillance. Many problems such as pedestrian detection [113], traffic surveillance [60] require the segmentation of specific objects e.g. persons or cars.
另一个分割很重要的领域是监视。许多问题比如行人检测，流量监视需要对特定的物体进行分割，比如人和车辆
Other domains include satellite imagery [11, 17], guidance systems in defense [119], forensics such as face [5], iris [51] and fingerprint [144] recognition.
其他领域包括卫星图像，国防制导系统，司法证据如人脸，虹膜和指纹的识别
Generally traditional methods such as histogram thresholding [195], hybridization [193, 87] feature space clustering [40], region- based approaches [59], edge detection approaches [184], fuzzy approaches [39], entropy-based approaches [47], neural networks (Hopfield neural network [35], self-organizing maps [27]), physics-based approaches [158] etc. are used popularly in this purpose.
通常传统的方法比如直方图阈值化，杂交特征空间聚类，基于区域的方法，边缘检测方法，模糊方法，基于熵的方法，神经网络（Hopfield 神经网络，自组织图），基于物理学的方法等等，在这项任务中得到广泛运用
However, such feature-based approaches have a common bottleneck that they are dependent on the quality of feature extracted by the domain experts. Generally, humans are bound to miss latent or abstract features for image segmentation.
但是，这些基于特征的方法有一个相同的瓶颈，即这些方法都依赖于领域专家提取得到的特征数量。通常，人在图像分割中必定会错过一些潜在的或抽象的特征
On the other hand, deep learning in general addresses this issue of automated feature learning. In this regard one of the most common technique in computer vision was introduced soon by the name of convolutional neural networks [110] that learned a cascaded set of convolutional kernels through backpropagation [182]. Since then, it has been improved significantly with features like layer-wise training [13], rectified linear activations [153], batch normalization [84], auxiliary classifiers [52], atrous convolutions [211], skip connections [78], better optimization techniques [97] and so on.
在其他方面，深度学习通常能够解决自动特征学习的这一问题。在这方面最常见的计算机视觉技术之一：卷积神经网络很快被介绍，它能够通过反向传播学习一连串的卷积核。自此以后，它以分层训练、纠正线性激活、批归一化、辅助分类器、圆卷积、跳跃连接（残差网络）、更好的优化技术为特征，被显著地发展
With all these there was a large number of new types of image segmentation techniques as well. Various such techniques drew inspiration from popular net- works such as AlexNet [104], convolutional autoencoders [141], recurrent neural networks [143], residual networks [78] and so on.
同时，随着这些技术也出现了大量的新式图像分割技术。各种这样的技术都从AlexNet，卷积自动编码器，循环卷积网络，残差网络等等之类的流行网络吸引了大量的灵感

2 Motivation

There have been many reviews and surveys regarding the traditional technologies associated with image segmentation [61, 160]. While some of them specialized in application areas [107, 123, 185], while other focused on specific types of algorithms [20, 19, 59].
目前已经有大量关于于图像分割相关的传统技术的评审和研究。其中一部分专注于应用领域，而其他的则关注特定类别的算法
With arrival of deep learning techniques many new classes of image segmentation algorithms have surfaced. Earlier studies [219] have shown the potential of deep learning based approaches. There have been more recent studies [68] which cover a number of methods and compare them on the basis of their reported performance.
随着深度学习技术的来临，许多新的图像分割算法类别出现了。早期研究已经展示了基于深度学习方法的潜力。最近有一些研究覆盖了大量的方法并根据报告的性能对它们进行了比较
The work of Garcia et al. [66] lists a variety of deep learning based segmentation techniques. They have tabulated the performance of various state of the art networks on several modern challenges. The resources are incredibly useful for understanding the current state-of-the-art in this domain. While knowing the available methods is quite useful to develop products, however, to contribute to this domain as a researcher, one needs to understand the underlying mechanics of the methods that make them confident.
加西亚等人的work列举了大量的基于深度学习方法的分割技术。他们列举了在一些现代挑战上取得SOTA结果的网络。这些资源在理解当前领域中的SOTA成果是相当有用的。认识这些可行的方法对于产品开发很有帮助，作为研究者为了对这个领域做出贡献，你需要理解这些使他们充满信息的方法的基础原理
In the present work, our main motivation is to answer the question why the methods are designed in a way they are. Understanding the mechanics of modern techniques would make it easier to tackle new challenges and develop better algorithms.
在之前的工作中，我们的主要动机是回答为什么这些方法会被以这样的方式设计的问题。理解现代技术的原理能够使得应对新的挑战，以及开发新的更好的算法时更加容易
Our approach carefully analyses each method to understand why they succeed at what they do and also why they fail for certain problems. Being aware of pros and cons of such method new designs can be initiated that reaps the benefits of the pros and overcomes the cons.
我们仔细地分析了每一种方法以理解为什么它们能够成功地完成工作，以及为什么会因为某些问题而失败。意识到这些方法的利弊之后，能够启动新的设计，从优点中受益并克服缺点
We recommend the works of Alberto Garcia-Garcia [66] to get an overview of some of the best image segmentation techniques using deep learning while our focus would be to understand why, when and how these techniques perform on various challenges.
我们推荐加西亚的的工作来对这些最棒的使用深度学习的图像分割方法获得概览，借此我们的关注点将在于理解这些方法为何，何时和如何对不同的挑战起作用

fig. 2

2.1 Contribution

The paper has been designed in a way such that new researchers reap the most benefits. Initially some of the traditional techniques have been discussed to uphold the frameworks before the deep learning era. Gradually the various factors governing the onset of deep learning has been discussed so that readers have a good idea of the current direction in which machine learning is progressing.
本文以能使新研究者取得最大的利益的方式设计。在深度学习时代之前，起初讨论了一些传统的方法。逐渐地，讨论了影响深度学习的许多因素，让读者能够对机器学习目前的发展方向有一个很好的了解
In the subsequent sections the major deep learning algorithms have been briefly described in a generic way to establish a clearer concept of the procedures in the mind of the readers. The image segmentation algorithms discussed thereafter have been categorized into the major families of algorithms that governed the last few years in this domain.
在接下来的章节中，以通俗的方式简要地讲解了主要的深度学习算法，在读者心中建立清晰的程序概念
The concepts behind all the major approaches have been explained through a very simple language with minimum amount of complicated mathematics. Almost all the diagrams corresponding to major networks have been drawn using a common representational format as shown in fig. 2.
通过非常简单的语言以及最少的复杂数学解释了所有主要方法背后的概念。几乎所有与主要网络对应的图表都使用通用的具有代表性的格式画成，如图2所示
The various approaches that have been discussed comes with different representations for architectures. The unified representation scheme allows the user to understand the fundamental similarities and differences between networks. Finally, the major application areas have been discussed to help new researchers pursue a field of their choice.
本文所讨论的各种方法对于不同的架构附带了不同的描述。这些独一无二的描述方案使读者可以理解网络之间基本的相似和不同之处。最后，讨论主要的应用领域，帮助新的研究者根据自己的选择追求相应的领域

3 Impact of Deep Learning on Image Segmentation

The development of deep learning algorithms like convolutional neural networks or deep autoencoders not only affected typical tasks like object classification but are also efficient in other related tasks like object detection, localization, tracking, or as in this case image segmentation.
卷积神经网络或深度自动编码器等深度学习算法的发展不仅影响物体分类之类典型的任务，也影响了其他相关的任务，如物体检测，定位，跟踪或这种情况下的图像分割

3.1 Effectiveness of convolutions for segmentation

As an operation convolution can be simply defined as the function that performs a sum-of-product between kernel weights and input values while convoluting the smaller kernel over a larger image.
卷积作为一种操作，能被简单地定义为在使用比图像小的kernel进行卷积操作时对kernel权重和输入数据之间执行乘积累加的函数
For a typical image with k channels we can convolute a smaller sized kernel with k channels along the x and y direction to obtain an output in the format of a 2 dimensional matrix. It has been observed that after training a typical CNN the convolutional kernels tend to generate activation maps with respect to certain features of the objects [214].
对于一张拥有k个通道的典型图片，我们可以使用一个同样拥有k个通道的更小的卷积核在x和y方向上进行卷积操作，以获得一个形式为2维矩阵的输出。可以观察到，在训练一个典型的CNN之后，卷积核倾向于生成关于物体具体特征的激活映射
Given the nature of activations, it can be seen as segmentation masks of object specific features. Hence the key to generating requirement specific segmentation is already embedded within this output activation matrices.
考虑到激活的特性，它可以视为物体具体特征的分割掩膜。因此生成特定于需求的分割的key已经和输出的激活矩阵嵌入到一起
Most of the image segmentation algorithm uses this property of CNNs to somehow generate the segmentation masks as required to solve the problem. As shown below in fig. 3, the earlier layers capture local features like the contour or a small part of an object.
图像分割的大部分算法使用CNN的这个性质来以某种方式生成解决问题所需的分割掩膜。如图3所示，前面的层提取轮廓和物体的小部分之类的局部特征
In the later layers more global features are activated such as field, people or sky. It can also be noted from this figure that the earlier layers show sharper activations as compared to the later ones.
在后面的层激活了更多的全局特征，如田地，任务和天空。从这张图同样可以注意到，与后面的层相比，前面的层展现出更锋利的激活

fig. 3 输入图像和典型的CNN输出的采样激活映射。上行为输入图像和前面层输出的两张激活映射，显示了物体的部分如T恤和轮廓之类的特征。下行为后面层输出的更具意义的激活，如田地，人物和天空

3.2 Impact of larger and more complex datasets

The second impact that deep learning brought to the world of image segmentation is the plethora of datasets, challenges and competitions. These factors encouraged researchers across the world to come up with various state-of-the-art technologies to implement segmentation across various domains. A list of many such datasets have been provided in table 1
深度学习给图像分割世界带来的第二个冲击是过多的数据集，挑战和竞赛。这个因素鼓励全世界的研发者找出各种SOTA的技术来适应不同领域的分割任务。表1列出了这些数据集

4 Image Segmentation using Deep Learning

As explained before, convolutions are quite effective in generating semantic activation maps that has components which inherently constitute various semantic segments. Various methods have been implemented to make use of these internal activations to segment the images. A summary of major deep learning based segmentation algorithms are provided in table 2 along with brief description of their major contribution.
如之前所解释的，卷积在生成语义激活映射方面十分高效，这些激活映射具有固有地构成各种语义分割的组件。各种方法被实施已将这些内部激活用于分割图像。主要的基于深度学习的分割算法以及对他们主要贡献的简要的描述提供在表2

4.1 Convolutional Neural Networks

4.1 卷积神经网络

Convolutional neural networks being one of the most commonly used methods in computer vision has adopted many simple modifications to perform well in segmentation tasks as well.
作为计算机视觉中最常用的方法之一，卷积神经网络也采纳了许多简单的改进来在分割任务中获得更好的效果

4.1.1 Fully convolutional layers

4.1.1 全卷积层

Classification tasks generally require a linear output in the form of a probability distribution over the number of classes. To convert volumes of 2 dimensional activation maps into linear layers they were often flattened.
分类任务通常需要一个形式为类别数目概率分布的线性输出。为了将2维激活映射转化为线性层，他们常常经过展开
The flattened shape allowed the execution of fully connected networks to obtain the probability distribution. However, this kind of reshaping loses the spatial relations among the pixels in the image. In a fully convolutional neural network(FCN) [130] the output of the last convolutional block is directly used for a pixel level classification. FCNs were first implemented on the PASCAL VOC 2011 segmentation dataset[54] and achieved a pixel accuracy of 90.3% and a mean IOU of 62.7%.
展开的形状使得全连接网络的执行结果获得概率的分布。但是，这种变形丧失了图像中像素之间的空间关系。在全卷积神经网络（FCN）中，最后一个卷积块的输出直接用于像素级别的分类。FCNs 首次在PASCAL VOC 2011分割数据集上实现，并实现了90.3%的像素准确率和62.7%的平均IOU。
Another way to avoid fully connected linear layers is the use of a full size average pooling to convert a set of 2 dimensional activation maps to a set of scalar values. As these pooled scalars are connected to the output layers, the weights corresponding to each class may be used to perform weighted summation of the corresponding activation maps in the previous layers. This process called Global Average Pooling(GAP) [121] can be directly used on various trained networks like residual network to find object specific activation zones which can be used for pixel level segmentation.
另一个避免全连接线性层的方法是使用全尺寸平均值池化来将2维激活映射转换为一组标量值。因为这些池化标量被连接到输出层，可以将每一个类别相应的权重用于与先前层的相应激活映射进行加权求和。这个过程被称为全局平均池化（GAP），能够直接用于不同的训练过的网络如残差网络，来寻找物体能够用于像素级别分割的确切激活区域
The major issues with algorithm such as this is the loss of sharpness due to the intermediate sub-sampling operations. Sub-sampling is a common operation in convolutional neural networks to increase the sensory area of kernels.
如此类算法的主要的问题是其中的下采样带来的锐度损失。下采样是卷积神经网络中用于增加内核感知域的一个常用操作
What it means is that as the activations maps reduces in size in the subsequent layers, the kernels convoluting over them actually corresponds to a larger area in the original image. However, it reduces the image size in the process, which when up-sampled to original size loses sharpness. Many approaches have been implemented to handle this issue.
这意味着在随后的层中激活映射的大小减小了，卷积核实际上对应于原图像中更大的区域。但是，这个操作减小了图像的尺寸，在利用上采样来还原图像大小时会算是锐度。为了解决这个问题，许多方法得到了应用
For fully convolutional models, skip connections from preceding layers can be used to obtain sharper versions of the activations from which finer segments can be chalked out (Refer fig. 4). Another work showed how the usage of high dimensional kernels to capture global information with FCN models created better segmentation masks [165].
对于全卷积模型，跳过先前层的连接能够用于获取更锐利的激活，借此可以获得更好的分割结果（如fig. 4 ）。另一个工作展示了FCN模型如何利用高分辨率的内核获取全局信息，产生更好的分割掩膜

fig. 4

Segmentation algorithms can also be treated as boundary detection technique. convolutional features are also very useful from that perspective [139]. While earlier layers can provide fine details, later layers focus more on the coarser boundaries.
分割算法也能够被视为边缘检测技术。卷积特征在这个角度同样有用。先前的层能够提供好的细节，后面的层则更多关注较粗的边界

DeepMask and SharpMask

DeepMask [166] was a name given to a project at Facebook AI Research (FAIR) related to image segmentation. It exhibited the same school of thought as FCN models except that the model was capable of multi-tasking (Refer fig. 5).
DeepMask是FAIR关于图像分割一项工作的名称。它展示了与FCN模型同一个派系的想法，期望能适用于多种任务

fig. 5

It had two main branches coming out of a shared feature representation. One of them created a pixel level classification of or a probabilistic mask for the central object and the second branch generated a score corresponding to the object recognition accuracy.
在表征共享特征上出现了两种主要的分支。其中一个产生了中心物体的像素级别分类或概率掩膜，另一个分支生成了与物体识别准确率对应的得分。
The network coupled with sliding windows of sixteen strides to create segments of objects at various locations of the image, whereas the score helped in identifying which of the segments were good.
网络与步长为16的滑动窗口配合来在图片的不同位置生成物体的分割，而得分可以帮助辨识哪些分割效果更好
The network was further upgraded in SharpMask [167], where probabilistic masks from each layer were combined in top-down fashion using convolutional refinements at every steps to generate high resolution masks (Refer fig. 6). The sharpmask scored an average recall of 39.3 which beats deepmask, which scored 36.6 on the MS COCO Segmentation Dataset.
网络后来在SharpMask中进一步升级，每一层的概率掩膜在生成高分辨率掩膜的每一步使用卷积细分自上而下结合（如fig. 6）。在MS COCO分割数据集中，sharpmask以39.3的平均召回率击败了deepmask（36.6的平均召回率）。

fig. 6

4.1.2 Region proposal networks

4.1.2 区域生成网络

Another similar wing that started developing with image segmentation was object localization. Task such as this involved locating specific objects in images. Expected outputs for such problems is normally a set of bounding boxes corresponding to the queried objects. Though strictly stating, some of these algo- rithms do not address image segmentation problems, however their approaches are of relevance to this domain.
另一个与图像分割一起开始发展的相似领域是物体定位。此类任务包含了定位图像中特定的物体。此类问题所期望的输出通常为一组与寻找的物体所对应的bounding box。尽管严格来说，这些算法中的一部分并不解决图像分割问题，但是他们的目的与这个领域相关

RCNN (Region-based Convolutional Neural Networks)
RCNN（基于区域的卷积神经网络）

The introduction of the CNNs raised many new questions in the domain of computer vision. One of them primarily being whether a network like AlexNet can be extended to detect the presence of more than one object. Region-based-CNN [70] or more commonly known as R-CNN used selective search technique to propose probable object regions and performed classification on the cropped window to verify sensible localization based on the output probability distribution.
对于CNN的介绍引起了计算机视觉领域中许多新的问题。其中一个主要的问题为像AlexNet之类的网络能否够被拓展以检测一个以上物体的存在。基于区域的CNN，或者更通俗地说，R-CNN使用选择性搜索技术来提出可能的物体区域并在剪切的窗口中进行分类，基于输出的概率分布来确认合理的定位
Selective search technique [198, 200] analyses various aspects like texture, color, or intensities to cluster the pixels into objects. The bounding boxes corresponding to these segments are passed through classifying networks to short-list some of the most sensible boxes. Finally, with a simple linear regression network tighter co-ordinate can be obtained.
选择性搜索技术分析了不同的方面如纹理，颜色，或强度来将像素聚类为对象。与这些分割对应的bounding box通过分类网络来选出最合理的box。最后，通过一个简单的线性回归网络能够获得更紧密的坐标
The main downside of the technique is its computational cost. The network needs to compute a forward pass for every bounding box proposition. The problem with sharing computation across all boxes was that the boxes were of different sizes and hence uniform sized features were not achievable. In the upgraded Fast R-CNN [69], ROI (Region of Interest) Pooling was proposed in which region of interests were dynamically pooled to obtain a fixed size feature output.
这项技术主要的缺点是它在计算上的消耗。网络需要计算对所有提出的bounding box计算前向传递。对所有box进行共享计算的问题在于这些box尺寸不一，因此不可能达到特征尺寸的一致性。在改进的Fast R-CNN中，提出了ROI池化，将ROI进行动态池化来获得特定尺寸的特征输出
Henceforth, the network was mainly bottlenecked by the selective search technique for candidate region proposal. In Faster-RCNN [175], instead of depending on external features, the intermediate activation maps were used to propose bounding boxes, thus speeding up the feature extraction process. Bounding boxes are representative of the location of the object, however they do not provide pixel-level segments.
此后，这个网络遇到了候选区域提出的选择性搜索技术的瓶颈。在Faster-RCNN中，中间激活映射代替了对外部特征的依赖，用在了bounding box的提出之上，这加速了特征提取的过程。bounding box代表了物体的位置，但它不提供像素级别的分割
The Faster R-CNN network was extended as Mask R-CNN [76] with a parallel branch that performed pixel level object specific binary classification to provide accurate segments. With Mask-RCNN an average precision of 35.7 was attained in the COCO[122] test images. The family of RCNN algorithms have been depicted in fig.7.
Faster R-CNN网络被拓展为Mask R-CNN，其具有一个平行的分支来执行像素级别的特定物体二分类，提供了准确的分割。通过Mask-RCNN，在COCO测试集上获得了了35.7的平均准确率。RCNN算法家族描绘于fig. 7。

fig. 7

Region proposal networks have often been combined with other networks [118, 44] to give instance level segmentations. RCNN was further improved under the name of HyperNet [99] by using features from multiple layers of the feature extractor. Region proposal networks have also been implemented for instance specific segmentation as well. As mentioned before object detection capabilities of approaches like RCNN are often coupled with segmentation models to generate different masks for different instances of the same object[43].
区域生成网络经常与其他网络结合来进行实例分割。在HyperNet中，通过使用多层特征提取器的特征，RCNN得到进一步改善。区域生成网络也被用于实例分割。如之前所提到的，RCNN之类方法的物体检测能力经常与分割模型结合，用于对同一物体不同的实例生成不同的掩膜

4.1.3 DeepLab

While pixel level segmentation was effective, two complementing issues were still affecting the performance. Firstly, smaller kernel sizes failed to capture contextual information. In classification problems, this is handled using pooling layers that increases the sensory area of the kernels with respect to the original image. But in segmentation that reduces the sharpness of the segmented output. Alternative usage of larger kernels tend to be slower due to significanty larger number of trainable parameters.
尽管像素级别的分割是高效的，两个补充问题仍然会影响网络的表现。首先，更小的内核尺寸无法获取前后关系的信息。在分类问题中，通过利用池化层增加内核对原图像的感受野来解决这个问题。但在分割问题中，这会导致分割输出锐度的减小。由于使用了大量的可训练参数，更大内核的替代用法往往较慢。
To handle this issue the DeepLab [30, 32] family of algorithms demonstrated the usage of various methodologies like atrous convolutions [211], spatial pooling pyramids [77] and fully connected conditional random fields [100] to perform image segmentation with great efficiency. The DeepLab algorithm was able to attain a meanIOU of 79.7 on the PASCAL VOC 2012 dataset[54].
为了解决这个问题，DeepLab算法家族[30], [32]提出了对不同方法的应用，如多孔卷积[211]，空间池化金字塔[77]和全连接条件随机场[100]来执行效率极高的图像分割。DeepLab算法在PASCAL VOC 2012数据集上能够获得79.7的平均IOU

Atrous/Dilated Convolution
稀疏卷积/多孔卷积

The size of the convolution kernels in any layer determine the sensory response area of the network. While smaller kernels extract local information, larger kernels try to focus on more contextual information. However, larger kernels normally comes with more number of parameters.
卷积核的大小表征了网络的感受应答域。小的卷积核提取局部信息，大的卷积核试图关注上下文结构信息。但是，更大的卷积核经常带有更多数量的参数
For example to have a sensory region of 6 × 6, one must have 36 neurons. To reduce the number of parameters in the CNN, the sensory area is increased in higher layers through techniques like pooling. Pooling layers reduce the size of the image. When an image is pooled by a 2 × 2 kernel with a stride of two, the size of the image reduces by 25%. A kernel with an area of 3 × 3 corresponds to a larger sensory area of 6 × 6 in the original image.
例如，为了拥有6×6大小的感受野，需要有36个神经元。为了减小CNN中的参数数量，感受野通过池化之类的技术在更高层得到增大。池化层减小了图像的尺寸。当一个图像以2 × 2大小，步长为2的内核池化时，图像的大小减小了75%（指面积），3 × 3大小的卷积核对应于原图像中6 × 6的感受野。
However, unlike before now only 18 neurons (9 for each layer) are needed in the convolution kernel. In case of segmentation, pooling creates new problems. The reduction in the image size results in loss of sharpness in generated segments as the reduced maps are scaled up to image size.
如今在卷积核中只需要18个神经元（每层9个）。在分割任务中，池化产生了新的问题，随着缩小的映射按比例放大到图像的尺寸，图像尺寸的减小导致了生成分割的锐度的损失
To deal with these two issues simultaneously, dilated or atrous convolutions play a key role. Atrous/Dilated convolutions increase the field of view without increasing the number of parameters. As shown in fig.8 a 3×3 kernel with a dilation factor of 1 can act upon an area of 5×5 in the image.
为了同时解决这两个问题，多孔卷积起到了重要的作用。多孔/稀疏卷积不需要增加参数的数量就能提高感受野的大小。如图8所示，一个3×3大小，稀疏因子为1的内核能够对图像中5×5大小的区域起作用

fig. 8 普通卷积（红色）和多孔/稀疏卷积（绿色）

Each row and column of the kernel has three neurons which is multiplied with intensity values in the image which separated by the dilation factor of 1. In this way the kernels can span over larger areas while keeping the number of neurons low and also preserving the sharpness of the image. Besides the DeepLab algorithms, atrous convolutions [34] have also been used with auto encoder based architectures.
卷积核的每一行和每一列都有3个神经元与图像中的像素值相乘，他们之间以大小为1的稀疏因子相隔开。在这种方式下，卷积核在更少的神经元数量以及保持图像的锐度的同时能够跨过更大的区域。除了DeepLab算法，稀疏卷积也用于基于自动编码器的架构中

fig. 9 DeepLab网络结构与标准的VGG网络（上）对比，分别为带串级多孔卷积（中）和多孔空间池化金字塔（下）

Spatial Pyramid Pooling

Spatial pyramid pooling [77] was introduced in R-CNN where ROI pooling showed the benefit of using multi-scale regions for object localization. However, in DeepLab, atrous convolutions were preferred over pooling layers for changing field of view or sensory area. To imitate the effect of ROI pooling, multiple branches with atrous convolutions of different dilations were combined together to utilize multi-scale properties for image segmentation.
R-CNN中引入了空间金字塔池化[77]，其中ROI池化展示了对物体定位使用多尺度区域的便利。但在DeepLab中，稀疏卷积代替了池化层来改变感受野的大小。为了模拟ROI池化的作用，带有不同稀疏度的稀疏卷积的多个分支结合在一起来利用图像分割的多尺度特性

Fully connected conditional random field

Conditional random field is a undirected discriminative probabilistic graphical model that is often used for various sequence learning problems. Unlike discrete classifiers, while classifying a sample it takes into account the labels of other neighboring samples.
条件随机场是无方向的判别概率图模型，经常用于不同序列学习问题。不像离散分类器，在分类一个样本的时候，会考虑到其他相邻样本的标签
Image segmentation can be treated as a sequence of pixel classifications. The label of a pixel is not only dependent on its own intensity values but also the values of neighboring pixels. The use of such probabilistic graphical models is often used in the field of image segmentation and hence it deserves a dedicated section (section 4.1.4).
图像分割可以被视为像素序列的分类。像素的标签不仅仅依赖于它自己的像素值，也于相邻像素有关。该概率图模型经常用于图像分割领域，因此用一个专门的部分来描述（见4.1.4）

4.1.4 Using inter pixel correlation to improve CNN based segmentation

使用像素间相关性来改善基于CNN的分割

The use of probabilistic graphical models such as markov random fields (MRF) or conditional random fields (CRF) for image segmentation thrived on its own even without the inclusion of CNN based feature extractors. The CRF or MRF is mainly characterized by an energy function with a unary and a pairwise component.
图像分割中概率图模型的应用如马尔科夫随机场（MRF）或条件随机场（CRF）得到了独立的发展，即使没有使用基于CNN的特征提取器。CRF或MRF的特征是具有一元和二元分量的能量函数

$E(x)=\sum_i{\theta_i(x_i)}+\sum_{ij}{\theta_{ij}(x_i,y_j)}\tag{1}$

while non-deep learning approaches focused on building efficient pairwise potentials like exploiting long-range dependencies, designing higher-order potentials and exploring contexts of semantic labels, deep learning based approaches focused on generating a strong unary potentials and using simple pairwise components to boost the performance.
非深度学习方法侧重于建立高效的成对的势，如利用长范围的相关性，设计高阶势，以及探索语义标签的上下文关系，而深度学习方法则侧重于生成强一元势以及使用简单的成对组件来提升性能
CRFs have usually been coupled with deep learning based methods in two ways. One as a separate post-processing module and the other as an trainable module in an end-to-end network like deep parsing networks[128] or spatial propagation networks[126].
CRF通常以两种方法与基于深度学习的方法结合。一是以单独的后处理模块的方式，二是在端到端网络中以可训练的模块的形式，如深度解析网络[128]或空间传播网络[126]

Using CRFs to improve Fully convolutional networks
使用CRF改进全卷积网络

One of the earliest implementations that kick-started this paradigm of boundary refinements was the works of [101] With the introduction of fully convolutional networks for image segmentation it was quite possible to draw coarse segments for objects in images.
[101]的工作开启这种边界细化范例最早的实现之一。随着全卷积神经网络在图像分割中的引入，提取图像中目标的较粗的分割成为可能
However, getting sharper segments was still a problem. In the works of [29], the output pixel level prediction was used as a unary potential for a fully connected CRF. For each pair of pixels i and j in the image the pairwise potential was defined as
但是，获取较锐利的分割依然是一个问题。在[29]的工作中，输出的像素级别预测被用作全连接CRF的一元势。对于图像中的每一对像素i，j，成对的势被定义为：

$\begin{matrix} \begin{aligned} \theta_{ij}(x_i,x_j)=\mu(x_i,x_j)[\omega_1 exp(-\frac{||p_i-p_j||^2}{2\sigma^2_{\alpha}}&-\frac{||I_i-I_j||^2}{2\sigma^2_{\beta}}) \\ &+\omega_2 exp(-\frac{||p_i-p_j||^2}{2\sigma^2_{\gamma}})]\tag{2} \end{aligned} \end{matrix}$

Here, $\mu(x_i,x_j)=1$ if $x_i\neq x_j,0$ otherwise and $\omega_1$ , $\omega_2$ are the weights given to the kernels. The expression uses two gaussian kernels. The first one is a bilateral kernel that depends on both pixel positions $p_i, p_j)$ and their corresponding intensities in the RGB channels. The second kernel is only dependent on the the pixel positions. $\sigma_{\alpha}$ , $\sigma_{\beta}$ and $\sigma_{\gamma}$ controls the scale of the Gaussian kernels.
此处，如果 $x_i\neq x_j,\mu(x_i,x_j)=1$ ，否则 $\mu(x_i,x_j)=0$ 。 $\omega_1$ , $\omega_2$ 为内核的权重。这个表达式使用了两个高斯核。第一个是一个双边内核，其依赖于两个像素的位置 $p_i, p_j)$ 以及他们在RGB通道对应的强度。另一个高斯核只依赖于像素的位置。 $\sigma_{\alpha}$ , $\sigma_{\beta}$ 和 $\sigma_{\gamma}$ 控制高斯核的比例
The intuition behind the design of such a pairwise potential energy function is to ensure that nearby pixels of similar intensities in the RGB channels are classified under the same class. This model has also been later included in the popular network called DeepLab (refer section 4.1.3). In the various versions of the DeepLab algorithm, the use of CRF was able to boost the mean IOU on the Pascal 2012 Dataset by significant amount(upto 4% in some cases).
设计这样一个成对势能函数背后的直觉是为了确保相邻且强度相似的像素能够被分到同一个种类中。这个模型后来也被包含到了流行的网络（DeepLab）中。在不同的DeepLab算法中，CRF的使用能够大量地提升Pascal2012数据集的平均IOU（同个类别中达到了4%）

CRF as RNN

While CRF is an useful post-processing module[101] for any deep learning based semantic image segmentation architecture, yet one of the main drawbacks was that it could not be used as a part of an end-to-end architecture. In the standard CRF model the pairwise potentials can be represented in terms of a sum of weighted Gaussians. However since the exact minimization is intractable a mean-field approximation of the CRF distribution is considered to represent the distribution with a simpler version which is simply a product of independent marginal distributions.
This mean-field approximation in its native form isn’t suitable for back-propagation. In the works of [221], this step was replaced by a set of convolutional operation that is iterated over a recurrent pipeline until convergence is reached. As reported in their work, with the pro- posed approach an mIOU of 74.7 was obtained as compared to 71.0 by BoxSup and 72.7 by DeepLab. The sequence of operations can be most easily explained as follows.
1. Initialization : A SoftMax operations over the unary potentials can give us the intial distribution to work with.
  初始化：对一元势的SoftMax操作能够为我们提供初始的分布
2. Message Passing : Convoluting using two Gaussian kernels, one spatial and one bilateral kernel. Similar to the actual implementation of CRF, the splatting and slicing also occurs while building the permutohedral lattice for efficient computation of the fully connected CRF
  信息传递：使用两个高斯核进行卷积，一个是空间核，另一个是双边核。与CRF的实际实现类似，在构建为了高效计算全连接CRF的四面体晶格时，也会发生溅射和切片
3. Weighting Filter Outputs : Convoluting with 1 × 1 kernels with the required number of channels the filter outputs can be weighted and summed. The weights can be easily learnt through backpropagation.
  权值滤波输出：使用带有需求数量通道 1 × 1 内核进行卷积，滤波器的输出能够进行加权求和。权值能够容易地通过反向传播学习得到
4. Compatibility Transform : Considering a compatibility function to keep a track of uncertainty between various labels, a simple 1 × 1 convolution with the same number of input and output channel is enough to simulate that. Unlike the potts model that assigns the same penalty, here the compatibility function can be learnt and hence a much better alternative.
  相容性变换：考虑相容性函数来保持对不同标签之间不确定性的追踪，一个简单的输入输出通道相同的 1 × 1 卷积就足以模拟它了。与分配相同惩罚的potts模型不同，这里的相容性函数能够通过学习得到，因此是一个更好的选择
5. Adding the unary potentials : This can be performed by a simple element wise subtraction of the penalty from the compatibility transform from the unary potentials
  添加一元势：可以通过一个简单的元素从一元势的相容性变换中减去惩罚实现
6. Normalization : The outputs can be normalized with another simple softmax function.
  归一化：使用另一个简单的softmax函数对输出进行归一化

Incorporating higher order dependencies
合并高阶依赖性

Another end-to-end network inspired from CRFs, incorporate higher order relations into a deep network . With a deep parsing network [128] pixel-wise prediction from a standard VGG-like feature extractor (but with lesser pooling operations) is boosted using a sequence of special convolution and pooling operations.
另一个从CRF获得灵感的端到端网络是合并更高阶的关系到一个深度网络中。通过一个深度解析网络[128]，标准VGG特征提取器输出的像素预测被通过使用特殊的卷积和池化操作序列而提升
Firstly , by using local convolutions that implement large unshared convolutional kernels across the different positions of the feature map, to obtain translation dependent features that model long-distance dependencies. Similar to standard CRFs a spatial convolution penalizes probabilistic maps based on local label contexts.
首先，通过使用局部卷积来实现较大的非共享卷积核，跨越了特征图中不同的位置以获得模型远程依赖的翻译相关特征。与标准CRF类似，空间卷积惩罚概率映射基于局部标签的上下文关系
Finally, with block min pooling that does a pixel-wise min-pooling across the depth to accept the prediction with the lowest penalty. Similarly, in the works of [126], a row/columnwise propagation model was proposed the calculated the global pairwise relationship across an image. With a dense affinity matrix drawn from a sparse transformation matrix, coarsely predicted labels were reclassified based on the affinity of pixels.
最后，通过块最小池化进行跨越深度的像素最小值池化，来接受惩罚最低的预测。相似地，在[126]的工作中，提出了计算跨越图片的全局成对关系的行/列传播模型。通过稀疏变换矩阵得到的稠密关系矩阵，粗略地预测了基于像素关系分类的标签

4.1.5 Multi-scale networks

One of the main problems with image segmentation for natural scene images is that the size of the object of interest is very unpredictable, as in real world objects may be of different sizes and objects may look bigger or smaller depending on the position of the object and the camera. The nature of a CNN dictates that delicate small scale features are captured in early layers whereas as one moves across the depth of the network the features become more specific for larger objects.
自然场景图像分割的一个主要问题是感兴趣目标的大小非常难以预测，正如真实世界中物体可能大小不同，并且可能会因为物体和相机之间的位置而看起来更大或更小。CNN的特性决定了小尺寸的特征在更早的层提取，然而随着网络的加深，大物体的特征变得更加明显
For example a tiny car in a scene has much lesser chance of being captured in the higher layers due to operations like pooling or down-sampling. It is often beneficial to extract information from feature maps of various scales to create segmentations that are agnostic of the size of the object in the image. Multiscale auto-encoder models [33] consider activations of different resolutions to provide image segmentation output.
例如，由于池化或降采样之类的操作，场景中的一辆小汽车在高层被捕获的几率更低。从不同尺寸的特征图提取信息来生成图像中物体大小不可知的分割通常比较方便。多尺度自动编码器模型[33]考虑了不同分辨率的激活来提供图像分割的输出
PSPNet
The pyramid scene parsing network [220] was built upon the FCN based pixel level classification network. The feature maps from a ResNet-101 network are converted to activations of different resolutions thorough multi-scale pooling layers which are later upsampled and concatenated with the original feature map to perform segmentation(Refer fig.10). The learning process in deep networks like ResNet was further optimized by using auxiliary classifiers.
金字塔场景解析网络[220]基于FCN的像素级别分类网络构建而成。ResNet-101网络的特征图通过多尺度池化层被转化为不同分辨率的激活，这些池化层随后被上采用并串联到原来的特征图中来进行分割（如图10）。ResNet之类深度网络学习的过程将被通过辅助分类器进一步优化

fig. 10 PSPNet的原理图

The different types of pooling modules focus on different areas of the activation map. Pooling kernels of various sizes like 1 × 1, 2 × 2, 3 × 3, 6 × 6 look into different areas of the activation map to create the spatial pooling pyramid. On the ImageNet scene parsing challenge the PSPNet was able to score an mean IoU of 57.21 with respect to 44.80 of FCN and 40.79 of SegNet.
不同类型的池化模块关注激活映射的不同区域。不同尺寸的池化内核，如 1 × 1, 2 × 2, 3 × 3, 6 × 6 搜索激活映射的不同区域来生成空间池化金字塔。在ImageNet场景解析挑战上，PSPNet达到了57.21的平均IoU，对比FCN的44.80和SegNet的40.79

RefineNet

Working with features from last layer of a CNN produces soft boundaries for the object segments. This issue was avoided in DeepLab algorithms with atrous convolutions. RefineNet [120] takes an alternative approach by refining intermediate activation maps and hierarchically concatenating it to combine multi-scale activations and prevent loss of sharpness simultaneously. The network consisted of separate RefineNet modules for each block of the ResNet. Each RefineNet module were made up of three main blocks, namely, Residual convolution unit(RCU), multi-resolution fusion(MRF) and chained residual pooling(CRP)(Refer fig.11).
通过处理先前层的特征CNN产生柔和的物体分割边界。在DeepLab算法中通过多孔卷积避免了这个问题。RefineNet[120]使用替代的方法，通过提炼中间层的激活映射并分层地连接到多尺度激活并同时防止锐度的损失。网络由分开的RefineNet模块组成，每个模块由ResNet块构成。每个RefineNet模块由3个主要的块组成，即残差卷积单元（RCU），多分辨率模糊（MRF）和连接残差池化（CRP）（如图11）

fig. 11 RefineNet的原理图

The RCU block consists of an adaptive convolution set that fine-tunes the pre-trained weights of the ResNet weights for the segmentation problem. The MRF layer fuses activations of different resolutions using convolutions and upsampling layers to create a higher resolution map. Finally in CRP layer pooling kernels of multiple sizes are used on the activations to capture background context from large image areas. The RefineNet was tested on the Person-Part Dataset where it obtained an IOU of 68.6 as compared to 64.9 by DeepLab-v2 both of which used the ResNet-101 as a feature extractor.
RCU块由一系列微调用于分割问题的ResNet预训练权值的自适应卷积组成。MRF层使用卷积核上采样层融合了不同分辨率的激活，来生成更高分辨率的映射。最后在CRP层中，激活中使用了多种尺寸的池化核来从大的图像区域提取背景的上下文信息。RefineNet在Person-Part数据集中测试，获得了68.6的IoU，与同样使用了ResNet-101作为特征提取器的DeepLab-v2相比

4.2 Convolutional autoencoders

The last subsection deals with discriminative models that are used to perform pixel level classification to deal with image segmentation problems. Another line of thought gets its inspiration from autoencoders. Autoencoders have been traditionally used for feature extraction from input samples while trying to retain most of the original information.
最后一个部分介绍了用于进行像素级别分类来处理图像分割问题的判别模型。从自动编码器获得了另一条思路。自动编码器传统上用于从输入样本上提取特征，并试图保留大部分原始信息
An autoencoder is basically composed of an encoder that encodes the input representations from a raw input to a possibly lower dimensional intermediate representation and a decoder that attempts to reconstruct the original input from the intermediate representation. The loss is computed in terms of the difference between the raw input images and the reconstructed output image.
自动编码器基本上由将输入表示从原始输入编码为可能更低维的中间表示的编码器，以及尝试从中间表示重建原始输入的编码器组成。损失将被以原始输入图片和重建的输入图片的差异的形式所表示
The generative nature of the decoder part has often been modified and used for image segmentation purposes. Unlike the traditional autoencoders, during segmentation the loss is computed in terms of the difference between the reconstructed pixel level class distribution and the desired pixel level class distribution. This kind of segmentation approach is more of a generative procedure as compared to the classification approach of RCNN or DeepLab algorithms.
解码器生成部分的特性经常被修改并用于图像分割用途。不同于传统的自动编码器，在分割时损失将被以重建的像素级别类别分布和目标像素类别分布的差异来计算。这种分割方法与RCNN或DeepLab算法这种分类方法更加具有生成的能力
The problem with approaches such as this is to prevent over-abstraction of images during the encoding process. The primary benefit of such approaches is the ability to generate sharper boundaries with much lesser complication. Unlike the classification approaches, the generative nature of the decoder can learn to create delicate boundaries based on extracted features.
这种方法的问题在于防止在编码的过程中图像的过度抽象。这种方法主要的好处在于更低的复杂度下生成更加锐利的边缘的能力。不同于分类方法，解码器的生成特性能够学习如何基于提取的特征生成精致的边缘
The major issue that affects these algorithm is the level of abstraction. It has been seen that without proper modification the reduction in the size of the feature map created inconsistencies during the reconstruction. in the paradigm of convolutional neural networks the encoding is basically a series of convolution and pooling layers or strided convolutions. The reconstruction however can be tricky. The commonly used techniques for decoding from a lower dimensional feature are transposed convolution or a unpooling layers.
影响这些算法的主要问题是抽象的程度。我们已经看到如果没有适当的修改，特征图尺寸的减小会导致重建过程的矛盾。在卷积神经网络的范例中，编码基本上由一系列的卷积和池化层或有一定步长的卷积来进行。重建的过程也十分棘手。通常用于从低维特征解码的技术为转置卷积或上池化层
One of the main advantages of using autoencoder based approach over normal convolutional feature extractor is the freedom of choosing input size. With a clever use of down-sampling and up-sampling operation it is possible to output a pixel-level probability that is of the same resolution as the input image. This benefit has made encoder-decoder architectures with multi-scale feature forwarding has become ubiquitous for networks where input size is not predetermined and an output of same size as the input is needed.
与普通的卷积特征提取器相比，使用基于自动编码器方法的主要优势在于自由选择输入的大小。通过降采样和上采用操作巧妙的使用，使得输出与输入图像分辨率相同的像素级的概率成为可能。这个好处使得带有多尺度特征传递的编码器-解码器结构在输入大小无法预先得知，并且需要输出与输入大小一致的网络中无处不在

Transposed Convolution

Transposed convolution also known as convolution with fractional strides has been introduced to reverse the effects of a traditional convolution operation [156, 53]. It is often referred to as deconvolution. However deconvolution, as defined in signal processing, is different than transposed convolution in terms of the basic formulation, although they effectively address the same problem.
转置卷积，也称为小数步长卷积，用于反转传统卷积操作的作用[156]，[53]。它通常被称为反卷积。反卷积在信号处理中的定义，与转置卷积在基本公式上是不同的，尽管他们都方便地解决了相同的问题

fig. 12 整数步长的普通卷积（左）和小数步长的转置卷积（右）

In a convolution operation there is a change in size of the input based on the amount of padding and stride of the kernels. As shown in fig. 12 a stride of 2 will create half the number of activations as that of a stride of 1. For a transposed convolution to work padding and stride should be controlled in a way that the size change is reversed. This is achieved by dilating the input space. Note that unlike atrous convolutions, where the kernels were dilated, here the input spaces are dilated.
在卷积操作中，输入的大小会因为填充的数量和卷积核的步长而改变。如图12所示，步长为2的卷积会产生与步长为1的卷积相比一半的激活。对于转置矩阵，填充和步长需要以大小相反的改变来控制。这通过扩大输入空间来实现。需要注意的是，不同于多孔卷积，因为内核扩大了，输入空间也随之扩大

Unpooling

Another approach to reduce the size of the activations is through pooling layers. a 2×2 pooling layer with a stride of two reduces the height and width of the image by a factor of 2. In such a pooling layer, a 2×2 neighborhood of pixel is compressed to a single pixel. Different types of pooling performs the compression in different ways. Max-pooling considers the maximum activation value among 4 pixels while average pooling takes an average of the same. A corresponding unpooling layer decompresses a single pixel to a neighborhood of 2 × 2 pixels to double the height and width of the image.
另一个减小激活的尺寸的方法是池化层。一个步长为2的 2×2 池化层将图像的长和宽各减小一半。在这种池化层中， 2×2 的像素邻域将被压缩为一个单独的像素。不同类型的池化以不同的方式进行压缩。最大值池化取4个像素中的最大激活值，平均值池化则取他们的平均值。一个相应的反池化层将一个像素解压为 2 × 2 的像素邻域，将图像的长和宽增倍

4.2.1 Skip Connections

跳连接

Linear skip connections has often been used in convolutional neural networks to improve gradient flow across a large number of layers [78]. As depth increases in a network the activations maps tend to focus on more and more abstract concepts. Skip connections has proved to be very effective to combine different levels of abstractions from different layers to generate crisp segmentation maps.
线性跳连接通常在卷积神经网络中用于提高大量层之间的梯度流动。随着网络深度的增加，激活映射倾向于关注越来越抽象的概念。跳连接被证明能够非常有效地结合来自不同层的不同级别的抽象来生成清晰的分割图

U-NET

The U-Net architecture, proposed in 2015, proved to be quite efficient for a variety of problems such as segmentation of neuronal structures, radiog- raphy, and cell tracking challenges [177]. The network is characterized by an encoder with a series of convolution and max pooling layers. The decoding layer contains a mirrored sequence of convolutions and transposed convolutions. As described till now it behaves as a traditional auto-encoder. Previously it has been mentioned how the level of abstraction plays an important role in the quality of image segmentation.
U-Net的架构提出于2015年，被证明在各种如神经元结构、X光片分割和细胞追踪挑战中效率很高[177]。网络的特点是带有一系列卷积和最大值池化层的编码器。解码层含有一个卷积和转置卷积的镜像序列。如此所述，其表现得如同传统的自动编码器。先前其曾在介绍抽象级别如何在图像分割的质量中起重要作用时提到过
To consider various levels of abstraction U-Net implements skip connections to copy the uncompressed activations from encod- ing blocks to their mirrored counterparts among the decoding blocks as shown in the fig. 13. The feature extractor of the U-Net can also be upgraded to provide better segmentation maps. The network nicknamed ”The one hundred layers Tiramisu” [88] applied the concept of U-Net using a dense-net based feature extractor. Other modern variations involve the use of capsule networks [183] along with locally constrained routing [108].
考虑到不同级别的抽象U-Net使用了跳连接来从编码模块中复制未压缩的激活到其在解码模块的镜像部分中，如图13所示。U-Net的特征提取器也能够升级以提供更好的分割图。网络的昵称为“一百层提拉米苏”[88]，这代表了U-Net中使用的基于稠密网络的特征提取器的概念。其他现代的变种包括胶囊网络的使用[183]以及局部约束路由[108]

fig. 13 U-Net的结构

U-Net was selected as a winner for an ISBI cell tracking challenge. In the PhC-U373 dataset it scored a mean IoU of 0.9203 whereas the second best was at 0.83. In the DIC-HeLa dataset, it scored a mean IoU of 0.7756 which was significantly better than the second best approach which scored only 0.46.
U-Net当选为ISBI细胞追踪挑战的冠军。在PhC-U373数据集中，其平均IoU的得分为0.9203，而第二名最好的结果仅为0.83。在DIC-HeLa数据集中，其平均IoU的得分为0.7756，远远高于第二名的最好结果0.46

前向池化索引

Max-pooling has been the most commonly used technique for reducing the size of the activation maps for various reasons. The activations represent of the response of the region of an image to a specific kernel. In max pooling, a region of pixels is compressed to single value by considering only the maximum response obtained within that region. If a typical autoencoder compresses a 2×2 neighborhood of pixels to a single pixel in the encoding phase, the decoder must decompress the pixel to a similar dimension of 2 × 2.
最大值池化是减小激活映射尺寸最常使用的技术，这有几个原因。激活代表了图像区域对一个特定内核的响应。在最大值池化中，像素邻域通过只考虑从区域中获取响应的最大值被压缩为单个值。如果一个典型的自动编码器在编码阶段把一个 2×2 像素邻域压缩为单个像素，解码器需要将这个像素解压为 2×2 的维度
By forwarding pooling indices the network basically remembers the location of the maximum value among the 4 pixels while performing max-pooling. The index corresponding to the maximum value is forwarded to the decoder(Refer fig.14) so that while the un-pooling operation the value from the single pixel can be copied to the corresponding location in 2 × 2 region in the next layer [215]. The values in rest of the three positions are computed in the subsequent convolutional layers. If the value was copied to random location without the knowledge of the pooling indices, there would be inconsistencies in classification especially in the boundary regions.
通过前向池化索引，网络在最大值池化时基本上记住了最大值所在位置的4个像素。对应于最大值的索引被传递到解码器（如图14），因此在反池化时，单个像素值能够被复制到下一层的 2 × 2 区域的对应最大值的位置中[215]。剩余3个位置的值将通过序列卷积层计算得到。如果这个值没有池化索引的信息而被复制到随机的位置上，在分类时特别是在边界区域上将产生矛盾

fig. 14 传递池化索引来在反池化的时候保持空间关系

SegNet

The SegNet algorithm [9] was launched in 2015 to compete with the FCN network on complex indoor and outdoor images. The architecture was composed of 5 encoding blocks and and 5 decoding blocks. The encoding blocks followed the architecture of the feature extractor in VGG-16 network. Each block is a sequence of multiple convolution, batch normalization and ReLU layers. Each encoding block ends with a max-pooling layer where the indices are stored.
SegNet算法[9]推出于2015年，成为在复杂的室内和室外图像上与FCN网络抗衡的算法。其结构由5个编码器块和5个解码器块组成。在VGG-16网络中，编码器块位于特征提取器的结构之后。每一个块由包含多个卷积，批归一化，ReLU激活函数的序列组成。每一个编码器块末端为一个最大值池化，其索引被储存下来
Each decoding block begins with a unpooling layer where the saved pooling indices are used (Refer fig.15). The indices from the max-pooling layer of the ith block in the encoder is forwarded to the max-unpooling layer in the (L−i+1)th block in the decoder where L is the total number of blocks in each of the encoder and decoder. The SegNet architecture scored an mIoU of 60.10 as compared to 53.88 by DeepLab-LargeFOV[31] or 49.83 by FCN[130] or 59.77 by Deconvnet[156] on the CamVid Dataset.
每一个解码器块从一个反池化层开始，其使用了池化索引（如图15）。这个来自编码器第i个块的最大值池化索引被传递到解码器第(L-i+1)块的最大值反池化层中，其中L是编码器和解码器块的总数量。SegNet架构在CamVid数据集上取得了60.10的平均IoU，与DeepLab-LargeFOV的53.88和Devonvnet的59.77对比

fig. 15 SegNet的结构

4.3 Adversarial Models

Until now, we have seen purely discriminative models like FCN, DeepMask, DeepLab that primarily generates a probability distribution for every pixel across the number of classes. Furthermore, autoencoder treated segmentation as a generative process however the last layer is generally connected to a pixelwise soft-max classifier. The adversarial learning framework approaches the optimization problem from a different perspective.
目前为止，我们看到纯识别模型如FCN，DeepMask，DeepLab主要生成每个像素在类别数目上的概率分布。此外，自动编码器将分割视为生成性的过程，但是最后一层往往连接到一个像素的soft-max分类器上。对抗学习架构在另一个角度上实现了优化问题
Generative Adversarial Networks (GANs) gained a lot of popularity due to there remarkable performance as a generative network. The adversarial learning framework mainly consists of two networks a generative network and a discriminator network. The generator G tries to generate images,./ like the ones from the training dataset using a noisy input prior distribution called pz(z). The network G(z; θg) represents a differentiable function represented by a neural network with weights θg.
生成对抗网络（GANs）因为作为生成网络值得一提的表现获得了大量的人气。对抗学习架构主要由两个网络组成，包括生成网络和判别器网络。生成器G试图通过噪声输入优先级分布pz(z)生成与训练数据集类似的图片。网络G(z; θg)代表由以θg为权值的神经网络表示的可微函数
A discriminator network tries to correctly guess whether an input data is from the training data distribution (pdata(x)) or generated by the generator G. The goal of the discriminator is to get better at catching a fake image, while the generator tries to get better at fooling the discriminator, thus in the process generating better outputs. The entire optimization process can be written as a min-max problem as follows:
判别器网络试图正确地判断输入数据是训练数据分布（pdata(x)）还是由生成器G所生成的。判别器的目标在于更好地识别出假图片，生成器则希望更好地欺骗判别器，在这个过程中产生更好的输出。整个优化过程可以被写为如下的最大-最小问题：

$\begin{matrix} \mathop{min} \limits_{G}\mathop{max} \limits_{D}V(D,G)=\mathbb{E}_{x\sim p_{data}(x)}[\mathop{log}D(x)]+ &\\ &\mathbb{E}_{z\sim p_z(z)}[log( 1-D(G(z)))] & (3) \end{matrix}$

The segmentation problem has also been approached from a adversarial learning perspective. The segmentation network is treated as a generator that generates the segmenation masks for each class, whereas a discriminator network tries to predict whether a set of masks is from the ground truth or from the output of the generator [133]. A schematic diagram of the process is shown in fig.20. Furthermore, conditional GANs have been used to perform image to image translation[86]. This framework can be used for image segmentation problems where the semantic boundaries of the image and output segmentation map do not necessarily coincide, for example, in case of creating a schematic diagram of a fa¸cade of a building.
分割问题也能通过对抗学习的角度来解决。分割网络被视为对每一个类别生成分割掩膜的生成器，而判别器网络则尝试预测一系列掩膜是来自实际的标签还是生成器的输出[133]。这个过程的原理图如图20所示。此外，条件GANs也被用于进行图像-图像翻译[86]。这个架构能被用于图像语义边界和输出分割图不需要一致的图像分割问题，例如，在生成建筑外观示意图的情况下

fig. 16 用于图像分割的对抗学习模型

在这里插入图片描述

fig. 20 GAN的原理图

4.4 Sequential Models

Till now almost all the techniques discussed deal with semantic image segmen- tation. Another class of segmentation problem, namely, instance level segmen- tation needs slightly different approach. Unlike semantic image segmentation, here all instances of the same object are segmented into different classes. This type of segmentation problem is mostly handled as a learning to give a sequence of object segments as outputs. Hence sequential models come into play in such problems. Some of the main architectures commonly used are convolutional LSTMs, Reccurent Networks, Attention-based models and so on.

4.4.1 Recurrent Models

Traditional LSTM networks employ fully connected weights to model long and short term memories accross sequential inputs. But they fail to capture spatial information of images. Moreover, fully connected weights for images increases the cost of computation by a great extent. In convolutional LSTM [176] these weights are replaced by convolutional layers (Refer fig. ??). Convolutional LSTMs have been used in several works to perform instance level segmentation. Normally they are used as a suffix to a object segmentation network. The purpose of the recurrent model like LSTM is to select each instance of the object in different timestamps of the sequential output. The approach has been implemented with object segmentation frameworks like FCN and U-NET [28].

4.4.2 Attention Models

While convolutional LSTMs can select different instance of objects at different timestamps, attention models are designed to have more control over this pro- cess of localizing individual instances. One simple method to control attention is by spatial inhibition [176]. Spatial inhibition network is designed to learn a bias parameter that cuts off previously detected segments from future activations. Attention models have been further developed with the introduction of dedicated attention module and an external memory to keep track of segments. In the works of [174], the instance segmentation network was divided into 4 modules. First, and external memory provides object boundary details from all previous steps. Second, a box network attempts to predict the location of the next instance of the object and outputs a sub-region of the image for the third module that is the segmentation module. The segmentation module is similar to a convolutional auto-encoder model discussed previously. The fourth module scores the predicted segments based on whether they qualify as a proper instance of the object. The network terminates when the score goes below a user-defined threshold.

4.5 Weakly Supervised or Unsupervised Models

Neural Networks in general are trained with algorithms like back-propagation, where the parameters w are updated based on their local partial derivative with respect to a error value E obtained using a loss function f.
The loss function is generally expressed in terms of a distance between a tar- get value and the predicted value. But in many scenarios image segmentation requires the use of data without annotations with ground truth. This leads to the development of unsupervised image segmentation techniques. One of the straight forward ways to achieve this is to use networks pre-trained on other larger datasets with similar kinds of samples and ground truths and use clustering algorithms like K-means on the feature maps. However this kind of semi-supervised technique is inefficient for data samples that have a unique distribution of sample space. Another cons is that the network is trained to perform on a input distribution which is still different from the test data. That does not allow the network to perform to it with full potential. The key problem in fully unsupervised segmentation algorithm is the development of a loss func- tion capable of measuring the quality of segments or cluster of pixels. With all these limitations the amount of literature is comparatively much lighter when it comes to weakly supervised or unsupervised approaches.

4.5.1 Weakly supervised algorithms

Even in the lack of proper pixel level annotations, segmentation algorithms can exploit coarser annotations like bounding boxes or even image level labels[161, 116] for performing pixel level segmentation.

Exploiting bounding boxes

From the angle of data annotation, defining bounding boxes is a much less expensive task as compared to pixel level seg- mentation. The availability of datasets with bounding boxes is also much larger than those with pixel level segmentations. The bounding box can be used as a weak supervision to generate pixel level segmentation maps. In the works of [42], titled BoxSup, segmentation proposals were generated using region pro- posal methods like selective search. After that multi-scale combinatorial group- ing is used to combine candidate masks and the objective is to select the optimal combination that has the highest IOU with the box. This segmentation map is used to tune a traditional image segmentation network like FCN. BoxSup was able to attain an mIOU of 75.1 in the pascal VOC 2012 test set as compared to 62.2 of FCN or 66.4 of DeepLab-CRF.

苏锌雨

关注

1
点赞
踩
12

收藏

觉得还不错? 一键收藏
0
评论
图像分割综述—翻译整理（未完待续）

文章目录Abstract1 Introduction2 Motivation2.1 Contribution3 Impact of Deep Learning on Image Segmentation3.1 Effectiveness of convolutions for segmentation3.2 Impact of larger and more complex datasets4 I...
复制链接

扫一扫