Image Segmentation Using Deep Learning: A Survey

小蒋的学习笔记

已于 2023-12-10 18:57:51 修改

阅读量278

点赞数

分类专栏：深度学习文章标签：深度学习人工智能

于 2023-12-10 18:03:15 首次发布

桂林电子科技大学计算机与信息安全学院蒋熹煜

本文链接：https://blog.csdn.net/qq_61735602/article/details/134909781

版权

深度学习专栏收录该内容

54 篇文章 3 订阅

订阅专栏

文章目录

目录
1 引言
2 深度神经网络概述
3各种分割模型
4 图像分割数据集
5 性能评估
- 5.1 分割模型的指标
- 5.2 基于深度学习模型的定量性能
6挑战与机遇
7 结论
参考文献
作者

引言
介绍了整个综述的背景和目的。
深度神经网络概述
2.1 卷积神经网络 (CNNs)
2.2 循环神经网络 (RNNs) 和长短时记忆网络 (LSTM)
2.3 编码器-解码器和自编码器模型
2.4 生成对抗网络 (GANs)
详细介绍了不同类型的深度神经网络。
基于深度学习的图像分割模型
3.1 全卷积网络
3.2 带图形模型的卷积模型
3.3 编码器-解码器模型

3.3.1 通用分割的编码器-解码器模型
3.3.2 医学和生物分割的编码器-解码器模型
3.4 多尺度和金字塔网络模型
3.5 基于 R-CNN 的模型 (用于实例分割)
3.6 膨胀卷积模型和 DeepLab 家族
3.7 基于循环神经网络的模型
3.8 基于注意力的模型
3.9 生成模型和对抗训练
3.10 带有主动轮廓模型的 CNN 模型
3.11 其他模型
对图像分割的不同深度学习方法进行了详细解释。
4. 图像分割数据集
4.1 2D 数据集
4.2 2.5D 数据集
4.3 3D 数据集
介绍了用于训练和评估模型的不同类型的图像分割数据集。

性能评估
5.1 图像分割模型的评价指标
5.2 深度学习模型的定量性能
对模型性能进行了评估和比较。
挑战与机遇
6.1 更具挑战性的数据集
6.2 可解释的深度模型
6.3 弱监督和无监督学习
6.4 各种应用的实时模型
6.5 内存高效的模型
6.6 3D 点云分割
6.7 应用场景
探讨了图像分割领域的挑战和未来机遇。
结论
总结了综述的主要发现和观点。

参考文献
列举了引用的文献。

作者简介
提供了主要作者的简要介绍。

这份综述全面而详细地介绍了深度学习在图像分割领域的不同模型、数据集、性能评估以及未来的挑战和机遇。

1 引言

引言
图像分割是许多视觉理解系统中的关键组成部分，涉及将图像（或视频帧）分割成多个部分或对象。它在医学图像分析、自动驾驶、视频监控和增强现实等应用中发挥着核心作用。文献中介绍了早期的图像分割方法，如阈值处理、直方图分割、区域生长，以及更先进的方法，如主动轮廓、图割、条件和马尔可夫随机场、基于稀疏性的方法等。
深度学习与图像分割
深度学习模型在过去几年中取得了显著的性能提升，对图像分割产生了新一代模型，通常在流行的基准测试上取得最高准确率。以DeepLabv3为例，展示了深度学习模型在图像分割中的输出结果。
图像分割问题与深度学习方法
图像分割可被表述为对具有语义标签的像素进行分类（语义分割）或对单个对象进行分割（实例分割）。语义分割对所有图像像素进行像素级别标记，而实例分割则通过检测和描绘图像中每个感兴趣对象的边界来扩展语义分割的范围。
深度学习模型分类
深度学习模型被分为不同类别，包括全卷积网络、带图形模型的卷积模型、编码器-解码器模型、多尺度和金字塔网络模型、R-CNN模型（用于实例分割）、膨胀卷积模型和DeepLab家族、基于循环神经网络的模型、基于注意力的模型、生成模型和对抗训练、以及带有主动轮廓模型的卷积模型。
主要贡献
该综述论文的主要贡献包括：

对截至2019年的100多种图像分割算法进行了全面审查。
对深度学习算法的不同方面进行了全面审查和深入分析，包括训练数据、网络架构选择、损失函数、训练策略等。
提供了约20个流行的图像分割数据集的概览，包括2D、2.5D（RGB-D）和3D图像。
对已审查方法在流行基准上的性能进行了比较总结。
讨论了深度学习图像分割方法的主要挑战和未来方向。
6. 综述结构
综述文献按照以下结构组织：

第2节概述了许多现代分割算法的基础的热门深度神经网络架构。
第3节全面概述了直到2020年提出的最显著的深度学习图像分割模型，涵盖了它们的优势和贡献。
第4节回顾了一些流行的图像分割数据集及其特征。
第5.1节回顾了用于评估深度学习图像分割模型的流行度量标准。
在第5.2节，报告了这些模型的定量结果和实验性能。
第6节讨论了深度学习图像分割方法的主要挑战和未来方向。
最后，在第7节中提出了结论。

2 深度神经网络概述

本节概述了计算机视觉领域中一些最突出的深度学习架构，包括卷积神经网络 (CNNs) [13]、循环神经网络 (RNNs) 和长短时记忆网络 (LSTM) [14]、编码器-解码器 [15] 以及生成对抗网络 (GANs) [16]。随着近年来深度学习的流行，还提出了其他一些深度神经网络架构，如变压器、胶囊网络、门控循环单元、空间变换网络等，这里不进行详细介绍。

值得一提的是，在某些情况下，深度学习模型可以从头开始在新的应用/数据集上进行训练（假设有足够数量的标记训练数据），但在许多情况下，没有足够的标记数据可用于从头开始训练模型。这时可以使用迁移学习来解决这个问题。在迁移学习中，一个在一个任务上训练过的模型被重新用于另一个（相关的）任务，通常通过一些适应过程来适应新任务。例如，可以想象将在ImageNet上训练的图像分类模型调整到不同的任务，如纹理分类或人脸识别。在图像分割的情况下，许多人使用在ImageNet上训练过的模型（比大多数图像分割数据集更大），作为网络的编码器部分，并从这些初始权重重新训练模型。这里的假设是这些预训练模型应该能够捕捉到图像分割所需的语义信息，从而使它们能够用更少的标记样本训练模型。

2.1卷积神经网络（CNNs）

卷积神经网络（CNNs）是深度学习领域最成功且广泛使用的架构之一，特别是在计算机视觉任务中。CNN最初由Fukushima在他关于“Neocognitron”的开创性论文中提出[17]，该论文基于Hubel和Wiesel提出的视觉皮层的分层感受野模型。随后，Waibel等人[18]引入了具有在时间感受野之间共享权重的CNN，并使用反向传播训练进行音素识别，LeCun等人[13]则为文档识别开发了CNN架构（见图2）。

CNN架构

图2：卷积神经网络的架构。来自[13]。

CNN主要包括三种类型的层次：

卷积层：通过卷积操作提取特征，使用一组权重的卷积核。
非线性层：对特征图应用激活函数，通常是逐元素操作，以使网络能够建模非线性函数。
池化层：用某些关于特征图的统计信息（均值、最大值等）替换特征图的小邻域，从而降低空间分辨率。
这些层中的单元是局部连接的，即每个单元从前一层的一个小邻域（称为感受野）中接收加权输入。通过堆叠层以形成多分辨率金字塔，更高级别的层从日益扩大的感受野中学习特征。CNN的主要计算优势在于每个层中的所有感受野共享权重，导致参数数量显著减小，相较于全连接神经网络。一些最知名的CNN架构包括：AlexNet [19]、VGGNet [20]、ResNet [21]、GoogLeNet [22]、MobileNet [23]和DenseNet [24]。

2.2 循环神经网络（RNNs）和长短时记忆网络（LSTM）

循环神经网络（RNNs）[25]被广泛用于处理序列数据，例如语音、文本、视频和时间序列，其中任何给定时间/位置的数据都取决于先前遇到的数据。在每个时间戳，模型从当前时间点X收集输入和来自上一步h-1的隐藏状态，并输出目标值和新的隐藏状态（见图3）。

简单循环神经网络的架构

图3：简单循环神经网络的架构。

RNNs通常在处理长序列时存在问题，因为它们不能在许多实际应用中捕捉长期依赖关系（尽管在理论上存在一些限制），并且常常遇到梯度消失或梯度爆炸的问题。然而，一种称为长短时记忆网络（LSTM）[14]的RNN类型旨在避免这些问题。LSTM架构（见图4）包括三个门（输入门、输出门、遗忘门），它们调控信息进入和离开存储值的记忆单元，该单元可以存储任意时间间隔内的值。

标准LSTM模块的架构

图4：标准LSTM模块的架构。由Karpathy提供。

2.3 编码器-解码器和自编码器模型

编码器-解码器模型是一类通过两阶段网络学习将数据点从输入域映射到输出域的模型：编码器，表示为编码函数 f(z)，将输入压缩成潜在空间表示；解码器 y=g(z) 旨在从潜在空间表示中预测输出[15]，[26]。这里的潜在表示基本上指的是一个能够捕捉输入的基础语义信息、对于预测输出有用的特征（向量）表示。这些模型在图像到图像的翻译问题以及NLP中的序列到序列模型中非常流行。图5说明了一个简单编码器-解码器模型的框图。这些模型通常通过最小化重构损失 L(y, y ) 进行训练，该损失度量了地面实况输出与后续重构 y 之间的差异。这里的输出可以是图像的增强版本（例如图像去模糊或超分辨率），也可以是分割图。自编码器是编码器-解码器模型的一种特殊情况，其中输入和输出相同。

2.4 生成对抗网络（GANs）总结

生成对抗网络（GANs）是深度学习模型的新兴家族。它包含两个网络，生成器和判别器，通过博弈过程进行训练。生成器学习从先验分布的噪音（具有先验分布的z）到目标分布y的映射，目标分布类似于“真实”样本。判别器试图区分生成的样本（“伪造”）和“真实”样本。GAN的损失函数包括对真实样本和生成样本的判别器输出的对数，通过最小化生成器和最大化判别器的错误来进行训练。在实践中，该函数可能在初始阶段提供的梯度不足以有效训练生成器，特别是在判别器能够轻松区分伪造样本和真实样本的初始阶段。为了解决这个问题，可以使用替代的目标，即最大化生成样本的判别器输出的对数。

3各种分割模型

这一节讨论了超过一百种基于深度学习的图像分割方法，截至2019年。这些方法被分为10个类别，基于它们的模型架构。其中一些共同点包括编码器和解码器部分、跳跃连接、多尺度分析以及最近使用的膨胀卷积。这些模型可以根据其模型架构和分割目标分为语义分割、实例分割、全景分割和深度分割。

3.1全卷积网络（FCN)

Longetal.提出了全卷积网络（FCN）用于语义图像分割，通过替换所有全连接层为全卷积层，使得模型能够处理任意大小的图像并输出相同大小的分割图。通过跳跃连接，将深层和浅层的特征图相结合，以产生准确和详细的分割。FCN被广泛应用于不同领域的分割问题。

3.2卷积模型与图形模型

一些模型整合概率图形模型（如条件随机场和马尔可夫随机场）以提高全卷积网络的全局语境理解。这样的模型包括ParseNet、Fully-Connected Deep Structured Network等。

3.3编码器-解码器模型

使用卷积编码器-解码器结构的模型是另一流行的图像分割方法。这些模型包括Noh et al.的Deconvolutional Semantic Segmentation、SegNet、以及HRNet等。这些模型在一般分割和医学图像分割领域都取得了成功。

3.4多尺度和金字塔网络

Feature Pyramid Network（FPN）和Pyramid Scene Parsing Network（PSPN）等模型利用多尺度分析，构建金字塔特征以更好地学习全局上下文表示。Path Aggregation Network（PANet）也是一个基于金字塔结构的模型。

3.5R-CNN模型（实例分割）

R-CNN及其扩展（Fast R-CNN、Faster R-CNN、Mask R-CNN）在对象检测中取得成功，并在实例分割中有所应用。Mask R-CNN通过同时检测对象并生成每个实例的高质量分割掩码，在实例分割任务中表现出色。一些扩展如Path Aggregation Network（PANet）和MaskLab进一步改进了实例分割性能。

3.6 空洞卷积模型和DeepLab家族

空洞卷积（也称为“atrous”卷积）引入了卷积层的另一个参数，即膨胀率。膨胀卷积的信号x(i)定义为yi = ：定义了卷积核w的权重之间的间距。例如，膨胀率为2的3x3卷积核将具有与5x5卷积核相同大小的感受野，但只使用9个参数，从而在不增加计算成本的情况下扩大感受野。膨胀卷积在实时分割领域中很受欢迎，许多最近的出版物报告了这一技术的使用。其中一些最重要的包括DeepLab家族[78]、多尺度上下文聚合[79]、密集上采样卷积和混合膨胀卷积（DUC-HDC）[80]、密集连接Atrous空间金字塔池化（DenseASPP）[81]和高效神经网络（ENet）[82]。

图21：膨胀卷积，显示了不同膨胀率下的3x3卷积核。

DeepLabv1 [37] 和 DeepLabv2 [78] 是由Chen等人开发的一些最流行的图像分割方法。DeepLabv2具有三个关键特征。首先是使用膨胀卷积解决网络中分辨率下降的问题（由最大池化和步幅引起）。其次是Atrous空间金字塔池化（ASPP），它使用多个采样率的滤波器来探测传入的卷积特征层，从而在多个尺度上捕捉对象和图像上下文，以稳健地在多个尺度上分割对象。第三是通过结合深度CNN和概率图模型的方法改善对象边界的定位。最佳的DeepLab（使用ResNet-101作为骨干）在2012 PASCAL VOC挑战赛上达到了79.7%的mIoU分数，在PASCAL-Context挑战赛上达到了45.7%的mIoU分数，在Cityscapes挑战赛上达到了70.4%的mIoU分数。图22说明了DeepLab模型，与[37]类似，主要区别在于使用了膨胀卷积和ASPP。

图22：DeepLab模型，使用类似于VGG-16或ResNet-101的CNN模型以全卷积方式，使用膨胀卷积，使用双线性插值阶段将特征映射放大到原始图像分辨率，最后使用完全连接的条件随机场（CRF）对分割结果进行微调，以更好地捕捉对象边界。来自[78]。

随后，Chen等人[12]提出了DeepLabv3，该模型结合了膨胀卷积的级联和并行模块。并行卷积模块分组在ASPP中。ASPP中添加了1x1卷积和批归一化。所有输出都被串联在一起，并通过另一个1x1卷积进行处理，以为每个像素创建最终的logits输出。

2018年，Chen等人[83]发布了DeepLabv3+，它使用编码器-解码器架构（图23），包括膨胀可分离卷积，由深度卷积组成（输入的每个通道的空间卷积）和点卷积（1x1卷积，输入为深度卷积）。他们使用DeepLabv3框架作为编码器。最相关的模型具有修改后的Xception骨干，更多层次，使用膨胀深度可分离卷积代替最大池化和批归一化。在COCO和JFT数据集上预训练的最佳DeepLabv3+在2012 PASCAL VOC挑战赛上获得了89.0%的mIoU分数。

图23：DeepLabv3+模型。来自[83]。

3.7 基于循环神经网络的模型

虽然卷积神经网络（CNNs）在计算机视觉问题中是一种自然选择，但它们并非唯一的选择。循环神经网络（RNNs）在建模像素之间的短/长期依赖性方面很有用，从而（潜在地）改进分割地图的估计。使用RNNs，像素可以链接在一起并按顺序处理，以建模全局上下文并改进语义分割。然而，面临的一个挑战是图像的自然2D结构。
Visin等人[84]提出了一种基于RNN的语义分割模型ReSeg。该模型主要基于另一项工作ReNet[85]，该工作是为图像分类而开发的。每个ReNet层由四个RNN组成，它们在水平和垂直两个方向上扫描图像，编码图块/激活，并提供相关的全局信息。为了使用ReSeg模型进行图像分割（图24），在预训练的VGG-16卷积层之上堆叠了ReNet层，这些卷积层提取通用的局部特征。然后，ReNet层后跟上采样层，以在最终预测中恢复原始图像分辨率。使用门控循环单元（GRUs）是因为它们在内存使用和计算能力之间提供了良好的平衡。

图24. ReSeg模型。未显示预训练的VGG-16特征提取器网络。来自[84]。
在另一项工作中，Byeon等人[86]使用长短时记忆（LSTM）网络开发了场景图像的像素级分割和分类。他们研究了用于自然场景图像的二维（2D）LSTM网络，考虑了标签的复杂空间依赖关系。在这项工作中，分类、分割和上下文集成都是由2D LSTM网络执行的，允许在单个模型内学习纹理和空间模型参数。
Lian等人[87]提出了一种基于图形长短时记忆（Graph LSTM）网络的语义分割模型，这是从时序数据或多维数据广义到一般图结构数据的LSTM。与现有的多维LSTM结构（例如行、网格和对角LSTM）均匀分割图像像素或图块不同，他们将每个任意形状的超像素视为语义一致的节点，并自适应地为图像构建一个无向图，其中超像素的空间关系自然用作边。图25呈现了传统像素级RNN模型和图形LSTM模型的视觉比较。为了将图形LSTM模型调整到语义分割（图26），在卷积层之上添加了建立在超像素地图上的LSTM层，以使用全局结构上下文增强视觉特征。卷积特征通过1x1卷积滤波器传递，以生成所有标签的初始置信度映射。用于后续图形LSTM层的节点更新顺序由基于初始置信度映射的置信度驱动方案确定，然后图形LSTM层可以顺序更新所有超像素节点的隐藏状态。

Xiang和Fox[88]提出了基于数据关联的循环神经网络（DA-RNNs），用于联合3D场景映射和语义标记。DA-RNNs使用一种新的循环神经网络架构进行RGB-D视频的语义标记。网络的输出与诸如Kinect-Fusion之类的映射技术集成在一起，以将语义信息注入重建的3D场景中。
Hu等人[89]开发了一种基于自然语言表达的语义分割算法，使用CNN编码图像并使用LSTM对其进行编码。这与传统的预定义语义类别上的语义分割不同，例如短语“两个人坐在右边的长凳上”需要仅对右边长凳上的两个人进行分割，而不分割任何站在另一张长凳上的人或坐在上面的人。为了根据语言表达生成像素级分割，他们提出了一种端到端可训练的循环和卷积模型，它同时学习处理视觉和语言信息（图27）。在所考虑的模型中，循环LSTM网络用于将指代表达编码成向量表示，而FCN用于从图像中提取空间特征图，并输出目标对象的空间响应映射。该模型的一个示例分割结果（查询为“穿蓝大衣的人”）显示在图28中。

图27. 从自然语言表达中进行分割的CNN+LSTM体系结构。来自[89]。
图28. 查询“穿蓝大衣的人”生成的分割蒙版。来自[89]。
RNN模型的一个局限性是，由于这些模型的顺序性质，它们的速度将比它们的CNN对应物更慢，因为这种顺序计算不容易并行化。

3.8 基于注意力机制的模型

多年来，注意力机制一直在计算机视觉领域中得到持续研究，因此在语义分割中应用这样的机制并不令人惊讶。
Chen等人[90]提出了一种注意力机制，学习在每个像素位置上柔和加权多尺度特征。他们改编了一种强大的语义分割模型，并将其与多尺度图像和注意模型一起联合训练（图29）。注意机制优于平均池化和最大池化，它使模型能够评估不同位置和尺度上特征的重要性。

图29. 基于注意力的语义分割模型。注意模型学会为不同尺度的对象分配不同的权重；例如，对于来自尺度1.0的小人（绿色虚线圆圈），模型为来自尺度0.5的大孩子（品红虚线圆圈）的特征分配大权重。来自[90]。
与其他作品不同，其中卷积分类器被训练为学习标签化对象的代表性语义特征，Huang等人[91]提出了使用反向注意机制的语义分割方法。他们的反向注意网络（RAN）架构（图30）同时执行直接和反向关注学习过程的三个分支网络。
图30. 用于分割的反向注意网络。来自[91]。
Li等人[92]为语义分割开发了金字塔注意网络。该模型利用语义分割中全局上下文信息的影响。他们结合了注意机制和空间金字塔，以提取像素标记的精确稠密特征，而不是使用复杂的膨胀卷积和人工设计的解码器网络。
最近，Fu等人[93]提出了一种用于场景分割的双重注意网络，该网络可以基于自注意机制捕捉丰富的上下文依赖关系。

图31. 用于语义分割的GAN。来自[100]。
具体来说，他们在一个以空间和通道维度建模语义相互依赖性的膨胀FCN之上附加了两种类型的注意模块。位置注意模块通过位置的加权和对所有位置的特征的加权和进行选择性地聚合每个位置的特征。
其他作品探索了注意机制在语义分割中的应用，如OCNet [94] 提出了受自注意机制启发的对象上下文池化，Expectation-Maximization Attention（EMANet）[95]，Criss-Cross Attention Network（CCNet）[96]，具有递归注意的端到端实例分割[97]，用于场景解析的逐点空间注意网络[98]，包含两个子网络的判别特征网络（DFN）[99]，其中包括平滑网络（包含通道注意块和全局平均池化以选择更有区别性的特征）和边界网络（用于使边界的双边特征可区分）。

3.9 生成模型和对抗训练

自引入以来，生成对抗网络（GANs）已应用于计算机视觉的广泛范围任务，也已用于图像分割。
Luc等人[100]提出了一种用于语义分割的对抗训练方法。他们训练了一个卷积语义分割网络（图31），以及一个对抗网络，该网络将真实分割地图与由分割网络生成的地图区分开。他们表明，对抗训练方法可以提高在Stanford Background和PASCAL VOC 2012数据集上的准确性。
Souly等人[101]提出了使用GAN进行半弱监督语义分割的方法。它由一个生成器网络和一个多类别分类器组成，作为GAN框架中的鉴别器，为多个可能类别中的样本分配标签y，或将其标记为伪样本（额外类别）。
在另一项工作中，Hung等人[102]提出了一种使用对抗网络进行半监督语义分割的框架。他们设计了一个FCN鉴别器，以区分预测概率地图与地面实况分割分布，考虑到空间分辨率。该模型的损失函数包含三个项：分割实况的交叉熵损失，鉴别器网络的对抗损失，以及基于置信度图的半监督损失，即鉴别器的输出。
Xue等人[103]提出了一种用于医学图像分割的多尺度L1 Loss的对抗网络。他们使用FCN作为分割器生成分割标签图，提出了一种新颖的对抗性评论网络，具有多尺度L1损失函数，以迫使评论者和分割器学习捕获像素之间的长程和短程空间关系的全局和局部特征。
其他一些发表的文章报告了基于对抗训练的分割模型，如使用GAN进行细胞图像分割[104]，以及对象不可见部分的分割和生成[105]。

3.10 CNN模型与主动轮廓模型

FCN与ACM协同探索：研究者们探索了FCN与主动轮廓模型（ACM）之间的协同关系。一种方法是制定受ACM原理启发的新损失函数。Chen等人提出了一个监督损失层，该层在训练FCN时结合了预测掩码的面积和大小信息，用于处理心脏MRI中心室分割的问题。

FCN与ACM的不同应用：一些研究尝试将ACM作为FCN输出的后处理器，而其他努力则通过预训练FCN进行适度的协同学习。例如，Le等人的工作中，级集ACMs被实现为RNNs。在医学图像分割方面，Hatamizadeh等人提出了集成的Deep Active Lesion Segmentation（DALS）模型，用于训练FCN骨干以预测新颖的局部参数化级集能量函数的参数函数。其他相关工作包括Marcos等人的Deep Structured Active Contours（DSAC）以及Cheng等人的Deep Active Ray Network（DarNet）等。

全集成FCN-ACM组合： Hatamizadeh等人最近引入了一种全面的、可反向传播训练的全集成FCN-ACM组合，被称为Deep Convolutional Active Contours（DCAC）。

3.11 其他模型

其他流行的分割模型：这一部分介绍了语义分割领域中的其他一些流行的深度学习架构。其中包括Context Encoding Network（EncNet）、ReﬁneNet、Seednet、Object-Contextual Representations（OCR）、BoxSup、Graph convolutional networks、Wide ResNet、Exfuse、Feedforward-Net等等。这些模型采用不同的方法和结构来解决语义分割问题。

全景分割：最后提到了全景分割问题的流行度上升，并介绍了一些相关的工作，如Panoptic Feature Pyramid Network、attention-guided network for Panoptic segmentation、Seamless Scene Segmentation等。

时序图表：图表32展示了自2014年以来流行的DL-based语义分割和实例分割工作的时间线，展示了这个领域的发展趋势。

4 图像分割数据集

在这一部分，我们总结了一些最广泛使用的图像分割数据集。我们将这些数据集分为三类：2D图像、2.5D RGB-D（彩色+深度）图像和3D图像，并提供每个数据集的详细特征。这些列出的数据集具有像素级标签，可用于评估模型性能。

值得一提的是，一些研究使用数据增强来增加标记样本的数量，尤其是处理小数据集（例如医学领域的数据集）。数据增强旨在通过对图像（即输入图像和分割地图）应用一组变换（在数据空间、特征空间或两者之间）来增加训练样本的数量。一些典型的转换包括平移、翻转、旋转、扭曲、缩放、颜色空间转换、裁剪和投影到主成分上。数据增强已被证明能够提高模型的性能，尤其是在学习有限数据集时，例如医学图像分析中的数据集。它还有助于更快地收敛，减少过拟合的机会，并增强泛化能力。对于一些小数据集，数据增强已经显示可以提高模型性能超过20%。

4.1 2D 数据集

PASCAL Visual Object Classes (VOC) [145]
任务：分类、分割、检测、动作识别、人物布局。
类别：21个对象标签，包括车辆、家居、动物、飞机、自行车等。
数据集分为训练和验证集，私有测试集用于挑战。
PASCAL Context [147]
基于 PASCAL VOC 2010 检测挑战的扩展。
包含超过400个类别，分为对象、物体和混合物。
数据集有59个常见类别。
Microsoft Common Objects in Context (MS COCO) [148]
大规模对象检测、分割和字幕数据集。
包含复杂日常场景的图像，91个对象类型。
训练集超过82k图像，验证集40.5k图像，测试集80k图像。
Cityscapes [149]
专注于城市街景的语义理解。
包含来自50个城市的立体视频序列，5k帧的像素级注释。
30个类别，如平面表面、人类、车辆、建筑等。
ADE20K / MIT Scene Parsing [132]
提供场景解析算法的标准训练和评估平台。
基于ADE20K数据集，包含超过20K的场景图像。
150个语义类别。
SiftFlow [150]
从 LabelMe 数据库的子集中获取的2,688张图像。
256x256像素图像，包括街道、山脉、田地、海滩和建筑等8个不同的户外场景。
共有33个语义类别。
KITTI [155]
面向移动机器人和自动驾驶的热门数据集。
包含交通场景的视频，原始数据集未提供语义分割的地面真值，但部分图像已手动标注。
其他数据集
包括Semantic Boundaries Dataset (SBD) [157]、PASCAL Part [158]、SYNTHIA [159]、Adobe’s Portrait Segmentation [160]等。

4.2 2.5D 数据集

NYU-D V2 [161]
使用Microsoft Kinect的室内场景RGB-D图像。
包括1,449对密集标记的RGB和深度图像。
对象用类别和实例编号标记。
SUN-3D [162]
大规模RGB-D视频数据集，包括415个序列。
每个帧带有场景中物体的语义分割和相机姿势信息。
SUN RGB-D [163]
为各种场景理解任务提供RGB-D基准。
包含10,000个RGB-D图像，密集注释，包括2D多边形和3D边界框。
UW RGB-D Object Dataset [164]
使用Kinect样式3D相机记录的300个常见家居对象。
包括51个类别，以WordNet超类-子类关系组织。
ScanNet [165]
包含超过1,500个扫描的RGB-D视频数据集。
有3D相机姿势、表面重建和实例级语义分割注释。

4.3 3D 数据集

Stanford 2D-3D
包含2D、2.5D和3D领域的注册模态。
包括70,000多个RGB图像，深度、法线、语义注释等。
收集于6个室内区域。
ShapeNet Core
ShapeNet 数据集的子集，包含55个常见对象类别。
包括约51,300个独特的3D模型。
Sydney Urban Objects Dataset
包含悉尼中央商务区的常见城市道路物体。
631个类别的扫描，涵盖车辆、行人、标志和树等。
这些数据集涵盖了2D、2.5D和3D领域，适用于图像分割、物体检测、场景理解等多个任务，为计算机视觉和深度学习研究提供了基础。

5 性能评估

在这一部分，我们首先总结了一些用于评估分割模型性能的流行指标，然后提供了有前途的基于深度学习的分割模型在流行数据集上的定量性能。

5.1 分割模型的指标

理想情况下，对模型的评估应包括多个方面，如定量准确性、速度（推理时间）和存储需求（内存占用）。然而，目前大部分研究集中于用于评估模型准确性的指标。以下是用于评估分割算法准确性的一些流行指标：

像素准确性（Pixel Accuracy）：正确分类的像素占总像素数的比例。

平均像素准确性（Mean Pixel Accuracy）：按类别计算正确像素的比率，然后对所有类别取平均。

交并比（Intersection over Union，IoU）：预测的分割图与实际情况的交集区域与并集区域之比。

平均交并比（Mean IoU）：所有类别的平均IoU。

Precision / Recall / F1分数：用于报告传统图像分割模型准确性的流行指标。

Dice系数：用于图像分割的另一个流行指标，可定义为预测和实际情况图的重叠区域的两倍，除以两个图中的总像素数。

5.2 基于深度学习模型的定量性能

在这一部分，我们总结了先前讨论的一些算法在流行的分割基准上的性能。需要注意的是，尽管大多数模型在标准数据集上报告其性能并使用标准指标，但有些模型未能这样做，使得全面比较变得困难。此外，只有少数几篇论文以可重复的方式提供了额外信息，如执行时间和内存占用，这对于分割模型在计算能力和存储方面受限的工业应用非常重要。

以下是几个杰出的基于深度学习的分割模型在不同数据集上的性能总结：

PASCAL VOC测试集：表格1总结了在该测试集上的性能，显示自引入第一个深度学习图像分割模型FCN以来，模型的准确性有了显著提高。

Cityscape测试集：表格2聚焦于该数据集，最新模型相对于初始FCN模型在该数据集上提高了约23%。

MS COCO stuff测试集：表格3关注该数据集，相比于PASCAL VOC和Cityescapes，该数据集更具挑战性。

ADE20k验证集：表格4提供了该验证集上的性能总结，该数据集相对于PASCAL VOC和Cityescapes更具挑战性。

这些总结显示，过去5-6年来，深度分割模型的性能有了显著提高，但一些研究缺乏可重复性，仍有改进的空间。

6挑战与机遇

毫无疑问，图像分割已经在深度学习的推动下取得了巨大的进展，但仍然面临一些挑战。接下来，我们将介绍一些有望推动图像分割算法进一步发展的研究方向。

6.1 更具挑战性的数据集

已经创建了几个用于语义分割和实例分割的大规模图像数据集。然而，仍然需要更具挑战性的数据集，以及适用于不同类型图像的数据集。对于静态图像，具有大量对象和重叠对象的数据集将非常有价值。这可以使模型更好地处理密集对象场景，以及在现实世界场景中普遍存在的对象之间的大量重叠。

随着3D图像分割，特别是在医学图像分析中的普及，对大规模3D图像数据集的需求也越来越强烈。这些数据集比低维对应物更难创建。目前可用于3D图像分割的数据集通常规模不够大，有些是合成的，因此更大、更具挑战性的3D图像数据集将非常有价值。

6.2 可解释的深度模型

尽管基于深度学习的模型在具有挑战性的基准上取得了令人满意的性能，但关于这些模型仍存在一些未解之谜。例如，深度模型到底学到了什么？我们应该如何解释这些模型学到的特征？在给定数据集上实现一定分割准确性的最小神经架构是什么？尽管有一些技术可用于可视化这些模型学到的卷积核，但对这些模型底层行为/动态的深入研究仍然缺乏。对这些模型理论方面的更好理解可以促使开发出更适用于各种分割场景的模型。

6.3 弱监督学习和无监督学习

弱监督学习（即少样本学习）[182] 和无监督学习[183] 正在成为非常活跃的研究领域。这些技术对于图像分割尤其有价值，因为在许多应用领域，特别是在医学图像分析中，为分割问题收集标记样本是困难的。迁移学习的方法是在大量标记样本上训练通用图像分割模型（可能来自公共基准测试），然后在一些特定目标应用的少量样本上微调该模型。自监督学习是另一个引起广泛关注的有前途的方向，在各个领域都受到了关注。通过自监督学习，可以利用图像中许多细节来训练具有更少训练样本的分割模型。基于强化学习的模型也可能是未来的潜在方向，因为它们在图像分割方面鲜有关注。例如，MOREL [184] 引入了一种深度强化学习方法，用于视频中的运动目标分割。

6.4 各种应用的实时模型

在许多应用中，准确性是最重要的因素；然而，在一些应用中，具有能够以接近实时运行的分割模型也至关重要，或者至少接近普通摄像机帧率（至少每秒25帧）。这对于例如部署在自动驾驶车辆中的计算机视觉系统非常有用。目前大多数模型远离这个帧率；例如，FCN-8大约需要100毫秒来处理低分辨率图像。基于扩张卷积的模型有助于在一定程度上提高分割模型的速度，但仍然有很大的改进空间。

6.5 内存高效的模型

许多现代分割模型在推理阶段需要大量内存。到目前为止，很大一部分工作已经致力于提高这些模型的准确性，但为了适应特定设备（如手机），网络必须简化。这可以通过使用更简单的模型、使用模型压缩技术，甚至是训练一个复杂模型然后使用知识蒸馏技术将其压缩成一个更小、内存高效的网络来实现。

6.6 3D点云分割

许多工作都集中在2D图像分割上，但很少有人涉及3D点云分割。然而，对点云分割的兴趣正在增加，它在3D建模、自动驾驶汽车、机器人、建筑模型等领域有着广泛的应用。处理3D无序和非结构化数据，如点云，提出了一些挑战。例如，如何在点云上应用CNN和其他经典深度学习架构的最佳方法尚不清楚。基于图的深度模型可能是点云分割的一个潜在研究方向，可以实现这些数据的额外工业应用。

6.7 应用场景

在这一部分，我们简要探讨了基于最近DL的分割方法在一些应用场景中的应用，以及未来可能面临的挑战。值得注意的是，这些方法已成功应用于遥感领域[185]，包括城市规划[186]和精准农业[187]等技术。通过飞机平台[188]和无人机[189]收集的遥感图像也已使用DL方法进行分割，这为解决涉及气候变化等重要环境问题提供了机会。DL分割在医学成像领域[190]的应用也非常重要。在这里，一个机会是设计标准化的图像数据库，可用于评估快速传播的新疾病和大流行病。最后但同样重要的是，我们还应该提到在生物学[191]和建筑材料[192]领域的DL分割技术，这提供了解决与大量相关图像数据和有限参考信息相关的挑战性应用领域的机会。

7 结论

我们对基于深度学习模型的100多种图像分割算法进行了调查，这些算法在各种图像分割任务和基准测试中取得了令人瞩目的性能，分为CNN和FCN、RNN、R-CNN、扩张CNN、基于注意力的模型、生成和对抗模型等十类。我们总结了这些模型在一些流行基准测试上的定量性能分析，如PASCAL VOC、MS COCO、Cityscapes和ADE20k数据集。最后，我们讨论了图像分割面临的一些挑战和未来可能的研究方向。

参考文献

[1] R. Szeliski, Computer vision: algorithms and applications. Springer
Science & Business Media, 2010.
[2] D. Forsyth and J. Ponce, Computer vision: a modern approach. Prentice Hall Professional Technical Reference, 2002.
[3] N. Otsu, “A threshold selection method from gray-level his- tograms,” IEEE transactions on systems, man, and cybernetics, vol. 9, no. 1, pp. 62–66, 1979.
[4] R. Nock and F. Nielsen, “Statistical region merging,” IEEE Transactions on pattern analysis and machine intelligence, vol. 26, no. 11, pp. 1452–1458, 2004.
[5] N. Dhanachandra, K. Manglem, andY. J. Chanu, “Image segmenta- tion using k-means clustering algorithm and subtractive clustering algorithm,” Procedia Computer Science, vol. 54, pp. 764–771, 2015.
[6] L. Najman and M. Schmitt, “Watershed of a continuous function,” Signal Processing, vol. 38, no. 1, pp. 99–112, 1994.
[7] M. Kass, A. Witkin, and D. Terzopoulos, “Snakes: Active contour models,” International journal of computer vision, vol. 1, no. 4, pp. 321–331, 1988.
[8] Y. Boykov, O. Veksler, and R. Zabih, “Fast approximate energy minimization via graph cuts,” IEEE Transactions on pattern analysis and machine intelligence, vol. 23, no. 11, pp. 1222–1239, 2001.
[9] N. Plath, M. Toussaint, and S. Nakajima, “Multi-class image segmentation using conditional random ﬁelds and global classiﬁ- cation,” in Proceedings of the 26th Annual International Conference on Machine Learning. ACM, 2009, pp. 817–824.
[10] J.-L. Starck, M. Elad, and D. L. Donoho, “Image decomposition via the combination of sparse representations and a variational approach,” IEEE transactions on image processing, vol. 14, no. 10, pp. 1570–1582, 2005.
[11] S. Minaee and Y. Wang, “An admm approach to masked signal decomposition using subspace representation,” IEEE Transactions on Image Processing, vol. 28, no. 7, pp. 3192–3204, 2019.
[12] L.-C. Chen, G. Papandreou, F. Schroff, and H. Adam, “Rethinking atrous convolution for semantic image segmentation,” arXiv preprint arXiv:1706.05587, 2017.
[13] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner et al., “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
[14] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
[15] V. Badrinarayanan, A. Kendall, and R. Cipolla, “Segnet: A deep convolutional encoder-decoder architecture for image segmenta- tion,” IEEE transactions on pattern analysis and machine intelligence, vol. 39, no. 12, pp. 2481–2495, 2017.
[16] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in neural information processing systems, 2014, pp. 2672–2680.
[17] K. Fukushima, “Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position,” Biological cybernetics, vol. 36, no. 4, pp. 193–202, 1980.
[18] A. Waibel, T. Hanazawa, G. Hinton, K. Shikano, and K. J. Lang, “Phoneme recognition using time-delay neural networks,” IEEE transactions on acoustics, speech, and signal processing, vol. 37, no. 3, pp. 328–339, 1989.
[19] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classiﬁ- cation with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105.
[20] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
[21] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
[22] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1–9.
[23] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efﬁcient convolutional neural networks for mobile vision applications,” arXiv preprint arXiv:1704.04861, 2017.

18
[24] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 4700–4708.
[25] D. E. Rumelhart, G. E. Hinton, R. J. Williams et al., “Learning representations by back-propagating errors,” Cognitive modeling, vol. 5, no. 3, p. 1, 1988.
[26] I. Goodfellow, Y. Bengio, and A. Courville, Deep learning. MIT press, 2016.
[27] A. Radford, L. Metz, and S. Chintala, “Unsupervised represen- tation learning with deep convolutional generative adversarial networks,” arXiv preprint arXiv:1511.06434, 2015.
[28] M. Mirza and S. Osindero, “Conditional generative adversarial nets,” arXiv preprint arXiv:1411.1784, 2014.
[29] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein gan,” arXiv
preprint arXiv:1701.07875, 2017.
[30] https://github.com/hindupuravinash/the-gan-zoo.
[31] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3431– 3440.
[32] W. Liu, A. Rabinovich, and A. C. Berg, “Parsenet: Looking wider to see better,” arXiv preprint arXiv:1506.04579, 2015.
[33] G. Wang, W. Li, S. Ourselin, and T. Vercauteren, “Automatic brain tumor segmentation using cascaded anisotropic convolutional neural networks,” in International MICCAI Brainlesion Workshop. Springer, 2017, pp. 178–190.
[34] Y. Li, H. Qi,J. Dai, X. Ji, and Y. Wei, “Fully convolutional instance- aware semantic segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 2359–2367.
[35] Y. Yuan, M. Chao, and Y.-C. Lo, “Automatic skin lesion seg- mentation using deep fully convolutional networks with jaccard distance,” IEEE transactions on medical imaging, vol. 36, no. 9, pp. 1876–1886, 2017.
[36] N. Liu, H. Li, M. Zhang, J. Liu, Z. Sun, and T. Tan, “Accurate iris segmentation in non-cooperative environments using fully convolutional networks,” in 2016 International Conference on Biometrics (ICB). IEEE, 2016, pp. 1–8.
[37] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Semantic image segmentation with deep convolutional nets and fully connected crfs,” arXiv preprint arXiv:1412.7062, 2014.
[38] A. G. Schwing and R. Urtasun, “Fully connected deep structured networks,” arXiv preprint arXiv:1503.02351, 2015.
[39] S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su, D. Du, C. Huang, and P. H. Torr, “Conditional random ﬁelds as recurrent neural networks,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 1529–1537.
[40] G. Lin, C. Shen, A. Van Den Hengel, and I. Reid, “Efﬁcient piecewise training of deep structured models for semantic seg- mentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 3194–3203.
[41] Z. Liu, X. Li, P. Luo, C.-C. Loy, and X. Tang, “Semantic image segmentation via deep parsing network,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 1377–1385.
[42] H. Noh, S. Hong, and B. Han, “Learning deconvolution network for semantic segmentation,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 1520–1528.
[43] A. Kendall, V. Badrinarayanan, and R. Cipolla, “Bayesian segnet: Model uncertainty in deep convolutional encoder-decoder archi- tectures for scene understanding,” arXiv preprint arXiv:1511.02680, 2015.
[44] Y. Yuan, X. Chen, and J. Wang, “Object-contextual representations for semantic segmentation,” arXiv preprint arXiv:1909.11065, 2019.
[45] J. Fu, J. Liu,Y. Wang,J. Zhou, C. Wang, and H. Lu, “Stacked decon- volutional network for semantic segmentation,” IEEE Transactions on Image Processing, 2019.
[46] A. Chaurasia and E. Culurciello, “Linknet: Exploiting encoder representations for efﬁcient semantic segmentation,” in 2017 IEEE Visual Communications and Image Processing (VCIP). IEEE, 2017, pp. 1–4.
[47] X. Xia and B. Kulis, “W-net: A deep model for fully unsupervised image segmentation,” arXiv preprint arXiv:1711.08506, 2017.
[48] Y. Cheng, R. Cai, Z. Li, X. Zhao, and K. Huang, “Locality-sensitive deconvolution networks with gated fusion for rgb-d indoor semantic segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 3029–3037.

[49] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical image computing and computer-assisted inter- vention. Springer, 2015, pp. 234–241.
[50] F. Milletari, N. Navab, and S.-A. Ahmadi, “V-net: Fully convolu- tional neural networks for volumetric medical image segmenta- tion,” in 2016 Fourth International Conference on 3D Vision (3DV).
IEEE, 2016, pp. 565–571.
[51] . C¸ ic¸ek, A. Abdulkadir, S. S. Lienkamp, T. Brox, and O. Ron-
neberger, “3d u-net: learning dense volumetric segmentation from sparse annotation,” in International conference on medical image computing and computer-assisted intervention. Springer, 2016, pp. 424–432.
[52] Z. Zhou, M. M. R. Siddiquee, N. Tajbakhsh, and J. Liang, “Unet++: A nested u-net architecture for medical image segmentation,” in Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support. Springer, 2018, pp. 3–11.
[53] Z. Zhang,Q. Liu, and Y. Wang, “Road extraction by deep residual u-net,” IEEE Geoscience and Remote Sensing Letters, vol. 15, no. 5, pp. 749–753, 2018.
[54] T. Brosch, L. Y. Tang, Y. Yoo, D. K. Li, A. Traboulsee, and R. Tam, “Deep 3d convolutional encoder networks with shortcuts for multiscale feature integration applied to multiple sclerosis lesion segmentation,” IEEE transactions on medical imaging, vol. 35, no. 5, pp. 1229–1239, 2016.
[55] T.-Y. Lin, P. Dollr, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature pyramid networks for object detection,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 2117–2125.
[56] H. Zhao,J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing network,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 2881–2890.
[57] G. Ghiasi and C. C. Fowlkes, “Laplacian pyramid reconstruction and reﬁnement for semantic segmentation,” in European Conference on Computer Vision. Springer, 2016, pp. 519–534.
[58] J. He, Z. Deng, and Y. Qiao, “Dynamic multi-scale ﬁlters for semantic segmentation,” in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 3562–3572.
[59] H. Ding, X. Jiang, B. Shuai, A. Qun Liu, and G. Wang, “Context contrasted feature and gated multi-scale aggregation for scene segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 2393–2402.
[60] J. He, Z. Deng, L. Zhou, Y. Wang, and Y. Qiao, “Adaptive pyramid context network for semantic segmentation,” in Conference on Computer Vision and Pattern Recognition, 2019, pp. 7519–7528.
[61] D. Lin, Y. Ji, D. Lischinski, D. Cohen-Or, and H. Huang, “Multi- scale context intertwining for semantic segmentation,” in Proceed- ings of the European Conference on Computer Vision (ECCV), 2018, pp. 603–619.
[62] G. Li, Y. Xie, L. Lin, and Y. Yu, “Instance-level salient object segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 2386–2395.
[63] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real- time object detection with region proposal networks,” in Advances in neural information processing systems, 2015, pp. 91–99.
[64] K. He, G. Gkioxari, P. Dollr, and R. Girshick, “Mask r-cnn,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2961–2969.
[65] S. Liu, L. Qi, H. Qin,J. Shi, and J. Jia, “Path aggregation network for instance segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 8759–8768.
[66] J. Dai, K. He, and J. Sun, “Instance-aware semantic segmentation via multi-task network cascades,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 3150–3158.
[67] R. Hu, P. Dollr, K. He, T. Darrell, and R. Girshick, “Learning to segment every thing,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 4233–4241.
[68] L.-C. Chen, A. Hermans, G. Papandreou, F. Schroff, P. Wang, and H. Adam, “Masklab: Instance segmentation by reﬁning object detection with semantic and direction features,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 4013–4022.
[69] X. Chen, R. Girshick, K. He, and P. Dollr, “Tensormask: A foundation for dense object segmentation,” arXiv preprint arXiv:1903.12174, 2019.

[70] J. Dai, Y. Li, K. He, and J. Sun, “R-fcn: Object detection via region-based fully convolutional networks,” in Advances in neural information processing systems, 2016, pp. 379–387.
[71] P. O. Pinheiro, R. Collobert, and P. Dollr, “Learning to segment object candidates,” in Advances in Neural Information Processing Systems, 2015, pp. 1990–1998.
[72] E. Xie, P. Sun, X. Song, W. Wang, X. Liu, D. Liang, C. Shen, and P. Luo, “Polarmask: Single shot instance segmentation with polar representation,” arXiv preprint arXiv:1909.13226, 2019.
[73] Z. Hayder, X. He, and M. Salzmann, “Boundary-aware instance segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 5696–5704.
[74] Y. Lee and J. Park, “Centermask: Real-time anchor-free instance segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 13 906–13 915.
[75] M. Bai and R. Urtasun, “Deep watershed transform for instance segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 5221–5229.
[76] D. Bolya,C. Zhou, F. Xiao, andY. J. Lee, “Yolact: Real-time instance segmentation,” in Proceedings of the IEEE international conference on computer vision, 2019, pp. 9157–9166.
[77] A. Fathi, Z. Wojna,V. Rathod, P. Wang, H. O. Song, S. Guadarrama, and K. P. Murphy, “Semantic instance segmentation via deep metric learning,” arXiv preprint arXiv:1703.10277, 2017.
[78] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Deeplab: Semantic image segmentation with deep convo- lutional nets,atrous convolution, and fully connected crfs,” IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 4, pp. 834–848, 2017.
[79] F. Yu and V. Koltun, “Multi-scale context aggregation by dilated convolutions,” arXiv preprint arXiv:1511.07122, 2015.
[80] P. Wang, P. Chen, Y. Yuan, D. Liu, Z. Huang, X. Hou, and G. Cot- trell, “Understanding convolution for semantic segmentation,” in winter conference on applications of computer vision. IEEE, 2018, pp. 1451–1460.
[81] M. Yang, K. Yu, C. Zhang, Z. Li, and K. Yang, “Denseaspp for semantic segmentation in street scenes,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 3684–3692.
[82] A. Paszke, A. Chaurasia, S. Kim, and E. Culurciello, “Enet: A deep neural network architecture for real-time semantic segmentation,” arXiv preprint arXiv:1606.02147, 2016.
[83] L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam, “Encoder-decoder with atrous separable convolution for semantic image segmentation,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 801–818.
[84] F. Visin, M. Ciccone, A. Romero, K. Kastner, K. Cho, Y. Bengio, M. Matteucci, and A. Courville, “Reseg: A recurrent neural network-based model for semantic segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2016, pp. 41–48.
[85] F. Visin, K. Kastner, K. Cho, M. Matteucci, A. Courville, and Y. Bengio, “Renet: A recurrent neural network based alternative to convolutional networks,” arXiv preprint arXiv:1505.00393, 2015.
[86] W. Byeon, T. M. Breuel, F. Raue, and M. Liwicki, “Scene labeling with lstm recurrent neural networks,” in IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3547–3555.
[87] X. Liang, X. Shen, J. Feng, L. Lin, and S. Yan, “Semantic object parsing with graph lstm,” in European Conference on Computer Vision. Springer, 2016, pp. 125–143.
[88] Y. Xiang and D. Fox, “Da-rnn: Semantic mapping with data associated recurrent neural networks,” arXiv:1703.03098, 2017.
[89] R. Hu, M. Rohrbach, and T. Darrell, “Segmentation from natural language expressions,” in European Conference on Computer Vision. Springer, 2016, pp. 108–124.
[90] L.-C. Chen, Y. Yang,J. Wang, W. Xu, and A. L. Yuille, “Attention to scale: Scale-aware semantic image segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 3640–3649.
[91] Q. Huang, C. Xia, C. Wu, S. Li, Y. Wang, Y. Song, and C.-C. J. Kuo, “Semantic segmentation with reverse attention,” arXiv preprint arXiv:1707.06426, 2017.
[92] H. Li, P. Xiong,J. An, and L. Wang, “Pyramid attention network for semantic segmentation,” arXiv preprint arXiv:1805.10180, 2018.
[93] J. Fu, J. Liu, H. Tian, Y. Li, Y. Bao, Z. Fang, and H. Lu, “Dual attention network for scene segmentation,” in Proceedings of the

19
IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 3146–3154.
[94] Y. Yuan and J. Wang, “Ocnet: Object context network for scene parsing,” arXiv preprint arXiv:1809.00916, 2018.
[95] X. Li, Z. Zhong,J. Wu, Y. Yang, Z. Lin, and H. Liu, “Expectation- maximization attention networks for semantic segmentation,” in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 9167–9176.
[96] Z. Huang, X. Wang, L. Huang, C. Huang, Y. Wei, and W. Liu, “Ccnet: Criss-cross attention for semantic segmentation,” in Pro- ceedings of the IEEE International Conference on Computer Vision, 2019, pp. 603–612.
[97] M. Ren and R. S. Zemel, “End-to-end instance segmentation with recurrent attention,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 6656–6664.
[98] H. Zhao, Y. Zhang, S. Liu,J. Shi, C. Change Loy, D. Lin, and J. Jia, “Psanet: Point-wise spatial attention network for scene parsing,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 267–283.
[99] C. Yu, J. Wang, C. Peng, C. Gao, G. Yu, and N. Sang, “Learning a discriminative feature network for semantic segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 1857–1866.
[100] P. Luc, C. Couprie, S. Chintala, and J. Verbeek, “Semantic segmen- tation using adversarial networks,” arXiv preprint arXiv:1611.08408, 2016.
[101] N. Souly, C. Spampinato, and M. Shah, “Semi supervised semantic segmentation using generative adversarial network,” in Proceed- ings of the IEEE International Conference on Computer Vision, 2017, pp. 5688–5696.
[102] W.-C. Hung, Y.-H. Tsai, Y.-T. Liou, Y.-Y. Lin, and M.-H. Yang, “Adversarial learning for semi-supervised semantic segmentation,” arXiv preprint arXiv:1802.07934, 2018.
[103] Y. Xue, T. Xu, H. Zhang, L. R. Long, and X. Huang, “Segan: Adversarial network with multi-scale l 1 loss for medical image segmentation,” Neuroinformatics, vol. 16, no. 3-4, pp. 383–392, 2018.
[104] M. Majurski, P. Manescu, S. Padi, N. Schaub, N. Hotaling, C. Si- mon Jr, and P. Bajcsy, “Cell image segmentation using generative adversarial networks, transfer learning, and augmentations,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2019, pp. 0–0.
[105] K. Ehsani, R. Mottaghi, and A. Farhadi, “Segan: Segmenting and generating the invisible,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 6144–6153.
[106] T. F. Chan and L. A. Vese, “Active contours without edges,” IEEE Transactions on Image Processing, vol. 10, no. 2, pp. 266–277, 2001.
[107] X. Chen, B. M. Williams, S. R. Vallabhaneni, G. Czanner, R. Williams, and Y. Zheng, “Learning active contour models for medical image segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 11 632–11 640.
[108] T. H. N. Le, K. G. Quach, K. Luu, C. N. Duong, and M. Savvides, “Reformulating level sets as deep recurrent neural network approach to semantic segmentation,” IEEE Transactions on Image Processing, vol. 27, no. 5, pp. 2393–2407, 2018.
[109] C. Rupprecht, E. Huaroc, M. Baust, and N. Navab, “Deep active contours,” arXiv preprint arXiv:1607.05074, 2016.
[110] A. Hatamizadeh, A. Hoogi, D. Sengupta, W. Lu, B. Wilcox, D. Rubin, and D. Terzopoulos, “Deep active lesion segmentation,” in Proc. International Workshop on Machine Learning in Medical Imaging, ser. Lecture Notes in Computer Science, vol. 11861. Springer, 2019, pp. 98–105.
[111] D. Marcos, D. Tuia, B. Kellenberger, L. Zhang, M. Bai, R. Liao, and R. Urtasun, “Learning deep structured active contours end- to-end,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 8877–8885.
[112] D. Cheng, R. Liao, S. Fidler, and R. Urtasun, “Darnet: Deep active ray network for building segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 7431–7439.
[113] A. Hatamizadeh, D. Sengupta, and D. Terzopoulos, “End-to-end deep convolutional active contours for image segmentation,” arXiv preprint arXiv:1909.13359, 2019.
[114] H. Zhang, K. Dana, J. Shi, Z. Zhang, X. Wang, A. Tyagi, and A. Agrawal, “Context encoding for semantic segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 7151–7160.

20
[115] G. Lin, A. Milan, C. Shen, and I. Reid, “Reﬁnenet: Multi-path reﬁnement networks for high-resolution semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 1925–1934.
[116] G. Song, H. Myeong, and K. Mu Lee, “Seednet: Automatic seed generation with deep reinforcement learning for robust interactive segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 1760–1768.
[117] J. Dai, K. He, and J. Sun, “Boxsup: Exploiting bounding boxes to supervise convolutional networks for semantic segmentation,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 1635–1643.
[118] C. Peng, X. Zhang, G. Yu, G. Luo, and J. Sun, “Large kernel matters–improve semantic segmentation by global convolutional network,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 4353–4361.
[119] Z. Wu, C. Shen, and A. Van Den Hengel, “Wider or deeper: Revisiting the resnet model for visual recognition,” Pattern Recognition, vol. 90, pp. 119–133, 2019.
[120] Z. Zhang, X. Zhang, C. Peng, X. Xue, and J. Sun, “Exfuse: Enhanc- ing feature fusion for semantic segmentation,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 269–284.
[121] M. Mostajabi, P. Yadollahpour, and G. Shakhnarovich, “Feed- forward semantic segmentation with zoom-out features,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3376–3385.
[122] W. Wang, J. Shen, and F. Porikli, “Saliency-aware geodesic video object segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3395–3402.
[123] P. Luo, G. Wang, L. Lin, and X. Wang, “Deep dual learning for semantic image segmentation,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 2718–2726.
[124] X. Li, Z. Jie, W. Wang, C. Liu, J. Yang, X. Shen, Z. Lin, Q. Chen, S. Yan, and J. Feng, “Foveanet: Perspective-aware urban scene parsing,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 784–792.
[125] I. Kreso, S. Segvic, and J. Krapac, “Ladder-style densenets for se- mantic segmentation of large natural images,” in IEEE International Conference on Computer Vision, 2017, pp. 238–245.
[126] C. Yu, J. Wang, C. Peng, C. Gao, G. Yu, and N. Sang, “Bisenet: Bilat- eral segmentation network for real-time semantic segmentation,” in European Conference on Computer Vision, 2018, pp. 325–341.
[127] B. Cheng, L.-C. Chen, Y. Wei, Y. Zhu, Z. Huang, J. Xiong, T. S. Huang, W.-M. Hwu, and H. Shi, “Spgnet: Semantic prediction guidance for scene parsing,” in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 5218–5228.
[128] T. Takikawa, D. Acuna, V. Jampani, and S. Fidler, “Gated-scnn: Gated shape cnns for semantic segmentation,” in IEEE International Conference on Computer Vision, 2019, pp. 5229–5238.
[129] J. Fu,J. Liu, Y. Wang, Y. Li, Y. Bao,J. Tang, and H. Lu, “Adaptive context network for scene parsing,” in Proceedings of the IEEE international conference on computer vision, 2019, pp. 6748–6757.
[130] X. Liang, H. Zhou, and E. Xing, “Dynamic-structured semantic propagation network,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 752–761.
[131] X. Liang, Z. Hu, H. Zhang, L. Lin, and E. P. Xing, “Symbolic graph reasoning meets convolutions,” in Advances in Neural Information Processing Systems, 2018, pp. 1853–1863.
[132] B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba, “Scene parsing through ade20k dataset,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.
[133] R. Zhang, S. Tang, Y. Zhang,J. Li, and S. Yan, “Scale-adaptive con- volutions for scene parsing,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 2031–2039.
[134] T. Xiao, Y. Liu, B. Zhou, Y. Jiang, and J. Sun, “Uniﬁed perceptual parsing for scene understanding,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 418–434.
[135] B. Zoph, G. Ghiasi, T.-Y. Lin, Y. Cui, H. Liu, E. D. Cubuk, and Q. V. Le, “Rethinking pre-training and self-training,” arXiv preprint arXiv:2006.06882, 2020.
[136] X. Zhang, H. Xu, H. Mo, J. Tan, C. Yang, and W. Ren, “Dcnas: Densely connected neural architecture search for semantic image segmentation,” arXiv preprint arXiv:2003.11883, 2020.
[137] A. Tao, K. Sapra, and B. Catanzaro, “Hierarchical multi-scale atten- tion for semantic segmentation,” arXiv preprint arXiv:2005.10821, 2020.

[138] A. Kirillov, K. He, R. Girshick, C. Rother, and P. Dollr, “Panoptic segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 9404–9413.
[139] A. Kirillov, R. Girshick, K. He, and P. Dollar, “Panoptic feature pyramid networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 6399–6408.
[140] Y. Li, X. Chen, Z. Zhu, L. Xie, G. Huang, D. Du, and X. Wang, “Attention-guided uniﬁed network for panoptic segmentation,” in IEEE Conference on Computer Vision and Pattern Recognition, 2019.
[141] L. Porzi, S. R. Bulo, A. Colovic, and P. Kontschieder, “Seamless scene segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 8277–8286.
[142] B. Cheng, M. D. Collins, Y. Zhu, T. Liu, T. S. Huang, H. Adam, and L.-C. Chen, “Panoptic-deeplab,” arXiv preprint arXiv:1910.04751, 2019.
[143] Y. Xiong,R. Liao,H. Zhao,R. Hu,M. Bai,E. Yumer, andR. Urtasun, “Upsnet: A uniﬁed panoptic segmentation network,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 8818–8826.
[144] R. Mohan and A. Valada, “Efﬁcientps: Efﬁcientpanoptic segmen- tation,” arXiv preprint arXiv:2004.02307, 2020.
[145] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes (voc) challenge,” International journal of computer vision, vol. 88, pp. 303–338, 2010.
[146] http://host.robots.ox.ac.uk/pascal/VOC/voc2012/.
[147] R. Mottaghi, X. Chen, X. Liu, N.-G. Cho, S.-W. Lee, S. Fidler, R. Urtasun, and A. Yuille, “The role of context for object detection and semantic segmentation in the wild,” in IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 891–898.
[148] T.-Y. Lin, M. Maire, S. Belongie,J. Hays, P. Perona, D. Ramanan, P. Dollr, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in European conference on computer vision. Springer, 2014.
[149] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Be- nenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset for semantic urban scene understanding,” in IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 3213–3223.
[150] C. Liu,J. Yuen, and A. Torralba, “Nonparametric scene parsing: Label transfer via dense scene alignment,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2009.
[151] S. Gould, R. Fulton, and D. Koller, “Decomposing a scene into geometric and semantically consistent regions,” in 2009 IEEE 12th international conference on computer vision. IEEE, 2009, pp. 1–8.
[152] D. Martin, C. Fowlkes, D. Tal, and J. Malik, “A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics,” in
Proc. 8th Int’l Conf. Computer Vision, vol. 2, July 2001, pp. 416–423. [153] A. Prest,C. Leistner, J. Civera,C. Schmid, and V. Ferrari, “Learning
object class detectors from weakly annotated video,” in 2012 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2012, pp. 3282–3289.
[154] S. D. Jain and K. Grauman, “Supervoxel-consistent foreground propagation in video,” in European conference on computer vision. Springer, 2014, pp. 656–671.
[155] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics: The kitti dataset,” The International Journal of Robotics Research, vol. 32, no. 11, pp. 1231–1237, 2013.
[156] J. M. Alvarez, T. Gevers, Y. LeCun, and A. M. Lopez, “Road scene segmentation from a single image,” in European Conference on Computer Vision. Springer, 2012, pp. 376–389.
[157] B. Hariharan, P. Arbelez, L. Bourdev, S. Maji, and J. Malik, “Semantic contours from inverse detectors,” in 2011 International Conference on Computer Vision. IEEE, 2011, pp. 991–998.
[158] X. Chen, R. Mottaghi, X. Liu, S. Fidler, R. Urtasun, and A. Yuille, “Detect what you can: Detecting and representing objects using holistic models and body parts,” in IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 1971–1978.
[159] G. Ros, L. Sellart,J. Materzynska, D. Vazquez, and A. M. Lopez, “The synthia dataset: A large collection of synthetic images for semantic segmentation of urban scenes,” in IEEE conference on computer vision and pattern recognition, 2016, pp. 3234–3243.
[160] X. Shen, A. Hertzmann,J. Jia, S. Paris, B. Price, E. Shechtman, and I. Sachs, “Automatic portrait segmentation for image stylization,” in Computer Graphics Forum, vol. 35, no. 2. Wiley Online Library, 2016, pp. 93–102.
[161] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus, “Indoor seg- mentation and support inference from rgbd images,” in European Conference on Computer Vision. Springer, 2012, pp. 746–760.

[162] J. Xiao, A. Owens, and A. Torralba, “Sun3d: A database of big spaces reconstructed using sfm and object labels,” in IEEE International Conference on Computer Vision, 2013, pp. 1625–1632.
[163] S. Song, S. P. Lichtenberg, and J. Xiao, “Sun rgb-d: A rgb-d scene understanding benchmark suite,” in IEEE conference on computer vision and pattern recognition, 2015, pp. 567–576.
[164] K. Lai, L. Bo, X. Ren, and D. Fox, “A large-scale hierarchical multi- view rgb-d object dataset,” in 2011 IEEE international conference on robotics and automation. IEEE, 2011, pp. 1817–1824.
[165] A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and M. Nießner, “Scannet: Richly-annotated 3d reconstructions of indoor scenes,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 5828–5839.
[166] I. Armeni, A. Sax, A. R. Zamir, and S. Savarese, “Joint 2D-3D- Semantic Data for Indoor Scene Understanding,” ArXiv e-prints, Feb. 2017.
[167] A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su et al., “Shapenet: An information-rich 3d model repository,” arXiv preprint arXiv:1512.03012, 2015.
[168] L. Yi, L. Shao, M. Savva, H. Huang, Y. Zhou,Q. Wang, B. Graham, M. Engelcke, R. Klokov, V. Lempitskyetal., “Large-scale 3d shape reconstruction and segmentation from shapenet core55,” arXiv preprint arXiv:1710.06104, 2017.
[169] M. De Deuge, A. Quadros, C. Hung, and B. Douillard, “Unsuper- vised feature learning for classiﬁcation of outdoor 3d scans,” in Australasian Conference on Robitics and Automation, vol. 2, 2013, p. 1.
[170] C.-Y. Fu, M. Shvets, and A. C. Berg, “Retinamask: Learning to predict masks improves state-of-the-art single-shot detection for free,” arXiv preprint arXiv:1901.03353, 2019.
[171] P. O. Pinheiro, T.-Y. Lin, R. Collobert, and P. Dollr, “Learning to reﬁne object segments,” in European Conference on Computer Vision. Springer, 2016, pp. 75–91.
[172] H. Liu, C. Peng, C. Yu, J. Wang, X. Liu, G. Yu, and W. Jiang, “An end-to-end network for panoptic segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 6172–6181.
[173] K. Soﬁiuk, O. Barinova, and A. Konushin, “Adaptis: Adaptive instance selection network,” in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 7355–7363.
[174] J. Lazarow, K. Lee, K. Shi, and Z. Tu, “Learning instance occlu- sion for panoptic segmentation,” in Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition, 2020, pp.
10 720–10 729.
[175] Z. Deng, S. Todorovic, and L. Jan Latecki, “Semantic segmentation of rgbd images with mutex constraints,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 1733–1741.
[176] D. Eigen and R. Fergus, “Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture,” in IEEE international conference on computer vision, 2015, pp. 2650–2658.
[177] A. Mousavian, H. Pirsiavash, and J. Kosecka, “Joint semantic segmentation and depth estimation with deep convolutional networks,” in International Conference on 3D Vision. IEEE, 2016.
[178] X. Qi, R. Liao,J. Jia, S. Fidler, and R. Urtasun, “3d graph neural networks for rgbd semantic segmentation,” in IEEE International Conference on Computer Vision, 2017, pp. 5199–5208.
[179] W. Wang and U. Neumann, “Depth-aware cnn for rgb-d segmen- tation,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 135–150.
[180] S.-J. Park, K.-S. Hong, and S. Lee, “Rdfnet: Rgb-d multi-level residual feature fusion for indoor semantic segmentation,” in IEEE International Conference on Computer Vision, 2017, pp. 4980–4989.
[181] J. Jiao,Y. Wei,Z. Jie, H. Shi, R. W. Lau, and T. S. Huang, “Geometry- aware distillation for indoor semantic segmentation,” in IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 2869–2878.
[182] Z.-H. Zhou, “A brief introduction to weakly supervised learning,” National Science Review, vol. 5, no. 1, pp. 44–53, 2018.
[183] L. Jing and Y. Tian, “Self-supervised visual feature learning with deep neural networks: A survey,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020.
[184] V. Goel, J. Weng, and P. Poupart, “Unsupervised video object segmentation for deep reinforcement learning,” in Advances in Neural Information Processing Systems, 2018, pp. 5683–5694.
[185] L. Ma, Y. Liu, X. Zhang, Y. Ye, G. Yin, and B. A. Johnson, “Deep
learning in remote sensing applications: A meta-analysis and

21
review,” ISPRS Journal of Photogrammetry and Remote Sensing, vol. 152, pp. 166 – 177, 2019.
[186] L. Gao, Y. Zhang, F. Zou,J. Shao, and J. Lai, “Unsupervised urban scene segmentation via domain adaptation,” Neurocomputing, vol. 406, pp. 295 – 301, 2020.
[187] M. Paoletti, J. Haut, J. Plaza, and A. Plaza, “Deep learning classiﬁers for hyperspectral imaging: A review,” ISPRS Journal of Photogrammetry and Remote Sensing, vol. 158, pp. 279 – 317, 2019.
[188] J. F. Abrams, A. Vashishtha, S. T. Wong, A. Nguyen, A. Mo- hamed, S. Wieser, A. Kuijper, A. Wilting, and A. Mukhopadhyay, “Habitat-net: Segmentation of habitat images using deep learning,” Ecological Informatics, vol. 51, pp. 121 – 128, 2019.
[189] M. Kerkech, A. Haﬁane, and R. Canals, “Vine disease detection in uav multispectral images using optimized image registration and deep learning segmentation approach,” Computers and Electronics in Agriculture, vol. 174, p. 105446, 2020.
[190] N. Tajbakhsh, L. Jeyaseelan, Q. Li, J. N. Chiang, Z. Wu, and X. Ding, “Embracing imperfect datasets: A review of deep learning solutions for medical image segmentation,” Medical Image Analysis, vol. 63, p. 101693, 2020.
[191] A. Amyar, R. Modzelewski, H. Li, and S. Ruan, “Multi-task deep learning based ct imaging analysis for covid-19 pneumonia: Clas- siﬁcation and segmentation,” Computers in Biology and Medicine, vol. 126, p. 104037, 2020.
[192] Y. Song, Z. Huang, C. Shen, H. Shi, and D. A. Lange, “Deep learning-based automated image segmentation for concrete petro- graphic analysis,” Cement and Concrete Research, vol. 135,p. 106118, 2020.

作者

Shervin Minaee是Snapchat计算机视觉团队的机器学习主管。他于2018年在纽约大学获得电气工程和计算机科学博士学位。他的研究兴趣包括计算机视觉、图像分割、生物识别和深度学习应用。在博士期间，他发表了40多篇论文和专利。此前，他曾在三星研究、AT&T Labs、华为实验室以及Expedia集团担任研究科学家和数据科学家。

Yuri Boykov是滑铁卢大学Cheriton计算机科学学院的教授。他的研究集中在计算机视觉和生物医学图像分析领域，重点关注结构化分割、恢复、注册、立体、运动、模型拟合、识别、照片视频编辑和其他数据分析问题的建模和优化。他是《国际计算机视觉杂志》（IJCV）的编辑。他的工作被列为IEEE TPAMI（30年来的重要论文）中最有影响力的10篇之一。2017年，Google Scholar将他在分割领域的工作列为计算机视觉和模式识别中的“经典论文”（2006年）。2011年，他获得了IEEE颁发的赫尔姆霍兹奖和国际计算机视觉大会颁发的时间测试奖。

Fatih Porikli是Qualcomm圣地亚哥的高级总监，也是IEEE会士。最近曾在澳大利亚国立大学工程研究学院担任全职教授，并在华为CBG Device硬件部门担任副总裁。他曾领导NICTA澳大利亚的计算机视觉研究组，在剑桥的三菱电机研究实验室担任杰出研究科学家。2002年，他从纽约大学获得博士学位。他获得了2006年度R&D 100科学家奖，获得了六个最佳论文奖，发表了250多篇论文，共同编辑了两本书，并发明了100多项专利。在过去的15年里，他担任了许多IEEE会议的总主席和技术方案主席，并担任了IEEE和Springer期刊的副主编。

Antonio Plaza教授是IEEE会士，现任西班牙埃斯特雷马杜拉大学技术计算机与通信系的教授。他在1999年和2002年分别获得计算机工程硕士学位和博士学位。他发表了600多篇论文，包括300多篇JCR期刊论文（其中有170多篇在IEEE期刊上），24篇书章以及300多篇同行评审的会议论文。他获得了2015年IEEE信号处理杂志最佳专栏奖，2013年JSTARS杂志最佳论文奖，以及Journal of Parallel and Distributed Computing（2005-2010）中引用最多的论文奖。他被列入2018年和2019年高被引研究员名单。

Nasser Kehtarnavaz是德克萨斯大学达拉斯分校电气与计算机工程系的杰出教授。他的研究兴趣包括信号和图像处理、机器学习以及在嵌入式处理器上的实时实现。他撰写或合著了十本书和390多篇期刊论文、会议论文、专利、手册和社论。他是SPIE会士、持牌专业工程师，也是《实时图像处理杂志》的主编。

Demetri Terzopoulos是加利福尼亚大学洛杉矶分校的计算机科学杰出教授，他领导着UCLA计算机图形与视觉实验室。他还是VoxelCloud公司的联合创始人和首席科学家。他在1984年从麻省理工学院获得博士学位。他曾获得奥斯卡奖，是ACM、IEEE、加拿大皇家学会和伦敦皇家学会的会士，以及欧洲科学院、纽约科学院和Sigma Xi的成员。他曾因在基于物理的计算机动画方面的开创性工作获得奥斯卡奖，并因在可变形模型及其应用方面的开创性和持续研究而获得IEEE计算机视觉杰出研究员奖。他发表了400多篇研究论文和多卷著作。在1989年成为学者之前，他曾在加利福尼亚和德克萨斯的斯伦贝杰公司研究中心担任项目领导。