语义分割

Introduction

Semantic Segmentation:将图像中每个像素分配到某个对象类别。图像语义分割中存在3种挑战:(1)特征分辨率减少,(2)不同尺度下的物体的存在状况,(3)由于深度卷积神经网络的不变性造成的定位精度减少。

第一个挑战是由 基于分类的卷积神经网络包含重复最大池化和降采样(步长跨度)操作造成的。深度卷积神经网络采用全卷积方式的时候,会明显降低特征地图的空间分辨率。
第二个挑战是object以多尺度形式存在于图像中。
第三个挑战是object分类器要求对空间变换具有不变性,内在地限制了深度卷积神经网络的空间精度。

在深度学习应用到计算机视觉领域之前,开发的大多数成功的语义分割系统都是采用手工特征的单调分类器,比如,提升方法(论文【Textonboost for image understanding: Multi-class object recognition and segmentation by jointly modeling texture, layout, and context, IJCV,2009.】),随机森林(论文【Semantic texton forests for image categorization and segmentation, in CVPR, 2008】),或支持向量机(论文【Class segmentation and object localization with superpixel neighborhoods, in ICCV,2009】)。通过整合更丰富的信息,内容(论文【 Semantic segmentation with second-order pooling, in ECCV, 2012】)和结构化的预测技术(论文【Efficient inference in fully connected crfs with gaussian edge potentials, in NIPS, 2011】、【Multiscale conditional random fields for image labeling, in CVPR, 2004】、【Associative hierarchical crfs for object class image segmentation, in ICCV,2009】、【CPMC: Automatic object segmentation using constrained parametric min-cuts, PAMI, vol. 34,no. 7, pp. 1312–1328, 2012】),取得了实质性的进展,但这些系统的性能受限于特征的表达能力。过去几年里,深度学习在图像分类上取得了突破性的进展,迅速转移到语义分割中。既然这个任务包含了分割和分类,那么一个中心问题是如何合并这两个任务呢。

第一类深度卷积神经网络的语义分割主要采用自下而上的串联图像分割,再串联深度卷积神经网络区域分类。比如,论文(【 Rich feature hierarchies for accurate object detection and semantic segmentation,in CVPR, 2014.】、【Simultaneous detection and segmentation,” in ECCV, 2014】)中使用了论文(【Multiscale combinatorial grouping, in CVPR, 2014】、【Selective search for object recognition, IJCV, 2013】)的bounding box proposals和masked regions,作为输入提供给深度卷积神经网络,整合形状信息提供给分类流程。类似的,论文【Feedforward semantic segmentation with zoom-out features, in CVPR】使用超像素表示。尽管这些方法可以从良好的分割产生的形状边缘受益,但它们无法从错误中恢复。

第二类方法用卷积计算的深度卷积神经网络特征用于密集图像标注,将独立获取的分割结合在一起。在第一阶段中论文【Learning hierarchical features for scene labeling, PAMI, 2013】在多个图像分辨率上使用深度卷积神经网络,再使用分割树平滑预测结果。最近,论文【Hypercolumns for object segmentation and fine-grained localization,in CVPR, 2015】建议使用跳层,并联结深度神经网络中计算的中间特征地图用于像素分类。还有,论文【Convolutional feature masking for joint object and stuff segmentation, arXiv:1412.1283, 2014】建议用局部方案池化中间特征地图。这些工作仍然是采用的从深度卷积分类器的结果中解耦合的分割算法,因此有提早做决策的风险。

第三类工作使用深度卷积神经网络直接提供类别级像素标注,甚至可以放弃分割。论文(【Fully convolutional networks for semantic segmentation, in CVPR, 2015】、【Combining the best of graphical models and convnets for semantic segmentation, arXiv:1412.4313, 2014】)的免分割方法直接用全卷积方法将深度卷积神经网络应用到整幅图像上,将深度卷积神经网络的最后全连接层转换成卷积层。为了处理前面简介中提出的空间定位问题,论文【Fully convolutional networks for semantic segmentation, in CVPR, 2015】在中间特征地图上进行上采样,连接分值,论文【Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture, arXiv:1411.4734, 2014】,将生成粗略的结果放入另外一个深度卷积神经网络上,从粗到细优化预测结果。在这些工作基础上,对特征分辨率进行控制,引入多尺度池化技术,在深度卷积神经网络的最上层集成稠密连接的条件随机场(论文【Efficient inference in fully connected crfs with gaussian edge potentials,in NIPS, 2011】)。我们发现这会产生明显好的分割结果,尤其是在物体边缘。深度卷积神经网络和条件随机场当然不是最新的,但之前的工作只是尝试了局部连接条件随机场模型。具体来说,论文 【Combining the best of graphical models and convnets for semantic segmentation,arXiv:1412.4313, 2014】用条件随机场作为基于深度卷积神经网络的重排序系统,论文【 Learning hierarchical features for scene labeling, PAMI, 2013】将超像素作为结点用于局部配对条件随机场,用图切割用于离散推理。这样,它们的模型受限于超像素计算的误差,或者忽略了长距离依赖。我们的方法将每个像素看做是条件随机场的结点,用于接收卷积神经网络的一元势能。关键是,论文【Efficient inference in fully connected crfs with gaussian edge potentials,in NIPS, 2011】中的全连接条件随机场中的高斯条件随机场势能可以抓取远程依赖,同时模型服从快速平均场推理。我们注意到平均场推理用于传统图像分割任务,论文(【Parallel and deterministic algorithms from mrfs: Surface reconstruction, PAMI, vol. 13, no. 5, pp. 401–412, 1991】、【A common framework for image segmentation, IJCV, vol. 6, no. 3, pp. 227–243, 1991】、【Computational analysis and learning for a biologically motivated model of boundary detection,” Neurocomputing, vol. 71, no. 10, pp. 1798–1812, 2008】),但这些老的模型受限于短程连接。在一些独立的工作中,论文57使用了一个非常相似的密集连接条件随机场模型优化深度卷积神经网络的结果,用于物料分类。然而,论文【Material recognition in the wild with the materials in context database,arXiv:1412.0623, 2014】的深度卷积神经网络模块是用稀疏点监督上训练的,而不是每个像素点的稠密监督。

下面介绍深度学习在语义分割中的发展历程,笔者本着学习的态度进行总结,如有错误之处,欢迎交流讨论。

Paper01 [CVPR2015]: 《Fully Convolutional Networks for Semantic Segmentation》

本论文为CVPR15 Best Paper arxiv:1411 PAMI16 加州大学伯克利分校

这里写图片描述

这里写图片描述
本网络架构特点

(1):第一次end-to-end训练全卷积网络[ FCN ]在semantic segmentation Task上,基本网络采用了AlexNet、VGG、GoogleNet;最后实验表明FCN_VGG16高于FCN_AlexNet和FCN_GoogLeNet。
这里写图片描述
(2):把全连接层转化为卷积层,卷积核大小就是输入的feature map的size.
这里写图片描述

这里写图片描述
(3):输入图片可以任意大小

(4):定义了一个skip connection, 融合深层的语义信息(semantic information from a deep,coarse layer)和浅层的表征信息(appearance information from a shallow ,fine layer to produce accurate and detailed segmentations );并通过实验证明这种融合可以提高分割性能。低分辨率的特征(高层的feature map)通过上采用与高分辨率的特征(低层的feature map)融合,该unsampling过程是可学习的(通过deconvolution和activation functions实现可学习的非线性采样)

这里写图片描述

这里写图片描述

这里写图片描述
比较FCNs
这里写图片描述
(5) Fully convolutional network achieves state-of-the-art segmentation of PASCAL VOC (20% relative improvement to 62.2% mean IoU on 2012), NYUDv2, and SIFT Flow, while inference takes less than one fifth of a second for a typical image.
这里写图片描述
(6)Fully convolutional networks are a rich class of models, of which modern classification convnets are a special case. Recognizing this, extending these classification nets to segmentation, and improving the architecture with multi resolution layer combinations dramatically improves the state of-the-art, while simultaneously simplifying and speeding up learning and inference.

这里写图片描述

Paper02 [CVPR2015]: 《Hypercolumns for Object Segmentation and Fine-grained Localization》

本论文为CVPR15 arxiv: v1:1411 v2:1504

这里写图片描述
本网络架构特点:

(1)针对CNN最后层输出特征(feature map) too coarse spatially to allow precise in localization;而早期的层的特征对于位置非常准确,而不具备强的语义特征。the last layer of the CNN is the most sensitive to category-level semantic information and the most invariant to “nuisance” variables such as pose, illumination, articulation, precise location and so on.作者提出了一个 Hypercolumns

define the hypercolumn at a pixel as the vector of activations of all CNN units above that pixel

这里写图片描述

(2)性能比SDS好(SDS只用了最后一层特征)

(3)在 keypoint localization,part labeling任务上取得了new state-of-the-art performance.

Paper03 [ICCV2015]: 《Conditional Random Fields as Recurrent Neural Networks》

本论文为ICCV15 Paper arxiv: v1:1502 v2:1504 v3:1604
这里写图片描述
本网络架构特点:
(1)提出一种新的卷积网路(由CNN和基于概率图模型[PGM]的条件随机场[CRF]组成,同时具备CNN和CRF的优越性质。
(2) Integrating CRF modeling with CNN,making it possible to train the whole deep network end-to-end with the usual back-propagation algorithm,avoiding offline post-processing methods for object delineation.
这里写图片描述

(3) 当时在VOC12 val取得了new state-of-the-art performance.比FCN-8s,FCN-8s+CRF(post-processing)高出了8.3个点,6.1个点(训练集without COCO).在VOC12 test性能高于FCN_8s、Hypercolumn、DeepLab-Msc(72.0% vs 62.2% 62.6% 71.6%3)
这里写图片描述

这里写图片描述

Paper04 [ICLR2015]: 《Semantic image segmentation with deep convolutional nets and fully connected CRFs》

本论文为ICLR15 Paper arxiv: v1:1412 v4:1607
【DeepLab_v1】
后面作者在Paper3得基础上继续改善,提出了ASPP(利用多尺度),详细参考后面的DeepLab_v2 : 《DeepLab: semantic image segmentation with Deep Convolutional Nets, Atrous Convolution and fully connected CRFs》,在TPAMI上发表. 和DeepLab_v3 :《Rethinking Atrous Convolution for Semantic Image Segmentation》

这里写图片描述

本网络架构特点:
(1)Problem: responses at the final layer of DCNNs are not sufficiently localized for accurate object segmentation. This is due to the very invariance properties that make DCNNs good for high-level tasks[classification],and it can hamper low-level tasks,such as pose estimation and semantic segmentation. ,proposing overcome this poor localization property of deep networks by**combining the responses at the final DCNN layer with a fully connected Conditional Random Field (CRF)**。

Even though some works[such as FCN、Hypercolumns] still employ segmentation algorithms that are decoupled from the DCNN classifier’s results, we believe it is advantageous that segmentation is only used at a later stage, avoiding the commitment to premature decisions[过早决策].

(2)提出了hole algorithm(或者称atrous algorithm,膨胀卷积),Careful network re-purposing and a novel application of the ’hole’ algorithm from the wavelet community allow dense computation of neural net responses at 8 frames per second on a modern GPU.
这里写图片描述
(3)他们提出的方法称为“DeepLab”,在VOC12 semantic image segmentation task reaching 71.6% IOU accuracy in the test set,new state-of-the-art. (DeepLab-MSc-CRF-LargeFov,采用输入多尺度和CRF做post-processing)
(4)CNN特征提取采用的是VGG-16,后面再TPAMI上发表的DeepLab_v2采用是ResNet.

The three main advantages of our “DeepLab” system are
(i) speed: by virtue of the ‘atrous’ algorithm, our dense DCNN operates at 8 fps, while Mean Field Inference for the fully-connected CRF requires 0.5 second.
(ii) accuracy: we obtain state-of-the-art results on the PASCAL semantic segmentation challenge, outperforming the second-best approach of Mostajabi et al. (2014) by a margin
of 7.2%
(iii) simplicity: our system is composed of a cascade of two fairly well-established modules, DCNNs and CRFs.

(5)整个网络详细结构: Converting the fully-connected layers of VGG-16 into convolutional ones and run the network in a convolutional fashion on the image at its original resolution. However this is not enough as it yields very sparsely computed detection scores (with a stride of 32 pixels).so skipping subsampling after the last two max-pooling layers ,and then modify the convolutional filters in the layers that follow them by introducing zeros to increase their length (2×in the last three convolutional layers and 4× in the first fully connected layer).

Paper05 [CVPR2016]: 《Efficient Piecewise Trainingof Deep Structured Models for Semantic Segmentation》

本论文为CVPR16 Paper arxiv: v1:1504 v2:1606

这里写图片描述
这里写图片描述

Note: FeatMap-Net is a convolutional network(e.g VGG ) to generate a feature map.

本网络架构特点:
(1)提出利用context information (including “patch-patch” context and “patch-background” context)来改善语义分割质量。因此设计了Contextual Deep CRFs model,结构如下:
这里写图片描述

(2) For learning from the patch-patch context, we formulate Conditional Random Fields (CRFs) with CNN-based pairwise potential functions to capture semantic correlations between neighboring patches.

(3) For capturing the patch-background context, we show that a network design with traditional multi-scale image input and sliding pyramid pooling is effective for improving performance.
这里写图片描述
sliding pyramid pooling结构如下:
这里写图片描述

(4) CNN-CRF联合训练极具挑战,通过Piecewise Training来缓解。
(5) achieving new state-of-the-art performance on a number of popular semantic segmentation datasets, including NYUDv2, PASCAL VOC 2012, PASCAL-Context, and SIFT-flow. In particular, we achieve an intersection-over-union score of 78.0 on the challenging PASCAL VOC 2012 dataset.
这里写图片描述

Paper06 [ICCV2015]: 《Learning Deconvolution Network for Semantic Segmentation》

本论文为ICCV15 Paper arxiv:1505

这里写图片描述

DeconvNet 网络架构特点:
(1提出了Deconvolution network(由deconvolution and unpooling layer组成,以VGG16为基础),通过集成Deconvolution network到FCN,缓解现存基于FCN方法的局限性,在PASCAL VOC 2012 test取得了当时最好的正确率72.5%(比 FCN8s 高出了 10.3% ,比DeepLab+CRF(v1)71.6%高出了0.9个点.
这里写图片描述
**FCN局限性:
固定大小的receptive field
FCN直接通过unpooling到相应输入size ,过于简单粗暴。**
这里写图片描述

(2)在Deconvolution Network中,低层捕捉对象的粗糙特征(位置、形状、区域),高层捕捉更复杂的表征(比如类别等)

(3)coarse-to-fine structures of an object 是通过一系列deconvolution 操作来重构。

(4)通过可视化实验,We can observe that coarse-to fine object structure sarere constructed through the propagation in the deconvolutional layers; lower layers tend to capture overall coarse configuration of an object (e.g. location, shapeand region), while more complex patterns are discovered in higher layers. Note that unpooling and deconvolution play differentroles for the construction of segmentation masks. Unpooling captures example-specific structures by tracing the original locations with strong activations back to image space.

(5)网络详细配置如下:
这里写图片描述

Paper07 [ICLR2016]: 《ParseNet Looking wider to see better》

本论文为ICLR16 Paper arxiv:1506

这里写图片描述

**(1)**adding global Context to fully convolutional networks for semantic segmentation,using the average feature for a layer to augment the feature at each location.

(2) proposing global Context can tackle local confusions。As for semantic segmentation, per pixel classification, is often ambiguous [模糊不清]**in the presence of only local information. However, **the task becomes much simpler if contextual information, from the whole image, is available.

**(3)**an end-to-end simple and effective convolutional network.在SIFTFLOW和PASCAL-Context数据集达到了new-state-of-the-art performance.在VOC12 test 69.8% mean IoU, ParseNet 没有使用 post processing,但性能接近使用了post processing的DeepLab_CRF_LargeFOV(v1)[70.3%]。此外,ParseNet速度更快,结构更简单,比起DeepLab.
这里写图片描述

Paper08 [CVPR2016]: 《Attention to Scale: Scale-aware Semantic Image Segmentation》

本论文为CVPR16 Paper arxiv:1511

这里写图片描述

本网络架构特点:

基于DeepLab-LargeFOV(而DeepLab-LargeFOV则是从VGG16修改到全卷机网络而来的)

(1)提出attention机制( learns to softly weight the multi-scale features at each pixel location.),实现多尺度感知;
这里写图片描述
(2)利用attention机制融合多尺度特征,不仅提高了性能(over average_ or max pooling baseline),而且通过可视化证明了不同位置、尺度的特征的重要性。
这里写图片描述
(3)以DeepLab-LargeFOV为baseline,在PASCAL-Person-Part、VOC12-val、VOC12-test、COCO子集性能分别提升了4.48%、6.8%、3.84%、6.4%、4.56%
这里写图片描述

Paper09 [ICLR2016]: 《Multi-scale Context Aggregation by dilated convolutions》

本论文为ICLR16 Paper arxiv:1511

本网络架构特点:
(1)proposing Context module that uses dilated convolutions(膨胀卷积) to systematically aggregate mutiscale Contextual information without losing resolution.
这里写图片描述
(2)this Network is based on VGG-16,removing the last two pooling and striding layers和余下用于分类使用的全连接层,而FCN仍然保留了它们,while DeepLabv1 replaced striding by dilation but kept the pooling layer.
(3)只使用前端网络,比FCN8s,DeepLab-Msc(v1)再VOC12 test性能更高(67.6% vs 62.2%、62.9%),并且高于DeepLab-CRF[66.4%],Contex+CRF-RNN再VOC12 test 取得了75.3% mean Iou.
这里写图片描述

Paper10 [ECCV2016]: 《Higher Order Conditional Random Fields in Deep Neural Networks》

本论文为ECCV16 Paper arxiv:1511
这里写图片描述

本网络架构特点:
(1)与CRF_RNN作者相同,作者在CRF_RNN基础上加 higher order potentials,that is based on object detections and superpixels,and demonstrate it can be included in a CRF embedded within a deep network.
这里写图片描述

(2)Superpixel先验 ,enforce the consistency of the semantic segmentation output feature of the image
(3)object detection先验,Intuitively,an object detector with a high recall can help the semantic segmentation algorithm by finding objects appearing in an image.
(4) 在VOC12 test上比CRF_RNN高了3.2个点,以CRF_RNN为基础,+det_potential比+superpixel_potential性能要高,end-to-end训练比分段训练性能高。在VOC12 test上取得了new-state-of-the-art,77.9%,使用了COCO+VOC数据训练。
这里写图片描述

Paper11 [ECCV2016]: 《 Laplacian Pyramid Reconstruction and Refinement for Semantic Segmentation 》

本论文为ECCV16 Paper arxiv:1605
这里写图片描述

本网络架构特点:

(1)证明了高层的feature map的spatial resolution很low ,但高层的feature map的高维特征包含了sub-pixel的位置信息。
这里写图片描述

(2)describing a multi-resolution reconstruction feature maps and multiplicative gating to successive refine segment boundaries reconstructed from lower-resolution maps.Boundary masking机制,也可以理解成attention机制高低层特征融合的时候进行加权求和
(3)低分辨率上采样到高分辨率的unsampling是可学习过程(不同于标准的unsampling),该过程称之为Reconstruction(双线性插值可以看成是Reconstruction的一种特殊版本),这种Reconstruction可以增加spatial accuracy.
这里写图片描述

(4)以VGG-16为前端,在VOC12 test ,训练只用VOC training data.性能高于FCN-8s,Hypercolumn, DeepLab-MSc+CRF(v1),CRF_RNN,DeconvNet。以ResNet101为前端,在VOC12 test,训练利用VOC and COCO性能在VOC上可比肩DeepLabv2(LRR-CRF 79.3% DeepLab-CRF(V2) 79.7%)
这里写图片描述

Paper12 [PAMI2016]: 《DeepLab_v2 :semantic image segmentation 》

《DeepLab semantic image segmentation with Deep Convolutional Nets, Atrous Convolution and fully connected CRFs》

本论文为TPAMI16 Paper arxiv:1606
这里写图片描述

本网络架构特点:
(1)以Resnet为基础,提出了“atrous convolution”(膨胀卷积),在不增加参数和计算开销的情况下,增大了filter的感受野,DeepLab_v1中也采用了atrous convolution。
这里写图片描述
(2)提出了ASPP(Atrous spatial pyramid pooling,膨胀空间金字塔pooling),实现多尺度。

proposing atrous spatial pyramid pooling (ASPP) to robustly segment objects at multiple scales. ASPP probes an incoming convolutional feature layer with filters at multiple sampling rates and effective fields-of-views, thus capturing objects as well as image context at multiple scales.

这里写图片描述

(3)组合深度卷积神经网络和CRF,提高object boundary的准确率。

传统方法中,条件随机场用于平滑带噪声的分割图,论文(【 GrabCut: Interactive foreground extraction using iterated graph cuts, in SIGGRAPH,2004.】、【Robust higher order potentials for
enforcing label consistency, IJCV, vol. 82, no. 3, pp. 302–324,2009.】)。通常,这些模型将邻近结点耦合,有利于将相同标记分配给空间上接近的像素。定性的说,这些短程条件随机场主函数会清除构建在局部手动特征上层弱分类器的错误预测。与这些弱分类器相比,现代深度卷积神经网络架构,比如像我们在本文中用的,生成score map和定性不同的语义标记预测。score map通常非常平滑,生成相同的分类结果。在这种情形下,用短程条件随机场可能就不好了,我们的目标是要恢复详细的局部结构而不是平滑它。用局部条件随机场关联中的反差灵敏势,可以增强定位,但还是会漏掉细小结构,通常都需要处理离散优化问题。为了克服短程条件随机场的局限,我们在系统中整合了论文【Efficient inference in fully connected crfs with gaussian edge potentials, in NIPS, 2011】的全连接随机场模型。
这里写图片描述
这里写图片描述
第一个是像素位置(记为p)和RGB颜色(记为I)间的双向核,第二个核是像素位置。超参数σασβσγ控制高斯核的尺度。第一个核强制位置相近、像素值相似的像素具有相同的分类(label),第二个核在强制平滑时只考虑空间上的接近程度。

(4)DeepLap系统在VOC12上取得了79.7%的mIOU.
这里写图片描述

Paper13 [CVPR2017]: 《Full-Resolution Residual Networks for Semantic Segmentation in Street Scenes》

本论文为CVPR17 Paper arxiv:1611_v1

这里写图片描述
本网络架构特点:
(1)一个网络两支处理流,residual stream和pooling stream。

One stream carries information at the full image resolution, enabling precise adherence to segment boundaries.The other stream undergoes a sequence of pooling operations to obtain robust features for recognition. The two streams are coupled at the full image resolution using residuals.

(2)解决了当前基于预训练方法主体网络无法加入新结构,比如batch normalization or new activation。
(3)是所有当前没有预训练的模型中,在Cityscapes dataset上去的了IoU score 71.8%,new state-of-the-art performance.
这里写图片描述

(4)网络具体配置:
这里写图片描述

Paper14 [CVPR2017]: 《RefineNet: Multi-Path Refinement Networks for High-Resolution Semantic Segmentation》

本论文为CVPR17 Paper arxiv:1611
这里写图片描述

本网络架构特点:
(1) repeated subsampling operations like pooling or convolution striding in deep CNNs lead to a significant de-crease in the initial image resolution,reducing the final image prediction typically by a factor of 32 or 16,which result in losing much of the finer image structure.In order to tackle the limitation,this paper proposes RefineNet, a generic multi-path refinement network.

RefineNet explicitly exploits all the information available along the down-sampling process to enable high-resolution prediction using long-range residual connections.multi-path refinement network (RefineNet) which exploits features at multiple levels of abstraction for high-resolution semantic segmentation. RefineNet refines low-resolution (coarse) semantic features with fine grained low-level features in a recursive manner to generate high-resolution semantic feature maps. the model is flexible in that it can be cascaded and modified easily.

(2)**proposing **Residual Convolution Unit[RCU], Multi resolution fusion,Chained residual pooling.

chained residual pooling is able to capture background context from a large image region. It does so by efficiently pooling features with multiple window sizes and fusing them together with residual connections and learnable weights.

(3) achieving an intersection-over-union score of 83.4 on the challenging PASCAL VOC 2012 dataset, which is the best
reported result to date.

achieving new state-of-the-art performance on 7 public datasets, including PASCAL VOC 2012, PASCAL-Context, NYUDv2, SUN-RGBD, Cityscapes, ADE20K, and the object parsing Person-Parts dataset. In particular, we achieve an IoU score of 83.4 on the PASCAL VOC 2012 dataset, outperforming the currently best approach DeepLab by a large margin.

这里写图片描述

Paper15 [CVPR2017]: 《Pyramid Scene Parsing Network》

本论文为CVPR17 Paper arxiv:1612

这里写图片描述

本网络架构特点:
(1)以Resnet为基础,提出了PSPNet(Pyramid scene parsing network),金字塔场景分割网络,使用了辅助损失(auxiliary loss),在VOC12上获得了85.4%mIOU。 new record of mIoU accuracy 85.4% on PASCAL VOC 2012 and accuracy 80.2% on Cityscapes.
这里写图片描述
(2)获得ImageNet Scene parsing challenging 2016 第一名。

Paper15 [arXiv2017]: 《TuSimple Understanding Convolution for Semantic Segmentation》

本论文 arxiv:1702
这里写图片描述
本网络架构特点:
[Tusimple]

(1)以Resnet为基础,design Dense upsampling convolution[DUC] to generate pixel-level prediction,which is able to capture and decode more detailed information that is generally missing in bilinear upsampling. Because Bilinear upsampling is not learnable and may lose fine details.

(2)膨胀卷积(Dilated Convolution)的局限性:
the “gridding issue” caused by the standard dilated convolution dilated convolution operation loses some neighboring information,and the problem gets worse when the rate of dilation increases,generally in higher layers when the receptive field is large: the convolutional kernel is too sparse to cover any local information,since the non-zero values are too far apart.
这里写图片描述
针对该局限性提出了Hybrid Dilated Convolution[HDC]来解决,
the goal of HDC is to let the final size of the RF of a series of convolutional operations fully covers a square region without any holes or missing edges.

(3)**achieving a state-of-the-art result of **83.1% mIOU on VOC12 test ; We also have achieved state-of-the-art overall on the KITTI road estimation benchmark and Cityscapes dataset.

Paper16 [ICCV2017]: 《Large Kernel Matters ——Improve Semantic Segmentation by Global Convolutional Network》

本论文为ICCV17 Paper arxiv:1703
这里写图片描述

本网络架构特点:
(1)问题 A well-designed segmentation model should deal with classification and localization.

For the classification task, the models are required to be in-variant to various transformations like translation and rotation. But for the localization task, models should be transformation-sensitive, i.e., precisely locate every pixel for each semantic category.

(2) proposing a Global Convolutional Network[GCN] to address both the classification and localization issues for the semantic segmentation.
这里写图片描述
(3) proposing a Boundary Refinement[BR] block to further improve the localization performance near the object boundary.
(4)作者进一步设计了Resnet50_GCN,用GCN——bottleneck代替了Resnet中原本的bottleneck,在ImageNet上分类正确率有所下降,但分割性能更好。
这里写图片描述
(5) achieves state-of-art performance on two public benchmarks and significantly outperforms previous results, 82.2% (vs 80.2%) on PASCAL VOC2012 dataset and 76.9% (vs 71.8%) on Cityscapes dataset.
这里写图片描述
这里写图片描述

Paper17 [ICCV2017]: 《Deformable Convolutional Networks》

本论文为ICCV17 Paper arxiv:1703

本网络架构特点:
(1)提出了deformable convolution和deformable RoI pooling(通过额外的offset,根据任务学习到的offset,来增强spatial sampling locations)
这里写图片描述

这里写图片描述
(2)解决了当前主流方法中(卷积单元固定位置采用、RoI pooling输出固定spatial bin的局限性,这二者都没有很好应对object几何变换)
这里写图片描述

Paper18 [ICCV2017]: 《Scale-adaptive Convolutions for Scene Parsing》

本论文为ICCV17 Paper
这里写图片描述**

本网络架构特点:
(1)The CNNs with standard convolutions can only handle a single scale due to the fixed-size receptive fields,however scene images usually contain stuff(e.g. sky,wall) and objects(e.g. people,cars) with various sizes,leading two critical drawbacks:

I):objects which are enough larger than the receptive fields often have inconsistent parsing predictions, since the receptive fields may cover only small part of the large objects;
II):small objects are often ignored and mislabeled to the background because the receptive fields cover too much background instead of focusing on the small objects.

作者针对这两个问题,proposing the scale-adaptive Convolutions,which are capable of automatically learning flexible-size receptive field,dealing with objects of various sizes.

(2)Scale-adaptive Convolutions can be considered as the generalized convoluitons with adaptive dialation parameters .
这里写图片描述
(3)整个架构以DeepLab_v2为基础框架,在ADE20K和Cityscapes验证了该方法的有效性。
这里写图片描述

Paper19 [arXiv2017]: 《Rethinking Atrous Convolution for Semantic Image Segmentation》

Paper arxiv:1706
这里写图片描述
【DeepLab_v3】
本网络架构特点:
(1) 改进了空间维度上的金字塔空洞池化方法(ASPP);
(2) 该模块级联了多个空洞卷积结构。

具体解释:
(1)与在DeepLab v2网络、空洞卷积中一样,这项研究也用空洞卷积/多空卷积来改善ResNet模型。
(2)这篇论文还提出了三种改善ASPP的方法,涉及了像素级特征的连接、加入1×1的卷积层和三个不同比率下3×3的空洞卷积,还在每个并行卷积层之后加入了批量归一化操作。
(3)级联模块实际上是一个残差网络模块,但其中的空洞卷积层是以不同比率构建的。这个模块与空洞卷积论文中提到的背景模块相似,但直接应用到中间特征图谱中,而不是置信图谱。置信图谱是指其通道数与类别数相同的CNN网络顶层特征图谱。
(4)该论文独立评估了这两个所提出的模型,尝试结合将两者结合起来并没有提高实际性能。两者在验证集上的实际性能相近,带有ASPP结构的模型表现略好一些,且没有加入CRF结构。
(5)这两种模型的性能优于DeepLabv2模型的最优值,文章中还提到性能的提高是由于加入了批量归一化层和使用了更优的方法来编码多尺度背景。

(3) The proposed ‘DeepLabv3’ system significantly improves over our previous DeepLab versions without DenseCRF post-processing and attains comparable performance with other state-of-art models on the PASCAL VOC 2012 semantic image segmentation benchmark.
这里写图片描述

Paper20 [ICCV2017]: 《FoveaNet: Perspective-aware Urban Scene Parsing》

本论文为ICCV17 Paper arXiv:1708
这里写图片描述

本网络架构特点:
(1) proposing to consider perspective geometry in urban scene parsing and introduce a perspective estimation network for learning the global perspective geometry of urban scene images.
解决的问题:
a) Most of the current solutions employ generic image parsing models that treat all scales and locations in the images equally and do not consider the geometry property of car-captured urban scene images. Thus, they suffer from heterogeneous [多种多样的]object scales caused by perspective projection [透视投影]of cameras on actual scenes and inevitably encounter parsing failures on distant objects as well as other boundary and recognition errors.图像周围区域的large object 经常会有“broken down” error.例如large object容易分割出肖的区域到别处其它不同但相似的类,如下图
这里写图片描述

presenting a new perspective-aware CRFs model that is able to reduce the typical “broken-down” errors in parsing peripheral regions of a scene image.

b)在图像中消失点处的密集smll objects分割效果差

developing a perspective-aware parsing network that addresses the scale heterogeneity issues well for urban scene images and gives accurate parsing on small objects crowding around the vanishing point.

(2)本文提出了一个 FoveaNet model,充分利用了透视投影来解决通用parsing model不能解决的问题a) and b).
(3)设计的Perspective Estimation Network[PEN]产生聚焦区[Fovea region],对密集的小目标群体有非常好的分割性能。
这里写图片描述

这里写图片描述
(4) FoveaNet provide new state-of-the-art performance on City spaces. 超过了DeepLabv2和LRR
这里写图片描述

Paper21 [CVPR2017] 《 Difficulty-Aware Semantic Segmentationvia Deep Layer Cascade》

本论文为CVPR17 Paper
这里写图片描述

本网络架构【LC】特点:
(1)propose a novel deep layer cascade (LC) method to improve the accuracy and speed of semantic segmentation. Unlike the conventional model cascade (MC) that is composed of multiple independent models, LC treats a single deep model as a cascade of several sub-models.Earlier sub-models are trained to handle easy and confident regions, and they progressively feed-forward harder regions to the next sub-model for processing.

(2)前端主干网络采用的是Inception-Resnet-V2(带atrous convolution)
(3)在没有多尺度、post-processing,在voc12 test上表现比DeepLabv2更好【70.42% vs 73.91%】。加上多尺度和post-processing在VOC12 test取得了82.7 mean IoU. [DeepLabv2_COCO_Msc_ASPP:79.7%]
这里写图片描述
在Cityscapes test set实验结果如下:
这里写图片描述

Paper22 [BMVC2017] 《Semantic Segmentation with Reverse Attention》

本论文为BMVC17 Paper arXiv:1707
这里写图片描述

本网络架构【LC】特点:
(1):propose a reverse attention network (RAN) architecture that trains the network to capture the opposite concept (i.e., what are not associated with a target class) as well. The RAN is a three-branch network that performs the direct, reverse and reverse-attention learning processes simultaneously.

Traditionally, the convolutional classifiers are taught to learn the representative semantic features of labeled semantic objects.

(2)Being built upon the DeepLabv2-LargeFOV,the RAN achieves the state-of-the-art mean IoU score (48.1%) for the challenging PASCAL-Context dataset. Significant performance improvements are also observed for the PASCAL-VOC, Person-Part, NYUDv2 and ADE20K datasets.在VOC 12 test上取得了80.5% mean IoU[DeepLabv2-LargeFOV:79.1%,DeepLabV2-ASPP:79.7%]
这里写图片描述

这里写图片描述

Paper23 [arXiv2016] 《Mixed context networks for semantic segmentation》

本论文为 arXiv:1610
这里写图片描述
本网络架构特点:
(1)proposing mixed context network [MCN]

FCN-based systems gained great improvement in this area. Unlike classification networks, combining features of different layers plays an important role in these dense prediction models, as these features contains information of different levels. A number of models have been proposed to show how to use these features. However, what is the best architecture to make use of features of different layers is still a question

(2)proposing message passing network [MPN],

the CRF-RNN is memory cost, especially when the class number is large, for example 150 in ADE20K.
这里写图片描述

(3)以VGG-16为基础,在VOC12 test 取得了81.4% mean IoU.

Paper24 [ECCV2016] 《Fast, Exact and Multi-Scale Inference for Semantic Segmentation with Deep G-CRFs》

《Fast, Exact and Multi-Scale Inference for Semantic Image Segmentation with Deep Gaussian CRFs》
本论文为 ECCV2016 arXiv:1603
这里写图片描述

本网络架构特点:
(1):combine the virtues of Gaussian Conditional Random Feilds (G-CRF) with Deep Learning.
(2) proposing structured prediction task having a unique global optimum that is obtained exactly form the solution of a linear system.
这里写图片描述

where A denotes the symmetric N × N matrix of pairwise terms, and B denotes the N × 1 vector of unary terms. the pairwise terms A and the unary terms B are learned from the data using a fully convolutional network.

[Inference ] Given A and B, inference involves solving for the value of x that minimizes the energy function. If (A + λI) is symmetric positive definite, then E(x) has a unique global minimum at:
这里写图片描述

(3)our pairwise terms do not have to be simple hand-crafted expressions, as in the line of works building on the DenseCRF[such as CRFasRNN]

(4)proposing multi-resolution architectures to couple information across scales in a joint optimization framework, yielding systematic improvements.(比简单的跨层特征融合高出0.3个点。
这里写图片描述

(5)以DeepLab-V2 Resnet-101为基础,在voc12 test上,Deep G-CRF achieving 80.2% mean IoU.
这里写图片描述

Paper25 [ICCV2017] 《Dense and Low-Rank Gaussian CRFs Using Deep Embeddings》

本论文为 ICCV2017
与《Fast, Exact and Multi-Scale Inference for Semantic Image Segmentation with Deep Gaussian CRFs》的作者相同
这里写图片描述

本网络架构[Dense G-CRF]特点:
(1)作者在Sparse G-CRF的基础上,针对Sparse G-CRF的局限性

While the Deep G-CRF model described above allows for efficient and exact inference, in practice it only captures interactions in small (4−,8− and 12−connected) neighborhoods. The model may thereby lose some of its power by ignoring a richer set of long-range interactions. The extension to fully-connected graphs is technically challenging because of the non-sparse matrix A it involves. Assuming an image size of 800 × 800 pixels, 21 labels (PASCAL VOC benchmark), and a network with a spatial down-sampling factor of 8 [5, 6], the number of variables is N = (100 × 100) × 21 and the number of elements in A would be N 2 ∼ 10 10 . This is prohibitively large due to both memory and computational requirements.

To overcome this challenge, we advocate forcing A to be a low-rank. In particular, we propose decomposing the N × N matrix A into a product of the form
这里写图片描述

(2)introducing a structured prediction model that endows the Deep Gaussian Conditional Random Field (G-CRF) with a densely connected graph structure. We keep memory and computational complexity under control by expressing the pairwise interactions as inner products of low-dimensional, learnable embeddings. The G-CRF system matrix is therefore low-rank, allowing us to solve the resulting system in a few milliseconds on the GPU by using conjugate gradient. As in G-CRF, inference is exact, the unary and pairwise terms are jointly trained end-to-end by using analytic expressions for the gradients, while we also develop even faster, Potts-type variants of our embeddings.
这里写图片描述
(3)在VOC12 test取得了80.4% mean IoU.
这里写图片描述

Paper26 [ICCV2017] 《Segmentation-Aware Convolutional Networks Using Local Attention Masks》

本论文为 ICCV2017
这里写图片描述

本网络架构[Segaware]特点:
(1)Introducing an approach to integrate segmentation in-formation within a convolutional neural network (CNN). This counter-acts the tendency of CNNs to smooth information across regions and increases their spatial precision.
and set up a CNN to provide an embedding space where region co-membership can be estimated based on Euclidean distance. We use these embeddings to compute a local attention mask relative to every neuron position. We incorporate such masks in CNNs and replace the convolution operation with a “segmentation-aware” variant that allows a neuron to selectively attend to inputs coming from its own region
这里写图片描述

(2)以DeepLabv2为基础,在voc 12 test上取得了79.8% mean IoU
这里写图片描述

附录1:Semantic Segmentation Datasets

1.The PASCAL VOC2011 and SBD

The PASCAL VOC 2011 segmentation challenge training set labels 1112 images.[Semantic contours from inverse detectors ,in ICCV2011] have collected labels for a much larger set of 8498 PASCAL training images.
The train/val/test splits of PASCAL VOC segmentation challenge and SBD diverge. Most notably VOC 2011 segval intersects with SBD train. Care must be taken for proper evaluation by excluding images from the train or val splits.
We train on the 8,498 images of SBD train. We validate on the non-intersecting set defined in the included seg11valid.txt.[736张] 因为有部分VOC11 val.txt的图片包含在SBD 训练集(8498张)

2.The PASCAL VOC2012

The PASCAL VOC 2012 segmentation benchmark[34] involves 20 foreground object classes and one background class. The original dataset contains
1464(train),1449 (val), and 1456 (test) pixel-level labeled images for training, validation, and testing, respectively. The dataset is augmented by the extra annotations provided by [Semantic contours from inverse detectors, in ICCV, 2011],resulting in 10582 (trainaug) training images.

3.VOC2012 vs. VOC2011

For VOC2012 the majority of the annotation effort was put into increasing the size of the segmentation and action classification datasets, andno additional annotation was performed for the classification/detection tasks. The list below summarizes the differences in the data between VOC2012 and VOC2011.

Classification/Detection: The 2012 dataset is the same as that used in 2011. No additional data has been annotated. For this reason, participants are not allowed to run evaluation on the VOC2011 dataset, and this option on the evaluation server has been disabled.

Segmentation: The 2012 dataset contains images from 2008-2011 for which additional segmentations have been prepared. As in previous years the assignment to training/test sets has been maintained. The total number of images with segmentation has been increased from 7062 to 9993.

4.The PASCAL-Context

[200~PASCAL-Context provides whole scene annotations of PASCAL VOC 2010. While there are over 400 distinct classes, we follow the 59 class task defined by [26] that picks the most frequent classes.

The PASCAL-Context dataset [35] provides detailed semantic labels for the whole scene, including both object (e.g., person) and stuff (e.g., sky). Following [The role of context for object detection and semantic segmentation in the wild,in CVPR, 2014], the proposed models are evaluated on the most frequent 59 classes along with one background category. The training set and validation set contain 4998 and 5105 images.
/data/pascal-context/trainval/文件夹提供了trainval=10103张图片的标注
训练时原始图片.jpg去/data/pascal/VOC2010/JPEGImages/下读取

5.NYUDv2

NYUDv2 [30] is an RGB-D dataset collected using the Microsoft Kinect. It has 1449 RGB-D images, with pixel-wise labels that have been coalesced into a 40 class semantic segmentation task by Gupta et al. [13]. We report results on the standard split of 795 training images and 654 testing images.

6.SIFT Flow

SIFT Flow is a dataset of 2,688 images with pixel labels for 33 semantic categories (“bridge”, “mountain”, “sun”),as well as three geometric categories (“horizontal”, “vertical”, and “sky”).the standard split into 2,488 training and 200 test images,

7.Cityscapes Dataset

The Cityscapes dataset [5] contains 5,000 images,including 2,975 images in training set, 500 images in validation set and 1,525 images in test set. The images in this dataset are collected in street scenes from 50 different cities, with high quality pixel-level annotations of 19 semantic classes and high resolution of 2048×1024. Intersection over Union (IoU) averaged over all the categoriesis adopted for evaluation.
Part I

gtFine_trainvaltest.zip (241MB)
fine annotations for train and val sets (3475 annotated images[train:2975 val:500]) and dummy annotations (ignore regions) for the test set (1525 images)

gtCoarse.zip (1.3GB)
coarse annotations for train and val set (3475 annotated images) and train_extra (19998 annotated images)

Others

leftImg8bit_trainvaltest.zip (11GB) [md5]
left 8-bit images - train, val, and test sets (5000 images)

leftImg8bit_trainextra.zip (44GB) [md5]
left 8-bit images - trainextra set (19998 images)

camera_trainvaltest.zip (2MB) [md5]
intrinsic and extrinsic camera parameters for train, val, and test sets

camera_trainextra.zip (8MB) [md5]
intrinsic and extrinsic camera parameters for trainextra set

vehicle_trainvaltest.zip (2MB) [md5]
vehicle odometry + GPS coordinates + temperature for train, val, and test sets

vehicle_trainextra.zip (7MB) [md5]
vehicle odometry + GPS coordinates + temperature for trainextra set

leftImg8bit_demoVideo.zip (6.6GB) [md5]
video sequences for qualitative evaluation, left 8-bit images only

8.ADE20K Dataset

The ADE20K dataset [34] is a large scale dataset recently released by ImageNet Large Scale Visual Recognition Challenge 2016 (ILSVRC2016). This dataset contains 150 semantic classes for scene parsing, with 20,210 images for training, 2,000 images for validation and 3,351 images for testing. Pixel-level annotations are provided for entire images. This dataset is more scene-centric with a diverse range of object categories. The performance is evaluated based on both pixel-wise accuracy and the Intersection over Union (IoU) averaged over all the semantic categories.

  • 5
    点赞
  • 27
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值