PSPNet论文翻译及解读(中英文对照)

Pyramid Scene Parsing Network
Abstract

Scene parsing is challenging for unrestricted open vocabulary and diverse scenes. In this paper, we exploit the capability of global context information by different-regionbased context aggregation through our pyramid pooling module together with the proposed pyramid scene parsing network (PSPNet). Our global prior representation is effective to produce good quality results on the scene parsing task, while PSPNet provides a superior framework for pixellevel prediction. The proposed approach achieves state-of-the-art performance on various datasets. It came first in ImageNet scene parsing challenge 2016, PASCAL VOC 2012 benchmark and Cityscapes benchmark. A single PSPNet yields the new record of mIoU accuracy 85.4% on PASCAL VOC 2012 and accuracy 80.2% on Cityscapes.

摘要

场景解析对于无限制开放词汇库和不同的场景来说是一个挑战。本文通过所提出的金字塔场景分析网络(PSPNet),对不同区域的语境进行聚合,使模型拥有了理解全局语境信息的能力。我们的全局信息可以有效地在场景分析任务中产生高质量的结果,而PSPNet则为像素级预测提供了一个优越的框架。该方法在各种数据集上展现出了最高水平的性能,在2016年ImageNet场景分析挑战赛、Pascal VOC 2012数据集和Cityscapes数据集中排名第一。本文所提出这种PSPNet在Pascal VOC 2012上的mIoU准确率达到85.4%,在Cityscapes数据集上的mIoU准确率达到80.2%。

1. Introduction

Scene parsing, based on semantic segmentation, is a fundamental topic in computer vision. The goal is to assign each pixel in the image a category label. Scene parsing provides complete understanding of the scene. It predicts the label, location, as well as shape for each element. This topic is of broad interest for potential applications of automatic driving, robot sensing, to name a few.

1. 引言

基于语义分割的场景分析是计算机视觉中的一个基本课题,其目的是为图像中的每个像素指定一个类别标签。场景分析提供了对场景的完整理解,预测了每个元素的类别、位置和形状。这一课题对于自动驾驶、机器人传感技术等潜在的应用领域来说具有广泛的研究意义。

Difficulty of scene parsing is closely related to scene and label variety. The pioneer scene parsing task [23] is to classify 33 scenes for 2,688 images on LMO dataset [22]. More recent PASCAL VOC semantic segmentation and PASCAL context datasets [8, 29] include more labels with similar context, such as chair and sofa, horse and cow, etc. The new ADE20K dataset [43] is the most challenging one with a large and unrestricted open vocabulary and more scene classes. A few representative images are shown in Fig. 1. To develop an effective algorithm for these datasets needs to conquer a few difficulties.

场景解析的难度与场景和类别标签的多样性密切相关。“先驱者”场景解析任务[23]是将LMO数据集[22]上的33个场景共2688张图像进行分类。最新的Pascal VOC语义分割和Pascal语境数据集[8,29]中包括了更多具有类似语境的标签,如椅子和沙发、马和牛等。新的ADE20K数据集[43]是最具挑战性的,它具有庞大且无限制的开放词汇库和更多的场景类别,其中一些有代表性的图像如图1所示。要为这些数据集开发一种有效的算法,仍需要克服较大的困难。

图1. ADE20K数据集中复杂场景示例

State-of-the-art scene parsing frameworks are mostly based on the fully convolutional network (FCN) [26]. The deep convolutional neural network (CNN) based methods boost dynamic object understanding, and yet still face chal-lenges considering diverse scenes and unrestricted vocabulary. One example is shown in the first row of Fig. 2, where a boat is mistaken as a car. These errors are due to similar appearance of objects. But when viewing the image regarding the context prior that the scene is described as boathouse near a river, correct prediction should be yielded.

最先进的场景分析框架主要是基于全卷积网络(FCN)[26]。基于深度卷积神经网络(CNN)的方法提高了对动态对象的理解,但考虑到场景的多样性和词汇库的不受限制,其仍然面临着较大的挑战。图2的第一行展示了一个例子,图中的一艘船被误认为是一辆汽车。这些错误是由于对象的外观相似造成的。但是,基于上下文信息,我们发现图像中的场景被描述为靠近河流的船坞,利用这一点,当要判别图像中元素的类别时,模型应要做出正确的预测。

Towards accurate scene perception, the knowledge graph relies on prior information of scene context. We found that the major issue for current FCN based models is lack of suitable strategy to utilize global scene category clues. For typical complex scene understanding, previously to get a global image-level feature, spatial pyramid pooling [18] was widely employed where spatial statistics provide a good descriptor for overall scene interpretation. Spatial pyramid pooling network [12] further enhances the ability.

准确的感知场景依赖于事先理解的场景语境信息。我们发现,当前基于FCN的模型的主要问题是缺乏合适的策略来利用全局场景类别线索。对于典型的复杂场景理解,以前为了获得全局图像级特征,广泛使用了空间金字塔池化技术[18],该技术中的空间统计数据为整体场景解释提供了良好的描述词。后来,在空间金字塔池化网络[12]中进一步增强了这种能力。

Different from these methods, to incorporate suitable global features, we propose pyramid scene parsing network (PSPNet). In addition to traditional dilated FCN [3, 40] for pixel prediction, we extend the pixel-level feature to the specially designed global pyramid pooling one. The local and global clues together make the final prediction

  • 5
    点赞
  • 30
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值