译《Panoptic Segmentation with a Joint Semantic and Instance Segmentation Network》

最新推荐文章于 2022-09-02 09:50:46 发布

二佳呀

最新推荐文章于 2022-09-02 09:50:46 发布

阅读量934

点赞数 1

分类专栏： AI 文章标签：全景分割 panoptic segmentation

AI 专栏收录该内容

4 篇文章 1 订阅

订阅专栏

Panoptic Segmentation with a Joint Semantic and Instance Segmentation Network联合语义和实例分割网络的全景分割

Abstract 摘要
1. Introduction 简介
2. Method 方法
- 2.1. Network architecture 网络结构

Abstract 摘要

We present an end-to-end method for the task of panoptic segmentation. The method makes instance segmentation and semantic segmentation predictions in a single network, and combines these outputs using heuristics to create a single panoptic segmentation output. The architecture consists of a ResNet-50 feature extractor shared by the semantic segmentation and instance segmentation branch. For instance segmentation, a Mask R-CNN type of architecture is used, while the semantic segmentation branch is augmented with a Pyramid Pooling Module. Results for this method are submitted to the COCO and Mapillary Joint Recognition Challenge 2018. Our approach achieves a PQ score of 17.6 on the Mapillary Vistas validation set and 27.2 on the COCO test-dev set.

我们提出了一种用于全景分割任务的端到端方法。该方法在单个网络中进行实例分割和语义分割预测，并使用heuristics 将这些输出组合以创建单个全景分割输出。该体系结构由语义分割分支和实例分割分支共享的ResNet-50特征提取器组成。对实例分割分支来说，使用Mask R-CNN类型的体系结构，而语义分割分支是使用金字塔池化模块来增强。该方法的结果已提交给COCO and Mapillary Joint Recognition Challenge 2018。我们的方法在Mapillary Vistas验证集上获得了17.6的PQ得分，在COCO test-dev 集上达到了27.2。

1. Introduction 简介

A key task in computer vision is image recognition, for which the ultimate goal is to recognize all elements in an image. At a high level these elements can be divided into two categories: things and stuff [4]. Things are usually countable objects, such as vehicles, persons and furniture items. On the other hand, stuff is the set of remanining elements, usually not countable, such as sky, road and water. Within image recognition, many tasks have been introduced to identify these elements in images. Instance segmentation and semantic segmentation are two of such tasks that have become very prominent, with state-of-the-art methods [6, 10] and [2, 14], respectively. The ﬁrst task, instance segmentation, focuses on the detection and segmentation of things. If an object is detected, a pixel mask is predicted for this object, and the output of such a method is a set of pixel masks. By design, this method does not account for all elements in an image, as it does not consider stuff classes. The second task, semantic segmentation, does consider all elements, as the aim is to make a class prediction for each pixel in an image, for both things and stuff classes. However, the semantic segmentation output does not differentiate between different instances of things classes. As a result, both methods lack the ability to fully describe the contents of an image.

计算机视觉的一项关键任务是图像识别，其最终目标是识别图像中的所有元素。在高层次上，这些元素可以分为两类：things and stuff[4]。things 通常是可数的物体，例如车辆，人和家具。另一方面，stuff是一组剩下的元素，通常是不可数的，例如天空，道路和水。在图像识别中，已经引入了许多任务来识别图像中的这些元素。实例分割和语义分割是两个已经变得非常突出的任务，分别采用最先进的方法[6,10]和[2,14]。第一个任务，实例分割，侧重于事物的检测和分割。如果检测到对象，则为该对象预测像素掩模，并且这种方法的输出是一组像素掩模。按照设计，这种方法不考虑图像中的所有元素，因为它不考虑stuff 类。第二个任务，语义分割，确实考虑所有元素，因为它的目标是对图像中的每个像素进行类预测，用于things和stuff类。但是，语义分割的输出不区分things类的不同实例。结果，两种方法都缺乏完全描述图像内容的能力。

Figure 1: Predictions by the network for an image from the Mapillary Vistas validation set. Top left: original image. Top right: panoptic segmentation. Bottom left: semantic segmentation. Bottom right: instance segmentation.
图1：对来自Mapillary Vistas验证集图像的网络预测。左上：原始图像。右上：全景分割。左下：语义分割。右下角：实例分割。

To ﬁll this gap, the task of panoptic segmentation is introduced in [8]. For this task, each pixel of an image must be assigned with a class label and an instance id. For things classes, the instance id is used to distinguiscoh between different objects. On the other hand, for the stuff classes, it is not necessary – and sometimes not even possible – to distinguish between different instances. Therefore, pixels in these classes always get the same instance id. In [8], a baseline method for this task is presented, for which they take the outputs of the best scoring instance segmentation and semantic segmentation networks, and combine them using basic heuristics to generate an output in the panoptic format. Before the task of panoptic segmentation was formally introduced, there were already some publications that focused on exactly this task. In [13], depth layering and direction predictions are used to detect different instances of objects in a semantic segmentation map. In [1], a Dynamically Instantiated Network is used to combine the outputs from an external object detector and an internal semantic segmentation network to form a panoptic-like output.

为了弥补这一差距，在[8]中引入了全景分割的任务。对于此任务，必须为图像的每个像素分配类标签和实例ID。对于things 类，实例id用于区分不同的对象。另一方面，对于stuff 类，这是没有必要的，且有时甚至是不可能区分不同的实例。因此，这些类中的像素始终获得相同的实例ID。在[8]中，提出了该任务的基线方法，为此，他们使用实例分割和语义分割网络最高得分的输出，并使用basic heuristics 将它们组合以生成全景格式的输出。在正式引入全景分割任务之前，已经有一些出版物专注于这项任务。在[13]中，depth layering and direction predictions用于检测语义分割图中的对象的不同实例。在[1]中，Dynamically Instantiated Network用来组合来自外部对象检测器和内部语义分割网络的输出，以形成类似全景的输出。

In this report, we present a single end-to-end network that makes both instance segmentation and semantic segmentation predictions, using a shared feature extractor. These predictions are combined to form panoptic segmentation outputs using heuristics, following [8]. The main contribution of our approach is the fact that we apply end-to-end learning to jointly make semantic segmentation and instance segmentation predictions to ﬁnally predict a panoptic segmentation output.

在本报告中，我们提出了一个单独的，端到端网络，它使用共享特征提取器进行实例分割和语义分割的预测。根据[8]，这些预测使用heuristics，结合形成全景分割输出。我们的方法的主要贡献在于我们应用端到端学习来共同进行语义分割和实例分割预测，以最终预测全景分割输出。

2. Method 方法

We propose a Joint Semantic and Instance Segmentation Network (JSIS-Net) for panoptic segmentation. This method consists of two main sections: 1) a Convolutional Neural Network (CNN) that jointly predicts semantic segmentation and instance segmentation outputs (Section 2.1) and 2) heuristics that are used to merge these outputs to generate panoptic segmentation predictions (Section 2.2).

我们提出了一种用于全景分割的联合语义和实例分割网络（JSIS-Net）。该方法包括两个主要部分：1）卷积神经网络（CNN），联合地预测了语义分割和实例分割输出（第2.1节）。2）heuristics ，用于合并这些输出以生成全景分割预测（第2.2节）。

2.1. Network architecture 网络结构

We propose a CNN that jointly predicts semantic segmentation and instance segmentation outputs. The base of the network is a ResNet-50 feature extractor [7], which is shared by the semantic segmentation and instance segmentation branch. This is depicted in Figure 2.

我们提出了一个联合预测语义分割和实例分割输出的CNN。网络的基础是ResNet-50特征提取器[7]，它被语义分割和实例分割分支共享。如图2。

The semantic segmentation branch ﬁrst applies a Pyramid Pooling Module to the generated feature map, as presented in [14], and uses hybrid upsampling to reshape the predictions to the size of the input image [11]. This hybrid upsampling ﬁrst applies a deconvolution operation and then bilinearly resizes the predictions to the dimensions of the input image. The output of this branch is a pixel map where each entry corresponds to the predicted class label for that pixel in the input image.

语义分割分支首先将金字塔池化模块应用于生成的特征映射，如[14]中所示，并使用hybrid upsampling 将预测reshape为输入图像的大小[11]。该hybrid upsampling首先应用反卷积操作，然后将预测双线性地调整为输入图像的尺寸。该分支的输出是pixel map，其中每个entry对应于输入图像中该像素的预测类标签。

The instance segmentation branch is based on Mask RCNN [6]. First, a Region Proposal Network (RPN) is used to generate region proposals for potential objects in the image. The features corresponding to these proposals are then extracted from the feature map and subjected to the ﬁnal layers of ResNet-50. Finally, these features are used to make three different parallel predictions: a classiﬁcation score, bounding box coordinates, and an instance mask. After applying non-maximum suppression, the output of this branch is a set of pixel clusters with class labels predicted to correspond to the location of different objects in the image. With post-processing, these pixel clusters are transformed to form per-object normalized instance masks with the dimensions of the input image.

实例分割分支基于Mask RCNN [6]。首先，区域提议网络（RPN）用于为图像中的潜在对象生成区域提议。然后，从特征图中提取与这些建议相对应的特征，并对ResNet-50最后的层进行处理。最后，这些特征用于进行三种不同的并行预测：分类分数，边界框坐标和实例掩码。在应用非最大抑制之后，该分支的输出是一组像素簇，其类标签被预测为对应于图像中不同对象的位置。通过后处理，这些像素簇被转换，以形成具有输入图像的尺寸的 per-object 标准化实例掩模。

Figure 2: The JSIS-Net architecture. JSIS-Net架构。