论文阅读：Hierarchical multi-scale attention for semantic segmentation

最新推荐文章于 2023-06-04 18:03:26 发布

贾小树

最新推荐文章于 2023-06-04 18:03:26 发布

阅读量1.4k

点赞数

分类专栏：论文阅读目标分割

本文链接：https://blog.csdn.net/j879159541/article/details/111242410

版权

论文阅读同时被 2 个专栏收录

74 篇文章 1 订阅

订阅专栏

目标分割

8 篇文章 0 订阅

订阅专栏

本文探讨了语义分割任务中多尺度推理的问题，指出简单取平均或最大值可能导致次优结果。提出了一种层次注意力机制，通过相对权重结合不同尺度的预测，解决了这个问题。此外，采用自动标注策略改进了Cityscapes数据集的粗标注，进一步提高了模型性能。实验结果表明，这两个技术结合使模型在Cityscapes榜单上取得第一。

摘要由CSDN通过智能技术生成

文章目录

1、论文总述

本篇论文总体来说有点水，文中首先提出一个如今语义分割任务中存在的一个 问题： 多尺度推理时，只是简单的取平均或者取最大，这样容易把最好的结果和最坏的结果结合在一起，导致得到次优的结果，（先验：大目标在小输入分辨率下结果比较好，小目标在大输入分辨率下结果比较好，但一张图中一般既有小目标和大目标，所以推理时候需要多尺度推理）；方法： 训练时候用两张不同尺寸的图片（其中一张是由另一张下采样2倍得到）同时进行训练，然后结合这俩输出结果时用了一个注意力机制（其实就是一个相对的权重，其中一个a，另一个1-a），正是因为这个注意力机制是相对的，所以推理时候，可以同时进行多个尺度的推理（不仅仅是两个尺度的推理），这也就是论文题目中Hierarchical 一次的来由。

文章的第二个内容：就是利用训好的模型在Cityscapes的粗标注数据集上进行伪标签的标注，把粗标注改的更加细节了，然后利用这个标注修改过大的数据集和 fine标注数据集一起进行训练，这样mIIOU指标就又有提升，两个trick加起来达到Cityscapes榜单第一名。

在这里插入图片描述

To address this problem, we adopt an attention mechanism to predict how to combine multi-scale predictions together
at a pixel level, similar to the method proposed by Chen et. al. [1].
（1）We propose a hierarchical attention mechanism
by which the network learns to predict a relative weighting between adjacent scales. In our method, because of it’s
hierarchical nature, we only require to augment the training pipeline with one extra scale whereas other methods such
as [1] require each additional inference scale to be explicitly added during the training phase. For example, when the
target inference scales for multi-scale evaluation are {0.5, 1.0 and 2.0}, other attention methods require the network to
first be trained with all of those scales, resulting in 4.25x (0.52 + 2.02
) extra training cost. Our method only requires
adding an extra 0.5x scale during training, which only adds 0.25x (0.52
) cost. Furthermore, our proposed hierarchical mechanism also provides the flexibility of choosing extra scales at inference time as compared to previous proposed
methods that are limited to only use training scales during inference.
（2）To achieve state-of-the-art results in Cityscapes, we also adopt an auto-labelling strategy of coarse images in order to
increase the variance in the dataset, thereby improving generalization. Our strategy is motivated by multiple recent
works, including [2, 3, 4]. As opposed to the typical soft-labelling strategy, we adopt hard labelling in order to manage
label storage size, which helps to improve training throughput by lowering the disk IO cost.

2、论文先验

在这里插入图片描述

The task of semantic segmentation is to label all pixels within an image as belonging to one of N classes. There is a
trade off in this task in that certain types of predictions are best handled at lower inference resolution and other tasks better handled at higher inference resolution. Fine detail, such as the edges of objects or thin structures, is often better predicted with scaled up images sizes. And at the same time, predictions of large structures, which requires more global context, is often done better at scaled down image sizes, because the network’s receptive field can observe more of the necessary context. We refer to this latter issue as class confusion. Examples of both of these cases are presented in Figure 1.

3、取平均或者最大的缺点

Using multi-scale inference is a common practice to address this trade off. Predictions are done at a range of scales,
and the results are combined with averaging or max pooling.
Using averaging to combine multiple scales generally
improves results, but it suffers the problem of combining the best predictions with poorer ones. For example, if for a
given pixel, the best prediction comes from the 2x scale, and a much worse prediction comes from the 0.5x scale, then
averaging will combine these predictions, resulting in sub-par output.
Max-pooling, on the other hand, selects only one
of N scales to use for a given pixel, while the optimal answer may be a weighted combination across the different scales
of predictions

在这里插入图片描述

4、Relational context methods 对长方形物体有提升！

Relational context methods. In practice, pyramid pooling techniques attend to fixed, square context regions because
pooling and dilation are typically employed in a symmetric fashion. Furthermore, such techniques tend to be static
and not learned. However, relational context methods build context by attending to the relationship between pixels
and are not bound to square regions. The learned nature of relational context methods allow context to be built
based on image composition. Such techniques can build more appropriate context for non-square semantic regions, such as a long train or a tall thin lamp post. OCRNet [9], DANET [10], CFNet [11], OCNet [12] and other related
work [13, 14, 15, 16, 17, 18, 19, 20] use such relationships to build better context. Auto-labelling.

5、Auto-labelling（hard lable）与 soft label

在这里插入图片描述

While most image classification auto-labelling work use continuous or soft labels, we generate hard thresholded labels,
for storage efficiency and training speed. With soft labels, a teacher network provides a continuous probability for each of N classes for each pixel of an image, whereas for hard labels a threshold is used to pick a single top class per pixel.
Similar to [37, 4] we generate hard dense labels for the coarse Cityscapes images. Examples are shown in Figure 4

A common technique for auto-labelling in image classification is to use soft or continuous labels, whereby a teacher
network provides a target (soft) probability for each of N classes for every pixel of every image. A challenge of this
approach is disk space and training speed: it costs roughly 3.2TB in disk space to store the labels: 20000 images * 2048
w * 1024 h * 19 classes * 4B = 3.2TB. Even if we chose to store such labels, reading such a volume of labels during
training would likely slow training considerably.
Instead, we adopt a hard labelling strategy, whereby for a given pixel, we select the top class prediction of the teacher
network. We threshold the label based on teacher network output probability. Teacher predictions that exceed the
threshold become true labels, otherwise the pixel is labelled as ignore class. In practice we use a threshold of 0.9.

注：soft label就是指为每个像素存储属于每个类的概率（比如19类，就存储19个概率，开销很大）； hard label就是只存储那个score得分最高的那个类别

6、注意力机制的具体实现过程

在这里插入图片描述

注意：这个预测出来的注意力机制是单通道

7、Ablation study on Cityscapes

在这里插入图片描述

8、 loss选择

We apply the “polynomial” learning rate policy [41]. We use RMI [42] as the the primary loss function under default
settings, and we use cross-entropy for the auxiliary loss function.
For Cityscapes, we use a poly exponent of 2.0, an
initial learning rate of 0.01, and train for 175 epochs across 2 DGX nodes. For Mapillary, we use a poly exponent of
1.0, an initial learning rate of 0.02, and train for 200 epochs across 4 DGX nodes. As in [29],
we use class uniform
sampling in the data loader to equally sample from each class, which helps improve results when there is unequal data
distribution.

注：两个loss选的不一样，还用了class uniform ，对每个类别采样的概率相等