【EmbedMask】《EmbedMask：Embedding Coupling for One-stage Instance Segmentation》

最新推荐文章于 2024-06-07 09:55:58 发布

bryant_meng

最新推荐文章于 2024-06-07 09:55:58 发布

阅读量1k

点赞数 1

分类专栏： CNN / Transformer

本文链接：https://blog.csdn.net/bryant_meng/article/details/108691211

版权

CNN / Transformer 专栏收录该内容

211 篇文章 7 订阅

订阅专栏

在这里插入图片描述

arXiv-2019
code：https://github.com/yinghdb/EmbedMask
在 FCOS 基础上的改进

1 Background and Motivation

随着深度学习的蓬勃发展，CNN 在计算机视觉中的应用已经从 image-level 扩展到 pixel-level。eg，实例分割就是对目标检测的一种扩展，detected objects from instance-level to pixel-level.

当前基于 CNN 做实例分割的方法可以分为两类

Proposal-based methods：先用目标检测方法检测目标框，再在框内区域进行 pix-level 的分类。代表性的方法有 Mask R-CNN （参考【Mask RCNN】《Mask R-CNN》）， one-stage 普遍没有 two-stage 的目标检测方法的方法猛，two-stage 中 RoI pooling 操作又会存在如下两个缺点
- results in the loss of features and the distortion to the aspect ratios
- complex to adjust too many parameters
segmentation based methods： segment first then do clustering，这类方法都没有 re-pooling 操作（RoI pooling），难点在 cluster 的过程中，很难去 determine the number of clusters or the positions of the cluster centers

作者融合两类分割方法的优点（It preserves strong detection capabilities as the proposal-based methods, and meanwhile keeps the details of images as the segmentation-based methods），提出了 EmbedMask 实例分割方法，用 embedding 的方式来 simplifies the clustering procedure in the segmentation-based methods and avoid the repooling procedure in Mask R-CNN

2 Related Work

Proposal-based methods：detection and segmentation
- Two-stage Methods，eg Mask RCNN
- One-stage Methods，eg YOLACT、TensorMask
Segmentation-based Methods（bottom-up methods）：first segmenting and then clustering（eg 让属于同一实例的像素在 embedding 空间中尽量靠在一起）

3 Advantages / Contributions

propose a framework that unites the proposal-based and segmentation-based methods，通过 pixel-embedding 和 proposal embedding
one-stage 实例分割方法，但是 higher quality（挑了些图） and higher speed than two-stage 的 Mask RCNN（但 AP 没别人高哟）

4 Method

pixel embeddings, proposal embeddings, and proposal margins to extract the instance masks

在这里插入图片描述
$d = 32$

location $x_j$ 处 $proposal_j$ 的所有参数为 $\{class_j, box_j,center_j, q_j,\sigma_j\}$

其中 $q_j$ 是 proposal embedding，which is regarded as the cluster center

$\sigma_j$ 是 proposal margin

pixel embedding 和 proposal embedding 的相似度来生成每个候选区域中的 mask，proposal margin $\delta$ 相当于一个相似度的阈值，来决定最终的 mask

4.1 Embedding Definition

在这里插入图片描述
作者提出了下面两种新的 embedding 方式

proposal embedding, which is a good representation of entire instance
pixel embedding, which learns the relation between each pixel with corresponding instance

在 embedding 空间中，proposal embedding 相当于聚类中心，然后同一个 instance 的 pixel embedding 会在这个聚类中心附近

相比于其他方法，作者的这种 embedding 方式就避免了找 cluster center 的位置和数量的问题了

常规的思路是

在这里插入图片描述

$p_i$ 是 pixel embeddings
$q_i$ 是 proposal embedding
$\delta$ 是 proposal margin
$Q_k$ 是 instance proposal $S_k$ （GT mask）内，正样本区域 $q_i$ 的平均值，也即聚类中心

训练的时候 $S_k$ 是 GT mask， $Q_k$ 是所有 positive proposal embedding 的平均值，优化目标是让同一 instance 的 pixel embedding 与 proposal embedding 尽可能的近（pull），与背景像素尽可能的远（push）

在公式（1）的基础上，采用 hinge loss，就可以训练了

在这里插入图片描述

$K$ ： GT instance 的数量
$B_k$ ：represents the set of pixel embeddings that need to be supervised for the instance $S_k$ ，GT mask 对应的 bbox 的区域
$N_k$ ：the number of pixel embeddings in $B_k$
$\mathbb{I}(i \in S_k)$ indicator function
$S_k$ ：GT instance 的 mask
$Q _k$ 是所有 positive proposal embedding 的平均值，positive 区域是预测的 bbox 与 $S_k$ 对应的 bbox IoU 大于 0.5 的区域内
$x]_+$ ：表示 max(0,x)
$\delta_a$ 和 $\delta_b$ 是 two margins designed for push and pull strategy

第一项是 pull 到 margin $\delta_a$ 内，第二项是 push 到 margin $\delta_b$ 外

画个图这个关系就很明了，横坐标是 p-q，纵坐标分别是两项 loss

import numpy as np
import matplotlib.pyplot as plt

x = np.linspace(-2,5,100)
y1 = np.array([max(0,(i-1))**2 for i in x])
y2 = np.array([max(0,(1-j))**2 for j in x])

plt.plot(x,y1)
plt.plot(x,y2)
plt.legend(['y1','y2'],loc="upper center")

# gca = get current axis
ax = plt.gca() # x,y

# spines = 上下左右四条黑线
ax.spines['right'].set_color('none') # 让右边的黑线消失
ax.spines['top'].set_color('none')  # 让上边的黑线消失

ax.xaxis.set_ticks_position('bottom') # 把下面的黑线设置为x轴
ax.yaxis.set_ticks_position('left')   #  把左边的黑线设置为y轴

ax.spines['bottom'].set_position(('data',0)) # 移动x轴到指定位置，本例子为0
ax.spines['left'].set_position(('data',0))   # 移动y轴到指定位置，本例子为0

plt.show()

在这里插入图片描述
第一项 x 超过了设定的阈值，损失就会越来越大，第二项 x 小于阈值，损失就会越来越大

作者发现设定固定的 margin （difficult to find the optimal values），

在这里插入图片描述

因此采用 learning 的方式，来学习一个 margin

3.2 Learnable Margin

采用高斯公式来判断像素是否属于实例，取代 3.1 小节的公式（1）

在这里插入图片描述

map the distance between the pixel embedding $p_i$ of the pixel $x_i$ and the proposal embedding $Q_k$ of the instance $S_k$ into a value ranged in [0， 1)

$\Sigma_k$ 就是 positive 区域的 $\sigma_j$ 的均值，类比 $q_j$ 和 $Q_k$ 的关系，positive 是预测的 bbox 与 $S_k$ 对应的 bbox IoU 大于 0.5 的区域内
$\phi(x_i,S_k)$ 表示像素 $x_i$ 属于 GT mask $S_k$ 的概率

整体的 loss 就没有用公式（3）中的 hinge loss 了，而采用了如下形式

在这里插入图片描述

$L (\cdot)$ 是 binary classification loss function
$\phi(x_i,S_k)$ 表示像素 $x_i$ 属于 GT mask $S_k$ 的概率
$\mathbb{G}(x_i,S_k)$ represents the ground truth label for pixel $x_i$ to judge whether it is in the mask of the proposal $S_k$ , which is a binary value

相当于需要网络学 $\Sigma$ ，而不是用固定的 $\sigma$ ，实际中学习的是 $\frac{1}{2\sigma^2}$ （会用指数函数保证预测出来的都是正值）

3.3 Smooth Loss

$Q_k$ 和 $\Sigma_k$ 的计算方式如下

在这里插入图片描述

$M_k$ 是正样本像素的集合——当前像素预测出的 bbox 与 GT bbox 的 IoU > 0.5

注意，训练的时候 $S_k$ 是 GT，会被用到如下两个地方

来算 $Q_k$ 和 $\Sigma_k$ 的正样本区域时候（在 GT mask 区域内选 IoU > 0.5 的位置）
算二值损失公式（4）时

测试的时候，我们是不知道 $S_k$ 的，无法计算 $Q_k$ 和 $\Sigma_k$ ，所以作者在测试的时候把公式（3）中的 $Q_k$ 和 $\Sigma_k$ 替换为了当前位置的 $q_j$ 和 $\sigma_j$

这样，训练和测试的 $Q_k$ 和 $\Sigma_k$ 就不一样（训练的时候是区域 embedding 的平均值，测试的时候是当前位置的 embedding），作者用如下损失来缓解这种情况

在这里插入图片描述

让每个位置的 embedding 尽量和他们的聚类中心差距较小

3.4 Training

计算 loss 的时候，feature map 和 embedding 都 resize 到原图长宽的 1/4
在这里插入图片描述

其中

在这里插入图片描述

$\lambda_1 = 0.5$ ， $\lambda_1 = 0.1$

1）Training Samples for Box and Classification

FCOS

${box_j, class_j, center_j\}$ ，正样本被定义为，locate on the center region of the ground-truth bounding box，且在 GT mask 区域内

2）Training Samples for Proposal Embedding and Margin

正样本被定义为，当前像素预测出的 bbox 与 GT bbox 的 IoU > 0.5（且在 GT mask 内）

3）Training Samples for Pixel Embedding

正样本被定义为，落在 GT bbox 中的 pixel，实验中发现， expand bbox，增加 training sample（负样本）效果会更好

3.5 Inference

根据 NMS 后的 bbox（默认都是正样本了），用当前位置的 $q_j$ 和 $\sigma_j$ 代替 $Q$ 和 $\Sigma$ ，然后代入下面公式计算 $x_i$ 属于 $S_k$ 的概率

在这里插入图片描述

5 Experiments

5.1 Datasets

MS COCO
- trainval35k split (115K images) for training,
- minival split (5K images) for ablation study
- test-dev (20K images) for reporting the main results

5.2 Main Results

1）Quantitative Results

在这里插入图片描述
一阶段中最好

2）Qualitative Results
在这里插入图片描述
左图 mask rcnn，右图 embedmask，can provide more detailed masks than Mask R-CNN with sharper edges（没有 re-pooling 带来的 detail missing）

5.3 Ablation Study

1）Fixed vs. Learnable Margin

在这里插入图片描述

$\delta_a = 0.5$ ， $\delta_b = 0.8$ ， $\delta = 1.5$

学出来的更好

2）The Choice of Cluster Centers

用 $p_j$ 作为聚类中心，不用 $Q_k$ 作为聚类中心，结果如下

在这里插入图片描述

3）Sampling Strategy

也即正样本采样策略

${box_j, class_j, center_j\}$ 是否落在中心区域

$\{p_j,\sigma_j\}$ IoU>0.5
在这里插入图片描述
在 mask 内，且 IoU >0.5 合起来，效果会更好

4）Training Samples for Pixel Embedding
在这里插入图片描述
正样本为 GT mask 内的像素，bbox 扩大 1.2 倍，增加 training sample（负样本）效果会更好

5）Embedding Dimension
在这里插入图片描述

6 Conclusion（own）

proposal margin，实际中学习的是 $\frac{1}{2\sigma^2}$ ，会用指数函数保证预测出来的都是正值
小写 $q$ 和大写 $Q$ ，小写 $\sigma$ 和大写 $\Sigma$ 的区别是，小写代表每个位置的 embedding，大写表示正样本区域（bbox IoU>0.5）的 embedding 均值