《Self-Supervised Difference Detection for Weakly-Supervised Semantic Segmentation》笔记

本文探讨了如何利用自监督方法改进弱监督语义分割的性能。通过对预先生成的粗略分割掩模进行差异检测,提出了一种名为DD-Net的网络,用于识别和修正错误的分割区域。在训练过程中,通过不断迭代和选择可预测的差异信息,提高了分割精度。实验结果显示,这种方法能有效提升语义分割的质量。
摘要由CSDN通过智能技术生成

Intorduction

该论文讲述如何用弱监督的方法训练语义分割器。弱监督方法没有强监督信息,比如ground truth。在这篇论文里,语义分割的样本标签只有图片的分类类别。已经有许多方法能够从分类信息中生成语义分割的mask。论文在这些工作的基础上,对生成的语义分割的mask调整成更加准确的mask。

输入的是粗糙的mask,输出是精细的mask的函数叫做mapping function。有研究表明,不断地把输出的mask重新输入到mapping function中,可以得到更好的结果。但是迭代的过程不能保证输出的mask一定比输入的mask要好。作者的方法针对这个问题,提出了能够保证mask不断变好的方法。作者的解释也有道理。

Method

定义输入到mapping function的信息定义为knowledge,输出信息为advice。假设advice提供监督信息,这个监督信息包含了noise,论文的方法是从advice中获取有用的信息。定义knowledge和advice不同的区域称为difference(如图1的a所示),论文用一个网络DD-Net(self-supervised difference detection module)来预测这个difference。DD-Net会用到knowledge或advice的其中一个。在训练时可以通过knowledge和advice来计算得到。DD-Net的监督信息(gt)通过数据自己产生,所以DD-Net时自监督学习的。

the concept of the proposed approach

在实际的advice中,有的advice可以预测,有的不可以预测。一些advice可以容易地推断,因为在训练的时候包含许多相似的样本。作者假设advice包含足够多的好的信息,可以预测的信息可以当作是有用的信息。因此,作者提出的一个方法来选择信息。这些信息是advice真实信息,可以在difference检测中推断出来的信息。如图1的bc所示,knowledge是输入的mask,advice是输出的mask,advice和knowledge不同的部分说明knowledge在这些部分的分类结果可能有错。用DD-Net来检测knowledge的有错的地方(difference),能够预测出来的地方称为predictable difference。因为DD-Net是根据数据集的样本训练得到的,DD-Net能够预测出来的difference确实是knowledge中分类出错的地方。advice包含noise,可以分为true advice和false advice,true advice对应的是对的建议,这个true advice有用的信息,这些有用的信息存在于数据的样本中,DD-Net可以学习得到,true advice就等同于predictable difference。简单说就是DD-Net通过训练得到的信息是有用信息,可以用来更正已有mask的错误。

difference detection network

先来说说怎么预测difference。knowledge是通过其他弱监督的方法生成的语义分割的mask或者是mapping function的输出mask。不少论文也提出了多种mapping function,常用的是CRF方法。advice是mapping function的输出。有了knowledge和advice,可以计算他们的difference。定义knowledge为 m K m^K mK,advice为 m A m^A mA,difference为 M K , A ∈ R H × W M^{K,A} \in \Bbb{R}^{H \times W} MK,ARH×W
M u K , A = { 1 if ( m u K = m u A ) 0 if ( m u K ≠ m u A ) (1) M^{K,A}_u = \begin{cases} 1 & \text{if} (m^K_u = m^A_u) \\ 0 & \text{if} (m^K_u \neq m^A_u) \end{cases} \tag{1} MuK,A={ 10if(muK=muA)if(muK=muA)(1)

接着看看DD-Net的网络结构
DD-Net
DD-Net输入的有backbone network的high-level features e h ( x ; θ e ) e^h(x;\theta_e) eh(x;θe)和low-level features e l ( x ; θ e ) e^l(x;\theta_e) el(x;θe),还有一个mask m ^ \hat{m} m^,输出的是difference mask的置信度map d。训练的损失函数是
L diff = 1 ∣ S ∣ ∑ u ∈ S ( J ( M K , A , d K , u ; θ d ) J ( M K , A , d A , u ; θ d ) ) (2) \begin{aligned} \mathcal{L}_{\text{diff}} = \frac{1}{|S|} \sum_{u \in S}( & J(M^{K,A}, d^K, u; \theta_d) \\ & J(M^{K,A}, d^A, u; \theta_d)) \end{aligned} \tag{2} Ldiff=S1uS(J(MK,A,dK,u;θd)J(MK,A,dA,u;θd))(2)
其中
J ( M , d , u ) = M u log ⁡ d u + ( 1 − M u ) log ⁡ ( 1 − d u ) J(M,d,u) = M_u \log d_u + (1 - M_u) \log (1 - d_u) J(M,d,u)=M

作者:Xiaohang Zhan,Ziwei Liu,Ping Luo,Xiaoou Tang,Chen Change Loy 摘要:Deep convolutional networks for semantic image segmentation typically require large-scale labeled data, e.g. ImageNet and MS COCO, for network pre-training. To reduce annotation efforts, self-supervised semantic segmentation is recently proposed to pre-train a network without any human-provided labels. The key of this new form of learning is to design a proxy task (e.g. image colorization), from which a discriminative loss can be formulated on unlabeled data. Many proxy tasks, however, lack the critical supervision signals that could induce discriminative representation for the target image segmentation task. Thus self-supervision's performance is still far from that of supervised pre-training. In this study, we overcome this limitation by incorporating a "mix-and-match" (M&M) tuning stage in the self-supervision pipeline. The proposed approach is readily pluggable to many self-supervision methods and does not use more annotated samples than the original process. Yet, it is capable of boosting the performance of target image segmentation task to surpass fully-supervised pre-trained counterpart. The improvement is made possible by better harnessing the limited pixel-wise annotations in the target dataset. Specifically, we first introduce the "mix" stage, which sparsely samples and mixes patches from the target set to reflect rich and diverse local patch statistics of target images. A "match" stage then forms a class-wise connected graph, which can be used to derive a strong triplet-based discriminative loss for fine-tuning the network. Our paradigm follows the standard practice in existing self-supervised studies and no extra data or label is required. With the proposed M&M approach, for the first time, a self-supervision method can achieve comparable or even better performance compared to its ImageNet pre-trained counterpart on both PASCAL VOC2012 dataset and CityScapes dataset.
评论 9
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值