论文阅读笔记之——《Spatial Group-wise Enhance: Improving Semantic Feature Learning in Convolutional Networks》

最新推荐文章于 2024-07-10 23:04:02 发布

gwpscut

最新推荐文章于 2024-07-10 23:04:02 发布

阅读量2.9k

点赞数 1

本文链接：https://blog.csdn.net/gwplovekimi/article/details/94595234

版权

卷积神经网络同时被 3 个专栏收录

65 篇文章 6 订阅

订阅专栏

深度学习

63 篇文章 28 订阅

订阅专栏

图像处理

44 篇文章 38 订阅

订阅专栏

最近在研究attention mechanism，感觉缺点之一就是增加较大的运算量。突然看到一篇微信推文说同时几乎不增加参数量和计算量的情况下也能让分类与检测性能得到极强的增益。

论文链接：https://arxiv.org/pdf/1905.09646.pdf

代码链接https://github.com/implus/PytorchInsight（里面还有各种attention实现）

文章的思路很简单，类似于SENet(对channel做attention)、spacial attention 就是将channel分为group，然后对每个group进行spacial的attention。作者提到，一个完整的feature是由许多sub feature组成的，并且这些sub feature会以group的形式分布在每一层的feature里，但是这些子特征会经由相同方式处理，且都会有背景噪声影响。这样会导致错误的识别和定位结果。所以作者提出了SGE模块，它通过在在每个group里生成attention factor，这样就能得到每个sub feature的重要性，每个group也可以有针对性的学习和抑制噪声。这个attention factor仅由各个group内全局和局部特征之间的相似性来决定，所以SGE非常轻量级。经由训练之后发现，SGE对于一些高阶语意非常有效。由作者实验发现，它可以显著提高图像识别任务性能。

The Convolutional Neural Networks (CNNs) generate the feature representation of complex objects by collecting hierarchical and different parts of semantic subfeatures（语义子特征的层次和不同部分）.These sub-features can usually be distributed in grouped form in the feature vector of each layer, representing various semantic entities （语义实体）.We propose a Spatial Group-wise Enhance (SGE) module that can adjust the importance of each sub-feature by generating an attention factor for each spatial location in each semantic group, so that every individual group can autonomously enhance its learnt expression and suppress possible noise（抑制可能的噪音。）.The attention factors are only guided by the similarities between the global and local feature descriptors inside each group, thus the design of SGE module is extremely lightweight with almost no extra parameters and calculations.（作者直接通过global和local feature描述子的相似性来获得attention factor，故此计算量与参数量都不大。而传统的方法，虽然参数量不会增加太多，但是计算量确实大大增大了）

与其他attention机制不一样，we use the similarity between the global statistical feature and the local ones of each location as the source of generation for the attention masks

However, due to the unavoidable noise and the existence of similar patterns, it is usually difficult for CNNs to obtain the well-distributed feature responses. To address this issue, we propose to utilize the overall information of the entire group space （整个group空间的整体信息） to further enhance the learning of semantic features in critical regions（关键区域）, given the fact that the features of the entire space are not dominated （主导） by noise (otherwise the model learns nothing from this group).因此，作者采用全局统计feature通过空间平均函数来近似对应feature的语义向量

利用这一global feature，可以产生对应于每一个feature的重要性系数（只通过简单的点乘操作，来衡量global语义feature和localfeature的相似性）。

具体操作如下：

首先将特征分组，每组feature在空间上与其global pooling后的feature做点积（相似性）得到初始的attention mask，在对该attention mask进行减均值除标准差的normalize，并同时每个组学习两个缩放偏移参数使得normalize操作可被还原，然后再经过sigmoid得到最终的attention mask并对原始feature group中的每个位置的feature进行scale。

希望能够增强CNN学到的feature的语义分布，使得在正确语义的region，特征能够突出，而在无关语义的region，特征向量能够尽可能接近0。首先我们将特征分组，并认为每组特征在学习地过程中能够捕捉到某一个特定的语义。自然地，我们可以将global的平均feature代表该组学习到的语义向量（至少是接近的，否则该组就都被noise feature dominate了，那我做不做操作都没太大影响）。接下来，我们用每个position的feature与该global feature做点积，那么根据点积的定义，那些本身模长大的feature以及与global feature向量方向接近的feature就会得到一个较大的初始attention mask数值，这也是我们所期望的。因为不同样本在同一组上求得的attention mask分布差异很大，所以我们需要归一到同样的范围来给出准确的attention。最后，每一个location的feature都会scale上最终的0-1之间的数值。该方法的名称也准确地反应了核心操作：我们是group-wise地在spatial上enhance了语义feature的分布。

尽管只有label的监督，CNN的确非常精准地学习到了一些语义特征，如狗的鼻子，舌头，耳朵，眼睛等等。而可以学习到这些精准的语义特征，是否就有利于恢复一些细节或者根据语义来恢复（当然所有的猜想还是需要经过实验的验证）。而且，被SGE增强后的feature map能够更加精准地凸显这些语义区域，完全达到了建模预期的效果。令人惊叹的是，4,7行连闭眼的眼睛SGE都能很好地给capture住

参考资料

https://mp.weixin.qq.com/s?__biz=MzI5MDUyMDIxNA==&mid=2247489447&idx=1&sn=b0571e1de700d15cacc213157a7f20cc&chksm=ec1ffa5edb6873481f7775aadd365aff45abcf9a8568f9e33975305531d3f60fabe48c799815&mpshare=1&scene=24&srcid=0703FNEC4bqsRAYTl6tut3No#rd

https://blog.csdn.net/py184473894/article/details/90603513

https://blog.csdn.net/qq_28778507/article/details/91129734