【PSA】《Polarized Self-Attention: Towards High-quality Pixel-wise Regression》

最新推荐文章于 2025-03-07 16:57:58 发布

bryant_meng

最新推荐文章于 2025-03-07 16:57:58 发布

阅读量1.4k

点赞数 28

分类专栏： CNN / Transformer 文章标签：人工智能深度学习 PSA polarized attention

本文链接：https://blog.csdn.net/bryant_meng/article/details/135948962

版权

CNN / Transformer 专栏收录该内容

248 篇文章

订阅专栏

在这里插入图片描述

arXiv-2020

1 Background and Motivation

论文名字的又来（参考来自 Polarized Self-Attention: Towards High-quality Pixel-wise Regression）

In photography, there are always random lights in transverse directions that produce glares/reflections. Polarized filtering, by only allowing the light pass orthogonal to the transverse direction, can potentially improve the contrast of the photo. Due to the loss of total intensity, the light after filtering usually has a small dynamic range, therefore needs a additional boost, e.g. by High Dynamic Range(HDR), to recover the details of the original scene

在这里插入图片描述

为防止眩光/反射，需偏振片滤光，但总能量会消失，需要额外的提升（eg：HDR），用来以恢复原始场景的详细信息

很像 attention

为了解决同时对空间和通道建模时，如果不进行维度缩减，就会导致计算量、显存爆炸的问题，作者在PSA中采用了一种极化滤波（polarized filtering）机制。

（1）滤波（Filtering）：使得一个维度的特征（比如通道维度）完全坍塌，同时让正交方向的维度（比如空间维度）保持高分辨率。

（2）High Dynamic Range（HDR）：首先在 attention 模块中最小的 tensor 上用 Softmax 函数来增加注意力的范围，然后再用Sigmoid 函数进行动态的映射。

深度学习发展 coarse-grained（分类 / 检测）-> fine-grained computer vision tasks（关键点 / 分割）

the pixel-wise regression problem has a higher problem complexity by the order of output element numbers

Keeping high internal resolution at a reasonable cost
Fitting output distribution such as that of the key-point heatmaps or segmentation masks

面对上面的难点，作者从 plug-and-play solution 的方向研究来提升 the pixel-wise regression problem 的精度，提出 Polarized Self-Attention

注意力机制设计过程中，通道注意力尽可能保留通道信息，空间注意力尽可能保留空间信息

输出时结合了 softmax 的高斯分布与 sigmoid 的二项式分布

在这里插入图片描述

2 Related Work

Pixel-wise Regression Tasks
- keypoint estimation（heatmaps）
- semantic segmentation
Self-attention and its Variants
Full-tensor and simplified attention blocks
non-local 优化

3 Advantages / Contributions

借鉴光学中偏振滤波的思想，提出 Polarized self attention，关键点检测和语义分割任务上公开数据集提点明显

4 Method

2D Gaussian distribution (keypoint heatmaps)

2D Binormial distribution (binary segmentation masks)

fuse softmax-sigmoid composition in both channel-only and spatial-only attention branches
在这里插入图片描述

通道注意力获取 $C\times1\times1$

$C\times1\times1$ 也可以由 $(C\times HW) \times (HW\times1\times1)$ 获取得到

空间注意力获取 $1\times H\times W$

$1\times H\times W$ 也可以是 $（1\times C） \times （C\times HW）$ 再 reshape 一下得到

和 CBAM 相比（【Attention】《CBAM: Convolutional Block Attention Module》）

通道注意力中利用了更多的空间信息（not global pooling），空间注意力中更加充分的利用了通道信息（not mean）

看看公式表达

（1）通道注意力

在这里插入图片描述

在这里插入图片描述
$\sigma$ 是 reshape 操作

$F_{SM}$ softmax

（2）空间注意力
在这里插入图片描述

通道空间注意力并行

在这里插入图片描述
通道空间注意力串行

在这里插入图片描述

Relation of PSA to other Self-Attentions

Internal Resolution vs Complexity：higher-resolution squeeze-and-excitation
Output Distribution/Non-linearity：Both the PSA channel-only and spatial-only branches use a Softmax-Sigmoid composition

看看代码

import numpy as np
import torch
from torch import nn
from torch.nn import init

class ParallelPolarizedSelfAttention(nn.Module):
    def __init__(self, channel=512):
        super().__init__()
        self.ch_wv=nn.Conv2d(channel,channel//2,kernel_size=(1,1))
        self.ch_wq=nn.Conv2d(channel,1,kernel_size=(1,1))
        self.softmax_channel=nn.Softmax(1)
        self.softmax_spatial=nn.Softmax(-1)
        self.ch_wz=nn.Conv2d(channel//2,channel,kernel_size=(1,1))
        self.ln=nn.LayerNorm(channel)
        self.sigmoid=nn.Sigmoid()
        self.sp_wv=nn.Conv2d(channel,channel//2,kernel_size=(1,1))
        self.sp_wq=nn.Conv2d(channel,channel//2,kernel_size=(1,1))
        self.agp=nn.AdaptiveAvgPool2d((1,1))

    def forward(self, x):
        b, c, h, w = x.size()

        #Channel-only Self-Attention
        channel_wv=self.ch_wv(x) #bs,c//2,h,w
        channel_wq=self.ch_wq(x) #bs,1,h,w
        channel_wv=channel_wv.reshape(b,c//2,-1) #bs,c//2,h*w
        channel_wq=channel_wq.reshape(b,-1,1) #bs,h*w,1
        channel_wq=self.softmax_channel(channel_wq)
        channel_wz=torch.matmul(channel_wv,channel_wq).unsqueeze(-1) #bs,c//2,1,1
        channel_weight=self.sigmoid(self.ln(self.ch_wz(channel_wz).reshape(b,c,1).permute(0,2,1))).permute(0,2,1).reshape(b,c,1,1) #bs,c,1,1
        channel_out=channel_weight*x

        #Spatial-only Self-Attention
        spatial_wv=self.sp_wv(x) #bs,c//2,h,w
        spatial_wq=self.sp_wq(x) #bs,c//2,h,w
        spatial_wq=self.agp(spatial_wq) #bs,c//2,1,1
        spatial_wv=spatial_wv.reshape(b,c//2,-1) #bs,c//2,h*w
        spatial_wq=spatial_wq.permute(0,2,3,1).reshape(b,1,c//2) #bs,1,c//2
        spatial_wq=self.softmax_spatial(spatial_wq) #bs,1,c//2
        spatial_wz=torch.matmul(spatial_wq,spatial_wv) #bs,1,h*w
        spatial_weight=self.sigmoid(spatial_wz.reshape(b,1,h,w)) #bs,1,h,w
        spatial_out=spatial_weight*x
        out=spatial_out+channel_out
        return out

class SequentialPolarizedSelfAttention(nn.Module):
    def __init__(self, channel=512):
        super().__init__()
        self.ch_wv=nn.Conv2d(channel,channel//2,kernel_size=(1,1))
        self.ch_wq=nn.Conv2d(channel,1,kernel_size=(1,1))
        self.softmax_channel=nn.Softmax(1)
        self.softmax_spatial=nn.Softmax(-1)
        self.ch_wz=nn.Conv2d(channel//2,channel,kernel_size=(1,1))
        self.ln=nn.LayerNorm(channel)
        self.sigmoid=nn.Sigmoid()
        self.sp_wv=nn.Conv2d(channel,channel//2,kernel_size=(1,1))
        self.sp_wq=nn.Conv2d(channel,channel//2,kernel_size=(1,1))
        self.agp=nn.AdaptiveAvgPool2d((1,1))

    def forward(self, x):
        b, c, h, w = x.size()

        #Channel-only Self-Attention
        channel_wv=self.ch_wv(x) #bs,c//2,h,w
        channel_wq=self.ch_wq(x) #bs,1,h,w
        channel_wv=channel_wv.reshape(b,c//2,-1) # bs,c//2,h*w
        channel_wq=channel_wq.reshape(b,-1,1) # bs,h*w,1
        channel_wq=self.softmax_channel(channel_wq) # bs,h*w,1
        channel_wz=torch.matmul(channel_wv,channel_wq).unsqueeze(-1) #bs,c//2,1,1
        channel_weight=self.sigmoid(self.ln(self.ch_wz(channel_wz).reshape(b,c,1).permute(0,2,1))).permute(0,2,1).reshape(b,c,1,1) #bs,c,1,1
        channel_out=channel_weight*x

        #Spatial-only Self-Attention
        spatial_wv=self.sp_wv(channel_out) #bs,c//2,h,w
        spatial_wq=self.sp_wq(channel_out) #bs,c//2,h,w
        spatial_wq=self.agp(spatial_wq) #bs,c//2,1,1
        spatial_wv=spatial_wv.reshape(b,c//2,-1) #bs,c//2,h*w
        spatial_wq=spatial_wq.permute(0,2,3,1).reshape(b,1,c//2) #bs,1,c//2
        spatial_wq=self.softmax_spatial(spatial_wq)
        spatial_wz=torch.matmul(spatial_wq,spatial_wv) #bs,1,h*w
        spatial_weight=self.sigmoid(spatial_wz.reshape(b,1,h,w)) #bs,1,h,w
        spatial_out=spatial_weight*channel_out
        return spatial_out

if __name__ == '__main__':
    input=torch.randn(1,512,7,7)
    psa = SequentialPolarizedSelfAttention(channel=512)
    output=psa(input)
    print(output.shape)

还是比较直观的

5 Experiments

we add PSAs after the first 3 × 3 convolution in every residual blocks, respectively.

在这里插入图片描述

5.1 Datasets and Metrics

MS-COCO 2017 human pose estimation（AP）
Pascal VOC2012 semantic segmentation（mIoU）

5.2 PSA vs. Baselines

（1）Top-Down 2D Human Pose Estimation

在这里插入图片描述

输出热力图尺寸 96 × 72，(1/4)

（2）Semantic Segmentation
在这里插入图片描述
没有关键点提升明显

5.3 Semantic Segmentation

在这里插入图片描述

通道注意力和空间注意力并行（p）和串行（s）并无太大差别，marginal metric differences

在这里插入图片描述
均有提点

5.4 Ablation Study

在这里插入图片描述
通道注意力和空间注意力都用比单用好，串行和并行两者效果相仿

在这里插入图片描述

6 Conclusion（own）

更多论文笔记，可以参考【Paper Reading】

Channel-only attention blocks put the same weights on different spatial locations, such that the classification task still benefits since its spatial information eventually collapses by pooling, and the anchor displacement regression in object detection benefits since the channel-only attention unanimously highlights all foreground pixels
PSA in complex DCNN heads 效果如何作者还没有做
光学故事背景讲的可以，提点也OK，但是没有全文反复强调，力没有集中发到一处，不够精确打击