【CCNet】《CCNet：Criss-Cross Attention for Semantic Segmentation》

bryant_meng

已于 2024-01-09 15:54:31 修改

阅读量1.3k

点赞数 28

分类专栏： CNN / Transformer 文章标签：人工智能深度学习 CCNet Criss-Cross

于 2024-01-09 13:53:21 首次发布

本文链接：https://blog.csdn.net/bryant_meng/article/details/135192231

版权

CNN / Transformer 专栏收录该内容

204 篇文章 7 订阅

订阅专栏

本文介绍了一种名为Criss-CrossAttention的新方法，它在保持全局上下文信息的同时降低了非局部注意力模型的计算复杂度。实验结果显示，该方法在Cityscapes、ADE20K和COCO语义分割任务上表现出色，尤其是在效率和性能上超越了非局部网络。

摘要由CSDN通过智能技术生成

在这里插入图片描述

ICCV-2019

1 Background and Motivation

分割任务中全局的上下文信息非常重要，如果高效轻量的获取上下文？

Thus, is there an alternative solution to achieve such a target in a more efficient way?

作者提出了 Criss-Cross Attention

相比于 Non-local（【NL】《Non-local Neural Networks》）

复杂度从 O（（HxW）x（HxW））降低到了 O（（HxW）x（H+W-1））

2 Related Work

semantic segmentation
contextual information aggregation
Attention model

3 Advantages / Contributions

提出 Criss-Cross 注意力，capture contextual information from full-image dependencies in a more efficient and effective way
在语义分割数据集 Cityscapes, ADE20K 和实例分割数据 COCO 上均有提升

4 Method

整理流程如下
在这里插入图片描述

Criss-Cross Attention Module 用了两次，叫 recurrent Criss-Cross attention (RCCA) module

下面是和 non-local 的对比
在这里插入图片描述
比如（b）中，计算蓝色块的 attention，绿色块不同深浅表示与蓝色块的相关程度，第一次结合十字架attention得到黄色块，第二次再结合十字架attention，得到红色块

为什么两次，因为一次捕获不到全局上下文信息，两次就可以，如下图

在这里插入图片描述

第一次，计算深绿色块的 Criss-Cross 注意力，只能获取到浅绿色块的信息，蓝色块的信息获取不到，浅绿色可以获取到蓝色块信息
第二次，计算深绿色块的 Criss-Cross 注意力，因为第一次计算浅绿色块注意力时已经有蓝色块信息了，此时，可以获取到蓝色块信息

更细节的 Criss-Cross 注意力图如下
在这里插入图片描述

下面结合图 3 看看公式表达

输入 $\in \mathbb{R}^{C \times W \times H}$

query 和 key， $\{Q, K\} \in \mathbb{R}^{{C}' \times W \times H}$ ， ${C}'$ 为 1/8 $C$

$Q_u \in \mathbb{R}^{{C}'}$ ， $u$ 是 $\times W$ 中空间位置索引，特征图 Q 的子集（每个空间位置）

$\Omega_{u} \in \mathbb{R}^{(H + W -1) \times {C}' }$ ，特征图 K 的子集（每个十字架）

Affinity operation 可以定义为

$d_{i,u} = Q_u \Omega_{i, u}^T$

$Q$ 上每个空间位置 $Q_u$ ，找到 $K$ 上对应的同行同列十字架 $\Omega_{u}$ ， $i$ 是十字架中空间位置的索引， $d_{i,u} \in {D}$ ， $\in \mathbb{R}^{(H+W-1) \times W \times H}$ ， $Q$ 和 $K$ 计算的 $D$ 经过 softmax 后成 $\in \mathbb{R}^{(H + W -1) \times W \times H}$

$Q$ 和 $K$ 计算出来了权重 $A$ 最终作用到 $K$ 上，形式如下：

${H}_u^{'} = \sum_{i \in | \Phi_u|} A_{i,u}\Phi_{i,u} + H_u$

$\Phi_{i,u}$ 同 $\Omega_{i, u}$ ，一个是特征图 $V$ 的子集，一个是特征图 $K$ 的子集， $H$ 是输入， ${H}^{'}$ 为输出， $i$ 是十字架索引， $u$ 是 $H$ x $W$ 空间位置索引

为了使每一个位置 $u$ 可以与任何位置对应起来，作者通过两次计算 Criss-cross 来完成，只需对 ${H}^{'}$ 再次计算 criss-cross attention，输出 ${H}^{''}$ ，此时就有：

$u$ and $\theta$ in the same row or column
在这里插入图片描述
$A$ 表示 loop = 1 时的注意力 weight， ${A}'$ 表示 loop = 2 时的 weight

$u$ and $\theta$ not in the same row or column，eg 图 4，深绿色位置是 $u$ ，蓝色的位置是 $\theta$
在这里插入图片描述

在这里插入图片描述
加上

再看看代码

import torch
import torch.nn as nn
import torch.nn.functional as F
 
def INF(B,H,W):
     return -torch.diag(torch.tensor(float("inf")).cuda().repeat(H),0).unsqueeze(0).repeat(B*W,1,1)
 
class CrissCrossAttention(nn.Module):
    def __init__(self, in_channels):
        super(CrissCrossAttention, self).__init__()
        self.in_channels = in_channels
        self.channels = in_channels // 8
        self.ConvQuery = nn.Conv2d(self.in_channels, self.channels, kernel_size=1)
        self.ConvKey = nn.Conv2d(self.in_channels, self.channels, kernel_size=1)
        self.ConvValue = nn.Conv2d(self.in_channels, self.in_channels, kernel_size=1)
 
        self.SoftMax = nn.Softmax(dim=3)
        self.INF = INF
        self.gamma = nn.Parameter(torch.zeros(1))
 
    def forward(self, x):
        b, _, h, w = x.size()
 
        # [b, c', h, w]
        query = self.ConvQuery(x)
        # [b, w, c', h] -> [b*w, c', h] -> [b*w, h, c']
        query_H = query.permute(0, 3, 1, 2).contiguous().view(b*w, -1, h).permute(0, 2, 1)
        # [b, h, c', w] -> [b*h, c', w] -> [b*h, w, c']
        query_W = query.permute(0, 2, 1, 3).contiguous().view(b*h, -1, w).permute(0, 2, 1)
        
        # [b, c', h, w]
        key = self.ConvKey(x)
        # [b, w, c', h] -> [b*w, c', h]
        key_H = key.permute(0, 3, 1, 2).contiguous().view(b*w, -1, h)
        # [b, h, c', w] -> [b*h, c', w]
        key_W = key.permute(0, 2, 1, 3).contiguous().view(b*h, -1, w)
        
        # [b, c, h, w]
        value = self.ConvValue(x)
        # [b, w, c, h] -> [b*w, c, h]
        value_H = value.permute(0, 3, 1, 2).contiguous().view(b*w, -1, h)
        # [b, h, c, w] -> [b*h, c, w]
        value_W = value.permute(0, 2, 1, 3).contiguous().view(b*h, -1, w)
        
        # [b*w, h, c']* [b*w, c', h] -> [b*w, h, h] -> [b, h, w, h]
        energy_H = (torch.bmm(query_H, key_H) + self.INF(b, h, w)).view(b, w, h, h).permute(0, 2, 1, 3)
        # [b*h, w, c']*[b*h, c', w] -> [b*h, w, w] -> [b, h, w, w]
        energy_W = torch.bmm(query_W, key_W).view(b, h, w, w)
        # [b, h, w, h+w]  concate channels in axis=3 
 
        concate = self.SoftMax(torch.cat([energy_H, energy_W], 3))
        # [b, h, w, h] -> [b, w, h, h] -> [b*w, h, h]
        attention_H = concate[:,:,:, 0:h].permute(0, 2, 1, 3).contiguous().view(b*w, h, h)
        attention_W = concate[:,:,:, h:h+w].contiguous().view(b*h, w, w)
 
        # [b*w, h, c]*[b*w, h, h] -> [b, w, c, h]
        out_H = torch.bmm(value_H, attention_H.permute(0, 2, 1)).view(b, w, -1, h).permute(0, 2, 3, 1)
        out_W = torch.bmm(value_W, attention_W.permute(0, 2, 1)).view(b, h, -1, w).permute(0, 2, 1, 3)
 
        return self.gamma*(out_H + out_W) + x
 
if __name__ == "__main__":
    model = CrissCrossAttention(512)
    x = torch.randn(2, 512, 28, 28)
    model.cuda()
    out = model(x.cuda())
    print(out.shape)

Q，K，A，V 还是比较直接

参考

5 Experiments

5.1 Datasets and Metrics

Cityscapes
ADE20K
COCO

Mean IoU (mIOU, mean of class-wise intersection over union section over union) for Cityscapes and ADE20K and the standard COCO metrics Average Precision (AP) for COCO