Bidirectional Attention Network_bidirectionalattention-CSDN博客

本文链接：https://blog.csdn.net/qq_26697045/article/details/121585717

欢迎访问我的博客首页。

BANet

1. Sub-Pixel Convolutional
2. BANet 网络
- 2.1 全局上下文聚合 D2S modules
- 2.2 双向注意力模型
3. 总结
4. 参考

这篇文章来自华为在加拿大的诺亚方舟实验室。

1. Sub-Pixel Convolutional

Sub-Pixel Convolutional 由推特 2016 年提出，用于图像超分，它其实是一种上采样方法。2018 年，加拿大萨斯喀彻温大学基于 Sub-Pixel Convolutional 提出 D2S 结构用于语义二分。BANet 把这里的 D2S 称为 D2S op，并基于 D2S op 提出 D2S modules 用于深度预测。
D2S op 就是使用 Sub-Pixel Convolutional 做的解码器（上采样），所以无需区分它们。下图来自 Sub-Pixel Convolutional 论文。

D2S
$图\ 1\quad Sub-Pixel\ \ Convolutional$

PyTorch 实现 Sub-Pixel Convolutional 的函数是 torch.nn.PixelShuffle，它把 shape 为 $\times r^2, h, w)$ 的 tensor 转换为 $\times r, w \times r)$ 。TensorFlow 实现 Sub-Pixel Convolutional 的函数是 tf.nn.depth_to_space。

if __name__ == '__main__':
    input = torch.tensor(np.random.random(size=(8, 27, 32, 32)), dtype=torch.float32)
    D2S = torch.nn.PixelShuffle(upscale_factor=3)
    output1 = D2S(input)
    output2 = F.pixel_shuffle(input, upscale_factor=3)
    print(output1.shape)  # torch.Size([8, 3, 96, 96])
    print(output2.shape)  # torch.Size([8, 3, 96, 96])

if __name__ == '__main__':
    input = tf.convert_to_tensor(np.random.random(size=(8, 32, 32, 27)), dtype=tf.float32)
    output = tf.nn.depth_to_space(input, block_size=3)
    print(output.shape)  # (8, 96, 96, 3)

Sub-Pixel Convolutional 的实现很简单，下面就是 PyTorch 的实现：

def subPixel_conv(data, upscale_factor=(3, 3)):
    if isinstance(upscale_factor, int):
        upscale_factor = (upscale_factor, upscale_factor)
    upscale_h = upscale_factor[0]
    upscale_w = upscale_factor[1]

    b, c, h, w = input.shape
    input = input.reshape(b, c // (upscale_h * upscale_w), upscale_h, upscale_w, h, w)
    input = input.permute((0, 1, 4, 2, 5, 3))
    input = input.reshape(b, c // (upscale_h * upscale_w), h * upscale_h, w * upscale_w)
    return input

可以看出，Sub-Pixel Convolutional 这种上采样方法和插值上采样不同，它减少了通道数，增加了宽高，数据量并没有改变。图 2 是一个例子。

Sub-Pixel Convolutional

$图\ 2\quad Sub-Pixel\ \ Convolutional\ 实例$

如无特殊说明，下文的 D2S 指的是 D2S modules 而不是 D2S op。

2. BANet 网络

图 3 是 BANet 的整体网络结构。
BANet
$图\ 3\quad BANet 网络结构$

主干网络共有 5 个阶段，相对原图的下采样步长 stride 分别是 2、4、8、16、32，通道数 D 在 256 到 2208 之间。为了弥补编码器中下采样造成的细节丢失，使用 D2S 取代解码器做上采样（全局上下文聚合），且把 D2S 得到的单通道全局特征注入主干网络的每一阶段（双向注意力）。

2.1 全局上下文聚合 D2S modules

D2S modules 的结构如图 3 右侧所示。BANet 先使用 B2S 把主干网络提取的特征 (D, L/stride) 转换到全尺寸 (1, L)，然后使用双向注意力机制精细处理。D2S modules 用于应对过亮或过暗环境中，弱纹理目标易模糊的问题。

2.2 双向注意力模型

先使用 1x1 的卷积和 D2S，把主干网络的阶段输出 (D, L/stride) 上采样到全尺寸，然后由双向注意力模型处理这些全尺寸特征。下面用公式来表示图 3 中的网络结构。

双向注意力模型的前向注意力和后向注意力实现如公式 1：

eq1

上标 f 代表前向注意力，b 代表后向注意力。下标 i 代表主干网络 5 个阶段之一。D 代表 D2S modules。带上下标的 A 代表 9x9 的卷积，A 代表各阶段像素级的注意力权重。

eq23

公式 2 指图 3 最下一行的 D2S 的结果沿通道维度叠加得到 F。公式 3 指 A 和 F 对应像素相乘，经过 Sigmoid 函数得到 $\hat{D}$ 。

3. 总结

区别于其它基于注意力的方法，BANet 的注意力机制在主干网络外，处理的是 D2S 输出的单通道全尺寸特征。由于深度估计不是多分类，而是单平面估计，所以这样做是可行的，而且降低了参数量和计算量。
虽然 BANet 的连接（图 3 中的线）比 state of the art 更多，但大多数操作处理的是 D2S 输出的单通道特征，所以 BANet 的参数量和计算量更低。