【点云处理之论文狂读扩展版3】—— Non-local Neural Networks

LingbinBu

已于 2022-05-27 09:34:10 修改

阅读量378

点赞数

分类专栏：点云处理之论文狂读扩展版文章标签：机器学习深度学习计算机视觉

于 2022-05-19 10:15:26 首次发布

本文链接：https://blog.csdn.net/yuanmiyu6522/article/details/124838023

版权

点云处理之论文狂读扩展版专栏收录该内容

3 篇文章 3 订阅

订阅专栏

Non-local Neural Networks

摘要
1.引言
2.相关工作
3. Non-local Neural Networks
实验
生词

摘要

问题：不论是卷积还是循环操作，都是每次构建一个local neighborhood
方法：提出一个non-local操作，用于捕获大范围的依赖
技术细节：non-local操作将某个位置的响应计算为所有位置特征的加权和
优势：plug and play
应用：video classification、static image recognition
代码：Caffe2 框架、PyTorch 框架

1.引言

A non-local operation computes the response at a position as a weighted sum of the features at all positions in the input feature maps (Figure 1).

基本思想： 在一个点 $x_i$ 上的响应可以计算为其他位置上的特征加权和。

优点：

Non-local操作可以通过计算任意两个位置之间的关联，直接捕获大范围的依赖
只需要很少的层便能达到很好的效果
兼容性好
计算量小（computationally economical）

基本单元: non-local blocks

2.相关工作

Non-local image processing

在此之前，被用于image denoising、texture synthesis、super-resolution、inpainting algorithms

Graphical models

Conditional random fields(CRF)
Graph neural networks

Feedforward modeling for sequences

For modeling sequences in speech and language

Self-attention

Self-attention ——> Sequence
Non-local operation ——> image and video

Interaction networks

Video classification architectures

CNNs + RNNs
3D convolutions
optical flow
trajectories

3. Non-local Neural Networks

3.1. Formulation

深度神经网络中的non-local 操作可以定义为:
$\mathbf{y}_{i}=\frac{1}{\mathcal{C}(\mathbf{x})} \sum_{\forall j} f\left(\mathbf{x}_{i}, \mathbf{x}_{j}\right) g\left(\mathbf{x}_{j}\right) .\tag{1}$
其中 $i$ 是输出位置的索引，其位置的相应需要被计算。 $j$ 是所有可能位置的索引。 $\mathrm{x}$ 是输入信号， $\mathbf{y}$ 是与 $\mathbf{x}$ 大小一致的输出信号。函数 $f$ 计算了 $i$ 和 $j$ 之间的相似系数。函数 $g$ 计算了输入信号在位置 $j$ 处的表示。相应通过因子 $\mathcal{C}(\mathbf{x})$ 进行了归一化。

Non-local操作和全连接层不一样，等式1是基于不同位置之间的关系计算的响应，而全连接是使用可学习权值。等式1的输入大小可以是不同的，并且能保证对应大小的输出。全连接层需要固定大小的输入输出，还会丢失位置对应关系。non-local可以被用于网络的任意位置，全连接却只能被用在最后。

3.2. Instantiations

考虑将 $g$ 函数表示为线性embedding的形式：
$g\left(\mathbf{x}_{j}\right)=W_{g} \mathbf{x}_{j}$
其中 $W_{g}$ 是需要学习的权值矩阵。

接下来就是 $f$ 函数：

Gaussian.

遵循着non-local方式和双边滤波的方式， $f$ 可以选择Gaussian函数：
$f\left(\mathbf{x}_{i}, \mathbf{x}_{j}\right)=e^{\mathbf{x}_{i}^{T} \mathbf{x}_{j}} . \tag{2}$
其中 $\mathbf{x}_{i}^{T} \mathbf{x}_{j}$ 是点乘相似度。归一化因子为 $\mathcal{C}(\mathbf{x})=\sum_{\forall j} f\left(\mathbf{x}_{i}, \mathbf{x}_{j}\right)$ 。

Embedded Gaussian.

在embedding空间使用Gaussian函数的扩展版计算相似性：
$f\left(\mathbf{x}_{i}, \mathbf{x}_{j}\right)=e^{\theta\left(\mathbf{x}_{i}\right)^{T} \phi\left(\mathbf{x}_{j}\right)} .$
其中 $\theta\left(\mathbf{x}_{i}\right)=W_{\theta} \mathbf{x}_{i}$ ， $\phi\left(\mathbf{x}_{j}\right)=W_{\phi} \mathbf{x}_{j}$ 是两个embeddings。令 $\mathcal{C}(\mathbf{x})=\sum_{\forall j} f\left(\mathbf{x}_{i}, \mathbf{x}_{j}\right)$ 。

我们注意到，当使用embedded Gaussian version时，最近提出的self-attention是non-local操作的一种特殊形式。给定 $i$ ， $\frac{1}{\mathcal{C}(\mathbf{x})} f\left(\mathbf{x}_{i}, \mathbf{x}_{j}\right)$ 变成了沿着维度 $j$ 的softmax计算。所以有 $\mathbf{y}=\operatorname{softmax}\left(\mathbf{x}^{T} W_{\theta}^{T} W_{\phi} \mathbf{x}\right) g(\mathbf{x})$ ，这个就是self-attention。

接下来，我们将会说明注意力在应用中并不是很重要的操作。

Dot product.

函数 $f$ 可以定义为点乘相似性：
$f\left(\mathbf{x}_{i}, \mathbf{x}_{j}\right)=\theta\left(\mathbf{x}_{i}\right)^{T} \phi\left(\mathbf{x}_{j}\right) .$
这里使用了embedding的形式。设置归一化因子 $\mathcal{C}(\mathbf{x})=N$ 。和Gaussian versions不一样的点便是没有了softmax操作，这个操作可以起到激活函数的作用。

Concatenation.

函数 $f$ 还可以表示成拼接的形式：
$f\left(\mathbf{x}_{i}, \mathbf{x}_{j}\right)=\operatorname{ReLU}\left(\mathbf{w}_{f}^{T}\left[\theta\left(\mathbf{x}_{i}\right), \phi\left(\mathbf{x}_{j}\right)\right]\right) .$
其中 $[\cdot, \cdot]$ 表示拼接操作， $\mathbf{w}_{f}$ 是一个权值向量，将拼接的向量变成一个标量， $\mathcal{C}(\mathbf{x})=N$ 。

3.3. Non-local Block

定义non-local Block为：
$\mathbf{z}_{i}=W_{z} \mathbf{y}_{i}+\mathbf{x}_{i} \tag{6}$
其中 $\mathbf{y}_{i}$ 在等式1中给出，" $+\mathbf{x}_{i}$ "表示残差相连。残差相连操作能够使我们将任何一个新的non-local Block插入到任何预训练的模型中，不会打破原先的模式。其中的Gaussian version例子可以看图2。

在高维度、下采样的特征映射下，Non-local Block的计算是轻量型的。与标准网络中的卷积操作相比，矩阵相乘操作的计算量还是比较小的。

3.3.1. Implementation of Non-local Blocks

当输入 $\mathbf{x}$ 进入block后，会通过可学习权重 $W_{g}, W_{\theta}$ , 和 $W_{\phi}$ 将通道数压缩至一半，这样就形成了一个bottleneck的结构。等式6中的 $W_{z}$ 又会将 $\mathbf{y}_{i}$ 映射到与输入相同的维度上。

下采样的操作同样可以被用于减少计算量，对等式（1）进行简化：
$\mathbf{y}_{i}=\frac{1}{\mathcal{C}(\hat{\mathbf{x}})} \sum_{\forall j} f\left(\mathbf{x}_{i}, \hat{\mathbf{x}}_{j}\right) g\left(\hat{\mathbf{x}}_{j}\right)$
其中 $\hat{\mathbf{x}}$ 是 $\mathbf{x}$ 的下采样结果，通常在空间域执行下采样操作，不会影响non-local操作，就是在 $\phi$ 和 $g$ 后面加一个max pooling。