recurrent models of visual attention(RAM)论文阅读笔记

最新推荐文章于 2023-03-16 15:04:43 发布

不知道叫啥好一点

最新推荐文章于 2023-03-16 15:04:43 发布

阅读量865

点赞数

分类专栏：注意力模型文章标签： RAM 论文阅读笔记

本文链接：https://blog.csdn.net/A_A666/article/details/108575990

版权

注意力模型专栏收录该内容

3 篇文章 0 订阅

订阅专栏

（零）细节介绍

retina 视网膜

（一）摘要

应用卷积神经网络计算量和像素数量是成线性关系的，对于大图像计算很复杂，本文提出的网络计算复杂度和图像大小无关，并且具有translation invariance built-in，给你的感觉就是比卷积网络要好
并且文中提出的网络能够 extracting information from an image or video by adaptively selecting
a sequence of regions or locations and only processing the selected regions at high resolution.
不过模型是不可微分的，需要利用强化学习进行训练。
通过image分类以及动态的视频跟踪任务对网络能力进行验证

（二）介绍

人类就是有选择地将注意力集中在视觉空间的某些部分，结合时间的推移，将不同注视点的信息结合起来，借鉴这个思路，作者设计了他们自己的模型：

The model is a recurrent neural network (RNN) which processes inputs sequentially, attending to different locations within the images (or video frames) one at a time, and incrementally combines information from these fixations to build up a dynamic internal representation of the scene or environment. Instead of processing an entire image or even bounding box at once, at each step, the model selects the next location to attend to based on past information and the demands of the task. Both the number of parameters in our model and the amount of computation it performs can be controlled independently of the size of the input image, which is in contrast to convolutional networks whose computational demands scale linearly with the number of image pixels. We describe an end-to-end optimization procedure that allows the model to be trained directly with respect to a given task and to maximize a performance measure which may depend on the entire sequence of decisions made by the model. This procedure uses backpropagation to train the neural-network components and policy gradient to address the non-differentiabilities due to the control problem.

（三）The Recurrent Attention Model（RAM）

在这里插入图片描述

利用 $h_t$ 得到 $t$ 时刻Glimpse Sensor的中心位置 $l_t$ ,不断整合各个时刻不同位置的信息(retina-like就是同一个中心位置的多尺度信息)，同时predict the action/classification。

At each time step, it processes the sensor data, integrates information over time, and chooses how to act and how to deploy its sensor at next time step:

3.1 Model

3.1.1 sensor

the bandwidth-limited sensor extracts a retina-like representation $ρ(x_t, l_{t−1})$ around location $l_{t−1}$ from image $x_t$ .利用glimpse sensor得到的特征向量为 $g_t = f_g(x_t,l_{t-1};\theta_{g})$

3.1.2 Internal state

利用隐状态从历史 $h_{t-1}$ 以及关于环境的信息 $g_t$ 中收集信息得到新的隐状态 $h_t$ ，也就是新的隐状态： $h_t = f_h(g_t,h_{t-1};\theta_h)$

3.1.3 Actions

利用隐状态 $h_t$ 产生two actions：

how to develop its sensor via the sensor control $l_t$
an environment action which might affect the state of the environment

其中 $l_t$ 服从 $l_{t} \sim p\left(\cdot \mid f_{l}\left(h_{t} ; \theta_{l}\right)\right)$ 的，从其中随机采样出来的

其中 $\alpha_t$ 服从 $a_{t} \sim p\left(\cdot \mid f_{a}\left(h_{t} ; \theta_{a}\right)\right)$ ,以同样的方式从分布中采样出来的

并且最后还有一个additional action能够decides when it will stop taking glimpses，这是通过对于talking glimpses 施加惩罚。使得网络能够学习到一个权衡glimpses数量以及分类正确的分类

3.1.4 Reward

执行动作后，agent得到a new visual observation of the environment $x_{t+1}$ (这个在网络架构中没有提到那)以及一个reward signal $r_{t+1}$ ，训练的目标函数为: $R=\sum_{t=1}^{T} r_{t}$ 最大化，对于分类问题，分类正确 $r_{t} = 1$ ,分类错误 $r_t=0$ .

3.1.5 小结

上面这些设置的思想来源于强化学习中的POMDP。策略函数 $\pi$ 是通过RNN网络实现，the history $s_t$ 是利用 $h_t$ 来summarized

3.2 Training

训练的目标函数为：
$J(\theta)=\mathbb{E}_{p\left(s_{1: T} ; \theta\right)}\left[\sum_{t=1}^{T} r_{t}\right]=\mathbb{E}_{p\left(s_{1: T} ; \theta\right)}[R]$
但是这个目标函数是non-trivial，包含了很多的high-dimensional interaction序列，不好求解。没想到天无绝人之路啊，使用RL中的POMDP重新审视这个问题对于 $J(\theta)$ 梯度的一个近似为：
$\nabla_{\theta} J=\sum_{t=1}^{T} \mathbb{E}_{p\left(s_{1: T} ; \theta\right)}\left[\nabla_{\theta} \log \pi\left(u_{t} \mid s_{1: t} ; \theta\right) R\right] \approx \frac{1}{M} \sum_{i=1}^{M} \sum_{t=1}^{T} \nabla_{\theta} \log \pi\left(u_{t}^{i} \mid s_{1: t}^{i} ; \theta\right) R^{i}$
上面公式需要计算一个 $\nabla_{\theta} \log \pi\left(u_{t}^{i} \mid s_{1: t}^{i} ; \theta\right)$ ,但是这个是RNN的梯度，可以通过标准的反向传播进行计算

3.2.1 Variance Reduction

由于 $\nabla_{\theta} J=\sum_{t=1}^{T} \mathbb{E}_{p\left(s_{1: T} ; \theta\right)}\left[\nabla_{\theta} \log \pi\left(u_{t} \mid s_{1: t} ; \theta\right) R\right] \approx \frac{1}{M} \sum_{i=1}^{M} \sum_{t=1}^{T} \nabla_{\theta} \log \pi\left(u_{t}^{i} \mid s_{1: t}^{i} ; \theta\right) R^{i}$ 是无偏估计，同时具有很高的方差，因此考虑下面的式子进行替换：
$\frac{1}{M} \sum_{i=1}^{M} \sum_{t=1}^{T} \nabla_{\theta} \log \pi\left(u_{t}^{i} \mid s_{1: t}^{i} ; \theta\right)\left(R_{t}^{i}-b_{t}\right)$

3.2.2 Using a Hybrid Supervised loss

网络的输出结果应该是在应该是在observation sequence的最后，如果是分类问题的话，需要使用交叉熵损失函数
策略函数 $\pi$ 以及产生action的network $f_\alpha$ 和glimpse network是通过反向传播梯度来进行训练的，而location network $f_l$ 是通过强化学习进行训练的。

（四）实验

4.0 网络细节介绍

4.0.1 Retina and location encodings

retina encodings $\rho(x, l)$ ,是提取以 $l$ 为中心的k个方形的patches（第一个大小为 $g_w \times g_w$ ,接下来的每一个都是上一个的两倍( $2g_w \times 2g_w$ ),然后再将其resized到 $g_w \times g_w$ 上），给出的位置 $l$ 是 $(x, y)$ ,其中(0,0)表示图像的中心位置

4.0.2 Glimpse network

glimpse network为 $f_{g}(x, l)$ ,是有两个全连接层的，其中
$g_t=\operatorname{Rect}\left(\operatorname{Linear}\left(h_t^{g}\right)+\operatorname{Linear}\left(h_t^{l}\right)\right)$
其中 $h_t^{g}=\operatorname{Rect}(\operatorname{Linear}(\rho(x, l_{t-1})))$ , $h_t^{l}=\operatorname{Rect}(\operatorname{Linear}(l_{t-1}))$ .其中 $h_t^{g}$ 和 $h_t^{l}$ 都是128dim，最后得到的 $g$ 是256dim的，这边不知道可能是通过Linear层分别将 $h_t^{g}$ 和 $h_t^{l}$ 映射到256dim，然后相加再做ReLU。

4.0.3 Location network

locations $l$ 通过双分量固定方差的高斯定义 $f_{l}(h)=\operatorname{Linear}(h)$ ,不太清楚这里和高斯分布有什么关系。。。

4.0.4 core network

隐状态 $h_{t}=f_{h}\left(h_{t-1}\right)=\operatorname{Rect}\left(\operatorname{Linear}\left(h_{t-1}\right)+\operatorname{Linear}\left(g_{t}\right)\right)$ ，和LSTM网络的公式不一样啊

在这里插入图片描述

那为什么说used a core of LSTM units

4.1 图像分类实验

action network是一个线性的softmax分类器，使用随机梯度下降，mini-batch为20，动量0.9,。
超参数(initial learning rate and variance of the location policy)是通过random search选择的
分类正确reward为1，分类失败reward为0

4.1.1 Centered Digits(数字位于图像中心)

在这里插入图片描述

This demonstrates the model can successfully learn to combine information from multiple glimpses.

4.1.1 Non-Centered Digits（数字随机平移不再位于图像中心）

在这里插入图片描述

This is possible because the attention model can focus its retina on the digit and hence learn a translation invariant policy. This experiment also shows that the attention model is able to successfully search for an object in a big image when the object is not centered.

4.1.3 Cluttered Non-Centered Digits:

One possible advantage of an attention mechanism is that it may make it easier to learn in the presence of clutter by focusing on the relevant part of the image and ignoring the irrelevant part.

在这里插入图片描述

从图中可以看出本文提出的RAM在有噪声的手写数字识别任务上出色的表现能力，同时也展示出利用任务以及历史信息来预测glimpse的正确性，相比于random glimpse有了很大的改进。

在这里插入图片描述

其中的第一列显示的是glimpse path，也就是中心点位置的变化。第2-7列显示的是随时刻变化能够看到的区域。逐渐地向目标区域靠近。

最后通过一个 $60 \times 60$ 的实验和 $100 \times 100$ 的实验对比说明RAM的有效性以及模型的overall capacity以及我们模型的计算量并没有发生改变，但是这个地方性能为什么会下降，我觉得可能是由于他最终的分类结果是在 $t = N$ 时刻做出的， $N$ 的值没有改变，但是图像变大了，搜索到的glimpse可能没有之前的准确，从而导致性能的下降。

4.2 动态environment

The network had a 6 by 6 retina at three scales as its input, which means that the agent had to capture the ball in the 6 by 6 highest resolution region in order to know its precise position
actions (left, right, and do nothing)
a core network of 256 LSTM units

（五）总结

模型不可微，但是可以利用policy gradient method 进行 end to end 训练
模型的特性
- 模型参数和计算量独立于输入图像尺寸
- 模型能够抓住重点，忽略cluttered object，抓住relevant regions。
- RAM变形应用很多，比如基于决策的任务
- 网络可以学习何时停止（获取了足够于用于分类的信息）
作者他们希望能够in future work中，在large scale 目标识别以及视频分类中应用。

不知道叫啥好一点

关注

0
点赞
踩
4

收藏

觉得还不错? 一键收藏
0
评论
recurrent models of visual attention(RAM)论文阅读笔记

recurrent models of visual attention-RAM（零）细节介绍（一）摘要（二）介绍（三）The Recurrent Attention Model（RAM）3.1 Model3.1.1 sensor3.1.2 Internal state3.1.3 Actions3.1.4 Reward3.1.5 小结3.2 Training3.2.1 Variance Reduction3.2.2 Using a Hybrid Supervised loss（四）实验4.0 网络细节介绍4
复制链接

扫一扫