论文阅读：Look-ahead before you leap: end-to-end active

最新推荐文章于 2022-02-18 18:52:48 发布

xizero00

最新推荐文章于 2022-02-18 18:52:48 发布

阅读量2.2k

点赞数

分类专栏：论文阅读文章标签： RNN Active Vision

本文链接：https://blog.csdn.net/xizero00/article/details/51386629

版权

论文阅读专栏收录该内容

11 篇文章 0 订阅

订阅专栏

作者：

Dinesh Jayaraman, UT Austin, and Kristen Grauman, UT Austin

一、论文所解决的问题

这篇文章很有意思，主要是学习如何控制摄像头进行移动进而去进行有效的场景识别，这一点很类似于人的行为

RNN学习如何操纵传感器实现可靠识别（主动视觉active vision）

本文中定义了主动视觉中的三个步骤：控制、每个场景的识别、时间证据的融合（integrate evidence over time）

控制使用随机神经网络、每个场景的识别使用NN、时间证据的融合使用RNN

本文follow的是Mnih, V., Heess, N., Graves, A., Kavukcuoglu, K.: Recurrent models of visual attention. In: NIPS. (2014)

二、何为主动视觉

我们传统的识别都是被动的，为何称之为被动，因为都是静态的图片，并且我们的算法也是找出图片中的不变的特征

而主动视觉可以自己调整摄像头的角度和方向，这样可以改变光照和角度等因素，可以实现环境的变化，因此并不是寻找图片中的不变特征。而是在当前的场景下以及对摄像头的动作之下（调整摄像头的角度和方向下）预测知识的改变的（ we jointly train our active vision system to have the ability to predict how its knowledge will evolve conditioned on its present state and its choice of motion.）

idea是如何产生的：我们人在看东西的时候，如果看不清的话，都是自己转动自己的头，或者调整自己的角度再去看的。这一点，传统的视觉并没有考虑进去

三、与视觉显著性、视觉注意力的区别(visual saliency and attention)以及预测相关特征之间的差别（相当于本文与以前方法的区别）

（1）视觉显著性（visual saliency）的目的是想去除图像中的干扰的部分，核心目标是替代滑动窗口

（2）视觉注意力（visual attention）是有一个整个环境的snapshot 。与本文的区别就是视觉注意力是离散的，也就是说传感器移动的幅度很大，而本文中传感器移动的幅度并不是很大，相对来看是连续的。

Attention systems thus sometimes take a “foveated”approach [29], [24]. In contrast, in our setting, the system never holds a snapshot of the entire environment at once. Rather, its input at each time step is one portion of its complete physical 3D environment, and it must choose motions leading to more informative—possibly non-overlapping—viewpoints. Another difference between the two settings is that the focus of attention may move in arbitrary jumps (saccades) without continuity, whereas active vision agents may only move continuously.

（3）预测相关特征是指：通过当前帧预测下一帧的特征（或者图像、或者光流）

比如下面这个架构，直接利用CNN进行预测未来的某一帧

选自：Anticipating the future by watching unlabeled video

或者下面将CNN和RNN融合形成回复式的CNN（rCNN），然后利用空间信息和时间信息预测下一帧的内容，输入的图像patch是经过量化的，具体怎么做我没看懂。

选自：VIDEO (LANGUAGE) MODELING: A BASELINE FOR GENERATIVE MODELS OF NATURAL VIDEOS

四、论文的解决方案

（0）整体架构一览

整个系统分为ACTOR，SENSOR，AGGREGATOR这三个部分，最后接一个CLASSIFIER, 其中ACTOR是随机神经网络、SENSOR是前馈神经网络、AGGREGATOR是RNN，CLASSIFIER是Softmax

每一个时刻t：

（1）ACTOR发出一个相机移动的参数mt，参数pt表示t时刻的位置pt-1是t-1时刻的位置

pt = pt−1 + mt

（2）根据此时的位置pt,获取一个图像Xt，并将mt和Xt都给SENSOR

st = SENSOR(xt, mt)

那么就有 s1, . . . , st

（3）此时将上面的 s1, . . . , st 都输入到AGGREGATOR

得到 AGGREGATOR(s1, . . . , st)，AGGREGATE能够得到一个聚合的特征向量at

将at作为下一个时刻的输入ACTOR(at)

上面的英文描述为：

Our basic active recognition system is modeled on the recurrent architecture first proposed in [24] for visual attention. Our system is composed of four basic modules: ACTOR, SENSOR, AGGREGATOR and CLASSIFIER, with weights Wa , Ws, Wr, Wc respectively. At each step t, ACTOR issues a motor command mt, which updates the camera pose vector to pt = pt−1 + mt. Next, a 2-D image xt captured from this pose is fed into SENSOR together with the motor command mt. SENSOR produces a view-specific feature vector st = SENSOR(xt, mt), which is then fed into AGGREGATOR to produce aggregate feature vector at = AGGREGATOR(s1, . . . , st). The cycle is completed when, at the next step t + 1, ACTOR processes the aggregate feature from the previous time step to issue mt+1 = ACTOR(at). Finally, after T steps, the category label beliefs are predicted as y ˆ(W, X) = CLASSIFIER(at),

如何训练：

ACTOR是随机神经网络用强化学习进行训练

SENSOR是前馈神经网络、AGGREGATOR是RNN，CLASSIFIER是Softmax 用BP

此外作者还引入了LOOKAHEAD单元