Abstract
We present FollowNet, an end-to-end differentiable neural architecture for learning multi-modal navigation policies. FollowNet maps natural language instructions as well as visual and depth inputs to locomotion primitives. (语言指令,视觉输入映射到动作原语中).
FollowNet processes instructions using an attention mechanism(注意力机制) conditioned on its visual and depth input to focus on the relevant parts of the command while performing the navigation task. Deep reinforcement learning (RL) a sparse reward learns simultaneously the state representation, the attention function, and control policies(在学习稀疏奖励的同时学了状态表征,注意力方程和控制策略).
Introduction
The novel aspect of the FollowNet architecture is a language instruction attention mechanism that is conditioned on the agent’s sensory observations. This allows the agent to do two things.
- First, it keeps track of the instruction command and focuses on different parts as it explores the environment. (能完成指令的跟踪以及在探索过程中重点关注不同的区域.)
- Second, it associates motion primitives, sensory observations, and sections of the instruction with the reward received, which enables the agent to generalize to new instructions. (FollowNet将运动源语,传感器观测,指令与收到的回报相关联,这使得网络对新的指令具有普适性.)
Related Work
In this work, we provide natural language instructions instead of the explicit goal, and the agent must learn to interpret the instructions to complete the task. (常见的端到端导航算法DRL主要解决显示目标点的导航问题,本文要求agent要从指令中发掘隐式的目标点.)
Methods
Problem formulation
we assume the robot to be a point-mass with 3-DOF
(
x
,
y
,
θ
)
(x,y,\theta)
(x,y,θ), navigating in a 2D grid overlaid on a 3D indoor house environment. To train a DQN agent, we formulate the task as a POMDP(部分可观马尔可夫决策过程): a tuple
(
O
,
A
,
D
,
R
)
(O,A,D,R)
(O,A,D,R) with observations
o
=
[
o
N
L
,
o
V
]
∈
O
o=[o_{NL},o_{V}] \in O
o=[oNL,oV]∈O, where
o
N
L
=
[
ω
1
,
ω
2
,
⋯
,
ω
i
]
{o_{NL}=[\omega_{1}, \omega_{2}, \cdots, \omega_{i}]}
oNL=[ω1,ω2,⋯,ωi] is a natural language instruction sampled from a set of user-provided directions for reaching a goal.
o
V
{o_{V}}
oV is the visual input available to the agent, which consists of the image that the robot sees at a time-step
i
{i}
i. The set of actions
A
=
(
t
u
r
n
π
2
,
g
o
s
t
r
a
i
g
h
t
,
t
u
r
n
3
π
2
)
{A=(turn\frac{\pi}{2}, go \ straight, turn\frac{3\pi}{2} )}
A=(turn2π,go straight,turn23π). The system dynamics
D
:
O
×
A
→
O
{D:O \times A \rightarrow O}
D:O×A→O are deterministic and apply the action to the robot. The reward
R
:
O
→
R
{R:O \rightarrow R}
R:O→R rewards an agent reaching a landmark (waypoint) mentioned in the instruction.
Fig. 2 provides an example task, where the robot starts at the position and orientation specified by the blue triangle, and must reach the goal location specified by the red circle.
FollowNet
We present FollowNet, a neural architecture for approximating the action value function directly from the language and visual inputs. (直接从语言和视觉输入进行动作值函数近似,说明该算法在DQN框架下.)
- To simplify the image processing task, we assume a separate preprocessing step parses the visual input
o
v
∈
R
n
×
m
{o_{v} \in R^{n \times m}}
ov∈Rn×m to obtain a semantic segmentation
o
S
{o_{S}}
oS which assigns a one-hot semantic class id to each pixel,
and a depth map o D {o_{D}} oD which assigns a real number to each pixel corresponding to the distance from the robot. (对输入图像进行预处理,得到语义分割结果和一个深度图信息.完成这个步骤需要一系列的卷积神经网络CNN和全连接层FC实现.) - We use a single layer bi-directional GRU network to encode the natural language instruction. To enable the agent
to focus on different parts of the instruction depending on the context, we add a feed-forward attention layer.
we use a feed-forward attention layer F F A {FF_{A}} FFA conditioned on v C {v{C}} vC, which is the concatenated embeddings of the visual and language inputs, to obtain unnormalized scores e i {e_{i}} ei for each token ω i {\omega_{i}} ωi (结合视觉和语言输入,获得未归一化的单词分数). e i {e_{i}} ei are normalized using the softmax function to obtain the attention scores α i {\alpha_{i}} αi, which correspond to the relative importance of each token of the instruction for the current time step (将单词分数归一化得到注意力分数,这个分数表征了指令中每个token的相对重要性). We take the attention-weighted mean of the output vectors o i {o_{i}} oi, and pass it through another feed-forward layer to obtain v L ∈ R d L {v_{L} \in R^{d_{L}}} vL∈RdL, which is the final encoding of the natural language instruction (将所有的token与注意力分数加权后,得到原指令的编码结果).
- The Q function is then estimated from the concatenated
[
v
S
;
v
D
;
v
L
]
{[v_{S}; v_{D}; v_{L}]}
[vS;vD;vL] passed through a final feed-forward layer. During training, we sample actions from the Q-Function using
ϵ
−
greedy
\epsilon-\textrm{greedy}
ϵ−greedy policy to collect experience, and update the Q-network to minimize the Bellman error over batches of transitions using gradient descent. After the Q function is trained, we used the greedy policy
π
(
o
)
:
O
→
A
{\pi(o)}:O \rightarrow A
π(o):O→A, with respect to learned
Q ^ , π ( o ) = π Q ( O ) = arg max a ∈ A Q ^ ( o , a ) , {\hat{Q}, \pi(o)=\pi^{Q}(O)=\arg\max_{a \in A}\hat{Q}(o,a)}, Q^,π(o)=πQ(O)=arga∈AmaxQ^(o,a),to take the robot to the goal presented in the instruction o l {o_{l}} ol.