Hierarchical Deep Reinforcement Learning: Integrating Temporal Abstraction and Intrinsic Motivation

Research Topic

Learning goal-directed behavior in environments with sparse feedback is a major challenge for reinforcement learning algorithms.
这里有两个名词需要注意:goal-directed behavior, sparse feedback

这篇文章提出了一种hierarchical-DQN (h-DQN), a framework to integrate hierarchical value functions, operating at different temporal scales, with intrinsically motivated deep reinforcement learning.

The model takes decisions over two levels of hierarchy:

  1. the top level module (meta-controller)
    takes in the state and picks a new goal
  2. the lower-level module (controller)
    uses both the state and the chosen goal to select actions either until the goal is reached or the episode is terminated.

In their work, they propose a scheme for temporal abstraction that involves simultaneously learning options and a control policy to compose options in a deep reinforcement learning setting.


这里有必要对intrinsic motivation和extrinsic motivation进行解释一下,这其实都是心理学名词:

  • intrinsic motivation
    使用内部评价体系的人,对别人的评价不大在乎,他们做事情的动力,是来自于自己内心, 内在动机提供了一个促进学习和发展的自然力量,它在没有外在奖赏和压力的情况下,可以激发行为。
  • extrinsic motivation
    使用外部评价体系的人,对别人的评价特别在乎,甚至会内化别人对自己的评价,认为自己就是这样的。这样的人他们在做事情时,首先考虑的,也是别人怎么看、怎么认为。他们做事情的动力,常是为了博取别人的认可、金钱等

现在的强化学习对agent的研究基本都集中在外部动机上,一般认为外部强化是激发外部动机的必要条件,在强化条件下个体会产生对下一步强化的期待,从而以获得外部强化作为个体行为的目标。


Model

agents

现在的exploration method(e.g. ϵ − g r e e d y \epsilon-greedy ϵgreedy)只对local exploration有用, 但是fail to provide provide impetus for the agent to explore different areas of the state space.
因此,为了解决这个问题,引入了一个重要的概念——goals
Goals provide intrinsic motivation for the agent. The agent focuses on setting and achieving sequences of goals in order to maximize cumulative extrinsic reward.

use temporal abstraction of options to define policy π g \pi_{g} πg for each goal g g g

其实本文的目标就是有两个:

  • learning option policy
  • learning the optimal sequence of goals to follow

Temporal Abstraction

as below:
在这里插入图片描述
critic的作用:
The internal critic is responsible for evaluating whether a goal has been reached and providing an appropriate reward r t ( g ) r_{t}(g) rt(g) to the controller.
The intrinsic reward functions are dynamic and temporally dependent on the sequential history of goals.

Deep Reinforcement Learning with Temporal Abstraction

这篇文章使用了deep Q-Learning framework to learn policies for both the controller and the meta-controller.

  • the controller estimates the following Q-value function:
    在这里插入图片描述
  • the meta-controller estimates the following Q-value function:
    在这里插入图片描述
    It is important to note that the transitions ( s t , g t , f t , s t + N ) (s_{t}, g_{t}, f_{t}, s_{t+N}) (st,gt,ft,st+N) generate by Q 2 Q_{2} Q2 run at a slower time-scale than the transitions ( s t , a t , g t , r t , s t + 1 ) (s_{t}, a_{t}, g_{t}, r_{t}, s_{t+1}) (st,at,gt,rt,st+1) generate by Q 1 Q_{1} Q1

Learning Algorithm

Parameters of h-DQN are learned using stochastic gradient descent at different time-scales.
在这里插入图片描述
在这里插入图片描述

Experiments

ATARI game with delayed rewards

Model Architecture

在这里插入图片描述
The internal critic is defined in the space of &lt; e n t i t y 1 , r e l a t i o n , e n t i t y 2 &gt; &lt;entity1, relation, entity2&gt; <entity1,relation,entity2> , where relation is a function over configurations of the entities.

Training Procedure
  • First Phase
    set the exploration parameter ϵ 2 \epsilon_{2} ϵ2 of the meta-controller to 1 and train the controller on actions. This effectively leads to pre-training the controller so that it can learn to solve a subset of the goals.
  • Second Phase
    jointly train the controller and meta-controller
  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值