Hierarchical Deep Reinforcement Learning: Integrating Temporal Abstraction and Intrinsic Motivation

最新推荐文章于 2024-02-22 16:30:51 发布

Vic_Hao

最新推荐文章于 2024-02-22 16:30:51 发布

阅读量1.4k

点赞数

分类专栏： Hierarchical Reinforcement Learning

本文链接：https://blog.csdn.net/weixin_42018112/article/details/89545437

版权

Hierarchical Reinforcement Learning 专栏收录该内容

3 篇文章 0 订阅

订阅专栏

Research Topic

Learning goal-directed behavior in environments with sparse feedback is a major challenge for reinforcement learning algorithms.
这里有两个名词需要注意：goal-directed behavior, sparse feedback

这篇文章提出了一种hierarchical-DQN (h-DQN), a framework to integrate hierarchical value functions, operating at different temporal scales, with intrinsically motivated deep reinforcement learning.

The model takes decisions over two levels of hierarchy:

the top level module (meta-controller)
takes in the state and picks a new goal
the lower-level module (controller)
uses both the state and the chosen goal to select actions either until the goal is reached or the episode is terminated.

In their work, they propose a scheme for temporal abstraction that involves simultaneously learning options and a control policy to compose options in a deep reinforcement learning setting.

这里有必要对intrinsic motivation和extrinsic motivation进行解释一下，这其实都是心理学名词：

intrinsic motivation
使用内部评价体系的人，对别人的评价不大在乎，他们做事情的动力，是来自于自己内心, 内在动机提供了一个促进学习和发展的自然力量，它在没有外在奖赏和压力的情况下，可以激发行为。
extrinsic motivation
使用外部评价体系的人，对别人的评价特别在乎，甚至会内化别人对自己的评价，认为自己就是这样的。这样的人他们在做事情时，首先考虑的，也是别人怎么看、怎么认为。他们做事情的动力，常是为了博取别人的认可、金钱等

现在的强化学习对agent的研究基本都集中在外部动机上，一般认为外部强化是激发外部动机的必要条件，在强化条件下个体会产生对下一步强化的期待，从而以获得外部强化作为个体行为的目标。

Model

agents

现在的exploration method（e.g. $\epsilon-greedy$ ）只对local exploration有用，但是fail to provide provide impetus for the agent to explore different areas of the state space.
因此，为了解决这个问题，引入了一个重要的概念——goals
Goals provide intrinsic motivation for the agent. The agent focuses on setting and achieving sequences of goals in order to maximize cumulative extrinsic reward.

use temporal abstraction of options to define policy $\pi_{g}$ for each goal $g$

其实本文的目标就是有两个：

learning option policy
learning the optimal sequence of goals to follow

Temporal Abstraction

as below:
在这里插入图片描述
critic的作用：
The internal critic is responsible for evaluating whether a goal has been reached and providing an appropriate reward $r_{t}(g)$ to the controller.
The intrinsic reward functions are dynamic and temporally dependent on the sequential history of goals.

Deep Reinforcement Learning with Temporal Abstraction

这篇文章使用了deep Q-Learning framework to learn policies for both the controller and the meta-controller.

the controller estimates the following Q-value function:
the meta-controller estimates the following Q-value function:

It is important to note that the transitions $s_{t}, g_{t}, f_{t}, s_{t+N})$ generate by $Q_{2}$ run at a slower time-scale than the transitions $s_{t}, a_{t}, g_{t}, r_{t}, s_{t+1})$ generate by $Q_{1}$

Learning Algorithm

Parameters of h-DQN are learned using stochastic gradient descent at different time-scales.
在这里插入图片描述

Experiments

ATARI game with delayed rewards

Model Architecture

在这里插入图片描述
The internal critic is defined in the space of $< e n t i t y 1, r e l a t i o n, e n t i t y 2 >$ , where relation is a function over configurations of the entities.

Training Procedure

First Phase
set the exploration parameter $\epsilon_{2}$ of the meta-controller to 1 and train the controller on actions. This effectively leads to pre-training the controller so that it can learn to solve a subset of the goals.
Second Phase
jointly train the controller and meta-controller

Vic_Hao

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
1
评论
Hierarchical Deep Reinforcement Learning: Integrating Temporal Abstraction and Intrinsic Motivation

Research TopicLearning goal-directed behavior in environments with sparse feedback is a major challenge for reinforcement learning algorithms.这里有两个名词需要注意：goal-directed behavior, sparse feedback这篇文章...
复制链接

扫一扫