Exploration and Apprenticeship Learning in Reinforcement Learning

最新推荐文章于 2021-12-02 16:02:57 发布

Vic_Hao

最新推荐文章于 2021-12-02 16:02:57 发布

阅读量267

点赞数

分类专栏：机器人论文阅读

本文链接：https://blog.csdn.net/weixin_42018112/article/details/88318852

版权

机器人论文阅读专栏收录该内容

33 篇文章 4 订阅

订阅专栏

Goal

本文提出apprenticeship learning, in which a teacher demonstration of the task is available。given the
initial demonstration, no explicit exploration is necessary, and we can attain near-optimal performance (compared to the teacher) simply by repeatedly executing “exploitation policies” that try to maximize rewards.

Related Work

$E^{3}$ algorithms
$E^{3}$ algorithms learn near-optimal policies by using “exploration policies” to drive the system towards poorly modeled states, so as to encourage exploration. 这些算法在很多系统中很难应用，因为会导致crash。
The algorithm would explicitly use an exploration policy until the model was considered accurate enough, after which it switched to an exploitation policy.

Contribution

propose the following algorithms

Have a teacher demonstrate the task to be learned, and record the state-action trajectories of the teacher’s demonstration.
Use all state-action trajectories seen so far to learn a dynamics model (MDP中的状态转移概率) for the system. For this model, find a (near) optimal policy using any reinforcement learning (RL) algorithm.
Test that policy by running it on the real system. If the performance is as good as the teacher’s performance, stop. Otherwise, add the state-action trajectories from the (unsuccessful) test to the training set, and go back to step 2.

Simulation Lemma
The simulaiton lemma shows that not necessarily all state-action pairs’ transition probabilities need to be accurately modeled.

Vic_Hao

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Exploration and Apprenticeship Learning in Reinforcement Learning

Goal本文提出apprenticeship learning, in which a teacher demonstration of the task is available。given theinitial demonstration, no explicit exploration is necessary, and we can attain near-optim...
复制链接

扫一扫

专栏目录