Exploration and Apprenticeship Learning in Reinforcement Learning

Goal

本文提出apprenticeship learning, in which a teacher demonstration of the task is available。given the
initial demonstration, no explicit exploration is necessary, and we can attain near-optimal performance (compared to the teacher) simply by repeatedly executing “exploitation policies” that try to maximize rewards.

Related Work

  • E 3 E^{3} E3 algorithms
    E 3 E^{3} E3 algorithms learn near-optimal policies by using “exploration policies” to drive the system towards poorly modeled states, so as to encourage exploration. 这些算法在很多系统中很难应用,因为会导致crash。
    The algorithm would explicitly use an exploration policy until the model was considered accurate enough, after which it switched to an exploitation policy.

Contribution

propose the following algorithms

  1. Have a teacher demonstrate the task to be learned, and record the state-action trajectories of the teacher’s demonstration.
  2. Use all state-action trajectories seen so far to learn a dynamics model (MDP中的状态转移概率) for the system. For this model, find a (near) optimal policy using any reinforcement learning (RL) algorithm.
  3. Test that policy by running it on the real system. If the performance is as good as the teacher’s performance, stop. Otherwise, add the state-action trajectories from the (unsuccessful) test to the training set, and go back to step 2.

Simulation Lemma
The simulaiton lemma shows that not necessarily all state-action pairs’ transition probabilities need to be accurately modeled.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值