Goal
本文提出apprenticeship learning, in which a teacher demonstration of the task is available。given the
initial demonstration, no explicit exploration is necessary, and we can attain near-optimal performance (compared to the teacher) simply by repeatedly executing “exploitation policies” that try to maximize rewards.
Related Work
-
E
3
E^{3}
E3 algorithms
E 3 E^{3} E3 algorithms learn near-optimal policies by using “exploration policies” to drive the system towards poorly modeled states, so as to encourage exploration. 这些算法在很多系统中很难应用,因为会导致crash。
The algorithm would explicitly use an exploration policy until the model was considered accurate enough, after which it switched to an exploitation policy.
Contribution
propose the following algorithms
- Have a teacher demonstrate the task to be learned, and record the state-action trajectories of the teacher’s demonstration.
- Use all state-action trajectories seen so far to learn a dynamics model (MDP中的状态转移概率) for the system. For this model, find a (near) optimal policy using any reinforcement learning (RL) algorithm.
- Test that policy by running it on the real system. If the performance is as good as the teacher’s performance, stop. Otherwise, add the state-action trajectories from the (unsuccessful) test to the training set, and go back to step 2.
Simulation Lemma
The simulaiton lemma shows that not necessarily all state-action pairs’ transition probabilities need to be accurately modeled.