由于组里新同学进来,需要带着他入门RL,选择从silver的课程开始。
对于我自己,增加一个仔细阅读《reinforcement learning:an introduction》的要求。
因为之前读的不太认真,这一次希望可以认真一点,将对应的知识点也做一个简单总结。
注意,这一章讲的内容都是model-based的,
知道policy evaluation(prediction problem、itself an iterative computation)、(greedy) policy improvement、policy iteration:
policy evaluation is obtained simply by turning the Bellman expected equation into an update rule
DP algorithms are called full backups because they are based on all possible next states rather than on a sample next state
In some undiscounted episodic tasks there may be policies for which eventual termination is not guaranteed. For example, in some grid problem it is possible to go back and forth between two states forever.
policy improvement过程中,如果出现多个actions都能达到最大V,any apportionment of probability among these actions is permitted,没必要以1.0的概率只选一个action。
知道value iteration:
It can be written as a particularly simple backup operation that combines the policy improvement and truncated policy evaluation steps:
value iteration is obtained simply by turning the Bellman optimality equation into an update rule。it requires the maximum to be taken over all actions,正是这个maximum operation,才将policy improvement隐式地省略了。
Asynchronous DP:in-place update、更新顺序无要求、
asynchronous DP methods are often preferred
To converge correctly, however, an asynchronous algorithm must continue to backup the values of all the states(所有state都要访问到): it can’t ignore any state after some point in the computation
DP相对七天方法,效率还是不错的:
DP may not be practical for very large problems, but compared with other methods(direct search in policy space 、linear programming) for solving MDPs, DP methods are actually quite efficient
Large state sets do create difficulties, but these are inherent difficulties of the problem, not of DP as a solution method.
Full backups are closely related to Bellman equations: they are little more than these equations turned into assignment statements. Just as there are four primary value functions (vπ, v∗, qπ, and q∗), there are four corresponding Bellman equations and four corresponding full backups.
All of them update estimates of the values of states based on estimates of the values of successor states. That is, they update estimates on the basis of other estimates. We call this general idea bootstrapping.
下面是silver课程《Lecture 3,Planning by Dynamic Programming》我觉得应该知道的内容: