《reinforcement learning:an introduction》第四章《Dynamic Programming》总结

由于组里新同学进来,需要带着他入门RL,选择从silver的课程开始。

对于我自己,增加一个仔细阅读《reinforcement learning:an introduction》的要求。

因为之前读的不太认真,这一次希望可以认真一点,将对应的知识点也做一个简单总结。





注意,这一章讲的内容都是model-based的,即需要知道π(a|s)、P(s'|s,a)、R(s'|s,a);model based问题和full RL真正要解决的问题有些差别。另外,model-based方法往往称为planning,和(reinforcement) learning区别。


知道policy evaluation(prediction problem、itself an iterative computation、(greedy) policy improvement、policy iteration:

policy evaluation is obtained simply by turning the Bellman expected equation into an update rule 

DP algorithms are called full backups because they are based on all possible next states rather than on a sample next state

In some undiscounted episodic tasks there may be policies for which eventual termination is not guaranteed. For example, in some grid problem it is possible to go back and forth between two states forever.

policy improvement过程中,如果出现多个actions都能达到最大V,any apportionment of probability among these actions is permitted,没必要以1.0的概率只选一个action。

知道value iteration:

It can be written as a particularly simple backup operation that combines the policy improvement and truncated policy evaluation steps:

value iteration is obtained simply by turning the Bellman optimality equation into an update rule。it requires the maximum to be taken over all actions,正是这个maximum operation,才将policy improvement隐式地省略了。

Asynchronous DP:in-place update、更新顺序无要求、

asynchronous DP methods are often preferred

To converge correctly, however, an asynchronous algorithm must continue to backup the values of all the states(所有state都要访问到): it can’t ignore any state after some point in the computation


DP相对七天方法,效率还是不错的:

DP may not be practical for very large problems, but compared with other methods(direct search in policy space 、linear programming) for solving MDPs, DP methods are actually quite efficient

Large state sets do create difficulties, but these are inherent difficulties of the problem, not of DP as a solution method. 


Full backups are closely related to Bellman equations: they are little more than these equations turned into assignment statements. Just as there are four primary value functions (vπ, v∗, qπ, and q∗), there are four corresponding Bellman equations and four corresponding full backups.
All of them update estimates of the values of states based on estimates of the values of successor states. That is, they update estimates on the basis of other estimates. We call this general idea bootstrapping.




下面是silver课程《Lecture 3,Planning by Dynamic Programming》我觉得应该知道的内容:


注意,这一章讲的内容都是model-based的, 即需要知道π(a|s)、P(s'|s,a)、R(s'|s, a);model based问题和full RL真正要解决的问题有些差别。另外,model- based方法往往称为planning,和( reinforcement) learning区别。

5:prediction和control,之前也提到过。
7-8:知道policy evaluation是用来求解Vπ的,基于Bellman expectation equation。
9-11:policy evaluation、policy improvement的过程
12-13:知道policy iteration是用来求解V*的,policy iteration =  iterative policy evaluation + greedy policy improvement,想一下iterative(很简单, 别想复杂了,可以结合下面18页提出的问题)。
16-17:为什么policy improvement能够不断逼近V*,了解一下即可。
因此是model-based方法
18: Modified Policy Iteration,想一下在policy improvement之前,policy evaluation有必要converge到Vπ吗? 答案是没必要,K=3好理解,那么“e- convergence of value function”是什么意思?结合后面 value iteration的内容,想一下为什么K=1的时候,就是value iteration了

23-24:知道value iteration是用来求解π*,基于Bellman optimality equation。
22:value iteration的过程(看能不能理解 value iteration是K=1的policy iteration,而且将policy improvement步骤“隐式地”执行了)。
26:synchronous DP总结。

27-31:三种Asynchronous DP,主要是为了提高更新效率, 在所有states都被无限多次访问的情况下,保证收敛到V*; 其中31页的方法,访问到哪个state s,就只更新该state s的V(s),注意和33页提到的方法区别(33页方法是model-free的,是full RL考虑的问题);其中30页基于Bellman error(TD-error)的Prioritised Sweeping(Prioritised  state selection)很有用,能够加快收敛,在你到学校之后可能会进一步接触

35:知道收敛性由 contraction mapping theorem保证就好,知道有这个theorem即可。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值