《reinforcement learning：an introduction》第四章《Dynamic Programming》总结

最新推荐文章于 2023-08-31 11:35:09 发布

mmc2015

最新推荐文章于 2023-08-31 11:35:09 发布

阅读量1.5k

点赞数

分类专栏：（深度）增强学习文章标签：增强学习 sutton RL reinforcement learni an introduction

本文链接：https://blog.csdn.net/mmc2015/article/details/75271279

版权

（深度）增强学习专栏收录该内容

40 篇文章 9 订阅

订阅专栏

由于组里新同学进来，需要带着他入门RL，选择从silver的课程开始。

对于我自己，增加一个仔细阅读《reinforcement learning：an introduction》的要求。

因为之前读的不太认真，这一次希望可以认真一点，将对应的知识点也做一个简单总结。

注意，这一章讲的内容都是model-based的，即需要知道π(a|s)、P(s'|s,a)、R(s'|s,a)；model based问题和full RL真正要解决的问题有些差别。另外，model-based方法往往称为planning，和(reinforcement) learning区别。

知道policy evaluation（prediction problem、itself an iterative computation）、(greedy) policy improvement、policy iteration：

policy evaluation is obtained simply by turning the Bellman expected equation into an update rule

DP algorithms are called full backups because they are based on all possible next states rather than on a sample next state

In some undiscounted episodic tasks there may be policies for which eventual termination is not guaranteed. For example, in some grid problem it is possible to go back and forth between two states forever.

policy improvement过程中，如果出现多个actions都能达到最大V，any apportionment of probability among these actions is permitted，没必要以1.0的概率只选一个action。

知道value iteration：

It can be written as a particularly simple backup operation that combines the policy improvement and truncated policy evaluation steps:

value iteration is obtained simply by turning the Bellman optimality equation into an update rule。it requires the maximum to be taken over all actions，正是这个maximum operation，才将policy improvement隐式地省略了。

Asynchronous DP：in-place update、更新顺序无要求、

asynchronous DP methods are often preferred

To converge correctly, however, an asynchronous algorithm must continue to backup the values of all the states（所有state都要访问到）: it can’t ignore any state after some point in the computation

DP相对七天方法，效率还是不错的：

DP may not be practical for very large problems, but compared with other methods（direct search in policy space 、linear programming） for solving MDPs, DP methods are actually quite efficient

Large state sets do create difficulties, but these are inherent difficulties of the problem, not of DP as a solution method.

Full backups are closely related to Bellman equations: they are little more than these equations turned into assignment statements. Just as there are four primary value functions (vπ, v∗, qπ, and q∗), there are four corresponding Bellman equations and four corresponding full backups.
All of them update estimates of the values of states based on estimates of the values of successor states. That is, they update estimates on the basis of other estimates. We call this general idea bootstrapping.

下面是silver课程《Lecture 3，Planning by Dynamic Programming》我觉得应该知道的内容：

注意，这一章讲的内容都是model-based的，即需要知道π(a|s)、P(s'|s,a)、R(s'|s, a)；model based问题和full RL真正要解决的问题有些差别。另外，model- based方法往往称为planning，和( reinforcement) learning区别。

5：prediction和control，之前也提到过。

7-8：知道policy evaluation是用来求解Vπ的，基于Bellman expectation equation。

9-11：policy evaluation、policy improvement的过程

12-13：知道policy iteration是用来求解V*的，policy iteration = iterative policy evaluation + greedy policy improvement，想一下iterative（很简单，别想复杂了，可以结合下面18页提出的问题）。

16-17：为什么policy improvement能够不断逼近V*，了解一下即可。

因此是model-based方法

18： Modified Policy Iteration，想一下在policy improvement之前，policy evaluation有必要converge到Vπ吗？答案是没必要，K=3好理解，那么“e- convergence of value function”是什么意思？结合后面 value iteration的内容，想一下为什么K=1的时候，就是value iteration了。

23-24：知道value iteration是用来求解π*，基于Bellman optimality equation。

22：value iteration的过程（看能不能理解 value iteration是K=1的policy iteration，而且将policy improvement步骤“隐式地”执行了）。

26：synchronous DP总结。

27-31：三种Asynchronous DP，主要是为了提高更新效率，在所有states都被无限多次访问的情况下，保证收敛到V*；其中31页的方法，访问到哪个state s，就只更新该state s的V(s)，注意和33页提到的方法区别（33页方法是model-free的，是full RL考虑的问题）；其中30页基于Bellman error（TD-error）的Prioritised Sweeping（Prioritised state selection）很有用，能够加快收敛，在你到学校之后可能会进一步接触。

35：知道收敛性由 contraction mapping theorem保证就好，知道有这个theorem即可。