Chapter 4. Dynamic Programming_policy iteration converage-CSDN博客

本章所介绍的dynamic programming 指的是在给出一个可以把环境视为马尔科夫决策过程的完美的模型下，用以计算最优策略的一系列的算法。传统的DP算法对模型和计算代价要求较高。其它的解决强化学习的算法可以看成以较小的计算成本、无需完美模型的代价来试图实现DP算法相同的效果。

假设环境是 a finite MDP。形象化描述是：
状态空间和动作空间都是有限的，即 $S$ and $A (s)$ for $s\in S$ 是有限的。
环境的动态信息通过给出转移概率和期望瞬时奖励来确定，即 $P_{ss'}^a=Pr\{s_{t+1}=s'| s_t=s, a_t=a\}$ 和 $R_{ss'}^a = E\{r_{t+1}|a_t=a, s_t=s, s_{t+1}=s'\}$ for all $s\in S$ , $a\in A(s)$ and $s'\in S^+$ ($S^+ $: a terminal state) 已知。

DP 的核心观点是用价值函数来组织和结构化对于好的策略的搜索。(The key idea of DP, and of reinforcement learning generally, is the use of value functions to organize and structure the search for good policies.)

DP 算法是通过将Bellman equations转换成更新规则来实现的。

4.1 Policy Evaluation

问题描述
如何评价一个策略的好坏程度。(如何计算state-value function)

policy evaluation : how to compute the state-value function $V^\pi$ . 我们把策略评估当做预测问题。

在Bellman equations for $V^\pi$ 中，只要 $\gamma$ <1 或者最终的终止状态被保证，那么 $V^\pi$ 的存在性和唯一性就会得到保证。

each successive approximation is obtained by using the Bellman equation for $V^\pi$ as an update rule:
$\begin{aligned} V_{k+1}(s) &= E_\pi\{r_{t+1} + \gamma V_k(s_{t+1})|s_t=s\}\\ &= \sum_a\pi(s,a)\sum_{s'}P_{ss'}^a[r_{t+1} + \gamma V_k(s\prime)] \end{aligned}$
Iterative Policy Evaluation.
the sequence { $V_k$ } can be shown in general to converage to $V^\pi$ as $\rightarrow \infty$ under the same conditions that guarantee the existence of $V^\pi$ .

a full backup: a kind of operation-it replaces the old value of $s$ with a new value obtained from the old values of the successor states of $s$ , and the expected immediate rewards, along all the one-step transitions possible under the policy being evaluated.

它基于所有可能的后续状态，而不是对后续状态的抽样。
所有在DP 里面的 backup 都是 full backup 。

如何编程实现 policy evaluation。
方法一：用两个数组实现。
一个数组保存旧值， $V_k(s)$ ，一个数组保存新值， $V_{k+1}(s)$ 。更新保存新值的数组时，用到的是旧值数组，旧值数组在此过程中是不会变化的。
方法二：用一个数组实现。
更新一个状态的预测值时，就地更新，直接在原地用新值代替旧值。
比较：两种方法最终都可以达到收敛。方法二的收敛更快。

sweep: the backups as being done in a sweep through the state space.

程序的终止条件设置：
①当最近更新的state-value值与之前的一个差距不大时；
②设置迭代次数限制。
代码如下所示：

Iterative policy evaluation

4.2 Policy Improvement

问题描述
在已有策略的时候，如何找到更好的策略？

select $a$ in $s$ and follow the existing policy $\pi$ ,the value of this way of behaving is
$\begin{aligned} Q^\pi(s,a) &= E_\pi\{ r_{t+1} + \gamma V^\pi(s_{t+1}) | s_t=s, a_t=a\} \\ &= \sum_{s'}P_{ss'}^a[R_{ss'}^a + \gamma V^\pi(s')] \end{aligned}$

这个方法的关键是此值与 $V^\pi(s)$ 值的大小比较。如果大于 $V^\pi(s)$ ，那么这个新策略可能会好于旧策略。

policy improvement theorem
$\pi$ 和 $\pi'$ 是任意的两个策略，如果 $\pi'$ 要好于 $\pi$ ，那么有：for all states
$Q^\pi(s,\pi'(s)) \ge V^\pi(s)$
也就是
$V^{\pi'}(s) \ge V^\pi(s)$ .

如果在上式中的严格不等，那么在下式中至少存在一个状态使式子严格不等。

如何由上式推导出下式，证明过程如下：

$\begin{aligned} V^\pi(s) &\le Q^\pi(s,\pi'(s))\\ &= E_{\pi'}\{ r_{t+1} + \gamma V^\pi(s_{t+1}) |s_t=s \}\\ &\le E_{\pi'}\{ r_{t+1} + \gamma Q^\pi(s_{t+1}, \pi'(s_{t+1})) | s_t=s\}\\ &= E_{\pi'}\{ r_{t+1} + \gamma r_{t+2} + \gamma^2 V^\pi(s_{t+2}) | s_t=s\}\\ &\le E_{\pi'}\{ r_{t+1} + \gamma r_{t+2} + \gamma ^3 r_{t+3} + \gamma^3V^\pi(s_{t+3}) | s_t=s\}\\ & .\\ & .\\ & .\\ &\le E_{\pi'}\{r_{t+1} + \gamma r_{t+2} + \gamma^2 r_{t+3} + ... | s_t=s\}\\ &= V^{\pi'}(s) \end{aligned}$

给出一个策略和此策略的 value function，我们可以很容易的在此策略下评价某个状态下一个动作的改变情况。也可以推广到任意状态下的任意动作。

policy improvement: The process of making a new policy that improves on an original policy, by making it greedy or nearly greedy with respect to the value function of the original policy.

notes: 如果在一个状态有多个动作都有相同的值得时候，我们不是选一个，而是在新的策略下给他们一个概率分布。
each maximizing action can be given a portion of the probability of being selected in the new greedy policy.

4.3 Policy Iteration

问题描述
策略迭代是策略评价和策略提升的整合。如何实现策略迭代。

图中E表示a policy evaluation， I 表示 a policy improvement。
因为一个有限的马尔科夫决策过程只有有限的策略，所以上述图中展示的过程最终会在有限的迭代中收敛到最优策略。

policy iteration 伪代码

4.4 Value Iteration

问题描述
策略迭代的优化。简化了策略评价。

策略迭代的缺点：
每一个迭代过程中都包含策略评价。

value iteration的特点：
combine the policy improvment and truncated policy evaluation.
$\begin{aligned} V_{k+1}(s) &= max_a E\{ r_{t+1} + \gamma V_k(s_{t+1}) |s_t=s,a_t=a\}\\ &= max_a \sum_{s'} P_{ss'}^a [R_{ss'}^a + \gamma V_k(s')] \end{aligned}$
更新赋值时取最大值（从所有可能的动作中取最大值），而不是根据所有可能出现的后续状态取平均值。

程序的终止条件设置：
①当在一次sweep中， $V (s)$ 的改变均小于 $自己设定的\theta$ 时，可认为近似最优策略。
②设置迭代次数。

value iteration

4.5 Asynchronous Dynamics Programming

问题描述
异步动态规划。

同步动态规划的思想是：一次性更新所有状态的value。缺点是对所有的状态都要遍历，当状态空间巨大时，无法以现有计算速度快速实现。
异步动态规划的思想是：避免长的而且又是无作用的sweep，跳过一些与最优行为无关的的状态。

4.6 Generalized Policy Iteration

问题描述
如何对策略评价过程和策略提升过程的交互过程概括化？

在policy iteration, value iteration 和 asynchronous DP 中，策略评价和策略提升都存在，但是具体的交互操作不一样。

generalized policy iteration(GPI): the general idea of interacting policy evaluation and policy improvment process, independent of the granularity and other details of the two process.

4.7
GPI:值函数和策略函数一直交互直到它们达到最优，之后它们就是一致的。
value and policy functions interact until they are optimal and thus consistent with other.

4.7 Efficiency of Dynamic Programming

问题描述
动态规划的效率分析。

DP不太适合解决状态空间巨大的问题，但是相比于解决MDP的方法，DP 方法是十分有效的。DP 方法找到最优策略的最坏时间复杂度是关于状态和动作数目的多项式级别。DP 会遇到维度灾难问题。但是这是此类问题本质上的困难，而不是DP方法的。

4.8 Summary

直观上只要理解了介绍policy iteration 和 value iteration 两段伪代码，就可以理解此章内容的精华。
如何实现 policy evaluation 和 policy improvement 两个过程的交互。
GPI的概括化。
动态规划分为同步和异步两种形式。异步动态规划算法，目前还没有接触实际的算法，可能在以后的章节里面会提到。
DP 方法的效率分析。