安卓 动态生成 格子 交互_强化学习纲要总结2:马尔科夫决策过程和动态规划...

本文介绍了强化学习的基础框架,包括Policy、Value Function和Model,并详细阐述了马尔科夫链、马尔可夫奖励过程、马尔科夫决策过程(MDP)的概念。通过举例和数学推导,解释了Horizon、Return、Value Function的定义,并探讨了Policy Iteration、Value Iteration以及两者在预测和控制问题上的应用。动态规划方法,如蒙特卡罗方法和时间差分学习,也在文中进行了讨论。
摘要由CSDN通过智能技术生成

3701486c2e043dc1390b1cf05c0cf750.png

专栏地址:

强化学习日积月累​www.zhihu.com
98af75058362c5f73cfdc4c801f9c148.png

上节讲到强化学习由下面这3部分组成:

  • Policy: agent's behavior function:为了得到action
  • Value function: how good is each state or action:为了得到状态和动作的价值
  • Model: agent's state representation of the environment:决定了下一个状态

abc3df53a02efc1d98a7f4e2028ca69f.png

这幅图介绍了Agent和Environment之间的交互,这个交互过程,其实是可以通过马尔科夫决策过程 (MDP)来表示。所以,马尔科夫决策过程 (MDP)是强化学习里面的一个基本框架。

在MDP的假设中,环境是Fully Observable的,即完全可观测的。

那么,再具体介绍MDP之前,先介绍马尔科夫链和马尔科夫奖励过程:

马尔可夫链

可以参考下面的链接:

科技猛兽:你一定从未看过如此通俗易懂的马尔科夫链蒙特卡罗方法(MCMC)解读(上)​zhuanlan.zhihu.com
6bd44bfa9a281edff10dbfacb6bbde20.png

首先介绍一下马尔可夫链的性质,如果说一个状态转移序列,满足马尔可夫,就是说下一状态只取决于当前状态,和之前的状态都是不相关的。

假如有一个状态历史序列:

equation?tex=h_%7Bt%7D%3D%5Cleft%5C%7Bs_%7B1%7D%2C+s_%7B2%7D%2C+s_%7B3%7D%2C+%5Cldots%2C+s_%7Bt%7D%5Cright%5C%7D

当且仅当满足以下条件时,State

equation?tex=s_t 是马尔可夫链:

equation?tex=%5Cbegin%7Baligned%7D+p%5Cleft%28s_%7Bt%2B1%7D+%5Cmid+s_%7Bt%7D%5Cright%29+%26%3Dp%5Cleft%28s_%7Bt%2B1%7D+%5Cmid+h_%7Bt%7D%5Cright%29+%5C%5C+p%5Cleft%28s_%7Bt%2B1%7D+%5Cmid+s_%7Bt%7D%2C+a_%7Bt%7D%5Cright%29+%26%3Dp%5Cleft%28s_%7Bt%2B1%7D+%5Cmid+h_%7Bt%7D%2C+a_%7Bt%7D%5Cright%29+%5Cend%7Baligned%7D

即下一状态仅和当前状态有关。

例子

a52baaa3c3bf21407463bab299a08345.png

比如上边这个例子,图中给出了从一个状态到另一个状态(或者保持不变)的概率,这样我们就可以用一个状态转移矩阵来表示。

equation?tex=P%3D%5Cleft%5B%5Cbegin%7Barray%7D%7Bcccc%7D+P%5Cleft%28s_%7B1%7D+%5Cmid+s_%7B1%7D%5Cright%29+%26+P%5Cleft%28s_%7B2%7D+%5Cmid+s_%7B1%7D%5Cright%29+%26+%5Cldots+%26+P%5Cleft%28s_%7BN%7D+%5Cmid+s_%7B1%7D%5Cright%29+%5C%5C+P%5Cleft%28s_%7B1%7D+%5Cmid+s_%7B2%7D%5Cright%29+%26+P%5Cleft%28s_%7B2%7D+%5Cmid+s_%7B2%7D%5Cright%29+%26+%5Cldots+%26+P%5Cleft%28s_%7BN%7D+%5Cmid+s_%7B2%7D%5Cright%29+%5C%5C+%5Cvdots+%26+%5Cvdots+%26+%5Cddots+%26+%5Cvdots+%5C%5C+P%5Cleft%28s_%7B1%7D+%5Cmid+s_%7BN%7D%5Cright%29+%26+P%5Cleft%28s_%7B2%7D+%5Cmid+s_%7BN%7D%5Cright%29+%26+%5Cldots+%26+P%5Cleft%28s_%7BN%7D+%5Cmid+s_%7BN%7D%5Cright%29+%5Cend%7Barray%7D%5Cright%5D

马尔可夫奖励过程(Markov Reward Process)

马尔可夫奖励过程指的是马尔可夫链加上一个奖励函数。

0ea08ca5bf50998d6e58ec9c9e5e48cb.png

说白了就是在走每一步时会获得奖励或者惩罚

421b8c5a21a86c242951303dae7c0856.png

还是这个例子,比如到达

equation?tex=s_1 有一个+5的奖励,到达
equation?tex=s_7 有一个+10的奖励,那奖励R就可以表征为:
equation?tex=R%3D%5B5%2C0%2C0%2C0%2C0%2C0%2C10%5D

这里几个概念定义一下:

Horizon

指的是一个episode里的最大步数。可以理解为单次任务结束用了多少步。

Return

累计折扣收益:

equation?tex=G_%7Bt%7D%3DR_%7Bt%2B1%7D%2B%5Cgamma+R_%7Bt%2B2%7D%2B%5Cgamma%5E%7B2%7D+R_%7Bt%2B3%7D%2B%5Cgamma%5E%7B3%7D+R_%7Bt%2B4%7D%2B%5Cldots%2B%5Cgamma%5E%7BT-t-1%7D+R_%7BT%7D

equation?tex=%5Cgamma+%5Cin%5B0%2C1%29 ,有种说法是可以等于1。整体来讲是更加重视眼前收益。

Value Function:正式定义一个状态的价值(Present value of future rewards)

equation?tex=V_t%28s%29
equation?tex=G_t 的期望,指的是从当前状态开始,有可能获得多大的价值。

equation?tex=%5Cbegin%7Baligned%7D+V_%7Bt%7D%28s%29+%26%3D%5Cmathbb%7BE%7D%5Cleft%5BG_%7Bt%7D+%5Cmid+s_%7Bt%7D%3Ds%5Cright%5D+%5C%5C+%26%3D%5Cmathbb%7BE%7D%5Cleft%5BR_%7Bt%2B1%7D%2B%5Cgamma+R_%7Bt%2B2%7D%2B%5Cgamma%5E%7B2%7D+R_%7Bt%2B3%7D%2B%5Cldots%2B%5Cgamma%5E%7BT-t-1%7D+R_%7BT%7D+%5Cmid+s_%7Bt%7D%3Ds%5Cright%5D+%5Cend%7Baligned%7D

还是刚才的例子,计算以下Return:

对于轨迹

equation?tex=s_%7B4%7D%2C+s_%7B5%7D%2C+s_%7B6%7D%2C+s_%7B7%7D%3A+0%2B%5Cfrac%7B1%7D%7B2%7D+%5Ctimes+0%2B%5Cfrac%7B1%7D%7B4%7D+%5Ctimes+0%2B%5Cfrac%7B1%7D%7B8%7D+%5Ctimes+10%3D1.25

对于轨迹

equation?tex=s_%7B4%7D%2C+s_%7B3%7D%2C+s_%7B2%7D%2C+s_%7B1%7D%3A+0%2B%5Cfrac%7B1%7D%7B2%7D+%5Ctimes+0%2B%5Cfrac%7B1%7D%7B4%7D+%5Ctimes+0%2B%5Cfrac%7B1%7D%7B8%7D+%5Ctimes+5%3D0.625

对于轨迹

equation?tex=s_%7B4%7D%2C+s_%7B5%7D%2C+s_%7B6%7D%2C+s_%7B6%7D%3A%3D0

那么怎么计算

equation?tex=V_t%28s%29 呢,可以从
equation?tex=s 出法,推演很多次取平均,这就是蒙特卡罗方法。之后会讲到。

我们这里采用了具体的推导,利用的是贝尔曼等式。

equation?tex=V%28s%29%3D%5Cunderbrace%7BR%28s%29%7D_%7B%5Ctext+%7BImmediate+reward+%7D%7D%2B%5Cunderbrace%7B%5Cgamma+%5Csum_%7Bs%5E%7B%5Cprime%7D+%5Cin+S%7D+P%5Cleft%28s%5E%7B%5Cprime%7D+%5Cmid+s%5Cright%29+V%5Cleft%28s%5E%7B%5Cprime%7D%5Cright%29%7D_%7B%5Ctext+%7BDiscounted+sum+of+future+reward+%7D%7D

中间经历了如下转换:

equation?tex=V%28s%29%3D%5Cmathbb%7BE%7D%5Cleft%5BR_%7Bt%2B1%7D%2B%5Cgamma+%5Cmathbb%7BE%7D%5Cleft%5BR_%7Bt%2B2%7D%2B%5Cgamma+R_%7Bt%2B3%7D%2B%5Cgamma%5E%7B2%7D+R_%7Bt%2B4%7D%2B%5Cldots%5Cright%5D+%5Cmid+s_%7Bt%7D%3Ds%5Cright%5D

写成矩阵形式就是:

equation?tex=%5Cleft%5B%5Cbegin%7Barray%7D%7Bc%7D+V%5Cleft%28s_%7B1%7D%5Cright%29+%5C%5C+V%5Cleft%28s_%7B2%7D%5Cright%29+%5C%5C+%5Cvdots+%5C%5C+V%5Cleft%28s_%7BN%7D%5Cright%29+%5Cend%7Barray%7D%5Cright%5D%3D%5Cleft%5B%5Cbegin%7Barray%7D%7Bc%7D+R%5Cleft%28s_%7B1%7D%5Cright%29+%5C%5C+R%5Cleft%28s_%7B2%7D%5Cright%29+%5C%5C+%5Cvdots+%5C%5C+R%5Cleft%28s_%7BN%7D%5Cright%29+%5Cend%7Barray%7D%5Cright%5D%2B%5Cgamma%5Cleft%5B%5Cbegin%7Barray%7D%7Bcccc%7D+P%5Cleft%28s_%7B1%7D+%5Cmid+s_%7B1%7D%5Cright%29+%26+P%5Cleft%28s_%7B2%7D+%5Cmid+s_%7B1%7D%5Cright%29+%26+%5Cldots+%26+P%5Cleft%28s_%7BN%7D+%5Cmid+s_%7B1%7D%5Cright%29+%5C%5C+P%5Cleft%28s_%7B1%7D+%5Cmid+s_%7B2%7D%5Cright%29+%26+P%5Cleft%28s_%7B2%7D+%5Cmid+s_%7B2%7D%5Cright%29+%26+%5Cldots+%26+P%5Cleft%28s_%7BN%7D+%5Cmid+s_%7B2%7D%5Cright%29+%5C%5C+%5Cvdots+%26+%5Cvdots+%26+%5Cddots+%26+%5Cvdots+%5C%5C+P%5Cleft%28s_%7B1%7D+%5Cmid+s_%7BN%7D%5Cright%29+%26+P%5Cleft%28s_%7B2%7D+%5Cmid+s_%7BN%7D%5Cright%29+%26+%5Cldots+%26+P%5Cleft%28s_%7BN%7D+%5Cmid+s_%7BN%7D%5Cright%29+%5Cend%7Barray%7D%5Cright%5D%5Cleft%5B%5Cbegin%7Barray%7D%7Bc%7D+V%5Cleft%28s_%7B1%7D%5Cright%29+%5C%5C+V%5Cleft%28s_%7B2%7D%5Cright%29+%5C%5C+%5Cvdots+%5C%5C+V%5Cleft%28s_%7BN%7D%5Cright%29+%5Cend%7Barray%7D%5Cright%5D

得到:

equation?tex=V%3DR%2B%5Cgamma+P+V ,可以直接求得
equation?tex=V ,但是复杂度太大。

因此我们采用迭代的方法:

  1. 动态规划 Dynamic Programming
  2. 蒙特卡罗方法 Monte-Carlo
  3. 时间差分 Temporal-Difference

首先介绍一下蒙特卡罗方法:

56cf466ee9f1b42ef50436957546baf9.png

这种方法的核心思想就是进行推演,生成大量数据,然后取平均,这其实是一种求期望的方法。

迭代法、动态规划法

59051ff0ea04f2dc5168e097fdeaa520.png

感觉这其实就是数值分析中的迭代求解法,把Bellman Equation变成Bellman Update,其利用的假设是

equation?tex=V 的值稳定时,会收敛到唯一的值。这也是假设了马尔科夫链有平稳分布(实际上大多数强化学习场景都满足)。

马尔可夫决策过程(MDP)

接下来就是重头戏,马尔可夫决策过程。相比于MRP,MDP多了决策,也就是action,这样就形成了五元组

equation?tex=%28S%2C+A%2C+P%2C+R%2C+%5Cgamma%29 ,具体定义如下

4ac989ecfba5e996416f02f2ce63f6ca.png

Policy在MDP中的定义为:在某一状态

equation?tex=s_t 下应该采取哪一个action。

equation?tex=%5Cpi%28a+%5Cmid+s%29%3DP%5Cleft%28a_%7Bt%7D%3Da+%5Cmid+s_%7Bt%7D%3Ds%5Cright%29

在这里,给定一个policy后,我们就可以把MDP问题转化为MRP问题。因为有了policy,我们就知道状态

equation?tex=s 下,采取动作
equation?tex=a 的概率是多少。转移函数和回报函数如下:

equation?tex=P%5E%7B%5Cpi%7D%5Cleft%28s%5E%7B%5Cprime%7D+%5Cmid+s%5Cright%29%3D%5Csum_%7Ba+%5Cin+A%7D+%5Cpi%28a+%5Cmid+s%29+P%5Cleft%28s%5E%7B%5Cprime%7D+%5Cmid+s%2C+a%5Cright%29

equation?tex=R%5E%7B%5Cpi%7D%28s%29%3D%5Csum_%7Ba+%5Cin+A%7D+%5Cpi%28a+%5Cmid+s%29+R%28s%2C+a%29

8852cccf391336511ef77a22f5fbae57.png

这里我们看一下MRP和MDP的区别:

142dd7b846ea7c13226ea91bb55edaa0.png

我们可以理解为MRP只有环境,它自己在那推演。MDP中间多了一个决策过程。也就是强化学习要学的。

Value function for MDP

状态值函数在MDP和MRP中的定义类似,只不过是在策略

equation?tex=%5Cpi 下的期望。

equation?tex=v%5E%7B%5Cpi%7D%28s%29%3D%5Cmathbb%7BE%7D_%7B%5Cpi%7D%5Cleft%5BG_%7Bt%7D+%5Cmid+s_%7Bt%7D%3Ds%5Cright%5D

此处又引入一个动作值函数,指的是在该状态下,采取某一个特定action后的期望收益。

equation?tex=q%5E%7B%5Cpi%7D%28s%2C+a%29%3D%5Cmathbb%7BE%7D_%7B%5Cpi%7D%5Cleft%5BG_%7Bt%7D+%5Cmid+s_%7Bt%7D%3Ds%2C+A_%7Bt%7D%3Da%5Cright%5D

显然,我们可以知道

equation?tex=v%5E%5Cpi%28s%29
equation?tex=q%5E%5Cpi%28s%2Ca%29 的联系:

equation?tex=v%5E%7B%5Cpi%7D%28s%29%3D%5Csum_%7Ba+%5Cin+A%7D+%5Cpi%28a+%5Cmid+s%29+q%5E%7B%5Cpi%7D%28s%2C+a%29

根据贝尔曼方程,我们可以得到:

equation?tex=v%5E%7B%5Cpi%7D%28s%29%3DE_%7B%5Cpi%7D%5Cleft%5BR_%7Bt%2B1%7D%2B%5Cgamma+v%5E%7B%5Cpi%7D%5Cleft%28s_%7Bt%2B1%7D%5Cright%29+%5Cmid+s_%7Bt%7D%3Ds%5Cright%5D

equation?tex=q%5E%7B%5Cpi%7D%28s%2C+a%29%3DE_%7B%5Cpi%7D%5Cleft%5BR_%7Bt%2B1%7D%2B%5Cgamma+q%5E%7B%5Cpi%7D%5Cleft%28s_%7Bt%2B1%7D%2C+A_%7Bt%2B1%7D%5Cright%29+%5Cmid+s_%7Bt%7D%3Ds%2C+A_%7Bt%7D%3Da%5Cright%5D

这就是当前状态和未来状态的关联。

我们还可以得到状态值和动作值函数的关系:

equation?tex=v%5E%7B%5Cpi%7D%28s%29%3D%5Csum_%7Ba+%5Cin+A%7D+%5Cpi%28a+%5Cmid+s%29+q%5E%7B%5Cpi%7D%28s%2C+a%29

equation?tex=q%5E%7B%5Cpi%7D%28s%2C+a%29%3DR%28s%2Ca%29%2B%5Cgamma+%5Csum_%7Bs%5E%7B%5Cprime%7D+%5Cin+S%7D+P%5Cleft%28s%5E%7B%5Cprime%7D+%5Cmid+s%2C+a%5Cright%29+v%5E%7B%5Cpi%7D%5Cleft%28s%5E%7B%5Cprime%7D%5Cright%29

两个式子结合,得到去掉期望后当前状态和未来状态的关联:

equation?tex=v%5E%7B%5Cpi%7D%28s%29%3D%5Csum_%7Ba+%5Cin+A%7D+%5Cpi%28a+%5Cmid+s%29%5Cleft%28R%28s%2C+a%29%2B%5Cgamma+%5Csum_%7Bs%5E%7B%5Cprime%7D+%5Cin+S%7D+P%5Cleft%28s%5E%7B%5Cprime%7D+%5Cmid+s%2C+a%5Cright%29+v%5E%7B%5Cpi%7D%5Cleft%28s%5E%7B%5Cprime%7D%5Cright%29%5Cright%29

equation?tex=q%5E%7B%5Cpi%7D%28s%2C+a%29%3DR%28s%2C+a%29%2B%5Cgamma+%5Csum_%7Bs%5E%7B%5Cprime%7D+%5Cin+S%7D+P%5Cleft%28s%5E%7B%5Cprime%7D+%5Cmid+s%2C+a%5Cright%29+%5Csum_%7Ba%5E%7B%5Cprime%7D+%5Cin+A%7D+%5Cpi%5Cleft%28a%5E%7B%5Cprime%7D+%5Cmid+s%5E%7B%5Cprime%7D%5Cright%29+q%5E%7B%5Cpi%7D%5Cleft%28s%5E%7B%5Cprime%7D%2C+a%5E%7B%5Cprime%7D%5Cright%29

用以下两个如能比较方便的理解:

b62aaee4d042ee4eeec8ca3776a849f7.png

这里有两层加和,里面一次加和能从叶子节点backup到黑色的

equation?tex=a 节点(也就是
equation?tex=q 值)。外边一层加和能backup到根节点。

动作值函数也一样:

c07275725ef4d93d1ff8115556087c17.png

马尔可夫决策的预测和控制,是MDP的核心问题。

1 Prediction

指的是给定一个马尔可夫决策

equation?tex=%5Clangle%5Cmathcal%7BS%7D%2C+%5Cmathcal%7BA%7D%2C+%5Cmathcal%7BP%7D%2C+%5Cmathcal%7BR%7D%2C+%5Cgamma%5Crangle 和一个策略
equation?tex=%5Cpi (或者直接给定
equation?tex=%3C%5Cmathcal%7BS%7D%2C+%5Cmathcal%7BP%7D%5E%7B%5Cpi%7D%2C+%5Cmathcal%7BR%7D%5E%7B%5Cpi%7D%2C+%5Cgamma%3E ),我们把它的value function
equation?tex=v%5E%5Cpi 计算出来。

2 Control

指的是寻找一个最佳的策略。他的输入是MDP

equation?tex=%5Clangle%5Cmathcal%7BS%7D%2C+%5Cmathcal%7BA%7D%2C+%5Cmathcal%7BP%7D%2C+%5Cmathcal%7BR%7D%2C+%5Cgamma%5Crangle ,输出是最佳价值函数
equation?tex=v%5E%2A 和最佳策略
equation?tex=%5Cpi%5E%2A

这两个问题可以用 动态规划来解决。

动态规划(Dynamic Programming)指的是先寻找子问题的最佳解,然后再子问题逐步添加,最终得到原始问题的最佳解。由之前的贝尔曼方程可知,MDP可以拆分为一个个子问题。

Policy Iteration

1 Policy evaluation on MDP

是指给定策略

equation?tex=%5Cpi 以后,寻找当前策略下的各个状态的value。

贝尔曼方程可以写为:

equation?tex=%5Cbegin%7Baligned%7D+v%5E%7B%5Cpi%7D%28s%29+%26++%3D+E_%7B%5Cpi%7D%5Cleft%5BR_%7Bt%2B1%7D%2B%5Cgamma+v%5E%7B%5Cpi%7D%5Cleft%28s_%7Bt%2B1%7D%5Cright%29+%5Cmid+s_%7Bt%7D%3Ds%5Cright%5D+%5C%5C+%26+%3D%5Csum_%7Ba%7D%7B%5Cpi+%28a%7Cs%29%7D%5Csum_%7Bs%5E%7B%5Cprime%7D%2Cr%7D%7Bp%28s%5E%7B%5Cprime%7D%2Cr%7Cs%2Ca%29%7D%5Br%2B%5Cgamma+v_k%28s%5E%7B%5Cprime%7D%29%5D%5Cend%7Baligned%7D

把贝尔曼方程用作更新

equation?tex=v_%5Cpi%28s%29 的方法,使用下式更新状态函数,直至收敛。

equation?tex=%5Cbegin%7Baligned%7D+v%5E%7Bk%2B1%7D%28s%29+%26++%3D+E_%7B%5Cpi%7D%5Cleft%5BR_%7Bt%2B1%7D%2B%5Cgamma+v%5E%7Bk%7D%5Cleft%28s_%7Bt%2B1%7D%5Cright%29+%5Cmid+s_%7Bt%7D%3Ds%5Cright%5D+%5C%5C+%26+%3D%5Csum_%7Ba%7D%7B%5Cpi+%28a%7Cs%29%7D%5Csum_%7Bs%5E%7B%5Cprime%7D%2Cr%7D%7Bp%28s%5E%7B%5Cprime%7D%2Cr%7Cs%2Ca%29%7D%5Br%2B%5Cgamma+v_k%28s%5E%7B%5Cprime%7D%29%5D%5Cend%7Baligned%7D

这是一个经典的Grid World的例子。我们有一个4x4的16宫格。只有左上和右下的格子是终止格子。该位置的价值固定为0,个体如果到达了该2个格子,则停止移动,此后每轮奖励都是0。个体在16宫格其他格的每次移动,得到的即时奖励R都是-1。注意个体每次只能移动一个格子,且只能上下左右4种移动选择,不能斜着走, 如果在边界格往外走,则会直接移动回到之前的边界格。衰减因子我们定义为

equation?tex=%5Cgamma%3D1 。由于这里每次移动,下一格都是固定的,因此所有可行的的状态转化概率P=1。这里给定的策略是随机策略,即每个格子里有25%的概率向周围的4个格子移动。

321c4293163e3bace4bd51961542544e.png

5facbe848f5be851781a06c45c0e55af.png

首先我们初始化所有格子的状态价值为0,如上图k=0的时候。现在我们开始策略迭代了。由于终止格子的价值固定为0,我们可以不将其加入迭代过程。在k=1的时候,我们利用上面的贝尔曼方程先计算第二行第一个格子的价值:

equation?tex=v_1%5E%7B%2821%29%7D+%3D+%5Cfrac%7B1%7D%7B4%7D%5B%28-1%2B0%29+%2B%28-1%2B0%29%2B%28-1%2B0%29%2B%28-1%2B0%29%5D+%3D+-1+

第二行第二个格子的价值是:

equation?tex=v_1%5E%7B%2822%29%7D+%3D+%5Cfrac%7B1%7D%7B4%7D%5B%28-1%2B0%29+%2B%28-1%2B0%29%2B%28-1%2B0%29%2B%28-1%2B0%29%5D+%3D+-1

其他的格子都是类似的,第一轮的状态价值迭代的结果如上图k=1的时候。现在我们第一轮迭代完了。开始动态规划迭代第二轮了。还是看第二行第一个格子的价值:

equation?tex=v_2%5E%7B%2821%29%7D+%3D+%5Cfrac%7B1%7D%7B4%7D%5B%28-1%2B0%29+%2B%28-1-1%29%2B%28-1-1%29%2B%28-1-1%29%5D+%3D+-1.75

第二行格子的价值是:

equation?tex=v_2%5E%7B%2822%29%7D+%3D+%5Cfrac%7B1%7D%7B4%7D%5B%28-1-1%29+%2B%28-1-1%29%2B%28-1-1%29%2B%28-1-1%29%5D+%3D+-2

最终得到的结果是上图k=2的时候。第三轮的迭代如下:

equation?tex=v_3%5E%7B%2821%29%7D+%3D+%5Cfrac%7B1%7D%7B4%7D%5B%28-1-1.7%29+%2B%28-1-2%29%2B%28-1-2%29%2B%28-1%2B0%29%5D+%3D+-2.425

equation?tex=v_3%5E%7B%2822%29%7D+%3D+%5Cfrac%7B1%7D%7B4%7D%5B%28-1-1.7%29+%2B%28-1-1.7%29%2B%28-1-2%29%2B%28-1-2%29%5D+%3D+-2.85

最终得到的结果是上图k=3的时候。就这样一直迭代下去,直到每个格子的策略价值改变很小为止。这时我们就得到了所有格子的基于随机策略的状态价值。

可以看到,动态规划的策略评估计算过程并不复杂,但是如果我们的问题是一个非常复杂的模型的话,这个计算量还是非常大的。

下图为另一个例子不断迭代过程中的可视化结果:

一开始所有状态的价值都是0,前几次迭代开始有一些状态价值不为0,随着迭代的进行,越来越多的状态价值得到更新,直到第6幅图的稳定状态,所有状态的价值趋于稳定。

358d753995939e96f4f6da3b296b2fe8.png

2 Policy Improvement on MDP

根据第一步 Policy evaluation 我们获得了每个状态的价值,是通过贝尔曼方程不断地迭代,一直到收敛,收敛的结果就是每个状态的价值。这一步我们要根据这些价值找到一个更好的策略,具体做法是:

首先根据值函数我们可以把动作值计算出来:

equation?tex=q%5E%7B%5Cpi_%7Bi%7D%7D%28s%2C+a%29%3DR%28s%2C+a%29%2B%5Cgamma+%5Csum_%7Bs%5E%7B%5Cprime%7D+%5Cin+S%7D+P%5Cleft%28s%5E%7B%5Cprime%7D+%5Cmid+s%2C+a%5Cright%29+v%5E%7B%5Cpi_%7Bi%7D%7D%5Cleft%28s%5E%7B%5Cprime%7D%5Cright%29

我们的新策略就是对于所有状态,选择能使

equation?tex=q 值最大的那个action:

equation?tex=%5Cpi_%7Bi%2B1%7D%28s%29%3D%5Carg+%5Cmax+_%7Ba%7D+q%5E%7B%5Cpi_%7Bi%7D%7D%28s%2C+a%29

经过argmax重新选择action后,我们新的策略

equation?tex=%5Cpi_%7Bi%2B1%7D 对应的值函数会大于等于原来的。

在决定这一步的策略

equation?tex=%5Cpi_%7Bi%2B1%7D 时,我们只是取:
equation?tex=q_%5Cpi%28s%2C%5Cpi_%7Bi%2B1%7D%28s%29%29%5Cgeq+v_%5Cpi%28s%29

d082a79195f81de65e8f5b28a88ecf14.png

这样取的好处是能够保证:

equation?tex=v_%7B%5Cpi_%7Bi%2B1%7D%7D%28s%29%5Cgeq+v_%7B%5Cpi_%7Bi%7D%7D%28s%29

注意上面这式子的含义,即:如果每一步的策略都按照贪心的方式来选择,那么最终的结果是:

equation?tex=v_%7B%5Cpi_%7Bi%2B1%7D%7D%28s%29%5Cgeq+v_%7B%5Cpi_%7Bi%7D%7D%28s%29

证明:这里用

equation?tex=%5Cpi 来指代
equation?tex=%5Cpi_i ,用
equation?tex=%5Cpi%5E%7B%5Cprime%7D 来指代
equation?tex=%5Cpi_%7Bi%2B1%7D

equation?tex=%5Cbegin%7Baligned%7D+v%5E%7B%5Cpi%7D%28s%29+%26+%5Cleq+q%5E%7B%5Cpi%7D%5Cleft%28s%2C+%5Cpi%5E%7B%5Cprime%7D%28s%29%5Cright%29%5C%5C%26%3D%5Cmathbb%7BE%7D_%7B%5Cpi%5E%7B%5Cprime%7D%7D%5Cleft%5BR_%7Bt%2B1%7D%2B%5Cgamma+v%5E%7B%5Cpi%7D%5Cleft%28S_%7Bt%2B1%7D+%5Cmid+S_%7Bt%7D%3Ds%5Cright%29%5Cright%5D+%5C%5C+%26+%5Cleq+%5Cmathbb%7BE%7D_%7B%5Cpi%5E%7B%5Cprime%7D%7D%5Cleft%5BR_%7Bt%2B1%7D%2B%5Cgamma+q%5E%7B%5Cpi%7D%5Cleft%28S_%7Bt%2B1%7D%2C+%5Cpi%5E%7B%5Cprime%7D%5Cleft%28S_%7Bt%2B1%7D%5Cright%29%5Cright%29+%5Cmid+S_%7Bt%7D%3Ds%5Cright%5D+%5C%5C%26+%3D%5Cmathbb%7BE%7D_%7B%5Cpi%5E%7B%5Cprime%7D%7D+%5BR_%7Bt%2B1%7D%2B%5Cgamma+%5Cmathbb%7BE%7D_%7B%5Cpi%5E%7B%5Cprime%7D%7D%5BR_%7Bt%2B2%7D%2B%5Cgamma+v_%5Cpi%28S_%7Bt%2B2%7D%29%7CS_%7Bt%2B1%7D+%5D%7CS_%7Bt%7D%5D+%5C%5C+%26+%5Cleq+%5Cmathbb%7BE%7D_%7B%5Cpi%5E%7B%5Cprime%7D%7D%5Cleft%5BR_%7Bt%2B1%7D%2B%5Cgamma+R_%7Bt%2B2%7D%2B%5Cgamma%5E%7B2%7D+v_%5Cpi%28S_%7Bt%2B2%7D%29+%5Cmid+S_%7Bt%7D%3Ds%5Cright%5D+%5C%5C+%26+%5Cleq+%5Cmathbb%7BE%7D_%7B%5Cpi%5E%7B%5Cprime%7D%7D%5Cleft%5BR_%7Bt%2B1%7D+%2B+%5Cgamma+R_%7Bt%2B2%7D%2B+%5Cgamma%5E2+R_%7Bt%2B3%7D%2B%5Cldots+%5Cmid+S_%7Bt%7D%3Ds%5Cright%5D%3Dv_%7B%5Cpi%5E%7B%5Cprime%7D%7D%28s%29+%5Cend%7Baligned%7D

所以,Policy Iteration其实是两个步骤迭代进行:

78c3326f449ada6ce3c20ffb1e083591.png
  1. Policy Evaluation on MDP:Evaluate the policy
    equation?tex=%5Cpi (根据当前的
    equation?tex=%5Cpi 计算
    equation?tex=v 值)
  2. Policy Improvement on MDP:根据上一步得到的
    equation?tex=v%5E%5Cpi ,利用贪心算法得到新的策略。
    equation?tex=%5Cpi%5E%7B%5Cprime%7D%3D%5Coperatorname%7Bgreedy%7D%5Cleft%28v%5E%7B%5Cpi%7D%5Cright%29

下图可以比较形象地表示这个循环迭代的过程:

b0c2da0dd2857fe8258b1e9f2bfd358a.png

当我们的策略停止提升后,会得到以下等式:

equation?tex=q%5E%7B%5Cpi%7D%5Cleft%28s%2C+%5Cpi%5E%7B%5Cprime%7D%28s%29%5Cright%29%3D%5Cmax+_%7Ba+%5Cin+%5Cmathcal%7BA%7D%7D+q%5E%7B%5Cpi%7D%28s%2C+a%29%3Dq%5E%7B%5Cpi%7D%28s%2C+%5Cpi%28s%29%29%3Dv%5E%7B%5Cpi%7D%28s%29

这时MDP到达最佳状态,贝尔曼最优等式得到满足:

equation?tex=v%5E%7B%5Cpi%7D%28s%29%3D%5Cmax+_%7Ba+%5Cin+%5Cmathcal%7BA%7D%7D+q%5E%7B%5Cpi%7D%28s%2C+a%29

这时,

equation?tex=v%5E%7B%5Cpi%7D%28s%29%3Dv%5E%7B%2A%7D%28s%29+%5Ctext+%7B+for+all+%7D+s+%5Cin+%5Cmathcal%7BS%7D

这时我们的贝尔曼最优等式为:

equation?tex=v%5E%7B%2A%7D%28s%29%3D%5Cmax+_%7Ba%7D+q%5E%7B%2A%7D%28s%2C+a%29

equation?tex=q%5E%7B%2A%7D%28s%2C+a%29%3DR%28s%2C+a%29%2B%5Cgamma+%5Csum_%7Bs%5E%7B%5Cprime%7D+%5Cin+S%7D+P%5Cleft%28s%5E%7B%5Cprime%7D+%5Cmid+s%2C+a%5Cright%29+v%5E%7B%2A%7D%5Cleft%28s%5E%7B%5Cprime%7D%5Cright%29

两个式子结合得到:

equation?tex=v%5E%7B%2A%7D%28s%29%3D%5Cmax+_%7Ba%7D+R%28s%2C+a%29%2B%5Cgamma+%5Csum_%7Bs%5E%7B%5Cprime%7D+%5Cin+S%7D+P%5Cleft%28s%5E%7B%5Cprime%7D+%5Cmid+s%2C+a%5Cright%29+v%5E%7B%2A%7D%5Cleft%28s%5E%7B%5Cprime%7D%5Cright%29

equation?tex=q%5E%7B%2A%7D%28s%2C+a%29%3DR%28s%2C+a%29%2B%5Cgamma+%5Csum_%7Bs%5E%7B%5Cprime%7D+%5Cin+S%7D+P%5Cleft%28s%5E%7B%5Cprime%7D+%5Cmid+s%2C+a%5Cright%29+%5Cmax+_%7Ba%5E%7B%5Cprime%7D%7D+q%5E%7B%2A%7D%5Cleft%28s%5E%7B%5Cprime%7D%2C+a%5E%7B%5Cprime%7D%5Cright%29

这个式子提供了

equation?tex=v%5E%7B%2A%7D
equation?tex=q%5E%7B%2A%7D 求解的一种迭代方法,也是q-learning的原理。

Value Iteration

Policy Iteration需要进行Policy Evaluation 的过程,但是这一过程不要不断地迭代直到状态函数收敛。Value Iteration省去了这一复杂的过程,首先通过一个循环找到最优的价值函数,再直接根据这个最优的价值函数提取出最优的策略

4a966fbf1befe4d8567b7030cf735c5d.png

4a6f404fd938916b354522dfe81c20a4.png

Policy Iteration和Value Iteration对比

Prediction 的问题需要求出各个状态的价值函数,可以通过Policy Evaluation这一步完成。

Control 的问题需要求出最优策略,有Policy Iteration和Value Iteration这2种方法。

Policy Iteration由Policy Evaluation和Policy Improvement两部分组成。两步交替迭代进行,比较直观。

Value Iteration则是直接利用贝尔曼最优等式迭代求解,最后再有一个policy extraction,由动作值获得最终策略。

d32c3e7414555847c7d19a5d13b7952f.png

如果是Prediction问题,可直接用贝尔曼期望等式迭代求解。

如果是Control问题,可用贝尔曼期望等式+策略迭代更新的方法或直接使用贝尔曼最优等式进行值迭代。

第1种方法Policy Iteration对于上面的例子的可视化结果:

  • 一开始所有状态的价值都初始化为0,策略都是随机的(图中的箭头朝向4个方向)

52b1ccac0b5a758232e763d95953ac90.png
  • 执行一次Policy Evaluation和Policy Improvement,分别得到左图和右图。我们发现策略已经有了更新。

459ba0af4542b0bcd83a8e08c1e30073.png
  • 再执行一次Policy Evaluation和Policy Improvement,分别得到左图和右图。我们发现策略已经有了更新。

7cf360f6b58e496adac1146136c6adac.png
  • 重复上述过程,最终得到最优策略,如下右图箭头所示:

d49667261a866dc9127cebfe5bf450b0.png

接着我们看看Policy Iteration的代码:

35046a9003a4b402cc8f6ab819c17808.png
def policy_iteration(env, gamma = 1.0):
    """ Policy-Iteration algorithm """
    policy = np.random.choice(env.env.nA, size=(env.env.nS))  # initialize a random policy
    max_iterations = 200000
    gamma = 1.0
    for i in range(max_iterations):
        old_policy_v = compute_policy_v(env, policy, gamma)
        new_policy = extract_policy(old_policy_v, gamma)
        if (np.all(policy == new_policy)):
            print ('Policy-Iteration converged at step %d.' %(i+1))
            break
        policy = new_policy
    return policy

我们发现和上面介绍的算法一致,Policy Iteration分为2步:

Policy Evaluation:old_policy_v = compute_policy_v(env, policy, gamma) Policy Improvement:new_policy = extract_policy(old_policy_v, gamma)
def compute_policy_v(env, policy, gamma=1.0):
    """ Iteratively evaluate the value-function under policy.
    Alternatively, we could formulate a set of linear equations in iterms of v[s] 
    and solve them to find the value function.
    """
    v = np.zeros(env.env.nS)
    eps = 1e-10
    while True:
        prev_v = np.copy(v)
        for s in range(env.env.nS):
            policy_a = policy[s]
            v[s] = sum([p * (r + gamma * prev_v[s_]) for p, s_, r, _ in env.env.P[s][policy_a]])
        if (np.sum((np.fabs(prev_v - v))) <= eps):
            # value converged
            break
    return v

这个函数构成Policy Evaluation,通过不断地迭代,返回

equation?tex=v_%5Cpi%28s%29
def extract_policy(v, gamma = 1.0):
    """ Extract the policy given a value-function """
    policy = np.zeros(env.env.nS)
    for s in range(env.env.nS):
        q_sa = np.zeros(env.env.nA)
        for a in range(env.env.nA):
            q_sa[a] = sum([p * (r + gamma * v[s_]) for p, s_, r, _ in  env.env.P[s][a]])
        policy[s] = np.argmax(q_sa)
    return policy

这个函数构成Policy Improvement,更新一步策略。

接着我们看看Value Iteration的源码:

973a3a2917c588b42b1f8eddf0c800d5.png

首先通过一个循环找到最优的价值函数,再直接根据这个最优的价值函数提取出最优的策略。

主函数:

if __name__ == '__main__':

    env_name  = 'FrozenLake-v0' # 'FrozenLake8x8-v0'
    env = gym.make(env_name)
    gamma = 1.0
    optimal_v = value_iteration(env, gamma);
    policy = extract_policy(optimal_v, gamma)
    policy_score = evaluate_policy(env, policy, gamma, n=1000)
    print('Policy average score = ', policy_score)

value_iteration只有一个迭代过程,函数如下:

def value_iteration(env, gamma = 1.0):
    """ Value-iteration algorithm """
    v = np.zeros(env.env.nS)  # initialize value-function
    max_iterations = 100000
    eps = 1e-20
    for i in range(max_iterations):
        prev_v = np.copy(v)
        for s in range(env.env.nS):
            q_sa = [sum([p*(r + gamma * prev_v[s_]) for p, s_, r, _ in env.env.P[s][a]]) for a in range(env.env.nA)] 
            v[s] = max(q_sa)
        if (np.sum(np.fabs(prev_v - v)) <= eps):
            print ('Value-iteration converged at iteration# %d.' %(i+1))
            break
    return v

注意这句话

q_sa = [sum([p*(r + gamma * prev_v[s_]) for p, s_, r, _ in env.env.P[s][a]]) for a in range(env.env.nA)] 
v[s] = max(q_sa)

其实是2层循环,它对应的表达式是:

equation?tex=V%28s%29%5Cleftarrow%5Cmax_a%5Csum_%7Bs%5E%7B%5Cprime%7D%2Cr%7Dp%28s%5E%7B%5Cprime%7D%2Cr%7Cs%2Ca%29%5Br%2B%5Cgamma+V%28s%5E%7B%5Cprime%7D%29%5D

所以value_iteration得到了最优的价值函数。

接下来是同样的extract_policy函数提取最优的策略,在Value Iteration方法中只做一步就行。

def extract_policy(v, gamma = 1.0):
    """ Extract the policy given a value-function """
    policy = np.zeros(env.env.nS)
    for s in range(env.env.nS):
        q_sa = np.zeros(env.action_space.n)
        for a in range(env.action_space.n):
            for next_sr in env.env.P[s][a]:
                # next_sr is a tuple of (probability, next state, reward, done)
                p, s_, r, _ = next_sr
                q_sa[a] += (p * (r + gamma * v[s_]))
        policy[s] = np.argmax(q_sa)
    return policy

next_sr 是一个在

equation?tex=s 状态下采取动作
equation?tex=a 时的所有可能的情况,把它们都考虑以后作为动作
equation?tex=a 的一个评价,遍历所有的动作得到状态
equation?tex=s 下的最优策略。
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值