【RL】Value Iteration and Policy Iteration（利用迭代算法求解贝尔曼最优等式）

晴空~

已于 2024-02-17 16:53:42 修改

阅读量742

点赞数 19

分类专栏：人工智能文章标签：算法人工智能机器学习深度学习

于 2024-02-17 16:42:48 首次发布

本文链接：https://blog.csdn.net/qq_44733706/article/details/136139484

版权

人工智能专栏收录该内容

21 篇文章 0 订阅

订阅专栏

Lecture 4: Value Iteration and Policy Iteration

Value Iteration Algorithm

对于Bellman最优公式：
$\mathbf{v} = f(\mathbf{v}) = max_{\pi}(\mathbf{r} + \gamma \mathbf{P}_{\pi} \mathbf{v})$
在Lecture 3中，已知可以通过contraction mapping原理来提出迭代算法：
$\mathbf{v}_{k+1} = f(\mathbf{v}_k) = max_{\pi}(\mathbf{r} + \gamma \mathbf{P}_{\pi} \mathbf{v}_k) \;\;\; k=1, 2, 3$
其中， $v_0$ 可以是任意的。

上述算法就是所谓的 value iteration（值迭代）。

其可以分为两步：

step 1: policy update （策略更新）。
$\pi_{k+1}=\text{argmax}_{\pi}(r_{\pi} + \gamma P_{\pi}v_k)$
其中， $v_k$ 是给定的。
step 2: value update（值更新）。
$v_{k+1} = r_{\pi_{k+1}} + \gamma P_{\pi_{k+1}}v_k$

注意： $v_k$ 不是state value，因为其不满足Bellman等式。

Value iteration algorithm分析：

step 1: policy update
$\pi_{k+1} = \text{argmax}_{\pi}(r_{\pi} + \gamma P_{\pi}v_k)$
的元素形式为：
$\begin{align*} \pi_{k+1}(s) &= \text{argmax}_{\pi} \sum_a \pi(a|s) \left( \sum_r p(r | s, a)r + \gamma \sum_{s'} p(s' | s, a) v_k(s') \right) \;\;\; s \in \mathcal{S} \\ &= \text{argmax}_{\pi} \sum_a \pi(a|s) q_k(s, a) \end{align*}$
解决上述优化问题的最优策略为：
$\pi_{k+1}(a | s) = \left\{\begin{matrix} 1 & a = a^*_k(s)\\ 0 & a \ne a^*_{k}(s) \end{matrix}\right.$
其中， $a^*_{k}(s)=\text{argmax}_aq_k(a, s)$ ， $\pi_{k+1}$ 是greedy policy（贪心策略），因为其只是简单的选择最大的q-value。
step 2: value update
$v_{k+1} = r_{\pi_{k+1}} + \gamma P_{\pi_{k+1}}v_k$
的元素形式为：
$\begin{align*} v_{k+1}(s) &= \sum_a \pi_{k+1}(a | s)\left( \sum_r p(r | s, a)r + \gamma \sum_{s'} p(s' | s, a) v_k(s') \right) \;\;\; s \in \mathcal{S}\\ &=\sum_a \pi_{k+1}(a | s)q_k(s, a) \end{align*}$
因为 $\pi_{k+1}$ 是贪心的，上述等式可以简化为：
$v_{k+1}(s)=\text{max}_a q_k(a, s)$

Procedure Summary：
$v_k(s) \rightarrow q_k(s, a) \rightarrow \text{greedy policy } \pi_{k+1}(a | s) \rightarrow \text{new value } v_{k+1} \rightarrow \text{max}_a q_k(s, a)$
在这里插入图片描述

Example:

设置：reward: $r_{\text{boundary}} = r_{\text{forbidden}} = -1$ ， $r_{\text{target}} = 1$ ，discount rate $\gamma=0.9$

在这里插入图片描述

其q-table为：

在这里插入图片描述

k=0时，使 $v_0(s_1) = v_0(s_2) = v_0(s_3) = v_0(s_4) = 0$

在这里插入图片描述

step1: policy update
$\pi_1(a_5|s_1) = 1, \pi_1(a_3|s_2) = 1, \pi_1(a_2|s_3) = 1, \pi_1(a_5|s_4) = 1$
如上图（b）

step2: value update
$v_1(s_1) = 0, v_2(s_2) = 1, v_1(s_3) = 1, v_1(s_4)=1$

k=1时，因为 $v_1(s_1) = 0, v_2(s_2) = 1, v_1(s_3) = 1, v_1(s_4)=1$ ，可得

step1: policy update
$\pi_2(a_3|s_1) = 1, \pi_2(a_3|s_2) = 1, \pi_2(a_2|s_3) = 1, \pi_2(a_5|s_4) = 1$
step2: value update
$v_1(s_1) = \gamma 1, v_2(s_2) = 1 + \gamma 1, v_1(s_3) = 1 + \gamma 1, v_1(s_4)=1 + \gamma 1$
如上图（c）
继续迭代，直到 $v_k - v_{k+1} \|$ 小于预设的阈值。

Policy Iteration Algorithm

Algorithm description：

对于给定的初始随机policy $\pi_0$

step 1: policy evaluation (PE)

这一步是计算 $\pi_k$ 的state value
$\mathbf{v}_{\pi_k} = \mathbf{r}_{\pi_k} + \gamma \mathbf{P}_{\pi_k}\mathbf{v}_{\pi_k}$
注意 $v_{{\pi}_k}$ 是state value函数
step 2: policy improvement (PI)
$\pi_{k+1} = \text{argmax}_{\pi} (r_\pi + \gamma P_\pi v_{\pi_k})$
最大化按元素计算

算法会产生如下序列：
$\pi_0 \xrightarrow[]{PE} v_{v_{\pi_0}} \xrightarrow[]{PI} \pi_1 \xrightarrow[]{PE} v_{\pi_1} \xrightarrow[]{PI} \pi_2 \xrightarrow[]{PE} v_{\pi_2} \xrightarrow[]{PI} \cdots$
其中，PE=policy evaluation，PI=policy improvement

三个问题：

In the policy evaluation step, how to get the state value $v_{\pi_k}$ by solving the Bellman equation?
$\mathbf{v}_{\pi_k} = \mathbf{r}_{\pi_k} + \gamma \mathbf{P}_{\pi_k}\mathbf{v}_{\pi_k}$

Close-form解：
$\mathbf{v}_{\pi_k} = (I - \gamma \mathbf{P}_{\pi_k})^{-1} \mathbf{r}_{\pi_k}$
迭代求解：
$\mathbf{v}_{\pi_k}^{(j+1)} = \mathbf{r}_{\pi_k} + \gamma \mathbf{P}_{\pi_k}\mathbf{v}_{\pi_k}^{(j)}$
policy iteration是一种迭代算法，在policy评估步骤中嵌入了另一种迭代算法！

In the policy improvement step, why is the new policy $\pi_{k+1}$ better than $\pi_k$ ?

Lemma (Policy Improvemnent)

如果：
$\pi_{k+1} = \text{argmax}_{\pi}(r_{\pi} + \gamma P_{\pi} v_{\pi_k})$
那么，对任意 $k$ ，都有 $v_{\pi_{k+1}} \ge v_{\pi_k}$ 成立。

Why such an iterative algorithm can finally reach an optimal policy?

已知：
$v_{\pi_0} \le v_{\pi_1} \le v_{\pi_2} \cdots \le v_{\pi_k} \le \cdots v^*$
故，每一次迭代都会提高 $v_{\pi_k}$ 而且其会收敛，接下来证明其会收敛到 $v^*$ 。

Theorem (Convergence of Policy Iteration)

由policy iteration算法产生的state value序列 $\left\{ v_{\pi_k} \right\}^{\infty}_{k=0}$ 会收敛到最优的state value $v^*$ ，因此，policy序列 $\left \{ \pi_k \right \}^{\infty}_{k=0}$ 也会收敛到最优的policy。

Policy iteration algorithm分析：

step 1: policy evaluation

maxtrix-vector form:
$\mathbf{v}_{\pi_k}^{(j+1)} = \mathbf{r}_{\pi_k} + \gamma \mathbf{P}_{\pi_k}\mathbf{v}_{\pi_k}^{(j)} \;\;\; j=0, 1, 2$
elementwise form:
$v_{\pi_k}^{(j+1)} = \sum_a \pi_k(a|s) \left( \sum_rp(r| s, a)r + \gamma \sum_{s'} p(s' | s, a)v_{\pi_k}^{(j)} (s')\right), \;\;\; s \in \mathcal{S}$
当 $\rightarrow \infty$ 或 $j$ 足够大或 $\| v^{(j+1)}_{\pi_k} - v^{(j)}_{\pi_k} \|$ 足够小时，迭代停止。
step 2: policy improvement

matrix-vector form:
$\mathbf{\pi}_{k+1} = \text{argmax}_{\pi}(\mathbf{r}_{\pi} + \gamma \mathbf{P}_\pi \mathbf{v}_{\pi_k})$
elementwise form:
$\begin{align*} \pi_{k+1}(s) &= \text{argmax}_\pi \sum_a \pi(a | s)\left( \sum_r p(r | s, a)r + \gamma \sum_{s'} p(s' | s, a) v_{\pi_k}(s') \right) \;\;\; s \in \mathcal{S}\\ &= \text{argmax}_\pi \sum_a \pi(a | s) q_{\pi_k}(s, a) \end{align*}$
其中， $q_{\pi_k}$ 是policy $\pi_k$ 下的action value：
$a^*_k(s) = \text{argmax}_a q_{\pi_k}(a, s)$
greedy policy为：
$\pi_{k+1}(a | s) = \begin{cases} 1 & a = a^*_k(s), \\ 0 & a \ne a^*_k(s). \end{cases}$

在这里插入图片描述

Simple example:

在这里插入图片描述

reward设置： $r_{\text{boundary}} = -1$ ， $r_{\text{target}} = 1$ ，discount rate $\gamma=0.9$ 。

action: $a_\ell$ ， $a_0$ ， $a_{r}$ 代表向左、保持不变和向右。

迭代： $k = 0$ :

step1: policy evaluation

$\pi_0$ 为上图（a），Bellman公式为：
$\begin{align*} &v_{\pi_0}(s_1) = -1 + \gamma v_{\pi_0}(s_1) \\ &v_{\pi_0}(s_2) = 0 + \gamma v_{\pi_0}(s_1) \end{align*}$
直接计算等式：
$\begin{align*} &v_{\pi_0}(s_1) = -10\\ &v_{\pi_0}(s_2) = -9 \end{align*}$
迭代计算等式：

假定初始 $v^{(0)}_{\pi_0}(s_1) = v^{(0)}_{\pi_0}(s_2) = 0$ ，则
$\begin{align*} &\begin{cases} v^{(1)}_{\pi_0}(s_1) = -1 + \gamma v^{(0)}_{\pi_0}(s_1) = -1\\ v^{(1)}_{\pi_0}(s_2) = 0 + \gamma v^{(0)}_{\pi_0}(s_1) = 0 \end{cases} \\ &\begin{cases} v^{(2)}_{\pi_0}(s_1) = -1 + \gamma v^{(1)}_{\pi_0}(s_1) = -1.9\\ v^{(2)}_{\pi_0}(s_2) = 0 + \gamma v^{(1)}_{\pi_0}(s_1) = -0.9 \end{cases} \\ &\begin{cases} v^{(3)}_{\pi_0}(s_1) = -1 + \gamma v^{(2)}_{\pi_0}(s_1) = -2.71\\ v^{(3)}_{\pi_0}(s_2) = 0 + \gamma v^{(2)}_{\pi_0}(s_1) = -1.71 \end{cases} \\ \end{align*}\\ \cdots$
step 2: policy improvement

$q_{\pi_k}(s, a)$ 为：

在这里插入图片描述

替换 $v_{\pi_0}(s_1) = -10$ 、 $v_{\pi_0}(s_2) = -9$ 和 $\gamma=0.9$ ，得：

在这里插入图片描述

通过寻找 $q_{\pi_0}$ 的最大值，提高的policy为：
$\pi_1(a_r | s_1) = 1\\ \pi_1(a_0 | s_2) = 1$
迭代一次之后，policy达到最优

Truncated Policy Iteration Algorithm

Compare value iteration and policy iteration

Policy iteration: 从 $\pi_0$ 开始

policy evaluation (PE):
$\mathbf{v}_{\pi_k} = \mathbf{r}_{\pi_k} + \gamma \mathbf{P}_{\pi_k}\mathbf{v}_{\pi_k}$
policy imporovement (PI):
$\pi_{k+1} = \text{argmax}_{\pi}(\mathbf{r}_{\pi} + \gamma \mathbf{P}_{\pi} \mathbf{v}_{\pi_k})$

value iteration: 从 $v_0$ 开始

policy update (PU):
$\pi_{k+1} = \text{argmax}_{\pi}(\mathbf{r}_{\pi} + \gamma \mathbf{P}_{\pi} \mathbf{v}_{\pi_k})$
value update (VU):
$\mathbf{v}_{k+1} = \mathbf{r}_{\pi_{k+1}} + \gamma \mathbf{P}_{\pi_{k+1}}\mathbf{v}_k$

两个算法十分相似：

policy iteration: $\pi_0 \xrightarrow[]{PE} v_{\pi_0} \xrightarrow[]{PI} \pi_1 \xrightarrow[]{PE} v_{\pi_1} \xrightarrow[]{PI} \pi_2 \xrightarrow[]{PE} v_{\pi_2} \xrightarrow[]{PI} \cdots$

value iteraton: $u_0 \xrightarrow[]{PU} \pi'_1 \xrightarrow[]{VU} u_1 \xrightarrow[]{PU} \pi_2' \xrightarrow[]{VU} u_2 \xrightarrow[]{PU} \cdots$

对两个算法详细比较：

在这里插入图片描述

两个算法的初始条件是相同的
两个算法的前三步是相同的
两个算法的第四步是不同的：

在policy iteration中，计算 $\mathbf{v}_{\pi_1} = \mathbf{r}_{\pi_1} + \gamma \mathbf{P}_{\pi_1}\mathbf{v}_{\pi_1}$ 需要一个迭代算法

在value iteration中，计算 $\mathbf{v}_1 = \mathbf{r}_{\pi_{1}} + \gamma \mathbf{P}_{\pi_{1}}\mathbf{v}_0$ 是一个one-step算法

考虑计算 $\mathbf{v}_{\pi_1} = \mathbf{r}_{\pi_1} + \gamma \mathbf{P}_{\pi_1}\mathbf{v}_{\pi_1}$ 这一步：

在这里插入图片描述

value iteration算法只计算一次
policy iteraton算法计算“无穷”次
truncated policy iteration算法计算有限次。剩下的从 $j$ 到 $\infty$ 次的迭代被省略。

truncted policy iteration是否会收敛：

考虑解决policy evaluaion的迭代算法：
$\mathbf{v}_{\pi_k}^{(j+1)} = \mathbf{r}_{\pi_k} + \gamma \mathbf{P}_{\pi_k}\mathbf{v}_{\pi_k}^{(j)} \;\;\; j=0, 1, 2, \cdots$
如果初始状态 $v_{\pi_k}^{(0)}=v_{\pi_{k-1}}$ ，那么：
$v_{\pi_k}^{(j+1)} \ge v_{\pi_k}^{(j)}$
对所有 $j$ 成立。

在这里插入图片描述

由上图可知，因为policy iteration和value iteration都会收敛到optimal state value，并且truncated policy iteration夹在两者之间，则由夹逼准则可知，其一定会收敛到最优。

例子：

如上图为初始状态，定义 $v_k - v^* \|$ 为在步骤 $k$ 时的state error。算法停止的标准为 $v_k - v^* \| < 0.01$

在这里插入图片描述

truncated policy iteration- $x$ 代表truncated policy iteration算法中policy evaluation的迭代次数。
$x$ 越大代表值估计收敛的越快。
当 $x$ 不断增加时，其对收敛速度的贡献越来越小。
因此，实际上，仅迭代少数几次就已足够。

Summary

Value iteration:

求解Bellman最优等式的迭代算法，给定初始value $v_0$
$v_{k+1} = max_{\pi}(r_{\pi} + \gamma P_{\pi}v_k)\\ \Updownarrow \\ \begin{cases} \text{Policy update}: \pi_{k+1} = \text{argmax}_{\pi}(r_{\pi} + \gamma P_{\pi} v_k) \\ \text{Value update}: v_{k+1} = r_{\pi_{k+1}} + \gamma P_{\pi_{k+1}}v_k \end{cases}$
Policy iteration: 给定初始 policy $\pi_0$
$\begin{cases} \text{Policy evaluation}: v_{\pi_k} = r_{\pi_k} + \gamma P_{\pi_k}v_{\pi_k}\\ \text{Policy improvement}: \pi_{k+1} = \text{argmax}_{\pi}(r_{\pi} + \gamma P_{\pi}v_{\pi_k}) \end{cases}$
Truncated policy iteration

以上内容为B站西湖大学智能无人系统强化学习的数学原理公开课笔记。