【RL】Value Iteration and Policy Iteration(利用迭代算法求解贝尔曼最优等式)

Lecture 4: Value Iteration and Policy Iteration

Value Iteration Algorithm

对于Bellman最优公式:
v = f ( v ) = m a x π ( r + γ P π v ) \mathbf{v} = f(\mathbf{v}) = max_{\pi}(\mathbf{r} + \gamma \mathbf{P}_{\pi} \mathbf{v}) v=f(v)=maxπ(r+γPπv)
在Lecture 3中,已知可以通过contraction mapping原理来提出迭代算法:
v k + 1 = f ( v k ) = m a x π ( r + γ P π v k )        k = 1 , 2 , 3 \mathbf{v}_{k+1} = f(\mathbf{v}_k) = max_{\pi}(\mathbf{r} + \gamma \mathbf{P}_{\pi} \mathbf{v}_k) \;\;\; k=1, 2, 3 vk+1=f(vk)=maxπ(r+γPπvk)k=1,2,3
其中, v 0 v_0 v0可以是任意的。

上述算法就是所谓的 value iteration(值迭代)。

其可以分为两步:

  • step 1: policy update (策略更新)。
    π k + 1 = argmax π ( r π + γ P π v k ) \pi_{k+1}=\text{argmax}_{\pi}(r_{\pi} + \gamma P_{\pi}v_k) πk+1=argmaxπ(rπ+γPπvk)
    其中, v k v_k vk是给定的。

  • step 2: value update(值更新)。
    v k + 1 = r π k + 1 + γ P π k + 1 v k v_{k+1} = r_{\pi_{k+1}} + \gamma P_{\pi_{k+1}}v_k vk+1=rπk+1+γPπk+1vk

注意: v k v_k vk不是state value,因为其不满足Bellman等式。

Value iteration algorithm分析

  • step 1: policy update
    π k + 1 = argmax π ( r π + γ P π v k ) \pi_{k+1} = \text{argmax}_{\pi}(r_{\pi} + \gamma P_{\pi}v_k) πk+1=argmaxπ(rπ+γPπvk)
    的元素形式为:
    π k + 1 ( s ) = argmax π ∑ a π ( a ∣ s ) ( ∑ r p ( r ∣ s , a ) r + γ ∑ s ′ p ( s ′ ∣ s , a ) v k ( s ′ ) )        s ∈ S = argmax π ∑ a π ( a ∣ s ) q k ( s , a ) \begin{align*} \pi_{k+1}(s) &= \text{argmax}_{\pi} \sum_a \pi(a|s) \left( \sum_r p(r | s, a)r + \gamma \sum_{s'} p(s' | s, a) v_k(s') \right) \;\;\; s \in \mathcal{S} \\ &= \text{argmax}_{\pi} \sum_a \pi(a|s) q_k(s, a) \end{align*} πk+1(s)=argmaxπaπ(as)(rp(rs,a)r+γsp(ss,a)vk(s))sS=argmaxπaπ(as)qk(s,a)
    解决上述优化问题的最优策略为:
    π k + 1 ( a ∣ s ) = { 1 a = a k ∗ ( s ) 0 a ≠ a k ∗ ( s ) \pi_{k+1}(a | s) = \left\{\begin{matrix} 1 & a = a^*_k(s)\\ 0 & a \ne a^*_{k}(s) \end{matrix}\right. πk+1(as)={10a=ak(s)a=ak(s)
    其中, a k ∗ ( s ) = argmax a q k ( a , s ) a^*_{k}(s)=\text{argmax}_aq_k(a, s) ak(s)=argmaxaqk(a,s) π k + 1 \pi_{k+1} πk+1是greedy policy(贪心策略),因为其只是简单的选择最大的q-value。

  • step 2: value update
    v k + 1 = r π k + 1 + γ P π k + 1 v k v_{k+1} = r_{\pi_{k+1}} + \gamma P_{\pi_{k+1}}v_k vk+1=rπk+1+γPπk+1vk
    的元素形式为:
    v k + 1 ( s ) = ∑ a π k + 1 ( a ∣ s ) ( ∑ r p ( r ∣ s , a ) r + γ ∑ s ′ p ( s ′ ∣ s , a ) v k ( s ′ ) )        s ∈ S = ∑ a π k + 1 ( a ∣ s ) q k ( s , a ) \begin{align*} v_{k+1}(s) &= \sum_a \pi_{k+1}(a | s)\left( \sum_r p(r | s, a)r + \gamma \sum_{s'} p(s' | s, a) v_k(s') \right) \;\;\; s \in \mathcal{S}\\ &=\sum_a \pi_{k+1}(a | s)q_k(s, a) \end{align*} vk+1(s)=aπk+1(as)(rp(rs,a)r+γsp(ss,a)vk(s))sS=aπk+1(as)qk(s,a)
    因为 π k + 1 \pi_{k+1} πk+1是贪心的,上述等式可以简化为:
    v k + 1 ( s ) = max a q k ( a , s ) v_{k+1}(s)=\text{max}_a q_k(a, s) vk+1(s)=maxaqk(a,s)

Procedure Summary
v k ( s ) → q k ( s , a ) → greedy policy  π k + 1 ( a ∣ s ) → new value  v k + 1 → max a q k ( s , a ) v_k(s) \rightarrow q_k(s, a) \rightarrow \text{greedy policy } \pi_{k+1}(a | s) \rightarrow \text{new value } v_{k+1} \rightarrow \text{max}_a q_k(s, a) vk(s)qk(s,a)greedy policy πk+1(as)new value vk+1maxaqk(s,a)
在这里插入图片描述

Example:

设置:reward: r boundary = r forbidden = − 1 r_{\text{boundary}} = r_{\text{forbidden}} = -1 rboundary=rforbidden=1 r target = 1 r_{\text{target}} = 1 rtarget=1,discount rate γ = 0.9 \gamma=0.9 γ=0.9

在这里插入图片描述

其q-table为:

在这里插入图片描述

  • k=0时,使 v 0 ( s 1 ) = v 0 ( s 2 ) = v 0 ( s 3 ) = v 0 ( s 4 ) = 0 v_0(s_1) = v_0(s_2) = v_0(s_3) = v_0(s_4) = 0 v0(s1)=v0(s2)=v0(s3)=v0(s4)=0

在这里插入图片描述

step1: policy update
π 1 ( a 5 ∣ s 1 ) = 1 , π 1 ( a 3 ∣ s 2 ) = 1 , π 1 ( a 2 ∣ s 3 ) = 1 , π 1 ( a 5 ∣ s 4 ) = 1 \pi_1(a_5|s_1) = 1, \pi_1(a_3|s_2) = 1, \pi_1(a_2|s_3) = 1, \pi_1(a_5|s_4) = 1 π1(a5s1)=1,π1(a3s2)=1,π1(a2s3)=1,π1(a5s4)=1
如上图(b)

step2: value update
v 1 ( s 1 ) = 0 , v 2 ( s 2 ) = 1 , v 1 ( s 3 ) = 1 , v 1 ( s 4 ) = 1 v_1(s_1) = 0, v_2(s_2) = 1, v_1(s_3) = 1, v_1(s_4)=1 v1(s1)=0,v2(s2)=1,v1(s3)=1,v1(s4)=1

  • k=1时,因为 v 1 ( s 1 ) = 0 , v 2 ( s 2 ) = 1 , v 1 ( s 3 ) = 1 , v 1 ( s 4 ) = 1 v_1(s_1) = 0, v_2(s_2) = 1, v_1(s_3) = 1, v_1(s_4)=1 v1(s1)=0,v2(s2)=1,v1(s3)=1,v1(s4)=1,可得

    step1: policy update
    π 2 ( a 3 ∣ s 1 ) = 1 , π 2 ( a 3 ∣ s 2 ) = 1 , π 2 ( a 2 ∣ s 3 ) = 1 , π 2 ( a 5 ∣ s 4 ) = 1 \pi_2(a_3|s_1) = 1, \pi_2(a_3|s_2) = 1, \pi_2(a_2|s_3) = 1, \pi_2(a_5|s_4) = 1 π2(a3s1)=1,π2(a3s2)=1,π2(a2s3)=1,π2(a5s4)=1
    step2: value update
    v 1 ( s 1 ) = γ 1 , v 2 ( s 2 ) = 1 + γ 1 , v 1 ( s 3 ) = 1 + γ 1 , v 1 ( s 4 ) = 1 + γ 1 v_1(s_1) = \gamma 1, v_2(s_2) = 1 + \gamma 1, v_1(s_3) = 1 + \gamma 1, v_1(s_4)=1 + \gamma 1 v1(s1)=γ1,v2(s2)=1+γ1,v1(s3)=1+γ1,v1(s4)=1+γ1
    如上图(c)

  • 继续迭代,直到 ∥ v k − v k + 1 ∥ \| v_k - v_{k+1} \| vkvk+1小于预设的阈值。

Policy Iteration Algorithm

Algorithm description

对于给定的初始随机policy π 0 \pi_0 π0

  • step 1: policy evaluation (PE)

    这一步是计算 π k \pi_k πk的state value
    v π k = r π k + γ P π k v π k \mathbf{v}_{\pi_k} = \mathbf{r}_{\pi_k} + \gamma \mathbf{P}_{\pi_k}\mathbf{v}_{\pi_k} vπk=rπk+γPπkvπk
    注意 v π k v_{{\pi}_k} vπk是state value函数

  • step 2: policy improvement (PI)
    π k + 1 = argmax π ( r π + γ P π v π k ) \pi_{k+1} = \text{argmax}_{\pi} (r_\pi + \gamma P_\pi v_{\pi_k}) πk+1=argmaxπ(rπ+γPπvπk)
    最大化按元素计算

算法会产生如下序列:
π 0 → P E v v π 0 → P I π 1 → P E v π 1 → P I π 2 → P E v π 2 → P I ⋯ \pi_0 \xrightarrow[]{PE} v_{v_{\pi_0}} \xrightarrow[]{PI} \pi_1 \xrightarrow[]{PE} v_{\pi_1} \xrightarrow[]{PI} \pi_2 \xrightarrow[]{PE} v_{\pi_2} \xrightarrow[]{PI} \cdots π0PE vvπ0PI π1PE vπ1PI π2PE vπ2PI
其中,PE=policy evaluation,PI=policy improvement

三个问题

In the policy evaluation step, how to get the state value v π k v_{\pi_k} vπkby solving the Bellman equation?
v π k = r π k + γ P π k v π k \mathbf{v}_{\pi_k} = \mathbf{r}_{\pi_k} + \gamma \mathbf{P}_{\pi_k}\mathbf{v}_{\pi_k} vπk=rπk+γPπkvπk

  • Close-form解:
    v π k = ( I − γ P π k ) − 1 r π k \mathbf{v}_{\pi_k} = (I - \gamma \mathbf{P}_{\pi_k})^{-1} \mathbf{r}_{\pi_k} vπk=(IγPπk)1rπk

  • 迭代求解:
    v π k ( j + 1 ) = r π k + γ P π k v π k ( j ) \mathbf{v}_{\pi_k}^{(j+1)} = \mathbf{r}_{\pi_k} + \gamma \mathbf{P}_{\pi_k}\mathbf{v}_{\pi_k}^{(j)} vπk(j+1)=rπk+γPπkvπk(j)
    policy iteration是一种迭代算法,在policy评估步骤中嵌入了另一种迭代算法!

In the policy improvement step, why is the new policy π k + 1 \pi_{k+1} πk+1 better than π k \pi_k πk?

Lemma (Policy Improvemnent)

如果:
π k + 1 = argmax π ( r π + γ P π v π k ) \pi_{k+1} = \text{argmax}_{\pi}(r_{\pi} + \gamma P_{\pi} v_{\pi_k}) πk+1=argmaxπ(rπ+γPπvπk)
那么,对任意 k k k,都有 v π k + 1 ≥ v π k v_{\pi_{k+1}} \ge v_{\pi_k} vπk+1vπk成立。

Why such an iterative algorithm can finally reach an optimal policy?

已知:
v π 0 ≤ v π 1 ≤ v π 2 ⋯ ≤ v π k ≤ ⋯ v ∗ v_{\pi_0} \le v_{\pi_1} \le v_{\pi_2} \cdots \le v_{\pi_k} \le \cdots v^* vπ0vπ1vπ2vπkv
故,每一次迭代都会提高 v π k v_{\pi_k} vπk而且其会收敛,接下来证明其会收敛到 v ∗ v^* v

Theorem (Convergence of Policy Iteration)

由policy iteration算法产生的state value序列 { v π k } k = 0 ∞ \left\{ v_{\pi_k} \right\}^{\infty}_{k=0} {vπk}k=0会收敛到最优的state value v ∗ v^* v,因此,policy序列 { π k } k = 0 ∞ \left \{ \pi_k \right \}^{\infty}_{k=0} {πk}k=0也会收敛到最优的policy。

Policy iteration algorithm分析:

  • step 1: policy evaluation

    maxtrix-vector form:
    v π k ( j + 1 ) = r π k + γ P π k v π k ( j )        j = 0 , 1 , 2 \mathbf{v}_{\pi_k}^{(j+1)} = \mathbf{r}_{\pi_k} + \gamma \mathbf{P}_{\pi_k}\mathbf{v}_{\pi_k}^{(j)} \;\;\; j=0, 1, 2 vπk(j+1)=rπk+γPπkvπk(j)j=0,1,2
    elementwise form:
    v π k ( j + 1 ) = ∑ a π k ( a ∣ s ) ( ∑ r p ( r ∣ s , a ) r + γ ∑ s ′ p ( s ′ ∣ s , a ) v π k ( j ) ( s ′ ) ) ,        s ∈ S v_{\pi_k}^{(j+1)} = \sum_a \pi_k(a|s) \left( \sum_rp(r| s, a)r + \gamma \sum_{s'} p(s' | s, a)v_{\pi_k}^{(j)} (s')\right), \;\;\; s \in \mathcal{S} vπk(j+1)=aπk(as)(rp(rs,a)r+γsp(ss,a)vπk(j)(s)),sS
    j → ∞ j \rightarrow \infty j j j j 足够大或 ∥ v π k ( j + 1 ) − v π k ( j ) ∥ \| v^{(j+1)}_{\pi_k} - v^{(j)}_{\pi_k} \| vπk(j+1)vπk(j)足够小时,迭代停止。

  • step 2: policy improvement

    matrix-vector form:
    π k + 1 = argmax π ( r π + γ P π v π k ) \mathbf{\pi}_{k+1} = \text{argmax}_{\pi}(\mathbf{r}_{\pi} + \gamma \mathbf{P}_\pi \mathbf{v}_{\pi_k}) πk+1=argmaxπ(rπ+γPπvπk)
    elementwise form:
    π k + 1 ( s ) = argmax π ∑ a π ( a ∣ s ) ( ∑ r p ( r ∣ s , a ) r + γ ∑ s ′ p ( s ′ ∣ s , a ) v π k ( s ′ ) )        s ∈ S = argmax π ∑ a π ( a ∣ s ) q π k ( s , a ) \begin{align*} \pi_{k+1}(s) &= \text{argmax}_\pi \sum_a \pi(a | s)\left( \sum_r p(r | s, a)r + \gamma \sum_{s'} p(s' | s, a) v_{\pi_k}(s') \right) \;\;\; s \in \mathcal{S}\\ &= \text{argmax}_\pi \sum_a \pi(a | s) q_{\pi_k}(s, a) \end{align*} πk+1(s)=argmaxπaπ(as)(rp(rs,a)r+γsp(ss,a)vπk(s))sS=argmaxπaπ(as)qπk(s,a)
    其中, q π k q_{\pi_k} qπk是policy π k \pi_k πk 下的action value:
    a k ∗ ( s ) = argmax a q π k ( a , s ) a^*_k(s) = \text{argmax}_a q_{\pi_k}(a, s) ak(s)=argmaxaqπk(a,s)
    greedy policy为:
    π k + 1 ( a ∣ s ) = { 1 a = a k ∗ ( s ) , 0 a ≠ a k ∗ ( s ) . \pi_{k+1}(a | s) = \begin{cases} 1 & a = a^*_k(s), \\ 0 & a \ne a^*_k(s). \end{cases} πk+1(as)={10a=ak(s),a=ak(s).

在这里插入图片描述

Simple example:

在这里插入图片描述

reward设置: r boundary = − 1 r_{\text{boundary}} = -1 rboundary=1 r target = 1 r_{\text{target}} = 1 rtarget=1,discount rate γ = 0.9 \gamma=0.9 γ=0.9

action: a ℓ a_\ell a a 0 a_0 a0 a r a_{r} ar代表向左、保持不变和向右。

迭代: k = 0 k=0 k=0:

  • step1: policy evaluation

    π 0 \pi_0 π0为上图(a),Bellman公式为:
    v π 0 ( s 1 ) = − 1 + γ v π 0 ( s 1 ) v π 0 ( s 2 ) = 0 + γ v π 0 ( s 1 ) \begin{align*} &v_{\pi_0}(s_1) = -1 + \gamma v_{\pi_0}(s_1) \\ &v_{\pi_0}(s_2) = 0 + \gamma v_{\pi_0}(s_1) \end{align*} vπ0(s1)=1+γvπ0(s1)vπ0(s2)=0+γvπ0(s1)
    直接计算等式:
    v π 0 ( s 1 ) = − 10 v π 0 ( s 2 ) = − 9 \begin{align*} &v_{\pi_0}(s_1) = -10\\ &v_{\pi_0}(s_2) = -9 \end{align*} vπ0(s1)=10vπ0(s2)=9
    迭代计算等式:

    假定初始 v π 0 ( 0 ) ( s 1 ) = v π 0 ( 0 ) ( s 2 ) = 0 v^{(0)}_{\pi_0}(s_1) = v^{(0)}_{\pi_0}(s_2) = 0 vπ0(0)(s1)=vπ0(0)(s2)=0,则
    { v π 0 ( 1 ) ( s 1 ) = − 1 + γ v π 0 ( 0 ) ( s 1 ) = − 1 v π 0 ( 1 ) ( s 2 ) = 0 + γ v π 0 ( 0 ) ( s 1 ) = 0 { v π 0 ( 2 ) ( s 1 ) = − 1 + γ v π 0 ( 1 ) ( s 1 ) = − 1.9 v π 0 ( 2 ) ( s 2 ) = 0 + γ v π 0 ( 1 ) ( s 1 ) = − 0.9 { v π 0 ( 3 ) ( s 1 ) = − 1 + γ v π 0 ( 2 ) ( s 1 ) = − 2.71 v π 0 ( 3 ) ( s 2 ) = 0 + γ v π 0 ( 2 ) ( s 1 ) = − 1.71 ⋯ \begin{align*} &\begin{cases} v^{(1)}_{\pi_0}(s_1) = -1 + \gamma v^{(0)}_{\pi_0}(s_1) = -1\\ v^{(1)}_{\pi_0}(s_2) = 0 + \gamma v^{(0)}_{\pi_0}(s_1) = 0 \end{cases} \\ &\begin{cases} v^{(2)}_{\pi_0}(s_1) = -1 + \gamma v^{(1)}_{\pi_0}(s_1) = -1.9\\ v^{(2)}_{\pi_0}(s_2) = 0 + \gamma v^{(1)}_{\pi_0}(s_1) = -0.9 \end{cases} \\ &\begin{cases} v^{(3)}_{\pi_0}(s_1) = -1 + \gamma v^{(2)}_{\pi_0}(s_1) = -2.71\\ v^{(3)}_{\pi_0}(s_2) = 0 + \gamma v^{(2)}_{\pi_0}(s_1) = -1.71 \end{cases} \\ \end{align*}\\ \cdots {vπ0(1)(s1)=1+γvπ0(0)(s1)=1vπ0(1)(s2)=0+γvπ0(0)(s1)=0{vπ0(2)(s1)=1+γvπ0(1)(s1)=1.9vπ0(2)(s2)=0+γvπ0(1)(s1)=0.9{vπ0(3)(s1)=1+γvπ0(2)(s1)=2.71vπ0(3)(s2)=0+γvπ0(2)(s1)=1.71

  • step 2: policy improvement

    q π k ( s , a ) q_{\pi_k}(s, a) qπk(s,a)为:

在这里插入图片描述

替换 v π 0 ( s 1 ) = − 10 v_{\pi_0}(s_1) = -10 vπ0(s1)=10 v π 0 ( s 2 ) = − 9 v_{\pi_0}(s_2) = -9 vπ0(s2)=9 γ = 0.9 \gamma=0.9 γ=0.9,得:

在这里插入图片描述

通过寻找 q π 0 q_{\pi_0} qπ0的最大值,提高的policy为:
π 1 ( a r ∣ s 1 ) = 1 π 1 ( a 0 ∣ s 2 ) = 1 \pi_1(a_r | s_1) = 1\\ \pi_1(a_0 | s_2) = 1 π1(ars1)=1π1(a0s2)=1
迭代一次之后,policy达到最优

Truncated Policy Iteration Algorithm

Compare value iteration and policy iteration

Policy iteration: 从 π 0 \pi_0 π0开始

  • policy evaluation (PE):
    v π k = r π k + γ P π k v π k \mathbf{v}_{\pi_k} = \mathbf{r}_{\pi_k} + \gamma \mathbf{P}_{\pi_k}\mathbf{v}_{\pi_k} vπk=rπk+γPπkvπk

  • policy imporovement (PI):
    π k + 1 = argmax π ( r π + γ P π v π k ) \pi_{k+1} = \text{argmax}_{\pi}(\mathbf{r}_{\pi} + \gamma \mathbf{P}_{\pi} \mathbf{v}_{\pi_k}) πk+1=argmaxπ(rπ+γPπvπk)

value iteration: 从 v 0 v_0 v0开始

  • policy update (PU):
    π k + 1 = argmax π ( r π + γ P π v π k ) \pi_{k+1} = \text{argmax}_{\pi}(\mathbf{r}_{\pi} + \gamma \mathbf{P}_{\pi} \mathbf{v}_{\pi_k}) πk+1=argmaxπ(rπ+γPπvπk)

  • value update (VU):
    v k + 1 = r π k + 1 + γ P π k + 1 v k \mathbf{v}_{k+1} = \mathbf{r}_{\pi_{k+1}} + \gamma \mathbf{P}_{\pi_{k+1}}\mathbf{v}_k vk+1=rπk+1+γPπk+1vk

两个算法十分相似:

policy iteration: π 0 → P E v π 0 → P I π 1 → P E v π 1 → P I π 2 → P E v π 2 → P I ⋯ \pi_0 \xrightarrow[]{PE} v_{\pi_0} \xrightarrow[]{PI} \pi_1 \xrightarrow[]{PE} v_{\pi_1} \xrightarrow[]{PI} \pi_2 \xrightarrow[]{PE} v_{\pi_2} \xrightarrow[]{PI} \cdots π0PE vπ0PI π1PE vπ1PI π2PE vπ2PI

value iteraton: u 0 → P U π 1 ′ → V U u 1 → P U π 2 ′ → V U u 2 → P U ⋯ u_0 \xrightarrow[]{PU} \pi'_1 \xrightarrow[]{VU} u_1 \xrightarrow[]{PU} \pi_2' \xrightarrow[]{VU} u_2 \xrightarrow[]{PU} \cdots u0PU π1VU u1PU π2VU u2PU

对两个算法详细比较:

在这里插入图片描述

  • 两个算法的初始条件是相同的

  • 两个算法的前三步是相同的

  • 两个算法的第四步是不同的:

    在policy iteration中,计算 v π 1 = r π 1 + γ P π 1 v π 1 \mathbf{v}_{\pi_1} = \mathbf{r}_{\pi_1} + \gamma \mathbf{P}_{\pi_1}\mathbf{v}_{\pi_1} vπ1=rπ1+γPπ1vπ1需要一个迭代算法

    在value iteration中,计算 v 1 = r π 1 + γ P π 1 v 0 \mathbf{v}_1 = \mathbf{r}_{\pi_{1}} + \gamma \mathbf{P}_{\pi_{1}}\mathbf{v}_0 v1=rπ1+γPπ1v0是一个one-step算法

考虑计算 v π 1 = r π 1 + γ P π 1 v π 1 \mathbf{v}_{\pi_1} = \mathbf{r}_{\pi_1} + \gamma \mathbf{P}_{\pi_1}\mathbf{v}_{\pi_1} vπ1=rπ1+γPπ1vπ1这一步:

在这里插入图片描述

  • value iteration算法只计算一次
  • policy iteraton算法计算“无穷”次
  • truncated policy iteration算法计算有限次。剩下的从 j j j ∞ \infty 次的迭代被省略。
    在这里插入图片描述

truncted policy iteration是否会收敛:

考虑解决policy evaluaion的迭代算法:
v π k ( j + 1 ) = r π k + γ P π k v π k ( j )        j = 0 , 1 , 2 , ⋯ \mathbf{v}_{\pi_k}^{(j+1)} = \mathbf{r}_{\pi_k} + \gamma \mathbf{P}_{\pi_k}\mathbf{v}_{\pi_k}^{(j)} \;\;\; j=0, 1, 2, \cdots vπk(j+1)=rπk+γPπkvπk(j)j=0,1,2,
如果初始状态 v π k ( 0 ) = v π k − 1 v_{\pi_k}^{(0)}=v_{\pi_{k-1}} vπk(0)=vπk1,那么:
v π k ( j + 1 ) ≥ v π k ( j ) v_{\pi_k}^{(j+1)} \ge v_{\pi_k}^{(j)} vπk(j+1)vπk(j)
对所有 j j j 成立。

在这里插入图片描述

由上图可知,因为policy iteration和value iteration都会收敛到optimal state value,并且truncated policy iteration夹在两者之间,则由夹逼准则可知,其一定会收敛到最优。

例子

如上图为初始状态,定义 ∥ v k − v ∗ ∥ \| v_k - v^* \| vkv为在步骤 k k k时的state error。算法停止的标准为 ∥ v k − v ∗ ∥ < 0.01 \| v_k - v^* \| < 0.01 vkv<0.01

在这里插入图片描述

  • truncated policy iteration- x x x代表truncated policy iteration算法中policy evaluation的迭代次数。
  • x x x越大代表值估计收敛的越快。
  • x x x不断增加时,其对收敛速度的贡献越来越小。
  • 因此,实际上,仅迭代少数几次就已足够。

Summary

Value iteration:

求解Bellman最优等式的迭代算法,给定初始value v 0 v_0 v0
v k + 1 = m a x π ( r π + γ P π v k ) ⇕ { Policy update : π k + 1 = argmax π ( r π + γ P π v k ) Value update : v k + 1 = r π k + 1 + γ P π k + 1 v k v_{k+1} = max_{\pi}(r_{\pi} + \gamma P_{\pi}v_k)\\ \Updownarrow \\ \begin{cases} \text{Policy update}: \pi_{k+1} = \text{argmax}_{\pi}(r_{\pi} + \gamma P_{\pi} v_k) \\ \text{Value update}: v_{k+1} = r_{\pi_{k+1}} + \gamma P_{\pi_{k+1}}v_k \end{cases} vk+1=maxπ(rπ+γPπvk){Policy update:πk+1=argmaxπ(rπ+γPπvk)Value update:vk+1=rπk+1+γPπk+1vk
Policy iteration: 给定初始 policy π 0 \pi_0 π0
{ Policy evaluation : v π k = r π k + γ P π k v π k Policy improvement : π k + 1 = argmax π ( r π + γ P π v π k ) \begin{cases} \text{Policy evaluation}: v_{\pi_k} = r_{\pi_k} + \gamma P_{\pi_k}v_{\pi_k}\\ \text{Policy improvement}: \pi_{k+1} = \text{argmax}_{\pi}(r_{\pi} + \gamma P_{\pi}v_{\pi_k}) \end{cases} {Policy evaluation:vπk=rπk+γPπkvπkPolicy improvement:πk+1=argmaxπ(rπ+γPπvπk)
Truncated policy iteration



以上内容为B站西湖大学智能无人系统 强化学习的数学原理 公开课笔记。

  • 19
    点赞
  • 15
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值