【强化学习的数学原理】第四课:值迭代与策略迭代

【Value iteration algorithm(值迭代算法)】

v k + 1 = f ( v k ) = max ⁡ π ( r π + γ P π v k ) , k = 1 , 2 , 3 … v_{k+1}=f\left(v_k\right)=\max _\pi\left(r_\pi+\gamma P_\pi v_k\right), \quad k=1,2,3 \ldots vk+1=f(vk)=πmax(rπ+γPπvk),k=1,2,3

  • 第一步(policy update):当 v k v_k vk 给出的时候解决 π k + 1 \pi_{k+1} πk+1
    π k + 1 = arg ⁡ max ⁡ π ( r π + γ P π v k ) \pi_{k+1}=\arg \max _\pi\left(r_\pi+\gamma P_\pi v_k\right) πk+1=argπmax(rπ+γPπvk)

  • 第二步(value update):利用当前的 π k + 1 \pi_{k+1} πk+1 解决 v k + 1 v_{k+1} vk+1
    v k + 1 = r π k + 1 + γ P π k + 1 v k v_{k+1}=r_{\pi_{k+1}}+\gamma P_{\pi_{k+1}} v_k vk+1=rπk+1+γPπk+1vk

问题 v k v_k vk 是不是state value?

回答:不是 v k = r π k + 1 + γ P π k + 1 v k v_{k}=r_{\pi_{k+1}}+\gamma P_{\pi_{k+1}} v_k vk=rπk+1+γPπk+1vk 这个公式才是贝尔曼公式是state value,但这里只是一个值用来进行迭代趋近的

✨policy update:

π k + 1 = arg ⁡ max ⁡ π ( r π + γ P π v k ) \pi_{k+1}=\arg \max _\pi\left(r_\pi+\gamma P_\pi v_k\right) πk+1=argπmax(rπ+γPπvk)

π k + 1 ( s ) = arg ⁡ max ⁡ π ∑ a π ( a ∣ s ) ( ∑ r p ( r ∣ s , a ) r + γ ∑ s ′ p ( s ′ ∣ s , a ) v k ( s ′ ) ) ⏟ q k ( s , a ) , s ∈ S \pi_{k+1}(s)=\arg \max _\pi \sum_a \pi(a \mid s) \underbrace{\left(\sum_r p(r \mid s, a) r+\gamma \sum_{s^{\prime}} p\left(s^{\prime} \mid s, a\right) v_k\left(s^{\prime}\right)\right)}_{q_k(s, a)}, s \in \mathcal{S} πk+1(s)=argπmaxaπ(as)qk(s,a) (rp(rs,a)r+γsp(ss,a)vk(s)),sS

a k ∗ ( s ) = arg ⁡ max ⁡ a q k ( a , s ) a_k^*(s)=\arg \max _a q_k(a, s) ak(s)=argmaxaqk(a,s)时其最优解为:
π k + 1 ( a ∣ s ) = { 1 a = a k ∗ ( s ) 0 a ≠ a k ∗ ( s ) \pi_{k+1}(a \mid s)=\left\{\begin{array}{cc} 1 & a=a_k^*(s) \\ 0 & a \neq a_k^*(s) \end{array}\right. πk+1(as)={10a=ak(s)a=ak(s)
由于它只选择最优的q-value所以 π k + 1 \pi_{k+1} πk+1 叫贪婪策略

✨value update:

v k + 1 = r π k + 1 + γ P π k + 1 v k v_{k+1}=r_{\pi_{k+1}}+\gamma P_{\pi_{k+1}} v_k vk+1=rπk+1+γPπk+1vk

v k + 1 ( s ) = ∑ a π k + 1 ( a ∣ s ) ( ∑ r p ( r ∣ s , a ) r + γ ∑ s ′ p ( s ′ ∣ s , a ) v k ( s ′ ) ) ⏟ q k ( s , a ) , s ∈ S v_{k+1}(s)=\sum_a \pi_{k+1}(a \mid s) \underbrace{\left(\sum_r p(r \mid s, a) r+\gamma \sum_{s^{\prime}} p\left(s^{\prime} \mid s, a\right) v_k\left(s^{\prime}\right)\right)}_{q_k(s, a)}, \quad s \in \mathcal{S} vk+1(s)=aπk+1(as)qk(s,a) (rp(rs,a)r+γsp(ss,a)vk(s)),sS

由于 π k + 1 \pi_{k+1} πk+1 为贪婪策略,所以 v k + 1 ( s ) = max ⁡ a q k ( a , s ) v_{k+1}(s)=\max _a q_k(a, s) vk+1(s)=maxaqk(a,s)

✨Value iteration algorithm伪代码:

v k ( s ) → q k ( s , a ) →  greedy policy  π k + 1 ( a ∣ s ) →  new value  v k + 1 = max ⁡ a q k ( s , a ) v_k(s) \rightarrow q_k(s, a) \rightarrow \text { greedy policy } \pi_{k+1}(a \mid s) \rightarrow \text { new value } v_{k+1}=\max _a q_k(s, a) vk(s)qk(s,a) greedy policy πk+1(as) new value vk+1=amaxqk(s,a)

伪代码:

  • 初始化:对所有的 ( s , a ) (s,a) (s,a) p ( r ∣ s , a ) p(r \mid s, a) p(rs,a) p ( s ′ ∣ s , a ) p\left(s^{\prime} \mid s, a\right) p(ss,a)
  • 目标:寻找最优的 state value 和 policy value 来解决贝尔曼最优公式
  • 过程:假设 ∥ v k − v k − 1 ∥ \left\|v_k-v_{k-1}\right\| vkvk1未小于最小值
    • 遍历所有的状态 s s s,对于每个状态 s ∈ S s \in \mathcal{S} sS
      • 对于每个动作 a ∈ A ( s ) a \in \mathcal{A}(s) aA(s),计算q-value: q k ( s , a ) = ∑ r p ( r ∣ s , a ) r + γ ∑ s ′ p ( s ′ ∣ s , a ) v k ( s ′ ) q_k(s, a)=\sum_r p(r \mid s, a) r+\gamma \sum_{s^{\prime}} p\left(s^{\prime} \mid s, a\right) v_k\left(s^{\prime}\right) qk(s,a)=rp(rs,a)r+γsp(ss,a)vk(s)
      • 选择最大的action value: a k ∗ ( s ) = arg ⁡ max ⁡ a q k ( a , s ) a_k^*(s)=\arg \max _a q_k(a, s) ak(s)=argmaxaqk(a,s)
      • policy update: π k + 1 ( a ∣ s ) = 1 \pi_{k+1}(a \mid s)=1 πk+1(as)=1 if a = a k ∗ a=a_k^* a=ak, and π k + 1 ( a ∣ s ) = 0 \pi_{k+1}(a \mid s)=0 πk+1(as)=0
      • value update: v k + 1 ( s ) = max ⁡ a q k ( a , s ) v_{k+1}(s)=\max _a q_k(a, s) vk+1(s)=maxaqk(a,s)

理解:首先对每个状态计算q-value,之后选择最大的action value表明我知道了如何行动。然后更新policy 再之后更新 value 然后一直重复直到找到最优解 。

✨例子:

r boundary  = r forbidden  = − 1 , r target  = 1 , γ = 0.9 r_{\text {boundary }}=r_{\text {forbidden }}=-1, r_{\text {target }}=1,\gamma=0.9 rboundary =rforbidden =1,rtarget =1,γ=0.9

在这里插入图片描述

  1. 在这里插入图片描述

  2. 在这里插入图片描述

  3. 在这里插入图片描述

  4. 在这里插入图片描述

【Policy iteration algorithm(策略迭代算法)】

π 0 ⟶ P E v π 0 ⟶ P I π 1 ⟶ P E v π 1 ⟶ P I π 2 ⟶ P E v π 2 ⟶ P I … \pi_0 \stackrel{P E}{\longrightarrow} v_{\pi_0} \stackrel{P I}{\longrightarrow} \pi_1 \stackrel{P E}{\longrightarrow} v_{\pi_1} \stackrel{P I}{\longrightarrow} \pi_2 \stackrel{P E}{\longrightarrow} v_{\pi_2} \stackrel{P I}{\longrightarrow} \ldots π0PEvπ0PIπ1PEvπ1PIπ2PEvπ2PI

  • 初始化:随机给个策略 π 0 \pi_0 π0

  • 第一步(policy evaluation):求解贝尔曼公式得到state value看策略如何
    v π k = r π k + γ P π k v π k v_{\pi_k}=r_{\pi_k}+\gamma P_{\pi_k} v_{\pi_k} vπk=rπk+γPπkvπk

  • 第二步(policy improvement):通过优化改变其策略为 π k + 1 \pi_{k+1} πk+1
    π k + 1 = arg ⁡ max ⁡ π ( r π + γ P π v π k ) \pi_{k+1}=\arg \max _\pi\left(r_\pi+\gamma P_\pi v_{\pi_k}\right) πk+1=argπmax(rπ+γPπvπk)

✨policy evaluation:

v π k ( j + 1 ) = r π k + γ P π k v π k ( j ) , j = 0 , 1 , 2 , … v_{\pi_k}^{(j+1)}=r_{\pi_k}+\gamma P_{\pi_k} v_{\pi_k}^{(j)}, \quad j=0,1,2, \ldots vπk(j+1)=rπk+γPπkvπk(j),j=0,1,2,

v π k ( j + 1 ) ( s ) = ∑ a π k ( a ∣ s ) ( ∑ r p ( r ∣ s , a ) r + γ ∑ s ′ p ( s ′ ∣ s , a ) v π k ( j ) ( s ′ ) ) , s ∈ S v_{\pi_k}^{(j+1)}(s)=\sum_a \pi_k(a \mid s)\left(\sum_r p(r \mid s, a) r+\gamma \sum_{s^{\prime}} p\left(s^{\prime} \mid s, a\right) v_{\pi_k}^{(j)}\left(s^{\prime}\right)\right), \quad s \in \mathcal{S} vπk(j+1)(s)=aπk(as)(rp(rs,a)r+γsp(ss,a)vπk(j)(s)),sS

✨policy improvement:

π k + 1 = arg ⁡ max ⁡ π ( r π + γ P π v π k ) \pi_{k+1}=\arg \max _\pi\left(r_\pi+\gamma P_{\pi} v_{\pi_k}\right) πk+1=argπmax(rπ+γPπvπk)

π k + 1 ( s ) = arg ⁡ max ⁡ π ∑ a π ( a ∣ s ) ( ∑ r p ( r ∣ s , a ) r + γ ∑ s ′ p ( s ′ ∣ s , a ) v π k ( s ′ ) ) ⏟ q π k ( s , a ) , s ∈ S .  \pi_{k+1}(s)=\arg \max _\pi \sum_a \pi(a \mid s) \underbrace{\left(\sum_r p(r \mid s, a) r+\gamma \sum_{s^{\prime}} p\left(s^{\prime} \mid s, a\right) v_{\pi_k}\left(s^{\prime}\right)\right)}_{q_{\pi_k}(s, a)}, \quad s \in \mathcal{S} \text {. } πk+1(s)=argπmaxaπ(as)qπk(s,a) (rp(rs,a)r+γsp(ss,a)vπk(s)),sS

q π k ( s , a ) q_{\pi_k}(s, a) qπk(s,a)是策略 π k \pi_k πk 下的 action value:
a k ∗ ( s ) = arg ⁡ max ⁡ a q π k ( a , s ) a_k^*(s)=\arg \max _a q_{\pi_k}(a, s) ak(s)=argamaxqπk(a,s)
其贪婪策略是:
π k + 1 ( a ∣ s ) = { 1 a = a k ∗ ( s ) 0 a ≠ a k ∗ ( s ) \pi_{k+1}(a \mid s)=\left\{\begin{array}{cl} 1 & a=a_k^*(s) \\ 0 & a \neq a_k^*(s) \end{array}\right. πk+1(as)={10a=ak(s)a=ak(s)

✨Policy iteration algorithm伪代码:

  • 初始化:对所有的 ( s , a ) (s,a) (s,a) p ( r ∣ s , a ) p(r \mid s, a) p(rs,a) p ( s ′ ∣ s , a ) p\left(s^{\prime} \mid s, a\right) p(ss,a),初始化 π 0 \pi_0 π0
  • 目标:寻找最优的state value 和 optimal policy
  • 假设policy还没收敛,对于第k次:
    • policy evaluation:
      • 给定初始猜测 v π k ( 0 ) v_{\pi_k}^{(0)} vπk(0)
      • 对于每个状态 s ∈ S s \in \mathcal{S} sS v π k ( j + 1 ) ( s ) = ∑ a π k ( a ∣ s ) [ ∑ r p ( r ∣ s , a ) r + γ ∑ s ′ p ( s ′ ∣ s , a ) v π k ( j ) ( s ′ ) ] v_{\pi_k}^{(j+1)}(s)=\sum_a \pi_k(a \mid s)\left[\sum_r p(r \mid s, a) r+\gamma \sum_{s^{\prime}} p\left(s^{\prime} \mid s, a\right) v_{\pi_k}^{(j)}\left(s^{\prime}\right)\right] vπk(j+1)(s)=aπk(as)[rp(rs,a)r+γsp(ss,a)vπk(j)(s)]
    • policy improvement:
      • 遍历每个状态 s ∈ S s \in \mathcal{S} sS
        • 遍历每个action, a ∈ A ( s ) a \in \mathcal{A}(s) aA(s) q π k ( s , a ) = ∑ r p ( r ∣ s , a ) r + γ ∑ s ′ p ( s ′ ∣ s , a ) v π k ( s ′ ) q_{\pi_k}(s, a)=\sum_r p(r \mid s, a) r+\gamma \sum_{s^{\prime}} p\left(s^{\prime} \mid s, a\right) v_{\pi_k}\left(s^{\prime}\right) qπk(s,a)=rp(rs,a)r+γsp(ss,a)vπk(s)
        • 选择最大的 a k ∗ ( s ) = arg ⁡ max ⁡ a q π k ( s , a ) a_k^*(s)=\arg \max _a q_{\pi_k}(s, a) ak(s)=argmaxaqπk(s,a)
        • π k + 1 ( a ∣ s ) = 1 \pi_{k+1}(a \mid s)=1 πk+1(as)=1 if a = a k ∗ a=a_k^* a=ak, and π k + 1 ( a ∣ s ) = 0 \pi_{k+1}(a \mid s)=0 πk+1(as)=0

✨例子:

r boundary  = − 1 , r target  = 1 , γ = 0.9 r_{\text {boundary }}=-1, r_{\text {target }}=1,\gamma=0.9 rboundary =1,rtarget =1,γ=0.9

在这里插入图片描述

  • actions: a ℓ , a 0 , a r a_{\ell}, a_0, a_r a,a0,ar分别表示向左、原地、向右

  • 目标:寻找最优policy

  • 在这里插入图片描述

  • 在这里插入图片描述

【Truncated policy iteration algorithm(截断策略迭代算法)】

✨value iteration 和 policy iteration比较:

policy iteration

Policy iteration: π 0 ⟶ P E v π 0 ⟶ P I π 1 ⟶ P E v π 1 ⟶ P I π 2 ⟶ P E v π 2 ⟶ P I … \pi_0 \stackrel{P E}{\longrightarrow} v_{\pi_0} \stackrel{P I}{\longrightarrow} \pi_1 \stackrel{P E}{\longrightarrow} v_{\pi_1} \stackrel{P I}{\longrightarrow} \pi_2 \stackrel{P E}{\longrightarrow} v_{\pi_2} \stackrel{P I}{\longrightarrow} \ldots π0PEvπ0PIπ1PEvπ1PIπ2PEvπ2PI

  • policy evaluation(PE):
    v π k = r π k + γ P π k v π k v_{\pi_k}=r_{\pi_k}+\gamma P_{\pi_k} v_{\pi_k} vπk=rπk+γPπkvπk

  • policy improvement(PI):
    π k + 1 = arg ⁡ max ⁡ π ( r π + γ P π v π k ) \pi_{k+1}=\arg \max _\pi\left(r_\pi+\gamma P_\pi v_{\pi_k}\right) πk+1=argπmax(rπ+γPπvπk)

value iteration

Value iteration: u 0 ⟶ P U π 1 ′ ⟶ V U u 1 ⟶ P U π 2 ′ ⟶ V U u 2 ⟶ P U … \quad u_0 \stackrel{P U}{\longrightarrow} \pi_1^{\prime} \stackrel{V U}{\longrightarrow} u_1 \stackrel{P U}{\longrightarrow} \pi_2^{\prime} \stackrel{V U}{\longrightarrow} u_2 \stackrel{P U}{\longrightarrow} \ldots u0PUπ1VUu1PUπ2VUu2PU

  • policy update(PU):
    π k + 1 = arg ⁡ max ⁡ π ( r π + γ P π v k ) \pi_{k+1}=\arg \max _\pi\left(r_\pi+\gamma P_\pi v_k\right) πk+1=argπmax(rπ+γPπvk)

  • value update(VU):
    v k + 1 = r π k + 1 + γ P π k + 1 v k v_{k+1}=r_{\pi_{k+1}}+\gamma P_{\pi_{k+1}} v_k vk+1=rπk+1+γPπk+1vk

在这里插入图片描述

  • 如图所示其前三部都是一样的。

  • 第四步不一样:

    在这里插入图片描述

    • policy iteration: v π 1 = r π 1 + γ P π 1 v π 1 v_{\pi_1}=r_{\pi_1}+\gamma P_{\pi_1} v_{\pi_1} vπ1=rπ1+γPπ1vπ1,需要进行内部迭代计算(贝尔曼公式迭代算法)
      v π 1 ( 0 ) = v 0 v π 1 ( 1 ) = r π 1 + γ P π 1 v π 1 ( 0 ) v π 1 ( 2 ) = r π 1 + γ P π 1 v π 1 ( 1 ) ⋮ v π 1 ( j ) = r π 1 + γ P π 1 v π 1 ( j − 1 ) ⋮ v π 1 ( ∞ ) = r π 1 + γ P π 1 v π 1 ( ∞ ) \begin{aligned} & v_{\pi_1}^{(0)}=v_0 \\ & v_{\pi_1}^{(1)}=r_{\pi_1}+\gamma P_{\pi_1} v_{\pi_1}^{(0)} \\ & v_{\pi_1}^{(2)}=r_{\pi_1}+\gamma P_{\pi_1} v_{\pi_1}^{(1)} \\ & \vdots \\ & v_{\pi_1}^{(j)}=r_{\pi_1}+\gamma P_{\pi_1} v_{\pi_1}^{(j-1)} \\ & \vdots \\ & v_{\pi_1}^{(\infty)}=r_{\pi_1}+\gamma P_{\pi_1} v_{\pi_1}^{(\infty)} \end{aligned} vπ1(0)=v0vπ1(1)=rπ1+γPπ1vπ1(0)vπ1(2)=rπ1+γPπ1vπ1(1)vπ1(j)=rπ1+γPπ1vπ1(j1)vπ1()=rπ1+γPπ1vπ1()

    • value iteration: v 1 = r π 1 + γ P π 1 v 0 v_1=r_{\pi_1}+\gamma P_{\pi_1} v_0 v1=rπ1+γPπ1v0,只需要进行一部(正常计算)

      在这里插入图片描述

✨Truncated policy iteration伪代码:

  • 初始化:对所有的 ( s , a ) (s,a) (s,a) p ( r ∣ s , a ) p(r \mid s, a) p(rs,a) p ( s ′ ∣ s , a ) p\left(s^{\prime} \mid s, a\right) p(ss,a),初始化 π 0 \pi_0 π0
  • 目标:寻找最优的state value 和 optimal policy
  • 假设policy还没收敛,对于第k次:
    • policy evaluation:
      • 初始化: v k ( 0 ) = v k − 1 v_k^{(0)}=v_{k-1} vk(0)=vk1,最大迭代次数为 j t r u n c a t e j_{truncate} jtruncate
      • j < j t r u n c a t e j < j_{truncate} j<jtruncate
        • 对每个状态 s ∈ S s \in \mathcal{S} sS v k ( j + 1 ) ( s ) = ∑ a π k ( a ∣ s ) [ ∑ r p ( r ∣ s , a ) r + γ ∑ s ′ p ( s ′ ∣ s , a ) v k ( j ) ( s ′ ) ] v_k^{(j+1)}(s)=\sum_a \pi_k(a \mid s)\left[\sum_r p(r \mid s, a) r+\gamma \sum_{s^{\prime}} p\left(s^{\prime} \mid s, a\right) v_k^{(j)}\left(s^{\prime}\right)\right] vk(j+1)(s)=aπk(as)[rp(rs,a)r+γsp(ss,a)vk(j)(s)]
      • v k = v k ( j truncate  ) v_k=v_k^{\left(j_{\text {truncate }}\right)} vk=vk(jtruncate )
    • policy improvement:
      • 对每个状态 s ∈ S s \in \mathcal{S} sS
        • 对每个action a ∈ A ( s ) a \in \mathcal{A}(s) aA(s) q k ( s , a ) = ∑ r p ( r ∣ s , a ) r + γ ∑ s ′ p ( s ′ ∣ s , a ) v k ( s ′ ) q_k(s, a)=\sum_r p(r \mid s, a) r+\gamma \sum_{s^{\prime}} p\left(s^{\prime} \mid s, a\right) v_k\left(s^{\prime}\right) qk(s,a)=rp(rs,a)r+γsp(ss,a)vk(s)
        • a k ∗ ( s ) = arg ⁡ max ⁡ a q k ( s , a ) a_k^*(s)=\arg \max _a q_k(s, a) ak(s)=argmaxaqk(s,a)
        • π k + 1 ( a ∣ s ) = 1 \pi_{k+1}(a \mid s)=1 πk+1(as)=1 if a = a k ∗ a=a_k^* a=ak, 并且 π k + 1 ( a ∣ s ) = 0 \pi_{k+1}(a \mid s)=0 πk+1(as)=0

在这里插入图片描述

通过这个图能看出来三种的效果

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值