Reinforcement Learning with Code 【Chapter 4. Value Iteration and Policy Iteration】

Reinforcement Learning with Code

This note records how the author begin to learn RL. Both theoretical understanding and code practice are presented. Many material are referenced such as ZhaoShiyu’s Mathematical Foundation of Reinforcement Learning, .

Chapter 4. Value Iteration and Policy Iteration

​ Value iteration and policy iteration have a common name called dynamic programming. Dynamic programming is model-based algorithm, which is the simplest RL algorithm. Its helpful to us to understand the model-free algorithm.

4.1 Value iteration

​ Value iteration is solving the Bellman optimal equation directly.

Matrix-vector form:

​ The value iteration is exactly the algorithm suggested by the contraction mapping in chapter 3. Value iteration concludes two parts. First, in every iteration is called policy update.

π k + 1 = arg ⁡ max ⁡ π k ( r π k + γ P π v k ) \pi_{k+1} = \arg \textcolor{red}{\max_{\pi_k}(r_{\pi_k} + \gamma P_\pi v_k)} πk+1=argπkmax(rπk+γPπvk)
Second, the step is value update. Mathematically, it is to substitute π k + 1 \pi_{k+1} πk+1 and do the following operation:

v k + 1 = r π k + 1 + γ P π k + 1 v k v_{k+1} = r_{\pi_{k+1}} + \gamma P_{\pi_{k+1}}v_k vk+1=rπk+1+γPπk+1vk

The above algorithm is matrix-vector form which is useful to understand the core idea. The above equation is iterative which means we calculate v k + 1 v_{k+1} vk+1 is only one step calculation.

There is a misunderstanding that we doesn’t use Bellman equation to calculate state value v k + 1 v_{k+1} vk+1 directly. Instead, when we use greedy policy update strategy the state value v k + 1 v_{k+1} vk+1 is actually the maximum action value max ⁡ a q k ( s , a ) \max_a q_k(s,a) maxaqk(s,a).

The value iteration includes solving the Bellman optimal equation part as show in red.

Elementwise form:

​ First, the elementwise form policy update is

π k + 1 = arg ⁡ max ⁡ π k ( r π k + γ P π v k ) (matrx-vector form) → π k + 1 ( s ) = arg ⁡ max ⁡ π k ∑ a π k ( a ∣ s ) ( ∑ r p ( r ∣ s , a ) r + ∑ s ′ p ( s ′ ∣ s , a ) v k ( s ′ ) ) ⏟ q π k ( s , a ) (elementwise form) → π k + 1 ( s ) = arg ⁡ max ⁡ π k ∑ a π k ( a ∣ s ) q π k ( s , a ) → π k + 1 ( s ) = ∑ a π k ( a ∣ s ) arg ⁡ max ⁡ a k q π k ( s , a ) \begin{aligned} \pi_{k+1} & = \arg \max_{\pi_k}(r_{\pi_k} + \gamma P_\pi v_k) \quad \text{(matrx-vector form)}\\ \to \pi_{k+1}(s) & = \arg \max_{\pi_k} \sum_a \pi_k(a|s) \underbrace{ \Big(\sum_r p(r|s,a)r + \sum_{s^\prime} p(s^\prime|s,a) v_k(s^\prime) \Big)}_{q_{\pi_k}(s,a)} \quad \text{(elementwise form)}\\ \to \pi_{k+1}(s) & = \arg \max_{\pi_k} \sum_a \pi_k(a|s) q_{\pi_k}(s,a) \\ \to \pi_{k+1}(s) & = \sum_a \pi_k(a|s) \arg \max_{a_k}q_{\pi_k}(s,a) \end{aligned} πk+1πk+1(s)πk+1(s)πk+1(s)=argπkmax(rπk+γPπvk)(matrx-vector form)=argπkmaxaπk(as)qπk(s,a) (rp(rs,a)r+sp(ss,a)vk(s))(elementwise form)=argπkmaxaπk(as)qπk(s,a)=aπk(as)argakmaxqπk(s,a)

Then use the greedy policy update algorithm

π k + 1 ( a ∣ s ) = { 1 a = a k ∗ ( s ) 0 a ≠ a k ∗ ( s ) \pi_{k+1}(a|s) = \left \{ \begin{aligned} & 1 & a=a_k^*(s) \\ & 0 & a\ne a_k^*(s) \end{aligned} \right. πk+1(as)={10a=ak(s)a=ak(s)

where a k ∗ ( s ) = arg ⁡ max ⁡ a q k ( a , s ) a_k^*(s) = \arg\max_a q_k(a,s) ak(s)=argmaxaqk(a,s).

​ Second, the elementwise form value update is

v k + 1 = r π k + 1 + γ P π k + 1 v k (matrx-vector form) → v k + 1 ( s ) = ∑ a π k + 1 ( a ∣ s ) ( ∑ r p ( r ∣ s , a ) r + γ ∑ s ′ p ( s ′ ∣ s , a ) v k ( s ′ ) ) ⏟ q k ( s , a ) (elementwise form) \begin{aligned} v_{k+1} & = r_{\pi_{k+1}} + \gamma P_{\pi_{k+1}}v_k \quad \text{(matrx-vector form)}\\ \to v_{k+1}(s) & = \sum_a \pi_{k+1}(a|s) \underbrace{ \Big(\sum_r p(r|s,a)r + \gamma \sum_{s^\prime}p(s^\prime|s,a)v_k(s^\prime) \Big)}_{q_k(s,a)} \quad \text{(elementwise form)} \end{aligned} vk+1vk+1(s)=rπk+1+γPπk+1vk(matrx-vector form)=aπk+1(as)qk(s,a) (rp(rs,a)r+γsp(ss,a)vk(s))(elementwise form)

Since π k + 1 \pi_{k+1} πk+1 is a greedy policy, the above equation is simply

v k + 1 ( s ) = max ⁡ a q k ( s , a ) v_{k+1}(s) = \max_a q_k(s,a) vk+1(s)=amaxqk(s,a)

Pesudocode:

Image

4.2 Policy iteration

​ The policy iteration is not an algorithm directly solving the Bellman optimality equation. However, it has an intimate relationship to value iteration. The policy iteration includes two parts policy evaluation and policy improvement.

Policy evaluation:

​ The first step is policy evaluation. Mathematically, it is to sovle Bellman equation of π k \pi_k πk:

v π k = r π k + γ P π k v π k v_{\pi_k} = r_{\pi_k} + \gamma P_{\pi_k}v_{\pi_k} vπk=rπk+γPπkvπk

This is the matrix-vector form of the Bellman equation, where r π k r_{\pi_k} rπk and P π k P_{\pi_k} Pπk are known. Here, v π k v_{\pi_k} vπk is the state value to be solved.

How to calculate state value v π k v_{\pi_k} vπk is important. In section 2.4 we have already introduced how to solve Bellman equation to get state value v π k v_{\pi_k} vπk as following

v π k ( j + 1 ) = r π k + γ P π k v π k ( j ) , j = 0 , 1 , 2 , … \textcolor{blue}{v_{\pi_k}^{(j+1)} = r_{\pi_k} + \gamma P_{\pi_k}v_{\pi_k}^{(j)}}, \quad j=0,1,2,\dots vπk(j+1)=rπk+γPπkvπk(j),j=0,1,2,

Its clear that policy iteration is directly calculate the state value using the above equation, which means the calculation is infity step calculation.

Policy improvement:

​ The second step is to imporve the policy. How to do that? Once v π k v_{\pi_k} vπk is calculated in the first step, a new and improved policy could be obtained as

π k + 1 = arg ⁡ max ⁡ π ( r π + γ P π v π k ) \pi_{k+1} = \arg \max_\pi (r_\pi + \gamma P_\pi v_{\pi_k}) πk+1=argπmax(rπ+γPπvπk)

Elementwise form:

​ The policy evaluation step is to solve v π k v_{\pi_k} vπk from the Bellman equation dirctly.

v π k ( j + 1 ) = r π k + γ P π k v π k ( j ) (matrix-vector form) → v π k ( j + 1 ) ( s ) = ∑ a π k ( a ∣ s ) ( ∑ r p ( r ∣ s , a ) r + ∑ s ′ p ( s ′ ∣ s , a ) v π k ( j ) ( s ′ ) ) (elementwise form) \begin{aligned} v_{\pi_k}^{(j+1)} & = r_{\pi_k} + \gamma P_{\pi_k} v_{\pi_k}^{(j)} \quad \text{(matrix-vector form)} \\ \to v_{\pi_k}^{(j+1)}(s) & = \sum_a \pi_k(a|s) \Big(\sum_r p(r|s,a)r + \sum_{s^\prime} p(s^\prime|s,a) v_{\pi_k}^{(j)} (s^\prime) \Big) \quad \text{(elementwise form)} \end{aligned} vπk(j+1)vπk(j+1)(s)=rπk+γPπkvπk(j)(matrix-vector form)=aπk(as)(rp(rs,a)r+sp(ss,a)vπk(j)(s))(elementwise form)
where j = 0 , 1 , 2 , … j=0,1,2,\dots j=0,1,2,

The policy improvement step is to solve π k + 1 = arg ⁡ max ⁡ π ( r π + γ P π k v π k ) \pi_{k+1}=\arg \max_\pi (r_\pi + \gamma P_{\pi_k}v_{\pi_k}) πk+1=argmaxπ(rπ+γPπkvπk).

π k + 1 = arg ⁡ max ⁡ π k ( r π k + γ P π k v π k ) (matrix-vector form) → π k + 1 ( s ) = arg ⁡ max ⁡ π k ∑ a π ( a ∣ s ) ( ∑ r p ( r ∣ s , a ) r + ∑ s ′ p ( s ′ ∣ s , a ) v π k ( s ′ ) ) ⏟ q π k ( s , a ) (elementwise form) → π k + 1 ( s ) = ∑ a π ( a ∣ s ) arg ⁡ max ⁡ a q π k ( s , a ) \begin{aligned} \pi_{k+1} & =\arg \max_{\pi_k} (r_{\pi_k} + \gamma P_{\pi_k}v_{\pi_k}) \quad \text{(matrix-vector form)} \\ \to \pi_{k+1}(s) & = \arg \max_{\pi_k} \sum_a \pi(a|s) \underbrace{ \Big(\sum_r p(r|s,a)r + \sum_{s^\prime} p(s^\prime|s,a) v_{\pi_k} (s^\prime) \Big)}_{q_{\pi_k}(s,a)} \quad \text{(elementwise form)} \\ \to \pi_{k+1}(s) & = \sum_a \pi(a|s) \arg \max_a q_{\pi_k}(s,a) \end{aligned} πk+1πk+1(s)πk+1(s)=argπkmax(rπk+γPπkvπk)(matrix-vector form)=argπkmaxaπ(as)qπk(s,a) (rp(rs,a)r+sp(ss,a)vπk(s))(elementwise form)=aπ(as)argamaxqπk(s,a)

Let a k ∗ ( s ) = arg ⁡ max ⁡ a q π k ( s , a ) a^*_k(s) = \arg \max_a q_{\pi_k}(s,a) ak(s)=argmaxaqπk(s,a). The greedy optimal policy is

π k + 1 ( a ∣ s ) = { 1 a = a k ∗ ( s ) 0 a ≠ a k ∗ ( s ) \pi_{k+1}(a|s) = \left \{ \begin{aligned} & 1 & a=a_k^*(s) \\ & 0 & a\ne a_k^*(s) \end{aligned} \right. πk+1(as)={10a=ak(s)a=ak(s)

Pesudocode:

Image

Reference

赵世钰老师的课程

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

木心

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值