Chapter 4: Dynamic Programming

1 Introduction

The term dynamic programming (DP) refers to a collection of algorithms that can be used to compute optimal policies given a perfect model of the environment as a Markov decision process (MDP).

Assumption:

Perfect model p ( s ′ , r ∣ s , a ) p(s',r|s,a) p(s,rs,a)

Advantages:

Simple, rigorous, relatively efficient.
A DP method is guaranteed to find an optimal policy (converge) in polynomial time even though the total number of (deterministic) policies is k n k^n kn. (k is the number of actions, n # of states)

Disadvantages:

  • Time computational operation is polynomial in the number of states and actions;

  • DP is sometimes thought to be of limited applicability because of the curse of dimensionality; When there are too many states, it is hard to update one state based on the estimate of all other states

  • Characteristics: Use bootstrap

Convergence

  • For finite MDP, guaranteed to converge in polynomial time as long as all states are continually visited.

Applications in practice

In practice, DP methods can be used with today’s computers to solve MDPs with millions of states.

2 Policy evaluation (prediction)

Policy evaluation considers how to compute state-value function v π v_{\pi} vπ for an arbitrary policy π \pi π. The Bellman equation can be used to get the formula:
v π ( s ) = ∑ a π ( a ∣ s ) ∑ s ′ ∑ r p ( s ′ , r ∣ s , a ) [ r + γ v π ( s ′ ) ] , for all  s ∈ S (4.4) v_{\pi}(s)=\sum_{a}\pi(a|s)\sum_{s'}\sum_{r}p(s',r|s,a)[r+\gamma v_{\pi}(s')], \text{for all }s\in \mathcal{S} \tag{4.4} vπ(s)=aπ(as)srp(s,rs,a)[r+γvπ(s)],for all sS(4.4) v k + 1 ( s ) = ∑ a π ( a ∣ s ) ∑ s ′ ∑ r p ( s ′ , r ∣ s , a ) [ r + γ v k ( s ′ ) ] , for all  s ∈ S (4.5) v_{k+1}(s)=\sum_{a}\pi(a|s)\sum_{s'}\sum_{r}p(s',r|s,a)[r+\gamma v_{k}(s')], \text{for all }s\in \mathcal{S} \tag{4.5} vk+1(s)=aπ(as)srp(s,rs,a)[r+γvk(s)],for all sS(4.5) If the model (enviromental dynamic is fully known), Eq ( 4.4 ) (4.4) (4.4) is a system of ∣ S ∣ |\mathcal{S}| S simultaneous linear equations in ∣ S ∣ |\mathcal{S}| S unknowns. We usually use the iterative form of Eq ( 4.5 ) (4.5) (4.5) to get approximate values, as shown below.
The existence and uniqueness of v π v_{\pi} vπ are guaranteed as long as either γ < 1 \gamma < 1 γ<1 or eventual termination is guaranteed from all states under the policy π \pi π.
In place version, update the values immediately, so the updated values are used in the following steps of a sweep, the sequence of update matters.
在这里插入图片描述

3 Policy improvement

Policy improvement therom:
Let π \pi π and π ′ \pi' π be any pair of deterministic policies such that, for all s ∈ S s \in S sS,
q π ( s , π ′ ( s ) ) ≥ v π ( s ) q_{\pi}(s,\pi'(s))\geq v_{\pi}(s) qπ(s,π(s))vπ(s)
Then the policy π ′ \pi' π must be as good as, or better than, π \pi π: v π ′ ( s ) ≥ v π ( s ) v_{\pi'}(s)\geq v_{\pi}(s) vπ(s)vπ(s).
When we get the value function, we can improve it by taking a greedy version:
π ′ ( s ) = a r g m a x a q π ( s , a ) = a r g m a x a E ( R t + 1 + γ G t + 1 ∣ s , a ) = a r g m a x a ∑ s ′ , r p ( s ′ , r ∣ s , a ) [ r + γ v ( s ′ ) ] (4.9) \begin{aligned} \pi'(s)&=\mathop{argmax}\limits_a q_{\pi}(s,a)\\ &=\mathop{argmax}\limits_a \mathbb{E}(R_{t+1}+\gamma G_{t+1}|s,a)\\ &=\mathop{argmax}\limits_a \sum_{s',r}p(s',r|s,a)[r+\gamma v(s')] \end{aligned} \tag{4.9} π(s)=aargmaxqπ(s,a)=aargmaxE(Rt+1+γGt+1s,a)=aargmaxs,rp(s,rs,a)[r+γv(s)](4.9) Note that the greedy operation is done using action-value function, but we can use the value function and MDP dynamics p p p to get q π q_{\pi} qπ.
The new value function is acutally updated using Bellman optimal equation:
v π ′ ( s ) = m a x π v π ( s ) = m a x a ∑ s ′ , r p ( s ′ , r ∣ s , a ) [ r + γ v π ( s ′ ) ] \begin{aligned} v_{\pi'}(s)&=\mathop{max}\limits_{\pi}v_\pi(s)\\ &=\mathop{max}\limits_{a}\sum_{s',r}p(s',r|s,a)[r+\gamma v_{\pi}(s')] \end{aligned} vπ(s)=πmaxvπ(s)=amaxs,rp(s,rs,a)[r+γvπ(s)] If v π ′ = v π v_{\pi'}=v_{\pi} vπ=vπ, the optimal Bellman equation holds, and v π ′ = v π v_{\pi'}=v_{\pi} vπ=vπ is the optimal policy.

4 Generalized policy Iteration (GPI)

Policy iteration consists of two simultaneous, interacting processes, one making the value function consistent with the current policy (policy evaluation), and the other making the policy greedy with respect to the current value function (policy improvement). In policy iteration, these two processes alternate, each completing before the other begins, but this is not really necessary. In value iteration, for example, only a single iteration of policy evaluation is performed in between each policy improvement. In asynchronous DP methods, the evaluation and improvement processes are interleaved at an even finer grain. In some cases a single state is updated in one process before returning to the other. As long as both processes continue to update all states, the ultimate result is typically the same—convergence to the optimal value function and an optimal policy.
在这里插入图片描述

4.1 Policy iteration

在这里插入图片描述
在这里插入图片描述

  • Convergence:
    Because a finite MDP has only a finite number of policies, this process must converge to an optimal policy and optimal value function in a finite number of iterations. If not finite MDP, it requires infinite iterations.

4.2 Value iteration

When policy evaluation is stopped after one update of each state, policy improvement comes in. value iteration is obtained simply by turning the Bellman optimality equation into an update rule:
v k + 1 = max ⁡ a ∑ s ′ , r p ( s ′ , r ∣ s , a ) [ r + γ v k ( s ′ ) ] (4.10) v_{k+1}=\max_a \sum_{s',r}p(s',r|s,a)[r+\gamma v_k(s')] \tag{4.10} vk+1=amaxs,rp(s,rs,a)[r+γvk(s)](4.10) q k + 1 ( s , a ) = ∑ s ′ , r p ( s ′ , r ∣ s , a ) [ r + γ m a x a ′ q k ( s ′ , a ′ ) ] q_{k+1}(s,a)=\sum\limits_{s',r}p(s',r|s,a)[r+\gamma \mathop{max}\limits_{a'} q_{k}(s',a')] qk+1(s,a)=s,rp(s,rs,a)[r+γamaxqk(s,a)] 在这里插入图片描述

  • Convergence:
    Like policy evaluation, value iteration formally requires an infinite number of iterations to converge exactly to v ∗ v_* v. Finite iterations before convergence if finite MDP.
  • Difference between policy iteration and value iteration:
    Policy iteration includes a complete policy evaluation until convergence in its loop while value iteration only does one policy evaluation update before attempting to improve the policy

4.3 Asynchronous Dynamic Programming

A major drawback to the DP methods:
that we have discussed so far is that they involve operations over the entire state set of the MDP, that is, they require sweeps of the state set.

Asynchronous DP:
algorithms are in-place iterative DP algorithms that are not organized in terms of systematic sweeps of the state set. These algorithms update the values of states in any order whatsoever, using whatever values of other states happen to be available.

The values of some states may be updated several times before the values of others are updated once. To converge correctly, however, an asynchronous algorithm must continue to update the values of all the states.

Advantages:
Asynchronous algorithms also make it easier to intermix computation with real-time interaction.

5 Exercises

Exercise 4.1

  • In Example 4.1, if π \pi π is the equiprobable random policy, what is q π ( 11 , d o w n ) q_{\pi}(11, down) qπ(11,down)? What is q π ( 7 , d o w n ) q_{\pi}(7, down) qπ(7,down)?
  • Refering to q π ( s , a ) = ∑ s ′ ∑ r p ( s ′ , r ∣ s , a ) [ r + γ v π ( s ′ ) ] q_\pi (s,a)=\sum_{s'}\sum_rp(s',r|s,a)[r+\gamma v_\pi(s')] qπ(s,a)=srp(s,rs,a)[r+γvπ(s)] Because the environment is deterministic, and thus the reward and transition is deterministic, so p ( s ′ , r ∣ s , a ) = 1 p(s',r|s,a)=1 p(s,rs,a)=1.
  • q π ( 11 , d o w n ) = − 1 + v ( T ) = − 1 q_{\pi}(11,down)=-1+v(T)=-1 qπ(11,down)=1+v(T)=1
  • q π ( 7 , d o w n ) = − 1 + v ( 11 ) = − 15 q_{\pi}(7,down)=-1+v(11)=-15 qπ(7,down)=1+v(11)=15

Exercise 4.2

  • Use Eq (4.4) to solve v π v_{\pi} vπ: v π ( 15 ) = − 1 + 0.25 ( − 20 − 22 − 14 + v π ( 15 ) ) v_{\pi}(15)=-1+0.25(-20-22-14+v_{\pi}(15)) vπ(15)=1+0.25(202214+vπ(15)), so v π ( 15 ) = − 20 v_{\pi}(15)=-20 vπ(15)=20. State 15 share the same dynamic with 13, so they should have the same value.

Exercise 4.3

  • What are the equations analogous to (4.3), (4.4), and (4.5) for the action value function q π q_{\pi} qπ and its successive approximation by a sequence of functions q0, q1, q2, . . .?
  • q k + 1 ( s , a ) = ∑ s ′ ∑ r p ( s ′ , r ∣ s , a ) [ r + γ ∑ a ′ q k ( s ′ , a ′ ) ] q_{k+1} (s,a)=\sum_{s'}\sum_rp(s',r|s,a)[r+\gamma \sum_{a'} q_{k}(s',a')] qk+1(s,a)=srp(s,rs,a)[r+γaqk(s,a)]

Exercise 4.4

  • The policy iteration algorithm on page 80 has a subtle bug in that it may
    never terminate if the policy continually switches between two or more policies that are equally good. This is ok for pedagogy, but not for actual use. Modify the pseudocode so that convergence is guaranteed.

Exercise 4.8

Beacause when we have a capital 50, there is a probability of directly win. Thus, the best policy will bet all when Capital=50 and the possible dividends of it, like 25. Thinking capital of 51 as 50 plus 1. Of course we can bet all when we have 51, but the best policy is to see if we can earn much from the extra 1 dollar. Otherwise, if we bet 50 out of 51 first, our chance of win is only ph and we lose the chance to reach 75.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值