Chapter 4: Dynamic Programming

最新推荐文章于 2020-11-04 14:44:19 发布

xiwang_chn

最新推荐文章于 2020-11-04 14:44:19 发布

阅读量354

点赞数

分类专栏： Reinforced learning

本文链接：https://blog.csdn.net/weixin_42017454/article/details/106886318

版权

Reinforced learning 专栏收录该内容

12 篇文章 0 订阅

订阅专栏

Notes of chapter 4: Dynamic Programming

1 Introduction
2 Policy evaluation (prediction)
3 Policy improvement
4 Generalized policy Iteration (GPI)
5 Exercises

1 Introduction

The term dynamic programming (DP) refers to a collection of algorithms that can be used to compute optimal policies given a perfect model of the environment as a Markov decision process (MDP).

Assumption:

Perfect model $p (s^{'}, r ∣ s, a)$

Advantages:

Simple, rigorous, relatively efficient.
A DP method is guaranteed to find an optimal policy (converge) in polynomial time even though the total number of (deterministic) policies is $k^n$ . (k is the number of actions, n # of states)

Disadvantages:

Time computational operation is polynomial in the number of states and actions;
DP is sometimes thought to be of limited applicability because of the curse of dimensionality; When there are too many states, it is hard to update one state based on the estimate of all other states
Characteristics: Use bootstrap

Convergence

For finite MDP, guaranteed to converge in polynomial time as long as all states are continually visited.

Applications in practice

In practice, DP methods can be used with today’s computers to solve MDPs with millions of states.

2 Policy evaluation (prediction)

Policy evaluation considers how to compute state-value function $v_{\pi}$ for an arbitrary policy $\pi$ . The Bellman equation can be used to get the formula:
$v_{\pi}(s)=\sum_{a}\pi(a|s)\sum_{s'}\sum_{r}p(s',r|s,a)[r+\gamma v_{\pi}(s')], \text{for all }s\in \mathcal{S} \tag{4.4}$ $v_{k+1}(s)=\sum_{a}\pi(a|s)\sum_{s'}\sum_{r}p(s',r|s,a)[r+\gamma v_{k}(s')], \text{for all }s\in \mathcal{S} \tag{4.5}$ If the model (enviromental dynamic is fully known), Eq $(4.4)$ is a system of $|\mathcal{S}|$ simultaneous linear equations in $|\mathcal{S}|$ unknowns. We usually use the iterative form of Eq $(4.5)$ to get approximate values, as shown below.
The existence and uniqueness of $v_{\pi}$ are guaranteed as long as either $\gamma < 1$ or eventual termination is guaranteed from all states under the policy $\pi$ .
In place version, update the values immediately, so the updated values are used in the following steps of a sweep, the sequence of update matters.
在这里插入图片描述

3 Policy improvement

Policy improvement therom:
Let $\pi$ and $\pi'$ be any pair of deterministic policies such that, for all $\in S$ ,
$q_{\pi}(s,\pi'(s))\geq v_{\pi}(s)$
Then the policy $\pi'$ must be as good as, or better than, $\pi$ : $v_{\pi'}(s)\geq v_{\pi}(s)$ .
When we get the value function, we can improve it by taking a greedy version:
$\begin{aligned} \pi'(s)&=\mathop{argmax}\limits_a q_{\pi}(s,a)\\ &=\mathop{argmax}\limits_a \mathbb{E}(R_{t+1}+\gamma G_{t+1}|s,a)\\ &=\mathop{argmax}\limits_a \sum_{s',r}p(s',r|s,a)[r+\gamma v(s')] \end{aligned} \tag{4.9}$ Note that the greedy operation is done using action-value function, but we can use the value function and MDP dynamics $p$ to get $q_{\pi}$ .
The new value function is acutally updated using Bellman optimal equation:
$\begin{aligned} v_{\pi'}(s)&=\mathop{max}\limits_{\pi}v_\pi(s)\\ &=\mathop{max}\limits_{a}\sum_{s',r}p(s',r|s,a)[r+\gamma v_{\pi}(s')] \end{aligned}$ If $v_{\pi'}=v_{\pi}$ , the optimal Bellman equation holds, and $v_{\pi'}=v_{\pi}$ is the optimal policy.

4 Generalized policy Iteration (GPI)

Policy iteration consists of two simultaneous, interacting processes, one making the value function consistent with the current policy (policy evaluation), and the other making the policy greedy with respect to the current value function (policy improvement). In policy iteration, these two processes alternate, each completing before the other begins, but this is not really necessary. In value iteration, for example, only a single iteration of policy evaluation is performed in between each policy improvement. In asynchronous DP methods, the evaluation and improvement processes are interleaved at an even finer grain. In some cases a single state is updated in one process before returning to the other. As long as both processes continue to update all states, the ultimate result is typically the same—convergence to the optimal value function and an optimal policy.
在这里插入图片描述

4.1 Policy iteration

在这里插入图片描述

Convergence:
Because a finite MDP has only a finite number of policies, this process must converge to an optimal policy and optimal value function in a finite number of iterations. If not finite MDP, it requires infinite iterations.

4.2 Value iteration

When policy evaluation is stopped after one update of each state, policy improvement comes in. value iteration is obtained simply by turning the Bellman optimality equation into an update rule:
$v_{k+1}=\max_a \sum_{s',r}p(s',r|s,a)[r+\gamma v_k(s')] \tag{4.10}$ $q_{k+1}(s,a)=\sum\limits_{s',r}p(s',r|s,a)[r+\gamma \mathop{max}\limits_{a'} q_{k}(s',a')]$ 在这里插入图片描述

Convergence:
Like policy evaluation, value iteration formally requires an infinite number of iterations to converge exactly to $v_*$ . Finite iterations before convergence if finite MDP.
Difference between policy iteration and value iteration:
Policy iteration includes a complete policy evaluation until convergence in its loop while value iteration only does one policy evaluation update before attempting to improve the policy

4.3 Asynchronous Dynamic Programming

A major drawback to the DP methods:
that we have discussed so far is that they involve operations over the entire state set of the MDP, that is, they require sweeps of the state set.

Asynchronous DP:
algorithms are in-place iterative DP algorithms that are not organized in terms of systematic sweeps of the state set. These algorithms update the values of states in any order whatsoever, using whatever values of other states happen to be available.

The values of some states may be updated several times before the values of others are updated once. To converge correctly, however, an asynchronous algorithm must continue to update the values of all the states.

Advantages:
Asynchronous algorithms also make it easier to intermix computation with real-time interaction.

5 Exercises

Exercise 4.1

In Example 4.1, if $\pi$ is the equiprobable random policy, what is $q_{\pi}(11, down)$ ? What is $q_{\pi}(7, down)$ ?
Refering to $q_\pi (s,a)=\sum_{s'}\sum_rp(s',r|s,a)[r+\gamma v_\pi(s')]$ Because the environment is deterministic, and thus the reward and transition is deterministic, so $p (s^{'}, r ∣ s, a) = 1$ .
$q_{\pi}(11,down)=-1+v(T)=-1$
$q_{\pi}(7,down)=-1+v(11)=-15$

Exercise 4.2

Use Eq (4.4) to solve $v_{\pi}$ : $v_{\pi}(15)=-1+0.25(-20-22-14+v_{\pi}(15))$ , so $v_{\pi}(15)=-20$ . State 15 share the same dynamic with 13, so they should have the same value.

Exercise 4.3

What are the equations analogous to (4.3), (4.4), and (4.5) for the action value function $q_{\pi}$ and its successive approximation by a sequence of functions q0, q1, q2, . . .?
$q_{k+1} (s,a)=\sum_{s'}\sum_rp(s',r|s,a)[r+\gamma \sum_{a'} q_{k}(s',a')]$

Exercise 4.4

The policy iteration algorithm on page 80 has a subtle bug in that it may
never terminate if the policy continually switches between two or more policies that are equally good. This is ok for pedagogy, but not for actual use. Modify the pseudocode so that convergence is guaranteed.

Exercise 4.8

Beacause when we have a capital 50, there is a probability of directly win. Thus, the best policy will bet all when Capital=50 and the possible dividends of it, like 25. Thinking capital of 51 as 50 plus 1. Of course we can bet all when we have 51, but the best policy is to see if we can earn much from the extra 1 dollar. Otherwise, if we bet 50 out of 51 first, our chance of win is only ph and we lose the chance to reach 75.

xiwang_chn

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Chapter 4: Dynamic Programming

Notes of chapter 4: Dynamic Programming (General dynamic programmingDP needs to know the whole model (transition and reward functoins).Bootstrap means updating one estimate from another estimate. It is used to update estimates of the values of states. Th
复制链接

扫一扫