RL(Chapter 4): Dynamic Programming (DP) (动态规划)

本文链接：https://blog.csdn.net/weixin_42437114/article/details/109463115

本文深入解析了强化学习中动态规划(DP)的概念和技术，包括策略评估、策略改进、策略迭代等核心方法，并探讨了不同DP算法的特点及应用场景。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

本文为强化学习笔记，主要参考以下内容：

Reinforcement Learning: An Introduction
代码全部来自 GitHub
习题答案参考 Github

Dynamic Programming

DP refers to a collection of algorithms that can be used to compute optimal policies given a perfect model (完备模型) of the environment as a Markov decision process (MDP).
- limited utility: assumption of a perfect model, great computational expense
- but they are still important theoretically.

The key idea of DP is the use of value functions to organize and structure the search for good policies.
As we shall see, DP algorithms are obtained by turning Bellman equations into update rules for improving approximations of the desired value functions.

Policy Evaluation (Prediction)

$p o l i c y$ $e v a l u a t i o n$ / $p r e d i c t i o n$ $p r o b l e m$ ：

Compute the state-value function $v_\pi$ for an arbitrary policy $\pi$ .

If the environment’s dynamics are completely known, then (4.4) is a system of $∣ S ∣$ simultaneous linear equations in $∣ S ∣$ unknowns. In principle, its solution is a straightforward but needs tedious computation. For our purposes, iterative solution methods are most suitable.

$\boldsymbol{iterative}$ $\boldsymbol{policy}$ $\boldsymbol{evaluation}$ :

Consider a sequence of approximate value functions $v_0, v_1, v_2$ . The initial approximation, $v_0$ , is chosen arbitrarily (except that the terminal state, if any, must be given value 0), and each successive approximation is obtained by using the Bellman equation for $v_\pi$ (4.4) as an update rule:
Each iteration of iterative policy evaluation updates the value of every state once to produce the new approximate value function $v_{k+1}$ . Indeed, the sequence ${v_k\}$ can be shown in general to converge to $v_\pi$ as $k\rightarrow\infty$ under the same conditions that guarantee the existence of $v_\pi$ .

$S t a n d a r d$ $i m p l e m e n t a t i o n :$

To write a sequential computer program to implement iterative policy evaluation as given by (4.5) you would have to use two arrays, one for the old values, $v_k(s)$ , and one for the new values, $v_{k+1}(s)$ . With two arrays, the new values can be computed one by one from the old values without the old values being changed.

$i n$ - $p l a c e$ $v e r s i o n$ :

Alternatively, you could use one array and update the values “in place”, that is, with each new value immediately overwriting the old one. Then, depending on the order in which the states are updated, sometimes new values are used instead of old ones on the right-hand side of (4.5).
This in-place algorithm usually converges to $v_\pi$ faster than the two-array version, as you might expect, because it uses new data as soon as they are available.
We think of the updates as being done in a $s w e e p$ (遍历) through the state space. For the in-place algorithm, the order in which states have their values updated during the sweep has a significant influence on the rate of convergence.

Policy Improvement

Our reason for computing the value function for a policy is to help find better policies. Suppose we have determined the value function $v_\pi$ for an arbitrary deterministic policy $\pi$ . For some state $s$ we would like to know whether or not we should change the policy to deterministically choose an action $a\neq\pi(s)$ . We know how good it is to follow the current policy from $s$ —that is $v_\pi(s)$ —but would it be better or worse to change to the new policy?
One way to answer this question is to consider selecting $a$ in $s$ and thereafter following the existing policy, $\pi$ . The value of this way of behaving is
The key criterion is whether this is greater than or less than $v_\pi(s)$ . If it is greater—that is, if it is better to select $a$ once in $s$ and thereafter follow $\pi$ than it would be to follow $\pi$ all the time, and that the new policy would in fact be a better one overall.
That is a special case of a general result called the $p o l i c y$ $i m p r o v e m e n t$ $t h e o r e m$ .

$\boldsymbol{policy}$ $\boldsymbol{improvement}$ $\boldsymbol{theorem}:$

Let $\pi$ and $\pi'$ be any pair of deterministic policies such that, for all $s\in\mathcal S$ ,
$q_\pi(s,\pi'(s))\geq v_\pi(s)\ \ \ \ \ \ \ \ (4.7)$ Then the policy $\pi'$ must obtain greater or equal expected return from all states $s\in\mathcal S$ :
$v_{\pi'}(s)\geq v_\pi(s)\ \ \ \ \ \ \ \ (4.8)$
The idea behind the proof of the policy improvement theorem is easy to understand. Starting from (4.7), we keep expanding the $q_\pi$ side with (4.6) and reapplying (4.7) until we get $v_{\pi'}(s)$ :

在这里插入图片描述

$\boldsymbol{policy}$ $\boldsymbol{improvement}:$

So far we have seen how, given a policy and its value function, we can easily evaluate a change in the policy at a single state. It is a natural extension to consider changes at $a l l$ $s t a t e s$ , selecting at each state the action that appears best according to $q_\pi(s, a)$ . In other words, to consider the new $\boldsymbol{greedy}$ $\boldsymbol{policy}$ , $\pi'$ , given by
By construction, the greedy policy meets the conditions of the policy improvement theorem (4.7), so we know that it is as good as, or better than, the original policy.
Suppose the new greedy policy, $\pi'$ , is as good as, but not better than, the old policy $\pi$ . Then $v_\pi = v_{\pi'}$ , and from (4.9) it follows that for all $s\in \mathcal S$ :
But this is the same as the Bellman optimality equation (4.1), and therefore, $v_\pi'$ must be $v_*$ , and both $\pi$ and $\pi'$ must be optimal policies.

$S t o c h a s t i c$ $P o l i c y$ :

So far in this section we have considered the special case of deterministic policies.
In the general case, a stochastic policy $\pi$ specifies probabilities, $\pi(a|s)$ . We will not go through the details, but in fact all the ideas of this section extend easily to stochastic policies.
- In particular, the policy improvement theorem carries through as stated for the stochastic case. In addition, if there are several actions at which the maximum is achieved—then in the stochastic case we need not select a single action from among them. Instead, each maximizing action can be given a portion of the probability of being selected in the new greedy policy. Any apportioning scheme is allowed as long as all submaximal actions are given zero probability.

Policy Iteration

在这里插入图片描述

where $\stackrel{E}{\longrightarrow}$ denotes a policy evaluation and $\stackrel{I}{\longrightarrow}$ denotes a policy improvement.

Because a finite MDP has only a finite number of policies, this process must converge to an optimal policy and the optimal value function in a finite number of iterations.
- The pseudocode above has a subtle bug that it may never terminate if the policy continually switches between two or more policies that are equally good. To fix it, substutute $" I f$ $o l d$ - $a c t i o n$ $\neq$ ${a_i\}$ , $w h i c h$ $i s$ $t h e$ $a l l$ $e q u i$ - $b e s t$ $s o l u t i o n s$ $f r o m$ $\pi(s), ......"$ for $" I f$ $o l d$ - $a c t i o n$ $\neq$ $\pi(s)"$
- Note that each policy evaluation, itself an iterative computation, is started with the value function for the previous policy. This typically results in a great increase in the speed of convergence of policy evaluation (presumably because the value function changes little from one policy to the next).

Analogously, we can give a complete algorithm for computing $q_*$ .

在这里插入图片描述

$\boldsymbol{bootstrapping}:$ (自举法)

To update estimates of the values of states based on estimates of the values of successor states. That is, to update estimates on the basis of other estimates.
Many reinforcement learning methods perform bootstrapping, even those that do not require a complete and accurate model of the environment.

Value Iteration

The policy evaluation step of policy iteration can be truncated in several ways without losing the convergence guarantees of policy iteration.

$\boldsymbol {value}$ $\boldsymbol {iteration}$

Policy evaluation is stopped after just one sweep (one update of each state).
It can be written as a particularly simple update operation that combines the policy improvement and truncated policy evaluation steps:
For arbitrary $v_0$ , the sequence ${v_k\}$ can be shown to converge to $v_*$ under the same conditions that guarantee the existence of $v_*$ .
Another way of understanding value iteration is by reference to the Bellman optimality equation (4.1). Note that value iteration is obtained simply by turning the Bellman optimality equation into an update rule.
- In practice, we stop value iteration once the value function changes by only a small amount in a sweep.
- Faster convergence is often achieved by interposing multiple policy evaluation sweeps between each policy improvement sweep.
- In general, the entire class of truncated policy iteration algorithms can be thought of as sequences of sweeps, some of which use policy evaluation updates and some of which use value iteration updates. Because the max operation in (4.10) is the only difference between these updates, this just means that the max operation is added to some sweeps of policy evaluation. All of these algorithms converge to an optimal policy for discounted finite MDPs.

Asynchronous Dynamic Programming

异步动态规划

A major drawback to the DP methods that we have discussed so far is that they involve operations over the entire state set of the MDP. If the state set is very large, then even a single sweep can be prohibitively expensive.

$A s y n c h r o n o u s$ DP

$A s y n c h r o n o u s$ DP algorithms are in-place iterative DP algorithms that are not organized in terms of systematic sweeps of the state set. These algorithms update the values of states in any order whatsoever, using whatever values of other states happen to be available. The values of some states may be updated several times before the values of others are updated once.
To converge correctly, however, an asynchronous algorithm must continue to update the values of all the states: it can’t ignore any state after some point in the computation.
- For example, one version of asynchronous value iteration updates the value, in place, of only one state, $s_k$ , on each step, $k$ , using the value iteration update (4.10). If $0\leq\gamma < 1$ , asymptotic convergence to $v_*$ is guaranteed given only that all states occur in the sequence ${s_k\}$ an infinite number of times (the sequence could even be random).
Similarly, it is possible to intermix policy evaluation and value iteration updates to produce a kind of asynchronous truncated policy iteration.

A few different updates form building blocks that can be used flexibly in a wide variety of sweepless DP algorithms. (无遍历 DP 算法)

Of course, avoiding sweeps does not necessarily mean that we can get away with less computation. It just means that an algorithm does not need to get locked into any hopelessly long sweep before it can make progress improving a policy. We can try to take advantage of this flexibility by selecting the states to which we apply updates so as to improve the algorithm’s rate of progress. We can try to order the updates to let value information propagate from state to state in an efficient way. Some states may not need their values updated as often as others. We might even try to skip updating some states entirely if they are not relevant to optimal behavior.
Asynchronous algorithms also make it easier to intermix computation with real-time interaction. To solve a given MDP, we can run an iterative DP algorithm at the same time that an agent is actually experiencing the MDP. The agent’s experience can be used to determine the states to which the DP algorithm applies its updates. At the same time, the latest value and policy information from the DP algorithm can guide the agent’s decision making.
- For example, we can apply updates to states as the agent visits them. This makes it possible to focus the DP algorithm’s updates onto parts of the state set that are most relevant to the agent.

Generalized Policy Iteration (GPI)

广义策略迭代

In policy iteration, two processes (policy evaluation, policy improvement) alternate, each completing before the other begins, but this is not really necessary. As long as both processes continue to update all states, the ultimate result is typically the same—convergence to the optimal value function and an optimal policy.
- In value iteration, for example, only a single iteration of policy evaluation is performed in between each policy improvement.
- In asynchronous DP methods, the evaluation and improvement processes are interleaved at an even finer grain (以更细的粒度交替进行). In some cases a single state is updated in one process before returning to the other.

$g e n e r a l i z e d$ $p o l i c y$ $i t e r a t i o n$ ( $G P I$ )

We use the term $g e n e r a l i z e d$ $p o l i c y$ $i t e r a t i o n$ ( $G P I$ ) to refer to the general idea of letting policy-evaluation and policy-improvement processes interact, independent of the granularity and other details of the two processes.
Almost all reinforcement learning methods are well described as GPI. That is, all have identifiable policies and value functions, with the policy always being improved with respect to the value function and the value function always being driven toward the value function for the policy, as suggested by the diagram to the below. The value function stabilizes only when it is consistent with the current policy, and the policy stabilizes only when it is greedy with respect to the current value function. Thus, the value function and policy must be optimal.

Efficiency of DP

DP may not be practical for very large problems, but compared with other methods for solving MDPs, DP methods are actually quite efficient. If we ignore a few technical details, then, in the worst case, the time that DP methods take to find an optimal policy is polynomial in the number of states and actions. If $n$ and $k$ denote the number of states and actions, this means that a DP method takes a number of computational operations that is less than some polynomial function of $n$ and $k$ .

Linear programming methods can also be used to solve MDPs, and in some cases their worst-case convergence guarantees are better than those of DP methods. But linear programming methods become impractical at a much smaller number of states than do DP methods (by a factor of about 100). For the largest problems, only DP methods are feasible.

DP is sometimes thought to be of limited applicability because of the $c u r s e$ $o f$ $d i m e n s i o n a l i t y$ (维度灾难), the fact that the number of states often grows exponentially with the number of state variables.
Large state sets do create difficulties, but these are inherent difficulties of the problem, not of DP as a solution method. In fact, DP is comparatively better suited to handling large state spaces than competing methods such as direct search and linear programming. In practice, DP methods can be used with today’s computers to solve MDPs with millions of states. Both policy iteration and value iteration are widely used, and it is not clear which, if either, is better in general. In practice, these methods usually converge much faster than their theoretical worst-case run times, particularly if they are started with good initial value functions or policies. On problems with large state spaces, asynchronous methods and other variations of GPI can be applied and may find good or optimal policies much faster than synchronous methods can.