RL(Chapter 7): n-step Bootstrapping (n步自举法)

本文为强化学习笔记,主要参考以下内容:

n n n-step Bootstrapping

  • In this chapter we present n n n-step TD methods that generalize the MC methods and the one-step TD methods so that one can shift from one to the other smoothly as needed to meet the demands of a particular task. n n n-step methods span a spectrum with MC methods at one end and one-step TD methods at the other. The best methods are often intermediate between the two extremes.
  • Another way of looking at the benefits of n n n-step methods is that they free you from the tyranny of the time step(解决更新时刻的不灵活问题). With one-step TD methods the same time step determines how often the action can be changed and the time interval over which bootstrapping is done. In many applications one wants to be able to update the action very fast to take into account anything that has changed, but bootstrapping works best if it is over a length of time in which a significant and recognizable state change has occurred. With one-step TD methods, these time intervals are the same, and so a compromise must be made. n n n-step methods enable bootstrapping to occur over multiple steps, freeing us from the tyranny of the single time step.

n n n-step TD Prediction

在这里插入图片描述
Consider the update of the estimated value of state S t S_t St as a result of the state–reward sequence, S t , R t + 1 , S t + 1 , R t + 2 , . . . , R T , S T S_t,R_{t+1}, S_{t+1},R_{t+2}, . . . ,R_T , S_T St,Rt+1,St+1,Rt+2,...,RT,ST (omitting the actions).

  • Whereas in Monte Carlo updates the t a r g e t target target is the r e t u r n return return:
    在这里插入图片描述
  • In one-step updates the t a r g e t target target is the o n e one one- s t e p step step r e t u r n return return:
    在这里插入图片描述

The subscripts on G t : t + 1 G_{t:t+1} Gt:t+1 indicate that it is a truncated return for time t t t using rewards up until time t + 1 t+1 t+1, with the discounted estimate γ V t ( S t + 1 ) \gamma V_t(S_{t+1}) γVt(St+1) taking the place of the other terms γ R t + 2 + γ 2 R t + 3 + ⋅ ⋅ ⋅ + γ T − t − 1 R T \gamma R_{t+2} +\gamma^2 R_{t+3} +· · ·+\gamma^{T−t−1}R_T γRt+2+γ2Rt+3++γTt1RT of the full return G t G_t Gt.

  • The target for a two-step update is the t w o two two- s t e p step step r e t u r n return return:
    在这里插入图片描述
  • Similarly, the target for an arbitrary n n n-step update is the n n n- s t e p step step r e t u r n return return:
    在这里插入图片描述

All n n n-step returns can be considered approximations to the full return, truncated after n n n steps and then corrected for the remaining missing terms by V t + n − 1 ( S t + n ) V_{t+n−1}(S_{t+n}) Vt+n1(St+n). If t + n > T t + n> T t+n>T (if the n n n-step return extends to or beyond termination), then the n n n-step return defined to be equal to the ordinary full return ( G t : t + n = G t   i f   t + n > T G_{t:t+n}=G_t\ if\ t + n > T Gt:t+n=Gt if t+n>T).

  • n n n- s t e p step step TD: the natural state-value learning algorithm for using n n n-step returns is thus
    在这里插入图片描述

Note that no changes at all are made during the first n − 1 n − 1 n1 steps of each episode. To make up for that, an equal number of additional updates are made at the end of the episode, after termination and before starting the next episode.

在这里插入图片描述
Exercise 7.1
In Chapter 6 we noted that the Monte Carlo error can be written as the sum of TD errors (6.6) if the value estimates don’t change from step to step. Show that the n n n-step error used in (7.2) can also be written as a sum TD errors (again if the value estimates don’t change) generalizing the earlier result.
ANSWER
在这里插入图片描述


e r r o r error error r e d u c t i o n reduction reduction p r o p e r t y property property of n n n-step returns

  • The n n n-step return uses the value function V t + n − 1 V_{t+n−1} Vt+n1 to correct for the missing rewards beyond R t + n R_{t+n} Rt+n.
  • An important property of n n n-step returns is that their expectation is guaranteed to be a better estimate of v π v_\pi vπ than V t + n − 1 V_{t+n−1} Vt+n1 is, in a worst-state sense. That is, the worst error of the expected n n n-step return is guaranteed to be less than or equal to γ n \gamma^n γn times the worst error under V t + n − 1 V_{t+n−1} Vt+n1:

在这里插入图片描述

  • Because of the error reduction property, one can show formally that all n n n-step TD methods converge to the correct predictions under appropriate technical conditions.
  • The n n n-step TD methods thus form a family of sound methods, with one-step TD methods and Monte Carlo methods as extreme members.

n n n-step Sarsa

The n n n-step version of Sarsa we call n n n-step Sarsa, and the original version presented in the previous chapter we henceforth call o n e one one- s t e p step step S a r s a Sarsa Sarsa, or S a r s a ( 0 ) Sarsa(0) Sarsa(0).

  • The main idea is to simply switch states for actions (state–action pairs) and then use an ε \varepsilon ε-greedy policy. We redefine n n n-step returns (update targets) in terms of estimated action values:

在这里插入图片描述with G t : t + n = G t   i f   t + n ≥ T G_{t:t+n}=G_t\ if\ t + n\geq T Gt:t+n=Gt if t+nT.

  • The n n n-step Sarsa is then

在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述

Exercise 7.4
Prove that the n n n-step return of Sarsa (7.4) can be written exactly in terms of a novel TD error, as

在这里插入图片描述
ANSWER

在这里插入图片描述


n n n-step Expected Sarsa

The backup diagram for the n n n-step version of Expected Sarsa is shown on the far right in Figure 7.3. It consists of a linear string of sample actions and states, just as in n n n-step Sarsa, except that its last element is a branch over all action possibilities weighted by their probability under π \pi π. This algorithm can be described by the same equation as n n n-step Sarsa (above) except with the n n n-step return redefined as

在这里插入图片描述
(with G t : t + n = G t   i f   t + n ≥ T G_{t:t+n}=G_t\ if\ t + n\geq T Gt:t+n=Gt if t+nT) where

在这里插入图片描述

n n n-step Off-policy Learning

In n n n-step methods, returns are constructed over n n n steps, so we are interested in the relative probability of just those n n n actions. For example, to make a simple off-policy version of n n n-step TD, the update for time t t t (actually made at time t + n t + n t+n) can simply be weighted by ρ t : t + n − 1 \rho_{t:t+n−1} ρt:t+n1:

在这里插入图片描述
where ρ t : t + n − 1 \rho_{t:t+n−1} ρt:t+n1, called the importance sampling ratio, is the relative probability under the two policies of taking the n n n actions from A t A_t At to A t + n − 1 A_{t+n−1} At+n1

在这里插入图片描述
Similarly, our previous n n n-step Sarsa update can be completely replaced by a simple off-policy form:

在这里插入图片描述
Note that the importance sampling ratio here starts and ends one step later than for n n n-step TD (7.9). This is because here we are updating a state–action pair. We do not have to care how likely we were to select the action; now that we have selected it we want to learn fully from what happens, with importance sampling only for subsequent actions.

在这里插入图片描述


The off-policy version of n n n-step Expected Sarsa would use the same update as above for n n n-step Sarsa except that the importance sampling ratio would have one less factor in it. That is, the above equation would use ρ t + 1 : t + n − 1 \rho_{t+1:t+n−1} ρt+1:t+n1 instead of ρ t + 1 : t + n \rho_{t+1:t+n} ρt+1:t+n, and of course it would use the Expected Sarsa version of the n n n-step return (7.7). This is because in Expected Sarsa all possible actions are taken into account in the last state; the one actually taken has no effect and does not have to be corrected for.

Off-policy Learning Without Importance Sampling: The n n n-step Tree Backup Algorithm

n n n 步树回溯算法

The idea of the algorithm is suggested by the 3-step tree-backup backup diagram:

  • Down the central spine and labeled in the diagram are three sample states and rewards, and two sample actions. These are the random variables representing the events occurring after the initial state–action pair S t , A t S_t,A_t St,At. Hanging off to the sides of each state are the actions that were not selected. (For the last state, all the actions are considered to have not (yet) been selected.)

在这里插入图片描述

  • So far we have always updated the estimated value of the node at the top of the diagram toward a target combining the rewards along the way (appropriately discounted) and the estimated values of the nodes at the bottom.

  • In the tree-backup update, the target includes all these things plus the estimated values of the dangling action nodes hanging off the sides, at all levels.

    • This is why it is called a t r e e tree tree- b a c k u p backup backup update; it is an update from the entire tree of estimated action values.

Because we have no sample data for the unselected actions, we bootstrap and use the estimates of their values in forming the target for the update. This slightly extends the idea of a backup diagram.

  • More precisely, the update is from the estimated action values of the l e a f leaf leaf n o d e s nodes nodes of the tree. The action nodes in the interior, corresponding to the actual actions taken, do not participate. Each leaf node contributes to the target with a weight proportional to its probability of occurring under the target policy π \pi π.
    • Thus each first-level action a a a contributes with a weight of π ( a ∣ S t + 1 ) \pi(a|S_{t+1}) π(aSt+1), except that the action actually taken, A t + 1 A_{t+1} At+1, does not contribute at all. Its probability, π ( A t + 1 ∣ S t + 1 ) \pi(A_{t+1}|S_{t+1}) π(At+1St+1), is used to weight all the second-level action values.
    • Thus, each non-selected second-level action a ′ a' a contributes with weight π ( A t + 1 ∣ S t + 1 ) π ( a ′ ∣ S t + 2 ) \pi(A_{t+1}|S_{t+1})\pi(a'|S_{t+2}) π(At+1St+1)π(aSt+2).
    • Each third-level action contributes with weight π ( A t + 1 ∣ S t + 1 ) π ( A t + 2 ∣ S t + 2 ) π ( a ′ ′ ∣ S t + 3 ) \pi(A_{t+1}|S_{t+1})\pi(A_{t+2}|S_{t+2})\pi(a''|S_{t+3}) π(At+1St+1)π(At+2St+2)π(aSt+3), and so on.
  • It is as if each arrow to an action node in the diagram is weighted by the action’s probability of being selected under the target policy and, if there is a tree below the action, then that weight applies to all the leaf nodes in the tree.

We can think of the 3-step tree-backup update as consisting of 6 half-steps, alternating between sample half-steps from an action to a subsequent state, and expected half-steps considering from that state all possible actions with their probabilities of occurring under the policy.

Now let us develop the detailed equations for the n n n-step tree-backup algorithm. The one-step return (target) is the same as that of Expected Sarsa,

在这里插入图片描述
for t < T − 1 t < T − 1 t<T1, and the two-step tree-backup return is

在这里插入图片描述
for t < T − 2 t < T −2 t<T2. The latter form suggests the general recursive definition of the tree-backup n n n-step return:

在这里插入图片描述
for t < T − 1 , n ≥ 2 t < T − 1, n \geq 2 t<T1,n2, with the n = 1 n = 1 n=1 case handled by (7.15) except for G T − 1 : t + n = R T G_{T−1:t+n}=R_T GT1:t+n=RT. This target is then used with the usual action-value update rule from n n n-step Sarsa:

在这里插入图片描述
for 0 ≤ t < T 0\leq t < T 0t<T.

在这里插入图片描述

Exercise 7.11
Show that if the approximate action values are unchanging, then the tree-backup return (7.16) can be written as a sum of expectation-based TD errors:

在这里插入图片描述
where δ t = R t + 1 + γ V t ‾ ( S t + 1 ) − Q ( S t , A t ) \delta_t=R_{t+1} +\gamma \overline{V_t}(S_{t+1}) − Q(S_t,A_t) δt=Rt+1+γVt(St+1)Q(St,At) and V ‾ t \overline V_t Vt is given by (7.8).

在这里插入图片描述

A Unifying Algorithm: n n n-step Q ( σ ) Q(\sigma) Q(σ)

So far in this chapter we have considered three different kinds of action-value algorithms, corresponding to the first three backup diagrams shown in Figure 7.5.

在这里插入图片描述

  • n n n-step Sarsa has all sample transitions
  • the tree-backup algorithm has all state-to-action transitions fully branched without sampling
  • n n n-step Expected Sarsa has all sample transitions except for the last state-to-action one, which is fully branched with an expected value

To what extent can these algorithms be unified?

  • One idea for unification is suggested by the fourth backup diagram in Figure 7.5. This is the idea that one might decide on a step-by-step basis whether one wanted to take the action as a sample, as in Sarsa, or consider the expectation over all actions instead, as in the tree-backup update.
  • Then, if one chose always to sample, one would obtain Sarsa, whereas if one chose never to sample, one would get the tree-backup algorithm. Expected Sarsa would be the case where one chose to sample for all steps except for the last one.

To increase the possibilities even further we can consider a continuous variation between sampling and expectation.

  • Let σ t ∈ [ 0 , 1 ] \sigma_t\in [0, 1] σt[0,1] denote the degree of sampling on step t t t, with σ = 1 \sigma = 1 σ=1 denoting full sampling and σ = 0 \sigma = 0 σ=0 denoting a pure expectation with no sampling.
  • The random variable σ t \sigma_t σt might be set as a function of the state, action, or state–action pair at time t t t. We call this proposed new algorithm n n n-step Q ( σ ) Q(\sigma) Q(σ).

Now let us develop the equations of n n n-step Q ( σ ) Q(\sigma) Q(σ).

  • First we write the tree-backup n n n-step return (7.16) in terms of the horizon h = t + n h = t + n h=t+n and then in terms of the expected approximate value V ‾ \overline V V (7.8):

在这里插入图片描述
after which it is exactly like the n n n-step return for Sarsa with control variates (7.14) except with the action probability π ( A t + 1 ∣ S t + 1 ) \pi(A_{t+1}|S_{t+1}) π(At+1St+1) substituted for the importance-sampling ratio ρ t + 1 \rho_{t+1} ρt+1. For Q ( σ ) Q(\sigma) Q(σ), we slide linearly between these two cases:

在这里插入图片描述
for t < h ≤ T t < h \leq T t<hT. The recursion ends with G h : h = Q h − 1 ( S h , A h ) G_{h:h}=Q_{h−1}(S_h,A_h) Gh:h=Qh1(Sh,Ah) if h < T h < T h<T, or with
G T − 1 : T = R T G_{T−1}:T=R_T GT1:T=RT if h = T h = T h=T.

Then we use the earlier update for n n n-step Sarsa without importance-sampling ratios (7.5) instead of (7.11), because now the ratios are incorporated in the n n n-step return.

在这里插入图片描述

In Chapter 12, we will see how multi-step TD methods can be implemented with minimal memory and computational complexity using eligibility traces.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值