本文为强化学习笔记,主要参考以下内容:
- Reinforcement Learning: An Introduction
- 代码全部来自 GitHub
- 习题答案参考 Github
目录
n n n-step Bootstrapping
- In this chapter we present n n n-step TD methods that generalize the MC methods and the one-step TD methods so that one can shift from one to the other smoothly as needed to meet the demands of a particular task. n n n-step methods span a spectrum with MC methods at one end and one-step TD methods at the other. The best methods are often intermediate between the two extremes.
- Another way of looking at the benefits of n n n-step methods is that they free you from the tyranny of the time step(解决更新时刻的不灵活问题). With one-step TD methods the same time step determines how often the action can be changed and the time interval over which bootstrapping is done. In many applications one wants to be able to update the action very fast to take into account anything that has changed, but bootstrapping works best if it is over a length of time in which a significant and recognizable state change has occurred. With one-step TD methods, these time intervals are the same, and so a compromise must be made. n n n-step methods enable bootstrapping to occur over multiple steps, freeing us from the tyranny of the single time step.
n n n-step TD Prediction
Consider the update of the estimated value of state
S
t
S_t
St as a result of the state–reward sequence,
S
t
,
R
t
+
1
,
S
t
+
1
,
R
t
+
2
,
.
.
.
,
R
T
,
S
T
S_t,R_{t+1}, S_{t+1},R_{t+2}, . . . ,R_T , S_T
St,Rt+1,St+1,Rt+2,...,RT,ST (omitting the actions).
- Whereas in Monte Carlo updates the
t
a
r
g
e
t
target
target is the
r
e
t
u
r
n
return
return:
- In one-step updates the
t
a
r
g
e
t
target
target is the
o
n
e
one
one-
s
t
e
p
step
step
r
e
t
u
r
n
return
return:
The subscripts on G t : t + 1 G_{t:t+1} Gt:t+1 indicate that it is a truncated return for time t t t using rewards up until time t + 1 t+1 t+1, with the discounted estimate γ V t ( S t + 1 ) \gamma V_t(S_{t+1}) γVt(St+1) taking the place of the other terms γ R t + 2 + γ 2 R t + 3 + ⋅ ⋅ ⋅ + γ T − t − 1 R T \gamma R_{t+2} +\gamma^2 R_{t+3} +· · ·+\gamma^{T−t−1}R_T γRt+2+γ2Rt+3+⋅⋅⋅+γT−t−1RT of the full return G t G_t Gt.
- The target for a two-step update is the
t
w
o
two
two-
s
t
e
p
step
step
r
e
t
u
r
n
return
return:
- Similarly, the target for an arbitrary
n
n
n-step update is the
n
n
n-
s
t
e
p
step
step
r
e
t
u
r
n
return
return:
All n n n-step returns can be considered approximations to the full return, truncated after n n n steps and then corrected for the remaining missing terms by V t + n − 1 ( S t + n ) V_{t+n−1}(S_{t+n}) Vt+n−1(St+n). If t + n > T t + n> T t+n>T (if the n n n-step return extends to or beyond termination), then the n n n-step return defined to be equal to the ordinary full return ( G t : t + n = G t i f t + n > T G_{t:t+n}=G_t\ if\ t + n > T Gt:t+n=Gt if t+n>T).
-
n
n
n-
s
t
e
p
step
step TD: the natural state-value learning algorithm for using
n
n
n-step returns is thus
Note that no changes at all are made during the first n − 1 n − 1 n−1 steps of each episode. To make up for that, an equal number of additional updates are made at the end of the episode, after termination and before starting the next episode.
Exercise 7.1
In Chapter 6 we noted that the Monte Carlo error can be written as the sum of TD errors (6.6) if the value estimates don’t change from step to step. Show that the
n
n
n-step error used in (7.2) can also be written as a sum TD errors (again if the value estimates don’t change) generalizing the earlier result.
ANSWER
e r r o r error error r e d u c t i o n reduction reduction p r o p e r t y property property of n n n-step returns
- The n n n-step return uses the value function V t + n − 1 V_{t+n−1} Vt+n−1 to correct for the missing rewards beyond R t + n R_{t+n} Rt+n.
- An important property of n n n-step returns is that their expectation is guaranteed to be a better estimate of v π v_\pi vπ than V t + n − 1 V_{t+n−1} Vt+n−1 is, in a worst-state sense. That is, the worst error of the expected n n n-step return is guaranteed to be less than or equal to γ n \gamma^n γn times the worst error under V t + n − 1 V_{t+n−1} Vt+n−1:
- Because of the error reduction property, one can show formally that all n n n-step TD methods converge to the correct predictions under appropriate technical conditions.
- The n n n-step TD methods thus form a family of sound methods, with one-step TD methods and Monte Carlo methods as extreme members.
n n n-step Sarsa
The n n n-step version of Sarsa we call n n n-step Sarsa, and the original version presented in the previous chapter we henceforth call o n e one one- s t e p step step S a r s a Sarsa Sarsa, or S a r s a ( 0 ) Sarsa(0) Sarsa(0).
- The main idea is to simply switch states for actions (state–action pairs) and then use an ε \varepsilon ε-greedy policy. We redefine n n n-step returns (update targets) in terms of estimated action values:
with G t : t + n = G t i f t + n ≥ T G_{t:t+n}=G_t\ if\ t + n\geq T Gt:t+n=Gt if t+n≥T.
- The n n n-step Sarsa is then
Exercise 7.4
Prove that the
n
n
n-step return of Sarsa (7.4) can be written exactly in terms of a novel TD error, as
ANSWER
n n n-step Expected Sarsa
The backup diagram for the n n n-step version of Expected Sarsa is shown on the far right in Figure 7.3. It consists of a linear string of sample actions and states, just as in n n n-step Sarsa, except that its last element is a branch over all action possibilities weighted by their probability under π \pi π. This algorithm can be described by the same equation as n n n-step Sarsa (above) except with the n n n-step return redefined as
(with
G
t
:
t
+
n
=
G
t
i
f
t
+
n
≥
T
G_{t:t+n}=G_t\ if\ t + n\geq T
Gt:t+n=Gt if t+n≥T) where
n n n-step Off-policy Learning
In n n n-step methods, returns are constructed over n n n steps, so we are interested in the relative probability of just those n n n actions. For example, to make a simple off-policy version of n n n-step TD, the update for time t t t (actually made at time t + n t + n t+n) can simply be weighted by ρ t : t + n − 1 \rho_{t:t+n−1} ρt:t+n−1:
where
ρ
t
:
t
+
n
−
1
\rho_{t:t+n−1}
ρt:t+n−1, called the importance sampling ratio, is the relative probability under the two policies of taking the
n
n
n actions from
A
t
A_t
At to
A
t
+
n
−
1
A_{t+n−1}
At+n−1
Similarly, our previous
n
n
n-step Sarsa update can be completely replaced by a simple off-policy form:
Note that the importance sampling ratio here starts and ends one step later than for
n
n
n-step TD (7.9). This is because here we are updating a state–action pair. We do not have to care how likely we were to select the action; now that we have selected it we want to learn fully from what happens, with importance sampling only for subsequent actions.
The off-policy version of n n n-step Expected Sarsa would use the same update as above for n n n-step Sarsa except that the importance sampling ratio would have one less factor in it. That is, the above equation would use ρ t + 1 : t + n − 1 \rho_{t+1:t+n−1} ρt+1:t+n−1 instead of ρ t + 1 : t + n \rho_{t+1:t+n} ρt+1:t+n, and of course it would use the Expected Sarsa version of the n n n-step return (7.7). This is because in Expected Sarsa all possible actions are taken into account in the last state; the one actually taken has no effect and does not have to be corrected for.
Off-policy Learning Without Importance Sampling: The n n n-step Tree Backup Algorithm
n n n 步树回溯算法
The idea of the algorithm is suggested by the 3-step tree-backup backup diagram:
- Down the central spine and labeled in the diagram are three sample states and rewards, and two sample actions. These are the random variables representing the events occurring after the initial state–action pair S t , A t S_t,A_t St,At. Hanging off to the sides of each state are the actions that were not selected. (For the last state, all the actions are considered to have not (yet) been selected.)
-
So far we have always updated the estimated value of the node at the top of the diagram toward a target combining the rewards along the way (appropriately discounted) and the estimated values of the nodes at the bottom.
-
In the tree-backup update, the target includes all these things plus the estimated values of the dangling action nodes hanging off the sides, at all levels.
- This is why it is called a t r e e tree tree- b a c k u p backup backup update; it is an update from the entire tree of estimated action values.
Because we have no sample data for the unselected actions, we bootstrap and use the estimates of their values in forming the target for the update. This slightly extends the idea of a backup diagram.
- More precisely, the update is from the estimated action values of the
l
e
a
f
leaf
leaf
n
o
d
e
s
nodes
nodes of the tree. The action nodes in the interior, corresponding to the actual actions taken, do not participate. Each leaf node contributes to the target with a weight proportional to its probability of occurring under the target policy
π
\pi
π.
- Thus each first-level action a a a contributes with a weight of π ( a ∣ S t + 1 ) \pi(a|S_{t+1}) π(a∣St+1), except that the action actually taken, A t + 1 A_{t+1} At+1, does not contribute at all. Its probability, π ( A t + 1 ∣ S t + 1 ) \pi(A_{t+1}|S_{t+1}) π(At+1∣St+1), is used to weight all the second-level action values.
- Thus, each non-selected second-level action a ′ a' a′ contributes with weight π ( A t + 1 ∣ S t + 1 ) π ( a ′ ∣ S t + 2 ) \pi(A_{t+1}|S_{t+1})\pi(a'|S_{t+2}) π(At+1∣St+1)π(a′∣St+2).
- Each third-level action contributes with weight π ( A t + 1 ∣ S t + 1 ) π ( A t + 2 ∣ S t + 2 ) π ( a ′ ′ ∣ S t + 3 ) \pi(A_{t+1}|S_{t+1})\pi(A_{t+2}|S_{t+2})\pi(a''|S_{t+3}) π(At+1∣St+1)π(At+2∣St+2)π(a′′∣St+3), and so on.
- It is as if each arrow to an action node in the diagram is weighted by the action’s probability of being selected under the target policy and, if there is a tree below the action, then that weight applies to all the leaf nodes in the tree.
We can think of the 3-step tree-backup update as consisting of 6 half-steps, alternating between sample half-steps from an action to a subsequent state, and expected half-steps considering from that state all possible actions with their probabilities of occurring under the policy.
Now let us develop the detailed equations for the n n n-step tree-backup algorithm. The one-step return (target) is the same as that of Expected Sarsa,
for
t
<
T
−
1
t < T − 1
t<T−1, and the two-step tree-backup return is
for
t
<
T
−
2
t < T −2
t<T−2. The latter form suggests the general recursive definition of the tree-backup
n
n
n-step return:
for
t
<
T
−
1
,
n
≥
2
t < T − 1, n \geq 2
t<T−1,n≥2, with the
n
=
1
n = 1
n=1 case handled by (7.15) except for
G
T
−
1
:
t
+
n
=
R
T
G_{T−1:t+n}=R_T
GT−1:t+n=RT. This target is then used with the usual action-value update rule from
n
n
n-step Sarsa:
for
0
≤
t
<
T
0\leq t < T
0≤t<T.
Exercise 7.11
Show that if the approximate action values are unchanging, then the tree-backup return (7.16) can be written as a sum of expectation-based TD errors:
where
δ
t
=
R
t
+
1
+
γ
V
t
‾
(
S
t
+
1
)
−
Q
(
S
t
,
A
t
)
\delta_t=R_{t+1} +\gamma \overline{V_t}(S_{t+1}) − Q(S_t,A_t)
δt=Rt+1+γVt(St+1)−Q(St,At) and
V
‾
t
\overline V_t
Vt is given by (7.8).
A Unifying Algorithm: n n n-step Q ( σ ) Q(\sigma) Q(σ)
So far in this chapter we have considered three different kinds of action-value algorithms, corresponding to the first three backup diagrams shown in Figure 7.5.
- n n n-step Sarsa has all sample transitions
- the tree-backup algorithm has all state-to-action transitions fully branched without sampling
- n n n-step Expected Sarsa has all sample transitions except for the last state-to-action one, which is fully branched with an expected value
To what extent can these algorithms be unified?
- One idea for unification is suggested by the fourth backup diagram in Figure 7.5. This is the idea that one might decide on a step-by-step basis whether one wanted to take the action as a sample, as in Sarsa, or consider the expectation over all actions instead, as in the tree-backup update.
- Then, if one chose always to sample, one would obtain Sarsa, whereas if one chose never to sample, one would get the tree-backup algorithm. Expected Sarsa would be the case where one chose to sample for all steps except for the last one.
To increase the possibilities even further we can consider a continuous variation between sampling and expectation.
- Let σ t ∈ [ 0 , 1 ] \sigma_t\in [0, 1] σt∈[0,1] denote the degree of sampling on step t t t, with σ = 1 \sigma = 1 σ=1 denoting full sampling and σ = 0 \sigma = 0 σ=0 denoting a pure expectation with no sampling.
- The random variable σ t \sigma_t σt might be set as a function of the state, action, or state–action pair at time t t t. We call this proposed new algorithm n n n-step Q ( σ ) Q(\sigma) Q(σ).
Now let us develop the equations of n n n-step Q ( σ ) Q(\sigma) Q(σ).
- First we write the tree-backup n n n-step return (7.16) in terms of the horizon h = t + n h = t + n h=t+n and then in terms of the expected approximate value V ‾ \overline V V (7.8):
after which it is exactly like the
n
n
n-step return for Sarsa with control variates (7.14) except with the action probability
π
(
A
t
+
1
∣
S
t
+
1
)
\pi(A_{t+1}|S_{t+1})
π(At+1∣St+1) substituted for the importance-sampling ratio
ρ
t
+
1
\rho_{t+1}
ρt+1. For
Q
(
σ
)
Q(\sigma)
Q(σ), we slide linearly between these two cases:
for
t
<
h
≤
T
t < h \leq T
t<h≤T. The recursion ends with
G
h
:
h
=
Q
h
−
1
(
S
h
,
A
h
)
G_{h:h}=Q_{h−1}(S_h,A_h)
Gh:h=Qh−1(Sh,Ah) if
h
<
T
h < T
h<T, or with
G
T
−
1
:
T
=
R
T
G_{T−1}:T=R_T
GT−1:T=RT if
h
=
T
h = T
h=T.
Then we use the earlier update for n n n-step Sarsa without importance-sampling ratios (7.5) instead of (7.11), because now the ratios are incorporated in the n n n-step return.
In Chapter 12, we will see how multi-step TD methods can be implemented with minimal memory and computational complexity using eligibility traces.