n-step Bootsrapping:Part1

Prediction

Actually the n-step TD is the method lying between MC and TD(0). It performs an update based on an intermediate number of rewards: more than one(TD(0)), but less than all of them until termination(MC). So, both the MC and TD(0) are the extreme exceptions of n-step TD.

In MC, the update occurs at the end of an episode, while in TD(0), the update occurs at next time step.

Some backup diagrams of specific n-step methods are shown in the following figure:
在这里插入图片描述
Noticing that, all diagrams start and end with a state, because we estimate the state value v π ( S ) v_{\pi}(S) vπ(S). In the part of control, we estimate the state-action value q π ( S , A ) q_{\pi}(S, A) qπ(S,A), whose diagrams all start and end with an action.

In Monte Carlo updates the target is the return, in one-step updates the target is the first reward plus the discounted estimated value of the next state.

n-step TD

Some returns, the target:

MC uses the complete return:
G t = R t + 1 + γ R t + 2 + γ 2 R t + 3 + . . . + γ T − t − 1 R T G_t = R_{t+1}+\gamma R_{t+2}+\gamma^2 R_{t+3} + ... +\gamma^{T-t-1} R_T Gt=Rt+1+γRt+2+γ2Rt+3+...+γTt1RT

one-step return:
G t : t + 1 = R t + 1 + γ V t ( S t + 1 ) G_{t:t+1} = R_{t+1} + \gamma V_t(S_{t+1}) Gt:t+1=Rt+1+γVt(St+1)
where V t V_t Vt is the estimate of v π v_{\pi} vπ at time t.

two-step return:
G t : t + 2 = R t + 1 + γ R t + 2 + γ 2 V t + 1 ( S t + 2 ) G_{t:t+2} = R_{t+1} + \gamma R_{t+2} + \gamma^2 V_{t+1}(S_{t+2}) Gt:t+2=Rt+1+γRt+2+γ2Vt+1(St+2)

n-step return:
G t : t + n = R t + 1 + γ R t + 2 + . . . + γ n − 1 R t + n + γ n V t + n − 1 ( S t + n ) G_{t:t+n} = R_{t+1}+\gamma R_{t+2}+ ... +\gamma^{n-1} R_{t+n} + \gamma^{n}V_{t+n-1}(S_{t+n}) Gt:t+n=Rt+1+γRt+2+...+γn1Rt+n+γnVt+n1(St+n)

Note that n-step returns for n > 1 n > 1 n>1 involve future rewards and states that are not available at the time of transition from t t t to t + 1 t+1 t+1.

No real algorithm can use the n-step return until after it has seen R t + n R_{t+n} Rt+n and computed V t + n − 1 V_{t+n-1} Vt+n1. The first time these are available is t + n t+n t+n.

This also leads to the problem that no changes at all are made during the first n − 1 n-1 n1 steps.
Here comes the state-value learning algorithm:
V t + n ( S t ) = V t + n − 1 ( S t ) + α [ G t : t + n − V t + n − 1 ( S t ) ] , 0 ≤ t < T V_{t+n}(S_t) = V_{t+n-1}(S_t) + \alpha [G_{t:t+n} - V_{t+n-1}(S_t)], \qquad 0 \leq t < T Vt+n(St)=Vt+n1(St)+α[Gt:t+nVt+n1(St)],0t<T
This is n-step algorithm. It only changes the state-value function of S t S_t St, the values of all other states remain still: V t + n ( s ) = V t + n − 1 ( s ) V_{t+n}(s) = V_{t+n-1}(s) Vt+n(s)=Vt+n1(s), for all s ≠ S t s \neq S_t s=St.

The complete pseudocode is given as:
在这里插入图片描述

The expectation of n-step returns is guaranteed to be a better estimate of v π v_{\pi} vπ than V t + n − 1 V_{t+n-1} Vt+n1, in a worst-state sense.
The worst error of the expected n-step return is guaranteed to be less than or equal to γ n \gamma^n γn times the worst error under V t + n − 1 V_{t+n-1} Vt+n1.
m a x s ∣ E π [ G t : t + n ∣ S t = s ] − v π ( s ) ∣ ≤ γ n m a x s ∣ V t + n − 1 ( s ) − v π ( s ) ∣ \mathop{max}\limits_{s}|E_{\pi}[G_{t:t+n}|S_t=s]-v_{\pi}(s)| \leq \gamma^n \mathop{max}\limits_{s}|V_{t+n-1}(s)-v_{\pi}(s)| smaxEπ[Gt:t+nSt=s]vπ(s)γnsmaxVt+n1(s)vπ(s)

the choice of n

在这里插入图片描述
The picture above shows the estimate situation when we choose different n. It is the square-root of the average squared error between the predictions at the end of the episode for the 19 states and their true values. The less the average RMS error, the better. So, from the picture, the methods with an intermediate value of n n n worked best, which also implies that neither MC nor TD(0) are the best methods.

Control

n-step Sarsa

The thought is to apply the n-step methods to control. As previously mentioned, the update focuses on the state-action pairs, and the state-action value function( Q ( s , a ) Q(s, a) Q(s,a)) takes the place of the state value function( V ( s ) V(s) V(s)).

The n-step return is redefined as:
G t : t + n = R t + 1 + γ R t + 2 + . . . + γ n − 1 R t + n + γ n Q t + n − 1 ( S t + n , A t + n ) , n ≥ 1 , 0 ≤ t < T − n G_{t:t+n} = R_{t+1}+\gamma R_{t+2}+ ... +\gamma^{n-1} R_{t+n} + \gamma^{n}Q_{t+n-1}(S_{t+n}, A_{t+n}), \qquad n\geq1, 0 \leq t < T-n Gt:t+n=Rt+1+γRt+2+...+γn1Rt+n+γnQt+n1(St+n,At+n),n1,0t<Tn
with G t : t + n = G t G_{t:t+n}=G_t Gt:t+n=Gt if t + n ≥ T t+n\geq T t+nT

And the update algorithm is then:
Q t + n ( S t , A t ) = Q t + n − 1 ( S t , A t ) + α [ G t : t + n − Q t + n − 1 ( S t , A t ) ] , 0 ≤ t < T Q_{t+n}(S_t, A_t) = Q_{t+n-1}(S_t, A_t) + \alpha [G_{t:t+n} - Q_{t+n-1}(S_t, A_t)], \qquad 0 \leq t < T Qt+n(St,At)=Qt+n1(St,At)+α[Gt:t+nQt+n1(St,At)],0t<T

Also, as prediction, the value of all other state-action pairs remain unchanged: Q t + n ( s , a ) = Q t + n − 1 ( s , a ) Q_{t+n}(s,a) = Q_{t+n-1}(s, a) Qt+n(s,a)=Qt+n1(s,a) for all s , a s, a s,a such that s ≠ S t s \neq S_t s=St or a ≠ A t a \neq A_t a=At. This algorithm is called n-step Sarsa.
在这里插入图片描述
And the back up diagram is:
在这里插入图片描述
In the following figure, it is obvious that the learning process can be accelerated with the application of n-step Sarsa compared to one-step methods.
在这里插入图片描述
The first panel is the complete path taken by the agent, G is the terminal position, where and only where the agent can get a positive reward. The arrows in the other two panels show which action values were strengthened as a result of this path by one-step and 10-step Sarsa methods .

The one-step method strengthens only the last action of sequence of actions that led to the high reward, whereas the n-step method strengthens the last n actions of the sequence, so that much more is learned from the one episode.

n-step Expected Sarsa

According to the back up diagram above, n-step version of Expected Sarsa is just as the form of n-step Sarsa, except that its last element is a branch over all actions. So the n step return should redefined as:
G t : t + n = R t + 1 + γ R t + 2 + . . . + γ n − 1 R t + n + γ n V ‾ t + n − 1 ( S t + n ) , t < T − n G_{t:t+n} = R_{t+1}+\gamma R_{t+2}+ ... +\gamma^{n-1} R_{t+n} + \gamma^{n}\overline{V}_{t+n-1}(S_{t+n}), \qquad t < T-n Gt:t+n=Rt+1+γRt+2+...+γn1Rt+n+γnVt+n1(St+n),t<Tn
with G t : t + n = G t G_{t:t+n}=G_t Gt:t+n=Gt for t + n ≥ T t+n \geq T t+nT, and V ‾ t + n − 1 \overline{V}_{t+n-1} Vt+n1 is actually the expected approximate value of state s, using the estimated action values at time t, under the target policy:

V ‾ t ( s ) = ∑ a π ( a ∣ s ) Q t ( s , a ) \overline{V}_t(s) = \sum_{a}\pi(a|s)Q_t(s, a) Vt(s)=aπ(as)Qt(s,a) \quad for all s ∈ S s \in S sS

If s is terminal then its expected approximate value is defined to be 0

V ‾ t ( s ) \overline{V}_t(s) Vt(s) weight all the possible actions of the state, so noted as the form of “V”.

n-step off-policy Learning

Here comes a new conception importance sampling ratio, noted as ρ t : t + n − 1 \rho_{t:t+n-1} ρt:t+n1.

The relative probability under the two policies of taking the n actions from A t A_t At to A t + 1 A_{t+1} At+1

ρ t : h = ∏ k = t m i n ( h , T − 1 ) π ( A k ∣ S k ) b ( A k ∣ S k ) \qquad\rho_{t:h}=\prod\limits_{k=t}^{min(h,T-1)}\frac{\pi(A_k|S_k)}{b(A_k|S_k)} ρt:h=k=tmin(h,T1)b(AkSk)π(AkSk)

π \pi π and b b b in the above formula represent two different polices.

The off-policy version of n-step TD is:
V t + n ( S t ) = V t + n − 1 ( S t ) + α ρ t : t + n − 1 [ G t : t + n − V t + n − 1 ( S t ) ] , 0 ≤ t < T V_{t+n}(S_t)=V_{t+n-1}(S_t) +\alpha\rho_{t:t+n-1}[G_{t:t+n}-V_{t+n-1}(S_t)], \qquad 0\leq t < T Vt+n(St)=Vt+n1(St)+αρt:t+n1[Gt:t+nVt+n1(St)],0t<T

To consider this: if the two policies are actually the same(on-policy case) then the ρ \rho ρ is always 1, thus the new update above can completely replace the earlier n-step TD update. Similarly, the previous n-step Sarsa update can be completely replaced by a off-policy form:
Q t + n ( S t , A t ) = Q t + n − 1 ( S t , A t ) + α ρ t + 1 : t + n [ G t : t + n − Q t + n − 1 ( S t , A t ) ] , 0 ≤ t < T Q_{t+n}(S_t, A_t) = Q_{t+n-1}(S_t, A_t) + \alpha\rho_{t+1:t+n} [G_{t:t+n} - Q_{t+n-1}(S_t, A_t)], \qquad 0 \leq t < T Qt+n(St,At)=Qt+n1(St,At)+αρt+1:t+n[Gt:t+nQt+n1(St,At)],0t<T

The importance sampling ratio here starts and ends one step later than for n-step TD. This is because here we are updating a state-action pair

Here comes the pseudocode for the off-policy version of n-step Sarsa.在这里插入图片描述

Off-policy Learning Without Importance Sampling: The n-step Tree Backup Algorithm

在这里插入图片描述
Down the central spine are three sample states and rewards, and two sample actions, these are the events occur after the initial state-action pair ( S t , A t ) (S_t, A_t) (St,At). Hanging off to the sides are the actions that were not selected.

Because we have no sample date for the unselected actions, we bootstrap and use the estimates of their values in forming the target for the update.

In the tree-backup update, the target includes the rewards along the way, the estimated value of the nodes at the bottom, plus the estimated values of the dangling action nodes hanging off the sides, at all levels.

While the action nodes in the interior, corresponding to the actual actions taken, do not participate the update, because its reward is known and has been involved.

Each leaf node contributes to the target with a weight proportional to its probability of occurring under the target policy π \pi π.

So, each first-level action a contributes with a weight of π ( a ∣ S t + 1 ) \pi(a|S_{t+1}) π(aSt+1), except the action actually taken( A t + 1 A_{t+1} At+1), but its probability, π ( A t + 1 ∣ S t + 1 ) \pi(A_{t+1}|S_{t+1}) π(At+1St+1), is used to weight all the second-level action values.
So, each non-selected second-level action a’ contributes with the weight π ( A t + 1 ∣ S t + 1 ) π ( a ′ ∣ S t + 2 ) \pi(A_{t+1}| S_{t+1})\pi(a'|S_{t+2}) π(At+1St+1)π(aSt+2).
Each non-selected third-level action a’’ contributes with the weight π ( A t + 1 ∣ S t + 1 ) π ( A t + 2 ∣ S t + 2 ) π ( a ′ ′ ∣ S t + 3 ) \pi(A_{t+1}|S_{t+1})\pi(A_{t+2}|S_{t+2})\pi(a''|S_{t+3}) π(At+1St+1)π(At+2St+2)π(aSt+3)

The follows are the detailed equations for the n-step tree-backup algorithm:

The one step return is the same as that of Expected Sarsa: G t : t + 1 = R t + 1 + γ ∑ a π ( a ∣ S t + 1 ) Q t ( S t + 1 , a ) , t < T − 1 G_{t:t+1} = R_{t+1}+\gamma\sum\limits_{a}\pi(a|S_{t+1})Q_{t}(S_{t+1}, a), \quad t < T-1 Gt:t+1=Rt+1+γaπ(aSt+1)Qt(St+1,a),t<T1

Two-step tree-backup return is:
G t : t + 1 = R t + 1 + γ ∑ a ≠ A t + 1 π ( a ∣ S t + 1 ) Q t + 1 ( S t + 1 , a ) + γ π ( A t + 1 ∣ S t + 1 ) ( R t + 2 + γ ∑ a π ( a ∣ S t + 2 ) Q t + 1 ( S t + 2 , a ) ) = R t + 1 + γ ∑ a ≠ A t + 1 π ( a ∣ S t + 1 ) Q t + 1 ( S t + 1 , a ) + γ π ( A t + 1 ∣ S t + 1 ) G t + 1 : t + 2 G_{t:t+1} = R_{t+1}+\gamma\sum\limits_{a \ne A_{t+1}}\pi(a|S_{t+1})Q_{t+1}(S_{t+1}, a)\\ + \gamma\pi(A_{t+1}|S_{t+1})(R_{t+2}+\gamma\sum\limits_{a}\pi(a|S_{t+2})Q_{t+1}(S_{t+2}, a)) \\ =R_{t+1} + \gamma\sum\limits_{a \ne A_{t+1}}\pi(a|S_{t+1})Q_{t+1}(S_{t+1},a) + \gamma\pi(A_{t+1}|S_{t+1})G_{t+1:t+2} Gt:t+1=Rt+1+γa=At+1π(aSt+1)Qt+1(St+1,a)+γπ(At+1St+1)(Rt+2+γaπ(aSt+2)Qt+1(St+2,a))=Rt+1+γa=At+1π(aSt+1)Qt+1(St+1,a)+γπ(At+1St+1)Gt+1:t+2,

for t < T − 1 , n > 2 t < T-1, n>2 t<T1,n>2

Action-value update rule from n-step Sarsa:
Q t + n ( S t , A t ) ≐ Q t + n − 1 ( S t , A t ) + α [ G t : t + n − Q t + n − 1 ( S t , A t ) ] Q_{t+n}(S_t, A_t) \doteq Q_{t+n-1}(S_t, A_t)+\alpha[G_{t:t+n}-Q_{t+n-1}(S_t, A_t)] Qt+n(St,At)Qt+n1(St,At)+α[Gt:t+nQt+n1(St,At)]
for 0 ≤ t < T 0 \leq t < T 0t<T, and all other state-action pairs remain unchanged.
And its pseoducode is:
在这里插入图片描述

References

[1]. Reinforcement Learning-An introduction

If there is infringement, promise to delete immediately

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值