n-step Bootsrapping：Part1

最新推荐文章于 2024-07-06 21:00:58 发布

fo-in

最新推荐文章于 2024-07-06 21:00:58 发布

阅读量174

点赞数

分类专栏： RL

本文链接：https://blog.csdn.net/WZX_Hello/article/details/116030599

版权

RL 专栏收录该内容

6 篇文章 1 订阅

订阅专栏

Prediction
- n-step TD
- the choice of n
Control
References

Prediction

Actually the n-step TD is the method lying between MC and TD(0). It performs an update based on an intermediate number of rewards: more than one(TD(0)), but less than all of them until termination(MC). So, both the MC and TD(0) are the extreme exceptions of n-step TD.

In MC, the update occurs at the end of an episode, while in TD(0), the update occurs at next time step.

Some backup diagrams of specific n-step methods are shown in the following figure:
在这里插入图片描述
Noticing that, all diagrams start and end with a state, because we estimate the state value $v_{\pi}(S)$ . In the part of control, we estimate the state-action value $q_{\pi}(S, A)$ , whose diagrams all start and end with an action.

In Monte Carlo updates the target is the return, in one-step updates the target is the first reward plus the discounted estimated value of the next state.

n-step TD

Some returns, the target:

MC uses the complete return:
$G_t = R_{t+1}+\gamma R_{t+2}+\gamma^2 R_{t+3} + ... +\gamma^{T-t-1} R_T$

one-step return:
$G_{t:t+1} = R_{t+1} + \gamma V_t(S_{t+1})$
where $V_t$ is the estimate of $v_{\pi}$ at time t.

two-step return:
$G_{t:t+2} = R_{t+1} + \gamma R_{t+2} + \gamma^2 V_{t+1}(S_{t+2})$

n-step return:
$G_{t:t+n} = R_{t+1}+\gamma R_{t+2}+ ... +\gamma^{n-1} R_{t+n} + \gamma^{n}V_{t+n-1}(S_{t+n})$

Note that n-step returns for $n > 1$ involve future rewards and states that are not available at the time of transition from $t$ to $t + 1$ .

No real algorithm can use the n-step return until after it has seen $R_{t+n}$ and computed $V_{t+n-1}$ . The first time these are available is $t + n$ .

This also leads to the problem that no changes at all are made during the first $n - 1$ steps.
Here comes the state-value learning algorithm:
$V_{t+n}(S_t) = V_{t+n-1}(S_t) + \alpha [G_{t:t+n} - V_{t+n-1}(S_t)], \qquad 0 \leq t < T$
This is n-step algorithm. It only changes the state-value function of $S_t$ , the values of all other states remain still: $V_{t+n}(s) = V_{t+n-1}(s)$ , for all $\neq S_t$ .

The complete pseudocode is given as:
在这里插入图片描述

The expectation of n-step returns is guaranteed to be a better estimate of $v_{\pi}$ than $V_{t+n-1}$ , in a worst-state sense.
The worst error of the expected n-step return is guaranteed to be less than or equal to $\gamma^n$ times the worst error under $V_{t+n-1}$ .
$\mathop{max}\limits_{s}|E_{\pi}[G_{t:t+n}|S_t=s]-v_{\pi}(s)| \leq \gamma^n \mathop{max}\limits_{s}|V_{t+n-1}(s)-v_{\pi}(s)|$

the choice of n

在这里插入图片描述
The picture above shows the estimate situation when we choose different n. It is the square-root of the average squared error between the predictions at the end of the episode for the 19 states and their true values. The less the average RMS error, the better. So, from the picture, the methods with an intermediate value of $n$ worked best, which also implies that neither MC nor TD(0) are the best methods.

Control

n-step Sarsa

The thought is to apply the n-step methods to control. As previously mentioned, the update focuses on the state-action pairs, and the state-action value function( $Q (s, a)$ ) takes the place of the state value function( $V (s)$ ).

The n-step return is redefined as:
$G_{t:t+n} = R_{t+1}+\gamma R_{t+2}+ ... +\gamma^{n-1} R_{t+n} + \gamma^{n}Q_{t+n-1}(S_{t+n}, A_{t+n}), \qquad n\geq1, 0 \leq t < T-n$
with $G_{t:t+n}=G_t$ if $t+n\geq T$

And the update algorithm is then:
$Q_{t+n}(S_t, A_t) = Q_{t+n-1}(S_t, A_t) + \alpha [G_{t:t+n} - Q_{t+n-1}(S_t, A_t)], \qquad 0 \leq t < T$

Also, as prediction, the value of all other state-action pairs remain unchanged: $Q_{t+n}(s,a) = Q_{t+n-1}(s, a)$ for all $s, a$ such that $\neq S_t$ or $\neq A_t$ . This algorithm is called n-step Sarsa.
在这里插入图片描述
And the back up diagram is：

In the following figure, it is obvious that the learning process can be accelerated with the application of n-step Sarsa compared to one-step methods.

The first panel is the complete path taken by the agent, G is the terminal position, where and only where the agent can get a positive reward. The arrows in the other two panels show which action values were strengthened as a result of this path by one-step and 10-step Sarsa methods .

The one-step method strengthens only the last action of sequence of actions that led to the high reward, whereas the n-step method strengthens the last n actions of the sequence, so that much more is learned from the one episode.

n-step Expected Sarsa

According to the back up diagram above, n-step version of Expected Sarsa is just as the form of n-step Sarsa, except that its last element is a branch over all actions. So the n step return should redefined as:
$G_{t:t+n} = R_{t+1}+\gamma R_{t+2}+ ... +\gamma^{n-1} R_{t+n} + \gamma^{n}\overline{V}_{t+n-1}(S_{t+n}), \qquad t < T-n$
with $G_{t:t+n}=G_t$ for $\geq T$ , and $\overline{V}_{t+n-1}$ is actually the expected approximate value of state s, using the estimated action values at time t, under the target policy:

$\overline{V}_t(s) = \sum_{a}\pi(a|s)Q_t(s, a)$ $\quad$ for all $\in S$

If s is terminal then its expected approximate value is defined to be 0

$\overline{V}_t(s)$ weight all the possible actions of the state, so noted as the form of “V”.

n-step off-policy Learning

Here comes a new conception importance sampling ratio, noted as $\rho_{t:t+n-1}$ .

The relative probability under the two policies of taking the n actions from $A_t$ to $A_{t+1}$

$\qquad\rho_{t:h}=\prod\limits_{k=t}^{min(h,T-1)}\frac{\pi(A_k|S_k)}{b(A_k|S_k)}$

$\pi$ and $b$ in the above formula represent two different polices.

The off-policy version of n-step TD is:
$V_{t+n}(S_t)=V_{t+n-1}(S_t) +\alpha\rho_{t:t+n-1}[G_{t:t+n}-V_{t+n-1}(S_t)], \qquad 0\leq t < T$

To consider this: if the two policies are actually the same(on-policy case) then the $\rho$ is always 1, thus the new update above can completely replace the earlier n-step TD update. Similarly, the previous n-step Sarsa update can be completely replaced by a off-policy form:
$Q_{t+n}(S_t, A_t) = Q_{t+n-1}(S_t, A_t) + \alpha\rho_{t+1:t+n} [G_{t:t+n} - Q_{t+n-1}(S_t, A_t)], \qquad 0 \leq t < T$

The importance sampling ratio here starts and ends one step later than for n-step TD. This is because here we are updating a state-action pair

Here comes the pseudocode for the off-policy version of n-step Sarsa. 在这里插入图片描述

Off-policy Learning Without Importance Sampling: The n-step Tree Backup Algorithm

在这里插入图片描述
Down the central spine are three sample states and rewards, and two sample actions, these are the events occur after the initial state-action pair $S_t, A_t)$ . Hanging off to the sides are the actions that were not selected.

Because we have no sample date for the unselected actions, we bootstrap and use the estimates of their values in forming the target for the update.

In the tree-backup update, the target includes the rewards along the way, the estimated value of the nodes at the bottom, plus the estimated values of the dangling action nodes hanging off the sides, at all levels.

While the action nodes in the interior, corresponding to the actual actions taken, do not participate the update, because its reward is known and has been involved.

Each leaf node contributes to the target with a weight proportional to its probability of occurring under the target policy $\pi$ .

So, each first-level action a contributes with a weight of $\pi(a|S_{t+1})$ , except the action actually taken( $A_{t+1}$ ), but its probability, $\pi(A_{t+1}|S_{t+1})$ , is used to weight all the second-level action values.
So, each non-selected second-level action a’ contributes with the weight $\pi(A_{t+1}| S_{t+1})\pi(a'|S_{t+2})$ .
Each non-selected third-level action a’’ contributes with the weight $\pi(A_{t+1}|S_{t+1})\pi(A_{t+2}|S_{t+2})\pi(a''|S_{t+3})$ …

The follows are the detailed equations for the n-step tree-backup algorithm:

The one step return is the same as that of Expected Sarsa: $G_{t:t+1} = R_{t+1}+\gamma\sum\limits_{a}\pi(a|S_{t+1})Q_{t}(S_{t+1}, a), \quad t < T-1$

Two-step tree-backup return is:
$G_{t:t+1} = R_{t+1}+\gamma\sum\limits_{a \ne A_{t+1}}\pi(a|S_{t+1})Q_{t+1}(S_{t+1}, a)\\ + \gamma\pi(A_{t+1}|S_{t+1})(R_{t+2}+\gamma\sum\limits_{a}\pi(a|S_{t+2})Q_{t+1}(S_{t+2}, a)) \\ =R_{t+1} + \gamma\sum\limits_{a \ne A_{t+1}}\pi(a|S_{t+1})Q_{t+1}(S_{t+1},a) + \gamma\pi(A_{t+1}|S_{t+1})G_{t+1:t+2}$ ,

for $t < T - 1, n > 2$

Action-value update rule from n-step Sarsa:
$Q_{t+n}(S_t, A_t) \doteq Q_{t+n-1}(S_t, A_t)+\alpha[G_{t:t+n}-Q_{t+n-1}(S_t, A_t)]$
for $\leq t < T$ , and all other state-action pairs remain unchanged.
And its pseoducode is:
在这里插入图片描述