Chapter 6&7: Temporal-Difference Learning

最新推荐文章于 2021-10-22 21:04:59 发布

xiwang_chn

最新推荐文章于 2021-10-22 21:04:59 发布

阅读量594

点赞数

分类专栏： Reinforced learning

本文链接：https://blog.csdn.net/weixin_42017454/article/details/107133715

版权

Reinforced learning 专栏收录该内容

12 篇文章 0 订阅

订阅专栏

Chapter 6&7: Temporal-Difference Learning

1 Introduction
2 n-step TD return and update
3 One-step Control
4 n-step control
- - Off-policy n-step sarsa
4 Off-policy Learning Without Importance Sampling: The n-step Tree Backup Algorithm
5 Question

1 Introduction

Temporal-difference (TD) learning is the central and novel idea of reinforced learning. It is a combination of Monta Carlo ideas and dynamic programming (DP) ideas. TD also uses bootstrap and can learn from raw experience.

Compared with DP methods:

TD do not require a model of the environment, of its reward and next-state probability distributions.

Compared with Monta Carlo methods:

TD is naturally implemented in an online, fully incremental fashion. While Monte Carlo methods one must wait until the end of an episode, because only then is the return known, whereas with n-step TD methods one need wait only n time step.
Monte Carlo methods must ignore or discount episodes on which experimental actions are taken, which can greatly slow learning. TD methods are much less susceptible to these problems because they learn from each transition regardless of what subsequent actions are taken.
Under batch updating, TD(0) converges deterministically to a single answer independent of the step-size parameter, $\alpha$ , as long as $\alpha$ is chosen to be sufficiently small. The constant- $\alpha$ MC method also converges deterministically under the same conditions, but to a different answer.
Batch Monte Carlo methods always find the estimates that minimize mean-squared error on the training set, which converges to sample average of the actual returns, whereas batch TD(0) always finds the estimates that would be exactly correct for the maximum-likelihood model of the Markov process. In general, the maximum-likelihood estimate of a parameter is the parameter value whose probability of generating the data is greatest. Certainty-equivalence estimate. Although the nonbatch methods do not achieve either the certainty-equivalence or the minimum squared-error estimates, they can be understood as moving roughly in these directions. TD methods may be the only feasible way of approximating the certainty-equivalence solution.
larger n causes larger variance but less biased result,

Convergence

For any fixed policy $\pi$ , TD(0) has been proved to converge to $v_{\pi}$ , in the mean for a constant step-size parameter if it is sufficiently small, and with probability 1 if the step-size parameter decreases according to the usual stochastic approximation conditions (2.7)

At the current time this is an open question in the sense that no one has been able to prove mathematically that one method converges faster than the other. In fact, it is not even clear what is the most appropriate formal way to phrase this question! In practice, however, TD methods have usually been found to converge faster than constant- $\alpha$ MC methods on stochastic tasks

2 n-step TD return and update

在这里插入图片描述
n-step TD changes an earlier estimate based on how it differs from a later estimate. Now the later estimate is not one step later, but n steps later.

Sequence:
$S_0, A_0, R_1, S_1, A_1, R_2, S_2 \cdots S_{T-1},A_{T-1},R_T,S_T$
n-step return:
$G_{t:t+n}=R_{t+1}+\gamma R_{t+2}+...+\gamma^{n-1} R_{t+n}+\gamma^n V_{t+n-1}(S_{t+n}) \tag{7.1}$ $G_{t:t+n}=R_{t+1}+\gamma R_{t+2}+...+\gamma^{n-1} R_{t+n}+\gamma^n Q_{t+n-1}(S_{t+n},A_{t+n}) \tag{7.4}$ $G_{t:t+n}=$
Update value function:
$V_{t+n}(S_t)=V_{t+n-1}(S_t)+\alpha[G_{t:t+n}-V_{t+n-1}(S_t)], 0 \leq t<T \tag{7.2}$ $Q_{t+n}(S_t,A_t)=Q_{t+n-1}(S_t,A_t)+\alpha[G_{t:t+n}-Q_{t+n-1}(S_t,A_t)], 0 \leq t<T \tag{7.2}$
Off-policy update:
$V_{t+n}(S_t)=V_{t+n-1}(S_t)+\alpha \rho_{t:t+n-1}[G_{t:t+n}-V_{t+n-1}(S_t)], 0 \leq t<T \tag{7.9}$ $Q_{t+n}(S_t,A_t)=Q_{t+n-1}(S_t,A_t)+\alpha\rho_{t+1:t+n}[G_{t:t+n}-Q_{t+n-1}(S_t,A_t)], 0 \leq t<T \tag{7.11}$ $\rho_{t:h}=\prod_{k=t}^{min(h,T-1)} \frac{\pi(A_t|S_t)}{b(A_t|S_t)}\tag{7.10}$

State visited at time step $t\text{ } (S_t)$ is updated at time step $t+n\text{ } (V_{t+n})$ .

Note that no changes at all are made during the first n-1 steps of each episode. To make up for that, an equal number of additional updates are made at the end of the episode, after termination and before starting the next episode.

在这里插入图片描述
Error reduction property of n-step returns
The n-step return uses the value function $V_{t+n−1}$ to correct for the missing rewards beyond $R_{t+n}$ . An important property of n-step returns is that their expectation is guaranteed to be a better estimate of $v_{\pi}$ than $V_{t+n−1}$ is, in a worst-state sense. All n-step TD methods converge to the correct predictions under appropriate technical conditions.

3 One-step Control

3.1 (One-step on-policy) Sarsa

$Q(S_t,A_t)=Q(S_t,A_t)+\alpha[R_{t+1}+\gamma Q(S_{t+1},A_{t+1})-Q(S_t,A_t)]$ Estimate the value function of ( $\varepsilon$ ) greedy policy.

3.2 (One-step off-policy Sarsa) Q-learning

$Q(S_t,A_t)=Q(S_t,A_t)+\alpha[R_{t+1}+\gamma \mathop{max}_a Q(S_{t+1},a)-Q(S_t,A_t)]$ Actions are taken with $\varepsilon$ -greedy policy, but the Q is updated assuming the greedy policy is followed.

Q-learning vs Sarsa?

Suppose action selection is greedy. Is Q-learning then exactly the same algorithm as Sarsa? Will they make exactly the same action selections and weight updates?
No, Q-learning select the next action before updating Q, while Sarsa select the action after updating Q.

3.3 (one-step, on or off policy) Expected Sarsa

$Q(S_t,A_t)=Q(S_t,A_t)+\alpha[R_{t+1}+\gamma \sum_a \pi(a|S_{t+1}) Q(S_{t+1},a)-Q(S_t,A_t)]$
Suppose $\pi$ is the greedy policy while behavior policy is more exploratory; then Expected Sarsa is exactly Q-learning.

Expected Sarsa subsumes and generalizes Q-learning while reliably improving over Sarsa.

Expected Sarsa subsumes and generalizes Q-learning while reliably improving over Sarsa.

3.4 Issues of Maximization Bias & Afterstates

3.4.1 Maximization Bias and Double Learning

In these algorithms, a maximum over estimated values is used implicitly as an estimate of the maximum value, which can lead to a significant positive bias.

To see why, consider a single state s where there are many actions a whose true values, q(s, a), are all zero but whose estimated values, Q(s, a), are uncertain and thus distributed some above and some below zero. The maximum of the true values is zero, but the maximum of the estimates is positive, a positive bias. We call this maximization bias.

One way to view the problem is that it is due to using the same samples (plays) both to determine the maximizing action and to estimate its value.

Double learning
$Q_1(S_t,A_t)=Q_1(S_t,A_t)+\alpha[R_{t+1}+\gamma Q_2(S_{t+1},argmax_aQ_1(S_{t+1},a))-Q_1(S_t,A_t)]$

3.4.2 Afterstates

In this book we try to present a uniform approach to solving tasks, but sometimes more specific
methods can do much better.

We introduce the idea of afterstates. Afterstates are relevant when the agent can deterministically change some aspect of the environment. In these cases, we are better to value the resulting state of the environment, after the agent has taken action and before any stochasticity, as this can reduce computation and speed convergence.

Take chess as an example. One should choose as states the board positions after the agent has
taken a move, rather than before. This is because there are multiple states at t than can lead to the
board position that the opponent sees at t+1 (assuming we move second) via deterministic actions
of the agent.

4 n-step control

Off-policy n-step sarsa

Update value function:
$V_{t+n}(S_t)=V_{t+n-1}(S_t)+\alpha \rho_{t:t+n-1}[G_{t:t+n}-V_{t+n-1}(S_t)], 0\leq t<T \tag{7.9}$
Importance sampling ratio:
$\rho_{t:h}=\prod_{k=t}^{min(h,T-1)}\frac{\pi(A_k|S_k)}{b(A_k|S_k)} \tag{7.10}$
n-step sarsa update:
$Q_{t+n}=Q_{t+n-1}+\alpha\rho_{t+1:t+n}[G_{t:t+n}-Q_{t+n-1}(S_t|A_t)]$

在这里插入图片描述

4 Off-policy Learning Without Importance Sampling: The n-step Tree Backup Algorithm

在这里插入图片描述
Unlike Monta Carlo methods, tree-backup update does not sample, Each leaf node contributes to the target with a weight proportional to its probability of occurring under the target policy $\pi$ . Thus each first-level action a contributes with a weight of $\pi(a|S_t+1)$ , except that the action actually taken, $A_{t+1}$ , does not contribute at all. Its probability, $\pi(A_t+1|S_t+1)$ , is used to weight all the second-level action values.
Thus, each non-selected second-level action $a^{'}$ contributes with weight $\pi(A_t+1|S_t+1)\pi(a'|S_t+2)$ . Each third-level action contributes with weight $\pi(A_{t+1}|S_{t+1})\pi(A_{t+2}|S_{t+2})\pi(a''|S_{t+3})$ .
在这里插入图片描述

5 Question

For the tree back-up and generalized n-step Q, only one action is actually taken. Can we take multiple actions at a time? Will it be more accurate?

xiwang_chn

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Chapter 6&7: Temporal-Difference Learning

Chapter 6&7: Temporal-Difference Learning1 Introduction2 n-step TD prediction (estimate V)3 Off-policy n-step sarsa4 Off-policy Learning Without Importance Sampling: The n-step Tree Backup Algorithm5 Question1 IntroductionTemporal-difference (TD) lea
复制链接

扫一扫

专栏目录