Chapter 6&7: Temporal-Difference Learning

1 Introduction

Temporal-difference (TD) learning is the central and novel idea of reinforced learning. It is a combination of Monta Carlo ideas and dynamic programming (DP) ideas. TD also uses bootstrap and can learn from raw experience.

Compared with DP methods:

TD do not require a model of the environment, of its reward and next-state probability distributions.

Compared with Monta Carlo methods:

  • TD is naturally implemented in an online, fully incremental fashion. While Monte Carlo methods one must wait until the end of an episode, because only then is the return known, whereas with n-step TD methods one need wait only n time step.
  • Monte Carlo methods must ignore or discount episodes on which experimental actions are taken, which can greatly slow learning. TD methods are much less susceptible to these problems because they learn from each transition regardless of what subsequent actions are taken.
  • Under batch updating, TD(0) converges deterministically to a single answer independent of the step-size parameter, α \alpha α, as long as α \alpha α is chosen to be sufficiently small. The constant- α \alpha α MC method also converges deterministically under the same conditions, but to a different answer.
  • Batch Monte Carlo methods always find the estimates that minimize mean-squared error on the training set, which converges to sample average of the actual returns, whereas batch TD(0) always finds the estimates that would be exactly correct for the maximum-likelihood model of the Markov process. In general, the maximum-likelihood estimate of a parameter is the parameter value whose probability of generating the data is greatest. Certainty-equivalence estimate. Although the nonbatch methods do not achieve either the certainty-equivalence or the minimum squared-error estimates, they can be understood as moving roughly in these directions. TD methods may be the only feasible way of approximating the certainty-equivalence solution.
  • larger n causes larger variance but less biased result,

Convergence

For any fixed policy π \pi π, TD(0) has been proved to converge to v π v_{\pi} vπ, in the mean for a constant step-size parameter if it is sufficiently small, and with probability 1 if the step-size parameter decreases according to the usual stochastic approximation conditions (2.7)

At the current time this is an open question in the sense that no one has been able to prove mathematically that one method converges faster than the other. In fact, it is not even clear what is the most appropriate formal way to phrase this question! In practice, however, TD methods have usually been found to converge faster than constant- α \alpha α MC methods on stochastic tasks

2 n-step TD return and update

在这里插入图片描述
n-step TD changes an earlier estimate based on how it differs from a later estimate. Now the later estimate is not one step later, but n steps later.

  • Sequence:
    S 0 , A 0 , R 1 , S 1 , A 1 , R 2 , S 2 ⋯ S T − 1 , A T − 1 , R T , S T S_0, A_0, R_1, S_1, A_1, R_2, S_2 \cdots S_{T-1},A_{T-1},R_T,S_T S0,A0,R1,S1,A1,R2,S2ST1,AT1,RT,ST

  • n-step return:
    G t : t + n = R t + 1 + γ R t + 2 + . . . + γ n − 1 R t + n + γ n V t + n − 1 ( S t + n ) (7.1) G_{t:t+n}=R_{t+1}+\gamma R_{t+2}+...+\gamma^{n-1} R_{t+n}+\gamma^n V_{t+n-1}(S_{t+n}) \tag{7.1} Gt:t+n=Rt+1+γRt+2+...+γn1Rt+n+γnVt+n1(St+n)(7.1) G t : t + n = R t + 1 + γ R t + 2 + . . . + γ n − 1 R t + n + γ n Q t + n − 1 ( S t + n , A t + n ) (7.4) G_{t:t+n}=R_{t+1}+\gamma R_{t+2}+...+\gamma^{n-1} R_{t+n}+\gamma^n Q_{t+n-1}(S_{t+n},A_{t+n}) \tag{7.4} Gt:t+n=Rt+1+γRt+2+...+γn1Rt+n+γnQt+n1(St+n,At+n)(7.4) G t : t + n = G_{t:t+n}= Gt:t+n=

  • Update value function:
    V t + n ( S t ) = V t + n − 1 ( S t ) + α [ G t : t + n − V t + n − 1 ( S t ) ] , 0 ≤ t < T (7.2) V_{t+n}(S_t)=V_{t+n-1}(S_t)+\alpha[G_{t:t+n}-V_{t+n-1}(S_t)], 0 \leq t<T \tag{7.2} Vt+n(St)=Vt+n1(St)+α[Gt:t+nVt+n1(St)],0t<T(7.2) Q t + n ( S t , A t ) = Q t + n − 1 ( S t , A t ) + α [ G t : t + n − Q t + n − 1 ( S t , A t ) ] , 0 ≤ t < T (7.2) Q_{t+n}(S_t,A_t)=Q_{t+n-1}(S_t,A_t)+\alpha[G_{t:t+n}-Q_{t+n-1}(S_t,A_t)], 0 \leq t<T \tag{7.2} Qt+n(St,At)=Qt+n1(St,At)+α[Gt:t+nQt+n1(St,At)],0t<T(7.2)

  • Off-policy update:
    V t + n ( S t ) = V t + n − 1 ( S t ) + α ρ t : t + n − 1 [ G t : t + n − V t + n − 1 ( S t ) ] , 0 ≤ t < T (7.9) V_{t+n}(S_t)=V_{t+n-1}(S_t)+\alpha \rho_{t:t+n-1}[G_{t:t+n}-V_{t+n-1}(S_t)], 0 \leq t<T \tag{7.9} Vt+n(St)=Vt+n1(St)+αρt:t+n1[Gt:t+nVt+n1(St)],0t<T(7.9) Q t + n ( S t , A t ) = Q t + n − 1 ( S t , A t ) + α ρ t + 1 : t + n [ G t : t + n − Q t + n − 1 ( S t , A t ) ] , 0 ≤ t < T (7.11) Q_{t+n}(S_t,A_t)=Q_{t+n-1}(S_t,A_t)+\alpha\rho_{t+1:t+n}[G_{t:t+n}-Q_{t+n-1}(S_t,A_t)], 0 \leq t<T \tag{7.11} Qt+n(St,At)=Qt+n1(St,At)+αρt+1:t+n[Gt:t+nQt+n1(St,At)],0t<T(7.11) ρ t : h = ∏ k = t m i n ( h , T − 1 ) π ( A t ∣ S t ) b ( A t ∣ S t ) (7.10) \rho_{t:h}=\prod_{k=t}^{min(h,T-1)} \frac{\pi(A_t|S_t)}{b(A_t|S_t)}\tag{7.10} ρt:h=k=tmin(h,T1)b(AtSt)π(AtSt)(7.10)

State visited at time step t   ( S t ) t\text{ } (S_t) t (St) is updated at time step t + n   ( V t + n ) t+n\text{ } (V_{t+n}) t+n (Vt+n).

Note that no changes at all are made during the first n-1 steps of each episode. To make up for that, an equal number of additional updates are made at the end of the episode, after termination and before starting the next episode.

在这里插入图片描述
Error reduction property of n-step returns
The n-step return uses the value function V t + n − 1 V_{t+n−1} Vt+n1 to correct for the missing rewards beyond R t + n R_{t+n} Rt+n. An important property of n-step returns is that their expectation is guaranteed to be a better estimate of v π v_{\pi} vπ than V t + n − 1 V_{t+n−1} Vt+n1 is, in a worst-state sense. All n-step TD methods converge to the correct predictions under appropriate technical conditions.

3 One-step Control

3.1 (One-step on-policy) Sarsa

Q ( S t , A t ) = Q ( S t , A t ) + α [ R t + 1 + γ Q ( S t + 1 , A t + 1 ) − Q ( S t , A t ) ] Q(S_t,A_t)=Q(S_t,A_t)+\alpha[R_{t+1}+\gamma Q(S_{t+1},A_{t+1})-Q(S_t,A_t)] Q(St,At)=Q(St,At)+α[Rt+1+γQ(St+1,At+1)Q(St,At)] Estimate the value function of ( ε \varepsilon ε) greedy policy.

3.2 (One-step off-policy Sarsa) Q-learning

Q ( S t , A t ) = Q ( S t , A t ) + α [ R t + 1 + γ m a x a Q ( S t + 1 , a ) − Q ( S t , A t ) ] Q(S_t,A_t)=Q(S_t,A_t)+\alpha[R_{t+1}+\gamma \mathop{max}_a Q(S_{t+1},a)-Q(S_t,A_t)] Q(St,At)=Q(St,At)+α[Rt+1+γmaxaQ(St+1,a)Q(St,At)] Actions are taken with ε \varepsilon ε-greedy policy, but the Q is updated assuming the greedy policy is followed.


Q-learning vs Sarsa?

  • Suppose action selection is greedy. Is Q-learning then exactly the same algorithm as Sarsa? Will they make exactly the same action selections and weight updates?
  • No, Q-learning select the next action before updating Q, while Sarsa select the action after updating Q.

3.3 (one-step, on or off policy) Expected Sarsa

Q ( S t , A t ) = Q ( S t , A t ) + α [ R t + 1 + γ ∑ a π ( a ∣ S t + 1 ) Q ( S t + 1 , a ) − Q ( S t , A t ) ] Q(S_t,A_t)=Q(S_t,A_t)+\alpha[R_{t+1}+\gamma \sum_a \pi(a|S_{t+1}) Q(S_{t+1},a)-Q(S_t,A_t)] Q(St,At)=Q(St,At)+α[Rt+1+γaπ(aSt+1)Q(St+1,a)Q(St,At)]
Suppose π \pi π is the greedy policy while behavior policy is more exploratory; then Expected Sarsa is exactly Q-learning.

Expected Sarsa subsumes and generalizes Q-learning while reliably improving over Sarsa.

Expected Sarsa subsumes and generalizes Q-learning while reliably improving over Sarsa.

3.4 Issues of Maximization Bias & Afterstates

3.4.1 Maximization Bias and Double Learning

In these algorithms, a maximum over estimated values is used implicitly as an estimate of the maximum value, which can lead to a significant positive bias.

To see why, consider a single state s where there are many actions a whose true values, q(s, a), are all zero but whose estimated values, Q(s, a), are uncertain and thus distributed some above and some below zero. The maximum of the true values is zero, but the maximum of the estimates is positive, a positive bias. We call this maximization bias.

One way to view the problem is that it is due to using the same samples (plays) both to determine the maximizing action and to estimate its value.

Double learning
Q 1 ( S t , A t ) = Q 1 ( S t , A t ) + α [ R t + 1 + γ Q 2 ( S t + 1 , a r g m a x a Q 1 ( S t + 1 , a ) ) − Q 1 ( S t , A t ) ] Q_1(S_t,A_t)=Q_1(S_t,A_t)+\alpha[R_{t+1}+\gamma Q_2(S_{t+1},argmax_aQ_1(S_{t+1},a))-Q_1(S_t,A_t)] Q1(St,At)=Q1(St,At)+α[Rt+1+γQ2(St+1,argmaxaQ1(St+1,a))Q1(St,At)]

3.4.2 Afterstates

In this book we try to present a uniform approach to solving tasks, but sometimes more specific
methods can do much better.

We introduce the idea of afterstates. Afterstates are relevant when the agent can deterministically change some aspect of the environment. In these cases, we are better to value the resulting state of the environment, after the agent has taken action and before any stochasticity, as this can reduce computation and speed convergence.

Take chess as an example. One should choose as states the board positions after the agent has
taken a move, rather than before. This is because there are multiple states at t than can lead to the
board position that the opponent sees at t+1 (assuming we move second) via deterministic actions
of the agent.

4 n-step control

Off-policy n-step sarsa

Update value function:
V t + n ( S t ) = V t + n − 1 ( S t ) + α ρ t : t + n − 1 [ G t : t + n − V t + n − 1 ( S t ) ] , 0 ≤ t < T (7.9) V_{t+n}(S_t)=V_{t+n-1}(S_t)+\alpha \rho_{t:t+n-1}[G_{t:t+n}-V_{t+n-1}(S_t)], 0\leq t<T \tag{7.9} Vt+n(St)=Vt+n1(St)+αρt:t+n1[Gt:t+nVt+n1(St)],0t<T(7.9)
Importance sampling ratio:
ρ t : h = ∏ k = t m i n ( h , T − 1 ) π ( A k ∣ S k ) b ( A k ∣ S k ) (7.10) \rho_{t:h}=\prod_{k=t}^{min(h,T-1)}\frac{\pi(A_k|S_k)}{b(A_k|S_k)} \tag{7.10} ρt:h=k=tmin(h,T1)b(AkSk)π(AkSk)(7.10)
n-step sarsa update:
Q t + n = Q t + n − 1 + α ρ t + 1 : t + n [ G t : t + n − Q t + n − 1 ( S t ∣ A t ) ] Q_{t+n}=Q_{t+n-1}+\alpha\rho_{t+1:t+n}[G_{t:t+n}-Q_{t+n-1}(S_t|A_t)] Qt+n=Qt+n1+αρt+1:t+n[Gt:t+nQt+n1(StAt)]

在这里插入图片描述

4 Off-policy Learning Without Importance Sampling: The n-step Tree Backup Algorithm

在这里插入图片描述
Unlike Monta Carlo methods, tree-backup update does not sample, Each leaf node contributes to the target with a weight proportional to its probability of occurring under the target policy π \pi π. Thus each first-level action a contributes with a weight of π ( a ∣ S t + 1 ) \pi(a|S_t+1) π(aSt+1), except that the action actually taken, A t + 1 A_{t+1} At+1, does not contribute at all. Its probability, π ( A t + 1 ∣ S t + 1 ) \pi(A_t+1|S_t+1) π(At+1St+1), is used to weight all the second-level action values.
Thus, each non-selected second-level action a ′ a' a contributes with weight π ( A t + 1 ∣ S t + 1 ) π ( a ′ ∣ S t + 2 ) \pi(A_t+1|S_t+1)\pi(a'|S_t+2) π(At+1St+1)π(aSt+2). Each third-level action contributes with weight π ( A t + 1 ∣ S t + 1 ) π ( A t + 2 ∣ S t + 2 ) π ( a ′ ′ ∣ S t + 3 ) \pi(A_{t+1}|S_{t+1})\pi(A_{t+2}|S_{t+2})\pi(a''|S_{t+3}) π(At+1St+1)π(At+2St+2)π(aSt+3).
在这里插入图片描述

5 Question

For the tree back-up and generalized n-step Q, only one action is actually taken. Can we take multiple actions at a time? Will it be more accurate?

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值