《reinforcement learning:an introduction》第六章《Temporal-Difference Learning》总结

由于组里新同学进来,需要带着他入门RL,选择从silver的课程开始。

对于我自己,增加一个仔细阅读《reinforcement learning:an introduction》的要求。

因为之前读的不太认真,这一次希望可以认真一点,将对应的知识点也做一个简单总结。




注意:本章考虑model-free的prediction和control,仍然有两种方法,policy iteration和value iteration(evaluation阶段使用model-free方法,improvement阶段采用greedy方法)。这一节主要讲基于TD-Learning的value iteration方法

在model-free的情况下,直接估算Q(S,A)更常见,因为即便估算出来了V(S),没有model还是不知道如何选择action(如何生成policy)。




MC思想:sample one timestep,然后用 [即时reward+后续状态的bootstrap] 来估算expected return。




MC method: V(St+1) = V(St) +α[Gt-V(St)]  ==》use  one sample Gt estimate Expectation

TD(0) method:(St+1) = (St) + α[Rt+1 + γV(St+1) V(St)] ==》useone sample Rt+1 + γV(St+1) estimate Expectation,且V(St+1)也是estimation。

TD-target = Rt+1 + γV(St+1) ;TD-error = Rt+1 + γV(St+1) V(St);MC error Gt - V(St)可以转换为TD-error的和:




TD Prediction

    基于当前的π产生St、At、Rt+1、St+1

    然后基于TD(0) method【(St+1) = (St) + α[Rt+1 + γV(St+1) V(St)] 】进行estimation。

    TD prediction保证收敛到Vπ和Qπ;TD method和MC method那个收敛快理论上还没有证明出来,但是实际中TD常常converge faster thanconstant-αMC methods on stochastic tasks 。



6.3分析了batch TD和batch MC,指出:Batch Monte Carlo methods always find the
estimates that minimize mean-squared error on the training set, whereas batch TD(0)
always finds the estimates that would be exactly correct for the maximum-likelihood
model of the Markov process. 

Given this model, we can compute the estimate of the value function
that would be exactly correct if the model were exactly correct. This is called the
certainty-equivalence estimate because it is equivalent to assuming that the estimate
of the underlying process was known with certainty rather than being approximated.
In general, batch TD(0) converges to the certainty-equivalence estimate.
 

This helps explain why TD methods converge more quickly than Monte Carlo
methods. In batch form, TD(0) is faster than Monte Carlo methods because it computes the true certainty-equivalence estimate.





Sarsa: On-Policy TD Control
    Q(St+1, At+1) = Q(St, At) +α[Rt+1+γQ(St+1, At+1)-Q(St, At)]

    As in all on-policy methods, we continually estimateqπfor the behavior policyπ, and at the same time changeπtoward greediness with respect to
qπ. The general form of the Sarsa control algorithm is given in the box below.


According to Satinder Singh (personal communication),Sarsa converges with probability 1 to an optimal policy and action-value function as long as all state-action pairs are visited an infinite number of times and the policyconverges in the limit to the greedy policy (which can be arranged, for example, withe-greedy policies by settinge= 1/t), but this result has not yet been published in the literature.

    SARSA之所以是on-policy的,是因为behavior policy和target policy都是基于当前Q(S,A)的 e-greedy policy





Q-learning: Off-Policy TD Control

    Q(St+1, At+1) = Q(St, At) + α[Rt+1 γmax_aQ(St+1, aQ(St, At)] 

    Q-learning之所以是off-policy的,是因为target policy是π*(因为迭代公式estimate Q*;或者更准确的说,中间步骤是greedy policy given the current action values,注意,greedy policy迭代到最后就是最优policyπ*),而behavior policy可以是任意一个(比如最常用的基于当前Q(S,A)的 e-greedy policy

    满足:1)all pairs continue to be updated;2)a variant of the usual stochastic approximation conditions onthe sequence of step-size parameters;可以保证Q has been shown to converge with probability 1 toq






expected sarsa:SARSA的改进


Maximization Bias and Double Learning:

Q-learning和sarsa选择动作时,都常用e-greedy方法,涉及到了max_a操作,which can lead to a significant positive bias. To see why, consider a single state s where there are many actionsawhose true values,q(s; a), are all zero but whose estimated values,Q(s; a), are uncertain and thus distributed some above and some below zero. The maximum of the true values is zero, but the maximum of the estimates is positive, a positive bias. We call thismaximization bias.






The special cases of TD methods introduced in the present chapter should rightly be calledone-step, tabular, modelfreeTD methods. 

In the next two chapters we extend them to multistep forms (a link to Monte Carlo methods) and forms that include a model of the environment (a
link to planning and dynamic programming). Then, in the second part of the book we extend them to various forms of function approximation rather than tables (a link to deep learning and artificial neural networks).







下面是silver课程《Lecture 4,Model-Free Prediction》、《Lecture 5,Model-Free Control》我觉得应该知道的内容:


这两节课讲的内容才是之后会经常用到的, 但是实际上没多少需要掌握的知识,大多数了解一下就可以了。
具体的,要求掌握:
lecture 4:MC和TD区别
lecture 5:SARSA和Q-learning

lecture 4: 主要理解MC和TD思想
3:知道Model-free prediction ==》 Estimate the value function of an unknown MDP
知道有两大类方法:Monte-Carlo Learning (MC-Learning) and Temporal-Difference Learning (TD-Learning)
4:MC-Learning特点:走到结束,然后通过empirical mean return来估算expected return;==》model-free、no bootstrapping(必须有明显结束的MDPs)
6-7:First- and Every-Visit MC PE,By law of large numbers(概率论里的大数定理),empirical mean return趋近于expected return。
10-11:知道MC本质是每个样本的权重一样大,但是对于incremental的更新形式,(Gt-V(St))的stepsize/learning rate(lr)却越来越小,具体的是lr_t=1/N(St)
而对于non-stationary的问题,让lr_t=constant反而更合适一些,因为lr_t=constant本质上意味着离现在越近的reward对于V(s)的贡献越大(这一点你之后会理解的更深刻)。而实际问题中更多的是non-stationary的,所以直接设置lr_t=constant很常见。
12:TD-Learning特点:走一步,用下一个状态的估值来estimate当前状态的估值;==》model-free、with bootstrapping。
13-20:MC和TD的对比,包括更新公式、优缺点、bias-variance trade-off(这个是机器学习里非常重要的概念)。知道什么是TD-target和TD-error.
MC has high variance, zero bias
Good convergence properties
(even with function approximation)
Not very sensitive to initial value
Very simple to understand and use
TD has low variance, some bias
Usually more efficient than MC
TD(0) converges to vπ(s)
(but not always with function approximation)
More sensitive to initial value
21-25:Batch MC/TD的例子,24页很重要,能够本质上理解MC/TD。
TD exploits Markov property
Usually more efficient in Markov environments
MC does not exploit Markov property
Usually more effective in non-Markov environments
26-30:理解一下这些Unified View就好
30页之后的内容目前不需要看


lecture 5: 主要理解SARSA和Q-learning
3:知道Model-free control ==》 Optimise the value function of an unknown MDP
MDP model is unknown, but experience can be sampled
MDP model is known, but is too big to use, except by samples
5:知道on-policy和off-policy的区别
8:这边解释了为什么model-free的问题,一般都是计算action-value Q(s,a)而不是state-value V(s),因为我们的最终目的是找到policy π*,在model-free的情况下,基于V(s)根本不能计算出π*。
10-11:e-greedy exploration
12:e-Greedy Policy Improvement,对于e-greedy的policy π进行e-Greedy Policy Improvement操作得到π',仍然满足π'>=π
14-16:On-Policy Monte-Carlo Control,GLIE Monte-Carlo Control,了解一下就好
20-22:On-Policy Learning - SARSA,这个算法还是比较有名的,要求掌握
23:收敛条件,前者的直观理解是,步长要足够大以保证最终的结果不会受V/Q的初始值影响;后者的直观理解是,步长最终变得会越来越小以保证最终可以收敛。
考虑lecture4中提到的lr,对于lr_t=1/N(St)是满足收敛条件的(1+1/2+1/3+1/4+...=ln(n)+C,C为欧拉常数,数值是0.5772……. -> +infinite;1^2+(1/2)^2+(1/3)^2+... < +infinite;详细证明我也不知道),而lr_t=constant=c连第一个条件都不满足;尽管这样,也不妨碍lr_t=constant在实际中的应用。
26-30:可以忽略
31:Off-Policy Learning
32-34:important sampling了解一下就好,主要是因为behavior policy和target policy产生同一个样本的概率不同,所以计算return/TD target也应该考虑这个概率
35-38:Off-Policy Learning - Q-learning,这个算法最常用,要求掌握
40:想一下为什么SARSA比Q-learning的reward大。
41-42:理解一下这个总结,对于这里提到的TD-Learning,好像silver这里的课特指针对V(s)的“方法”,但更一般的是指lecture4中讲的一类“思想”。
最后还有一个问题,为什么SARSA是on-policy的而q-learning是off-policy的,你试着回答一下,看能不能明白两者的区别。参考22和36页。


更新一下需要掌握的内容:
lecture 4中说30页之后的内容目前不需要看;lecture 5中说26-30可以忽略;你要是有时间可以稍微看一下, 尽量理解eligibility trace的思想,不过这部分内容确实不要求掌握, 你尽力理解就好。

更新一下 lecture 5 的内容。
上一个邮件中提到的8、10-11、12、14-16, 都是从单个知识点的角度出发的。实际上,7-16页按照On- policy Monte-Carlo的角度去看更好一些。其中:
7:引出问题==》基于Monte-Carlo的policy iteration方法存在什么问题。第一,在model- free的情况下,基于V(s)根本不能计算出π*。第二, greedy policy improvement会造成没有exploration。
8:针对第一个问题,解释了为什么model-free时,一般 都是计算action-value Q(s,a)而不是state-value V(s),因为我们的最终目的是找到policy π*,在model-free的情况下,基于V(s)根本不能计 算出π*。
9-11:针对第二个问题,因为greedy policy improvement会造成没有exploration, 所以需要将greedy policy improvement改成e-greedy policy improvement(即11页讲的内容)。
12:对于e-greedy的policy π进行e-Greedy Policy Improvement操作得到π',仍然满足π'>=π。
13-14: Monte-Carlo Policy Iteration和 Monte-Carlo Control,真正在control时,evaluation步骤只需要得到Q~=Qπ即可。
15-16: GLIE Monte-Carlo Control,能够保证收敛到Q*的 Monte-Carlo Control方法,要求满足GLIE条件==》(1)每个(s,a)访问无穷多次;(2)最终的behavior policy(即improvement阶段得到的e- Greedy Policy )要慢慢趋近于greedy policy(可以认为只有greedy policy才是最优的)。满足GLIE最简单的方法就是让  e=1/ k(即16页讲的内容)。






  • 1
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 2
    评论
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值