Off-policy vs on-policy(大师级解释,推荐)

简单滴说,behavior policy and target policy 一样的,那就是on policy。

两者不一样的,就称作是off-policy。

其实这样的话,也可以把on-policy 看作是 off-policy的一种特殊形式。

image-20230304201651573

image-20230304201718456

因为你学习率那么小,为了实现对每个状态的探索,需要进行很多很多次抽样,效率比较低。一个有代表性的exploratory policy是均匀概率策略。也就是下图所示:

image-20230304202144053

下面我们区分一下 online/offline learning.

Another concept that may be confused with on-policy/off-policy is online/offline learning.

  • Online learning refers to the case where the value and policy can be updated once an experience sample is obtained.
  • Offline learning refers to the case that the update can only be done after all experience samples have been collected. For example, TD learning is online, whereas MC learning is offline.

An on-policy learning algorithm such as Sarsa must work online because the updated policy must be used to generate new experience samples.

An off-policy learning algorithm such as Q-learning can work either online or offline. It can either update the value and policy upon receiving an experience sample or update after collecting all experience samples.(注意了,虽然online learning版本的Q-learning 可以实时地更新策略,但是它的新策略并不用来生成样本。Q-learning 仅仅想要利用历史样本去进行动作价值的更新。)

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值