Off-policy vs on-policy（大师级解释，推荐）

最新推荐文章于 2024-04-17 10:23:47 发布

时间里的河

最新推荐文章于 2024-04-17 10:23:47 发布

阅读量628

点赞数

分类专栏：强化学习文章标签：机器学习深度学习人工智能

本文链接：https://blog.csdn.net/weixin_37726222/article/details/129338978

版权

强化学习专栏收录该内容

6 篇文章 3 订阅

订阅专栏

简单滴说，behavior policy and target policy 一样的，那就是on policy。

两者不一样的，就称作是off-policy。

其实这样的话，也可以把on-policy 看作是 off-policy的一种特殊形式。

因为你学习率那么小，为了实现对每个状态的探索，需要进行很多很多次抽样，效率比较低。一个有代表性的exploratory policy是均匀概率策略。也就是下图所示：

下面我们区分一下 online/offline learning.

Another concept that may be confused with on-policy/off-policy is online/offline learning.

Online learning refers to the case where the value and policy can be updated once an experience sample is obtained.
Offline learning refers to the case that the update can only be done after all experience samples have been collected. For example, TD learning is online, whereas MC learning is offline.

An on-policy learning algorithm such as Sarsa must work online because the updated policy must be used to generate new experience samples.

An off-policy learning algorithm such as Q-learning can work either online or offline. It can either update the value and policy upon receiving an experience sample or update after collecting all experience samples.（注意了，虽然online learning版本的Q-learning 可以实时地更新策略，但是它的新策略并不用来生成样本。Q-learning 仅仅想要利用历史样本去进行动作价值的更新。）

时间里的河

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
Off-policy vs on-policy（大师级解释，推荐）

It can either update the value and policy upon receiving an experience sample or update after collecting all experience samples.（注意了，虽然online learning版本的Q-learning 可以实时地更新策略，但是它的新策略并不用来生成样本。简单滴说，behavior policy and target policy 一样的，那就是on policy。
复制链接

扫一扫

专栏目录