简单滴说,behavior policy and target policy 一样的,那就是on policy。
两者不一样的,就称作是off-policy。
其实这样的话,也可以把on-policy 看作是 off-policy的一种特殊形式。
因为你学习率那么小,为了实现对每个状态的探索,需要进行很多很多次抽样,效率比较低。一个有代表性的exploratory policy是均匀概率策略。也就是下图所示:
下面我们区分一下 online/offline learning.
Another concept that may be confused with on-policy/off-policy is online/offline learning.
- Online learning refers to the case where the value and policy can be updated once an experience sample is obtained.
- Offline learning refers to the case that the update can only be done after all experience samples have been collected. For example, TD learning is online, whereas MC learning is offline.
An on-policy learning algorithm such as Sarsa must work online because the updated policy must be used to generate new experience samples.
An off-policy learning algorithm such as Q-learning can work either online or offline. It can either update the value and policy upon receiving an experience sample or update after collecting all experience samples.(注意了,虽然online learning版本的Q-learning 可以实时地更新策略,但是它的新策略并不用来生成样本。Q-learning 仅仅想要利用历史样本去进行动作价值的更新。)