原文:Empirical Study of Off-Policy Policy Evaluation for Reinforcement Learning
1. inverse propensity scoring :IPS逆概率评分
IPS核心思想:通过行为策略和待评估策略的重要性采样比率,将历史数据中的奖励reward重新加权,目的是衡量在行为策略下奖励符合待评估策略的可能性。
IPS: 不适用于long horizon
IS: improtance sampling
- unbiased
PDIS: per decision improtance sampling
WIS:weighted improtance sampling
- biased
- more accurate and data-efficient than IS
PDWIS: per decision weighted improtance sampling
- performs best during these four methods
2. direct methods: DM 直接方法
Generally,
FQE,
Q_pi(lamda),
IH:infinite horizon setting
Minimax-style estimators
3.Hybrid Methods (HM)
除了IH,每一种DM 都对应三种HM: standard doubly robust (DR), weighted doubly robust (WDR), and MAGIC
每一种DM:MAGIC>WDR>