OPE| importance sampling methods: IS,PDIS,WIS,WPDIS,CWPDIS

source:SAFE REINFORCEMENT LEARNING by PHILIP S. THOMAS

1. Importance Sampling (IS)

  • for all trajectories of D case:

 IS estimator := mean of the individual IS estimators for each trajectory:

  • properties: unbiased
  • Upper and Lower Bounds on the IS Estimator

2.PDIS

lower variance than IS

unbiased

all rewards are normalized !!

Use a different importance weight for each reward rather than one importance weight for the entire return.

batch case:

3.NPDIS

4.WIS

'''
Weighted Importance Sampling
* Works in a batch setting

pi_b : batch containing histories sampled from behavorial policy 
pi_e : batch containing histories sampled from evaluation policy 

reward: batch of list of reward obtained per time step


returns normalized estimate of performance under evaluation policy
'''

def weighted_is(pi_b,pi_e,reward):

    estimated_reward = 0
    estimated_weight = 0
    for history_b,history_e,history_reward in zip(pi_b,pi_e,reward):
        estimated_history_reward = history_reward
        estimated_history_weight = 1
        for i,action_hist_prob in enumerate(history_b):
            estimated_history_reward*= history_e[i]/history_b[i]
            estimated_history_weight*= history_e[i]/history_b[i]
        estimated_reward+= estimated_history_reward
        estimated_weight+= estimated_history_weight
    return estimated_reward/estimated_weight

5.WPDIS

6.CWPDIS 

  • 3
    点赞
  • 5
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值