OPE| importance sampling methods: IS,PDIS,WIS,WPDIS,CWPDIS

Sparks Fly ~

已于 2024-05-10 01:27:01 修改

阅读量349

点赞数 3

文章标签：机器学习学习

于 2024-05-10 01:23:29 首次发布

版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。

本文链接：https://blog.csdn.net/qq_52797432/article/details/138636080

版权

source:SAFE REINFORCEMENT LEARNING by PHILIP S. THOMAS

1. Importance Sampling (IS)

for all trajectories of D case:

IS estimator := mean of the individual IS estimators for each trajectory:

properties: unbiased
Upper and Lower Bounds on the IS Estimator

2.PDIS

lower variance than IS

unbiased

all rewards are normalized !!

Use a different importance weight for each reward rather than one importance weight for the entire return.

batch case:

3.NPDIS

4.WIS

'''
Weighted Importance Sampling
* Works in a batch setting

pi_b : batch containing histories sampled from behavorial policy 
pi_e : batch containing histories sampled from evaluation policy 

reward: batch of list of reward obtained per time step


returns normalized estimate of performance under evaluation policy
'''

def weighted_is(pi_b,pi_e,reward):

    estimated_reward = 0
    estimated_weight = 0
    for history_b,history_e,history_reward in zip(pi_b,pi_e,reward):
        estimated_history_reward = history_reward
        estimated_history_weight = 1
        for i,action_hist_prob in enumerate(history_b):
            estimated_history_reward*= history_e[i]/history_b[i]
            estimated_history_weight*= history_e[i]/history_b[i]
        estimated_reward+= estimated_history_reward
        estimated_weight+= estimated_history_weight
    return estimated_reward/estimated_weight

5.WPDIS

6.CWPDIS

关注

3
点赞
踩
5

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

Sparks Fly ~ CSDN认证博客专家 CSDN认证企业博客

码龄4年

12: 原创

124万+: 周排名

13万+: 总排名

4966: 访问

: 等级

196: 积分

46: 粉丝

65: 获赞

7: 评论

63: 收藏

私信

关注

热门文章

分类专栏

强化学习小实验 1篇

最新评论

OPE in RL|强化学习中的离策略评估方法
Sparks Fly ~: https://github.com/hari-sikchi/safeRL hcope link
tf/encoder|Error合集
普通网友: 这篇文章真是一篇佳作!作者运用了生动有趣的语言,将枯燥的理论知识娓娓道来,让人如沐春风。【我也写了一些相关领域的文章，希望能够得到博主的指导，共同进步！】
RL强化学习基础|Q learning|test on FrozenLake代码小实验
CSDN-Ada助手: 恭喜用户发布了第9篇博客！标题中提到了Q learning在强化学习中的应用，以及在FrozenLake环境中的代码小实验，内容相当丰富和有趣。希望用户继续坚持创作，可以考虑分享更多关于RL强化学习算法的实践经验或者进阶内容，让读者可以更深入地了解这个领域。期待用户的下一篇博客！
强化学习sepsis论文复现|02 数据准备：在postgresql本地安装mimic-iii数据集
Sparks Fly ~: https://blog.csdn.net/2301_78042158/article/details/130617621?utm_medium=distribute.pc_relevant.none-task-blog-2~default~baidujs_baidulandingword~default-1-130617621-blog-134451815.235^v43^pc_blog_bottom_relevance_base9&spm=1001.2101.3001.4242.2&utm_relevant_index=4
python pickle
CSDN-Ada助手: 恭喜您写了这篇关于“python pickle”的博客！持续创作真的很棒，我非常期待您的下一篇文章。如果可能的话，我建议您可以尝试写一些关于Python中其他常用模块的文章，比如numpy或者pandas，这样可以让更多的读者受益。不过话说回来，我知道您一定已经有自己的计划，期待您更多精彩的文章！

大家在看

最新文章

目录

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。