论文阅读3-----基于强化学习的推荐系统 Top-K Off-Policy Correction for a REINFORCE Recommender System

最新推荐文章于 2024-03-27 20:58:34 发布

界限消除者

最新推荐文章于 2024-03-27 20:58:34 发布

阅读量821

点赞数

文章标签：深度学习强化学习推荐系统数据挖掘

版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。

本文链接：https://blog.csdn.net/qq_37227782/article/details/112779130

版权

论文阅读3-----基于强化学习的推荐系统 Top-K Off-Policy Correction for a REINFORCE Recommender System

abstract

problems in recommendation: a complex user state space (但好在有很多隐式的数据可以使用）

the problem of previous versions of the recommender: only observing feedback on recommendations selected by the previous versions of the recommender(只考虑正反馈，可考虑负反馈或是被忽视的item)

所以作者提出了以强化学习为基础的推荐系统(可以考虑负反馈以及其他的反馈）

contribuation

(1)scaling REINFORCE to a production receommender system with an action space on the orders of millions.(就是可以适应huge action space)

(2)applying off-policy correction to address data biases in learning from logged feedback collected form multiple behavior policies(利用importance sampling方法实现离线训练，看不懂转李宏毅力强化学习,看PPO哈）

(3)proposing a novel top-k off-policy correction to account f

最低0.47元/天解锁文章

界限消除者

关注

0
点赞
踩
4

收藏

觉得还不错? 一键收藏
0
评论
论文阅读3-----基于强化学习的推荐系统 Top-K Off-Policy Correction for a REINFORCE Recommender System

论文阅读3-----基于强化学习的推荐系统Top-K Off-Policy Correction for a REINFORCE Recommender Systemabstractproblems in recommendation: a complex user state space (但好在有很多隐式的数据可以使用）the problem of previous versions of the recommender: only observing feedback on re...
复制链接

扫一扫

界限消除者 CSDN认证博客专家 CSDN认证企业博客

码龄8年

5: 原创

118万+: 周排名

116万+: 总排名

3706: 访问

: 等级

55: 积分

10: 粉丝

4: 获赞

1: 评论

24: 收藏

私信

关注

热门文章

最新评论

论文阅读4-----基于强化学习的推荐系统 Recommendations with Negative Feedback via Pairwise Deep Reinforcement Learnin
weixin_37803082: 作者好，有几个问题读了好几遍论文都不明白，如果有时间是否可以帮忙回答一下，万分感谢！主要集中在如何利用离线数据集去训练DQN？ off-line train步骤中原文是这么说的：【We train the proposed model based on users’ offline log, which records the interaction history between RA’s policy b(st ) and users’ feedback. RA takes the action based on the off-policy b(st ) and obtain the feedback from the offline log. 】想请问feedback（也就是reward）是如何根据用户log就能得到的？因为训练时RA’s policy b(st ) 可能会推荐任意物品，推荐物品时用户的状态也是任意可能的。而reward函数的自变量是state和action。首先所推荐的物品不一定在用户的历史纪录内，而且训练时当时用户状态也不一定和log中相同，怎么保证用户的历史记录中能找到相同的state和action，以得到相应的reward？此外，关于off-line evaluation步骤中原文是这么说的：【The reason why recommender agent only reranks items in this session rather than items in the whole item space is that for the historical offline dataset, we only have the ground truth rewards of the existing items in this session】问题是，为什么在off-line evaluation中，就考虑了log中对有的物品没有ground truth rewards的问题，而在training中不考虑呢？以及online test步骤中原文是这么说的：：【The simulated online environment is also trained on users’ logs, but not

大家在看

最新文章

目录

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。