Large-Scale Interactive Recommendation with Tree-Structured Policy Gradient AAAI2019 阅读笔记

strawberry47

已于 2022-03-21 09:09:13 修改

阅读量690

点赞数 1

分类专栏：论文阅读强化学习文章标签：强化学习树结构聚类深度学习算法

于 2021-05-12 10:52:41 首次发布

本文链接：https://blog.csdn.net/strawberry47/article/details/116697266

版权

强化学习同时被 2 个专栏收录

20 篇文章 10 订阅

订阅专栏

论文阅读

8 篇文章 2 订阅

订阅专栏

法一：
每一个样本点视为一个簇；
计算各个簇之间的距离，最近的两个簇聚合成一个新簇；
重复以上过程直至最后只有一簇。
分割法（本文使用）：
先将数据点分为c个聚类
再将分好的数据点继续划分为更小的聚类
直到每个子聚类仅与一个点相关联。

Introduction

We propose a Tree-structured Policy Gradient Recommendation (TPGR) framework which achieves high efficiency and high effectiveness at the same time.
a balanced hierarchical clustering tree is built over the items and picking an item is thus formulated as seeking a path from the root to a certain leaf of the tree, which dramatically reduces the time complexity in both the training and the decision making stages.
We utilize policy gradient technique to learn how to make recommendation decisions so as to maximize long-run rewards.
针对IRS(Interactive recommender systems)的算法及缺点：

MAB：假设用户兴趣在推荐过程中不变
RL：无法处理大规模离散空间问题
Wolpertinger Architecture[1]：一种针对动作空间来说复杂度为次线性而且在动作空间上能较好泛化的方法（基于actor-critic框架，通过DDPG来训练参数）；存在学习的连续动作和实际期望的离散动作之间的不一致性的问题。

[1] Dulac-Arnold G, Evans R, van Hasselt H, et al. Deep reinforcement learning in large discrete action spaces[J]. arXiv preprint arXiv:1512.07679, 2015.

Methods

State. A state s is defined as the historical interactions between a user and the recommender system, which can be encoded as a low-dimensional vector via a recurrent neural network (RNN)
Action. An action a is to pick an item for recommendation
Reward. all users interacting with the recommender system form the environment that returns a reward r after receiving an action a at the state s, which reflects the user’s feedback to the recommended item.
Transition. As the state is the historical interactions, once a new item is recommended and the corresponding user’s feedback is given, the state transition is determined.

Tree-structured Policy Gradient Recommendation Intuition for TPGR

在这里插入图片描述
每个叶节点都映射到item，每个非叶节点与policy network相关联
给定一个state，在policy network的引导下，从根节点到叶节点进行自顶向下的移动，并向用户推荐相应的item

Balanced Hierarchical Clustering over Items

在这里插入图片描述

平衡树：对于每个节点，其子树的高度最多相差1。
每个非叶节点具有相同数量的子节点，表示为c。（叶节点的父节点除外）
通过聚类算法以一组向量（这里我理解的是所有item的向量）和整数c为输入，并将向量分成c个平衡的聚类；通过重复应用聚类算法直到每个子聚类只与一个item相关联，构建了一个平衡的聚类树。
采用的聚类方法：

PCA-based(better)
K-means-based

采用的item向量表示方式：

Rating-based：用评分矩阵对应列表示（后续实验表明这是最佳表示方法）；
VAE-based：使用变分自编码器降维；
MF-based：使用矩阵分解表示item

Architecture of TPGR

status point指示当前位于哪个节点，选择item就变成将status point从根节点移到某个叶子节点。
树的非叶节点与policy network相关联（全连接层+激活单元）。
status point所在的节点v的policy network以当前state为输入，输出v在子节点上的概率分布，表示移动到v每个子节点的概率。

State Representation

输入为时间t前推荐的item ids以及相应的rewards，其中每个item id都被映射为一个embedding vector（可以端到端训练，也可以使用MF提前训练好），每个reward映射为一个one-hot向量。
user_status表示一些统计信息，如在时间步长t之前的positive reward、negative reward 、连续的positive reward和negative reward的数量
采用SRU(simple recurrent unit)编码，得到state

Experiments and Results

数据集：
将评分大于3视为positive reward，其余视为negative reward
纵坐标为连续positive(negative) reward的平均分。
表明用户以前消费的满意(令人失望)的项目越多，她获得的快乐(不愉快)就越多，因此，她倾向于对当前项目给出更高(更低)的评级

结果：

在这里插入图片描述

Time Comparison

在这里插入图片描述
虽然DDPG-KNN（k=1）时间复杂度低，但是推荐性能很差

Influence of Clustering Approach & Tree Depth

在这里插入图片描述
结果表明：PCA聚类方式，rating-based向量表示方式，深度为2时，效果最好
原因分析：

Rating-based保留了用户和item之间的所有交互信息，而VAE和MF的表示都是低维的，在降维后保留的信息比基于评分的表示少。
K-means方法对初始点的选择以及距离函数度量都有要求，不稳定。

strawberry47

关注

1
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
Large-Scale Interactive Recommendation with Tree-Structured Policy Gradient AAAI2019 阅读笔记

目录IntroductionMethodsTree-structured Policy Gradient Recommendation Intuition for TPGRBalanced Hierarchical Clustering over ItemsArchitecture of TPGRState RepresentationExperiments and Results结果：Time ComparisonInfluence of Clustering Approach & Tree De
复制链接

扫一扫

专栏目录