SlateQ：解决板岩推荐问题的可扩展算法

最新推荐文章于 2024-04-13 09:38:37 发布

weixin_26752765

最新推荐文章于 2024-04-13 09:38:37 发布

阅读量827

点赞数

文章标签：算法 python java 人工智能机器学习

原文链接：https://medium.com/analytics-vidhya/slateq-a-scalable-algorithm-for-slate-recommendation-problems-735a1c24458c

版权

I was recently introduced to the wonderful world of Reinforcement Learning (RL) and wanted to explore its applications in recommender systems. Craig Boutilier’s talk on challenges of RL in recommender systems piqued my interest. In this article, I explore Google’s SlateQ algorithm and use it to train an agent in a simulated environment using its RecSim [Github] library.

最近，我被带到了强化学习(RL)的奇妙世界，并想探索其在推荐系统中的应用。 Craig Boutilier关于推荐系统中RL的挑战的演讲激起了我的兴趣。在本文中，我将探索Google的SlateQ算法，并使用它的RecSim [ Github ]库在模拟环境中使用它来训练代理。

MDP进行救援 (MDPs to the rescue)

Image for post — MDP formulation in SlateQ

SlateQ models the slate recommendation problem as a Markov Decision Process (MDP) where:

SlateQ将板岩推荐问题建模为马尔可夫决策过程(MDP)，其中：

the State S represents the state of the user. The user state captures information like user’s history, demographics, context (e.g. time of day) etc.
状态S代表用户的状态。用户状态捕获信息，例如用户的历史记录，人口统计信息，上下文(例如一天中的时间)等。
the Action space A is simply the set of all possible recommendation slates.
动作空间A只是所有可能的建议列表的集合。
Transition Probability P(s’|s, A) reflects the probability that the user transitions to state s’ when slate A is presented to the user in state s.
转变概率P(s'| s，A)反映了当将板岩A呈现给用户s时，用户转变为状态s'的概率。
Rewards R(s, A) measures the expected degree of user engagement with items on the slate A.
奖励R(s，A)衡量用户与平板A上的商品的互动程度。

简化问题的假设 (Assumptions that simplify the problem)

For a given slate of items, SlateQ assumes that a user can select at max one item i at a time. This is a reasonable assumption in settings like Youtube. Note that this assumption also allows a user to get away without choosing any item.
对于给定的项目，SlateQ假定用户一次最多可以选择一个项目i 。在YouTube之类的设置中，这是一个合理的假设。注意，该假设还允许用户逃脱而不选择任何物品。
Returning to the slate for a second item is modelled and logged as a separate event. The user’s state is updated with each consumed item, which in turn is used to recommend a new slate to the returning user.
返回到第二个项目的清单被建模并记录为一个单独的事件。用户状态会随着每个消耗的物品而更新，这反过来又被用来向返回的用户推荐新的状态。
State Transitions and Rewards depend only on the selected item, i.e. they are independent of other items in the slate recommended to the user. This is a reasonable assumption since non-consumed items have significantly less impact than consumed items on the user’s behaviour.
状态转换和奖励仅取决于所选项目，即，它们独立于推荐给用户的其他项目。这是一个合理的假设，因为未消费的物品对用户的行为的影响远小于消费的物品。
The user choice model is known i.e. P(s’|s, i) is known. Conditional choice models like the multinomial logit can be used to model user choice.
用户选择模型是已知的，即P(s'| s，i)是已知的。诸如多项式logit之类的条件选择模型可用于对用户选择进行建模。

These assumptions enable us to break down the rewards and transition probabilities in terms of its component items.

这些假设使我们能够从构成要素的角度分解奖励和转移概率。

Q学习细目 (Q-learning breakdown)

The Q in SlateQ refers to Q-learning, an RL algorithm that finds the optimal slate by assigning values (called Q-values) to all state and action pairs. These Q-values represent the Long Term Value (LTV) of a user consuming an item i from slate A.

SlateQ中的Q表示Q学习，这是一种RL算法，它通过为所有状态和动作对分配值(称为Q值)来找到最佳选择。这些Q值表示从A项消费项目i的用户的长期价值(LTV)。

SlateQ breaks down the LTV of a slate of items as the expected sum of LTVs of its component items i.e. Q_bar(s, i). The probabilities P(i|s, A) are generated using user choice models discussed earlier.

SlateQ分解一组项目的LTV，作为其组成项目的LTV的预期总和，即Q_bar(s，i)。概率P(i | s，A)是使用前面讨论的用户选择模型生成的。

Temporal Difference (TD) learning is then used to learn the item-wise LTVs Q_bar(s, i). Given a consumed item i at state s with observed reward r, the agent transitions to new state s’ and recommends new slate A’ such that user LTV is maximised. Below equation lays down the update to Q-values. Discount rate gamma and learning rate alpha regulate the importance placed on future actions.

然后使用时差(TD)学习来学习逐项LTV Q_bar(s，i)。给定状态为s且观察到的奖励为r的消费物品i ，代理将转换为新状态s'并推荐新的状态A' ，以使用户LTV最大化。下面的公式列出了对Q值的更新。折现率伽玛和学习率α调节着对未来行动的重视。

SlateQ uses a Deep Q Network (DQN) with experience replay to learn these Q-values.

SlateQ使用具有经验重播的深度Q网络(DQN)来学习这些Q值。

板岩优化 (Slate Optimisation)

After learning the item-wise LTVs, the algorithm selects a slate A of k items from corpus I such that the total expected LTV of the user is maximised. More formally, this can be written as:

在学习逐项LTV之后，该算法从语料库I中选择k个项的板岩A，以使用户的总预期LTV最大化。更正式地说，可以写成：

SlateQ uses mixed-integer programming methods to find the exact solution of the above optimisation problem. Exact optimisation takes polynomial time in the number of items in corpus I and can be used for offline training. While serving real-time recommendations, approximation methods that are less computationally expensive are used to find the optimal slate.

SlateQ使用混合整数编程方法来找到上述优化问题的确切解决方案。精确优化需要花费语料库I中项数的多项式时间，并且可以用于脱机训练。在提供实时建议的同时，使用计算量较小的近似方法来查找最佳方案。

Now that we have a good understanding of the SlateQ algorithm, let’s see how well it works by running a small experiment. RecSim library has an implementation of the algorithm along with various simulated environments to train agents and compare algorithmic performance.

现在，我们已经对SlateQ算法有了很好的了解，让我们通过运行一个小实验来看看它的工作情况。 RecSim库具有该算法的实现以及各种模拟环境，以训练代理并比较算法性能。

RecSim的兴趣演化环境 (RecSim’s Interest Evolution Environment)

Interest Evolution environment is a collection of models that simulate sequential user interaction in a slate recommender setting. It has 3 key components:

兴趣演化环境是一组模型的集合，这些模型可以模拟在推荐者设置中的顺序用户交互。它包含3个关键组件：

The document model samples items from a prior over document features, which incorporates latent features such as document quality, and observable features such as document topic.
文档模型从优先于文档特征的项目中采样项目，该文档特征包含诸如文档质量之类的潜在特征以及诸如文档主题之类的可观察特征。
The user model samples users from a prior distribution over user features. These features include latent features such as interests and behavioural features such as time budget.
用户模型从用户功能的先前分布中抽样用户。这些功能包括潜在功能(例如兴趣)和行为功能(例如时间预算)。
For a recommended slate, the user choice model records the user response by selecting at max 1 item. The choice of the item by the user depends on observable document features (e.g., topic, perceived appeal) and all user features (e.g., interests).
对于推荐的选项， 用户选择模型通过最多选择1个项目来记录用户的响应。用户对项目的选择取决于可观察的文档特征(例如，主题，感知的吸引力)和所有用户特征(例如，兴趣)。

Each user starts with a fixed time budget. Each consumed item reduces the time budget by a fixed amount. Besides, this time budget is replenished depending upon how appealing the item is to the user. Such a configuration places importance on long term behaviour of the user and hence enables us to train non-myopic agents.

每个用户都以固定的时间预算开始。每个消耗的物品将时间预算减少固定量。此外，根据项目对用户的吸引力来补充该时间预算。这种配置非常重视用户的长期行为，因此使我们能够训练非近视代理。

Check out this notebook for more details on environment and generated data.

请查看此笔记本以获取有关环境和生成的数据的更多详细信息。

实验设置和结果 (Experiment Setup and Results)

In our experiment, the agent chooses 2 documents to recommend from a corpus of 10 candidate documents. The candidate documents are selected by the document model from 20 different topics. The size of the slate is intentionally kept small for faster training and evaluation.

在我们的实验中，代理从10个候选文档的语料集中选择2个文档进行推荐。通过文档模型从20个不同主题中选择候选文档。故意将板岩的尺寸保持较小，以便更快地进行培训和评估。

RecSim comes with a host of algorithms to train an agent. The slate_decomp_q_agent function implements the SlateQ algorithm. I am using the full_slate_q_agent function as the baseline agent, which is a DQN without SlateQ’s decomposition. Check out this notebook for more details.

RecSim附带了许多算法来训练代理。 slate_decomp_q_agent函数实现SlateQ算法。我正在使用full_slate_q_agent函数作为基准代理，这是没有SlateQ分解的DQN。请查看此笔记本以了解更多详细信息。

We see that the Click Through Rate (CTR) is 0.5 for SlateQ agent vs 0.48 for Full Slate Q-learning agent. Similar trend is seen while comparing the average rewards ( i.e. clicks) per episode metric. This shows that SlateQ agent outperforms the baseline agent for our setup.

我们看到，SlateQ代理的点阅率(CTR)是0.5，而Full Slate Q学习代理的点阅率是0.48。在比较每个情节的平均奖励(即点击次数)指标时，可以看到类似的趋势。这表明在我们的设置中，SlateQ代理的性能优于基准代理。

Note: Please use the latest version of RecSim library. I encountered this issue which has recently been fixed by the administrator. Also, you can use this requirements file to create an environment with the required dependencies.

注意：请使用最新版本的RecSim库。 我遇到了 这个 问题， 该 问题最近已由管理员修复。 另外，您可以使用 此 需求文件来创建具有所需依赖关系的环境。

外卖 (Takeaways)

SlateQ breaks down the LTV of a slate into its component-wise LTVs and maximises long term user engagement. Such a decomposition uses items as action inputs rather than slates, making it more generalisable and data-efficient. SlateQ also enables real-time recommendations by using approximation methods to solve the slate optimisation problem.

SlateQ将平板电脑的LTV分解为各个组件的LTV，并最大限度地提高了长期用户的参与度 。这样的分解使用项作为动作输入而不是选项，从而使其更具通用性和数据效率 。 SlateQ还通过使用近似方法来解决slate优化问题，从而实现了实时建议。

Besides, SlateQ can be implemented on top of existing myopic recommender system already in place. Youtube ran a live experiment comparing the performance of SlateQ vs their highly optimised recommender system in production (i.e. control). Below chart shows SlateQ consistently and significantly outperformed the existing system by improving user engagement.

此外，SlateQ可以在已经存在的现有近视推荐系统之上实现。 Youtube进行了一项现场实验，比较了SlateQ和其高度优化的推荐系统在生产中(即对照)的性能。下图显示了SlateQ通过改善用户参与度始终如一且显着优于现有系统。

下一步 (Next Steps)

Explore slate optimisation methods in more detail.
详细探讨板岩优化方法。
Explore the implementation details of DQN in SlateQ algorithm.
探索SlateQ算法中DQN的实现细节。

Code and other relevant materials can be found in this Github repo. A recording of our class presentation can be found here.

可以在此 Github存储库中找到代码和其他相关材料。我们班级演讲的录音可以在这里找到。

Special shoutout to my USF MSDS classmate Collin Prather for collaborating with me on this project. I also thank Prof. Brian Spiering for his awesome introductory course on RL at USF.

特别鸣谢我的USF MSDS同学Collin Prather与我在这个项目上进行合作。我还要感谢Brian Spiering教授在南佛罗里达大学出色的RL入门课程。

翻译自: https://medium.com/analytics-vidhya/slateq-a-scalable-algorithm-for-slate-recommendation-problems-735a1c24458c

weixin_26752765

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
SlateQ：解决板岩推荐问题的可扩展算法

I was recently introduced to the wonderful world of Reinforcement Learning (RL) and wanted to explore its applications in recommender systems. Craig Boutilier’s talk on challenges of RL in recommender...
复制链接

扫一扫