SlateQ:解决板岩推荐问题的可扩展算法

I was recently introduced to the wonderful world of Reinforcement Learning (RL) and wanted to explore its applications in recommender systems. Craig Boutilier’s talk on challenges of RL in recommender systems piqued my interest. In this article, I explore Google’s SlateQ algorithm and use it to train an agent in a simulated environment using its RecSim [Github] library.

最近,我被带到了强化学习(RL)的奇妙世界,并想探索其在推荐系统中的应用。 Craig Boutilier关于推荐系统中RL的挑战的演讲激起了我的兴趣。 在本文中,我将探索Google的SlateQ算法,并使用它的RecSim [ Github ]库在模拟环境中使用它来训练代理。

推荐者问题中的关键挑战 (Key challenges in slate recommender problems)

Slate recommender problems are very common in online services like Youtube, News, E-commerce etc, where a user chooses from a ranked list of recommended items. These recommended items are selected from an often large collection of items and ranked basis user behaviour and interests. The objective is to maximise user engagement by offering personalised content that best aligns with a user’s interests and likes.

Slate推荐者问题在YouTube,新闻,电子商务等在线服务中非常常见,用户可以从推荐项目的排名列表中进行选择。 这些推荐项目是从大量项目中选择的,并根据用户的行为和兴趣进行排名。 目的是通过提供最能满足用户兴趣和喜好的个性化内容来最大化用户参与度。

A recommender agent, in this case, would select an ordered slate of size k from a corpus of items I such that user engagement is maximised. This problem is three-pronged:

在这种情况下,推荐者代理将从项目I的语料库中选择大小为k的排序板,以使用户参与度最大化。 这个问题有三方面:

  1. For |I| = 20 and k = 10, an agent has \binom{20}{10} x 10! = 670 Billion possible slates to choose from (i.e. agent’s action space). Since this action space scales in the factorial of k, it is referred to as the problem of combinatorial action space.

    对于| I | = 20且k = 10,则代理具有\ binom {20} {10} x 10! = 6700亿可能的选择(即代理人的行动空间)。 由于此动作空间按k阶乘缩放,因此被称为组合动作空间问题。

  2. An agent that maximises user engagement in the long run ( e.g. total session time, long term value etc) is desirable. Myopic agents that optimise only for the short term can end up hurting long term engagement of a user.

    希望在长期内最大化用户参与度 (例如,总会话时间,长期价值等)的代理。 仅在短期内优化的近视药剂可能会损害用户的长期参与。

  3. We require an efficient and scalable algorithm that enables real-time recommendation for a large number of users.

    我们需要一种高效且可扩展的算法,该算法可为大量用户提供实时推荐。

Algorithms like collaborative filtering, matrix factorisation etc. are widely used in practical recommender algorithms. However, they fail to address one or more of the above challenges. SlateQ uses the power of Reinforcement Learning to overcome all these challenges.

诸如协同过滤,矩阵分解等算法在实际推荐算法中得到了广泛使用。 但是,它们无法解决上述一个或多个挑战。 SlateQ利用强化学习的力量来克服所有这些挑战。

MDP进行救援 (MDPs to the rescue)

Image for post
MDP formulation in SlateQ
SlateQ中的MDP公式

SlateQ models the slate recommendation problem as a Markov Decision Process (MDP) where:

SlateQ将板岩推荐问题建模为马尔可夫决策过程(MDP),其中:

  1. the State S represents the state of the user. The user state captures information like user’s history, demographics, context (e.g. time of day) etc.

    状态S代表用户的状态。 用户状态捕获信息,例如用户的历史记录,人口统计信息,上下文(例如一天中的时间)等。

  2. the Action space A is simply the set of all possible recommendation slates.

    动作空间A只是所有可能的建议列表的集合。

  3. Transition Probability P(s’|s, A) reflects the probability that the user transitions to state s’ when slate A is presented to the user in state s.

    转变概率P(s'| s,A)反映了当将板岩A呈现给用户s时,用户转变为状态s'的概率

  4. Rewards R(s, A) measures the expected degree of user engagement with items on the slate A.

    奖励R(s,A)衡量用户与平板A上的商品的互动程度。

简化问题的假设 (Assumptions that simplify the problem)

  1. For a given slate of items, SlateQ assumes that a user can select at max one item i at a time. This is a reasonable assumption in settings like Youtube. Note that this assumption also allows a user to get away without choosing any item.

    对于给定的项目,SlateQ假定用户一次最多可以选择一个项目i 。 在YouTube之类的设置中,这是一个合理的假设。 注意,该假设还允许用户逃脱而不选择任何物品。

  2. Returning to the slate for a second item is modelled and logged as a separate event. The user’s state is updated with each consumed item, which in turn is used to recommend a new slate to the returning user.

    返回到第二个项目的清单被建模并记录为一个单独的事件。 用户状态会随着每个消耗的物品而更新,这反过来又被用来向返回的用户推荐新的状态。
  3. State Transitions and Rewards depend only on the selected item, i.e. they are independent of other items in the slate recommended to the user. This is a reasonable assumption since non-consumed items have significantly less impact than consumed items on the user’s behaviour.

    状态转换和奖励仅取决于所选项目,即,它们独立于推荐给用户的其他项目。 这是一个合理的假设,因为未消费的物品对用户的行为的影响远小于消费的物品。
  4. The user choice model is known i.e. P(s’|s, i) is known. Conditional choice models like the multinomial logit can be used to model user choice.

    用户选择模型是已知的,即P(s'| s,i)是已知的。 诸如多项式logit之类的条件选择模型可用于对用户选择进行建模。

These assumptions enable us to break down the rewards and transition probabilities in terms of its component items.

这些假设使我们能够从构成要素的角度分解奖励和转移概率。

Image for post
#1: Single Choice Assumption
#1:单选假设
Image for post
#3: Reward/transition dependence on selection (RTDS) assumption
#3:奖励/过渡对选择的依赖(RTDS)假设

Q学习细目 (Q-learning breakdown)

The Q in SlateQ refers to Q-learning, an RL algorithm that finds the optimal slate by assigning values (called Q-values) to all state and action pairs. These Q-values represent the Long Term Value (LTV) of a user consuming an item i from slate A.

SlateQ中的Q表示Q学习,这是一种RL算法,它通过为所有状态和动作对分配值(称为Q值)来找到最佳选择。 这些Q值表示从A项消费项目i的用户的长期价值(LTV)。

SlateQ breaks down the LTV of a slate of items as the expected sum of LTVs of its component items i.e. Q_bar(s, i). The probabilities P(i|s, A) are generated using user choice models discussed earlier.

SlateQ分解一组项目的LTV,作为其组成项目的LTV的预期总和,即Q_bar(s,i)。 概率P(i | s,A)是使用前面讨论的用户选择模型生成的。

Image for post
Item-wise breakdown of Q-values for a given slate A
给定板岩A的Q值的逐项细分

Temporal Difference (TD) learning is then used to learn the item-wise LTVs Q_bar(s, i). Given a consumed item i at state s with observed reward r, the agent transitions to new state s’ and recommends new slate A’ such that user LTV is maximised. Below equation lays down the update to Q-values. Discount rate gamma and learning rate alpha regulate the importance placed on future actions.

然后使用时差(TD)学习来学习逐项LTV Q_bar(s,i)。 给定状态为s且观察到的奖励为r的消费物品i ,代理将转换为新状态s'并推荐新的状态A' ,以使用户LTV最大化。 下面的公式列出了对Q值的更新。 折现率伽玛和学习率α调节着对未来行动的重视。

Image for post
TD update for item-wise Q-values
TD更新逐项Q值

SlateQ uses a Deep Q Network (DQN) with experience replay to learn these Q-values.

SlateQ使用具有经验重播的深度Q网络(DQN)来学习这些Q值。

板岩优化 (Slate Optimisation)

After learning the item-wise LTVs, the algorithm selects a slate A of k items from corpus I such that the total expected LTV of the user is maximised. More formally, this can be written as:

在学习逐项LTV之后,该算法从语料库I中选择k个项的板岩A,以使用户的总预期LTV最大化。 更正式地说,可以写成:

Image for post
LTV Optimisation Problem
LTV优化问题

SlateQ uses mixed-integer programming methods to find the exact solution of the above optimisation problem. Exact optimisation takes polynomial time in the number of items in corpus I and can be used for offline training. While serving real-time recommendations, approximation methods that are less computationally expensive are used to find the optimal slate.

SlateQ使用混合整数编程方法来找到上述优化问题的确切解决方案。 精确优化需要花费语料库I中项数的多项式时间,并且可以用于脱机训练。 在提供实时建议的同时,使用计算量较小的近似方法来查找最佳方案。

Now that we have a good understanding of the SlateQ algorithm, let’s see how well it works by running a small experiment. RecSim library has an implementation of the algorithm along with various simulated environments to train agents and compare algorithmic performance.

现在,我们已经对SlateQ算法有了很好的了解,让我们通过运行一个小实验来看看它的工作情况。 RecSim库具有该算法的实现以及各种模拟环境,以训练代理并比较算法性能。

RecSim的兴趣演化环境 (RecSim’s Interest Evolution Environment)

Interest Evolution environment is a collection of models that simulate sequential user interaction in a slate recommender setting. It has 3 key components:

兴趣演化环境是一组模型的集合,这些模型可以模拟在推荐者设置中的顺序用户交互。 它包含3个关键组件:

  1. The document model samples items from a prior over document features, which incorporates latent features such as document quality, and observable features such as document topic.

    文档模型从优先于文档特征的项目中采样项目,该文档特征包含诸如文档质量之类的潜在特征以及诸如文档主题之类的可观察特征。

  2. The user model samples users from a prior distribution over user features. These features include latent features such as interests and behavioural features such as time budget.

    用户模型从用户功能的先前分布中抽样用户。 这些功能包括潜在功能(例如兴趣)和行为功能(例如时间预算)。

  3. For a recommended slate, the user choice model records the user response by selecting at max 1 item. The choice of the item by the user depends on observable document features (e.g., topic, perceived appeal) and all user features (e.g., interests).

    对于推荐的选项用户选择模型通过最多选择1个项目来记录用户的响应。 用户对项目的选择取决于可观察的文档特征(例如,主题,感知的吸引力)和所有用户特征(例如,兴趣)。

Each user starts with a fixed time budget. Each consumed item reduces the time budget by a fixed amount. Besides, this time budget is replenished depending upon how appealing the item is to the user. Such a configuration places importance on long term behaviour of the user and hence enables us to train non-myopic agents.

每个用户都以固定的时间预算开始。 每个消耗的物品将时间预算减少固定量。 此外,根据项目对用户的吸引力来补充该时间预算。 这种配置非常重视用户的长期行为,因此使我们能够训练非近视代理。

Image for post
Single User Control flow in RecSim
RecSim中的单用户控制流程

Check out this notebook for more details on environment and generated data.

请查看笔记本以获取有关环境和生成的数据的更多详细信息。

实验设置和结果 (Experiment Setup and Results)

In our experiment, the agent chooses 2 documents to recommend from a corpus of 10 candidate documents. The candidate documents are selected by the document model from 20 different topics. The size of the slate is intentionally kept small for faster training and evaluation.

在我们的实验中,代理从10个候选文档的语料集中选择2个文档进行推荐。 通过文档模型从20个不同主题中选择候选文档。 故意将板岩的尺寸保持较小,以便更快地进行培训和评估。

RecSim comes with a host of algorithms to train an agent. The slate_decomp_q_agent function implements the SlateQ algorithm. I am using the full_slate_q_agent function as the baseline agent, which is a DQN without SlateQ’s decomposition. Check out this notebook for more details.

RecSim附带了许多算法来训练代理。 slate_decomp_q_agent函数实现SlateQ算法。 我正在使用full_slate_q_agent函数作为基准代理,这是没有SlateQ分解的DQN。 请查看笔记本以了解更多详细信息。

Image for post
Click-Through Rates(CTR) at different time steps
不同时间步的点击率(CTR)

We see that the Click Through Rate (CTR) is 0.5 for SlateQ agent vs 0.48 for Full Slate Q-learning agent. Similar trend is seen while comparing the average rewards ( i.e. clicks) per episode metric. This shows that SlateQ agent outperforms the baseline agent for our setup.

我们看到,SlateQ代理的点阅率(CTR)是0.5,而Full Slate Q学习代理的点阅率是0.48。 在比较每个情节的平均奖励(即点击次数)指标时,可以看到类似的趋势。 这表明在我们的设置中,SlateQ代理的性能优于基准代理。

Image for post
Average Rewards per Episode at different time steps
每集不同时间步长的平均奖励

Note: Please use the latest version of RecSim library. I encountered this issue which has recently been fixed by the administrator. Also, you can use this requirements file to create an environment with the required dependencies.

注意:请使用最新版本的RecSim库。 我遇到了 这个 问题, 问题最近已由管理员修复。 另外,您可以使用 需求文件来创建具有所需依赖关系的环境。

外卖 (Takeaways)

SlateQ breaks down the LTV of a slate into its component-wise LTVs and maximises long term user engagement. Such a decomposition uses items as action inputs rather than slates, making it more generalisable and data-efficient. SlateQ also enables real-time recommendations by using approximation methods to solve the slate optimisation problem.

SlateQ将平板电脑的LTV分解为各个组件的LTV,并最大限度地提高了长期用户的参与度 。 这样的分解使用项作为动作输入而不是选项,从而使其更具通用性数据效率 。 SlateQ还通过使用近似方法来解决slate优化问题,从而实现了实时建议。

Besides, SlateQ can be implemented on top of existing myopic recommender system already in place. Youtube ran a live experiment comparing the performance of SlateQ vs their highly optimised recommender system in production (i.e. control). Below chart shows SlateQ consistently and significantly outperformed the existing system by improving user engagement.

此外,SlateQ可以在已经存在的现有近视推荐系统之上实现。 Youtube进行了一项现场实验,比较了SlateQ和其高度优化的推荐系统在生产中(即对照)的性能。 下图显示了SlateQ通过改善用户参与度始终如一且显着优于现有系统。

Image for post
Increase in user engagement for SlateQ over baseline algorithm
与基准算法相比,SlateQ的用户参与度增加

下一步 (Next Steps)

  1. Explore slate optimisation methods in more detail.

    详细探讨板岩优化方法。
  2. Explore the implementation details of DQN in SlateQ algorithm.

    探索SlateQ算法中DQN的实现细节。

Code and other relevant materials can be found in this Github repo. A recording of our class presentation can be found here.

可以在 Github存储库中找到代码和其他相关材料。 我们班级演讲的录音可以在这里找到。

Special shoutout to my USF MSDS classmate Collin Prather for collaborating with me on this project. I also thank Prof. Brian Spiering for his awesome introductory course on RL at USF.

特别鸣谢我的USF MSDS同学Collin Prather与我在这个项目上进行合作。 我还要感谢Brian Spiering教授在南佛罗里达大学出色的RL入门课程

翻译自: https://medium.com/analytics-vidhya/slateq-a-scalable-algorithm-for-slate-recommendation-problems-735a1c24458c

  • 0
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值