开发是怎么根据ui供稿_基于任务的家庭供稿排名的多任务学习和校准

最新推荐文章于 2023-01-07 14:44:00 发布

weixin_26750481

最新推荐文章于 2023-01-07 14:44:00 发布

阅读量202

点赞数

文章标签： python

原文链接：https://medium.com/pinterest-engineering/multi-task-learning-and-calibration-for-utility-based-home-feed-ranking-64087a7bcbad

版权

开发是怎么根据ui供稿

Ekrem Kocaguneli| Software Engineer, Homefeed Ranking, Dhruvil Deven Badani| Software Engineer, Homefeed Ranking, Sangmin Shin| Engineering Manager, Homefeed Ranking

Ekrem Kocaguneli | Dhruvil Deven Badani | Homefeed排名软件工程师| 软件工程师，家庭供稿排名，Sangmin Shin | 工程经理，家庭供稿排行

Home feed is one of the most important surfaces at Pinterest that drives a significant portion of engagement from the more than 400+ million people who visit each month. From a business standpoint, the home feed is also a revenue driver, where most ads are shown to Pinners. Therefore, the way we surface personalized, engaging and inspiring recommendations in home feed is critical.

家庭供餐是Pinterest上最重要的方面之一，每个月来访的400多万人中，有很大一部分参与度很高。从业务的角度来看，家庭供稿也是收入的驱动力，大多数广告都显示给Pinners。因此，在家庭供稿中显示个性化，吸引人和启发性建议的方式至关重要。

In this post we will cover how we switched from a single output node deep neural net (DNN) to a multi-task learning (MTL) based DNN. We will also cover how we calibrated each output node’s probability prediction to be combined into a utility value as well as the benefits of this new architecture.

在本文中，我们将介绍如何从单输出节点深度神经网络(DNN)切换到基于多任务学习(MTL)的DNN。我们还将介绍如何校准每个输出节点的概率预测，以将其组合成效用值以及该新体系结构的好处。

背景和动机 (Background and motivation)

We used to score each user-pin pair with a single-output from a logistic loss DNN model (Figure 1) and rank Pins using these scores. This is a common scenario for ranking models: an action like click is identified to be optimized against; and the ranker learns the click probability based on historic engagement data which is then used for ranking.

我们曾经使用逻辑损失DNN模型的单个输出对每个用户引脚对进行评分(图1)，并使用这些评分对引脚进行排名。这是对模型进行排名的常见情况：识别出针对点击等操作进行了优化；然后排名者根据历史参与度数据了解点击概率，然后将其用于排名。

However, at Pinterest, engagement is multi-objective: there are long-clicks, close-ups and repins. The single output DNN incorporated these objectives into the ranking score by augmenting the logistic loss in the cost function using action weights. y(i) ∈ {0,1} is the actual label and y(i) ∈ [0,1]is the predicted score of the i(th) instance out of m instances in equation 1. The output of such a model is not a probability, but rather a ranking score that captures which Pin will be more engaging (a combination of click, long-click, close-up and repin). We call this combined ranking score the pinnability score.

但是，在Pinterest，参与是多目标的：需要长按，特写和重复。单个输出DNN通过使用动作权重来增加成本函数中的逻辑损失，从而将这些目标纳入排名得分。 y(i)∈{0,1}是实际标签，y(i)∈[0,1]是等式1中m个实例中第i(th)个实例的预测得分。这样的模型的输出这不是一个概率，而是一个排名得分，可以捕获哪个Pin会更吸引人(单击，长按，特写和重绘的组合)。我们称此综合排名分数为可插拔性分数。

Pinnability score is effective in ranking user-pin pairs, but it has a few shortcomings: the business value of different actions are baked into the training data and the weighted loss. Hence, the pinnability score is simply a floating number used to rank, but it is not a probability and is not comparable from one model to another. This makes debugging and interpretability of the model quite challenging.

Pinnability评分可有效地对用户PIN对进行排名，但它有一些缺点：将不同操作的业务价值纳入训练数据和加权损失中。因此，可销性分数只是用于排序的浮点数，但不是概率，因此无法从一种模型与另一种模型进行比较。这使得模型的调试和可解释性极具挑战性。

So, we switched to a more flexible and interpretable ranking method based on the utility of a Pin to a Pinner. At a high level, the utility of a Pin is a combination of the probability values of different actions as in Equation 3. The actual utility function used in our ranking is more involved than Equation 3, the details of which are beyond the scope of this post.

因此，我们基于Pin到Pinner的实用性，切换到了一种更灵活，可解释的排名方法。在较高的层次上，引脚的效用是方程3中不同动作的概率值的组合。与等式3相比，我们在排名中使用的实际效用函数所涉及的更多，其详细信息不在此范围之内。发布。

To get action-specific probability values P(action), we employed MTL where each output node predicts an action-specific probability. However, once we decide on the action weights W(action), we need the probability values to be stable and accurate between different models. To enable this, we used calibration models.

为了获得特定于动作的概率值P(action) ，我们采用了MTL，其中每个输出节点都预测了特定于动作的概率。但是，一旦确定了动作权重W ( action )，我们就需要在不同模型之间保持稳定和准确的概率值。为此，我们使用了校准模型。

Utility-based ranking via MTL and calibration enabled us to do several things better:

通过MTL和校准基于实用程序的排名使我们可以做得更好：

Quickly change home feed characteristics: As in Equation 3, we weigh each action differently. At any time, we can decide to shift surface characteristics based on business needs (e.g., to promote more videos or to promote more repinnable/clickable content). These types of shifts previously required a cycle of training data augmentation, model parameter updates and multiple A/B experiments spanning multiple weeks. Now, we can simply adjust the utility weights of treatment groups and observe the effects within a few hours.
快速改变家中饲料的特性：如公式3所示，我们对每个动作的权重不同。我们可以随时决定根据业务需求来改变表面特征(例如，推广更多视频或推广更多可固定/可点击的内容)。这些类型的轮班以前需要一个训练数据扩充，模型参数更新以及跨越多个星期的多次A / B实验的循环。现在，我们只需调整治疗组的效用权重，并在几个小时内观察效果。
Ability to compare different Pin types: It used to be difficult to compare different Pin types such as organic and video. For example, we mainly look at view time on video Pins to measure engagement. The current setup enabled us to have different utility functions for different Pin types.
比较不同引脚类型的能力：过去很难比较不同引脚类型，例如有机和视频。例如，我们主要查看视频图钉上的观看时间以衡量互动度。当前的设置使我们能够针对不同的引脚类型使用不同的实用程序功能。
Improved model interpretability: Since we can monitor the calibration per each action type, we can better interpret changes, e.g., if we see a candidate generator get increased distribution, we can check its calibration and determine if the increase is justified or not.
改进的模型可解释性：由于我们可以监视每种动作类型的校准，因此我们可以更好地解释更改，例如，如果看到候选生成器的分布增加，则可以检查其校准并确定增加是否合理。

多任务学习模型 (Multi-task learning model)

MTL enables us to have multiple output nodes with representation sharing [1]. Each output node (head) optimizes for a certain action type, such as repin or click.

MTL使我们能够拥有多个具有表示共享的输出节点[1]。每个输出节点(头)针对特定的操作类型(例如图钉或单击)进行优化。

Each head predicts a binary label: the action happened or not. The predicted value from each head is designed to be a probability score. Our current MTL model is shown in Figure 2 with additional calibration models.

每个头都预示着一个二进制标签：动作是否发生。来自每个头部的预测值被设计为概率分数。图2显示了我们当前的MTL模型以及其他校准模型。

We also need to redefine the cost function. Unlike the previous pinnability model where we have a single binary label, in this model the label is a vector of n-actions (n=4 in the first version). Hence, the loss with each round of predictions is summed over all 4 actions as in Equation 4, where yij ∈ ℝ4and ℒ is the logistic loss as given in Equation 1. Thanks to parameter sharing that MTL provides among different objectives, switching to MTL alone, without utility-based ranking, provided engagement metric improvements.

我们还需要重新定义成本函数。与以前的可插性模型中只有一个二进制标签不同，在此模型中，标签是n个动作的向量(第一个版本中的n = 4)。因此，如公式4所示，将对每轮预测进行的所有4个动作的求和相加，其中yij∈ℝ4，而ℒ是如公式1中给出的逻辑损失。，但没有基于实用程序的排名，从而改善了参与度指标。

The DNN in Figure 2 is only the last part (called fully-connected layer) of a larger AutoML model we use for ranking. The larger model is composed of 4 components, where we learn feature representations from raw features. While we will not go into the first 3 components in this blog post, it suffices to say that they are responsible for learning representations and crosses among features without requiring engineers to worry about feature engineering.

图2中的DNN只是我们用于排名的较大AutoML模型的最后一部分(称为完全连接层)。较大的模型由4个组成部分组成，我们从原始特征中学习特征表示。尽管我们不会在本博文中介绍前三个组件，但可以说它们负责学习表示形式和特征之间的交叉，而无需工程师担心特征工程。

通过校准模型校准输出节点预测 (Calibration of Output Node Predictions via Calibration Model)

The relationship of whether we are over/under-predicting is given through calibration, a post-processing technique used to improve probability estimation of a learner. There are a number of techniques that can be used like Platt Scaling, isotonic regression [6] or downsampling correction [3]. For binary classification, calibration measures the relationship between the observed behavior (e.g., empirical click-through rate) and the predicted behavior (e.g., predicted CTR) as given in Equation 5. We need a calibration model on each DNN output node to make sure that the predicted probability aligns well with empirical rates.

我们是否过度/预测不足的关系通过校准给出，校准是一种后处理技术，用于提高学习者的概率估计。可以使用多种技术，例如Platt Scaling，等渗回归[6]或下采样校正[3]。对于二进制分类，校准将测量等式5中给出的观察到的行为(例如，经验点击率)与预测行为(例如，预测CTR)之间的关系。我们需要在每个DNN输出节点上使用一个校准模型来确保预测的概率与经验率非常吻合。

Initially, we tried to simply incorporate the positive downsampling rate 𝛼 and the negative downsampling rate 𝛽 as in Equation 6, where p is the probability estimate of DNN and q is the adjusted probability.

最初，我们尝试简单地将正下采样率𝛼和负下采样率incorporate合并为公式6，其中p是DNN的概率估计， q是调整后的概率。

This method did not work well because our training data generation pipeline not only downsamples positives and negatives, but also enforces stratified sampling around geography, user-state and positive/negative distribution (which helps us rank better, but makes calibration harder).

这种方法效果不佳，因为我们的训练数据生成管道不仅对正负进行了下采样，而且还围绕地理位置，用户状态和正/负分布实施了分层采样(这有助于我们更好地排名，但会使校准难度更大)。

We realized that the calibration had to act as a transfer learning layer that maps ranking optimized probabilities to empirical rates. For this work, we opted for a logistic regression (LR) model which can be viewed as a highly altered Platt scaling technique. Instead of just a probability score (p) weight and a bias term (b) to learn calibrated probability (p*) as in Equations 7 and 8, we used an LR model with 80+ features.

我们意识到，校准必须充当传递学习层，该层将排名优化的概率映射到经验率。对于这项工作，我们选择了逻辑回归(LR)模型，该模型可以看作是高度变化的Platt缩放技术。与方程式7和8一样，我们不仅使用概率得分( p )权重和偏差项( b )来学习校准概率( p * )，还使用了具有80多个特征的LR模型。

训练数据生成和特征化 (Training data generation and featurization)

In order to train calibration models that can learn empirical rates, we created a new training data generation pipeline without any stratified sampling.

为了训练可以学习经验率的校准模型，我们创建了一个没有分层抽样的新训练数据生成管道。

The pipeline comprised two parts:

管道包括两部分：

Raw data logs coming from application servers, where we have the Pin IDs, context information as well as raw logs required for featurization (shown in purple in Figure 3)
来自应用程序服务器的原始数据日志，其中包含Pin ID，上下文信息以及实现功能所需的原始日志(在图3中以紫色显示)
Label information that we attain after Pin views and take action on the impressed Pins, which is stored in feedview logs (shown in orange in Figure 3).
标记我们在引脚视图后获得的信息，并对被压下的引脚采取措施，这些信息存储在feedview日志中(图3中以橙色显示)。

Combining 1) and 2) provided us with label and raw feature information to create training data for each calibration model.

组合1)和2)为我们提供了标签和原始特征信息，以为每个校准模型创建训练数据。

The data for each calibration model is the same except for the labels. The repin calibration data marks only repins as positive labels, click data marks only clicks as positive, and so on.

除标签外，每个校准模型的数据均相同。重新固定校准数据标记仅重新固定为正标签，单击数据标记仅单击为正，依此类推。

We uniformly sampled 10% of the user logs to reduce training data size. We collected 7 days of logs as training data to get rid of any day-of-week effects and used the following day as test data. The performance on test data matched the model’s online calibration performance.

我们统一采样了10％的用户日志，以减少培训数据的大小。我们收集了7天的日志作为训练数据，以消除任何星期几的影响，并以第二天作为测试数据。测试数据的性能与模型的在线校准性能相匹配。

While we mainly relied on total calibration error (Equation 5) and reliability diagrams, we also use the following performance measures:

虽然我们主要依靠总校准误差(公式5)和可靠性图，但我们还使用以下性能指标：

Log loss
日志丢失
Expected calibration error [4,5]
预期的校准误差[4,5]

功能和模型训练 (Features and Model Training)

To learn a good mapping of probabilities coming from a DNN trained on stratified-sampled data, we needed to provide the model a number of features capturing different aspects:

要学习对DNN进行分层采样数据训练后得出的概率的良好映射，我们需要为模型提供捕获不同方面的许多特征：

Bias and position features: Binary and categorical features such as app type, device type, gender, bucketized positions and cluster IDs. The cluster IDs are generated by an internal algorithm mapping Pins to a pool of clusters.
偏差和排名功能：二进制和分类功能，例如应用程序类型，设备类型，性别，存储区位和群集ID。群集ID由内部算法生成，将引脚映射到群集池。
User and Pin performance features: The performance (repin rate, click rate, close-up rate) of Pins and Pinners at different time granularities such as the past 3 hours, 1 day, 3 days, 30 days and 90 days.
用户和个人识别码性能功能：在不同的时间粒度(例如过去3小时，1天，3天，30天和90天)，个人识别码和固定器的性能(重复率，点击率，关闭率)。
Feedback-loop features: The empirical action rates on the platform in the last 30 minutes aggregated by overall, country, gender-cross-country. These features were included to capture fluctuations during the day.
反馈环功能：最近30分钟在平台上的经验性行动率，按总体，国家/地区，性别跨性别汇总。包括这些功能以捕获白天的波动。

As for the model, we chose logistic regression (LR) trained with cross-entropy (CE) loss. The fact that CE-loss incurs smaller or larger loss values based on the predicted probability helps LR achieve good calibration. Based on the class and the predicted probability, the CE loss will be smaller or larger, where it rewards correct classification and highly penalizes wrong classifications with high certainty (Figure 4).

对于模型，我们选择经过交叉熵(CE)损失训练的逻辑回归(LR)。根据预测的概率，CE损耗会产生较小或较大的损耗值这一事实有助于LR实现良好的校准。根据类别和预测的概率，CE损失将变得更大或更小，从而奖励正确的分类并高度肯定地对错误的分类进行严厉的惩罚(图4)。

The final model is akin to Platt scaling with a large number of features:

最终模型类似于具有大量功能的Platt缩放：

在校准训练数据上重放新的DNN (Replaying a new DNN on calibration training data)

One problem we haven’t addressed so far is training a calibration model for a newly trained DNNnew. The fact that it is not serving means that all the calibration training data is served by the production DNNprod. This means different calibration features, since DNNnew and DNNprod may have different predicted scores.

到目前为止，我们尚未解决的一个问题是为新近训练的DNNnew训练校准模型。它不提供服务的事实意味着所有校准培训数据都由生产DNNprod提供。这意味着不同的校准功能，因为DNNnew和DNNprod可能具有不同的预测分数。

We solved this by a three-step simulation process (Figure 5):

我们通过三步模拟过程解决了这个问题(图5)：

We make sure to keep DNN logs and calibration logs for the same 10% users. Let’s call these “common logs” (although the calibration and DNN logs used for training are different, they are created for the same user-pin pairs).
我们确保为相同的10％用户保留DNN日志和校准日志。我们将这些称为“通用日志” (尽管用于训练的校准日志和DNN日志不同，但它们是为相同的用户引脚对创建的)。
Get predictions from DNNnew against DNN evaluation data generated from the common logs .
根据常见日志生成的DNN评估数据，从DNNnew获取预测。
Replace the uncalibrated probability feature values in calibration training logs with the predictions from step 2.
将校准训练日志中的未校准概率特征值替换为来自步骤2的预测。

监控和警报 (Monitoring and alerting)

We monitor and alert in realtime on the calibrated probabilities of the production model.

我们实时监控生产模型的校准概率并发出警报。

We also have a daily report to monitor the production and experimental models’ calibration errors. If they are over or under-calibrated by a certain threshold across any of the monitored actions, we alert the oncall engineer.

我们还提供每日报告，以监控生产和实验模型的校准错误。如果在任何受监视的操作中它们均超出或未达到某个阈值的校准值，我们将提醒值班工程师。

The calibration error and the calibrated probabilities are highly sensitive to changes in features. The changes can be either in DNN features (which affects the uncalibrated probability values, which in turn will affect the calibrated probabilities) or in calibration features. We were able to capture incidents in our system via calibration monitoring and alerting before they significantly affected topline metrics.

校准误差和校准概率对特征变化高度敏感。更改既可以在DNN特征中(这会影响未校准的概率值，这又会影响校准的概率)或在校准特征中。我们能够通过校准监视和警报捕获系统中的事件，然后再严重影响收入指标。

其他用例 (Additional use cases)

In addition to utility-based ranking, the MTL / calibration framework unlocked multiple use cases.

除了基于实用程序的排名之外，MTL /校准框架还解锁了多个用例。

影片发布 (Video distribution)

Video distribution was one of our main objectives in 2019 and achieving it via the old framework was difficult. In the MTL framework, we first defined a positive label for videos: was the video viewed for more than 10 seconds? We then added a new output node to MTL to predict this label. We then calibrated this node and added it into utility, including only video-specific actions: repins, close-ups and 10 second views. This increased our video distribution by 40% with increased engagement rates.

视频分发是我们2019年的主要目标之一，通过旧框架实现视频分发非常困难。在MTL框架中，我们首先为视频定义了正面标签：视频观看时间是否超过10秒钟？然后，我们向MTL添加了一个新的输出节点以预测该标签。然后，我们校准了该节点并将其添加到实用程序中，仅包括特定于视频的操作：重按，特写和10秒查看。随着参与率的提高，我们的视频发行量增加了40％。

隐藏造型 (Hide modeling)

We were also able to model negative engagement using the MTL framework, with a small twist in the utility function.

我们还能够使用MTL框架对负面参与进行建模，而实用程序功能却稍有改变。

Similar to videos, we first defined a label for negative engagement: Pin hides. Then we added an MTL node and calibration model for hides. Lastly, we added a high negative weight on hide probabilities (Equation 11).

与视频类似，我们首先定义了一个负面参与标签：图钉隐藏。然后，我们为皮革添加了MTL节点和校准模型。最后，我们增加了皮革概率的负高权重(公式11)。

学习和陷阱 (Learnings and pitfalls)

Although MTL is a powerful tool to handle multiple objectives, it has its pitfalls. For example, in the hard parameter sharing approach [1] we used, the hidden layers are shared by multiple objectives. Hence, each newly added objective affects other objectives. Therefore, it is important that tasks are complementary: just adding the hide head increased hides. We were able to reverse this in the utility function by effectively pushing pins with high hide predictions to the end of the list.

尽管MTL是处理多个目标的强大工具，但它也有其陷阱。例如，在我们使用的硬参数共享方法[1]中，隐藏层由多个目标共享。因此，每个新添加的目标都会影响其他目标。因此，重要的是任务是互补的：仅添加生皮头就可以增加生皮。通过有效地将具有较高生皮预测的图钉推到列表的末尾，我们能够在实用程序功能中将其反转。

MTL is also not a silver bullet. The task to model still needs a decent amount of training data to affect the shared hidden layers. For example, besides hides, we also tried to model Pin reports, However, this reporting number can be small. Adding an output node for reports had no effect and we were unable to calibrate the action.

MTL也不是灵丹妙药。建模任务仍然需要大量的训练数据才能影响共享的隐藏层。例如，除了生皮之外，我们还尝试为Pin报告建模，但是，此报告数量可能很小。为报告添加输出节点无效，我们无法校准该操作。

结论 (Conclusion)

This work lead us to several wins on Pinner engagement, business goals and developer velocity:

这项工作使我们在Pinner的参与度，业务目标和开发人员速度方面赢得了多个胜利：

We were able to show more relevant pins to users by improving the accuracy of our predictions.
通过提高预测的准确性，我们能够向用户显示更多相关的图钉。
We improved engineering velocity by separating the model predictions from the ranking layer. We now can iterate on ranking functions by modifying utility terms and in parallel do model iterations.
我们通过将模型预测与排名层分开来提高了工程速度。现在，我们可以通过修改实用程序术语并并行执行模型迭代来对函数排序。
We helped the business by enabling stakeholders to quickly adjust ranking based on business needs.
我们使利益相关者能够根据业务需求快速调整排名，从而为业务提供了帮助。

致谢 (Acknowledgments)

This was a large project involving contributions from a number of engineers and managers. Specifically we would like to thank Utku Irmak, Crystal Lee, Xin Liu, Chenjin Liang, Cosmin Negruseri, Yaron Greif, Derek Zhiyuan Cheng, Tao Cheng, Randall Keller, Mukund Narasimhan and Vijay Narayanan.

这是一个大型项目，涉及许多工程师和经理的贡献。 具体而言，我们要感谢Utku Irmak，Crystal Lee，Lin Xin，Chenjin Liang，Cosmin Negruseri，Yaron Greif，Derek Zhiyuan Cheng，Tao Cheng，Randall Keller，Mukund Narasimhan和Vijay Narayanan。

翻译自: https://medium.com/pinterest-engineering/multi-task-learning-and-calibration-for-utility-based-home-feed-ranking-64087a7bcbad

开发是怎么根据ui供稿

weixin_26750481

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
开发是怎么根据ui供稿_基于任务的家庭供稿排名的多任务学习和校准

开发是怎么根据ui供稿Ekrem Kocaguneli| Software Engineer, Homefeed Ranking, Dhruvil Deven Badani| Software Engineer, Homefeed Ranking, Sangmin Shin| Engineering Manager, Homefeed Ranking Ekrem Kocaguneli | Dhru...
复制链接

扫一扫