YouTube 深度神经网络在推荐系统上的应用

最新推荐文章于 2025-02-26 14:39:00 发布

ljtyxl

最新推荐文章于 2025-02-26 14:39:00 发布

阅读量5k

点赞数 2

分类专栏： bigdata 推荐

bigdata 同时被 2 个专栏收录

102 篇文章

订阅专栏

2. SYSTEM OVERVIEW （系统综述）

Figure 2: Recommendation system architecture demonstrating the “funnel” where candidate videos are retrieved and ranked before presenting only a few to the user.

The overall structure of our recommendation system is illustrated in Figure 2. The system is comprised of two neural networks: one for candidate generation and one for ranking.

我们推荐系统的总体结构如图2所示。该系统由两个神经网络组成：一个用于候选生成，另一个用于排序。

The candidate generation network takes events from the user’s YouTube activity history as input and retrieves a small subset (hundreds) of videos from a large corpus. These candidates are intended to be generally relevant to the user with high precision. The candidate generation network only provides broad personalization via collaborative filtering. The similarity between users is expressed in terms of coarse features such as IDs of video watches, search query tokens and demographics.

候选生成网络将用户YouTube活动历史中的事件作为输入，并从大型语料库中检索一小部分(数百)视频。这些候选人的目的是以高精度一般地与用户相关。候选生成网络仅通过协同过滤提供广泛的个性化服务。用户之间的相似性用粗略的特征来表示，例如视频手表的ID、搜索查询标记和人口统计学。

Presenting a few “best” recommendations in a list requires a fine-level representation to distinguish relative importance among candidates with high recall. The ranking network accomplishes this task by assigning a score to each video according to a desired objective function using a rich set of features describing the video and user. The highest scoring videos are presented to the user, ranked by their score. The two-stage approach to recommendation allows us to make recommendations from a very large corpus (millions) of videos while still being certain that the small number of videos appearing on the device are personalized and engaging for the user. Furthermore, this design enables blending candidates generated by other sources, such as those described in an earlier work [3]. During development, we make extensive use of offline metrics (precision, recall, ranking loss, etc.) to guide iterative improvements to our system. However for the final determination of the effectiveness of an algorithm or model, we rely on A/B testing via live experiments. In a live experiment, we can measure subtle changes in click-through rate, watch time, and many other metrics that measure user engagement. This is important because live A/B results are not always correlated with offline experiments.

在一份名单中提出一些“最好”的建议，需要一个很好的代表，以区分高召回率的候选人的相对重要性。排名网络通过使用描述视频和用户的丰富特征集根据期望的目标函数为每个视频分配分数来完成这一任务。得分最高的视频将呈现给用户，并按其得分进行排名。两阶段推荐方法允许我们从一个非常大的视频语料库(数百万)中提出建议，同时仍然可以确定设备上出现的少量视频是个性化的，并且对用户很有吸引力。此外，这种设计还可以混合其他来源产生的候选资源，例如以前的工作[3]中描述的那些来源。在开发过程中，我们广泛使用离线度量(精度、召回、排名损失等)。指导我们的系统的迭代改进。然而，为了最终确定算法或模型的有效性，我们依赖于通过现场实验进行的A/B测试。在现场实验中，我们可以测量点击率、观看时间和其他许多衡量用户参与度的指标的细微变化。这一点很重要，因为实时A/B结果并不总是与离线实验相关。

3. CANDIDATE GENERATION（候选集）

During candidate generation, the enormous YouTube corpus is winnowed down to hundreds of videos that may be relevant to the user. The predecessor to the recommender described here was a matrix factorization approach trained under rank loss [23]. Early iterations of our neural network model mimicked this factorization behavior with shallow networks that only embedded the user’s previous watches. From this perspective, our approach can be viewed as a nonlinear generalization of factorization techniques.

在候选人生成过程中，巨大的YouTube语料库被简化为数百个可能与用户相关的视频。这里描述的推荐人的前身是在秩损失[23]下训练的矩阵因式分解方法。我们的神经网络模型的早期迭代模仿了这种因式分解行为，而浅层网络只嵌入了用户以前的观看记录。从这个角度来看，我们的方法可以看作是因子分解技术的一个非线性推广。

3.1 Recommendation as Classification

We pose recommendation as extreme multiclass classification where the prediction problem becomes accurately classifying a specific video watch wt at time t among millions of videos i (classes) from a corpus V based on a user U and context C,

我们提出建议从一个基于用户U和上下文C的V语料库作为多标签分类，其中预测问题变得准确地分类特定的视频手表在时间t中的数百万视频I(类)，

where u ∈ R N represents a high-dimensional “embedding” of the user, context pair and the vj ∈ R N represent embeddings of each candidate video. In this setting, an embedding is simply a mapping of sparse entities (individual videos, users etc.) into a dense vector in R N . The task of the deep neural network is to learn user embeddings u as a function of the user’s history and context that are useful for discriminating among videos with a softmax classifier.

其中u∈RN表示用户的高维“嵌入”，上下文对和vj∈RN表示每个候选视频的嵌入。在这种情况下，嵌入只是稀疏实体(单个视频、用户等)的映射。RN中的稠密向量。深层神经网络的任务是学习用户嵌入u作为用户历史和上下文的函数，这对于使用Softmax分类器区分视频是非常有用的。

Although explicit feedback mechanisms exist on YouTube (thumbs up/down, in-product surveys, etc.) we use the implicit feedback [16] of watches to train the model, where a user completing a video is a positive example. This choice is based on the orders of magnitude more implicit user history available, allowing us to produce recommendations deep in the tail where explicit feedback is extremely sparse.

尽管YouTube上有明确的反馈机制(大拇指向上/向下，产品内调查等等)我们使用手表的内隐反馈[16]来训练模型，其中完成视频的用户是一个积极的例子。这一选择基于更多隐含用户历史的数量级，允许我们在显式反馈极其稀疏的尾部产生建议。

Efficient Extreme Multiclass（有效的多标签分类）

To efficiently train such a model with millions of classes, we rely on a technique to sample negative classes from the background distribution (“candidate sampling”) and then correct for this sampling via importance weighting [10]. For each example the cross-entropy loss is minimized for the true label and the sampled negative classes. In practice several thousand negatives are sampled, corresponding to more than 100 times speedup over traditional softmax. A popular alternative approach is hierarchical softmax [15], but we weren’t able to achieve comparable accuracy. In hierarchical softmax, traversing each node in the tree involves discriminating between sets of classes that are often unrelated, making the classification problem much more difficult and degrading performance.

为了有效地训练这样一个具有数百万个类的模型，我们依靠一种从背景分布(“候选抽样”)中抽取负类的技术，然后通过重要性加权[10]对这种抽样进行修正。对于每一个例子，交叉熵损失是最小的真正的标签和采样的负类。在实际中，对几千个负片进行了采样，比传统的Softmax算法快100倍以上。一种流行的替代方法是层次化的Softmax[15]，但我们无法达到可比较的精度。在层次化的Softmax中，遍历树中的每个节点涉及到对经常不相关的类集的区分，从而使分类问题变得更加困难和降低性能。

At serving time we need to compute the most likely N classes (videos) in order to choose the top N to present to the user. Scoring millions of items under a strict serving latency of tens of milliseconds requires an approximate scoring scheme sublinear in the number of classes. Previous systems at YouTube relied on hashing [24] and the classifier described here uses a similar approach. Since calibrated likelihoods from the softmax output layer are not needed at serving time, the scoring problem reduces to a nearest neighbor search in the dot product space for which general purpose libraries can be used [12]. We found that A/B results were not particularly sensitive to the choice of nearest neighbor search algorithm.

在服务时间，我们需要计算最有可能的N类(视频)，以便选择要呈现给用户的最上面的N。在数十毫秒的严格服务延迟范围内对数百万项进行评分需要一个近似的评分方案，在类的数量中是次线性的。YouTube以前的系统依赖于散列[24]，这里描述的分类器使用了类似的方法。由于在服务时间不需要从Softmax输出层校准的似然值，因此评分问题简化为可以使用通用库的点积空间中的最近邻搜索[12]。我们发现A/B结果对最近邻搜索算法的选择并不特别敏感。

3.2 Model Architecture

Figure 3: Deep candidate generation model architecture showing embedded sparse features concatenated with dense features. Embeddings are averaged before concatenation to transform variable sized bags of sparse IDs into fixed-width vectors suitable for input to the hidden layers. All hidden layers are fully connected. In training, a cross-entropy loss is minimized with gradient descent on the output of the sampled softmax. At serving, an approximate nearest neighbor lookup is performed to generate hundreds of candidate video recommendations.

图3：深层候选生成模型体系结构，表现为嵌入稀疏特征与密集特征相连接。嵌入在级联之前进行平均，以将大小可变的稀疏ID包转换为适合输入到隐藏层的固定宽度向量。所有隐藏层都完全连接。在训练中，在采样的Softmax输出上，通过梯度下降将交叉熵损失降到最小。在服务时，执行近似最近邻查找以生成数百个候选视频建议。

Inspired by continuous bag of words language models [14], we learn high dimensional embeddings for each video in a fixed vocabulary and feed these embeddings into a feedforward neural network. A user’s watch history is represented by a variable-length sequence of sparse video IDs which is mapped to a dense vector representation via the embeddings. The network requires fixed-sized dense inputs and simply averaging the embeddings performed best among several strategies (sum, component-wise max, etc.). Importantly, the embeddings are learned jointly with all other model parameters through normal gradient descent backpropagation updates. Features are concatenated into a wide first layer, followed by several layers of fully connected Rectified Linear Units (ReLU) [6]. Figure 3 shows the general network architecture with additional non-video watch features described below.

在连续的词汇语言模型[14]的启发下，我们学习了固定词汇表中每个视频的高维嵌入，并将这些嵌入输入到前馈神经网络中。用户的监视历史由稀疏视频ID的可变长度序列表示，该序列通过嵌入映射到密集向量表示。网络需要固定大小的密集输入，并且简单地在几种策略(和、组件级最大值等)中平均执行最好的嵌入。重要的是，嵌入是通过正常梯度下降反向传播更新与所有其他模型参数一起学习的。特征被串联成一个宽的第一层，然后是几层完全连接的修正线性单元(Relu)[6]。图3显示了具有以下描述的附加非视频监视功能的一般网络体系结构。

3.3 Heterogeneous Signals（异构信息）

A key advantage of using deep neural networks as a generalization of matrix factorization is that arbitrary continuous and categorical features can be easily added to the model. Search history is treated similarly to watch history - each query is tokenized into unigrams and bigrams and each token is embedded. Once averaged, the user’s tokenized, embedded queries represent a summarized dense search history. Demographic features are important for providing priors so that the recommendations behave reasonably for new users. The user’s geographic region and device are embedded and concatenated. Simple binary and continuous features such as the user’s gender, logged-in state and age are input directly into the network as real values normalized to [0, 1].

利用深度神经网络作为矩阵因式分解的推广方法的一个关键优点是可以很容易地将任意、连续和分类的特征加入到模型中。对搜索历史的处理与查看历史相似-每个查询都被标记为unigram和bigram，每个标记都被嵌入。一旦平均，用户的令牌化，嵌入式查询表示一个汇总的密集搜索历史。人口学特征对于提供先验信息非常重要，这样建议才能为新用户合理地运行。用户的地理区域和设备被嵌入和连接。用户的性别、登录状态和年龄等简单的二进制和连续特性被直接输入到网络中，实值标准化为[0，1]。

“Example Age” Feature Many hours worth of videos are uploaded each second to YouTube. Recommending this recently uploaded (“fresh”) content is extremely important for YouTube as a product. We consistently observe that users prefer fresh content, though not at the expense of relevance. In addition to the first-order effect of simply recommending new videos that users want to watch, there is a critical secondary phenomenon of bootstrapping and propagating viral content [11].

“例子年龄”功能的许多小时的视频被上传到YouTube每秒。推荐这个最近上传的(“新鲜”)内容对于YouTube作为一种产品是极其重要的。我们不断观察到，用户更喜欢新鲜的内容，尽管这并不是以牺牲相关性为代价的。除了简单推荐用户想看的新视频的一阶效应外，还有一个重要的次要现象，即引导和传播病毒内容[11]。

Machine learning systems often exhibit an implicit bias towards the past because they are trained to predict future behavior from historical examples. The distribution of video popularity is highly non-stationary but the multinomial distribution over the corpus produced by our recommender will reflect the average watch likelihood in the training window of several weeks. To correct for this, we feed the age of the training example as a feature during training. At serving time, this feature is set to zero (or slightly negative) to reflect that the model is making predictions at the very end of the training window. Figure 4 demonstrates the efficacy of this approach on an arbitrarily chosen video [26].

机器学习系统通常表现出对过去的隐性偏见，因为它们是通过历史例子来预测未来行为的。视频流行度的分布是高度非平稳的，但推荐者在语料库上产生的多项式分布将反映几周训练窗口的平均观看概率。为了纠正这一点，我们在培训期间将训练示例的年龄作为功能提供。在服务时，此特性被设置为零(或略有负值)，以反映模型正在训练窗口的最末端进行预测。图4展示了这种方法在任意选择的视频上的有效性[26]

Figure 4: For a given video [26], the model trained with example age as a feature is able to accurately represent the upload time and time-dependant popularity observed in the data. Without the feature, the model would predict approximately the average likelihood over the training window.

图4：对于给定的视频[26]，以年龄为特征的示例模型能够准确地表示数据中观察到的上传时间和时间相关性的流行程度。如果没有该特征，该模型将预测训练窗口的平均可能性。

3.4 Label and Context Selection

It is important to emphasize that recommendation often involves solving a surrogate problem and transferring the result to a particular context. A classic example is the assumption that accurately predicting ratings leads to effective movie recommendations [2]. We have found that the choice of this surrogate learning problem has an outsized importance on performance in A/B testing but is very difficult to measure with offline experiments.

必须强调的是，建议往往涉及解决代孕问题，并将结果转移到特定的上下文中。一个典型的例子是这样一个假设：准确地预测收视率会导致有效的电影推荐[2]。我们发现，在A/B测试中，这个替代学习问题的选择对性能有着巨大的影响，但很难用离线实验来衡量。

Training examples are generated from all YouTube watches (even those embedded on other sites) rather than just watches on the recommendations we produce. Otherwise, it would be very difficult for new content to surface and the recommender would be overly biased towards exploitation. If users are discovering videos through means other than our recommendations, we want to be able to quickly propagate this discovery to others via collaborative filtering. Another key insight that improved live metrics was to generate a fixed number of training examples per user, effectively weighting our users equally in the loss function. This prevented a small cohort of highly active users from dominating the loss.

培训的例子是从所有YouTube手表(甚至那些嵌入到其他网站)，而不仅仅是观看我们提出的建议。否则，新内容很难浮出水面，推荐人将过分倾向于开发。如果用户通过我们推荐之外的方式发现视频，我们希望能够通过协作过滤迅速地将这一发现传播给其他人。另一个关键的洞见是，改进的实时度量是为每个用户生成固定数量的培训示例，在损失函数中有效地对我们的用户进行加权。这阻止了一小群高度活跃的用户控制损失。

Somewhat counter-intuitively, great care must be taken to withhold information from the classifier in order to prevent the model from exploiting the structure of the site and overfitting the surrogate problem. Consider as an example a case in which the user has just issued a search query for “taylor swift”. Since our problem is posed as predicting the next watched video, a classifier given this information will predict that the most likely videos to be watched are those which appear on the corresponding search results page for “taylor swift”. Unsurpisingly, reproducing the user’s last search page as homepage recommendations performs very poorly. By discarding sequence information and representing search queries with an unordered bag of tokens, the classifier is no longer directly aware of the origin of the label.

在某种程度上与直觉相反，必须非常小心地从分类器中保留信息，以防止模型利用站点的结构和过度拟合代理问题。举个例子，用户刚刚发布了一个搜索“泰勒·斯威夫特”的查询。由于我们的问题是预测下一个观看的视频，给出这个信息的分类器将预测最有可能观看的视频是那些出现在相应的搜索结果页面上的“泰勒·斯威夫特”。不出所料的是，将用户的最后一个搜索页面复制为主页推荐效果非常差。通过丢弃序列信息和用无序的令牌包表示搜索查询，分类器不再直接知道标签的来源。

Natural consumption patterns of videos typically lead to very asymmetric co-watch probabilities. Episodic series are usually watched sequentially and users often discover artists in a genre beginning with the most broadly popular before focusing on smaller niches. We therefore found much better performance predicting the user’s next watch, rather than predicting a randomly held-out watch (Figure 5). Many collaborative filtering systems implicitly choose the labels and context by holding out a random item and predicting it from other items in the user’s history (5a). This leaks future information and ignores any asymmetric consumption patterns. In contrast, we “rollback” a user’s history by choosing a random watch and only input actions the user took before the held-out label watch (5b).

视频的自然消费模式通常会导致非常不对称的共看概率。情节系列通常是按顺序观看的，用户通常会在一种类型中发现艺术家，首先是最受欢迎的类型，然后才会关注较小的细分市场。因此，我们发现预测用户下一个手表的性能要好得多，而不是预测一个随机持有的手表(图5)。许多协作过滤系统通过保留一个随机项并从用户历史(5a)中的其他项预测它，从而隐式地选择标签和上下文。这泄露了未来的信息，忽略了任何不对称的消费模式。相反，我们“回滚”一个用户的历史，选择一个随机的手表，并且只输入用户在显示标签表(5b)之前采取的操作。

3.5 Experiments with Features and Depth

Adding features and depth significantly improves precision on holdout data as shown in Figure 6. In these experiments, a vocabulary of 1M videos and 1M search tokens were embedded with 256 floats each in a maximum bag size of 50 recent watches and 50 recent searches. The softmax layer outputs a multinomial distribution over the same 1M video classes with a dimension of 256 (which can be thought of as a separate output video embedding). These models were trained until convergence over all YouTube users, corresponding to several epochs over the data. Network structure followed a common “tower” pattern in which the bottom of the network is widest and each successive hidden layer halves the number of units (similar to Figure 3). The depth zero network is effectively a linear factorization scheme which performed very similarly to the predecessor system. Width and depth were added until the incremental benefit diminished and convergence became difficult:

添加特性和深度显着地提高了保留数据的精度，如图6所示。在这些实验中，一个包含1M视频和1M搜索令牌的词汇表被嵌入了256个浮标，每个浮标的最大包大小分别为50只最近的手表和50只最近的搜索。Softmax层在相同的1M视频类上输出一个多项式分布，维数为256(可以认为是独立的输出视频嵌入)。对这些模型进行了培训，直到所有YouTube用户汇聚起来，这与数据上的几个时代相对应。网络结构遵循一种常见的“塔”模式，在这种模式中，网络的底部是最宽的，每一个连续的隐藏层都将单元的数量减半(类似于图3)。深度零网络实际上是一种线性因式分解方案，其性能与前一系统非常相似。增加了宽度和深度，直到增量效益减少，收敛变得困难：

• Depth 0: A linear layer simply transforms the concatenation layer to match the softmax dimension of 256

• Depth 1: 256 ReLU

• Depth 2: 512 ReLU → 256 ReLU

• Depth 3: 1024 ReLU → 512 ReLU → 256 ReLU

• Depth 4: 2048 ReLU → 1024 ReLU → 512 ReLU → 256 ReLU

Figure 5: Choosing labels and input context to the model is challenging to evaluate offline but has a large impact on live performance. Here, solid events • are input features to the network while hollow events ◦ are excluded. We found predicting a future watch (5b) performed better in A/B testing. In (5b), the example age is expressed as tmax − tN where tmax is the maximum observed time in the training data.

图5：选择模型的标签和输入上下文对离线评估具有挑战性，但对实时性能有很大影响。在这里，实体事件·是网络的输入特性，而空事件◦除外。我们发现预测未来的手表(5b)在A/B测试中表现更好。在(5b)中，示例年龄表示为Tmax−TN，其中Tmax是训练数据中的最大观测时间。

Figure 6: Features beyond video embeddings improve holdout Mean Average Precision (MAP) and layers of depth add expressiveness so that the model can effectively use these additional features by modeling their interaction.

图6：视频嵌入之外的特性提高了平均精度(MAP)和深度添加表现力，这样模型就可以通过建模它们的交互来有效地使用这些附加特性。

4. RANKING 等级排序

The primary role of ranking is to use impression data to specialize and calibrate candidate predictions for the particular user interface. For example, a user may watch a given video with high probability generally but is unlikely to click on the specific homepage impression due to the choice of thumbnail image. During ranking, we have access to many more features describing the video and the user’s relationship to the video because only a few hundred videos are being scored rather than the millions scored in candidate generation. Ranking is also crucial for ensembling different candidate sources whose scores are not directly comparable. We use a deep neural network with similar architecture as candidate generation to assign an independent score to each video impression using logistic regression (Figure 7). The list of videos is then sorted by this score and returned to the user. Our final ranking objective is constantly being tuned based on live A/B testing results but is generally a simple function of expected watch time per impression. Ranking by click-through rate often promotes deceptive videos that the user does not complete (“clickbait”) whereas watch time better captures engagement [13, 25].

名的主要作用是使用印象数据专门化和校准特定用户界面的候选预测。例如，用户通常会以很高的概率观看给定的视频，但是由于选择缩略图，不太可能点击特定的主页印象。在排名期间，我们可以获得更多描述视频和用户与视频的关系的功能，因为只有几百个视频被评分，而不是候选人一代的数百万。排名也是至关重要的集合不同的候选人来源，其分数不是直接可比。我们使用具有相似结构的深度神经网络作为候选生成，使用Logistic回归为每个视频印象分配一个独立的分数(图7)。然后根据这个分数对视频列表进行排序，并返回给用户。我们的最终排名目标是不断调整的基础上的实时A/B测试结果，但通常是一个简单的功能，预期的手表时间每印象。按点击率进行排名通常会助长用户未完成的欺骗性视频(“点击诱饵”)，而观看时间则能更好地捕捉参与[13，25]。

4.1 Feature Representation

Our features are segregated with the traditional taxonomy of categorical and continuous/ordinal features. The categorical features we use vary widely in their cardinality - some are binary (e.g. whether the user is logged-in) while others have millions of possible values (e.g. the user’s last search query). Features are further split according to whether they contribute only a single value (“univalent”) or a set of values (“multivalent”). An example of a univalent categorical feature is the video ID of the impression being scored, while a corresponding multivalent feature might be a bag of the last N video IDs the user has watched. We also classify features according to whether they describe properties of the item (“impression”) or properties of the user/context (“query”). Query features are computed once per request while impression features are computed for each item scored.

我们的特征与分类和连续/序号特征的传统分类法相分离。我们使用的分类特性在基数上差异很大-有些是二进制的(例如，用户是否登录)，而另一些则有数百万的可能值(例如用户的最后一次搜索查询)。特征是根据它们只贡献一个值(“单价”)还是一组值(“多价”)进一步划分的。单价分类特征的一个例子是正在得分的印象的视频ID，而相应的多价特征可能是用户已观看的最后N个视频ID的一个袋子。我们还根据特征是描述项目的属性(“印象”)还是描述用户/上下文的属性(“查询”)来对特征进行分类。每个请求计算一次查询特征，而对每个得分项计算印象特征。

Feature Engineering 特征工程

We typically use hundreds of features in our ranking models, roughly split evenly between categorical and continuous. Despite the promise of deep learning to alleviate the burden of engineering features by hand, the nature of our raw data does not easily lend itself to be input directly into feedforward neural networks. We still expend considerable engineering resources transforming user and video data into useful features. The main challenge is in representing a temporal sequence of user actions and how these actions relate to the video impression being scored.

我们通常在排名模型中使用数百个特性，大致分为分类和连续两个部分。尽管有希望通过深入学习来减轻工程特性的负担，但原始数据的性质并不容易直接输入到前馈神经网络中。我们仍然花费大量的工程资源，将用户和视频数据转换成有用的特性。主要的挑战是表示用户动作的时间序列，以及这些动作如何与被评分的视频印象相关。

We observe that the most important signals are those that describe a user’s previous interaction with the item itself and other similar items, matching others’ experience in ranking ads [7]. As an example, consider the user’s past history with the channel that uploaded the video being scored - how many videos has the user watched from this channel? When was the last time the user watched a video on this topic? These continuous features describing past user actions on related items are particularly powerful because they generalize well across disparate items. We have also found it crucial to propagate information from candidate generation into ranking in the form of features, e.g. which sources nominated this video candidate? What scores did they assign?

我们观察到，最重要的信号是那些描述用户以前与项目本身和其他类似项目的交互，与其他人在排名广告[7]方面的经验相匹配的信号。举个例子，考虑一下用户过去的历史，上传视频被评分的频道-用户从这个频道看了多少个视频？用户最后一次观看这个主题的视频是什么时候？这些描述过去用户对相关项的操作的连续特性特别强大，因为它们可以很好地跨不同的项进行泛化。我们还发现，以特征的形式将信息从候选人一代传播到排名是至关重要的，例如，是哪些来源提名了该视频候选人？他们分配了什么分数？

Features describing the frequency of past video impressions are also critical for introducing “churn” in recommendations (successive requests do not return identical lists). If a user was recently recommended a video but did not watch it then the model will naturally demote this impression on the next page load. Serving up-to-the-second impression and watch history is an engineering feat onto itself outside the scope of this paper, but is vital for producing responsive recommendations.

描述过去视频印象频率的功能对于在建议中引入“搅动”(连续的请求不返回相同的列表)也至关重要。如果用户最近被推荐一个视频，但没有看它，那么模型自然会降低这种印象在下一个页面的加载。服务于第二印象和观察历史是一项工程壮举本身超出了本文的范围，但是至关重要的产生响应性的建议。

Embedding Categorical Features 嵌入类别特征

Similar to candidate generation, we use embeddings to map sparse categorical features to dense representations suitable for neural networks. Each unique ID space (“vocabulary”) has a separate learned embedding with dimension that increases approximately proportional to the logarithm of the number of unique values. These vocabularies are simple look-up tables built by passing over the data once before training. Very large cardinality ID spaces (e.g. video IDs or search query terms) are truncated by including only the top N after sorting based on their frequency in clicked impressions. Out-of-vocabulary values are simply mapped to the zero embedding. As in candidate generation, multivalent categorical feature embeddings are averaged before being fed in to the network.

类似于候选生成，我们使用嵌入将稀疏的分类特征映射到适合于神经网络的密集表示。每个唯一的ID空间(“词汇表”)都有一个独立的学习嵌入维数，与唯一值数的对数几乎成正比。这些词汇表是通过在培训前传递一次数据构建的简单查找表。非常大的基数ID空间(例如视频ID或搜索查询项)被截断，只包括顶部N之后，根据它们的频率排序在点击的印象。词汇表外值被简单地映射到零嵌入。与候选生成一样，多价分类特征嵌入在输入到网络之前是平均的。

Importantly, categorical features in the same ID space also share underlying emeddings. For example, there exists a single global embedding of video IDs that many distinct features use (video ID of the impression, last video ID watched by the user, video ID that “seeded” the recommendation, etc.). Despite the shared embedding, each feature is fed separately into the network so that the layers above can learn specialized representations per feature. Sharing embeddings is important for improving generalization, speeding up training and reducing memory requirements. The overwhelming majority of model parameters are in these high-cardinality embedding spaces - for example, one million IDs embedded in a 32 dimensional space have 7 times more parameters than fully connected layers 2048 units wide.

重要的是，相同ID空间中的分类特性也共享潜在的发送。例如，存在许多不同特性使用的视频ID的单个全局嵌入(印象的视频ID、用户最后观看的视频ID、“种子”推荐的视频ID等)。尽管有共享的嵌入，但每个特性都被单独地输入到网络中，以便上面的层能够学习每个特性的专门表示。共享嵌入对于改进泛化、加快训练和减少内存需求非常重要。绝大多数模型参数都在这些高基数的嵌入空间中，例如，嵌入在32维空间中的100万ID的参数是全连接层2048个单元的7倍。

Normalizing Continuous Features。

连续特征的规范化。

Neural networks are notoriously sensitive to the scaling and distribution of their inputs [9] whereas alternative approaches such as ensembles of decision trees are invariant to scaling of individual features. We found that proper normalization of continuous features was critical for convergence. A continuous feature x with distribution f is transformed to ˜x by scaling the values such that the feature is equally distributed in [0, 1) using the cumulative distribution, ˜x = x 到−∞积分. This integral is approximated with linear interpolation on the quantiles of the feature values computed in a single pass over the data before training begins.

众所周知，神经网络对输入的尺度和分布都很敏感[9]，而决策树集合等替代方法对个体特征的缩放是不变的。我们发现连续特征的适当归一化是收敛的关键。将分布为f的连续特征x转换为˜x，方法是使用累积分布˜x=x到−∞积分，使特征在[0，1)中分布相等。在训练开始前，在一次通过数据计算出的特征值的分位数上，用线性插值来逼近这个积分。

In addition to the raw normalized feature ˜x, we also input powers ˜x 2 and √ x˜, giving the network more expressive power by allowing it to easily form super- and sub-linear functions of the feature. Feeding powers of continuous features was found to improve offline accuracy.

除了原始的归一化特征˜x外，我们还输入了˜x 2和√x˜的幂函数，使网络具有更强的表现力，因为它可以很容易地形成特征的超线性函数和次线性函数。连续特征的喂料能力可以提高离线精度。

4.2 Modeling Expected Watch Time 建模预期的观看时间

Our goal is to predict expected watch time given training examples that are either positive (the video impression was clicked) or negative (the impression was not clicked). Positive examples are annotated with the amount of time the user spent watching the video. To predict expected watch time we use the technique of weighted logistic regression, which was developed for this purpose.

我们的目标是预测预期的观看时间，给出的训练例子要么是正面的(视频印象被点击)要么是负面的(印象没有被点击)。正面的例子用户观看视频的时间来标注。为了预测预期的观察时间，我们使用了加权Logistic回归技术，这是为此目的而发展起来的。

The model is trained with logistic regression under crossentropy loss (Figure 7). However, the positive (clicked) impressions are weighted by the observed watch time on the video. Negative (unclicked) impressions all receive unit weight. In this way, the odds learned by the logistic regression are P Ti N−k where N is the number of training examples, k is the number of positive impressions, and Ti is the watch time of the ith impression. Assuming the fraction of positive impressions is small (which is true in our case), the learned odds are approximately E[T](1 + P), where P is the click probability and E[T] is the expected watch time of the impression. Since P is small, this product is close to E[T]. For inference we use the exponential function e x as the final activation function to produce these odds that closely estimate expected watch time.

模型在交叉熵损失下用Logistic回归进行训练(图7)。然而，正面的(点击的)印象是根据视频上观察到的观察时间来加权的。负面(未点击)印象都有单位重量。Logistic回归得到的概率为P Ti N−k，其中N为训练样本数，k为正印模数，Ti为第i印象观察时间。假设正面印象的分数很小(在我们的例子中是这样)，学习的概率大约是E[T](1P)，其中P是点击概率，E[T]是印象的预期观察时间。由于P很小，这个积接近E[T]。为了进行推断，我们使用指数函数ex作为最终的激活函数来产生这些概率，从而精确地估计期望的观察时间。

4.3 Experiments with Hidden Layers 隐藏层实验

Table 1 shows the results we obtained on next-day holdout data with different hidden layer configurations. The value shown for each configuration (“weighted, per-user loss”) was obtained by considering both positive (clicked) and negative (unclicked) impressions shown to a user on a single page. We first score these two impressions with our model. If the negative impression receives a higher score than the positive impression, then we consider the positive impression’s watch time to be mispredicted watch time. Weighted, peruser loss is then the total amount mispredicted watch time as a fraction of total watch time over heldout impression pairs.

表1显示了我们在第二天不同隐藏层配置的持久化数据上获得的结果。每个配置所显示的值(“加权，每个用户损失”)都是通过考虑在单个页面上向用户显示的正面(单击)和负面(未单击)印象来获得的。我们首先用我们的模型来衡量这两种印象。如果消极印象得到的分数高于正面印象，那么我们认为积极印象的观察时间是错误的。加权的，每个用户的损失是总数量错误预测的手表时间，作为一个零头的总手表时间在地狱的印象对。

These results show that increasing the width of hidden layers improves results, as does increasing their depth. The trade-off, however, is server CPU time needed for inference. The configuration of a 1024-wide ReLU followed by a 512- wide ReLU followed by a 256-wide ReLU gave us the best results while enabling us to stay within our serving CPU budget.

这些结果表明，增加隐藏层的宽度可以改善结果，增加其深度也是如此。然而，这种权衡是推断所需的服务器CPU时间.1024宽的REU配置，然后是512宽的REU，然后是256宽的REU，这给了我们最好的结果，同时使我们能够在我们的服务CPU预算范围内。

For the 1024 → 512 → 256 model we tried only feeding the normalized continuous features without their powers, which increased loss by 0.2%. With the same hidden layer configuration, we also trained a model where positive and negative examples are weighted equally. Unsurprisingly, this increased the watch time-weighted loss by a dramatic 4.1%.

对于1024→5 12→2 5 6模型，我们尝试只提供归一化连续特征，而不考虑它们的功耗，损失增加了0.2%。在相同的隐层结构下，我们还训练了一个模型，该模型对正负两类样本进行了同样的加权。不出所料，这使得手表时间加权亏损大幅增加了4.1%.

Table 1: Effects of wider and deeper hidden ReLU layers on watch time-weighted pairwise loss computed on next-day holdout data.

表1：较宽和更深的隐藏中继层对根据第二天持久力数据计算的手表时间加权成对损失的影响。

5. CONCLUSIONS 总结

We have described our deep neural network architecture for recommending YouTube videos, split into two distinct problems: candidate generation and ranking.

我们描述了我们推荐YouTube视频的深层神经网络架构，分为两个截然不同的问题：候选人生成和排名。

Our deep collaborative filtering model is able to effectively assimilate many signals and model their interaction with layers of depth, outperforming previous matrix factorization approaches used at YouTube [23]. There is more art than science in selecting the surrogate problem for recommendations and we found classifying a future watch to perform well on live metrics by capturing asymmetric co-watch behavior and preventing leakage of future information. Withholding discrimative signals from the classifier was also essential to achieving good results - otherwise the model would overfit the surrogate problem and not transfer well to the homepage.

我们的深度协作过滤模型能够有效地吸收许多信号，并将它们与深度层的交互建模，优于以前在YouTube[23]中使用的矩阵因式分解方法。有更多的艺术，而不是科学选择代孕问题的建议，我们发现，分类未来的手表表现良好的实时指标，捕捉不对称的共同观察行为和防止未来的信息泄漏。保留分类器的描述信号也是取得良好效果的关键-否则该模型将过度适应代理问题，并且不会很好地传输到主页。

We demonstrated that using the age of the training example as an input feature removes an inherent bias towards the past and allows the model to represent the time-dependent behavior of popular of videos. This improved offline holdout precision results and increased the watch time dramatically on recently uploaded videos in A/B testing.

我们证明，使用训练实例的年龄作为输入特征，消除了对过去的固有偏见，并允许模型表示流行视频的时间依赖行为。这提高了离线坚持精度的结果，并大大增加了手表的时间，最近上传的视频，在A/B测试。

Ranking is a more classical machine learning problem yet our deep learning approach outperformed previous linear and tree-based methods for watch time prediction. Recommendation systems in particular benefit from specialized features describing past user behavior with items. Deep neural networks require special representations of categorical and continuous features which we transform with embeddings and quantile normalization, respectively. Layers of depth were shown to effectively model non-linear interactions between hundreds of features.

排序是一个比较经典的机器学习问题，但是我们的深度学习方法优于以往的线性和基于树的观察时间预测方法。推荐系统特别受益于描述过去用户使用项的行为的专门特性。深度神经网络需要对分类特征和连续特征分别进行嵌入和分位数归一化的特殊表示。层层深度被证明可以有效地模拟数百个特征之间的非线性交互作用。

Logistic regression was modified by weighting training examples with watch time for positive examples and unity for negative examples, allowing us to learn odds that closely model expected watch time. This approach performed much better on watch-time weighted ranking evaluation metrics compared to predicting click-through rate directly.

对Logistic回归方法进行了修正，将训练样本与训练样本进行加权，对正样本进行训练时间加权，对负数样本进行统一处理，使我们能够了解密切模型期望观察时间的概率。与直接预测点击率相比，这种方法在观察时间加权排名评价指标上的表现要好得多。

6. ACKNOWLEDGMENTS

The authors would like to thank Jim McFadden and Pranav Khaitan for valuable guidance and support. Sujeet Bansal, Shripad Thite and Radek Vingralek implemented key components of the training and serving infrastructure. Chris Berg and Trevor Walker contributed thoughtful discussion and detailed feedback.

作者感谢Jim McFadden和Pranav Khaitan的宝贵指导和支持。Sujeet Bansal、Shripad TIte和Radek Vingralek实施了培训和服务基础设施的关键组成部分。克里斯·伯格和特雷弗·沃克提供了深思熟虑的讨论和详细的反馈。

7. REFERENCES

[1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Man´e, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Vi´egas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. Software available from tensorflow.org.

[2] X. Amatriain. Building industrial-scale real-world recommender systems. In Proceedings of the Sixth ACM Conference on Recommender Systems, RecSys ’12, pages 7–8, New York, NY, USA, 2012. ACM.

[3] J. Davidson, B. Liebald, J. Liu, P. Nandy, T. Van Vleet, U. Gargi, S. Gupta, Y. He, M. Lambert, B. Livingston, and D. Sampath. The youtube video recommendation system. In Proceedings of the Fourth ACM Conference on Recommender Systems, RecSys ’10, pages 293–296, New York, NY, USA, 2010. ACM.

[4] J. Dean, G. S. Corrado, R. Monga, K. Chen, M. Devin, Q. V. Le, M. Z. Mao, M. Ranzato, A. Senior, P. Tucker, K. Yang, and A. Y. Ng. Large scale distributed deep networks. In NIPS, 2012.

[5] A. M. Elkahky, Y. Song, and X. He. A multi-view deep learning approach for cross domain user modeling in recommendation systems. In Proceedings of the 24th International Conference on World Wide Web, WWW ’15, pages 278–288, New York, NY, USA, 2015. ACM.

[6] X. Glorot, A. Bordes, and Y. Bengio. Deep sparse rectifier neural networks. In G. J. Gordon and D. B. Dunson, editors, Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics (AISTATS-11), volume 15, pages 315–323. Journal of Machine Learning Research - Workshop and Conference Proceedings, 2011.

[7] X. He, J. Pan, O. Jin, T. Xu, B. Liu, T. Xu, Y. Shi, A. Atallah, R. Herbrich, S. Bowers, and J. Q. n. Candela. Practical lessons from predicting clicks on ads at facebook. In Proceedings of the Eighth International Workshop on Data Mining for Online Advertising, ADKDD’14, pages 5:1–5:9, New York, NY, USA, 2014. ACM.

[8] W. Huang, Z. Wu, L. Chen, P. Mitra, and C. L. Giles. A neural probabilistic model for context based citation recommendation. In AAAI, pages 2404–2410, 2015.

[9] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. CoRR, abs/1502.03167, 2015. [10] S. Jean, K. Cho, R. Memisevic, and Y. Bengio. On using very large target vocabulary for neural machine translation. CoRR, abs/1412.2007, 2014.

[11] L. Jiang, Y. Miao, Y. Yang, Z. Lan, and A. G. Hauptmann. Viral video style: A closer look at viral videos on youtube. In Proceedings of International Conference on Multimedia Retrieval, ICMR ’14, pages 193:193–193:200, New York, NY, USA, 2014. ACM.

[12] T. Liu, A. W. Moore, A. Gray, and K. Yang. An investigation of practical approximate nearest neighbor algorithms. pages 825–832. MIT Press, 2004.

[13] E. Meyerson. Youtube now: Why we focus on watch time. http://youtubecreator.blogspot.com/2012/08/ youtube-now-why-we-focus-on-watch-time.html. Accessed: 2016-04-20.

[14] T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. CoRR, abs/1310.4546, 2013.

[15] F. Morin and Y. Bengio. Hierarchical probabilistic neural network language model. In AISTATSˆaA˘Z05 ´ , pages 246–252, 2005.

[16] D. Oard and J. Kim. Implicit feedback for recommender systems. In in Proceedings of the AAAI Workshop on Recommender Systems, pages 81–83, 1998.

[17] K. J. Oh, W. J. Lee, C. G. Lim, and H. J. Choi. Personalized news recommendation using classified keywords to capture user preference. In 16th International Conference on Advanced Communication Technology, pages 1283–1287, Feb 2014.

[18] S. Sedhain, A. K. Menon, S. Sanner, and L. Xie. Autorec: Autoencoders meet collaborative filtering. In Proceedings of the 24th International Conference on World Wide Web, WWW ’15 Companion, pages 111–112, New York, NY, USA, 2015. ACM.

[19] X. Su and T. M. Khoshgoftaar. A survey of collaborative filtering techniques. Advances in artificial intelligence, 2009:4, 2009.

[20] D. Tang, B. Qin, T. Liu, and Y. Yang. User modeling with neural network for review rating prediction. In Proc. IJCAI, pages 1340–1346, 2015.

[21] A. van den Oord, S. Dieleman, and B. Schrauwen. Deep content-based music recommendation. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 26, pages 2643–2651. Curran Associates, Inc., 2013.

[22] H. Wang, N. Wang, and D.-Y. Yeung. Collaborative deep learning for recommender systems. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’15, pages 1235–1244, New York, NY, USA, 2015. ACM.

[23] J. Weston, S. Bengio, and N. Usunier. Wsabie: Scaling up to large vocabulary image annotation. In Proceedings of the International Joint Conference on Artificial Intelligence, IJCAI, 2011.

[24] J. Weston, A. Makadia, and H. Yee. Label partitioning for sublinear ranking. In S. Dasgupta and D. Mcallester, editors, Proceedings of the 30th International Conference on Machine Learning (ICML-13), volume 28, pages 181–189. JMLR Workshop and Conference Proceedings, May 2013.

[25] X. Yi, L. Hong, E. Zhong, N. N. Liu, and S. Rajan. Beyond clicks: Dwell time for personalization. In Proceedings of the 8th ACM Conference on Recommender Systems, RecSys ’14, pages 113–120, New York, NY, USA, 2014. ACM. [26] Zayn. Pillowtalk. https://www.youtube.com/watch?v=C 3d6GntKbk.

1.背景

推荐Youtube视频主要有以下3大挑战：

（1）Scale: 规模。专业的分布式学习算法，高效的服务系统对于处理YouTube庞大的用户群和语料库是非常重要的。

（2）Freshness: 新鲜度。Youtube具有一个非常动态的语料库，每秒就会上传几个小时的视频。推荐系统的模型应该能及时响应新上传的内容和用户最近的行为，从探索的角度看，平衡新视频和旧视频的推荐很好理解。

（3）Noise: 噪音。由于数据稀疏性和很多外部因素的影响，我们只能拿到用户的隐式反馈，而无法获取用户真实的满意度，因此存在了大量的噪声。

2.系统框架

Youtube推荐系统框架

Youtube的整体推荐系统框架如上图，从上图可以看出整个系统包含了2个神经网络，候选生成（召回）网络和排序网络。

召回网络：根据用户的历史行为，在百万级别的视频中挑选出几百个用户感兴趣的视频。

召回网络召回的视频满足用户泛化的兴趣，用户之间的相似度则通过粗略的特征来表示，如用户观看视频的ID，搜索query和用户画像。

排序网络：将召回的视频进行打分，该过程将使用更加丰富和精细的用户和视频特征，从而的到一个相对最好的推荐效果。

3.召回网络

Youtube把推荐问题看成一个超大规模多分类问题。即表示在 t 时刻，用户 U 和其上下文 C 下，在视频库 V 里观看视频 wt为第 i个视频的概率。其数学表达式如下：

其中 u ε RN表示（用户，上下文）的高维embedding， vj ε RN表示每个候选视频的embedding向量。

因此深度神经网络的目标就是在给定用户历史行为与上下文的情况下，学习出用户的embedding向量 u，从而用于softmax分类器来召回视频。

每次计算，softmax的分母，都需要遍历视频库V中所有的视频，并且用户上下文向量与视频向量之间的点积，exp等操作造成计算量过大，因此如何高效训练成为一个问题。

高效训练模型

为了高效地训练模型，论文中借鉴了机器翻译中的Sampled Softmax方法，其主要思想就是利用重要性采样采样出部分视频 V’，尽可能的去逼近视频库V 。下面是简单介绍，想了解细节的同学可以参考：http://www.aclweb.org/anthology/P15-1001

candidate sampling，在实践中通过重要性采样方法采样几千个负样本

经过importance weighting的方式进行修正

除了上面的Sampled Softmax方法之外，已有的其他解决方法主要有：

利用其它的方法，去逼近softmax的概率，比如：基于noise-contrastive estimation （NCE）的方法，类似于word2vec中skip-gram的negative sampling

层次softmax

线上预测过程

Sampled Softmax算法仅仅是加快了训练速度，预测时，还是需要对分母项做全量计算。对同一个用户 U，所有候选视频计算softmax时的分母相同，所以召回时，只需要比较分子的大小，因此大大减少了计算量。召回过程最终被简化为求最近邻的问题，可以通过建立视频向量索引，来加快搜索。

模型的输入是用户视频历史浏览行为、历史搜索行为、人口统计学信息和其余上下文信息拼接成的向量；输出为上文已经讲到的线上和离线训练两个部分。

特征工程

历史观看video：把历史完成观看视频的embedding向量进行平均
历史搜索query：把历史搜索的query分词后的token的embedding向量进行平均
人口统计学信息：性别、年龄、地域等
其他上下文信息：设备、登录状态等

Example Age特征

除了以上的主要特征之外，Youtube还将视频的Example Age作为特征加入训练，因为除了相关性之外，用户更加喜欢新视频，当然是用户感兴趣的新视频。视频的Example Age=视频最大的发布时间-该视频的上传时间。

在线上召回时，将Example Age特征的值均设置成0（或者是一个小的负数），有利于召回上传时间短，但是视频内容与用户兴趣更相关的新视频。

其实加入这个特征的原因是上传越早的视频，更有机会成为热门的视频，与大多数用户相关，因此在训练时加入该特征，有利于该特征的权重较大，而在线上召回时，将该特征的权重进行打压，则有利于其他相关性特征，因此对上传时间较短的新视频的召回有一定的帮助。

Label and Context Selection

训练样本生成。

首先，不仅仅使用Youtube内推荐给用户的数据，还使用在其他嵌入页面用户观看的数据，如相关性推荐页面。如果用户在其他嵌入页面观看了视频，则可以通过协同过滤的方式推荐给别人。

其次，为每个用户生成固定数量训练样本，可以在损失函数中对用户进行加权，防止部分活跃用户对损失的主导。

抛弃序列信息。

对过去观看视频/历史搜索query的embedding向量进行加权平均。

是否可以考虑将历史向量，按照时间衰减，进行加权平均？在文中给出的答案是否定的。这点其实违反直觉，可能原因是模型对负反馈没有很好的建模。

不对称的共同浏览（asymmetric co-watch）问题。

连续电视剧通常按顺序观看，是序列式的。用户通常都是先对流行的视频感兴趣，其次才会专注于细分的视频。

因此我们预测用户下一次浏览的视频的效果会比预测held-out的视频好。在传统的协同过滤算法或者FM等算法中，都是采用了图（a）的held-out的方法，忽略了不对称的浏览模式。

4.排序网络

排序网络架构与召回网络架构基本是相似的，最大的区别在于排序网络的最后一层是weighted LR，而召回网络的最后一层是softmax，其次为排序网络使用的特征更加精细化。排序网络的模型结构如下图：

特征工程

最重要的特征是用户与视频或相似视频的交互历史的特征。

Embedding Categorical Features

神经网络更适合处理连续稠密特征，因此ID类特征都需要embedding到稠密的向量中。

重要的是，在相同ID空间下的embedding向量是可以共享的，比如，video ID的embedding可以供多个不同的特征使用（浏览的video ID，上一次浏览的video ID，作为推荐种子的video ID）。共享embedding对提高模型的泛化能力，加快训练速度，减少内存占用都很有帮助。

Normalizing Continuous Features

神经网络对特征值的scaling以及输入的分布很敏感，对连续特征合适的归一化对收敛是非常重要的（除了树类型的机器学习方法外，大多数方法都需要对连续特征做归一化）。

预测观看时长

我们的目标是给定训练数据后，不论是正样本还是负样本，预测其期望的观看时长，因此weighted LR作为排序网络的输出层。

正样本的权重是该视频被观看的时长，负样本的权重为1。

通过这种方式LR学到的概率是：

其中 N 为训练集的样本数， k是正样本数， Ti是第 i个样本的观看时间。因为 k一般相对 N来说很小，则有下面的推导：

其中 E[T] 是期望的观看时间， P是点击率，因为 P比较小，所以上式约等于 E[T] 。在线上预测的时候，采用 ewx+b作为最后的激活函数来近似估计这个概率，近似于估计的期望观看时间，即平均观看时间。

5.总结

我觉得主要的收获有以下几点：

召回时使用softmax来变成一个海量多分类的问题，并且借鉴了Sampled Softmax的方法
视频上传时间的特征，在线上预测时，设置为0或者是微小的负数，以利于上传时间较短的新视频召回
排序网络使用weighted LR训练，将问题转化为预测平均观看时长的问题

本文如有错误的地方，请私信或者留言指出。

参考资料：

https://static.googleusercontent.com/media/research.google.com/zh-CN//pubs/archive/45530.pdf

http://www.aclweb.org/anthology/P15-1001

石塔西：看Youtube怎么利用深度学习做推荐

以上是知乎大佬对这篇文章的理解

个人觉得在海量数据、多维情景以及推荐算法不断迭代的发展下，用户对推荐系统的体验会越来越好。目前推荐系统面临的问题大致如下：

针对海量数据，分类别、分区域、分情景的对用户进行数据分析、画像、以及特征提取、建模需要细化，更加个性化。
数据挖掘离线结果的更新，在海量数据下需要大量的机器。
冷启动问题，新用户的地理位置，注册信息等成为冷启动能够利用的关键信息。
用户满意度的评测

美团的推荐框架如下图：

候选集的生成是基于传统的推荐算法协同过滤、基于位置、查询、异构图、替补策略的触发的候选集。排序层主要是利用机器学习的模型对触发层筛选出来的候选集进行重排序。YouTube的召回是一个海量类别的多分类问题。主要是把两层神经网络。类别总数就是视频库中的视频总数，数量在百万级别。

美团推荐系统参考https://cloud.tencent.com/developer/article/1041925

美团推荐系统从框架的角度看，推荐系统基本可以分为数据层、触发层、融合过滤层和排序层。数据层包括数据生成和数据存储，主要是利用各种数据处理工具对原始日志进行清洗，处理成格式化的数据，落地到不同类型的存储系统中，供下游的算法和模型使用。候选集触发层主要是从用户的历史行为、实时行为、地理位置等角度利用各种触发策略产生推荐的候选集。候选集融合和过滤层有两个功能，一是对出发层产生的不同候选集进行融合，提高推荐策略的覆盖度和精度；另外还要承担一定的过滤职责，从产品、运营的角度确定一些人工规则，过滤掉不符合条件的item。排序层主要是利用机器学习的模型对触发层筛选出来的候选集进行重排序。

同时，对与候选集触发和重排序两层而言，为了效果迭代是需要频繁修改的两层，因此需要支持ABtest。为了支持高效率的迭代，对候选集触发和重排序两层进行了解耦，这两层的结果是正交的，因此可以分别进行对比试验，不会相互影响。同时在每一层的内部，会根据用户将流量划分为多份，支持多个策略同时在线对比。

CF是推荐这块应用的比较广的算法了，很简单但是要用好要看具体的场景问题。

清除作弊、刷单、代购等噪声数据。这些数据的存在会严重影响算法的效果，因此要在第一步的数据清洗中就将这些数据剔除。

合理选取训练数据。选取的训练数据的时间窗口不宜过长，当然也不能过短。具体的窗口期数值需要经过多次的实验来确定。同时可以考虑引入时间衰减，因为近期的用户行为更能反映用户接下来的行为动作。

user-based与item-based相结合。

对于移动设备而言，与PC端最大的区别之一是移动设备的位置是经常发生变化的。不同的地理位置反映了不同的用户场景，在具体的业务中可以充分利用用户所处的地理位置。在推荐的候选集触发中，我们也会根据用户的实时地理位置、工作地、居住地等地理位置触发相应的策略。

根据用户的历史消费、历史浏览等，挖掘出某一粒度的区域（比如商圈）内的区域消费热单和区域购买热单

当新的线上用户请求到达时，根据用户的几个地理位置对相应地理位置的区域消费热单和区域购买热单进行加权，最终得到一个推荐列表。

此外，还可以根据用户出现的地理位置，采用协同过滤的方式计算用户的相似度。

搜索是一种强用户意图，比较明确的反应了用户的意愿，但是在很多情况下，因为各种各样的原因，没有形成最终的转换。尽管如此，这种情景还是代表了一定的用户意愿，可以加以利用。具体做法如下：

对用户过去一段时间的搜索无转换行为进行挖掘，计算每一个用户对不同query的权重。计算每个query下不同deal的权重。当用户再次请求时，根据用户对不同query的权重及query下不同deal的权重进行加权，取出权重最大的TopN进行推荐。

对于协同过滤而言，user之间或者deal之间的图距离是两跳，对于更远距离的关系则不能考虑在内。而图算法可以打破这一限制，将user与deal的关系视作一个二部图，相互间的关系可以在图上传播。simrank： a measure of structural-context similarity是一种衡量对等实体相似度的图算法。它的基本思想是，如果两个实体与另外的相似实体有相关关系，那它们也是相似的，即相似性是可以传播的。

实时用户行为

目前美团的业务会产生包括搜索、筛选、收藏、浏览、下单等丰富的用户行为，这些是进行效果优化的重要基础。推荐当然希望每一个用户行为流都能到达转化的环节，但是事实上远非这样。

当用户产生了下单行为上游的某些行为时，会有相当一部分因为各种原因使行为流没有形成转化。但是，用户的这些上游行为是非常重要的先验知识。很多情况下，用户当时没有转化并不代表用户对当前的item不感兴趣。当用户再次到达推荐展位时，根据用户之前产生的先验行为理解并识别用户的真正意图，将符合用户意图的相关deal再次展现给用户，引导用户沿着行为流向下游行进，最终达到下单这个终极目标。

目前引入的实时用户行为包括：实时浏览、实时收藏。

替补策略

虽然现在有一系列基于用户历史行为的候选集触发算法，但对于部分新用户或者历史行为不太丰富的用户，上述算法触发的候选集太小，因此需要使用一些替补策略进行填充。

热销单：在一定时间内销量最多的item，可以考虑时间衰减的影响等。

好评单：用户产生的评价中，评分较高的item。

城市单：满足基本的限定条件，在用户的请求城市内的。

子策略融合

为了结合不同触发算法的优点，同时提高候选集的多样性和覆盖率，需要将不同的触发算法融合在一起。常见的融合的方法有以下几种：

加权型：最简单的融合方法就是根据经验值对不同算法赋给不同的权重，对各个算法产生的候选集按照给定的权重进行加权，然后再按照权重排序。

分级型：优先采用效果好的算法，当产生的候选集大小不足以满足目标值时，再使用效果次好的算法，依此类推。

调制型：不同的算法按照不同的比例产生一定量的候选集，然后叠加产生最终总的候选集。

过滤型：当前的算法对前一级算法产生的候选集进行过滤，依此类推，候选集被逐级过滤，最终产生一个小而精的候选集合。

目前美团使用的方法集成了调制和分级两种融合方法，不同的算法根据历史效果表现给定不同的候选集构成比例，同时优先采用效果好的算法触发，如果候选集不够大，再采用效果次之的算法触发，依此类推。

候选集重排序

如上所述，对于不同算法触发出来的候选集，只是根据算法的历史效果决定算法产生的item的位置显得有些简单粗暴，同时，在每个算法的内部，不同item的顺序也只是简单的由一个或者几个因素决定，这些排序的方法只能用于第一步的初选过程，最终的排序结果需要借助机器学习的方法，使用相关的排序模型，综合多方面的因素来确定。

模型

非线性模型能较好的捕捉特征中的非线性关系，但训练和预测的代价相对线性模型要高一些，这也导致了非线性模型的更新周期相对要长。反之，线性模型对特征的处理要求比较高，需要凭借领域知识和经验人工对特征做一些先期处理，但因为线性模型简单，在训练和预测时效率较高。因此在更新周期上也可以做的更短，还可以结合业务做一些在线学习的尝试。在实践中，非线性模型和线性模型都有应用。

非线性模型

目前主要采用了非线性的树模型Additive Groves（简称AG），相对于线性模型，非线性模型可以更好的处理特征中的非线性关系，不必像线性模型那样在特征处理和特征组合上花费比较大的精力。AG是一个加性模型，由很多个Grove组成，不同的Grove之间进行bagging得出最后的预测结果，由此可以减小过拟合的影响。

每一个Grove有多棵树组成，在训练时每棵树的拟合目标为真实值与其他树预测结果之和之间的残差。当达到给定数目的树时，重新训练的树会逐棵替代以前的树。经过多次迭代后，达到收敛。

线性模型

目前应用比较多的线性模型非Logistic Regression莫属了。为了能实时捕捉数据分布的变化，引入了online learning，接入实时数据流，使用google提出的FTRL方法对模型进行在线更新。

主要的步骤如下：

在线写特征向量到HBase

Storm解析实时点击和下单日志流，改写HBase中对应特征向量的label

通过FTRL更新模型权重

将新的模型参数应用于线上

Training

采样：对于点击率预估而言，正负样本严重不均衡，所以需要对负例做一些采样。

负例：正例一般是用户产生点击、下单等转换行为的样本，但是用户没有转换行为的样本是否就一定是负例呢？其实不然，很多展现其实用户根本没有看到，所以把这样样本视为负例是不合理的，也会影响模型的效果。比较常用的方法是skip-above，即用户点击的item位置以上的展现才可能视作负例。当然，上面的负例都是隐式的负反馈数据，除此之外，还有用户主动删除的显示负反馈数据，这些数据是高质量的负例。

去噪：对于数据中混杂的刷单等类作弊行为的数据，要将其排除出训练数据，否则会直接影响模型的效果。

Feature

在目前的重排序模型中，大概分为以下几类特征：

deal(即团购单，下同)维度的特征：主要是deal本身的一些属性，包括价格、折扣、销量、评分、类别、点击率等

user维度的特征：包括用户等级、用户的人口属性、用户的客户端类型等user、deal的交叉特征：包括用户对deal的点击、收藏、购买等

距离特征：包括用户的实时地理位置、常去地理位置、工作地、居住地等与poi的距离对于非线性模型，上述特征可以直接使用；而对于线性模型，则需要对特征值做一些分桶、归一化等处理，使特征值成为0~1之间的连续值或01二值。

conclusion

以数据为基础，用算法去雕琢，只有将二者有机结合，才会带来效果的提升。以下两个节点是优化过程中的里程碑：

将候选集进行融合：提高了推荐的覆盖度、多样性和精度

引入重排序模型：解决了候选集增加以后deal之间排列顺序的问题这些对于O2O场景的推荐有非常代表性的借鉴意义。