Deep Neural Networks for YouTube Recommendations 双语译文+理解

最新推荐文章于 2022-01-19 23:17:47 发布

sjz_hahalala479

最新推荐文章于 2022-01-19 23:17:47 发布

阅读量3.2k

点赞数 2

分类专栏：机器学习学习笔记文章标签：深度学习

本文链接：https://blog.csdn.net/sjz_hahalala479/article/details/106242426

版权

学习笔记同时被 2 个专栏收录

32 篇文章 3 订阅

订阅专栏

机器学习

11 篇文章 0 订阅

订阅专栏

Deep Neural Networks for YouTube Recommendations

Abstract 摘要

YouTube represents one of the largest scale and most sophisticated industrial recommendation systems in existence. In this paper, we describe the system at a high level and focus on the dramatic performance improvements brought by deep learning. The paper is split according to the classic two-stage information retrieval dichotomy: first, we detail a deep candidate generation model and then describe a separate deep ranking model. We also provide practical lessons and insights derived from designing, iterating and maintaining a massive recommendation system with enormous user-facing impact.

Youtube推荐是业界规模最大且最复杂的推荐系统之一。文章作者从宏观的角度描述了该推荐系统并且着重在深度学习带来的性能提升上。文章使用到的推荐模型是经典的二阶段模型：召回和排序模型。召回模型是一个深度候选集生成模型，排序模型是一个深度排序模型，这两个模型是互相独立的。文章作者也介绍了一些实践经验。

INTRODUCTION 1 引言

YouTube is the world’s largest platform for creating, sharing and discovering video content. YouTube recommendations are responsible for helping more than a billion users discover personalized content from an ever-growing corpus of videos. In this paper we will focus on the immense impact deep learning has recently had on the YouTube video recommendations system. Figure 1 illustrates the recommendations on the YouTube mobile app home.
YouTube是世界上创建，共享和发现视频内容的最大平台。 YouTube的推荐可帮助超过十亿用户从不断增长的视频库中发现个性化内容。在本文中，我们将重点介绍深度学习最近对YouTube视频推荐系统产生的巨大影响。图1说明了YouTube移动应用首页上的推荐。

Recommending YouTube videos is extremely challenging from three major perspectives:

Scale: Many existing recommendation algorithms proven to work well on small problems fail to operate on our scale. Highly specialized distributed learning algorithms and efficient serving systems are essential for handling YouTube’s massive user base and corpus.
Freshness: YouTube has a very dynamic corpus with many hours of video are uploaded per second. The recommendation system should be responsive enough to model newly uploaded content as well as the latest actions taken by the user. Balancing new content with well-established videos can be understood from an exploration/exploitation perspective.
Noise: Historical user behavior on YouTube is inherently difficult to predict due to sparsity and a variety of unobservable external factors. We rarely obtain the ground truth of user satisfaction and instead model noisy implicit feedback signals. Furthermore, metadata associated with content is poorly structured without a well defined ontology. Our algorithms need to be robust to these particular characteristics of our training data.

Youtube推荐会遇到的三个主要挑战：

数据规模大。需要高效的分布式学习算法和服务系统。许多现有的推荐算法被证明可以很好地解决小规模问题，但无法解决我们这么大规模的问题。高度专业化的分布式学习算法和高效的服务系统对于处理YouTube庞大的用户群和语料库至关重要。
新鲜感。Youtube每秒都有很多新视频上传，需要在新发布视频和已有存量视频间进行balance。也需要追踪用户最新的行为。
噪声。噪音主要体现在用户的历史行为往往是稀疏的并且是不完整的。模型是对有噪音的隐式反馈信号进行建模，且我们很少获得关于用户满意度的基本事实。

In conjugation with other product areas across Google, YouTube has undergone a fundamental paradigm shift towards using deep learning as a general-purpose solution for nearly all learning problems. Our system is built on Google Brain [4] which was recently open sourced as TensorFlow [1]. TensorFlow provides a flexible framework for experimenting with various deep neural network architectures using large-scale distributed training. Our models learn approximately one billion parameters and are trained on hundreds of billions of examples.

Youtube 在几乎所有学习问题上都转向使用深度学习作为一个通用解决方案。Youtube的系统基于TensorFlow , TensorFlow提供了一个灵活的框架，可使用大规模分布式训练来尝试各种深度神经网络架构。我们的模型在数千亿的训练样本中训练，并学习约十亿个参数。

In contrast to vast amount of research in matrix factorization methods [19], there is relatively little work using deep neural networks for recommendation systems. Neural networks are used for recommending news in [17], citations in [8] and review ratings in [20]. Collaborative filtering is formulated as a deep neural network in [22] and autoencoders in [18]. Elkahky et al. used deep learning for cross domain user modeling [5]. In a content-based setting, Burges et al. used deep neural networks for music recommendation [21].

与矩阵分解方法的大量研究相反[19]，使用深度神经网络进行推荐系统的工作相对较少。神经网络在[17]中用于推荐新闻，在[8]中用于引用，在[20]中用于评价等级。 协同过滤在[22]中被表述为一个深度神经网络，在[18]中被表述为自动编码器。 Elkahky等。使用深度学习进行跨域用户建模[5]。在基于内容的设置中，Burges等人。使用深度神经网络进行音乐推荐[21]。

The paper is organized as follows: A brief system overview is presented in Section 2. Section 3 describes the candidate generation model in more detail, including how it is trained and used to serve recommendations. Experimental results will show how the model benefits from deep layers of hidden units and additional heterogeneous signals. Section 4 details the ranking model, including how classic logistic regression is modified to train a model predicting expected watch time (rather than click probability). Experimental results will show that hidden layer depth is helpful as well in this situation. Finally, Section 5 presents our conclusions and lessons learned.

本文的组织结构如下：

第2节简要介绍了系统。
第3节更详细地描述了候选集生成模型（召回模型），包括如何训练候选模型以及如何使用它来提供推荐。实验结果将显示该模型如何从隐藏单元的深层和其他异构信号中受益。
第4节详细介绍了排序模型，包括如何修改经典逻辑回归以训练模型来预测预期观看时间（而不是点击概率）。实验结果表明，在这种情况下，隐藏层深度也是有帮助的。
最后，第5节介绍了我们的结论和经验教训。

2 SYSTEM OVERVIEW 系统总览

The overall structure of our recommendation system is illustrated in Figure 2. The system is comprised of two neural networks: one for candidate generation and one for ranking.

系统由两个神经网络构成，一个用于召回，一个用于排序。
在这里插入图片描述
The candidate generation network takes events from the user’s YouTube activity history as input and retrieves a small subset (hundreds) of videos from a large corpus. These candidates are intended to be generally relevant to the user with high precision. The candidate generation network only provides broad personalization via collaborative filtering. The similarity between users is expressed in terms of coarse features such as IDs of video watches, search query tokens and demographics.

候选生成网络从用户的YouTube活动历史记录中获取事件作为输入，并从大型语料库中检索一小部分（数百个）视频。这些候选者旨在高精度地与用户相关。 候选生成网络仅通过协同过滤提供广泛的个性化。用户之间的相似性是根据诸如观看视频的ID，搜索查询token和人群定向之类的粗略特征来表达的。

Presenting a few “best” recommendations in a list requires a fine-level representation to distinguish relative importance among candidates with high recall. The ranking network accomplishes this task by assigning a score to each video according to a desired objective function using a rich set of features describing the video and user. The highest scoring videos are presented to the user, ranked by their score.

在一个列表中选出“最佳”推荐需要一个精细的表征，以区分具有较高召回率的候选视频之间的相对重要性。排序网络通过使用描述视频和用户的丰富特征集，根据特定的目标函数，为每个视频预测分数，从而完成此任务。得分最高的视频向用户展示，并按其得分排名。

The two-stage approach to recommendation allows us to make recommendations from a very large corpus (millions) of videos while still being certain that the small number of videos appearing on the device are personalized and engaging for the user. Furthermore, this design enables blending candidates generated by other sources, such as those described in an earlier work [3].

推荐的两阶段方法使我们能够保证即使在非常大的视频库中，出现在设备上的少量视频仍然是个性化的，并且可以吸引用户。此外，这种设计能够融合其他来源生成的候选对象，例如早期工作[3]中描述的那些。

During development, we make extensive use of offline metrics (precision, recall, ranking loss, etc.) to guide iterative improvements to our system. However for the final determination of the effectiveness of an algorithm or model, we rely on A/B testing via live experiments. In a live experiment, we can measure subtle changes in click-through rate, watch time, and many other metrics that measure user engagement. This is important because live A/B results are not always correlated with offline experiments.

在开发过程中，我们广泛使用离线指标（精度，召回率，排名损失等）来指导系统的迭代改进。但是，为了最终确定算法或模型的有效性，我们依靠实时实验进行A / B测试。在现场实验中，我们可以衡量点击率，观看时间以及许多其他衡量用户参与度的指标的细微变化。这很重要，因为实时A / B结果并不总是与离线实验相关联。

3 CANDIDATE GENERATION 召回模型

During candidate generation, the enormous YouTube corpus is winnowed down to hundreds of videos that may be relevant to the user. The predecessor to the recommender described here was a matrix factorization approach trained under rank loss [23]. Early iterations of our neural network model mimicked this factorization behavior with shallow networks that only embedded the user’s previous watches. From this perspective, our approach can be viewed as a non-linear generalization of factorization techniques.

在生成后续计的过程中，庞大的YouTube语料库会筛选出可能与用户相关的数百个视频。这里介绍的推荐程序的前身是在秩损失下训练的矩阵分解方法[23]。我们的神经网络模型的早期迭代通过浅层网络模仿了这种分解行为，该浅层网络仅嵌入了用户以前的观看记录。从这个角度来看，我们的方法可以看作是分解技术的非线性泛化。

3.1 Recommendation as Classification 将推荐视为分类问题

We pose recommendation as extreme multi-class classification where the prediction problem becomes accurately classifying a specific video watch $w_t$ at time $t$ among millions of videos $i$ (classes) from a corpus $V$ based on a user $U$ and context $C$ ,

我们提出将推荐问题视为"超大规模分类问题"。我们将在时刻 $t$ ，用户 $U$ 在场景 $C$ 中观看特定视频这个事件称为 $w_t$ ，当这个事件 $w_t$ 中观看的视频是视频库 $V$ 中的视频 $i$ 时，我们称作事件 $w_t$ 属于 $i$ 类别。视频库 $V$ 中有多少个视频就有多少个类别( $i = 1, 2, 3, . . ., l e n (V)$ ) 。
【也就是说我们需要预测在时刻 $t$ ，用户 $U$ 在场景 $C$ 要看哪个视频】
数学表达如下
$P(w_t=i|U,C) = \frac{e^{{v_i}^T \cdot u}}{\sum_j e^{{v_j}^T \cdot u}} （1）$
where $u∈R^N$ represents a high-dimensional “embedding” of the user, context pair and the $v_j ∈R^N$ represent embeddings of each candidate video.

In this setting, an embedding is simply a mapping of sparse entities ( individual videos, users etc) into a dense vector $R^N$ . The task of the deep neural network is to learn user embeddings $u$ as a function of the user’s history and context that are useful for discriminating among videos with a soft-max classifier.

一个embedding嵌入代表一个从稀疏高维空间到稠密低维空间的映射，该低维空间维数为 $N$ 【一般是从一个one-hot的高维向量(一个video, 一个<user, context>对等)转变为代表相似度的低维向量】。在公式(1)中 $v$ 和 $u$ 均为低维向量， $v$ 表示每个候选video， $u$ 表示每个<user,context>对。
深度神经网络的任务是学习用户嵌入 $u$ 作为用户历史和上下文的函数，这对于使用soft-max分类器区分视频非常有用。

Although explicit feedback mechanisms exist on YouTube (thumbs up/down, in-product surveys, etc.) we use the implicit feedback [16] of watches to train the model, where a user completing a video is a positive example. This choice is based on the orders of magnitude more implicit user history available, allowing us to produce recommendations deep in the tail where explicit feedback is extremely sparse.

尽管YouTube上存在明确的反馈机制（点赞/踩，产品内调查等），我们仍然使用观看的隐式反馈[16]来训练模型，其中用户看完视频就是一个正样本。 该选择基于隐式用户历史记录的数量级比较大，而显示反馈相对稀疏，我们利用隐式反馈可以在显示反馈长尾处做深度推荐。

Efficient Extreme Multi-class 有效的超大多分类

To efficiently train such a model with millions of classes, we rely on a technique to sample negative classes from the background distribution (“candidate sampling”) and then correct for this sampling via importance weighting [10]. For each example the cross-entropy loss is minimized for the true label and the sampled negative classes. In practice several thousand negatives are sampled, corresponding to more than 100 times speedup over traditional softmax. A popular alternative approach is hierarchical softmax [15], but we weren’t able to achieve comparable accuracy. In hierarchical soft-max, traversing each node in the tree involves discriminating between sets of classes that are often unrelated, making the classification problem much more difficult and degrading performance.

我们需要一项能有效进行负样本采样的技术, 然后通过重要性加权对这种样本进行校正[10]。对于每个样本，对于真实标签和采样的负类，交叉熵损失最小。在实践中，对数千个负片进行了采样，这相当于传统softmax的加速超过100倍。一种流行的替代方法是分层softmax [15]，但我们无法达到可比的准确性。在分层soft-max中，遍历树中的每个节点都涉及到区分通常不相关的类集，从而使分类问题更加困难并降低性能。？

At serving time we need to compute the most likely N classes (videos) in order to choose the top N to present to the user. Scoring millions of items under a strict serving latency of tens of milliseconds requires an approximate scoring scheme sublinear in the number of classes. Previous systems at YouTube relied on hashing [24] and the classifier described here uses a similar approach.
Since calibrated likelihoods from the softmax output layer are not needed at serving time, the scoring problem reduces to a nearest neighbor search in the dot product space for which general purpose libraries can be used [12]. We found that A/B results were not particularly sensitive to the choice of nearest neighbor search algorithm.

在投放时，我们需要计算最有可能的N个类别（视频），以便选择要呈现给用户的前N个类别。在数十毫秒的严格服务等待时间下对数百万个项目进行评分，需要一种近似的评分方案，该评分方案在类数上为亚线性。 YouTube上的先前系统依赖于哈希[24]，此处描述的分类器使用类似的方法。

由于在服务时不需要来自softmax输出层的经过校准的似然性，因此计分问题减少到可以使用通用库的点积空间中的最近邻居搜索[12]。我们发现A / B结果对最近邻居搜索算法的选择不是特别敏感。

3.2 Model Architecture 模型架构

Inspired by continuous bag of words language models [14], we learn high dimensional embeddings for each video in a fixed vocabulary and feed these embeddings into a feedfor-ward neural network. A user’s watch history is represented by a variable-length sequence of sparse video IDs which is mapped to a dense vector representation via the embeddings. The network requires fixed-sized dense inputs and simply averaging the embeddings performed best among several strategies (sum, component-wise max, etc.). Importantly, the embeddings are learned jointly with all other model parameters through normal gradient descent back- propagation updates. Features are concatenated into a wide first layer, followed by several layers of fully connected Rectified Linear Units (ReLU) [6]. Figure 3 shows the general network architecture with additional non-video watch features described below.
受连续词袋模型CBOW[14]的启发，我们将video构成video词典，每个video是一个one-hot高维嵌入，并将这些嵌入作为前馈神经网络的输入。 用户的观看历史记录是可变长的序列，序列的每一个元素是稀疏视频的ID序列，该序列通过嵌入可以映射到密集的矢量表示形式。
该网络需要固定大小的密集输入，并简单地从几种策略（求和，逐分量最大值等）中表现最佳的嵌入进行平均。 重要的是，嵌入是通过正常梯度下降反向传播更新与所有其他模型参数一起学习的。特征被连接到一个宽的第一层，然后是几层完全连接的整流线性单元（ReLU）[6]。图3显示了该体系结构，包含其他非观看历史嵌入的特征。

加粗样式

3.3 Heterogeneous Signals 异构信号

A key advantage of using deep neural networks as a generalization of matrix factorization is that arbitrary continuous and categorical features can be easily added to the model. Search history is treated similarly to watch history - each query is tokenized into unigrams and bigrams and each token is embedded. Once averaged, the user’s tokenized, embedded queries represent a summarized dense search history. Demographic features are important for providing priors so that the recommendations behave reasonably for new users. The user’s geographic region and device are embedded and concatenated. Simple binary and continuous features such as the user’s gender, logged-in state and age are input directly into the network as real values normalized to [0, 1].

使用深度神经网络作为矩阵分解的泛化的主要优势在于，可以轻松地将任意连续和分类特征添加到模型中。 搜索历史记录与观看历史记录的处理方式相似-每个查询都被标记为unigram和bigrams，并且每个标记都被嵌入。取平均后，用户的标记化嵌入查询代表了汇总的密集搜索历史记录。

人口统计特征对于提供先验信息很重要，因此推荐对于新用户（冷启动）而言行为合理。用户的地理区域和设备已嵌入并连接在一起。简单的二进制和连续特征（例如用户的性别，登录状态和年龄）将以归一化为[0，1]的实际值直接输入到网络中。

“Example Age” Feature "视频的年龄"特征

Many hours worth of videos are uploaded each second to YouTube. Recommending this recently uploaded (“fresh”) content is extremely important for YouTube as a product. We consistently observe that users prefer fresh content, though not at the expense of relevance. In addition to the first-order effect of simply recommending new videos that users want to watch, there is a critical secondary phenomenon of boot- strapping and propagating viral content [11].

每秒都有很多视频上传到YouTube。对于YouTube产品而言，推荐最近上传（“新鲜”）的内容非常重要。我们始终观察到，尽管不以牺牲相关性为代价，用户还是喜欢新鲜的内容。除了简单地推荐用户想要观看的新视频的一阶效果外，还存在引导和传播病毒内容的严重的次要现象[11]。

Machine learning systems often exhibit an implicit bias towards the past because they are trained to predict future behavior from historical examples. The distribution of video popularity is highly non-stationary but the multinomial distribution over the corpus produced by our recommender will reflect the average watch likelihood in the training window of several weeks. To correct for this, we feed the age of the training example as a feature during training. At serving time, this feature is set to zero (or slightly negative) to reflect that the model is making predictions at the very end of the training window.

机器学习系统通常表现出对过去的隐性偏见，因为它们经过训练可以根据历史示例预测未来行为。视频受欢迎程度的分布是高度不稳定的，但是由我们的推荐模型在语料库上生成的多项分布会反映出几周的训练窗口中的平均观看可能性。为了解决这个问题，我们在训练过程中以训练示例的年龄为特征。
在投放时，此特征设置为零（或略微为负），以反映该模型在训练窗口的最后进行预测。

3.4 Label and Context Selection 标签和上下文选择

It is important to emphasize that recommendation often involves solving a surrogate problem and transferring the result to a particular context. A classic example is the assumption that accurately predicting ratings leads to effective movie recommendations [2]. We have found that the choice of this surrogate learning problem has an outsized importance on performance in A/B testing but is very difficult to measure with offline experiments.

需要强调的是，推荐通常涉及解决代理问题并将结果迁移到特定的环境。 一个典型的例子是，准确预测收视率会产生有效的电影推荐[2]。我们发现，这种替代学习问题的选择对A / B测试的性能具有极其重要的意义，但是很难通过离线实验进行衡量。

Training examples are generated from all YouTube watches (even those embedded on other sites) rather than just watches on the recommendations we produce. Otherwise, it would be very difficult for new content to surface and the recommender would be overly biased towards exploitation. If users are discovering videos through means other than our recommendations, we want to be able to quickly propagate this discovery to others via collaborative filtering. Another key insight that improved live metrics was to generate a fixed number of training examples per user, effectively weighting our users equally in the loss function. This prevented a small cohort of highly active users from dominating the loss.

训练样本是从所有YouTube观看（甚至是嵌入在其他网站上的观看）中生成的，而不仅仅是根据我们产生的推荐进行的观看。不这样做的话，将很难使新内容浮出水面，并且推荐者将过度偏向于利用（而非探索）。如果用户通过我们的推荐之外的其他方式发现视频，我们希望能够通过协作过滤将该发现快速传播给其他人。
改进实时指标的另一个关键见解是为每位用户生成固定数量的训练样本，从而有效地在损失函数中对我们的用户进行加权。这阻止了一小群高度活跃的用户控制损失。

Somewhat counter-intuitively, great care must be taken to withhold information from the classifier in order to prevent the model from exploiting the structure of the site and overfitting the surrogate problem. Consider as an example a case in which the user has just issued a search query for “taylor swift”. Since our problem is posed as predicting the next watched video, a classifier given this information will predict that the most likely videos to be watched are those which appear on the corresponding search results page for “taylor swift”. Unsurpisingly, reproducing the user’s last search page as homepage recommendations performs very poorly. By discarding sequence information and representing search queries with an unordered bag of tokens, the classifier is no longer directly aware of the origin of the label.

有点与直觉相反，必须格外小心从分类器中保留信息，以防止模型利用site的结构并过度代入替代问题。以一个示例为例，其中用户刚刚发出了“泰勒·斯威夫特”的搜索查询。由于我们的问题在于预测下一个观看的视频，因此，根据该信息，分类器将预测最可能观看的视频是出现在相应搜索结果页面上的“泰勒·斯威夫特”。 毫不奇怪，将用户的最后搜索页面重现为首页推荐的效果非常差。通过丢弃序列信息并使用无序token袋表示搜索查询，分类器不再直接知道标签的来源。

Natural consumption patterns of videos typically lead to very asymmetric co-watch probabilities. Episodic series are usually watched sequentially and users often discover artists in a genre beginning with the most broadly popular before focusing on smaller niches. We therefore found much better performance predicting the user’s next watch, rather than predicting a randomly held-out watch (Figure 5). Many collaborative filtering systems implicitly choose the labels and context by holding out a random item and predicting it from other items in the user’s history (5a). This leaks future information and ignores any asymmetric consumption patterns. In contrast, we “rollback” a user’s history by choosing a random watch and only input actions the user took before the held-out label watch (5b).

视频的自然消费模式通常会导致非常不对称的共同观看概率。情节剧系列通常是按顺序观看的，用户通常会先发现最流行的艺术家，然后再关注较小的领域。 因此，我们发现预测用户的下一次观看要比预测 held out 的观看要好得多（图5）。 许多协作过滤系统通过提供随机项目并根据用户历史记录中的其他项目进行预测来隐式选择标签和上下文（5a）。这会泄漏未来的信息，并且会忽略任何不对称的消费模式。

相反，我们通过选择随机观看来“回滚”用户的历史记录，并且仅输入用户在伸出之前进行的操作（5b）。？

在这里插入图片描述

3.5 Experiments with Features and Depth 关于特征和深度的实验

Adding features and depth significantly improves precision on holdout data as shown in Figure 6. In these experiments, a vocabulary of 1M videos and 1M search tokens were embedded with 256 floats each in a maximum bag size of 50 recent watches and 50 recent searches. The softmax layer outputs a multinomial distribution over the same 1M video classes with a dimension of 256 (which can be thought of as a separate output video embedding). These models were trained until convergence over all YouTube users, corresponding to several epochs over the data. Network structure followed a common “tower” pattern in which the bottom of the network is widest and each successive hidden layer halves the number of units (similar to Figure 3).

如图6所示，增加特征和深度会大大提高hold-out data的精度。在这些实验中，嵌入了1M视频和1M搜索令牌的词汇，每个词汇有256个浮点数（每个词汇有256维），最大包大小为50个最近的观看和50个最近的搜索。

softmax层在相同的1M视频类别上输出多项式分布，维度为256（可以认为是单独的输出视频嵌入）。对这些模型进行了训练，直到所有YouTube用户都收敛为止，这对应于数据上的几个时期。网络结构遵循常见的“塔式”模式，其中网络的底部最宽，每个连续的隐藏层将单元数减半（类似于图3）。

The depth zero network is effectively a linear factorization scheme which performed very similarly to the predecessor system. Width and depth were added until the incremental benefit diminished and convergence became difficult:

Depth 0: A linear layer simply transforms the concate- nation layer to match the softmax dimension of 256
Depth 1: 256 ReLU
Depth 2: 512 ReLU → 256 ReLU
Depth 3: 1024 ReLU → 512 ReLU → 256 ReLU
Depth 4: 2048 ReLU → 1024 ReLU → 512 ReLU → 256 ReLU

深度零网络实际上是一种线性分解方案，其执行方式与之前的系统非常相似。增加了宽度和深度，直到增加的收益减少并且收敛变得困难为止：

4 Ranking 排序

参考

https://blog.csdn.net/a819825294/article/details/71215538
https://zhuanlan.zhihu.com/p/25343518
https://zhuanlan.zhihu.com/p/25936140
https://blog.csdn.net/yujianmin1990/article/details/80640964
https://www.jianshu.com/p/f9d2abc486c9
https://www.cnblogs.com/hellojamest/p/11739865.html
https://blog.csdn.net/xiongjiezk/article/details/73445835

召回-自己的语言

框架中 candidate generation 候选集生成的过程即为召回过程，该过程的输入为 video corpus (百外级别的视频语料库)和 user history and context (用户背景和上下文)，输出为数百个候选视频。

embedded video watches

每个用户都有一个视频观看的序列，这个序列按时间顺序由视频ID组成。我们可以使用类似Word2Vec的方法对视频ID进行向量化表示。

具体的方法：

假设视频库一共有 $M$ 个视频，每个视频记做 $i_m∈{1,2,3,....,M}$ ，最开始我们可以使用 One-Hot 方法将每个视频 $i_m$ 表示为M维的向量；
运用CBOW方法，利用相似性将高维稀疏的M维向量映射为低维稠密向量（论文设置为256维）。
对每个用户而言，将视频序列中每个视频 $i_m$ 的低维稠密向量进行加权平均，得到最终表示用户观看历史的256维特征。

note：

为什么是加权平均不是直接平均？加权平均如何做？

论文中介绍可以将序列中向量各个维度的值求和或者按维度求最大值，等等，选择效果最优的方法得到最终的平均嵌入；

加权时按照视频观看历史的远近，视频的流行程度等规则进行视频的重要性加权

为什么是CBOW不是Skip-gram？

由推荐场景决定，依据上下文推断中心词，而非依据中心词推断上下文

嵌入好的向量是可变的还是不可变的?

嵌入的训练过程是否需要加入到最终深度网络的训练过程中，这个问题是存在争议中。论文中是将嵌入的训练也加入到了整体的DNN模型中，一起梯度下降。然而文章在工程中却发现先训练好嵌入向量，将向量固定后作为原始特征加入到DNN模型中效果更好。这个问题需要实验测试。

embedded search tokens

用户会自行搜索想看的视频关键词。对于视频关键词有关键词序列，关键词的处理会用到nlp的相关内容，但是整体的嵌入过程同 embedded video watches。

最终得到256维的搜索词向量，表示用户搜索历史的256维特征。

example age

"example age"表示视频上传时间特征。

论文中介绍这个特征有用的原因在于"We consistently observe that users prefer fresh content, though not at the expense of relevance." 即用户更倾向于观看最新的视频，哪怕以牺牲相关性为代价。计算方法为tmax-tN ，训练数据中 tmax 为视频最大的发布时间戳，tN 为该样本发布时间戳。

其它分类特征

对于分类取值较多的特征，比如地理位置(geographic)，可以进行嵌入，映射到低维向量；
对于分类取值较少的特征，比如性别等人口属性特征，可以做常规one-hot。

其它连续特征
对于连续特征，比如年龄等，进行归一化，归一化到[0,1]之间。

建模
将所有特征拼接到一起作为输入。
深度模型有3个隐藏层，每个隐藏层激活函数为ReLU，单元数分别为1024, 512和256
离线训练
最后一层输出256维的向量作为用户的向量u，在训练阶段基于向量内积结果 (softmax) 做分类

$P(w_t=i∣U,C)=\frac{e^{{v_i}⋅u}}{∑je^{{v_j}⋅u}}$