SVD++

最新推荐文章于 2021-11-24 18:30:34 发布

匆匆整棹还

最新推荐文章于 2021-11-24 18:30:34 发布

阅读量1k

点赞数

文章标签：机器翻译

本文链接：https://blog.csdn.net/weixin_44628096/article/details/120929245

版权

SVD++

MAE:平均的预测误差

RMSE:预测误差的标准差

Neighborhood methods are centered on computing the relationships between items or, alternatively, between users. An item-oriented approach evaluates the preference of a user to an item based on ratings of similar items by the same user.

邻域方法的核心是计算项目之间或用户之间的关系。面向项目的方法基于同一用户对类似项目的评分来评估用户对项目的偏好。

Neighborhood methods are centered on computing the relationships between items or, alternatively, between users. An item-oriented approach evaluates the preference of a user to an item based on ratings of similar items by the same user.

邻域方法的核心是计算项目之间或用户之间的关系。面向项目的方法基于同一用户对类似项目的评分来评估用户对项目的偏好

Collaborating Filtering (CF), where past transactions are analyzed in order to
establish connections between users and products. The two more
successful approaches to CF are latent factor models, which directly profile both users and products, and neighborhood models,which analyze similarities between products or users.

这些系统通常依赖于协作过滤（CF），即分析过去的事务以在用户和产品之间建立联系。另外两个成功的CF方法是潜在因素模型，它直接描述用户和产品，以及邻域模型，分析产品或用户之间的相似性

Collaborating Filtering (CF), where past transactions are analyzed in order to establish connections between users and products. The two more successful approaches to CF are latent factor models, which directly profile both users and products, and neighborhood models,which analyze similarities between products or users.
这些系统通常依赖于协作过滤（CF），即分析过去的事务以在用户和产品之间建立联系。另外两个成功的CF方法是潜在因素模型，它直接描述用户和产品，以及邻域模型，分析产品或用户之间的相似性

Latent factor models, such as Singular Value Decomposition (SVD),comprise an alternative approach by transforming both items and users to the same latent factor space, thus making them directly comparable. The latent space tries to explain ratings by characterizing both products and users on factors automatically inferred from user feedback. For example, when the products are movies, factors might measure obvious dimensions such as comedy vs. drama,

潜在因素模型，如奇异值分解（SVD），包括一种替代方法，通过将项目和用户转换到相同的潜在因素空间，从而使它们直接可比。潜在空间试图通过根据用户反馈自动推断的因素描述产品和用户来解释评级。例如，当产品是电影时，因素可能衡量明显的维度，如喜剧与戏剧，

In this work we suggest a combined model that improves prediction accuracy by capitalizing on the advantages of both neighborhood and latent factor approaches. To our best knowledge, this is the first time that a single model has integrated the two approaches.

在这项工作中，我们提出了一种组合模型，通过利用邻域和潜在因子方法的优点来提高预测精度。据我们所知，这是

第一次，一个单一的模型集成了这两种方法。

In fact, some past works (e.g., [2, 4]) recognized the utility of combining those approaches. However, they suggested post-processing the factorization results, rather than a unified model where neighborhood and factor information are considered symmetrically.
Another lesson learnt from the Netflix Prize competition is the importance of integrating different forms of user input into the models [3]. Recommender systems rely on different types of input. Most convenient is the high quality explicit feedback, which includes explicit input by users regarding their interest in products.
For example, Netflix collects star ratings for movies and TiVo users indicate their preferences for TV shows by hitting thumbs-up/down buttons. However, explicit feedback is not always available. Thus, recommenders can infer user preferences from the more abundant implicit feedback, which indirectly reflect opinion through observing user behavior [16]. Types of implicit feedback include purchase history, browsing history, search patterns, or even mouse movements. For example, a user that purchased many books by the same author probably likes that author. Our main focus is on cases where explicit feedback is available. Nonetheless, we recognize the importance of implicit feedback, which can illuminate users that did not provide enough explicit feedback. Hence, our models integrate explicit and implicit feedback.

In fact, some past works (e.g., [2, 4]) recognized the utility of combining those approaches. However, they suggested post-processing the factorization results, rather than a unified model where neighborhood and factor information are considered symmetrically.
Another lesson learnt from the Netflix Prize competition is the importance of integrating different forms of user input into the models [3]. Recommender systems rely on different types of input. Most convenient is the high quality explicit feedback, which includes explicit input by users regarding their interest in products.
For example, Netflix collects star ratings for movies and TiVo users indicate their preferences for TV shows by hitting thumbs-up/down buttons. However, explicit feedback is not always available. Thus, recommenders can infer user preferences from the more abundant implicit feedback, which indirectly reflect opinion through observing user behavior [16]. Types of implicit feedback include purchase history, browsing history, search patterns, or even mouse movements. For example, a user that purchased many books by the same author probably likes that author. Our main focus is on cases where explicit feedback is available. Nonetheless, we recognize the importance of implicit feedback, which can illuminate users that did not provide enough explicit feedback. Hence, our models integrate explicit and implicit feedback.

事实上，过去的一些著作(例如，[2,4])承认了结合这些方法的效用。然而，他们建议后期处理因子分解的结果，而不是一个统一的模型，其中邻域和因子信息是对称的。

Netflix Prize竞赛的另一个教训是，将不同形式的用户输入整合到模型[3]。推荐系统依赖于不同类型的输入。最方便的是高质量的明确反馈包括用户关于他们对产品的兴趣的明确输入。例如，Netflix收集电影和TiVo用户的星级评分,点击拇指向上/向下的按钮来显示他们对电视节目的喜好。然而，明确的反馈并不总是可用的。因此，推荐者可以从更丰富的隐式反馈中推断出用户的偏好，隐式反馈通过观察用户行为[16]间接反映意见。隐式反馈的类型包括购买历史、浏览历史、搜索模式，甚至是鼠标移动。例如，一个用户购买了同一作者的很多书，他可能喜欢这个作者。我们主要关注的是在哪些情况下可以得到明确的反馈。尽管如此，我们还是认识到隐性反馈的重要性，它可以帮助那些没有提供足够明确反馈的用户。因此，我们的模型集成了显式反馈和隐式反馈。

The structure of the rest of the paper is as follows. We start with preliminaries and related work in Sec. 2. Then, we describe a new, more accurate neighborhood model in Sec. 3. The new model is based on an optimization framework that allows smooth
integration with latent factor models, and also inclusion of implicit user feedback. Section 4 revisits SVD-based latent factor models while introducing useful extensions. These extensions include a factor model that allows explaining the reasoning behind recommendations. Such explainability is important for practical systems[11, 23] and known to be problematic with latent factor models.
The methods introduced in Sec. 3-4 are linked together in Sec.5, through a model that integrates neighborhood and factor models within a single framework. Relevant experimental results are brought within each section. In addition, we suggest a new methodology to evaluate effectiveness of the models, as described in Sec.6, with encouraging results.

本文其余部分的结构如下。我们开始在第二章中进行了初步工作和相关工作。然后，我们在第3节描述了一个新的，更精确的邻域模型。新的模型是基于一个优化框架，允许平滑整合潜在因素模型，也包含隐性因素用户的反馈。第4节回顾了基于svd的潜在因素模型同时引入有用的扩展。这些扩展包括可以解释推荐背后原因的因素模型。这种可解释性对实际系统很重要[11,23]，而潜在因素模型也存在问题。

第3-4节介绍的方法在第5节中通过一个将邻域模型和因子模型集成在一个单一框架内的模型联系在一起。各部分给出了相关的实验结果。此外，我们建议一种新的方法来评估模型的有效性，如第6节所述，并取得了令人鼓舞的结果。

We reserve special indexing letters for distinguishing users from items:
for users u, v, and for items i, j. A rating rui indicates the preference by user u of item i, where high values mean stronger preference. For example, values can be integers ranging from 1(star) indicating no interest to 5 (stars) indicating a strong interest.
We distinguish predicted ratings from known ones, by using the notation rˆui for the predicted value of rui. The (u, i) pairs for which rui is known are stored in the set K = {(u, i) | rui is known}.
Usually the vast majority of ratings are unknown. For example, in the Netflix data 99% of the possible ratings are missing. In order to combat overfitting the sparse rating data, models are regularized so estimates are shrunk towards baseline defaults. Regularization is controlled by constants which are denoted as: λ1, λ2, . . . Exact values of these constants are determined by cross validation. As they grow, regularization becomes heavier

我们保留专门的索引字母，以区分用户和项目:

对于用户u, v，对于项目i, j, rui表示用户u对项目i的偏好，值越大表示偏好越强。例如，值可以是整数，从1(星号)表示不感兴趣到5(星号)表示感兴趣。

我们通过使用符号rˆui表示rui的预测值来区分预测评分和已知评分。rui已知的(u, i)对存储在集合K = {(u, i) |rui已知}中。

通常，绝大多数收视率都是未知的。例如，在Netflix的数据中，99%的可能收视率都被遗漏了。为了对抗稀疏评级数据的过拟合，模型被正则化，因此估计被缩小到基线默认值。正则化由λ1， λ2，…等常数控制。这些常量的精确值是通过交叉验证确定的。随着它们的增长，正规化变得更重

Baseline estimates
Typical CF data exhibit large user and item effects – i.e., systematic tendencies for some users to give higher ratings than others,
and for some items to receive higher ratings than others. It is customary to adjust the data by accounting for these effects, which we encapsulate within the baseline estimates. Denote by µ the overall average rating. A baseline estimate for an unknown rating rui is denoted by bui and accounts for the user and item effects:

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-boq8pXtl-1635015033786)(C:\Users\周雨\AppData\Roaming\Typora\typora-user-images\image-20211023224624053.png)]

The parameters bu and bi indicate the observed deviations of user u and item i, respectively, from the average. For example, suppose that we want a baseline estimate for the rating of the movie Titanic by user Joe. Now, say that the average rating over all movies, µ, is 3.7 stars. Furthermore, Titanic is better than an average movie, so it tends to be rated 0.5 stars above the average. On the other hand, Joe
is a critical user, who tends to rate 0.3 stars lower than the average. Thus, the baseline estimate for Titanic’s rating by Joe would be 3.9 stars by calculating 3.7 − 0.3 + 0.5. In order to estimate bu and bi one can solve the least squares problem:

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-hQNMXL2z-1635015033789)(C:\Users\周雨\AppData\Roaming\Typora\typora-user-images\image-20211023224531431.png)]

基线估计

典型的CF数据显示出较大的用户和项目效应，即一些用户给出的评分比其他用户高的系统趋势，

并且有些项目的评分比其他的高。习惯上是通过考虑这些影响来调整数据，我们将这些影响封装在基线估计中。以µ表示整体平均评分。一个未知等级芮的基线估计由bui表示，并考虑到用户和物品的影响:

参数bu和bi分别表示用户u和项目i与平均值的观测偏差。例如，假设我们想要用户Joe对电影《泰坦尼克号》的评价的基线估计。现在，假设所有电影的平均评分是3.7颗星。此外，《泰坦尼克号》比一般电影要好，所以它的评分往往高出0.5颗星。另一方面，乔

是一个挑剔的用户，往往比平均水平低0.3颗星。因此，Joe对《泰坦尼克号》评分的基线估计是3.9颗星(3.7 - 0.3 + 0.5)。为了估计bu和bi，一个可以解决最小二乘问题:

The most common approach to CF is based on neighborhood models. Its original form, which was shared by virtually all earlier
CF systems, is user-oriented; see [12] for a good analysis. Such
user-oriented methods estimate unknown ratings based on recorded ratings of like minded users. Later, an analogous item-oriented approach [15, 21] became popular. In those methods, a rating is estimated using known ratings made by the same user on similar items. Better scalability and improved accuracy make the item-oriented approach more favorable in many cases [2, 21, 22]. In addition, item-oriented methods are more amenable to explaining the reasoning behind predictions. This is because users are familiar with items previously preferred by them, but do not know those allegedly like minded users. Thus, our focus is on item-oriented approaches, but parallel techniques can be developed in a user-oriented fashion, by switching the roles of users and items.
Central to most item-oriented approaches is a similarity measure between items. Frequently, it is based on the Pearson correlation coefficient, ρij , which measures the tendency of users to rate items i and j similarly. Since many ratings are unknown, it is expected that some items share only a handful of common raters. Computation of the correlation coefficient is based only on the common user
support. Accordingly, similarities based on a greater user support are more reliable. An appropriate similarity measure, denoted by sij , would be a shrunk correlation coefficient:

最常用的CF方法是基于邻域模型。它最初的形式，在更早的时候几乎为所有人所共享CF系统，是面向用户的;请参阅[12]以获得良好的分析。这样的面向用户的方法根据志同道合用户的记录评分来估计未知评分。后来，一种类似的面向物品的方法[15,21]开始流行起来。在这些方法中，使用同一用户对类似物品的已知评级来估计评级。更好的可扩展性和提高的准确性使得面向项的方法在许多情况下更受欢迎[2,21,22]。此外，面向项目的方法更易于解释预测背后的原因。这是因为用户熟悉自己以前喜欢的商品，但不知道那些所谓的喜欢的用户。因此，我们的重点是面向项的方法，但是可以通过切换用户和项的角色，以面向用户的方式开发并行技术。

大多数面向物品的方法的中心是物品之间的相似性度量。通常，它是基于皮尔逊相关系数ρij，它衡量用户对项目i和j的相似评级的趋势。由于许多评级是未知的，预计有些项目只有少数共同评级人。相关系数的计算仅以普通用户为基础

支持。因此，基于更大的用户支持的相似性更可靠。一个合适的相似性度量，用sij表示，是一个缩小的相关系数:

The variable nij denotes the number of users that rated both i and j. A typical value for λ2 is 100. Notice that the literature suggests additional alternatives for a similarity measure [21, 22].Our goal is to predict rui – the unobserved rating by user u for item i. Using the similarity measure, we identify the k items rated by u, which are most similar to i. This set of k neighbors is denoted by Sk(i; u). The predicted value of rui is taken as a weighted average of the ratings of neighboring items, while adjusting for user and item effects through the baseline estimates:

变量nij表示同时为i和j评分的用户数量。λ2的典型值是100。请注意，文献为相似性度量提供了额外的替代方法[22,22]。我们的目标是预测rui -用户u对项目i未观察到的评分。使用相似度度量，我们识别出由u评分的k个项目，它们与i最相似。这k个邻居的集合用Sk(i;u). rui的预测值作为相邻物品评分的加权平均值，同时通过基线估计对用户和物品效果进行调整:

Neighborhood-based methods of this form became very popular because they are intuitive and relatively simple to implement.
However, in a recent work [2], we raised a few concerns about such neighborhood schemes. Most notably, these methods are not justified by a formal model. We also questioned the suitability of a similarity measure that isolates the relations between two items, without analyzing the interactions within the full set of neighbors. In addition, the fact that interpolation weights in (3) sum to one forces the
method to fully rely on the neighbors even in cases where neighborhood information is absent (i.e., user u did not rate items similar to i), and it would be preferable to rely on baseline estimates.
This led us to propose a more accurate neighborhood model,which overcomes these difficulties. Given a set of neighbors Sk(i; u)we need to compute interpolation weights {θuij |j ∈ Sk(i; u)} that enable the best prediction rule of the form:

这种形式的基于邻里关系的方法变得非常流行，因为它们直观且相对容易实现。

然而，在最近的[2]工作中，我们提出了一些关于这种社区方案的担忧。最值得注意的是，这些方法没有一个正式的模型来证明。我们还质疑了一种相似性度量的适用性，这种度量分离了两个项目之间的关系，而没有分析整个邻居之间的相互作用。此外，(3)和中的插值权值为1的事实迫使

方法完全依赖邻居，即使在邻居信息不存在的情况下(即，用户u没有对类似于i的项目进行评级)，它将是更好的依赖基线估计。

这使我们提出了一个更精确的邻域模型，它克服了这些困难。给定一组邻居Sk(i;u)需要计算插值权值{θuij |j∈Sk(i;U)}使表单的最佳预测规则成为可能:
和中的插值权值为1的事实迫使

方法完全依赖邻居，即使在邻居信息不存在的情况下(即，用户u没有对类似于i的项目进行评级)，它将是更好的依赖基线估计。

这使我们提出了一个更精确的邻域模型，它克服了这些困难。给定一组邻居Sk(i;u)需要计算插值权值{θuij |j∈Sk(i;U)}使表单的最佳预测规则成为可能: