SparkMLlib之三：协同过滤

最新推荐文章于 2021-04-26 12:03:24 发布

岸芷汀兰whu

最新推荐文章于 2021-04-26 12:03:24 发布

阅读量864

点赞数

分类专栏：大数据 spark 文章标签： spark MLlib

本文链接：https://blog.csdn.net/u012432611/article/details/50506380

版权

spark 同时被 2 个专栏收录

66 篇文章 0 订阅

订阅专栏

大数据

59 篇文章 0 订阅

订阅专栏

协同过滤用于推荐系统，目的在于填补用户-物品同现矩阵,spark目前支持基于模型的协同过滤，用户和产品由一个小的可以用于预测缺失值的潜在因子集描述，spark.mllib采用交叉最小二乘算法学习潜在因子，它有以下参数：

numBlocks 用于并行计算的block数
rank模型潜在因子数
iterations迭代次数
lambda在ALS中的正则参数
implicitPrefs specifies whether to use the explicit feedback ALS variant or one adapted for implicit feedback data.
alpha is a parameter applicable to the implicit feedback variant of ALS that governs the baseline confidence in preference observations.

Explicit vs. implicit feedback

The standard approach to matrix factorization based collaborative filtering treats the entries in the user-item matrix as explicit preferences given by the user to the item.

It is common in many real-world use cases to only have access to implicit feedback (e.g. views, clicks, purchases, likes, shares etc.). The approach used in spark.mllib to deal with such data is taken from Collaborative Filtering for Implicit Feedback Datasets. Essentially instead of trying to model the matrix of ratings directly, this approach treats the data as a combination of binary preferences and confidence values. The ratings are then related to the level of confidence in observed user preferences, rather than explicit ratings given to items. The model then tries to find latent factors that can be used to predict the expected preference of a user for an item.

Scaling of the regularization parameter

Since v1.1, we scale the regularization parameter lambda in solving each least squares problem by the number of ratings the user generated in updating user factors, or the number of ratings the product received in updating product factors. This approach is named “ALS-WR” and discussed in the paper “Large-Scale Parallel Collaborative Filtering for the Netflix Prize”. It makes lambda less dependent on the scale of the dataset. So we can apply the best parameter learned from a sampled subset to the full dataset and expect similar performance.

例子

import org.apache.spark.mllib.recommendation.ALS
import org.apache.spark.mllib.recommendation.MatrixFactorizationModel
import org.apache.spark.mllib.recommendation.Rating

// Load and parse the data
val data = sc.textFile("data/mllib/als/test.data")
val ratings = data.map(_.split(',') match { case Array(user, item, rate) =>
  Rating(user.toInt, item.toInt, rate.toDouble)
})

// Build the recommendation model using ALS
val rank = 10
val numIterations = 10
val model = ALS.train(ratings, rank, numIterations, 0.01)

// Evaluate the model on rating data
val usersProducts = ratings.map { case Rating(user, product, rate) =>
  (user, product)
}
val predictions =
  model.predict(usersProducts).map { case Rating(user, product, rate) =>
    ((user, product), rate)
  }
val ratesAndPreds = ratings.map { case Rating(user, product, rate) =>
  ((user, product), rate)
}.join(predictions)
val MSE = ratesAndPreds.map { case ((user, product), (r1, r2)) =>
  val err = (r1 - r2)
  err * err
}.mean()
println("Mean Squared Error = " + MSE)

// Save and load model
model.save(sc, "target/tmp/myCollaborativeFilter")
val sameModel = MatrixFactorizationModel.load(sc, "target/tmp/myCollaborativeFilter")

Find full example code at “examples/src/main/scala/org/apache/spark/examples/mllib/RecommendationExample.scala” in the Spark repo.

If the rating matrix is derived from another source of information (e.g., it is inferred from other signals), you can use the trainImplicit method to get better results.

val alpha = 0.01
val lambda = 0.01
val model = ALS.trainImplicit(ratings, rank, numIterations, lambda, alpha)

岸芷汀兰whu

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录