交替最小二乘矩阵分解_使用交替最小二乘矩阵分解与pyspark建立推荐系统

交替最小二乘矩阵分解

pyspark上的动手推荐系统 (Hands-on recommender system on pyspark)

Recommender System is an information filtering tool that seeks to predict which product a user will like, and based on that, recommends a few products to the users. For example, Amazon can recommend new shopping items to buy, Netflix can recommend new movies to watch, and Google can recommend news that a user might be interested in. The two widely used approaches for building a recommender system are the content-based filtering (CBF) and collaborative filtering (CF).

推荐系统是一种信息过滤工具,旨在预测用户喜欢的产品,并在此基础上向用户推荐一些产品。 例如,Amazon可以推荐要购买的新购物商品,Netflix可以推荐要观看的新电影,而Google可以推荐用户可能感兴趣的新闻。构建推荐系统的两种广泛使用的方法是基于内容的过滤( CBF)和协作过滤(CF)。

To understand the concept of recommender systems, let us look at an example. The below table shows the user-item utility matrix Y where the value Rui denotes how item i has been rated by user u on a scale of 1–5. The missing entries (shown by ? in Table) are the items that have not been rated by the respective user.

为了理解推荐系统的概念,让我们看一个例子。 下表显示了用户项效用矩阵 Y,其中Rui值表示用户i如何以1-5的等级对项i进行评分。 缺少的条目(在表中用显示)是尚未由相应用户评分的项目。

Image for post
utility matrix 效用矩阵

The objective of the recommender system is to predict the ratings for these items. Then the highest rated items can be recommended to the respective users. In real world problems, the utility matrix is expected to be very sparse, as each user only encounters a small fraction of items among the vast pool of options available. The code for this project can be found here.

推荐系统的目的是预测这些项目的评级。 然后可以向各个用户推荐评分最高的项目。 在现实世界中,效用矩阵非常稀疏,因为每个用户在大量可用选项中仅​​会遇到一小部分项目。 该项目的代码可以在这里找到。

显式与隐式评级 (Explicit v.s. Implicit ratings)

There are two ways to gather user preference data to recommend items, the first method is to ask for explicit ratings from a user, typically on a concrete rating scale (such as rating a movie from one to five stars) making it easier to make extrapolations from data to predict future ratings. However, the drawback with explicit data is that it puts the responsibility of data collection on the user, who may not want to take time to enter ratings. On the other hand, implicit data is easy to collect in large quantities without any extra effort on the part of the user. Unfortunately, it is much more difficult to work with.

有两种收集用户偏好数据以推荐项目的方法,第一种方法是要求用户提供明确的评分 ,通常以具体的评分标准(例如,将电影从一星评为五星),使推断更容易从数据中预测未来的收视率。 但是,显式数据的缺点是将数据收集的责任交给了用户,而用户可能不想花时间输入评分。 另一方面, 隐式数据易于大量收集,而无需用户付出任何额外的努力。 不幸的是,要处理它要困难得多。

数据稀疏和冷启动 (Data Sparsity and Cold Start)

In real world problems, the utility matrix is expected to be very sparse, as each user only encounters a small fraction of items among the vast pool of options available. Cold-Start problem can arise during addition of a new user or a new item where both do not have history in terms of ratings. Sparsity can be calculated using the below function.

在现实世界中,效用矩阵非常稀疏,因为每个用户在大量可用选项中仅​​会遇到一小部分项目。 在添加新用户或新项目时,如果两者都没有评级历史记录,则会出现冷启动问题。 稀疏度可以使用以下函数计算。

def get_mat_sparsity(ratings):
    # Count the total number of ratings in the dataset
    count_nonzero = ratings.select("rating").count()


    # Count the number of distinct userIds and distinct movieIds
    total_elements = ratings.select("userId").distinct().count() * ratings.select("movieId").distinct().count()


    # Divide the numerator by the denominator
    sparsity = (1.0 - (count_nonzero *1.0)/total_elements)*100
    print("The ratings dataframe is ", "%.2f" % sparsity + "% sparse.")
    
get_mat_sparsity(ratings)

1.具有显式评分的数据集(MovieLens) (1. Dataset with Explicit Ratings (MovieLens))

MovieLens is a recommender system and virtual community website that recommends movies for its users to watch, based on their film preferences using collaborative filtering. MovieLens 100M datatset is taken from the MovieLens website, which customizes user recommendation based on the ratings given by the user. To understand the concept of recommendation system better, we will work with this dataset. This dataset can be downloaded from here.

MovieLens是一个推荐器系统和虚拟社区网站,它基于用户使用协作筛选的电影偏好来推荐电影供用户观看。 MovieLens 100M数据集来自MovieLens网站,该网站根据用户给出的等级来自定义用户推荐。 为了更好地理解推荐系统的概念,我们将使用此数据集。 可以从此处下载该数据集。

There are 2 tuples, movies and ratings which contains variables such as MovieID::Genre::Title and UserID::MovieID::Rating::Timestamp respectively.

有2个元组,电影和等级,其中分别包含MovieID :: Genre :: Title和UserID :: MovieID :: Rating :: Timestamp等变量。

Let’s

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值