交替最小二乘矩阵分解
pyspark上的动手推荐系统 (Hands-on recommender system on pyspark)
Recommender System is an information filtering tool that seeks to predict which product a user will like, and based on that, recommends a few products to the users. For example, Amazon can recommend new shopping items to buy, Netflix can recommend new movies to watch, and Google can recommend news that a user might be interested in. The two widely used approaches for building a recommender system are the content-based filtering (CBF) and collaborative filtering (CF).
推荐系统是一种信息过滤工具,旨在预测用户喜欢的产品,并在此基础上向用户推荐一些产品。 例如,Amazon可以推荐要购买的新购物商品,Netflix可以推荐要观看的新电影,而Google可以推荐用户可能感兴趣的新闻。构建推荐系统的两种广泛使用的方法是基于内容的过滤( CBF)和协作过滤(CF)。
To understand the concept of recommender systems, let us look at an example. The below table shows the user-item utility matrix Y where the value Rui denotes how item i has been rated by user u on a scale of 1–5. The missing entries (shown by ? in Table) are the items that have not been rated by the respective user.
为了理解推荐系统的概念,让我们看一个例子。 下表显示了用户项效用矩阵 Y,其中Rui值表示用户i如何以1-5的等级对项i进行评分。 缺少的条目(在表中用?显示)是尚未由相应用户评分的项目。
![Image for post](https://i-blog.csdnimg.cn/blog_migrate/221ba941a4064ec3f011c849812f1d3f.png)
The objective of the recommender system is to predict the ratings for these items. Then the highest rated items can be recommended to the respective users. In real world problems, the utility matrix is expected to be very sparse, as each user only encounters a small fraction of items among the vast pool of options available. The code for this project can be found here.
推荐系统的目的是预测这些项目的评级。 然后可以向各个用户推荐评分最高的项目。 在现实世界中,效用矩阵非常稀疏,因为每个用户在大量可用选项中仅会遇到一小部分项目。 该项目的代码可以在这里找到。
显式与隐式评级 (Explicit v.s. Implicit ratings)
There are two ways to gather user preference data to recommend items, the first method is to ask for explicit ratings from a user, typically on a concrete rating scale (such as rating a movie from one to five stars) making it easier to make extrapolations from data to predict future ratings. However, the drawback with explicit data is that it puts the responsibility of data collection on the user, who may not want to take time to enter ratings. On the other hand, implicit data is easy to collect in large quantities without any extra effort on the part of the user. Unfortunately, it is much more difficult to work with.
有两种收集用户偏好数据以推荐项目的方法,第一种方法是要求用户提供明确的评分 ,通常以具体的评分标准(例如,将电影从一星评为五星),使推断更容易从数据中预测未来的收视率。 但是,显式数据的缺点是将数据收集的责任交给了用户,而用户可能不想花时间输入评分。 另一方面, 隐式数据易于大量收集,而无需用户付出任何额外的努力。 不幸的是,要处理它要困难得多。
数据稀疏和冷启动 (Data Sparsity and Cold Start)
In real world problems, the utility matrix is expected to be very sparse, as each user only encounters a small fraction of items among the vast pool of options available. Cold-Start problem can arise during addition of a new user or a new item where both do not have history in terms of ratings. Sparsity can be calculated using the below function.
在现实世界中,效用矩阵非常稀疏,因为每个用户在大量可用选项中仅会遇到一小部分项目。 在添加新用户或新项目时,如果两者都没有评级历史记录,则会出现冷启动问题。 稀疏度可以使用以下函数计算。
def get_mat_sparsity(ratings):
# Count the total number of ratings in the dataset
count_nonzero = ratings.select("rating").count()
# Count the number of distinct userIds and distinct movieIds
total_elements = ratings.select("userId").distinct().count() * ratings.select("movieId").distinct().count()
# Divide the numerator by the denominator
sparsity = (1.0 - (count_nonzero *1.0)/total_elements)*100
print("The ratings dataframe is ", "%.2f" % sparsity + "% sparse.")
get_mat_sparsity(ratings)
1.具有显式评分的数据集(MovieLens) (1. Dataset with Explicit Ratings (MovieLens))
MovieLens is a recommender system and virtual community website that recommends movies for its users to watch, based on their film preferences using collaborative filtering. MovieLens 100M datatset is taken from the MovieLens website, which customizes user recommendation based on the ratings given by the user. To understand the concept of recommendation system better, we will work with this dataset. This dataset can be downloaded from here.
MovieLens是一个推荐器系统和虚拟社区网站,它基于用户使用协作筛选的电影偏好来推荐电影供用户观看。 MovieLens 100M数据集来自MovieLens网站,该网站根据用户给出的等级来自定义用户推荐。 为了更好地理解推荐系统的概念,我们将使用此数据集。 可以从此处下载该数据集。
There are 2 tuples, movies and ratings which contains variables such as MovieID::Genre::Title and UserID::MovieID::Rating::Timestamp respectively.
有2个元组,电影和等级,其中分别包含MovieID :: Genre :: Title和UserID :: MovieID :: Rating :: Timestamp等变量。
Let’s