使用Python构建推荐系统的机器学习

本文介绍了如何使用Python的scikit Surprise库和Kaggle Netflix数据集构建基于模型的协同过滤推荐系统。从协同过滤概述到在Python中实现推荐系统,包括数据加载、模型训练和评估,以及产品推荐。文章展示了使用SVD模型进行预测并为用户推荐电影。
摘要由CSDN通过智能技术生成

Recommender systems are widely used in product recommendations such as recommendations of music, movies, books, news, research articles, restaurants, etc. [1][5][9][10].

推荐系统广泛用于产品推荐,例如音乐,电影,书籍,新闻,研究文章,餐厅等的推荐。[1] [5] [9] [10]。

There are two popular methods for building recommender systems:

有两种建立推荐系统的流行方法:

The collaborative filtering method [5][10] predicts (filters) the interests of a user on a product by collecting preferences information from many other users (collaborating). The assumption behind the collaborative filtering method is that if a person P1 has the same opinion as another person P2 on an issue, P1 is more likely to share P2’s opinion on a different issue than that of a randomly chosen person [5].

协作过滤方法[5] [10]通过从许多其他用户收集(协作)偏好信息来预测(过滤)用户对产品的兴趣。 协作过滤方法背后的假设是,如果一个人P1在某个问题上与另一个人P2具有相同的观点,则P1与随机选择的人相比,更有可能分享P2在不同问题上的观点。

Content-based filtering method [6][9] utilizes product features/attributes to recommend other products similar to what the user likes, based on other users’ previous actions or explicit feedback such as rating on products.

基于内容的过滤方法[6] [9]根据其他用户的先前行为或明确的反馈(例如,对产品的评分),利用产品功能/属性来推荐与用户喜欢的产品类似的其他产品。

A recommender system may use either or both of these two methods.

推荐系统可以使用这两种方法中的一种或两种。

In this article, I use the Kaggle Netflix prize data [2] to demonstrate how to use model-based collaborative filtering method to build a recommender system in Python.

在本文中,我将使用Kaggle Netflix奖品数据[2]演示如何使用基于模型的协作过滤方法在Python中构建推荐系统。

The rest of the article is arranged as follows:

本文的其余部分安排如下:

  • Overview of collaborative filtering

    协作过滤概述
  • Build recommender system in Python

    用Python构建推荐系统
  • Summary

    摘要

1.协同过滤概述 (1. Overview of Collaborative Filtering)

As described in [5], the main idea behind collaborative filtering is that one person often gets the best recommendations from another with similar interests. Collaborative filtering uses various techniques to match people with similar interests and make recommendations based on shared interests.

如[5]中所述,协作过滤的主要思想是一个人经常从兴趣相似的另一个人那里获得最佳建议。 协作过滤使用各种技术来匹配具有相似兴趣的人,并根据共同的兴趣提出建议。

The high-level workflow of a collaborative filtering system can be described as follows:

协作过滤系统的高级工作流程可以描述如下:

  • A user rates items (e.g., movies, books) to express his or her preferences on the items

    用户对项目(例如电影,书籍)进行评分,以表达他或她对项目的偏好
  • The system treats the ratings as an approximate representation of the user’s interest in items

    系统将等级视为用户对商品兴趣的近似表示
  • The system matches this user’s ratings with other users’ ratings and finds the people with the most similar ratings

    系统将该用户的评分与其他用户的评分相匹配,并找到评分最相似的人
  • The system recommends items that the similar users have rated highly but not yet being rated by this user

    系统推荐相似用户评价较高但尚未被该用户评价的项目

Typically a collaborative filtering system recommends products to a given user in two steps [5]:

通常,协作式筛选系统通过两个步骤[5]向给定的用户推荐产品:

  • Step 1: Look for people who share the same rating patterns with the given user

    步骤1:寻找与指定使用者分享相同评分模式的使用者
  • Step 2: Use the ratings from the people found in step 1 to calculate a prediction of a rating by the given user on a product

    步骤2:使用步骤1中找到的人员的评分来计算给定用户对产品的评分预测

This is called user-based collaborative filtering. One specific implementation of this method is the user-based Nearest Neighbor algorithm.

这称为基于用户的协作过滤。 该方法的一种特定实现是基于用户的最近邻算法

As an alternative, item-based collaborative filtering (e.g., users who are interested in x also interested in y) works in an item-centric manner:

或者,基于项目的协作过滤(例如,对x感兴趣的用户也对y感兴趣)以项目为中心的方式工作:

  • Step 1: Build an item-item matrix of the rating relationships between pairs of items

    步骤1:建立项目对之间的评级关系的项目-项目矩阵
  • Step 2: Predict the rating of the current user on a product by examining the matrix and matching that user’s rating data

    步骤2:通过检查矩阵并匹配该用户的评分数据,预测当前用户对产品的评分

There are two types of collaborative filtering system:

协作过滤系统有两种类型:

  • Model-based

    基于模型
  • Memory-based

    基于内存

In a model-based system, we develop models using different machine learning algorithms to predict users’ rating of unrated items [5]. There are many model-based collaborative filtering algorithms such as Matrix factorization algorithms (e.g., singular value decomposition (SVD), Alternating Least Squares (ALS) algorithm [8]), Bayesian networks, clustering models, etc.[5].

在基于模型的系统中,我们使用不同的机器学习算法开发模型,以预测用户对未分级项目的评分[5]。 有许多基于模型的协作过滤算法,例如矩阵分解算法(例如, 奇异值分解 (SVD),交替最小二乘(ALS)算法[8]), 贝叶斯网络聚类模型[5]

A memory-based system uses users’ rating data to compute the similarity between users or items. Typical examples of this type of systems are neighbourhood-based method and item-based/user-based top-N recommendations [5].

基于内存的系统使用用户的评分数据来计算用户或项目之间的相似度。 这种类型系统的典型示例是基于邻域的方法和基于项/基于用户的前N个建议[5]。

This article describes how to build a model-based collaborative filtering system using the SVD model.

本文介绍如何使用SVD模型构建基于模型的协作筛选系统。

2.用Python构建推荐系统 (2. Build Recommender System in Python)

This section describes how to build a recommender system in Python.

本节介绍如何在Python中构建推荐系统。

2.1安装库 (2.1 Installing Library)

There are multiple Python libraries available (e.g., Python scikit Surprise [7], Spark RDD-based API for collaborative filtering [8]) for building recommender systems. I use the Python scikit Surprise library in this article for demonstration purpose.

有许多可用的Python库(例如,Python scikit Surprise [7], 基于Spark RDD的用于协作过滤的API [8])用于构建推荐系统。 我将本文中的Python scikit Surprise库用于演示目的。

The Surprise library can be installed as follows:

Surprise库可以按以下方式安装:

pip install scikit-surprise

2.2加载数据 (2.2 Loading Data)

As described before, I use the Kaggle Netflix prize data [2] in this article. There are multiple data files for different purposes. The following data files are used in this article:

如前所述,我在本文中使用Kaggle Netflix奖励数据[2]。 有多个数据文件可用于不同目的。 本文中使用以下数据文件:

training data:

训练数据:

  • combined_data_1.txt

    Combined_data_1.txt
  • combined_data_2.txt

    Combined_data_2.txt
  • combined_data_3.txt

    Combined_data_3.txt
  • combined_data_4.txt

    Combined_data_4.txt

Movie titles data file:

电影标题数据文件:

  • movie_titles.csv

    movie_titles.csv

The training dataset is too big to be handled on a Laptop. Thus I only load the first 100,000 records from each of the training data files for demonstration purpose.

训练数据集太大,无法在笔记本电脑上处理。 因此,出于演示目的,我仅从每个训练数据文件中加载前100,000条记录。

Once training data files have been downloaded onto a local machine, the first 100,000 records from each of the training data files can be loaded into memory as Pandas DataFrames as follows:

将训练数据文件下载到本地计算机上之后,可以将每个训练数据文件中的前100,000条记录作为Pandas DataFrames加载到内存中,如下所示:

def readFile(file_path, rows=100000):
data_dict = {'Cust_Id' : [], 'Movie_Id' : [], 'Rating' : [], 'Date' : []}
f = open(file_path, "r")
count = 0
for line in f:
count += 1
if count > rows:
break

if ':' in line:
movidId = line[:-2] # remove the last character ':'
movieId = int(movidId)
else:
customerID, rating, date = line.split(',')
data_dict['Cust_Id'].append(customerID)
data_dict['Movie_Id'].append(movieId)
data_dict['Rating'].append(rating)
data_dict['Date'].append(date.rstrip("\n"))
f.close()

return pd.DataFrame(data_dict)df1 = readFile('./data/netflix/combined_data_1.txt', rows=100000)
df2 = readFile('./data/netflix/combined_data_2.txt', rows=100000)
df3 = readFile('./data/netflix/combined_data_3.txt', rows=100000)
df4 = readFile('./data/netflix/combined_data_4.txt', rows=100000)df1['Rating'] = df1['Rating'].astype(float)
df2['Rating'] = df2['Rating'].astype(float)
df3['Rating'] = df3['Rating'].astype(float)
df4['Rating'] = df4['Rating'].astype(float)

The resulting different DataFrames for different portions of training data are combined into one as follows:

针对训练数据的不同部分所产生的不同DataFrame合并为一个,如下所示:

df = df1.copy()
df = df.append(df2)
df = df.append(df3)
df = df.append(df4)df.index = np.arange(0,len(df))
df.head(10)
Image for post

The movie titles file can be loaded into memory as Pandas DataFrame:

电影标题文件可以作为Pandas DataFrame加载到内存中:

df_title = pd.read_csv('./data/netflix/movie_titles.csv', encoding = "ISO-8859-1", header = None, names = ['Movie_Id', 'Year', 'Name'])
df_title.head(10)
Image for post

2.3培训与评估模型 (2.3 Training and Evaluating Model)

The Dataset module in Surprise provides different methods for loading data from files, Pandas DataFrames, or built-in datasets such as ml-100k (MovieLens 100k) [4]:

Surprise中Dataset模块提供了从文件,Pandas DataFrames或内置数据集(例如ml-100k(MovieLens 100k)[4])中加载数据的不同方法:

  • Dataset.load_builtin()

    数据集.load_builtin()
  • Dataset.load_from_file()

    数据集.load_from_file()
  • Dataset.load_from_df()

    数据集.load_from_df()

I use the load_from_df() method to load data from Pandas DataFrame in this article.

我在本文中使用load_from_df ()方法从Pandas DataFrame加载数据。

The Reader class in Surprise is to parse a file containing users, items, and users’ ratings on items. The default format is that each rating is stored in a separate line in the following order separated by space: user item rating

Surprise中Reader类用于解析包含用户,项目以及用户对项目的评分的文件。 缺省格式是,每个等级以以下顺序存储在单独的行中,并以空格分隔: 用户 项目 等级

This order and the separator are configurable using the following parameters:

可以使用以下参数配置此顺序和分隔符:

  • line_format is a string like “item user rating” to indicate the order of the data with field names separated by a space

    line_format是一个类似于“ item user rating ”的字符串,用于指示字段名称用空格分隔的数据顺序

  • sep is used to specify separator between fields, such as space, ‘,’, etc.

    sep用于指定字段之间的分隔符,例如空格,“,”等。

  • rating_scale is to specify the rating scale. The default value is (1, 5)

    rating_scale用于指定评分等级。 默认值为(1,5)

  • skip_lines is to indicate the number of lines to skip at the beginning of the file and the default is 0

    skip_lines用于指示文件开头要跳过的行数,默认值为0

I use the default settings in this article. The item, user, rating correspond to the columns of Cust_Id, Movie_Id, and Rating of the DataFrame respectively.

我在本文中使用默认设置。 itemuserrating分别对应于DataFrame的Cust_IdMovie_IdRating的列。

The Surprise library [7] contains the implementation of multiple models/algorithms for building recommender systems such as SVD, Probabilistic Matrix Factorization (PMF), Non-negative Matrix Factorization (NMF), etc. The SVD model is used in this article.

Surprise库[7]包含用于构建推荐系统的多个模型/算法,例如SVD,概率矩阵分解(PMF),非负矩阵分解(NMF)等。本文使用了SVD模型。

The following code is to load data from Pandas DataFrame and create a SVD model instance:

以下代码用于从Pandas DataFrame加载数据并创建SVD模型实例:

from surprise import Reader, Dataset, SVD
from surprise.model_selection.validation import cross_validatereader = Reader()data = Dataset.load_from_df(df[['Cust_Id', 'Movie_Id', 'Rating']], reader)svd = SVD()

Once the data and model for product recommendation are ready, the model can be evaluated using cross-validation as follows:

一旦准备好产品推荐的数据和模型,就可以使用交叉验证对模型进行评估,如下所示:

# Run 5-fold cross-validation and print results
cross_validate(svd, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

The following are the results of the cross validation of the SVD model:

以下是SVD模型的交叉验证的结果:

Image for post

Once the model has been evaluated to our satisfaction, then we can re-train the model using the entire training dataset:

一旦对模型进行了评估,我们就可以使用整个训练数据集对模型进行重新训练:

trainset = data.build_full_trainset()
svd.fit(trainset)

2.4推荐产品 (2.4 Recommending Products)

After a recommendation model has been trained appropriately, it can be used for prediction.

推荐模型经过适当训练后,可以用于预测。

For example, given a user (e.g., Customer Id 785314), we can use the trained model to predict the ratings given by the user on different products (i.e., Movie titles):

例如,给定用户(例如,客户ID 785314),我们可以使用经过训练的模型来预测用户对不同产品(即电影标题)给出的评分:

titles = df_title.copy()titles['Estimate_Score'] = titles['Movie_Id'].apply(lambda x: svd.predict(785314, x).est)

To recommend products (i.e., movies) to the given user, we can sort the list of movies in decreasing order of predicted ratings and take the top N movies as recommendations:

为了向给定的用户推荐产品(例如电影),我们可以按照预测收视率从高到低的顺序对电影列表进行排序,并以推荐的前N部电影作为推荐:

titles = titles.sort_values(by=['Estimate_Score'], ascending=False)
titles.head(10)

The following are the top 10 movies to be recommended to the user with Customer Id 785314:

以下是建议使用客户ID 785314向用户推荐的十大电影:

Image for post

3.总结 (3. Summary)

In this article, I used the scikit Surprise library [7] and the Kaggle Netflix prize data [2] to demonstrate how to use model-based collaborative filtering method to build a recommender system in Python.

在本文中,我使用了scikit Surprise库[7]和Kaggle Netflix奖励数据[2]来演示如何使用基于模型的协作过滤方法在Python中构建推荐系统。

As described at the beginning of this article, the dataset is too big to be handled on a laptop or any typical single personal computer. Thus I only loaded the first 100,000 records from each of the training dataset files for demonstration purpose.

如本文开头所述,数据集太大,无法在笔记本电脑或任何典型的单台个人计算机上处​​理。 因此,出于演示目的,我仅从每个训练数据集文件中加载了前100,000条记录。

In the settings of real applications, I would recommend to use Surprise with Koalas or use the ALS algorithm in Spark MLLib to implement collaborative filtering system and run it on Spark cluster [8].

在实际应用中的设置,我会建议使用与惊喜考拉或使用ALS算法星火MLLib实现协同过滤系统和星火集群[8]上运行它。

The Jupyter notebook with all of the source code used in this article is available in Github [11].

Github [11]中提供了Jupyter笔记本以及本文中使用的所有源代码。

翻译自: https://towardsdatascience.com/machine-learning-for-building-recommender-system-in-python-9e4922dd7e97

  • 0
    点赞
  • 6
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值