电影推荐系统

Creating a content based movie recommendation system.

创建基于内容的电影推荐系统。

  1. Introduction: Movies have always been a substantial part of entertainment in our history and specially in this current world which is amidst a pandemic, movies have played a large role by helping people to relax their mind. But when you watch a movie the next big question is “WHAT NEXT?” it is the most confusing question that almost everyone has and to solve this problem I have created this project. In this project we are going to create a movie recommendation system, based on content the user watches. This model will use content based filtering method for giving the recommended movies to the user and tell them about the similar movies according to their respective preferences.

    简介 :电影一直是我们历史上娱乐的重要组成部分,尤其是在当今这个大流行的世界中,电影在帮助人们放松身心方面发挥了重要作用。 但是,当您看电影时,下一个大问题是“接下来是什么?” 这是几乎每个人都有的最令人困惑的问题,为了解决这个问题,我创建了这个项目。 在这个项目中,我们将基于用户观看的内容创建一个电影推荐系统。 该模型将使用基于内容的过滤方法,将推荐的电影提供给用户,并根据他们各自的偏好向他们介绍相似的电影。

  2. Types of filtering techniques used : When making a recommendation engine you have to choose a filtering technique to categorize your data for the prediction. There are majorly two main filtering techniques used while making a recommendation engine:

    所使用的过滤技术的类型:制作推荐引擎时,您必须选择一种过滤技术来对数据进行分类以进行预测。 推荐引擎主要使用两种主要的过滤技术:

2.1 Collaborative Filtering : Suppose there are two similar users U1 and U2. U1 buys an I phone and along with it he/she buys an earphone, now U2 buys an I phone so U2 would also be recommended the same earphones. To sum up Collaborative filtering is a technique that can filter out items which a user might like on the basis of reactions by similar users.

2.1 协作过滤:假设有两个相似的用户U1和U2。 U1购买了I手机,然后他/她也购买了耳机,现在U2购买了I手机,因此也推荐使用与U2相同的耳机。 综上所述, 协作过滤 是一种可以根据相似用户的React过滤出用户可能喜欢的项目的技术。

2.2 Content Based Filtering: Suppose there are two users U1 and U2 and User 1 has watched Movies M1(Action),M2(Adventurous) and M3(Action) and rated them 5 stars, 4 stars and 3 stars respectively. Now let us suppose U2 has watched Movies M4(Action), M2(Adventurous), U2 will be recommended movie M1 which is an action movie with the highest rating. To sum up: Content based filtering as a system that seeks to predict the “rating” or “preference” a user would give to an item.

2.2 基于内容的过滤:假设有两个用户U1和U2,并且用户1看过电影M1(动作),M2(Adventurous)和M3(动作),并将其分别评为5星,4星和3星。 现在,让我们假设U2看过电影M4(动作),M2(Adventurous),我们将推荐U2电影M1,这是一部收视率最高的动作电影。 总结一下: 基于内容的过滤是一种 系统,旨在预测用户对某项商品的“评价”或“偏好”。

3. Importing the Libraries and data set: We will import libraries Pandas and Numpy. Pandas is used for data manipulation and analysis, and Numpy is used for adding support to large multidimensional arrays.

3. 导入库和数据集:我们将导入库Pandas和Numpy。 Pandas用于数据处理和分析,Numpy用于添加对大型多维数组的支持。

DATASET: I imported the dataset from Kaggle https://www.kaggle.com/tmdb/tmdb-movie-metadata . As shown in the image there are two datasets : 1. Credits and 2. Movies

数据集我从Kaggle https://www.kaggle.com/tmdb/tmdb-movie-metadata导入了数据集。 如图所示,有两个数据集:1.片数和2.电影

4. Merging the datasets:

4. 合并数据集:

Merging Credits and Movies dataset
img.2) img.2 )

If we look at both the datasets we can clearly see the that “Movies_id”column on credits dataset is same as “id” column on Movies dataset and have the same values, so we can rename the column “Movie_id” as “id” in the credits dataset and merge both the datasets on the column “id” as shown in the image 2

如果同时查看这两个数据集,我们可以清楚地看到,信用数据集上的“ Movies_id”列与“电影”数据集上的“ id”列相同,并且具有相同的值,因此我们可以将“ Movie_id”列重命名为“ id”积分数据集并合并“ id”列上的两个数据集,如图2所示

5. Data Cleaning : Drop all the columns of the dataset which are not useful for prediction, which includes almost every column except “Overview”, “orignal_title”, “Id”, “Genres”, and “orignal_language”.

5. 数据清理:删除数据集中所有对预测无用的列,其中几乎包括除“概述”,“ orignal_title”,“ Id”,“流派”和“ orignal_language”之外的所有列。

Image for post
Img.3) Img.3 )

6. Creating vector of matrix: We will now use the overview column(Summary of plot) to pickup the keywords in order to recommend the user, of movies having similar plots. Overview column will have the content which we will be extracting to make the recommendations. When creating a recommendation engine, it is necessary to create vector of matrix for each movie. In this case we will do it using tfidfvectorizer .

6. 创建矩阵向量:现在,我们将使用概述列(情节摘要)来挑选关键字,以推荐用户使用类似情节的电影。 概述列将包含我们将提取的内容以提出建议。 创建推荐引擎时,必须为每个电影创建矩阵向量。 在这种情况下,我们将使用tfidfvectorizer进行操作。

6.1 Tfidfvectorizer : It is a NLP concept, which is used to convert text to vectors. We will import it from sklearn.feature_extraction.text . It will create document matrix of this column. This function has 3 main features, which are :

6.1 Tfidfvectorizer: 是NLP概念,用于将文本转换为矢量。 我们将从sklearn.feature_extraction.text导入它。 它将创建此列的文档矩阵。 此功能具有3个主要功能,它们是:

  1. ngram_range: This feature will help the model to group 1–3 similar words of the overview column.

    ngram_range:此功能将帮助模型将总览列的1-3个相似词分组。
  2. stop_words = “english”: this features will remove all the repetitive words like pronouns, articles etc.

    stop_words =“ english”:此功能将删除所有重复的单词,例如代词,文章等。
  3. strip_accents, token_pattern, analyzer : these features will help getting rid of punctuation marks and all the conjunctions from the column.

    strip_accents,token_pattern,分析器:这些功能将帮助摆脱标点符号和该列中的所有连词。

6.2 Treating NaN values : Alot of nan values will be there in the overview column due to the aforementioned steps, which we will replace with the blank values using .fillna(‘ ’) .

6.2 处理NaN值:由于上述步骤,总览列中将出现许多nan值,我们将使用.fillna('')将其替换为空白值。

Image for post
Img.4) Img.4 )

7. Converting it into a sparse matrix : We will now convert this column into sparse matrix using the fit_transform function. Sparce matrix is a matrix which has alot of zero values, and some non zero values . The non zero values in this matrix will be given because of the tfidfvectorizer, using an equation . The value of all the non zero terms would be between between 0 and 1.

7. 将其转换为稀疏矩阵:现在,我们将使用fit_transform函数将此列转换为稀疏矩阵。 稀疏矩阵是具有许多零值和一些非零值的矩阵。 由于tfidfvectorizer,将使用等式给出该矩阵中的非零值。 所有非零项的值将在0到1之间。

On seeing the shape of the matrix we observe that there are more than 4500 records and more than 10000 columns which are combination of words, that are created using ngram_range.

在看到矩阵的形状时,我们发现使用ngram_range创建的单词组合超过4500条记录和10000列以上。

Image for post
Img.5) Img.5 )

8. Finding similarity between different movies: We will import a library known as sigmoid_kernel from sklearn.metrics.pairwise. This library basically converts an input value into a sigmoid function. A sigmoid function is a function which has its value between 0 and 1. It is converted a by simple formula given in the figure below

8. 查找不同电影之间的相似性:我们将从sklearn.metrics.pairwise导入一个名为sigmoid_kernel的库。 该库基本上将输入值转换为S型函数。 S形函数是其值在0到1之间的函数。它是通过下图给出的简单公式转换的

Image for post
fig.1) 图1 )

8.1 Applying Sigmoid : It is used to see similarity of an overview/summary of one movie with respect to the overview/summary of another movie and when we pass it through a sigmoid we will see a similar value between 0 and 1, higher the value more will be the similarity. So in the code when applying the sigmoid kernel we will have to give the same matrix in order to get the similarities between different movies. The similarity will be calculated based on the vector values. As shown in image 6 , here sig[0] represents similarity of overview 1 with respect to overview of all the other movies.

8.1 应用Sigmoid:用于查看一部电影的概述/摘要相对于另一部电影的概述/摘要的相似性,当我们将其通过S形时,我们将看到一个介于0和1之间的相似值,该值越高相似之处更多。 因此,在应用S形内核的代码中,我们将必须提供相同的矩阵 ,以便获得不同电影之间的相似性。 相似度将基于向量值进行计算。 如图6所示,此处sig [0]表示概述1相对于所有其他电影的概述的相似性。

Image for post
Img.6) Img.6 )

9. Creating Indices: We will create indices of all the movies in the dataset and drop all the duplicate titles, this will give us a unique index value for every movie title in the dataset, which will be very useful in the upcoming part of the code

9. 创建索引:我们将为数据集中的所有电影创建索引,并删除所有重复的标题,这将为数据集中的每个电影标题提供唯一的索引值,这对于即将到来的电影非常有用。码

Image for post
Img.7) 7 )

10. Getting recommendations : We have a function over here which will predict all the similar movies, we will now know how it works. This function will take the movie title, from this movie title the model will find it’s index value(Using Indices). It will then be passed through a sigmoid object which will give a range of values,the model will convert the values into a list using the list(enumerate(sig[])) attribute. The list will then be arranged in descending order using the sorted() attribute. Then we will pick up the top 7 similarity scores, and using the movie_indices we will pickup the original title of the movie.

10. 获得建议:我们这里有一个功能,可以预测所有相似的电影,现在我们知道它的工作原理。 该函数将获取电影标题,模型将从该电影标题中找到其索引值(使用索引)。 然后,它将通过一个Sigmoid对象传递该对象,该对象将提供一定范围的值,该模型将使用list(enumerate(sig []))属性将这些值转换为列表。 然后,将使用sorted()属性以降序排列列表。 然后,我们将获取前7个相似度得分,并使用movie_indices来获取电影的原始标题。

Image for post

11. Result: Top 7 movies with the highest similarity based on our model will be displayed.

11. 结果 :根据我们的模型,将显示相似度最高的前7部电影。

Image for post

Source Code : https://github.com/vishnavchhabra/content-based-movie-reccomendation

源代码: https : //github.com/vishnavchhabra/content-based-movie-reccomendation

翻译自: https://medium.com/swlh/movie-recommendation-system-dc00430af6ec

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值