使用python构建电影推荐器

最新推荐文章于 2024-04-29 21:50:55 发布

weixin_26748251

最新推荐文章于 2024-04-29 21:50:55 发布

阅读量442

点赞数

文章标签： python 人工智能机器学习推荐系统

原文链接：https://towardsdatascience.com/building-a-movie-recommender-using-python-277959b07dae

版权

In this post, I will show you how to build a movie recommender program using Python. This will be a simple project where we will be able to see how machine learning can be used in our daily life. If you check my other articles, you will see that I like to demonstrate hands-on projects. I think this is the best way to practice our coding skills and improve ourselves. Building a movie recommender program is a great way to get started with machine recommenders. After sharing the contents table, I would like to introduce you to recommendation systems.

在本文中，我将向您展示如何使用Python构建电影推荐器程序。这将是一个简单的项目，我们将能够看到如何在日常生活中使用机器学习。如果查看其他文章，您会发现我喜欢演示动手项目。我认为这是练习我们的编码技能和提高自我的最好方法。建立电影推荐器程序是开始使用机器推荐器的好方法。共享内容表后，我想向您介绍推荐系统。

内容 (Contents)

Introduction
介绍
The Movie Data
电影数据
Data Preparation
资料准备
Content-Based Recommender
基于内容的推荐人
Show Time
开演时间
Video Demonstration
视频示范

介绍(Introduction)

Recommendation systems have been around with us for a while now, and they are so powerful. They do have a strong influence on our decisions these days. From movie streaming services to online shopping stores, they are almost everywhere we look. If you are wondering how do they know what you might buy after adding an “x” item to your cart, the answer is simple: Power of Data.

推荐系统已经存在了一段时间，而且功能如此强大。这些天确实对我们的决定产生了重大影响。从电影流媒体服务到在线购物商店，它们几乎遍布我们所有的地方。如果您想知道他们在将“ x”商品添加到购物车后如何知道您会买什么，答案很简单：数据的力量。

We may look very different from each other, but our habits can be very similar. And the companies love to find similar habits of their customers. Since they know that many people who bought “x” item also bought “y” item, they recommend you to add “y” item to your cart. And guess what, you are training your own recommender the more you buy, which means the machine will know more about you.

我们看起来彼此可能非常不同，但是我们的习惯可能非常相似。而且，公司喜欢发现客户的类似习惯。由于他们知道很多购买“ x”商品的人也购买了“ y”商品，因此建议您将“ y”商品添加到购物车中。猜猜是什么，您购买的越多，您就在训练自己的推荐器，这意味着机器将对您了解更多。

Recommendation systems are a very interesting field of machine learning, and the cool part about them is that they are all around us. There is a lot to learn about this topic, to keep things simple I will stop here. Let’s begin building our own movie recommender system!

推荐系统是机器学习中非常有趣的领域，而关于它们的最酷的部分是它们遍布我们。关于这个主题，有很多东西要学习，为了简单起见，我将在这里停止。让我们开始构建自己的电影推荐系统！

电影数据 (The Movie Data)

I’ve found great movie data on Kaggle. If you haven’t heard about Kaggle, Kaggle is the world’s largest data science community with powerful tools and resources to help you achieve your data science goals.

我在Kaggle上找到了很棒的电影数据。如果您还没有听说过Kaggle，Kaggle是世界上最大的数据科学社区，其功能强大的工具和资源可帮助您实现数据科学目标。

Here is the link to download the dataset.

这是下载数据集的链接。

语境 (Context)

The data folder contains the metadata for all 45,000 movies listed in the Full MovieLens Dataset.
数据文件夹包含完整电影镜头数据集中列出的所有45,000电影的元数据。
The dataset consists of movies released on or before July 2017. Data points include cast, crew, plot keywords, budget, revenue, posters, release dates, languages, production companies, countries, TMDB vote counts, and vote averages.
数据集包含2017年7月或之前发行的电影。数据点包括演员，剧组，剧情关键字，预算，收入，海报，发行日期，语言，制作公司，国家/地区，TMDB投票数和平均票数。
This data folder also contains a rating file with 26 million ratings from 270,000 users for all 45,000 movies.
该数据文件夹还包含一个分级文件，该分级文件包含来自270,000个用户的2,600万个分级，适用于所有45,000部电影。

资料准备 (Data Preparation)

First things first, let’s start by importing and exploring our data. After you download the data folder, you will multiple dataset files. For this project, we will be using the movies_metadata.csv dataset. This dataset has all we need to create a movie recommender.

首先，让我们从导入和浏览数据开始。下载数据文件夹后，将有多个数据集文件。对于此项目，我们将使用movies_metadata.csv数据集。该数据集包含创建电影推荐器所需的全部内容。

import pandas as pd#load the data
movie_data = pd.read_csv('data/movie_data/movies_metadata.csv', low_memory=False)movie_data.head()

情节概述(Plot Overviews)

movie_data['overview'].head(10)

Perfect! Our data is all set. Ready to be trained by our model. It’s time to move to the next step, where we will start building our content-based recommender.

完善！我们的数据都准备好了。准备由我们的模型训练。现在是时候进行下一步了，我们将开始构建基于内容的推荐器。

基于内容的推荐人 (Content Based Recommender)

Content based recommender is a recommendation model that returns a list of items based on a specific item. A nice example of this recommenders are Netflix, YouTube, Disney+ and more. For example, Netflix recommends similar shows that you watched before and liked more. With this project, you will have a better understanding of how these online streaming services’ algorithms work.

基于内容的推荐器是一种推荐模型，可基于特定项目返回项目列表。 Netflix，YouTube，迪士尼+等都是此类推荐者的一个很好的例子。例如，Netflix推荐您之前看过并且喜欢更多的类似节目。通过此项目，您将更好地了解这些在线流服务的算法如何工作。

Back to the project, as an input to train our model we will use the overview of the movies that we checked earlier. And then we will use some sci-kit learn ready functions to build our model. Our recommender will be ready in four simple steps. Let’s begin!

回到项目，作为训练模型的输入，我们将使用我们之前检查过的电影的概述。然后，我们将使用一些sci-kit学习就绪函数来构建模型。我们的推荐人将通过四个简单步骤准备就绪。让我们开始！

1.定义矢量化器 (1. Define Vectorizer)

from sklearn.feature_extraction.text import TfidfVectorizertfidf_vector = TfidfVectorizer(stop_words='english')movie_data['overview'] = movie_data['overview'].fillna('')tfidf_matrix = tfidf_vector.fit_transform(movie_data['overview'])

Understanding the above code

了解上面的代码

Importing the vectorizer from sci-kit learn module. Learn more here.
从sci-kit学习模块导入矢量化器。在这里了解更多。
Tf-idf Vectorizer Object removes all English stop words such as ‘the’, ‘a’ etc.
Tf-idf Vectorizer对象会删除所有英语停用词，例如“ the”，“ a”等。
We are replacing the Null(empty) values with an empty string so that it doesn’t return an error message when training them.
我们将Null(empty)值替换为空字符串，以便在训练它们时不返回错误消息。
Lastly, we are constructing the required Tf-idf matrix by fitting and transforming the data
最后，我们通过拟合和转换数据来构造所需的Tf-idf矩阵

2.线性内核(2. Linear Kernel)

We will start by importing the linear kernel function from sci-kit learn module. The linear kernel will help us to create a similarity matrix. These lines take a bit longer to execute, don’t worry it’s normal. Calculating the dot product of two huge matrixes is not easy, even for machines :)

我们将从sci-kit学习模块中导入线性内核函数开始。线性核将帮助我们创建相似度矩阵。这些行需要更长的时间来执行，请不要担心这是正常的。即使对于机器，也很难计算两个巨大矩阵的点积：)

from sklearn.metrics.pairwise import linear_kernelsim_matrix = linear_kernel(tfidf_matrix, tfidf_matrix)

3.指数 (3. Indices)

Now, we have to construct a reverse map of the indices and movie titles. And in the second part of the Series function, we are cleaning the movie titles that are repeating with a simple function called drop_duplicates.

现在，我们必须构建索引和电影标题的反向映射。在Series函数的第二部分中，我们使用名为drop_duplicates的简单函数来清理正在重复的电影标题。

indices = pd.Series(movie_data.index, index=movie_data['title']).drop_duplicates()indices[:10]

4.最后-推荐功能(4. Finally — Recommender Function)

def content_based_recommender(title, sim_scores=sim_matrix):
    idx = indices[title]    sim_scores = list(enumerate(sim_matrix[idx]))    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)    sim_scores = sim_scores[1:11]    movie_indices = [i[0] for i in sim_scores]    return movie_data['title'].iloc[movie_indices]

开演时间 (Show Time)

Well done! It’s time to test our recommender. Let’s see in action and how powerful it really is. We will run the function by adding the movie name as a string as a parameter.

做得好！现在该测试我们的推荐器了。让我们来看看实际操作以及它到底有多强大。我们将通过将电影名称作为字符串添加为参数来运行该函数。

与“玩具总动员”相似的电影 (Similar Movies to “Toy Story”)

content_based_recommender('Toy Story')

视频示范 (Video Demonstration)

演示地址

video demonstration of project

项目视频演示

Congrats!! You have created a program that recommends movies to you. Now, you have a program to run when you want to choose your next Netflix show. Hoping that you enjoyed reading this hands-on project. I would be glad if you learned something new today. Working on hands-on programming projects like this one is the best way to sharpen your coding skills.

恭喜！！您已经创建了一个向您推荐电影的程序。现在，当您要选择下一个Netflix节目时，就有一个要运行的程序。希望您喜欢阅读本动手项目。如果您今天学到一些新知识，我将很高兴。从事这样的动手编程项目是提高您的编码技能的最佳方法。

Feel free to contact me if you have any questions while implementing the code.

实施代码时如有任何疑问，请随时与我联系。