动漫数据推荐系统

Simple, TfidfVectorizer and CountVectorizer recommendation system for beginner.

简单的TfidfVectorizer和CountVectorizer推荐系统,适用于初学者。

目标 (The Goal)

Recommendation system is widely use in many industries to suggest items to customers. For example, a radio station may use a recommendation system to create the top 100 songs of the month to suggest to audiences, or they might use recommendation system to identify song of similar genre that the audience has requested. Based on how recommendation system is widely being used in the industry, we are going to create a recommendation system for the anime data. It would be nice if anime followers can see an update of top 100 anime every time they walk into an anime store or receive an email suggesting anime based on genre that they like.

推荐系统在许多行业中广泛用于向客户推荐项目。 例如,广播电台可以使用推荐系统创建当月最流行的100首歌曲以向观众推荐,或者他们可以使用推荐系统来标识观众已请求的类似流派的歌曲。 基于推荐系统在行业中的广泛使用,我们将为动漫数据创建一个推荐系统。 如果动漫追随者每次走进动漫商店或收到一封根据他们喜欢的流派来推荐动漫的电子邮件时,都能看到前100名动漫的更新,那就太好了。

With the anime data, we will apply two different recommendation system models: simple recommendation system and content-based recommendation system to analyse anime data and create recommendation.

对于动漫数据 ,我们将应用两种不同的推荐系统模型:简单的推荐系统和基于内容的推荐系统来分析动漫数据并创建推荐。

总览 (Overview)

For simple recommendation system, we need to calculate weighted rating to make sure that the rating of the same score of different votes numbers will have unequal weight. For example, an average rating of 9.0 from 10 people will have lower weight from an average rating of 9.0 from 1,000 people. After we calculate the weighted rating, we can see a list of top chart anime.

对于简单的推荐系统,我们需要计算加权等级,以确保不同票数的相同分数的等级具有不相等的权重。 例如,每10个人获得9.0的平均评分将比每1,000个人获得9.0的平均评分降低。 在计算加权评分后,我们可以看到顶级动漫列表。

For content-based recommendation system, we will need to identify which features will be used as part of the analysis. We will apply sklearn to identify the similarity in the context and create anime suggestion.

对于基于内容的推荐系统,我们将需要确定哪些功能将用作分析的一部分。 我们将应用sklearn 识别上下文中的相似性并创建动漫建议。

资料总览 (Data Overview)

With the anime data that we have, there are a total of 12,294 anime of 7 different types of data including anime_id, name, genre, type, episodes, rating, and members.

根据我们拥有的动画数据,总共有12294种7种不同类型的数据的动画,包括anime_id,名称,类型,类型,剧集,评分和成员。

实作 (Implementation)

1. Import Data

1.导入数据

We need to import pandas as this well let us put data nicely into the dataframe format.

我们需要导入大熊猫,因为这样可以很好地将数据放入数据框格式中。

import pandas as pd
anime = pd.read_csv('…/anime.csv')
anime.head(5)
Image for post
anime.info()
Image for post
anime.describe()
Image for post

We can see that the minimum rating score is 1.67 and the maximum rating score is 10. The minimum members is 5 and the maximum is 1,013,917.

我们可以看到最低评级分数是1.67,最大评级分数是10。最小成员是5,最大成员是1,013,917。

anime_dup = anime[anime.duplicated()]
print(anime_dup)
Image for post

There is no duplicated data that need to be cleaned.

没有重复的数据需要清除。

type_values = anime['type'].value_counts()
print(type_values)
Image for post

Most anime are broadcast of the TV, followed by OVA.

多数动漫在电视上播放,其次是OVA。

2. Simple Recommendation System

2.简单的推荐系统

Firstly, we need to know the calculation of the weighted rating (WR).

首先,我们需要知道加权等级(WR)的计算。

Image for post

v is the number of votes for the anime; m is the minimum votes required to be listed in the chart; R is the average rating of the anime; C is the mean vote across the whole report.

v是动画的票数; m是图表中需要列出的最低投票数; R是动画的平均评分; C是整个报告中的平均票数。

We need to determine what data will be used in this calculation.

我们需要确定在此计算中将使用哪些数据。

m = anime['members'].quantile(0.75)
print(m)
Image for post

From the result, we are going to use those data that have more than 9,437 members to create the recommendation system.

根据结果​​,我们将使用拥有超过9,437个成员的那些数据来创建推荐系统。

qualified_anime = anime.copy().loc[anime['members']>m]
C = anime['rating'].mean()def WR(x,C=C, m=m):
v = x['members']
R = x['rating']
return (v/(v+m)*R)+(m/(v+m)*C)qualified_anime['score'] = WR(qualified_anime)
qualified_anime.sort_values('score', ascending =False)
qualified_anime.head(15)
Image for post

This is the list of top 15 anime based on weighted rating calculation.

这是根据加权评级计算得出的前15名动漫的列表。

3. Genre Based Recommendation System

3.基于体裁的推荐系统

With genre based recommendation, we will use sklearn package to help us analyse text context. We will need to compute the similarity of the genre. Two method that we are going to use is TfidfVectorizer and CountVectorizer.

通过基于体裁的推荐,我们将使用sklearn包来帮助我们分析文本上下文。 我们将需要计算体裁的相似性。 我们将使用的两种方法是TfidfVectorizer和CountVectorizer。

In TfidfVectorizer, it calculates the frequency of the word with the consideration on how often it occurs in all documents. While, CountVectorizer is more simpler, it only counts how many times the word has occured.

在TfidfVectorizer中,它会考虑单词在所有文档中出现的频率来计算单词的频率。 虽然CountVectorizer更简单,但它仅计算单词出现的次数。

from sklearn.feature_extraction.text import TfidfVectorizertf_idf = TfidfVectorizer(lowercase=True, stop_words = 'english')
anime['genre'] = anime['genre'].fillna('')
tf_idf_matrix = tf_idf.fit_transform(anime['genre'])tf_idf_matrix.shape
Image for post

We can see that there are 46 different words from 12,294 anime.

我们可以看到,从12,294动漫中有46个不同的单词。

from sklearn.metrics.pairwise import linear_kernelcosine_sim = linear_kernel(tf_idf_matrix, tf_idf_matrix)
indices = pd.Series(anime.index, index=anime['name'])
indices = indices.drop_duplicates()def recommendations (name, cosine_sim = cosine_sim):
similarity_scores = list(enumerate(cosine_sim[indices[name]]))
similarity_scores = sorted(similarity_scores, key=lambda x: x[1], reverse=True)
similarity_scores = similarity_scores[1:21]
anime_indices = [i[0] for i in similarity_scores]
return anime['name'].iloc[anime_indices]recommendations('Kimi no Na wa.')
Image for post

Based of the TF-IDF calculation, this is the top 20 anime recommendations that are similar to Kimi no Na wa..

根据TF-IDF的计算,这是前20大动漫推荐,与《 Kimi no Na wa》相似。

Next, we are going to look at another model, CountVectorizer() and we are going to compare the result between cosine_similarity and linear_kernel.

接下来,我们将看看另一个模型CountVectorizer(),并将比较余弦相似度和linear_kernel之间的结果。

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similaritycount = CountVectorizer(stop_words = 'english')
count_matrix = count.fit_transform(anime['genre'])
cosine_sim2 = cosine_similarity(count_matrix, count_matrix)recommendations('Kimi no Na wa.', cosine_sim2)
Image for post
cosine_sim2 = linear_kernel(count_matrix, count_matrix)
recommendations('Kimi no Na wa.', cosine_sim2)
Image for post

Summary

摘要

In this article, we have look at the anime data and trying to build two types of recommendation systems. The simple recommendation system let us see the top chart anime. We have done this by using the weighted rating calculation on the voting and number of members. Then, we continue to build the recommendation system based on anime’s genre feature. With this, we apply both TfidfVectorizer and CountVectorizer to see the differences in their recommendation.

在本文中,我们研究了动画数据,并尝试构建两种类型的推荐系统。 简单的推荐系统让我们看到了热门动画。 我们通过对投票和成员数进行加权评级计算来完成此任务。 然后,我们将继续基于动漫的流派特征构建推荐系统。 这样,我们同时应用了TfidfVectorizer和CountVectorizer来查看其建议中的差异。

Hope that you enjoy this article!

希望您喜欢这篇文章!

翻译自: https://medium.com/analytics-vidhya/recommendation-system-for-anime-data-784c78952ba5

基于Spark的漫画推荐系统是一种利用Spark框架来构建的推荐系统,它可以根据用户的兴趣和行为数据,为用户推荐适合他们的漫画作品。下面是一个基于Spark的漫画推荐系统的简单介绍: 1. 数据收集和预处理:首先,需要收集用户的行为数据,例如用户的点击、收藏、评分等信息。然后,对这些数据进行预处理,例如去除噪声、处理缺失值等。 2. 特征工程:在构建推荐系统之前,需要对用户和漫画进行特征工程。这包括将用户和漫画的属性转换为数值特征,例如用户的年龄、性别、漫画的类别、作者等。 3. 模型训练:使用Spark的机器学习库,例如MLlib,可以训练推荐模型。常用的推荐算法包括协同过滤、基于内容的推荐、矩阵分解等。这些算法可以根据用户的历史行为和漫画的属性,预测用户对漫画的喜好程度。 4. 推荐生成:根据训练好的模型,可以为每个用户生成个性化的漫画推荐列表。推荐系统可以根据用户的历史行为和漫画的属性,计算出用户对每个漫画的喜好程度,并按照喜好程度进行排序。 5. 可视化展示:最后,可以使用可视化工具,例如Django框架,将推荐结果以图表或界面的形式展示给用户。这样用户可以更直观地了解推荐结果,并进行交互操作。 通过基于Spark的漫画推荐系统,用户可以更方便地发现自己喜欢的漫画作品,并提高漫画阅读的体验。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值