推荐系统
"""
数据下载地址:
https://www.kaggle.com/tmdb/tmdb-movie-metadata
https://www.kaggle.com/rounakbanik/the-movies-dataset
"""
import pandas as pd
import numpy as np
df1=pd.read_csv('./movie/tmdb/tmdb_5000_credits.csv')
df2=pd.read_csv('./movie/tmdb/tmdb_5000_movies.csv')
tmdb_5000_movies.csv 中共有 20 个字段,其各自释义如下:
- budget:预算
- genres:分类
- homepage:主页(大量缺失值,但不重要)
- id:编号
- keywords:关键词标签
- original_language:原语言
- original_title:原标题
- overview:简介
- popularity:流行度
- production_companies:制作公司
- production_countries:制作国家
- release_date:上映日期
- revenue:收益
- runtime:时长
- spoken_languages:配音语言
- status:状态
- tagline:一句话标语
- title:题目
- vote_average:平均分
- vote_count:参与评分人数
tmdb_5000_credits.csv 中共有4 个字段,其各自释义如下:
- movie_id:编号
- title:电影名称
- cast:演员阵容
- crew:全体人员
df1.columns = ['id','tittle','cast','crew']
df2= df2.merge(df1, on='id') # join 操作
df2.head(5)
budget | genres | homepage | id | keywords | original_language | original_title | overview | popularity | production_companies | ... | runtime | spoken_languages | status | tagline | title | vote_average | vote_count | tittle | cast | crew | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 237000000 | [{"id": 28, "name": "Action"}, {"id": 12, "nam... | http://www.avatarmovie.com/ | 19995 | [{"id": 1463, "name": "culture clash"}, {"id":... | en | Avatar | In the 22nd century, a paraplegic Marine is di... | 150.437577 | [{"name": "Ingenious Film Partners", "id": 289... | ... | 162.0 | [{"iso_639_1": "en", "name": "English"}, {"iso... | Released | Enter the World of Pandora. | Avatar | 7.2 | 11800 | Avatar | [{"cast_id": 242, "character": "Jake Sully", "... | [{"credit_id": "52fe48009251416c750aca23", "de... |
1 | 300000000 | [{"id": 12, "name": "Adventure"}, {"id": 14, "... | http://disney.go.com/disneypictures/pirates/ | 285 | [{"id": 270, "name": "ocean"}, {"id": 726, "na... | en | Pirates of the Caribbean: At World's End | Captain Barbossa, long believed to be dead, ha... | 139.082615 | [{"name": "Walt Disney Pictures", "id": 2}, {"... | ... | 169.0 | [{"iso_639_1": "en", "name": "English"}] | Released | At the end of the world, the adventure begins. | Pirates of the Caribbean: At World's End | 6.9 | 4500 | Pirates of the Caribbean: At World's End | [{"cast_id": 4, "character": "Captain Jack Spa... | [{"credit_id": "52fe4232c3a36847f800b579", "de... |
2 | 245000000 | [{"id": 28, "name": "Action"}, {"id": 12, "nam... | http://www.sonypictures.com/movies/spectre/ | 206647 | [{"id": 470, "name": "spy"}, {"id": 818, "name... | en | Spectre | A cryptic message from Bond’s past sends him o... | 107.376788 | [{"name": "Columbia Pictures", "id": 5}, {"nam... | ... | 148.0 | [{"iso_639_1": "fr", "name": "Fran\u00e7ais"},... | Released | A Plan No One Escapes | Spectre | 6.3 | 4466 | Spectre | [{"cast_id": 1, "character": "James Bond", "cr... | [{"credit_id": "54805967c3a36829b5002c41", "de... |
3 | 250000000 | [{"id": 28, "name": "Action"}, {"id": 80, "nam... | http://www.thedarkknightrises.com/ | 49026 | [{"id": 849, "name": "dc comics"}, {"id": 853,... | en | The Dark Knight Rises | Following the death of District Attorney Harve... | 112.312950 | [{"name": "Legendary Pictures", "id": 923}, {"... | ... | 165.0 | [{"iso_639_1": "en", "name": "English"}] | Released | The Legend Ends | The Dark Knight Rises | 7.6 | 9106 | The Dark Knight Rises | [{"cast_id": 2, "character": "Bruce Wayne / Ba... | [{"credit_id": "52fe4781c3a36847f81398c3", "de... |
4 | 260000000 | [{"id": 28, "name": "Action"}, {"id": 12, "nam... | http://movies.disney.com/john-carter | 49529 | [{"id": 818, "name": "based on novel"}, {"id":... | en | John Carter | John Carter is a war-weary, former military ca... | 43.926995 | [{"name": "Walt Disney Pictures", "id": 2}] | ... | 132.0 | [{"iso_639_1": "en", "name": "English"}] | Released | Lost in our world, found in another. | John Carter | 6.1 | 2124 | John Carter | [{"cast_id": 5, "character": "John Carter", "c... | [{"credit_id": "52fe479ac3a36847f813eaa3", "de... |
5 rows × 23 columns
电影评分 - 粗排
我们需要一个指标来给电影打分
计算每部电影的分数
对评分进行排序,并向用户推荐最佳评分的电影。
可以使用电影的平均评分作为分数,但使用这个分数是不够公平的,因为一部平均评分为8.9且只有3票的电影不能被认为比平均评分为7.8但有40票的电影更好。
where
- v 是电影的票数
- m 是需要在图表中列出的最低投票数;
- R 电影平均分
- C 所有电影的平均分
score = (v/(v+m) * R) + (m/(m+v) * C)
### 所有电影的平均分- 平均
C= round(df2['vote_average'].mean() ,2)
print(C)
# 分位数 - [0.9]
m= round(df2['vote_count'].quantile(0.9),2)
print(m)
6.09
1838.4
过滤数据
q_movies = df2.copy().loc[df2['vote_count'] >= m]
q_movies.shape
(481, 23)
# 计算电影评分
def weighted_rating(x, m=m, C=C):
v = x['vote_count']
R = x['vote_average']
# Calculation based on the IMDB formula
return (v/(v+m) * R) + (m/(m+v) * C)
# 每个电影都进行评分
q_movies['score'] = q_movies.apply(weighted_rating, axis=1)
最后,让我们根据分数特性对DataFrame进行排序,并输出前10部电影的标题、投票计数、投票平均和加权评分或分数。
# 根据评分进行排序
q_movies = q_movies.sort_values('score', ascending=False)
# TOP 15 影片
q_movies[['title', 'vote_count', 'vote_average', 'score']].head(10)
title | vote_count | vote_average | score | |
---|---|---|---|---|
1881 | The Shawshank Redemption | 8205 | 8.5 | 8.058860 |
662 | Fight Club | 9413 | 8.3 | 7.938901 |
65 | The Dark Knight | 12002 | 8.2 | 7.919732 |
3232 | Pulp Fiction | 8428 | 8.3 | 7.904256 |
96 | Inception | 13752 | 8.1 | 7.862983 |
3337 | The Godfather | 5893 | 8.4 | 7.850720 |
95 | Interstellar | 10867 | 8.1 | 7.809164 |
809 | Forrest Gump | 7927 | 8.2 | 7.802779 |
329 | The Lord of the Rings: The Return of the King | 8064 | 8.1 | 7.726840 |
1990 | The Empire Strikes Back | 5879 | 8.2 | 7.697366 |
# 流行度排行榜
pop= df2.sort_values('popularity', ascending=False)
import matplotlib.pyplot as plt
plt.figure(figsize=(12,4))
plt.barh(pop['title'].head(6),pop['popularity'].head(6), align='center',
color='skyblue')
plt.gca().invert_yaxis()
plt.xlabel("Popularity")
plt.title("Popular Movies")
# top - 6
pop[['title','popularity']].head(6)
title | popularity | |
---|---|---|
546 | Minions | 875.581305 |
95 | Interstellar | 724.247784 |
788 | Deadpool | 514.569956 |
94 | Guardians of the Galaxy | 481.098624 |
127 | Mad Max: Fury Road | 434.278564 |
28 | Jurassic World | 418.708552 |
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-fh26SgyU-1678273193333)(movie-recommendation-system_files/movie-recommendation-system_12_1.png)]
基于内容的过滤
在这个推荐系统中,电影的内容(概述、演员、工作人员、关键字、口号等)被用来寻找与其他电影的相似之处。然后推荐最有可能相似的电影。
基于情节描述的推荐
df2['overview'].head(5) # 电影描述
0 In the 22nd century, a paraplegic Marine is di...
1 Captain Barbossa, long believed to be dead, ha...
2 A cryptic message from Bond’s past sends him o...
3 Following the death of District Attorney Harve...
4 John Carter is a war-weary, former military ca...
Name: overview, dtype: object
# TF_IDF 向量
from sklearn.feature_extraction.text import TfidfVectorizer
# 删除所有英文停止字,如'the', 'a'
tfidf = TfidfVectorizer(stop_words='english')
# 用空字符串替换NaN
df2['overview'] = df2['overview'].fillna('')
# 构建稀疏矩阵
tfidf_matrix = tfidf.fit_transform(df2['overview'])
#Output the shape of tfidf_matrix
tfidf_matrix.shape
(4803, 20978)
余弦相似度计算公式:
from sklearn.metrics.pairwise import linear_kernel
# 计算矩阵的相似度
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)
cosine_sim[:1]
array([[1., 0., 0., ..., 0., 0., 0.]])
每行包含其与所有行的相似度
我们将定义一个函数,该函数以电影标题作为输入并输出10个最相似电影的列表。
# 获取电影所在的行号
indices = pd.Series(df2.index, index=df2['title']).drop_duplicates()
indices.head()
title
Avatar 0
Pirates of the Caribbean: At World's End 1
Spectre 2
The Dark Knight Rises 3
John Carter 4
dtype: int64
现在,我们可以很好地定义推荐功能。遵循以下步骤:
-
- 根据标题获得电影的索引。
-
- 获取该特定电影与所有电影的余弦相似度得分列表。 将其转换为元组列表,其中第一个元素是其位置,第二个元素是相似性分数。
-
- 根据相似度分数对上述元组列表进行排序。
-
- 获取此列表的前10个元素。忽略第一个元素(与特定电影最相似的电影是电影本身)。
-
- 返回与顶部元素索引相对应的电影标题。
# 获取推荐结果
def get_recommendations(title, cosine_sim=cosine_sim):
# 获取标题的索引
idx = indices[title]
# 获取相似度序号及相似度值
sim_scores = list(enumerate(cosine_sim[idx]))
# 排序 = 降序排
sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
# 获取前10 相似度
sim_scores = sim_scores[1:11]
movie_indices = [i[0] for i in sim_scores]
return df2['title'].iloc[movie_indices] # 返回title
seasons = ['Spring', 'Summer', 'Fall', 'Winter']
list(enumerate(seasons))
[(0, 'Spring'), (1, 'Summer'), (2, 'Fall'), (3, 'Winter')]
get_recommendations('The Dark Knight Rises')
65 The Dark Knight
299 Batman Forever
428 Batman Returns
1359 Batman
3854 Batman: The Dark Knight Returns, Part 2
119 Batman Begins
2507 Slow Burn
9 Batman v Superman: Dawn of Justice
1181 JFK
210 Batman & Robin
Name: title, dtype: object
get_recommendations('The Avengers')
7 Avengers: Age of Ultron
3144 Plastic
1715 Timecop
4124 This Thing of Ours
3311 Thank You for Smoking
3033 The Corruptor
588 Wall Street: Money Never Sleeps
2136 Team America: World Police
1468 The Fountain
1286 Snowpiercer
Name: title, dtype: object
基于荣誉,电影类型和关键字的推荐
显然,使用更好的元数据将提高我们推荐程序的质量。这正是我们在本节中要做的。我们将基于以下元数据构建推荐系统:3个顶级演员,导演,相关流派和电影情节关键字。从演员,剧组和关键字中,我们需要提取三个最重要的演员,导演和与该电影相关的关键字。
# 将字符串化特性解析为它们对应的python对象
from ast import literal_eval
features = ['cast', 'crew', 'keywords', 'genres']
for feature in features:
df2[feature] = df2[feature].apply(literal_eval)
df2[feature].head()
0 [{'id': 28, 'name': 'Action'}, {'id': 12, 'nam...
1 [{'id': 12, 'name': 'Adventure'}, {'id': 14, '...
2 [{'id': 28, 'name': 'Action'}, {'id': 12, 'nam...
3 [{'id': 28, 'name': 'Action'}, {'id': 80, 'nam...
4 [{'id': 28, 'name': 'Action'}, {'id': 12, 'nam...
Name: genres, dtype: object
# 返回导演名称, 空 = NaN
def get_director(x):
for i in x:
if i['job'] == 'Director':
return i['name']
return np.nan
# 返回列表前3 个元素; whichever is more.
def get_list(x):
if isinstance(x, list):
names = [i['name'] for i in x]
#Check if more than 3 elements exist. If yes, return only first three. If no, return entire list.
if len(names) > 3:
names = names[:3]
return names
return []
# 以合适的形式定义新的导演、演员阵容、类型和关键字。
df2['director'] = df2['crew'].apply(get_director)
features = ['cast', 'keywords', 'genres']
for feature in features:
df2[feature] = df2[feature].apply(get_list)
# Print the new features of the first 3 films
df2[['title', 'cast', 'director', 'keywords', 'genres']].head(3)
title | cast | director | keywords | genres | |
---|---|---|---|---|---|
0 | Avatar | [Sam Worthington, Zoe Saldana, Sigourney Weaver] | James Cameron | [culture clash, future, space war] | [Action, Adventure, Fantasy] |
1 | Pirates of the Caribbean: At World's End | [Johnny Depp, Orlando Bloom, Keira Knightley] | Gore Verbinski | [ocean, drug abuse, exotic island] | [Adventure, Fantasy, Action] |
2 | Spectre | [Daniel Craig, Christoph Waltz, Léa Seydoux] | Sam Mendes | [spy, based on novel, secret agent] | [Action, Adventure, Crime] |
下一步是将名称和关键字实例转换为小写并去除它们之间的所有空格。 这样做是为了使矢量化程序不会将“ Johnny Depp”和“ Johnny Galecki”中的Johnny视为相同变量。
# 转小写去空格
def clean_data(x):
if isinstance(x, list):
return [str.lower(i.replace(" ", "")) for i in x]
else:
#Check if director exists. If not, return empty string
if isinstance(x, str):
return str.lower(x.replace(" ", ""))
else:
return ''
print(clean_data('Johnny Depp'))
johnnydepp
# 清理数据
features = ['cast', 'keywords', 'director', 'genres']
for feature in features:
df2[feature] = df2[feature].apply(clean_data)
df2[features].head() # 查看效果
cast | keywords | director | genres | |
---|---|---|---|---|
0 | [samworthington, zoesaldana, sigourneyweaver] | [cultureclash, future, spacewar] | jamescameron | [action, adventure, fantasy] |
1 | [johnnydepp, orlandobloom, keiraknightley] | [ocean, drugabuse, exoticisland] | goreverbinski | [adventure, fantasy, action] |
2 | [danielcraig, christophwaltz, léaseydoux] | [spy, basedonnovel, secretagent] | sammendes | [action, adventure, crime] |
3 | [christianbale, michaelcaine, garyoldman] | [dccomics, crimefighter, terrorist] | christophernolan | [action, crime, drama] |
4 | [taylorkitsch, lynncollins, samanthamorton] | [basedonnovel, mars, medallion] | andrewstanton | [action, adventure, sciencefiction] |
def create_soup(x):
return ' '.join(x['keywords']) + ' ' + ' '.join(x['cast']) + ' ' + x['director'] + ' ' + ' '.join(x['genres'])
df2['soup'] = df2.apply(create_soup, axis=1)
后续步骤与我们对基于内容的推荐算法所做的相同。 一个重要的区别是我们使用CountVectorizer 而不是TF-IDF。 这是因为我们不希望减轻演员/导演在相对较多的电影中所扮演或导演的影响力。
# 导入CountVectorizer并创建计数矩阵
from sklearn.feature_extraction.text import CountVectorizer
count = CountVectorizer(stop_words='english')
count_matrix = count.fit_transform(df2['soup'])
# 基于count_matrix计算余弦相似度矩阵
from sklearn.metrics.pairwise import cosine_similarity
cosine_sim2 = cosine_similarity(count_matrix, count_matrix)
# 重置主DataFrame的索引,并像前面一样构造反向映射
df2 = df2.reset_index()
indices = pd.Series(df2.index, index=df2['title'])
indices.head()
title
Avatar 0
Pirates of the Caribbean: At World's End 1
Spectre 2
The Dark Knight Rises 3
John Carter 4
dtype: int64
get_recommendations('The Dark Knight Rises', cosine_sim2)
65 The Dark Knight
119 Batman Begins
4638 Amidst the Devil's Wings
1196 The Prestige
3073 Romeo Is Bleeding
3326 Black November
1503 Takers
1986 Faster
303 Catwoman
747 Gangster Squad
Name: title, dtype: object
get_recommendations('The Godfather', cosine_sim2)
867 The Godfather: Part III
2731 The Godfather: Part II
4638 Amidst the Devil's Wings
2649 The Son of No One
1525 Apocalypse Now
1018 The Cotton Club
1170 The Talented Mr. Ripley
1209 The Rainmaker
1394 Donnie Brasco
1850 Scarface
Name: title, dtype: object
协同过滤
奇异值分解
from surprise import Reader, Dataset, SVD
from surprise.model_selection import cross_validate
reader = Reader()
ratings = pd.read_csv('./movie/ratings_small.csv')
ratings.head()
userId | movieId | rating | timestamp | |
---|---|---|---|---|
0 | 1 | 31 | 2.5 | 1260759144 |
1 | 1 | 1029 | 3.0 | 1260759179 |
2 | 1 | 1061 | 3.0 | 1260759182 |
3 | 1 | 1129 | 2.0 | 1260759185 |
4 | 1 | 1172 | 4.0 | 1260759205 |
请注意,与之前的数据集不同,在这个数据集中,电影是按照5个等级进行评级的。
data = Dataset.load_from_df(ratings[['userId', 'movieId', 'rating']], reader)
svd = SVD()
cross_validate(svd, data, measures=['RMSE', 'MAE'],cv = 5)
{'test_rmse': array([0.89845754, 0.87851544, 0.90687626, 0.88875854, 0.90646164]),
'test_mae': array([0.69207854, 0.67794004, 0.69724481, 0.68496575, 0.69589722]),
'fit_time': (4.268375635147095,
4.251009225845337,
4.2722389698028564,
3.6795027256011963,
3.63496994972229),
'test_time': (0.12253570556640625,
0.12330198287963867,
0.12207889556884766,
0.09984803199768066,
0.10153317451477051)}
我们得到的平均均方根误差约为0.89,这对我们的情况来说已经足够好了。现在让我们在我们的数据集上进行训练,并得出预测结果。
trainset = data.build_full_trainset()
svd.fit(trainset)
<surprise.prediction_algorithms.matrix_factorization.SVD at 0x14c51df10>
ratings[ratings['userId'] == 2]
userId | movieId | rating | timestamp | |
---|---|---|---|---|
20 | 2 | 10 | 4.0 | 835355493 |
21 | 2 | 17 | 5.0 | 835355681 |
22 | 2 | 39 | 5.0 | 835355604 |
23 | 2 | 47 | 4.0 | 835355552 |
24 | 2 | 50 | 4.0 | 835355586 |
... | ... | ... | ... | ... |
91 | 2 | 592 | 5.0 | 835355395 |
92 | 2 | 593 | 3.0 | 835355511 |
93 | 2 | 616 | 3.0 | 835355932 |
94 | 2 | 661 | 4.0 | 835356141 |
95 | 2 | 720 | 4.0 | 835355978 |
76 rows × 4 columns
svd.predict(1, 302, 3)
Prediction(uid=1, iid=302, r_ui=3, est=2.7593729618356786, details={'was_impossible': False})