movie-recommendation-system

the uzi

于 2023-03-08 19:05:52 发布

阅读量217

点赞数

文章标签： python 开发语言

本文链接：https://blog.csdn.net/Albert__Einstein/article/details/129409424

版权

数据挖掘专栏收录该内容

34 篇文章 8 订阅

订阅专栏

推荐系统

"""
数据下载地址: 
https://www.kaggle.com/tmdb/tmdb-movie-metadata
https://www.kaggle.com/rounakbanik/the-movies-dataset
"""
import pandas as pd 
import numpy as np 
df1=pd.read_csv('./movie/tmdb/tmdb_5000_credits.csv')
df2=pd.read_csv('./movie/tmdb/tmdb_5000_movies.csv')

tmdb_5000_movies.csv 中共有 20 个字段，其各自释义如下：

budget：预算
genres：分类
homepage：主页（大量缺失值，但不重要）
id：编号
keywords：关键词标签
original_language：原语言
original_title：原标题
overview：简介
popularity：流行度
production_companies：制作公司
production_countries：制作国家
release_date：上映日期
revenue：收益
runtime：时长
spoken_languages：配音语言
status：状态
tagline：一句话标语
title：题目
vote_average：平均分
vote_count：参与评分人数

tmdb_5000_credits.csv 中共有4 个字段，其各自释义如下：

movie_id：编号
title：电影名称
cast：演员阵容
crew：全体人员

df1.columns = ['id','tittle','cast','crew']
df2= df2.merge(df1, on='id') # join 操作
df2.head(5)

	budget	genres	homepage	id	keywords	original_language	original_title	overview	popularity	production_companies	...	runtime	spoken_languages	status	tagline	title	vote_average	vote_count	tittle	cast	crew
0	237000000	[{"id": 28, "name": "Action"}, {"id": 12, "nam...	http://www.avatarmovie.com/	19995	[{"id": 1463, "name": "culture clash"}, {"id":...	en	Avatar	In the 22nd century, a paraplegic Marine is di...	150.437577	[{"name": "Ingenious Film Partners", "id": 289...	...	162.0	[{"iso_639_1": "en", "name": "English"}, {"iso...	Released	Enter the World of Pandora.	Avatar	7.2	11800	Avatar	[{"cast_id": 242, "character": "Jake Sully", "...	[{"credit_id": "52fe48009251416c750aca23", "de...
1	300000000	[{"id": 12, "name": "Adventure"}, {"id": 14, "...	http://disney.go.com/disneypictures/pirates/	285	[{"id": 270, "name": "ocean"}, {"id": 726, "na...	en	Pirates of the Caribbean: At World's End	Captain Barbossa, long believed to be dead, ha...	139.082615	[{"name": "Walt Disney Pictures", "id": 2}, {"...	...	169.0	[{"iso_639_1": "en", "name": "English"}]	Released	At the end of the world, the adventure begins.	Pirates of the Caribbean: At World's End	6.9	4500	Pirates of the Caribbean: At World's End	[{"cast_id": 4, "character": "Captain Jack Spa...	[{"credit_id": "52fe4232c3a36847f800b579", "de...
2	245000000	[{"id": 28, "name": "Action"}, {"id": 12, "nam...	http://www.sonypictures.com/movies/spectre/	206647	[{"id": 470, "name": "spy"}, {"id": 818, "name...	en	Spectre	A cryptic message from Bond’s past sends him o...	107.376788	[{"name": "Columbia Pictures", "id": 5}, {"nam...	...	148.0	[{"iso_639_1": "fr", "name": "Fran\u00e7ais"},...	Released	A Plan No One Escapes	Spectre	6.3	4466	Spectre	[{"cast_id": 1, "character": "James Bond", "cr...	[{"credit_id": "54805967c3a36829b5002c41", "de...
3	250000000	[{"id": 28, "name": "Action"}, {"id": 80, "nam...	http://www.thedarkknightrises.com/	49026	[{"id": 849, "name": "dc comics"}, {"id": 853,...	en	The Dark Knight Rises	Following the death of District Attorney Harve...	112.312950	[{"name": "Legendary Pictures", "id": 923}, {"...	...	165.0	[{"iso_639_1": "en", "name": "English"}]	Released	The Legend Ends	The Dark Knight Rises	7.6	9106	The Dark Knight Rises	[{"cast_id": 2, "character": "Bruce Wayne / Ba...	[{"credit_id": "52fe4781c3a36847f81398c3", "de...
4	260000000	[{"id": 28, "name": "Action"}, {"id": 12, "nam...	http://movies.disney.com/john-carter	49529	[{"id": 818, "name": "based on novel"}, {"id":...	en	John Carter	John Carter is a war-weary, former military ca...	43.926995	[{"name": "Walt Disney Pictures", "id": 2}]	...	132.0	[{"iso_639_1": "en", "name": "English"}]	Released	Lost in our world, found in another.	John Carter	6.1	2124	John Carter	[{"cast_id": 5, "character": "John Carter", "c...	[{"credit_id": "52fe479ac3a36847f813eaa3", "de...

5 rows × 23 columns

电影评分 - 粗排

我们需要一个指标来给电影打分
计算每部电影的分数
对评分进行排序，并向用户推荐最佳评分的电影。

可以使用电影的平均评分作为分数，但使用这个分数是不够公平的，因为一部平均评分为8.9且只有3票的电影不能被认为比平均评分为7.8但有40票的电影更好。

where

v 是电影的票数
m 是需要在图表中列出的最低投票数;
R 电影平均分
C 所有电影的平均分

score = (v/(v+m) * R) + (m/(m+v) * C)

### 所有电影的平均分- 平均
C= round(df2['vote_average'].mean() ,2)
print(C)

# 分位数 - [0.9] 
m= round(df2['vote_count'].quantile(0.9),2)
print(m)

6.09
1838.4

过滤数据

q_movies = df2.copy().loc[df2['vote_count'] >= m]
q_movies.shape

(481, 23)

# 计算电影评分
def weighted_rating(x, m=m, C=C):
    v = x['vote_count']
    R = x['vote_average']
    # Calculation based on the IMDB formula
    return (v/(v+m) * R) + (m/(m+v) * C)

# 每个电影都进行评分
q_movies['score'] = q_movies.apply(weighted_rating, axis=1)

最后，让我们根据分数特性对DataFrame进行排序，并输出前10部电影的标题、投票计数、投票平均和加权评分或分数。

# 根据评分进行排序
q_movies = q_movies.sort_values('score', ascending=False)

# TOP 15 影片
q_movies[['title', 'vote_count', 'vote_average', 'score']].head(10)

	title	vote_count	vote_average	score
1881	The Shawshank Redemption	8205	8.5	8.058860
662	Fight Club	9413	8.3	7.938901
65	The Dark Knight	12002	8.2	7.919732
3232	Pulp Fiction	8428	8.3	7.904256
96	Inception	13752	8.1	7.862983
3337	The Godfather	5893	8.4	7.850720
95	Interstellar	10867	8.1	7.809164
809	Forrest Gump	7927	8.2	7.802779
329	The Lord of the Rings: The Return of the King	8064	8.1	7.726840
1990	The Empire Strikes Back	5879	8.2	7.697366

# 流行度排行榜
pop= df2.sort_values('popularity', ascending=False)

import matplotlib.pyplot as plt
plt.figure(figsize=(12,4))

plt.barh(pop['title'].head(6),pop['popularity'].head(6), align='center',
        color='skyblue')

plt.gca().invert_yaxis()
plt.xlabel("Popularity")
plt.title("Popular Movies")

# top - 6
pop[['title','popularity']].head(6)

	title	popularity
546	Minions	875.581305
95	Interstellar	724.247784
788	Deadpool	514.569956
94	Guardians of the Galaxy	481.098624
127	Mad Max: Fury Road	434.278564
28	Jurassic World	418.708552

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-fh26SgyU-1678273193333)(movie-recommendation-system_files/movie-recommendation-system_12_1.png)]

基于内容的过滤

在这个推荐系统中，电影的内容(概述、演员、工作人员、关键字、口号等)被用来寻找与其他电影的相似之处。然后推荐最有可能相似的电影。

基于情节描述的推荐

df2['overview'].head(5) # 电影描述

0    In the 22nd century, a paraplegic Marine is di...
1    Captain Barbossa, long believed to be dead, ha...
2    A cryptic message from Bond’s past sends him o...
3    Following the death of District Attorney Harve...
4    John Carter is a war-weary, former military ca...
Name: overview, dtype: object

# TF_IDF 向量
from sklearn.feature_extraction.text import TfidfVectorizer

# 删除所有英文停止字，如'the'， 'a'
tfidf = TfidfVectorizer(stop_words='english')

#  用空字符串替换NaN
df2['overview'] = df2['overview'].fillna('')

# 构建稀疏矩阵
tfidf_matrix = tfidf.fit_transform(df2['overview'])

#Output the shape of tfidf_matrix
tfidf_matrix.shape

(4803, 20978)

余弦相似度计算公式:

from sklearn.metrics.pairwise import linear_kernel

# 计算矩阵的相似度
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)
cosine_sim[:1]

array([[1., 0., 0., ..., 0., 0., 0.]])

每行包含其与所有行的相似度

我们将定义一个函数，该函数以电影标题作为输入并输出10个最相似电影的列表。

# 获取电影所在的行号
indices = pd.Series(df2.index, index=df2['title']).drop_duplicates()

indices.head()

title
Avatar                                      0
Pirates of the Caribbean: At World's End    1
Spectre                                     2
The Dark Knight Rises                       3
John Carter                                 4
dtype: int64

现在，我们可以很好地定义推荐功能。遵循以下步骤:

1. 根据标题获得电影的索引。
1. 获取该特定电影与所有电影的余弦相似度得分列表。将其转换为元组列表，其中第一个元素是其位置，第二个元素是相似性分数。
1. 根据相似度分数对上述元组列表进行排序。
1. 获取此列表的前10个元素。忽略第一个元素（与特定电影最相似的电影是电影本身）。
1. 返回与顶部元素索引相对应的电影标题。

# 获取推荐结果
def get_recommendations(title, cosine_sim=cosine_sim):
    # 获取标题的索引
    idx = indices[title]

    # 获取相似度序号及相似度值
    sim_scores = list(enumerate(cosine_sim[idx]))

    #  排序 = 降序排
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # 获取前10 相似度
    sim_scores = sim_scores[1:11]
    movie_indices = [i[0] for i in sim_scores]

    return df2['title'].iloc[movie_indices] # 返回title

seasons = ['Spring', 'Summer', 'Fall', 'Winter']
list(enumerate(seasons))

[(0, 'Spring'), (1, 'Summer'), (2, 'Fall'), (3, 'Winter')]

get_recommendations('The Dark Knight Rises')

65                              The Dark Knight
299                              Batman Forever
428                              Batman Returns
1359                                     Batman
3854    Batman: The Dark Knight Returns, Part 2
119                               Batman Begins
2507                                  Slow Burn
9            Batman v Superman: Dawn of Justice
1181                                        JFK
210                              Batman & Robin
Name: title, dtype: object

get_recommendations('The Avengers')

7               Avengers: Age of Ultron
3144                            Plastic
1715                            Timecop
4124                 This Thing of Ours
3311              Thank You for Smoking
3033                      The Corruptor
588     Wall Street: Money Never Sleeps
2136         Team America: World Police
1468                       The Fountain
1286                        Snowpiercer
Name: title, dtype: object

基于荣誉，电影类型和关键字的推荐

显然，使用更好的元数据将提高我们推荐程序的质量。这正是我们在本节中要做的。我们将基于以下元数据构建推荐系统：3个顶级演员，导演，相关流派和电影情节关键字。从演员，剧组和关键字中，我们需要提取三个最重要的演员，导演和与该电影相关的关键字。

# 将字符串化特性解析为它们对应的python对象
from ast import literal_eval

features = ['cast', 'crew', 'keywords', 'genres']
for feature in features:
    df2[feature] = df2[feature].apply(literal_eval)

df2[feature].head()

0    [{'id': 28, 'name': 'Action'}, {'id': 12, 'nam...
1    [{'id': 12, 'name': 'Adventure'}, {'id': 14, '...
2    [{'id': 28, 'name': 'Action'}, {'id': 12, 'nam...
3    [{'id': 28, 'name': 'Action'}, {'id': 80, 'nam...
4    [{'id': 28, 'name': 'Action'}, {'id': 12, 'nam...
Name: genres, dtype: object

# 返回导演名称, 空 =  NaN
def get_director(x):
    for i in x:
        if i['job'] == 'Director':
            return i['name']
    return np.nan

# 返回列表前3 个元素; whichever is more.
def get_list(x):
    if isinstance(x, list):
        names = [i['name'] for i in x]
        #Check if more than 3 elements exist. If yes, return only first three. If no, return entire list.
        if len(names) > 3:
            names = names[:3]
        return names

    return []

# 以合适的形式定义新的导演、演员阵容、类型和关键字。
df2['director'] = df2['crew'].apply(get_director)

features = ['cast', 'keywords', 'genres']
for feature in features:
    df2[feature] = df2[feature].apply(get_list)

# Print the new features of the first 3 films
df2[['title', 'cast', 'director', 'keywords', 'genres']].head(3)

	title	cast	director	keywords	genres
0	Avatar	[Sam Worthington, Zoe Saldana, Sigourney Weaver]	James Cameron	[culture clash, future, space war]	[Action, Adventure, Fantasy]
1	Pirates of the Caribbean: At World's End	[Johnny Depp, Orlando Bloom, Keira Knightley]	Gore Verbinski	[ocean, drug abuse, exotic island]	[Adventure, Fantasy, Action]
2	Spectre	[Daniel Craig, Christoph Waltz, Léa Seydoux]	Sam Mendes	[spy, based on novel, secret agent]	[Action, Adventure, Crime]

下一步是将名称和关键字实例转换为小写并去除它们之间的所有空格。这样做是为了使矢量化程序不会将“ Johnny Depp”和“ Johnny Galecki”中的Johnny视为相同变量。

# 转小写去空格
def clean_data(x):
    if isinstance(x, list):
        return [str.lower(i.replace(" ", "")) for i in x]
    else:
        #Check if director exists. If not, return empty string
        if isinstance(x, str):
            return str.lower(x.replace(" ", ""))
        else:
            return ''

print(clean_data('Johnny Depp'))

johnnydepp

# 清理数据
features = ['cast', 'keywords', 'director', 'genres']

for feature in features:
    df2[feature] = df2[feature].apply(clean_data)

df2[features].head() # 查看效果

	cast	keywords	director	genres
0	[samworthington, zoesaldana, sigourneyweaver]	[cultureclash, future, spacewar]	jamescameron	[action, adventure, fantasy]
1	[johnnydepp, orlandobloom, keiraknightley]	[ocean, drugabuse, exoticisland]	goreverbinski	[adventure, fantasy, action]
2	[danielcraig, christophwaltz, léaseydoux]	[spy, basedonnovel, secretagent]	sammendes	[action, adventure, crime]
3	[christianbale, michaelcaine, garyoldman]	[dccomics, crimefighter, terrorist]	christophernolan	[action, crime, drama]
4	[taylorkitsch, lynncollins, samanthamorton]	[basedonnovel, mars, medallion]	andrewstanton	[action, adventure, sciencefiction]

def create_soup(x):
    return ' '.join(x['keywords']) + ' ' + ' '.join(x['cast']) + ' ' + x['director'] + ' ' + ' '.join(x['genres'])
df2['soup'] = df2.apply(create_soup, axis=1)

后续步骤与我们对基于内容的推荐算法所做的相同。一个重要的区别是我们使用CountVectorizer 而不是TF-IDF。这是因为我们不希望减轻演员/导演在相对较多的电影中所扮演或导演的影响力。

# 导入CountVectorizer并创建计数矩阵
from sklearn.feature_extraction.text import CountVectorizer

count = CountVectorizer(stop_words='english')
count_matrix = count.fit_transform(df2['soup'])

# 基于count_matrix计算余弦相似度矩阵
from sklearn.metrics.pairwise import cosine_similarity
cosine_sim2 = cosine_similarity(count_matrix, count_matrix)

# 重置主DataFrame的索引，并像前面一样构造反向映射
df2 = df2.reset_index()
indices = pd.Series(df2.index, index=df2['title'])

indices.head()

title
Avatar                                      0
Pirates of the Caribbean: At World's End    1
Spectre                                     2
The Dark Knight Rises                       3
John Carter                                 4
dtype: int64

get_recommendations('The Dark Knight Rises', cosine_sim2)

65               The Dark Knight
119                Batman Begins
4638    Amidst the Devil's Wings
1196                The Prestige
3073           Romeo Is Bleeding
3326              Black November
1503                      Takers
1986                      Faster
303                     Catwoman
747               Gangster Squad
Name: title, dtype: object

get_recommendations('The Godfather', cosine_sim2)

867      The Godfather: Part III
2731      The Godfather: Part II
4638    Amidst the Devil's Wings
2649           The Son of No One
1525              Apocalypse Now
1018             The Cotton Club
1170     The Talented Mr. Ripley
1209               The Rainmaker
1394               Donnie Brasco
1850                    Scarface
Name: title, dtype: object

协同过滤

奇异值分解

from surprise import Reader, Dataset, SVD
from surprise.model_selection import cross_validate

reader = Reader()
ratings = pd.read_csv('./movie/ratings_small.csv')
ratings.head()

	userId	movieId	rating	timestamp
0	1	31	2.5	1260759144
1	1	1029	3.0	1260759179
2	1	1061	3.0	1260759182
3	1	1129	2.0	1260759185
4	1	1172	4.0	1260759205

请注意，与之前的数据集不同，在这个数据集中，电影是按照5个等级进行评级的。

data = Dataset.load_from_df(ratings[['userId', 'movieId', 'rating']], reader)

svd = SVD()
cross_validate(svd, data, measures=['RMSE', 'MAE'],cv = 5)

{'test_rmse': array([0.89845754, 0.87851544, 0.90687626, 0.88875854, 0.90646164]),
 'test_mae': array([0.69207854, 0.67794004, 0.69724481, 0.68496575, 0.69589722]),
 'fit_time': (4.268375635147095,
  4.251009225845337,
  4.2722389698028564,
  3.6795027256011963,
  3.63496994972229),
 'test_time': (0.12253570556640625,
  0.12330198287963867,
  0.12207889556884766,
  0.09984803199768066,
  0.10153317451477051)}

我们得到的平均均方根误差约为0.89，这对我们的情况来说已经足够好了。现在让我们在我们的数据集上进行训练，并得出预测结果。

trainset = data.build_full_trainset()
svd.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x14c51df10>

ratings[ratings['userId'] == 2]

	userId	movieId	rating	timestamp
20	2	10	4.0	835355493
21	2	17	5.0	835355681
22	2	39	5.0	835355604
23	2	47	4.0	835355552
24	2	50	4.0	835355586
...	...	...	...	...
91	2	592	5.0	835355395
92	2	593	3.0	835355511
93	2	616	3.0	835355932
94	2	661	4.0	835356141
95	2	720	4.0	835355978

76 rows × 4 columns

svd.predict(1, 302, 3)

Prediction(uid=1, iid=302, r_ui=3, est=2.7593729618356786, details={'was_impossible': False})

the uzi

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
打赏
0
评论
movie-recommendation-system

推荐
复制链接

扫一扫

专栏目录