movie-recommendation-system

推荐系统

"""
数据下载地址: 
https://www.kaggle.com/tmdb/tmdb-movie-metadata
https://www.kaggle.com/rounakbanik/the-movies-dataset
"""
import pandas as pd 
import numpy as np 
df1=pd.read_csv('./movie/tmdb/tmdb_5000_credits.csv')
df2=pd.read_csv('./movie/tmdb/tmdb_5000_movies.csv')

tmdb_5000_movies.csv 中共有 20 个字段,其各自释义如下:

  • budget:预算
  • genres:分类
  • homepage:主页(大量缺失值,但不重要)
  • id:编号
  • keywords:关键词标签
  • original_language:原语言
  • original_title:原标题
  • overview:简介
  • popularity:流行度
  • production_companies:制作公司
  • production_countries:制作国家
  • release_date:上映日期
  • revenue:收益
  • runtime:时长
  • spoken_languages:配音语言
  • status:状态
  • tagline:一句话标语
  • title:题目
  • vote_average:平均分
  • vote_count:参与评分人数

tmdb_5000_credits.csv 中共有4 个字段,其各自释义如下:

  • movie_id:编号
  • title:电影名称
  • cast:演员阵容
  • crew:全体人员
df1.columns = ['id','tittle','cast','crew']
df2= df2.merge(df1, on='id') # join 操作
df2.head(5)
budgetgenreshomepageidkeywordsoriginal_languageoriginal_titleoverviewpopularityproduction_companies...runtimespoken_languagesstatustaglinetitlevote_averagevote_counttittlecastcrew
0237000000[{"id": 28, "name": "Action"}, {"id": 12, "nam...http://www.avatarmovie.com/19995[{"id": 1463, "name": "culture clash"}, {"id":...enAvatarIn the 22nd century, a paraplegic Marine is di...150.437577[{"name": "Ingenious Film Partners", "id": 289......162.0[{"iso_639_1": "en", "name": "English"}, {"iso...ReleasedEnter the World of Pandora.Avatar7.211800Avatar[{"cast_id": 242, "character": "Jake Sully", "...[{"credit_id": "52fe48009251416c750aca23", "de...
1300000000[{"id": 12, "name": "Adventure"}, {"id": 14, "...http://disney.go.com/disneypictures/pirates/285[{"id": 270, "name": "ocean"}, {"id": 726, "na...enPirates of the Caribbean: At World's EndCaptain Barbossa, long believed to be dead, ha...139.082615[{"name": "Walt Disney Pictures", "id": 2}, {"......169.0[{"iso_639_1": "en", "name": "English"}]ReleasedAt the end of the world, the adventure begins.Pirates of the Caribbean: At World's End6.94500Pirates of the Caribbean: At World's End[{"cast_id": 4, "character": "Captain Jack Spa...[{"credit_id": "52fe4232c3a36847f800b579", "de...
2245000000[{"id": 28, "name": "Action"}, {"id": 12, "nam...http://www.sonypictures.com/movies/spectre/206647[{"id": 470, "name": "spy"}, {"id": 818, "name...enSpectreA cryptic message from Bond’s past sends him o...107.376788[{"name": "Columbia Pictures", "id": 5}, {"nam......148.0[{"iso_639_1": "fr", "name": "Fran\u00e7ais"},...ReleasedA Plan No One EscapesSpectre6.34466Spectre[{"cast_id": 1, "character": "James Bond", "cr...[{"credit_id": "54805967c3a36829b5002c41", "de...
3250000000[{"id": 28, "name": "Action"}, {"id": 80, "nam...http://www.thedarkknightrises.com/49026[{"id": 849, "name": "dc comics"}, {"id": 853,...enThe Dark Knight RisesFollowing the death of District Attorney Harve...112.312950[{"name": "Legendary Pictures", "id": 923}, {"......165.0[{"iso_639_1": "en", "name": "English"}]ReleasedThe Legend EndsThe Dark Knight Rises7.69106The Dark Knight Rises[{"cast_id": 2, "character": "Bruce Wayne / Ba...[{"credit_id": "52fe4781c3a36847f81398c3", "de...
4260000000[{"id": 28, "name": "Action"}, {"id": 12, "nam...http://movies.disney.com/john-carter49529[{"id": 818, "name": "based on novel"}, {"id":...enJohn CarterJohn Carter is a war-weary, former military ca...43.926995[{"name": "Walt Disney Pictures", "id": 2}]...132.0[{"iso_639_1": "en", "name": "English"}]ReleasedLost in our world, found in another.John Carter6.12124John Carter[{"cast_id": 5, "character": "John Carter", "c...[{"credit_id": "52fe479ac3a36847f813eaa3", "de...

5 rows × 23 columns

电影评分 - 粗排

我们需要一个指标来给电影打分
计算每部电影的分数
对评分进行排序,并向用户推荐最佳评分的电影。

可以使用电影的平均评分作为分数,但使用这个分数是不够公平的,因为一部平均评分为8.9且只有3票的电影不能被认为比平均评分为7.8但有40票的电影更好。

where

  • v 是电影的票数
  • m 是需要在图表中列出的最低投票数;
  • R 电影平均分
  • C 所有电影的平均分

score = (v/(v+m) * R) + (m/(m+v) * C)

### 所有电影的平均分- 平均
C= round(df2['vote_average'].mean() ,2)
print(C)

# 分位数 - [0.9] 
m= round(df2['vote_count'].quantile(0.9),2)
print(m)
6.09
1838.4
过滤数据
q_movies = df2.copy().loc[df2['vote_count'] >= m]
q_movies.shape
(481, 23)
# 计算电影评分
def weighted_rating(x, m=m, C=C):
    v = x['vote_count']
    R = x['vote_average']
    # Calculation based on the IMDB formula
    return (v/(v+m) * R) + (m/(m+v) * C)
# 每个电影都进行评分
q_movies['score'] = q_movies.apply(weighted_rating, axis=1)

最后,让我们根据分数特性对DataFrame进行排序,并输出前10部电影的标题、投票计数、投票平均和加权评分或分数。

# 根据评分进行排序
q_movies = q_movies.sort_values('score', ascending=False)

# TOP 15 影片
q_movies[['title', 'vote_count', 'vote_average', 'score']].head(10)
titlevote_countvote_averagescore
1881The Shawshank Redemption82058.58.058860
662Fight Club94138.37.938901
65The Dark Knight120028.27.919732
3232Pulp Fiction84288.37.904256
96Inception137528.17.862983
3337The Godfather58938.47.850720
95Interstellar108678.17.809164
809Forrest Gump79278.27.802779
329The Lord of the Rings: The Return of the King80648.17.726840
1990The Empire Strikes Back58798.27.697366
# 流行度排行榜
pop= df2.sort_values('popularity', ascending=False)

import matplotlib.pyplot as plt
plt.figure(figsize=(12,4))

plt.barh(pop['title'].head(6),pop['popularity'].head(6), align='center',
        color='skyblue')

plt.gca().invert_yaxis()
plt.xlabel("Popularity")
plt.title("Popular Movies")

# top - 6
pop[['title','popularity']].head(6)
titlepopularity
546Minions875.581305
95Interstellar724.247784
788Deadpool514.569956
94Guardians of the Galaxy481.098624
127Mad Max: Fury Road434.278564
28Jurassic World418.708552

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-fh26SgyU-1678273193333)(movie-recommendation-system_files/movie-recommendation-system_12_1.png)]

基于内容的过滤

在这个推荐系统中,电影的内容(概述、演员、工作人员、关键字、口号等)被用来寻找与其他电影的相似之处。然后推荐最有可能相似的电影。

基于情节描述的推荐

df2['overview'].head(5) # 电影描述
0    In the 22nd century, a paraplegic Marine is di...
1    Captain Barbossa, long believed to be dead, ha...
2    A cryptic message from Bond’s past sends him o...
3    Following the death of District Attorney Harve...
4    John Carter is a war-weary, former military ca...
Name: overview, dtype: object
# TF_IDF 向量
from sklearn.feature_extraction.text import TfidfVectorizer

# 删除所有英文停止字,如'the', 'a'
tfidf = TfidfVectorizer(stop_words='english')

#  用空字符串替换NaN
df2['overview'] = df2['overview'].fillna('')

# 构建稀疏矩阵
tfidf_matrix = tfidf.fit_transform(df2['overview'])

#Output the shape of tfidf_matrix
tfidf_matrix.shape
(4803, 20978)

余弦相似度计算公式:

from sklearn.metrics.pairwise import linear_kernel

# 计算矩阵的相似度
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)
cosine_sim[:1]
array([[1., 0., 0., ..., 0., 0., 0.]])

每行包含其与所有行的相似度

我们将定义一个函数,该函数以电影标题作为输入并输出10个最相似电影的列表。

# 获取电影所在的行号
indices = pd.Series(df2.index, index=df2['title']).drop_duplicates()

indices.head()
title
Avatar                                      0
Pirates of the Caribbean: At World's End    1
Spectre                                     2
The Dark Knight Rises                       3
John Carter                                 4
dtype: int64

现在,我们可以很好地定义推荐功能。遵循以下步骤:

    1. 根据标题获得电影的索引。
    1. 获取该特定电影与所有电影的余弦相似度得分列表。 将其转换为元组列表,其中第一个元素是其位置,第二个元素是相似性分数。
    1. 根据相似度分数对上述元组列表进行排序。
    1. 获取此列表的前10个元素。忽略第一个元素(与特定电影最相似的电影是电影本身)。
    1. 返回与顶部元素索引相对应的电影标题。
# 获取推荐结果
def get_recommendations(title, cosine_sim=cosine_sim):
    # 获取标题的索引
    idx = indices[title]

    # 获取相似度序号及相似度值
    sim_scores = list(enumerate(cosine_sim[idx]))

    #  排序 = 降序排
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # 获取前10 相似度
    sim_scores = sim_scores[1:11]
    movie_indices = [i[0] for i in sim_scores]

    return df2['title'].iloc[movie_indices] # 返回title
seasons = ['Spring', 'Summer', 'Fall', 'Winter']
list(enumerate(seasons))
[(0, 'Spring'), (1, 'Summer'), (2, 'Fall'), (3, 'Winter')]
get_recommendations('The Dark Knight Rises')
65                              The Dark Knight
299                              Batman Forever
428                              Batman Returns
1359                                     Batman
3854    Batman: The Dark Knight Returns, Part 2
119                               Batman Begins
2507                                  Slow Burn
9            Batman v Superman: Dawn of Justice
1181                                        JFK
210                              Batman & Robin
Name: title, dtype: object
get_recommendations('The Avengers')
7               Avengers: Age of Ultron
3144                            Plastic
1715                            Timecop
4124                 This Thing of Ours
3311              Thank You for Smoking
3033                      The Corruptor
588     Wall Street: Money Never Sleeps
2136         Team America: World Police
1468                       The Fountain
1286                        Snowpiercer
Name: title, dtype: object

基于荣誉,电影类型和关键字的推荐

显然,使用更好的元数据将提高我们推荐程序的质量。这正是我们在本节中要做的。我们将基于以下元数据构建推荐系统:3个顶级演员,导演,相关流派和电影情节关键字。从演员,剧组和关键字中,我们需要提取三个最重要的演员,导演和与该电影相关的关键字。

# 将字符串化特性解析为它们对应的python对象
from ast import literal_eval

features = ['cast', 'crew', 'keywords', 'genres']
for feature in features:
    df2[feature] = df2[feature].apply(literal_eval)

df2[feature].head()
0    [{'id': 28, 'name': 'Action'}, {'id': 12, 'nam...
1    [{'id': 12, 'name': 'Adventure'}, {'id': 14, '...
2    [{'id': 28, 'name': 'Action'}, {'id': 12, 'nam...
3    [{'id': 28, 'name': 'Action'}, {'id': 80, 'nam...
4    [{'id': 28, 'name': 'Action'}, {'id': 12, 'nam...
Name: genres, dtype: object
# 返回导演名称, 空 =  NaN
def get_director(x):
    for i in x:
        if i['job'] == 'Director':
            return i['name']
    return np.nan
# 返回列表前3 个元素; whichever is more.
def get_list(x):
    if isinstance(x, list):
        names = [i['name'] for i in x]
        #Check if more than 3 elements exist. If yes, return only first three. If no, return entire list.
        if len(names) > 3:
            names = names[:3]
        return names

    return []
# 以合适的形式定义新的导演、演员阵容、类型和关键字。
df2['director'] = df2['crew'].apply(get_director)

features = ['cast', 'keywords', 'genres']
for feature in features:
    df2[feature] = df2[feature].apply(get_list)
# Print the new features of the first 3 films
df2[['title', 'cast', 'director', 'keywords', 'genres']].head(3)
titlecastdirectorkeywordsgenres
0Avatar[Sam Worthington, Zoe Saldana, Sigourney Weaver]James Cameron[culture clash, future, space war][Action, Adventure, Fantasy]
1Pirates of the Caribbean: At World's End[Johnny Depp, Orlando Bloom, Keira Knightley]Gore Verbinski[ocean, drug abuse, exotic island][Adventure, Fantasy, Action]
2Spectre[Daniel Craig, Christoph Waltz, Léa Seydoux]Sam Mendes[spy, based on novel, secret agent][Action, Adventure, Crime]

下一步是将名称和关键字实例转换为小写并去除它们之间的所有空格。 这样做是为了使矢量化程序不会将“ Johnny Depp”和“ Johnny Galecki”中的Johnny视为相同变量。

# 转小写去空格
def clean_data(x):
    if isinstance(x, list):
        return [str.lower(i.replace(" ", "")) for i in x]
    else:
        #Check if director exists. If not, return empty string
        if isinstance(x, str):
            return str.lower(x.replace(" ", ""))
        else:
            return ''

print(clean_data('Johnny Depp'))
johnnydepp
# 清理数据
features = ['cast', 'keywords', 'director', 'genres']

for feature in features:
    df2[feature] = df2[feature].apply(clean_data)
df2[features].head() # 查看效果
castkeywordsdirectorgenres
0[samworthington, zoesaldana, sigourneyweaver][cultureclash, future, spacewar]jamescameron[action, adventure, fantasy]
1[johnnydepp, orlandobloom, keiraknightley][ocean, drugabuse, exoticisland]goreverbinski[adventure, fantasy, action]
2[danielcraig, christophwaltz, léaseydoux][spy, basedonnovel, secretagent]sammendes[action, adventure, crime]
3[christianbale, michaelcaine, garyoldman][dccomics, crimefighter, terrorist]christophernolan[action, crime, drama]
4[taylorkitsch, lynncollins, samanthamorton][basedonnovel, mars, medallion]andrewstanton[action, adventure, sciencefiction]
def create_soup(x):
    return ' '.join(x['keywords']) + ' ' + ' '.join(x['cast']) + ' ' + x['director'] + ' ' + ' '.join(x['genres'])
df2['soup'] = df2.apply(create_soup, axis=1)

后续步骤与我们对基于内容的推荐算法所做的相同。 一个重要的区别是我们使用CountVectorizer 而不是TF-IDF。 这是因为我们不希望减轻演员/导演在相对较多的电影中所扮演或导演的影响力。

# 导入CountVectorizer并创建计数矩阵
from sklearn.feature_extraction.text import CountVectorizer

count = CountVectorizer(stop_words='english')
count_matrix = count.fit_transform(df2['soup'])
# 基于count_matrix计算余弦相似度矩阵
from sklearn.metrics.pairwise import cosine_similarity
cosine_sim2 = cosine_similarity(count_matrix, count_matrix)
# 重置主DataFrame的索引,并像前面一样构造反向映射
df2 = df2.reset_index()
indices = pd.Series(df2.index, index=df2['title'])

indices.head()
title
Avatar                                      0
Pirates of the Caribbean: At World's End    1
Spectre                                     2
The Dark Knight Rises                       3
John Carter                                 4
dtype: int64
get_recommendations('The Dark Knight Rises', cosine_sim2)
65               The Dark Knight
119                Batman Begins
4638    Amidst the Devil's Wings
1196                The Prestige
3073           Romeo Is Bleeding
3326              Black November
1503                      Takers
1986                      Faster
303                     Catwoman
747               Gangster Squad
Name: title, dtype: object
get_recommendations('The Godfather', cosine_sim2)
867      The Godfather: Part III
2731      The Godfather: Part II
4638    Amidst the Devil's Wings
2649           The Son of No One
1525              Apocalypse Now
1018             The Cotton Club
1170     The Talented Mr. Ripley
1209               The Rainmaker
1394               Donnie Brasco
1850                    Scarface
Name: title, dtype: object

协同过滤

奇异值分解

from surprise import Reader, Dataset, SVD
from surprise.model_selection import cross_validate

reader = Reader()
ratings = pd.read_csv('./movie/ratings_small.csv')
ratings.head()
userIdmovieIdratingtimestamp
01312.51260759144
1110293.01260759179
2110613.01260759182
3111292.01260759185
4111724.01260759205

请注意,与之前的数据集不同,在这个数据集中,电影是按照5个等级进行评级的。

data = Dataset.load_from_df(ratings[['userId', 'movieId', 'rating']], reader)
svd = SVD()
cross_validate(svd, data, measures=['RMSE', 'MAE'],cv = 5)
{'test_rmse': array([0.89845754, 0.87851544, 0.90687626, 0.88875854, 0.90646164]),
 'test_mae': array([0.69207854, 0.67794004, 0.69724481, 0.68496575, 0.69589722]),
 'fit_time': (4.268375635147095,
  4.251009225845337,
  4.2722389698028564,
  3.6795027256011963,
  3.63496994972229),
 'test_time': (0.12253570556640625,
  0.12330198287963867,
  0.12207889556884766,
  0.09984803199768066,
  0.10153317451477051)}

我们得到的平均均方根误差约为0.89,这对我们的情况来说已经足够好了。现在让我们在我们的数据集上进行训练,并得出预测结果。

trainset = data.build_full_trainset()
svd.fit(trainset)
<surprise.prediction_algorithms.matrix_factorization.SVD at 0x14c51df10>
ratings[ratings['userId'] == 2]
userIdmovieIdratingtimestamp
202104.0835355493
212175.0835355681
222395.0835355604
232474.0835355552
242504.0835355586
...............
9125925.0835355395
9225933.0835355511
9326163.0835355932
9426614.0835356141
9527204.0835355978

76 rows × 4 columns

svd.predict(1, 302, 3)
Prediction(uid=1, iid=302, r_ui=3, est=2.7593729618356786, details={'was_impossible': False})

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

the uzi

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值