数据集
从kaggle里下载的约5000部较受欢迎的英文电影的基本信息,包含的属性有:
文字描述:
title, cast, crew, genres, keywords, original_language, original_title, overview, production_companies, tagline,production_countries,spoken_languages
数值参数:
popularity, release_date, revenue, runtime, vote_average, vote_count, budget
目标
content-based filtering
根据数据集提供的电影信息,主要是文字描述信息,找到相似电影。这里主要是做一个基于内容的相似匹配,暂不考虑用户的喜好品味评分等信息。
基于用户的相关推荐方法请查看:肖月:挖掘豆瓣电影数据集·SVD推荐系统zhuanlan.zhihu.com
输入-词向量一种方案是用电影的剧情概述来定义电影:输入电影的overview信息,分词统计词频,定义词向量
一种方案是用电影的关键词、类型、导演、演员等这些人们预先概括出的文字标签或者说是元信息来定义电影
TF-IDF 词频
TF: term instances/total instances 输入文本中当前词的个数/输入文本所有词的个数
IDF: log(number of documents/documents with term) log(输入文本数/包含当前词的文本数)
TF*IDF 这种统计词频的方法可以突出输入文本中独特的高频词,一些所有文本中都经常用到的词的权重被削弱了,适用于我们的第一种方案
关键指标-相似度
通过把文本转换成词向量,构建了一个欧式空间,比较电影之间的相似度就转换成了度量词向量之间的距离,有很多种指标可以选,如euclidean,皮尔森还有cosine,这里选用cosine余弦来度量词向量之间夹角的大小:
第一种方案里,用sklearn里的linear_kernel()方法求TF·IDF矩阵和自己的转置矩阵的点积,所得到的结果就是cosine值。计算标签信息的相似度时不需要用到TF-IDF对常用词过滤,直接用sklearn里的CountVectorizer()统计词频即刻,对应的相似度计算也用sklearn库里的cosine_similarity即可。
数据载入和清洗
import pandas as pd
import numpy as np
df_credits = pd.read_csv('../movie_filter/input/tmdb_5000_credits.csv')
df_credits.columns = ['id','title','cast','crew']
df_movies = pd.read_csv('../movie_filter/input/tmdb_5000_movies.csv')
df_movies = df_movies.merge(df_credits[['id','cast','crew']], on='id',how='outer')
df_movies.drop(columns=['budget','homepage','popularity', 'release_date', 'revenue', 'runtime', 'vote_average', 'vote_count'], inplace=True)
print('Dataset Shape: ',df_movies.shape)
df_movies.head(3)
Dataset Shape: (4803, 14)
df_movies[['cast','crew','keywords','genres']].info()
Data columns (total 4 columns):
cast 4803 non-null object
crew 4803 non-null object
keywords 4803 non-null object
genres 4803 non-null object
电影描述文本-->词向量
根据两种定义电影的方式,准备两套数据:
overview 剧情概括
metadata 电影标签
这里选取电影的类型标签,剧情关键词,导演,制片,编剧和前三位主演作为元信息,合并成一个长文本统计词频组成词向量,注意清理数据时要把关键词,人员姓名里的空格去掉
剧情概括词向量:
from sklearn.feature_extraction.text import TfidfVectorizer
overview = df_movies['overview'].fillna('')
overview_tfidf = TfidfVectorizer(stop_words='english')
overview_tfidf_matrix = overview_tfidf.fit_transform(overview)
overview_tfidf_matrix.shape
(4803, 20978)
电影标签词向量:
先把所有需要的属性column提取出来,把他们由dataframe object转换成对应的python object
from sklearn.metrics.pairwise import linear_kernel
from ast import literal_eval
features = ['cast','crew','keywords','genres']
for feature in features:
df_movies[feature] = df_movies[feature].apply(literal_eval)
df_meta = df_movies[features].copy()
获取导演、制片、编剧、主演的方法:
def get_crew(crews,job='Director'):
names = []
for i in crews:
if i['job'] == job:
names.append(i['name'])
return names
def get_list(casts,limit=3):
if isinstance(casts,list):
names = [i['name'] for i in casts]
if len(names)>limit:
names = names[:limit]
return names
df_meta['director'] = df_meta['crew'].apply(get_crew)
df_meta['writers'] = df_meta['crew'].apply(get_crew,job='Screenplay')
for feature in ['keywords','genres']:
df_meta[feature] = df_meta[feature].apply(get_list, limit=4)
df_meta['cast'] = df_meta['cast'].apply(get_list)
df_meta.head()
def clean_data(x):
if isinstance(x,list):
return [str.lower(i.replace(" ", "")) for i in x]
elif isinstance(x,str):
return str.lower(x.replace(" ", ""))
return ''
for feature in ['cast','keywords','director','writers','genres']:
df_meta[feature] = df_meta[feature].apply(clean_data)
def create_soup(x):
return ' '.join(x['keywords'])+' '+' '.join(x['cast'])+' '+' '.join(x['writers'])+' '+' '.join(x['director'])+' '+' '.join(x['genres'])
df_meta['soup'] = df_meta.apply(create_soup,axis=1)
df_meta['soup'][:5]
0 cultureclash future spacewar spacecolony samwo...
1 ocean drugabuse exoticisland eastindiatradingc...
2 spy basedonnovel secretagent sequel danielcrai...
3 dccomics crimefighter terrorist secretidentity...
4 basedonnovel mars medallion spacetravel taylor...
from sklearn.feature_extraction.text import CountVectorizer
count = CountVectorizer(stop_words='english')
meta = df_meta['soup']
meta_matrix = count.fit_transform(meta)
print(meta_matrix.shape)
(4803, 14767)
以词云的形式预览下整理好的标签
doc_list = list(df_meta['soup'])
words = []
for doc in doc_list:
words+=doc.strip().split(' ')
words_dict = pd.Series([word for word in words if word!='']).value_counts()
import matplotlib.pyplot as plt
from wordcloud import WordCloud
def random_color_func(word=None, font_size=None, position=None,orientation=None, font_path=None, random_state=None):
h = int(360.0 * tone / 255.0)
s = int(100.0 * 255.0 / 255.0)
l = int(100.0 * float(random_state.randint(70, 120)) / 255.0)
return "hsl({}, {}%, {}%)".format(h, s, l)
tone = 48.0
wordcloud = WordCloud(width=1000,height=300, background_color='black',
max_words=500,relative_scaling=1,
color_func = random_color_func,
normalize_plurals=False)
wordcloud.generate_from_frequencies(words_dict.to_dict())
fig = plt.figure(1, figsize=(18,14))
ax1 = fig.add_subplot(2,1,1)
ax1.imshow(wordcloud, interpolation="bilinear")
ax1.axis('off')
ax2 = fig.add_subplot(2,1,2)
top_words = words_dict[:60].to_dict()
y_axis = [i for i in words_dict[:60]]
x_axis = [i for i in words_dict[:60].index]
plt.xticks(rotation=85, fontsize = 12)
plt.yticks(fontsize = 12)
plt.ylabel("Frequency", fontsize = 15, labelpad = 10)
ax2.bar(x_axis, y_axis, align = 'center', color='g')
plt.show()
电影类型的权重高,影响大,思考是否应该采取些办法削弱其影响
定义获取相似电影的方法输入电影名称,获取ID号
从相似度矩阵中提取该电影对应列,排序
获取相似度最高的前十部电影,注意第一部一定是自己,要去掉
返回十部相似电影的名称
indices = pd.Series(df_movies.index,index=df_movies['title'])
from sklearn.metrics.pairwise import cosine_similarity
overview_cosine = linear_kernel(overview_tfidf_matrix, overview_tfidf_matrix)
meta_cosine = cosine_similarity(meta_matrix,meta_matrix)
def get_recommendations(title, cosine_sim):
idx = indices[title]
sim_scores = list(enumerate(cosine_sim[idx]))
sim_scores = sorted(sim_scores,key=lambda x:x[1],reverse=True)
sim_scores = sim_scores[1:11]
movie_indices = [i[0] for i in sim_scores]
df_recommended = df_movies[['title']].iloc[movie_indices]
df_recommended.index = range(1,11)
df_recommended.rename(columns={'title':'相似电影','vote_average':'评分'}, inplace=True)
return df_recommended
结果
按剧情概括查找:
get_recommendations("Avatar",overview_cosine)
按标签信息查找:
get_recommendations("Avatar",meta_cosine)
从结果看,个人感觉还是按关键词电影类型这些信息查找到的比较准确