电影推荐系统

电影推荐系统

常见的电影推荐系统算法有协同过滤和矩阵因子分解。而协同过滤算法有基于item和基于user两种不同的形式。
数据来自:
here.
部分数据展示如下:movies.csv(shape:103293)和ratings.csv(shape:1053394)

movies.csv
ratings.cvs

进行简单的数据分析

Most Viewed Movies Visualization:we will explore the most viewed movies in our dataset. We will first count the number of views in a film and then organize them in a table that would group them in descending order.


we will visualize the distribution of the average ratings per user.
distribution of the average ratings per user
评分均值的mean = 3.66,std=0.456。
实现它们的代码如下:

import pandas as pd
import numpy as np
movies = pd.read_csv(r"movies.csv")
ratings = pd.read_csv(r"ratings.csv")
#create a map movield -> title
rawidtotitle = {m:n for m,n,_ in movies.values.tolist()}
rawidtogenre = {m:n for m,_,n #Data Pre-processing
###################################################
#Most viewed movies visualization
import matplotlib.pyplot as plt
from matplotlib import rcParams

moviesid = ratings["movieId"].tolist()
movies_num = dict()
for key in moviesid:
    movies_num[key] = movies_num.get(key,0)+1

movies_sort = sorted(movies_num.items(), key = lambda kv:kv[1] ,reverse=True)
name = [rawidtotitle[m] for m,_ in movies_sort]
num = [m for _,m in movies_sort]

#Plot pic x:name,y:num ,select first 10 movies 
#set font size and pic size
config = {
    "font.family":"serif", 
    "font.size": 10,
    "mathtext.fontset":'stix',
}
rcParams.update(config)
plt.figure(dpi=160,figsize=(10,4))


bar_width = 0.3
bar = plt.bar( np.arange(1,11),num[0:10],bar_width,color = "g")


for a,b,c in zip(np.arange(1,11),num,num): 
    plt.text(a,b+0.00001,c,ha = 'center',va = 'bottom',fontsize=10)

plt.xticks(rotation=20)
plt.xticks(np.arange(1,11),name, horizontalalignment='right')
plt.ylabel('Viewed Times')
plt.show() movies.values.tolist()}
ratings.shape


# We will visualize the distribution of the average ratings per user.
from collections import defaultdict
score_dic = defaultdict(list)
score = ratings[['userId','rating']].values.tolist()
for user,rating in score:
    score_dic[user].append(rating)
ave_score = [np.mean(s) for _,s in score_dic.items()]
print('average:',np.mean(ave_score))
print('std:',np.std(ave_score))

import seaborn as sns
from pylab import mpl

plt.figure(dpi = 160,figsize=(5,4))
config = {
    "font.family":"serif",  
    "font.size": 10,
    "mathtext.fontset":'stix',
}
rcParams.update(config)
sns.distplot(ave_score ,bins = 20,color = 'g')
plt.xlabel("Average Ratings")
plt.show()

Selecting useful data:For finding useful data in our dataset, we have set the threshold for the minimum number of users who have rated a film as 50. This is also same for minimum number of views that are per film.

processed data
代码如下:

#Performing data preparation
#Selecting useful data
'''
For finding useful data in our dataset, we have set the threshold for the minimum number of users who have rated a film as 50. 
This is also same for minimum number of views that are per film. This way, we have filtered a list of watched films
from least-watched ones.
From the above output of ‘movie_ratings’, we observe that there are 420 users and 447 films as opposed to the previous 668 
users and 10325 films. 
'''
from collections import Counter
userid_dic = Counter(ratings['userId'].tolist())
movieid_dic = Counter(ratings['movieId'].tolist())

print('original number of users:',len(userid_dic))
print('original number of movies:',len(movieid_dic))

#remove the users and the movies whose number is below 50

rm_user = [id for id,num in userid_dic.items() if num < 50]
rm_movie = [id for id,num in movieid_dic.items() if num < 50]

print('removed number of users:',len(userid_dic)-len(rm_user))
print('removed number of movies:',len(movieid_dic)-len(rm_movie))

#modify the dataframe ratings
index = []
for i in range(len(ratings)):
    if ratings.iloc[i,:]["userId"] in rm_user or ratings.iloc[i,:]["movieId"] in rm_movie:
        index.append(i)
rm_ratings = ratings.drop(index = index)

user-based KNNBasic 协同过滤算法

User-based KNNBasic:A basic collaborative filtering algorithm.The prediction 𝑟 ̂ is set as:

在这里插入图片描述
The similarity was calculated by MSD, Only common users are taken into account. The Mean Squared Difference is defined as:
在这里插入图片描述
The MSD-similarity is then defined as:
在这里插入图片描述

The +1 term is just here to avoid dividing by zero.
在这里插入图片描述
of course,you can also use cosine and other similar algorithm,

模型生成

我们使用python 的 surprise library。

代码如下:

#get the data that we define
from surprise import Dataset
from surprise import Reader
reader  = Reader(rating_scale = (1,5))
data = Dataset.load_from_df(ratings[["userId","movieId","rating"]],reader)
rm_data = Dataset.load_from_df(rm_ratings[["userId","movieId","rating"]],reader)
#Compare the accuracy between original data and dealed data
#create model
from surprise import KNNBasic
from surprise.model_selection import cross_validate
sim_options = {
    "name":"MSD",
    'user_based':True
}
knnb = KNNBasic(k=40, min_k=1, sim_options=sim_options, verbose=False)

#using KNNBasic model,computing accuracy metrics on the data and rm_data
result = cross_validate(knnb,data,measures=['RMSE', 'MAE'],cv=5, verbose=True)
rm_result = cross_validate(knnb,rm_data,measures=['RMSE', 'MAE'],cv=5, verbose=True)
#Visaulize RMSE and MAE
fig = plt.figure(dpi = 160,figsize=(5,4)) 
config = {
"font.family":"serif",    #serif
"font.size": 10,
"mathtext.fontset":'stix',
}
rcParams.update(config)
plt.plot(np.arange(1,6),result['test_rmse'], color="y", lw=0.8, ls='-', marker='o', ms=8)
plt.plot(np.arange(1,6),rm_result['test_rmse'], color='green', lw=0.8,  marker='^', ms=8)
plt.xticks([1,2,3,4,5])
# 图例设置
plt.legend(['unprocessed','processed'],loc='best',frameon=False)
plt.ylabel('RMSE')
plt.show()

在这里插入图片描述
上图为采用5折交叉验证计算的结果,可见处理后的数据的RMSE更小,推荐系统将更将精确。这是因为我们删除了一部分提供信息很少的users和movies。

调整超参数

我们调整超参数来优化模型:通过改变计算相似度的算法(msd和cosine)以及KNN的k值来调整模型。对于每一个参数点RMSE的计算,我们采用5折交叉验证的平均值来表示在这里插入图片描述
由图可见,在k=20,采用msd算法时,该模型能得到一些性能上的优化。此时模型的RMSE为0.846。

#Adjust the hyperparameter optimization model
from surprise.model_selection import GridSearchCV
param_grid = {'k': np.arange(10,150,10),
              'sim_options': {'name': ['msd', 'cosine'], 
                              'min_support': [1],
                              'user_based': [True]},
              }

gs = GridSearchCV(KNNBasic, param_grid, measures=['rmse', 'mae'], cv=5)
gs.fit(rm_data)
print('best_params:',gs.best_params['rmse'])
print('best score of RMSE:',gs.best_score['rmse'])
grid_result = gs.cv_results['mean_test_rmse']
msd_r = grid_result[np.arange(0,28,2)]
cosin_r = grid_result[np.arange(1,29,2)]
#Visaulize RMSE and MAE for different metrix method
fig = plt.figure(dpi = 160,figsize=(5,4)) 
config = {
"font.family":"serif",    #serif
"font.size": 10,
"mathtext.fontset":'stix',
}
rcParams.update(config)
plt.plot(np.arange(10,150,10),msd_r , color="y", lw=0.8, ls='-', marker='o', ms=8)
plt.plot(np.arange(10,150,10),cosin_r, color='green', lw=0.8,  marker='^', ms=8)
# 图例设置
plt.legend(['msd','cosine'],loc='best',frameon=False)
plt.ylabel('RMSE')
plt.xlabel('K')
plt.show()
  • 1
    点赞
  • 6
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
Django电影推荐系统是一个基于Django框架开发的电影推荐平台。该系统通过收集用户的喜好和行为数据,利用推荐算法为用户提供个性化的电影推荐。 首先,系统会要求用户进行注册和登录操作,以便能够跟踪用户的浏览历史和评分记录。用户可以搜索电影,查看电影详情页面,并对电影进行评分和评论。用户评分越多,系统就能更准确地了解用户的喜好,从而为其提供更精准的电影推荐。 系统通过采用协同过滤算法和基于内容的推荐算法来为用户提供个性化的推荐。协同过滤算法通过比较用户之间的行为和喜好,找出兴趣相似的用户,向他们推荐未看过的电影。基于内容的推荐算法则通过分析电影的内容特征(如导演、演员、类型等),找出与用户过去喜好相匹配的电影。 除了基本的电影推荐功能,该系统还提供了热门电影排行榜和分类推荐功能。热门电影排行榜会根据用户的评分和点播次数,计算出热度最高的电影,并展示给用户。分类推荐功能则会根据用户的兴趣偏好,向其推荐相同类型或相似风格的电影。 用户还可以将自己喜欢的电影收藏起来,方便下次观看。系统还会根据用户的收藏记录和评分记录,向其推荐相关的电影。 总之,Django电影推荐系统通过收集用户的喜好和行为数据,利用推荐算法为用户提供个性化的电影推荐,使用户能够更方便地找到自己感兴趣的电影

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值