基于协同过滤的电影推荐系统

最新推荐文章于 2023-07-01 21:35:00 发布

WiFi下的365

最新推荐文章于 2023-07-01 21:35:00 发布

阅读量2.1k

点赞数 5

文章标签： python 数据分析机器学习推荐系统

本文链接：https://blog.csdn.net/Eddy364/article/details/105987254

版权

数据处理

import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
import matplotlib.pyplot as plt
plt.style.use('seaborn')

#导入数据
d1 = pd.read_csv(r"movies.dat", sep='::', names=['电影ID', '上市年份', '种类'])
d2 = pd.read_csv(r"ratings.dat", sep='::', names=['用户ID', '电影ID', '评分', '时间戳'])
d3 = pd.read_csv(r"users.dat",
                 sep='::',
                 names=['用户ID', '性别', '年龄', '职业', '邮编'])

C:\Users\Administrator\anaconda3\lib\site-packages\ipykernel_launcher.py:2: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.
  
C:\Users\Administrator\anaconda3\lib\site-packages\ipykernel_launcher.py:5: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.
  """
C:\Users\Administrator\anaconda3\lib\site-packages\ipykernel_launcher.py:8: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.

#数据的简单探索
print("行数     : ", d1.shape[0], d2.shape[0], d3.shape[0], sep='\n')
print("列数  : ", d1.shape[1], d2.shape[1], d3.shape[1], sep='\n')
print("\n特征数量 : \n",
      d1.columns.tolist(),
      d2.columns.tolist(),
      d3.columns.tolist(),
      sep='\n')
print("\n缺失值 :  \n",
      d1.isnull().sum(),
      d2.isnull().sum(),
      d3.isnull().sum(),
      sep='\n\n')
print("\n唯一值 :  \n", d1.nunique(), d2.nunique(), d3.nunique(), sep='\n\n')

行数     : 
3883
1000209
6040
列数  : 
3
4
5

特征数量 : 

['电影ID', '上市年份', '种类']
['用户ID', '电影ID', '评分', '时间戳']
['用户ID', '性别', '年龄', '职业', '邮编']

缺失值 :  


电影ID    0
上市年份    0
种类      0
dtype: int64

用户ID    0
电影ID    0
评分      0
时间戳     0
dtype: int64

用户ID    0
性别      0
年龄      0
职业      0
邮编      0
dtype: int64

唯一值 :  


电影ID    3883
上市年份    3883
种类       301
dtype: int64

用户ID      6040
电影ID      3706
评分           5
时间戳     458455
dtype: int64

用户ID    6040
性别         2
年龄         7
职业        21
邮编      3439
dtype: int64

人口统计学数据准备

# 制作一个电影平均评分表
d2_g = d2[['电影ID','评分']].groupby('电影ID').agg(['mean','count']).评分.rename({'mean':'平均评分','count':'评分次数'},axis=1)

协同过滤数据准备

# 数据透视表
d2_piv = d2.pivot_table(index=['用户ID'], columns=['电影ID'], values='评分')

# 填充0，去除列0项
d2_piv_fill = d2_piv.fillna(0)
d2_piv_fill = d2_piv_fill.T
d2_piv_fill = d2_piv_fill.loc[:, (d2_piv_fill != 0).any(axis=0)]

# 标准化
d2_piv_norm = d2_piv_fill.apply(lambda x: (x - np.mean(x)) /
                                (np.max(x) - np.min(x)),
                                axis=1)

# 计算余弦相似度，余弦矩阵可以表现每个用户和项目数值之间的余弦相似度
mov_similarity = cosine_similarity(d2_piv_norm)
user_similarity = cosine_similarity(d2_piv_norm.T)

#放到DataFrame里面
mov_sim_df = pd.DataFrame(mov_similarity,
                          index=d2_piv_norm.index,
                          columns=d2_piv_norm.index)
user_sim_df = pd.DataFrame(user_similarity,
                           index=d2_piv_norm.columns,
                           columns=d2_piv_norm.columns)

基于人口统计学

目的：基于加权评分推荐10部高分电影

我们需要一个指标来给电影评分或评分
计算每部电影的分数
排序分数并向用户推荐收视率最高的电影
我们可以使用电影的平均评分作为得分，但使用它的评分不够合理，因为平均评分很容易出现数量的干扰问题，比如评分为5但是只有2票的电影，不能被认为比平均评分,4.8但是有40票的电影更好。
因此，我使用IMDB的加权评分。

WR = (v ÷ (v+m)) × R + (m ÷ (v+m)) × c

WR，加权得分（weighted rating）
R，该电影的用户投票的平均得分
v，该电影的投票人数
m，要求列出电影的最低投票数（我这里使用第80个百分位）
c，所有电影的平均得分（现在为3.24）

c = d2_g.平均评分.mean()
m = d2_g.评分次数.quantile(0.8)

#筛选电影,743部
p_movies = d2_g.loc[d2_g['评分次数'] >= m]
p_movies.shape

(743, 2)

# 定义WR函数,来确定新的评分
def WR(x, m=m, c=c):
    v = x['评分次数']
    R = x['平均评分']
    return (v / (v + m) * R) + (m / (m + v) * c)

p_movies['加权分数'] = p_movies.apply(WR, axis=1)

C:\Users\Administrator\anaconda3\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.

#基于加权评分推荐10部电影
p_movies.sort_values('加权分数', ascending=False).head(10)

	平均评分	评分次数	加权分数
电影ID
318	4.554558	2227	4.342050
858	4.524966	2223	4.316925
527	4.510417	2304	4.310825
260	4.453694	2991	4.301311
1198	4.477725	2514	4.297141
50	4.517106	1783	4.269206
2762	4.406263	2459	4.232855
2858	4.317386	3428	4.197429
593	4.351823	2578	4.193044
2028	4.337354	2653	4.184453

协同过滤

下面我们将使用电影推荐的数据来构建一个推荐系统,实现以下功能

输入电影名生成前10部最相似电影
输入用户名生成前10名最相似用户及其相似度
输入用户名生成相似用户评分最高的电影名列表及其频次
预测用户评分

#定义一个方法返回相似度前k的电影
def top10_movie(movie_ID, k=10):
    count = 1
    print('和电影ID：{}相似的电影有 :\n'.format(movie_ID))
    for i in mov_sim_df.sort_values(by=movie_ID,
                                    ascending=False).index[1:k + 1]:
        print('NO. {}: {}'.format(count, i))
        count += 1

#定义一个方法返回相似度前k的用户
def top5_users(user_ID, k=10):
    if user_ID not in d2_piv_norm.columns:
        return ('没有用户ID： {} 这个用户的数据'.format(user))

    print('最相似的用户有:\n')
    sim_values = user_sim_df.sort_values(
        by=user_ID, ascending=False).loc[:, user_ID].tolist()[1:k + 1]
    sim_users = user_sim_df.sort_values(by=user_ID,
                                        ascending=False).index[1:k + 1]
    zipped = zip(
        sim_users,
        sim_values,
    )
    for user, sim in zipped:
        print('user ID: {0}, 相似度为: {1:.2f}'.format(user, sim))

#定义一个方法返回K个相似用户的w个最高频次电影
def mov_reccomend(user_ID, k=100, w=10):
    if user_ID not in d2_piv_norm.columns:
        return ('没有用户ID： {} 这个用户的数据'.format(user))
    sim_users = user_sim_df.sort_values(by=user_ID,
                                        ascending=False).index[1:k + 1]
    top_mov = []
    mov_recomm = {}
    for i in sim_users:
        top_mov.append(d2_piv_norm[d2_piv_norm.loc[:, i] ==
                                   d2_piv_norm.loc[:, i].max()].index.tolist())
    for i in range(len(top_mov)):
        for j in top_mov[i]:
            if j in mov_recomm:
                mov_recomm[j] += 1
            else:
                mov_recomm[j] = 1
    mov_recomm_list = sorted(mov_recomm.items(),
                             key=lambda item: item[1],
                             reverse=True)
    return mov_recomm_list[0:w]

top10_movie(2)

和电影ID：2相似的电影有 :

NO. 1: 3489
NO. 2: 653
NO. 3: 60
NO. 4: 317
NO. 5: 2161
NO. 6: 2054
NO. 7: 673
NO. 8: 3438
NO. 9: 2193
NO. 10: 367

top5_users(5)

最相似的用户有:

user ID: 1227, 相似度为: 0.26
user ID: 281, 相似度为: 0.26
user ID: 3240, 相似度为: 0.25
user ID: 1484, 相似度为: 0.24
user ID: 590, 相似度为: 0.24
user ID: 1407, 相似度为: 0.24
user ID: 5749, 相似度为: 0.24
user ID: 3538, 相似度为: 0.24
user ID: 5496, 相似度为: 0.24
user ID: 1104, 相似度为: 0.24

mov_reccomend(3)

[(1015, 2),
 (314, 2),
 (290, 2),
 (1257, 2),
 (1959, 2),
 (3671, 1),
 (1246, 1),
 (3006, 1),
 (1661, 1),
 (1017, 1)]

数据源：提取码:8hkb–来自百度网盘超级会员V4的分享)

WiFi下的365

关注

5
点赞
踩
23

收藏

觉得还不错? 一键收藏
打赏
13
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫