利用Surprise包进行movieLens电影推荐

最新推荐文章于 2023-09-21 20:40:45 发布

curd_boy

最新推荐文章于 2023-09-21 20:40:45 发布

阅读量2k

点赞数 3

本文链接：https://blog.csdn.net/weixin_43746433/article/details/98530272

版权

介绍

Surprise（Simple Python Recommendation System Engine）是一款推荐系统库，是scikit系列中的一个。简单易用，同时支持多种推荐算法（基础算法、协同过滤、矩阵分解等）。
在这里插入图片描述
设计surprise时考虑到以下目的：

让用户完美控制他们的实验。为此，特别强调文档，试图通过指出算法的每个细节尽可能清晰和准确。
减轻数据集处理的痛苦。用户可以使用内置数据集（Movielens， Jester）和他们自己的自定义数据集。
提供各种即用型预测算法，例如基线算法，邻域方法，基于矩阵因子分解（ SVD， PMF， SVD ++，NMF）等等。
此外，内置了各种相似性度量（余弦，MSD，皮尔逊…）。可以轻松实现新的算法思路。
提供评估，分析和比较算法性能的工具。
使用强大的CV迭代器（受scikit-learn优秀工具启发）以及对一组参数的详尽搜索，可以非常轻松地运行交叉验证程序。

其中基于近邻的方法(协同过滤)可以设定不同的度量准则
在这里插入图片描述
评估准则

1.本地加载数据集

from surprise import KNNBasic,SVD
from surprise import Dataset,Reader
from surprise import evaluate, print_perf
from surprise.model_selection import cross_validate
import os

# http://surprise.readthedocs.io/en/stable/index.html
# http://files.grouplens.org/datasets/movielens/ml-100k-README.txt

# Load the movielens-100k dataset (download it if needed),
# and split it into 3 folds for cross-validation.
# data = Dataset.load_builtin('ml-100k')#加载
file_path=os.path.expanduser('ml-100k/u.data')#本地
reader=Reader(line_format='user item rating timestamp',sep='\t')
data=Dataset.load_from_file(file_path,reader=reader)

#k折交叉验证(k=3)
data.split(n_folds=3)

#基础过滤，测试效果
algo = KNNBasic()
perf = evaluate(algo, data, measures=['RMSE', 'MAE'])#评估方法均方根误差，均绝对误差
print_perf(perf)

打印输出：

Evaluating RMSE, MAE of algorithm KNNBasic.

------------
Fold 1
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.9898
MAE:  0.7822
------------
Fold 2
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.9905
MAE:  0.7821
------------
Fold 3
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.9857
MAE:  0.7797
------------
------------
Mean RMSE: 0.9887
Mean MAE : 0.7813
------------
------------
        Fold 1  Fold 2  Fold 3  Mean    
RMSE    0.9898  0.9905  0.9857  0.9887  
MAE     0.7822  0.7821  0.7797  0.7813

2. GridSearch寻找最优参数

from surprise import GridSearch
#定义好需要优选的参数网格
param_grid = {'n_epochs': [5, 10], 'lr_all': [0.002, 0.005],
              'reg_all': [0.4, 0.6]}
#算法使用SVD分解              
grid_search = GridSearch(SVD, param_grid, measures=['RMSE', 'FCP'])

#data = Dataset.load_builtin('ml-100k')
file_path=os.path.expanduser('ml-100k/u.data')#本地文件

#告诉文本阅读器，文本的格式是怎么样的
reader=Reader(line_format='user item rating timestamp',sep='\t')

#加载数据
data=Dataset.load_from_file(file_path,reader=reader)
data.split(n_folds=3)

#评估
grid_search.evaluate(data)

打印输出：

------------
Parameters combination 1 of 8
params:  {'lr_all': 0.002, 'n_epochs': 5, 'reg_all': 0.4}
------------
Mean RMSE: 0.9972
Mean FCP : 0.6843
------------
------------
Parameters combination 2 of 8
params:  {'lr_all': 0.005, 'n_epochs': 5, 'reg_all': 0.4}
------------
Mean RMSE: 0.9734
Mean FCP : 0.6946
------------
------------
Parameters combination 3 of 8
params:  {'lr_all': 0.002, 'n_epochs': 10, 'reg_all': 0.4}
------------
Mean RMSE: 0.9777
Mean FCP : 0.6926
------------
------------
Parameters combination 4 of 8
params:  {'lr_all': 0.005, 'n_epochs': 10, 'reg_all': 0.4}
------------
Mean RMSE: 0.9635
Mean FCP : 0.6987
------------
------------
Parameters combination 5 of 8
params:  {'lr_all': 0.002, 'n_epochs': 5, 'reg_all': 0.6}
------------
Mean RMSE: 1.0029
Mean FCP : 0.6875
------------
------------
Parameters combination 6 of 8
params:  {'lr_all': 0.005, 'n_epochs': 5, 'reg_all': 0.6}
------------
Mean RMSE: 0.9820
Mean FCP : 0.6953
------------
------------
Parameters combination 7 of 8
params:  {'lr_all': 0.002, 'n_epochs': 10, 'reg_all': 0.6}
------------
Mean RMSE: 0.9860
Mean FCP : 0.6943
------------
------------
Parameters combination 8 of 8
params:  {'lr_all': 0.005, 'n_epochs': 10, 'reg_all': 0.6}
------------
Mean RMSE: 0.9733
Mean FCP : 0.6991
------------

输出最优参数

# best RMSE score
print(grid_search.best_score['RMSE'])

# combination of parameters that gave the best RMSE score
print(grid_search.best_params['RMSE'])

# best FCP score
print(grid_search.best_score['FCP'])

# combination of parameters that gave the best FCP score
print(grid_search.best_params['FCP'])

打印输出：

0.963501988854
{'lr_all': 0.005, 'n_epochs': 10, 'reg_all': 0.4}
0.699084153002
{'lr_all': 0.005, 'n_epochs': 10, 'reg_all': 0.6}

GridSearchCV 方法：

# 定义好需要优选的参数网格
param_grid = {'n_epochs': [5, 10], 'lr_all': [0.002, 0.005],
              'reg_all': [0.4, 0.6]}
# 使用网格搜索交叉验证
grid_search = GridSearchCV(SVD, param_grid, measures=['RMSE', 'FCP'], cv=3)
# 在数据集上找到最好的参数
data = Dataset.load_builtin('ml-100k')
# pref = cross_validate(grid_search, data, cv=3)
grid_search.fit(data)
# 输出调优的参数组
# 输出最好的RMSE结果
print(grid_search.best_score)

进行预测的常用方法和属性
grid.fit()：运行网格搜索
grid_scores_：给出不同参数情况下的评价结果
best_params_：描述了已取得最佳结果的参数的组合
best_score_：成员提供优化过程期间观察到的最好的评

dict的键成为DataFrame列：

import pandas as pd  
results_df = pd.DataFrame.from_dict(grid_search.cv_results)
results_df

在这里插入图片描述

3.使用不同的推荐系统算法进行建模比较

from surprise import Dataset, print_perf
from surprise.model_selection import cross_validate
file_path=os.path.expanduser('ml-100k/u.data')#本地
reader=Reader(line_format='user item rating timestamp',sep='\t')
data=Dataset.load_from_file(file_path,reader=reader)

### 使用NormalPredictor
from surprise import NormalPredictor
algo = NormalPredictor()
perf = cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=3)
print('使用NormalPredictor\n')
print_perf(perf)

### 使用BaselineOnly
from surprise import BaselineOnly
algo = BaselineOnly()
perf = cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=3)
print('BaselineOnly\n')
print_perf(perf)

### 使用基础版协同过滤
from surprise import KNNBasic, evaluate
algo = KNNBasic()
perf = cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=3)
print('用基础版协同过滤\n')
print_perf(perf)

### 使用均值协同过滤
from surprise import KNNWithMeans, evaluate
algo = KNNWithMeans()
perf = cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=3)
print('使用均值协同过滤\n')
print_perf(perf)
print()

### 使用协同过滤baseline
from surprise import KNNBaseline, evaluate
algo = KNNBaseline()
perf = cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=3)
print('使用协同过滤baseline')
print_perf(perf)
print()

### 使用SVD
from surprise import SVD, evaluate
algo = SVD()
perf = cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=3)
print('使用SVD\n')
print_perf(perf)
print()

### 使用SVD++
from surprise import SVDpp, evaluate
algo = SVDpp()
perf = cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=3)
print('使用SVD++\n')
print_perf(perf)
print()

### 使用NMF
from surprise import NMF
algo = NMF()
perf = cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=3)
print('用基础版协同过滤\n')
print_perf(perf)

打印输出：

使用NormalPredictor

        Fold 1  Fold 2  Fold 3  Mean    
TEST_RMSE1.5211  1.5267  1.5147  1.5208  
TEST_MAE1.2210  1.2279  1.2186  1.2225  
FIT_TIME0.0800  0.0960  0.0980  0.0913  
TEST_TIME0.2160  0.1820  0.1920  0.1967  
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
BaselineOnly

        Fold 1  Fold 2  Fold 3  Mean    
TEST_RMSE0.9501  0.9420  0.9507  0.9476  
TEST_MAE0.7505  0.7491  0.7542  0.7513  
FIT_TIME0.1340  0.1420  0.1610  0.1457  
TEST_TIME0.1540  0.1580  0.1650  0.1590  
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
用基础版协同过滤

        Fold 1  Fold 2  Fold 3  Mean    
TEST_RMSE0.9848  0.9930  0.9822  0.9867  
TEST_MAE0.7785  0.7853  0.7761  0.7800  
FIT_TIME0.2940  0.3080  0.3320  0.3113  
TEST_TIME3.9830  4.2000  4.1950  4.1260  
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
使用均值协同过滤

        Fold 1  Fold 2  Fold 3  Mean    
TEST_RMSE0.9533  0.9585  0.9591  0.9569  
TEST_MAE0.7502  0.7538  0.7571  0.7537  
FIT_TIME0.3170  0.3400  0.3310  0.3293  
TEST_TIME4.4160  4.3900  4.3870  4.3977  

Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
使用协同过滤baseline
        Fold 1  Fold 2  Fold 3  Mean    
TEST_RMSE0.9344  0.9319  0.9397  0.9353  
TEST_MAE0.7372  0.7332  0.7413  0.7372  
FIT_TIME0.4720  0.4480  0.4880  0.4693  
TEST_TIME5.1810  5.2340  5.1500  5.1883  

使用SVD

        Fold 1  Fold 2  Fold 3  Mean    
TEST_RMSE0.9430  0.9488  0.9481  0.9467  
TEST_MAE0.7441  0.7492  0.7475  0.7470  
FIT_TIME3.2910  3.2740  3.2060  3.2570  
TEST_TIME0.2330  0.2120  0.2180  0.2210  

使用SVD++

        Fold 1  Fold 2  Fold 3  Mean    
TEST_RMSE0.9310  0.9256  0.9244  0.9270  
TEST_MAE0.7305  0.7275  0.7274  0.7285  
FIT_TIME96.0670 93.5950 94.2860 94.6493 
TEST_TIME3.9380  3.9330  3.9100  3.9270  

用基础版协同过滤

        Fold 1  Fold 2  Fold 3  Mean    
TEST_RMSE0.9704  0.9731  0.9804  0.9746  
TEST_MAE0.7629  0.7633  0.7701  0.7654  
FIT_TIME3.4200  3.4150  3.4240  3.4197  
TEST_TIME0.1840  0.1850  0.1830  0.1840

4.训练KNNBaseline模型

from __future__ import (absolute_import, division, print_function,
                        unicode_literals)
import os
import io

from surprise import KNNBaseline
from surprise import Dataset

def read_item_names():

    file_name = ('./ml-100k/u.item')
    rid_to_name = {}
    name_to_rid = {}
    with io.open(file_name, 'r', encoding='ISO-8859-1') as f:
        for line in f:
            line = line.split('|')
            rid_to_name[line[0]] = line[1]
            name_to_rid[line[1]] = line[0]

    return rid_to_name, name_to_rid

data = Dataset.load_builtin('ml-100k')

#转换为标准矩阵格式
trainset = data.build_full_trainset()

#使用pearson_baseline方式计算相似度  False以item为基准计算相似度 本例为电影之间的相似度
sim_options = {'name': 'pearson_baseline', 'user_based': False}#参数
algo = KNNBaseline(sim_options=sim_options)
algo.train(trainset)

5.获取id到name的互相映射

数据样本ID

rid_to_name, name_to_rid = read_item_names()
print(rid_to_name)
toy_story_raw_id = name_to_rid['Now and Then (1995)']
toy_story_raw_id #数据样本ID

{'1': 'Toy Story (1995)', '2': 'GoldenEye (1995)', '3': 'Four Rooms (1995)', '4': 'Get Shorty (1995)', '5': 'Copycat (1995)', '6': 'Shanghai Triad (Yao a yao yao dao waipo qiao) (1995)', '7': 'Twelve Monkeys (1995)', '8': 'Babe (1995)', '9': 'Dead Man Walking (1995)', '10': 'Richard III (1995)', '11': 'Seven (Se7en) (1995)', '12': 'Usual Suspects, The (1995)', '13': 'Mighty Aphrodite (1995)', '14': 'Postino, Il (1994)', '15': "Mr. Holland's Opus (1995)", '16': 'French Twist (Gazon maudit) (1995)', '17': 'From Dusk Till Dawn (1996)', '18': 'White Balloon, The (1995)', '19': "Antonia's Line (1995)", '20': 'Angels and Insects (1995)', '21': 'Muppet Treasure Island (1996)', '22': 'Braveheart (1995)', '23': 'Taxi Driver (1976)', '24': 'Rumble in the Bronx (1995)', '25': 'Birdcage, The (1996)', '26': 'Brothers McMullen, The (1995)', '27': 'Bad Boys (1995)', '28': 'Apollo 13 (1995)', '29': 'Batman Forever (1995)', '30': 'Belle de jour (1967)', '31': 'Crimson Tide (1995)', '32': 'Crumb (1994)', '33': 'Desperado (1995)', '34': 'Doom Generation, The (1995)', '35': 'Free Willy 2: The Adventure Home (1995)', '36': 'Mad Love (1995)', '37': 'Nadja (1994)', '38': 'Net, The (1995)', '39': 'Strange Days (1995)', '40': 'To Wong Foo, Thanks for Everything! Julie Newmar (1995)', '41': 'Billy Madison (1995)', '42': 'Clerks (1994)', '43': 'Disclosure (1994)', '44': 'Dolores Claiborne (1994)', '45': 'Eat Drink Man Woman (1994)', '46': 'Exotica (1994)', '47': 'Ed Woo........
'1053'

样本ID转矩阵ID

toy_story_inner_id = algo.trainset.to_inner_iid(toy_story_raw_id)
toy_story_inner_id #矩阵ID

寻找矩阵中邻近的ID

toy_story_neighbors = algo.get_neighbors(toy_story_inner_id, k=10)
toy_story_neighbors

[291, 82, 366, 528, 179, 101, 556, 310, 431, 543]

矩阵ID到样本name

raw_iids = (algo.trainset.to_raw_iid(inner_id)
                       for inner_id in toy_story_neighbors)
toy_story_neighbors = (rid_to_name[rid]
                       for rid in raw_iids)

print()
print('The 10 nearest neighbors of Toy Story are:')
for movie in toy_story_neighbors:
    print(movie)

The 10 nearest neighbors of Toy Story are:
While You Were Sleeping (1995)
Batman (1989)
Dave (1993)
Mrs. Doubtfire (1993)
Groundhog Day (1993)
Raiders of the Lost Ark (1981)
Maverick (1994)
French Kiss (1995)
Stand by Me (1986)
Net, The (1995)

curd_boy

关注

3
点赞
踩
12

收藏

觉得还不错? 一键收藏
0
评论
利用Surprise包进行movieLens电影推荐

1.本地加载数据集from surprise import KNNBasic,SVDfrom surprise import Dataset,Readerfrom surprise import evaluate, print_perfimport os# http://surprise.readthedocs.io/en/stable/index.html# http://file...
复制链接

扫一扫