python推荐系统库Surprise




# coding: utf-8

# ## python推荐系统库Surprise

# ![](./Surprise.png)

# 在推荐系统的建模过程中,我们将用到python库 [Surprise(Simple Python RecommendatIon System Engine)](https://github.com/NicolasHug/Surprise),是scikit系列中的一个(很多同学用过scikit-learn和scikit-image等库)。
#
# ### 简单易用,同时支持多种推荐算法:
# * [基础算法/baseline algorithms](http://surprise.readthedocs.io/en/stable/basic_algorithms.html)
# * [基于近邻方法(协同过滤)/neighborhood methods](http://surprise.readthedocs.io/en/stable/knn_inspired.html)
# * [矩阵分解方法/matrix factorization-based (SVD, PMF, SVD++, NMF)](http://surprise.readthedocs.io/en/stable/matrix_factorization.html#surprise.prediction_algorithms.matrix_factorization.SVD)
#
# | 算法类名        | 说明  |
# | ------------- |:-----|
# |[random_pred.NormalPredictor](http://surprise.readthedocs.io/en/stable/basic_algorithms.html#surprise.prediction_algorithms.random_pred.NormalPredictor)|Algorithm predicting a random rating based on the distribution of the training set, which is assumed to be normal.|
# |[baseline_only.BaselineOnly](http://surprise.readthedocs.io/en/stable/basic_algorithms.html#surprise.prediction_algorithms.baseline_only.BaselineOnly)|Algorithm predicting the baseline estimate for given user and item.|
# |[knns.KNNBasic](http://surprise.readthedocs.io/en/stable/knn_inspired.html#surprise.prediction_algorithms.knns.KNNBasic)|A basic collaborative filtering algorithm.|
# |[knns.KNNWithMeans](http://surprise.readthedocs.io/en/stable/knn_inspired.html#surprise.prediction_algorithms.knns.KNNWithMeans)|A basic collaborative filtering algorithm, taking into account the mean ratings of each user.|
# |[knns.KNNBaseline](http://surprise.readthedocs.io/en/stable/knn_inspired.html#surprise.prediction_algorithms.knns.KNNBaseline)|A basic collaborative filtering algorithm taking into account a baseline rating.| 
# |[matrix_factorization.SVD](http://surprise.readthedocs.io/en/stable/matrix_factorization.html#surprise.prediction_algorithms.matrix_factorization.SVD)|The famous SVD algorithm, as popularized by Simon Funk during the Netflix Prize.|
# |[matrix_factorization.SVDpp](http://surprise.readthedocs.io/en/stable/matrix_factorization.html#surprise.prediction_algorithms.matrix_factorization.SVDpp)|The SVD++ algorithm, an extension of SVD taking into account implicit ratings.|
# |[matrix_factorization.NMF](http://surprise.readthedocs.io/en/stable/matrix_factorization.html#surprise.prediction_algorithms.matrix_factorization.NMF)|A collaborative filtering algorithm based on Non-negative Matrix Factorization.|
# |[slope_one.SlopeOne](http://surprise.readthedocs.io/en/stable/slope_one.html#surprise.prediction_algorithms.slope_one.SlopeOne)|A simple yet accurate collaborative filtering algorithm.|
# |[co_clustering.CoClustering](http://surprise.readthedocs.io/en/stable/co_clustering.html#surprise.prediction_algorithms.co_clustering.CoClustering)|A collaborative filtering algorithm based on co-clustering.|

# ### 其中基于近邻的方法(协同过滤)可以设定不同的度量准则。
#
# | 相似度度量标准 | 度量标准说明  |
# | ------------- |:-----|
# |[cosine](http://surprise.readthedocs.io/en/stable/similarities.html#surprise.similarities.cosine)|Compute the cosine similarity between all pairs of users (or items).|
# |[msd](http://surprise.readthedocs.io/en/stable/similarities.html#surprise.similarities.msd)|Compute the Mean Squared Difference similarity between all pairs of users (or items).|
# |[pearson](http://surprise.readthedocs.io/en/stable/similarities.html#surprise.similarities.pearson)|Compute the Pearson correlation coefficient between all pairs of users (or items).|
# |[pearson_baseline](http://surprise.readthedocs.io/en/stable/similarities.html#surprise.similarities.pearson_baseline)|Compute the (shrunk) Pearson correlation coefficient between all pairs of users (or items) using baselines for centering instead of means.|

# ### Jaccard similarity
# 交集元素个数/并集元素个数

# ### 支持不同的评估准则
# | 评估准则 | 准则说明  |
# | ------------- |:-----|
# |[rmse](http://surprise.readthedocs.io/en/stable/accuracy.html#surprise.accuracy.rmse)|Compute RMSE (Root Mean Squared Error).|
# |[msd](http://surprise.readthedocs.io/en/stable/similarities.html#surprise.similarities.msd)|Compute MAE (Mean Absolute Error).|
# |[fcp](http://surprise.readthedocs.io/en/stable/accuracy.html#surprise.accuracy.fcp)|Compute FCP (Fraction of Concordant Pairs).|

# ### 使用示例

# #### 基本使用方法如下

# ```python
# # 可以使用上面提到的各种推荐系统算法
# from surprise import SVD
# from surprise import Dataset
# from surprise import evaluate, print_perf
#
# # 默认载入movielens数据集
# data = Dataset.load_builtin('ml-100k')
# # k折交叉验证(k=3)
# data.split(n_folds=3)
# # 试一把SVD矩阵分解
# algo = SVD()
# # 在数据集上测试一下效果
# perf = evaluate(algo, data, measures=['RMSE', 'MAE'])
# #输出结果
# print_perf(perf)
# ```

# #### 载入自己的数据集方法

# ```python
# # 指定文件所在路径
# file_path = os.path.expanduser('~/.surprise_data/ml-100k/ml-100k/u.data')
# # 告诉文本阅读器,文本的格式是怎么样的
# reader = Reader(line_format='user item rating timestamp', sep='\t')
# # 加载数据
# data = Dataset.load_from_file(file_path, reader=reader)
# # 手动切分成5折(方便交叉验证)
# data.split(n_folds=5)
# ```

# #### 算法调参(让推荐系统有更好的效果)

# 这里实现的算法用到的算法无外乎也是SGD等,因此也有一些超参数会影响最后的结果,我们同样可以用sklearn中常用到的网格搜索交叉验证(GridSearchCV)来选择最优的参数。简单的例子如下所示:

# ```python
# # 定义好需要优选的参数网格
# param_grid = {'n_epochs': [5, 10], 'lr_all': [0.002, 0.005],
#               'reg_all': [0.4, 0.6]}
# # 使用网格搜索交叉验证
# grid_search = GridSearch(SVD, param_grid, measures=['RMSE', 'FCP'])
# # 在数据集上找到最好的参数
# data = Dataset.load_builtin('ml-100k')
# data.split(n_folds=3)
# grid_search.evaluate(data)
# # 输出调优的参数组
# # 输出最好的RMSE结果
# print(grid_search.best_score['RMSE'])
# # >>> 0.96117566386
#
# # 输出对应最好的RMSE结果的参数
# print(grid_search.best_params['RMSE'])
# # >>> {'reg_all': 0.4, 'lr_all': 0.005, 'n_epochs': 10}
#
# # 最好的FCP得分
# print(grid_search.best_score['FCP'])
# # >>> 0.702279736531
#
# # 对应最高FCP得分的参数
# print(grid_search.best_params['FCP'])
# # >>> {'reg_all': 0.6, 'lr_all': 0.005, 'n_epochs': 10}
# ```

# ## 在我们的数据集上训练模型

# ## 建模和存储模型

# ### 1.用协同过滤构建模型并进行预测

# #### 1.1 movielens的例子

# In[4]:

# 可以使用上面提到的各种推荐系统算法
from surprise import KNNWithMeans
from surprise import Dataset
from surprise import evaluate, print_perf

# 默认载入movielens数据集
data = Dataset.load_builtin('ml-100k')
# k折交叉验证(k=3)
data.split(n_folds=3)
# 试一把SVD矩阵分解
algo = KNNWithMeans()
# 在数据集上测试一下效果
perf = evaluate(algo, data, measures=['RMSE', 'MAE'])
#输出结果
print_perf(perf)


# In[6]:

data.raw_ratings[1]


# In[7]:

"""
以下的程序段告诉大家如何在协同过滤算法建模以后,根据一个item取回相似度最高的item,主要是用到algo.get_neighbors()这个函数
"""

from __future__ import (absolute_import, division, print_function,
                        unicode_literals)
import os
import io

from surprise import KNNBaseline
from surprise import Dataset


def read_item_names():
    """
    获取电影名到电影id 和 电影id到电影名的映射
    """

    file_name = (os.path.expanduser('~') +
                 '/.surprise_data/ml-100k/ml-100k/u.item')
    rid_to_name = {}
    name_to_rid = {}
    with io.open(file_name, 'r', encoding='ISO-8859-1') as f:
        for line in f:
            line = line.split('|')
            rid_to_name[line[0]] = line[1]
            name_to_rid[line[1]] = line[0]

    return rid_to_name, name_to_rid


# 首先,用算法计算相互间的相似度
data = Dataset.load_builtin('ml-100k')
trainset = data.build_full_trainset()
sim_options = {'name': 'pearson_baseline', 'user_based': False}
algo = KNNBaseline(sim_options=sim_options)
algo.train(trainset)


# In[8]:

# 获取电影名到电影id 和 电影id到电影名的映射
rid_to_name, name_to_rid = read_item_names()


# In[9]:

# 拿出来Toy Story这部电影对应的item id
toy_story_raw_id = name_to_rid['Toy Story (1995)']
toy_story_raw_id


# In[10]:

toy_story_inner_id = algo.trainset.to_inner_iid(toy_story_raw_id)
toy_story_inner_id


# In[11]:

# 找到最近的10个邻居
toy_story_neighbors = algo.get_neighbors(toy_story_inner_id, k=10)
toy_story_neighbors


# In[12]:

# 从近邻的id映射回电影名称
toy_story_neighbors = (algo.trainset.to_raw_iid(inner_id)
                       for inner_id in toy_story_neighbors)
toy_story_neighbors = (rid_to_name[rid]
                       for rid in toy_story_neighbors)

print()
print('The 10 nearest neighbors of Toy Story are:')
for movie in toy_story_neighbors:
    print(movie)


# In[ ]:

# 拿出来Toy Story这部电影对应的item id
toy_story_raw_id = name_to_rid['Toy Story (1995)']
toy_story_inner_id = algo.trainset.to_inner_iid(toy_story_raw_id)

# 找到最近的10个邻居
toy_story_neighbors = algo.get_neighbors(toy_story_inner_id, k=10)

# 从近邻的id映射回电影名称
toy_story_neighbors = (algo.trainset.to_raw_iid(inner_id)
                       for inner_id in toy_story_neighbors)
toy_story_neighbors = (rid_to_name[rid]
                       for rid in toy_story_neighbors)

print()
print('The 10 nearest neighbors of Toy Story are:')
for movie in toy_story_neighbors:
    print(movie)


# #### 1.2 音乐预测的例子

# In[13]:

from __future__ import (absolute_import, division, print_function, unicode_literals)
import os
import io

from surprise import KNNBaseline, Reader
from surprise import Dataset

import cPickle as pickle
# 重建歌单id到歌单名的映射字典
id_name_dic = pickle.load(open("popular_playlist.pkl","rb"))
print("加载歌单id到歌单名的映射字典完成...")
# 重建歌单名到歌单id的映射字典
name_id_dic = {}
for playlist_id in id_name_dic:
    name_id_dic[id_name_dic[playlist_id]] = playlist_id
print("加载歌单名到歌单id的映射字典完成...")


file_path = os.path.expanduser('./popular_music_suprise_format.txt')
# 指定文件格式
reader = Reader(line_format='user item rating timestamp', sep=',')
# 从文件读取数据
music_data = Dataset.load_from_file(file_path, reader=reader)
# 计算歌曲和歌曲之间的相似度
print("构建数据集...")
trainset = music_data.build_full_trainset()
#sim_options = {'name': 'pearson_baseline', 'user_based': False}


# In[15]:

id_name_dic.keys()[2]


# In[18]:

print(id_name_dic[id_name_dic.keys()[2]])


# In[19]:

trainset.n_items


# In[20]:

trainset.n_users


# #### 1.2.1 模板之查找最近的user(在这里是歌单)

# In[22]:

print("开始训练模型...")
#sim_options = {'user_based': False}
#algo = KNNBaseline(sim_options=sim_options)
algo = KNNBaseline()
algo.train(trainset)

current_playlist = name_id_dic.keys()[39]
print("歌单名称", current_playlist)

# 取出近邻
# 映射名字到id
playlist_id = name_id_dic[current_playlist]
print("歌单id", playlist_id)
# 取出来对应的内部user id => to_inner_uid
playlist_inner_id = algo.trainset.to_inner_uid(playlist_id)
print("内部id", playlist_inner_id)

playlist_neighbors = algo.get_neighbors(playlist_inner_id, k=10)

# 把歌曲id转成歌曲名字
# to_raw_uid映射回去
playlist_neighbors = (algo.trainset.to_raw_uid(inner_id)
                       for inner_id in playlist_neighbors)
playlist_neighbors = (id_name_dic[playlist_id]
                       for playlist_id in playlist_neighbors)

print()
print("和歌单 《", current_playlist, "》 最接近的10个歌单为:\n")
for playlist in playlist_neighbors:
    print(playlist, algo.trainset.to_inner_uid(name_id_dic[playlist]))


# #### 1.2.2 模板之针对用户进行预测

# In[7]:

import cPickle as pickle
# 重建歌曲id到歌曲名的映射字典
song_id_name_dic = pickle.load(open("popular_song.pkl","rb"))
print("加载歌曲id到歌曲名的映射字典完成...")
# 重建歌曲名到歌曲id的映射字典
song_name_id_dic = {}
for song_id in song_id_name_dic:
    song_name_id_dic[song_id_name_dic[song_id]] = song_id
print("加载歌曲名到歌曲id的映射字典完成...")


# In[9]:

#内部编码的4号用户
user_inner_id = 4
user_rating = trainset.ur[user_inner_id]
items = map(lambda x:x[0], user_rating)
for song in items:
    print(algo.predict(user_inner_id, song, r_ui=1), song_id_name_dic[algo.trainset.to_raw_iid(song)])


# ### 2.用矩阵分解进行预测

# In[10]:

### 使用NMF
from surprise import NMF, evaluate
from surprise import Dataset

file_path = os.path.expanduser('./popular_music_suprise_format.txt')
# 指定文件格式
reader = Reader(line_format='user item rating timestamp', sep=',')
# 从文件读取数据
music_data = Dataset.load_from_file(file_path, reader=reader)
# 构建数据集和建模
algo = NMF()
trainset = music_data.build_full_trainset()
algo.train(trainset)


# In[17]:

user_inner_id = 4
user_rating = trainset.ur[user_inner_id]
items = map(lambda x:x[0], user_rating)
for song in items:
    print(algo.predict(algo.trainset.to_raw_uid(user_inner_id), algo.trainset.to_raw_iid(song), r_ui=1), song_id_name_dic[algo.trainset.to_raw_iid(song)])


# ## 模型存储

# In[ ]:

import surprise
surprise.dump.dump('./recommendation.model', algo=algo)
# 可以用下面的方式载入
algo = surprise.dump.load('./recommendation.model')


# ## 不同的推荐系统算法评估

# ### 首先载入数据

# In[23]:

import os
from surprise import Reader, Dataset
# 指定文件路径
file_path = os.path.expanduser('./popular_music_suprise_format.txt')
# 指定文件格式
reader = Reader(line_format='user item rating timestamp', sep=',')
# 从文件读取数据
music_data = Dataset.load_from_file(file_path, reader=reader)
# 分成5折
music_data.split(n_folds=5)


# In[ ]:

music_data


# In[ ]:

music_data.raw_ratings[:20]


# In[ ]:

### 使用NormalPredictor
from surprise import NormalPredictor, evaluate
algo = NormalPredictor()
perf = evaluate(algo, music_data, measures=['RMSE', 'MAE'])


# In[ ]:

### 使用BaselineOnly
from surprise import BaselineOnly, evaluate
algo = BaselineOnly()
perf = evaluate(algo, music_data, measures=['RMSE', 'MAE'])


# In[ ]:

### 使用基础版协同过滤
from surprise import KNNBasic, evaluate
algo = KNNBasic()
perf = evaluate(algo, music_data, measures=['RMSE', 'MAE'])


# In[ ]:

### 使用均值协同过滤
from surprise import KNNWithMeans, evaluate
algo = KNNWithMeans()
perf = evaluate(algo, music_data, measures=['RMSE', 'MAE'])


# In[ ]:

### 使用协同过滤baseline
from surprise import KNNBaseline, evaluate
algo = KNNBaseline()
perf = evaluate(algo, music_data, measures=['RMSE', 'MAE'])


# In[ ]:

### 使用SVD
from surprise import SVD, evaluate
algo = SVD()
perf = evaluate(algo, music_data, measures=['RMSE', 'MAE'])


# In[ ]:

### 使用SVD++
from surprise import SVDpp, evaluate
algo = SVDpp()
perf = evaluate(algo, music_data, measures=['RMSE', 'MAE'])


# In[ ]:

### 使用NMF
from surprise import NMF
algo = NMF()
perf = evaluate(algo, music_data, measures=['RMSE', 'MAE'])
print_perf(perf)


  • 1
    点赞
  • 6
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值