# surprise 库

### 安装

With pip (you’ll need numpy, and a C compiler. Windows users might prefer using conda):

$pip install numpy$ pip install scikit-surprise

With conda:

$conda install -c conda-forge scikit-surprise For the latest version, you can also clone the repo and build the source (you’ll first need Cython and numpy): $ pip install numpy cython
$git clone https://github.com/NicolasHug/surprise.git$ cd surprise
\$ python setup.py install

### 预测算法

1. 所有算法都由AlgoBase基类生成，基类里实现了一些关键的方法，（e.g. predict, fit and test）.
2. Every algorithm is part of the global Surprise namespace, so you only need to import their names from the Surprise package
任何算法都是在surprise库的全局命名空间中，可以直接调用。

### prediction_algorithms包

random_pred.NormalPredictor Algorithm predicting a random rating based on the distribution of the training set, which is assumed to be normal. 根据训练集的分布特征随机给出一个预测值
baseline_only.BaselineOnly Algorithm predicting the baseline estimate for given user and item. 给定用户和Item，给出基于baseline的估计值
knns.KNNBasic A basic collaborative filtering algorithm. 最基础的协作过滤
knns.KNNWithMeans A basic collaborative filtering algorithm, taking into account the mean ratings of each user. 将每个用户评分的均值考虑在内的协作过滤实现
knns.KNNBaseline A basic collaborative filtering algorithm taking into account a baseline rating. 考虑基线评级的协作过滤
matrix_factorization.SVD The famous SVD algorithm, as popularized by Simon Funk during the Netflix Prize. SVD实现
matrix_factorization.SVDpp The SVD++ algorithm, an extension of SVD taking into account implicit ratings. SVD++，即LFM+SVD
matrix_factorization.NMF A collaborative filtering algorithm based on Non-negative Matrix Factorization. 基于矩阵分解的协作过滤
slope_one.SlopeOne A simple yet accurate collaborative filtering algorithm. 一个简单但精确的协作过滤算法
co_clustering.CoClustering A collaborative filtering algorithm based on co-clustering. 基于协同聚类的协同过滤算法

### The algorithm base class 算法基类

The surprise.prediction_algorithms.algo_base module defines the base class AlgoBase from which every single prediction algorithm has to inherit.

1. fit(trainset)

Train an algorithm on a given training set.
This method is called by every derived class as the first basic step for training an algorithm. It basically just initializes some internal structures and set the self.trainset attribute.
Parameters: trainset (Trainset) – A training set, as returned by the folds method.
Returns: self

在一个给定的数据集上训练一个算法

### Baselines estimates configuration 基线法估计配置

Factor in the Neighbors: Scalable and Accurate Collaborative Filtering

### 使用GridSearchCV调整算法参数

cross_validate（）函数针对给定的一组参数报告交叉验证过程的准确性度量。 如果你想知道哪个参数组合可以产生最好的结果，那么GridSearchCV类就可以解决问题。 给定一个参数的字典，这个类彻底地尝试所有参数组合，并报告任何准确性度量（在不同分割上的平均值）的最佳参数。 它受到scikit-learn的GridSearchCV的启发。

### Similarity measure configuration 相似度度量配置

Many algorithms use a similarity measure to estimate a rating. The way they can be configured is done in a similar fashion as for baseline ratings: you just need to pass a sim_options argument at the creation of an algorithm. This argument is a dictionary with the following (all optional) keys:

1. name: 指定相似度指标的名字，similarities module 中给出了MSD
2. user_based: 指定是使用计算用户之间的相似度还是Item之间的相似度，这个地方的选择对预测算法的性能有巨大影响，默认值：True
3. min_support:当相似度不为0时，最小公共用户数或公共项目数
4. shrinkage: 收缩参数，仅用于 pearson_baseline 相似度

### Trainset class

It is used by the fit() method of every prediction algorithm. You should not try to built such an object on your own but rather use the Dataset.folds() method or the DatasetAutoFolds.build_full_trainset() method.

## 源码解读

def __init__(self, name=None, line_format='user item rating',
sep=None,rating_scale=(1, 5), skip_lines=0):

self.offset : self.offset = -lower_bound + 1 if lower_bound <= 0 else 0

def parse_line(self, line)
'''Parse a line.

Ratings are translated so that they are all strictly positive.

Args:
line(str): The line to parse

Returns:
tuple: User id, item id, rating and timestamp. The timestamp is set  to None if it does no exist.
'''


### dataset 类

def build_full_trainset(self):
"""Do not split the dataset into folds and just return a trainset as
is, built from the whole dataset.

User can then query for predictions, as shown in the :ref:User Guide
<train_on_whole_trainset>.

Returns:
The :class:Trainset <surprise.Trainset>.
"""

 def construct_trainset(self, raw_trainset):

ir 字典 --- Item-评分字典
n_users --- 用户数
n_items --- Item数
n_ratings --- 评分记录数

### KNNBaseline 类

1. KNNBaseline

考虑基础评分的协同过滤
最好用 pearson_baseline 相似度
关于Baseline方法的作用，可以参考这篇文章
推荐系统的协同过滤算法实现和浅析 pdf

### KNNBasic 类

def fit(self, trainset):
"""
计算相似度
计算公式的由来可以参考相关书籍
"""
def estimate(self, u, i):
"""
估计用户u对物品i的打分
找出在给物品i打过分的k个近邻用户
根据相应的预测评分计算公式计算预测评分
返回值： est, details({'actual_k': actual_k})
"""