python surprise

surprise 库


目录

参考文档

Surprise’ documentation
GitHub — Surprise

安装

With pip (you’ll need numpy, and a C compiler. Windows users might prefer using conda):

$ pip install numpy
$ pip install scikit-surprise

With conda:

$ conda install -c conda-forge scikit-surprise

For the latest version, you can also clone the repo and build the source (you’ll first need Cython and numpy):

$ pip install numpy cython
$ git clone https://github.com/NicolasHug/surprise.git
$ cd surprise
$ python setup.py install

预测算法

  1. 所有算法都由AlgoBase基类生成,基类里实现了一些关键的方法,(e.g. predict, fit and test).
  2. Every algorithm is part of the global Surprise namespace, so you only need to import their names from the Surprise package
    任何算法都是在surprise库的全局命名空间中,可以直接调用。

prediction_algorithms

算法类名 说明 解释
random_pred.NormalPredictor Algorithm predicting a random rating based on the distribution of the training set, which is assumed to be normal. 根据训练集的分布特征随机给出一个预测值
baseline_only.BaselineOnly Algorithm predicting the baseline estimate for given user and item. 给定用户和Item,给出基于baseline的估计值
knns.KNNBasic A basic collaborative filtering algorithm. 最基础的协作过滤
knns.KNNWithMeans A basic collaborative filtering algorithm, taking into account the mean ratings of each user. 将每个用户评分的均值考虑在内的协作过滤实现
knns.KNNBaseline A basic collaborative filtering algorithm taking into account a baseline rating. 考虑基线评级的协作过滤
matrix_factorization.SVD The famous SVD algorithm, as popularized by Simon Funk during the Netflix Prize. SVD实现
matrix_factorization.SVDpp The SVD++ algorithm, an extension of SVD taking into account implicit ratings. SVD++,即LFM+SVD
matrix_factorization.NMF A collaborative filtering algorithm based on Non-negative Matrix Factorization. 基于矩阵分解的协作过滤
slope_one.SlopeOne A simple yet accurate collaborative filtering algorithm. 一个简单但精确的协作过滤算法
co_clustering.CoClustering A collaborative filtering algorithm based on co-clustering. 基于协同聚类的协同过滤算法

The algorithm base class 算法基类

The surprise.prediction_algorithms.algo_base module defines the base class AlgoBase from which every single prediction algorithm has to inherit.

  1. fit(trainset)

    Train an algorithm on a given training set.
    This method is called by every derived class as the first basic step for training an algorithm. It basically just initializes some internal structures and set the self.trainset attribute.
    Parameters: trainset (Trainset) – A training set, as returned by the folds method.
    Returns: self

    在一个给定的数据集上训练一个算法

Baselines estimates configuration 基线法估计配置

中用到最小化平方误差函数的算法(包括Baseline 方法 和相似度计算)都需要配置参数,不同的参数会导致算法有不同的性能,且不同的 baseline 用于不同的算法,参数配置不一样。
使用默认的baseline参数已经可以取得一定的性能的。
需要注意的是,一些相似度度量用到Baseline,无论实际的预测算法是否用到baseline,都需要配置相关参数
具体的参数配置,可以参考这篇论文
Factor in the Neighbors: Scalable and Accurate Collaborative Filtering

使用GridSearchCV调整算法参数

cross_validate()函数针对给定的一组参数报告交叉验证过程的准确性度量。 如果你想知道哪个参数组合可以产生最好的结果,那么GridSearchCV类就可以解决问题。 给定一个参数的字典,这个类彻底地尝试所有参数组合,并报告任何准确性度量(在不同分割上的平均值)的最佳参数。 它受到scikit-learn的GridSearchCV的启发。

Similarity measure configuration 相似度度量配置

Many algorithms use a similarity measure to estimate a rating. The way they can be configured is done in a similar fashion as for baseline ratings: you just need to pass a sim_options argument at the creation of an algorithm. This argument is a dictionary with the following (all optional) keys:

通过指定sim_options这个字典变量来配置相似度指标
1. name: 指定相似度指标的名字,similarities module 中给出了MSD
2. user_based: 指定是使用计算用户之间的相似度还是Item之间的相似度,这个地方的选择对预测算法的性能有巨大影响,默认值:True
3. min_support:当相似度不为0时,最小公共用户数或公共项目数
4. shrinkage: 收缩参数,仅用于 pearson_baseline 相似度

Trainset class

It is used by the fit() method of every prediction algorithm. You should not try to built such an object on your own but rather use the Dataset.folds() method or the DatasetAutoFolds.build_full_trainset() method.

训练集不应该由个人创建,可以通过Dataset.folds()或者DatasetAutoFolds.build_full_trainset()方法创建。

源码解读

reader 类

def __init__(self, name=None, line_format='user item rating',
            sep=None,rating_scale=(1, 5), skip_lines=0):

建立阅读器的格式,自动将评分定在rating_scale区间
self.offset : self.offset = -lower_bound + 1 if lower_bound <= 0 else 0

def parse_line(self, line)
    '''Parse a line.

        Ratings are translated so that they are all strictly positive.

        Args:
            line(str): The line to parse

        Returns:
            tuple: User id, item id, rating and timestamp. The timestamp is set  to ``None`` if it does no exist.
    '''

解析一行,返回需要的格式数据

dataset 类

def build_full_trainset(self):
    """Do not split the dataset into folds and just return a trainset as
    is, built from the whole dataset.

    User can then query for predictions, as shown in the :ref:`User Guide
    <train_on_whole_trainset>`.

    Returns:
        The :class:`Trainset <surprise.Trainset>`.
    """

将所有数据用于生成训练集

 def construct_trainset(self, raw_trainset):

建立 `raw_id` 到 `inner_id`的映射
得到 ur 字典 --- `用户-评分`字典
     ir 字典 --- `Item-评分`字典
     n_users --- 用户数
     n_items --- Item数
     n_ratings --- 评分记录数

建立训练集

trainset 类

KNNBaseline 类

  1. KNNBaseline
    KNNBaseline.png
    考虑基础评分的协同过滤
    最好用 pearson_baseline 相似度
    关于Baseline方法的作用,可以参考这篇文章
    推荐系统的协同过滤算法实现和浅析 pdf

KNNBasic 类

KNNBasic.png

def fit(self, trainset):
    """
    计算相似度
    计算公式的由来可以参考相关书籍
    """
def estimate(self, u, i):
    """
    估计用户u对物品i的打分
    找出在给物品i打过分的k个近邻用户
    根据相应的预测评分计算公式计算预测评分
    返回值: est, details({'actual_k': actual_k})
    """

example

库中,是预测所有用户没有评论过的物品的评分
不适合做Top-N推荐

split 数据分割

数据分割部分的分割操作都是针对所有数据集的操作,适合评分预测的应用,不适合Top-N推荐的应用(有待验证)

ShuffleSplit 类

shuffleSplit
将所有的训练数据随机地分成k份。

train_test_split
将数据分割为训练集和测试集割为训练集和测试集

阅读更多
版权声明:本文为博主原创文章,未经博主允许不得转载。 https://blog.csdn.net/makingLJ/article/details/80331246
想对作者说点什么? 我来说一句

没有更多推荐了,返回首页

关闭
关闭
关闭