交叉验证-Python应用

最新推荐文章于 2024-03-13 08:57:15 发布

维格堂406小队

最新推荐文章于 2024-03-13 08:57:15 发布

阅读量1k

点赞数 1

分类专栏： ★★★机器学习 # ★★模型选择和评估

本文链接：https://blog.csdn.net/wendaomudong_l2d4/article/details/105079574

版权

★★★机器学习同时被 2 个专栏收录

62 篇文章 3 订阅

订阅专栏

★★模型选择和评估

13 篇文章 1 订阅

订阅专栏

主要介绍Skearn中交叉验证的相关函数：

数据切分函数
交叉验证函数

部分交叉验证函数是和网格搜索等调参方法一起用的，此文不涉及，另Package版本：

Sklearn版本为0.22
Pandas版本0.23.4
Numpy版本1.17.4

数据划分

train_test_split

随机把数据划分为成两部分，先看help

import pandas as pd
import numpy as np  
from sklearn import svm
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score

help(train_test_split)

Help on function train_test_split in module sklearn.model_selection._split:

train_test_split(*arrays, **options)
    Split arrays or matrices into random train and test subsets
    
    Quick utility that wraps input validation and
    ``next(ShuffleSplit().split(X, y))`` and application to input data
    into a single call for splitting (and optionally subsampling) data in a
    oneliner.
    
    Read more in the :ref:`User Guide <cross_validation>`.
    
    Parameters
    ----------
    *arrays : sequence of indexables with same length / shape[0]
        Allowed inputs are lists, numpy arrays, scipy-sparse
        matrices or pandas dataframes.
    
    test_size : float, int or None, optional (default=None)
        If float, should be between 0.0 and 1.0 and represent the proportion
        of the dataset to include in the test split. If int, represents the
        absolute number of test samples. If None, the value is set to the
        complement of the train size. If ``train_size`` is also None, it will
        be set to 0.25.
    
    train_size : float, int, or None, (default=None)
        If float, should be between 0.0 and 1.0 and represent the
        proportion of the dataset to include in the train split. If
        int, represents the absolute number of train samples. If None,
        the value is automatically set to the complement of the test size.
    
    random_state : int, RandomState instance or None, optional (default=None)
        If int, random_state is the seed used by the random number generator;
        If RandomState instance, random_state is the random number generator;
        If None, the random number generator is the RandomState instance used
        by `np.random`.
    
    shuffle : boolean, optional (default=True)
        Whether or not to shuffle the data before splitting. If shuffle=False
        then stratify must be None.
    
    stratify : array-like or None (default=None)
        If not None, data is split in a stratified fashion, using this as
        the class labels.
    
    Returns
    -------
    splitting : list, length=2 * len(arrays)
        List containing train-test split of inputs.
    
        .. versionadded:: 0.16
            If the input is sparse, the output will be a
            ``scipy.sparse.csr_matrix``. Else, output type is the same as the
            input type.
    
    Examples
    --------
    >>> import numpy as np
    >>> from sklearn.model_selection import train_test_split
    >>> X, y = np.arange(10).reshape((5, 2)), range(5)
    >>> X
    array([[0, 1],
           [2, 3],
           [4, 5],
           [6, 7],
           [8, 9]])
    >>> list(y)
    [0, 1, 2, 3, 4]
    
    >>> X_train, X_test, y_train, y_test = train_test_split(
    ...     X, y, test_size=0.33, random_state=42)
    ...
    >>> X_train
    array([[4, 5],
           [0, 1],
           [6, 7]])
    >>> y_train
    [2, 0, 3]
    >>> X_test
    array([[2, 3],
           [8, 9]])
    >>> y_test
    [1, 4]
    
    >>> train_test_split(y, shuffle=False)
    [[0, 1, 2], [3, 4]]

构造测试数据

data = datasets.load_iris()
irisDf = pd.DataFrame(data=data.data, columns=data.feature_names)
# 增加属性数据
irisDf["target"] = data.target
# 重编码
irisDf["species"] = irisDf.target.map({0:"Setosa",1:"Versicolour",2:"Virginica"})
# 列名变换  
irisDf.columns = ["sepalLength","sepalWidth","petalLength","petalWidth","target","species"]
irisDf.head()

	sepalLength	sepalWidth	petalLength	petalWidth	species
0	5.1	3.5	1.4	0.2	Setosa
1	4.9	3.0	1.4	0.2	Setosa
2	4.7	3.2	1.3	0.2	Setosa
3	4.6	3.1	1.5	0.2	Setosa
4	5.0	3.6	1.4	0.2	Setosa

简单的参数说明：

test_size：float时是划分比例，int则是样本数
random_state：指定随机数种子，使得结果可复现
stratify：数据类型为array-like，该参数是指split数据时，按照该数据进行分层抽样，通常是为了保证split之后的正负样本比一致
If not None, data is split in a stratified fashion, using this as the class labels.

X = irisDf.iloc[:, [0, 1, 2, 3]]
y = irisDf.target
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.3,
                                                    random_state=123)
print(" X_train.shape: ", X_train.shape, "\n", "X_test.shape: ", X_test.shape)
print("trainset的样本分布:")
print(y_train.value_counts())
print("testset的样本分布:")
print(y_test.value_counts())

 X_train.shape:  (105, 4) 
 X_test.shape:  (45, 4)
trainset的样本分布:
1    40
2    33
0    32
Name: target, dtype: int64
testset的样本分布:
0    18
2    17
1    10
Name: target, dtype: int64

此时样本的分布不是和总体数据不一致，可以通过stratify调整

X = irisDf.iloc[:, [0, 1, 2, 3]]
y = irisDf.target
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.3,
                                                    random_state=123,
                                                    stratify=y)
print(" X_train.shape: ", X_train.shape, "\n", "X_test.shape: ", X_test.shape)
print("trainset的样本分布:")
print(y_train.value_counts())
print("testset的样本分布:")
print(y_test.value_counts())

 X_train.shape:  (105, 4) 
 X_test.shape:  (45, 4)
trainset的样本分布:
2    35
1    35
0    35
Name: target, dtype: int64
testset的样本分布:
2    15
1    15
0    15
Name: target, dtype: int64

交叉验证评估

cross_val_score

还是先看文档

help(cross_val_score)

Help on function cross_val_score in module sklearn.model_selection._validation:

cross_val_score(estimator, X, y=None, groups=None, scoring=None, cv=None, n_jobs=None, verbose=0, fit_params=None, pre_dispatch='2*n_jobs', error_score=nan)
    Evaluate a score by cross-validation
    
    Read more in the :ref:`User Guide <cross_validation>`.
    
    Parameters
    ----------
    estimator : estimator object implementing 'fit'
        The object to use to fit the data.
    
    X : array-like
        The data to fit. Can be for example a list, or an array.
    
    y : array-like, optional, default: None
        The target variable to try to predict in the case of
        supervised learning.
    
    groups : array-like, with shape (n_samples,), optional
        Group labels for the samples used while splitting the dataset into
        train/test set. Only used in conjunction with a "Group" :term:`cv`
        instance (e.g., :class:`GroupKFold`).
    
    scoring : string, callable or None, optional, default: None
        A string (see model evaluation documentation) or
        a scorer callable object / function with signature
        ``scorer(estimator, X, y)`` which should return only
        a single value.
    
        Similar to :func:`cross_validate`
        but only a single metric is permitted.
    
        If None, the estimator's default scorer (if available) is used.
    
    cv : int, cross-validation generator or an iterable, optional
        Determines the cross-validation splitting strategy.
        Possible inputs for cv are:
    
        - None, to use the default 5-fold cross validation,
        - integer, to specify the number of folds in a `(Stratified)KFold`,
        - :term:`CV splitter`,
        - An iterable yielding (train, test) splits as arrays of indices.
    
        For integer/None inputs, if the estimator is a classifier and ``y`` is
        either binary or multiclass, :class:`StratifiedKFold` is used. In all
        other cases, :class:`KFold` is used.
    
        Refer :ref:`User Guide <cross_validation>` for the various
        cross-validation strategies that can be used here.
    
        .. versionchanged:: 0.22
            ``cv`` default value if None changed from 3-fold to 5-fold.
    
    n_jobs : int or None, optional (default=None)
        The number of CPUs to use to do the computation.
        ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context.
        ``-1`` means using all processors. See :term:`Glossary <n_jobs>`
        for more details.
    
    verbose : integer, optional
        The verbosity level.
    
    fit_params : dict, optional
        Parameters to pass to the fit method of the estimator.
    
    pre_dispatch : int, or string, optional
        Controls the number of jobs that get dispatched during parallel
        execution. Reducing this number can be useful to avoid an
        explosion of memory consumption when more jobs get dispatched
        than CPUs can process. This parameter can be:
    
            - None, in which case all the jobs are immediately
              created and spawned. Use this for lightweight and
              fast-running jobs, to avoid delays due to on-demand
              spawning of the jobs
    
            - An int, giving the exact number of total jobs that are
              spawned
    
            - A string, giving an expression as a function of n_jobs,
              as in '2*n_jobs'
    
    error_score : 'raise' or numeric
        Value to assign to the score if an error occurs in estimator fitting.
        If set to 'raise', the error is raised.
        If a numeric value is given, FitFailedWarning is raised. This parameter
        does not affect the refit step, which will always raise the error.
    
    Returns
    -------
    scores : array of float, shape=(len(list(cv)),)
        Array of scores of the estimator for each run of the cross validation.
    
    Examples
    --------
    >>> from sklearn import datasets, linear_model
    >>> from sklearn.model_selection import cross_val_score
    >>> diabetes = datasets.load_diabetes()
    >>> X = diabetes.data[:150]
    >>> y = diabetes.target[:150]
    >>> lasso = linear_model.Lasso()
    >>> print(cross_val_score(lasso, X, y, cv=3))
    [0.33150734 0.08022311 0.03531764]
    
    See Also
    ---------
    :func:`sklearn.model_selection.cross_validate`:
        To run cross-validation on multiple metrics and also to return
        train scores, fit times and score times.
    
    :func:`sklearn.model_selection.cross_val_predict`:
        Get predictions from each split of cross-validation for diagnostic
        purposes.
    
    :func:`sklearn.metrics.make_scorer`:
        Make a scorer from a performance metric or loss function.

主要参数：

X,y分别为特征数据和目标变量
scoring为模型评估指标，如"roc_auc"、“accuracy”、"f1"等
- 具体可参考https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter
cv:默认5折交叉验证，int时为k折，也可以输入cv发生器
- 对于分类问题，如果cv=int,那么split数据的方法为：StratifiedKFold
- 其他则为 KFold
groups

这里似乎有点问题：

怎么保证可重复性，似乎没有提供随机数参数
是否通过输入cv发生器时指定随机数？
看了源码，因为StratifiedKFold和KFold是在内部调用的，所以不制定随机数
之前看别人代码，似乎可以搞个全局的随机数，没准可以用？？
抓大放小，以后有空再review这个问题吧

clf = svm.SVC(kernel='linear', C=1)
scores = cross_val_score(clf, X, y, cv=5,scoring="f1_macro")
scores

array([0.96658312, 1.        , 0.96658312, 0.96658312, 1.        ])

cross_validate

该函数和cross_val_score的差别：

It allows specifying multiple metrics for evaluation.
It returns a dict containing fit-times, score-times (and optionally training scores as well as fitted estimators) in addition to the test score.

from sklearn.model_selection import cross_validate
scoring = ['precision_macro', 'recall_macro']
clf = svm.SVC(kernel='linear', C=1, random_state=0)
scores = cross_validate(clf, X, y, scoring=scoring)
scores

{'fit_time': array([0.00199509, 0.0009973 , 0.00200558, 0.00099754, 0.0009973 ]),
 'score_time': array([0.00099778, 0.00099826, 0.0019834 , 0.00199461, 0.00199437]),
 'test_precision_macro': array([0.96969697, 1.        , 0.96969697, 0.96969697, 1.        ]),
 'test_recall_macro': array([0.96666667, 1.        , 0.96666667, 0.96666667, 1.        ])}

Ref

[1] Sklearn官网文档
 [2] cross_val_score文档

2020-03-01 于南京市栖霞区

维格堂406小队

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
交叉验证-Python应用

主要介绍Skearn中交叉验证的相关函数：数据切分函数交叉验证函数部分交叉验证函数是和网格搜索等调参方法一起用的，此文不涉及，另Package版本：Sklearn版本为0.22Pandas版本0.23.4Numpy版本1.17.4数据划分train_test_split随机把数据划分为成两部分，先看helpimport pandas as pdimport n...
复制链接

扫一扫