交叉验证-Python应用

  主要介绍Skearn中交叉验证的相关函数:

  • 数据切分函数
  • 交叉验证函数

  部分交叉验证函数是和网格搜索等调参方法一起用的,此文不涉及,另Package版本:

  • Sklearn版本为0.22
  • Pandas版本0.23.4
  • Numpy版本1.17.4

数据划分

train_test_split

随机把数据划分为成两部分,先看help

import pandas as pd
import numpy as np  
from sklearn import svm
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
help(train_test_split)
Help on function train_test_split in module sklearn.model_selection._split:

train_test_split(*arrays, **options)
    Split arrays or matrices into random train and test subsets
    
    Quick utility that wraps input validation and
    ``next(ShuffleSplit().split(X, y))`` and application to input data
    into a single call for splitting (and optionally subsampling) data in a
    oneliner.
    
    Read more in the :ref:`User Guide <cross_validation>`.
    
    Parameters
    ----------
    *arrays : sequence of indexables with same length / shape[0]
        Allowed inputs are lists, numpy arrays, scipy-sparse
        matrices or pandas dataframes.
    
    test_size : float, int or None, optional (default=None)
        If float, should be between 0.0 and 1.0 and represent the proportion
        of the dataset to include in the test split. If int, represents the
        absolute number of test samples. If None, the value is set to the
        complement of the train size. If ``train_size`` is also None, it will
        be set to 0.25.
    
    train_size : float, int, or None, (default=None)
        If float, should be between 0.0 and 1.0 and represent the
        proportion of the dataset to include in the train split. If
        int, represents the absolute number of train samples. If None,
        the value is automatically set to the complement of the test size.
    
    random_state : int, RandomState instance or None, optional (default=None)
        If int, random_state is the seed used by the random number generator;
        If RandomState instance, random_state is the random number generator;
        If None, the random number generator is the RandomState instance used
        by `np.random`.
    
    shuffle : boolean, optional (default=True)
        Whether or not to shuffle the data before splitting. If shuffle=False
        then stratify must be None.
    
    stratify : array-like or None (default=None)
        If not None, data is split in a stratified fashion, using this as
        the class labels.
    
    Returns
    -------
    splitting : list, length=2 * len(arrays)
        List containing train-test split of inputs.
    
        .. versionadded:: 0.16
            If the input is sparse, the output will be a
            ``scipy.sparse.csr_matrix``. Else, output type is the same as the
            input type.
    
    Examples
    --------
    >>> import numpy as np
    >>> from sklearn.model_selection import train_test_split
    >>> X, y = np.arange(10).reshape((5, 2)), range(5)
    >>> X
    array([[0, 1],
           [2, 3],
           [4, 5],
           [6, 7],
           [8, 9]])
    >>> list(y)
    [0, 1, 2, 3, 4]
    
    >>> X_train, X_test, y_train, y_test = train_test_split(
    ...     X, y, test_size=0.33, random_state=42)
    ...
    >>> X_train
    array([[4, 5],
           [0, 1],
           [6, 7]])
    >>> y_train
    [2, 0, 3]
    >>> X_test
    array([[2, 3],
           [8, 9]])
    >>> y_test
    [1, 4]
    
    >>> train_test_split(y, shuffle=False)
    [[0, 1, 2], [3, 4]]

构造测试数据

data = datasets.load_iris()
irisDf = pd.DataFrame(data=data.data, columns=data.feature_names)
# 增加属性数据
irisDf["target"] = data.target
# 重编码
irisDf["species"] = irisDf.target.map({0:"Setosa",1:"Versicolour",2:"Virginica"})
# 列名变换  
irisDf.columns = ["sepalLength","sepalWidth","petalLength","petalWidth","target","species"]
irisDf.head()
sepalLengthsepalWidthpetalLengthpetalWidthtargetspecies
05.13.51.40.20Setosa
14.93.01.40.20Setosa
24.73.21.30.20Setosa
34.63.11.50.20Setosa
45.03.61.40.20Setosa

简单的参数说明:

  • test_size:float时是划分比例,int则是样本数
  • random_state:指定随机数种子,使得结果可复现
  • stratify:数据类型为array-like,该参数是指split数据时,按照该数据进行分层抽样,通常是为了保证split之后的正负样本比一致
        If not None, data is split in a stratified fashion, using this as the class labels.
X = irisDf.iloc[:, [0, 1, 2, 3]]
y = irisDf.target
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.3,
                                                    random_state=123)
print(" X_train.shape: ", X_train.shape, "\n", "X_test.shape: ", X_test.shape)
print("trainset的样本分布:")
print(y_train.value_counts())
print("testset的样本分布:")
print(y_test.value_counts())
 X_train.shape:  (105, 4) 
 X_test.shape:  (45, 4)
trainset的样本分布:
1    40
2    33
0    32
Name: target, dtype: int64
testset的样本分布:
0    18
2    17
1    10
Name: target, dtype: int64

此时样本的分布不是和总体数据不一致,可以通过stratify调整

X = irisDf.iloc[:, [0, 1, 2, 3]]
y = irisDf.target
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.3,
                                                    random_state=123,
                                                    stratify=y)
print(" X_train.shape: ", X_train.shape, "\n", "X_test.shape: ", X_test.shape)
print("trainset的样本分布:")
print(y_train.value_counts())
print("testset的样本分布:")
print(y_test.value_counts())
 X_train.shape:  (105, 4) 
 X_test.shape:  (45, 4)
trainset的样本分布:
2    35
1    35
0    35
Name: target, dtype: int64
testset的样本分布:
2    15
1    15
0    15
Name: target, dtype: int64

交叉验证评估

cross_val_score

还是先看文档

help(cross_val_score)
Help on function cross_val_score in module sklearn.model_selection._validation:

cross_val_score(estimator, X, y=None, groups=None, scoring=None, cv=None, n_jobs=None, verbose=0, fit_params=None, pre_dispatch='2*n_jobs', error_score=nan)
    Evaluate a score by cross-validation
    
    Read more in the :ref:`User Guide <cross_validation>`.
    
    Parameters
    ----------
    estimator : estimator object implementing 'fit'
        The object to use to fit the data.
    
    X : array-like
        The data to fit. Can be for example a list, or an array.
    
    y : array-like, optional, default: None
        The target variable to try to predict in the case of
        supervised learning.
    
    groups : array-like, with shape (n_samples,), optional
        Group labels for the samples used while splitting the dataset into
        train/test set. Only used in conjunction with a "Group" :term:`cv`
        instance (e.g., :class:`GroupKFold`).
    
    scoring : string, callable or None, optional, default: None
        A string (see model evaluation documentation) or
        a scorer callable object / function with signature
        ``scorer(estimator, X, y)`` which should return only
        a single value.
    
        Similar to :func:`cross_validate`
        but only a single metric is permitted.
    
        If None, the estimator's default scorer (if available) is used.
    
    cv : int, cross-validation generator or an iterable, optional
        Determines the cross-validation splitting strategy.
        Possible inputs for cv are:
    
        - None, to use the default 5-fold cross validation,
        - integer, to specify the number of folds in a `(Stratified)KFold`,
        - :term:`CV splitter`,
        - An iterable yielding (train, test) splits as arrays of indices.
    
        For integer/None inputs, if the estimator is a classifier and ``y`` is
        either binary or multiclass, :class:`StratifiedKFold` is used. In all
        other cases, :class:`KFold` is used.
    
        Refer :ref:`User Guide <cross_validation>` for the various
        cross-validation strategies that can be used here.
    
        .. versionchanged:: 0.22
            ``cv`` default value if None changed from 3-fold to 5-fold.
    
    n_jobs : int or None, optional (default=None)
        The number of CPUs to use to do the computation.
        ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context.
        ``-1`` means using all processors. See :term:`Glossary <n_jobs>`
        for more details.
    
    verbose : integer, optional
        The verbosity level.
    
    fit_params : dict, optional
        Parameters to pass to the fit method of the estimator.
    
    pre_dispatch : int, or string, optional
        Controls the number of jobs that get dispatched during parallel
        execution. Reducing this number can be useful to avoid an
        explosion of memory consumption when more jobs get dispatched
        than CPUs can process. This parameter can be:
    
            - None, in which case all the jobs are immediately
              created and spawned. Use this for lightweight and
              fast-running jobs, to avoid delays due to on-demand
              spawning of the jobs
    
            - An int, giving the exact number of total jobs that are
              spawned
    
            - A string, giving an expression as a function of n_jobs,
              as in '2*n_jobs'
    
    error_score : 'raise' or numeric
        Value to assign to the score if an error occurs in estimator fitting.
        If set to 'raise', the error is raised.
        If a numeric value is given, FitFailedWarning is raised. This parameter
        does not affect the refit step, which will always raise the error.
    
    Returns
    -------
    scores : array of float, shape=(len(list(cv)),)
        Array of scores of the estimator for each run of the cross validation.
    
    Examples
    --------
    >>> from sklearn import datasets, linear_model
    >>> from sklearn.model_selection import cross_val_score
    >>> diabetes = datasets.load_diabetes()
    >>> X = diabetes.data[:150]
    >>> y = diabetes.target[:150]
    >>> lasso = linear_model.Lasso()
    >>> print(cross_val_score(lasso, X, y, cv=3))
    [0.33150734 0.08022311 0.03531764]
    
    See Also
    ---------
    :func:`sklearn.model_selection.cross_validate`:
        To run cross-validation on multiple metrics and also to return
        train scores, fit times and score times.
    
    :func:`sklearn.model_selection.cross_val_predict`:
        Get predictions from each split of cross-validation for diagnostic
        purposes.
    
    :func:`sklearn.metrics.make_scorer`:
        Make a scorer from a performance metric or loss function.

主要参数:

这里似乎有点问题:

  1. 怎么保证可重复性,似乎没有提供随机数参数
  2. 是否通过输入cv发生器时指定随机数?
  3. 看了源码,因为StratifiedKFold和KFold是在内部调用的,所以不制定随机数
  4. 之前看别人代码,似乎可以搞个全局的随机数,没准可以用??
  5. 抓大放小,以后有空再review这个问题吧
clf = svm.SVC(kernel='linear', C=1)
scores = cross_val_score(clf, X, y, cv=5,scoring="f1_macro")
scores
array([0.96658312, 1.        , 0.96658312, 0.96658312, 1.        ])

cross_validate

该函数和cross_val_score的差别:

  • It allows specifying multiple metrics for evaluation.
  • It returns a dict containing fit-times, score-times (and optionally training scores as well as fitted estimators) in addition to the test score.
from sklearn.model_selection import cross_validate
scoring = ['precision_macro', 'recall_macro']
clf = svm.SVC(kernel='linear', C=1, random_state=0)
scores = cross_validate(clf, X, y, scoring=scoring)
scores
{'fit_time': array([0.00199509, 0.0009973 , 0.00200558, 0.00099754, 0.0009973 ]),
 'score_time': array([0.00099778, 0.00099826, 0.0019834 , 0.00199461, 0.00199437]),
 'test_precision_macro': array([0.96969697, 1.        , 0.96969697, 0.96969697, 1.        ]),
 'test_recall_macro': array([0.96666667, 1.        , 0.96666667, 0.96666667, 1.        ])}

Ref

[1] Sklearn官网文档
[2] cross_val_score文档

                                2020-03-01 于南京市栖霞区

  • 1
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
使用PyTorch可以对卷积神经网络进行交叉验证。下面是一个使用PyTorch对卷积神经网络进行交叉验证的示例。 首先,我们定义一个模型类,名为CNN。这个模型类包含了卷积层、池化层和全连接层等各种层。以下是CNN模型类的代码: ```python class CNN(nn.Module): def __init__(self): super(CNN, self).__init__() self.conv1 = nn.Conv2d(3, 16, kernel_size=3, padding=1) self.conv2 = nn.Conv2d(16, 32, kernel_size=3, padding=1) self.pool = nn.MaxPool2d(2, 2) self.fc1 = nn.Linear(32 * 8 * 8, 128) self.fc2 = nn.Linear(128, 10) def forward(self, x): x = self.pool(nn.functional.relu(self.conv1(x))) x = self.pool(nn.functional.relu(self.conv2(x))) x = x.view(-1, 32 * 8 * 8) x = nn.functional.relu(self.fc1(x)) x = self.fc2(x) return x ``` 接下来,我们需要定义交叉验证的参数和方法。可以使用PyTorch提供的交叉验证工具,比如K-Fold交叉验证。具体的实现方法可以根据具体的需求来设计。可以参考PyTorch官方文档和示例代码来了解更多关于交叉验证的用法。 总之,Python神经网络应用交叉验证可以通过使用PyTorch提供的工具和方法来实现。我们可以定义一个模型类,然后使用交叉验证工具对模型进行验证和评估。这样可以更好地评估模型的性能和泛化能力。<span class="em">1</span><span class="em">2</span><span class="em">3</span> #### 引用[.reference_title] - *1* *2* [pytorch神经网络学习笔记07----一个使用pytorch对卷积神经网络做交叉验证的例子](https://blog.csdn.net/weixin_44875219/article/details/129648519)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v93^chatsearchT3_2"}}] [.reference_item style="max-width: 50%"] - *3* [Python的待完成工作应用程序及其源代码](https://download.csdn.net/download/qq_37270421/88261711)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v93^chatsearchT3_2"}}] [.reference_item style="max-width: 50%"] [ .reference_list ]

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值