主要介绍Skearn中交叉验证的相关函数:
- 数据切分函数
- 交叉验证函数
部分交叉验证函数是和网格搜索等调参方法一起用的,此文不涉及,另Package版本:
- Sklearn版本为0.22
- Pandas版本0.23.4
- Numpy版本1.17.4
数据划分
train_test_split
随机把数据划分为成两部分,先看help
import pandas as pd
import numpy as np
from sklearn import svm
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
help(train_test_split)
Help on function train_test_split in module sklearn.model_selection._split:
train_test_split(*arrays, **options)
Split arrays or matrices into random train and test subsets
Quick utility that wraps input validation and
``next(ShuffleSplit().split(X, y))`` and application to input data
into a single call for splitting (and optionally subsampling) data in a
oneliner.
Read more in the :ref:`User Guide <cross_validation>`.
Parameters
----------
*arrays : sequence of indexables with same length / shape[0]
Allowed inputs are lists, numpy arrays, scipy-sparse
matrices or pandas dataframes.
test_size : float, int or None, optional (default=None)
If float, should be between 0.0 and 1.0 and represent the proportion
of the dataset to include in the test split. If int, represents the
absolute number of test samples. If None, the value is set to the
complement of the train size. If ``train_size`` is also None, it will
be set to 0.25.
train_size : float, int, or None, (default=None)
If float, should be between 0.0 and 1.0 and represent the
proportion of the dataset to include in the train split. If
int, represents the absolute number of train samples. If None,
the value is automatically set to the complement of the test size.
random_state : int, RandomState instance or None, optional (default=None)
If int, random_state is the seed used by the random number generator;
If RandomState instance, random_state is the random number generator;
If None, the random number generator is the RandomState instance used
by `np.random`.
shuffle : boolean, optional (default=True)
Whether or not to shuffle the data before splitting. If shuffle=False
then stratify must be None.
stratify : array-like or None (default=None)
If not None, data is split in a stratified fashion, using this as
the class labels.
Returns
-------
splitting : list, length=2 * len(arrays)
List containing train-test split of inputs.
.. versionadded:: 0.16
If the input is sparse, the output will be a
``scipy.sparse.csr_matrix``. Else, output type is the same as the
input type.
Examples
--------
>>> import numpy as np
>>> from sklearn.model_selection import train_test_split
>>> X, y = np.arange(10).reshape((5, 2)), range(5)
>>> X
array([[0, 1],
[2, 3],
[4, 5],
[6, 7],
[8, 9]])
>>> list(y)
[0, 1, 2, 3, 4]
>>> X_train, X_test, y_train, y_test = train_test_split(
... X, y, test_size=0.33, random_state=42)
...
>>> X_train
array([[4, 5],
[0, 1],
[6, 7]])
>>> y_train
[2, 0, 3]
>>> X_test
array([[2, 3],
[8, 9]])
>>> y_test
[1, 4]
>>> train_test_split(y, shuffle=False)
[[0, 1, 2], [3, 4]]
构造测试数据
data = datasets.load_iris()
irisDf = pd.DataFrame(data=data.data, columns=data.feature_names)
# 增加属性数据
irisDf["target"] = data.target
# 重编码
irisDf["species"] = irisDf.target.map({0:"Setosa",1:"Versicolour",2:"Virginica"})
# 列名变换
irisDf.columns = ["sepalLength","sepalWidth","petalLength","petalWidth","target","species"]
irisDf.head()
sepalLength | sepalWidth | petalLength | petalWidth | target | species | |
---|---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | 0 | Setosa |
1 | 4.9 | 3.0 | 1.4 | 0.2 | 0 | Setosa |
2 | 4.7 | 3.2 | 1.3 | 0.2 | 0 | Setosa |
3 | 4.6 | 3.1 | 1.5 | 0.2 | 0 | Setosa |
4 | 5.0 | 3.6 | 1.4 | 0.2 | 0 | Setosa |
简单的参数说明:
- test_size:float时是划分比例,int则是样本数
- random_state:指定随机数种子,使得结果可复现
- stratify:数据类型为array-like,该参数是指split数据时,按照该数据进行分层抽样,通常是为了保证split之后的正负样本比一致
If not None, data is split in a stratified fashion, using this as the class labels.
X = irisDf.iloc[:, [0, 1, 2, 3]]
y = irisDf.target
X_train, X_test, y_train, y_test = train_test_split(X,
y,
test_size=0.3,
random_state=123)
print(" X_train.shape: ", X_train.shape, "\n", "X_test.shape: ", X_test.shape)
print("trainset的样本分布:")
print(y_train.value_counts())
print("testset的样本分布:")
print(y_test.value_counts())
X_train.shape: (105, 4)
X_test.shape: (45, 4)
trainset的样本分布:
1 40
2 33
0 32
Name: target, dtype: int64
testset的样本分布:
0 18
2 17
1 10
Name: target, dtype: int64
此时样本的分布不是和总体数据不一致,可以通过stratify调整
X = irisDf.iloc[:, [0, 1, 2, 3]]
y = irisDf.target
X_train, X_test, y_train, y_test = train_test_split(X,
y,
test_size=0.3,
random_state=123,
stratify=y)
print(" X_train.shape: ", X_train.shape, "\n", "X_test.shape: ", X_test.shape)
print("trainset的样本分布:")
print(y_train.value_counts())
print("testset的样本分布:")
print(y_test.value_counts())
X_train.shape: (105, 4)
X_test.shape: (45, 4)
trainset的样本分布:
2 35
1 35
0 35
Name: target, dtype: int64
testset的样本分布:
2 15
1 15
0 15
Name: target, dtype: int64
交叉验证评估
cross_val_score
还是先看文档
help(cross_val_score)
Help on function cross_val_score in module sklearn.model_selection._validation:
cross_val_score(estimator, X, y=None, groups=None, scoring=None, cv=None, n_jobs=None, verbose=0, fit_params=None, pre_dispatch='2*n_jobs', error_score=nan)
Evaluate a score by cross-validation
Read more in the :ref:`User Guide <cross_validation>`.
Parameters
----------
estimator : estimator object implementing 'fit'
The object to use to fit the data.
X : array-like
The data to fit. Can be for example a list, or an array.
y : array-like, optional, default: None
The target variable to try to predict in the case of
supervised learning.
groups : array-like, with shape (n_samples,), optional
Group labels for the samples used while splitting the dataset into
train/test set. Only used in conjunction with a "Group" :term:`cv`
instance (e.g., :class:`GroupKFold`).
scoring : string, callable or None, optional, default: None
A string (see model evaluation documentation) or
a scorer callable object / function with signature
``scorer(estimator, X, y)`` which should return only
a single value.
Similar to :func:`cross_validate`
but only a single metric is permitted.
If None, the estimator's default scorer (if available) is used.
cv : int, cross-validation generator or an iterable, optional
Determines the cross-validation splitting strategy.
Possible inputs for cv are:
- None, to use the default 5-fold cross validation,
- integer, to specify the number of folds in a `(Stratified)KFold`,
- :term:`CV splitter`,
- An iterable yielding (train, test) splits as arrays of indices.
For integer/None inputs, if the estimator is a classifier and ``y`` is
either binary or multiclass, :class:`StratifiedKFold` is used. In all
other cases, :class:`KFold` is used.
Refer :ref:`User Guide <cross_validation>` for the various
cross-validation strategies that can be used here.
.. versionchanged:: 0.22
``cv`` default value if None changed from 3-fold to 5-fold.
n_jobs : int or None, optional (default=None)
The number of CPUs to use to do the computation.
``None`` means 1 unless in a :obj:`joblib.parallel_backend` context.
``-1`` means using all processors. See :term:`Glossary <n_jobs>`
for more details.
verbose : integer, optional
The verbosity level.
fit_params : dict, optional
Parameters to pass to the fit method of the estimator.
pre_dispatch : int, or string, optional
Controls the number of jobs that get dispatched during parallel
execution. Reducing this number can be useful to avoid an
explosion of memory consumption when more jobs get dispatched
than CPUs can process. This parameter can be:
- None, in which case all the jobs are immediately
created and spawned. Use this for lightweight and
fast-running jobs, to avoid delays due to on-demand
spawning of the jobs
- An int, giving the exact number of total jobs that are
spawned
- A string, giving an expression as a function of n_jobs,
as in '2*n_jobs'
error_score : 'raise' or numeric
Value to assign to the score if an error occurs in estimator fitting.
If set to 'raise', the error is raised.
If a numeric value is given, FitFailedWarning is raised. This parameter
does not affect the refit step, which will always raise the error.
Returns
-------
scores : array of float, shape=(len(list(cv)),)
Array of scores of the estimator for each run of the cross validation.
Examples
--------
>>> from sklearn import datasets, linear_model
>>> from sklearn.model_selection import cross_val_score
>>> diabetes = datasets.load_diabetes()
>>> X = diabetes.data[:150]
>>> y = diabetes.target[:150]
>>> lasso = linear_model.Lasso()
>>> print(cross_val_score(lasso, X, y, cv=3))
[0.33150734 0.08022311 0.03531764]
See Also
---------
:func:`sklearn.model_selection.cross_validate`:
To run cross-validation on multiple metrics and also to return
train scores, fit times and score times.
:func:`sklearn.model_selection.cross_val_predict`:
Get predictions from each split of cross-validation for diagnostic
purposes.
:func:`sklearn.metrics.make_scorer`:
Make a scorer from a performance metric or loss function.
主要参数:
- X,y分别为特征数据和目标变量
- scoring为模型评估指标,如"roc_auc"、“accuracy”、"f1"等
- cv:默认5折交叉验证,int时为k折,也可以输入cv发生器
- 对于分类问题,如果cv=int,那么split数据的方法为:StratifiedKFold
- 其他则为 KFold
- groups
这里似乎有点问题:
- 怎么保证可重复性,似乎没有提供随机数参数
- 是否通过输入cv发生器时指定随机数?
- 看了源码,因为StratifiedKFold和KFold是在内部调用的,所以不制定随机数
- 之前看别人代码,似乎可以搞个全局的随机数,没准可以用??
- 抓大放小,以后有空再review这个问题吧
clf = svm.SVC(kernel='linear', C=1)
scores = cross_val_score(clf, X, y, cv=5,scoring="f1_macro")
scores
array([0.96658312, 1. , 0.96658312, 0.96658312, 1. ])
cross_validate
该函数和cross_val_score的差别:
- It allows specifying multiple metrics for evaluation.
- It returns a dict containing fit-times, score-times (and optionally training scores as well as fitted estimators) in addition to the test score.
from sklearn.model_selection import cross_validate
scoring = ['precision_macro', 'recall_macro']
clf = svm.SVC(kernel='linear', C=1, random_state=0)
scores = cross_validate(clf, X, y, scoring=scoring)
scores
{'fit_time': array([0.00199509, 0.0009973 , 0.00200558, 0.00099754, 0.0009973 ]),
'score_time': array([0.00099778, 0.00099826, 0.0019834 , 0.00199461, 0.00199437]),
'test_precision_macro': array([0.96969697, 1. , 0.96969697, 0.96969697, 1. ]),
'test_recall_macro': array([0.96666667, 1. , 0.96666667, 0.96666667, 1. ])}
Ref
[1] Sklearn官网文档
[2] cross_val_score文档
2020-03-01 于南京市栖霞区