参考:http://scikit-learn.org/stable/modules/cross_validation.html
overfitting很常见,所以提出使用test set来验证模型的performance。给个直观的例子:
>>> import numpy as np
>>> from sklearn import cross_validation
>>> from sklearn import datasets
>>> from sklearn import svm
>>> iris = datasets.load_iris()
>>> iris.data.shape, iris.target.shape
((150, 4), (150,))
>>> X_train, X_test, y_train, y_test = <strong>cross_validation.train_test_split</strong>(
... iris.data, iris.target, <strong>test_size=0.4, random_state=0</strong>) #<span style="font-family: Arial, Helvetica, sans-serif;"><strong>holding out 40% of the data for testing</strong></span>
>>> X_train.shape, y_train.shape
((90, 4), (90,))
>>> X_test.shape, y_test.shape
((60, 4), (60,))
>>> clf = svm.SVC(kernel='linear', C=1).fit(X_train, y_train)
>>> clf.score(X_test, y_test)
0.96...
还有个问题就是,超参数( C=1)是人工设置,这样会造成overfitting。所以提出training set、validation set、test set的三级概念: training proceeds on the training set, after which evaluation is done on the validation set, and when the experiment seems to be successful, final evaluation can be done on the test set。
三级概念也有问题,数据量少时,进一步加重了训练数据的量少。所以提出 cross-validation (CV for short,k-fold CV)的概念:
- A model is trained using of the folds as training data;
- the resulting model is validated on the remaining part of the data (i.e., it is used as a test set to compute a performance measure such as accuracy).
1、 Computing cross-validated metrics
使用CV最简单的方法是,同时对estimator和dataset调用 cross_val_score helper function: