python机器学习读书笔记
- 读书笔记(六)Model Evaluation and Hyperparameter Tuning
读书笔记(六)Model Evaluation and Hyperparameter Tuning
嘿嘿嘿
一、Pipeline一体化操作
流程图如下
代码如下:
#The intermediate steps in a pipeline constitute scikit-learn transformers, and the last one is an estimator.
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
pipe_lr = make_pipeline(StandardScaler(),
PCA(n_components=2),
LogisticRegression(random_state=1,solver='lbfgs'))
pipe_lr.fit(X_train, y_train)#实际上调用了StandardScaler的fit_transform on the training data,
#并且将标准化后的参数传递给PCA,PCA操作同理
y_pred = pipe_lr.predict(X_test)#LogisticRegression的操作
print('Test Accuracy: %.3f' % pipe_lr.score(X_test, y_test))
二、Holdout cross-validation and K-fold cross-validation
1. Holdout cross-validation
基本方法:
将data分为三个部分,分别为 training set、validation set 、test set
训练集用于训练不同的模型,验证集用于模型选择,而测试集则用于 obtain a less biased estimate of its ablility to generalize data. 测试模型的泛化能力。
缺点:
结果对于dataset的三个subset的划分十分敏感,not robust enough~~(相对于下面将要介绍的K-fold cross-validation )
如图:
2.K-fold cross-validation
基本方法:
将dataset不放回地随机分为k组,其中k-1组用于training,剩下1组用于testing。重复该步骤k次,得到k个model estimates。计算出平均值,可以得到一个比Holdout cross-validation更不依赖于data的分割的估计。
如图:
优点:
每一个样本点都有且仅有一次成为training set 和 test set的一部分。相较于holdout方法,减小了过拟合的可能(lower-variance estimate)
小注意~:
-
1.这个一般方法用于调参,调参完毕,再将整个dataset代入模型之中得到一个最后的performance estimate.
-
2.k的一般默认取值为10,对于较小的数据集,我们可以适当增大k,由于更多的数据能被用上,这样可以lower bias(但是增大了计算量并且可能导致higher variance)。对于较大的数据集,我们可以适当地减小k(e.g. k
=5),相对地可以减少计算量。 -
#在数据集极端小的情况下,一个Kfold的极端例子是k=n,称为leave-one-out(LOO),即每次test sample只有一个样本点。
3.Stratified k-fold(分层)
基本方法:
In stratified cross-validation, the calss proportions are preserved in each fold to ensure that each fold is representative of the class proportions in the training dataset.(即每一个fold中不同类的比例与总体保持一致,故称为stratified,不禁让人联想到train_test_split()中的参数,分层取样)
import numpy as np
from sklearn.model_selection import StratifiedKFold
kfold = StratifiedKFold(n_splits=10,
random_state=1).split(X_train, y_train)
scores = []
for k, (train, test) in enumerate(kfold):
pipe_lr.fit(X_train[train], y_train[train])
score = pipe_lr.score(X_train[test], y_train[test])
scores.append(score)
print('Fold: %2d, Class dist.: %s, Acc: %.3f' % (k+1,
np.bincount(y_train[train]), score))
print('\nCV accuracy: %.3f +/- %.3f' % (np.mean(scores), np.std(scores)))
结果如下
Fold: 1, Class dist.: [256 153], Acc: 0.935
Fold: 2, Class dist.: [256 153], Acc: 0.935
Fold: 3, Class dist.: [256 153], Acc: 0.957
Fold: 4, Class dist.: [256 153], Acc: 0.957
Fold: 5, Class dist.: [256 153], Acc: 0.935
Fold: 6, Class dist.: [257 153], Acc: 0.956
Fold: 7, Class dist.: [257 153], Acc: 0.978
Fold: 8, Class dist.: [