Model selection: choosing estimators and their parameters
为了帮朋友写个作业, 由于之前又没学过, 所以干脆过一遍官方的教程, 做个笔记,以便日后回查。
评分, 交叉验证的评分:Score, and cross-validated scores
- 每个模型都会有个score 方法来表示训练的结果, 这个方法返回的就是模型的评分了, 越高自然越好
Bigger is better.
from sklearn import datasets, svm
X_digits, y_digits = datasets.load_digits(return_X_y=True)
svc = svm.SVC(C=1, kernel='linear') svc.fit(X_digits[:-100], y_digits[:-100]).score(X_digits[-100:], y_digits[-100:])
>>>0.98
- 有时候你在自己选的训练集和验证集不一定就最能够说明你的模型性能好, 因为很可能刚好是对你选训练集和测试集的效果好, 可能换一种选法效果就不好了,所以我们可以通过把数据集分成很多分来分别得出它在这些数据集上的性能评分
import numpy as np
X_folds = np.array_split(X_digits, 3)
y_folds = np.array_split(y_digits, 3)
scores = list()
for k in range(3):
# We use 'list' to copy, in order to 'pop' lazer on
X_train = list(X_folds)
X_test = X_train.pop(k)
X_train = np.concatenate(X_train)
y_train = list(y_folds)
y_test = y_train.pop(k)
y_train = np.concatenate(y_train)
scores.append(svc.fit(X_train, y_train).score(X_test, y_test))
print(scores)
>>>[0.934..., 0.956..., 0.939...]
这就叫 KFold cross-validation
k份交叉验证
交叉验证生成器:Cross-validation generators
- sklearn 中的数据生成器都有一个split方法, 它可以帮你自动的去生成训练集和测试集样本的下标
from sklearn.model_selection import KFold, cross_val_score
X = ["a", "a", "a", "b", "b", "c", "c", "c", "c", "c"]
k_fold = KFold(n_splits=5)
for train_indices, test_indices in k_fold.split(X):
print('Train: %s | test: %s' % (train_indices, test_indices))
Train: [2 3 4 5 6 7 8 9] | test: [0 1]
Train: [0 1 4 5 6 7 8 9] | test: [2 3]
Train: [0 1 2 3 6 7 8 9] | test: [4 5]
Train: [0 1 2 3 4 5 8 9] | test: [6 7]
Train: [0 1 2 3 4 5 6 7] | test: [8 9]
然后就很好算交叉验证分了:
[svc.fit(X_digits[train], y_digits[train]).score(X_digits[test], y_digits[test])
... for train, test in k_fold.split(X_digits)]
[0.963..., 0.922..., 0.963..., 0.963..., 0.930...]
当然直接求也是可以的
cross_val_score(svc, X_digits, y_digits, cv=k_fold, n_jobs=-1)
>>> array([0.96388889, 0.92222222, 0.9637883 , 0.9637883 , 0.93036212])
这里的 n_jobs = -1 意思是计算给用到所有的cpu资源
这里不得不说的是要想更6地去看看还有哪些模型评估工具,那都在 matrics module 里了.
但是其实score 是可以直接通过名字来选的, 人家都给你封装好了。使用参数 scoring
就ok了
cross_val_score(svc, X_digits, y_digits, cv=k_fold,
scoring='precision_macro')
>>> array([0.96578289, 0.92708922, 0.96681476, 0.96362897, 0.93192644])
下面的图显示还有很多的数据集交叉验证生成器供人玩
然后有个小练习脚本可以玩玩:
print(__doc__)
import numpy as np
from sklearn.model_selection import cross_val_score
from sklearn import datasets, svm
X, y = datasets.load_digits(return_X_y=True)
svc = svm.SVC(kernel='linear')
C_s = np.logspace(-10, 0, 10)
scores = list()
scores_std = list()
for C in C_s:
svc.C = C
this_scores = cross_val_score(svc, X, y, n_jobs=1)
scores.append(np.mean(this_scores))
scores_std.append(np.std(this_scores))
# Do the plotting
import matplotlib.pyplot as plt
plt.figure()
plt.semilogx(C_s, scores)
plt.semilogx(C_s, np.array(scores) + np.array(scores_std), 'b--')
plt.semilogx(C_s, np.array(scores) - np.array(scores_std), 'b--')
locs, labels = plt.yticks()
plt.yticks(locs, list(map(lambda x: "%g" % x, locs)))
plt.ylabel('CV score')
plt.xlabel('Parameter C')
plt.ylim(0, 1.1)
plt.show()
图片大体如下:
网格搜索 和 交叉验证估计:Grid-search and cross-validated estimators
网格搜索: grid-search
- 就是说你在训练的时候grid-search 可以帮你找到交叉验证分最高的模型超参是啥, 很爽, 你只用提供数据和模型的对象就好了
>>> from sklearn.model_selection import GridSearchCV, cross_val_score
>>> Cs = np.logspace(-6, -1, 10)
>>> clf = GridSearchCV(estimator=svc, param_grid=dict(C=Cs),
... n_jobs=-1)
>>> clf.fit(X_digits[:1000], y_digits[:1000])
GridSearchCV(cv=None,...
>>> clf.best_score_
0.925...
>>> clf.best_estimator_.C
0.0077...
>>> # Prediction performance on test set is not as good as on train set
>>> clf.score(X_digits[1000:], y_digits[1000:])
0.943...
GridSearchCV默认是3个fold 的交叉验证分, 取决于版本, 当然如果你放进去的是回归器, 人家就会使用3fold层级交叉验证。
密集交叉验证:Nested cross-validation
cross_val_score(clf, X_digits, y_digits)
array([0.938..., 0.963..., 0.944...])
本质上就是两个循环, 一个是循环参数, 第二个是循环遍历交叉验证分, 然后找到最高的分, The resulting scores are unbiased estimates of the prediction score on new data.
这句话就很有意思了, 意思就是训练集上完美训练了呗?
Warning
You cannot nest objects with parallel computing (n_jobs different than 1).
交叉验证估计:Cross-validated estimators
- 调参其实很高效, 因为 for certain estimators, scikit-learn exposes Cross-validation: evaluating estimator performance estimators that set their parameter automatically by cross-validation 用这些超参估计器就可以自动设置超参啦
>>> from sklearn import linear_model, datasets
>>> lasso = linear_model.LassoCV()
>>> X_diabetes, y_diabetes = datasets.load_diabetes(return_X_y=True)
>>> lasso.fit(X_diabetes, y_diabetes)
LassoCV()
>>> # The estimator chose automatically its lambda:
>>> lasso.alpha_
0.00375...
这些模型所对应的超参交叉验证估计器就是这些模型对应的名字后面加上“CV”
下面是官方的一个练习例子脚本:
print(__doc__)
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.linear_model import LassoCV
from sklearn.linear_model import Lasso
from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV
X, y = datasets.load_diabetes(return_X_y=True)
X = X[:150]
y = y[:150]
lasso = Lasso(random_state=0, max_iter=10000)
alphas = np.logspace(-4, -0.5, 30)
tuned_parameters = [{'alpha': alphas}]
n_folds = 5
clf = GridSearchCV(lasso, tuned_parameters, cv=n_folds, refit=False)
clf.fit(X, y)
scores = clf.cv_results_['mean_test_score']
scores_std = clf.cv_results_['std_test_score']
plt.figure().set_size_inches(8, 6)
plt.semilogx(alphas, scores)
# plot error lines showing +/- std. errors of the scores
std_error = scores_std / np.sqrt(n_folds)
plt.semilogx(alphas, scores + std_error, 'b--')
plt.semilogx(alphas, scores - std_error, 'b--')
# alpha=0.2 controls the translucency of the fill color
plt.fill_between(alphas, scores + std_error, scores - std_error, alpha=0.2)
plt.ylabel('CV score +/- std error')
plt.xlabel('alpha')
plt.axhline(np.max(scores), linestyle='--', color='.5')
plt.xlim([alphas[0], alphas[-1]])
# #############################################################################
# Bonus: how much can you trust the selection of alpha?
# To answer this question we use the LassoCV object that sets its alpha
# parameter automatically from the data by internal cross-validation (i.e. it
# performs cross-validation on the training data it receives).
# We use external cross-validation to see how much the automatically obtained
# alphas differ across different cross-validation folds.
lasso_cv = LassoCV(alphas=alphas, random_state=0, max_iter=10000)
k_fold = KFold(3)
print("Answer to the bonus question:",
"how much can you trust the selection of alpha?")
print()
print("Alpha parameters maximising the generalization score on different")
print("subsets of the data:")
for k, (train, test) in enumerate(k_fold.split(X, y)):
lasso_cv.fit(X[train], y[train])
print("[fold {0}] alpha: {1:.5f}, score: {2:.5f}".
format(k, lasso_cv.alpha_, lasso_cv.score(X[test], y[test])))
print()
print("Answer: Not very much since we obtained different alphas for different")
print("subsets of the data and moreover, the scores for these alphas differ")
print("quite substantially.")
plt.show()
图是这么个图