cross-validation(交叉验证)
A solution to this problem is a procedure called cross-validation (CV for short). A test set should still be held out for final evaluation, but the validation set is no longer needed when doing CV. In the basic approach, called k-fold CV, the training set is split into k smaller sets (other approaches are described below, but generally follow the same principles). The following procedure is followed for each of the k”folds”:
1.A model is trained using k-1 of the folds as training data;
2.the resulting model is validated on the remaining part of the data (i.e., it is used as a test set to compute a performance measure such as accuracy).The performance measure reported by k-fold cross-validation is then the average of the values computed in the loop. This approach can be computationally expensive, but does not waste too much data (as it is the case when fixing an arbitrary test set), which is a major advantage in problem such as inverse inference where the number of samples is very small.
上面这段话是引自sklearn的document中,对于cv的描述.描述了一个在交叉验证中的相同的规则就是,在解决实际问题中,我们可以将所有的数据集dataset,划分为train_set(例如70%)和test_set(30%),然后在train_set上做cross_validation,最后取平均之后,再使用test_set测试模型的准确度.不是直接在dataset上直接做cross-validation(这个是我理解cv中的一个误区)
k-fold
本来不想写关于cross-validation的内容的,但是决定这里面自己的误区还是很多的,所以写一下,如果有人看到了,也可以帮忙指出来.
1.A model is trained using k-1 of the folds as training data;
2.the resulting model is validated on the remaining part of the data (i.e., it is used as a test set to compute a performance measure such as accuracy).
The performance measure reported by k-fold cross-validation is then the average of the values computed in the loop
前提:整个数据集被分成了训练集D(70%)和测试集T(30%).
上面这段话就是k-fold的全过程,(此时只涉及到训练集D)
1.将整个训练集D分为k个等大的集合,然后选出k-1个作为模型的训练集.训练模型model1.
2.使用剩下的一个集合
Di
,作为验证集(和所谓的测试集的作用是一样的),测试model1的准确性.关于模型评估方法,可以参考sklearn实现的一些方法.
3.循环执行上述过程k次,保证没有重复.然后对于准确性求平均值,这就是该分类方法对应的正确性.
有人可能会问平均出来的正确性对应的模型权值
θ
是哪一个?这个问题就需要明白机器学习的目的是什么?机器学习不是找到所谓模型对应的权值是多少,而是相对于实际问题,选出合适的模型(比如向量机模型)和合适的超参(比如核函数,c等超参).上述的平均正确率就是对应于模型+超参的.
GridSearch
搞懂了K-fold,就可以聊一聊GridSearch啦,因为GridSearch默认参数就是3-fold的,如果没有不懂cross-validation就很难理解这个.
想干什么
Gridsearch是为了解决调参的问题.比如向量机SVM的常用参数有kernel,gamma,C等,手动调的话太慢了,写循环也只能顺序运行,不能并行.于是就出现了Gridsearch.通过它,可以直接找出最优的参数.
怎么调参
param字典类型,它会将每个字典类型里的字段所有的组合都输入到分类器中执行.
tuned_parameters = [{'kernel': ['rbf'], 'gamma': [1e-3, 1e-4],
'C': [1, 10, 100, 1000]},
{'kernel': ['linear'], 'C': [1, 10, 100, 1000]}]
如何评估
参数输入之后,需要评估每组参数对应的模型的预测能力.Gridsearch就在数据集上做k-fold,然后求出每组参数对应模型的平均精确度.选出最优的参数.返回.
一般Gridsearch只在训练集上做k-fold并不会使用测试集.而是将测试集留在最后,当gridsearch选出最佳模型的时候,在使用测试集测试模型的泛化能力.
贴一个sklearn上面的例子
from sklearn import datasets
from sklearn.cross_validation import train_test_split
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.svm import SVC
# Loading the Digits dataset
digits = datasets.load_digits()
# To apply an classifier on this data, we need to flatten the image, to
# turn the data in a (samples, feature) matrix:
n_samples = len(digits.images)
X = digits.images.reshape((n_samples, -1))
y = digits.target
# 将数据集分成训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.5, random_state=0)
# 设置gridsearch的参数
tuned_parameters = [{'kernel': ['rbf'], 'gamma': [1e-3, 1e-4],
'C': [1, 10, 100, 1000]},
{'kernel': ['linear'], 'C': [1, 10, 100, 1000]}]
#设置模型评估的方法.如果不清楚,可以参考上面的k-fold章节里面的超链接
scores = ['precision', 'recall']
for score in scores:
print("# Tuning hyper-parameters for %s" % score)
print()
#构造这个GridSearch的分类器,5-fold
clf = GridSearchCV(SVC(), tuned_parameters, cv=5,
scoring='%s_weighted' % score)
#只在训练集上面做k-fold,然后返回最优的模型参数
clf.fit(X_train, y_train)
print("Best parameters set found on development set:")
print()
#输出最优的模型参数
print(clf.best_params_)
print()
print("Grid scores on development set:")
print()
for params, mean_score, scores in clf.grid_scores_:
print("%0.3f (+/-%0.03f) for %r"
% (mean_score, scores.std() * 2, params))
print()
print("Detailed classification report:")
print()
print("The model is trained on the full development set.")
print("The scores are computed on the full evaluation set.")
print()
#在测试集上测试最优的模型的泛化能力.
y_true, y_pred = y_test, clf.predict(X_test)
print(classification_report(y_true, y_pred))
print()
上面这个例子就符合一般的套路.例子中的SVC是支持多分类的,其默认使用的是ovo的方式,如果需要改变,可以将参数设置为decision_function_shape=’ovr’,具体的可以参看SVC的API文档.
需要注意的几个点
1.GridSearch支不支持多分类?
GridSearch只是在将参数组合好了,然后将数据使用k-fold的方式输入到模型中,然后评估模型的准确性.其本身并不是新的分类方法,所以只要你选择的estimator可以应用于多分类,就可以.上面的例子手写体的识别就是一个多分类的问题.你选择的模型评估方法也需要满足多分类问题.当你使用roc_auc的时候评估模型的时候就需要注意数据格式.
2.GridSearch的estimator有的时候会出现嵌套,比如adaboost()集成学习中,就需要Gridsearch支持嵌套参数.双下划线__就表示该参数是嵌套参数,内层的参数.(这一点我没有试验过,只是看到有人这样说…)当然gridsearch也有专门针对集成学习的API.
嵌套参数这篇博客有个例子:
———2017.4.18