测试的好处
-
With separate training and testing dataset, we would know how are the performance of our learning model against dataset that haven't been seen.
-
Serves as check on overfitting
在sklearn中训练/测试分离
train_test_split是交叉验证中常用的函数
功能是从样本中随机的按比例选取训练集(training set)和测试集(test set)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
参数解释:
train_data:所要划分的样本特征集
train_target:所要划分的样本结果
test_size:样本占比,如果是整数的话就是样本的数量
random_state:是随机数的种子,其实就是该组随机数的编号,在需要重复试验的时候,保证得到一组一样的随机数
何处使用训练与测试数据
训练
测试
K折交叉验证
Another method is using K-Fold, where you split our dataset into K units. You narrow the test set to 1 units, and K-1 units as training set.
Then we take iterative K-steps with different test bin each steps, springing K units test results.
Then you average the results.
This will give you max accuracy, as all bagging method, but gives up to longer training time than usual.
sklearn中的k-fold cross-validation
红色部分标记出的cv = KFold( len(authors), 2 )
需要改成随机化的: cv = KFold( len(authors), 2, shuffle=True )
为参数调整而进行的交叉验证
GridSearchCV 用于系统地遍历多种参数组合,通过交叉验证确定最佳效果参数。 它的好处是,只需增加几行代码,就能遍历多种组合。
parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]} svr = svm.SVC() clf = grid_search.GridSearchCV(svr, parameters) clf.fit(iris.data, iris.target)
parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}
各组合均用于训练 SVM,并使用交叉验证对表现进行评估
svr = svm.SVC() “分类器”在这种情况下不仅仅是一个算法,而是算法加参数值
clf = grid_search.GridSearchCV(svr, parameters) 我们传达算法 (svr) 和参数 (parameters) 字典来尝试,它生成一个网格的参数组合进行尝试
clf.fit(iris.data, iris.target) 拟合函数现在尝试了所有的参数组合,并返回一个合适的分类器,自动调整至最佳参数组合
可通过 clf.best_params_ 来获得参数值