1.使用学习曲线判定偏差和方差问题
通过将模型的训练及准确性验证看作是训练集样本数的函数,并绘制其图像,可以看出模型是面临高方差(过拟合)还是高偏差(欠拟合)的问题。
1)随着训练集样本数的增加,训练集的分数从最大值逐渐下降,而交叉验证集的分数从最小值逐渐上升,训练集的分数总是高于测试集的分数;
2)若训练集、验证集的准确率都很低,说明是一个高偏差模型,属于欠拟合的情况;
3)若训练集准确率显著高于验证集准确率,说明是一个高方差模型,属于过拟合的情况;
4)若训练集准确率与验证集准确率较高,且两者差别较小,则说明模型未出现欠拟合、过拟合的情况;
利用sklearn自带breast_cancer数据集绘制学习曲线
>>>import numpy as np
>>>import matplotlib.pyplot as plt
>>>from sklearn.datasets import load_breast_cancer
>>>from sklearn.preprocessing import StandardScaler
>>>from sklearn.linear_model import LogisticRegression
>>>from sklearn.pipeline import Pipeline
>>>from sklearn.model_selection import train_test_split,learning_curve,cross_val_score
>>>X = load_breast_cancer().data
>>>y = load_breast_cancer().target
>>>X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.20,random_state=1)
>>>pipe_lr = Pipeline([('scl', StandardScaler()), ('clf', LogisticRegression(penalty='l2', random_state=0))])
>>>scores=cross_val_score(pipe_lr,X=X_train, y=y_train,cv=10, n_jobs=-1).mean()
>>>print('Scores: %.3f'%scores)
>>>train_sizes, train_scores, test_scores = learning_curve(estimator=pipe_lr, X=X_train, y=y_train, train_sizes=np.linspace(0.1, 1.0, 5), cv=10, n_jobs=-1)
>>>train_mean = np.mean(train_scores, axis=1)
>>>train_std = np.std(train_scores, axis=1)
>>>test_mean = np.mean(test_scores, axis=1)
>>>test_std = np.std(test_scores, axis=1)
>>>plt.plot(train_sizes, train_mean, color='r', marker='o', markersize=5, label='Training accuracy')
>>>plt.fill_between(train_sizes, train_mean - train_std, train_mean + train_std,alpha=0.25, color='r')
>>>plt.plot(train_sizes, test_mean, color='g', linestyle='--', marker='o', markersize=5, label='Validation accuracy')
>>>plt.fill_between(train_sizes,test_mean - test_std,test_mean + test_std, alpha=0.25, color='g')
>>>plt.grid()
>>>plt.xlabel('Number of training samples')
>>>plt.ylabel('Accuracy')
>>>plt.legend(loc='best')
>>>plt.ylim([0.8, 1.0])
>>>plt.show()
Scores: 0.976
通过图中的学习曲线来看,模型在测试集上表现良好,但是在训练准确率曲线与交叉验证准确率曲线之间,存在着较小差距,这说明模型对训练集有轻微过拟合。
2.使用验证曲线观察准确性与模型参数之间的关系
#绘制逻辑回归模型中不同的C值与模型准确率对应的的验证曲线
>>>from sklearn.learning_curve import validation_curve
>>>param_range = [0.001, 0.01, 0.1, 1.0, 10.0, 100.0]
>>>train_scores, test_scores = validation_curve(estimator=pipe_lr, X=X_train, y=y_train, param_name='clf__C', param_range=param_range, cv=10)
>>>train_mean = np.mean(train_scores, axis=1)
>>>train_std = np.std(train_scores, axis=1)
>>>test_mean = np.mean(test_scores, axis=1)
>>>test_std = np.std(test_scores, axis=1)
>>>plt.plot(param_range, train_mean, color='r', marker='o', markersize=5, label='Training accuracy')
>>>plt.fill_between(param_range, train_mean + train_std, train_mean - train_std, alpha=0.15, color='r')
>>>plt.plot(param_range, test_mean, color='g', linestyle='--', marker='s', markersize=5, label='Validation accuracy')
>>>plt.fill_between(param_range, test_mean + test_std, test_mean - test_std, alpha=0.15, color='g')
>>>plt.grid()
>>>plt.xscale('log')
>>>plt.legend(loc='best')
>>>plt.xlabel('Parameter C')
>>>plt.ylabel('Accuracy')
>>>plt.ylim([0.8, 1.0])
>>>plt.show()
有上图可知C值较小时,模型有轻微的欠拟合;当C值较大时,模型过拟合较明显。C值最优值在0.1附近。
3.使用网格搜索调优机器学习模型
在机器学习中有两类参数:一种是通过训练集学习到的参数,如逻辑回归中的回归系数;还有一种是算法中需要单独优化的参数,也称为调优参数或超参,例如逻辑回归中的正则化系数、决策树中的深度参数。
网格搜索通过对指定的不同超参列表进行暴力穷举搜索,并评估每个组合对模型性能的影响,以获得参数的最优组合。
>>>from sklearn.datasets import load_breast_cancer
>>>from sklearn.preprocessing import StandardScaler
>>>from sklearn.svm import SVC
>>>from sklearn.pipeline import Pipeline
>>>from sklearn.grid_search import GridSearchCV
>>>from sklearn.model_selection import train_test_split
>>>X = load_breast_cancer().data
>>>y = load_breast_cancer().target
>>>X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.20,random_state=1)
>>>pipe_svc=Pipeline([('scl',StandardScaler()),('clf',SVC(random_state=1))])
>>>param_range=[0.0001,0.001,0.01,0.1,1.0,10.0,100.0,1000.0]
#param_grid参数以字典的方式定义待调优参数,对于线性SVM只需调优正则化参数C,对于RBF核SVM,需要同时调优C和gamma参数
>>>param_grid=[{'clf__C':param_range,'clf__kernel':['linear']},{'clf__C':param_range,'clf__gamma':param_range,'clf__kernel':['rbf']}]
>>>gs=GridSearchCV(estimator=pipe_svc,param_grid=param_grid,scoring='accuracy',cv=10,n_jobs=-1)
>>>gs=gs.fit(X_train,y_train)
#通过best_score_属性得到调优后模型的性能评分,通过best_params_属性得到具体参数
>>>print('Best_score: %.3f'%gs.best_score_)
>>>print(gs.best_params_)
#通过best_estimator_得到调优后的最优模型,并利用独立的测试集对best_estimator_的模型性能进行评估
>>>clf=gs.best_estimator_
>>>clf.fit(X_train,y_train)
>>>print('Test_accuracy: %.3f'%clf.score(X_test,y_test))
Best_score: 0.978
{'clf__C': 0.1, 'clf__kernel': 'linear'}
Test_accuracy: 0.965