通过测试数据集判断模型好坏,由于测试数据集也是已知的,可能导致针对特定测试数据集过拟合这一问题。
解决办法
将数据集分为训练数据(训练)、验证数据(评判)、测试数据
验证数据集——调整超参数使用的数据集
测试数据集——作为衡量最终模型性能的数据集
为了解决只有一个验证数据集导致的训练出的模型不准确的问题——
交叉验证
交叉验证
In [127]: import numpy as np
...: from sklearn import datasets
In [128]: digits = datasets.load_digits()
...: X = digits.data
...: y = digits.target
1. 测试train_test_split
In [129]: from sklearn.model_selection import train_test_split
...: X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.4,random_state=666)
#k:KNN中每次选几个相邻元素作为邻居
#p:KNN中计算距离相应的参数p
In [132]: from sklearn.neighbors import KNeighborsClassifier
...:
...: best_score,best_p,best_k = 0,0,0
...: for k in range(2,11):
...: for p in range(1,6):
...: knn_clf = KNeighborsClassifier(weights="distance",n_neighbors=k)
...: knn_clf.fit(X_train,y_train)
...: score = knn_clf.score(X_test,y_test)
...: if score > best_score:
...: best_score,best_p,best_k = score,p,k
...:
...: print("Best K = ",best_k)
...: print("Best P = ",best_p)
...: print("Best Score = ",best_score)
Best K = 5
Best P = 1
Best Score = 0.9860917941585535
2. 交叉验证
cross_val_score()
自动进行交叉验证。并返回生成的k个模型每个模型的准确率
In [133]: from sklearn.model_selection import cross_val_score
...:
...: knn_clf = KNeighborsClassifier()
...: cross_val_score(knn_clf,X_train,y_train)
Out[133]: array([0.98895028, 0.97777778, 0.96629213])
使用交叉验证的方式进行调参
In [134]: best_score,best_p,best_k = 0,0,0
...: for k in range(2,11):
...: for p in range(1,6):
...: knn_clf = KNeighborsClassifier(weights="distance",n_neighbors=k)
...: scores = cross_val_score(knn_clf,X_train,y_train)
...: score = np.mean(scores)
...: if score > best_score:
...: best_score,best_p,best_k = score,p,k
...:
...: print("Best K = ",best_k)
...: print("Best P = ",best_p)
...: print("Best Score = ",best_score)
Best K = 2
Best P = 1
Best Score = 0.9823599874006478
通过上述交叉验证获得最佳的k、p后,就可以用这组参数获得最佳的一个KNN的模型;
由于测试用的X_test,y_test在模型训练过程中是完全没有的,所以得到的0.98的分数是可靠的。
In [135]: best_knn_clf = KNeighborsClassifier(weights="distance",n_neighbors=2,p=2)
In [136]: best_knn_clf.fit(X_train,y_train)
...: best_knn_clf.score(X_test,y_test)
Out[136]: 0.980528511821975
网格搜索中的GridSearchCV这个类就是通过交叉验证来实现调参
用cv这个参数可以设置模型个数
In [137]: cross_val_score(knn_clf,X_train,y_train,cv=5)
Out[137]: array([0.99543379, 0.97716895, 0.97685185, 0.97196262, 0.97142857])