网格搜索[6]_sklearn
1. 什么是网格搜索?
- Grid Search:一种调参手段;穷举搜索:在所有候选的参数选择中,通过循环遍历,尝试每一种可能性,表现最好的参数就是最终的结果。其原理就像是在数组里找最大值。(为什么叫网格搜索?以有两个参数的模型为例,参数a有3种可能,参数b有4种可能,把所有可能性列出来,可以表示成一个3*4的表格,其中每个cell就是一个网格,循环过程就像是在每个网格里遍历、搜索,所以叫grid search)
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
'''------------------------------------------
1 读取数据
----------------------------------------------'''
data = pd.read_csv('data_processed.csv',encoding='gbk')
'''-------------------------------------------
1.1 划分训练集何验证集
----------------------------------------------'''
train, test = train_test_split(data, test_size=0.3, random_state=666)
'''----------------------------------------
1.2 获取标签
-------------------------------------------'''
y_train = train.status
train.drop(['status'], axis=1, inplace=True)
y_test = test.status
test.drop(['status'], axis=1, inplace=True)
'''---------------------------------------------
1.3 数据标准化
-----------------------------------------------'''
scaler = StandardScaler()
train = pd.DataFrame(scaler.fit_transform(train),index=train.index, columns=test.columns)
test = pd.DataFrame(scaler.fit_transform(test),index=test.index, columns=test.columns)
'''----------------------------------------
1.4 训练模型
------------------------------------------'''
best_score = 0
for gamma in [0.001,0.01,0.1,1,10,100]:
for C in [0.001,0.01,0.1,1,10,100]:
model = SVC(gamma=gamma,C=C)
model.fit(train, y_train)
score = model.score(test, y_test)
if score > best_score: #找到最好表现的参数
best_score = score
best_parameters = {'gamma':gamma,'c':C}
'''--------------------------------------
1.5 预测模型
---------------------------------------------'''
y_test_pre = model.predict(test)
'''------------------------------------------
1.6 评分
---------------------------------------------'''
print('-------------------------------')
print('best score验证分数:{}'.format(best_score))
print("best parameters:{}".format(best_parameters))
1.1存在问题:
- 原始数据集划分训练集和测试集以后,其中测试集除了用作参数调整,也用来测试模型的好坏,这样做导致最终的评分结果比实际的效果要好
解决办法:
- 对训练集再进行一次划分,分成训练集和验证集:
- 训练集用来训练模型,验证集用来调整参数,测试集用来衡量模型好坏
- 训练集用来训练模型,验证集用来调整参数,测试集用来衡量模型好坏
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
'''------------------------------------------
1 读取数据
----------------------------------------------'''
data = pd.read_csv('data_processed.csv',encoding='gbk')
'''-------------------------------------------
1.1 划分训练集、验证集、测试集
----------------------------------------------'''
train_val, test = train_test_split(data, test_size=0.1, random_state=666)
train, val = train_test_split(train_val,test_size=0.1, random_state=666)
'''----------------------------------------
1.2 获取标签
-------------------------------------------'''
y_train_val = train_val.status # 训练集+验证集
train_val.drop(['status'], axis=1,inplace=True)
y_train = train.status
train.drop(['status'], axis=1, inplace=True)
y_test = test.status
test.drop(['status'], axis=1, inplace=True)
y_val = val.status
val.drop(['status'], axis=1, inplace = True)
'''---------------------------------------------
1.3 数据标准化
-----------------------------------------------'''
scaler = StandardScaler()
train = pd.DataFrame(scaler.fit_transform(train),index=train.index, columns=test.columns)
test = pd.DataFrame(scaler.fit_transform(test),index=test.index, columns=test.columns)
train_val = pd.DataFrame(scaler.fit_transform(train_val),index=train_val.index, columns=train_val.columns)
val = pd.DataFrame(scaler.fit_transform(val),index=val.index, columns=val.columns)
'''----------------------------------------
1.4 训练模型
------------------------------------------'''
best_score = 0
for gamma in [0.001,0.01,0.1,1,10,100]:
for C in [0.001,0.01,0.1,1,10,100]:
model = SVC(gamma=gamma,C=C)
model.fit(train, y_train)
score = model.score(test, y_test)
if score > best_score: #找到最好表现的参数
best_score = score
best_parameters = {'gamma':gamma,'C':C}
'''--------------------------------------
1.5 预测模型
---------------------------------------------'''
model = SVC(**best_parameters) #使用最好参数,构建新的模型
model.fit(train_val, y_train_val) #使用训练集和验证集进行训练
best_score = model.score(test,y_test)
print('-------------------------------')
print('best score验证分数:{}'.format(best_score))
print("best parameters:{}".format(best_parameters))
- 注意数据的标准化,标准化后,数据准确率显著提升
2. 交叉验证
-
交叉验证法:
- 将数据集D划分为k个大小相似的互斥子集,即 D = D 1 ∪ D 2 ∪ . . . ∪ D k , D i ∩ D j = ∅ ( i ≠ j ) D = D_1\cup D_2\cup ...\cup D_k,D_i\cap D_j=\empty(i\ne j) D=D1∪D2∪...∪Dk,Di∩Dj=∅(i̸=j)
- 每次用k-1个子集的并集作为训练集,余下的那个作为测试集
- 这样就可以获得k组训练/测试集——最终返回的是这k个测试集结果的均值
- 通常把交叉验证法称为“K折交叉验证”
-
留一法:
- 假设数据集D中包含m个样本,若领k=m
- 留一法不受随机样本划分方式的影响
-
Grid Search with cross validation
from sklearn.model_selection import cross_val_score
'''-----------------------------------------
1.4 训练模型
------------------------------------------'''
best_score = 0
for gamma in [0.001,0.01,0.1,1,10,100]:
for C in [0.001,0.01,0.1,1,10,100]:
model = SVC(gamma=gamma,C=C)
scores = cross_val_score(model, train, y_train, cv=5)
score = scores.mean()
if score > best_score: #找到最好表现的参数
best_score = score
best_parameters = {'gamma':gamma,'C':C}
'''--------------------------------------
1.5 预测模型
---------------------------------------------'''
mpdel = SVC(**best_parameters)
model.fit(train,y_train)
test_score = model.score(test,y_test)