网格搜索[6]_sklearn

网格搜索[6]_sklearn

1. 什么是网格搜索?

  • Grid Search:一种调参手段;穷举搜索:在所有候选的参数选择中,通过循环遍历,尝试每一种可能性,表现最好的参数就是最终的结果。其原理就像是在数组里找最大值。(为什么叫网格搜索?以有两个参数的模型为例,参数a有3种可能,参数b有4种可能,把所有可能性列出来,可以表示成一个3*4的表格,其中每个cell就是一个网格,循环过程就像是在每个网格里遍历、搜索,所以叫grid search)
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
'''------------------------------------------
1 读取数据
----------------------------------------------'''
data = pd.read_csv('data_processed.csv',encoding='gbk') 
'''-------------------------------------------
1.1 划分训练集何验证集
----------------------------------------------'''
train, test = train_test_split(data, test_size=0.3, random_state=666)
'''----------------------------------------
1.2 获取标签
-------------------------------------------'''
y_train = train.status
train.drop(['status'], axis=1, inplace=True)
y_test = test.status
test.drop(['status'], axis=1, inplace=True)
'''---------------------------------------------
1.3 数据标准化
-----------------------------------------------'''
scaler = StandardScaler()
train = pd.DataFrame(scaler.fit_transform(train),index=train.index, columns=test.columns)
test = pd.DataFrame(scaler.fit_transform(test),index=test.index, columns=test.columns)
'''----------------------------------------
1.4 训练模型
------------------------------------------'''
best_score = 0
for gamma in [0.001,0.01,0.1,1,10,100]:
    for C in [0.001,0.01,0.1,1,10,100]:
        model = SVC(gamma=gamma,C=C)
        model.fit(train, y_train)
        score = model.score(test, y_test)
        if score > best_score:  #找到最好表现的参数
            best_score = score
            best_parameters = {'gamma':gamma,'c':C}
'''--------------------------------------
1.5 预测模型
---------------------------------------------'''
y_test_pre = model.predict(test)
'''------------------------------------------
1.6 评分
---------------------------------------------'''
print('-------------------------------')
print('best score验证分数:{}'.format(best_score))
print("best parameters:{}".format(best_parameters))

1.1存在问题:

  • 原始数据集划分训练集和测试集以后,其中测试集除了用作参数调整,也用来测试模型的好坏,这样做导致最终的评分结果比实际的效果要好

解决办法

  • 对训练集再进行一次划分,分成训练集和验证集:
    • 训练集用来训练模型,验证集用来调整参数,测试集用来衡量模型好坏
      在这里插入图片描述
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
'''------------------------------------------
1 读取数据
----------------------------------------------'''
data = pd.read_csv('data_processed.csv',encoding='gbk') 
'''-------------------------------------------
1.1 划分训练集、验证集、测试集
----------------------------------------------'''
train_val, test = train_test_split(data, test_size=0.1, random_state=666)
train, val = train_test_split(train_val,test_size=0.1, random_state=666)
'''----------------------------------------
1.2 获取标签
-------------------------------------------'''
y_train_val = train_val.status                  # 训练集+验证集
train_val.drop(['status'], axis=1,inplace=True)
y_train = train.status
train.drop(['status'], axis=1, inplace=True)
y_test = test.status
test.drop(['status'], axis=1, inplace=True)
y_val = val.status
val.drop(['status'], axis=1, inplace = True)
'''---------------------------------------------
1.3 数据标准化
-----------------------------------------------'''
scaler = StandardScaler()
train = pd.DataFrame(scaler.fit_transform(train),index=train.index, columns=test.columns)
test = pd.DataFrame(scaler.fit_transform(test),index=test.index, columns=test.columns)
train_val = pd.DataFrame(scaler.fit_transform(train_val),index=train_val.index, columns=train_val.columns)
val = pd.DataFrame(scaler.fit_transform(val),index=val.index, columns=val.columns)
'''----------------------------------------
1.4 训练模型
------------------------------------------'''
best_score = 0
for gamma in [0.001,0.01,0.1,1,10,100]:
    for C in [0.001,0.01,0.1,1,10,100]:
        model = SVC(gamma=gamma,C=C)
        model.fit(train, y_train)
        score = model.score(test, y_test)
        if score > best_score:  #找到最好表现的参数
            best_score = score
            best_parameters = {'gamma':gamma,'C':C}
'''--------------------------------------
1.5 预测模型
---------------------------------------------'''
model = SVC(**best_parameters)  #使用最好参数,构建新的模型
model.fit(train_val, y_train_val)    #使用训练集和验证集进行训练
best_score = model.score(test,y_test)
print('-------------------------------')
print('best score验证分数:{}'.format(best_score))
print("best parameters:{}".format(best_parameters))
  • 注意数据的标准化,标准化后,数据准确率显著提升

2. 交叉验证

  • 交叉验证法

    1. 将数据集D划分为k个大小相似的互斥子集,即 D = D 1 ∪ D 2 ∪ . . . ∪ D k , D i ∩ D j = ∅ ( i ≠ j ) D = D_1\cup D_2\cup ...\cup D_k,D_i\cap D_j=\empty(i\ne j) D=D1D2...DkDiDj=(i̸=j)
    2. 每次用k-1个子集的并集作为训练集,余下的那个作为测试集
      • 这样就可以获得k组训练/测试集——最终返回的是这k个测试集结果的均值
      • 通常把交叉验证法称为“K折交叉验证”
        在这里插入图片描述
  • 留一法

    • 假设数据集D中包含m个样本,若领k=m
    • 留一法不受随机样本划分方式的影响
  • Grid Search with cross validation

from sklearn.model_selection import cross_val_score
'''-----------------------------------------
1.4 训练模型
------------------------------------------'''
best_score = 0
for gamma in [0.001,0.01,0.1,1,10,100]:
    for C in [0.001,0.01,0.1,1,10,100]:
        model = SVC(gamma=gamma,C=C)
        scores = cross_val_score(model, train, y_train, cv=5)
        score = scores.mean()
        if score > best_score:  #找到最好表现的参数
            best_score = score
            best_parameters = {'gamma':gamma,'C':C}
'''--------------------------------------
1.5 预测模型
---------------------------------------------'''
mpdel = SVC(**best_parameters)
model.fit(train,y_train)
test_score = model.score(test,y_test)

参考

调参必备–Girdsearch网格搜索

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值