数据挖掘—网格调参GridSearchCV

最新推荐文章于 2024-07-18 18:16:06 发布

马╮(╯▽╰)╭霜

最新推荐文章于 2024-07-18 18:16:06 发布

阅读量2.8k

点赞数

分类专栏：数据挖掘文章标签：机器学习 python

本文链接：https://blog.csdn.net/weixin_43212941/article/details/109447510

版权

数据挖掘专栏收录该内容

7 篇文章 0 订阅

订阅专栏

数据挖掘—网格调参GridSearchCV

1、参数
2、属性
3、常用方法
4、实例

带交叉验证的网格搜索是一种常用的调参方法，因此 scikit-learn 提供了GridSearchCV 类，它以估计器（estimator）的形式实现了这种方法。要使用 GridSearchCV类，你首先需要用一个字典指定要搜索的参数。然后 GridSearchCV 会执行所有必要的模型拟合。字典的键是我们要调节的参数名称（在构建模型时给出，在这个例子中是 C 和gamma ），字典的值是我们想要尝试的参数设置。

sklearn.model_selection.GridSearchCV(estimator, param_grid, *, scoring=None, n_jobs=None, iid='deprecated', refit=True, cv=None, verbose=0, pre_dispatch='2*n_jobs', error_score=nan, return_train_score=False)

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV

1、参数

estimator：estimator object.
选择使用的分类器，并且传入除需要确定最佳的参数之外的其他参数。每一个分类器都需要一个scoring参数，或者score方法

param_grid：dict or list of dictionaries
需要最优化的参数的取值，值为字典或者列表

scoring：str, callable, list/tuple or dict, default=None
模型评价标准，默认为None，这时需要使用score函数；或者如scoring = ‘roc_auc’，根据所选模型不同，评价准则不同，字符串（函数名），或是可调用对象，需要其函数签名，形如：scorer(estimator，X，y）；如果是None，则使用estimator的误差估计函数。

n_jobs：int, default=None
要并行运行的作业数量.-1：所有CPU核数，1：默认值

pre_dispatch：int, or str, default=n_jobs
指定总共发的并行任务数，当n_jobs大于1时候，数据将在每个运行点进行复制，这可能导致OOM，而设置pre_dispatch参数，则可以预先划分总共的job数量，使数据最多被复制pre_dispatch次。

iid：bool, default=False
iid：默认为True，为True时，默认为各个样本fold概率分布一致，误差估计为所有样本之和，而非各个fold的平均。

cv：交叉验证参数，默认None，使用三折交叉验证。指定fold数量，默认为3，也可以是yield训练/测试数据的生成器。

refit：bool, str, or callable, default=True
默认为True，程序将会以交叉验证训练集得到的最佳参数，重新对所有可能的训练集与开发集进行，作为最终用于性能评估的最佳模型参数。即在搜索参数结束后，用最佳参数结果再次fit一遍全部数据集。

verbose：integer
日志冗长度，int：冗长度，0：不输出训练过程，1：偶尔输出，>1：对每个子模型都输出。

error_score：‘raise’ or numeric, default=np.nan
在估计量拟合中出现错误时分配给分数的值。如果设置为“raise”，则会引发错误。如果给定数值，则会发出FitFailedWarning。此参数不会影响refit步骤，因为它总是会引发错误。

return_train_score：bool, default=False
如果为False，则cv_results_属性将不包含训练分数。计算训练分数是用来了解不同的参数设置如何影响过拟合/不拟合的权衡。然而，计算训练集上的分数可能会有很高的计算成本，并且不需要严格地选择能产生最佳泛化性能的参数。

2、属性

（1） cv_results_ : dict of numpy (masked) ndarrays
具有键作为列标题和值作为列的dict，可以导入到DataFrame中。注意，“params”键用于存储所有参数候选项的参数设置列表。

（2） best_estimator_ : estimator
访问最佳参数对应的模型，它是在整个训练集上训练得到的

（3）best_score_ ：float best_estimator的分数
（4）best_parmas_ : 最佳模型的参数
（5） best_index_ ： int 对应于最佳候选参数设置的索引（cv_results_数组）
search.cv_results _ [‘params’] [search.best_index_]中的dict给出了最佳模型的参数设置，给出了最高的平均分数（search.best_score_）。
（6）scorer_ : Scorer function 用于为模型选择最佳的参数。
（7）n_splits_ : 交叉验证拆分(折叠/迭代)的次数。

3、常用方法

fit(X[, y, groups])

get_params([deep])
Get parameters for this estimator.

predict(X)
Call predict on the estimator with the best found parameters.

predict_log_proba(X)
Call predict_log_proba on the estimator with the best found parameters.

predict_proba(X)
Call predict_proba on the estimator with the best found parameters.

score(X[, y])
Returns the score on the given data, if the estimator has been refit.

**set_params(params)
Set the parameters of this estimator.

transform(X)
Call transform on the estimator with the best found parameters.

4、实例

我们创建的 grid_search 对象的行为就像是一个分类器，我们可以对它调用标准的 fit 、predict 和 score 方法。但我们在调用 fit 时，它会对 param_grid 指定的每种参数组合都运行交叉验证：

from sklearn.datasets import load_iris
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
iris=load_iris()
param_grid={'C':[0.001,0.01,0.1,1,10,100],
           'gamma':[0.001,0.01,0.1,1,10,100]}
grid_search=GridSearchCV(SVC(),param_grid,cv=5)
X_train,X_test,y_train,y_test=train_test_split(iris.data,iris.target,random_state=0)
grid_search.fit(X_train,y_train)
print(grid_search.score(X_test,y_test))

0.9736842105263158

grid_search.cv_results_.keys()

dict_keys([‘mean_fit_time’, ‘std_fit_time’, ‘mean_score_time’, ‘std_score_time’, ‘param_C’, ‘param_gamma’, ‘params’, ‘split0_test_score’, ‘split1_test_score’, ‘split2_test_score’, ‘split3_test_score’, ‘split4_test_score’, ‘mean_test_score’, ‘std_test_score’, ‘rank_test_score’, ‘split0_train_score’, ‘split1_train_score’, ‘split2_train_score’, ‘split3_train_score’, ‘split4_train_score’, ‘mean_train_score’, ‘std_train_score’])

print(grid_search.best_estimator_)
print(grid_search.best_params_)
print(grid_search.best_score_)

SVC(C=100, cache_size=200, class_weight=None, coef0=0.0,decision_function_shape=‘ovr’, degree=3, gamma=0.01, kernel=‘rbf’, max_iter=-1, probability=False, random_state=None, shrinking=True, tol=0.001, verbose=False)

{‘C’: 100, ‘gamma’: 0.01}

0.9732142857142857

print(grid_search.best_index_)
print(grid_search.n_splits_)

31
5

grid_search.scorer_

grid_search.get_params

马╮(╯▽╰)╭霜

关注

0
点赞
踩
12

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录