自定义函数使用GridSearchCV参数寻优

最新推荐文章于 2024-06-20 16:12:01 发布

erdaidai

最新推荐文章于 2024-06-20 16:12:01 发布

阅读量5.8k

点赞数 16

分类专栏： python 函数机器学习

本文链接：https://blog.csdn.net/erdaidai/article/details/112305154

版权

GridSearchCV RandomizedSearchCV 参数调优自定义模型交叉验证

关键词由CSDN通过智能技术生成

机器学习同时被 3 个专栏收录

12 篇文章 1 订阅

订阅专栏

python

11 篇文章 0 订阅

订阅专栏

函数

2 篇文章 0 订阅

订阅专栏

GirdSearchCV and RandomizedSearchCV

1）. GirdSearchCV是sklearn中的一个参数寻优的方法，存在的意义是自动调参，使用的人只需要把想调的参数输入进去，GirdSearchCV就会返回对于模型最佳的参数。GirdSearchCV原理是网格搜索，也就是穷举搜索，在候选的参数选择中，循环遍历每种可能，表现最好的参数就是最后想要的结果。

其原理就像是在数组里找最大值。（为什么叫网格搜索？以有两个参数的模型为例，参数a有3种可能，参数b有4种可能，把所有可能性列出来，可以表示成一个 $3 * 4$ 的表格，其中每个 cell 就是一个网格，循环过程就像是在每个网格里遍历、搜索，所以叫grid search）

CV表示交叉验证，避免偶然性，这也是在一般实验中常用到的方法。

GirdSearchCV存在一个问题，那就是只适合小的数据集，或者参数较少的模型。参数较多或数据很大的时候，就要想办法了，可以使用一个快速调优的方法——坐标下降。它其实是一种贪心算法：拿当前对模型影响最大的参数调优，直到最优化；再拿下一个影响最大的参数调优，如此下去，直到所有的参数调整完毕。这个方法的缺点就是可能会调到局部最优而不是全局最优，但是省时间省力，巨大的优势面前，还是试一试吧，后续可以再拿bagging再优化。

2）. 当然，还可以用另外一种方法，叫做 RandomizedSearchCV ，随机在超参数空间中搜索几十几百个点，其中就有可能有比较小的值。这种做法比上面稀疏化网格的做法快，而且实验证明，随机搜索法结果比稀疏网格法稍好。

RandomizedSearchCV使用方法和类GridSearchCV 很相似，但他不是尝试所有可能的组合，而是通过选择每一个超参数的一个随机值的特定数量的随机组合，这个方法有两个优点：

如果你让随机搜索运行，比如1000次，它会探索每个超参数的1000个不同的值（而不是像网格搜索那样，只搜索每个超参数的几个值）
你可以方便的通过设定搜索次数，控制超参数搜索的计算量。

RandomizedSearchCV的使用方法其实是和GridSearchCV一致的，但它以随机在参数空间中采样的方式代替了GridSearchCV对于参数的网格搜索，在对于有连续变量的参数时，RandomizedSearchCV会将其当做一个分布进行采样进行这是网格搜索做不到的，它的搜索能力取决于设定的n_iter参数。

自定义函数中使用GirdSearchCV

想在自己的方法中使用自动寻参的方法，这样就可以少写几层循环了。但是在网上找了很久大多都是用在sklearn里面的方法中，比如xgboost 、svm等等，这些方法是包装好的，用起来很方便，比如我们对svm中的 $\sigma$ 和 $C$ 参数寻优，我们可以直接这样：

from sklearn.datasets import load_iris
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import svm
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer , accuracy_score

iris_data = load_iris()

X_trainval, X_test, y_trainval, y_test = train_test_split(iris_data.data, iris_data.target, random_state=0)
X_train , X_val, y_train, y_val = train_test_split(X_trainval, y_trainval, random_state=1)

clf = svm.SVC(kernel='rbf', C=1)

# 这里需要试验的2个超参数svc_gamma和svc_C的元素个数分别为4、3,这样我们一共有12种超参数对集合
# numpy.linspace用于创建等差数列，numpy.logspace用于创建等比数列
# logspace中，开始点和结束点是10的幂
# 例如logspace(-2,1,4)表示起始数字为10^-2，结尾数字为10^1即10，元素个数为4的等比数列
# parameters变量里面的key都有一个前缀,这个前缀其实就是在Pipeline中定义的操作名。二者相结合，使我们的代码变得十分简洁。
# 还有注意的是，这里对参数名是<两条>下划线 __
parameters = {'gamma': np.logspace(-5, 5, 3), 'svc__C':np.logspace(-1,1,3)}

scorin_fnc = make_scorer(accuracy_score)
# GridSearchCV参数解释:
# 1.estimator : estimator(评估) object.
# 2.param_grid : dict or list of dictionaries
# 3.verbose:Controls the verbosity(冗余度): the higher, the more messages.
# 4.refit:default=True, Refit(再次拟合)the best estimator with the entire dataset
# 5.cv : int, cross-validation generator 此处表示3折交叉验证
gs = GridSearchCV(clf, parameters, scorin_fnc, verbose=2, refit=True, cv=3)

# 执行单线程网格搜索
gs.fit(X_train, y_train)

print(gs.best_params_, gs.best_score_)

# 最后输出最佳模型在测试集上的准确性
print('the accuracy of best model in test set is', gs.score(X_test, y_test))

GridSearchCV参数说明

那想要在自己的方法中使用GridSearchCV怎么办呢？我们先观察一下上面GridSearchCV代码，详细了解一下GridSearchCV的参数。

class sklearn.model_selection.GridSearchCV(estimator, param_grid, scoring=None,
fit_params=None, n_jobs=None, iid=’warn’, refit=True, cv=’warn’, verbose=0,
pre_dispatch=‘2*n_jobs’, error_score=’raise-deprecating’, return_train_score=’warn’)

estimator：选择使用的分类器，并且传入除需要确定最佳的参数之外的其他参数。每一个分类器都需要一个scoring参数，或者score方法：如estimator = RandomForestClassifier(min_sample_split=100,min_samples_leaf = 20,max_depth = 8,max_features = ‘sqrt’ , random_state =10),
param_grid：需要最优化的参数的取值，值为字典或者列表，例如：param_grid = param_test1,param_test1 = {‘n_estimators’ : range(10,71,10)}
scoring = None ：模型评价标准，默认为None，这时需要使用score函数；或者如scoring = ‘roc_auc’，根据所选模型不同，评价准则不同，字符串（函数名），或是可调用对象，需要其函数签名，形如：scorer(estimator，X，y）；如果是None，则使用estimator的误差估计函数。
fit_para,s = None
n_jobs = 1 ： n_jobs：并行数，int：个数，-1：跟CPU核数一致，1：默认值
iid = True：iid：默认为True，为True时，默认为各个样本fold概率分布一致，误差估计为所有样本之和，而非各个fold的平均。
refit = True ：默认为True，程序将会以交叉验证训练集得到的最佳参数，重新对所有可能的训练集与开发集进行，作为最终用于性能评估的最佳模型参数。即在搜索参数结束后，用最佳参数结果再次fit一遍全部数据集。
cv = None：交叉验证参数，默认None，使用三折交叉验证。指定fold数量，默认为3，也可以是yield训练/测试数据的生成器。
verbose = 0 ,scoring = None　　verbose：日志冗长度，int：冗长度，0：不输出训练过程，1：偶尔输出，>1：对每个子模型都输出。
pre_dispatch = ‘2*n_jobs’ ：指定总共发的并行任务数，当n_jobs大于1时候，数据将在每个运行点进行复制，这可能导致OOM，而设置pre_dispatch参数，则可以预先划分总共的job数量，使数据最多被复制pre_dispatch次。

GridSearchCV常用方法

grid.fit() ：运行网格搜索
best_params_ ：描述了已取得最佳结果的参数的组合
best_score_ ：提供优化过程期间观察到的最好的评分
cv_results_ ：具体用法模型不同参数下交叉验证的结果

GridSearchCV属性说明

cv_results_ : dict of numpy (masked) ndarrays
具有键作为列标题和值作为列的dict，可以导入到DataFrame中。注意，“params”键用于存储所有参数候选项的参数设置列表。
best_estimator_ : estimator
通过搜索选择的估计器，即在左侧数据上给出最高分数（或指定的最小损失）的估计器。如果refit = False，则不可用。
best_score_ ：float best_estimator的分数
best_parmas_ : dict 在保存数据上给出最佳结果的参数设置
best_index_ : int 对应于最佳候选参数设置的索引（cv_results_数组）
search.cv_results _ [‘params’] [search.best_index_]中的dict给出了最佳模型的参数设置，给出了最高的平均分数（search.best_score_）。
scorer_ : function
Scorer function used on the held out data to choose the best parameters for the model.
n_splits_ : int
The number of cross-validation splits (folds/iterations).

自定义函数使用GridSearchCV

首先我们要知道一些规则，

__init__ 的所有参数都必须具有默认值，因此仅通过键入 MyClassifier（） 即可初始化分类器。
__init__方法中不能确认输入参数！输入数据参数是在 fit() 中接收的。
__init__ 方法的所有参数都应具有与创建对象的属性相同的名称。
在这里不要以数据为参数！它应该在 fit() 中。
所有估计器都必须具有 get_params 和 set_params 函数。当你继承 BaseEstimator 的子类时，它们会被继承，这个时候最好不要重写这些函数，以免出错。
在 fit() 函数中，应该完成所有的分类工作，在这里面首先你会检查参数，即会使用到需要优化的参数。其次会对输入的数据进行处理。如果你通过 fit() 方法创建了一些新的属性，那么这个属性的名字要以“-”结尾，例如 self.fitted_ 。出于兼容性和与scikit-learn的通用接口， fit()函数将返回 self ，即最后会return self。
为了使GridSearch正常运行，我们必须给出一个score()方法，为什么？因为GridSearch需要识别给定的模型是否更好，它会直接看score()的结果，认定越大越好，因此设计的评价指标必须是数字型的表示。

给出一个例子，更加详细内容看这里。

from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.metrics import make_scorer
from sklearn.model_selection import GridSearchCV

class MeanClassifier(BaseEstimator, ClassifierMixin):
    """An example of classifier"""

    def __init__(self, intValue=0, stringParam="defaultValue", otherParam=None):
        """
        Called when initializing the classifier
        """
        self.intValue = intValue
        self.stringParam = stringParam

        # THIS IS WRONG! Parameters should have same name as attributes
        self.differentParam = otherParam

    def fit(self, X, y=None):
        """
        This should fit classifier. All the "work" should be done here.

        Note: assert is not a good choice here and you should rather
        use try/except blog with exceptions. This is just for short syntax.
        """
        self.treshold_ = (sum(X)/len(X)) + self.intValue  # mean + intValue

        return self

    def _meaning(self, x):
        # returns True/False according to fitted classifier
        # notice underscore on the beginning
        return( True if x >= self.treshold_ else False )

    def predict(self, X, y=None):
        try:
            getattr(self, "treshold_")
        except AttributeError:
            raise RuntimeError("You must train classifer before predicting data!")

        return([self._meaning(x) for x in X])

    def score(self, X, y=None):
        # counts number of values bigger than mean
        return(sum(self.predict(X)))

X_train = [i for i in range(0, 100, 5)]
X_test = [i + 3 for i in range(20)]
tuned_params = {"intValue" : [-10, -1, 0, 1, 10]}

gs = GridSearchCV(MeanClassifier(), tuned_params)

# for some reason I have to pass y with same shape
# otherwise gridsearch throws an error. Not sure why.
gird_result = gs.fit(X_train, y=[1 for i in range(20)])

print("Best: %f using %s" % (gird_result.best_score_, gird_result.best_params_))

means = gird_result.cv_results_['mean_test_score']
params = gird_result.cv_results_['params']
for mean, param in zip(means, params):
    print("%f  with:   %r" % (mean, param))

简单的说，就是把自己的分类器放在fit()函数，最后返回self，然后其他的函数，不继承的话，必须的要有score()、get_params()、set_params()，看一下get_params()、set_params()重写的固定格式（不继承的情况下用GridSearchCV）：

 def get_params(self, deep = False):
        params={'alpha':self.alpha,'num_iters':self.num_iters}
        return params
    
    def set_params(self, **parameters):
        for parameter, value in parameters.items():
            setattr(self, parameter, value)
        return self

直接复制即可用，改一下params即可。其他什么都不要改了。score()自己根据自定义函数的评价指标定义就可以了，切记一定要是数值型。
最后给大家看一下在我的函数使用的代码：

import sys
from src.parameters_optimization_svdd import SVDD
#from src.single_svdd import SVDD
from src.visualize import Visualization as draw
import pandas as pd
import numpy as np
import src.tool as to
import cross_validation as cr
import matplotlib.pyplot as plt
from matplotlib.pyplot import MultipleLocator
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split

import warnings
warnings.filterwarnings("ignore")

sys.path.append("..")

class GC_SVDD():
    """

    """
    def __init__(self, alpha=0.01, positive_penalty=0.1, negative_penalty=0.1):

        self.alpha = alpha
        self.positive_penalty = positive_penalty
        self.negative_penalty = negative_penalty


    def fit(self, train_data, train_label):
        """
        SVDD based on granular computing
        :return:
        """

        # gauss width
        t_data, te_data, t_label, te_label = train_test_split(train_data, train_label, test_size=0.1, 
        random_state=1)

        temp_gauss_width = self.alpha
        # set SVDD parameters
        parameters = {"positive penalty": self.positive_penalty ,
                      "negative penalty": self.negative_penalty  ,
                      "kernel": {"type": 'gauss', "width": temp_gauss_width},
                      "option": {"display": 'on'}}

        # construct an SVDD model
        svdd = SVDD(parameters)

        # train SVDD model
        para_list, _ = svdd.train(t_data, t_label)
        _, self.accuracy_ = svdd.test(te_data, te_label, para_list)

        return self
        
	def score(self, test_data, test_label):

        return self.accuracy_

    def get_params(self, deep=False):

        params = {'alpha': self.alpha,
                  'positive_penalty': self.positive_penalty,
                  'negative_penalty': self.negative_penalty
                  }

        return params

    def set_params(self, **parameters):
        for parameter, value in parameters.items():
            setattr(self, parameter, value)
        return self
        
if __name__ == '__main__':

	data_mat_list = ['iris.mat.csv']
	
    for i in range(len(data_mat_list)):
    
        np.random.seed(1)
        temp_path = r".\\data\\sortData\\"
        temp_data_path = temp_path + data_mat_list[i]
        data = pd.read_csv(temp_data_path, header=None)

        trainData, trainLabel = to.load_datasets_excel(data)

        temp_k = 10

        temp_train_index, temp_test_index = cr.cross_validation(trainData.shape[0], temp_k)
        clf = GC_SVDD()

        parameters = {'alpha': np.linspace(1, 10, 3),
                      'negative_penalty': np.linspace(0.1, 1, 2),
                      'positive_penalty': np.linspace(0.1, 1, 2)
                      }

        gs = GridSearchCV(clf, parameters, cv=10)

        gird_result = gs.fit(trainData, trainLabel)

        print("Best: %f using %s" % (gird_result.best_score_, gird_result.best_params_))

        # means = gird_result.cv_results_['mean_test_score']
        # params = gird_result.cv_results_['params']
        # for mean, param in zip(means, params):
        #     print("%f  with:   %r" % (mean, param))

【参考】
[1] https://www.cnblogs.com/wj-1314/p/10422159.html
[2] 自己定义的类无法在GridSearchCV中使用解决办法
[3] 自定义的模型如何使用GridSearchCV()来选择参数
[4] 最值得一读的参考
[5] Developing scikit-learn estimators¶
[6] python中 return self的作用

erdaidai

关注

16
点赞
踩
47

收藏

觉得还不错? 一键收藏
0
评论
自定义函数使用GridSearchCV参数寻优

自定义函数使用GridSearchCV参数寻优GirdSearchCV and RandomizedSearchCV自定义函数中使用GirdSearchCVGridSearchCV参数说明GridSearchCV常用方法GridSearchCV属性说明自定义函数使用GridSearchCVGirdSearchCV and RandomizedSearchCV1）. GirdSearchCV是sklearn中的一个参数寻优的方法，存在的意义是自动调参，使用的人只需要把想调的参数输入进去，GirdSearc
复制链接

扫一扫

专栏目录