调参方法-超参数优化

最新推荐文章于 2024-05-09 15:35:47 发布

thinker_1120

最新推荐文章于 2024-05-09 15:35:47 发布

阅读量873

点赞数

分类专栏：算法实现

本文链接：https://blog.csdn.net/cymy001/article/details/99682645

版权

算法实现专栏收录该内容

38 篇文章 9 订阅

订阅专栏

模型参数：（1）通过学习获得（2）学习开始前设定，没办法在学习过程得到，如学习率、隐层数。
把第（2）类参数称为超参数；优化超参数，可以提高模型在独立数据集上的性能；常用交叉验证法，评估不同超参数下，模型的泛化性能。
sklearn包提供的获取候选参数搜索方法：（1）GridSearchCV（2）RandomizedSearchCV。

GridSearchCV

对一个分类器进行超参数优化，通过优化阶段未使用的验证集进行评估。
该方法适用于小数据集；大数据集参数组合较多时，尝试使用坐标下降法调参。即，每次贪心地选取对整体模型性能影响最大的参数，在该参数上调优，使模型在该参数上最优化，然后选取下一个影响最大的参数，以此类推，直到所有参数调整完毕。在每轮选取坐标的过程中，都会在每个坐标的方向上进行一次线性搜索。选取参数时，需要确保该参数对整体模型性能的提升是单调的或者近似单调的。

import pandas as pd
from sklearn.metrics import roc_auc_score
from sklearn import datasets
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

cancer=datasets.load_breast_cancer()
X=cancer.data
y=cancer.target
print(X.shape, y.shape)
# O:(569, 30) (569,)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0)

from sklearn.model_selection import GridSearchCV
param_grid={'n_estimators':[40,45,50],
            'min_samples_split':[2,3,5,8,11],
            'min_samples_leaf':[2,4,6,8,10],
            'criterion':['gini','entropy']}
# model
clf = RandomForestClassifier()
# GridSearchCV initializer
grid_search = GridSearchCV(clf, param_grid = param_grid, scoring = 'roc_auc')

grid_search.fit(X_train, y_train)
print(grid_search.best_params_)
# O:{'criterion': 'gini', 'min_samples_leaf': 4, 'min_samples_split': 8, 'n_estimators': 40}

grid_pred = grid_search.predict_proba(X_test)
# calculate AUC
grid_pred_score = pd.DataFrame(grid_pred, columns=grid_search.classes_.tolist())[1].values
grid_auc = roc_auc_score(y_test, grid_pred_score)
print('AUC score of GridSearchCV: %f' % grid_auc)
# O:AUC score of GridSearchCV: 0.987059

RandomizedSearchCV

RandomizedSearchCV的参数取值是通过某种概率分布抽取的，该概率分布描述了“参数所有取值的可能性”。训练过程需要指定参数n_iter，即总共采样多少参数组合or迭代多少次。对每个参数，既可以用“值的分布”，又可以指定一个“离散的取值列表”(离散列表会被均匀采样)。

from scipy.stats import randint as sp_randint
from sklearn.model_selection import RandomizedSearchCV
param_dist={'n_estimators':[40,45,50],
            'min_samples_split':sp_randint(2,15),
            'min_samples_leaf':sp_randint(2,15),
            'criterion':['gini','entropy']}
# iter loops for RandomizedSearchCV
search_n_iter = 10
# model
clf = RandomForestClassifier()
# RandomizedSearchCV初始化
random_search = RandomizedSearchCV(clf, param_distributions = param_dist, n_iter= search_n_iter)

random_search.fit(X_train, y_train)
print(random_search.best_params_)
# O:{'criterion': 'gini', 'min_samples_leaf': 10, 'min_samples_split': 5, 'n_estimators': 40}

random_pred = random_search.predict_proba(X_test)
# calculate AUC
random_pred_score = pd.DataFrame(random_pred, columns=random_search.classes_.tolist())[1].values
random_auc = roc_auc_score(y_test, random_pred_score)
print('AUC score of RandomizedSearchCV: %f' % random_auc)
# O:AUC score of RandomizedSearchCV: 0.984126

（1）GridSearchCV本质上是在人工指定参数范围后，对该范围内的每种参数组合进行穷举，遍历所有参数组合，将得到的“最优性能模型对应的参数”作为最优参数。在实际操作中，为了节省时间会“先醋调，后微调”。目标函数非凸时，容易陷入局部最优解。（2）RandomizedSearchCV相对节省计算时间，运行效率有提升，但“每次参数组合采样相互独立”，没利用“先验知识”来得到下一次的采样参数。——针对该问题，有贝叶斯优化方法。

贝叶斯优化

贝叶斯优化：（1）先对目标函数 $f (x)$ 的先验分布模型进行假设，一般假设满足高斯过程；（2）通过采集函数，确定下一个采样的样本点，进行修正模型。采集函数原理是E&E(exploration&exploitation)。Exploration是在未采用区域进行采样；Exploitation是在目前后验分布的基础上，选择当前认为的全局最优解(最优参数组合)进行采样。

————————————算法流程
对 $t = 1, 2, . . ., T$

通过优化高斯过程的采集函数，找到 $x_t=argmax_x U(x;D_{t-1})$

通过 $x_t$ 查询目标函数，获取 $y_t$

添加数据到数据集 $D_t=\{D_{t-1},x_t,y_t\}$
更新高斯过程(GP)

结束循环

thinker_1120

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
调参方法-超参数优化

模型参数：（1）通过学习获得（2）学习开始前设定，没办法在学习过程得到，如学习率、隐层数。把第（2）类参数称为超参数；优化超参数，可以提高模型在独立数据集上的性能；常用交叉验证法，评估不同超参数下，模型的泛化性能。sklearn包提供的获取候选参数搜索方法：（1）GridSearchCV（2）RandomizedSearchCV。GridSearchCV对一个分类器进行超参数优化，通过优化阶段未使用的验证集进行评估。该方法适用于小数据集；大数据集参数组合较多时，尝试使用坐标下降法调参。即，每次贪心
复制链接

扫一扫

专栏目录