机器学习之网格搜索调参sklearn

最新推荐文章于 2023-10-30 11:41:12 发布

韩立 •

最新推荐文章于 2023-10-30 11:41:12 发布

阅读量1.6k

点赞数 2

分类专栏：机器学习理论学习文章标签： sklearn 机器学习 python

本文链接：https://blog.csdn.net/qq_44386182/article/details/127161644

版权

机器学习理论学习专栏收录该内容

9 篇文章 4 订阅

订阅专栏

网格搜索

网格搜索 GridSearchCV我们在选择超参数有两个途径：1凭经验；2选择不同大小的参数，带入到模型中，挑选表现最好的参数。通过途径2选择超参数时，人力手动调节注意力成本太高，非常不值得。For循环或类似于for循环的方法受限于太过分明的层次，不够简洁与灵活，注意力成本高，易出错。GridSearchCV 称为网格搜索交叉验证调参，它通过遍历传入的参数的所有排列组合，通过交叉验证的方式，返回所有参数组合下的评价指标得分。

GridSearchCV听起来很高大上，其实就是暴力搜索。注意的是，该方法在小数据集上很有用，数据集大了就不太适用了。

from sklearn.metrics import roc_auc_score
import warnings
warnings.filterwarnings("ignore")
from sklearn.datasets import load_iris  # 自带的样本数据集


iris = load_iris()

X = iris.data  # 150个样本，4个属性
y = iris.target # 150个类标号
# 以随机森林为例介绍基本调用方法

# 穷举网格搜索
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split  # 切分数据
# 切分数据 训练数据80% 验证数据20%
train_data, test_data, train_target, test_target = train_test_split(
    X, y, test_size=0.2, random_state=0)

model = RandomForestClassifier()
parameters = {'n_estimators': [20, 50, 100], 'max_depth': [1, 2, 3]}

clf = GridSearchCV(model, parameters, cv=3, verbose=2)
clf.fit(train_data, train_target)

print("最优参数:")
print(clf.best_params_)
print("最优分数:")
print(clf.best_score_)
sorted(clf.cv_results_.keys())

score_test = roc_auc_score(test_target, clf.predict_proba(test_data), multi_class='ovr')

print("RandomForestClassifier GridSearchCV test AUC:   ", score_test)

D:\anaconda\python.exe C:/Users/Administrator/Desktop/数据挖掘项目/代码包测试集/网格搜索调参.py
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
Fitting 3 folds for each of 9 candidates, totalling 27 fits
[CV] max_depth=1, n_estimators=20 ....................................
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[CV] ..................... max_depth=1, n_estimators=20, total=   0.0s
[CV] max_depth=1, n_estimators=20 ....................................
[CV] ..................... max_depth=1, n_estimators=20, total=   0.0s
[CV] max_depth=1, n_estimators=20 ....................................
[CV] ..................... max_depth=1, n_estimators=20, total=   0.0s
[CV] max_depth=1, n_estimators=50 ....................................
[CV] ..................... max_depth=1, n_estimators=50, total=   0.1s
[CV] max_depth=1, n_estimators=50 ....................................
[CV] ..................... max_depth=1, n_estimators=50, total=   0.1s
[CV] max_depth=1, n_estimators=50 ....................................
[CV] ..................... max_depth=1, n_estimators=50, total=   0.1s
[CV] max_depth=1, n_estimators=100 ...................................
[CV] .................... max_depth=1, n_estimators=100, total=   0.1s
[CV] max_depth=1, n_estimators=100 ...................................
[CV] .................... max_depth=1, n_estimators=100, total=   0.1s
[CV] max_depth=1, n_estimators=100 ...................................
[CV] .................... max_depth=1, n_estimators=100, total=   0.1s
[CV] max_depth=2, n_estimators=20 ....................................
[CV] ..................... max_depth=2, n_estimators=20, total=   0.0s
[CV] max_depth=2, n_estimators=20 ....................................
[CV] ..................... max_depth=2, n_estimators=20, total=   0.0s
[CV] max_depth=2, n_estimators=20 ....................................
[CV] ..................... max_depth=2, n_estimators=20, total=   0.0s
[CV] max_depth=2, n_estimators=50 ....................................
[CV] ..................... max_depth=2, n_estimators=50, total=   0.1s
[CV] max_depth=2, n_estimators=50 ....................................
[CV] ..................... max_depth=2, n_estimators=50, total=   0.1s
[CV] max_depth=2, n_estimators=50 ....................................
[CV] ..................... max_depth=2, n_estimators=50, total=   0.1s
[CV] max_depth=2, n_estimators=100 ...................................
[CV] .................... max_depth=2, n_estimators=100, total=   0.1s
[CV] max_depth=2, n_estimators=100 ...................................
[CV] .................... max_depth=2, n_estimators=100, total=   0.1s
[CV] max_depth=2, n_estimators=100 ...................................
[CV] .................... max_depth=2, n_estimators=100, total=   0.1s
[CV] max_depth=3, n_estimators=20 ....................................
[CV] ..................... max_depth=3, n_estimators=20, total=   0.0s
[CV] max_depth=3, n_estimators=20 ....................................
[CV] ..................... max_depth=3, n_estimators=20, total=   0.0s
[CV] max_depth=3, n_estimators=20 ....................................
[CV] ..................... max_depth=3, n_estimators=20, total=   0.0s
[CV] max_depth=3, n_estimators=50 ....................................
[CV] ..................... max_depth=3, n_estimators=50, total=   0.1s
[CV] max_depth=3, n_estimators=50 ....................................
[CV] ..................... max_depth=3, n_estimators=50, total=   0.1s
[CV] max_depth=3, n_estimators=50 ....................................
[CV] ..................... max_depth=3, n_estimators=50, total=   0.1s
[CV] max_depth=3, n_estimators=100 ...................................
[CV] .................... max_depth=3, n_estimators=100, total=   0.1s
[CV] max_depth=3, n_estimators=100 ...................................
[CV] .................... max_depth=3, n_estimators=100, total=   0.1s
[CV] max_depth=3, n_estimators=100 ...................................
[CV] .................... max_depth=3, n_estimators=100, total=   0.1s
[Parallel(n_jobs=1)]: Done  27 out of  27 | elapsed:    1.7s finished
最优参数:
{'max_depth': 2, 'n_estimators': 50}
最优分数:
0.9583333333333334
RandomForestClassifier GridSearchCV test AUC:    1.0

进程已结束,退出代码0

随机搜索

我们在搜索超参数的时候，如果超参数个数较少（三四个或者更少），那么我们可以采用网格搜索，一种穷尽式的搜索方法。

但是当超参数个数比较多的时候，我们仍然采用网格搜索，那么搜索所需时间将会指数级上升。所以有人就提出了随机搜索的方法，随机在超参数空间中搜索几十几百个点，其中就有可能有比较小的值。这种做法比上面稀疏化网格的做法快，而且实验证明，随机搜索法结果比稀疏网格法稍好。

RandomizedSearchCV使用方法和类GridSearchCV 很相似，但他不是尝试所有可能的组合，而是通过选择每一个超参数的一个随机值的特定数量的随机组合，这个方法有两个优点：相比于整体参数空间，可以选择相对较少的参数组合数量。如果让随机搜索运行，它会探索每个超参数的不同的值可以方便的通过设定搜索次数，控制超参数搜索的计算量。添加参数节点不会影响性能，不会降低效率。RandomizedSearchCV的使用方法其实是和GridSearchCV一致的，但它以随机在参数空间中采样的方式代替了GridSearchCV对于参数的网格搜索，在对于有连续变量的参数时，RandomizedSearchCV会将其当做一个分布进行采样进行这是网格搜索做不到的，它的搜索能力取决于设定的n_iter参数。

from sklearn.metrics import roc_auc_score
import warnings
warnings.filterwarnings("ignore")
from sklearn.datasets import load_iris  # 自带的样本数据集


iris = load_iris()

X = iris.data  # 150个样本，4个属性
y = iris.target # 150个类标号
# 以随机森林为例介绍基本调用方法




# 随机参数优化

from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split  # 切分数据
# 切分数据 训练数据80% 验证数据20%
train_data, test_data, train_target, test_target = train_test_split(
    X, y, test_size=0.2, random_state=0)

model = RandomForestClassifier()
parameters = {'n_estimators': [10, 20, 30, 50], 'max_depth': [1, 2, 3]}

clf = RandomizedSearchCV(model, parameters, cv=3, verbose=2)
clf.fit(train_data, train_target)

score_test = roc_auc_score(test_target, clf.predict_proba(test_data), multi_class='ovr')

print("RandomForestClassifier RandomizedSearchCV test AUC:   ", score_test)
print("最优参数:")
print(clf.best_params_)
sorted(clf.cv_results_.keys())

D:\anaconda\python.exe C:/Users/Administrator/Desktop/数据挖掘项目/代码包测试集/随机参数优化调参.py
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
Fitting 3 folds for each of 10 candidates, totalling 30 fits
[CV] n_estimators=10, max_depth=3 ....................................
[CV] ..................... n_estimators=10, max_depth=3, total=   0.0s
[CV] n_estimators=10, max_depth=3 ....................................
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[CV] ..................... n_estimators=10, max_depth=3, total=   0.0s
[CV] n_estimators=10, max_depth=3 ....................................
[CV] ..................... n_estimators=10, max_depth=3, total=   0.0s
[CV] n_estimators=50, max_depth=2 ....................................
[CV] ..................... n_estimators=50, max_depth=2, total=   0.1s
[CV] n_estimators=50, max_depth=2 ....................................
[CV] ..................... n_estimators=50, max_depth=2, total=   0.1s
[CV] n_estimators=50, max_depth=2 ....................................
[CV] ..................... n_estimators=50, max_depth=2, total=   0.1s
[CV] n_estimators=20, max_depth=1 ....................................
[CV] ..................... n_estimators=20, max_depth=1, total=   0.0s
[CV] n_estimators=20, max_depth=1 ....................................
[CV] ..................... n_estimators=20, max_depth=1, total=   0.0s
[CV] n_estimators=20, max_depth=1 ....................................
[CV] ..................... n_estimators=20, max_depth=1, total=   0.0s
[CV] n_estimators=30, max_depth=3 ....................................
[CV] ..................... n_estimators=30, max_depth=3, total=   0.0s
[CV] n_estimators=30, max_depth=3 ....................................
[CV] ..................... n_estimators=30, max_depth=3, total=   0.0s
[CV] n_estimators=30, max_depth=3 ....................................
[CV] ..................... n_estimators=30, max_depth=3, total=   0.0s
[CV] n_estimators=10, max_depth=2 ....................................
[CV] ..................... n_estimators=10, max_depth=2, total=   0.0s
[CV] n_estimators=10, max_depth=2 ....................................
[CV] ..................... n_estimators=10, max_depth=2, total=   0.0s
[CV] n_estimators=10, max_depth=2 ....................................
[CV] ..................... n_estimators=10, max_depth=2, total=   0.0s
[CV] n_estimators=20, max_depth=2 ....................................
[CV] ..................... n_estimators=20, max_depth=2, total=   0.0s
[CV] n_estimators=20, max_depth=2 ....................................
[CV] ..................... n_estimators=20, max_depth=2, total=   0.0s
[CV] n_estimators=20, max_depth=2 ....................................
[CV] ..................... n_estimators=20, max_depth=2, total=   0.0s
[CV] n_estimators=50, max_depth=3 ....................................
[CV] ..................... n_estimators=50, max_depth=3, total=   0.1s
[CV] n_estimators=50, max_depth=3 ....................................
[CV] ..................... n_estimators=50, max_depth=3, total=   0.1s
[CV] n_estimators=50, max_depth=3 ....................................
[CV] ..................... n_estimators=50, max_depth=3, total=   0.1s
[CV] n_estimators=30, max_depth=1 ....................................
[CV] ..................... n_estimators=30, max_depth=1, total=   0.0s
[CV] n_estimators=30, max_depth=1 ....................................
[CV] ..................... n_estimators=30, max_depth=1, total=   0.0s
[CV] n_estimators=30, max_depth=1 ....................................
[CV] ..................... n_estimators=30, max_depth=1, total=   0.0s
[CV] n_estimators=10, max_depth=1 ....................................
[CV] ..................... n_estimators=10, max_depth=1, total=   0.0s
[CV] n_estimators=10, max_depth=1 ....................................
[CV] ..................... n_estimators=10, max_depth=1, total=   0.0s
[CV] n_estimators=10, max_depth=1 ....................................
[CV] ..................... n_estimators=10, max_depth=1, total=   0.0s
[CV] n_estimators=50, max_depth=1 ....................................
[CV] ..................... n_estimators=50, max_depth=1, total=   0.1s
[CV] n_estimators=50, max_depth=1 ....................................
[CV] ..................... n_estimators=50, max_depth=1, total=   0.1s
[CV] n_estimators=50, max_depth=1 ....................................
[CV] ..................... n_estimators=50, max_depth=1, total=   0.1s
[Parallel(n_jobs=1)]: Done  30 out of  30 | elapsed:    0.9s finished
RandomForestClassifier RandomizedSearchCV test AUC:    1.0
最优参数:
{'n_estimators': 30, 'max_depth': 3}

进程已结束,退出代码0

韩立 •

关注

2
点赞
踩
10

收藏

觉得还不错? 一键收藏
0
评论
机器学习之网格搜索调参sklearn

而是通过选择每一个超参数的一个随机值的特定数量的随机组合，这个方法有两个优点：相比于整体参数空间，可以选择相对较少的参数组合数量。RandomizedSearchCV的使用方法其实是和GridSearchCV一致的，但它以随机在参数空间中采样的方式代替了GridSearchCV对于参数的网格搜索，在对于有连续变量的参数时，RandomizedSearchCV会将其当做一个分布进行采样进行这是网格搜索做不到的，它的搜索能力取决于设定的n_iter参数。所以有人就提出了随机搜索的方法，
复制链接

扫一扫