Python Scikit-learn 机械学习超参数优化

最新推荐文章于 2023-04-28 13:33:42 发布

一骑代码走天涯

最新推荐文章于 2023-04-28 13:33:42 发布

阅读量332

点赞数 3

分类专栏： Python 机械学习文章标签：机器学习 python

本文链接：https://blog.csdn.net/m0_48922254/article/details/118698064

版权

Python 同时被 2 个专栏收录

5 篇文章 0 订阅

订阅专栏

机械学习

3 篇文章 0 订阅

订阅专栏

Python Scikit-learn 机械学习超参数优化

大家好，我是一个喜欢研究算法、机械学习和生物计算的小青年，我的CSDN博客是：一骑代码走天涯
如果您喜欢我的笔记，那么请点一下关注、点赞和收藏。如果内容有错或者有改进的空间，也可以在评论让我知道。😄

一般我们在Python做机械学习都会用到 Sci-kit Learn 这个包，里面除了有各种机械学习的算法模型可供使用，还有很多评估和优化模型表现的工具，其中就包括超参数优化 (Hyperparameter tuning)的工具。这篇文章简单记录了我在训练模型时何使用这个工具。

1. RandomizedSearchCV 和 GridSearchCV

Scikit-learn 里面主要有两个调参函数可以用：RandomizedSearchCV 和 GridSearchCV。

两者有何不同？
RandomizedSearchCV:
在一组指定的超参数范围内，根据使用者的设定，随机组合 n_iter 组测试。该超参数范围 param_distributions 用dict来表达。适合需要在很大的超参数测试范围內找出最理想组合时使用。

GridSearchCV:
提供一个指定的超参数范围后，计算机会把所有组合整理出来逐一测试。该超参数范围 param_grid 用dict来表达。适合已经有一定小的或者集中的超参数测试范围的时候使用。

备註：根据 Version 0.24.2的手册，还有 HalvingGridSearchCV and HalvingRandomSearchCV两个可用的函数，理论上比前两个更快速，但现阶段还是Experimental阶段。

2. 模型训练和调参

用随机森林 (Random Forest) 举个栗子：

RandomizedSearchCV:
下面的超参数范围一共有 (2x10x3x5x4x10)=12000个组合，但因为 n_iter 设定为100，所以计算机只会随机抽其中100个测试，然后找出当中最好的一个。这个时候RandomizedSearchCV 比 GridSearchCV 好是因为节省时间，能在比较短的时间内找出接近最优解的答案。

from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(Features, Labels, test_size=0.3)
# Create the parameter grid
random_grid = {
 'bootstrap': [True, False],
 'max_depth': [20, 40, 60, 80, 100, 200, 400, 600, 800, None],
 'max_features': ["auto", "sqrt", "log2"],
 'min_samples_leaf': [1, 2, 4, 8, 12],
 'min_samples_split': [1, 2, 5, 10],
 'n_estimators': [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000]
}
# Define random forest model.
rf = RandomForestRegressor()
# Instantiate the random search model
rand_search = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, n_iter = 100, cv = 3, verbose=2, n_jobs = -1)
rand_search.fit(X_train, y_train)
# Retrieve the best-performing model
best_model = rand_search.best_estimator_
# Print out the hyperparameters in this model
print(best_model)

GridSearchCV:
当测试范围很集中，或者参数选择不多的时候，就可以用GridSearchCV做全面搜查。下面的例子就只有 (4x4x3x3)=144个组合。

from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(Features, Labels, test_size=0.3)
param_grid = {
    'max_depth': [3, 6, 9, 12],
    'max_features': [2, 3, 4, 5],
    'min_samples_leaf': [3, 4, 5],
    'n_estimators': [50, 100, 150]
}
rf = RandomForestRegressor()
# Instantiate the grid search model
grid_search = GridSearchCV(estimator = rf, param_grid = param_grid, cv = 3, n_jobs = -1, verbose = 2)
grid_search.fit(X_train, y_train)
# Retrieve the best-performing model
best_model = grid_search.best_estimator_
# Print out the hyperparameters in this model
print(best_model)