超参数优化贝叶斯优化框架_mlmachine-使用贝叶斯优化进行超参数调整

最新推荐文章于 2024-07-23 15:32:55 发布

weixin_26752765

最新推荐文章于 2024-07-23 15:32:55 发布

阅读量2.2k

点赞数

文章标签： java python 人工智能 js 机器学习 ViewUI

原文链接：https://towardsdatascience.com/mlmachine-hyperparameter-tuning-with-bayesian-optimization-2de81472e6d

版权

本文介绍了mlmachine框架如何利用贝叶斯优化技术进行超参数优化，帮助提升机器学习模型的性能。通过翻译自towardsdatascience的文章，详细阐述了这一过程。

摘要由CSDN通过智能技术生成

超参数优化贝叶斯优化框架

机器 (mlmachine)

TL; DR (TL;DR)

mlmachine is a Python library that organizes and accelerates notebook-based machine learning experiments.

mlmachine是一个Python库，可组织和加速基于笔记本的机器学习实验。

In this article, we use mlmachine to accomplish actions that would otherwise take considerable coding and effort, including:

在本文中，我们使用mlmachine来完成原本需要大量编码和精力的操作，包括：

Bayesian Optimization for Multiple Estimators in One Shot
一击中多个估计的贝叶斯优化
Results Analysis
结果分析
Model Reinstantiation
模型实例化

Check out the Jupyter Notebook for this article.

查看本文的Jupyter Notebook 。

Check out the project on GitHub.

在GitHub上检查项目。

And check out past mlmachine articles:

并查看过去的mlmachine文章：

一击中多个估计的贝叶斯优化 (Bayesian Optimization for Multiple Estimators in One Shot)

Bayesian optimization is typically described as an advancement beyond exhaustive grid searches, and rightfully so. This hyperparameter tuning strategy succeeds by using prior information to inform future parameter selection for a given estimator. Check out Will Koehrsen’s article on Medium for an excellent overview of the package.

贝叶斯优化通常被描述为超越穷举网格搜索的一种进步，理所当然的。通过使用先验信息来通知给定估计量的将来参数选择，此超参数调整策略成功完成。请查看Will Koehrsen在Medium上的文章，以获得该软件包的出色概述。

mlmachine uses hyperopt as a foundation for performing Bayesian optimization, and takes the functionality of hyperopt a step further through a simplified workflow that allows for optimization of multiple models in single process execution. In this article, we are going to optimize four classifiers:

mlmachine使用hyperopt作为执行贝叶斯优化的基础，并通过简化的工作流程使hyperopt的功能更进一步，该工作流程允许在单个流程执行中优化多个模型。在本文中，我们将优化四个分类器：

LogisticRegression()
LogisticRegression()
XGBClassifier()
XGBClassifier()
RandomForestClassifier()
RandomForestClassifier()
KNeighborsClassifier()
KNeighborsClassifier()

准备数据 (Prepare Data)

First, we apply data preprocessing techniques to clean up our data. We’ll start by creating two Machine() objects — one for the training data and a second for the validation data:

首先，我们应用数据预处理技术来清理数据。我们将首先创建两个Machine()对象-一个用于训练数据，另一个用于验证数据：

Now we process the data by imputing nulls and applying various binning, feature engineering and encoding techniques:

现在，我们通过估算空值并应用各种装仓，特征工程和编码技术来处理数据：

Here is the output, still in a DataFrame:

这是输出，仍然在DataFrame ：

功能重要性摘要 (Feature Importance Summary)

As a second preparatory step, we want to perform feature selection for each of our classifiers:

作为第二准备步骤，我们要为每个分类器执行特征选择：

详尽的迭代特征选择 (Exhaustively Iterative Feature Selection)

For our final preparatory step, we use this feature selection summary to perform iterative cross-validation on smaller and smaller subsets of features for each of our estimators:

对于最后的准备步骤，我们使用此特征选择摘要为每个估计量对越来越小的特征子集执行迭代交叉验证：

From this result, we extract our dictionary of optimum feature sets for each estimator:

从这个结果中，我们提取出每个估计量的最佳特征集字典：

The keys are estimator names, and the associated values are lists containing the column names of the best performing feature subset for each estimator. Here are the key/value pairs for XGBClassifier(), which used only 10 of the available 43 features to achieve the best average cross-validation accuracy on the validation dataset:

键是估计器名称，而关联的值是包含每个估计器性能最佳的特征子集的列名称的列表。以下是XGBClassifier()的键/值对，该键/值对仅使用了43个可用功能中的10个，以在验证数据集上实现最佳的平均交叉验证准确性：

With our processed dataset and optimum feature subsets in hand, it’s time to use Bayesian optimization to tune the hyperparameters of our 4 estimators.

有了我们经过处理的数据集和最佳特征子集，是时候使用贝叶斯优化来调整4个估计量的超参数了。

概述我们的特征空间 (Outline Our Feature Space)

First, we need to establish our feature space for each parameter for each estimator:

首先，我们需要为每个估计量的每个参数建立特征空间：

The outermost keys of the dictionary are names of classifiers, represented by strings. The associated values are also dictionaries, where the keys are parameter names, represented as strings, and the values are hyperopt sampling distributions from which parameter values will be chosen.

字典的最外键是分类器的名称，由字符串表示。关联的值也是字典，其中键是参数名称(表示为字符串)，值是hyperopt采样分布，将从中选择参数值。

运行贝叶斯优化作业 (Run the Bayesian Optimization Job)

Now we’re ready to run our Bayesian optimization hyperparameter tuning job. We will use a built-in method belonging to our Machine() object called exec_bayes_optim_search(). Let’s see mlmachine in action:

现在，我们可以运行贝叶斯优化超参数调整工作了。我们将使用属于Machine()对象的内置方法，称为exec_bayes_optim_search() 。让我们看看mlmachine的作用：

Let’s review the parameters:

让我们回顾一下参数：

estimator_parameter_space: The dictionary-based feature space we setup above.
estimator_parameter_space ：我们在上面设置的基于字典的特征空间。
data: Our observations.
data ：我们的观察。
target: Our target data.
target ：我们的目标数据。
columns: An optional parameter that allows us to subset the input dataset features. Accepts a list of feature names, which will apply equally to all estimators. Also accepts a dictionary, where the keys represent estimator class names and values are lists of feature names to be used with the associated estimator. In this example, we use the latter by passing in the dictionary returned by cross_val_feature_dict() in the FeatureSelector() workflow above.
columns ：一个可选参数，允许我们对输入数据集要素进行子集化。接受要素名称列表，这将同样适用于所有估计量。还接受字典，其中的键代表估计器类名称，值是要与关联的估计器一起使用的功能名称的列表。在此示例中，我们通过在上面的FeatureSelector()工作流程中cross_val_feature_dict()返回的字典来使用后者。
scoring: The scoring metric to be evaluated.
scoring ：要评估的得分指标。
n_folds: Number of folds to use in cross-validation procedure.
n_folds ：在交叉验证过程中使用的折叠数。
iters: Total number of iterations to run the hyperparameter tuning process. In this example, we run the experiment for 200 iterations.
iters ：运行超参数调整过程的迭代总数。在此示例中，我们对实验进行了200次迭代。
show_progressbar: Controls whether progress bar displays and actively updates during the course of the process.
show_progressbar ：控制进度条是否在过程中显示和主动更新。

Anyone familiar with hyperopt will be wondering where the objective function is. mlmachine abstracts away this complexity.

任何熟悉hyperopt的人都会想知道目标函数在哪里。 mlmachine消除了这种复杂性。

The process runtime depends on several attributes, including hardware, the number and type of estimators used, the number of folds, feature selection, and the number of sampling iterations. Runtimes can be quite lengthy. For this reason, exec_bayes_optim_search() automatically saves the result of each iteration to a CSV.

流程运行时取决于几个属性，包括硬件，使用的估计量的数量和类型，折叠的数量，特征选择以及采样迭代的数量。运行时间可能很长。因此， exec_bayes_optim_search()自动将每次迭代的结果保存到CSV中。

结果分析 (Results Analysis)

结果汇总 (Results Summary)

Let’s start by loading and reviewing the results:

让我们从加载和查看结果开始：

Our Bayesian optimization log maintains key information about each iteration:

我们的贝叶斯优化日志维护有关每次迭代的关键信息：

Iteration number, estimator and scoring metric
迭代数，估计量和评分指标
Cross-validation summary statistics
交叉验证摘要统计
Iteration training time
迭代训练时间
Dictionary of parameters used
使用的参数字典

This log provides an immense amount of data for us to analyze and evaluate the effectiveness of the Bayesian optimization process.

该日志为我们提供了大量数据，以分析和评估贝叶斯优化过程的有效性。

模型优化评估 (Model Optimization Assessment)

First and foremost, we want to see how if performance improved over the iterations.

首先，我们想看看在迭代过程中性能如何提高。

Let’s visualize the XGBClassifier() loss by iteration:

让我们通过迭代可视化XGBClassifier()损失：

Each dot represents the performance of one of our 200 experiments. The key detail to notice is that the line of best fit has a clear downward slope - exactly what we want. This means that with each iteration, model performance tends to improve compared to the previous iterations.

每个点代表我们200个实验之一的性能。需要注意的关键细节是，最合适的线具有明显的向下倾斜-正是我们想要的。这意味着与以前的迭代相比，每次迭代时模型性能都有提高的趋势。

参数选择评估 (Parameter Selection Assessment)

One of the coolest parts of Bayesian optimization is seeing how parameter selection is optimized.

贝叶斯优化的最酷部分之一就是了解如何优化参数选择。

For each model and for each model’s parameters, we can generate a two-panel visual.

对于每个模型和每个模型的参数，我们可以生成一个两面板的视觉效果。

For numeric parameters, such as n_estimators or learning_rate, the two-visual panel includes:

对于数字参数，例如n_estimators或learning_rate ，两个可视面板包括：

Parameter selection KDE, overplayed on a theoretical distribution KDE
参数选择KDE，超过了理论分布KDE
Parameter selection by iteration scatter plot, with line of best fit
通过迭代散点图选择参数，并选择最佳拟合线

For categorical parameters, such as loss function, the two-visual panel includes:

对于分类参数(例如损失函数)，两个可视面板包括：

Parameter selection and theoretical distribution bar chart
参数选择和理论分布条形图
Parameter selection by iteration scatter plot, faceted by parameter category
通过迭代散点图选择参数，按参数类别进行分面

Let’s review the parameter selection panels for KNeighborsClassifier():

让我们回顾一下KNeighborsClassifier()的参数选择面板：

The built-in method model_param_plot() cycles through of the estimator’s parameters and presents the appropriate panel given each parameter’s type. Let’s look at a numeric parameter and categorical parameter separately.

内置方法model_param_plot()循环遍历估计器的参数，并根据每个参数的类型显示适当的面板。让我们分别看一下数字参数和分类参数。

First, we’ll review the panel for the numeric parameter n_neighbors:

首先，我们将在面板上查看数字参数n_neighbors ：

On the left, we can see two overlapping kernel density plots summarizing the actual parameter selections and the theoretical parameter distribution. The purple line corresponds to the theoretical distribution, and, as expected, this curve is smooth and evenly distributed. The teal line corresponds to the actual parameter selections, and it’s clearly evident that hyperopt prefers values between 5 and 10.

在左侧，我们可以看到两个重叠的核密度图，总结了实际参数选择和理论参数分布。紫色线对应于理论分布，并且正如预期的那样，该曲线是平滑且均匀分布的。蓝绿色线对应于实际的参数选择，很明显，hyperopt更喜欢5到10之间的值。

On the right, the scatter plot visualizes the n_neighbors value selections over the iterations. There is a slight downward slope to the line of best fit, as the Bayesian optimization process hones in on values around 7.

在右侧，散点图将迭代中的n_neighbors值选择可视化。最佳拟合线略有向下倾斜，因为贝叶斯优化过程的值大约为7。

Next, we’ll review the panel for the categorical parameter algorithm:

接下来，我们将回顾分类参数algorithm面板：

On the left, we see a bar chart displaying the counts of parameter selections, faceted by actual parameter selections and selections from the theoretical distribution . The purple bars, representing selections from the theoretical distribution, are more even than the teal bars, representing the actual selection.

在左侧，我们看到一个条形图，其中显示了参数选择的计数，其中包括实际参数选择和理论分布中的选择。代表理论分布的选择的紫色条比代表实际选择的蓝绿色条还要均匀。

On the right, the scatter plot again visualizes the algorithm value selection over the iterations. There is a clear decrease in selection of “ball_tree” and “auto” in favor of “kd_tree” and “brute” over the the iterations.

在右侧，散点图再次可视化了迭代中的algorithm值选择。在迭代过程中，对“ ball_tree”和“ auto”的选择明显减少，而对“ kd_tree”和“ brute”有利。

模型实例化 (Models Reinstantiation)

顶级模特鉴定 (Top Model Identification)

Our Machine() object has a built-in method called top_bayes_optim_models(), which identifies the best model for each estimator type based on the results in our Bayesian optimization log.

我们的Machine()对象具有一个称为top_bayes_optim_models()的内置方法，该方法根据贝叶斯优化日志中的结果为每种估计器类型标识最佳模型。

With this method, we can identify the top N models for each estimator based on mean cross-validation score. In this experiment, top_bayes_optim_models() returns the dictionary below, which tells us that LogisticRegression() identified its top model on iteration 30, XGBClassifier() on iteration 61, RandomForestClassifier() on iteration 46, and KNeighborsClassifier() on iteration 109.

使用这种方法，我们可以基于平均交叉验证得分为每个估计量确定前N个模型。在该实验中， top_bayes_optim_models()返回下面的字典，它告诉我们， LogisticRegression()识别其顶部模型上迭代30， XGBClassifier()上迭代61， RandomForestClassifier()上迭代46和KNeighborsClassifier()上迭代109。

使用模型 (Putting the Models to Use)

To reinstantiate a model, we leverage our Machine() object’s built-in method BayesOptimClassifierBuilder(). To use this method, we pass in our results log, specify an estimator class and iteration number. This will instantiate a model object with the parameters stored on that record of the log:

为了重新实例化模型，我们利用了Machine()对象的内置方法BayesOptimClassifierBuilder() 。要使用此方法，我们传入结果日志，指定一个估计器类和迭代数。这将使用存储在日志记录中的参数实例化模型对象：

Here we see the model parameters:

在这里，我们看到模型参数：

The models instantiated with BayesOptimClassifierBuilder() use .fit() and .predict() in a way that should feel quite familiar.

与实例化的模型BayesOptimClassifierBuilder()使用.fit()和.predict()的方式，应该感到很熟悉。

Let’s finish this article with a very basic model performance evaluation. We will fit this RandomForestClassifier() on the training data and labels, generate predictions with the training data, and evaluate the model’s performance by comparing these predictions to the ground-truth:

让我们以一个非常基本的模型性能评估结束本文。我们将将此RandomForestClassifier()拟合到训练数据和标签上，使用训练数据生成预测，并通过将这些预测与真实性进行比较来评估模型的性能：

Anyone familiar with Scikit-learn should feel right at home.

任何熟悉Scikit学习的人都应该感到宾至如归。

收盘时 (In Closing)

mlmachine makes it easy to efficiently optimize the hyperparameters for multiple estimators in one shot, and facilitates the visual inspection of model improvement and parameter selection.

mlmachine使您可以轻松高效地一次优化多个估计器的超参数，并有助于对模型改进和参数选择进行直观检查。

Check out the GitHub repository, and stay tuned for additional column entries.

签出GitHub存储库，并继续关注其他列条目。