端到端的Optimalflow自动化机器学习教程，带有真实项目公式

最新推荐文章于 2024-09-27 19:00:00 发布

weixin_26750481

最新推荐文章于 2024-09-27 19:00:00 发布

阅读量365

点赞数

文章标签：机器学习 python 人工智能 java 深度学习

原文链接：https://towardsdatascience.com/end-to-end-optimalflow-automated-machine-learning-tutorial-with-real-projects-formula-e-laps-31d810539102

版权

In the previous Part 1 of this tutorial, we discussed how to implement data engineering to prepare suitable datasets, feeding further modeling steps. And now we will focus on how to use OptimalFlow library(Documentation | GitHub) to implement Omni-ensemble automated machine learning.

在本教程的第1部分中，我们讨论了如何实施数据工程以准备合适的数据集，并提供进一步的建模步骤。现在，我们将重点介绍如何使用OptimalFlow库( 文档 | GitHub )来实现Omni-ensemble自动化机器学习。

Why we use OptimalFlow? You could read another story of its introduction: “An Omni-ensemble Automated Machine Learning — OptimalFlow”.

为什么我们使用OptimalFlow ？您可以阅读有关它的介绍的另一个故事： “ 全集成自动机器学习-OptimalFlow ” 。

步骤1：安装OptimalFlow (Step 1: Install OptimalFlow)

Set up your working environment with Python 3.7+, and install OptimalFlow by Pip command. Currently, the most recent version is 0.1.7. More package’s information could be found at PyPI.

使用Python 3.7+设置您的工作环境，并通过以下方式安装OptimalFlow ：点命令。当前，最新版本是0.1.7。可以在PyPI上找到更多软件包的信息。

pip install OptimalFlow

步骤2：仔细检查缺失值 (Step 2: Double-check missing values)

After data preparation in Part 1 of this tutorial, most of the features are ready to feed the modeling process. Missing values in category features are not welcomed when the data flow comes to autoPP(OptimalFlow’s auto feature preprocessing module). So we need to double-check the cleaned data and apply data cleaning to the features with missing values. For this problem, I only found ‘GROUP’ feature has missing value and using the following code to convert it.

在本教程的第1部分中准备了数据之后，大多数功能都已准备就绪，可用于建模过程。当数据流进入autoPP ( OptimalFlow的自动功能预处理模块)时，不欢迎使用类别功能中的缺失值。因此，我们需要仔细检查已清除的数据，并将数据清除应用于缺少值的要素。对于此问题，我仅发现“组”功能缺少值，并使用以下代码对其进行了转换。

步骤3：自订设定 (Step 3: Custom settings)

OptimalFlow provides open interfaces for users to do custom settings in every module. Even in the autoCV (OptimalFlow’s model selection & evaluation module), you could custom set specific models or hyperparameters searching space. You could find more details in the Documentation.

OptimalFlow提供了开放的界面，供用户在每个模块中进行自定义设置。即使在autoCV ( OptimalFlow的模型选择和评估模块)中，您也可以自定义设置特定模型或超参数的搜索空间。您可以在“ 文档”中找到更多详细信息。

As the code below, we set up scalers and encoders algorithm for autoPP (OptimalFlow’s auto feature preprocessing module); selectors for autoFS (OptimalFlow’s auto feature selection module); estimators for autoCV module.

如下代码所示，我们为autoPP ( OptimalFlow的自动功能预处理模块)设置了缩放器和编码器算法； autoFS的选择器( OptimalFlow的自动功能选择模块)； autoCV的估算器模块。

For the features selection and model selection w/ evaluation, we set up selectors and estimators searching space.

对于带有评估的特征选择和模型选择，我们设置了搜索空间的选择器和估计器。

PLEASE NOTE: The “sparsity” and “cols” are 2 limitations you could set to narrow down the number of dataset combinations. Usually, when the sparsity of a dataset is too low, the variance within the features will be low, which means the information value will be low. You could try different values settings in these 2 parameters, based on your acceptable number of datasets, which will go through the Pipeline Cluster Traversal Experiments(PCTE) process to find the optimal model with its pipeline workflow. Of course, the larger number of datasets combination generated from autoPP module, the more time needs for further processes in OptimalFlow automated machine learning. Vice versa, when there’s no dataset combination can meet the sparsity and columns number’s restrictions, the following process cannot continue. So, be careful and try some values with these settings.

请注意： “稀疏性”和“ cols”是两个限制，您可以设置这些限制来缩小数据集组合的数量。通常，当数据集的稀疏度太低时，要素内的方差会很低，这意味着信息值会很低。您可以根据可接受的数据集数量在这2个参数中尝试不同的值设置，这些数据集将通过管道集群遍历实验(PCTE)流程来找到其管道工作流程的最佳模型。当然，从autoPP模块生成的数据集组合的数量越多，OptimalFlow自动化机器学习中进一步处理所需的时间就越多。反之亦然，当没有数据集组合可以满足稀疏性和列数限制时，以下过程将无法继续。因此，请小心并使用这些设置尝试一些值。

步骤4：管道丛集遍历实验(PCTE) (Step 4: Pipeline Cluster Traversal Experiments(PCTE))

The core concept/improvement in OptimalFlow is Pipeline Cluster Traversal Experiments(PCTE), which is a theory of framework first proposed by Tony Dong in Genpact 2020 GVector Conference, to optimize and automate Machine Learning Workflow using ensemble pipelines algorithm.

OptimalFlow的核心概念/改进是管道集群遍历实验(PCTE) ，这是Tony Dong在Genpact 2020 GVector Conference上首次提出的一种框架理论，旨在使用集成管道算法优化和自动化机器学习工作流程。

Comparing other automated or classic machine learning workflow’s repetitive experiments using a single pipeline, Pipeline Cluster Traversal Experiments is more powerful, since it expends the workflow from 1 dimension to 2 dimensions by ensemble all possible pipelines(Pipeline Cluster) and automated experiments. With larger coverage scope, to find the best model without manual intervention, and also more flexible with elasticity to cope with unseen data due to its ensemble designs in each component, the Pipeline Cluster Traversal Experiments provide data scientists an alternative more convenient and “Omni-automated” machine learning approach.

与使用单个管道的其他自动化或经典机器学习工作流的重复性实验相比，“ 管道集群遍历实验”功能更强大，因为它通过集成所有可能的管道(“ 管道集群” )和自动化实验将工作流从1维扩展到2维。 管道集群遍历实验具有覆盖范围广，无需人工干预即可找到最佳模型的优点，并且由于每个组件的整体设计而具有更大的灵活性，可以处理看不见的数据，因此， 管道集群遍历实验为数据科学家提供了一种更便捷和“全方位的解决方案”自动化”的机器学习方法。

To implement PCTE process, OptimalFlow provides autoPipe module to achieve that. More examples and functions details could be found in documentation.

为了实现PCTE流程， OptimalFlow提供了autoPipe模块来实现。在文档中可以找到更多示例和功能详细信息。

Here’s the attributes we’ve set in autoPipe module:

这是我们在autoPipe模块中设置的属性：

for autoPP module: Give it the custom parameters we’ve set above; set the prediction column as “Total_Lap_Num”, and select the model_type as “regression”(prevent dummy variables trap);
对于autoPP模块：为其提供我们在上面设置的自定义参数；将预测列设置为“ Total_Lap_Num”，并将模型类型选择为“回归”(防止虚拟变量陷阱)；
for the splitting rule: set 20% validation, 20% test, and 60% train data;
对于拆分规则：设置20％验证，20％测试和60％训练数据；
for autoFS module: set 10 top features, 5 cross-validation folders;
对于autoFS模块：设置10个主要功能，5个交叉验证文件夹；
for autoCV module: use fastRegressor class, 5 cross-validation folders.
对于autoCV模块：使用fastRegressor类，5个交叉验证文件夹。

Here’s the simple description of this automated process:

这是此自动化过程的简单描述：

Based on our previous custom settings, autoPP module will generate total #256 dataset combinations(based on our custom sparsity and cols restriction settings). And our PCTE process will go through all of them. It will automatedly select top 10 features using autoFS module. And searching the best model with tuned hyperparameters to finally find the optimal model with its pipeline workflow within the pipeline cluster.
根据我们以前的自定义设置， autoPP模块将生成总共＃256个数据集组合(基于我们的自定义稀疏性和cols限制设置)。我们的PCTE流程将贯穿所有流程。它将使用autoFS模块自动选择十大功能。并通过调整后的超参数搜索最佳模型，最终在管道集群中利用管道工作流找到最佳模型。

You will find all the log information of the PCTE process in the auto-generated log files, which created by OptimalFlow’s autoFlow module.

您将在由OptimalFlow的autoFlow模块创建的自动生成的日志文件中找到PCTE进程的所有日志信息。

PCTE process covers almost all machine learning steps data scientists need to cover, and automatically searching the best optimal model with its pipeline workflow information for easy evaluation and implementation.

PCTE流程涵盖了数据科学家需要涵盖的几乎所有机器学习步骤，并通过其管道工作流信息自动搜索最佳最佳模型，以便于评估和实施。

Although PCTE will not save time for each machine learning pipeline operation, data scientists could move to other tasks when OptimalFlow is helping them get out of the tedious model experiment and tuning work.

尽管PCTE不会为每次机器学习管道操作节省时间，但是当OptimalFlow处于运行状态时，数据科学家可以转移到其他任务帮助他们摆脱繁琐的模型实验和调整工作。

This is what I understand a REAL Automated Machine Learning process should be. OptimalFlow should finish all of these tasks automatedly.

这就是我所理解的真正的自动机器学习过程。 OptimalFlow应该自动完成所有这些任务。

The outputs of the Pipeline Cluster Traversal Experiments (PCTE) process includes information of preprocessing algorithms applied to prepared dataset combinations(DICT_PREP_INFO), selected top features for each dataset combination(DICT_FEATURE_SELECTION_INFO), models evaluation results(DICT_MODELS_EVALUATION), split dataset combination(DICT_DATA), model selection results ranking table(models_summary).

管道集群遍历实验 (PCTE)过程的输出包括应用于准备的数据集组合(DICT_PREP_INFO)的预处理算法信息，每个数据集组合的选定主要特征(DICT_FEATURE_SELECTION_INFO)，模型评估结果(DICT_MODELS_EVALUATION)，拆分数据集组合(DICT_DATA) ，模型选择结果排名表(models_summary)。

This is a useful function for data scientists, since retrieving the previous machine learning workflow is painful when they want to reuse the previous outputs.

这对于数据科学家来说是一项有用的功能，因为当他们想重用先前的输出时，检索以前的机器学习工作流程很痛苦。

步骤5：使用最佳模型保存管道集群 (Step 5: Save pipeline cluster with optimal models)

Since the PCTE process will last very long when there is a large number of datasets combination as the input, we’d better save the outputs of the previous step(pipeline cluster with optimal models) as pickles for results interpretation and visualization steps.

由于当有大量数据集组合作为输入时，PCTE过程将持续很长时间，因此，我们最好将上一步骤(具有最佳模型的管道集群 )的输出保存为泡菜，以进行结果解释和可视化步骤。

步骤6：建模结果解释 (Step 6: Modeling results interpret)

Next step we will see our modeling results by import the saved pickles in the previous step. We can use the following code to find the top 3 models with their optimal flow after PCTE automated process:

下一步，我们将在上一步中导入保存的泡菜来查看建模结果。在PCTE自动化过程之后，我们可以使用以下代码来找到前三个模型及其最佳流程：

It’s very clear, the KNN algorithm with tuned hyperparameters performance the best. And we could retrieve the whole pipeline workflow from PCTE’s outputs.

很清楚，带有已调整超参数的KNN算法性能最佳。我们可以从PCTE的输出中检索整个管道工作流程。

Specifically:

特别：

The optimal pipeline is consisted by the KNN algorithm using Dataset_214, Dataset_230 in the 256 datasets combinations, with the best parameters [(‘weights’: ‘distance’),(‘n_neighbors’: ‘5’),(‘algorithm’: ‘kd_tree’)]. The R-squared is 0.971, MAE is 1.157, MSE is 5.928, RMSE is 5.928 and the latency score is 3.0.

最佳流水线由KNN算法使用256个数据集组合中的Dataset_214，Dataset_230和最佳参数[['weights'：'distance')，('n_neighbors'：'5')，('algorithm'：' kd_tree')]。 R平方是0.971，MAE是1.157，MSE是5.928，RMSE是5.928，延迟分数是3.0。

All 256 datasets pipeline performance assessment results could be generated by autoViz module’s dynamic table function(more details and other existing visualization examples can be found here), and you could find it at ./temp-plot.html.

可以通过autoViz生成所有256个数据集管道性能评估结果模块的动态表功能(可在此处找到更多详细信息和其他现有的可视化示例)，您可以在./temp-plot.html中找到它。

The top 10 features selected by autoFS module are:

autoFS模块选择的前10个功能是：

The feature preprocessing details to Dataset_214, Dataset_230 : Winsorization with outliers by top 10% and bottom 10%; Encoding ‘match_name’ and ‘DATE_ONLY’ features by mean encoding approach; Encoding ‘GROUP’ feature by OneHot encoding approach; None scaler is involved in the preprocessing step.

对Dataset_214，Dataset_230进行特征预处理的详细信息：Winsorization的离群值分别为前10％和后10％；通过均值编码方法对“ match_name”和“ DATE_ONLY”特征进行编码；通过OneHot编码方法对“ GROUP”功能进行编码；预处理步骤中不涉及任何缩放器。

That’s all. We made our fist OptimalFlow automated machine learning project. Simple and easy, right? 😎

就这样。我们做了第一个OptimalFlow自动化机器学习项目。简单容易，对不对？ 😎

更多事情需要考虑： (More things need to consider:)

Our top pipeline of the model has a very high R-Squared value, which is over 0.9. For most physical processes this value might not be surprising, however, if we are predicting human behavior, that’s a way somehow too high. So we also need to consider other metrics like MSE.

我们模型的顶层管道的R-平方值非常高，超过0.9。对于大多数物理过程，此值可能并不令人惊讶，但是，如果我们预测人类行为，则某种程度上过高。因此，我们还需要考虑其他指标，例如MSE。

Within this tutorial, we simplified the real project to a more suitable case for OptimalFlow’s beginners. So based on this starting point, this result is acceptable to be your first OptimalFlow automated machine learning’s output.

在本教程中，我们将实际项目简化为更适合OptimalFlow的案例初学者。因此，基于此起点，此结果可以作为您的第一个OptimalFlow自动化机器学习输出。

Here’re some suggestions, if you want to continually improve our sample script to go deeper with a more practical optimal model approach.

如果您想不断改进我们的示例脚本，以使用更实用的最佳模型方法进行更深入的研究，请提出以下建议。

A high R-squared value usually means overfitting happened. So drop more features to prevent that;
高R平方值通常表示发生过拟合。因此，放弃更多功能以防止这种情况；
Aggregation is a good idea to assemble data, but it also lets us lose the lap-by-lap and timing-by-timing variance information;
汇总是组装数据的一个好主意，但是它也使我们失去了逐圈和按时序计时的方差信息。
The scaling approach is also essential to prevent overfitting, we could move “None” out of our custom_pp, and add some other scaler approach(i.e. minmax, robust) in Step 3;
缩放方法对于防止过度拟合也是必不可少的，我们可以将“ None”移出custom_pp，并在步骤3中添加其他缩放器方法(即minmax，robust)。

综上所述： (In Summary:)

OptimalFlow is an easy-use API tool to achieve Omni-ensemble automated machine learning with simple code, and it’s also a best practice library to prove Pipeline Cluster Traversal Experiments (PCTE) theory.

OptimalFlow是一种易于使用的API工具，可通过简单的代码实现Omni集成的自动化机器学习，它还是证明管道集群遍历实验 (PCTE)理论的最佳实践库。

Its 6 modules could not only be connected to implement PCTE process, but also could be used individually to optimize traditional machine learning workflow’s components. You can find their individual use cases in Documentation.

它的6个模块不仅可以连接以实现PCTE流程，而且可以单独用于优化传统机器学习工作流程的组件。您可以在“ 文档”中找到其各自的用例。

“An algorithmicist looks at no free lunch.” — Culberson

“算法学家不看免费午餐。” —库尔伯森

Last but not least, as data scientists, we should always keep in mind: no matter what kind of automated machine learning algorithms, ‘no free lunch theorem’ always applies.

最后但并非最不重要的一点，作为数据科学家，我们应始终牢记：无论哪种自动机器学习算法，始终都适用“无免费午餐定理”。

关于我： (About me:)

I am a healthcare & pharmaceutical data scientist and big data Analytics & AI enthusiast. I developed OptimalFlow library to help data scientists building optimal models in an easy way, and automate Machine Learning workflow with simple codes.

我是医疗保健和制药数据科学家以及大数据分析和AI爱好者。我开发了OptimalFlow库，以帮助数据科学家以简单的方式构建最佳模型，并使用简单的代码使机器学习工作流程自动化。

As a big data insights seeker, process optimizer, and AI professional with years of analytics experience, I use machine learning and problem-solving skills in data science to turn data into actionable insights while providing strategic and quantitative products as solutions for optimal outcomes.

作为具有多年分析经验的大数据洞察力寻求者，流程优化器和AI专业人士，我使用数据科学中的机器学习和问题解决技能将数据转化为可行的洞察力，同时提供战略性和定量产品作为最佳结果的解决方案。

You can connect with me on LinkedIn or GitHub.

您可以在LinkedIn或GitHub上与我联系。