自动超参数优化 AutoGluon 简单使用

_刘文凯_

已于 2023-02-28 14:09:42 修改

阅读量5.8k

点赞数 4

分类专栏：深度学习机器学习文章标签：深度学习计算机视觉机器学习

于 2022-01-01 18:53:52 首次发布

本文链接：https://blog.csdn.net/qq_24211837/article/details/122269561

版权

机器学习同时被 2 个专栏收录

50 篇文章 6 订阅

订阅专栏

深度学习

36 篇文章 18 订阅

订阅专栏

今天发现一个非常简单易用的超参数优化包 (李沐大神开发的），简单的使用了以下，效果不错。

说明

支持的模型：
机器学习模型；
深度学习模型；
模型集成；
深度学习模型集成；
等等

简单应用

Autogluon官网：https://auto.gluon.ai/stable/index.html
使用kaggle 泰坦尼克号公开数据集：https://www.kaggle.com/c/titanic/
代码和数据我整理放到了github: https://github.com/yinboliu-git/MachineLearning-sample-demos/tree/main/automl

安装（建议使用conda虚拟环境）：
python3.7,3.8
仅仅支持linux，暂不支持win (2021.12.30)

pip install "mxnet<2.0.0"
pip install autogluon

代码(run.py)：

#! /usr/bin/python3
from autogluon.tabular import TabularDataset, TabularPredictor
import pandas as pd
train_data = TabularDataset('train.csv').iloc[0:10,:]  # 因为模型时间长，这里只用前10个数据做测试
id, label = 'PassengerId', 'Survived'
predictor = TabularPredictor(label=label).fit(train_data.drop(columns=[id]))

test_data = TabularDataset('test.csv')
pred = predictor.predict(test_data.drop(columns=[id]))
sub = pd.DataFrame({id:test_data[id], label:pred})
sub.to_csv('submission.csv', index=False)

会自动生成模型保存在：./AutogluonModels

超参数调优

// 你没看错，这个里也要超参数调优。哈哈哈，用于调优模型的模型也是有参数的：

使用：

time_limit = 60  # for quick demonstration only, you should set this to longest time you are willing to wait (in seconds)
metric = 'roc_auc'  # specify your evaluation metric here
predictor = TabularPredictor(label, eval_metric=metric).fit(train_data, time_limit=time_limit, presets='best_quality')
predictor.leaderboard(test_data, silent=True)

参数说明：
百度翻译：

此命令实施以下策略以最大限度地提高准确性：

1. 指定参数presets='best_quality'，它允许 AutoGluon 基于stacking/ bagging 自动构建强大的模型集成，并且如果给予足够的训练时间，将大大改善结果预测。的默认值presets是 'medium_quality_faster_train'，这会产生不太准确的模型，但有助于更快地进行原型设计。使用presets，您可以灵活地优先考虑预测准确性与训练/推理速度。例如，如果您不太关心预测性能并希望快速部署基本模型，请考虑使用： 。presets=['good_quality_faster_inference_only_refit', 'optimize_for_deployment']

2. 提供eval_metric，如果你知道什么指标将用来评估你的应用程序的预测。您可能使用的其他一些非默认指标包括：（'f1'用于二元分类）、'roc_auc'（用于二元分类）、 'log_loss'（用于分类）、'mean_absolute_error'（用于回归）、'median_absolute_error'（用于回归）。您还可以定义自己的自定义度量函数，请参阅文件夹中的示例：autogluon/core/metrics/

3. 包括您的所有数据，train_data但不提供 tuning_data（AutoGluon 将更智能地拆分数据以满足其需求）。

4. 不要指定hyperparameter_tune_kwargs参数（与直觉相反，超参数调整不是花费有限训练时间预算的最佳方式，因为模型集成通常更优）。我们建议您仅hyperparameter_tune_kwargs 在您的目标是部署单个模型而不是集成时使用。

5. 不要指定hyperparameters参数（允许 AutoGluon 自适应地选择要使用的模型/超参数）。

6. 设置time_limit为您愿意等待的最长时间（以秒为单位）。AutoGluon 的预测性能提高了fit()允许运行的时间越长。

原文：

This command implements the following strategy to maximize accuracy:

Specify the argument presets='best_quality', which allows AutoGluon to automatically construct powerful model ensembles based on stacking/bagging, and will greatly improve the resulting predictions if granted sufficient training time. The default value of presets is 'medium_quality_faster_train', which produces less accurate models but facilitates faster prototyping. With presets, you can flexibly prioritize predictive accuracy vs. training/inference speed. For example, if you care less about predictive performance and want to quickly deploy a basic model, consider using: presets=['good_quality_faster_inference_only_refit', 'optimize_for_deployment'].

Provide the eval_metric if you know what metric will be used to evaluate predictions in your application. Some other non-default metrics you might use include things like: 'f1' (for binary classification), 'roc_auc' (for binary classification), 'log_loss' (for classification), 'mean_absolute_error' (for regression), 'median_absolute_error' (for regression). You can also define your own custom metric function, see examples in the folder: autogluon/core/metrics/

Include all your data in train_data and do not provide tuning_data (AutoGluon will split the data more intelligently to fit its needs).

Do not specify the hyperparameter_tune_kwargs argument (counterintuitively, hyperparameter tuning is not the best way to spend a limited training time budgets, as model ensembling is often superior). We recommend you only use hyperparameter_tune_kwargs if your goal is to deploy a single model rather than an ensemble.

Do not specify hyperparameters argument (allow AutoGluon to adaptively select which models/hyperparameters to use).

Set time_limit to the longest amount of time (in seconds) that you are willing to wait. AutoGluon’s predictive performance improves the longer fit() is allowed to run.

接说明

支持的任务
分类回归
图像识别
图像预测
物体检测
文本预测
多任务预测

另外
支持自定义模型
支持神经架构

补充

自动机器学习（AutoML）是将机器学习应用于现实问题的端到端流程自动化的过程。

传统机器学习模型大致可分为以下四个部分：数据采集、数据预处理、优化、应用；
其中数据预处理与模型优化部分往往需要具备专业知识的数据科学家来完成，他们建立起了数据到计算的桥梁。

然而，即使是数据科学家，也需要花费大量的精力来进行算法与模型的选择。

机器学习在各种应用中的成功，导致对机器学习从业人员的需求不断增长，因此我们希望实现真正意义上的机器学习，让尽可能多的工作也能够被自动化完成，进一步降低机器学习的门槛，让没有该领域专业知识的人也可以使用机器学习来完成相关的工作。
从传统机器学习模型出发，AutoML从特征工程、模型构建、超参优化三方面实现自动化。
来自：https://zhuanlan.zhihu.com/p/93109455