使用Optuna获得准确的Scikit学习模型：超参数框架

最新推荐文章于 2024-08-20 22:01:31 发布

weixin_26746401

最新推荐文章于 2024-08-20 22:01:31 发布

阅读量2.9k

点赞数 1

文章标签： python java 机器学习人工智能深度学习

原文链接：https://towardsdatascience.com/exploring-optuna-a-hyper-parameter-framework-using-logistic-regression-84bd622cd3a5

版权

本文介绍了如何利用Optuna这一流行框架为Scikit-learn模型找到最佳超参数。通过实例展示了随机森林和逻辑回归的优化过程，强调了Optuna易于使用、文档丰富、算法适应性强以及具备修剪和可视化功能的特点。

摘要由CSDN通过智能技术生成

Hyper-parameter frameworks have been quite in the discussions in the past couple of months. With several packages developed and still in progress, it has become a tough choice to pick one. Such frameworks not only help fit an accurate model but can help boost Data scientists’ efficiency to the next level. Here I am showing how a recent popular framework Optuna can be used to get the best parameters for any Scikit-learn model. I have only implemented Random Forest and Logistic Regression as an example, but other algorithms can be implemented in a similar way shown here.

^ h yper参数框架已经相当在过去几个月的讨论。随着一些软件包的开发并且仍在进行中，选择一个软件包已经成为一个艰难的选择。这样的框架不仅有助于拟合准确的模型，而且可以帮助将数据科学家的效率提高到一个新的水平。在这里，我向您展示如何使用最近流行的Optuna框架来获取任何Scikit学习模型的最佳参数。我仅以“ 随机森林和逻辑回归”为例进行了说明，但是其他算法也可以按照此处显示的类似方式来实现。

为什么选择奥图纳？ (Why Optuna?)

Optuna can become one of the work-horse tools if integrated into everyday experimentations. I was deeply impressed when I implemented Logistic Regression using Optuna with such minimal effort. Here are a couple of reasons why I like Optuna:

如果将Optuna集成到日常实验中，则可以成为工作工具之一。当我以极少的努力使用Optuna实现Logistic回归时，我印象深刻。我喜欢Optuna的原因有两个：

Easy use of API
易于使用的API
Great documentation
优质的文档
Flexibility to accommodate any algorithms
适应任何算法的灵活性
Features like pruning and in-built great visualization modules
修剪和内置出色的可视化模块等功能

Documentation: https://optuna.readthedocs.io/en/stable/index.html

文档： https : //optuna.readthedocs.io/en/stable/index.html

Github: https://github.com/optuna/optuna

GitHub ： https : //github.com/optuna/optuna

Before we start looking at the functionalities, we need to make sure that we have installed pre-requisite packages:

在开始研究功能之前，我们需要确保已安装必备软件包：

Optuna
奥图纳
Plotly
密谋
Pandas
大熊猫
Scikit-Learn
Scikit学习

基本参数和定义： (Basic parameters and defining:)

Setting up the basic framework is pretty simple and straightforward. It can be divided broadly into 4 steps:

设置基本框架非常简单明了。它可以大致分为4个步骤：

Define an objective function (Step 1)
定义目标函数 (步骤1)
Define a set of hyperparameters to try (Step 2)
定义一组要尝试的超参数 (步骤2)
Define the variable/metrics you want to optimize(Step 3)
定义要优化的变量/指标(第3步)
Finally, run the function. Here you need to mention:
最后，运行该函数。在这里您需要提及：

the scoring function/variable you are trying to optimize is to be maximized or minimized
评分功能/变量 您试图优化是要最大化还是最小化
the number of trials you want to make. Higher the number of hyper-parameters and more the number of trials defined, the more computationally expensive it is (unless you have a beefy machine or a GPU!)
您要进行的试验次数 。超参数的数量越多，定义的试验数量越多，计算量就越大(除非您拥有强大的机器或GPU！)

In the Optuna world, the term Trial is a single call of the objective function, and multiple such Trials together are called Study.

在Optuna世界中，“ 试用 ”一词是对目标函数的一次调用，而多个这样的“试用”一起称为“ 学习”。

Following is a basic implementation of Random Forest and Logistic Regression from scikit-learn package:

以下是从scikit-learn包中随机森林和逻辑回归的基本实现：

# Importing the Packages:
import optuna
import pandas as pd
from sklearn import linear_model
from sklearn import ensemble
from sklearn import datasets
from sklearn import model_selection


#Grabbing a sklearn Classification dataset:
X,y = datasets.load_breast_cancer(return_X_y=True, as_frame=True)


#Step 1. Define an objective function to be maximized.
def objective(trial):


    classifier_name = trial.suggest_categorical("classifier", ["LogReg", "RandomForest"])
    
    # Step 2. Setup values for the hyperparameters:
    if classifier_name == 'LogReg':
        logreg_c = trial.suggest_float("logreg_c", 1e-10, 1e10, log=True)
        classifier_obj = linear_model.LogisticRegression(C=logreg_c)
    else:
        rf_n_estimators = trial.suggest_int("rf_n_estimators", 10, 1000)
        rf_max_depth = trial.suggest_int("rf_max_depth", 2, 32, log=True)
        classifier_obj = ensemble.RandomForestClassifier(
            max_depth=rf_max_depth, n_estimators=rf_n_estimators
        )


    # Step 3: Scoring method:
    score = model_selection.cross_val_score(classifier_obj, X, y, n_jobs=-1, cv=3)
    accuracy = score.mean()
    return accuracy


# Step 4: Running it
study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=100)

When you run the above code, the output would be something like below:

当您运行上面的代码时，输出将如下所示：