Hyper-parameter frameworks have been quite in the discussions in the past couple of months. With several packages developed and still in progress, it has become a tough choice to pick one. Such frameworks not only help fit an accurate model but can help boost Data scientists’ efficiency to the next level. Here I am showing how a recent popular framework Optuna can be used to get the best parameters for any Scikit-learn model. I have only implemented Random Forest and Logistic Regression as an example, but other algorithms can be implemented in a similar way shown here.
^ h yper参数框架已经相当在过去几个月的讨论。 随着一些软件包的开发并且仍在进行中,选择一个软件包已经成为一个艰难的选择。 这样的框架不仅有助于拟合准确的模型,而且可以帮助将数据科学家的效率提高到一个新的水平。 在这里,我向您展示如何使用最近流行的Optuna框架来获取任何Scikit学习模型的最佳参数。 我仅以“ 随机森林和逻辑回归”为例进行了说明,但是其他算法也可以按照此处显示的类似方式来实现。
为什么选择奥图纳? (Why Optuna?)
Optuna can become one of the work-horse tools if integrated into everyday experimentations. I was deeply impressed when I implemented Logistic Regression using Optuna with such minimal effort. Here are a couple of reasons why I like Optuna:
如果将Optuna集成到日常实验中,则可以成为工作工具之一。 当我以极少的努力使用Optuna实现Logistic回归时,我印象深刻。 我喜欢Optuna的原因有两个:
- Easy use of API 易于使用的API
- Great documentation 优质的文档
- Flexibility to accommodate any algorithms 适应任何算法的灵活性
- Features like pruning and in-built great visualization modules 修剪和内置出色的可视化模块等功能
Documentation: https://optuna.readthedocs.io/en/stable/index.html
文档 : https : //optuna.readthedocs.io/en/stable/index.html
Github: https://github.com/optuna/optuna
GitHub : https : //github.com/optuna/optuna
Before we start looking at the functionalities, we need to make sure that we have installed pre-requisite packages:
在开始研究功能之前,我们需要确保已安装必备软件包:
- Optuna 奥图纳
- Plotly 密谋
- Pandas 大熊猫
- Scikit-Learn Scikit学习
基本参数和定义: (Basic parameters and defining:)
Setting up the basic framework is pretty simple and straightforward. It can be divided broadly into 4 steps:
设置基本框架非常简单明了。 它可以大致分为4个步骤:
Define an objective function (Step 1)
定义目标函数 (步骤1)
Define a set of hyperparameters to try (Step 2)
定义一组要尝试的超参数 (步骤2)
- Define the variable/metrics you want to optimize(Step 3) 定义要优化的变量/指标(第3步)
Finally, run the function. Here you need to mention:
最后, 运行该函数。 在这里您需要提及:
the scoring function/variable you are trying to optimize is to be maximized or minimized
评分功能/变量 您试图优化是要最大化还是最小化
the number of trials you want to make. Higher the number of hyper-parameters and more the number of trials defined, the more computationally expensive it is (unless you have a beefy machine or a GPU!)
您要进行的试验次数 。 超参数的数量越多,定义的试验数量越多,计算量就越大(除非您拥有强大的机器或GPU!)
In the Optuna world, the term Trial is a single call of the objective function, and multiple such Trials together are called Study.
在Optuna世界中,“ 试用 ”一词是对目标函数的一次调用,而多个这样的“试用”一起称为“ 学习”。
Following is a basic implementation of Random Forest and Logistic Regression from scikit-learn package:
以下是从scikit-learn包中随机森林和逻辑回归的基本实现:
# Importing the Packages:
import optuna
import pandas as pd
from sklearn import linear_model
from sklearn import ensemble
from sklearn import datasets
from sklearn import model_selection
#Grabbing a sklearn Classification dataset:
X,y = datasets.load_breast_cancer(return_X_y=True, as_frame=True)
#Step 1. Define an objective function to be maximized.
def objective(trial):
classifier_name = trial.suggest_categorical("classifier", ["LogReg", "RandomForest"])
# Step 2. Setup values for the hyperparameters:
if classifier_name == 'LogReg':
logreg_c = trial.suggest_float("logreg_c", 1e-10, 1e10, log=True)
classifier_obj = linear_model.LogisticRegression(C=logreg_c)
else:
rf_n_estimators = trial.suggest_int("rf_n_estimators", 10, 1000)
rf_max_depth = trial.suggest_int("rf_max_depth", 2, 32, log=True)
classifier_obj = ensemble.RandomForestClassifier(
max_depth=rf_max_depth, n_estimators=rf_n_estimators
)
# Step 3: Scoring method:
score = model_selection.cross_val_score(classifier_obj, X, y, n_jobs=-1, cv=3)
accuracy = score.mean()
return accuracy
# Step 4: Running it
study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=100)
When you run the above code, the output would be something like below:
当您运行上面的代码时,输出将如下所示: