1、摘要
本文主要讲解:使用微软自动化机器学习框架Auto ML-NNI对lgb的超参进行优化
主要思路:
- 设置需要优化的参数,用json格式保存为一个文件
- 设置yml文件,用于调参算法和文件的配置
- 写python文件,使用nni获取参数并调优参数
- 使用 nnictl create 命令运行yml文件实现调优
- 将调优后的网络参数放入默认参数中,运行python文件,实现调参优化并训练好模型
2、数据介绍
数据请到GitHub自行下载,参考文末的学习链接
此数据为回归训练数据,第一列为标签列
此数据为回归测试数据,第一列为标签列
3、相关技术
NNI (Neural Network Intelligence) 是一个轻量但强大的工具包,帮助用户自动的进行特征工程,神经网络架构搜索,超参调优以及模型压缩。
NNI 管理自动机器学习 (AutoML) 的 Experiment,调度运行由调优算法生成的 Trial 任务来找到最好的神经网络架构和/或超参。
GBDT (Gradient Boosting Decision Tree) 是机器学习中一个长盛不衰的模型,其主要思想是利用弱分类器(决策树)迭代训练以得到最优模型,该模型具有训练效果好、不易过拟合等优点。GBDT不仅在工业界应用广泛,通常被用于多分类、点击率预测、搜索排序等任务;在各种数据挖掘竞赛中也是致命武器,据统计Kaggle上的比赛有一半以上的冠军方案都是基于GBDT。而LightGBM(Light Gradient Boosting Machine)是一个实现GBDT算法的框架,支持高效率的并行训练,并且具有更快的训练速度、更低的内存消耗、更好的准确率、支持分布式可以快速处理海量数据等优点。
————————————————
原文链接:https://blog.csdn.net/weixin_44023658/article/details/106732861
4、完整代码和步骤
首先请安装nni和lgb
pip install nni lightgbm
json文件如下:
{
"num_leaves":{"_type":"randint","_value":[20, 31]},
"learning_rate":{"_type":"choice","_value":[0.01, 0.05, 0.1, 0.2]},
"bagging_fraction":{"_type":"uniform","_value":[0.7, 1.0]},
"feature_fraction":{"_type":"uniform","_value":[0.7, 1.0]},
"bagging_freq":{"_type":"choice","_value":[1, 2, 4, 8, 10]}
}
yml文件如下
authorName: default
experimentName: example_auto-gbdt
trialConcurrency: 1
maxExecDuration: 10h
maxTrialNum: 20
#choice: local, remote, pai
trainingServicePlatform: local
searchSpacePath: search_space.json
#choice: true, false
useAnnotation: false
tuner:
#choice: TPE, Random, Anneal, Evolution, BatchTuner, MetisTuner, GPTuner
#SMAC (SMAC should be installed through nnictl)
builtinTunerName: TPE
classArgs:
#choice: maximize, minimize
optimize_mode: minimize
trial:
command: python main.py
codeDir: .
gpuNum: 0
代码输出如下:
You can set
force_col_wise=true
to remove the overhead.
[1] valid_0’s l2: 0.249547 valid_0’s auc: 0.717097
Training until validation scores don’t improve for 5 rounds
[2] valid_0’s l2: 0.248911 valid_0’s auc: 0.748828
[3] valid_0’s l2: 0.24806 valid_0’s auc: 0.755783
[4] valid_0’s l2: 0.247232 valid_0’s auc: 0.767229
[5] valid_0’s l2: 0.246511 valid_0’s auc: 0.767581
[6] valid_0’s l2: 0.245987 valid_0’s auc: 0.76611
[7] valid_0’s l2: 0.245194 valid_0’s auc: 0.767887
[8] valid_0’s l2: 0.244439 valid_0’s auc: 0.769296
[9] valid_0’s l2: 0.243661 valid_0’s auc: 0.770051
[10] valid_0’s l2: 0.242782 valid_0’s auc: 0.772751
[11] valid_0’s l2: 0.242057 valid_0’s auc: 0.772154
[12] valid_0’s l2: 0.241234 valid_0’s auc: 0.773714
[13] valid_0’s l2: 0.240547 valid_0’s auc: 0.774013
[14] valid_0’s l2: 0.239834 valid_0’s auc: 0.773551
[15] valid_0’s auc: 0.775091 valid_0’s l2: 0.194493
Start predicting…
The rmse of prediction is: 0.4195613807319431
[2021-06-20 12:27:06] INFO (nni/MainThread) Final result: 0.4195613807319431
主运行程序入口
'''
This project is for automatically tuning parameters for GBDT.
Trial(尝试) 是将一组参数组合(例如,超参)在模型上独立的一次尝试。
定义 NNI 的 Trial,需要首先定义参数组(例如,搜索空间),并更新模型代码。
nnictl create --config config.yml -p 8888
'''
import logging
import lightgbm as lgb
import nni
import pandas as pd
from sklearn.metrics import mean_squared_error
LOG = logging.getLogger('auto-gbdt')
# specify your configurations as a dict
def get_default_parameters():
# params = {
# 'boosting_type': 'gbdt',
# 'objective': 'regression',
# 'metric': {'l2', 'auc'},
# 'num_leaves': 31,
# 'learning_rate': 0.05,
# 'feature_fraction': 0.9,
# 'bagging_fraction': 0.8,
# 'bagging_freq': 5,
# 'verbose': 0
# }
# The rmse of prediction is: 0.450357
params = {
'boosting_type': 'gbdt',
'objective': 'regression',
'metric': {'l2', 'auc'},
'num_leaves': 28,
'learning_rate': 0.2,
'feature_fraction': 0.930,
'bagging_fraction': 0.7656,
'bagging_freq': 4,
'verbose': 0
}
# 调参后的 The rmse of prediction is: 0.419561
return params
def load_data(train_path='./data/regression.train', test_path='./data/regression.test'):
'''
Load or create dataset
'''
print('Load data...')
df_train = pd.read_csv(train_path, header=None, sep='\t')
df_test = pd.read_csv(test_path, header=None, sep='\t')
num = len(df_train)
split_num = int(0.9 * num)
y_train = df_train[0].values
y_test = df_test[0].values
y_eval = y_train[split_num:]
y_train = y_train[:split_num]
X_train = df_train.drop(0, axis=1).values
X_test = df_test.drop(0, axis=1).values
X_eval = X_train[split_num:, :]
X_train = X_train[:split_num, :]
# create dataset for lightgbm
lgb_train = lgb.Dataset(X_train, y_train)
lgb_eval = lgb.Dataset(X_eval, y_eval, reference=lgb_train)
return lgb_train, lgb_eval, X_test, y_test
def run(lgb_train, lgb_eval, params, X_test, y_test):
print('Start training...')
params['num_leaves'] = int(params['num_leaves'])
# train
gbm = lgb.train(params,
lgb_train,
num_boost_round=20,
valid_sets=lgb_eval,
early_stopping_rounds=5)
print('Start predicting...')
# predict
y_pred = gbm.predict(X_test, num_iteration=gbm.best_iteration)
# eval
rmse = mean_squared_error(y_test, y_pred) ** 0.5
print('The rmse of prediction is:', rmse)
nni.report_final_result(rmse)
if __name__ == '__main__':
lgb_train, lgb_eval, X_test, y_test = load_data()
try:
# get parameters from tuner
RECEIVED_PARAMS = nni.get_next_parameter()
LOG.debug(RECEIVED_PARAMS)
PARAMS = get_default_parameters()
PARAMS.update(RECEIVED_PARAMS)
LOG.debug(PARAMS)
# train
run(lgb_train, lgb_eval, PARAMS, X_test, y_test)
except Exception as exception:
LOG.exception(exception)
raise
运行方式:
nnictl create --config config.yml -p 8888 --debug
浏览器查看效果
调参后rmse下降了0.03个点
5、学习链接
微软新工具 NNI 使用指南之Mnist-annotation例子分析篇