这里结合Kaggle比赛的一个数据集,记录一下使用贝叶斯全局优化和高斯过程来寻找最佳参数的方法步骤。
1.安装贝叶斯全局优化库
从pip安装最新版本
pip install bayesian-optimization
2.加载数据集
import pandas as pd
import numpy as np
from sklearn.model_selection import StratifiedKFold
from scipy.stats import rankdata
from sklearn import metrics
import lightgbm as lgb
import warnings
import gc
pd.set_option('display.max_columns', 200)
train_df = pd.read_csv('../input/train.csv')
test_df = pd.read_csv('../input/test.csv')
目标变量的分布
target = 'target'
predictors = train_df.columns.values.tolist()[2:]
train_df.target.value_counts()
问题是不平衡。这里使用50%分层行作为保持行,以便验证集获得最佳参数。 稍后将在最终模型拟合中使用5折交叉验证。
bayesian_tr_index, bayesian_val_index = list(StratifiedKFold(n_splits=2,
shuffle=True, random_state=1).split(train_df, train_df.target.values))[0]
这些bayesian_tr_index和bayesian_val_index索引将用于贝叶斯优化,作为训练和验证数据集的索引。
3.黑盒函数优化(LightGBM)
在加载数据时,为LightGBM创建黑盒函数以查找参数。
def LGB_bayesian(
num_leaves, # int
min_data_in_leaf, # int
learning_rate,
min_sum_hessian_in_leaf, # int
feature_fraction,
lambda_l1,
lambda_l2,
min_gain_to_split,
max_depth):
# LightGBM expects next three parameters need to be integer. So we make them integer
num_leaves = int(num_leaves)
min_data_in_leaf = int(min_data_in_leaf)
max_depth = int(max_depth)
assert type(num_leaves) == int
assert type(min_data_in_leaf) == int
assert type(max_depth) == int
param = {
'num_leaves': num_leaves,
'max_bin': 63,
'min_data_in_leaf': min_data_in_leaf,
'learning_rate': learning_rate,
'min_sum_hessian_in_leaf': min_sum_hessian_in_leaf,