背景
lightGBM主要分为原生接口,与scikit-learn接口两种。
除去传参与调包格式不一样,后者的save与load需要用sklearn来完成。
API手册: https://lightgbm.readthedocs.io/en/latest/Python-API.html
训练
原生接口,使用lgb.train()方法。需要参数外挂为字典。
lgb_train = lgb.Dataset(data=train_x,label=train_y)
lgb_valid = lgb.Dataset(data=valid_x,label=valid_y)
params = {
'task':'train',
'boosting_type':'gbdt',
'objective':'binary',
'metric':{'12','auc','binary_logloss'},
'num_leaves':31,
'num_trees':100,
'learning_rate':0.05,
'feature_fraction':0.9,
'bagging_fraction':0.8,
'bagging_freq':5,
'verbose':0
}
gbm = lgb.train(params=params,
train_set=lgb_train,
num_boost_round=10,
valid_sets=lgb_valid,
early_stopping_rounds=50)
sklearn接口,先定义训练器对象,再用fit()训练。
gbm = lgb.LGBMRegressor(
boosting_type='gbdt', objective='regression', metric='rmse',
learning_rate=0.05, num_leaves=31, max_depth=-1, n_estimators=1000,
subsample=0.7,subsample_freq=1,colsample_bytree=0.7)
gbm.fit(train_x, train_y
early_stopping_rounds=None)
proba_test = gbm.predict_proba(test_x)[:, 1]
交叉验证
lgb.cv
#这是train的进阶
num_round = 10
bst = lgb.train(param, train_data, num_round, valid_sets=[test_data])
#升级之后就是
num_round = 10
lgb.cv(param, train_data, num_round, nfold=5)
自定义loss和eval函数
loss func: 需要定义一个函数,input=目标值和预测值(都是list-like)。反过来,该函数应该返回梯度的两个梯度和每个观测值的hessian数组。如上所述,我们需要使用微积分来派生gradient和hessian,然后在Python中实现它。
eval func:在LightGBM中定制验证丢失需要定义一个函数,该函数接受格式相同的两个数组,但返回三个值: 要打印的名称为metric的字符串、损失本身以及关于是否更高更好的布尔值。
def custom_asymmetric_train(y_true, y_pred):
residual = (y_true - y_pred).astype("float")
grad = np.where(residual<0, -2*10.0*residual, -2*residual)
hess = np.where(residual<0, 2*10.0, 2.0)
return grad, hess
def custom_asymmetric_valid(y_true, y_pred):
residual = (y_true - y_pred).astype("float")
loss = np.where(residual < 0, (residual**2)*10.0, residual**2)
return "custom_asymmetric_eval", np.mean(loss), False
#https://cloud.tencent.com/developer/article/1357671
如果使用sklearn
********* Sklearn API **********
# default lightgbm model with sklearn api
gbm = lightgbm.LGBMRegressor()
#把我们自定义的Loss函数设为objective,也可以在实例化gbm的时候这样做
gbm.set_params(**{'objective': custom_asymmetric_train}, metrics = ["mse", 'mae'])
#在fit中,传入自定义的valid函数
gbm.fit(
X_train,
y_train,
eval_set=[(X_valid, y_valid)],
eval_metric=custom_asymmetric_valid,
verbose=False,
)
y_pred = gbm.predict(X_valid)
如果使用lgbm原生接口
********* Python API **********
# create dataset for lightgbm
# if you want to re-use data, remember to set free_raw_data=False
lgb_train = lgb.Dataset(X_train, y_train, free_raw_data=False)
lgb_eval = lgb.Dataset(X_valid, y_valid, reference=lgb_train, free_raw_data=False)
# specify your configurations as a dict
params = {
'objective': 'regression',
'verbose': 0
}
gbm = lgb.train(params,
lgb_train,
num_boost_round=10,
init_model=gbm,
fobj=custom_asymmetric_train,
feval=custom_asymmetric_valid,
valid_sets=lgb_eval)
y_pred = gbm.predict(X_valid)
使用最优参数预测
ypred = bst.predict(data, num_iteration=bst.best_iteration)
保存模型
gbm.save_model('model.txt',num_iteration=gbm.best_iteration_)
bst.save_model('model.txt')
bst = lgb.Booster(model_file='model.txt') #init model
https://github.com/Microsoft/LightGBM/issues/1217#issuecomment-360352312
I see, for the sklearn model save/load, you can use joblib.
example:
from sklearn.externals import joblib
# save model
joblib.dump(lgbmodel, 'lgb.pkl')
# load model
gbm_pickle = joblib.load('lgb.pkl')
查看属性
对sklearn的API
model.best_iteration
#对model.best_iteration_的封装,也可以直接访问本体。
model.