1.categorical_feature(类别特征)使用
lightGBM比XGBoost的1个改进之处在于对类别特征的处理, 不再需要将类别特征转为one-hot形式, 具体可参考这里.
在使用python API时(参考官方文档)
1.1可以使用pd.DataFrame存放特征X, 每一列表示1个特征, 将类别特征设置为X[cat_cols].astype('category'). 这样模型在fit时会自动识别类别特征.
1.2在模型的fit方法中传入参数categorical_feature, 指明哪些列是类别特征.
1.3类别特征的值必须是从0开始的连续整数, 比如0,1,2,..., 不能是负数.
下面是官方文档对fit方法中categorical_feature参数的说明:
categorical_feature (list of strings or int, or 'auto', optional (default='auto')) – Categorical features. If list of int, interpreted as indices. If list of strings, interpreted as feature names (need to specify feature_name as well). If ‘auto’ and data is pandas DataFrame, pandas unordered categorical columns are used. All values in categorical features should be less than int32 max value (2147483647). Large values could be memory consuming. Consider using consecutive integers starting from zero. All negative values in categorical features will be treated as missing values. The output cannot be monotonically constrained with respect to a categorical feature.
之前在使用sklearn的GridSearchCV时就发现一个问题, x_train和x_test中的cat_cols列已经设置好了是category类型, 但是concat之后类型又变成了int:
x_search = pd.concat([x_train, x_test], 0)
y_search = np.concatenate([train_y, test_y],0)
gsearch.fit(x_search, y_search) # 搜索最优参数, gsearch是sklearn的GridSearchCV实例
所以, 需要在concat之后重新设置一下类型
x_search = pd.concat([x_train, x_test], 0)
x_search[cat_cols] = x_search[cat_cols].astype('category')
y_search = np.concatenate([train_y, test_y],0)
gsearch.fit(x_search, y_search) # 搜索最优参数
2.init_score使用
init_score是estimator的初始得分, 在regression任务中用init_score可以帮助模型更快收敛.
使用时只需要在fit方法中设置init_score参数即可, 最后predict时, 需要加上设置的这个init_score
model.fit(
# pd.concat([x_train,x_val],0),
# np.concatenate([train_y, val_y],0),
x_train,
train_y,
init_score=y_train_base_avg1,
eval_metric=['mape'],
eval_set=(x_val, val_y),
early_stopping_rounds=20,
eval_init_score=[y_val_base_avg1],
verbose=True
)
...
y_train_pre = model.predict(x_train) + y_train_base_avg1 # 加上init_score
上面是regression的情况, 那么对于classification呢?以及配合GridSearchCV时怎么用(因为predict时必须手动加上init_score)?
参考这里和这里(init_score is the raw score, before any transformation.).
▲▲▲使用ParameterGrid代替GridSearchCV, 中间可加入更多的自定义操作
1.可以使用early_stopping, 而GridSearchCV做不到
2.可以支持lightGBM的init_score
...
from sklearn.model_selection import ParameterGrid
...
parameters = {
'objective': ['regression', 'regression_l1'],
'max_depth': [2,3,4,5],
'num_leaves': [20,25,30,35],
'n_estimators': [20,25,30,35,40,50],
'min_child_samples': [15,20,25,30],
# 'subsample_freq': [0,2,5],
# 'subsample': [0.7,0.8,0.9,1],
# 'colsample_bytree': [0.8,0.9,1]
}
...default_params = model.get_params() # 得到前面预定义的参数
best_score = np.inf
best_params = None
best_idx = 0
param_grid = list(ParameterGrid(parameters)) # 生成所有参数组合的list
for idx,param_ in enumerate(param_grid):
param = default_params.copy()
param.update(param_)
model = LGBMRegressor(**param)
model.fit(
x_train,
train_y,
init_score=y_train_base_avg1,
eval_metric=['mape'],
eval_set=(x_val, val_y),
early_stopping_rounds=20,
eval_init_score=[y_val_base_avg1],
verbose=False
)
score_ = model.best_score_['valid_0']['mape'] # 当前模型在val set上的最好得分
print('for %d/%d, score: %.6f, best idx in %d/%d'%(idx+1, len(param_grid), score_, best_idx, len(param_grid)))
if score_<best_score:
best_params = param
best_score = score_
best_idx = idx+1
print('find best score: {}, \nbest params: {}'.format(best_score, best_params))
print('\nbest score: {}, \nbest params: {}\n'.format(best_score, best_params))
#raise ValueError
model = LGBMRegressor(**best_params)
model.fit(
x_train,
train_y,
init_score=y_train_base_avg1,
eval_metric=['mape'],
eval_set=(x_val, val_y),
early_stopping_rounds=20,
eval_init_score=[y_val_base_avg1],
verbose=True
)