看过别人使用Xgboost会发现它是由有两个版本的,分别是xgboost的python版本有原生版本和为了与sklearn相适应的sklearn接口版本,现在就简单总结下二者的区别。
这里放上 Xgboost中文文档,以及 XGBoost的Python文档方便查询和使用。
1. 分别使用两个版本对同一个数据集进行测试
1.1 数据集的准备
这里直接利用Hastie算法,生成2分类数据
from sklearn.model_selection import train_test_split
from pandas import DataFrame
from sklearn import metrics
from sklearn.datasets import make_hastie_10_2
from xgboost.sklearn import XGBClassifier
import xgboost as xgb
import pandas as pd
#准备数据,y本来是[-1:1],xgboost自带接口邀请标签是[0:1],把-1的转成1了。
X, y = make_hastie_10_2(random_state=0)
X = DataFrame(X)
y = DataFrame(y)
y.columns={"label"}
label={-1:0,1:1}
y.label=y.label.map(label)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)#划分数据集
1.2 用两个版本设定相同的参数,对数据集进行训练
#XGBoost自带接口
params={
'eta': 0.3,
'max_depth':3,
'min_child_weight':1,
'gamma':0.3,
'subsample':0.8,
'colsample_bytree':0.8,
'booster':'gbtree',
'objective': 'binary:logistic',
'nthread':12,
'scale_pos_weight': 1,
'lambda':1,
'seed':27,
'silent':0 ,
'eval_metric': 'auc'
}
d_train = xgb.DMatrix(X_train, label=y_train)
d_valid = xgb.DMatrix(X_test, label=y_test)
d_test = xgb.DMatrix(X_test)
watchlist = [(d_train, 'train'), (d_valid, 'valid')]
#sklearn接口
clf = XGBClassifier(
n_estimators=30,#三十棵树
learning_rate =0.3,
max_depth=3,
min_child_weight=1,
gamma=0.3,
subsample=0.8,
colsample_bytree=0.8,
objective= 'binary:logistic',
nthread=12,
scale_pos_weight=1,
reg_lambda=1,
seed=27)
watchlist2 = [(X_train,y_train),(X_test,y_test)]
print("XGBoost_自带接口进行训练:")
model_bst = xgb.train(params, d_train, 30, watchlist, early_stopping_rounds=500, verbose_eval=10)
print("XGBoost_sklearn接口进行训练:")
model_sklearn=clf.fit(X_train, y_train, eval_set=watchlist2,eval_metric='auc',verbose=10, early_stopping_rounds=500)
y_bst= model_bst.predict(d_test)
y_sklearn= clf.predict_proba(X_test)[:,1]
XGBoost_自带接口进行训练:
[0] train-auc:0.608992 valid-auc:0.579947
Multiple eval metrics have been passed: ‘valid-auc’ will be used for early stopping.
Will train until valid-auc hasn’t improved in 500 rounds.
[10] train-auc:0.940251 valid-auc:0.920879
[20] train-auc:0.973669 valid-auc:0.959898
[29] train-auc:0.983232 valid-auc:0.970292
XGBoost_sklearn接口进行训练:
[0] validation_0-auc:0.608992 validation_1-auc:0.579947
Multiple eval metrics have been passed: ‘validation_1-auc’ will be used for early stopping.
Will train until validation_1-auc hasn’t improved in 500 rounds.
[10] validation_0-auc:0.940251 validation_1-auc:0.920879
[20] validation_0-auc:0.973669 validation_1-auc:0.959898
[29] validation_0-auc:0.983232 validation_1-auc:0.970292
1.3 将评估结果打印出来
print("XGBoost_自带接口 AUC Score : %f" % metrics.roc_auc_score(y_test, y_bst))
print("XGBoost_sklearn接口 AUC Score : %f" % metrics.roc_auc_score(y_test, y_sklearn))
# 将概率值转化为0和1
y_bst = pd.DataFrame(y_bst).apply(lambda row: 1 if row[0]>=0.5 else 0, axis=1)
y_sklearn = pd.DataFrame(y_sklearn).apply(lambda row: 1 if row[0]>=0.5 else 0, axis=1)
print("XGBoost_自带接口 AUC Score : %f" % metrics.accuracy_score(y_test, y_bst))
print("XGBoost_sklearn接口 AUC Score : %f" % metrics.accuracy_score(y_test, y_sklearn))
GBoost_自带接口 AUC Score : 0.970292
XGBoost_sklearn接口 AUC Score : 0.970292
XGBoost_自带接口 AUC Score : 0.897917
XGBoost_sklearn接口 AUC Score : 0.897917
2. 两个版本的区别
两个版本区别 | 原生版 | sklearn接口版 |
---|---|---|
是否需要将数据转化为Dmatrix | 是 | 否 |
学习率参数(步长) | eta | learning_rate |
迭代次数(树的数量) | 在原生xgb中定义在xgb.train()的num_boost_round | n_estimators |
对训练集进行训练 | xgb.train( ) | clf.fit( ) |
watchlist的形式 | watchlist = [(d_train, ‘train’), (d_valid, ‘valid’)] | watchlist2 = [(X_train,y_train),(X_test,y_test)] |
L1正则化权重 | lambda | reg_lambda |
L2正则化权重 | alpha | reg_alpha |
参考
https://blog.csdn.net/PIPIXIU/article/details/80463565
https://www.cnblogs.com/chenxiangzhen/p/10894143.html
https://zhuanlan.zhihu.com/p/66832906
https://cloud.tencent.com/developer/article/1455453