Xgboost版本对比(原生版与sklearn接口版)


看过别人使用Xgboost会发现它是由有两个版本的,分别是xgboost的python版本有原生版本和为了与sklearn相适应的sklearn接口版本,现在就简单总结下二者的区别。
这里放上 Xgboost中文文档,以及 XGBoost的Python文档方便查询和使用。

1. 分别使用两个版本对同一个数据集进行测试

1.1 数据集的准备

这里直接利用Hastie算法,生成2分类数据

from sklearn.model_selection import train_test_split
from pandas import DataFrame
from sklearn import metrics
from sklearn.datasets  import  make_hastie_10_2
from xgboost.sklearn import XGBClassifier
import xgboost as xgb
import pandas as pd
 
#准备数据,y本来是[-1:1],xgboost自带接口邀请标签是[0:1],把-1的转成1了。
X, y = make_hastie_10_2(random_state=0)
X = DataFrame(X)
y = DataFrame(y)
y.columns={"label"}
label={-1:0,1:1}
y.label=y.label.map(label)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)#划分数据集

1.2 用两个版本设定相同的参数,对数据集进行训练

#XGBoost自带接口
params={
    'eta': 0.3,
    'max_depth':3,   
    'min_child_weight':1,
    'gamma':0.3, 
    'subsample':0.8,
    'colsample_bytree':0.8,
    'booster':'gbtree',
    'objective': 'binary:logistic',
    'nthread':12,
    'scale_pos_weight': 1,
    'lambda':1,  
    'seed':27,
    'silent':0 ,
    'eval_metric': 'auc'
}
d_train = xgb.DMatrix(X_train, label=y_train)
d_valid = xgb.DMatrix(X_test, label=y_test)
d_test = xgb.DMatrix(X_test)
watchlist = [(d_train, 'train'), (d_valid, 'valid')]
 
#sklearn接口
clf = XGBClassifier(
    n_estimators=30,#三十棵树
    learning_rate =0.3,
    max_depth=3,
    min_child_weight=1,
    gamma=0.3,
    subsample=0.8,
    colsample_bytree=0.8,
    objective= 'binary:logistic',
    nthread=12,
    scale_pos_weight=1,
    reg_lambda=1,
    seed=27)
watchlist2 = [(X_train,y_train),(X_test,y_test)]

print("XGBoost_自带接口进行训练:")
model_bst = xgb.train(params, d_train, 30, watchlist, early_stopping_rounds=500, verbose_eval=10)
print("XGBoost_sklearn接口进行训练:")
model_sklearn=clf.fit(X_train, y_train, eval_set=watchlist2,eval_metric='auc',verbose=10, early_stopping_rounds=500)
 
y_bst= model_bst.predict(d_test)
y_sklearn= clf.predict_proba(X_test)[:,1]

XGBoost_自带接口进行训练:
[0] train-auc:0.608992 valid-auc:0.579947
Multiple eval metrics have been passed: ‘valid-auc’ will be used for early stopping.

Will train until valid-auc hasn’t improved in 500 rounds.
[10] train-auc:0.940251 valid-auc:0.920879
[20] train-auc:0.973669 valid-auc:0.959898
[29] train-auc:0.983232 valid-auc:0.970292

XGBoost_sklearn接口进行训练:
[0] validation_0-auc:0.608992 validation_1-auc:0.579947
Multiple eval metrics have been passed: ‘validation_1-auc’ will be used for early stopping.

Will train until validation_1-auc hasn’t improved in 500 rounds.
[10] validation_0-auc:0.940251 validation_1-auc:0.920879
[20] validation_0-auc:0.973669 validation_1-auc:0.959898
[29] validation_0-auc:0.983232 validation_1-auc:0.970292

1.3 将评估结果打印出来

print("XGBoost_自带接口    AUC Score : %f" % metrics.roc_auc_score(y_test, y_bst))
print("XGBoost_sklearn接口 AUC Score : %f" % metrics.roc_auc_score(y_test, y_sklearn))
 
# 将概率值转化为0和1
y_bst = pd.DataFrame(y_bst).apply(lambda row: 1 if row[0]>=0.5 else 0, axis=1)
y_sklearn = pd.DataFrame(y_sklearn).apply(lambda row: 1 if row[0]>=0.5 else 0, axis=1)
print("XGBoost_自带接口    AUC Score : %f" % metrics.accuracy_score(y_test, y_bst))
print("XGBoost_sklearn接口 AUC Score : %f" % metrics.accuracy_score(y_test, y_sklearn))

GBoost_自带接口 AUC Score : 0.970292
XGBoost_sklearn接口 AUC Score : 0.970292
XGBoost_自带接口 AUC Score : 0.897917
XGBoost_sklearn接口 AUC Score : 0.897917

2. 两个版本的区别

两个版本区别原生版sklearn接口版
是否需要将数据转化为Dmatrix
学习率参数(步长)etalearning_rate
迭代次数(树的数量)在原生xgb中定义在xgb.train()的num_boost_roundn_estimators
对训练集进行训练xgb.train( )clf.fit( )
watchlist的形式watchlist = [(d_train, ‘train’), (d_valid, ‘valid’)]watchlist2 = [(X_train,y_train),(X_test,y_test)]
L1正则化权重lambdareg_lambda
L2正则化权重alphareg_alpha

参考

https://blog.csdn.net/PIPIXIU/article/details/80463565
https://www.cnblogs.com/chenxiangzhen/p/10894143.html
https://zhuanlan.zhihu.com/p/66832906
https://cloud.tencent.com/developer/article/1455453

  • 4
    点赞
  • 32
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值