XGB建模流程化代码—仅作学习笔记

最新推荐文章于 2024-05-01 09:36:18 发布

数据厂商小伙

最新推荐文章于 2024-05-01 09:36:18 发布

阅读量949

点赞数

分类专栏：菜鸟数据建模文章标签： python 机器学习深度学习数据建模

本文链接：https://blog.csdn.net/qq_42457415/article/details/114595537

版权

菜鸟数据建模专栏收录该内容

6 篇文章 1 订阅

订阅专栏

XGB建模流程化代码—仅作个人学习笔记

以下绝大部分出自网络，因为不知道具体作者是谁。。。代码部分针对自己学习使用修改了一下

建模的要点80%在数据，我是真的理解到了，心痛、头疼☠

本篇主要是把xgb建模流程化，处理好数据，扔进去个把小时就搞定了，很easy。经常用于三方数据的测试，说起这个也是头大，外测返回数据一般分为评分、标签或者特征变量表。当返回众多特征时，可以简单的跑下模型，看看初期表现，再决定是否深入分析，不然众多厂商的数据，花时间各种弄完，发现并不适用，就白瞎了。

import pandas as pd
import numpy as np
from sklearn.metrics import roc_auc_score,roc_curve,auc
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
import math
import xgboost as xgb
import toad
from toad.metrics import KS, F1, AUC
import pickle

from toad.metrics import KS, F1, AUC
def cal_ks_auc(model,x_train,y_train,x_test,y_test):
    EYtr_proba = model.predict_proba(x_train)[:,1]
    EYtr = model.predict(x_train)

    print('Training cal')
    print('KS:', KS(EYtr_proba,y_train))
    print('AUC:', AUC(EYtr_proba,y_train))
    
    EYts_proba = model.predict_proba(x_test)[:,1]
    EYts = model.predict(x_test)

    print('\nTest cal')
    print('KS:', KS(EYts_proba,y_test))
    print('AUC:', AUC(EYts_proba,y_test))

#预跑初版模型
params={'base_score': 0.5, 'booster': 'gbtree', 'colsample_bylevel': 0.8,
        'colsample_bynode': 1, 'colsample_bytree': 0.9, 'gamma': 3, 'learning_rate': 0.1,
         'max_delta_step': 0,'max_depth': 3, 'min_child_weight': 5, 'missing': None, 'n_estimators': 190,
        'n_jobs': 1,  'objective': 'binary:logistic', 'random_state': 0, 'reg_alpha': 0,
        'reg_lambda': 1, 'scale_pos_weight': 1, 'seed': None, 'silent': False, 'subsample': 0.8, 'verbosity': 1}
model = xgb.XGBClassifier(**params)
model.fit(x_train,y_train)

#计算初版KS
cal_ks_auc(model=model,x_train=x_train,y_train=y_train,x_test=x_test,y_test=y_test)

初版的参数可以全是默认，也可以是多次建模的常用参数，怎么舒服怎么来呗。

2.调参,建议组合参数尽量不要超过4个，多了耗时耗力，你懂的。
控制 xgboost 中的过拟合:
1.直接控制模型的复杂度,包括 max_depth, min_child_weight 和 gamma
2.增加随机性，使训练对噪声强健,包括 subsample, colsample_bytree

#优化subsample,colsample_bytree
#*subsample=1 #减小这个参数的值，算法会更加保守，避免过拟合。(0,1]
#colsample_bytree=0.8 #控制每棵随机采样的列数的占比。(0,1]
#这两个参数一般[0.5-0.9]表现较优
subsample=[0.6,0.7,0.8]
colsample_bytree=[0.6,0.8,0.9]
for i in subsample:
    for j in colsample_bytree:
        params = {'base_score': 0.5, 'booster': 'gbtree', 'colsample_bylevel': 0.8,
                  'colsample_bynode': 1, 'colsample_bytree':j, 'gamma': 13, 'learning_rate': 0.1,
                  'max_delta_step': 0, 'max_depth': 3, 'min_child_weight': 5, 'missing': None, 'n_estimators': 190,
                  'n_jobs': 1, 'objective': 'binary:logistic', 'random_state': 0, 'reg_alpha': 0,
                  'reg_lambda': 1, 'scale_pos_weight': 1, 'seed': None, 'silent': False, 'subsample': i, 'verbosity': 1}
        model = xgb.XGBClassifier(**params)
        model.fit(x_train, y_train)
        print('i={},j={}:'.format(i,j))
        cal_ks_auc(model,x_train, y_train, x_test, y_test)

*max_depth=3 #最大树的深度,常用3-10
*min_child_weight=23 #最小叶子节点

*subsample=1 #减小这个参数的值，算法会更加保守，避免过拟合。(0,1]
*colsample_bytree=0.8 #控制每棵随机采样的列数的占比。(0,1]
这两个参数一般[0.5-0.9]表现较优
reg_alpha=18 #默认值=0,权重的L1正则化项。增加使模型保守
reg_lambda=5 #默认值=1,权重的L2正则化项,减少过拟合。增加使模型保守

调整单个参数：gamma,n_estimators,learning_rate
代码同上，单循环

*gamma=13 #作用等同于预剪枝，指定了节点分裂所需的最小损失函数下降值。这个参数的值越大，算法越保守。因为gamma值越大的时候，损失函数下降更多才可以分裂节点。
*n_estimators=190 #拟合的树的棵树，最大的迭代次数 n_estimators与learning_rate相关联，大致成反比
*learning_rate=0.1 #每一步迭代的步长，太大了运行准确率不高，太小了运行速度慢。

调参后的数据测试分数

通常有两种方法可以控制 xgboost 中的过拟合。

第一个方法是直接控制模型的复杂度
这包括 max_depth, min_child_weight 和 gamma
第二种方法是增加随机性，使训练对噪声强健
这包括 subsample, colsample_bytree
你也可以减小步长 eta, 但是当你这么做的时候需要记得增加 num_round

from sklearn.model_selection  import GridSearchCV
param_dist = {'max_depth':[2,3,4], 'min_child_weight':[4,5,6]}
params = {'base_score': 0.5, 'booster': 'gbtree', 'colsample_bylevel': 0.8,
                      'colsample_bynode': 1, 'colsample_bytree': 0.3, 'gamma': 13, 'learning_rate': 0.1,
                      'max_delta_step': 0, 'missing': None, 'n_estimators': 190,
                      'n_jobs': 1, 'objective': 'binary:logistic', 'random_state': 0, 'reg_alpha': 0,
                      'reg_lambda': 1, 'scale_pos_weight': 1, 'seed': None, 'silent': False, 'subsample': 0.8,'verbosity': 1}
model = xgb.XGBClassifier(**params )
optimized_GBM = GridSearchCV(estimator=model, param_grid=param_dist, scoring='roc_auc', cv=5, verbose=1, n_jobs=-1)
optimized_GBM.fit(x_train, y_train)
best_estimator = optimized_GBM.best_estimator_
print(best_estimator)# 输出最优训练器的精度

调参这里可用网格搜索的，但之前一直用循环遍历，习惯了。也尝试过网格搜索，不知道是不是自己设置太多数据，耗时太长，老卡死，放弃了。这里提供代码，郑重建议，网格搜索的话，不要所有参数一起来，也如上两个两个就好了，别问为什么，都是教训。

feature_importance = pd.DataFrame(model._Booster.feature_names, model.feature_importances_, columns=['feature'])

用最优参数训练模型,挑选重要特征，特征的选择随意。建议，当入模特征较少时选择大于0的，毕竟少嘛，能有点用就用上呗；当入模有千八百个的时候就选topN，具体可以看下重要性值，若为0.00002这种就可以舍弃啦。

'''5.用挑选变量后基本数据做交叉验证'''
columns=list(feature_importance['feature'])
columns.extend(['target','cid'])
df_data_filter=feature_all[columns]
ks_ave_train, ks_ave_test = kfold_xgb_lgbm(df_data_filter, params, 5, 'xgb','cid')

交叉验证主要是大致看下模型的稳定性，几折随意。若每次拟合的结果训练集和测试集的KS、AUC都差不多时，说明模型还不错；当出现较大差异时建议重建，这里不仅是每次交叉结果间的比较，也包括组内的训练集和测试集之间。

上诉都OK就跑个终版看下结果，模型就完成准备输出啦。

'''转化分数,并输出最后结果'''
predprob1=pd.concat([sample_train,predprob_train],axis=1)
predprob1['label']='train'
predprob2=pd.concat([sample_test,predprob_test],axis=1)
predprob2['label']='test'
predprob=pd.concat([predprob1,predprob2])
min_p = 1e-9
max_p = 1 - min_p
def p2score(p):
    if p >= max_p:
        p = max_p
    elif p <= min_p:
        p = min_p
    return 500 + 75 * np.log(p / (1 - p))
predprob['score'] = predprob['predprob'].apply(lambda x: p2score(x))

将模型结果转换为我们常见的分数，用于整体的评估。时刻牢记，模型并不是KS、AUC高就好用，适用，还有一系列的业务指标呢，这些更重要，后续整理更新的。

from sklearn2pmml import sklearn2pmml, PMMLPipeline
from sklearn_pandas import DataFrameMapper
save_file_path=r'D:\JYY.pmml'
pipeline=PMMLPipeline([('classifier',model)],)
pipeline.fit(x_train_filter,y_train_filter)
sklearn2pmml(pipeline, pmml=save_file_path)

模型怎么用呢，目前我们通用的是和数据厂商联合建模，给到对方pmml文件部署，转换方法如上。

载入输出模型

#输出模型文件
s = pickle.dumps(model)
with open('xgbModel.model','wb') as f:#注意此处mode是'wb+'，表示二进制写入
	f.write(s)

#载入模型
f = open('XGB2.model','rb') #注意此处model是rb
s = f.read()
model = pickle.loads(s)

怕找不到了，仅作学习资料，联系删
代码实际应用过做了些修改
—————————————————————————
#############################################
分界线
#############################################
—————————————————————————
可以看最新的XGB建模流程化代码文章，在日常业务使用中进行了精简。

数据厂商小伙

关注

0
点赞
踩
13

收藏

觉得还不错? 一键收藏
0
评论
XGB建模流程化代码—仅作学习笔记

XGB建模流程化代码—仅作个人学习笔记建模的要点80%在数据，我是真的理解到了，心痛、头疼☠本篇主要是把xgb建模流程化，处理好数据，扔进去个把小时就搞定了，很easy。经常用于三方数据的测试，说起这个也是头大，外测返回数据一般分为评分、标签或者特征变量表。当返回众多特征时，可以简单的跑下模型，看看初期表现，再决定是否深入分析，不然众多厂商的数据，花时间各种弄完，发现并不适用，就白瞎了。import pandas as pdimport numpy as npfrom sklearn.metric
复制链接

扫一扫