数据挖掘之xgboost

最新推荐文章于 2024-09-29 11:00:00 发布

i-code

最新推荐文章于 2024-09-29 11:00:00 发布

阅读量694

点赞数

分类专栏：机器学习数据挖掘文章标签： xgboost

本文链接：https://blog.csdn.net/weixin_44132035/article/details/102807785

版权

机器学习同时被 2 个专栏收录

10 篇文章 2 订阅

订阅专栏

数据挖掘

9 篇文章 2 订阅

订阅专栏

今天想分享的是数据挖掘中集成算法，具体的原理我就不分享了，代码案在我的github上.
另外这是我主要参考的两篇博客：

1. XGBoost类库概述

XGBoost除了支持Python外，也支持R，Java等语言。本文关注于Python的XGBoost类库，安装使用"pip install xgboost"即可，目前使用的是XGBoost的0.90版本。XGBoost类库除了支持决策树作为弱学习器外，还支持线性分类器，以及带DropOut的决策树DART，不过通常情况下，我们使用默认的决策树弱学习器即可，本文也只会讨论使用默认决策树弱学习器的XGBoost。

XGBoost有2种Python接口风格。一种是XGBoost自带的原生Python API接口，另一种是sklearn风格的API接口，两者的实现是基本一样的，仅仅有细微的API使用的不同，主要体现在参数命名上，以及数据集的初始化上面。

２.使用原生Python API接口

原生XGBoost需要先把数据集按输入特征部分，输出部分分开，然后放到一个DMatrix数据结构里面，这个DMatrix我们不需要关心里面的细节，使用我们的训练集X和y初始化即可。

import xgboost as xgb
import pandas as pd
import numpy as np
import pickle
import sys
import matplotlib.pyplot as plt
from sklearn.metrics import mean_absolute_error, make_scorer,mean_squared_error
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from scipy.sparse import csr_matrix, hstack
from sklearn.model_selection import KFold, train_test_split,cross_validate
from sklearn.model_selection import GridSearchCV
import warnings  
warnings.filterwarnings('ignore')  ##忽略警告信息
train = pd.read_csv(r'./data/insurance/train.csv')  ## 读取数据

features = [x for x in train.columns if x not in ['id','loss', 'log_loss']] ## 特征字段

cat_features = [x for x in train.select_dtypes(
        include=['object']).columns if x not in ['id','loss', 'log_loss']]  ## 类别性特征
        
num_features = [x for x in train.select_dtypes(
        exclude=['object']).columns if x not in ['id','loss', 'log_loss']]  ## 数值性特征

print ("Categorical features:", len(cat_features))
print ("Numerical features:", len(num_features))

在这里插入图片描述

ntrain = train.shape[0]
train_x = train[features]  ## X特征数据集
train_y = train['log_loss']  ## y标签值

for c in range(len(cat_features)): 
    """
    遍历所有类别型特征，把字符映射为数值类型
    """
    train_x[cat_features[c]] = train_x[cat_features[c]].astype('category').cat.codes  

print("Xtrain:", train_x.shape)  
print("ytrain:", train_y.shape)

在这里插入图片描述

xtrain,xtest,ytrain,ytest =train_test_split(train_x,train_y,test_size=0.33, random_state=42)

def xg_eval_mae(yhat, dtrain):
    """
    自定义评估准则，平均绝对误差
    :param yhat: 预测值
    :param dtrain: DMatrix数据结构，
    :return:
    """
    y = dtrain.get_label()  ##选取真实标签值
    return 'mae', mean_absolute_error(np.exp(y), np.exp(yhat))

dtrain = xgb.DMatrix(xtrain, ytrain)
dtest = xgb.DMatrix(xtest,ytest) 
xgb_params = {
    'seed': 0,  ## 随机种子
    'eta': 0.07,  #类似于学习率
    'colsample_bytree': 0.6, ##特征的随机采样频率
    'silent': 1,
    'subsample': 0.9,  ## 样本采样平率
    'objective': 'reg:linear', ##回归类型优化器
    'max_depth': 8, ##每棵树的最大深度
    'min_child_weight': 6, ##最小的子节点权重阈值，如果某个树节点的权重小于这个阈值，则不会再分裂子树，即这个树节点就是叶子节点 
    'gamma':0.2 ##决策树分裂所带来的损失减小阈值
}  ##定义参数

## 交叉验证
bst_cv1 = xgb.cv(xgb_params, dtrain, num_boost_round=50, nfold=3, seed=0,
                feval=xg_eval_mae, maximize=False, early_stopping_rounds=10)
print(bst_cv1)
print ('CV score:', bst_cv1.iloc[-1,:]['test-mae-mean'])  ##CV score: 1223.1750083333334
plt.figure()
bst_cv1[['train-mae-mean', 'test-mae-mean']].plot()
plt.show()

输出结果
在这里插入图片描述 test和train的平均误差变动曲线

raw_model = xgb.train(xgb_params, dtrain, num_boost_round=200,feval=xg_eval_mae)  ##训练数据建立200棵弱分类器决策树
with open('./save/insurance/xgboost_insurance.pickle','wb') as f: ##保存模型结果
    pickle.dump(raw_model,f)
with open('./save/insurance/xgboost_insurance.pickle','rb') as f:  ## 获取模型对象
    raw_model=pickle.load(f)
pred_test_raw =np.expm1(raw_model.predict(dtest)) ##预测测试集
print(mean_absolute_error(np.expm1(dtest.get_label()),pred_test_raw)) ##平均绝对误差
predictions = pd.DataFrame({"predict":pred_test_raw, "true":np.expm1(dtest.get_label())})
predictions.plot(x = "predict", y = "true", kind = "scatter")
predictions.to_csv('./save/insurance/output.csv')
plt.show()

下图最理想的情况是真实值和预测值在对角线上分布，由于没有做调参以及特征工程，误差会优点大！
在这里插入图片描述

3.使用sklearn风格接口，使用sklearn风格参数

对于sklearn风格的接口，主要有2个类可以使用，一个是分类用的XGBClassifier，另一个是回归用的XGBRegressor。

XGBoost可以和sklearn的网格搜索类GridSeachCV结合使用来调参，使用时和普通sklearn分类回归算法没有区别。具体的流程的一个示例如下：

xgb_reg = xgb.XGBRegressor(min_child_weight=3,subsample=0.8,learning_rate=0.5, verbosity=1)

def mae_score(y_true, y_pred):
    return mean_absolute_error(np.exp(y_true), np.exp(y_pred))

mae_scorer = make_scorer(mae_score, greater_is_better=False)

xgb_param_grid = {'max_depth': list(range(5,8)), 'min_child_weight': list((1,3,6))}

grid = GridSearchCV(xgb_reg,param_grid=xgb_param_grid, cv=4, scoring=mae_scorer)
grid.fit(xtrain, ytrain)
print(grid.scorer_)
print(grid.best_params_)
print(grid.best_score_)
model = grid.best_estimator_

with open('./save/insurance/xgboost_insurance1.pickle','wb') as f:
    pickle.dump(model,f)
with open('./save/insurance/xgboost_insurance1.pickle','rb') as f:
    xgb_reg=pickle.load(f)
xgb_preds = np.expm1(xgb_reg.predict(xtest))
predictions = pd.DataFrame({"predict":xgb_preds, "true":np.expm1(ytest)})
predictions.plot(x = "predict", y = "true", kind = "scatter")
predictions.to_csv('./save/insurance/output1.csv')
plt.show()