xgboost通俗_【通俗易懂】XGBoost从入门到实战，非常详细

最新推荐文章于 2024-07-16 12:38:33 发布

萌到了

最新推荐文章于 2024-07-16 12:38:33 发布

阅读量1.6k

点赞数 2

文章标签： xgboost通俗

本文链接：https://blog.csdn.net/weixin_33731135/article/details/112434000

版权

本文深入浅出地介绍了XGBoost的核心概念，包括Bagging与Boosting的区别、提升树的构建过程和目标函数优化。通过泰勒级数近似目标函数，讨论如何构建和优化树的结构。此外，文章还提供了实战指导，涉及数据预处理、模型调参策略，并给出了XGBoost在实际问题中的应用示例。

摘要由CSDN通过智能技术生成

Paper：XGBoost - A Scalable Tree Boosting System

如果你从来没学习过 XGBoost，或者不了解这个框架的数学原理。这篇 10000 字的文章一定能帮到你，虽然本文有很多公式，但是仔细读下去一定可以读懂。本文从原理到实战，仔细讲解 XGBoost。如果把这篇博文看懂了，再去读原始文章、看 XGBoost 的 PPT 就会比较轻松了。

有人问我要笔记的 PDF 版本，私我即可。

本篇文章纯手码，参考了很多前辈(比如@李文哲)的讲解，会在文末附上链接。如果这篇文章能帮助到你，来个关注，点赞，收藏吧。 如果有披露，还请留言区指出。

XGBoost 是陈天奇等人开源的一个机器学习项目，高效地实现了 GBDT 算法并进行了算法和工程上的许多改进，被广泛应用在 Kaggle 竞赛及其他许多机器学习竞赛中，并取得了不错的成绩。2015 年 29 组优胜方案中 17 组使用了 XGBoost。

Bagging VS Boosting

Bagging：Leverages unstable base learners that are weak because of overfitting

Boosting: Leverage stable base learners that are weak because of underfitting

Bagging 是 Bootstrap Aggregating 的简称，意思就是再取样 (Bootstrap) 然后在每个样本上训练出来的模型取平均，所以是降低模型的 variance. Bagging 比如 Random Forest 这种先天并行的算法都有这个效果。

Boosting 则是迭代算法，每一次迭代都根据上一次迭代的预测结果对样本进行加权，所以随着迭代不断进行，误差会越来越小，所以模型的 bias 会不断降低。比如 Adaptive Boosting，XGBoost 就是 Boosting 算法。

提升树

给了一个预测问题，张三在此数据上训练出了一个模型 - Model 1，但是效果不怎么好，误差比较大。

问题： 如果我们只能接受去使用这个模型但不能改变模型的架构，那接下来需要怎么做？

如上图所示，将左侧的数据输入到模型1中，会得到预测收入。预测收入和真实的收入之间的差值记做残差。由于这个模型1有一定的能力，但是能力比较弱，遗留了一些问题。这个残差就能表征这个遗留的问题。

紧接着，再训练一个模型2去预测这些样本，只不过目标值改为刚刚得到的残差。上图所示，预测的结果不再是收入，而是模型1得到的残差。上图中的模型2还会得到残差，但是我们发现第一行样本的残差已经为零了。也就是说第一个样本，通过模型1和模型2能够预测对收入。但是除了第一行，其他的还是有残差的，这时候可以在这基础上训练一个模型3。

上图所示，在刚刚 模型2得到的残差（准确的说是模型1和模型2共同作用的结果） 的基础上去拟合，得到模型3。这时候的残差可以理解为是前两个模型遗留下来的问题。该模型去预测模型2的残差，我们发现通过前三个模型的预测，得到的残差是上图中最新的残差这一列。

这时候最新的残差都是非常小了，如果能达到我们满意的标准，我们就可以停下。这样我们就得到了三个不同的模型。如下图所示，最终的预测就是三个模型预测的结果和。

具体问题是如何去构造这些模型呢？如何构建目标函数，如何优化？问题可以按照下面的流程去一步步解决：

如何构造目标函数 -> 目标函数直接优化难，如何近似？ -> 如何把树的结构引入到目标函数？-> 仍然难优化，要不要使用贪心算法？

如果看不懂这个流程什么意思，没关系。直接往下读就行了，回过头来看会豁然开朗。

构建目标函数

首先举个例子，用多棵树来预测张三、李四的薪资。如下图所示，用年龄这个因素构建的树预测张三的值为12，用工作年限这个因素构建的树张三为2. 两个相加就是对张三薪资的预测：12+2=14。

假设已经训练了

颗树，则对于第

个样本的最终预测值为：

是样本的特征，

是用第

颗树对

样本进行预测。将结果加在一起就得到了最终的预测值

, 而该样本的真实 label 是

。这样我们就能构建损失函数了。

构建的目标函数如下：

损失函数计算模型预测值和真实值的 loss，其中

是损失函数，可以是 MSE、Cross Entropy 等等。第二项是正则项，来控制模型的复杂度，防止过拟合。这个正则项可以类比 L2 正则。

叠加式的训练

如下图所示，将样本

放入第一棵树后，会得到一个预测值

，将该样本放入到第二颗树中后会得到

。依次类推。

假设给定样本

且

：

其中

是到第 m 课树为止累加的一个预测结果。通过推断，我们可以知道：

，到第

颗树时累加的结果是前

颗树累计的结果和第

颗树输出的结果总和。有了这个推论，我们再看目标函数：

因为最终的预测结果是所有模型（树）累加的结果，所以可以把

写成

（到第 k 课树为止累加的一个预测结果）当训练第

颗树时，最下化下面的损失函数：

相比之下，该式子去掉了

这一项，因为训练第

颗树时，该项为常数项，因为在训练第

颗树的时候，前

颗树的复杂度是已知的，不需要关注前面这些树了。到此为止，我们得出了目标函数：

用泰勒级数近似目标函数

这个目标函数是非常复杂的，我们可以用泰勒级数来近似这个目标函数。

目标函数：

根据泰勒展开式：

紧接着, 我们把

视作

, 把

视作

而根据泰勒展开式可以知道（这里公式可能有点长，但不乱，仔细看就能看懂）：

当前目标函数是训练第

颗树时的函数，其中

项是真实值与到第

课树为止累加的预测结果的损失，可以看作是已知的，不参与优化的过程。并且

和

也可以看成已知的。我们假设上式中

，

故：

可以将目标函数简化为如下的形式：

当训练第

颗树的时候，{

} 是已知的，

可以看作是训练前

棵树时的残差。由于我们要优化这个目标函数，接下来需要把

、

参数化。

如何用参数表示一颗树

叶子结点的值用

表示，我们假设 15 这个叶节点用

表示，12 这个叶子结点用

表示，20 这个叶子结点用

表示。

这里的

就是一个参数。

接下来的目标是把

、

参数化。

是什么呢？简单来说，

就是第

课树对样本

的预测结果。更具体的，就是把第

个样本规划到第几个叶子结点上了。

这里定义一个函数

:样本

的位置。这里假设第一个叶节点上（即 15 的地方）有样本[1, 3]落在这里 ,第二个节点有样本[4]落在这个地方，样本[2,5]落在了第三个叶子结点处这里：

: 样本

的位置

用函数

表示了样本落在了那个位置后，就能用参数表示

了。样本

落在了第

个叶节点上。那么

的预测值就可以用

表示，这样就把

进行了参数化。

是一个参数，下角标

表示落在哪个叶子结点上。但是这里下标还是一个函数，需要定义一下：

即表示哪些样本

落在第

个叶子结点上。举个例子：

表示样本 1，3 落在了第一个节点上。这样进行表示的目的是根据叶节点的位置把样本进行重新的组织。

定义树的复杂度

刚刚把

进行了参数化，接下来的目标是把

参数化。

一颗树的复杂度可以通过叶节点个数和 leaf value 。如下式子，其中

为叶节点的个数，第二项表示 leaf value：

复杂度有两个部分构成的，所以我们可以给每个模块定一个超参数来控制他们：

新的目标函数

经过上面的一步步的简化，我们把最初的目标函数：

简化为了：

紧接着，我们根据刚刚定义的参数：

叶节点的值，

样本

落在哪个叶节点上。

第

个节点有哪些样本。

的预测值就可以用

表示，可得到：

紧接着，看下图，假设第一个叶节点上（即 15 的地方）有样本[1, 3]落在这里 ,第二个节点有样本[2]落在这个地方，样本[4,5]落在了第三个叶子结点处这里：

所以：

而其中的

又可以表示为（因为样本[4,5]落在了第三个叶子结点处）：

因为

。所以我们可以进一步构造新的目标函数：

这个式子中，

和

是已知的，分别记做

和

。参数是

。所以是一个关于

二次函数求最优解问题。

知识回顾，典型的二次函数：

最小点的值为：

所以，所以当树的结构固定，也就是说

固定的话，在中括号中的最佳

为：

将

带入到

中可得，

当前树结构下的最佳的目标函数值：

当我们的知道了训练第

棵树时最小的目标函数值

后，随意给出一颗树（已知树结构），就能算出该棵树下最小的目标函数值。但是可能会有很多颗树，所以我们需要找到目标函数值最小的那颗树。**那么如何去寻找这棵树呢？**把可能所有的树罗列出来是代价很大的，这时候就需要贪心的方法。

如何寻找树的形状？

我们寻找最小的

，原来我们有一颗树，我们是能够计算出这棵树最小的目标函数值的。紧接着

根据特征进行分割落在叶节点的样本，树结构发生改变，这时候新的树的目标函数值也是能够算出来的。所以，使用贪心的方式，选择新的树目标函数值较小的那颗树。

比如下面这个例子，我们有样本

，第一颗树把这些样本分为了两部分，左侧的叶子结点是

，右侧节点是

。

此时我们知道了树的结构，可以根据如下的公式计算出此时树的最小目标函数值：

紧接着，我们根据新的特征对叶子结点再次进行了分割，得到了如下的树的形状：

此时，得到了新的

紧接着计算两颗树最小目标函数值的差：

当

最大化的时候，便是

最小的时候。这样我们通过贪心的方式不断构造这棵树，不断扩充这棵树。这里构造树的部分是非常重要的，需要细细品味。如果大概懂了，可以去读原文章哈：XGBoost - A Scalable Tree Boosting System。

实战

实战基于数据集 AllstateClaimsSeverity (Kaggle2016竞赛) ：

官网：https://www.kaggle.com/c/allstate-claims-severity/overview

基于给出的数据预测保险赔偿。给出的训练数据是116列（cat1-cat116）的离散数据和14列（con1-con14）的连续数据。

数据分布

导入依赖

import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score as AUC
from sklearn.metrics import mean_absolute_error
from sklearn.decomposition import PCA
from sklearn.preprocessing import LabelEncoder,LabelBinarizer
from sklearn.model_selection  import cross_val_score

from scipy import stats
import seaborn as sns
from copy import deepcopy

%matplotlib inline

%config InlineBackend.figure_format = 'retina'

加载数据

train = pd.read_csv('allstate-claims-severity/train.csv') 
test = pd.read_csv('allstate-claims-severity/test.csv')

观察数据，看看数据是啥样的

train.shape # (188318, 132)

输出训练数据，查看数据内容

train

print('First 20 columns:',list(train.columns[:20]))
print('Last 20 columns:',list(train.columns[-20:]))

First 20 columns: ['id', 'cat1', 'cat2', 'cat3', 'cat4', 'cat5', 'cat6', 'cat7', 'cat8', 'cat9', 'cat10', 'cat11', 'cat12', 'cat13', 'cat14', 'cat15', 'cat16', 'cat17', 'cat18', 'cat19']
Last 20 columns: ['cat112', 'cat113', 'cat114', 'cat115', 'cat116', 'cont1', 'cont2', 'cont3', 'cont4', 'cont5', 'cont6', 'cont7', 'cont8', 'cont9', 'cont10', 'cont11', 'cont12', 'cont13', 'cont14', 'loss']

观察得到：一共有 object 类型属性 116 个，float64 属性15个，int64 属性 1 个，其中 id 是int64，loss 赔偿是 float64.

train.describe()

可以看到此数据已经被处理，均值基本为 0.5。

查看缺失值

大多情况，我们都需要对数据进行缺失值处理。

pd.isnull(train).values.any()# False 表示没有缺失值

连续值与离散值

train.info()

#类型以及数量：float64(15), int64(1), object(116)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 188318 entries, 0 to 188317
Columns: 132 entries, id to loss
dtypes: float64(15), int64(1), object(116)
memory usage: 189.7+ MB

查看离散特征和连续特征个数

cat_features = list(train.select_dtypes(include=['object']))
print('离散特征Categorical: {} features'.format(len(cat_features)))

离散特征Categorical: 116 features

cont_features = [cont for cont in list(train.select_dtypes(include=['float64','int64'])) if cont not in ['loss','id']]
print('连续特征Continuous: {} features'.format(len(cont_features)))

连续特征Continuous: 14 features

id_col = list(train.select_dtypes(include=['int64']))
print('A column of int64:{}'.format(id_col))

A column of int64:['id']

类别值中属性的个数

#统计类别属性中不同类别的个数
cat_uniques=[]
for cat in cat_features:    
    cat_uniques.append(len(train[cat].unique())) 
uniq_values_in_categories = pd.DataFrame.from_dict([('cat_name',cat_features),('unique_values',cat_uniques)])

uniq_values_in_categories.head()

fig,(ax1, ax2) = plt.subplots(1,2)
fig.set_size_inches(16,5)

ax1.hist(uniq_values_in_categories.unique_values, bins=50)
ax1.set_title('Amount of categorical features with X distinct values')#离散特征值分布情况
ax1.set_xlabel('Distinct values in a feature')
ax1.set_ylabel('F eatures')
ax1.annotate('A  feature  with  326  vals', xy=(322, 2), xytext=(200,  38),arrowprops=dict(facecolor='black'))

ax2.set_xlim(2,30)
ax2.set_title('Zooming in the [0,30] part of left histogram')
ax2.set_xlabel('Distinct values in a feature')
ax2.set_ylabel('F eatures')
ax2.grid(True)
ax2.hist(uniq_values_in_categories[uniq_values_in_categories.unique_values<=30].unique_values, bins=30)
ax2.annotate('Binary features', xy=(3, 71), xytext=(7, 71), arrowprops=dict(facecolor='black'))

正如我们所看到的，大部分的分类特征（72/116）是二值的，绝大多数特征（88/116）有四个值，其中有一个具有326个值的特征（一天的数量？）。

赔偿值

plt.figure(figsize=(16,8))
plt.plot(train['id'],train['loss'])
print('train['id']个数:',len(train['id']))
plt.title('Loss values per id')
plt.xlabel('id')
plt.ylabel('loss')
plt.legend()
plt.show()

如上图所示，损失值有几个显著的峰值，表示严重事故。这样的数据分布，使得这个功能非常扭曲导致回归表现不佳。

基本上，偏度度量了实值随机变量的均值分布的不对称性，下面让我们来计算一下loss的偏度：

#scipy.stats 统计指标。
stats.mstats.skew(train['loss']).data
#输出：array(3.79492815)

偏度值比1大，说明数据是倾斜的。不利于数据建模。我们利用对数变换np.log，使倾斜降低。

stats.mstats.skew(np.log(train['loss'])).data
#输出：array(0.0929738)

两种 loss 分布对比：

fig, (ax1, ax2) = plt.subplots(1,2)
fig.set_size_inches(16,5)
ax1.hist(train['loss'], bins=50)
ax1.set_title('Train Loss target histogram')
ax1.grid(True)
ax2.hist(np.log(train['loss']), bins=50, color='g')
ax2.set_title("Train Log Loss target histogram")
ax2.grid(True)
plt.show()

数据loss对数化之后，是我们喜欢的分布类型。

连续值特征

train[cont_features].hist(bins=50,figsize=(16,12))

特征之间的相关性

plt.subplots(figsize=(16,9))
correlation_mat = train[cont_features].corr()
sns.heatmap(correlation_mat,annot=True)

XGBoost 调参策略

导入依赖

import pandas as pd 
import numpy as np
import xgboost as xgb
import pickle
import sys
import matplotlib.pyplot as plt
from sklearn.metrics import make_scorer
from sklearn.metrics import mean_absolute_error
from sklearn.preprocessing import LabelEncoder,LabelBinarizer
from sklearn.model_selection  import cross_val_score
from sklearn.model_selection import KFold,train_test_split

from xgboost import XGBRegressor

import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

%config InlineBackend.figure_format = 'retina'

数据预处理

train = pd.read_csv('allstate-claims-severity/train.csv')
train['log_loss']=np.log(train['loss'])

features = [x for x in train.columns if x not in ['id','loss','log_loss']]

cat_features =[x for x in train.select_dtypes(include = ['object']) if x not in ['id','loss','log_loss']]

num_features =[x for x in train.select_dtypes(exclude = ['object']) if x not in ['id','loss','log_loss']]           

print('离散特征Categorical: {} features'.format(len(cat_features)))
print('Numerical:{} features'.format(len(num_features)))
features

离散特征 Categorical: 116 features
Numerical:14 features
['cat1',
 'cat2',
 'cat3',
 'cat4',
 'cat5',
 'cat6',
 'cat7',
 'cat8',
 'cat9',
 'cat10',
 'cat11',
 'cat12',
 'cat13',
 'cat14',
 'cat15',
 'cat16',
 'cat17',
 'cat18',
 'cat19',
 'cat20',
 'cat21',
 'cat22',
 'cat23',
 'cat24',
 'cat25',
 'cat26',
 'cat27',
 'cat28',
 'cat29',
 'cat30',
 'cat31',
 'cat32',
 'cat33',
 'cat34',
 'cat35',
 'cat36',
 'cat37',
 'cat38',
 'cat39',
 'cat40',
 'cat41',
 'cat42',
 'cat43',
 'cat44',
 'cat45',
 'cat46',
 'cat47',
 'cat48',
 'cat49',
 'cat50',
 'cat51',
 'cat52',
 'cat53',
 'cat54',
 'cat55',
 'cat56',
 'cat57',
 'cat58',
 'cat59',
 'cat60',
 'cat61',
 'cat62',
 'cat63',
 'cat64',
 'cat65',
 'cat66',
 'cat67',
 'cat68',
 'cat69',
 'cat70',
 'cat71',
 'cat72',
 'cat73',
 'cat74',
 'cat75',
 'cat76',
 'cat77',
 'cat78',
 'cat79',
 'cat80',
 'cat81',
 'cat82',
 'cat83',
 'cat84',
 'cat85',
 'cat86',
 'cat87',
 'cat88',
 'cat89',
 'cat90',
 'cat91',
 'cat92',
 'cat93',
 'cat94',
 'cat95',
 'cat96',
 'cat97',
 'cat98',
 'cat99',
 'cat100',
 'cat101',
 'cat102',
 'cat103',
 'cat104',
 'cat105',
 'cat106',
 'cat107',
 'cat108',
 'cat109',
 'cat110',
 'cat111',
 'cat112',
 'cat113',
 'cat114',
 'cat115',
 'cat116',
 'cont1',
 'cont2',
 'cont3',
 'cont4',
 'cont5',
 'cont6',
 'cont7',
 'cont8',
 'cont9',
 'cont10',
 'cont11',
 'cont12',
 'cont13',
 'cont14']

ntrain = train.shape[0]
#ntrain = 188318

train_x = train[features]
train_y = train['log_loss']

for c in range(len(cat_features)):
    train_x[cat_features[c]] = train_x[cat_features[c]].astype('category').cat.codes
print('Xtrain:',train_x.shape) # Xtrain: (188318, 130)
print('ytrain:',train_y.shape) # ytrain: (188318,)

train_x

Simple XGBoost Model

首先，我们训练一个基本的xgboost模型，然后进行参数调节通过交叉验证来观察结果的变换，使用平均绝对误差衡量 mean_absolute_error(np.exp(y),np.exp(yhat))。

xgboost 自定义一个数据矩阵类 DMatrix，会在训练开始时，进行一边预处理，从而提高之后每次迭代的效率。

结果衡量方法

#评估策略，e的次幂，用来评估。
#结果衡量方法：使用平均绝对误差来衡量
#mean_absolute_error(np.exp(y), np.exp(yhat))。
#定义计算损失值的函数
def xg_eval_mae(yhat,dtrain):
    y = dtrain.get_label()
    return 'mae',mean_absolute_error(np.exp(y),np.exp(yhat))

Model

#数据类型转换成库可以使用的底层格式。
dtrain = xgb.DMatrix(train_x,train['log_loss'])
dtrain

XGBoost 参数

booster : gbtree, 用什么方法进行结点分裂。梯度提升树来进行结点分裂。
objective : multi softmax, 使用的损失函数，softmax 是多分类问题
num_class : 10, 类别数，与 multi softmax 并用
gamma : 损失下降多少才进行分裂
max_depth : 12, 构建树的深度, 越大越容易过拟合
lambda : 2, 控制模型复杂度的权重值的L2正则化项参数，参数越大。模型越不容易过拟合。
subsample : 0.7 , 随机采样训练样本，取70%的数据训练
colsample_bytree : 0.7, 生成树时进行的列采样
min_child_weight : 3, 孩子节点中最小的样本权重和，如果一个叶子结点的样本权重和小于 min_child_weight 则拆分过程结果
slient : 0, 设置成 1 则没有运行信息输出，最好是设置为0
eta : 0.007, 如同学习率。前面的树都不变了，新加入一棵树后对结果的影响占比
seed : 1000
Thread : 7, cup 线程数

xgb_params = {
    'seed': 0,
    'eta': 0.1,
    'colsample_bytree': 0.5,
    'silent': 1,
    'subsample': 0.5,
    'objective': 'reg:linear',
    'max_depth': 5,
    'min_child_weight': 3
}

使用交叉验证 xgb.cv

%%time

#feval:评估策略
bst_cv1 = xgb.cv(xgb_params, dtrain, num_boost_round=50, nfold=3, seed=0, 
                feval=xg_eval_mae, maximize=False, early_stopping_rounds=10)

print ('CV score:', bst_cv1.iloc[-1,:]['test-mae-mean'])

CV score: 1220.110026
Wall time: 26 s

plt.figure()
bst_cv1[['train-mae-mean', 'test-mae-mean']].plot()

上面是我们第一个模型

没有发生过拟合
只建立了50个树模型

%%time
#建立100个树模型
bst_cv2 = xgb.cv(xgb_params, dtrain, num_boost_round=100, 
                nfold=3, seed=0, feval=xg_eval_mae, maximize=False, 
                early_stopping_rounds=10)

print ('CV score:', bst_cv2.iloc[-1,:]['test-mae-mean'])

CV score: 1172.059570333333
Wall time: 50.9 s

fig, (ax1, ax2) = plt.subplots(1,2)
fig.set_size_inches(16,4)

ax1.set_title('100 rounds of training')
ax1.set_xlabel('Rounds')
ax1.set_ylabel('Loss')
ax1.grid(True)
ax1.plot(bst_cv2[['train-mae-mean', 'test-mae-mean']])
ax1.legend(['Training Loss', 'Test Loss'])

ax2.set_title('60 last rounds of training')
ax2.set_xlabel('Rounds')
ax2.set_ylabel('Loss')
ax2.grid(True)
ax2.plot(bst_cv2.iloc[40:][['train-mae-mean', 'test-mae-mean']])
ax2.legend(['Training Loss', 'Test Loss'])

我们把树模型的数量增加到了100。效果不是很明显。看最后的60次。我们可以看到测试集仅比训练集高那么一丁点。存在一丁点的过拟合。

不过我们的CV score更低了。接下来，我们改变其他参数。

XGBoost 参数调节

Step 1: 选择一组初始参数 Step 2: 改变 max_depth 和 min_child_weight. Step 3: 调节 gamma 降低模型过拟合风险. Step 4: 调节 subsample 和 colsample_bytree 改变数据采样策略. Step 5: 调节学习率 eta.

class XGBoostRegressor(object):
    def __init__(self, **kwargs):
        self.params = kwargs
        if 'num_boost_round' in self.params:
            self.num_boost_round = self.params['num_boost_round']
        self.params.update({'silent': 1, 'objective': 'reg:linear', 'seed': 0})#默认参数
        
    def fit(self, x_train, y_train):
        '''
        #数据类型转换,#用参数去训练xgboost模型
        '''
        dtrain = xgb.DMatrix(x_train, y_train) 
        self.bst = xgb.train(params=self.params, dtrain=dtrain, num_boost_round=self.num_boost_round,
                             feval=xg_eval_mae, maximize=False)
        
    def predict(self, x_pred):
        dpred = xgb.DMatrix(x_pred)
        self.bst = xgb.train(params=self.params, dtrain=dtrain, num_boost_round=self.num_boost_round,
                             feval=xg_eval_mae, maximize=False)
        return self.bst.predict(dpred)
    
    def kfold(self, x_train, y_train, nfold=5):
        dtrain = xgb.DMatrix(x_train, y_train)
        cv_rounds = xgb.cv(params=self.params, dtrain=dtrain, num_boost_round=self.num_boost_round,
                           nfold=nfold, feval=xg_eval_mae, maximize=False, early_stopping_rounds=10)
        return cv_rounds.iloc[-1,:]
    
    def plot_feature_importances(self):
        feat_imp = pd.Series(self.bst.get_fscore()).sort_values(ascending=False)
        feat_imp.plot(title='Feature Importances')
        plt.ylabel('Feature Importance Score')
        
    def get_params(self, deep=True):
        return self.params
 
    def set_params(self, **params):
        self.params.update(params)
        return self 

#衡量标准
def mae_score(y_true, y_pred):
    return mean_absolute_error(np.exp(y_true), np.exp(y_pred))

mae_scorer = make_scorer(mae_score, greater_is_better=False)

bst = XGBoostRegressor(eta=0.1, colsample_bytree=0.5, subsample=0.5, 
                       max_depth=5, min_child_weight=3, num_boost_round=50)

bst.kfold(train_x, train_y, nfold=5)

train-rmse-mean       0.558938
train-rmse-std        0.001005
test-rmse-mean        0.562665
test-rmse-std         0.002445
train-mae-mean     1209.707324
train-mae-std         3.004207
test-mae-mean      1218.884204
test-mae-std          8.982969
Name: 49, dtype: float64

按照训练集处理方式，处理我们的测试集

test = pd.read_csv('allstate-claims-severity/test.csv')

test # 没有loss列,loss需要预测

#features_test = [x for x in test.columns if x not in ['id']]

test_x = test[features]

#将类别数据的类别用数字替换
for c in range(len(cat_features)):

    test_x[cat_features[c]] = test_x[cat_features[c]].astype('category').cat.codes

test_x.head()

#数据类型转换成库可以使用的底层格式。
dtest_x = xgb.DMatrix(test_x)
#dtest_x
#得到我们想要的测试集


#预测命令：
#xgb.predict(dtest_x)
test_y = bst.predict(test_x)

test_y[1],len(test_y)

(7.450635, 125546)

import math

#math.exp(test_y[0])
test_exp_y= np.zeros(len(test_y))
for i in range(len(test_y)):
    test_exp_y[i] = math.exp(test_y[i])
test_exp_y.shape

(125546,)

Step 1: 基准模型

Step 2: 树的深度与节点权重

这些参数对xgboost性能影响最大，因此，他们应该调整第一。我们简要地概述它们： max_depth: 树的最大深度。增加这个值会使模型更加复杂，也容易出现过拟合，深度3-10是合理的。 min_child_weight: 正则化参数. 如果树分区中的实例权重小于定义的总和，则停止树构建过程。

xgb_param_grid = {'max_depth': list(range(4,9)), 'min_child_weight': list((1,3,6))}
xgb_param_grid['max_depth']

[4, 5, 6, 7, 8]

%%time
from sklearn.model_selection import GridSearchCV

#交叉验证 网格搜索
grid = GridSearchCV(XGBoostRegressor(eta=0.1, num_boost_round=50, colsample_bytree=0.5, subsample=0.5),
                param_grid=xgb_param_grid, cv=5, scoring = mae_scorer)

grid.fit(train_x, train_y.values)

Wall time: 18min 30s

GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=<__main__.XGBoostRegressor object at 0x000001EAF043AAC8>,
             iid='warn', n_jobs=None,
             param_grid={'max_depth': [4, 5, 6, 7, 8],
                         'min_child_weight': [1, 3, 6]},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=make_scorer(mae_score, greater_is_better=False),
             verbose=0)

#grid.grid_scores_, grid.best_params_, grid.best_score_  #旧版本

#print(grid.cv_results_)#新版本
print(grid.cv_results_['mean_test_score'])
print(grid.cv_results_['params'])
print('************************************')
print(grid.best_params_)
print('************************************')
print(grid.best_score_ )

[-1243.45883654 -1243.43443586 -1243.74273371 -1219.45710638
 -1219.53921994 -1219.59817312 -1205.01504313 -1203.69057513
 -1203.64737075 -1194.79949665 -1194.31288936 -1193.77942991
 -1189.22162241 -1188.18115104 -1187.53612533]
[{'max_depth': 4, 'min_child_weight': 1}, {'max_depth': 4, 'min_child_weight': 3}, {'max_depth': 4, 'min_child_weight': 6}, {'max_depth': 5, 'min_child_weight': 1}, {'max_depth': 5, 'min_child_weight': 3}, {'max_depth': 5, 'min_child_weight': 6}, {'max_depth': 6, 'min_child_weight': 1}, {'max_depth': 6, 'min_child_weight': 3}, {'max_depth': 6, 'min_child_weight': 6}, {'max_depth': 7, 'min_child_weight': 1}, {'max_depth': 7, 'min_child_weight': 3}, {'max_depth': 7, 'min_child_weight': 6}, {'max_depth': 8, 'min_child_weight': 1}, {'max_depth': 8, 'min_child_weight': 3}, {'max_depth': 8, 'min_child_weight': 6}]
************************************
{'max_depth': 8, 'min_child_weight': 6}
************************************
-1187.5361253348233

网格搜索发现的最佳结果: {'max_depth': 8, 'min_child_weight': 6}, -1187.9597499123447)

def convert_grid_scores(scores):
    _params = []
    _params_mae = []    
    for i in scores:
        _params.append(i[0].values())
        _params_mae.append(i[1])
    params = np.array(_params)
    grid_res = np.column_stack((_params,_params_mae))
    return [grid_res[:,i] for i in range(grid_res.shape[1])] 

_,scores =  convert_grid_scores(grid.grid_scores_)
scores = scores.reshape(5,3)

plt.figure(figsize=(10,5))
cp = plt.contourf(xgb_param_grid['min_child_weight'], xgb_param_grid['max_depth'], scores, cmap='BrBG')
plt.colorbar(cp)
plt.title('Depth / min_child_weight optimization')
plt.annotate('We use this', xy=(5.95, 7.95), xytext=(4, 7.5), arrowprops=dict(facecolor='white'), color='white')
plt.annotate('Good for depth=7', xy=(5.98, 7.05), 
             xytext=(4, 6.5), arrowprops=dict(facecolor='white'), color='white')
plt.xlabel('min_child_weight')
plt.ylabel('max_depth')
plt.grid(True)
plt.show()

Step 3: 调节 gamma去降低过拟合风险

%%time

xgb_param_grid = {'gamma':[ 0.1 * i for i in range(0,5)]}

grid = GridSearchCV(XGBoostRegressor(eta=0.1, num_boost_round=50, max_depth=8, min_child_weight=6,
                                        colsample_bytree=0.5, subsample=0.5),
                    param_grid=xgb_param_grid, cv=5, scoring=mae_scorer)

grid.fit(train_x, train_y.values)

#Wall time: 13min 45s

Step 4: 调节样本采样方式 subsample 和 colsample_bytree

%%time

xgb_param_grid = {'subsample':[ 0.1 * i for i in range(6,9)],
                      'colsample_bytree':[ 0.1 * i for i in range(6,9)]}


grid = GridSearchCV(XGBoostRegressor(eta=0.1, gamma=0.2, num_boost_round=50, max_depth=8, min_child_weight=6),
                    param_grid=xgb_param_grid, cv=5, scoring=mae_scorer)
grid.fit(train_x, train_y.values)

# Wall time: 28min 26s

grid.grid_scores_, grid.best_params_, grid.best_score_ 

_, scores =  convert_grid_scores(grid.grid_scores_)
scores = scores.reshape(3,3)

plt.figure(figsize=(10,5))
cp = plt.contourf(xgb_param_grid['subsample'], xgb_param_grid['colsample_bytree'], scores, cmap='BrBG')
plt.colorbar(cp)
plt.title('Subsampling params tuning')
plt.annotate('Optimum', xy=(0.895, 0.6), xytext=(0.8, 0.695), arrowprops=dict(facecolor='black'))
plt.xlabel('subsample')
plt.ylabel('colsample_bytree')
plt.grid(True)
plt.show()

在当前的预训练模式的具体案例，我得到了下面的结果： `{'colsample_bytree': 0.8, 'subsample': 0.8}, -1182.9309918891634)

Step 5: 减小学习率并增大树个数

（也可以增大学习率减小树个数）

参数优化的最后一步是降低学习速度，同时增加更多的估计量

First, we plot different learning rates for a simpler model (50 trees):

%%time
    
xgb_param_grid = {'eta':[0.5,0.4,0.3,0.2,0.1,0.075,0.05,0.04,0.03]}
grid = GridSearchCV(XGBoostRegressor(num_boost_round=50, gamma=0.2, max_depth=8, min_child_weight=6,
                                        colsample_bytree=0.6, subsample=0.9),
                    param_grid=xgb_param_grid, cv=5, scoring=mae_scorer)

grid.fit(train_x, train_y.values) 

#CPU times: user 6.69 ms, sys: 0 ns, total: 6.69 ms
#Wall time: 6.55 ms

grid.grid_scores_, grid.best_params_, grid.best_score_ 

eta, y = convert_grid_scores(grid.grid_scores_)
plt.figure(figsize=(10,4))
plt.title('MAE and ETA, 50 trees')
plt.xlabel('eta')
plt.ylabel('score')
plt.plot(eta, -y)
plt.grid(True)
plt.show()

{'eta': 0.2}, -1160.9736284869114 是目前最好的结果, 现在我们把树的个数增加到100

xgb_param_grid = {'eta':[0.5,0.4,0.3,0.2,0.1,0.075,0.05,0.04,0.03]}
grid = GridSearchCV(XGBoostRegressor(num_boost_round=100, gamma=0.2, max_depth=8, min_child_weight=6,
                                        colsample_bytree=0.6, subsample=0.9),
                    param_grid=xgb_param_grid, cv=5, scoring=mae_scorer)

grid.fit(train_x, train_y.values)

grid.grid_scores_, grid.best_params_, grid.best_score_ 

eta, y = convert_grid_scores(grid.grid_scores_)
plt.figure(figsize=(10,4))
plt.title('MAE and ETA, 100 trees')
plt.xlabel('eta')
plt.ylabel('score')
plt.plot(eta, -y)
plt.grid(True)
plt.show()

学习率低一些的效果更好

我们继续增大树的个数

%%time

xgb_param_grid = {'eta':[0.09,0.08,0.07,0.06,0.05,0.04]}
grid = GridSearchCV(XGBoostRegressor(num_boost_round=200, gamma=0.2, max_depth=8, min_child_weight=6,
                                        colsample_bytree=0.6, subsample=0.9),
                    param_grid=xgb_param_grid, cv=5, scoring=mae_scorer)

grid.fit(train_x, train_y.values)

grid.grid_scores_, grid.best_params_, grid.best_score_

eta, y = convert_grid_scores(grid.grid_scores_)
plt.figure(figsize=(10,4))
plt.title('MAE and ETA, 200 trees')
plt.xlabel('eta')
plt.ylabel('score')
plt.plot(eta, -y)
plt.grid(True)
plt.show()

%%time

# Final XGBoost model

bst = XGBoostRegressor(num_boost_round=200, eta=0.07, gamma=0.2, max_depth=8, min_child_weight=6,
                                        colsample_bytree=0.6, subsample=0.9)
cv = bst.kfold(train_x, train_y, nfold=5)