XGBoost线性回归工控数据分析实践案例（原生篇）

本文链接：https://blog.csdn.net/xiaoyw71/article/details/107900651

1. 关于XGBoost

XGBoost号称“比赛夺冠的必备大杀器”，横扫机器学习Kaggle、天池、DataCastle、Kesci等国内外数据竞赛罕逢敌手，堪称机器学习算法中的王者，业界使用者众多！

在涉及非结构化数据（图像、文本等）的预测问题中，人工神经网络显著优于所有其他算法或框架。但当涉及到中小型结构/表格数据时，基于决策树的算法现在被认为是最佳方法。而基于决策树算法中最惊艳的，非XGBoost莫属了。

XGBoost最初由陈天奇开发。陈天奇是华盛顿大学计算机系博士生，研究方向为大规模机器学习。他曾获得KDD CUP 2012 Track 1第一名，并开发了SVDFeature，XGBoost，cxxnet等著名机器学习工具，是Distributed (Deep) Machine Learning Common的发起人之一。

1.1. XGBoot应用任务

Xgboost有两大类接口：Xgboost原生接口和sklearn接口，并且Xgboost能够实现分类与回归两种任务。下面将分享对其中回归任务工控实践，也就是XGBoost和Sklearn接口做以解析。

1.2. XGBoost的优点

XGBoost算法可以给预测模型带来能力的提升。当我们对其表现有更多了解的时候，我们会发现他有如下优势：

正则化
XGBoost在代价函数里加入了正则项，用于控制模型的复杂度。正则项里包含了树的叶子节点个数，每个叶子节点上输出的score的L2模的平方和。从Bias-variance tradeoff角度来讲，正则项降低了模型的variance，使学习出来的模型更加简单，防止过拟合，这也是Xgboost优于传统GBDT的一个特征
并行处理
XGBoost的并行式在特征粒度上的，也就是说每一颗树的构造都依赖于前一颗树。一般来说，决策树的学习最耗时的一个步骤就是对特征的值进行排序（因为要确定最佳分割点），在训练之前，预先对数据进行了排序，然后保存为block结构，后面的迭代中重复使用这个结构，大大减小计算量。这个block结构也使得并行成为了可能，在进行节点的分裂的时候，需要计算每个特征的增益，最终选增益最大的那个特征去做分裂，那么各个特征的增益计算就可以开多线程进行。
灵活性
XGBoost支持用户自定义目标函数和评估函数，只要目标函数二阶可导就行。它对模型增加了一个全新的维度，所以我们的处理不会受到任何限制。
缺失值处理
对于特征的值有缺失的样本，XGBoost可以自动学习出他的分裂方向。XGBoost内置处理缺失值的规则。用户需要提供一个和其他样本不同的值，然后把它作为一个参数穿进去，以此来作为缺失值的取值。XGBoost在不同节点遇到缺失值时采用不同的处理方法，并且会学习未来遇到缺失值时的处理方法。
剪枝
XGBoost先从顶到底建立所有可以建立的子树，再从底到顶反向机芯剪枝，比起GBM，这样不容易陷入局部最优解
内置交叉验证
XGBoost允许在每一轮Boosting迭代中使用交叉验证。因此可以方便的获得最优Boosting迭代次数，而GBM使用网格搜索，只能检测有限个值。

2. XGBoost回归预测实践

2.1. 工控案例简述

监控某罐体设备液位变换，每10分钟测量罐内液体容量（液位），以此回归预测某时刻，就是可能不在正常工作测量的时刻，估算某时刻液位，从整体上来说，相当于解决数据缺失问题。
通过此预测，可以相互验证罐体是否漏夜等安全报警信息，从另一个角度报警。
在这里插入图片描述

2.2. 基于Xgboost原生接口的回归

import xgboost as xgb

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score, KFold
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt
from xgboost import plot_importance
from sklearn.metrics import r2_score

def model_train_reg(x_train,x_test,y_train,y_test):
    params ={'learning_rate': 0.1,
              'max_depth': 10,                # 构建树的深度，越大越容易过拟合
              'num_boost_round':2000,
              'objective':'reg:squarederror',  # 线性回归问题
              #'objective': 'reg:linear',      # 线性回归问题，早期版本的参与，将被reg:squarederror替换
              'random_state': 7,
              'gamma':0,
              'subsample':0.8,
              'colsample_bytree':0.8,
              'reg_alpha':0.005,
              'n_estimators' : 1000,
              'eval_metric':['logloss','rmse','mae'],   #分类有“auc”
              'eta':0.3                      #为了防止过拟合，更新过程中用到的收缩步长。eta通过缩减特征 的权重使提升计算过程更加保守。缺省值为0.3，取值范围为：[0,1]
            }
    dtrain = xgb.DMatrix(x_train, label=y_train)
    dtest = xgb.DMatrix(x_test,label=y_test)

    res = xgb.cv(params,dtrain,num_boost_round=5000,metrics='rmse',early_stopping_rounds=25)
    #找到最佳迭代轮数
    best_nround = res.shape[0] - 1
    
    watchlist = [(dtrain,'train'),(dtest,'eval')]
    evals_result = {}
    
    model = xgb.train(params,dtrain,num_boost_round=best_nround,evals = watchlist,evals_result=evals_result)
    y_pred=model.predict(xgb.DMatrix(x_test))

    y_pred = list(map(lambda x: x if x >= 0 else 0,y_pred))
    MSE = np.sqrt(sum((np.array(y_test) - np.array(y_pred)) ** 2 ) / len(y_pred) ) #均方根误差作为结果
    R2  = r2_score(y_test,y_pred)


    print ('MSE: %f' % MSE)
    print ('r2_score: %.2f' %R2)
    
    print('绘制训练RMSE下降趋势图')
    
    #验证数据评估指标，与param参数，'eval_metric':['logloss','rmse','mae']相关
    #验证包括训练和验证两个部分（train、eval），如上所示3个参数，则是6组数据
    names = []
    values = []
    
    for e_name,e_mtrs in evals_result.items():
        #print('- {}'.format(e_name))
        for e_mtr_name, e_mtr_vals in e_mtrs.items():
            #print('    - {}'.format(e_mtr_name))
            names.append(e_name + '_' + e_mtr_name)
            #print('        - {}'.format(e_mtr_vals))
            values.append(e_mtr_vals)

    plt.figure(12)
    plt.rcParams['font.sans-serif']=['SimHei'] #显示中文
    plt.grid()      
    plt.subplot(121)
  
    plt.scatter(y_test,y_pred,s=20)
    plt.plot([min(y_test),max(y_test)],[min(y_pred),max(y_pred)])
    plt.xlabel('实际液位')
    plt.ylabel('预测液位')
 
    plt.subplot(122)
    #plt.plot(values[0],label = names[0],color='green')
    plt.plot(values[1],label = names[1],color='blue')
    #plt.plot(values[2],label = names[2],color='coral')
    plt.plot(values[3],label = names[3],color='deeppink')
    
    
    plt.show()
    
  
    return model

#读取Excel数据
df0 = get_DataFromExcel()

df0 = pd.concat([df0,feature_datatime(df0)],axis=1)
print(df0.dtypes)

x_train,x_test,y_train,y_test = init_train_data(df0)

model= model_train_reg(x_train,x_test,y_train,y_test)

#model.save_model('OilCanXGbLinear.model')  # 保存训练模型
# 显示重要特征
plot_importance(model)
plt.show()

2.2. 输出成果分析

（1）学习过程中监控
训练过程监控输出，重点代码如下所示，Watchlist不会影响模型训练。：

    watchlist = [(dtrain,'train'),(dtest,'eval')]
    evals_result = {}

在这里插入图片描述
其中，RMSE是下降的，比较快，为什么logloss几乎不变呢，需要进一步学习研究。

（2）预测与真实值对比分析，以及RMSE下降情况

（3）特征重要程度
XGBoost的特征重要性是如何得到的？某个特征的重要性（feature score），等于它被选中为树节点分裂特征的次数的和，比如特征A在第一次迭代中（即第一棵树）被选中了1次去分裂树节点，在第二次迭代被选中2次……那么最终特征A的feature score就是 1+2+….
在这里插入图片描述

2.4. XGBoost CV验证（交叉验证）及找出最优树

对于XGBoost模型评估的方法，一般采用交叉验证（cross-validation 简称cv）将数据集分为k等份，对于每一份数据集，其中k-1份用作训练集，单独的那一份用作验证集。

利用xXGBoost.cv可以找出最优的树，详见文中代码。

    res = xgb.cv(params,dtrain,num_boost_round=5000,metrics='rmse',early_stopping_rounds=25)
    #找到最佳迭代轮数
    best_nround = res.shape[0] - 1

其中，early_stop 是在多少轮 metrics 没有变好的情况下提前结束，等于是找到了最佳的迭代轮数。

3. XGBoost参数

XGBoost的参数可以分为三种类型：通用参数、booster参数以及学习目标参数

General parameters：参数控制在提升（boosting）过程中使用哪种booster，常用的booster有树模型（tree）和线性模型（linear model）。
Booster parameters：这取决于使用哪种booster。
Learning Task parameters：控制学习的场景，例如在回归问题中会使用不同的参数控制排序。
除了以上参数还可能有其它参数，在命令行中使用

3.1 General Parameters

booster [default=gbtree]
有两种模型可以选择gbtree和gblinear。gbtree使用基于树的模型进行提升计算，gblinear使用线性模型进行提升计算。缺省值为gbtree
silent [default=0]
取0时表示打印出运行时信息，取1时表示以缄默方式运行，不打印运行时的信息。缺省值为0
建议取0，过程中的输出数据有助于理解模型以及调参。另外实际上我设置其为1也通常无法缄默运行。。
nthread [default to maximum number of threads available if not set]
XGBoost运行时的线程数。缺省值是当前系统可以获得的最大线程数
如果你希望以最大速度运行，建议不设置这个参数，模型将自动获得最大线程
num_pbuffer [set automatically by xgboost, no need to be set by user]
size of prediction buffer, normally set to number of training instances. The buffers are used to save the prediction results of last boosting step.
num_feature [set automatically by xgboost, no need to be set by user]
boosting过程中用到的特征维数，设置为特征个数。XGBoost会自动设置，不需要手工设置

3.2. Booster Parameters

From xgboost-unity, the bst: prefix is no longer needed for booster parameters. Parameter with or without bst: prefix will be equivalent(i.e. both bst:eta and eta will be valid parameter setting) .

3.3. Parameter for Tree Booster

eta [default=0.3]
为了防止过拟合，更新过程中用到的收缩步长。在每次提升计算之后，算法会直接获得新特征的权重。 eta通过缩减特征的权重使提升计算过程更加保守。缺省值为0.3
取值范围为：[0,1]
通常最后设置eta为0.01~0.2
gamma [default=0]
minimum loss reduction required to make a further partition on a leaf node of the tree. the larger, the more conservative the algorithm will be.
range: [0,∞]
模型在默认情况下，对于一个节点的划分只有在其loss function 得到结果大于0的情况下才进行，而gamma 给定了所需的最低loss function的值
gamma值使得算法更conservation，且其值依赖于loss function ，在模型中应该进行调参。
max_depth [default=6]
树的最大深度。缺省值为6
取值范围为：[1,∞]
指树的最大深度
树的深度越大，则对数据的拟合程度越高（过拟合程度也越高）。即该参数也是控制过拟合
建议通过交叉验证（xgb.cv ) 进行调参
通常取值：3-10
min_child_weight [default=1]
孩子节点中最小的样本权重和。如果一个叶子节点的样本权重和小于min_child_weight则拆分过程结束。在现行回归模型中，这个参数是指建立每个模型所需要的最小样本数。该成熟越大算法越conservative。即调大这个参数能够控制过拟合。
取值范围为: [0,∞]
max_delta_step [default=0]
Maximum delta step we allow each tree’s weight estimation to be. If the value is set to 0, it means there is no constraint. If it is set to a positive value, it can help making the update step more conservative. Usually this parameter is not needed, but it might help in logistic regression when class is extremely imbalanced. Set it to value of 1-10 might help control the update
取值范围为：[0,∞]
如果取值为0，那么意味着无限制。如果取为正数，则其使得xgboost更新过程更加保守。
通常不需要设置这个值，但在使用logistics 回归时，若类别极度不平衡，则调整该参数可能有效果
subsample [default=1]
用于训练模型的子样本占整个样本集合的比例。如果设置为0.5则意味着XGBoost将随机的从整个样本集合中抽取出50%的子样本建立树模型，这能够防止过拟合。
取值范围为：(0,1]
colsample_bytree [default=1]
在建立树时对特征随机采样的比例。缺省值为1
取值范围：(0,1]
colsample_bylevel[default=1]
决定每次节点划分时子样例的比例
通常不使用，因为subsample和colsample_bytree已经可以起到相同的作用了
scale_pos_weight[default=0]
A value greater than 0 can be used in case of high class imbalance as it helps in faster convergence.
大于0的取值可以处理类别不平衡的情况。帮助模型更快收敛

3.4. Parameter for Linear Booster

lambda [default=0]
L2 正则的惩罚系数
用于处理XGBoost的正则化部分。通常不使用，但可以用来降低过拟合
alpha [default=0]
L1 正则的惩罚系数
当数据维度极高时可以使用，使得算法运行更快。
lambda_bias
在偏置上的L2正则。缺省值为0（在L1上没有偏置项的正则，因为L1时偏置不重要）

3.5. Task Parameters

objective [ default=reg:linear ]
定义学习任务及相应的学习目标，可选的目标函数如下：
- “reg:squarederror”，线性回归，在新版本（1.1.0）中替换了“reg:linear”。
- “reg:logistic” ，逻辑回归。
- “binary:logistic” ，二分类的逻辑回归问题，输出为概率。
- “binary:logitraw” ，二分类的逻辑回归问题，输出的结果为wTx。
- “count:poisson” ，计数问题的poisson回归，输出结果为poisson分布。
  在poisson回归中，max_delta_step的缺省值为0.7。(used to safeguard optimization)
- “multi:softmax” ，让XGBoost采用softmax目标函数处理多分类问题，同时需要设置参数num_class（类别个数）
- “multi:softprob” ，和softmax一样，但是输出的是ndata * nclass的向量，可以将该向量reshape成ndata行nclass列的矩阵。每行数据表示样本所属于每个类别的概率。
- “rank:pairwise” ，set XGBoost to do ranking task by minimizing the pairwise loss
base_score [ default=0.5 ]
the initial prediction score of all instances, global bias
eval_metric [ default according to objective ]
校验数据所需要的评价指标，不同的目标函数将会有缺省的评价指标（rmse for regression, and error for classification, mean average precision for ranking）
用户可以添加多种评价指标，对于Python用户要以list传递参数对给程序，而不是map参数list参数不会覆盖’eval_metric’
The choices are listed below:
- “rmse”: root mean square error
- “logloss”: negative log-likelihood
- “error”: Binary classification error rate. It is calculated as #(wrong cases)/#(all cases). For the predictions, the evaluation will regard the instances with prediction value larger than 0.5 as positive instances, and the others as negative instances.
- “merror”: Multiclass classification error rate. It is calculated as #(wrong cases)/#(all cases).
- “mlogloss”: Multiclass logloss
- “auc”: Area under the curve for ranking evaluation.
- “ndcg”:Normalized Discounted Cumulative Gain
- “map”:Mean average precision
- “ndcg@n”,”map@n”: n can be assigned as an integer to cut off the top positions in the lists for evaluation.
- “ndcg-“,”map-“,”ndcg@n-“,”map@n-“: In XGBoost, NDCG and MAP will evaluate the score of a list without any positive samples as 1. By adding “-” in the evaluation metric XGBoost will evaluate these score as 0 to be consistent under some conditions.
  training repeatively
seed [ default=0 ]
随机数的种子。缺省值为0
可以用于产生可重复的结果（每次取一样的seed即可得到相同的随机划分）
dtrain：训练的数据
num_boost_round：这是指提升迭代的次数，也就是生成多少基模型
evals：这是一个列表，用于对训练过程中进行评估列表中的元素。
形式是evals = [(dtrain,‘train’),(dval,‘val’)]或者是evals = [(dtrain,‘train’)]，对于第一种情况，它使得我们可以在训练过程中观察验证集的效果
obj：自定义目的函数
feval：自定义评估函数
maximize：是否对评估函数进行最大化
early_stopping_rounds：早期停止次数
假设为100，验证集的误差迭代到一定程度在100次内不能再继续降低，就停止迭代。这要求evals 里至少有一个元素，如果有多个，按最后一个去执行。返回的是最后的迭代次数（不是最好的）。如果early_stopping_rounds存在，则模型会生成三个属性，bst.best_score，bst.best_iteration和bst.best_ntree_limit
evals_result：字典，存储在watchlist中的元素的评估结果。
verbose_eval ：(可以输入布尔型或数值型)，也要求evals里至少有一个元素。如果为True,则对evals中元素的评估结果会输出在结果中；如果输入数字，假设为5，则每隔5个迭代输出一次。
learning_rates：每一次提升的学习率的列表，
xgb_model：在训练之前用于加载的xgb model。