错误的常识？XGBoost缺失值处理（纠错篇）

最新推荐文章于 2024-09-15 10:45:29 发布

Alpha_GoGo

最新推荐文章于 2024-09-15 10:45:29 发布

阅读量1.3k

点赞数 7

分类专栏：机器学习数据预处理文章标签：机器学习人工智能 python 数据挖掘

本文链接：https://blog.csdn.net/Alpha_GoGo/article/details/118101429

版权

机器学习同时被 2 个专栏收录

3 篇文章

订阅专栏

数据预处理

2 篇文章

订阅专栏

故事是这样的。。。

最近小组在招聘实习生，下面是一次面试对话

我去网上一查，XGBoost的解析文章的确不少。。。

为什么缺失值默认向右子树分裂呢？依据是什么呢？

找遍所有的解析文章，但均未给出具体的依据或者证明，话术也竟然如此相像。。。

知乎上：

CSDN上：

为寻找答案，仔细问了楼主，得到以下的答复：

让我来慢慢破解。。。

破解一：看论文

仔细看了XGBoost的创始人陈天齐的论文，

未说明缺失值默认向右子树分裂呀。。。

论文链接：http://www.kdd.org/kdd2016/subtopic/view/xgboost-a-scalable-tree-boosting-system

破解二：看源码

论文上找不到答案，源码总该可以吧！

看了c++源码，发现默认向左子树分裂！

//来源：https://github.com/dmlc/xgboost 
//include/xgboost/tree_model.h 

  /*! \brief index of default child when feature is missing */ 
    inline int DefaultChild() const { 
        return this->DefaultLeft() ? this->LeftChild() : this->RightChild(); 
}

有了源码，必须实战来验证！

测试一个案例，先看结果，如图（missing即为缺失值）

缺失值的默认分裂方向，是划分到左子树，而非右子树！

以下是代码

import xgboost as xgb
from xgboost import plot_tree
import matplotlib.pyplot as plt
import numpy as np

train_x = np.random.rand(100,5)
train_y = np.random.randint(0,2,100)
test_x = np.random.rand(20,5)
test_y = np.random.randint(0,2,20)

# print(train_x,train_y)
# print(test_x,test_y)
dtrain = xgb.DMatrix(train_x, label = train_y)
dtest = xgb.DMatrix(test_x, label = test_y)

print('*' * 25, '开始训练', '*' * 25)
model = xgb.train(params={
    'booster': 'gbtree',
    'objective': 'binary:logistic',
    'eval_metric': [ 'logloss','auc'],
    'max_depth': 4},
          dtrain=dtrain,
          verbose_eval=True,
          evals=[(dtrain, "train"), (dtest, "valid")],
          early_stopping_rounds=10,
          num_boost_round = 1000
                  )
print('*' * 25, '模型训练结束', '*' * 25)
print('*' * 25, '模型结构绘制', '*' * 25)
for i in range(model.best_ntree_limit):
    plot_tree(model,num_trees=i)
plt.show()
print('*' * 25, '模型结构绘制结束', '*' * 25)

XGBoost在学术界和工业界都得到了广泛的应用，也被誉为竞赛神器

这只是XGBoost的一个小细节罢了

但我希望

对任何知识，都要怀有一颗敬畏之心