xgboost等Tree-Model 对于特征是否需要进行one-hot编码的必要性分析

Michael_Shentu

已于 2023-12-19 17:19:51 修改

阅读量1w

点赞数 4

分类专栏：特征工程 xgboost 文章标签： python 机器学习 numpy

于 2018-09-30 23:29:59 首次发布

本文链接：https://blog.csdn.net/shenxiaoming77/article/details/82914293

版权

特征工程同时被 2 个专栏收录

37 篇文章 5 订阅

订阅专栏

xgboost

9 篇文章 2 订阅

订阅专栏

附：2023年数据资源，推荐算法代码白皮书下载：

关注WX公众号： commindtech77，获得数据资产相关白皮书下载地址

回复关键字：推荐系统
下载《新闻资讯个性化推荐系统源码及白皮书》

1. 回复关键字：数据资源入表白皮书
下载《2023数据资源入表白皮书》
2. 回复关键字：光大银行
下载光大银行-《商业银行数据资产会计核算研究报告》
3. 回复关键字：数据资产估值
下载《商业银行数据资产估值白皮书》
4. 回复关键字：上海银行
下载上海银行《商业银行数据资产体系白皮书》
5. 回复关键字：商业银行数据资产管理
下载《商业银行数据资产管理体系建设实践报告》

知乎主页申小明77 - 知乎

参考链接：

数据预处理：独热编码（One-Hot Encoding）_onehot编码怎么训练-CSDN博客

xgboost 对所有的输入特征都是当做数值型对待，所以你给定的数据也要是指定的数据类型

对于数据缺失或者稀疏，xgboost 都可以自己处理

纠结于 one-hot 编码问题主要是将分类信息转化为一定长度索引的二进制数据

假设当前的数据类型是 annimal={‘panda’,’cat’,’dog’}

经过 one-hot 编码可能就变成

[[1,0,0],

[0,1,0],

[0,0,1]]

上述是一个 3*3 矩阵向量

对于 xgboost 而言，将其解释为 3 个特征变量，animal0,animal1,animal2，这三个共同表征 animal

最终在 get_fscore 函数中计算特征的重要性也会将其分开来看，可能 animal0 占据着更重要的地位

xgboost 树模型其实是不建议使用 one-hot 编码，在 xgboost 上面的 issue 也提到过，相关的说明如下

I do not know what you mean by vector. xgboost treat every input feature as numerical, with support for missing values and sparsity. The decision is at the user

So if you want ordered variables, you can transform the variables into numerical levels(say age). Or if you prefer treat it as categorical variable, do one hot encoding.

在另一个issues上也提到过（tqchen commented on 8 May 2015）：

One-hot encoding could be helpful when the number of categories are small( in level of 10 to 100). In such case one-hot encoding can discover interesting interactions like (gender=male) AND (job = teacher).

While ordering them makes it harder to be discovered(need two split on job). However, indeed there is not a unified way handling categorical features in trees, and usually what tree was really good at was ordered continuous features anyway..

总结起来的结论，大至两条：