LGBM是如何处理类别特征，相比onehot编码的优势在哪

最新推荐文章于 2024-03-25 23:26:01 发布

AndrewHR

最新推荐文章于 2024-03-25 23:26:01 发布

阅读量8.4k

点赞数 1

分类专栏：数据挖掘、推荐、广告文章标签： LGBM 类别特征

本文链接：https://blog.csdn.net/gangyin5071/article/details/82591553

版权

LGBM模型可以直接处理类别特征，避免了one-hot编码导致的树结构不平衡和深度需求增加。LGBM通过排序类别并依据训练目标找到最佳分割，实现更有效的特征利用。相比于one-hot，LGBM的节点分裂策略能以较低的时间复杂度找到最优分割策略，提高效率和准确性。

摘要由CSDN通过智能技术生成

LGBM为什么可以直接输入类别特征，而不需要one-hot

LGBM官方文档对如何处理类别特征的解释

Optimal Split for Categorical Features

It is common to represent categorical features with one-hot encoding, but this approach is suboptimal for tree learners. Particularly for high-cardinality categorical features, a tree built on one-hot features tends to be unbalanced and needs to grow very deep to achieve good accuracy.

Instead of one-hot encoding, the optimal solution is to split on a categorical feature by partitioning its categories into 2 subsets. If the feature has k categories, there are 2^(k-1) - 1 possible partitions. But there is an efficient solution for regression trees[8]. It needs about O(k * log(k)) to find the optimal partition.

The basic idea is to sort the categories according to the training objective at each split. More specifically, LightGBM sorts the histogram (for a categorical feature) according to its accumulated values (sum_gradient / sum_hessian) and then finds the best split on the sorted histogram.

综合外网各路的解释

打个比方，我现在有一个特征是颜色，每个样本的颜色特征是{红、黄、蓝、绿}四种类别中的一种，那么我们来对比一下one-hot和LGBM的处理方式，到底有什么不同：

One-hot encoding

这种编码方式很常用，直接将颜色这一维特征变成四维特征，分别表示红、黄、蓝、绿四维特征（每个样本只有其中一维取值为一，其他为0）。
那么我们决策树分裂的时候，只会选择其中一维进行节点分裂，比如选黄色进行分裂，那么意思就是所有样本是否是黄色作为节点分裂条件。
那么我们反观原始的颜色特征，会发现，其实就是一个1对其他颜色的分裂策略，对于颜色这维原始特征来说其实我们只有四种分裂策略可选(