LGBM为什么可以直接输入类别特征,而不需要one-hot
LGBM官方文档对如何处理类别特征的解释
Optimal Split for Categorical Features
It is common to represent categorical features with one-hot encoding, but this approach is suboptimal for tree learners. Particularly for high-cardinality categorical features, a tree built on one-hot features tends to be unbalanced and needs to grow very deep to achieve good accuracy.
Instead of one-hot encoding, the optimal solution is to split on a categorical feature by partitioning its categories into 2 subsets. If the feature has k categories, there are 2^(k-1) - 1 possible partitions. But there is an efficient solution for regression trees[8]. It needs about O(k * log(k)) to find the optimal partition.
The basic idea is to sort the categories according to the training objective at each split. More specifically, LightGBM sorts the histogram (for a categorical feature) according to its accumulated values (sum_gradient / sum_hessian) and then finds the best split on the sorted histogram.
综合外网各路的解释
打个比方,我现在有一个特征是颜色,每个样本的颜色特征是{红、黄、蓝、绿}四种类别中的一种,那么我们来对比一下one-hot和LGBM的处理方式,到底有什么不同:
One-hot encoding
这种编码方式很常用,直接将颜色这一维特征变成四维特征,分别表示红、黄、蓝、绿四维特征(每个样本只有其中一维取值为一,其他为0)。
那么我们决策树分裂的时候,只会选择其中一维进行节点分裂,比如选黄色进行分裂,那么意思就是所有样本是否是黄色作为节点分裂条件。
那么我们反观原始的颜色特征,会发现,其实就是一个1对其他颜色的分裂策略,对于颜色这维原始特征来说其实我们只有四种分裂策略可选(