这本书以CART树为例
一、CART树分裂节点的过程
1、对于回归问题,最小化y的方差来决定分裂点。The variance tells us how much the y values in a node are spread around their mean value
2、对于分类问题,最小化y的GINI系数,The Gini index tells us how “impure” a node is, e.g. if all classes have the same frequency, the node is impure, if only one class is present, it is maximally pure。
对于连续数值型变量是分裂点,对类别变量是尝试单个特征下类别的组合
二、Interpretation
2.1 Feature importance
在决策树中,遍历一个特征所有分裂点,计算它比父结点降低了多少方差或者GINI系数,所有特征的importance之和为100。这意味着每个特征的importance可以表示为整个模型importance的百分比。
2.2 Tree decomposition树分解
2.3 优缺点
优点:可以处理捕捉特征的交叉关系、便于可视化、解释性好等。
There is no need to transform features. In linear models, it is sometimes necessary to take the logarithm of a feature. A decision tree works equally well with any monotonic transformation of a feature.
缺点:
不支持线性、缺少平滑性(Slight changes in the input feature can have a big impact on the predicted outcome, which is usually not desirable.因为对分裂点敏感)、不稳定