决策树综合

最新推荐文章于 2024-07-13 17:34:46 发布

luv_dusk

最新推荐文章于 2024-07-13 17:34:46 发布

阅读量316

点赞数

文章标签：决策树

本文链接：https://blog.csdn.net/weixin_43269174/article/details/92065764

版权

算法同时被 2 个专栏收录

25 篇文章 1 订阅

订阅专栏

机器学习

7 篇文章 0 订阅

订阅专栏

一、概念

算法	特征
ID3	使用信息增益度量不纯度；可处理离散型数据；可用于分类；每个节点衍生出多个分支
C4.5	使用信息增益率度量不纯度；可处理离散型/连续型数据；可用于分类；每个节点衍生出多个分支
CART	使用基尼系数度量不纯度；可处理离散型/连续型数据；可用于分类/回归；每个节点衍生出两个分支

依据信息论的定义，信息的混乱程度由熵 (Entropy) 给出。假定样本数据 $X$ 中有 $N$ 种类别，则 $H(X)=-\sum_{j=1}^N p_j \log p_j$ 信息增益 (Information Gain) 计算一个节点中的数据划分前后的熵差值，衡量不纯度减小的程度： $info\_gain=H(X)-\sum_i \frac{|X_i|}{|X|}H(X_i)$ 信息增益的缺点是显而易见的，当某一特征（例如姓名）取值较多时，每一种取值下对应一条记录，使用该特征划分能获取极大的信息增益，但实际上训练完成的算法泛化能力极差。因此基于信息增益的 ID3 算法仅适用于处理取值较少的离散型数据。为应对此类情况，信息增益率 (Information Gain Ratio) 在信息增益的基础上做调整： $info\_gain\_ratio=\frac{info\_gain}{H(A)}$ $H (A)$ 代表属性 $A$ 取值的信息熵。基尼系数 (Gini Index) 则是与熵相对的另一种不纯度度量方式，公式如下： $Gini(X)=1-\sum_{j=1}^Np_j^2$ 在 CART 算法中，我们希望最大化划分前后的基尼增益： $Gini\_gain=Gini(X)-\sum_i\frac{|X_i|}{|X|}Gini(X_i)$ 学术界还有诸多其他类型的信息不纯度度量方式，在此不多赘述。

二、算法

ID3

该算法在几种决策树算法中最为简单，以下伪代码中包含了预剪枝过程（信息增益太小，或验证集表现无法继续提升），这一过程在 C4.5 和 CART 算法中同样适用。关于 ID3 算法使用信息增益作为不纯度度量标准的缺陷上文中已说明。

Algorithm ID3(Node):
Input: Object Node containing sample data.
Output: N/A.
if the depth exceeds the claimed maximum depth then label Node with y and terminate the branch
if the samples in Node is of the same class y then label Node with y and terminate the branch
if there is no remaining attribute unused then label Node with the class y with the most samples and terminate the branch
for each unused attribute A do
calculate information gain
select the attribute A* that maximizes information gain as the branching attribute at Node
call prunning() # code block to terminate the branch in advance, i.e. when the information gain is too small.
segment the samples of Node into M fractions based on their values of A*
for each segmentation Di do
initialize child node Node_i and feed Di to the node
recursively call ID3(Node_i)

C4.5

为避免 ID3 中特征选取偏向于取值较多的特征，C4.5 使用信息增益率作为不纯度的度量方式。同时，C4.5 增加了对连续型变量的二分法处理过程。二分法在于首先对特征取值进行排序，而后依据相邻数值的平均数生成一列二分阈值，从中挑出最佳划分点。

Algorithm C4.5(Node):
Input: Object Node containing sample data.
Output: N/A.
if the depth exceeds the claimed maximum depth then label Node with y and terminate the branch
if the samples in Node is of the same class y then label Node with y and terminate the branch
if there is no remaining attribute unused then label Node with the class y with the most samples and terminate the branch
for each unused attribute A do
if A is discrete then
calculate information gain ratio
else
select the optimal threshold value that maximizes infomation gain ratio
select the attribute A* that maximizes information gain ratio as the branching attribute at Node
call prunning() # code block to terminate the branch in advance, i.e. when the information gain ratio is too small.
segment the samples of Node into M fractions based on the branching principle
for each segmentation Di do
initialize child node Node_i and feed Di to the node
recursively call C4.5(Node_i)

CART

CART 算法与 C4.5 相比，对离散型变量同样采用二分法处理，将树的结构约束为二叉树，同时增加了对回归任务的处理步骤。
分类问题上，CART 使用基尼增益挑选最佳划分点 (具体方法与 C4.5 类似)：
$\rho^*=\arg\max_\rho \big[Gini(X)-\sum_iGini(X_i)\big]$ 回归问题上，CART 则使用最小二乘法： $\rho^*=\arg\min_\rho \big[\sum_{x_i<\rho}(y_i-\bar{y}_{x_i<\rho})^2+\sum_{x_i\ge\rho}(y_i-\bar{y}_{x_i \ge\rho})^2\big]$

Algorithm CART(Node):
Input: Object Node containing sample data.
Output: N/A.
if the depth exceeds the claimed maximum depth then terminate the branch
if the samples in Node is of the same class or covers a range smaller than requirement then terminate the branch
if there is no remaining attribute unused then terminate the branch
for each unused attribute A do
if A is discrete then
select the optimal value that maximizes Gini gain or minimizes square values
else
select the optimal threshold value that maximizes Gini gain or minimizes square values
select the attribute A* that maximizes Gini gain or minimizes square values as the branching attribute at Node
call prunning() # code block to terminate the branch in advance, i.e. when the Gini gain is too small.
segment the samples of Node into two fractions based on the branching principle
for each segmentation Di (i=1,2) do
initialize child node Node_i and feed Di to the node
recursively call CART(Node_i)

三、剪枝

为防止决策树算法过拟合，通常有预剪枝和后剪枝两种处理方式。预剪枝通过设立提前停止条件，在生成枝叶时立即执行，也即上述伪代码中的 prunning()，常见的条件有 “不纯度降低少于阈值” 和 “无法继续优化验证集表现” 等；后剪枝则在决策树生成完毕后进行修剪，通常而言也有两种做法：“使用验证集检测无法提升准确度的节点”、“应用正则化思想结合样本不纯度和模型复杂度定义新的损失函数”。以 CART 分类为例，第二种方法中的损失函数采取以下形式： $L=\sum_i\frac{|X_i|}{|X|}Gini(X_i)+\alpha|N|$ $\alpha$ 是惩罚因子，该值越大则模型复杂度的惩罚越大； $∣ N ∣$ 代表该节点下游子节点的数目。如果剪枝前的损失函数值大于剪枝后的值，则对该节点进行剪枝。

luv_dusk

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
决策树综合

目录一、概念二、算法ID3C4.5CART三、剪枝一、概念算法特征ID3使用信息增益度量不纯度；可处理离散型数据；可用于分类；每个节点衍生出多个分支C4.5使用信息增益率度量不纯度；可处理离散型/连续型数据；可用于分类；每个节点衍生出多个分支CART使用基尼系数度量不纯度；可处理离散型/连续型数据；可用于分类/回归；每个节...
复制链接

扫一扫