机器学习技法之决策树（Decision Tree）

最新推荐文章于 2024-01-29 00:15:00 发布

FlameAlpha

最新推荐文章于 2024-01-29 00:15:00 发布

阅读量651

点赞数 2

分类专栏：机器学习 # 机器学习技法文章标签：决策树机器学习

本文链接：https://blog.csdn.net/Flame_alone/article/details/105734298

版权

机器学习同时被 2 个专栏收录

32 篇文章 15 订阅

订阅专栏

机器学习技法

13 篇文章 34 订阅

订阅专栏

$\begin{array} { c | | c | c } \text { aggregation type } & \text { blending } & \text { learning } \\ \hline \hline \text { uniform } & \text { voting/averaging } & \text { Bagging } \\ \hline \text { non-uniform } & \text { linear } & \text { AdaBoost } \\ \hline \text { conditional } & \text { stacking } & \text { Decision Tree } \end{array}$

在前文介绍过 blending ，这实际上就是 meta algorithm，只是将已获得的假设函数做一个融合。
而 bagging 属于 learing 的一种，也就是说在融合过程中获取假设函数。bagging 与 AdaBoost 区别于一个是 uniform 而另一个不是。而 stacking 实际上就是将已有的假设函数集的输出作为特征输入到一个机器学习模型中，在训练出一个融合模型，当然可以是 PLA 等算法。决策树与stacking不同的地方在于，不需要预先训练处 $g_t$ ，而是在学习过程中获得。

下面以为是否观看视频做决策的决策树示意图如下：

在这里插入图片描述
那么该决策树的数学表达如下：

$\mathbf { x } ) = \sum _ { t = 1 } ^ { T } q _ { t } ( \mathbf { x } ) \cdot g _ { t } ( \mathbf { x } )$

其中 $g_t(\mathbf{x})$ 也是一个决策树， $q_t(\mathbf{x})$ 表示的是 $\mathbf{x}$ 是否在 $G$ 的路径 $t$ 中。由此看来决策树是一种仿人模型（human-mimicking models）。

从递推（树）的角度看：

$\mathbf { x } ) = \sum _ { c = 1 } ^ { C } \left[ \kern-0.15em \left[ b ( \mathbf { x } ) = c \right]\kern-0.15em \right]\cdot G _ { c } ( \mathbf { x } )$

其中

$G (x)$ : full-tree hypothesis（当前根节点的全树模型）
$b (x)$ : branching criteria（判断是哪个分支）
$G_c(x)$ : sub-tree hypothesis at the c-th branch（第c个分支的子树）

那么决策树的训练过程为：

$\begin{array} { l } \text { function Decision Tree } \left( \text { data } \mathcal { D } = \left\{ \left( \mathbf { x } _ { n } , y _ { n } \right) \right\} _ { n = 1 } ^ { N } \right) \\ \text { if termination criteria met } \\ \text { return base hypothesis } g _ { t } ( \mathbf { x } ) \\ \text { else } \\ \qquad \begin{array} { l } \text { learn branching criteria } b ( \mathbf { x } ) \\ \text { split } \mathcal { D } \text { to } \text {C parts } \mathcal { D } _ { c } = \left\{ \left( \mathbf { x } _ { n } , y _ { n } \right) : b \left( \mathbf { x } _ { n } \right) = c \right\} \\ \text { build sub-tree } G _ { c } \leftarrow \text { Decision Tree } \left( \mathcal { D } _ { c } \right) \\ \text { return } G ( \mathbf { x } ) = \sum _ { c = 1 } ^ { C } \left[ \kern-0.15em \left[ b ( \mathbf { x } ) = c \right]\kern-0.15em \right]\cdot G _ { c } ( \mathbf { x } ) \end{array} \end{array}$

从直观训练过程中，可以得知现在需要确认四个问题：

number of branches（分支个数） $C$
branching criteria（分支条件） $\mathbf { x } )$
termination criteria（终止条件）
base hypothesis（基假设函数） $g_t( \mathbf { x } )$

由于有这么多可选条件，那么决策树模型有很多种实现方法，所以决策树模型有很多前人的巧思但是很有用（decision tree: mostly heuristic but useful on its own）。

下面介绍一个常用的决策树模型 —— Classification and Regression Tree（C&RT）

C&RT 模型

四个选择

其在前文所提到的四个问题的是如何解决的呢：

分支个数为 2 （二叉树），使用 decision stump （即 $\mathbf { x })$ 的实现方法）进行分段。
分支条件 $\mathbf { x })$ 也就是如何分支，最佳分支函数（模型）的选取，使用的是两部分数据是否 “纯” ，首先判断每段数据的纯度然后求平均值，作为本 decision stump 是否被选取的评价标准。
$\mathbf { x } ) = \underset { \text { decision stumps } h ( \mathbf { x } ) } { \operatorname { argmin } } \sum _ { c = 1 } ^ { 2 } | \mathcal { D } _ { c } \text { with } h | \cdot \text { impurity } \left( \mathcal { D } _ { c } \text { with } h \right)$
基假设函数 $g_t$ 则是常值。
终止条件则是不能在分支（全部的 $y_n$ 或 $\mathbf{x}_n$ 都一样，也就是说不纯度为零或者无法再进行决策时停止）。

详细实现

C&RT 模型的具体实现：

$\begin{array} { l } \text { function Decision Tree } \left( \text { data } \mathcal { D } = \left\{ \left( \mathbf { x } _ { n } , y _ { n } \right) \right\} _ { n = 1 } ^ { N } \right) \\ \text { if cannot branch anymore } \\ \qquad \text { return } g _ { t } ( \mathbf { x } ) = E _ { \text {in } } \text { -optimal constant } \\ \text { else learn branching criteria } \\ \qquad \begin{aligned} & b ( \mathbf { x } ) = \operatorname { argmin } \sum _ { \text {decision stumps } h ( \mathbf { x } ) } \sum _ { c = 1 } ^ { 2 } | \mathcal { D } _ { c } \text { with } h | \cdot \text { impurity } \left( \mathcal { D } _ { c } \text { with } h \right) \\ & \text { split } \mathcal { D } \text { to } 2 \text { parts } \mathcal { D } _ { c } = \left\{ \left( \mathbf { x } _ { n } , y _ { n } \right) : b \left( \mathbf { x } _ { n } \right) = c \right\} \\ & \text { build sub-tree } G _ { c } \leftarrow \text { Decision Tree } \left( \mathcal { D } _ { c } \right) \\ & \text { return } G ( \mathbf { x } ) = \sum _ { c = 1 } ^ { C } \left[ \kern-0.15em \left[ b ( \mathbf { x } ) = c \right]\kern-0.15em \right]\cdot G _ { c } ( \mathbf { x } ) \end{aligned} \end{array}$

该决策树模型可以轻松驾驭回归，二分类和多分类。

不纯度函数（Impurity Functions）

回归错误（regression error）

用于回归，且是比较常用的方法

$\begin{array} { l } \text { impurity } ( \mathcal { D } ) = \frac { 1 } { N } \sum _ { n = 1 } ^ { N } \left( y _ { n } - \bar { y } \right) ^ { 2 } \\ \text { with } \bar { y } = \text { average of } \left\{ y _ { n } \right\} \end{array}$

将平均值作为评判标准。

分类错误（classification error）

用于分类

$\begin{array} { l } \text { impurity } ( \mathcal { D } ) = \frac { 1 } { N } \sum _ { n = 1 } ^ { N } \left[ y _ { n } \neq y ^ { * } \right] \\ \text { with } y ^ { * } = \text { majority of } \left\{ y _ { n } \right\} \end{array}$

将占比最多的类作为评判标准。

基尼系数（Gini index）

用于分类，且是比较常用的方法

$\sum _ { k = 1 } ^ { K } \left( \frac { \sum _ { n = 1 } ^ { N } \left[ \kern-0.15em \left[ { y } _ { n } = k \right] \kern-0.15em \right] } { N } \right) ^ { 2 }$

考虑全部的类，计算不纯度。

剪枝正则化（Regularization by Pruning）

当全部的 $\mathbf{x}_n$ 均不相同，那么 $E_{in}(G) = 0$ ，也就是说这是一个完全长成的树，这样会导致过拟合，因为在低位的子树构建数据很少，所以很容导致过拟合。所以这里想到使用剪枝实现正则化，简单来说就是控制数的叶子数量。

将叶子数量用 $\Omega ( G )$ 表示：

$\Omega ( G ) = \text { NumberOfLeaves } ( G )$

那么正则化后的优化目标则变为：

$\underset { \text { all possible } G } { \operatorname { argmin } } E _ { \text {in } } ( G ) + \lambda \Omega ( G )$

做此操作后被叫做被砍过的决策树（pruned decision tree）。

这里有一个比较困难的问题，那就是 $\text { all possible } G$ ，全部穷举是不可能的，在 C&RT 中，采用了以下策略：

$\begin{array} { l } G ^ { ( 0 ) } = \text { fully-grown tree } \\ G ^ { ( i ) } = \operatorname { argmin } _ { G } E _ { \text {in } } ( G ) \text { such that } G \text { is one-leaf removed from } G ^ { ( i - 1 ) } \end{array}$

也就是说在摘掉 $i$ 片树叶的决策树中找出性能最优的一颗，该决策树由摘掉 $i - 1$ 片树叶的决策树中最优的一颗再摘掉一片获得。

那么假设完全长成的决策树的叶子数量为 $I$ ，那么现在便可以获得：
$},\cdots,G ^ { ( I^- ) } \quad \text{where } I^{-} \leq I$

那么在从这一堆 $G^{(i)}$ 中使用正则化的优化目标找出最优的那颗决策树。当然这里还有一个参数 $\lambda$ ，其可以用 validation 获得。

分类特征（Categorical Features）

在连续特征中，分支条件实现如下：

$\begin{aligned} & b ( \mathbf { x } ) = \left[ \kern-0.15em \left[ x _ { i } \leq \theta \right] \kern-0.15em \right] + 1 \\ \text{with } & \theta \in R \end{aligned}$

而在离散特征中，分支条件类似
$\begin{aligned} & b ( \mathbf { x } ) = \left[ \kern-0.15em \left[ x _ { i } \in S \right] \kern-0.15em \right] + 1 \\ \text{with }& S \subset \{ 1,2 , \ldots , K \} \end{aligned}$