# 【西瓜书笔记】3. 决策树

## 3.1 决策树基本流程

1. 根节点：包含样本全集
2. 若干内部节点：对应属性测试
3. 若干叶子结点：对应决策结果
4. 结点包含的样本集合根据属性测试划分到子节点中国

【来源：西瓜书page74】

1. 当前节点包含的样本权属同一个类别，无需划分
2. 当前属性集为空，或者所有样本属性相同，无法划分。标记当前结点为叶子结点，类别设定为该节点所含样本最多的类别。实质上利用当前结点的后验分布
3. 当前结点包含的样本集合为空，不能划分。标记当前结点为叶子结点，将其类别设定为父节点所含样本最多的类别。实质上利用父节点的样本分布作为当前结点的先验分布。

## 3.2 划分选择

### 3.2.1 ID3决策树

#### 3.2.1.1 信息熵

Ent ⁡ ( D ) = − ∑ k = 1 ∣ Y ∣ p k log ⁡ 2 p k \operatorname{Ent}(D)=-\sum_{k=1}^{|\mathcal{Y}|} p_{k} \log _{2} p_{k}

∣ Y ∣ = n , p k = x k |\mathcal{Y}|=n, p_{k}=x_{k} ,那么信息熵 Ent ⁡ ( D ) \operatorname{Ent}(D) 就可以看成是一个 n n 元实值函数：
Ent ⁡ ( D ) = f ( x 1 , … , x n ) = − ∑ k = 1 n x k log ⁡ 2 x k \operatorname{Ent}(D)=f\left(x_{1}, \ldots, x_{n}\right)=-\sum_{k=1}^{n} x_{k} \log _{2} x_{k}

min ⁡ ∑ k = 1 n x k log ⁡ 2 x k  s.t.  ∑ k = 1 n x k = 1 \begin{aligned} \min \sum_{k=1}^{n} x_{k} \log _{2} x_{k} \\ \text { s.t. } \sum_{k=1}^{n} x_{k}=1 \end{aligned}

L ( x 1 , … , x n , λ ) = ∑ k = 1 n x k log ⁡ 2 x k + λ ( ∑ k = 1 n x k − 1 ) L\left(x_{1}, \ldots, x_{n}, \lambda\right)=\sum_{k=1}^{n} x_{k} \log _{2} x_{k}+\lambda\left(\sum_{k=1}^{n} x_{k}-1\right)

∂ L ( x 1 , … , x n , λ ) ∂ x 1 = ∂ ∂ x 1 [ ∑ k = 1 n x k log ⁡ 2 x k + λ ( ∑ k = 1 n x k − 1 ) ] = 0 = log ⁡ 2 x 1 + x 1 ⋅ 1 x 1 ln ⁡ 2 + λ = 0 = log ⁡ 2 x 1 + 1 ln ⁡ 2 + λ = 0 ⇒ λ = − log ⁡ 2 x 1 − 1 ln ⁡ 2 \begin{aligned} \frac{\partial L\left(x_{1}, \ldots, x_{n}, \lambda\right)}{\partial x_{1}} &=\frac{\partial}{\partial x_{1}}\left[\sum_{k=1}^{n} x_{k} \log _{2} x_{k}+\lambda\left(\sum_{k=1}^{n} x_{k}-1\right)\right]=0 \\ &=\log _{2} x_{1}+x_{1} \cdot \frac{1}{x_{1} \ln 2}+\lambda=0 \\ &=\log _{2} x_{1}+\frac{1}{\ln 2}+\lambda=0 \\ & \Rightarrow \lambda=-\log _{2} x_{1}-\frac{1}{\ln 2} \end{aligned}

λ = − log ⁡ 2 x 1 − 1 ln ⁡ 2 = − log ⁡ 2 x 2 − 1 ln ⁡ 2 = … = − log ⁡ 2 x n − 1 ln ⁡ 2 \lambda=-\log _{2} x_{1}-\frac{1}{\ln 2}=-\log _{2} x_{2}-\frac{1}{\ln 2}=\ldots=-\log _{2} x_{n}-\frac{1}{\ln 2}

∂ L ( x 1 … . x n ⋅ λ ) ∂ λ = ∂ ∂ λ [ ∑ k = 1 n x k log ⁡ 2 x k + λ ( ∑ k = 1 n x k − 1 ) ] = 0 ⇒ ∑ k = 1 n x k = 1 \frac{\partial L\left(x_{1} \ldots . x_{n} \cdot \lambda\right)}{\partial \lambda}=\frac{\partial}{\partial \lambda}\left[\sum_{k=1}^{n} x_{k} \log _{2} x_{k}+\lambda\left(\sum_{k=1}^{n} x_{k}-1\right)\right]=0\\ \Rightarrow \sum_{k=1}^{n} x_{k}=1

x 1 = x 2 = … = x n = 1 n x_{1}=x_{2}=\ldots=x_{n}=\frac{1}{n}

f ( 1 n , … , 1 n ) = − ∑ k = 1 n 1 n log ⁡ 2 1 n = − n ⋅ 1 n log ⁡ 2 1 n = log ⁡ 2 n f\left(\frac{1}{n}, \ldots, \frac{1}{n}\right)=-\sum_{k=1}^{n} \frac{1}{n} \log _{2} \frac{1}{n}=-n \cdot \frac{1}{n} \log _{2} \frac{1}{n}=\log _{2} n

f ( x 1 , … , x n ) = ∑ k = 1 n g ( x k ) g ( x k ) = − x k log ⁡ 2 x k , 0 ≤ x k ≤ 1 f\left(x_{1}, \ldots, x_{n}\right)=\sum_{k=1}^{n} g\left(x_{k}\right)\\ g\left(x_{k}\right)=-x_{k} \log _{2} x_{k}, \quad 0 \leq x_{k} \leq 1

g ′ ( x 1 ) = d ( − x 1 log ⁡ 2 x 1 ) d x 1 = − log ⁡ 2 x 1 − x 1 ⋅ 1 x 1 ln ⁡ 2 = − log ⁡ 2 x 1 − 1 ln ⁡ 2 g ′ ′ ( x 1 ) = d ( g ′ ( x 1 ) ) d x 1 = d ( − log ⁡ 2 x 1 − 1 ln ⁡ 2 ) d x 1 = − 1 x 1 ln ⁡ 2 \begin{array}{c} g^{\prime}\left(x_{1}\right)=\dfrac{d\left(-x_{1} \log _{2} x_{1}\right)}{d x_{1}}=-\log _{2} x_{1}-x_{1} \cdot \dfrac{1}{x_{1} \ln 2}=-\log _{2} x_{1}-\dfrac{1}{\ln 2} \\ g^{\prime \prime}\left(x_{1}\right)=\dfrac{d\left(g^{\prime}\left(x_{1}\right)\right)}{d x_{1}}=\dfrac{d\left(-\log _{2} x_{1}-\dfrac{1}{\ln 2}\right)}{d x_{1}}=-\dfrac{1}{x_{1} \ln 2} \end{array}

g ( 0 ) = − 0 log ⁡ 2 0 = 0 g ( 1 ) = − 1 log ⁡ 2 1 = 0 \begin{array}{l} g(0)=-0 \log _{2} 0=0 \\ g(1)=-1 \log _{2} 1=0 \end{array}

f ( 0 , 0 , … , 0 , 1 , 0 , … , 0 ) = − 0 log ⁡ 2 0 − 0 log ⁡ 2 0 … − 0 log ⁡ 2 0 − 1 log ⁡ 2 1 − 0 log ⁡ 2 0 … − 0 log ⁡ 2 0 = 0 f(0,0, \ldots, 0,1,0, \ldots, 0)=-0 \log _{2} 0-0 \log _{2} 0 \ldots-0 \log _{2} 0-1 \log _{2} 1-0 \log _{2} 0 \ldots-0 \log _{2} 0=0

#### 3.2.1.2 条件熵

H ( D ∣ a ) = ∑ v = 1 V ∣ D v ∣ ∣ D ∣ Ent ⁡ ( D v ) H(D \mid a)=\sum_{v=1}^{V} \frac{\left|D^{v}\right|}{|D|} \operatorname{Ent}\left(D^{v}\right)

Gain ⁡ ( D , a ) = Ent ⁡ ( D ) − ∑ v = 1 V ∣ D v ∣ ∣ D ∣ Ent ⁡ ( D v ) = Ent ⁡ ( D ) − H ( D ∣ a ) \begin{aligned} \operatorname{Gain}(D, a) &=\operatorname{Ent}(D)-\sum_{v=1}^{V} \frac{\left|D^{v}\right|}{|D|} \operatorname{Ent}\left(D^{v}\right) \\ &=\operatorname{Ent}(D)-H(D \mid a) \end{aligned}

Gain ⁡ ( D , a ) = Ent ⁡ ( D ) − ∑ v = 1 V ∣ D v ∣ ∣ D ∣ Ent ⁡ ( D v ) = Ent ⁡ ( D ) − ∑ v = 1 V ∣ D v ∣ ∣ D ∣ ( − ∑ k = 1 ∣ Y ∣ p k v log ⁡ 2 p k v ) = Ent ⁡ ( D ) − ∑ v = 1 V ∣ D v ∣ ∣ D ∣ ( − ∑ k = 1 ∣ y ∣ ∣ D k v ∣ ∣ D v ∣ log ⁡ 2 ∣ D k v ∣ ∣ D v ∣ ) \begin{aligned} \operatorname{Gain}(D, a) &=\operatorname{Ent}(D)-\sum_{v=1}^{V} \frac{\left|D^{v}\right|}{|D|} \operatorname{Ent}\left(D^{v}\right) \\ &=\operatorname{Ent}(D)-\sum_{v=1}^{V} \frac{\left|D^{v}\right|}{|D|}\left(-\sum_{k=1}^{|\mathcal{Y}|} p_{k} ^{v}\log _{2} p_{k}^{v}\right) \\ &=\operatorname{Ent}(D)-\sum_{v=1}^{V} \frac{\left|D^{v}\right|}{|D|}\left(-\sum_{k=1}^{|y|} \frac{\left|D_{k}^{v}\right|}{\left|D^{v}\right|} \log _{2} \frac{\left|D_{k}^{v}\right|}{\left|D^{v}\right|}\right) \end{aligned}

### 3.2.2 C4.5决策树

Gain_ratio  ( D , a ) = Gain ⁡ ( D , a ) IV ⁡ ( a ) \text { Gain\_ratio }(D, a)=\frac{\operatorname{Gain}(D, a)}{\operatorname{IV}(a)}

I V ( a ) = − ∑ v = 1 V ∣ D v ∣ ∣ D ∣ log ⁡ 2 ∣ D v ∣ ∣ D ∣ \mathrm{IV}(a)=-\sum_{v=1}^{V} \frac{\left|D^{v}\right|}{|D|} \log _{2} \frac{\left|D^{v}\right|}{|D|}

1. 先从候选属性中找到信息增益高于平均水平的属性集合
2. 再从这个属性集合中选择增益率最高的属性

### 3.2.3 CART决策树

#### 3.2.3.1 定义：

Gini ⁡ ( D ) = ∑ k = 1 ∣ Y ∣ ∑ k ′ ≠ k p k p k ′ = ∑ k = 1 ∣ Y ∣ p k ∑ k ′ ≠ k p k ′ = ∑ k = 1 ∣ Y ∣ p k ( 1 − p k ) = 1 − ∑ k = 1 ∣ Y ∣ p k 2 \operatorname{Gini}(D)=\sum_{k=1}^{|\mathcal{Y}|} \sum_{k^{\prime} \neq k} p_{k} p_{k^{\prime}}=\sum_{k=1}^{|\mathcal{Y}|} p_{k} \sum_{k^{\prime} \neq k} p_{k^{\prime}}=\sum_{k=1}^{|\mathcal{Y}|} p_{k}\left(1-p_{k}\right)=1-\sum_{k=1}^{|\mathcal{Y}|} p_{k}^{2}

Gini_index  ( D , a ) = ∑ v = 1 V ∣ D v ∣ ∣ D ∣ Gini ⁡ ( D v ) \text { Gini\_index }(D, a)=\sum_{v=1}^{V} \frac{\left|D^{v}\right|}{|D|} \operatorname{Gini}\left(D^{v}\right)

#### 3.2.3.2 CART决策树分类算法：

1. 根据基尼指数公式找出基尼指数最小的属性【伪代码算法第8行变成 a ∗ = arg ⁡ min ⁡ a ∈ A Gini_index ⁡ ( D , a ) a_{*}=\underset{a \in A}{\arg \min } \operatorname{Gini\_index}(D, a)
2. 计算属性 a ∗ a_{*} 的所有可能取值的基尼值 Gini ⁡ ( D v ) , v = 1 , 2 , … , V \operatorname{Gini}\left(D^{v}\right), v=1,2, \ldots, V , 选择基尼指数最小的取值 a ∗ v a^{v}_{*} 作为划分点，将集合 D D 划分为 D 1 , D 2 D_1, D_{2} 两个集合（分支结点），其中 D 1 D_1 集合的样本为 a ∗ = a ∗ v a_{*}=a_{*}^{v} 的样本， D 2 D_2 集合为 a ∗ ≠ a ∗ v a_{*} \neq a_{*}^{v}
3. 对集合 D 1 D_1 D 2 D_2 重复递归步骤1，2，直到满足停止条件

#### 3.2.3.3 CART决策树回归算法：

1. 根据以下公式找到最优划分特征 a ∗ a_{*} 和最优划分点 a ∗ v a_{*}^{v}

a ∗ , a ∗ v = arg ⁡ min ⁡ a , a v [ min ⁡ c 1 ∑ x i ∈ D 1 ( a , a v ) ( y i − c 1 ) 2 + min ⁡ c 2 ∑ x , ∈ D 2 ( a , a v ) ( y i − c 2 ) 2 ] a_{*}, a_{*}^{v}=\underset{a, a^{v}}{\arg \min }\left[\min_{c_{1}} \sum_{x_i \in D_{1}\left(a, a^{v}\right)}\left(y_{i}-c_{1}\right)^{2}+\min _{c_{2}} \sum_{x, \in D_{2}\left(a, a^{v}\right)}\left(y_{i}-c_{2}\right)^{2}\right]

1. 根据划分点 a ∗ v a_{*}^{v} 将集合 D D 划分为 D 1 D_1 D 2 D_2 两个集合（分支节点）
2. 对集合 D 1 D_1 D 2 D_2 重复递归步骤1，2，直到满足停止条件

## 3.4 决策树处理连续值、缺失值

### 3.4.1 连续值处理

T a = { a i + a i + 1 2 ∣ 1 ⩽ i ⩽ n − 1 } T_{a}=\left\{\frac{a^{i}+a^{i+1}}{2} \mid 1 \leqslant i \leqslant n-1\right\}

Gain ⁡ ( D , a ) = max ⁡ t ∈ T a Gain ⁡ ( D , a , t ) = max ⁡ t ∈ T a Ent ⁡ ( D ) − ∑ λ ∈ { − , + } ∣ D t λ ∣ ∣ D ∣ Ent ⁡ ( D t λ ) \begin{aligned} \operatorname{Gain}(D, a) &=\max _{t \in T_{a}} \operatorname{Gain}(D, a, t) \\ &=\max _{t \in T_{a}} \operatorname{Ent}(D)-\sum_{\lambda \in\{-,+\}} \frac{\left|D_{t}^{\lambda}\right|}{|D|} \operatorname{Ent}\left(D_{t}^{\lambda}\right) \end{aligned}

### 3.4.2 缺失值处理

1. 如何在属性值确实的情况下进行划分属性选择？
2. 给定划分选择，若样本在该属性上的值缺失，如何对样本进行划分？

ρ = ∑ x ∈ D ~ w x ∑ x ∈ D w x p ~ k = ∑ x ∈ D ~ k w x ∑ x ∈ D ~ w x ( 1 ⩽ k ⩽ ∣ Y ∣ ) r ~ v = ∑ x ∈ D ~ v w x ∑ x ∈ D ~ w x ( 1 ⩽ v ⩽ V ) \begin{aligned} \rho &=\frac{\sum_{\boldsymbol{x} \in \tilde{D}} w_{\boldsymbol{x}}}{\sum_{\boldsymbol{x} \in D} w_{\boldsymbol{x}}} \\ \tilde{p}_{k} &=\frac{\sum_{\boldsymbol{x} \in \tilde{D}_{k}} w_{\boldsymbol{x}}}{\sum_{\boldsymbol{x} \in \tilde{D}} w_{\boldsymbol{x}}} \quad(1 \leqslant k \leqslant|\mathcal{Y}|) \\ \tilde{r}_{v} &=\frac{\sum_{\boldsymbol{x} \in \tilde{D}^{v}} w_{\boldsymbol{x}}}{\sum_{\boldsymbol{x} \in \tilde{D}} w_{\boldsymbol{x}}} \quad(1 \leqslant v \leqslant V) \end{aligned}

1. ρ \rho 表示无缺失值样本所占的比例
2. p ~ k \tilde{p}_{k} 表示无缺失值样本中第 k k 类所占的比例
3. r ~ v \tilde{r}_{v} 则表示无缺失值样本中在属性 a a 上取值 a v a^{v} 的样本所占的比例。

Gain ⁡ ( D , a ) = ρ × Gain ⁡ ( D ~ , a ) = ρ × ( Ent ⁡ ( D ~ ) − ∑ v = 1 V r ~ v Ent ⁡ ( D ~ v ) ) \begin{aligned} \operatorname{Gain}(D, a) &=\rho \times \operatorname{Gain}(\tilde{D}, a) \\ &=\rho \times\left(\operatorname{Ent}(\tilde{D})-\sum_{v=1}^{V} \tilde{r}_{v} \operatorname{Ent}\left(\tilde{D}^{v}\right)\right) \end{aligned}

Ent ⁡ ( D ~ ) = − ∑ k = 1 ∣ Y ∣ p ~ k log ⁡ 2 p ~ k \operatorname{Ent}(\tilde{D})=-\sum_{k=1}^{|\mathcal{Y}|} \tilde{p}_{k} \log _{2} \tilde{p}_{k}

• 点赞
• 评论
• 分享
x

海报分享

扫一扫，分享海报

• 收藏
• 打赏

打赏

西风瘦马1912

你的鼓励将是我创作的最大动力

C币 余额
2C币 4C币 6C币 10C币 20C币 50C币
• 举报
• 一键三连

点赞Mark关注该博主, 随时了解TA的最新博文

05-23 1109

01-26 410
08-05 206
08-20 1046
08-26 1399
12-09 165
05-23 242
07-02 3632
09-29 713
05-22 121
03-25
09-28