决策树公式推导
(1)信息熵--用来度量样本集合纯度最常用的一种指标,定义如下:
Ent
(
D
)
=
−
∑
k
=
1
∣
Y
∣
p
k
log
2
p
k
(式1)
\operatorname{Ent}(D)=-\sum_{k=1}^{\vert{\mathcal{Y}}\vert}p_k\log_2p_k\tag{式1}
Ent(D)=−k=1∑∣Y∣pklog2pk(式1)
其中,
D
=
{
(
x
1
,
y
1
)
,
(
x
2
,
y
2
)
,
⋯
,
(
x
m
,
y
m
)
}
D=\{(x_1,y_1),(x_2,y_2),\cdots,(x_m,y_m)\}
D={(x1,y1),(x2,y2),⋯,(xm,ym)}表示样本集合,
∣
y
∣
|y|
∣y∣表示样本类别总数,
P
k
P_k
Pk表示第
k
k
k类样本所占的比例,而且满足
0
≤
P
k
≤
1
⋅
∑
k
=
1
∣
y
∣
p
k
=
1
0\le{P_k}\le{1\cdot\sum_{k=1}^{|y|}p_k=1}
0≤Pk≤1⋅∑k=1∣y∣pk=1。上式值越小,则纯度越高。(样本尽可能属于同一类别,则“纯度”更高)
[证]:证明
0
≤
Ent
(
D
)
≤
log
2
∣
Y
∣
0\leq\operatorname{Ent}(D)\leq\log_{2}|\mathcal{Y}|
0≤Ent(D)≤log2∣Y∣:
已知集合
D
D
D的信息熵的定义为
Ent
(
D
)
=
−
∑
k
=
1
∣
Y
∣
p
k
log
2
p
k
(式2)
\operatorname{Ent}(D)=-\sum_{k=1}^{|\mathcal{Y}|} p_{k} \log _{2} p_{k}\tag{式2}
Ent(D)=−k=1∑∣Y∣pklog2pk(式2)
其中,
∣
Y
∣
|\mathcal{Y}|
∣Y∣表示样本类别总数,
p
k
p_k
pk表示第
k
k
k类样本所占的比例,且
0
≤
p
k
≤
1
,
∑
k
=
1
n
p
k
=
1
0 \leq p_k \leq 1,\sum_{k=1}^{n}p_k=1
0≤pk≤1,∑k=1npk=1。如若令
∣
Y
∣
=
n
,
p
k
=
x
k
|\mathcal{Y}|=n,p_k=x_k
∣Y∣=n,pk=xk,那么信息熵
Ent
(
D
)
\operatorname{Ent}(D)
Ent(D)就可以看作一个
n
n
n元实值函数,也即
Ent
(
D
)
=
f
(
x
1
,
.
.
.
,
x
n
)
=
−
∑
k
=
1
n
x
k
log
2
x
k
(式3)
\operatorname{Ent}(D)=f(x_1,...,x_n)=-\sum_{k=1}^{n} x_{k} \log _{2} x_{k} \tag{式3}
Ent(D)=f(x1,...,xn)=−k=1∑nxklog2xk(式3)
其中,
0
≤
x
k
≤
1
,
∑
k
=
1
n
x
k
=
1
0 \leq x_k \leq 1,\sum_{k=1}^{n}x_k=1
0≤xk≤1,∑k=1nxk=1,下面考虑求该多元函数的最值。
最大值:
如果不考虑约束
0
≤
x
k
≤
1
0 \leq x_k \leq 1
0≤xk≤1,仅考虑
∑
k
=
1
n
x
k
=
1
\sum_{k=1}^{n}x_k=1
∑k=1nxk=1的话,对
f
(
x
1
,
.
.
.
,
x
n
)
f(x_1,...,x_n)
f(x1,...,xn)求最大值等价于如下最小化问题
min
∑
k
=
1
n
x
k
log
2
x
k
s.t.
∑
k
=
1
n
x
k
=
1
(式4)
\begin{array}{ll}{ \operatorname{min}} & {\sum\limits_{k=1}^{n} x_{k} \log _{2} x_{k} } \\ {\text { s.t. }} & {\sum\limits_{k=1}^{n}x_k=1} \end{array}\tag{式4}
min s.t. k=1∑nxklog2xkk=1∑nxk=1(式4)
其中,
y
=
x
log
x
y=x\log{x}
y=xlogx函数图像如下:
这样一来,在
0
≤
x
k
≤
1
0 \leq x_k \leq 1
0≤xk≤1时,此问题为凸优化问题,而对于凸优化问题来说,满足KKT条件的点即为最优解。由于此最小化问题仅含等式约束,那么能令其拉格朗日函数的一阶偏导数等于0的点即为满足KKT条件的点。根据拉格朗日乘子法可知,该优化问题的拉格朗日函数如下:
L
(
x
1
,
.
.
.
,
x
n
,
λ
)
=
∑
k
=
1
n
x
k
log
2
x
k
+
λ
(
∑
k
=
1
n
x
k
−
1
)
(式5)
L(x_1,...,x_n,\lambda)=\sum_{k=1}^{n} x_{k} \log _{2} x_{k}+\lambda(\sum_{k=1}^{n}x_k-1)\tag{式5}
L(x1,...,xn,λ)=k=1∑nxklog2xk+λ(k=1∑nxk−1)(式5)
其中,
λ
\lambda
λ为拉格朗日乘子。对
L
(
x
1
,
.
.
.
,
x
n
,
λ
)
L(x_1,...,x_n,\lambda)
L(x1,...,xn,λ)分别关于
x
1
,
.
.
.
,
x
n
,
λ
x_1,...,x_n,\lambda
x1,...,xn,λ求一阶偏导数,并令偏导数等于0可得到下式:
∂
L
(
x
1
,
.
.
.
,
x
n
,
λ
)
∂
x
1
=
∂
∂
x
1
[
∑
k
=
1
n
x
k
log
2
x
k
+
λ
(
∑
k
=
1
n
x
k
−
1
)
]
=
0
=
log
2
x
1
+
x
1
⋅
1
x
1
ln
2
+
λ
=
0
=
log
2
x
1
+
1
ln
2
+
λ
=
0
⇒
λ
=
−
log
2
x
1
−
1
ln
2
对
x
2
求
偏
导
数
有
:
∂
L
(
x
1
,
.
.
.
,
x
n
,
λ
)
∂
x
2
=
∂
∂
x
2
[
∑
k
=
1
n
x
k
log
2
x
k
+
λ
(
∑
k
=
1
n
x
k
−
1
)
]
=
0
⇒
λ
=
−
log
2
x
2
−
1
ln
2
⋮
对
x
n
求
偏
导
数
有
:
∂
L
(
x
1
,
.
.
.
,
x
n
,
λ
)
∂
x
n
=
∂
∂
x
n
[
∑
k
=
1
n
x
k
log
2
x
k
+
λ
(
∑
k
=
1
n
x
k
−
1
)
]
=
0
⇒
λ
=
−
log
2
x
n
−
1
ln
2
对
λ
求
偏
导
数
有
:
∂
L
(
x
1
,
.
.
.
,
x
n
,
λ
)
∂
λ
=
∂
∂
λ
[
∑
k
=
1
n
x
k
log
2
x
k
+
λ
(
∑
k
=
1
n
x
k
−
1
)
]
=
0
⇒
∑
k
=
1
n
x
k
=
1
实
际
上
就
等
于
约
束
条
件
。
(式6)
\begin{aligned} \cfrac{\partial L(x_1,...,x_n,\lambda)}{\partial x_1}&=\cfrac{\partial }{\partial x_1}\left[\sum_{k=1}^{n} x_{k} \log _{2} x_{k}+\lambda(\sum_{k=1}^{n}x_k-1)\right]=0\\ &=\log _{2} x_{1}+x_1\cdot \cfrac{1}{x_1\ln2}+\lambda=0 \\ &=\log _{2} x_{1}+\cfrac{1}{\ln2}+\lambda=0 \\ &\Rightarrow \lambda=-\log _{2} x_{1}-\cfrac{1}{\ln2}\\ 对x_2求偏导数有:\\ \cfrac{\partial L(x_1,...,x_n,\lambda)}{\partial x_2}&=\cfrac{\partial }{\partial x_2}\left[\sum_{k=1}^{n} x_{k} \log _{2} x_{k}+\lambda(\sum_{k=1}^{n}x_k-1)\right]=0\\ &\Rightarrow \lambda=-\log _{2} x_{2}-\cfrac{1}{\ln2}\\ \vdots\\ 对x_n求偏导数有:\\ \cfrac{\partial L(x_1,...,x_n,\lambda)}{\partial x_n}&=\cfrac{\partial }{\partial x_n}\left[\sum_{k=1}^{n} x_{k} \log _{2} x_{k}+\lambda(\sum_{k=1}^{n}x_k-1)\right]=0\\ &\Rightarrow \lambda=-\log _{2} x_{n}-\cfrac{1}{\ln2}\\ 对\lambda求偏导数有:\\ \cfrac{\partial L(x_1,...,x_n,\lambda)}{\partial \lambda}&=\cfrac{\partial }{\partial \lambda}\left[\sum_{k=1}^{n} x_{k} \log _{2} x_{k}+\lambda(\sum_{k=1}^{n}x_k-1)\right]=0\\ &\Rightarrow \sum_{k=1}^{n}x_k=1 实际上就等于约束条件。\\ \end{aligned}\tag{式6}
∂x1∂L(x1,...,xn,λ)对x2求偏导数有:∂x2∂L(x1,...,xn,λ)⋮对xn求偏导数有:∂xn∂L(x1,...,xn,λ)对λ求偏导数有:∂λ∂L(x1,...,xn,λ)=∂x1∂[k=1∑nxklog2xk+λ(k=1∑nxk−1)]=0=log2x1+x1⋅x1ln21+λ=0=log2x1+ln21+λ=0⇒λ=−log2x1−ln21=∂x2∂[k=1∑nxklog2xk+λ(k=1∑nxk−1)]=0⇒λ=−log2x2−ln21=∂xn∂[k=1∑nxklog2xk+λ(k=1∑nxk−1)]=0⇒λ=−log2xn−ln21=∂λ∂[k=1∑nxklog2xk+λ(k=1∑nxk−1)]=0⇒k=1∑nxk=1实际上就等于约束条件。(式6)
整理一下可得:
{
λ
=
−
log
2
x
1
−
1
ln
2
=
−
log
2
x
2
−
1
ln
2
=
.
.
.
=
−
log
2
x
n
−
1
ln
2
∑
k
=
1
n
x
k
=
1
(式7)
\left\{ \begin{array}{lr} \lambda=-\log _{2} x_{1}-\cfrac{1}{\ln2}=-\log _{2} x_{2}-\cfrac{1}{\ln2}=...=-\log _{2} x_{n}-\cfrac{1}{\ln2} \\ \sum\limits_{k=1}^{n}x_k=1 \end{array}\right.\tag{式7}
⎩⎪⎨⎪⎧λ=−log2x1−ln21=−log2x2−ln21=...=−log2xn−ln21k=1∑nxk=1(式7)
由以上两个方程可以解得:
x
1
=
x
2
=
.
.
.
=
x
n
=
1
n
(式8)
x_1=x_2=...=x_n=\cfrac{1}{n}\tag{式8}
x1=x2=...=xn=n1(式8)
又因为
x
k
x_k
xk还需满足约束
0
≤
x
k
≤
1
0 \leq x_k \leq 1
0≤xk≤1,这样就得到
0
≤
1
n
≤
1
0 \leq\cfrac{1}{n}\leq 1
0≤n1≤1,所以
x
1
=
x
2
=
.
.
.
=
x
n
=
1
n
x_1=x_2=...=x_n=\cfrac{1}{n}
x1=x2=...=xn=n1是满足所有约束的最优解,也即为当前最小化问题的最小值点,同时也是
f
(
x
1
,
.
.
.
,
x
n
)
f(x_1,...,x_n)
f(x1,...,xn)的最大值点。将
x
1
=
x
2
=
.
.
.
=
x
n
=
1
n
x_1=x_2=...=x_n=\cfrac{1}{n}
x1=x2=...=xn=n1代入
f
(
x
1
,
.
.
.
,
x
n
)
f(x_1,...,x_n)
f(x1,...,xn)中可得:
f
(
1
n
,
.
.
.
,
1
n
)
=
−
∑
k
=
1
n
1
n
log
2
1
n
=
−
n
⋅
1
n
log
2
1
n
=
log
2
n
(式9)
f(\cfrac{1}{n},...,\cfrac{1}{n})=-\sum_{k=1}^{n} \cfrac{1}{n} \log _{2} \cfrac{1}{n}=-n\cdot\cfrac{1}{n} \log _{2} \cfrac{1}{n}=\log _{2} n\tag{式9}
f(n1,...,n1)=−k=1∑nn1log2n1=−n⋅n1log2n1=log2n(式9)
所以
f
(
x
1
,
.
.
.
,
x
n
)
f(x_1,...,x_n)
f(x1,...,xn)在满足约束
0
≤
x
k
≤
1
,
∑
k
=
1
n
x
k
=
1
0 \leq x_k \leq 1,\sum_{k=1}^{n}x_k=1
0≤xk≤1,∑k=1nxk=1时的最大值为
log
2
n
\log _{2} n
log2n。
最小值:
如果不考虑约束
∑
k
=
1
n
x
k
=
1
\sum_{k=1}^{n}x_k=1
∑k=1nxk=1,仅考虑
0
≤
x
k
≤
1
0 \leq x_k \leq 1
0≤xk≤1,
f
(
x
1
,
.
.
.
,
x
n
)
f(x_1,...,x_n)
f(x1,...,xn)可以看做是
n
n
n个互不相关的一元函数的加和,即:
f
(
x
1
,
.
.
.
,
x
n
)
=
∑
k
=
1
n
g
(
x
k
)
(式10)
f(x_1,...,x_n)=\sum_{k=1}^{n} g(x_k) \tag{式10}
f(x1,...,xn)=k=1∑ng(xk)(式10)
其中,
g
(
x
k
)
=
−
x
k
log
2
x
k
,
0
≤
x
k
≤
1
g(x_k)=-x_{k} \log _{2} x_{k},0 \leq x_k \leq 1
g(xk)=−xklog2xk,0≤xk≤1。那么当
g
(
x
1
)
,
g
(
x
2
)
,
.
.
.
,
g
(
x
n
)
g(x_1),g(x_2),...,g(x_n)
g(x1),g(x2),...,g(xn)分别取到其最小值时,
f
(
x
1
,
.
.
.
,
x
n
)
f(x_1,...,x_n)
f(x1,...,xn)也就取到了最小值。所以接下来考虑分别求
g
(
x
1
)
,
g
(
x
2
)
,
.
.
.
,
g
(
x
n
)
g(x_1),g(x_2),...,g(x_n)
g(x1),g(x2),...,g(xn)各自的最小值,由于
g
(
x
1
)
,
g
(
x
2
)
,
.
.
.
,
g
(
x
n
)
g(x_1),g(x_2),...,g(x_n)
g(x1),g(x2),...,g(xn)的定义域和函数表达式均相同,所以只需求出
g
(
x
1
)
g(x_1)
g(x1)的最小值也就求出了
g
(
x
2
)
,
.
.
.
,
g
(
x
n
)
g(x_2),...,g(x_n)
g(x2),...,g(xn)的最小值。下面考虑求
g
(
x
1
)
g(x_1)
g(x1)的最小值,首先对
g
(
x
1
)
g(x_1)
g(x1)关于
x
1
x_1
x1求一阶和二阶导数:
g
′
(
x
1
)
=
d
(
−
x
1
log
2
x
1
)
d
x
1
=
−
log
2
x
1
−
x
1
⋅
1
x
1
ln
2
=
−
log
2
x
1
−
1
ln
2
(式11)
g^{\prime}(x_1)=\cfrac{d(-x_{1} \log _{2} x_{1})}{d x_1}=-\log _{2} x_{1}-x_1\cdot \cfrac{1}{x_1\ln2}=-\log _{2} x_{1}-\cfrac{1}{\ln2}\tag{式11}
g′(x1)=dx1d(−x1log2x1)=−log2x1−x1⋅x1ln21=−log2x1−ln21(式11)
g
′
′
(
x
1
)
=
d
(
g
′
(
x
1
)
)
d
x
1
=
d
(
−
log
2
x
1
−
1
ln
2
)
d
x
1
=
−
1
x
1
ln
2
(式12)
g^{\prime\prime}(x_1)=\cfrac{d\left(g^{\prime}(x_1)\right)}{d x_1}=\cfrac{d\left(-\log _{2} x_{1}-\cfrac{1}{\ln2}\right)}{d x_1}=-\cfrac{1}{x_{1}\ln2}\tag{式12}
g′′(x1)=dx1d(g′(x1))=dx1d(−log2x1−ln21)=−x1ln21(式12)
发现,当
0
≤
x
k
≤
1
0 \leq x_k \leq 1
0≤xk≤1时
g
′
′
(
x
1
)
=
−
1
x
1
ln
2
g^{\prime\prime}(x_1)=-\cfrac{1}{x_{1}\ln2}
g′′(x1)=−x1ln21恒小于0,所以
g
(
x
1
)
g(x_1)
g(x1)是一个在其定义域范围内开口向下的凹函数,那么其最小值必然在边界取,于是分别取
x
1
=
0
x_1=0
x1=0和
x
1
=
1
x_1=1
x1=1,代入
g
(
x
1
)
g(x_1)
g(x1)可得:
g
(
0
)
=
−
0
log
2
0
=
0
(式13)
g(0)=-0\log _{2} 0=0\tag{式13}
g(0)=−0log20=0(式13)
g
(
1
)
=
−
1
log
2
1
=
0
(式14)
g(1)=-1\log _{2} 1=0\tag{式14}
g(1)=−1log21=0(式14)
所以,
g
(
x
1
)
g(x_1)
g(x1)的最小值为0,同理可得
g
(
x
2
)
,
.
.
.
,
g
(
x
n
)
g(x_2),...,g(x_n)
g(x2),...,g(xn)的最小值也为0,那么
f
(
x
1
,
.
.
.
,
x
n
)
f(x_1,...,x_n)
f(x1,...,xn)的最小值此时也为0。
但是,此时是不考虑约束
∑
k
=
1
n
x
k
=
1
\sum_{k=1}^{n}x_k=1
∑k=1nxk=1,仅考虑
0
≤
x
k
≤
1
0 \leq x_k \leq 1
0≤xk≤1时取到的最小值,若考虑约束
∑
k
=
1
n
x
k
=
1
\sum_{k=1}^{n}x_k=1
∑k=1nxk=1的话,那么
f
(
x
1
,
.
.
.
,
x
n
)
f(x_1,...,x_n)
f(x1,...,xn)的最小值一定大于等于0。如果令某个
x
k
=
1
x_k=1
xk=1,那么根据约束
∑
k
=
1
n
x
k
=
1
\sum_{k=1}^{n}x_k=1
∑k=1nxk=1可知
x
1
=
x
2
=
.
.
.
=
x
k
−
1
=
x
k
+
1
=
.
.
.
=
x
n
=
0
x_1=x_2=...=x_{k-1}=x_{k+1}=...=x_n=0
x1=x2=...=xk−1=xk+1=...=xn=0,将其代入
f
(
x
1
,
.
.
.
,
x
n
)
f(x_1,...,x_n)
f(x1,...,xn)可得:
f
(
0
,
0
,
.
.
.
,
0
,
1
,
0
,
.
.
.
,
0
)
=
−
0
log
2
0
−
0
log
2
0...
−
0
log
2
0
−
1
log
2
1
−
0
log
2
0...
−
0
log
2
0
=
0
(式15)
f(0,0,...,0,1,0,...,0)=-0 \log _{2}0-0 \log _{2}0...-0 \log _{2}0-1 \log _{2}1-0 \log _{2}0...-0 \log _{2}0=0 \\ \tag{式15}
f(0,0,...,0,1,0,...,0)=−0log20−0log20...−0log20−1log21−0log20...−0log20=0(式15)
所以
x
k
=
1
,
x
1
=
x
2
=
.
.
.
=
x
k
−
1
=
x
k
+
1
=
.
.
.
=
x
n
=
0
x_k=1,x_1=x_2=...=x_{k-1}=x_{k+1}=...=x_n=0
xk=1,x1=x2=...=xk−1=xk+1=...=xn=0一定是
f
(
x
1
,
.
.
.
,
x
n
)
f(x_1,...,x_n)
f(x1,...,xn)在满足约束
∑
k
=
1
n
x
k
=
1
\sum_{k=1}^{n}x_k=1
∑k=1nxk=1和
0
≤
x
k
≤
1
0 \leq x_k \leq 1
0≤xk≤1的条件下的最小值点,其最小值为0。
综上可知,当 f ( x 1 , . . . , x n ) f(x_1,...,x_n) f(x1,...,xn)取到最大值时: x 1 = x 2 = . . . = x n = 1 n x_1=x_2=...=x_n=\cfrac{1}{n} x1=x2=...=xn=n1,此时样本集合纯度最低;当 f ( x 1 , . . . , x n ) f(x_1,...,x_n) f(x1,...,xn)取到最小值时: x k = 1 , x 1 = x 2 = . . . = x k − 1 = x k + 1 = . . . = x n = 0 x_k=1,x_1=x_2=...=x_{k-1}=x_{k+1}=...=x_n=0 xk=1,x1=x2=...=xk−1=xk+1=...=xn=0,此时样本集合纯度最高。
(2)条件熵–在已知样本属性
a
a
a的取值情况下,度量样本集合纯度的一种指标:
H
(
D
∣
a
)
=
∑
v
=
1
V
∣
D
v
∣
∣
D
∣
Ent
(
D
v
)
(式16)
H(D|a)=\sum_{v=1}^{V}\cfrac{|D^v|}{|D|}\operatorname{Ent}(D^v)\tag{式16}
H(D∣a)=v=1∑V∣D∣∣Dv∣Ent(Dv)(式16)
其中,
a
a
a表示样本的某个属性,假定属性
a
a
a有
V
V
V个可能的取值
{
a
1
,
a
2
,
⋯
,
a
V
}
\{a^1,a^2,\cdots,a^V\}
{a1,a2,⋯,aV},样本集合
D
D
D中在属性
a
a
a上取值为
a
v
a^v
av的样本记为
D
v
D^v
Dv,
Ent
(
D
v
)
\operatorname{Ent}(D^v)
Ent(Dv)为样本集合
D
v
D^v
Dv的信息熵。
H
(
D
∣
a
)
H(D|a)
H(D∣a)越小,纯度越高。
1. ID3 决策树
ID3决策树----以信息增益为准则来选择划分属性的决策树。信息增益定义如下:
Gain
(
D
,
a
)
=
Ent
(
D
)
−
∑
v
=
1
V
∣
D
v
∣
∣
D
∣
Ent
(
D
v
)
=
Ent
(
D
)
−
H
(
D
∣
a
)
(式17)
\begin{aligned} \operatorname{Gain}(D,a) &=\operatorname{Ent}(D)-\sum_{v=1}^{V}\cfrac{|D^v|}{|D|}\operatorname{Ent}(D^v)\\ &=\operatorname{Ent}(D)-H(D|a) \end{aligned}\tag{式17}
Gain(D,a)=Ent(D)−v=1∑V∣D∣∣Dv∣Ent(Dv)=Ent(D)−H(D∣a)(式17)
选择信息增益值最大的属性作为划分属性。因为信息增益越大,则意味着使用该属性来进行划分所获得的“纯度提升”越大。
将(式17)进一步展开写为如下形式:
Gain
(
D
,
a
)
=
Ent
(
D
)
−
∑
v
=
1
V
∣
D
v
∣
∣
D
∣
Ent
(
D
v
)
=
Ent
(
D
)
−
∑
v
=
1
V
∣
D
v
∣
∣
D
∣
(
−
∑
k
=
1
∣
Y
∣
p
k
log
2
p
k
)
=
Ent
(
D
)
−
∑
v
=
1
V
∣
D
v
∣
∣
D
∣
(
−
∑
k
=
1
∣
Y
∣
∣
D
k
v
∣
∣
D
v
∣
log
2
∣
D
k
v
∣
∣
D
v
∣
)
(式18)
\begin{aligned} \operatorname{Gain}(D,a) &=\operatorname{Ent}(D)-\sum_{v=1}^{V}\cfrac{|D^v|}{|D|}\operatorname{Ent}(D^v)\\ &=\operatorname{Ent}(D)-\sum_{v=1}^{V}\cfrac{|D^v|}{|D|}\left(-\sum_{k=1}^{\mathcal{|Y|}}p_k\log_2p_k\right)\\ &=\operatorname{Ent}(D)-\sum_{v=1}^{V}\cfrac{|D^v|}{|D|}\left(-\sum_{k=1}^{\mathcal{|Y|}}{\cfrac{|D_k^v|}{|D^v|}}\log_2{\cfrac{|D_k^v|}{|D^v|}}\right)\tag{式18} \end{aligned}
Gain(D,a)=Ent(D)−v=1∑V∣D∣∣Dv∣Ent(Dv)=Ent(D)−v=1∑V∣D∣∣Dv∣⎝⎛−k=1∑∣Y∣pklog2pk⎠⎞=Ent(D)−v=1∑V∣D∣∣Dv∣⎝⎛−k=1∑∣Y∣∣Dv∣∣Dkv∣log2∣Dv∣∣Dkv∣⎠⎞(式18)
其中,
D
k
v
D_k^v
Dkv为样本集合
D
D
D中在属性
a
a
a上取值为
a
v
a^v
av且类别为
k
k
k的样本。
⟹ \Longrightarrow ⟹以信息增益为划分准则的ID3决策树对可取值数目较多的属性有所偏好。
2. C4.5决策树
C4.5决策树----以信息增益率为准则来选择划分属性的决策树。信息增益率定义为:
Gain-ratio
(
D
,
a
)
=
Gain
(
D
,
a
)
IV
(
a
)
其
中
,
IV
(
a
)
=
−
∑
v
=
1
V
∣
D
v
∣
∣
D
∣
log
2
∣
D
v
∣
∣
D
∣
(19)
\operatorname{Gain-ratio}(D,a)=\cfrac{\operatorname{Gain}(D,a)}{\operatorname{IV}(a)}\\ 其中,\operatorname{IV}(a)=-\sum_{v=1}^{V}{\cfrac{|D^v|}{|D|}}\log_2{\cfrac{|D^v|}{|D|}}\tag{19}
Gain-ratio(D,a)=IV(a)Gain(D,a)其中,IV(a)=−v=1∑V∣D∣∣Dv∣log2∣D∣∣Dv∣(19)
增益率准则,对可取值数目较少的属性有所偏好,因此C4.5算法并不是直接选择增益率最大的划分属性,而是使用了一个启发式;先从候选划分属性中找出信息增益高于平均水平的属性,再从中选择增益率最高的。
3. CART决策树
CART决策树----以基尼指数为准则来选择划分属性的决策树。基尼值定义为:
Gini
(
D
)
=
∑
k
=
1
∣
Y
∣
∑
k
′
≠
k
p
k
′
p
k
=
∑
k
=
1
∣
Y
∣
p
k
∑
k
′
≠
k
p
k
′
=
∑
k
=
1
∣
Y
∣
p
k
(
1
−
p
k
)
=
1
−
∑
k
=
1
∣
Y
∣
p
k
2
(式20)
\begin{aligned} \operatorname{Gini}(D)&=\sum_{k=1}^{|\mathcal{Y}|}\sum_{k^{\prime}\neq{k}}p_k^{\prime}p_k=\sum_{k=1}^{|\mathcal{Y}|}p_k\sum_{k^{\prime}\neq{k}}p_k^{\prime}\\ &=\sum_{k=1}^{|\mathcal{Y}|}p_k(1-p_k)=1-\sum_{k=1}^{|\mathcal{Y}|}p_{k}^2\tag{式20} \end{aligned}
Gini(D)=k=1∑∣Y∣k′=k∑pk′pk=k=1∑∣Y∣pkk′=k∑pk′=k=1∑∣Y∣pk(1−pk)=1−k=1∑∣Y∣pk2(式20)
基尼指数:
Gini-index
(
D
,
a
)
=
∑
v
=
1
V
∣
D
v
∣
∣
D
∣
Gini
(
D
v
)
(式21)
\operatorname{Gini-index}(D,a)=\sum_{v=1}^{V}{\cfrac{|D^v|}{|D|}}\operatorname{Gini}(D^v)\tag{式21}
Gini-index(D,a)=v=1∑V∣D∣∣Dv∣Gini(Dv)(式21)
基尼值和基尼指数越小,样本集合纯度越高。
CART决策树分类算法:
1.根据基尼指数公式找出基尼指数最小的属性
a
∗
a_*
a∗
2.计算属性
a
∗
a_*
a∗的所有可能取值的基尼值
Gini
(
D
v
)
\operatorname{Gini}(D^v)
Gini(Dv),并选择基尼值最小的取值
a
∗
v
a_*^v
a∗v作为划分点。将集合
D
D
D划分为
D
1
D_1
D1和
D
2
D_2
D2两个集合(节点),其中
D
1
D_1
D1集合的样本为
a
∗
=
a
∗
v
a_*=a_*^v
a∗=a∗v的样本,
D
2
D_2
D2集合为
a
∗
≠
a
∗
v
a_*\neq{a_*^v}
a∗=a∗v的样本。
3.对集合
D
1
,
D
2
D_1,D_2
D1,D2重复步骤1和2,知道满足条件。
4. 决策树的构建
author by xiaoyao 这里我使用酒的数据集来演示一下决策树的构建。
# 导入libraries
import numpy as np
# 导入画图工具
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
# 导入tree模型和数据集加载工具
from sklearn import tree, datasets
# 导入数据集拆分工具
from sklearn.model_selection import train_test_split
wine = datasets.load_wine()
# 这里只选取数据集的前两个特征
X = wine.data[:,:2]
y = wine.target
# 将数据集划分为训练集个测试集
X_train, X_test, y_train, y_test = train_test_split(X, y)
# 忽略警告
import warnings
warnings.filterwarnings("ignore")
# 设定决策树分类器最大深度为1
clf = tree.DecisionTreeClassifier(max_depth=1)
# 拟合训练数据集
clf.fit(X_train, y_train)
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=1,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False,
random_state=None, splitter='best')
# 定义图像中分区的颜色和散点的颜色
cmap_light = ListedColormap(["#FFAAAA", "#AAFFAA", "#AAAAFF"])
cmap_bold = ListedColormap(["#FF0000", "#00FF00", "#0000FF"])
# 分别用样本的两个特征值创建图像和横轴、纵轴
x_min, x_max = X_train[:,0].min() - 1, X_train[:,0].max() + 1
y_min, y_max = X_train[:,1].min() - 1, X_train[:,1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, .02),np.arange(y_min, y_max, .02))
z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
# 给每个分类中的样本分配不同的颜色
z = z.reshape(xx.shape)
plt.figure()
plt.pcolormesh(xx, yy, z, cmap=cmap_light)
# 使用散点图进行表示
plt.scatter(X[:,0], X[:,1], c=y, cmap=cmap_bold, edgecolor="k",s=20)
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.title("Classifier:(max_depth = 1)")
plt.show()
最大深度为1时,分类器的表现不很好,下面加大深度
# 设定决策树分类器最大深度为3
clf2 = tree.DecisionTreeClassifier(max_depth=3)
# 拟合训练数据集
clf2.fit(X_train, y_train)
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=3,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False,
random_state=None, splitter='best')
# 定义图像中分区的颜色和散点的颜色
cmap_light = ListedColormap(["#FFAAAA", "#AAFFAA", "#AAAAFF"])
cmap_bold = ListedColormap(["#FF0000", "#00FF00", "#0000FF"])
# 分别用样本的两个特征值创建图像和横轴、纵轴
x_min, x_max = X_train[:,0].min() - 1, X_train[:,0].max() + 1
y_min, y_max = X_train[:,1].min() - 1, X_train[:,1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, .02),np.arange(y_min, y_max, .02))
z = clf2.predict(np.c_[xx.ravel(), yy.ravel()])
# 给每个分类中的样本分配不同的颜色
z = z.reshape(xx.shape)
plt.figure()
plt.pcolormesh(xx, yy, z, cmap=cmap_light)
# 使用散点图进行表示
plt.scatter(X[:,0], X[:,1], c=y, cmap=cmap_bold, edgecolor="k",s=20)
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.title("Classifier:(max_depth = 3)")
plt.show()
此时,分类器就可以进行3个分类的识别,而且大部分的数据点都进入了正切的分类。接下来进一步调整深度。
# 设定决策树分类器最大深度为3
clf3 = tree.DecisionTreeClassifier(max_depth=5)
# 拟合训练数据集
clf3.fit(X_train, y_train)
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=5,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False,
random_state=None, splitter='best')
# 定义图像中分区的颜色和散点的颜色
cmap_light = ListedColormap(["#FFAAAA", "#AAFFAA", "#AAAAFF"])
cmap_bold = ListedColormap(["#FF0000", "#00FF00", "#0000FF"])
# 分别用样本的两个特征值创建图像和横轴、纵轴
x_min, x_max = X_train[:,0].min() - 1, X_train[:,0].max() + 1
y_min, y_max = X_train[:,1].min() - 1, X_train[:,1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, .02),np.arange(y_min, y_max, .02))
z = clf3.predict(np.c_[xx.ravel(), yy.ravel()])
# 给每个分类中的样本分配不同的颜色
z = z.reshape(xx.shape)
plt.figure()
plt.pcolormesh(xx, yy, z, cmap=cmap_light)
# 使用散点图进行表示
plt.scatter(X[:,0], X[:,1], c=y, cmap=cmap_bold, edgecolor="k",s=20)
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.title("Classifier:(max_depth = 3)")
plt.show()
发现,性能进一步提升了。接下来我使用graphviz这个library来展示这个过程。
安装方式:pip install -i https://pypi.tuna.tsinghua.edu.cn/simple graphviz
%pwd
'D:\\python code\\8messy'
# 导入graphviz工具包
import graphviz
# 导入决策树中输出graphviz的接口
from sklearn.tree import export_graphviz
# 选择最大深度为3的分类模型
export_graphviz(clf2, out_file="./wine.dot", class_names=wine.target_names,
feature_names = wine.feature_names[:2], impurity=False,filled=True)
# 打开一个dot文件
with open('./wine.dot') as f:
dot_graph = f.read()
# 显示dot文件中的图形
graphviz.Source(dot_graph)