决策树
Lei_ZM
2019-09-21
1. 信息熵与条件熵
1.1. 信息熵
度量样本集合纯度最常用的一种指标,其含义如下:
Ent ( D ) = − ∑ k = 1 ∣ y ∣ p k log 2 p k \operatorname{Ent}(D)=-\sum_{k=1}^{|y|} p_{k} \log _{2} p_{k} Ent(D)=−k=1∑∣y∣pklog2pk
其中, D = { ( x 1 , y 1 ) , ( x 2 , y 2 ) , ⋯ , ( x m , y m ) } D=\{(x_{1},y_{1}),(x_{2},y_{2}),\cdots,(x_{m},y_{m})\} D={(x1,y1),(x2,y2),⋯,(xm,ym)}表示样本集合, ∣ Y ∣ |\mathcal{Y}| ∣Y∣表示样本类别总数, p k p_{k} pk表示第 k k k类样本所占的比例,且 0 ≤ p k ≤ 1 0\leq p_{k}\leq 1 0≤pk≤1, ∑ k = 1 ∣ Y ∣ p k = 1 \sum_{k=1}^{|\mathcal{Y}|} p_{k}=1 ∑k=1∣Y∣pk=1。 Ent ( D ) \operatorname{Ent}(D) Ent(D)值越小,纯度越高。
证明: 0 ≤ Ent ( D ) ≤ 、 log 2 ∣ Y ∣ 0\leq \operatorname{Ent}(D)\leq 、\log_{2}\mathcal{|Y|} 0≤Ent(D)≤、log2∣Y∣
求 Ent ( D ) \operatorname{Ent}(D) Ent(D)最大值:
若令 ∣ Y ∣ = n |\mathcal{Y}|=n ∣Y∣=n, p k = x k p_{k}=x_{k} pk=xk,那么信息熵 Ent ( D ) \operatorname{Ent}(D) Ent(D)就可以看作一个 n n n元实值函数,也即:
Ent ( D ) = f ( x 1 , … , x n ) = − ∑ k = 1 n x k log 2 x k \operatorname{Ent}(D)=f\left(x_{1}, \ldots, x_{n}\right)=-\sum_{k=1}^{n} x_{k} \log _{2} x_{k} Ent(D)=f(x1,…,xn)=−k=1∑nxklog2xk
其中, 0 ≤ x k ≤ 1 0\leq x_{k}\leq 1 0≤xk≤1, ∑ k = 1 n x k = 1 \sum_{k=1}^{n} x_{k}=1 ∑k=1nxk=1,下面考虑求该多元函数的最值。
如果不考虑约束 0 ≤ x k ≤ 1 0\leq x_{k}\leq 1 0≤xk≤1,仅考虑 ∑ k = 1 n x k = 1 \sum_{k=1}^{n} x_{k}=1 ∑k=1nxk=1的话,对 f ( x 1 , x 2 , ⋯ , x n ) f\left(x_{1},x_{2},\cdots,x_{n}\right) f(x1,x2,⋯,xn)求最大值等价于如下最小化问题:
min ∑ k = 1 n x k log 2 x k s.t. ∑ k = 1 n x k = 1 \begin{array}{ll} {\min } & {\sum_{k=1}^{n} x_{k} \log _{2} x_{k}} \\ {\text { s.t. }} & {\sum_{k=1}^{n} x_{k}=1}\end{array} min s.t. ∑k=1nxklog2xk∑k=1nxk=1
显然,在 0 ≤ x k ≤ 1 0\leq x_{k}\leq 1 0≤xk≤1时,此问题为凸优化问题,而对于凸优化问题来说,满足KKT条件的点即为最优解。由于此最小化问题仅含等式约束,那么能令其拉格朗日函数的一阶偏导数等于0的点即为满足KKT条件的点。
根据拉格朗日乘子法可知,该优化问题的拉格朗日函数为:
L ( x 1 , … , x n , λ ) = ∑ k = 1 n x k log 2 x k + λ ( ∑ k = 1 n x k − 1 ) L\left(x_{1}, \ldots, x_{n}, \lambda\right)=\sum_{k=1}^{n} x_{k} \log _{2} x_{k}+\lambda\left(\sum_{k=1}^{n} x_{k}-1\right) L(x1,…,xn,λ)=k=1∑nxklog2xk+λ(k=1∑nxk−1)
对拉格朗日函数分别关于 x 1 , … , x n , λ x_{1}, \ldots, x_{n}, \lambda x1,…,xn,λ求一阶偏导数,并令偏导数等于0可得:
∂ L ( x 1 , … , x n , λ ) ∂ x 1 = ∂ ∂ x 1 [ ∑ k = 1 n x k log 2 x k + λ ( ∑ k = 1 n x k − 1 ) ] = 0 = log 2 x 1 + x 1 ⋅ 1 x 1 ln 2 + λ = 0 = log 2 x 1 + ⋅ 1 ln 2 + λ = 0 ⇒ λ = − log 2 x 1 − 1 ln 2 \begin{aligned} \frac{\partial L\left(x_{1}, \ldots, x_{n}, \lambda\right)}{\partial x_{1}} &=\frac{\partial}{\partial x_{1}}\left[\sum_{k=1}^{n} x_{k} \log _{2} x_{k}+\lambda\left(\sum_{k=1}^{n} x_{k}-1\right)\right]=0 \\ &=\log_{2} x_{1} + x_{1} \cdot \frac{1}{x_{1} \ln 2} + \lambda = 0 \\ &=\log_{2} x_{1} + \cdot \frac{1}{\ln 2} + \lambda = 0 \\ &\Rightarrow \lambda=-\log_{2} x_{1} - \frac{1}{\ln 2} \end{aligned} ∂x1∂L(x1,…,xn,λ)=∂x1∂[k=1∑nxklog2xk+λ(k=1∑nxk−1)]=0=log2x1+x1⋅x1ln21+λ=0=log2x1+⋅ln21+λ=0⇒λ=−log2x1−ln21
同理可得:
λ = − log 2 x 1 − 1 ln 2 = − log 2 x 2 − 1 ln 2 = ⋯ = = − log 2 x n − 1 ln 2 \lambda=-\log_{2} x_{1} - \frac{1}{\ln 2}=-\log_{2} x_{2} - \frac{1}{\ln 2}=\cdots==-\log_{2} x_{n} - \frac{1}{\ln 2} λ=−log2x1−ln21=−log2x2−ln21=⋯==−log2xn−ln21
又因为:
∂ L ( x 1 , … , x n , λ ) ∂ x 1 = ∂ ∂ λ [ ∑ k = 1 n x k log 2 x k + λ ( ∑ k = 1 n x k − 1 ) ] = 0 ⇒ ∑ k = 1 n x k = 1 \begin{aligned} \frac{\partial L\left(x_{1}, \ldots, x_{n}, \lambda\right)}{\partial x_{1}} &=\frac{\partial}{\partial \lambda}\left[\sum_{k=1}^{n} x_{k} \log _{2} x_{k}+\lambda\left(\sum_{k=1}^{n} x_{k}-1\right)\right]=0 \\ &\Rightarrow \sum_{k=1}^{n} x_{k}=1 \end{aligned} ∂x1∂L(x1,…,xn,λ)=∂λ∂[k=1∑nxklog2xk+λ(k=1∑nxk−1)]=0⇒k=1∑nxk=1
所以可以解得:
x 1 = x 2 = ⋯ = x n = 1 n x_{1}=x_{2}=\cdots=x_{n}=\frac{1}{n} x1=x2=⋯=xn=n1
又因为 x k x_{k} xk还需满足约束 0 ≤ x k ≤ 1 0\leq x_{k}\leq 1 0≤xk≤1,显然 0 ≤ 1 n ≤ 1 0\leq \frac{1}{n}\leq 1 0≤n1≤1,所以 x 1 = x 2 = ⋯ = x n = 1 n x_{1}=x_{2}=\cdots=x_{n}=\frac{1}{n} x1=x2=⋯=xn=n1是满足所有约束的最优解,也即为当期最小化问题的最小值点,同时也是 f ( x 1 , x 2 , ⋯ , x n ) f\left(x_{1},x_{2},\cdots,x_{n}\right) f(x1,x2,⋯,xn)的最大值点。将 x 1 = x 2 = ⋯ = x n = 1 n x_{1}=x_{2}=\cdots=x_{n}=\frac{1}{n} x1=x2=⋯=xn=n1代入 f ( x 1 , x 2 , ⋯ , x n ) f\left(x_{1},x_{2},\cdots,x_{n}\right) f(x1,x2,⋯,xn)中可得:
f ( 1 n , … , 1 n ) = − ∑ k = 1 n 1 n log 2 1 n = − n ⋅ 1 n log 2 1 n = log 2 n f\left(\frac{1}{n}, \ldots, \frac{1}{n}\right) =-\sum_{k=1}^{n} \frac{1}{n} \log _{2} \frac{1}{n} =-n \cdot \frac{1}{n} \log _{2} \frac{1}{n}=\log _{2} n f(n1,…,n1)=−k=1∑nn1log2n1=−n⋅n1log2n1=log2n
所以 f ( x 1 , x 2 , ⋯ , x n ) f\left(x_{1},x_{2},\cdots,x_{n}\right) f(x1,x2,⋯,xn)在满足约束 0 ≤ x k ≤ 1 0\leq x_{k}\leq 1 0≤xk≤1, ∑ k = 1 n x k = 1 \sum_{k=1}^{n} x_{k}=1 ∑k=1nxk=1时的最大值为 log 2 n \log_{2} n log2n。
求 Ent ( D ) \operatorname{Ent}(D) Ent(D)最小值:
如果不考虑约束 0 ≤ x k ≤ 1 0\leq x_{k}\leq 1 0≤xk≤1,仅考虑 ∑ k = 1 n x k = 1 \sum_{k=1}^{n} x_{k}=1 ∑k=1nxk=1的话, f ( x 1 , x 2 , ⋯ , x n ) f\left(x_{1},x_{2},\cdots,x_{n}\right) f(x1,x2,⋯,xn)可以看做是 n n n个互不相关的一元函数的加和,也即:
f ( x 1 , x 2 , … , x n ) = ∑ k = 1 n g ( x k ) f\left(x_{1},x_{2}, \ldots, x_{n}\right)=\sum_{k=1}^{n} g\left(x_{k}\right) f(x1,x2,…,xn)=k=1∑ng(xk)
其中, g ( x k ) = − x k log 2 x k g\left(x_{k}\right)=-x_{k} \log_2 x_{k} g(xk)=−xklog2xk, 0 ≤ x k ≤ 1 0\leq x_{k}\leq 1 0≤xk≤1。那么当 g ( x 1 ) , g ( x 2 ) , ⋯ , g ( x n ) g\left(x_{1}\right),g\left(x_{2}\right),\cdots,g\left(x_{n}\right) g(x1),g(x2),⋯,g(xn)分别取到其最小值时, f ( x 1 , x 2 , ⋯ , x n ) f\left(x_{1},x_{2},\cdots,x_{n}\right) f(x1,x2,⋯,xn)也就取到了最小值。由于 g ( x 1 ) , g ( x 2 ) , ⋯ , g ( x n ) g\left(x_{1}\right),g\left(x_{2}\right),\cdots,g\left(x_{n}\right) g(x1),g(x2),⋯,g(xn)的定义域和函数表达式均相同,所以只需求出 g ( x 1 ) g\left(x_{1}\right) g(x1)的最小值也就求出了 g ( x 2 ) , ⋯ , g ( x n ) g\left(x_{2}\right),\cdots,g\left(x_{n}\right) g(x2),⋯,g(xn)的最小值。下面考虑求 g ( x 1 ) g\left(x_{1}\right) g(x1)的最小值。
首先对 g ( x 1 ) g\left(x_{1}\right) g(x1)关于 x 1 x_{1} x1求一阶和二阶导数:
g ′ ( x 1 ) = d ( − x 1 log 2 x 1 ) d x 1 = − log 2 x 1 − x 1 ⋅ 1 x 1 ln 2 = − log 2 x 1 − 1 ln 2 g ′ ′ ( x 1 ) = d ( g ′ ( x 1 ) ) d x 1 = d ( − log 2 x 1 − 1 ln 2 ) d x 1 = − 1 x 1 ln 2 g^{\prime}\left(x_{1}\right)=\frac{d\left(-x_{1} \log _{2} x_{1}\right)}{d x_{1}}=-\log _{2} x_{1}-x_{1} \cdot \frac{1}{x_{1} \ln 2}=-\log _{2} x_{1}-\frac{1}{\ln 2} \\ g^{\prime \prime}\left(x_{1}\right)=\frac{d\left(g^{\prime}\left(x_{1}\right)\right)}{d x_{1}}=\frac{d\left(-\log _{2} x_{1}-\frac{1}{\ln 2}\right)}{d x_{1}}=-\frac{1}{x_{1} \ln 2} g′(x1)=dx1d(−x1log2x1)=−log2x1−x1⋅x1ln21=−log2x1−ln21g′′(x1)=dx1d(g′(x1))=dx1d(−log2x1−ln21)=−x1ln21
显然,当 0 ≤ x k ≤ 1 0\leq x_{k}\leq 1 0≤xk≤1时, g ′ ′ ( x 1 ) = − 1 x 1 ln 2 g^{\prime \prime}\left(x_{1}\right)=-\frac{1}{x_{1} \ln 2} g′′(x1)=−x1ln21恒小于0,所以 g ( x 1 ) g\left(x_{1}\right) g(x1)是在其定义域范围内开口向下的凹函数,那么其最小值必然在边界取,于是分别取 x 1 = 0 x_{1}=0 x1=0和 x 1 = 1 x_{1}=1 x1=1,代入 g ( x 1 ) g\left(x_{1}\right) g(x1)可得:
g ( 0 ) = − 0 log 2 0 = 0 g ( 1 ) = − 1 log 2 1 = 0 g(0)=-0\log_{2} 0=0 \\ g(1)=-1\log_{2} 1=0 g(0)=−0log20=0g(1)=−1log21=0
所以, g ( x 1 ) g\left(x_{1}\right) g(x1)的最小值为0,同理可得 g ( x 2 ) , ⋯ , g ( x n ) g\left(x_{2}\right),\cdots,g\left(x_{n}\right) g(x2),⋯,g(xn)最小值也为0,那么 f ( x 1 , x 2 , … , x n ) f\left(x_{1},x_{2}, \ldots, x_{n}\right) f(x1,x2,…,xn)的最小值此时也为0.但是,此时是仅考虑 0 ≤ x k ≤ 1 0\leq x_{k}\leq 1 0≤xk≤1时取到的最小值,若考虑约束 ∑ k = 1 n x k = 1 \sum_{k=1}^{n} x_{k}=1 ∑k=1nxk=1的话,那么 f ( x 1 , , x 2 , … , x n ) f\left(x_{1},,x_{2}, \ldots, x_{n}\right) f(x1,,x2,…,xn)的最小值一定大于等于0。如果令某个 x k = 1 x_{k}=1 xk=1,那么根据约束 ∑ k = 1 n x k = 1 \sum_{k=1}^{n} x_{k}=1 ∑k=1nxk=1可知 x 1 = x 2 = ⋯ = x k − 1 = x k + 1 = ⋯ = x n = 0 x_{1}=x_{2}=\cdots=x_{k-1}=x_{k+1}=\cdots=x_{n}=0 x1=x2=⋯=xk−1=xk+1=⋯=xn=0,将其代入 f ( x 1 , x 2 , … , x n ) f\left(x_{1},x_{2}, \ldots, x_{n}\right) f(x1,x2,…,xn)可得:
f ( 0 , 0 , ⋯ , 0 , 1 , 0 , ⋯ , 0 ) = − 0 l o g 2 0 − 1 l o g 2 1 − ⋯ − 0 l o g 2 0 − 1 l o g 2 1 − 0 l o g 2 0 − ⋯ − 0 l o g 2 0 = 0 f(0,0,\cdots,0,1,0,\cdots,0)=-0log_{2} 0-1log_{2} 1-\cdots-0log_{2} 0-1log_{2} 1-0log_{2} 0-\cdots-0log_{2} 0=0 f(0,0,⋯,0,1,0,⋯,0)=−0log20−1log21−⋯−0log20−1log21−0log20−⋯−0log20=0
所以 x 1 = x 2 = ⋯ = x k − 1 = x k + 1 = ⋯ = x n = 0 x_{1}=x_{2}=\cdots=x_{k-1}=x_{k+1}=\cdots=x_{n}=0 x1=x2=⋯=xk−1=xk+1=⋯=xn=0一定是 f ( x 1 , x 2 , … , x n ) f\left(x_{1},x_{2}, \ldots, x_{n}\right) f(x1,x2,…,xn)在满足约束 0 ≤ x k ≤ 1 0\leq x_{k}\leq 1 0≤xk≤1, ∑ k = 1 n x k = 1 \sum_{k=1}^{n} x_{k}=1 ∑k=1nxk=1的条件下的最小值点,其最小值为0。
1.2. 条件熵
在已知样本属性 a a a的取值情况下,度量样本集合纯度的一种指标。
H ( D ∣ a ) = ∑ v = 1 V ∣ D v ∣ ∣ D ∣ Ent ( D v ) H(D | a)=\sum_{v=1}^{V} \frac{\left|D^{v}\right|}{|D|} \operatorname{Ent}\left(D^{v}\right) H(D∣a)=v=1∑V∣D∣∣Dv∣Ent(Dv)
其中, a a a表示样本的某个属性,假定属性 a a a有 V V V个可能的取值 { a 1 , a 2 , ⋯ , a V } \{a_{1},a_{2},\cdots,a_{V}\} {a1,a2,⋯,aV},样本集合 D D D中在属性 a a a上的取值为 a v a^{v} av的样本即为 D v D^{v} Dv, Ent ( D v ) \operatorname{Ent}\left(D^{v}\right) Ent(Dv)表示样本集合 D v D^{v} Dv的信息熵。 H ( D ∣ a ) H(D | a) H(D∣a)值越小,纯度越高
2. ID3决策树
以信息增益为准则来选择划分属性的决策树
信息增益:
Gain ( D , a ) = Ent ( D ) − ∑ v = 1 V ∣ D v ∣ ∣ D ∣ Ent ( D v ) = Ent ( D ) − H ( D ∣ a ) \begin{aligned} \operatorname{Gain}(D, a) &=\operatorname{Ent}(D)-\sum_{v=1}^{V} \frac{\left|D^{v}\right|}{|D|} \operatorname{Ent}\left(D^{v}\right) \\ &=\operatorname{Ent}(D)-H(D | a) \end{aligned} Gain(D,a)=Ent(D)−v=1∑V∣D∣∣Dv∣Ent(Dv)=Ent(D)−H(D∣a)
选择信息增益最大的属性作为划分属性,因为信息增益越大,则意味着使用该属性来进行划分所获得的“纯度”提升越大。
举个例子比喻
信息增益: Gain ( D , a ) \operatorname{Gain}(D, a) Gain(D,a):减了多少肥
信息熵: Ent ( D ) = − ∑ k = 1 ∣ y ∣ p k log 2 p k \operatorname{Ent}(D)=-\sum_{k=1}^{|y|} p_{k} \log _{2} p_{k} Ent(D)=−∑k=1∣y∣pklog2pk:减肥前的体重
条件熵: H ( D ∣ a ) = ∑ v = 1 V ∣ D v ∣ ∣ D ∣ Ent ( D v ) H(D | a)=\sum_{v=1}^{V} \frac{\left|D^{v}\right|}{|D|} \operatorname{Ent}\left(D^{v}\right) H(D∣a)=∑v=1V∣D∣∣Dv∣Ent(Dv):在做了某项减肥运动后 a a a的情况下减肥后的体重
信息增益越大,则说明做该项运动“所减的肥”越大
以信息增益为划分准则的ID3决策树对可取值数目较多的属性有所偏好
Gain ( D , a ) = Ent ( D ) − ∑ v = 1 V ∣ D v ∣ ∣ D ∣ Ent ( D v ) = Ent ( D ) − ∑ v = 1 V ∣ D v ∣ ∣ D ∣ ( − ∑ k = 1 ∣ y ∣ p k log 2 p k ) = Ent ( D ) − ∑ v = 1 V ∣ D v ∣ ∣ D ∣ ( − ∑ k = 1 ∣ V ∣ ∣ D k v ∣ ∣ D v ∣ log 2 ∣ D k v ∣ ∣ D v ∣ ) \begin{aligned} \operatorname{Gain}(D, a) &=\operatorname{Ent}(D)-\sum_{v=1}^{V} \frac{\left|D^{v}\right|}{|D|} \operatorname{Ent}\left(D^{v}\right) \\ &=\operatorname{Ent}(D)-\sum_{v=1}^{V} \frac{\left|D^{v}\right|}{|D|}\left(-\sum_{k=1}^{|y|} p_{k} \log _{2} p_{k}\right) \\ &=\operatorname{Ent}(D)-\sum_{v=1}^{V} \frac{\left|D^{v}\right|}{|D|}\left(-\sum_{k=1}^{|\mathcal{V}|} \frac{\left|D_{k}^{v}\right|}{\left|D^{v}\right|} \log _{2} \frac{\left|D_{k}^{v}\right|}{\left|D^{v}\right|}\right) \end{aligned} Gain(D,a)=Ent(D)−v=1∑V∣D∣∣Dv∣Ent(Dv)=Ent(D)−v=1∑V∣D∣∣Dv∣⎝⎛−k=1∑∣y∣pklog2pk⎠⎞=Ent(D)−v=1∑V∣D∣∣Dv∣⎝⎛−k=1∑∣V∣∣Dv∣∣Dkv∣log2∣Dv∣∣Dkv∣⎠⎞
其中, D k v D_{k}^{v} Dkv样本集合 D D D中在属性 a a a上取值为 a v a^{v} av且类别为 k k k的样本。
3. C4.5决策树
以信息增益率为准则来选择划分属性的决策树。
信息增益率:
Gain-ratio ( D , a ) = Gain ( D , a ) IV ( a ) = Ent ( D ) − ∑ v = 1 V ∣ D v ∣ ∣ D ∣ Ent ( D v ) − ∑ v = 1 V ∣ D v ∣ ∣ D ∣ log 2 ∣ D v ∣ ∣ D ∣ = − ∑ k = 1 ∣ y ∣ p k D log 2 p k D − ∑ v = 1 V ∣ D v ∣ ∣ D ∣ ( − ∑ k = 1 ∣ y ∣ p k D v log 2 p k D v ) − ∑ v = 1 V ∣ D v ∣ ∣ D ∣ log 2 ∣ D v ∣ ∣ D ∣ = − ∑ k = 1 ∣ y ∣ ∣ D k ∣ ∣ D ∣ log 2 ∣ D k v ∣ ∣ D ∣ − ∑ v = 1 V ∣ D v ∣ ∣ D ∣ ( − ∑ k = 1 ∣ y ∣ ∣ D k v ∣ ∣ D v ∣ log 2 ∣ D k v ∣ ∣ D v ∣ ) − ∑ v = 1 V ∣ D v ∣ ∣ D ∣ log 2 ∣ D v ∣ ∣ D ∣ \begin{aligned} \text { Gain-ratio }(D, a) &=\frac{\operatorname{Gain}(D, a)}{\operatorname{IV}(a)} \\ &=\frac{\operatorname{Ent}(D)-\sum_{v=1}^{V} \frac{\left|D^{v}\right|}{|D|} \operatorname{Ent}\left(D^{v}\right)}{- \sum_{v=1}^{V} \frac{\left|D^{v}\right|}{|D|} \log _{2} \frac{\left|D^{v}\right|}{|D|}} \\ &=\frac{-\sum_{k=1}^{|y|} p_{k_{D}} \log _{2} p_{k_{D}}-\sum_{v=1}^{V} \frac{\left|D^{v}\right|}{|D|}\left(-\sum_{k=1}^{|y|} p_{k_{D^{v}}} \log _{2} p_{k_{D^{v}}}\right)}{-\sum_{v=1}^{V} \frac{\left|D^{v}\right|}{|D|} \log _{2} \frac{\left|D^{v}\right|}{|D|}} \\ &=\frac{-\sum_{k=1}^{|y|} \frac{\left|D_{k}\right|}{\left|D\right|} \log _{2} \frac{\left|D_{k}^{v}\right|}{\left|D\right|}-\sum_{v=1}^{V} \frac{\left|D^{v}\right|}{|D|}\left(-\sum_{k=1}^{|y|} \frac{\left|D_{k}^{v}\right|}{\left|D^{v}\right|} \log _{2} \frac{\left|D_{k}^{v}\right|}{\left|D^{v}\right|}\right)}{-\sum_{v=1}^{V} \frac{\left|D^{v}\right|}{|D|} \log _{2} \frac{\left|D^{v}\right|}{|D|}} \\ \end{aligned} Gain-ratio (D,a)=IV(a)Gain(D,a)=−∑v=1V∣D∣∣Dv∣log2∣D∣∣Dv∣Ent(D)−∑v=1V∣D∣∣Dv∣Ent(Dv)=−∑v=1V∣D∣∣Dv∣log2∣D∣∣Dv∣−∑k=1∣y∣pkDlog2pkD−∑v=1V∣D∣∣Dv∣(−∑k=1∣y∣pkDvlog2pkDv)=−∑v=1V∣D∣∣Dv∣log2∣D∣∣Dv∣−∑k=1∣y∣∣D∣∣Dk∣log2∣D∣∣Dkv∣−∑v=1V∣D∣∣Dv∣(−∑k=1∣y∣∣Dv∣∣Dkv∣log2∣Dv∣∣Dkv∣)
4. CART决策树
以基尼指数为准则来选择划分属性的决策树。
基尼值:
Gini ( D ) = ∑ k = 1 ∣ Y ∣ ∑ k ′ ≠ k p k p k ′ = ∑ k = 1 ∣ Y ∣ p k ∑ k ′ ≠ k p k ′ = ∑ k = 1 ∣ Y ∣ p k ( 1 − p k ) = 1 − ∑ k = 1 ∣ Y ∣ p k 2 \operatorname{Gini}(D) =\sum_{k=1}^{|\mathcal{Y}|} \sum_{k^{\prime} \neq k} p_{k} p_{k^{\prime}} =\sum_{k=1}^{|\mathcal{Y}|} p_{k} \sum_{k^{\prime} \neq k} p_{k^{\prime}} =\sum_{k=1}^{|\mathcal{Y}|} p_{k}\left(1-p_{k}\right) =1-\sum_{k=1}^{|\mathcal{Y}|} p_{k}^{2} Gini(D)=k=1∑∣Y∣k′=k∑pkpk′=k=1∑∣Y∣pkk′=k∑pk′=k=1∑∣Y∣pk(1−pk)=1−k=1∑∣Y∣pk2
基尼指数:
Gini index ( D , a ) = ∑ v = 1 V ∣ D v ∣ ∣ D ∣ Gini ( D v ) \operatorname{Gini} \operatorname{index}(D, a) =\sum_{v=1}^{V} \frac{\left|D^{v}\right|}{|D|} \operatorname{Gini}\left(D^{v}\right) Giniindex(D,a)=v=1∑V∣D∣∣Dv∣Gini(Dv)
基尼值和基尼指数越小,样本集合纯度越高。
CART决策树分类算法:
-
根据基尼指数公式 Gini index ( D , a ) = ∑ v = 1 V ∣ D v ∣ ∣ D ∣ Gini ( D v ) \operatorname{Gini} \operatorname{index}(D, a)=\sum_{v=1}^{V} \frac{\left|D^{v}\right|}{|D|} \operatorname{Gini}\left(D^{v}\right) Giniindex(D,a)=∑v=1V∣D∣∣Dv∣Gini(Dv)找出基尼指数最小的属性 a ∗ a_{*} a∗。
-
计算属性 a ∗ a_{*} a∗的所有可能取值的基尼值 Gini ( D v ) \operatorname{Gini}(D^{v}) Gini(Dv), v = 1 , 2 , ⋯ , V v=1,2,\cdots,V v=1,2,⋯,V,选择基尼值最小的取值 a ∗ a_{*} a∗作为划分点,将集合 D D D划分为 D 1 D_{1} D1和 D 2 D_{2} D2两个集合(节点),其中 D 1 D_{1} D1的样本集合为 a ∗ = a ∗ v a_{*}=a_{*}^{v} a∗=a∗v的样本, D 2 D_{2} D2集合为 a ∗ ≠ a ∗ v a_{*}\neq a_{*}^{v} a∗=a∗v的样本。
-
对集合 D 1 D_{1} D1和 D 2 D_{2} D2重复步骤1和步骤2,直到满足停止条件。
CART决策树回归算法:
-
根据以下公式找出最优化分特征 a ∗ a_{*} a∗和最优划分点 a ∗ v a_{*}^{v} a∗v:
a ∗ , a ∗ v = arg min a , a v [ min c 1 ∑ x i ∈ D 1 ( a , a v ) ( y i − c 1 ) 2 + min c 2 ∑ x i ∈ D 2 ( a , a v ) ( y i − c 2 ) 2 ] a_{*}, a_{*}^{v} =\underset{a, a^{v}}{\arg \min }\left[\min _{c_{1}} \sum_{\boldsymbol{x}_{i} \in D_{1}\left(a, a^{v}\right)}\left(y_{i}-c_{1}\right)^{2}+\min _{c_{2}} \sum_{\boldsymbol{x}_{i} \in D_{2}\left(a, a^{v}\right)}\left(y_{i}-c_{2}\right)^{2}\right] a∗,a∗v=a,avargmin⎣⎡c1minxi∈D1(a,av)∑(yi−c1)2+c2minxi∈D2(a,av)∑(yi−c2)2⎦⎤
其中, D 1 ( a , a v ) D_{1}\left(a, a^{v}\right) D1(a,av)表示在属性 a a a上取值小于等于 a v a^{v} av的样本集合, D 2 ( a , a v ) D_{2}\left(a, a^{v}\right) D2(a,av)表示在属性 a a a上取值大于 a v a^{v} av的样本集合, c 1 c_{1} c1表示 D 1 D_{1} D1的样本输出均值, c 2 c_{2} c2表示 D 2 D_{2} D2的样本输出均值。
-
根据划分点 a ∗ v a_{*}^{v} a∗v将集合 D D D划分为 D 1 D_{1} D1和 D 2 D_{2} D2两个集合(节点)。
-
对集合 D 1 D_{1} D1和 D 2 D_{2} D2重复步骤1和步骤2,直到满足停止条件。