ID3决策树
信息熵,是度量样本集合纯度最常用的一种指标,其定义如下
E
n
t
(
D
)
=
−
∑
k
=
1
∣
Y
∣
p
k
log
2
p
k
Ent(D) = -\sum_{k=1}^{|\mathcal{Y}|}p_{k}\log_{2}p_{k}
Ent(D)=−k=1∑∣Y∣pklog2pk
其中
D
=
{
(
x
1
,
y
1
)
,
(
x
2
,
y
2
)
,
⋯
,
(
x
n
,
y
n
)
}
D=\left \{ (x_{1},y_{1}),(x_{2},y_{2}),\cdots,(x_{n},y_{n}) \right \}
D={(x1,y1),(x2,y2),⋯,(xn,yn)}表示样本集合,
∣
Y
∣
|\mathcal{Y}|
∣Y∣表示样本类别总数,如果是二分类,就是2,
p
k
p_{k}
pk表示第
k
k
k类样本所占比例,且
0
≤
p
k
≤
1
,
∑
k
=
1
∣
Y
∣
p
k
=
1
0 \leq p_{k}\leq 1 ,\sum_{k=1}^{|\mathcal{Y}|}p_{k} = 1
0≤pk≤1,∑k=1∣Y∣pk=1,
E
n
t
(
D
)
Ent(D)
Ent(D)值越小,纯度越高
证明: 0 ≤ E n t ( D ) ≤ log 2 ∣ Y ∣ 0 \leq Ent(D) \leq \log_{2}|\mathcal{Y}| 0≤Ent(D)≤log2∣Y∣
求
E
n
t
(
D
)
Ent(D)
Ent(D)最大值,若令
∣
Y
∣
=
n
,
p
k
=
x
k
|\mathcal{Y}|=n,p_{k}=x_{k}
∣Y∣=n,pk=xk,则是一个n分类问题,那么信息熵
E
n
t
(
D
)
Ent(D)
Ent(D)就可以看作一个n元实值函数,也即
E
n
t
(
D
)
=
f
(
x
1
,
x
2
,
⋯
,
x
n
)
=
−
∑
k
=
1
n
x
k
log
2
x
k
Ent(D) = f(x_{1},x_{2},\cdots,x_{n}) = -\sum_{k=1}^{n}x_{k}\log_{2}x_{k}
Ent(D)=f(x1,x2,⋯,xn)=−k=1∑nxklog2xk
0
≤
x
k
≤
1
,
∑
k
=
1
n
x
k
=
1
0 \leq x_{k}\leq 1 ,\sum_{k=1}^{n}x_{k} = 1
0≤xk≤1,∑k=1nxk=1,下面考虑求该多元函数的最值
如果不考虑约束
0
≤
x
k
≤
1
0 \leq x_{k}\leq 1
0≤xk≤1,仅考虑
∑
k
=
1
n
x
k
=
1
\sum_{k=1}^{n}x_{k} = 1
∑k=1nxk=1的话,对
f
(
x
1
,
x
2
,
⋯
,
x
n
)
f(x_{1},x_{2},\cdots,x_{n})
f(x1,x2,⋯,xn)求最大值等价于如下最小化问题
m
i
n
∑
k
=
1
n
x
k
log
2
x
k
s
.
t
.
∑
k
=
1
n
x
k
=
1
min \sum_{k=1}^{n}x_{k}\log_{2}x_{k}\\ s.t. \sum_{k=1}^{n}x_{k} = 1
mink=1∑nxklog2xks.t.k=1∑nxk=1
∑
k
=
1
n
x
k
log
2
x
k
\sum_{k=1}^{n}x_{k}\log_{2}x_{k}
∑k=1nxklog2xk可以看成是n个
x
log
2
x
x\log_{2}x
xlog2x求和
单独看其中一个函数,记
f
(
x
)
=
x
l
o
g
2
x
f(x) = xlog_{2}x
f(x)=xlog2x,则
f
′
(
x
)
=
l
o
g
2
x
+
x
⋅
1
x
ln
2
=
l
o
g
2
x
+
1
ln
2
f
′
′
(
x
)
=
1
x
ln
2
f'(x) = log_{2}x + x\cdot \frac{1}{x\ln 2} = log_{2}x + \frac{1}{\ln 2}\\ f''(x) = \frac{1}{x\ln 2}
f′(x)=log2x+x⋅xln21=log2x+ln21f′′(x)=xln21
当
0
≤
x
≤
1
0 \leq x\leq 1
0≤x≤1,
f
′
′
(
x
)
>
0
f''(x)>0
f′′(x)>0,所以
f
(
x
)
f(x)
f(x)是凸函数,由n个
f
(
x
)
f(x)
f(x)组合而成的
∑
k
=
1
n
x
k
log
2
x
k
\sum_{k=1}^{n}x_{k}\log_{2}x_{k}
∑k=1nxklog2xk函数也是凸函数
在 0 ≤ x k ≤ 1 0 \leq x_{k}\leq 1 0≤xk≤1时,此问题为凸优化问题,而对于凸优化问题来说,满足KKT条件的点即为最优解,由于此最小化问题仅含等式约束,那么令其拉格朗日函数的一阶偏导数等于0的点即为满足KKT条件的点
根据拉格朗日乘子法可知,该优化问题的拉格朗日函数为
L
(
x
1
,
⋯
,
x
n
,
λ
)
=
∑
k
=
1
n
x
k
log
2
x
k
+
λ
(
∑
k
=
1
n
x
k
−
1
)
L(x_{1},\cdots,x_{n},\lambda ) = \sum_{k=1}^{n}x_{k}\log_{2}x_{k} + \lambda(\sum_{k=1}^{n}x_{k} - 1)
L(x1,⋯,xn,λ)=k=1∑nxklog2xk+λ(k=1∑nxk−1)
对该拉格朗日函数分别关于
x
1
,
⋯
,
x
n
,
λ
x_{1},\cdots,x_{n},\lambda
x1,⋯,xn,λ求一阶偏导数,并令偏导数等于0
先对
x
1
x_{1}
x1求偏导等于0
∂
L
(
x
1
,
⋯
,
x
n
,
λ
)
∂
x
1
=
∂
∂
x
[
∑
k
=
1
n
x
k
log
2
x
k
+
λ
(
∑
k
=
1
n
x
k
−
1
)
]
=
l
o
g
2
x
1
+
x
1
⋅
1
x
1
ln
2
+
λ
=
l
o
g
2
x
1
+
1
ln
2
+
λ
=
0
\begin{aligned} \frac{\partial L(x_{1},\cdots,x_{n},\lambda )}{\partial x_{1}} &= \frac{\partial }{\partial x}\left [ \sum_{k=1}^{n}x_{k}\log_{2}x_{k} + \lambda(\sum_{k=1}^{n}x_{k} - 1) \right ] \\&= log_{2}x_{1} + x_{1}\cdot \frac{1}{x_{1}\ln 2} + \lambda \\&= log_{2}x_{1} + \frac{1}{\ln 2} + \lambda =0 \end{aligned}
∂x1∂L(x1,⋯,xn,λ)=∂x∂[k=1∑nxklog2xk+λ(k=1∑nxk−1)]=log2x1+x1⋅x1ln21+λ=log2x1+ln21+λ=0
得
λ
=
−
l
o
g
2
x
1
−
1
ln
2
\lambda = -log_{2}x_{1} - \frac{1}{\ln 2}
λ=−log2x1−ln21
然后分别对
x
2
,
⋯
,
x
n
{x_{2},\cdots, x_{n}}
x2,⋯,xn分别求偏导,可得
λ
=
−
l
o
g
2
x
1
−
1
ln
2
=
−
l
o
g
2
x
2
−
1
ln
2
=
⋯
=
−
l
o
g
2
x
n
−
1
ln
2
\lambda = -log_{2}x_{1} - \frac{1}{\ln 2} = -log_{2}x_{2} - \frac{1}{\ln 2} = \cdots = -log_{2}x_{n} - \frac{1}{\ln 2}
λ=−log2x1−ln21=−log2x2−ln21=⋯=−log2xn−ln21
对
λ
\lambda
λ求偏导
∂
L
(
x
1
,
⋯
,
x
n
,
λ
)
∂
λ
=
∂
∂
λ
[
∑
k
=
1
n
x
k
log
2
x
k
+
λ
(
∑
k
=
1
n
x
k
−
1
)
]
=
∑
k
=
1
n
x
k
−
1
\frac{\partial L(x_{1},\cdots,x_{n},\lambda )}{\partial \lambda} = \frac{\partial }{\partial \lambda}\left [ \sum_{k=1}^{n}x_{k}\log_{2}x_{k} + \lambda(\sum_{k=1}^{n}x_{k} - 1) \right ] = \sum_{k=1}^{n}x_{k} - 1
∂λ∂L(x1,⋯,xn,λ)=∂λ∂[k=1∑nxklog2xk+λ(k=1∑nxk−1)]=k=1∑nxk−1
令其等于0得
∑
k
=
1
n
x
k
=
1
\sum_{k=1}^{n}x_{k} = 1
k=1∑nxk=1
所以可以解得
x
1
=
x
2
=
⋯
=
x
n
=
1
n
x_{1} = x_{2} = \cdots = x_{n} = \frac{1}{n}
x1=x2=⋯=xn=n1 (因为
x
1
=
x
2
=
⋯
=
x
n
x_{1} = x_{2} = \cdots = x_{n}
x1=x2=⋯=xn且
∑
k
=
1
n
x
k
=
1
\sum_{k=1}^{n}x_{k} = 1
∑k=1nxk=1)
又因为
x
k
x_{k}
xk还要满足
0
≤
x
k
≤
1
0 \leq x_{k}\leq 1
0≤xk≤1,显然
0
≤
1
n
≤
1
0 \leq \frac{1}{n} \leq 1
0≤n1≤1,所以
x
1
=
x
2
=
⋯
=
x
n
=
1
n
x_{1} = x_{2} = \cdots = x_{n} = \frac{1}{n}
x1=x2=⋯=xn=n1是满足所有约束的最优解,也即为当前最小化问题的最小值点,同时也是$ f(x_{1},x_{2},\cdots,x_{n})
的
最
大
值
点
,
将
的最大值点,将
的最大值点,将x_{1} = x_{2} = \cdots = x_{n} = \frac{1}{n}
代
入
代入
代入 f(x_{1},x_{2},\cdots,x_{n})$中可得
f
(
1
n
,
1
n
,
⋯
,
1
n
)
=
−
∑
k
=
1
n
1
n
log
2
1
n
=
−
n
⋅
1
n
⋅
log
2
1
n
=
log
2
n
f(\frac{1}{n},\frac{1}{n},\cdots,\frac{1}{n}) = -\sum_{k=1}^{n}\frac{1}{n}\log_{2}\frac{1}{n} = -n\cdot \frac{1}{n} \cdot \log_{2}\frac{1}{n} = \log_{2}n
f(n1,n1,⋯,n1)=−k=1∑nn1log2n1=−n⋅n1⋅log2n1=log2n
所以$ f(x_{1},x_{2},\cdots,x_{n})$在满足约束
0
≤
x
k
≤
1
,
∑
k
=
1
n
x
k
=
1
0 \leq x_{k}\leq 1,\sum_{k=1}^{n}x_{k} = 1
0≤xk≤1,∑k=1nxk=1时的最大值为
log
2
n
\log_{2}n
log2n
求 E n t ( D ) Ent(D) Ent(D)的最小值
如果不考虑
∑
k
=
1
n
x
k
=
1
\sum_{k=1}^{n}x_{k} = 1
∑k=1nxk=1,仅考虑
l
e
q
x
k
≤
1
leq x_{k}\leq 1
leqxk≤1的话,$ f(x_{1},x_{2},\cdots,x_{n})$可以看成是n个互不相关的一元函数加和,也即
f
(
x
1
,
x
2
,
⋯
,
x
n
)
=
∑
k
=
1
n
g
(
x
k
)
f(x_{1},x_{2},\cdots,x_{n}) =\sum_{k=1}^{n}g(x_{k})
f(x1,x2,⋯,xn)=k=1∑ng(xk)
其中,
g
(
x
k
)
=
−
x
k
log
2
x
k
g(x_{k}) = -x_{k}\log_{2}x_{k}
g(xk)=−xklog2xk,
0
≤
x
k
≤
1
0 \leq x_{k}\leq 1
0≤xk≤1,那么当
g
(
x
1
)
,
g
(
x
2
)
,
⋯
,
g
(
x
k
)
g(x_{1}),g(x_{2}),\cdots,g(x_{k})
g(x1),g(x2),⋯,g(xk)分别取到其最小值时,
f
(
x
1
,
x
2
,
⋯
,
x
n
)
f(x_{1},x_{2},\cdots,x_{n})
f(x1,x2,⋯,xn)也就取到了最小值,由于
g
(
x
1
)
,
g
(
x
2
)
,
⋯
,
g
(
x
k
)
g(x_{1}),g(x_{2}),\cdots,g(x_{k})
g(x1),g(x2),⋯,g(xk)的定义域和函数表达式均相同,所以只需求出
g
(
x
1
)
g(x_{1})
g(x1)的最小值也就求出了
g
(
x
2
)
,
⋯
,
g
(
x
k
)
g(x_{2}),\cdots,g(x_{k})
g(x2),⋯,g(xk)的最小值,下面考虑求
g
(
x
1
)
g(x_{1})
g(x1)的最小值
首先对
g
(
x
1
)
g(x_{1})
g(x1)关于
x
1
x_{1}
x1求一阶和二阶导数
g
′
(
x
1
)
=
−
l
o
g
2
x
1
−
x
1
⋅
1
x
1
ln
2
=
−
l
o
g
2
x
1
−
1
ln
2
g
′
′
(
x
1
)
=
−
1
x
1
ln
2
g'(x_{1}) = -log_{2}x_{1} - x_{1}\cdot \frac{1}{x_{1}\ln 2} = -log_{2}x_{1} - \frac{1}{\ln 2}\\ g''(x_{1}) = -\frac{1}{x_{1}\ln 2}
g′(x1)=−log2x1−x1⋅x1ln21=−log2x1−ln21g′′(x1)=−x1ln21
显然,当
0
≤
x
k
≤
1
0 \leq x_{k}\leq 1
0≤xk≤1时
g
′
′
(
x
1
)
=
−
1
x
1
ln
2
g''(x_{1}) = -\frac{1}{x_{1}\ln 2}
g′′(x1)=−x1ln21恒小于0,所以
g
(
x
1
)
g(x_{1})
g(x1)是一个在其定义域范围内开口向下的凹函数,那么其最小值必定在边界取,于是分别取
x
1
=
0
x_{1} = 0
x1=0和
x
1
=
1
x_{1}=1
x1=1代入
g
(
x
1
)
g(x_{1})
g(x1)得
g
(
0
)
=
−
0
log
2
0
=
0
g
(
1
)
=
−
l
o
g
2
1
=
0
g(0) = -0\log_{2}0 = 0\\ g(1) = -log_{2}1 = 0
g(0)=−0log20=0g(1)=−log21=0
所以,
g
(
x
1
)
g(x_{1})
g(x1)的最小值为0,同理可得
g
(
x
2
)
,
⋯
,
g
(
x
k
)
g(x_{2}),\cdots,g(x_{k})
g(x2),⋯,g(xk)的最小值也为0,那么$ f(x_{1},x_{2},\cdots,x_{n})$的最小值也为0,但是,此时是仅考虑
0
≤
x
k
≤
1
0 \leq x_{k}\leq 1
0≤xk≤1时取到的最小值,若考虑约束
∑
k
=
1
n
x
k
=
1
\sum_{k=1}^{n}x_{k} = 1
∑k=1nxk=1的话,那么$ f(x_{1},x_{2},\cdots,x_{n})
的
最
小
值
一
定
大
于
等
于
0
,
如
果
令
某
个
的最小值一定大于等于0,如果令某个
的最小值一定大于等于0,如果令某个x_{k}=1
,
那
么
根
据
约
束
,那么根据约束
,那么根据约束\sum_{k=1}^{n}x_{k} = 1
可
知
可知
可知x_{1} = x_{2} = \cdots = x_{k-1} = x_{k+1} = \cdots = x_{n} = 0
,
将
其
代
入
,将其代入
,将其代入 f(x_{1},x_{2},\cdots,x_{n})$可得
f
(
0
,
0
,
⋯
,
1
,
0
,
⋯
,
0
)
=
−
0
log
2
0
−
−
0
log
2
0
−
⋯
−
log
2
1
−
−
0
log
2
0
−
⋯
−
0
log
2
0
=
0
f(0,0,\cdots,1,0,\cdots,0) = -0\log_{2}0 - -0\log_{2}0 - \cdots -\log_{2}1 - -0\log_{2}0 - \cdots -0\log_{2}0 = 0
f(0,0,⋯,1,0,⋯,0)=−0log20−−0log20−⋯−log21−−0log20−⋯−0log20=0
所以
x
k
=
1
,
x
1
=
x
2
=
⋯
=
x
k
−
1
=
x
k
+
1
=
⋯
=
x
n
=
0
x_{k} = 1,x_{1} = x_{2} = \cdots = x_{k-1} = x_{k+1} = \cdots = x_{n} = 0
xk=1,x1=x2=⋯=xk−1=xk+1=⋯=xn=0一定是$ f(x_{1},x_{2},\cdots,x_{n})$在满足约束
0
≤
x
k
≤
1
,
∑
k
=
1
n
x
k
=
1
0 \leq x_{k}\leq 1,\sum_{k=1}^{n}x_{k} = 1
0≤xk≤1,∑k=1nxk=1 的条件下的最小值,其最小值为0
条件熵:在已知样本属性a的取值情况下,度量样本集合纯度的一种指标
H
(
D
∣
a
)
=
∑
v
=
1
V
∣
D
v
∣
D
E
n
t
(
D
v
)
H(D|a) = \sum_{v=1}^{V}\frac{|D^v|}{D}Ent(D^v)
H(D∣a)=v=1∑VD∣Dv∣Ent(Dv)
其中,
a
a
a表示样本的某个属性,假定属性
a
a
a有
V
V
V个可能的取值
{
a
1
,
a
2
,
⋯
,
a
V
}
\left \{ a^1,a^2,\cdots,a^V\right \}
{a1,a2,⋯,aV},样本集合
D
D
D中在属性
a
a
a上取值为
a
V
a^V
aV的样本记为
D
V
D^V
DV,
E
n
t
(
D
V
)
Ent(D^V)
Ent(DV)表示样本集合
D
v
D^v
Dv的信息熵,
H
(
D
∣
a
)
H(D|a)
H(D∣a)值越小,纯度越高
ID3决策树,已信息增益为准则来选择划分属性的决策树,信息增益公式为
G
a
i
n
(
D
,
a
)
=
E
n
t
(
D
)
−
∑
v
=
1
V
∣
D
v
∣
D
E
n
t
(
D
v
)
=
E
n
t
(
D
)
−
H
(
D
∣
a
)
\begin{aligned} Gain(D,a) &= Ent(D) - \sum_{v=1}^{V}\frac{|D^v|}{D}Ent(D^v) \\&= Ent(D) - H(D|a) \end{aligned}
Gain(D,a)=Ent(D)−v=1∑VD∣Dv∣Ent(Dv)=Ent(D)−H(D∣a)
选择信息增益值最大的属性作为划分属性,因为信息增益越大,则意味着使用该属性来进行划分所获得的"纯度提升"越大
以信息增益为划分标准的ID3决策树对可取值越多数目较多的属性有所偏好
G
a
i
n
(
D
,
a
)
=
E
n
t
(
D
)
−
∑
v
=
1
V
∣
D
v
∣
D
E
n
t
(
D
v
)
=
E
n
t
(
D
)
−
∑
v
=
1
V
∣
D
v
∣
D
(
−
∑
k
=
1
∣
Y
∣
p
k
log
2
p
k
)
=
E
n
t
(
D
)
−
∑
v
=
1
V
∣
D
v
∣
D
(
−
∑
k
=
1
∣
Y
∣
p
k
log
2
p
k
)
∣
D
k
v
∣
D
v
\begin{aligned} Gain(D,a) &= Ent(D) - \sum_{v=1}^{V}\frac{|D^v|}{D}Ent(D^v) \\&= Ent(D) - \sum_{v=1}^{V}\frac{|D^v|}{D}(-\sum_{k=1}^{|\mathcal{Y}|}p_{k}\log_{2}p_{k}) \\&=Ent(D) - \sum_{v=1}^{V}\frac{|D^v|}{D}(-\sum_{k=1}^{|\mathcal{Y}|}p_{k}\log_{2}p_{k})\frac{|D_{k}^{v}|}{D^v} \end{aligned}
Gain(D,a)=Ent(D)−v=1∑VD∣Dv∣Ent(Dv)=Ent(D)−v=1∑VD∣Dv∣(−k=1∑∣Y∣pklog2pk)=Ent(D)−v=1∑VD∣Dv∣(−k=1∑∣Y∣pklog2pk)Dv∣Dkv∣
其中,
D
k
v
D_{k}^{v}
Dkv样本集合
D
D
D中在属性
a
a
a上取值为
a
v
a^{v}
av且类别为
k
k
k的样本
C4.5决策树
C4.5决策树以信息增益率为准则来选择划分属性的决策树,信息增益率
G
a
i
n
_
r
a
t
i
o
(
D
,
a
)
=
G
a
i
n
(
D
,
a
)
I
V
(
a
)
Gain\_ratio(D,a) = \frac{Gain(D,a)}{IV(a)}
Gain_ratio(D,a)=IV(a)Gain(D,a)
其中
I
V
(
a
)
=
−
∑
v
=
1
V
∣
D
v
∣
D
log
2
∣
D
v
∣
D
IV(a) = -\sum_{v=1}^{V}\frac{|D^v|}{D}\log_{2}\frac{|D^v|}{D}
IV(a)=−v=1∑VD∣Dv∣log2D∣Dv∣
CART决策树
CART决策树以基尼指数为准则来选择划分属性的决策树
基尼值:
G
i
n
i
(
D
)
=
∑
k
=
1
∣
Y
∣
∑
k
′
≠
k
p
k
p
k
′
=
∑
k
=
1
∣
Y
∣
p
k
∑
k
′
≠
k
p
k
′
=
∑
k
=
1
∣
Y
∣
p
k
(
1
−
p
k
)
=
1
−
∑
k
=
1
∣
Y
∣
p
k
2
Gini(D) = \sum_{k=1}^{|\mathcal{Y}|}\sum_{k'\neq k}p_{k}p_{k'} = \sum_{k=1}^{|\mathcal{Y}|}p_{k}\sum_{k'\neq k}p_{k'} = \sum_{k=1}^{|\mathcal{Y}|}p_{k}(1-p_{k}) = 1-\sum_{k=1}^{|\mathcal{Y}|}p_{k}^2
Gini(D)=k=1∑∣Y∣k′=k∑pkpk′=k=1∑∣Y∣pkk′=k∑pk′=k=1∑∣Y∣pk(1−pk)=1−k=1∑∣Y∣pk2
基尼指数:
G
i
n
i
_
i
n
d
e
x
(
D
,
a
)
=
∑
v
=
1
V
∣
D
v
∣
D
G
i
n
i
(
D
v
)
Gini\_index(D,a) =\sum_{v=1}^{V}\frac{|D^v|}{D}Gini(D^v)
Gini_index(D,a)=v=1∑VD∣Dv∣Gini(Dv)
基尼值和基尼指数越小,样本集合纯度越高
CART决策树分类算法
- 根据基尼指数公式 G i n i _ i n d e x ( D , a ) = ∑ v = 1 V ∣ D v ∣ D G i n i ( D v ) Gini\_index(D,a) =\sum_{v=1}^{V}\frac{|D^v|}{D}Gini(D^v) Gini_index(D,a)=∑v=1VD∣Dv∣Gini(Dv)找出基尼指数最小的属性 a ∗ a_{*} a∗
- 计算属性 a ∗ a_{*} a∗的所有可能取值的基尼值 G i n i ( D v ) Gini(D^v) Gini(Dv), v = 1 , 2 , ⋯ V v=1,2,\cdots\,V v=1,2,⋯V,选择季妮志最小的取值 a ∗ v a_{*}^{v} a∗v作为划分点,将集合 D D D划分为 D 1 D1 D1和 D 2 D2 D2两个集合(节点),其中 D 1 D1 D1集合的样本为 a ∗ = a ∗ v a_{*}=a_{*}^{v} a∗=a∗v的样本, D 2 D2 D2集合为 a ∗ ≠ a ∗ v a_{*}\neq a_{*}^{v} a∗=a∗v的样本
- 对集合 D 1 D1 D1和 D 2 D2 D2重复步骤1和步骤2,直到满足停止条件
CART决策树回归算法
-
根据以下公式找出最优划分特征 a ∗ a^* a∗和最优划分点 a ∗ v a_{*}^v a∗v
a ∗ , a ∗ v = a r g m i n a , a v [ m i n c 1 ∑ x i ∈ D 1 ( a , a v ) ( y i − c 1 ) 2 − m i n c 2 ∑ x i ∈ D 2 ( a , a v ) ( y i − c 2 ) 2 ] a_{*},a_{*}^v = \underset{a,a^v}{arg min}\left [\underset{c_{1}}{min} \underset{x_{i} \in D_{1}(a,a^v)}{\sum }(y_{i}-c_{1})^2-\underset{c_{2}}{min} \underset{x_{i} \in D_{2}(a,a^v)}{\sum }(y_{i}-c_{2})^2 \right ] a∗,a∗v=a,avargmin⎣⎡c1minxi∈D1(a,av)∑(yi−c1)2−c2minxi∈D2(a,av)∑(yi−c2)2⎦⎤
其中, D 1 ( a , a ∗ ) D_{1}(a,a^*) D1(a,a∗)表示在属性 a a a上取值小于等于 a v a^v av的样本集合, D 2 ( a , a v ) D_{2}(a,a^v) D2(a,av)表示在属性 a a a上取值大于 a v a^v av的样本集合, c 1 c_{1} c1表示 D 1 D_{1} D1的样本输出均值, c 2 c_{2} c2表示 D 2 D_{2} D2的样本输出均值 -
根据划分点 a ∗ v a_{*}^v a∗v将集合 D D D划分为 D 1 D_{1} D1和 D 2 D_{2} D2两个集合(节点)
-
对集合 D 1 D_{1} D1和 D 2 D_{2} D2重复步骤1和步骤2,直至满足停止条件