为什么要建立多棵树
函数空间上的梯度下降
设样本为
(
x
j
,
y
j
)
,
j
=
1
,
2
,
⋯
,
n
\begin{array}{rcl}(x^j,y^j),j=1,2,\cdots,n\end{array}
(xj,yj),j=1,2,⋯,n
对于回归问题,损失函数为
L
=
∑
j
=
1
n
l
(
x
j
,
y
j
,
f
j
)
=
L
(
f
1
,
f
2
,
⋯
,
f
j
)
=
L
(
F
)
\begin{array}{rcl}L&=&\sum_{j=1}^nl(x^j,y^j,f^j)\\&=&L(f^1,f^2,\cdots,f^j)\\&=&L(F)\end{array}
L===∑j=1nl(xj,yj,fj)L(f1,f2,⋯,fj)L(F)
对于二分类问题,损失函数为
L
=
∑
j
=
1
n
l
(
x
j
,
y
j
,
σ
(
f
j
)
)
=
L
(
f
1
,
f
2
,
⋯
,
f
j
)
=
L
(
F
)
\begin{array}{rcl}L&=&\sum_{j=1}^nl(x^j,y^j,\sigma(f^j))\\&=&L(f^1,f^2,\cdots,f^j)\\&=&L(F)\end{array}
L===∑j=1nl(xj,yj,σ(fj))L(f1,f2,⋯,fj)L(F)
F是多维函数
(
f
1
,
f
2
,
⋯
,
f
j
)
(f^1,f^2,\cdots,f^j)
(f1,f2,⋯,fj)
目的要求
F
=
a
r
g
m
i
n
F
L
(
F
)
F=\underset F{argmin}L(F)
F=FargminL(F)
根据梯度下降有
F
0
=
F
0
F
1
=
F
0
−
η
∇
L
∣
F
=
F
0
⋯
F
i
=
F
i
−
1
−
η
∇
L
∣
F
=
F
i
−
1
F_0=F_0\\F_1=F_0-\eta\nabla L\vert_{F=F_0}\\\cdots\\F_i=F_{i-1}-\eta\nabla L\vert_{F=F_{i-1}}
F0=F0F1=F0−η∇L∣F=F0⋯Fi=Fi−1−η∇L∣F=Fi−1
所以有
F
=
F
0
+
η
∑
i
=
0
m
−
∇
L
∣
F
=
F
i
F=F_0+\eta\sum_{i=0}^m-\nabla L\vert_{F=F_i}
F=F0+ηi=0∑m−∇L∣F=Fi
取每个维度,
f
j
=
f
0
j
+
η
∑
i
=
0
m
−
∂
L
∂
f
j
∣
f
j
=
f
i
j
=
f
0
j
+
η
∑
i
=
0
m
−
∂
l
∂
f
j
∣
f
j
=
f
i
j
j
=
1
,
2
,
⋯
,
n
\begin{array}{rcl}f^j&=&f_0^j+\eta\sum_{i=0}^m-{\textstyle\frac{\partial L}{\partial f^j}}\vert_{f^j=f_i^j}\\&=&f_0^j+\eta\sum_{i=0}^m-{\textstyle\frac{\partial l}{\partial f^j}}\vert_{f^j=f_i^j}\\j&=&1,2,\cdots,n\end{array}
fjj===f0j+η∑i=0m−∂fj∂L∣fj=fijf0j+η∑i=0m−∂fj∂l∣fj=fij1,2,⋯,n
把
f
j
f^j
fj看成函数
f
f
f,即
f
=
{
f
1
,
x
=
x
1
f
2
,
x
=
x
2
⋯
f
n
,
x
=
x
n
=
f
(
x
)
\begin{array}{rcl}f&=&\left\{\begin{array}{c}\begin{array}{c}\begin{array}{c}f^1,x=x^1\\f^2,x=x^2\end{array}\end{array}\\\cdots\\f^n,x=x^n\end{array}=f(x)\right.\end{array}
f=⎩⎪⎪⎨⎪⎪⎧f1,x=x1f2,x=x2⋯fn,x=xn=f(x)
所以
f
=
f
0
+
η
∑
i
=
0
m
−
∂
l
∂
f
∣
f
=
f
i
f=f_0+\eta\sum_{i=0}^m-{\textstyle\frac{\partial l}{\partial f}}\vert_{f=f_i}
f=f0+ηi=0∑m−∂f∂l∣f=fi
令
T
0
=
f
0
T
1
=
−
∂
l
∂
f
∣
f
=
f
0
⋯
T
i
=
−
∂
l
∂
f
∣
f
=
f
i
T_0=f_0\\T_1=-{\textstyle\frac{\partial l}{\partial f}}\vert_{f=f_0}\\\cdots\\T_i=-{\textstyle\frac{\partial l}{\partial f}}\vert_{f=fi}
T0=f0T1=−∂f∂l∣f=f0⋯Ti=−∂f∂l∣f=fi
则
f
=
T
0
+
η
T
1
+
⋯
+
η
T
i
f=T_0\;+\eta T_1+\cdots+\eta T_i
f=T0+ηT1+⋯+ηTi
T
0
T_0
T0为初始函数,可以随意定。
对于基分类器
T
i
T_i
Ti,则有
T
i
(
x
)
=
{
−
∂
l
∂
f
∣
f
=
f
i
1
,
x
=
x
1
−
∂
l
∂
f
∣
f
=
f
i
2
,
x
=
x
2
⋯
−
∂
l
∂
f
∣
f
=
f
i
n
,
x
=
x
n
T_i(x)=\left\{\begin{array}{c}\begin{array}{c}\begin{array}{c}-\frac{\partial l}{\partial f}\vert_{f=f_i^1},x=x^1\\-\frac{\partial l}{\partial f}\vert_{f=f_i^2},x=x^2\end{array}\end{array}\\\cdots\\-\frac{\partial l}{\partial f}\vert_{f=f_i^n},x=x^n\end{array}\right.
Ti(x)=⎩⎪⎪⎨⎪⎪⎧−∂f∂l∣f=fi1,x=x1−∂f∂l∣f=fi2,x=x2⋯−∂f∂l∣f=fin,x=xn
残差推导
对于回归问题,取平方损失
l
=
(
y
j
−
f
)
2
l=(y^j-f)^2
l=(yj−f)2,则有
T
i
=
−
∂
l
∂
f
∣
f
=
f
i
=
2
(
f
i
−
y
j
)
T_i=-{\textstyle\frac{\partial l}{\partial f}}\vert_{f=f_i}=2(f_i-y^j)
Ti=−∂f∂l∣f=fi=2(fi−yj)
对于二分类问题,取交叉熵损失
l
=
y
j
l
n
(
σ
(
f
)
)
+
(
1
−
y
j
)
l
n
(
1
−
σ
(
f
)
)
l=y^jln(\sigma(f))+(1-y^j)ln(1-\sigma(f))
l=yjln(σ(f))+(1−yj)ln(1−σ(f)),则有
T
i
=
−
∂
l
∂
f
∣
f
=
f
i
=
σ
(
f
i
)
−
y
j
T_i=-{\textstyle\frac{\partial l}{\partial f}}\vert_{f=f_i}=\sigma(f_i)-y^j
Ti=−∂f∂l∣f=fi=σ(fi)−yj
建树过程
第一棵树
第一棵树只有一个根节点,即对所有的
x
j
x^j
xj,都有唯一的输出c。
对于回归问题
T
0
=
c
=
a
r
g
m
i
n
c
(
L
(
c
)
)
=
a
r
g
m
i
n
c
(
∑
j
=
1
n
l
(
x
j
,
y
j
,
c
)
)
=
a
r
g
m
i
n
c
(
∑
j
=
1
n
(
c
−
y
j
)
2
)
T_0=c=\underset c{argmin}(L(c))\\=\underset c{argmin}(\sum_{j=1}^nl(x^j,y^j,c))\\=\underset c{argmin}(\sum_{j=1}^n{(c-y^j)}^2)
T0=c=cargmin(L(c))=cargmin(j=1∑nl(xj,yj,c))=cargmin(j=1∑n(c−yj)2)
求导得到
∂
L
∂
c
=
2
∑
j
=
1
n
c
−
y
j
=
0
\frac{\partial L}{\partial c}=2\sum_{j=1}^nc-y^j=0
∂c∂L=2j=1∑nc−yj=0
解得
c
=
1
n
∑
j
=
1
n
y
j
c=\frac1n\sum_{j=1}^ny^j
c=n1j=1∑nyj
对于二分类问题
T
0
=
c
=
a
r
g
m
i
n
c
(
L
(
c
)
)
=
a
r
g
m
i
n
c
(
∑
j
=
1
n
l
(
x
j
,
y
j
,
σ
(
c
)
)
)
=
a
r
g
m
i
n
c
(
∑
j
=
1
n
y
j
l
n
(
σ
(
c
)
)
+
(
1
−
y
j
)
l
n
(
1
−
σ
(
c
)
)
)
T_0=c=\underset c{argmin}(L(c))\\=\underset c{argmin}(\sum_{j=1}^nl(x^j,y^j,\sigma(c)))\\=\underset c{argmin}(\sum_{j=1}^ny^jln(\sigma(c))+(1-y^j)ln(1-\sigma(c)))
T0=c=cargmin(L(c))=cargmin(j=1∑nl(xj,yj,σ(c)))=cargmin(j=1∑nyjln(σ(c))+(1−yj)ln(1−σ(c)))
求导得到
∂
L
∂
c
=
∑
j
=
1
n
(
σ
(
c
)
−
y
j
)
=
0
\frac{\partial L}{\partial c}=\sum_{j=1}^n(\sigma(c)-y^j)=0
∂c∂L=j=1∑n(σ(c)−yj)=0
解得
c
=
σ
−
1
(
1
n
∑
j
=
1
n
y
j
)
c=\sigma^{-1}(\frac1n\sum_{j=1}^ny^j)
c=σ−1(n1j=1∑nyj)
第i棵树
如何分裂
这里选择CART回归树,对特征排序,遍历分裂点,把样本分为两个组,L和R
G
=
m
i
n
(
∑
y
j
∈
L
(
y
j
−
c
1
)
2
+
∑
y
j
∈
R
(
y
j
−
c
2
)
2
)
G=min(\sum_{y^j\in L}{(y^j-c_1)}^2+\sum_{y^j\in R}{(y^j-c_2)}^2)
G=min(yj∈L∑(yj−c1)2+yj∈R∑(yj−c2)2)
解得
c
1
=
1
L
∑
y
j
∈
L
y
j
c
2
=
1
R
∑
y
j
∈
R
y
j
c_1=\frac1L\sum_{y^j\in L}y^j\\c_2=\frac1R\sum_{y^j\in R}y^j
c1=L1yj∈L∑yjc2=R1yj∈R∑yj
取能够使G达到最小的特征及其对应的分裂点
如何取值
对于回归问题
c
=
a
r
g
m
i
n
c
(
∑
y
j
∈
R
l
(
x
j
,
y
j
,
f
i
j
+
c
)
)
=
a
r
g
m
i
n
c
(
∑
y
j
∈
R
(
f
i
j
+
c
−
y
j
)
2
)
c=\underset c{argmin}(\sum_{y^j\in R}l(x^j,y^j,f_i^j+c))\\=\underset c{argmin}{(\sum_{y^j\in R}(f_i^j+c-y^j)}^2)
c=cargmin(yj∈R∑l(xj,yj,fij+c))=cargmin(yj∈R∑(fij+c−yj)2)
解得
c
=
1
R
∑
y
j
∈
R
(
y
j
−
f
i
j
)
c=\frac1R\sum_{y^j\in R}(y^j-f_i^j)
c=R1yj∈R∑(yj−fij)
对于分类问题
c
=
a
r
g
m
i
n
c
(
∑
y
j
∈
R
l
(
x
j
,
y
j
,
σ
(
f
i
j
+
c
)
)
)
=
a
r
g
m
i
n
c
(
∑
y
j
∈
R
y
j
l
n
(
σ
(
f
i
j
+
c
)
)
+
(
1
−
y
j
)
l
n
(
1
−
σ
(
f
i
j
+
c
)
)
)
c=\underset c{argmin}(\sum_{y^j\in R}l(x^j,y^j,\sigma(f_i^j+c)))\\=\underset c{argmin}{(\sum_{y^j\in R}} y^jln(\sigma(f_i^j+c))+(1-y^j)ln(1-\sigma(f_i^j+c)))
c=cargmin(yj∈R∑l(xj,yj,σ(fij+c)))=cargmin(yj∈R∑yjln(σ(fij+c))+(1−yj)ln(1−σ(fij+c)))
令导数为0
∂
L
∂
c
=
∑
j
∈
R
(
y
j
−
σ
(
f
i
j
+
c
)
)
=
0
\frac{\partial L}{\partial c}=\sum_{j\in R}(y^j-\sigma(f_i^j+c))=0
∂c∂L=j∈R∑(yj−σ(fij+c))=0
该函数的导数为
∂
2
L
∂
c
2
=
−
∑
j
∈
R
σ
(
f
i
j
+
c
)
(
1
−
σ
(
f
i
j
+
c
)
)
<
0
\frac{\partial^2L}{\partial c^2}=-\sum_{j\in R}\sigma(f_i^j+c)(1-\sigma(f_i^j+c))<0
∂c2∂2L=−j∈R∑σ(fij+c)(1−σ(fij+c))<0
函数递减,图像如下
利用牛顿一阶近似求零点,左边直线方程为
y
=
k
x
+
b
y=kx+b
y=kx+b
k
=
∂
2
L
∂
c
2
∣
c
=
0
=
−
∑
j
∈
R
σ
(
f
i
j
)
(
1
−
σ
(
f
i
j
)
)
b
=
∂
L
∂
c
∣
c
=
0
=
∑
j
∈
R
y
j
−
σ
(
f
i
j
)
k=\frac{\partial^2L}{\partial c^2}|_{c=0}=-\sum_{j\in R}\sigma(f_i^j)(1-\sigma(f_i^j))\\ b=\frac{\partial L}{\partial c}|_{c=0}=\sum_{j\in R}y^j-\sigma(f_i^j)
k=∂c2∂2L∣c=0=−j∈R∑σ(fij)(1−σ(fij))b=∂c∂L∣c=0=j∈R∑yj−σ(fij)
所以左边比较靠近c的红色圆圈的点的坐标为
c
^
=
−
b
k
=
∑
j
∈
R
y
j
−
σ
(
f
i
j
)
∑
j
∈
R
σ
(
f
i
j
)
(
1
−
σ
(
f
i
j
)
)
\widehat c=-\frac bk=\frac{\sum_{j\in R}y^j-\sigma(f_i^j)}{\sum_{j\in R}\sigma(f_i^j)(1-\sigma(f_i^j))}
c
=−kb=∑j∈Rσ(fij)(1−σ(fij))∑j∈Ryj−σ(fij)
多分类问题
假如有K>=3种分类,那么gbdt会创建K串树,每一串会在各自的梯度上分裂增长。但是梯度计算是相互依赖的。
L
=
∑
j
=
1
n
l
(
x
j
,
y
j
,
s
(
f
j
)
)
=
∑
j
=
1
n
∑
k
=
1
K
y
j
k
ln
e
f
i
j
k
∑
t
=
1
K
e
f
i
j
t
L=\sum_{j=1}^nl(x^j,y^j,s(f^j))=\sum_{j=1}^n\sum_{k=1}^Ky^{jk}\ln{\textstyle\frac{e^{f_i^{jk}}}{\textstyle\sum_{t=1}^Ke^{f_i^{jt}}}}
L=j=1∑nl(xj,yj,s(fj))=j=1∑nk=1∑Kyjkln∑t=1Kefijtefijk
如果
y
j
k
=
1
y^{jk} = 1
yjk=1,则
∂
L
∂
f
i
j
k
=
∑
t
=
1
K
e
f
i
j
t
e
f
i
j
k
e
f
i
j
k
∑
t
=
1
K
e
f
i
j
t
−
(
e
f
i
j
k
)
2
(
∑
t
=
1
K
e
f
i
j
t
)
2
=
1
−
e
f
i
j
k
∑
t
=
1
K
e
f
i
j
t
\frac{\partial L}{\partial f_i^{jk}}=\frac{\sum_{t=1}^Ke^{f_i^{jt}}}{\displaystyle e^{f_i^{jk}}}{\textstyle\frac{e^{f_i^{jk}}\sum_{t=1}^Ke^{f_i^{jt}}\;-\;{(e^{f_i^{jk}})}^2}{\textstyle{(\sum_{t=1}^Ke^{f_i^{jt}})}^2}=1-}\frac{e^{f_i^{jk}}}{\displaystyle\sum\nolimits_{t=1}^Ke^{f_i^{jt}}}
∂fijk∂L=efijk∑t=1Kefijt(∑t=1Kefijt)2efijk∑t=1Kefijt−(efijk)2=1−∑t=1Kefijtefijk
如果
y
j
k
=
0
y^{jk} = 0
yjk=0,则
∂
L
∂
f
i
j
k
=
−
e
f
i
j
k
∑
t
=
1
K
e
f
i
j
t
\frac{\partial L}{\partial f_i^{jk}}=-\frac{e^{f_i^{jk}}}{\displaystyle\sum\nolimits_{t=1}^Ke^{f_i^{jt}}}
∂fijk∂L=−∑t=1Kefijtefijk
所以
∂
L
∂
f
i
j
k
=
y
j
k
−
e
f
i
j
k
∑
t
=
1
K
e
f
i
j
t
\frac{\partial L}{\partial f_i^{jk}}=y^{jk}-\frac{e^{f_i^{jk}}}{\displaystyle\sum\nolimits_{t=1}^Ke^{f_i^{jt}}}
∂fijk∂L=yjk−∑t=1Kefijtefijk
公式编辑网址
http://www.wiris.com/editor/demo/en/developers