实现XGBoost
目录
- XGBoost算法推导
- XGBoost算法实现
- 使用自制的数据集进行测试
- 使用波士顿数据集进行测试,并与sklearn中的Adaboost,RandomForest进行对比
XGBoost算法推导
XGBoost是梯度提升树中的改进算法,所以具有梯度提升树中的一些共性,同时也存在自己的个性
对于梯度提升树算法来说,有三个关键部分需要明确
- 能够让弱评估器集成的手段(迭代方法、抽样手段、样本加权)
- 能够衡量集成算法效果的损失函数
- 弱评估器 f k ( x ) f_k(x) fk(x)
1.集成手段
y
^
i
(
k
)
=
y
^
i
(
k
−
1
)
+
η
f
k
(
X
i
)
\hat y_i^{(k)} = \hat y_i^{(k-1)} + \eta f_{k}(X_i)
y^i(k)=y^i(k−1)+ηfk(Xi)
其中
η
\eta
η为shrinkage或者理解为learning rate
一般设置为 eta=1
2.能够衡量集成算法效果的损失函数
O
b
j
(
k
)
=
∑
i
=
1
m
l
(
y
i
,
y
^
(
k
)
)
+
∑
k
=
1
K
Ω
(
f
k
)
Obj^{(k)} = \sum\limits_{i=1}^m l(y_i,\hat y^{(k)}) + \sum\limits_{k=1}^K \Omega(f_k)
Obj(k)=i=1∑ml(yi,y^(k))+k=1∑KΩ(fk)
其中加号的左侧部分为传统的损失函数,衡量模型是否拟合训练数据
常见的损失函数
reg:linear --> 平方误差
l
(
y
i
,
y
^
i
)
=
(
y
i
−
y
^
i
)
2
l(y_i,\hat y_i) = (y_i - \hat y_i)^2
l(yi,y^i)=(yi−y^i)2
binary:logistic -->
l
(
y
i
,
y
^
i
)
=
y
i
l
n
(
1
+
e
−
y
^
i
)
+
(
1
−
y
i
)
l
n
(
1
+
e
y
^
i
)
l(y_i,\hat y_i) = y_i ln(1+e^{-\hat y_i})+(1-y_i)ln(1+e^{\hat y_i})
l(yi,y^i)=yiln(1+e−y^i)+(1−yi)ln(1+ey^i)
binary:hinge
multi:softmax
可以将算法理解为,左侧在减小算法的偏差(模型的准确度),右侧在减小算法的方法(模型在不同数据集上的稳定性)
接下来的Obj的过程中,将目标函数转化为与树结构直接相关的写法,以此建立树结构与模型效果之间的直接联系。所以目标函数又称为“结构分数”。
O
b
j
(
k
)
=
∑
i
=
1
m
l
(
y
i
,
y
^
i
(
k
−
1
)
+
f
k
(
x
i
)
)
+
Ω
(
f
k
)
+
c
o
n
s
t
a
n
t
Obj^{(k)} = \sum\limits_{i=1}^ml(y_i,\hat y_i^{(k-1)}+f_k(x_i))+\Omega(f_k)+constant
Obj(k)=i=1∑ml(yi,y^i(k−1)+fk(xi))+Ω(fk)+constant
根据泰勒展开式
f
(
x
+
Δ
x
)
≈
f
(
x
)
+
f
′
(
x
)
Δ
x
+
1
2
f
′
′
(
x
)
Δ
x
2
f(x+\Delta x) \approx f(x)+f'(x)\Delta x + \frac{1}{2}f''(x)\Delta x^2
f(x+Δx)≈f(x)+f′(x)Δx+21f′′(x)Δx2
故原式按泰勒展开
O
b
j
(
k
)
=
∑
i
=
1
m
[
l
(
y
i
,
y
^
i
(
k
−
1
)
)
+
g
i
f
k
(
x
i
)
+
1
2
h
i
f
k
2
(
x
i
)
]
+
Ω
(
f
k
)
+
c
o
n
s
t
a
n
t
Obj^{(k)} = \sum \limits_{i=1}^m \big[ l(y_i,\hat y_i^{(k-1)}) + gif_k(x_i) + \frac{1}{2}h_i f_k^2(x_i) \big] +\Omega(f_k)+constant
Obj(k)=i=1∑m[l(yi,y^i(k−1))+gifk(xi)+21hifk2(xi)]+Ω(fk)+constant
其中
g
i
=
∂
l
(
y
i
,
y
^
(
k
−
1
)
)
∂
y
^
(
k
−
1
)
g_i=\frac{\partial l(y_i,\hat y ^{(k-1)})}{\partial \hat y^{(k-1)}}
gi=∂y^(k−1)∂l(yi,y^(k−1))
h
i
=
∂
2
l
(
y
i
,
y
^
(
k
−
1
)
)
∂
y
^
(
k
−
1
)
2
h_i = \frac{\partial^2 l(y_i,\hat y ^{(k-1)})}{\partial {\hat y^{(k-1)}}^2}
hi=∂y^(k−1)2∂2l(yi,y^(k−1))
g
i
g_i
gi,
h
i
h_i
hi可以理解为是每个样本的梯度统计量
对于
Ω
(
f
k
)
\Omega{(f_k)}
Ω(fk),即树的复杂度,可以采用叶子结点的数量来进行衡量。同时加上L1和L2正则项
故
Ω
(
f
k
)
=
γ
T
+
1
2
α
∑
j
=
1
T
∣
w
j
∣
+
1
2
λ
∑
j
=
1
T
w
j
2
\Omega{(f_k)} = \gamma T + \frac{1}{2}\alpha\sum\limits_{j=1}^T|w_j|+\frac{1}{2}\lambda\sum\limits_{j=1}^Tw_j^2
Ω(fk)=γT+21αj=1∑T∣wj∣+21λj=1∑Twj2
通常采用L2正则,
故
α
=
0
\alpha=0
α=0,
λ
=
1
\lambda=1
λ=1
设定
γ
=
0
\gamma=0
γ=0
其中T为叶子节点的个数
将样本映射到对应的叶子节点上,也就是
f
k
(
x
)
=
w
q
(
x
)
f_k(x) = w_{q(x)}
fk(x)=wq(x)
故Obj公式可以进一步推导
O
b
j
(
k
)
=
∑
i
=
1
m
[
w
q
(
x
i
)
g
i
+
1
2
w
q
(
x
i
)
2
h
i
]
+
γ
T
+
1
2
λ
∑
j
=
1
T
w
j
2
+
c
o
n
s
t
a
n
t
Obj^{(k)} = \sum \limits_{i=1}^m \big[ w_{q(x_i)}g_i + \frac{1}{2}w_{q(x_i)}^2h_i \big] + \gamma T + \frac{1}{2}\lambda \sum\limits_{j=1}^Tw_j^2 + constant
Obj(k)=i=1∑m[wq(xi)gi+21wq(xi)2hi]+γT+21λj=1∑Twj2+constant
定义索引为j的叶子上含有的样本集合为
I
j
I_j
Ij
故上式继续推导为
O
b
j
(
k
)
=
∑
j
=
1
T
(
w
j
∑
i
∈
I
j
g
i
)
+
1
2
(
w
j
2
∑
i
∈
I
j
h
i
)
+
γ
T
+
1
2
λ
∑
i
=
1
T
w
j
2
Obj^{(k)} = \sum \limits_{j=1}^T \big( w_j \sum\limits_{i \in I_j }g_i \big) + \frac{1}{2} \big( w_j^2 \sum\limits_{i \in I_j}h_i \big) + \gamma T +\frac{1}{2}\lambda \sum\limits_{i=1}^T w_j^2
Obj(k)=j=1∑T(wji∈Ij∑gi)+21(wj2i∈Ij∑hi)+γT+21λi=1∑Twj2
合并一些项,化简为
O
b
j
(
k
)
=
∑
i
=
1
T
[
w
j
G
j
+
1
2
w
j
2
(
H
j
+
λ
)
]
+
γ
T
Obj^{(k)} = \sum\limits_{i=1}^T \big[ w_jG_j+\frac{1}{2}w_j^2(H_j+\lambda) \big]+\gamma T
Obj(k)=i=1∑T[wjGj+21wj2(Hj+λ)]+γT
其中
G
j
=
∑
i
∈
I
j
g
i
G_j = \sum\limits_{i \in I_j}g_i
Gj=i∈Ij∑gi
H
j
=
∑
i
∈
I
j
h
i
H_j = \sum\limits_{i \in I_j}h_i
Hj=i∈Ij∑hi
求每个叶子的
w
j
∗
w_j^*
wj∗
O
b
j
(
k
)
Obj^{(k)}
Obj(k)对
w
j
w_j
wj求导,得
G
j
+
w
j
(
H
j
+
λ
)
=
0
G_j +w_j (H_j+\lambda)=0
Gj+wj(Hj+λ)=0
故
w
j
∗
=
−
G
j
H
j
+
λ
w_j^* = -\frac{G_j}{H_j+\lambda}
wj∗=−Hj+λGj
将
w
j
∗
w_j^*
wj∗带入
O
b
j
(
k
)
Obj^{(k)}
Obj(k)得
O
b
j
(
k
)
=
−
1
2
∑
j
=
1
T
G
j
2
H
j
+
λ
+
γ
T
Obj^{(k)} = - \frac{1}{2}\sum\limits_{j=1}^T \frac{G_j^2}{H_j+\lambda}+\gamma T
Obj(k)=−21j=1∑THj+λGj2+γT
故当我们确定了树结构后,每个叶子节点的权重
w
j
∗
w_j^*
wj∗就可以确定
下一个问题,如果确定树结构
采用的方式为贪婪算法:寻找最佳分枝
这个最佳是如何衡量的呢?
计算分枝前与分枝后的结构分数之差Gain,选择Gain最大的特征上的分枝点进行分枝。若最大的Gain为负,则停止生长。
G
a
i
n
=
S
c
o
r
e
m
i
d
d
l
e
−
(
S
c
o
r
e
l
e
f
t
+
S
c
o
r
e
r
i
g
h
t
)
Gain = Score_{middle} - (Score_{left}+Score_{right})
Gain=Scoremiddle−(Scoreleft+Scoreright)
=
1
2
[
G
L
2
H
L
+
λ
+
G
R
2
H
R
+
λ
−
(
G
L
+
G
R
)
2
H
L
+
H
R
+
λ
]
−
γ
=\frac{1}{2}\big[ \frac{G_L^2}{H_L+\lambda} + \frac{G_R^2}{H_R+\lambda} - \frac{(G_L+G_R)^2}{H_L+ H_R +\lambda} \big]-\gamma
=21[HL+λGL2+HR+λGR2−HL+HR+λ(GL+GR)2]−γ
如何找到最佳分枝呢?
对所有的特征的所有分枝进行Gain计算
选出让目标函数下降最快的节点来分枝
我们规定,只要结构函数之差Gain>0,即目标函数能够继续减小,我们就允许树继续分枝
即
1
2
[
G
L
2
H
L
+
λ
+
G
R
2
H
R
+
λ
−
(
G
L
+
G
R
)
2
H
L
+
H
R
+
λ
]
>
γ
\frac{1}{2}\big[ \frac{G_L^2}{H_L+\lambda} + \frac{G_R^2}{H_R+\lambda} - \frac{(G_L+G_R)^2}{H_L+ H_R +\lambda} \big]>\gamma
21[HL+λGL2+HR+λGR2−HL+HR+λ(GL+GR)2]>γ
故可以通过设定
γ
\gamma
γ,让XGBoost中的树停止生长。
所以
γ
\gamma
γ可以定义为在树的节点上进一步分枝所需的减小量
总结算法的流程:
- 初始化
f
0
(
x
)
=
0
,
y
^
(
0
)
=
0
f_0(x)=0,\hat y^{(0)}=0
f0(x)=0,y^(0)=0
a. 根据 y ^ ( k − 1 ) \hat y^{(k-1)} y^(k−1)更新 g i g_i gi和 h i h_i hi
b. 根据贪婪算法,寻找最佳分枝,进行建树
c. 根据所建的树,计算每个叶子节点的 w j w_j wj
d. 更新 f k ( x ) f_k(x) fk(x), y ^ ( k ) = y ^ ( k − 1 ) + η f k ( x ) \hat y^{(k)} = \hat y^{(k-1)}+\eta f_k(x) y^(k)=y^(k−1)+ηfk(x) - 直到建立num_round棵树,停止迭代
XGBoost算法实现
实现XGBoost回归器,其中采用平方误差为损失函数,故将上述的算法过程进一步具体化
- 初始化
f
0
(
x
)
=
0
,
y
^
(
0
)
=
0
f_0(x)=0,\hat y^{(0)}=0
f0(x)=0,y^(0)=0
a. 根据 y ^ ( k − 1 ) \hat y^{(k-1)} y^(k−1)更新 g i g_i gi和 h i h_i hi
g i = − 2 ( y i − y ^ ( k − 1 ) ) g_i = -2(y_i - \hat y^{(k-1)}) gi=−2(yi−y^(k−1))
h i = 2 h_i = 2 hi=2
b. 根据贪婪算法,寻找最佳分枝,进行建树
其中
G a i n = 1 2 [ G L 2 H L + λ + G R 2 H R + λ − ( G L + G R ) 2 H L + H R + λ ] − γ Gain = \frac{1}{2}\big[ \frac{G_L^2}{H_L+\lambda} + \frac{G_R^2}{H_R+\lambda} - \frac{(G_L+G_R)^2}{H_L+ H_R +\lambda} \big]-\gamma Gain=21[HL+λGL2+HR+λGR2−HL+HR+λ(GL+GR)2]−γ
c. 根据所建的树,计算每个叶子节点的 w j w_j wj
w j ∗ = − G j H j + λ w_j^*=-\frac{G_j}{H_j+\lambda} wj∗=−Hj+λGj
d. 更新 f k ( x ) f_k(x) fk(x), y ^ ( k ) = y ^ ( k − 1 ) + η f k ( x ) \hat y^{(k)} = \hat y^{(k-1)}+\eta f_k(x) y^(k)=y^(k−1)+ηfk(x) - 直到建立num_round棵树,停止迭代
g i = − 2 ∗ ( y i − y ^ ( k − 1 ) ) g_i = -2*(y_i - \hat y^{(k-1)}) gi=−2∗(yi−y^(k−1))
import numpy as np
class Try_XGBoost():
def __init__(self,num_round=10,eta=0.3,gamma=0,Lambda=1,scoring="mse"):
self.scoring = scoring
self.num_round = num_round
self.eta = eta
self.gamma = gamma
self.Lambda = Lambda
self.ensemble = []
self.g = None
self.h = None
self.haty = None
self.f = None
def _G(self,y_train):
"""计算g_i"""
return -2*(y_train - self.haty)
def _Gain(self,listL,listR):
"""计算Gain,方便确定最佳分枝"""
GL = self.g[listL].sum()
GR = self.g[listR].sum()
HL = self.h[listL].sum()
HR = self.h[listR].sum()
return (GL**2/(HL+self.Lambda)+GR**2/(HR+self.Lambda)-(GR+GL)**2/(HL+HR+self.Lambda))/2-self.gamma
def _w(self,indexlist):
return -np.sum(self.g[indexlist])/(np.sum(self.h[indexlist])+self.Lambda)
def fit(self,X_train,y_train):
def BestSplit(X_train,y_train,indexlist):
"""寻找最佳切分,如果有最佳切分,返回切分特征和切分值;如果无最佳切分,返回None"""
bestGain = 0
bestSplitFeature = -1
bestSplitValue = -1
for Feature in range(X_train.shape[1]):
ValueSet = set(X_train[:,Feature])
for Val in ValueSet:
boolindexLeft = X_train[:,Feature] <= Val
boolindexRight = ~boolindexLeft
indexLeft = indexlist[boolindexLeft]
indexRight = indexlist[boolindexRight]
gain = self._Gain(indexLeft,indexRight)
if gain > bestGain:
bestGain = gain
bestSplitFeature = Feature
bestSplitValue = Val
if bestSplitFeature == -1:
return None,None
else:
return bestSplitFeature,bestSplitValue
def create_tree(X_train,y_train,indexlist = np.arange(len(X_train))):
"""建立新树,以字典形式保存,并更新self.f(这次更新后每个样本的目标分数)"""
bestSplitFeature,bestSplitValue = BestSplit(X_train,y_train,indexlist)
if bestSplitFeature is None:
w = self._w(indexlist)
self.f[indexlist] = w
return w
else :
left_index = X_train[:,bestSplitFeature] <= bestSplitValue
sub_X_train_left,sub_y_train_left = X_train [left_index],y_train[left_index]
sub_X_train_right,sub_y_train_right = X_train [~left_index] ,y_train[~left_index]
indexlist_left = indexlist[left_index]
indexlist_right = indexlist[~left_index]
leftchild = create_tree(sub_X_train_left,sub_y_train_left,indexlist_left)
rightchild = create_tree(sub_X_train_right,sub_y_train_right,indexlist_right)
return {bestSplitFeature:{"<={}".format(bestSplitValue): leftchild,">{}".format(bestSplitValue): rightchild}}
self.haty = np.zeros(len(X_train))
self.h = np.ones(len(X_train))*2
for _ in range(self.num_round):
self.g = self._G(y_train)
self.f = np.empty(len(X_train))
newtree = create_tree(X_train,y_train)
self.ensemble.append(newtree)
self.haty = self.haty + self.eta*self.f
return
def draw_one_tree(self,index):
from graphviz import Digraph
def export_graphviz(tree,root_index):
root = next(iter(tree))
text_node.append([str(root_index),"feature:{}".format(root)])
secondDic = tree[root]
for key in secondDic:
if type(secondDic[key]) == dict:
i[0] += 1
secondrootindex=i[0]
text_edge.append([str(root_index),str(secondrootindex),str(key)])
export_graphviz(secondDic[key],secondrootindex)
else:
i[0] += 1
text_node.append([str(i[0]),str(secondDic[key])])
text_edge.append([ str(root_index) , str(i[0]) , str(key) ])
tree = self.ensemble[index]
text_node=[]
text_edge=[]
i=[1]
export_graphviz(tree,i[0])
dot = Digraph()
for line in text_node:
dot.node(line[0],line[1])
for line in text_edge:
dot.edge(line[0],line[1],line[2])
dot.view()
def predict(self,X_test):
return np.array([self._predict(test) for test in X_test])
def _predict(self,test):
"""对单条测试集进行预测"""
def __predict(tree,test):
feature = next(iter(tree))
secondDic = tree[feature]
content = test[feature]
for key in secondDic:
if eval(str(content)+key):
if type(secondDic[key]) == dict :
return __predict(secondDic[key],test)
else:
return secondDic[key]
assert len(self.ensemble) != 0,"fit before predict"
res = 0
for i in range(len(self.ensemble)):
tree = self.ensemble[i]
res_temp = __predict(tree,test)
res += res_temp*self.eta
return res
def score(self,X_test,y_test):
y_pre = self.predict(X_test)
if self.scoring == "mse":
return sum((y_test-y_pre)**2)/len(X_test)
elif self.scoring=="r2":
return 1 - sum((y_test-y_pre)**2)/sum((y_test-y_test.mean())**2)
def get_params(self,deep=False):
dic={}
dic["num_round"]=self.num_round
dic["eta"] = self.eta
dic["gamma"] = self.gamma
dic["Lambda"] = self.Lambda
dic["scoring"] = self.scoring
return dic
使用自制的数据集进行功能测试
数据说明
仿照陈天奇的ppt中的例子构建的数据集,第一列特征为年龄,第二列特征为是否为男性
预测值为 对电脑游戏的喜欢程度
一共为5个样本,1 boy,2 mother,3 grandpa,4 girl,5 grandma
X = np.array([
[13,1],
[30,0],
[63,1],
[12,0],
[63,0]
])
y = np.array([5,2,1,4,2])
import pandas as pd
pd.concat([pd.DataFrame(X,columns=["age","is male"]),pd.DataFrame(y,columns=["the degree of favor in computer games"])],axis=1)
age | is male | the degree of favor in computer games | |
---|---|---|---|
0 | 13 | 1 | 5 |
1 | 30 | 0 | 2 |
2 | 63 | 1 | 1 |
3 | 12 | 0 | 4 |
4 | 63 | 0 | 2 |
reg = Try_XGBoost()
reg.fit(X,y)
reg.ensemble
[{0: {'<=13': 3.6, '>13': 1.4285714285714286}},
{0: {'<=13': 2.7359999999999998,
'>13': {1: {'<=0': 1.2571428571428571, '>0': 0.38095238095238093}}}},
{0: {'<=13': 2.0793600000000003,
'>13': {1: {'<=0': 0.9554285714285715, '>0': 0.3047619047619048}}}},
{0: {'<=13': 1.5803136000000002,
'>13': {1: {'<=0': 0.7261257142857144, '>0': 0.24380952380952384}}}},
{0: {'<=13': 1.2010383360000003,
'>13': {1: {'<=0': 0.5518555428571428, '>0': 0.19504761904761905}}}},
{0: {'<=13': 0.9127891353600003,
'>13': {1: {'<=0': 0.41941021257142863, '>0': 0.15603809523809523}}}},
{0: {'<=13': {0: {'<=12': 0.24476645239466688, '>12': 0.9114331190613335}},
'>13': {1: {'<=0': 0.31875176155428575, '>0': 0.12483047619047616}}}},
{0: {'<=13': {0: {'<=12': 0.1958131619157335, '>12': 0.7291464952490667}},
'>13': 0.21583569096620409}},
{0: {'<=13': {0: {'<=12': 0.15665052953258693, '>12': 0.5833171961992534}},
'>13': {1: {'<=0': 0.1904507729493682, '>0': 0.05669724275914012}}}},
{0: {'<=13': {0: {'<=12': 0.12532042362606965, '>12': 0.4666537569594027}},
'>13': {1: {'<=0': 0.14474258744151988, '>0': 0.045357794207312084}}}}]
画出其中一棵树看看
reg.draw_one_tree(0)
reg.score(X,y)#均方误差
0.07539524128331372
reg = Try_XGBoost(scoring="r2")
reg.fit(X,y)
reg.score(X,y)#r2值
0.9650947957021696
对比sklearn中的随机森林回归
from sklearn.ensemble import RandomForestRegressor
reg_sklearn_rf = RandomForestRegressor(n_estimators=5)
reg_sklearn_rf.fit(X,y)
reg_sklearn_rf.score(X,y)
0.9333333333333332
from sklearn.ensemble import AdaBoostRegressor
reg_sklearn_ad = AdaBoostRegressor(n_estimators=5,learning_rate=0.3)
reg_sklearn_ad.fit(X,y)
reg_sklearn_ad.score(X,y)
1.0
使用波士顿数据集进行测试
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
boston = load_boston()
X = boston.data
y = boston.target
X_train,X_test,y_train,y_test = train_test_split(X,y,random_state=2,test_size=0.3)
XGB=Try_XGBoost(scoring='r2',num_round=50,eta=0.1,gamma=0.2)
XGB.fit(X_train,y_train)
XGB_score = XGB.score(X_test,y_test)
sklearn_Ada = AdaBoostRegressor(n_estimators=50,random_state=22)
sklearn_Ada.fit(X_train,y_train)
Ada_score = sklearn_Ada.score(X_train,y_train)
sklearn_RF = RandomForestRegressor(n_estimators=50,random_state=22)
sklearn_RF.fit(X_train,y_train)
RF_score = sklearn_RF.score(X_train,y_train)
import matplotlib.pyplot as plt
score = [XGB_score,Ada_score,RF_score]
plt.barh(range(3),score,color=["orange","gold","yellowgreen"])
plt.yticks(range(3),["Try_XGBoost_score","AdaBoostRegressor_score","RandomForest_score"])
plt.xlim([0.8,1])
for loc1,loc2 in enumerate(score):
plt.text(loc2,loc1,"{:.3f}".format(loc2),va='center')
plt.show()
随机森林真的是,一骑绝尘的优秀
我自己写的XGBoost除了慢以外,计算准确度还是可以和其他集成算法比拼一下的。
XGBoost的优点是性能快呀,有机会再看看论文,学习下如何加快算法的计算速度。
可能我调的参数不太理想,还是很过拟合。
比如看看其中的一棵树
甚是枝繁叶茂。。。