python xgboost_XGBoost的python源码实现

1.前言

本文主要讲解XGBoost代码实现的细节,对于想了解xgboost原理的同学建议可以去看陈天奇博士的paper,下面我附上一个中文链接,我通过该文很好的了解了xgboost的原理,分享给大家链接: http://www.52cs.org/?p=429

最后我会简单的对比一下xgboost与gbdt实现细节上的不同

2.源码讲解

XGBoost与GBDT,随机森林一样需要使用到决策树的子类,对于决策树子类的代码讲解在我上一篇文章中。

2.1 构建XGBoostRegressionTree

XGBoostRegressionTree继承了我上文讲解的DecisionTree

class XGBoostRegressionTree(DecisionTree):

"""Regression tree for XGBoost- Reference -http://xgboost.readthedocs.io/en/latest/model.html"""

def _split(self, y):

""" y contains y_true in left half of the middle column andy_pred in the right half. Split and return the two matrices """

col = int(np.shape(y)[1]/2)

y, y_pred = y[:, :col], y[:, col:]

return y, y_pred

def _gain(self, y, y_pred):

nominator = np.power((self.loss.gradient(y, y_pred)).sum(), 2)

denominator = self.loss.hess(y, y_pred).sum()

return 0.5 * (nominator / denominator)

def _gain_by_taylor(self, y, y1, y2):

# Split

y, y_pred = self._split(y)

y1, y1_pred = self._split(y1)

y2, y2_pred = self._split(y2)

true_gain = self._gain(y1, y1_pred)

false_gain = self._gain(y2, y2_pred)

gain = self._gain(y, y_pred)

return true_gain + false_gain - gain

def _approximate_update(self, y):

# y split into y, y_pred

y, y_pred = self._split(y)

gradient = np.sum(self.loss.gradient(y, y_pred),axis=0)

hessian = np.sum(self.loss.hess(y, y_pred), axis=0)

update_approximation = gradient / hessian

return update_approximation

def fit(self, X, y):

self._impurity_calculation = self._gain_by_taylor

self._leaf_value_calculation = self._approximate_update

super(XGBoostRegressionTree, self).fit(X, y)

2.1.1 gain():

该函数计算切分后的数据集的gain值。

这里我忽略了正则化的参数

注意:

2.1.2 gain_by_taylor():

该函数通过调用gain()来计算树节点的纯度,并以此来作为树是否分割的标准

2.1.3 approximate_update():

xgboost被切割完成后,每个子节点的取都是定好的了。

具体的取值为

,这里我忽略了正则化的参数

2.1.4 fit():

将gain_by_taylor()作为切割树的标准

将approximate_update()作为估算子节点取值的方法

传递回给decisionTree,并以此来构建决策树

2.2 构建XGBoost

class XGBoost(object):

"""The XGBoost classifier.Reference: http://xgboost.readthedocs.io/en/latest/model.htmln_estimators: int树的数量The number of classification trees that are used.learning_rate: float梯度下降的学习率The step length that will be taken when following the negative gradient duringtraining.min_samples_split: int每棵子树的节点的最小数目(小于后不继续切割)The minimum number of samples needed to make a split when building a tree.min_impurity: float每颗子树的最小纯度(小于后不继续切割)The minimum impurity required to split the tree further.max_depth: int每颗子树的最大层数(大于后不继续切割)"""

def __init__(self, n_estimators=200, learning_rate=0.01, min_samples_split=2,

min_impurity=1e-7, max_depth=2):

self.n_estimators = n_estimators # Number of trees

self.learning_rate = learning_rate # Step size for weight update

self.min_samples_split = min_samples_split # The minimum n of sampels to justify split

self.min_impurity = min_impurity # Minimum variance reduction to continue

self.max_depth = max_depth # Maximum depth for tree

self.bar = progressbar.ProgressBar(widgets=bar_widgets)

# Log loss for classification

self.loss = LeastSquaresLoss()

# Initialize regression trees

self.trees = []

for _ in range(n_estimators):

tree = XGBoostRegressionTree(

min_samples_split=self.min_samples_split,

min_impurity=min_impurity,

max_depth=self.max_depth,

loss=self.loss)

self.trees.append(tree)

def fit(self, X, y):

# y = to_categorical(y)

m = X.shape[0]

y = np.reshape(y, (m, -1))

y_pred = np.zeros(np.shape(y))

for i in self.bar(range(self.n_estimators)):

tree = self.trees[i]

y_and_pred = np.concatenate((y, y_pred), axis=1)

tree.fit(X, y_and_pred)

update_pred = tree.predict(X)

update_pred = np.reshape(update_pred, (m, -1))

y_pred += update_pred

def predict(self, X):

y_pred = None

m = X.shape[0]

# Make predictions

for tree in self.trees:

# Estimate gradient and update prediction

update_pred = tree.predict(X)

update_pred = np.reshape(update_pred, (m, -1))

if y_pred is None:

y_pred = np.zeros_like(update_pred)

y_pred += update_pred

return y_pred

2.2.1 __init__()

构建一个含有n_estimators棵XGBoostRegressionTree的类

这部分大家若是有看不懂的地方可以反复专研一下我上文提到的链接,若是吃透了xgboost的原理,相信能很轻松的看懂源码实现

3.源码地址

项目包括了许多机器学习算法的简洁实现

4.XGBoost与GBDT的实现区别

gbdt中每棵树在fit的时候并不在意树的结构与子节点的值是多少,也就是说gbdt并不关心建树的过程。

而xgboost中每棵树fit的过程中,重新定义了树构建时切割的标准,以及子节点具体的取值。

此文章为记录自己一路的学习路程,也希望能给广大初学者们一点点帮助,如有错误,疑惑欢迎一起交流。

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值