python xgboost_XGBoost的python源码实现

最新推荐文章于 2024-04-27 14:55:15 发布

weixin_39975055

最新推荐文章于 2024-04-27 14:55:15 发布

阅读量690

点赞数

文章标签： python xgboost

1.前言

本文主要讲解XGBoost代码实现的细节，对于想了解xgboost原理的同学建议可以去看陈天奇博士的paper，下面我附上一个中文链接，我通过该文很好的了解了xgboost的原理，分享给大家链接: http://www.52cs.org/?p=429

最后我会简单的对比一下xgboost与gbdt实现细节上的不同

2.源码讲解

XGBoost与GBDT，随机森林一样需要使用到决策树的子类，对于决策树子类的代码讲解在我上一篇文章中。

2.1 构建XGBoostRegressionTree

XGBoostRegressionTree继承了我上文讲解的DecisionTree

class XGBoostRegressionTree(DecisionTree):

"""Regression tree for XGBoost- Reference -http://xgboost.readthedocs.io/en/latest/model.html"""

def _split(self, y):

""" y contains y_true in left half of the middle column andy_pred in the right half. Split and return the two matrices """

col = int(np.shape(y)[1]/2)

y, y_pred = y[:, :col], y[:, col:]

return y, y_pred

def _gain(self, y, y_pred):

nominator = np.power((self.loss.gradient(y, y_pred)).sum(), 2)

denominator = self.loss.hess(y, y_pred).sum()

return 0.5 * (nominator / denominator)

def _gain_by_taylor(self, y, y1, y2):

# Split

y, y_pred = self._split(y)

y1, y1_pred = self._split(y1)

y2, y2_pred = self._split(y2)

true_gain = self._gain(y1, y1_pred)

false_gain = self._gain(y2, y2_pred)

gain = self._gain(y, y_pred)

return true_gain + false_gain - gain

def _approximate_update(self, y):

# y split into y, y_pred

y, y_pred = self._split(y)

gradient = np.sum(self.loss.gradient(y, y_pred),axis=0)

hessian = np.sum(self.loss.hess(y, y_pred), axis=0)

update_approximation = gradient / hessian

return update_approximation

def fit(self, X, y):

self._impurity_calculation = self._gain_by_taylor

self._leaf_value_calculation = self._approximate_update

super(XGBoostRegressionTree, self).fit(X, y)

2.1.1 gain():

该函数计算切分后的数据集的gain值。

这里我忽略了正则化的参数

注意：

2.1.2 gain_by_taylor():

该函数通过调用gain()来计算树节点的纯度，并以此来作为树是否分割的标准

2.1.3 approximate_update():

xgboost被切割完成后，每个子节点的取都是定好的了。

具体的取值为

,这里我忽略了正则化的参数

2.1.4 fit():

将gain_by_taylor()作为切割树的标准

将approximate_update()作为估算子节点取值的方法

传递回给decisionTree，并以此来构建决策树

2.2 构建XGBoost

class XGBoost(object):

"""The XGBoost classifier.Reference: http://xgboost.readthedocs.io/en/latest/model.htmln_estimators: int树的数量The number of classification trees that are used.learning_rate: float梯度下降的学习率The step length that will be taken when following the negative gradient duringtraining.min_samples_split: int每棵子树的节点的最小数目(小于后不继续切割)The minimum number of samples needed to make a split when building a tree.min_impurity: float每颗子树的最小纯度(小于后不继续切割)The minimum impurity required to split the tree further.max_depth: int每颗子树的最大层数(大于后不继续切割)"""

def __init__(self, n_estimators=200, learning_rate=0.01, min_samples_split=2,

min_impurity=1e-7, max_depth=2):

self.n_estimators = n_estimators # Number of trees

self.learning_rate = learning_rate # Step size for weight update

self.min_samples_split = min_samples_split # The minimum n of sampels to justify split

self.min_impurity = min_impurity # Minimum variance reduction to continue

self.max_depth = max_depth # Maximum depth for tree

self.bar = progressbar.ProgressBar(widgets=bar_widgets)

# Log loss for classification

self.loss = LeastSquaresLoss()

# Initialize regression trees

self.trees = []

for _ in range(n_estimators):

tree = XGBoostRegressionTree(

min_samples_split=self.min_samples_split,

min_impurity=min_impurity,

max_depth=self.max_depth,

loss=self.loss)

self.trees.append(tree)

def fit(self, X, y):

# y = to_categorical(y)

m = X.shape[0]

y = np.reshape(y, (m, -1))

y_pred = np.zeros(np.shape(y))

for i in self.bar(range(self.n_estimators)):

tree = self.trees[i]

y_and_pred = np.concatenate((y, y_pred), axis=1)

tree.fit(X, y_and_pred)

update_pred = tree.predict(X)

update_pred = np.reshape(update_pred, (m, -1))

y_pred += update_pred

def predict(self, X):

y_pred = None

m = X.shape[0]

# Make predictions

for tree in self.trees:

# Estimate gradient and update prediction

update_pred = tree.predict(X)

update_pred = np.reshape(update_pred, (m, -1))

if y_pred is None:

y_pred = np.zeros_like(update_pred)

y_pred += update_pred

return y_pred

2.2.1 __init__()

构建一个含有n_estimators棵XGBoostRegressionTree的类

这部分大家若是有看不懂的地方可以反复专研一下我上文提到的链接，若是吃透了xgboost的原理，相信能很轻松的看懂源码实现

3.源码地址

项目包括了许多机器学习算法的简洁实现

4.XGBoost与GBDT的实现区别

gbdt中每棵树在fit的时候并不在意树的结构与子节点的值是多少，也就是说gbdt并不关心建树的过程。

而xgboost中每棵树fit的过程中，重新定义了树构建时切割的标准，以及子节点具体的取值。

此文章为记录自己一路的学习路程，也希望能给广大初学者们一点点帮助，如有错误,疑惑欢迎一起交流。

weixin_39975055

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
python xgboost_XGBoost的python源码实现

1.前言本文主要讲解XGBoost代码实现的细节，对于想了解xgboost原理的同学建议可以去看陈天奇博士的paper，下面我附上一个中文链接，我通过该文很好的了解了xgboost的原理，分享给大家链接: http://www.52cs.org/?p=429最后我会简单的对比一下xgboost与gbdt实现细节上的不同2.源码讲解XGBoost与GBDT，随机森林一样需要使用到决策树的子类，对于决...
复制链接

扫一扫