1.前言
本文主要讲解XGBoost代码实现的细节,对于想了解xgboost原理的同学建议可以去看陈天奇博士的paper,下面我附上一个中文链接,我通过该文很好的了解了xgboost的原理,分享给大家链接: http://www.52cs.org/?p=429
最后我会简单的对比一下xgboost与gbdt实现细节上的不同
2.源码讲解
XGBoost与GBDT,随机森林一样需要使用到决策树的子类,对于决策树子类的代码讲解在我上一篇文章中。
2.1 构建XGBoostRegressionTree
XGBoostRegressionTree继承了我上文讲解的DecisionTree
class XGBoostRegressionTree(DecisionTree):
"""Regression tree for XGBoost- Reference -http://xgboost.readthedocs.io/en/latest/model.html"""
def _split(self, y):
""" y contains y_true in left half of the middle column andy_pred in the right half. Split and return the two matrices """
col = int(np.shape(y)[1]/2)
y, y_pred = y[:, :col], y[:, col:]
return y, y_pred
def _gain(self, y, y_pred):
nominator = np.power((self.loss.gradient(y, y_pred)).sum(), 2)
denominator = self.loss.hess(y, y_pred).sum()
return 0.5 * (nominator / denominator)
def _gain_by_taylor(self, y, y1, y2):
# Split
y, y_pred = self._split(y)
y1, y1_pred = self._split(y1)
y2, y2_pred = self._split(y2)
true_gain = self._gain(y1, y1_pred)
false_gain = self._gain(y2, y2_pred)
gain = self._gain(y, y_pred)
return true_gain + false_gain - gain
def _approximate_update(self, y):
# y split into y, y_pred
y, y_pred = self._split(y)
gradient = np.sum(self.loss.gradient(y, y_pred),axis=0)
hessian = np.sum(self.loss.hess(y, y_pred), axis=0)
update_approximation = gradient / hessian
return update_approximation
def fit(self, X, y):
self._impurity_calculation = self._gain_by_taylor
self._leaf_value_calculation = self._approximate_update
super(XGBoostRegressionTree, self).fit(X, y)
2.1.1 gain():
该函数计算切分后的数据集的gain值。
这里我忽略了正则化的参数
注意:
2.1.2 gain_by_taylor():
该函数通过调用gain()来计算树节点的纯度,并以此来作为树是否分割的标准
2.1.3 approximate_update():
xgboost被切割完成后,每个子节点的取都是定好的了。
具体的取值为
,这里我忽略了正则化的参数
2.1.4 fit():
将gain_by_taylor()作为切割树的标准
将approximate_update()作为估算子节点取值的方法
传递回给decisionTree,并以此来构建决策树
2.2 构建XGBoost
class XGBoost(object):
"""The XGBoost classifier.Reference: http://xgboost.readthedocs.io/en/latest/model.htmln_estimators: int树的数量The number of classification trees that are used.learning_rate: float梯度下降的学习率The step length that will be taken when following the negative gradient duringtraining.min_samples_split: int每棵子树的节点的最小数目(小于后不继续切割)The minimum number of samples needed to make a split when building a tree.min_impurity: float每颗子树的最小纯度(小于后不继续切割)The minimum impurity required to split the tree further.max_depth: int每颗子树的最大层数(大于后不继续切割)"""
def __init__(self, n_estimators=200, learning_rate=0.01, min_samples_split=2,
min_impurity=1e-7, max_depth=2):
self.n_estimators = n_estimators # Number of trees
self.learning_rate = learning_rate # Step size for weight update
self.min_samples_split = min_samples_split # The minimum n of sampels to justify split
self.min_impurity = min_impurity # Minimum variance reduction to continue
self.max_depth = max_depth # Maximum depth for tree
self.bar = progressbar.ProgressBar(widgets=bar_widgets)
# Log loss for classification
self.loss = LeastSquaresLoss()
# Initialize regression trees
self.trees = []
for _ in range(n_estimators):
tree = XGBoostRegressionTree(
min_samples_split=self.min_samples_split,
min_impurity=min_impurity,
max_depth=self.max_depth,
loss=self.loss)
self.trees.append(tree)
def fit(self, X, y):
# y = to_categorical(y)
m = X.shape[0]
y = np.reshape(y, (m, -1))
y_pred = np.zeros(np.shape(y))
for i in self.bar(range(self.n_estimators)):
tree = self.trees[i]
y_and_pred = np.concatenate((y, y_pred), axis=1)
tree.fit(X, y_and_pred)
update_pred = tree.predict(X)
update_pred = np.reshape(update_pred, (m, -1))
y_pred += update_pred
def predict(self, X):
y_pred = None
m = X.shape[0]
# Make predictions
for tree in self.trees:
# Estimate gradient and update prediction
update_pred = tree.predict(X)
update_pred = np.reshape(update_pred, (m, -1))
if y_pred is None:
y_pred = np.zeros_like(update_pred)
y_pred += update_pred
return y_pred
2.2.1 __init__()
构建一个含有n_estimators棵XGBoostRegressionTree的类
这部分大家若是有看不懂的地方可以反复专研一下我上文提到的链接,若是吃透了xgboost的原理,相信能很轻松的看懂源码实现
3.源码地址
项目包括了许多机器学习算法的简洁实现
4.XGBoost与GBDT的实现区别
gbdt中每棵树在fit的时候并不在意树的结构与子节点的值是多少,也就是说gbdt并不关心建树的过程。
而xgboost中每棵树fit的过程中,重新定义了树构建时切割的标准,以及子节点具体的取值。
此文章为记录自己一路的学习路程,也希望能给广大初学者们一点点帮助,如有错误,疑惑欢迎一起交流。