决策树和提升树的区别_决策树提升技术比较

最新推荐文章于 2024-06-09 19:50:14 发布

weixin_26704853

最新推荐文章于 2024-06-09 19:50:14 发布

阅读量847

点赞数

文章标签： java python 机器学习人工智能面试

原文链接：https://medium.com/dataseries/decision-tree-boosting-techniques-compared-5667bb2087ab

版权

本文探讨了决策树和提升树之间的差异，基于中英文资料的翻译，深入解析这两种机器学习模型在算法原理和应用上的不同。

摘要由CSDN通过智能技术生成

决策树和提升树的区别

Decision Trees are popular Machine Learning algorithms used for both regression and classification tasks. Their popularity mainly arises from their interpretability and representability, as they mimic the way the human brain takes decisions.

决策树是流行的机器学习算法，用于回归和分类任务。它们的流行主要源于它们的可解释性和可表示性，因为它们模仿人脑做出决策的方式。

However, to be interpretable, they pay a price in terms of prediction accuracy. To overcome this caveat, some techniques have been developed, with the goal of creating strong and robust models starting from ‘poor’ models. Those techniques are known as ‘ensemble’ methods (I discussed some of them in my previous article here).

但是，可以理解的是，它们为预测准确性付出了代价。为了克服这一警告，已经开发了一些技术，目的是从“不良”模型开始创建强大而健壮的模型。这些技术被称为“合奏”的方法(我讨论其中一些我以前的文章在这里 )。

In this article, I’m going to dwell on four different ensemble techniques, all having Decision Tree as base learner, with the aim of comparing their performances in terms of accuracy and training time. The four algorithms I’m going to use are:

在本文中，我将介绍四种不同的集成技术，所有这些技术都以决策树作为基础学习者，目的是比较它们在准确性和培训时间方面的表现。我要使用的四种算法是：

Random Forest
随机森林
Gradient Boosting
梯度提升
XGBoost
XGBoost
LightGBM
LightGBM

To compare these methods’ performances, I initialized an artificial dataset as follows:

为了比较这些方法的性能，我初始化了一个人工数据集，如下所示：

from sklearn.datasets import make_blobs
from matplotlib import pyplot
from pandas import DataFrame
# generate 2d classification dataset
X, y = make_blobs(n_samples=10000, centers=3, n_features=2)
df = DataFrame(dict(x1=X[:,0], x2=X[:,1], label=y))
df.head()

As you can see, our dataset contains observations having a vector of two predictors [x1,x2] and a categorical output with 3 classes [0,1,2].

如您所见，我们的数据集包含具有两个预测变量[x1，x2]的向量和具有3个类别[0,1,2]的分类输出的观测值。

The final aim is showing how the LightGBM overperforms (by far) the other algorithms.

最终目标是展示LightGBM如何(到目前为止)优于其他算法。

随机森林 (Random Forest)

Random forest relies on the concept of bagging, that is: if we were able to train on different datasets multiple trees and then use an average (or, in case of classification, the majority vote) of their output to predict the label of a new observation, we would get more accurate results. We can achieve that by creating a series of datasets obtained as a bootstrapped version of the original one, and then train a bunch of classifiers.

随机森林依赖于装袋的概念，即：如果我们能够在不同的数据集上训练多棵树，然后使用它们的输出的平均值(或分类的话，则以多数表决)来预测新树的标签。观察，我们将获得更准确的结果。我们可以通过创建一系列作为原始版本的自举版本获得的数据集，然后训练一堆分类器来实现。

Plus, Random Forest algorithm adds a further constraint: every time a tree is grown from a bootstrapped sample, the algorithm allows it to consider only a subset of size m of the entire covariates spaces of size p (with m<p). By doing so, each tree is independent of each other.

另外，随机森林算法增加了进一步的约束：每次从自举样本中生成一棵树时，该算法允许它仅考虑大小为p(m <p)的整个协变量空间中大小为m的子集。这样，每棵树都彼此独立。

Let us see how it performs on our artificial data:

让我们看看它如何在人工数据上执行：

#random forest
import time
start = time.time()
clf_rf = RandomForestClassifier(max_depth=2, random_state=0)
from sklearn.model_selection import cross_val_score
clf_rf = RandomForestClassifier()
scores = cross_val_score(clf_rf, X, y, cv=5)
acc_rf = scores.mean()
#acc_rf#do something
end = time.time()
temp_rf = end-start

To evaluate model performances in terms of accuracy, I will use the cross-validation approach on the training set, partitioning it into 5 folds. I’ve stored the time and accuracy results into variables which will be revealed at the end of this article.

为了评估模型性能的准确性，我将对训练集使用交叉验证方法，将其分为5倍。我将时间和准确性结果存储在变量中，这些变量将在本文结尾处显示。

梯度提升 (Gradient Boosting)

The idea behind boosting is building a series of trees, each of those being an updated version of the previous one. Basically, at each iteration, a tree is built on the dataset (X, r) rather than (X,y), where “r” indicates the residuals obtained by the previous tree. Then a shrunken version of this classifier is added to the previous one, and the procedure goes on up to the end of the loop (or when a certain breaking condition is reached).

Boosting背后的想法是构建一系列树，每个树都是前一棵的更新版本。基本上，在每次迭代时，树都是在数据集(X，r)而不是(X，y)上构建的，其中“ r”表示前一棵树获得的残差。然后，将这个分类器的缩小版本添加到前一个分类器中，然后该过程继续进行到循环结束(或在达到特定中断条件时)。

With Gradient Boosting, the updating of the classifier is done via gradient descent optimization procedure, which is used to approximate the residuals (to learn more about the functioning of this algorithm, here there is a very intuitive article).

使用Gradient Boosting，可通过梯度下降优化程序完成分类器的更新，该程序用于近似残差(要了解有关该算法功能的更多信息，此处有一篇非常直观的文章)。

So let’s initialize also this algorithm and train it against our data:

因此，我们还要初始化此算法，并根据我们的数据对其进行训练：

from sklearn.ensemble import GradientBoostingClassifier
start = time.time()
clf_gb = GradientBoostingClassifier(max_depth=2, random_state=0)
scores = cross_val_score(clf_gb, X, y, cv=5)
acc_gb = scores.mean()
end = time.time()
temp_gb = end-start

XGBoost (XGBoost)

XGboost is an “extreme” version of Gradient Boosting, in the sense that is more efficient, flexible, and portable. Among the features that make this algorithm that performant, we can quote its parallelization of tree construction (using all CPU cores) and its possibility of being distributed across different machines to train very large datasets. Plus, it is more accurate than standard Gradient Boosting.

XGboost是Gradient Boosting的“极端”版本，从某种意义上讲，它更高效，更灵活，更便携。在使该算法具有出色性能的功能中，我们可以引用其对树结构的并行化(使用所有CPU内核)，以及将其分布在不同机器上以训练非常大的数据集的可能性。另外，它比标准的Gradient Boosting更准确。

All these features made it the favourite algorithm on Kaggle for a very long time, until new versions of Gradient Boosting have entered the market (among those, LightGBM).

所有这些功能使它成为Kaggle上最喜欢的算法已有很长时间了，直到新版本的Gradient Boosting进入市场(其中包括LightGBM)。

So let’s train our XGBoost and store its results.

因此，让我们训练XGBoost并存储其结果。

#installing the package for xgboost
!pip install xgboostimport xgboost as xgb
start = time.time()
clf_xgb=xgb.XGBClassifier()scores = cross_val_score(clf_xgb, X, y, cv=5)
acc_xgb = scores.mean()end = time.time()
temp_xgb = end-start

Note: differently from Random Forest and Gradient Boosting Classifier, that were scikit-learn libraries, with XGBoost and, later on, LightGBM, we need to treat them as individual packages. Hence, we can easily install them via pip.

注意：与随机森林和梯度提升分类器(它们是scikit学习库)不同，它们具有XGBoost，后来又具有LightGBM，我们需要将它们视为单独的软件包。因此，我们可以通过pip轻松安装它们。

Now it’s time to train and evaluate the last ensemble algorithm and then compare all the obtained results.

现在是时候训练和评估最后一个集成算法，然后比较所有获得的结果了。

LightGBM (LightGBM)

LightGBM is yet another gradient boosting framework that uses a tree-based learning algorithm. As its colleague XGBoost, it focuses on computational efficiency and high standard performance.

LightGBM是又一个使用基于树的学习算法的梯度增强框架。作为其同事XGBoost，它专注于计算效率和高标准性能。

In recent times, LightGBM gathered incredible success in many Kaggle competitions, outperforming XGBoost in terms of both speeds of training and accuracy of predictions.

最近，LightGBM在许多Kaggle比赛中获得了令人难以置信的成功，在训练速度和预测准确性方面均胜过XGBoost。

Let’s see whether it holds also for our artificial dataset.

让我们看看它是否也适用于我们的人工数据集。

! pip install lightgbmimport lightgbm as lgbparams = {
    'boosting_type': 'gbdt',
    'objective': 'binary',
    'metric': 'binary_error', 
}lgb_train = lgb.Dataset(X, y, free_raw_data=False)start = time.time()
scores = lgb.cv(
        params,
        lgb_train,
        num_boost_round=100,
        nfold=5,
        stratified=False
        )
scores
end = time.time()
temp_lgb = end-start

结论 (Conclusions)

Now let us see the results of our algorithms:

现在让我们看一下算法的结果：

import pandas as pddata = dict([('LigthGBM', [acc_lgb, temp_lgb]), ('XGBoost', [acc_xgb*100, temp_xgb]), ('Random Forest', [acc_rf*100, temp_rf]),
             ('Gradient Boosting', [acc_gb*100, temp_gb])])df = pd.DataFrame(data).T.rename(columns={0: 'Accuracy', 1: 'Training Time'})
df

import matplotlib.pyplot as plt
fig, axes = plt.subplots(figsize=(12,8),nrows=1, ncols=2, sharey=True)#initializing a function for a better interpretability of the 
#training timedef ret_time(temp):
    minutes = round(temp//60, 0)
    seconds = round(temp - 60*minutes, 0)
    return [int(minutes), int(seconds)]#ax2=plt.subplot(2,2,2)
ax1=df.sort_values(by='Accuracy', ascending=True)["Accuracy"].plot(ax=axes[0], kind='barh', logx=True, xlim=(0,102))
ax1.text(96.15, 0, str(str(round(acc_rf*100,2))+'%'), fontsize=15)
ax1.text(96.4, 1, str(str(round(acc_gb*100,2))+'%'), fontsize=15)
ax1.text(96.44, 2, str(str(round(acc_xgb*100,2))+'%'), fontsize=15)
ax1.text(100, 3, str(str(round(acc_lgb,2))+'%'), fontsize=15)
ax1.set_title('Accuracy on 5-fold CV')ax2 = df.sort_values(by='Accuracy', ascending=True)["Training Time"].plot(ax=axes[1], kind='barh', xlim=(0,900), colormap='viridis')
ax2.text(400, 0, str(str(ret_time(temp_rf)[0]) + 'm:'+str(ret_time(temp_rf)[1])+'s'), fontsize=15)
ax2.text(700, 1, str(str(ret_time(temp_gb)[0]) + 'm:'+str(ret_time(temp_gb)[1])+'s'), fontsize=15)
ax2.text(400, 2, str(str(ret_time(temp_xgb)[0]) + 'm:'+str(ret_time(temp_xgb)[1])+'s'), fontsize=15)
ax2.text(100, 3, str(str(ret_time(temp_lgb)[0]) + 'm:'+str(ret_time(temp_lgb)[1])+'s'), fontsize=15)
#ax1.text(700, 1, str(), fontsize=15)
#ax1.text(400, 2, str(), fontsize=15)
ax2.set_title('Training Time')

As you can see, not only the LightGBM is the model with the highest accuracy, but also is it the one with the lowest training time (by far).

如您所见，LightGBM不仅是精度最高的模型，而且还是训练时间最短的模型(到目前为止)。

Of course, to draw more consistent conclusions, one experiment is not enough. Plus, picking the best algorithm really depends on the task you are carrying on, as well as on the size (and nature) of the dataset.

当然，要得出更一致的结论，一个实验是不够的。另外，选择最佳算法实际上取决于您要执行的任务以及数据集的大小(和性质)。

Nevertheless, LightGBM resulted in great performances in many Kaggle competitions and today is one of the preferred classifiers in the market.

尽管如此，LightGBM在许多Kaggle比赛中都取得了出色的成绩，如今已成为市场上首选的分类器之一。

I hope you enjoy this reading! If you are interested in the topic, as well as in further “extreme” versions of Gradient Boosting, I suggest to you the references below:

希望您喜欢阅读！如果您对该主题以及Gradient Boosting的其他“高级”版本感兴趣，建议您参考以下内容：

翻译自: https://medium.com/dataseries/decision-tree-boosting-techniques-compared-5667bb2087ab

决策树和提升树的区别

weixin_26704853

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
决策树和提升树的区别_决策树提升技术比较

决策树和提升树的区别Decision Trees are popular Machine Learning algorithms used for both regression and classification tasks. Their popularity mainly arises from their interpretability and representability, as ...
复制链接

扫一扫