决策树和提升树的区别
Decision Trees are popular Machine Learning algorithms used for both regression and classification tasks. Their popularity mainly arises from their interpretability and representability, as they mimic the way the human brain takes decisions.
决策树是流行的机器学习算法,用于回归和分类任务。 它们的流行主要源于它们的可解释性和可表示性,因为它们模仿人脑做出决策的方式。
However, to be interpretable, they pay a price in terms of prediction accuracy. To overcome this caveat, some techniques have been developed, with the goal of creating strong and robust models starting from ‘poor’ models. Those techniques are known as ‘ensemble’ methods (I discussed some of them in my previous article here).
但是,可以理解的是,它们为预测准确性付出了代价。 为了克服这一警告 ,已经开发了一些技术,目的是从“不良”模型开始创建强大而健壮的模型。 这些技术被称为“合奏”的方法(我讨论其中一些我以前的文章在这里 )。
In this article, I’m going to dwell on four different ensemble techniques, all having Decision Tree as base learner, with the aim of comparing their performances in terms of accuracy and training time. The four algorithms I’m going to use are:
在本文中,我将介绍四种不同的集成技术,所有这些技术都以决策树作为基础学习者,目的是比较它们在准确性和培训时间方面的表现。 我要使用的四种算法是:
- Random Forest 随机森林
- Gradient Boosting 梯度提升
- XGBoost XGBoost
- LightGBM LightGBM
To compare these methods’ performances, I initialized an artificial dataset as follows:
为了比较这些方法的性能,我初始化了一个人工数据集,如下所示:
from sklearn.datasets import make_blobs
from matplotlib import pyplot
from pandas import DataFrame
# generate 2d classification dataset
X, y = make_blobs(n_samples=10000, centers=3, n_features=2)
df = DataFrame(dict(x1=X[:,0], x2=X[:,1], label=y))
df.head()
![Image for post](https://miro.medium.com/max/9999/1*irRgGwjP8MfsauOb8Ap2RA.png)
As you can see, our dataset contains observations having a vector of two predictors [x1,x2] and a categorical output with 3 classes [0,1,2].
如您所见,我们的数据集包含具有两个预测变量[x1&