线性回归和逻辑回归的梯度_逐步解释梯度加速回归

最新推荐文章于 2023-03-25 17:20:01 发布

羊牮

最新推荐文章于 2023-03-25 17:20:01 发布

阅读量369

点赞数

文章标签：逻辑回归深度学习机器学习 python 人工智能

原文链接：https://medium.com/@vagifaliyev/a-hands-on-explanation-of-gradient-boosting-regression-4cfe7cfdf9e

版权

线性回归和逻辑回归的梯度

介绍(Introduction)

One of the most powerful ways of training models is to train multiple models and aggregate their predictions. This is the main concept of Ensemble Learning. While many flavours of Ensemble Learning exist, some of the most powerful algorithms and Boosting Algorithms. In my previous article, I broke down one of the most popular Boosting Algorithms; Adaptive Boosting. Today, I want to talk about its equally powerful twin; Gradient Boosting.

训练模型最强大的方法之一是训练多个模型并汇总其预测。这是集成学习的主要概念。尽管存在多种合奏学习风格，但一些最强大的算法和Boosting算法。在上一篇文章中，我分解了一种最受欢迎的Boosting算法。自适应提升。今天，我想谈谈它同样强大的孪生兄弟。渐变增强。

增强与自适应增强与梯度增强 (Boosting & Adaptive Boosting vs Gradient Boosting)

Boosting refers to any Ensemble Method that can combine several weak learners(a predictor with poor accuracy) to make a strong learner(a predictor with high accuracy). The idea behind boosting is to train models sequentially, each trying to correct its predecessor.

Boosting是指可以将多个弱学习者(准确性低的预测变量)组合成强大学习者(高精度的预测变量)的任何集成方法。增强背后的想法是顺序地训练模型，每个模型都试图纠正其前身。

自适应提升概述 (An Overview Of Adaptive Boosting)

In Adaptive Boosting, the main idea occurs with the model assigning a certain weight to each instance, and training a weak learner. Based on the predictor’s performance, it gets assigned its own separate weight based on a weighted error rate. The higher the accuracy of the predictor, the higher its weight, and the more “say” it will have on the final prediction.

在自适应提升中，主要思想是模型为每个实例分配一定的权重，并训练一个弱学习者。根据预测器的性能，将基于加权错误率为其分配单独的权重。预测变量的准确性越高，其权重就越高，并且对最终预测变量的“判断”就越多。

Once the predictor has made predictions, AdaBoost looks at the misclassified instances, and boosts their instance weights. After normalising the instance weights so that they all equate to 1, a new predictor is trained and the process is repeated until a desirable output is reached, or a threshold is reached.

一旦预测器做出预测，AdaBoost就会查看错误分类的实例，并提高其实例权重。 在对实例权重进行标准化以使它们全部等于1之后，将训练新的预测变量，并重复该过程，直到达到所需的输出或达到阈值为止。

The final classification is done by taking a weighted vote. In other words, if we were predicting heart disease on a patient, and 60 stumps predicted 1 and 40 predicted 0, but the predictors in the 0 class had a higher cumulative weight(i.e the predictors had more “say”), then the final prediction would be 0.

最终分类通过加权投票完成。换句话说，如果我们正在预测患者的心脏病，60个残端预测1，40个预测0，但是0类的预测因子具有更高的累积权重(即预测因子具有更多的“说”)，则最终预测将为0。

梯度提升 (Gradient Boosting)

In contrast to Adaptive Boosting, instead of sequentially boosting misclassified instance weights, Gradient Boosting actually make predictions on the predecessors residuals. Woah, hold it. What?

与“自适应增强”相比，“梯度增强”实际上不是对顺序错误的实例权重进行增强，而是对先前的残差进行预测。哇，等一下什么？

Ok, so let’s break down the model’s steps:

好的，让我们分解模型的步骤：

The first thing Gradient Boosting does is that is starts of with a Dummy Estimator. Basically, it calculates the mean value of the target values and makes initial predictions. Using the predictions, it calculates the difference between the predicted value and the actual value. This is called the residuals.
Gradient Boosting要做的第一件事是从虚拟估计器开始。 基本上，它计算目标值的平均值并进行初始预测。使用这些预测，可以计算出预测值和实际值之间的差异。这称为残差。
Next, instead of training a new estimator on the data to predict the target, it trains an estimator to predict the residuals of the first predictor. This predictor is usually a Decision Tree with certain limits, such as the maximum amount of leaf nodes allowed. If multiple instances’ residuals are in the same leaf node, it takes their average and uses that as the leaf node’s value.
接下来，与其在数据上训练新的估计器以预测目标，不如在训练估计器以预测第一个预测器的残差。 该预测变量通常是具有某些限制的决策树，例如允许的最大叶节点数量。如果多个实例的残差在同一个叶节点中，它将取其平均值并将其用作叶节点的值。
Next, to make predictions, for each instance, it adds the base estimator’s value onto the Decision Tree’s predicted residual value of the instance to make a new prediction. It then calculates the residuals again between the predicted and actual value.
接下来，要进行预测，对于每个实例，它将基本估计量的值添加到实例的决策树的预测残差值上，以进行新的预测。然后，它再次计算预测值和实际值之间的残差。
This process is repeated until a certain threshold is reached or the residual difference is very small.
重复此过程，直到达到某个阈值或残差很小。
To make a prediction for an unseen instance, it gives the instance to each and very decision tree made, sums their predictions and adds the base estimator’s value.
为了对看不见的实例进行预测，它将实例提供给每个决策树，对它们的预测求和，并添加基本估算器的值。

学习率 (Learning Rate)

An important hyperparameter to take note of here is the learning rate. This actually scales the contribution of each tree, so essentially increasing bias in exchange for a lower variance. So at step 3 and 4, the predicted value is actually multiplied by a learning rate to achieve better generalisation on unseen data.

这里要注意的一个重要的超参数是学习率。这实际上缩放了每棵树的贡献，因此实质上增加了偏差以换取较低的方差。因此，在第3步和第4步，将预测值实际乘以学习率，以更好地概括未见数据。

使用Python和Scikit-Learn进行梯度提升回归的动手示例 (A hands-on example of Gradient Boosting Regression with Python & Scikit-Learn)

Some of the concepts might still be unfamiliar in your mind, so, in order to learn, one must apply! Let’s build a Gradient Boosting Regressor to predict house prices using the infamous Boston Housing Dataset. Without further ado, let’s get started!

您可能对某些概念不熟悉，因此，为了学习，必须先应用一个概念！让我们构建一个梯度提升回归器，以使用臭名昭著的波士顿住房数据集来预测房价。事不宜迟，让我们开始吧！

import pandas as pd
import numpy as npfrom sklearn.metrics import mean_squared_error
from sklearn.model_selection import KFold
from sklearn.datasets import load_bostonfrom sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import GradientBoostingRegressor

Ok, so we do some basic imports, along with our dataset, which is conveniently builtin to scikit-learn, and Kfold cross validation, for splitting our data into a train set and validation set. We also import the DecisionTreeRegressor as well as the GradientBoostingRegressor

好的，所以我们进行了一些基本的导入以及数据集(方便地内置到scikit-learn和Kfold交叉验证中)，以便将数据分为训练集和验证集。我们还导入了DecisionTreeRegressor以及GradientBoostingRegressor

df = pd.DataFrame(load_boston()['data'],columns=load_boston()['feature_names'])
df['y'] = load_boston()['target']
df.head(5)

Here, we just convert our data into a DataFrame for convenience

在这里，为了方便起见，我们只是将数据转换为DataFrame

X,y = df.drop('y',axis=1),df['y']kf = KFold(n_splits=5,random_state=42,shuffle=True)for train_index,val_index in kf.split(X):
    X_train,X_val = X.iloc[train_index],X.iloc[val_index],
    y_train,y_val = y.iloc[train_index],y.iloc[val_index]

Here, we initialise our features and our target, and use 5 Fold cross validation to split our dataset into a training set and a validation set.

在这里，我们初始化功能和目标，并使用5折交叉验证将我们的数据集分为训练集和验证集。

Before I go ahead and implement scikit-learn’s GradientBoostingRegressor, I would like to make a custom one of my own, just to help illustrate the concepts I wrote about earlier.

在继续实施scikit-learn的GradientBoostingRegressor之前，我想做一个自己的自定义，只是为了帮助说明我之前写的概念。

First, we create our initial predictions to be just the average of the training label values and assign our learning rate to be 0.1:

首先，我们将初始预测创建为训练标签值的平均值，并将学习率分配为0.1：

base = [y_train.mean()] * len(y_train)
learning_rate = 0.1

Then, we calculate the residuals and get the MSE:

然后，我们计算残差并获得MSE：

residuals_1 = base - y_train
mean_squared_error(y_val, base[len(y_val)])OUT:
71.92521322606885

Well, not so bad, considering we just predicted the mean value the whole time!

好吧，考虑到我们一直在预测平均值，还算不错！

After that, we create our first tree and train it on the residuals. Again we will get the MSE of our predictions on the validation set:

之后，我们创建我们的第一棵树并在残差上对其进行训练。同样，我们将在验证集上获得预测的MSE：

predictions_dtree_1 = base + learning_rate * dtree_1.predict(X_train)
mean_squared_error(y_train,predictions_dtree_1)OUT:
70.90445609876541

Ok, so a slight improvement, already showing the power of Gradient Boosting! Again, we get the residuals:

好的，这是一个很小的改进，已经显示了Gradient Boosting的强大功能！同样，我们得到残差：

residuals_2 = y_train - predictions_dtree_1
dtree_2 = DecisionTreeRegressor(random_state=42)
dtree_2.fit(X_train,residuals_2)

And we get the MSE by making predictions, but note how we combined the predicted value of the first predictor with the new predictor’s values:

我们通过做出预测来获得MSE，但请注意我们如何结合第一个预测变量的预测值与新预测变量的值：

predictions_dtree_2 = ((dtree_2.predict(X_train) * learning_rate)) + predictions_dtree_1
mean_squared_error(y_train,predictions_dtree_2)OUT:
57.43260944000001

Wow, that was a big leap indeed! Now, let’s keep training a few more predictors to see what we can achieve:

哇，那确实是一个巨大的飞跃！现在，让我们继续训练更多的预测变量，以了解我们可以实现的目标：

residuals_3 = y_train - predictions_dtree_2
dtree_3 = DecisionTreeRegressor(random_state=42)
dtree_3.fit(X_train,residuals_3)predictions_dtree_3 = (dtree_3.predict(X_train) * learning_rate) + predictions_dtree_2residuals_4 = y_train - predictions_dtree_3
dtree_4 = DecisionTreeRegressor(random_state=42)
dtree_4.fit(X_train,residuals_4)predictions_dtree_4 = (dtree_4.predict(X_train) * learning_rate)  + predictions_dtree_3
mean_squared_error(y_train,predictions_dtree_4)OUT:43.90388561846081

So we definitely improved our score, but now it’s time for the ultimate test: the validation set!

因此，我们确实提高了分数，但是现在该进行终极测试了：验证集！

To make a final prediction, we do the following:

为了做出最终预测，我们执行以下操作：

initial prediction(the mean of the target values) * learning rate +

初始预测(目标值的平均值)*学习率+

predicted reisudal values from tree 1
树1的预计重新估值
predicted reisudal values from tree 2
树2的预测的重新设计值
predicted reisudal values from tree 3
树3的预计重新估值
predicted reisudal values from tree 4
树4的预测的重新设计值

In code:

在代码中：

y_pred = base[:101] + learning_rate * 
(dtree_1.predict(X_val)) + 
(dtree_2.predict(X_val) * learning_rate) + 
(dtree_3.predict(X_val) * learning_rate) + 
(dtree_4.predict(X_val) * learning_rate)

And the result (Drumroll please…):

结果(请打鼓……)：

mean_squared_error(y_train, y_pred)OUT:
42.32013345535233

Fantastic! Not only did it fit the training set well, it also generalised smoothly on the test set! Why? Because we used a learning rate to control each trees contribution size, making sure that the ensemble did not overfit the data.

太棒了！它不仅非常适合训练集，而且还可以在测试集上顺利推广！为什么？因为我们使用学习率来控制每棵树的贡献大小，所以请确保该集合不会过度拟合数据。

使用Scikit-Learn的GradientBoostingRegressor进行梯度增强 (Gradient Boosting with Scikit-Learn’s GradientBoostingRegressor)

We have now manually made a basic gradient boosting algorithm, but now let us code out a Gradient Boosting Regressor using scikit-learn!

现在，我们已经手动制作了基本的梯度增强算法，但是现在让我们使用scikit-learn编写出梯度增强回归器！

Using the same data as above:

使用与上述相同的数据：

gradient_booster = GradientBoostingRegressor(loss='ls',learning_rate=0.1)
gradient_booster.get_params()OUT:
{'alpha': 0.9,
 'ccp_alpha': 0.0,
 'criterion': 'friedman_mse',
 'init': None,
 'learning_rate': 0.1,
 'loss': 'ls',
 'max_depth': 3,
 'max_features': None,
 'max_leaf_nodes': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_iter_no_change': None,
 'presort': 'deprecated',
 'random_state': None,
 'subsample': 1.0,
 'tol': 0.0001,
 'validation_fraction': 0.1,
 'verbose': 0,
 'warm_start': False}

Now, this model has a lot of parameters, so it is worth mentioning the most important ones:

现在，该模型具有很多参数，因此值得一提的是最重要的参数：

learning_rate: exactly the same parameter as we have discussed about above; it scales the contribution of each tree

learning_rate ：与我们上面讨论的参数完全相同；它缩放了每棵树的贡献

init: the initial estimator, which equates to the DummyEstimator by default(aka predicts the mean for everything)

init：初始估计量，默认情况下等于DummyEstimator(aka预测所有均值)

max_depth: the maximum depth you want your trees to grow

max_depth：您希望树木生长的最大深度

n_estimators: the amount of trees you want to create

n_estimators：您要创建的树的数量

criterion: what loss function you would like to minimise for the decision trees to use when it is searching for the best feature and threshold that splits the data.

准则：当搜索拆分数据的最佳特征和阈值时，您希望对决策树使用的损失函数最小化。

loss: the loss to use for calculating residuals(the default is “ls”, or least squares)

loss ：用于计算残差的损失(默认为“ ls”或最小二乘)

max_leaf_nodes: the maximum number of leaf nodes you want to have for each tree. If this number is smaller then the number of training instances, and if two or more instances are in the same leaf, then the leaf’s value will be the average of all the training instance values in that leaf.

max_leaf_nodes：每棵树要拥有的最大叶节点数。如果此数目较小，则为训练实例的数目，并且如果同一叶中有两个或更多实例，则该叶子的值将为该叶中所有训练实例值的平均值。

Let’s fit our model to the dataset and get its R2 score:

让我们将模型拟合到数据集并获得其R2分数：

gradient_booster.fit(X_train,y_train)gradient_booster.score(X_train,y_train)OUT:
0.9791009142174039

And let’s get its R2 Score on the validation set:

让我们在验证集上获得其R2分数：

gradient_booster.score(X_val,y_val)OUT:
0.8847454683496595

Ok , so our algorithm is slightly overfitting. Let’s adjust the learning_rate parameter to see if we can get better results:

好的，所以我们的算法有点过拟合。让我们调整learning_rate参数以查看是否可以获得更好的结果：

gradient_booster = GradientBoostingRegressor(loss='ls',learning_rate=0.25)gradient_booster.fit(X_train,y_train)gradient_booster.score(X_train,y_train)OUT:
0.994857818295815gradient_booster.score(X_val,y_val)OUT:
0.9082261292781879

This is a much better result, where we sacrificed some bias for a better variance. Finally, let’s get the ensemble’s MSE on the validation set:

这是一个更好的结果，我们为了更好的变化而牺牲了一些偏见。最后，让我们获取验证集中的集合的MSE：

predictions = gradient_booster.predict(X_val)
mean_squared_error(y_val,predictions)OUT:
6.599129139886324

Ok, so that wraps up this one, folks! I hope you enjoyed it, but my Gradient Boosting days are not over yet! I will be back, and in the next articles I will be talking about Gradient Boosting for Classification, and more importantly; the big boy: XTREME GRADIENT BOOSTING! But, for now:

好的，伙计们，这个总结一下！希望您喜欢它，但是我的Gradient Boosting天还没有结束！我会回来的，在接下来的文章中，我将讨论分类的梯度提升，更重要的是，大男孩：XTREME GRADIENT BOOSTING！但现在：