fitbit手表中文说明书_使用机器学习预测Fitbit睡眠分数-CSDN博客

fitbit手表中文说明书

In Part 1 of this article I explained how we can obtain sleep data from Fitbit, load it into Python and preprocess the data to be ready for further analysis. In this part I will explain how and why we split the data into training, validation and test set, how we can select features for our Machine Learning models and then train three different models: Multiple Linear Regression, Random Forest Regressor and Extreme Gradient Boosting Regressor. I will briefly explain how these models work and define performance measures to compare their performance. Let’s get started.

在本文的第1部分中，我解释了如何从Fitbit获取睡眠数据，将其加载到Python中并进行预处理，以准备进行进一步的分析。在这一部分中，我将说明我们如何以及为什么将数据分为训练，验证和测试集，如何为机器学习模型选择功能，然后训练三种不同的模型：多重线性回归，随机森林回归和极限梯度提升回归。我将简要解释这些模型如何工作，并定义绩效指标以比较其绩效。让我们开始吧。

将数据分为训练，验证和测试集 (Separating the data into training, validation and test set)

Before we do any further analysis using our data we need to split the entire data set into three different subsets: training set, validation set and test set. The following image displays this process well:

在使用数据进行任何进一步分析之前，我们需要将整个数据集分为三个不同的子集：训练集，验证集和测试集。下图很好地显示了此过程：

Image for post — Training, Validation and Testing data

The test set is also referred to as hold-out set and once we split it from the remaining data we do not touch it again until we have trained and tweaked our Machine Learning models to a point where we think they will perform well on data that they have never seen before.

测试集也称为保持集，一旦我们将其与剩余数据分离，我们就不会再碰它，直到我们对机器学习模型进行了训练和调整，以至于我们认为它们将在对他们从未见过。

We split the remaining data into a training and a validation set. This allows us to train our models on the training data and then evaluate their performance on the validation data. In theory, we can then go and tweak our models and evaluate them on the validation data again and thereby find ways to improve model performance. This process often leads to overfitting, meaning that we focus too much on training our model in a way that it performs well on the validation set but it performs poorly when used on a data set that it has never seen before (such as the test set).

我们将剩余的数据分为训练和验证集。这使我们可以在训练数据上训练我们的模型，然后在验证数据上评估它们的性能。从理论上讲，我们可以去调整模型并再次在验证数据上对其进行评估，从而找到提高模型性能的方法。此过程通常会导致过度拟合，这意味着我们过分专注于训练模型，使得模型在验证集上表现良好，但在从未使用过的数据集(例如测试集)上使用时，效果却很差)。

In part 3 of this article I explain how we can reduce overfitting while making sure that the models still perform well. For now, we will follow the above approach of a simple split of the data set into training, validation and test set.

在本文的第3部分中，我解释了如何在确保模型仍然运行良好的同时减少过度拟合。现在，我们将采用上述方法，将数据集简单地分为训练，验证和测试集。

I want to split the data in a way that the training set is made up of 60% of the total data set and the validation and test set are both made up of 20%. This code achieves the correct percentage splits:

我想以训练集由总数据集的60％组成，而验证和测试集均由20％组成的方式来拆分数据。这段代码实现了正确的百分比分割：

In the first test split the test_size parameter is set to 0.2, which splits the data into 80% training data and 20% test data. In order to split the 80% training data into training and validation data and ensuring that the validation data is 20% of the size of the original data set the test_size parameter needs to be 0.25 (20% is one quarter, or 0.25, of 80%).

在第一次测试拆分中，将test_size参数设置为0.2，这会将数据拆分为80％的训练数据和20％的测试数据。为了将80％的训练数据分为训练和验证数据，并确保验证数据是原始数据集大小的20％，test_size参数需要为0.25(20％是80的四分之一或0.25) ％)。

Before moving on I want to emphasise one important thing here. It is crucial to split the data before performing any further transformations such as scaling the data because we want to prevent any information about the test set to spill over into our training and validation set. Data scaling is often done using statistics about the data set as a whole, such as mean and standard deviation. Because we want to be able to measure how well our Machine Learning models perform on data they have never seen before we have to make sure that no information from the test data impacts how the scaling or any other transformation is done.

在继续之前，我想在这里强调一件事。在执行任何进一步的转换(例如缩放数据)之前，请先拆分数据，因为我们希望防止有关测试集的任何信息溢出到我们的训练和验证集中。通常使用有关整个数据集的统计信息(例如均值和标准差)来完成数据缩放。因为我们希望能够衡量我们的机器学习模型在从未见过的数据上的性能，因此我们必须确保测试数据中的任何信息都不会影响缩放或任何其他转换的方式。

扩展功能，定义性能指标和基准 (Scaling features, defining performance metrics and a baseline)

Although for the Machine Learning models in this project feature scaling is not required, it is considered best practice to scale features when comparing different models and their performance.

尽管对于该项目中的机器学习模型而言，不需要进行特征缩放，但是在比较不同模型及其性能时，将特征缩放是最佳实践。

In this code, I use MinMaxScaler, which I fit on the training data and then use to scale the training, validation and test data:

在此代码中，我使用MinMaxScaler，将其适合训练数据，然后用于缩放训练，验证和测试数据：

绩效指标 (Performance measures)

Next, let’s define some performance measures that we can use to evaluate our models and compare them. Because Sleep Scores are a continuous variable (although only integer Sleep Scores are possible) the problem at hand is a regression problem. For regression problems there are many different measures of performance and in this analysis I will use Mean Absolute Error, Mean Squared Error and R-Squared. Additionally, I compute an accuracy of the predictions of the models.

接下来，让我们定义一些性能指标，这些指标可用于评估模型并进行比较。由于睡眠分数是一个连续变量(尽管只能使用整数睡眠分数)，所以当前的问题是回归问题。对于回归问题，有许多不同的性能指标，在此分析中，我将使用均值绝对误差，均方误差和R平方。另外，我计算模型预测的准确性。

Accuracy is typically used as a performance measure in classification problems and not regression problems because it refers to the proportion of correct predictions that the model makes. The ways I use accuracy for the regression models in this analysis is different. Accuracy for the regression models is a measure of how far off (in percentage terms) the predicted Sleep Score will be from the actual Sleep Score, on average. For example, if the actual sleep score is 80 and the model has an accuracy of 96%, meaning that on average it is 4% off, the model is expected to make a prediction for the sleep score in the range of 76.8 (80 — (80 x 0.04)) to 83.2 (80 + (80 x 0.04)).

准确性通常用作分类问题而不是回归问题的性能指标，因为它指的是模型做出的正确预测的比例。在此分析中，我对回归模型使用准确性的方式有所不同。回归模型的准确性是对预测睡眠得分与实际睡眠得分的平均差距(以百分比表示)的度量。例如，如果实际睡眠得分为80，而模型的准确度为96％，即平均降低了4％，则该模型可以对睡眠得分做出76.8的预测(80 – (80 x 0.04))至83.2(80 +(80 x 0.04))。

Here is the function that evaluates a model’s performance that takes as inputs the model at hand, the test features and the test labels:

这是评估模型性能的函数，该函数将手头模型，测试功能和测试标签作为输入：

But how do we know what scores are good or bad for these different measures. For example, is an accuracy of 90% good or bad? What about R-squared? In order to have a reference point we will first come up with a baseline model that we can compare all later models and their performance to.

但是我们如何知道这些不同衡量标准的好坏。例如，精度为90％是好是坏？ R平方呢？为了获得参考点，我们将首先提出一个基准模型，我们可以将所有后续模型及其性能进行比较。

基准表现 (Baseline performance)

In order to evaluate the Machine Learning models we are about to build we want to have a baseline that we can compare their performance to. Generally, a baseline is a simplistic approach that generates predictions based on a simple rule. For our analysis, the baseline model always predicts the median Sleep Score of the training set. If our Machine Learning model is not able to outperform this simple baseline it would be rather useless.

为了评估我们将要建立的机器学习模型，我们希望有一个基准，我们可以将其性能与之进行比较。通常，基线是一种基于简单规则生成预测的简化方法。对于我们的分析，基线模型总是预测训练集的中位睡眠得分。如果我们的机器学习模型无法胜过这个简单的基准，那将毫无用处。

Let’s see what the performance of the baseline looks like:

让我们看看基线的性能如何：

While the accuracy may seem decent, looking at the other performance measures tells a very different story. The R-squared is negative, which is a strong indication of extremely poor model performance.

尽管准确性似乎不错，但查看其他性能指标却得出了截然不同的故事。 R平方为负，这强烈表明模型性能极差。

Now that we have split our data into different sub sets, have scaled the features, defined performance metrics and have come up with a baseline model we are almost ready to start training and evaluating our Machine Learning models. Before we move on to our models let’s first select the features that we want to use in those models.

现在，我们已经将数据分为不同的子集，扩展了功能，定义了性能指标，并提出了一个基线模型，我们几乎准备开始训练和评估我们的机器学习模型。在继续进行模型之前，我们首先选择要在这些模型中使用的功能。

使用套索回归进行特征选择 (Feature Selection using Lasso Regression)

There are two questions that you might have after reading that heading: Why do we need to select features and what the hell is Lasso Regression?

阅读该标题后，您可能会遇到两个问题：为什么我们需要选择要素以及Lasso回归到底是什么？

功能选择 (Feature Selection)

There are multiple reasons for selecting only a subset of the available features.

选择多个可用功能的子集有多种原因。

Firstly, feature selection enables the Machine Learning algorithm to train faster because it is using less data. Secondly, it reduces model complexity and makes it easier to interpret the model. In our case this will be important because apart from predicting Sleep Scores accurately we also want to be able to understand how the different features impact the Sleep Score. Thirdly, feature selection can reduce overfitting and thereby improve the prediction performance of the model.

首先，特征选择使机器学习算法可以训练得更快，因为它使用的数据较少。其次，它降低了模型的复杂性，并使模型的解释更加容易。在我们的案例中，这很重要，因为除了准确预测睡眠分数外，我们还希望能够了解不同功能如何影响睡眠分数。第三，特征选择可以减少过度拟合，从而提高模型的预测性能。

In part 1 of this article we saw that many of the features in the sleep data set are highly correlated, meaning that the more features we use the more multicollinearity will be present in the model. This is generally speaking not an issue if we only care about prediction performance of the model but it is an issue if we want to be able to interpret the model. Feature selection will also help reduce some of that multicollinearity.

在本文的第1部分中，我们看到了睡眠数据集中的许多特征是高度相关的，这意味着我们使用的特征越多，多重共线性就会出现在模型中。一般来说，如果我们只关心模型的预测性能，这不是问题，但是如果我们希望能够解释模型，则不是问题。特征选择还将有助于减少某些多重共线性。

For more information on feature selection see this article.

对于特征选择的更多信息，请参阅该文章。

套索回归 (Lasso Regression)

Before we move on to Lasso Regression let’s briefly recap what a linear regression does. Fitting a linear regression minimises a loss function by choosing coefficients for each feature variable. One problem with that is that large coefficients can lead to overfitting, meaning that the model will perform well on the training data but poorly on data it has never seen before. This is where regularisation comes in.

在继续进行Lasso回归之前，让我们简要回顾一下线性回归的作用。拟合线性回归可以通过为每个特征变量选择系数来最小化损失函数。这样做的一个问题是，较大的系数会导致过拟合，这意味着该模型在训练数据上表现良好，但在从未见过的数据上表现较差。这就是正则化的地方。

Lasso Regression is a type of regularisation regression that penalises the absolute size of the regression coefficients through an additional term in the loss function. The loss function for a Lasso regression can be written like this:

Lasso回归是一种正则化回归，它通过损失函数中的附加项来惩罚回归系数的绝对大小。拉索回归的损失函数可以这样写：

The first part of the loss function is equivalent to the loss function of a linear regression, which minimises the sum of squared residuals. The additional part is the penalty term, which penalises the absolute value of the coefficients. Mathematically, this is equivalent to minimising the sum of squared residuals with the constraint that the sum of absolute coefficient values has to be less than a prespecified parameter. This parameter determines the amount of regularisation and causes some coefficients to be shrunk to close to, or exactly, zero.

损失函数的第一部分等效于线性回归的损失函数，它使残差平方和最小。附加部分是惩罚项，它惩罚系数的绝对值。从数学上讲，这等效于最小化残差平方和，并具有绝对系数值之和必须小于预定参数的约束。此参数确定正则化的数量，并使某些系数缩小到接近或恰好为零。

In the above equation, λ is the tuning parameter which determines the strength of the penalty, i.e. the amount of shrinkage. Setting λ=0 would result in the loss function for a linear regression and as λ increases, more and more coefficients are set to zero and the remaining coefficients are therefore “selected” by the Lasso Regression as being important.

在上式中，λ是确定损失强度(即收缩量)的调整参数。设置λ= 0将导致线性回归的损失函数，并且随着λ的增加，越来越多的系数被设置为零，因此剩余的系数被拉索回归“选择”为重要的。

Fitting a Lasso regression on the training data and plotting the resulting coefficients looks like this:

在训练数据上拟合Lasso回归并绘制得出的系数如下所示：

The Lasso Regression algorithm has reduced the coefficients of Time in Bed and Minutes Light Sleep to close to zero, deeming them less important than the other four features. This comes in handy as we would face major multicollinearity issues if we included all of the features in our models. Let’s drop these two features from our data sets:

Lasso回归算法已将“床上”和“分钟睡眠时间”的时间系数减小到接近零，认为它们不如其他四个功能重要。这很方便，因为如果我们在模型中包含所有功能，我们将面临主要的多重共线性问题。让我们从数据集中删除这两个功能：

Now that we have selected a set of four features we can move on to building some Machine Learning models that will use those four features to predict Sleep Scores.

现在，我们选择了四个功能集，我们可以继续构建一些机器学习模型，这些模型将使用这四个功能来预测睡眠分数。

多元线性回归 (Multiple Linear Regression)

In summary, Multiple Linear Regression (MLR) is used to estimate the relationship between one dependent variable and two or more independent variables. In our case, it will be used to estimate the relationship between Sleep Score and Minutes Asleep, Minutes Awake, Minutes REM Sleep and Minutes Deep Sleep. Note that MLR assumes that the relationship between these variables is linear.

总之，多元线性回归(MLR)用于估计一个因变量和两个或多个自变量之间的关系。在我们的情况下，它将用于估计睡眠分数与分钟睡眠，分钟清醒，分钟REM睡眠和分钟深度睡眠之间的关系。注意，MLR假定这些变量之间的关系是线性的。

Let’s train a MLR model and evaluate its performance:

让我们训练一个MLR模型并评估其性能：

All performance measures are substantially better than those of the baseline model (thank god). Especially the accuracy seems to be really high but this can be misleading, which is why it is important to consider multiple measures. One of the most important measures for regression performance is the R-squared. Generally speaking, the R-squared measures the proportion of the variance of the dependent variable that is explained by the independent variables. Hence, in our case it is a measure of how much of the variance in Sleep Scores is explained by our features. A value of roughly 0.76 is decent already but let’s see if we can do better by using different models.

所有性能指标均明显优于基准模型(感谢上帝)。尤其是准确性似乎确实很高，但这可能会产生误导，这就是为什么考虑多种测量方法很重要的原因。 R平方是回归性能最重要的指标之一。一般而言，R平方测量因变量的方差比例，由自变量解释。因此，在我们的情况下，这是衡量我们的功能解释了睡眠得分差异的多少的指标。大约0.76的值已经不错了，但让我们看看是否可以通过使用不同的模型来做得更好。

回归统计 (Regression statistics)

Before we move on to other Machine Learning models I would like to take a look at the regression output for the Multiple Linear Regression on our training data:

在继续学习其他机器学习模型之前，我想看一下我们训练数据上多元线性回归的回归输出：

A few things to note regarding the regression output:

有关回归输出的一些注意事项：

All p-values are statistically significant.
所有p值均具有统计意义。
Minutes Asleep, Minutes REM Sleep and Minutes Deep Sleep have positive coefficients, meaning that an increase in these variables increases Sleep Scores.
分钟睡眠，分钟REM睡眠和分钟深度睡眠具有正系数，这意味着这些变量的增加会增加睡眠得分。
Minutes Awake has a negative coefficient, indicating that more time awake decreases the sleep score.
分钟清醒系数为负，表示清醒时间越长，睡眠得分越低。
Based on the magnitude of the coefficients, REM sleep seems to have a bigger positive impact on Sleep Score than Deep sleep.
根据系数的大小，REM睡眠似乎比深度睡眠对睡眠分数的积极影响更大。

The regression output provides a good starting point for understanding how the different sleep statistics may affect Sleep Score. More time asleep increases Sleep Score. This makes sense because more sleep (up until a certain point) will generally be beneficial. Similarly, more time spent in REM and Deep Sleep increase the Sleep Score as well. This also makes sense because both of these sleep stages provide important restorative benefits. For the computation of Sleep Score, Fitbit seems to consider REM sleep to be more important than Deep sleep (higher magnitude of the coefficient), which to me is one of the most interesting outcomes of the regression analysis. Finally, more time awake decreases Sleep Score. Again, that makes perfect sense because spending more time awake during one’s sleep window indicates restlessness and takes away from the restorative powers that time spent asleep provides.

回归输出为理解不同睡眠统计信息如何影响睡眠评分提供了一个很好的起点。更多的睡眠时间会增加睡眠得分。这是有道理的，因为多睡(直到一定时间)通常是有益的。同样，花在REM和深度睡眠上的时间也增加了睡眠得分。这也很有意义，因为这两个睡眠阶段都提供了重要的恢复性益处。对于睡眠得分的计算，Fitbit认为REM睡眠比深度睡眠(深度系数更高)更重要，这对我而言是回归分析中最有趣的结果之一。最后，更多的时间清醒会降低睡眠得分。再说一次，这完全是有道理的，因为在一个人的睡眠窗口中花更多的时间醒着表示不安，并摆脱了睡眠所提供的恢复能力。

For those people that are interested in understanding the importance of different sleep stages and of sleep in general, I highly recommend “Why We Sleep” by Matthew Walker. It is a brilliantly written book with fascinating experiments and insights!

对于那些有兴趣了解不同睡眠阶段和一般睡眠重要性的人，我强烈建议马修·沃克(Matthew Walker)发表“为什么我们要睡觉” 。这是一本精彩的书，具有引人入胜的实验和见解！

All that being said, it is important to note that the interpretability of the above output is somewhat limited because of the correlation that is present between features. In Multiple Linear Regression, the coefficient tells you how much the dependent variable is expected to increase when that independent variable increases by one unit, holding all the other independent variables constant. In our case, because the independent variables are correlated, we could not expect one variable to change without the others changing and therefore cannot reliably interpret the coefficients in this way. Always look out for multicollinearity when interpreting your models!

综上所述，重要的是要注意，由于特征之间存在相关性，因此上述输出的可解释性受到一定程度的限制。在“多元线性回归”中，系数告诉您当该自变量增加一个单位并保持所有其他自变量不变时，该自变量期望增加多少。在我们的案例中，因为自变量是相关的，所以我们不能期望一个变量在不改变其他变量的情况下发生变化，因此无法以这种方式可靠地解释系数。解释模型时，请始终注意多重共线性！

Let’s see if other Machine Learning models perform better than Multiple Linear Regression.

让我们看看其他机器学习模型是否比多重线性回归更好。

随机森林回归 (Random Forest Regressor)

Random Forests are one of the most popular Machine Learning models because of their ability to perform well on both classification and regression problems. In summary, a Random Forest is an ensemble technique that leverages multiple decision trees through Bootstrap Aggregation, also called “bagging”. What exactly does that mean?

随机森林是最流行的机器学习模型之一，因为它们在分类和回归问题上都能表现出色。总之，随机森林是一种集成技术，它通过Bootstrap聚合(也称为“装袋”)利用多个决策树。这到底是什么意思呢？

In order to understand this better we first need to understand how Decision Tree Regression works.

为了更好地理解这一点，我们首先需要了解决策树回归的工作方式。

决策树回归 (Decision Tree Regression)

As the name suggests, decision trees build prediction models in form of a tree structure that may look like this:

顾名思义，决策树以树结构的形式构建预测模型，如下所示：

In the above example the decision tree iteratively splits the data set based on various features in order to come up with a prediction of how many hours will be spent playing. But how does the tree know what features to split on first and which ones to split on further down the tree? After all, the predictions could be different if we change the sequence of the features used to make the split.

在上面的示例中，决策树基于各种功能迭代地拆分数据集，以便预测将花费多少小时进行游戏。但是，树如何知道首先要分割的特征以及要进一步分解的特征？毕竟，如果我们更改用于进行分割的特征的顺序，则预测可能会有所不同。

In a regression problem, the most common way to decide what feature to split the dataset on at a specific node is Mean Squared Error (MSE). The decision tree tries out different features that it can use to split the data set and computes the resulting MSEs. The feature that leads to the lowest MSE is chosen for the split at hand. This process is continued until the tree reaches a leaf (an end point) or a predetermined maximum depth. Maximum depths can be used to reduce overfitting because if a decision tree is allowed to continue until it finds a leaf, it may strongly overfit to the training data. Using maximum depths in this way is referred to as “pruning” of the tree.

在回归问题中，决定在特定节点上分割数据集的特征的最常见方法是均方误差(MSE)。决策树尝试了可用于拆分数据集并计算生成的MSE的不同功能。将导致最低MSE的功能选择为当前拆分。继续该过程，直到树到达叶子(终点)或预定的最大深度为止。最大深度可用于减少过度拟合，因为如果允许决策树继续直到找到叶子，则可能会严重过度拟合训练数据。以这种方式使用最大深度被称为树的“修剪”。

There are two major limitations with decision trees:

决策树有两个主要限制：

Greediness — Decision trees are not always globally optimal because we assume that the best way to create the tree is to find the feature that will result in the largest reduction in MSE now without considering whether a suboptimal split now could lead to an even better split further down the line (“greedy” strategy).
贪婪性—决策树并不总是全局最优的，因为我们认为创建树的最佳方法是找到现在可以最大程度降低MSE的功能，而不考虑现在是否存在次优拆分可能会导致更好的拆分下线(“贪婪”策略)。
Overfitting — The structure of the tree is often too dependent on the training data and pruning the tree often is not enough to overcome this issue.
过拟合-树的结构通常过于依赖训练数据，并且修剪树通常不足以克服此问题。

Random Forests address both of those limitations.

随机森林解决了这两个限制。

随机森林 (Random Forests)

As the “Forest” in Random Forest suggests, they are made up of many decision trees and their predictions are made by averaging the predictions of each decision tree in the forest. Think of this as a Democracy. Having only one person vote on an important issue may not be representative of how the entire community really feels, but collecting votes from many randomly selected members of the community may provide an accurate representation.

正如《随机森林》中的“森林”所暗示的那样，它们由许多决策树组成，并且其预测是通过对森林中每个决策树的预测取平均来做出的。认为这是民主。在重要问题上只有一个人投票可能无法代表整个社区的真实感受，但是从社区中许多随机选择的成员那里收集选票可能会提供准确的表示。

But what exactly does the “Random” in Random Forest represent?

但是“随机森林”中的“随机”到底代表什么？

In a Random Forest, every decision tree is created using a randomly chosen subset of the data points in the training set. This way every tree is different but all trees are still created from a portion of the same training data. Subsets are randomly selected with replacement, meaning that data points are “put back in the bag” and can be picked again for another decision tree.

在随机森林中，每个决策树都是使用训练集中数据点的随机选择子集创建的。这样，每棵树都是不同的，但是所有树仍然是从相同训练数据的一部分中创建的。子集是通过替换随机选择的，这意味着数据点被“放回包中”，并且可以再次用于其他决策树。

In addition to choosing different random subsets for each tree, the decision trees in a Random Forest only consider a subset of randomly selected features at each split. The best feature is chosen for the split at hand and at the next node, a new set of random features is evaluated, etc.

除了为每棵树选择不同的随机子集外，随机森林中的决策树在每个分割处仅考虑随机选择特征的子集。为手头和下一个节点的分割选择最佳特征，评估一组新的随机特征，依此类推。

By constructing decision trees using these “bagging” techniques, Random Forests address the limitations of individual decision trees well and manage to turn what would be a weak predictor in isolation into a strong predictor in a group, similar to the voting example.

通过使用这些“套袋”技术构建决策树，Random Forests很好地解决了各个决策树的局限性，并设法将孤立的弱预测变量变成一组中的强预测变量，类似于投票示例。

Python中的随机森林回归 (Random Forest Regression in Python)

Using the scikit-learn library in Python, most Machine Learning models are built in the same way. First, you initiate the model, then you train it on the training set and then evaluate it on the validation set. Here is the code:

使用Python中的scikit-learn库，大多数机器学习模型都是以相同的方式构建的。首先，启动模型，然后在训练集上对其进行训练，然后在验证集上对其进行评估。这是代码：

Similar to the Multiple Linear Regression, the Random Forest performs vastly better than the baseline model. That being said, its R-squared and accuracy are lower than that of the MLR. So, what is all the hype around Random Forests about?

与多重线性回归相似，随机森林的性能大大优于基线模型。话虽如此，其R平方和准确性低于MLR。那么，关于随机森林的所有炒作是什么？

The answer to that question can be found here (hint: Hyperparameter Optimisation):

可以在以下位置找到该问题的答案(提示：超参数优化)：

极梯度提升回归器 (Extreme Gradient Boosting Regressor)

Similar to Random Forests, Gradient Boosting is an ensemble learner, meaning that it creates a final model based on a collection of individual models, usually decision trees. What is different in the case of Gradient Boosting compared to Random Forests is the type of ensemble method. Random Forests use “Bagging” (described previously) and Gradient Boosting uses “Boosting”.

与“随机森林”相似，“梯度增强”是一个整体学习器，这意味着它基于单个模型(通常是决策树)的集合来创建最终模型。与“随机森林”相比，“梯度增强”的区别在于集成方法的类型。随机森林使用“装袋”(如前所述)，而渐变增强使用“升压”。

梯度提升 (Gradient Boosting)

The general idea behind Gradient Boosting is that the individual models are built sequentially by putting more weight on instances with wrong predictions and high errors. The model therefore “learns from its past mistakes”.

渐变增强的基本思想是，通过对具有错误预测和高错误的实例进行更多权重来依次构建各个模型。因此，该模型“从过去的错误中学习”。

The model minimises a cost function through gradient descent. In each round of training, the weak learner (decision tree) makes a prediction, which is compared to the actual outcome. The distance between prediction and actual outcome represents the error of the model. The errors can then be used to calculate the gradient, i.e. the partial derivative of the loss function, to figure out in which direction to change the model parameters in order to reduce the error. The below graph visualises how this works:

该模型通过梯度下降最小化成本函数。在每一轮培训中，学习能力较弱的人(决策树)都会做出预测，并将其与实际结果进行比较。预测和实际结果之间的距离代表模型的误差。然后可以使用误差来计算梯度，即损失函数的偏导数，以找出在哪个方向上更改模型参数以减小误差。下图显示了它是如何工作的：

The rate with which these adjustments will be made (“Incremental Step” in the above graph) can be set through the hyperparameter “learning rate”.

可以通过超参数“学习率”来设置进行这些调整的速率(上图中的“增量步长”)。

极端梯度提升 (Extreme Gradient Boosting)

Extreme Gradient Boosting improves upon Gradient Boosting by computing the second partial derivative of the cost function, which aids in getting to the minimum of the cost function, as well as using advanced regularisation similar to that described using Lasso Regression, which improves model generalisation.

通过计算成本函数的二阶偏导数，极端梯度增强在梯度增强的基础上进行了改进，这有助于使成本函数达到最小，并使用类似于使用拉索回归描述的高级正则化方法来改善模型泛化。

In Python, training and evaluating Extreme Gradient Boosting Regressor follows the same fitting and scoring process as the Random Forest Regressor:

在Python中，训练和评估Extreme Gradient Boosting Regressor遵循与Random Forest Regressor相同的拟合和评分过程：

The performance metrics are extremely close to that of the Random Forest, i.e. it performs decently but still not as well as our good old Multiple Linear Regression.

性能指标非常接近随机森林的指标，即它的性能不错，但仍然不及我们以前良好的多元线性回归。

在这里去哪里？ (Where to go form here?)

So far, we have not provided any hyperparameters in the Random Forest or Extreme Gradient Boosting Regressor. The respective libraries provide sensible default values for the hyperparameters of each model but there is no one-size-fits-all. By tweaking some of the hyperparameters we could potentially greatly improve the performance of these two models.

到目前为止，我们尚未在“随机森林”或“极端梯度增强回归”中提供任何超参数。各个库为每个模型的超参数提供了合理的默认值，但没有“一刀切”的功能。通过调整一些超参数，我们有可能极大地改善这两个模型的性能。

Furthermore, for our performance evaluation so far we have only relied on the models’ performances on one relatively small validation set. The performance is therefore highly dependent on how representative this validation set is of sleep data as a whole.

此外，到目前为止，对于我们的性能评估，我们仅依靠一个相对较小的验证集上的模型性能。因此，性能高度取决于此验证集作为整体睡眠数据的代表性。

In the third part of this article I address both of these issues and boost the performance of the Random Forest and the Extreme Gradient Boosting Regressor. See here:

在本文的第三部分中，我将同时解决这两个问题，并提高随机森林和极限梯度提升回归器的性能。看这里：