基于树的模型的更好功能

最新推荐文章于 2023-07-04 02:55:08 发布

weixin_26713457

最新推荐文章于 2023-07-04 02:55:08 发布

阅读量469

点赞数

文章标签： python java 人工智能机器学习深度学习

原文链接：https://towardsdatascience.com/better-features-for-a-tree-based-model-d3b21247cdf2

版权

When you understand how a model works, it becomes much easier to create successful features. It is because you can reason about the model’s strong and weak sides and prepare features accordingly. Let’s take a look together at what features can be understood by a tree-based model and what features are harder for it to use (and how can we help the model in such cases).

了解模型的工作原理后，创建成功的特征将变得更加容易。这是因为您可以推断模型的优缺点，并相应地准备特征。让我们一起看看基于树的模型可以理解哪些功能，以及哪些功能更难使用(在这种情况下我们如何为模型提供帮助)。

基于树的模型如何使用功能 (How a tree-based model uses features)

We’ll start by taking a closer look at what’s inside the tree-based model.

我们将从仔细研究基于树的模型中的内容开始。

The main building block of tree-based models is a binary decision. It takes the value of a specific feature and make a split depending on it. Many such decisions form a decision tree. And many trees predictions averaged give model predictions. Of course, this is a quite simplified explanation of tree-based models, but it fits very well for understanding what properties are required from successful features.

基于树的模型的主要构建块是二进制决策。它采用特定功能的值并根据其进行拆分。许多这样的决策形成决策树。平均的许多树木预测给出了模型预测。当然，这是对基于树的模型的相当简单的解释，但是它非常适合理解成功功能需要哪些属性。

The number of required splits (or required binary decisions) is an important property of the feature

所需分割数(或所需二进制决策)是该功能的重要属性

Given the binary nature, tree-based models are good at handling features where the main signal can be extracted by a few splits. Let’s take a look at some features with such property.

鉴于二进制性质，基于树的模型擅长处理一些特征，在这些特征中，可以通过几次拆分来提取主信号。让我们看一下具有这种属性的一些功能。

In order to better reason about required splits, we’ll use “feature vs target” graphs. On the X axis, we have feature values sorted from smallest to largest. And on the Y axis, we have corresponding target values for each feature value. Additionally, we’ll use a split line to separate between which target values we would like to consider low and high.

为了更好地说明所需的分割，我们将使用“功能与目标”图。在X轴上，我们具有从最小到最大的特征值。在Y轴上，每个特征值都有对应的目标值。此外，我们将使用分界线来分隔我们要考虑的目标值和较低的目标值。

Here are some examples of “feature vs target” graphs for features requiring one or a few splits, which are handled well by tree-based models.

以下是一些“特征与目标”图的示例，这些图需要进行一次或几次分割，这些特征可以通过基于树的模型很好地处理。

Image for post — Easy features requiring one/few splits (visualization by the author)

As you can see, tree-based models are good at handling features, where the graph doesn’t change the direction often from upward to downward and vice-versa. Features don’t need to be linear, the main requirement is that they don’t require many split points to separate low and high target values.

如您所见，基于树的模型擅长处理功能，其中图形通常不会从上到下改变方向，反之亦然。特征不需要是线性的，主要要求是它们不需要很多分割点即可分离出低和高目标值。

Additionally, what is the strong side of tree-based models, is the ability to handle feature interactions. This is when feature A behaved in one way for small feature B values but changes its behavior for big feature B values. Not all models can capture such interaction, but tree-based models handle it very naturally.

此外，基于树的模型的强项是能够处理要素交互。这是特征A对于较小的特征B值以一种方式表现出来，但对于较大的特征B值改变了其表现时。并非所有的模型都可以捕获这种交互，但是基于树的模型可以非常自然地处理它。

Now let’s take a look at some features, which are harder for tree-based models to handle.

现在让我们看一些功能，基于树的模型更难处理。

The more splits are needed for a feature to capture the full signal, the deeper trees are required. But deeper trees with many leaves means a higher risk of over-fitting. Allowing the model to grow deeper trees, doesn’t mean it will make splits only on the features we want. It can use splits to unnecessarily divide some other features resulting in capturing noise.

要捕获一个完整的信号，需要更多的分割，就需要更深的树。但是，树木越深，叶子越多，意味着过度拟合的风险就越高。允许模型生长更深的树木，并不意味着它将仅在我们想要的特征上进行分割。它可以使用拆分来不必要地划分一些其他功能，从而导致捕获噪声。

功能工程实例 (Feature engineering examples)

Now, having some intuition on what type of features we want for our tree-based model, let’s look at some examples of how to actually transform features to make them better.

现在，对基于树的模型需要什么样的特征有一些直觉，让我们看一些如何实际转换特征以使其更好的示例。

捕捉重复图案 (Capture repeating pattern)

Here we can see a feature with some repeating pattern. One real example of a feature with a similar pattern is a date. In many datasets, there are one or several date features. They can specify the date of registration, date of birth, date of measurement, and so on. One typical property of date features is that sometimes they have a seasonality component. E.g. the target variable can depend on weekday — in weekends target variable behaves differently than in working days.

在这里，我们可以看到具有某些重复模式的功能。日期具有一个具有相似模式的功能的真实示例。在许多数据集中，有一个或几个日期特征。他们可以指定注册日期，出生日期，测量日期等。日期特征的一个典型属性是有时它们具有季节性成分。例如，目标变量可能取决于工作日-在周末目标变量的行为与工作日不同。

It will be hard for our model to extract such information. As it will need two splits for each weekend — to separate it from both sides. An additional pitfall is that making splits on exact date values won’t help for predicting on unseen data in the future as dates are not repeating.

对于我们的模型而言，很难提取此类信息。由于每个周末都需要进行两次拆分-将其从双方分开。另一个陷阱是，由于日期不重复，因此对确切的日期值进行拆分将无助于将来对看不见的数据进行预测。

The best way how to help our model is to extract the pattern to much simpler binary feature “is_weekend”, which will have value “1” for weekends and “0” for working days. Or maybe it will be more effective to use a “weekday” feature, having values from 1 to 7 for each weekday. When we see what this means visually for our previous example, we are reducing necessary splits to fully capture the signal in data, thus making the feature more friendly for tree-based models.

帮助模型的最好方法是将模式提取到更简单的二进制功能“ is_weekend”，该功能在周末的值为“ 1”，在工作日的值为“ 0”。也许使用“工作日”功能会更有效，每个工作日的值从1到7。当我们看到上一个示例在视觉上意味着什么时，我们正在减少必要的分割以完全捕获数据中的信号，从而使该功能对于基于树的模型更加友好。

Basically, we push all repetitive intervals (e.g. weeks) together and let the model use an average value for making decisions on splits. In the example above this allows us to reduce the number of required splits from 8 to 2. If you wonder, how this is done in code — we simply took x mod 25 here as the new feature value, because the original feature had repeating intervals of length 25.

基本上，我们将所有重复间隔(例如几周)推到一起，并让模型使用平均值来做出拆分决定。在以上示例允许我们从8减少所需的分割数为2。如果你想知道，这是如何完成的代码-我们只是把x mod 25这里作为新的特征值，因为原来的功能已经重复间隔长度为25。

We can improve features by reducing the number of required splits for the model to capture the signal

我们可以通过减少模型捕获信号所需的分割次数来改善功能

消除噪音(Remove a noise)

Another situation where a tree-based model might need some help is very noisy data. Consider some feature where the values are very noisy, but in reality, only one point is meaningful. If this point is known to us, we can help the model by making the necessary split ourselves. This can be achieved by transforming raw feature to the binary one.

基于树的模型可能需要帮助的另一种情况是非常嘈杂的数据。考虑一些值非常嘈杂的功能，但实际上只有一点是有意义的。如果我们知道这一点，我们可以通过进行必要的拆分来帮助模型。这可以通过将原始特征转换为二进制特征来实现。

For example, we are given a number of purchases at a store in different dates and there is a known date when this store was moved to a different location. If we would check the average value before and after the move, the corresponding means would be 25 and 30. But given a high variance with values changing from 0 to 60, it is hard for the model to determine the correct split point.

例如，在不同日期，我们给一家商店提供了许多购买信息，并且有一个已知的日期将该商店转移到其他位置。如果我们检查移动前后的平均值，则对应的均值将为25和30。但是，鉴于方差较大，值从0更改为60，模型很难确定正确的分割点。

In this particular example, given that the mean value changed, most likely the model will manage to make a split somewhere near the correct point. However, it may choose some point to one or other side, resulting in worse predictions around this moving point. We can help the model by introducing a new feature specifying this point explicitly — for example, this could be a “new_location” feature, having value 1 for all rows after the move and value 0 before it.

在此特定示例中，假设平均值已更改，则该模型很可能会设法在正确点附近进行拆分。但是，它可能会选择一侧或另一侧的某个点，从而导致对该移动点周围的预测更糟。我们可以通过引入明确指定此点的新功能来帮助模型-例如，这可以是“ new_location”功能，移动后所有行的值均为1，其前值为0。

Of course, such transformation requires some domain knowledge and “expert decision” to choose the correct split point. But when this is possible, such a feature can greatly help the model to avoid making unnecessary wrong splits, which could cause over-fitting to noise.

当然，这种转换需要一些领域知识和“专家决策”才能选择正确的分割点。但是，在可能的情况下，这种功能可以极大地帮助模型避免不必要的错误分割，从而避免噪声的过度拟合。

有关基于树的模型可以做什么/不能做什么的更多信息 (More on what a tree-based model can/can’t do)

So far we explored how tree-based models are using features and looked at some examples of how to use that knowledge to fix or engineer better features.

到目前为止，我们探讨了基于树的模型如何使用要素，并查看了一些示例，说明了如何使用该知识来修复或设计更好的要素。

Now let’s talk about other considerations that can help us when working with tree-based models.

现在让我们讨论在使用基于树的模型时可以帮助我们的其他注意事项。

当心在已知间隔之外的特征值 (Beware of feature values outside the known interval)

One weak spot of tree-based models is an inability to extrapolate. Simply speaking, a tree-based model will typically fail to predict a value that is smaller than the smallest value in training data, or a value that is bigger than the biggest value in the training set.

基于树的模型的一个弱点是无法推断。简而言之，基于树的模型通常将无法预测小于训练数据中最小值的值，或者大于训练集中最大值的值。

Consider the feature X values and corresponding target Y values as follows:

如下考虑特征X值和相应的目标Y值：

X=1, Y=2
X = 1，Y = 2
X=2, Y=3
X = 2，Y = 3
X=3, Y=4
X = 3，Y = 4
X=4, Y=?
X = 4，Y = ？

Any simple linear model will capture the linear relationship between X and Y and will guess the last Y value correctly. But tree-based models most likely will predict a value around 4. And not 5, which is obviously the correct value.

任何简单的线性模型都将捕获X和Y之间的线性关系，并正确猜测出最后一个Y值。但是基于树的模型最有可能将预测值约为4，而不是5，这显然是正确的值。

To understand the reason, we should recall the way how tree-based models are using features — by making splits on them. And if the model has never seen feature X value bigger than 3, there is no way for it to make splits for bigger values. As a result, all X values bigger than 3 will be treated exactly the same by the tree-based model.

为了理解原因，我们应该回顾一下基于树的模型如何使用要素的方式-通过对要素进行拆分。而且，如果模型从未见过特征X值大于3，则无法对较大的值进行拆分。结果，基于树的模型将完全相同地对待所有大于3的X值。

Now, can we help a model to overcome this? In general — no, we can’t in any way force a model to make splits outside the known feature interval.

现在，我们可以帮助一个模型来克服这个问题吗？通常，不，我们不能以任何方式强迫模型在已知特征区间之外进行拆分。

A tree-based model can’t make splits for feature values outside the interval seen in training data, therefore it can’t distinguish between them.

基于树的模型无法对训练数据中所见间隔之外的特征值进行分割，因此无法区分它们。

There actually exist some approaches around this limitation, but those typically involve a transformation of target value — therefore I won’t consider them here as in this article I want to focus only on transforming/creating features. I can just give you a clue regarding the direction of those approaches. For example, sometimes it is possible to predict not the actual target value, but the difference from some previous value and at the end reconstruct the real target values by incrementally adding predicted difference from the previous value one by one.

实际上存在一些解决此限制的方法，但是这些方法通常涉及目标值的转换-因此，在这里我将不考虑它们，因为在本文中，我仅关注转换/创建特征。我可以为您提供有关这些方法方向的线索。例如，有时可能不是预测实际目标值，而是预测与某个先前值的差，最后通过逐个增量地增加与先前值的预测差来重建实际目标值。

无需缩放或标准化 (Don’t need to scale or normalize)

When working with tree-based models, it is not needed to scale or normalize data. Why? Because making split at 0.8 in the interval from 0 to 1 is equally hard (or easy) as making split at e.g. 700 in an interval from -100 to 900.

使用基于树的模型时，不需要缩放或规范化数据。为什么？因为在0到1的间隔中以0.8进行拆分与在从100到900的间隔中以700进行拆分一样困难(或容易)。

To see it visually, let’s plot the “feature vs target” graph for some random feature with values ranging from 30 to 130. And let’s plot the same graph for the same feature, but after normalization to the interval (0,1).

为了直观地看到它，让我们为一些随机特征绘制“特征与目标”图，其值的范围为30到130。让我们为相同特征绘制相同的图，但归一化为间隔(0,1)。

Notice, the only thing changed is the scale of the X-axis. The model will need exactly the same splits as before normalization to separate low and high target values.

注意，唯一改变的是X轴的比例。该模型将需要与归一化之前完全相同的分割，以分离出较低的目标值和较高的目标值。

Similarly, tree-based models are not sensitive to outliers in features (on the contrary to e.g. linear models).

类似地，基于树的模型对特征的离群值不敏感(与例如线性模型相反)。

结论 (Conclusions)

Despite undeniable advancements of deep learning lately, tree-based models are still very competitive. And if we talk about tabular data, in many (if not most) of the cases, tree-based models with accurate feature engineering can still outperform the deep learning approach, which is proven by many Kaggle competition results. In case of tree-based models, feature engineering is the key to success. But the understanding of models underlying structure and operation is the key to successful feature engineering.

尽管最近在深度学习方面取得了不可否认的进步，但是基于树的模型仍然非常有竞争力。而且，如果我们谈论表格数据，那么在许多情况下(即使不是大多数情况下)，具有精确特征工程的基于树的模型仍然可以胜过深度学习方法，这在许多Kaggle竞争结果中都得到了证明。对于基于树的模型，要素工程是成功的关键。但是，了解基础结构和操作的模型是成功进行要素工程的关键。

Hope you found something interesting, and useful in this article, and thanks for reading! Follow me not to miss further articles about machine learning.

希望您在本文中发现了一些有趣且有用的东西，并感谢您的阅读！跟着我，不要错过有关机器学习的更多文章。