android 揭示动画_遗传编程揭示具有相互作用的多元线性回归

android 揭示动画

We all had some sort of experience with linear regression. It’s one of the most used regression techniques used. Why? Because it is simple to explain and it is easy to implement. But what happens when you have more than one variable? How can you deal with this increased complexity and still use an easy to understand regression like this? And what happen if the system is even more complicated? Let’s imagine when you have an interaction between two variables.

我们都有线性回归的某种经验。 这是最常用的回归技术之一。 为什么? 因为它易于解释且易于实现。 但是,当您拥有多个变量时会发生什么? 您如何处理这种日益增加的复杂性,并且仍然使用这种易于理解的回归? 如果系统更加复杂怎么办? 让我们想象一下,当两个变量之间有相互作用时。

Here is where multiple linear regression kicks in and we will see how to deal with interactions using some handy libraries in python. Finally we will try to deal with the same problem also with symbolic regression and we will enjoy the benefits that come with it!

这是多个线性回归的开始,我们将看到如何使用python中的一些方便的库来处理交互。 最后,我们将尝试通过符号回归来解决相同的问题,我们将享受随之而来的好处!

If you want to have a refresh on linear regression there are plenty of resources available and I also wrote a brief introduction with coding. What about symbolic regression? In this article we will be using gplearn. See its documentation for more informations or, if you like, see my other article about how to use it with complex functions in python here.

如果您想刷新线性回归,可以使用很多资源,我还编写了有关编码简短介绍 。 那么符号回归呢? 在本文中,我们将使用gplearn 。 看到它的文档获取更多的信息,或者,如果你喜欢,看到我如何与复杂的功能在Python中使用它的另一篇文章在这里

资料准备 (Data preparation)

We will explore two use cases of regression. In in the first case we will just have four variables (x1 to x4) which adds up plus some predetermined interactions: x1*x2, x3*x2 and x4*x2.

我们将探讨两个回归用例。 在第一种情况下,我们将只有四个变量(x1至x4),这些变量加起来加上一些预定的相互作用:x1 * x2,x3 * x2和x4 * x2。

Note that in our dataset “out_df” we don’t have the interactions terms. What we will be doing will try to discover those relationships with our tools. This is how the variables look like when we plot them with seaborn, using x4 as hue (figure 1):

请注意,在我们的数据集“ out_df”中,我们没有交互项。 我们将要做的事情将是尝试发现与我们工具之间的关系。 这是当我们使用x4作为色相对seaborn进行绘制时,变量的外观(图1):

Image for post
Figure 1: 1st order interactions: dataframe variables pairplot
图1:一阶交互:数据帧变量pairplot
Image for post
Figure 2: 2nd order interactions: dataframe pairplot;
图2:二阶交互:数据帧对图;

The y of the second case (figure 2) is given by:

第二种情况的y(图2)由下式给出:

y_true = x1+x2+x3+x4+ (x1*x2)*x2 - x3*x2 + x4*x2*x3*x2 + x1**2

Pretty complex sceneario!

相当复杂的场景!

情况1:多元线性回归 (Case 1: Multiple Linear Regression)

The first step is to have a better understanding of the relationships so we will try our standard approach and fit a multiple linear regression to this dataset. We will be using statsmodels for that. In figure 3 we have the OLS regressions results.

第一步是要更好地了解这些关系,因此我们将尝试标准方法,并对该数据集进行多元线性回归。 我们将为此使用statsmodels。 在图3中,我们有OLS回归结果。

import statsmodels.api as sm Xb = sm.add_constant(out_df[['x1','x2','x3','x4']])
mod = sm.OLS(y_true, Xb)
res = mod.fit()
res.summary()
Image for post
Figure 3: Fit Summary for statsmodels.
图3:统计模型的拟合摘要。

Ouch, this is clearly not the result we were hoping for. R² is just 0.567 and moreover I am surprised to see that P value for x1 and x4 is incredibly high. We need some different strategy.

clearly,这显然不是我们期望的结果。 R²仅为0.567,而且令我惊讶的是x1和x4的P值非常高。 我们需要一些不同的策略。

多项式特征 (Polynomial Features)

What we can do is to import a python library called PolynomialFeatures from sklearn which will generate polynomial and interaction features. For example, if an input sample is two dimensional and of the form [a, b], the degree-2 polynomial features are [1, a, b, a², ab, b²].

我们可以做的是从sklearn导入一个名为PolynomialFeatures的python库,该库将生成多项式和交互特征。 例如,如果输入样本是二维且格式为[a,b],则2阶多项式特征为[1,a,b,a²,ab,b²]。

from sklearn.preprocessing import PolynomialFeatures
import scipy.specialpoly = PolynomialFeatures(interaction_only=True)
X_tr = poly.fit_transform(Xb)
Xt = pd.concat([Xb,pd.DataFrame(X_tr,columns=poly.get_feature_names()).drop([‘1’,’x0',’x1',’x2',’x3',’x4'],1)],1)

With “interaction_only=True” only interaction features are produced: features that are products of at most degree distinct input features (so not x[1] ** 2, x[0] * x[2] ** 3, etc.). The default degree parameter is 2.

用“interaction_only =真”仅相互作用特征产生:即所有的产品特征至多degree 不同的输入功能(因此不是x[1] ** 2x[0] * x[2] ** 3等) 。 默认的度数参数为2。

With the same code as before, but using Xt now, yields the results below.

使用与以前相同的代码,但现在使用Xt ,将产生以下结果。

mod = sm.OLS(y_true, Xt)
res = mod.fit()
res.summary()
Image for post
Figure 4: statsmodels regression result with intractions.
图4:带有吸引力的statsmodels回归结果。

Now R² in Figure 4 is 1 which is perfect. Too perfect to be good? In fact there are a lot of interaction terms in the summary statistics. Some that we did not even be aware of. Our equation is of the kind of: y = x₁+05*x₂+2*x₃+x₄+ x₁*x₂ — x₃*x₂ + x₄*x₂ So our fit introduces interactions that we didn’t explicitly use in our function. Even if we remove those with high p-value (x₁ x₄), we are left with a complex scenario. This might be a problem for generalization. We can exploit genetic programming to give us some advice here.

现在图4中的R²是1,这是完美的。 太完美了不能成为好人? 实际上,摘要统计信息中有很多交互项。 我们什至没有意识到的一些。 我们的方程式是:y =x₁+ 05 *x²+ 2 *x₃+x₄+x₁* x_2-x₃* x2 +x₄* x2因此,我们的拟合引入了我们在函数中未明确使用的相互作用。 即使我们删除那些具有较高p值(x₁x₄)的图形,我们仍然面临着复杂的情况。 这可能是一个普遍化的问题。 我们可以利用基因编程在这里给我们一些建议。

基因编程:GPlearn (Genetic Programming: GPlearn)

With genetic programming we are basically telling the system to do its best to find relationships in our data in an analytical form. If you read the other tutorial some functions I will call here will be clearer. However what we basically want to do is to import SymbolicRegressor from gplearn.genetic and we will use sympy to pretty formatting our equations. Since we are at it, we will also import RandomForest and DecisionTree regressors to compare the results between all those tools later on. Below the code to get it working:

通过基因编程,我们基本上是在告诉系统要尽最大努力以分析形式在我们的数据中查找关系。 如果您阅读其他教程,我将在此处调用的某些功能将更加清晰。 但是,我们基本上要做的是从gplearn.genetic导入SymbolicRegressor,我们将使用sympy对公式进行漂亮的格式化。 既然可以了,我们还将导入RandomForest和DecisionTree回归变量,以便稍后比较所有这些工具的结果。 在代码下方,使其正常工作:

The converter dictionary is there to help us map the equation with its corrispondent python function to let simpy do its work. We also do train_test split of our data so that we will compare our predictions on the test data alone. We defined a function set in which we use standard functions from gplearn’s set. At the 40th generation the code stops and we see that R² is almost 1, while the formula generated is now pretty easy to read.

转换器词典在那里可以帮助我们使用其对应的python函数映射方程式,以使simpy能够完成其工作。 我们还对数据进行了train_test拆分,以便我们将仅对测试数据的预测进行比较。 我们定义了一个函数集,在其中使用了gplearn集合中的标准函数。 在第40代,代码停止了,我们看到R²几乎为1,而生成的公式现在很容易阅读。

Image for post
Figure 5: gplearn results
图5:gplearn结果

If you compare it with the formula we actually used you will see that its a close match, refactoring our formula becomes:

如果将其与我们实际使用的公式进行比较,您会发现它非常匹配,重构我们的公式将变成:

y = -x₃ (x₂–2) + x₂ (x₁+x₄+ 0.5)+x₁+x₄

y =-x₃(x 2–2)+ x 2(x₁+x₄+ 0.5)+x₁+x₄

All algorithms performed good on this work: here are the R².

所有算法在这项工作上都表现出色:这是R²。

statsmodels OLS with polynomial features 1.0, 
random forest 0.9964436147653762,
decision tree 0.9939005077996459,
gplearn regression 0.9999946996993035

情况2:二阶互动 (Case 2: 2nd order interactions)

In this case the relationship is more complex as the interaction order is increased:

在这种情况下,关系随着交互顺序的增加而变得更加复杂:

X = np.column_stack((x1, x2, x3, x4))y_true = x1+x2+x3+x4+ (x1*x2)*x2 - x3*x2 + x4*x2*x3*x2 + x1**2out_df['y'] = y_true

We do basically the same steps as in the first case, but here we already start with polynomial features:

我们执行与第一种情况基本相同的步骤,但是这里我们已经从多项式特征开始:

poly = PolynomialFeatures(interaction_only=True)
X_tr = poly.fit_transform(out_df.drop('y',1))
Xt = pd.concat([out_df.drop('y',1),pd.DataFrame(X_tr,columns=poly.get_feature_names()).drop(['1','x0','x1','x2','x3'],1)],1)Xt = sm.add_constant(Xt)
mod = sm.OLS(y_true, Xt)
res = mod.fit()
res.summary()
Image for post
Figure 6: statsmodels summary for case 2
图6:案例2的statsmodels摘要

In this scenario our approach is not rewarding anymore. It is clear that we don’t have the correct predictors in our dataset. We could use polynomialfeatures to investigate higher orders of interactions but the dimensionality will likely increase too much and we will be left with no much more knowledge then before. Besides, if you had a real dataset and you did not know the formula of the target, would you increase the interactions order? I guess not!

在这种情况下,我们的方法不再有用。 显然,我们的数据集中没有正确的预测变量。 我们可以使用多项式特征来研究更高阶的交互,但是维数可能会增加太多,并且我们将比以前拥有更多的知识。 此外,如果您有一个真实的数据集,但您不知道目标的公式,您会增加交互顺序吗? 我猜不会!

In the code below we again fit and predict our dataset with decision tree and random forest algorithms but also employ gplearn.

在下面的代码中,我们再次使用决策树和随机森林算法拟合并预测我们的数据集,但也使用gplearn。

X_train, X_test, y_train, y_test = train_test_split(out_df.drop('y',1), y, test_size=0.30, random_state=42)est_tree = DecisionTreeRegressor(max_depth=5)
est_tree.fit(X_train, y_train)
est_rf = RandomForestRegressor(n_estimators=100,max_depth=5)
est_rf.fit(X_train, y_train)y_gp = est_gp.predict(X_test)
score_gp = est_gp.score(X_test, y_test)
y_tree = est_tree.predict(X_test)
score_tree = est_tree.score(X_test, y_test)
y_rf = est_rf.predict(X_test)
score_rf = est_rf.score(X_test, y_test)
y_sm = res.predict(Xt)
est_gp.fit(X_train, y_train)
print('R2:',est_gp.score(X_test,y_test))
next_e = sympify((est_gp._program), locals=converter)
next_e

The result is incredible: again after 40 generations we are left with an incredibly high R² and even better a simple analytical equation.

结果令人难以置信:40代之后,我们仍然拥有令人难以置信的高R²和更好的简单解析方程式。

Image for post
Figure 7: last generation, R² and analytical formula.
图7:上一代,R 2和分析公式。

The original formula is like this:

原始公式如下:

Image for post

So we see that there are indeed differences on the terms which involves x1 and its interactions. While the terms which don’t depend on it are perfectly there. Neverthless, if compared with the polynomialfeatures approach, we’re dealing with a much less complicated formula here.

因此,我们看到涉及x1及其交互的术语确实存在差异。 不依赖于此的术语完全存在。 不过,如果与多项式特征方法相比,我们在这里处理的是一个简单得多的公式。

What is the error of the different systems? Well for gplearn it is incredibly low if compared with other. In figure 8 the error in the y-coordinate versus the actual y is reported. While the x axis is shared, you can notice how different the y axis become. The maximum error with GPlearn is around 4 while other methods can show spikes up to 1000.

不同系统的错误是什么? 对于gplearn来说,与其他同类产品相比,其低得令人难以置信。 在图8中,报告了y坐标相对于实际y的误差。 共享x轴时,您会注意到y轴的差异。 GPlearn的最大误差约为4,而其他方法可能会显示高达1000的峰值。

Image for post
Figure 8: Error plots for different methods used as function of y. The legend shows the method.
图8:用作y函数的不同方法的误差图。 图例显示了该方法。

结论 (Conclusion)

In the first part of this article we saw how to deal with multiple linear regression in the presence of interactions. We used statsmodels OLS for multiple linear regression and sklearn polynomialfeatures to generate interactions. We then approached the same problem with a different class of algorithm, namely genetic programming, which is easy to import and implement and gives an analytical expression.

在本文的第一部分,我们了解了如何在存在交互的情况下处理多元线性回归。 我们使用statsmodels OLS进行多元线性回归,并使用sklearn多项式特征生成相互作用。 然后,我们使用另一类算法来解决同一问题,即遗传编程,该算法易于导入和实现,并给出了解析表达式。

In the second part we saw that when things get messy, we are left with some uncertainty using standard tools, even those from traditional machine learning. However, this class of problems is easier to face with the use of gplearn. With this library we were given an analytical formula for our problem directly.

在第二部分中,我们看到了当事情变得混乱时,使用标准工具,甚至是传统机器学习的工具,我们都将面临不确定性。 但是,使用gplearn更容易面对此类问题。 有了这个库,我们直接得到了我们问题的解析公式。

翻译自: https://towardsdatascience.com/multiple-linear-regression-with-interactions-unveiled-by-genetic-programming-4cc325ac1b65

android 揭示动画

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值