单变量线性回归模型_了解如何为单变量模型选择效果最好的线性回归

单变量线性回归模型

by Björn Hartmann

比约恩·哈特曼(BjörnHartmann)

找出哪种线性回归模型最适合您的数据 (Find out which linear regression model is the best fit for your data)

Inspired by a question after my previous article, I want to tackle an issue that often comes up after trying different linear models: You need to make a choice which model you want to use. More specifically, Khalifa Ardi Sidqi asked:

上一篇文章之后受到一个问题的启发,我想解决在尝试不同的线性模型后经常出现的一个问题:您需要选择要使用的模型。 更具体地说, Khalifa Ardi Sidqi问:

“How to determine which model suits best to my data? Do I just look at the R square, SSE, etc.?
“如何确定哪种模型最适合我的数据? 我是否只看R平方,SSE等?
As the interpretation of that model (quadratic, root, etc.) will be very different, won’t it be an issue?”
由于该模型(二次方,根等)的解释将非常不同,这不是问题吗?”

The second part of the question can be answered easily. First, find a model that best suits to your data and then interpret its results. It is good if you have ideas how your data might be explained. However, interpret the best model, only.

问题的第二部分很容易回答。 首先,找到最适合您的数据的模型,然后解释其结果。 如果您有想法可以解释您的数据,这是很好的。 但是,仅解释最佳模型。

The rest of this article will address the first part of his question. Please note that I will share my approach on how to select a model. There are multiple ways, and others might do it differently. But I will describe the way that works best for me.

本文的其余部分将解决他的问题的第一部分。 请注意,我将分享 我的方法 如何 选择一个模型。 有多种方法,其他方法可能会有所不同。 但是我将描述最适合我的方式。

In addition, this approach only applies to univariate models. Univariate models have just one input variable. I am planning a further article, where I will show you how to assess multivariate models with more input variables. For today, however, let us focus on the basics and univariate models.

另外, 这种方法仅适用于单变量模型 。 单变量模型只有一个输入变量。 我正在计划另一篇文章,我将向您展示如何评估具有更多输入变量的多元模型。 但是,今天,让我们关注基础知识和单变量模型。

To practice and get a feeling for this, I wrote a small ShinyApp. Use it and play around with different datasets and models. Notice how parameters change and become more confident with assessing simple linear models. Finally, you can also use the app as a framework for your data. Just copy it from Github.

为了练习并对此有所了解,我编写了一个小的ShinyApp。 使用它并使用不同的数据集和模型。 注意参数如何变化,并通过评估简单的线性模型变得更加自信。 最后,您还可以将应用程序用作数据框架。 只需从Github复制它即可

将调整后的R2用于单变量模型 (Use the Adjusted R2 for univariate models)

If you only use one input variable, the adjusted R2 value gives you a good indication of how well your model performs. It illustrates how much variation is explained by your model.

如果仅使用一个输入变量,则adjusted R2值可以很好地指示模型的性能。 它说明了您的模型解释了多少变化。

In contrast to the simple R2, the adjusted R2 takes the number of input factors into account. It penalizes too many input factors and favors parsimonious models.

与简单的R2adjusted R2考虑了输入因子的数量。 它惩罚了太多的输入因素,并偏爱简约模型。

In the screenshot above, you can see two models with a value of 71.3 % and 84.32%. Apparently, the second model is better than the first one. Models with low values, however, can still be useful because the adjusted R2 is sensitive to the amount of noise in your data. As such, only compare this indicator of models for the same dataset than comparing it across different datasets.

在上面的屏幕截图中,您可以看到两个模型,其值分别为71.3%和84.32%。 显然,第二种模式比第一种更好。 但是,低值的模型仍然有用,因为adjusted R2对数据中的噪声量很敏感。 因此,仅比较同一数据集的模型指标而不是比较不同数据集的模型指标。

通常,对SSE的需求很少 (Usually, there is little need for the SSE)

Before you read on, let’s make sure we are talking about the same SSE. On Wikipedia, SSE refers to the sum of squared errors. In some statistic textbooks, however, SSE can refer to the explained sum of squares (the exact opposite). So for now, suppose SSE refers to the sum of squared errors.

在继续阅读之前,请确保我们正在谈论相同的SSE。 在Wikipedia上 ,SSE是指平方误差的总和。 但是,在一些统计教科书中,SSE可以参考所解释的平方和(正好相反)。 因此,现在,假设SSE是指平方误差的总和。

Hence, the adjusted R2 is approximately 1 — SSE /SST. With SST referring to the total sum of squares.

因此, adjusted R2约为1 -SSE / SST。 SST是指平方和的总和。

I do not want to dive deeper into the math behind this. What I want to show you is that the adjusted R2 is computed with the SSE. So the SSE usually does not give you any additional information.

我不想深入探讨其背后的数学原理。 我想向您展示的是, adjusted R2是使用SSE计算的 。 因此,SSE通常不会为您提供任何其他信息

Furthermore, the adjusted R2 is normalized such that it is always between zero and one. So it is easier for you and others to interpret an unfamiliar model with an adjusted R2 of 75% rather than an SSE of 394 — even though both figures might explain the same model.

此外,将adjusted R2归一化,使其始终在零和一之间。 因此,您和其他人更容易解释adjusted R2为75%而不是394的SSE的陌生模型,即使两个数字都可能解释了相同的模型。

看一下残差或误差项! (Have a look at the residuals or error terms!)

What is often ignored are error terms or so-called residuals. They often tell you more than what you might think.

通常忽略的是误差项或所谓的残差。 他们经常告诉您比您想的更多的信息。

残差是您的预测值和实际值之间的差。 (The residuals are the difference between your predicted values and the actual values.)

Their benefit is that they can show you both the magnitude as well as the direction of your errors. Let’s have a look at an example:

它们的好处是,它们可以向您显示错误的幅度和方向。 让我们看一个例子

Here, I tried to predict a polynomial dataset with a linear function. Analyzing the residuals shows that there are areas where the model has an upward or downward bias.

在这里,我试图用线性函数预测多项式数据集。 分析残差表明,在某些区域中模型具有向上或向下的偏差。

For 50 < x < 100, the residuals are above zero. So in this area, the actual values have been higher than the predicted values — our model has a downward bias.

50 &l t ; x &l 50 &l t ; x &l t; 100,残差大于零。 因此,在该区域中,实际值高于预测值-我们的模型存在向下偏差。

For100 < x &lt; 150, however, the residuals are below zero. Thus, the actual values have been lower than the predicted values — the model has an upward bias.

对于100 < x &l t; 150,但是,残差低于零。 因此,实际值已低于预测值-模型具有向上偏差。

It is always good to know, whether your model suggests too high or too low values. But you usually do not want to have patterns like this.

总是很高兴知道您的模型建议的值是太高还是太低。 但是您通常不希望有这样的模式。

The residuals should be zero on average (as indicated by the mean) and they should be equally distributed. Predicting the same dataset with a polynomial function of 3 degrees suggests a much better fit:

残差平均应为零(如平均值所示),并且它们应平均分布。 用3 degrees的多项式函数预测相同的数据集将显示出更好的拟合度:

In addition, you can observe whether the variance of your errors increases. In statistics, this is called Heteroscedasticity. You can fix this easily with robust standard errors. Otherwise, your hypothesis tests are likely to be wrong.

此外,您可以观察误差的方差是否增加。 在统计上,这称为异方差性 。 您可以通过强大的标准错误轻松解决此问题。 否则,您的假设检验可能是错误的。

残差直方图 (Histogram of residuals)

Finally, the histogram summarizes the magnitude of your error terms. It provides information about the bandwidth of errors and indicates how often which errors occurred.

最后,直方图总结了误差项的大小。 它提供有关错误带宽的信息,并指示发生错误的频率。

The above screenshots show two models for the same dataset. In the left histogram, errors occur within a range of -338 and 520.

上面的屏幕截图显示了同一数据集的两个模型。 在左侧的直方图中,误差发生在-338520的范围内。

In the right histogram, errors occur within -293 and 401. So the outliers are much lower. Furthermore, most errors in the model of the right histogram are closer to zero. So I would favor the right model.

右边的直方图中,错误发生在-293401 。 因此,异常值要低得多。 此外,右直方图模型中的大多数误差都接近于零。 因此,我倾向于正确的模型。

摘要 (Summary)

When choosing a linear model, these are factors to keep in mind:

选择线性模型时,请牢记以下因素:

  • Only compare linear models for the same dataset.

    仅比较同一数据集的线性模型。
  • Find a model with a high adjusted R2

    查找调整后的R2高的模型
  • Make sure this model has equally distributed residuals around zero

    确保该模型的残差均匀分布在零附近
  • Make sure the errors of this model are within a small bandwidth

    确保此模型的误差在较小的带宽内

If you have any questions, write a comment below or contact me. I appreciate your feedback.

如有任何疑问,请在下面写评论或与我联系 。 感谢您的反馈。

翻译自: https://www.freecodecamp.org/news/learn-how-to-select-the-best-performing-linear-regression-for-univariate-models-e9d429c40581/

单变量线性回归模型

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值