线性回归 假设_违反线性回归假设的后果

线性回归 假设

动机 (Motivation)

Recently, a friend learning linear regression asked me what happens when assumptions like multicollinearity are violated. Despite being a former statistics student, I could only give him general answers like “you won’t be able to trust the estimates of your model.” Unsatisfied with my response, I decided to create a real-world example, via simulation, to show what can happen to prediction and inference when certain assumptions are violated.

最近,一个学习线性回归的朋友问我,当违反多重共线性之类的假设时会发生什么。 尽管曾经是统计学专业的学生,​​但我只能给他一些笼统的答案,例如“您将无法信任模型的估计。” 对我的响应不满意,我决定通过仿真创建一个真实的示例,以显示违反某些假设时预测和推理可能发生的情况。

模拟 (Simulation)

Suppose researchers are interested in understanding what drives the price of a house. Let’s pretend that housing prices are determined by just two variables: the size and age of the house. While age holds a negative, linear relationship with price, the size of the house has a positive, quadratic (non-linear) relationship with price. Mathematically, we can model this relationship like so:

假设研究人员有兴趣了解驱动房价的因素。 让我们假设房价仅由两个变量决定:房屋的大小和使用年限。 虽然年龄与价格呈负线性关系,但房屋的大小与价格呈正,二次(非线性)关系。 在数学上,我们可以像下面这样建模这种关系:

Priceᵢ = β₀ + β₁*sqftᵢ + β₂*sqftᵢ² − β₃*age_yearsᵢ + eᵢ

价格ᵢ=β₀+β₁*平方尺ᵢ+β2*平方尺ᵢ-β₃*年龄年ᵢ+eᵢ

where Price is the price of a house in thousands of dollars, sqft is the square footage of a house in thousands, and age_years the age of the house in years. The residuals e are normally distributed with mean 0 and variance σ². Let’s call this the true model since it accounts for everything that drives housing prices (excluding residuals). Since researchers don’t have a crystal ball telling them what the true model is, they test out a few linear regression models. Here’s what they came up with, in no particular order:

其中Price是房屋价格(千美元), sqft是房屋的平方英尺 (千美元), age_years以年为单位的房屋年龄。 的残差e通常与均值为0,方差为σₑ²分布 我们称其为真实模型,因为它考虑了驱动房价的所有因素(不包括残差)。 由于研究人员没有一个水晶球可以告诉他们真正的模型是什么,所以他们测试了一些线性回归模型。 这是他们想出的顺序,无特定顺序:

(1) Priceᵢ = β₀ + β₁*sqftᵢ + β₂*sqftᵢ² − β₃*age_yearsᵢ + eᵢ

(1) 价格ᵢ=β₀+β₁*sqftᵢ+β2*sqftᵢ²-β₃*年龄_年ᵢ+eᵢ

(2) Priceᵢ = β₀ + β₁*sqftᵢ + β₂*sqftᵢ² − β₃*age_yearsᵢ − β₄*age_monthsᵢ + eᵢ

(2) 价格ᵢ=β₀+β₁*平方尺+β2*平方尺-β₃*年龄年ᵢ-β₄*月月+eᵢ

(3) Priceᵢ = β₀ + β₁*sqftᵢ − β₂*age_yearsᵢ + eᵢ

(3) 价格ᵢ=β₀+β₁*平方尺-β2*年龄年_ +eᵢ

(4) Priceᵢ = β₀ − β₁*age_yearsᵢ + eᵢ

(4) 价格ᵢ=β₀-β₁*年龄年ᵢ+eᵢ

The researchers were smart and nailed the true model (Model 1), but the other models (Models 2, 3, and 4) violate certain OLS assumptions. Lastly, let’s say that there were 10K researchers who conducted the same study. Each took 50 independent observations from the population of houses and fit the above models to the data. By examining the results of these 10K models, we can see how these different models behave. The table below shows key parameters used to simulate the data (the full code can be found here):

研究人员很聪明,并确定了真实模型(模型1),但其他模型(模型2、3和4)违反了某些OLS假设。 最后,假设有1万名研究人员进行了同一项研究。 每个人都从房屋总数中获得了50个独立的观察结果,并将上述模型与数据拟合。 通过检查这些10K模型的结果,我们可以看到这些不同模型的行为。 下表显示了用于模拟数据的关键参数(完整代码可在此处找到):

结果 (Results)

无多重共线性违规 (No Multicollinearity Violation)

(2) Priceᵢ = β₀ + β₁*sqftᵢ + β₂*sqftᵢ² − β₃*age_yearsᵢ − β₄*age_monthsᵢ + eᵢ

(2) 价格ᵢ=β₀+β₁*平方尺+β2*平方尺-β₃*年龄年ᵢ-β₄*月月+eᵢ

The researchers were very tired when putting together Model 2 and didn’t realize that they included two measures for the age of the house: age_years and age_months. A good way to check for multicollinearity is by looking at the variance inflation factor (VIF). As a rule of thumb, a VIF above 5 indicates multicollinearity, which is the case for both age_years and age_months. Let’s start off by comparing the predictive ability of Model 2 to Model 1 (true model). Mean squared error (MSE) is a good metric for prediction and tells you how close a model’s predictions are to the actual values. The plot below shows the distribution of MSE collected from all 10K researchers.

研究人员在组装模型2时非常疲倦,并且没有意识到他们包括了两个关于房屋年龄的度量: age_yearsage_months 。 检查多重共线性的一种好方法是查看方差膨胀因子 (VIF)。 根据经验,VIF大于5表示多重共线性, age_yearsage_months都是这种情况 让我们从比较模型2和模型1(真实模型)的预测能力开始。 均方误差 (MSE)是进行预测的良好指标,它告诉您模型的预测与实际值的接近程度。 下图显示了从所有10K研究人员那里收集的MSE分布。

Image for post
Image by Author
图片作者

MSE between both models are very similar meaning that multicollinearity violations do not really impact prediction. How about inference or, in other words, the model’s ability to explain? It turns out that the coefficient estimates for age_years, β₃, are quite different between Model 2 and Model 1:

两个模型之间的MSE非常相似,这意味着违反多重共线性不会真正影响预测。 推理或换句话说,模型的解释能力如何? 事实证明,模型2和模型1之间的age_years的系数估计值β₃非常不同。

Image for post

On average, the coefficient estimates are unbiased at -7 for both models. However, it’s clear that there’s much more variation from sample to sample for Model 2. What does this mean? It means that multicollinearity weakens the statistical power of Model 2. For example, in Model 2, age_years is found to be statistically significant in only 54% of the 10K models. This is problematic because almost half of researchers would believe age_years to not be statistically significant. On the other hand, in Model 1, age_years is statistically significant in all 10K models.

平均而言,两种模型的系数估计值在-7时均无偏。 但是,很明显,模型2的样本之间存在更多差异。这意味着什么? 这意味着多重共线性会削弱模型2的统计能力。例如,在模型2中,仅10万个模型中有54%的age_years具有统计意义。 这是有问题的,因为几乎一半的研究人员会认为年龄在统计学上不重要。 另一方面,在模型1中, age_years在所有10K模型中都具有统计意义。

线性违规 (Linearity Violation)

(3) Priceᵢ = β₀ + β₁*sqftᵢ − β₂*age_yearsᵢ + eᵢ

(3) 价格ᵢ=β₀+β₁*平方尺-β2*年龄年_ +eᵢ

Recall that the true relationship between Price and sqft is non-linear. Model 1 addresses this violation, but Model 3 does not since researchers excluded the second-order term for sqft. One tell tale sign of this violation is if plotting fitted values against residuals produces a distinctive pattern. As can be seen below, Model 3 produces a parabolic shape since the linear function does not adequately capture the relationship between Price and age_years:

回想一下, 价格平方英尺之间的真实关系是非线性的。 模型1解决了这种违规问题,但模型3并未解决,因为研究人员排除了平方英尺这一二阶术语。 发生这种违规的一个明显迹象是,将拟合值与残差作图是否会产生一种独特的模式。 如下所示,模型3产生抛物线形状,因为线性函数不能充分反映Priceage_years之间的关系:

Image for post
Image by Author
图片作者

Now that we confirmed that linearity is violated, let’s compare predictions across all 10K models by looking at the MSE:

现在,我们确认违反了线性度,让我们通过查看MSE比较所有10K模型的预测:

Image for post
Image by Author
图片作者

The average MSE for Model 1 is 84 compared to 113 for Model 3. To make the interpretation clearer, we can take the root mean squared error (RMSE) — the square root of MSE — to say that housing price predictions for Model 1 are on average $9,167 (√84*1000) away from true prices while they are $10,614 away for Model 3. Lastly, let’s dive into inference and compare the coefficient estimates for age_years between Model 1and Model 3. What you see in the plot below are the distribution of age_years coefficient estimates obtained from the 10k researchers:

模型1的平均MSE为84,而模型3的平均MSE为113。为使解释更清楚,我们可以采用均方根误差(RMSE)(MSE的平方根)来表示模型1的房价预测处于模型3的实际价格平均为$ 9,167(√84* 1000),而模型3的实际价格为$ 10,614。最后,让我们深入研究一下,比较模型1和模型3之间age_years的系数估计值。从1万研究人员获得的age_years系数估算值:

Image for post
Image by Author
图片作者

Although both models obtain the correct result of -7 on average, Model 3 is less precise since it takes on a slightly larger range of values. While this issue is not that severe for Model 3 like it is for Model 2, it’s exacerbated when stronger levels of non-linearity are unaccounted for.

尽管两个模型平均获得正确的-7结果,但模型3的精确度较差,因为它的取值范围稍大。 尽管对于模型3来说,这个问题并不像对模型2那样严重,但是当无法解决更严重的非线性时,问题就会加剧。

无内生性违规 (No Endogeneity Violation)

(4) Priceᵢ = β₀ − β₁*age_yearsᵢ + eᵢ

(4) 价格ᵢ=β₀-β₁*年龄年ᵢ+eᵢ

Endogeneity occurs when there is a link between independent variables and the error term. Model 4 violates the no endogeneity assumption because researchers omitted sqft from the model. Remember, when relevant variables are omitted from the model, it gets absorbed by the error term. Since sqft and age_years are slightly correlated (I set this to 20% in the simulation), omitting sqft from the model causes the error term to be correlated with age_years. Let’s first compare the predictive abilities of Model 1and Model 4 by examining MSE:

当自变量与误差项之间存在联系时,就会发生内生性。 模型4违反了无内生性假设,因为研究人员从模型中忽略了平方英尺 。 请记住,当模型中省略了相关变量时,它会被误差项吸收。 由于sqftage_years稍微相关(我在模拟中将其设置为20%),因此从模型中省略sqft会导致误差项与age_years相关。 让我们首先通过检查MSE比较模型1和模型4的预测能力:

Image for post
Image by Author
图片作者

Compared to Model 1, predictions for Model 4 are considerably worse, largely because sqft explains a lot of the variation in housing prices. RMSE tells us that, on average, Model 4 made predictions that were $29,099 away from true housing prices compared to $9,167 for Model 1. Next, let’s focus on inference. The plot below shows what the distribution of age_years coefficients, β₁, for Model 4 look like across the 10K researchers:

与模型1相比,模型4的预测要差得多,这在很大程度上是因为平方英尺能解释房价的许多变化。 RMSE告诉我们,平均而言,模型4做出的预测与真实房屋价格相差29,099美元,而模型1则为9,167美元。接下来,让我们集中讨论。 下图为情节什么age_years系数,β₁,分布于模型4的样子跨越10K研究员:

Image for post

The average coefficient estimate is biased (hence the term omitted variable bias) since we know that the true coefficient value for age_years is -7, not -4.1. Furthermore, we can see that for 9.5K out of 10K researchers, coefficient estimates for age_years ranged from -5.5 to -2.8. This would lead the majority of researchers to underestimate the affect of age_years on Price.

由于我们知道age_years的真实系数值为-7,而不是-4.1,因此平均系数估算值有偏差(因此,术语“变量偏差”被忽略了)。 此外,我们可以看到,在10,000名研究人员中,有9.5K人的age_years系数估计范围为-5.5至-2.8。 这将导致大多数研究人员低估年龄价格的影响

结论 (Conclusion)

  • Violating multicollinearity does not impact prediction, but can impact inference. For example, p-values typically become larger for highly correlated covariates, which can cause statistically significant variables to lack significance.

    违反多重共线性不会影响预测,但会影响推断。 例如,对于高度相关的协变量,p值通常会变大,这可能导致统计上显着的变量缺乏显着性。
  • Violating linearity can affect prediction and inference. For Model 3, we saw that prediction and precision in estimating coefficients were only hindered slightly. However, these things will be exacerbated when stronger levels of non-linearity are unaccounted for.

    违反线性会影响预测和推断。 对于模型3,我们看到系数的预测和精确度仅受到些微阻碍。 但是,如果不考虑更强的非线性程度,这些情况将会加剧。
  • The no endogeneity assumption was violated in Model 4 due to an omitted variable. This created biased coefficient estimates, which lead to misleading conclusions. Prediction was also poor since the omitted variable explained a good deal of variation in housing prices.

    由于缺少变量,模型4中没有违反内生性假设。 这产生了有偏差的系数估计,这导致了误导性的结论。 预测也很差,因为省略的变量解释了房价的很大变化。

This simulation gives a flavor of what can happen when assumptions are violated. Depending on a multitude of factors (i.e. variance of residuals, number of observations, etc.), the model’s ability to predict and infer will vary. Of course, it’s also possible for a model to violate multiple assumptions. If there’s interest, I’ll cover the other assumptions in the future (homoskedasticity, normality of error term, and autocorrelation), but the three I covered should give you a good idea of the consequences of violating assumptions.

该模拟给出了违反假设时可能发生的情况。 根据多种因素(即残差的变化,观测值的数量等),模型的预测和推断能力将有所不同。 当然,模型也有可能违反多种假设。 如果有兴趣,我将在将来介绍其他假设(同方差,误差项的正态性和自相关),但是我所介绍的三个应该可以使您很好地理解违反假设的后果。

翻译自: https://towardsdatascience.com/the-consequences-of-violating-linear-regression-assumptions-4f0513dd3160

线性回归 假设

  • 1
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值