线性回归系数_线性回归系数可能对您不利

最新推荐文章于 2024-03-25 08:00:00 发布

weixin_26704853

最新推荐文章于 2024-03-25 08:00:00 发布

阅读量1.1k

点赞数

文章标签： python java 算法

原文链接：https://towardsdatascience.com/linear-regression-coefficients-are-probably-lying-to-you-457a57aaf288

版权

线性回归中的系数可能并不如我们所认为的那样准确反映变量间的关系。文章探讨了线性回归系数可能存在的误导性，并提醒数据科学家在解读结果时要谨慎。

摘要由CSDN通过智能技术生成

线性回归系数

Interpreting linear regression coefficients is common to do, because it’s so easy. Training a model can be done in a few lines of code, and the results yield statistics that can be stated matter-of-factly: “each additional point on the SAT increases your chances of admission by 0.002%”.

解释线性回归系数很常见，因为它很容易。训练模型可以用几行代码完成，结果产生的统计数据可以说是事实：“ SAT上每增加一个点，您的入学机会就会增加0.002％”。

Whenever you train a linear regression (or logistic regression) model with this intent, be wary: you are treading in dangerous waters.

每当您以此意图训练线性回归(或逻辑回归)模型时，请当心：您正在危险的环境中涉猎。

What is linear regression even doing? It multiplies each of the inputs by a value and adds them up — as an additional degree of freedom, an ‘intercept’ can be added. The result should be the y-variable.

线性回归甚至在做什么？它将每个输入乘以一个值并将其相加-作为额外的自由度，可以添加“拦截”。结果应该是y变量。

演示地址

Let’s put linear regression in context. Take the following dataset, which contains several attributes of students’ graduate school applications, like the GRE or college GPA, along with their chance of admission.

让我们将线性回归放在上下文中。采取以下数据集，其中包含学生的研究生申请的几个属性，例如GRE或大学GPA，以及他们被录取的机会。

Then, the linear regression equation becomes:

然后，线性回归方程变为：

演示地址

When we train a linear regression model on the datasets, we find the coefficients to be:

当我们在数据集上训练线性回归模型时，我们发现系数为：

演示地址

By default, then, we may be inclined to make the following statements:

那么，默认情况下，我们可能倾向于做出以下声明：

‘For each point on the GRE, your chances of admission go up by 0.2%.’
“对于GRE的每一点，您被录取的机会都会增加0.2％。”
‘For each point on the TOEFL, your chances go up by 0.3%.’
“在托福考试的每一点，您的机会都会增加0.3％。”
‘Performing research increases your chances of admission by 2.3%.’
“进行研究可以使您被录取的机会增加2.3％。”

If only it were that simple.

像那么简单就好了。

To put out a truly accurate and rigorous definition of interpreting coefficients: ‘this is the change in the y-variable that will occur if the x-variable is increased by one unit, holding all other x-variables fixed.’ This part is usually omitted from interpretations of coefficients because it gets too long.

提出一个真正准确和严格的解释系数定义：“这是y变量的变化，如果x变量增加一个单位，而所有其他x变量保持固定 ，则将发生y变量的变化。” 这部分通常会从系数的解释中省略，因为它太长了。

For instance, a truly accurate interpretation would be:

例如，真正准确的解释是：

For each point on the GRE, your chances increase by 0.2%, given:- Your TOEFL score remains fixed (standard scale)- Your University Rating remains fixed (scale of 1–5)- Your SOP remains fixed (scale of 1–5)- Your LOR remains fixed (scale of 1–5)- Your CGPA remains fixed (scale of 5–10)- Your Research remains fixed (0 or 1)

鉴于GRE的每个点，您的机会增加0.2％，原因是：-您的TOEFL分数保持固定(标准等级)-您的大学等级保持固定(1-5等级)-您的SOP保持固定(1-5等级) )-您的LOR保持固定(1-5级)-CGPA保持固定(5-10级)-您的研究保持固定(0或1)

The scales of each x-variable need to be specified, because linear regression is heavily dependent on scale. If we were to train on a dataset where length is measured in feet and another measured in miles, the performance of both will be the same but the coefficients will be different.

由于线性回归在很大程度上取决于比例尺，因此需要指定每个x变量的比例尺。如果我们要训练一个数据集，该数据集的长度以英尺为单位，另一个以英里为单位，则两者的性能将相同，但系数将不同。

In theory, a change of scale in one variable shouldn’t affect other variables, but practical implementations of linear regression don’t always work as well as the theory. Since a large difference in scale causes a gap in the y-variable, the coefficient adjustment algorithm will often change multiple coefficients.

从理论上讲，一个变量的比例变化不应影响其他变量，但是线性回归的实际实现并不总是与理论一样有效。由于比例尺上的较大差异会导致y变量出现缺口，因此系数调整算法通常会更改多个系数。

For instance, consider the difference of coefficients for linear regression when the research column is put on a scale of 0, 100 instead of 0, 1.

例如，当研究列的刻度为0、100而不是0、1时，请考虑线性回归系数的差异。

The coefficient for research decreases to address this increase in scale, but many other coefficients change as well, notably:

研究系数减小以应对规模的这种增长，但是许多其他系数也发生变化，尤其是：

GRE score from 0.0019 to 0.0012 (63%)
GRE分数从0.0019至0.0012(63％)
University Rating from 0.0085 to 0.0047 (55%)
大学评级从0.0085至0.0047(55％)
SOP from 0.00046 to 0.0056 (1,375%)
SOP从0.00046到0.0056(1,375％)

It’s completely valid to train a model with the research column on a scale from 0 to 100 and another one on a scale of 0 to 1, but the results are very different. Likewise, you may find discrepancies in coefficient interpretations between different measurements of length, weight, price, scoring, etc.

用研究列从0到100的比例训练另一个模型以0到1的比例训练模型是完全有效的，但是结果却大不相同。同样，您可能会在长度，重量，价格，得分等的不同度量之间发现系数解释上的差异。

The only way to ensure your coefficients make sense in the context of all the other variables is to specify the scales of each column.

确保系数在所有其他变量的上下文中有意义的唯一方法是指定每列的小数位数。

Additionally, adding and removing features can have different effects on the coefficients. For instance, if we remove the research column altogether, the other coefficients need to each increase to address the absence of a substantial value adder (in the view of linear regression, features only serve to add or subtract from the predicted y-value).

此外，添加和删除特征可能会对系数产生不同的影响。例如，如果我们完全删除研究列，则其他每个系数都需要增加，以解决不存在实质性值累加器的问题(从线性回归的角度来看，特征仅用于对预测的y值进行加或减)。

So let’s think about this in reverse — what if the starting dataset didn’t have a Research column? We would have gotten different results if we expanded our dataset, and an expansion of data usually means getting closer to the truth. Hence, unless you have ‘all the data’, at least philosophically, your linear regression coefficients will never be perfect.

因此，让我们反过来考虑一下-如果起始数据集没有“ Research列怎么办？如果扩展数据集，我们将得到不同的结果，而数据扩展通常意味着更接近事实。因此，除非您拥有“所有数据”，至少在哲学上如此，否则您的线性回归系数将永远不会是完美的。

Practically, however, this means that if you have control over data collection, you should always make an effort to collect as much of as it as possible with the knowledge that coefficients can suffer from limited data.

但是实际上，这意味着，如果您可以控制数据收集，则应该始终努力收集尽可能多的数据，同时要知道系数可能会受到有限数据的影响。

Multicollinearity, or the high correlation between different features, also screws with the empirical interpretation of coefficients. For instance, we may be inclined to say that ‘one point on the GRE will increase chances by x%’, under the assumption that all other factors remain fixed. Realistically, someone who scores well on the GRE probably also has a good TOEFL score and is applying to a university with a better rating.

多重共线性，或不同特征之间的高度相关性，也与系数的经验解释紧密相关。例如，在所有其他因素保持不变的假设下，我们可能倾向于说“ GRE上的每一点将使机会增加x ％”。实际上，在GRE上得分较高的人可能也有较高的TOEFL分数，并且正在申请分数更高的大学。

This means that the causation implied by ‘one point on the GRE’ actually will probably result in an increase larger than x%, due to correlated factors. Similarly, ‘losing one point on the GRE’ will most likely see a decrease larger than x%’. One can imagine how damaging this could be in a business context.

这意味着，由于相关因素的影响，“ GRE上的一点”所暗示的因果关系实际上可能导致大于x ％的增长。同样，“ GRE下降1分”很有可能会出现大于x ％的下降。可以想像这在业务环境中可能造成的损害。

Largely, most of the problems described can be traced to multicollinearity, which is inevitable in every real-world dataset. If you try to transform your dataset to get rid of multicollinearity, the coefficients are no longer interpretable (or, at least, very difficult and questionable).

在很大程度上，所描述的大多数问题都可以追溯到多重共线性，这在每个现实世界的数据集中都是不可避免的。如果尝试转换数据集以消除多重共线性，则系数将不再可解释(或者至少非常困难且可疑)。

Really, the only sure-fire solution is not to use linear regression in the first place. Plenty of great explanation methods like SHAP exist, and are much less vulnerable to the sorts of problems the simple linear regression is prone to.

确实，唯一可以肯定的解决方案是首先不要使用线性回归。存在许多诸如SHAP之类的出色解释方法，并且它们不易受到简单线性回归易于产生的各种问题的影响。

But if you insist on using linear regression coefficients for interpretation, make sure there is little multicollinearity and that the scales and features are specified for reproducibility and coefficients that better resemble the truth.

但是，如果您坚持使用线性回归系数进行解释，请确保几乎没有多重共线性，并且为可再现性指定了标度和特征，并且系数更类似于真实情况。