多重共线性初学者指南

Regression is the way of describing the relationship between a dependent variable and independent variables. What if the independent variables are related to themselves? This concept is called multicollinearity.

回归是描述因变量和自变量之间关系的方式。 如果自变量与自身相关怎么办? 这个概念称为多重共线性。

Multicollinearity is a state of very high correlation among the independent variables, i.e. A predictor variable can be used to predict another predictor variable.

多重共线性是自变量之间具有非常高相关性的状态,即,可以使用一个预测变量来预测另一个预测变量。

Both the independent features have a similar impact on the dependent variable so the regression model fails to understand the individual effect of each independent variables on the dependent variable.

这两个独立特征对因变量的影响相似,因此回归模型无法理解每个独立变量对因变量的影响。

多重共线性的原因。 (Reasons for multicollinearity.)

  • It can be caused by the inaccurate use of dummy variables.

    这可能是由于虚拟变量使用不正确引起的。
  • It can be caused by the inclusion of a variable which is computed from other variables in the data set.

    这可能是由于包含了一个变量,该变量是根据数据集中的其他变量计算得出的。
  • Multicollinearity can also result from the repetition of the same kind of variable.- e.g. Sex and Gender

    多重共线性也可能源于相同种类变量的重复。例如,性别和性别

让我们考虑以下情况-(Let us consider the following scenario -)

Salary of a person in an organization is a function of ‘Years of experience’, ‘Age’, ‘X3’,’ X4’…

组织中人员的薪水取决于“经验年限”,“年龄”,“ X3”,“ X4” ...

Salary = β0+ β1 (“Years of experience”)+ β2(“Age”)+…

工资=β0 +β1(“多年的经验”)+β2(“时代”)+ ...

β1 — The marginal effect on salary for an additional unit in Years of experience, holding other variables constant

β1-在多年经验中,在其他变量不变的情况下,增加单位对工资的边际影响

β2- The marginal effect on salary for an additional unit of Age, holding other variables constant

β2-在其他变量不变的情况下,附加年龄单位对工资的边际影响

Multicollinearity is when the independent variables themselves are correlated so that the individual effects are obscure.

多重共线性是指自变量本身相互关联,从而使各个影响不明确的情况。

What regression does is it tears apart the individual effect of β1 and β2 on “Salary”.

回归的作用是破坏了β1β2对“工资”的单独影响。

But, the problem here is, the more experienced a person is, probably the older they get at the same time. So regression can’t differentiate the impact of ‘Years of experience’ and ‘Age’ on ‘Salary’,

但是,这里的问题是,一个人越有经验,可能同时变得越老。 因此,回归无法区分“经验年”和“年龄”对“薪资”的影响,

It fails to understand whether the increase in “Age” has led to an increased “Salary” or increase in “Years of experience” have led to an increased “Salary”.

它无法理解“年龄”的增加是否导致“薪水”的增加,还是“经验年限”的增加导致“薪水”的增加。

那么我们可以保持其他变量不变吗? (So can we hold the other variables constant?)

Let’s see an with an example.

让我们来看一个例子。

Consider the following Data frame -

考虑以下数据框-

Image for post
DataFrame
数据框

From this data, we will build a regression model to predict salary.

根据这些数据,我们将建立回归模型来预测薪水。

CODE:

码:

Image for post

OUTPUT:

输出:

Image for post

In this case, the independent variables (Years

  • 6
    点赞
  • 14
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值