线性回归-多元线性回归

线性回归-多元线性回归

上次讲到简单线性回归,本文主要讲下如何处理多元线性回归,多元线性回归中如何检查多重共线性,以及如何进行feature selection。

多元线性回归,故名思意,用多个变量来拟合预测一个response。多元线性回归的使用场景和简单线性回归一致,其中之一也是多用于解释因变量和变量之间的关系。比如销量可能和引流的流量、SKU的宽度、价格、折扣等相关,如果想知道这几个变量哪个影响销量最为显著,可以用多元回归的系数来解释。另一种场景是用多个因变量来预测变量的值。本文focus在第一个场景。其实两个场景general来说,从technical的角度,差异不大。

import pandas as pd
import numpy as np
import seaborn as sns
from sklearn import linear_model as lm
import statsmodels.api as sm
from sklearn.datasets import fetch_california_housing
data_set = fetch_california_housing(as_frame=True)
data = data_set.frame
data.describe()
MedIncHouseAgeAveRoomsAveBedrmsPopulationAveOccupLatitudeLongitudeMedHouseVal
count20640.00000020640.00000020640.00000020640.00000020640.00000020640.00000020640.00000020640.00000020640.000000
mean3.87067128.6394865.4290001.0966751425.4767443.07065535.631861-119.5697042.068558
std1.89982212.5855582.4741730.4739111132.46212210.3860502.1359522.0035321.153956
min0.4999001.0000000.8461540.3333333.0000000.69230832.540000-124.3500000.149990
25%2.56340018.0000004.4407161.006079787.0000002.42974133.930000-121.8000001.196000
50%3.53480029.0000005.2291291.0487801166.0000002.81811634.260000-118.4900001.797000
75%4.74325037.0000006.0523811.0995261725.0000003.28226137.710000-118.0100002.647250
max15.00010052.000000141.90909134.06666735682.0000001243.33333341.950000-114.3100005.000010

MedHouseVal为我们需要预测的因变量,这里变量有8个,分别是MedInc, HouseAge, AveRooms, AveBedrms, Population, AveOccup,Latitude, Longitude。
我们先来看下多重共线性。

X = data[['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup',
       'Latitude', 'Longitude']]
X.corr()
MedIncHouseAgeAveRoomsAveBedrmsPopulationAveOccupLatitudeLongitude
MedInc1.000000-0.1190340.326895-0.0620400.0048340.018766-0.079809-0.015176
HouseAge-0.1190341.000000-0.153277-0.077747-0.2962440.0131910.011173-0.108197
AveRooms0.326895-0.1532771.0000000.847621-0.072213-0.0048520.106389-0.027540
AveBedrms-0.062040-0.0777470.8476211.000000-0.066197-0.0061810.0697210.013344
Population0.004834-0.296244-0.072213-0.0661971.0000000.069863-0.1087850.099773
AveOccup0.0187660.013191-0.004852-0.0061810.0698631.0000000.0023660.002476
Latitude-0.0798090.0111730.1063890.069721-0.1087850.0023661.000000-0.924664
Longitude-0.015176-0.108197-0.0275400.0133440.0997730.002476-0.9246641.000000

我们用heatmap图来直观观察下共线性。一般correlation coefficient > 0.6时,我们就认为两个变量有较强的相关性。

import matplotlib.pyplot as plt
import seaborn as sns
plt.figure(figsize=(15,8))

mask = np.triu(np.ones_like(X.corr(), dtype=bool))

sns.heatmap(X.corr(), annot=True, mask=mask, vmin=-1, vmax=1)
plt.title('Correlation Coefficient Of Predictors')
plt.show()

在这里插入图片描述

通过heatmap,我们可以清楚的看到两组变量存在较强的相关性:

  • AveBedrms 和AveRooms
  • Longitude和Latitude

在多元线性回归中,如果变量之间存在多重共线性,就很难判断变量对因变量的影响大小。但是如果只是用于预测因变量,多重共线性其实可以不用考虑

那么问题来了,当发现变量之间存在相关性时,删除哪个变量呢?我们可以计算每个变量的VIF,逐个删除。 V I F = 1 1 − R i 2 VIF = \frac{1}{1-R_i^2} VIF=1Ri21

VIFDescription
1无共线性
<5弱共线性
>=5强共线性
from statsmodels.stats.outliers_influence import variance_inflation_factor
import statsmodels.api as sm
def compute_vif(considered_features):
    
    X = data[considered_features]
    # the calculation of variance inflation requires a constant
    X = sm.add_constant(X)
    
    # create dataframe to store vif values
    vif = pd.DataFrame({'variables':X.columns[1:], 'VIF':[variance_inflation_factor(X.values, i+1) for i in range(len(X.columns[1:]))]})
    return vif
considered_features = ['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup',
       'Latitude', 'Longitude']
compute_vif(considered_features).sort_values('VIF', ascending=False)
variablesVIF
6Latitude9.297624
7Longitude8.962263
2AveRooms8.342786
3AveBedrms6.994995
0MedInc2.501295
1HouseAge1.241254
4Population1.138125
5AveOccup1.008324

先删除掉Latitude,重新计算下vif

considered_features.remove('Latitude')
compute_vif(considered_features).sort_values('VIF', ascending=False)
variablesVIF
2AveRooms7.513381
3AveBedrms6.644835
0MedInc2.072289
1HouseAge1.158398
4Population1.126271
6Longitude1.027751
5AveOccup1.006831

再删除掉AveRooms,重新计算VIF

considered_features.remove('AveRooms')
compute_vif(considered_features).sort_values('VIF', ascending=False)
variablesVIF
1HouseAge1.136532
3Population1.120021
0MedInc1.022435
2AveBedrms1.020911
5Longitude1.017817
4AveOccup1.006728

将剩下6个变量fit多元线性回归,看下拟合结果

X = data[considered_features]
X = sm.add_constant(X)
y= data['MedHouseVal']
model = sm.OLS(y,X).fit()
model.summary()
OLS Regression Results
Dep. Variable:MedHouseVal R-squared: 0.512
Model:OLS Adj. R-squared: 0.512
Method:Least Squares F-statistic: 3612.
Date:Sun, 03 Jul 2022 Prob (F-statistic): 0.00
Time:14:42:43 Log-Likelihood: -24832.
No. Observations: 20640 AIC: 4.968e+04
Df Residuals: 20633 BIC: 4.973e+04
Df Model: 6
Covariance Type:nonrobust
coefstd errtP>|t|[0.0250.975]
const -1.4014 0.338 -4.144 0.000 -2.064 -0.739
MedInc 0.4333 0.003 145.113 0.000 0.427 0.439
HouseAge 0.0185 0.000 38.878 0.000 0.018 0.019
AveBedrms 0.0381 0.012 3.188 0.001 0.015 0.062
Population 3.801e-05 5.24e-06 7.249 0.000 2.77e-05 4.83e-05
AveOccup -0.0047 0.001 -8.661 0.000 -0.006 -0.004
Longitude -0.0099 0.003 -3.499 0.000 -0.015 -0.004
Omnibus:4203.765 Durbin-Watson: 0.791
Prob(Omnibus): 0.000 Jarque-Bera (JB): 10072.811
Skew: 1.140 Prob(JB): 0.00
Kurtosis: 5.552 Cond. No. 1.10e+05


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.1e+05. This might indicate that there are
strong multicollinearity or other numerical problems.

y = − 1.4014 + 0.4333 ( M e d I n c ) + 0.0185 ( H o u s e A g e ) + 0.0381 ( A v e B e d r m s ) − 0.0047 ( A v e O c c u p ) − 0.0099 ( L o n g i t u d e ) + y = -1.4014 + 0.4333(MedInc) + 0.0185(HouseAge) + 0.0381(AveBedrms) - 0.0047(AveOccup)- 0.0099(Longitude) + y=1.4014+0.4333(MedInc)+0.0185(HouseAge)+0.0381(AveBedrms)0.0047(AveOccup)0.0099(Longitude)+

3.801 ∗ 1 0 − 5 ( P o p u l a t i o n ) 3.801*10^{-5}(Population) 3.801105(Population)

以HouseAge为例,即HouseAge增加一个unit,MedHouseVal增加0.4333。这个模型拟合效果一般, R 2 R^2 R2只有0.51,可以通过一些数据预处理的方法来提升拟合准确率。下次再介绍吧

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值