线性回归-多元线性回归
上次讲到简单线性回归,本文主要讲下如何处理多元线性回归,多元线性回归中如何检查多重共线性,以及如何进行feature selection。
多元线性回归,故名思意,用多个变量来拟合预测一个response。多元线性回归的使用场景和简单线性回归一致,其中之一也是多用于解释因变量和变量之间的关系。比如销量可能和引流的流量、SKU的宽度、价格、折扣等相关,如果想知道这几个变量哪个影响销量最为显著,可以用多元回归的系数来解释。另一种场景是用多个因变量来预测变量的值。本文focus在第一个场景。其实两个场景general来说,从technical的角度,差异不大。
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn import linear_model as lm
import statsmodels.api as sm
from sklearn.datasets import fetch_california_housing
data_set = fetch_california_housing(as_frame=True)
data = data_set.frame
data.describe()
MedInc | HouseAge | AveRooms | AveBedrms | Population | AveOccup | Latitude | Longitude | MedHouseVal | |
---|---|---|---|---|---|---|---|---|---|
count | 20640.000000 | 20640.000000 | 20640.000000 | 20640.000000 | 20640.000000 | 20640.000000 | 20640.000000 | 20640.000000 | 20640.000000 |
mean | 3.870671 | 28.639486 | 5.429000 | 1.096675 | 1425.476744 | 3.070655 | 35.631861 | -119.569704 | 2.068558 |
std | 1.899822 | 12.585558 | 2.474173 | 0.473911 | 1132.462122 | 10.386050 | 2.135952 | 2.003532 | 1.153956 |
min | 0.499900 | 1.000000 | 0.846154 | 0.333333 | 3.000000 | 0.692308 | 32.540000 | -124.350000 | 0.149990 |
25% | 2.563400 | 18.000000 | 4.440716 | 1.006079 | 787.000000 | 2.429741 | 33.930000 | -121.800000 | 1.196000 |
50% | 3.534800 | 29.000000 | 5.229129 | 1.048780 | 1166.000000 | 2.818116 | 34.260000 | -118.490000 | 1.797000 |
75% | 4.743250 | 37.000000 | 6.052381 | 1.099526 | 1725.000000 | 3.282261 | 37.710000 | -118.010000 | 2.647250 |
max | 15.000100 | 52.000000 | 141.909091 | 34.066667 | 35682.000000 | 1243.333333 | 41.950000 | -114.310000 | 5.000010 |
MedHouseVal为我们需要预测的因变量,这里变量有8个,分别是MedInc, HouseAge, AveRooms, AveBedrms, Population, AveOccup,Latitude, Longitude。
我们先来看下多重共线性。
X = data[['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup',
'Latitude', 'Longitude']]
X.corr()
MedInc | HouseAge | AveRooms | AveBedrms | Population | AveOccup | Latitude | Longitude | |
---|---|---|---|---|---|---|---|---|
MedInc | 1.000000 | -0.119034 | 0.326895 | -0.062040 | 0.004834 | 0.018766 | -0.079809 | -0.015176 |
HouseAge | -0.119034 | 1.000000 | -0.153277 | -0.077747 | -0.296244 | 0.013191 | 0.011173 | -0.108197 |
AveRooms | 0.326895 | -0.153277 | 1.000000 | 0.847621 | -0.072213 | -0.004852 | 0.106389 | -0.027540 |
AveBedrms | -0.062040 | -0.077747 | 0.847621 | 1.000000 | -0.066197 | -0.006181 | 0.069721 | 0.013344 |
Population | 0.004834 | -0.296244 | -0.072213 | -0.066197 | 1.000000 | 0.069863 | -0.108785 | 0.099773 |
AveOccup | 0.018766 | 0.013191 | -0.004852 | -0.006181 | 0.069863 | 1.000000 | 0.002366 | 0.002476 |
Latitude | -0.079809 | 0.011173 | 0.106389 | 0.069721 | -0.108785 | 0.002366 | 1.000000 | -0.924664 |
Longitude | -0.015176 | -0.108197 | -0.027540 | 0.013344 | 0.099773 | 0.002476 | -0.924664 | 1.000000 |
我们用heatmap图来直观观察下共线性。一般correlation coefficient > 0.6时,我们就认为两个变量有较强的相关性。
import matplotlib.pyplot as plt
import seaborn as sns
plt.figure(figsize=(15,8))
mask = np.triu(np.ones_like(X.corr(), dtype=bool))
sns.heatmap(X.corr(), annot=True, mask=mask, vmin=-1, vmax=1)
plt.title('Correlation Coefficient Of Predictors')
plt.show()
通过heatmap,我们可以清楚的看到两组变量存在较强的相关性:
- AveBedrms 和AveRooms
- Longitude和Latitude
在多元线性回归中,如果变量之间存在多重共线性,就很难判断变量对因变量的影响大小。但是如果只是用于预测因变量,多重共线性其实可以不用考虑
那么问题来了,当发现变量之间存在相关性时,删除哪个变量呢?我们可以计算每个变量的VIF,逐个删除。 V I F = 1 1 − R i 2 VIF = \frac{1}{1-R_i^2} VIF=1−Ri21
VIF | Description |
---|---|
1 | 无共线性 |
<5 | 弱共线性 |
>=5 | 强共线性 |
from statsmodels.stats.outliers_influence import variance_inflation_factor
import statsmodels.api as sm
def compute_vif(considered_features):
X = data[considered_features]
# the calculation of variance inflation requires a constant
X = sm.add_constant(X)
# create dataframe to store vif values
vif = pd.DataFrame({'variables':X.columns[1:], 'VIF':[variance_inflation_factor(X.values, i+1) for i in range(len(X.columns[1:]))]})
return vif
considered_features = ['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup',
'Latitude', 'Longitude']
compute_vif(considered_features).sort_values('VIF', ascending=False)
variables | VIF | |
---|---|---|
6 | Latitude | 9.297624 |
7 | Longitude | 8.962263 |
2 | AveRooms | 8.342786 |
3 | AveBedrms | 6.994995 |
0 | MedInc | 2.501295 |
1 | HouseAge | 1.241254 |
4 | Population | 1.138125 |
5 | AveOccup | 1.008324 |
先删除掉Latitude,重新计算下vif
considered_features.remove('Latitude')
compute_vif(considered_features).sort_values('VIF', ascending=False)
variables | VIF | |
---|---|---|
2 | AveRooms | 7.513381 |
3 | AveBedrms | 6.644835 |
0 | MedInc | 2.072289 |
1 | HouseAge | 1.158398 |
4 | Population | 1.126271 |
6 | Longitude | 1.027751 |
5 | AveOccup | 1.006831 |
再删除掉AveRooms,重新计算VIF
considered_features.remove('AveRooms')
compute_vif(considered_features).sort_values('VIF', ascending=False)
variables | VIF | |
---|---|---|
1 | HouseAge | 1.136532 |
3 | Population | 1.120021 |
0 | MedInc | 1.022435 |
2 | AveBedrms | 1.020911 |
5 | Longitude | 1.017817 |
4 | AveOccup | 1.006728 |
将剩下6个变量fit多元线性回归,看下拟合结果
X = data[considered_features]
X = sm.add_constant(X)
y= data['MedHouseVal']
model = sm.OLS(y,X).fit()
model.summary()
Dep. Variable: | MedHouseVal | R-squared: | 0.512 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.512 |
Method: | Least Squares | F-statistic: | 3612. |
Date: | Sun, 03 Jul 2022 | Prob (F-statistic): | 0.00 |
Time: | 14:42:43 | Log-Likelihood: | -24832. |
No. Observations: | 20640 | AIC: | 4.968e+04 |
Df Residuals: | 20633 | BIC: | 4.973e+04 |
Df Model: | 6 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
const | -1.4014 | 0.338 | -4.144 | 0.000 | -2.064 | -0.739 |
MedInc | 0.4333 | 0.003 | 145.113 | 0.000 | 0.427 | 0.439 |
HouseAge | 0.0185 | 0.000 | 38.878 | 0.000 | 0.018 | 0.019 |
AveBedrms | 0.0381 | 0.012 | 3.188 | 0.001 | 0.015 | 0.062 |
Population | 3.801e-05 | 5.24e-06 | 7.249 | 0.000 | 2.77e-05 | 4.83e-05 |
AveOccup | -0.0047 | 0.001 | -8.661 | 0.000 | -0.006 | -0.004 |
Longitude | -0.0099 | 0.003 | -3.499 | 0.000 | -0.015 | -0.004 |
Omnibus: | 4203.765 | Durbin-Watson: | 0.791 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 10072.811 |
Skew: | 1.140 | Prob(JB): | 0.00 |
Kurtosis: | 5.552 | Cond. No. | 1.10e+05 |
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.1e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
即 y = − 1.4014 + 0.4333 ( M e d I n c ) + 0.0185 ( H o u s e A g e ) + 0.0381 ( A v e B e d r m s ) − 0.0047 ( A v e O c c u p ) − 0.0099 ( L o n g i t u d e ) + y = -1.4014 + 0.4333(MedInc) + 0.0185(HouseAge) + 0.0381(AveBedrms) - 0.0047(AveOccup)- 0.0099(Longitude) + y=−1.4014+0.4333(MedInc)+0.0185(HouseAge)+0.0381(AveBedrms)−0.0047(AveOccup)−0.0099(Longitude)+
3.801 ∗ 1 0 − 5 ( P o p u l a t i o n ) 3.801*10^{-5}(Population) 3.801∗10−5(Population)
以HouseAge为例,即HouseAge增加一个unit,MedHouseVal增加0.4333。这个模型拟合效果一般, R 2 R^2 R2只有0.51,可以通过一些数据预处理的方法来提升拟合准确率。下次再介绍吧