线性回归-多元线性回归

最新推荐文章于 2024-07-11 16:30:16 发布

Sarah_07

最新推荐文章于 2024-07-11 16:30:16 发布

阅读量1.8k

点赞数 1

文章标签：线性回归机器学习 python

本文链接：https://blog.csdn.net/Sarah_07/article/details/125591265

版权

线性回归-多元线性回归

上次讲到简单线性回归，本文主要讲下如何处理多元线性回归，多元线性回归中如何检查多重共线性，以及如何进行feature selection。

多元线性回归，故名思意，用多个变量来拟合预测一个response。多元线性回归的使用场景和简单线性回归一致，其中之一也是多用于解释因变量和变量之间的关系。比如销量可能和引流的流量、SKU的宽度、价格、折扣等相关，如果想知道这几个变量哪个影响销量最为显著，可以用多元回归的系数来解释。另一种场景是用多个因变量来预测变量的值。本文focus在第一个场景。其实两个场景general来说，从technical的角度，差异不大。

import pandas as pd
import numpy as np
import seaborn as sns
from sklearn import linear_model as lm
import statsmodels.api as sm
from sklearn.datasets import fetch_california_housing

data_set = fetch_california_housing(as_frame=True)
data = data_set.frame
data.describe()

	MedInc	HouseAge	AveRooms	AveBedrms	Population	AveOccup	Latitude	Longitude	MedHouseVal
count	20640.000000	20640.000000	20640.000000	20640.000000	20640.000000	20640.000000	20640.000000	20640.000000	20640.000000
mean	3.870671	28.639486	5.429000	1.096675	1425.476744	3.070655	35.631861	-119.569704	2.068558
std	1.899822	12.585558	2.474173	0.473911	1132.462122	10.386050	2.135952	2.003532	1.153956
min	0.499900	1.000000	0.846154	0.333333	3.000000	0.692308	32.540000	-124.350000	0.149990
25%	2.563400	18.000000	4.440716	1.006079	787.000000	2.429741	33.930000	-121.800000	1.196000
50%	3.534800	29.000000	5.229129	1.048780	1166.000000	2.818116	34.260000	-118.490000	1.797000
75%	4.743250	37.000000	6.052381	1.099526	1725.000000	3.282261	37.710000	-118.010000	2.647250
max	15.000100	52.000000	141.909091	34.066667	35682.000000	1243.333333	41.950000	-114.310000	5.000010

MedHouseVal为我们需要预测的因变量，这里变量有8个，分别是MedInc, HouseAge, AveRooms, AveBedrms, Population, AveOccup,Latitude, Longitude。
我们先来看下多重共线性。

X = data[['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup',
       'Latitude', 'Longitude']]
X.corr()

	MedInc	HouseAge	AveRooms	AveBedrms	Population	AveOccup	Latitude	Longitude
MedInc	1.000000	-0.119034	0.326895	-0.062040	0.004834	0.018766	-0.079809	-0.015176
HouseAge	-0.119034	1.000000	-0.153277	-0.077747	-0.296244	0.013191	0.011173	-0.108197
AveRooms	0.326895	-0.153277	1.000000	0.847621	-0.072213	-0.004852	0.106389	-0.027540
AveBedrms	-0.062040	-0.077747	0.847621	1.000000	-0.066197	-0.006181	0.069721	0.013344
Population	0.004834	-0.296244	-0.072213	-0.066197	1.000000	0.069863	-0.108785	0.099773
AveOccup	0.018766	0.013191	-0.004852	-0.006181	0.069863	1.000000	0.002366	0.002476
Latitude	-0.079809	0.011173	0.106389	0.069721	-0.108785	0.002366	1.000000	-0.924664
Longitude	-0.015176	-0.108197	-0.027540	0.013344	0.099773	0.002476	-0.924664	1.000000

我们用heatmap图来直观观察下共线性。一般correlation coefficient > 0.6时，我们就认为两个变量有较强的相关性。

import matplotlib.pyplot as plt
import seaborn as sns
plt.figure(figsize=(15,8))

mask = np.triu(np.ones_like(X.corr(), dtype=bool))

sns.heatmap(X.corr(), annot=True, mask=mask, vmin=-1, vmax=1)
plt.title('Correlation Coefficient Of Predictors')
plt.show()

在这里插入图片描述

通过heatmap，我们可以清楚的看到两组变量存在较强的相关性：

AveBedrms 和AveRooms
Longitude和Latitude

在多元线性回归中，如果变量之间存在多重共线性，就很难判断变量对因变量的影响大小。但是如果只是用于预测因变量，多重共线性其实可以不用考虑

那么问题来了，当发现变量之间存在相关性时，删除哪个变量呢？我们可以计算每个变量的VIF，逐个删除。 $\frac{1}{1-R_i^2}$

VIF	Description
1	无共线性
<5	弱共线性
>=5	强共线性

from statsmodels.stats.outliers_influence import variance_inflation_factor
import statsmodels.api as sm
def compute_vif(considered_features):
    
    X = data[considered_features]
    # the calculation of variance inflation requires a constant
    X = sm.add_constant(X)
    
    # create dataframe to store vif values
    vif = pd.DataFrame({'variables':X.columns[1:], 'VIF':[variance_inflation_factor(X.values, i+1) for i in range(len(X.columns[1:]))]})
    return vif

considered_features = ['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup',
       'Latitude', 'Longitude']
compute_vif(considered_features).sort_values('VIF', ascending=False)

	variables	VIF
6	Latitude	9.297624
7	Longitude	8.962263
2	AveRooms	8.342786
3	AveBedrms	6.994995
0	MedInc	2.501295
1	HouseAge	1.241254
4	Population	1.138125
5	AveOccup	1.008324

先删除掉Latitude，重新计算下vif

considered_features.remove('Latitude')
compute_vif(considered_features).sort_values('VIF', ascending=False)

	variables	VIF
2	AveRooms	7.513381
3	AveBedrms	6.644835
0	MedInc	2.072289
1	HouseAge	1.158398
4	Population	1.126271
6	Longitude	1.027751
5	AveOccup	1.006831

再删除掉AveRooms，重新计算VIF

considered_features.remove('AveRooms')
compute_vif(considered_features).sort_values('VIF', ascending=False)

	variables	VIF
1	HouseAge	1.136532
3	Population	1.120021
0	MedInc	1.022435
2	AveBedrms	1.020911
5	Longitude	1.017817
4	AveOccup	1.006728

将剩下6个变量fit多元线性回归，看下拟合结果

X = data[considered_features]
X = sm.add_constant(X)
y= data['MedHouseVal']
model = sm.OLS(y,X).fit()
model.summary()

OLS Regression Results
Dep. Variable:	MedHouseVal	R-squared:	0.512
Model:	OLS	Adj. R-squared:	0.512
Method:	Least Squares	F-statistic:	3612.
Date:	Sun, 03 Jul 2022	Prob (F-statistic):	0.00
Time:	14:42:43	Log-Likelihood:	-24832.
No. Observations:	20640	AIC:	4.968e+04
Df Residuals:	20633	BIC:	4.973e+04
Df Model:	6
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
const	-1.4014	0.338	-4.144	0.000	-2.064	-0.739
MedInc	0.4333	0.003	145.113	0.000	0.427	0.439
HouseAge	0.0185	0.000	38.878	0.000	0.018	0.019
AveBedrms	0.0381	0.012	3.188	0.001	0.015	0.062
Population	3.801e-05	5.24e-06	7.249	0.000	2.77e-05	4.83e-05
AveOccup	-0.0047	0.001	-8.661	0.000	-0.006	-0.004
Longitude	-0.0099	0.003	-3.499	0.000	-0.015	-0.004

Omnibus:	4203.765	Durbin-Watson:	0.791
Prob(Omnibus):	0.000	Jarque-Bera (JB):	10072.811
Skew:	1.140	Prob(JB):	0.00
Kurtosis:	5.552	Cond. No.	1.10e+05

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.1e+05. This might indicate that there are
strong multicollinearity or other numerical problems.

即 $y = - 1.4014 + 0.4333 (M e d I n c) + 0.0185 (H o u s e A g e) + 0.0381 (A v e B e d r m s) - 0.0047 (A v e O c c u p) - 0.0099 (L o n g i t u d e) +$