多元线性回归练习-预测房价

最新推荐文章于 2024-05-10 13:39:50 发布

NeverthelessEnd

最新推荐文章于 2024-05-10 13:39:50 发布

阅读量1.7k

点赞数 2

本文链接：https://blog.csdn.net/sinat_41177607/article/details/106661109

版权

目的：

找到数据集中关于特征的描述。使用数据集中的其他变量来构建最佳模型以预测平均房价。

数据集说明：

数据集总共包含506个案例。

每种情况下，数据集都有14个属性：

特征	说明
MedianHomePrice	房价中位数
CRIM	人均城镇犯罪率
ZN	25,000平方英尺以上土地的住宅用地比例
INDIUS	每个城镇非零售业务英亩的比例。
CHAS	查尔斯河虚拟变量（如果束缚河，则为1；否则为0）
NOX-	氧化氮浓度（百万分之一）
RM	每个住宅的平均房间数
AGE	1940年之前建造的自有住房的比例
DIS	到五个波士顿就业中心的加权距离
RAD	径向公路的可达性指数
TAX	每10,000美元的全值财产税率
PTRATIO	各镇师生比例
B	1000（Bk-0.63）^ 2，其中Bk是按城镇划分的黑人比例
LSTAT	人口状况降低百分比
MEDV	自有住房的中位价格（以$ 1000为单位）

设定库和数据。

import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor

from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import r2_score
from patsy import dmatrices
import matplotlib.pyplot as plt
%matplotlib inline

np.random.seed(42)

#加载内置数据集，了解即可
boston_data = load_boston()
df = pd.DataFrame()
df['MedianHomePrice'] = boston_data.target
df2 = pd.DataFrame(boston_data.data)
df2.columns = boston_data.feature_names
df = df.join(df2)
df.head()

	MedianHomePrice	CRIM	ZN	INDUS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT
0	24.0	0.00632	18.0	2.31	0.538	6.575	65.2	4.0900	1.0	296.0	15.3	396.90	4.98
1	21.6	0.02731	0.0	7.07	0.469	6.421	78.9	4.9671	2.0	242.0	17.8	396.90	9.14
2	34.7	0.02729	0.0	7.07	0.469	7.185	61.1	4.9671	2.0	242.0	17.8	392.83	4.03
3	33.4	0.03237	0.0	2.18	0.458	6.998	45.8	6.0622	3.0	222.0	18.7	394.63	2.94
4	36.2	0.06905	0.0	2.18	0.458	7.147	54.2	6.0622	3.0	222.0	18.7	396.90	5.33

1.获取数据集中每个特征的汇总

使用 corr 方法计算各变量间的相关性，判断是否存在多重线性。

#绘制热力图
import seaborn as sns
plt.subplots(figsize=(10,10))#调节图像大小
sns.heatmap(df.corr(), annot = True, vmax = 1, square = True, cmap='RdPu')

在这里插入图片描述

2.拆分数据集

创建一个 training 数据集与一个 test 数据集，其中20％的数据在 test 数据集中。将结果存储在 X_train, X_test, y_train, y_test 中。

X = df.drop('MedianHomePrice' , axis=1, inplace=False)
y = df['MedianHomePrice']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42 )

3.标准化

使用 [StandardScaler]来缩放数据集中的所有 x 变量。将结果存储在 X_scaled_train 中。

#把y_train的索引改为从0开始，因为原索引与下面的training_data索引不一致，合并会出错
y_train = pd.Series(y_train.values)

#使用 StandardScaler 来缩放数据集中的所有 x 变量,将结果存储在 X_scaled_train 中。 
X_scaled_train = StandardScaler()

#创建一个 pandas 数据帧并存储缩放的 x 变量以及 y_train。命名为 training_data 。
training_data = X_scaled_train.fit_transform(X_train)
training_data = pd.DataFrame(training_data, columns = X_train.columns)

training_data['MedianHomePrice'] = y_train
training_data.head()

	CRIM	ZN	INDUS	CHAS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT	MedianHomePrice
0	1.287702	-0.500320	1.033237	-0.278089	0.489252	-1.428069	1.028015	-0.802173	1.706891	1.578434	0.845343	-0.074337	1.753505	12.0
1	-0.336384	-0.500320	-0.413160	-0.278089	-0.157233	-0.680087	-0.431199	0.324349	-0.624360	-0.584648	1.204741	0.430184	-0.561474	19.9
2	-0.403253	1.013271	-0.715218	-0.278089	-1.008723	-0.402063	-1.618599	1.330697	-0.974048	-0.602724	-0.637176	0.065297	-0.651595	19.4
3	0.388230	-0.500320	1.033237	-0.278089	0.489252	-0.300450	0.591681	-0.839240	1.706891	1.578434	0.845343	-3.868193	1.525387	13.4
4	-0.325282	-0.500320	-0.413160	-0.278089	-0.157233	-0.831094	0.033747	-0.005494	-0.624360	-0.584648	1.204741	0.379119	-0.165787	18.2

4.模型1:所有特征

对训练集training_data进行线性拟合，查看p值判断显著性

#用所有的缩放特征来拟合线性模型，以预测此响应（平均房价）。不要忘记添加一个截距。
training_data['intercept'] = 1
X_train1= training_data.drop('MedianHomePrice' , axis=1, inplace=False)
lm = sm.OLS(training_data['MedianHomePrice'], X_train1)
result = lm.fit()
result.summary()

OLS Regression Results
Dep. Variable:	MedianHomePrice	R-squared:	0.751
Model:	OLS	Adj. R-squared:	0.743
Method:	Least Squares	F-statistic:	90.43
Date:	Sun, 10 May 2020	Prob (F-statistic):	6.21e-109
Time:	20:22:27	Log-Likelihood:	-1194.3
No. Observations:	404	AIC:	2417.
Df Residuals:	390	BIC:	2473.
Df Model:	13
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
CRIM	-1.0021	0.308	-3.250	0.001	-1.608	-0.396
ZN	0.6963	0.370	1.882	0.061	-0.031	1.423
INDUS	0.2781	0.464	0.599	0.549	-0.634	1.190
CHAS	0.7187	0.247	2.914	0.004	0.234	1.204
NOX	-2.0223	0.498	-4.061	0.000	-3.001	-1.043
RM	3.1452	0.329	9.567	0.000	2.499	3.792
AGE	-0.1760	0.407	-0.432	0.666	-0.977	0.625
DIS	-3.0819	0.481	-6.408	0.000	-4.027	-2.136
RAD	2.2514	0.652	3.454	0.001	0.970	3.533
TAX	-1.7670	0.704	-2.508	0.013	-3.152	-0.382
PTRATIO	-2.0378	0.321	-6.357	0.000	-2.668	-1.408
B	1.1296	0.271	4.166	0.000	0.596	1.663
LSTAT	-3.6117	0.395	-9.133	0.000	-4.389	-2.834
intercept	22.7965	0.236	96.774	0.000	22.333	23.260

Omnibus:	133.052	Durbin-Watson:	2.114
Prob(Omnibus):	0.000	Jarque-Bera (JB):	579.817
Skew:	1.379	Prob(JB):	1.24e-126
Kurtosis:	8.181	Cond. No.	9.74

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

5.判断解释变量间是否存在相关性：

计算训练集中的vif

#计算数据集中每个 x_variable 的 vif
def vif_calculator(df, response):
    '''
    INPUT:
    df - 包含x和y的数据集
    response - 反应变量的列名string
    OUTPUT:
    vif - a dataframe of the vifs
    '''
    df2 = df.drop(response, axis = 1, inplace=False)#删除反应变量列
    features = "+".join(df2.columns)
    y, X = dmatrices(response + ' ~' + features, df, return_type='dataframe')
    vif = pd.DataFrame()
    vif["VIF Factor"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
    vif["features"] = X.columns
    vif = vif.round(1)
    return vif

vif = vif_calculator(training_data, 'MedianHomePrice')
vif

C:\ProgramData\Anaconda3\lib\site-packages\statsmodels\regression\linear_model.py:1685: RuntimeWarning: divide by zero encountered in double_scalars
  return 1 - self.ssr/self.centered_tss

	VIF Factor	features
0	0.0	Intercept
1	1.7	CRIM
2	2.5	ZN
3	3.9	INDUS
4	1.1	CHAS
5	4.5	NOX
6	1.9	RM
7	3.0	AGE
8	4.2	DIS
9	7.7	RAD
10	8.9	TAX
11	1.9	PTRATIO
12	1.3	B
13	2.8	LSTAT
14	0.0	intercept

结合vif、相关性和p值，判断要删除哪些变量：

vif限制在4以内。INDUS、RAD、TAX、NOX的VIF较大

TAX 和 RAD 之间具有强相关性，INDUS 和 NOX 也是如此，因此，每组相关性高的变量只要删除一个就能有效地减小另一个的 VIF。

p值限制在0.05以内。AGE和INDUS的p值较大。

根据查看 p 值和VIF的结果，如果选择保留RAD和INDUS，那么删除 AGE、 NOX 与TAX，删掉这些特征之后，用剩余的特征拟合一个新的线性模型。

6.模型2：删除 AGE、 NOX 与TAX

X_train1 = training_data.drop(['AGE','NOX','TAX','MedianHomePrice'] , axis=1, inplace=False)
lm1 = sm.OLS(training_data['MedianHomePrice'], X_train1)
result1 = lm1.fit()
result1.summary()

OLS Regression Results
Dep. Variable:	MedianHomePrice	R-squared:	0.733
Model:	OLS	Adj. R-squared:	0.727
Method:	Least Squares	F-statistic:	108.1
Date:	Sun, 10 May 2020	Prob (F-statistic):	2.77e-106
Time:	21:02:41	Log-Likelihood:	-1208.0
No. Observations:	404	AIC:	2438.
Df Residuals:	393	BIC:	2482.
Df Model:	10
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
CRIM	-0.9116	0.317	-2.876	0.004	-1.535	-0.289
ZN	0.5622	0.363	1.548	0.123	-0.152	1.276
INDUS	-0.8746	0.411	-2.128	0.034	-1.683	-0.067
CHAS	0.6896	0.252	2.738	0.006	0.194	1.185
RM	3.2406	0.330	9.818	0.000	2.592	3.889
DIS	-2.1728	0.434	-5.010	0.000	-3.025	-1.320
RAD	0.4380	0.389	1.126	0.261	-0.327	1.202
PTRATIO	-1.6369	0.310	-5.288	0.000	-2.246	-1.028
B	1.2106	0.279	4.345	0.000	0.663	1.758
LSTAT	-3.9851	0.381	-10.470	0.000	-4.733	-3.237
intercept	22.7965	0.243	93.916	0.000	22.319	23.274

Omnibus:	126.568	Durbin-Watson:	2.033
Prob(Omnibus):	0.000	Jarque-Bera (JB):	542.197
Skew:	1.310	Prob(JB):	1.83e-118
Kurtosis:	8.034	Cond. No.	4.66

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

根据p值，应删除 RAD ，保留其他变量。

7.模型3：删除 AGE、 NOX 、TAX、RAD

X_train2 = training_data.drop(['AGE','NOX','TAX','RAD', 'MedianHomePrice'] , axis=1, inplace=False)
lm2 = sm.OLS(training_data['MedianHomePrice'], X_train2)
result2 = lm2.fit()
result2.summary()

OLS Regression Results
Dep. Variable:	MedianHomePrice	R-squared:	0.733
Model:	OLS	Adj. R-squared:	0.726
Method:	Least Squares	F-statistic:	119.9
Date:	Sun, 10 May 2020	Prob (F-statistic):	4.60e-107
Time:	21:02:09	Log-Likelihood:	-1208.6
No. Observations:	404	AIC:	2437.
Df Residuals:	394	BIC:	2477.
Df Model:	9
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
CRIM	-0.7616	0.288	-2.647	0.008	-1.327	-0.196
ZN	0.6151	0.360	1.707	0.089	-0.093	1.323
INDUS	-0.7544	0.397	-1.900	0.058	-1.535	0.026
CHAS	0.7067	0.252	2.810	0.005	0.212	1.201
RM	3.3022	0.326	10.142	0.000	2.662	3.942
DIS	-2.2235	0.432	-5.153	0.000	-3.072	-1.375
PTRATIO	-1.5090	0.288	-5.239	0.000	-2.075	-0.943
B	1.1502	0.273	4.206	0.000	0.613	1.688
LSTAT	-3.9413	0.379	-10.406	0.000	-4.686	-3.197
intercept	22.7965	0.243	93.884	0.000	22.319	23.274

Omnibus:	134.948	Durbin-Watson:	2.028
Prob(Omnibus):	0.000	Jarque-Bera (JB):	619.161
Skew:	1.381	Prob(JB):	3.56e-135
Kurtosis:	8.399	Cond. No.	4.36

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

仔细检查所有的 VIF 是否小于4。与先前模型相比，Rsquared 值没有发生变化。

training_data2 = training_data.drop(['AGE','NOX','TAX','RAD'] , axis=1, inplace=False)
vif = vif_calculator(training_data2, 'MedianHomePrice')
vif

	VIF Factor	features
0	0.0	Intercept
1	1.4	CRIM
2	2.2	ZN
3	2.7	INDUS
4	1.1	CHAS
5	1.8	RM
6	3.2	DIS
7	1.4	PTRATIO
8	1.3	B
9	2.4	LSTAT
10	0.0	intercept

8.模型评估

对各个模型的测试预测值和实际测试值的匹配度进行打分

#含有全部变量的模型
lm_full = LinearRegression()
lm_full.fit(X_train, y_train)
lm_full.score(X_test, y_test)#打分

0.66848257539715972

#删除AGE、NOX、TAX
X_train_red = X_train.drop(['AGE','NOX','TAX'] , axis=1, inplace=False)
X_test_red = X_test.drop(['AGE','NOX','TAX'] , axis=1, inplace=False)

#删除 AGE、 NOX 、TAX、RAD
X_train_red2 = X_train.drop(['AGE','NOX','TAX','RAD'] , axis=1, inplace=False)
X_test_red2 = X_test.drop(['AGE','NOX','TAX','RAD'] , axis=1, inplace=False)

lm_red = LinearRegression()#删除AGE、NOX、TAX的模型
lm_red.fit(X_train_red, y_train)
print(lm_red.score(X_test_red, y_test))#打分

lm_red2 = LinearRegression()#删除 AGE、 NOX 、TAX、RAD的模型
lm_red2.fit(X_train_red2, y_train)
print(lm_red2.score(X_test_red2, y_test))#打分

0.639421781821
0.63441065636

从评分可以看出，在此测试集中，拥有所有变量的模型表现最佳。后续可以用交叉验证（即在多个训练和测试集里重复这一操作）来确定模型效果是否有稳定性。

NeverthelessEnd

关注

2
点赞
踩
13

收藏

觉得还不错? 一键收藏
2
评论
多元线性回归练习-预测房价

目的：找到数据集中关于特征的描述。使用数据集中的其他变量来构建最佳模型以预测平均房价。数据集说明：数据集总共包含506个案例。每种情况下，数据集都有14个属性：特征说明MedianHomePrice房价中位数CRIM人均城镇犯罪率ZN25,000平方英尺以上土地的住宅用地比例INDIUS每个城镇非零售业务英亩的比例。CHAS查尔斯河虚拟变量（如果束缚河，则为1；否则为0）NOX-氧化氮浓度（百万分之一）RM每个住宅的平均房间数
复制链接

扫一扫