本文利用pandas对Ames房价数据集进行数据分析,并挑选其中对房价影响程度前十的特征做了线性回归以预测房价。
一、导入数据集
import pandas as pd
train_data=pd.read_csv('./train.csv')
train_data
Id | MSSubClass | MSZoning | LotFrontage | LotArea | Street | Alley | LotShape | LandContour | Utilities | ... | PoolArea | PoolQC | Fence | MiscFeature | MiscVal | MoSold | YrSold | SaleType | SaleCondition | SalePrice | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 60 | RL | 65.0 | 8450 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 2 | 2008 | WD | Normal | 208500 |
1 | 2 | 20 | RL | 80.0 | 9600 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 5 | 2007 | WD | Normal | 181500 |
2 | 3 | 60 | RL | 68.0 | 11250 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 9 | 2008 | WD | Normal | 223500 |
3 | 4 | 70 | RL | 60.0 | 9550 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 2 | 2006 | WD | Abnorml | 140000 |
4 | 5 | 60 | RL | 84.0 | 14260 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 12 | 2008 | WD | Normal | 250000 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1455 | 1456 | 60 | RL | 62.0 | 7917 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 8 | 2007 | WD | Normal | 175000 |
1456 | 1457 | 20 | RL | 85.0 | 13175 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | MnPrv | NaN | 0 | 2 | 2010 | WD | Normal | 210000 |
1457 | 1458 | 70 | RL | 66.0 | 9042 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | GdPrv | Shed | 2500 | 5 | 2010 | WD | Normal | 266500 |
1458 | 1459 | 20 | RL | 68.0 | 9717 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 4 | 2010 | WD | Normal | 142125 |
1459 | 1460 | 20 | RL | 75.0 | 9937 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 6 | 2008 | WD | Normal | 147500 |
1460 rows × 81 columns
二、查看数据的基本情况
train_data.describe()
Id | MSSubClass | LotFrontage | LotArea | OverallQual | OverallCond | YearBuilt | YearRemodAdd | MasVnrArea | BsmtFinSF1 | ... | WoodDeckSF | OpenPorchSF | EnclosedPorch | 3SsnPorch | ScreenPorch | PoolArea | MiscVal | MoSold | YrSold | SalePrice | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 1460.000000 | 1460.000000 | 1201.000000 | 1460.000000 | 1460.000000 | 1460.000000 | 1460.000000 | 1460.000000 | 1452.000000 | 1460.000000 | ... | 1460.000000 | 1460.000000 | 1460.000000 | 1460.000000 | 1460.000000 | 1460.000000 | 1460.000000 | 1460.000000 | 1460.000000 | 1460.000000 |
mean | 730.500000 | 56.897260 | 70.049958 | 10516.828082 | 6.099315 | 5.575342 | 1971.267808 | 1984.865753 | 103.685262 | 443.639726 | ... | 94.244521 | 46.660274 | 21.954110 | 3.409589 | 15.060959 | 2.758904 | 43.489041 | 6.321918 | 2007.815753 | 180921.195890 |
std | 421.610009 | 42.300571 | 24.284752 | 9981.264932 | 1.382997 | 1.112799 | 30.202904 | 20.645407 | 181.066207 | 456.098091 | ... | 125.338794 | 66.256028 | 61.119149 | 29.317331 | 55.757415 | 40.177307 | 496.123024 | 2.703626 | 1.328095 | 79442.502883 |
min | 1.000000 | 20.000000 | 21.000000 | 1300.000000 | 1.000000 | 1.000000 | 1872.000000 | 1950.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 2006.000000 | 34900.000000 |
25% | 365.750000 | 20.000000 | 59.000000 | 7553.500000 | 5.000000 | 5.000000 | 1954.000000 | 1967.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 5.000000 | 2007.000000 | 129975.000000 |
50% | 730.500000 | 50.000000 | 69.000000 | 9478.500000 | 6.000000 | 5.000000 | 1973.000000 | 1994.000000 | 0.000000 | 383.500000 | ... | 0.000000 | 25.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 6.000000 | 2008.000000 | 163000.000000 |
75% | 1095.250000 | 70.000000 | 80.000000 | 11601.500000 | 7.000000 | 6.000000 | 2000.000000 | 2004.000000 | 166.000000 | 712.250000 | ... | 168.000000 | 68.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 8.000000 | 2009.000000 | 214000.000000 |
max | 1460.000000 | 190.000000 | 313.000000 | 215245.000000 | 10.000000 | 9.000000 | 2010.000000 | 2010.000000 | 1600.000000 | 5644.000000 | ... | 857.000000 | 547.000000 | 552.000000 | 508.000000 | 480.000000 | 738.000000 | 15500.000000 | 12.000000 | 2010.000000 | 755000.000000 |
8 rows × 38 columns
train_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
Id 1460 non-null int64
MSSubClass 1460 non-null int64
MSZoning 1460 non-null object
LotFrontage 1201 non-null float64
LotArea 1460 non-null int64
Street 1460 non-null object
Alley 91 non-null object
LotShape 1460 non-null object
LandContour 1460 non-null object
Utilities 1460 non-null object
LotConfig 1460 non-null object
LandSlope 1460 non-null object
Neighborhood 1460 non-null object
Condition1 1460 non-null object
Condition2 1460 non-null object
BldgType 1460 non-null object
HouseStyle 1460 non-null object
OverallQual 1460 non-null int64
OverallCond 1460 non-null int64
YearBuilt 1460 non-null int64
YearRemodAdd 1460 non-null int64
RoofStyle 1460 non-null object
RoofMatl 1460 non-null object
Exterior1st 1460 non-null object
Exterior2nd 1460 non-null object
MasVnrType 1452 non-null object
MasVnrArea 1452 non-null float64
ExterQual 1460 non-null object
ExterCond 1460 non-null object
Foundation 1460 non-null object
BsmtQual 1423 non-null object
BsmtCond 1423 non-null object
BsmtExposure 1422 non-null object
BsmtFinType1 1423 non-null object
BsmtFinSF1 1460 non-null int64
BsmtFinType2 1422 non-null object
BsmtFinSF2 1460 non-null int64
BsmtUnfSF 1460 non-null int64
TotalBsmtSF 1460 non-null int64
Heating 1460 non-null object
HeatingQC 1460 non-null object
CentralAir 1460 non-null object
Electrical 1459 non-null object
1stFlrSF 1460 non-null int64
2ndFlrSF 1460 non-null int64
LowQualFinSF 1460 non-null int64
GrLivArea 1460 non-null int64
BsmtFullBath 1460 non-null int64
BsmtHalfBath 1460 non-null int64
FullBath 1460 non-null int64
HalfBath 1460 non-null int64
BedroomAbvGr 1460 non-null int64
KitchenAbvGr 1460 non-null int64
KitchenQual 1460 non-null object
TotRmsAbvGrd 1460 non-null int64
Functional 1460 non-null object
Fireplaces 1460 non-null int64
FireplaceQu 770 non-null object
GarageType 1379 non-null object
GarageYrBlt 1379 non-null float64
GarageFinish 1379 non-null object
GarageCars 1460 non-null int64
GarageArea 1460 non-null int64
GarageQual 1379 non-null object
GarageCond 1379 non-null object
PavedDrive 1460 non-null object
WoodDeckSF 1460 non-null int64
OpenPorchSF 1460 non-null int64
EnclosedPorch 1460 non-null int64
3SsnPorch 1460 non-null int64
ScreenPorch 1460 non-null int64
PoolArea 1460 non-null int64
PoolQC 7 non-null object
Fence 281 non-null object
MiscFeature 54 non-null object
MiscVal 1460 non-null int64
MoSold 1460 non-null int64
YrSold 1460 non-null int64
SaleType 1460 non-null object
SaleCondition 1460 non-null object
SalePrice 1460 non-null int64
dtypes: float64(3), int64(35), object(43)
memory usage: 924.0+ KB
train_data.hist(figsize=(20,20))
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x0000000026468E80>,
<matplotlib.axes._subplots.AxesSubplot object at 0x00000000255EB4E0>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000000002539A630>,
<matplotlib.axes._subplots.AxesSubplot object at 0x00000000246C3FD0>,
<matplotlib.axes._subplots.AxesSubplot object at 0x00000000246E18D0>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000000002331EA20>],
[<matplotlib.axes._subplots.AxesSubplot object at 0x0000000025FEECF8>,
<matplotlib.axes._subplots.AxesSubplot object at 0x0000000025851320>,
<matplotlib.axes._subplots.AxesSubplot object at 0x0000000025851358>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000000002564BE48>,
<matplotlib.axes._subplots.AxesSubplot object at 0x0000000026404438>,
<matplotlib.axes._subplots.AxesSubplot object at 0x00000000264189E8>],
[<matplotlib.axes._subplots.AxesSubplot object at 0x0000000025959F98>,
<matplotlib.axes._subplots.AxesSubplot object at 0x0000000025928588>,
<matplotlib.axes._subplots.AxesSubplot object at 0x0000000024A87B38>,
<matplotlib.axes._subplots.AxesSubplot object at 0x00000000256C12E8>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000000002614A6D8>,
<matplotlib.axes._subplots.AxesSubplot object at 0x0000000026448C88>],
[<matplotlib.axes._subplots.AxesSubplot object at 0x0000000025CB2278>,
<matplotlib.axes._subplots.AxesSubplot object at 0x0000000026506828>,
<matplotlib.axes._subplots.AxesSubplot object at 0x00000000257F0DD8>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000000002619C3C8>,
<matplotlib.axes._subplots.AxesSubplot object at 0x0000000026191978>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000000002522DF28>],
[<matplotlib.axes._subplots.AxesSubplot object at 0x0000000024B11518>,
<matplotlib.axes._subplots.AxesSubplot object at 0x0000000025D49AC8>,
<matplotlib.axes._subplots.AxesSubplot object at 0x0000000025A120B8>,
<matplotlib.axes._subplots.AxesSubplot object at 0x00000000260E1668>,
<matplotlib.axes._subplots.AxesSubplot object at 0x00000000258A1C18>,
<matplotlib.axes._subplots.AxesSubplot object at 0x0000000026816208>],
[<matplotlib.axes._subplots.AxesSubplot object at 0x000000002683AD68>,
<matplotlib.axes._subplots.AxesSubplot object at 0x0000000026611D68>,
<matplotlib.axes._subplots.AxesSubplot object at 0x0000000024E85358>,
<matplotlib.axes._subplots.AxesSubplot object at 0x0000000026772908>,
<matplotlib.axes._subplots.AxesSubplot object at 0x00000000259B3EB8>,
<matplotlib.axes._subplots.AxesSubplot object at 0x0000000025C744A8>],
[<matplotlib.axes._subplots.AxesSubplot object at 0x0000000025CFDA58>,
<matplotlib.axes._subplots.AxesSubplot object at 0x0000000025C42710>,
<matplotlib.axes._subplots.AxesSubplot object at 0x0000000025252B70>,
<matplotlib.axes._subplots.AxesSubplot object at 0x0000000025234BA8>,
<matplotlib.axes._subplots.AxesSubplot object at 0x0000000025288198>,
<matplotlib.axes._subplots.AxesSubplot object at 0x00000000267E7748>]],
dtype=object)
三、计算各特征与房价之间的相关系数
corr_df=train_data.corr()['SalePrice'].sort_values(ascending=False)
corr_df
SalePrice 1.000000
OverallQual 0.790982
GrLivArea 0.708624
GarageCars 0.640409
GarageArea 0.623431
TotalBsmtSF 0.613581
1stFlrSF 0.605852
FullBath 0.560664
TotRmsAbvGrd 0.533723
YearBuilt 0.522897
YearRemodAdd 0.507101
GarageYrBlt 0.486362
MasVnrArea 0.477493
Fireplaces 0.466929
BsmtFinSF1 0.386420
LotFrontage 0.351799
WoodDeckSF 0.324413
2ndFlrSF 0.319334
OpenPorchSF 0.315856
HalfBath 0.284108
LotArea 0.263843
BsmtFullBath 0.227122
BsmtUnfSF 0.214479
BedroomAbvGr 0.168213
ScreenPorch 0.111447
PoolArea 0.092404
MoSold 0.046432
3SsnPorch 0.044584
BsmtFinSF2 -0.011378
BsmtHalfBath -0.016844
MiscVal -0.021190
Id -0.021917
LowQualFinSF -0.025606
YrSold -0.028923
OverallCond -0.077856
MSSubClass -0.084284
EnclosedPorch -0.128578
KitchenAbvGr -0.135907
Name: SalePrice, dtype: float64
四、对与房价相关系数前十的特征进行数据可视化(散点图)
import matplotlib.pyplot as plt
import math
plt.figure(figsize=(10,10))
col_names=corr_df.index[1:11]
n=len(col_names)
cols=3
rows=math.ceil(n/cols)
for i in range(n):
plt.subplot(rows,cols,i+1)
plt.scatter(train_data[col_names[i]],train_data['SalePrice'])
plt.title(col_names[i])
plt.tight_layout()
五、初始化回归模型权重并定义均方误差为评估指标
import numpy as np
theta=np.random.rand(len(col_names)+1)
print(theta)
def f(x):
return np.dot(theta,x.T)
def mse(x,y):
return np.sum((f(x)-y)**2)/len(y)
[0.14013852 0.70306677 0.43516643 0.57665082 0.19052374 0.21492676
0.61924449 0.24905321 0.06569707 0.07960751 0.95983703]
六、设置初始学习率并进行模型训练
learning_rate=0.0001
train_x=train_data[col_names]
train_x=(train_x-train_x.mean())/train_x.std()#标准化
train_x['b']=1 #加上偏置项
train_y=train_data['SalePrice']
for i in range(100):
theta=theta-learning_rate*np.dot(f(train_x)-train_y,train_x)
metrics=mse(train_x,train_y)
print(f'iter:{i},mse:{metrics}')
iter:0,mse:25817371843.540993
iter:1,mse:19070779958.704334
iter:2,mse:14319710788.67177
iter:3,mse:10861663959.744366
iter:4,mse:8337269406.214424
iter:5,mse:6493530020.578001
iter:6,mse:5146444679.488889
iter:7,mse:4161822192.191155
iter:8,mse:3441770537.6153336
iter:9,mse:2914871989.9557214
iter:10,mse:2529019275.7782683
iter:11,mse:2246190793.872005
iter:12,mse:2038639905.947897
iter:13,mse:1886115523.5835316
iter:14,mse:1773834843.9879074
iter:15,mse:1691004653.6156518
iter:16,mse:1629742726.9989412
iter:17,mse:1584291037.725107
iter:18,mse:1550441808.3864832
iter:19,mse:1525118802.0107481
iter:20,mse:1506071846.899198
iter:21,mse:1491653956.3700118
iter:22,mse:1480658696.958829
iter:23,mse:1472201506.2484279
iter:24,mse:1465633072.234535
iter:25,mse:1460476103.0681505
iter:26,mse:1456379162.2852933
iter:27,mse:1453082955.9092643
iter:28,mse:1450395705.962204
iter:29,mse:1448175155.3018467
iter:30,mse:1446315412.7260683
iter:31,mse:1444737331.6370451
iter:32,mse:1443381468.856506
iter:33,mse:1442202927.8996868
iter:34,mse:1441167579.0150025
iter:35,mse:1440249285.4449391
iter:36,mse:1439427865.4210844
iter:37,mse:1438687592.4084334
iter:38,mse:1438016089.381014
iter:39,mse:1437403511.781369
iter:40,mse:1436841942.1847355
iter:41,mse:1436324940.3953006
iter:42,mse:1435847207.818527
iter:43,mse:1435404335.9915845
iter:44,mse:1434992617.215805
iter:45,mse:1434608901.1250343
iter:46,mse:1434250485.328487
iter:47,mse:1433915031.4142568
iter:48,mse:1433600499.9022899
iter:49,mse:1433305099.4212935
iter:50,mse:1433027246.6189718
iter:51,mse:1432765534.220597
iter:52,mse:1432518705.3157291
iter:53,mse:1432285632.44166
iter:54,mse:1432065300.3920307
iter:55,mse:1431856791.9445975
iter:56,mse:1431659275.898459
iter:57,mse:1431471996.956677
iter:58,mse:1431294267.0985336
iter:59,mse:1431125458.166575
iter:60,mse:1430964995.4542654
iter:61,mse:1430812352.1258454
iter:62,mse:1430667044.3346698
iter:63,mse:1430528626.9328084
iter:64,mse:1430396689.6850588
iter:65,mse:1430270853.9163136
iter:66,mse:1430150769.533587
iter:67,mse:1430036112.3737445
iter:68,mse:1429926581.8357728
iter:69,mse:1429821898.7626548
iter:70,mse:1429721803.5430362
iter:71,mse:1429626054.4070165
iter:72,mse:1429534425.8938847
iter:73,mse:1429446707.4724996
iter:74,mse:1429362702.2974653
iter:75,mse:1429282226.0863287
iter:76,mse:1429205106.1047971
iter:77,mse:1429131180.2485106
iter:78,mse:1429060296.2112334
iter:79,mse:1428992310.730462
iter:80,mse:1428927088.9024827
iter:81,mse:1428864503.559771
iter:82,mse:1428804434.7044203
iter:83,mse:1428746768.991961
iter:84,mse:1428691399.260538
iter:85,mse:1428638224.1009636
iter:86,mse:1428587147.4636173
iter:87,mse:1428538078.29861
iter:88,mse:1428490930.2259846
iter:89,mse:1428445621.233077
iter:90,mse:1428402073.3964493
iter:91,mse:1428360212.6260748
iter:92,mse:1428319968.429696
iter:93,mse:1428281273.6954813
iter:94,mse:1428244064.4913058
iter:95,mse:1428208279.8791373
iter:96,mse:1428173861.7431722
iter:97,mse:1428140754.6304965
iter:98,mse:1428108905.6031668
iter:99,mse:1428078264.1007133
七、取前两百个数据查看预测值与真实值之间的误差
plt.figure(figsize=(20,10))
plt.plot(np.arange(200),train_y[:200],c='red',linestyle='--')
plt.plot(np.arange(200),f(train_x[:200]),c='green',linestyle='-.')
[<matplotlib.lines.Line2D at 0x2470af98>]
八、结论
由于本文使用的是最简单的线性回归模型,并且只在全部八十列特征中选取了十大特征进行训练,所以在某些极值方面的预测效果欠佳,但总体上是基本拟合实际房价的。
若想取得更好的预测结果,一是可以采用复杂度更高多项式回归模型进行训练,二是可以使用多层神经网络模型进行训练。