房屋售价初体验

最新推荐文章于 2023-09-17 14:04:11 发布

StayRealMon

最新推荐文章于 2023-09-17 14:04:11 发布

阅读量277

点赞数 2

本文链接：https://blog.csdn.net/sinat_34391940/article/details/82912629

版权

第二次学习继续采用Kaggle上面的一个经典竞赛——房屋售价预测。刚刚下载过数据文集后看到七十多种属性还是懵了一下，要比Titanic数据集稍显复杂，而且房价预测不是简单的dead or alive这样的二分类问题。所以第一次接触这样的预测模型我依然先参考一个较高的rank script，先对预测流程有所认识，再对模块进行学习。

和上一篇Titanic预测结构类似：数据集认识->缺失值处理->特征工程->base models->模型堆叠->训练预测

数据集认识

import pandas as pd
train = pd.read_csv('D:\\Dataset\\HousePrices\\train.csv')
test = pd.read_csv('D:\\Dataset\\HousePrices\\test.csv')

print(train.head(10))
print('*'*10)
train.info()

数据集认识

可以看出数据集中的属性类型和缺失值数量等信息。接下来进行缺失值处理。

编号Id不影响最终的房价可以先保存到变量中，代码如下：

train_Id = train['Id']
test_Id = test['Id']
train.drop("Id", axis = 1, inplace = True)
test.drop("Id", axis = 1, inplace = True)

常识认为房屋面积越大价钱就应该越高，下面作图看一下房屋面积和最终价格的关系图：

import matplotlib.pyplot as plt
import seaborn as sns
color = sns.color_palette()
sns.set_style('darkgrid')

fig, ax = plt.subplots()
ax.scatter(x = train['GrLivArea'], y = train['SalePrice'])
plt.ylabel('SalePrice', fontsize=13)
plt.xlabel('GrLivArea', fontsize=13)
plt.show()

数据集认识
如图所示绝大多数的数据点都符合我们的常识，但还是发现有极少数的数据点房屋面积异常大但是房屋价格却异常低，所以在处理缺失值之前需要先清除数据中的异常值，代码如下:

train = train.drop(train[(train['GrLivArea']>4000) & (train['SalePrice']<300000)].index)
fig, ax = plt.subplots()
ax.scatter(train['GrLivArea'], train['SalePrice'])
plt.ylabel('SalePrice', fontsize=13)
plt.xlabel('GrLivArea', fontsize=13)
plt.show()

数据集认识

在这里还有一个小细节：为了能够更好的训练出线性模型，最好能够让目标值服从正态分布，从而减少偏态情况对模型训练的影响，因此这里还有一步关键步骤即正态分布转化，转化之前的房屋价格分布如下所示：

from scipy import stats
from scipy.stats import norm, skew

sns.distplot(train['SalePrice'] , fit=norm);
#正态拟合
(mu, sigma) = norm.fit(train['SalePrice'])
print( '\n mu = {:.2f} and sigma = {:.2f}\n'.format(mu, sigma))

plt.legend(['Normal dist. ($\mu=$ {:.2f} and $ \sigma=${:.2f})'.format(mu, sigma)], loc='best')
plt.ylabel('Frequency')
plt.title('SalePrice distribution')

fig = plt.figure()
res = stats.probplot(train['SalePrice'], plot=plt)
plt.show()

数据集认识
这里用到的转换方法是Numpy中的log1p()方法，作用是进行log(1+x)转换，能够减少数据的偏态分布，代码如下：

#log1p()转换
train["SalePrice"] = np.log1p(train["SalePrice"])
#目标值分布显示
sns.distplot(train['SalePrice'] , fit=norm);
(mu, sigma) = norm.fit(train['SalePrice'])
print( '\n mu = {:.2f} and sigma = {:.2f}\n'.format(mu, sigma))

plt.legend(['Normal dist. ($\mu=$ {:.2f} and $\sigma=$ {:.2f} )'.format(mu, sigma)],loc='best')
plt.ylabel('Frequency')
plt.title('SalePrice distribution')

fig = plt.figure()
res = stats.probplot(train['SalePrice'], plot=plt)
plt.show()

最低0.47元/天解锁文章

StayRealMon

关注

2
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
房屋售价初体验

第二次学习继续采用Kaggle上面的一个经典竞赛——房屋售价预测。刚刚下载过数据文集后看到七十多种属性还是懵了一下，要比Titanic数据集稍显复杂，而且房价预测不是简单的dead or alive这样的二分类问题。所以第一次接触这样的预测模型我依然先参考一个较高的rank script，先对预测流程有所认识，再对模块进行学习。和上一篇Titanic预测结构类似：数据集认识-&gt;缺失值处理-...
复制链接

扫一扫